Re: RFC: mdadm and bringing up raid sets from initrd (dracut)

2009-07-16 Thread Neil Brown
On Wednesday July 15, dan.j.willi...@intel.com wrote:
> [ Cc: Neil ]
> 
> On Tue, Jul 14, 2009 at 7:30 AM, David Zeuthen wrote:
> > On Tue, 2009-07-14 at 12:59 +0200, Hans de Goede wrote:
> >> Currently the udev rules use incremental assembly like this:
> >> mdadm -I /dev/mdraid-member
> >>
> >> There are 2 problems with this:
> >> 1) When doing this for native mdraid metadata arrays, if only
> >>     one disk is present the set never gets activated
> >> 2) When doing this for imsm metadata arrays, as soon as the
> >>     first disk is incrementally added, the set gets activated
> >>     in degraded mode and stays that way, the second disk
> >>     will get added to the container, but not to the actual
> >>     sets in the container
> >
> > FWIW, this incremental assembly business in mdadm is actually not a very
> > good idea. At least not the current implementation. I'm not sure whether
> > it's still a Fedora-ism or whether it's something that's in upstream
> > mdadm yet. I'm talking about this udev rule
> >
> >  /lib/udev/rules.d/65-md-incremental.rules:
> >  # This file causes block devices with Linux RAID (mdadm) signatures to
> >  # automatically cause mdadm to be run.
> >  # See udev(8) for syntax
> >
> >  SUBSYSTEM=="block", ACTION=="add", ENV{ID_FS_TYPE}=="linux_raid_member", \
> >        IMPORT{program}="/sbin/mdadm --examine --export $tempnode", \
> >        RUN+="/bin/bash -c '[ ! -f /dev/.in_sysinit ] && mdadm -I 
> > $env{DEVNAME}'"
> >
> > For example if the user plugs in a random old disk that happens to
> > contain half of a RAID1 mirror, then the incremental assembly bits sets
> > up an inert md-device and the user is now left to his own devices as to
> > sort this out when he's told by partitioning tools etc. that the disk
> > (or partition of) he just plugged in, is "busy" (it is claimed by the
> > inert md node).
> >
> > I actually had to add some extra code to the GNOME Disk Utility bits to
> > handle such things (stop inert md devices) - makes the user experience
> > quite a bit worse since there's now an extra state to worry about. And
> > most current users don't use the UI bits yet for this so they get extra
> > confused when trying to use e.g. parted(8) or fdisk(8) on the device.
> >
> > FWIW, I'd wish people would stop playing games like this. If you want to
> > do auto-assembly at the system-level, at the very least don't leave the
> > system in a state like this. For example, one way to do auto-assembly
> > without such bugs would be to use libudev to enumerate all md component
> > devices with the same MD_UUID. Then you count the number of components
> > and only start the array if the number of components equals MD_DEVICES.
> > That's much better than incrementally adding to an md device node that
> > might never get used.

Yes:  auto-assembly is hard, and easy to get wrong.

While I don't claim that the current scheme is at all perfect, I don't
think your suggestion is a clear improvement.
The whole point of RAID is to survive drive failure, and that includes
drives being missing.
So I don't think "completely ignore the array if not all expected
drives are present" is the correct answer.

It is very easy to remove unwanted raid metadata 
(mdadm --zero-superblock), and making that easily accessible from a
GUI would probably be a good and useful thing, and might solve some
problems for some people.

One thing that I have contemplated is for md to not claim exclusive
ownership of drives until the array is activated and switch to
read-write.  That would address the 'my drive was stolen by md'
problem, but it may well create other problems in its place.

My general goal at present is to make mdadm sufficiently flexible that
a distro can choose a suitable policy implement it.  If someone comes
up with a policy that works convincingly well, I could then make that
the default approach that mdadm takes.
There is certainly still room for improvement and I am happy to
discuss possibilities.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: mdadm and bringing up raid sets from initrd (dracut)

2009-07-16 Thread Neil Brown
On Wednesday July 15, dan.j.willi...@intel.com wrote:
> 
> mdadm-3.0 has facilities to prevent assembly of certain metadata types
> [1] or arrays with certain uuids [2].  I wonder if we also need a
> facility to prevent auto-assembly of arrays *not* listed in
> mdadm.conf?  So the mdadm.conf file installed in the initramfs would
> only identify the root array and all other randomly identified md
> devices would be ignored (rather than assembled with a foreign name).
> 
> Thoughts?

This is exactly the functionality that the 'AUTO' line provides.
Arrays that are listed in mdadm.conf will always be assembled when they
are found.
Arrays that are not listed may or may not be assembled depending on
their metadata type and the setting of 'AUTO'.

So
  AUTO -all

will disable all auto-assemble of arrays that are not listed in
mdadm.conf.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: mdadm and bringing up raid sets from initrd (dracut)

2009-07-16 Thread Harald Hoyer

On 07/14/2009 05:00 PM, David Zeuthen wrote:

On Tue, 2009-07-14 at 10:14 -0400, Doug Ledford wrote:

On Jul 14, 2009, at 11:02 AM, Hans de Goede wrote:

Hi,
On 07/14/2009 03:39 PM, Doug Ledford wrote:

On Jul 14, 2009, at 6:59 AM, Hans de Goede wrote:

Hi,

As you probably know I'm working on making Fedora 12 use mdraid
instead of dmraid for Intel BIOS-RAID setups.

The installer (anaconda) part is mostly done (needs more testing)
and now I'm looking at implementing support for this in dracut
(the new mkinitrd for Fedora 12).

So I've been testing how this works for both imsm mdraid sets
and native mdraid metadata sets, in both cases using a 2 disk
mirror, so that the set can also be brought up in degraded mode.

Currently the udev rules use incremental assembly like this:
mdadm -I /dev/mdraid-member

Hmmm...does dracut use udev during initramfs time?

Yes, it uses udev for everything, making discovery of / consistent
with the discovery of other storage devices.

I'm not sure I like or agree with that philosophy.  I absolutely
*don't* want my / filesystem or raid device treated like some plug in,
temporary, roaming raid device.  They *aren't* the same, not in terms
of importance to the running of the machine and not in terms of
reliability requirements.  By using mdadm -A in the mkinitrd calls, I
was able to put in an mdadm.conf file and limit what arrays get
started to arrays found non-ambiguously in that mdadm.conf file and
identified by UUID.  When you switch to incremental assembly for root,
you risk the possibility of name space collisions and non-
deterministic bring up of your / array.


I'm concerned about this too. To be more specific, I'm concerned about
both automatically assembling things like RAID arrays / LVM logical
volumes and also automounting devices [1].

Anyway, my point with all this is that maybe we are going about things
wrong in the initramfs. My understanding is that dracut roughly works
this way (please let me know if this is wrong)

  1. when generating the initramfs image, we leave information in
 the kernel command-line about the root filesystem - typically
 the UUID - e.g. root=UUID=786263c4-5e28-4cdc-97b8-1ab6e221c344

  2. when the initramfs starts, we trigger all uevents and wait for
 things to settle

  3. Autoassembly / magic:

 - If we see e.g. md components, we activate them via udev rules
 - If we see e.g. LUKS devices, we unlock them (by interacting with
   the user asking for the passphrase) via udev rules.
 - Ditto for e.g. LVM

  5. if we see the rootfs (matching on e.g. the UUID passed on the
 kernel command line) we create the /dev/root symlink

  6. when the system has settled (e.g. no more uevents) we mount
 /dev/root and transition to non-early user space. If there
 is no /dev/root link, we bail out

Now, my beef is 3. above. I think it is way too optimistic to just
auto-assemble / unlock etc. everything. E.g. we end up doing a lot of
work not related to the rootfs that is better done in non-early user
space.

Instead, just like we specify the UUID for rootfs on the command-line,
we need to leave some instructions to the initramfs logic on _exactly_
what things should be autoassembled / unlocked / etc. in order to find
the rootfs. So the kernel command-line wouldn't really be "just" the
UUID of rootfs; it would be a whole recipe of actions to do. E.g.

  ROOTFS=UUID=1234  \ # this the UUID of my rootfs
  MD_ASSEMBLE=UUID=4567 \ # assemble MD array with UUID 4567
  LUKS_UNLOCK=UUID=89ab   # unlock LUKS device with UUID 89ab

which would work for e.g. cases where rootfs is on a LUKS device which
is on a MD array. In other words, we'd need a whole "recipe" passed to
the initramfs (the mkinitrd tool would generate this recipe), not just
the UUID of the rootfs.

Coincidentally, if we had something like this and the format of the
"recipe" was documented somewhere, it would be easy to e.g. implement
"rescue" functionality as described here

http://www.redhat.com/archives/fedora-desktop-list/2009-July/msg00019.html

since graphical disk utilities would just find /etc/grub.conf (or
similar), read the recipe and then start assembling/unlocking bits and
mount them as appropriate in /mnt/rescue/.

Actually this is very close to what Doug is asking for when he says
(paraphrased) "just include mdadm.conf instead of this magic". The key
difference, however, is that the user _won't_ have to use mdadm.conf or
care about config files - it's all taken care of by the mkinitrd binary
when building the recipe. This is a good thing as having one less config
file to worry about is good.

Thanks for considering, and sorry for the long mail,
David

[1] : As some background information, I've spent a good chunk of my
life, five years or so, dealing with end users complaining about how
plain block devices got automounted when they were plugged in. FWIW, the
complaints ranges from both non-sensical (irritated users: "these
desktop k

Re: RFC: mdadm and bringing up raid sets from initrd (dracut)

2009-07-16 Thread Victor Lowther
On Wed, Jul 15, 2009 at 7:16 PM, Jeremy Katz wrote:
> On Wednesday, July 15 2009, Dan Williams said:
>> mdadm-3.0 has facilities to prevent assembly of certain metadata types
>> [1] or arrays with certain uuids [2].  I wonder if we also need a
>> facility to prevent auto-assembly of arrays *not* listed in
>> mdadm.conf?  So the mdadm.conf file installed in the initramfs would
>> only identify the root array and all other randomly identified md
>> devices would be ignored (rather than assembled with a foreign name).
>>
>> Thoughts?
>
> There is no mdadm.conf in the initramfs -- in fact, the initramfs may
> not even be generated on the system that you're booting and instead be
> "generic" for the kernel in question

Still, it sounds like a good feature to be added for a --hostonly initramfs.

(especially on systems that are attaching to iscsi and/or fibre channel luns)

> Jeremy
> --
> To unsubscribe from this list: send the line "unsubscribe initramfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: mdadm and bringing up raid sets from initrd (dracut)

2009-07-15 Thread Jeremy Katz
On Wednesday, July 15 2009, Dan Williams said:
> mdadm-3.0 has facilities to prevent assembly of certain metadata types
> [1] or arrays with certain uuids [2].  I wonder if we also need a
> facility to prevent auto-assembly of arrays *not* listed in
> mdadm.conf?  So the mdadm.conf file installed in the initramfs would
> only identify the root array and all other randomly identified md
> devices would be ignored (rather than assembled with a foreign name).
> 
> Thoughts?

There is no mdadm.conf in the initramfs -- in fact, the initramfs may
not even be generated on the system that you're booting and instead be
"generic" for the kernel in question

Jeremy
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: mdadm and bringing up raid sets from initrd (dracut)

2009-07-15 Thread Dan Williams
[ Cc: Neil ]

On Tue, Jul 14, 2009 at 7:30 AM, David Zeuthen wrote:
> On Tue, 2009-07-14 at 12:59 +0200, Hans de Goede wrote:
>> Currently the udev rules use incremental assembly like this:
>> mdadm -I /dev/mdraid-member
>>
>> There are 2 problems with this:
>> 1) When doing this for native mdraid metadata arrays, if only
>>     one disk is present the set never gets activated
>> 2) When doing this for imsm metadata arrays, as soon as the
>>     first disk is incrementally added, the set gets activated
>>     in degraded mode and stays that way, the second disk
>>     will get added to the container, but not to the actual
>>     sets in the container
>
> FWIW, this incremental assembly business in mdadm is actually not a very
> good idea. At least not the current implementation. I'm not sure whether
> it's still a Fedora-ism or whether it's something that's in upstream
> mdadm yet. I'm talking about this udev rule
>
>  /lib/udev/rules.d/65-md-incremental.rules:
>  # This file causes block devices with Linux RAID (mdadm) signatures to
>  # automatically cause mdadm to be run.
>  # See udev(8) for syntax
>
>  SUBSYSTEM=="block", ACTION=="add", ENV{ID_FS_TYPE}=="linux_raid_member", \
>        IMPORT{program}="/sbin/mdadm --examine --export $tempnode", \
>        RUN+="/bin/bash -c '[ ! -f /dev/.in_sysinit ] && mdadm -I 
> $env{DEVNAME}'"
>
> For example if the user plugs in a random old disk that happens to
> contain half of a RAID1 mirror, then the incremental assembly bits sets
> up an inert md-device and the user is now left to his own devices as to
> sort this out when he's told by partitioning tools etc. that the disk
> (or partition of) he just plugged in, is "busy" (it is claimed by the
> inert md node).
>
> I actually had to add some extra code to the GNOME Disk Utility bits to
> handle such things (stop inert md devices) - makes the user experience
> quite a bit worse since there's now an extra state to worry about. And
> most current users don't use the UI bits yet for this so they get extra
> confused when trying to use e.g. parted(8) or fdisk(8) on the device.
>
> FWIW, I'd wish people would stop playing games like this. If you want to
> do auto-assembly at the system-level, at the very least don't leave the
> system in a state like this. For example, one way to do auto-assembly
> without such bugs would be to use libudev to enumerate all md component
> devices with the same MD_UUID. Then you count the number of components
> and only start the array if the number of components equals MD_DEVICES.
> That's much better than incrementally adding to an md device node that
> might never get used.
>
> I've complained to Doug about this already for Fedora but, since it's
> still broken and, AFAICT, up it's way to upstream mdadm, it's worth
> reiterating the complaint.
>
> Thanks,
> David
>
> [1] : And, except for booting, it's not clear to me that you want to
> have policy like auto-assembling RAID arrays at the system. I'd leave
> such policy to desktop bits where the user can control it and the
> software can actually interact with the user. And where it's easy to
> turn off features like this.
>

mdadm-3.0 has facilities to prevent assembly of certain metadata types
[1] or arrays with certain uuids [2].  I wonder if we also need a
facility to prevent auto-assembly of arrays *not* listed in
mdadm.conf?  So the mdadm.conf file installed in the initramfs would
only identify the root array and all other randomly identified md
devices would be ignored (rather than assembled with a foreign name).

Thoughts?

Thanks,
Dan

[1]: http://neil.brown.name/git?p=mdadm;a=commitdiff;h=31015d57
[2]: http://neil.brown.name/git?p=mdadm;a=commitdiff;h=112cace6
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: mdadm and bringing up raid sets from initrd (dracut)

2009-07-14 Thread David Zeuthen
On Tue, 2009-07-14 at 12:59 +0200, Hans de Goede wrote:
> Currently the udev rules use incremental assembly like this:
> mdadm -I /dev/mdraid-member
> 
> There are 2 problems with this:
> 1) When doing this for native mdraid metadata arrays, if only
> one disk is present the set never gets activated
> 2) When doing this for imsm metadata arrays, as soon as the
> first disk is incrementally added, the set gets activated
> in degraded mode and stays that way, the second disk
> will get added to the container, but not to the actual
> sets in the container

FWIW, this incremental assembly business in mdadm is actually not a very
good idea. At least not the current implementation. I'm not sure whether
it's still a Fedora-ism or whether it's something that's in upstream
mdadm yet. I'm talking about this udev rule

 /lib/udev/rules.d/65-md-incremental.rules:
 # This file causes block devices with Linux RAID (mdadm) signatures to
 # automatically cause mdadm to be run.
 # See udev(8) for syntax
 
 SUBSYSTEM=="block", ACTION=="add", ENV{ID_FS_TYPE}=="linux_raid_member", \
IMPORT{program}="/sbin/mdadm --examine --export $tempnode", \
RUN+="/bin/bash -c '[ ! -f /dev/.in_sysinit ] && mdadm -I 
$env{DEVNAME}'"

For example if the user plugs in a random old disk that happens to
contain half of a RAID1 mirror, then the incremental assembly bits sets
up an inert md-device and the user is now left to his own devices as to
sort this out when he's told by partitioning tools etc. that the disk
(or partition of) he just plugged in, is "busy" (it is claimed by the
inert md node).

I actually had to add some extra code to the GNOME Disk Utility bits to
handle such things (stop inert md devices) - makes the user experience
quite a bit worse since there's now an extra state to worry about. And
most current users don't use the UI bits yet for this so they get extra
confused when trying to use e.g. parted(8) or fdisk(8) on the device.

FWIW, I'd wish people would stop playing games like this. If you want to
do auto-assembly at the system-level, at the very least don't leave the
system in a state like this. For example, one way to do auto-assembly
without such bugs would be to use libudev to enumerate all md component
devices with the same MD_UUID. Then you count the number of components
and only start the array if the number of components equals MD_DEVICES.
That's much better than incrementally adding to an md device node that
might never get used.

I've complained to Doug about this already for Fedora but, since it's
still broken and, AFAICT, up it's way to upstream mdadm, it's worth
reiterating the complaint.

Thanks,
David

[1] : And, except for booting, it's not clear to me that you want to
have policy like auto-assembling RAID arrays at the system. I'd leave
such policy to desktop bits where the user can control it and the
software can actually interact with the user. And where it's easy to
turn off features like this.


--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: mdadm and bringing up raid sets from initrd (dracut)

2009-07-14 Thread David Zeuthen
On Tue, 2009-07-14 at 10:14 -0400, Doug Ledford wrote:
> On Jul 14, 2009, at 11:02 AM, Hans de Goede wrote:
> > Hi,
> > On 07/14/2009 03:39 PM, Doug Ledford wrote:
> >> On Jul 14, 2009, at 6:59 AM, Hans de Goede wrote:
> >>> Hi,
> >>>
> >>> As you probably know I'm working on making Fedora 12 use mdraid
> >>> instead of dmraid for Intel BIOS-RAID setups.
> >>>
> >>> The installer (anaconda) part is mostly done (needs more testing)
> >>> and now I'm looking at implementing support for this in dracut
> >>> (the new mkinitrd for Fedora 12).
> >>>
> >>> So I've been testing how this works for both imsm mdraid sets
> >>> and native mdraid metadata sets, in both cases using a 2 disk
> >>> mirror, so that the set can also be brought up in degraded mode.
> >>>
> >>> Currently the udev rules use incremental assembly like this:
> >>> mdadm -I /dev/mdraid-member
> >>
> >> Hmmm...does dracut use udev during initramfs time?
> >
> > Yes, it uses udev for everything, making discovery of / consistent
> > with the discovery of other storage devices.
> 
> I'm not sure I like or agree with that philosophy.  I absolutely  
> *don't* want my / filesystem or raid device treated like some plug in,  
> temporary, roaming raid device.  They *aren't* the same, not in terms  
> of importance to the running of the machine and not in terms of  
> reliability requirements.  By using mdadm -A in the mkinitrd calls, I  
> was able to put in an mdadm.conf file and limit what arrays get  
> started to arrays found non-ambiguously in that mdadm.conf file and  
> identified by UUID.  When you switch to incremental assembly for root,  
> you risk the possibility of name space collisions and non- 
> deterministic bring up of your / array.

I'm concerned about this too. To be more specific, I'm concerned about
both automatically assembling things like RAID arrays / LVM logical
volumes and also automounting devices [1].

Anyway, my point with all this is that maybe we are going about things
wrong in the initramfs. My understanding is that dracut roughly works
this way (please let me know if this is wrong)

 1. when generating the initramfs image, we leave information in
the kernel command-line about the root filesystem - typically
the UUID - e.g. root=UUID=786263c4-5e28-4cdc-97b8-1ab6e221c344

 2. when the initramfs starts, we trigger all uevents and wait for
things to settle

 3. Autoassembly / magic:

- If we see e.g. md components, we activate them via udev rules
- If we see e.g. LUKS devices, we unlock them (by interacting with
  the user asking for the passphrase) via udev rules.
- Ditto for e.g. LVM

 5. if we see the rootfs (matching on e.g. the UUID passed on the
kernel command line) we create the /dev/root symlink

 6. when the system has settled (e.g. no more uevents) we mount
/dev/root and transition to non-early user space. If there
is no /dev/root link, we bail out

Now, my beef is 3. above. I think it is way too optimistic to just
auto-assemble / unlock etc. everything. E.g. we end up doing a lot of
work not related to the rootfs that is better done in non-early user
space.

Instead, just like we specify the UUID for rootfs on the command-line,
we need to leave some instructions to the initramfs logic on _exactly_
what things should be autoassembled / unlocked / etc. in order to find
the rootfs. So the kernel command-line wouldn't really be "just" the
UUID of rootfs; it would be a whole recipe of actions to do. E.g.

 ROOTFS=UUID=1234  \ # this the UUID of my rootfs
 MD_ASSEMBLE=UUID=4567 \ # assemble MD array with UUID 4567
 LUKS_UNLOCK=UUID=89ab   # unlock LUKS device with UUID 89ab

which would work for e.g. cases where rootfs is on a LUKS device which
is on a MD array. In other words, we'd need a whole "recipe" passed to
the initramfs (the mkinitrd tool would generate this recipe), not just
the UUID of the rootfs.

Coincidentally, if we had something like this and the format of the
"recipe" was documented somewhere, it would be easy to e.g. implement
"rescue" functionality as described here

http://www.redhat.com/archives/fedora-desktop-list/2009-July/msg00019.html

since graphical disk utilities would just find /etc/grub.conf (or
similar), read the recipe and then start assembling/unlocking bits and
mount them as appropriate in /mnt/rescue/.

Actually this is very close to what Doug is asking for when he says
(paraphrased) "just include mdadm.conf instead of this magic". The key
difference, however, is that the user _won't_ have to use mdadm.conf or
care about config files - it's all taken care of by the mkinitrd binary
when building the recipe. This is a good thing as having one less config
file to worry about is good.

Thanks for considering, and sorry for the long mail,
David

[1] : As some background information, I've spent a good chunk of my
life, five years or so, dealing with end users complaining about how
plain block devices got automounted when they were plugged

Re: RFC: mdadm and bringing up raid sets from initrd (dracut)

2009-07-14 Thread Doug Ledford

On Jul 14, 2009, at 11:02 AM, Hans de Goede wrote:

Hi,
On 07/14/2009 03:39 PM, Doug Ledford wrote:

On Jul 14, 2009, at 6:59 AM, Hans de Goede wrote:

Hi,

As you probably know I'm working on making Fedora 12 use mdraid
instead of dmraid for Intel BIOS-RAID setups.

The installer (anaconda) part is mostly done (needs more testing)
and now I'm looking at implementing support for this in dracut
(the new mkinitrd for Fedora 12).

So I've been testing how this works for both imsm mdraid sets
and native mdraid metadata sets, in both cases using a 2 disk
mirror, so that the set can also be brought up in degraded mode.

Currently the udev rules use incremental assembly like this:
mdadm -I /dev/mdraid-member


Hmmm...does dracut use udev during initramfs time?


Yes, it uses udev for everything, making discovery of / consistent
with the discovery of other storage devices.


I'm not sure I like or agree with that philosophy.  I absolutely  
*don't* want my / filesystem or raid device treated like some plug in,  
temporary, roaming raid device.  They *aren't* the same, not in terms  
of importance to the running of the machine and not in terms of  
reliability requirements.  By using mdadm -A in the mkinitrd calls, I  
was able to put in an mdadm.conf file and limit what arrays get  
started to arrays found non-ambiguously in that mdadm.conf file and  
identified by UUID.  When you switch to incremental assembly for root,  
you risk the possibility of name space collisions and non- 
deterministic bring up of your / array.






Are we going to be totally changing this with
dracut and F12? This method very nicely resolves the issues you  
posted.




Yes.

Regards,

Hans



--

Doug Ledford 

GPG KeyID: CFBFF194
http://people.redhat.com/dledford

InfiniBand Specific RPMS
http://people.redhat.com/dledford/Infiniband






PGP.sig
Description: This is a digitally signed message part


Re: RFC: mdadm and bringing up raid sets from initrd (dracut)

2009-07-14 Thread Hans de Goede

Hi,

On 07/14/2009 03:39 PM, Doug Ledford wrote:

On Jul 14, 2009, at 6:59 AM, Hans de Goede wrote:

Hi,

As you probably know I'm working on making Fedora 12 use mdraid
instead of dmraid for Intel BIOS-RAID setups.

The installer (anaconda) part is mostly done (needs more testing)
and now I'm looking at implementing support for this in dracut
(the new mkinitrd for Fedora 12).

So I've been testing how this works for both imsm mdraid sets
and native mdraid metadata sets, in both cases using a 2 disk
mirror, so that the set can also be brought up in degraded mode.

Currently the udev rules use incremental assembly like this:
mdadm -I /dev/mdraid-member


Hmmm...does dracut use udev during initramfs time?


Yes, it uses udev for everything, making discovery of / consistent
with the discovery of other storage devices.




Are we going to be totally changing this with
dracut and F12? This method very nicely resolves the issues you posted.



Yes.

Regards,

Hans
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: mdadm and bringing up raid sets from initrd (dracut)

2009-07-14 Thread Doug Ledford

On Jul 14, 2009, at 6:59 AM, Hans de Goede wrote:

Hi,

As you probably know I'm working on making Fedora 12 use mdraid
instead of dmraid for Intel BIOS-RAID setups.

The installer (anaconda) part is mostly done (needs more testing)
and now I'm looking at implementing support for this in dracut
(the new mkinitrd for Fedora 12).

So I've been testing how this works for both imsm mdraid sets
and native mdraid metadata sets, in both cases using a 2 disk
mirror, so that the set can also be brought up in degraded mode.

Currently the udev rules use incremental assembly like this:
mdadm -I /dev/mdraid-member


Hmmm...does dracut use udev during initramfs time?  mkinitrd didn't,  
so this would be a change.  In particular, I didn't have these  
problems with mkinitrd because I didn't use udev rules in the initrd,  
I ran mdadm -A instead.  In fact, the F11 method of bringup of raid  
devices is as such:


initrd: use mdadm -As --run in /etc/mdadm.conf>
rc.sysinit: use mdadm -As --run (no md device name, which means all  
arrays listed in mdadm.conf will get brought up, plus extra arrays not  
listed in mdadm.conf but which can be found and identified by metadata)
udev: in 65-md-incremental.rules use mdadm -I  (but only  
if /dev/.in.rcsysinit does not exist, so we don't run udev incremental  
rules until after the system is up and running, which means for hot  
plugged devices...in particular we will never run the udev rule on any  
device that was present on boot, instead the previous two calls will  
catch these devices, and those previous calls will run degraded  
arrays, this allows me to safely refuse to run degraded arrays in the  
udev rules file without risking failing to boot, instead a degraded  
hot plugged array will need minor manual intervention, but the system  
will be fully up and operational no matter what)


I find this setup to be a rather safe, conservative way of handling md  
raid array hot plug.  Are we going to be totally changing this with  
dracut and F12?  This method very nicely resolves the issues you posted.



There are 2 problems with this:
1) When doing this for native mdraid metadata arrays, if only
  one disk is present the set never gets activated
2) When doing this for imsm metadata arrays, as soon as the
  first disk is incrementally added, the set gets activated
  in degraded mode and stays that way, the second disk
  will get added to the container, but not to the actual
  sets in the container

And these 2 problems have 2 different solutions:
1) An incomplete, but potentially activatable in degraded mode
  set can be activated using mdadm --run /dev/md#
2) One can stop this problem by using:
  mdadm -I --no-degraded /dev/mdraid-member
  instead (this does not change anything for
  native mdraid metadata format sets)
  But if that is done, the sets in the container never get
  activated, this can be fixed by running
  mdadm -I /dev/md# on the container device

So my proposed solution for this is when udev is done scanning
(when the event queue is empty, detected using the same mechanism as
dracut is using for dmraid), do the following:

For each /dev/md#
 run mdadm --export --detail, and get the MD_LEVEL
 if MD_LEVEL == "container":
   mdadm -I /dev/md#
 else
   mdadm --run /dev/md#

This will:
1) Bring up raid sets inside containers (such as imsm raidsets)
2) Bring up incomplete raid sets in degraded mode where possible

I'll post a patch implementing this later today.

Regards,

Hans



--

Doug Ledford 

GPG KeyID: CFBFF194
http://people.redhat.com/dledford

InfiniBand Specific RPMS
http://people.redhat.com/dledford/Infiniband






PGP.sig
Description: This is a digitally signed message part


RFC: mdadm and bringing up raid sets from initrd (dracut)

2009-07-14 Thread Hans de Goede

Hi,

As you probably know I'm working on making Fedora 12 use mdraid
instead of dmraid for Intel BIOS-RAID setups.

The installer (anaconda) part is mostly done (needs more testing)
and now I'm looking at implementing support for this in dracut
(the new mkinitrd for Fedora 12).

So I've been testing how this works for both imsm mdraid sets
and native mdraid metadata sets, in both cases using a 2 disk
mirror, so that the set can also be brought up in degraded mode.

Currently the udev rules use incremental assembly like this:
mdadm -I /dev/mdraid-member

There are 2 problems with this:
1) When doing this for native mdraid metadata arrays, if only
   one disk is present the set never gets activated
2) When doing this for imsm metadata arrays, as soon as the
   first disk is incrementally added, the set gets activated
   in degraded mode and stays that way, the second disk
   will get added to the container, but not to the actual
   sets in the container

And these 2 problems have 2 different solutions:
1) An incomplete, but potentially activatable in degraded mode
   set can be activated using mdadm --run /dev/md#
2) One can stop this problem by using:
   mdadm -I --no-degraded /dev/mdraid-member
   instead (this does not change anything for
   native mdraid metadata format sets)
   But if that is done, the sets in the container never get
   activated, this can be fixed by running
   mdadm -I /dev/md# on the container device

So my proposed solution for this is when udev is done scanning
(when the event queue is empty, detected using the same mechanism as
dracut is using for dmraid), do the following:

For each /dev/md#
  run mdadm --export --detail, and get the MD_LEVEL
  if MD_LEVEL == "container":
mdadm -I /dev/md#
  else
mdadm --run /dev/md#

This will:
1) Bring up raid sets inside containers (such as imsm raidsets)
2) Bring up incomplete raid sets in degraded mode where possible

I'll post a patch implementing this later today.

Regards,

Hans
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html