Re: Time to deprecate old RAID formats?

2007-11-02 Thread Doug Ledford
On Thu, 2007-11-01 at 14:02 -0700, H. Peter Anvin wrote:
 Doug Ledford wrote:
 
  I would argue that ext[234] should be clearing those 512 bytes.  Why
  aren't they cleared  
  
  Actually, I didn't think msdos used the first 512 bytes for the same
  reason ext3 doesn't: space for a boot sector.
  
 
 The creators of MS-DOS put the superblock in the bootsector, so that the 
 BIOS loads them both.  It made sense in some diseased Microsoft 
 programmer's mind.
 
 Either way, for RAID-1 booting, the boot sector really should be part of 
 the protected area (and go through the MD stack.)

It depends on what you are calling the protected area.  If by that you
mean outside the filesystem itself, and in a non-replicated area like
where the superblock and internal bitmaps go, then yes, that would be
ideal.  If you mean in the file system proper, then that depends on the
boot loader.

   The bootloader should 
 deal with the offset problem by storing partition/filesystem-relative 
 pointers, not absolute ones.

Grub2 is on the way to this, but it isn't there yet.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-11-02 Thread Bill Davidsen

Neil Brown wrote:

On Friday October 26, [EMAIL PROTECTED] wrote:
  
Perhaps you could have called them 1.start, 1.end, and 1.4k in the 
beginning? Isn't hindsight wonderful?





Those names seem good to me.  I wonder if it is safe to generate them
in -Eb output

  
If you agree that they are better, using them in the obvious places 
would be better now than later. Are you going to put them in the 
metadata options as well? Let me know, I have looking at the 
documentation on my list for next week, and could include some text.

Maybe the key confusion here is between version numbers and
revision numbers.
When you have multiple versions, there is no implicit assumption that
one is better than another. Here is my version of what happened, now
let's hear yours.
When you have multiple revisions, you do assume ongoing improvement.

v1.0  v1.1 and v1.2 are different version of the v1 superblock, which
itself is a revision of the v0...
  


Like kernel releases, people assume that the first number means *big* 
changes, the second incremental change.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-30 Thread Doug Ledford
On Tue, 2007-10-30 at 07:55 +0100, Luca Berra wrote:

 Well it might be a matter of personal preference, but i would prefer
 an initrd doing just the minumum necessary to mount the root filesystem
 (and/or activating resume from a swap device), and leaving all the rest
 to initscripts, then an initrd that tries to do everything.

The initrd does exactly that.  The rescan for superblocks does not
happen in initrd or mkinitrd, it must be done manually.  The code in
mkinitrd uses the mdadm.conf file as it stands, but in the initrd image
it doesn't start all the arrays, just the needed arrays to get booted
into your / partition.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-29 Thread Luca Berra

On Sun, Oct 28, 2007 at 01:47:55PM -0400, Doug Ledford wrote:

On Sun, 2007-10-28 at 15:13 +0100, Luca Berra wrote:

On Sat, Oct 27, 2007 at 08:26:00PM -0400, Doug Ledford wrote:
It was only because I wasn't using mdadm in the initrd and specifying
uuids that it found the right devices to start and ignored the whole
disk devices.  But, when I later made some more devices and went to
update the mdadm.conf file using mdadm -Eb, it found the devices and
added it to the mdadm.conf.  If I hadn't checked it before remaking my
initrd, it would have hosed the system.  And it would have passed all
the above is not clear to me, afair redhat initrd still uses
raidautorun,


RHEL does, but this is on a personal machine I installed Fedora an and
latest Fedora has a mkinitrd that installs mdadm and mdadm.conf and
starts the needed devices using the UUID.  My first sentence above
should have read that I *was* using mdadm.

ah, ok i should look again at fedora's mkinitrd, last one i checked was
6.0.9-1 and i see mdadm was added in 6.0.9-2


 which iirc does not works with recent superblocks,
so you used uuids on kernel command line?
or you use something else for initrd?
why would remaking the initrd break it?


Remaking the initrd installs the new mdadm.conf file, which would have
then contained the whole disk devices and it's UUID.  There in would
have been the problem.

yes, i read the patch, i don't like that code, as i don't like most of
what has been put in mkinitrd from 5.0 onward.
Imho the correct thing here would not have been copying the existing
mdadm.conf but generating a safe one from output of mdadm -D (note -D,
not -E)


the tests you can throw at it.  Quite simply, there is no way to tell
the difference between those two situations with 100% certainty.  Mdadm
tries to be smart and start the newest devices, but Luca's original
suggestion of skip the partition scanning in the kernel and figure it
out from user space would not have shown mdadm the new devices and would
have gotten it wrong every time.
yes, in this particular case it would have, congratulation you found a new
creative way of shooting yourself in the feet.


Creative, not so much.  I just backed out of what I started and tried
something else.  Lots of people do that.


maybe mdadm should do checks when creating a device to prevent this kind
of mistakes.
i.e.
if creating an array on a partition, check the whole device for a
superblock and refuse in case it finds one

if creating an array on a whole device that has a partition table,
either require --force, or check for superblocks in every possible
partition.


What happens if you add the partition table *after* you make the whole
disk device and there are stale superblocks in the partitions?  This
still isn't infallible.

It depends on what you do with that partitioned device *after* having
created the partition table.
- If you try again to run mdadm on it (and the above is implemented it
would fail, and you will be given a chance to wipe the stale sb)
- If you don't and use them as plain devices, _and_ leave the line in
mdadm.conf you will suffer a lot of pain. Since the problem is known and
since fdisk/sfdisk/parted already do a lot of checks on the device, this
could be another useful one.

L.

--
Luca Berra -- [EMAIL PROTECTED]
   Communication Media  Services S.r.l.
/\
\ / ASCII RIBBON CAMPAIGN
 XAGAINST HTML MAIL
/ \
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-29 Thread Doug Ledford
On Mon, 2007-10-29 at 09:41 +0100, Luca Berra wrote:

 Remaking the initrd installs the new mdadm.conf file, which would have
 then contained the whole disk devices and it's UUID.  There in would
 have been the problem.
 yes, i read the patch, i don't like that code, as i don't like most of
 what has been put in mkinitrd from 5.0 onward.
 Imho the correct thing here would not have been copying the existing
 mdadm.conf but generating a safe one from output of mdadm -D (note -D,
 not -E)

I'm not sure I'd want that.  Besides, what makes you say -D is safer
than -E?

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-29 Thread Luca Berra

On Mon, Oct 29, 2007 at 11:30:53AM -0400, Doug Ledford wrote:

On Mon, 2007-10-29 at 09:41 +0100, Luca Berra wrote:


Remaking the initrd installs the new mdadm.conf file, which would have
then contained the whole disk devices and it's UUID.  There in would
have been the problem.
yes, i read the patch, i don't like that code, as i don't like most of
what has been put in mkinitrd from 5.0 onward.

in case you wonder i am referring to things like

emit dm create $1 $UUID $(/sbin/dmsetup table $1)


Imho the correct thing here would not have been copying the existing
mdadm.conf but generating a safe one from output of mdadm -D (note -D,
not -E)


I'm not sure I'd want that.  Besides, what makes you say -D is safer
than -E?


mdadm -D  /dev/mdX works on an active md device, so i strongly doubt the 
information
gathered from there would be stale
while mdadm -Es will scan disk devices for md superblock, thus
possibly even finding stale superblocks or leftovers.
I would strongly recommend against blindly doing mdadm -Es 
/etc/mdadm.conf and not supervising the result.

--
Luca Berra -- [EMAIL PROTECTED]
   Communication Media  Services S.r.l.
/\
\ / ASCII RIBBON CAMPAIGN
 XAGAINST HTML MAIL
/ \
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-29 Thread Doug Ledford
On Mon, 2007-10-29 at 22:44 +0100, Luca Berra wrote:
 On Mon, Oct 29, 2007 at 11:30:53AM -0400, Doug Ledford wrote:
 On Mon, 2007-10-29 at 09:41 +0100, Luca Berra wrote:
 
  Remaking the initrd installs the new mdadm.conf file, which would have
  then contained the whole disk devices and it's UUID.  There in would
  have been the problem.
  yes, i read the patch, i don't like that code, as i don't like most of
  what has been put in mkinitrd from 5.0 onward.
 in case you wonder i am referring to things like
 
 emit dm create $1 $UUID $(/sbin/dmsetup table $1)

I make no judgments on the dm setup stuff, I know too little about the
dm stack to be qualified.

  Imho the correct thing here would not have been copying the existing
  mdadm.conf but generating a safe one from output of mdadm -D (note -D,
  not -E)
 
 I'm not sure I'd want that.  Besides, what makes you say -D is safer
 than -E?
 
 mdadm -D  /dev/mdX works on an active md device, so i strongly doubt the 
 information
 gathered from there would be stale
 while mdadm -Es will scan disk devices for md superblock, thus
 possibly even finding stale superblocks or leftovers.
 I would strongly recommend against blindly doing mdadm -Es 
 /etc/mdadm.conf and not supervising the result.

Well, I agree that blindly doing mdadm -Esb  mdadm.conf would be bad,
but that's not what mkinitrd is doing, it's using the mdadm.conf that's
in place so you can update the mdadm.conf whenever you find it
appropriate.

And I agree -D has less chance of finding a stale superblock, but it's
also true that it has no chance of finding non-stale superblocks on
devices that aren't even started.  So, as a method of getting all the
right information in the event of system failure and rescuecd boot, it
leaves something to be desired ;-)  In other words, I'd rather use a
mode that finds everything and lets me remove the stale than a mode that
might miss something.  But, that's a matter of personal choice.
Considering that we only ever update mdadm.conf automatically during
installs, after that the user makes manual mdadm.conf changes
themselves, they are free to use whichever they prefer.

The one thing I *do* like about mdadm -E above -D is it includes the
superblock format in its output.  The one thing I don't like, is it
almost universally gets the name wrong.  What I really want is a brief
query format that both gives me the right name (-D) and the superblock
format (-E).

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-29 Thread Neil Brown
On Friday October 26, [EMAIL PROTECTED] wrote:
 
 Perhaps you could have called them 1.start, 1.end, and 1.4k in the 
 beginning? Isn't hindsight wonderful?
 

Those names seem good to me.  I wonder if it is safe to generate them
in -Eb output

Maybe the key confusion here is between version numbers and
revision numbers.
When you have multiple versions, there is no implicit assumption that
one is better than another. Here is my version of what happened, now
let's hear yours.
When you have multiple revisions, you do assume ongoing improvement.

v1.0  v1.1 and v1.2 are different version of the v1 superblock, which
itself is a revision of the v0...

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-29 Thread Luca Berra

On Mon, Oct 29, 2007 at 07:05:42PM -0400, Doug Ledford wrote:

And I agree -D has less chance of finding a stale superblock, but it's
also true that it has no chance of finding non-stale superblocks on

Well it might be a matter of personal preference, but i would prefer
an initrd doing just the minumum necessary to mount the root filesystem
(and/or activating resume from a swap device), and leaving all the rest
to initscripts, then an initrd that tries to do everything.


devices that aren't even started.  So, as a method of getting all the
right information in the event of system failure and rescuecd boot, it
leaves something to be desired ;-)  In other words, I'd rather use a
mode that finds everything and lets me remove the stale than a mode that
might miss something.  But, that's a matter of personal choice.

In case of a rescuecd boot, you will probably not have any md devices
activated, and you will probably run mdadm -Es to check what md are
available, the data should be still on the disk, else you would be hosed
anyway.

L.

--
Luca Berra -- [EMAIL PROTECTED]
   Communication Media  Services S.r.l.
/\
\ / ASCII RIBBON CAMPAIGN
 XAGAINST HTML MAIL
/ \
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-28 Thread Luca Berra

On Sat, Oct 27, 2007 at 04:09:03PM -0400, Doug Ledford wrote:

On Sat, 2007-10-27 at 10:00 +0200, Luca Berra wrote:

On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote:
On Fri, 2007-10-26 at 11:54 +0200, Luca Berra wrote:
 On Sat, Oct 20, 2007 at 09:11:57AM -0400, Doug Ledford wrote:
 just apply some rules, so if you find a partition table _AND_ an md
 superblock at the end, read both and you can tell if it is an md on a
 partition or a partitioned md raid1 device.

In fact, no you can't.  I know, because I've created a device that had
both but wasn't a raid device.  And it's matching partner still existed
too.  What you are talking about would have misrecognized this
situation, guaranteed.
then just ignore the device and log a warning, instead of doing a random
choice.
L.


It also happened to be my OS drive pair.  Ignoring it would have
rendered the machine unusable.


I wonder what would have happened if it got it wrong

--
Luca Berra -- [EMAIL PROTECTED]
   Communication Media  Services S.r.l.
/\
\ / ASCII RIBBON CAMPAIGN
 XAGAINST HTML MAIL
/ \
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-28 Thread Luca Berra

On Sat, Oct 27, 2007 at 08:26:00PM -0400, Doug Ledford wrote:

On Sat, 2007-10-27 at 00:30 +0200, Gabor Gombas wrote:

On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote:

 In fact, no you can't.  I know, because I've created a device that had
 both but wasn't a raid device.  And it's matching partner still existed
 too.  What you are talking about would have misrecognized this
 situation, guaranteed.

Maybe we need a 2.0 superblock that contains the physical size of every
component, not just the logical size that is used for RAID. That way if
the size read from the superblock does not match the size of the device,
you know that this device should be ignored.


In my case that wouldn't have helped.  What actually happened was I
create a two disk raid1 device using whole devices and a version 1.0
superblock.  I know a version 1.1 wouldn't work because it would be
where the boot sector needed to be, and wasn't sure if a 1.2 would work
either.  Then I tried to make the whole disk raid device a partitioned
device.  This obviously put a partition table right where the BIOS and
the kernel would look for it whether the raid was up or not.  I also

the only reason i can think for the above setup not working is udev
mucking with your device too early.


tried doing an lvm setup to split the raid up into chunks and that
didn't work either.  So, then I redid the partition table and created
individual raid devices from the partitions.  But, I didn't think to
zero the old whole disk superblock.  When I made the individual raid
devices, I used all 1.1 superblocks.  So, when it was all said and done,
I had a bunch of partitions that looked like a valid set of partitions
for the whole disk raid device and a whole disk raid superblock, but I
also had superblocks in each partition with their own bitmaps and so on.

OK


It was only because I wasn't using mdadm in the initrd and specifying
uuids that it found the right devices to start and ignored the whole
disk devices.  But, when I later made some more devices and went to
update the mdadm.conf file using mdadm -Eb, it found the devices and
added it to the mdadm.conf.  If I hadn't checked it before remaking my
initrd, it would have hosed the system.  And it would have passed all

the above is not clear to me, afair redhat initrd still uses
raidautorun, which iirc does not works with recent superblocks,
so you used uuids on kernel command line?
or you use something else for initrd?
why would remaking the initrd break it?


the tests you can throw at it.  Quite simply, there is no way to tell
the difference between those two situations with 100% certainty.  Mdadm
tries to be smart and start the newest devices, but Luca's original
suggestion of skip the partition scanning in the kernel and figure it
out from user space would not have shown mdadm the new devices and would
have gotten it wrong every time.

yes, in this particular case it would have, congratulation you found a new
creative way of shooting yourself in the feet.

maybe mdadm should do checks when creating a device to prevent this kind
of mistakes.
i.e.
if creating an array on a partition, check the whole device for a
superblock and refuse in case it finds one

if creating an array on a whole device that has a partition table,
either require --force, or check for superblocks in every possible
partition.

L.
--
Luca Berra -- [EMAIL PROTECTED]
   Communication Media  Services S.r.l.
/\
\ / ASCII RIBBON CAMPAIGN
 XAGAINST HTML MAIL
/ \
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-28 Thread Doug Ledford
On Sun, 2007-10-28 at 15:13 +0100, Luca Berra wrote:
 On Sat, Oct 27, 2007 at 08:26:00PM -0400, Doug Ledford wrote:
 On Sat, 2007-10-27 at 00:30 +0200, Gabor Gombas wrote:
  On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote:
  
   In fact, no you can't.  I know, because I've created a device that had
   both but wasn't a raid device.  And it's matching partner still existed
   too.  What you are talking about would have misrecognized this
   situation, guaranteed.
  
  Maybe we need a 2.0 superblock that contains the physical size of every
  component, not just the logical size that is used for RAID. That way if
  the size read from the superblock does not match the size of the device,
  you know that this device should be ignored.
 
 In my case that wouldn't have helped.  What actually happened was I
 create a two disk raid1 device using whole devices and a version 1.0
 superblock.  I know a version 1.1 wouldn't work because it would be
 where the boot sector needed to be, and wasn't sure if a 1.2 would work
 either.  Then I tried to make the whole disk raid device a partitioned
 device.  This obviously put a partition table right where the BIOS and
 the kernel would look for it whether the raid was up or not.  I also
 the only reason i can think for the above setup not working is udev
 mucking with your device too early.

It was a combination of boot loader issues and an inability to get this
device partitioned up the way I needed.  I went with a totally different
setup in the end because I essentially started out with a two drive
raid1 for the OS and another 2 drive raid1 for data, but I wanted to
span them and I was attempting to do so with a mixture of md raid and
lvm physical volume striping.  Didn't work.

 tried doing an lvm setup to split the raid up into chunks and that
 didn't work either.  So, then I redid the partition table and created
 individual raid devices from the partitions.  But, I didn't think to
 zero the old whole disk superblock.  When I made the individual raid
 devices, I used all 1.1 superblocks.  So, when it was all said and done,
 I had a bunch of partitions that looked like a valid set of partitions
 for the whole disk raid device and a whole disk raid superblock, but I
 also had superblocks in each partition with their own bitmaps and so on.
 OK
 
 It was only because I wasn't using mdadm in the initrd and specifying
 uuids that it found the right devices to start and ignored the whole
 disk devices.  But, when I later made some more devices and went to
 update the mdadm.conf file using mdadm -Eb, it found the devices and
 added it to the mdadm.conf.  If I hadn't checked it before remaking my
 initrd, it would have hosed the system.  And it would have passed all
 the above is not clear to me, afair redhat initrd still uses
 raidautorun,

RHEL does, but this is on a personal machine I installed Fedora an and
latest Fedora has a mkinitrd that installs mdadm and mdadm.conf and
starts the needed devices using the UUID.  My first sentence above
should have read that I *was* using mdadm.

  which iirc does not works with recent superblocks,
 so you used uuids on kernel command line?
 or you use something else for initrd?
 why would remaking the initrd break it?

Remaking the initrd installs the new mdadm.conf file, which would have
then contained the whole disk devices and it's UUID.  There in would
have been the problem.

 the tests you can throw at it.  Quite simply, there is no way to tell
 the difference between those two situations with 100% certainty.  Mdadm
 tries to be smart and start the newest devices, but Luca's original
 suggestion of skip the partition scanning in the kernel and figure it
 out from user space would not have shown mdadm the new devices and would
 have gotten it wrong every time.
 yes, in this particular case it would have, congratulation you found a new
 creative way of shooting yourself in the feet.

Creative, not so much.  I just backed out of what I started and tried
something else.  Lots of people do that.

 maybe mdadm should do checks when creating a device to prevent this kind
 of mistakes.
 i.e.
 if creating an array on a partition, check the whole device for a
 superblock and refuse in case it finds one
 
 if creating an array on a whole device that has a partition table,
 either require --force, or check for superblocks in every possible
 partition.

What happens if you add the partition table *after* you make the whole
disk device and there are stale superblocks in the partitions?  This
still isn't infallible.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-28 Thread Bill Davidsen

Doug Ledford wrote:

On Fri, 2007-10-26 at 14:41 -0400, Doug Ledford wrote:
  

Actually, after doing some research, here's what I've found:


I should note that both the lvm code and raid code are simplistic at the
moment.  For example, the raid5 mapping only supports the default raid5
layout.  If you use any other layout, game over.  Getting it to work
with version 1.1 or 1.2 superblocks probably wouldn't be that hard, but
getting it to the point where it handles all the relevant setups
properly would require a reasonable amount of coding.

  
My first thought is that after the /boot partition is read (assuming you 
use one) restrictions go away. Performance of /boot is not much of an 
issue, for me at least, but more complex setups are sometimes need for 
the rest of the system.


Thanks for the research.

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-27 Thread Luca Berra

On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote:

On Fri, 2007-10-26 at 11:54 +0200, Luca Berra wrote:

On Sat, Oct 20, 2007 at 09:11:57AM -0400, Doug Ledford wrote:
just apply some rules, so if you find a partition table _AND_ an md
superblock at the end, read both and you can tell if it is an md on a
partition or a partitioned md raid1 device.


In fact, no you can't.  I know, because I've created a device that had
both but wasn't a raid device.  And it's matching partner still existed
too.  What you are talking about would have misrecognized this
situation, guaranteed.

then just ignore the device and log a warning, instead of doing a random
choice.
L.

--
Luca Berra -- [EMAIL PROTECTED]
   Communication Media  Services S.r.l.
/\
\ / ASCII RIBBON CAMPAIGN
 XAGAINST HTML MAIL
/ \
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-27 Thread Luca Berra

On Fri, Oct 26, 2007 at 07:06:46PM +0200, Gabor Gombas wrote:

On Fri, Oct 26, 2007 at 06:22:27PM +0200, Gabor Gombas wrote:


You got the ordering wrong. You should get userspace support ready and
accepted _first_, and then you can start the
flamew^H^H^H^H^H^Hdiscussion to make the in-kernel partitioning code
configurable.

sorry, i did not intend to start a flamewar.


Oh wait that is possible even today. So you can build your own kernel
without any partition table format support - problem solved.

yes, i can build my own, i just tought it could be useful for someone
but myself. maybe even Doug's enterprise customers

L.

--
Luca Berra -- [EMAIL PROTECTED]
   Communication Media  Services S.r.l.
/\
\ / ASCII RIBBON CAMPAIGN
 XAGAINST HTML MAIL
/ \
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-27 Thread Luca Berra

On Sat, Oct 27, 2007 at 12:20:12AM +0200, Gabor Gombas wrote:

On Fri, Oct 26, 2007 at 02:41:56PM -0400, Doug Ledford wrote:


* When using lilo to boot from a raid device, it automatically installs
itself to the mbr, not to the partition.  This can not be changed.  Only
0.90 and 1.0 superblock types are supported because lilo doesn't
understand the offset to the beginning of the fs otherwise.


Huh? I have several machines that boot with LILO and the root is on
RAID1. All install LILO to the boot sector of the mdX device (having
boot=/dev/mdX in lilo.conf), while the MBR is installed by
install-mbr. Since install-mbr has its own prompt that is displayed
before LILO's prompt on boot, I can be pretty sure that LILO did not
write anything to the MBR...


the behaviour is documented in lilo man page, for the
raid-extra-boot option.


--
Luca Berra -- [EMAIL PROTECTED]
   Communication Media  Services S.r.l.
/\
\ / ASCII RIBBON CAMPAIGN
 XAGAINST HTML MAIL
/ \
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-27 Thread Bill Davidsen

Doug Ledford wrote:

On Fri, 2007-10-26 at 10:18 -0400, Bill Davidsen wrote:
  
[___snip___]
  



Actually, after doing some research, here's what I've found:

* When using lilo to boot from a raid device, it automatically installs
itself to the mbr, not to the partition.  This can not be changed.  Only
0.90 and 1.0 superblock types are supported because lilo doesn't
understand the offset to the beginning of the fs otherwise.
  


I'm reasonably sure that's wrong, I used to set up dual boot machines by 
putting LILO in the partition and making that the boot partition, by 
changing the active partition flag I could just have the machine boot 
Windows, to keep people from getting confused.

* When using grub to boot from a raid device, only 0.90 and 1.0
superblocks are supported[1] (because grub is ignorant of the raid and
it requires the fs to start at the start of the partition).  You can use
either MBR or partition based installs of grub.  However, partition
based installs require that all bootable partitions be in exactly the
same logical block address across all devices.  This limitation can be
an extremely hazardous limitation in the event a drive dies and you have
to replace it with a new drive as newer drives may not share the older
drive's geometry and will require starting your boot partition in an odd
location to make the logical block addresses match.

* When using grub2, there is supposedly already support for raid/lvm
devices.  However, I do not know if this includes version 1.0, 1.1, or
1.2 superblocks.  I intend to find that out today.  If you tell grub2 to
install to an md device, it searches out all constituent devices and
installs to the MBR on each device[2].  This can't be changed (at least
right now, probably not ever though).
  


That sounds like a good reason to avoid grub2, frankly. Software which 
decides that it knows what to do better than the user isn't my 
preference. If I wanted software which fores me to do things their way 
I'd be running Windows.

So, given the above situations, really, superblock format 1.2 is likely
to never be needed.  None of the shipping boot loaders work with 1.2
regardless, and the boot loader under development won't install to the
partition in the event of an md device and therefore doesn't need that
4k buffer that 1.2 provides.
  


Sounds right, although it may have other uses for clever people.

[1] Grub won't work with either 1.1 or 1.2 superblocks at the moment.  A
person could probably hack it to work, but since grub development has
stopped in preference to the still under development grub2, they won't
take the patches upstream unless they are bug fixes, not new features.
  


If the patches were available, doesn't work with existing raid formats 
would probably qualify as a bug.

[2] There are two ways to install to a master boot record.  The first is
to use the first 512 bytes *only* and hardcode the location of the
remainder of the boot loader into those 512 bytes.  The second way is to
use the free space between the MBR and the start of the first partition
to embed the remainder of the boot loader.  When you point grub2 at an
md device, they automatically only use the second method of boot loader
installation.  This gives them the freedom to be able to modify the
second stage boot loader on a boot disk by boot disk basis.  The
downside to this is that they need lots of room after the MBR and before
the first partition in order to put their core.img file in place.  I
*think*, and I'll know for sure later today, that the core.img file is
generated during grub install from the list of optional modules you
specify during setup.  Eg., the pc module gives partition table support,
the lvm module lvm support, etc.  You list the modules you need, and
grub then builds a core.img out of all those modules.  The normal amount
of space between the MBR and the first partition is (sectors_per_track -
1).  For standard disk geometries, that basically leaves 254 sectors, or
127k of space.  This might not be enough for your particular needs if
you have a complex boot environment.  In that case, you would need to
bump at least the starting track of your first partition to make room
for your boot loader.  Unfortunately, how is a person to know how much
room their setup needs until after they've installed and it's too late
to bump the partition table start?  They can't.  So, that's another
thing I think I will check out today, what the maximum size of grub2
might be with all modules included, and what a common size might be.

  
Based on your description, it sounds as if grub2 may not have given 
adequate thought to what users other than the authors might need (that 
may be a premature conclusion). I have multiple installs on several of 
my machines, and I assume that the grub2 for 32 and 64 bit will be 
different. Thanks for the research.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979


Re: Time to deprecate old RAID formats?

2007-10-27 Thread Doug Ledford
On Sat, 2007-10-27 at 10:00 +0200, Luca Berra wrote:
 On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote:
 On Fri, 2007-10-26 at 11:54 +0200, Luca Berra wrote:
  On Sat, Oct 20, 2007 at 09:11:57AM -0400, Doug Ledford wrote:
  just apply some rules, so if you find a partition table _AND_ an md
  superblock at the end, read both and you can tell if it is an md on a
  partition or a partitioned md raid1 device.
 
 In fact, no you can't.  I know, because I've created a device that had
 both but wasn't a raid device.  And it's matching partner still existed
 too.  What you are talking about would have misrecognized this
 situation, guaranteed.
 then just ignore the device and log a warning, instead of doing a random
 choice.
 L.

It also happened to be my OS drive pair.  Ignoring it would have
rendered the machine unusable.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-27 Thread Doug Ledford
On Fri, 2007-10-26 at 14:41 -0400, Doug Ledford wrote:
 Actually, after doing some research, here's what I've found:

 * When using grub2, there is supposedly already support for raid/lvm
 devices.  However, I do not know if this includes version 1.0, 1.1, or
 1.2 superblocks.  I intend to find that out today.

It does not include support for any version 1 superblocks.  It's noted
in the code that it should, but doesn't yet.  However, the interesting
bit is that they rearchitected grub so that any reads from a device
during boot are filtered through the stack that provides the device.
So, when you tell grub2 to set root=md0, then all reads from md0 are
filtered through the raid module, and the raid module then calls the
reads from the IO module, which then does the actual int 13 call.  This
allows the raid module to read superblocks, detect the raid level and
layout, and actually attempt to work on raid0/1/5 devices (at the
moment).  It also means that all the calls from the ext2 module when it
attempts to read from the md device are filtered through the md module
and therefore it would be simple for it to implement an offset into the
real device to get past the version 1.1/1.2 superblocks.

In terms of resilience, the raid module actually tries to utilize the
raid itself during any failure.  On raid1 devices, if it gets a read
failure on any block it attempts to read, then it goes to the next
device in the raid1 array and attempts to read from it.  So, in the
event that your normal boot disk suffers a sector failure in your actual
kernel image, but the raid disk is otherwise fine, grub2 should be able
to boot from the kernel image on the next raid device.  Similarly, on
raid5 it will attempt to recover from a block read failure by using the
parity to generate the missing data unless the array is already in
degraded mode at which point it will bail on any read failure.

The lvm module attempts to properly map extents to physical volumes and
allows you to have your bootable files in lvm logical volume.  In that
case you set root=logical-volume-name-as-it-appears-in-/dev/mapper and
the lvm module then figures out what physical volumes contain that
logical volume and where the extents are mapped and goes from there.

I should note that both the lvm code and raid code are simplistic at the
moment.  For example, the raid5 mapping only supports the default raid5
layout.  If you use any other layout, game over.  Getting it to work
with version 1.1 or 1.2 superblocks probably wouldn't be that hard, but
getting it to the point where it handles all the relevant setups
properly would require a reasonable amount of coding.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-27 Thread Doug Ledford
On Sat, 2007-10-27 at 11:20 -0400, Bill Davidsen wrote:
  * When using lilo to boot from a raid device, it automatically installs
  itself to the mbr, not to the partition.  This can not be changed.  Only
  0.90 and 1.0 superblock types are supported because lilo doesn't
  understand the offset to the beginning of the fs otherwise.

 
 I'm reasonably sure that's wrong, I used to set up dual boot machines by 
 putting LILO in the partition and making that the boot partition, by 
 changing the active partition flag I could just have the machine boot 
 Windows, to keep people from getting confused.

Yeah, someone else pointed this out too.  The original patch to lilo
*did* do as I suggest, so they must have improved on the patch later.

  * When using grub to boot from a raid device, only 0.90 and 1.0
  superblocks are supported[1] (because grub is ignorant of the raid and
  it requires the fs to start at the start of the partition).  You can use
  either MBR or partition based installs of grub.  However, partition
  based installs require that all bootable partitions be in exactly the
  same logical block address across all devices.  This limitation can be
  an extremely hazardous limitation in the event a drive dies and you have
  to replace it with a new drive as newer drives may not share the older
  drive's geometry and will require starting your boot partition in an odd
  location to make the logical block addresses match.
 
  * When using grub2, there is supposedly already support for raid/lvm
  devices.  However, I do not know if this includes version 1.0, 1.1, or
  1.2 superblocks.  I intend to find that out today.  If you tell grub2 to
  install to an md device, it searches out all constituent devices and
  installs to the MBR on each device[2].  This can't be changed (at least
  right now, probably not ever though).

 
 That sounds like a good reason to avoid grub2, frankly. Software which 
 decides that it knows what to do better than the user isn't my 
 preference. If I wanted software which fores me to do things their way 
 I'd be running Windows.

It's not really all that unreasonable of a restriction.  Most people
aren't aware than when you put a boot sector at the beginning of a
partition, you only have 512 bytes of space, so the boot loader that you
put there is basically nothing more than code to read the remainder of
the boot loader from the file system space.  Now, traditionally, most
boot loaders have had to hard code the block addresses of certain key
components into these second stage boot loaders.  If a user isn't aware
of the fact that the boot loader does this at install time (or at kernel
selection update time in the case of lilo), then they aren't aware that
the files must reside at exactly the same logical block address on all
devices.  Without that knowledge, they can easily create an unbootable
setup by having the various boot partitions in slightly different
locations on the disks.  And intelligent partition editors like parted
can compound the problem because as they insulate the user from having
to pick which partition number is used for what partition, etc., they
can end up placing the various boot partitions in different areas of
different drives.  The requirement above is a means of making sure that
users aren't surprise by a non-working setup.  The whole element of
least surprise thing.  Of course, if they keep that requirement, then I
would expect it to be well documented so that people know this going
into putting the boot loader in place, but I would argue that this is at
least better than finding out when a drive dies that your system isn't
bootable.

  So, given the above situations, really, superblock format 1.2 is likely
  to never be needed.  None of the shipping boot loaders work with 1.2
  regardless, and the boot loader under development won't install to the
  partition in the event of an md device and therefore doesn't need that
  4k buffer that 1.2 provides.

 
 Sounds right, although it may have other uses for clever people.
  [1] Grub won't work with either 1.1 or 1.2 superblocks at the moment.  A
  person could probably hack it to work, but since grub development has
  stopped in preference to the still under development grub2, they won't
  take the patches upstream unless they are bug fixes, not new features.

 
 If the patches were available, doesn't work with existing raid formats 
 would probably qualify as a bug.

Possibly.  I'm a bit overbooked on other work at the moment, but I may
try to squeeze in some work on grub/grub2 to support version 1.1 or 1.2
superblocks.

  [2] There are two ways to install to a master boot record.  The first is
  to use the first 512 bytes *only* and hardcode the location of the
  remainder of the boot loader into those 512 bytes.  The second way is to
  use the free space between the MBR and the start of the first partition
  to embed the remainder of the boot loader.  When you point grub2 at an
  md 

Re: Time to deprecate old RAID formats?

2007-10-27 Thread Doug Ledford
On Sat, 2007-10-27 at 00:30 +0200, Gabor Gombas wrote:
 On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote:
 
  In fact, no you can't.  I know, because I've created a device that had
  both but wasn't a raid device.  And it's matching partner still existed
  too.  What you are talking about would have misrecognized this
  situation, guaranteed.
 
 Maybe we need a 2.0 superblock that contains the physical size of every
 component, not just the logical size that is used for RAID. That way if
 the size read from the superblock does not match the size of the device,
 you know that this device should be ignored.

In my case that wouldn't have helped.  What actually happened was I
create a two disk raid1 device using whole devices and a version 1.0
superblock.  I know a version 1.1 wouldn't work because it would be
where the boot sector needed to be, and wasn't sure if a 1.2 would work
either.  Then I tried to make the whole disk raid device a partitioned
device.  This obviously put a partition table right where the BIOS and
the kernel would look for it whether the raid was up or not.  I also
tried doing an lvm setup to split the raid up into chunks and that
didn't work either.  So, then I redid the partition table and created
individual raid devices from the partitions.  But, I didn't think to
zero the old whole disk superblock.  When I made the individual raid
devices, I used all 1.1 superblocks.  So, when it was all said and done,
I had a bunch of partitions that looked like a valid set of partitions
for the whole disk raid device and a whole disk raid superblock, but I
also had superblocks in each partition with their own bitmaps and so on.
It was only because I wasn't using mdadm in the initrd and specifying
uuids that it found the right devices to start and ignored the whole
disk devices.  But, when I later made some more devices and went to
update the mdadm.conf file using mdadm -Eb, it found the devices and
added it to the mdadm.conf.  If I hadn't checked it before remaking my
initrd, it would have hosed the system.  And it would have passed all
the tests you can throw at it.  Quite simply, there is no way to tell
the difference between those two situations with 100% certainty.  Mdadm
tries to be smart and start the newest devices, but Luca's original
suggestion of skip the partition scanning in the kernel and figure it
out from user space would not have shown mdadm the new devices and would
have gotten it wrong every time.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-26 Thread Neil Brown
On Thursday October 25, [EMAIL PROTECTED] wrote:
 
 I didn't get a reply to my suggestion of separating the data and location...

No. Sorry.

 
 ie not talking about superblock versions 0.9, 1.0, 1.1, 1.2 etc but a data
 format (0.9 vs 1.0) and a location (end,start,offset4k)?
 
 This would certainly make things a lot clearer to new (and old!) users:
 
 mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k
 or
 mdadm --create /dev/md0 --metadata 1.0 --meta-location start
 or
 mdadm --create /dev/md0 --metadata 1.0 --meta-location end

I'm happy to support synonyms.  How about

   --metadata 1-end
   --metadata 1-start

??

 
 resulting in:
 mdadm --detail /dev/md0
 
 /dev/md0:
 Version : 01.0
   Metadata-locn : End-of-device

It already lists the superblock location as a sector offset, but I
don't have a problem with reporting:

  Version : 1.0 (metadata at end of device)
  Version : 1.1 (metadata at start of device)

Would that help?


   Creation Time : Fri Aug  4 23:05:02 2006
  Raid Level : raid0
 
 You provide rational defaults for mortals and this approach allows people like
 Doug to do wacky HA things explicitly.
 
 I'm not sure you need any changes to the kernel code - probably just the docs
 and mdadm.

True.

 
  It is conceivable that I could change the default, though that would
  require a decision as to what the new default would be.  I think it
  would have to be 1.0 or it would cause too much confusion.
  
  A newer default would be nice.
 
 I also suspect that a *lot* of people will assume that the highest superblock
 version is the best and should be used for new installs etc.

Grumble... why can't people expect what I want them to expect?

 
 So if you make 1.0 the default then how many users will try 'the bleeding 
 edge'
 and use 1.2? So then you have 1.3 which is the same as 1.0? H? So to quote
 from an old Soap: Confused, you  will be...
:-)

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-26 Thread Luca Berra

On Sat, Oct 20, 2007 at 09:11:57AM -0400, Doug Ledford wrote:

On Sat, 2007-10-20 at 09:53 +0200, Iustin Pop wrote:


Honestly, I don't see how a properly configured system would start
looking at the physical device by mistake. I suppose it's possible, but
I didn't have this issue.


Mount by label support scans all devices in /proc/partitions looking for
the filesystem superblock that has the label you are trying to mount.

it could probably be smarter, but in any case there is no point in
mounting by label an md device.

LVM (unless told not to) scans all devices in /proc/partitions looking

yes, but lvm unless told to, will ignore devices having a valid md
superblock.

for valid LVM superblocks.  In fact, you can't build a linux system that
is resilient to device name changes without doing that.

i dislike labels, especially for devices that contain the os. we should
ensure great care that these are identified correctly, and
mount-by-label does not (usb drive that migrate from one system to
another are so common that you can't ignore them)

you forgot udev ;)

but the fix is easy.
remove the partition detection code from the kernel and start working on
a smart userspace replacement for device detection. we already have
vol_id from udev and blkid from ext3 which support detection of many
device formats.
just apply some rules, so if you find a partition table _AND_ an md
superblock at the end, read both and you can tell if it is an md on a
partition or a partitioned md raid1 device.


And you can with superblock at the front.  You can create a new single
disk raid1 over the existing superblock or you can munge the partition
table to have it point at the start of your data.  There are options,

Please don't do that,
use device-mapper to set the device up, without mucking with partition
tables.

L.


--
Luca Berra -- [EMAIL PROTECTED]
   Communication Media  Services S.r.l.
/\
\ / ASCII RIBBON CAMPAIGN
 XAGAINST HTML MAIL
/ \
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-26 Thread Bill Davidsen

Neil Brown wrote:

On Thursday October 25, [EMAIL PROTECTED] wrote:
  

I didn't get a reply to my suggestion of separating the data and location...



No. Sorry.

  

ie not talking about superblock versions 0.9, 1.0, 1.1, 1.2 etc but a data
format (0.9 vs 1.0) and a location (end,start,offset4k)?

This would certainly make things a lot clearer to new (and old!) users:

mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k
or
mdadm --create /dev/md0 --metadata 1.0 --meta-location start
or
mdadm --create /dev/md0 --metadata 1.0 --meta-location end



I'm happy to support synonyms.  How about

   --metadata 1-end
   --metadata 1-start

??
  
Offset? Do you like 1-offset4k or maybe 1-start4k or even 
1-start+4k for that? The last is most intuitive but I don't know how 
you feel about the + in there.
  

resulting in:
mdadm --detail /dev/md0

/dev/md0:
Version : 01.0
  Metadata-locn : End-of-device



It already lists the superblock location as a sector offset, but I
don't have a problem with reporting:

  Version : 1.0 (metadata at end of device)
  Version : 1.1 (metadata at start of device)

Would that help?

  

Same comments on the reporting, metadata at block 4k or something.
  

  Creation Time : Fri Aug  4 23:05:02 2006
 Raid Level : raid0

You provide rational defaults for mortals and this approach allows people like
Doug to do wacky HA things explicitly.

I'm not sure you need any changes to the kernel code - probably just the docs
and mdadm.



True.

  

It is conceivable that I could change the default, though that would
require a decision as to what the new default would be.  I think it
would have to be 1.0 or it would cause too much confusion.


A newer default would be nice.
  

I also suspect that a *lot* of people will assume that the highest superblock
version is the best and should be used for new installs etc.



Grumble... why can't people expect what I want them to expect?

  
I confess that I thought 1.x was a series of solutions reflecting your 
evolving opinion on what was best, so maybe in retrospect you made a 
non-intuitive choice of nomenclature. Or bluntly, you picked confusing 
names for this and confused people. If 1.0 meant start, 1.1 meant 4k, 
and 1.2 meant end, at least it would be easy to remember for people who 
only create a new array a few times a year, or once in the lifetime of a 
new computer.

So if you make 1.0 the default then how many users will try 'the bleeding edge'
and use 1.2? So then you have 1.3 which is the same as 1.0? H? So to quote
from an old Soap: Confused, you  will be...



Perhaps you could have called them 1.start, 1.end, and 1.4k in the 
beginning? Isn't hindsight wonderful?


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-26 Thread Gabor Gombas
On Fri, Oct 26, 2007 at 11:54:18AM +0200, Luca Berra wrote:

 but the fix is easy.
 remove the partition detection code from the kernel and start working on
 a smart userspace replacement for device detection. we already have
 vol_id from udev and blkid from ext3 which support detection of many
 device formats.

You got the ordering wrong. You should get userspace support ready and
accepted _first_, and then you can start the
flamew^H^H^H^H^H^Hdiscussion to make the in-kernel partitioning code
configurable. But even if you have the perfect userspace solution ready
today, removing partitioning support from the kernel is a pretty
invasive ABI change so it will take many years if it ever happens at
all.

I saw the let's move partition detection to user space argument
several times on l-k in the past years but it never gained support...
So if you want to make it happen, stop talking and start coding, and
persuade all major distros to accept your changes. _Then_ you can start
arguing to remove partition detection from the kernel, and even then it
won't be easy.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-26 Thread Gabor Gombas
On Fri, Oct 26, 2007 at 06:22:27PM +0200, Gabor Gombas wrote:

 You got the ordering wrong. You should get userspace support ready and
 accepted _first_, and then you can start the
 flamew^H^H^H^H^H^Hdiscussion to make the in-kernel partitioning code
 configurable.

Oh wait that is possible even today. So you can build your own kernel
without any partition table format support - problem solved.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-26 Thread Doug Ledford
On Fri, 2007-10-26 at 10:18 -0400, Bill Davidsen wrote:
 Neil Brown wrote:
  On Thursday October 25, [EMAIL PROTECTED] wrote:

  I didn't get a reply to my suggestion of separating the data and 
  location...
  
 
  No. Sorry.
 

  ie not talking about superblock versions 0.9, 1.0, 1.1, 1.2 etc but a data
  format (0.9 vs 1.0) and a location (end,start,offset4k)?
 
  This would certainly make things a lot clearer to new (and old!) users:
 
  mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k
  or
  mdadm --create /dev/md0 --metadata 1.0 --meta-location start
  or
  mdadm --create /dev/md0 --metadata 1.0 --meta-location end
  
 
  I'm happy to support synonyms.  How about
 
 --metadata 1-end
 --metadata 1-start
 
  ??

 Offset? Do you like 1-offset4k or maybe 1-start4k or even 
 1-start+4k for that? The last is most intuitive but I don't know how 
 you feel about the + in there.

Actually, after doing some research, here's what I've found:

* When using lilo to boot from a raid device, it automatically installs
itself to the mbr, not to the partition.  This can not be changed.  Only
0.90 and 1.0 superblock types are supported because lilo doesn't
understand the offset to the beginning of the fs otherwise.

* When using grub to boot from a raid device, only 0.90 and 1.0
superblocks are supported[1] (because grub is ignorant of the raid and
it requires the fs to start at the start of the partition).  You can use
either MBR or partition based installs of grub.  However, partition
based installs require that all bootable partitions be in exactly the
same logical block address across all devices.  This limitation can be
an extremely hazardous limitation in the event a drive dies and you have
to replace it with a new drive as newer drives may not share the older
drive's geometry and will require starting your boot partition in an odd
location to make the logical block addresses match.

* When using grub2, there is supposedly already support for raid/lvm
devices.  However, I do not know if this includes version 1.0, 1.1, or
1.2 superblocks.  I intend to find that out today.  If you tell grub2 to
install to an md device, it searches out all constituent devices and
installs to the MBR on each device[2].  This can't be changed (at least
right now, probably not ever though).

So, given the above situations, really, superblock format 1.2 is likely
to never be needed.  None of the shipping boot loaders work with 1.2
regardless, and the boot loader under development won't install to the
partition in the event of an md device and therefore doesn't need that
4k buffer that 1.2 provides.

[1] Grub won't work with either 1.1 or 1.2 superblocks at the moment.  A
person could probably hack it to work, but since grub development has
stopped in preference to the still under development grub2, they won't
take the patches upstream unless they are bug fixes, not new features.

[2] There are two ways to install to a master boot record.  The first is
to use the first 512 bytes *only* and hardcode the location of the
remainder of the boot loader into those 512 bytes.  The second way is to
use the free space between the MBR and the start of the first partition
to embed the remainder of the boot loader.  When you point grub2 at an
md device, they automatically only use the second method of boot loader
installation.  This gives them the freedom to be able to modify the
second stage boot loader on a boot disk by boot disk basis.  The
downside to this is that they need lots of room after the MBR and before
the first partition in order to put their core.img file in place.  I
*think*, and I'll know for sure later today, that the core.img file is
generated during grub install from the list of optional modules you
specify during setup.  Eg., the pc module gives partition table support,
the lvm module lvm support, etc.  You list the modules you need, and
grub then builds a core.img out of all those modules.  The normal amount
of space between the MBR and the first partition is (sectors_per_track -
1).  For standard disk geometries, that basically leaves 254 sectors, or
127k of space.  This might not be enough for your particular needs if
you have a complex boot environment.  In that case, you would need to
bump at least the starting track of your first partition to make room
for your boot loader.  Unfortunately, how is a person to know how much
room their setup needs until after they've installed and it's too late
to bump the partition table start?  They can't.  So, that's another
thing I think I will check out today, what the maximum size of grub2
might be with all modules included, and what a common size might be.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-26 Thread Doug Ledford
On Fri, 2007-10-26 at 11:54 +0200, Luca Berra wrote:
 On Sat, Oct 20, 2007 at 09:11:57AM -0400, Doug Ledford wrote:
 just apply some rules, so if you find a partition table _AND_ an md
 superblock at the end, read both and you can tell if it is an md on a
 partition or a partitioned md raid1 device.

In fact, no you can't.  I know, because I've created a device that had
both but wasn't a raid device.  And it's matching partner still existed
too.  What you are talking about would have misrecognized this
situation, guaranteed.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-26 Thread Gabor Gombas
On Fri, Oct 26, 2007 at 02:41:56PM -0400, Doug Ledford wrote:

 * When using lilo to boot from a raid device, it automatically installs
 itself to the mbr, not to the partition.  This can not be changed.  Only
 0.90 and 1.0 superblock types are supported because lilo doesn't
 understand the offset to the beginning of the fs otherwise.

Huh? I have several machines that boot with LILO and the root is on
RAID1. All install LILO to the boot sector of the mdX device (having
boot=/dev/mdX in lilo.conf), while the MBR is installed by
install-mbr. Since install-mbr has its own prompt that is displayed
before LILO's prompt on boot, I can be pretty sure that LILO did not
write anything to the MBR...

What you say is only true for skewed RAID setups, but I always
considered such a setup too risky for anything critical (not because of
LILO, but because of the increased administrative complexity).

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-26 Thread Gabor Gombas
On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote:

 In fact, no you can't.  I know, because I've created a device that had
 both but wasn't a raid device.  And it's matching partner still existed
 too.  What you are talking about would have misrecognized this
 situation, guaranteed.

Maybe we need a 2.0 superblock that contains the physical size of every
component, not just the logical size that is used for RAID. That way if
the size read from the superblock does not match the size of the device,
you know that this device should be ignored.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-25 Thread Doug Ledford
On Thu, 2007-10-25 at 09:55 +1000, Neil Brown wrote:

 As for where the metadata should be placed, it is interesting to
 observe that the SNIA's DDFv1.2 puts it at the end of the device.
 And as DDF is an industry standard sponsored by multiple companies it
 must be ..
 Sorry.  I had intended to say correct, but when it came to it, my
 fingers refused to type that word in that context.
 
 DDF is in a somewhat different situation though.  It assumes that the
 components are whole devices, and that the controller has exclusive
 access - there is no way another controller could interpret the
 devices differently before the DDF controller has a chance.

Putting a superblock at the end of a device works around OS
compatibility issues and other things related to transitioning the
device from part of an array to not, etc.  But, it works if and only if
you have the guarantee you mention.  Long, long ago I tinkered with the
idea of md multipath devices using an end of device superblock on the
whole device to allow reliable multipath detection and autostart,
failover of all partitions on a device when a command to any partition
failed, ability to use standard partition tables, etc. while being 100%
transparent to the rest of the OS.  The second you considered FC
connected devices and multi-OS access, that fell apart in a big way.
Very analogous.

So, I wouldn't necessarily call it wrong, but it's fragile.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-25 Thread David Greaves
Jeff Garzik wrote:
 Neil Brown wrote:
 As for where the metadata should be placed, it is interesting to
 observe that the SNIA's DDFv1.2 puts it at the end of the device.
 And as DDF is an industry standard sponsored by multiple companies it
 must be ..
 Sorry.  I had intended to say correct, but when it came to it, my
 fingers refused to type that word in that context.

 For the record, I have no intention of deprecating any of the metadata
 formats, not even 0.90.
 
 strongly agreed

I didn't get a reply to my suggestion of separating the data and location...

ie not talking about superblock versions 0.9, 1.0, 1.1, 1.2 etc but a data
format (0.9 vs 1.0) and a location (end,start,offset4k)?

This would certainly make things a lot clearer to new (and old!) users:

mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k
or
mdadm --create /dev/md0 --metadata 1.0 --meta-location start
or
mdadm --create /dev/md0 --metadata 1.0 --meta-location end

resulting in:
mdadm --detail /dev/md0

/dev/md0:
Version : 01.0
  Metadata-locn : End-of-device
  Creation Time : Fri Aug  4 23:05:02 2006
 Raid Level : raid0

You provide rational defaults for mortals and this approach allows people like
Doug to do wacky HA things explicitly.

I'm not sure you need any changes to the kernel code - probably just the docs
and mdadm.

 It is conceivable that I could change the default, though that would
 require a decision as to what the new default would be.  I think it
 would have to be 1.0 or it would cause too much confusion.
 
 A newer default would be nice.

I also suspect that a *lot* of people will assume that the highest superblock
version is the best and should be used for new installs etc.

So if you make 1.0 the default then how many users will try 'the bleeding edge'
and use 1.2? So then you have 1.3 which is the same as 1.0? H? So to quote
from an old Soap: Confused, you  will be...

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-25 Thread Bill Davidsen

Neil Brown wrote:

I certainly accept that the documentation is probably less that
perfect (by a large margin).  I am more than happy to accept patches
or concrete suggestions on how to improve that.  I always think it is
best if a non-developer writes documentation (and a developer reviews
it) as then it is more likely to address the issues that a
non-developer will want to read about, and in a way that will make
sense to a non-developer. (i.e. I'm to close to the subject to write
good doco).


Patches against what's in 2.6.4 I assume? I can't promise to write 
anything which pleases even me, but I will take a look at it.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-25 Thread David Greaves
Bill Davidsen wrote:
 Neil Brown wrote:
 I certainly accept that the documentation is probably less that
 perfect (by a large margin).  I am more than happy to accept patches
 or concrete suggestions on how to improve that.  I always think it is
 best if a non-developer writes documentation (and a developer reviews
 it) as then it is more likely to address the issues that a
 non-developer will want to read about, and in a way that will make
 sense to a non-developer. (i.e. I'm to close to the subject to write
 good doco).
 
 Patches against what's in 2.6.4 I assume? I can't promise to write
 anything which pleases even me, but I will take a look at it.
 

The man page is a great place for describing, eg, the superblock location; but
don't forget we have
  http://linux-raid.osdl.org/index.php/Main_Page
which is probably a better place for *discussions* (or essays) about the
superblock location (eg the LVM / v1.1 comment Janek picked up on)

In fact I was going to take some of the writings from this thread and put them
up there.

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-25 Thread Doug Ledford
On Wed, 2007-10-24 at 16:22 -0400, Bill Davidsen wrote:
 Doug Ledford wrote:
  On Mon, 2007-10-22 at 16:39 -0400, John Stoffel wrote:
 

  I don't agree completely.  I think the superblock location is a key
  issue, because if you have a superblock location which moves depending
  the filesystem or LVM you use to look at the partition (or full disk)
  then you need to be even more careful about how to poke at things.
  
 
  This is the heart of the matter.  When you consider that each file
  system and each volume management stack has a superblock, and they some
  store their superblocks at the end of devices and some at the beginning,
  and they can be stacked, then it becomes next to impossible to make sure
  a stacked setup is never recognized incorrectly under any circumstance.
  It might be possible if you use static device names, but our users
  *long* ago complained very loudly when adding a new disk or removing a
  bad disk caused their setup to fail to boot.  So, along came mount by
  label and auto scans for superblocks.  Once you do that, you *really*
  need all the superblocks at the same end of a device so when you stack
  things, it always works properly.
 Let me be devil's advocate, I noted in another post that location might 
 be raid level dependent. For raid-1 putting the superblock at the end 
 allows the BIOS to treat a single partition as a bootable unit.

This is true for both the 1.0 and 1.2 superblock formats.  The BIOS
couldn't care less if there is an offset to the filesystem because it
doesn't try to read from the filesystem.  It just jumps to the first 512
byte sector and that's it.  Grub/Lilo are the ones that have to know
about the offset, and they would be made aware of the offset at install
time.

So, we are back to the exact same thing I was talking about.  With the
superblock at the beginning of the device, you don't hinder bootability
with or without the raid working, the raid would be bootable regardless
as long as you made it bootable, it only hinders accessing the
filesystem via a running linux installation without bringing up the
raid.

  For all 
 other arrangements the end location puts the superblock where it is 
 slightly more likely to be overwritten, and where it must be moved if 
 the partition grows or whatever.
 
 There really may be no right answer.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-25 Thread Neil Brown
On Thursday October 25, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  I certainly accept that the documentation is probably less that
  perfect (by a large margin).  I am more than happy to accept patches
  or concrete suggestions on how to improve that.  I always think it is
  best if a non-developer writes documentation (and a developer reviews
  it) as then it is more likely to address the issues that a
  non-developer will want to read about, and in a way that will make
  sense to a non-developer. (i.e. I'm to close to the subject to write
  good doco).
 
 Patches against what's in 2.6.4 I assume? I can't promise to write 
 anything which pleases even me, but I will take a look at it.

Any text at all would be welcome, but yes; patches against 2.6.4 would
be easiest.

Thanks
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-24 Thread David Greaves
Doug Ledford wrote:
 On Mon, 2007-10-22 at 16:39 -0400, John Stoffel wrote:
 
 I don't agree completely.  I think the superblock location is a key
 issue, because if you have a superblock location which moves depending
 the filesystem or LVM you use to look at the partition (or full disk)
 then you need to be even more careful about how to poke at things.
 
 This is the heart of the matter.  When you consider that each file
 system and each volume management stack has a superblock, and they some
 store their superblocks at the end of devices and some at the beginning,
 and they can be stacked, then it becomes next to impossible to make sure
 a stacked setup is never recognized incorrectly under any circumstance.

I wonder if we should not really be talking about superblock versions 1.0, 1.1,
1.2 etc but a data format (0.9 vs 1.0) and a location (end,start,offset4k)?

This would certainly make things a lot clearer to new users:

mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k


mdadm --detail /dev/md0

/dev/md0:
Version : 01.0
  Metadata-locn : End-of-device
  Creation Time : Fri Aug  4 23:05:02 2006
 Raid Level : raid0


And there you have the deprecation... only two superblock versions and no real
changes to code etc

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-24 Thread John Stoffel
 Bill == Bill Davidsen [EMAIL PROTECTED] writes:

Bill John Stoffel wrote:
 Why do we have three different positions for storing the superblock?  

Bill Why do you suggest changing anything until you get the answer to
Bill this question? If you don't understand why there are three
Bill locations, perhaps that would be a good initial investigation.

Because I've asked this question before and not gotten an answer, nor
is it answered in the man page for mdadm on why we have this setup. 

Bill Clearly the short answer is that they reflect three stages of
Bill Neil's thinking on the topic, and I would bet that he had a good
Bill reason for moving the superblock when he did it.

So let's hear Neil's thinking about all this?  Or should I just work
up a patch to do what I suggest and see how that flies? 

Bill Since you have to support all of them or break existing arrays,
Bill and they all use the same format so there's no saving of code
Bill size to mention, why even bring this up?

Because of the confusion factor.  Again, since noone has been able to
articulate a reason why we have three different versions of the 1.x
superblock, nor have I seen any good reasons for why we should have
them, I'm going by the KISS principle to reduce the options to the
best one.

And no, I'm not advocating getting rid of legacy support, but I AM
advocating that we settle on ONE standard format going forward as the
default for all new RAID superblocks.

John
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-24 Thread Mike Snitzer
On 10/24/07, John Stoffel [EMAIL PROTECTED] wrote:
  Bill == Bill Davidsen [EMAIL PROTECTED] writes:

 Bill John Stoffel wrote:
  Why do we have three different positions for storing the superblock?

 Bill Why do you suggest changing anything until you get the answer to
 Bill this question? If you don't understand why there are three
 Bill locations, perhaps that would be a good initial investigation.

 Because I've asked this question before and not gotten an answer, nor
 is it answered in the man page for mdadm on why we have this setup.

 Bill Clearly the short answer is that they reflect three stages of
 Bill Neil's thinking on the topic, and I would bet that he had a good
 Bill reason for moving the superblock when he did it.

 So let's hear Neil's thinking about all this?  Or should I just work
 up a patch to do what I suggest and see how that flies?

 Bill Since you have to support all of them or break existing arrays,
 Bill and they all use the same format so there's no saving of code
 Bill size to mention, why even bring this up?

 Because of the confusion factor.  Again, since noone has been able to
 articulate a reason why we have three different versions of the 1.x
 superblock, nor have I seen any good reasons for why we should have
 them, I'm going by the KISS principle to reduce the options to the
 best one.

 And no, I'm not advocating getting rid of legacy support, but I AM
 advocating that we settle on ONE standard format going forward as the
 default for all new RAID superblocks.

Why exactly are you on this crusade to find the one best v1
superblock location?  Giving people the freedom to place the
superblock where they choose isn't a bad thing.  Would adding
something like If in doubt, 1.1 is the safest choice. to the mdadm
man page give you the KISS warm-fuzzies you're pining for?

The fact that, after you read the manpage, you didn't even know that
the only difference between the v1.x variants is the location that the
superblock is placed indicates that you're not in a position to be so
tremendously evangelical about affecting code changes that limit
existing options.

Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-24 Thread Bill Davidsen

John Stoffel wrote:

Bill == Bill Davidsen [EMAIL PROTECTED] writes:



Bill John Stoffel wrote:
  
Why do we have three different positions for storing the superblock?  
  


Bill Why do you suggest changing anything until you get the answer to
Bill this question? If you don't understand why there are three
Bill locations, perhaps that would be a good initial investigation.

Because I've asked this question before and not gotten an answer, nor
is it answered in the man page for mdadm on why we have this setup. 


Bill Clearly the short answer is that they reflect three stages of
Bill Neil's thinking on the topic, and I would bet that he had a good
Bill reason for moving the superblock when he did it.

So let's hear Neil's thinking about all this?  Or should I just work
up a patch to do what I suggest and see how that flies? 
  


If you are only going to change the default, I think you're done, since 
people report problems with bootloaders starting versions other than 
0.90. And until I hear Neil's thinking on this, I'm not sure that I know 
what the default location and type should be. In fact, reading the 
discussion I suspect it should be different for RAID-0 (should be at the 
end) and all other types (should be near the front). That retains the 
ability to mount one part of the mirror as a single partition, while 
minimizing the possibility of bad applications seeing something which 
looks like a filesystem at the start of a partition and trying to run 
fsck on it.

Bill Since you have to support all of them or break existing arrays,
Bill and they all use the same format so there's no saving of code
Bill size to mention, why even bring this up?

Because of the confusion factor.  Again, since noone has been able to
articulate a reason why we have three different versions of the 1.x
superblock, nor have I seen any good reasons for why we should have
them, I'm going by the KISS principle to reduce the options to the
best one.

And no, I'm not advocating getting rid of legacy support, but I AM
advocating that we settle on ONE standard format going forward as the
default for all new RAID superblocks.
  


Unfortunately the solution can't be any simpler than the problem, and 
that's why I'm dubious that anything but the documentation should be 
changed, or an additional metadata target added per the discussion 
above, perhaps best1 for best 1.x format based on the raid level.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-24 Thread Bill Davidsen

Doug Ledford wrote:

On Mon, 2007-10-22 at 16:39 -0400, John Stoffel wrote:

  

I don't agree completely.  I think the superblock location is a key
issue, because if you have a superblock location which moves depending
the filesystem or LVM you use to look at the partition (or full disk)
then you need to be even more careful about how to poke at things.



This is the heart of the matter.  When you consider that each file
system and each volume management stack has a superblock, and they some
store their superblocks at the end of devices and some at the beginning,
and they can be stacked, then it becomes next to impossible to make sure
a stacked setup is never recognized incorrectly under any circumstance.
It might be possible if you use static device names, but our users
*long* ago complained very loudly when adding a new disk or removing a
bad disk caused their setup to fail to boot.  So, along came mount by
label and auto scans for superblocks.  Once you do that, you *really*
need all the superblocks at the same end of a device so when you stack
things, it always works properly.
Let me be devil's advocate, I noted in another post that location might 
be raid level dependent. For raid-1 putting the superblock at the end 
allows the BIOS to treat a single partition as a bootable unit. For all 
other arrangements the end location puts the superblock where it is 
slightly more likely to be overwritten, and where it must be moved if 
the partition grows or whatever.


There really may be no right answer.

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-24 Thread Neil Brown
On Tuesday October 23, [EMAIL PROTECTED] wrote:
 On Tue, 2007-10-23 at 19:03 -0400, Bill Davidsen wrote:
  John Stoffel wrote:
   Why do we have three different positions for storing the superblock?  
 
  Why do you suggest changing anything until you get the answer to this 
  question? If you don't understand why there are three locations, perhaps 
  that would be a good initial investigation.
  
  Clearly the short answer is that they reflect three stages of Neil's 
  thinking on the topic, and I would bet that he had a good reason for 
  moving the superblock when he did it.
 
 I believe, and Neil can correct me if I'm wrong, that 1.0 (at the end of
 the device) is to satisfy people that want to get at their raid1 data
 without bringing up the device or using a loop mount with an offset.
 Version 1.1, at the beginning of the device, is to prevent accidental
 access to a device when the raid array doesn't come up.  And version 1.2
 (4k from the beginning of the device) would be suitable for those times
 when you want to embed a boot sector at the very beginning of the device
 (which really only needs 512 bytes, but a 4k offset is as easy to deal
 with as anything else).  From the standpoint of wanting to make sure an
 array is suitable for embedding a boot sector, the 1.2 superblock may be
 the best default.
 

Exactly correct.

Another perspective is that I chickened out of making a decision and
chose to support all the credible possibilities that I could think of.
And showed that I didn't have enough imagination.  The other
possibility that I should have included (as has been suggested in this
conversation, and previously on this list) is to store the superblock
both at the beginning and the end for redundancy.  However I cannot
decide whether to combine the 1.0 and 1.1 locations, or the 1.0 and
1.2.  And I don't think I want to support both (maybe I've learned my
lesson).

As for where the metadata should be placed, it is interesting to
observe that the SNIA's DDFv1.2 puts it at the end of the device.
And as DDF is an industry standard sponsored by multiple companies it
must be ..
Sorry.  I had intended to say correct, but when it came to it, my
fingers refused to type that word in that context.

DDF is in a somewhat different situation though.  It assumes that the
components are whole devices, and that the controller has exclusive
access - there is no way another controller could interpret the
devices differently before the DDF controller has a chance.

DDF is also interesting in that it uses 512 byte alignment for
metadata.  The 'anchor' block is in the last sector of the device.
This contrasts with current md metadata which is all 4K aligned.
Given that the drive manufacturers seem to be telling us that 4096 is
the new 512, I think 4K alignment was a good idea.
It could be that DDF actually specifies the anchor to reside in the
last block rather than the last sector, and it could be that the
spec allows for block size to be device specific - I'd have to hunt
through the spec again to be sure.

For the record, I have no intention of deprecating any of the metadata
formats, not even 0.90.
It is conceivable that I could change the default, though that would
require a decision as to what the new default would be.  I think it
would have to be 1.0 or it would cause too much confusion.

I think it would be entirely appropriate for a distro (especially an
'enterprise' distro) to choose a format and location that it was going
to standardise on and support, and make that the default on that
distro (by using a CREATE line in mdadm.conf).  Debian has already
done this by making 1.0 the default.

I certainly accept that the documentation is probably less that
perfect (by a large margin).  I am more than happy to accept patches
or concrete suggestions on how to improve that.  I always think it is
best if a non-developer writes documentation (and a developer reviews
it) as then it is more likely to address the issues that a
non-developer will want to read about, and in a way that will make
sense to a non-developer. (i.e. I'm to close to the subject to write
good doco).

NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-24 Thread Jeff Garzik

Neil Brown wrote:

On Tuesday October 23, [EMAIL PROTECTED] wrote:
As for where the metadata should be placed, it is interesting to
observe that the SNIA's DDFv1.2 puts it at the end of the device.
And as DDF is an industry standard sponsored by multiple companies it
must be ..
Sorry.  I had intended to say correct, but when it came to it, my
fingers refused to type that word in that context.

DDF is in a somewhat different situation though.  It assumes that the
components are whole devices, and that the controller has exclusive
access - there is no way another controller could interpret the
devices differently before the DDF controller has a chance.


grin agreed.



DDF is also interesting in that it uses 512 byte alignment for
metadata.  The 'anchor' block is in the last sector of the device.
This contrasts with current md metadata which is all 4K aligned.
Given that the drive manufacturers seem to be telling us that 4096 is
the new 512, I think 4K alignment was a good idea.
It could be that DDF actually specifies the anchor to reside in the
last block rather than the last sector, and it could be that the
spec allows for block size to be device specific - I'd have to hunt
through the spec again to be sure.


Its a bit of a mess.

Yes, with 1K and 4K sector devices starting to appear, as long as the 
underlying partitioning gets the initial partition alignment correct, 
this /should/ continue functioning as normal.


If for whatever reason you wind up with an odd-aligned 1K sector device 
and your data winds up aligned to even numbered [hard] sectors, 
performance will definitely suffer.


Mostly this is out of MD's hands, and up to the sysadmin and 
partitioning tools to get hard-sector alignment right.




For the record, I have no intention of deprecating any of the metadata
formats, not even 0.90.


strongly agreed



It is conceivable that I could change the default, though that would
require a decision as to what the new default would be.  I think it
would have to be 1.0 or it would cause too much confusion.


A newer default would be nice.

Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-23 Thread Bill Davidsen

John Stoffel wrote:
Why do we have three different positions for storing the superblock?  
  
Why do you suggest changing anything until you get the answer to this 
question? If you don't understand why there are three locations, perhaps 
that would be a good initial investigation.


Clearly the short answer is that they reflect three stages of Neil's 
thinking on the topic, and I would bet that he had a good reason for 
moving the superblock when he did it.


Since you have to support all of them or break existing arrays, and they 
all use the same format so there's no saving of code size to mention, 
why even bring this up?


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-23 Thread Bill Davidsen

Doug Ledford wrote:

On Fri, 2007-10-19 at 23:23 +0200, Iustin Pop wrote:
  

On Fri, Oct 19, 2007 at 02:39:47PM -0400, John Stoffel wrote:


And if putting the superblock at the end is problematic, why is it the
default?  Shouldn't version 1.1 be the default?  
  

In my opinion, having the superblock *only* at the end (e.g. the 0.90
format) is the best option.

It allows one to mount the disk separately (in case of RAID 1), if the
MD superblock is corrupt or you just want to get easily at the raw data.



Bad reasoning.  It's the reason that the default is at the end of the
device, but that was a bad decision made by Ingo long, long ago in a
galaxy far, far away.

The simple fact of the matter is there are only two type of raid devices
for the purpose of this issue: those that fragment data (raid0/4/5/6/10)
and those that don't (raid1, linear).

For the purposes of this issue, there are only two states we care about:
the raid array works or doesn't work.

If the raid array works, then you *only* want the system to access the
data via the raid array.  If the raid array doesn't work, then for the
fragmented case you *never* want the system to see any of the data from
the raid array (such as an ext3 superblock) or a subsequent fsck could
see a valid superblock and actually start a filesystem scan on the raw
device, and end up hosing the filesystem beyond all repair after it hits
the first chunk size break (although in practice this is usually a
situation where fsck declares the filesystem so corrupt that it refuses
to touch it, that's leaving an awful lot to chance, you really don't
want fsck to *ever* see that superblock).

If the raid array is raid1, then the raid array should *never* fail to
start unless all disks are missing (in which case there is no raw device
to access anyway).  The very few failure types that will cause the raid
array to not start automatically *and* still have an intact copy of the
data usually happen when the raid array is perfectly healthy, in which
case automatically finding a constituent device when the raid array
failed to start is exactly the *wrong* thing to do (for instance, you
enable SELinux on a machine and it hasn't been relabeled and the raid
array fails to start because /dev/mdblah can't be created because of
an SELinux denial...all the raid1 members are still there, but if you
touch a single one of them, then you run the risk of creating silent
data corruption).

It really boils down to this: for any reason that a raid array might
fail to start, you *never* want to touch the underlying data until
someone has taken manual measures to figure out why it didn't start and
corrected the problem.  Putting the superblock in front of the data does
not prevent manual measures (such as recreating superblocks) from
getting at the data.  But, putting superblocks at the end leaves the
door open for accidental access via constituent devices when you
*really* don't want that to happen.
  


You didn't mention some ill-behaved application using the raw device 
(ie. database) writing just a little more than it should and destroying 
the superblock.

So, no, the default should *not* be at the end of the device.

  

You make a convincing argulemt.

As to the people who complained exactly because of this feature, LVM has
two mechanisms to protect from accessing PVs on the raw disks (the
ignore raid components option and the filter - I always set filters when
using LVM ontop of MD).

regards,
iustin




--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-23 Thread Bill Davidsen

John Stoffel wrote:

Michael == Michael Tokarev [EMAIL PROTECTED] writes:



Michael Doug Ledford wrote:
Michael []
  

1.0, 1.1, and 1.2 are the same format, just in different positions on
the disk.  Of the three, the 1.1 format is the safest to use since it
won't allow you to accidentally have some sort of metadata between the
beginning of the disk and the raid superblock (such as an lvm2
superblock), and hence whenever the raid array isn't up, you won't be
able to accidentally mount the lvm2 volumes, filesystem, etc.  (In worse
case situations, I've seen lvm2 find a superblock on one RAID1 array
member when the RAID1 array was down, the system came up, you used the
system, the two copies of the raid array were made drastically
inconsistent, then at the next reboot, the situation that prevented the
RAID1 from starting was resolved, and it never know it failed to start
last time, and the two inconsistent members we put back into a clean
array).  So, deprecating any of these is not really helpful.  And you
need to keep the old 0.90 format around for back compatibility with
thousands of existing raid arrays.
  


Michael Well, I strongly, completely disagree.  You described a
Michael real-world situation, and that's unfortunate, BUT: for at
Michael least raid1, there ARE cases, pretty valid ones, when one
Michael NEEDS to mount the filesystem without bringing up raid.
Michael Raid1 allows that.

Please describe one such case please.  There have certainly been hacks
of various RAID systems on other OSes such as Solaris where the VxVM
and/or Solstice DiskSuite allowed you to encapsulate an existing
partition into a RAID array.  


But in my experience (and I'm a professional sysadm... :-) it's not
really all that useful, and can lead to problems liks those described
by Doug.  


If you are going to mirror an existing filesystem, then by definition
you have a second disk or partition available for the purpose.  So you
would merely setup the new RAID1, in degraded mode, using the new
partition as the base.  Then you copy the data over to the new RAID1
device, change your boot setup, and reboot.

Once that is done, you can then add the original partition into the
RAID1 array.  


As Doug says, and I agree strongly, you DO NOT want to have the
possibility of confusion and data loss, especially on bootup.  And
this leads to the heart of my initial post on this matter, that the
confusion of having four different variations of RAID superblocks is
bad.  We should deprecate them down to just two, the old 0.90 format,
and the new 1.x format at the start of the RAID volume.
  


Perhaps I am misreading you here, when you say depreciate them down do 
you mean the Adrian Bunk method of putting in a printk scolding the 
administrator, and then remove the feature a version later, or did you 
mean depreciate all but two which clearly doesn't suggest removing the 
capability at all?


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-23 Thread Bill Davidsen

Justin Piszcz wrote:



On Fri, 19 Oct 2007, John Stoffel wrote:


Justin == Justin Piszcz [EMAIL PROTECTED] writes:


Justin On Fri, 19 Oct 2007, John Stoffel wrote:



So,

Is it time to start thinking about deprecating the old 0.9, 1.0 and
1.1 formats to just standardize on the 1.2 format?  What are the
issues surrounding this?

It's certainly easy enough to change mdadm to default to the 1.2
format and to require a --force switch to  allow use of the older
formats.

I keep seeing that we support these old formats, and it's never been
clear to me why we have four different ones available?  Why can't we
start defining the canonical format for Linux RAID metadata?

Thanks,
John
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe 
linux-raid in

the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Justin I hope 00.90.03 is not deprecated, LILO cannot boot off of
Justin anything else!

Are you sure?  I find that GRUB is much easier to use and setup than
LILO these days.  But hey, just dropping down to support 00.09.03 and
1.2 formats would be fine too.  Let's just lessen the confusion if at
all possible.

John



I am sure, I submitted a bug report to the LILO developer, he 
acknowledged the bug but I don't know if it was fixed.


I have not tried GRUB with a RAID1 setup yet.


Works fine.

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-23 Thread Doug Ledford
On Tue, 2007-10-23 at 19:03 -0400, Bill Davidsen wrote:
 John Stoffel wrote:
  Why do we have three different positions for storing the superblock?  

 Why do you suggest changing anything until you get the answer to this 
 question? If you don't understand why there are three locations, perhaps 
 that would be a good initial investigation.
 
 Clearly the short answer is that they reflect three stages of Neil's 
 thinking on the topic, and I would bet that he had a good reason for 
 moving the superblock when he did it.

I believe, and Neil can correct me if I'm wrong, that 1.0 (at the end of
the device) is to satisfy people that want to get at their raid1 data
without bringing up the device or using a loop mount with an offset.
Version 1.1, at the beginning of the device, is to prevent accidental
access to a device when the raid array doesn't come up.  And version 1.2
(4k from the beginning of the device) would be suitable for those times
when you want to embed a boot sector at the very beginning of the device
(which really only needs 512 bytes, but a 4k offset is as easy to deal
with as anything else).  From the standpoint of wanting to make sure an
array is suitable for embedding a boot sector, the 1.2 superblock may be
the best default.

 Since you have to support all of them or break existing arrays, and they 
 all use the same format so there's no saving of code size to mention, 
 why even bring this up?
 
-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: chunk size (was Re: Time to deprecate old RAID formats?)

2007-10-23 Thread Doug Ledford
On Tue, 2007-10-23 at 21:21 +0200, Michal Soltys wrote:
 Doug Ledford wrote:
  
  Well, first I was thinking of files in the few hundreds of megabytes
  each to gigabytes each, and when they are streamed, they are streamed at
  a rate much lower than the full speed of the array, but still at a fast
  rate.  How parallel the reads are then would tend to be a function of
  chunk size versus streaming rate. 
 
 Ahh, I see now. Thanks for explanation.
 
 I wonder though, if setting large readahead would help, if you used larger 
 chunk size. Assuming other options are not possible - i.e. streaming from 
 larger buffer, while reading to it in a full stripe width at least.

Probably not.  All my trial and error in the past with raid5 arrays and
various situations that would cause pathological worst case behavior
showed that once reads themselves reach 16k in size, and are sequential
in nature, then the disk firmware's read ahead kicks in and your
performance stays about the same regardless of increasing your OS read
ahead.  In a nutshell, once you've convinced the disk firmware that you
are going to be reading some data sequentially, it does the rest.  With
a large stripe size (say 256k+), you'll trigger this firmware read ahead
fairly early on in reading any given stripe, so you really don't buy
much by reading the next stripe before you need it, and in fact can end
up wasting a lot of RAM trying to do so, hurting overall performance.

  
  I'm not familiar with the benchmark you are referring to.
  
 
 I was thinking about 
 http://www.mail-archive.com/linux-raid@vger.kernel.org/msg08461.html
 
 with small discussion that happend after that.
-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-23 Thread Doug Ledford
On Sat, 2007-10-20 at 22:24 +0400, Michael Tokarev wrote:
 John Stoffel wrote:
  Michael == Michael Tokarev [EMAIL PROTECTED] writes:

  As Doug says, and I agree strongly, you DO NOT want to have the
  possibility of confusion and data loss, especially on bootup.  And
 
 There are different point of views, and different settings etc.

Indeed, there are different points of view.  And with that in mind, I'll
just point out that my point of view is that of an engineer who is
responsible for all the legitimate md bugs in our products once tech
support has weeded out the you tried to do what? cases.  From that
point of view, I deal with *every* user's preferred use case, not any
single use case.

 For example, I once dealt with a linux user who was unable to
 use his disk partition, because his system (it was RedHat if I
 remember correctly) recognized some LVM volume on his disk (it
 was previously used with Windows) and tried to automatically
 activate it, thus making it busy.

Yep, that can still happen today under certain circumstances.

   What I'm talking about here
 is that any automatic activation of anything should be done with
 extreme care, using smart logic in the startup scripts if at
 all.

We do.  Unfortunately, there is no logic smart enough to recognize all
the possible user use cases that we've seen given the way things are
created now.

 The Doug's example - in my opinion anyway - shows wrong tools
 or bad logic in the startup sequence, not a general flaw in
 superblock location.

Well, one of the problems is that you can both use an md device as an
LVM physical volume and use an LVM logical volume as an md constituent
device.  Users have done both.

 For example, when one drive was almost dead, and mdadm tried
 to bring the array up, machine just hanged for unknown amount
 of time.  An unexpirienced operator was there.  Instead of
 trying to teach him how to pass parameter to the initramfs
 to stop trying to assemble root array and next assembling
 it manually, I told him to pass root=/dev/sda1 to the
 kernel.  Root mounts read-only, so it should be a safe thing
 to do - I only needed root fs and minimal set of services
 (which are even in initramfs) just for it to boot up to SOME
 state where I can log in remotely and fix things later.

Umm, no.  Generally speaking (I can't speak for other distros) but both
Fedora and RHEL remount root rw even when coming up in single user mode.
The only time the fs is left in ro mode is when it drops to a shell
during rc.sysinit as a result of a failed fs check.  And if you are
using an ext3 filesystem and things didn't go down clean, then you also
get a journal replay.  So, then what happens when you think you've fixed
things, and you reboot, and then due to random chance, the ext3 fs check
gets the journal off the drive that wasn't mounted and replays things
again?  Will this overwrite your fixes possibly?  Yep.  Could do all
sorts of bad things.  In fact, unless you do a full binary compare of
your constituent devices, you could have silent data corruption and just
never know about it.  You may get off lucky and never *see* the
corruption, but it could well be there.  The only safe way to
reintegrate your raid after doing what you suggest is to kick the
unmounted drive out of the array before rebooting by using mdadm to zero
its superblock, boot up with a degraded raid1 array, and readd the
kicked device back in.

So, while you list several more examples of times when it was convenient
to do as you suggest, these times can be handled in other ways (although
it may mean keeping a rescue CD handy at each location just for
situations like this) that are far safer IMO.

Now, putting all this back into the point of view I have to take, which
is what's the best default action to take for my customers, I'm sure you
can understand how a default setup and recommendation of use that leaves
silent data corruption is simply a non-starter for me.  If someone wants
to do this manually, then go right ahead.  But as for what we do by
default when the user asks us to create a raid array, we really need to
be on superblock 1.1 or 1.2 (although we aren't yet, we've waited for
the version 1 superblock issues to iron out and will do so in a future
release).

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-23 Thread Doug Ledford
On Mon, 2007-10-22 at 16:39 -0400, John Stoffel wrote:

 I don't agree completely.  I think the superblock location is a key
 issue, because if you have a superblock location which moves depending
 the filesystem or LVM you use to look at the partition (or full disk)
 then you need to be even more careful about how to poke at things.

This is the heart of the matter.  When you consider that each file
system and each volume management stack has a superblock, and they some
store their superblocks at the end of devices and some at the beginning,
and they can be stacked, then it becomes next to impossible to make sure
a stacked setup is never recognized incorrectly under any circumstance.
It might be possible if you use static device names, but our users
*long* ago complained very loudly when adding a new disk or removing a
bad disk caused their setup to fail to boot.  So, along came mount by
label and auto scans for superblocks.  Once you do that, you *really*
need all the superblocks at the same end of a device so when you stack
things, it always works properly.

 Michael Another example is ext[234]fs - it does not touch first 512
 Michael bytes of the device, so if there was an msdos filesystem
 Michael there before, it will be recognized as such by many tools,
 Michael and an attempt to mount it automatically will lead to at
 Michael least scary output and nothing mounted, or in fsck doing
 Michael fatal things to it in worst scenario.  Sure thing the first
 Michael 512 bytes should be just cleared.. but that's another topic.
 
 I would argue that ext[234] should be clearing those 512 bytes.  Why
 aren't they cleared  

Actually, I didn't think msdos used the first 512 bytes for the same
reason ext3 doesn't: space for a boot sector.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-22 Thread John Stoffel

[ I was going to reply to this earlier, but the Red Sox and good
weather got into the way this weekend.  ;-]

 Michael == Michael Tokarev [EMAIL PROTECTED] writes:

Michael I'm doing a sysadmin work for about 15 or 20 years.

Welcome to the club!  It's a fun career, always something new to
learn. 

 If you are going to mirror an existing filesystem, then by definition
 you have a second disk or partition available for the purpose.  So you
 would merely setup the new RAID1, in degraded mode, using the new
 partition as the base.  Then you copy the data over to the new RAID1
 device, change your boot setup, and reboot.

Michael And you have to copy the data twice as a result, instead of
Michael copying it only once to the second disk.

So?  Why is this such a big deal?  As I see it, there are two seperate
ways to setup a RAID1 setup, on an OS.

1.  The mirror is built ahead of time and you install onto the
mirror.  And twice as much data gets written, half to each disk.
*grin* 

2.  You are encapsulating an existing OS install and you need to do a
reboot from the un-mirrored OS to the mirrored setup.  So yes, you
do have to copy the data from the orig to the mirror, reboot, then
resync back onto the original disk whish has been added into the the
RAID set.  

Neither case is really that big a deal.  And with the RAID super block
at the front of the disk, you don't have to worry about mixing up
which disk is which.  It's not fun when you boot one disk, thinking
it's the RAID disk, but end up booting the original disk.  

 As Doug says, and I agree strongly, you DO NOT want to have the
 possibility of confusion and data loss, especially on bootup.  And

Michael There are different point of views, and different settings
Michael etc.  For example, I once dealt with a linux user who was
Michael unable to use his disk partition, because his system (it was
Michael RedHat if I remember correctly) recognized some LVM volume on
Michael his disk (it was previously used with Windows) and tried to
Michael automatically activate it, thus making it busy.  What I'm
Michael talking about here is that any automatic activation of
Michael anything should be done with extreme care, using smart logic
Michael in the startup scripts if at all.

Ah... but you can also de-active LVM partitions as well if you like.  

Michael The Doug's example - in my opinion anyway - shows wrong tools
Michael or bad logic in the startup sequence, not a general flaw in
Michael superblock location.

I don't agree completely.  I think the superblock location is a key
issue, because if you have a superblock location which moves depending
the filesystem or LVM you use to look at the partition (or full disk)
then you need to be even more careful about how to poke at things.

This is really true when you use the full disk for the mirror, because
then you don't have the partition table to base some initial
guestimates on.  Since there is an explicit Linux RAID partition type,
as well as an explicit linux filesystem (filesystem is then decoded
from the first Nk of the partition), you have a modicum of safety.

If ext3 has the superblock in the first 4k of the disk, but you've
setup the disk to use RAID1 with the LVM superblock at the end of the
disk, you now need to be careful about how the disk is detected and
then mounted.

To the ext3 detection logic, it looks like an ext3 filesystem, to LVM,
it looks like a RAID partition.  Which is correct?  Which is wrong?
How do you tell programmatically?  

That's what I think that all superblocks should be in the SAME
location on the disk and/or partitions if used.  It keeps down
problems like this.  

Michael Another example is ext[234]fs - it does not touch first 512
Michael bytes of the device, so if there was an msdos filesystem
Michael there before, it will be recognized as such by many tools,
Michael and an attempt to mount it automatically will lead to at
Michael least scary output and nothing mounted, or in fsck doing
Michael fatal things to it in worst scenario.  Sure thing the first
Michael 512 bytes should be just cleared.. but that's another topic.

I would argue that ext[234] should be clearing those 512 bytes.  Why
aren't they cleared  

Michael Speaking of cases where it was really helpful to have an
Michael ability to mount individual raid components directly without
Michael the raid level - most of them was due to one or another
Michael operator errors, usually together with bugs and/or omissions
Michael in software.  I don't remember exact scenarious anymore (last
Michael time it was more than 2 years ago).  Most of the time it was
Michael one or another sort of system recovery.

In this case, you're only talking about RAID1 mirrors, no other RAID
configuration fits this scenario.  And while this might look to be
helpful, I would strongly argue that it's not, because it's a special
case of the RAID code and can lead to all kinds of bugs and problems
if it's not 

Re: Time to deprecate old RAID formats?

2007-10-22 Thread Michael Tokarev
John Stoffel wrote:

 Michael == Michael Tokarev [EMAIL PROTECTED] writes:

 If you are going to mirror an existing filesystem, then by definition
 you have a second disk or partition available for the purpose.  So you
 would merely setup the new RAID1, in degraded mode, using the new
 partition as the base.  Then you copy the data over to the new RAID1
 device, change your boot setup, and reboot.
 
 Michael And you have to copy the data twice as a result, instead of
 Michael copying it only once to the second disk.
 
 So?  Why is this such a big deal?  As I see it, there are two seperate
 ways to setup a RAID1 setup, on an OS.
[..]
that was just a tiny nitpick, so to say, about a particular way to
convert existing system into raid1 - not something which's done every
day anyway.  Still, double the time for copying your terabyte-sized
drive is something to consider.

[]
 Michael automatically activate it, thus making it busy.  What I'm
 Michael talking about here is that any automatic activation of
 Michael anything should be done with extreme care, using smart logic
 Michael in the startup scripts if at all.
 
 Ah... but you can also de-active LVM partitions as well if you like.  

Yes, esp. being a newbie user who first installed linux on his PC just
to see that he can't use his disk.. ;)  That was a real situation - I
helped someone who had never heard of LVM and did little of anything
with filesystems/disks before.

 Michael The Doug's example - in my opinion anyway - shows wrong tools
 Michael or bad logic in the startup sequence, not a general flaw in
 Michael superblock location.
 
 I don't agree completely.  I think the superblock location is a key
 issue, because if you have a superblock location which moves depending
 the filesystem or LVM you use to look at the partition (or full disk)
 then you need to be even more careful about how to poke at things.

Superblock location does not depend on the filesystem.  Raid exports
the inside space only, excluding superblocks, to the next level
(filesystem or else).

 This is really true when you use the full disk for the mirror, because
 then you don't have the partition table to base some initial
 guestimates on.  Since there is an explicit Linux RAID partition type,
 as well as an explicit linux filesystem (filesystem is then decoded
 from the first Nk of the partition), you have a modicum of safety.

Speaking of whole disks - first, don't do that (for reasons suitable
for another topic), and second, using the whole disk or partitions
makes no real difference whatsoever to the topic being discussed.

There's just no need for the guesswork, except for the first install
(to automatically recognize existing devices, and to use them after
confirmation), and maybe for rescue systems, which again is a different
topic.

In any case, for a tool that does a guesswork (like libvolume-id, to
create /dev/ symlinks), it's as easy to look at the end of the device
as to the beginning or to any other fixed place - since the tool has
to know the superblock format, it knows superblock location as well).

Maybe manual guesswork, based on hexdump of first several kilobytes
of data, is a bit more difficult in case where superblock is located
at the end.  But if one has to analyze hexdump, he doesn't care about
raid anymore.

 If ext3 has the superblock in the first 4k of the disk, but you've
 setup the disk to use RAID1 with the LVM superblock at the end of the
 disk, you now need to be careful about how the disk is detected and
 then mounted.

See above.  For tools, it's trivial to distinguish a component of a
raid volume from the volume itself, by looking for superblock at whatever
location.  Including stuff like mkfs, which - like mdadm does - may
warn one about previous filesystem/volume information on the device
in question.

 Michael Speaking of cases where it was really helpful to have an
 Michael ability to mount individual raid components directly without
 Michael the raid level - most of them was due to one or another
 Michael operator errors, usually together with bugs and/or omissions
 Michael in software.  I don't remember exact scenarious anymore (last
 Michael time it was more than 2 years ago).  Most of the time it was
 Michael one or another sort of system recovery.
 
 In this case, you're only talking about RAID1 mirrors, no other RAID
 configuration fits this scenario.  And while this might look to be

Definitely.  However, linear - to some extent - can be used partially.
But sure with much less usefulness.

However, raid1 is much more common setup than anything else - IMHO anyway.
It's the cheapest and the most reliable thing for an average user anyway -
it's cheaper to get 2 large drives than to, say, 3 a bit smaller drives.
Yes, raid1 has 1/2 space wasted, compared with, say, raid5 on top of 3
drives (only 1/3 wasted), but still 3 smallish drives costs more than
2 larger drives.

 helpful, I would strongly argue that it's not, because it's a special
 

Re: Time to deprecate old RAID formats?

2007-10-20 Thread Doug Ledford
On Sat, 2007-10-20 at 09:53 +0200, Iustin Pop wrote:

 Honestly, I don't see how a properly configured system would start
 looking at the physical device by mistake. I suppose it's possible, but
 I didn't have this issue.

Mount by label support scans all devices in /proc/partitions looking for
the filesystem superblock that has the label you are trying to mount.
LVM (unless told not to) scans all devices in /proc/partitions looking
for valid LVM superblocks.  In fact, you can't build a linux system that
is resilient to device name changes without doing that.

 It's not only about the activation of the array. I'm mostly talking
 about RAID1, but the fact that migrating between RAID1 and plain disk is
 just a few hundred K at the end increases the flexibility very much.

Flexibility, no.  Convenience, yes.  You can do all the things with
superblock at the front that you can with it at the end, it just takes a
little more effort.

 Also, sometime you want to recover as much as possible from a not intact
 copy of the data...

And you can with superblock at the front.  You can create a new single
disk raid1 over the existing superblock or you can munge the partition
table to have it point at the start of your data.  There are options,
they just require manual intervention.  But if you are trying to rescue
data off of a seriously broken device, you are already doing manual
intervention anyway.

 Of course, different people have different priorities, but as I said, I
 like that this conversion is possible, and I never had the case of a
 tool saying hmm, /dev/mdsomething is not there, let's look at
 /dev/sdc instead.

mount, pvscan.

 thanks,
 iustin
-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: chunk size (was Re: Time to deprecate old RAID formats?)

2007-10-20 Thread Doug Ledford
On Sat, 2007-10-20 at 00:43 +0200, Michal Soltys wrote:
 Doug Ledford wrote:
  course, this comes at the expense of peak throughput on the device.
  Let's say you were building a mondo movie server, where you were
  streaming out digital movie files.  In that case, you very well may care
  more about throughput than seek performance since I suspect you wouldn't
  have many small, random reads.  Then I would use a small chunk size,
  sacrifice the seek performance, and get the throughput bonus of parallel
  reads from the same stripe on multiple disks.  On the other hand, if I
  
 
 Out of curiosity though - why wouldn't large chunk work well here ? If you 
 stream video (I assume large files, so like a good few MBs at least), the 
 reads are parallel either way.

Well, first I was thinking of files in the few hundreds of megabytes
each to gigabytes each, and when they are streamed, they are streamed at
a rate much lower than the full speed of the array, but still at a fast
rate.  How parallel the reads are then would tend to be a function of
chunk size versus streaming rate.  I guess I should clarify what I'm
talking about anyway.  To me, a large chunk size is 1 to 2MB or so, a
small chunk size is in the 64k to 256k range.  If you have a 10 disk
raid5 array with a 2mb chunk size, and you aren't just copying files
around, then it's hard to ever get that to do full speed parallel reads
because you simply won't access the data fast enough.

 Yes, the amount of data read from each of the disks will be in less perfect 
 proportion than in small chunk size scenario, but it's pretty neglible. 
 Benchamrks I've seen (like Justin's one) seem not to care much about chunk 
 size in sequential read/write scenarios (and often favors larger chunks). 
 Some of my own tests I did few months ago confirmed that as well.

I'm not familiar with the benchmark you are referring to.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-20 Thread Michael Tokarev
Doug Ledford wrote:
[]
 1.0, 1.1, and 1.2 are the same format, just in different positions on
 the disk.  Of the three, the 1.1 format is the safest to use since it
 won't allow you to accidentally have some sort of metadata between the
 beginning of the disk and the raid superblock (such as an lvm2
 superblock), and hence whenever the raid array isn't up, you won't be
 able to accidentally mount the lvm2 volumes, filesystem, etc.  (In worse
 case situations, I've seen lvm2 find a superblock on one RAID1 array
 member when the RAID1 array was down, the system came up, you used the
 system, the two copies of the raid array were made drastically
 inconsistent, then at the next reboot, the situation that prevented the
 RAID1 from starting was resolved, and it never know it failed to start
 last time, and the two inconsistent members we put back into a clean
 array).  So, deprecating any of these is not really helpful.  And you
 need to keep the old 0.90 format around for back compatibility with
 thousands of existing raid arrays.

Well, I strongly, completely disagree.  You described a real-world
situation, and that's unfortunate, BUT: for at least raid1, there ARE
cases, pretty valid ones, when one NEEDS to mount the filesystem without
bringing up raid.  Raid1 allows that.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-20 Thread John Stoffel
 Michael == Michael Tokarev [EMAIL PROTECTED] writes:

Michael Doug Ledford wrote:
Michael []
 1.0, 1.1, and 1.2 are the same format, just in different positions on
 the disk.  Of the three, the 1.1 format is the safest to use since it
 won't allow you to accidentally have some sort of metadata between the
 beginning of the disk and the raid superblock (such as an lvm2
 superblock), and hence whenever the raid array isn't up, you won't be
 able to accidentally mount the lvm2 volumes, filesystem, etc.  (In worse
 case situations, I've seen lvm2 find a superblock on one RAID1 array
 member when the RAID1 array was down, the system came up, you used the
 system, the two copies of the raid array were made drastically
 inconsistent, then at the next reboot, the situation that prevented the
 RAID1 from starting was resolved, and it never know it failed to start
 last time, and the two inconsistent members we put back into a clean
 array).  So, deprecating any of these is not really helpful.  And you
 need to keep the old 0.90 format around for back compatibility with
 thousands of existing raid arrays.

Michael Well, I strongly, completely disagree.  You described a
Michael real-world situation, and that's unfortunate, BUT: for at
Michael least raid1, there ARE cases, pretty valid ones, when one
Michael NEEDS to mount the filesystem without bringing up raid.
Michael Raid1 allows that.

Please describe one such case please.  There have certainly been hacks
of various RAID systems on other OSes such as Solaris where the VxVM
and/or Solstice DiskSuite allowed you to encapsulate an existing
partition into a RAID array.  

But in my experience (and I'm a professional sysadm... :-) it's not
really all that useful, and can lead to problems liks those described
by Doug.  

If you are going to mirror an existing filesystem, then by definition
you have a second disk or partition available for the purpose.  So you
would merely setup the new RAID1, in degraded mode, using the new
partition as the base.  Then you copy the data over to the new RAID1
device, change your boot setup, and reboot.

Once that is done, you can then add the original partition into the
RAID1 array.  

As Doug says, and I agree strongly, you DO NOT want to have the
possibility of confusion and data loss, especially on bootup.  And
this leads to the heart of my initial post on this matter, that the
confusion of having four different variations of RAID superblocks is
bad.  We should deprecate them down to just two, the old 0.90 format,
and the new 1.x format at the start of the RAID volume.

John
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-20 Thread Iustin Pop
On Sat, Oct 20, 2007 at 10:52:39AM -0400, John Stoffel wrote:
 Michael Well, I strongly, completely disagree.  You described a
 Michael real-world situation, and that's unfortunate, BUT: for at
 Michael least raid1, there ARE cases, pretty valid ones, when one
 Michael NEEDS to mount the filesystem without bringing up raid.
 Michael Raid1 allows that.
 
 Please describe one such case please.

Boot from a raid1 array, such that everything - including the partition
table itself - is mirrored.

iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-20 Thread Doug Ledford
On Sat, 2007-10-20 at 17:07 +0200, Iustin Pop wrote:
 On Sat, Oct 20, 2007 at 10:52:39AM -0400, John Stoffel wrote:
  Michael Well, I strongly, completely disagree.  You described a
  Michael real-world situation, and that's unfortunate, BUT: for at
  Michael least raid1, there ARE cases, pretty valid ones, when one
  Michael NEEDS to mount the filesystem without bringing up raid.
  Michael Raid1 allows that.
  
  Please describe one such case please.
 
 Boot from a raid1 array, such that everything - including the partition
 table itself - is mirrored.

That's a *really* bad idea.  If you want to subpartition a raid array,
you really should either run lvm on top of raid or use a partitionable
raid array embedded in a raid partition.  If you don't, there are a
whole slew of failure cases that would result in the same sort of
accidental access and data corruption that I talked about.  For
instance, if you ever ran fdisk on the disk itself instead of the raid
array, fdisk would happily create a partition that runs off the end of
the raid device and into the superblock area.  The raid subsystem
autodetect only works on partitions labeled as type 0xfd, so it would
never search for a raid superblock at the end of the actual device, and
that means that if you boot from a rescue CD that doesn't contain an
mdadm.conf file that specifies the whole disk device as a search device,
then it is guaranteed to not start the device and possibly try and
modify the underlying constituent devices.  All around, it's just a
*really* bad idea.

I've heard several descriptions of things you *could* do with the
superblock at the end, but as of yet, not one of them is a good idea if
you really care about your data.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-20 Thread Michael Tokarev
John Stoffel wrote:
 Michael == Michael Tokarev [EMAIL PROTECTED] writes:
[]
 Michael Well, I strongly, completely disagree.  You described a
 Michael real-world situation, and that's unfortunate, BUT: for at
 Michael least raid1, there ARE cases, pretty valid ones, when one
 Michael NEEDS to mount the filesystem without bringing up raid.
 Michael Raid1 allows that.
 
 Please describe one such case please.  There have certainly been hacks
 of various RAID systems on other OSes such as Solaris where the VxVM
 and/or Solstice DiskSuite allowed you to encapsulate an existing
 partition into a RAID array.  
 
 But in my experience (and I'm a professional sysadm... :-) it's not
 really all that useful, and can lead to problems liks those described
 by Doug.  

I'm doing a sysadmin work for about 15 or 20 years.

 If you are going to mirror an existing filesystem, then by definition
 you have a second disk or partition available for the purpose.  So you
 would merely setup the new RAID1, in degraded mode, using the new
 partition as the base.  Then you copy the data over to the new RAID1
 device, change your boot setup, and reboot.
[...]

And you have to copy the data twice as a result, instead of copying
it only once to the second disk.

 As Doug says, and I agree strongly, you DO NOT want to have the
 possibility of confusion and data loss, especially on bootup.  And

There are different point of views, and different settings etc.
For example, I once dealt with a linux user who was unable to
use his disk partition, because his system (it was RedHat if I
remember correctly) recognized some LVM volume on his disk (it
was previously used with Windows) and tried to automatically
activate it, thus making it busy.  What I'm talking about here
is that any automatic activation of anything should be done with
extreme care, using smart logic in the startup scripts if at
all.

The Doug's example - in my opinion anyway - shows wrong tools
or bad logic in the startup sequence, not a general flaw in
superblock location.

Another example is ext[234]fs - it does not touch first 512
bytes of the device, so if there was an msdos filesystem there
before, it will be recognized as such by many tools, and an
attempt to mount it automatically will lead to at least scary
output and nothing mounted, or in fsck doing fatal things to
it in worst scenario.  Sure thing the first 512 bytes should
be just cleared.. but that's another topic.

Speaking of cases where it was really helpful to have an ability
to mount individual raid components directly without the raid
level - most of them was due to one or another operator errors,
usually together with bugs and/or omissions in software.  I don't
remember exact scenarious anymore (last time it was more than 2
years ago).  Most of the time it was one or another sort of
system recovery.

In almost all machines I maintain, there's a raid1 for the root
filesystem built of all the drives (be it 2 or 4 or even 6 of
them) - the key point is to be able to boot off any of them
in case some cable/drive/controller rearrangement has to be
done.  Root filesystem is quite small (256 or 512 Mb here),
and it's not too dynamic either -- so it's not a big deal to
waste space for it.

Problem occurs - obviously - when something goes wrong.
And most of the time issues we had happened on a remote site,
where there was no expirienced operator/sysadmin handy.

For example, when one drive was almost dead, and mdadm tried
to bring the array up, machine just hanged for unknown amount
of time.  An unexpirienced operator was there.  Instead of
trying to teach him how to pass parameter to the initramfs
to stop trying to assemble root array and next assembling
it manually, I told him to pass root=/dev/sda1 to the
kernel.  Root mounts read-only, so it should be a safe thing
to do - I only needed root fs and minimal set of services
(which are even in initramfs) just for it to boot up to SOME
state where I can log in remotely and fix things later.
(no I didn't want to remove the drive yet, I wanted to
examine it first, and it turned to be a good idea because
the hang was happening only at the beginning of it, and
while we tried to install replacement and fill it up with
data, there was an unreadable sector found on another
drive, so this old but not removed drive was really handy).

Another situation - after some weird crash I had to examine
the filesystems found on both components - I want to look
at the filesystems and compare them, WITHOUT messing up
with raid superblocks (later on I wrote a tiny program to
save/restore 0.90 superblocks), and without attempting a
reconstruction attempts.  In fact, this very case - examining
the contents - is something I've been doing many times for
one or another reason.  There's just no need to involve
raid layer here at all, but it doesn't disturb things either
(in some cases anyway).

Yet another - many times we had to copy an old system to
a new one - new machine boots with 3 drives 

Re: Time to deprecate old RAID formats?

2007-10-20 Thread Michael Tokarev
Justin Piszcz wrote:
 
 On Fri, 19 Oct 2007, Doug Ledford wrote:
 
 On Fri, 2007-10-19 at 13:05 -0400, Justin Piszcz wrote:
[]
 Got it, so for RAID1 it would make sense if LILO supported it (the
 later versions of the md superblock)

 Lilo doesn't know anything about the superblock format, however, lilo
 expects the raid1 device to start at the beginning of the physical
 partition.  In otherwords, format 1.0 would work with lilo.
 Did not work when I tried 1.x with LILO, switched back to 00.90.03 and
 it worked fine.

There are different 1.x - and the difference is exactly this -- location
of the superblock.  In 1.0, superblock is located at the end, just like
with 0.90, and lilo works just fine with it.  It gets confused somehow
(however I don't see how really, because it uses bmap() to get a list
of physical blocks for the files it wants to access - those should be
in absolute numbers, regardless of the superblock locaction) when the
superblock is at the beginning (v 1.1 or 1.2).

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-20 Thread Doug Ledford
On Sat, 2007-10-20 at 22:38 +0400, Michael Tokarev wrote:
 Justin Piszcz wrote:
  
  On Fri, 19 Oct 2007, Doug Ledford wrote:
  
  On Fri, 2007-10-19 at 13:05 -0400, Justin Piszcz wrote:
 []
  Got it, so for RAID1 it would make sense if LILO supported it (the
  later versions of the md superblock)
 
  Lilo doesn't know anything about the superblock format, however, lilo
  expects the raid1 device to start at the beginning of the physical
  partition.  In otherwords, format 1.0 would work with lilo.
  Did not work when I tried 1.x with LILO, switched back to 00.90.03 and
  it worked fine.
 
 There are different 1.x - and the difference is exactly this -- location
 of the superblock.  In 1.0, superblock is located at the end, just like
 with 0.90, and lilo works just fine with it.  It gets confused somehow
 (however I don't see how really, because it uses bmap() to get a list
 of physical blocks for the files it wants to access - those should be
 in absolute numbers, regardless of the superblock locaction) when the
 superblock is at the beginning (v 1.1 or 1.2).
 
 /mjt

It's been a *long* time since I looked at the lilo raid1 support (I
wrote the original patch that Red Hat used, I have no idea if that's
what the lilo maintainer integrated though).  However, IIRC, it uses
bmap on the file, which implies it's via the filesystem mounted on the
raid device.  And the numbers are not absolute I don't think except with
respect to the file system.  So, I think the situation could be made to
work if you just taught lilo that on version 1.1 or version 1.2
superblock raids that it should add the data offset of the raid to the
bmap numbers (which I think are already added to the partition offset
numbers).

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Time to deprecate old RAID formats?

2007-10-19 Thread John Stoffel

So, 

Is it time to start thinking about deprecating the old 0.9, 1.0 and
1.1 formats to just standardize on the 1.2 format?  What are the
issues surrounding this?

It's certainly easy enough to change mdadm to default to the 1.2
format and to require a --force switch to  allow use of the older
formats.

I keep seeing that we support these old formats, and it's never been
clear to me why we have four different ones available?  Why can't we
start defining the canonical format for Linux RAID metadata?

Thanks,
John
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-19 Thread Justin Piszcz



On Fri, 19 Oct 2007, Doug Ledford wrote:


On Fri, 2007-10-19 at 13:05 -0400, Justin Piszcz wrote:


I'm sure an internal bitmap would.  On RAID1 arrays, reads/writes are
never split up by a chunk size for stripes.  A 2mb read is a single
read, where as on a raid4/5/6 array, a 2mb read will end up hitting a
series of stripes across all disks.  That means that on raid1 arrays,
total disk seeks  total reads/writes, where as on a raid4/5/6, total
disk seeks is usually  total reads/writes.  That in turn implies that
in a raid1 setup, disk seek time is important to performance, but not
necessarily paramount.  For raid456, disk seek time is paramount because
of how many more seeks that format uses.  When you then use an internal
bitmap, you are adding writes to every member of the raid456 array,
which adds more seeks.  The same is true for raid1, but since raid1
doesn't have the same level of dependency on seek rates that raid456
has, it doesn't show the same performance hit that raid456 does.



Got it, so for RAID1 it would make sense if LILO supported it (the
later versions of the md superblock)


Lilo doesn't know anything about the superblock format, however, lilo
expects the raid1 device to start at the beginning of the physical
partition.  In otherwords, format 1.0 would work with lilo.
Did not work when I tried 1.x with LILO, switched back to 00.90.03 and it 
worked fine.





 (for those who use LILO) but for
RAID4/5/6, keep the bitmaps away :)


I still use an internal bitmap regardless ;-)  To help mitigate the cost
of seeks on raid456, you can specify a huge chunk size (like 256k to 2MB
or somewhere in that range).  As long as you can get 90%+ of your
reads/writes to fall into the space of a single chunk, then you start
performing more like a raid1 device without the extra seek overhead.  Of
course, this comes at the expense of peak throughput on the device.
Let's say you were building a mondo movie server, where you were
streaming out digital movie files.  In that case, you very well may care
more about throughput than seek performance since I suspect you wouldn't
have many small, random reads.  Then I would use a small chunk size,
sacrifice the seek performance, and get the throughput bonus of parallel
reads from the same stripe on multiple disks.  On the other hand, if I
was setting up a mail server then I would go with a large chunk size
because the filesystem activities themselves are going to produce lots
of random seeks, and you don't want your raid setup to make that problem
worse.  Plus, most mail doesn't come in or go out at any sort of massive
streaming speed, so you don't need the paralllel reads from multiple
disks to perform well.  It all depends on your particular use scenario.

--
Doug Ledford [EMAIL PROTECTED]
 GPG KeyID: CFBFF194
 http://people.redhat.com/dledford

Infiniband specific RPMs available at
 http://people.redhat.com/dledford/Infiniband


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-19 Thread Iustin Pop
On Fri, Oct 19, 2007 at 02:39:47PM -0400, John Stoffel wrote:
 And if putting the superblock at the end is problematic, why is it the
 default?  Shouldn't version 1.1 be the default?  

In my opinion, having the superblock *only* at the end (e.g. the 0.90
format) is the best option.

It allows one to mount the disk separately (in case of RAID 1), if the
MD superblock is corrupt or you just want to get easily at the raw data.

As to the people who complained exactly because of this feature, LVM has
two mechanisms to protect from accessing PVs on the raw disks (the
ignore raid components option and the filter - I always set filters when
using LVM ontop of MD).

regards,
iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-19 Thread Doug Ledford
On Fri, 2007-10-19 at 23:23 +0200, Iustin Pop wrote:
 On Fri, Oct 19, 2007 at 02:39:47PM -0400, John Stoffel wrote:
  And if putting the superblock at the end is problematic, why is it the
  default?  Shouldn't version 1.1 be the default?  
 
 In my opinion, having the superblock *only* at the end (e.g. the 0.90
 format) is the best option.
 
 It allows one to mount the disk separately (in case of RAID 1), if the
 MD superblock is corrupt or you just want to get easily at the raw data.

Bad reasoning.  It's the reason that the default is at the end of the
device, but that was a bad decision made by Ingo long, long ago in a
galaxy far, far away.

The simple fact of the matter is there are only two type of raid devices
for the purpose of this issue: those that fragment data (raid0/4/5/6/10)
and those that don't (raid1, linear).

For the purposes of this issue, there are only two states we care about:
the raid array works or doesn't work.

If the raid array works, then you *only* want the system to access the
data via the raid array.  If the raid array doesn't work, then for the
fragmented case you *never* want the system to see any of the data from
the raid array (such as an ext3 superblock) or a subsequent fsck could
see a valid superblock and actually start a filesystem scan on the raw
device, and end up hosing the filesystem beyond all repair after it hits
the first chunk size break (although in practice this is usually a
situation where fsck declares the filesystem so corrupt that it refuses
to touch it, that's leaving an awful lot to chance, you really don't
want fsck to *ever* see that superblock).

If the raid array is raid1, then the raid array should *never* fail to
start unless all disks are missing (in which case there is no raw device
to access anyway).  The very few failure types that will cause the raid
array to not start automatically *and* still have an intact copy of the
data usually happen when the raid array is perfectly healthy, in which
case automatically finding a constituent device when the raid array
failed to start is exactly the *wrong* thing to do (for instance, you
enable SELinux on a machine and it hasn't been relabeled and the raid
array fails to start because /dev/mdblah can't be created because of
an SELinux denial...all the raid1 members are still there, but if you
touch a single one of them, then you run the risk of creating silent
data corruption).

It really boils down to this: for any reason that a raid array might
fail to start, you *never* want to touch the underlying data until
someone has taken manual measures to figure out why it didn't start and
corrected the problem.  Putting the superblock in front of the data does
not prevent manual measures (such as recreating superblocks) from
getting at the data.  But, putting superblocks at the end leaves the
door open for accidental access via constituent devices when you
*really* don't want that to happen.

So, no, the default should *not* be at the end of the device.

 As to the people who complained exactly because of this feature, LVM has
 two mechanisms to protect from accessing PVs on the raw disks (the
 ignore raid components option and the filter - I always set filters when
 using LVM ontop of MD).
 
 regards,
 iustin
-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-19 Thread Doug Ledford
On Fri, 2007-10-19 at 12:38 -0400, John Stoffel wrote:


 1, 1.0, 1.1, 1.2
 
   Use the new version-1 format superblock.  This has few restrictions.
   The different sub-versions store the superblock at different locations
   on the device, either at the end (for 1.0), at the start (for 1.1) or
   4K from the start (for 1.2).
 
 
 It looks to me that the 1.1, combined with the 1.0 should be what we
 use, with the 1.2 format nuked.  Maybe call it 1.3?  *grin*

You're somewhat misreading the man page.  You *can't* combine 1.0 with
1.1.  All of the above options: 1, 1.0, 1.1, 1.2; specifically mean to
use a version 1 superblock.  1.0 means use a version 1 superblock at the
end of the disk.  1.1 means version 1 superblock at beginning of disk.
`1.2 means version 1 at 4k offset from beginning of the disk.  There
really is no actual version 1.1, or 1.2, the .0, .1, and .2 part of the
version *only* means where to put the version 1 superblock on the disk.
If you just say version 1, then it goes to the default location for
version 1 superblocks, and last I checked that was the end of disk (aka,
1.0).

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-19 Thread Justin Piszcz



On Fri, 19 Oct 2007, Doug Ledford wrote:


On Fri, 2007-10-19 at 12:45 -0400, Justin Piszcz wrote:


On Fri, 19 Oct 2007, John Stoffel wrote:


Justin == Justin Piszcz [EMAIL PROTECTED] writes:


Justin Is a bitmap created by default with 1.x?  I remember seeing
Justin reports of 15-30% performance degradation using a bitmap on a
Justin RAID5 with 1.x.

Not according to the mdadm man page.  I'd probably give up that
performance if it meant that re-syncing an array went much faster
after a crash.

I certainly use it on my RAID1 setup on my home machine.

John



The performance AFTER a crash yes, but in general usage I remember seeing
someone here doing benchmarks it had a negative affect on performance.


I'm sure an internal bitmap would.  On RAID1 arrays, reads/writes are
never split up by a chunk size for stripes.  A 2mb read is a single
read, where as on a raid4/5/6 array, a 2mb read will end up hitting a
series of stripes across all disks.  That means that on raid1 arrays,
total disk seeks  total reads/writes, where as on a raid4/5/6, total
disk seeks is usually  total reads/writes.  That in turn implies that
in a raid1 setup, disk seek time is important to performance, but not
necessarily paramount.  For raid456, disk seek time is paramount because
of how many more seeks that format uses.  When you then use an internal
bitmap, you are adding writes to every member of the raid456 array,
which adds more seeks.  The same is true for raid1, but since raid1
doesn't have the same level of dependency on seek rates that raid456
has, it doesn't show the same performance hit that raid456 does.



Justin.

--
Doug Ledford [EMAIL PROTECTED]
 GPG KeyID: CFBFF194
 http://people.redhat.com/dledford

Infiniband specific RPMs available at
 http://people.redhat.com/dledford/Infiniband



Got it, so for RAID1 it would make sense if LILO supported it (the 
later versions of the md superblock) (for those who use LILO) but for

RAID4/5/6, keep the bitmaps away :)

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-19 Thread John Stoffel
 Justin == Justin Piszcz [EMAIL PROTECTED] writes:

Justin Is a bitmap created by default with 1.x?  I remember seeing
Justin reports of 15-30% performance degradation using a bitmap on a
Justin RAID5 with 1.x.

Not according to the mdadm man page.  I'd probably give up that
performance if it meant that re-syncing an array went much faster
after a crash.

I certainly use it on my RAID1 setup on my home machine.  

John
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-19 Thread Justin Piszcz



On Fri, 19 Oct 2007, John Stoffel wrote:


Doug == Doug Ledford [EMAIL PROTECTED] writes:


Doug On Fri, 2007-10-19 at 11:46 -0400, John Stoffel wrote:

Justin == Justin Piszcz [EMAIL PROTECTED] writes:



Justin On Fri, 19 Oct 2007, John Stoffel wrote:




So,

Is it time to start thinking about deprecating the old 0.9, 1.0 and
1.1 formats to just standardize on the 1.2 format?  What are the
issues surrounding this?


Doug 1.0, 1.1, and 1.2 are the same format, just in different positions on
Doug the disk.  Of the three, the 1.1 format is the safest to use since it
Doug won't allow you to accidentally have some sort of metadata between the
Doug beginning of the disk and the raid superblock (such as an lvm2
Doug superblock), and hence whenever the raid array isn't up, you won't be
Doug able to accidentally mount the lvm2 volumes, filesystem, etc.  (In worse
Doug case situations, I've seen lvm2 find a superblock on one RAID1 array
Doug member when the RAID1 array was down, the system came up, you used the
Doug system, the two copies of the raid array were made drastically
Doug inconsistent, then at the next reboot, the situation that prevented the
Doug RAID1 from starting was resolved, and it never know it failed to start
Doug last time, and the two inconsistent members we put back into a clean
Doug array).  So, deprecating any of these is not really helpful.  And you
Doug need to keep the old 0.90 format around for back compatibility with
Doug thousands of existing raid arrays.

This is a great case for making the 1.1 format be the default.  So
what are the advantages of the 1.0 and 1.2 formats then?  Or should be
we thinking about making two copies of the data on each RAID member,
one at the beginning and one at the end, for resiliency?

I just hate seeing this in the mag page:

   Declare the style of superblock (raid metadata) to be used.
   The default is 0.90 for --create, and to guess for other operations.
   The default can be overridden by setting the metadata value for the
   CREATE keyword in mdadm.conf.

   Options are:

   0, 0.90, default

 Use the original 0.90 format superblock.  This format limits arrays to
 28 component devices and limits compo- nent devices of levels 1 and
 greater to 2 terabytes.

   1, 1.0, 1.1, 1.2

 Use the new version-1 format superblock.  This has few restrictions.
 The different sub-versions store the superblock at different locations
 on the device, either at the end (for 1.0), at the start (for 1.1) or
 4K from the start (for 1.2).


It looks to me that the 1.1, combined with the 1.0 should be what we
use, with the 1.2 format nuked.  Maybe call it 1.3?  *grin*

So at this point I'm not arguing to get rid of the 0.9 format, though
I think it should NOT be the default any more, we should be using the
1.1 combined with 1.0 format.


Is a bitmap created by default with 1.x?  I remember seeing reports of 
15-30% performance degradation using a bitmap on a RAID5 with 1.x.




John
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-19 Thread Justin Piszcz



On Fri, 19 Oct 2007, Doug Ledford wrote:


On Fri, 2007-10-19 at 11:46 -0400, John Stoffel wrote:

Justin == Justin Piszcz [EMAIL PROTECTED] writes:


Justin On Fri, 19 Oct 2007, John Stoffel wrote:



So,

Is it time to start thinking about deprecating the old 0.9, 1.0 and
1.1 formats to just standardize on the 1.2 format?  What are the
issues surrounding this?


1.0, 1.1, and 1.2 are the same format, just in different positions on
the disk.  Of the three, the 1.1 format is the safest to use since it
won't allow you to accidentally have some sort of metadata between the
beginning of the disk and the raid superblock (such as an lvm2
superblock), and hence whenever the raid array isn't up, you won't be
able to accidentally mount the lvm2 volumes, filesystem, etc.  (In worse
case situations, I've seen lvm2 find a superblock on one RAID1 array
member when the RAID1 array was down, the system came up, you used the
system, the two copies of the raid array were made drastically
inconsistent, then at the next reboot, the situation that prevented the
RAID1 from starting was resolved, and it never know it failed to start
last time, and the two inconsistent members we put back into a clean
array).  So, deprecating any of these is not really helpful.  And you
need to keep the old 0.90 format around for back compatibility with
thousands of existing raid arrays.


Agree, what is the benefit in deprecating them?  Is there that much old 
code or?





It's certainly easy enough to change mdadm to default to the 1.2
format and to require a --force switch to  allow use of the older
formats.

I keep seeing that we support these old formats, and it's never been
clear to me why we have four different ones available?  Why can't we
start defining the canonical format for Linux RAID metadata?

Thanks,
John
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Justin I hope 00.90.03 is not deprecated, LILO cannot boot off of
Justin anything else!

Are you sure?  I find that GRUB is much easier to use and setup than
LILO these days.  But hey, just dropping down to support 00.09.03 and
1.2 formats would be fine too.  Let's just lessen the confusion if at
all possible.

John
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Doug Ledford [EMAIL PROTECTED]
 GPG KeyID: CFBFF194
 http://people.redhat.com/dledford

Infiniband specific RPMs available at
 http://people.redhat.com/dledford/Infiniband


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-19 Thread Justin Piszcz



On Fri, 19 Oct 2007, John Stoffel wrote:


Justin == Justin Piszcz [EMAIL PROTECTED] writes:


Justin On Fri, 19 Oct 2007, John Stoffel wrote:



So,

Is it time to start thinking about deprecating the old 0.9, 1.0 and
1.1 formats to just standardize on the 1.2 format?  What are the
issues surrounding this?

It's certainly easy enough to change mdadm to default to the 1.2
format and to require a --force switch to  allow use of the older
formats.

I keep seeing that we support these old formats, and it's never been
clear to me why we have four different ones available?  Why can't we
start defining the canonical format for Linux RAID metadata?

Thanks,
John
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Justin I hope 00.90.03 is not deprecated, LILO cannot boot off of
Justin anything else!

Are you sure?  I find that GRUB is much easier to use and setup than
LILO these days.  But hey, just dropping down to support 00.09.03 and
1.2 formats would be fine too.  Let's just lessen the confusion if at
all possible.

John



I am sure, I submitted a bug report to the LILO developer, he acknowledged 
the bug but I don't know if it was fixed.


I have not tried GRUB with a RAID1 setup yet.

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-19 Thread John Stoffel
 Justin == Justin Piszcz [EMAIL PROTECTED] writes:

Justin On Fri, 19 Oct 2007, John Stoffel wrote:

 
 So,
 
 Is it time to start thinking about deprecating the old 0.9, 1.0 and
 1.1 formats to just standardize on the 1.2 format?  What are the
 issues surrounding this?
 
 It's certainly easy enough to change mdadm to default to the 1.2
 format and to require a --force switch to  allow use of the older
 formats.
 
 I keep seeing that we support these old formats, and it's never been
 clear to me why we have four different ones available?  Why can't we
 start defining the canonical format for Linux RAID metadata?
 
 Thanks,
 John
 [EMAIL PROTECTED]
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

Justin I hope 00.90.03 is not deprecated, LILO cannot boot off of
Justin anything else!

Are you sure?  I find that GRUB is much easier to use and setup than
LILO these days.  But hey, just dropping down to support 00.09.03 and
1.2 formats would be fine too.  Let's just lessen the confusion if at
all possible.

John
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-19 Thread Justin Piszcz



On Fri, 19 Oct 2007, John Stoffel wrote:



So,

Is it time to start thinking about deprecating the old 0.9, 1.0 and
1.1 formats to just standardize on the 1.2 format?  What are the
issues surrounding this?

It's certainly easy enough to change mdadm to default to the 1.2
format and to require a --force switch to  allow use of the older
formats.

I keep seeing that we support these old formats, and it's never been
clear to me why we have four different ones available?  Why can't we
start defining the canonical format for Linux RAID metadata?

Thanks,
John
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



I hope 00.90.03 is not deprecated, LILO cannot boot off of anything else!


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-19 Thread John Stoffel
 Doug == Doug Ledford [EMAIL PROTECTED] writes:

Doug On Fri, 2007-10-19 at 12:38 -0400, John Stoffel wrote:
 1, 1.0, 1.1, 1.2
 
 Use the new version-1 format superblock.  This has few restrictions.
 The different sub-versions store the superblock at different locations
 on the device, either at the end (for 1.0), at the start (for 1.1) or
 4K from the start (for 1.2).
 
 
 It looks to me that the 1.1, combined with the 1.0 should be what we
 use, with the 1.2 format nuked.  Maybe call it 1.3?  *grin*

Doug You're somewhat misreading the man page. 

The man page is somewhat misleading then.  It's not clear from reading
it that the version 1 RAID superblock can be in one of three different
positions in the volume.  

Doug You *can't* combine 1.0 with 1.1.  All of the above options: 1,
Doug 1.0, 1.1, 1.2; specifically mean to use a version 1 superblock.
Doug 1.0 means use a version 1 superblock at the end of the disk.
Doug 1.1 means version 1 superblock at beginning of disk.  `1.2 means
Doug version 1 at 4k offset from beginning of the disk.  There really
Doug is no actual version 1.1, or 1.2, the .0, .1, and .2 part of the
Doug version *only* means where to put the version 1 superblock on
Doug the disk.  If you just say version 1, then it goes to the
Doug default location for version 1 superblocks, and last I checked
Doug that was the end of disk (aka, 1.0).

So why not get rid of (deprecate) the version 1.0 and version 1.2
blocks, and only support the 1.1 version?  

Why do we have three different positions for storing the superblock?  

And if putting the superblock at the end is problematic, why is it the
default?  Shouldn't version 1.1 be the default?  

Or, alternatively, update the code so that we support RAID superblocks
at BOTH the beginning and end 4k of the disk, for maximum redundancy.

I guess I need to go and read the code to figure out the placement of
0.90 and 1.0 blocks to see how they are different.  It's just not
clear to me why we have such a muddle of 1.x formats to choose from
and what the advantages and tradeoffs are between them.

John


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-19 Thread Justin Piszcz



On Fri, 19 Oct 2007, John Stoffel wrote:


Justin == Justin Piszcz [EMAIL PROTECTED] writes:


Justin Is a bitmap created by default with 1.x?  I remember seeing
Justin reports of 15-30% performance degradation using a bitmap on a
Justin RAID5 with 1.x.

Not according to the mdadm man page.  I'd probably give up that
performance if it meant that re-syncing an array went much faster
after a crash.

I certainly use it on my RAID1 setup on my home machine.

John



The performance AFTER a crash yes, but in general usage I remember seeing 
someone here doing benchmarks it had a negative affect on performance.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-19 Thread John Stoffel
 Doug == Doug Ledford [EMAIL PROTECTED] writes:

Doug On Fri, 2007-10-19 at 11:46 -0400, John Stoffel wrote:
  Justin == Justin Piszcz [EMAIL PROTECTED] writes:
 
Justin On Fri, 19 Oct 2007, John Stoffel wrote:
 
  
  So,
  
  Is it time to start thinking about deprecating the old 0.9, 1.0 and
  1.1 formats to just standardize on the 1.2 format?  What are the
  issues surrounding this?

Doug 1.0, 1.1, and 1.2 are the same format, just in different positions on
Doug the disk.  Of the three, the 1.1 format is the safest to use since it
Doug won't allow you to accidentally have some sort of metadata between the
Doug beginning of the disk and the raid superblock (such as an lvm2
Doug superblock), and hence whenever the raid array isn't up, you won't be
Doug able to accidentally mount the lvm2 volumes, filesystem, etc.  (In worse
Doug case situations, I've seen lvm2 find a superblock on one RAID1 array
Doug member when the RAID1 array was down, the system came up, you used the
Doug system, the two copies of the raid array were made drastically
Doug inconsistent, then at the next reboot, the situation that prevented the
Doug RAID1 from starting was resolved, and it never know it failed to start
Doug last time, and the two inconsistent members we put back into a clean
Doug array).  So, deprecating any of these is not really helpful.  And you
Doug need to keep the old 0.90 format around for back compatibility with
Doug thousands of existing raid arrays.

This is a great case for making the 1.1 format be the default.  So
what are the advantages of the 1.0 and 1.2 formats then?  Or should be
we thinking about making two copies of the data on each RAID member,
one at the beginning and one at the end, for resiliency?  

I just hate seeing this in the mag page:

Declare the style of superblock (raid metadata) to be used.
The default is 0.90 for --create, and to guess for other operations.
The default can be overridden by setting the metadata value for the
CREATE keyword in mdadm.conf.

Options are:

0, 0.90, default

  Use the original 0.90 format superblock.  This format limits arrays to
  28 component devices and limits compo- nent devices of levels 1 and
  greater to 2 terabytes.

1, 1.0, 1.1, 1.2

  Use the new version-1 format superblock.  This has few restrictions.
  The different sub-versions store the superblock at different locations
  on the device, either at the end (for 1.0), at the start (for 1.1) or
  4K from the start (for 1.2).


It looks to me that the 1.1, combined with the 1.0 should be what we
use, with the 1.2 format nuked.  Maybe call it 1.3?  *grin*

So at this point I'm not arguing to get rid of the 0.9 format, though
I think it should NOT be the default any more, we should be using the
1.1 combined with 1.0 format.

John
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-19 Thread Doug Ledford
On Fri, 2007-10-19 at 11:46 -0400, John Stoffel wrote:
  Justin == Justin Piszcz [EMAIL PROTECTED] writes:
 
 Justin On Fri, 19 Oct 2007, John Stoffel wrote:
 
  
  So,
  
  Is it time to start thinking about deprecating the old 0.9, 1.0 and
  1.1 formats to just standardize on the 1.2 format?  What are the
  issues surrounding this?

1.0, 1.1, and 1.2 are the same format, just in different positions on
the disk.  Of the three, the 1.1 format is the safest to use since it
won't allow you to accidentally have some sort of metadata between the
beginning of the disk and the raid superblock (such as an lvm2
superblock), and hence whenever the raid array isn't up, you won't be
able to accidentally mount the lvm2 volumes, filesystem, etc.  (In worse
case situations, I've seen lvm2 find a superblock on one RAID1 array
member when the RAID1 array was down, the system came up, you used the
system, the two copies of the raid array were made drastically
inconsistent, then at the next reboot, the situation that prevented the
RAID1 from starting was resolved, and it never know it failed to start
last time, and the two inconsistent members we put back into a clean
array).  So, deprecating any of these is not really helpful.  And you
need to keep the old 0.90 format around for back compatibility with
thousands of existing raid arrays.

  It's certainly easy enough to change mdadm to default to the 1.2
  format and to require a --force switch to  allow use of the older
  formats.
  
  I keep seeing that we support these old formats, and it's never been
  clear to me why we have four different ones available?  Why can't we
  start defining the canonical format for Linux RAID metadata?
  
  Thanks,
  John
  [EMAIL PROTECTED]
  -
  To unsubscribe from this list: send the line unsubscribe linux-raid in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
 
 Justin I hope 00.90.03 is not deprecated, LILO cannot boot off of
 Justin anything else!
 
 Are you sure?  I find that GRUB is much easier to use and setup than
 LILO these days.  But hey, just dropping down to support 00.09.03 and
 1.2 formats would be fine too.  Let's just lessen the confusion if at
 all possible.
 
 John
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


chunk size (was Re: Time to deprecate old RAID formats?)

2007-10-19 Thread Michal Soltys

Doug Ledford wrote:

course, this comes at the expense of peak throughput on the device.
Let's say you were building a mondo movie server, where you were
streaming out digital movie files.  In that case, you very well may care
more about throughput than seek performance since I suspect you wouldn't
have many small, random reads.  Then I would use a small chunk size,
sacrifice the seek performance, and get the throughput bonus of parallel
reads from the same stripe on multiple disks.  On the other hand, if I



Out of curiosity though - why wouldn't large chunk work well here ? If you 
stream video (I assume large files, so like a good few MBs at least), the 
reads are parallel either way.


Yes, the amount of data read from each of the disks will be in less perfect 
proportion than in small chunk size scenario, but it's pretty neglible. 
Benchamrks I've seen (like Justin's one) seem not to care much about chunk 
size in sequential read/write scenarios (and often favors larger chunks). 
Some of my own tests I did few months ago confirmed that as well.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html