Re: Time to deprecate old RAID formats?

2007-10-29 Thread Luca Berra

On Mon, Oct 29, 2007 at 07:05:42PM -0400, Doug Ledford wrote:

And I agree -D has less chance of finding a stale superblock, but it's
also true that it has no chance of finding non-stale superblocks on

Well it might be a matter of personal preference, but i would prefer
an initrd doing just the minumum necessary to mount the root filesystem
(and/or activating resume from a swap device), and leaving all the rest
to initscripts, then an initrd that tries to do everything.


devices that aren't even started.  So, as a method of getting all the
right information in the event of system failure and rescuecd boot, it
leaves something to be desired ;-)  In other words, I'd rather use a
mode that finds everything and lets me remove the stale than a mode that
might miss something.  But, that's a matter of personal choice.

In case of a rescuecd boot, you will probably not have any md devices
activated, and you will probably run "mdadm -Es" to check what md are
available, the data should be still on the disk, else you would be hosed
anyway.

L.

--
Luca Berra -- [EMAIL PROTECTED]
   Communication Media & Services S.r.l.
/"\
\ / ASCII RIBBON CAMPAIGN
 XAGAINST HTML MAIL
/ \
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bad drive discovered during raid5 reshape

2007-10-29 Thread Neil Brown
On Monday October 29, [EMAIL PROTECTED] wrote:
> Hi,
> I bought two new hard drives to expand my raid array today and
> unfortunately one of them appears to be bad. The problem didn't arise
> until after I attempted to grow the raid array. I was trying to expand
> the array from 6 to 8 drives. I added both drives using mdadm --add
> /dev/md1 /dev/sdb1 which completed, then mdadm --add /dev/md1 /dev/sdc1
> which also completed. I then ran mdadm --grow /dev/md1 --raid-devices=8.
> It passed the critical section, then began the grow process.
> 
> After a few minutes I started to hear unusual sounds from within the
> case. Fearing the worst I tried to cat /proc/mdstat which resulted in no
> output so I checked dmesg which showed that /dev/sdb1 was not working
> correctly. After several minutes dmesg indicated that mdadm gave up and
> the grow process stopped. After googling around I tried the solutions
> that seemed most likely to work, including removing the new drives with
> mdadm --remove --force /dev/md1 /dev/sd[bc]1 and rebooting after which I
> ran mdadm -Af /dev/md1. The grow process restarted then failed almost
> immediately. Trying to mount the drive gives me a reiserfs replay
> failure and suggests running fsck. I don't dare fsck the array since
> I've already messed it up so badly. Is there any way to go back to the
> original working 6 disc configuration with minimal data loss? Here's
> where I'm at right now, please let me know if I need to include any
> additional information.

Looks like you are in real trouble.  Both the drives seem bad in some
way.  If it was just sdc that was failing it would have picked up
after the "-Af", but when it tried, sdb gave errors.

Have two failed devices in a RAID5 is not good!

Your best bet goes like this:

  The reshape has started and got up to some point.  The data
  before that point is spread over 8 drives.  The data after is over
  6.
  We need to restripe the 8drive data back to 6 drives.  This can be
  done with the test_stripe tool that can be built from the mdadm
  source. 

  1/ Find out how far the reshape progressed, by using "mdadm -E" on
 one of the devices.
  2/ use something like
test_stripe save /some/file 8 $chunksize 5 2 0 $length  /dev/..

 If you get all the args right, this should copy the data from
 the array into /some/file.
 You could possibly do the same thing by assembling the array 
 read-only (set /sys/modules/md_mod/parameters/start_ro to 1)
 and 'dd' from the array.  It might be worth doing both and
 checking you get the same result.

  3/ use something like
test_stripe restore /some/file 6 ..
 to restore the data to just 6 devices.

  4/ use "mdadm -C" to create the array a-new on the 6 devices.  Make
 sure the order and the chunksize etc is preserved.

 Once you have done this, the start of the array should (again)
 look like the content of /some/file.  It wouldn't hurt to check.

   Then your data would be as much back together as possible.
   You will probably still need to do an fsck, but I think you did the
   right thing in holding off.  Don't do an fsck until you are sure
   the array is writable.

You can probably do the above without using test_stripe by using dd to
copy of the array before you recreate it, then using dd to put the
same data back.  Using test_stripe as well might give you extra
confidence. 

Feel free to ask questions

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Implementing low level timeouts within MD

2007-10-29 Thread Alberto Alonso
On Sat, 2007-10-27 at 12:33 +0200, Samuel Tardieu wrote:
> I agree with Doug: nothing prevents you from using md above very slow
> drivers (such as remote disks or even a filesystem implemented over a
> tape device to make it extreme). Only the low-level drivers know when
> it is appropriate to timeout or fail.
> 
>   Sam

The problem is when some of these drivers are just not smart
enough to keep themselves out of trouble. Unfortunately I've
been bitten by apparently too many of them.

I'll repeat my plea one more time. Is there a published list
of tested combinations that respond well to hardware failures
and fully signals the md code so that nothing hangs?

If not, I would like to see what people that have experienced
hardware failures and survived them are using so that such
a list can be compiled.

Alberto


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Implementing low level timeouts within MD

2007-10-29 Thread Alberto Alonso
On Mon, 2007-10-29 at 13:22 -0400, Doug Ledford wrote:

> OK, these you don't get to count.  If you run raid over USB...well...you
> get what you get.  IDE never really was a proper server interface, and
> SATA is much better, but USB was never anything other than a means to
> connect simple devices without having to put a card in your PC, it was
> never intended to be a raid transport.

I still count them ;-) I guess I just would of hoped for software raid
to really don't care about the lower layers.
> 
> > * Internal serverworks PATA controller on a netengine server. The
> >   server if off waiting to get picked up, so I can't get the important
> >   details.
> 
> 1 PATA failure.

I was surprised on this one, I did have good luck with with PATA in
the past. The kernel is whatever came standard in Fedora Core 2

> 
> > * Supermicro MB with ICH5/ICH5R controller and 2 RAID5 arrays of 3 
> >   disks each. (only one drive on one array went bad)
> > 
> > * VIA VT6420 built into the MB with RAID1 across 2 SATA drives.
> > 
> > * And the most complex is this week's server with 4 PCI/PCI-X cards.
> >   But the one that hanged the server was a 4 disk RAID5 array on a
> >   RocketRAID1540 card.
> 
> And 3 SATA failures, right?  I'm assuming the Supermicro is SATA or else
> it has more PATA ports than I've ever seen.
> 
> Was the RocketRAID card in hardware or software raid mode?  It sounds
> like it could be a combination of both, something like hardware on the
> card, and software across the different cards or something like that.
> 
> What kernels were these under?


Yes, these 3 were all SATA. The kernels (in the same order as above) 
are:

* 2.4.21-4.ELsmp #1 (Basically RHEL v3)
* 2.6.18-4-686 #1 SMP on a Fedora Core release 2
* 2.6.17.13 (compiled from vanilla sources)

The RocketRAID was configured for all drives as legacy/normal and
software RAID5 across all drives. I wasn't using hardware raid on
the last described system when it crashed.

Alberto


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Implementing low level timeouts within MD

2007-10-29 Thread Neil Brown
On Friday October 26, [EMAIL PROTECTED] wrote:
> I've been asking on my other posts but haven't seen
> a direct reply to this question:
> 
> Can MD implement timeouts so that it detects problems when
> drivers don't come back?

No.
However it is possible that we will start sending the BIO_RW_FAILFAST
flag down on some or all requests.  That might make drivers fail more
promptly, which might be  good thing.  However it won't fix bugs in
drivers and - as has been said elsewhere on this thread - that is the
real problem.

NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Superblocks

2007-10-29 Thread Neil Brown
On Friday October 26, [EMAIL PROTECTED] wrote:
> Can someone help me understand superblocks and MD a little bit?
> 
> I've got a raid5 array with 3 disks - sdb1, sdc1, sdd1.
> 
> --examine on these 3 drives shows correct information.
> 
> 
> However, if I also examine the raw disk devices, sdb and sdd, they
> also appear to have superblocks with some semi valid looking
> information. sdc has no superblock.

If a partition starts a multiple of 64K from the start of the device,
and ends with about 64K of the end of the device, then a superblock on
the partition will also look like a superblock on the whole device.
This is one of the shortcomings of v0.90 superblocks.  v1.0 doesn't
have this problem.

> 
> How can I clear these? If I unmount my raid, stop md0, it won't clear it.

mdadm --zero-superblock device name

is the best way to remove an unwanted superblock.  Ofcourse in the
above described case, removing the unwanted superblock will remove the
wanted one aswell.


> 
> [EMAIL PROTECTED] ~]# mdadm --zero-superblock /dev/hdd
> mdadm: Couldn't open /dev/hdd for write - not zeroing

As I think someone else pointed out "/dev/hdd" is not "/dev/sdd".

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-29 Thread Neil Brown
On Monday October 29, [EMAIL PROTECTED] wrote:
> 
> The one thing I *do* like about mdadm -E above -D is it includes the
> superblock format in its output.  The one thing I don't like, is it
> almost universally gets the name wrong.  What I really want is a brief
> query format that both gives me the right name (-D) and the superblock
> format (-E).
> 


You need only ask:-)

Following patch will be in next release.  Thanks for the suggestion.

NeilBrown


### Diffstat output
 ./Detail.c |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff .prev/Detail.c ./Detail.c
--- .prev/Detail.c  2007-10-30 14:04:25.0 +1100
+++ ./Detail.c  2007-10-30 14:08:28.0 +1100
@@ -143,7 +143,10 @@ int Detail(char *dev, int brief, int exp
}
 
if (brief)
-   printf("ARRAY %s level=%s num-devices=%d", dev, 
c?c:"-unknown-",array.raid_disks );
+   printf("ARRAY %s level=%s metadata=%d.%d num-devices=%d", dev,
+  c?c:"-unknown-",
+  array.major_version, array.minor_version,
+  array.raid_disks );
else {
mdu_bitmap_file_t bmf;
unsigned long long larray_size;

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-29 Thread Neil Brown
On Friday October 26, [EMAIL PROTECTED] wrote:
> 
> Perhaps you could have called them 1.start, 1.end, and 1.4k in the 
> beginning? Isn't hindsight wonderful?
> 

Those names seem good to me.  I wonder if it is safe to generate them
in "-Eb" output

Maybe the key confusion here is between "version" numbers and
"revision" numbers.
When you have multiple versions, there is no implicit assumption that
one is better than another. "Here is my version of what happened, now
let's hear yours".
When you have multiple revisions, you do assume ongoing improvement.

v1.0  v1.1 and v1.2 are different version of the v1 superblock, which
itself is a revision of the v0...

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid-10 mount at startup always has problem

2007-10-29 Thread Daniel L. Miller

Doug Ledford wrote:

Nah.  Even if we had concluded that udev was to blame here, I'm not
entirely certain that we hadn't left Daniel with the impression that we
suspected it versus blamed it, so reiterating it doesn't hurt.  And I'm
sure no one has given him a fix for the problem (although Neil did
request a change that will give debug output, but not solve the
problem), so not dropping it entirely would seem appropriate as well.
  
I've opened a bug report on Ubuntu's Launchpad.net.  Scott James Remnant 
asked me to cc him on Neil's incremental reference - we'll see what 
happens from here.


Thanks for the help guys.  At the moment, I've changed my mdadm.conf to 
explicitly list the drives, instead of the auto=partition parameter.  
We'll see what happens on the next reboot.


I don't know if it means anything, but I'm using a self-compiled 2.6.22 
kernel - with initrd.  At least I THINK I'm using initrd - I have an 
image, but I don't see an initrd line in my grub config.  HmmI'm 
going to add a stanza that includes the initrd and see what happens also.


--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid-10 mount at startup always has problem

2007-10-29 Thread Doug Ledford
On Mon, 2007-10-29 at 22:29 +0100, Luca Berra wrote:
> At which point he found that
> >the udev scripts in ubuntu are being stupid, and from the looks of it
> >are the cause of the problem.  So, I've considered the initial issue
> >root caused for a bit now.
> It seems i made an idiot of myself by missing half of the thread, and i
> even knew ubuntu was braindead in their use of udev at startup, since a
> similar discussion came up on the lvm or the dm-devel mailing list (that
> time iirc it was about lvm over multipath)

Nah.  Even if we had concluded that udev was to blame here, I'm not
entirely certain that we hadn't left Daniel with the impression that we
suspected it versus blamed it, so reiterating it doesn't hurt.  And I'm
sure no one has given him a fix for the problem (although Neil did
request a change that will give debug output, but not solve the
problem), so not dropping it entirely would seem appropriate as well.

-- 
Doug Ledford <[EMAIL PROTECTED]>
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-29 Thread Doug Ledford
On Mon, 2007-10-29 at 22:44 +0100, Luca Berra wrote:
> On Mon, Oct 29, 2007 at 11:30:53AM -0400, Doug Ledford wrote:
> >On Mon, 2007-10-29 at 09:41 +0100, Luca Berra wrote:
> >
> >> >Remaking the initrd installs the new mdadm.conf file, which would have
> >> >then contained the whole disk devices and it's UUID.  There in would
> >> >have been the problem.
> >> yes, i read the patch, i don't like that code, as i don't like most of
> >> what has been put in mkinitrd from 5.0 onward.
> in case you wonder i am referring to things like
> 
> emit dm create "$1" $UUID $(/sbin/dmsetup table "$1")

I make no judgments on the dm setup stuff, I know too little about the
dm stack to be qualified.

> >> Imho the correct thing here would not have been copying the existing
> >> mdadm.conf but generating a safe one from output of mdadm -D (note -D,
> >> not -E)
> >
> >I'm not sure I'd want that.  Besides, what makes you say -D is safer
> >than -E?
> 
> "mdadm -D  /dev/mdX" works on an active md device, so i strongly doubt the 
> information
> gathered from there would be stale
> while "mdadm -Es" will scan disk devices for md superblock, thus
> possibly even finding stale superblocks or leftovers.
> I would strongly recommend against blindly doing "mdadm -Es >>
> /etc/mdadm.conf" and not supervising the result.

Well, I agree that blindly doing mdadm -Esb >> mdadm.conf would be bad,
but that's not what mkinitrd is doing, it's using the mdadm.conf that's
in place so you can update the mdadm.conf whenever you find it
appropriate.

And I agree -D has less chance of finding a stale superblock, but it's
also true that it has no chance of finding non-stale superblocks on
devices that aren't even started.  So, as a method of getting all the
right information in the event of system failure and rescuecd boot, it
leaves something to be desired ;-)  In other words, I'd rather use a
mode that finds everything and lets me remove the stale than a mode that
might miss something.  But, that's a matter of personal choice.
Considering that we only ever update mdadm.conf automatically during
installs, after that the user makes manual mdadm.conf changes
themselves, they are free to use whichever they prefer.

The one thing I *do* like about mdadm -E above -D is it includes the
superblock format in its output.  The one thing I don't like, is it
almost universally gets the name wrong.  What I really want is a brief
query format that both gives me the right name (-D) and the superblock
format (-E).

-- 
Doug Ledford <[EMAIL PROTECTED]>
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-29 Thread Luca Berra

On Mon, Oct 29, 2007 at 11:30:53AM -0400, Doug Ledford wrote:

On Mon, 2007-10-29 at 09:41 +0100, Luca Berra wrote:


>Remaking the initrd installs the new mdadm.conf file, which would have
>then contained the whole disk devices and it's UUID.  There in would
>have been the problem.
yes, i read the patch, i don't like that code, as i don't like most of
what has been put in mkinitrd from 5.0 onward.

in case you wonder i am referring to things like

emit dm create "$1" $UUID $(/sbin/dmsetup table "$1")


Imho the correct thing here would not have been copying the existing
mdadm.conf but generating a safe one from output of mdadm -D (note -D,
not -E)


I'm not sure I'd want that.  Besides, what makes you say -D is safer
than -E?


"mdadm -D  /dev/mdX" works on an active md device, so i strongly doubt the 
information
gathered from there would be stale
while "mdadm -Es" will scan disk devices for md superblock, thus
possibly even finding stale superblocks or leftovers.
I would strongly recommend against blindly doing "mdadm -Es >>
/etc/mdadm.conf" and not supervising the result.

--
Luca Berra -- [EMAIL PROTECTED]
   Communication Media & Services S.r.l.
/"\
\ / ASCII RIBBON CAMPAIGN
 XAGAINST HTML MAIL
/ \
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid-10 mount at startup always has problem

2007-10-29 Thread Luca Berra

On Mon, Oct 29, 2007 at 11:47:19AM -0400, Doug Ledford wrote:

On Mon, 2007-10-29 at 09:18 +0100, Luca Berra wrote:

On Sun, Oct 28, 2007 at 10:59:01PM -0700, Daniel L. Miller wrote:
>Doug Ledford wrote:
>>Anyway, I happen to *like* the idea of using full disk devices, but the
>>reality is that the md subsystem doesn't have exclusive ownership of the
>>disks at all times, and without that it really needs to stake a claim on
>>the space instead of leaving things to chance IMO.
>>   
>I've been re-reading this post numerous times - trying to ignore the 
>burgeoning flame war :) - and this last sentence finally clicked with me.

>
I am sorry Daniel, when i read Doug and Bill, stating that your issue
was not having a partition table, i immediately took the bait and forgot
about your original issue.


I never said *his* issue was lack of partition table, I just said I
don't recommend that because it's flaky.  The last statement I made

maybe i misread you but Bill was quite clear.


about his issue was to ask about whether the problem was happening
during initrd time or sysinit time to try and identify if it was failing
before or after / was mounted to try and determine where the issue might
lay.  Then we got off on the tangent about partitions, and at the same
time Neil started asking about udev, at which point it came out that
he's running ubuntu, and as much as I would like to help, the fact of
the matter is that I've never touched ubuntu and wouldn't have the
faintest clue, so I let Neil handle it.  At which point he found that
the udev scripts in ubuntu are being stupid, and from the looks of it
are the cause of the problem.  So, I've considered the initial issue
root caused for a bit now.

It seems i made an idiot of myself by missing half of the thread, and i
even knew ubuntu was braindead in their use of udev at startup, since a
similar discussion came up on the lvm or the dm-devel mailing list (that
time iirc it was about lvm over multipath)


like udev/hal that believes it knows better than you about what you have
on your disks.
but _NEITHER OF THESE IS YOUR PROBLEM_ imho


Actually, it looks like udev *is* the problem, but not because of
partition tables.

you are right.

L.

--
Luca Berra -- [EMAIL PROTECTED]
   Communication Media & Services S.r.l.
/"\
\ / ASCII RIBBON CAMPAIGN
 XAGAINST HTML MAIL
/ \
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid-10 mount at startup always has problem

2007-10-29 Thread Richard Scobie

Daniel L. Miller wrote:

Nothing in the documentation (that I read - granted I don't always read 
everything) stated that partitioning prior to md creation was necessary 
- in fact references were provided on how to use complete disks.  Is 
there an "official" position on, "To Partition, or Not To Partition"?  
Particularly for my application - dedicated Linux server, RAID-10 
configuration, identical drives.


My simplistic reason for always making one partition on md drives, about 
100MB smaller than the full space, has been as insurance to allow use of 
a replacement drive from another manufacturer, which while nominally 
marked as the same size as the originals, is in fact slightly smaller.


Regards,

Richard
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Implementing low level timeouts within MD

2007-10-29 Thread Doug Ledford
On Sun, 2007-10-28 at 01:27 -0500, Alberto Alonso wrote:
> On Sat, 2007-10-27 at 19:55 -0400, Doug Ledford wrote:
> > On Sat, 2007-10-27 at 16:46 -0500, Alberto Alonso wrote:
> > > Regardless of the fact that it is not MD's fault, it does make
> > > software raid an invalid choice when combined with those drivers. A
> > > single disk failure within a RAID5 array bringing a file server down
> > > is not a valid option under most situations.
> > 
> > Without knowing the exact controller you have and driver you use, I
> > certainly can't tell the situation.  However, I will note that there are
> > times when no matter how well the driver is written, the wrong type of
> > drive failure *will* take down the entire machine.  For example, on an
> > SPI SCSI bus, a single drive failure that involves a blown terminator
> > will cause the electrical signaling on the bus to go dead no matter what
> > the driver does to try and work around it.
> 
> Sorry I thought I copied the list with the info that I sent to Richard.
> Here is the main hardware combinations.
> 
> --- Excerpt Start 
> Certainly. The times when I had good results (ie. failed drives
> with properly degraded arrays have been with old PATA based IDE 
> controllers built in the motherboard and the Highpoint PATA
> cards). The failures (ie. single disk failure bringing the whole
> server down) have been with the following:
> 
> * External disks on USB enclosures, both RAID1 and RAID5 (two different
>   systems) Don't know the actual controller for these. I assume it is
>   related to usb-storage, but can probably research the actual chipset,
>   if it is needed.

OK, these you don't get to count.  If you run raid over USB...well...you
get what you get.  IDE never really was a proper server interface, and
SATA is much better, but USB was never anything other than a means to
connect simple devices without having to put a card in your PC, it was
never intended to be a raid transport.

> * Internal serverworks PATA controller on a netengine server. The
>   server if off waiting to get picked up, so I can't get the important
>   details.

1 PATA failure.

> * Supermicro MB with ICH5/ICH5R controller and 2 RAID5 arrays of 3 
>   disks each. (only one drive on one array went bad)
> 
> * VIA VT6420 built into the MB with RAID1 across 2 SATA drives.
> 
> * And the most complex is this week's server with 4 PCI/PCI-X cards.
>   But the one that hanged the server was a 4 disk RAID5 array on a
>   RocketRAID1540 card.

And 3 SATA failures, right?  I'm assuming the Supermicro is SATA or else
it has more PATA ports than I've ever seen.

Was the RocketRAID card in hardware or software raid mode?  It sounds
like it could be a combination of both, something like hardware on the
card, and software across the different cards or something like that.

What kernels were these under?

> --- Excerpt End 
> 
> > 
> > > I wasn't even asking as to whether or not it should, I was asking if
> > > it could.
> > 
> > It could, but without careful control of timeouts for differing types of
> > devices, you could end up making the software raid less reliable instead
> > of more reliable overall.
> 
> Even if the default timeout was really long (ie. 1 minute) and then
> configurable on a per device (or class) via /proc it would really help.

It's a band-aid.  It's working around other bugs in the kernel instead
of fixing the real problem.

> > Generally speaking, most modern drivers will work well.  It's easier to
> > maintain a list of known bad drivers than known good drivers.
> 
> That's what has been so frustrating. The old PATA IDE hardware always
> worked and the new stuff is what has crashed.

In all fairness, the SATA core is still relatively young.  IDE was
around for eons, where as Jeff started the SATA code just a few years
back.  In that time I know he's had to deal with both software bugs and
hardware bugs that would lock a SATA port up solid with no return.  What
it sounds like to me is you found some of those.

> > Be careful which hardware raid you choose, as in the past several brands
> > have been known to have the exact same problem you are having with
> > software raid, so you may not end up buying yourself anything.  (I'm not
> > naming names because it's been long enough since I paid attention to
> > hardware raid driver issues that the issues I knew of could have been
> > solved by now and I don't want to improperly accuse a currently well
> > working driver of being broken)
> 
> I have settled for 3ware. All my tests showed that it performed quite
> well and kicked drives out when needed. Of course, I haven't had a
> bad drive on a 3ware production server yet, so I may end up
> pulling the little bit of hair I have left.
> 
> I am now rushing the RocketRAID 2220 into production without testing
> due to it being the only thing I could get my hands on. I'll report
> any experiences as they happen.
> 
> Thanks for all the info,
> 
> Alberto
> 
-- 
Doug Ledfo

Re: Raid-10 mount at startup always has problem

2007-10-29 Thread Doug Ledford
On Sun, 2007-10-28 at 22:59 -0700, Daniel L. Miller wrote:
> Doug Ledford wrote:
> > Anyway, I happen to *like* the idea of using full disk devices, but the
> > reality is that the md subsystem doesn't have exclusive ownership of the
> > disks at all times, and without that it really needs to stake a claim on
> > the space instead of leaving things to chance IMO.
> >   
> I've been re-reading this post numerous times - trying to ignore the 
> burgeoning flame war :) - and this last sentence finally clicked with me.
> 
> As I'm a novice Linux user - and not involved in development at all - 
> bear with me if I'm stating something obvious.  And if I'm wrong - 
> please be gentle!
> 
> 1.  md devices are not "native" to the kernel - they are 
> created/assembled/activated/whatever by a userspace program.

My real point was that md doesn't own the disks, meaning that during
startup, and at other points in time, software other than the md stack
can attempt to use the disk directly.  That software may be the linux
file system code, linux lvm code, or in some case entirely different OS
software.  Given that these situations can arise, using a partition
table to mark the space as in use by linux is what I meant by staking a
claim.  It doesn't keep the linux kernel from using it because it thinks
it owns it, but it does stop other software from attempting to use it.

> 2.  Because md devices are "non-native" devices, and are composed of 
> "native" devices, the kernel may try to use those components directly 
> without going through md.

In the case of superblocks at the end, yes.  The kernel may see the
underlying file system or lvm disk label even if the md device is not
started.

> 3.  Creating a partition table somehow (I'm still not clear how/why) 
> reduces the chance the kernel will access the drive directly without md.

The partition table is more to tell other software that linux owns the
space and to avoid mistakes where someone runs fdisk on a disk
accidentally and wipes out your array because they added a partition
table on what they thought was a new disk (more likely when you have
large arrays of disks attached via fiber channel or such than in a
single system).  Putting the superblock at the beginning of the md
device is the main thing that guarantees the kernel will never try to
use what's inside the md device without the md device running.

> These concepts suddenly have me terrified over my data integrity.  Is 
> the md system so delicate that BOOT sequence can corrupt it?

If you have your superblocks at the end of the devices, then there are
certain failure modes that can cause data inconsistencies.  Generally
speaking they won't harm the array itself, it's just that the different
disks in a raid1 array might contain different data.  If you don't use
partitions, then the majority of failure scenarios involve things like
accidental use of fdisk on the unpartitioned device, access of the
device by other OSes, that sort of thing.

>   How is it 
> more reliable AFTER the completed boot sequence?

Once the array is up and running, the constituent disks are marked as
busy in the operating system, which prevents other portions of the linux
kernel and other software in general from getting at the md owned disks.

> Nothing in the documentation (that I read - granted I don't always read 
> everything) stated that partitioning prior to md creation was necessary 
> - in fact references were provided on how to use complete disks.  Is 
> there an "official" position on, "To Partition, or Not To Partition"?  
> Particularly for my application - dedicated Linux server, RAID-10 
> configuration, identical drives.
> 
> And if partitioning is the answer - what do I need to do with my live 
> dataset?  Drop one drive, partition, then add the partition as a new 
> drive to the set - and repeat for each drive after the rebuild finishes?

You *probably*, and I emphasize probably, don't need to do anything.  I
emphasize it because I don't know enough about your situation to say so
with 100% certainty.  If I'm wrong, it's not my fault.

Now, that said, here's the gist of the situation.  There are specific
failure cases that can corrupt data in an md raid1 array mainly related
to superblocks at the end of devices.  There are specific failure cases
where an unpartitioned device can be accidentally partitioned or where a
partitioned md array in combination with superblocks at the end and
using a whole disk device can be misrecognized as a partitioned normal
drive.  There are, on the other hand, cases where it's perfectly safe to
use unpartitioned devices, or superblocks at the end of devices.  My
recommendation when someone asks what to do is to use partitions, and to
use superblocks at the beginning of the devices (except for /boot since
that isn't supported at the moment).  The reason I give that advice is
that I assume if a person knows enough to know when it's safe to use
unpartitioned devices, like Luca, then they w

Requesting migrate device options for raid5/6

2007-10-29 Thread Goswin von Brederlow
Hi,

I would welcome if someone could work on a new feature for raid5/6
that would allow replacing a disk in a raid5/6 with a new one without
having to degrade the array.

Consider the following situation:

raid5 md0 : sda sdb sdc

Now sda gives a "SMART - failure iminent" warning and you want to
repalce it with sdd.

% mdadm --fail /dev/md0 /dev/sda
% mdadm --remove /dev/md0 /dev/sda
% mdadm --add /dev/md0 /dev/sdd

Further consider that drive sdb will give an I/O error during resync
of the array or fail completly. The array is in degraded mode so you
experience data loss.


But that is completly avoidable and some hardware raids support disk
migration too. Loosly speaking the kernel should do the following:

raid5 md0 : sda sdb sdc
-> create internal raid1 or dm-mirror
raid1 mdT : sda
raid5 md0 : mdT sdb sdc
-> hot add sdd to mdT
raid1 mdT : sda sdd
raid5 md0 : mdT sdb sdc
-> resync and then drop sda
raid1 mdT : sdd
raid5 md0 : mdT sdb sdc
-> remove internal mirror
raid5 md0 : sdd sdb sdc 


Thoughts?

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid-10 mount at startup always has problem

2007-10-29 Thread Gabor Gombas
On Mon, Oct 29, 2007 at 08:41:39AM +0100, Luca Berra wrote:

> consider a storage with 64 spt, an io size of 4k and partition starting
> at sector 63.
> first io request will require two ios from the storage (1 for sector 63,
> and one for sectors 64 to 70)
> the next 7 io (71-78,79-86,97-94,95-102,103-110,111-118,119-126) will be
> on the same track
> the 8th will again require to be split, and so on.
> this causes the storage to do 1 unnecessary io every 8. YMMV.

That's only true for random reads. If the OS does sufficient read-ahead
then sequential reads are affected much less. But the killers are the
misaligned random writes since then (considering RAID5/6 for simplicity)
the stripe has to be read from all component disks before it can be
written back.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid-10 mount at startup always has problem

2007-10-29 Thread Doug Ledford
On Mon, 2007-10-29 at 09:18 +0100, Luca Berra wrote:
> On Sun, Oct 28, 2007 at 10:59:01PM -0700, Daniel L. Miller wrote:
> >Doug Ledford wrote:
> >>Anyway, I happen to *like* the idea of using full disk devices, but the
> >>reality is that the md subsystem doesn't have exclusive ownership of the
> >>disks at all times, and without that it really needs to stake a claim on
> >>the space instead of leaving things to chance IMO.
> >>   
> >I've been re-reading this post numerous times - trying to ignore the 
> >burgeoning flame war :) - and this last sentence finally clicked with me.
> >
> I am sorry Daniel, when i read Doug and Bill, stating that your issue
> was not having a partition table, i immediately took the bait and forgot
> about your original issue.

I never said *his* issue was lack of partition table, I just said I
don't recommend that because it's flaky.  The last statement I made
about his issue was to ask about whether the problem was happening
during initrd time or sysinit time to try and identify if it was failing
before or after / was mounted to try and determine where the issue might
lay.  Then we got off on the tangent about partitions, and at the same
time Neil started asking about udev, at which point it came out that
he's running ubuntu, and as much as I would like to help, the fact of
the matter is that I've never touched ubuntu and wouldn't have the
faintest clue, so I let Neil handle it.  At which point he found that
the udev scripts in ubuntu are being stupid, and from the looks of it
are the cause of the problem.  So, I've considered the initial issue
root caused for a bit now.


> like udev/hal that believes it knows better than you about what you have
> on your disks.
> but _NEITHER OF THESE IS YOUR PROBLEM_ imho

Actually, it looks like udev *is* the problem, but not because of
partition tables.

> I am also sorry to say that i fail to identify what the source of your
> problem is, we should try harder instead of flaming between us.

We can do both, or at least I can :-P

> Is it possible to reproduce it on the live system
> e.g. unmount, stop array, start it again and mount.
> I bet it will work flawlessly in this case.
> then i would disable starting this array at boot, and start it manually
> when the system is up (stracing mdadm, so we can see what it does)
> 
> I am also wondering about this:
> md: md0: raid array is not clean -- starting background reconstruction
> does your system shut down properly?
> do you see the message about stopping md at the very end of the
> reboot/halt process?

The root cause is that as udev adds his sata devices one at a time, on
each add of the sata device it invokes mdadm to see if there is an array
to start, and it doesn't use incremental mode on mdadm.  As a result, as
soon as there are 3 out of the 4 disks present, mdadm starts the array
in degraded mode.  It's probably a race between the mdadm started on the
third disk and mdadm started on the fourth disk that results in the
message about being unable to set the array info.  The one loosing the
race gets the error as the other one has already manipulated the array
(for example, the 4th disk mdadm could be trying to add the first disk
to the array, but it's already there, so it gets this error and bails).

So, as much as you might dislike mkinitrd since 5.0 Luca, it doesn't
have this particular problem ;-)  In the initrd we produce, it loads all
the SCSI/SATA/etc drivers first, then calls mkblkdevs which forces all
of the devices to appear in /dev, and only then does it start the
mdadm/lvm configuration.  Daniel, I make no promises what so ever that
this will even work at all as it may fail to load modules or all other
sorts of weirdness, but if you want to test the theory, you can download
the latest mkinitrd from fedoraproject.org, then use it to create an
initrd image under some other name than your default image name, then
manually edit your boot to have an extra stanza that uses the mkinitrd
generated initrd image instead of the ubuntu image, and then just see if
it brings the md device up cleanly instead of in degraded mode.  That
should be a fairly quick and easy way to test if Neil's analysis of the
udev script was right.

-- 
Doug Ledford <[EMAIL PROTECTED]>
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-29 Thread Doug Ledford
On Mon, 2007-10-29 at 09:41 +0100, Luca Berra wrote:

> >Remaking the initrd installs the new mdadm.conf file, which would have
> >then contained the whole disk devices and it's UUID.  There in would
> >have been the problem.
> yes, i read the patch, i don't like that code, as i don't like most of
> what has been put in mkinitrd from 5.0 onward.
> Imho the correct thing here would not have been copying the existing
> mdadm.conf but generating a safe one from output of mdadm -D (note -D,
> not -E)

I'm not sure I'd want that.  Besides, what makes you say -D is safer
than -E?

-- 
Doug Ledford <[EMAIL PROTECTED]>
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Raid-10 mount at startup always has problem

2007-10-29 Thread Doug Ledford
On Mon, 2007-10-29 at 09:22 -0400, Bill Davidsen wrote:

> > consider a storage with 64 spt, an io size of 4k and partition starting
> > at sector 63.
> > first io request will require two ios from the storage (1 for sector 63,
> > and one for sectors 64 to 70)
> > the next 7 io (71-78,79-86,97-94,95-102,103-110,111-118,119-126) will be
> > on the same track
> > the 8th will again require to be split, and so on.
> > this causes the storage to do 1 unnecessary io every 8. YMMV.
> No one makes drives with fixed spt any more. Your assumptions are a 
> decade out of date.

Your missing the point, it's not about drive tracks, it's about array
tracks, aka chunks.  A 64k write, that should write to one and only one
chunk, ends up spanning two.  That increases the amount of writing the
array has to do and the number of disks it busies for a typical single
I/O operation.

-- 
Doug Ledford <[EMAIL PROTECTED]>
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Raid-10 mount at startup always has problem

2007-10-29 Thread Doug Ledford
On Sun, 2007-10-28 at 20:21 -0400, Bill Davidsen wrote:
> Doug Ledford wrote:
> > On Fri, 2007-10-26 at 11:15 +0200, Luca Berra wrote:
> >   
> >> On Thu, Oct 25, 2007 at 02:40:06AM -0400, Doug Ledford wrote:
> >> 
> >>> The partition table is the single, (mostly) universally recognized
> >>> arbiter of what possible data might be on the disk.  Having a partition
> >>> table may not make mdadm recognize the md superblock any better, but it
> >>> keeps all that other stuff from even trying to access data that it
> >>> doesn't have a need to access and prevents random luck from turning your
> >>> day bad.
> >>>   
> >> on a pc maybe, but that is 20 years old design.
> >> 
> >
> > So?  Unix is 35+ year old design, I suppose you want to switch to Vista
> > then?
> >
> >   
> >> partition table design is limited because it is still based on C/H/S,
> >> which do not exist anymore.
> >> Put a partition table on a big storage, say a DMX, and enjoy a 20%
> >> performance decrease.
> >> 
> >
> > Because you didn't stripe align the partition, your bad.
> >   
> Align to /what/ stripe? Hardware (CHS is fiction), software (of the RAID 
> you're about to create), or ??? I don't notice my FC6 or FC7 install 
> programs using any special partition location to start, I have only run 
> (tried to run) FC8-test3 for the live CD, so I can't say what it might 
> do. CentOS4 didn't do anything obvious, either, so unless I really 
> misunderstand your position at redhat, that would be your bad.  ;-)
> 
> If you mean start a partition on a pseudo-CHS boundary, fdisk seems to 
> use what it thinks are cylinders for that.
> 
> Please clarify what alignment provides a performance benefit.

Luca was specifically talking about the big multi-terabyte to petabyte
hardware arrays on the market.  DMX, DDN, and others.  When they export
a volume to the OS, there is an underlying stripe layout to that volume.
If you don't use any partition table at all, you are automatically
aligned with their stripes.  However, if you do, then you have to align
your partition on a chunk boundary or else performance drops pretty
dramatically as a result of more writes than not crossing chunk
boundaries unnecessarily.  It's only relevant when you are talking about
a raid device that shows the OS a single logical disk made from lots of
other disks.


-- 
Doug Ledford <[EMAIL PROTECTED]>
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Raid-10 mount at startup always has problem

2007-10-29 Thread Bill Davidsen

Luca Berra wrote:

On Sun, Oct 28, 2007 at 08:21:34PM -0400, Bill Davidsen wrote:

Because you didn't stripe align the partition, your bad.
  
Align to /what/ stripe? Hardware (CHS is fiction), software (of the RAID 

the real stripe (track) size of the storage, you must read the manual
and/or bug technical support for that info.


That's my point, there *is* no "real stripe (track) size of the storage" 
because modern drives use zone bit recording, and sectors per track 
depends on track, and changes within a partition. See

 http://www.dewassoc.com/kbase/hard_drives/hard_disk_sector_structures.htm
 http://www.storagereview.com/guide2000/ref/hdd/op/mediaTracks.html
you're about to create), or ??? I don't notice my FC6 or FC7 install 
programs using any special partition location to start, I have only 
run (tried to run) FC8-test3 for the live CD, so I can't say what it 
might do. CentOS4 didn't do anything obvious, either, so unless I 
really misunderstand your position at redhat, that would be your 
bad.  ;-)


If you mean start a partition on a pseudo-CHS boundary, fdisk seems 
to use what it thinks are cylinders for that.
Yes, fdisk will create partition at sector 63 (due to CHS being 
braindead,

other than fictional: 63 sectors-per-track)
most arrays use 64 or 128 spt, and array cache are aligned accordingly.
So 63 is almost always the wrong choice.


As the above links show, there's no right choice.


for the default choice you must consider what spt your array uses, iirc
(this is from memory, so double check these figures)
IBM 64 spt (i think)
EMC DMX 64
EMC CX 128???
HDS (and HP XP) except OPEN-V 96
HDS (and HP XP) OPEN-V 128
HP EVA 4/6/8 with XCS 5.x state that no alignment is needed even if i
never found a technical explanation about that.
previous HP EVA versions did (maybe 64).
you might then want to consider how data is laid out on the storage, but
i believe the storage cache is enough to deal with that issue.

Please note that "0" is always well aligned.

Note to people who is now wondering WTH i am talking about.

consider a storage with 64 spt, an io size of 4k and partition starting
at sector 63.
first io request will require two ios from the storage (1 for sector 63,
and one for sectors 64 to 70)
the next 7 io (71-78,79-86,97-94,95-102,103-110,111-118,119-126) will be
on the same track
the 8th will again require to be split, and so on.
this causes the storage to do 1 unnecessary io every 8. YMMV.
No one makes drives with fixed spt any more. Your assumptions are a 
decade out of date.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Raid1/5 over iSCSI trouble

2007-10-29 Thread BERTRAND Joël

Ming Zhang wrote:

off topic, could you resubmit the alignment issue patch to list and see
if tomof accept. he needs a patch inlined in email. it is found and
fixed by you, so had better you post it (instead of me). thx.


diff -u kernel.old/iscsi.c kernel/iscsi.c
--- kernel.old/iscsi.c  2007-10-29 09:49:16.0 +0100
+++ kernel/iscsi.c  2007-10-17 11:19:14.0 +0200
@@ -726,13 +726,26 @@
case READ_10:
case WRITE_10:
case WRITE_VERIFY:
-   *off = be32_to_cpu(*(u32 *)&cmd[2]);
+   *off = be32_to_cpuu32) cmd[2]) << 24) |
+   (((u32) cmd[3]) << 16) |
+   (((u32) cmd[4]) << 8) |
+   cmd[5]);
*len = (cmd[7] << 8) + cmd[8];
break;
case READ_16:
case WRITE_16:
-   *off = be64_to_cpu(*(u64 *)&cmd[2]);
-   *len = be32_to_cpu(*(u32 *)&cmd[10]);
+   *off = be32_to_cpuu64) cmd[2]) << 56) |
+   (((u64) cmd[3]) << 48) |
+   (((u64) cmd[4]) << 40) |
+   (((u64) cmd[5]) << 32) |
+   (((u64) cmd[6]) << 24) |
+   (((u64) cmd[7]) << 16) |
+   (((u64) cmd[8]) << 8) |
+   cmd[9]);
+   *len = be32_to_cpuu32) cmd[10]) << 24) |
+   (((u32) cmd[11]) << 16) |
+   (((u32) cmd[12]) << 8) |
+   cmd[13]);
break;
default:
BUG();
diff -u kernel.old/target_disk.c kernel/target_disk.c
--- kernel.old/target_disk.c2007-10-29 09:49:16.0 +0100
+++ kernel/target_disk.c2007-10-17 16:04:06.0 +0200
@@ -66,13 +66,15 @@
unsigned char geo_m_pg[] = {0x04, 0x16, 0x00, 0x00, 0x00, 0x40, 
0x00, 0x

00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 
0x00, 0x

00,
0x00, 0x00, 0x00, 0x00, 0x3a, 0x98, 
0x00, 0x

00};
-   u32 ncyl, *p;
+   u32 ncyl;
+   u32 n;

/* assume 0xff heads, 15krpm. */
memcpy(ptr, geo_m_pg, sizeof(geo_m_pg));
ncyl = sec >> 14; /* 256 * 64 */
-   p = (u32 *)(ptr + 1);
-   *p = *p | cpu_to_be32(ncyl);
+   memcpy(&n,ptr+1,sizeof(u32));
+   n = n | cpu_to_be32(ncyl);
+   memcpy(ptr+1, &n, sizeof(u32));
return sizeof(geo_m_pg);
 }

@@ -249,7 +251,10 @@
struct iet_volume *lun;
int rest, idx = 0;

-   size = be32_to_cpu(*(u32 *)&req->scb[6]);
+   size = be32_to_cpuu32) req->scb[6]) << 24) |
+   (((u32) req->scb[7]) << 16) |
+   (((u32) req->scb[8]) << 8) |
+   req->scb[9]);
if (size < 16)
return -1;

Regards,

JKB
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-29 Thread Luca Berra

On Sun, Oct 28, 2007 at 01:47:55PM -0400, Doug Ledford wrote:

On Sun, 2007-10-28 at 15:13 +0100, Luca Berra wrote:

On Sat, Oct 27, 2007 at 08:26:00PM -0400, Doug Ledford wrote:
>It was only because I wasn't using mdadm in the initrd and specifying
>uuids that it found the right devices to start and ignored the whole
>disk devices.  But, when I later made some more devices and went to
>update the mdadm.conf file using mdadm -Eb, it found the devices and
>added it to the mdadm.conf.  If I hadn't checked it before remaking my
>initrd, it would have hosed the system.  And it would have passed all
the above is not clear to me, afair redhat initrd still uses
raidautorun,


RHEL does, but this is on a personal machine I installed Fedora an and
latest Fedora has a mkinitrd that installs mdadm and mdadm.conf and
starts the needed devices using the UUID.  My first sentence above
should have read that I *was* using mdadm.

ah, ok i should look again at fedora's mkinitrd, last one i checked was
6.0.9-1 and i see mdadm was added in 6.0.9-2


 which iirc does not works with recent superblocks,
so you used uuids on kernel command line?
or you use something else for initrd?
why would remaking the initrd break it?


Remaking the initrd installs the new mdadm.conf file, which would have
then contained the whole disk devices and it's UUID.  There in would
have been the problem.

yes, i read the patch, i don't like that code, as i don't like most of
what has been put in mkinitrd from 5.0 onward.
Imho the correct thing here would not have been copying the existing
mdadm.conf but generating a safe one from output of mdadm -D (note -D,
not -E)


>the tests you can throw at it.  Quite simply, there is no way to tell
>the difference between those two situations with 100% certainty.  Mdadm
>tries to be smart and start the newest devices, but Luca's original
>suggestion of skip the partition scanning in the kernel and figure it
>out from user space would not have shown mdadm the new devices and would
>have gotten it wrong every time.
yes, in this particular case it would have, congratulation you found a new
creative way of shooting yourself in the feet.


Creative, not so much.  I just backed out of what I started and tried
something else.  Lots of people do that.


maybe mdadm should do checks when creating a device to prevent this kind
of mistakes.
i.e.
if creating an array on a partition, check the whole device for a
superblock and refuse in case it finds one

if creating an array on a whole device that has a partition table,
either require --force, or check for superblocks in every possible
partition.


What happens if you add the partition table *after* you make the whole
disk device and there are stale superblocks in the partitions?  This
still isn't infallible.

It depends on what you do with that partitioned device *after* having
created the partition table.
- If you try again to run mdadm on it (and the above is implemented it
would fail, and you will be given a chance to wipe the stale sb)
- If you don't and use them as plain devices, _and_ leave the line in
mdadm.conf you will suffer a lot of pain. Since the problem is known and
since fdisk/sfdisk/parted already do a lot of checks on the device, this
could be another useful one.

L.

--
Luca Berra -- [EMAIL PROTECTED]
   Communication Media & Services S.r.l.
/"\
\ / ASCII RIBBON CAMPAIGN
 XAGAINST HTML MAIL
/ \
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid-10 mount at startup always has problem

2007-10-29 Thread Luca Berra

On Sun, Oct 28, 2007 at 10:59:01PM -0700, Daniel L. Miller wrote:

Doug Ledford wrote:

Anyway, I happen to *like* the idea of using full disk devices, but the
reality is that the md subsystem doesn't have exclusive ownership of the
disks at all times, and without that it really needs to stake a claim on
the space instead of leaving things to chance IMO.
  
I've been re-reading this post numerous times - trying to ignore the 
burgeoning flame war :) - and this last sentence finally clicked with me.



I am sorry Daniel, when i read Doug and Bill, stating that your issue
was not having a partition table, i immediately took the bait and forgot
about your original issue.
I have no reason to believe your problem is due to not having a
partition table on your devices.


sda: unknown partition table

sdb: unknown partition table

sdc: unknown partition table

sdd: unknown partition table

the above clearly shows that the kernel does not see a partition table
where there is none which happens in some cases and bit Doug so hard.
Note, it does not happen at random, it should happen only if you use a
partitioned md device with a superblock at the end. Or if you configure
it wrongly as Doug did. (i am not accusing Doug of being stupid at all,
it is a fairly common mistake to make and we should try to prevent this
in mdadm as much as we can)
Again, having the kernel find a partition table where there is none,
should not pose a problem at all unless there is some badly designed software
like udev/hal that believes it knows better than you about what you have
on your disks.
but _NEITHER OF THESE IS YOUR PROBLEM_ imho

I am also sorry to say that i fail to identify what the source of your
problem is, we should try harder instead of flaming between us.

Is it possible to reproduce it on the live system
e.g. unmount, stop array, start it again and mount.
I bet it will work flawlessly in this case.
then i would disable starting this array at boot, and start it manually
when the system is up (stracing mdadm, so we can see what it does)

I am also wondering about this:
md: md0: raid array is not clean -- starting background reconstruction
does your system shut down properly?
do you see the message about stopping md at the very end of the
reboot/halt process?

L.


--
Luca Berra -- [EMAIL PROTECTED]
   Communication Media & Services S.r.l.
/"\
\ / ASCII RIBBON CAMPAIGN
 XAGAINST HTML MAIL
/ \
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html