Re: transferring RAID-1 drives via sneakernet

2008-02-13 Thread David Greaves
Jeff Breidenbach wrote:
 It's not a RAID issue, but make sure you don't have any duplicate volume
 names.  According to Murphy's Law, if there are two / volumes, the wrong
 one will be chosen upon your next reboot.
 
 Thanks for the tip. Since I'm not using volumes or LVM at all, I should be
 safe from this particular problem.

Volumes is being used as a generic term here.

You would be safest if, for the disks/partitions you are transferring, you made
the partition type 0x83 (linux) instead of 0xfd to prevent the kernel 
autodetecting.

Otherwise there is a risk that /dev/md0 and /dev/md1 will be transposed.

Having done that you can manually assemble the array and then configure
mdadm.conf to associate the UUID with the correct md device.

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: transferring RAID-1 drives via sneakernet

2008-02-12 Thread David Greaves
Jeff Breidenbach wrote:
 I'm planning to take some RAID-1 drives out of an old machine
 and plop them into a new machine. Hoping that mdadm assemble
 will magically work. There's no reason it shouldn't work. Right?
 
 old  [ mdadm v1.9.0 / kernel 2.6.17 / Debian Etch / x86-64 ]
 new [ mdad v2.6.2 / kernel 2.6.22 / Ubuntu 7.10 server ]

I've done it several times.

Does the new machine have a RAID array already?

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Use new sb type

2008-02-11 Thread David Greaves
Jan Engelhardt wrote:
 Feel free to argue that the manpage is clear on this - but as we know, not
 everyone reads the manpages in depth...
 
 That is indeed suboptimal (but I would not care since I know the
 implications of an SB at the front)

Neil cares even less and probably  doesn't even need mdadm - heck he probably
just echos the raw superblock into place via dd...

http://xkcd.com/378/

:D

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: howto and faq

2008-02-10 Thread David Greaves
Keld Jørn Simonsen wrote:
 I am trying to get some order to linux raid info.
Help appreciated :)

 The list description at
 http://vger.kernel.org/vger-lists.html#linux-raid
 does list af FAQ, http://www.linuxdoc.org/FAQ/
Yes, that should be amended. Drop them a line about the FAQ too

 So our FAQ info is pretty out of date. I think it would be nice to have
 a wiki like we have for the Howto. This would mean that we have much
 better means to let new people make their mark, and avoid the problem
 that we have today with really outdated info.

There seems to be no point in having separate wikis for the FAQ and HOWTO
elements of documentation. Especially since a lot of FAQs are How do I... by
definition the answer is a HOWTO.


 So can we put up a wiki somewhere for this, or should we just extend the
 wiki howto pages to also include a faq section?
So just extend the existing wiki.

 For the howto, I have asked the VGER people to add info to our list
 description, that we have a wiki howto at http://linux-raid.osdl.org/
ta.


I set the wiki up at osdl to ensure that if a bus hit me then Neil or others
would have a rational and responsive organisation to go to to change ownership.

I've been writing to some of the other FAQ/Doc organisations sporadically for
over a year now and had no response from any of them. It's a very poor aspect of
OSS...

David


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Use new sb type

2008-02-10 Thread David Greaves
Jan Engelhardt wrote:
 On Jan 29 2008 18:08, Bill Davidsen wrote:
 
 IIRC there was a discussion a while back on renaming mdadm options
 (google Time to deprecate old RAID formats?) and the superblocks
 to emphasise the location and data structure. Would it be good to
 introduce the new names at the same time as changing the default
 format/on-disk-location?
 Yes, I suggested some layout names, as did a few other people, and
 a few changes to separate metadata type and position were
 discussed. BUT, changing the default layout, no matter how better
 it seems, is trumped by breaks existing setups and user practice.
 
 Layout names are a different matter from what the default sb type should 
 be.
Indeed they are. Or rather should be.

However the current default sb includes a layout element. If the default sb is
changed then it seems like an opportunity to detach the data format from the
on-disk location.

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: when is a disk non-fresh?

2008-02-10 Thread David Greaves
Dexter Filmore wrote:
 On Friday 08 February 2008 00:22:36 Neil Brown wrote:
 On Thursday February 7, [EMAIL PROTECTED] wrote:
 On Tuesday 05 February 2008 03:02:00 Neil Brown wrote:
 On Monday February 4, [EMAIL PROTECTED] wrote:
 Seems the other topic wasn't quite clear...
 not necessarily.  sometimes it helps to repeat your question.  there
 is a lot of noise on the internet and somethings important things get
 missed... :-)

 Occasionally a disk is kicked for being non-fresh - what does this
 mean and what causes it?
 The 'event' count is too small.
 Every event that happens on an array causes the event count to be
 incremented.
 An 'event' here is any atomic action? Like write byte there or calc
 XOR?
 An 'event' is
- switch from clean to dirty
- switch from dirty to clean
- a device fails
- a spare finishes recovery
 things like that.
 
 Is there a glossary that explains dirty and such in detail?

Not yet.

http://linux-raid.osdl.org/index.php?title=Glossary

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Use new sb type

2008-02-10 Thread David Greaves
Jan Engelhardt wrote:
 On Feb 10 2008 10:34, David Greaves wrote:
 Jan Engelhardt wrote:
 On Jan 29 2008 18:08, Bill Davidsen wrote:

 IIRC there was a discussion a while back on renaming mdadm options
 (google Time to deprecate old RAID formats?) and the superblocks
 to emphasise the location and data structure. Would it be good to
 introduce the new names at the same time as changing the default
 format/on-disk-location?
 Yes, I suggested some layout names, as did a few other people, and
 a few changes to separate metadata type and position were
 discussed. BUT, changing the default layout, no matter how better
 it seems, is trumped by breaks existing setups and user practice.
 Layout names are a different matter from what the default sb type should 
 be.
 Indeed they are. Or rather should be.

 However the current default sb includes a layout element. If the default sb 
 is
 changed then it seems like an opportunity to detach the data format from the
 on-disk location.
 
 I do not see anything wrong by specifying the SB location as a metadata
 version. Why should not location be an element of the raid type?
 It's fine the way it is IMHO. (Just the default is not :)

There was quite a discussion about it.

For me the main argument is that for most people seeing superblock versions
(even the manpage terminology is version and subversion) will correlate
incremental versions with improvement.
They will therefore see v1.2 as 'the latest and best'.
We had our first 'in the wild' example just a few days ago.

Feel free to argue that the manpage is clear on this - but as we know, not
everyone reads the manpages in depth...

It's misleading and I would submit that *if* Neil decides to change the default
then changing the terminology at the same time would mean a single change that
ushers in broader benefit.

I acknowledge that I am only talking semantics - OTOH I think semantics can be a
very important aspect of communication.

David
PS I would love to send a patch to mdadm in - I am currently being heavily
nagged to sort out our house electrics and get lunch. It may happen though :)




-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: howto and faq

2008-02-10 Thread David Greaves
Keld Jørn Simonsen wrote:
 I would then like that to be reflected in the main page.
 I would rather that this be called Howto and FAQ - Linux raid
 than Main Page - Linux Raid. Is that possible?

Just like C has a main() wiki's have a Main Page :)

I guess it could be changed but I think it involves editing the Mediawiki config
- maybe next time I'm in there...


 And then, how do we structure the pages? I think we need a new section
 for the FAQ.

By all means create an FAQ page and link to answers or other relevant sections
of the wiki. Bear in mind that this is a reference work and whilst it may
contain tutorials the idea is that it contains (reasonably) authoritative
information about the linux raid subsystem (linking to the source, kernel docs
or man pages if that's more appropriate).

 And then I would like a clearer statement on the relation between the
 linux-raid mailing list and the pages, right in the top of the main page.
The relationship is loose - the statement as it stands describes the current
state of affairs. If Neil feels that he could or would like to help the case by
declaring a more official relationship then that's his call. To be fair I work
on these pages on and off as the mood takes me :) if I was Neil I'd be keeping
an eye on it and waiting for the right level of community involvement.

 I have had a look at other search engines, yahoo and msn.
 Our pages do show up within the 10 first hits for linux raid.
 So that is not that bad. Still, Google has the
 http://linux-raid.osdl.org/ page as number 127. That is very bad.
 Maybe something about it being referenced from wikipedia?

I'm not an expert at gaming the search engines - more than happy to do rational
things like linking from Wikipedia and other reference sites.

I am sad that I've had such a poor response from the other linux documentation
sites... maybe a Slashdot article not so much about doc-rot but about the
difficulty of combating doc-rot would help...

Maybe they'd take more notice if I said the linux raid subsystem maintainer
says... - dunno.

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-06 Thread David Greaves
Marcin Krol wrote:
 Hello everyone,
 
 I have had a problem with RAID array (udev messed up disk names, I've had 
 RAID on
 disks only, without raid partitions)

Do you mean that you originally used /dev/sdb for the RAID array? And now you
are using /dev/sdb1?

Given the system seems confused I wonder if this may be relevant?

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID 1 and grub

2008-01-31 Thread David Greaves
Richard Scobie wrote:
 David Rees wrote:
 
 FWIW, this step is clearly marked in the Software-RAID HOWTO under
 Booting on RAID:
 http://tldp.org/HOWTO/Software-RAID-HOWTO-7.html#ss7.3
 
 The one place I didn't look...

Good - I hope you'll both look here instead:

http://linux-raid.osdl.org/index.php/Tweaking%2C_tuning_and_troubleshooting#Booting_on_RAID

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-30 Thread David Greaves
Peter Rabbitson wrote:
 I guess I will sit down tonight and craft some patches to the existing
 md* man pages. Some things are indeed left unsaid.
If you want to be more verbose than a man page allows then there's always the
wiki/FAQ...

http://linux-raid.osdl.org/

Keld Jørn Simonsen wrote:
 Is there an official web page for mdadm?
 And maybe the raid faq could be updated?

That *is* the linux-raid FAQ brought up to date (with the consent of the
original authors)

Of course being a wiki means it is now a shared, community responsibility - and
to all present and future readers: that means you too ;)

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-30 Thread David Greaves
On 26 Oct 2007, Neil Brown wrote:
On Thursday October 25, [EMAIL PROTECTED] wrote:
 I also suspect that a *lot* of people will assume that the highest superblock
 version is the best and should be used for new installs etc.

 Grumble... why can't people expect what I want them to expect?


Moshe Yudkowsky wrote:
 I expect it's because I used 1.2 superblocks (why
 not use the latest, I said, foolishly...) and therefore the RAID10 --

Aha - an 'in the wild' example of why we should deprecate '0.9 1.0 1.1, 1.2' and
rename the superblocks to data-version + on-disk-location :)


David





-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux raid faq

2008-01-30 Thread David Greaves
Keld Jørn Simonsen wrote:
 Hmm, I read the Linux raid faq on
 http://www.faqs.org/contrib/linux-raid/x37.html
 
 It looks pretty outdated, referring to how to patch 2.2 kernels and
 not mentioning new mdadm, nor raid10. It was not dated. 
 It seemed to be related to the linux-raid list, telling where to find
 archives of the list.
 
 Maybe time for an update? or is this not the right place to write stuff?

http://linux-raid.osdl.org/index.php/Main_Page

I have written to faqs.org but got no reply. I'll try again...

 
 If I searched on google for raid faq, the first say 5-7 items did not
 mention raid10.

Until people link to and use the new wiki, Google won't find it.


 Maybe wikipedia is the way to go? I did contribute myself a little
 there.
 
 The software raid howto is dated v. 1.1 3rd of June 2004,
 http://unthought.net/Software-RAID.HOWTO/Software-RAID.HOWTO.html
 also pretty old.

FYI
http://linux-raid.osdl.org/index.php/Credits

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WRONG INFO (was Re: In this partition scheme, grub does not find md information?)

2008-01-30 Thread David Greaves
Peter Rabbitson wrote:
 Moshe Yudkowsky wrote:
 over the other. For example, I've now learned that if I want to set up
 a RAID1 /boot, it must actually be 1.2 or grub won't be able to read
 it. (I would therefore argue that if the new version ever becomes
 default, then the default sub-version ought to be 1.2.)
 
 In the discussion yesterday I myself made a serious typo, that should
 not spread. The only superblock version that will work with current GRUB
 is 1.0 _not_ 1.2.

Ah, the joys of consolidated and yet editable documentation - like a wiki

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Use new sb type

2008-01-30 Thread David Greaves
Bill Davidsen wrote:
 David Greaves wrote:
 Jan Engelhardt wrote:
  
 This makes 1.0 the default sb type for new arrays.

 

 IIRC there was a discussion a while back on renaming mdadm options
 (google Time
 to  deprecate old RAID formats?) and the superblocks to emphasise the
 location
 and data structure. Would it be good to introduce the new names at the
 same time
 as changing the default format/on-disk-location?
   
 
 Yes, I suggested some layout names, as did a few other people, and a few
 changes to separate metadata type and position were discussed. BUT,
 changing the default layout, no matter how better it seems, is trumped
 by breaks existing setups and user practice. For all of the reasons
 something else is preferable, 1.0 *works*.

It wasn't my intention to change anything other than the naming.

If the default layout was being updated to 1.0 then I thought it would be a good
time to introduce 1-start, 1-4k and 1-end names and actually announce a default
of 1-end and not 1.0.

Although I still prefer a full separation:
  mdadm --create /dev/md0 --metadata 1 --meta-location start

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID 1 and grub

2008-01-30 Thread David Rees
On Jan 30, 2008 2:06 PM, Richard Scobie [EMAIL PROTECTED] wrote:
 hda has failed and after spending some time with a rescue disk mounting
 hdc's /boot partition (hdc1) and changing the grub.conf device
 parameters, I have no success in booting off it.

 I then set them back to the original (hd0,0) and moved hdc into hda's
 position.

 Booting from there brings up the message: GRUB hard disk error

Have you tried re-running grub-install after booting from a rescue disk?

-Dave
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID 1 and grub

2008-01-30 Thread David Rees
On Jan 30, 2008 6:33 PM, Richard Scobie [EMAIL PROTECTED] wrote:
 I found this document very useful:
 http://lists.us.dell.com/pipermail/linux-poweredge/2003-July/008898.html

 After modifying my grub.conf to refer to (hd0,0), reinstalling grub on
 hdc with:

 grub device (hd0) /dev/hdc
 grub root (hd0,0)
 grub (hd0)

 and rebooting with the bios set to boot off hdc, everything burst back
 into life.

FWIW, this step is clearly marked in the Software-RAID HOWTO under
Booting on RAID:
http://tldp.org/HOWTO/Software-RAID-HOWTO-7.html#ss7.3

If it appears that Fedora isn't doing this when installing on a
Software RAID 1 boot device, I suggest you open a bug.

BTW, I suspect you are missing the command setup from your 3rd
command above, it should be:

# grub
grub device (hd0) /dev/hdc
grub root (hd0,0)
grub setup (hd0)

 I shall now be checking all my Fedora/Centos RAID1 installs for grub
 installed on both drives.

Good idea. Whenever setting up a RAID1 device to boot from, I perform
the above 3 steps. I also suggest using labels to identify partitions
and testing the two failure modes and that you are able to boot with
either drive disconnected.

-Dave
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Use new sb type

2008-01-28 Thread David Greaves
Jan Engelhardt wrote:
 This makes 1.0 the default sb type for new arrays.
 

IIRC there was a discussion a while back on renaming mdadm options (google Time
to  deprecate old RAID formats?) and the superblocks to emphasise the location
and data structure. Would it be good to introduce the new names at the same time
as changing the default format/on-disk-location?

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Use new sb type

2008-01-28 Thread David Greaves
Peter Rabbitson wrote:
 David Greaves wrote:
 Jan Engelhardt wrote:
 This makes 1.0 the default sb type for new arrays.


 IIRC there was a discussion a while back on renaming mdadm options
 (google Time
 to  deprecate old RAID formats?) and the superblocks to emphasise the
 location
 and data structure. Would it be good to introduce the new names at the
 same time
 as changing the default format/on-disk-location?

 David
 
 Also wasn't the concession to make 1.1 default instead of 1.0 ?
 
IIRC Doug Leford did some digging wrt lilo + grub and found that 1.1 and 1.2
wouldn't work with them. I'd have to review the thread though...

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: identifying failed disk/s in an array.

2008-01-23 Thread David Greaves
Tomasz Chmielewski wrote:
 Michael Harris schrieb:
 i have a disk fail say HDC for example, i wont know which disk HDC is
 as it could be any of the 5 disks in the PC. Is there anyway to make
 it easier to identify which disk is which?.
 
 If the drives have any LEDs, the most reliable way would be:
 
 dd if=/dev/drive of=/dev/null
 
 Then look which LED is the one which blinks the most.

And/or use smartctl to look up the make/model/serial number and look at the
drive label. I always do this to make sure I'm pulling the right drive (also
useful to RMA the drive)


David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to create a degraded raid1 with only 1 of 2 drives ??

2008-01-20 Thread David Greaves
Mitchell Laks wrote:
 I think my error was that maybe I did not
 do write the fdisk changes to the drive with 
 fdisk w

No - your problem was that you needed to use the literal word missing
like you did this time:
 mdadm -C /dev/md0 --level=2 -n2 /dev/sda1 missing

[however, this time you also asked for a RAID2 (--level=2) which I doubt would 
work]

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Last ditch plea on remote double raid5 disk failure

2008-01-01 Thread David Rees
On Dec 31, 2007 2:39 AM, Marc MERLIN [EMAIL PROTECTED] wrote:
 new years eve :(  I was wondering if I can tell the kernel not to kick
 a drive out of an array if it sees a block error and just return the
 block error upstream, but continue otherwise (all my partitions are on
 a raid5 array, with lvm on top, so even if I were to lose a partition,
 I would still be likely to get the other ones back up if I can stop
 the auto kicking-out and killing the md array feature).

Best bet is to get a new drive into the machine that is at least the
same size as the bad-sector disk, use dd_rescue[1] to copy as much of
the bad-sector disk to the new one.

Remove the bad-sector disk, reboot and hopefully you'll have a
functioning raid array with a bit of bad data on it somewhere.

I'm probably missing a step somewhere but you get the general idea...

-Dave

[1] http://www.garloff.de/kurt/linux/ddrescue/
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Few questions

2007-12-08 Thread David Greaves
Guy Watkins wrote:
 man md
 man mdadm
and
http://linux-raid.osdl.org/index.php/Main_Page

:)

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


/proc/mdstat docs (was Re: Few questions)

2007-12-08 Thread David Greaves
Michael Makuch wrote:
 So my questions are:
...
 - Is this a.o.k for a raid5 array?

So I realised that /proc/mdstat isn't documented too well anywhere...

http://linux-raid.osdl.org/index.php/Mdstat

Comments welcome...

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reading takes 100% precedence over writes for mdadm+raid5?

2007-12-06 Thread David Rees
On Dec 6, 2007 1:06 AM, Justin Piszcz [EMAIL PROTECTED] wrote:
 On Wed, 5 Dec 2007, Jon Nelson wrote:

  I saw something really similar while moving some very large (300MB to
  4GB) files.
  I was really surprised to see actual disk I/O (as measured by dstat)
  be really horrible.

 Any work-arounds, or just don't perform heavy reads the same time as
 writes?

What kernel are you using? (Did I miss it in your OP?)

The per-device write throttling in 2.6.24 should help significantly,
have you tried the latest -rc and compared to your current kernel?

-Dave
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: assemble vs create an array.......

2007-12-06 Thread David Chinner
On Thu, Dec 06, 2007 at 07:39:28PM +0300, Michael Tokarev wrote:
 What to do is to give repairfs a try for each permutation,
 but again without letting it to actually fix anything.
 Just run it in read-only mode and see which combination
 of drives gives less errors, or no fatal errors (there
 may be several similar combinations, with the same order
 of drives but with different drive missing).

Ugggh. 

 It's sad that xfs refuses mount when structure needs
 cleaning - the best way here is to actually mount it
 and see how it looks like, instead of trying repair
 tools. 

It self protection - if you try to write to a corrupted filesystem,
you'll only make the corruption worse. Mounting involves log
recovery, which writes to the filesystem

 Is there some option to force-mount it still
 (in readonly mode, knowing it may OOPs kernel etc)?

Sure you can: mount -o ro,norecovery dev mtpt

But it you hit corruption it will still shut down on you. If
the machine oopses then that is a bug.

 thread prompted me to think.  If I can't force-mount it
 (or browse it using other ways) as I can almost always
 do with (somewhat?) broken ext[23] just to examine things,
 maybe I'm trying it before it's mature enough? ;)

Hehe ;)

For maximum uber-XFS-guru points, learn to browse your filesystem
with xfs_db. :P

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: assemble vs create an array.......

2007-12-05 Thread David Greaves
Dragos wrote:
 Thank you for your very fast answers.
 
 First I tried 'fsck -n' on the existing array. The answer was that If I
 wanted to check a XFS partition I should use 'xfs_check'. That seems to
 say that my array was partitioned with xfs, not reiserfs. Am I correct?
 
 Then I tried the different permutations:
 mdadm --create /dev/md0 --raid-devices=3 --level=5 missing /dev/sda1
 /dev/sdb1
 mount /dev/md0 temp
 mdadm --stop --scan
 
 mdadm --create /dev/md0 --raid-devices=3 --level=5 /dev/sda1 missing
 /dev/sdb1
 mount /dev/md0 temp
 mdadm --stop --scan
 
[etc]

 
 With some arrays mount reported:
   mount: you must specify the filesystem type
 and with others:
   mount: Structure needs cleaning
 
 No choice seems to have been successful.

OK, not as good as you could have hoped for.

Make sure you have the latest xfs tools.

you may want to try xfs_repair and you can use the -n (I think - check man page)
option.

You may need to force it to ignore the log

David



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: assemble vs create an array.......

2007-11-30 Thread David Greaves
Neil Brown wrote:
 On Thursday November 29, [EMAIL PROTECTED] wrote:
 2. Do you know of any way to recover from this mistake? Or at least what 
 filesystem it was formated with.
It may not have been lost - yet.


 If you created the same array with the same devices and layout etc,
 the data will still be there, untouched.
 Try to assemble the array and use fsck on it.
To be safe I'd use fsck -n (check the man page as this is odd for reiserfs)


 When you create a RAID5 array, all that is changed is the metadata (at
 the end of the device) and one drive is changed to be the xor of all
 the others.
In other words, one of your 3 drives has just been erased.
Unless you know the *exact* command you used and have the dmesg output to hand
then we won't know which one.

Now what you need to do is to try all the permutations of creating a degraded
array using 2 of the drives and specify the 3rd as 'missing':

So something like:
mdadm --create /dev/md0 --raid-devices=3 --level=5 missing /dev/sdb1 /dev/sdc1
mdadm --create /dev/md0 --raid-devices=3 --level=5 /dev/sdb1 missing /dev/sdc1
mdadm --create /dev/md0 --raid-devices=3 --level=5 /dev/sdb1 /dev/sdc1 missing
mdadm --create /dev/md0 --raid-devices=3 --level=5 missing /dev/sdb1 /dev/sdd1
mdadm --create /dev/md0 --raid-devices=3 --level=5 /dev/sdb1 missing /dev/sdd1
mdadm --create /dev/md0 --raid-devices=3 --level=5 /dev/sdb1 /dev/sdd1 missing
etc etc

It is important to create the array using a 'missing' device so the xor data
isn't written.

There is a program here: http://linux-raid.osdl.org/index.php/Permute_array.pl
that may help...

David


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 Recovery

2007-11-14 Thread David Greaves
 localhost kernel: [17179584.18] md: md driver
hdc is kicked too (again)

 Nov 13 07:30:24 localhost kernel: [17179584.184000] md: raid5
 personality registered as nr 4
Another reboot...

 Nov 13 07:30:24 localhost kernel: [17179585.068000] md: syncing RAID array md0
Now (I guess) hdg is being restored using hdc data:

 Nov 13 07:30:24 localhost kernel: [17179684.16] ReiserFS: md0:
 warning: sh-2021: reiserfs_fill_super: can not find reiserfs on md0
But Reiser is confused.

 Nov 13 08:57:11 localhost kernel: [17184895.816000] md: md0: sync done.
hdg is back up to speed:


So hdc looks faulty.
Your only hope (IMO) is to use reiserfs recovery tools.
You may want to replace hdc to avoid an hdc failure interrupting any rebuild.

I think what happened is that hdg failed prior to 2am and you didn't notice
(mdadm --monitor is your friend). Then hdc had a real failure - at that point
you had data loss (not enough good disks). I don't know why md rebuilt using hdc
- I would expect it to have found hdc and hdg stale. If this is a newish kernel
then maybe Neil should take a look...

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: RAID5 Recovery

2007-11-14 Thread David Greaves
Neil Cavan wrote:
 Thanks for taking a look, David.
No problem.

 Kernel:
 2.6.15-27-k7, stock for Ubuntu 6.06 LTS
 
 mdadm:
 mdadm - v1.12.0 - 14 June 2005
OK - fairly old then. Not really worth trying to figure out why hdc got re-added
when things had gone wrong.

 You're right, earlier in /var/log/messages there's a notice that hdg
 dropped, I missed it before. I use mdadm --monitor, but I recently
 changed the target email address - I guess it didn't take properly.
 
 As for replacing hdc, thanks for the diagnosis but it won't help: the
 drive is actually fine, as is hdg. I've replaced hdc before, only to
 have the brand new hdc show the same behaviour, and SMART says the
 drive is A-OK. There's something flaky about these PCI IDE
 controllers. I think it's new system time.
Any excuse eh? :)


 Reiserfs recovery-wise: any suggestions? A simple fsck doesn't find a
 file system superblock. Is --rebuild-sb the way to go here?
No idea, sorry. I only ever tried Reiser once and it failed. It was very hard to
get recovered so I swapped back to XFS.

Good luck on the fscking

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid5 assemble after dual sata port failure

2007-11-11 Thread David Greaves
Chris Eddington wrote:
 Hi,
 
 Thanks for the pointer on xfs_repair -n , it actually tells me something
 (some listed below) but I'm not sure what it means but there seems to be
 a lot of data loss.  One complication is I see an error message in ata6,
 so I moved the disks around thinking it was a flaky sata port, but I see
 the error again on ata4 so it seems to follow the disk.  But it happens
 exactly at the same time during xfs_repair sequence, so I don't think it
 is a flaky disk.
Does dmesg have any info/sata errors?

xfs_repair will have problems if the disk is bad. You may want to image the disk
(possibly onto the 'spare'?) if it is bad.

  I'll go to the xfs mailing list on this.
Very good idea :)

 Is there a way to be sure the disk order is right? 
The order looks right to me.
xfs_repair wouldn't recognise it as well as it does if the order was wrong.

 not way out of wack since I'm seeing so much from xfs_repair.  Also
 since I've been moving the disks around, I want to be sure I have the
 right order.

Bear in mind that -n stops the repair fixing a problem. Then as the 'repair'
proceeds it becomes very confused by problems that should have been fixed.

This is evident in the superblock issue (which also probably explains the failed
mount).


 
 Is there a way to try restoring using the other disk?
No the event count was very out of date.


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid5 assemble after dual sata port failure

2007-11-11 Thread David Greaves
Chris Eddington wrote:
 Yes, there is some kind of media error message in dmesg, below.  It is
 not random, it happens at exactly the same moments in each xfs_repair -n
 run.
 Nov 11 09:48:25 altair kernel: [37043.300691]  res
 51/40:00:01:00:00/00:00:00:00:00/e1 Emask 0x9 (media error)
 Nov 11 09:48:25 altair kernel: [37043.304326] ata4.00: ata_hpa_resize 1:
 sectors = 976773168, hpa_sectors = 976773168
 Nov 11 09:48:25 altair kernel: [37043.307672] ata4.00: ata_hpa_resize 1:
 sectors = 976773168, hpa_sectors = 976773168

I'm not sure what an ata_hpa_resize error is...

It probably explains the problems you've been having with the raid not 'just
recovering' though.

I saw this:
http://www.linuxquestions.org/questions/linux-kernel-70/sata-issues-568894/


What does smartctl say about your drive?

IMO the spare drive is no longer useful for data recovery - you may want to use
ddrescue to try and copy this drive to the spare drive.

David
PS Don't get the ddrescue parameters the wrong way round if you go that route...
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid5 assemble after dual sata port failure

2007-11-10 Thread David Greaves
Ok - it looks like the raid array is up. There will have been an event count
mismatch which is why you needed --force. This may well have caused some
(hopefully minor) corruption.

FWIW, xfs_check is almost never worth running :) (It runs out of memory easily).
xfs_repair -n is much better.

What does the end of dmesg say after trying to mount the fs?

Also try:
xfs_repair -n -L

I think you then have 2 options:
* xfs_repair -L
This may well lose data that was being written as the drives crashed.
* contact the xfs mailing list

David

Chris Eddington wrote:
 Hi David,
 
 I ran xfs_check and get this:
 ERROR: The filesystem has valuable metadata changes in a log which needs to
 be replayed.  Mount the filesystem to replay the log, and unmount it before
 re-running xfs_check.  If you are unable to mount the filesystem, then use
 the xfs_repair -L option to destroy the log and attempt a repair.
 Note that destroying the log may cause corruption -- please attempt a mount
 of the filesystem before doing this.
 
 After mounting (which fails) and re-running xfs_check it gives the same
 message.
 
 The array info details are below and seems it is running correctly ??  I
 interpret the message above as actually a good sign - seems that
 xfs_check sees the filesystem but the log file and maybe the most
 currently written data is corrupted or will be lost.  But I'd like to
 hear some advice/guidance before doing anything permanent with
 xfs_repair.  I also would like to confirm somehow that the array is in
 the right order, etc.  Appreciate your feedback.
 
 Thks,
 Chris
 
 
 
 
 cat /etc/mdadm/mdadm.conf
 DEVICE /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
 ARRAY /dev/md0 level=raid5 num-devices=4
 UUID=bc74c21c:9655c1c6:ba6cc37a:df870496
 MAILADDR root
 
 cat /proc/mdstat
 Personalities : [raid6] [raid5] [raid4]
 md0 : active raid5 sda1[0] sdd1[2] sdb1[1]
  1465151808 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
 unused devices: none
 
 mdadm -D /dev/md0
 /dev/md0:
Version : 00.90.03
  Creation Time : Sun Nov  5 14:25:01 2006
 Raid Level : raid5
 Array Size : 1465151808 (1397.28 GiB 1500.32 GB)
Device Size : 488383936 (465.76 GiB 500.11 GB)
   Raid Devices : 4
  Total Devices : 3
 Preferred Minor : 0
Persistence : Superblock is persistent
 
Update Time : Fri Nov  9 16:26:31 2007
  State : clean, degraded
 Active Devices : 3
 Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0
 
 Layout : left-symmetric
 Chunk Size : 64K
 
   UUID : bc74c21c:9655c1c6:ba6cc37a:df870496
 Events : 0.4880384
 
Number   Major   Minor   RaidDevice State
   0   810  active sync   /dev/sda1
   1   8   171  active sync   /dev/sdb1
   2   8   492  active sync   /dev/sdd1
   3   003  removed
 
 
 
 Chris Eddington wrote:
 Thanks David.

 I've had cable/port failures in the past and after re-adding the
 drive, the order changed - I'm not sure why, but I noticed it sometime
 ago but don't remember the exact order.

 My initial attempt to assemble, it came up with only two drives in the
 array.  Then I tried assembling with --force and that brought up 3 of
 the drives.  At that point I thought I was good, so I tried mount
 /dev/md0 and it failed.  Would that have written to the disk?  I'm
 using XFS.

 After that, I tried assembling with different drive orders on the
 command line, i.e. mdadm -Av --force /dev/md0 /dev/sda1, ... thinking
 that the order might not be right.

 At the moment I can't access the machine, but I'll try fsck -n and
 send you the other info later this evening.

 Many thanks,
 Chris


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid5 assemble after dual sata port failure

2007-11-08 Thread David Greaves
Chris Eddington wrote:
 
 Hi,
Hi
 
 While on vacation I had one SATA port/cable fail, and then four hours
 later a second one fail.  After fixing/moving the SATA ports, I can
 reboot and all drives seem to be OK now, but when assembled it won't
 recognize the filesystem.

That's unusual - if the array comes back then you should be OK.
In general if two devices fail then there is a real data loss risk.
However if the drives are good and there was just a cable glitch, then unless
you're unlucky it's usually fsck fixable.

I see
mdadm: /dev/md0 has been started with 3 drives (out of 4).

which means it's now up and running.

And:
sda1Events : 0.4880374
sdb1Events : 0.4880374
sdc1Events : 0.4857597
sdd1Events : 0.4880374

so sdc1 is way out of date... we'll add/resync that when everything else is 
working.

but:
  After futzing around with assemble options
 like --force and disk order I couldn't get it to work.

Let me check... what commands did you use? Just 'assemble' - which doesn't care
about disk order - or did you try to re-'create' the array - which does care
about disk order and leads us down a different path...
err, scratch that:
  Creation Time : Sun Nov  5 14:25:01 2006
OK, it was created a year ago... so you did use assemble.


It is slightly odd to see that the drive order is:
/dev/mapper/sda1
/dev/mapper/sdb1
/dev/mapper/sdd1
/dev/mapper/sdc1
Usually people just create them in order.


Have you done any fsck's that involve a write?

What filesystem are you running? What does your 'fsck -n' (readonly) report?

Also, please report the results of:
 cat /proc/mdadm
 mdadm -D /dev/md0
 cat /etc/mdadm.conf


David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel Module - Raid

2007-11-05 Thread David Greaves
Paul VanGundy wrote:
 All,
 
 Hello. I don't know if this is the right place to post this issue but it
 does deal with RAID so I thought I would try.
It deals primarily with linux *software* raid.
But stick with it - you may end up doing that...

What hardware/distro etc are you using?
Is this an expensive (hundreds of £) card? Or an onboard/motherboard chipset?

Once you answer this then it may be worth suggesting using sw-raid (in which
case we can help out) or pointing you elsewhere...

 I successfully built a new kernel and am able to boot from it.
 However, I need to incorporate a specific RAID driver (adpahci.ko) so we
 can use the on-board RAID.
I think this is the adaptec proprietary code - in which case you may need a very
specific kernel to run it. You may find others on here who can help but you'll
probably need an Adaptec forum/list.

 I have the adpahci.ko and am unable to get
 it to compile against any other kernel because I don't have the original
 kernel module (adpahci.c I assume is what I need). Is there any way I
 can view the adpahci.ko and copy the contents to make a adpahci.c?
No

 Is it
 possible to get the kernel object to compile with another kernel only
 using the adpahci.ko?
No

 Am I making sense?  :)
Yes

That's one of the big reasons proprietary drivers suck on linux.

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel Module - Raid

2007-11-05 Thread David Greaves
Paul VanGundy wrote:
 Thanks for the prompt replay David. Below are the answers to your questions:
 
 What hardware/distro etc are you using?
 Is this an expensive (hundreds of £) card? Or an onboard/motherboard chipset?
 The distro is Suse 10.1.
As a bit of trivia, Neil (who wrote and maintains linux RAID) works for Suse.

 It is an onboard chipset.
In which case it's not likely to be hardware RAID.
See: http://linux-ata.org/faq-sata-raid.html

 Once you answer this then it may be worth suggesting using sw-raid (in which
 case we can help out) or pointing you elsewhere...
You should probably configure the BIOS to use


 That's one of the big reasons proprietary drivers suck on linux.
 
 Ok. So this chipset has the ability to use an Intel based RAID. Would
 that be better?
mmm, see the link above...

In almost any case where you are considering 'onboard' raid, linux software raid
(using md and mdadm) is a better choice.

Start here: http://linux-raid.osdl.org/index.php/Main_Page
(feel free to correct it or ask here for clarification)

Also essential reading is the mdadm man page.

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-04 Thread David Greaves
Michael Tokarev wrote:
 Justin Piszcz wrote:
 On Sun, 4 Nov 2007, Michael Tokarev wrote:
 []
 The next time you come across something like that, do a SysRq-T dump and
 post that.  It shows a stack trace of all processes - and in particular,
 where exactly each task is stuck.
 
 Yes I got it before I rebooted, ran that and then dmesg  file.

 Here it is:

 [1172609.665902]  80747dc0 80747dc0 80747dc0 
 80744d80
 [1172609.668768]  80747dc0 81015c3aa918 810091c899b4 
 810091c899a8
 
 That's only partial list.  All the kernel threads - which are most important
 in this context - aren't shown.  You ran out of dmesg buffer, and the most
 interesting entries was at the beginning.  If your /var/log partition is
 working, the stuff should be in /var/log/kern.log or equivalent.  If it's
 not working, there is a way to capture the info still, by stopping syslogd,
 cat'ing /proc/kmsg to some tmpfs file and scp'ing it elsewhere.

or netconsole is actually pretty easy and incredibly useful in this kind of
situation even if there's no disk at all :)

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Implementing low level timeouts within MD

2007-11-02 Thread David Greaves
Alberto Alonso wrote:
 On Thu, 2007-11-01 at 15:16 -0400, Doug Ledford wrote:
 Not in the older kernel versions you were running, no.
 
 These old versions (specially the RHEL) are supposed to be
 the official versions supported by Redhat and the hardware 
 vendors, as they were very specific as to what versions of 
 Linux were supported. Of all people, I would think you would
 appreciate that. Sorry if I sound frustrated and upset, but 
 it is clearly a result of what supported and tested really 
 means in this case. I don't want to go into a discussion of
 commercial distros, which are supported as this is nor the
 time nor the place but I don't want to open the door to the
 excuse of its an old kernel, it wasn't when it got installed.

It may be worth noting that the context of this email is the upstream linux-raid
 list. In my time watching the list it is mainly focused on 'current' code and
development (but hugely supportive of older environments).
In general discussions in this context will have a certain mindset - and it's
not going to be the same as that which you'd find in an enterprise product
support list.

 Outside of the rejected suggestion, I just want to figure out 
 when software raid works and when it doesn't. With SATA, my 
 experience is that it doesn't.

SATA, or more precisely, error handling in SATA has recently been significantly
overhauled by Tejun Heo (IIRC). We're talking post 2.6.18 though (again IIRC) -
so as far as SATA EH goes, older kernels bear no relation to the new ones.

And the initial SATA EH code was, of course, beta :)

David
PS I can't really contribute to your list - I'm only using cheap desktop 
hardware.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-25 Thread David Greaves
Jeff Garzik wrote:
 Neil Brown wrote:
 As for where the metadata should be placed, it is interesting to
 observe that the SNIA's DDFv1.2 puts it at the end of the device.
 And as DDF is an industry standard sponsored by multiple companies it
 must be ..
 Sorry.  I had intended to say correct, but when it came to it, my
 fingers refused to type that word in that context.

 For the record, I have no intention of deprecating any of the metadata
 formats, not even 0.90.
 
 strongly agreed

I didn't get a reply to my suggestion of separating the data and location...

ie not talking about superblock versions 0.9, 1.0, 1.1, 1.2 etc but a data
format (0.9 vs 1.0) and a location (end,start,offset4k)?

This would certainly make things a lot clearer to new (and old!) users:

mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k
or
mdadm --create /dev/md0 --metadata 1.0 --meta-location start
or
mdadm --create /dev/md0 --metadata 1.0 --meta-location end

resulting in:
mdadm --detail /dev/md0

/dev/md0:
Version : 01.0
  Metadata-locn : End-of-device
  Creation Time : Fri Aug  4 23:05:02 2006
 Raid Level : raid0

You provide rational defaults for mortals and this approach allows people like
Doug to do wacky HA things explicitly.

I'm not sure you need any changes to the kernel code - probably just the docs
and mdadm.

 It is conceivable that I could change the default, though that would
 require a decision as to what the new default would be.  I think it
 would have to be 1.0 or it would cause too much confusion.
 
 A newer default would be nice.

I also suspect that a *lot* of people will assume that the highest superblock
version is the best and should be used for new installs etc.

So if you make 1.0 the default then how many users will try 'the bleeding edge'
and use 1.2? So then you have 1.3 which is the same as 1.0? H? So to quote
from an old Soap: Confused, you  will be...

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: deleting mdadm array?

2007-10-25 Thread David Greaves
Janek Kozicki wrote:
 Hello,
 
 I just created a new array /dev/md1 like this:
 
 mdadm --create --verbose /dev/md1 --chunk=64 --level=raid5 \
--metadata=1.1  --bitmap=internal \
--raid-devices=3 /dev/hdc2 /dev/sda2 missing
 
 
 But later I changed my mind, and I wanted to use chunk 128. Do I need
 to delete this array somehow first, or can I just create an array
 again (overwriting the current one)?

How much later? This will, of course, destroy any data on the array (!) and
you'll need to mkfs again...


To answer the question though: just run mdadm again to create a new array with
new parameters.


I think the only time you need to 'delete' an array before creating a new one is
if you change the superblock version since it quietly writes different
superblocks to different disk locations you may end up with 2 superblocks on the
disk and then you get confusion :)
(I'm not sure if mdadm is clever about this though...)

Also, if you don't mind me asking: why did you choose version 1.1 for the
metadata/superblock version?

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-25 Thread David Greaves
Bill Davidsen wrote:
 Neil Brown wrote:
 I certainly accept that the documentation is probably less that
 perfect (by a large margin).  I am more than happy to accept patches
 or concrete suggestions on how to improve that.  I always think it is
 best if a non-developer writes documentation (and a developer reviews
 it) as then it is more likely to address the issues that a
 non-developer will want to read about, and in a way that will make
 sense to a non-developer. (i.e. I'm to close to the subject to write
 good doco).
 
 Patches against what's in 2.6.4 I assume? I can't promise to write
 anything which pleases even me, but I will take a look at it.
 

The man page is a great place for describing, eg, the superblock location; but
don't forget we have
  http://linux-raid.osdl.org/index.php/Main_Page
which is probably a better place for *discussions* (or essays) about the
superblock location (eg the LVM / v1.1 comment Janek picked up on)

In fact I was going to take some of the writings from this thread and put them
up there.

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-24 Thread David Greaves
Doug Ledford wrote:
 On Mon, 2007-10-22 at 16:39 -0400, John Stoffel wrote:
 
 I don't agree completely.  I think the superblock location is a key
 issue, because if you have a superblock location which moves depending
 the filesystem or LVM you use to look at the partition (or full disk)
 then you need to be even more careful about how to poke at things.
 
 This is the heart of the matter.  When you consider that each file
 system and each volume management stack has a superblock, and they some
 store their superblocks at the end of devices and some at the beginning,
 and they can be stacked, then it becomes next to impossible to make sure
 a stacked setup is never recognized incorrectly under any circumstance.

I wonder if we should not really be talking about superblock versions 1.0, 1.1,
1.2 etc but a data format (0.9 vs 1.0) and a location (end,start,offset4k)?

This would certainly make things a lot clearer to new users:

mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k


mdadm --detail /dev/md0

/dev/md0:
Version : 01.0
  Metadata-locn : End-of-device
  Creation Time : Fri Aug  4 23:05:02 2006
 Raid Level : raid0


And there you have the deprecation... only two superblock versions and no real
changes to code etc

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Raid1/5 over iSCSI trouble

2007-10-24 Thread David Miller
From: Dan Williams [EMAIL PROTECTED]
Date: Wed, 24 Oct 2007 16:49:28 -0700

 Hopefully it is as painless to run on sparc as it is on IA:
 
 opcontrol --start --vmlinux=/path/to/vmlinux
 wait
 opcontrol --stop
 opreport --image-path=/lib/modules/`uname -r` -l

It is painless, I use it all the time.

The only caveat is to make sure the /path/to/vmlinux is
the pre-stripped kernel image.  The images installed
under /boot/ are usually stripped and thus not suitable
for profiling.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very degraded RAID5, or increasing capacity by adding discs

2007-10-22 Thread Louis-David Mitterrand
On Tue, Oct 09, 2007 at 01:48:50PM +0400, Michael Tokarev wrote:
 
 There still is - at least for ext[23].  Even offline resizers
 can't do resizes from any to any size, extfs developers recommend
 to recreate filesystem anyway if size changes significantly.
 I'm too lazy to find a reference now, it has been mentioned here
 on linux-raid at least this year.  It's sorta like fat (yea, that
 ms-dog filesystem) - when you resize it from, say, 501Mb to 999Mb,
 everything is ok, but if you want to go from 501Mb to 1Gb+1, you
 have to recreate almost all data structures because sizes of
 all internal fields changes - and here it's much safer to just
 re-create it from scratch than trying to modify it in place.
 Sure it's much better for extfs, but the point is still the same.

I'll just mention that I once resized a multi-Tera ext3 filesystem and 
it took 8hours +, a comparable XFS online resize lasted all of 10 
seconds! 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


flaky controller or disk error?

2007-10-22 Thread Louis-David Mitterrand
Hi,

[using kernel 2.6.23 and mdadm 2.6.3+20070929]

I have a rather flaky sata controller with which I am trying to resync a raid5
array. It usually starts failing after 40% of the resync is done. Short of
changing the controller (which I will do later this week), is there a way to
have mdmadm resume the resync where it left at reboot time?

Here is the error I am seeing in the syslog. Can this actually be a disk 
error?

Oct 18 11:54:34 sylla kernel: ata1.00: exception Emask 0x10 SAct 0x0 
SErr 0x1 action 0x2 frozen
Oct 18 11:54:34 sylla kernel: ata1.00: irq_stat 0x0040, PHY RDY 
changed
Oct 18 11:54:34 sylla kernel: ata1.00: cmd 
ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 
Oct 18 11:54:34 sylla kernel: res 40/00:00:19:26:33/00:00:3a:00:00/40 
Emask 0x10 (ATA bus error)
Oct 18 11:54:35 sylla kernel: ata1: soft resetting port
Oct 18 11:54:40 sylla kernel: ata1: failed to reset engine 
(errno=-95)4ata1: port is slow to respond, please be patient (Status 0xd0)
Oct 18 11:54:45 sylla kernel: ata1: softreset failed (device not ready)
Oct 18 11:54:45 sylla kernel: ata1: hard resetting port
Oct 18 11:54:46 sylla kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 
SControl 300)
Oct 18 11:54:46 sylla kernel: ata1.00: configured for UDMA/133
Oct 18 11:54:46 sylla kernel: ata1: EH complete
Oct 18 11:54:46 sylla kernel: sd 0:0:0:0: [sda] 976773168 512-byte 
hardware sectors (500108 MB)
Oct 18 11:54:46 sylla kernel: sd 0:0:0:0: [sda] Write Protect is off
Oct 18 11:54:46 sylla kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Oct 18 11:54:46 sylla kernel: sd 0:0:0:0: [sda] Write cache: enabled, 
read cache: enabled, doesn't support DPO or FUA


Thanks,
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID 5 performance issue.

2007-10-03 Thread David Rees
On 10/3/07, Andrew Clayton [EMAIL PROTECTED] wrote:
 On Wed, 3 Oct 2007 12:43:24 -0400 (EDT), Justin Piszcz wrote:
  Have you checked fragmentation?

 You know, that never even occurred to me. I've gotten into the mind set
 that it's generally not a problem under Linux.

It's probably not the root cause, but certainly doesn't help things.
At least with XFS you have an easy way to defrag the filesystem
without even taking it offline.

 # xfs_db -c frag -f /dev/md0
 actual 1828276, ideal 1708782, fragmentation factor 6.54%

 Good or bad?

Not bad, but not that good, either. Try running xfs_fsr into a nightly
cronjob. By default, it will defrag mounted xfs filesystems for up to
2 hours. Typically this is enough to keep fragmentation well below 1%.

-Dave
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reducing the number of disks a RAID1 expects

2007-09-15 Thread J. David Beutel

Neil Brown wrote:

2.6.12 does support reducing the number of drives in a raid1, but it
will only remove drives from the end of the list. e.g. if the
state was

  58604992 blocks [3/2] [UU_]

then it would work.  But as it is

  58604992 blocks [3/2] [_UU]

it won't.  You could fail the last drive (hdc8) and then add it back
in again.  This would move it to the first slot, but it would cause a
full resync which is a bit of a waste.
  


Thanks for your help!  That's the route I took.  It worked ([2/2] 
[UU]).  The only hiccup was that when I rebooted, hdd2 was back in the 
first slot by itself ([3/1] [U__]).  I guess there was some contention 
in discovery.  But all I had to do was physically remove hdd and the 
remaining two were back to [2/2] [UU].



Since commit 6ea9c07c6c6d1c14d9757dd8470dc4c85bbe9f28 (about
2.6.13-rc4) raid1 will repack the devices to the start of the
list when trying to change the number of devices.
  


I couldn't find a newer kernel RPM for FC3, and I was nervous about 
building a new kernel myself and screwing up my system, so I went the 
slot rotate route instead.  It only took about 20 minutes to resync (a 
lot faster than trying to build a new kernel).


My main concern was that it would discover an unreadable sector while 
resyncing from the last remaining drive and I would lose the whole 
array.  (That didn't happen, though.)  I looked for some mdadm command 
to check the remaining drive before I failed the last one, to help avoid 
that worst case scenario, but couldn't find any.  Is there some way to 
do that, for future reference?


Cheers,
11011011
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID6 mdadm --grow bug?

2007-09-13 Thread David Miller

Neil,

On RHEL5 the kernel is 2.6.18-8.1.8. On Ubuntu 7.04 the kernel is  
2.6.20-16. Someone on the Arstechnica forums wrote they see the same  
thing in Debian etch running kernel 2.6.18. Below is a messages log  
from the RHEL5 system. I have only included the section for creating  
the RAID6, adding a spare and trying to grow it. There is a one line  
error when I do the mdadm --grow command. It is md: couldn't  
update array info. -22.


md: bindloop1
md: bindloop2
md: bindloop3
md: bindloop4
md: md0: raid array is not clean -- starting background reconstruction
raid5: device loop4 operational as raid disk 3
raid5: device loop3 operational as raid disk 2
raid5: device loop2 operational as raid disk 1
raid5: device loop1 operational as raid disk 0
raid5: allocated 4204kB for md0
raid5: raid level 6 set md0 active with 4 out of 4 devices, algorithm 2
RAID5 conf printout:
 --- rd:4 wd:4 fd:0
 disk 0, o:1, dev:loop1
 disk 1, o:1, dev:loop2
 disk 2, o:1, dev:loop3
 disk 3, o:1, dev:loop4
md: syncing RAID array md0
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwidth (but not more than  
20 KB/sec) for reconstruction.

md: using 128k window, over a total of 102336 blocks.
md: md0: sync done.
RAID5 conf printout:
 --- rd:4 wd:4 fd:0
 disk 0, o:1, dev:loop1
 disk 1, o:1, dev:loop2
 disk 2, o:1, dev:loop3
 disk 3, o:1, dev:loop4
md: bindloop5
md: couldn't update array info. -22

David.



On Sep 13, 2007, at 3:52 AM, Neil Brown wrote:


On Wednesday September 12, [EMAIL PROTECTED] wrote:



Problem:

The mdadm --grow command fails when trying to add disk to a RAID6.


..


So far I have replicated this problem on RHEL5 and Ubuntu 7.04
running the latest official updates and patches. I have even tried it
with the most latest version of mdadm 2.6.3 under RHEL5. RHEL5 uses
version 2.5.4.


You don't say what kernel version you are using (as I don't use RHEL5
or Ubunutu, I don't know what 'latest' means).

If it is 2.6.23-rcX, then it is a known problem that should be fixed
in the next -rc.  If it is something else... I need details.

Also, any kernel message (run 'dmesg') might be helpful.

NeilBrown


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RAID6 mdadm --grow bug?

2007-09-12 Thread David Miller



Problem:

The mdadm --grow command fails when trying to add disk to a RAID6.

The man page says it can do this.

GROW MODE
   The GROW mode is used for changing the size or shape of an  
active array.  For this to  work,
   the  kernel must support the necessary change.  Various types  
of growth are being added dur-
   ing 2.6 development, including restructuring a raid5 array to  
have more active devices.


   Currently the only support available is to

   ·   change the size attribute for RAID1, RAID5 and RAID6.

   ·   increase the raid-disks attribute of RAID1, RAID5, and  
RAID6.


   ·   add a write-intent bitmap to any array which supports  
these bitmaps, or remove a  write-

   intent bitmap from such an array.


So far I have replicated this problem on RHEL5 and Ubuntu 7.04  
running the latest official updates and patches. I have even tried it  
with the most latest version of mdadm 2.6.3 under RHEL5. RHEL5 uses  
version 2.5.4.


How to replicate the problem:

You can either use real physical disks or use the loopback device to  
create fake disks.


Here are the steps using the loopback method as root.

cd /tmp
dd if=/dev/zero of=rd1 bs=10240 count=10240
cp rd1 rd2;cp rd1 rd3;cp rd1 rd4;cp rd1 rd5
losetup /dev/loop1 rd1;losetup /dev/loop2 rd2;losetup /dev/loop3  
rd3;losetup /dev/loop4 rd4;losetup /dev/loop5 rd5
mdadm --create --verbose /dev/md0 --level=6 --raid-devices=4 /dev/ 
loop1 /dev/loop2 /dev/loop3 /dev/loop4


At this point wait a minute while the raid is being built.

mdadm --add /dev/md0 /dev/loop5
mdadm --grow /dev/md0 --raid-devices=5

You should get the following error

mdadm: Need to backup 384K of critical section..
mdadm: Cannot set device size/shape for /dev/md0: Invalid argument

How to clean up

mdadm --stop /dev/md0
mdadm --remove /dev/md0
losetup -d /dev/loop1;losetup -d /dev/loop2;losetup -d /dev/ 
loop3;losetup -d /dev/loop4;losetup -d /dev/loop5

rm rd1 rd2 rd3 rd4 rd5

David.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reducing the number of disks a RAID1 expects

2007-09-10 Thread J. David Beutel

Richard Scobie wrote:

Have a look at the Grow Mode section of the mdadm man page.


Thanks!  I overlooked that, although I did look at the man page before 
posting.


It looks as though you should just need to use the same command you 
used to grow it to 3 drives, except specify only 2 this time.


I think I hot-added it.  Anyway, --grow looks like what I need, but I'm 
having some difficulty with it.  The man page says, Change the size or 
shape of an active array.  But I got:


[EMAIL PROTECTED] ~]# mdadm --grow /dev/md5 -n2
mdadm: Cannot set device size/shape for /dev/md5: Device or resource busy
[EMAIL PROTECTED] ~]# umount /dev/md5
[EMAIL PROTECTED] ~]# mdadm --grow /dev/md5 -n2
mdadm: Cannot set device size/shape for /dev/md5: Device or resource busy

So I tried stopping it, but got:

[EMAIL PROTECTED] ~]# mdadm --stop /dev/md5
[EMAIL PROTECTED] ~]# mdadm --grow /dev/md5 -n2
mdadm: Cannot get array information for /dev/md5: No such device
[EMAIL PROTECTED] ~]# mdadm --query /dev/md5 --scan
/dev/md5: is an md device which is not active
/dev/md5: is too small to be an md component.
[EMAIL PROTECTED] ~]# mdadm --grow /dev/md5 --scan -n2
mdadm: option s not valid in grow mode

Am I trying the right thing, but running into some limitation of my 
version of mdadm or the kernel?  Or am I overlooking something 
fundamental yet again?  md5 looked like this in /proc/mdstat before I 
stopped it:


md5 : active raid1 hdc8[2] hdg8[1]
 58604992 blocks [3/2] [_UU]

For -n the man page says, This  number can only be changed using --grow 
for RAID1 arrays, and only on kernels which provide necessary support.


Grow mode says, Various types of growth may be added during 2.6  
development, possibly  including  restructuring  a  raid5 array to have 
more active devices. Currently the only support available is to change 
the size attribute for  arrays  with  redundancy,  and  the raid-disks 
attribute of RAID1 arrays.  ...  When  reducing the number of devices in 
a RAID1 array, the slots which are to be removed from the array must 
already be vacant.  That is, the devices that which were in those slots 
must be failed and removed.


I don't know how I overlooked all that the first time, but I can't see 
what I'm overlooking now.


mdadm - v1.6.0 - 4 June 2004
Linux 2.6.12-1.1381_FC3 #1 Fri Oct 21 03:46:55 EDT 2005 i686 athlon i386 
GNU/Linux


Cheers,
11011011
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


reducing the number of disks a RAID1 expects

2007-09-09 Thread J. David Beutel

My /dev/hdd started failing its SMART check, so I removed it from a RAID1:

# mdadm /dev/md5 -f /dev/hdd2 -r /dev/hdd2

Now when I boot it looks like this in /proc/mdstat:

md5 : active raid1 hdc8[2] hdg8[1]
 58604992 blocks [3/2] [_UU]

and I get a DegradedArray event on /dev/md5 email on every boot from 
mdadm monitoring.  I only need 2 disks in md5 now.  How can I stop it 
from being considered degraded?  I added a 3rd disk a while ago just 
because I got a new disk with plenty of space, and little /dev/hdd was 
getting old.


mdadm - v1.6.0 - 4 June 2004
Linux 2.6.12-1.1381_FC3 #1 Fri Oct 21 03:46:55 EDT 2005 i686 athlon i386 
GNU/Linux


Cheers,
11011011

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Patch for boot-time assembly of v1.x-metadata-based soft (MD) arrays: reasoning and future plans

2007-08-27 Thread David Greaves

Dan Williams wrote:

On 8/26/07, Abe Skolnik [EMAIL PROTECTED] wrote:

Because you can rely on the configuration file to be certain about
which disks to pull in and which to ignore.  Without the config file
the auto-detect routine may not always do the right thing because it
will need to make assumptions.

But kernel parameters can provide the same data, no?  After all, it is



Yes, you can get a similar effect of the config file by adding
parameters to the kernel command line.  My only point is that if the
initramfs update tools were as simple as:
mkinitrd root=/dev/md0 md=0,v1,/dev/sda1,/dev/sdb1,/dev/sdc1
...then using an initramfs becomes the same amount of work as editing
/etc/grub.conf.




I'm not sure if you're aware of this so in the spirit of being helpful I thought 
I'd point out that this has been discussed quite extensively in the past. You 
may want to take a look at the list archives.


It's quite complex and whilst I too would like autodetect to continue to work 
(I've been 'bitten' by the new superblocks not autodetecting) I accept the 
arguments of those wiser than me.


I agree that the problem is one of old dogs and the new (!) initram trick.
Look at it as an opportunity to learn more about a cool capability.

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4 Port eSATA RAID5/JBOD PCI-E 8x Controller

2007-08-21 Thread David Greaves

Richard Scobie wrote:

This looks like a potentially good, cheap candidate for md use.

Although Linux support is not explicitly mentioned, SiI 3124 is used.

http://www.addonics.com/products/host_controller/ADSA3GPX8-4e.asp


Thanks Richard. FWIW I find this kind of info useful.

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SWAP file on a RAID-10 array possible?

2007-08-15 Thread David Greaves

Tomas France wrote:

Hi everyone,

I apologize for asking such a fundamental question on the Linux-RAID 
list but the answers I found elsewhere have been contradicting one another.


So, is it possible to have a swap file on a RAID-10 array?

yes.

mkswap /dev/mdX
swapon /dev/mdX

Should you use RAID-10 for swap? That's philosophy :)

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SWAP file on a RAID-10 array possible?

2007-08-15 Thread David Greaves

Tomas France wrote:

Thanks for the answer, David!

you're welome

By the way, does anyone know if there is a comprehensive how-to on 
software RAID with mdadm available somewhere? I mean a website where I 
could get answers to questions like How to convert your system from no 
RAID to RAID-1, from RAID-1 to RAID-5/10, how to setup LILO/GRUB to boot 
from a RAID-1 array etc. Don't take me wrong, I have done my homework 
and found a lot of info on the topic but a lot of it is several years 
old and many things have changed since then. And it's quite scattered too..


Yes, I got to thinking that. About a year ago I got a copy of the old raid FAQ 
from the authors and have started to modify it when I have time and it bubbles 
up my list.


Try
http://linux-raid.osdl.org/

It's a community wiki - welcome to the community :) Please feel free to edit,,,
Also feel free to post questions/suggestions here if you're not sure.

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Moving RAID distro

2007-08-15 Thread David Greaves

Richard Grundy wrote:

Hello,

I was just wonder if it's possible to move my RAID5 array to another
distro, same machine just a different flavor of Linux.

Yes.

The only problem will be if it is the root filesystem (unlikely).



Would it just be a case of running:

sudo mdadm --create --verbose /dev/md0 --level=5 –raid-devices=5
/dev/sdb /dev/sdc etc


No
That's kinda dangerous since it overwrites the superblocks (it prompts first). 
If you make the slightest error in, eg, the order of the devices (which may be 
different under a new kernel since modules may load in a different order etc) 
then you'll end up back here asking about recovering from a broken array (which 
you can do but you don't want to.)




Or do I need to run some other command.


yes
You want to use mdadm --assemble

It's quite possible that your distro will include mdadm support scripts that 
will find and assemble the array automatically.


David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-13 Thread David Greaves

Paul Clements wrote:
Well, if people would like to see a timeout option, I actually coded up 
a patch a couple of years ago to do just that, but I never got it into 
mainline because you can do almost as well by doing a check at 
user-level (I basically ping the nbd connection periodically and if it 
fails, I kill -9 the nbd-client).



Yes please.

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-13 Thread david

On Mon, 13 Aug 2007, David Greaves wrote:


[EMAIL PROTECTED] wrote:

 per the message below MD (or DM) would need to be modified to work
 reasonably well with one of the disk components being over an unreliable
 link (like a network link)

 are the MD/DM maintainers interested in extending their code in this
 direction? or would they prefer to keep it simpler by being able to
 continue to assume that the raid components are connected over a highly
 reliable connection?

 if they are interested in adding (and maintaining) this functionality then
 there is a real possibility that NBD+MD/DM could eliminate the need for
 DRDB. however if they are not interested in adding all the code to deal
 with the network type issues, then the argument that DRDB should not be
 merged becouse you can do the same thing with MD/DM + NBD is invalid and
 can be dropped/ignored

 David Lang


As a user I'd like to see md/nbd be extended to cope with unreliable links.
I think md could be better in handling link exceptions. My unreliable memory 
recalls sporadic issues with hot-plug leaving md hanging and certain lower 
level errors (or even very high latency) causing unsatisfactory behaviour in 
what is supposed to be a fault 'tolerant' subsystem.



Would this just be relevant to network devices or would it improve support 
for jostled usb and sata hot-plugging I wonder?


good question, I suspect that some of the error handling would be similar 
(for devices that are unreachable not haning the system for example), but 
a lot of the rest would be different (do you really want to try to 
auto-resync to a drive that you _think_ just reappeared, what if it's a 
different drive? how can you be sure?) the error rate of a network is gong 
to be significantly higher then for USB or SATA drives (although I suppose 
iscsi would be limilar)


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-13 Thread David Greaves

[EMAIL PROTECTED] wrote:
Would this just be relevant to network devices or would it improve 
support for jostled usb and sata hot-plugging I wonder?


good question, I suspect that some of the error handling would be 
similar (for devices that are unreachable not haning the system for 
example), but a lot of the rest would be different (do you really want 
to try to auto-resync to a drive that you _think_ just reappeared,
Well, omit 'think' and the answer may be yes. A lot of systems are quite 
simple and RAID is common on the desktop now. If jostled USB fits into this 
category - then yes.


what 
if it's a different drive? how can you be sure?
And that's the key isn't it. We have the RAID device UUID and the superblock 
info. Isn't that enough? If not then given the work involved an extended 
superblock wouldn't be unreasonable.
And I suspect the capability of devices would need recording in the superblock 
too? eg 'retry-on-fail'
I can see how md would fail a device but may now periodically retry it. If a 
retry shows that it's back then it would validate it (UUID) and then resync it.


) the error rate of a 
network is gong to be significantly higher then for USB or SATA drives 
(although I suppose iscsi would be limilar)


I do agree - I was looking for value-add for the existing subsystem. If this 
benefits existing RAID users then it's more likely to be attractive.


David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread david

On Sun, 12 Aug 2007, Jan Engelhardt wrote:


On Aug 12 2007 13:35, Al Boldi wrote:

Lars Ellenberg wrote:

meanwhile, please, anyone interessted,
the drbd paper for LinuxConf Eu 2007 is finalized.
http://www.drbd.org/fileadmin/drbd/publications/
drbd8.linux-conf.eu.2007.pdf

but it does give a good overview about what DRBD actually is,
what exact problems it tries to solve,
and what developments to expect in the near future.

so you can make up your mind about
 Do we need it?, and
 Why DRBD? Why not NBD + MD-RAID?


I may have made a mistake when asking for how it compares to NBD+MD.
Let me retry: what's the functional difference between
GFS2 on a DRBD .vs. GFS2 on a DAS SAN?


GFS is a distributed filesystem, DRDB is a replicated block device. you 
wouldn't do GFS on top of DRDB, you would do ext2/3, XFS, etc


DRDB is much closer to the NBD+MD option.

now, I am not an expert on either option, but three are a couple things 
that I would question about the DRDB+MD option


1. when the remote machine is down, how does MD deal with it for reads and 
writes?


2. MD over local drive will alternate reads between mirrors (or so I've 
been told), doing so over the network is wrong.


3. when writing, will MD wait for the network I/O to get the data saved on 
the backup before returning from the syscall? or can it sync the data out 
lazily



Now, shared remote block access should theoretically be handled, as does
DRBD, by a block layer driver, but realistically it may be more appropriate
to let it be handled by the combining end user, like OCFS or GFS.


there are times when you want to replicate at the block layer, and there 
are times when you want to have a filesystem do the work. don't force a 
filesystem on use-cases where a block device is the right answer.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread david
per the message below MD (or DM) would need to be modified to work 
reasonably well with one of the disk components being over an unreliable 
link (like a network link)


are the MD/DM maintainers interested in extending their code in this 
direction? or would they prefer to keep it simpler by being able to 
continue to assume that the raid components are connected over a highly 
reliable connection?


if they are interested in adding (and maintaining) this functionality then 
there is a real possibility that NBD+MD/DM could eliminate the need for 
DRDB. however if they are not interested in adding all the code to deal 
with the network type issues, then the argument that DRDB should not be 
merged becouse you can do the same thing with MD/DM + NBD is invalid and 
can be dropped/ignored


David Lang

On Sun, 12 Aug 2007, Paul Clements wrote:


Iustin Pop wrote:

 On Sun, Aug 12, 2007 at 07:03:44PM +0200, Jan Engelhardt wrote:
  On Aug 12 2007 09:39, [EMAIL PROTECTED] wrote:
   now, I am not an expert on either option, but three are a couple 
   things that I

   would question about the DRDB+MD option
  
   1. when the remote machine is down, how does MD deal with it for reads 
   and

   writes?
  I suppose it kicks the drive and you'd have to re-add it by hand unless 
  done by

  a cronjob.


Yes, and with a bitmap configured on the raid1, you just resync the blocks 
that have been written while the connection was down.




From my tests, since NBD doesn't have a timeout option, MD hangs in the
 write to that mirror indefinitely, somewhat like when dealing with a
 broken IDE driver/chipset/disk.


Well, if people would like to see a timeout option, I actually coded up a 
patch a couple of years ago to do just that, but I never got it into mainline 
because you can do almost as well by doing a check at user-level (I basically 
ping the nbd connection periodically and if it fails, I kill -9 the 
nbd-client).



   2. MD over local drive will alternate reads between mirrors (or so 
   I've been

   told), doing so over the network is wrong.
  Certainly. In which case you set write_mostly (or even write_only, not 
  sure

  of its name) on the raid component that is nbd.
 
   3. when writing, will MD wait for the network I/O to get the data 
   saved on the
   backup before returning from the syscall? or can it sync the data out 
   lazily

  Can't answer this one - ask Neil :)

 MD has the write-mostly/write-behind options - which help in this case
 but only up to a certain amount.


You can configure write_behind (aka, asynchronous writes) to buffer as much 
data as you have RAM to hold. At a certain point, presumably, you'd want to 
just break the mirror and take the hit of doing a resync once your network 
leg falls too far behind.


--
Paul


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid array is not automatically detected.

2007-07-18 Thread David Greaves

dean gaudet wrote:


On Mon, 16 Jul 2007, David Greaves wrote:


Bryan Christ wrote:

I do have the type set to 0xfd.  Others have said that auto-assemble only
works on RAID 0 and 1, but just as Justin mentioned, I too have another box
with RAID5 that gets auto assembled by the kernel (also no initrd).  I
expected the same behavior when I built this array--again using mdadm
instead of raidtools.

Any md arrays with partition type 0xfd using a 0.9 superblock should be
auto-assembled by a standard kernel.


no... debian (and probably ubuntu) do not build md into the kernel, they 
build it as a module, and the module does not auto-detect 0xfd.  i don't 
know anything about slackware, but i just felt it worth commenting that a 
standard kernel is not really descriptive enough.


Good point - I should have mentioned the non-module bit!

http://linux-raid.osdl.org/index.php/Autodetect

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible data corruption sata_sil24?

2007-07-18 Thread David Shaw
On Wed, Jul 18, 2007 at 05:53:39PM +0900, Tejun Heo wrote:
 David Shaw wrote:
  It fails whether I use a raw /dev/sdd or partition it into one large
  /dev/sdd1, or partition into multiple partitions.  sata_sil24 seems to
  work by itself, as does dm, but as soon as I mix sata_sil24+dm, I get
  corruption.
  H Can you reproduce the corruption by accessing both devices
  simultaneously without using dm?  Considering ich5 does fine, it looks
  like hardware and/or driver problem and I really wanna rule out dm.
  
  I think I wasn't clear enough before.  The corruption happens when I
  use dm to create two dm mappings that both reside on the same real
  device.  Using two different devices, or two different partitions on
  the same physical device works properly.  ich5 does fine with these 3
  tests, but sata_sil24 fails:
  
   * /dev/sdd, create 2 dm linear mappings on it, mke2fs and use those
 dm devices == corruption
  
   * Partition /dev/sdd into /dev/sdd1 and /dev/sdd2, mke2fs and use
 those partitions == no corruption
  
   * Partition /dev/sdd into /dev/sdd1 and /dev/sdd2, create 2 dm linear
 mappings on /dev/sdd1, mke2fs and use those dm devices ==
 corruption
 
 I'm not sure whether this is problem of sata_sil24 or dm layer.  Cc'ing
 linux-raid for help.  How much memory do you have?  One big difference
 between ata_piix and sata_sil24 is that sil24 can handle 64bit DMA.
 Maybe dma mapping or something interacts weirdly with dm there?

The machine has 640 megs of RAM.  FWIW, I tried this with 512 megs of
RAM with the same results.  Running Memtest86+ shows the memory is
good.

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid array is not automatically detected.

2007-07-18 Thread David Greaves

Bryan Christ wrote:

I'm now very confused...

It's all that top-posting...


When I run mdadm --examine /dev/md0 I get the error message:  No 
superblock detected on /dev/md0


However, when I run mdadm -D /dev/md0 the report clearly states 
Superblock is persistent



David Greaves wrote:

* are the superblocks version 0.9? (mdadm --examine /dev/component)


See where it says 'component' ? :)

I wish mdadm --detail and --examine were just aliases and the output varied 
according to whether you looked at a component (eg /dev/sda1) or an md device 
(/dev/md0)


I get that wrong *all* the time...

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3ware 9650 tips

2007-07-16 Thread David Chinner
On Mon, Jul 16, 2007 at 12:41:15PM +1000, David Chinner wrote:
 On Fri, Jul 13, 2007 at 03:36:46PM -0400, Justin Piszcz wrote:
  On Fri, 13 Jul 2007, Jon Collette wrote:
  
  Wouldn't Raid 6 be slower than Raid 5 because of the extra fault tolerance?
http://www.enterprisenetworksandservers.com/monthly/art.php?1754 - 20% 
  drop according to this article
  
  His 500GB WD drives are 7200RPM compared to the Raptors 10K.  So his 
  numbers will be slower. 
  Justin what file system do you have running on the Raptors?  I think thats 
  an interesting point made by Joshua.
  
  I use XFS:
 
 When it comes to bandwidth, there is good reason for that.
 
  Trying to stick with a supported config as much as possible, I need to 
  run ext3.  As per usual, though, initial ext3 numbers are less than 
  impressive. Using bonnie++ to get a baseline, I get (after doing 
  'blockdev --setra 65536' on the device):
  Write: 136MB/s
  Read:  384MB/s
  
  Proving it's not the hardware, with XFS the numbers look like:
  Write: 333MB/s
  Read:  465MB/s
  
 
 Those are pretty typical numbers. In my experience, ext3 is limited to about
 250MB/s buffered write speed. It's not disk limited, it's design limited. e.g.
 on a disk subsystem where XFS was getting 4-5GB/s buffered write, ext3 was 
 doing
 250MB/s.
 
 http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf
 
 If you've got any sort of serious disk array, ext3 is not the filesystem
 to use

To show what the difference is, I used blktrace and Chris Mason's
seekwatcher script on a simple, single threaded dd command on
a 12 disk dm RAID0 stripe:

# dd if=/dev/zero of=/mnt/scratch/fred bs=1024k count=10k; sync

http://oss.sgi.com/~dgc/writes/ext3_write.png
http://oss.sgi.com/~dgc/writes/xfs_write.png

You can see from the ext3 graph that it comes to a screeching halt
every 5s (probably when pdflush runs) and at all other times the
seek rate is 10,000 seeks/s. That's pretty bad for a brand new,
empty filesystem and the only way it is sustained is the fact that
the disks have their write caches turned on. ext4 will probably show
better results, but I haven't got any of the tools installed to be
able to test it

The XFS pattern shows consistently an order of magnitude less seeks
and consistent throughput above 600MB/s. To put the number of seeks
in context, XFS is doing 512k I/Os at about 1200-1300 per second. The
number of seeks? A bit above 10^3 per second or roughly 1 seek per
I/O which is pretty much optimal.

Cheers,

Dave.

-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid array is not automatically detected.

2007-07-16 Thread David Greaves

Bryan Christ wrote:
I do have the type set to 0xfd.  Others have said that auto-assemble 
only works on RAID 0 and 1, but just as Justin mentioned, I too have 
another box with RAID5 that gets auto assembled by the kernel (also no 
initrd).  I expected the same behavior when I built this array--again 
using mdadm instead of raidtools.


Any md arrays with partition type 0xfd using a 0.9 superblock should be 
auto-assembled by a standard kernel.


If you want to boot from them you must ensure the kernel image is on a partition 
that the bootloader can read - ie RAID 0. This is nothing to do with auto-assembly.


So some questions:
* are the partitions 0xfd ? yes.
* is the kernel standard?
* are the superblocks version 0.9? (mdadm --examine /dev/component)

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3ware 9650 tips

2007-07-16 Thread David Chinner
On Mon, Jul 16, 2007 at 10:50:34AM -0500, Eric Sandeen wrote:
 David Chinner wrote:
  On Mon, Jul 16, 2007 at 12:41:15PM +1000, David Chinner wrote:
  On Fri, Jul 13, 2007 at 03:36:46PM -0400, Justin Piszcz wrote:
 ...
  If you've got any sort of serious disk array, ext3 is not the filesystem
  to use
  
  To show what the difference is, I used blktrace and Chris Mason's
  seekwatcher script on a simple, single threaded dd command on
  a 12 disk dm RAID0 stripe:
  
  # dd if=/dev/zero of=/mnt/scratch/fred bs=1024k count=10k; sync
  
  http://oss.sgi.com/~dgc/writes/ext3_write.png
  http://oss.sgi.com/~dgc/writes/xfs_write.png
 
 Were those all with default mkfs  mount options?  ext3 in writeback
 mode might be an interesting comparison too.

Defaults. i.e.

# mkfs.ext3 /dev/mapper/dm0

# mkfs.xfs /dev/mapper/dm0

The mkfs.xfs picked up sunit/swidth correctly from the dm volume.

Last time I checked, writeback made little difference to ext3 throughput;
maybe 5-10% at most. I'll run it again later today...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3ware 9650 tips

2007-07-15 Thread David Chinner
On Fri, Jul 13, 2007 at 03:36:46PM -0400, Justin Piszcz wrote:
 On Fri, 13 Jul 2007, Jon Collette wrote:
 
 Wouldn't Raid 6 be slower than Raid 5 because of the extra fault tolerance?
   http://www.enterprisenetworksandservers.com/monthly/art.php?1754 - 20% 
 drop according to this article
 
 His 500GB WD drives are 7200RPM compared to the Raptors 10K.  So his 
 numbers will be slower. 
 Justin what file system do you have running on the Raptors?  I think thats 
 an interesting point made by Joshua.
 
 I use XFS:

When it comes to bandwidth, there is good reason for that.

 Trying to stick with a supported config as much as possible, I need to 
 run ext3.  As per usual, though, initial ext3 numbers are less than 
 impressive. Using bonnie++ to get a baseline, I get (after doing 
 'blockdev --setra 65536' on the device):
 Write: 136MB/s
 Read:  384MB/s
 
 Proving it's not the hardware, with XFS the numbers look like:
 Write: 333MB/s
 Read:  465MB/s
 

Those are pretty typical numbers. In my experience, ext3 is limited to about
250MB/s buffered write speed. It's not disk limited, it's design limited. e.g.
on a disk subsystem where XFS was getting 4-5GB/s buffered write, ext3 was doing
250MB/s.

http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

If you've got any sort of serious disk array, ext3 is not the filesystem
to use

 How many folks are using these?  Any tuning tips?

Make sure you tell XFS the correct sunit/swidth. For hardware
raid5/6, sunit = per-disk chunksize, swidth = number of *data* disks in
array.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm create to existing raid5

2007-07-13 Thread David Greaves

Guy Watkins wrote:

} [EMAIL PROTECTED] On Behalf Of Jon Collette
} I wasn't thinking and did a mdadm --create to my existing raid5 instead
} of --assemble.  The syncing process ran and now its not mountable.  Is
} there anyway to recover from this?
Maybe.  Not really sure.  But don't do anything until someone that really
knows answers!

I agree - Yes, maybe.



What I think...
If you did a create with the exact same parameters the data should not have
changed.  But you can't mount so you must have used different parameters.

I'd agree.



Only 1 disk was written to during the create.

Yep.


 Only that disk was changed.

Yep.


If you remove the 1 disk and do another create with the original parameters
and put missing for the 1 disk your array will be back to normal, but
degraded.  Once you confirm this you can add back the 1 disk.

Yep.
**WARNING**
**WARNING**
**WARNING**
At this point you are relatively safe (!) but as soon as you do an 'add' and 
initiate another resync then if you got it wrong you will have toasted your data 
completely!!!

**WARNING**
**WARNING**
**WARNING**


 You must be
able to determine which disk was written to.  I don't know how to do that
unless you have the output from mdadm -D during the create/syncing.


Do you know the *exact* command you issued when you did the initial --create?
Do you know the *exact* command you issued when you did the bogus --create?

And what version of mdadm you are using?

Neil said that it's mdadm, not the kernel, that determines which device is 
initially degraded during a create. We can look at the code and your command 
line and guess which device mdadm chose. (Getting this wrong won't matter but it 
may make recovery quicker.)


assuming you have a 4 device raid using /dev/sda1, /dev/sdb1, /dev/sdc1, 
/dev/sdd1

you'll then do something like:
mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 /dev/sda1 /dev/sdb1 
/dev/sdc1 missing

try a mount
mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 /dev/sda1 missing 
/dev/sdc1 /dev/sdb1

try a mount
mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 /dev/sdb1 /dev/sda1 
/dev/sdc1 missing

try a mount
mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 /dev/sdc1 /dev/sdb1 
/dev/sda1 missing

try a mount

etc etc,

So you'll still need to do a trial and error assemble
For a simple 4 device array I there are 24 permutations - doable by hand, if you 
have 5 devices then it's 120, 6 is 720 - getting tricky ;)


I'm bored so I'm going to write a script based on something like this:
http://www.unix.org.ua/orelly/perl/cookbook/ch04_20.htm

Feel free to beat me to it ...

The critical thing is that you *must* use 'missing' when doing these trial 
--create calls.


If we've not explained something very well and you don't understand then please 
ask before trying it out...


David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm create to existing raid5

2007-07-13 Thread David Greaves

David Greaves wrote:
For a simple 4 device array I there are 24 permutations - doable by 
hand, if you have 5 devices then it's 120, 6 is 720 - getting tricky ;)


Oh, wait, for 4 devices there are 24 permutations - and you need to do it 4 
times, substituting 'missing' for each device - so 96 trials.


4320 trials for a 6 device array.

Hmm. I've got a 7 device raid 6 - I think I'll go an make a note of how it's put 
together... grin



Have a look at this section and the linked script.
I can't test it until later

http://linux-raid.osdl.org/index.php/RAID_Recovery

http://linux-raid.osdl.org/index.php/Permute_array.pl


David


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposed enhancement to mdadm: Allow --write-behind= to be done in grow mode.

2007-07-03 Thread David Greaves

Ian Dall wrote:

There doesn't seem to be any designated place to send bug reports and
feature requests to mdadm, so I hope I am doing the right thing by
sending it here.

I have a small patch to mdamd which allows the write-behind amount to be
set a array grow time (instead of currently only at grow or create
time). I have tested this fairly extensively on some arrays built out of
loop back devices, and once on a real live array. I haven't lot any data
and it seems to work OK, though it is possible I am missing something.


Sounds like a useful feature...

Did you test the bitmap cases you mentioned?

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-pm] Re: [linux-lvm] 2.6.22-rc4 XFS fails after hibernate/resume

2007-06-29 Thread David Greaves

David Chinner wrote:

On Fri, Jun 29, 2007 at 12:16:44AM +0200, Rafael J. Wysocki wrote:

There are two solutions possible, IMO.  One would be to make these workqueues
freezable, which is possible, but hacky and Oleg didn't like that very much.
The second would be to freeze XFS from within the hibernation code path,
using freeze_bdev().


The second is much more likely to work reliably. If freezing the
filesystem leaves something in an inconsistent state, then it's
something I can reproduce and debug without needing to
suspend/resume.

FWIW, don't forget you need to thaw the filesystem on resume.


I've been a little distracted recently - sorry. I'll re-read the thread and see 
if there are any test actions I need to complete.


I do know that the corruption problems I've been having:
a) only happen after hibernate/resume
b) only ever happen on one of 2 XFS filesystems
c) happen even when the script does xfs_freeze;sync;hibernate;xfs_thaw

What happens if a filesystem is frozen and I hibernate?
Will it be thawed when I resume?

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fastest Chunk Size w/XFS For MD Software RAID = 1024k

2007-06-28 Thread David Greaves

David Chinner wrote:

On Wed, Jun 27, 2007 at 07:20:42PM -0400, Justin Piszcz wrote:

For drives with 16MB of cache (in this case, raptors).


That's four (4) drives, right?


I'm pretty sure he's using 10 - email a few days back...

Justin Piszcz wrote:
Running test with 10 RAPTOR 150 hard drives, expect it to take 
awhile until I get the results, avg them etc. :)



If so, how do you get a block read rate of 578MB/s from
4 drives? That's 145MB/s per drive


Which gives a far more reasonable 60MB/s per drive...

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm usage: creating arrays with helpful names?

2007-06-28 Thread David Greaves
(back on list for google's benefit ;) and because there are some good questions 
and I don't know all the answers... )


Oh, and Neil 'cos there may be a bug ...

Richard Michael wrote:

On Wed, Jun 27, 2007 at 08:49:22AM +0100, David Greaves wrote:

http://linux-raid.osdl.org/index.php/Partitionable



Thanks.  I didn't know this site existed (Googling even just 'mdadm'
doesn't yield it in the first 100 results), and it's helpful.
Good ... I got permission to wikify the 'official' linux raid FAQ but it takes 
time (and motivation!) to update it :)

Hopefully it will snowball as people who use it then contribute back hint ;)

As it becomes more valuable to people then more links will be created and Google 
will notice...




What if don't want a partitioned array?  I simply want the name to be
nicer than the /dev/mdX or /dev/md/XX style.  (p1 still gives me
/dev/nicename /dev/nicename0, as your page indicates.)

--auto md

mdadm --create /dev/strawberry --auto md ...
[EMAIL PROTECTED]:/tmp # mdadm --detail /dev/strawberry
/dev/strawberry:
Version : 00.90.03
  Creation Time : Thu Jun 28 08:25:06 2007
 Raid Level : raid4





Also, when I use --create /dev/nicename --auto=p1 (for example), I
also see /dev/md_d126 created.  Why?  There is then a /sys/block/md_d126
entry (presumably created by the md driver), but no /sys/block/nicename
entry.  Why?

Not sure who creates this, mdadm or udev
The code isn't that hard to read and you sound like you'd follow it if you 
fancied a skim-read...


I too would expect that there should be a /sys/block/nicename - is this a bug 
Neil?

These options don't see a lot of use - I recently came across a bug in the 
--auto pX option...



Finally --stop /dev/nicename doesn't remove any of the aforementioned
/dev or /sys entries.  I don't suppose that it should, but an mdadm
command to do this would be helpful.  So, how do I remove the oddly
named /sys entries? (I removed the /dev entries with rm.)  man mdadm
indicates --stop releases all resources, but it doesn't (and probably
shouldn't).

rm !

'--stop' with mdadm  does release the 'resources', ie the components you used. 
It doesn't remove the array. There is no delete - I guess since an rm is just as 
effective unless you use a nicename...



[I think there should be a symmetry to the mdadm options
--create/--delete and --start/--stop.  It's *convenient* --create
also starts the array, but this conflates the issue a bit..]

I want to stop and completely remove all trace of the array.
(Especially as I'm experimenting with this over loopback, and stuff
hanging around irritates the lo driver.)

You're possibly mixing two things up here...

Releasing the resources with a --stop would let you re-use a lo device in 
another array. You don't _need_ --delete (or rm).
However md does write superblocks to the components and *mdadm* warns you that 
the loopback has a valid superblock..


mdadm: /dev/loop1 appears to be part of a raid array:
level=raid4 devices=6 ctime=Thu Jun 21 09:46:27 2007

[hmm, I can see why you may think it's part of an 'active' array]

You could do mdadm --zero-superblock to clean the component or just say yes 
when mdadm asks you to continue.


see:
# mdadm --create /dev/strawberry --auto md --level=4 -n 6 /dev/loop1 /dev/loop2 
/dev/loop3 /dev/loop4 /dev/loop5 /dev/loop6

mdadm: /dev/loop1 appears to be part of a raid array:
level=raid4 devices=6 ctime=Thu Jun 28 08:25:06 2007
blah
Continue creating array? yes
mdadm: array /dev/strawberry started.

# mdadm --stop /dev/strawberry
mdadm: stopped /dev/strawberry

# mdadm --create /dev/strawberry --auto md --level=4 -n 6 /dev/loop1 /dev/loop2 
/dev/loop3 /dev/loop4 /dev/loop5 /dev/loop6

mdadm: /dev/loop1 appears to be part of a raid array:
level=raid4 devices=6 ctime=Thu Jun 28 09:07:29 2007
blah
Continue creating array? yes
mdadm: array /dev/strawberry started.

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fastest Chunk Size w/XFS For MD Software RAID = 1024k

2007-06-28 Thread David Chinner
On Thu, Jun 28, 2007 at 04:27:15AM -0400, Justin Piszcz wrote:
 
 
 On Thu, 28 Jun 2007, Peter Rabbitson wrote:
 
 Justin Piszcz wrote:
 mdadm --create \
   --verbose /dev/md3 \
   --level=5 \
   --raid-devices=10 \
   --chunk=1024 \
   --force \
   --run
   /dev/sd[cdefghijkl]1
 
 Justin.
 
 Interesting, I came up with the same results (1M chunk being superior) 
 with a completely different raid set with XFS on top:
 
 mdadm--create \
  --level=10 \
  --chunk=1024 \
  --raid-devices=4 \
  --layout=f3 \
  ...
 
 Could it be attributed to XFS itself?

More likely it's related to the I/O size being sent to the disks. The larger
the chunk size, the larger the I/o hitting each disk. I think the maximum I/O
size is 512k ATM on x86(_64), so a chunk of 1MB will guarantee that there are
maximally sized I/Os being sent to the disk

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc4 XFS fails after hibernate/resume

2007-06-28 Thread David Chinner
On Wed, Jun 27, 2007 at 08:49:24PM +, Pavel Machek wrote:
 Hi!
 
  FWIW, I'm on record stating that sync is not sufficient to quiesce an XFS
  filesystem for a suspend/resume to work safely and have argued that the only
 
 Hmm, so XFS writes to disk even when its threads are frozen?

They issue async I/O before they sleep and expects
processing to be done on I/O completion via workqueues.

  safe thing to do is freeze the filesystem before suspend and thaw it after
  resume. This is why I originally asked you to test that with the other 
  problem
 
 Could you add that to the XFS threads if it is really required? They
 do know that they are being frozen for suspend.

We don't suspend the threads on a filesystem freeze - they continue
run. A filesystem freeze guarantees the filesystem clean and that
the in memory state matches what is on disk. It is not possible for
the filesytem to issue I/O or have outstanding I/O when it is in the
frozen state, so the state of the threads and/or workqueues does not
matter because they will be idle.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-pm] Re: [linux-lvm] 2.6.22-rc4 XFS fails after hibernate/resume

2007-06-28 Thread David Chinner
On Fri, Jun 29, 2007 at 12:16:44AM +0200, Rafael J. Wysocki wrote:
 There are two solutions possible, IMO.  One would be to make these workqueues
 freezable, which is possible, but hacky and Oleg didn't like that very much.
 The second would be to freeze XFS from within the hibernation code path,
 using freeze_bdev().

The second is much more likely to work reliably. If freezing the
filesystem leaves something in an inconsistent state, then it's
something I can reproduce and debug without needing to
suspend/resume.

FWIW, don't forget you need to thaw the filesystem on resume.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm usage: creating arrays with helpful names?

2007-06-27 Thread David Greaves

Richard Michael wrote:

How do I create an array with a helpful name? i.e. /dev/md/storage?

The mdadm man page hints at this in the discussion of the --auto option
in the ASSEMBLE MODE section, but doesn't clearly indicate how it's done.

Must I create the device nodes by hand first using MAKEDEV?



Does this help?

http://linux-raid.osdl.org/index.php/Partitionable

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fastest Chunk Size w/XFS For MD Software RAID = 1024k

2007-06-27 Thread David Chinner
On Wed, Jun 27, 2007 at 07:20:42PM -0400, Justin Piszcz wrote:
 For drives with 16MB of cache (in this case, raptors).

That's four (4) drives, right?

If so, how do you get a block read rate of 578MB/s from
4 drives? That's 145MB/s per drive

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-22 Thread david

On Fri, 22 Jun 2007, David Greaves wrote:

That's not a bad thing - until you look at the complexity it brings - and 
then consider the impact and exceptions when you do, eg hardware 
acceleration? md information fed up to the fs layer for xfs? simple long term 
maintenance?


Often these problems are well worth the benefits of the feature.

I _wonder_ if this is one where the right thing is to just say no :)


In this case I think the advantages of a higher level system knowing what 
efficiant blocks to do writes/reads in can potentially be a HUGE 
advantage.


if the uppper levels know that you ahve a 6 disk raid 6 array with a 64K 
chunk size then reads and writes in 256k chunks (aligned) should be able 
to be done at basicly the speed of a 4 disk raid 0 array.


what's even more impressive is that this could be done even if the array 
is degraded (if you know the drives have failed you don't even try to read 
from them and you only have to reconstruct the missing info once per 
stripe)


the current approach doesn't give the upper levels any chance to operate 
in this mode, they just don't have enough information to do so.


the part about wanting to know raid 0 chunk size so that the upper layers 
can be sure that data that's supposed to be redundant is on seperate 
drives is also possible


storage technology is headed in the direction of having the system do more 
and more of the layout decisions, and re-stripe the array as conditions 
change (similar to what md can already do with enlarging raid5/6 arrays) 
but unless you want to eventually put all that decision logic into the md 
layer you should make it possible for other layers to make queries to find 
out what's what and then they can give directions for what they want to 
have happen.


so for several reasons I don't see this as something that's deserving of 
an atomatic 'no'


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-22 Thread David Greaves

Bill Davidsen wrote:

David Greaves wrote:

[EMAIL PROTECTED] wrote:

On Fri, 22 Jun 2007, David Greaves wrote:
If you end up 'fiddling' in md because someone specified 
--assume-clean on a raid5 [in this case just to save a few minutes 
*testing time* on system with a heavily choked bus!] then that adds 
*even more* complexity and exception cases into all the stuff you 
described.


A few minutes? Are you reading the times people are seeing with 
multi-TB arrays? Let's see, 5TB at a rebuild rate of 20MB... three days. 

Yes. But we are talking initial creation here.

And as soon as you believe that the array is actually usable you cut 
that rebuild rate, perhaps in half, and get dog-slow performance from 
the array. It's usable in the sense that reads and writes work, but for 
useful work it's pretty painful. You either fail to understand the 
magnitude of the problem or wish to trivialize it for some reason.

I do understand the problem and I'm not trying to trivialise it :)

I _suggested_ that it's worth thinking about things rather than jumping in to 
say oh, we can code up a clever algorithm that keeps track of what stripes have 
valid parity and which don't and we can optimise the read/copy/write for valid 
stripes and use the raid6 type read-all/write-all for invalid stripes and then 
we can write a bit extra on the check code to set the bitmaps..


Phew - and that lets us run the array at semi-degraded performance (raid6-like) 
for 3 days rather than either waiting before we put it into production or 
running it very slowly.

Now we run this system for 3 years and we saved 3 days - hmmm IS IT WORTH IT?

What happens in those 3 years when we have a disk fail? The solution doesn't 
apply then - it's 3 days to rebuild - like it or not.


By delaying parity computation until the first write to a stripe only 
the growth of a filesystem is slowed, and all data are protected without 
waiting for the lengthly check. The rebuild speed can be set very low, 
because on-demand rebuild will do most of the work.

I am not saying you are wrong.
I ask merely if the balance of benefit outweighs the balance of complexity.

If the benefit were 24x7 then sure - eg using hardware assist in the raid calcs 
- very useful indeed.


I'm very much for the fs layer reading the lower block structure so I 
don't have to fiddle with arcane tuning parameters - yes, *please* 
help make xfs self-tuning!


Keeping life as straightforward as possible low down makes the upwards 
interface more manageable and that goal more realistic... 


Those two paragraphs are mutually exclusive. The fs can be simple 
because it rests on a simple device, even if the simple device is 
provided by LVM or md. And LVM and md can stay simple because they rest 
on simple devices, even if they are provided by PATA, SATA, nbd, etc. 
Independent layers make each layer more robust. If you want to 
compromise the layer separation, some approach like ZFS with full 
integration would seem to be promising. Note that layers allow 
specialized features at each point, trading integration for flexibility.


That's a simplistic summary.
You *can* loosely couple the layers. But you can enrich the interface and 
tightly couple them too - XFS is capable (I guess) of understanding md more 
fully than say ext2.
XFS would still work on a less 'talkative' block device where performance wasn't 
as important (USB flash maybe, dunno).



My feeling is that full integration and independent layers each have 
benefits, as you connect the layers to expose operational details you 
need to handle changes in those details, which would seem to make layers 
more complex.

Agreed.

What I'm looking for here is better performance in one 
particular layer, the md RAID5 layer. I like to avoid unnecessary 
complexity, but I feel that the current performance suggests room for 
improvement.


I agree there is room for improvement.
I suggest that it may be more fruitful to write a tool called raid5prepare
that writes zeroes/ones as appropriate to all component devices and then you can 
use --assume-clean without concern. That could look to see if the devices are 
scsi or whatever and take advantage of the hyperfast block writes that can be done.


David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-22 Thread david

On Fri, 22 Jun 2007, Bill Davidsen wrote:

By delaying parity computation until the first write to a stripe only the 
growth of a filesystem is slowed, and all data are protected without waiting 
for the lengthly check. The rebuild speed can be set very low, because 
on-demand rebuild will do most of the work.


 I'm very much for the fs layer reading the lower block structure so I
 don't have to fiddle with arcane tuning parameters - yes, *please* help
 make xfs self-tuning!

 Keeping life as straightforward as possible low down makes the upwards
 interface more manageable and that goal more realistic... 


Those two paragraphs are mutually exclusive. The fs can be simple because it 
rests on a simple device, even if the simple device is provided by LVM or 
md. And LVM and md can stay simple because they rest on simple devices, even 
if they are provided by PATA, SATA, nbd, etc. Independent layers make each 
layer more robust. If you want to compromise the layer separation, some 
approach like ZFS with full integration would seem to be promising. Note that 
layers allow specialized features at each point, trading integration for 
flexibility.


My feeling is that full integration and independent layers each have 
benefits, as you connect the layers to expose operational details you need to 
handle changes in those details, which would seem to make layers more 
complex. What I'm looking for here is better performance in one particular 
layer, the md RAID5 layer. I like to avoid unnecessary complexity, but I feel 
that the current performance suggests room for improvement.


they both have have benifits, but it shouldn't have to be either-or

if you build the seperate layers and provide for ways that the upper 
layers can query the lower layers to find what's efficiant then you can 
have some uppoer layers that don't care about this and trat the lower 
layer as a simple block device, while other upper layers find out what 
sort of things are more efficiant to do and use the same lower layer in a 
more complex manner


the alturnative is to duplicate effort (and code) to have two codebases 
that try to do the same thing, one stand-alone, and one as a part of an 
integrated solution (and it gets even worse if there end up being multiple 
integrated solutions)


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-21 Thread David Greaves

Neil Brown wrote:


This isn't quite right.

Thanks :)


Firstly, it is mdadm which decided to make one drive a 'spare' for
raid5, not the kernel.
Secondly, it only applies to raid5, not raid6 or raid1 or raid10.

For raid6, the initial resync (just like the resync after an unclean
shutdown) reads all the data blocks, and writes all the P and Q
blocks.
raid5 can do that, but it is faster the read all but one disk, and
write to that one disk.


How about this:

Initial Creation

When mdadm asks the kernel to create a raid array the most noticeable activity 
is what's called the initial resync.


Raid level 0 doesn't have any redundancy so there is no initial resync.

For raid levels 1,4,6 and 10 mdadm creates the array and starts a resync. The 
raid algorithm then reads the data blocks and writes the appropriate 
parity/mirror (P+Q) blocks across all the relevant disks. There is some sample 
output in a section below...


For raid5 there is an optimisation: mdadm takes one of the disks and marks it as 
'spare'; it then creates the array in degraded mode. The kernel marks the spare 
disk as 'rebuilding' and starts to read from the 'good' disks, calculate the 
parity and determines what should be on the spare disk and then just writes to it.


Once all this is done the array is clean and all disks are active.

This can take quite a time and the array is not fully resilient whilst this is 
happening (it is however fully useable).






Also is raid4 like raid5 or raid6 in this respect?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 recover after a 2 disk failure

2007-06-19 Thread David Greaves

Frank Jenkins wrote:

So here's the /proc/mdstat prior to the array failure:


I'll take a look through this and see if I can see any problems Frank. Bit busy 
now - give me a few minutes.


David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume

2007-06-19 Thread David Greaves

David Greaves wrote:

I'm going to have to do some more testing...

done



David Chinner wrote:

On Mon, Jun 18, 2007 at 08:49:34AM +0100, David Greaves wrote:

David Greaves wrote:
So doing:
xfs_freeze -f /scratch
sync
echo platform  /sys/power/disk
echo disk  /sys/power/state
# resume
xfs_freeze -u /scratch

Works (for now - more usage testing tonight)


Verrry interesting.

Good :)

Now, not so good :)



What you were seeing was an XFS shutdown occurring because the free space
btree was corrupted. IOWs, the process of suspend/resume has resulted
in either bad data being written to disk, the correct data not being
written to disk or the cached block being corrupted in memory.

That's the kind of thing I was suspecting, yes.

If you run xfs_check on the filesystem after it has shut down after a 
resume,
can you tell us if it reports on-disk corruption? Note: do not run 
xfs_repair
to check this - it does not check the free space btrees; instead it 
simply
rebuilds them from scratch. If xfs_check reports an error, then run 
xfs_repair

to fix it up.

OK, I can try this tonight...



This is on 2.6.22-rc5

So I hibernated last night and resumed this morning.
Before hibernating I froze and sync'ed. After resume I thawed it. (Sorry Dave)

Here are some photos of the screen during resume. This is not 100% reproducable 
- it seems to occur only if the system is shutdown for 30mins or so.


Tejun, I wonder if error handling during resume is problematic? I got the same 
errors in 2.6.21. I have never seen these (or any other libata) errors other 
than during resume.


http://www.dgreaves.com/pub/2.6.22-rc5-resume-failure.jpg
(hard to read, here's one from 2.6.21
http://www.dgreaves.com/pub/2.6.21-resume-failure.jpg

I _think_ I've only seen the xfs problem when a resume shows these errors.


Ok, to try and cause a problem I ran a make and got this back at once:
make: stat: Makefile: Input/output error
make: stat: clean: Input/output error
make: *** No rule to make target `clean'.  Stop.
make: stat: GNUmakefile: Input/output error
make: stat: makefile: Input/output error


I caught the first dmesg this time:

Filesystem dm-0: XFS internal error xfs_btree_check_sblock at line 334 of file 
fs/xfs/xfs_btree.c.  Caller 0xc01b58e1

 [c0104f6a] show_trace_log_lvl+0x1a/0x30
 [c0105c52] show_trace+0x12/0x20
 [c0105d15] dump_stack+0x15/0x20
 [c01daddf] xfs_error_report+0x4f/0x60
 [c01cd736] xfs_btree_check_sblock+0x56/0xd0
 [c01b58e1] xfs_alloc_lookup+0x181/0x390
 [c01b5b06] xfs_alloc_lookup_le+0x16/0x20
 [c01b30c1] xfs_free_ag_extent+0x51/0x690
 [c01b4ea4] xfs_free_extent+0xa4/0xc0
 [c01bf739] xfs_bmap_finish+0x119/0x170
 [c01e3f4a] xfs_itruncate_finish+0x23a/0x3a0
 [c02046a2] xfs_inactive+0x482/0x500
 [c0210ad4] xfs_fs_clear_inode+0x34/0xa0
 [c017d777] clear_inode+0x57/0xe0
 [c017d8e5] generic_delete_inode+0xe5/0x110
 [c017da77] generic_drop_inode+0x167/0x1b0
 [c017cedf] iput+0x5f/0x70
 [c01735cf] do_unlinkat+0xdf/0x140
 [c0173640] sys_unlink+0x10/0x20
 [c01040a4] syscall_call+0x7/0xb
 ===
xfs_force_shutdown(dm-0,0x8) called from line 4258 of file fs/xfs/xfs_bmap.c. 
Return address = 0xc021101e
Filesystem dm-0: Corruption of in-memory data detected.  Shutting down 
filesystem: dm-0

Please umount the filesystem, and rectify the problem(s)

so I cd'ed out of /scratch and umounted.

I then tried the xfs_check.

haze:~# xfs_check /dev/video_vg/video_lv
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_check.  If you are unable to mount the filesystem, then use
the xfs_repair -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.
haze:~# mount /scratch/
haze:~# umount /scratch/
haze:~# xfs_check /dev/video_vg/video_lv

Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ...
haze kernel: Bad page state in process 'xfs_db'

Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ...
haze kernel: page:c1767bc0 flags:0x80010008 mapping: mapcount:-64 
count:0

Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ...
haze kernel: Trying to fix it up, but a reboot is needed

Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ...
haze kernel: Backtrace:

Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ...
haze kernel: Bad page state in process 'syslogd'

Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ...
haze kernel: page:c1767cc0 flags:0x80010008 mapping: mapcount:-64 
count:0

Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ...
haze kernel: Trying to fix it up, but a reboot is needed

Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ...
haze kernel: Backtrace:

ugh. Try again
haze:~# xfs_check /dev/video_vg/video_lv
haze:~#

whilst running a top reported this as roughly the peak memory usage:
 8759 root

Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume

2007-06-19 Thread David Greaves

Rafael J. Wysocki wrote:

This is on 2.6.22-rc5


Is the Tejun's patch

http://www.sisk.pl/kernel/hibernation_and_suspend/2.6.22-rc5/patches/30-block-always-requeue-nonfs-requests-at-the-front.patch

applied on top of that?


2.6.22-rc5 includes it.

(but, when I was testing rc4, I did apply this patch)

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-19 Thread david

On Tue, 19 Jun 2007, Lennart Sorensen wrote:


On Mon, Jun 18, 2007 at 02:56:10PM -0700, [EMAIL PROTECTED] wrote:

yes, I'm useing promise drive shelves, I have them configured to export
the 15 drives as 15 LUNs on a single ID.

I'm going to be useing this as a huge circular buffer that will just be
overwritten eventually 99% of the time, but once in a while I will need to
go back into the buffer and extract and process the data.


I would guess that if you ran 15 drives per channel on 3 different
channels, you would resync in 1/3 the time.  Well unless you end up
saturating the PCI bus instead.

hardware raid of course has an advantage there in that it doesn't have
to go across the bus to do the work (although if you put 45 drives on
one scsi channel on hardware raid, it will still be limited).


I fully realize that the channel will be the bottleneck, I just didn't 
understand what /proc/mdstat was telling me. I thought that it was telling 
me that the resync was processing 5M/sec, not that it was writing 5M/sec 
on each of the two parity locations.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc4 XFS fails after hibernate/resume

2007-06-18 Thread David Greaves

David Greaves wrote:

David Robinson wrote:

David Greaves wrote:

This isn't a regression.

I was seeing these problems on 2.6.21 (but 22 was in -rc so I waited 
to try it).
I tried 2.6.22-rc4 (with Tejun's patches) to see if it had improved - 
no.


Note this is a different (desktop) machine to that involved my recent 
bugs.


The machine will work for days (continually powered up) without a 
problem and then exhibits a filesystem failure within minutes of a 
resume.


snip


OK, that gave me an idea.

Freeze the filesystem
md5sum the lvm
hibernate
resume
md5sum the lvm

snip

So the lvm and below looks OK...

I'll see how it behaves now the filesystem has been frozen/thawed over 
the hibernate...



And it appears to behave well. (A few hours compile/clean cycling kernel builds 
on that filesystem were OK).



Historically I've done:
sync
echo platform  /sys/power/disk
echo disk  /sys/power/state
# resume

and had filesystem corruption (only on this machine, my other hibernating xfs 
machines don't have this problem)


So doing:
xfs_freeze -f /scratch
sync
echo platform  /sys/power/disk
echo disk  /sys/power/state
# resume
xfs_freeze -u /scratch

Works (for now - more usage testing tonight)

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS Tunables for High Speed Linux SW RAID5 Systems?

2007-06-18 Thread David Greaves

David Chinner wrote:

On Fri, Jun 15, 2007 at 04:36:07PM -0400, Justin Piszcz wrote:

Hi,

I was wondering if the XFS folks can recommend any optimizations for high 
speed disk arrays using RAID5?


[sysctls snipped]

None of those options will make much difference to performance.
mkfs parameters are the big ticket item here


Is there anywhere you can point to that expands on this?

Is there anything raid specific that would be worth including in the Wiki?

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc4 XFS fails after hibernate/resume

2007-06-18 Thread David Chinner
On Mon, Jun 18, 2007 at 08:49:34AM +0100, David Greaves wrote:
 David Greaves wrote:
 OK, that gave me an idea.
 
 Freeze the filesystem
 md5sum the lvm
 hibernate
 resume
 md5sum the lvm
 snip
 So the lvm and below looks OK...
 
 I'll see how it behaves now the filesystem has been frozen/thawed over 
 the hibernate...
 
 
 And it appears to behave well. (A few hours compile/clean cycling kernel 
 builds on that filesystem were OK).
 
 
 Historically I've done:
 sync
 echo platform  /sys/power/disk
 echo disk  /sys/power/state
 # resume
 
 and had filesystem corruption (only on this machine, my other hibernating 
 xfs machines don't have this problem)
 
 So doing:
 xfs_freeze -f /scratch
 sync
 echo platform  /sys/power/disk
 echo disk  /sys/power/state
 # resume
 xfs_freeze -u /scratch

 Works (for now - more usage testing tonight)

Verrry interesting.

What you were seeing was an XFS shutdown occurring because the free space
btree was corrupted. IOWs, the process of suspend/resume has resulted
in either bad data being written to disk, the correct data not being
written to disk or the cached block being corrupted in memory.

If you run xfs_check on the filesystem after it has shut down after a resume,
can you tell us if it reports on-disk corruption? Note: do not run xfs_repair
to check this - it does not check the free space btrees; instead it simply
rebuilds them from scratch. If xfs_check reports an error, then run xfs_repair
to fix it up.

FWIW, I'm on record stating that sync is not sufficient to quiesce an XFS
filesystem for a suspend/resume to work safely and have argued that the only
safe thing to do is freeze the filesystem before suspend and thaw it after
resume. This is why I originally asked you to test that with the other problem
that you reported. Up until this point in time, there's been no evidence to
prove either side of the argument..

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: resync to last 27h - usually 3. what's this?

2007-06-18 Thread David Greaves

Dexter Filmore wrote:
1661 minutes is *way* too long. it's a 4x250GiB sATA array and usually takes 3 
hours to resync or check, for that matter.


So, what's this? 

kernel, mdadm verisons?

I seem to recall a long fixed ETA calculation bug some time back...

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-18 Thread david

On Mon, 18 Jun 2007, Brendan Conoboy wrote:


[EMAIL PROTECTED] wrote:

 in my case it takes 2+ days to resync the array before I can do any
 performance testing with it. for some reason it's only doing the rebuild
 at ~5M/sec (even though I've increased the min and max rebuild speeds and
 a dd to the array seems to be ~44M/sec, even during the rebuild)


With performance like that, it sounds like you're saturating a bus somewhere 
along the line.  If you're using scsi, for instance, it's very easy for a 
long chain of drives to overwhelm a channel.  You might also want to consider 
some other RAID layouts like 1+0 or 5+0 depending upon your space vs. 
reliability needs.


I plan to test the different configurations.

however, if I was saturating the bus with the reconstruct how can I fire 
off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the 
reconstruct to ~4M/sec?


I'm putting 10x as much data through the bus at that point, it would seem 
to proove that it's not the bus that's saturated.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-18 Thread david

On Mon, 18 Jun 2007, Lennart Sorensen wrote:


On Mon, Jun 18, 2007 at 10:28:38AM -0700, [EMAIL PROTECTED] wrote:

I plan to test the different configurations.

however, if I was saturating the bus with the reconstruct how can I fire
off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the
reconstruct to ~4M/sec?

I'm putting 10x as much data through the bus at that point, it would seem
to proove that it's not the bus that's saturated.


dd 45MB/s from the raid sounds reasonable.

If you have 45 drives, doing a resync of raid5 or radi6 should probably
involve reading all the disks, and writing new parity data to one drive.
So if you are writing 5MB/s, then you are reading 44*5MB/s from the
other drives, which is 220MB/s.  If your resync drops to 4MB/s when
doing dd, then you have 44*4MB/s which is 176MB/s or 44MB/s less read
capacity, which surprisingly seems to match the dd speed you are
getting.  Seems like you are indeed very much saturating a bus
somewhere.  The numbers certainly agree with that theory.

What kind of setup is the drives connected to?


simple ultra-wide SCSI to a single controller.

I didn't realize that the rate reported by /proc/mdstat was the write 
speed that was takeing place, I thought it was the total data rate (reads 
+ writes). the next time this message gets changed it would be a good 
thing to clarify this.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-18 Thread david

On Mon, 18 Jun 2007, Brendan Conoboy wrote:


[EMAIL PROTECTED] wrote:

 I plan to test the different configurations.

 however, if I was saturating the bus with the reconstruct how can I fire
 off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the
 reconstruct to ~4M/sec?

 I'm putting 10x as much data through the bus at that point, it would seem
 to proove that it's not the bus that's saturated.


I am unconvinced.  If you take ~1MB/s for each active drive, add in SCSI 
overhead, 45M/sec seems reasonable.  Have you look at a running iostat while 
all this is going on?  Try it out- add up the kb/s from each drive and see 
how close you are to your maximum theoretical IO.


I didn't try iostat, I did look at vmstat, and there the numbers look even 
worse, the bo column is ~500 for the resync by itself, but with the DD 
it's ~50,000. when I get access to the box again I'll try iostat to get 
more details



Also, how's your CPU utilization?


~30% of one cpu for the raid 6 thread, ~5% of one cpu for the resync 
thread


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-18 Thread david

On Mon, 18 Jun 2007, Lennart Sorensen wrote:


On Mon, Jun 18, 2007 at 11:12:45AM -0700, [EMAIL PROTECTED] wrote:

simple ultra-wide SCSI to a single controller.


Hmm, isn't ultra-wide limited to 40MB/s?  Is it Ultra320 wide?  That
could do a lot more, and 220MB/s sounds plausable for 320 scsi.


yes, sorry, ultra 320 wide.


I didn't realize that the rate reported by /proc/mdstat was the write
speed that was takeing place, I thought it was the total data rate (reads
+ writes). the next time this message gets changed it would be a good
thing to clarify this.


Well I suppose itcould make sense to show rate of rebuild which you can
then compare against the total size of tha raid, or you can have rate of
write, which you then compare against the size of the drive being
synced.  Certainly I would expect much higer speeds if it was the
overall raid size, while the numbers seem pretty reasonable as a write
speed.  4MB/s would take for ever if it was the overall raid resync
speed.  I usually see SATA raid1 resync at 50 to 60MB/s or so, which
matches the read and write speeds of the drives in the raid.


as I read it right now what happens is the worst of the options, you show 
the total size of the array for the amount of work that needs to be done, 
but then show only the write speed for the rate pf progress being made 
through the job.


total rebuild time was estimated at ~3200 min

David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-18 Thread david

On Mon, 18 Jun 2007, Brendan Conoboy wrote:


[EMAIL PROTECTED] wrote:

 yes, sorry, ultra 320 wide.


Exactly how many channels and drives?


one channel, 2 OS drives plus the 45 drives in the array.

yes I realize that there will be bottlenecks with this, the large capacity 
is to handle longer history (it's going to be a 30TB circular buffer being 
fed by a pair of OC-12 links)


it appears that my big mistake was not understanding what /proc/mdstat is 
telling me.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-17 Thread David Chinner
On Sat, Jun 16, 2007 at 07:59:29AM +1000, Neil Brown wrote:
 Combining these thoughts, it would make a lot of sense for the
 filesystem to be able to say to the block device That blocks looks
 wrong - can you find me another copy to try?.  That is an example of
 the sort of closer integration between filesystem and RAID that would
 make sense.

I think that this would only be useful on devices that store
discrete copies of the blocks on different devices i.e. mirrors. If
it's an XOR based RAID, you don't have another copy you can
retreive

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-16 Thread david

On Sat, 16 Jun 2007, Neil Brown wrote:


It would be possible to have a 'this is not initialised' flag on the
array, and if that is not set, always do a reconstruct-write rather
than a read-modify-write.  But the first time you have an unclean
shutdown you are going to resync all the parity anyway (unless you
have a bitmap) so you may as well resync at the start.

And why is it such a big deal anyway?  The initial resync doesn't stop
you from using the array.  I guess if you wanted to put an array into
production instantly and couldn't afford any slowdown due to resync,
then you might want to skip the initial resync but is that really
likely?


in my case it takes 2+ days to resync the array before I can do any 
performance testing with it. for some reason it's only doing the rebuild 
at ~5M/sec (even though I've increased the min and max rebuild speeds and 
a dd to the array seems to be ~44M/sec, even during the rebuild)


I want to test several configurations, from a 45 disk raid6 to a 45 disk 
raid0. at 2-3 days per test (or longer, depending on the tests) this 
becomes a very slow process.


also, when a rebuild is slow enough (and has enough of a performance 
impact) it's not uncommon to want to operate in degraded mode just long 
enought oget to a maintinance window and then recreate the array and 
reload from backup.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread David Chinner
On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
 On Thu, May 31 2007, David Chinner wrote:
  IOWs, there are two parts to the problem:
  
  1 - guaranteeing I/O ordering
  2 - guaranteeing blocks are on persistent storage.
  
  Right now, a single barrier I/O is used to provide both of these
  guarantees. In most cases, all we really need to provide is 1); the
  need for 2) is a much rarer condition but still needs to be
  provided.
  
   if I am understanding it correctly, the big win for barriers is that you 
   do NOT have to stop and wait until the data is on persistant media before 
   you can continue.
  
  Yes, if we define a barrier to only guarantee 1), then yes this
  would be a big win (esp. for XFS). But that requires all filesystems
  to handle sync writes differently, and sync_blockdev() needs to
  call blkdev_issue_flush() as well
  
  So, what do we do here? Do we define a barrier I/O to only provide
  ordering, or do we define it to also provide persistent storage
  writeback? Whatever we decide, it needs to be documented
 
 The block layer already has a notion of the two types of barriers, with
 a very small amount of tweaking we could expose that. There's absolutely
 zero reason we can't easily support both types of barriers.

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread david

On Thu, 31 May 2007, Jens Axboe wrote:


On Thu, May 31 2007, Phillip Susi wrote:

David Chinner wrote:

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate


So what if you want a synchronous write, but DON'T care about the order?
  They need to be two completely different flags which you can choose
to combine, or use individually.


If you have a use case for that, we can easily support it as well...
Depending on the drive capabilities (FUA support or not), it may be
nearly as slow as a real barrier write.


true, but a real barrier write could have significant side effects on 
other writes that wouldn't happen with a synchronous wrote (a sync wrote 
can have other, unrelated writes re-ordered around it, a barrier write 
can't)


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   >