from:"Molle Bestefich"

Re: ATA cables and drives

2006-09-24 Thread Molle Bestefich


Hi everyone;

Thanks for the information so far!
Greatly appreciated.

I've just found this:
http://home-tj.org/wiki/index.php/Sil_m15w#Message:_Re:_SiI_3112_.26_Seagate_drivers

Which in particular mentions that Silicon Image controllers and
Seagate drives don't work too well together, and neither Silicon Image
nor Seagate wants to know about or do anything about the problem.

Hmm.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ATA cables and drives

2006-09-24 Thread Molle Bestefich


Jeff Garzik wrote:

Molle Bestefich wrote:
 I've just found this:
 
http://home-tj.org/wiki/index.php/Sil_m15w#Message:_Re:_SiI_3112_.26_Seagate_drivers

 Which in particular mentions that Silicon Image controllers and
 Seagate drives don't work too well together, and neither Silicon Image
 nor Seagate wants to know about or do anything about the problem.

Not really true...


It's not?
That's how I read it.

Which part of it is wrong?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: USB disks for RAID storage (was Re: Please help me save my data)

2006-09-18 Thread Molle Bestefich


Martin Kihlgren writes:

And no, nothing hangs except the disk access to the device
in question when a disk fails.


Sounds good!  +1 for USB...


My Seagate disks DO generate too much heat if I stack them on top
of each other, which their form factor suggests they would accept.


Starts to take up a lot of space if you need to lay them out like that.

(Just for reference, I've had external USB thingys fail with just two
drives stacked.  They both failed.)


My RAID5 + LVM + dm_crypt + XFS setup allows for a very extendable
system.


That does give you a cool feature set :-).

With LVM and dm_crypt in there it does sound like you're running with
beta quality software to my ears, however.  If it works, great.


And as long as I treat the entire disk set as one device, the
bandwidth will not be an issue since I will never demand more
bandwidth from the entire array than from a single USB drive
anyway.


Fair enough.
It would be cool to get the extra bandwidth though.

One solution is to use external SATA (eSATA) enclosures
instead of USB enclosures.  That would both raise bandwidth, and
fix the transaction latency issue mentioned by Daniel Pittman.

There's a couple of single-disk enclosures out there that allows
to connect disks via either eSATA or USB2.  None of them seems to
come with cooling, though :-/.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Please help me save my data

2006-09-16 Thread Molle Bestefich


Patrick Hoover wrote:

Is anyone else having issues with USB interfaced disks to implement
RAID? Any thoughts on Pros / Cons for doing this?


Sounds like a very good stress test for MD.

I often find servers completely hung when a disk fails, this usually
happens in the IDE layer.
If using USB disks circumvents the IDE layer enough, using USB disks
might get rid of these hangs.  Would be nice at least.  Maybe I'm just
dreaming.

For end users, USB might remove the need to take special care of
cooling in your cabinet.
OTOH, most USB disk enclosures have horrible thermal properties.

USB would make it a lot easier to add new disks (beyond your cabinet's
capacity) and to remove old disks when/if they're no longer needed.
Users might run into a bandwidth issue at some point..
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

ATA cables and drives

2006-09-16 Thread Molle Bestefich


I'm looking for new harddrives.

This is my experience so far.


SATA cables:
=

I have zero good experiences with any SATA cables.
They've all been crap so far.


3.5 ATA harddrives buyable where I live:
==

(All drives are 7200rpm, for some reason.)

Hitachi DeskStar  500 GB / 16 MB /  8.5 ms / SATA or PATA
Maxtor DiamondMax 11  500 GB / 16 MB /  8.5 ms / SATA or PATA
Maxtor MaXLine Pro500 GB / 16 MB /  8.5 ms / SATA or PATA
Seagate Barracuda 7200.10 500 GB / 16 MB /?/ SATA or PATA
Seagate Barracuda 7200.10 750 GB / 16 MB /?/ SATA or PATA
Seagate Barracuda 7200.9  500 GB / 16 MB / 11   ms / SATA or PATA
Seagate Barracuda 7200.9  500 GB /  8 MB / 11   ms / SATA or PATA
Seagate Barracuda ES  500 GB / 16 MB /  8.5 ms / SATA
Seagate Barracuda ES  750 GB / 16 MB /  8.5 ms / SATA
Seagate ESATA 500 GB / 16 MB /?/ SATA (external)
Seagate NL35.2 ST3500641NS500 GB / 16 MB /  8   ms / ? / SATA
Seagate NL35.2 ST3500841NS500 GB /  8 MB /  8   ms / ? / SATA
Western Digital SE16 WD5000KS 500 GB / 16 MB /  8.9 ms / SATA
Western Digital RE2 WD5000YS  500 GB / 16 MB /  8.7 ms / SATA

I've tried Maxtor and IBM (now Hitachi) harddrives.
Both makes have failed on me, but most of the time due to horrible packaging.

I don't care a split-second whether one kind is marginally faster than
the other, so all the reviews on AnandTech etc. are utterly useless to
me.  There's an infinite number of more effective ways to get better
performance than to buy a slightly faster harddrive.

I DO care about quality, namely:
* How often the drives has catastrophic failure,
* How they handle heat (dissipation  acceptance - how hot before it fails?),
* How big the spare area is,
* How often they have single-sector failures,
* How long the manufacturer warranty lasts,
* How easy the manufacturer is to work with wrt. warranty.

I haven't been able to figure the spare area size, heat properties,
etc. for any drives.
Thus my only criteria so far has been manufacturer warranty: How much
bitching do I get when I tell them my drive doesn't work.

My main experience is with Maxtor.
Maxtor has been none less than superb wrt. warranty!
Download an ISO with a diag tool, burn the CD, boot the CD, type in
the fault code it prints on Maxtor's site, and a day or two later
you've got a new drive in the mail and packaging to ship the old one
back in.  If something odd happens, call them up and they're extremely
helpful.

Unfortunately, I lack thorough experience with the other brands.


Questions:
===

A.) Does anyone have experience with returning Hitachi, Seagate or WD
drives to the manufacturer?
   Do they have manufacturer warranty at all?
   How much/little trouble did you have with Hitachi, Seagate or WD?

B.) Can anyone *prove* (to a reasonable degree) that drives from
manufacturer H, M, S or WD is of better quality?
   Has anyone seen a review that heat/shock/stress test drives?

C.) Does good SATA cables exist?
   Eg. cables that lock on to the drives, or backplanes which lock
the entire disk in place?


Thanks for reading, and thanks in advance for answers (if any) :-).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: remark and RFC

2006-08-16 Thread Molle Bestefich


Peter T. Breuer wrote:

1) I would like raid request retries to be done with exponential
   delays, so that we get a chance to overcome network brownouts.

I presume the former will either not be objectionable


You want to hurt performance for every single MD user out there, just
because things doesn't work optimally under enbd, which is after all a
rather rare use case compared to using MD on top of real disks.

Uuuuh..  yeah, no objections there.

Besides, it seems a rather pointless exercise to try and hide the fact
from MD that the device is gone, since it *is* in fact missing.  Seems
wrong at the least.


2) I would like some channel of communication to be available
   with raid that devices can use to say that they are
   OK and would they please be reinserted in the array.

The latter is the RFC thing


It would be reasonable for MD to know the difference between
- device has (temporarily, perhaps) gone missing and
- device has physical errors when reading/writing blocks,

because if MD knew that, then it would be trivial to automatically
hot-add the missing device once available again.  Whereas the faulty
one would need the administrator to get off his couch.

This would help in other areas too, like when a disk controller dies,
or a cable comes (completely) loose.

Even if the IDE drivers are not mature enough to tell us which kind of
error it is, MD could still implement such a feature just to help
enbd.

I don't think a comm-channel is the right answer, though.

I think the type=(missing/faulty) information should be embedded in
the I/O error message from the block layer (enbd in your case)
instead, to avoid race conditions and allow MD to take good decisions
as early as possible.


The comm channel and hey, I'm OK message you propose doesn't seem
that different from just hot-adding the disks from a shell script
using 'mdadm'.



When the device felt good (or ill) it notified the raid arrays it
knew it was in via another ioctl (really just hot-add or hot-remove),
and the raid layer would do the appropriate catchup (or start
bitmapping for it).


No point in bitmapping.  Since with the network down and all the
devices underlying the RAID missing, there's nowhere to store data.
Right?
Some more factual data about your setup would maybe be good..



all I can do is make the enbd device block on network timeouts.
But that's totally unsatisfactory, since real network outages then
cause permanent blocks on anything touching a file system
mounted remotely.  People don't like that.


If it's just this that you want to fix, you could write a DM module
which returns I/O error if the request to the underlying device takes
more than 10 seconds.

Layer that module on top of the RAID, and make your enbd device block
on network timeouts.

Now the RAID array doesn't see missing disks on network outages, and
users get near-instant errors when the array isn't responsive due to a
network outage.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: remark and RFC

2006-08-16 Thread Molle Bestefich


Peter T. Breuer wrote:

 You want to hurt performance for every single MD user out there, just

There's no performance drop!  Exponentially staged retries on failure
are standard in all network protocols ...  it is the appropriate
reaction in general, since stuffing the pipe full of immediate retries
doesn't allow the would-be successful transactions to even get a look in
against that competition.


That's assuming that there even is a pipe, which is something specific
to ENBD / networked block devices, not something that the MD driver
should in general care about.



 because things doesn't work optimally under enbd, which is after all a
 rather rare use case compared to using MD on top of real disks.

Strawman.


Quah?



 Besides, it seems a rather pointless exercise to try and hide the fact
 from MD that the device is gone, since it *is* in fact missing.

Well, we don't really know that for sure.  As you know, it is
impossible to tell in general if the net has gone awol or is simply
heavily overloaded (with retry requests).



From MD's point of view, if we're unable to complete a request to the

device, then it's either missing or faulty.  If a call to the device
blocks, then it's just very slow.

I don't think it's wise to pollute these simple mechanics with a
maybe it's in a sort-of failing due to a network outage, which might
just be a brownout scenario.  Better to solve the problem in a more
appropriate place, somewhere that knows about the fact that we're
simulating a block device over a network connection.

Not introducing network-block-device aware code in MD is a good way to
avoid wrong code paths and weird behaviour for real block device
users.

Missing vs. Faulty is OTOH a pretty simple interface, which maps
fine to both real disks and NBDs.



The retry on error is a good thing.  I am simply suggesting that if the
first retry also fails that we do some back off before trying again,
since it is now likely (lacking more knowledge) that the device is
having trouble and may well take some time to recover.  I would suspect
that an interval of 0 1 5 10 30 60s would be appropriate for retries.


Only for networked block devices.

Not for real disks, there you are just causing unbearable delays for
users for no good reason, in the event that this code path is taken.



One can cycle that twice for luck before giving up for good, if you
like.  The general idea in such backoff protocols is that it avoids
filling a fixed bandwidth channel with retries (the sum of a constant
times 1 + 1/2 + 1/4 + ..  is a finite proportion of the channel
bandwidth, but the sum of 1+1+1+1+1+...  is unbounded), but here also
there is an _additional_ assumption that the net is likely to have
brownouts and so we _ought_ to retry at intervals since retrying
immediately will definitely almost always do no good.


Since the knowledge that the block device is on a network resides in
ENBD, I think the most reasonable thing to do would be to implement a
backoff in ENBD?  Should be relatively simple to catch MD retries in
ENBD and block for 0 1 5 10 30 60 seconds.  That would keep the
network backoff algorithm in a more right place, namely the place that
knows the device is on a network.



In normal  failures there is zero delay anyway.


Since the first retry would succeed, or?
I'm not sure what this normal failure is, btw.


And further, the bitmap takes care of delayed
responses in the normal course of events.


Mebbe.  Does it?



 It would be reasonable for MD to know the difference between
  - device has (temporarily, perhaps) gone missing and
  - device has physical errors when reading/writing blocks,

I agree. The problem is that we can't really tell what's happening
(even in the lower level device) across a net that is not responding.


In the case where requests can't be delivered over the network (or a
SATA cable, whatever), it's a clear case of missing device.



 because if MD knew that, then it would be trivial to automatically
 hot-add the missing device once available again.  Whereas the faulty
 one would need the administrator to get off his couch.

Yes. The idea is that across the net approximately ALL failures are
temporary ones, to a value of something like 99.99%.  The cleaning lady
is usually dusting the on-off switch on the router.

 This would help in other areas too, like when a disk controller dies,
 or a cable comes (completely) loose.

 Even if the IDE drivers are not mature enough to tell us which kind of
 error it is, MD could still implement such a feature just to help
 enbd.

 I don't think a comm-channel is the right answer, though.

 I think the type=(missing/faulty) information should be embedded in
 the I/O error message from the block layer (enbd in your case)
 instead, to avoid race conditions and allow MD to take good decisions
 as early as possible.

That's a possibility. I certainly get two types of error back in the
enbd driver .. remote error or network error. Remote error is when
we

Re: remark and RFC

2006-08-16 Thread Molle Bestefich


Peter T. Breuer wrote:

  We can't do a HOT_REMOVE while requests are outstanding,
  as far as I know.

Actually, I'm not quite sure which kind of requests you are
talking about.

Only one kind. Kernel requests :). They come in read and write
flavours (let's forget about the third race for the moment).


I was wondering whether you were talking about requests from eg.
userspace to MD, or from MD to the raw device.  I guess it's not that
important really, that's why I asked you off-list.  Just getting in
too deep, and being curious.



Pipe refers to a channel of fixed bandwidth.  Every communication
channel is one.  The pipe for a local disk is composed of the bus,
disk architecture, controller, and also the kernel architecture layers.


[snip]


See above. The problem is generic to fixed bandwidth transmission
channels, which, in the abstract, is everything. As soon as one
does retransmits one has a kind of obligation to keep retransmissions
down to a fixed maximum percentage of the potential traffic, which
is generally accomplished via exponential backoff (a time-wise
solution, in other words, sdeliberately mearing retransmits out along
the time axis in order to prevent spikes).


Right, so with the bandwidth to local disks being, say, 150MB/s, an
appropriate backoff would be 0 0 0 0 0 0.1 0.1 0.1 0.1 secs.  We can
agree on that pretty fast.. right? ;-).



The md layers now can generate retries by at least one mechanism that I
know of ..  a failed disk _read_ (maybe of existing data or parity data
as part of an exterior write attempt) will generate a disk _write_ of
the missed data (as reconstituted via redundancy info).

I believe failed disk _write_ may also generate a retry,


Can't see any reason why MD would try to fix a failed write, since
it's not likely to be going to be successful anyway.



Such delays may in themselves cause timeouts in md - I don't know. My
RFC (maybe RFD) is aimed at raising a flag saying that something is
going on here that needs better control.


I'm still not convinced MD does retries at all..



What the upper layer, md, ought to do is back off.


I think it should just kick the disk.



 I don't think it's wise to pollute these simple mechanics with a
 maybe it's in a sort-of failing due to a network outage, which might
 just be a brownout scenario.  Better to solve the problem in a more
 appropriate place, somewhere that knows about the fact that we're
 simulating a block device over a network connection.

I've already suggested a simple mechanism above .. back off on the
retries, already. It does no harm to local disk devices.


Except if the code path gets taken, and the user has to wait
10+20+30+60s for each failed I/O request.



If you like, the constant of backoff can be based on how long it took
the underlying device to signal the io request as failed. So a local
disk that replies failed immediately can get its range of retries run
through in a couple of hop skip and millijiffies. A network device that
took 10s to report a timeout can get its next retry back again in 10s.
That should give it time to recover.


That sounds saner to me.



 Not introducing network-block-device aware code in MD is a good way to
 avoid wrong code paths and weird behaviour for real block device
 users.

Uh, the net is everywhere.  When you have 10PB of storage in your
intelligent house's video image file system, the parts of that array are
connected by networking room to room.  Supecomputers used to have simple
networking between each computing node.  Heck, clusters still do :).
Please keep your special case code out of the kernel :-).


Uhm.



 Missing vs. Faulty is OTOH a pretty simple interface, which maps
 fine to both real disks and NBDs.

It may well be a solution. I think we're still at the stage of
precisely trying to identify the problem too! At the moment, most
of what I can say is definitely, there is something wrong with the
way the md layer reacts or can be controlled with respect to
networking brown-outs and NBDs.



 Not for real disks, there you are just causing unbearable delays for
 users for no good reason, in the event that this code path is taken.

We are discussing _error_ semantics.  There is no bad effect at all on
normal working!


In the past, I've had MD run a box to a grinding halt more times than
I like.  It always results in one thing: The user pushing the big red
switch.

That's not acceptable for a RAID solution.  It should keep working,
without blocking all I/O from userspace for 5 minutes just because it
thinks it's a good idea to hold up all I/O requests to underlying
disks for 60s each, waiting to retry them.



The effect on normal working should even be _good_ when errors
occur, because now max bandwidth devoted to error retries is
limited, leaving more max bandwidth for normal requests.


Assuming you use your RAID component device as a regular device also,
and that the underlying device is not able to satisfy the requests as
fast as you

Re: trying to brute-force my RAID 5...

2006-07-18 Thread Molle Bestefich


Sevrin Robstad wrote:

I created the RAID when I installed Fedora Core 3 some time ago,
didn't do anything special so the chunks should be 64kbyte and
parity should be left-symmetric ?


I have no idea what's default on FC3, sorry.


Any Idea ?


I missed that you were trying to fdisk -l /dev/md0..
As others have suggested, search for filesystems using fsck, or mount,
or what not ;-).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: trying to brute-force my RAID 5...

2006-07-17 Thread Molle Bestefich


Sevrin Robstad wrote:

I got a friend of mine to make a list of all the 6^6 combinations of dev
1 2 3 4 5 missing,

shouldn't this work ???


Only if you get the layout and chunk size right.

And make sure that you know whether you were using partitions (eg.
sda1) or whole drives (eg. sda - bad idea).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: only 4 spares and no access to my data

2006-07-12 Thread Molle Bestefich


Karl Voit wrote:

if (super == NULL) {
  fprintf(stderr, Name : No suitable drives found for %s\n, mddev);
[...]

Well I guess, the message will be shown, if the superblock is not found.


Yes.  No clue why, my buest guess is that you've already zeroed the superblock.
What does madm --query / --examine say about /dev/sd[abcd], are there
superblocks ?


st = guess_super(fd);
  if (st == NULL) {
if (!quiet)
  fprintf(stderr, Name : Unrecognised md component device - %s\n,
dev);

Again: this seems to be the case, when the superblock is empty.


Yes, looks like it can't find any usable superblocks.
Maybe you've accidentally zeroed the superblocks on sd[abcd]1 also?

If you fdisk -l /dev/sd[abcd], does the partition tables look like
they should / like they used to?

What does mdadm --query / --examine /dev/sd[abcd]1 tell you, any superblocks ?


Since my miserably failure I am probably too careful *g*

The problem is also, that without deeper background knowledge, I can not
predict, if this or that permanently affects the real data on the disks.


My best guess is that it's OK and you won't loose data if you run
--zero-superblock on /dev/sd[abcd] and then create an array on
/dev/sd[abcd]1, but I do find it odd that it suddenly can't find
superblocks on /dev/sd[abcd]1.


Maybe such a person like me starts to think that sw-raid-tools like
mdadm should warn users before permanent changes are executed. If
mdadm should be used by users (additional to raid-geeks like you),
it might be a good idea to prevent data loss. (Ment as a suggestion.)


Perhaps.  Or perhaps mdadm should just tell you that you're doing
something stupid if you try to manipulate arrays on a block device
which seems to contain a partition table.

It's not like it's even remotely useful to create an MD array spanning
the whole disk rather than spanning a partition which spans the whole
disk, anyway.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: only 4 spares and no access to my data

2006-07-10 Thread Molle Bestefich


Karl Voit wrote:

  443: root at ned ~ # mdadm --examine /dev/sd[abcd]

 Shows that all 4 devices are ACTIVE SYNC

Please note that there is no 1 behind sda up to sdd!


Yes, you're right.

Seems you've created an array/superblocks on both sd[abcd] (line 443
onwards), and on sd[abcd]1 (line 66 and onward).

I'm unsure why 'pvscan' says there is an LVM PV on sda1 (line
118/119).  Probably it's a misfeature in LVM, causing it to find the
PV inside the MD volume if the array has not been started (since it
says that the PV is ~700 GB).



 Running zero-superblock on sd[abcd] and then assembling the array
 from sd[abcd]_1_ sounds odd to me.

Well this is because of the false(?) superblocks of sda-sdd in comparison to
sda1 to sdd1.


Yes ok, I missed that part of the story.
In that case it sounds sane to zero the superblocks on sd[abcd],
seeing that 'pvscan' and 'lvscan' finds live data that you could
backup on the array consisting of sd[abcd]1.



[EMAIL PROTECTED] ~ # mdadm --assemble /dev/md0 /dev/sda1 /dev/sdb1\
 /dev/sdc1 /dev/sdd1
mdadm: cannot open device /dev/sda1: Device or resource busy
mdadm: /dev/sda1 has no superblock - assembly aborted


Odd message.  Does lsof | grep sda show anything using /dev/sda(1)?



I should have mentioned that I did not use the whole hard drive space for
sd[abcd]1. I thought that if I have to replace one of my Samsungs with another
drive that has not the very same capacity, I'd better use exactly 250GB
partitions and forget the last approx. 49MB of the drives.


Good idea.



The problem seems to be the superblocks.


Which ones, those on sd[abcd]1 ?
You've probably destroyed them by syncing the array consisting of sd[abcd].



Can I repair them?


No, but you can recreate them without touching your data.
I think the suggestion from Andreas Gredler sounds sane.

I'm unsure if hot-adding a device will recreate a superblock on it.
Therefore I'd probably run --create on all four devices and use sysfs
to force a repair, instead of (as Andreas suggests) creating the array
with one 'missing' device.

Do remember to zero the superblocks on sd[abcd] first, to prevent mishaps...
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: only 4 spares and no access to my data

2006-07-10 Thread Molle Bestefich


Henrik Holst wrote:

Is sda1 occupying the entire disk? since the superblock is the /last/
128Kb (I'm assuming 128*1024 bytes) the superblocks should be one and
the same.


Ack, never considered that.

Ugly!!!
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: only 4 spares and no access to my data

2006-07-10 Thread Molle Bestefich

Karl Voit wrote:

OK, I upgraded my kernel and mdadm:

uname -a:
Linux ned 2.6.13-grml #1 Tue Oct 4 18:24:46 CEST 2005 i686 GNU/Linux

That release is 10 months old.
Newest release is 2.6.17.
You can see changes to MD since 2.6.13 here:
http://www.kernel.org/git/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.17.y.gita=searchs=md%3A

Anything from 2005-09-09 and further up the list is something that's in 2.6.17
but not in 2.6.13.

For example, your MD does not have sysfs support, it seems...

dpkg --list mdadm -- 2.4.1-6

Newest release is 2.5.2.
2.4.1 is 3 months old.

Is it true, that I should try the following lines?

mdadm --stop /dev/md0
mdadm --zero-superblock /dev/sda
mdadm --zero-superblock /dev/sdb
mdadm --zero-superblock /dev/sdc
mdadm --zero-superblock /dev/sdd
mdadm --assemble /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 --force
(check, if it worked - probably not - and if not, try the following line)
mdadm --create -n 4 -l 5 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1

I don't have a unix box right here, but yes, that looks correct to me.

You can make certain that the ordering of the devices is correct by looking in
your paste bin, lines 12-15.
Other RAID parameters (raid level, # of devices, persistence, layout chunk
size) can be seen on lines 212-231.

Did you mean something like echo repair /sys/block/md0/md/sync_action

Exactly.
(Gee, I hope someone stops me if I'm giving out bad advice. Heh ;-).)

You can also assemble the array read-only after recreating the superblocks, and you can
use check as a sync_action...

But only if your kernel has MD with sysfs support ;-).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: only 4 spares and no access to my data

2006-07-09 Thread Molle Bestefich


Karl Voit wrote:

I published the whole story (as much as I could log during my reboots
and so on) on the web:

  http://paste.debian.net/8779



From the paste bin:



443: [EMAIL PROTECTED] ~ # mdadm --examine /dev/sd[abcd]


Shows that all 4 devices are ACTIVE SYNC

Next command:


563: [EMAIL PROTECTED] ~ # mdadm --assemble --update=summaries /dev/md0 
/dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm: /dev/md0 assembled from 0 drives and 4 spares - not enough to start the 
array.


Then:


568: [EMAIL PROTECTED] ~ # mdadm --examine /dev/sd[abcd]1


Suddenly shows all 4 devices as SPARE?

What the heck happened in between?
Did you do anything evil, or is it a MD bug, or what?



mdadm-version: 1.12.0-1
uname: Linux ned 2.6.13-grml


You should probably upgrade at some point, there's always a better
chance that devels will look at your problem if you're running the
version that they're sitting with..



Andreas Gredler suggested following lines as a last attempt but risk
of loosing data which I want to avoid:

mdadm --stop /dev/md0
mdadm --zero-superblock /dev/sda
mdadm --zero-superblock /dev/sdb
mdadm --zero-superblock /dev/sdc
mdadm --zero-superblock /dev/sdd
mdadm --assemble /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1\
 /dev/sdd1 --force
mdadm --create -n 4 -l 5 /dev/md0 missing /dev/sdb1\
 /dev/sdc1 /dev/sdd1


Running zero-superblock on sd[abcd] and then assembling the array
from sd[abcd]_1_ sounds odd to me.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Cutting power without breaking RAID

2006-07-05 Thread Molle Bestefich


Tim wrote:

That would probably be ideal, issue the power off command with
something like a 30 second timeout, which would give the system time to
power off cleanly first.


I don't think that's ideal.
Many systems restore power to the last known state, thus powering off
cleanly would result in the machine not coming back up after the power
cycle.

Some machines can be changed to always power on in the BIOS.
If the server administrator remembers to do so, that is.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ok to go ahead with this setup?

2006-06-22 Thread Molle Bestefich


Christian Pernegger wrote:

Intel SE7230NH1-E mainboard
Pentium D 930


HPA recently said that x86_64 CPUs have better RAID5 performance.


Promise Ultra133 TX2 (2ch PATA)
   - 2x Maxtor 6B300R0 (300GB, DiamondMax 10) in RAID1

Onboard Intel ICH7R (4ch SATA)
   - 4x Western Digital WD5000YS (500GB, Caviar RE2) in RAID5


Is it a NAS kind of device?

In that case, drop the 2x 300GB disks and get 6x 500GB instead.  You
can partition those so that you have a RAID1 spanning the first 10GB
of all 6 drives for use as the system partition, and use the rest in a
RAID5.


* Does this hardware work flawlessly with Linux?


No clue.


* Is it advisable to boot from the mirror?


Should work.


  Would the box still boot with only one of the disks?


If you configure things correctly - better test it.


* Can I use EVMS as a frontend?


Yes.


  Does it even use md or is EVMS's RAID something else entirely?


EVMS uses a lot of underlying software, MD being one component.


* Should I use the 300s as a single mirror, or span multiple ones over
the two disks?


What would the purpose be?


* Am I even correct in assuming that I could stick an array in another
box and have it work?


Work for what?


Comments welcome


Get gigabit nics, in case you want to fiddle with iSCSI? :-).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID tuning?

2006-06-14 Thread Molle Bestefich


Nix wrote:

Adam Talbot wrote:
 Can any one give me more info on this error?
 Pulled from /var/log/messages.
 raid6: read error corrected!!

The message is pretty easy to figure out and the code (in
drivers/md/raid6main.c) is clear enough.


But the message could be clearer, for instance it would be a very good
service to the user if the message showed which block had been
corrected.

Also using two !! in place of one ! makes the message seem a little dumb :-).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Horrific Raid 5 crash... help!

2006-05-08 Thread Molle Bestefich


David M. Strang wrote:

Well today, during this illustrious rebuild... it appears I actually DID
have a disk fail. So, I have 26 disks... 1 partially rebuilt, and 1 failed.


Common scenario it seems.


Hoping and praying that a rebuild didn't actually wipe the disk and maybe
just synced things up -- I did a create with the 26 disks + 1 partially
rebuilt and 1 'missing disk'


I've always loathed that approach.
It seems *so* wrong to force MD to nuke all the information that is in
the superblocks.

I know this is the recommended approach, and I wish it would be changed.
Better sooner than later, too :-).


well, the array came up but I get access denied on a zillion things,
and the filesystem is freaking out.


Do you have something like mdadm's printout of the superblocks before
and after you did the mdadm --create?  Without that information,
it's going to be hard to tell what's going on.

My best guess is that you've assembled the array using the halfway
rebuilt disk, that it was kicked by MD long ago on basis of a single
bad block somewhere, and that the disk happens to contain a bunch of
old data which confuses the filesystem.

But it's a *very* wild guess.


Before I proceed any further... what are my options? Do I have any options?


Hard to say unless you can tell us which disks failed, and when, and
which disks you use to assemble now, etc etc.

The more information, the better.


I could run a fsck... but I held off fearing it could just make things worse.


Good thinking..
I've often seen ext3 fsck do a lot more harm than good.
(I believe reiser's fsck is pretty good, but haven't got any
experience with it.)
Until you're sure you've got everything right, I wouldn't run fsck.

In fact, assemble the array readonly and mount the filesystem readonly.

If you need to do modifications, and you're not sure you're doing it
right, you can make a duplicate of the individual disks in the RAID
and perform experiments on those.  Of course, the price tag depends on
how many GB your array is..
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: md: Change ENOTSUPP to EOPNOTSUPP

2006-04-29 Thread Molle Bestefich


Ric Wheeler wrote:

You are absolutely right - if you do not have a validated, working
barrier for your low level devices (or a high end, battery backed array
or JBOD), you should disable the write cache on your RAIDed partitions
and on your normal file systems ;-)

There is working support for SCSI (or libata S-ATA) barrier operations
in mainline, but they conflict with queue enable targets which ends up
leaving queuing on and disabling the barriers.


Thank you very much for the information!

How can I check that I have a validated, working barrier with my
particular kernel version etc.?
(Do I just assume that since it's not SCSI, it doesn't work?)

I find it, hmm... stupefying?  horrendous?  completely brain dead?  I
don't know..  that noone warns users about this.  I bet there's a
million people out there, happily using MD (probably installed and
initialized it with Fedora Core / anaconda) and thinking their data is
safe, while in fact it is anything but.  Damn, this is not a good
situation..

(Any suggestions for a good place to fix this?  Better really really
really late than never...)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 003 of 5] md: Change ENOTSUPP to EOPNOTSUPP

2006-04-28 Thread Molle Bestefich


NeilBrown wrote:

Change ENOTSUPP to EOPNOTSUPP
Because that is what you get if a BIO_RW_BARRIER isn't supported !


Dumb question, hope someone can answer it :).

Does this mean that any version of MD up till now won't know that SATA
disks does not support barriers, and therefore won't flush SATA disks
and therefore I need to disable the disks's write cache if I want to
be 100% sure that raid arrays are not corrupted?

Or am I way off :-).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: to be or not to be...

2006-04-23 Thread Molle Bestefich

gelma wrote:
 first run: lot of strange errors report about impossible i_size
 values, duplicated blocks, and so on

You mention only filesystem errors, no block device related errors.
In this case, I'd say that it's more likely that dm-crypt is to blame
rather than MD.

I think you should try the dm-devel mailing list.  Posting a complete
log of everything that has happened would probably be a good thing.

I have no experience with dm-crypt, but I do have experience with
another dm target (dm-snapshot), which iss very good at destroying my
data.

If you want a stable solution for encrypting your files, I can
recommend loop-aes.
loop-aes has very well thought-through security, the docs are concise
but have wide coverage,
it has good backwards compatibility - probably not your biggest
concern right now, but it is *very* nice to know that your data is
accessible, in the future as well as now - etc..  I've been using it
for a couple of years now, since the 2.2 or 2.4 days (can't remember),
and I've had nothing short of an absolutely *brilliant* experience
with it.

Enough propaganda for now, hope that you get your problem solved :-).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: data recovery on raid5

2006-04-22 Thread Molle Bestefich

Sam Hopkins wrote:
 mdadm -C /dev/md0 -n 4 -l 5 missing /dev/etherd/e0.[023]

While it should work, a bit drastic perhaps?
I'd start with mdadm --assemble --force.

With --force, mdadm will pull the event counter of the most-recently
failed drive up to current status which should give you a readable
array.

After that, you could try running a check by echo'ing check into
sync_action.
If the check succeeds, fine, hotadd the last drive to your array and
MD will start resync'ing.

If the check fails because of a bad block, you'll have to make a decision.
Live with the lost blocks, or try and reconstruct from the first kicked disk.

I posted a patch this week that will allow you to forcefully get the
array started with all of the disks - but beware, MD wasn't made with
this in mind and will probably be confused and sometimes pick data
from the first-kicked drive over data from the other drives.  Only
forcefully start the array with all drives if you absolutely have
to...

Oh, and I'm not an expert by any means, so take everything I say with
a grain of salt :-).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Problem with 5disk RAID5 array - two drives lost

2006-04-22 Thread Molle Bestefich

Tim Bostrom wrote:
 It appears that /dev/hdf1 failed this past week and /dev/hdh1 failed  back in 
 February.

An obvious question would be, how much have you been altering the
contents of the array since February?

 I tried a mdadm --assemble --force and was able to get the following:
 ==
 mdadm: forcing event count in /dev/hdf1(1) from 777532 upto 777535
 mdadm: clearing FAULTY flag for device 2 in /dev/md0 for /dev/hdf1
 raid5: raid level 5 set md0 active with 4 out of 5 devices, algorithm 2
 mdadm: /dev/md0 has been started with 4 drives (out of 5).
 ==

Looks good.

 I then tried to mount /dev/md0

A bit premature, I'd say.

 
 raid5: Disk failure on hdf1, disabling device.

MD doesn't like to find errors when it's rebuilding.
It will kick that disk off the array, which will cause MD to return
crap (instead of stopping the array and removing the device - I
wonder), again causing 'mount' etc. to fail.

Quite unfortunate for you, since you have absolutely no redundancy
with 4/5 drives, and you really can't afford to have the 4th disk
kicked just because there's a bad block on it.

This is something that MD could probably handle much better than it does now.
In your case, you probably want to try and reconstruct from all 5
disks, but without loosing the information in their event counters -
you want MD to use as much data as it can from the 4 fresh disks
(assuming that they're at least 99% readable), and only when there's a
rare bad block on one of them should it use data from the 5th.

Seeing as
1) MD doesn't automatically check your array unless you ask it to
2) Modern disks have a habit of developing lots of bad blocks

It would be very nice if MD could help out in these kind of situations.
Unfortunately implementation is tricky as I see it, and currently MD
can do no such thing.

 spurious 8259A interrupt: IRQ7.

Oops.
I'd look into that, I think it's a known bug.

(Then again, maybe it's just the IDE drivers - I've experienced really
bad IRQ handling both with old style IDE and with atalib.)

 hdf: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6720

Hey, it's telling you where your data used to be.  Cute.

 raid5: Disk failure on hdf1, disabling device.
 Operation continuing  on 3 devices

Haha!  Real bright there, MD, continuing raid5 operation with 3/5 devices.
Still not a bug, eh? :-)
*poke, poke*

 I'm guessing /dev/hdf is shot.

Actually, there's a lot of sequential sector numbers in the output you posted.
I think it's unusual for a drive to develop that many bad blocks in a row.
I could be wrong, and it could be a head crash or something (have you
been moving the system around much?).

But if I had to guess, I'd say that there's a real likelihood that
it's a loose cable or a controller problem or a driver issue.

Could you try and run:
# dd if=/dev/hdf of=/dev/null bs=1M count=100 skip=1234567

You can play around with different random numbers instead of 1234567.
If it craps out *immediately*, then I'd say it's a cable problem or
so, and not a problem with what's on the platters.


 I haven't tried an fsck though.
 Would this be advisable?

No, get the array running first, then fix the filesystem.

You can initiate array checks and repairs like this:
# cd /sys/block/md0/md/
# echo check  sync_action
or
# echo repair  sync_action

Or something like that.

 Is there a way that I can try and  build the array again with /dev/hdh
 instead of /dev/hdf with some possible data corruption on files that
 were added since Feb?

Let's first see if we can't get hdf online.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: data recovery on raid5

2006-04-22 Thread Molle Bestefich

Jonathan wrote:
 # mdadm -C /dev/md0 -n 4 -l 5 missing /dev/etherd/e0.[023]

I think you should have tried mdadm --assemble --force first, as I
proposed earlier.

By doing the above, you have effectively replaced your version 0.9.0
superblocks with version 0.9.2.  I don't know if version 0.9.2
superblocks are larger than 0.9.0, Neil hasn't responded to that yet. 
Potentially hazardous, who knows.

Anyway.
This is from your old superblock as described by Sam Hopkins:

 /dev/etherd/blah:
  Chunk Size : 32K

This is from what you've just posted:
 /dev/etherd/blah:
  Chunk Size : 64K

If I were you, I'd recreate your superblocks now, but with the correct
chunk size (use -c).

 We'll be happy to pay you for your services.

I'll be modest and charge you a penny per byte of data recovered, ho hum.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: data recovery on raid5

2006-04-22 Thread Molle Bestefich

Jonathan wrote:
 I was already terrified of screwing things up
 now I'm afraid of making things worse

Adrenalin... makes life worth living there for a sec, doesn't it ;o)

 based on what was posted before is this a sensible thing to try?
 mdadm -C /dev/md0 -c 32 -n 4 -l 5 missing /dev/etherd/e0.[023]

Yes, looks exactly right.

 Is what I've done to the superblock size recoverable?

I don't think you've done anything at all.
I just *don't know* if you have, that's all.

Was just trying to say that it wasn't super-cautious of you to begin
with, that's all :-).

 I don't understand how mdadm --assemble would know what to do,
 which is why I didn't try it initially.

By giving it --force, you tell it to forcefully mount the array even
though it might be damaged.
That means including some disks (the freshest ones) that are out of sync.

That help?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: data recovery on raid5

2006-04-22 Thread Molle Bestefich

Jonathan wrote:
 Well, the block sizes are back to 32k now, but I still had no luck
 mounting /dev/md0 once I created the array.

Ahem, I missed something.
Sorry, the 'a' was hard to spot.

Your array used layout : left-asymmetric, while the superblock you've
just created has layout: left-symmetric.

Try again, but add the option --parity=left-asymmetric
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: data recovery on raid5

2006-04-22 Thread Molle Bestefich

Jonathan wrote:
 how safe should the following be?

 mdadm --assemble /dev/md0 --uuid=8fe1fe85:eeb90460:c525faab:cdaab792
 /dev/etherd/e0.[01234]

You can hardly do --assemble anymore.
After you have recreated superblocks on some of the devices, those are
conceptually part of a different raid array.  At least as seen by MD.

 I am *really* not interested in making my situation worse.

We'll keep going till you got your data back..
Recreating superblocks again on e0.{0,2,3} can't hurt, since you've
already done this and thereby nuked the old superblocks.

You can shake your own hand and thank yourself now (oh, and Sam too)
for posting all the debug output you have.  Otherwise we would
probably never have spotted nor known about the parity/chunk size
differences :o).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

libata retry - disable?

2006-04-18 Thread Molle Bestefich

Does anyone know of a way to disable libata's 5-time retry when a read fails?

It has the effect of causing every failed sector read to take 6
seconds before it fails, causing raid5 rebuilds to go awfully slow. 
It's generally undesirable too, when you've got RAID on top that can
write replacement data onto the failed sectors..

Log showing a failed sector:
Apr 18 09:49:53 linux kernel: end_request: I/O error, dev sda, sector 131124407
Apr 18 09:49:55 linux kernel: ata1: no sense translation for status: 0x51
Apr 18 09:49:55 linux kernel: ata1: translated ATA stat/err 0x51/00 to
SCSI SK/ASC/ASCQ 0x3/11/04
Apr 18 09:49:55 linux kernel: ata1: status=0x51 { DriveReady
SeekComplete Error }
Apr 18 09:49:56 linux kernel: ata1: no sense translation for status: 0x51
Apr 18 09:49:56 linux kernel: ata1: translated ATA stat/err 0x51/00 to
SCSI SK/ASC/ASCQ 0x3/11/04
Apr 18 09:49:56 linux kernel: ata1: status=0x51 { DriveReady
SeekComplete Error }
Apr 18 09:49:57 linux kernel: ata1: no sense translation for status: 0x51
Apr 18 09:49:57 linux kernel: ata1: translated ATA stat/err 0x51/00 to
SCSI SK/ASC/ASCQ 0x3/11/04
Apr 18 09:49:57 linux kernel: ata1: status=0x51 { DriveReady
SeekComplete Error }
Apr 18 09:49:59 linux kernel: ata1: no sense translation for status: 0x51
Apr 18 09:49:59 linux kernel: ata1: translated ATA stat/err 0x51/00 to
SCSI SK/ASC/ASCQ 0x3/11/04
Apr 18 09:49:59 linux kernel: ata1: status=0x51 { DriveReady
SeekComplete Error }
Apr 18 09:50:00 linux kernel: ata1: no sense translation for status: 0x51
Apr 18 09:50:00 linux kernel: ata1: translated ATA stat/err 0x51/00 to
SCSI SK/ASC/ASCQ 0x3/11/04
Apr 18 09:50:00 linux kernel: ata1: status=0x51 { DriveReady
SeekComplete Error }
Apr 18 09:50:01 linux kernel: ata1: no sense translation for status: 0x51
Apr 18 09:50:01 linux kernel: ata1: translated ATA stat/err 0x51/00 to
SCSI SK/ASC/ASCQ 0x3/11/04
Apr 18 09:50:01 linux kernel: ata1: status=0x51 { DriveReady
SeekComplete Error }
Apr 18 09:50:01 linux kernel: sd 0:0:0:0: SCSI error: return code = 0x802
Apr 18 09:50:01 linux kernel: sda: Current: sense key: Medium Error
Apr 18 09:50:01 linux kernel: Additional sense: Unrecovered read
error - auto reallocate failed
Apr 18 09:50:01 linux kernel: end_request: I/O error, dev sda, sector 131124415
Apr 18 09:50:02 linux kernel: raid5: read error corrected!!
Apr 18 09:50:02 linux last message repeated 112 times
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

mdadm -C / 0.90?

2006-04-17 Thread Molle Bestefich

Hi Neil, list

You wrote:
   mdadm -C /dev/md1 --assume-clean /dev/sd{a,b,c,d,e,f}1

Will the above destroy data by overwriting the on-disk v0.9 superblock
with a larger v1 superblock?

--assume-clean is not document in 'mdadm --create --help', by the way
- what does it do?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: help wanted - 6-disk raid5 borked: _ _ U U U U

2006-04-17 Thread Molle Bestefich

Molle Bestefich wrote:
 Neil Brown wrote:
   How do I force MD to raise the event counter on sdb1 and accept it
   into the array as-is, so I can avoid bad-block induced data
   corruption?
 
  For that, you have to recreate the array.

 Scary.  And hairy.
 How much do I have to bribe you to make this work:

 # mdadm --assemble /dev/md1 --force  /dev/sd{a,b,c,d,e,f}1

 Not now, that is, but for the sake of future generations? :)

I crossed my fingers and dived into the code.
The mdadm code was super well written - Thanks, Neil :-) - so
conjuring up a patch took less than 5 minutes.

Patch attached - does it look acceptable for inclusion in mainline?

With the patch, I can do this:
# mdadm --assemble /dev/md1 --force /dev/sd{a,b,c,d,e,f}1
mdadm: forcing event count in /dev/sdb1(1) from 160264 upto 163368
mdadm: clearing FAULTY flag for device 1 in /dev/md1 for /dev/sdb1
mdadm: /dev/md1 has been started with 6 drives.

And have the array started with all 6 devices (as I've asked it for).
--- Assemble.c.ok	2006-04-18 02:47:11.0 +0200
+++ Assemble.c	2006-04-18 01:18:38.0 +0200
@@ -119,6 +119,7 @@
 	struct mdinfo info;
 	char *avail;
 	int nextspare = 0;
+	int devs_on_cmdline = devlist!=NULL;
 	
 	vers = md_get_version(mdfd);
 	if (vers = 0) {
@@ -407,9 +408,10 @@
 sparecnt++;
 		}
 	}
-	while (force  !enough(info.array.level, info.array.raid_disks,
+	while (force  ((!enough(info.array.level, info.array.raid_disks,
 info.array.layout,
-avail, okcnt)) {
+avail, okcnt)) ||
+(devs_on_cmdline  (num_devs  okcnt {
 		/* Choose the newest best drive which is
 		 * not up-to-date, update the superblock
 		 * and add it.

help wanted - 6-disk raid5 borked: _ _ U U U U

2006-04-16 Thread Molle Bestefich

A system with 6 disks, it was UU a moment ago, after read errors
on a file now looks like:

/proc/mdstat:
  md1 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[6](F) sda1[7](F)
  level 5, 64k chunk, algorithm 2 [6/4] [__]

uname:
  linux 2.6.11-gentoo-r4

What's the recommended approach?

Compile 2.6.16.5 and mdadm 2.4.1, install both, reboot, then use mdadm
assemble --force (after automatic mount probably fails)?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: help wanted - 6-disk raid5 borked: _ _ U U U U

2006-04-16 Thread Molle Bestefich

Neil Brown wrote:
 You shouldn't need to upgrade kernel

Ok.
I had a crazy idea that 2 devices down in a RAID5 was an MD bug.

I didn't expect MD to kick that last disk - I would have thought that
it would just pass on the read error in that situation.  If you've got
the time to explain I'd like to be wiser - why not?

 But yet, use --assemble --force and be aware that there could be data
 corruption (without knowing the history it is hard to say how
 likely).  An 'fsck' at least would be recommended.

Thanks a tera!

(Probably no corruption since I was strictly reading from the array
when the disks were kicked.)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: help wanted - 6-disk raid5 borked: _ _ U U U U

2006-04-16 Thread Molle Bestefich

Neil Brown wrote:
 It is arguable that for a read error on a degraded raid5, that may not
 be the best thing to do, but I'm not completely convinced.  A read
 error will mean that a write to the same stripe will have to fail, so
 at the very least we would want to switch the array read-only.

That would be much nicer for me as a user, because:
 * I would know which disks are the freshest (the ones marked U).
 * My data wouldn't be abruptly pulled offline - right now I'm getting
*weird* errors from the systems on top of the array.
 * I wouldn't have to try and guess the correct 'mdadm' command to
stop/start the array (including pointing at the right disks).

2 cents ;).

Thanks for the explanation.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: help wanted - 6-disk raid5 borked: _ _ U U U U

2006-04-16 Thread Molle Bestefich

Neil Brown wrote:
 use --assemble --force

# mdadm --assemble --force /dev/md1
mdadm: forcing event count in /dev/sda1(0) from 163362 upto 163368
mdadm: /dev/md1 has been started with 5 drives (out of 6).

Oops, only 5 drives, but I know data is OK on all 6 drives.

I also know that there are bad blocks on more than 1 drive.  So I want
MD to recover from the other drives in those cases, which I won't be
able to with only 5 drives.

In other words, checking/repairing with only 5 drives will lead to
data corruption.

I'll stop and try again, listing all 6 drives by hand:

# mdadm --stop /dev/md1
# mdadm --assemble /dev/md1 --force  /dev/sda1 /dev/sdb1 /dev/sdc1
/dev/sdd1 /dev/sde1 /dev/sdf1
mdadm: /dev/md1 has been started with 5 drives (out of 6).

Ugh.  Didn't work.  Bug?

How do I force MD to raise the event counter on sdb1 and accept it
into the array as-is, so I can avoid bad-block induced data
corruption?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: help wanted - 6-disk raid5 borked: _ _ U U U U

2006-04-16 Thread Molle Bestefich

Neil Brown wrote:
  How do I force MD to raise the event counter on sdb1 and accept it
  into the array as-is, so I can avoid bad-block induced data
  corruption?

 For that, you have to recreate the array.

Scary.  And hairy.
How much do I have to bribe you to make this work:

# mdadm --assemble /dev/md1 --force  /dev/sd{a,b,c,d,e,f}1

Not now, that is, but for the sake of future generations? :)

 Make sure you get the
 chunksize,  parity algorithm, and order correct, but something like

   mdadm -C /dev/md1 --assume-clean /dev/sda1 /dev/sdb1 /dev/sdc1 \
   /dev/sdd1 /dev/sde1 /dev/sdf1

How do I backup the MD superblocks first so I can 'dd'-restore them in
case the above command totally wrecks the array because I got it
wrong?

It's an old array, so it's version 00.90.00 superblocks - does that
make a difference?

 and then
   echo check  /sys/block/md1/md/sync_action

 and see what
cat /sys/block/md1/md/mismatch_count
 reports at the end.
 Then maybe 'echo repair  '

Super, thanks as always for your kind help.

(After examining the syslog, I think that there might be a bug, or not
- I'll start a new thread about that.)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[bug?] MD doesn't stop failed array

2006-04-16 Thread Molle Bestefich

May I offer the point of view that this is a bug:

MD apparently tries to keep a raid5 array up by using 4 out of 6 disks.

Here's the event chain, from start to now:
==
 1.) Array assembled automatically with 6/6 devices.
 2.) Read error, MD kicks sdb1.
 3.) Read error, MD kicks sda1, doesn't seem to stop array.
 4.) ext3 and loop0 devices run amok, probably writes crazy things to disk?

Here's the syslog contents corresponding to the above:
==
Apr 13 16:50:38 linux kernel: md:  adding sdf1 ...
Apr 13 16:50:38 linux kernel: md:  adding sde1 ...
Apr 13 16:50:38 linux kernel: md:  adding sdd1 ...
Apr 13 16:50:38 linux kernel: md:  adding sdc1 ...
Apr 13 16:50:38 linux kernel: md:  adding sdb1 ...
Apr 13 16:50:38 linux kernel: md:  adding sda1 ...
Apr 13 16:50:38 linux kernel: md: created md1
Apr 13 16:50:38 linux kernel: md: bindsda1
Apr 13 16:50:38 linux kernel: md: bindsdb1
Apr 13 16:50:38 linux kernel: md: bindsdc1
Apr 13 16:50:38 linux kernel: md: bindsdd1
Apr 13 16:50:38 linux kernel: md: bindsde1
Apr 13 16:50:38 linux kernel: md: bindsdf1
Apr 13 16:50:38 linux kernel: md: running: sdf1sde1sdd1sdc1sdb1sda1
Apr 13 16:50:38 linux kernel: raid5: device sdf1 operational as raid disk 5
Apr 13 16:50:38 linux kernel: raid5: device sde1 operational as raid disk 4
Apr 13 16:50:38 linux kernel: raid5: device sdd1 operational as raid disk 3
Apr 13 16:50:38 linux kernel: raid5: device sdc1 operational as raid disk 2
Apr 13 16:50:38 linux kernel: raid5: device sdb1 operational as raid disk 1
Apr 13 16:50:38 linux kernel: raid5: device sda1 operational as raid disk 0
Apr 13 16:50:38 linux kernel: raid5: allocated 6290kB for md1
Apr 13 16:50:38 linux kernel: raid5: raid level 5 set md1 active with
6 out of 6 devices, algorithm 2
Apr 13 16:50:38 linux kernel: RAID5 conf printout:
Apr 13 16:50:39 linux kernel:  --- rd:6 wd:6 fd:0
Apr 13 16:50:39 linux kernel:  disk 0, o:1, dev:sda1
Apr 13 16:50:39 linux kernel:  disk 1, o:1, dev:sdb1
Apr 13 16:50:39 linux kernel:  disk 2, o:1, dev:sdc1
Apr 13 16:50:39 linux kernel:  disk 3, o:1, dev:sdd1
Apr 13 16:50:39 linux kernel:  disk 4, o:1, dev:sde1
Apr 13 16:50:39 linux kernel:  disk 5, o:1, dev:sdf1
[snip irrelevant]
Apr 13 16:54:06 linux kernel: ata2: status=0x51 { DriveReady
SeekComplete Error }
Apr 13 16:54:06 linux kernel: ata2: error=0x04 { DriveStatusError }
[11 repetitions of above 2 lines snipped]
Apr 13 16:54:06 linux kernel: SCSI error : 1 0 0 0 return code = 0x802
Apr 13 16:54:06 linux kernel: sdb: Current: sense key: Aborted Command
Apr 13 16:54:06 linux kernel: Additional sense: No additional
sense information
Apr 13 16:54:06 linux kernel: end_request: I/O error, dev sdb, sector 119
Apr 13 16:54:06 linux kernel: raid5: Disk failure on sdb1, disabling
device. Operation continuing on 5 devices
Apr 13 16:54:06 linux kernel: RAID5 conf printout:
Apr 13 16:54:06 linux kernel:  --- rd:6 wd:5 fd:1
Apr 13 16:54:06 linux kernel:  disk 0, o:1, dev:sda1
Apr 13 16:54:06 linux kernel:  disk 1, o:0, dev:sdb1
Apr 13 16:54:06 linux kernel:  disk 2, o:1, dev:sdc1
Apr 13 16:54:06 linux kernel:  disk 3, o:1, dev:sdd1
Apr 13 16:54:06 linux kernel:  disk 4, o:1, dev:sde1
Apr 13 16:54:06 linux kernel:  disk 5, o:1, dev:sdf1
Apr 13 16:54:06 linux kernel: RAID5 conf printout:
Apr 13 16:54:06 linux kernel:  --- rd:6 wd:5 fd:1
Apr 13 16:54:06 linux kernel:  disk 0, o:1, dev:sda1
Apr 13 16:54:06 linux kernel:  disk 2, o:1, dev:sdc1
Apr 13 16:54:06 linux kernel:  disk 3, o:1, dev:sdd1
Apr 13 16:54:07 linux kernel:  disk 4, o:1, dev:sde1
Apr 13 16:54:07 linux kernel:  disk 5, o:1, dev:sdf1
Apr 13 16:54:06 linux kernel: ata2: status=0x51 { DriveReady
SeekComplete Error }
Apr 13 16:54:06 linux kernel: ata2: error=0x04 { DriveStatusError }
[11 repetitions of above 2 lines snipped]
Apr 13 16:54:06 linux kernel: SCSI error : 1 0 0 0 return code = 0x802
Apr 13 16:54:06 linux kernel: sdb: Current: sense key: Aborted Command
Apr 13 16:54:06 linux kernel: Additional sense: No additional
sense information
Apr 13 16:54:06 linux kernel: end_request: I/O error, dev sdb, sector 119
Apr 13 16:54:06 linux kernel: raid5: Disk failure on sdb1, disabling
device. Operation continuing on 5 devices
Apr 13 16:54:06 linux kernel: RAID5 conf printout:
Apr 13 16:54:06 linux kernel:  --- rd:6 wd:5 fd:1
Apr 13 16:54:06 linux kernel:  disk 0, o:1, dev:sda1
Apr 13 16:54:06 linux kernel:  disk 1, o:0, dev:sdb1
Apr 13 16:54:06 linux kernel:  disk 2, o:1, dev:sdc1
Apr 13 16:54:06 linux kernel:  disk 3, o:1, dev:sdd1
Apr 13 16:54:06 linux kernel:  disk 4, o:1, dev:sde1
Apr 13 16:54:06 linux kernel:  disk 5, o:1, dev:sdf1
Apr 13 16:54:06 linux kernel: RAID5 conf printout:
Apr 13 16:54:06 linux kernel:  --- rd:6 wd:5 fd:1
Apr 13 16:54:06 linux kernel:  disk 0, o:1, dev:sda1
Apr 13 16:54:06 linux kernel:  disk 2, o:1, dev:sdc1
Apr 13 16:54:06 linux kernel:  disk 3, o:1, dev:sdd1
Apr 13 16:54:07 linux kernel:  disk 4, o:1,

Re: raid 5 corruption

2006-03-08 Thread Molle Bestefich

Todd [EMAIL PROTECTED] wrote:
 The strangest thing happened the other day. I booted my machine
 and the permissions were all messed up. I couldn't access many
 files as root which were owned by root. I couldnt' run common
 programs as root or a standard user.

Odd, have you found out why?
What was the first error you saw?

 So I restarted and it wouldn't mount my raid drive (raid 5, 5 disks).
 I tried doing it manually from the livecd, and it's telling me it
 can't mount with only 2 disks.

Is that because the kernel found only 2/5 physical disks,
or because MD thinks that they're out-of-date?

 I tried to force with four drives and it claims there's no
 superblock for sda3.

Try mdadm --assemble --force again, but exclude sda3 and
assemble the array using the 4 other drives instead?

You might want to run mdadm to query the superblock on each device.

You can post the output to this list so others will be able to see
which of your drives are considered 'freshest' by MD etc.

 There's nothing wrong with my disks. I can mount the boot partition.

One doesn't imply the other.  And since you don't tell where the boot
partition resides, it hardly seems relevant to your RAID devices..

 It's fine as far as I can tell. Does anyone know what's going on?
 Has anyone else experienced this?
 I have had problems in the past with other machines.
 One time a redhat machine locked up in X.

Yeah, I've had X lock up on me quite a lot.

 I don't know if it was just X or the kernel.

Probably the graphics driver.

 I restarted and it couldn't find the root i-node.
 It may have been correctable, but I just reinstalled.
 It seems strange that windows can crash on me every day and
 it still starts right back up. (I still have 98.)
 But linux seems to have more fragile file systems.

Windows' flushing policy is a LOT more sane than Linux'.
That's probably why you'll rarely get corrupted filesystems
with Windows, and often with Linux.

Like you, I've had filesystem corruption after system crashes
happen to me with Linux quite a lot, and never (even though
it crashes much more often) with Windows.

My guess is that the Linux kernel folks are more concerned with
a .01% improvement in performance than with your data and that's
why the policy is as it is..  But I could easily be wrong, so take
it with a grain of salt.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: block level vs. file level

2006-03-03 Thread Molle Bestefich

Bill Davidsen wrote:
 Molle Bestefich wrote:
 it wrote:
   Ouch.
  
   How does hardware raid deal with this? Does it?
 
  Hardware RAID controllers deal with this by rounding the size of
  participant devices down to nearest GB, on the assumption that no
  drive manufacturers would have the guts to actually sell eg. a 250 GB
  drive with less than exactly 250.000.000.000 bytes of space on it.
 
  (It would be nice if the various flavors of Linux fdisk had an option
  to do this. It would be very nice if anaconda had an option to do
  this.)

 I guess if you care you specify the size of the partition instead of
 use it all. I use fdisk usually, cfdisk when installing, both let me
 set size, fdisk let's me set starting track and even play with the
 partition table's idea of geometry. What kind of an option did you have
 in mind?

I don't know.  Examples good enough?

a.) Do not use space beyond highest GB
b.) Do not use last cylinder

Help texts could be:

a.) Helps ensure that you can replace fx. a 300GB drive from one
manufacturer with a 300GB from another.
b.) Leave an area that Windows might use for disk metadata alone.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: block level vs. file level

2006-02-20 Thread Molle Bestefich

it wrote:
 Ouch.

 How does hardware raid deal with this? Does it?

Hardware RAID controllers deal with this by rounding the size of
participant devices down to nearest GB, on the assumption that no
drive manufacturers would have the guts to actually sell eg. a 250 GB
drive with less than exactly 250.000.000.000 bytes of space on it.

(It would be nice if the various flavors of Linux fdisk had an option
to do this. It would be very nice if anaconda had an option to do
this.)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.15: mdrun, udev -- who creates nodes?

2006-01-31 Thread Molle Bestefich

[EMAIL PROTECTED] wrote:
  Not only that, the raid developers themselves
  consider autoassembly deprecated.
 
  http://article.gmane.org/gmane.linux.kernel/373620

 Hmm. My knee-jerk, didn't-stop-to-think-about-it reaction is that
 this is one of the finest features of linux raid, so why remove it?

I *think* that the raid developers may be, for once, choosing words
not-so-wisely when talking about deprecating autoassembly.

Last time I heard that I choked as well, only to find out later that
Neil's notion of what auto-assembly is differed substantially from my
own.

Isn't there a faq/wiki somewhere where the official opinion on
autoassembly deprecation and exactly what that means can go?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Silent Corruption on RAID5

2006-01-27 Thread Molle Bestefich

Michael Barnwell wrote:
 I'm experiencing silent data corruption
 on my RAID 5 set of four 400GB SATA disks.

I have circa the same hardware:
 * AMD Opteron 250
 * Silicon Image 3114
 * 300 GB Maxtor SATA

Just to add a data point, I've run your test on my RAID 1 (not RAID 5
!) without problems.

localhost ~ # dd bs=1024 count=1k if=/dev/zero of=./10GB.tst
1024+0 records in
1024+0 records out
localhost ~ # od -t x1 ./10GB.tst
000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
1161
localhost ~ # uname -a
Linux localhost 2.6.12.6-xen #6 SMP Fri Jan 6 06:49:53 CET 2006 x86_64
AMD Opteron(tm) Processor 250 AuthenticAMD GNU/Linux
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 006 of 7] md: Checkpoint and allow restart of raid5 reshape

2006-01-27 Thread Molle Bestefich

NeilBrown wrote:
 We allow the superblock to record an 'old' and a 'new'
 geometry, and a position where any conversion is up to.

 When starting an array we check for an incomplete reshape
 and restart the reshape process if needed.

*Super* cool!
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: blog entry on RAID limitation

2006-01-27 Thread Molle Bestefich

Rik Herrin wrote:
 Wouldn't connecting a UPS + using a stable kernel
 version remove 90% or so of the RAID-5 write hole
 problem?

There are some RAID systems that you'd rather not have redundant power on.

Think encryption.  As long as a system is online, it's normal for it
to have encryption keys in memory and it's disk systems mounted
through the decryption system.  You wouldn't want someone to be able
to steal your server along with the UPS and stuff it in a van with a
power inverter :-).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux RAID Enterprise-Level Capabilities and If It Supports Raid Level Migration and Online Capacity Expansion

2005-12-22 Thread Molle Bestefich

Rik Herrin wrote:
 I was interested in Linux's RAID capabilities and
 read that mdadm was the tool of choice.  We are
 currently comparing software RAID with hardware RAID

MD is far superior to most of the hardware RAID solutions I've touched.
In short, it seems MD is developed with the goal of keeping your data
safe, not selling hardware.

I've had problems both with MD and with hardware RAID.  With hardware
RAID, once things go bad, they really go bad.  With MD, there's
usually a straight-forward way to rescue things.  And when there's
not, Neil's a real nice guy who always stands up to help and fix bugs.

I would trust my data with MD over any hardware RAID solution,
including professional server RAID solutions from eg. Compaq or IBM.

MD is a little more difficult to set up and also lacks in that it
doesn't integrate with BIOS level stuff and boot loaders (maybe
there's minimal MD RAID 1 support in Lilo, not sure).  Depending on
your choice of hardware, you might also get more features than MD can
currently offer.

   1) OCE: Online Capacity Expansion:  From the latest
 version of mdadm (v2.2), it ssems that there is
 support for it with the -G option.  How well tested is
 this?

New feature, so obviously not tested very well.

Neil said at one point that he was going to release this to the
general public when it's stable and when it can recover an interrupted
resize process.  Sounds like a very reasonable and sane goal to me, I
hope that this is still the case.

Otherwise, it's easy to work around - you can just create a new RAID
array on your new disks / extra disk space and then join it to the end
of the old array using MD's linear personality or DM.  Never tried it,
but should work just fine.

  Also, in the Readme / Man page, it mentions:
   This usage causes mdadm to attempt to reconfigure a
 running array.  This is only possibly if the kernel
 being used supports a particular reconfiguration.
   How can I know if the kernel I am using supports this
 reconfiguration?  What if I'm compiling the kernel by
 hand.  What options would I have to enable?

Just the usual MD stuff I think.
You'll probably need a quite new kernel where Neil's bitmap patches
has been applied.
Hopefully MD will detect whether the kernel is new enough or not, but
I haven't tried myself ;-).

   2) RAID Level Migration:  Does mdadm currently
 support this feature?

I don't think so, but sounds like RAID5 -- RAID6 is planned.
Check back in a year or so ;-).

Or choose the RAID level you *really* want to begin with (duh).

Since you say we, I assume you're part of a very large corporation
and thus intend to RAID a whole bunch of disks.  Go with RAID6 + a
couple of spares for that.  If you intend to use really many disks,
make multiple arrays.  (Not sure whether you can share spares across
arrays, but I think you can.)

   3) Performance issues:  I'm currently thinking of
 using either RAID 10 or LVM2 with RAID 5 to serve as a
 RAID server.  The machine will be running either an
 AMD 64 processor or a dual-core AMD 64 processor, so I
 don't think the CPU will be a bottleneck.  In fact, it
 should easily pass the speed of most hardware based
 RAID systems.

I think there's two issues to cover,
 * Throughput
 * Seek times

And of course they're not entirely separate issues - throughput will
be lower when you're doing random access (seeking) and seek times will
be higher when you're pulling lots of data out.

I've seen lots of MD tests, but none that covered profiling MD's
random access performance.  So I suppose that most hardware solutions
will do a lot better than MD here since they have been profiled with
this in mind.

Throughput-wise, I think MD is probably very good.
But I can't back that up with factual data, sorry.

   4) Would anyone recommend a certain hotswap
 enclosure?

I would, but can't remember their name, sorry :-)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux RAID Enterprise-Level Capabilities and If It Supports Raid Level Migration and Online Capacity Expansion

2005-12-22 Thread Molle Bestefich

Lajber Zoltan wrote:
 I have some simple test with bonnie++, the sw raid superior to hw raid,
 except big-name storage systems.
 http://zeus.gau.hu/~lajbi/diskbenchmarks.txt

Cool.
But what does gep, tip, diskvez, iras, olvasas and atlag mean?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mdadm 2.1: command line option parsing bug?

2005-12-15 Thread Molle Bestefich

I found myself typing IMHO after writing up just about each comment.
I've dropped that and you'll just have to know that all this is IMHO
and not an attack on your ways if they happen to be different ^_^.

Neil Brown wrote:
 I like the suggestion of adding one-line descriptions to this.
 How about:

I'll first comment each command/description (general feedback later):

 Usage:
mdadm --create device options...
   Create a new array from unused devices.
mdadm --assemble device options...
   reassemble a previously created array

I like 'em!

mdadm --build device options...
   create or assemble an array without metadata

Oh, that's what it's for, operating without metadata.  Good to know.

It would help a lot (for me) if the above also had a brief note about
why I would ever want to use --build.  Otherwise I'll just be confused
about when I should use assemble and when I should use build and why
it even exists.

What's the logic behind having split create/assemble commands but a
joined command for creating/assembling when there's no metadata?  (I'm
sure there is one, I'm just confused as always.)

mdadm --manage device options...
   Make changes to an active array

Nice.

mdadm --misc options... devices
   report information or perform miscellaneous tasks.

Get rid of the misc section or rename it to something meaningful..

mdadm --monitor options...
   monitor one or more arrays and report any changes in status

Nice.

mdadm device options...
   same as --manage

Oh, so that's what it does.
A bit confusing for a newbie like me that we're not sticking to _1_ syntax.

Since the above is a side note about a convenient syntax hack, I think
that (in the event that you find that it is not confusing [which I do
:-] and decide to keep it), there should be at least a blank line
between the very important --cmd descriptions and this rarely
relevant note.

General stuff:

The note:
 * use --help in combination with --cmd for further help

is missing.  I think it would be good to retain it.


  I find it very unhelpful to have a --misc section.  Every time I'm
  looking for some command, besides from having to guess in which
  section it's located, I have to check --misc also, since misc can
  cover anything.

 Yes, I can see how that is confusing.
 The difference between 'manage' and 'misc' is that manage will only
 apply to a single array, while misc can apply to multiple arrays.
 This is really a distinction that is mostly relevant in the
 implementation.  I should try to hide it in the documentation.

That would be good :-).
Renaming --misc to something which relates to the fact that this is
about multi md device commands and giving it an appropriate
description line (I don't think your --misc description was any good,
sorry, hehe) would also be *a lot* better..


  And btw, why mention device and options for each and every
  section/command set/command when it's obvious that you need more
  parameters to do something fruitful?  In my opinion it would be better
  to rid the general --help screen of them and instead specify in brief
  what functionality the sections are meant to cover.

 well.  it is device options for all but misc, which has
  options ... devices...

Yes, okay, I think that would be explained better by a description
line for misc saying that it's about multiple devices (by the way, are
we talking multiple MD or multiple component devices?)

 but I agree that it is probably unnecessary repetition.

Also completely useless since there's no mention of what any of the
options /are/, no?
Just adds to confusion, so better to snip it.

  I keep typing mdadm --help --assemble which unfortunately gets me
  nowhere.  A lot of the times it's because the general help text has
  scrolled out of view because I've just done an '--xxx --help' for some
  other command.  Maybe it's just me, but a minor improvement could be
  to allow '--help --xxx'.

 That's a fair comment.  I currently print the help message as soon as I
 see the --help option.  But that could change to: if I see '--help',
 set a flag, then for every option, print the appropriate help.

That would be really swell!

 Then
   mdadm --help --size
 could even give something useful...

Yup :-).  Hmm.  Not bad at all.
Would be good for the extreme newbie and confused ppl like me ;).

  ... snip ...
  For general help on options use
  mdadm --help-options
  ... snip ...
 
  Like misc, I think this is an odd section.  Let's explore it's contents:
 
 $ mdadm --help-options
 Any parameter that does not start with '-' is treated as a device name
  ... snip ...
 
  To me, the above is very interesting information about mdadm's syntax
  and thus should be presented at the start of the general --help
  screen.
 
  ... snip ...
 The first such name is often the name of an md device.  Subsequent
 names are

Re: mdadm 2.1: command line option parsing bug?

2005-12-14 Thread Molle Bestefich

  mdadm's command line arguments seem arcane and cryptic and unintuitive.
  It's difficult to grasp what combinations will actually do something
  worthwhile and what combinations will just yield a 'you cannot do
  that' output.
 
  I find myself spending 20 minutes with mdadm --help and experimenting
  with different commands (which shouldn't be the case when doing RAID
  stuff) just to do simple things like create an array or make MD
  assemble the devices that compose an array.

 Thanks strange.  They seem very regular and intuitive to me (but I
 find it very hard to be objective).  Maybe you have a mental model of
 what the task involves that differs from how md actually does things.

 Can you say anything more about the sort of mistakes you find yourself
 making.  That might help either improve the help pages or the error
 messages (revamping all the diagnostic messages to make the more
 helpful is slowly climbing to the top of my todo list).

I can, but I suck at putting these things down in writing, which is
why my initial description was intentionally vague and shallow :-). 
Now, I'll try anyway.  It'll be imprecise and I'll miss some things
and show up late with major points etc., but at least I will have
tried =).

Ok.. Starting with the usage output:

   $ mdadm
   Usage: mdadm --help
 for help

Nothing wrong with that.  Good, concise stuff.  Great.

The --help output, however...:

   $ mdadm --help
   Usage: mdadm --create device options...
  mdadm --assemble device options...
  mdadm --build device options...
   ... snip ...

... I find confusing.
Try and read the words create, assemble and build as an outsider or a
child would read them, as regular english words, not with MD
development in mind.  All three words mean just about the same thing
in plain english (something like taking smaller parts and
constructing something bigger with them or some such).  That confuses
me.

To make things worse, I now have to type in 3 commands (--help
--cmd) and compare the output of each just to get a grasp on what
each of the 3 individual commands do.  The proces is laborious. 
Well, I'll live, but it's annoying to have to scroll up and down
perhaps 5-6 screens of text whilst comparing options just to figure
out which command I would like to use.  I would much prefer if the
'areas of utility' that mdadm commands are divided in were self
explanatory.

I'm not saying that these 'areas of utility' (--create etc.) are not
grouped in a logical fashion.  Just that it's not always easy to
comprehend what they cover.  Perhaps it's enough just to add a small
description per line of 'mdadm --help' output.  After 'mdadm --create'
it would fx. say 'creates a new RAID array based on multiple devices.'
or so.  Just a one-liner.

Moving right along:

... snip ...
  mdadm --manage device options...
  mdadm --misc options... devices
  mdadm --monitor options...
... snip ...

I find it very unhelpful to have a --misc section.  Every time I'm
looking for some command, besides from having to guess in which
section it's located, I have to check --misc also, since misc can
cover anything.

... snip ...
  mdadm device options...
... snip ...

Where does that syntax suddenly come from?
Is it a special no-command mode?  What does it do?  Hmm.  Confusing.

And btw, why mention device and options for each and every
section/command set/command when it's obvious that you need more
parameters to do something fruitful?  In my opinion it would be better
to rid the general --help screen of them and instead specify in brief
what functionality the sections are meant to cover.

... snip ...
mdadm is used for building, managing, and monitoring
Linux md devices (aka RAID arrays)
... snip ...

Probably belongs in the top of the output, but that's a small nit...

... snip ...
For detailed help on the above major modes use --help after the mode
e.g.
mdadm --assemble --help
... snip ...

I keep typing mdadm --help --assemble which unfortunately gets me
nowhere.  A lot of the times it's because the general help text has
scrolled out of view because I've just done an '--xxx --help' for some
other command.  Maybe it's just me, but a minor improvement could be
to allow '--help --xxx'.

... snip ...
For general help on options use
mdadm --help-options
... snip ...

Like misc, I think this is an odd section.  Let's explore it's contents:

   $ mdadm --help-options
   Any parameter that does not start with '-' is treated as a device name
... snip ...

To me, the above is very interesting information about mdadm's syntax
and thus should be presented at the start of the general --help
screen.

... snip ...
   The first such name is often the name of an md device.  Subsequent
   names are often names of component devices.
... snip ...

This too.

What's with the often btw?
It is somewhat more helpful to me if

Re: mdadm 2.1: command line option parsing bug?

2005-11-22 Thread Molle Bestefich

Neil Brown wrote:
 I would like it to take an argument in contexts where --bitmap was
 meaningful (Create, Assemble, Grow) and not where --brief is
 meaningful (Examine, Detail).  but I don't know if getopt_long will
 allow the 'short_opt' string to be changed half way through
 processing...

Here's an honest opinion from a regular user.

mdadm's command line arguments seem arcane and cryptic and unintuitive.
It's difficult to grasp what combinations will actually do something
worthwhile and what combinations will just yield a 'you cannot do
that' output.

I find myself spending 20 minutes with mdadm --help and experimenting
with different commands (which shouldn't be the case when doing RAID
stuff) just to do simple things like create an array or make MD
assemble the devices that compose an array.

I know.  Not very constructive, but a POV anyway.  Maybe I just do not
use MD enough, and so I shouldn't complain, because the interface is
really not designed for the absolute newbie.  If so, then I apologize.

I don't have any constructive suggestions, except to say that the way
the classic Cisco interface does things works very nicely.

A lot of other manufacturers has also started doing things the Cisco
way.  If you don't have a Cisco router available, you can fx. use a
Windows XP box.  Type 'netsh' in a command prompt, then 'help'.  Or
alternatively 'netsh help'.  You get the idea :-).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: /boot on RAID5 with GRUB

2005-11-12 Thread Molle Bestefich

Spencer Tuttle wrote:
 Is it possible to have /boot on /dev/md_d0p1 in a RAID5 configuration
 and boot with GRUB?

Only if you get yourself a PCI card with a RAID BIOS on it and attach
the disks to that.
The RAID BIOS hooks interrupt 13 and allows GRUB (or DOS or LILO for
that matter) to see the RAID5 array instead of the individual disks.

Obviously any RAID card will do, it doesn't matter whether it has a
CPU to do the RAID calculations or not (known as fake-RAID and ataraid
if it doesn't).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: questions about ext3, raid-5, small files and wasted disk space

2005-11-12 Thread Molle Bestefich

On Saturday November 12, Neil Brown wrote:
 On Saturday November 12, Kyle Wong wrote:
  I understand that if I store a 224KB file into the RAID5, the
  file will be divided into 7 parts x 32KB, plus 32KB parity.
  (Am I correct in this?)

 Sort of ... if the filesystem happens to lay it out like that.
 But this isn't a useful way to think about it.  The filesystem
 writes the data in 4K blocks.  The raid5 layer worries about
 how to create the parity block.

Well, there IS some optimization to be done here that we're all missing out on,
if the filesystem does not take this into account, isn't there?

Is it reasonable to assume that Linux filesystems always start the
'data block area' (whatever) exactly on x * fs block size kB into
the device they're laid on?  Doesn't seem *entirely* unreasonable that
they'd do that, if not for optimization then just because their
authors happened to think that it would be neat code-wise.  If the
filesystem do that, then an optimization would be to just make sure
that the filesystem block size exactly equals the RAID chunk size.

Things become slightly harder if you start partitioning your RAID
device.  FDISK needs to make sure that partitions are on cylinder
boundaries, but luckily FDISK is rarely used to partition MD RAID
devices - LVM or EVMS is.  Both of those systems are technically free
to move the partition data area a few kB back or forth within the
RAID device so that the partition is aligned on a RAID chunk. 
Wouldn't that be great?  It would of course give you nothing, unless
the filesystem also aligns it's blocks (does it do that?).

(Then there's the fakeraid / ataraid people.  They're screwed, as far
as optimizations in this area go.  Maybe they can go and get someone
to make a raid bios that understands MD metadata :-).)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Flappy hotswap disks?

2005-10-25 Thread Molle Bestefich

If
- a disk is part of a MD RAID 1 array, and
- the disk is 'flapping', eg. going online and offline repeatedly in a
hotswap system, and
- a *write* occurs to the MD array at a time when the disk happens to
be offline,

will MD handle this correctly?

Eg. will it increase the event counters on the other disks /even/ when
no reboot or stop-start has been performed, so that when the flappy
disk flaps back online, it will be (perhaps partially) resynced?

Apologies if it's a dumb question.
Hope someone's got a minute to answer it :-).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flappy hotswap disks?

2005-10-25 Thread Molle Bestefich

Mario 'BitKoenig' Holbe wrote:
 Molle Bestefich wrote:
  Eg. will it increase the event counters on the other disks /even/ when
  no reboot or stop-start has been performed, so that when the flappy

 Event counters are increased immediately when an event occurs.
 A device failure is an event as well as start and stop of a RAID are.

Ok.  Thanks!
Hope it's race-free and all :-).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Drive fails raid6 array is not self rebuild .

2005-09-08 Thread Molle Bestefich

Mr. James W. Laferriere wrote:
 Is there a documented procedure to follow during
 creation or after that will get a raid6 array to self
 rebuild ?

MD will rebuild your array automatically, given that it has a spare disk to use.

 raid5: Disk failure on sde, disabling device. Operation continuing on 35 
 devices

Seems like a raid5, not raid6..

 [UU_U]

No need to do any rebuilding on the remaining devices, since the data
on them are fine.

You've lost redundancy however, so you should add a new disk to the array ASAP.

With 35 disks, I'd recommend that you at least use raid6 in place of raid5..
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Accelerating Linux software raid

2005-09-06 Thread Molle Bestefich

Dan Williams wrote:
 The first question is whether a solution along these lines would be
 valued by the community?  The effort is non-trivial.

I don't represent the community, but I think the idea is great.

When will it be finished and where can I buy the hardware? :-)

And if you don't mind terribly, could you also add hardware
acceleration support to loop-aes now that you're at it? :-)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] proactive raid5 disk replacement for 2.6.11

2005-08-22 Thread Molle Bestefich

Pallai Roland wrote:
 Molle Bestefich wrote:
  Claas Hilbrecht wrote:
   Pallai Roland schrieb:
 this is a feature patch that implements 'proactive raid5 disk
replacement' (http://www.arctic.org/~dean/raid-wishlist.html),
  
   After my experience with a broken raid5 (read the list) I think the
   partially failed disks feature you describe is really useful. I agree
   with you that this kind of error is rather common.
 
  Horrible idea.
  Once you have a bad block on one disk, you have definitively lost your
  data redundancy.
  That's bad.

  Hm, I think you don't understand the point, yes, that should be
 replaced as soon as you can, but the good sectors of that drive can be
 useful if some bad sectors are discovered on an another drive during the
 rebuilding. we must keep that drive in sync to keep that sectors useful,
 this is why the badblock tolerance is.

Ok, I misunderstood you.  Sorry, and thanks for the explanation.

  It is the common error if you've lot of disks and can't do daily media
 checks because of the IO load.

Agreed.

  What should be done about bad blocks instead of your suggestion is to
  try and write the data back to the bad block before kicking the disk.
  If this succeeds, and the data can then be read from the failed block,
  the disk has automatically reassigned the sector to the spare sector
  area.  You have redundancy again and the bad sector is fixed.
 
  If you're having a lot of problems with disks getting kicked because
  of bad blocks, then you need to diagnose some more to find out what
  the actual problem is.
 
  My best guess would be that either you're using an old version of MD
  that won't try to write to bad blocks, or the spare area on your disk
  is full, in which case it should be replaced.  You can check the
  status of spare areas on disks with 'smartctl' or similar.

  Which version of md tries to rewrite bad blocks in raid5?

Haven't followed the discussions closely, but I sure hope that the
newest version does.  (After all, spare areas are a somewhat old
feature in harddrives..)

  I've problem with hidden bad blocks (never mind if that's repairable
 or not), the rewrite can't help, cause you don't know if that's there
 until you don't try to rebuild the array from degraded state to a
 replaced disk. I want to avoid from the rebuiling from degraded state,
 this is why the 'proactive replacement' feature is.

Got it now.  Super.  Sounds good ;-).
(I hope that you're simply rebuilding to a spare before kicking the
drive, not doing something funky like remapping sectors or some
such..)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] proactive raid5 disk replacement for 2.6.11

2005-08-22 Thread Molle Bestefich

Claas Hilbrecht wrote:
 Pallai Roland schrieb:
   this is a feature patch that implements 'proactive raid5 disk
  replacement' (http://www.arctic.org/~dean/raid-wishlist.html),
 
 After my experience with a broken raid5 (read the list) I think the
 partially failed disks feature you describe is really useful. I agree
 with you that this kind of error is rather common.

Horrible idea.
Once you have a bad block on one disk, you have definitively lost your
data redundancy.
That's bad.

What should be done about bad blocks instead of your suggestion is to
try and write the data back to the bad block before kicking the disk. 
If this succeeds, and the data can then be read from the failed block,
the disk has automatically reassigned the sector to the spare sector
area.  You have redundancy again and the bad sector is fixed.

If you're having a lot of problems with disks getting kicked because
of bad blocks, then you need to diagnose some more to find out what
the actual problem is.

My best guess would be that either you're using an old version of MD
that won't try to write to bad blocks, or the spare area on your disk
is full, in which case it should be replaced.  You can check the
status of spare areas on disks with 'smartctl' or similar.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Software RAID on Windows using Embedded Linux?

2005-07-24 Thread Molle Bestefich

Ewan Grantham wrote:
 I know, this is borderline, but figure this is the group of folks who
 will know. I do a lot of audio and video stuff for myself and my
 family. I also have a rather unusual networking setup. Long story
 short, when I try to run Linux as my primary OS, I usually end up
 reinstalling Windows after a couple weeks because there are still
 holes in what I can do. That isn't the fault of Linux as much as folks
 who write device drivers or have video codecs that require DirectShow.
 
 However, the big thing I really miss from Linux that keeps me trying
 to find a way to convert is the support for Software RAID 5.
 
 It occured to me yesterday that perhaps the trick would be to use QEMU
 to run Knoppix or Damn Small Linux under Windows, and then setup a
 RAID 5 array under one of those. Not to mention then having access to
 Linux for some other fun stuff.
 
 I'm not sure if that's even possible, and if it is, how much trouble I
 would have moving files around to and from the RAID array if it's
 setup that way. So I'm wondering if anyone on the list has ever tried
 this?

Not too much trouble, you should be able to just setup Samba in the
Knoppix/DSL and access your files through QEMU's network emulation.

It will be damn slow, however :-).

Perhaps buy a NAS device like Linksys NSLU2?
It's cheap and has low power consumption - but you'll need to hack it
to add disks (via USB).

http://peter.korsgaard.com/articles/debian-nslu2.php
 or
http://www.tomsnetworking.com/Sections-article85-page3.php

explains how to get into the Linux guts of the box.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Software RAID on Windows using Embedded Linux?

2005-07-24 Thread Molle Bestefich

On 7/24/05, Ewan Grantham [EMAIL PROTECTED] wrote:
 On 7/24/05, Molle Bestefich [EMAIL PROTECTED] wrote:
  Ewan Grantham wrote:
   I know, this is borderline, but figure this is the group of folks who
   will know. I do a lot of audio and video stuff for myself and my
   family. I also have a rather unusual networking setup. Long story
   short, when I try to run Linux as my primary OS, I usually end up
   reinstalling Windows after a couple weeks because there are still
   holes in what I can do. That isn't the fault of Linux as much as folks
   who write device drivers or have video codecs that require DirectShow.
  
   However, the big thing I really miss from Linux that keeps me trying
   to find a way to convert is the support for Software RAID 5.
  
   It occured to me yesterday that perhaps the trick would be to use QEMU
   to run Knoppix or Damn Small Linux under Windows, and then setup a
   RAID 5 array under one of those. Not to mention then having access to
   Linux for some other fun stuff.
  
   I'm not sure if that's even possible, and if it is, how much trouble I
   would have moving files around to and from the RAID array if it's
   setup that way. So I'm wondering if anyone on the list has ever tried
   this?
 
  Not too much trouble, you should be able to just setup Samba in the
  Knoppix/DSL and access your files through QEMU's network emulation.
 
  It will be damn slow, however :-).
 
  Perhaps buy a NAS device like Linksys NSLU2?
  It's cheap and has low power consumption - but you'll need to hack it
  to add disks (via USB).
 
  http://peter.korsgaard.com/articles/debian-nslu2.php
   or
  http://www.tomsnetworking.com/Sections-article85-page3.php
 
  explains how to get into the Linux guts of the box.
 
 Interesting idea. However, I note that it is rather slow as well -
 since you're then accessing disks over 100 Mbps rather than at full
 USB2. And it appears you can only attach 2 USB disks to it.

You should be able to hack it to accept more than 2 disks..
You're right about the 100 mbps network interface..

 Which makes me wonder just how slow you mean when you say that the
 QEMU version would be slow? Slow compared to native access? Slow
 compared to the NSLU2 solution discussed above? Or slow compared to a
 snail being attacked by one of those Hawaiian Catepillars? :-)

Slow as in a passing tortoise would take that system by surprise =)..

Based on QEMU being a CPU emulator, not a virtualization system, thus
penalizing performance with a factor 10 or so.

Actually, just checked their web site, and they now have a
non-open-source alternative that uses virtualization, so if you're
going to use that it just might work ok.  If you're willing to live
with the fact that MD will at times suck up some CPU.  Looking forward
to hearing whether you can live with the solution or not (if you go
that way) :-).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID1 assembly requires manual mdadm --run

2005-07-10 Thread Molle Bestefich

Neil Brown wrote:
 On Friday July 8, [EMAIL PROTECTED] wrote:
  So a clean RAID1 with a disk missing should start without --run, just
  like a clean RAID5 with a disk missing?

 Not that with /dev/loop3 no funcitoning,
   mdadm --assemble --scan
 will still work.

Super!
That was exactly the point of the test.

You're correct, when I try again with a DEVICE line and --assemble
--scan, everything works perfectly.

(Not sure why Mitchell Laks saw something else happen.)

Thanks a lot for taking the time to answer questions, the information
is much appreciated!
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Degraded raid5 returns mdadm: /dev/hdc5 has no superblock - assembly aborted

2005-07-08 Thread Molle Bestefich

 On Friday July 8, [EMAIL PROTECTED] wrote:
  On 8 Jul 2005, Molle Bestefich wrote:
   On 8 Jul 2005, Melinda Taylor wrote:
   We have a computer based at the South Pole which has a degraded raid 5
   array across 4 disks. One of the 4 HDD's mechanically failed but we have
   bought the majority of the system back online except for the raid5
   array. I am pretty sure that data on the remaining 3 partitions that
   made up the raid5 array is intact - just confused. The reason I know
   this is that just before we took the system down, the raid5 array
   (mounted as /home) was still readable and writable even though
   /proc/mdstat said:
  
   On 7/8/05, Daniel Pittman wrote:
   What you want to do is start the array as degraded, using *only* the
   devices that were part of the disk set.  Substitute 'missing' for the
   last device if needed but, IIRC, you should be able to say just:
  
   ] mdadm --assemble --force /dev/md2 /dev/hd[abd]5
  
   Don't forget to fsck the filesystem thoroughly at this point. :)
  
   At this point, before adding the new disk, I'd suggest making *very*
   sure that the event counters match on the three existing disks.
   Because if they don't, MD will add the new disk with an event counter
   matching the freshest disk in the array.  That will cause it to start
   synchronizing onto one of the good disks instead of onto the newly
   added disk  Happened to me once, gah.
 
  Ack!  I didn't know that.  If the event counters don't match up, what
  can you do to correct the problem?

Daniel Pittman wrote:
 Ack!  I didn't know that.  If the event counters don't match up, what
 can you do to correct the problem?

In the 2.4 days, I think I used to plug cables in and out of the
disks, rebooting the system again and again until the counters were
aligned.

Neil Brown wrote:
 The --assemble --force should result in all the event counters of
 the named drives being the same.  Then it should be perfectly safe the
 add the new drive.

Sounds like a better option!

 I cannot quite imagine a situation as described by Molle.

Fair enough, the situation just struck me as something I had seen
before, and it doesn't hurt to be sure..

 If it was at all reproducible I'd love to hear more details.

I'd rather not reproduce it :-).

It's happened a couple of times on a production system..
Once back when it was running 2.4 and an old version of MD, and once
while I was in the process of upgrading the box to 2.6 (so it might
have been while it was booted into 2.4.. not sure).  The box used to
have two disks failing from time to time, one due to a semi-bad disk
and one due to a flaky SATA cable.

That's about all I can remember on top of my head.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RAID1 assembly requires manual mdadm --run

2005-07-07 Thread Molle Bestefich

Mitchell Laks wrote:
 However I think that raids should boot as long as they are intact, as a matter
 of policy. Otherwise we lose our  ability to rely upon them for remote
 servers...

It does seem wrong that a RAID 5 starts OK with a disk missing, but a
RAID 1 fails.

Perhaps MD is unable to tell which disk in the RAID 1 is the freshest
and therefore refuses to assemble any RAID 1's with disks missing?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: MD bug or me being stupid?

2005-04-22 Thread Molle Bestefich

Hmm, I think the information in /var/log/messages are actually
interesting for MD debugging.

Seems there was a bad sector somewhere in the middle of all this,
which might have triggered something?

Attached (gzipped - sorry for the inconvenience, but it's 5 kB vs. 250 kB!)

I've cut out a lot of irrelevant cruft (kernel messages) and added
comments about what I did when.


linux-22apr-messages.gz
Description: GNU Zip compressed data

Re: Bug in MDADM or just crappy computer?

2005-04-22 Thread Molle Bestefich

Phantazm wrote:
 And the kernel log is filled up with this.
 
 Feb 20 08:43:13 [kernel] md: md0: sync done.
 Feb 20 08:43:13 [kernel] md: syncing RAID array md0
 Feb 20 08:43:13 [kernel] md: minimum _guaranteed_ reconstruction speed: 5000
 KB/sec/disc.
 Feb 20 08:43:13 [kernel] md: md0: sync done.
 Feb 20 08:43:13 [kernel] .6md: syncing RAID array md0
 Feb 20 08:43:13 [kernel] md: minimum _guaranteed_ reconstruction speed: 5000
 KB/sec/disc.

No sync done here? What is it doing, multiple syncs in parallel?

 Feb 20 08:43:13 [kernel] .6md: syncing RAID array md0
 Feb 20 08:43:13 [kernel] md: minimum _guaranteed_ reconstruction speed: 5000
 KB/sec/disc.

Again (below line)?

 Feb 20 08:43:13 [kernel] md: syncing RAID array md0
 Feb 20 08:43:13 [kernel] md: minimum _guaranteed_ reconstruction speed: 5000
 KB/sec/disc.
 Feb 20 08:43:13 [kernel] md: using maximum available idle IO bandwith (but
 not more than 15000 KB/sec) for reconstruction.
 Feb 20 08:43:13 [kernel] md: using 128k window, over a total of 199141632
 blocks.
 Feb 20 08:43:13 [kernel] md: md0: sync done.

There we go, a sync done (above)

 Feb 20 08:43:13 [kernel] .6md: syncing RAID array md0
 Feb 20 08:43:13 [kernel] md: minimum _guaranteed_ reconstruction speed: 5000
 KB/sec/disc.

But then again (below)..

 Feb 20 08:43:13 [kernel] .6md: syncing RAID array md0
 Feb 20 08:43:13 [kernel] md: minimum _guaranteed_ reconstruction speed: 5000
 KB/sec/disc.

And again...

 Feb 20 08:43:13 [kernel] .6md: syncing RAID array md0
 Feb 20 08:43:13 [kernel] md: minimum _guaranteed_ reconstruction speed: 5000
 KB/sec/disc.

And again...

 Feb 20 08:43:13 [kernel] .6md: syncing RAID array md0
 Feb 20 08:43:13 [kernel] md: minimum _guaranteed_ reconstruction speed: 5000
 KB/sec/disc.

And again?!

 Feb 20 08:43:13 [kernel] .6md: syncing RAID array md0
 Feb 20 08:43:13 [kernel] md: minimum _guaranteed_ reconstruction speed: 5000
 KB/sec/disc.

And again...

 Feb 20 08:43:13 [kernel] .6md: syncing RAID array md0
 Feb 20 08:43:13 [kernel] md: minimum _guaranteed_ reconstruction speed: 5000
 KB/sec/disc.

And again...

 Feb 20 08:43:13 [kernel] md: syncing RAID array md0

[snip]
Odd!

 aint got a single clue what it could be.

I'm seeing something that looks like the same, so let me know if you
found out what happened.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Questions about software RAID

2005-04-20 Thread Molle Bestefich

David Greaves wrote:
 Guy wrote:
  Well, I agree with KISS, but from the operator's point of view!
 
  I want...
 [snip]
 
 Fair enough.
[snip]
 should the LED control code be built into mdadm?

Obviously not.
But currently, a LED control app would have to pull information from
/proc/mdstat, right?
mdstat is a crappy place to derive any state from.
It currently seems to have a dual purpose:
 - being a simple textual representation of RAID state for the user.
 - providing MD state information for userspace apps.

That's not good.

There seems to be an obvious lack of a properly thought out interface
to notify userspace applications of MD events (disk failed -- go
light a LED, etc).

Please correct me if I'm on the wrong track, in which case the rest of
this posting will be bogus.  Maybe there are IOCTLs or such that I'm
not aware of.

I'm not sure how a proper interface could be done (so I'm basically
just blabbering).  ACPI has some sort of event system, but the MD one
would need to be more flexible.  For instance userspace apps has to
pick up on MD events such as disk failures, even if the userspace app
happens to not be running in the exact moment that the event occurs
(due to system restart, daemon restart or what not).  So the system
that ACPI uses is probably unsuited.

Perhaps a simple logfile would do.  It's focus should be
machine-readability (vs. human readability for mdstat).  A userspace
app could follow MD's state from the beginning (bootup, no devices
discovered, logfile cleared), through device discovery and RAID
assembly and to failing devices.  By adding up the information in all
the log lines, a userspace app could derive the current state of MD
(which disks are dead..).

Just a thought.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Questions about software RAID

2005-04-20 Thread Molle Bestefich

Hervé Eychenne wrote:
 Molle Bestefich wrote:
  There seems to be an obvious lack of a properly thought out interface
  to notify userspace applications of MD events (disk failed -- go
  light a LED, etc).
  
  I'm not sure how a proper interface could be done (so I'm basically
  just blabbering).  ACPI has some sort of event system, but the MD one
  would need to be more flexible.  For instance userspace apps has to
  pick up on MD events such as disk failures, even if the userspace app
  happens to not be running in the exact moment that the event occurs
  (due to system restart, daemon restart or what not).  So the system
  that ACPI uses is probably unsuited.
  
  Perhaps a simple logfile would do.  It's focus should be
  machine-readability (vs. human readability for mdstat).  A userspace
  app could follow MD's state from the beginning (bootup, no devices
  discovered, logfile cleared), through device discovery and RAID
  assembly and to failing devices.  By adding up the information in all
  the log lines, a userspace app could derive the current state of MD
  (which disks are dead..).

 No, as it requires active polling.

No it doesn't.
Just tail -f the logfile (or /proc/ or /sys/ file), and your
app will receive due notice exactly when something happens.  Or use
inotify.

 I think something like a netlink device would be more accurate,
 but I'm not a kernel guru.

No idea how that works :-).
If by accurate you mean you'll get a faster reaction, that's wrong
as per above explanation.  And I'll try to explain why a logfile in
other respects are actually _more_ accurate.

I can see why a logfile _seems_ wrong at first sight.
But the idea that it allows you to (*also*!) see historic MD events
instead of just the current status this instant seems compelling.

 - You can be sure that you haven't missed or lost any MD events.  If
your monitoring app crashes or restarts, just look in the log.  (If
you're unsure whether you've notified the admin on some event or not;
I'm sure MD could log the disk's event counters.  The monitoring app
could keep it's own how far have I gotten event counter [on disk],
so the app knows it's own status.)

 - If the log resides in eg. /proc/whatever, you can pipe it to an
actual file.  It could be pretty useful for debugging MD (attach your
MD log, send a mail asking what happened, and it'll be clear to the
super-md-dude at first sight).

 - Seems more convincing to enterprise customers that you can actually
see MD's every move in the log.  Makes it seem much more robust and
reliable.

 - Really useful for debugging the monitoring app

 - Probably other advantages.  Haven't really thought it trough
that well :-).

The problem, as I see it, is if it's worth the implementation trouble
(is it any harder than to implement a netlink / what not interface? 
No idea!)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: waiting for recovery to complete

2005-04-19 Thread Molle Bestefich

David Greaves wrote:
 Does everyone really type cat /proc/mdstat from time to time??
 How clumsy...
 And yes, I do :)

You're not alone..
*gah...*
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: interesting failure scenario

2005-04-04 Thread Molle Bestefich

Michael Tokarev wrote:
 I just come across an interesting situation, here's the
 scenario.

 [snip] 
 
 Now we have an interesting situation.  Both superblocks in d1
 and d2 are identical, event counts are the same, both are clean.
 Things wich are different:
utime - on d1 it is more recent (provided we haven't touched
  the system clock ofcourse)
on d1, d2 is marked as faulty
on d2, d1 is marked as faulty.
 
 Neither of the conditions are checked by mdadm.
 
 So, mdadm just starts a clean RAID1 array composed of two drives
 with different data on them.  And noone noticies this fact (fsck
 which is reading from one disk goes ok), until some time later when
 some app reports data corruption (reading from another disk); you
 go check what's going on, notice there's no data corruption (reading
 from 1st disk), suspects memory and.. it's quite a long list of
 possible bad stuff which can go on here... ;)
 
 The above scenario is just a theory, but the theory with some quite
 non-null probability.  Instead of hotplugging the disks, one can do
 a reboot having flaky ide/scsi cables or whatnot, so that disks will
 be detected on/off randomly...
 
 Probably it is a good idea to test utime too, in additional to event
 counters, in mdadm's Assemble.c (as comments says but code disagrees).

Humn, please don't.
 
I rely on MD assembling arrays if their event counters match but the
utimes don't all the time.  Happens quite often that a controller
fails or something like that and you accidentally loose 2 disks in a
raid5.
 
I still want to be able to force the array to be assembled in these cases.
I'm still on 2.4 btw, don't know if there's a better way to do it in
2.6 than manipulating the event counters.
 
(Thinking about it, it would be perfect if the array would instantly
go into read-only mode whenever it is degraded to a non-redundant
state.  That way there's a higher chance of assembling a working array
afterwards?)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: AW: RAID1 and data safety?

2005-03-29 Thread Molle Bestefich

 Does this sound reasonable?

Does to me.  Great example!
Thanks for painting the pretty picture :-).

Seeing as you're clearly the superior thinker, I'll address your brain
instead of wasting wattage on my own.

Let's say that MD had the feature to read from both disks in a mirror
and perform a comparison on read.
Let's say that I had that feature turned on for 2 mirror arrays (4 disks).
I want to get a bit of performance back though, so I stripe the two
mirrored arrays.

Do you see any problem in this scenario?
Are we back to corruption could happen then or are we still OK?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH md ] md: allow degraded raid1 array to resync after an unclean shutdown.

2005-03-26 Thread Molle Bestefich

 The following is (I think) appropriate for 2.4.30.  The bug it fixes
 can result in data corruption in a fairly unusual circumstance (having
 a 3 drive raid1 array running in degraded mode, and suffering a system
 crash).

What's unusual?  Having a 3 drive raid1 array?

It's not unusual for a system to crash after a RAID array gets sent to
degraded mode.  Happens a lot on a system I administer.  Probably
caused by a linux-si3112-ide bug which first results in read errors, then
(after md has been told to resync) results in a complete system crash...

Another topic:
Just noticed MD usage in a screenshot:
  http://linuxdevices.com/files/misc/ravehd_screenshot.png
From this article:
  http://linuxdevices.com/news/NS8217660071.html
Just in case anybody cares :-).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID1 and data safety?

2005-03-22 Thread Molle Bestefich

Neil Brown wrote:
 Is there any way to tell MD to do verify-on-write and
 read-from-all-disks on a RAID1 array?

 No.
 I would have thought that modern disk drives did some sort of
 verify-on-write, else how would they detect write errors, and they are
 certainly in the best place to do verify-on-write.

Really?  My guess was that they wouldn't, because it would lead to
less performance.
And that's why read errors crop up at read time.

 Doing it at the md level would be problematic as you would have to
 ensure that you really were reading from the media and not from some
 cache somewhere in the data path.  I doubt it would be a mechanism
 that would actually increase confidence in the safety of the data.

Hmm.  Could hack it by reading / writing blocks larger than the cache.  Ugly.

 Imagine a filesystem that could access multiple devices, and where it
 kept index information it didn't just keep one block address, but
 rather kept two block address, each on different devices, and a strong
 checksum of the data block.  This would allow much the same robustness
 as read-from-all-drives and much lower overhead.

As in, if the checksum fails, try loading the data blocks [again]
from the other device?
Not sure why a checksum of X data blocks should be cheaper
performance-wise than a comparison between X data blocks, but I can
see the point in that you only have to load the data once and check
the checksum.  Not quite the same security, but almost.

 In summary:
  - you cannot do it now.
  - I don't think md is at the right level to solve these sort of problems.
I think a filesystem could do it much better. (I'm working on a
filesystem  slowly...)
  - read-from-all-disks might get implemented one day. verify-on-write
is much less likely.
 
 Apologies if the answer is in the docs.
 
 It isn't.  But it is in the list archives now

Thanks! :-)

(Guess I'll drop the idea for the time being...)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RAID1 and data safety?

2005-03-16 Thread Molle Bestefich

Just wondering;

Is there any way to tell MD to do verify-on-write and
read-from-all-disks on a RAID1 array?

I was thinking of setting up a couple of RAID1s with maximum data safety.
I'd like to verify after each write to a disk plus I'd like to read
from all disks and perform data comparison whenever something is read.
 I'd then run a RAID0 over the RAID1 arrays, to regain some of the
speed lost from all of the excessive checking.

Just wondering if it could be done :-).

Apologies if the answer is in the docs.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Spare disk could not sleep / standby

2005-03-08 Thread Molle Bestefich

Tobias wrote:
[...]
 I just found your mail on this list, where I have been lurking for
 some weeks now to get acquainted with RAID, but I fear my mail would
 be almost OT there:

Think so?  It's about RAID on Linux isn't it?
I'm gonna CC the list anyway, hope it's okay :-).

 I was just curious about the workings of MD in 2.6, since it sounded
 a bit like it wasn't possible to put a RAID array to sleep.  I'm about
 to upgrade a server to 2.6, which needs to spin down when idle.

 Which is exactly what I am planning to do at my home - currently, I have

[...]

 Thus my question: Would you have a link to info on the net concerning
 safely powering down an unused/idle Raid?

No, but I can tell you what I did.

I stuffed a bunch of cheap SATA disks and crappy controllers in an old system.
(And replaced the power supply with one that has enough power on the 12V rail.)

It's running 2.4, and since it's IDE disks, I just call 'hdparm
-Swhatever' in rc.local,
which instructs the disks to go on standby whenever they've been idle
for 10 minutes.

Works like a charm so far, been running for a couple of years.
There does not seem to be any issues with MD and timing because of the
disks using 5 seconds or so to spin up, MD happily waits for them, and
no corruption or wrong behaviour has stemmed from putting the disks in
sleep mode.

There have been a couple of annoyances, though.

One is that MD reads from the disks sequentially, thus spinning up the
disks one by one.
The more disks you have, the longer you will have to wait for the
entire array to come up :-/.
Would have been beautiful if MD issues the requests in parallel.

Another is that you need to have your root partition outside of the array.
The reason for this is that some fancy feature in your favorite distro
with guarantee periodically writes something to the disk, which will
make the array spin up constantly.
Incidentally, this also makes using Linux as a desktop system a PITA,
since the disks are noisy as hell if you leave it on.
I'm currently using two old disks in RAID1 for the root filesystem,
but I'm thinking that there's probably a better solution.
Perhaps the root filesystem can be shifted to a ramdisk during
startup.  Or you could boot from a custom made CD - that would also be
extremely handy as a rescue disk.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: strangre drive behaviour.

2005-03-07 Thread Molle Bestefich

Max Waterman wrote:
 Can I just make it a slave device? How will that effect performance?

AFAIR (CMIIW):
- The standards does not allow a slave without a master.
- The master has a role to play in that it does coordination of some
sort (commands perhaps?) between the slave drive and the controller.

But on the other hand, I've seen ATAPI cdrom drives working in slave
only configurations for years.  Hm.

It shouldn't cause performance degradation, but it's a kinky setup
which you should probably trust a bit less than a master-only setup.

If it's not the CABLE SELECT thing, it could be that the firmware on
the drive acting up is different from f/w on the other drives.  Check
versions?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel panic??

2005-03-07 Thread Molle Bestefich

John McMonagle wrote:
 All panics seem to be associated with accessing bad spot on sdb
 It seems really strange that one can get panic from a drive problem.

sarcasm Wow, yeah, never seen that happen with Linux before! /sarcasm

Just for the fun of it, try digging up a disk which has a bad spot
somewhere, preferably track 0, otherwise point an extended partition
table to the bad spot.  Then try and get your linux box to even boot
with the disk in the system :-D.

God, Linux is so stable.. Bwahaha...
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel panic??

2005-03-07 Thread Molle Bestefich

Molle Bestefich wrote:
 sarcasm Wow, yeah, never seen that happen with Linux before! /sarcasm

Wait a minute, that wasn't a very productive comment.
Nevermind, I'm probably just ridden with faulty hardware.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Spare disk could not sleep / standby

2005-03-07 Thread Molle Bestefich

Neil Brown wrote:
 It is writes, but don't be scared.  It is just super-block updates.
 
 In 2.6, the superblock is marked 'clean' whenever there is a period of
 about 20ms of no write activity.  This increases the chance on a
 resync won't be needed after a crash.
 (unfortunately) the superblocks on the spares need to be updated too.

Ack, one of the cool things that a linux md array can do that others
can't is imho that the disks can spin down when inactive.  Granted,
it's mostly for home users who want their desktop RAID to be quiet
when it's not in use, and their basement multi-terabyte facility to
use a minimum of power when idling, but anyway.

Is there any particular reason to update the superblocks every 20
msecs when they're already marked clean?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Spare disk could not sleep / standby

2005-03-07 Thread Molle Bestefich

Neil Brown wrote:
 Is my perception of the situation correct?
 
 No.  Writing the superblock does not cause the array to be marked
 active.
 If the array is idle, the individual drives will be idle.

Ok, thank you for the clarification.

 Seems like a design flaw to me, but then again, I'm biased towards
 hating this behaviour since I really like being able to put inactive
 RAIDs to sleep..
 
 Hmmm... maybe I misunderstood your problem.  I thought you were just
 talking about a spare not being idle when you thought it should be.
 Are you saying that your whole array is idle, but still seeing writes?
 That would have to be something non-md-specific I think.

No, the confusion is my bad.
That was the original problem posted by Peter Evertz, which you
provided a workaround for.

_I_ was just curious about the workings of MD in 2.6, since it sounded
a bit like it wasn't possible to put a RAID array to sleep.  I'm about
to upgrade a server to 2.6, which needs to spin down when idle.  Got
a bit worried for a moment there =).

Thanks again.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Joys of spare disks!

2005-03-03 Thread Molle Bestefich

Guy [EMAIL PROTECTED] wrote:

I generally agree with you, so I'm just gonna cite / reply to the
points where we don't :-).

 This sounded like Neil's current plan.  But if I understand the plan, the
 drive would be kicked out of the array.

Yeah, sounds bad.
Although it should be marked as degraded in mdstat, since there's
basically no redundancy until the failed blocks have been reassigned
somehow.

 And 1000 bad blocks!  I have never had 2 on the same disk at the
 same time.  AFAIK.  I would agree that 1000 would put a strain on the
 system!

Well, it happened to me on a Windows system, so I don't think that
that is far-fetched.

This was a desktop system with the case open, so it was bounced about a lot.

Every time the disk reached one of the faulty areas, it recalibrated
the head and then moved it out to try and read again.  It retried the
operation 5 times before giving up.  While this was ongoing, Windows
was frozen.  It took at least 3 seconds each time I hit a bad area,
and I think even more.

If MD could read from a disk while a similar scenario occurred, and
just mark the bad blocks for rewriting in some bad block rewrite
bitmap or whatever, a system hang could be avoided.  Trying to
rewrite every failed sector sequentially in the code that also reads
the data would incur a system hang.  That's what I tried to say
originally, though I probably didn't do a good job (I know little of
linux md, guess it shows =)).

Of course, the disks would, in the case of IDE, probably have to _not_
be in master/slave configurations, since the disk with failing blocks
could perhaps hog the bus.  Of course I know as little of ATA/IDE as I
do of linux MD, so I'm basically just guessing here ;-).

 Sometime in the past I have said there should be a threshold on the number
 of bad blocks allowed.  Once the threshold is reached, the disk should be
 assumed bad, or at least failing, and should be replaced.

Hm.  Why?
If a re-write on the block succeeds and then a read on the block
returns the correct data, the block has been fixed.  I can see your
point on old disks where it might be a magnetic problem that was
causing the sector to fail, but on a modern disk, it has probably been
relocated to the spare area.  I think the disk should just be failed
when a rewrite-and-verify cycle still fails.  The threshold suggestion
adds complexity and user-configurability (error-prone) to an area
where it's not really needed, doesn't it?

Another note.  I'd like to see MD being able to have a
user-specifiable bad block relocation area, just like modern disks
have.  It could use this when the disks spare area filled up.  I even
thought up a use case at one time that wasn't insane like, my disks
is really beginning to show up a lot of failures now, but I think I'll
keep it running a bit more, but I can't quite reminisce what it was.

 Does anyone know how many spare blocks are on a disk?

It probably varies?
Ie. crappy disks probably have a much too small area ;-).
In this case it would be very cute if MD had an option to specify it's
own relocation area (and perhaps even a recommendation for the user on
how to set it wrt. specific harddisks).
But OTOH, it sucks to implement features in MD that would be much
easier to solve in the disks by just expanding the spare area (when
present).

 My worse disk has 28 relocated bad blocks.

Doesn't sound bad.
Isn't there a SMART value that will show you how big a percentage of
spare is used (0-255)?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Joys of spare disks!

2005-03-01 Thread Molle Bestefich

Robin Bowes wrote:
 I envisage something like:
 
 md attempts read
 one disk/partition fails with a bad block
 md re-calculates correct data from other disks
 md writes correct data to bad disk
   - disk will re-locate the bad block

Probably not that simple, since some times multiple blocks will go
bad, and you wouldn't want the entire system to come to a screeching
halt whenever that happens.

A more consistent and risk-free way of doing it would probably be to
do the above partial resync in a background thread or so?..
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2TB ?

2005-02-11 Thread Molle Bestefich

No email [EMAIL PROTECTED] wrote:

 Forgive me as this is probably a silly question and one that has been
 answered many times, I have tried to search for the answers but have
 ended up more confused than when I started.  So thought maybe I could
 ask the community to put me out of my misery

 Is there a version of MD that can create larger than 2TB raid sets?

I have a couple of terabyte software RAID 1+0 arrays under Linux.
No size problems with MD as of yet.

But the filesystems is a different affair, and I think this is where
you should watch out.
Linux filesystems seems to stink real bad when they span multiple
terabytes, at least that's my personal experience.  I've tried both
ext3 and reiserfs.  Even simple operations such as deleting files
suddenly take on the order of 10-20 minutes.

I haven't got a ready explanation for why ext3 and reiser can't handle
TB sizes, but I'd definately advice you against a multi-TB setup using
Linux (at least until you find someone who has a working setup..)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

XFS or JFS? (Was: 2TB ?)

2005-02-11 Thread Molle Bestefich

Carlos Knowlton wrote:
 Molle Bestefich wrote:
Linux filesystems seems to stink real bad when they span multiple
terabytes, at least that's my personal experience.  I've tried both
ext3 and reiserfs.  Even simple operations such as deleting files
suddenly take on the order of 10-20 minutes.

 I'm running some 3TB software arrays (12 * 250GB RAID5) with no
 trouble.  I've opted for XFS over ext3 or reiserfs, and I see no trouble
 in accessing or deleting files.

Is there anybody out there with a qualified opinion on what is best
suited for TB arrays, XFS or JFS?

 not a problem.  Well, as far a software RAID goes anyway - I wish it
 handled trivial media errors more gracefully (ie, without dropping
 disks).  You should always back-up your data.

Second that.  MD is not the brightest thing around.
I particularly dislike the game where you have a failed disk and
accidentally yank a cable on another disk, and MD increases the usage
counter on the remaining (unusable) disks in the array.  Plugging in
and out disks while rebooting Linux to see if you can get the counters
to match and MD to assemble the array again does give that nice
adrenalin surge, but I still prefer the more relaxing desktop games
that come with Linux.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

82 matches

Mail list logo