from:"Neil Brown"

Re: How many drives are bad?

2008-02-20 Thread Neil Brown

On Tuesday February 19, [EMAIL PROTECTED] wrote:
 So I had my first failure today, when I got a report that one drive
 (/dev/sdam) failed. I've attached the output of mdadm --detail. It
 appears that two drives are listed as removed, but the array is
 still functioning. What does this mean? How many drives actually
 failed?

The array is configured for 8 devices, but on 6 are active.  So you
have lost data.
Of the two missing devices, one is still in the array and is marked as
fault.  One is simply not present at all.
Hence Failed Devices: 1.  i.e. there is one failed device in the
array.

It looks like you have been running a degraded array for a while
(maybe not a long while) and the device has then failed.

mdadm --monitor

will send you mail if you have a degraded array.

NeilBrown

 
 This is all a test system, so I can dink around as much as necessary.
 Thanks for any advice!
 
 Norman Elton
 
 == OUTPUT OF MDADM =
 
 Version : 00.90.03
   Creation Time : Fri Jan 18 13:17:33 2008
  Raid Level : raid5
  Array Size : 6837319552 (6520.58 GiB 7001.42 GB)
 Device Size : 976759936 (931.51 GiB 1000.20 GB)
Raid Devices : 8
   Total Devices : 7
 Preferred Minor : 4
 Persistence : Superblock is persistent
 
 Update Time : Mon Feb 18 11:49:13 2008
   State : clean, degraded
  Active Devices : 6
 Working Devices : 6
  Failed Devices : 1
   Spare Devices : 0
 
  Layout : left-symmetric
  Chunk Size : 64K
 
UUID : b16bdcaf:a20192fb:39c74cb8:e5e60b20
  Events : 0.110
 
 Number   Major   Minor   RaidDevice State
0  6610  active sync   /dev/sdag1
1  66   171  active sync   /dev/sdah1
2  66   332  active sync   /dev/sdai1
3  66   493  active sync   /dev/sdaj1
4  66   654  active sync   /dev/sdak1
5   005  removed
6   006  removed
7  66  1137  active sync   /dev/sdan1
 
8  66   97-  faulty spare   /dev/sdam1
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: suns raid-z / zfs

2008-02-18 Thread Neil Brown

On Monday February 18, [EMAIL PROTECTED] wrote:
 On Mon, Feb 18, 2008 at 03:07:44PM +1100, Neil Brown wrote:
  On Sunday February 17, [EMAIL PROTECTED] wrote:
   Hi
   
  
   It seems like a good way to avoid the performance problems of raid-5
   /raid-6
  
  I think there are better ways.
 
 Interesting! What do you have in mind?

A Log Structured Filesystem always does large contiguous writes.
Aligning these to the raid5 stripes wouldn't be too hard and then you
would never have to do any pre-reading.

 
 and what are the problems with zfs?

Recovery after a failed drive would not be an easy operation, and I
cannot imagine it being even close to the raw speed of the device.

 
   
   But does it stripe? One could think that rewriting stripes
   other places would damage the striping effects.
  
  I'm not sure what you mean exactly.  But I suspect your concerns here
  are unjustified.
 
 More precisely. I understand that zfs always write the data anew.
 That would mean at other blocks on the partitions, for the logical blocks
 of the file in question. So the blocks on the partitions will not be
 adjacant. And striping will not be possible, generally.

The important part of striping is that a write is spread out over
multiple devices, isn't it.

If ZFS can choose where to put each block that it writes, it can
easily choose to write a series of blocks to a collection of different
devices, thus getting the major benefit of striping.


NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID5 to RAID6 reshape?

2008-02-17 Thread Neil Brown

On Sunday February 17, [EMAIL PROTECTED] wrote:
 On Sun, 17 Feb 2008 14:31:22 +0100
 Janek Kozicki [EMAIL PROTECTED] wrote:
 
  oh, right - Sevrin Robstad has a good idea to solve your problem -
  create raid6 with one missing member. And add this member, when you
  have it, next year or such.
  
 
 I thought I read that would involve a huge performance hit, since
 then everything would require parity calculations.  Or would that
 just be w/ 2 missing drives?

A raid6 with one missing drive would have a little bit of a
performance hit over raid5.

Partly there is a CPU hit to calculate the Q block which is slower
than calculating normal parity.

Partly there is the fact that raid6 never does read-modify-write
cycles, so to update one block in a stripe, it has to read all the
other data blocks.

But the worst aspect of doing this that if you have a system crash,
you could get hidden data corruption.
After a system crash you cannot trust parity data (as it may have been
in the process of being updated) so you have to regenerate it from
known good data.  But if your array is degraded, you don't have all
the known good data, so you loose.

It is really best to avoid degraded raid4/5/6 arrays when at all
possible.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID5 to RAID6 reshape?

2008-02-17 Thread Neil Brown

On Saturday February 16, [EMAIL PROTECTED] wrote:
 found was a few months old.  Is it likely that RAID5 to RAID6
 reshaping will be implemented in the next 12 to 18 months (my rough

Certainly possible.

I won't say it is likely until it is actually done.  And by then it
will be definite :-)

i.e. no concrete plans.
It is always best to base your decisions on what is available today.


NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: suns raid-z / zfs

2008-02-17 Thread Neil Brown

On Sunday February 17, [EMAIL PROTECTED] wrote:
 Hi
 
 any opinions on suns zfs/raid-z?

It's vaguely interesting.  I'm not sold on the idea though.

 It seems like a good way to avoid the performance problems of raid-5
 /raid-6

I think there are better ways.

 
 But does it stripe? One could think that rewriting stripes
 other places would damage the striping effects.

I'm not sure what you mean exactly.  But I suspect your concerns here
are unjustified.

 
 Or is the performance only meant to be good for random read/write?

I suspect it is mean to be good for everything.  But you would have to
ask SUN that.

 
 Can the code be lifted to Linux? I understand that it is already in
 freebsd. Does Suns licence prevent this?

My understanding is that the sun license prevents it.

However raid-z only makes sense in the context of a specific
filesystem such as ZFS.  It isn't something that you could just layer
any filesystem on top of.

 
 And could something like this be built into existing file systems like
 ext3 and xfs? They could have a multipartition layer in their code, and
 then the heuristics to optimize block access could also apply to stripe
 access.

I doubt it, but I haven't thought deeply enough about it to see if
there might be some relatively non-intrusive way.

NeilBrown

 
 best regards
 keld
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Create Raid6 with 1 missing member fails

2008-02-17 Thread Neil Brown

On Sunday February 17, [EMAIL PROTECTED] wrote:
 I tried to create a raid6 with one missing member, but it fails.
 It works fine to create a raid6 with two missing members. Is it supposed 
 to be like that ?

No, it isn't supposed to be like that, but currently it is.

The easiest approach if to create it with 2 drives missing, and the
extra drive immediately.
This is essentially what mdadm will do when I fix it.

Alternately you can use --assume-clean to tell it that the array is
clean.  It is actually a lie, but it is a harmless lie. Whenever any
data is written to the array, that little part of the array will get
cleaned. (Note that this isn't true of raid5, only of raid6).

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid5: two writing algorithms

2008-02-07 Thread Neil Brown

On Thursday February 7, [EMAIL PROTECTED] wrote:
 As I understand it, there are 2 valid algoritms for writing in raid5.
 
 1. calculate the parity data by XOR'ing all data of the relevant data
 chunks.
 
 2. calculate the parity data by kind of XOR-subtracting the old data to
 be changed, and then XOR-adding the new data. (XOR-subtract and XOR-add
 is actually the same).
 
 There are situations where method 1 is the fastest, and situations where
 method 2 is the fastest.
 
 My idea is then that the raid5 code in the kernel can calculate which
 method is the faster. 
 
 method 1 is faster, if all data is already available. I understand that
 this method is employed in the current kernel. This would eg be the case
 with sequential writes.
 
 Method 2 is faster, if no data is available in core. It would require
 2 reads and two writes, which always will be faster than n reads and 1
 write, possibly except for n=2. method 2 is thus faster normally for
 random writes.
 
 I think that method 2 is not used in the kernel today. Mayby I am wrong,
 but I did have a look in the kernel code.

It is very odd that you would think something about the behaviour of
the kernel with actually having looked.

It also seems a little arrogant to have a clever idea and assume that
no one else has thought of it before.

 
 So I hereby give the idea for inspiration to kernel hackers.

and I hereby invite you to read the code ;-)

Code reading is a good first step to being a
 
 Yoyr kernel hacker wannabe
   ^

NeilBrown


 keld
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: when is a disk non-fresh?

2008-02-07 Thread Neil Brown

On Thursday February 7, [EMAIL PROTECTED] wrote:
 On Tuesday 05 February 2008 03:02:00 Neil Brown wrote:
  On Monday February 4, [EMAIL PROTECTED] wrote:
   Seems the other topic wasn't quite clear...
 
  not necessarily.  sometimes it helps to repeat your question.  there
  is a lot of noise on the internet and somethings important things get
  missed... :-)
 
   Occasionally a disk is kicked for being non-fresh - what does this mean
   and what causes it?
 
  The 'event' count is too small.
  Every event that happens on an array causes the event count to be
  incremented.
 
 An 'event' here is any atomic action? Like write byte there or calc XOR?

An 'event' is
   - switch from clean to dirty
   - switch from dirty to clean
   - a device fails
   - a spare finishes recovery
things like that.

 
 
  If the event counts on different devices differ by more than 1, then
  the smaller number is 'non-fresh'.
 
  You need to look to the kernel logs of when the array was previously
  shut down to figure out why it is now non-fresh.
 
 The kernel logs show absolutely nothing. Log's fine, next time I boot up, one 
 disk is kicked, I got no clue why, badblocks is fine, smartctl is fine, selft 
 test fine, dmesg and /var/log/messages show nothing apart from that news that 
 the disk was kicked and mdadm -E doesn't say anything suspicious either.

Can you get mdadm -E on all devices *before* attempting to assemble
the array?

 
 Question: what events occured on the 3 other disks that didn't occur on the 
 last? It only happens after reboots, not while the machine is up so the 
 closest assumption is that the array is not properly shut down somehow during 
 system shutdown - only I wouldn't know why.

Yes, most likely is that the array didn't shut down properly.

 Box is Slackware 11.0, 11 doesn't come with raid script of its own so I 
 hacked 
 them into the boot scripts myself and carefully watched that everything 
 accessing the array is down before mdadm --stop --scan is issued.
 No NFS, no Samba, no other funny daemons, disks are synced and so on.
 
 I could write some failsafe inot it by checking if the event count is the 
 same 
 on all disks before --stop, but even if it wasn't, I really wouldn't know 
 what to do about it.
 
 (btw mdadm -E gives me: Events : 0.1149316 - what's with the 0. ?)
 

The events count is a 64bit number and for historical reasons it is
printed as 2 32bit numbers.  I agree this is ugly.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid5: two writing algorithms

2008-02-07 Thread Neil Brown

On Friday February 8, [EMAIL PROTECTED] wrote:
 On Fri, Feb 08, 2008 at 07:25:31AM +1100, Neil Brown wrote:
  On Thursday February 7, [EMAIL PROTECTED] wrote:
 
   So I hereby give the idea for inspiration to kernel hackers.
  
  and I hereby invite you to read the code ;-)
 
 I did some reading.  Is there somewhere a description of it, especially
 the raid code, or are the comments and the code the best documentation?

No.  If a description was written (and various people have tried to
describe various parts) it would be out of date within a few months :-(

Look for READ_MODIFY_WRITE and RECONSTRUCT_WRITE  no.  That
only applied to raid6 code now..
Look instead for the 'rcw' and 'rmw' counters, and then at
'handle_write_operations5'  which does different things based on the
'rcw' variable.

It used to be a lot clearer before we implemented xor-offload.  The
xor-offload stuff is good, but it does make the code more complex.


 
 Do you say that this is already implemented?

Yes.

 
 I am sorry if you think I am mailing too much on the list.

You aren't.

 But I happen to think it is fun.

Good.

 And I do try to give something back.

We'll look forward to that.

 
  Code reading is a good first step to being a
   
   Yoyr kernel hacker wannabe
 ^
  
  NeilBrown
 
 Well, I do have a hack in mind, on the raid10,f2.
 I need to investigate some more, and possibly test out
 what really happens. But maybe the code already does what I want it to.
 You are possibly the one that knows the code best, so maybe you can tell
 me if raid10,f2 always does its reading in the first part of the disks?

Yes, I know the code best.

No, raid10,f2 doesn't always use the first part of the disk.  Getting
it to do that would be a fairly small change in 'read_balance' in
md/raid10.c.

I'm not at all convinced that the read balancing code in raid10 (or
raid1) really does the best thing.  So any improvements - backed up
with broad testing - would be most welcome.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Deleting mdadm RAID arrays

2008-02-06 Thread Neil Brown

On Wednesday February 6, [EMAIL PROTECTED] wrote:
 
  Maybe the kernel has  been told to forget about the partitions of
  /dev/sdb.
 
 But fdisk/cfdisk has no problem whatsoever finding the partitions .

It is looking at the partition table on disk.  Not at the kernel's
idea of partitions, which is initialised from that table...

What does

  cat /proc/partitions

say?

 
  mdadm will sometimes tell it to do that, but only if you try to
  assemble arrays out of whole components.
 
  If that is the problem, then
 blockdev --rereadpt /dev/sdb
 
 I deleted LVM devices that were sitting on top of RAID and reinstalled mdadm.
 
 % blockdev --rereadpt /dev/sdf
 BLKRRPART: Device or resource busy
 

Implies that some partition is in use.

 % mdadm /dev/md2 --fail /dev/sdf1
 mdadm: set /dev/sdf1 faulty in /dev/md2
 
 % blockdev --rereadpt /dev/sdf
 BLKRRPART: Device or resource busy
 
 % mdadm /dev/md2 --remove /dev/sdf1
 mdadm: hot remove failed for /dev/sdf1: Device or resource busy

OK, that's weird.  If sdf1 is faulty, then you should be able to
remove it.  What does
  cat /proc/mdstat
  dmesg | tail

say at this point?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid10 on three discs - few questions.

2008-02-06 Thread Neil Brown

On Wednesday February 6, [EMAIL PROTECTED] wrote:
 

  4. Would it be possible to later '--grow' the array to use 4 discs in
 raid10 ? Even with far=2 ?
 
  
 
  No.
 
  Well if by later you mean in five years, then maybe.  But the
  code doesn't currently exist.

 
 That's a reason to avoid raid10 for certain applications, then, and go 
 with a more manual 1+0 or similar.

Not really.  You cannot reshape a raid0 either.

 
 Can you create a raid10 with one drive missing and add it later? I 
 know, I should try it when I get a machine free... but I'm being lazy today.

Yes, but then the array would be degraded and a single failure could
destroy your data.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Deleting mdadm RAID arrays

2008-02-06 Thread Neil Brown

On Wednesday February 6, [EMAIL PROTECTED] wrote:
 
 % cat /proc/partitions
 major minor  #blocks  name
 
8 0  390711384 sda
8 1  390708801 sda1
816  390711384 sdb
817  390708801 sdb1
832  390711384 sdc
833  390708801 sdc1
848  390710327 sdd
849  390708801 sdd1
864  390711384 sde
865  390708801 sde1
880  390711384 sdf
881  390708801 sdf1
364   78150744 hdb
3651951866 hdb1
3667815622 hdb2
3674883760 hdb3
368  1 hdb4
369 979933 hdb5
370 979933 hdb6
371   61536951 hdb7
9 1  781417472 md1
9 0  781417472 md0

So all the expected partitions are known to the kernel - good.

 
 /etc/udev/rules.d % cat /proc/mdstat
 Personalities : [raid1] [raid6] [raid5] [raid4]
 md0 : active(auto-read-only) raid5 sdc1[0] sde1[3](S) sdd1[1]
   781417472 blocks level 5, 64k chunk, algorithm 2 [3/2] [UU_]
 
 md1 : active(auto-read-only) raid5 sdf1[0] sdb1[3](S) sda1[1]
   781417472 blocks level 5, 64k chunk, algorithm 2 [3/2] [UU_]
 
 md0 consists of sdc1, sde1 and sdd1 even though when creating I asked it to 
 use d_1, d_2 and d_3 (this is probably written on the particular 
 disk/partition itself,
 but I have no idea how to clean this up - mdadm --zero-superblock /dev/d_1
 again produces mdadm: Couldn't open /dev/d_1 for write - not zeroing)
 

I suspect it is related to the (auto-read-only).
The array is degraded and has a spare, so it wants to do a recovery to
the spare.  But it won't start the recovery until the array is not
read-only.

But the recovery process has partly started (you'll see an md1_resync
thread) so it won't let go of any fail devices at the moment.
If you 
  mdadm -w /dev/md0

the recovery will start.
Then
  mdadm /dev/md0 -f /dev/d_1

will fail d_1, abort the recovery, and release d_1.

Then
  mdadm --zero-superblock /dev/d_1

should work.

It is currently failing with EBUSY - --zero-superblock opens the
device with O_EXCL to ensure that it isn't currently in use, and as
long as it is part of an md array, O_EXCL will fail.
I should make that more explicit in the error message.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: recommendations for stripe/chunk size

2008-02-06 Thread Neil Brown

On Wednesday February 6, [EMAIL PROTECTED] wrote:
 
 We implemented the option to select kernel page sizes of  4,  16,  64
 and  256  kB for some PowerPC systems (440SPe, to be precise). A nice
 graphics of the effect can be found here:
 
 https://www.amcc.com/MyAMCC/retrieveDocument/PowerPC/440SPe/RAIDinLinux_PB_0529a.pdf

Thanks for the link!

quote
The second improvement is to remove a memory copy that is internal to the MD 
driver. The MD
driver stages strip data ready to be written next to the I/O controller in a 
page size pre-
allocated buffer. It is possible to bypass this memory copy for sequential 
writes thereby saving
SDRAM access cycles.
/quote

I sure hope you've checked that the filesystem never (ever) changes a
buffer while it is being written out.  Otherwise the data written to
disk might be different from the data used in the parity calculation
:-)

And what are the Second memcpy and First memcpy in the graph?
I assume one is the memcpy mentioned above, but what is the other?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: recommendations for stripe/chunk size

2008-02-06 Thread Neil Brown

On Wednesday February 6, [EMAIL PROTECTED] wrote:
 Keld Jørn Simonsen wrote:
  Hi
 
  I am looking at revising our howto. I see a number of places where a
  chunk size of 32 kiB is recommended, and even recommendations on
  maybe using sizes of 4 kiB. 
 

 Depending on the raid level, a write smaller than the chunk size causes 
 the chunk to be read, altered, and rewritten, vs. just written if the 
 write is a multiple of chunk size. Many filesystems by default use a 4k 
 page size and writes. I believe this is the reasoning behind the 
 suggestion of small chunk sizes. Sequential vs. random and raid level 
 are important here, there's no one size to work best in all cases.

Not in md/raid.

RAID4/5/6 will do a read-modify-write if you are writing less than one
*page*, but then they often to read-modify-write anyway for parity
updates.

No level will every read a whole chunk just because it is a chunk.

To answer the original question:  The only way to be sure is to test
your hardware with your workload with different chunk sizes.
But I suspect that around 256K is good on current hardware.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: recommendations for stripe/chunk size

2008-02-06 Thread Neil Brown

On Thursday February 7, [EMAIL PROTECTED] wrote:
 
 Anyway, why does a SATA-II drive not deliver something like 300 MB/s?


Are you serious?

I high end 15000RPM enterprise grade drive such as the Seagate
Cheetah® 15K.6 Hard Drives only deliver 164MB/sec.

The SATA Bus might be able to deliver 300MB/s, but an individual drive
would be around 80MB/s unless it is really expensive.

(or was that yesterday?  I'm having trouble keeping up with the pace
 of improvement :-)

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re[2]: mdadm 2.6.4 : How i can check out current status of reshaping ?

2008-02-05 Thread Neil Brown

On Tuesday February 5, [EMAIL PROTECTED] wrote:
 Feb  5 11:56:12 raid01 kernel: BUG: unable to handle kernel paging request at 
 virtual address 001cd901

This looks like some sort of memory corruption.

 Feb  5 11:56:12 raid01 kernel: EIP is at md_do_sync+0x629/0xa32

This tells us what code is executing.

 Feb  5 11:56:12 raid01 kernel: Code: 54 24 48 0f 87 a4 01 00 00 72 0a 3b 44 
 24 44 0f 87 98 01 00 00 3b 7c 24 40 75 0a 3b 74 24 3c 0f 84 88 01 00 00 0b 85 
 30 01 00 00 88 08 0f 85 90 01 00 00 8b 85 30 01 00 00 a8 04 0f 85 82 01 00

This tells us what the actual byte of code were.
If I feed this line (from Code: onwards) into ksymoops I get 

   0:   54push   %esp
   1:   24 48 and$0x48,%al
   3:   0f 87 a4 01 00 00 ja 1ad _EIP+0x1ad
   9:   72 0a jb 15 _EIP+0x15
   b:   3b 44 24 44   cmp0x44(%esp),%eax
   f:   0f 87 98 01 00 00 ja 1ad _EIP+0x1ad
  15:   3b 7c 24 40   cmp0x40(%esp),%edi
  19:   75 0a jne25 _EIP+0x25
  1b:   3b 74 24 3c   cmp0x3c(%esp),%esi
  1f:   0f 84 88 01 00 00 je 1ad _EIP+0x1ad
  25:   0b 85 30 01 00 00 or 0x130(%ebp),%eax
Code;   Before first symbol
  2b:   88 08 mov%cl,(%eax)
  2d:   0f 85 90 01 00 00 jne1c3 _EIP+0x1c3
  33:   8b 85 30 01 00 00 mov0x130(%ebp),%eax
  39:   a8 04 test   $0x4,%al
  3b:   0f.byte 0xf
  3c:   85.byte 0x85
  3d:   82(bad)  
  3e:   01 00 add%eax,(%eax)


I removed the Code;... lines as they are just noise, except for the
one that points to the current instruction in the middle.
Note that it is dereferencing %eax, after just 'or'ing some value into
it, which is rather unusual.

Now get the md-mod.ko for the kernel you are running.
run
   gdb md-mod.ko

and give the command

   disassemble md_do_sync

and look for code at offset 0x629, which is 1577 in decimal.

I found a similar kernel to what you are running, and the matching code
is 

0x55c0 md_do_sync+1485:   cmp0x30(%esp),%eax
0x55c4 md_do_sync+1489:   ja 0x5749 md_do_sync+1878
0x55ca md_do_sync+1495:   cmp0x2c(%esp),%edi
0x55ce md_do_sync+1499:   jne0x55da md_do_sync+1511
0x55d0 md_do_sync+1501:   cmp0x28(%esp),%esi
0x55d4 md_do_sync+1505:   je 0x5749 md_do_sync+1878
0x55da md_do_sync+1511:   mov0x130(%ebp),%eax
0x55e0 md_do_sync+1517:   test   $0x8,%al
0x55e2 md_do_sync+1519:   jne0x575f md_do_sync+1900
0x55e8 md_do_sync+1525:   mov0x130(%ebp),%eax
0x55ee md_do_sync+1531:   test   $0x4,%al
0x55f0 md_do_sync+1533:   jne0x575f md_do_sync+1900
0x55f6 md_do_sync+1539:   mov0x38(%esp),%ecx
0x55fa md_do_sync+1543:   mov0x0,%eax
-

Note the sequence cmp, ja, cmp, jne, cmp, je
where the cmp arguments are consecutive 4byte values on the stack
(%esp).
In the code from your oops, the offsets are 0x44 0x40 0x3c.
In the kernel I found they are 0x30 0x2c 0x28.  The difference is some
subtle difference in the kernel, possibly a different compiler or
something.

Anyway, your code crashed at 


  25:   0b 85 30 01 00 00 or 0x130(%ebp),%eax
Code;   Before first symbol
  2b:   88 08 mov%cl,(%eax)

The matching code in the kernel I found is 

0x55da md_do_sync+1511:   mov0x130(%ebp),%eax
0x55e0 md_do_sync+1517:   test   $0x8,%al

Note that you have an 'or', the kernel I found has 'mov'.

If we look at the actual byte of code for those two instructions
the code that crashed shows the bytes above:

0b 85 30 01 00 00
88 08

if I get the same bytes with gdb:

(gdb) x/8b 0x55da
0x55da md_do_sync+1511:   0x8b0x850x300x010x000x00
0xa80x08
(gdb) 

So what should be 8b has become 0b, and what should be a8 has
become 08.

If you look for the same data in your md-mod.ko, you might find
slightly different details but it is clear to me that the code in
memory is bad.

Possible you have bad memory, or a bad CPU, or you are overclocking
the CPU, or it is getting hot, or something.


But you clearly have a hardware error.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Deleting mdadm RAID arrays

2008-02-05 Thread Neil Brown

On Tuesday February 5, [EMAIL PROTECTED] wrote:
 
 % mdadm --zero-superblock /dev/sdb1
 mdadm: Couldn't open /dev/sdb1 for write - not zeroing

That's weird.
Why can't it open it?

Maybe you aren't running as root (The '%' prompt is suspicious).
Maybe the kernel has  been told to forget about the partitions of
/dev/sdb.
mdadm will sometimes tell it to do that, but only if you try to
assemble arrays out of whole components.

If that is the problem, then
   blockdev --rereadpt /dev/sdb

will fix it.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mdadm 2.6.4 : How i can check out current status of reshaping ?

2008-02-04 Thread Neil Brown

On Monday February 4, [EMAIL PROTECTED] wrote:
 
 [EMAIL PROTECTED]:/# cat /proc/mdstat
 Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
 [multipath] [faulty]
 md1 : active raid5 sdc[0] sdb[5](S) sdf[3] sde[2] sdd[1]
   1465159488 blocks super 0.91 level 5, 64k chunk, algorithm 2 [5/4] 
 [_]
 
 unused devices: none
 
 ##
 But how i can see the status of reshaping ?
 Is it reshaped realy ? or may be just hang up ? or may be mdadm nothing do 
 not give in
 general ?
 How long wait when reshaping will finish ?
 ##
 

The reshape hasn't restarted.

Did you do that mdadm -w /dev/md1 like I suggested?  If so, what
happened?

Possibly you tried mounting the filesystem before trying the mdadm
-w.  There seems to be a bug such that doing this would cause the
reshape not to restart, and mdadm -w would not help any more.

I suggest you:

  echo 0  /sys/module/md_mod/parameters/start_ro

stop the array 
  mdadm -S /dev/md1
(after unmounting if necessary).

Then assemble the array again.
Then
  mdadm -w /dev/md1

just to be sure.

If this doesn't work, please report exactly what you did, exactly what
message you got and exactly where message appeared in the kernel log.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: when is a disk non-fresh?

2008-02-04 Thread Neil Brown

On Monday February 4, [EMAIL PROTECTED] wrote:
 Seems the other topic wasn't quite clear...

not necessarily.  sometimes it helps to repeat your question.  there
is a lot of noise on the internet and somethings important things get
missed... :-)

 Occasionally a disk is kicked for being non-fresh - what does this mean and 
 what causes it?

The 'event' count is too small.  
Every event that happens on an array causes the event count to be
incremented.
If the event counts on different devices differ by more than 1, then
the smaller number is 'non-fresh'.

You need to look to the kernel logs of when the array was previously
shut down to figure out why it is now non-fresh.

NeilBrown


 
 Dex
 
 
 
 -- 
 -BEGIN GEEK CODE BLOCK-
 Version: 3.12
 GCS d--(+)@ s-:+ a- C UL++ P+++ L+++ E-- W++ N o? K-
 w--(---) !O M+ V- PS+ PE Y++ PGP t++(---)@ 5 X+(++) R+(++) tv--(+)@ 
 b++(+++) DI+++ D- G++ e* h++ r* y?
 --END GEEK CODE BLOCK--
 
 http://www.vorratsdatenspeicherung.de
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid10 on three discs - few questions.

2008-02-03 Thread Neil Brown

On Sunday February 3, [EMAIL PROTECTED] wrote:
 Hi,
 
 Maybe I'll buy three HDDs to put a raid10 on them. And get the total
 capacity of 1.5 of a disc. 'man 4 md' indicates that this is possible
 and should work.
 
 I'm wondering - how a single disc failure is handled in such configuration?
 
 1. does the array continue to work in a degraded state?

Yes.

 
 2. after the failure I can disconnect faulty drive, connect a new one,
start the computer, add disc to array and it will sync automatically?
 

Yes.

 
 Question seems a bit obvious, but the configuration is, at least for
 me, a bit unusual. This is why I'm asking. Anybody here tested such
 configuration, has some experience?
 
 
 3. Another thing - would raid10,far=2 work when three drives are used?
Would it increase the read performance?

Yes.

 
 4. Would it be possible to later '--grow' the array to use 4 discs in
raid10 ? Even with far=2 ?
 

No.

Well if by later you mean in five years, then maybe.  But the
code doesn't currently exist.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: problem with spare, acive device, clean degrated, reshaip RADI5, anybody can help ?

2008-02-03 Thread Neil Brown

On Thursday January 31, [EMAIL PROTECTED] wrote:
 Hello linux-raid.
 
 i have DEBIAN.
 
 raid01:/# mdadm -V
 mdadm - v2.6.4 - 19th October 2007
 
 raid01:/# mdadm -D /dev/md1
 /dev/md1:
 Version : 00.91.03
   Creation Time : Tue Nov 13 18:42:36 2007
  Raid Level : raid5

   Delta Devices : 1, (4-5)

So the array is in the middle of a reshape.

It should automatically complete...  Presumably it isn't doing that?

What does
   cat /proc/mdstat
say?

Where kernel log messages do you get when you assemble the array?


The spare device will not be added to the array until the reshape has
finished.

Hopefully you aren't using a 2.6.23 kernel?
That kernel had a bug which corrupted data when reshaping a degraded
raid5 array.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: /dev/sdb has different metadata to chosen array /dev/md1 0.91 0.90.

2008-02-03 Thread Neil Brown

On Saturday February 2, [EMAIL PROTECTED] wrote:
 Çäðàâñòâóéòå, linux-raid.
 
 Help please, How i can to fight THIS :
 
 [EMAIL PROTECTED]:~# mdadm -I /dev/sdb
 mdadm: /dev/sdb has different metadata to chosen array /dev/md1 0.91 0.90.
 

Apparently mdadm -I doesn't work with arrays that are in the middle
of a reshape.  I'll try to fix that for the next release.

Thanks for the report.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re[2]: problem with spare, acive device, clean degrated, reshaip RADI5, anybody can help ?

2008-02-03 Thread Neil Brown

On Monday February 4, [EMAIL PROTECTED] wrote:
 
 raid01:/etc# cat /proc/mdstat
 Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
 [multipath] [faulty]
 md1 : active(auto-read-only) raid5 sdc[0] sdb[5](S) sdf[3] sde[2] sdd[1]
  ^^^
   1465159488 blocks super 0.91 level 5, 64k chunk, algorithm 2 [5/4] 
 [_]
 
 unused devices: none

That explains it.  The array is still 'read-only' and won't write
anything until you allow it to.
The easiest way is
  mdadm -w /dev/md1

That should restart the reshape.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux md and iscsi problems

2008-02-02 Thread Neil Brown

On Friday February 1, [EMAIL PROTECTED] wrote:
 
 
 Summarizing, I have two questions about the behavior of Linux md with  
 slow devices:
 
 1. Is it possible to modify some kind of time-out parameter on the  
 mdadm tool so the slow device wouldn't be marked as faulty because of  
 its slow performance.

No.  md doesn't do timeouts at all.  The underlying device does.
So if you are getting time out errors from the iscsi initiator, then
you need to change the timeout value used by the iscsi initiator.  md
has no part to play in this.  It just sends a request and eventually
gets either 'success' or 'fail'.

 
 2. Is it possible to control the buffer size of the RAID?, in other  
 words, can I control the amount of data I can write to the local disc  
 before I receive an acknowledgment from the slow device when I am  
 using the write-behind option.

No.  md/raid1 simply calls 'kmalloc' to get space to buffer each write
as the write arrives.  If the allocation succeeds, it is used to
perform the write lazily.  If the allocation fails, the write is
performs synchronously.

What did you hope to achieve by such tuning?  It can probably be
added if it is generally useful.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid problem: after every reboot /dev/sdb1 is removed?

2008-02-01 Thread Neil Brown

On Friday February 1, [EMAIL PROTECTED] wrote:
 Hi!
 
 I have the following problem with my softraid (raid 1). I'm running
 Ubuntu 7.10 64bit with kernel 2.6.22-14-generic.
 
 After every reboot my first boot partition in md0 is not synchron. One
 of the disks (the sdb1) is removed. 
 After a resynch every partition is synching. But after a reboot the
 state is removed. 

Please send boot logs (e.g. dmesg  afile).

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BUG: possible array corruption when adding a component to a degraded raid5 (possibly other levels too)

2008-01-28 Thread Neil Brown

On Monday January 28, [EMAIL PROTECTED] wrote:
 Hello,
 
 It seems that mdadm/md do not perform proper sanity checks before adding a 
 component to a degraded array. If the size of the new component is just 
 right, 
 the superblock information will overlap with the data area. This will happen 
 without any error indications in the syslog or otherwise.

I thought I fixed that What versions of Linux kernel and mdadm are
you using for your tests?

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BUG: possible array corruption when adding a component to a degraded raid5 (possibly other levels too)

2008-01-28 Thread Neil Brown

On Monday January 28, [EMAIL PROTECTED] wrote:
 Hello,
 
 It seems that mdadm/md do not perform proper sanity checks before adding a 
 component to a degraded array. If the size of the new component is just 
 right, 
 the superblock information will overlap with the data area. This will happen 
 without any error indications in the syslog or otherwise.
 
 I came up with a reproducible scenario which I am attaching to this email 
 alongside with the entire test script. I have not tested it for other raid 
 levels, or other types of superblocks, but I suspect the same problem will 
 occur for many other configurations.
 
 I am willing to test patches, however the attached script is non-intrusive 
 enough to be executed anywhere.

Thanks for the report and the test script.

This patch for mdadm should fix this problem I hate the fact that
we sometimes use K and sometimes use sectors for
sizes/offsets... groan.

I'll probably get a test in the kernel as well to guard against this.

Thanks,
NeilBrown


### Diffstat output
 ./Manage.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff .prev/Manage.c ./Manage.c
--- .prev/Manage.c  2008-01-29 11:15:54.0 +1100
+++ ./Manage.c  2008-01-29 11:16:15.0 +1100
@@ -337,7 +337,7 @@ int Manage_subdevs(char *devname, int fd
 
/* Make sure device is large enough */
if (tst-ss-avail_size(tst, ldsize/512) 
-   array.size) {
+   array.size*2) {
fprintf(stderr, Name : %s not large 
enough to join array\n,
dv-devname);
return 1;
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: In this partition scheme, grub does not find md information?

2008-01-28 Thread Neil Brown

On Monday January 28, [EMAIL PROTECTED] wrote:
 
 Perhaps I'm mistaken but I though it was possible to do boot from 
 /dev/md/all1.

It is my understanding that grub cannot boot from RAID.
You can boot from raid1 by the expedient of booting from one of the
halves.
A common approach is to make a small raid1 which contains /boot and
boot from that.  Then use the rest of your devices for raid10 or raid5
or whatever.
 
 Am I trying to do something that's basically impossible?

I believe so.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: write-intent bitmaps

2008-01-27 Thread Neil Brown

On Sunday January 27, [EMAIL PROTECTED] wrote:
 http://lists.debian.org/debian-devel/2008/01/msg00921.html
 
 Are they regarded as a stable feature?  If so I'd like to see distributions 
 supporting them by default.  I've started a discussion in Debian on this 
 topic, see the above URL for details.

Yes, it is regarded as stable.

However it can be expected to reduce write throughput.  A reduction of
several percent would not be surprising, and depending in workload it
could probably be much higher.

It is quite easy to add or remove a bitmap on an active array, so
making it a default would probably be fine providing it was easy for
an admin to find out about it and remove the bitmap is they wanted the
extra performance.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: striping of a 4 drive raid10

2008-01-27 Thread Neil Brown

On Sunday January 27, [EMAIL PROTECTED] wrote:
 Hi
 
 I have tried to make a striping raid out of my new 4 x 1 TB
 SATA-2 disks. I tried raid10,f2 in several ways:
 
 1: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, md2 = raid0
 of md0+md1
 
 2: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, md2 = raid01,f2
 of md0+md1
 
 3: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, chunksize of 
 md0 =md1 =128 KB,  md2 = raid0 of md0+md1 chunksize = 256 KB
 
 4: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, chunksize
 of md0 = md1 = 128 KB, md2 = raid01,f2 of md0+md1 chunksize = 256 KB
 
 5: md0= raid10,f4 of sda1+sdb1+sdc1+sdd1

Try
  6: md0 = raid10,f2 of sda1+sdb1+sdc1+sdd1

Also try raid10,o2 with a largeish chunksize (256KB is probably big
enough).

NeilBrown


 
 My new disks give a transfer rate of about 80 MB/s, so I expected
 to have something like 320 MB/s for the whole raid, but I did not get
 more than about 180 MB/s.
 
 I think it may be something with the layout, that in effect 
 the drives should be something like:
 
   sda1 sdb1sdc1  sdd1
01   2 3
45   6 7
 
 And this was not really doable for the combination of raids,
 because thet combinations give different block layouts.
 
 How can it be done? Do we need a new raid type?
 
 Best regards
 keld
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: striping of a 4 drive raid10

2008-01-27 Thread Neil Brown

On Sunday January 27, [EMAIL PROTECTED] wrote:
 On Mon, Jan 28, 2008 at 07:13:30AM +1100, Neil Brown wrote:
  On Sunday January 27, [EMAIL PROTECTED] wrote:
   Hi
   
   I have tried to make a striping raid out of my new 4 x 1 TB
   SATA-2 disks. I tried raid10,f2 in several ways:
   
   1: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, md2 = raid0
   of md0+md1
   
   2: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, md2 = raid01,f2
   of md0+md1
   
   3: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, chunksize 
   of 
   md0 =md1 =128 KB,  md2 = raid0 of md0+md1 chunksize = 256 KB
   
   4: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, chunksize
   of md0 = md1 = 128 KB, md2 = raid01,f2 of md0+md1 chunksize = 256 KB
   
   5: md0= raid10,f4 of sda1+sdb1+sdc1+sdd1
  
  Try
6: md0 = raid10,f2 of sda1+sdb1+sdc1+sdd1
 
 That I already tried, (and I wrongly stated that I used f4 in stead of
 f2). I had two times a thruput of about 300 MB/s but since then I could
 not reproduce the behaviour. Are there errors on this that has been
 corrected in newer kernels?

No, I don't think any performance related changes have been made to
raid10 lately.

You could try increasing the read-ahead size.  For a 4-drive raid10 it
defaults to 4 times the read-ahead setting of a single drive, but
increasing substantially (e.g. 64 times) seem to increase the speed of
dd reading a gigabyte.
Whether that will actually affect your target workload is a different question.

 
 
  Also try raid10,o2 with a largeish chunksize (256KB is probably big
  enough).
 
 I tried that too, but my mdadm did not allow me to use the o flag.
 
 My kernel is 2.6.12  and mdadm is v1.12.0 - 14 June 2005.
 can I upgrade the mdadm alone to a newer version, and then which is
 recommendable?

You would need a newer kernel and a newer mdadm to get raid10 - offset
mode.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Fwd: Error on /dev/sda, but takes down RAID-1

2008-01-23 Thread Neil Brown

On Wednesday January 23, [EMAIL PROTECTED] wrote:
 Hi, 
 
 I'm not sure this is completely linux-raid related, but I can't figure out 
 where to start: 
 
 A few days ago, my server died. I was able to log in and salvage this content 
 of dmesg: 
 http://pastebin.com/m4af616df 

At line 194:

   end_request: I/O error, dev sdb, sector 80324865

then at line 384

   end_request: I/O error, dev sda, sector 80324865

 
 I talked to my hosting-people and they said it was an io-error on /dev/sda, 
 and replaced that drive. 
 After this, I was able to boot into a PXE-image and re-build the two RAID-1 
 devices with no problems - indicating that sdb was fine. 
 
 I expected RAID-1 to be able to stomach exactly this kind of error - one 
 drive dying. What did I do wrong? 

Trouble is it wasn't one drive dying.  You got errors from two
drives, at almost exactly the same time.  So maybe the controller
died.  Or maybe when one drive died, the controller or the driver got
confused and couldn't work with the other drive any more.

Certainly the blk: request botched message (line 233 onwards)
suggest some confusion in the driver.

Maybe post to [EMAIL PROTECTED] - that is where issues with
SATA drivers and controllers can be discussed.

NeilBrown


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [BUG] The kernel thread for md RAID1 could cause a md RAID1 array deadlock

2008-01-23 Thread Neil Brown

On Tuesday January 15, [EMAIL PROTECTED] wrote:
 
 This message describes the details about md-RAID1 issue found by
 testing the md RAID1 using the SCSI fault injection framework.
 
 Abstract:
 Both the error handler for md RAID1 and write access request to the md RAID1
 use raid1d kernel thread. The nr_pending flag could cause a race condition
 in raid1d, results in a raid1d deadlock.

Thanks for finding and reporting this.

I believe the following patch should fix the deadlock.

If you are able to repeat your test and confirm this I would
appreciate it.

Thanks,
NeilBrown



Fix deadlock in md/raid1 when handling a read error.

When handling a read error, we freeze the array to stop any other
IO while attempting to over-write with correct data.

This is done in the raid1d thread and must wait for all submitted IO
to complete (except for requests that failed and are sitting in the
retry queue - these are counted in -nr_queue and will stay there during
a freeze).

However write requests need attention from raid1d as bitmap updates
might be required.  This can cause a deadlock as raid1 is waiting for
requests to finish that themselves need attention from raid1d.

So we create a new function 'flush_pending_writes' to give that attention,
and call it in freeze_array to be sure that we aren't waiting on raid1d.

Thanks to K.Tanaka [EMAIL PROTECTED] for finding and reporting
this problem.

Cc: K.Tanaka [EMAIL PROTECTED]
Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/raid1.c |   66 ++-
 1 file changed, 45 insertions(+), 21 deletions(-)

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c2008-01-18 11:19:09.0 +1100
+++ ./drivers/md/raid1.c2008-01-24 14:21:55.0 +1100
@@ -592,6 +592,37 @@ static int raid1_congested(void *data, i
 }
 
 
+static int flush_pending_writes(conf_t *conf)
+{
+   /* Any writes that have been queue but are awaiting
+* bitmap updates get flushed here.
+* We return 1 if any requests were actually submitted.
+*/
+   int rv = 0;
+
+   spin_lock_irq(conf-device_lock);
+
+   if (conf-pending_bio_list.head) {
+   struct bio *bio;
+   bio = bio_list_get(conf-pending_bio_list);
+   blk_remove_plug(conf-mddev-queue);
+   spin_unlock_irq(conf-device_lock);
+   /* flush any pending bitmap writes to
+* disk before proceeding w/ I/O */
+   bitmap_unplug(conf-mddev-bitmap);
+
+   while (bio) { /* submit pending writes */
+   struct bio *next = bio-bi_next;
+   bio-bi_next = NULL;
+   generic_make_request(bio);
+   bio = next;
+   }
+   rv = 1;
+   } else
+   spin_unlock_irq(conf-device_lock);
+   return rv;
+}
+
 /* Barriers
  * Sometimes we need to suspend IO while we do something else,
  * either some resync/recovery, or reconfigure the array.
@@ -678,10 +709,14 @@ static void freeze_array(conf_t *conf)
spin_lock_irq(conf-resync_lock);
conf-barrier++;
conf-nr_waiting++;
+   spin_unlock_irq(conf-resync_lock);
+
+   spin_lock_irq(conf-resync_lock);
wait_event_lock_irq(conf-wait_barrier,
conf-barrier+conf-nr_pending == conf-nr_queued+2,
conf-resync_lock,
-   raid1_unplug(conf-mddev-queue));
+   ({ flush_pending_writes(conf);
+  raid1_unplug(conf-mddev-queue); }));
spin_unlock_irq(conf-resync_lock);
 }
 static void unfreeze_array(conf_t *conf)
@@ -907,6 +942,9 @@ static int make_request(struct request_q
blk_plug_device(mddev-queue);
spin_unlock_irqrestore(conf-device_lock, flags);
 
+   /* In case raid1d snuck into freeze_array */
+   wake_up(conf-wait_barrier);
+
if (do_sync)
md_wakeup_thread(mddev-thread);
 #if 0
@@ -1473,28 +1511,14 @@ static void raid1d(mddev_t *mddev)

for (;;) {
char b[BDEVNAME_SIZE];
-   spin_lock_irqsave(conf-device_lock, flags);
-
-   if (conf-pending_bio_list.head) {
-   bio = bio_list_get(conf-pending_bio_list);
-   blk_remove_plug(mddev-queue);
-   spin_unlock_irqrestore(conf-device_lock, flags);
-   /* flush any pending bitmap writes to disk before 
proceeding w/ I/O */
-   bitmap_unplug(mddev-bitmap);
-
-   while (bio) { /* submit pending writes */
-   struct bio *next = bio-bi_next;
-   bio-bi_next = NULL;
-   generic_make_request(bio);
-   bio = next

Re: idle array consuming cpu ??!!

2008-01-23 Thread Neil Brown

On Tuesday January 22, [EMAIL PROTECTED] wrote:
 Neil Brown ([EMAIL PROTECTED]) wrote on 21 January 2008 12:15:
  On Sunday January 20, [EMAIL PROTECTED] wrote:
   A raid6 array with a spare and bitmap is idle: not mounted and with no
   IO to it or any of its disks (obviously), as shown by iostat. However
   it's consuming cpu: since reboot it used about 11min in 24h, which is 
 quite
   a lot even for a busy array (the cpus are fast). The array was cleanly
   shutdown so there's been no reconstruction/check or anything else.
   
   How can this be? Kernel is 2.6.22.16 with the two patches for the
   deadlock ([PATCH 004 of 4] md: Fix an occasional deadlock in raid5 -
   FIX) and the previous one.
  
  Maybe the bitmap code is waking up regularly to do nothing.
  
  Would you be happy to experiment?  Remove the bitmap with
 mdadm --grow /dev/mdX --bitmap=none
  
  and see how that affects cpu usage?
 
 Confirmed, removing the bitmap stopped cpu consumption.

Thanks.

This patch should substantiallly reduce cpu consumption on an idle
bitmap.

NeilBrown

--
Reduce CPU wastage on idle md array with a write-intent bitmap.

On an md array with a write-intent bitmap, a thread wakes up every few
seconds to and scans the bitmap looking for work to do.  If there
array is idle, there will be no work to do, but a lot of scanning is
done to discover this.

So cache the fact that the bitmap is completely clean, and avoid
scanning the whole bitmap when the cache is known to be clean.

Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/bitmap.c |   19 +--
 ./include/linux/raid/bitmap.h |2 ++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c
--- .prev/drivers/md/bitmap.c   2008-01-24 15:53:45.0 +1100
+++ ./drivers/md/bitmap.c   2008-01-24 15:54:29.0 +1100
@@ -1047,6 +1047,11 @@ void bitmap_daemon_work(struct bitmap *b
if (time_before(jiffies, bitmap-daemon_lastrun + 
bitmap-daemon_sleep*HZ))
return;
bitmap-daemon_lastrun = jiffies;
+   if (bitmap-allclean) {
+   bitmap-mddev-thread-timeout = MAX_SCHEDULE_TIMEOUT;
+   return;
+   }
+   bitmap-allclean = 1;
 
for (j = 0; j  bitmap-chunks; j++) {
bitmap_counter_t *bmc;
@@ -1068,8 +1073,10 @@ void bitmap_daemon_work(struct bitmap *b
clear_page_attr(bitmap, page, 
BITMAP_PAGE_NEEDWRITE);
 
spin_unlock_irqrestore(bitmap-lock, flags);
-   if (need_write)
+   if (need_write) {
write_page(bitmap, page, 0);
+   bitmap-allclean = 0;
+   }
continue;
}
 
@@ -1098,6 +1105,9 @@ void bitmap_daemon_work(struct bitmap *b
 /*
   if (j  100) printk(bitmap: j=%lu, *bmc = 0x%x\n, j, *bmc);
 */
+   if (*bmc)
+   bitmap-allclean = 0;
+
if (*bmc == 2) {
*bmc=1; /* maybe clear the bit next time */
set_page_attr(bitmap, page, BITMAP_PAGE_CLEAN);
@@ -1132,6 +1142,8 @@ void bitmap_daemon_work(struct bitmap *b
}
}
 
+   if (bitmap-allclean == 0)
+   bitmap-mddev-thread-timeout = bitmap-daemon_sleep * HZ;
 }
 
 static bitmap_counter_t *bitmap_get_counter(struct bitmap *bitmap,
@@ -1226,6 +1238,7 @@ int bitmap_startwrite(struct bitmap *bit
sectors -= blocks;
else sectors = 0;
}
+   bitmap-allclean = 0;
return 0;
 }
 
@@ -1296,6 +1309,7 @@ int bitmap_start_sync(struct bitmap *bit
}
}
spin_unlock_irq(bitmap-lock);
+   bitmap-allclean = 0;
return rv;
 }
 
@@ -1332,6 +1346,7 @@ void bitmap_end_sync(struct bitmap *bitm
}
  unlock:
spin_unlock_irqrestore(bitmap-lock, flags);
+   bitmap-allclean = 0;
 }
 
 void bitmap_close_sync(struct bitmap *bitmap)
@@ -1399,7 +1414,7 @@ static void bitmap_set_memory_bits(struc
set_page_attr(bitmap, page, BITMAP_PAGE_CLEAN);
}
spin_unlock_irq(bitmap-lock);
-
+   bitmap-allclean = 0;
 }
 
 /* dirty the memory and file bits for bitmap chunks s to e */

diff .prev/include/linux/raid/bitmap.h ./include/linux/raid/bitmap.h
--- .prev/include/linux/raid/bitmap.h   2008-01-24 15:53:45.0 +1100
+++ ./include/linux/raid/bitmap.h   2008-01-24 15:54:29.0 +1100
@@ -235,6 +235,8 @@ struct bitmap {
 
unsigned long flags;
 
+   int allclean;
+
unsigned long max_write_behind; /* write-behind mode */
atomic_t behind_writes;
 
-
To unsubscribe from this list: send the line unsubscribe linux

Re: array doesn't run even with --force

2008-01-20 Thread Neil Brown

On Sunday January 20, [EMAIL PROTECTED] wrote:
 I've got a raid5 array with 5 disks where 2 failed. The failures are
 occasional and only on a few sectors so I tried to assemble it with 4
 disks anyway:
 
 # mdadm -A -f -R /dev/mdnumber /dev/disk1 /dev/disk2 /dev/disk3 /dev/disk4
 
 However mdadm complains that one of the disks has an out-of-date
 superblock and kicks it out, and then it cannot run the array with
 only 3 disks.
 
 Shouldn't it adjust the superblock and assemble-run it anyway? That's
 what -f is for, no? This is with kernel 2.6.22.16 and mdadm 2.6.4.

Please provide actual commands and actual output.
Also add --verbose to the assemble command
Also provide --examine for all devices.
Also provide any kernel log messages.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: idle array consuming cpu ??!!

2008-01-20 Thread Neil Brown

On Sunday January 20, [EMAIL PROTECTED] wrote:
 A raid6 array with a spare and bitmap is idle: not mounted and with no
 IO to it or any of its disks (obviously), as shown by iostat. However
 it's consuming cpu: since reboot it used about 11min in 24h, which is quite
 a lot even for a busy array (the cpus are fast). The array was cleanly
 shutdown so there's been no reconstruction/check or anything else.
 
 How can this be? Kernel is 2.6.22.16 with the two patches for the
 deadlock ([PATCH 004 of 4] md: Fix an occasional deadlock in raid5 -
 FIX) and the previous one.

Maybe the bitmap code is waking up regularly to do nothing.

Would you be happy to experiment?  Remove the bitmap with
   mdadm --grow /dev/mdX --bitmap=none

and see how that affects cpu usage?

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: array doesn't run even with --force

2008-01-20 Thread Neil Brown

On Monday January 21, [EMAIL PROTECTED] wrote:
 
 The command is
 
 mdadm -A --verbose -f -R /dev/md3 /dev/sda4 /dev/sdc4 /dev/sde4 /dev/sdd4
 
 The failed areas are sdb4 (which I didn't include above) and sdd4. I
 did a dd if=/dev/sdb4 of=/dev/hda4 bs=512 conv=noerror and it
 complained about roughly 10 bad sectors. I did dd if=/dev/sdd4
 of=/dev/hdc4 bs=512 conv=noerror and there were no errors, that's why
 I used sdd4 above. I tried to substitute hdc4 for sdd4, and hda4 for
 sdb4, to no avail.
 
 I don't have kernel logs because the failed area has /home and /var.
 The double fault occurred during the holidays, so I don't know which
 happened first. Below are the output of the command above and of
 --examine.
 
 mdadm: looking for devices for /dev/md3
 mdadm: /dev/sda4 is identified as a member of /dev/md3, slot 0.
 mdadm: /dev/sdc4 is identified as a member of /dev/md3, slot 2.
 mdadm: /dev/sde4 is identified as a member of /dev/md3, slot 4.
 mdadm: /dev/sdd4 is identified as a member of /dev/md3, slot 5.
 mdadm: no uptodate device for slot 1 of /dev/md3
 mdadm: added /dev/sdc4 to /dev/md3 as 2
 mdadm: no uptodate device for slot 3 of /dev/md3
 mdadm: added /dev/sde4 to /dev/md3 as 4
 mdadm: added /dev/sdd4 to /dev/md3 as 5
 mdadm: added /dev/sda4 to /dev/md3 as 0
 mdadm: failed to RUN_ARRAY /dev/md3: Input/output error
 mdadm: Not enough devices to start the array.

So no device claim to be member '1' or '3' of the array, and as you
cannot start an array with 2 devices missing, there is nothing that
mdadm can do.  It has no way of knowing what should go in as '1' or
'3'.

As you note, sda4 says that it thinks slot 1 is still active/sync, but
it doesn't seem to know which device should go there either.
However that does indicate that slot 3 failed first and slot 1 failed
later.  So if we have candidates for both, slot 1 is probably more
uptodate.

You need to tell mdadm what goes where by creating the array.
e.g. if you think that sdb4 is adequately reliable and that it was in
slot 1, then

 mdadm -C /dev/md3 -l5 -n5 -c 128 /dev/sda4 /dev/sdb4 /dev/sdc4 missing 
/dev/sde4

alternately if you think it best to use sdd, and it was in slot 3,
then

 mdadm -C /dev/md3 -l5 -n5 -c 128 /dev/sda4 missing /dev/sdc4 /dev/sdd4 
/dev/sde4

would be the command to use.

Note that this command will not touch any data.  It will just
overwrite the superblock and assemble the array.
You can then 'fsck' or whatever to confirm that the data looks good.

good luck.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: do_md_run returned -22 [Was: 2.6.24-rc8-mm1]

2008-01-17 Thread Neil Brown

On Thursday January 17, [EMAIL PROTECTED] wrote:
 On Thu, 17 Jan 2008 16:23:30 +0100 Jiri Slaby [EMAIL PROTECTED] wrote:
 
  On 01/17/2008 11:35 AM, Andrew Morton wrote:
   ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc8/2.6.24-rc8-mm1/
  
  still the same md issue (do_md_run returns -22=EINVAL) as in -rc6-mm1 
  reported 
  by Thorsten here:
  http://lkml.org/lkml/2007/12/27/45
 
 hm, I must have been asleep when that was reported.  Neil, did you see it?

No, even though it was Cc:ed to me - sorry.
Maybe a revised subject line would have helped... maybe not.

 
  Is there around any fix for this?
 
 Well, we could bitbucket md-allow-devices-to-be-shared-between-md-arrays.patch

Yeah, do that.  I'll send you something new.
I'll move that chunk into a different patch and add the extra bits
needed to make that test correct in *all* cases rather than just the
ones I was thinking about at the time.
My test suit does try in-kernel-autodetect (the problem case) but it
didn't catch this bug due to another bug.  I'll fix that too.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5

2008-01-16 Thread Neil Brown

On Tuesday January 15, [EMAIL PROTECTED] wrote:
 On Wed, 16 Jan 2008 00:09:31 -0700 Dan Williams [EMAIL PROTECTED] wrote:
 
   heheh.
  
   it's really easy to reproduce the hang without the patch -- i could
   hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB.
   i'll try with ext3... Dan's experiences suggest it won't happen with ext3
   (or is even more rare), which would explain why this has is overall a
   rare problem.
  
  
  Hmmm... how rare?
  
  http://marc.info/?l=linux-kernelm=119461747005776w=2
  
  There is nothing specific that prevents other filesystems from hitting
  it, perhaps XFS is just better at submitting large i/o's.  -stable
  should get some kind of treatment.  I'll take altered performance over
  a hung system.
 
 We can always target 2.6.25-rc1 then 2.6.24.1 if Neil is still feeling
 wimpy.

I am feeling wimpy.  There've been a few too many raid5 breakages
recently and it is very hard to really judge the performance impact of
this change.  I even have a small uncertainty of correctness - could
it still hang in some other way?  I don't think so, but this is
complex code...

If it were really common I would have expected more noise on the
mailing list.  Sure, there has been some, but not much.  However maybe
people are searching the archives and finding the increase stripe
cache size trick, and not reporting anything  seems unlikely
though.

How about we queue it for 2.6.25-rc1 and then about when -rc2 comes
out, we queue it for 2.6.24.y?  Any one (or any distro) that really
needs it can of course grab the patch them selves...

??

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How do I get rid of old device?

2008-01-16 Thread Neil Brown

On Wednesday January 16, [EMAIL PROTECTED] wrote:
 p34:~# mdadm /dev/md3 --zero-superblock
 p34:~# mdadm --examine --scan
 ARRAY /dev/md0 level=raid1 num-devices=2 
 UUID=f463057c:9a696419:3bcb794a:7aaa12b2
 ARRAY /dev/md1 level=raid1 num-devices=2 
 UUID=98e4948c:c6685f82:e082fd95:e7f45529
 ARRAY /dev/md2 level=raid1 num-devices=2 
 UUID=330c9879:73af7d3e:57f4c139:f9191788
 ARRAY /dev/md3 level=raid0 num-devices=10 
 UUID=6dc12c36:b3517ff9:083fb634:68e9eb49
 p34:~#
 
 I cannot seem to get rid of /dev/md3, its almost as if there is a piece of 
 it on the root (2) disks or reference to it?
 
 I also dd'd the other 10 disks (non-root) and /dev/md3 persists.

You don't zero the superblock on the array device, because the array
device does not have a superblock.  The component devices have the
superblock.

So
  mdadm --zero-superblock /dev/sd*
or whatever.
Maybe
  mdadm --examine --scan -v

then get the list of devices it found for the array you want to kill,
and  --zero-superblock that list.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array.

2008-01-13 Thread Neil Brown

On Monday January 14, [EMAIL PROTECTED] wrote:
 On Mon, Jan 14, 2008 at 12:45:31PM +1100, NeilBrown wrote:
  
  Due to possible deadlock issues we need to use a schedule work to
  kobject_del an 'rdev' object from a different thread.
  
  A recent change means that kobject_add no longer gets a refernce, and
  kobject_del doesn't put a reference.  Consequently, we need to
  explicitly hold a reference to ensure that the last reference isn't
  dropped before the scheduled work get a chance to call kobject_del.
  
  Also, rename delayed_delete to md_delayed_delete to that it is more
  obvious in a stack trace which code is to blame.
 
 I don't know...  You still get kobject_del() and export_rdev()
 in unpredictable order; sure, it won't be freed under you, but...

I cannot see that that would matter.
kobject_del deletes the object from the kobj tree and free sysfs.
export_rdev disconnects the objects from md structures and releases
the connection with the device.  They are quite independent.

 
 What is that deadlock problem, anyway?  I don't see anything that
 would look like an obvious candidate in the stuff you are delaying...

Maybe it isn't there any more

Once upon a time, when I 
   echo remove  /sys/block/mdX/md/dev-YYY/state

sysfs_write_file would hold buffer-sem while calling my store
handler.
When my store handler tried to delete the relevant kobject, it would
eventually call orphan_all_buffers which would try to take buf-sem
and deadlock.

orphan_all_buffers doesn't exist any more, so maybe the deadlock is
gone too.
However the comment at the top of sysfs_schedule_callback in
sysfs/file.c says:

 *
 * sysfs attribute methods must not unregister themselves or their parent
 * kobject (which would amount to the same thing).  Attempts to do so will
 * deadlock, since unregistration is mutually exclusive with driver
 * callbacks.
 *

so I'm included to leave the code as it is  ofcourse the comment
could be well out of date.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array.

2008-01-13 Thread Neil Brown

On Monday January 14, [EMAIL PROTECTED] wrote:
 On Mon, Jan 14, 2008 at 02:21:45PM +1100, Neil Brown wrote:
 
  Maybe it isn't there any more
  
  Once upon a time, when I 
 echo remove  /sys/block/mdX/md/dev-YYY/state
 
 Egads.  And just what will protect you from parallel callers
 of state_store()?  buffer-mutex does *not* do that - it only
 gives you exclusion on given struct file.  Run the command
 above from several shells and you've got independent open
 from each redirect = different struct file *and* different
 buffer for each = no exclusion whatsoever.

well in -mm, rdev_attr_store gets a lock on
rdev-mddev-reconfig_mutex. 
It doesn't test is rdev-mddev is NULL though, so if the write happens
after unbind_rdev_from_array, we lose.
A test for NULL would be easy enough.  And I think that the mddev
won't actually disappear until the rdevs are all gone (you subsequent
comment about kobject_del ordering seems to confirm that) so a simple test
for NULL should be sufficient.

 
 And _that_ is present right in the mainline tree - it's unrelated
 to -mm kobject changes.
 
 BTW, yes, you do have a deadlock there - kobject_del() will try to evict
 children, which will include waiting for currently running -store()
 to finish, which will include the caller since .../state *is* a child of
 that sucker.
 
 The real problem is the lack of any kind of exclusion considerations in
 md.c itself, AFAICS.  Fun with ordering is secondary (BTW, yes, it is
 a problem - will sysfs -store() to attribute between export_rdev() and
 kobject_del() work correctly?)

Probably not.  The possibility that rdev-mddev could be NULL would
break a lot of these.  Maybe I should delay setting rdev-mddev to
NULL until after the kobject_del.  Then audit them all.

Thanks.  I'll see what I can some up with.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array.

2008-01-13 Thread Neil Brown

On Monday January 14, [EMAIL PROTECTED] wrote:
 
 Thanks.  I'll see what I can some up with.

How about this, against current -mm

On both the read and write path for an rdev attribute, we
call mddev_lock, first checking that mddev is not NULL.
Once we get the lock, we check again.
If rdev-mddev is not NULL, we know it will stay that way as it only
gets cleared under the same lock.

While in the rdev show/store routines, we know that the mddev cannot
get freed, do to the kobject relationships.

rdev_size_store is awkward because it has to drop the lock.  So we
take a copy of rdev-mddev before the drop, and we are safe...

Comments?

NeilBrown

Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/md.c |   35 ++-
 1 file changed, 26 insertions(+), 9 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2008-01-14 12:26:15.0 +1100
+++ ./drivers/md/md.c   2008-01-14 17:05:53.0 +1100
@@ -1998,9 +1998,11 @@ rdev_size_store(mdk_rdev_t *rdev, const 
char *e;
unsigned long long size = simple_strtoull(buf, e, 10);
unsigned long long oldsize = rdev-size;
+   mddev_t *my_mddev = rdev-mddev;
+
if (e==buf || (*e  *e != '\n'))
return -EINVAL;
-   if (rdev-mddev-pers)
+   if (my_mddev-pers)
return -EBUSY;
rdev-size = size;
if (size  oldsize  rdev-mddev-external) {
@@ -2013,7 +2015,7 @@ rdev_size_store(mdk_rdev_t *rdev, const 
int overlap = 0;
struct list_head *tmp, *tmp2;
 
-   mddev_unlock(rdev-mddev);
+   mddev_unlock(my_mddev);
for_each_mddev(mddev, tmp) {
mdk_rdev_t *rdev2;
 
@@ -2033,7 +2035,7 @@ rdev_size_store(mdk_rdev_t *rdev, const 
break;
}
}
-   mddev_lock(rdev-mddev);
+   mddev_lock(my_mddev);
if (overlap) {
/* Someone else could have slipped in a size
 * change here, but doing so is just silly.
@@ -2045,8 +2047,8 @@ rdev_size_store(mdk_rdev_t *rdev, const 
return -EBUSY;
}
}
-   if (size  rdev-mddev-size || rdev-mddev-size == 0)
-   rdev-mddev-size = size;
+   if (size  my_mddev-size || my_mddev-size == 0)
+   my_mddev-size = size;
return len;
 }
 
@@ -2067,10 +2069,21 @@ rdev_attr_show(struct kobject *kobj, str
 {
struct rdev_sysfs_entry *entry = container_of(attr, struct 
rdev_sysfs_entry, attr);
mdk_rdev_t *rdev = container_of(kobj, mdk_rdev_t, kobj);
+   mddev_t *mddev = rdev-mddev;
+   ssize_t rv;
 
if (!entry-show)
return -EIO;
-   return entry-show(rdev, page);
+
+   rv = mddev ? mddev_lock(mddev) : -EBUSY;
+   if (!rv) {
+   if (rdev-mddev == NULL)
+   rv = -EBUSY;
+   else
+   rv = entry-show(rdev, page);
+   mddev_unlock(mddev);
+   }
+   return rv;
 }
 
 static ssize_t
@@ -2079,15 +2092,19 @@ rdev_attr_store(struct kobject *kobj, st
 {
struct rdev_sysfs_entry *entry = container_of(attr, struct 
rdev_sysfs_entry, attr);
mdk_rdev_t *rdev = container_of(kobj, mdk_rdev_t, kobj);
-   int rv;
+   ssize_t rv;
+   mddev_t *mddev = rdev-mddev;
 
if (!entry-store)
return -EIO;
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
-   rv = mddev_lock(rdev-mddev);
+   rv = mddev ? mddev_lock(mddev): -EBUSY;
if (!rv) {
-   rv = entry-store(rdev, page, length);
+   if (rdev-mddev == NULL)
+   rv = -EBUSY;
+   else
+   rv = entry-store(rdev, page, length);
mddev_unlock(rdev-mddev);
}
return rv;
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid5 stuck in degraded, inactive and dirty mode

2008-01-10 Thread Neil Brown

On Thursday January 10, [EMAIL PROTECTED] wrote:
 On Wed, Jan 09, 2008 at 07:16:34PM +1100, CaT wrote:
   But I suspect that --assemble --force would do the right thing.
   Without more details, it is hard to say for sure.
  
  I suspect so aswell but throwing caution into the wind erks me wrt this
  raid array. :)
 
 Sorry. Not to be a pain but considering the previous email with all the
 examine dumps, etc would the above be the way to go? I just don't want
 to have missed something and bugger the array up totally.

Yes, definitely.

The superblocks look perfectly normal for a single drive failure
followed by a crash.  So --assemble --force is the way to go.

Technically you could have some data corruption if a write was under
way at the time of the crash.  In that case the parity block of that
stripe could be wrong, so the recovered data for the missing device
could be wrong.
This is why you are required to use --force - to confirm that you
are aware that there could be a problem.

It would be worth running fsck just to be sure that nothing critical
has been corrupted.  Also if you have a recent backup, I wouldn't
recycle it until I was fairly sure that all your data was really safe.

But in my experience the chance of actual data corruption in this
situation is fairly low.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: md rotates RAID5 spare at boot

2008-01-10 Thread Neil Brown

On Thursday January 10, [EMAIL PROTECTED] wrote:
 
 It looks to me like md inspects and attempts to assemble after each 
 drive controller is scanned (from dmesg, there appears to be a failed 
 bind on the first three devices after they are scanned, and then again 
 when the second controller is scanned).  Would the scan order cause a 
 spare to be swapped in?
 

This suggests that mdadm --incremental is being used to assemble the
arrays.  Every time udev finds a new device, it gets added to
whichever array is should be in.
If it is called as mdadm --incremental --run, then it will get
started as soon as possible, even if it is degraded.  With the
--run, it will wait until all devices are available.

Even with mdadm --incremental --run, you shouldn't get a resync if
the last device is added before the array is written to.

What distro are you running?
What does
   grep -R mdadm /etc/udev

show?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: md rotates RAID5 spare at boot

2008-01-10 Thread Neil Brown

On Thursday January 10, [EMAIL PROTECTED] wrote:
 distro: Ubuntu 7.10
 
 Two files show up...
 
 85-mdadm.rules:
 # This file causes block devices with Linux RAID (mdadm) signatures to
 # automatically cause mdadm to be run.
 # See udev(8) for syntax
 
 SUBSYSTEM==block, ACTION==add|change, ENV{ID_FS_TYPE}==linux_raid*, \
 RUN+=watershed /sbin/mdadm --assemble --scan --no-degraded

 
 I see.  So udev is invoking the assemble command as soon as it detects 
 the devices.  So is it possible that the spare is not the last drive to 
 be detected and mdadm assembles too soon?

The --no-degraded' should stop it from assembling until all expected
devices have been found.  It could assemble before the spare is found,
but should not assemble before all the data devices have been found.

The dmesg trace you included in your first mail doesn't actually
show anything wrong - it never starts and incomplete array.
Can you try again and get a trace where there definitely is a rebuild
happening.

And please don't drop linux-raid from the 'cc' list.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: md rotates RAID5 spare at boot

2008-01-10 Thread Neil Brown

On Thursday January 10, [EMAIL PROTECTED] wrote:
 One quick question about those rules.  The 65-mdadm rule looks like it 
 checks ACTIVE arrays for filesystems, and the 85 rule assembles arrays.  
 Shouldn't they run in the other order?
 

They are fine.  The '65' rule applies to arrays.  I.e. it fires on an
array device once it has been started.
The '85' rule applies to component devices.

They are quite independent.

NeilBrown


 
 
 
 distro: Ubuntu 7.10
 
 Two files show up...
 
 85-mdadm.rules:
 # This file causes block devices with Linux RAID (mdadm) signatures to
 # automatically cause mdadm to be run.
 # See udev(8) for syntax
 
 SUBSYSTEM==block, ACTION==add|change, ENV{ID_FS_TYPE}==linux_raid*, \
 RUN+=watershed /sbin/mdadm --assemble --scan --no-degraded
 
 
 
 65-mdadm.vol_id.rules:
 # This file causes Linux RAID (mdadm) block devices to be checked for
 # further filesystems if the array is active.
 # See udev(8) for syntax
 
 SUBSYSTEM!=block, GOTO=mdadm_end
 KERNEL!=md[0-9]*, GOTO=mdadm_end
 ACTION!=add|change, GOTO=mdadm_end
 
 # Check array status
 ATTR{md/array_state}==|clear|inactive, GOTO=mdadm_end
 
 # Obtain array information
 IMPORT{program}=/sbin/mdadm --detail --export $tempnode
 ENV{MD_NAME}==?*, SYMLINK+=disk/by-id/md-name-$env{MD_NAME}
 ENV{MD_UUID}==?*, SYMLINK+=disk/by-id/md-uuid-$env{MD_UUID}
 
 # by-uuid and by-label symlinks
 IMPORT{program}=vol_id --export $tempnode
 OPTIONS=link_priority=-100
 ENV{ID_FS_USAGE}==filesystem|other|crypto, ENV{ID_FS_UUID_ENC}==?*, \
 SYMLINK+=disk/by-uuid/$env{ID_FS_UUID_ENC}
 ENV{ID_FS_USAGE}==filesystem|other, ENV{ID_FS_LABEL_ENC}==?*, \
 SYMLINK+=disk/by-label/$env{ID_FS_LABEL_ENC}
 
 
 I see.  So udev is invoking the assemble command as soon as it detects 
 the devices.  So is it possible that the spare is not the last drive to 
 be detected and mdadm assembles too soon?
 
 
 
 Neil Brown wrote:
  On Thursday January 10, [EMAIL PROTECTED] wrote:

  It looks to me like md inspects and attempts to assemble after each 
  drive controller is scanned (from dmesg, there appears to be a failed 
  bind on the first three devices after they are scanned, and then again 
  when the second controller is scanned).  Would the scan order cause a 
  spare to be swapped in?
 
  
 
  This suggests that mdadm --incremental is being used to assemble the
  arrays.  Every time udev finds a new device, it gets added to
  whichever array is should be in.
  If it is called as mdadm --incremental --run, then it will get
  started as soon as possible, even if it is degraded.  With the
  --run, it will wait until all devices are available.
 
  Even with mdadm --incremental --run, you shouldn't get a resync if
  the last device is added before the array is written to.
 
  What distro are you running?
  What does
 grep -R mdadm /etc/udev
 
  show?
 
  NeilBrown
 

 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: md rotates RAID5 spare at boot

2008-01-10 Thread Neil Brown

On Thursday January 10, [EMAIL PROTECTED] wrote:
 (Sorry- yes it looks like I posted an incorrect dmesg extract)

This still doesn't seem to match your description.
I see:

 [   41.247389] md: bindsdf1
 [   41.247584] md: bindsdb1
 [   41.247787] md: bindsda1
 [   41.247971] md: bindsdc1
 [   41.248151] md: bindsdg1
 [   41.248325] md: bindsde1
 [   41.256718] raid5: device sde1 operational as raid disk 0
 [   41.256771] raid5: device sdc1 operational as raid disk 4
 [   41.256821] raid5: device sda1 operational as raid disk 3
 [   41.256870] raid5: device sdb1 operational as raid disk 2
 [   41.256919] raid5: device sdf1 operational as raid disk 1
 [   41.257426] raid5: allocated 5245kB for md0
 [   41.257476] raid5: raid level 5 set md0 active with 5 out of 5 
 devices, algorithm 2

which looks like 'md0' started with 5 of 5 drives, plus g1 is there as
a spare.  And

 [   41.312250] md: bindsdf2
 [   41.312476] md: bindsdb2
 [   41.312711] md: bindsdg2
 [   41.312922] md: bindsdc2
 [   41.313138] md: bindsda2
 [   41.313343] md: bindsde2
 [   41.313452] md: md1: raid array is not clean -- starting background 
 reconstruction
 [   41.322189] raid5: device sde2 operational as raid disk 0
 [   41.322243] raid5: device sdc2 operational as raid disk 4
 [   41.322292] raid5: device sdg2 operational as raid disk 3
 [   41.322342] raid5: device sdb2 operational as raid disk 2
 [   41.322391] raid5: device sdf2 operational as raid disk 1
 [   41.322823] raid5: allocated 5245kB for md1
 [   41.322872] raid5: raid level 5 set md1 active with 5 out of 5 
 devices, algorithm 2

md1 also assembled with 5/5 drives and sda2 as a spare.  
This one was not shut down cleanly so it started a resync.  But there
is not evidence of anything starting degraded.



NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: The effects of multiple layers of block drivers

2008-01-10 Thread Neil Brown

On Thursday January 10, [EMAIL PROTECTED] wrote:
 Hello,
 
 I am starting to dig into the Block subsystem to try and uncover the
 reason for some data I lost recently.  My situation is that I have
 multiple block drivers on top of each other and am wondering how the
 effectss of a raid 5 rebuild would affect the block devices above it.

It should just work - no surprises.  raid5 is just a block device
like any other.  When doing a rebuild it might be a bit slower, but
that is all.

 
 The layers are raid 5 - lvm - cryptoloop.  It seems that after the
 raid 5 device was rebuilt by adding in a new disk, that the cryptoloop
 doesn't have a valid ext3 partition on it.

There was a difference of opinion between raid5 and dm-crypt which
could cause some corruption.
What kernel version are you using, and are you using dm-crypt or loop
(e..g losetup) with encryption?


 
 As a raid device re-builds is there ant rearranging of sectors or
 corresponding blocks that would effect another block device on top of it?

No.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-10 Thread Neil Brown

On Thursday January 10, [EMAIL PROTECTED] wrote:
 On Jan 10, 2008 12:13 AM, dean gaudet [EMAIL PROTECTED] wrote:
  w.r.t. dan's cfq comments -- i really don't know the details, but does
  this mean cfq will misattribute the IO to the wrong user/process?  or is
  it just a concern that CPU time will be spent on someone's IO?  the latter
  is fine to me... the former seems sucky because with today's multicore
  systems CPU time seems cheap compared to IO.
 
 
 I do not see this affecting the time slicing feature of cfq, because
 as Neil says the work has to get done at some point.   If I give up
 some of my slice working on someone else's I/O chances are the favor
 will be returned in kind since the code does not discriminate.  The
 io-priority capability of cfq currently does not work as advertised
 with current MD since the priority is tied to the current thread and
 the thread that actually submits the i/o on a stripe is
 non-deterministic.  So I do not see this change making the situation
 any worse.  In fact, it may make it a bit better since there is a
 higher chance for the thread submitting i/o to MD to do its own i/o to
 the backing disks.
 
 Reviewed-by: Dan Williams [EMAIL PROTECTED]

Thanks.
But I suspect you didn't test it with a bitmap :-)
I ran the mdadm test suite and it hit a problem - easy enough to fix.

I'll look out for any other possible related problem (due to raid5d
running in different processes) and then submit it.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-09 Thread Neil Brown

On Wednesday January 9, [EMAIL PROTECTED] wrote:
 On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote:
  i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
  
  http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
  
  which was Neil's change in 2.6.22 for deferring generic_make_request 
  until there's enough stack space for it.
  
 
 Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization
 by preventing recursive calls to generic_make_request.  However the
 following conditions can cause raid5 to hang until 'stripe_cache_size' is
 increased:
 

Thanks for pursuing this guys.  That explanation certainly sounds very
credible.

The generic_make_request_immed is a good way to confirm that we have
found the bug,  but I don't like it as a long term solution, as it
just reintroduced the problem that we were trying to solve with the
problematic commit.

As you say, we could arrange that all request submission happens in
raid5d and I think this is the right way to proceed.  However we can
still take some of the work into the thread that is submitting the
IO by calling raid5d() at the end of make_request, like this.

Can you test it please?  Does it seem reasonable?

Thanks,
NeilBrown


Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/md.c|2 +-
 ./drivers/md/raid5.c |4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2008-01-07 13:32:10.0 +1100
+++ ./drivers/md/md.c   2008-01-10 11:08:02.0 +1100
@@ -5774,7 +5774,7 @@ void md_check_recovery(mddev_t *mddev)
if (mddev-ro)
return;
 
-   if (signal_pending(current)) {
+   if (current == mddev-thread-tsk  signal_pending(current)) {
if (mddev-pers-sync_request) {
printk(KERN_INFO md: %s in immediate safe mode\n,
   mdname(mddev));

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c2008-01-07 13:32:10.0 +1100
+++ ./drivers/md/raid5.c2008-01-10 11:06:54.0 +1100
@@ -3432,6 +3432,7 @@ static int chunk_aligned_read(struct req
}
 }
 
+static void raid5d (mddev_t *mddev);
 
 static int make_request(struct request_queue *q, struct bio * bi)
 {
@@ -3547,7 +3548,7 @@ static int make_request(struct request_q
goto retry;
}
finish_wait(conf-wait_for_overlap, w);
-   handle_stripe(sh, NULL);
+   set_bit(STRIPE_HANDLE, sh-state);
release_stripe(sh);
} else {
/* cannot get stripe for read-ahead, just give-up */
@@ -3569,6 +3570,7 @@ static int make_request(struct request_q
  test_bit(BIO_UPTODATE, bi-bi_flags)
? 0 : -EIO);
}
+   raid5d(mddev);
return 0;
 }
 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-09 Thread Neil Brown

On Wednesday January 9, [EMAIL PROTECTED] wrote:
 On Jan 9, 2008 5:09 PM, Neil Brown [EMAIL PROTECTED] wrote:
  On Wednesday January 9, [EMAIL PROTECTED] wrote:
 
  Can you test it please?
 
 This passes my failure case.

Thanks!

 
  Does it seem reasonable?
 
 What do you think about limiting the number of stripes the submitting
 thread handles to be equal to what it submitted?  If I'm a stripe that
 only submits 1 stripe worth of work should I get stuck handling the
 rest of the cache?

Dunno
Someone has to do the work, and leaving it all to raid5d means that it
all gets done on one CPU.
I expect that most of the time the queue of ready stripes is empty so
make_request will mostly only handle it's own stripes anyway.
The times that it handles other thread's stripes will probably balance
out with the times that other threads handle this threads stripes.

So I'm incline to leave it as do as much work as is available to be
done as that is simplest.  But I can probably be talked out of it
with a convincing argument

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid5 stuck in degraded, inactive and dirty mode

2008-01-08 Thread Neil Brown

On Wednesday January 9, [EMAIL PROTECTED] wrote:
 
 I'd provide data dumps of --examine and friends but I'm in a situation
 where transferring the data would be a right pain. I'll do it if need
 be, though.
 
 So, what can I do? 

Well, providing the output of --examine would help a lot.

But I suspect that --assemble --force would do the right thing.
Without more details, it is hard to say for sure.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid 1, can't get the second disk added back in.

2008-01-07 Thread Neil Brown

On Monday January 7, [EMAIL PROTECTED] wrote:
 Problem is not raid, or at least not obviously raid related.  The 
 problem is that the whole disk, /dev/hdb is unavailable. 

Maybe check /sys/block/hdb/holders ?  lsof /dev/hdb ?

good luck :-)

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Why mdadm --monitor --program sometimes only gives 2 command-line arguments to the program?

2008-01-07 Thread Neil Brown

On Saturday January 5, [EMAIL PROTECTED] wrote:
 
 Hi all,
 
 I need to monitor my RAID and if it fails, I'd like to call my-script to
 deal with the failure.
 
 I did: 
 mdadm --monitor --program my-script --delay 60 /dev/md1
 
 And then, I simulate a failure with
 mdadm --manage --set-faulty /dev/md1 /dev/sda2
 mdadm /dev/md1 --remove /dev/sda2
 
 I hope the mdadm monitor function can pass all three command-line
 arguments to my-script, including the name of the event, the name of the
 md device and the name of a related device if relevant.
 
 But my-script doesn't get the third one, which should be /dev/sda2. Is
 this not relevant?
 
 If I really need to know it's /dev/sda2 that fails, what can I do?

What version of mdadm are you using?
I'm guessing 2.6, 2.6.1, or 2.6.2.
There was a bug introduced in 2.6 that was fixed in 2.6.3 that would
have this effect.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid 1, new disk can't be added after replacing faulty disk

2008-01-07 Thread Neil Brown

On Monday January 7, [EMAIL PROTECTED] wrote:
 On Jan 7, 2008 6:44 AM, Radu Rendec [EMAIL PROTECTED] wrote:
  I'm experiencing trouble when trying to add a new disk to a raid 1 array
  after having replaced a faulty disk.
 
 [..]
  # mdadm --version
  mdadm - v2.6.2 - 21st May 2007
 
 [..]
  However, this happens with both mdadm 2.6.2 and 2.6.4. I downgraded to
  2.5.4 and it works like a charm.
 
 Looks like you are running into the issue described here:
 http://marc.info/?l=linux-raidm=119892098129022w=2

I cannot easily reproduce this.  I suspect it is sensitive to the
exact size of the devices involved.

Please test this patch and see if it fixes the problem.
If not, please tell me the exact sizes of the partition being used
(e.g. cat /proc/partitions) and I will try harder to reproduce it.

Thanks,
NeilBrown



diff --git a/super1.c b/super1.c
index 2b096d3..9eec460 100644
--- a/super1.c
+++ b/super1.c
@@ -903,7 +903,7 @@ static int write_init_super1(struct supertype *st, void 
*sbv,
 * for a bitmap.
 */
array_size = __le64_to_cpu(sb-size);
-   /* work out how much space we left of a bitmap */
+   /* work out how much space we left for a bitmap */
bm_space = choose_bm_space(array_size);
 
switch(st-minor_version) {
@@ -913,6 +913,8 @@ static int write_init_super1(struct supertype *st, void 
*sbv,
sb_offset = ~(4*2-1);
sb-super_offset = __cpu_to_le64(sb_offset);
sb-data_offset = __cpu_to_le64(0);
+   if (sb_offset - bm_space  array_size)
+   bm_space = sb_offset - array_size;
sb-data_size = __cpu_to_le64(sb_offset - bm_space);
break;
case 1:
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid 1, can't get the second disk added back in.

2008-01-06 Thread Neil Brown

On Saturday January 5, [EMAIL PROTECTED] wrote:
 
 Since /dev/hdb5 has been part of this array before you should use  
 --re-add instead of --add.
 Kind regards,
 Alex.

That is not correct.

--re-add is only needed for arrays without metadata, for which you use
--build to start them.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid 1, can't get the second disk added back in.

2008-01-06 Thread Neil Brown

On Saturday January 5, [EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED]:~# mdadm /dev/md0 --add /dev/hdb5
 mdadm: Cannot open /dev/hdb5: Device or resource busy
 
 All the solutions I've been able to google fail with the busy.  There is 
 nothing that I can find that might be  using /dev/hdb5 except the raid 
 device and it appears it's not either.

Very odd. But something must be using it.

What does
   ls -l /sys/block/hdb/hdb5/holders
show?
What about
   cat /proc/mounts
   cat /proc/swaps
   lsof /dev/hdb5
  
??
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: stopped array, but /sys/block/mdN still exists.

2008-01-03 Thread Neil Brown

On Thursday January 3, [EMAIL PROTECTED] wrote:
 
 So what happens if I try to _use_ that /sys entry? For instance run a 
 script which reads data, or sets the stripe_cache_size higher, or 
 whatever? Do I get back status, ignored, or system issues?

Try it:-)

The stripe_cache_size attributes will disappear (it is easy to remove
attributes, and stripe_cache_size is only meaningful for certain raid
levels).
Other attributes will return 0 or some equivalent, though I think
chunk_size will have the old value.

NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PROBLEM: RAID5 reshape data corruption

2008-01-03 Thread Neil Brown

On Monday December 31, [EMAIL PROTECTED] wrote:
 Ok, since my previous thread didn't seem to attract much attention,
 let me try again.

Thank you for your report and your patience.

 An interrupted RAID5 reshape will cause the md device in question to
 contain one corrupt chunk per stripe if resumed in the wrong manner.
 A testcase can be found at http://www.nagilum.de/md/ .
 The first testcase can be initialized with start.sh the real test
 can then be run with test.sh. The first testcase also uses dm-crypt
 and xfs to show the corruption.

It looks like this can be fixed with the patch:

Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/raid5.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c2008-01-04 09:20:54.0 +1100
+++ ./drivers/md/raid5.c2008-01-04 09:21:05.0 +1100
@@ -2865,7 +2865,7 @@ static void handle_stripe5(struct stripe
md_done_sync(conf-mddev, STRIPE_SECTORS, 1);
}
 
-   if (s.expanding  s.locked == 0)
+   if (s.expanding  s.locked == 0  s.req_compute == 0)
handle_stripe_expansion(conf, sh, NULL);
 
if (sh-ops.count)


With this patch in place, the v2 test only reports errors after the end
of the original array, as you would expect (the new space is
initialised to 0).

 I'm not just interested in a simple behaviour fix I'm also interested
 in what actually happens and if possible a repair program for that
 kind of data corruption.

What happens is that when reshape happens while a device is missing,
the data on that device should be computed from the other data devices
and parity.  However because of the above bug, the data is copied into
the new layout before the compute is complete.  This means that the
data that was on that device is really lost beyond recovery.

I'm really sorry about that, but there is nothing that can be done to
recover the lost data.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] md: Fix data corruption when a degraded raid5 array is reshaped.

2008-01-03 Thread Neil Brown

On Thursday January 3, [EMAIL PROTECTED] wrote:
 
 On closer look the safer test is:
 
   !test_bit(STRIPE_OP_COMPUTE_BLK, sh-ops.pending).
 
 The 'req_compute' field only indicates that a 'compute_block' operation
 was requested during this pass through handle_stripe so that we can
 issue a linked chain of asynchronous operations.
 
 ---
 
 From: Neil Brown [EMAIL PROTECTED]

Technically that should probably be
  From: Dan Williams [EMAIL PROTECTED]

now, and then I add
  Acked-by: NeilBrown [EMAIL PROTECTED]

because I completely agree with your improvement.

We should keep an eye out for then Andrew commits this and make sure
the right patch goes in...

Thanks,
NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: stopped array, but /sys/block/mdN still exists.

2008-01-02 Thread Neil Brown

On Wednesday January 2, [EMAIL PROTECTED] wrote:
 This isn't a high priority issue or anything, but I'm curious:
 
 I --stop(ped) an array but /sys/block/md2 remained largely populated.
 Is that intentional?

It is expected.
Because of the way that md devices are created (just open the
device-special file), it is very hard to make them disappear in a
race-free manner.  I tried once and failed.  It is probably getting
close to trying again, but as you say: it isn't a high priority issue.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Last ditch plea on remote double raid5 disk failure

2007-12-31 Thread Neil Brown

On Monday December 31, [EMAIL PROTECTED] wrote:
 
 I'm hoping that if I can get raid5 to continue despite the errors, I
 can bring back up enough of the server to continue, a bit like the
 remount-ro option in ext2/ext3.
 
 If not, oh well...

Sorry, but it is oh well.

I could probably make it behave a bit better in this situation, but
not in time for you.


NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mdadm --stop goes off and never comes back?

2007-12-22 Thread Neil Brown

On Wednesday December 19, [EMAIL PROTECTED] wrote:
 On 12/19/07, Jon Nelson [EMAIL PROTECTED] wrote:
  On 12/19/07, Neil Brown [EMAIL PROTECTED] wrote:
   On Tuesday December 18, [EMAIL PROTECTED] wrote:
   
I tried to stop the array:
   
mdadm --stop /dev/md2
   
and mdadm never came back. It's off in the kernel somewhere. :-(

Looking at your stack traces, you have the mdadm -S holding
an md lock and trying to get a sysfs lock as part of tearing down the
array, and 'hald' is trying to read some attribute in
   /sys/block/md
and is holding the sysfs lock and trying to get the md lock.
A classic AB-BA deadlock.

 
 NOTE: kernel is stock openSUSE 10.3 kernel, x86_64, 2.6.22.13-0.3-default.
 

It is fixed in mainline with some substantial changes to sysfs.
I don't imagine they are likely to get back ported to openSUSE, but
you could try logging a bugzilla if you like.

The 'hald' process is interruptible and killing it would release the
deadlock.

I suspect you have to be fairly unlucky to lose the race but it is
obviously quite possible.

I don't think there is anything I can do on the md side to avoid the
bug.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid5 resizing

2007-12-19 Thread Neil Brown

On Wednesday December 19, [EMAIL PROTECTED] wrote:
 Hi,
 
 I'm thinking of slowly replacing disks in my raid5 array with bigger
 disks and then resize the array to fill up the new disks. Is this
 possible? Basically I would like to go from:
 
 3 x 500gig RAID5 to 3 x 1tb RAID5, thereby going from 1tb to 2tb of
 storage.
 
 It seems like it should be, but... :)

Yes.

mdadm --grow /dev/mdX --size=max

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mdadm --stop goes off and never comes back?

2007-12-19 Thread Neil Brown

On Tuesday December 18, [EMAIL PROTECTED] wrote:
 This just happened to me.
 Create raid with:
 
 mdadm --create /dev/md2 --level=raid10 --raid-devices=3
 --spare-devices=0 --layout=o2 /dev/sdb3 /dev/sdc3 /dev/sdd3
 
 cat /proc/mdstat
 
 md2 : active raid10 sdd3[2] sdc3[1] sdb3[0]
   5855424 blocks 64K chunks 2 offset-copies [3/3] [UUU]
   [==..]  resync = 14.6% (859968/5855424)
 finish=1.3min speed=61426K/sec
 
 Some log messages:
 
 Dec 18 15:02:28 turnip kernel: md: md2: raid array is not clean --
 starting background reconstruction
 Dec 18 15:02:28 turnip kernel: raid10: raid set md2 active with 3 out
 of 3 devices
 Dec 18 15:02:28 turnip kernel: md: resync of RAID array md2
 Dec 18 15:02:28 turnip kernel: md: minimum _guaranteed_  speed: 1000
 KB/sec/disk.
 Dec 18 15:02:28 turnip kernel: md: using maximum available idle IO
 bandwidth (but not more than 20 KB/sec) for resync.
 Dec 18 15:02:28 turnip kernel: md: using 128k window, over a total of
 5855424 blocks.
 Dec 18 15:03:36 turnip kernel: md: md2: resync done.
 Dec 18 15:03:36 turnip kernel: md: checkpointing resync of md2.
 
 I tried to stop the array:
 
 mdadm --stop /dev/md2
 
 and mdadm never came back. It's off in the kernel somewhere. :-(
 
 kill, of course, has no effect.
 The machine still runs fine, the rest of the raids (md0 and md1) work
 fine (same disks).
 
 The output (snipped, only mdadm) of 'echo t  /proc/sysrq-trigger'
 
 Dec 18 15:09:13 turnip kernel: mdadm S 0001e5359fa38fb0 0
 3943  1 (NOTLB)
 Dec 18 15:09:13 turnip kernel:  810033e7ddc8 0086
  0092
 Dec 18 15:09:13 turnip kernel:  0fc7 810033e7dd78
 80617800 80617800
 Dec 18 15:09:13 turnip kernel:  8061d210 80617800
 80617800 
 Dec 18 15:09:13 turnip kernel: Call Trace:
 Dec 18 15:09:13 turnip kernel:  [803fac96]
 __mutex_lock_interruptible_slowpath+0x8b/0xca
 Dec 18 15:09:13 turnip kernel:  [802acccb] do_open+0x222/0x2a5
 Dec 18 15:09:13 turnip kernel:  [8038705d] md_seq_show+0x127/0x6c1
 Dec 18 15:09:13 turnip kernel:  [80275597] vma_merge+0x141/0x1ee
 Dec 18 15:09:13 turnip kernel:  [802a2aa0] seq_read+0x1bf/0x28b
 Dec 18 15:09:13 turnip kernel:  [8028a42d] vfs_read+0xcb/0x153
 Dec 18 15:09:13 turnip kernel:  [8028a7c1] sys_read+0x45/0x6e
 Dec 18 15:09:13 turnip kernel:  [80209c2e] system_call+0x7e/0x83
 
 
 
 What happened? Is there any debug info I can provide before I reboot?

Don't know very odd.

The rest of the 'sysrq' output would possibly help.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid over 48 disks

2007-12-18 Thread Neil Brown

On Tuesday December 18, [EMAIL PROTECTED] wrote:
 We're investigating the possibility of running Linux (RHEL) on top of  
 Sun's X4500 Thumper box:
 
 http://www.sun.com/servers/x64/x4500/
 
 Basically, it's a server with 48 SATA hard drives. No hardware RAID.  
 It's designed for Sun's ZFS filesystem.
 
 So... we're curious how Linux will handle such a beast. Has anyone run  
 MD software RAID over so many disks? Then piled LVM/ext3 on top of  
 that? Any suggestions?
 
 Are we crazy to think this is even possible?

Certainly possible.
The default metadata is limited to 28 devices, but with
--metadata=1

you can easily use all 48 drives or more in the one array.  I'm not
sure if you would want to though.

If you just wanted an enormous scratch space and were happy to lose
all your data on a drive failure, then you could make a raid0 across
all the drives which should work perfectly and give you lots of
space.  But that probably isn't what you want.

I wouldn't create a raid5 or raid6 on all 48 devices.
RAID5 only survives a single device failure and with that many
devices, the chance of a second failure before you recover becomes
appreciable.

RAID6 would be much more reliable, but probably much slower.  RAID6
always needs to read or write every block in a stripe (i.e. it always
uses reconstruct-write to generate the P and Q blocks,  It never does
a read-modify-write like raid5 does).  This means that every write
touches every device so you have less possibility for parallelism
among your many drives.
It might be instructive to try it out though.

RAID10 would be a good option if you are happy wit 24 drives worth of
space.  I would probably choose a largish chunk size (256K) and use
the 'offset' layout.

Alternately, 8 6drive RAID5s or 6 8raid RAID6s, and use RAID0 to
combine them together.  This would give you adequate reliability and
performance and still a large amount of storage space.

Have fun!!!

NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Cannot re-assemble Degraded RAID6 after crash

2007-12-17 Thread Neil Brown

On Monday December 17, [EMAIL PROTECTED] wrote:
 My system has crashed a couple of times, each time the two drives have
 dropped off of the RAID.
 
 Previously I simply did the following, which would take all night:
 
 mdadm -a --re-add /dev/md2 /dev/sde3
 mdadm -a --re-add /dev/md2 /dev/sdf3
 mdadm -a --re-add /dev/md3 /dev/sde5
 mdadm -a --re-add /dev/md3 /dev/sde5
 
 When I woke up in the morning, everything was happy...until it crashed
 again yesterday. This time, I get a message: /dev/md3 assembled from
 4 drives - not enough to start the array while not clean - consider
 --force.
 
 I can re-assemble /dev/md3 (sda5, sdb5, sdc5, sdd5, sde5 and sdf5) if
 I use -f, although all the other sets seem fine. I cannot --re-add
 the other partitions. 

What happens when you try to re-add those devices?
How about just --add.  --re-add is only need for arrays without
metadata, in your case it should behave the same as --add.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Please Help!!! Raid 5 reshape failed!

2007-12-16 Thread Neil Brown

On Friday December 14, [EMAIL PROTECTED] wrote:
 
 gentoofs ~#mdadm --assemble /dev/md1 /dev/sdc /dev/sdd /dev/sdf
 mdadm: /dev/md1 assembled from 2 drives - not enough to start the array

Try adding --run.  or maybe --force.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 007 of 7] md: Get name for block device in sysfs

2007-12-16 Thread Neil Brown

On Saturday December 15, [EMAIL PROTECTED] wrote:
 On Dec 14, 2007 7:26 AM, NeilBrown [EMAIL PROTECTED] wrote:
 
  Given an fd on a block device, returns a string like
 
  /block/sda/sda1
 
  which can be used to find related information in /sys.

 
 As pointed out to when you came up with the idea, we can't do this. A devpath
 is a path to the device and will not necessarily start with /block for block
 devices. It may start with /devices and can be much longer than
 BDEVNAME_SIZE*2  + 10.


When you say will not necessarily can I take that to mean that it
currently does, but it might (will) change??
In that case can we have the patch as it stands and when the path to
block devices in /sys changes, the ioctl can be changed at the same
time to match?

Or are you saying that as the kernel is today, some block devices
appear under /devices/..., in which case could you please give an
example?

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mdadm break / restore soft mirror

2007-12-14 Thread Neil Brown

On Thursday December 13, [EMAIL PROTECTED] wrote:
  What you could do is set the number of devices in the array to 3 so
  they it always appears to be degraded, then rotate your backup drives
  through the array.  The number of dirty bits in the bitmap will
  steadily grow and so resyncs will take longer.  Once it crosses some
  threshold you set the array back to having 2 devices to that it looks
  non-degraded and clean the bitmap.  Then each device will need a full
  resync after which you will get away with partial resyncs for a while.
 
 I don't undertand why clearing the bitmap causes a rebuild of
 all devices. I think I have a conceptual misunderstanding.  Consider
 a RAID-1 and three physical disks involved, A,B,C
 
 1) A and B are in the RAID, everything is synced
 2) Create a bitmap on the array
 3) Fail + remove B
 4) Hot add C, wait for C to sync
 5) Fail + remove C
 6) Hot add B, wait for B to resync
 7) Goto step 3
 
 I understand that after a while we might want to clean the bitmap
 and that would trigger a full resync for drives B and C. I don't
 understand why it would ever cause a resync for drive A.

You are exactly correct.  That is what I meant, though I probably
didn't express it very clearly.

After you clean out the bitmap, any devices that are not in the array
at that time will need a full resync to come back in to the array.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Auto assembly errors with mdadm and 64K aligned partitions.

2007-12-13 Thread Neil Brown

On Thursday December 13, [EMAIL PROTECTED] wrote:
 Good morning to Neil and everyone on the list, hope your respective
 days are going well.
 
 Quick overview.  We've isolated what appears to be a failure mode with
 mdadm assembling RAID1 (and presumably other level) volumes which
 kernel based RAID autostart is able to do correctly.
 
 We picked up on the problem with OES based systems with SAN attached
 volumes.  I am able to reproduce the problem under 2.6.23.9 UML with
 version 2.6.4 of mdadm.
 
 The problem occurs when partitions are aligned on a 64K boundary.  Any
 64K boundary seems to work, ie 128, 256 and 512 sector offsets.
 
 Block devices look like the following:
 
 ---
 cat /proc/partions:
 
 major minor  #blocks  name
 
   98 0 262144 ubda
   9816  10240 ubdb
   9817  10176 ubdb1
   9832  10240 ubdc
   9833  10176 ubdc1
 ---
 
 
 A RAID1 device was created and started consisting of the /dev/ubdb1
 and /dev/ubdc1 partitions.  An /etc/mdadm.conf file was generated
 which contains the following:
 
 ---
 DEVICE partitions
 ARRAY /dev/md0 level=raid1 num-devices=2 
 UUID=e604c49e:d3a948fd:13d9bc11:dbc82862
 ---
 
 
 The RAID1 device was shutdown.  The following assembly command yielded:
 
 ---
 mdadm -As
 
 mdadm: WARNING /dev/ubdc1 and /dev/ubdc appear to have very similar 
 superblocks.  If they are really different, please --zero the superblock 
 on one
   If they are the same or overlap, please remove one from the
   DEVICE list in mdadm.conf.
 ---

Yes.  This is one of the problems with v0.90 metadata, and with
DEVICE partitions.

As the partitions start on a 64K alignment, and the metadata is 64K
aligned, the metadata appears look  right for both the whole device
and for the last partition on the device, and mdadm cannot tell the
difference.

With v1.x metadata, we store the superblock offset which allows us to
tell if we have mis-identified a superblock that was meant to be part
of a partition or of the whole device.

If you make your DEVICE line a little more restrictive. e.g.

 DEVICE /dev/ubc?1

then it will also work.

Or just don't use partitions.  Make the arrat from /dev/ubdb and
/dev/ubdc.


NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: mdadm break / restore soft mirror

2007-12-13 Thread Neil Brown

On Thursday December 13, [EMAIL PROTECTED] wrote:

 How do I create the internal bitmap?  man mdadm didn't shed any
 light and my brief excursion into google wasn't much more helpful. 

  mdadm --grow --bitmap=internal /dev/mdX

 
 The version I have installed is mdadm-1.12.0-5.i386 from RedHat
 which would appear to be way out of date! 

WAY!  mdadm 2.0 would be an absolute minimum, and linux 2.6.13 as an
absolute minimum, probably something closer to 2.6.20 would be a good
idea.

NeilBRown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mdadm break / restore soft mirror

2007-12-12 Thread Neil Brown

On Wednesday December 12, [EMAIL PROTECTED] wrote:
 Hi, 
 
   Question for you guys. 
 
   A brief history: 
   RHEL 4 AS 
   I have a partition with way to many small files on (Usually around a couple 
 of million) that needs to be backed up, standard
 
   methods mean that a restore is impossibly slow due to the sheer volume of 
 files. 
   Solution, raw backup /restore of the device.  However the partition is 
 permanently being accessed. 
 
   Proposed solution is to use software raid mirror.  Before backup starts, 
 break the soft mirror unmount and backup partition
 
   restore soft mirror and let it resync / rebuild itself. 
 
   Would the above intentional break/fix of the mirror cause any problems? 

No, it should work fine.

If you can be certain that the device that you break out of the mirror
is never altered, then you could add an internal bitmap while the
array is split and the rebuild will go much faster.
However even mounting a device readonly will sometimes alter the
content (e.g. if ext3 needs to replay the journal) so you need to be
very careful.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] (2nd try) force parallel resync

2007-12-06 Thread Neil Brown

On Thursday December 6, [EMAIL PROTECTED] wrote:
 Hello,
 
 here is the second version of the patch. With this version also on  
 setting /sys/block/*/md/sync_force_parallel the sync_thread is woken up. 
 Though, I still don't understand why md_wakeup_thread() is not working.

Could give a little more detail on why you want this?  When do you
want multiple arrays on the same device to sync at the same time?
What exactly is the hardware like?

md threads generally run for a little while to perform some task, then
stop and wait to be needed again.  md_wakeup_thread says you are
needed again.

The resync/recovery thread is a bit different.  It just run md_do_sync
once.  md_wakeup_thread is not really meaningful in that context.

What you want is:
wake_up(resync_wait);

that will get any thread that is waiting for some other array to
resync to wake up and see if something needs to be done.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID mapper device size wrong after replacing drives

2007-12-06 Thread Neil Brown


I think you would have more luck posting this to
[EMAIL PROTECTED] - I think that is where support for device mapper
happens.

NeilBrown


On Thursday December 6, [EMAIL PROTECTED] wrote:
 
 Hi,
 
 I have a problem with my RAID array under Linux after upgrading to larger
 drives. I have a machine with Windows and Linux dual-boot which had a pair
 of 160GB drives in a RAID-1 mirror with 3 partitions: partiton 1 = Windows
 boot partition (FAT32), partiton 2 = Linux /boot (ext3), partiton 3 =
 Windows system (NTFS). The Linux /root is on a separate physical drive. The
 dual boot is via Grub installed on the /boot partiton, and this was all
 working fine.
 
 But I just upgraded the drives in the RAID pair, replacing them with 500GB
 drives. I did this by replacing one of the 160s with a new 500 and letting
 the RAID copy the drive, splitting the drives out of the RAID array and
 increasing the size of the last partition of the 500 (which I did under
 Windows since its the Windows partiton) then replacing the last 160 with the
 other 500 and having the RAID controller create a new array with the two
 500s, copying the drive that I'd copied from the 160. This worked great for
 Windows, and that now boots and sees a 500GB RAID drive with all the data
 intact.
 
 However, Linux has a problem and will not now boot all the way. It reports
 that the RAID /dev/mapper volume failed - the partition is beyond the
 boundaries of the disk. Running fdisk shows that it is seeing the larger
 partiton, but still sees the size of the RAID /dev/mapper drive as 160GB.
 Here is the fdisk output for one of the physical drives and for the RAID
 mapper drive:
 
 Disk /dev/sda: 500.1 GB, 500107862016 bytes
 255 heads, 63 sectors/track, 60801 cylinders
 Units = cylinders of 16065 * 512 = 8225280 bytes
 
Device Boot  Start End  Blocks   Id  System
 /dev/sda1   1 625 5018624b  W95 FAT32
 Partition 1 does not end on cylinder boundary.
 /dev/sda2 626 637   96390   83  Linux
 /dev/sda3   * 638   60802   4832645127  HPFS/NTFS
 
 
 Disk /dev/mapper/isw_bcifcijdi_Raid-0: 163.9 GB, 163925983232 bytes
 255 heads, 63 sectors/track, 19929 cylinders
 Units = cylinders of 16065 * 512 = 8225280 bytes
 
 Device Boot  Start End  Blocks  
 Id  System
 /dev/mapper/isw_bcifcijdi_Raid-0p1   1 625 5018624   
 b  W95 FAT32
 Partition 1 does not end on cylinder boundary.
 /dev/mapper/isw_bcifcijdi_Raid-0p2 626 637   96390  
 83  Linux
 /dev/mapper/isw_bcifcijdi_Raid-0p3   * 638   60802   483264512   
 7  HPFS/NTFS
 
 
 They differ only in the drive capacity and number of cylinders.
 
 I started to try to run a Linux reinstall, but it reports that the partiion
 table on the mapper drive is invalid, giving an option to re-initialize it
 but saying that doing so will lose all the data on the drive.
 
 So questions:
 
 1. Where is the drive size information for the RAID mapper drive kept, and
 is there some way to patch it?
 
 2. Is there some way to re-initialize the RAID mapper drive without
 destroying the data on the drive?
 
 Thanks,
 Ian
 -- 
 View this message in context: 
 http://www.nabble.com/RAID-mapper-device-size-wrong-after-replacing-drives-tf4958354.html#a14200241
 Sent from the linux-raid mailing list archive at Nabble.com.
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Spontaneous rebuild

2007-12-02 Thread Neil Brown

On Sunday December 2, [EMAIL PROTECTED] wrote:
 
 Anyway, the problems are back: To test my theory that everything is
 alright with the CPU running within its specs, I removed one of the
 drives while copying some large files yesterday. Initially, everything
 seemed to work out nicely, and by the morning, the rebuild had finished.
 Again, I unmounted the filesystem and ran badblocks -svn on the LVM. It
 ran without gripes for some hours, but just now I saw md had started to
 rebuild the array again out of the blue:
 
 Dec  1 20:04:49 quassel kernel: usb 4-5.2: reset high speed USB device
 using ehci_hcd and address 4
 Dec  2 01:06:02 quassel kernel: md: data-check of RAID array md0
  ^^
 Dec  2 01:06:02 quassel kernel: md: minimum _guaranteed_  speed: 1000
 KB/sec/disk.
 Dec  2 01:06:02 quassel kernel: md: using maximum available idle IO
 bandwidth (but not more than 20 KB/sec) for data-check.
  ^^
 Dec  2 01:06:02 quassel kernel: md: using 128k window, over a total of
 488383936 blocks.
 Dec  2 03:57:24 quassel kernel: usb 4-5.2: reset high speed USB device
 using ehci_hcd and address 4
 

This isn't a resync, it is a data check.  Dec  2 is the first Sunday
of the month.  You probably have a crontab entries that does
   echo check  /sys/block/mdX/md/sync_action

early on the first Sunday of the month.  I know that Debian does this.

It is good to do this occasionally to catch sleeping bad blocks.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Reading takes 100% precedence over writes for mdadm+raid5?

2007-12-02 Thread Neil Brown

On Sunday December 2, [EMAIL PROTECTED] wrote:
 
 Was curious if when running 10 DD's (which are writing to the RAID 5) 
 fine, no issues, suddenly all go into D-state and let the read/give it 
 100% priority?

So are you saying that the writes completely stalled while the read
was progressing?  How exactly did you measure that?

What kernel version are you running.

 
 Is this normal?

It shouldn't be.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: assemble vs create an array.......

2007-11-29 Thread Neil Brown

On Thursday November 29, [EMAIL PROTECTED] wrote:
 Hello,
 I had created a raid 5 array on 3 232GB SATA drives. I had created one 
 partition (for /home) formatted with either xfs or reiserfs (I do not 
 recall).
 Last week I reinstalled my box from scratch with Ubuntu 7.10, with mdadm 
 v. 2.6.2-1ubuntu2.
 Then I made a rookie mistake: I --create instead of --assemble. The 
 recovery completed. I then stopped the array, realizing the mistake.
 
 1. Please make the warning more descriptive: ALL DATA WILL BE LOST, when 
 attempting to created an array over an existing one.

No matter how loud the warning is, people will get it wrong... unless
I make it actually impossible to corrupt data (which may not be
possible) in which case it will inconvenience many more people.

 2. Do you know of any way to recover from this mistake? Or at least what 
 filesystem it was formated with.

If you created the same array with the same devices and layout etc,
the data will still be there, untouched.
Try to assemble the array and use fsck on it.

When you create a RAID5 array, all that is changed is the metadata (at
the end of the device) and one drive is changed to be the xor of all
the others.

 
 Any help would be greatly appreciated. I have hundreds of family digital 
 pictures and videos that are irreplaceable.

You have probably heard it before, but RAID is no replacement for
backups. 
My photos are one two separate computers, one with RAID.  And I will
be backing them up to DVD any day now . really!!   or maybe next
year, if I remember :-)

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid5 reshape/resync

2007-11-28 Thread Neil Brown

On Sunday November 25, [EMAIL PROTECTED] wrote:
 - Message from [EMAIL PROTECTED] -
  Date: Sat, 24 Nov 2007 12:02:09 +0100
  From: Nagilum [EMAIL PROTECTED]
 Reply-To: Nagilum [EMAIL PROTECTED]
   Subject: raid5 reshape/resync
To: linux-raid@vger.kernel.org

  Hi,
  I'm running 2.6.23.8 x86_64 using mdadm v2.6.4.
  I was adding a disk (/dev/sdf) to an existing raid5 (/dev/sd[a-e] - md0)
  During that reshape (at around 4%) /dev/sdd reported read errors and
  went offline.

Sad.

  I replaced /dev/sdd with a new drive and tried to reassemble the array
  (/dev/sdd was shown as removed and now as spare).

There must be a step missing here.
Just because one drive goes offline, that  doesn't mean that you need
to reassemble the array.  It should just continue with the reshape
until that is finished.  Did you shut the machine down or did it crash
or what

  Assembly worked but it would not run unless I use --force.

That suggests an unclean shutdown.  Maybe it did crash?

  Since I'm always reluctant to use force I put the bad disk back in,
  this time as /dev/sdg . I re-added the drive and could run the array.
  The array started to resync (since the disk can be read until 4%) and
  then I marked the disk as failed. Now the array is active, degraded,
  recovering:

It should have restarted the reshape from whereever it was up to, so
it should have hit the read error almost immediately.  Do you remember
where it started the reshape from?  If it restarted from the beginning
that would be bad.

Did you just --assemble all the drives or did you do something else?

  What I find somewhat confusing/disturbing is that does not appear to
  utilize /dev/sdd. What I see here could be explained by md doing a
  RAID5 resync from the 4 drives sd[a-c,e] to sd[a-c,e,f] but I would
  have expected it to use the new spare sdd for that. Also the speed is

md cannot recover to a spare while a reshape is happening.  It
completes the reshape, then does the recovery (as you discovered).

  unusually low which seems to indicate a lot of seeking as if two
  operations are happening at the same time.

Well reshape is always slow as it has to read from one part of the
drive and write to another part of the drive.

  Also when I look at the data rates it looks more like the reshape is
  continuing even though one drive is missing (possible but risky).

Yes, that is happening.

  Can someone relief my doubts as to whether md does the right thing here?
  Thanks,

I believe it is do the right thing.

 - End message from [EMAIL PROTECTED] -

 Ok, so the reshape tried to continue without the failed drive and  
 after that resynced to the new spare.

As I would expect.

 Unfortunately the result is a mess. On top of the Raid5 I have  

Hmm.  This I would not expect.

 dm-crypt and LVM.
 Although dmcrypt and LVM dont appear to have a problem the filesystems  
 on top are a mess now.

Can you be more specific about what sort of mess they are in?

NeilBrown

 I still have the failed drive, I can read the superblock from that  
 drive and up to 4% from the beginning and probably backwards from the  
 end towards that point.
 So in theory it could be possible to reorder the stripe blocks which  
 appears to have been messed up.(?)
 Unfortunately I'm not sure what exactly went wrong or what I did  
 wrong. Can someone please give me hint?
 Thanks,
 Alex.

 #_  __  _ __ http://www.nagilum.org/ \n icq://69646724 #
 #   / |/ /__  _(_) /_  _  [EMAIL PROTECTED] \n +491776461165 #
 #  // _ `/ _ `/ / / // /  ' \  Amiga (68k/PPC): AOS/NetBSD/Linux   #
 # /_/|_/\_,_/\_, /_/_/\_,_/_/_/_/   Mac (PPC): MacOS-X / NetBSD /Linux #
 #   /___/ x86: FreeBSD/Linux/Solaris/Win2k  ARM9: EPOC EV6 #

 cakebox.homeunix.net - all the machine one needs..

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-11-28 Thread Neil Brown

On Thursday November 22, [EMAIL PROTECTED] wrote:
 Dear Neil,
 
 thank you very much for your detailed answer.
 
 Neil Brown wrote:
  While it is possible to use the RAID6 P+Q information to deduce which
  data block is wrong if it is known that either 0 or 1 datablocks is 
  wrong, it is *not* possible to deduce which block or blocks are wrong
  if it is possible that more than 1 data block is wrong.
 
 If I'm not mistaken, this is only partly correct.  Using P+Q redundancy,
 it *is* possible, to distinguish three cases:
 a) exactly zero bad blocks
 b) exactly one bad block
 c) more than one bad block
 
 Of course, it is only possible to recover from b), but one *can* tell,
 whether the situation is a) or b) or c) and act accordingly.

It would seem that either you or Peter Anvin is mistaken.

On page 9 of 
  http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
at the end of section 4 it says:

  Finally, as a word of caution it should be noted that RAID-6 by
  itself cannot even detect, never mind recover from, dual-disk
  corruption. If two disks are corrupt in the same byte positions,
  the above algorithm will in general introduce additional data
  corruption by corrupting a third drive.

 
 The point that I'm trying to make is, that there does exist a specific
 case, in which recovery is possible, and that implementing recovery for
 that case will not hurt in any way.

Assuming that it true (maybe hpa got it wrong) what specific
conditions would lead to one drive having corrupt data, and would
correcting it on an occasional 'repair' pass be an appropriate
response?

Does the value justify the cost of extra code complexity?

 
  RAID is not designed to protect again bad RAM, bad cables, chipset 
  bugs drivers bugs etc.  It is only designed to protect against drive 
  failure, where the drive failure is apparent.  i.e. a read must 
  return either the same data that was last written, or a failure 
  indication. Anything else is beyond the design parameters for RAID.
 
 I'm taking a more pragmatic approach here.  In my opinion, RAID should
 just protect my data, against drive failure, yes, of course, but if it
 can help me in case of occasional data corruption, I'd happily take
 that, too, especially if it doesn't cost extra... ;-)

Everything costs extra.  Code uses bytes of memory, requires
maintenance, and possibly introduced new bugs.  I'm not convinced the
failure mode that you are considering actually happens with a
meaningful frequency.

NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-11-28 Thread Neil Brown

On Tuesday November 27, [EMAIL PROTECTED] wrote:
 Thiemo Nagel wrote:
  Dear Neil,
 
  thank you very much for your detailed answer.
 
  Neil Brown wrote:
  While it is possible to use the RAID6 P+Q information to deduce which
  data block is wrong if it is known that either 0 or 1 datablocks is 
  wrong, it is *not* possible to deduce which block or blocks are wrong
  if it is possible that more than 1 data block is wrong.
 
  If I'm not mistaken, this is only partly correct.  Using P+Q redundancy,
  it *is* possible, to distinguish three cases:
  a) exactly zero bad blocks
  b) exactly one bad block
  c) more than one bad block
 
  Of course, it is only possible to recover from b), but one *can* tell,
  whether the situation is a) or b) or c) and act accordingly.
 I was waiting for a response before saying me too, but that's exactly 
 the case, there is a class of failures other than power failure or total 
 device failure which result in just the one identifiable bad sector 
 result. Given that the data needs to be read to realize that it is bad, 
 why not go the extra inch and fix it properly instead of redoing the p+q 
 which just makes the problem invisible rather than fixing it.
 
 Obviously this is a subset of all the things which can go wrong, but I 
 suspect it's a sizable subset.

Why do think that it is a sizable subset.  Disk drives have internal
checksum which are designed to prevent corrupted data being returned.

If the data is getting corrupt on some buss between the CPU and the
media, then I suspect that your problem is big enough that RAID cannot
meaningfully solve it, and New hardware plus possibly restore from
backup would be the only credible option.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Skip bio copy in full-stripe write ops

2007-11-23 Thread Neil Brown

On Friday November 23, [EMAIL PROTECTED] wrote:
 
  Hello all,
 
  Here is a patch which allows to skip intermediate data copying between the 
 bio
 requested to write and the disk cache in sh if the full-stripe write 
 operation is
 on the way.
 
  This improves the performance of write operations for some dedicated cases
 when big chunks of data are being sequentially written to RAID array, but in
 general eliminating disk cache slows the performance down.

There is a subtlety here that we need to be careful not to miss. 
The stripe cache has an import 'correctness' aspect that you might be
losing.

When a write request is passed to generic_make_request, it is entirely
possible for the data in the buffer to be changing while the write is
being processed.  This can happen particularly with memory mapped
files, but also in other cases.
If we perform the XOR operation against the data in the buffer, and
then later DMA that data out to the storage device, the data could
have changed in the mean time.  The net result will be that the that
parity block is wrong.
That is one reason why we currently copy the data before doing the XOR
(though copying at the same time as doing the XOR would be a suitable
alternative).

I can see two possible approaches where it could be safe to XOR out of
the provided buffer.

 1/ If we can be certain that the data in the buffer will not change
until the write completes.  I think this would require the
filesystem to explicitly promise not to change the data, possibly by
setting some flag in the BIO.  The filesystem would then need its
own internal interlock mechanisms to be able to keep the promise,
and we would only be able to convince filesystems to do this if
there were significant performance gains.

 2/ We allow the parity to be wrong for a little while (it happens
anyway) but make sure that:
a/ future writes to the same stripe use reconstruct_write, not
  read_modify_write, as the parity block might be wrong.
b/ We don't mark the array or (with bitmaps) region 'clean' until
  we have good reason to believe that it is.  i.e. somehow we
  would need to check that the last page written to each device
  were still clean when the write completed.

I think '2' is probably too complex.  Part 'a' makes it particularly
difficult to achieve efficiently.

I think that '1' might be possible for some limited cases, and it
could be that those limited cases form 99% for all potential
stripe-wide writes.
e.g. If someone was building a dedicated NAS device and wanted this
performance improvement, they could work with the particular
filesystem that they choose, and ensure that - for the applications
that they use on top of it - the filesystem does not update in-flight
data.


But without the above issues being considered and addressed, we cannot
proceed with this patch..
  

 
  The performance results obtained on the ppc440spe-based board using the
 PPC440SPE ADMA driver, Xdd benchmark, and the RAID-5 of 4 disks are as
 follows:
 
  SKIP_BIO_SET = 'N': 40 MBps;
  SKIP_BIO_SET = 'Y': 70 MBps.


.which is a shame because that is a very significant performance
increase I wonder if that comes from simply avoiding the copy, or
whether there are some scheduling improvements that account for some
of it After all a CPU can copy data around at much more that
30MBps.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: md RAID 10 on Linux 2.6.20?

2007-11-22 Thread Neil Brown

On Thursday November 22, [EMAIL PROTECTED] wrote:
 Hi all,
 
 I am running a home-grown Linux 2.6.20.11 SMP 64-bit build, and I am 
 wondering if there is indeed a RAID 10 personality defined in md that 
 can be implemented using mdadm. If so, is it available in 2.6.20.11, or 
 is it in a later kernel version? In the past, to create RAID 10, I 
 created RAID 1's and a RAID 0, so an 8 drive RAID 10 would actually 
 consist of 5 md devices (four RAID 1's and one RAID 0). But if I could 
 just use RAID 10 natively, and simply create one RAID 10, that would of 
 course be better both in terms of management and probably performance I 
 would guess. Is this possible?

Why don't you try it and see, or check the documentation?

But yes, there is native RAID10 in 2.6.20.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-11-21 Thread Neil Brown

On Wednesday November 21, [EMAIL PROTECTED] wrote:
 Dear Neal,
 
  I have been looking a bit at the check/repair functionality in the
  raid6 personality.
  
  It seems that if an inconsistent stripe is found during repair, md
  does not try to determine which block is corrupt (using e.g. the
  method in section 4 of HPA's raid6 paper), but just recomputes the
  parity blocks - i.e. the same way as inconsistent raid5 stripes are
  handled.
  
  Correct?
  
  Correct!
  
  The mostly likely cause of parity being incorrect is if a write to
  data + P + Q was interrupted when one or two of those had been
  written, but the other had not.
  
  No matter which was or was not written, correctly P and Q will produce
  a 'correct' result, and it is simple.  I really don't see any
  justification for being more clever.
 
 My opinion about that is quite different.  Speaking just for myself:
 
 a) When I put my data on a RAID running on Linux, I'd expect the 
 software to do everything which is possible to protect and when 
 necessary to restore data integrity.  (This expectation was one of the 
 reasons why I chose software RAID with Linux.)

Yes, of course.  possible is an import aspect of this.

 
 b) As a consequence of a):  When I'm using a RAID level that has extra 
 redundancy, I'd expect Linux to make use of that extra redundancy during 
 a 'repair'.  (Otherwise I'd consider repair a misnomer and rather call 
 it 'recalc parity'.)

The extra redundancy in RAID6 is there to enable you to survive two
drive failure.  Nothing more.

While it is possible to use the RAID6 P+Q information to deduce which
data block is wrong if it is known that either 0 or 1 datablocks is
wrong, it is *not* possible to deduce which block or blocks are wrong
if it is possible that more than 1 data block is wrong.
As it is quite possible for a write to be aborted in the middle
(during unexpected power down) with an unknown number of blocks in a
given stripe updated but others not, we do not know how many blocks
might be wrong so we cannot try to recover some wrong block.  Doing
so would quite possibly corrupt a block that is not wrong.

The repair process repairs the parity (redundancy information).
It does not repair the data.  It cannot.

The only possible scenario that md/raid recognises for the parity
information being wrong is the case of an unexpected shutdown in the
middle of a stripe write, where some blocks have been written and some
have not.
Further (for raid 4/5/6), it only supports this case when your array
is not degraded.  If you have a degraded array, then an unexpected
shutdown is potentially fatal to your data (the chances of it actually
being fatal is actually quite small, but the potential is still there).
There is nothing RAID can do about this.  It is not designed to
protect against power failure.  It is designed to protect again drive
failure.  It does that quite well.

If you have wrong data appearing on your device for some other reason,
then you have a serious hardware problem and RAID cannot help you.

The best approach to dealing with data on drives getting spontaneously
corrupted is for the filesystem to perform strong checksums on the
data block, and store the checksums in the indexing information.  This
provides detection, not recovery of course.

 
 c) Why should 'repair' be implemented in a way that only works in most 
 cases when there exists a solution that works in all cases?  (After all, 
 possibilities for corruption are many, e.g. bad RAM, bad cables, chipset 
 bugs, driver bugs, last but not least human mistake.  From all these 
 errors I'd like to be able to recover gracefully without putting the 
 array at risk by removing and readding a component device.)

As I said above - there is no solution that works in all cases.  If
more that one block is corrupt, and you don't know which ones, then
you lose and there is now way around that.
RAID is not designed to protect again bad RAM, bad cables, chipset
bugs drivers bugs etc.  It is only designed to protect against drive
failure, where the drive failure is apparent.  i.e. a read must return
either the same data that was last written, or a failure indication.
Anything else is beyond the design parameters for RAID.
It might be possible to design a data storage system that was
resilient to these sorts of errors.  It would be much more
sophisticated than RAID though.

NeilBrown


 
 Bottom line:  So far I was talking about *my* expectations, is it 
 reasonable to assume that it is shared by others?  Are there any 
 arguments that I'm not aware of speaking against an improved 
 implementation of 'repair'?
 
 BTW:  I just checked, it's the same for RAID 1:  When I intentionally 
 corrupt a sector in the first device of a set of 16, 'repair' copies the 
 corrupted data to the 15 remaining devices instead of restoring the 
 correct sector from one of the other fifteen devices to the first.
 
 Thank you for your time.
 
-
To unsubscribe from this list: send

Re: BUG: soft lockup detected on CPU#1! (was Re: raid6 resync blocks the entire system)

2007-11-21 Thread Neil Brown

On Tuesday November 20, [EMAIL PROTECTED] wrote:
 
 My personal (wild) guess for this problem is, that there is somewhere a 
 global 
 lock, preventing all other CPUs to do something. At 100%s (at 80 MB/s) 
 there's probably not left any time frame to wake up the other CPUs or its 
 sufficiently small to only allow high priority kernel threads to do 
 something.
 When I limit the sync to 40MB/s each resync-CPU has to wait sufficiently long 
 to allow the other CPUs to wake up.
 
 

md doesn't hold any locks that would interfere with other parts of the
kernel from working.

I cannot imagine what would be causing your problems.  The resync
thread makes a point of calling cond_resched() periodically so that it
will let other processes run even if it constantly has work to do.

If you have nothing that could write to the RAID6 arrays, then I
cannot see how the resync could affect the rest of the system except
to reduce the amount of available CPU time.  And as CPU is normally
much faster than drives, you wouldn't expect that effect to be very
great.

Very strange.

Can you do 'alt-sysrq-T' when it is frozen and get the process traces
from the kernel logs?

Can you send me cat /proc/mdstat after the resyn has started, but
before the system has locked up.

I'm sorry that I cannot suggest anything more useful.

NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-11-15 Thread Neil Brown

On Thursday November 15, [EMAIL PROTECTED] wrote:
 Hi,
 
 I have been looking a bit at the check/repair functionality in the
 raid6 personality.
 
 It seems that if an inconsistent stripe is found during repair, md
 does not try to determine which block is corrupt (using e.g. the
 method in section 4 of HPA's raid6 paper), but just recomputes the
 parity blocks - i.e. the same way as inconsistent raid5 stripes are
 handled.
 
 Correct?

Correct!

The mostly likely cause of parity being incorrect is if a write to
data + P + Q was interrupted when one or two of those had been
written, but the other had not.

No matter which was or was not written, correctly P and Q will produce
a 'correct' result, and it is simple.  I really don't see any
justification for being more clever.


NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Chnaging partition types of RAID array members

2007-11-15 Thread Neil Brown

On Thursday November 15, [EMAIL PROTECTED] wrote:
 
 Hi.  I have two RAID5 arrays on an opensuse 10.3 system.  They are used
  together in a large LVM volume that contains a lot of data I'd rather
  not have to try and backup/recreate.
 
 md1 comes up fine and is detected by the OS on boot and assembled
  automatically.  md0 however, doesn't, and needs to be brought up manually,
  followed by a manual start of lvm.  This is a real pain of course.  The
  issue I think is that md0 was created through EVMS, which I have
  stopped using some time ago since it's support seems to have been deprecated.
   EVMS created the array fine, but using partitions that were not 0xFD
  (Linux RAID), but rather 0x83 (linux native).  Since stopping the use
  of EVMS on boot, the array has not come up automatically.
 
 I have tried failing one of the array members, recreating the partition
  as linux RAID though the yast partition manager, and then trying to
  add it, but I get a mdadm: Cannot open /dev/sdb1: Device or resource
  busy error.  If the partition is type 0x83 (linux native) and formatted
  with a filesystem first, then re-adding it is no problem at all, and the 
 array rebuilds
  fine.

You don't need to fail a device just to change the partition type.
Just use cfdisk to change all the partition types to 'fd', then
reboot and see what happens.

NeilBrown

 
 In googling the topic I can't seem to find out why I get the error
  message, and how to fix this.  I'd really like to get this problem
  resolved.  Does anyone out there know how to fix this, so I can get 
 partitions
  correctly flagged as Linux RAID and the array autodetected at start?
 
 Sorry if I missed something obvious.
 
 Thanks,
 Mike
 
 
 
 
 
 
   
 
 Never miss a thing.  Make Yahoo your home page. 
 http://www.yahoo.com/r/hs
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [stable] [PATCH 000 of 2] md: Fixes for md in 2.6.23

2007-11-14 Thread Neil Brown

On Tuesday November 13, [EMAIL PROTECTED] wrote:
 
 raid5-fix-unending-write-sequence.patch is in -mm and I believe is
 waiting on an Acked-by from Neil?
 

It seems to have just been sent on to Linus, so it probably will go in
without:

   Acked-By: NeilBrown [EMAIL PROTECTED]

I'm beginning to think that I really should sit down and make sure I
understand exactly how those STRIPE_OP_ flags are uses.  They
generally make sense but there seem to be a number of corner cases
where they aren't quite handled properly..  Maybe they are all found
now, or maybe

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Proposal: non-striping RAID4

2007-11-14 Thread Neil Brown

On Thursday November 15, [EMAIL PROTECTED] wrote:
 
 Neil: any comments on whether this would be desirable / useful / feasible?

1/ Have in raid4 variant which arranges the data like 'linear' is
   something I am planning to do eventually.  If your filesystem nows
   about the geometry of the array , then it can distribute the data
   across the drives and can make up for a lot of the benefits of
   striping.  The big advantage of such an arrangement is that it is
   trivial to add a drive - just zero it and make it part of the
   array.  No need to re-arrange what is currently there.
   However I was not thinking of support different sizes devices in
   such a configuration.

2/ Having an array with redundancy where drives are of different sizes
   is awkward, primarily because if there was a spare that as not as
   large as the largest device, you may-or-may not be able to rebuild
   in that situation.   Certainly I could code up those decisions, but
   I'm not sure the scenario is worth the complexity.
   If you have drives of different sizes, use raid0 to combine pairs
   of smaller one to match larger ones, and do raid5 across devices
   that look like the same size.

3/ If you really want to use exactly what you have, you can partition
   them into bits and make a variety of raid5 arrays as you suggest.
   md will notice and will resync in series so that you don't kill
   performance.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Building a new raid6 with bitmap does not clear bits during resync

2007-11-12 Thread Neil Brown

On Monday November 12, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
 
  However there is value in regularly updating the bitmap, so add code
  to periodically pause while all pending sync requests complete, then
  update the bitmap.  Doing this only every few seconds (the same as the
  bitmap update time) does not notciable affect resync performance.

 
 I wonder if a minimum time and minimum number of stripes would be 
 better. If a resync is going slowly because it's going over a slow link 
 to iSCSI, nbd, or a box of cheap drives fed off a single USB port, just 
 writing the updated bitmap may represent as much data as has been 
 resynced in the time slice.
 
 Not a suggestion, but a request for your thoughts on that.

Thanks for your thoughts.
Choosing how often to update the bitmap during a sync is certainly not
trivial.   In different situations, different requirements might rule.

I chose to base it on time, and particularly on the time we already
have for how soon to write back clean bits to the bitmap because it
is fairly easy to users to understand the implications (if I set the
time to 30 seconds, then I might have to repeat 30second of resync)
and it is already configurable (via the --delay option to --create
--bitmap).

Presumably if someone has a very slow system and wanted to use
bitmaps, they would set --delay relatively large to reduce the cost
and still provide significant benefits.  This would effect both normal
clean-bit writeback and during-resync clean-bit-writeback.

Hope that clarifies my approach.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Building a new raid6 with bitmap does not clear bits during resync

2007-11-11 Thread Neil Brown

On Thursday November 8, [EMAIL PROTECTED] wrote:
 Hi,
 
 I have created a new raid6:
 
 md0 : active raid6 sdb1[0] sdl1[5] sdj1[4] sdh1[3] sdf1[2] sdd1[1]
   6834868224 blocks level 6, 512k chunk, algorithm 2 [6/6] [UU]
   []  resync = 21.5% (368216964/1708717056) 
 finish=448.5min speed=49808K/sec
   bitmap: 204/204 pages [816KB], 4096KB chunk
 
 The raid is totally idle, not mounted and nothing.
 
 So why does the bitmap: 204/204 not sink? I would expect it to clear
 bits as it resyncs so it should count slowly down to 0. As a side
 effect of the bitmap being all dirty the resync will restart from the
 beginning when the system is hard reset. As you can imagine that is
 pretty anoying.
 
 On the other hand on a clean shutdown it seems the bitmap gets updated
 before stopping the array:
 
 md3 : active raid6 sdc1[0] sdm1[5] sdk1[4] sdi1[3] sdg1[2] sde1[1]
   6834868224 blocks level 6, 512k chunk, algorithm 2 [6/6] [UU]
   [===.]  resync = 38.4% (656155264/1708717056) 
 finish=17846.4min speed=982K/sec
   bitmap: 187/204 pages [748KB], 4096KB chunk
 
 Consequently the rebuild did restart and is already further along.
 

Thanks for the report.

 
 Any ideas why that is so?

Yes.  The following patch should explain (a bit tersely) why this was
so, and should also fix it so it will no longer be so.  Test reports
always welcome.

NeilBrown

Status: ok

Update md bitmap during resync.

Currently and md array with a write-intent bitmap does not updated
that bitmap to reflect successful partial resync.  Rather the entire
bitmap is updated when the resync completes.

This is because there is no guarentee that resync requests will
complete in order, and tracking each request individually is
unnecessarily burdensome.

However there is value in regularly updating the bitmap, so add code
to periodically pause while all pending sync requests complete, then
update the bitmap.  Doing this only every few seconds (the same as the
bitmap update time) does not notciable affect resync performance.

Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/bitmap.c |   34 +-
 ./drivers/md/raid1.c  |1 +
 ./drivers/md/raid10.c |2 ++
 ./drivers/md/raid5.c  |3 +++
 ./include/linux/raid/bitmap.h |3 +++
 5 files changed, 38 insertions(+), 5 deletions(-)

diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c
--- .prev/drivers/md/bitmap.c   2007-10-22 16:55:52.0 +1000
+++ ./drivers/md/bitmap.c   2007-11-12 16:36:30.0 +1100
@@ -1349,14 +1349,38 @@ void bitmap_close_sync(struct bitmap *bi
 */
sector_t sector = 0;
int blocks;
-   if (!bitmap) return;
+   if (!bitmap)
+   return;
while (sector  bitmap-mddev-resync_max_sectors) {
bitmap_end_sync(bitmap, sector, blocks, 0);
-/*
-   if (sector  500) printk(bitmap_close_sync: sec %llu blks 
%d\n,
-(unsigned long long)sector, blocks);
-*/ sector += blocks;
+   sector += blocks;
+   }
+}
+
+void bitmap_cond_end_sync(struct bitmap *bitmap, sector_t sector)
+{
+   sector_t s = 0;
+   int blocks;
+
+   if (!bitmap)
+   return;
+   if (sector == 0) {
+   bitmap-last_end_sync = jiffies;
+   return;
+   }
+   if (time_before(jiffies, (bitmap-last_end_sync
+ + bitmap-daemon_sleep * HZ)))
+   return;
+   wait_event(bitmap-mddev-recovery_wait,
+  atomic_read(bitmap-mddev-recovery_active) == 0);
+
+   sector = ~((1ULL  CHUNK_BLOCK_SHIFT(bitmap)) - 1);
+   s = 0;
+   while (s  sector  s  bitmap-mddev-resync_max_sectors) {
+   bitmap_end_sync(bitmap, s, blocks, 0);
+   s += blocks;
}
+   bitmap-last_end_sync = jiffies;
 }
 
 static void bitmap_set_memory_bits(struct bitmap *bitmap, sector_t offset, int 
needed)

diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c
--- .prev/drivers/md/raid10.c   2007-10-30 13:50:45.0 +1100
+++ ./drivers/md/raid10.c   2007-11-12 16:06:39.0 +1100
@@ -1671,6 +1671,8 @@ static sector_t sync_request(mddev_t *md
if (!go_faster  conf-nr_waiting)
msleep_interruptible(1000);
 
+   bitmap_cond_end_sync(mddev-bitmap, sector_nr);
+
/* Again, very different code for resync and recovery.
 * Both must result in an r10bio with a list of bios that
 * have bi_end_io, bi_sector, bi_bdev set,

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c2007-10-30 13:50:45.0 +1100
+++ ./drivers/md/raid1.c2007-11-12 16:06:12.0 +1100
@@ -1685,6 +1685,7 @@ static sector_t sync_request(mddev_t *md
if (!go_faster  conf-nr_waiting)
msleep_interruptible(1000

Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-04 Thread Neil Brown

On Sunday November 4, [EMAIL PROTECTED] wrote:
 # ps auxww | grep D
 USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
 root   273  0.0  0.0  0 0 ?DOct21  14:40 [pdflush]
 root   274  0.0  0.0  0 0 ?DOct21  13:00 [pdflush]
 
 After several days/weeks, this is the second time this has happened, while 
 doing regular file I/O (decompressing a file), everything on the device 
 went into D-state.

At a guess (I haven't looked closely) I'd say it is the bug that was
meant to be fixed by

commit 4ae3f847e49e3787eca91bced31f8fd328d50496

except that patch applied badly and needed to be fixed with
the following patch (not in git yet).
These have been sent to stable@ and should be in the queue for 2.6.23.2


NeilBrown

Fix misapplied patch in raid5.c

commit 4ae3f847e49e3787eca91bced31f8fd328d50496 did not get applied
correctly, presumably due to substantial similarities between
handle_stripe5 and handle_stripe6.

This patch (with lots of context) moves the chunk of new code from
handle_stripe6 (where it isn't needed (yet)) to handle_stripe5.


Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/raid5.c |   14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c2007-11-02 12:10:49.0 +1100
+++ ./drivers/md/raid5.c2007-11-02 12:25:31.0 +1100
@@ -2607,40 +2607,47 @@ static void handle_stripe5(struct stripe
struct bio *return_bi = NULL;
struct stripe_head_state s;
struct r5dev *dev;
unsigned long pending = 0;
 
memset(s, 0, sizeof(s));
pr_debug(handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d 
ops=%lx:%lx:%lx\n, (unsigned long long)sh-sector, sh-state,
atomic_read(sh-count), sh-pd_idx,
sh-ops.pending, sh-ops.ack, sh-ops.complete);
 
spin_lock(sh-lock);
clear_bit(STRIPE_HANDLE, sh-state);
clear_bit(STRIPE_DELAYED, sh-state);
 
s.syncing = test_bit(STRIPE_SYNCING, sh-state);
s.expanding = test_bit(STRIPE_EXPAND_SOURCE, sh-state);
s.expanded = test_bit(STRIPE_EXPAND_READY, sh-state);
/* Now to look around and see what can be done */
 
+   /* clean-up completed biofill operations */
+   if (test_bit(STRIPE_OP_BIOFILL, sh-ops.complete)) {
+   clear_bit(STRIPE_OP_BIOFILL, sh-ops.pending);
+   clear_bit(STRIPE_OP_BIOFILL, sh-ops.ack);
+   clear_bit(STRIPE_OP_BIOFILL, sh-ops.complete);
+   }
+
rcu_read_lock();
for (i=disks; i--; ) {
mdk_rdev_t *rdev;
struct r5dev *dev = sh-dev[i];
clear_bit(R5_Insync, dev-flags);
 
pr_debug(check %d: state 0x%lx toread %p read %p write %p 
written %p\n, i, dev-flags, dev-toread, dev-read,
dev-towrite, dev-written);
 
/* maybe we can request a biofill operation
 *
 * new wantfill requests are only permitted while
 * STRIPE_OP_BIOFILL is clear
 */
if (test_bit(R5_UPTODATE, dev-flags)  dev-toread 
!test_bit(STRIPE_OP_BIOFILL, sh-ops.pending))
set_bit(R5_Wantfill, dev-flags);
 
/* now count some things */
@@ -2880,47 +2887,40 @@ static void handle_stripe6(struct stripe
struct stripe_head_state s;
struct r6_state r6s;
struct r5dev *dev, *pdev, *qdev;
 
r6s.qd_idx = raid6_next_disk(pd_idx, disks);
pr_debug(handling stripe %llu, state=%#lx cnt=%d, 
pd_idx=%d, qd_idx=%d\n,
   (unsigned long long)sh-sector, sh-state,
   atomic_read(sh-count), pd_idx, r6s.qd_idx);
memset(s, 0, sizeof(s));
 
spin_lock(sh-lock);
clear_bit(STRIPE_HANDLE, sh-state);
clear_bit(STRIPE_DELAYED, sh-state);
 
s.syncing = test_bit(STRIPE_SYNCING, sh-state);
s.expanding = test_bit(STRIPE_EXPAND_SOURCE, sh-state);
s.expanded = test_bit(STRIPE_EXPAND_READY, sh-state);
/* Now to look around and see what can be done */
 
-   /* clean-up completed biofill operations */
-   if (test_bit(STRIPE_OP_BIOFILL, sh-ops.complete)) {
-   clear_bit(STRIPE_OP_BIOFILL, sh-ops.pending);
-   clear_bit(STRIPE_OP_BIOFILL, sh-ops.ack);
-   clear_bit(STRIPE_OP_BIOFILL, sh-ops.complete);
-   }
-
rcu_read_lock();
for (i=disks; i--; ) {
mdk_rdev_t *rdev;
dev = sh-dev[i];
clear_bit(R5_Insync, dev-flags);
 
pr_debug(check %d: state 0x%lx read %p write %p written %p\n,
i, dev-flags, dev-toread, dev-towrite, dev-written);
/* maybe we can reply to a read

Re: Very small internal bitmap after recreate

2007-11-02 Thread Neil Brown

On Friday November 2, [EMAIL PROTECTED] wrote:
 
 Am 02.11.2007 um 10:22 schrieb Neil Brown:
 
  On Friday November 2, [EMAIL PROTECTED] wrote:
  I have a 5 disk version 1.0 superblock RAID5 which had an internal
  bitmap that has been reported to have a size of 299 pages in /proc/
  mdstat. For whatever reason I removed this bitmap (mdadm --grow --
  bitmap=none) and recreated it afterwards (mdadm --grow --
  bitmap=internal). Now it has a reported size of 10 pages.
 
  Do I have a problem?
 
  Not a big problem, but possibly a small problem.
  Can you send
 mdadm -E /dev/sdg1
  as well?
 
 Sure:
 
 # mdadm -E /dev/sdg1
 /dev/sdg1:
Magic : a92b4efc
  Version : 01
  Feature Map : 0x1
   Array UUID : e1a335a8:fc0f0626:d70687a6:5d9a9c19
 Name : 1
Creation Time : Wed Oct 31 14:30:55 2007
   Raid Level : raid5
 Raid Devices : 5
 
Used Dev Size : 625137008 (298.09 GiB 320.07 GB)
   Array Size : 2500547584 (1192.35 GiB 1280.28 GB)
Used Size : 625136896 (298.09 GiB 320.07 GB)
 Super Offset : 625137264 sectors

So there is 256 sectors before the superblock were a bitmap could go,
or about 6 sectors afterwards

State : clean
  Device UUID : 95afade2:f2ab8e83:b0c764a0:4732827d
 
 Internal Bitmap : 2 sectors from superblock

And the '6 sectors afterwards' was chosen.
6 sectors has room for 5*512*8 = 20480 bits,
and from your previous email:
   Bitmap : 19078 bits (chunks), 0 dirty (0.0%)
you have 19078 bits, which is about right (a the bitmap chunk size
must be a power of 2).

So the problem is that mdadm -G is putting the bitmap after the
superblock rather than considering the space before
(checks code)

Ahh, I remember now.  There is currently no interface to tell the
kernel where to put the bitmap when creating one on an active array,
so it always puts in the 'safe' place.  Another enhancement waiting
for time.

For now, you will have to live with a smallish bitmap, which probably
isn't a real problem.  With 19078 bits, you will still get a
several-thousand-fold increase it resync speed after a crash
(i.e. hours become seconds) and to some extent, fewer bits are better
and you have to update them less.

I've haven't made any measurements to see what size bitmap is
ideal... maybe someone should :-)

  Update Time : Fri Nov  2 07:46:38 2007
 Checksum : 4ee307b3 - correct
   Events : 408088
 
   Layout : left-symmetric
   Chunk Size : 128K
 
  Array Slot : 3 (0, 1, failed, 2, 3, 4)
 Array State : uuUuu 1 failed
 
 This time I'm getting nervous - Array State failed doesn't sound good!

This is nothing to worry about - just a bad message from mdadm.

The superblock has recorded that there was once a device in position 2
which is now failed (See the list in Array Slot).
This summaries as 1 failed in Array State.

But the array is definitely working OK now.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: stride / stripe alignment on LVM ?

2007-11-01 Thread Neil Brown

On Thursday November 1, [EMAIL PROTECTED] wrote:
 Hello,
 
 I have raid5 /dev/md1, --chunk=128 --metadata=1.1. On it I have
 created LVM volume called 'raid5', and finally a logical volume
 'backup'.
 
 Then I formatted it with command:
 
mkfs.ext3 -b 4096 -E stride=32 -E resize=550292480 /dev/raid5/backup
 
 And because LVM is putting its own metadata on /dev/md1, the ext3
 partition is shifted by some (unknown for me) amount of bytes from
 the beginning of /dev/md1.
 
 I was wondering, how big is the shift, and would it hurt the
 performance/safety if the `ext3 stride=32` didn't align perfectly
 with the physical stripes on HDD?

It is probably better to ask this question on an ext3 list as people
there might know exactly what 'stride' does.

I *think* it causes the inode tables to be offset in different
block-groups so that they are not all on the same drive.  If that is
the case, then an offset causes by LVM isn't going to make any
difference at all.

NeilBrown


 
 PS: the resize option is to make sure that I can grow this fs
 in the future.
 
 PSS: I looked in the archive but didn't find this question asked
 before. I'm sorry if it really was asked.

Thanks for trying!
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Superblocks

2007-11-01 Thread Neil Brown

On Tuesday October 30, [EMAIL PROTECTED] wrote:
 Which is the default type of superblock? 0.90 or 1.0?

The default default is 0.90.
However a local device can be set in mdadm.conf with e.g.
   CREATE metdata=1.0

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Bad drive discovered during raid5 reshape

2007-10-30 Thread Neil Brown

On Tuesday October 30, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  On Monday October 29, [EMAIL PROTECTED] wrote:
  Hi,
  I bought two new hard drives to expand my raid array today and
  unfortunately one of them appears to be bad. The problem didn't arise
 
  Looks like you are in real trouble.  Both the drives seem bad in some
  way.  If it was just sdc that was failing it would have picked up
  after the -Af, but when it tried, sdb gave errors.
 
 Humble enquiry :)
 
 I'm not sure that's right?
 He *removed* sdb and sdc when the failure occurred so sdc would indeed be 
 non-fresh.

I'm not sure what point you are making here.
In any case, remove two drives from a raid5 is always a bad thing.
Part of the array was striped over 8 drives by this time.  With only
six still in the array, some data will be missing.

 
 The key question I think is: will md continue to grow an array even if it 
 enters
 degraded mode during the grow?
 ie grow from a 6 drive array to a 7-of-8 degraded array?
 
 Technically I guess it should be able to.

Yes, md can grow to a degraded array.  If you get a single failure I
would expect it to abort the growth process, then restart where it
left off (after checking that that made sense).

 
 In which case should he be able to re-add /dev/sdc and allow md to retry the
 grow? (possibly losing some data due to the sdc staleness)

He only needs one of the two drives in there.  I got the impression
that both sdc and sdb had reported errors.  If not, and sdc really
seems OK, then --assemble --force listing all drives except sdb
should make it all work again.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Time to deprecate old RAID formats?

2007-10-29 Thread Neil Brown

On Friday October 26, [EMAIL PROTECTED] wrote:
 
 Perhaps you could have called them 1.start, 1.end, and 1.4k in the 
 beginning? Isn't hindsight wonderful?
 

Those names seem good to me.  I wonder if it is safe to generate them
in -Eb output

Maybe the key confusion here is between version numbers and
revision numbers.
When you have multiple versions, there is no implicit assumption that
one is better than another. Here is my version of what happened, now
let's hear yours.
When you have multiple revisions, you do assume ongoing improvement.

v1.0  v1.1 and v1.2 are different version of the v1 superblock, which
itself is a revision of the v0...

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Superblocks

2007-10-29 Thread Neil Brown

On Friday October 26, [EMAIL PROTECTED] wrote:
 Can someone help me understand superblocks and MD a little bit?
 
 I've got a raid5 array with 3 disks - sdb1, sdc1, sdd1.
 
 --examine on these 3 drives shows correct information.
 
 
 However, if I also examine the raw disk devices, sdb and sdd, they
 also appear to have superblocks with some semi valid looking
 information. sdc has no superblock.

If a partition starts a multiple of 64K from the start of the device,
and ends with about 64K of the end of the device, then a superblock on
the partition will also look like a superblock on the whole device.
This is one of the shortcomings of v0.90 superblocks.  v1.0 doesn't
have this problem.

 
 How can I clear these? If I unmount my raid, stop md0, it won't clear it.

mdadm --zero-superblock device name

is the best way to remove an unwanted superblock.  Ofcourse in the
above described case, removing the unwanted superblock will remove the
wanted one aswell.


 
 [EMAIL PROTECTED] ~]# mdadm --zero-superblock /dev/hdd
 mdadm: Couldn't open /dev/hdd for write - not zeroing

As I think someone else pointed out /dev/hdd is not /dev/sdd.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Implementing low level timeouts within MD

2007-10-29 Thread Neil Brown

On Friday October 26, [EMAIL PROTECTED] wrote:
 I've been asking on my other posts but haven't seen
 a direct reply to this question:
 
 Can MD implement timeouts so that it detects problems when
 drivers don't come back?

No.
However it is possible that we will start sending the BIO_RW_FAILFAST
flag down on some or all requests.  That might make drivers fail more
promptly, which might be  good thing.  However it won't fix bugs in
drivers and - as has been said elsewhere on this thread - that is the
real problem.

NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 975 matches

Mail list logo