Re: 2.6.20.3 AMD64 oops in CFQ code

2007-04-02 Thread Tejun Heo
[resending.  my mail service was down for more than a week and this
message didn't get delivered.]

[EMAIL PROTECTED] wrote:
> > Anyway, what's annoying is that I can't figure out how to bring the
> > drive back on line without resetting the box.  It's in a hot-swap
enclosure,
> > but power cycling the drive doesn't seem to help.  I thought libata
hotplug
> > was working?  (SiI3132 card, using the sil24 driver.)

Yeah, it's working but failing resets are considered highly dangerous
(in that the controller status is unknown and may cause something
dangerous like screaming interrupts) and port is muted after that.  The
plan is to handle this with polling hotplug such that libata tries to
revive the port if PHY status change is detected by polling.  Patches
are available but they need other things to resolved to get integrated.
 I think it'll happen before the summer.

Anyways, you can tell libata to retry the port by manually telling it to
rescan the port (echo - - - > /sys/class/scsi_host/hostX/scan).

> > (H'm... after rebooting, reallocated sectors jumped from 26 to 39.
> > Something is up with that drive.)

Yeap, seems like a broken drive to me.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


bio too big device md1 (16 > 8)

2007-04-02 Thread syrius . ml

Hi,

i'm using 2.6.21-rc5-git9 +
http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-merge-max_hw_sector.patch
 
( i've been testing with and without it, and first encountered it on
2.6.18-debian )

I've setup a raid1 array md1 (it was created in a degraded mode using
the debian installer)
(md0 is also a small raid1 array created in degraded mode, but i did
not have any issue with it)

md1 hold a lvm physical volume holding a vg and several lvs

mdadm -D /dev/md1:
/dev/md1:
Version : 00.90.03
  Creation Time : Sun Mar 25 16:34:42 2007
 Raid Level : raid1
 Array Size : 290607744 (277.15 GiB 297.58 GB)
Device Size : 290607744 (277.15 GiB 297.58 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 1
Persistence : Superblock is persistent

Update Time : Tue Apr  3 01:37:23 2007
  State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

   UUID : af8d2807:e573935d:04be1e12:bc7defbb
 Events : 0.422096

Number   Major   Minor   RaidDevice State
   0   330  active sync   /dev/hda3
   1   001  removed


the problem i'm encountering is when i add /dev/md2 to /dev/md1.

mdadm -D /dev/md2
/dev/md2:
Version : 00.90.03
  Creation Time : Sun Apr  1 15:06:43 2007
 Raid Level : linear
 Array Size : 290607808 (277.15 GiB 297.58 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 2
Persistence : Superblock is persistent

Update Time : Sun Apr  1 15:06:43 2007
  State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

   Rounding : 64K

   UUID : 887ecdeb:5f205eb6:4cd470d6:4cbda83c (local to host odo)
 Events : 0.1

Number   Major   Minor   RaidDevice State
   0  3440  active sync   /dev/hdg4
   1  5721  active sync   /dev/hdk2
   2  9132  active sync   /dev/hds3
   3  8923  active sync   /dev/hdo2

I use mdadm --manage --add /dev/md1 /dev/md2
when I do so here is what happen:
md: bind
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:0, o:1, dev:hda3
 disk 1, wo:1, o:1, dev:md2
md: syncing RAID array md1
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwidth (but not more than 20 KB/sec) 
for reconstruction.
md: using 128k window, over a total of 290607744 blocks.
bio too big device md1 (16 > 8)
Device dm-7, XFS metadata write error block 0x243ec0 in dm-7
bio too big device md1 (16 > 8)
I/O error in filesystem ("dm-8") meta-data dev dm-8 block 0x1b5b6550   
("xfs_trans_read_buf") error 5 buf count 8192
bio too big device md1 (16 > 8)
I/O error in filesystem ("dm-8") meta-data dev dm-8 block 0x1fb3b00   
("xfs_trans_read_buf") error 5 buf count 8192

every filesystems on md1 get corrupted.
I manually fail md2 then reboot and so i can boot the fs again.
(but md1 is still degraded)

Any idea ?
I can provide more information if needed. (the only weird thing is
/dev/hdo that doesn't seem to be lba48-ready, but i guess that
shouldn't be a geometry issue.)

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt worries

2007-04-02 Thread Neil Brown
On Monday April 2, [EMAIL PROTECTED] wrote:
> 
> Neil's post here suggests either this is all normal or I'm seriously up the
> creek.
>   http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07349.html
> 
> My questions:
> 
> 1. Should I be worried or is this normal?  If so can you explain why the
>number is non-zero?

Probably not too worried.
Is it normal?  I'm not really sure what 'normal' is.  I'm beginning to
think that it is 'normal' to get strange errors from disk drives, by
maybe I have a jaded perspective.
If you have a swap-partition or a swap-file on the device then you
should consider it normal.  If not, then it is much less likely but
still possible.

> 2. Should I repair, fsck, replace a disk, something else?

'repair' is probably a good idea.
'fsck' certainly wouldn't hurt and might show something, though I
suspect it will find the filesystem to be structurally sound.
I wouldn't replace the disk on the basis on a single difference report
from mismatch_cnt.  I don't know what the SMART message means so I
don't know if that suggests that the drive needs to be replaced.

> 3. Can someone explain how this quote can be true:
>"Though it is less likely, a regular filesystem could still (I think)
> genuinely write different data to difference devices in a raid1/10."
>when I thought the point of RAID1 was that the data should be the same on
>both disks.

Suppose I memory-map a file and often modify the mapped memory.
The system will at some point decide to write that block of the file
to the device.  It will send a request to raid1, which will send one
request each to two different devices.  They will each DMA the data
out of that memory to the controller at different times so they could
quite possibly get different data (if I changed the mapped memory
between those two DMA request).  So the data on the two drives in a
mirror can easily be different.  If a 'check' happens at exactly this
time it will notice.
Normally that block will be written out again (as it is still 'dirty')
and again and again if necessary as long as I keep writing to the
memory.  Once I stop writing to the memory (e.g. close the file,
unmount the filesystem) a final write will be made with the same data
going to both devices.  During this time we will never read that block
from the filesystem, so the filesystem will never be able to see any
difference between the two devices in a raid1.

So: if you are actively writing to a file while 'check' is running on
a raid1, it could show up as a difference in mismatch_cnt.  But you
have to get the timing just right (or wrong).

I think it is possible in the above scenario to truncate the file
while a write is underway but with new data in memory.  If you do
this, the system might not write out that last 'new' data, so the last
write to the particular block on storage may have written different
data to the two different drives, and this difference will not be
corrected by the filesystem e.g on unmount.  Note that the inconsistent
data will never be read by the filesystem (the file has been
truncated, remember) so there is no risk of data corruption.
In this case the difference could remain for some time until later
when a 'check' or 'repair' notices it.

Does that help explain the above quote?

It is still the case that:
  filesystem corruption won't happen in normal operation
  a small mismatch_cnt does not necessarily imply a problem.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Question about using dmraid45 patch

2007-04-02 Thread Wood, Brian J
Hello, I'm new to the dm-devel mail list, so hopefully my question won't
get flamed too badly :)

I need some help with a setup issue in dmraid, specifically raid5. I'm
trying to use Heinz Mauelshagen's patch for dmraid45 and having problems
getting it to compile in the kernel (or as a module). It's been a number
of years since I've worked with Linux, so I might be missing something
blatantly easy. 

I have downloaded the source for both 2.6.18.1 (which the patch
specifically applies to) and 2.6.18.8 which RHEL5 uses. When I used the
2.6.18.8 kernel I applied the patch, I copied the .config over from the
kernel directory, and used "menu gconfig" to active (as a module) the
dmraid45 in device mapper under device drivers. I compiled the kernel,
copied over the files, edited grub and rebooted into the new
kernel...all seemed to go fine. I rebooted again when into the OPROM and
setup a raid5 volume (I have the system loaded onto a raid0 volume; the
system has 5 SATA disks total). After booting back into the new kernel I
ran the command "dmraid -tay" to list my active tables (I hope that's
the right way to describe it) and got:

[EMAIL PROTECTED] ~]# dmraid -tay
isw_dfbhbdaedb_Volume0: 0 312592896 striped 2 256 /dev/sda 0 /dev/sdb 0
isw_bffiiabjc_Volume1: 0 312592896 raid45 core 2 65536 nosync raid5_la 1
128 3 -1 /dev/sdc 0 /dev/sdd 0 /dev/sde 0
isw_dfbhbdaedb_Volume01: 0 1108422 linear
/dev/mapper/isw_dfbhbdaedb_Volume0 63
isw_dfbhbdaedb_Volume02: 0 2104515 linear
/dev/mapper/isw_dfbhbdaedb_Volume0 1108485
isw_dfbhbdaedb_Volume03: 0 309379770 linear
/dev/mapper/isw_dfbhbdaedb_Volume0 3213000

I then tried to activate the volumes and got:

[EMAIL PROTECTED] ~]# dmraid -ay
RAID set "isw_dfbhbdaedb_Volume0" already active
ERROR: device-mapper target type "raid45" not in kernel

I then used lsmod to see if my module was active and saw this:
[EMAIL PROTECTED] ~]# lsmod | grep -e "dm"
dm_snapshot48824  0 
dm_zero35200  0 
dm_mirror  46976  0 
dm_log 42496  1 dm_mirror
dm_mod 92880  17 dm_snapshot,dm_zero,dm_mirror,dm_log

nothing to do with raid5.

I ran insmod and got the error::
[EMAIL PROTECTED] ~]# insmod
/lib/modules/2.6.18.8/kernel/drivers/md/dm-raid4-5.ko 
insmod: error inserting 'drivers/md/dm-raid4-5.ko': -1 Unknown symbol in
module

I wanted to do a test at this point to see if it might be the kernel
.config file I had from the RHEL5 2.6.18.8 release, so I used a fresh
drop of kernel 2.6.18.1 (the one listed inside the patch code), applied
the dmraid45 patch, and used "make gconfig" to configure the kernel. I
went into device drivers and unselected everything except device mapper
support and the new subcategory of RAID4/5 target to be inserted as part
of the kernel. I even unselected Loadable Module Support just to make
sure :)
I then set make to use -d to get a little more info on the error, here's
what it showed when trying to compile:


Successfully remade target file `drivers/md/xor.o'.
Pruning file `FORCE'.
   Finished prerequisites of target file `drivers/md/built-in.o'.
  Must remake target `drivers/md/built-in.o'.
Putting child 0x006bc870 (drivers/md/built-in.o) PID 30056 on the chain.
Live child 0x006bc870 (drivers/md/built-in.o) PID 30056 
  LD  drivers/md/built-in.o
drivers/md/dm-mem-cache.o: In function `pl_elem':
/usr/src/kernels/linux-2.6.18.1/drivers/md/dm-mem-cache.h:18: multiple
definition of `pl_elem'
drivers/md/dm-raid4-5.o:/usr/src/kernels/linux-2.6.18.1/drivers/md/dm-me
m-cache.h:18: first defined here
Reaping losing child 0x006bc870 PID 30056 
make[2]: *** [drivers/md/built-in.o] Error 1
Removing child 0x006bc870 PID 30056 from chain.
Reaping losing child 0x00697ed0 PID 29856 
make[1]: *** [drivers/md] Error 2
Removing child 0x00697ed0 PID 29856 from chain.
Reaping losing child 0x006c4de0 PID 26812 
make: *** [drivers] Error 2
Removing child 0x006c4de0 PID 26812 from chain.


Is there something I'm not setting in the build environment to build
this as part of the kernel or as a module? I looked in the raid4-5
readme.txt and it just says to apply the patch and have device mapper
installed (which RHEL5 does). I'm using gnu make version 3.81, gcc
version 4.1.1, and the version of RHEL5 is x86_64 (I've also tried this
same operation with i386). As something else to add I've tried this
entire operation with OpenSuSE 10.2 for x86_64 and i386.



Thank you for the help, 

Brian Wood
Intel Corporation 
Digital Enterprise Group
Manageability & Platform Software Division
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mismatch_cnt worries

2007-04-02 Thread Gavin McCullagh
Hi,

I've relatively recently started using md having had some bad experiences
with hardware raid controllers.  I've had some really good experiences
(stepwise upgrading a 800GB raid5 array to 1.5TB one by exchanging disks
and using mdadm --grow), but am in the middle of a more worrying one.  I have
read previous recent threads about mismatch_cnt and am a little unclear as yet
how to interpret this. I'm seeing this issue on a couple of machines, but I'll
just use talk about one for now.

I ran a check on the three RAID1 arrays in a machine I'm managing.  The check
finished without error.  I then had a look at the mismatch_cnt and one of them
is non-zero (128), specifically the one which holds the root filesystem.

The Gentoo Wiki on the subject seems to be moreorless saying I need to
format the partition to be sure of anything.  Needless to say that's not
desirable.

Stupidly, I have not been running Smart until now but I have installed and
configured it now and run long and short tests manually.  The most interesting
part of the smartctl output on the disks is below but only ECC fast errors are
shown.

All of the event logs look like this, so I guess there's only partial support
for Smart:

  Error event 19:
:Sense Key  06h Unit Attention  :Add Sense Code 29h :Add Sense Code Qualif  
02h :Hardware Status  00h :CCHSS Valid   :CC  h :H No.  00h :SS No. 00

Neil's post here suggests either this is all normal or I'm seriously up the
creek.
http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07349.html

My questions:

1. Should I be worried or is this normal?  If so can you explain why the
   number is non-zero?
2. Should I repair, fsck, replace a disk, something else?
3. Can someone explain how this quote can be true:
   "Though it is less likely, a regular filesystem could still (I think)
genuinely write different data to difference devices in a raid1/10."
   when I thought the point of RAID1 was that the data should be the same on
   both disks.

Many thanks for any help/comfort,

Gavin

SDA:
Error counter log:
   Errors Corrected by   Total   Correction Gigabytes
Total
   ECC  rereads/errors   algorithm  processed
uncorrected
   fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  
errors
read:88787730 0   8878773  0437.620 
  0
write: 00 0 0  0277.228 
  0

SDB:
Error counter log:
   Errors Corrected by   Total   Correction Gigabytes
Total
   ECC  rereads/errors   algorithm  processed
uncorrected
   fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  
errors
read:50777820 0   5077782  0455.871 
  0
write: 00 0 0  0263.680 
  0

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 write performance

2007-04-02 Thread Raz Ben-Jehuda(caro)

On 4/2/07, Dan Williams <[EMAIL PROTECTED]> wrote:

On 3/30/07, Raz Ben-Jehuda(caro) <[EMAIL PROTECTED]> wrote:
> Please see bellow.
>
> On 8/28/06, Neil Brown <[EMAIL PROTECTED]> wrote:
> > On Sunday August 13, [EMAIL PROTECTED] wrote:
> > > well ... me again
> > >
> > > Following your advice
> > >
> > > I added a deadline for every WRITE stripe head when it is created.
> > > in raid5_activate_delayed i checked if deadline is expired and if not i am
> > > setting the sh to prereadactive mode as .
> > >
> > > This small fix ( and in few other places in the code) reduced the
> > > amount of reads
> > > to zero with dd but with no improvement to throghput. But with random 
access to
> > > the raid  ( buffers are aligned by the stripe width and with the size
> > > of stripe width )
> > > there is an improvement of at least 20 % .
> > >
> > > Problem is that a user must know what he is doing else there would be
> > > a reduction
> > > in performance if deadline line it too long (say 100 ms).
> >
> > So if I understand you correctly, you are delaying write requests to
> > partial stripes slightly (your 'deadline') and this is sometimes
> > giving you a 20% improvement ?
> >
> > I'm not surprised that you could get some improvement.  20% is quite
> > surprising.  It would be worth following through with this to make
> > that improvement generally available.
> >
> > As you say, picking a time in milliseconds is very error prone.  We
> > really need to come up with something more natural.
> > I had hopped that the 'unplug' infrastructure would provide the right
> > thing, but apparently not.  Maybe unplug is just being called too
> > often.
> >
> > I'll see if I can duplicate this myself and find out what is really
> > going on.
> >
> > Thanks for the report.
> >
> > NeilBrown
> >
>
> Neil Hello. I am sorry for this interval , I was assigned abruptly to
> a different project.
>
> 1.
>   I'd taken a look at the raid5 delay patch I have written a while
> ago. I ported it to 2.6.17 and tested it. it makes sounds of working
> and when used correctly it eliminates the reads penalty.
>
> 2. Benchmarks .
> configuration:
>  I am testing a raid5 x 3 disks with 1MB chunk size.  IOs are
> synchronous and non-buffered(o_direct) , 2 MB in size and always
> aligned to the beginning of a stripe. kernel is 2.6.17. The
> stripe_delay was set to 10ms.
>
>  Attached is the simple_write code.
>
>  command :
>simple_write /dev/md1 2048 0 1000
>simple_write raw writes (O_DIRECT) sequentially
> starting from offset zero 2048 kilobytes 1000 times.
>
> Benchmark Before patch
>
> sda1848.00  8384.00 50992.00   8384  50992
> sdb1995.00 12424.00 51008.00  12424  51008
> sdc1698.00  8160.00 51000.00   8160  51000
> sdd   0.00 0.00 0.00  0  0
> md0   0.00 0.00 0.00  0  0
> md1 450.00 0.00102400.00  0 102400
>
>
> Benchmark After patch
>
> sda 389.11 0.00128530.69  0 129816
> sdb 381.19 0.00129354.46  0 130648
> sdc 383.17 0.00128530.69  0 129816
> sdd   0.00 0.00 0.00  0  0
> md0   0.00 0.00 0.00  0  0
> md11140.59 0.00259548.51  0 262144
>
> As one can see , no additional reads were done. One can actually
> calculate  the raid's utilization: n-1/n * ( single disk throughput
> with 1M writes ) .
>
>
>   3.  The patch code.
>   Kernel tested above was 2.6.17. The patch is of 2.6.20.2
> because I have noticed a big code differences between 17 to 20.x .
> This patch was not tested on 2.6.20.2 but it is essentialy the same. I
> have not tested (yet) degraded mode or any other non-common pathes.
>
This is along the same lines of what I am working on, new cache
policies for raid5/6, so I want to give it a try as well.
Unfortunately gmail has mangled your patch.  Can you resend as an
attachment?

patch:  malformed patch at line 10:
(&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK]))

Thanks,
Dan



Dan hello.
Attached are the patches. Also , I have added another test unit : random_writev.
It is not much of a code but it does the work. It tests writing a
vector .it shows the same results as writing using a single buffer.

What is the new cache poilcies ?

Please note !
I haven't indented the patch nor did the instructions according to
SubmitingPatches document. If Neil would approve this patch or parts
of it, I will do so.

# Benchmark 3:  Testing  8 disks raid5.

Tyan Numa dual (amd) CPU machine, with 8 sata maxtor disks, controller
is promise
in jbod mode.

raid conf:
md1 : active raid5 sda2[0] sdh1[7] sdg1[6] sdf1[5] sde1[4] s

Re: [PATCH] md: Avoid a deadlock when removing a device from an md array via sysfs.

2007-04-02 Thread Neil Brown
On Monday April 2, [EMAIL PROTECTED] wrote:
> 
> What guarantees that *rdev is still valid when delayed_delete() runs?

Because that is how kobjects and krefs work.  There is an embedded
refcount etc etc..

> 
> And what guarantees that the md module hasn't been rmmodded when
> delayed_delete() tries to run?

Good point.  Nothing.  Maybe this patch is needed.

Thanks,
NeilBrown

---
Avoid a deadlock when removing a device from an md array via sysfs. - fix

Make sure any delayed_delete calls finish before module unload.
For simplicity, flush the queue when we stop the array.

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/md.c |3 +++
 1 file changed, 3 insertions(+)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2007-04-02 17:38:46.0 +1000
+++ ./drivers/md/md.c   2007-04-02 18:49:24.0 +1000
@@ -3410,6 +3410,9 @@ static int do_md_stop(mddev_t * mddev, i
sysfs_remove_link(&mddev->kobj, nm);
}
 
+   /* make sure all delayed_delete calls have finished */
+   flush_scheduled_work();
+
export_array(mddev);
 
mddev->array_size = 0;
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] md: Avoid a deadlock when removing a device from an md array via sysfs.

2007-04-02 Thread Andrew Morton
On Mon, 2 Apr 2007 17:44:17 +1000 NeilBrown <[EMAIL PROTECTED]> wrote:

> (This patch should go in 2.6.21 as it fixes a recent regression - NB)
> 
> A device can be removed from an md array via e.g.
>   echo remove > /sys/block/md3/md/dev-sde/state
> 
> This will try to remove the 'dev-sde' subtree which will deadlock
> since
>   commit e7b0d26a86943370c04d6833c6edba2a72a6e240
> 
> With this patch we run the kobject_del via schedule_work so as to
> avoid the deadlock.
> 
> Cc: Alan Stern <[EMAIL PROTECTED]>
> Signed-off-by: Neil Brown <[EMAIL PROTECTED]>
> 
> ### Diffstat output
>  ./drivers/md/md.c   |   13 -
>  ./include/linux/raid/md_k.h |1 +
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff .prev/drivers/md/md.c ./drivers/md/md.c
> --- .prev/drivers/md/md.c 2007-04-02 17:43:03.0 +1000
> +++ ./drivers/md/md.c 2007-04-02 17:38:46.0 +1000
> @@ -1389,6 +1389,12 @@ static int bind_rdev_to_array(mdk_rdev_t
>   return err;
>  }
>  
> +static void delayed_delete(struct work_struct *ws)
> +{
> + mdk_rdev_t *rdev = container_of(ws, mdk_rdev_t, del_work);
> + kobject_del(&rdev->kobj);
> +}
> +
>  static void unbind_rdev_from_array(mdk_rdev_t * rdev)
>  {
>   char b[BDEVNAME_SIZE];
> @@ -1401,7 +1407,12 @@ static void unbind_rdev_from_array(mdk_r
>   printk(KERN_INFO "md: unbind<%s>\n", bdevname(rdev->bdev,b));
>   rdev->mddev = NULL;
>   sysfs_remove_link(&rdev->kobj, "block");
> - kobject_del(&rdev->kobj);
> +
> + /* We need to delay this, otherwise we can deadlock when
> +  * writing to 'remove' to "dev/state"
> +  */
> + INIT_WORK(&rdev->del_work, delayed_delete);
> + schedule_work(&rdev->del_work);
>  }
>  
>  /*
> 
> diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h
> --- .prev/include/linux/raid/md_k.h   2007-04-02 17:43:03.0 +1000
> +++ ./include/linux/raid/md_k.h   2007-04-02 17:36:32.0 +1000
> @@ -104,6 +104,7 @@ struct mdk_rdev_s
>  * for reporting to userspace and 
> storing
>  * in superblock.
>  */
> + struct work_struct del_work;/* used for delayed sysfs removal */
>  };
>  

What guarantees that *rdev is still valid when delayed_delete() runs?

And what guarantees that the md module hasn't been rmmodded when
delayed_delete() tries to run?

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] md: Avoid a deadlock when removing a device from an md array via sysfs.

2007-04-02 Thread NeilBrown
(This patch should go in 2.6.21 as it fixes a recent regression - NB)

A device can be removed from an md array via e.g.
  echo remove > /sys/block/md3/md/dev-sde/state

This will try to remove the 'dev-sde' subtree which will deadlock
since
  commit e7b0d26a86943370c04d6833c6edba2a72a6e240

With this patch we run the kobject_del via schedule_work so as to
avoid the deadlock.

Cc: Alan Stern <[EMAIL PROTECTED]>
Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/md.c   |   13 -
 ./include/linux/raid/md_k.h |1 +
 2 files changed, 13 insertions(+), 1 deletion(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2007-04-02 17:43:03.0 +1000
+++ ./drivers/md/md.c   2007-04-02 17:38:46.0 +1000
@@ -1389,6 +1389,12 @@ static int bind_rdev_to_array(mdk_rdev_t
return err;
 }
 
+static void delayed_delete(struct work_struct *ws)
+{
+   mdk_rdev_t *rdev = container_of(ws, mdk_rdev_t, del_work);
+   kobject_del(&rdev->kobj);
+}
+
 static void unbind_rdev_from_array(mdk_rdev_t * rdev)
 {
char b[BDEVNAME_SIZE];
@@ -1401,7 +1407,12 @@ static void unbind_rdev_from_array(mdk_r
printk(KERN_INFO "md: unbind<%s>\n", bdevname(rdev->bdev,b));
rdev->mddev = NULL;
sysfs_remove_link(&rdev->kobj, "block");
-   kobject_del(&rdev->kobj);
+
+   /* We need to delay this, otherwise we can deadlock when
+* writing to 'remove' to "dev/state"
+*/
+   INIT_WORK(&rdev->del_work, delayed_delete);
+   schedule_work(&rdev->del_work);
 }
 
 /*

diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h
--- .prev/include/linux/raid/md_k.h 2007-04-02 17:43:03.0 +1000
+++ ./include/linux/raid/md_k.h 2007-04-02 17:36:32.0 +1000
@@ -104,6 +104,7 @@ struct mdk_rdev_s
   * for reporting to userspace and 
storing
   * in superblock.
   */
+   struct work_struct del_work;/* used for delayed sysfs removal */
 };
 
 struct mddev_s
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html