RE: Could bio_vec be referenced any time?

2007-02-07 Thread Yu-Chen Wu
Hi NeilBrown,
Thank you for your help and introducing ksymoops to me. 
I think you are right.
The BIO is passed by MD put BIO into a share kfifo.
THX : )
-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Neil Brown
Sent: Wednesday, February 07, 2007 3:46 AM
To: Yu-Chen Wu
Cc: linux-kernel@vger.kernel.org; linux-raid@vger.kernel.org
Subject: Re: Could bio_vec be referenced any time?

On Tuesday February 6, [EMAIL PROTECTED] wrote:
 Hi all,
   I write a module that creates a kernel thread to show the BIOs from
 MD modules.
   The kernel thread will call show_bio() when md passing a BIO to my
 module,else sleep.
   Sometimes, show_bio() continues working successfully ,but it
 somtimes makes general protection fault.
   The show_bio() always works when I comment the
 bio_for_each_segment loop. 
   Is the zone I comment the cause of the fault? 
   As above, I consider it's the main problem.Also, I strongly want to
 know your opinions.Thank you for help.
 
   THX

Without seeing how the bio gets to show_bio it is hard to be certain,
but my guess would be that by the time show_bio tries to inspect the
bio, the IO request involving it has already completed and the bio has
been freed, so you are accessing freed memory.

 Feb  6 22:00:28 RAID-SUSE kernel: Code: 8b 00 f6 c4 08 74 0e 48 c7 c7 14
9c
 45 88 31 c0 e8 b5 bf e2

If you feed this line into ksymoops you get:

Code;   Before first symbol
   0:   8b 00 mov(%rax),%eaxC
...

so it is trying to dereference $rax.

 Feb  6 22:00:28 RAID-SUSE kernel: RAX: 6b6b6b6b6b6b6b6b RBX:
 810037f52668 RCX: 0004

Which contains 6b6b6b6b6b6b6b6b.
which is lots of copies of 'POISON_FREE' (defined in
include/linux/poison.h) which makes it really look like that memory
has already been freed.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Move superblock on partition resize?

2007-02-07 Thread Rob Bray
I am trying to grow a raid5 volume in-place. I would like to expand the
partition boundaries, then grow raid5 into the newly-expanded partitions.
I was wondering if there is a way to move the superblock from the end of
the old partition to the end of the new partition. I've tried dd
if=/dev/sdX1 of=/dev/sdX1 bs=512 count=256
skip=(sizeOfOldPartitionInBlocks - 256) seek=(sizeOfNewPartitionInBlocks -
256) unsuccessfully. Also, copying the last 128KB (256 blocks) of the old
partition before the table modification to a file, and placing that data
at the tail of the new partition also yields no beans. I can drop one
drive at a time from the group, change the partition table, then hot-add
it, but a resync times 7 drives is a lot of juggling. Any ideas?

Thanks,
Rob

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Move superblock on partition resize?

2007-02-07 Thread Michael Tokarev
Rob Bray wrote:
 I am trying to grow a raid5 volume in-place. I would like to expand the
 partition boundaries, then grow raid5 into the newly-expanded partitions.
 I was wondering if there is a way to move the superblock from the end of
 the old partition to the end of the new partition. I've tried dd
 if=/dev/sdX1 of=/dev/sdX1 bs=512 count=256
 skip=(sizeOfOldPartitionInBlocks - 256) seek=(sizeOfNewPartitionInBlocks -
 256) unsuccessfully. Also, copying the last 128KB (256 blocks) of the old
 partition before the table modification to a file, and placing that data
 at the tail of the new partition also yields no beans. I can drop one
 drive at a time from the group, change the partition table, then hot-add
 it, but a resync times 7 drives is a lot of juggling. Any ideas?

The superblock location is somewhat tricky to calculate correctly.

I've used a tiny program (attached) for exactly this purpose.

/mjt
/* mdsuper: read or write a linux software raid superbloc (version 0.90)
 * from or to a given device.
 * 
 * GPL.
 * Written by Michael Tokarev ([EMAIL PROTECTED])
 */

#define _GNU_SOURCE
#include sys/types.h
#include stdio.h
#include unistd.h
#include errno.h
#include stdlib.h
#include fcntl.h
#include string.h
#include sys/ioctl.h
#include linux/ioctl.h
#include linux/types.h
#include linux/raid/md_p.h
#include linux/fs.h

int main(int argc, char **argv) {
  unsigned long long dsize;
  unsigned long long offset;
  int mdfd;
  int n;
  mdp_super_t super;
  const char *dev;

  if (argc != 3) {
fprintf(stderr, mdsuper: usage: mdsuper {read|write} mddev\n);
return 1;
  }

  if (strcmp(argv[1], read) == 0)
n = O_RDONLY;
  else if (strcmp(argv[1], write) == 0)
n = O_WRONLY;
  else {
fprintf(stderr, mdsuper: read or write arg required, not \%s\\n,
argv[1]);
return 1;
  }

  dev = argv[2];
  mdfd = open(dev, n, 0);
  if (mdfd  0) {
perror(dev);
return 1;
  }

  if (ioctl(mdfd, BLKGETSIZE64, dsize)  0) {
perror(dev);
return 1;
  }

  if (dsize  MD_RESERVED_SECTORS*2) {
fprintf(stderr, mdsuper: %s is too small\n, dev);
return 1;
  }

  offset = MD_NEW_SIZE_SECTORS(dsize9);

  fprintf(stderr, size=%Lu (%Lu sect), offset=%Lu (%Lu sect)\n,
  dsize, dsize9, offset * 512, offset);
  offset *= 512;

  if (n == O_RDONLY) {
if (pread64(mdfd, super, sizeof(super), offset) != sizeof(super)) {
  perror(dev);
  return 1;
}
if (super.md_magic != MD_SB_MAGIC) {
  fprintf(stderr, %s: bad magic (0x%08x, should be 0x%08x)\n,
  dev, super.md_magic, MD_SB_MAGIC);
  return 1;
}
if (write(1, super, sizeof(super)) != sizeof(super)) {
  perror(write);
  return 1;
}
  }
  else {
if (read(0, super, sizeof(super)) != sizeof(super)) {
  perror(read);
  return 1;
}
if (pwrite64(mdfd, super, sizeof(super), offset) != sizeof(super)) {
  perror(dev);
  return 1;
}
  }

  return 0;
}


[PATCH] md: Avoid possible BUG_ON in md bitmap handling.

2007-02-07 Thread Neil Brown

[[This patch is against 2.6.20 rather than -mm as the new plugging stuff
  in -mm breaks md/raid1/bitmap so I couldn't test it there...
  It is probably appropriate for -stable though I expect the failure
  case is fairly uncommon (raid1 over multipath) but what would I
  know about how common things are :-?
]]


md/bitmap tracks how many active write requests are pending on blocks
associated with each bit in the bitmap, so that it knows when it can
clear the bit (when count hits zero).

The counter has 14 bits of space, so if there are ever more than 16383,
we cannot cope.

Currently the code just calles BUG_ON as all drivers have request queue
limits much smaller than this.

However is seems that some don't.  Apparently some multipath configurations
can allow more than 16383 concurrent write requests.

So, in this unlikely situation, instead of calling BUG_ON we now wait
for the count to drop down a bit.  This requires a new wait_queue_head,
some waiting code, and a wakeup call.

Tested by limiting the counter to 20 instead of 16383 (writes go a lot slower
in that case...).

Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/bitmap.c |   22 +-
 ./include/linux/raid/bitmap.h |1 +
 2 files changed, 22 insertions(+), 1 deletion(-)

diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c
--- .prev/drivers/md/bitmap.c   2007-02-07 13:03:56.0 +1100
+++ ./drivers/md/bitmap.c   2007-02-07 21:34:47.0 +1100
@@ -1160,6 +1160,22 @@ int bitmap_startwrite(struct bitmap *bit
return 0;
}
 
+   if (unlikely((*bmc  COUNTER_MAX) == COUNTER_MAX)) {
+   DEFINE_WAIT(__wait);
+   /* note that it is safe to do the prepare_to_wait
+* after the test as long as we do it before dropping
+* the spinlock.
+*/
+   prepare_to_wait(bitmap-overflow_wait, __wait,
+   TASK_UNINTERRUPTIBLE);
+   spin_unlock_irq(bitmap-lock);
+   bitmap-mddev-queue
+   -unplug_fn(bitmap-mddev-queue);
+   schedule();
+   finish_wait(bitmap-overflow_wait, __wait);
+   continue;
+   }
+
switch(*bmc) {
case 0:
bitmap_file_set_bit(bitmap, offset);
@@ -1169,7 +1185,7 @@ int bitmap_startwrite(struct bitmap *bit
case 1:
*bmc = 2;
}
-   BUG_ON((*bmc  COUNTER_MAX) == COUNTER_MAX);
+
(*bmc)++;
 
spin_unlock_irq(bitmap-lock);
@@ -1207,6 +1223,9 @@ void bitmap_endwrite(struct bitmap *bitm
if (!success  ! (*bmc  NEEDED_MASK))
*bmc |= NEEDED_MASK;
 
+   if ((*bmc  COUNTER_MAX) == COUNTER_MAX)
+   wake_up(bitmap-overflow_wait);
+
(*bmc)--;
if (*bmc = 2) {
set_page_attr(bitmap,
@@ -1431,6 +1450,7 @@ int bitmap_create(mddev_t *mddev)
spin_lock_init(bitmap-lock);
atomic_set(bitmap-pending_writes, 0);
init_waitqueue_head(bitmap-write_wait);
+   init_waitqueue_head(bitmap-overflow_wait);
 
bitmap-mddev = mddev;
 

diff .prev/include/linux/raid/bitmap.h ./include/linux/raid/bitmap.h
--- .prev/include/linux/raid/bitmap.h   2007-02-07 13:03:56.0 +1100
+++ ./include/linux/raid/bitmap.h   2007-02-07 20:57:57.0 +1100
@@ -247,6 +247,7 @@ struct bitmap {
 
atomic_t pending_writes; /* pending writes to the bitmap file */
wait_queue_head_t write_wait;
+   wait_queue_head_t overflow_wait;
 
 };
 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 00/12] md raid acceleration and performance analysis

2007-02-07 Thread Dan Williams

On 2/6/07, Leech, Christopher [EMAIL PROTECTED] wrote:

Hi Dan,

I've been looking over how your patches change the ioatdma driver.  I
like the idea of removing the multiple entry points for virtual address
vs. page struct arguments, and just using dma_addr_t for the driver
interfaces.

But, I don't think having both ioatdma and iop-adma implement map_page,
map_single, unmap_page, and unmap_single entry points is much better.
Do you see a reason why it wouldn't work to expose the generic device
for a DMA channel, and replace instances of

dma_device-map_single(dma_chan, src, len, DMA_TO_DEVICE)

with

dma_map_single(dma_device-dev, src, len, DMA_TO_DEVICE)



I was initially concerned about a case where dma_map_single was not
equivalent to pci_map_single.  Looking now, it appears that case would
be a bug, so I will integrate this change.


I am a little concerned about having the DMA mapping happen outside of
the driver, but the unmapping is still in the driver cleanup routine.
I'm not sure if it's really a problem, or how I'd change it though.

- Chris


Thanks,
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mdadm RAID5 array failure

2007-02-07 Thread jahammonds prost
I'm running an FC4 system. I was copying some files on to the server this 
weekend, and the server locked up hard, and I had to power off. I rebooted the 
server, and the array came up fine, but when I tried to fsck the filesystem, 
fsck just locked up at about 40%. I left it sitting there for 12 hours, hoping 
it was going to come back, but I had to power off the server again. When I now 
reboot the server, it is failing to mount my raid5 array..
 
  mdadm: /dev/md0 assembled from 3 drives and 1 spare - not enough to start 
the array.
 
I've added the output from the various files/commands at the bottom...
I am a little confused at the output.. According to /dev/hd[cgh], there is only 
1 failed disk in the array, so why does it think that there are 3 failed disks 
in the array? It looks like there is only 1 failed disk – I got an error from 
SMARTD about it when I got the server back into multiuser mode, so I know there 
is an issue with the disk (Device: /dev/hde, 8 Offline uncorrectable sectors), 
but there are still enough disks to bring up the array, and for the spare disk 
to start rebuilding.
 
I've spent the last couple of days googling around, and I can't seem to find 
much on how to recover a failed md arrary. Is there any way to get the array 
back and working? Unfortunately I don't have a back up of this array, and I'd 
really like to try and get the data back (there are 3 LVM logical volumes on 
it).
 
Thanks very much for any help.
 
 
Graham
 
 
 
My /etc/mdadm.conf looks like this
 
]# cat /etc/mdadm.conf
DEVICE /dev/hd*[a-z]
ARRAY /dev/md0 level=raid5 num-devices=6 
UUID=96c7d78a:2113ea58:9dc237f1:79a60ddf
  
devices=/dev/hdh,/dev/hdg,/dev/hdf,/dev/hde,/dev/hdd,/dev/hdc,/dev/hdb
 
 
Looking at /proc/mdstat, I am getting this output
 
# cat /proc/mdstat
Personalities : [raid5] [raid4]
md0 : inactive hdc[0] hdb[6] hdh[5] hdg[4] hdf[3] hde[2] hdd[1]
  137832 blocks super non-persistent
 
 
 
 
Here's the output when ran on the device that some think have failed.
 
# mdadm -E /dev/hde
/dev/hde:
  Magic : a92b4efc
Version : 00.90.02
   UUID : 96c7d78a:2113ea58:9dc237f1:79a60ddf
  Creation Time : Wed Feb  1 17:10:39 2006
 Raid Level : raid5
   Raid Devices : 6
  Total Devices : 7
Preferred Minor : 0
 
Update Time : Sun Feb  4 17:29:53 2007
  State : active
 Active Devices : 6
Working Devices : 7
 Failed Devices : 0
  Spare Devices : 1
   Checksum : dcab70d - correct
 Events : 0.840944
 
 Layout : left-symmetric
 Chunk Size : 128K
 
  Number   Major   Minor   RaidDevice State
this 2  3302  active sync   /dev/hde
 
   0 0  2200  active sync   /dev/hdc
   1 1  22   641  active sync   /dev/hdd
   2 2  3302  active sync   /dev/hde
   3 3  33   643  active sync   /dev/hdf
   4 4  3404  active sync   /dev/hdg
   5 5  34   645  active sync   /dev/hdh
   6 6   3   646  spare   /dev/hdb
 
 
Running an mdadm -E on /dev/hd[bcgh] gives this,
 
 
  Number   Major   Minor   RaidDevice State
this 6   3   646  spare   /dev/hdb
 
   0 0  2200  active sync   /dev/hdc
   1 1  22   641  active sync   /dev/hdd
   2 2   002  faulty removed
   3 3  33   643  active sync   /dev/hdf
   4 4  3404  active sync   /dev/hdg
   5 5  34   645  active sync   /dev/hdh
   6 6   3   646  spare   /dev/hdb
 
 
 
And running mdadm -E on /dev/hd[def]
 
  Number   Major   Minor   RaidDevice State
this 3  33   643  active sync   /dev/hdf
 
   0 0  2200  active sync   /dev/hdc
   1 1  22   641  active sync   /dev/hdd
   2 2  3302  active sync   /dev/hde
   3 3  33   643  active sync   /dev/hdf
   4 4  3404  active sync   /dev/hdg
   5 5  34   645  active sync   /dev/hdh
   6 6   3   646  spare   /dev/hdb
 
 
Looking at /var/log/messages, shows the following
 
Feb  6 12:36:42 file01bert kernel: md: bindhdd
Feb  6 12:36:42 file01bert kernel: md: bindhde
Feb  6 12:36:42 file01bert kernel: md: bindhdf
Feb  6 12:36:42 file01bert kernel: md: bindhdg
Feb  6 12:36:42 file01bert kernel: md: bindhdh
Feb  6 12:36:42 file01bert kernel: md: bindhdb
Feb  6 12:36:42 file01bert kernel: md: bindhdc
Feb  6 12:36:42 file01bert kernel: md: kicking non-fresh hdf from array!
Feb  6 12:36:42 file01bert kernel: md: unbindhdf
Feb  6 12:36:42 file01bert kernel: md: export_rdev(hdf)
Feb  6 12:36:42 file01bert kernel: md: kicking non-fresh hde from array!
Feb  6 12:36:42 

Re: mdadm RAID5 array failure

2007-02-07 Thread Neil Brown
On Thursday February 8, [EMAIL PROTECTED] wrote:

 I'm running an FC4 system. I was copying some files on to the server
 this weekend, and the server locked up hard, and I had to power
 off. I rebooted the server, and the array came up fine, but when I
 tried to fsck the filesystem, fsck just locked up at about 40%. I
 left it sitting there for 12 hours, hoping it was going to come
 back, but I had to power off the server again. When I now reboot the
 server, it is failing to mount my raid5 array.. 
  
   mdadm: /dev/md0 assembled from 3 drives and 1 spare - not enough to 
 start the array.

mdadm -Af /dev/md0
should get it back for you.  But you really want to find out why it
died.
Where there any kernel messages at the time of the first failure?
What kernel version are you running?

  
 I've added the output from the various files/commands at the bottom...
 I am a little confused at the output.. According to /dev/hd[cgh],
 there is only 1 failed disk in the array, so why does it think that
 there are 3 failed disks in the array? 

You need to look at the 'Event' count.  md will look for the device
with the highest event count and reject anything with an event count 2
or more less than that.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html