RE: Could bio_vec be referenced any time?
Hi NeilBrown, Thank you for your help and introducing ksymoops to me. I think you are right. The BIO is passed by MD put BIO into a share kfifo. THX : ) -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Neil Brown Sent: Wednesday, February 07, 2007 3:46 AM To: Yu-Chen Wu Cc: linux-kernel@vger.kernel.org; linux-raid@vger.kernel.org Subject: Re: Could bio_vec be referenced any time? On Tuesday February 6, [EMAIL PROTECTED] wrote: Hi all, I write a module that creates a kernel thread to show the BIOs from MD modules. The kernel thread will call show_bio() when md passing a BIO to my module,else sleep. Sometimes, show_bio() continues working successfully ,but it somtimes makes general protection fault. The show_bio() always works when I comment the bio_for_each_segment loop. Is the zone I comment the cause of the fault? As above, I consider it's the main problem.Also, I strongly want to know your opinions.Thank you for help. THX Without seeing how the bio gets to show_bio it is hard to be certain, but my guess would be that by the time show_bio tries to inspect the bio, the IO request involving it has already completed and the bio has been freed, so you are accessing freed memory. Feb 6 22:00:28 RAID-SUSE kernel: Code: 8b 00 f6 c4 08 74 0e 48 c7 c7 14 9c 45 88 31 c0 e8 b5 bf e2 If you feed this line into ksymoops you get: Code; Before first symbol 0: 8b 00 mov(%rax),%eaxC ... so it is trying to dereference $rax. Feb 6 22:00:28 RAID-SUSE kernel: RAX: 6b6b6b6b6b6b6b6b RBX: 810037f52668 RCX: 0004 Which contains 6b6b6b6b6b6b6b6b. which is lots of copies of 'POISON_FREE' (defined in include/linux/poison.h) which makes it really look like that memory has already been freed. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Move superblock on partition resize?
I am trying to grow a raid5 volume in-place. I would like to expand the partition boundaries, then grow raid5 into the newly-expanded partitions. I was wondering if there is a way to move the superblock from the end of the old partition to the end of the new partition. I've tried dd if=/dev/sdX1 of=/dev/sdX1 bs=512 count=256 skip=(sizeOfOldPartitionInBlocks - 256) seek=(sizeOfNewPartitionInBlocks - 256) unsuccessfully. Also, copying the last 128KB (256 blocks) of the old partition before the table modification to a file, and placing that data at the tail of the new partition also yields no beans. I can drop one drive at a time from the group, change the partition table, then hot-add it, but a resync times 7 drives is a lot of juggling. Any ideas? Thanks, Rob - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Move superblock on partition resize?
Rob Bray wrote: I am trying to grow a raid5 volume in-place. I would like to expand the partition boundaries, then grow raid5 into the newly-expanded partitions. I was wondering if there is a way to move the superblock from the end of the old partition to the end of the new partition. I've tried dd if=/dev/sdX1 of=/dev/sdX1 bs=512 count=256 skip=(sizeOfOldPartitionInBlocks - 256) seek=(sizeOfNewPartitionInBlocks - 256) unsuccessfully. Also, copying the last 128KB (256 blocks) of the old partition before the table modification to a file, and placing that data at the tail of the new partition also yields no beans. I can drop one drive at a time from the group, change the partition table, then hot-add it, but a resync times 7 drives is a lot of juggling. Any ideas? The superblock location is somewhat tricky to calculate correctly. I've used a tiny program (attached) for exactly this purpose. /mjt /* mdsuper: read or write a linux software raid superbloc (version 0.90) * from or to a given device. * * GPL. * Written by Michael Tokarev ([EMAIL PROTECTED]) */ #define _GNU_SOURCE #include sys/types.h #include stdio.h #include unistd.h #include errno.h #include stdlib.h #include fcntl.h #include string.h #include sys/ioctl.h #include linux/ioctl.h #include linux/types.h #include linux/raid/md_p.h #include linux/fs.h int main(int argc, char **argv) { unsigned long long dsize; unsigned long long offset; int mdfd; int n; mdp_super_t super; const char *dev; if (argc != 3) { fprintf(stderr, mdsuper: usage: mdsuper {read|write} mddev\n); return 1; } if (strcmp(argv[1], read) == 0) n = O_RDONLY; else if (strcmp(argv[1], write) == 0) n = O_WRONLY; else { fprintf(stderr, mdsuper: read or write arg required, not \%s\\n, argv[1]); return 1; } dev = argv[2]; mdfd = open(dev, n, 0); if (mdfd 0) { perror(dev); return 1; } if (ioctl(mdfd, BLKGETSIZE64, dsize) 0) { perror(dev); return 1; } if (dsize MD_RESERVED_SECTORS*2) { fprintf(stderr, mdsuper: %s is too small\n, dev); return 1; } offset = MD_NEW_SIZE_SECTORS(dsize9); fprintf(stderr, size=%Lu (%Lu sect), offset=%Lu (%Lu sect)\n, dsize, dsize9, offset * 512, offset); offset *= 512; if (n == O_RDONLY) { if (pread64(mdfd, super, sizeof(super), offset) != sizeof(super)) { perror(dev); return 1; } if (super.md_magic != MD_SB_MAGIC) { fprintf(stderr, %s: bad magic (0x%08x, should be 0x%08x)\n, dev, super.md_magic, MD_SB_MAGIC); return 1; } if (write(1, super, sizeof(super)) != sizeof(super)) { perror(write); return 1; } } else { if (read(0, super, sizeof(super)) != sizeof(super)) { perror(read); return 1; } if (pwrite64(mdfd, super, sizeof(super), offset) != sizeof(super)) { perror(dev); return 1; } } return 0; }
[PATCH] md: Avoid possible BUG_ON in md bitmap handling.
[[This patch is against 2.6.20 rather than -mm as the new plugging stuff in -mm breaks md/raid1/bitmap so I couldn't test it there... It is probably appropriate for -stable though I expect the failure case is fairly uncommon (raid1 over multipath) but what would I know about how common things are :-? ]] md/bitmap tracks how many active write requests are pending on blocks associated with each bit in the bitmap, so that it knows when it can clear the bit (when count hits zero). The counter has 14 bits of space, so if there are ever more than 16383, we cannot cope. Currently the code just calles BUG_ON as all drivers have request queue limits much smaller than this. However is seems that some don't. Apparently some multipath configurations can allow more than 16383 concurrent write requests. So, in this unlikely situation, instead of calling BUG_ON we now wait for the count to drop down a bit. This requires a new wait_queue_head, some waiting code, and a wakeup call. Tested by limiting the counter to 20 instead of 16383 (writes go a lot slower in that case...). Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/bitmap.c | 22 +- ./include/linux/raid/bitmap.h |1 + 2 files changed, 22 insertions(+), 1 deletion(-) diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c --- .prev/drivers/md/bitmap.c 2007-02-07 13:03:56.0 +1100 +++ ./drivers/md/bitmap.c 2007-02-07 21:34:47.0 +1100 @@ -1160,6 +1160,22 @@ int bitmap_startwrite(struct bitmap *bit return 0; } + if (unlikely((*bmc COUNTER_MAX) == COUNTER_MAX)) { + DEFINE_WAIT(__wait); + /* note that it is safe to do the prepare_to_wait +* after the test as long as we do it before dropping +* the spinlock. +*/ + prepare_to_wait(bitmap-overflow_wait, __wait, + TASK_UNINTERRUPTIBLE); + spin_unlock_irq(bitmap-lock); + bitmap-mddev-queue + -unplug_fn(bitmap-mddev-queue); + schedule(); + finish_wait(bitmap-overflow_wait, __wait); + continue; + } + switch(*bmc) { case 0: bitmap_file_set_bit(bitmap, offset); @@ -1169,7 +1185,7 @@ int bitmap_startwrite(struct bitmap *bit case 1: *bmc = 2; } - BUG_ON((*bmc COUNTER_MAX) == COUNTER_MAX); + (*bmc)++; spin_unlock_irq(bitmap-lock); @@ -1207,6 +1223,9 @@ void bitmap_endwrite(struct bitmap *bitm if (!success ! (*bmc NEEDED_MASK)) *bmc |= NEEDED_MASK; + if ((*bmc COUNTER_MAX) == COUNTER_MAX) + wake_up(bitmap-overflow_wait); + (*bmc)--; if (*bmc = 2) { set_page_attr(bitmap, @@ -1431,6 +1450,7 @@ int bitmap_create(mddev_t *mddev) spin_lock_init(bitmap-lock); atomic_set(bitmap-pending_writes, 0); init_waitqueue_head(bitmap-write_wait); + init_waitqueue_head(bitmap-overflow_wait); bitmap-mddev = mddev; diff .prev/include/linux/raid/bitmap.h ./include/linux/raid/bitmap.h --- .prev/include/linux/raid/bitmap.h 2007-02-07 13:03:56.0 +1100 +++ ./include/linux/raid/bitmap.h 2007-02-07 20:57:57.0 +1100 @@ -247,6 +247,7 @@ struct bitmap { atomic_t pending_writes; /* pending writes to the bitmap file */ wait_queue_head_t write_wait; + wait_queue_head_t overflow_wait; }; - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 00/12] md raid acceleration and performance analysis
On 2/6/07, Leech, Christopher [EMAIL PROTECTED] wrote: Hi Dan, I've been looking over how your patches change the ioatdma driver. I like the idea of removing the multiple entry points for virtual address vs. page struct arguments, and just using dma_addr_t for the driver interfaces. But, I don't think having both ioatdma and iop-adma implement map_page, map_single, unmap_page, and unmap_single entry points is much better. Do you see a reason why it wouldn't work to expose the generic device for a DMA channel, and replace instances of dma_device-map_single(dma_chan, src, len, DMA_TO_DEVICE) with dma_map_single(dma_device-dev, src, len, DMA_TO_DEVICE) I was initially concerned about a case where dma_map_single was not equivalent to pci_map_single. Looking now, it appears that case would be a bug, so I will integrate this change. I am a little concerned about having the DMA mapping happen outside of the driver, but the unmapping is still in the driver cleanup routine. I'm not sure if it's really a problem, or how I'd change it though. - Chris Thanks, Dan - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
mdadm RAID5 array failure
I'm running an FC4 system. I was copying some files on to the server this weekend, and the server locked up hard, and I had to power off. I rebooted the server, and the array came up fine, but when I tried to fsck the filesystem, fsck just locked up at about 40%. I left it sitting there for 12 hours, hoping it was going to come back, but I had to power off the server again. When I now reboot the server, it is failing to mount my raid5 array.. mdadm: /dev/md0 assembled from 3 drives and 1 spare - not enough to start the array. I've added the output from the various files/commands at the bottom... I am a little confused at the output.. According to /dev/hd[cgh], there is only 1 failed disk in the array, so why does it think that there are 3 failed disks in the array? It looks like there is only 1 failed disk I got an error from SMARTD about it when I got the server back into multiuser mode, so I know there is an issue with the disk (Device: /dev/hde, 8 Offline uncorrectable sectors), but there are still enough disks to bring up the array, and for the spare disk to start rebuilding. I've spent the last couple of days googling around, and I can't seem to find much on how to recover a failed md arrary. Is there any way to get the array back and working? Unfortunately I don't have a back up of this array, and I'd really like to try and get the data back (there are 3 LVM logical volumes on it). Thanks very much for any help. Graham My /etc/mdadm.conf looks like this ]# cat /etc/mdadm.conf DEVICE /dev/hd*[a-z] ARRAY /dev/md0 level=raid5 num-devices=6 UUID=96c7d78a:2113ea58:9dc237f1:79a60ddf devices=/dev/hdh,/dev/hdg,/dev/hdf,/dev/hde,/dev/hdd,/dev/hdc,/dev/hdb Looking at /proc/mdstat, I am getting this output # cat /proc/mdstat Personalities : [raid5] [raid4] md0 : inactive hdc[0] hdb[6] hdh[5] hdg[4] hdf[3] hde[2] hdd[1] 137832 blocks super non-persistent Here's the output when ran on the device that some think have failed. # mdadm -E /dev/hde /dev/hde: Magic : a92b4efc Version : 00.90.02 UUID : 96c7d78a:2113ea58:9dc237f1:79a60ddf Creation Time : Wed Feb 1 17:10:39 2006 Raid Level : raid5 Raid Devices : 6 Total Devices : 7 Preferred Minor : 0 Update Time : Sun Feb 4 17:29:53 2007 State : active Active Devices : 6 Working Devices : 7 Failed Devices : 0 Spare Devices : 1 Checksum : dcab70d - correct Events : 0.840944 Layout : left-symmetric Chunk Size : 128K Number Major Minor RaidDevice State this 2 3302 active sync /dev/hde 0 0 2200 active sync /dev/hdc 1 1 22 641 active sync /dev/hdd 2 2 3302 active sync /dev/hde 3 3 33 643 active sync /dev/hdf 4 4 3404 active sync /dev/hdg 5 5 34 645 active sync /dev/hdh 6 6 3 646 spare /dev/hdb Running an mdadm -E on /dev/hd[bcgh] gives this, Number Major Minor RaidDevice State this 6 3 646 spare /dev/hdb 0 0 2200 active sync /dev/hdc 1 1 22 641 active sync /dev/hdd 2 2 002 faulty removed 3 3 33 643 active sync /dev/hdf 4 4 3404 active sync /dev/hdg 5 5 34 645 active sync /dev/hdh 6 6 3 646 spare /dev/hdb And running mdadm -E on /dev/hd[def] Number Major Minor RaidDevice State this 3 33 643 active sync /dev/hdf 0 0 2200 active sync /dev/hdc 1 1 22 641 active sync /dev/hdd 2 2 3302 active sync /dev/hde 3 3 33 643 active sync /dev/hdf 4 4 3404 active sync /dev/hdg 5 5 34 645 active sync /dev/hdh 6 6 3 646 spare /dev/hdb Looking at /var/log/messages, shows the following Feb 6 12:36:42 file01bert kernel: md: bindhdd Feb 6 12:36:42 file01bert kernel: md: bindhde Feb 6 12:36:42 file01bert kernel: md: bindhdf Feb 6 12:36:42 file01bert kernel: md: bindhdg Feb 6 12:36:42 file01bert kernel: md: bindhdh Feb 6 12:36:42 file01bert kernel: md: bindhdb Feb 6 12:36:42 file01bert kernel: md: bindhdc Feb 6 12:36:42 file01bert kernel: md: kicking non-fresh hdf from array! Feb 6 12:36:42 file01bert kernel: md: unbindhdf Feb 6 12:36:42 file01bert kernel: md: export_rdev(hdf) Feb 6 12:36:42 file01bert kernel: md: kicking non-fresh hde from array! Feb 6 12:36:42
Re: mdadm RAID5 array failure
On Thursday February 8, [EMAIL PROTECTED] wrote: I'm running an FC4 system. I was copying some files on to the server this weekend, and the server locked up hard, and I had to power off. I rebooted the server, and the array came up fine, but when I tried to fsck the filesystem, fsck just locked up at about 40%. I left it sitting there for 12 hours, hoping it was going to come back, but I had to power off the server again. When I now reboot the server, it is failing to mount my raid5 array.. mdadm: /dev/md0 assembled from 3 drives and 1 spare - not enough to start the array. mdadm -Af /dev/md0 should get it back for you. But you really want to find out why it died. Where there any kernel messages at the time of the first failure? What kernel version are you running? I've added the output from the various files/commands at the bottom... I am a little confused at the output.. According to /dev/hd[cgh], there is only 1 failed disk in the array, so why does it think that there are 3 failed disks in the array? You need to look at the 'Event' count. md will look for the device with the highest event count and reject anything with an event count 2 or more less than that. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html