[PATCH 006 of 6] md: Make new function stripe_to_pdidx static.

2006-03-16 Thread NeilBrown


Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/raid5.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~   2006-03-17 18:18:48.0 +1100
+++ ./drivers/md/raid5.c2006-03-17 18:18:50.0 +1100
@@ -1037,7 +1037,7 @@ static int add_stripe_bio(struct stripe_
 
 static void end_reshape(raid5_conf_t *conf);
 
-int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int disks)
+static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int disks)
 {
int sectors_per_chunk = conf->chunk_size >> 9;
sector_t x = stripe;
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 004 of 6] md: Improve comments about locking situation in raid5 make_request

2006-03-16 Thread NeilBrown


Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/raid5.c |   15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~   2006-03-17 18:18:35.0 +1100
+++ ./drivers/md/raid5.c2006-03-17 18:18:44.0 +1100
@@ -1766,6 +1766,14 @@ static int make_request(request_queue_t 
if (likely(conf->expand_progress == MaxSector))
disks = conf->raid_disks;
else {
+   /* spinlock is needed as expand_progress may be
+* 64bit on a 32bit platform, and so it might be
+* possible to see a half-updated value
+* Ofcourse expand_progress could change after
+* the lock is dropped, so once we get a reference
+* to the stripe that we think it is, we will have
+* to check again.
+*/
spin_lock_irq(&conf->device_lock);
disks = conf->raid_disks;
if (logical_sector >= conf->expand_progress)
@@ -1789,7 +1797,12 @@ static int make_request(request_queue_t 
if (sh) {
if (unlikely(conf->expand_progress != MaxSector)) {
/* expansion might have moved on while waiting 
for a
-* stripe, so we much do the range check again.
+* stripe, so we must do the range check again.
+* Expansion could still move past after this
+* test, but as we are holding a reference to
+* 'sh', we know that if that happens,
+*  STRIPE_EXPANDING will get set and the 
expansion
+* won't proceed until we finish with the 
stripe.
 */
int must_retry = 0;
spin_lock_irq(&conf->device_lock);
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 000 of 6] md: Introduction - patching those patches.

2006-03-16 Thread NeilBrown
This is the "Andrew Morton: Awesome code reviewer" patch series, that fixes
up issues identified in my recent series of md patches.

NeilBrown


 [PATCH 001 of 6] md: INIT_LIST_HEAD to LIST_HEAD conversions.
 [PATCH 002 of 6] md: Documentation and tidy up for resize_stripes
 [PATCH 003 of 6] md: Remove an unused variable.
 [PATCH 004 of 6] md: Improve comments about locking situation in raid5 
make_request
 [PATCH 005 of 6] md: Remove some stray semi-colons after functions called in 
macro
 [PATCH 006 of 6] md: Make new function stripe_to_pdidx static.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 001 of 6] md: INIT_LIST_HEAD to LIST_HEAD conversions.

2006-03-16 Thread NeilBrown

A couple of places we all INIT_LIST_HEAD on a locally declared
variable.  This can be changed to a LIST_HEAD declaration.

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/md.c|2 +-
 ./drivers/md/raid5.c |6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~  2006-03-17 18:17:56.0 +1100
+++ ./drivers/md/md.c   2006-03-17 18:18:19.0 +1100
@@ -2895,7 +2895,6 @@ static void autorun_array(mddev_t *mddev
  */
 static void autorun_devices(int part)
 {
-   struct list_head candidates;
struct list_head *tmp;
mdk_rdev_t *rdev0, *rdev;
mddev_t *mddev;
@@ -2904,6 +2903,7 @@ static void autorun_devices(int part)
printk(KERN_INFO "md: autorun ...\n");
while (!list_empty(&pending_raid_disks)) {
dev_t dev;
+   LIST_HEAD(candidates);
rdev0 = list_entry(pending_raid_disks.next,
 mdk_rdev_t, same_set);
 

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~   2006-03-17 18:17:57.0 +1100
+++ ./drivers/md/raid5.c2006-03-17 18:18:19.0 +1100
@@ -345,7 +345,8 @@ static int resize_stripes(raid5_conf_t *
 * at some points in this operation.
 */
struct stripe_head *osh, *nsh;
-   struct list_head newstripes, oldstripes;
+   LIST_HEAD(newstripes);
+   LIST_HEAD(oldstripes);
struct disk_info *ndisks;
int err = 0;
kmem_cache_t *sc;
@@ -359,7 +360,7 @@ static int resize_stripes(raid5_conf_t *
   0, 0, NULL, NULL);
if (!sc)
return -ENOMEM;
-   INIT_LIST_HEAD(&newstripes);
+
for (i = conf->max_nr_stripes; i; i--) {
nsh = kmem_cache_alloc(sc, GFP_NOIO);
if (!nsh)
@@ -385,7 +386,6 @@ static int resize_stripes(raid5_conf_t *
/* OK, we have enough stripes, start collecting inactive
 * stripes and copying them over
 */
-   INIT_LIST_HEAD(&oldstripes);
list_for_each_entry(nsh, &newstripes, lru) {
spin_lock_irq(&conf->device_lock);
wait_event_lock_irq(conf->wait_for_stripe,
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 007 of 13] md: Core of raid5 resize process

2006-03-16 Thread Neil Brown
On Thursday March 16, [EMAIL PROTECTED] wrote:
> NeilBrown <[EMAIL PROTECTED]> wrote:
> >
> > @@ -4539,7 +4543,9 @@ static void md_do_sync(mddev_t *mddev)
> >  */
> > max_sectors = mddev->resync_max_sectors;
> > mddev->resync_mismatches = 0;
> >  -  } else
> >  +  } else if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
> >  +  max_sectors = mddev->size << 1;
> >  +  else
> > /* recovery follows the physical size of devices */
> > max_sectors = mddev->size << 1;
> >   
> 
> This change is a no-op.   Intentional?

Uhmm... sort of.
A later patch adds stuff to the later branch but not the middle one.
This comes from creating a patch to fix a bug, then merging it back
into the wrong original patch...

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array.

2006-03-16 Thread Neil Brown
On Thursday March 16, [EMAIL PROTECTED] wrote:
> NeilBrown <[EMAIL PROTECTED]> wrote:
> >
> > +   /* Got them all.
> >  +   * Return the new ones and free the old ones.
> >  +   * At this point, we are holding all the stripes so the array
> >  +   * is completely stalled, so now is a good time to resize
> >  +   * conf->disks.
> >  +   */
> >  +  ndisks = kzalloc(newsize * sizeof(struct disk_info), GFP_NOIO);
> >  +  if (ndisks) {
> >  +  for (i=0; iraid_disks; i++)
> >  +  ndisks[i] = conf->disks[i];
> >  +  kfree(conf->disks);
> >  +  conf->disks = ndisks;
> >  +  } else
> >  +  err = -ENOMEM;
> >  +  while(!list_empty(&newstripes)) {
> >  +  nsh = list_entry(newstripes.next, struct stripe_head, lru);
> >  +  list_del_init(&nsh->lru);
> >  +  for (i=conf->raid_disks; i < newsize; i++)
> >  +  if (nsh->dev[i].page == NULL) {
> >  +  struct page *p = alloc_page(GFP_NOIO);
> >  +  nsh->dev[i].page = p;
> >  +  if (!p)
> >  +  err = -ENOMEM;
> >  +  }
> >  +  release_stripe(nsh);
> >  +  }
> >  +  while(!list_empty(&oldstripes)) {
> >  +  osh = list_entry(oldstripes.next, struct stripe_head, lru);
> >  +  list_del(&osh->lru);
> >  +  kmem_cache_free(conf->slab_cache, osh);
> >  +  }
> >  +  kmem_cache_destroy(conf->slab_cache);
> >  +  conf->slab_cache = sc;
> >  +  conf->active_name = 1-conf->active_name;
> >  +  conf->pool_size = newsize;
> >  +  return err;
> >  +}
> 
> Are you sure the -ENOMEM handling here is solid?  It
> looks strange.

The philosophy of the -ENOMEM handling is (awkwardly?) embodied in the
comment
 * Finally we add new pages.  This could fail, but we leave
 * the stripe cache at it's new size, just with some pages empty.

at the top of the function.  The core function here is making some
data structures bigger.  In each case, having a bigger data structure
than required is no big deal.  So we try to increase the size of each
of them (the stripe_head cache, the 'disks' array, and the pages
allocated to each stripe.
If any of there fail we return -ENOMEM, but may allow others to
succeed.

Does that help?

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 010 of 13] md: Only checkpoint expansion progress occasionally.

2006-03-16 Thread Andrew Morton
NeilBrown <[EMAIL PROTECTED]> wrote:
>
> diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
>  --- ./drivers/md/raid5.c~current~2006-03-17 11:48:58.0 +1100
>  +++ ./drivers/md/raid5.c 2006-03-17 11:48:58.0 +1100
>  @@ -1747,8 +1747,9 @@ static int make_request(request_queue_t 

That's a fairly complex function..

I wonder about this:

spin_lock_irq(&conf->device_lock);
if (--bi->bi_phys_segments == 0) {
int bytes = bi->bi_size;

if ( bio_data_dir(bi) == WRITE )
md_write_end(mddev);
bi->bi_size = 0;
bi->bi_end_io(bi, bytes, 0);
}
spin_unlock_irq(&conf->device_lock);

bi_end_io() can be somewhat expensive.  Does it need to happen under the lock?
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 006 of 13] md: Infrastructure to allow normal IO to continue while array is expanding.

2006-03-16 Thread Neil Brown
On Thursday March 16, [EMAIL PROTECTED] wrote:
> NeilBrown <[EMAIL PROTECTED]> wrote:
> >
> >  -  retry:
> > prepare_to_wait(&conf->wait_for_overlap, &w, 
> > TASK_UNINTERRUPTIBLE);
> >  -  sh = get_active_stripe(conf, new_sector, pd_idx, 
> > (bi->bi_rw&RWA_MASK));
> >  +  sh = get_active_stripe(conf, new_sector, disks, pd_idx, 
> > (bi->bi_rw&RWA_MASK));
> > if (sh) {
> >  -  if (!add_stripe_bio(sh, bi, dd_idx, 
> > (bi->bi_rw&RW_MASK))) {
> >  -  /* Add failed due to overlap.  Flush everything
> >  +  if (unlikely(conf->expand_progress != MaxSector)) {
> >  +  /* expansion might have moved on while waiting 
> > for a
> >  +   * stripe, so we much do the range check again.
> >  +   */
> >  +  int must_retry = 0;
> >  +  spin_lock_irq(&conf->device_lock);
> >  +  if (logical_sector <  conf->expand_progress &&
> >  +  disks == conf->previous_raid_disks)
> >  +  /* mismatch, need to try again */
> >  +  must_retry = 1;
> >  +  spin_unlock_irq(&conf->device_lock);
> >  +  if (must_retry) {
> >  +  release_stripe(sh);
> >  +  goto retry;
> >  +  }
> >  +  }
> 
> The locking in here looks strange.  We take the lock, do some arithmetic
> and some tests and then drop the lock again.  Is it not possible that the
> result of those tests now becomes invalid?

Obviously another comment missing.
 conf->expand_progress is sector_t and so could be 64bits on a 32 bit
 platform, and so I cannot be sure it is updated atomically.  So I
 always access it within a lock (unless I am comparing for equality with ~0).
 
 Yes, the result can become invalid, but only in one direction:  As
 expand_progress always increases, it is possible that it will pass
 logical_sector.  When that happens, STRIPE_EXPANDING gets set on the
 stripe_head at logical_sector.
 So because we took a reference to logical_sector *before* this test,
 and check for stripe_expanding *after* the test, we can easily catch
 that transition.

 Putting it another way, this test is to catch cases where
 logical_sector is a long way from expand_progress, the subsequent
 test of STRIPE_EXPANDING catches cases where they are close together,
 and the ordering wrt get_stripe_active ensures there are no holes.

Now to put that into a few short-but-clear comments.

Thanks again!

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array.

2006-03-16 Thread Neil Brown
On Thursday March 16, [EMAIL PROTECTED] wrote:
> NeilBrown <[EMAIL PROTECTED]> wrote:
> >
> > +static int resize_stripes(raid5_conf_t *conf, int newsize)
> >  +{
> >  +  /* make all the stripes able to hold 'newsize' devices.
> >  +   * New slots in each stripe get 'page' set to a new page.
> >  +   * We allocate all the new stripes first, then if that succeeds,
> >  +   * copy everything across.
> >  +   * Finally we add new pages.  This could fail, but we leave
> >  +   * the stripe cache at it's new size, just with some pages empty.
> >  +   *
> >  +   * We use GFP_NOIO allocations as IO to the raid5 is blocked
> >  +   * at some points in this operation.
> >  +   */
> >  +  struct stripe_head *osh, *nsh;
> >  +  struct list_head newstripes, oldstripes;
> 
> You can use LIST_HEAD() here, avoid the separate INIT_LIST_HEAD().
> 

I guess.
I have to have the declaration "miles" from where I use the variable.
Do I have to have the initialisation equally far?  Ok, I'll do that..


> 
> >  +  struct disk_info *ndisks;
> >  +  int err = 0;
> >  +  kmem_cache_t *sc;
> >  +  int i;
> >  +
> >  +  if (newsize <= conf->pool_size)
> >  +  return 0; /* never bother to shrink */
> >  +
> >  +  sc = kmem_cache_create(conf->cache_name[1-conf->active_name],
> >  + sizeof(struct 
> > stripe_head)+(newsize-1)*sizeof(struct r5dev),
> >  + 0, 0, NULL, NULL);
> 
> kmem_cache_create() internally does a GFP_KERNEL allocation.
> 
> >  +  if (!sc)
> >  +  return -ENOMEM;
> >  +  INIT_LIST_HEAD(&newstripes);
> >  +  for (i = conf->max_nr_stripes; i; i--) {
> >  +  nsh = kmem_cache_alloc(sc, GFP_NOIO);
> 
> So either this can use GFP_KERNEL, or we have a problem.

Good point  Maybe the comment about GFP_NOIO just needs to be
improved. 
We cannot risk waiting on IO after the 
/* OK, we have enough stripes, start collecting inactive
 * stripes and copying them over
 */

comment, up to the second last while loop, that starts
while(!list_empty(&newstripes)) {

Before the comment, which where the kmem_cache_create is, is OK.

Thanks!

NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 007 of 13] md: Core of raid5 resize process

2006-03-16 Thread Andrew Morton
NeilBrown <[EMAIL PROTECTED]> wrote:
>
> @@ -4539,7 +4543,9 @@ static void md_do_sync(mddev_t *mddev)
>*/
>   max_sectors = mddev->resync_max_sectors;
>   mddev->resync_mismatches = 0;
>  -} else
>  +} else if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
>  +max_sectors = mddev->size << 1;
>  +else
>   /* recovery follows the physical size of devices */
>   max_sectors = mddev->size << 1;
>   

This change is a no-op.   Intentional?
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 006 of 13] md: Infrastructure to allow normal IO to continue while array is expanding.

2006-03-16 Thread Andrew Morton
NeilBrown <[EMAIL PROTECTED]> wrote:
>
>  -retry:
>   prepare_to_wait(&conf->wait_for_overlap, &w, 
> TASK_UNINTERRUPTIBLE);
>  -sh = get_active_stripe(conf, new_sector, pd_idx, 
> (bi->bi_rw&RWA_MASK));
>  +sh = get_active_stripe(conf, new_sector, disks, pd_idx, 
> (bi->bi_rw&RWA_MASK));
>   if (sh) {
>  -if (!add_stripe_bio(sh, bi, dd_idx, 
> (bi->bi_rw&RW_MASK))) {
>  -/* Add failed due to overlap.  Flush everything
>  +if (unlikely(conf->expand_progress != MaxSector)) {
>  +/* expansion might have moved on while waiting 
> for a
>  + * stripe, so we much do the range check again.
>  + */
>  +int must_retry = 0;
>  +spin_lock_irq(&conf->device_lock);
>  +if (logical_sector <  conf->expand_progress &&
>  +disks == conf->previous_raid_disks)
>  +/* mismatch, need to try again */
>  +must_retry = 1;
>  +spin_unlock_irq(&conf->device_lock);
>  +if (must_retry) {
>  +release_stripe(sh);
>  +goto retry;
>  +}
>  +}

The locking in here looks strange.  We take the lock, do some arithmetic
and some tests and then drop the lock again.  Is it not possible that the
result of those tests now becomes invalid?

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array.

2006-03-16 Thread Andrew Morton
NeilBrown <[EMAIL PROTECTED]> wrote:
>
> + /* Got them all.
>  + * Return the new ones and free the old ones.
>  + * At this point, we are holding all the stripes so the array
>  + * is completely stalled, so now is a good time to resize
>  + * conf->disks.
>  + */
>  +ndisks = kzalloc(newsize * sizeof(struct disk_info), GFP_NOIO);
>  +if (ndisks) {
>  +for (i=0; iraid_disks; i++)
>  +ndisks[i] = conf->disks[i];
>  +kfree(conf->disks);
>  +conf->disks = ndisks;
>  +} else
>  +err = -ENOMEM;
>  +while(!list_empty(&newstripes)) {
>  +nsh = list_entry(newstripes.next, struct stripe_head, lru);
>  +list_del_init(&nsh->lru);
>  +for (i=conf->raid_disks; i < newsize; i++)
>  +if (nsh->dev[i].page == NULL) {
>  +struct page *p = alloc_page(GFP_NOIO);
>  +nsh->dev[i].page = p;
>  +if (!p)
>  +err = -ENOMEM;
>  +}
>  +release_stripe(nsh);
>  +}
>  +while(!list_empty(&oldstripes)) {
>  +osh = list_entry(oldstripes.next, struct stripe_head, lru);
>  +list_del(&osh->lru);
>  +kmem_cache_free(conf->slab_cache, osh);
>  +}
>  +kmem_cache_destroy(conf->slab_cache);
>  +conf->slab_cache = sc;
>  +conf->active_name = 1-conf->active_name;
>  +conf->pool_size = newsize;
>  +return err;
>  +}

Are you sure the -ENOMEM handling here is solid?  It looks strange.

There are a few more GFP_NOIOs in this function, which can possibly become
GFP_KERNEL.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array.

2006-03-16 Thread Andrew Morton
NeilBrown <[EMAIL PROTECTED]> wrote:
>
> + wait_event_lock_irq(conf->wait_for_stripe,
>  +!list_empty(&conf->inactive_list),
>  +conf->device_lock,
>  +unplug_slaves(conf->mddev);
>  +);

Boy, that's an ugly-looking thing, isn't it?

__wait_event_lock_irq() already puts a semicolon after `cmd' so I think the
one here isn't needed, which would make it a bit less of a surprise to look
at.

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array.

2006-03-16 Thread Andrew Morton
NeilBrown <[EMAIL PROTECTED]> wrote:
>
> +static int resize_stripes(raid5_conf_t *conf, int newsize)
>  +{
>  +/* make all the stripes able to hold 'newsize' devices.
>  + * New slots in each stripe get 'page' set to a new page.
>  + * We allocate all the new stripes first, then if that succeeds,
>  + * copy everything across.
>  + * Finally we add new pages.  This could fail, but we leave
>  + * the stripe cache at it's new size, just with some pages empty.
>  + *
>  + * We use GFP_NOIO allocations as IO to the raid5 is blocked
>  + * at some points in this operation.
>  + */
>  +struct stripe_head *osh, *nsh;
>  +struct list_head newstripes, oldstripes;

You can use LIST_HEAD() here, avoid the separate INIT_LIST_HEAD().


>  +struct disk_info *ndisks;
>  +int err = 0;
>  +kmem_cache_t *sc;
>  +int i;
>  +
>  +if (newsize <= conf->pool_size)
>  +return 0; /* never bother to shrink */
>  +
>  +sc = kmem_cache_create(conf->cache_name[1-conf->active_name],
>  +   sizeof(struct 
> stripe_head)+(newsize-1)*sizeof(struct r5dev),
>  +   0, 0, NULL, NULL);

kmem_cache_create() internally does a GFP_KERNEL allocation.

>  +if (!sc)
>  +return -ENOMEM;
>  +INIT_LIST_HEAD(&newstripes);
>  +for (i = conf->max_nr_stripes; i; i--) {
>  +nsh = kmem_cache_alloc(sc, GFP_NOIO);

So either this can use GFP_KERNEL, or we have a problem.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 006 of 13] md: Infrastructure to allow normal IO to continue while array is expanding.

2006-03-16 Thread NeilBrown

We need to allow that different stripes are of different effective sizes,
and use the appropriate size.
Also, when a stripe is being expanded, we must block any IO attempts
until the stripe is stable again.

Key elements in this change are:
 - each stripe_head gets a 'disk' field which is part of the key,
   thus there can sometimes be two stripe heads of the same area of
   the array, but covering different numbers of devices.  One of these
   will be marked STRIPE_EXPANDING and so won't accept new requests.
 - conf->expand_progress tracks how the expansion is progressing and
   is used to determine whether the target part of the array has been
   expanded yet or not.


Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/raid5.c |   88 ---
 ./include/linux/raid/raid5.h |6 ++
 2 files changed, 64 insertions(+), 30 deletions(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~   2006-03-17 11:48:56.0 +1100
+++ ./drivers/md/raid5.c2006-03-17 11:48:56.0 +1100
@@ -178,10 +178,10 @@ static int grow_buffers(struct stripe_he
 
 static void raid5_build_block (struct stripe_head *sh, int i);
 
-static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx)
+static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, 
int disks)
 {
raid5_conf_t *conf = sh->raid_conf;
-   int disks = conf->raid_disks, i;
+   int i;
 
if (atomic_read(&sh->count) != 0)
BUG();
@@ -198,7 +198,9 @@ static void init_stripe(struct stripe_he
sh->pd_idx = pd_idx;
sh->state = 0;
 
-   for (i=disks; i--; ) {
+   sh->disks = disks;
+
+   for (i = sh->disks; i--; ) {
struct r5dev *dev = &sh->dev[i];
 
if (dev->toread || dev->towrite || dev->written ||
@@ -215,7 +217,7 @@ static void init_stripe(struct stripe_he
insert_hash(conf, sh);
 }
 
-static struct stripe_head *__find_stripe(raid5_conf_t *conf, sector_t sector)
+static struct stripe_head *__find_stripe(raid5_conf_t *conf, sector_t sector, 
int disks)
 {
struct stripe_head *sh;
struct hlist_node *hn;
@@ -223,7 +225,7 @@ static struct stripe_head *__find_stripe
CHECK_DEVLOCK();
PRINTK("__find_stripe, sector %llu\n", (unsigned long long)sector);
hlist_for_each_entry(sh, hn, stripe_hash(conf, sector), hash)
-   if (sh->sector == sector)
+   if (sh->sector == sector && sh->disks == disks)
return sh;
PRINTK("__stripe %llu not in cache\n", (unsigned long long)sector);
return NULL;
@@ -232,8 +234,8 @@ static struct stripe_head *__find_stripe
 static void unplug_slaves(mddev_t *mddev);
 static void raid5_unplug_device(request_queue_t *q);
 
-static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t 
sector,
-int pd_idx, int noblock) 
+static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t 
sector, int disks,
+int pd_idx, int noblock)
 {
struct stripe_head *sh;
 
@@ -245,7 +247,7 @@ static struct stripe_head *get_active_st
wait_event_lock_irq(conf->wait_for_stripe,
conf->quiesce == 0,
conf->device_lock, /* nothing */);
-   sh = __find_stripe(conf, sector);
+   sh = __find_stripe(conf, sector, disks);
if (!sh) {
if (!conf->inactive_blocked)
sh = get_free_stripe(conf);
@@ -263,7 +265,7 @@ static struct stripe_head *get_active_st
);
conf->inactive_blocked = 0;
} else
-   init_stripe(sh, sector, pd_idx);
+   init_stripe(sh, sector, pd_idx, disks);
} else {
if (atomic_read(&sh->count)) {
if (!list_empty(&sh->lru))
@@ -300,6 +302,7 @@ static int grow_one_stripe(raid5_conf_t 
kmem_cache_free(conf->slab_cache, sh);
return 0;
}
+   sh->disks = conf->raid_disks;
/* we just created an active stripe so... */
atomic_set(&sh->count, 1);
atomic_inc(&conf->active_stripes);
@@ -470,7 +473,7 @@ static int raid5_end_read_request(struct
 {
struct stripe_head *sh = bi->bi_private;
raid5_conf_t *conf = sh->raid_conf;
-   int disks = conf->raid_disks, i;
+   int disks = sh->disks, i;
int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
 
if (bi->bi_size)
@@ -568,7 +571,7 @@ static int raid5_end_write_request (stru
 {
struct stripe_head *sh = bi->bi_private;
raid5_conf_t *conf = sh->raid

[PATCH 011 of 13] md: Split reshape handler in check_reshape and start_reshape.

2006-03-16 Thread NeilBrown

check_reshape checks validity and does things that can be done
instantly - like adding devices to raid1.
start_reshape initiates a restriping process to convert the whole array.


Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/md.c   |   10 ---
 ./drivers/md/raid1.c|   19 +++--
 ./drivers/md/raid5.c|   60 
 ./include/linux/raid/md_k.h |3 +-
 4 files changed, 58 insertions(+), 34 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~  2006-03-17 11:48:58.0 +1100
+++ ./drivers/md/md.c   2006-03-17 11:48:59.0 +1100
@@ -2589,7 +2589,7 @@ static int do_md_run(mddev_t * mddev)
strlcpy(mddev->clevel, pers->name, sizeof(mddev->clevel));
 
if (mddev->reshape_position != MaxSector &&
-   pers->reshape == NULL) {
+   pers->start_reshape == NULL) {
/* This personality cannot handle reshaping... */
mddev->pers = NULL;
module_put(pers->owner);
@@ -3551,14 +3551,16 @@ static int update_raid_disks(mddev_t *md
 {
int rv;
/* change the number of raid disks */
-   if (mddev->pers->reshape == NULL)
+   if (mddev->pers->check_reshape == NULL)
return -EINVAL;
if (raid_disks <= 0 ||
raid_disks >= mddev->max_disks)
return -EINVAL;
-   if (mddev->sync_thread)
+   if (mddev->sync_thread || mddev->reshape_position != MaxSector)
return -EBUSY;
-   rv = mddev->pers->reshape(mddev, raid_disks);
+   mddev->delta_disks = raid_disks - mddev->raid_disks;
+
+   rv = mddev->pers->check_reshape(mddev);
return rv;
 }
 

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~   2006-03-17 11:48:58.0 +1100
+++ ./drivers/md/raid1.c2006-03-17 11:48:59.0 +1100
@@ -1976,7 +1976,7 @@ static int raid1_resize(mddev_t *mddev, 
return 0;
 }
 
-static int raid1_reshape(mddev_t *mddev, int raid_disks)
+static int raid1_reshape(mddev_t *mddev)
 {
/* We need to:
 * 1/ resize the r1bio_pool
@@ -1993,10 +1993,22 @@ static int raid1_reshape(mddev_t *mddev,
struct pool_info *newpoolinfo;
mirror_info_t *newmirrors;
conf_t *conf = mddev_to_conf(mddev);
-   int cnt;
+   int cnt, raid_disks;
 
int d, d2;
 
+   /* Cannot change chunk_size, layout, or level */
+   if (mddev->chunk_size != mddev->new_chunk ||
+   mddev->layout != mddev->new_layout ||
+   mddev->level != mddev->new_level) {
+   mddev->new_chunk = mddev->chunk_size;
+   mddev->new_layout = mddev->layout;
+   mddev->new_level = mddev->level;
+   return -EINVAL;
+   }
+
+   raid_disks = mddev->raid_disks + mddev->delta_disks;
+
if (raid_disks < conf->raid_disks) {
cnt=0;
for (d= 0; d < conf->raid_disks; d++)
@@ -2043,6 +2055,7 @@ static int raid1_reshape(mddev_t *mddev,
 
mddev->degraded += (raid_disks - conf->raid_disks);
conf->raid_disks = mddev->raid_disks = raid_disks;
+   mddev->delta_disks = 0;
 
conf->last_used = 0; /* just make sure it is in-range */
lower_barrier(conf);
@@ -2084,7 +2097,7 @@ static struct mdk_personality raid1_pers
.spare_active   = raid1_spare_active,
.sync_request   = sync_request,
.resize = raid1_resize,
-   .reshape= raid1_reshape,
+   .check_reshape  = raid1_reshape,
.quiesce= raid1_quiesce,
 };
 

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~   2006-03-17 11:48:58.0 +1100
+++ ./drivers/md/raid5.c2006-03-17 11:48:59.0 +1100
@@ -2574,21 +2574,15 @@ static int raid5_resize(mddev_t *mddev, 
return 0;
 }
 
-static int raid5_reshape(mddev_t *mddev, int raid_disks)
+static int raid5_check_reshape(mddev_t *mddev)
 {
raid5_conf_t *conf = mddev_to_conf(mddev);
int err;
-   mdk_rdev_t *rdev;
-   struct list_head *rtmp;
-   int spares = 0;
-   int added_devices = 0;
 
-   if (mddev->degraded ||
-   test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
-   return -EBUSY;
-   if (conf->raid_disks > raid_disks)
-   return -EINVAL; /* Cannot shrink array yet */
-   if (conf->raid_disks == raid_disks)
+   if (mddev->delta_disks < 0 ||
+   mddev->new_level != mddev->level)
+   return -EINVAL; /* Cannot shrink array or change level yet */
+   if (mddev->delta_disks == 0)
return 0; /* nothing to do */
 
/* Can only proceed if there are plenty of stripe_heads.
@@ -2599,30 +2593,48 @@ static int raid5_reshape(mddev_t *mddev,
 * If the chunk size is greater, user-space should request

[PATCH 004 of 13] md: Split disks array out of raid5 conf structure so it is easier to grow.

2006-03-16 Thread NeilBrown

Previously the array of disk information was included in the
raid5 'conf' structure which was allocated to an appropriate size.
This makes it awkward to change the size of that array.
So we split it off into a separate kmalloced array which will
require a little extra indexing, but is much easier to grow.


Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/raid5.c |   10 +++---
 ./drivers/md/raid6main.c |   10 +++---
 ./include/linux/raid/raid5.h |2 +-
 3 files changed, 15 insertions(+), 7 deletions(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~   2006-03-17 11:45:43.0 +1100
+++ ./drivers/md/raid5.c2006-03-17 11:48:55.0 +1100
@@ -1822,11 +1822,13 @@ static int run(mddev_t *mddev)
return -EIO;
}
 
-   mddev->private = kzalloc(sizeof (raid5_conf_t)
-+ mddev->raid_disks * sizeof(struct disk_info),
-GFP_KERNEL);
+   mddev->private = kzalloc(sizeof (raid5_conf_t), GFP_KERNEL);
if ((conf = mddev->private) == NULL)
goto abort;
+   conf->disks = kzalloc(mddev->raid_disks * sizeof(struct disk_info),
+ GFP_KERNEL);
+   if (!conf->disks)
+   goto abort;
 
conf->mddev = mddev;
 
@@ -1966,6 +1968,7 @@ static int run(mddev_t *mddev)
 abort:
if (conf) {
print_raid5_conf(conf);
+   kfree(conf->disks);
kfree(conf->stripe_hashtbl);
kfree(conf);
}
@@ -1986,6 +1989,7 @@ static int stop(mddev_t *mddev)
kfree(conf->stripe_hashtbl);
blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
sysfs_remove_group(&mddev->kobj, &raid5_attrs_group);
+   kfree(conf->disks);
kfree(conf);
mddev->private = NULL;
return 0;

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~   2006-03-17 11:45:43.0 +1100
+++ ./drivers/md/raid6main.c2006-03-17 11:48:55.0 +1100
@@ -2006,11 +2006,14 @@ static int run(mddev_t *mddev)
return -EIO;
}
 
-   mddev->private = kzalloc(sizeof (raid6_conf_t)
-+ mddev->raid_disks * sizeof(struct disk_info),
-GFP_KERNEL);
+   mddev->private = kzalloc(sizeof (raid6_conf_t), GFP_KERNEL);
if ((conf = mddev->private) == NULL)
goto abort;
+   conf->disks = kzalloc(mddev->raid_disks * sizeof(struct disk_info),
+GFP_KERNEL);
+   if (!conf->disks)
+   goto abort;
+
conf->mddev = mddev;
 
if ((conf->stripe_hashtbl = kzalloc(PAGE_SIZE, GFP_KERNEL)) == NULL)
@@ -2158,6 +2161,7 @@ abort:
print_raid6_conf(conf);
safe_put_page(conf->spare_page);
kfree(conf->stripe_hashtbl);
+   kfree(conf->disks);
kfree(conf);
}
mddev->private = NULL;

diff ./include/linux/raid/raid5.h~current~ ./include/linux/raid/raid5.h
--- ./include/linux/raid/raid5.h~current~   2006-03-17 11:45:43.0 
+1100
+++ ./include/linux/raid/raid5.h2006-03-17 11:48:55.0 +1100
@@ -240,7 +240,7 @@ struct raid5_private_data {
 * waiting for 25% to 
be free
 */
spinlock_t  device_lock;
-   struct disk_infodisks[0];
+   struct disk_info*disks;
 };
 
 typedef struct raid5_private_data raid5_conf_t;
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 009 of 13] md: Checkpoint and allow restart of raid5 reshape

2006-03-16 Thread NeilBrown

We allow the superblock to record an 'old' and a 'new'
geometry, and a position where any conversion is up to.
The geometry allows for changing chunksize, layout and
level as well as number of devices.

When using verion-0.90 superblock, we convert the version
to 0.91 while the conversion is happening so that an old
kernel will refuse the assemble the array.
For version-1, we use a feature bit for the same effect.

When starting an array we check for an incomplete reshape
and restart the reshape process if needed.
If the reshape stopped at an awkward time (like when updating
the first stripe) we refuse to assemble the array, and
let user-space worry about it.




Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/md.c|   69 -
 ./drivers/md/raid1.c |5 +
 ./drivers/md/raid5.c |  140 ---
 ./include/linux/raid/md.h|2 
 ./include/linux/raid/md_k.h  |8 ++
 ./include/linux/raid/md_p.h  |   32 -
 ./include/linux/raid/raid5.h |1 
 7 files changed, 231 insertions(+), 26 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~  2006-03-17 11:48:57.0 +1100
+++ ./drivers/md/md.c   2006-03-17 11:48:58.0 +1100
@@ -659,7 +659,8 @@ static int super_90_load(mdk_rdev_t *rde
}
 
if (sb->major_version != 0 ||
-   sb->minor_version != 90) {
+   sb->minor_version < 90 ||
+   sb->minor_version > 91) {
printk(KERN_WARNING "Bad version number %d.%d on %s\n",
sb->major_version, sb->minor_version,
b);
@@ -744,6 +745,20 @@ static int super_90_validate(mddev_t *md
mddev->bitmap_offset = 0;
mddev->default_bitmap_offset = MD_SB_BYTES >> 9;
 
+   if (mddev->minor_version >= 91) {
+   mddev->reshape_position = sb->reshape_position;
+   mddev->delta_disks = sb->delta_disks;
+   mddev->new_level = sb->new_level;
+   mddev->new_layout = sb->new_layout;
+   mddev->new_chunk = sb->new_chunk;
+   } else {
+   mddev->reshape_position = MaxSector;
+   mddev->delta_disks = 0;
+   mddev->new_level = mddev->level;
+   mddev->new_layout = mddev->layout;
+   mddev->new_chunk = mddev->chunk_size;
+   }
+
if (sb->state & (1md_magic = MD_SB_MAGIC;
sb->major_version = mddev->major_version;
-   sb->minor_version = mddev->minor_version;
sb->patch_version = mddev->patch_version;
sb->gvalid_words  = 0; /* ignored */
memcpy(&sb->set_uuid0, mddev->uuid+0, 4);
@@ -857,6 +871,17 @@ static void super_90_sync(mddev_t *mddev
sb->events_hi = (mddev->events>>32);
sb->events_lo = (u32)mddev->events;
 
+   if (mddev->reshape_position == MaxSector)
+   sb->minor_version = 90;
+   else {
+   sb->minor_version = 91;
+   sb->reshape_position = mddev->reshape_position;
+   sb->new_level = mddev->new_level;
+   sb->delta_disks = mddev->delta_disks;
+   sb->new_layout = mddev->new_layout;
+   sb->new_chunk = mddev->new_chunk;
+   }
+   mddev->minor_version = sb->minor_version;
if (mddev->in_sync)
{
sb->recovery_cp = mddev->recovery_cp;
@@ -1101,6 +1126,20 @@ static int super_1_validate(mddev_t *mdd
}
mddev->bitmap_offset = 
(__s32)le32_to_cpu(sb->bitmap_offset);
}
+   if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_RESHAPE_ACTIVE)) 
{
+   mddev->reshape_position = 
le64_to_cpu(sb->reshape_position);
+   mddev->delta_disks = le32_to_cpu(sb->delta_disks);
+   mddev->new_level = le32_to_cpu(sb->new_level);
+   mddev->new_layout = le32_to_cpu(sb->new_layout);
+   mddev->new_chunk = le32_to_cpu(sb->new_chunk)<<9;
+   } else {
+   mddev->reshape_position = MaxSector;
+   mddev->delta_disks = 0;
+   mddev->new_level = mddev->level;
+   mddev->new_layout = mddev->layout;
+   mddev->new_chunk = mddev->chunk_size;
+   }
+
} else if (mddev->pers == NULL) {
/* Insist of good event counter while assembling */
__u64 ev1 = le64_to_cpu(sb->events);
@@ -1172,6 +1211,14 @@ static void super_1_sync(mddev_t *mddev,
sb->bitmap_offset = cpu_to_le32((__u32)mddev->bitmap_of

[PATCH 003 of 13] md: Update status_resync to handle LARGE devices.

2006-03-16 Thread NeilBrown

status_resync - used by /proc/mdstat to report the status of a resync,
assumes that device sizes will always fit into an 'unsigned long' This
is no longer the case...

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/md.c |   30 --
 1 file changed, 24 insertions(+), 6 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~  2006-03-17 11:48:08.0 +1100
+++ ./drivers/md/md.c   2006-03-17 11:48:30.0 +1100
@@ -4040,7 +4040,10 @@ static void status_unused(struct seq_fil
 
 static void status_resync(struct seq_file *seq, mddev_t * mddev)
 {
-   unsigned long max_blocks, resync, res, dt, db, rt;
+   sector_t max_blocks, resync, res;
+   unsigned long dt, db, rt;
+   int scale;
+   unsigned int per_milli;
 
resync = (mddev->curr_resync - atomic_read(&mddev->recovery_active))/2;
 
@@ -4056,9 +4059,22 @@ static void status_resync(struct seq_fil
MD_BUG();
return;
}
-   res = (resync/1024)*1000/(max_blocks/1024 + 1);
+   /* Pick 'scale' such that (resync>>scale)*1000 will fit
+* in a sector_t, and (max_blocks>>scale) will fit in a
+* u32, as those are the requirements for sector_div.
+* Thus 'scale' must be at least 10
+*/
+   scale = 10;
+   if (sizeof(sector_t) > sizeof(unsigned long)) {
+   while ( max_blocks/2 > (1ULL<<(scale+32)))
+   scale++;
+   }
+   res = (resync>>scale)*1000;
+   sector_div(res, (u32)((max_blocks>>scale)+1));
+
+   per_milli = res;
{
-   int i, x = res/50, y = 20-x;
+   int i, x = per_milli/50, y = 20-x;
seq_printf(seq, "[");
for (i = 0; i < x; i++)
seq_printf(seq, "=");
@@ -4067,10 +4083,12 @@ static void status_resync(struct seq_fil
seq_printf(seq, ".");
seq_printf(seq, "] ");
}
-   seq_printf(seq, " %s =%3lu.%lu%% (%lu/%lu)",
+   seq_printf(seq, " %s =%3u.%u%% (%llu/%llu)",
  (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ?
   "resync" : "recovery"),
- res/10, res % 10, resync, max_blocks);
+ per_milli/10, per_milli % 10,
+  (unsigned long long) resync,
+  (unsigned long long) max_blocks);
 
/*
 * We do not want to overflow, so the order of operands and
@@ -4084,7 +4102,7 @@ static void status_resync(struct seq_fil
dt = ((jiffies - mddev->resync_mark) / HZ);
if (!dt) dt++;
db = resync - (mddev->resync_mark_cnt/2);
-   rt = (dt * ((max_blocks-resync) / (db/100+1)))/100;
+   rt = (dt * ((unsigned long)(max_blocks-resync) / (db/100+1)))/100;
 
seq_printf(seq, " finish=%lu.%lumin", rt / 60, (rt % 60)/6);
 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 005 of 13] md: Allow stripes to be expanded in preparation for expanding an array.

2006-03-16 Thread NeilBrown

Before a RAID-5 can be expanded, we need to be able to expand the
stripe-cache data structure.  
This requires allocating new stripes in a new kmem_cache.
If this succeeds, we copy cache pages over and release the old
stripes and kmem_cache.
We then allocate new pages.  If that fails, we leave the stripe
cache at it's new size.  It isn't worth the effort to shrink 
it back again.

Unfortuanately this means we need two kmem_cache names as we, for a
short period of time, we have two kmem_caches.  So they are
raid5/%s and raid5/%s-alt


Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/raid5.c |  118 +--
 ./drivers/md/raid6main.c |4 -
 ./include/linux/raid/raid5.h |9 ++-
 3 files changed, 123 insertions(+), 8 deletions(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~   2006-03-17 11:48:55.0 +1100
+++ ./drivers/md/raid5.c2006-03-17 11:48:56.0 +1100
@@ -313,20 +313,130 @@ static int grow_stripes(raid5_conf_t *co
kmem_cache_t *sc;
int devs = conf->raid_disks;
 
-   sprintf(conf->cache_name, "raid5/%s", mdname(conf->mddev));
-
-   sc = kmem_cache_create(conf->cache_name, 
+   sprintf(conf->cache_name[0], "raid5/%s", mdname(conf->mddev));
+   sprintf(conf->cache_name[1], "raid5/%s-alt", mdname(conf->mddev));
+   conf->active_name = 0;
+   sc = kmem_cache_create(conf->cache_name[conf->active_name],
   sizeof(struct 
stripe_head)+(devs-1)*sizeof(struct r5dev),
   0, 0, NULL, NULL);
if (!sc)
return 1;
conf->slab_cache = sc;
+   conf->pool_size = devs;
while (num--) {
if (!grow_one_stripe(conf))
return 1;
}
return 0;
 }
+static int resize_stripes(raid5_conf_t *conf, int newsize)
+{
+   /* make all the stripes able to hold 'newsize' devices.
+* New slots in each stripe get 'page' set to a new page.
+* We allocate all the new stripes first, then if that succeeds,
+* copy everything across.
+* Finally we add new pages.  This could fail, but we leave
+* the stripe cache at it's new size, just with some pages empty.
+*
+* We use GFP_NOIO allocations as IO to the raid5 is blocked
+* at some points in this operation.
+*/
+   struct stripe_head *osh, *nsh;
+   struct list_head newstripes, oldstripes;
+   struct disk_info *ndisks;
+   int err = 0;
+   kmem_cache_t *sc;
+   int i;
+
+   if (newsize <= conf->pool_size)
+   return 0; /* never bother to shrink */
+
+   sc = kmem_cache_create(conf->cache_name[1-conf->active_name],
+  sizeof(struct 
stripe_head)+(newsize-1)*sizeof(struct r5dev),
+  0, 0, NULL, NULL);
+   if (!sc)
+   return -ENOMEM;
+   INIT_LIST_HEAD(&newstripes);
+   for (i = conf->max_nr_stripes; i; i--) {
+   nsh = kmem_cache_alloc(sc, GFP_NOIO);
+   if (!nsh)
+   break;
+
+   memset(nsh, 0, sizeof(*nsh) + (newsize-1)*sizeof(struct r5dev));
+
+   nsh->raid_conf = conf;
+   spin_lock_init(&nsh->lock);
+
+   list_add(&nsh->lru, &newstripes);
+   }
+   if (i) {
+   /* didn't get enough, give up */
+   while (!list_empty(&newstripes)) {
+   nsh = list_entry(newstripes.next, struct stripe_head, 
lru);
+   list_del(&nsh->lru);
+   kmem_cache_free(sc, nsh);
+   }
+   kmem_cache_destroy(sc);
+   return -ENOMEM;
+   }
+   /* OK, we have enough stripes, start collecting inactive
+* stripes and copying them over
+*/
+   INIT_LIST_HEAD(&oldstripes);
+   list_for_each_entry(nsh, &newstripes, lru) {
+   spin_lock_irq(&conf->device_lock);
+   wait_event_lock_irq(conf->wait_for_stripe,
+   !list_empty(&conf->inactive_list),
+   conf->device_lock,
+   unplug_slaves(conf->mddev);
+   );
+   osh = get_free_stripe(conf);
+   spin_unlock_irq(&conf->device_lock);
+   atomic_set(&nsh->count, 1);
+   for(i=0; ipool_size; i++)
+   nsh->dev[i].page = osh->dev[i].page;
+   for( ; idev[i].page = NULL;
+   list_add(&osh->lru, &oldstripes);
+   }
+   /* Got them all.
+* Return the new ones and free the old ones.
+* At this point, we are holding all the stripes so the array
+* is completely stalled, so now is a good time to resize
+* conf->disks.
+*/
+   ndisks = kza

[PATCH 008 of 13] md: Final stages of raid5 expand code.

2006-03-16 Thread NeilBrown

This patch adds raid5_reshape and end_reshape which will
start and finish the reshape processes.

raid5_reshape is only enabled in CONFIG_MD_RAID5_RESHAPE is set,
to discourage accidental use.

Read the 'help' for the CONFIG_MD_RAID5_RESHAPE entry.

and Make sure that you have backups, just in case.

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/Kconfig  |   26 ++
 ./drivers/md/md.c |7 +-
 ./drivers/md/raid5.c  |  117 ++
 ./include/linux/raid/md.h |3 -
 4 files changed, 149 insertions(+), 4 deletions(-)

diff ./drivers/md/Kconfig~current~ ./drivers/md/Kconfig
--- ./drivers/md/Kconfig~current~   2006-03-17 11:45:43.0 +1100
+++ ./drivers/md/Kconfig2006-03-17 11:48:57.0 +1100
@@ -127,6 +127,32 @@ config MD_RAID5
 
  If unsure, say Y.
 
+config MD_RAID5_RESHAPE
+   bool "Support adding drives to a raid-5 array (experimental)"
+   depends on MD_RAID5 && EXPERIMENTAL
+   ---help---
+ A RAID-5 set can be expanded by adding extra drives. This
+ requires "restriping" the array which means (almost) every
+ block must be written to a different place.
+
+  This option allows such restriping to be done while the array
+ is online.  However it is still EXPERIMENTAL code.  It should
+ work, but please be sure that you have backups.
+
+ You will need a version of mdadm newer than 2.3.1.   During the
+ early stage of reshape there is a critical section where live data
+ is being over-written.  A crash during this time needs extra care
+ for recovery.  The newer mdadm takes a copy of the data in the
+ critical section and will restore it, if necessary, after a crash.
+
+ The mdadm usage is e.g.
+  mdadm --grow /dev/md1 --raid-disks=6
+ to grow '/dev/md1' to having 6 disks.
+
+ Note: The array can only be expanded, not contracted.
+ There should be enough spares already present to make the new
+ array workable.
+
 config MD_RAID6
tristate "RAID-6 mode"
depends on BLK_DEV_MD

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~  2006-03-17 11:48:57.0 +1100
+++ ./drivers/md/md.c   2006-03-17 11:48:57.0 +1100
@@ -159,12 +159,12 @@ static int start_readonly;
  */
 static DECLARE_WAIT_QUEUE_HEAD(md_event_waiters);
 static atomic_t md_event_count;
-static void md_new_event(mddev_t *mddev)
+void md_new_event(mddev_t *mddev)
 {
atomic_inc(&md_event_count);
wake_up(&md_event_waiters);
 }
-
+EXPORT_SYMBOL_GPL(md_new_event);
 /*
  * Enables to iterate over all existing md arrays
  * all_mddevs_lock protects this list.
@@ -4463,7 +4463,7 @@ static DECLARE_WAIT_QUEUE_HEAD(resync_wa
 
 #define SYNC_MARKS 10
 #defineSYNC_MARK_STEP  (3*HZ)
-static void md_do_sync(mddev_t *mddev)
+void md_do_sync(mddev_t *mddev)
 {
mddev_t *mddev2;
unsigned int currspeed = 0,
@@ -4700,6 +4700,7 @@ static void md_do_sync(mddev_t *mddev)
set_bit(MD_RECOVERY_DONE, &mddev->recovery);
md_wakeup_thread(mddev->thread);
 }
+EXPORT_SYMBOL_GPL(md_do_sync);
 
 
 /*

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~   2006-03-17 11:48:57.0 +1100
+++ ./drivers/md/raid5.c2006-03-17 11:48:57.0 +1100
@@ -1021,6 +1021,8 @@ static int add_stripe_bio(struct stripe_
return 0;
 }
 
+static void end_reshape(raid5_conf_t *conf);
+
 int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int disks)
 {
int sectors_per_chunk = conf->chunk_size >> 9;
@@ -1831,6 +1833,10 @@ static sector_t sync_request(mddev_t *md
if (sector_nr >= max_sector) {
/* just being told to finish up .. nothing much to do */
unplug_slaves(mddev);
+   if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) {
+   end_reshape(conf);
+   return 0;
+   }
 
if (mddev->curr_resync < max_sector) /* aborted */
bitmap_end_sync(mddev->bitmap, mddev->curr_resync,
@@ -2451,6 +2457,114 @@ static int raid5_resize(mddev_t *mddev, 
return 0;
 }
 
+static int raid5_reshape(mddev_t *mddev, int raid_disks)
+{
+   raid5_conf_t *conf = mddev_to_conf(mddev);
+   int err;
+   mdk_rdev_t *rdev;
+   struct list_head *rtmp;
+   int spares = 0;
+   int added_devices = 0;
+
+   if (mddev->degraded ||
+   test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
+   return -EBUSY;
+   if (conf->raid_disks > raid_disks)
+   return -EINVAL; /* Cannot shrink array yet */
+   if (conf->raid_disks == raid_disks)
+   return 0; /* nothing to do */
+
+   /* Can only proceed if there are plenty of stripe_heads.
+* We need a m

[PATCH 000 of 13] md: Introduction

2006-03-16 Thread NeilBrown

Following are 13 patches for md in 2.6.last (created against 2.6.16-rc6-mm1).
They are NOT appropriate for 2.6.16 but should be OK for .17.  Please include
them in -mm.

The first three are assorted bug fixes, none really serious

The remainder implement raid5 reshaping.  Currently the only shape change
that is supported is added a device, but it is envisioned that 
changing the chunksize and layout will also be supported, as well
as changing the level (e.g. 1->5, 5->6).

The reshape process naturally has to move all of the data in the
array, and so should be used with caution.  It is believed to work,
and some testing does support this, but wider testing would be great
for increasing my confidence.

You will need a version of mdadm newer than 2.3.1 to make use of raid5
growth.  This is because mdadm need to take a copy of a 'critical
section' at the start of the array incase there is a crash at an
awkward moment.  On restart, mdadm will restore the critical section
and allow reshape to continue.

I hope to release a 2.4-pre by early next week - it still needs a
little more polishing.

NeilBrown


 [PATCH 001 of 13] md: Add '4' to the list of levels for which bitmaps are 
supported.
 [PATCH 002 of 13] md: Fix the 'failed' count for version-0 superblocks.
 [PATCH 003 of 13] md: Update status_resync to handle LARGE devices.

 [PATCH 004 of 13] md: Split disks array out of raid5 conf structure so it is 
easier to grow.
 [PATCH 005 of 13] md: Allow stripes to be expanded in preparation for 
expanding an array.
 [PATCH 006 of 13] md: Infrastructure to allow normal IO to continue while 
array is expanding.
 [PATCH 007 of 13] md: Core of raid5 resize process
 [PATCH 008 of 13] md: Final stages of raid5 expand code.
 [PATCH 009 of 13] md: Checkpoint and allow restart of raid5 reshape
 [PATCH 010 of 13] md: Only checkpoint expansion progress occasionally.
 [PATCH 011 of 13] md: Split reshape handler in check_reshape and start_reshape.
 [PATCH 012 of 13] md: Make 'reshape' a possible sync_action action.
 [PATCH 013 of 13] md: Support suspending of IO to regions of an md array.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A random initramfs script

2006-03-16 Thread Andre Noll
On 00:41, Nix wrote:

> > So I downloaded iproute2-2.4.7-now-ss020116-try.tar.gz, but there
> > seems to be a problem with errno.h:
> 
> Holy meatballs that's ancient.

It is the most recent version on the ftp server mentioned in the HOWTO.

> Try 
> 
> for a rather newer and more capable copy. :)

Much better. Thanks. This version works fine for me, just like busybox
ip does.

> Yes, the initial population of /dev is done by firing messages at udevd
> *from a shell script*. It's gone all the way from devfs's kernel-space
> hardwiring to something sufficiently extensible that a shell script can
> do all the neessary stuff to populate /dev :)

Yeah, Linux rulez :)

> [uClibc]
> 
> Alternatively, just suck down GCC from, say, 
> svn://gcc.gnu.org/svn/gcc/tags/gcc_3_4_5_release,
> or ftp.gnu.org, or somewhere, and point buildroot at that.

Yep, there's a 'dl' directory which contains all downloads. One can
download the tarballs from anywhere else to that directory. Seems to
work now.

Thanks
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A random initramfs script

2006-03-16 Thread Andre Noll
On 15:23, Neil Brown wrote:

> You shouldn't need portmap to mount an NFS filesystem unless you
> enable locking,

That's news to me, thanks for pointing it out. But I do need portmap
for mounting a NFS filesystem read-only (/usr, which contains
portmap). Is that correct?

> > He likes to compare the situation with /etc/fstab. Nobody complains
> > about having to edit /etc/fstab, so why keep people complaining about
> > having to edit /etc/mdadm.conf?
> 
> Indeed!  And if you plug in some devices off another machine for
> disaster recovery, you don't want another disaster because you
> assembled the wrong arrays.

How is such a disaster possible, given each md device contains an
ID for the array it belongs to? But yes, it is certainly a good
idea to doublecheck everything before assembling the array in such
a recovery situation.

> I would like an md superblock to be able to contain some indication of
> the 'name' of the machine which is meant to host the array, so that
> once a machine knows its own name, it can automatically find and mount
> its own arrays, but that isn't near the top of my list of priorities
> yet.

How about a user-defined name?

mdadm --create --name the_extra_noisy_array /dev/md0 --level...

would use some fixed algorithm to compute a usual UUID for the new
array from the string "the_extra_noisy_array", and

mdadm --assemble /dev/md0 --name the_extra_noisy_array

could use the same algorithm and take into account only those devices
which have a UUID equal to the computed one.

Just a thought.
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A random initramfs script

2006-03-16 Thread Nix
On Fri, 17 Mar 2006, Andre Noll stated:
> On 07:50, Nix wrote:
>> If / was a ramfs (as rootfs is), you'd run out of memory...
> 
> Yes, it's an additional piece of rope, and I already used it to shoot
> myself in the foot by doing a backup with "rsync -a /home /mnt" without
> mounting /mnt. First the machine went slow, then the OOM killer kicked
> in and killed everything. Finally the system was totally unresponsible
> and I had to use the "so everything is unusual - boot" thing.

Oops!

> But only root can write to /mnt, and there are much simpler ways for
> root to halt the system..

True. I guess I'm a traditionalist: I'd like / to be a real filesystem
if possible. Both ways work: TMTOWTDI :)

>> Well, there's some extra stuff, but it's mostly on the iptables side:
>> the advanced routing has mostly been stable since not just 2.4 but 2.2!
> 
> So I downloaded iproute2-2.4.7-now-ss020116-try.tar.gz, but there
> seems to be a problem with errno.h:

Holy meatballs that's ancient.

Try 

for a rather newer and more capable copy. :)

>> You don't need an mdev.conf at all; by default mdev creates a /dev with
>> the KERNEL= names. All it's needed for is putting things in strange
>> places or fiddling permissions, and that's not necessary for a boot
>> initramfs :)
> 
> Nice, and works like a charm. I just removed udev* from the initramfs.

For those places that are still using udev, if you're running 2.6.15+,
you can soon ditch udevstart, as distros are doing the moral equivalent
of this nowadays:

,
| udevd --daemon
|
| list=$(echo /sys/bus/*/devices/*/uevent /sys/class/*/*/uevent)
| list=$(echo $list /sys/block/*/uevent /sys/block/*/*/uevent)
| 
| for file in $list; do
|   case "$file" in
| */device/uevent) ;;# skip followed device symlinks
| */\*/*) ;;
| *" "*) ;;
| 
| */class/mem/*) # for /dev/null
|   first="$first $file" ;;
| 
| */block/md[0-9]*)
|   last="$last $file" ;;
| 
| *)
|   default="$default $file" ;;
|   esac
| done
| 
| for file in $first $default $last; do
|   echo 'add' > $file
| done
`

Yes, the initial population of /dev is done by firing messages at udevd
*from a shell script*. It's gone all the way from devfs's kernel-space
hardwiring to something sufficiently extensible that a shell script can
do all the neessary stuff to populate /dev :)

(This is also much faster than udevstart; <1s to populate a crowded /dev
on my P233...)

>> (I'd recommend managing the *real* /dev with udev, still; it's vastly
>> more flexible... 
> 
> Yes, and it's needed for hotplugable devices anyway.

mdev can handle hotplugging: leave the -s off. But yes, udev is probably
preferable on non-space-constrained systems, for much the same reason
that glibc is preferable to uClibc in those situations.

>> but of course it's also about fifty times larger at a
>> horrifying 50K plus 70K of rules...
>  
> No need for such a huge rules file:
> 
> # find  /etc/udev/ -type f -printf '%f %s\n'
> udev.conf 768
> udev.rules 5200
> scsi-model.sh 1326
> ide-model.sh 1201

Er, true. Actually I goofed: I du -s'd the directory and forgot that I
was keeping it in subversion... I'm actually using 10K of rules. Not
a very accurate estimate on my part, that.

[uClibc]
>> Yep. Of course the SVN release has broken binary compatibility, so you
>> need to rebuild everything that depends on it (probably the
>> cross-toolchain too, for safety). I scripted this long ago, of course,
>> because it's a bit annoying otherwise...
> 
> I tried to built the cross-compilation toolchain with Buildroot,
> but it didn't even start building because it couldn't download gcc
> from mirrors.rcn.net which appears to be down ATM. Isn't it possible
> to change the gcc mirror? I did not find a config option for that.

I wouldn't know: I don't use buildroot. Bugging the buildroot author
(who also happens to be the uClibc author) might be a good idea.

Alternatively, just suck down GCC from, say, 
svn://gcc.gnu.org/svn/gcc/tags/gcc_3_4_5_release,
or ftp.gnu.org, or somewhere, and point buildroot at that. (I hope
you can do that: like I say, I've never done more than suck buildroot
down to pull the uClibc patches out of it and apply them to the
dev trees of binutils and GCC that I already had...)

-- 
`Come now, you should know that whenever you plan the duration of your
 unplanned downtime, you should add in padding for random management
 freakouts.'
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A random initramfs script

2006-03-16 Thread Andre Noll
On 07:50, Nix wrote:

> I suppose that if *every* filesystem hanging off / is its own
> fs, using rootfs as your / is not inefficient because there's
> still nothing in it.
> 
> But it still makes me worry: what if some mad script makes a
> huge file in /? It's happened to me a couple of times, and
> because /var was on a different fs, all that happened was that
> / filled up and nothing bad resulted.

> If / was a ramfs (as rootfs is), you'd run out of memory...

Yes, it's an additional piece of rope, and I already used it to shoot
myself in the foot by doing a backup with "rsync -a /home /mnt" without
mounting /mnt. First the machine went slow, then the OOM killer kicked
in and killed everything. Finally the system was totally unresponsible
and I had to use the "so everything is unusual - boot" thing.

But only root can write to /mnt, and there are much simpler ways for
root to halt the system..

> [ip(8)]
> >>  describes many of its myriad extra features in more
> >> detail.
> > 
> > All that stuff seems to be fairly old, linux-2.6. isn't mentioned at
> > all and the cvs server doesn't work. Is it still up do date?
> 
> Well, there's some extra stuff, but it's mostly on the iptables side:
> the advanced routing has mostly been stable since not just 2.4 but 2.2!

So I downloaded iproute2-2.4.7-now-ss020116-try.tar.gz, but there
seems to be a problem with errno.h:

make[1]: Entering directory `/home/work/install/src/iproute2/lib'
gcc -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall -g -I../include-glibc 
-I/usr/include/db3 -include ../include-glibc/glibc-bugs.h 
-I/home/install/w/linux/stable/include -I../include -DRESOLVE_HOSTNAMES -c -o 
libnetlink.o libnetlink.c
distcc[13445] ERROR: compile
/home/install/w/var/ccache/libnetlink.tmp.p133.13441.i on p133 failed
libnetlink.c: In function `rtnl_dump_filter':
libnetlink.c:149: error: `EINTR' undeclared (first use in this function)
libnetlink.c:149: error: (Each undeclared identifier is reported only once
libnetlink.c:149: error: for each function it appears in.)
libnetlink.c: In function `rtnl_talk':
libnetlink.c:248: error: `EINTR' undeclared (first use in this function)
libnetlink.c: In function `rtnl_listen':
libnetlink.c:350: error: `EINTR' undeclared (first use in this function)
libnetlink.c: In function `rtnl_from_file':
libnetlink.c:416: error: `EINTR' undeclared (first use in this function)
make[1]: *** [libnetlink.o] Error 1
make[1]: Leaving directory `/home/work/install/src/iproute2/lib'
make: *** [all] Error 2

> >> >> mdev is `micro-udev', a 255-line tiny replacement for udev. It's part of
> >> >> busybox.
> >> > 
> >> > Cool. Guess I'll have to update busybox..
> > 
> > done. The new busybox (from SVN) seems to work fine, just like the
> > old one did. The init script doesn't use mdev yet, but from a first
> > reading this is just a matter of translating /etc/udev/udev.conf
> > to the mdev syntax.
> 
> You don't need an mdev.conf at all; by default mdev creates a /dev with
> the KERNEL= names. All it's needed for is putting things in strange
> places or fiddling permissions, and that's not necessary for a boot
> initramfs :)

Nice, and works like a charm. I just removed udev* from the initramfs.

> (I'd recommend managing the *real* /dev with udev, still; it's vastly
> more flexible... 

Yes, and it's needed for hotplugable devices anyway.

> but of course it's also about fifty times larger at a
> horrifying 50K plus 70K of rules...
 
No need for such a huge rules file:

# find  /etc/udev/ -type f -printf '%f %s\n'
udev.conf 768
udev.rules 5200
scsi-model.sh 1326
ide-model.sh 1201

> >> You need SVN uClibc too (if you're using uClibc rather than glibc);
> >> older versions don't maintain the d_type field in the struct dirent, so
> >> mdev's scanning of /sys gets very confused and you end up with an empty
> >> /dev.
> > 
> > Damn. I just compiled 0.9.28. Guess this one is too old.
> 
> Yep. Of course the SVN release has broken binary compatibility, so you
> need to rebuild everything that depends on it (probably the
> cross-toolchain too, for safety). I scripted this long ago, of course,
> because it's a bit annoying otherwise...

I tried to built the cross-compilation toolchain with Buildroot,
but it didn't even start building because it couldn't download gcc
from mirrors.rcn.net which appears to be down ATM. Isn't it possible
to change the gcc mirror? I did not find a config option for that.

Thanks
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe


signature.asc
Description: Digital signature


Re: Two disk failure in RAID5 during resync, wrong superblocks

2006-03-16 Thread Frank Blendinger
Hi again,

I've just seen I still had a "wrong superblock" in the subject of my
mail. Please just ignore, I fixed that while writing the last mail and
forgot to remove it. :)


Greets,
Frank


signature.asc
Description: Digital signature


Re: Neil, where are your Patches !

2006-03-16 Thread Neil Brown
On Wednesday March 15, [EMAIL PROTECTED] wrote:
> Hi All,
> 
> I digged the net over and over, but no way. I could not find any place
> where current patches to md are stored, like the raid5 resize stuff...
> Yes, I know, I could copy/paste them from the mail archives, but it's
> crappy (especially with gmail), and it's sure I'll skip some...
> 
> So, where are they hidding ? I can't believe that an efficient
> organization such as http://cgi.cse.unsw.edu.au/~neilb/patches/ is not
> available anymore...

Hmmm I thought the appeared somewhere automatically, but they
don't seem to, do they I'll try to sort something out later today.

> 
> Or maybe there's a git tree somewhere ?

No, I'm not a git convert yet.  I have a little script for managing
patches which works quite nicely and does all that I need (Except
export them...)

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: No syncing after crash. Is this a software raid bug?

2006-03-16 Thread Heinz Mauelshagen
On Thu, Mar 16, 2006 at 08:24:52AM +0100, Kasper Dupont wrote:
> On 10/03/06 08.49, Kasper Dupont wrote:
> > On 10/03/06 08.43, Heinz Mauelshagen wrote:
> > > On Tue, Mar 07, 2006 at 12:18:35PM +0100, Kasper Dupont wrote:
> > > > 
> > > > OK, I'll do some testing with raid5 then. I want to know if it
> > > > behaves differently from raid1 in this respect.
> > > 
> > > Have you got results yet ?
> > 
> > No, not yet. I'll tell you as soon as I have some.
> 
> OK now I have some results. I could not reproduce the symptoms
> with raid5.

Good and as expected.

> As long as this only shows up with raid1, there is
> not much reason to worry.

Alright.

Regards,
Heinz

> 
> -- 
> Kasper Dupont -- Rigtige mænd skriver deres egne backupprogrammer
> #define _(_)"d.%.4s%."_"2s" /* This is my new email address */
> char*_="@2kaspner"_()"%03"_("4s%.")"t\n";printf(_+11,_+6,_,6,_+2,_+7,_+6);

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen Red Hat GmbH
Consulting Development Engineer   Am Sonnenhang 11
Storage Development   56242 Marienrachdorf
  Germany
[EMAIL PROTECTED]PHONE +49  171 7803392
  FAX   +49 2626 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Two disk failure in RAID5 during resync, wrong superblocks

2006-03-16 Thread Frank Blendinger
Hi all,

I have just added the missing fourth disk to my RAID5 and waited for the
resync to finish. This morning I had to see this in my /proc/mdstat:

md2 : active raid5 hde1[4] hdg1[5](F) hdk1[2] hdi1[1]
  730948992 blocks level 5, 64k chunk, algorithm 2 [4/2] [_UU_]

hde is the added fourth disk the array was syncing to and hdg seems to
have failed during this. From my syslog from yesterday:

Mar 14 21:09:17 localhost kernel: [  717.345236] md: bind
Mar 14 21:09:17 localhost kernel: [  717.633915] raid5: device hdg1 operational 
as raid disk 0
Mar 14 21:09:17 localhost kernel: [  717.687884]  disk 0, o:1, dev:hdg1
Mar 14 22:29:19 localhost kernel: [ 5529.025214] hdg: dma_intr: status=0x51 { 
DriveReady SeekComplete Error }
Mar 14 22:29:19 localhost kernel: [ 5529.025242] hdg: dma_intr: error=0x84 { 
DriveStatusError BadCRC }
Mar 14 22:29:19 localhost kernel: [ 5529.180163] hdg: dma_intr: status=0x51 { 
DriveReady SeekComplete Error }
Mar 14 22:29:19 localhost kernel: [ 5529.180187] hdg: dma_intr: error=0x84 { 
DriveStatusError BadCRC }
[...]
Mar 14 23:27:15 localhost kernel: [ 9006.977466] PDC202XX: Secondary channel 
reset.
Mar 14 23:27:15 localhost kernel: [ 9009.060757] PDC202XX: Primary channel 
reset.
Mar 14 23:27:15 localhost kernel: [ 9009.061102] ide3: reset: master: error 
(0x00?)

I guess the disk was then kicked out of the array.


So I'm left with to working disks (hdk and hdi), one probably broken
disk (hdg) with valuable data on it and one disk (hde) with not enough
information on it to assemble the array.

I think that leaves me two options:

1) I'll try to reboot and force the array to be assembled with the
   broken hdg, add hde and pray that a resync will finish.

2) I'll dd_rescue hdg to hde and create the array with hde, hgk and hdi.
   Then add hdg and see if a resync works.

What would you suggest me to do? Is there maybe a better approach that I
have missed? Any hints on how to force mdadm to assemble the array with
the faulty hdg?


Thanks in advance,
Frank


signature.asc
Description: Digital signature


Re: Bitmaps & Kernel Versions

2006-03-16 Thread Laurent CARON

Luca Berra a écrit :

On Wed, Mar 15, 2006 at 09:08:17PM +1100, Neil Brown wrote:


On Wednesday March 15, [EMAIL PROTECTED] wrote:


Hi,

I'm planning to use bitmaps on some of our RAID1 arrays.

I'm wondering how bitmaps are handeled by older kernels.

Eg: I create a raid array with a bitmap under a 2.6.15 kernel.

I now want to boot under 2.6.12, or even 2.4


Hos is it handeled?
Will it work even if this is my / partition?



On older kernel will not notice the bitmap and will behave
'normally'.


strange,
last time i tried an older kernel would refuse to activate an md with a
bitmap on it.
I am far from home on a business trip and i don't have kernel-sources at
hand, but i seem to remember that the kernel was very strict on the
feature bitmap in the superblock.

L.


I experienced the same strange behavior.

Bitmap was created on 2.6.15, tried to boot 2.6.14 and /dev/md0 was not 
started :$.


Strange.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html