Re: optimal IO scheduler choice?

2007-12-13 Thread Justin Piszcz



On Thu, 13 Dec 2007, Louis-David Mitterrand wrote:


Hi,

after reading some interesting suggestions on kernel tuning at:

http://hep.kbfi.ee/index.php/IT/KernelTuning

I am wondering whether 'deadline' is indeed the best IO scheduler (vs.
anticipatory and cfq) for a soft raid5/6 partition on a server?

What is the common wisdom on the subject among linux-raid users and
developers?

Thanks,
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



I have found anticipatory to be the fastest.

http://home.comcast.net/~jpiszcz/sched/cfq_vs_as_vs_deadline_vs_noop.html

Sequential:
Output of CFQ: (horrid): 311,683 KiB/s
 Output of AS: 443,103 KiB/s

Input CFQ is a little faster.

It depends on your workload I suppose.

Justin.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-13 Thread Bill Davidsen

Tejun Heo wrote:

Bill Davidsen wrote:
  

Jan Engelhardt wrote:


On Dec 1 2007 06:26, Justin Piszcz wrote:
  

I ran the following:

dd if=/dev/zero of=/dev/sdc
dd if=/dev/zero of=/dev/sdd
dd if=/dev/zero of=/dev/sde

(as it is always a very good idea to do this with any new disk)


Why would you care about what's on the disk? fdisk, mkfs and
the day-to-day operation will overwrite it _anyway_.

(If you think the disk is not empty, you should look at it
and copy off all usable warez beforehand :-)

  

Do you not test your drive for minimum functionality before using them?



I personally don't.

  

Also, if you have the tools to check for relocated sectors before and
after doing this, that's a good idea as well. S.M.A.R.T is your friend.
And when writing /dev/zero to a drive, if it craps out you have less
emotional attachment to the data.



Writing all zero isn't too useful tho.  Drive failing reallocation on
write is catastrophic failure.  It means that the drive wanna relocate
but can't because it used up all its extra space which usually indicates
something else is seriously wrong with the drive.  The drive will have
to go to the trash can.  This is all serious and bad but the catch is
that in such cases the problem usually stands like a sore thumb so
either vendor doesn't ship such drive or you'll find the failure very
early.  I personally haven't seen any such failure yet.  Maybe I'm lucky.
  


The problem is usually not with what the vendor ships, but what the 
carrier delivers. Bad handling does happen, drop ship can have several 
meanings, and I have received shipments with the G sensor in the case 
triggered. Zero is a handy source of data, but the important thing is to 
look at the relocated sector count before and after the write. If there 
are a lot of bad sectors initially, the drive is probably a poor choice 
for anything critical.

Most data loss occurs when the drive fails to read what it thought it
wrote successfully and the opposite - reading and dumping the whole disk
to /dev/null periodically is probably much better than writing zeros as
it allows the drive to find out deteriorating sector early while it's
still readable and relocate.  But then again I think it's an overkill.

Writing zeros to sectors is more useful as cure rather than prevention.
 If your drive fails to read a sector, write whatever value to the
sector.  The drive will forget about the data on the damaged sector and
reallocate and write new data to it.  Of course, you lose data which was
originally on the sector.

I personally think it's enough to just throw in an extra disk and make
it RAID0 or 5 and rebuild the array if read fails on one of the disks.
If write fails or read fail continues, replace the disk.  Of course, if
you wanna be extra cautious, good for you.  :-)

  

--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm break / restore soft mirror

2007-12-13 Thread Bill Davidsen

Brett Maton wrote:
Hi, 

  Question for you guys. 

  A brief history: 
  RHEL 4 AS 
  I have a partition with way to many small files on (Usually around a couple of million) that needs to be backed up, standard


  methods mean that a restore is impossibly slow due to the sheer volume of files. 
  Solution, raw backup /restore of the device.  However the partition is permanently being accessed. 


  Proposed solution is to use software raid mirror.  Before backup starts, 
break the soft mirror unmount and backup partition

  restore soft mirror and let it resync / rebuild itself. 

  Would the above intentional break/fix of the mirror cause any problems? 
  


Probably. If by accessed you mean read-only, you can do this, but if 
the data is changing you have a serious problem that the data on the 
disk and queued in memory may leave that part on the disk as an 
inconsistent data set. If there is a means of backing up a set of files 
which are changing other than stopping the updates in a known valid 
state, it's not something which I've seen really work in all cases.


DM has some snapshot capabilities, but in fact they have the same 
limitation, the data on a partition can be backed up, but unless you can 
ensure that the data is in a consistent state when it's frozen, your 
backup will have some small possibility of failure. Database programs 
have ways to freeze the data to do backups, but if an application 
doesn't have a means to force the data on the disk valid, it will only 
be a pretty good backup.


I suggest looking at things like rsync, which will not solve the 
changing data problem, but may do the backup quickly enough to be as 
useful as what you propose. Of course a full backup is likely to take a 
long time however you do it.


--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Auto assembly errors with mdadm and 64K aligned partitions.

2007-12-13 Thread Neil Brown
On Thursday December 13, [EMAIL PROTECTED] wrote:
 Good morning to Neil and everyone on the list, hope your respective
 days are going well.
 
 Quick overview.  We've isolated what appears to be a failure mode with
 mdadm assembling RAID1 (and presumably other level) volumes which
 kernel based RAID autostart is able to do correctly.
 
 We picked up on the problem with OES based systems with SAN attached
 volumes.  I am able to reproduce the problem under 2.6.23.9 UML with
 version 2.6.4 of mdadm.
 
 The problem occurs when partitions are aligned on a 64K boundary.  Any
 64K boundary seems to work, ie 128, 256 and 512 sector offsets.
 
 Block devices look like the following:
 
 ---
 cat /proc/partions:
 
 major minor  #blocks  name
 
   98 0 262144 ubda
   9816  10240 ubdb
   9817  10176 ubdb1
   9832  10240 ubdc
   9833  10176 ubdc1
 ---
 
 
 A RAID1 device was created and started consisting of the /dev/ubdb1
 and /dev/ubdc1 partitions.  An /etc/mdadm.conf file was generated
 which contains the following:
 
 ---
 DEVICE partitions
 ARRAY /dev/md0 level=raid1 num-devices=2 
 UUID=e604c49e:d3a948fd:13d9bc11:dbc82862
 ---
 
 
 The RAID1 device was shutdown.  The following assembly command yielded:
 
 ---
 mdadm -As
 
 mdadm: WARNING /dev/ubdc1 and /dev/ubdc appear to have very similar 
 superblocks.  If they are really different, please --zero the superblock 
 on one
   If they are the same or overlap, please remove one from the
   DEVICE list in mdadm.conf.
 ---

Yes.  This is one of the problems with v0.90 metadata, and with
DEVICE partitions.

As the partitions start on a 64K alignment, and the metadata is 64K
aligned, the metadata appears look  right for both the whole device
and for the last partition on the device, and mdadm cannot tell the
difference.

With v1.x metadata, we store the superblock offset which allows us to
tell if we have mis-identified a superblock that was meant to be part
of a partition or of the whole device.

If you make your DEVICE line a little more restrictive. e.g.

 DEVICE /dev/ubc?1

then it will also work.

Or just don't use partitions.  Make the arrat from /dev/ubdb and
/dev/ubdc.


NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: mdadm break / restore soft mirror

2007-12-13 Thread Neil Brown
On Thursday December 13, [EMAIL PROTECTED] wrote:

 How do I create the internal bitmap?  man mdadm didn't shed any
 light and my brief excursion into google wasn't much more helpful. 

  mdadm --grow --bitmap=internal /dev/mdX

 
 The version I have installed is mdadm-1.12.0-5.i386 from RedHat
 which would appear to be way out of date! 

WAY!  mdadm 2.0 would be an absolute minimum, and linux 2.6.13 as an
absolute minimum, probably something closer to 2.6.20 would be a good
idea.

NeilBRown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 000 of 7] md: Introduction EXPLAIN PATCH SET HERE

2007-12-13 Thread NeilBrown
Following are 7 md related patches are suitable for the next -mm
and maybe for 2.6.25.

They move towards giving user-space programs more fine control of an
array so that we can add support for more complex metadata formats
(e.g. DDF) without bothering the kernel with such things.

The last patch isn't strictly md related.  It adds an ioctl which
allows mapping from an open file descriptor on a block device to
a name in /sys.  This makes finding name of things in /sys more practical.
As I put this in block-layer code, I have Cc:ed Jens Axboe.

 [PATCH 001 of 7] md: Support 'external' metadata for md arrays.
 [PATCH 002 of 7] md: Give userspace control over removing failed devices when 
external metdata in use
 [PATCH 003 of 7] md: Allow a maximum extent to be set for resyncing.
 [PATCH 004 of 7] md: Allow devices to be shared between md arrays.
 [PATCH 005 of 7] md: Lock address when changing attributes of component 
devices.
 [PATCH 006 of 7] md: Allow an md array to appear with 0 drives if it has 
external metadata.
 [PATCH 007 of 7] md: Get name for block device in sysfs
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 001 of 7] md: Support 'external' metadata for md arrays.

2007-12-13 Thread NeilBrown

- Add a state flag 'external' to indicate that the metadata is managed
  externally (by user-space) so important changes need to be 
  left of user-space to handle.
  Alternates are non-persistant ('none') where there is no stable metadata -
  after the  array is stopped there is no record of it's status - and 
  internal which can be version 0.90 or version 1.x
  These are selected by writing to the 'metadata' attribute.



- move the updating of superblocks (sync_sbs) to after we have checked if
  there are any superblocks or not.

- New array state 'write_pending'.  This means that the metadata records
  the array as 'clean', but a write has been requested, so the metadata has
  to be updated to record a 'dirty' array before the write can continue.
  This change is reported to md by writing 'active' to the array_state
  attribute.

- tidy up marking of sb_dirty:
   - don't set sb_dirty when resync finishes as md_check_recovery
 calls md_update_sb when the sync thread finishes anyway.
   - Don't set sb_dirty in multipath_run as the array might not be dirty.
   - don't mark superblock dirty when switching to 'clean' if there
 is no internal superblock (if external, userspace can choose to
 update the superblock whenever it chooses to).

Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/md.c   |   77 +---
 ./include/linux/raid/md_k.h |3 +
 2 files changed, 61 insertions(+), 19 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2007-12-14 16:07:51.0 +1100
+++ ./drivers/md/md.c   2007-12-14 16:08:28.0 +1100
@@ -778,7 +778,8 @@ static int super_90_validate(mddev_t *md
mddev-major_version = 0;
mddev-minor_version = sb-minor_version;
mddev-patch_version = sb-patch_version;
-   mddev-persistent = ! sb-not_persistent;
+   mddev-persistent = 1;
+   mddev-external = 0;
mddev-chunk_size = sb-chunk_size;
mddev-ctime = sb-ctime;
mddev-utime = sb-utime;
@@ -904,7 +905,7 @@ static void super_90_sync(mddev_t *mddev
sb-size  = mddev-size;
sb-raid_disks = mddev-raid_disks;
sb-md_minor = mddev-md_minor;
-   sb-not_persistent = !mddev-persistent;
+   sb-not_persistent = 0;
sb-utime = mddev-utime;
sb-state = 0;
sb-events_hi = (mddev-events32);
@@ -1158,6 +1159,7 @@ static int super_1_validate(mddev_t *mdd
mddev-major_version = 1;
mddev-patch_version = 0;
mddev-persistent = 1;
+   mddev-external = 0;
mddev-chunk_size = le32_to_cpu(sb-chunksize)  9;
mddev-ctime = le64_to_cpu(sb-ctime)  ((1ULL  32)-1);
mddev-utime = le64_to_cpu(sb-utime)  ((1ULL  32)-1);
@@ -1699,18 +1701,20 @@ repeat:
MD_BUG();
mddev-events --;
}
-   sync_sbs(mddev, nospares);
 
/*
 * do not write anything to disk if using
 * nonpersistent superblocks
 */
if (!mddev-persistent) {
-   clear_bit(MD_CHANGE_PENDING, mddev-flags);
+   if (!mddev-external)
+   clear_bit(MD_CHANGE_PENDING, mddev-flags);
+
spin_unlock_irq(mddev-write_lock);
wake_up(mddev-sb_wait);
return;
}
+   sync_sbs(mddev, nospares);
spin_unlock_irq(mddev-write_lock);
 
dprintk(KERN_INFO 
@@ -2430,6 +2434,8 @@ array_state_show(mddev_t *mddev, char *p
case 0:
if (mddev-in_sync)
st = clean;
+   else if (test_bit(MD_CHANGE_CLEAN, mddev-flags))
+   st = write_pending;
else if (mddev-safemode)
st = active_idle;
else
@@ -2460,11 +2466,9 @@ array_state_store(mddev_t *mddev, const 
break;
case clear:
/* stopping an active array */
-   if (mddev-pers) {
-   if (atomic_read(mddev-active)  1)
-   return -EBUSY;
-   err = do_md_stop(mddev, 0);
-   }
+   if (atomic_read(mddev-active)  1)
+   return -EBUSY;
+   err = do_md_stop(mddev, 0);
break;
case inactive:
/* stopping an active array */
@@ -2472,7 +2476,8 @@ array_state_store(mddev_t *mddev, const 
if (atomic_read(mddev-active)  1)
return -EBUSY;
err = do_md_stop(mddev, 2);
-   }
+   } else
+   err = 0; /* already inactive */
break;
case suspended:

[PATCH 002 of 7] md: Give userspace control over removing failed devices when external metdata in use

2007-12-13 Thread NeilBrown

When a device fails, we must not allow an further writes to the array
until the device failure has been recorded in array metadata.
When metadata is managed externally, this requires some synchronisation...

Allow/require userspace to explicitly remove failed devices
from active service in the array by writing 'none' to the 'slot'
attribute.  If this reduces the number of failed devices to 0,
the write block will automatically be lowered.


Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/md.c |   43 ++-
 1 file changed, 34 insertions(+), 9 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2007-12-14 16:08:28.0 +1100
+++ ./drivers/md/md.c   2007-12-14 16:08:52.0 +1100
@@ -1894,20 +1894,44 @@ static ssize_t
 slot_store(mdk_rdev_t *rdev, const char *buf, size_t len)
 {
char *e;
+   int err;
+   char nm[20];
int slot = simple_strtoul(buf, e, 10);
if (strncmp(buf, none, 4)==0)
slot = -1;
else if (e==buf || (*e  *e!= '\n'))
return -EINVAL;
-   if (rdev-mddev-pers)
-   /* Cannot set slot in active array (yet) */
-   return -EBUSY;
-   if (slot = rdev-mddev-raid_disks)
-   return -ENOSPC;
-   rdev-raid_disk = slot;
-   /* assume it is working */
-   rdev-flags = 0;
-   set_bit(In_sync, rdev-flags);
+   if (rdev-mddev-pers) {
+   /* Setting 'slot' on an active array requires also
+* updating the 'rd%d' link, and communicating
+* with the personality with -hot_*_disk.
+* For now we only support removing
+* failed/spare devices.  This normally happens automatically,
+* but not when the metadata is externally managed.
+*/
+   if (slot != -1)
+   return -EBUSY;
+   if (rdev-raid_disk == -1)
+   return -EEXIST;
+   /* personality does all needed checks */
+   if (rdev-mddev-pers-hot_add_disk == NULL)
+   return -EINVAL;
+   err = rdev-mddev-pers-
+   hot_remove_disk(rdev-mddev, rdev-raid_disk);
+   if (err)
+   return err;
+   sprintf(nm, rd%d, rdev-raid_disk);
+   sysfs_remove_link(rdev-mddev-kobj, nm);
+   set_bit(MD_RECOVERY_NEEDED, rdev-mddev-recovery);
+   md_wakeup_thread(rdev-mddev-thread);
+   } else {
+   if (slot = rdev-mddev-raid_disks)
+   return -ENOSPC;
+   rdev-raid_disk = slot;
+   /* assume it is working */
+   rdev-flags = 0;
+   set_bit(In_sync, rdev-flags);
+   }
return len;
 }
 
@@ -5551,6 +5575,7 @@ static int remove_and_add_spares(mddev_t
 
ITERATE_RDEV(mddev,rdev,rtmp)
if (rdev-raid_disk = 0 
+   !mddev-external 
(test_bit(Faulty, rdev-flags) ||
 ! test_bit(In_sync, rdev-flags)) 
atomic_read(rdev-nr_pending)==0) {
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 003 of 7] md: Allow a maximum extent to be set for resyncing.

2007-12-13 Thread NeilBrown

This allows userspace to control resync/reshape progress and
synchronise it with other activities, such as shared access in a SAN,
or backing up critical sections during a tricky reshape.

Writing a number of sectors (which must be a multiple of the chunk
size if such is meaningful) causes a resync to pause when it
gets to that point.

Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./Documentation/md.txt  |   10 +
 ./drivers/md/md.c   |   75 ++--
 ./drivers/md/raid1.c|2 +
 ./drivers/md/raid10.c   |3 +
 ./drivers/md/raid5.c|   25 ++
 ./include/linux/raid/md_k.h |2 +
 6 files changed, 107 insertions(+), 10 deletions(-)

diff .prev/Documentation/md.txt ./Documentation/md.txt
--- .prev/Documentation/md.txt  2007-12-14 16:07:50.0 +1100
+++ ./Documentation/md.txt  2007-12-14 16:08:57.0 +1100
@@ -416,6 +416,16 @@ also have
  sectors in total that could need to be processed.  The two
  numbers are separated by a '/'  thus effectively showing one
  value, a fraction of the process that is complete.
+ A 'select' on this attribute will return when resync completes,
+ when it reaches the current sync_max (below) and possibly at
+ other times.
+
+   sync_max
+ This is a number of sectors at which point a resync/recovery
+ process will pause.  When a resync is active, the value can
+ only ever be increased, never decreased.  The value of 'max'
+ effectively disables the limit.
+
 
sync_speed
  This shows the current actual speed, in K/sec, of the current

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2007-12-14 16:08:52.0 +1100
+++ ./drivers/md/md.c   2007-12-14 16:08:57.0 +1100
@@ -275,6 +275,7 @@ static mddev_t * mddev_find(dev_t unit)
spin_lock_init(new-write_lock);
init_waitqueue_head(new-sb_wait);
new-reshape_position = MaxSector;
+   new-resync_max = MaxSector;
 
new-queue = blk_alloc_queue(GFP_KERNEL);
if (!new-queue) {
@@ -2926,6 +2927,43 @@ sync_completed_show(mddev_t *mddev, char
 static struct md_sysfs_entry md_sync_completed = __ATTR_RO(sync_completed);
 
 static ssize_t
+max_sync_show(mddev_t *mddev, char *page)
+{
+   if (mddev-resync_max == MaxSector)
+   return sprintf(page, max\n);
+   else
+   return sprintf(page, %llu\n,
+  (unsigned long long)mddev-resync_max);
+}
+static ssize_t
+max_sync_store(mddev_t *mddev, const char *buf, size_t len)
+{
+   if (strncmp(buf, max, 3) == 0)
+   mddev-resync_max = MaxSector;
+   else {
+   char *ep;
+   unsigned long long max = simple_strtoull(buf, ep, 10);
+   if (ep == buf || (*ep != 0  *ep != '\n'))
+   return -EINVAL;
+   if (max  mddev-resync_max 
+   test_bit(MD_RECOVERY_RUNNING, mddev-recovery))
+   return -EBUSY;
+
+   /* Must be a multiple of chunk_size */
+   if (mddev-chunk_size) {
+   if (max  (sector_t)((mddev-chunk_size9)-1))
+   return -EINVAL;
+   }
+   mddev-resync_max = max;
+   }
+   wake_up(mddev-recovery_wait);
+   return len;
+}
+
+static struct md_sysfs_entry md_max_sync =
+__ATTR(sync_max, S_IRUGO|S_IWUSR, max_sync_show, max_sync_store);
+
+static ssize_t
 suspend_lo_show(mddev_t *mddev, char *page)
 {
return sprintf(page, %llu\n, (unsigned long long)mddev-suspend_lo);
@@ -3035,6 +3073,7 @@ static struct attribute *md_redundancy_a
md_sync_max.attr,
md_sync_speed.attr,
md_sync_completed.attr,
+   md_max_sync.attr,
md_suspend_lo.attr,
md_suspend_hi.attr,
md_bitmap.attr,
@@ -3582,6 +3621,7 @@ static int do_md_stop(mddev_t * mddev, i
mddev-size = 0;
mddev-raid_disks = 0;
mddev-recovery_cp = 0;
+   mddev-resync_max = MaxSector;
mddev-reshape_position = MaxSector;
mddev-external = 0;
 
@@ -5445,8 +5485,16 @@ void md_do_sync(mddev_t *mddev)
sector_t sectors;
 
skipped = 0;
+   if (j = mddev-resync_max) {
+   sysfs_notify(mddev-kobj, NULL, sync_completed);
+   wait_event(mddev-recovery_wait,
+  mddev-resync_max  j
+  || kthread_should_stop());
+   }
+   if (kthread_should_stop())
+   goto interrupted;
sectors = mddev-pers-sync_request(mddev, j, skipped,
-   currspeed  speed_min(mddev));
+ currspeed  speed_min(mddev));
if (sectors == 

[PATCH 004 of 7] md: Allow devices to be shared between md arrays.

2007-12-13 Thread NeilBrown

Currently, a given device is claimed by a particular array so
that it cannot be used by other arrays.

This is not ideal for DDF and other metadata schemes which have
their own partitioning concept.

So for externally managed metadata, just claim the device for
md in general, require that offset and size are set
properly for each device, and make sure that if a device is
included in different arrays then the active sections do
not overlap.

This involves adding another flag to the rdev which makes it awkward
to set -flags = 0 to clear certain flags.  So now clear flags
explicitly by name when we want to clear things.

Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/md.c   |   93 ++--
 ./include/linux/raid/md_k.h |2 
 2 files changed, 84 insertions(+), 11 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2007-12-14 16:08:57.0 +1100
+++ ./drivers/md/md.c   2007-12-14 16:09:01.0 +1100
@@ -774,7 +774,11 @@ static int super_90_validate(mddev_t *md
__u64 ev1 = md_event(sb);
 
rdev-raid_disk = -1;
-   rdev-flags = 0;
+   clear_bit(Faulty, rdev-flags);
+   clear_bit(In_sync, rdev-flags);
+   clear_bit(WriteMostly, rdev-flags);
+   clear_bit(BarriersNotsupp, rdev-flags);
+
if (mddev-raid_disks == 0) {
mddev-major_version = 0;
mddev-minor_version = sb-minor_version;
@@ -1155,7 +1159,11 @@ static int super_1_validate(mddev_t *mdd
__u64 ev1 = le64_to_cpu(sb-events);
 
rdev-raid_disk = -1;
-   rdev-flags = 0;
+   clear_bit(Faulty, rdev-flags);
+   clear_bit(In_sync, rdev-flags);
+   clear_bit(WriteMostly, rdev-flags);
+   clear_bit(BarriersNotsupp, rdev-flags);
+
if (mddev-raid_disks == 0) {
mddev-major_version = 1;
mddev-patch_version = 0;
@@ -1407,7 +1415,7 @@ static int bind_rdev_to_array(mdk_rdev_t
goto fail;
}
list_add(rdev-same_set, mddev-disks);
-   bd_claim_by_disk(rdev-bdev, rdev, mddev-gendisk);
+   bd_claim_by_disk(rdev-bdev, rdev-bdev-bd_holder, mddev-gendisk);
return 0;
 
  fail:
@@ -1447,7 +1455,7 @@ static void unbind_rdev_from_array(mdk_r
  * otherwise reused by a RAID array (or any other kernel
  * subsystem), by bd_claiming the device.
  */
-static int lock_rdev(mdk_rdev_t *rdev, dev_t dev)
+static int lock_rdev(mdk_rdev_t *rdev, dev_t dev, int shared)
 {
int err = 0;
struct block_device *bdev;
@@ -1459,13 +1467,15 @@ static int lock_rdev(mdk_rdev_t *rdev, d
__bdevname(dev, b));
return PTR_ERR(bdev);
}
-   err = bd_claim(bdev, rdev);
+   err = bd_claim(bdev, shared ? (mdk_rdev_t *)lock_rdev : rdev);
if (err) {
printk(KERN_ERR md: could not bd_claim %s.\n,
bdevname(bdev, b));
blkdev_put(bdev);
return err;
}
+   if (!shared)
+   set_bit(AllReserved, rdev-flags);
rdev-bdev = bdev;
return err;
 }
@@ -1930,7 +1940,8 @@ slot_store(mdk_rdev_t *rdev, const char 
return -ENOSPC;
rdev-raid_disk = slot;
/* assume it is working */
-   rdev-flags = 0;
+   clear_bit(Faulty, rdev-flags);
+   clear_bit(WriteMostly, rdev-flags);
set_bit(In_sync, rdev-flags);
}
return len;
@@ -1955,6 +1966,10 @@ offset_store(mdk_rdev_t *rdev, const cha
return -EINVAL;
if (rdev-mddev-pers)
return -EBUSY;
+   if (rdev-size  rdev-mddev-external)
+   /* Must set offset before size, so overlap checks
+* can be sane */
+   return -EBUSY;
rdev-data_offset = offset;
return len;
 }
@@ -1968,16 +1983,69 @@ rdev_size_show(mdk_rdev_t *rdev, char *p
return sprintf(page, %llu\n, (unsigned long long)rdev-size);
 }
 
+static int overlaps(sector_t s1, sector_t l1, sector_t s2, sector_t l2)
+{
+   /* check if two start/length pairs overlap */
+   if (s1+l1 = s2)
+   return 0;
+   if (s2+l2 = s1)
+   return 0;
+   return 1;
+}
+
 static ssize_t
 rdev_size_store(mdk_rdev_t *rdev, const char *buf, size_t len)
 {
char *e;
unsigned long long size = simple_strtoull(buf, e, 10);
+   unsigned long long oldsize = rdev-size;
if (e==buf || (*e  *e != '\n'))
return -EINVAL;
if (rdev-mddev-pers)
return -EBUSY;
rdev-size = size;
+   if (size  oldsize  rdev-mddev-external) {
+   /* need to check that all other rdevs with the same -bdev
+* do not overlap.  We need to unlock the mddev to avoid
+* a deadlock.  We have already changed rdev-size, and if
+   

[PATCH 005 of 7] md: Lock address when changing attributes of component devices.

2007-12-13 Thread NeilBrown


Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/md.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2007-12-14 16:09:01.0 +1100
+++ ./drivers/md/md.c   2007-12-14 16:09:03.0 +1100
@@ -2080,12 +2080,18 @@ rdev_attr_store(struct kobject *kobj, st
 {
struct rdev_sysfs_entry *entry = container_of(attr, struct 
rdev_sysfs_entry, attr);
mdk_rdev_t *rdev = container_of(kobj, mdk_rdev_t, kobj);
+   int rv;
 
if (!entry-store)
return -EIO;
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
-   return entry-store(rdev, page, length);
+   rv = mddev_lock(rdev-mddev);
+   if (!rv) {
+   rv = entry-store(rdev, page, length);
+   mddev_unlock(rdev-mddev);
+   }
+   return rv;
 }
 
 static void rdev_free(struct kobject *ko)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 006 of 7] md: Allow an md array to appear with 0 drives if it has external metadata.

2007-12-13 Thread NeilBrown


Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/md.c |7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2007-12-14 16:09:03.0 +1100
+++ ./drivers/md/md.c   2007-12-14 16:09:09.0 +1100
@@ -4650,9 +4650,10 @@ static int md_ioctl(struct inode *inode,
 */
/* if we are not initialised yet, only ADD_NEW_DISK, STOP_ARRAY,
 * RUN_ARRAY, and GET_ and SET_BITMAP_FILE are allowed */
-   if (!mddev-raid_disks  cmd != ADD_NEW_DISK  cmd != STOP_ARRAY
-cmd != RUN_ARRAY  cmd != SET_BITMAP_FILE
-cmd != GET_BITMAP_FILE) {
+   if ((!mddev-raid_disks  !mddev-external)
+cmd != ADD_NEW_DISK  cmd != STOP_ARRAY
+cmd != RUN_ARRAY  cmd != SET_BITMAP_FILE
+cmd != GET_BITMAP_FILE) {
err = -ENODEV;
goto abort_unlock;
}
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 007 of 7] md: Get name for block device in sysfs

2007-12-13 Thread NeilBrown

Given an fd on a block device, returns a string like

/block/sda/sda1

which can be used to find related information in /sys.

Ideally we should have an ioctl that works on char devices as well,
but that seems far from trivial, so it seems reasonable to have
this until the later can be implemented.

Cc: Jens Axboe [EMAIL PROTECTED]
Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./block/ioctl.c  |   13 +
 ./include/linux/fs.h |2 ++
 2 files changed, 15 insertions(+)

diff .prev/block/ioctl.c ./block/ioctl.c
--- .prev/block/ioctl.c 2007-12-14 17:18:50.0 +1100
+++ ./block/ioctl.c 2007-12-14 16:15:41.0 +1100
@@ -227,8 +227,21 @@ int blkdev_ioctl(struct inode *inode, st
struct block_device *bdev = inode-i_bdev;
struct gendisk *disk = bdev-bd_disk;
int ret, n;
+   char b[BDEVNAME_SIZE*2  + 10];
 
switch(cmd) {
+   case BLKGETNAME:
+   strcpy(b, /block/);
+   bdevname(bdev-bd_contains, b+7);
+   if (bdev-bd_contains != bdev) {
+   char *e = b + strlen(b);
+   *e++ = '/';
+   bdevname(bdev, e);
+   }
+   if (copy_to_user((char __user *)arg, b, strlen(b)+1))
+   return -EFAULT;
+   return 0;
+
case BLKFLSBUF:
if (!capable(CAP_SYS_ADMIN))
return -EACCES;

diff .prev/include/linux/fs.h ./include/linux/fs.h
--- .prev/include/linux/fs.h2007-12-14 17:18:50.0 +1100
+++ ./include/linux/fs.h2007-12-14 16:13:03.0 +1100
@@ -218,6 +218,8 @@ extern int dir_notify_enable;
 #define BLKTRACESTOP _IO(0x12,117)
 #define BLKTRACETEARDOWN _IO(0x12,118)
 
+#define BLKGETNAME _IOR(0x12, 119, char [1024])
+
 #define BMAP_IOCTL 1   /* obsolete - kept for compatibility */
 #define FIBMAP_IO(0x00,1)  /* bmap access */
 #define FIGETBSZ   _IO(0x00,2) /* get the block size used for bmap */
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm break / restore soft mirror

2007-12-13 Thread Jeff Breidenbach
 What you could do is set the number of devices in the array to 3 so
 they it always appears to be degraded, then rotate your backup drives
 through the array.  The number of dirty bits in the bitmap will
 steadily grow and so resyncs will take longer.  Once it crosses some
 threshold you set the array back to having 2 devices to that it looks
 non-degraded and clean the bitmap.  Then each device will need a full
 resync after which you will get away with partial resyncs for a while.

I don't undertand why clearing the bitmap causes a rebuild of
all devices. I think I have a conceptual misunderstanding.  Consider
a RAID-1 and three physical disks involved, A,B,C

1) A and B are in the RAID, everything is synced
2) Create a bitmap on the array
3) Fail + remove B
4) Hot add C, wait for C to sync
5) Fail + remove C
6) Hot add B, wait for B to resync
7) Goto step 3

I understand that after a while we might want to clean the bitmap
and that would trigger a full resync for drives B and C. I don't
understand why it would ever cause a resync for drive A.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html