Re: [BUG] OOPS 2.6.24.2 raid5 write with ioatdma

2008-02-15 Thread Dan Williams
On Fri, Feb 15, 2008 at 9:19 AM, Laurent CORBES
[EMAIL PROTECTED] wrote:
 Hi all,

  I got a raid5 oops when trying to write on a raid 5 array, with ioatdma 
 loaded
  and without DCA activated in bios:


At first glance I believe the attached patch may fix the issue, I'll
try to reproduce this locally.

Regards,
Dan
ioat: fix 'ack' handling, driver must ensure that 'ack' is zero

From: Dan Williams [EMAIL PROTECTED]

Initialize 'ack' to zero in case the descriptor has been recycled.

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/dma/ioat_dma.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)


diff --git a/drivers/dma/ioat_dma.c b/drivers/dma/ioat_dma.c
index 45e7b46..8cf542b 100644
--- a/drivers/dma/ioat_dma.c
+++ b/drivers/dma/ioat_dma.c
@@ -726,6 +726,7 @@ static struct dma_async_tx_descriptor *ioat1_dma_prep_memcpy(
 
 	if (new) {
 		new-len = len;
+		new-async_tx.ack = 0;
 		return new-async_tx;
 	} else
 		return NULL;
@@ -749,6 +750,7 @@ static struct dma_async_tx_descriptor *ioat2_dma_prep_memcpy(
 
 	if (new) {
 		new-len = len;
+		new-async_tx.ack = 0;
 		return new-async_tx;
 	} else
 		return NULL;


Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5

2008-01-15 Thread Dan Williams
 heheh.

 it's really easy to reproduce the hang without the patch -- i could
 hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB.
 i'll try with ext3... Dan's experiences suggest it won't happen with ext3
 (or is even more rare), which would explain why this has is overall a
 rare problem.


Hmmm... how rare?

http://marc.info/?l=linux-kernelm=119461747005776w=2

There is nothing specific that prevents other filesystems from hitting
it, perhaps XFS is just better at submitting large i/o's.  -stable
should get some kind of treatment.  I'll take altered performance over
a hung system.

--
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-10 Thread Dan Williams
On Jan 10, 2008 12:13 AM, dean gaudet [EMAIL PROTECTED] wrote:
 w.r.t. dan's cfq comments -- i really don't know the details, but does
 this mean cfq will misattribute the IO to the wrong user/process?  or is
 it just a concern that CPU time will be spent on someone's IO?  the latter
 is fine to me... the former seems sucky because with today's multicore
 systems CPU time seems cheap compared to IO.


I do not see this affecting the time slicing feature of cfq, because
as Neil says the work has to get done at some point.   If I give up
some of my slice working on someone else's I/O chances are the favor
will be returned in kind since the code does not discriminate.  The
io-priority capability of cfq currently does not work as advertised
with current MD since the priority is tied to the current thread and
the thread that actually submits the i/o on a stripe is
non-deterministic.  So I do not see this change making the situation
any worse.  In fact, it may make it a bit better since there is a
higher chance for the thread submitting i/o to MD to do its own i/o to
the backing disks.

Reviewed-by: Dan Williams [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-09 Thread Dan Williams
On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote:
 On Sat, 29 Dec 2007, Dan Williams wrote:
 
  On Dec 29, 2007 1:58 PM, dean gaudet [EMAIL PROTECTED] wrote: 
   On Sat, 29 Dec 2007, Dan Williams wrote: 
   
On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote: 
 hmm bummer, i'm doing another test (rsync 3.5M inodes from another 
 box) on 
 the same 64k chunk array and had raised the stripe_cache_size to 
 1024... 
 and got a hang.  this time i grabbed stripe_cache_active before 
 bumping 
 the size again -- it was only 905 active.  as i recall the bug we 
 were 
 debugging a year+ ago the active was at the size when it would hang.  
 so 
 this is probably something new. 

I believe I am seeing the same issue and am trying to track down 
whether XFS is doing something unexpected, i.e. I have not been able 
to reproduce the problem with EXT3.  MD tries to increase throughput 
by letting some stripe work build up in batches.  It looks like every 
time your system has hung it has been in the 'inactive_blocked' state 
i.e.  3/4 of stripes active.  This state should automatically 
clear... 
   
   cool, glad you can reproduce it :) 
   
   i have a bit more data... i'm seeing the same problem on debian's 
   2.6.22-3-amd64 kernel, so it's not new in 2.6.24. 
   
  
  This is just brainstorming at this point, but it looks like xfs can 
  submit more requests in the bi_end_io path such that it can lock 
  itself out of the RAID array.  The sequence that concerns me is: 
  
  return_io-xfs_buf_end_io-xfs_buf_io_end-xfs_buf_iodone_work-xfs_buf_iorequest-make_request-hang
   
  
  I need verify whether this path is actually triggering, but if we are 
  in an inactive_blocked condition this new request will be put on a 
  wait queue and we'll never get to the release_stripe() call after 
  return_io().  It would be interesting to see if this is new XFS 
  behavior in recent kernels.
 
 
 i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
 
 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
 
 which was Neil's change in 2.6.22 for deferring generic_make_request 
 until there's enough stack space for it.
 
 with my git tree sync'd to that commit my test cases fail in under 20 
 minutes uptime (i rebooted and tested 3x).  sync'd to the commit previous 
 to it i've got 8h of run-time now without the problem.
 
 this isn't definitive of course since it does seem to be timing 
 dependent, but since all failures have occured much earlier than that 
 for me so far i think this indicates this change is either the cause of 
 the problem or exacerbates an existing raid5 problem.
 
 given that this problem looks like a very rare problem i saw with 2.6.18 
 (raid5+xfs there too) i'm thinking Neil's commit may just exacerbate an 
 existing problem... not that i have evidence either way.
 
 i've attached a new kernel log with a hang at d89d87965d... and the 
 reduced config file i was using for the bisect.  hopefully the hang 
 looks the same as what we were seeing at 2.6.24-rc6.  let me know.
 

Dean could you try the below patch to see if it fixes your failure
scenario?  It passes my test case.

Thanks,
Dan

---
md: add generic_make_request_immed to prevent raid5 hang

From: Dan Williams [EMAIL PROTECTED]

Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization
by preventing recursive calls to generic_make_request.  However the
following conditions can cause raid5 to hang until 'stripe_cache_size' is
increased:

1/ stripe_cache_active is N stripes away from the 'inactive_blocked' limit
   (3/4 * stripe_cache_size)
2/ a bio is submitted that requires M stripes to be processed where M  N
3/ stripes 1 through N are up-to-date and ready for immediate processing,
   i.e. no trip through raid5d required

This results in the calling thread hanging while waiting for resources to
process stripes N through M.  This means we never return from make_request.
All other raid5 users pile up in get_active_stripe.  Increasing
stripe_cache_size temporarily resolves the blockage by allowing the blocked
make_request to return to generic_make_request.

Another way to solve this is to move all i/o submission to raid5d context.

Thanks to Dean Gaudet for bisecting this down to d89d8796.

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 block/ll_rw_blk.c  |   16 +---
 drivers/md/raid5.c |4 ++--
 include/linux/blkdev.h |1 +
 3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 8b91994..bff40c2 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -3287,16 +3287,26 @@ end_io:
 }
 
 /*
- * We only want one -make_request_fn to be active at a time,
- * else stack usage with stacked devices could be a problem.
+ * In the general case we only want one -make_request_fn

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-09 Thread Dan Williams
On Jan 9, 2008 5:09 PM, Neil Brown [EMAIL PROTECTED] wrote:
 On Wednesday January 9, [EMAIL PROTECTED] wrote:
  On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote:
   i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
  
   http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
  
   which was Neil's change in 2.6.22 for deferring generic_make_request
   until there's enough stack space for it.
  
 
  Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization
  by preventing recursive calls to generic_make_request.  However the
  following conditions can cause raid5 to hang until 'stripe_cache_size' is
  increased:
 

 Thanks for pursuing this guys.  That explanation certainly sounds very
 credible.

 The generic_make_request_immed is a good way to confirm that we have
 found the bug,  but I don't like it as a long term solution, as it
 just reintroduced the problem that we were trying to solve with the
 problematic commit.

 As you say, we could arrange that all request submission happens in
 raid5d and I think this is the right way to proceed.  However we can
 still take some of the work into the thread that is submitting the
 IO by calling raid5d() at the end of make_request, like this.

 Can you test it please?

This passes my failure case.

However, my test is different from Dean's in that I am using tiobench
and the latest rev of my 'get_priority_stripe' patch. I believe the
failure mechanism is the same, but it would be good to get
confirmation from Dean.  get_priority_stripe has the effect of
increasing the frequency of
make_request-handle_stripe-generic_make_request sequences.

 Does it seem reasonable?

What do you think about limiting the number of stripes the submitting
thread handles to be equal to what it submitted?  If I'm a stripe that
only submits 1 stripe worth of work should I get stuck handling the
rest of the cache?

Regards,
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-09 Thread Dan Williams
On Wed, 2008-01-09 at 20:57 -0700, Neil Brown wrote:
 So I'm incline to leave it as do as much work as is available to be
 done as that is simplest.  But I can probably be talked out of it
 with a convincing argument

Well, in an age of CFS and CFQ it smacks of 'unfairness'.  But does that
trump KISS...? Probably not.

--
Dan

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid 1, new disk can't be added after replacing faulty disk

2008-01-07 Thread Dan Williams
On Jan 7, 2008 6:44 AM, Radu Rendec [EMAIL PROTECTED] wrote:
 I'm experiencing trouble when trying to add a new disk to a raid 1 array
 after having replaced a faulty disk.

[..]
 # mdadm --version
 mdadm - v2.6.2 - 21st May 2007

[..]
 However, this happens with both mdadm 2.6.2 and 2.6.4. I downgraded to
 2.5.4 and it works like a charm.

Looks like you are running into the issue described here:
http://marc.info/?l=linux-raidm=119892098129022w=2
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] md: Fix data corruption when a degraded raid5 array is reshaped.

2008-01-03 Thread Dan Williams
On Thu, 2008-01-03 at 15:46 -0700, NeilBrown wrote:
 This patch fixes a fairly serious bug in md/raid5 in 2.6.23 and 24-rc.
 It would be great if it cold get into 23.13 and 24.final.
 Thanks.
 NeilBrown
 
 ### Comments for Changeset
 
 We currently do not wait for the block from the missing device
 to be computed from parity before copying data to the new stripe
 layout.
 
 The change in the raid6 code is not techincally needed as we
 don't delay data block recovery in the same way for raid6 yet.
 But making the change now is safer long-term.
 
 This bug exists in 2.6.23 and 2.6.24-rc
 
 Cc: [EMAIL PROTECTED]
 Cc: Dan Williams [EMAIL PROTECTED]
 Signed-off-by: Neil Brown [EMAIL PROTECTED]
 
Acked-by: Dan Williams [EMAIL PROTECTED]



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] md: Fix data corruption when a degraded raid5 array is reshaped.

2008-01-03 Thread Dan Williams
On Thu, 2008-01-03 at 16:00 -0700, Williams, Dan J wrote:
 On Thu, 2008-01-03 at 15:46 -0700, NeilBrown wrote:
  This patch fixes a fairly serious bug in md/raid5 in 2.6.23 and
 24-rc.
  It would be great if it cold get into 23.13 and 24.final.
  Thanks.
  NeilBrown
 
  ### Comments for Changeset
 
  We currently do not wait for the block from the missing device
  to be computed from parity before copying data to the new stripe
  layout.
 
  The change in the raid6 code is not techincally needed as we
  don't delay data block recovery in the same way for raid6 yet.
  But making the change now is safer long-term.
 
  This bug exists in 2.6.23 and 2.6.24-rc
 
  Cc: [EMAIL PROTECTED]
  Cc: Dan Williams [EMAIL PROTECTED]
  Signed-off-by: Neil Brown [EMAIL PROTECTED]
 
 Acked-by: Dan Williams [EMAIL PROTECTED]
 

On closer look the safer test is:

!test_bit(STRIPE_OP_COMPUTE_BLK, sh-ops.pending).

The 'req_compute' field only indicates that a 'compute_block' operation
was requested during this pass through handle_stripe so that we can
issue a linked chain of asynchronous operations.

---

From: Neil Brown [EMAIL PROTECTED]

md: Fix data corruption when a degraded raid5 array is reshaped.

We currently do not wait for the block from the missing device
to be computed from parity before copying data to the new stripe
layout.

The change in the raid6 code is not techincally needed as we
don't delay data block recovery in the same way for raid6 yet.
But making the change now is safer long-term.

This bug exists in 2.6.23 and 2.6.24-rc

Cc: [EMAIL PROTECTED]
Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a5aad8c..e8c8157 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2865,7 +2865,8 @@ static void handle_stripe5(struct stripe_head *sh)
md_done_sync(conf-mddev, STRIPE_SECTORS, 1);
}
 
-   if (s.expanding  s.locked == 0)
+   if (s.expanding  s.locked == 0 
+   !test_bit(STRIPE_OP_COMPUTE_BLK, sh-ops.pending))
handle_stripe_expansion(conf, sh, NULL);
 
if (sh-ops.count)
@@ -3067,7 +3068,8 @@ static void handle_stripe6(struct stripe_head *sh, struct 
page *tmp_page)
md_done_sync(conf-mddev, STRIPE_SECTORS, 1);
}
 
-   if (s.expanding  s.locked == 0)
+   if (s.expanding  s.locked == 0 
+   !test_bit(STRIPE_OP_COMPUTE_BLK, sh-ops.pending))
handle_stripe_expansion(conf, sh, r6s);
 
spin_unlock(sh-lock);

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mdadm: unable to add a disk to degraded raid1 array

2007-12-29 Thread Dan Williams
In case someone else happens upon this I have found that mdadm =
v2.6.2 cannot add a disk to a degraded raid1 array created with mdadm
 2.6.2.

I bisected the problem down to mdadm git commit
2fb749d1b7588985b1834e43de4ec5685d0b8d26 which appears to make an
incompatible change to the super block's 'data_size' field.

--- sdb1-sb-good.hex2007-12-12 14:31:42.0 +
+++ sdb1-sb-bad.hex 2007-12-12 14:31:36.0 +
@@ -6,12 +6,12 @@
 050 60d8 0077     0004 
 060        
 *
-080     60d8 0077  
+080     60d0 0077  

Which trips up the if (rdev-size  le64_to_cpu(sb-data_size)/2)
check in super_1_load [1], resulting in:

mdadm: add new device failed for /dev/sdb1 as 4: Invalid argument

--
Dan

[1] http://lxr.linux.no/linux/drivers/md/md.c#L1148
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread Dan Williams
On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote:
 hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on
 the same 64k chunk array and had raised the stripe_cache_size to 1024...
 and got a hang.  this time i grabbed stripe_cache_active before bumping
 the size again -- it was only 905 active.  as i recall the bug we were
 debugging a year+ ago the active was at the size when it would hang.  so
 this is probably something new.

I believe I am seeing the same issue and am trying to track down
whether XFS is doing something unexpected, i.e. I have not been able
to reproduce the problem with EXT3.  MD tries to increase throughput
by letting some stripe work build up in batches.  It looks like every
time your system has hung it has been in the 'inactive_blocked' state
i.e.  3/4 of stripes active.  This state should automatically
clear...


 anyhow raising it to 2048 got it unstuck, but i'm guessing i'll be able to
 hit that limit too if i try harder :)

Once you hang if 'stripe_cache_size' is increased such that
stripe_cache_active  3/4 * stripe_cache_size things will start
flowing again.


 btw what units are stripe_cache_size/active in?  is the memory consumed
 equal to (chunk_size * raid_disks * stripe_cache_size) or (chunk_size *
 raid_disks * stripe_cache_active)?


memory_consumed = PAGE_SIZE * raid_disks * stripe_cache_size


 -dean


--
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread Dan Williams
On Dec 29, 2007 1:58 PM, dean gaudet [EMAIL PROTECTED] wrote:
 On Sat, 29 Dec 2007, Dan Williams wrote:

  On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote:
   hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on
   the same 64k chunk array and had raised the stripe_cache_size to 1024...
   and got a hang.  this time i grabbed stripe_cache_active before bumping
   the size again -- it was only 905 active.  as i recall the bug we were
   debugging a year+ ago the active was at the size when it would hang.  so
   this is probably something new.
 
  I believe I am seeing the same issue and am trying to track down
  whether XFS is doing something unexpected, i.e. I have not been able
  to reproduce the problem with EXT3.  MD tries to increase throughput
  by letting some stripe work build up in batches.  It looks like every
  time your system has hung it has been in the 'inactive_blocked' state
  i.e.  3/4 of stripes active.  This state should automatically
  clear...

 cool, glad you can reproduce it :)

 i have a bit more data... i'm seeing the same problem on debian's
 2.6.22-3-amd64 kernel, so it's not new in 2.6.24.


This is just brainstorming at this point, but it looks like xfs can
submit more requests in the bi_end_io path such that it can lock
itself out of the RAID array.  The sequence that concerns me is:

return_io-xfs_buf_end_io-xfs_buf_io_end-xfs_buf_iodone_work-xfs_buf_iorequest-make_request-hang

I need verify whether this path is actually triggering, but if we are
in an inactive_blocked condition this new request will be put on a
wait queue and we'll never get to the release_stripe() call after
return_io().  It would be interesting to see if this is new XFS
behavior in recent kernels.

--
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HELP! New disks being dropped from RAID 6 array on every reboot

2007-11-23 Thread Dan Williams
On Nov 23, 2007 11:19 AM, Joshua Johnson [EMAIL PROTECTED] wrote:
 Greetings, long time listener, first time caller.

 I recently replaced a disk in my existing 8 disk RAID 6 array.
 Previously, all disks were PATA drives connected to the motherboard
 IDE and 3 promise Ultra 100/133 controllers.  I replaced one of the
 Promise controllers with a Via 64xx based controller, which has 2 SATA
 ports and one PATA port.  I connected a new SATA drive to the new
 card, partitioned the drive and added it to the array.  After 5 or 6
 hours the resyncing process finished and the array showed up complete.
  Upon rebooting I discovered that the new drive had not been added to
 the array when it was assembled on boot.   I resynced it and tried
 again -- still would not persist after a reboot.  I moved one of the
 existing PATA drives to the new controller (so I could have the slot
 for network), rebooted and rebuilt the array.  Now when I reboot BOTH
 disks are missing from the array (sda and sdb).  Upon examining the
 disks it appears they think they are part of the array, but for some
 reason they are not being added when the array is being assembled.
 For example, this is a disk on the new controller which was not added
 to the array after rebooting:

 # mdadm --examine /dev/sda1
 /dev/sda1:
   Magic : a92b4efc
 Version : 00.90.03
UUID : 63ee7d14:a0ac6a6e:aef6fe14:50e047a5
   Creation Time : Thu Sep 21 23:52:19 2006
  Raid Level : raid6
 Device Size : 191157248 (182.30 GiB 195.75 GB)
  Array Size : 1146943488 (1093.81 GiB 1174.47 GB)
Raid Devices : 8
   Total Devices : 8
 Preferred Minor : 0

 Update Time : Fri Nov 23 10:22:57 2007
   State : clean
  Active Devices : 8
 Working Devices : 8
  Failed Devices : 0
   Spare Devices : 0
Checksum : 50df590e - correct
  Events : 0.96419878

  Chunk Size : 256K

   Number   Major   Minor   RaidDevice State
 this 6   816  active sync   /dev/sda1

0 0   320  active sync   /dev/hda2
1 1  5721  active sync   /dev/hdk2
2 2  3322  active sync   /dev/hde2
3 3  3423  active sync   /dev/hdg2
4 4  2224  active sync   /dev/hdc2
5 5  5625  active sync   /dev/hdi2
6 6   816  active sync   /dev/sda1
7 7   8   177  active sync   /dev/sdb1


 Everything there seems to be correct and current up to the last
 shutdown.  But the disk is not being added on boot.  Examining a disk
 that is currently running in the array shows:

 # mdadm --examine /dev/hdc2
 /dev/hdc2:
   Magic : a92b4efc
 Version : 00.90.03
UUID : 63ee7d14:a0ac6a6e:aef6fe14:50e047a5
   Creation Time : Thu Sep 21 23:52:19 2006
  Raid Level : raid6
 Device Size : 191157248 (182.30 GiB 195.75 GB)
  Array Size : 1146943488 (1093.81 GiB 1174.47 GB)
Raid Devices : 8
   Total Devices : 6
 Preferred Minor : 0

 Update Time : Fri Nov 23 10:23:52 2007
   State : clean
  Active Devices : 6
 Working Devices : 6
  Failed Devices : 2
   Spare Devices : 0
Checksum : 50df5934 - correct
  Events : 0.96419880

  Chunk Size : 256K

   Number   Major   Minor   RaidDevice State
 this 4  2224  active sync   /dev/hdc2

0 0   320  active sync   /dev/hda2
1 1  5721  active sync   /dev/hdk2
2 2  3322  active sync   /dev/hde2
3 3  3423  active sync   /dev/hdg2
4 4  2224  active sync   /dev/hdc2
5 5  5625  active sync   /dev/hdi2
6 6   006  faulty removed
7 7   007  faulty removed


 Here is my /etc/mdadm/mdadm.conf:

 DEVICE partitions
 PROGRAM /bin/echo
 MAILADDR redacted
 ARRAY /dev/md0 level=raid6 num-devices=8
 UUID=63ee7d14:a0ac6a6e:aef6fe14:50e047a5


 Can anyone see anything that is glaringly wrong here?  Has anybody
 experienced similar behavior?  I am running Debian using kernel
 2.6.23.8.  All partitions are set to type 0xFD and it appears the
 superblocks on the sd* disks were written, why wouldn't they be added
 to the array on boot?  Any help is greatly appreciated!

I wonder if you are running into a driver load order problem where the
ide driver and md are coming up before the sata driver.  You can let
userspace do the assembly after everything is up and running.  Specify
'raid=noautodetect' on the kernel command line and then let Debian's
'/etc/init.d/mdadm-raid' initscript take care of the assembly based on
your configuration file, or just run 'mdadm --assemble --scan' by
hand.

--
Dan
-
To unsubscribe from this list: send the 

Re: PROBLEM: raid5 hangs

2007-11-14 Thread Dan Williams
On Nov 14, 2007 5:05 PM, Justin Piszcz [EMAIL PROTECTED] wrote:
 On Wed, 14 Nov 2007, Bill Davidsen wrote:
  Justin Piszcz wrote:
  This is a known bug in 2.6.23 and should be fixed in 2.6.23.2 if the RAID5
  bio* patches are applied.
 
  Note below he's running 2.6.22.3 which doesn't have the bug unless -STABLE
  added it. So should not really be in 2.6.22.anything. I assume you're 
  talking
  the endless write or bio issue?
 The bio issue is the root cause of the bug yes?

Not if this is a 2.6.22 issue.  Neither of the bugs fixed by raid5:
fix clearing of biofill operations or raid5: fix unending write
sequence existed prior to 2.6.23.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel panic (2.6.23.1-fc7) in drivers/md/raid5.c:144

2007-11-13 Thread Dan Williams
[ Adding Neil, stable@, DaveJ, and GregKH to the cc ]

On Nov 13, 2007 11:20 AM, Peter [EMAIL PROTECTED] wrote:
 Hi

 I had a 3 disc raid5 array running fine with Fedora 7 (32bit) kernel 
 2.6.23.1-fc7 on an old Athlon XP using a two sata_sil cards.

 I replaced the hardware with an Athlon64 X2 and using the onboard sata_nv, 
 after I modified the initrd I was able to boot up from my old system drive. 
 However when it brought up the raid array it died with a kernel panic. I used 
 a rescue CD, commented out the array in mdadm.conf and booted up. I could 
 assemble the array manually (it kicked out one of the three drives for some 
 reason?) but when I used mdadm --examine /dev/md0 I got the kernel panic 
 again. I don't have remote debugging but I managed to take some pictures:

 http://img132.imageshack.us/img132/1697/kernel1sh3.jpg
 http://img132.imageshack.us/img132/3538/kernel2eu2.jpg

 From what I understand it should be possible to do this hardware upgrade with 
 using software raid? Any ideas?

 Thanks
 Peter


There are two bug fix patches pending for 2.6.23.2:
raid5: fix clearing of biofill operations
http://marc.info/?l=linux-raidm=119303750132068w=2
raid5: fix unending write sequence
http://marc.info/?l=linux-raidm=119453934805607w=2

You are hitting the bug that was fixed by: raid5: fix clearing of
biofill operations

Heads up for the stable@ team raid5: fix clearing of biofill
operations was originally misapplied for 2.6.24-rc:
md: Fix misapplied patch in raid5.c
http://marc.info/?l=linux-raidm=119396783332081w=2
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [stable] [PATCH 000 of 2] md: Fixes for md in 2.6.23

2007-11-13 Thread Dan Williams
On Nov 13, 2007 5:23 PM, Greg KH [EMAIL PROTECTED] wrote:
 On Tue, Nov 13, 2007 at 04:22:14PM -0800, Greg KH wrote:
  On Mon, Oct 22, 2007 at 05:15:27PM +1000, NeilBrown wrote:
  
   It appears that a couple of bugs slipped in to md for 2.6.23.
   These two patches fix them and are appropriate for 2.6.23.y as well
   as 2.6.24-rcX
  
   Thanks,
   NeilBrown
  
[PATCH 001 of 2] md: Fix an unsigned compare to allow creation of 
   bitmaps with v1.0 metadata.
[PATCH 002 of 2] md: raid5: fix clearing of biofill operations
 
  I don't see these patches in 2.6.24-rcX, are they there under some other
  subject?

 Oh nevermind, I found them, sorry for the noise...


Careful, it looks like you cherry picked commit 4ae3f847 md: raid5:
fix clearing of biofill operations which ended up misapplied in
Linus' tree,  You should either also pick up def6ae26 md: fix
misapplied patch in raid5.c or I can resend the original raid5: fix
clearing of biofill operations.

The other patch for -stable raid5: fix unending write sequence is
currently in -mm.

 greg k-h

Regards,
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [stable] [PATCH 000 of 2] md: Fixes for md in 2.6.23

2007-11-13 Thread Dan Williams
On Nov 13, 2007 8:43 PM, Greg KH [EMAIL PROTECTED] wrote:
 
  Careful, it looks like you cherry picked commit 4ae3f847 md: raid5:
  fix clearing of biofill operations which ended up misapplied in
  Linus' tree,  You should either also pick up def6ae26 md: fix
  misapplied patch in raid5.c or I can resend the original raid5: fix
  clearing of biofill operations.
 
  The other patch for -stable raid5: fix unending write sequence is
  currently in -mm.

 Hm, I've attached the two patches that I have right now in the -stable
 tree so far (still have over 100 patches to go, so I might not have
 gotten to them yet if you have sent them).  These were sent to me by
 Andrew on their way to Linus.  if I should drop either one, or add
 another one, please let me know.


Drop md-raid5-fix-clearing-of-biofill-operations.patch and replace it
with the attached
md-raid5-not-raid6-fix-clearing-of-biofill-operations.patch (the
original sent to Neil).

The critical difference is that the replacement patch touches
handle_stripe5, not handle_stripe6.  Diffing the patches shows the
changes for hunk #3:

-@@ -2903,6 +2907,13 @@ static void handle_stripe6(struct stripe
+@@ -2630,6 +2634,13 @@ static void handle_stripe5(struct stripe_head *sh)

raid5-fix-unending-write-sequence.patch is in -mm and I believe is
waiting on an Acked-by from Neil?

 thanks,

 greg k-h

Thanks,
Dan
raid5: fix clearing of biofill operations

From: Dan Williams [EMAIL PROTECTED]

ops_complete_biofill() runs outside of spin_lock(sh-lock) and clears the
'pending' and 'ack' bits.  Since the test_and_ack_op() macro only checks
against 'complete' it can get an inconsistent snapshot of pending work.

Move the clearing of these bits to handle_stripe5(), under the lock.

Signed-off-by: Dan Williams [EMAIL PROTECTED]
Tested-by: Joel Bertrand [EMAIL PROTECTED]
Signed-off-by: Neil Brown [EMAIL PROTECTED]
---

 drivers/md/raid5.c |   17 ++---
 1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f96dea9..3808f52 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -377,7 +377,12 @@ static unsigned long get_stripe_work(struct stripe_head *sh)
 		ack++;
 
 	sh-ops.count -= ack;
-	BUG_ON(sh-ops.count  0);
+	if (unlikely(sh-ops.count  0)) {
+		printk(KERN_ERR pending: %#lx ops.pending: %#lx ops.ack: %#lx 
+			ops.complete: %#lx\n, pending, sh-ops.pending,
+			sh-ops.ack, sh-ops.complete);
+		BUG();
+	}
 
 	return pending;
 }
@@ -551,8 +556,7 @@ static void ops_complete_biofill(void *stripe_head_ref)
 			}
 		}
 	}
-	clear_bit(STRIPE_OP_BIOFILL, sh-ops.ack);
-	clear_bit(STRIPE_OP_BIOFILL, sh-ops.pending);
+	set_bit(STRIPE_OP_BIOFILL, sh-ops.complete);
 
 	return_io(return_bi);
 
@@ -2630,6 +2634,13 @@ static void handle_stripe5(struct stripe_head *sh)
 	s.expanded = test_bit(STRIPE_EXPAND_READY, sh-state);
 	/* Now to look around and see what can be done */
 
+	/* clean-up completed biofill operations */
+	if (test_bit(STRIPE_OP_BIOFILL, sh-ops.complete)) {
+		clear_bit(STRIPE_OP_BIOFILL, sh-ops.pending);
+		clear_bit(STRIPE_OP_BIOFILL, sh-ops.ack);
+		clear_bit(STRIPE_OP_BIOFILL, sh-ops.complete);
+	}
+
 	rcu_read_lock();
 	for (i=disks; i--; ) {
 		mdk_rdev_t *rdev;
raid5: fix unending write sequence

From: Dan Williams [EMAIL PROTECTED]

debug output from Joel's system
handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0
check 5: state 0x6 toread  read  write f800ffcffcc0 written 
check 4: state 0x6 toread  read  write f800fdd4e360 written 
check 3: state 0x1 toread  read  write  written 
check 2: state 0x1 toread  read  write  written 
check 1: state 0x6 toread  read  write f800ff517e40 written 
check 0: state 0x6 toread  read  write f800fd4cae60 written 
locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0
for sector 7629696, rmw=0 rcw=0
/debug

These blocks were prepared to be written out, but were never handled in
ops_run_biodrain(), so they remain locked forever.  The operations flags
are all clear which means handle_stripe() thinks nothing else needs to be
done.

This state suggests that the STRIPE_OP_PREXOR bit was sampled 'set' when it
should not have been.  This patch cleans up cases where the code looks at
sh-ops.pending when it should be looking at the consistent stack-based
snapshot of the operations flags.

Report from Joel:
	Resync done. Patch fix this bug.

Signed-off-by: Dan Williams [EMAIL PROTECTED]
Tested-by: Joel Bertrand [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
---

 drivers/md/raid5.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3808f52

[PATCH] raid5: fix unending write sequence

2007-11-08 Thread Dan Williams
From: Dan Williams [EMAIL PROTECTED]

debug output from Joël's system
handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0
check 5: state 0x6 toread  read  write 
f800ffcffcc0 written 
check 4: state 0x6 toread  read  write 
f800fdd4e360 written 
check 3: state 0x1 toread  read  write 
 written 
check 2: state 0x1 toread  read  write 
 written 
check 1: state 0x6 toread  read  write 
f800ff517e40 written 
check 0: state 0x6 toread  read  write 
f800fd4cae60 written 
locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0
for sector 7629696, rmw=0 rcw=0
/debug

These blocks were prepared to be written out, but were never handled in
ops_run_biodrain(), so they remain locked forever.  The operations flags
are all clear which means handle_stripe() thinks nothing else needs to be
done.

This state suggests that the STRIPE_OP_PREXOR bit was sampled 'set' when it
should not have been.  This patch cleans up cases where the code looks at
sh-ops.pending when it should be looking at the consistent stack-based
snapshot of the operations flags.

Report from Joël:
Resync done. Patch fix this bug.

Signed-off-by: Dan Williams [EMAIL PROTECTED]
Tested-by: Joël Bertrand [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
---

 drivers/md/raid5.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3808f52..e86cacb 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -689,7 +689,8 @@ ops_run_prexor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
 }
 
 static struct dma_async_tx_descriptor *
-ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
+ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx,
+unsigned long pending)
 {
int disks = sh-disks;
int pd_idx = sh-pd_idx, i;
@@ -697,7 +698,7 @@ ops_run_biodrain(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
/* check if prexor is active which means only process blocks
 * that are part of a read-modify-write (Wantprexor)
 */
-   int prexor = test_bit(STRIPE_OP_PREXOR, sh-ops.pending);
+   int prexor = test_bit(STRIPE_OP_PREXOR, pending);
 
pr_debug(%s: stripe %llu\n, __FUNCTION__,
(unsigned long long)sh-sector);
@@ -774,7 +775,8 @@ static void ops_complete_write(void *stripe_head_ref)
 }
 
 static void
-ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
+ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx,
+   unsigned long pending)
 {
/* kernel stack size limits the total number of disks */
int disks = sh-disks;
@@ -782,7 +784,7 @@ ops_run_postxor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
 
int count = 0, pd_idx = sh-pd_idx, i;
struct page *xor_dest;
-   int prexor = test_bit(STRIPE_OP_PREXOR, sh-ops.pending);
+   int prexor = test_bit(STRIPE_OP_PREXOR, pending);
unsigned long flags;
dma_async_tx_callback callback;
 
@@ -809,7 +811,7 @@ ops_run_postxor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
}
 
/* check whether this postxor is part of a write */
-   callback = test_bit(STRIPE_OP_BIODRAIN, sh-ops.pending) ?
+   callback = test_bit(STRIPE_OP_BIODRAIN, pending) ?
ops_complete_write : ops_complete_postxor;
 
/* 1/ if we prexor'd then the dest is reused as a source
@@ -897,12 +899,12 @@ static void raid5_run_ops(struct stripe_head *sh, 
unsigned long pending)
tx = ops_run_prexor(sh, tx);
 
if (test_bit(STRIPE_OP_BIODRAIN, pending)) {
-   tx = ops_run_biodrain(sh, tx);
+   tx = ops_run_biodrain(sh, tx, pending);
overlap_clear++;
}
 
if (test_bit(STRIPE_OP_POSTXOR, pending))
-   ops_run_postxor(sh, tx);
+   ops_run_postxor(sh, tx, pending);
 
if (test_bit(STRIPE_OP_CHECK, pending))
ops_run_check(sh);
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-06 Thread Dan Williams
On Tue, 2007-11-06 at 03:19 -0700, BERTRAND Joël wrote:
 Done. Here is obtained ouput :

Much appreciated.
 
 [ 1260.969314] handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0
 [ 1260.980606] check 5: state 0x6 toread  read 
  write f800ffcffcc0 written 
 [ 1260.994808] check 4: state 0x6 toread  read 
  write f800fdd4e360 written 
 [ 1261.009325] check 3: state 0x1 toread  read 
  write  written 
 [ 1261.244478] check 2: state 0x1 toread  read 
  write  written 
 [ 1261.270821] check 1: state 0x6 toread  read 
  write f800ff517e40 written 
 [ 1261.312320] check 0: state 0x6 toread  read 
  write f800fd4cae60 written 
 [ 1261.361030] locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0
 [ 1261.443120] for sector 7629696, rmw=0 rcw=0
[..]

This looks as if the blocks were prepared to be written out, but were
never handled in ops_run_biodrain(), so they remain locked forever.  The
operations flags are all clear which means handle_stripe thinks nothing
else needs to be done.

The following patch, also attached, cleans up cases where the code looks
at sh-ops.pending when it should be looking at the consistent
stack-based snapshot of the operations flags.


---

 drivers/md/raid5.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 496b9a3..e1a3942 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -693,7 +693,8 @@ ops_run_prexor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
 }
 
 static struct dma_async_tx_descriptor *
-ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
+ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx,
+unsigned long pending)
 {
int disks = sh-disks;
int pd_idx = sh-pd_idx, i;
@@ -701,7 +702,7 @@ ops_run_biodrain(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
/* check if prexor is active which means only process blocks
 * that are part of a read-modify-write (Wantprexor)
 */
-   int prexor = test_bit(STRIPE_OP_PREXOR, sh-ops.pending);
+   int prexor = test_bit(STRIPE_OP_PREXOR, pending);
 
pr_debug(%s: stripe %llu\n, __FUNCTION__,
(unsigned long long)sh-sector);
@@ -778,7 +779,8 @@ static void ops_complete_write(void *stripe_head_ref)
 }
 
 static void
-ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
+ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx,
+   unsigned long pending)
 {
/* kernel stack size limits the total number of disks */
int disks = sh-disks;
@@ -786,7 +788,7 @@ ops_run_postxor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
 
int count = 0, pd_idx = sh-pd_idx, i;
struct page *xor_dest;
-   int prexor = test_bit(STRIPE_OP_PREXOR, sh-ops.pending);
+   int prexor = test_bit(STRIPE_OP_PREXOR, pending);
unsigned long flags;
dma_async_tx_callback callback;
 
@@ -813,7 +815,7 @@ ops_run_postxor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
}
 
/* check whether this postxor is part of a write */
-   callback = test_bit(STRIPE_OP_BIODRAIN, sh-ops.pending) ?
+   callback = test_bit(STRIPE_OP_BIODRAIN, pending) ?
ops_complete_write : ops_complete_postxor;
 
/* 1/ if we prexor'd then the dest is reused as a source
@@ -901,12 +903,12 @@ static void raid5_run_ops(struct stripe_head *sh, 
unsigned long pending)
tx = ops_run_prexor(sh, tx);
 
if (test_bit(STRIPE_OP_BIODRAIN, pending)) {
-   tx = ops_run_biodrain(sh, tx);
+   tx = ops_run_biodrain(sh, tx, pending);
overlap_clear++;
}
 
if (test_bit(STRIPE_OP_POSTXOR, pending))
-   ops_run_postxor(sh, tx);
+   ops_run_postxor(sh, tx, pending);
 
if (test_bit(STRIPE_OP_CHECK, pending))
ops_run_check(sh);

raid5: fix unending write sequence

From: Dan Williams [EMAIL PROTECTED]


---

 drivers/md/raid5.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 496b9a3..e1a3942 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -693,7 +693,8 @@ ops_run_prexor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 }
 
 static struct dma_async_tx_descriptor *
-ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
+ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx

Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-05 Thread Dan Williams
On 11/4/07, Justin Piszcz [EMAIL PROTECTED] wrote:


 On Mon, 5 Nov 2007, Neil Brown wrote:

  On Sunday November 4, [EMAIL PROTECTED] wrote:
  # ps auxww | grep D
  USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
  root   273  0.0  0.0  0 0 ?DOct21  14:40 [pdflush]
  root   274  0.0  0.0  0 0 ?DOct21  13:00 [pdflush]
 
  After several days/weeks, this is the second time this has happened, while
  doing regular file I/O (decompressing a file), everything on the device
  went into D-state.
 
  At a guess (I haven't looked closely) I'd say it is the bug that was
  meant to be fixed by
 
  commit 4ae3f847e49e3787eca91bced31f8fd328d50496
 
  except that patch applied badly and needed to be fixed with
  the following patch (not in git yet).
  These have been sent to stable@ and should be in the queue for 2.6.23.2
 

 Ah, thanks Neil, will be updating as soon as it is released, thanks.


Are you seeing the same md thread takes 100% of the CPU that Joël is
reporting?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-05 Thread Dan Williams
On 11/5/07, Justin Piszcz [EMAIL PROTECTED] wrote:
[..]
  Are you seeing the same md thread takes 100% of the CPU that Joël is
  reporting?
 

 Yes, in another e-mail I posted the top output with md3_raid5 at 100%.


This seems too similar to Joël's situation for them not to be
correlated, and it shows that iscsi is not a necessary component of
the failure.

The attached patch allows the debug statements in MD to be enabled via
sysfs.  Joël, since it is easier for you to reproduce can you capture
the kernel log output after the raid thread goes into the spin?  It
will help if you have CONFIG_PRINTK_TIME=y set in your kernel
configuration.

After the failure run:

echo 1  /sys/block/md_d0/md/debug_print_enable; sleep 5; echo 0 
/sys/block/md_d0/md/debug_print_enable

...to enable the print messages for a few seconds.  Please send the
output in a private message if it proves too big for the mailing list.


raid5-debug-print-enable.patch
Description: Binary data


Re: Bug in processing dependencies by async_tx_submit() ?

2007-11-01 Thread Dan Williams
On 11/1/07, Yuri Tikhonov [EMAIL PROTECTED] wrote:

  Hi Dan,

   Honestly I tried to fix this quickly using the approach similar to proposed
  by you, with one addition though (in fact, deletion of BUG_ON(chan ==
  tx-chan) in async_tx_run_dependencies()). And this led to Kernel stack
  overflow. This happened because of the recurseve calling async_tx_submit()
  from async_trigger_callback() and vice verse.


I had a feeling the fix could not be that easy...

   So, then I made the interrupt scheduling in async_tx_submit() only for the
  cases when it is really needed: i.e. when dependent operations are to be run
  on different channels.

   The resulted kernel locked-up during processing of the mkfs command on the
  top of the RAID-array. The place where it is spinning is the dma_sync_wait()
  function.

   This is happened because of the specific implementation of
  dma_wait_for_async_tx().

So I take it you are not implementing interrupt based callbacks in your driver?

   The iter, we finally waiting for there, corresponds to the last allocated
  but not-yet-submitted descriptor. But if the iter we are waiting for is
  dependent from another descriptor which has cookie  0, but is not yet
  submitted to the h/w channel because of the fact that threshold is not
  achieved to this moment, then we may wait in dma_wait_for_async_tx()
  infinitely. I think that it makes more sense to get the first descriptor
  which was submitted to the channel but probably is not put into the h/w
  chain, i.e. with cookie  0 and do dma_sync_wait() of this descriptor.

   When I modified the dma_wait_for_async_tx() in such way, then the kernel
  locking had disappeared. But nevertheless the mkfs processes hangs-up after
  some time. So, it looks like something is still missing in support of the
  chaining dependencies feature...


I am preparing a new patch that replaces ASYNC_TX_DEP_ACK with
ASYNC_TX_CHAIN_ACK.  The plan is to make the entire chain of
dependencies available up until the last transaction is submitted.
This allows the entire dependency chain to be walked at
async_tx_submit time so that we can properly handle these multiple
dependency cases.  I'll send it out when it passes my internal
tests...

--
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Raid1/5 over iSCSI trouble

2007-10-27 Thread Dan Williams
On 10/27/07, BERTRAND Joël [EMAIL PROTECTED] wrote:
 Dan Williams wrote:
  Can you collect some oprofile data, as Ming suggested, so we can maybe
  see what md_d0_raid5 and istd1 are fighting about?  Hopefully it is as
  painless to run on sparc as it is on IA:
 
  opcontrol --start --vmlinux=/path/to/vmlinux
  wait
  opcontrol --stop
  opreport --image-path=/lib/modules/`uname -r` -l

 Done.


[..]


 Is it enough ?

I would expect md_d0_raid5 and istd1 to show up pretty high in the
list if they are constantly pegged at a 100% CPU utilization like you
showed in the failure case.  Maybe this was captured after the target
has disconnected?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MD driver document

2007-10-24 Thread Dan Williams
On 10/24/07, tirumalareddy marri [EMAIL PROTECTED] wrote:

  Hi,
I am looking for best way of understanding MD
 driver(including raid5/6) architecture. I am
 developing driver for one of the PPC based SOC. I have
 done some code reading and tried to use HW debugger to
 walk through the code. But it was not much help.

   If you have any pointers or documents, I will
 greatly appreciate if you can share it.


I started out with include/linux/raid/raid5.h.  Also, running it with
the debug print statements turned on will get you familiar with the
code flow.

Lastly, I wrote the following paper which is already becoming outdated:
http://downloads.sourceforge.net/xscaleiop/ols_paper_2006.pdf

 Thanks and regards,
  Marri


--
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Raid1/5 over iSCSI trouble

2007-10-24 Thread Dan Williams
On 10/24/07, BERTRAND Joël [EMAIL PROTECTED] wrote:
 Hello,

 Any news about this trouble ? Any idea ? I'm trying to fix it, but I
 don't see any specific interaction between raid5 and istd. Does anyone
 try to reproduce this bug on another arch than sparc64 ? I only use
 sparc32 and 64 servers and I cannot test on other archs. Of course, I
 have a laptop, but I cannot create a raid5 array on its internal HD to
 test this configuration ;-)


Can you collect some oprofile data, as Ming suggested, so we can maybe
see what md_d0_raid5 and istd1 are fighting about?  Hopefully it is as
painless to run on sparc as it is on IA:

opcontrol --start --vmlinux=/path/to/vmlinux
wait
opcontrol --stop
opreport --image-path=/lib/modules/`uname -r` -l

--
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: async_tx: get best channel

2007-10-23 Thread Dan Williams
On Fri, 2007-10-19 at 05:23 -0700, Yuri Tikhonov wrote:
 
  Hello Dan,

Hi Yuri, sorry it has taken me so long to get back to you...
 
  I have a suggestion regarding the async_tx_find_channel() procedure.
 
  First, a little introduction. Some processors (e.g. ppc440spe) have several 
 DMA
 engines (say DMA1 and DMA2) which are capable of performing the same type of
 operation, say XOR. The DMA2 engine may process the XOR operation faster than
 the DMA1 engine, but DMA2 (which is faster) has some restrictions for the 
 source
 operand addresses, whereas there are no such restrictions for DMA1 (which is 
 slower).
 So the question is, how may ASYNC_TX select the DMA engine which will be the
 most effective for the given tx operation ?

  In the example just described this means: if the faster engine, DMA2, may 
 process
 the tx operation with the given source operand addresses, then we select DMA2;
 if the given source operand addresses cannot be processed with DMA2, then we
 select the slower engine, DMA1.
 
  I see the following way for introducing such functionality.
 
  We may introduce an additional method in struct dma_device (let's call it 
 device_estimate())
 which would take the following as the arguments:
 --- the list of sources to be processed during the given tx,
 --- the type of operation (XOR, COPY, ...),
 --- perhaps something else,
  and then estimate the effectiveness of processing this tx on the given 
 channel.
  The async_tx_find_channel() function should call the device_estimate() 
 method for each
 registered dma channel and then select the most effective one.
  The architecture specific ADMA driver will be responsible for returning the 
 greatest
 value from the device_estimate() method for the channel which will be the 
 most effective
 for this given tx.
 
  What are your thoughts regarding this? Do you see any other effective ways 
 for
 enhancing ASYNC_TX with such functionality?

The problem with moving this test to async_tx_find_channel() is that it
imposes extra overhead in the fast path.  It would be best if we could
keep all these decisions in the slow path, or at least hide it from
architectures that do not need to implement it.  The thing that makes
this tricky is the fact that the speed is based on the source address...

One question what are the source address restrictions, is it around
high-memory?  My thought is MD usually only operates on GFP_KERNEL
memory but sometimes sees high-memory when copying data into and out of
the cache.  You might be able to achieve your use case by disabling
(hiding) the XOR capability on the channels used for copying.  This will
cause async_tx to switch the operation from the high memory capable copy
channel to the fast low memory XOR channel.

Another way to approach this would be to implement architecture specific
definitions of dma_channel_add_remove() and async_tx_rebalance().  This
will bypass the default allocation scheme and allow you to assign the
fastest channel to an operation, but it still does not allow for dynamic
selection based on source/destination address...

 
  Regards, Yuri
 
Regards,
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Raid1/5 over iSCSI trouble

2007-10-19 Thread Dan Williams
On Fri, 2007-10-19 at 14:04 -0700, BERTRAND Joël wrote:
 
 Sorry for this last mail. I have found another mistake, but I
 don't
 know if this bug comes from iscsi-target or raid5 itself. iSCSI target
 is disconnected because istd1 and md_d0_raid5 kernel threads use 100%
 of
 CPU each !
 
 Tasks: 235 total,   6 running, 227 sleeping,   0 stopped,   2 zombie
 Cpu(s):  0.1%us, 12.5%sy,  0.0%ni, 87.4%id,  0.0%wa,  0.0%hi,  0.0%si,
 0.0%st
 Mem:   4139032k total,   218424k used,  3920608k free,10136k
 buffers
 Swap:  7815536k total,0k used,  7815536k free,64808k
 cached
 
PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 
   5824 root  15  -5 000 R  100  0.0  10:34.25 istd1
 
   5599 root  15  -5 000 R  100  0.0   7:25.43
 md_d0_raid5

What is the output of:
cat /proc/5824/wchan
cat /proc/5599/wchan

Thanks,
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Raid5 trouble

2007-10-19 Thread Dan Williams
On Fri, 2007-10-19 at 01:04 -0700, BERTRAND Joël wrote:
 I never see any oops with this patch. But I cannot create a
 RAID1 array
 with a local RAID5 volume and a foreign RAID5 array exported by iSCSI.
 iSCSI seems to works fine, but RAID1 creation randomly aborts due to a
 unknown SCSI task on target side.

For now I am going to forward this patch to Neil for inclusion in
-stable and 2.6.24-rc.  I will add a Tested-by: Joël Bertrand
[EMAIL PROTECTED] unless you have an objection.

 I have stressed iSCSI target with some simultaneous I/O
 without any
 trouble (nullio, fileio and blockio), thus I suspect another bug in
 raid
 code (or an arch specific bug). The last two days, I have made some
 tests to isolate and reproduce this bug:
 1/ iSCSI target and initiator seem work when I export with iSCSI a
 raid5
 array;
 2/ raid1 and raid5 seem work with local disks;
 3/ iSCSI target is disconnected only when I create a raid1 volume over
 iSCSI (blockio _and_ fileio) with following message:
 
 Oct 18 10:43:52 poulenc kernel: iscsi_trgt: cmnd_abort(1156) 29 1 0 42
 57344 0 0
 Oct 18 10:43:52 poulenc kernel: iscsi_trgt: Abort Task (01) issued on
 tid:1 lun:0 by sid:630024457682948 (Unknown Task)
 
 I run for 12 hours some dd's (read and write in nullio)
 between
 initiator and target without any disconnection. Thus iSCSI code seems
 to
 be robust. Both initiator and target are alone on a single gigabit
 ethernet link (without any switch). I'm investigating...

Can you reproduce on 2.6.22?

Also, I do not think this is the cause of your failure, but you have
CONFIG_DMA_ENGINE=y in your config.  Setting this to 'n' will compile
out the unneeded checks for offload engines in async_memcpy and
async_xor.
 
 Regards,
 JKB

Regards,
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Raid5 trouble

2007-10-17 Thread Dan Williams
On 10/17/07, Dan Williams [EMAIL PROTECTED] wrote:
 On 10/17/07, BERTRAND Joël [EMAIL PROTECTED] wrote:
  BERTRAND Joël wrote:
   Hello,
  
   I run 2.6.23 linux kernel on two T1000 (sparc64) servers. Each
   server has a partitionable raid5 array (/dev/md/d0) and I have to
   synchronize both raid5 volumes by raid1. Thus, I have tried to build a
   raid1 volume between /dev/md/d0p1 and /dev/sdi1 (exported by iscsi from
   the second server) and I obtain a BUG :
  
   Root gershwin:[/usr/scripts]  mdadm -C /dev/md7 -l1 -n2 /dev/md/d0p1
   /dev/sdi1
   ...
 
  Hello,
 
  I have fixed iscsi-target, and I have tested it. It works now 
  without
  any trouble. Patches were posted on iscsi-target mailing list. When I
  use iSCSI to access to foreign raid5 volume, it works fine. I can format
  foreign volume, copy large files on it... But when I tried to create a
  new raid1 volume with a local raid5 volume and a foreign raid5 volume, I
  receive my well known Oops. You can find my dmesg after Oops :
 

 Can you send your .config and your bootup dmesg?


I found a problem which may lead to the operations count dropping
below zero.  If ops_complete_biofill() gets preempted in between the
following calls:

raid5.c:554 clear_bit(STRIPE_OP_BIOFILL, sh-ops.ack);
raid5.c:555 clear_bit(STRIPE_OP_BIOFILL, sh-ops.pending);

...then get_stripe_work() can recount/re-acknowledge STRIPE_OP_BIOFILL
causing the assertion.  In fact, the 'pending' bit should always be
cleared first, but the other cases are protected by
spin_lock(sh-lock).  Patch attached.

--
Dan


fix-biofill-clear.patch
Description: Binary data


Re: [BUG] Raid5 trouble

2007-10-17 Thread Dan Williams
On Wed, 2007-10-17 at 09:44 -0700, BERTRAND Joël wrote:
 Dan,
 
 I have modified get_stripe_work like this :
 
 static unsigned long get_stripe_work(struct stripe_head *sh)
 {
  unsigned long pending;
  int ack = 0;
  int a,b,c,d,e,f,g;
 
  pending = sh-ops.pending;
 
  test_and_ack_op(STRIPE_OP_BIOFILL, pending);
  a=ack;
  test_and_ack_op(STRIPE_OP_COMPUTE_BLK, pending);
  b=ack;
  test_and_ack_op(STRIPE_OP_PREXOR, pending);
  c=ack;
  test_and_ack_op(STRIPE_OP_BIODRAIN, pending);
  d=ack;
  test_and_ack_op(STRIPE_OP_POSTXOR, pending);
  e=ack;
  test_and_ack_op(STRIPE_OP_CHECK, pending);
  f=ack;
  if (test_and_clear_bit(STRIPE_OP_IO, sh-ops.pending))
  ack++;
  g=ack;
 
  sh-ops.count -= ack;
 
  if (sh-ops.count0) printk(%d %d %d %d %d %d %d\n,
 a,b,c,d,e,f,g);
  BUG_ON(sh-ops.count  0);
 
  return pending;
 }
 
 and I obtain on console :
 
   1 1 1 1 1 2
 kernel BUG at drivers/md/raid5.c:390!
\|/  \|/
@'/ .. \`@
/_| \__/ |_\
   \__U_/
 md7_resync(5409): Kernel bad sw trap 5 [#1]
 
 If that can help you...
 
 JKB

This gives more evidence that it is probably mishandling of
STRIPE_OP_BIOFILL.  The attached patch (replacing the previous) moves
the clearing of these bits into handle_stripe5 and adds some debug
information.

--
Dan
raid5: fix clearing of biofill operations (try2)

From: Dan Williams [EMAIL PROTECTED]

ops_complete_biofill() runs outside of spin_lock(sh-lock) and clears the
'pending' and 'ack' bits.  Since the test_and_ack_op() macro only checks
against 'complete' it can get an inconsistent snapshot of pending work.

Move the clearing of these bits to handle_stripe5(), under the lock.

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |   17 ++---
 1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f96dea9..3808f52 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -377,7 +377,12 @@ static unsigned long get_stripe_work(struct stripe_head *sh)
 		ack++;
 
 	sh-ops.count -= ack;
-	BUG_ON(sh-ops.count  0);
+	if (unlikely(sh-ops.count  0)) {
+		printk(KERN_ERR pending: %#lx ops.pending: %#lx ops.ack: %#lx 
+			ops.complete: %#lx\n, pending, sh-ops.pending,
+			sh-ops.ack, sh-ops.complete);
+		BUG();
+	}
 
 	return pending;
 }
@@ -551,8 +556,7 @@ static void ops_complete_biofill(void *stripe_head_ref)
 			}
 		}
 	}
-	clear_bit(STRIPE_OP_BIOFILL, sh-ops.ack);
-	clear_bit(STRIPE_OP_BIOFILL, sh-ops.pending);
+	set_bit(STRIPE_OP_BIOFILL, sh-ops.complete);
 
 	return_io(return_bi);
 
@@ -2630,6 +2634,13 @@ static void handle_stripe5(struct stripe_head *sh)
 	s.expanded = test_bit(STRIPE_EXPAND_READY, sh-state);
 	/* Now to look around and see what can be done */
 
+	/* clean-up completed biofill operations */
+	if (test_bit(STRIPE_OP_BIOFILL, sh-ops.complete)) {
+		clear_bit(STRIPE_OP_BIOFILL, sh-ops.pending);
+		clear_bit(STRIPE_OP_BIOFILL, sh-ops.ack);
+		clear_bit(STRIPE_OP_BIOFILL, sh-ops.complete);
+	}
+
 	rcu_read_lock();
 	for (i=disks; i--; ) {
 		mdk_rdev_t *rdev;


Re: mdadm: /dev/sda1 is too small: 0K

2007-10-13 Thread Dan Williams
On 10/13/07, Hod Greeley [EMAIL PROTECTED] wrote:
 Hello,

 I tried to create a raid device starting with

 foo:~ 1032% mdadm --create -l1 -n2 /dev/md1 /dev/sda1 missing
 mdadm: /dev/sda1 is too small: 0K
 mdadm: create aborted


Quick sanity check, is /dev/sda1 still a block device node with major
number 8 and minor number 1, i.e. does the following fix the issue?:

mknod /dev/sda1 b 8 1

snip

 Thanks,
 Hod Greeley


--
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance)

2007-10-09 Thread Dan Williams
On Mon, 2007-10-08 at 23:21 -0700, Neil Brown wrote:
 On Saturday October 6, [EMAIL PROTECTED] wrote:
  Neil,
 
  Here is the latest spin of the 'stripe_queue' implementation.
 Thanks to
  raid6+bitmap testing done by Mr. James W. Laferriere there have been
  several cleanups and fixes since the last release.  Also, the
 changes
  are now spread over 4 patches to isolate one conceptual change per
  patch.  The most significant cleanup is removing the stripe_head
 back
  pointer from stripe_queue.  This effectively makes the queuing layer
  independent from the caching layer.
 
 Thanks Dan, and sorry that it has taken such a time for me to take a
 serious look at this.

Not a problem, I've actually only recently had some cycles to look at
these patches again myself.

 The results seem impressive.  I'll try to do some testing myself, but
 firstly: some questions.

 1/ Can you explain why this improves the performance more than simply
   doubling the size of the stripe cache?

Before I answer here are some quick numbers to quantify the difference
versus simply doubling the size of the stripe cache:

Test Configuration:
mdadm --create /dev/md0 /dev/sd[bcdefghi] -n 8 -l 5 --assume-clean
for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done

Average rate taken for 2.6.23-rc9 (1), 2.6.23-rc9 with stripe_cache_size
= 512 (2), 2.6.23-rc9+stripe_queue (3), 2.6.23-rc9+stripe_queue with
stripe_cache_size = 512 (4).

(1): 181MB/s
(2): 252MB/s (+41%)
(3): 330MB/s (+82%)
(4): 352MB/s (+94%)

   The core of what it is doing seems to be to give priority to writing
   full stripes.  We already do that by delaying incomplete stripes.
   Maybe we just need to tune that mechanism a bit?  Maybe release
   fewer partial stripes at a time?

   It seems that the whole point of the stripe_queue structure is to
   allow requests to gather before they are processed so the more
   deserving can be processed first, but I cannot see why you need a
   data structure separate from the list_head.

   You could argue that simply doubling the size of the stripe cache
   would be a waste of memory as we only want to use half of it to
   handle active requests - the other half is for requests being built
   up.
   In that case, I don't see a problem with having a pool of pages
   which is smaller that would be needed for the full stripe cache, and
   allocating them to stripe_heads as they become free.

I believe the additional performance is coming from the fact that
delayed stripes are no longer consuming cache space while they wait for
their delay condition to clear, *and* that full stripe writes are
explicitly detected and moved to the front of the line.  This
effectively makes delayed stripes wait longer in some cases which is the
overall goal.

 2/ I thought I understood from your descriptions that
raid456_cache_arbiter would normally be waiting for a free stripe,
that during this time full stripes could get promoted to io_hi, and
so when raid456_cache_arbiter finally got a free stripe, it would
attach it to the most deserving stripe_queue.  However it doesn't
quite do that.  It chooses the deserving stripe_queue *before*
waiting for a free stripe_head.  This seems slightly less than
optimal?

I see, get the stripe first and then go look at io_hi versus io_lo.
Yes, that would prevent some unnecessary io_lo requests from sneaking
into the cache.
 
 3/ Why create a new workqueue for raid456_cache_arbiter rather than
use raid5d.  It should be possible to do a non-blocking wait for a
free stripe_head, in which cache the find a stripe head and attach
the most deserving stripe_queue would fit well into raid5d.

It seemed necessary to have at least one thread doing a blocking wait on
the stripe cache... but moving this all under raid5d seems possible.
And, it might fix the deadlock condition that Jim is able to create in
his testing with bitmaps.  I have sent him a patch, off-list, to move
all bitmap handling to the stripe_queue which seems to improve
bitmap-write performance, but he still sees cases where raid5d() and
raid456_cache_arbiter() are staring blankly at each other while bonnie++
patiently waits in D state.  A kick
to /sys/block/md3/md/stripe_cache_size gets things going again.

 4/ Why do you use an rbtree rather than a hash table to index the
   'stripe_queue' objects?  I seem to recall a discussion about this
   where it was useful to find adjacent requests or something like
   that, but I cannot see that in the current code.
   But maybe rbtrees are a better fit, in which case, should we use
   them for stripe_heads as well???

If you are referring to the following:
http://marc.info/?l=linux-kernelm=117740314031101w=2
...then no, I am not caching the leftmost or rightmost entry to speed
lookups.

I initially did not know how many queues would need to be in play versus
stripe_heads to yield a performance advantage so I picked the rbtree
mainly because it was being 

Re: [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance)

2007-10-07 Thread Dan Williams
On 10/6/07, Justin Piszcz [EMAIL PROTECTED] wrote:


 On Sat, 6 Oct 2007, Dan Williams wrote:

  Neil,
 
  Here is the latest spin of the 'stripe_queue' implementation.  Thanks to
  raid6+bitmap testing done by Mr. James W. Laferriere there have been
  several cleanups and fixes since the last release.  Also, the changes
  are now spread over 4 patches to isolate one conceptual change per
  patch.  The most significant cleanup is removing the stripe_head back
  pointer from stripe_queue.  This effectively makes the queuing layer
  independent from the caching layer.
 
  Expansion support needs more testing.
 
  See the individual patch changelogs for details.  Patch 1 contains
  updated performance numbers.
 
  Andrew,
 
  These are updated in the git-md-accel tree, but I will work the
  finalized versions through Neil's 'Signed-off-by' path.
 
  Dan Williams (4):
   raid5: add the stripe_queue object for tracking raid io requests (rev3)
   raid5: split allocation of stripe_heads and stripe_queues
   raid5: convert add_stripe_bio to add_queue_bio
   raid5: use stripe_queues to prioritize the most deserving requests 
  (rev7)
 
  drivers/md/raid5.c | 1560 
  
  include/linux/raid/raid5.h |   88 ++-
  2 files changed, 1200 insertions(+), 448 deletions(-)
 
  --
  Dan

 These patches  data look very impressive, do we have an ETA of when they
 will be merged into mainline?

 Justin.

The short answer is when they are ready.  Jim reported that he is
seeing bonnie++ get stuck in D state on his platform, so the debug is
ongoing.  Additional testing is always welcome...

--
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance)

2007-10-06 Thread Dan Williams
Neil,

Here is the latest spin of the 'stripe_queue' implementation.  Thanks to
raid6+bitmap testing done by Mr. James W. Laferriere there have been
several cleanups and fixes since the last release.  Also, the changes
are now spread over 4 patches to isolate one conceptual change per
patch.  The most significant cleanup is removing the stripe_head back
pointer from stripe_queue.  This effectively makes the queuing layer
independent from the caching layer.

Expansion support needs more testing.

See the individual patch changelogs for details.  Patch 1 contains
updated performance numbers.

Andrew,

These are updated in the git-md-accel tree, but I will work the
finalized versions through Neil's 'Signed-off-by' path.

Dan Williams (4):
  raid5: add the stripe_queue object for tracking raid io requests (rev3)
  raid5: split allocation of stripe_heads and stripe_queues
  raid5: convert add_stripe_bio to add_queue_bio
  raid5: use stripe_queues to prioritize the most deserving requests 
(rev7)

 drivers/md/raid5.c | 1560 
 include/linux/raid/raid5.h |   88 ++-
 2 files changed, 1200 insertions(+), 448 deletions(-)

--
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -mm 1/4] raid5: add the stripe_queue object for tracking raid io requests (rev3)

2007-10-06 Thread Dan Williams
The raid5 stripe cache object, struct stripe_head, serves two purposes:
1/ front-end: queuing incoming requests
2/ back-end: transitioning requests through the cache state machine
   to the backing devices
The problem with this model is that queuing decisions are directly tied to
cache availability.  There is no facility to determine that a request or
group of requests 'deserves' usage of the cache and disks at any given time.

This patch separates the object members needed for queuing from the object
members used for caching.  The stripe_queue object takes over the incoming
bio lists, the io completion bio lists, and the parameters needed for
expansion.

The following fields are moved from struct stripe_head to struct
stripe_queue:
raid5_private_data *raid_conf
int pd_idx
spinlock_t lock
int disks

The following fields are moved from struct r5dev to struct r5_queue_dev:
sector_t sector
struct bio *toread, *read, *towrite, *written

This (first) commit just moves fields around subsequent commits take
advantage of the split for performance gains.

--- Performance Data ---
Platform: SMP x4 IA, sata_vsc, 7200RPM SATA Drives x4
Test1: tiobench --size 2048 --numruns 5 --block 4096 --block 131072 --dir 
/mnt/raid
=pre-patch=
Sequential Writes:
File   Blk  Num Avg
Size   Size Thread  Rate (MiB/s)
   ---- --
2048   131072   172.02
2048   131072   841.51

=post-patch=
Sequential Writes:
File   Blk  Num Avg
Size   Size Thr Rate (MiB/s)
   ---- --
2048   131072   1   140.86 (+96%)
2048   131072   850.18 (+21%)

Test2: blktrace of: dd if=/dev/zero of=/dev/md0 bs=1024k count=1024
=pre-patch=
Total (sdd):
 Reads Queued:   1,383,5,532KiB  Writes Queued:  80,186, 320,744KiB
 Reads Completed:  276,4,888KiB  Writes Completed:   12,677, 294,324KiB
 IO unplugs: 0   Timer unplugs:   0

=post-patch=
Total (sdd):
 Reads Queued:  61,  244KiB  Writes Queued:  66,330, 265,320KiB
 Reads Completed:4,  112KiB  Writes Completed:3,562, 285,912KiB
 IO unplugs:16   Timer unplugs:  17

Platform: SMP x4 IA, mptsas, 15000RPM SAS Drives x4
Test: tiobench --size 2048 --numruns 5 --block 4096 --block 131072 --dir 
/mnt/raid
=pre-patch=
Sequential Writes:
File   Blk  Num Avg
Size   Size Thr Rate (MiB/s)
   ---- --
2048   131072   1   132.51
2048   131072   886.92

=post-patch=
Sequential Writes:
File   Blk  Num Avg
Size   Size Thr Rate (MiB/s)
   ---- --
2048   131072   1   172.26 (+30%)
2048   131072   8   114.82 (+32%)

Changes in rev2:
* leave the flags with the buffers, prevents a data corruption issue
  whereby stale buffer state flags are attached to newly initialized
  buffers

Changes in rev3:
* move bm_seq back into the stripe_head, since the bitmap sequencing
  matters at write-out time (after cache attach)  Thanks to Mr. James W.
  Laferriere for his bug reports and testing of bitmap support.
* move 'int disks' into stripe_queue since expansion details are recorded
  at make_request() time (i.e. pre stripe_head availability)
* move dev-read, dev-written to dev_q-read and dev_q-written.  Allow
  sq-sh back references to be killed, and removes need to handle sh
  details in add_queue_bio

Tested-by: Mr. James W. Laferriere [EMAIL PROTECTED]
Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |  564 +++-
 include/linux/raid/raid5.h |   28 +-
 2 files changed, 364 insertions(+), 228 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f96dea9..a13de7d 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -159,7 +159,7 @@ static void __release_stripe(raid5_conf_t *conf, struct 
stripe_head *sh)
 }
 static void release_stripe(struct stripe_head *sh)
 {
-   raid5_conf_t *conf = sh-raid_conf;
+   raid5_conf_t *conf = sh-sq-raid_conf;
unsigned long flags;
 
spin_lock_irqsave(conf-device_lock, flags);
@@ -238,7 +238,7 @@ static void raid5_build_block (struct stripe_head *sh, int 
i);
 
 static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, 
int disks)
 {
-   raid5_conf_t *conf = sh-raid_conf;
+   raid5_conf_t *conf = sh-sq-raid_conf;
int i;
 
BUG_ON(atomic_read(sh-count) != 0);
@@ -252,19 +252,20 @@ static void init_stripe(struct stripe_head *sh, sector_t 
sector, int pd_idx, int
remove_hash(sh);
 
sh-sector = sector;
-   sh-pd_idx = pd_idx;
+   sh-sq-pd_idx = pd_idx;
sh-state = 0;
 
-   sh-disks = disks;
+   sh-sq-disks = disks;
 
-   for (i = sh-disks; i--; ) {
+   for (i = disks; i--;) {
struct r5dev

[PATCH -mm 2/4] raid5: split allocation of stripe_heads and stripe_queues

2007-10-06 Thread Dan Williams
Provide separate routines for allocating stripe_head and stripe_queue
objects and introduce 'io_weight' bitmaps to struct stripe_queue.

The io_weight bitmaps add an efficient way to determine what is pending in
a stripe_queue using 'hweight' in comparison to a 'for' loop.

Tested-by: Mr. James W. Laferriere [EMAIL PROTECTED]
Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |  316 
 include/linux/raid/raid5.h |   11 +-
 2 files changed, 239 insertions(+), 88 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a13de7d..7bc206c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -65,6 +65,7 @@
 #defineIO_THRESHOLD1
 #define NR_HASH(PAGE_SIZE / sizeof(struct hlist_head))
 #define HASH_MASK  (NR_HASH - 1)
+#define STRIPE_QUEUE_SIZE 1 /* multiple of nr_stripes */
 
 #define stripe_hash(conf, sect)(((conf)-stripe_hashtbl[((sect)  
STRIPE_SHIFT)  HASH_MASK]))
 
@@ -78,6 +79,8 @@
  * of the current stripe+device
  */
 #define r5_next_bio(bio, sect) ( ( (bio)-bi_sector + ((bio)-bi_size9)  
sect + STRIPE_SECTORS) ? (bio)-bi_next : NULL)
+#define r5_io_weight_size(devs) (sizeof(unsigned long) * \
+ (ALIGN(devs, BITS_PER_LONG) / BITS_PER_LONG))
 /*
  * The following can be used to debug the driver
  */
@@ -120,6 +123,21 @@ static void return_io(struct bio *return_bi)
}
 }
 
+#if BITS_PER_LONG == 32
+#define hweight hweight32
+#else
+#define hweight hweight64
+#endif
+static unsigned long io_weight(unsigned long *bitmap, int disks)
+{
+   unsigned long weight = hweight(*bitmap);
+
+   for (bitmap++; disks  BITS_PER_LONG; disks -= BITS_PER_LONG, bitmap++)
+   weight += hweight(*bitmap);
+
+   return weight;
+}
+
 static void print_raid5_conf (raid5_conf_t *conf);
 
 static void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
@@ -236,36 +254,37 @@ static int grow_buffers(struct stripe_head *sh, int num)
 
 static void raid5_build_block (struct stripe_head *sh, int i);
 
-static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, 
int disks)
+static void init_queue(struct stripe_queue *sq, sector_t sector,
+   int disks, int pd_idx);
+
+static void
+init_stripe(struct stripe_head *sh, struct stripe_queue *sq,
+sector_t sector, int pd_idx, int disks)
 {
-   raid5_conf_t *conf = sh-sq-raid_conf;
+   raid5_conf_t *conf = sq-raid_conf;
int i;
 
+   pr_debug(init_stripe called, stripe %llu\n,
+   (unsigned long long)sector);
+
BUG_ON(atomic_read(sh-count) != 0);
BUG_ON(test_bit(STRIPE_HANDLE, sh-state));
BUG_ON(sh-ops.pending || sh-ops.ack || sh-ops.complete);
+   init_queue(sh-sq, sector, disks, pd_idx);
 
CHECK_DEVLOCK();
-   pr_debug(init_stripe called, stripe %llu\n,
-   (unsigned long long)sh-sector);
 
remove_hash(sh);
 
sh-sector = sector;
-   sh-sq-pd_idx = pd_idx;
sh-state = 0;
 
-   sh-sq-disks = disks;
-
for (i = disks; i--;) {
struct r5dev *dev = sh-dev[i];
-   struct r5_queue_dev *dev_q = sh-sq-dev[i];
 
-   if (dev_q-toread || dev_q-read || dev_q-towrite ||
-   dev_q-written || test_bit(R5_LOCKED, dev-flags)) {
-   printk(KERN_ERR sector=%llx i=%d %p %p %p %p %d\n,
-  (unsigned long long)sh-sector, i, dev_q-toread,
-  dev_q-read, dev_q-towrite, dev_q-written,
+   if (test_bit(R5_LOCKED, dev-flags)) {
+   printk(KERN_ERR sector=%llx i=%d %d\n,
+  (unsigned long long)sector, i,
   test_bit(R5_LOCKED, dev-flags));
BUG();
}
@@ -283,7 +302,7 @@ static struct stripe_head *__find_stripe(raid5_conf_t 
*conf, sector_t sector, in
CHECK_DEVLOCK();
pr_debug(__find_stripe, sector %llu\n, (unsigned long long)sector);
hlist_for_each_entry(sh, hn, stripe_hash(conf, sector), hash)
-   if (sh-sector == sector  sh-sq-disks == disks)
+   if (sh-sector == sector  disks == disks)
return sh;
pr_debug(__stripe %llu not in cache\n, (unsigned long long)sector);
return NULL;
@@ -326,7 +345,7 @@ static struct stripe_head *get_active_stripe(raid5_conf_t 
*conf, sector_t sector
);
conf-inactive_blocked = 0;
} else
-   init_stripe(sh, sector, pd_idx, disks);
+   init_stripe(sh, sh-sq, sector, pd_idx, disks);
} else {
if (atomic_read(sh-count)) {
  BUG_ON(!list_empty(sh-lru));
@@ -348,6

[PATCH -mm 3/4] raid5: convert add_stripe_bio to add_queue_bio

2007-10-06 Thread Dan Williams
The stripe_queue object collects i/o requests before they are handled by
the stripe-cache (via the stripe_head object).  add_stripe_bio currently
looks at the state of the stripe-cache to implement bitmap support,
reimplement this using stripe_queue attributes.

Introduce the STRIPE_QUEUE_FIRSTWRITE flag to track when a stripe is first
written.  When a stripe_head is available record the bitmap batch sequence
number and set STRIPE_BIT_DELAY.  For now a stripe_head will always be
available at 'add_queue_bio' time, going forward the 'sh' field of the
stripe_queue will indicate whether a stripe_head is attached.

Tested-by: Mr. James W. Laferriere [EMAIL PROTECTED]
Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |   53 
 include/linux/raid/raid5.h |6 +
 2 files changed, 40 insertions(+), 19 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7bc206c..d566fc9 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -31,8 +31,10 @@
  * conf-bm_flush is the number of the last batch that was closed to
  *new additions.
  * When we discover that we will need to write to any block in a stripe
- * (in add_stripe_bio) we update the in-memory bitmap and record in sh-bm_seq
- * the number of the batch it will be in. This is bm_flush+1.
+ * (in add_queue_bio) we update the in-memory bitmap and record in the
+ * stripe_queue that a bitmap write was started.  Then, in handle_stripe when
+ * we have a stripe_head available, we update sh-bm_seq to record the
+ * sequence number (target batch number) of this request.  This is bm_flush+1.
  * When we are ready to do a write, if that batch hasn't been written yet,
  *   we plug the array and queue the stripe for later.
  * When an unplug happens, we increment bm_flush, thus closing the current
@@ -360,8 +362,14 @@ static struct stripe_head *get_active_stripe(raid5_conf_t 
*conf, sector_t sector
}
} while (sh == NULL);
 
-   if (sh)
+   if (sh) {
atomic_inc(sh-count);
+   if (test_and_clear_bit(STRIPE_QUEUE_FIRSTWRITE,
+   sh-sq-state)) {
+   sh-bm_seq = conf-seq_flush+1;
+   set_bit(STRIPE_BIT_DELAY, sh-state);
+   }
+   }
 
spin_unlock_irq(conf-device_lock);
return sh;
@@ -1991,26 +1999,34 @@ handle_write_operations5(struct stripe_head *sh, int 
rcw, int expand)
  * toread/towrite point to the first in a chain.
  * The bi_next chain must be in order.
  */
-static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, 
int forwrite)
+static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
+ int forwrite)
 {
struct bio **bip;
-   struct stripe_queue *sq = sh-sq;
raid5_conf_t *conf = sq-raid_conf;
int firstwrite=0;
 
-   pr_debug(adding bh b#%llu to stripe s#%llu\n,
+   pr_debug(adding bio (%llu) to queue (%llu)\n,
(unsigned long long)bi-bi_sector,
-   (unsigned long long)sh-sector);
-
+   (unsigned long long)sq-sector);
 
spin_lock(sq-lock);
spin_lock_irq(conf-device_lock);
if (forwrite) {
bip = sq-dev[dd_idx].towrite;
-   if (*bip == NULL  sq-dev[dd_idx].written == NULL)
+   set_bit(dd_idx, sq-to_write);
+   if (*bip == NULL  sq-dev[dd_idx].written == NULL) {
+   /* flag the queue to be assigned a bitmap
+* sequence number
+*/
+   set_bit(STRIPE_QUEUE_FIRSTWRITE, sq-state);
firstwrite = 1;
-   } else
+   }
+   } else {
bip = sq-dev[dd_idx].toread;
+   set_bit(dd_idx, sq-to_read);
+   }
+
while (*bip  (*bip)-bi_sector  bi-bi_sector) {
if ((*bip)-bi_sector + ((*bip)-bi_size  9)  bi-bi_sector)
goto overlap;
@@ -2024,19 +2040,17 @@ static int add_stripe_bio(struct stripe_head *sh, 
struct bio *bi, int dd_idx, in
bi-bi_next = *bip;
*bip = bi;
bi-bi_phys_segments ++;
+
spin_unlock_irq(conf-device_lock);
spin_unlock(sq-lock);
 
pr_debug(added bi b#%llu to stripe s#%llu, disk %d.\n,
(unsigned long long)bi-bi_sector,
-   (unsigned long long)sh-sector, dd_idx);
+   (unsigned long long)sq-sector, dd_idx);
 
-   if (conf-mddev-bitmap  firstwrite) {
-   bitmap_startwrite(conf-mddev-bitmap, sh-sector,
+   if (conf-mddev-bitmap  firstwrite)
+   bitmap_startwrite(conf-mddev-bitmap, sq-sector,
  STRIPE_SECTORS, 0);
-   sh-bm_seq = conf-seq_flush+1;
-   set_bit(STRIPE_BIT_DELAY, sh-state);
-   }
 
if (forwrite

[PATCH -mm 4/4] raid5: use stripe_queues to prioritize the most deserving requests (rev7)

2007-10-06 Thread Dan Williams
 on its stripe_head
  after handle_queue, otherwise we deadlock on drive removal.  To make this
  more obvious handle_queue no longer implicitly releases the stripe_queue.
* kill wait_for_attach
* Fix up stripe_queue documentation

Changes in rev7
* split out the 'add_queue_bio' and object allocation changes into separate
  patches
* fix release_stripe/release_queue ordering
* refactor handle_queue and release_queue to remove STRIPE_QUEUE_HANDLE and
  sq-sh back references
* kill init_sh and allocate init_sq on the stack

Tested-by: Mr. James W. Laferriere [EMAIL PROTECTED]
Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |  843 +---
 include/linux/raid/raid5.h |   45 ++
 2 files changed, 666 insertions(+), 222 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index d566fc9..eb7fd10 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -67,7 +67,7 @@
 #defineIO_THRESHOLD1
 #define NR_HASH(PAGE_SIZE / sizeof(struct hlist_head))
 #define HASH_MASK  (NR_HASH - 1)
-#define STRIPE_QUEUE_SIZE 1 /* multiple of nr_stripes */
+#define STRIPE_QUEUE_SIZE 2 /* multiple of nr_stripes */
 
 #define stripe_hash(conf, sect)(((conf)-stripe_hashtbl[((sect)  
STRIPE_SHIFT)  HASH_MASK]))
 
@@ -142,16 +142,66 @@ static unsigned long io_weight(unsigned long *bitmap, int 
disks)
 
 static void print_raid5_conf (raid5_conf_t *conf);
 
+/* __release_queue - route the stripe_queue based on pending i/o's.  The
+ * queue object is allowed to bounce around between 4 lists up until
+ * it is attached to a stripe_head.  The lists in order of priority are:
+ * 1/ overwrite: all data blocks are set to be overwritten, no prereads
+ * 2/ unaligned_read: read requests that get past chunk_aligned_read
+ * 3/ subwidth_write: write requests that require prereading
+ * 4/ delayed_q: write requests pending activation
+ */
+static void __release_queue(raid5_conf_t *conf, struct stripe_queue *sq)
+{
+   if (atomic_dec_and_test(sq-count)) {
+   int disks = sq-disks;
+   int data_disks = disks - conf-max_degraded;
+   int to_write = io_weight(sq-to_write, disks);
+
+   BUG_ON(!list_empty(sq-list_node));
+   BUG_ON(atomic_read(conf-active_queues) == 0);
+
+   if (to_write 
+   io_weight(sq-overwrite, disks) == data_disks) {
+   list_add_tail(sq-list_node, conf-io_hi_q_list);
+   queue_work(conf-workqueue, conf-stripe_queue_work);
+   } else if (io_weight(sq-to_read, disks)) {
+   list_add_tail(sq-list_node, conf-io_lo_q_list);
+   queue_work(conf-workqueue, conf-stripe_queue_work);
+   } else if (to_write 
+  test_bit(STRIPE_QUEUE_PREREAD_ACTIVE, sq-state)) {
+   list_add_tail(sq-list_node, conf-io_lo_q_list);
+   queue_work(conf-workqueue, conf-stripe_queue_work);
+   } else if (to_write) {
+   list_add_tail(sq-list_node, conf-delayed_q_list);
+   blk_plug_device(conf-mddev-queue);
+   } else {
+   atomic_dec(conf-active_queues);
+   if (test_and_clear_bit(STRIPE_QUEUE_PREREAD_ACTIVE,
+  sq-state)) {
+   atomic_dec(conf-preread_active_queues);
+   if (atomic_read(conf-preread_active_queues) 
+   IO_THRESHOLD)
+   queue_work(conf-workqueue,
+  conf-stripe_queue_work);
+   }
+   if (!test_bit(STRIPE_QUEUE_EXPANDING, sq-state)) {
+   list_add_tail(sq-list_node,
+ conf-inactive_q_list);
+   wake_up(conf-wait_for_queue);
+   }
+   }
+   }
+}
+
 static void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
 {
+   struct stripe_queue *sq = sh-sq;
+
if (atomic_dec_and_test(sh-count)) {
BUG_ON(!list_empty(sh-lru));
BUG_ON(atomic_read(conf-active_stripes)==0);
if (test_bit(STRIPE_HANDLE, sh-state)) {
-   if (test_bit(STRIPE_DELAYED, sh-state)) {
-   list_add_tail(sh-lru, conf-delayed_list);
-   blk_plug_device(conf-mddev-queue);
-   } else if (test_bit(STRIPE_BIT_DELAY, sh-state) 
+   if (test_bit(STRIPE_BIT_DELAY, sh-state) 
   sh-bm_seq - conf-seq_write  0) {
list_add_tail(sh-lru, conf-bitmap_list

[GIT PULL] async-tx/md-accel fixes and documentation for 2.6.23

2007-09-24 Thread Dan Williams
Linus, please pull from:

git://lost.foo-projects.org/~dwillia2/git/iop async-tx-fixes-for-linus

to receive:

Dan Williams (3):
  async_tx: usage documentation and developer notes (v2)
  async_tx: fix dma_wait_for_async_tx
  raid5: fix 2 bugs in ops_complete_biofill

The raid5 change has been reviewed with Neil, and the documentation
received some fixups from Randy Dunlap and Shannon Nelson.

Documentation/crypto/async-tx-api.txt |  219 +
crypto/async_tx/async_tx.c|   12 ++-
drivers/md/raid5.c|   17 +--
3 files changed, 236 insertions(+), 12 deletions(-)

---

diff --git a/Documentation/crypto/async-tx-api.txt 
b/Documentation/crypto/async-tx-api.txt
new file mode 100644
index 000..c1e9545
--- /dev/null
+++ b/Documentation/crypto/async-tx-api.txt
@@ -0,0 +1,219 @@
+Asynchronous Transfers/Transforms API
+
+1 INTRODUCTION
+
+2 GENEALOGY
+
+3 USAGE
+3.1 General format of the API
+3.2 Supported operations
+3.3 Descriptor management
+3.4 When does the operation execute?
+3.5 When does the operation complete?
+3.6 Constraints
+3.7 Example
+
+4 DRIVER DEVELOPER NOTES
+4.1 Conformance points
+4.2 My application needs finer control of hardware channels
+
+5 SOURCE
+
+---
+
+1 INTRODUCTION
+
+The async_tx API provides methods for describing a chain of asynchronous
+bulk memory transfers/transforms with support for inter-transactional
+dependencies.  It is implemented as a dmaengine client that smooths over
+the details of different hardware offload engine implementations.  Code
+that is written to the API can optimize for asynchronous operation and
+the API will fit the chain of operations to the available offload
+resources.
+
+2 GENEALOGY
+
+The API was initially designed to offload the memory copy and
+xor-parity-calculations of the md-raid5 driver using the offload engines
+present in the Intel(R) Xscale series of I/O processors.  It also built
+on the 'dmaengine' layer developed for offloading memory copies in the
+network stack using Intel(R) I/OAT engines.  The following design
+features surfaced as a result:
+1/ implicit synchronous path: users of the API do not need to know if
+   the platform they are running on has offload capabilities.  The
+   operation will be offloaded when an engine is available and carried out
+   in software otherwise.
+2/ cross channel dependency chains: the API allows a chain of dependent
+   operations to be submitted, like xor-copy-xor in the raid5 case.  The
+   API automatically handles cases where the transition from one operation
+   to another implies a hardware channel switch.
+3/ dmaengine extensions to support multiple clients and operation types
+   beyond 'memcpy'
+
+3 USAGE
+
+3.1 General format of the API:
+struct dma_async_tx_descriptor *
+async_operation(op specific parameters,
+ enum async_tx_flags flags,
+ struct dma_async_tx_descriptor *dependency,
+ dma_async_tx_callback callback_routine,
+ void *callback_parameter);
+
+3.2 Supported operations:
+memcpy   - memory copy between a source and a destination buffer
+memset   - fill a destination buffer with a byte value
+xor  - xor a series of source buffers and write the result to a
+  destination buffer
+xor_zero_sum - xor a series of source buffers and set a flag if the
+  result is zero.  The implementation attempts to prevent
+  writes to memory
+
+3.3 Descriptor management:
+The return value is non-NULL and points to a 'descriptor' when the operation
+has been queued to execute asynchronously.  Descriptors are recycled
+resources, under control of the offload engine driver, to be reused as
+operations complete.  When an application needs to submit a chain of
+operations it must guarantee that the descriptor is not automatically recycled
+before the dependency is submitted.  This requires that all descriptors be
+acknowledged by the application before the offload engine driver is allowed to
+recycle (or free) the descriptor.  A descriptor can be acked by one of the
+following methods:
+1/ setting the ASYNC_TX_ACK flag if no child operations are to be submitted
+2/ setting the ASYNC_TX_DEP_ACK flag to acknowledge the parent
+   descriptor of a new operation.
+3/ calling async_tx_ack() on the descriptor.
+
+3.4 When does the operation execute?
+Operations do not immediately issue after return from the
+async_operation call.  Offload engine drivers batch operations to
+improve performance by reducing the number of mmio cycles needed to
+manage the channel.  Once a driver-specific threshold is met the driver
+automatically issues pending operations.  An application can force this
+event by calling async_tx_issue_pending_all().  This operates on all
+channels since the application has no knowledge of channel to operation
+mapping.
+
+3.5 When does the operation complete?
+There are two methods

[PATCH 2.6.23-rc7 0/3] async_tx and md-accel fixes for 2.6.23

2007-09-20 Thread Dan Williams
Fix a couple bugs and provide documentation for the async_tx api.

Neil, please 'ack' patch #3.

git://lost.foo-projects.org/~dwillia2/git/iop async-tx-fixes-for-linus

Dan Williams (3):
  async_tx: usage documentation and developer notes
  async_tx: fix dma_wait_for_async_tx
  raid5: fix ops_complete_biofill

Documentation/crypto/async-tx-api.txt |  217 +
crypto/async_tx/async_tx.c|   12 ++-
drivers/md/raid5.c|   90 +++---
3 files changed, 273 insertions(+), 46 deletions(-)

--
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2.6.23-rc7 1/3] async_tx: usage documentation and developer notes

2007-09-20 Thread Dan Williams
Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 Documentation/crypto/async-tx-api.txt |  217 +
 1 files changed, 217 insertions(+), 0 deletions(-)

diff --git a/Documentation/crypto/async-tx-api.txt 
b/Documentation/crypto/async-tx-api.txt
new file mode 100644
index 000..48d685a
--- /dev/null
+++ b/Documentation/crypto/async-tx-api.txt
@@ -0,0 +1,217 @@
+Asynchronous Transfers/Transforms API
+
+1 INTRODUCTION
+
+2 GENEALOGY
+
+3 USAGE
+3.1 General format of the API
+3.2 Supported operations
+3.2 Descriptor management
+3.3 When does the operation execute?
+3.4 When does the operation complete?
+3.5 Constraints
+3.6 Example
+
+4 DRIVER DEVELOPER NOTES
+4.1 Conformance points
+4.2 My application needs finer control of hardware channels
+
+5 SOURCE
+
+---
+
+1 INTRODUCTION
+
+The async_tx api provides methods for describing a chain of asynchronous
+bulk memory transfers/transforms with support for inter-transactional
+dependencies.  It is implemented as a dmaengine client that smooths over
+the details of different hardware offload engine implementations.  Code
+that is written to the api can optimize for asynchronous operation and
+the api will fit the chain of operations to the available offload
+resources.
+
+2 GENEALOGY
+
+The api was initially designed to offload the memory copy and
+xor-parity-calculations of the md-raid5 driver using the offload engines
+present in the Intel(R) Xscale series of I/O processors.  It also built
+on the 'dmaengine' layer developed for offloading memory copies in the
+network stack using Intel(R) I/OAT engines.  The following design
+features surfaced as a result:
+1/ implicit synchronous path: users of the API do not need to know if
+   the platform they are running on has offload capabilities.  The
+   operation will be offloaded when an engine is available and carried out
+   in software otherwise.
+2/ cross channel dependency chains: the API allows a chain of dependent
+   operations to be submitted, like xor-copy-xor in the raid5 case.  The
+   API automatically handles cases where the transition from one operation
+   to another implies a hardware channel switch.
+3/ dmaengine extensions to support multiple clients and operation types
+   beyond 'memcpy'
+
+3 USAGE
+
+3.1 General format of the API:
+struct dma_async_tx_descriptor *
+async_operation(op specific parameters,
+ enum async_tx_flags flags,
+ struct dma_async_tx_descriptor *dependency,
+ dma_async_tx_callback callback_routine,
+ void *callback_parameter);
+
+3.2 Supported operations:
+memcpy   - memory copy between a source and a destination buffer
+memset   - fill a destination buffer with a byte value
+xor - xor a series of source buffers and write the result to a
+  destination buffer
+xor_zero_sum - xor a series of source buffers and set a flag if the
+  result is zero.  The implementation attempts to prevent
+  writes to memory
+
+3.2 Descriptor management:
+The return value is non-NULL and points to a 'descriptor' when the operation
+has been queued to execute asynchronously.  Descriptors are recycled
+resources, under control of the offload engine driver, to be reused as
+operations complete.  When an application needs to submit a chain of
+operations it must guarantee that the descriptor is not automatically recycled
+before the dependency is submitted.  This requires that all descriptors be
+acknowledged by the application before the offload engine driver is allowed to
+recycle (or free) the descriptor.  A descriptor can be acked by:
+1/ setting the ASYNC_TX_ACK flag if no operations are to be submitted
+2/ setting the ASYNC_TX_DEP_ACK flag to acknowledge the parent
+   descriptor of a new operation.
+3/ calling async_tx_ack() on the descriptor.
+
+3.3 When does the operation execute?:
+Operations do not immediately issue after return from the
+async_operation call.  Offload engine drivers batch operations to
+improve performance by reducing the number of mmio cycles needed to
+manage the channel.  Once a driver specific threshold is met the driver
+automatically issues pending operations.  An application can force this
+event by calling async_tx_issue_pending_all().  This operates on all
+channels since the application has no knowledge of channel to operation
+mapping.
+
+3.4 When does the operation complete?:
+There are two methods for an application to learn about the completion
+of an operation.
+1/ Call dma_wait_for_async_tx().  This call causes the cpu to spin while
+   it polls for the completion of the operation.  It handles dependency
+   chains and issuing pending operations.
+2/ Specify a completion callback.  The callback routine runs in tasklet
+   context if the offload engine driver supports interrupts, or it is
+   called in application context if the operation is carried out
+   synchronously in software

[PATCH 2.6.23-rc7 2/3] async_tx: fix dma_wait_for_async_tx

2007-09-20 Thread Dan Williams
Fix dma_wait_for_async_tx to not loop forever in the case where a
dependency chain is longer than two entries.  This condition will not
happen with current in-kernel drivers, but fix it for future drivers.

Found-by: Saeed Bishara [EMAIL PROTECTED]
Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 crypto/async_tx/async_tx.c |   12 ++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/crypto/async_tx/async_tx.c b/crypto/async_tx/async_tx.c
index 0350071..bc18cbb 100644
--- a/crypto/async_tx/async_tx.c
+++ b/crypto/async_tx/async_tx.c
@@ -80,6 +80,7 @@ dma_wait_for_async_tx(struct dma_async_tx_descriptor *tx)
 {
enum dma_status status;
struct dma_async_tx_descriptor *iter;
+   struct dma_async_tx_descriptor *parent;
 
if (!tx)
return DMA_SUCCESS;
@@ -87,8 +88,15 @@ dma_wait_for_async_tx(struct dma_async_tx_descriptor *tx)
/* poll through the dependency chain, return when tx is complete */
do {
iter = tx;
-   while (iter-cookie == -EBUSY)
-   iter = iter-parent;
+
+   /* find the root of the unsubmitted dependency chain */
+   while (iter-cookie == -EBUSY) {
+   parent = iter-parent;
+   if (parent  parent-cookie == -EBUSY)
+   iter = iter-parent;
+   else
+   break;
+   }
 
status = dma_sync_wait(iter-chan, iter-cookie);
} while (status == DMA_IN_PROGRESS || (iter != tx));
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2.6.23-rc7 3/3] raid5: fix ops_complete_biofill

2007-09-20 Thread Dan Williams
ops_complete_biofill tried to avoid calling handle_stripe since all the
state necessary to return read completions is available.  However the
process of determining whether more read requests are pending requires
locking the stripe (to block add_stripe_bio from updating dev-toead).
ops_complete_biofill can run in tasklet context, so rather than upgrading
all the stripe locks from spin_lock to spin_lock_bh this patch just moves
read completion handling back into handle_stripe.

Found-by: Yuri Tikhonov [EMAIL PROTECTED]
Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |   90 +++-
 1 files changed, 46 insertions(+), 44 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 4d63773..38c8893 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -512,54 +512,12 @@ async_copy_data(int frombio, struct bio *bio, struct page 
*page,
 static void ops_complete_biofill(void *stripe_head_ref)
 {
struct stripe_head *sh = stripe_head_ref;
-   struct bio *return_bi = NULL;
-   raid5_conf_t *conf = sh-raid_conf;
-   int i, more_to_read = 0;
 
pr_debug(%s: stripe %llu\n, __FUNCTION__,
(unsigned long long)sh-sector);
 
-   /* clear completed biofills */
-   for (i = sh-disks; i--; ) {
-   struct r5dev *dev = sh-dev[i];
-   /* check if this stripe has new incoming reads */
-   if (dev-toread)
-   more_to_read++;
-
-   /* acknowledge completion of a biofill operation */
-   /* and check if we need to reply to a read request
-   */
-   if (test_bit(R5_Wantfill, dev-flags)  !dev-toread) {
-   struct bio *rbi, *rbi2;
-   clear_bit(R5_Wantfill, dev-flags);
-
-   /* The access to dev-read is outside of the
-* spin_lock_irq(conf-device_lock), but is protected
-* by the STRIPE_OP_BIOFILL pending bit
-*/
-   BUG_ON(!dev-read);
-   rbi = dev-read;
-   dev-read = NULL;
-   while (rbi  rbi-bi_sector 
-   dev-sector + STRIPE_SECTORS) {
-   rbi2 = r5_next_bio(rbi, dev-sector);
-   spin_lock_irq(conf-device_lock);
-   if (--rbi-bi_phys_segments == 0) {
-   rbi-bi_next = return_bi;
-   return_bi = rbi;
-   }
-   spin_unlock_irq(conf-device_lock);
-   rbi = rbi2;
-   }
-   }
-   }
-   clear_bit(STRIPE_OP_BIOFILL, sh-ops.ack);
-   clear_bit(STRIPE_OP_BIOFILL, sh-ops.pending);
-
-   return_io(return_bi);
-
-   if (more_to_read)
-   set_bit(STRIPE_HANDLE, sh-state);
+   set_bit(STRIPE_OP_BIOFILL, sh-ops.complete);
+   set_bit(STRIPE_HANDLE, sh-state);
release_stripe(sh);
 }
 
@@ -2112,6 +2070,42 @@ static void handle_issuing_new_read_requests6(struct 
stripe_head *sh,
 }
 
 
+/* handle_completed_read_requests - return completion for reads and allow
+ * new read operations to be submitted to the stripe.
+ */
+static void handle_completed_read_requests(raid5_conf_t *conf,
+   struct stripe_head *sh,
+   struct bio **return_bi)
+{
+   int i;
+
+   pr_debug(%s: stripe %llu\n, __FUNCTION__,
+   (unsigned long long)sh-sector);
+
+   /* check if we need to reply to a read request */
+   for (i = sh-disks; i--; ) {
+   struct r5dev *dev = sh-dev[i];
+
+   if (test_and_clear_bit(R5_Wantfill, dev-flags)) {
+   struct bio *rbi, *rbi2;
+
+   rbi = dev-read;
+   dev-read = NULL;
+   while (rbi  rbi-bi_sector 
+   dev-sector + STRIPE_SECTORS) {
+   rbi2 = r5_next_bio(rbi, dev-sector);
+   spin_lock_irq(conf-device_lock);
+   if (--rbi-bi_phys_segments == 0) {
+   rbi-bi_next = *return_bi;
+   *return_bi = rbi;
+   }
+   spin_unlock_irq(conf-device_lock);
+   rbi = rbi2;
+   }
+   }
+   }
+}
+
 /* handle_completed_write_requests
  * any written block on an uptodate or failed drive can be returned.
  * Note that if we 'wrote' to a failed drive, it will be UPTODATE, but
@@ -2633,6 +2627,14 @@ static void handle_stripe5(struct stripe_head *sh)
s.expanded

Re: md raid acceleration and the async_tx api

2007-09-13 Thread Dan Williams
On 9/13/07, Yuri Tikhonov [EMAIL PROTECTED] wrote:

  Hi Dan,

 On Friday 07 September 2007 20:02, you wrote:
  You need to fetch from the 'md-for-linus' tree.  But I have attached
  them as well.
 
  git fetch git://lost.foo-projects.org/~dwillia2/git/iop
  md-for-linus:md-for-linus

  Thanks.

  Unrelated question. Comparing the drivers/md/raid5.c file in the Linus's
 2.6.23-rc6 tree and in your md-for-linus one I'd found the following
 difference in the expand-related part of the handle_stripe5() function:

 -   s.locked += handle_write_operations5(sh, 1, 1);
 +   s.locked += handle_write_operations5(sh, 0, 1);

  That is, in your case we are passing rcw=0, whereas in the Linus's case the
 handle_write_operation5() is called with rcw=1. What code is correct ?

There was a recent bug discovered in my changes to the expansion code.
 The fix has now gone into Linus's tree through Andrew's tree.  I kept
the fix out of my 'md-for-linus' tree to prevent it getting dropped
from -mm due to automatic git-tree merge-detection.  I have now
rebased my git tree so everything is in sync.

However, after talking with Neil at LCE we came to the conclusion that
it would be best if I just sent patches since git tree updates tend to
not get enough review, and because the patch sets will be more
manageable now that the big pieces of the acceleration infrastructure
have been merged.

  Regards, Yuri

Thanks,
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md raid acceleration and the async_tx api

2007-08-30 Thread Dan Williams
On 8/30/07, Yuri Tikhonov [EMAIL PROTECTED] wrote:

  Hi Dan,

 On Monday 27 August 2007 23:12, you wrote:
  This still looks racy...  I think the complete fix is to make the
  R5_Wantfill and dev_q-toread accesses atomic.  Please test the
  following patch (also attached) and let me know if it fixes what you are
  seeing:

  Your approach doesn't help, the Bonnie++ utility hangs up during the
 ReWriting stage.

Looking at it again I see that what I added would not affect the
failure you are seeing.  However I noticed that you are using a broken
version of the stripe-queue and cache_arbiter patches.  In the current
revisions the dev_q-flags field has been moved back to dev-flags
which fixes a data corruption issue and could potentially address the
hang you are seeing.  The latest revisions are:
raid5: add the stripe_queue object for tracking raid io requests (rev2)
raid5: use stripe_queues to prioritize the most deserving requests (rev6)

  Note that before applying your patch I rolled my fix in the
 ops_complete_biofill() function back. Do I understand it right that your
 patch should be used *instead* of my one rather than *with* it ?

You understood correctly.  The attached patch integrates your change
to keep R5_Wantfill set while also protecting the 'more_to_read' case.
 Please try it on top of the latest stripe-queue changes [1] (instead
of the other proposed patches) .

  Regards, Yuri

Thanks,
Dan

[1] git fetch -f git://lost.foo-projects.org/~dwillia2/git/iop
md-for-linus:refs/heads/md-for-linus
raid5: fix the 'more_to_read' case in ops_complete_biofill

From: Dan Williams [EMAIL PROTECTED]

Prevent ops_complete_biofill from running concurrently with add_queue_bio

---

 drivers/md/raid5.c |   33 +++--
 1 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 2f9022d..1c591d3 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -828,22 +828,19 @@ static void ops_complete_biofill(void *stripe_head_ref)
 		struct r5dev *dev = sh-dev[i];
 		struct r5_queue_dev *dev_q = sq-dev[i];
 
-		/* check if this stripe has new incoming reads */
+		/* 1/ acknowledge completion of a biofill operation
+		 * 2/ check if we need to reply to a read request.
+		 * 3/ check if we need to reschedule handle_stripe
+		 */
 		if (dev_q-toread)
 			more_to_read++;
 
-		/* acknowledge completion of a biofill operation */
-		/* and check if we need to reply to a read request
-		*/
-		if (test_bit(R5_Wantfill, dev-flags)  !dev_q-toread) {
+		if (test_bit(R5_Wantfill, dev-flags)) {
 			struct bio *rbi, *rbi2;
-			clear_bit(R5_Wantfill, dev-flags);
 
-			/* The access to dev-read is outside of the
-			 * spin_lock_irq(conf-device_lock), but is protected
-			 * by the STRIPE_OP_BIOFILL pending bit
-			 */
-			BUG_ON(!dev-read);
+			if (!dev_q-toread)
+clear_bit(R5_Wantfill, dev-flags);
+
 			rbi = dev-read;
 			dev-read = NULL;
 			while (rbi  rbi-bi_sector 
@@ -899,8 +896,15 @@ static void ops_run_biofill(struct stripe_head *sh)
 	}
 
 	atomic_inc(sh-count);
+
+	/* spin_lock prevents ops_complete_biofill from running concurrently
+	 * with add_queue_bio in the synchronous case
+	 */
+	spin_lock(sq-lock);
 	async_trigger_callback(ASYNC_TX_DEP_ACK | ASYNC_TX_ACK, tx,
 		ops_complete_biofill, sh);
+	spin_unlock(sq-lock);
+
 }
 
 static void ops_complete_compute5(void *stripe_head_ref)
@@ -2279,7 +2283,8 @@ static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
 		(unsigned long long)bi-bi_sector,
 		(unsigned long long)sq-sector);
 
-	spin_lock(sq-lock);
+	/* prevent asynchronous completions from running */
+	spin_lock_bh(sq-lock);
 	spin_lock_irq(conf-device_lock);
 	sh = sq-sh;
 	if (forwrite) {
@@ -2306,7 +2311,7 @@ static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
 	*bip = bi;
 	bi-bi_phys_segments ++;
 	spin_unlock_irq(conf-device_lock);
-	spin_unlock(sq-lock);
+	spin_unlock_bh(sq-lock);
 
 	pr_debug(added bi b#%llu to stripe s#%llu, disk %d.\n,
 		(unsigned long long)bi-bi_sector,
@@ -2339,7 +2344,7 @@ static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
  overlap:
 	set_bit(R5_Overlap, sh-dev[dd_idx].flags);
 	spin_unlock_irq(conf-device_lock);
-	spin_unlock(sq-lock);
+	spin_unlock_bh(sq-lock);
 	return 0;
 }
 


Re: [md-accel PATCH 16/19] dmaengine: driver for the iop32x, iop33x, and iop13xx raid engines

2007-08-30 Thread Dan Williams
On 8/30/07, saeed bishara [EMAIL PROTECTED] wrote:
 you are right, I've another question regarding the function
 dma_wait_for_async_tx from async_tx.c, here is the body of the code:
/* poll through the dependency chain, return when tx is complete */
 1.do {
 2. iter = tx;
 3. while (iter-cookie == -EBUSY)
 4. iter = iter-parent;
 5.
 6. status = dma_sync_wait(iter-chan, iter-cookie);
 7. } while (status == DMA_IN_PROGRESS || (iter != tx));

 assume that:
 - The interrupt capability is not provided.
 - Request A was sent to chan 0
 - Request B that depends on A is sent to chan 1
 - Request C that depends on B is send to chan 2.
 - Also, assume that when C is handled by async_tx_submit(), B is still
 not queued to the dmaengine (cookie equals to -EBUSY).

 In this case, dma_wait_for_async_tx will be called for C, now, it
 looks for me that the do while will loop forever, even when A gets
 completed. this is because the iter will point to B after line 4, thus
 the iter != tx (C) will always become true.

You are right.  There are no drivers in the tree that can hit this,
but it needs to be fixed up.

I'll submit the following change:

diff --git a/crypto/async_tx/async_tx.c b/crypto/async_tx/async_tx.c
index 0350071..bc18cbb 100644
--- a/crypto/async_tx/async_tx.c
+++ b/crypto/async_tx/async_tx.c
@@ -80,6 +80,7 @@ dma_wait_for_async_tx(struct dma_async_tx_descriptor *tx)
 {
enum dma_status status;
struct dma_async_tx_descriptor *iter;
+   struct dma_async_tx_descriptor *parent;

if (!tx)
return DMA_SUCCESS;
@@ -87,8 +88,15 @@ dma_wait_for_async_tx(struct dma_async_tx_descriptor *tx)
/* poll through the dependency chain, return when tx is complete */
do {
iter = tx;
-   while (iter-cookie == -EBUSY)
-   iter = iter-parent;
+
+   /* find the root of the unsubmitted dependency chain */
+   while (iter-cookie == -EBUSY) {
+   parent = iter-parent;
+   if (parent  parent-cookie == -EBUSY)
+   iter = iter-parent;
+   else
+   break;
+   }

 saeed

Regards,
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5:md3: kernel BUG , followed by , Silent halt .

2007-08-27 Thread Dan Williams
On 8/25/07, Mr. James W. Laferriere [EMAIL PROTECTED] wrote:
 Hello Dan ,

 On Mon, 20 Aug 2007, Dan Williams wrote:
  On 8/18/07, Mr. James W. Laferriere [EMAIL PROTECTED] wrote:
  Hello All ,  Here we go again .  Again attempting to do bonnie++ 
  testing
  on a small array .
  Kernel 2.6.22.1
  Patches involved ,
  IOP1 ,  2.6.22.1-iop1 for improved sequential write performance
  (stripe-queue) ,  Dan Williams [EMAIL PROTECTED]
 
  Hello James,
 
  Thanks for the report.
 
  I tried to reproduce this on my system, no luck.
 Possibly because there is significant hardware differances ?
 See 'lspci -v' below .sig .

  However it looks
  like their is a potential race between 'handle_queue' and
  'add_queue_bio'.  The attached patch moves these critical sections
  under spin_lock(sq-lock), and adds some debugging output if this BUG
  triggers.  It also includes a fix for retry_aligned_read which is
  unrelated to this debug.
  --
  Dan
 Applied your patch .  The same 'kernel BUG at 
 drivers/md/raid5.c:3689!'
 messages appear (see attached) .  The system is still responsive with your
 patch ,  the kernel crashed last time .  Tho the bonnie++ run is stuck in 'D' 
 .
 And doing a ' /md3/asdf'  stays hung even after passing the parent process a
 'kill -9' .
 Any further info You can think of I can/should ,  I will try to 
 acquire
 .  But I'll have to repeat these steps to attempt to get the same results .
 I'll be shutting the system down after sending this off .
 Fyi ,  the previous 'BUG without your patch was quite repeatable .
 I might have time over the next couple of weeks to be able to see if 
 it
 is as repatable as the last one .

 Contents of /proc/mdstat for md3 .

 md3 : active raid6 sdx1[3] sdw1[2] sdv1[1] sdu1[0] sdt1[7](S) sds1[6] sdr1[5] 
 sdq1[4]
717378560 blocks level 6, 1024k chunk, algorithm 2 [7/7] [UUU]
bitmap: 2/137 pages [8KB], 512KB chunk

 Commands I ran that lead to the 'BUG' .

 bonniemd3() { /root/bonnie++-1.03a/bonnie++  -u0:0  -d /md3  -s 131072  -f; }
 bonniemd3  131072MB-bonnie++-run-md3-xfs.log-20070825 21 

Ok, the 'bitmap' and 'raid6' details were the missing pieces of my
testing.  I can now reproduce this bug in handle_queue.  I'll keep you
posted on what I find.

Thank you for tracking this.

Regards,
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Patch for boot-time assembly of v1.x-metadata-based soft (MD) arrays

2007-08-26 Thread Dan Williams
On 8/26/07, Justin Piszcz [EMAIL PROTECTED] wrote:


 On Sun, 26 Aug 2007, Abe Skolnik wrote:

  Dear Mr./Dr./Prof. Brown et al,
 
  I recently had the unpleasant experience of creating an MD array for
  the purpose of booting off it and then not being able to do so.  Since
  I had already made changes to the array's contents relative to that
  which I cloned it from, I did not want to reformat the array and
  re-clone it just to bring it down to the old 0.90 metadata format so
  that I would be able to boot off it, so I searched for a solution, and
  I found it.
 
  First I tried the patch (written by Neil Brown) which can be seen at...
   http://www.issociate.de/board/post/277868/
 
  That patch did not work as-is, but with some more hacking, I got it
  working.  I then cleaned up my work and added relevant comments.
 
  I know that Mr./Dr./Prof. Brown is against in-kernel boot-time MD
  assembly and prefers init[rd/ramfs], but I prefer in-kernel assembly,
  and I think several other people do too.  Since this patch does not
  (AFAIK) disable the init[rd/ramfs] way of bringing up MDs in boot-time,
  I hope that this patch will be accepted and submitted up-stream for
  future inclusion in the mainline kernel.org kernel distribution.
  This way kernel users can choose their MD assembly strategy at will
  without having to restrict themselves to the old metadata format.
 
  I hope that this message finds all those who read it doing well and
  feeling fine.
 
  Sincerely,
 
  Abe Skolnik
 
  P.S.  Mr./Dr./Prof. Brown, in case you read this:  thanks!
   And if you want your name removed from the code, just say so.
 

  but I prefer in-kernel assembly,
  and I think several other people do too.
 I concur with this statement, why go through the hassle of init[rd/ramfs]
 if we can just have it done in the kernel?


Because you can rely on the configuration file to be certain about
which disks to pull in and which to ignore.  Without the config file
the auto-detect routine may not always do the right thing because it
will need to make assumptions.

So I turn the question around, why go through the exercise of trying
to improve an auto-detect routine which can never be perfect when the
explicit configuration can be specified by a config file?

I believe the real issue is the need to improve the distributions'
initramfs build-scripts and relieve the hassle of handling MD details.

 Justin.

Regards,
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Patch for boot-time assembly of v1.x-metadata-based soft (MD) arrays: reasoning and future plans

2007-08-26 Thread Dan Williams
On 8/26/07, Abe Skolnik [EMAIL PROTECTED] wrote:
 Dear Mr./Dr. Williams,

Just Dan is fine :-)

  Because you can rely on the configuration file to be certain about
  which disks to pull in and which to ignore.  Without the config file
  the auto-detect routine may not always do the right thing because it
  will need to make assumptions.

 But kernel parameters can provide the same data, no?  After all, it is
 not the file nature of the config file that we are after,
 but rather the configuration data itself.  My now-working setup uses a
 line in my grub.conf (AKA menu.lst) file in my boot partition that
 says something like...
   root=/dev/md0 md=0,v1,/dev/sda1,/dev/sdb1,/dev/sdc1.
 This works just fine, and will not go bad unless a drive fails or I
 rearrange the SCSI bus.  Even in a case like that, the worst that will
 happen (AFAIK) is that my root-RAID will not come up and I will have to
 boot the PC from e.g. Knoppix in order to fix the problem
 (that, and maybe also replace a broken drive or re-plug a loose power
 connector or whatever).  Another MD should not be corrupted, since they
 are (supposedly) protected from that by supposedly-unique array UUIDs.

Yes, you can get a similar effect of the config file by adding
parameters to the kernel command line.  My only point is that if the
initramfs update tools were as simple as:
mkinitrd root=/dev/md0 md=0,v1,/dev/sda1,/dev/sdb1,/dev/sdc1
...then using an initramfs becomes the same amount of work as editing
/etc/grub.conf.

  So I turn the question around, why go through the exercise of trying
  to improve an auto-detect routine which can never be perfect when the
  explicit configuration can be specified by a config file?

 I will turn your turning of my question back around at you; I hope it's
 not rude to do so.  Why make root-RAID (based on MD, not hardware RAID)
 require an initrd/initramfs, especially since (a) that's another system
 component to manage, (b) that's another thing that can go wrong,
 (c) many systems (the new one I just finished building included) do not
 need an initrd/initramfs for any other reason, so why trigger the
 need just out of laziness of maintaining some boot code?  Especially
 since a patch has already been written and tested as working. ;-)

Understood.  It comes down to a question of how much mdadm
functionality should be duplicated in the kernel?  With an initramfs
you get the full functionality and only one codebase to maintain
(mdadm).

[snip]

 Sincerely,

 Abe


Regards,
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5:md3: kernel BUG , followed by , Silent halt .

2007-08-20 Thread Dan Williams
On 8/18/07, Mr. James W. Laferriere [EMAIL PROTECTED] wrote:
 Hello All ,  Here we go again .  Again attempting to do bonnie++ 
 testing
 on a small array .
 Kernel 2.6.22.1
 Patches involved ,
 IOP1 ,  2.6.22.1-iop1 for improved sequential write performance
 (stripe-queue) ,  Dan Williams [EMAIL PROTECTED]

Hello James,

Thanks for the report.

I tried to reproduce this on my system, no luck.  However it looks
like their is a potential race between 'handle_queue' and
'add_queue_bio'.  The attached patch moves these critical sections
under spin_lock(sq-lock), and adds some debugging output if this BUG
triggers.  It also includes a fix for retry_aligned_read which is
unrelated to this debug.

--
Dan
---
raid5-fix-sq-locking.patch
---
raid5: address potential sq-to_write race

From: Dan Williams [EMAIL PROTECTED]

synchronize reads and writes to the sq-to_write bit

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |   12 
 1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 02e313b..688b8d3 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2289,10 +2289,14 @@ static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
 	sh = sq-sh;
 	if (forwrite) {
 		bip = sq-dev[dd_idx].towrite;
+		set_bit(dd_idx, sq-to_write);
 		if (*bip == NULL  (!sh || (sh  !sh-dev[dd_idx].written)))
 			firstwrite = 1;
-	} else
+	} else {
 		bip = sq-dev[dd_idx].toread;
+		set_bit(dd_idx, sq-to_read);
+	}
+
 	while (*bip  (*bip)-bi_sector  bi-bi_sector) {
 		if ((*bip)-bi_sector + ((*bip)-bi_size  9)  bi-bi_sector)
 			goto overlap;
@@ -2324,7 +2328,6 @@ static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
 		/* check if page is covered */
 		sector_t sector = sq-dev[dd_idx].sector;
 
-		set_bit(dd_idx, sq-to_write);
 		for (bi = sq-dev[dd_idx].towrite;
 		 sector  sq-dev[dd_idx].sector + STRIPE_SECTORS 
 			 bi  bi-bi_sector = sector;
@@ -2334,8 +2337,7 @@ static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
 		}
 		if (sector = sq-dev[dd_idx].sector + STRIPE_SECTORS)
 			set_bit(dd_idx, sq-overwrite);
-	} else
-		set_bit(dd_idx, sq-to_read);
+	}
 
 	return 1;
 
@@ -3656,6 +3658,7 @@ static void handle_queue(struct stripe_queue *sq, int disks, int data_disks)
 	struct stripe_head *sh = NULL;
 
 	/* continue to process i/o while the stripe is cached */
+	spin_lock(sq-lock);
 	if (test_bit(STRIPE_QUEUE_HANDLE, sq-state)) {
 		if (io_weight(sq-overwrite, disks) == data_disks) {
 			set_bit(STRIPE_QUEUE_IO_HI, sq-state);
@@ -3678,6 +3681,7 @@ static void handle_queue(struct stripe_queue *sq, int disks, int data_disks)
 		 */
 		BUG_ON(!(sq-sh  sq-sh == sh));
 	}
+	spin_unlock(sq-lock);
 
 	release_queue(sq);
 	if (sh) {
---
raid5-debug-init_queue-bugs.patch
---
raid5: printk instead of BUG in init_queue

From: Dan Williams [EMAIL PROTECTED]

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |   19 +--
 1 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 688b8d3..7164011 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -557,12 +557,19 @@ static void init_queue(struct stripe_queue *sq, sector_t sector,
 		__FUNCTION__, (unsigned long long) sq-sector,
 		(unsigned long long) sector, sq);
 
-	BUG_ON(atomic_read(sq-count) != 0);
-	BUG_ON(io_weight(sq-to_read, disks));
-	BUG_ON(io_weight(sq-to_write, disks));
-	BUG_ON(io_weight(sq-overwrite, disks));
-	BUG_ON(test_bit(STRIPE_QUEUE_HANDLE, sq-state));
-	BUG_ON(sq-sh);
+	if ((atomic_read(sq-count) != 0) || io_weight(sq-to_read, disks) ||
+	io_weight(sq-to_write, disks) || io_weight(sq-overwrite, disks) ||
+	test_bit(STRIPE_QUEUE_HANDLE, sq-state) || sq-sh) {
+		printk(KERN_ERR %s: sector=%llx count: %d to_read: %lu 
+to_write: %lu overwrite: %lu state: %lx 
+sq-sh: %p\n, __FUNCTION__,
+(unsigned long long) sq-sector,
+atomic_read(sq-count),
+io_weight(sq-to_read, disks),
+io_weight(sq-to_write, disks),
+io_weight(sq-overwrite, disks),
+sq-state, sq-sh);
+	}
 
 	sq-state = (1  STRIPE_QUEUE_HANDLE);
 	sq-sector = sector;
---
raid5-fix-get_active_queue-bug.patch
---
raid5: fix get_active_queue bug in retry_aligned_read

From: Dan Williams [EMAIL PROTECTED]

Check for a potential null return from get_active_queue

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |3 ++-
 1

Re: bonnie++ benchmarks for ext2,ext3,ext4,jfs,reiserfs,xfs,zfs on software raid 5

2007-07-30 Thread Dan Williams
[trimmed all but linux-raid from the cc]

On 7/30/07, Justin Piszcz [EMAIL PROTECTED] wrote:
 CONFIG:

 Software RAID 5 (400GB x 6): Default mkfs parameters for all filesystems.
 Kernel was 2.6.21 or 2.6.22, did these awhile ago.
Can you give 2.6.22.1-iop1 a try to see what affect it has on
sequential write performance?

Download:
http://downloads.sourceforge.net/xscaleiop/patches-2.6.22.1-iop1-x86fix.tar.gz

Unpack into your 2.6.22.1 source tree.  Install the x86 series file
cp patches/series.x86 patches/series.  Apply the series with quilt
quilt push -a.

I recommend trying the default chunk size and default
stripe_cache_size as my tests have shown improvement without needing
to perform any tuning.

Thanks,
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PATCH 0/2] stripe-queue for 2.6.23 consideration

2007-07-22 Thread Dan Williams
Andrew, Neil,

The stripe-queue patches are showing solid performance improvement.

git://lost.foo-projects.org/~dwillia2/git/iop md-for-linus

 drivers/md/raid5.c | 1484 
 include/linux/raid/raid5.h |   87 +++-
 2 files changed, 1164 insertions(+), 407 deletions(-)

Dan Williams (2):
  raid5: add the stripe_queue object for tracking raid io requests (take2)
  raid5: use stripe_queues to prioritize the most deserving requests 
(take4)

I initially considered them 2.6.24 material but after fixing the sync+io
data corruption regression, fixing the large 'stripe_cache_size' values
performance regression, and seeing how well it performed on my IA
platform I would like them to be considered for 2.6.23.  That being said
I have not yet tested expand operations or raid6.

Without any tuning a 4 disk (SATA) RAID5 array can reach 190MB/s.  Previously
performance was around 90MB/s.  Blktrace data confirms that less reads are
occurring and more writes are being merged.

$ mdadm --create /dev/md0 /dev/sd[abcd] -n 4 -l 5 --assume-clean
$ blktrace /dev/sd[abcd] 
$ for i in `seq 1 3`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=1024; done
$ fg ^C
$ blkparse /dev/sda /dev/sdb /dev/sdc /dev/sdd

=pre-patch=
Total (sda):
 Reads Queued:   3,136,   12,544KiB  Writes Queued: 187,068,  748,272KiB
 Read Dispatches:  676,   12,384KiB  Write Dispatches:   30,949,  737,052KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:  662,   12,080KiB  Writes Completed:   30,630,  736,964KiB
 Read Merges:2,452,9,808KiB  Write Merges:  155,885,  623,540KiB
 IO unplugs: 1   Timer unplugs:   1

Total (sdb):
 Reads Queued:   1,541,6,164KiB  Writes Queued:  91,224,  364,896KiB
 Read Dispatches:  323,6,184KiB  Write Dispatches:   14,603,  335,528KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:  303,6,124KiB  Writes Completed:   13,650,  328,520KiB
 Read Merges:1,209,4,836KiB  Write Merges:   76,080,  304,320KiB
 IO unplugs: 0   Timer unplugs:   0

Total (sdc):
 Reads Queued:   1,372,5,488KiB  Writes Queued:  82,995,  331,980KiB
 Read Dispatches:  297,5,280KiB  Write Dispatches:   13,258,  304,020KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:  268,4,948KiB  Writes Completed:   12,320,  298,668KiB
 Read Merges:1,067,4,268KiB  Write Merges:   69,154,  276,616KiB
 IO unplugs: 0   Timer unplugs:   0

Total (sdd):
 Reads Queued:   1,383,5,532KiB  Writes Queued:  80,186,  320,744KiB
 Read Dispatches:  307,5,008KiB  Write Dispatches:   13,241,  298,400KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:  276,4,888KiB  Writes Completed:   12,677,  294,324KiB
 Read Merges:1,050,4,200KiB  Write Merges:   66,772,  267,088KiB
 IO unplugs: 0   Timer unplugs:   0


=post-patch=
Total (sda):
 Reads Queued: 117,  468KiB  Writes Queued:  71,511,  286,044KiB
 Read Dispatches:   17,  308KiB  Write Dispatches:8,412,  699,204KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:6,   96KiB  Writes Completed:3,704,  321,552KiB
 Read Merges:   96,  384KiB  Write Merges:   67,880,  271,520KiB
 IO unplugs:14   Timer unplugs:  15

Total (sdb):
 Reads Queued:  88,  352KiB  Writes Queued:  56,687,  226,748KiB
 Read Dispatches:   11,  288KiB  Write Dispatches:8,142,  686,412KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:8,  184KiB  Writes Completed:2,770,  257,740KiB
 Read Merges:   76,  304KiB  Write Merges:   54,005,  216,020KiB
 IO unplugs:16   Timer unplugs:  17

Total (sdc):
 Reads Queued:  60,  240KiB  Writes Queued:  61,863,  247,452KiB
 Read Dispatches:7,  248KiB  Write Dispatches:8,302,  699,832KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:5,  144KiB  Writes Completed:2,907,  258,900KiB
 Read Merges:   50,  200KiB  Write Merges:   58,926,  235,704KiB
 IO unplugs:20   Timer unplugs:  23

Total (sdd):
 Reads Queued:  61,  244KiB  Writes Queued:  66,330,  265,320KiB
 Read Dispatches:   10,  180KiB  Write Dispatches:9,326,  694,012KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:4,  112KiB  Writes Completed:3,562,  285,912KiB
 Read Merges:   47,  188KiB  Write

[GIT PATCH 1/2] raid5: add the stripe_queue object for tracking raid io requests (take2)

2007-07-22 Thread Dan Williams
 to newly initialized
  buffers

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |  474 +++-
 include/linux/raid/raid5.h |   29 ++-
 2 files changed, 316 insertions(+), 187 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index d90ee14..f5ee4a7 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -31,7 +31,7 @@
  * conf-bm_flush is the number of the last batch that was closed to
  *new additions.
  * When we discover that we will need to write to any block in a stripe
- * (in add_stripe_bio) we update the in-memory bitmap and record in sh-bm_seq
+ * (in add_stripe_bio) we update the in-memory bitmap and record in sq-bm_seq
  * the number of the batch it will be in. This is bm_flush+1.
  * When we are ready to do a write, if that batch hasn't been written yet,
  *   we plug the array and queue the stripe for later.
@@ -132,7 +132,7 @@ static void __release_stripe(raid5_conf_t *conf, struct 
stripe_head *sh)
list_add_tail(sh-lru, conf-delayed_list);
blk_plug_device(conf-mddev-queue);
} else if (test_bit(STRIPE_BIT_DELAY, sh-state) 
-  sh-bm_seq - conf-seq_write  0) {
+  sh-sq-bm_seq - conf-seq_write  0) {
list_add_tail(sh-lru, conf-bitmap_list);
blk_plug_device(conf-mddev-queue);
} else {
@@ -159,7 +159,7 @@ static void __release_stripe(raid5_conf_t *conf, struct 
stripe_head *sh)
 }
 static void release_stripe(struct stripe_head *sh)
 {
-   raid5_conf_t *conf = sh-raid_conf;
+   raid5_conf_t *conf = sh-sq-raid_conf;
unsigned long flags;
 
spin_lock_irqsave(conf-device_lock, flags);
@@ -238,7 +238,7 @@ static void raid5_build_block (struct stripe_head *sh, int 
i);
 
 static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, 
int disks)
 {
-   raid5_conf_t *conf = sh-raid_conf;
+   raid5_conf_t *conf = sh-sq-raid_conf;
int i;
 
BUG_ON(atomic_read(sh-count) != 0);
@@ -252,19 +252,20 @@ static void init_stripe(struct stripe_head *sh, sector_t 
sector, int pd_idx, int
remove_hash(sh);
 
sh-sector = sector;
-   sh-pd_idx = pd_idx;
+   sh-sq-pd_idx = pd_idx;
sh-state = 0;
 
sh-disks = disks;
 
for (i = sh-disks; i--; ) {
struct r5dev *dev = sh-dev[i];
+   struct r5_queue_dev *dev_q = sh-sq-dev[i];
 
-   if (dev-toread || dev-read || dev-towrite || dev-written ||
-   test_bit(R5_LOCKED, dev-flags)) {
+   if (dev_q-toread || dev-read || dev_q-towrite ||
+   dev-written || test_bit(R5_LOCKED, dev-flags)) {
printk(KERN_ERR sector=%llx i=%d %p %p %p %p %d\n,
-  (unsigned long long)sh-sector, i, dev-toread,
-  dev-read, dev-towrite, dev-written,
+  (unsigned long long)sh-sector, i, dev_q-toread,
+  dev-read, dev_q-towrite, dev-written,
   test_bit(R5_LOCKED, dev-flags));
BUG();
}
@@ -288,6 +289,9 @@ static struct stripe_head *__find_stripe(raid5_conf_t 
*conf, sector_t sector, in
return NULL;
 }
 
+static sector_t compute_blocknr(raid5_conf_t *conf, int raid_disks,
+   sector_t sector, int pd_idx, int i);
+
 static void unplug_slaves(mddev_t *mddev);
 static void raid5_unplug_device(request_queue_t *q);
 
@@ -389,12 +393,13 @@ raid5_end_write_request (struct bio *bi, unsigned int 
bytes_done, int error);
 
 static void ops_run_io(struct stripe_head *sh)
 {
-   raid5_conf_t *conf = sh-raid_conf;
+   struct stripe_queue *sq = sh-sq;
+   raid5_conf_t *conf = sq-raid_conf;
int i, disks = sh-disks;
 
might_sleep();
 
-   for (i = disks; i--; ) {
+   for (i = disks; i--;) {
int rw;
struct bio *bi;
mdk_rdev_t *rdev;
@@ -513,7 +518,8 @@ static void ops_complete_biofill(void *stripe_head_ref)
 {
struct stripe_head *sh = stripe_head_ref;
struct bio *return_bi = NULL;
-   raid5_conf_t *conf = sh-raid_conf;
+   struct stripe_queue *sq = sh-sq;
+   raid5_conf_t *conf = sq-raid_conf;
int i, more_to_read = 0;
 
pr_debug(%s: stripe %llu\n, __FUNCTION__,
@@ -522,14 +528,16 @@ static void ops_complete_biofill(void *stripe_head_ref)
/* clear completed biofills */
for (i = sh-disks; i--; ) {
struct r5dev *dev = sh-dev[i];
+   struct r5_queue_dev *dev_q = sq-dev[i];
+
/* check if this stripe has new incoming reads */
-   if (dev-toread)
+   if (dev_q-toread

[RFT] 2.6.22.1-iop1 for improved sequential write performance (stripe-queue)

2007-07-19 Thread Dan Williams

Per Bill Davidsen's request I have made available a 2.6.22.1 based
kernel with the current raid5 performance changes I have been working
on:
1/ Offload engine acceleration (recently merged for the 2.6.23
development cycle)
2/ Stripe-queue, an evolutionary change to the raid5 queuing model (take4)

The offload engine work only helps platforms with offload engines and
should not affect performance otherwise.  The stripe-queue work is an
attempt to increase sequential write performance and should benefit
most platforms.

The patch series is available on the Xscale(r) IOP SourceForge page:

http://downloads.sourceforge.net/xscaleiop/patches-2.6.22.1-iop1-x86fix.tar.gz

Use quilt to apply the series on top of a fresh 2.6.22.1 source tree:
$ wget http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.22.1.tar.bz2
$ tar xjf linux-2.6.22.1.tar.bz2
$ cd linux-2.6.22.1
$ tar xzvf patches-2.6.22.1-iop1.tar.gz
$ cp patches/series.x86 patches/series
$ quilt push -a

Configure and build the kernel as normal, there are no configuration
options for stripe-queue.

Any feedback, bug report, fix, or suggestion is welcome.

Thanks,
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] ioat fixes, raid5 acceleration, and the async_tx api

2007-07-13 Thread Dan Williams

Linus, please pull from

git://lost.foo-projects.org/~dwillia2/git/iop ioat-md-accel-for-linus

to receive:

1/ I/OAT performance tweaks and simple fixups.  These patches have been
   in -mm for a few kernel releases as git-ioat.patch
2/ RAID5 acceleration and the async_tx api.  These patches have also
   been in -mm for a few kernel releases as git-md-accel.patch.  In
   addition, they have received field testing as a part of the -iop kernel
   released via SourceForge[1] since 2.6.18-rc6.

The raid acceleration work can further be subdivided into three logical
areas:
- API -
The async_tx api provides methods for describing a chain of
asynchronous bulk memory transfers/transforms with support for
inter-transactional dependencies.  It is implemented as a dmaengine
client that smooths over the details of different hardware offload
engine implementations.  Code that is written to the api can optimize
for asynchronous operation and the api will fit the chain of operations
to the available offload resources. 

- Implementation -
When the raid acceleration work was proposed, Neil laid out the
following attack plan:
1/ move the xor and copy operations outside spin_lock(sh-lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api
(async_tx) and the stripe_operations member of a stripe_head to carry
out xor and copy operations asynchronously, outside the lock.

- Driver -
The Intel(R) Xscale IOP series of I/O processors integrate an Xscale
core with raid acceleration engines.  The iop-adma driver supports the
copy and xor capabilities of the 3 IOP architectures iop32x, iop33x, and
iop34x.

All the MD changes have been acked-by Neil Brown.  For the changes made
to net/ I have received David Miller's acked-by.  Shannon Nelson has
tested the I/OAT changes (due to async_tx support) in his environment
and has added his signed-off-by.  Herbert Xu has agreed to let the
async_tx api be housed under crypto/ with the intent to coordinate
efforts as support for transforms like crc32c and raid6-p+q are
developed.

To be clear Shannon Nelson is the I/OAT maintainer, but we agreed that I
should coordinate this release to simplify the merge process.  Going
forward I will be the iop-adma maintainer.  For the common bits,
dmaengine core and the async_tx api, Shannon and I will coordinate as
co-maintainers.

- Credits -
I cannot thank Neil Brown enough for his advice and patience as this
code was developed.

Jeff Garzik is credited with helping the dmaengine core and async_tx
become sane apis.  You are credited with the general premise that users
of an asynchronous offload engine api should not know or care if an
operation is carried out asynchronously or synchronously in software.
Andrew Morton is credited with corralling these conflicting git trees in
-mm and more importantly imparting encouragement at OLS 2006.

Per Andrew's request the md-accel changelogs were fleshed out and the
patch set was posted for a final review a few weeks ago[2].  To my
knowledge there are no pending review items.  This tree is based on
2.6.22.

Thank you,
Dan

[1] http://sourceforge.net/projects/xscaleiop
[2] http://marc.info/?l=linux-raidw=2r=1s=md-accelq=b

Andrew Morton (1):
  I/OAT: warning fix

Chris Leech (5):
  ioatdma: Push pending transactions to hardware more frequently
  ioatdma: Remove the wrappers around read(bwl)/write(bwl) in ioatdma
  ioatdma: Remove the use of writeq from the ioatdma driver
  I/OAT: Add documentation for the tcp_dma_copybreak sysctl
  I/OAT: Only offload copies for TCP when there will be a context switch

Dan Aloni (1):
  I/OAT: fix I/OAT for kexec

Dan Williams (20):
  dmaengine: refactor dmaengine around dma_async_tx_descriptor
  dmaengine: make clients responsible for managing channels
  xor: make 'xor_blocks' a library routine for use with async_tx
  async_tx: add the async_tx api
  raid5: refactor handle_stripe5 and handle_stripe6 (v3)
  raid5: replace custom debug PRINTKs with standard pr_debug
  md: raid5_run_ops - run stripe operations outside sh-lock
  md: common infrastructure for running operations with raid5_run_ops
  md: handle_stripe5 - add request/completion logic for async write ops
  md: handle_stripe5 - add request/completion logic for async compute ops
  md: handle_stripe5 - add request/completion logic for async check ops
  md: handle_stripe5 - add request/completion logic for async read ops
  md: handle_stripe5 - add request/completion logic for async expand ops
  md: handle_stripe5 - request io processing in raid5_run_ops
  md: remove raid5 compute_block and compute_parity5
  dmaengine: driver for the iop32x, iop33x, and iop13xx raid engines
  iop13xx: surface the iop13xx adma units to the iop-adma driver
  iop3xx: surface the iop3xx DMA and AAU units to the iop-adma driver
  ARM: Add drivers/dma to arch/arm/Kconfig

[-mm PATCH 0/2] 74% decrease in dispatched writes, stripe-queue take3

2007-07-13 Thread Dan Williams
Neil, Andrew,

The following patches replace the stripe-queue patches currently in -mm.
Following your suggestion, Neil, I gathered blktrace data on the number
of reads generated by sequential write stimulus.  It turns out that
reduced pre-reading is not the cause of the performance increase, but
rather increased write merging.  The data, in patch #1, shows a 74%
decrease in the number of dispatched writes.  I can only assume that
this is the explanation for the 65% throughput improvement, because the
occurrence of reads actually increased with these patches applied.

This take also fixes observed data corruption while running i/o to a
synching array (it was wrong to move the flags parameter from r5dev to
r5_queue_dev as things could get out of sync... reverted).  Next step is
to test reshape under this new queuing model.

Regards,
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-mm PATCH 1/2] raid5: add the stripe_queue object for tracking raid io requests (take2)

2007-07-13 Thread Dan Williams
 flags are attached to newly initialized buffers

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |  475 
 include/linux/raid/raid5.h |   29 ++-
 2 files changed, 317 insertions(+), 187 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 0b66afe..b653c2b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -31,7 +31,7 @@
  * conf-bm_flush is the number of the last batch that was closed to
  *new additions.
  * When we discover that we will need to write to any block in a stripe
- * (in add_stripe_bio) we update the in-memory bitmap and record in sh-bm_seq
+ * (in add_stripe_bio) we update the in-memory bitmap and record in sq-bm_seq
  * the number of the batch it will be in. This is bm_flush+1.
  * When we are ready to do a write, if that batch hasn't been written yet,
  *   we plug the array and queue the stripe for later.
@@ -132,7 +132,7 @@ static void __release_stripe(raid5_conf_t *conf, struct 
stripe_head *sh)
list_add_tail(sh-lru, conf-delayed_list);
blk_plug_device(conf-mddev-queue);
} else if (test_bit(STRIPE_BIT_DELAY, sh-state) 
-  sh-bm_seq - conf-seq_write  0) {
+  sh-sq-bm_seq - conf-seq_write  0) {
list_add_tail(sh-lru, conf-bitmap_list);
blk_plug_device(conf-mddev-queue);
} else {
@@ -159,7 +159,7 @@ static void __release_stripe(raid5_conf_t *conf, struct 
stripe_head *sh)
 }
 static void release_stripe(struct stripe_head *sh)
 {
-   raid5_conf_t *conf = sh-raid_conf;
+   raid5_conf_t *conf = sh-sq-raid_conf;
unsigned long flags;
 
spin_lock_irqsave(conf-device_lock, flags);
@@ -238,7 +238,7 @@ static void raid5_build_block (struct stripe_head *sh, int 
i);
 
 static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, 
int disks)
 {
-   raid5_conf_t *conf = sh-raid_conf;
+   raid5_conf_t *conf = sh-sq-raid_conf;
int i;
 
BUG_ON(atomic_read(sh-count) != 0);
@@ -252,19 +252,20 @@ static void init_stripe(struct stripe_head *sh, sector_t 
sector, int pd_idx, int
remove_hash(sh);
 
sh-sector = sector;
-   sh-pd_idx = pd_idx;
+   sh-sq-pd_idx = pd_idx;
sh-state = 0;
 
sh-disks = disks;
 
for (i = sh-disks; i--; ) {
struct r5dev *dev = sh-dev[i];
+   struct r5_queue_dev *dev_q = sh-sq-dev[i];
 
-   if (dev-toread || dev-read || dev-towrite || dev-written ||
-   test_bit(R5_LOCKED, dev-flags)) {
+   if (dev_q-toread || dev-read || dev_q-towrite ||
+   dev-written || test_bit(R5_LOCKED, dev-flags)) {
printk(KERN_ERR sector=%llx i=%d %p %p %p %p %d\n,
-  (unsigned long long)sh-sector, i, dev-toread,
-  dev-read, dev-towrite, dev-written,
+  (unsigned long long)sh-sector, i, dev_q-toread,
+  dev-read, dev_q-towrite, dev-written,
   test_bit(R5_LOCKED, dev-flags));
BUG();
}
@@ -288,6 +289,9 @@ static struct stripe_head *__find_stripe(raid5_conf_t 
*conf, sector_t sector, in
return NULL;
 }
 
+static sector_t compute_blocknr(raid5_conf_t *conf, int raid_disks,
+   sector_t sector, int pd_idx, int i);
+
 static void unplug_slaves(mddev_t *mddev);
 static void raid5_unplug_device(request_queue_t *q);
 
@@ -389,12 +393,13 @@ raid5_end_write_request (struct bio *bi, unsigned int 
bytes_done, int error);
 
 static void ops_run_io(struct stripe_head *sh)
 {
-   raid5_conf_t *conf = sh-raid_conf;
+   struct stripe_queue *sq = sh-sq;
+   raid5_conf_t *conf = sq-raid_conf;
int i, disks = sh-disks;
 
might_sleep();
 
-   for (i = disks; i--; ) {
+   for (i = disks; i--;) {
int rw;
struct bio *bi;
mdk_rdev_t *rdev;
@@ -513,7 +518,8 @@ static void ops_complete_biofill(void *stripe_head_ref)
 {
struct stripe_head *sh = stripe_head_ref;
struct bio *return_bi = NULL;
-   raid5_conf_t *conf = sh-raid_conf;
+   struct stripe_queue *sq = sh-sq;
+   raid5_conf_t *conf = sq-raid_conf;
int i, more_to_read = 0;
 
pr_debug(%s: stripe %llu\n, __FUNCTION__,
@@ -522,14 +528,16 @@ static void ops_complete_biofill(void *stripe_head_ref)
/* clear completed biofills */
for (i = sh-disks; i--; ) {
struct r5dev *dev = sh-dev[i];
+   struct r5_queue_dev *dev_q = sq-dev[i];
+
/* check if this stripe has new incoming reads */
-   if (dev-toread)
+   if (dev_q-toread

[RFC PATCH 0/2] raid5: 65% sequential-write performance improvement, stripe-queue take2

2007-07-03 Thread Dan Williams
The first take of the stripe-queue implementation[1] had a performance
limiting bug in __wait_for_inactive_queue.  Fixing that issue
drastically changed the performance characteristics.  The following data
from tiobench shows the relative performance difference of the
stripe-queue patchset.

Unit information

File size = megabytes
Blk Size  = bytes
Num Thr   = number of threads
Avg Rate  = relative throughput
CPU%  = relative percentage of CPU used during the test
CPU Eff   = Rate divided by CPU% - relative throughput per cpu load

Configuration
=
Platform: 1200Mhz iop348 with 4-disk sata_vsc array
mdadm --create /dev/md0 /dev/sd[abcd] -n 4 -l 5
mkfs.ext2 /dev/md0
mount /dev/md0 /mnt/raid
tiobench --size 2048 --numruns 5 --block 4096 --block 131072 --dir /mnt/raid

Sequential Reads
FileBlk Num Avg Maximum CPU
Identifier  SizeSizeThr Rate(CPU%)  Eff
--- --  -   --- --  --  -
2.6.22-rc7-iop1 204840961   0%  4%  -3%
2.6.22-rc7-iop1 204840962   -38%-33%-8%
2.6.22-rc7-iop1 204840964   -35%-30%-8%
2.6.22-rc7-iop1 204840968   -14%-11%-3%
2.6.22-rc7-iop1 204813107   1   2%  1%  2%
2.6.22-rc7-iop1 204813107   2   -11%-10%-2%
2.6.22-rc7-iop1 204813107   4   -7% -6% -1%
2.6.22-rc7-iop1 204813107   8   -9% -6% -4%

Random  Reads
FileBlk Num Avg Maximum CPU
Identifier  SizeSizeThr Rate(CPU%)  Eff
--- --  -   --- --  --  -
2.6.22-rc7-iop1 204840961   -9% 15% -21%
2.6.22-rc7-iop1 204840962   -1% -30%42%
2.6.22-rc7-iop1 204840964   -14%-22%10%
2.6.22-rc7-iop1 204840968   -21%-28%9%
2.6.22-rc7-iop1 204813107   1   -8% -4% -4%
2.6.22-rc7-iop1 204813107   2   -13%-13%0%
2.6.22-rc7-iop1 204813107   4   -15%-15%0%
2.6.22-rc7-iop1 204813107   8   -13%-13%0%

Sequential Writes
FileBlk Num Avg Maximum CPU
Identifier  SizeSizeThr Rate(CPU%)  Eff
--- --  -   --- --  --  -
2.6.22-rc7-iop1 204840961   25% 11% 12%
2.6.22-rc7-iop1 204840962   41% 42% -1%
2.6.22-rc7-iop1 204840964   40% 18% 19%
2.6.22-rc7-iop1 204840968   15% -5% 21%
2.6.22-rc7-iop1 204813107   1   65% 57% 4%
2.6.22-rc7-iop1 204813107   2   46% 36% 8%
2.6.22-rc7-iop1 204813107   4   24% -7% 34%
2.6.22-rc7-iop1 204813107   8   28% -15%51%

Random  Writes
FileBlk Num Avg Maximum CPU
Identifier  SizeSizeThr Rate(CPU%)  Eff
--- --  -   --- --  --  -
2.6.22-rc7-iop1 204840961   2%  -8% 11%
2.6.22-rc7-iop1 204840962   -1% -19%21%
2.6.22-rc7-iop1 204840964   2%  2%  0%
2.6.22-rc7-iop1 204840968   -1% -28%37%
2.6.22-rc7-iop1 204813107   1   2%  -3% 5%
2.6.22-rc7-iop1 204813107   2   3%  -4% 7%
2.6.22-rc7-iop1 204813107   4   4%  -3% 8%
2.6.22-rc7-iop1 204813107   8   5%  -9% 15%

The write performance numbers are better than I expected and would seem
to address the concerns raised in the thread Odd (slow) RAID
performance[2].  The read performance drop was not expected.  However,
the numbers suggest some additional changes to be made to the queuing
model.  Where read performance is dropping there appears to be an equal
drop in CPU utilization, which seems to suggest that pure read requests
be handled immediately without a trip to the the stripe-queue workqueue.

Although it is not shown in the above data, another positive aspect is that
increasing the cache size past a certain point causes the write performance
gains to erode.  In other words negative returns in contrast to diminishing
returns.  The stripe-queue can only carry out optimizations while the cache is
busy.  When the cache is large requests can be handled without waiting, and
performance approaches the original 1:1 (queue-to-stripe-head) model.  CPU
speed dictates the maximum effective cache size.  Once the CPU can no longer
keep the stripe-queue saturated performance falls off from the peak.  This is
a positive change because it shows that the new queuing model can produce higher
performance with less resources, but it does require more care when changing
'stripe_cache_size.'  The above numbers were taken with the default cache size
of 256.

Changes since take1:
* separate write and overwrite in the io_weight fields, i.e. an overwrite
  no longer implies a write
* rename 

Re: [md-accel PATCH 03/19] xor: make 'xor_blocks' a library routine for use with async_tx

2007-06-27 Thread Dan Williams

[ trimmed the cc ]

On 6/26/07, Satyam Sharma [EMAIL PROTECTED] wrote:

Hi Dan,

[ Minor thing ... ]


Not a problem, thanks for taking a look...


On 6/27/07, Dan Williams [EMAIL PROTECTED] wrote:
 The async_tx api tries to use a dma engine for an operation, but will fall
 back to an optimized software routine otherwise.  Xor support is
 implemented using the raid5 xor routines.  For organizational purposes this
 routine is moved to a common area.

This isn't quite crypto code and isn't connected to or through the cryptoapi
(at least not in this patchset), so I somehow find it misplaced in the crypto/
directory. If all its users are in drivers/md/ then that would be a
better place.
If it is something kernel-global, lib/ sounds more appropriate?



True, it isn't quite crypto code, but I gravitated to this location because:
1/ the models are similar, both are general purpose apis with a driver component
2/ there are already some algorithms in the crypto layer that are not
strictly cryptographic like crc32c, and other checksums
3/ having the code under that directory is a reminder to consider
closer integration when adding support for more complex algorithms
like raid6 p+q (at what point does a 'dma-offload' engine become a
'crypto' engine?  at some point they converge)

The hope is that other subsystems beyond md could benefit from offload
engines.  For example, the crc32c calculations in btrfs might be a
good candidate, and kcopyd integration has crossed my mind.


Satyam


Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 0/2] An evolutionary change to the raid456 queuing model

2007-06-27 Thread Dan Williams
Raz's stripe-deadline patch illuminated the fact that the current
queuing model leaves write performance on the table in some cases.  The
following patches introduce a new queuing model which attempts to
recover this performance.

On an ARM based iop13xx platform I see an averaged %14.7 increase in
throughput as reported by the following simple test:

for i in `seq 1 10`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=512; 
done

This was performed on a default configured 4-disk SATA array (chunksize=64k
stripe_cache_size=256)

However, a test with an ext2 filesystem and tiobench showed negligible
changes.  My suspicion is that the new queuing model, as it currently stands,
can extract some more performance when full stripe writes are present, but in
the tiobench case there is not enough to take advantage of the queue's preread
prevention logic (improving this case is the goal of releasing this version of
the patches).

These patches are on top of the md-accel series.  I will spin a complete
set based on 2.6.22 once it goes final (2.6.22-iop1 to be released on
SourceForge).  Until then the complete series is available via git:

git://lost.foo-projects.org/~dwillia2/git/iop md-accel+experimental

Not to be confused with the acceleration patch set released yesterday which is
targetted at 2.6.23.  These queuing changes need some more time to cook.

-- 
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[md-accel PATCH 01/19] dmaengine: refactor dmaengine around dma_async_tx_descriptor

2007-06-26 Thread Dan Williams
The current dmaengine interface defines mutliple routines per operation,
i.e. dma_async_memcpy_buf_to_buf, dma_async_memcpy_buf_to_page etc.  Adding
more operation types (xor, crc, etc) to this model would result in an
unmanageable number of method permutations.

Are we really going to add a set of hooks for each DMA engine
whizbang feature?
- Jeff Garzik

The descriptor creation process is refactored using the new common
dma_async_tx_descriptor structure.  Instead of per driver
do_operation_dest_to_src methods, drivers integrate
dma_async_tx_descriptor into their private software descriptor and then
define a 'prep' routine per operation.  The prep routine allocates a
descriptor and ensures that the tx_set_src, tx_set_dest, tx_submit routines
are valid.  Descriptor creation and submission becomes:

struct dma_device *dev;
struct dma_chan *chan;
struct dma_async_tx_descriptor *tx;

tx = dev-device_prep_dma_operation(chan, len, int_flag)
tx-tx_set_src(dma_addr_t, tx, index /* for multi-source ops */)
tx-tx_set_dest(dma_addr_t, tx, index)
tx-tx_submit(tx)

In addition to the refactoring, dma_async_tx_descriptor also lays the
groundwork for definining cross-channel-operation dependencies, and a
callback facility for asynchronous notification of operation completion.

Changelog:
* drop dma mapping methods, suggested by Chris Leech
* fix ioat_dma_dependency_added, also caught by Andrew Morton
* fix dma_sync_wait, change from Andrew Morton
* uninline large functions, change from Andrew Morton
* add tx-callback = NULL to dmaengine calls to interoperate with async_tx
  calls
* hookup ioat_tx_submit
* convert channel capabilities to a 'cpumask_t like' bitmap
* removed DMA_TX_ARRAY_INIT, no longer needed
* checkpatch.pl fixes
* make set_src, set_dest, and tx_submit descriptor specific methods

Cc: Jeff Garzik [EMAIL PROTECTED]
Cc: Chris Leech [EMAIL PROTECTED]
Cc: Shannon Nelson [EMAIL PROTECTED]
Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/dma/dmaengine.c   |  182 ++
 drivers/dma/ioatdma.c |  277 -
 drivers/dma/ioatdma.h |8 +
 include/linux/dmaengine.h |  230 +++--
 4 files changed, 455 insertions(+), 242 deletions(-)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index 322ee29..379809f 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -59,6 +59,7 @@
 
 #include linux/init.h
 #include linux/module.h
+#include linux/mm.h
 #include linux/device.h
 #include linux/dmaengine.h
 #include linux/hardirq.h
@@ -66,6 +67,7 @@
 #include linux/percpu.h
 #include linux/rcupdate.h
 #include linux/mutex.h
+#include linux/jiffies.h
 
 static DEFINE_MUTEX(dma_list_mutex);
 static LIST_HEAD(dma_device_list);
@@ -165,6 +167,24 @@ static struct dma_chan *dma_client_chan_alloc(struct 
dma_client *client)
return NULL;
 }
 
+enum dma_status dma_sync_wait(struct dma_chan *chan, dma_cookie_t cookie)
+{
+   enum dma_status status;
+   unsigned long dma_sync_wait_timeout = jiffies + msecs_to_jiffies(5000);
+
+   dma_async_issue_pending(chan);
+   do {
+   status = dma_async_is_tx_complete(chan, cookie, NULL, NULL);
+   if (time_after_eq(jiffies, dma_sync_wait_timeout)) {
+   printk(KERN_ERR dma_sync_wait_timeout!\n);
+   return DMA_ERROR;
+   }
+   } while (status == DMA_IN_PROGRESS);
+
+   return status;
+}
+EXPORT_SYMBOL(dma_sync_wait);
+
 /**
  * dma_chan_cleanup - release a DMA channel's resources
  * @kref: kernel reference structure that contains the DMA channel device
@@ -322,6 +342,25 @@ int dma_async_device_register(struct dma_device *device)
if (!device)
return -ENODEV;
 
+   /* validate device routines */
+   BUG_ON(dma_has_cap(DMA_MEMCPY, device-cap_mask) 
+   !device-device_prep_dma_memcpy);
+   BUG_ON(dma_has_cap(DMA_XOR, device-cap_mask) 
+   !device-device_prep_dma_xor);
+   BUG_ON(dma_has_cap(DMA_ZERO_SUM, device-cap_mask) 
+   !device-device_prep_dma_zero_sum);
+   BUG_ON(dma_has_cap(DMA_MEMSET, device-cap_mask) 
+   !device-device_prep_dma_memset);
+   BUG_ON(dma_has_cap(DMA_ZERO_SUM, device-cap_mask) 
+   !device-device_prep_dma_interrupt);
+
+   BUG_ON(!device-device_alloc_chan_resources);
+   BUG_ON(!device-device_free_chan_resources);
+   BUG_ON(!device-device_dependency_added);
+   BUG_ON(!device-device_is_tx_complete);
+   BUG_ON(!device-device_issue_pending);
+   BUG_ON(!device-dev);
+
init_completion(device-done);
kref_init(device-refcount);
device-dev_id = id++;
@@ -397,6 +436,149 @@ void dma_async_device_unregister(struct dma_device 
*device)
 }
 EXPORT_SYMBOL(dma_async_device_unregister);
 
+/**
+ * dma_async_memcpy_buf_to_buf - offloaded copy between

[md-accel PATCH 08/19] md: common infrastructure for running operations with raid5_run_ops

2007-06-26 Thread Dan Williams
All the handle_stripe operations that are to be transitioned to use
raid5_run_ops need a method to coherently gather work under the stripe-lock
and hand that work off to raid5_run_ops.  The 'get_stripe_work' routine
runs under the lock to read all the bits in sh-ops.pending that do not
have the corresponding bit set in sh-ops.ack.  This modified 'pending'
bitmap is then passed to raid5_run_ops for processing.

The transition from 'ack' to 'completion' does not need similar protection
as the existing release_stripe infrastructure will guarantee that
handle_stripe will run again after a completion bit is set, and
handle_stripe can tolerate a sh-ops.completed bit being set while the lock
is held.

A call to async_tx_issue_pending_all() is added to raid5d to kick the
offload engines once all pending stripe operations work has been submitted.
This enables batching of the submission and completion of operations.

Signed-off-by: Dan Williams [EMAIL PROTECTED]
Acked-By: NeilBrown [EMAIL PROTECTED]
---

 drivers/md/raid5.c |   67 +---
 1 files changed, 58 insertions(+), 9 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 34fcda0..7c688f6 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -124,6 +124,7 @@ static void __release_stripe(raid5_conf_t *conf, struct 
stripe_head *sh)
}
md_wakeup_thread(conf-mddev-thread);
} else {
+   BUG_ON(sh-ops.pending);
if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, 
sh-state)) {
atomic_dec(conf-preread_active_stripes);
if (atomic_read(conf-preread_active_stripes) 
 IO_THRESHOLD)
@@ -225,7 +226,8 @@ static void init_stripe(struct stripe_head *sh, sector_t 
sector, int pd_idx, int
 
BUG_ON(atomic_read(sh-count) != 0);
BUG_ON(test_bit(STRIPE_HANDLE, sh-state));
-   
+   BUG_ON(sh-ops.pending || sh-ops.ack || sh-ops.complete);
+
CHECK_DEVLOCK();
pr_debug(init_stripe called, stripe %llu\n,
(unsigned long long)sh-sector);
@@ -241,11 +243,11 @@ static void init_stripe(struct stripe_head *sh, sector_t 
sector, int pd_idx, int
for (i = sh-disks; i--; ) {
struct r5dev *dev = sh-dev[i];
 
-   if (dev-toread || dev-towrite || dev-written ||
+   if (dev-toread || dev-read || dev-towrite || dev-written ||
test_bit(R5_LOCKED, dev-flags)) {
-   printk(sector=%llx i=%d %p %p %p %d\n,
+   printk(KERN_ERR sector=%llx i=%d %p %p %p %p %d\n,
   (unsigned long long)sh-sector, i, dev-toread,
-  dev-towrite, dev-written,
+  dev-read, dev-towrite, dev-written,
   test_bit(R5_LOCKED, dev-flags));
BUG();
}
@@ -325,6 +327,44 @@ static struct stripe_head *get_active_stripe(raid5_conf_t 
*conf, sector_t sector
return sh;
 }
 
+/* test_and_ack_op() ensures that we only dequeue an operation once */
+#define test_and_ack_op(op, pend) \
+do {   \
+   if (test_bit(op, sh-ops.pending)\
+   !test_bit(op, sh-ops.complete)) { \
+   if (test_and_set_bit(op, sh-ops.ack)) \
+   clear_bit(op, pend);   \
+   else\
+   ack++;  \
+   } else  \
+   clear_bit(op, pend);   \
+} while (0)
+
+/* find new work to run, do not resubmit work that is already
+ * in flight
+ */
+static unsigned long get_stripe_work(struct stripe_head *sh)
+{
+   unsigned long pending;
+   int ack = 0;
+
+   pending = sh-ops.pending;
+
+   test_and_ack_op(STRIPE_OP_BIOFILL, pending);
+   test_and_ack_op(STRIPE_OP_COMPUTE_BLK, pending);
+   test_and_ack_op(STRIPE_OP_PREXOR, pending);
+   test_and_ack_op(STRIPE_OP_BIODRAIN, pending);
+   test_and_ack_op(STRIPE_OP_POSTXOR, pending);
+   test_and_ack_op(STRIPE_OP_CHECK, pending);
+   if (test_and_clear_bit(STRIPE_OP_IO, sh-ops.pending))
+   ack++;
+
+   sh-ops.count -= ack;
+   BUG_ON(sh-ops.count  0);
+
+   return pending;
+}
+
 static int
 raid5_end_read_request(struct bio *bi, unsigned int bytes_done, int error);
 static int
@@ -2487,7 +2527,6 @@ static void handle_stripe_expansion(raid5_conf_t *conf, 
struct stripe_head *sh,
  *schedule a write of some buffers
  *return confirmation of parity correctness
  *
- * Parity calculations are done inside the stripe lock
  * buffers are taken off read_list or write_list, and bh_cache buffers
  * get BH_Lock set before the stripe lock is released

[md-accel PATCH 10/19] md: handle_stripe5 - add request/completion logic for async compute ops

2007-06-26 Thread Dan Williams
handle_stripe will compute a block when a backing disk has failed, or when
it determines it can save a disk read by computing the block from all the
other up-to-date blocks.

Previously a block would be computed under the lock and subsequent logic in
handle_stripe could use the newly up-to-date block.  With the raid5_run_ops
implementation the compute operation is carried out a later time outside
the lock.  To preserve the old functionality we take advantage of the
dependency chain feature of async_tx to flag the block as R5_Wantcompute
and then let other parts of handle_stripe operate on the block as if it
were up-to-date.  raid5_run_ops guarantees that the block will be ready
before it is used in another operation.

However, this only works in cases where the compute and the dependent
operation are scheduled at the same time.  If a previous call to
handle_stripe sets the R5_Wantcompute flag there is no facility to pass the
async_tx dependency chain across successive calls to raid5_run_ops.  The
req_compute variable protects against this case.

Changelog:
* remove the req_compute BUG_ON

Signed-off-by: Dan Williams [EMAIL PROTECTED]
Acked-By: NeilBrown [EMAIL PROTECTED]
---

 drivers/md/raid5.c |  149 ++--
 include/linux/raid/raid5.h |2 -
 2 files changed, 115 insertions(+), 36 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b2e88fe..38b8167 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2070,36 +2070,101 @@ handle_requests_to_failed_array(raid5_conf_t *conf, 
struct stripe_head *sh,
 
 }
 
+/* __handle_issuing_new_read_requests5 - returns 0 if there are no more disks
+ * to process
+ */
+static int __handle_issuing_new_read_requests5(struct stripe_head *sh,
+   struct stripe_head_state *s, int disk_idx, int disks)
+{
+   struct r5dev *dev = sh-dev[disk_idx];
+   struct r5dev *failed_dev = sh-dev[s-failed_num];
+
+   /* don't schedule compute operations or reads on the parity block while
+* a check is in flight
+*/
+   if ((disk_idx == sh-pd_idx) 
+test_bit(STRIPE_OP_CHECK, sh-ops.pending))
+   return ~0;
+
+   /* is the data in this block needed, and can we get it? */
+   if (!test_bit(R5_LOCKED, dev-flags) 
+   !test_bit(R5_UPTODATE, dev-flags)  (dev-toread ||
+   (dev-towrite  !test_bit(R5_OVERWRITE, dev-flags)) ||
+s-syncing || s-expanding || (s-failed 
+(failed_dev-toread || (failed_dev-towrite 
+!test_bit(R5_OVERWRITE, failed_dev-flags)
+) {
+   /* 1/ We would like to get this block, possibly by computing it,
+* but we might not be able to.
+*
+* 2/ Since parity check operations potentially make the parity
+* block !uptodate it will need to be refreshed before any
+* compute operations on data disks are scheduled.
+*
+* 3/ We hold off parity block re-reads until check operations
+* have quiesced.
+*/
+   if ((s-uptodate == disks - 1) 
+   !test_bit(STRIPE_OP_CHECK, sh-ops.pending)) {
+   set_bit(STRIPE_OP_COMPUTE_BLK, sh-ops.pending);
+   set_bit(R5_Wantcompute, dev-flags);
+   sh-ops.target = disk_idx;
+   s-req_compute = 1;
+   sh-ops.count++;
+   /* Careful: from this point on 'uptodate' is in the eye
+* of raid5_run_ops which services 'compute' operations
+* before writes. R5_Wantcompute flags a block that will
+* be R5_UPTODATE by the time it is needed for a
+* subsequent operation.
+*/
+   s-uptodate++;
+   return 0; /* uptodate + compute == disks */
+   } else if ((s-uptodate  disks - 1) 
+   test_bit(R5_Insync, dev-flags)) {
+   /* Note: we hold off compute operations while checks are
+* in flight, but we still prefer 'compute' over 'read'
+* hence we only read if (uptodate  * disks-1)
+*/
+   set_bit(R5_LOCKED, dev-flags);
+   set_bit(R5_Wantread, dev-flags);
+   if (!test_and_set_bit(STRIPE_OP_IO, sh-ops.pending))
+   sh-ops.count++;
+   s-locked++;
+   pr_debug(Reading block %d (sync=%d)\n, disk_idx,
+   s-syncing);
+   }
+   }
+
+   return ~0;
+}
+
 static void handle_issuing_new_read_requests5(struct stripe_head *sh,
struct stripe_head_state *s, int disks)
 {
int i

[md-accel PATCH 11/19] md: handle_stripe5 - add request/completion logic for async check ops

2007-06-26 Thread Dan Williams
Check operations are scheduled when the array is being resynced or an
explicit 'check/repair' command was sent to the array.  Previously check
operations would destroy the parity block in the cache such that even if
parity turned out to be correct the parity block would be marked
!R5_UPTODATE at the completion of the check.  When the operation can be
carried out by a dma engine the assumption is that it can check parity as a
read-only operation.  If raid5_run_ops notices that the check was handled
by hardware it will preserve the R5_UPTODATE status of the parity disk.

When a check operation determines that the parity needs to be repaired we
reuse the existing compute block infrastructure to carry out the operation.
Repair operations imply an immediate write back of the data, so to
differentiate a repair from a normal compute operation the
STRIPE_OP_MOD_REPAIR_PD flag is added.

Changelog:
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown

Signed-off-by: Dan Williams [EMAIL PROTECTED]
Acked-By: NeilBrown [EMAIL PROTECTED]
---

 drivers/md/raid5.c |   84 
 1 files changed, 65 insertions(+), 19 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 38b8167..89d3890 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2464,26 +2464,67 @@ static void handle_parity_checks5(raid5_conf_t *conf, 
struct stripe_head *sh,
struct stripe_head_state *s, int disks)
 {
set_bit(STRIPE_HANDLE, sh-state);
-   if (s-failed == 0) {
-   BUG_ON(s-uptodate != disks);
-   compute_parity5(sh, CHECK_PARITY);
-   s-uptodate--;
-   if (page_is_zero(sh-dev[sh-pd_idx].page)) {
-   /* parity is correct (on disc, not in buffer any more)
-*/
-   set_bit(STRIPE_INSYNC, sh-state);
-   } else {
-   conf-mddev-resync_mismatches += STRIPE_SECTORS;
-   if (test_bit(MD_RECOVERY_CHECK, conf-mddev-recovery))
-   /* don't try to repair!! */
+   /* Take one of the following actions:
+* 1/ start a check parity operation if (uptodate == disks)
+* 2/ finish a check parity operation and act on the result
+* 3/ skip to the writeback section if we previously
+*initiated a recovery operation
+*/
+   if (s-failed == 0 
+   !test_bit(STRIPE_OP_MOD_REPAIR_PD, sh-ops.pending)) {
+   if (!test_and_set_bit(STRIPE_OP_CHECK, sh-ops.pending)) {
+   BUG_ON(s-uptodate != disks);
+   clear_bit(R5_UPTODATE, sh-dev[sh-pd_idx].flags);
+   sh-ops.count++;
+   s-uptodate--;
+   } else if (
+  test_and_clear_bit(STRIPE_OP_CHECK, sh-ops.complete)) {
+   clear_bit(STRIPE_OP_CHECK, sh-ops.ack);
+   clear_bit(STRIPE_OP_CHECK, sh-ops.pending);
+
+   if (sh-ops.zero_sum_result == 0)
+   /* parity is correct (on disc,
+* not in buffer any more)
+*/
set_bit(STRIPE_INSYNC, sh-state);
else {
-   compute_block(sh, sh-pd_idx);
-   s-uptodate++;
+   conf-mddev-resync_mismatches +=
+   STRIPE_SECTORS;
+   if (test_bit(
+MD_RECOVERY_CHECK, conf-mddev-recovery))
+   /* don't try to repair!! */
+   set_bit(STRIPE_INSYNC, sh-state);
+   else {
+   set_bit(STRIPE_OP_COMPUTE_BLK,
+   sh-ops.pending);
+   set_bit(STRIPE_OP_MOD_REPAIR_PD,
+   sh-ops.pending);
+   set_bit(R5_Wantcompute,
+   sh-dev[sh-pd_idx].flags);
+   sh-ops.target = sh-pd_idx;
+   sh-ops.count++;
+   s-uptodate++;
+   }
}
}
}
-   if (!test_bit(STRIPE_INSYNC, sh-state)) {
+
+   /* check if we can clear a parity disk reconstruct */
+   if (test_bit(STRIPE_OP_COMPUTE_BLK, sh-ops.complete) 
+   test_bit(STRIPE_OP_MOD_REPAIR_PD, sh-ops.pending)) {
+
+   clear_bit(STRIPE_OP_MOD_REPAIR_PD, sh-ops.pending);
+   clear_bit(STRIPE_OP_COMPUTE_BLK, sh-ops.complete);
+   clear_bit

[md-accel PATCH 12/19] md: handle_stripe5 - add request/completion logic for async read ops

2007-06-26 Thread Dan Williams
When a read bio is attached to the stripe and the corresponding block is
marked R5_UPTODATE, then a read (biofill) operation is scheduled to copy
the data from the stripe cache to the bio buffer.  handle_stripe flags the
blocks to be operated on with the R5_Wantfill flag.  If new read requests
arrive while raid5_run_ops is running they will not be handled until
handle_stripe is scheduled to run again.

Changelog:
* cleanup to_read and to_fill accounting
* do not fail reads that have reached the cache

Signed-off-by: Dan Williams [EMAIL PROTECTED]
Acked-By: NeilBrown [EMAIL PROTECTED]
---

 drivers/md/raid5.c |   53 +---
 include/linux/raid/raid5.h |2 +-
 2 files changed, 26 insertions(+), 29 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 89d3890..3d0dca9 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2042,9 +2042,12 @@ handle_requests_to_failed_array(raid5_conf_t *conf, 
struct stripe_head *sh,
bi = bi2;
}
 
-   /* fail any reads if this device is non-operational */
-   if (!test_bit(R5_Insync, sh-dev[i].flags) ||
-   test_bit(R5_ReadError, sh-dev[i].flags)) {
+   /* fail any reads if this device is non-operational and
+* the data has not reached the cache yet.
+*/
+   if (!test_bit(R5_Wantfill, sh-dev[i].flags) 
+   (!test_bit(R5_Insync, sh-dev[i].flags) ||
+ test_bit(R5_ReadError, sh-dev[i].flags))) {
bi = sh-dev[i].toread;
sh-dev[i].toread = NULL;
if (test_and_clear_bit(R5_Overlap, sh-dev[i].flags))
@@ -2733,37 +2736,27 @@ static void handle_stripe5(struct stripe_head *sh)
struct r5dev *dev = sh-dev[i];
clear_bit(R5_Insync, dev-flags);
 
-   pr_debug(check %d: state 0x%lx read %p write %p written %p\n,
-   i, dev-flags, dev-toread, dev-towrite, dev-written);
-   /* maybe we can reply to a read */
-   if (test_bit(R5_UPTODATE, dev-flags)  dev-toread) {
-   struct bio *rbi, *rbi2;
-   pr_debug(Return read for disc %d\n, i);
-   spin_lock_irq(conf-device_lock);
-   rbi = dev-toread;
-   dev-toread = NULL;
-   if (test_and_clear_bit(R5_Overlap, dev-flags))
-   wake_up(conf-wait_for_overlap);
-   spin_unlock_irq(conf-device_lock);
-   while (rbi  rbi-bi_sector  dev-sector + 
STRIPE_SECTORS) {
-   copy_data(0, rbi, dev-page, dev-sector);
-   rbi2 = r5_next_bio(rbi, dev-sector);
-   spin_lock_irq(conf-device_lock);
-   if (--rbi-bi_phys_segments == 0) {
-   rbi-bi_next = return_bi;
-   return_bi = rbi;
-   }
-   spin_unlock_irq(conf-device_lock);
-   rbi = rbi2;
-   }
-   }
+   pr_debug(check %d: state 0x%lx toread %p read %p write %p 
+   written %p\n, i, dev-flags, dev-toread, dev-read,
+   dev-towrite, dev-written);
+
+   /* maybe we can request a biofill operation
+*
+* new wantfill requests are only permitted while
+* STRIPE_OP_BIOFILL is clear
+*/
+   if (test_bit(R5_UPTODATE, dev-flags)  dev-toread 
+   !test_bit(STRIPE_OP_BIOFILL, sh-ops.pending))
+   set_bit(R5_Wantfill, dev-flags);
 
/* now count some things */
if (test_bit(R5_LOCKED, dev-flags)) s.locked++;
if (test_bit(R5_UPTODATE, dev-flags)) s.uptodate++;
if (test_bit(R5_Wantcompute, dev-flags)) s.compute++;
 
-   if (dev-toread)
+   if (test_bit(R5_Wantfill, dev-flags))
+   s.to_fill++;
+   else if (dev-toread)
s.to_read++;
if (dev-towrite) {
s.to_write++;
@@ -2786,6 +2779,10 @@ static void handle_stripe5(struct stripe_head *sh)
set_bit(R5_Insync, dev-flags);
}
rcu_read_unlock();
+
+   if (s.to_fill  !test_and_set_bit(STRIPE_OP_BIOFILL, sh-ops.pending))
+   sh-ops.count++;
+
pr_debug(locked=%d uptodate=%d to_read=%d
 to_write=%d failed=%d failed_num=%d\n,
s.locked, s.uptodate, s.to_read, s.to_write,
diff --git a/include/linux/raid/raid5.h b/include/linux/raid/raid5.h
index 2d45eba..e9dfb2d 100644

[md-accel PATCH 13/19] md: handle_stripe5 - add request/completion logic for async expand ops

2007-06-26 Thread Dan Williams
When a stripe is being expanded bulk copying takes place to move the data
from the old stripe to the new.  Since raid5_run_ops only operates on one
stripe at a time these bulk copies are handled in-line under the stripe
lock.  In the dma offload case we poll for the completion of the operation.

After the data has been copied into the new stripe the parity needs to be
recalculated across the new disks.  We reuse the existing postxor
functionality to carry out this calculation.  By setting STRIPE_OP_POSTXOR
without setting STRIPE_OP_BIODRAIN the completion path in handle stripe
can differentiate expand operations from normal write operations.

Signed-off-by: Dan Williams [EMAIL PROTECTED]
Acked-By: NeilBrown [EMAIL PROTECTED]
---

 drivers/md/raid5.c |   50 ++
 1 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3d0dca9..e0ae26d 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2646,6 +2646,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, 
struct stripe_head *sh,
/* We have read all the blocks in this stripe and now we need to
 * copy some of them into a target stripe for expand.
 */
+   struct dma_async_tx_descriptor *tx = NULL;
clear_bit(STRIPE_EXPAND_SOURCE, sh-state);
for (i = 0; i  sh-disks; i++)
if (i != sh-pd_idx  (r6s  i != r6s-qd_idx)) {
@@ -2671,9 +2672,12 @@ static void handle_stripe_expansion(raid5_conf_t *conf, 
struct stripe_head *sh,
release_stripe(sh2);
continue;
}
-   memcpy(page_address(sh2-dev[dd_idx].page),
-  page_address(sh-dev[i].page),
-  STRIPE_SIZE);
+
+   /* place all the copies on one channel */
+   tx = async_memcpy(sh2-dev[dd_idx].page,
+   sh-dev[i].page, 0, 0, STRIPE_SIZE,
+   ASYNC_TX_DEP_ACK, tx, NULL, NULL);
+
set_bit(R5_Expanded, sh2-dev[dd_idx].flags);
set_bit(R5_UPTODATE, sh2-dev[dd_idx].flags);
for (j = 0; j  conf-raid_disks; j++)
@@ -2686,6 +2690,12 @@ static void handle_stripe_expansion(raid5_conf_t *conf, 
struct stripe_head *sh,
set_bit(STRIPE_HANDLE, sh2-state);
}
release_stripe(sh2);
+
+   /* done submitting copies, wait for them to complete */
+   if (i + 1 = sh-disks) {
+   async_tx_ack(tx);
+   dma_wait_for_async_tx(tx);
+   }
}
 }
 
@@ -2924,18 +2934,34 @@ static void handle_stripe5(struct stripe_head *sh)
}
}
 
-   if (s.expanded  test_bit(STRIPE_EXPANDING, sh-state)) {
-   /* Need to write out all blocks after computing parity */
-   sh-disks = conf-raid_disks;
-   sh-pd_idx = stripe_to_pdidx(sh-sector, conf, 
conf-raid_disks);
-   compute_parity5(sh, RECONSTRUCT_WRITE);
+   /* Finish postxor operations initiated by the expansion
+* process
+*/
+   if (test_bit(STRIPE_OP_POSTXOR, sh-ops.complete) 
+   !test_bit(STRIPE_OP_BIODRAIN, sh-ops.pending)) {
+
+   clear_bit(STRIPE_EXPANDING, sh-state);
+
+   clear_bit(STRIPE_OP_POSTXOR, sh-ops.pending);
+   clear_bit(STRIPE_OP_POSTXOR, sh-ops.ack);
+   clear_bit(STRIPE_OP_POSTXOR, sh-ops.complete);
+
for (i = conf-raid_disks; i--; ) {
-   set_bit(R5_LOCKED, sh-dev[i].flags);
-   s.locked++;
set_bit(R5_Wantwrite, sh-dev[i].flags);
+   if (!test_and_set_bit(STRIPE_OP_IO, sh-ops.pending))
+   sh-ops.count++;
}
-   clear_bit(STRIPE_EXPANDING, sh-state);
-   } else if (s.expanded) {
+   }
+
+   if (s.expanded  test_bit(STRIPE_EXPANDING, sh-state) 
+   !test_bit(STRIPE_OP_POSTXOR, sh-ops.pending)) {
+   /* Need to write out all blocks after computing parity */
+   sh-disks = conf-raid_disks;
+   sh-pd_idx = stripe_to_pdidx(sh-sector, conf,
+   conf-raid_disks);
+   s.locked += handle_write_operations5(sh, 0, 1);
+   } else if (s.expanded 
+   !test_bit(STRIPE_OP_POSTXOR, sh-ops.pending)) {
clear_bit(STRIPE_EXPAND_READY, sh-state);
atomic_dec(conf-reshape_stripes);
wake_up(conf-wait_for_overlap);
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http

[md-accel PATCH 15/19] md: remove raid5 compute_block and compute_parity5

2007-06-26 Thread Dan Williams
replaced by raid5_run_ops

Signed-off-by: Dan Williams [EMAIL PROTECTED]
Acked-By: NeilBrown [EMAIL PROTECTED]
---

 drivers/md/raid5.c |  124 
 1 files changed, 0 insertions(+), 124 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a09bc5f..0579d1f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1509,130 +1509,6 @@ static void copy_data(int frombio, struct bio *bio,
   }  \
} while(0)
 
-
-static void compute_block(struct stripe_head *sh, int dd_idx)
-{
-   int i, count, disks = sh-disks;
-   void *ptr[MAX_XOR_BLOCKS], *dest, *p;
-
-   pr_debug(compute_block, stripe %llu, idx %d\n,
-   (unsigned long long)sh-sector, dd_idx);
-
-   dest = page_address(sh-dev[dd_idx].page);
-   memset(dest, 0, STRIPE_SIZE);
-   count = 0;
-   for (i = disks ; i--; ) {
-   if (i == dd_idx)
-   continue;
-   p = page_address(sh-dev[i].page);
-   if (test_bit(R5_UPTODATE, sh-dev[i].flags))
-   ptr[count++] = p;
-   else
-   printk(KERN_ERR compute_block() %d, stripe %llu, %d
-not present\n, dd_idx,
-   (unsigned long long)sh-sector, i);
-
-   check_xor();
-   }
-   if (count)
-   xor_blocks(count, STRIPE_SIZE, dest, ptr);
-   set_bit(R5_UPTODATE, sh-dev[dd_idx].flags);
-}
-
-static void compute_parity5(struct stripe_head *sh, int method)
-{
-   raid5_conf_t *conf = sh-raid_conf;
-   int i, pd_idx = sh-pd_idx, disks = sh-disks, count;
-   void *ptr[MAX_XOR_BLOCKS], *dest;
-   struct bio *chosen;
-
-   pr_debug(compute_parity5, stripe %llu, method %d\n,
-   (unsigned long long)sh-sector, method);
-
-   count = 0;
-   dest = page_address(sh-dev[pd_idx].page);
-   switch(method) {
-   case READ_MODIFY_WRITE:
-   BUG_ON(!test_bit(R5_UPTODATE, sh-dev[pd_idx].flags));
-   for (i=disks ; i-- ;) {
-   if (i==pd_idx)
-   continue;
-   if (sh-dev[i].towrite 
-   test_bit(R5_UPTODATE, sh-dev[i].flags)) {
-   ptr[count++] = page_address(sh-dev[i].page);
-   chosen = sh-dev[i].towrite;
-   sh-dev[i].towrite = NULL;
-
-   if (test_and_clear_bit(R5_Overlap, 
sh-dev[i].flags))
-   wake_up(conf-wait_for_overlap);
-
-   BUG_ON(sh-dev[i].written);
-   sh-dev[i].written = chosen;
-   check_xor();
-   }
-   }
-   break;
-   case RECONSTRUCT_WRITE:
-   memset(dest, 0, STRIPE_SIZE);
-   for (i= disks; i-- ;)
-   if (i!=pd_idx  sh-dev[i].towrite) {
-   chosen = sh-dev[i].towrite;
-   sh-dev[i].towrite = NULL;
-
-   if (test_and_clear_bit(R5_Overlap, 
sh-dev[i].flags))
-   wake_up(conf-wait_for_overlap);
-
-   BUG_ON(sh-dev[i].written);
-   sh-dev[i].written = chosen;
-   }
-   break;
-   case CHECK_PARITY:
-   break;
-   }
-   if (count) {
-   xor_blocks(count, STRIPE_SIZE, dest, ptr);
-   count = 0;
-   }
-   
-   for (i = disks; i--;)
-   if (sh-dev[i].written) {
-   sector_t sector = sh-dev[i].sector;
-   struct bio *wbi = sh-dev[i].written;
-   while (wbi  wbi-bi_sector  sector + STRIPE_SECTORS) 
{
-   copy_data(1, wbi, sh-dev[i].page, sector);
-   wbi = r5_next_bio(wbi, sector);
-   }
-
-   set_bit(R5_LOCKED, sh-dev[i].flags);
-   set_bit(R5_UPTODATE, sh-dev[i].flags);
-   }
-
-   switch(method) {
-   case RECONSTRUCT_WRITE:
-   case CHECK_PARITY:
-   for (i=disks; i--;)
-   if (i != pd_idx) {
-   ptr[count++] = page_address(sh-dev[i].page);
-   check_xor();
-   }
-   break;
-   case READ_MODIFY_WRITE:
-   for (i = disks; i--;)
-   if (sh-dev[i].written) {
-   ptr[count++] = page_address(sh-dev[i].page);
-   check_xor

[md-accel PATCH 05/19] raid5: refactor handle_stripe5 and handle_stripe6 (v2)

2007-06-26 Thread Dan Williams
handle_stripe5 and handle_stripe6 have very deep logic paths handling the
various states of a stripe_head.  By introducing the 'stripe_head_state'
and 'r6_state' objects, large portions of the logic can be moved to
sub-routines.

'struct stripe_head_state' consumes all of the automatic variables that 
previously
stood alone in handle_stripe5,6.  'struct r6_state' contains the handle_stripe6
specific variables like p_failed and q_failed.

One of the nice side effects of the 'stripe_head_state' change is that it
allows for further reductions in code duplication between raid5 and raid6.
The following new routines are shared between raid5 and raid6:

handle_completed_write_requests
handle_requests_to_failed_array
handle_stripe_expansion

Changes in v2:
* fixed 'conf-raid_disk-1' for the raid6 'handle_stripe_expansion' path

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c | 1488 +---
 include/linux/raid/raid5.h |   16 
 2 files changed, 737 insertions(+), 767 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 4f51dfa..94e0920 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1326,6 +1326,608 @@ static int stripe_to_pdidx(sector_t stripe, 
raid5_conf_t *conf, int disks)
return pd_idx;
 }
 
+static void
+handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh,
+   struct stripe_head_state *s, int disks,
+   struct bio **return_bi)
+{
+   int i;
+   for (i = disks; i--; ) {
+   struct bio *bi;
+   int bitmap_end = 0;
+
+   if (test_bit(R5_ReadError, sh-dev[i].flags)) {
+   mdk_rdev_t *rdev;
+   rcu_read_lock();
+   rdev = rcu_dereference(conf-disks[i].rdev);
+   if (rdev  test_bit(In_sync, rdev-flags))
+   /* multiple read failures in one stripe */
+   md_error(conf-mddev, rdev);
+   rcu_read_unlock();
+   }
+   spin_lock_irq(conf-device_lock);
+   /* fail all writes first */
+   bi = sh-dev[i].towrite;
+   sh-dev[i].towrite = NULL;
+   if (bi) {
+   s-to_write--;
+   bitmap_end = 1;
+   }
+
+   if (test_and_clear_bit(R5_Overlap, sh-dev[i].flags))
+   wake_up(conf-wait_for_overlap);
+
+   while (bi  bi-bi_sector 
+   sh-dev[i].sector + STRIPE_SECTORS) {
+   struct bio *nextbi = r5_next_bio(bi, sh-dev[i].sector);
+   clear_bit(BIO_UPTODATE, bi-bi_flags);
+   if (--bi-bi_phys_segments == 0) {
+   md_write_end(conf-mddev);
+   bi-bi_next = *return_bi;
+   *return_bi = bi;
+   }
+   bi = nextbi;
+   }
+   /* and fail all 'written' */
+   bi = sh-dev[i].written;
+   sh-dev[i].written = NULL;
+   if (bi) bitmap_end = 1;
+   while (bi  bi-bi_sector 
+  sh-dev[i].sector + STRIPE_SECTORS) {
+   struct bio *bi2 = r5_next_bio(bi, sh-dev[i].sector);
+   clear_bit(BIO_UPTODATE, bi-bi_flags);
+   if (--bi-bi_phys_segments == 0) {
+   md_write_end(conf-mddev);
+   bi-bi_next = *return_bi;
+   *return_bi = bi;
+   }
+   bi = bi2;
+   }
+
+   /* fail any reads if this device is non-operational */
+   if (!test_bit(R5_Insync, sh-dev[i].flags) ||
+   test_bit(R5_ReadError, sh-dev[i].flags)) {
+   bi = sh-dev[i].toread;
+   sh-dev[i].toread = NULL;
+   if (test_and_clear_bit(R5_Overlap, sh-dev[i].flags))
+   wake_up(conf-wait_for_overlap);
+   if (bi) s-to_read--;
+   while (bi  bi-bi_sector 
+  sh-dev[i].sector + STRIPE_SECTORS) {
+   struct bio *nextbi =
+   r5_next_bio(bi, sh-dev[i].sector);
+   clear_bit(BIO_UPTODATE, bi-bi_flags);
+   if (--bi-bi_phys_segments == 0) {
+   bi-bi_next = *return_bi;
+   *return_bi = bi;
+   }
+   bi = nextbi;
+   }
+   }
+   spin_unlock_irq(conf-device_lock

[md-accel PATCH 17/19] iop13xx: surface the iop13xx adma units to the iop-adma driver

2007-06-26 Thread Dan Williams
Adds the platform device definitions and the architecture specific
support routines (i.e. register initialization and descriptor formats) for the
iop-adma driver.

Changelog:
* added 'descriptor pool size' to the platform data
* add base support for buffer sizes larger than 16MB (hw max)
* build error fix from Kirill A. Shutemov
* rebase for async_tx changes
* add interrupt support
* do not call platform register macros in driver code
* remove unnecessary ARM assembly statement
* checkpatch.pl fixes
* gpl v2 only correction

Cc: Russell King [EMAIL PROTECTED]
Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 arch/arm/mach-iop13xx/setup.c  |  217 +
 include/asm-arm/arch-iop13xx/adma.h|  544 
 include/asm-arm/arch-iop13xx/iop13xx.h |   38 +-
 3 files changed, 774 insertions(+), 25 deletions(-)

diff --git a/arch/arm/mach-iop13xx/setup.c b/arch/arm/mach-iop13xx/setup.c
index bc48715..bfe0c87 100644
--- a/arch/arm/mach-iop13xx/setup.c
+++ b/arch/arm/mach-iop13xx/setup.c
@@ -25,6 +25,7 @@
 #include asm/hardware.h
 #include asm/irq.h
 #include asm/io.h
+#include asm/hardware/iop_adma.h
 
 #define IOP13XX_UART_XTAL 4000
 #define IOP13XX_SETUP_DEBUG 0
@@ -236,19 +237,143 @@ static unsigned long iq8134x_probe_flash_size(void)
 }
 #endif
 
+/* ADMA Channels */
+static struct resource iop13xx_adma_0_resources[] = {
+   [0] = {
+   .start = IOP13XX_ADMA_PHYS_BASE(0),
+   .end = IOP13XX_ADMA_UPPER_PA(0),
+   .flags = IORESOURCE_MEM,
+   },
+   [1] = {
+   .start = IRQ_IOP13XX_ADMA0_EOT,
+   .end = IRQ_IOP13XX_ADMA0_EOT,
+   .flags = IORESOURCE_IRQ
+   },
+   [2] = {
+   .start = IRQ_IOP13XX_ADMA0_EOC,
+   .end = IRQ_IOP13XX_ADMA0_EOC,
+   .flags = IORESOURCE_IRQ
+   },
+   [3] = {
+   .start = IRQ_IOP13XX_ADMA0_ERR,
+   .end = IRQ_IOP13XX_ADMA0_ERR,
+   .flags = IORESOURCE_IRQ
+   }
+};
+
+static struct resource iop13xx_adma_1_resources[] = {
+   [0] = {
+   .start = IOP13XX_ADMA_PHYS_BASE(1),
+   .end = IOP13XX_ADMA_UPPER_PA(1),
+   .flags = IORESOURCE_MEM,
+   },
+   [1] = {
+   .start = IRQ_IOP13XX_ADMA1_EOT,
+   .end = IRQ_IOP13XX_ADMA1_EOT,
+   .flags = IORESOURCE_IRQ
+   },
+   [2] = {
+   .start = IRQ_IOP13XX_ADMA1_EOC,
+   .end = IRQ_IOP13XX_ADMA1_EOC,
+   .flags = IORESOURCE_IRQ
+   },
+   [3] = {
+   .start = IRQ_IOP13XX_ADMA1_ERR,
+   .end = IRQ_IOP13XX_ADMA1_ERR,
+   .flags = IORESOURCE_IRQ
+   }
+};
+
+static struct resource iop13xx_adma_2_resources[] = {
+   [0] = {
+   .start = IOP13XX_ADMA_PHYS_BASE(2),
+   .end = IOP13XX_ADMA_UPPER_PA(2),
+   .flags = IORESOURCE_MEM,
+   },
+   [1] = {
+   .start = IRQ_IOP13XX_ADMA2_EOT,
+   .end = IRQ_IOP13XX_ADMA2_EOT,
+   .flags = IORESOURCE_IRQ
+   },
+   [2] = {
+   .start = IRQ_IOP13XX_ADMA2_EOC,
+   .end = IRQ_IOP13XX_ADMA2_EOC,
+   .flags = IORESOURCE_IRQ
+   },
+   [3] = {
+   .start = IRQ_IOP13XX_ADMA2_ERR,
+   .end = IRQ_IOP13XX_ADMA2_ERR,
+   .flags = IORESOURCE_IRQ
+   }
+};
+
+static u64 iop13xx_adma_dmamask = DMA_64BIT_MASK;
+static struct iop_adma_platform_data iop13xx_adma_0_data = {
+   .hw_id = 0,
+   .pool_size = PAGE_SIZE,
+};
+
+static struct iop_adma_platform_data iop13xx_adma_1_data = {
+   .hw_id = 1,
+   .pool_size = PAGE_SIZE,
+};
+
+static struct iop_adma_platform_data iop13xx_adma_2_data = {
+   .hw_id = 2,
+   .pool_size = PAGE_SIZE,
+};
+
+/* The ids are fixed up later in iop13xx_platform_init */
+static struct platform_device iop13xx_adma_0_channel = {
+   .name = iop-adma,
+   .id = 0,
+   .num_resources = 4,
+   .resource = iop13xx_adma_0_resources,
+   .dev = {
+   .dma_mask = iop13xx_adma_dmamask,
+   .coherent_dma_mask = DMA_64BIT_MASK,
+   .platform_data = (void *) iop13xx_adma_0_data,
+   },
+};
+
+static struct platform_device iop13xx_adma_1_channel = {
+   .name = iop-adma,
+   .id = 0,
+   .num_resources = 4,
+   .resource = iop13xx_adma_1_resources,
+   .dev = {
+   .dma_mask = iop13xx_adma_dmamask,
+   .coherent_dma_mask = DMA_64BIT_MASK,
+   .platform_data = (void *) iop13xx_adma_1_data,
+   },
+};
+
+static struct platform_device iop13xx_adma_2_channel = {
+   .name = iop-adma,
+   .id = 0,
+   .num_resources = 4,
+   .resource = iop13xx_adma_2_resources,
+   .dev = {
+   .dma_mask = iop13xx_adma_dmamask,
+   .coherent_dma_mask = DMA_64BIT_MASK

[md-accel PATCH 18/19] iop3xx: surface the iop3xx DMA and AAU units to the iop-adma driver

2007-06-26 Thread Dan Williams
Adds the platform device definitions and the architecture specific support
routines (i.e. register initialization and descriptor formats) for the
iop-adma driver.

Changelog:
* add support for  1k zero sum buffer sizes
* added dma/aau platform devices to iq80321 and iq80332 setup
* fixed the calculation in iop_desc_is_aligned
* support xor buffer sizes larger than 16MB
* fix places where software descriptors are assumed to be contiguous, only
  hardware descriptors are contiguous for up to a PAGE_SIZE buffer size
* convert to async_tx
* add interrupt support
* add platform devices for 80219 boards
* do not call platform register macros in driver code
* remove switch() statements for compatible register offsets/layouts
* change over to bitmap based capabilities
* remove unnecessary ARM assembly statement
* checkpatch.pl fixes
* gpl v2 only correction

Cc: Russell King [EMAIL PROTECTED]
Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 arch/arm/mach-iop32x/glantank.c|2 
 arch/arm/mach-iop32x/iq31244.c |5 
 arch/arm/mach-iop32x/iq80321.c |3 
 arch/arm/mach-iop32x/n2100.c   |2 
 arch/arm/mach-iop33x/iq80331.c |3 
 arch/arm/mach-iop33x/iq80332.c |3 
 arch/arm/plat-iop/Makefile |2 
 arch/arm/plat-iop/adma.c   |  209 
 include/asm-arm/arch-iop32x/adma.h |5 
 include/asm-arm/arch-iop33x/adma.h |5 
 include/asm-arm/hardware/iop3xx-adma.h |  891 
 include/asm-arm/hardware/iop3xx.h  |   68 --
 12 files changed, 1138 insertions(+), 60 deletions(-)

diff --git a/arch/arm/mach-iop32x/glantank.c b/arch/arm/mach-iop32x/glantank.c
index 5776fd8..2b086ab 100644
--- a/arch/arm/mach-iop32x/glantank.c
+++ b/arch/arm/mach-iop32x/glantank.c
@@ -180,6 +180,8 @@ static void __init glantank_init_machine(void)
platform_device_register(iop3xx_i2c1_device);
platform_device_register(glantank_flash_device);
platform_device_register(glantank_serial_device);
+   platform_device_register(iop3xx_dma_0_channel);
+   platform_device_register(iop3xx_dma_1_channel);
 
pm_power_off = glantank_power_off;
 }
diff --git a/arch/arm/mach-iop32x/iq31244.c b/arch/arm/mach-iop32x/iq31244.c
index d4eefbe..98cfa1c 100644
--- a/arch/arm/mach-iop32x/iq31244.c
+++ b/arch/arm/mach-iop32x/iq31244.c
@@ -298,9 +298,14 @@ static void __init iq31244_init_machine(void)
platform_device_register(iop3xx_i2c1_device);
platform_device_register(iq31244_flash_device);
platform_device_register(iq31244_serial_device);
+   platform_device_register(iop3xx_dma_0_channel);
+   platform_device_register(iop3xx_dma_1_channel);
 
if (is_ep80219())
pm_power_off = ep80219_power_off;
+
+   if (!is_80219())
+   platform_device_register(iop3xx_aau_channel);
 }
 
 static int __init force_ep80219_setup(char *str)
diff --git a/arch/arm/mach-iop32x/iq80321.c b/arch/arm/mach-iop32x/iq80321.c
index 8d9f491..18ad29f 100644
--- a/arch/arm/mach-iop32x/iq80321.c
+++ b/arch/arm/mach-iop32x/iq80321.c
@@ -181,6 +181,9 @@ static void __init iq80321_init_machine(void)
platform_device_register(iop3xx_i2c1_device);
platform_device_register(iq80321_flash_device);
platform_device_register(iq80321_serial_device);
+   platform_device_register(iop3xx_dma_0_channel);
+   platform_device_register(iop3xx_dma_1_channel);
+   platform_device_register(iop3xx_aau_channel);
 }
 
 MACHINE_START(IQ80321, Intel IQ80321)
diff --git a/arch/arm/mach-iop32x/n2100.c b/arch/arm/mach-iop32x/n2100.c
index d55005d..390a97d 100644
--- a/arch/arm/mach-iop32x/n2100.c
+++ b/arch/arm/mach-iop32x/n2100.c
@@ -245,6 +245,8 @@ static void __init n2100_init_machine(void)
platform_device_register(iop3xx_i2c0_device);
platform_device_register(n2100_flash_device);
platform_device_register(n2100_serial_device);
+   platform_device_register(iop3xx_dma_0_channel);
+   platform_device_register(iop3xx_dma_1_channel);
 
pm_power_off = n2100_power_off;
 
diff --git a/arch/arm/mach-iop33x/iq80331.c b/arch/arm/mach-iop33x/iq80331.c
index 2b06318..433188e 100644
--- a/arch/arm/mach-iop33x/iq80331.c
+++ b/arch/arm/mach-iop33x/iq80331.c
@@ -136,6 +136,9 @@ static void __init iq80331_init_machine(void)
platform_device_register(iop33x_uart0_device);
platform_device_register(iop33x_uart1_device);
platform_device_register(iq80331_flash_device);
+   platform_device_register(iop3xx_dma_0_channel);
+   platform_device_register(iop3xx_dma_1_channel);
+   platform_device_register(iop3xx_aau_channel);
 }
 
 MACHINE_START(IQ80331, Intel IQ80331)
diff --git a/arch/arm/mach-iop33x/iq80332.c b/arch/arm/mach-iop33x/iq80332.c
index 7889ce3..416c095 100644
--- a/arch/arm/mach-iop33x/iq80332.c
+++ b/arch/arm/mach-iop33x/iq80332.c
@@ -136,6 +136,9 @@ static void __init iq80332_init_machine

[md-accel PATCH 19/19] ARM: Add drivers/dma to arch/arm/Kconfig

2007-06-26 Thread Dan Williams
Cc: Russell King [EMAIL PROTECTED]
Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 arch/arm/Kconfig |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 50d9f3e..0cb2d4f 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1034,6 +1034,8 @@ source drivers/mmc/Kconfig
 
 source drivers/rtc/Kconfig
 
+source drivers/dma/Kconfig
+
 endmenu
 
 source fs/Kconfig
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [md-accel PATCH 00/19] md raid acceleration and the async_tx api

2007-06-26 Thread Dan Williams

On 6/26/07, Mr. James W. Laferriere [EMAIL PROTECTED] wrote:

Hello Dan ,

On Tue, 26 Jun 2007, Dan Williams wrote:
 Greetings,

 Per Andrew's suggestion this is the md raid5 acceleration patch set
 updated with more thorough changelogs to lower the barrier to entry for
 reviewers.  To get started with the code I would suggest the following
 order:
 [md-accel PATCH 01/19] dmaengine: refactor dmaengine around 
dma_async_tx_descriptor
 [md-accel PATCH 04/19] async_tx: add the async_tx api
 [md-accel PATCH 07/19] md: raid5_run_ops - run stripe operations outside 
sh-lock
 [md-accel PATCH 16/19] dmaengine: driver for the iop32x, iop33x, and iop13xx 
raid engines
  ...snip...

Can you please tell me against what linus kernel version these will
apply against ?  Or at least tell me against which version they were diff'd ?


This patch set is against 2.6.22-rc6.  The git tree is periodically
rebased to track Linus' latest.

--
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stripe_cache_size and performance

2007-06-25 Thread Dan Williams

7. And now, the question: the best absolute 'write' performance comes
with a stripe_cache_size value of 4096 (for my setup). However, any
value of stripe_cache_size above 384 really, really hurts 'check' (and
rebuild, one can assume) performance.  Why?


Question:
After performance goes bad does it go back up if you reduce the size
back down to 384?


--
Jon Nelson [EMAIL PROTECTED]


Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH git-md-accel 0/2] raid5 refactor, and pr_debug cleanup

2007-06-18 Thread Dan Williams
Neil,

The following two patches are the respin of the changes you suggested to
raid5: coding style cleanup / refactor.  I have added them to the
git-md-accel tree for a 2.6.23-rc1 pull.  The full, rebased, raid
acceleration patchset will be sent for a another round of review once I
address Andrew's concerns about the commit messages.

Dan Williams (2):
  raid5: refactor handle_stripe5 and handle_stripe6
  raid5: replace custom debug print with standard pr_debug
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH git-md-accel 1/2] raid5: refactor handle_stripe5 and handle_stripe6

2007-06-18 Thread Dan Williams
handle_stripe5 and handle_stripe6 have very deep logic paths handling the
various states of a stripe_head.  By introducing the 'stripe_head_state'
and 'r6_state' objects, large portions of the logic can be moved to
sub-routines.

'struct stripe_head_state' consumes all of the automatic variables that 
previously
stood alone in handle_stripe5,6.  'struct r6_state' contains the handle_stripe6
specific variables like p_failed and q_failed.

One of the nice side effects of the 'stripe_head_state' change is that it
allows for further reductions in code duplication between raid5 and raid6.
The following new routines are shared between raid5 and raid6:

handle_completed_write_requests
handle_requests_to_failed_array
handle_stripe_expansion

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c | 1484 +---
 include/linux/raid/raid5.h |   16 
 2 files changed, 733 insertions(+), 767 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 4f51dfa..68834d2 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1326,6 +1326,604 @@ static int stripe_to_pdidx(sector_t stripe, 
raid5_conf_t *conf, int disks)
return pd_idx;
 }
 
+static void
+handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh,
+   struct stripe_head_state *s, int disks,
+   struct bio **return_bi)
+{
+   int i;
+   for (i = disks; i--;) {
+   struct bio *bi;
+   int bitmap_end = 0;
+
+   if (test_bit(R5_ReadError, sh-dev[i].flags)) {
+   mdk_rdev_t *rdev;
+   rcu_read_lock();
+   rdev = rcu_dereference(conf-disks[i].rdev);
+   if (rdev  test_bit(In_sync, rdev-flags))
+   /* multiple read failures in one stripe */
+   md_error(conf-mddev, rdev);
+   rcu_read_unlock();
+   }
+   spin_lock_irq(conf-device_lock);
+   /* fail all writes first */
+   bi = sh-dev[i].towrite;
+   sh-dev[i].towrite = NULL;
+   if (bi) {
+   s-to_write--;
+   bitmap_end = 1;
+   }
+
+   if (test_and_clear_bit(R5_Overlap, sh-dev[i].flags))
+   wake_up(conf-wait_for_overlap);
+
+   while (bi  bi-bi_sector 
+   sh-dev[i].sector + STRIPE_SECTORS) {
+   struct bio *nextbi = r5_next_bio(bi, sh-dev[i].sector);
+   clear_bit(BIO_UPTODATE, bi-bi_flags);
+   if (--bi-bi_phys_segments == 0) {
+   md_write_end(conf-mddev);
+   bi-bi_next = *return_bi;
+   *return_bi = bi;
+   }
+   bi = nextbi;
+   }
+   /* and fail all 'written' */
+   bi = sh-dev[i].written;
+   sh-dev[i].written = NULL;
+   if (bi) bitmap_end = 1;
+   while (bi  bi-bi_sector 
+  sh-dev[i].sector + STRIPE_SECTORS) {
+   struct bio *bi2 = r5_next_bio(bi, sh-dev[i].sector);
+   clear_bit(BIO_UPTODATE, bi-bi_flags);
+   if (--bi-bi_phys_segments == 0) {
+   md_write_end(conf-mddev);
+   bi-bi_next = *return_bi;
+   *return_bi = bi;
+   }
+   bi = bi2;
+   }
+
+   /* fail any reads if this device is non-operational */
+   if (!test_bit(R5_Insync, sh-dev[i].flags) ||
+   test_bit(R5_ReadError, sh-dev[i].flags)) {
+   bi = sh-dev[i].toread;
+   sh-dev[i].toread = NULL;
+   if (test_and_clear_bit(R5_Overlap, sh-dev[i].flags))
+   wake_up(conf-wait_for_overlap);
+   if (bi) s-to_read--;
+   while (bi  bi-bi_sector 
+  sh-dev[i].sector + STRIPE_SECTORS) {
+   struct bio *nextbi =
+   r5_next_bio(bi, sh-dev[i].sector);
+   clear_bit(BIO_UPTODATE, bi-bi_flags);
+   if (--bi-bi_phys_segments == 0) {
+   bi-bi_next = *return_bi;
+   *return_bi = bi;
+   }
+   bi = nextbi;
+   }
+   }
+   spin_unlock_irq(conf-device_lock);
+   if (bitmap_end)
+   bitmap_endwrite(conf-mddev-bitmap, sh

[PATCH git-md-accel 2/2] raid5: replace custom debug print with standard pr_debug

2007-06-18 Thread Dan Williams
Replaces PRINTK with pr_debug, and kills the RAID5_DEBUG definition in
favor of the global DEBUG definition.  To get local debug messages just add
'#define DEBUG' to the top of the file.

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |  116 ++--
 1 files changed, 58 insertions(+), 58 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 68834d2..fa562e7 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -80,7 +80,6 @@
 /*
  * The following can be used to debug the driver
  */
-#define RAID5_DEBUG0
 #define RAID5_PARANOIA 1
 #if RAID5_PARANOIA  defined(CONFIG_SMP)
 # define CHECK_DEVLOCK() assert_spin_locked(conf-device_lock)
@@ -88,8 +87,7 @@
 # define CHECK_DEVLOCK()
 #endif
 
-#define PRINTK(x...) ((void)(RAID5_DEBUG  printk(x)))
-#if RAID5_DEBUG
+#ifdef DEBUG
 #define inline
 #define __inline__
 #endif
@@ -152,7 +150,8 @@ static void release_stripe(struct stripe_head *sh)
 
 static inline void remove_hash(struct stripe_head *sh)
 {
-   PRINTK(remove_hash(), stripe %llu\n, (unsigned long long)sh-sector);
+   pr_debug(remove_hash(), stripe %llu\n,
+   (unsigned long long)sh-sector);
 
hlist_del_init(sh-hash);
 }
@@ -161,7 +160,8 @@ static inline void insert_hash(raid5_conf_t *conf, struct 
stripe_head *sh)
 {
struct hlist_head *hp = stripe_hash(conf, sh-sector);
 
-   PRINTK(insert_hash(), stripe %llu\n, (unsigned long long)sh-sector);
+   pr_debug(insert_hash(), stripe %llu\n,
+   (unsigned long long)sh-sector);
 
CHECK_DEVLOCK();
hlist_add_head(sh-hash, hp);
@@ -226,7 +226,7 @@ static void init_stripe(struct stripe_head *sh, sector_t 
sector, int pd_idx, int
BUG_ON(test_bit(STRIPE_HANDLE, sh-state));

CHECK_DEVLOCK();
-   PRINTK(init_stripe called, stripe %llu\n, 
+   pr_debug(init_stripe called, stripe %llu\n,
(unsigned long long)sh-sector);
 
remove_hash(sh);
@@ -260,11 +260,11 @@ static struct stripe_head *__find_stripe(raid5_conf_t 
*conf, sector_t sector, in
struct hlist_node *hn;
 
CHECK_DEVLOCK();
-   PRINTK(__find_stripe, sector %llu\n, (unsigned long long)sector);
+   pr_debug(__find_stripe, sector %llu\n, (unsigned long long)sector);
hlist_for_each_entry(sh, hn, stripe_hash(conf, sector), hash)
if (sh-sector == sector  sh-disks == disks)
return sh;
-   PRINTK(__stripe %llu not in cache\n, (unsigned long long)sector);
+   pr_debug(__stripe %llu not in cache\n, (unsigned long long)sector);
return NULL;
 }
 
@@ -276,7 +276,7 @@ static struct stripe_head *get_active_stripe(raid5_conf_t 
*conf, sector_t sector
 {
struct stripe_head *sh;
 
-   PRINTK(get_stripe, sector %llu\n, (unsigned long long)sector);
+   pr_debug(get_stripe, sector %llu\n, (unsigned long long)sector);
 
spin_lock_irq(conf-device_lock);
 
@@ -537,8 +537,8 @@ static int raid5_end_read_request(struct bio * bi, unsigned 
int bytes_done,
if (bi == sh-dev[i].req)
break;
 
-   PRINTK(end_read_request %llu/%d, count: %d, uptodate %d.\n, 
-   (unsigned long long)sh-sector, i, atomic_read(sh-count), 
+   pr_debug(end_read_request %llu/%d, count: %d, uptodate %d.\n,
+   (unsigned long long)sh-sector, i, atomic_read(sh-count),
uptodate);
if (i == disks) {
BUG();
@@ -613,7 +613,7 @@ static int raid5_end_write_request (struct bio *bi, 
unsigned int bytes_done,
if (bi == sh-dev[i].req)
break;
 
-   PRINTK(end_write_request %llu/%d, count %d, uptodate: %d.\n, 
+   pr_debug(end_write_request %llu/%d, count %d, uptodate: %d.\n,
(unsigned long long)sh-sector, i, atomic_read(sh-count),
uptodate);
if (i == disks) {
@@ -658,7 +658,7 @@ static void error(mddev_t *mddev, mdk_rdev_t *rdev)
 {
char b[BDEVNAME_SIZE];
raid5_conf_t *conf = (raid5_conf_t *) mddev-private;
-   PRINTK(raid5: error called\n);
+   pr_debug(raid5: error called\n);
 
if (!test_bit(Faulty, rdev-flags)) {
set_bit(MD_CHANGE_DEVS, mddev-flags);
@@ -929,7 +929,7 @@ static void compute_block(struct stripe_head *sh, int 
dd_idx)
int i, count, disks = sh-disks;
void *ptr[MAX_XOR_BLOCKS], *dest, *p;
 
-   PRINTK(compute_block, stripe %llu, idx %d\n, 
+   pr_debug(compute_block, stripe %llu, idx %d\n,
(unsigned long long)sh-sector, dd_idx);
 
dest = page_address(sh-dev[dd_idx].page);
@@ -960,7 +960,7 @@ static void compute_parity5(struct stripe_head *sh, int 
method)
void *ptr[MAX_XOR_BLOCKS], *dest;
struct bio *chosen;
 
-   PRINTK(compute_parity5, stripe %llu, method %d\n,
+   pr_debug(compute_parity5, stripe %llu, method %d\n

[PATCH] md: comment add_stripe_bio

2007-06-05 Thread Dan Williams
From: Dan Williams [EMAIL PROTECTED]

Document the overloading of struct bio fields.

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

[ drop this if you think it is too much commenting/unnecessary, but I figured I 
would leave some
  breadcrumbs for the next guy. ]

 drivers/md/raid5.c |   26 ++
 1 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 061375e..065b8dc 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1231,10 +1231,13 @@ static void compute_block_2(struct stripe_head *sh, int 
dd_idx1, int dd_idx2)
 
 
 
-/*
- * Each stripe/dev can have one or more bion attached.
- * toread/towrite point to the first in a chain.
- * The bi_next chain must be in order.
+/* add_stripe_bio - attach a bio to the toread/towrite list for an
+ * rdev in the given stripe.  This routine assumes that the toread/towrite
+ * lists are in submission order
+ * @sh: stripe targeted by bi-bi_sector
+ * @bi: bio to add (this routines assumes ownership of the bi-bi_next field)
+ * @dd_idx: data disk determined from the logical sector
+ * @forwrite: read/write flag
  */
 static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, 
int forwrite)
 {
@@ -1249,12 +1252,18 @@ static int add_stripe_bio(struct stripe_head *sh, 
struct bio *bi, int dd_idx, in
 
spin_lock(sh-lock);
spin_lock_irq(conf-device_lock);
+   
+   /* pick the list to manipulate */
if (forwrite) {
bip = sh-dev[dd_idx].towrite;
if (*bip == NULL  sh-dev[dd_idx].written == NULL)
firstwrite = 1;
} else
bip = sh-dev[dd_idx].toread;
+
+   /* scroll through the list to see if this bio overlaps with a
+* pending request
+*/
while (*bip  (*bip)-bi_sector  bi-bi_sector) {
if ((*bip)-bi_sector + ((*bip)-bi_size  9)  bi-bi_sector)
goto overlap;
@@ -1263,10 +1272,19 @@ static int add_stripe_bio(struct stripe_head *sh, 
struct bio *bi, int dd_idx, in
if (*bip  (*bip)-bi_sector  bi-bi_sector + ((bi-bi_size)9))
goto overlap;
 
+   /* add this bio into the chain and make sure we are not dropping
+* a link that was previously established
+*/
BUG_ON(*bip  bi-bi_next  (*bip) != bi-bi_next);
if (*bip)
bi-bi_next = *bip;
+
+   /* attach the bio to the end of the list, if this is the first bio
+* added then we are directly manipulating toread/towrite
+*/
*bip = bi;
+
+   /* keep count of the number of stripes that this bio touches */
bi-bi_phys_segments ++;
spin_unlock_irq(conf-device_lock);
spin_unlock(sh-lock);
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/16] raid acceleration and asynchronous offload api for 2.6.22

2007-05-02 Thread Dan Williams
I am pleased to release this latest spin of the raid acceleration
patches for merge consideration.  This release aims to address all
pending review items including MD bug fixes and async_tx api changes
from Neil, and concerns on channel management from Chris and others.

Data integrity tests using home grown scripts and 'iozone -V' are
passing.  I am open to suggestions for additional testing criteria.  I
have also verified that git bisect is not broken by this set.

The short log below highlights the most recent changes.  The patches
will be sent as a reply to this message, and they are also available via
git:

git pull git://lost.foo-projects.org/~dwillia2/git/iop md-accel-linus

Additional comments and feedback welcome.

Thanks,
Dan

--
01/16: dmaengine: add base support for the async_tx api
* convert channel capabilities to a 'cpumask_t like' bitmap
02/16: dmaengine: move channel management to the client
* this patch is new to this series
03/16: ARM: Add drivers/dma to arch/arm/Kconfig
04/16: dmaengine: add the async_tx api
* remove the per operation type list, and distribute operation
  capabilities evenly amongst the available channels
* simplify async_tx_find_channel to optimize the fast path
05/16: md: add raid5_run_ops and support routines
* explicitly handle the 2-disk raid5 case (xor becomes memcpy)
* fix race between async engines and bi_end_io call for reads,
  Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling, Neil Brown
06/16: md: use raid5_run_ops for stripe cache operations
07/16: md: move write operations to raid5_run_ops
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
08/16: md: move raid5 compute block operations to raid5_run_ops
* remove the req_compute BUG_ON
09/16: md: move raid5 parity checks to raid5_run_ops
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
10/16: md: satisfy raid5 read requests via raid5_run_ops
* cleanup to_read and to_fill accounting
* do not fail reads that have reached the cache
11/16: md: use async_tx and raid5_run_ops for raid5 expansion operations
12/16: md: move raid5 io requests to raid5_run_ops
13/16: md: remove raid5 compute_block and compute_parity5
14/16: dmaengine: driver for the iop32x, iop33x, and iop13xx raid engines
* fix locking bug in iop_adma_alloc_chan_resources, Benjamin
  Herrenschmidt
* convert capabilities over to dma_cap_mask_t
15/16: iop13xx: Surface the iop13xx adma units to the iop-adma driver
16/16: iop3xx: Surface the iop3xx DMA and AAU units to the iop-adma driver

(previous release: http://marc.info/?l=linux-raidm=117463257423193w=2)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/16] dmaengine: add base support for the async_tx api

2007-05-02 Thread Dan Williams
In preparation for the async_tx (dmaengine client) API this patch:
1/ introduces struct dma_async_tx_descriptor as a common field for all
   dmaengine software descriptors.  The primary role of this structure
   is to enable callbacks at transaction completion time, and support
   transaction chains that span multiple channels
2/ converts the device_memcpy_* methods into separate prep, set
   src/dest, and submit stages
3/ adds support for capabilities beyond memcpy (xor, memset, xor zero
   sum, completion interrupts).  place holders for future capabilities
   are also included
4/ converts ioatdma to the new semantics

Changelog:
* drop dma mapping methods, suggested by Chris Leech
* fix ioat_dma_dependency_added, also caught by Andrew Morton
* fix dma_sync_wait, change from Andrew Morton
* uninline large functions, change from Andrew Morton
* add tx-callback = NULL to dmaengine calls to interoperate with async_tx
  calls
* hookup ioat_tx_submit
* convert channel capabilities to a 'cpumask_t like' bitmap

Cc: Chris Leech [EMAIL PROTECTED]
Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/dma/dmaengine.c   |  182 +
 drivers/dma/ioatdma.c |  248 -
 drivers/dma/ioatdma.h |8 +
 include/linux/dmaengine.h |  245 
 4 files changed, 454 insertions(+), 229 deletions(-)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index 322ee29..8a49103 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -59,6 +59,7 @@
 
 #include linux/init.h
 #include linux/module.h
+#include linux/mm.h
 #include linux/device.h
 #include linux/dmaengine.h
 #include linux/hardirq.h
@@ -66,6 +67,7 @@
 #include linux/percpu.h
 #include linux/rcupdate.h
 #include linux/mutex.h
+#include linux/jiffies.h
 
 static DEFINE_MUTEX(dma_list_mutex);
 static LIST_HEAD(dma_device_list);
@@ -165,6 +167,24 @@ static struct dma_chan *dma_client_chan_alloc(struct 
dma_client *client)
return NULL;
 }
 
+enum dma_status dma_sync_wait(struct dma_chan *chan, dma_cookie_t cookie)
+{
+   enum dma_status status;
+   unsigned long dma_sync_wait_timeout = jiffies + msecs_to_jiffies(5000);
+
+   dma_async_issue_pending(chan);
+   do {
+   status = dma_async_is_tx_complete(chan, cookie, NULL, NULL);
+   if (time_after_eq(jiffies, dma_sync_wait_timeout)) {
+   printk(KERN_ERR dma_sync_wait_timeout!\n);
+   return DMA_ERROR;
+   }
+   } while (status == DMA_IN_PROGRESS);
+
+   return status;
+}
+EXPORT_SYMBOL(dma_sync_wait);
+
 /**
  * dma_chan_cleanup - release a DMA channel's resources
  * @kref: kernel reference structure that contains the DMA channel device
@@ -322,6 +342,28 @@ int dma_async_device_register(struct dma_device *device)
if (!device)
return -ENODEV;
 
+   /* validate device routines */
+   BUG_ON(dma_has_cap(DMA_MEMCPY, device-cap_mask) 
+   !device-device_prep_dma_memcpy);
+   BUG_ON(dma_has_cap(DMA_XOR, device-cap_mask) 
+   !device-device_prep_dma_xor);
+   BUG_ON(dma_has_cap(DMA_ZERO_SUM, device-cap_mask) 
+   !device-device_prep_dma_zero_sum);
+   BUG_ON(dma_has_cap(DMA_MEMSET, device-cap_mask) 
+   !device-device_prep_dma_memset);
+   BUG_ON(dma_has_cap(DMA_ZERO_SUM, device-cap_mask) 
+   !device-device_prep_dma_interrupt);
+
+   BUG_ON(!device-device_alloc_chan_resources);
+   BUG_ON(!device-device_free_chan_resources);
+   BUG_ON(!device-device_tx_submit);
+   BUG_ON(!device-device_set_dest);
+   BUG_ON(!device-device_set_src);
+   BUG_ON(!device-device_dependency_added);
+   BUG_ON(!device-device_is_tx_complete);
+   BUG_ON(!device-device_issue_pending);
+   BUG_ON(!device-dev);
+
init_completion(device-done);
kref_init(device-refcount);
device-dev_id = id++;
@@ -397,6 +439,146 @@ void dma_async_device_unregister(struct dma_device 
*device)
 }
 EXPORT_SYMBOL(dma_async_device_unregister);
 
+/**
+ * dma_async_memcpy_buf_to_buf - offloaded copy between virtual addresses
+ * @chan: DMA channel to offload copy to
+ * @dest: destination address (virtual)
+ * @src: source address (virtual)
+ * @len: length
+ *
+ * Both @dest and @src must be mappable to a bus address according to the
+ * DMA mapping API rules for streaming mappings.
+ * Both @dest and @src must stay memory resident (kernel memory or locked
+ * user space pages).
+ */
+dma_cookie_t dma_async_memcpy_buf_to_buf(struct dma_chan *chan,
+void *dest, void *src, size_t len)
+{
+   struct dma_device *dev = chan-device;
+   struct dma_async_tx_descriptor *tx;
+   dma_addr_t addr;
+   dma_cookie_t cookie;
+   int cpu;
+
+   tx = dev-device_prep_dma_memcpy(chan, len, 0);
+   if (!tx)
+   return -ENOMEM

[PATCH 02/16] dmaengine: move channel management to the client

2007-05-02 Thread Dan Williams
This effectively makes channels a shared resource rather than tying them
to a specific client.  dmaengine now assumes that clients will internally
track how many channels they need and dmaengine will learn if the client cares 
about
a channel at dma_event_callback time.  This also enables a client to ignore
a channel if it does not meet extra client specific constraints beyond
simple base capabilities.

This patch also fixes up the NET_DMA client to use the new mechanism.

Cc: Chris Leech [EMAIL PROTECTED]
Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/dma/dmaengine.c   |  206 ++---
 drivers/dma/ioatdma.c |1 
 drivers/dma/ioatdma.h |3 -
 include/linux/dmaengine.h |   46 +-
 net/core/dev.c|  106 ---
 5 files changed, 198 insertions(+), 164 deletions(-)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index 8a49103..1a26ce3 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -37,8 +37,8 @@
  * Each device has a channels list, which runs unlocked but is never modified
  * once the device is registered, it's just setup by the driver.
  *
- * Each client has a channels list, it's only modified under the client-lock
- * and in an RCU callback, so it's safe to read under rcu_read_lock().
+ * Each client is responsible for keeping track of the channels it uses.  See
+ * the definition of dma_event_callback in dmaengine.h.
  *
  * Each device has a kref, which is initialized to 1 when the device is
  * registered. A kref_put is done for each class_device registered.  When the
@@ -51,10 +51,12 @@
  * references to finish.
  *
  * Each channel has an open-coded implementation of Rusty Russell's bigref,
- * with a kref and a per_cpu local_t.  A single reference is set when on an
- * ADDED event, and removed with a REMOVE event.  Net DMA client takes an
- * extra reference per outstanding transaction.  The relase function does a
- * kref_put on the device. -ChrisL
+ * with a kref and a per_cpu local_t.  A dma_chan_get is called when a client
+ * signals that it wants to use a channel, and dma_chan_put is called when
+ * a channel is removed or a client using it is unregesitered.  A client can
+ * take extra references per outstanding transaction, as is the case with
+ * the NET DMA client.  The release function does a kref_put on the device.
+ * -ChrisL, DanW
  */
 
 #include linux/init.h
@@ -102,8 +104,18 @@ static ssize_t show_bytes_transferred(struct class_device 
*cd, char *buf)
 static ssize_t show_in_use(struct class_device *cd, char *buf)
 {
struct dma_chan *chan = container_of(cd, struct dma_chan, class_dev);
+   int in_use = 0;
+
+   if (unlikely(chan-slow_ref)  atomic_read(chan-refcount.refcount)  
1)
+   in_use = 1;
+   else {
+   if (local_read((per_cpu_ptr(chan-local,
+   get_cpu())-refcount))  0)
+   in_use = 1;
+   put_cpu();
+   }
 
-   return sprintf(buf, %d\n, (chan-client ? 1 : 0));
+   return sprintf(buf, %d\n, in_use);
 }
 
 static struct class_device_attribute dma_class_attrs[] = {
@@ -129,42 +141,50 @@ static struct class dma_devclass = {
 
 /* --- client and device registration --- */
 
+#define dma_async_chan_satisfies_mask(chan, mask) 
__dma_async_chan_satisfies_mask((chan), (mask))
+static int __dma_async_chan_satisfies_mask(struct dma_chan *chan, 
dma_cap_mask_t *want)
+{
+   dma_cap_mask_t has;
+
+   bitmap_and(has.bits, want-bits, chan-device-cap_mask.bits, 
DMA_TX_TYPE_END);
+   return bitmap_equal(want-bits, has.bits, DMA_TX_TYPE_END);
+}
+
 /**
- * dma_client_chan_alloc - try to allocate a channel to a client
+ * dma_client_chan_alloc - try to allocate channels to a client
  * @client: dma_client
  *
  * Called with dma_list_mutex held.
  */
-static struct dma_chan *dma_client_chan_alloc(struct dma_client *client)
+static void dma_client_chan_alloc(struct dma_client *client)
 {
struct dma_device *device;
struct dma_chan *chan;
-   unsigned long flags;
int desc;   /* allocated descriptor count */
+   int ack; /* client has taken a reference to this channel */
 
-   /* Find a channel, any DMA engine will do */
-   list_for_each_entry(device, dma_device_list, global_node) {
+   /* Find a channel */
+   list_for_each_entry(device, dma_device_list, global_node)
list_for_each_entry(chan, device-channels, device_node) {
-   if (chan-client)
+   if (!dma_async_chan_satisfies_mask(chan, 
client-cap_mask))
continue;
 
desc = chan-device-device_alloc_chan_resources(chan);
if (desc = 0) {
-   kref_get(device-refcount);
-   kref_init(chan-refcount);
-   chan-slow_ref = 0

[PATCH 03/16] ARM: Add drivers/dma to arch/arm/Kconfig

2007-05-02 Thread Dan Williams
Cc: Russell King [EMAIL PROTECTED]
Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 arch/arm/Kconfig |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index e7baca2..74077e3 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -997,6 +997,8 @@ source drivers/mmc/Kconfig
 
 source drivers/rtc/Kconfig
 
+source drivers/dma/Kconfig
+
 endmenu
 
 source fs/Kconfig
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/16] dmaengine: add the async_tx api

2007-05-02 Thread Dan Williams
The async_tx api provides methods for describing a chain of asynchronous
bulk memory transfers/transforms with support for inter-transactional
dependencies.  It is implemented as a dmaengine client that smooths over
the details of different hardware offload engine implementations.  Code
that is written to the api can optimize for asynchrnous operation and
the api will fit the chain of operations to the available offload
resources. 
 
Currently the raid5 implementation in the MD raid456 driver has been
converted to the async_tx api.  A driver for the offload engines on the
Intel Xscale series of I/O processors, iop-adma, is provided.  With the
iop-adma driver and async_tx, raid456 is able to offload copy, xor, and
xor-zero-sum operations to hardware engines.
 
On iop342 tiobench showed higher throughput for sequential writes (20 -
30% improvement) and sequential reads to a degraded array (40 - 55%
improvement).  For the other cases performance was roughly equal, +/- a
few percentage points.  On a x86-smp platform the performance of the
async_tx implementation (in synchronous mode) was also +/- a few
percentage points of the original implementation.  According to 'top'
CPU utilization was positively affected in the offload case, but exact
measurements have yet to be taken.
 
The tiobench command line used for testing was: tiobench --size 2048
--block 4096 --block 131072 --dir /mnt/raid --numruns 5 * iop342 had 1GB
of memory available

Xor operations are handled by async_tx, to this end xor.c is moved into
drivers/dma and is changed to take an explicit destination address and a
series of sources to match the hardware engine implementation.

When CONFIG_DMA_ENGINE is not set the asynchrounous path is compiled
away.

Changelog:
* fixed a leftover debug print
* don't allow callbacks in async_interrupt_cond
* fixed xor_block changes
* fixed usage of ASYNC_TX_XOR_DROP_DEST
* drop dma mapping methods, suggested by Chris Leech
* printk warning fixups from Andrew Morton
* don't use inline in C files, Adrian Bunk
* select the API when MD is enabled
* BUG_ON xor source counts = 1
* implicitly handle hardware concerns like channel switching and
  interrupts, Neil Brown
* remove the per operation type list, and distribute operation capabilities
  evenly amongst the available channels
* simplify async_tx_find_channel to optimize the fast path

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/Makefile |1 
 drivers/dma/Kconfig  |   15 +
 drivers/dma/Makefile |1 
 drivers/dma/async_tx.c   |  889 ++
 drivers/dma/xor.c|  153 
 drivers/md/Kconfig   |3 
 drivers/md/Makefile  |6 
 drivers/md/raid5.c   |   52 +--
 drivers/md/xor.c |  154 
 include/linux/async_tx.h |  173 +
 include/linux/raid/xor.h |5 
 11 files changed, 1263 insertions(+), 189 deletions(-)

diff --git a/drivers/Makefile b/drivers/Makefile
index 3a718f5..2e8de9e 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -62,6 +62,7 @@ obj-$(CONFIG_I2C) += i2c/
 obj-$(CONFIG_W1)   += w1/
 obj-$(CONFIG_HWMON)+= hwmon/
 obj-$(CONFIG_PHONE)+= telephony/
+obj-$(CONFIG_ASYNC_TX_DMA) += dma/
 obj-$(CONFIG_MD)   += md/
 obj-$(CONFIG_BT)   += bluetooth/
 obj-$(CONFIG_ISDN) += isdn/
diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 30d021d..292ddad 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -7,8 +7,8 @@ menu DMA Engine support
 config DMA_ENGINE
bool Support for DMA engines
---help---
- DMA engines offload copy operations from the CPU to dedicated
- hardware, allowing the copies to happen asynchronously.
+  DMA engines offload bulk memory operations from the CPU to dedicated
+  hardware, allowing the operations to happen asynchronously.
 
 comment DMA Clients
 
@@ -22,6 +22,16 @@ config NET_DMA
  Since this is the main user of the DMA engine, it should be enabled;
  say Y here.
 
+config ASYNC_TX_DMA
+   tristate Asynchronous Bulk Memory Transfers/Transforms API
+   ---help---
+ This enables the async_tx management layer for dma engines.
+ Subsystems coded to this API will use offload engines for bulk
+ memory operations where present.  Software implementations are
+ called when a dma engine is not present or fails to allocate
+ memory to carry out the transaction.
+ Current subsystems ported to async_tx: MD_RAID4,5
+
 comment DMA Devices
 
 config INTEL_IOATDMA
@@ -30,5 +40,4 @@ config INTEL_IOATDMA
default m
---help---
  Enable support for the Intel(R) I/OAT DMA engine.
-
 endmenu
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
index bdcfdbd..6a99341 100644
--- a/drivers/dma/Makefile
+++ b/drivers/dma/Makefile
@@ -1,3 +1,4 @@
 obj-$(CONFIG_DMA_ENGINE) += dmaengine.o

[PATCH 05/16] md: add raid5_run_ops and support routines

2007-05-02 Thread Dan Williams
Prepare the raid5 implementation to use async_tx for running stripe
operations:
* biofill (copy data into request buffers to satisfy a read request)
* compute block (generate a missing block in the cache from the other
blocks)
* prexor (subtract existing data as part of the read-modify-write process)
* biodrain (copy data out of request buffers to satisfy a write request)
* postxor (recalculate parity for new data that has entered the cache)
* check (verify that the parity is correct)
* io (submit i/o to the member disks)

Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the workqueue
* call bi_end_io for reads in ops_complete_biofill
* explicitly handle the 2-disk raid5 case (xor becomes memcpy)
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |  539 
 include/linux/raid/raid5.h |   63 +
 2 files changed, 599 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ab8702d..0251bca 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -52,6 +52,7 @@
 #include raid6.h
 
 #include linux/raid/bitmap.h
+#include linux/async_tx.h
 
 /*
  * Stripe cache
@@ -324,6 +325,544 @@ static struct stripe_head *get_active_stripe(raid5_conf_t 
*conf, sector_t sector
return sh;
 }
 
+static int
+raid5_end_read_request(struct bio * bi, unsigned int bytes_done, int error);
+static int
+raid5_end_write_request (struct bio *bi, unsigned int bytes_done, int error);
+
+static void ops_run_io(struct stripe_head *sh)
+{
+   raid5_conf_t *conf = sh-raid_conf;
+   int i, disks = sh-disks;
+
+   might_sleep();
+
+   for (i=disks; i-- ;) {
+   int rw;
+   struct bio *bi;
+   mdk_rdev_t *rdev;
+   if (test_and_clear_bit(R5_Wantwrite, sh-dev[i].flags))
+   rw = WRITE;
+   else if (test_and_clear_bit(R5_Wantread, sh-dev[i].flags))
+   rw = READ;
+   else
+   continue;
+
+   bi = sh-dev[i].req;
+
+   bi-bi_rw = rw;
+   if (rw == WRITE)
+   bi-bi_end_io = raid5_end_write_request;
+   else
+   bi-bi_end_io = raid5_end_read_request;
+
+   rcu_read_lock();
+   rdev = rcu_dereference(conf-disks[i].rdev);
+   if (rdev  test_bit(Faulty, rdev-flags))
+   rdev = NULL;
+   if (rdev)
+   atomic_inc(rdev-nr_pending);
+   rcu_read_unlock();
+
+   if (rdev) {
+   if (test_bit(STRIPE_SYNCING, sh-state) ||
+   test_bit(STRIPE_EXPAND_SOURCE, sh-state) ||
+   test_bit(STRIPE_EXPAND_READY, sh-state))
+   md_sync_acct(rdev-bdev, STRIPE_SECTORS);
+
+   bi-bi_bdev = rdev-bdev;
+   PRINTK(%s: for %llu schedule op %ld on disc %d\n,
+   __FUNCTION__, (unsigned long long)sh-sector,
+   bi-bi_rw, i);
+   atomic_inc(sh-count);
+   bi-bi_sector = sh-sector + rdev-data_offset;
+   bi-bi_flags = 1  BIO_UPTODATE;
+   bi-bi_vcnt = 1;
+   bi-bi_max_vecs = 1;
+   bi-bi_idx = 0;
+   bi-bi_io_vec = sh-dev[i].vec;
+   bi-bi_io_vec[0].bv_len = STRIPE_SIZE;
+   bi-bi_io_vec[0].bv_offset = 0;
+   bi-bi_size = STRIPE_SIZE;
+   bi-bi_next = NULL;
+   if (rw == WRITE 
+   test_bit(R5_ReWrite, sh-dev[i].flags))
+   atomic_add(STRIPE_SECTORS, 
rdev-corrected_errors);
+   generic_make_request(bi);
+   } else {
+   if (rw == WRITE)
+   set_bit(STRIPE_DEGRADED, sh-state);
+   PRINTK(skip op %ld on disc %d for sector %llu\n,
+   bi-bi_rw, i, (unsigned long long)sh-sector);
+   clear_bit(R5_LOCKED, sh-dev[i].flags);
+   set_bit(STRIPE_HANDLE, sh-state);
+   }
+   }
+}
+
+static struct dma_async_tx_descriptor *
+async_copy_data(int frombio, struct bio *bio, struct page *page, sector_t 
sector,
+   struct dma_async_tx_descriptor *tx)
+{
+   struct bio_vec *bvl;
+   struct page *bio_page;
+   int i;
+   int page_offset;
+
+   if (bio-bi_sector = sector

[PATCH 06/16] md: use raid5_run_ops for stripe cache operations

2007-05-02 Thread Dan Williams
Each stripe has three flag variables to reflect the state of operations
(pending, ack, and complete).
-pending: set to request servicing in raid5_run_ops
-ack: set to reflect that raid5_runs_ops has seen this request
-complete: set when the operation is complete and it is ok for handle_stripe5
to clear 'pending' and 'ack'.

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |   65 +---
 1 files changed, 56 insertions(+), 9 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 0251bca..14e9f6a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -126,6 +126,7 @@ static void __release_stripe(raid5_conf_t *conf, struct 
stripe_head *sh)
}
md_wakeup_thread(conf-mddev-thread);
} else {
+   BUG_ON(sh-ops.pending);
if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, 
sh-state)) {
atomic_dec(conf-preread_active_stripes);
if (atomic_read(conf-preread_active_stripes) 
 IO_THRESHOLD)
@@ -225,7 +226,8 @@ static void init_stripe(struct stripe_head *sh, sector_t 
sector, int pd_idx, int
 
BUG_ON(atomic_read(sh-count) != 0);
BUG_ON(test_bit(STRIPE_HANDLE, sh-state));
-   
+   BUG_ON(sh-ops.pending || sh-ops.ack || sh-ops.complete);
+
CHECK_DEVLOCK();
PRINTK(init_stripe called, stripe %llu\n, 
(unsigned long long)sh-sector);
@@ -241,11 +243,11 @@ static void init_stripe(struct stripe_head *sh, sector_t 
sector, int pd_idx, int
for (i = sh-disks; i--; ) {
struct r5dev *dev = sh-dev[i];
 
-   if (dev-toread || dev-towrite || dev-written ||
+   if (dev-toread || dev-read || dev-towrite || dev-written ||
test_bit(R5_LOCKED, dev-flags)) {
-   printk(sector=%llx i=%d %p %p %p %d\n,
+   printk(sector=%llx i=%d %p %p %p %p %d\n,
   (unsigned long long)sh-sector, i, dev-toread,
-  dev-towrite, dev-written,
+  dev-read, dev-towrite, dev-written,
   test_bit(R5_LOCKED, dev-flags));
BUG();
}
@@ -325,6 +327,43 @@ static struct stripe_head *get_active_stripe(raid5_conf_t 
*conf, sector_t sector
return sh;
 }
 
+/* check_op() ensures that we only dequeue an operation once */
+#define check_op(op) do {\
+   if (test_bit(op, sh-ops.pending) \
+   !test_bit(op, sh-ops.complete)) {\
+   if (test_and_set_bit(op, sh-ops.ack))\
+   clear_bit(op, pending);\
+   else\
+   ack++;\
+   } else\
+   clear_bit(op, pending);\
+} while(0)
+
+/* find new work to run, do not resubmit work that is already
+ * in flight
+ */
+static unsigned long get_stripe_work(struct stripe_head *sh)
+{
+   unsigned long pending;
+   int ack = 0;
+
+   pending = sh-ops.pending;
+
+   check_op(STRIPE_OP_BIOFILL);
+   check_op(STRIPE_OP_COMPUTE_BLK);
+   check_op(STRIPE_OP_PREXOR);
+   check_op(STRIPE_OP_BIODRAIN);
+   check_op(STRIPE_OP_POSTXOR);
+   check_op(STRIPE_OP_CHECK);
+   if (test_and_clear_bit(STRIPE_OP_IO, sh-ops.pending))
+   ack++;
+
+   sh-ops.count -= ack;
+   BUG_ON(sh-ops.count  0);
+
+   return pending;
+}
+
 static int
 raid5_end_read_request(struct bio * bi, unsigned int bytes_done, int error);
 static int
@@ -1878,7 +1917,6 @@ static int stripe_to_pdidx(sector_t stripe, raid5_conf_t 
*conf, int disks)
  *schedule a write of some buffers
  *return confirmation of parity correctness
  *
- * Parity calculations are done inside the stripe lock
  * buffers are taken off read_list or write_list, and bh_cache buffers
  * get BH_Lock set before the stripe lock is released.
  *
@@ -1896,10 +1934,11 @@ static void handle_stripe5(struct stripe_head *sh)
int non_overwrite = 0;
int failed_num=0;
struct r5dev *dev;
+   unsigned long pending=0;
 
-   PRINTK(handling stripe %llu, cnt=%d, pd_idx=%d\n,
-   (unsigned long long)sh-sector, atomic_read(sh-count),
-   sh-pd_idx);
+   PRINTK(handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d 
ops=%lx:%lx:%lx\n,
+  (unsigned long long)sh-sector, sh-state, 
atomic_read(sh-count),
+  sh-pd_idx, sh-ops.pending, sh-ops.ack, sh-ops.complete);
 
spin_lock(sh-lock);
clear_bit(STRIPE_HANDLE, sh-state);
@@ -2349,8 +2388,14 @@ static void handle_stripe5(struct stripe_head *sh)
}
}
 
+   if (sh-ops.count)
+   pending = get_stripe_work(sh);
+
spin_unlock(sh-lock);
 
+   if (pending)
+   raid5_run_ops(sh, pending);
+
while

[PATCH 07/16] md: move write operations to raid5_run_ops

2007-05-02 Thread Dan Williams
handle_stripe sets STRIPE_OP_PREXOR, STRIPE_OP_BIODRAIN, STRIPE_OP_POSTXOR
to request a write to the stripe cache.  raid5_run_ops is triggerred to run
and executes the request outside the stripe lock.

Changelog:
* make the 'rcw' parameter to handle_write_operations5 a simple flag, Neil
  Brown
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |  151 +---
 1 files changed, 130 insertions(+), 21 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 14e9f6a..03a435d 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1807,7 +1807,74 @@ static void compute_block_2(struct stripe_head *sh, int 
dd_idx1, int dd_idx2)
}
 }
 
+static int handle_write_operations5(struct stripe_head *sh, int rcw, int 
expand)
+{
+   int i, pd_idx = sh-pd_idx, disks = sh-disks;
+   int locked=0;
+
+   if (rcw) {
+   /* skip the drain operation on an expand */
+   if (!expand) {
+   set_bit(STRIPE_OP_BIODRAIN, sh-ops.pending);
+   sh-ops.count++;
+   }
+
+   set_bit(STRIPE_OP_POSTXOR, sh-ops.pending);
+   sh-ops.count++;
+
+   for (i=disks ; i-- ;) {
+   struct r5dev *dev = sh-dev[i];
+
+   if (dev-towrite) {
+   set_bit(R5_LOCKED, dev-flags);
+   if (!expand)
+   clear_bit(R5_UPTODATE, dev-flags);
+   locked++;
+   }
+   }
+   } else {
+   BUG_ON(!(test_bit(R5_UPTODATE, sh-dev[pd_idx].flags) ||
+   test_bit(R5_Wantcompute, sh-dev[pd_idx].flags)));
+
+   set_bit(STRIPE_OP_PREXOR, sh-ops.pending);
+   set_bit(STRIPE_OP_BIODRAIN, sh-ops.pending);
+   set_bit(STRIPE_OP_POSTXOR, sh-ops.pending);
+
+   sh-ops.count += 3;
+
+   for (i=disks ; i-- ;) {
+   struct r5dev *dev = sh-dev[i];
+   if (i==pd_idx)
+   continue;
 
+   /* For a read-modify write there may be blocks that are
+* locked for reading while others are ready to be 
written
+* so we distinguish these blocks by the R5_Wantprexor 
bit
+*/
+   if (dev-towrite 
+   (test_bit(R5_UPTODATE, dev-flags) ||
+   test_bit(R5_Wantcompute, dev-flags))) {
+   set_bit(R5_Wantprexor, dev-flags);
+   set_bit(R5_LOCKED, dev-flags);
+   clear_bit(R5_UPTODATE, dev-flags);
+   locked++;
+   }
+   }
+   }
+
+   /* keep the parity disk locked while asynchronous operations
+* are in flight
+*/
+   set_bit(R5_LOCKED, sh-dev[pd_idx].flags);
+   clear_bit(R5_UPTODATE, sh-dev[pd_idx].flags);
+   locked++;
+
+   PRINTK(%s: stripe %llu locked: %d pending: %lx\n,
+   __FUNCTION__, (unsigned long long)sh-sector,
+   locked, sh-ops.pending);
+
+   return locked;
+}
 
 /*
  * Each stripe/dev can have one or more bion attached.
@@ -2170,8 +2237,67 @@ static void handle_stripe5(struct stripe_head *sh)
set_bit(STRIPE_HANDLE, sh-state);
}
 
-   /* now to consider writing and what else, if anything should be read */
-   if (to_write) {
+   /* Now we check to see if any write operations have recently
+* completed
+*/
+
+   /* leave prexor set until postxor is done, allows us to distinguish
+* a rmw from a rcw during biodrain
+*/
+   if (test_bit(STRIPE_OP_PREXOR, sh-ops.complete) 
+   test_bit(STRIPE_OP_POSTXOR, sh-ops.complete)) {
+
+   clear_bit(STRIPE_OP_PREXOR, sh-ops.complete);
+   clear_bit(STRIPE_OP_PREXOR, sh-ops.ack);
+   clear_bit(STRIPE_OP_PREXOR, sh-ops.pending);
+
+   for (i=disks; i--;)
+   clear_bit(R5_Wantprexor, sh-dev[i].flags);
+   }
+
+   /* if only POSTXOR is set then this is an 'expand' postxor */
+   if (test_bit(STRIPE_OP_BIODRAIN, sh-ops.complete) 
+   test_bit(STRIPE_OP_POSTXOR, sh-ops.complete)) {
+
+   clear_bit(STRIPE_OP_BIODRAIN, sh-ops.complete);
+   clear_bit(STRIPE_OP_BIODRAIN, sh-ops.ack);
+   clear_bit(STRIPE_OP_BIODRAIN, sh-ops.pending);
+
+   clear_bit(STRIPE_OP_POSTXOR, sh-ops.complete);
+   clear_bit(STRIPE_OP_POSTXOR, sh-ops.ack);
+   clear_bit(STRIPE_OP_POSTXOR, sh-ops.pending);
+
+   /* All the 'written' buffers

[PATCH 08/16] md: move raid5 compute block operations to raid5_run_ops

2007-05-02 Thread Dan Williams
handle_stripe sets STRIPE_OP_COMPUTE_BLK to request servicing from
raid5_run_ops.  It also sets a flag for the block being computed to let
other parts of handle_stripe submit dependent operations.  raid5_run_ops
guarantees that the compute operation completes before any dependent
operation starts.

Changelog:
* remove the req_compute BUG_ON

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |  126 +++-
 1 files changed, 94 insertions(+), 32 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 03a435d..844bd9b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1998,7 +1998,7 @@ static void handle_stripe5(struct stripe_head *sh)
int i;
int syncing, expanding, expanded;
int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0;
-   int non_overwrite = 0;
+   int compute=0, req_compute=0, non_overwrite=0;
int failed_num=0;
struct r5dev *dev;
unsigned long pending=0;
@@ -2050,8 +2050,8 @@ static void handle_stripe5(struct stripe_head *sh)
/* now count some things */
if (test_bit(R5_LOCKED, dev-flags)) locked++;
if (test_bit(R5_UPTODATE, dev-flags)) uptodate++;
+   if (test_bit(R5_Wantcompute, dev-flags)) BUG_ON(++compute  
1);
 
-   
if (dev-toread) to_read++;
if (dev-towrite) {
to_write++;
@@ -2206,31 +2206,83 @@ static void handle_stripe5(struct stripe_head *sh)
 * parity, or to satisfy requests
 * or to load a block that is being partially written.
 */
-   if (to_read || non_overwrite || (syncing  (uptodate  disks)) || 
expanding) {
-   for (i=disks; i--;) {
-   dev = sh-dev[i];
-   if (!test_bit(R5_LOCKED, dev-flags)  
!test_bit(R5_UPTODATE, dev-flags) 
-   (dev-toread ||
-(dev-towrite  !test_bit(R5_OVERWRITE, 
dev-flags)) ||
-syncing ||
-expanding ||
-(failed  (sh-dev[failed_num].toread ||
-(sh-dev[failed_num].towrite  
!test_bit(R5_OVERWRITE, sh-dev[failed_num].flags
-   )
-   ) {
-   /* we would like to get this block, possibly
-* by computing it, but we might not be able to
+   if (to_read || non_overwrite || (syncing  (uptodate + compute  
disks)) || expanding ||
+   test_bit(STRIPE_OP_COMPUTE_BLK, sh-ops.pending)) {
+
+   /* Clear completed compute operations.  Parity recovery
+* (STRIPE_OP_MOD_REPAIR_PD) implies a write-back which is 
handled
+* later on in this routine
+*/
+   if (test_bit(STRIPE_OP_COMPUTE_BLK, sh-ops.complete) 
+   !test_bit(STRIPE_OP_MOD_REPAIR_PD, sh-ops.pending)) {
+   clear_bit(STRIPE_OP_COMPUTE_BLK, sh-ops.complete);
+   clear_bit(STRIPE_OP_COMPUTE_BLK, sh-ops.ack);
+   clear_bit(STRIPE_OP_COMPUTE_BLK, sh-ops.pending);
+   }
+
+   /* look for blocks to read/compute, skip this if a compute
+* is already in flight, or if the stripe contents are in the
+* midst of changing due to a write
+*/
+   if (!test_bit(STRIPE_OP_COMPUTE_BLK, sh-ops.pending) 
+   !test_bit(STRIPE_OP_PREXOR, sh-ops.pending) 
+   !test_bit(STRIPE_OP_POSTXOR, sh-ops.pending)) {
+   for (i=disks; i--;) {
+   dev = sh-dev[i];
+
+   /* don't schedule compute operations or reads on
+* the parity block while a check is in flight
 */
-   if (uptodate == disks-1) {
-   PRINTK(Computing block %d\n, i);
-   compute_block(sh, i);
-   uptodate++;
-   } else if (test_bit(R5_Insync, dev-flags)) {
-   set_bit(R5_LOCKED, dev-flags);
-   set_bit(R5_Wantread, dev-flags);
-   locked++;
-   PRINTK(Reading block %d (sync=%d)\n, 
-   i, syncing);
+   if ((i == sh-pd_idx)  
test_bit(STRIPE_OP_CHECK, sh-ops.pending))
+   continue;
+
+   if (!test_bit(R5_LOCKED, dev-flags)  
!test_bit(R5_UPTODATE, dev-flags

[PATCH 09/16] md: move raid5 parity checks to raid5_run_ops

2007-05-02 Thread Dan Williams
handle_stripe sets STRIPE_OP_CHECK to request a check operation in
raid5_run_ops.  If raid5_run_ops is able to perform the check with a
dma engine the parity will be preserved in memory removing the need to
re-read it from disk, as is necessary in the synchronous case.

'Repair' operations re-use the same logic as compute block, with the caveat
that the results of the compute block are immediately written back to the
parity disk.  To differentiate these operations the STRIPE_OP_MOD_REPAIR_PD
flag is added.

Changelog:
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |   80 
 1 files changed, 61 insertions(+), 19 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 844bd9b..f8a4522 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2430,32 +2430,74 @@ static void handle_stripe5(struct stripe_head *sh)
locked += handle_write_operations5(sh, rcw == 0, 0);
}
 
-   /* maybe we need to check and possibly fix the parity for this stripe
-* Any reads will already have been scheduled, so we just see if enough 
data
-* is available
+   /* 1/ Maybe we need to check and possibly fix the parity for this 
stripe.
+*Any reads will already have been scheduled, so we just see if 
enough data
+*is available.
+* 2/ Hold off parity checks while parity dependent operations are in 
flight
+*(conflicting writes are protected by the 'locked' variable)
 */
-   if (syncing  locked == 0 
-   !test_bit(STRIPE_INSYNC, sh-state)) {
+   if ((syncing  locked == 0  !test_bit(STRIPE_OP_COMPUTE_BLK, 
sh-ops.pending) 
+   !test_bit(STRIPE_INSYNC, sh-state)) ||
+   test_bit(STRIPE_OP_CHECK, sh-ops.pending) ||
+   test_bit(STRIPE_OP_MOD_REPAIR_PD, sh-ops.pending)) {
+
set_bit(STRIPE_HANDLE, sh-state);
-   if (failed == 0) {
-   BUG_ON(uptodate != disks);
-   compute_parity5(sh, CHECK_PARITY);
-   uptodate--;
-   if (page_is_zero(sh-dev[sh-pd_idx].page)) {
-   /* parity is correct (on disc, not in buffer 
any more) */
-   set_bit(STRIPE_INSYNC, sh-state);
-   } else {
-   conf-mddev-resync_mismatches += 
STRIPE_SECTORS;
-   if (test_bit(MD_RECOVERY_CHECK, 
conf-mddev-recovery))
-   /* don't try to repair!! */
+   /* Take one of the following actions:
+* 1/ start a check parity operation if (uptodate == disks)
+* 2/ finish a check parity operation and act on the result
+* 3/ skip to the writeback section if we previously
+*initiated a recovery operation
+*/
+   if (failed == 0  !test_bit(STRIPE_OP_MOD_REPAIR_PD, 
sh-ops.pending)) {
+   if (!test_and_set_bit(STRIPE_OP_CHECK, 
sh-ops.pending)) {
+   BUG_ON(uptodate != disks);
+   clear_bit(R5_UPTODATE, 
sh-dev[sh-pd_idx].flags);
+   sh-ops.count++;
+   uptodate--;
+   } else if (test_and_clear_bit(STRIPE_OP_CHECK, 
sh-ops.complete)) {
+   clear_bit(STRIPE_OP_CHECK, sh-ops.ack);
+   clear_bit(STRIPE_OP_CHECK, sh-ops.pending);
+
+   if (sh-ops.zero_sum_result == 0)
+   /* parity is correct (on disc, not in 
buffer any more) */
set_bit(STRIPE_INSYNC, sh-state);
else {
-   compute_block(sh, sh-pd_idx);
-   uptodate++;
+   conf-mddev-resync_mismatches += 
STRIPE_SECTORS;
+   if (test_bit(MD_RECOVERY_CHECK, 
conf-mddev-recovery))
+   /* don't try to repair!! */
+   set_bit(STRIPE_INSYNC, 
sh-state);
+   else {
+   set_bit(STRIPE_OP_COMPUTE_BLK,
+   sh-ops.pending);
+   set_bit(STRIPE_OP_MOD_REPAIR_PD,
+   sh-ops.pending);
+   set_bit(R5_Wantcompute,
+   
sh-dev[sh-pd_idx].flags

[PATCH 10/16] md: satisfy raid5 read requests via raid5_run_ops

2007-05-02 Thread Dan Williams
Use raid5_run_ops to carry out the memory copies for a raid5 read request.

Changelog:
* cleanup to_read and to_fill accounting
* do not fail reads that have reached the cache

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |   61 ++--
 1 files changed, 30 insertions(+), 31 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f8a4522..6bde174 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1998,7 +1998,7 @@ static void handle_stripe5(struct stripe_head *sh)
int i;
int syncing, expanding, expanded;
int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0;
-   int compute=0, req_compute=0, non_overwrite=0;
+   int to_fill=0, compute=0, req_compute=0, non_overwrite=0;
int failed_num=0;
struct r5dev *dev;
unsigned long pending=0;
@@ -2022,37 +2022,29 @@ static void handle_stripe5(struct stripe_head *sh)
dev = sh-dev[i];
clear_bit(R5_Insync, dev-flags);
 
-   PRINTK(check %d: state 0x%lx read %p write %p written %p\n,
-   i, dev-flags, dev-toread, dev-towrite, dev-written);
-   /* maybe we can reply to a read */
-   if (test_bit(R5_UPTODATE, dev-flags)  dev-toread) {
-   struct bio *rbi, *rbi2;
-   PRINTK(Return read for disc %d\n, i);
-   spin_lock_irq(conf-device_lock);
-   rbi = dev-toread;
-   dev-toread = NULL;
-   if (test_and_clear_bit(R5_Overlap, dev-flags))
-   wake_up(conf-wait_for_overlap);
-   spin_unlock_irq(conf-device_lock);
-   while (rbi  rbi-bi_sector  dev-sector + 
STRIPE_SECTORS) {
-   copy_data(0, rbi, dev-page, dev-sector);
-   rbi2 = r5_next_bio(rbi, dev-sector);
-   spin_lock_irq(conf-device_lock);
-   if (--rbi-bi_phys_segments == 0) {
-   rbi-bi_next = return_bi;
-   return_bi = rbi;
-   }
-   spin_unlock_irq(conf-device_lock);
-   rbi = rbi2;
-   }
-   }
+   PRINTK(check %d: state 0x%lx toread %p read %p write %p 
written %p\n,
+   i, dev-flags, dev-toread, dev-read, dev-towrite, 
dev-written);
+
+   /* maybe we can request a biofill operation
+*
+* new wantfill requests are only permitted while
+* STRIPE_OP_BIOFILL is clear
+*/
+   if (test_bit(R5_UPTODATE, dev-flags)  dev-toread 
+   !test_bit(STRIPE_OP_BIOFILL, sh-ops.pending))
+   set_bit(R5_Wantfill, dev-flags);
 
/* now count some things */
if (test_bit(R5_LOCKED, dev-flags)) locked++;
if (test_bit(R5_UPTODATE, dev-flags)) uptodate++;
+
+   if (test_bit(R5_Wantfill, dev-flags))
+   to_fill++;
+   else if (dev-toread)
+   to_read++;
+
if (test_bit(R5_Wantcompute, dev-flags)) BUG_ON(++compute  
1);
 
-   if (dev-toread) to_read++;
if (dev-towrite) {
to_write++;
if (!test_bit(R5_OVERWRITE, dev-flags))
@@ -2073,9 +2065,13 @@ static void handle_stripe5(struct stripe_head *sh)
set_bit(R5_Insync, dev-flags);
}
rcu_read_unlock();
+
+   if (to_fill  !test_and_set_bit(STRIPE_OP_BIOFILL, sh-ops.pending))
+   sh-ops.count++;
+
PRINTK(locked=%d uptodate=%d to_read=%d
-to_write=%d failed=%d failed_num=%d\n,
-   locked, uptodate, to_read, to_write, failed, failed_num);
+to_write=%d to_fill=%d failed=%d failed_num=%d\n,
+   locked, uptodate, to_read, to_write, to_fill, failed, 
failed_num);
/* check if the array has lost two devices and, if so, some requests 
might
 * need to be failed
 */
@@ -2127,9 +2123,12 @@ static void handle_stripe5(struct stripe_head *sh)
bi = bi2;
}
 
-   /* fail any reads if this device is non-operational */
-   if (!test_bit(R5_Insync, sh-dev[i].flags) ||
-   test_bit(R5_ReadError, sh-dev[i].flags)) {
+   /* fail any reads if this device is non-operational and
+* the data has not reached the cache yet.
+*/
+   if (!test_bit(R5_Wantfill, sh-dev[i].flags

[PATCH 11/16] md: use async_tx and raid5_run_ops for raid5 expansion operations

2007-05-02 Thread Dan Williams
The parity calculation for an expansion operation is the same as the
calculation performed at the end of a write with the caveat that all blocks
in the stripe are scheduled to be written.  An expansion operation is
identified as a stripe with the POSTXOR flag set and the BIODRAIN flag not
set.

The bulk copy operation to the new stripe is handled inline by async_tx.

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |   48 
 1 files changed, 36 insertions(+), 12 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6bde174..1966713 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2538,18 +2538,32 @@ static void handle_stripe5(struct stripe_head *sh)
}
}
 
-   if (expanded  test_bit(STRIPE_EXPANDING, sh-state)) {
-   /* Need to write out all blocks after computing parity */
-   sh-disks = conf-raid_disks;
-   sh-pd_idx = stripe_to_pdidx(sh-sector, conf, 
conf-raid_disks);
-   compute_parity5(sh, RECONSTRUCT_WRITE);
+   /* Finish postxor operations initiated by the expansion
+* process
+*/
+   if (test_bit(STRIPE_OP_POSTXOR, sh-ops.complete) 
+   !test_bit(STRIPE_OP_BIODRAIN, sh-ops.pending)) {
+
+   clear_bit(STRIPE_EXPANDING, sh-state);
+
+   clear_bit(STRIPE_OP_POSTXOR, sh-ops.pending);
+   clear_bit(STRIPE_OP_POSTXOR, sh-ops.ack);
+   clear_bit(STRIPE_OP_POSTXOR, sh-ops.complete);
+
for (i= conf-raid_disks; i--;) {
-   set_bit(R5_LOCKED, sh-dev[i].flags);
-   locked++;
set_bit(R5_Wantwrite, sh-dev[i].flags);
+   if (!test_and_set_bit(STRIPE_OP_IO, sh-ops.pending))
+   sh-ops.count++;
}
-   clear_bit(STRIPE_EXPANDING, sh-state);
-   } else if (expanded) {
+   }
+
+   if (expanded  test_bit(STRIPE_EXPANDING, sh-state) 
+   !test_bit(STRIPE_OP_POSTXOR, sh-ops.pending)) {
+   /* Need to write out all blocks after computing parity */
+   sh-disks = conf-raid_disks;
+   sh-pd_idx = stripe_to_pdidx(sh-sector, conf, 
conf-raid_disks);
+   locked += handle_write_operations5(sh, 0, 1);
+   } else if (expanded  !test_bit(STRIPE_OP_POSTXOR, sh-ops.pending)) {
clear_bit(STRIPE_EXPAND_READY, sh-state);
atomic_dec(conf-reshape_stripes);
wake_up(conf-wait_for_overlap);
@@ -2560,6 +2574,7 @@ static void handle_stripe5(struct stripe_head *sh)
/* We have read all the blocks in this stripe and now we need to
 * copy some of them into a target stripe for expand.
 */
+   struct dma_async_tx_descriptor *tx = NULL;
clear_bit(STRIPE_EXPAND_SOURCE, sh-state);
for (i=0; i sh-disks; i++)
if (i != sh-pd_idx) {
@@ -2583,9 +2598,12 @@ static void handle_stripe5(struct stripe_head *sh)
release_stripe(sh2);
continue;
}
-   memcpy(page_address(sh2-dev[dd_idx].page),
-  page_address(sh-dev[i].page),
-  STRIPE_SIZE);
+
+   /* place all the copies on one channel */
+   tx = async_memcpy(sh2-dev[dd_idx].page,
+   sh-dev[i].page, 0, 0, STRIPE_SIZE,
+   ASYNC_TX_DEP_ACK, tx, NULL, NULL);
+
set_bit(R5_Expanded, sh2-dev[dd_idx].flags);
set_bit(R5_UPTODATE, sh2-dev[dd_idx].flags);
for (j=0; jconf-raid_disks; j++)
@@ -2597,6 +2615,12 @@ static void handle_stripe5(struct stripe_head *sh)
set_bit(STRIPE_HANDLE, sh2-state);
}
release_stripe(sh2);
+
+   /* done submitting copies, wait for them to 
complete */
+   if (i + 1 = sh-disks) {
+   async_tx_ack(tx);
+   dma_wait_for_async_tx(tx);
+   }
}
}
 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/16] md: move raid5 io requests to raid5_run_ops

2007-05-02 Thread Dan Williams
handle_stripe now only updates the state of stripes.  All execution of
operations is moved to raid5_run_ops.

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |   68 
 1 files changed, 10 insertions(+), 58 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 1966713..c9b91e3 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2388,6 +2388,8 @@ static void handle_stripe5(struct stripe_head *sh)
PRINTK(Read_old block %d for 
r-m-w\n, i);
set_bit(R5_LOCKED, dev-flags);
set_bit(R5_Wantread, 
dev-flags);
+   if 
(!test_and_set_bit(STRIPE_OP_IO, sh-ops.pending))
+   sh-ops.count++;
locked++;
} else {
set_bit(STRIPE_DELAYED, 
sh-state);
@@ -2408,6 +2410,8 @@ static void handle_stripe5(struct stripe_head *sh)
PRINTK(Read_old block %d for 
Reconstruct\n, i);
set_bit(R5_LOCKED, dev-flags);
set_bit(R5_Wantread, 
dev-flags);
+   if 
(!test_and_set_bit(STRIPE_OP_IO, sh-ops.pending))
+   sh-ops.count++;
locked++;
} else {
set_bit(STRIPE_DELAYED, 
sh-state);
@@ -2506,6 +2510,8 @@ static void handle_stripe5(struct stripe_head *sh)
 
set_bit(R5_LOCKED, dev-flags);
set_bit(R5_Wantwrite, dev-flags);
+   if (!test_and_set_bit(STRIPE_OP_IO, sh-ops.pending))
+   sh-ops.count++;
clear_bit(STRIPE_DEGRADED, sh-state);
locked++;
set_bit(STRIPE_INSYNC, sh-state);
@@ -2527,12 +2533,16 @@ static void handle_stripe5(struct stripe_head *sh)
dev = sh-dev[failed_num];
if (!test_bit(R5_ReWrite, dev-flags)) {
set_bit(R5_Wantwrite, dev-flags);
+   if (!test_and_set_bit(STRIPE_OP_IO, sh-ops.pending))
+   sh-ops.count++;
set_bit(R5_ReWrite, dev-flags);
set_bit(R5_LOCKED, dev-flags);
locked++;
} else {
/* let's read it back */
set_bit(R5_Wantread, dev-flags);
+   if (!test_and_set_bit(STRIPE_OP_IO, sh-ops.pending))
+   sh-ops.count++;
set_bit(R5_LOCKED, dev-flags);
locked++;
}
@@ -2642,64 +2652,6 @@ static void handle_stripe5(struct stripe_head *sh)
  test_bit(BIO_UPTODATE, bi-bi_flags)
? 0 : -EIO);
}
-   for (i=disks; i-- ;) {
-   int rw;
-   struct bio *bi;
-   mdk_rdev_t *rdev;
-   if (test_and_clear_bit(R5_Wantwrite, sh-dev[i].flags))
-   rw = WRITE;
-   else if (test_and_clear_bit(R5_Wantread, sh-dev[i].flags))
-   rw = READ;
-   else
-   continue;
- 
-   bi = sh-dev[i].req;
- 
-   bi-bi_rw = rw;
-   if (rw == WRITE)
-   bi-bi_end_io = raid5_end_write_request;
-   else
-   bi-bi_end_io = raid5_end_read_request;
- 
-   rcu_read_lock();
-   rdev = rcu_dereference(conf-disks[i].rdev);
-   if (rdev  test_bit(Faulty, rdev-flags))
-   rdev = NULL;
-   if (rdev)
-   atomic_inc(rdev-nr_pending);
-   rcu_read_unlock();
- 
-   if (rdev) {
-   if (syncing || expanding || expanded)
-   md_sync_acct(rdev-bdev, STRIPE_SECTORS);
-
-   bi-bi_bdev = rdev-bdev;
-   PRINTK(for %llu schedule op %ld on disc %d\n,
-   (unsigned long long)sh-sector, bi-bi_rw, i);
-   atomic_inc(sh-count);
-   bi-bi_sector = sh-sector + rdev-data_offset;
-   bi-bi_flags = 1  BIO_UPTODATE;
-   bi-bi_vcnt = 1;
-   bi-bi_max_vecs = 1;
-   bi-bi_idx = 0;
-   bi-bi_io_vec = sh-dev

[PATCH 13/16] md: remove raid5 compute_block and compute_parity5

2007-05-02 Thread Dan Williams
replaced by raid5_run_ops

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |  124 
 1 files changed, 0 insertions(+), 124 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c9b91e3..74ce354 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1501,130 +1501,6 @@ static void copy_data(int frombio, struct bio *bio,
   }   \
} while(0)
 
-
-static void compute_block(struct stripe_head *sh, int dd_idx)
-{
-   int i, count, disks = sh-disks;
-   void *ptr[MAX_XOR_BLOCKS], *dest, *p;
-
-   PRINTK(compute_block, stripe %llu, idx %d\n, 
-   (unsigned long long)sh-sector, dd_idx);
-
-   dest = page_address(sh-dev[dd_idx].page);
-   memset(dest, 0, STRIPE_SIZE);
-   count = 0;
-   for (i = disks ; i--; ) {
-   if (i == dd_idx)
-   continue;
-   p = page_address(sh-dev[i].page);
-   if (test_bit(R5_UPTODATE, sh-dev[i].flags))
-   ptr[count++] = p;
-   else
-   printk(KERN_ERR compute_block() %d, stripe %llu, %d
-not present\n, dd_idx,
-   (unsigned long long)sh-sector, i);
-
-   check_xor();
-   }
-   if (count)
-   xor_block(count, STRIPE_SIZE, dest, ptr);
-   set_bit(R5_UPTODATE, sh-dev[dd_idx].flags);
-}
-
-static void compute_parity5(struct stripe_head *sh, int method)
-{
-   raid5_conf_t *conf = sh-raid_conf;
-   int i, pd_idx = sh-pd_idx, disks = sh-disks, count;
-   void *ptr[MAX_XOR_BLOCKS], *dest;
-   struct bio *chosen;
-
-   PRINTK(compute_parity5, stripe %llu, method %d\n,
-   (unsigned long long)sh-sector, method);
-
-   count = 0;
-   dest = page_address(sh-dev[pd_idx].page);
-   switch(method) {
-   case READ_MODIFY_WRITE:
-   BUG_ON(!test_bit(R5_UPTODATE, sh-dev[pd_idx].flags));
-   for (i=disks ; i-- ;) {
-   if (i==pd_idx)
-   continue;
-   if (sh-dev[i].towrite 
-   test_bit(R5_UPTODATE, sh-dev[i].flags)) {
-   ptr[count++] = page_address(sh-dev[i].page);
-   chosen = sh-dev[i].towrite;
-   sh-dev[i].towrite = NULL;
-
-   if (test_and_clear_bit(R5_Overlap, 
sh-dev[i].flags))
-   wake_up(conf-wait_for_overlap);
-
-   BUG_ON(sh-dev[i].written);
-   sh-dev[i].written = chosen;
-   check_xor();
-   }
-   }
-   break;
-   case RECONSTRUCT_WRITE:
-   memset(dest, 0, STRIPE_SIZE);
-   for (i= disks; i-- ;)
-   if (i!=pd_idx  sh-dev[i].towrite) {
-   chosen = sh-dev[i].towrite;
-   sh-dev[i].towrite = NULL;
-
-   if (test_and_clear_bit(R5_Overlap, 
sh-dev[i].flags))
-   wake_up(conf-wait_for_overlap);
-
-   BUG_ON(sh-dev[i].written);
-   sh-dev[i].written = chosen;
-   }
-   break;
-   case CHECK_PARITY:
-   break;
-   }
-   if (count) {
-   xor_block(count, STRIPE_SIZE, dest, ptr);
-   count = 0;
-   }
-   
-   for (i = disks; i--;)
-   if (sh-dev[i].written) {
-   sector_t sector = sh-dev[i].sector;
-   struct bio *wbi = sh-dev[i].written;
-   while (wbi  wbi-bi_sector  sector + STRIPE_SECTORS) 
{
-   copy_data(1, wbi, sh-dev[i].page, sector);
-   wbi = r5_next_bio(wbi, sector);
-   }
-
-   set_bit(R5_LOCKED, sh-dev[i].flags);
-   set_bit(R5_UPTODATE, sh-dev[i].flags);
-   }
-
-   switch(method) {
-   case RECONSTRUCT_WRITE:
-   case CHECK_PARITY:
-   for (i=disks; i--;)
-   if (i != pd_idx) {
-   ptr[count++] = page_address(sh-dev[i].page);
-   check_xor();
-   }
-   break;
-   case READ_MODIFY_WRITE:
-   for (i = disks; i--;)
-   if (sh-dev[i].written) {
-   ptr[count++] = page_address(sh-dev[i].page);
-   check_xor();
-   }
-   }
-   if (count)
-   xor_block(count

[PATCH 14/16] dmaengine: driver for the iop32x, iop33x, and iop13xx raid engines

2007-05-02 Thread Dan Williams
This is a driver for the iop DMA/AAU/ADMA units which are capable of pq_xor,
pq_update, pq_zero_sum, xor, dual_xor, xor_zero_sum, fill, copy+crc, and copy
operations.

Changelog:
* fixed a slot allocation bug in do_iop13xx_adma_xor that caused too few
slots to be requested eventually leading to data corruption
* enabled the slot allocation routine to attempt to free slots before
returning -ENOMEM
* switched the cleanup routine to solely use the software chain and the
status register to determine if a descriptor is complete.  This is
necessary to support other IOP engines that do not have status writeback
capability
* make the driver iop generic
* modified the allocation routines to understand allocating a group of
slots for a single operation
* added a null xor initialization operation for the xor only channel on
iop3xx
* support xor operations on buffers larger than the hardware maximum
* split the do_* routines into separate prep, src/dest set, submit stages
* added async_tx support (dependent operations initiation at cleanup time)
* simplified group handling
* added interrupt support (callbacks via tasklets)
* brought the pending depth inline with ioat (i.e. 4 descriptors)
* drop dma mapping methods, suggested by Chris Leech
* don't use inline in C files, Adrian Bunk
* remove static tasklet declarations
* make iop_adma_alloc_slots easier to read and remove chances for a
corrupted descriptor chain
* fix locking bug in iop_adma_alloc_chan_resources, Benjamin Herrenschmidt
* convert capabilities over to dma_cap_mask_t

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/dma/Kconfig |8 
 drivers/dma/Makefile|1 
 drivers/dma/iop-adma.c  | 1464 +++
 include/asm-arm/hardware/iop_adma.h |  121 +++
 4 files changed, 1594 insertions(+), 0 deletions(-)

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 292ddad..1c2ae4e 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -40,4 +40,12 @@ config INTEL_IOATDMA
default m
---help---
  Enable support for the Intel(R) I/OAT DMA engine.
+
+config INTEL_IOP_ADMA
+tristate Intel IOP ADMA support
+depends on DMA_ENGINE  (ARCH_IOP32X || ARCH_IOP33X || ARCH_IOP13XX)
+default m
+---help---
+  Enable support for the Intel(R) IOP Series RAID engines.
+
 endmenu
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
index 6a99341..8ebf10d 100644
--- a/drivers/dma/Makefile
+++ b/drivers/dma/Makefile
@@ -1,4 +1,5 @@
 obj-$(CONFIG_DMA_ENGINE) += dmaengine.o
 obj-$(CONFIG_NET_DMA) += iovlock.o
 obj-$(CONFIG_INTEL_IOATDMA) += ioatdma.o
+obj-$(CONFIG_INTEL_IOP_ADMA) += iop-adma.o
 obj-$(CONFIG_ASYNC_TX_DMA) += async_tx.o xor.o
diff --git a/drivers/dma/iop-adma.c b/drivers/dma/iop-adma.c
new file mode 100644
index 000..0d85f12
--- /dev/null
+++ b/drivers/dma/iop-adma.c
@@ -0,0 +1,1464 @@
+/*
+ * Copyright(c) 2006 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ */
+
+/*
+ * This driver supports the asynchrounous DMA copy and RAID engines available
+ * on the Intel Xscale(R) family of I/O Processors (IOP 32x, 33x, 134x)
+ */
+
+#include linux/init.h
+#include linux/module.h
+#include linux/async_tx.h
+#include linux/delay.h
+#include linux/dma-mapping.h
+#include linux/spinlock.h
+#include linux/interrupt.h
+#include linux/platform_device.h
+#include asm/arch/adma.h
+#include asm/memory.h
+
+#define to_iop_adma_chan(chan) container_of(chan, struct iop_adma_chan, common)
+#define to_iop_adma_device(dev) container_of(dev, struct iop_adma_device, 
common)
+#define tx_to_iop_adma_slot(tx) container_of(tx, struct iop_adma_desc_slot, 
async_tx)
+
+/**
+ * iop_adma_free_slots - flags descriptor slots for reuse
+ * @slot: Slot to free
+ * Caller must hold iop_chan-lock while calling this function
+ */
+static void iop_adma_free_slots(struct iop_adma_desc_slot *slot)
+{
+   int stride = slot-slots_per_op;
+
+   while (stride--) {
+   slot-slots_per_op = 0;
+   slot = list_entry(slot-slot_node.next,
+   struct iop_adma_desc_slot

[PATCH 15/16] iop13xx: Surface the iop13xx adma units to the iop-adma driver

2007-05-02 Thread Dan Williams
Adds the platform device definitions and the architecture specific
support routines (i.e. register initialization and descriptor formats) for the
iop-adma driver.

Changelog:
* added 'descriptor pool size' to the platform data
* add base support for buffer sizes larger than 16MB (hw max)
* build error fix from Kirill A. Shutemov
* rebase for async_tx changes
* add interrupt support
* do not call platform register macros in driver code

Cc: Russell King [EMAIL PROTECTED]
Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 arch/arm/mach-iop13xx/setup.c  |  208 
 include/asm-arm/arch-iop13xx/adma.h|  545 
 include/asm-arm/arch-iop13xx/iop13xx.h |   34 +-
 3 files changed, 766 insertions(+), 21 deletions(-)

diff --git a/arch/arm/mach-iop13xx/setup.c b/arch/arm/mach-iop13xx/setup.c
index 9a46bcd..662d1e2 100644
--- a/arch/arm/mach-iop13xx/setup.c
+++ b/arch/arm/mach-iop13xx/setup.c
@@ -25,6 +25,7 @@
 #include asm/hardware.h
 #include asm/irq.h
 #include asm/io.h
+#include asm/hardware/iop_adma.h
 
 #define IOP13XX_UART_XTAL 4000
 #define IOP13XX_SETUP_DEBUG 0
@@ -236,6 +237,129 @@ static unsigned long iq8134x_probe_flash_size(void)
 }
 #endif
 
+/* ADMA Channels */
+static struct resource iop13xx_adma_0_resources[] = {
+   [0] = {
+   .start = IOP13XX_ADMA_PHYS_BASE(0),
+   .end = IOP13XX_ADMA_UPPER_PA(0),
+   .flags = IORESOURCE_MEM,
+   },
+   [1] = {
+   .start = IRQ_IOP13XX_ADMA0_EOT,
+   .end = IRQ_IOP13XX_ADMA0_EOT,
+   .flags = IORESOURCE_IRQ
+   },
+   [2] = {
+   .start = IRQ_IOP13XX_ADMA0_EOC,
+   .end = IRQ_IOP13XX_ADMA0_EOC,
+   .flags = IORESOURCE_IRQ
+   },
+   [3] = {
+   .start = IRQ_IOP13XX_ADMA0_ERR,
+   .end = IRQ_IOP13XX_ADMA0_ERR,
+   .flags = IORESOURCE_IRQ
+   }
+};
+
+static struct resource iop13xx_adma_1_resources[] = {
+   [0] = {
+   .start = IOP13XX_ADMA_PHYS_BASE(1),
+   .end = IOP13XX_ADMA_UPPER_PA(1),
+   .flags = IORESOURCE_MEM,
+   },
+   [1] = {
+   .start = IRQ_IOP13XX_ADMA1_EOT,
+   .end = IRQ_IOP13XX_ADMA1_EOT,
+   .flags = IORESOURCE_IRQ
+   },
+   [2] = {
+   .start = IRQ_IOP13XX_ADMA1_EOC,
+   .end = IRQ_IOP13XX_ADMA1_EOC,
+   .flags = IORESOURCE_IRQ
+   },
+   [3] = {
+   .start = IRQ_IOP13XX_ADMA1_ERR,
+   .end = IRQ_IOP13XX_ADMA1_ERR,
+   .flags = IORESOURCE_IRQ
+   }
+};
+
+static struct resource iop13xx_adma_2_resources[] = {
+   [0] = {
+   .start = IOP13XX_ADMA_PHYS_BASE(2),
+   .end = IOP13XX_ADMA_UPPER_PA(2),
+   .flags = IORESOURCE_MEM,
+   },
+   [1] = {
+   .start = IRQ_IOP13XX_ADMA2_EOT,
+   .end = IRQ_IOP13XX_ADMA2_EOT,
+   .flags = IORESOURCE_IRQ
+   },
+   [2] = {
+   .start = IRQ_IOP13XX_ADMA2_EOC,
+   .end = IRQ_IOP13XX_ADMA2_EOC,
+   .flags = IORESOURCE_IRQ
+   },
+   [3] = {
+   .start = IRQ_IOP13XX_ADMA2_ERR,
+   .end = IRQ_IOP13XX_ADMA2_ERR,
+   .flags = IORESOURCE_IRQ
+   }
+};
+
+static u64 iop13xx_adma_dmamask = DMA_64BIT_MASK;
+static struct iop_adma_platform_data iop13xx_adma_0_data = {
+   .hw_id = 0,
+   .pool_size = PAGE_SIZE,
+};
+
+static struct iop_adma_platform_data iop13xx_adma_1_data = {
+   .hw_id = 1,
+   .pool_size = PAGE_SIZE,
+};
+
+static struct iop_adma_platform_data iop13xx_adma_2_data = {
+   .hw_id = 2,
+   .pool_size = PAGE_SIZE,
+};
+
+/* The ids are fixed up later in iop13xx_platform_init */
+static struct platform_device iop13xx_adma_0_channel = {
+   .name = iop-adma,
+   .id = 0,
+   .num_resources = 4,
+   .resource = iop13xx_adma_0_resources,
+   .dev = {
+   .dma_mask = iop13xx_adma_dmamask,
+   .coherent_dma_mask = DMA_64BIT_MASK,
+   .platform_data = (void *) iop13xx_adma_0_data,
+   },
+};
+
+static struct platform_device iop13xx_adma_1_channel = {
+   .name = iop-adma,
+   .id = 0,
+   .num_resources = 4,
+   .resource = iop13xx_adma_1_resources,
+   .dev = {
+   .dma_mask = iop13xx_adma_dmamask,
+   .coherent_dma_mask = DMA_64BIT_MASK,
+   .platform_data = (void *) iop13xx_adma_1_data,
+   },
+};
+
+static struct platform_device iop13xx_adma_2_channel = {
+   .name = iop-adma,
+   .id = 0,
+   .num_resources = 4,
+   .resource = iop13xx_adma_2_resources,
+   .dev = {
+   .dma_mask = iop13xx_adma_dmamask,
+   .coherent_dma_mask = DMA_64BIT_MASK,
+   .platform_data = (void *) iop13xx_adma_2_data,
+   },
+};
+
 void __init iop13xx_map_io

[PATCH 16/16] iop3xx: Surface the iop3xx DMA and AAU units to the iop-adma driver

2007-05-02 Thread Dan Williams
Adds the platform device definitions and the architecture specific support
routines (i.e. register initialization and descriptor formats) for the
iop-adma driver.

Changelog:
* add support for  1k zero sum buffer sizes
* added dma/aau platform devices to iq80321 and iq80332 setup
* fixed the calculation in iop_desc_is_aligned
* support xor buffer sizes larger than 16MB
* fix places where software descriptors are assumed to be contiguous, only
hardware descriptors are contiguous
for up to a PAGE_SIZE buffer size
* convert to async_tx
* add interrupt support
* add platform devices for 80219 boards
* do not call platform register macros in driver code
* remove switch() statements for compatible register offsets/layouts
* change over to bitmap based capabilities

Cc: Russell King [EMAIL PROTECTED]
Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 arch/arm/mach-iop32x/glantank.c|2 
 arch/arm/mach-iop32x/iq31244.c |5 
 arch/arm/mach-iop32x/iq80321.c |3 
 arch/arm/mach-iop32x/n2100.c   |2 
 arch/arm/mach-iop33x/iq80331.c |3 
 arch/arm/mach-iop33x/iq80332.c |3 
 arch/arm/plat-iop/Makefile |2 
 arch/arm/plat-iop/adma.c   |  216 
 include/asm-arm/arch-iop32x/adma.h |5 
 include/asm-arm/arch-iop33x/adma.h |5 
 include/asm-arm/hardware/iop3xx-adma.h |  893 
 include/asm-arm/hardware/iop3xx.h  |   68 --
 12 files changed, 1147 insertions(+), 60 deletions(-)

diff --git a/arch/arm/mach-iop32x/glantank.c b/arch/arm/mach-iop32x/glantank.c
index 45f4f13..2e0099b 100644
--- a/arch/arm/mach-iop32x/glantank.c
+++ b/arch/arm/mach-iop32x/glantank.c
@@ -180,6 +180,8 @@ static void __init glantank_init_machine(void)
platform_device_register(iop3xx_i2c1_device);
platform_device_register(glantank_flash_device);
platform_device_register(glantank_serial_device);
+   platform_device_register(iop3xx_dma_0_channel);
+   platform_device_register(iop3xx_dma_1_channel);
 
pm_power_off = glantank_power_off;
 }
diff --git a/arch/arm/mach-iop32x/iq31244.c b/arch/arm/mach-iop32x/iq31244.c
index 60e7430..c0d077c 100644
--- a/arch/arm/mach-iop32x/iq31244.c
+++ b/arch/arm/mach-iop32x/iq31244.c
@@ -295,9 +295,14 @@ static void __init iq31244_init_machine(void)
platform_device_register(iop3xx_i2c1_device);
platform_device_register(iq31244_flash_device);
platform_device_register(iq31244_serial_device);
+   platform_device_register(iop3xx_dma_0_channel);
+   platform_device_register(iop3xx_dma_1_channel);
 
if (is_ep80219())
pm_power_off = ep80219_power_off;
+
+   if (!is_80219())
+   platform_device_register(iop3xx_aau_channel);
 }
 
 static int __init force_ep80219_setup(char *str)
diff --git a/arch/arm/mach-iop32x/iq80321.c b/arch/arm/mach-iop32x/iq80321.c
index 361c70c..474ec2a 100644
--- a/arch/arm/mach-iop32x/iq80321.c
+++ b/arch/arm/mach-iop32x/iq80321.c
@@ -180,6 +180,9 @@ static void __init iq80321_init_machine(void)
platform_device_register(iop3xx_i2c1_device);
platform_device_register(iq80321_flash_device);
platform_device_register(iq80321_serial_device);
+   platform_device_register(iop3xx_dma_0_channel);
+   platform_device_register(iop3xx_dma_1_channel);
+   platform_device_register(iop3xx_aau_channel);
 }
 
 MACHINE_START(IQ80321, Intel IQ80321)
diff --git a/arch/arm/mach-iop32x/n2100.c b/arch/arm/mach-iop32x/n2100.c
index 5f07344..8e6fe13 100644
--- a/arch/arm/mach-iop32x/n2100.c
+++ b/arch/arm/mach-iop32x/n2100.c
@@ -245,6 +245,8 @@ static void __init n2100_init_machine(void)
platform_device_register(iop3xx_i2c0_device);
platform_device_register(n2100_flash_device);
platform_device_register(n2100_serial_device);
+   platform_device_register(iop3xx_dma_0_channel);
+   platform_device_register(iop3xx_dma_1_channel);
 
pm_power_off = n2100_power_off;
 
diff --git a/arch/arm/mach-iop33x/iq80331.c b/arch/arm/mach-iop33x/iq80331.c
index 1a9e361..b4d12bf 100644
--- a/arch/arm/mach-iop33x/iq80331.c
+++ b/arch/arm/mach-iop33x/iq80331.c
@@ -135,6 +135,9 @@ static void __init iq80331_init_machine(void)
platform_device_register(iop33x_uart0_device);
platform_device_register(iop33x_uart1_device);
platform_device_register(iq80331_flash_device);
+   platform_device_register(iop3xx_dma_0_channel);
+   platform_device_register(iop3xx_dma_1_channel);
+   platform_device_register(iop3xx_aau_channel);
 }
 
 MACHINE_START(IQ80331, Intel IQ80331)
diff --git a/arch/arm/mach-iop33x/iq80332.c b/arch/arm/mach-iop33x/iq80332.c
index 96d6f0f..2abb2d8 100644
--- a/arch/arm/mach-iop33x/iq80332.c
+++ b/arch/arm/mach-iop33x/iq80332.c
@@ -135,6 +135,9 @@ static void __init iq80332_init_machine(void)
platform_device_register(iop33x_uart0_device

Re: RAID rebuild on Create

2007-04-30 Thread Dan Williams

On 4/30/07, Jan Engelhardt [EMAIL PROTECTED] wrote:

Hi list,


when a user does `mdadm -C /dev/md0 -l any -n whatever fits
devices`, the array gets rebuilt for at least RAID1 and RAID5, even if
the disk contents are most likely not of importance (otherwise we would
not be creating a raid array right now). Could not this needless resync
be skipped - what do you think?


If you want his behavior you can always create the array with a
'missing' device to hold off the resync process.  Otherwise, if all
disks are available, why not let the array make forward progress to a
protected state?

Also, the resync thread automatically yields to new data coming into
the array, so you can effectively sync an array by writing to all the
blocks.



Jan


--
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 0/4] raid5: write-back caching policy and write performance

2007-04-11 Thread Dan Williams
These patches are presented to further the discussion of raid5 write
performance and are not yet meant for mainline or -mm inclusion.  Raz's
delayed-activation patch showed interesting results so it has been
ported/included in this series.  The question to be answered is whether
the sequential write performance of a raid5 array, out of the box, can
approach that of a similarly configured raid0 array (minus one disk).
Currently, on an iop13xx platform, tiobench is reporting a 2x advantage
for the N-1 raid0 array, so it seems there is room for improvement.

The third patch in the series adds a write-back caching capability to md
to investigate the raw throughput to the stripe cache.  Since battery
backed memory is not being used this patch makes the system markedly
less safe, so only use it with data that can be thrown away.  Initial
testing with dd shows the performance of this policy can be ~1.8x that
of the default write-through policy.  That is, when the data set is
smaller than the cache size.  Once cache pressure begins to force the
writes to disk performance drops well below the write-through case.  So
work remains to be done to see how the write-through case achieves
better sustained throughput numbers.

I am interested in the performance of these patches on other
platforms/configurations and comments on the implementation.

[ based on 2.6.21-rc6 + git-md-accel.patch from -mm ]
  md: introduce struct stripe_head_state
  md: refactor raid5 cache policy code using 'struct stripe_cache_policy'
  md: writeback caching policy for raid5 [experimental]
  md: delayed stripe activation

The patches can also be pulled via git:
git pull git://lost.foo-projects.org/~dwillia2/git/iop md-accel+experimental

--
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 3/4] md: writeback caching policy for raid5 [experimental]

2007-04-11 Thread Dan Williams
In write-through mode bi_end_io is called once writes to the data disk(s)
and the parity disk have completed.

In write-back mode bi_end_io is called immediately after data has been
copied into the stripe cache, which also causes the stripe to be marked
dirty.  The STRIPE_DIRTY state implies that parity will need to be
reconstructed at eviction time.  In other words, the read-modify-write case
implemented for write-through mode is not supported; all writes are
reconstruct-writes.  An eviction brings the backing disks up to date with
data in the cache.  A dirty stripe is set for eviction when a new stripe
needs to be activated and there are no stripes on the inactive list.  All
dirty stripes are evicted when the array is being shutdown.

In its current implementation write-back mode acknowledges writes before
they have reached non-volatile media.  Unclean shutdowns will result in
filesystem corruption.

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/Kconfig |   13 ++
 drivers/md/md.c|2 
 drivers/md/raid5.c |  354 
 include/linux/raid/md_k.h  |2 
 include/linux/raid/raid5.h |   31 
 5 files changed, 400 insertions(+), 2 deletions(-)

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 79a361e..7ab6c55 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -138,6 +138,19 @@ config MD_RAID456
 
  If unsure, say Y.
 
+config RAID5_CACHE_POLICY_WRITE_BACK
+   bool EXPERIMENTAL: Set the raid cache policy to write-back
+   default n
+   depends on EXPERIMENTAL  MD_RAID456
+   ---help---
+ Enable this feature if you want to test this experiemental
+ caching policy instead of the default write-through.
+ Do not enable this on a system with data that you care
+ about.  Filesytem corruption will occur if an array in write-back
+ mode is not shutdown cleanly.
+
+ If unsure, say N.
+
 config MD_RAID5_RESHAPE
bool Support adding drives to a raid-5 array
depends on MD_RAID456
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 509171c..b83f434 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -3344,6 +3344,8 @@ static int do_md_stop(mddev_t * mddev, int mode)
break;
case 0: /* disassemble */
case 2: /* stop */
+   if (mddev-pers-cache_flush)
+   mddev-pers-cache_flush(mddev);
bitmap_flush(mddev);
md_super_wait(mddev);
if (mddev-ro)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3b32a19..1a2d6b5 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -267,6 +267,7 @@ static struct stripe_head *get_active_stripe(raid5_conf_t 
*conf, sector_t sector
 int pd_idx, int noblock)
 {
struct stripe_head *sh;
+   struct stripe_cache_policy *cp = conf-cache_policy;
 
PRINTK(get_stripe, sector %llu\n, (unsigned long long)sector);
 
@@ -280,6 +281,8 @@ static struct stripe_head *get_active_stripe(raid5_conf_t 
*conf, sector_t sector
if (!sh) {
if (!conf-inactive_blocked)
sh = get_free_stripe(conf);
+   if (!sh  cp-try_to_free_stripe)
+   cp-try_to_free_stripe(conf, 0);
if (noblock  sh == NULL)
break;
if (!sh) {
@@ -299,7 +302,8 @@ static struct stripe_head *get_active_stripe(raid5_conf_t 
*conf, sector_t sector
if (atomic_read(sh-count)) {
  BUG_ON(!list_empty(sh-lru));
} else {
-   if (!test_bit(STRIPE_HANDLE, sh-state))
+   if (!test_bit(STRIPE_HANDLE, sh-state) 
+   !test_bit(STRIPE_EVICT, sh-state))
atomic_inc(conf-active_stripes);
if (list_empty(sh-lru) 
!test_bit(STRIPE_EXPANDING, sh-state))
@@ -668,6 +672,8 @@ ops_run_biodrain(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
 {
int disks = sh-disks;
int pd_idx = sh-pd_idx, i;
+   raid5_conf_t *conf = sh-raid_conf;
+   struct stripe_cache_policy *cp = conf-cache_policy;
 
/* check if prexor is active which means only process blocks
 * that are part of a read-modify-write (Wantprexor)
@@ -688,7 +694,8 @@ ops_run_biodrain(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
towrite = 1;
} else { /* rcw */
if (i!=pd_idx  dev-towrite 
-   test_bit(R5_LOCKED, dev-flags))
+   (test_bit

[PATCH RFC 1/4] md: introduce struct stripe_head_state

2007-04-11 Thread Dan Williams
struct stripe_head_state collects all the dynamic stripe-state information
that is calculated/tracked during calls to handle_stripe.  This enables a
mechanism for handle_stripe functionality to be broken off into
subroutines.

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |  280 ++--
 include/linux/raid/raid5.h |   11 ++
 2 files changed, 153 insertions(+), 138 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 74ce354..684552a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1872,12 +1872,14 @@ static void handle_stripe5(struct stripe_head *sh)
struct bio *return_bi= NULL;
struct bio *bi;
int i;
-   int syncing, expanding, expanded;
-   int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0;
-   int to_fill=0, compute=0, req_compute=0, non_overwrite=0;
-   int failed_num=0;
+   struct stripe_head_state s = {
+   .locked=0, .uptodate=0, .to_read=0, .to_write=0, .failed=0,
+   .written=0, .to_fill=0, .compute=0, .req_compute=0,
+   .non_overwrite=0,
+   };
struct r5dev *dev;
unsigned long pending=0;
+   s.failed_num=0;
 
PRINTK(handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d 
ops=%lx:%lx:%lx\n,
   (unsigned long long)sh-sector, sh-state, 
atomic_read(sh-count),
@@ -1887,9 +1889,9 @@ static void handle_stripe5(struct stripe_head *sh)
clear_bit(STRIPE_HANDLE, sh-state);
clear_bit(STRIPE_DELAYED, sh-state);
 
-   syncing = test_bit(STRIPE_SYNCING, sh-state);
-   expanding = test_bit(STRIPE_EXPAND_SOURCE, sh-state);
-   expanded = test_bit(STRIPE_EXPAND_READY, sh-state);
+   s.syncing = test_bit(STRIPE_SYNCING, sh-state);
+   s.expanding = test_bit(STRIPE_EXPAND_SOURCE, sh-state);
+   s.expanded = test_bit(STRIPE_EXPAND_READY, sh-state);
/* Now to look around and see what can be done */
 
rcu_read_lock();
@@ -1911,22 +1913,22 @@ static void handle_stripe5(struct stripe_head *sh)
set_bit(R5_Wantfill, dev-flags);
 
/* now count some things */
-   if (test_bit(R5_LOCKED, dev-flags)) locked++;
-   if (test_bit(R5_UPTODATE, dev-flags)) uptodate++;
+   if (test_bit(R5_LOCKED, dev-flags)) s.locked++;
+   if (test_bit(R5_UPTODATE, dev-flags)) s.uptodate++;
 
if (test_bit(R5_Wantfill, dev-flags))
-   to_fill++;
+   s.to_fill++;
else if (dev-toread)
-   to_read++;
+   s.to_read++;
 
-   if (test_bit(R5_Wantcompute, dev-flags)) BUG_ON(++compute  
1);
+   if (test_bit(R5_Wantcompute, dev-flags)) BUG_ON(++s.compute  
1);
 
if (dev-towrite) {
-   to_write++;
+   s.to_write++;
if (!test_bit(R5_OVERWRITE, dev-flags))
-   non_overwrite++;
+   s.non_overwrite++;
}
-   if (dev-written) written++;
+   if (dev-written) s.written++;
rdev = rcu_dereference(conf-disks[i].rdev);
if (!rdev || !test_bit(In_sync, rdev-flags)) {
/* The ReadError flag will just be confusing now */
@@ -1935,23 +1937,24 @@ static void handle_stripe5(struct stripe_head *sh)
}
if (!rdev || !test_bit(In_sync, rdev-flags)
|| test_bit(R5_ReadError, dev-flags)) {
-   failed++;
-   failed_num = i;
+   s.failed++;
+   s.failed_num = i;
} else
set_bit(R5_Insync, dev-flags);
}
rcu_read_unlock();
 
-   if (to_fill  !test_and_set_bit(STRIPE_OP_BIOFILL, sh-ops.pending))
+   if (s.to_fill  !test_and_set_bit(STRIPE_OP_BIOFILL, sh-ops.pending))
sh-ops.count++;
 
PRINTK(locked=%d uptodate=%d to_read=%d
 to_write=%d to_fill=%d failed=%d failed_num=%d\n,
-   locked, uptodate, to_read, to_write, to_fill, failed, 
failed_num);
+   s.locked, s.uptodate, s.to_read, s.to_write, s.to_fill,
+   s.failed, s.failed_num);
/* check if the array has lost two devices and, if so, some requests 
might
 * need to be failed
 */
-   if (failed  1  to_read+to_write+written) {
+   if (s.failed  1  s.to_read+s.to_write+s.written) {
for (i=disks; i--; ) {
int bitmap_end = 0;
 
@@ -1969,7 +1972,7 @@ static void handle_stripe5(struct stripe_head *sh)
/* fail all writes first */
bi = sh-dev[i].towrite;
sh-dev[i].towrite = NULL

[PATCH RFC 2/4] md: refactor raid5 cache policy code using 'struct stripe_cache_policy'

2007-04-11 Thread Dan Williams
struct stripe_cache_policy is introduced as an interface to enable multiple
caching policies.  It adds several methods to be called when cache events
occur.  See the definition of stripe_cache_policy in
include/linux/raid/raid5.h.  This patch does not add any new caching
policies, it just moves the current code to a new location and calls it by
a struct stripe_cache_policy method.

Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |  644 +---
 include/linux/raid/raid5.h |   82 +-
 2 files changed, 446 insertions(+), 280 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 684552a..3b32a19 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -112,11 +112,12 @@ static void __release_stripe(raid5_conf_t *conf, struct 
stripe_head *sh)
if (atomic_dec_and_test(sh-count)) {
BUG_ON(!list_empty(sh-lru));
BUG_ON(atomic_read(conf-active_stripes)==0);
+   if (conf-cache_policy-release_stripe(conf, sh,
+   test_bit(STRIPE_HANDLE, 
sh-state)))
+   return; /* stripe was moved to a cache policy specific 
queue */
+
if (test_bit(STRIPE_HANDLE, sh-state)) {
-   if (test_bit(STRIPE_DELAYED, sh-state)) {
-   list_add_tail(sh-lru, conf-delayed_list);
-   blk_plug_device(conf-mddev-queue);
-   } else if (test_bit(STRIPE_BIT_DELAY, sh-state) 
+   if (test_bit(STRIPE_BIT_DELAY, sh-state) 
   sh-bm_seq - conf-seq_write  0) {
list_add_tail(sh-lru, conf-bitmap_list);
blk_plug_device(conf-mddev-queue);
@@ -125,23 +126,11 @@ static void __release_stripe(raid5_conf_t *conf, struct 
stripe_head *sh)
list_add_tail(sh-lru, conf-handle_list);
}
md_wakeup_thread(conf-mddev-thread);
-   } else {
-   BUG_ON(sh-ops.pending);
-   if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, 
sh-state)) {
-   atomic_dec(conf-preread_active_stripes);
-   if (atomic_read(conf-preread_active_stripes) 
 IO_THRESHOLD)
-   md_wakeup_thread(conf-mddev-thread);
-   }
-   atomic_dec(conf-active_stripes);
-   if (!test_bit(STRIPE_EXPANDING, sh-state)) {
-   list_add_tail(sh-lru, conf-inactive_list);
-   wake_up(conf-wait_for_stripe);
-   if (conf-retry_read_aligned)
-   md_wakeup_thread(conf-mddev-thread);
-   }
-   }
+   } else
+   BUG();
}
 }
+
 static void release_stripe(struct stripe_head *sh)
 {
raid5_conf_t *conf = sh-raid_conf;
@@ -724,39 +713,6 @@ ops_run_biodrain(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
return tx;
 }
 
-static void ops_complete_postxor(void *stripe_head_ref)
-{
-   struct stripe_head *sh = stripe_head_ref;
-
-   PRINTK(%s: stripe %llu\n, __FUNCTION__,
-   (unsigned long long)sh-sector);
-
-   set_bit(STRIPE_OP_POSTXOR, sh-ops.complete);
-   set_bit(STRIPE_HANDLE, sh-state);
-   release_stripe(sh);
-}
-
-static void ops_complete_write(void *stripe_head_ref)
-{
-   struct stripe_head *sh = stripe_head_ref;
-   int disks = sh-disks, i, pd_idx = sh-pd_idx;
-
-   PRINTK(%s: stripe %llu\n, __FUNCTION__,
-   (unsigned long long)sh-sector);
-
-   for (i=disks ; i-- ;) {
-   struct r5dev *dev = sh-dev[i];
-   if (dev-written || i == pd_idx)
-   set_bit(R5_UPTODATE, dev-flags);
-   }
-
-   set_bit(STRIPE_OP_BIODRAIN, sh-ops.complete);
-   set_bit(STRIPE_OP_POSTXOR, sh-ops.complete);
-
-   set_bit(STRIPE_HANDLE, sh-state);
-   release_stripe(sh);
-}
-
 static void
 ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 {
@@ -764,6 +720,7 @@ ops_run_postxor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
int disks = sh-disks;
struct page *xor_srcs[disks];
 
+   raid5_conf_t *conf = sh-raid_conf;
int count = 0, pd_idx = sh-pd_idx, i;
struct page *xor_dest;
int prexor = test_bit(STRIPE_OP_PREXOR, sh-ops.pending);
@@ -792,9 +749,8 @@ ops_run_postxor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
}
}
 
-   /* check whether this postxor is part of a write */
-   callback = test_bit(STRIPE_OP_BIODRAIN, sh-ops.pending) ?
-   ops_complete_write : ops_complete_postxor

[PATCH RFC 4/4] md: delayed stripe activation

2007-04-11 Thread Dan Williams
based on a patch by: Raz Ben-Jehuda(caro) [EMAIL PROTECTED]
---

 drivers/md/raid5.c |   92 +---
 include/linux/raid/raid5.h |5 ++
 2 files changed, 90 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 1a2d6b5..1b3db16 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -226,6 +226,8 @@ static void init_stripe(struct stripe_head *sh, sector_t 
sector, int pd_idx, int
sh-sector = sector;
sh-pd_idx = pd_idx;
sh-state = 0;
+   sh-active_preread_jiffies = msecs_to_jiffies(
+   atomic_read(conf-cache_policy-deadline_ms))+jiffies;
 
sh-disks = disks;
 
@@ -1172,6 +1174,7 @@ static int raid5_end_write_request (struct bio *bi, 
unsigned int bytes_done,

clear_bit(R5_LOCKED, sh-dev[i].flags);
set_bit(STRIPE_HANDLE, sh-state);
+   sh-active_preread_jiffies = jiffies;
release_stripe(sh);
return 0;
 }
@@ -1741,8 +1744,10 @@ static int add_stripe_bio(struct stripe_head *sh, struct 
bio *bi, int dd_idx, in
bip = sh-dev[dd_idx].towrite;
if (*bip == NULL  sh-dev[dd_idx].written == NULL)
firstwrite = 1;
-   } else
+   } else {
bip = sh-dev[dd_idx].toread;
+   sh-active_preread_jiffies = jiffies;
+   }
while (*bip  (*bip)-bi_sector  bi-bi_sector) {
if ((*bip)-bi_sector + ((*bip)-bi_size  9)  bi-bi_sector)
goto overlap;
@@ -2160,7 +2165,7 @@ raid5_wt_cache_handle_new_writes(struct stripe_head *sh, 
struct stripe_head_stat
}
 }
 
-static void raid5_wt_cache_activate_delayed(raid5_conf_t *conf)
+static struct stripe_head *raid5_wt_cache_activate_delayed(raid5_conf_t *conf)
 {
struct stripe_cache_policy *cp = conf-cache_policy;
if (atomic_read(cp-preread_active_stripes)  IO_THRESHOLD) {
@@ -2168,6 +2173,20 @@ static void raid5_wt_cache_activate_delayed(raid5_conf_t 
*conf)
struct list_head *l = cp-delayed_list.next;
struct stripe_head *sh;
sh = list_entry(l, struct stripe_head, lru);
+
+   if (time_before(jiffies,sh-active_preread_jiffies)) {
+   PRINTK(deadline: no expire sec=%lld %8u %8u\n,
+   (unsigned long long) sh-sector,
+   
jiffies_to_msecs(sh-active_preread_jiffies),
+   jiffies_to_msecs(jiffies));
+   return sh;
+   } else {
+ PRINTK(deadline: expire:sec=%lld %8u %8u\n,
+   (unsigned long long)sh-sector,
+   
jiffies_to_msecs(sh-active_preread_jiffies),
+   jiffies_to_msecs(jiffies));
+   }
+
list_del_init(l);
clear_bit(STRIPE_DELAYED, sh-state);
if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, 
sh-state))
@@ -2175,9 +2194,11 @@ static void raid5_wt_cache_activate_delayed(raid5_conf_t 
*conf)
list_add_tail(sh-lru, conf-handle_list);
}
}
+
+   return NULL;
 }
 
-static void raid5_wt_cache_raid5d(mddev_t *mddev, raid5_conf_t *conf)
+static struct stripe_head *raid5_wt_cache_raid5d(mddev_t *mddev, raid5_conf_t 
*conf)
 {
struct stripe_cache_policy *cp = conf-cache_policy;
 
@@ -2185,7 +2206,9 @@ static void raid5_wt_cache_raid5d(mddev_t *mddev, 
raid5_conf_t *conf)
atomic_read(cp-preread_active_stripes)  IO_THRESHOLD 
!blk_queue_plugged(mddev-queue) 
!list_empty(cp-delayed_list))
-   raid5_wt_cache_activate_delayed(conf);
+   return raid5_wt_cache_activate_delayed(conf);
+
+   return NULL;
 }
 
 static void raid5_wt_cache_init(raid5_conf_t *conf)
@@ -4339,7 +4362,7 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct 
bio *raid_bio)
  */
 static void raid5d (mddev_t *mddev)
 {
-   struct stripe_head *sh;
+   struct stripe_head *sh,*delayed_sh=NULL;
raid5_conf_t *conf = mddev_to_conf(mddev);
int handled;
 
@@ -4363,7 +4386,10 @@ static void raid5d (mddev_t *mddev)
}
 
if (conf-cache_policy-raid5d)
-   conf-cache_policy-raid5d(mddev, conf);
+   delayed_sh = conf-cache_policy-raid5d(mddev, conf);
+
+   if (delayed_sh)
+   break;
 
while ((bio = remove_bio_from_retry(conf))) {
int ok;
@@ -4401,8 +4427,60 @@ static void raid5d (mddev_t *mddev)
unplug_slaves(mddev);
 
PRINTK(--- raid5d inactive\n);
+
+   if (delayed_sh) {
+   unsigned long local_jiffies = 

  1   2   3   >