Re: [BUG] OOPS 2.6.24.2 raid5 write with ioatdma

2008-02-15 Thread Dan Williams
On Fri, Feb 15, 2008 at 9:19 AM, Laurent CORBES
<[EMAIL PROTECTED]> wrote:
> Hi all,
>
>  I got a raid5 oops when trying to write on a raid 5 array, with ioatdma 
> loaded
>  and without DCA activated in bios:
>

At first glance I believe the attached patch may fix the issue, I'll
try to reproduce this locally.

Regards,
Dan
ioat: fix 'ack' handling, driver must ensure that 'ack' is zero

From: Dan Williams <[EMAIL PROTECTED]>

Initialize 'ack' to zero in case the descriptor has been recycled.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/dma/ioat_dma.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)


diff --git a/drivers/dma/ioat_dma.c b/drivers/dma/ioat_dma.c
index 45e7b46..8cf542b 100644
--- a/drivers/dma/ioat_dma.c
+++ b/drivers/dma/ioat_dma.c
@@ -726,6 +726,7 @@ static struct dma_async_tx_descriptor *ioat1_dma_prep_memcpy(
 
 	if (new) {
 		new->len = len;
+		new->async_tx.ack = 0;
 		return &new->async_tx;
 	} else
 		return NULL;
@@ -749,6 +750,7 @@ static struct dma_async_tx_descriptor *ioat2_dma_prep_memcpy(
 
 	if (new) {
 		new->len = len;
+		new->async_tx.ack = 0;
 		return &new->async_tx;
 	} else
 		return NULL;


Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5

2008-01-15 Thread Dan Williams
> heheh.
>
> it's really easy to reproduce the hang without the patch -- i could
> hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB.
> i'll try with ext3... Dan's experiences suggest it won't happen with ext3
> (or is even more rare), which would explain why this has is overall a
> rare problem.
>

Hmmm... how rare?

http://marc.info/?l=linux-kernel&m=119461747005776&w=2

There is nothing specific that prevents other filesystems from hitting
it, perhaps XFS is just better at submitting large i/o's.  -stable
should get some kind of treatment.  I'll take altered performance over
a hung system.

--
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-10 Thread Dan Williams
On Jan 10, 2008 12:13 AM, dean gaudet <[EMAIL PROTECTED]> wrote:
> w.r.t. dan's cfq comments -- i really don't know the details, but does
> this mean cfq will misattribute the IO to the wrong user/process?  or is
> it just a concern that CPU time will be spent on someone's IO?  the latter
> is fine to me... the former seems sucky because with today's multicore
> systems CPU time seems cheap compared to IO.
>

I do not see this affecting the time slicing feature of cfq, because
as Neil says the work has to get done at some point.   If I give up
some of my slice working on someone else's I/O chances are the favor
will be returned in kind since the code does not discriminate.  The
io-priority capability of cfq currently does not work as advertised
with current MD since the priority is tied to the current thread and
the thread that actually submits the i/o on a stripe is
non-deterministic.  So I do not see this change making the situation
any worse.  In fact, it may make it a bit better since there is a
higher chance for the thread submitting i/o to MD to do its own i/o to
the backing disks.

Reviewed-by: Dan Williams <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-09 Thread Dan Williams
On Wed, 2008-01-09 at 20:57 -0700, Neil Brown wrote:
> So I'm incline to leave it as "do as much work as is available to be
> done" as that is simplest.  But I can probably be talked out of it
> with a convincing argument

Well, in an age of CFS and CFQ it smacks of 'unfairness'.  But does that
trump KISS...? Probably not.

--
Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-09 Thread Dan Williams
On Jan 9, 2008 5:09 PM, Neil Brown <[EMAIL PROTECTED]> wrote:
> On Wednesday January 9, [EMAIL PROTECTED] wrote:
> > On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote:
> > > i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
> > >
> > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
> > >
> > > which was Neil's change in 2.6.22 for deferring generic_make_request
> > > until there's enough stack space for it.
> > >
> >
> > Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization
> > by preventing recursive calls to generic_make_request.  However the
> > following conditions can cause raid5 to hang until 'stripe_cache_size' is
> > increased:
> >
>
> Thanks for pursuing this guys.  That explanation certainly sounds very
> credible.
>
> The generic_make_request_immed is a good way to confirm that we have
> found the bug,  but I don't like it as a long term solution, as it
> just reintroduced the problem that we were trying to solve with the
> problematic commit.
>
> As you say, we could arrange that all request submission happens in
> raid5d and I think this is the right way to proceed.  However we can
> still take some of the work into the thread that is submitting the
> IO by calling "raid5d()" at the end of make_request, like this.
>
> Can you test it please?

This passes my failure case.

However, my test is different from Dean's in that I am using tiobench
and the latest rev of my 'get_priority_stripe' patch. I believe the
failure mechanism is the same, but it would be good to get
confirmation from Dean.  get_priority_stripe has the effect of
increasing the frequency of
make_request->handle_stripe->generic_make_request sequences.

> Does it seem reasonable?

What do you think about limiting the number of stripes the submitting
thread handles to be equal to what it submitted?  If I'm a stripe that
only submits 1 stripe worth of work should I get stuck handling the
rest of the cache?

Regards,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-09 Thread Dan Williams
On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote:
> On Sat, 29 Dec 2007, Dan Williams wrote:
> 
> > On Dec 29, 2007 1:58 PM, dean gaudet <[EMAIL PROTECTED]> wrote: 
> > > On Sat, 29 Dec 2007, Dan Williams wrote: 
> > > 
> > > > On Dec 29, 2007 9:48 AM, dean gaudet <[EMAIL PROTECTED]> wrote: 
> > > > > hmm bummer, i'm doing another test (rsync 3.5M inodes from another 
> > > > > box) on 
> > > > > the same 64k chunk array and had raised the stripe_cache_size to 
> > > > > 1024... 
> > > > > and got a hang.  this time i grabbed stripe_cache_active before 
> > > > > bumping 
> > > > > the size again -- it was only 905 active.  as i recall the bug we 
> > > > > were 
> > > > > debugging a year+ ago the active was at the size when it would hang.  
> > > > > so 
> > > > > this is probably something new. 
> > > > 
> > > > I believe I am seeing the same issue and am trying to track down 
> > > > whether XFS is doing something unexpected, i.e. I have not been able 
> > > > to reproduce the problem with EXT3.  MD tries to increase throughput 
> > > > by letting some stripe work build up in batches.  It looks like every 
> > > > time your system has hung it has been in the 'inactive_blocked' state 
> > > > i.e. > 3/4 of stripes active.  This state should automatically 
> > > > clear... 
> > > 
> > > cool, glad you can reproduce it :) 
> > > 
> > > i have a bit more data... i'm seeing the same problem on debian's 
> > > 2.6.22-3-amd64 kernel, so it's not new in 2.6.24. 
> > > 
> > 
> > This is just brainstorming at this point, but it looks like xfs can 
> > submit more requests in the bi_end_io path such that it can lock 
> > itself out of the RAID array.  The sequence that concerns me is: 
> > 
> > return_io->xfs_buf_end_io->xfs_buf_io_end->xfs_buf_iodone_work->xfs_buf_iorequest->make_request->
> >  
> > 
> > I need verify whether this path is actually triggering, but if we are 
> > in an inactive_blocked condition this new request will be put on a 
> > wait queue and we'll never get to the release_stripe() call after 
> > return_io().  It would be interesting to see if this is new XFS 
> > behavior in recent kernels.
> 
> 
> i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
> 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
> 
> which was Neil's change in 2.6.22 for deferring generic_make_request 
> until there's enough stack space for it.
> 
> with my git tree sync'd to that commit my test cases fail in under 20 
> minutes uptime (i rebooted and tested 3x).  sync'd to the commit previous 
> to it i've got 8h of run-time now without the problem.
> 
> this isn't definitive of course since it does seem to be timing 
> dependent, but since all failures have occured much earlier than that 
> for me so far i think this indicates this change is either the cause of 
> the problem or exacerbates an existing raid5 problem.
> 
> given that this problem looks like a very rare problem i saw with 2.6.18 
> (raid5+xfs there too) i'm thinking Neil's commit may just exacerbate an 
> existing problem... not that i have evidence either way.
> 
> i've attached a new kernel log with a hang at d89d87965d... and the 
> reduced config file i was using for the bisect.  hopefully the hang 
> looks the same as what we were seeing at 2.6.24-rc6.  let me know.
> 

Dean could you try the below patch to see if it fixes your failure
scenario?  It passes my test case.

Thanks,
Dan

--->
md: add generic_make_request_immed to prevent raid5 hang

From: Dan Williams <[EMAIL PROTECTED]>

Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization
by preventing recursive calls to generic_make_request.  However the
following conditions can cause raid5 to hang until 'stripe_cache_size' is
increased:

1/ stripe_cache_active is N stripes away from the 'inactive_blocked' limit
   (3/4 * stripe_cache_size)
2/ a bio is submitted that requires M stripes to be processed where M > N
3/ stripes 1 through N are up-to-date and ready for immediate processing,
   i.e. no trip through raid5d required

This results in the calling thread hanging while waiting for resources to
process stripes N through M.  This means we never return from make_request.
All other raid5 users pile up in get_active_stripe.  Increasing
s

Re: Raid 1, new disk can't be added after replacing faulty disk

2008-01-07 Thread Dan Williams
On Jan 7, 2008 6:44 AM, Radu Rendec <[EMAIL PROTECTED]> wrote:
> I'm experiencing trouble when trying to add a new disk to a raid 1 array
> after having replaced a faulty disk.
>
[..]
> # mdadm --version
> mdadm - v2.6.2 - 21st May 2007
>
[..]
> However, this happens with both mdadm 2.6.2 and 2.6.4. I downgraded to
> 2.5.4 and it works like a charm.

Looks like you are running into the issue described here:
http://marc.info/?l=linux-raid&m=119892098129022&w=2
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] md: Fix data corruption when a degraded raid5 array is reshaped.

2008-01-03 Thread Dan Williams
On Thu, 2008-01-03 at 16:00 -0700, Williams, Dan J wrote:
> On Thu, 2008-01-03 at 15:46 -0700, NeilBrown wrote:
> > This patch fixes a fairly serious bug in md/raid5 in 2.6.23 and
> 24-rc.
> > It would be great if it cold get into 23.13 and 24.final.
> > Thanks.
> > NeilBrown
> >
> > ### Comments for Changeset
> >
> > We currently do not wait for the block from the missing device
> > to be computed from parity before copying data to the new stripe
> > layout.
> >
> > The change in the raid6 code is not techincally needed as we
> > don't delay data block recovery in the same way for raid6 yet.
> > But making the change now is safer long-term.
> >
> > This bug exists in 2.6.23 and 2.6.24-rc
> >
> > Cc: [EMAIL PROTECTED]
> > Cc: Dan Williams <[EMAIL PROTECTED]>
> > Signed-off-by: Neil Brown <[EMAIL PROTECTED]>
> >
> Acked-by: Dan Williams <[EMAIL PROTECTED]>
> 

On closer look the safer test is:

!test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending).

The 'req_compute' field only indicates that a 'compute_block' operation
was requested during this pass through handle_stripe so that we can
issue a linked chain of asynchronous operations.

---

From: Neil Brown <[EMAIL PROTECTED]>

md: Fix data corruption when a degraded raid5 array is reshaped.

We currently do not wait for the block from the missing device
to be computed from parity before copying data to the new stripe
layout.

The change in the raid6 code is not techincally needed as we
don't delay data block recovery in the same way for raid6 yet.
But making the change now is safer long-term.

This bug exists in 2.6.23 and 2.6.24-rc

Cc: [EMAIL PROTECTED]
Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a5aad8c..e8c8157 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2865,7 +2865,8 @@ static void handle_stripe5(struct stripe_head *sh)
md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
}
 
-   if (s.expanding && s.locked == 0)
+   if (s.expanding && s.locked == 0 &&
+   !test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending))
handle_stripe_expansion(conf, sh, NULL);
 
if (sh->ops.count)
@@ -3067,7 +3068,8 @@ static void handle_stripe6(struct stripe_head *sh, struct 
page *tmp_page)
md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
}
 
-   if (s.expanding && s.locked == 0)
+   if (s.expanding && s.locked == 0 &&
+   !test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending))
handle_stripe_expansion(conf, sh, &r6s);
 
spin_unlock(&sh->lock);

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] md: Fix data corruption when a degraded raid5 array is reshaped.

2008-01-03 Thread Dan Williams
On Thu, 2008-01-03 at 15:46 -0700, NeilBrown wrote:
> This patch fixes a fairly serious bug in md/raid5 in 2.6.23 and 24-rc.
> It would be great if it cold get into 23.13 and 24.final.
> Thanks.
> NeilBrown
> 
> ### Comments for Changeset
> 
> We currently do not wait for the block from the missing device
> to be computed from parity before copying data to the new stripe
> layout.
> 
> The change in the raid6 code is not techincally needed as we
> don't delay data block recovery in the same way for raid6 yet.
> But making the change now is safer long-term.
> 
> This bug exists in 2.6.23 and 2.6.24-rc
> 
> Cc: [EMAIL PROTECTED]
> Cc: Dan Williams <[EMAIL PROTECTED]>
> Signed-off-by: Neil Brown <[EMAIL PROTECTED]>
> 
Acked-by: Dan Williams <[EMAIL PROTECTED]>



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread Dan Williams
On Dec 29, 2007 1:58 PM, dean gaudet <[EMAIL PROTECTED]> wrote:
> On Sat, 29 Dec 2007, Dan Williams wrote:
>
> > On Dec 29, 2007 9:48 AM, dean gaudet <[EMAIL PROTECTED]> wrote:
> > > hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on
> > > the same 64k chunk array and had raised the stripe_cache_size to 1024...
> > > and got a hang.  this time i grabbed stripe_cache_active before bumping
> > > the size again -- it was only 905 active.  as i recall the bug we were
> > > debugging a year+ ago the active was at the size when it would hang.  so
> > > this is probably something new.
> >
> > I believe I am seeing the same issue and am trying to track down
> > whether XFS is doing something unexpected, i.e. I have not been able
> > to reproduce the problem with EXT3.  MD tries to increase throughput
> > by letting some stripe work build up in batches.  It looks like every
> > time your system has hung it has been in the 'inactive_blocked' state
> > i.e. > 3/4 of stripes active.  This state should automatically
> > clear...
>
> cool, glad you can reproduce it :)
>
> i have a bit more data... i'm seeing the same problem on debian's
> 2.6.22-3-amd64 kernel, so it's not new in 2.6.24.
>

This is just brainstorming at this point, but it looks like xfs can
submit more requests in the bi_end_io path such that it can lock
itself out of the RAID array.  The sequence that concerns me is:

return_io->xfs_buf_end_io->xfs_buf_io_end->xfs_buf_iodone_work->xfs_buf_iorequest->make_request->

I need verify whether this path is actually triggering, but if we are
in an inactive_blocked condition this new request will be put on a
wait queue and we'll never get to the release_stripe() call after
return_io().  It would be interesting to see if this is new XFS
behavior in recent kernels.

--
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread Dan Williams
On Dec 29, 2007 9:48 AM, dean gaudet <[EMAIL PROTECTED]> wrote:
> hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on
> the same 64k chunk array and had raised the stripe_cache_size to 1024...
> and got a hang.  this time i grabbed stripe_cache_active before bumping
> the size again -- it was only 905 active.  as i recall the bug we were
> debugging a year+ ago the active was at the size when it would hang.  so
> this is probably something new.

I believe I am seeing the same issue and am trying to track down
whether XFS is doing something unexpected, i.e. I have not been able
to reproduce the problem with EXT3.  MD tries to increase throughput
by letting some stripe work build up in batches.  It looks like every
time your system has hung it has been in the 'inactive_blocked' state
i.e. > 3/4 of stripes active.  This state should automatically
clear...

>
> anyhow raising it to 2048 got it unstuck, but i'm guessing i'll be able to
> hit that limit too if i try harder :)

Once you hang if 'stripe_cache_size' is increased such that
stripe_cache_active < 3/4 * stripe_cache_size things will start
flowing again.

>
> btw what units are stripe_cache_size/active in?  is the memory consumed
> equal to (chunk_size * raid_disks * stripe_cache_size) or (chunk_size *
> raid_disks * stripe_cache_active)?
>

memory_consumed = PAGE_SIZE * raid_disks * stripe_cache_size

>
> -dean
>

--
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mdadm: unable to add a disk to degraded raid1 array

2007-12-29 Thread Dan Williams
In case someone else happens upon this I have found that mdadm >=
v2.6.2 cannot add a disk to a degraded raid1 array created with mdadm
< 2.6.2.

I bisected the problem down to mdadm git commit
2fb749d1b7588985b1834e43de4ec5685d0b8d26 which appears to make an
incompatible change to the super block's 'data_size' field.

--- sdb1-sb-good.hex2007-12-12 14:31:42.0 +
+++ sdb1-sb-bad.hex 2007-12-12 14:31:36.0 +
@@ -6,12 +6,12 @@
 050 60d8 0077     0004 
 060        
 *
-080     60d8 0077  
+080     60d0 0077  

Which trips up the "if (rdev->size < le64_to_cpu(sb->data_size)/2)"
check in super_1_load [1], resulting in:

mdadm: add new device failed for /dev/sdb1 as 4: Invalid argument

--
Dan

[1] http://lxr.linux.no/linux/drivers/md/md.c#L1148
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HELP! New disks being dropped from RAID 6 array on every reboot

2007-11-23 Thread Dan Williams
On Nov 23, 2007 11:19 AM, Joshua Johnson <[EMAIL PROTECTED]> wrote:
> Greetings, long time listener, first time caller.
>
> I recently replaced a disk in my existing 8 disk RAID 6 array.
> Previously, all disks were PATA drives connected to the motherboard
> IDE and 3 promise Ultra 100/133 controllers.  I replaced one of the
> Promise controllers with a Via 64xx based controller, which has 2 SATA
> ports and one PATA port.  I connected a new SATA drive to the new
> card, partitioned the drive and added it to the array.  After 5 or 6
> hours the resyncing process finished and the array showed up complete.
>  Upon rebooting I discovered that the new drive had not been added to
> the array when it was assembled on boot.   I resynced it and tried
> again -- still would not persist after a reboot.  I moved one of the
> existing PATA drives to the new controller (so I could have the slot
> for network), rebooted and rebuilt the array.  Now when I reboot BOTH
> disks are missing from the array (sda and sdb).  Upon examining the
> disks it appears they think they are part of the array, but for some
> reason they are not being added when the array is being assembled.
> For example, this is a disk on the new controller which was not added
> to the array after rebooting:
>
> # mdadm --examine /dev/sda1
> /dev/sda1:
>   Magic : a92b4efc
> Version : 00.90.03
>UUID : 63ee7d14:a0ac6a6e:aef6fe14:50e047a5
>   Creation Time : Thu Sep 21 23:52:19 2006
>  Raid Level : raid6
> Device Size : 191157248 (182.30 GiB 195.75 GB)
>  Array Size : 1146943488 (1093.81 GiB 1174.47 GB)
>Raid Devices : 8
>   Total Devices : 8
> Preferred Minor : 0
>
> Update Time : Fri Nov 23 10:22:57 2007
>   State : clean
>  Active Devices : 8
> Working Devices : 8
>  Failed Devices : 0
>   Spare Devices : 0
>Checksum : 50df590e - correct
>  Events : 0.96419878
>
>  Chunk Size : 256K
>
>   Number   Major   Minor   RaidDevice State
> this 6   816  active sync   /dev/sda1
>
>0 0   320  active sync   /dev/hda2
>1 1  5721  active sync   /dev/hdk2
>2 2  3322  active sync   /dev/hde2
>3 3  3423  active sync   /dev/hdg2
>4 4  2224  active sync   /dev/hdc2
>5 5  5625  active sync   /dev/hdi2
>6 6   816  active sync   /dev/sda1
>7 7   8   177  active sync   /dev/sdb1
>
>
> Everything there seems to be correct and current up to the last
> shutdown.  But the disk is not being added on boot.  Examining a disk
> that is currently running in the array shows:
>
> # mdadm --examine /dev/hdc2
> /dev/hdc2:
>   Magic : a92b4efc
> Version : 00.90.03
>UUID : 63ee7d14:a0ac6a6e:aef6fe14:50e047a5
>   Creation Time : Thu Sep 21 23:52:19 2006
>  Raid Level : raid6
> Device Size : 191157248 (182.30 GiB 195.75 GB)
>  Array Size : 1146943488 (1093.81 GiB 1174.47 GB)
>Raid Devices : 8
>   Total Devices : 6
> Preferred Minor : 0
>
> Update Time : Fri Nov 23 10:23:52 2007
>   State : clean
>  Active Devices : 6
> Working Devices : 6
>  Failed Devices : 2
>   Spare Devices : 0
>Checksum : 50df5934 - correct
>  Events : 0.96419880
>
>  Chunk Size : 256K
>
>   Number   Major   Minor   RaidDevice State
> this 4  2224  active sync   /dev/hdc2
>
>0 0   320  active sync   /dev/hda2
>1 1  5721  active sync   /dev/hdk2
>2 2  3322  active sync   /dev/hde2
>3 3  3423  active sync   /dev/hdg2
>4 4  2224  active sync   /dev/hdc2
>5 5  5625  active sync   /dev/hdi2
>6 6   006  faulty removed
>7 7   007  faulty removed
>
>
> Here is my /etc/mdadm/mdadm.conf:
>
> DEVICE partitions
> PROGRAM /bin/echo
> MAILADDR 
> ARRAY /dev/md0 level=raid6 num-devices=8
> UUID=63ee7d14:a0ac6a6e:aef6fe14:50e047a5
>
>
> Can anyone see anything that is glaringly wrong here?  Has anybody
> experienced similar behavior?  I am running Debian using kernel
> 2.6.23.8.  All partitions are set to type 0xFD and it appears the
> superblocks on the sd* disks were written, why wouldn't they be added
> to the array on boot?  Any help is greatly appreciated!

I wonder if you are running into a driver load order problem where the
ide driver and md are coming up before the sata driver.  You can let
userspace do the assembly after everything is up and running.  Specify
'raid=noautodetect' on the kernel command line and then let Debian's
'/etc/init.d/mdadm-raid' initscript take care of the assembly based on
your configuration 

Re: PROBLEM: raid5 hangs

2007-11-14 Thread Dan Williams
On Nov 14, 2007 5:05 PM, Justin Piszcz <[EMAIL PROTECTED]> wrote:
> On Wed, 14 Nov 2007, Bill Davidsen wrote:
> > Justin Piszcz wrote:
> >> This is a known bug in 2.6.23 and should be fixed in 2.6.23.2 if the RAID5
> >> bio* patches are applied.
> >
> > Note below he's running 2.6.22.3 which doesn't have the bug unless -STABLE
> > added it. So should not really be in 2.6.22.anything. I assume you're 
> > talking
> > the endless write or bio issue?
> The bio issue is the root cause of the bug yes?

Not if this is a 2.6.22 issue.  Neither of the bugs fixed by "raid5:
fix clearing of biofill operations" or "raid5: fix unending write
sequence" existed prior to 2.6.23.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [stable] [PATCH 000 of 2] md: Fixes for md in 2.6.23

2007-11-13 Thread Dan Williams
On Nov 13, 2007 8:43 PM, Greg KH <[EMAIL PROTECTED]> wrote:
> >
> > Careful, it looks like you cherry picked commit 4ae3f847 "md: raid5:
> > fix clearing of biofill operations" which ended up misapplied in
> > Linus' tree,  You should either also pick up def6ae26 "md: fix
> > misapplied patch in raid5.c" or I can resend the original "raid5: fix
> > clearing of biofill operations."
> >
> > The other patch for -stable "raid5: fix unending write sequence" is
> > currently in -mm.
>
> Hm, I've attached the two patches that I have right now in the -stable
> tree so far (still have over 100 patches to go, so I might not have
> gotten to them yet if you have sent them).  These were sent to me by
> Andrew on their way to Linus.  if I should drop either one, or add
> another one, please let me know.
>

Drop md-raid5-fix-clearing-of-biofill-operations.patch and replace it
with the attached
md-raid5-not-raid6-fix-clearing-of-biofill-operations.patch (the
original sent to Neil).

The critical difference is that the replacement patch touches
handle_stripe5, not handle_stripe6.  Diffing the patches shows the
changes for hunk #3:

-@@ -2903,6 +2907,13 @@ static void handle_stripe6(struct stripe
+@@ -2630,6 +2634,13 @@ static void handle_stripe5(struct stripe_head *sh)

raid5-fix-unending-write-sequence.patch is in -mm and I believe is
waiting on an Acked-by from Neil?

> thanks,
>
> greg k-h

Thanks,
Dan
raid5: fix clearing of biofill operations

From: Dan Williams <[EMAIL PROTECTED]>

ops_complete_biofill() runs outside of spin_lock(&sh->lock) and clears the
'pending' and 'ack' bits.  Since the test_and_ack_op() macro only checks
against 'complete' it can get an inconsistent snapshot of pending work.

Move the clearing of these bits to handle_stripe5(), under the lock.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
Tested-by: Joel Bertrand <[EMAIL PROTECTED]>
Signed-off-by: Neil Brown <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   17 ++---
 1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f96dea9..3808f52 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -377,7 +377,12 @@ static unsigned long get_stripe_work(struct stripe_head *sh)
 		ack++;
 
 	sh->ops.count -= ack;
-	BUG_ON(sh->ops.count < 0);
+	if (unlikely(sh->ops.count < 0)) {
+		printk(KERN_ERR "pending: %#lx ops.pending: %#lx ops.ack: %#lx "
+			"ops.complete: %#lx\n", pending, sh->ops.pending,
+			sh->ops.ack, sh->ops.complete);
+		BUG();
+	}
 
 	return pending;
 }
@@ -551,8 +556,7 @@ static void ops_complete_biofill(void *stripe_head_ref)
 			}
 		}
 	}
-	clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack);
-	clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending);
+	set_bit(STRIPE_OP_BIOFILL, &sh->ops.complete);
 
 	return_io(return_bi);
 
@@ -2630,6 +2634,13 @@ static void handle_stripe5(struct stripe_head *sh)
 	s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
 	/* Now to look around and see what can be done */
 
+	/* clean-up completed biofill operations */
+	if (test_bit(STRIPE_OP_BIOFILL, &sh->ops.complete)) {
+		clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending);
+		clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack);
+		clear_bit(STRIPE_OP_BIOFILL, &sh->ops.complete);
+	}
+
 	rcu_read_lock();
 	for (i=disks; i--; ) {
 		mdk_rdev_t *rdev;
raid5: fix unending write sequence

From: Dan Williams <[EMAIL PROTECTED]>


handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0
check 5: state 0x6 toread  read  write f800ffcffcc0 written 
check 4: state 0x6 toread  read  write f800fdd4e360 written 
check 3: state 0x1 toread  read  write  written 
check 2: state 0x1 toread  read  write  written 
check 1: state 0x6 toread  read  write f800ff517e40 written 
check 0: state 0x6 toread  read  write f800fd4cae60 written 
locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0
for sector 7629696, rmw=0 rcw=0


These blocks were prepared to be written out, but were never handled in
ops_run_biodrain(), so they remain locked forever.  The operations flags
are all clear which means handle_stripe() thinks nothing else needs to be
done.

This state suggests that the STRIPE_OP_PREXOR bit was sampled 'set' when it
should not have been.  This patch cleans up cases where the code looks at
sh->ops.pending when it should be looking at the consistent stack-based
sna

Re: [stable] [PATCH 000 of 2] md: Fixes for md in 2.6.23

2007-11-13 Thread Dan Williams
On Nov 13, 2007 5:23 PM, Greg KH <[EMAIL PROTECTED]> wrote:
> On Tue, Nov 13, 2007 at 04:22:14PM -0800, Greg KH wrote:
> > On Mon, Oct 22, 2007 at 05:15:27PM +1000, NeilBrown wrote:
> > >
> > > It appears that a couple of bugs slipped in to md for 2.6.23.
> > > These two patches fix them and are appropriate for 2.6.23.y as well
> > > as 2.6.24-rcX
> > >
> > > Thanks,
> > > NeilBrown
> > >
> > >  [PATCH 001 of 2] md: Fix an unsigned compare to allow creation of 
> > > bitmaps with v1.0 metadata.
> > >  [PATCH 002 of 2] md: raid5: fix clearing of biofill operations
> >
> > I don't see these patches in 2.6.24-rcX, are they there under some other
> > subject?
>
> Oh nevermind, I found them, sorry for the noise...
>

Careful, it looks like you cherry picked commit 4ae3f847 "md: raid5:
fix clearing of biofill operations" which ended up misapplied in
Linus' tree,  You should either also pick up def6ae26 "md: fix
misapplied patch in raid5.c" or I can resend the original "raid5: fix
clearing of biofill operations."

The other patch for -stable "raid5: fix unending write sequence" is
currently in -mm.

> greg k-h

Regards,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel panic (2.6.23.1-fc7) in drivers/md/raid5.c:144

2007-11-13 Thread Dan Williams
[ Adding Neil, stable@, DaveJ, and GregKH to the cc ]

On Nov 13, 2007 11:20 AM, Peter <[EMAIL PROTECTED]> wrote:
> Hi
>
> I had a 3 disc raid5 array running fine with Fedora 7 (32bit) kernel 
> 2.6.23.1-fc7 on an old Athlon XP using a two sata_sil cards.
>
> I replaced the hardware with an Athlon64 X2 and using the onboard sata_nv, 
> after I modified the initrd I was able to boot up from my old system drive. 
> However when it brought up the raid array it died with a kernel panic. I used 
> a rescue CD, commented out the array in mdadm.conf and booted up. I could 
> assemble the array manually (it kicked out one of the three drives for some 
> reason?) but when I used mdadm --examine /dev/md0 I got the kernel panic 
> again. I don't have remote debugging but I managed to take some pictures:
>
> http://img132.imageshack.us/img132/1697/kernel1sh3.jpg
> http://img132.imageshack.us/img132/3538/kernel2eu2.jpg
>
> From what I understand it should be possible to do this hardware upgrade with 
> using software raid? Any ideas?
>
> Thanks
> Peter
>

There are two bug fix patches pending for 2.6.23.2:
"raid5: fix clearing of biofill operations"
http://marc.info/?l=linux-raid&m=119303750132068&w=2
"raid5: fix unending write sequence"
http://marc.info/?l=linux-raid&m=119453934805607&w=2

You are hitting the bug that was fixed by: "raid5: fix clearing of
biofill operations"

Heads up for the stable@ team "raid5: fix clearing of biofill
operations" was originally misapplied for 2.6.24-rc:
"md: Fix misapplied patch in raid5.c"
http://marc.info/?l=linux-raid&m=119396783332081&w=2
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-08 Thread Dan Williams
On 11/8/07, Bill Davidsen <[EMAIL PROTECTED]> wrote:
> Jeff Lessem wrote:
> > Dan Williams wrote:
> > > The following patch, also attached, cleans up cases where the code
> > looks
> > > at sh->ops.pending when it should be looking at the consistent
> > > stack-based snapshot of the operations flags.
> >
> > I tried this patch (against a stock 2.6.23), and it did not work for
> > me.  Not only did I/O to the effected RAID5 & XFS partition stop, but
> > also I/O to all other disks.  I was not able to capture any debugging
> > information, but I should be able to do that tomorrow when I can hook
> > a serial console to the machine.
>
> That can't be good! This is worrisome because Joel is giddy with joy
> because it fixes his iSCSI problems. I was going to try it with nbd, but
> perhaps I'll wait a week or so and see if others have more information.
> Applying patches before a holiday weekend is a good way to avoid time
> off. :-(

We need to see more information on the failure that Jeff is seeing,
and whether it goes away with the two known patches applied.  He
applied this most recent patch against stock 2.6.23 which means that
the platform was still open to the first biofill flags issue.

--
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] raid5: fix unending write sequence

2007-11-08 Thread Dan Williams
From: Dan Williams <[EMAIL PROTECTED]>


handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0
check 5: state 0x6 toread  read  write 
f800ffcffcc0 written 
check 4: state 0x6 toread  read  write 
f800fdd4e360 written 
check 3: state 0x1 toread  read  write 
 written 
check 2: state 0x1 toread  read  write 
 written 
check 1: state 0x6 toread  read  write 
f800ff517e40 written 
check 0: state 0x6 toread  read  write 
f800fd4cae60 written 
locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0
for sector 7629696, rmw=0 rcw=0


These blocks were prepared to be written out, but were never handled in
ops_run_biodrain(), so they remain locked forever.  The operations flags
are all clear which means handle_stripe() thinks nothing else needs to be
done.

This state suggests that the STRIPE_OP_PREXOR bit was sampled 'set' when it
should not have been.  This patch cleans up cases where the code looks at
sh->ops.pending when it should be looking at the consistent stack-based
snapshot of the operations flags.

Report from Joël:
Resync done. Patch fix this bug.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
Tested-by: Joël Bertrand <[EMAIL PROTECTED]>
Cc: [EMAIL PROTECTED]
---

 drivers/md/raid5.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3808f52..e86cacb 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -689,7 +689,8 @@ ops_run_prexor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
 }
 
 static struct dma_async_tx_descriptor *
-ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
+ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx,
+unsigned long pending)
 {
int disks = sh->disks;
int pd_idx = sh->pd_idx, i;
@@ -697,7 +698,7 @@ ops_run_biodrain(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
/* check if prexor is active which means only process blocks
 * that are part of a read-modify-write (Wantprexor)
 */
-   int prexor = test_bit(STRIPE_OP_PREXOR, &sh->ops.pending);
+   int prexor = test_bit(STRIPE_OP_PREXOR, &pending);
 
pr_debug("%s: stripe %llu\n", __FUNCTION__,
(unsigned long long)sh->sector);
@@ -774,7 +775,8 @@ static void ops_complete_write(void *stripe_head_ref)
 }
 
 static void
-ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
+ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx,
+   unsigned long pending)
 {
/* kernel stack size limits the total number of disks */
int disks = sh->disks;
@@ -782,7 +784,7 @@ ops_run_postxor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
 
int count = 0, pd_idx = sh->pd_idx, i;
struct page *xor_dest;
-   int prexor = test_bit(STRIPE_OP_PREXOR, &sh->ops.pending);
+   int prexor = test_bit(STRIPE_OP_PREXOR, &pending);
unsigned long flags;
dma_async_tx_callback callback;
 
@@ -809,7 +811,7 @@ ops_run_postxor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
}
 
/* check whether this postxor is part of a write */
-   callback = test_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending) ?
+   callback = test_bit(STRIPE_OP_BIODRAIN, &pending) ?
ops_complete_write : ops_complete_postxor;
 
/* 1/ if we prexor'd then the dest is reused as a source
@@ -897,12 +899,12 @@ static void raid5_run_ops(struct stripe_head *sh, 
unsigned long pending)
tx = ops_run_prexor(sh, tx);
 
if (test_bit(STRIPE_OP_BIODRAIN, &pending)) {
-   tx = ops_run_biodrain(sh, tx);
+   tx = ops_run_biodrain(sh, tx, pending);
overlap_clear++;
}
 
if (test_bit(STRIPE_OP_POSTXOR, &pending))
-   ops_run_postxor(sh, tx);
+   ops_run_postxor(sh, tx, pending);
 
if (test_bit(STRIPE_OP_CHECK, &pending))
ops_run_check(sh);
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-06 Thread Dan Williams
On Tue, 2007-11-06 at 03:19 -0700, BERTRAND Joël wrote:
> Done. Here is obtained ouput :

Much appreciated.
> 
> [ 1260.969314] handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0
> [ 1260.980606] check 5: state 0x6 toread  read 
>  write f800ffcffcc0 written 
> [ 1260.994808] check 4: state 0x6 toread  read 
>  write f800fdd4e360 written 
> [ 1261.009325] check 3: state 0x1 toread  read 
>  write  written 
> [ 1261.244478] check 2: state 0x1 toread  read 
>  write  written 
> [ 1261.270821] check 1: state 0x6 toread  read 
>  write f800ff517e40 written 
> [ 1261.312320] check 0: state 0x6 toread  read 
>  write f800fd4cae60 written 
> [ 1261.361030] locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0
> [ 1261.443120] for sector 7629696, rmw=0 rcw=0
[..]

This looks as if the blocks were prepared to be written out, but were
never handled in ops_run_biodrain(), so they remain locked forever.  The
operations flags are all clear which means handle_stripe thinks nothing
else needs to be done.

The following patch, also attached, cleans up cases where the code looks
at sh->ops.pending when it should be looking at the consistent
stack-based snapshot of the operations flags.


---

 drivers/md/raid5.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 496b9a3..e1a3942 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -693,7 +693,8 @@ ops_run_prexor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
 }
 
 static struct dma_async_tx_descriptor *
-ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
+ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx,
+unsigned long pending)
 {
int disks = sh->disks;
int pd_idx = sh->pd_idx, i;
@@ -701,7 +702,7 @@ ops_run_biodrain(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
/* check if prexor is active which means only process blocks
 * that are part of a read-modify-write (Wantprexor)
 */
-   int prexor = test_bit(STRIPE_OP_PREXOR, &sh->ops.pending);
+   int prexor = test_bit(STRIPE_OP_PREXOR, &pending);
 
pr_debug("%s: stripe %llu\n", __FUNCTION__,
(unsigned long long)sh->sector);
@@ -778,7 +779,8 @@ static void ops_complete_write(void *stripe_head_ref)
 }
 
 static void
-ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
+ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx,
+   unsigned long pending)
 {
/* kernel stack size limits the total number of disks */
int disks = sh->disks;
@@ -786,7 +788,7 @@ ops_run_postxor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
 
int count = 0, pd_idx = sh->pd_idx, i;
struct page *xor_dest;
-   int prexor = test_bit(STRIPE_OP_PREXOR, &sh->ops.pending);
+   int prexor = test_bit(STRIPE_OP_PREXOR, &pending);
unsigned long flags;
dma_async_tx_callback callback;
 
@@ -813,7 +815,7 @@ ops_run_postxor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
}
 
/* check whether this postxor is part of a write */
-   callback = test_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending) ?
+   callback = test_bit(STRIPE_OP_BIODRAIN, &pending) ?
ops_complete_write : ops_complete_postxor;
 
/* 1/ if we prexor'd then the dest is reused as a source
@@ -901,12 +903,12 @@ static void raid5_run_ops(struct stripe_head *sh, 
unsigned long pending)
tx = ops_run_prexor(sh, tx);
 
if (test_bit(STRIPE_OP_BIODRAIN, &pending)) {
-   tx = ops_run_biodrain(sh, tx);
+   tx = ops_run_biodrain(sh, tx, pending);
overlap_clear++;
}
 
if (test_bit(STRIPE_OP_POSTXOR, &pending))
-   ops_run_postxor(sh, tx);
+   ops_run_postxor(sh, tx, pending);
 
        if (test_bit(STRIPE_OP_CHECK, &pending))
ops_run_check(sh);

raid5: fix unending write sequence

From: Dan Williams <[EMAIL PROTECTED]>


---

 drivers/md/raid5.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 496b9a3..e1a3942 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -693,7 +693,8 @@ ops_run_prexor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 }
 
 static struct dma_async_tx_desc

Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-05 Thread Dan Williams
On 11/5/07, Justin Piszcz <[EMAIL PROTECTED]> wrote:
[..]
> > Are you seeing the same "md thread takes 100% of the CPU" that Joël is
> > reporting?
> >
>
> Yes, in another e-mail I posted the top output with md3_raid5 at 100%.
>

This seems too similar to Joël's situation for them not to be
correlated, and it shows that iscsi is not a necessary component of
the failure.

The attached patch allows the debug statements in MD to be enabled via
sysfs.  Joël, since it is easier for you to reproduce can you capture
the kernel log output after the raid thread goes into the spin?  It
will help if you have CONFIG_PRINTK_TIME=y set in your kernel
configuration.

After the failure run:

echo 1 > /sys/block/md_d0/md/debug_print_enable; sleep 5; echo 0 >
/sys/block/md_d0/md/debug_print_enable

...to enable the print messages for a few seconds.  Please send the
output in a private message if it proves too big for the mailing list.


raid5-debug-print-enable.patch
Description: Binary data


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-05 Thread Dan Williams
On 11/4/07, Justin Piszcz <[EMAIL PROTECTED]> wrote:
>
>
> On Mon, 5 Nov 2007, Neil Brown wrote:
>
> > On Sunday November 4, [EMAIL PROTECTED] wrote:
> >> # ps auxww | grep D
> >> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> >> root   273  0.0  0.0  0 0 ?DOct21  14:40 [pdflush]
> >> root   274  0.0  0.0  0 0 ?DOct21  13:00 [pdflush]
> >>
> >> After several days/weeks, this is the second time this has happened, while
> >> doing regular file I/O (decompressing a file), everything on the device
> >> went into D-state.
> >
> > At a guess (I haven't looked closely) I'd say it is the bug that was
> > meant to be fixed by
> >
> > commit 4ae3f847e49e3787eca91bced31f8fd328d50496
> >
> > except that patch applied badly and needed to be fixed with
> > the following patch (not in git yet).
> > These have been sent to stable@ and should be in the queue for 2.6.23.2
> >
>
> Ah, thanks Neil, will be updating as soon as it is released, thanks.
>

Are you seeing the same "md thread takes 100% of the CPU" that Joël is
reporting?
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug in processing dependencies by async_tx_submit() ?

2007-11-01 Thread Dan Williams
On 11/1/07, Yuri Tikhonov <[EMAIL PROTECTED]> wrote:
>
>  Hi Dan,
>
>   Honestly I tried to fix this quickly using the approach similar to proposed
>  by you, with one addition though (in fact, deletion of BUG_ON(chan ==
>  tx->chan) in async_tx_run_dependencies()). And this led to "Kernel stack
>  overflow". This happened because of the recurseve calling async_tx_submit()
>  from async_trigger_callback() and vice verse.
>

I had a feeling the fix could not be that easy...

>   So, then I made the interrupt scheduling in async_tx_submit() only for the
>  cases when it is really needed: i.e. when dependent operations are to be run
>  on different channels.
>
>   The resulted kernel locked-up during processing of the mkfs command on the
>  top of the RAID-array. The place where it is spinning is the dma_sync_wait()
>  function.
>
>   This is happened because of the specific implementation of
>  dma_wait_for_async_tx().

So I take it you are not implementing interrupt based callbacks in your driver?

>   The "iter", we finally waiting for there, corresponds to the last allocated
>  but not-yet-submitted descriptor. But if the "iter" we are waiting for is
>  dependent from another descriptor which has cookie > 0, but is not yet
>  submitted to the h/w channel because of the fact that threshold is not
>  achieved to this moment, then we may wait in dma_wait_for_async_tx()
>  infinitely. I think that it makes more sense to get the first descriptor
>  which was submitted to the channel but probably is not put into the h/w
>  chain, i.e. with cookie > 0 and do dma_sync_wait() of this descriptor.
>
>   When I modified the dma_wait_for_async_tx() in such way, then the kernel
>  locking had disappeared. But nevertheless the mkfs processes hangs-up after
>  some time. So, it looks like something is still missing in support of the
>  chaining dependencies feature...
>

I am preparing a new patch that replaces ASYNC_TX_DEP_ACK with
ASYNC_TX_CHAIN_ACK.  The plan is to make the entire chain of
dependencies available up until the last transaction is submitted.
This allows the entire dependency chain to be walked at
async_tx_submit time so that we can properly handle these multiple
dependency cases.  I'll send it out when it passes my internal
tests...

--
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug in processing dependencies by async_tx_submit() ?

2007-10-31 Thread Dan Williams
On Wed, 2007-10-31 at 09:21 -0700, Yuri Tikhonov wrote:
> 
>  Hello Dan,
> 
>  I've run into a problem with the h/w accelerated RAID-5 driver (on the
> ppc440spe-based board). After some investigations I've come to conclusion
> that the issue is with the async_tx_submit() implementation in ASYNC_TX.
> 
Unfortunately this is correct, async_tx_submit() will let the third
operation pass the second in the scenario you describe.  I propose the
fix (untested) below.  I'll test this out tomorrow when I am back in the
office.

---
async_tx: fix successive dependent operation submission

From: Dan Williams <[EMAIL PROTECTED]>

async_tx_submit() tried to use the hardware descriptor chain to maintain
transaction ordering.  However before falling back to hardware-channel
dependency ordering async_tx_submit() must first check if the entire chain
is waiting on another channel.

OP1 (DMA0) <--- OP2 (DMA1) <--- OP3 (DMA1)

OP3 must be submitted as an OP2 dependency if it is submitted before OP1
completes.  Otherwise if OP1 is complete, OP3 can use the natural sequence
of DMA1's hardware chain to satisfy that it runs after OP2.

The fix is to check if the ->parent field of the dependency is non-NULL.
This also requires that the parent field be cleared at dependency
submission time.

Found-by: Yuri Tikhonov <[EMAIL PROTECTED]>
Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 crypto/async_tx/async_tx.c |6 +-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/crypto/async_tx/async_tx.c b/crypto/async_tx/async_tx.c
index bc18cbb..eb1afb9 100644
--- a/crypto/async_tx/async_tx.c
+++ b/crypto/async_tx/async_tx.c
@@ -125,6 +125,9 @@ async_tx_run_dependencies(struct dma_async_tx_descriptor 
*tx)
list_del(&dep_tx->depend_node);
tx->tx_submit(dep_tx);
 
+   /* we no longer have a parent */
+   tx->parent = NULL;
+
/* we need to poke the engine as client code does not
 * know about dependency submission events
 */
@@ -409,8 +412,9 @@ async_tx_submit(struct dma_chan *chan, struct 
dma_async_tx_descriptor *tx,
/* set this new tx to run after depend_tx if:
 * 1/ a dependency exists (depend_tx is !NULL)
 * 2/ the tx can not be submitted to the current channel
+* 3/ the depend_tx has a parent
 */
-   if (depend_tx && depend_tx->chan != chan) {
+   if (depend_tx && (depend_tx->chan != chan || depend_tx->parent)) {
/* if ack is already set then we cannot be sure
 * we are referring to the correct operation
 */

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Raid1/5 over iSCSI trouble

2007-10-27 Thread Dan Williams
On 10/27/07, BERTRAND Joël <[EMAIL PROTECTED]> wrote:
> Dan Williams wrote:
> > Can you collect some oprofile data, as Ming suggested, so we can maybe
> > see what md_d0_raid5 and istd1 are fighting about?  Hopefully it is as
> > painless to run on sparc as it is on IA:
> >
> > opcontrol --start --vmlinux=/path/to/vmlinux
> > 
> > opcontrol --stop
> > opreport --image-path=/lib/modules/`uname -r` -l
>
> Done.
>

[..]

>
> Is it enough ?

I would expect md_d0_raid5 and istd1 to show up pretty high in the
list if they are constantly pegged at a 100% CPU utilization like you
showed in the failure case.  Maybe this was captured after the target
has disconnected?
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Raid1/5 over iSCSI trouble

2007-10-24 Thread Dan Williams
On 10/24/07, BERTRAND Joël <[EMAIL PROTECTED]> wrote:
> Hello,
>
> Any news about this trouble ? Any idea ? I'm trying to fix it, but I
> don't see any specific interaction between raid5 and istd. Does anyone
> try to reproduce this bug on another arch than sparc64 ? I only use
> sparc32 and 64 servers and I cannot test on other archs. Of course, I
> have a laptop, but I cannot create a raid5 array on its internal HD to
> test this configuration ;-)
>

Can you collect some oprofile data, as Ming suggested, so we can maybe
see what md_d0_raid5 and istd1 are fighting about?  Hopefully it is as
painless to run on sparc as it is on IA:

opcontrol --start --vmlinux=/path/to/vmlinux

opcontrol --stop
opreport --image-path=/lib/modules/`uname -r` -l

--
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MD driver document

2007-10-24 Thread Dan Williams
On 10/24/07, tirumalareddy marri <[EMAIL PROTECTED]> wrote:
>
>  Hi,
>I am looking for best way of understanding MD
> driver(including raid5/6) architecture. I am
> developing driver for one of the PPC based SOC. I have
> done some code reading and tried to use HW debugger to
> walk through the code. But it was not much help.
>
>   If you have any pointers or documents, I will
> greatly appreciate if you can share it.
>

I started out with include/linux/raid/raid5.h.  Also, running it with
the debug print statements turned on will get you familiar with the
code flow.

Lastly, I wrote the following paper which is already becoming outdated:
http://downloads.sourceforge.net/xscaleiop/ols_paper_2006.pdf

> Thanks and regards,
>  Marri
>

--
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: async_tx: get best channel

2007-10-23 Thread Dan Williams
On Fri, 2007-10-19 at 05:23 -0700, Yuri Tikhonov wrote:
> 
>  Hello Dan,

Hi Yuri, sorry it has taken me so long to get back to you...
> 
>  I have a suggestion regarding the async_tx_find_channel() procedure.
> 
>  First, a little introduction. Some processors (e.g. ppc440spe) have several 
> DMA
> engines (say DMA1 and DMA2) which are capable of performing the same type of
> operation, say XOR. The DMA2 engine may process the XOR operation faster than
> the DMA1 engine, but DMA2 (which is faster) has some restrictions for the 
> source
> operand addresses, whereas there are no such restrictions for DMA1 (which is 
> slower).
> So the question is, how may ASYNC_TX select the DMA engine which will be the
> most effective for the given tx operation ?

>  In the example just described this means: if the faster engine, DMA2, may 
> process
> the tx operation with the given source operand addresses, then we select DMA2;
> if the given source operand addresses cannot be processed with DMA2, then we
> select the slower engine, DMA1.
> 
>  I see the following way for introducing such functionality.
> 
>  We may introduce an additional method in struct dma_device (let's call it 
> device_estimate())
> which would take the following as the arguments:
> --- the list of sources to be processed during the given tx,
> --- the type of operation (XOR, COPY, ...),
> --- perhaps something else,
>  and then estimate the effectiveness of processing this tx on the given 
> channel.
>  The async_tx_find_channel() function should call the device_estimate() 
> method for each
> registered dma channel and then select the most effective one.
>  The architecture specific ADMA driver will be responsible for returning the 
> greatest
> value from the device_estimate() method for the channel which will be the 
> most effective
> for this given tx.
> 
>  What are your thoughts regarding this? Do you see any other effective ways 
> for
> enhancing ASYNC_TX with such functionality?

The problem with moving this test to async_tx_find_channel() is that it
imposes extra overhead in the fast path.  It would be best if we could
keep all these decisions in the slow path, or at least hide it from
architectures that do not need to implement it.  The thing that makes
this tricky is the fact that the speed is based on the source address...

One question what are the source address restrictions, is it around
high-memory?  My thought is MD usually only operates on GFP_KERNEL
memory but sometimes sees high-memory when copying data into and out of
the cache.  You might be able to achieve your use case by disabling
(hiding) the XOR capability on the channels used for copying.  This will
cause async_tx to switch the operation from the high memory capable copy
channel to the fast low memory XOR channel.

Another way to approach this would be to implement architecture specific
definitions of dma_channel_add_remove() and async_tx_rebalance().  This
will bypass the default allocation scheme and allow you to assign the
fastest channel to an operation, but it still does not allow for dynamic
selection based on source/destination address...

> 
>  Regards, Yuri
> 
Regards,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Raid1/5 over iSCSI trouble

2007-10-19 Thread Dan Williams
On Fri, 2007-10-19 at 14:04 -0700, BERTRAND Joël wrote:
> 
> Sorry for this last mail. I have found another mistake, but I
> don't
> know if this bug comes from iscsi-target or raid5 itself. iSCSI target
> is disconnected because istd1 and md_d0_raid5 kernel threads use 100%
> of
> CPU each !
> 
> Tasks: 235 total,   6 running, 227 sleeping,   0 stopped,   2 zombie
> Cpu(s):  0.1%us, 12.5%sy,  0.0%ni, 87.4%id,  0.0%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Mem:   4139032k total,   218424k used,  3920608k free,10136k
> buffers
> Swap:  7815536k total,0k used,  7815536k free,64808k
> cached
> 
>PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
> 
>   5824 root  15  -5 000 R  100  0.0  10:34.25 istd1
> 
>   5599 root  15  -5 000 R  100  0.0   7:25.43
> md_d0_raid5

What is the output of:
cat /proc/5824/wchan
cat /proc/5599/wchan

Thanks,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -stable, 2.6.24-rc] raid5: fix clearing of biofill operations (try2)

2007-10-19 Thread Dan Williams
Neil,

The following patch fixes a case where STRIPE_OP_BIOFILL can be counted
twice causing sh->ops.count to get out of sync thus triggering a BUG()
statement.  The patch applies cleanly on top of 2.6.23 and 2.6.24-rc.

--
Dan

---snip---

raid5: fix clearing of biofill operations (try2)

From: Dan Williams <[EMAIL PROTECTED]>

ops_complete_biofill() runs outside of spin_lock(&sh->lock) and clears the
'pending' and 'ack' bits.  Since the test_and_ack_op() macro only checks
against 'complete' it can get an inconsistent snapshot of pending work.

Move the clearing of these bits to handle_stripe5(), under the lock.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
Tested-by: Joël Bertrand <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   17 ++---
 1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f96dea9..3808f52 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -377,7 +377,12 @@ static unsigned long get_stripe_work(struct stripe_head 
*sh)
ack++;
 
sh->ops.count -= ack;
-   BUG_ON(sh->ops.count < 0);
+   if (unlikely(sh->ops.count < 0)) {
+   printk(KERN_ERR "pending: %#lx ops.pending: %#lx ops.ack: %#lx "
+   "ops.complete: %#lx\n", pending, sh->ops.pending,
+   sh->ops.ack, sh->ops.complete);
+   BUG();
+   }
 
return pending;
 }
@@ -551,8 +556,7 @@ static void ops_complete_biofill(void *stripe_head_ref)
}
}
}
-   clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack);
-   clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending);
+   set_bit(STRIPE_OP_BIOFILL, &sh->ops.complete);
 
return_io(return_bi);
 
@@ -2630,6 +2634,13 @@ static void handle_stripe5(struct stripe_head *sh)
s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
/* Now to look around and see what can be done */
 
+   /* clean-up completed biofill operations */
+   if (test_bit(STRIPE_OP_BIOFILL, &sh->ops.complete)) {
+   clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending);
+   clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack);
+   clear_bit(STRIPE_OP_BIOFILL, &sh->ops.complete);
+   }
+
rcu_read_lock();
for (i=disks; i--; ) {
mdk_rdev_t *rdev;
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Raid5 trouble

2007-10-19 Thread Dan Williams
On Fri, 2007-10-19 at 01:04 -0700, BERTRAND Joël wrote:
> I never see any oops with this patch. But I cannot create a
> RAID1 array
> with a local RAID5 volume and a foreign RAID5 array exported by iSCSI.
> iSCSI seems to works fine, but RAID1 creation randomly aborts due to a
> unknown SCSI task on target side.

For now I am going to forward this patch to Neil for inclusion in
-stable and 2.6.24-rc.  I will add a "Tested-by: Joël Bertrand
<[EMAIL PROTECTED]>" unless you have an objection.

> I have stressed iSCSI target with some simultaneous I/O
> without any
> trouble (nullio, fileio and blockio), thus I suspect another bug in
> raid
> code (or an arch specific bug). The last two days, I have made some
> tests to isolate and reproduce this bug:
> 1/ iSCSI target and initiator seem work when I export with iSCSI a
> raid5
> array;
> 2/ raid1 and raid5 seem work with local disks;
> 3/ iSCSI target is disconnected only when I create a raid1 volume over
> iSCSI (blockio _and_ fileio) with following message:
> 
> Oct 18 10:43:52 poulenc kernel: iscsi_trgt: cmnd_abort(1156) 29 1 0 42
> 57344 0 0
> Oct 18 10:43:52 poulenc kernel: iscsi_trgt: Abort Task (01) issued on
> tid:1 lun:0 by sid:630024457682948 (Unknown Task)
> 
> I run for 12 hours some dd's (read and write in nullio)
> between
> initiator and target without any disconnection. Thus iSCSI code seems
> to
> be robust. Both initiator and target are alone on a single gigabit
> ethernet link (without any switch). I'm investigating...

Can you reproduce on 2.6.22?

Also, I do not think this is the cause of your failure, but you have
CONFIG_DMA_ENGINE=y in your config.  Setting this to 'n' will compile
out the unneeded checks for offload engines in async_memcpy and
async_xor.
> 
> Regards,
> JKB

Regards,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Raid5 trouble

2007-10-17 Thread Dan Williams
On Wed, 2007-10-17 at 09:44 -0700, BERTRAND Joël wrote:
> Dan,
> 
> I have modified get_stripe_work like this :
> 
> static unsigned long get_stripe_work(struct stripe_head *sh)
> {
>  unsigned long pending;
>  int ack = 0;
>  int a,b,c,d,e,f,g;
> 
>  pending = sh->ops.pending;
> 
>  test_and_ack_op(STRIPE_OP_BIOFILL, pending);
>  a=ack;
>  test_and_ack_op(STRIPE_OP_COMPUTE_BLK, pending);
>  b=ack;
>  test_and_ack_op(STRIPE_OP_PREXOR, pending);
>  c=ack;
>  test_and_ack_op(STRIPE_OP_BIODRAIN, pending);
>  d=ack;
>  test_and_ack_op(STRIPE_OP_POSTXOR, pending);
>  e=ack;
>  test_and_ack_op(STRIPE_OP_CHECK, pending);
>  f=ack;
>  if (test_and_clear_bit(STRIPE_OP_IO, &sh->ops.pending))
>  ack++;
>  g=ack;
> 
>  sh->ops.count -= ack;
> 
>  if (sh->ops.count<0) printk("%d %d %d %d %d %d %d\n",
> a,b,c,d,e,f,g);
>  BUG_ON(sh->ops.count < 0);
> 
>  return pending;
> }
> 
> and I obtain on console :
> 
>   1 1 1 1 1 2
> kernel BUG at drivers/md/raid5.c:390!
>\|/  \|/
>"@'/ .. \`@"
>/_| \__/ |_\
>   \__U_/
> md7_resync(5409): Kernel bad sw trap 5 [#1]
> 
> If that can help you...
> 
> JKB

This gives more evidence that it is probably mishandling of
STRIPE_OP_BIOFILL.  The attached patch (replacing the previous) moves
the clearing of these bits into handle_stripe5 and adds some debug
information.

--
Dan
raid5: fix clearing of biofill operations (try2)

From: Dan Williams <[EMAIL PROTECTED]>

ops_complete_biofill() runs outside of spin_lock(&sh->lock) and clears the
'pending' and 'ack' bits.  Since the test_and_ack_op() macro only checks
against 'complete' it can get an inconsistent snapshot of pending work.

Move the clearing of these bits to handle_stripe5(), under the lock.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   17 ++---
 1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f96dea9..3808f52 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -377,7 +377,12 @@ static unsigned long get_stripe_work(struct stripe_head *sh)
 		ack++;
 
 	sh->ops.count -= ack;
-	BUG_ON(sh->ops.count < 0);
+	if (unlikely(sh->ops.count < 0)) {
+		printk(KERN_ERR "pending: %#lx ops.pending: %#lx ops.ack: %#lx "
+			"ops.complete: %#lx\n", pending, sh->ops.pending,
+			sh->ops.ack, sh->ops.complete);
+		BUG();
+	}
 
 	return pending;
 }
@@ -551,8 +556,7 @@ static void ops_complete_biofill(void *stripe_head_ref)
 			}
 		}
 	}
-	clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack);
-	clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending);
+	set_bit(STRIPE_OP_BIOFILL, &sh->ops.complete);
 
 	return_io(return_bi);
 
@@ -2630,6 +2634,13 @@ static void handle_stripe5(struct stripe_head *sh)
 	s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
 	/* Now to look around and see what can be done */
 
+	/* clean-up completed biofill operations */
+	if (test_bit(STRIPE_OP_BIOFILL, &sh->ops.complete)) {
+		clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending);
+		clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack);
+		clear_bit(STRIPE_OP_BIOFILL, &sh->ops.complete);
+	}
+
 	rcu_read_lock();
 	for (i=disks; i--; ) {
 		mdk_rdev_t *rdev;


Re: [BUG] Raid5 trouble

2007-10-17 Thread Dan Williams
On 10/17/07, Dan Williams <[EMAIL PROTECTED]> wrote:
> On 10/17/07, BERTRAND Joël <[EMAIL PROTECTED]> wrote:
> > BERTRAND Joël wrote:
> > > Hello,
> > >
> > > I run 2.6.23 linux kernel on two T1000 (sparc64) servers. Each
> > > server has a partitionable raid5 array (/dev/md/d0) and I have to
> > > synchronize both raid5 volumes by raid1. Thus, I have tried to build a
> > > raid1 volume between /dev/md/d0p1 and /dev/sdi1 (exported by iscsi from
> > > the second server) and I obtain a BUG :
> > >
> > > Root gershwin:[/usr/scripts] > mdadm -C /dev/md7 -l1 -n2 /dev/md/d0p1
> > > /dev/sdi1
> > > ...
> >
> > Hello,
> >
> > I have fixed iscsi-target, and I have tested it. It works now 
> > without
> > any trouble. Patches were posted on iscsi-target mailing list. When I
> > use iSCSI to access to foreign raid5 volume, it works fine. I can format
> > foreign volume, copy large files on it... But when I tried to create a
> > new raid1 volume with a local raid5 volume and a foreign raid5 volume, I
> > receive my well known Oops. You can find my dmesg after Oops :
> >
>
> Can you send your .config and your bootup dmesg?
>

I found a problem which may lead to the operations count dropping
below zero.  If ops_complete_biofill() gets preempted in between the
following calls:

raid5.c:554> clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack);
raid5.c:555> clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending);

...then get_stripe_work() can recount/re-acknowledge STRIPE_OP_BIOFILL
causing the assertion.  In fact, the 'pending' bit should always be
cleared first, but the other cases are protected by
spin_lock(&sh->lock).  Patch attached.

--
Dan


fix-biofill-clear.patch
Description: Binary data


Re: [BUG] Raid5 trouble

2007-10-17 Thread Dan Williams
On 10/17/07, BERTRAND Joël <[EMAIL PROTECTED]> wrote:
> BERTRAND Joël wrote:
> > Hello,
> >
> > I run 2.6.23 linux kernel on two T1000 (sparc64) servers. Each
> > server has a partitionable raid5 array (/dev/md/d0) and I have to
> > synchronize both raid5 volumes by raid1. Thus, I have tried to build a
> > raid1 volume between /dev/md/d0p1 and /dev/sdi1 (exported by iscsi from
> > the second server) and I obtain a BUG :
> >
> > Root gershwin:[/usr/scripts] > mdadm -C /dev/md7 -l1 -n2 /dev/md/d0p1
> > /dev/sdi1
> > ...
>
> Hello,
>
> I have fixed iscsi-target, and I have tested it. It works now without
> any trouble. Patches were posted on iscsi-target mailing list. When I
> use iSCSI to access to foreign raid5 volume, it works fine. I can format
> foreign volume, copy large files on it... But when I tried to create a
> new raid1 volume with a local raid5 volume and a foreign raid5 volume, I
> receive my well known Oops. You can find my dmesg after Oops :
>

Can you send your .config and your bootup dmesg?

Thanks,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: experiences with raid5: stripe_queue patches

2007-10-16 Thread Dan Williams
On Mon, 2007-10-15 at 08:03 -0700, Bernd Schubert wrote:
> Hi,
> 
> in order to tune raid performance I did some benchmarks with and
> without the
> stripe queue patches. 2.6.22 is only for comparison to rule out other
> effects, e.g. the new scheduler, etc.

Thanks for testing!

> It seems there is a regression with these patch regarding the re-write
> performance, as you can see its almost 50% of what it should be.
> 
> write  re-write   read   re-read
> 480844.26  448723.48  707927.55  706075.02 (2.6.22 w/o SQ patches)
> 487069.47  232574.30  709038.28  707595.09 (2.6.23 with SQ patches)
> 469865.75  438649.88  711211.92  703229.00 (2.6.23 without SQ patches)

A quick way to verify that it is a fairness issue is to simply not
promote full stripe writes to their own list, debug patch follows:

---

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index eb7fd10..755aafb 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -162,7 +162,7 @@ static void __release_queue(raid5_conf_t *conf, struct 
stripe_queue *sq)
 
if (to_write &&
io_weight(sq->overwrite, disks) == data_disks) {
-   list_add_tail(&sq->list_node, &conf->io_hi_q_list);
+   list_add_tail(&sq->list_node, &conf->io_lo_q_list);
queue_work(conf->workqueue, &conf->stripe_queue_work);
} else if (io_weight(sq->to_read, disks)) {
list_add_tail(&sq->list_node, &conf->io_lo_q_list);


---




> 
> An interesting effect to notice: Without these patches the pdflush
> daemons
> will take a lot of CPU time, with these patches, pdflush almost
> doesn't
> appear in the 'top' list.
> 
> Actually we would prefer one single raid5 array, but then one single
> raid5
> thread will run with 100% CPU time leaving 7 CPUs idle state, the
> status of
> the hardware raid says its utilization is only at about 50% and we
> only see
> writes at about 200 MB/s.
> On the contrary, with 3 different software raid5 sets the i/o to the
> harware
> raid systems is the bottleneck.
> 
> Is there any chance to parallize the raid5 code? I think almost
> everything is
> done in raid5.c make_request(), but the main loop there is spin_locked
> by
> prepare_to_wait(). Would it be possible not to lock this entire loop?

I made a rough attempt at multi-threading raid5[1] a while back.
However, this configuration only helps affinity, it does not address the
cases where the load needs to be further rebalanced between cpus.
> 
> 
> Thanks,
> Bernd
> 

[1] http://marc.info/?l=linux-raid&m=117262977831208&w=2
Note this implementation incorrectly handles the raid6 spare_page, we
would need a spare_page per cpu.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm: /dev/sda1 is too small: 0K

2007-10-13 Thread Dan Williams
On 10/13/07, Hod Greeley <[EMAIL PROTECTED]> wrote:
> Hello,
>
> I tried to create a raid device starting with
>
> foo:~ 1032% mdadm --create -l1 -n2 /dev/md1 /dev/sda1 missing
> mdadm: /dev/sda1 is too small: 0K
> mdadm: create aborted
>

Quick sanity check, is /dev/sda1 still a block device node with major
number 8 and minor number 1, i.e. does the following fix the issue?:

mknod /dev/sda1 b 8 1



> Thanks,
> Hod Greeley
>

--
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance)

2007-10-09 Thread Dan Williams
On Mon, 2007-10-08 at 23:21 -0700, Neil Brown wrote:
> On Saturday October 6, [EMAIL PROTECTED] wrote:
> > Neil,
> >
> > Here is the latest spin of the 'stripe_queue' implementation.
> Thanks to
> > raid6+bitmap testing done by Mr. James W. Laferriere there have been
> > several cleanups and fixes since the last release.  Also, the
> changes
> > are now spread over 4 patches to isolate one conceptual change per
> > patch.  The most significant cleanup is removing the stripe_head
> back
> > pointer from stripe_queue.  This effectively makes the queuing layer
> > independent from the caching layer.
> 
> Thanks Dan, and sorry that it has taken such a time for me to take a
> serious look at this.

Not a problem, I've actually only recently had some cycles to look at
these patches again myself.

> The results seem impressive.  I'll try to do some testing myself, but
> firstly: some questions.

> 1/ Can you explain why this improves the performance more than simply
>   doubling the size of the stripe cache?

Before I answer here are some quick numbers to quantify the difference
versus simply doubling the size of the stripe cache:

Test Configuration:
mdadm --create /dev/md0 /dev/sd[bcdefghi] -n 8 -l 5 --assume-clean
for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done

Average rate taken for 2.6.23-rc9 (1), 2.6.23-rc9 with stripe_cache_size
= 512 (2), 2.6.23-rc9+stripe_queue (3), 2.6.23-rc9+stripe_queue with
stripe_cache_size = 512 (4).

(1): 181MB/s
(2): 252MB/s (+41%)
(3): 330MB/s (+82%)
(4): 352MB/s (+94%)

>   The core of what it is doing seems to be to give priority to writing
>   full stripes.  We already do that by delaying incomplete stripes.
>   Maybe we just need to tune that mechanism a bit?  Maybe release
>   fewer partial stripes at a time?

>   It seems that the whole point of the stripe_queue structure is to
>   allow requests to gather before they are processed so the more
>   "deserving" can be processed first, but I cannot see why you need a
>   data structure separate from the list_head.

>   You could argue that simply doubling the size of the stripe cache
>   would be a waste of memory as we only want to use half of it to
>   handle active requests - the other half is for requests being built
>   up.
>   In that case, I don't see a problem with having a pool of pages
>   which is smaller that would be needed for the full stripe cache, and
>   allocating them to stripe_heads as they become free.

I believe the additional performance is coming from the fact that
delayed stripes are no longer consuming cache space while they wait for
their delay condition to clear, *and* that full stripe writes are
explicitly detected and moved to the front of the line.  This
effectively makes delayed stripes wait longer in some cases which is the
overall goal.

> 2/ I thought I understood from your descriptions that
>raid456_cache_arbiter would normally be waiting for a free stripe,
>that during this time full stripes could get promoted to io_hi, and
>so when raid456_cache_arbiter finally got a free stripe, it would
>attach it to the most deserving stripe_queue.  However it doesn't
>quite do that.  It chooses the deserving stripe_queue *before*
>waiting for a free stripe_head.  This seems slightly less than
>optimal?

I see, get the stripe first and then go look at io_hi versus io_lo.
Yes, that would prevent some unnecessary io_lo requests from sneaking
into the cache.
> 
> 3/ Why create a new workqueue for raid456_cache_arbiter rather than
>use raid5d.  It should be possible to do a non-blocking wait for a
>free stripe_head, in which cache the "find a stripe head and attach
>the most deserving stripe_queue" would fit well into raid5d.

It seemed necessary to have at least one thread doing a blocking wait on
the stripe cache... but moving this all under raid5d seems possible.
And, it might fix the deadlock condition that Jim is able to create in
his testing with bitmaps.  I have sent him a patch, off-list, to move
all bitmap handling to the stripe_queue which seems to improve
bitmap-write performance, but he still sees cases where raid5d() and
raid456_cache_arbiter() are staring blankly at each other while bonnie++
patiently waits in D state.  A kick
to /sys/block/md3/md/stripe_cache_size gets things going again.

> 4/ Why do you use an rbtree rather than a hash table to index the
>   'stripe_queue' objects?  I seem to recall a discussion about this
>   where it was useful to find adjacent requests or something like
>   that, but I cannot see that in the current code.
>   But maybe rbtrees are a better fit, in which case, should we use
>   them for stripe_heads as well???

If you are referring to the following:
http://marc.info/?l=linux-kernel&m=117740314031101&w=2
...then no, I am not caching the leftmost or rightmost entry to speed
lookups.

I initially did not know how many queues would need to be in play versus
stripe_heads to yield a performa

Re: [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance)

2007-10-07 Thread Dan Williams
On 10/6/07, Justin Piszcz <[EMAIL PROTECTED]> wrote:
>
>
> On Sat, 6 Oct 2007, Dan Williams wrote:
>
> > Neil,
> >
> > Here is the latest spin of the 'stripe_queue' implementation.  Thanks to
> > raid6+bitmap testing done by Mr. James W. Laferriere there have been
> > several cleanups and fixes since the last release.  Also, the changes
> > are now spread over 4 patches to isolate one conceptual change per
> > patch.  The most significant cleanup is removing the stripe_head back
> > pointer from stripe_queue.  This effectively makes the queuing layer
> > independent from the caching layer.
> >
> > Expansion support needs more testing.
> >
> > See the individual patch changelogs for details.  Patch 1 contains
> > updated performance numbers.
> >
> > Andrew,
> >
> > These are updated in the git-md-accel tree, but I will work the
> > finalized versions through Neil's 'Signed-off-by' path.
> >
> > Dan Williams (4):
> >  raid5: add the stripe_queue object for tracking raid io requests (rev3)
> >  raid5: split allocation of stripe_heads and stripe_queues
> >  raid5: convert add_stripe_bio to add_queue_bio
> >  raid5: use stripe_queues to prioritize the "most deserving" requests 
> > (rev7)
> >
> > drivers/md/raid5.c | 1560 
> > 
> > include/linux/raid/raid5.h |   88 ++-
> > 2 files changed, 1200 insertions(+), 448 deletions(-)
> >
> > --
> > Dan
>
> These patches & data look very impressive, do we have an ETA of when they
> will be merged into mainline?
>
> Justin.

The short answer is "when they are ready."  Jim reported that he is
seeing bonnie++ get stuck in D state on his platform, so the debug is
ongoing.  Additional testing is always welcome...

--
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -mm 4/4] raid5: use stripe_queues to prioritize the "most deserving" requests (rev7)

2007-10-06 Thread Dan Williams
orming
  atomic_dec_and_test(&conf->active_aligned_reads)
* Fix, retry_aligned_read needs to call release_stripe on its stripe_head
  after handle_queue, otherwise we deadlock on drive removal.  To make this
  more obvious handle_queue no longer implicitly releases the stripe_queue.
* kill wait_for_attach
* Fix up stripe_queue documentation

Changes in rev7
* split out the 'add_queue_bio' and object allocation changes into separate
  patches
* fix release_stripe/release_queue ordering
* refactor handle_queue and release_queue to remove STRIPE_QUEUE_HANDLE and
  sq->sh back references
* kill init_sh and allocate init_sq on the stack

Tested-by: Mr. James W. Laferriere <[EMAIL PROTECTED]>
Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |  843 +---
 include/linux/raid/raid5.h |   45 ++
 2 files changed, 666 insertions(+), 222 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index d566fc9..eb7fd10 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -67,7 +67,7 @@
 #defineIO_THRESHOLD1
 #define NR_HASH(PAGE_SIZE / sizeof(struct hlist_head))
 #define HASH_MASK  (NR_HASH - 1)
-#define STRIPE_QUEUE_SIZE 1 /* multiple of nr_stripes */
+#define STRIPE_QUEUE_SIZE 2 /* multiple of nr_stripes */
 
 #define stripe_hash(conf, sect)(&((conf)->stripe_hashtbl[((sect) >> 
STRIPE_SHIFT) & HASH_MASK]))
 
@@ -142,16 +142,66 @@ static unsigned long io_weight(unsigned long *bitmap, int 
disks)
 
 static void print_raid5_conf (raid5_conf_t *conf);
 
+/* __release_queue - route the stripe_queue based on pending i/o's.  The
+ * queue object is allowed to bounce around between 4 lists up until
+ * it is attached to a stripe_head.  The lists in order of priority are:
+ * 1/ overwrite: all data blocks are set to be overwritten, no prereads
+ * 2/ unaligned_read: read requests that get past chunk_aligned_read
+ * 3/ subwidth_write: write requests that require prereading
+ * 4/ delayed_q: write requests pending activation
+ */
+static void __release_queue(raid5_conf_t *conf, struct stripe_queue *sq)
+{
+   if (atomic_dec_and_test(&sq->count)) {
+   int disks = sq->disks;
+   int data_disks = disks - conf->max_degraded;
+   int to_write = io_weight(sq->to_write, disks);
+
+   BUG_ON(!list_empty(&sq->list_node));
+   BUG_ON(atomic_read(&conf->active_queues) == 0);
+
+   if (to_write &&
+   io_weight(sq->overwrite, disks) == data_disks) {
+   list_add_tail(&sq->list_node, &conf->io_hi_q_list);
+   queue_work(conf->workqueue, &conf->stripe_queue_work);
+   } else if (io_weight(sq->to_read, disks)) {
+   list_add_tail(&sq->list_node, &conf->io_lo_q_list);
+   queue_work(conf->workqueue, &conf->stripe_queue_work);
+   } else if (to_write &&
+  test_bit(STRIPE_QUEUE_PREREAD_ACTIVE, &sq->state)) {
+   list_add_tail(&sq->list_node, &conf->io_lo_q_list);
+   queue_work(conf->workqueue, &conf->stripe_queue_work);
+   } else if (to_write) {
+   list_add_tail(&sq->list_node, &conf->delayed_q_list);
+   blk_plug_device(conf->mddev->queue);
+   } else {
+   atomic_dec(&conf->active_queues);
+   if (test_and_clear_bit(STRIPE_QUEUE_PREREAD_ACTIVE,
+  &sq->state)) {
+   atomic_dec(&conf->preread_active_queues);
+   if (atomic_read(&conf->preread_active_queues) <
+   IO_THRESHOLD)
+   queue_work(conf->workqueue,
+  &conf->stripe_queue_work);
+   }
+   if (!test_bit(STRIPE_QUEUE_EXPANDING, &sq->state)) {
+   list_add_tail(&sq->list_node,
+ &conf->inactive_q_list);
+   wake_up(&conf->wait_for_queue);
+   }
+   }
+   }
+}
+
 static void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
 {
+   struct stripe_queue *sq = sh->sq;
+
if (atomic_dec_and_test(&sh->count)) {
BUG_ON(!list_empty(&sh->lru));
BUG_ON(atomic_read(&conf->active_stripes)==0);
if (test_bit(STRIPE_HANDLE, &sh->state)) {
-

[PATCH -mm 3/4] raid5: convert add_stripe_bio to add_queue_bio

2007-10-06 Thread Dan Williams
The stripe_queue object collects i/o requests before they are handled by
the stripe-cache (via the stripe_head object).  add_stripe_bio currently
looks at the state of the stripe-cache to implement bitmap support,
reimplement this using stripe_queue attributes.

Introduce the STRIPE_QUEUE_FIRSTWRITE flag to track when a stripe is first
written.  When a stripe_head is available record the bitmap batch sequence
number and set STRIPE_BIT_DELAY.  For now a stripe_head will always be
available at 'add_queue_bio' time, going forward the 'sh' field of the
stripe_queue will indicate whether a stripe_head is attached.

Tested-by: Mr. James W. Laferriere <[EMAIL PROTECTED]>
Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   53 
 include/linux/raid/raid5.h |6 +
 2 files changed, 40 insertions(+), 19 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7bc206c..d566fc9 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -31,8 +31,10 @@
  * conf->bm_flush is the number of the last batch that was closed to
  *new additions.
  * When we discover that we will need to write to any block in a stripe
- * (in add_stripe_bio) we update the in-memory bitmap and record in sh->bm_seq
- * the number of the batch it will be in. This is bm_flush+1.
+ * (in add_queue_bio) we update the in-memory bitmap and record in the
+ * stripe_queue that a bitmap write was started.  Then, in handle_stripe when
+ * we have a stripe_head available, we update sh->bm_seq to record the
+ * sequence number (target batch number) of this request.  This is bm_flush+1.
  * When we are ready to do a write, if that batch hasn't been written yet,
  *   we plug the array and queue the stripe for later.
  * When an unplug happens, we increment bm_flush, thus closing the current
@@ -360,8 +362,14 @@ static struct stripe_head *get_active_stripe(raid5_conf_t 
*conf, sector_t sector
}
} while (sh == NULL);
 
-   if (sh)
+   if (sh) {
atomic_inc(&sh->count);
+   if (test_and_clear_bit(STRIPE_QUEUE_FIRSTWRITE,
+   &sh->sq->state)) {
+   sh->bm_seq = conf->seq_flush+1;
+   set_bit(STRIPE_BIT_DELAY, &sh->state);
+   }
+   }
 
spin_unlock_irq(&conf->device_lock);
return sh;
@@ -1991,26 +1999,34 @@ handle_write_operations5(struct stripe_head *sh, int 
rcw, int expand)
  * toread/towrite point to the first in a chain.
  * The bi_next chain must be in order.
  */
-static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, 
int forwrite)
+static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
+ int forwrite)
 {
struct bio **bip;
-   struct stripe_queue *sq = sh->sq;
raid5_conf_t *conf = sq->raid_conf;
int firstwrite=0;
 
-   pr_debug("adding bh b#%llu to stripe s#%llu\n",
+   pr_debug("adding bio (%llu) to queue (%llu)\n",
(unsigned long long)bi->bi_sector,
-   (unsigned long long)sh->sector);
-
+   (unsigned long long)sq->sector);
 
spin_lock(&sq->lock);
spin_lock_irq(&conf->device_lock);
if (forwrite) {
bip = &sq->dev[dd_idx].towrite;
-   if (*bip == NULL && sq->dev[dd_idx].written == NULL)
+   set_bit(dd_idx, sq->to_write);
+   if (*bip == NULL && sq->dev[dd_idx].written == NULL) {
+   /* flag the queue to be assigned a bitmap
+* sequence number
+*/
+   set_bit(STRIPE_QUEUE_FIRSTWRITE, &sq->state);
firstwrite = 1;
-   } else
+   }
+   } else {
bip = &sq->dev[dd_idx].toread;
+   set_bit(dd_idx, sq->to_read);
+   }
+
while (*bip && (*bip)->bi_sector < bi->bi_sector) {
if ((*bip)->bi_sector + ((*bip)->bi_size >> 9) > bi->bi_sector)
goto overlap;
@@ -2024,19 +2040,17 @@ static int add_stripe_bio(struct stripe_head *sh, 
struct bio *bi, int dd_idx, in
bi->bi_next = *bip;
*bip = bi;
bi->bi_phys_segments ++;
+
spin_unlock_irq(&conf->device_lock);
spin_unlock(&sq->lock);
 
pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
(unsigned long long)bi->bi_sector,
-   (unsigned long long)sh->sector, dd_idx);
+   (unsigned long long)sq->sector, dd_idx);
 
-   if (conf->mddev->bitmap && firstwrite) {
-   

[PATCH -mm 1/4] raid5: add the stripe_queue object for tracking raid io requests (rev3)

2007-10-06 Thread Dan Williams
The raid5 stripe cache object, struct stripe_head, serves two purposes:
1/ front-end: queuing incoming requests
2/ back-end: transitioning requests through the cache state machine
   to the backing devices
The problem with this model is that queuing decisions are directly tied to
cache availability.  There is no facility to determine that a request or
group of requests 'deserves' usage of the cache and disks at any given time.

This patch separates the object members needed for queuing from the object
members used for caching.  The stripe_queue object takes over the incoming
bio lists, the io completion bio lists, and the parameters needed for
expansion.

The following fields are moved from struct stripe_head to struct
stripe_queue:
raid5_private_data *raid_conf
int pd_idx
spinlock_t lock
int disks

The following fields are moved from struct r5dev to struct r5_queue_dev:
sector_t sector
struct bio *toread, *read, *towrite, *written

This (first) commit just moves fields around subsequent commits take
advantage of the split for performance gains.

--- Performance Data ---
Platform: SMP x4 IA, sata_vsc, 7200RPM SATA Drives x4
Test1: tiobench --size 2048 --numruns 5 --block 4096 --block 131072 --dir 
/mnt/raid
=pre-patch=
Sequential Writes:
File   Blk  Num Avg
Size   Size Thread  Rate (MiB/s)
   ---- --
2048   131072   172.02
2048   131072   841.51

=post-patch=
Sequential Writes:
File   Blk  Num Avg
Size   Size Thr Rate (MiB/s)
   ---- --
2048   131072   1   140.86 (+96%)
2048   131072   850.18 (+21%)

Test2: blktrace of: dd if=/dev/zero of=/dev/md0 bs=1024k count=1024
=pre-patch=
Total (sdd):
 Reads Queued:   1,383,5,532KiB  Writes Queued:  80,186, 320,744KiB
 Reads Completed:  276,4,888KiB  Writes Completed:   12,677, 294,324KiB
 IO unplugs: 0   Timer unplugs:   0

=post-patch=
Total (sdd):
 Reads Queued:  61,  244KiB  Writes Queued:  66,330, 265,320KiB
 Reads Completed:4,  112KiB  Writes Completed:3,562, 285,912KiB
 IO unplugs:16   Timer unplugs:  17

Platform: SMP x4 IA, mptsas, 15000RPM SAS Drives x4
Test: tiobench --size 2048 --numruns 5 --block 4096 --block 131072 --dir 
/mnt/raid
=pre-patch=
Sequential Writes:
File   Blk  Num Avg
Size   Size Thr Rate (MiB/s)
   ---- --
2048   131072   1   132.51
2048   131072   886.92

=post-patch=
Sequential Writes:
File   Blk  Num Avg
Size   Size Thr Rate (MiB/s)
   ---- --
2048   131072   1   172.26 (+30%)
2048   131072   8   114.82 (+32%)

Changes in rev2:
* leave the flags with the buffers, prevents a data corruption issue
  whereby stale buffer state flags are attached to newly initialized
  buffers

Changes in rev3:
* move bm_seq back into the stripe_head, since the bitmap sequencing
  matters at write-out time (after cache attach)  Thanks to Mr. James W.
  Laferriere for his bug reports and testing of bitmap support.
* move 'int disks' into stripe_queue since expansion details are recorded
  at make_request() time (i.e. pre stripe_head availability)
* move dev->read, dev->written to dev_q->read and dev_q->written.  Allow
  sq->sh back references to be killed, and removes need to handle sh
  details in add_queue_bio

Tested-by: Mr. James W. Laferriere <[EMAIL PROTECTED]>
Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |  564 +++-
 include/linux/raid/raid5.h |   28 +-
 2 files changed, 364 insertions(+), 228 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f96dea9..a13de7d 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -159,7 +159,7 @@ static void __release_stripe(raid5_conf_t *conf, struct 
stripe_head *sh)
 }
 static void release_stripe(struct stripe_head *sh)
 {
-   raid5_conf_t *conf = sh->raid_conf;
+   raid5_conf_t *conf = sh->sq->raid_conf;
unsigned long flags;
 
spin_lock_irqsave(&conf->device_lock, flags);
@@ -238,7 +238,7 @@ static void raid5_build_block (struct stripe_head *sh, int 
i);
 
 static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, 
int disks)
 {
-   raid5_conf_t *conf = sh->raid_conf;
+   raid5_conf_t *conf = sh->sq->raid_conf;
int i;
 
BUG_ON(atomic_read(&sh->count) != 0);
@@ -252,19 +252,20 @@ static void init_stripe(struct stripe_head *sh, sector_t 
sector, int pd_idx, int
remove_hash(sh);
 
sh->sector = sector;
-   sh->pd_idx = pd_idx;
+   sh->sq->pd_idx = pd_idx;
sh->state = 0;
 
-   sh->disks = disks

[PATCH -mm 2/4] raid5: split allocation of stripe_heads and stripe_queues

2007-10-06 Thread Dan Williams
Provide separate routines for allocating stripe_head and stripe_queue
objects and introduce 'io_weight' bitmaps to struct stripe_queue.

The io_weight bitmaps add an efficient way to determine what is pending in
a stripe_queue using 'hweight' in comparison to a 'for' loop.

Tested-by: Mr. James W. Laferriere <[EMAIL PROTECTED]>
Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |  316 
 include/linux/raid/raid5.h |   11 +-
 2 files changed, 239 insertions(+), 88 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a13de7d..7bc206c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -65,6 +65,7 @@
 #defineIO_THRESHOLD1
 #define NR_HASH(PAGE_SIZE / sizeof(struct hlist_head))
 #define HASH_MASK  (NR_HASH - 1)
+#define STRIPE_QUEUE_SIZE 1 /* multiple of nr_stripes */
 
 #define stripe_hash(conf, sect)(&((conf)->stripe_hashtbl[((sect) >> 
STRIPE_SHIFT) & HASH_MASK]))
 
@@ -78,6 +79,8 @@
  * of the current stripe+device
  */
 #define r5_next_bio(bio, sect) ( ( (bio)->bi_sector + ((bio)->bi_size>>9) < 
sect + STRIPE_SECTORS) ? (bio)->bi_next : NULL)
+#define r5_io_weight_size(devs) (sizeof(unsigned long) * \
+ (ALIGN(devs, BITS_PER_LONG) / BITS_PER_LONG))
 /*
  * The following can be used to debug the driver
  */
@@ -120,6 +123,21 @@ static void return_io(struct bio *return_bi)
}
 }
 
+#if BITS_PER_LONG == 32
+#define hweight hweight32
+#else
+#define hweight hweight64
+#endif
+static unsigned long io_weight(unsigned long *bitmap, int disks)
+{
+   unsigned long weight = hweight(*bitmap);
+
+   for (bitmap++; disks > BITS_PER_LONG; disks -= BITS_PER_LONG, bitmap++)
+   weight += hweight(*bitmap);
+
+   return weight;
+}
+
 static void print_raid5_conf (raid5_conf_t *conf);
 
 static void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
@@ -236,36 +254,37 @@ static int grow_buffers(struct stripe_head *sh, int num)
 
 static void raid5_build_block (struct stripe_head *sh, int i);
 
-static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, 
int disks)
+static void init_queue(struct stripe_queue *sq, sector_t sector,
+   int disks, int pd_idx);
+
+static void
+init_stripe(struct stripe_head *sh, struct stripe_queue *sq,
+sector_t sector, int pd_idx, int disks)
 {
-   raid5_conf_t *conf = sh->sq->raid_conf;
+   raid5_conf_t *conf = sq->raid_conf;
int i;
 
+   pr_debug("init_stripe called, stripe %llu\n",
+   (unsigned long long)sector);
+
BUG_ON(atomic_read(&sh->count) != 0);
BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
BUG_ON(sh->ops.pending || sh->ops.ack || sh->ops.complete);
+   init_queue(sh->sq, sector, disks, pd_idx);
 
CHECK_DEVLOCK();
-   pr_debug("init_stripe called, stripe %llu\n",
-   (unsigned long long)sh->sector);
 
remove_hash(sh);
 
sh->sector = sector;
-   sh->sq->pd_idx = pd_idx;
sh->state = 0;
 
-   sh->sq->disks = disks;
-
for (i = disks; i--;) {
struct r5dev *dev = &sh->dev[i];
-   struct r5_queue_dev *dev_q = &sh->sq->dev[i];
 
-   if (dev_q->toread || dev_q->read || dev_q->towrite ||
-   dev_q->written || test_bit(R5_LOCKED, &dev->flags)) {
-   printk(KERN_ERR "sector=%llx i=%d %p %p %p %p %d\n",
-  (unsigned long long)sh->sector, i, dev_q->toread,
-  dev_q->read, dev_q->towrite, dev_q->written,
+   if (test_bit(R5_LOCKED, &dev->flags)) {
+   printk(KERN_ERR "sector=%llx i=%d %d\n",
+  (unsigned long long)sector, i,
   test_bit(R5_LOCKED, &dev->flags));
BUG();
}
@@ -283,7 +302,7 @@ static struct stripe_head *__find_stripe(raid5_conf_t 
*conf, sector_t sector, in
CHECK_DEVLOCK();
pr_debug("__find_stripe, sector %llu\n", (unsigned long long)sector);
hlist_for_each_entry(sh, hn, stripe_hash(conf, sector), hash)
-   if (sh->sector == sector && sh->sq->disks == disks)
+   if (sh->sector == sector && disks == disks)
return sh;
pr_debug("__stripe %llu not in cache\n", (unsigned long long)sector);
return NULL;
@@ -326,7 +345,7 @@ static struct stripe_head *get_active_stripe(raid5_conf_t 
*conf, sector_t sector
);
  

[PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance)

2007-10-06 Thread Dan Williams
Neil,

Here is the latest spin of the 'stripe_queue' implementation.  Thanks to
raid6+bitmap testing done by Mr. James W. Laferriere there have been
several cleanups and fixes since the last release.  Also, the changes
are now spread over 4 patches to isolate one conceptual change per
patch.  The most significant cleanup is removing the stripe_head back
pointer from stripe_queue.  This effectively makes the queuing layer
independent from the caching layer.

Expansion support needs more testing.

See the individual patch changelogs for details.  Patch 1 contains
updated performance numbers.

Andrew,

These are updated in the git-md-accel tree, but I will work the
finalized versions through Neil's 'Signed-off-by' path.

Dan Williams (4):
  raid5: add the stripe_queue object for tracking raid io requests (rev3)
  raid5: split allocation of stripe_heads and stripe_queues
  raid5: convert add_stripe_bio to add_queue_bio
  raid5: use stripe_queues to prioritize the "most deserving" requests 
(rev7)

 drivers/md/raid5.c | 1560 
 include/linux/raid/raid5.h |   88 ++-
 2 files changed, 1200 insertions(+), 448 deletions(-)

--
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] async-tx/md-accel fixes and documentation for 2.6.23

2007-09-24 Thread Dan Williams
Linus, please pull from:

git://lost.foo-projects.org/~dwillia2/git/iop async-tx-fixes-for-linus

to receive:

Dan Williams (3):
  async_tx: usage documentation and developer notes (v2)
  async_tx: fix dma_wait_for_async_tx
  raid5: fix 2 bugs in ops_complete_biofill

The raid5 change has been reviewed with Neil, and the documentation
received some fixups from Randy Dunlap and Shannon Nelson.

Documentation/crypto/async-tx-api.txt |  219 +
crypto/async_tx/async_tx.c|   12 ++-
drivers/md/raid5.c|   17 +--
3 files changed, 236 insertions(+), 12 deletions(-)

---

diff --git a/Documentation/crypto/async-tx-api.txt 
b/Documentation/crypto/async-tx-api.txt
new file mode 100644
index 000..c1e9545
--- /dev/null
+++ b/Documentation/crypto/async-tx-api.txt
@@ -0,0 +1,219 @@
+Asynchronous Transfers/Transforms API
+
+1 INTRODUCTION
+
+2 GENEALOGY
+
+3 USAGE
+3.1 General format of the API
+3.2 Supported operations
+3.3 Descriptor management
+3.4 When does the operation execute?
+3.5 When does the operation complete?
+3.6 Constraints
+3.7 Example
+
+4 DRIVER DEVELOPER NOTES
+4.1 Conformance points
+4.2 "My application needs finer control of hardware channels"
+
+5 SOURCE
+
+---
+
+1 INTRODUCTION
+
+The async_tx API provides methods for describing a chain of asynchronous
+bulk memory transfers/transforms with support for inter-transactional
+dependencies.  It is implemented as a dmaengine client that smooths over
+the details of different hardware offload engine implementations.  Code
+that is written to the API can optimize for asynchronous operation and
+the API will fit the chain of operations to the available offload
+resources.
+
+2 GENEALOGY
+
+The API was initially designed to offload the memory copy and
+xor-parity-calculations of the md-raid5 driver using the offload engines
+present in the Intel(R) Xscale series of I/O processors.  It also built
+on the 'dmaengine' layer developed for offloading memory copies in the
+network stack using Intel(R) I/OAT engines.  The following design
+features surfaced as a result:
+1/ implicit synchronous path: users of the API do not need to know if
+   the platform they are running on has offload capabilities.  The
+   operation will be offloaded when an engine is available and carried out
+   in software otherwise.
+2/ cross channel dependency chains: the API allows a chain of dependent
+   operations to be submitted, like xor->copy->xor in the raid5 case.  The
+   API automatically handles cases where the transition from one operation
+   to another implies a hardware channel switch.
+3/ dmaengine extensions to support multiple clients and operation types
+   beyond 'memcpy'
+
+3 USAGE
+
+3.1 General format of the API:
+struct dma_async_tx_descriptor *
+async_(,
+ enum async_tx_flags flags,
+ struct dma_async_tx_descriptor *dependency,
+ dma_async_tx_callback callback_routine,
+ void *callback_parameter);
+
+3.2 Supported operations:
+memcpy   - memory copy between a source and a destination buffer
+memset   - fill a destination buffer with a byte value
+xor  - xor a series of source buffers and write the result to a
+  destination buffer
+xor_zero_sum - xor a series of source buffers and set a flag if the
+  result is zero.  The implementation attempts to prevent
+  writes to memory
+
+3.3 Descriptor management:
+The return value is non-NULL and points to a 'descriptor' when the operation
+has been queued to execute asynchronously.  Descriptors are recycled
+resources, under control of the offload engine driver, to be reused as
+operations complete.  When an application needs to submit a chain of
+operations it must guarantee that the descriptor is not automatically recycled
+before the dependency is submitted.  This requires that all descriptors be
+acknowledged by the application before the offload engine driver is allowed to
+recycle (or free) the descriptor.  A descriptor can be acked by one of the
+following methods:
+1/ setting the ASYNC_TX_ACK flag if no child operations are to be submitted
+2/ setting the ASYNC_TX_DEP_ACK flag to acknowledge the parent
+   descriptor of a new operation.
+3/ calling async_tx_ack() on the descriptor.
+
+3.4 When does the operation execute?
+Operations do not immediately issue after return from the
+async_ call.  Offload engine drivers batch operations to
+improve performance by reducing the number of mmio cycles needed to
+manage the channel.  Once a driver-specific threshold is met the driver
+automatically issues pending operations.  An application can force this
+event by calling async_tx_issue_pending_all().  This operates on all
+channels since the application has no knowledge of channel to operation
+mapping.
+
+3.5 When does the operation complete?
+There are two

[PATCH 2.6.23-rc7 2/3] async_tx: fix dma_wait_for_async_tx

2007-09-20 Thread Dan Williams
Fix dma_wait_for_async_tx to not loop forever in the case where a
dependency chain is longer than two entries.  This condition will not
happen with current in-kernel drivers, but fix it for future drivers.

Found-by: Saeed Bishara <[EMAIL PROTECTED]>
Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 crypto/async_tx/async_tx.c |   12 ++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/crypto/async_tx/async_tx.c b/crypto/async_tx/async_tx.c
index 0350071..bc18cbb 100644
--- a/crypto/async_tx/async_tx.c
+++ b/crypto/async_tx/async_tx.c
@@ -80,6 +80,7 @@ dma_wait_for_async_tx(struct dma_async_tx_descriptor *tx)
 {
enum dma_status status;
struct dma_async_tx_descriptor *iter;
+   struct dma_async_tx_descriptor *parent;
 
if (!tx)
return DMA_SUCCESS;
@@ -87,8 +88,15 @@ dma_wait_for_async_tx(struct dma_async_tx_descriptor *tx)
/* poll through the dependency chain, return when tx is complete */
do {
iter = tx;
-   while (iter->cookie == -EBUSY)
-   iter = iter->parent;
+
+   /* find the root of the unsubmitted dependency chain */
+   while (iter->cookie == -EBUSY) {
+   parent = iter->parent;
+   if (parent && parent->cookie == -EBUSY)
+   iter = iter->parent;
+   else
+   break;
+   }
 
status = dma_sync_wait(iter->chan, iter->cookie);
} while (status == DMA_IN_PROGRESS || (iter != tx));
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2.6.23-rc7 3/3] raid5: fix ops_complete_biofill

2007-09-20 Thread Dan Williams
ops_complete_biofill tried to avoid calling handle_stripe since all the
state necessary to return read completions is available.  However the
process of determining whether more read requests are pending requires
locking the stripe (to block add_stripe_bio from updating dev->toead).
ops_complete_biofill can run in tasklet context, so rather than upgrading
all the stripe locks from spin_lock to spin_lock_bh this patch just moves
read completion handling back into handle_stripe.

Found-by: Yuri Tikhonov <[EMAIL PROTECTED]>
Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   90 +++-
 1 files changed, 46 insertions(+), 44 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 4d63773..38c8893 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -512,54 +512,12 @@ async_copy_data(int frombio, struct bio *bio, struct page 
*page,
 static void ops_complete_biofill(void *stripe_head_ref)
 {
struct stripe_head *sh = stripe_head_ref;
-   struct bio *return_bi = NULL;
-   raid5_conf_t *conf = sh->raid_conf;
-   int i, more_to_read = 0;
 
pr_debug("%s: stripe %llu\n", __FUNCTION__,
(unsigned long long)sh->sector);
 
-   /* clear completed biofills */
-   for (i = sh->disks; i--; ) {
-   struct r5dev *dev = &sh->dev[i];
-   /* check if this stripe has new incoming reads */
-   if (dev->toread)
-   more_to_read++;
-
-   /* acknowledge completion of a biofill operation */
-   /* and check if we need to reply to a read request
-   */
-   if (test_bit(R5_Wantfill, &dev->flags) && !dev->toread) {
-   struct bio *rbi, *rbi2;
-   clear_bit(R5_Wantfill, &dev->flags);
-
-   /* The access to dev->read is outside of the
-* spin_lock_irq(&conf->device_lock), but is protected
-* by the STRIPE_OP_BIOFILL pending bit
-*/
-   BUG_ON(!dev->read);
-   rbi = dev->read;
-   dev->read = NULL;
-   while (rbi && rbi->bi_sector <
-   dev->sector + STRIPE_SECTORS) {
-   rbi2 = r5_next_bio(rbi, dev->sector);
-   spin_lock_irq(&conf->device_lock);
-   if (--rbi->bi_phys_segments == 0) {
-   rbi->bi_next = return_bi;
-   return_bi = rbi;
-   }
-   spin_unlock_irq(&conf->device_lock);
-   rbi = rbi2;
-   }
-   }
-   }
-   clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack);
-   clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending);
-
-   return_io(return_bi);
-
-   if (more_to_read)
-   set_bit(STRIPE_HANDLE, &sh->state);
+   set_bit(STRIPE_OP_BIOFILL, &sh->ops.complete);
+   set_bit(STRIPE_HANDLE, &sh->state);
release_stripe(sh);
 }
 
@@ -2112,6 +2070,42 @@ static void handle_issuing_new_read_requests6(struct 
stripe_head *sh,
 }
 
 
+/* handle_completed_read_requests - return completion for reads and allow
+ * new read operations to be submitted to the stripe.
+ */
+static void handle_completed_read_requests(raid5_conf_t *conf,
+   struct stripe_head *sh,
+   struct bio **return_bi)
+{
+   int i;
+
+   pr_debug("%s: stripe %llu\n", __FUNCTION__,
+   (unsigned long long)sh->sector);
+
+   /* check if we need to reply to a read request */
+   for (i = sh->disks; i--; ) {
+   struct r5dev *dev = &sh->dev[i];
+
+   if (test_and_clear_bit(R5_Wantfill, &dev->flags)) {
+   struct bio *rbi, *rbi2;
+
+   rbi = dev->read;
+   dev->read = NULL;
+   while (rbi && rbi->bi_sector <
+   dev->sector + STRIPE_SECTORS) {
+   rbi2 = r5_next_bio(rbi, dev->sector);
+   spin_lock_irq(&conf->device_lock);
+   if (--rbi->bi_phys_segments == 0) {
+   rbi->bi_next = *return_bi;
+   *return_bi = rbi;
+   }
+   spin_unlock_irq(&conf->device_lock);
+   rbi = rbi2;
+

[PATCH 2.6.23-rc7 1/3] async_tx: usage documentation and developer notes

2007-09-20 Thread Dan Williams
Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 Documentation/crypto/async-tx-api.txt |  217 +
 1 files changed, 217 insertions(+), 0 deletions(-)

diff --git a/Documentation/crypto/async-tx-api.txt 
b/Documentation/crypto/async-tx-api.txt
new file mode 100644
index 000..48d685a
--- /dev/null
+++ b/Documentation/crypto/async-tx-api.txt
@@ -0,0 +1,217 @@
+Asynchronous Transfers/Transforms API
+
+1 INTRODUCTION
+
+2 GENEALOGY
+
+3 USAGE
+3.1 General format of the API
+3.2 Supported operations
+3.2 Descriptor management
+3.3 When does the operation execute?
+3.4 When does the operation complete?
+3.5 Constraints
+3.6 Example
+
+4 DRIVER DEVELOPER NOTES
+4.1 Conformance points
+4.2 "My application needs finer control of hardware channels"
+
+5 SOURCE
+
+---
+
+1 INTRODUCTION
+
+The async_tx api provides methods for describing a chain of asynchronous
+bulk memory transfers/transforms with support for inter-transactional
+dependencies.  It is implemented as a dmaengine client that smooths over
+the details of different hardware offload engine implementations.  Code
+that is written to the api can optimize for asynchronous operation and
+the api will fit the chain of operations to the available offload
+resources.
+
+2 GENEALOGY
+
+The api was initially designed to offload the memory copy and
+xor-parity-calculations of the md-raid5 driver using the offload engines
+present in the Intel(R) Xscale series of I/O processors.  It also built
+on the 'dmaengine' layer developed for offloading memory copies in the
+network stack using Intel(R) I/OAT engines.  The following design
+features surfaced as a result:
+1/ implicit synchronous path: users of the API do not need to know if
+   the platform they are running on has offload capabilities.  The
+   operation will be offloaded when an engine is available and carried out
+   in software otherwise.
+2/ cross channel dependency chains: the API allows a chain of dependent
+   operations to be submitted, like xor->copy->xor in the raid5 case.  The
+   API automatically handles cases where the transition from one operation
+   to another implies a hardware channel switch.
+3/ dmaengine extensions to support multiple clients and operation types
+   beyond 'memcpy'
+
+3 USAGE
+
+3.1 General format of the API:
+struct dma_async_tx_descriptor *
+async_(,
+ enum async_tx_flags flags,
+ struct dma_async_tx_descriptor *dependency,
+ dma_async_tx_callback callback_routine,
+ void *callback_parameter);
+
+3.2 Supported operations:
+memcpy   - memory copy between a source and a destination buffer
+memset   - fill a destination buffer with a byte value
+xor - xor a series of source buffers and write the result to a
+  destination buffer
+xor_zero_sum - xor a series of source buffers and set a flag if the
+  result is zero.  The implementation attempts to prevent
+  writes to memory
+
+3.2 Descriptor management:
+The return value is non-NULL and points to a 'descriptor' when the operation
+has been queued to execute asynchronously.  Descriptors are recycled
+resources, under control of the offload engine driver, to be reused as
+operations complete.  When an application needs to submit a chain of
+operations it must guarantee that the descriptor is not automatically recycled
+before the dependency is submitted.  This requires that all descriptors be
+acknowledged by the application before the offload engine driver is allowed to
+recycle (or free) the descriptor.  A descriptor can be acked by:
+1/ setting the ASYNC_TX_ACK flag if no operations are to be submitted
+2/ setting the ASYNC_TX_DEP_ACK flag to acknowledge the parent
+   descriptor of a new operation.
+3/ calling async_tx_ack() on the descriptor.
+
+3.3 When does the operation execute?:
+Operations do not immediately issue after return from the
+async_ call.  Offload engine drivers batch operations to
+improve performance by reducing the number of mmio cycles needed to
+manage the channel.  Once a driver specific threshold is met the driver
+automatically issues pending operations.  An application can force this
+event by calling async_tx_issue_pending_all().  This operates on all
+channels since the application has no knowledge of channel to operation
+mapping.
+
+3.4 When does the operation complete?:
+There are two methods for an application to learn about the completion
+of an operation.
+1/ Call dma_wait_for_async_tx().  This call causes the cpu to spin while
+   it polls for the completion of the operation.  It handles dependency
+   chains and issuing pending operations.
+2/ Specify a completion callback.  The callback routine runs in tasklet
+   context if the offload engine driver supports interrupts, or it is
+   called in application context if the operation is carried out
+   synchronously 

[PATCH 2.6.23-rc7 0/3] async_tx and md-accel fixes for 2.6.23

2007-09-20 Thread Dan Williams
Fix a couple bugs and provide documentation for the async_tx api.

Neil, please 'ack' patch #3.

git://lost.foo-projects.org/~dwillia2/git/iop async-tx-fixes-for-linus

Dan Williams (3):
  async_tx: usage documentation and developer notes
  async_tx: fix dma_wait_for_async_tx
  raid5: fix ops_complete_biofill

Documentation/crypto/async-tx-api.txt |  217 +
crypto/async_tx/async_tx.c|   12 ++-
drivers/md/raid5.c|   90 +++---
3 files changed, 273 insertions(+), 46 deletions(-)

--
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md raid acceleration and the async_tx api

2007-09-13 Thread Dan Williams
On 9/13/07, Yuri Tikhonov <[EMAIL PROTECTED]> wrote:
>
>  Hi Dan,
>
> On Friday 07 September 2007 20:02, you wrote:
> > You need to fetch from the 'md-for-linus' tree.  But I have attached
> > them as well.
> >
> > git fetch git://lost.foo-projects.org/~dwillia2/git/iop
> > md-for-linus:md-for-linus
>
>  Thanks.
>
>  Unrelated question. Comparing the drivers/md/raid5.c file in the Linus's
> 2.6.23-rc6 tree and in your md-for-linus one I'd found the following
> difference in the expand-related part of the handle_stripe5() function:
>
> -   s.locked += handle_write_operations5(sh, 1, 1);
> +   s.locked += handle_write_operations5(sh, 0, 1);
>
>  That is, in your case we are passing rcw=0, whereas in the Linus's case the
> handle_write_operation5() is called with rcw=1. What code is correct ?
>
There was a recent bug discovered in my changes to the expansion code.
 The fix has now gone into Linus's tree through Andrew's tree.  I kept
the fix out of my 'md-for-linus' tree to prevent it getting dropped
from -mm due to automatic git-tree merge-detection.  I have now
rebased my git tree so everything is in sync.

However, after talking with Neil at LCE we came to the conclusion that
it would be best if I just sent patches since git tree updates tend to
not get enough review, and because the patch sets will be more
manageable now that the big pieces of the acceleration infrastructure
have been merged.

>  Regards, Yuri

Thanks,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [md-accel PATCH 16/19] dmaengine: driver for the iop32x, iop33x, and iop13xx raid engines

2007-08-30 Thread Dan Williams
On 8/30/07, saeed bishara <[EMAIL PROTECTED]> wrote:
> you are right, I've another question regarding the function
> dma_wait_for_async_tx from async_tx.c, here is the body of the code:
>/* poll through the dependency chain, return when tx is complete */
> 1.do {
> 2. iter = tx;
> 3. while (iter->cookie == -EBUSY)
> 4. iter = iter->parent;
> 5.
> 6. status = dma_sync_wait(iter->chan, iter->cookie);
> 7. } while (status == DMA_IN_PROGRESS || (iter != tx));
>
> assume that:
> - The interrupt capability is not provided.
> - Request A was sent to chan 0
> - Request B that depends on A is sent to chan 1
> - Request C that depends on B is send to chan 2.
> - Also, assume that when C is handled by async_tx_submit(), B is still
> not queued to the dmaengine (cookie equals to -EBUSY).
>
> In this case, dma_wait_for_async_tx will be called for C, now, it
> looks for me that the do while will loop forever, even when A gets
> completed. this is because the iter will point to B after line 4, thus
> the iter != tx (C) will always become true.
>
You are right.  There are no drivers in the tree that can hit this,
but it needs to be fixed up.

I'll submit the following change:

diff --git a/crypto/async_tx/async_tx.c b/crypto/async_tx/async_tx.c
index 0350071..bc18cbb 100644
--- a/crypto/async_tx/async_tx.c
+++ b/crypto/async_tx/async_tx.c
@@ -80,6 +80,7 @@ dma_wait_for_async_tx(struct dma_async_tx_descriptor *tx)
 {
enum dma_status status;
struct dma_async_tx_descriptor *iter;
+   struct dma_async_tx_descriptor *parent;

if (!tx)
return DMA_SUCCESS;
@@ -87,8 +88,15 @@ dma_wait_for_async_tx(struct dma_async_tx_descriptor *tx)
/* poll through the dependency chain, return when tx is complete */
do {
iter = tx;
-   while (iter->cookie == -EBUSY)
-   iter = iter->parent;
+
+   /* find the root of the unsubmitted dependency chain */
+   while (iter->cookie == -EBUSY) {
+   parent = iter->parent;
+   if (parent && parent->cookie == -EBUSY)
+   iter = iter->parent;
+   else
+   break;
+   }

> saeed

Regards,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md raid acceleration and the async_tx api

2007-08-30 Thread Dan Williams
On 8/30/07, Yuri Tikhonov <[EMAIL PROTECTED]> wrote:
>
>  Hi Dan,
>
> On Monday 27 August 2007 23:12, you wrote:
> > This still looks racy...  I think the complete fix is to make the
> > R5_Wantfill and dev_q->toread accesses atomic.  Please test the
> > following patch (also attached) and let me know if it fixes what you are
> > seeing:
>
>  Your approach doesn't help, the Bonnie++ utility hangs up during the
> ReWriting stage.
>
Looking at it again I see that what I added would not affect the
failure you are seeing.  However I noticed that you are using a broken
version of the stripe-queue and cache_arbiter patches.  In the current
revisions the dev_q->flags field has been moved back to dev->flags
which fixes a data corruption issue and could potentially address the
hang you are seeing.  The latest revisions are:
raid5: add the stripe_queue object for tracking raid io requests (rev2)
raid5: use stripe_queues to prioritize the "most deserving" requests (rev6)

>  Note that before applying your patch I rolled my fix in the
> ops_complete_biofill() function back. Do I understand it right that your
> patch should be used *instead* of my one rather than *with* it ?
>
You understood correctly.  The attached patch integrates your change
to keep R5_Wantfill set while also protecting the 'more_to_read' case.
 Please try it on top of the latest stripe-queue changes [1] (instead
of the other proposed patches) .

>  Regards, Yuri

Thanks,
Dan

[1] git fetch -f git://lost.foo-projects.org/~dwillia2/git/iop
md-for-linus:refs/heads/md-for-linus
raid5: fix the 'more_to_read' case in ops_complete_biofill

From: Dan Williams <[EMAIL PROTECTED]>

Prevent ops_complete_biofill from running concurrently with add_queue_bio

---

 drivers/md/raid5.c |   33 +++--
 1 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 2f9022d..1c591d3 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -828,22 +828,19 @@ static void ops_complete_biofill(void *stripe_head_ref)
 		struct r5dev *dev = &sh->dev[i];
 		struct r5_queue_dev *dev_q = &sq->dev[i];
 
-		/* check if this stripe has new incoming reads */
+		/* 1/ acknowledge completion of a biofill operation
+		 * 2/ check if we need to reply to a read request.
+		 * 3/ check if we need to reschedule handle_stripe
+		 */
 		if (dev_q->toread)
 			more_to_read++;
 
-		/* acknowledge completion of a biofill operation */
-		/* and check if we need to reply to a read request
-		*/
-		if (test_bit(R5_Wantfill, &dev->flags) && !dev_q->toread) {
+		if (test_bit(R5_Wantfill, &dev->flags)) {
 			struct bio *rbi, *rbi2;
-			clear_bit(R5_Wantfill, &dev->flags);
 
-			/* The access to dev->read is outside of the
-			 * spin_lock_irq(&conf->device_lock), but is protected
-			 * by the STRIPE_OP_BIOFILL pending bit
-			 */
-			BUG_ON(!dev->read);
+			if (!dev_q->toread)
+clear_bit(R5_Wantfill, &dev->flags);
+
 			rbi = dev->read;
 			dev->read = NULL;
 			while (rbi && rbi->bi_sector <
@@ -899,8 +896,15 @@ static void ops_run_biofill(struct stripe_head *sh)
 	}
 
 	atomic_inc(&sh->count);
+
+	/* spin_lock prevents ops_complete_biofill from running concurrently
+	 * with add_queue_bio in the synchronous case
+	 */
+	spin_lock(&sq->lock);
 	async_trigger_callback(ASYNC_TX_DEP_ACK | ASYNC_TX_ACK, tx,
 		ops_complete_biofill, sh);
+	spin_unlock(&sq->lock);
+
 }
 
 static void ops_complete_compute5(void *stripe_head_ref)
@@ -2279,7 +2283,8 @@ static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
 		(unsigned long long)bi->bi_sector,
 		(unsigned long long)sq->sector);
 
-	spin_lock(&sq->lock);
+	/* prevent asynchronous completions from running */
+	spin_lock_bh(&sq->lock);
 	spin_lock_irq(&conf->device_lock);
 	sh = sq->sh;
 	if (forwrite) {
@@ -2306,7 +2311,7 @@ static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
 	*bip = bi;
 	bi->bi_phys_segments ++;
 	spin_unlock_irq(&conf->device_lock);
-	spin_unlock(&sq->lock);
+	spin_unlock_bh(&sq->lock);
 
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
 		(unsigned long long)bi->bi_sector,
@@ -2339,7 +2344,7 @@ static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
  overlap:
 	set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
 	spin_unlock_irq(&conf->device_lock);
-	spin_unlock(&sq->lock);
+	spin_unlock_bh(&sq->lock);
 	return 0;
 }
 


Re: raid5:md3: kernel BUG , followed by , Silent halt .

2007-08-27 Thread Dan Williams
On 8/25/07, Mr. James W. Laferriere <[EMAIL PROTECTED]> wrote:
> Hello Dan ,
>
> On Mon, 20 Aug 2007, Dan Williams wrote:
> > On 8/18/07, Mr. James W. Laferriere <[EMAIL PROTECTED]> wrote:
> >> Hello All ,  Here we go again .  Again attempting to do bonnie++ 
> >> testing
> >> on a small array .
> >> Kernel 2.6.22.1
> >> Patches involved ,
> >> IOP1 ,  2.6.22.1-iop1 for improved sequential write performance
> >> (stripe-queue) ,  Dan Williams <[EMAIL PROTECTED]>
> >
> > Hello James,
> >
> > Thanks for the report.
> >
> > I tried to reproduce this on my system, no luck.
> Possibly because there is significant hardware differances ?
> See 'lspci -v' below .sig .
>
> > However it looks
> > like their is a potential race between 'handle_queue' and
> > 'add_queue_bio'.  The attached patch moves these critical sections
> > under spin_lock(&sq->lock), and adds some debugging output if this BUG
> > triggers.  It also includes a fix for retry_aligned_read which is
> > unrelated to this debug.
> > --
> > Dan
> Applied your patch .  The same 'kernel BUG at 
> drivers/md/raid5.c:3689!'
> messages appear (see attached) .  The system is still responsive with your
> patch ,  the kernel crashed last time .  Tho the bonnie++ run is stuck in 'D' 
> .
> And doing a '> /md3/asdf'  stays hung even after passing the parent process a
> 'kill -9' .
> Any further info You can think of I can/should ,  I will try to 
> acquire
> .  But I'll have to repeat these steps to attempt to get the same results .
> I'll be shutting the system down after sending this off .
> Fyi ,  the previous 'BUG" without your patch was quite repeatable .
> I might have time over the next couple of weeks to be able to see if 
> it
> is as repatable as the last one .
>
> Contents of /proc/mdstat for md3 .
>
> md3 : active raid6 sdx1[3] sdw1[2] sdv1[1] sdu1[0] sdt1[7](S) sds1[6] sdr1[5] 
> sdq1[4]
>717378560 blocks level 6, 1024k chunk, algorithm 2 [7/7] [UUU]
>bitmap: 2/137 pages [8KB], 512KB chunk
>
> Commands I ran that lead to the 'BUG' .
>
> bonniemd3() { /root/bonnie++-1.03a/bonnie++  -u0:0  -d /md3  -s 131072  -f; }
> bonniemd3 > 131072MB-bonnie++-run-md3-xfs.log-20070825 2>&1 &
>
Ok, the 'bitmap' and 'raid6' details were the missing pieces of my
testing.  I can now reproduce this bug in handle_queue.  I'll keep you
posted on what I find.

Thank you for tracking this.

Regards,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Patch for boot-time assembly of v1.x-metadata-based soft (MD) arrays: reasoning and future plans

2007-08-26 Thread Dan Williams
On 8/26/07, Abe Skolnik <[EMAIL PROTECTED]> wrote:
> Dear Mr./Dr. Williams,
>
Just "Dan" is fine :-)

> > Because you can rely on the configuration file to be certain about
> > which disks to pull in and which to ignore.  Without the config file
> > the auto-detect routine may not always do the right thing because it
> > will need to make assumptions.
>
> But kernel parameters can provide the same data, no?  After all, it is
> not the "file nature" of the "config file" that we are after,
> but rather the configuration data itself.  My now-working setup uses a
> line in my "grub.conf" (AKA "menu.lst") file in my boot partition that
> says something like...
>   "root=/dev/md0 md=0,v1,/dev/sda1,/dev/sdb1,/dev/sdc1".
> This works just fine, and will not go bad unless a drive fails or I
> rearrange the SCSI bus.  Even in a case like that, the worst that will
> happen (AFAIK) is that my root-RAID will not come up and I will have to
> boot the PC from e.g. Knoppix in order to fix the problem
> (that, and maybe also replace a broken drive or re-plug a loose power
> connector or whatever).  Another MD should not be corrupted, since they
> are (supposedly) protected from that by supposedly-unique array UUIDs.
>
Yes, you can get a similar effect of the config file by adding
parameters to the kernel command line.  My only point is that if the
initramfs update tools were as simple as:
mkinitrd "root=/dev/md0 md=0,v1,/dev/sda1,/dev/sdb1,/dev/sdc1"
...then using an initramfs becomes the same amount of work as editing
/etc/grub.conf.

> > So I turn the question around, why go through the exercise of trying
> > to improve an auto-detect routine which can never be perfect when the
> > explicit configuration can be specified by a config file?
>
> I will turn your turning of my question back around at you; I hope it's
> not rude to do so.  Why make root-RAID (based on MD, not hardware RAID)
> require an initrd/initramfs, especially since (a) that's another system
> component to manage, (b) that's another thing that can go wrong,
> (c) many systems (the new one I just finished building included) do not
> need an initrd/initramfs for any other reason, so why "trigger" the
> need just out of laziness of maintaining some boot code?  Especially
> since a patch has already been written and tested as working. ;-)
>
Understood.  It comes down to a question of how much mdadm
functionality should be duplicated in the kernel?  With an initramfs
you get the full functionality and only one codebase to maintain
(mdadm).

[snip]

> Sincerely,
>
> Abe
>

Regards,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Patch for boot-time assembly of v1.x-metadata-based soft (MD) arrays

2007-08-26 Thread Dan Williams
On 8/26/07, Justin Piszcz <[EMAIL PROTECTED]> wrote:
>
>
> On Sun, 26 Aug 2007, Abe Skolnik wrote:
>
> > Dear Mr./Dr./Prof. Brown et al,
> >
> > I recently had the unpleasant experience of creating an MD array for
> > the purpose of booting off it and then not being able to do so.  Since
> > I had already made changes to the array's contents relative to that
> > which I cloned it from, I did not want to reformat the array and
> > re-clone it just to bring it down to the old 0.90 metadata format so
> > that I would be able to boot off it, so I searched for a solution, and
> > I found it.
> >
> > First I tried the patch (written by Neil Brown) which can be seen at...
> >  
> >
> > That patch did not work as-is, but with some more hacking, I got it
> > working.  I then cleaned up my work and added relevant comments.
> >
> > I know that Mr./Dr./Prof. Brown is against in-kernel boot-time MD
> > assembly and prefers init[rd/ramfs], but I prefer in-kernel assembly,
> > and I think several other people do too.  Since this patch does not
> > (AFAIK) disable the init[rd/ramfs] way of bringing up MDs in boot-time,
> > I hope that this patch will be accepted and submitted up-stream for
> > future inclusion in the mainline kernel.org kernel distribution.
> > This way kernel users can choose their MD assembly strategy at will
> > without having to restrict themselves to the old metadata format.
> >
> > I hope that this message finds all those who read it doing well and
> > feeling fine.
> >
> > Sincerely,
> >
> > Abe Skolnik
> >
> > P.S.  Mr./Dr./Prof. Brown, in case you read this:  thanks!
> >  And if you want your name removed from the code, just say so.
> >
>
> > but I prefer in-kernel assembly,
> > and I think several other people do too.
> I concur with this statement, why go through the hassle of init[rd/ramfs]
> if we can just have it done in the kernel?
>

Because you can rely on the configuration file to be certain about
which disks to pull in and which to ignore.  Without the config file
the auto-detect routine may not always do the right thing because it
will need to make assumptions.

So I turn the question around, why go through the exercise of trying
to improve an auto-detect routine which can never be perfect when the
explicit configuration can be specified by a config file?

I believe the real issue is the need to improve the distributions'
initramfs build-scripts and relieve the hassle of handling MD details.

> Justin.

Regards,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5:md3: kernel BUG , followed by , Silent halt .

2007-08-20 Thread Dan Williams
On 8/18/07, Mr. James W. Laferriere <[EMAIL PROTECTED]> wrote:
> Hello All ,  Here we go again .  Again attempting to do bonnie++ 
> testing
> on a small array .
> Kernel 2.6.22.1
> Patches involved ,
> IOP1 ,  2.6.22.1-iop1 for improved sequential write performance
> (stripe-queue) ,  Dan Williams <[EMAIL PROTECTED]>

Hello James,

Thanks for the report.

I tried to reproduce this on my system, no luck.  However it looks
like their is a potential race between 'handle_queue' and
'add_queue_bio'.  The attached patch moves these critical sections
under spin_lock(&sq->lock), and adds some debugging output if this BUG
triggers.  It also includes a fix for retry_aligned_read which is
unrelated to this debug.

--
Dan
---
raid5-fix-sq-locking.patch
-------
raid5: address potential sq->to_write race

From: Dan Williams <[EMAIL PROTECTED]>

synchronize reads and writes to the sq->to_write bit

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   12 
 1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 02e313b..688b8d3 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2289,10 +2289,14 @@ static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
 	sh = sq->sh;
 	if (forwrite) {
 		bip = &sq->dev[dd_idx].towrite;
+		set_bit(dd_idx, sq->to_write);
 		if (*bip == NULL && (!sh || (sh && !sh->dev[dd_idx].written)))
 			firstwrite = 1;
-	} else
+	} else {
 		bip = &sq->dev[dd_idx].toread;
+		set_bit(dd_idx, sq->to_read);
+	}
+
 	while (*bip && (*bip)->bi_sector < bi->bi_sector) {
 		if ((*bip)->bi_sector + ((*bip)->bi_size >> 9) > bi->bi_sector)
 			goto overlap;
@@ -2324,7 +2328,6 @@ static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
 		/* check if page is covered */
 		sector_t sector = sq->dev[dd_idx].sector;
 
-		set_bit(dd_idx, sq->to_write);
 		for (bi = sq->dev[dd_idx].towrite;
 		 sector < sq->dev[dd_idx].sector + STRIPE_SECTORS &&
 			 bi && bi->bi_sector <= sector;
@@ -2334,8 +2337,7 @@ static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
 		}
 		if (sector >= sq->dev[dd_idx].sector + STRIPE_SECTORS)
 			set_bit(dd_idx, sq->overwrite);
-	} else
-		set_bit(dd_idx, sq->to_read);
+	}
 
 	return 1;
 
@@ -3656,6 +3658,7 @@ static void handle_queue(struct stripe_queue *sq, int disks, int data_disks)
 	struct stripe_head *sh = NULL;
 
 	/* continue to process i/o while the stripe is cached */
+	spin_lock(&sq->lock);
 	if (test_bit(STRIPE_QUEUE_HANDLE, &sq->state)) {
 		if (io_weight(sq->overwrite, disks) == data_disks) {
 			set_bit(STRIPE_QUEUE_IO_HI, &sq->state);
@@ -3678,6 +3681,7 @@ static void handle_queue(struct stripe_queue *sq, int disks, int data_disks)
 		 */
 		BUG_ON(!(sq->sh && sq->sh == sh));
 	}
+	spin_unlock(&sq->lock);
 
 	release_queue(sq);
 	if (sh) {
-----------
raid5-debug-init_queue-bugs.patch
---
raid5: printk instead of BUG in init_queue

From: Dan Williams <[EMAIL PROTECTED]>

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   19 +--
 1 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 688b8d3..7164011 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -557,12 +557,19 @@ static void init_queue(struct stripe_queue *sq, sector_t sector,
 		__FUNCTION__, (unsigned long long) sq->sector,
 		(unsigned long long) sector, sq);
 
-	BUG_ON(atomic_read(&sq->count) != 0);
-	BUG_ON(io_weight(sq->to_read, disks));
-	BUG_ON(io_weight(sq->to_write, disks));
-	BUG_ON(io_weight(sq->overwrite, disks));
-	BUG_ON(test_bit(STRIPE_QUEUE_HANDLE, &sq->state));
-	BUG_ON(sq->sh);
+	if ((atomic_read(&sq->count) != 0) || io_weight(sq->to_read, disks) ||
+	io_weight(sq->to_write, disks) || io_weight(sq->overwrite, disks) ||
+	test_bit(STRIPE_QUEUE_HANDLE, &sq->state) || sq->sh) {
+		printk(KERN_ERR "%s: sector=%llx count: %d to_read: %lu "
+"to_write: %lu overwrite: %lu state: %lx "
+"sq->sh: %p\n", __FUNCTION__,
+(unsigned long long) sq->sector,
+atomic_read(&sq->count),
+io_weight(sq->to_read, disks),
+io_weight(sq->to_write, disks),
+io_weight(sq->overwrite, disks),
+sq->state, sq->sh);
+	}
 
 	sq->state =

Re: [RFT] 2.6.22.1-iop1 for improved sequential write performance (stripe-queue)

2007-08-06 Thread Dan Williams
On 8/4/07, Mr. James W. Laferriere <[EMAIL PROTECTED]> wrote:
> Hello Dan ,
>
> On Thu, 19 Jul 2007, Dan Williams wrote:
> > Per Bill Davidsen's request I have made available a 2.6.22.1 based
> > kernel with the current raid5 performance changes I have been working
> > on:
> > 1/ Offload engine acceleration (recently merged for the 2.6.23
> > development cycle)
> > 2/ Stripe-queue, an evolutionary change to the raid5 queuing model (take4)
> >
> > The offload engine work only helps platforms with offload engines and
> > should not affect performance otherwise.  The stripe-queue work is an
> > attempt to increase sequential write performance and should benefit
> > most platforms.
> >
> > The patch series is available on the Xscale(r) IOP SourceForge page:
> >   
> > http://downloads.sourceforge.net/xscaleiop/patches-2.6.22.1-iop1-x86fix.tar.gz
> >
> > Use quilt to apply the series on top of a fresh 2.6.22.1 source tree:
> > $ wget http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.22.1.tar.bz2
> > $ tar xjf linux-2.6.22.1.tar.bz2
> > $ cd linux-2.6.22.1
> > $ tar xzvf patches-2.6.22.1-iop1.tar.gz
> > $ cp patches/series.x86 patches/series
> > $ quilt push -a
> >
> > Configure and build the kernel as normal, there are no configuration
> > options for stripe-queue.
> > Any feedback, bug report, fix, or suggestion is welcome.
>
> Following these instructions .  With 2.6.22.1 gets me the below(*) .
> Undoubtedly I have missed some message that said us a differnat 
> pathset
> or something along those lines .
I believe you inadvertently used the series file with xscale specific
patches.  The patch that is causing your compile failures is
"iop13xx-pci-sysfs-mmap.patch".  Try 'quilt pop -a' then verify that
your series file looks like the following:

# This series applies on GIT commit 7dcca30a32aadb0520417521b0c44f42d09fe05c
iop13xx-add-iop-attr-for-tpmi.patch
iop-section-mismatch.patch
has-dma-compilation-fix.patch
# comment out the following patch for an x86 build
#iop13xx-workaround-uart-iir-errata.patch
arm-allow-dma-none.patch
ioatdma-push-pending-4.patch
ioatdma-sysfs-errors.patch
ioatdma-kill-register-access-wrappers.patch
ioatdma-killwriteq.patch
ioatdma-document-tcp_dma_copybreak.patch
ioatdma-only-offload-at-context-switch.patch
ioatdma-warning-fix.patch
ioat-kexec-fix.patch
ioat-unisys-pciid.patch
dmaengine-dma_async_tx_descriptor-refactor.patch
dmaengine-client-channel-management.patch
xor-move.patch
add-the-async_tx-api.patch
raid5-refactor.patch
raid5-pr_debug.patch
md-add-raid5_run_ops.patch
md-enable-raid5_run_ops.patch
md-async-write-operations.patch
md-async-compute-operations.patch
md-async-check-operations.patch
md-async-read-operations.patch
md-async-expand-operations.patch
md-move-io-to-raid5_run_ops.patch
md-remove-compute_block-and-compute_parity5.patch
iop-adma-device-driver.patch
iop13xx-adma-support.patch
iop3xx-adma-support.patch
arm-add-drivers-dma.patch
iop-watchdog.patch
i2c-m41st85w.patch
iop13xx-imu.patch
iop13xx-imu-scsi.patch
iop13xx-imu-directed-core-reset.patch
iop13xx-xsc3-oprofile.patch
# comment out the following patch for an x86 build
#iop13xx-pci-sysfs-mmap.patch
iop13xx-pmon-oprofile.patch
arm-udivdi3.patch
pbi_compact_flash.patch
# comment out the following patch for an x86 build
#dma-copy-clear-page.patch
# comment out the following patch for an x86 build
#try-dma-memcpy.patch
# comment out the following patch for an x86 build
#try-dma-copy-to-user.patch
# comment out the following patch for an x86 build
#try-dma-copy-from-user.patch
isc813xx.patch
iop-defconfig-update.patch
# comment out the following patch for an x86 build
#iop-cross-compile-default.patch
raid5-stripe-queue-intro.patch
raid5-stripe-queue-tree.patch
v2.6.22.1-iop1.patch

Regards,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bonnie++ benchmarks for ext2,ext3,ext4,jfs,reiserfs,xfs,zfs on software raid 5

2007-07-30 Thread Dan Williams
[trimmed all but linux-raid from the cc]

On 7/30/07, Justin Piszcz <[EMAIL PROTECTED]> wrote:
> CONFIG:
>
> Software RAID 5 (400GB x 6): Default mkfs parameters for all filesystems.
> Kernel was 2.6.21 or 2.6.22, did these awhile ago.
Can you give 2.6.22.1-iop1 a try to see what affect it has on
sequential write performance?

Download:
http://downloads.sourceforge.net/xscaleiop/patches-2.6.22.1-iop1-x86fix.tar.gz

Unpack into your 2.6.22.1 source tree.  Install the x86 series file
"cp patches/series.x86 patches/series".  Apply the series with quilt
"quilt push -a".

I recommend trying the default chunk size and default
stripe_cache_size as my tests have shown improvement without needing
to perform any tuning.

Thanks,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PATCH 1/2] raid5: add the stripe_queue object for tracking raid io requests (take2)

2007-07-22 Thread Dan Williams
ttached to newly initialized
  buffers

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |  474 +++-
 include/linux/raid/raid5.h |   29 ++-
 2 files changed, 316 insertions(+), 187 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index d90ee14..f5ee4a7 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -31,7 +31,7 @@
  * conf->bm_flush is the number of the last batch that was closed to
  *new additions.
  * When we discover that we will need to write to any block in a stripe
- * (in add_stripe_bio) we update the in-memory bitmap and record in sh->bm_seq
+ * (in add_stripe_bio) we update the in-memory bitmap and record in sq->bm_seq
  * the number of the batch it will be in. This is bm_flush+1.
  * When we are ready to do a write, if that batch hasn't been written yet,
  *   we plug the array and queue the stripe for later.
@@ -132,7 +132,7 @@ static void __release_stripe(raid5_conf_t *conf, struct 
stripe_head *sh)
list_add_tail(&sh->lru, &conf->delayed_list);
blk_plug_device(conf->mddev->queue);
} else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
-  sh->bm_seq - conf->seq_write > 0) {
+  sh->sq->bm_seq - conf->seq_write > 0) {
list_add_tail(&sh->lru, &conf->bitmap_list);
blk_plug_device(conf->mddev->queue);
} else {
@@ -159,7 +159,7 @@ static void __release_stripe(raid5_conf_t *conf, struct 
stripe_head *sh)
 }
 static void release_stripe(struct stripe_head *sh)
 {
-   raid5_conf_t *conf = sh->raid_conf;
+   raid5_conf_t *conf = sh->sq->raid_conf;
unsigned long flags;
 
spin_lock_irqsave(&conf->device_lock, flags);
@@ -238,7 +238,7 @@ static void raid5_build_block (struct stripe_head *sh, int 
i);
 
 static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, 
int disks)
 {
-   raid5_conf_t *conf = sh->raid_conf;
+   raid5_conf_t *conf = sh->sq->raid_conf;
int i;
 
BUG_ON(atomic_read(&sh->count) != 0);
@@ -252,19 +252,20 @@ static void init_stripe(struct stripe_head *sh, sector_t 
sector, int pd_idx, int
remove_hash(sh);
 
sh->sector = sector;
-   sh->pd_idx = pd_idx;
+   sh->sq->pd_idx = pd_idx;
sh->state = 0;
 
sh->disks = disks;
 
for (i = sh->disks; i--; ) {
struct r5dev *dev = &sh->dev[i];
+   struct r5_queue_dev *dev_q = &sh->sq->dev[i];
 
-   if (dev->toread || dev->read || dev->towrite || dev->written ||
-   test_bit(R5_LOCKED, &dev->flags)) {
+   if (dev_q->toread || dev->read || dev_q->towrite ||
+   dev->written || test_bit(R5_LOCKED, &dev->flags)) {
printk(KERN_ERR "sector=%llx i=%d %p %p %p %p %d\n",
-  (unsigned long long)sh->sector, i, dev->toread,
-  dev->read, dev->towrite, dev->written,
+  (unsigned long long)sh->sector, i, dev_q->toread,
+  dev->read, dev_q->towrite, dev->written,
   test_bit(R5_LOCKED, &dev->flags));
BUG();
}
@@ -288,6 +289,9 @@ static struct stripe_head *__find_stripe(raid5_conf_t 
*conf, sector_t sector, in
return NULL;
 }
 
+static sector_t compute_blocknr(raid5_conf_t *conf, int raid_disks,
+   sector_t sector, int pd_idx, int i);
+
 static void unplug_slaves(mddev_t *mddev);
 static void raid5_unplug_device(request_queue_t *q);
 
@@ -389,12 +393,13 @@ raid5_end_write_request (struct bio *bi, unsigned int 
bytes_done, int error);
 
 static void ops_run_io(struct stripe_head *sh)
 {
-   raid5_conf_t *conf = sh->raid_conf;
+   struct stripe_queue *sq = sh->sq;
+   raid5_conf_t *conf = sq->raid_conf;
int i, disks = sh->disks;
 
might_sleep();
 
-   for (i = disks; i--; ) {
+   for (i = disks; i--;) {
int rw;
struct bio *bi;
mdk_rdev_t *rdev;
@@ -513,7 +518,8 @@ static void ops_complete_biofill(void *stripe_head_ref)
 {
struct stripe_head *sh = stripe_head_ref;
struct bio *return_bi = NULL;
-   raid5_conf_t *conf = sh->raid_conf;
+   struct stripe_queue *sq = sh->sq;
+   raid5_conf_t *conf = sq->raid_conf;
int i, more_to_read = 0;
 
pr_debug("%s: stripe %llu\n", __FUNCTION__,
@@ -522,14 +528,16 @@ static void

[GIT PATCH 0/2] stripe-queue for 2.6.23 consideration

2007-07-22 Thread Dan Williams
Andrew, Neil,

The stripe-queue patches are showing solid performance improvement.

git://lost.foo-projects.org/~dwillia2/git/iop md-for-linus

 drivers/md/raid5.c | 1484 
 include/linux/raid/raid5.h |   87 +++-
 2 files changed, 1164 insertions(+), 407 deletions(-)

Dan Williams (2):
  raid5: add the stripe_queue object for tracking raid io requests (take2)
  raid5: use stripe_queues to prioritize the "most deserving" requests 
(take4)

I initially considered them 2.6.24 material but after fixing the sync+io
data corruption regression, fixing the large 'stripe_cache_size' values
performance regression, and seeing how well it performed on my IA
platform I would like them to be considered for 2.6.23.  That being said
I have not yet tested expand operations or raid6.

Without any tuning a 4 disk (SATA) RAID5 array can reach 190MB/s.  Previously
performance was around 90MB/s.  Blktrace data confirms that less reads are
occurring and more writes are being merged.

$ mdadm --create /dev/md0 /dev/sd[abcd] -n 4 -l 5 --assume-clean
$ blktrace /dev/sd[abcd] &
$ for i in `seq 1 3`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=1024; done
$ fg ^C
$ blkparse /dev/sda /dev/sdb /dev/sdc /dev/sdd

=pre-patch=
Total (sda):
 Reads Queued:   3,136,   12,544KiB  Writes Queued: 187,068,  748,272KiB
 Read Dispatches:  676,   12,384KiB  Write Dispatches:   30,949,  737,052KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:  662,   12,080KiB  Writes Completed:   30,630,  736,964KiB
 Read Merges:2,452,9,808KiB  Write Merges:  155,885,  623,540KiB
 IO unplugs: 1   Timer unplugs:   1

Total (sdb):
 Reads Queued:   1,541,6,164KiB  Writes Queued:  91,224,  364,896KiB
 Read Dispatches:  323,6,184KiB  Write Dispatches:   14,603,  335,528KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:  303,6,124KiB  Writes Completed:   13,650,  328,520KiB
 Read Merges:1,209,4,836KiB  Write Merges:   76,080,  304,320KiB
 IO unplugs: 0   Timer unplugs:   0

Total (sdc):
 Reads Queued:   1,372,5,488KiB  Writes Queued:  82,995,  331,980KiB
 Read Dispatches:  297,5,280KiB  Write Dispatches:   13,258,  304,020KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:  268,4,948KiB  Writes Completed:   12,320,  298,668KiB
 Read Merges:1,067,4,268KiB  Write Merges:   69,154,  276,616KiB
 IO unplugs: 0   Timer unplugs:   0

Total (sdd):
 Reads Queued:   1,383,5,532KiB  Writes Queued:  80,186,  320,744KiB
 Read Dispatches:  307,5,008KiB  Write Dispatches:   13,241,  298,400KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:  276,4,888KiB  Writes Completed:   12,677,  294,324KiB
 Read Merges:1,050,4,200KiB  Write Merges:   66,772,  267,088KiB
 IO unplugs: 0   Timer unplugs:   0


=post-patch=
Total (sda):
 Reads Queued: 117,  468KiB  Writes Queued:  71,511,  286,044KiB
 Read Dispatches:   17,  308KiB  Write Dispatches:8,412,  699,204KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:6,   96KiB  Writes Completed:3,704,  321,552KiB
 Read Merges:   96,  384KiB  Write Merges:   67,880,  271,520KiB
 IO unplugs:14   Timer unplugs:  15

Total (sdb):
 Reads Queued:  88,  352KiB  Writes Queued:  56,687,  226,748KiB
 Read Dispatches:   11,  288KiB  Write Dispatches:8,142,  686,412KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:8,  184KiB  Writes Completed:2,770,  257,740KiB
 Read Merges:   76,  304KiB  Write Merges:   54,005,  216,020KiB
 IO unplugs:16   Timer unplugs:  17

Total (sdc):
 Reads Queued:  60,  240KiB  Writes Queued:  61,863,  247,452KiB
 Read Dispatches:7,  248KiB  Write Dispatches:8,302,  699,832KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:5,  144KiB  Writes Completed:2,907,  258,900KiB
 Read Merges:   50,  200KiB  Write Merges:   58,926,  235,704KiB
 IO unplugs:20   Timer unplugs:  23

Total (sdd):
 Reads Queued:  61,  244KiB  Writes Queued:  66,330,  265,320KiB
 Read Dispatches:   10,  180KiB  Write Dispatches:9,326,  694,012KiB
 Reads Requeued: 0   Writes Requeued: 0
 Reads Completed:4,  112KiB  Writes Completed:3,562,  285,912KiB
 Read Merges:  

[RFT] 2.6.22.1-iop1 for improved sequential write performance (stripe-queue)

2007-07-19 Thread Dan Williams

Per Bill Davidsen's request I have made available a 2.6.22.1 based
kernel with the current raid5 performance changes I have been working
on:
1/ Offload engine acceleration (recently merged for the 2.6.23
development cycle)
2/ Stripe-queue, an evolutionary change to the raid5 queuing model (take4)

The offload engine work only helps platforms with offload engines and
should not affect performance otherwise.  The stripe-queue work is an
attempt to increase sequential write performance and should benefit
most platforms.

The patch series is available on the Xscale(r) IOP SourceForge page:

http://downloads.sourceforge.net/xscaleiop/patches-2.6.22.1-iop1-x86fix.tar.gz

Use quilt to apply the series on top of a fresh 2.6.22.1 source tree:
$ wget http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.22.1.tar.bz2
$ tar xjf linux-2.6.22.1.tar.bz2
$ cd linux-2.6.22.1
$ tar xzvf patches-2.6.22.1-iop1.tar.gz
$ cp patches/series.x86 patches/series
$ quilt push -a

Configure and build the kernel as normal, there are no configuration
options for stripe-queue.

Any feedback, bug report, fix, or suggestion is welcome.

Thanks,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-mm PATCH 1/2] raid5: add the stripe_queue object for tracking raid io requests (take2)

2007-07-13 Thread Dan Williams
ffer state flags are attached to newly initialized buffers

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |  475 
 include/linux/raid/raid5.h |   29 ++-
 2 files changed, 317 insertions(+), 187 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 0b66afe..b653c2b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -31,7 +31,7 @@
  * conf->bm_flush is the number of the last batch that was closed to
  *new additions.
  * When we discover that we will need to write to any block in a stripe
- * (in add_stripe_bio) we update the in-memory bitmap and record in sh->bm_seq
+ * (in add_stripe_bio) we update the in-memory bitmap and record in sq->bm_seq
  * the number of the batch it will be in. This is bm_flush+1.
  * When we are ready to do a write, if that batch hasn't been written yet,
  *   we plug the array and queue the stripe for later.
@@ -132,7 +132,7 @@ static void __release_stripe(raid5_conf_t *conf, struct 
stripe_head *sh)
list_add_tail(&sh->lru, &conf->delayed_list);
blk_plug_device(conf->mddev->queue);
} else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
-  sh->bm_seq - conf->seq_write > 0) {
+  sh->sq->bm_seq - conf->seq_write > 0) {
list_add_tail(&sh->lru, &conf->bitmap_list);
blk_plug_device(conf->mddev->queue);
} else {
@@ -159,7 +159,7 @@ static void __release_stripe(raid5_conf_t *conf, struct 
stripe_head *sh)
 }
 static void release_stripe(struct stripe_head *sh)
 {
-   raid5_conf_t *conf = sh->raid_conf;
+   raid5_conf_t *conf = sh->sq->raid_conf;
unsigned long flags;
 
spin_lock_irqsave(&conf->device_lock, flags);
@@ -238,7 +238,7 @@ static void raid5_build_block (struct stripe_head *sh, int 
i);
 
 static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, 
int disks)
 {
-   raid5_conf_t *conf = sh->raid_conf;
+   raid5_conf_t *conf = sh->sq->raid_conf;
int i;
 
BUG_ON(atomic_read(&sh->count) != 0);
@@ -252,19 +252,20 @@ static void init_stripe(struct stripe_head *sh, sector_t 
sector, int pd_idx, int
remove_hash(sh);
 
sh->sector = sector;
-   sh->pd_idx = pd_idx;
+   sh->sq->pd_idx = pd_idx;
sh->state = 0;
 
sh->disks = disks;
 
for (i = sh->disks; i--; ) {
struct r5dev *dev = &sh->dev[i];
+   struct r5_queue_dev *dev_q = &sh->sq->dev[i];
 
-   if (dev->toread || dev->read || dev->towrite || dev->written ||
-   test_bit(R5_LOCKED, &dev->flags)) {
+   if (dev_q->toread || dev->read || dev_q->towrite ||
+   dev->written || test_bit(R5_LOCKED, &dev->flags)) {
printk(KERN_ERR "sector=%llx i=%d %p %p %p %p %d\n",
-  (unsigned long long)sh->sector, i, dev->toread,
-  dev->read, dev->towrite, dev->written,
+  (unsigned long long)sh->sector, i, dev_q->toread,
+  dev->read, dev_q->towrite, dev->written,
   test_bit(R5_LOCKED, &dev->flags));
BUG();
}
@@ -288,6 +289,9 @@ static struct stripe_head *__find_stripe(raid5_conf_t 
*conf, sector_t sector, in
return NULL;
 }
 
+static sector_t compute_blocknr(raid5_conf_t *conf, int raid_disks,
+   sector_t sector, int pd_idx, int i);
+
 static void unplug_slaves(mddev_t *mddev);
 static void raid5_unplug_device(request_queue_t *q);
 
@@ -389,12 +393,13 @@ raid5_end_write_request (struct bio *bi, unsigned int 
bytes_done, int error);
 
 static void ops_run_io(struct stripe_head *sh)
 {
-   raid5_conf_t *conf = sh->raid_conf;
+   struct stripe_queue *sq = sh->sq;
+   raid5_conf_t *conf = sq->raid_conf;
int i, disks = sh->disks;
 
might_sleep();
 
-   for (i = disks; i--; ) {
+   for (i = disks; i--;) {
int rw;
struct bio *bi;
mdk_rdev_t *rdev;
@@ -513,7 +518,8 @@ static void ops_complete_biofill(void *stripe_head_ref)
 {
struct stripe_head *sh = stripe_head_ref;
struct bio *return_bi = NULL;
-   raid5_conf_t *conf = sh->raid_conf;
+   struct stripe_queue *sq = sh->sq;
+   raid5_conf_t *conf = sq->raid_conf;
int i, more_to_read = 0;
 
pr_debug("%s: stripe %llu\n", __FUNCTION__,
@@ -522,14 +528,16

[-mm PATCH 0/2] 74% decrease in dispatched writes, stripe-queue take3

2007-07-13 Thread Dan Williams
Neil, Andrew,

The following patches replace the stripe-queue patches currently in -mm.
Following your suggestion, Neil, I gathered blktrace data on the number
of reads generated by sequential write stimulus.  It turns out that
reduced pre-reading is not the cause of the performance increase, but
rather increased write merging.  The data, in patch #1, shows a 74%
decrease in the number of dispatched writes.  I can only assume that
this is the explanation for the 65% throughput improvement, because the
occurrence of reads actually increased with these patches applied.

This take also fixes observed data corruption while running i/o to a
synching array (it was wrong to move the flags parameter from r5dev to
r5_queue_dev as things could get out of sync... reverted).  Next step is
to test reshape under this new queuing model.

Regards,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] ioat fixes, raid5 acceleration, and the async_tx api

2007-07-13 Thread Dan Williams

Linus, please pull from

git://lost.foo-projects.org/~dwillia2/git/iop ioat-md-accel-for-linus

to receive:

1/ I/OAT performance tweaks and simple fixups.  These patches have been
   in -mm for a few kernel releases as git-ioat.patch
2/ RAID5 acceleration and the async_tx api.  These patches have also
   been in -mm for a few kernel releases as git-md-accel.patch.  In
   addition, they have received field testing as a part of the -iop kernel
   released via SourceForge[1] since 2.6.18-rc6.

The raid acceleration work can further be subdivided into three logical
areas:
- API -
The async_tx api provides methods for describing a chain of
asynchronous bulk memory transfers/transforms with support for
inter-transactional dependencies.  It is implemented as a dmaengine
client that smooths over the details of different hardware offload
engine implementations.  Code that is written to the api can optimize
for asynchronous operation and the api will fit the chain of operations
to the available offload resources. 

- Implementation -
When the raid acceleration work was proposed, Neil laid out the
following attack plan:
1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api
The raid5_run_ops routine uses the asynchronous offload api
(async_tx) and the stripe_operations member of a stripe_head to carry
out xor and copy operations asynchronously, outside the lock.

- Driver -
The Intel(R) Xscale IOP series of I/O processors integrate an Xscale
core with raid acceleration engines.  The iop-adma driver supports the
copy and xor capabilities of the 3 IOP architectures iop32x, iop33x, and
iop34x.

All the MD changes have been acked-by Neil Brown.  For the changes made
to net/ I have received David Miller's acked-by.  Shannon Nelson has
tested the I/OAT changes (due to async_tx support) in his environment
and has added his signed-off-by.  Herbert Xu has agreed to let the
async_tx api be housed under crypto/ with the intent to coordinate
efforts as support for transforms like crc32c and raid6-p+q are
developed.

To be clear Shannon Nelson is the I/OAT maintainer, but we agreed that I
should coordinate this release to simplify the merge process.  Going
forward I will be the iop-adma maintainer.  For the common bits,
dmaengine core and the async_tx api, Shannon and I will coordinate as
co-maintainers.

- Credits -
I cannot thank Neil Brown enough for his advice and patience as this
code was developed.

Jeff Garzik is credited with helping the dmaengine core and async_tx
become sane apis.  You are credited with the general premise that users
of an asynchronous offload engine api should not know or care if an
operation is carried out asynchronously or synchronously in software.
Andrew Morton is credited with corralling these conflicting git trees in
-mm and more importantly imparting encouragement at OLS 2006.

Per Andrew's request the md-accel changelogs were fleshed out and the
patch set was posted for a final review a few weeks ago[2].  To my
knowledge there are no pending review items.  This tree is based on
2.6.22.

Thank you,
Dan

[1] http://sourceforge.net/projects/xscaleiop
[2] http://marc.info/?l=linux-raid&w=2&r=1&s=md-accel&q=b

Andrew Morton (1):
  I/OAT: warning fix

Chris Leech (5):
  ioatdma: Push pending transactions to hardware more frequently
  ioatdma: Remove the wrappers around read(bwl)/write(bwl) in ioatdma
  ioatdma: Remove the use of writeq from the ioatdma driver
  I/OAT: Add documentation for the tcp_dma_copybreak sysctl
  I/OAT: Only offload copies for TCP when there will be a context switch

Dan Aloni (1):
      I/OAT: fix I/OAT for kexec

Dan Williams (20):
  dmaengine: refactor dmaengine around dma_async_tx_descriptor
  dmaengine: make clients responsible for managing channels
  xor: make 'xor_blocks' a library routine for use with async_tx
  async_tx: add the async_tx api
  raid5: refactor handle_stripe5 and handle_stripe6 (v3)
  raid5: replace custom debug PRINTKs with standard pr_debug
  md: raid5_run_ops - run stripe operations outside sh->lock
  md: common infrastructure for running operations with raid5_run_ops
  md: handle_stripe5 - add request/completion logic for async write ops
  md: handle_stripe5 - add request/completion logic for async compute ops
  md: handle_stripe5 - add request/completion logic for async check ops
  md: handle_stripe5 - add request/completion logic for async read ops
  md: handle_stripe5 - add request/completion logic for async expand ops
  md: handle_stripe5 - request io processing in raid5_run_ops
  md: remove raid5 compute_block and compute_parity5
  dmaengine: driver for the iop32x, iop33x, and iop13xx raid engines
  iop13xx: surface the iop13xx adma units to the iop-adma driver
  iop3xx: surface the iop3xx DMA and AAU units to the iop-adma driver

Re: [RFC PATCH 0/2] raid5: 65% sequential-write performance improvement, stripe-queue take2

2007-07-05 Thread Dan Williams

On 04 Jul 2007 13:41:26 +0200, Andi Kleen <[EMAIL PROTECTED]> wrote:

Dan Williams <[EMAIL PROTECTED]> writes:

> The write performance numbers are better than I expected and would seem
> to address the concerns raised in the thread "Odd (slow) RAID
> performance"[2].  The read performance drop was not expected.  However,
> the numbers suggest some additional changes to be made to the queuing
> model.

Have you considered supporting copy-xor in MD for non accelerated
RAID? I've been looking at fixing the old dubious slow crufty x86 SSE
XOR functions.

Copy-xor is something that Neil suggested at the beginning of the
acceleration work.  It was put on the back-burner, but now that the
implementation has settled it can be revisited.


One thing I discovered is that it seems fairly
pointless to make them slower with cache avoidance when most of the data is
copied before anyways. I think much more advantage could be gotten by
supporting copy-xor because XORing during a copy should be nearly
free.


Yes, it does not make sense to have cache-avoidance mismatched copy
and xor operations in MD.  However, I think the memcpy should be
changed to a cache-avoiding memcpy rather than caching the xor data.
Then a copy-xor implementation will have a greater effect, or do you
see it differently?


On the other hand ext3 write() also uses a cache avoiding copy now
and for the XOR it would need to load the data from memory again.
Perhaps this could be also optimized somehow (e.g. setting a flag
somewhere and using a normal copy for the RAID-5 case)


The incoming async_memcpy call has a flags parameter where this could go...

One possible way to implement support for copy-xor (and xor-copy-xor
for that matter) would be to write a soft-dmaengine driver.  When a
memcpy is submitted it can hold off processing it to see if an xor
operation is attached to the chain.  Once the xor descriptor is
attached the implementation will know the location of all the incoming
data, all the existing stripe data and the destination for the xor.


-Andi


Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 0/2] raid5: 65% sequential-write performance improvement, stripe-queue take2

2007-07-03 Thread Dan Williams
The first take of the stripe-queue implementation[1] had a performance
limiting bug in __wait_for_inactive_queue.  Fixing that issue
drastically changed the performance characteristics.  The following data
from tiobench shows the relative performance difference of the
stripe-queue patchset.

Unit information

File size = megabytes
Blk Size  = bytes
Num Thr   = number of threads
Avg Rate  = relative throughput
CPU%  = relative percentage of CPU used during the test
CPU Eff   = Rate divided by CPU% - relative throughput per cpu load

Configuration
=
Platform: 1200Mhz iop348 with 4-disk sata_vsc array
mdadm --create /dev/md0 /dev/sd[abcd] -n 4 -l 5
mkfs.ext2 /dev/md0
mount /dev/md0 /mnt/raid
tiobench --size 2048 --numruns 5 --block 4096 --block 131072 --dir /mnt/raid

Sequential Reads
FileBlk Num Avg Maximum CPU
Identifier  SizeSizeThr Rate(CPU%)  Eff
--- --  -   --- --  --  -
2.6.22-rc7-iop1 204840961   0%  4%  -3%
2.6.22-rc7-iop1 204840962   -38%-33%-8%
2.6.22-rc7-iop1 204840964   -35%-30%-8%
2.6.22-rc7-iop1 204840968   -14%-11%-3%
2.6.22-rc7-iop1 204813107   1   2%  1%  2%
2.6.22-rc7-iop1 204813107   2   -11%-10%-2%
2.6.22-rc7-iop1 204813107   4   -7% -6% -1%
2.6.22-rc7-iop1 204813107   8   -9% -6% -4%

Random  Reads
FileBlk Num Avg Maximum CPU
Identifier  SizeSizeThr Rate(CPU%)  Eff
--- --  -   --- --  --  -
2.6.22-rc7-iop1 204840961   -9% 15% -21%
2.6.22-rc7-iop1 204840962   -1% -30%42%
2.6.22-rc7-iop1 204840964   -14%-22%10%
2.6.22-rc7-iop1 204840968   -21%-28%9%
2.6.22-rc7-iop1 204813107   1   -8% -4% -4%
2.6.22-rc7-iop1 204813107   2   -13%-13%0%
2.6.22-rc7-iop1 204813107   4   -15%-15%0%
2.6.22-rc7-iop1 204813107   8   -13%-13%0%

Sequential Writes
FileBlk Num Avg Maximum CPU
Identifier  SizeSizeThr Rate(CPU%)  Eff
--- --  -   --- --  --  -
2.6.22-rc7-iop1 204840961   25% 11% 12%
2.6.22-rc7-iop1 204840962   41% 42% -1%
2.6.22-rc7-iop1 204840964   40% 18% 19%
2.6.22-rc7-iop1 204840968   15% -5% 21%
2.6.22-rc7-iop1 204813107   1   65% 57% 4%
2.6.22-rc7-iop1 204813107   2   46% 36% 8%
2.6.22-rc7-iop1 204813107   4   24% -7% 34%
2.6.22-rc7-iop1 204813107   8   28% -15%51%

Random  Writes
FileBlk Num Avg Maximum CPU
Identifier  SizeSizeThr Rate(CPU%)  Eff
--- --  -   --- --  --  -
2.6.22-rc7-iop1 204840961   2%  -8% 11%
2.6.22-rc7-iop1 204840962   -1% -19%21%
2.6.22-rc7-iop1 204840964   2%  2%  0%
2.6.22-rc7-iop1 204840968   -1% -28%37%
2.6.22-rc7-iop1 204813107   1   2%  -3% 5%
2.6.22-rc7-iop1 204813107   2   3%  -4% 7%
2.6.22-rc7-iop1 204813107   4   4%  -3% 8%
2.6.22-rc7-iop1 204813107   8   5%  -9% 15%

The write performance numbers are better than I expected and would seem
to address the concerns raised in the thread "Odd (slow) RAID
performance"[2].  The read performance drop was not expected.  However,
the numbers suggest some additional changes to be made to the queuing
model.  Where read performance is dropping there appears to be an equal
drop in CPU utilization, which seems to suggest that pure read requests
be handled immediately without a trip to the the stripe-queue workqueue.

Although it is not shown in the above data, another positive aspect is that
increasing the cache size past a certain point causes the write performance
gains to erode.  In other words negative returns in contrast to diminishing
returns.  The stripe-queue can only carry out optimizations while the cache is
busy.  When the cache is large requests can be handled without waiting, and
performance approaches the original 1:1 (queue-to-stripe-head) model.  CPU
speed dictates the maximum effective cache size.  Once the CPU can no longer
keep the stripe-queue saturated performance falls off from the peak.  This is
a positive change because it shows that the new queuing model can produce higher
performance with less resources, but it does require more care when changing
'stripe_cache_size.'  The above numbers were taken with the default cache size
of 256.

Changes since take1:
* separate write and overwrite in the io_weight fields, i.e. an overwrite
  no longer implies a write
* rename 

[RFC PATCH 0/2] An evolutionary change to the raid456 queuing model

2007-06-27 Thread Dan Williams
Raz's stripe-deadline patch illuminated the fact that the current
queuing model leaves write performance on the table in some cases.  The
following patches introduce a new queuing model which attempts to
recover this performance.

On an ARM based iop13xx platform I see an averaged %14.7 increase in
throughput as reported by the following simple test:

for i in `seq 1 10`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=512; 
done

This was performed on a default configured 4-disk SATA array (chunksize=64k
stripe_cache_size=256)

However, a test with an ext2 filesystem and tiobench showed negligible
changes.  My suspicion is that the new queuing model, as it currently stands,
can extract some more performance when full stripe writes are present, but in
the tiobench case there is not enough to take advantage of the queue's preread
prevention logic (improving this case is the goal of releasing this version of
the patches).

These patches are on top of the md-accel series.  I will spin a complete
set based on 2.6.22 once it goes final (2.6.22-iop1 to be released on
SourceForge).  Until then the complete series is available via git:

git://lost.foo-projects.org/~dwillia2/git/iop md-accel+experimental

Not to be confused with the acceleration patch set released yesterday which is
targetted at 2.6.23.  These queuing changes need some more time to cook.

-- 
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [md-accel PATCH 03/19] xor: make 'xor_blocks' a library routine for use with async_tx

2007-06-27 Thread Dan Williams

[ trimmed the cc ]

On 6/26/07, Satyam Sharma <[EMAIL PROTECTED]> wrote:

Hi Dan,

[ Minor thing ... ]


Not a problem, thanks for taking a look...


On 6/27/07, Dan Williams <[EMAIL PROTECTED]> wrote:
> The async_tx api tries to use a dma engine for an operation, but will fall
> back to an optimized software routine otherwise.  Xor support is
> implemented using the raid5 xor routines.  For organizational purposes this
> routine is moved to a common area.

This isn't quite crypto code and isn't connected to or through the cryptoapi
(at least not in this patchset), so I somehow find it misplaced in the crypto/
directory. If all its users are in drivers/md/ then that would be a
better place.
If it is something kernel-global, lib/ sounds more appropriate?



True, it isn't quite crypto code, but I gravitated to this location because:
1/ the models are similar, both are general purpose apis with a driver component
2/ there are already some algorithms in the crypto layer that are not
strictly cryptographic like crc32c, and other checksums
3/ having the code under that directory is a reminder to consider
closer integration when adding support for more complex algorithms
like raid6 p+q (at what point does a 'dma-offload' engine become a
'crypto' engine?  at some point they converge)

The hope is that other subsystems beyond md could benefit from offload
engines.  For example, the crc32c calculations in btrfs might be a
good candidate, and kcopyd integration has crossed my mind.


Satyam


Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [md-accel PATCH 00/19] md raid acceleration and the async_tx api

2007-06-26 Thread Dan Williams

On 6/26/07, Mr. James W. Laferriere <[EMAIL PROTECTED]> wrote:

Hello Dan ,

On Tue, 26 Jun 2007, Dan Williams wrote:
> Greetings,
>
> Per Andrew's suggestion this is the md raid5 acceleration patch set
> updated with more thorough changelogs to lower the barrier to entry for
> reviewers.  To get started with the code I would suggest the following
> order:
> [md-accel PATCH 01/19] dmaengine: refactor dmaengine around 
dma_async_tx_descriptor
> [md-accel PATCH 04/19] async_tx: add the async_tx api
> [md-accel PATCH 07/19] md: raid5_run_ops - run stripe operations outside 
sh->lock
> [md-accel PATCH 16/19] dmaengine: driver for the iop32x, iop33x, and iop13xx 
raid engines
  ...snip...

Can you please tell me against what linus kernel version these will
apply against ?  Or at least tell me against which version they were diff'd ?


This patch set is against 2.6.22-rc6.  The git tree is periodically
rebased to track Linus' latest.

--
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[md-accel PATCH 19/19] ARM: Add drivers/dma to arch/arm/Kconfig

2007-06-26 Thread Dan Williams
Cc: Russell King <[EMAIL PROTECTED]>
Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 arch/arm/Kconfig |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 50d9f3e..0cb2d4f 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1034,6 +1034,8 @@ source "drivers/mmc/Kconfig"
 
 source "drivers/rtc/Kconfig"
 
+source "drivers/dma/Kconfig"
+
 endmenu
 
 source "fs/Kconfig"
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[md-accel PATCH 17/19] iop13xx: surface the iop13xx adma units to the iop-adma driver

2007-06-26 Thread Dan Williams
Adds the platform device definitions and the architecture specific
support routines (i.e. register initialization and descriptor formats) for the
iop-adma driver.

Changelog:
* added 'descriptor pool size' to the platform data
* add base support for buffer sizes larger than 16MB (hw max)
* build error fix from Kirill A. Shutemov
* rebase for async_tx changes
* add interrupt support
* do not call platform register macros in driver code
* remove unnecessary ARM assembly statement
* checkpatch.pl fixes
* gpl v2 only correction

Cc: Russell King <[EMAIL PROTECTED]>
Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 arch/arm/mach-iop13xx/setup.c  |  217 +
 include/asm-arm/arch-iop13xx/adma.h|  544 
 include/asm-arm/arch-iop13xx/iop13xx.h |   38 +-
 3 files changed, 774 insertions(+), 25 deletions(-)

diff --git a/arch/arm/mach-iop13xx/setup.c b/arch/arm/mach-iop13xx/setup.c
index bc48715..bfe0c87 100644
--- a/arch/arm/mach-iop13xx/setup.c
+++ b/arch/arm/mach-iop13xx/setup.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define IOP13XX_UART_XTAL 4000
 #define IOP13XX_SETUP_DEBUG 0
@@ -236,19 +237,143 @@ static unsigned long iq8134x_probe_flash_size(void)
 }
 #endif
 
+/* ADMA Channels */
+static struct resource iop13xx_adma_0_resources[] = {
+   [0] = {
+   .start = IOP13XX_ADMA_PHYS_BASE(0),
+   .end = IOP13XX_ADMA_UPPER_PA(0),
+   .flags = IORESOURCE_MEM,
+   },
+   [1] = {
+   .start = IRQ_IOP13XX_ADMA0_EOT,
+   .end = IRQ_IOP13XX_ADMA0_EOT,
+   .flags = IORESOURCE_IRQ
+   },
+   [2] = {
+   .start = IRQ_IOP13XX_ADMA0_EOC,
+   .end = IRQ_IOP13XX_ADMA0_EOC,
+   .flags = IORESOURCE_IRQ
+   },
+   [3] = {
+   .start = IRQ_IOP13XX_ADMA0_ERR,
+   .end = IRQ_IOP13XX_ADMA0_ERR,
+   .flags = IORESOURCE_IRQ
+   }
+};
+
+static struct resource iop13xx_adma_1_resources[] = {
+   [0] = {
+   .start = IOP13XX_ADMA_PHYS_BASE(1),
+   .end = IOP13XX_ADMA_UPPER_PA(1),
+   .flags = IORESOURCE_MEM,
+   },
+   [1] = {
+   .start = IRQ_IOP13XX_ADMA1_EOT,
+   .end = IRQ_IOP13XX_ADMA1_EOT,
+   .flags = IORESOURCE_IRQ
+   },
+   [2] = {
+   .start = IRQ_IOP13XX_ADMA1_EOC,
+   .end = IRQ_IOP13XX_ADMA1_EOC,
+   .flags = IORESOURCE_IRQ
+   },
+   [3] = {
+   .start = IRQ_IOP13XX_ADMA1_ERR,
+   .end = IRQ_IOP13XX_ADMA1_ERR,
+   .flags = IORESOURCE_IRQ
+   }
+};
+
+static struct resource iop13xx_adma_2_resources[] = {
+   [0] = {
+   .start = IOP13XX_ADMA_PHYS_BASE(2),
+   .end = IOP13XX_ADMA_UPPER_PA(2),
+   .flags = IORESOURCE_MEM,
+   },
+   [1] = {
+   .start = IRQ_IOP13XX_ADMA2_EOT,
+   .end = IRQ_IOP13XX_ADMA2_EOT,
+   .flags = IORESOURCE_IRQ
+   },
+   [2] = {
+   .start = IRQ_IOP13XX_ADMA2_EOC,
+   .end = IRQ_IOP13XX_ADMA2_EOC,
+   .flags = IORESOURCE_IRQ
+   },
+   [3] = {
+   .start = IRQ_IOP13XX_ADMA2_ERR,
+   .end = IRQ_IOP13XX_ADMA2_ERR,
+   .flags = IORESOURCE_IRQ
+   }
+};
+
+static u64 iop13xx_adma_dmamask = DMA_64BIT_MASK;
+static struct iop_adma_platform_data iop13xx_adma_0_data = {
+   .hw_id = 0,
+   .pool_size = PAGE_SIZE,
+};
+
+static struct iop_adma_platform_data iop13xx_adma_1_data = {
+   .hw_id = 1,
+   .pool_size = PAGE_SIZE,
+};
+
+static struct iop_adma_platform_data iop13xx_adma_2_data = {
+   .hw_id = 2,
+   .pool_size = PAGE_SIZE,
+};
+
+/* The ids are fixed up later in iop13xx_platform_init */
+static struct platform_device iop13xx_adma_0_channel = {
+   .name = "iop-adma",
+   .id = 0,
+   .num_resources = 4,
+   .resource = iop13xx_adma_0_resources,
+   .dev = {
+   .dma_mask = &iop13xx_adma_dmamask,
+   .coherent_dma_mask = DMA_64BIT_MASK,
+   .platform_data = (void *) &iop13xx_adma_0_data,
+   },
+};
+
+static struct platform_device iop13xx_adma_1_channel = {
+   .name = "iop-adma",
+   .id = 0,
+   .num_resources = 4,
+   .resource = iop13xx_adma_1_resources,
+   .dev = {
+   .dma_mask = &iop13xx_adma_dmamask,
+   .coherent_dma_mask = DMA_64BIT_MASK,
+   .platform_data = (void *) &iop13xx_adma_1_data,
+   },
+};
+
+static struct platform_device iop13xx_adma_2_channel = {
+   .name = "iop-adma",
+   .id = 0,
+   .num_resources = 4,
+   .resource = iop13xx_adma_2_resources,
+   .dev = {
+   .dma_mask = &iop13xx_adma_dmamask,
+   .co

[md-accel PATCH 18/19] iop3xx: surface the iop3xx DMA and AAU units to the iop-adma driver

2007-06-26 Thread Dan Williams
Adds the platform device definitions and the architecture specific support
routines (i.e. register initialization and descriptor formats) for the
iop-adma driver.

Changelog:
* add support for > 1k zero sum buffer sizes
* added dma/aau platform devices to iq80321 and iq80332 setup
* fixed the calculation in iop_desc_is_aligned
* support xor buffer sizes larger than 16MB
* fix places where software descriptors are assumed to be contiguous, only
  hardware descriptors are contiguous for up to a PAGE_SIZE buffer size
* convert to async_tx
* add interrupt support
* add platform devices for 80219 boards
* do not call platform register macros in driver code
* remove switch() statements for compatible register offsets/layouts
* change over to bitmap based capabilities
* remove unnecessary ARM assembly statement
* checkpatch.pl fixes
* gpl v2 only correction

Cc: Russell King <[EMAIL PROTECTED]>
Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 arch/arm/mach-iop32x/glantank.c|2 
 arch/arm/mach-iop32x/iq31244.c |5 
 arch/arm/mach-iop32x/iq80321.c |3 
 arch/arm/mach-iop32x/n2100.c   |2 
 arch/arm/mach-iop33x/iq80331.c |3 
 arch/arm/mach-iop33x/iq80332.c |3 
 arch/arm/plat-iop/Makefile |2 
 arch/arm/plat-iop/adma.c   |  209 
 include/asm-arm/arch-iop32x/adma.h |5 
 include/asm-arm/arch-iop33x/adma.h |5 
 include/asm-arm/hardware/iop3xx-adma.h |  891 
 include/asm-arm/hardware/iop3xx.h  |   68 --
 12 files changed, 1138 insertions(+), 60 deletions(-)

diff --git a/arch/arm/mach-iop32x/glantank.c b/arch/arm/mach-iop32x/glantank.c
index 5776fd8..2b086ab 100644
--- a/arch/arm/mach-iop32x/glantank.c
+++ b/arch/arm/mach-iop32x/glantank.c
@@ -180,6 +180,8 @@ static void __init glantank_init_machine(void)
platform_device_register(&iop3xx_i2c1_device);
platform_device_register(&glantank_flash_device);
platform_device_register(&glantank_serial_device);
+   platform_device_register(&iop3xx_dma_0_channel);
+   platform_device_register(&iop3xx_dma_1_channel);
 
pm_power_off = glantank_power_off;
 }
diff --git a/arch/arm/mach-iop32x/iq31244.c b/arch/arm/mach-iop32x/iq31244.c
index d4eefbe..98cfa1c 100644
--- a/arch/arm/mach-iop32x/iq31244.c
+++ b/arch/arm/mach-iop32x/iq31244.c
@@ -298,9 +298,14 @@ static void __init iq31244_init_machine(void)
platform_device_register(&iop3xx_i2c1_device);
platform_device_register(&iq31244_flash_device);
platform_device_register(&iq31244_serial_device);
+   platform_device_register(&iop3xx_dma_0_channel);
+   platform_device_register(&iop3xx_dma_1_channel);
 
if (is_ep80219())
pm_power_off = ep80219_power_off;
+
+   if (!is_80219())
+   platform_device_register(&iop3xx_aau_channel);
 }
 
 static int __init force_ep80219_setup(char *str)
diff --git a/arch/arm/mach-iop32x/iq80321.c b/arch/arm/mach-iop32x/iq80321.c
index 8d9f491..18ad29f 100644
--- a/arch/arm/mach-iop32x/iq80321.c
+++ b/arch/arm/mach-iop32x/iq80321.c
@@ -181,6 +181,9 @@ static void __init iq80321_init_machine(void)
platform_device_register(&iop3xx_i2c1_device);
platform_device_register(&iq80321_flash_device);
platform_device_register(&iq80321_serial_device);
+   platform_device_register(&iop3xx_dma_0_channel);
+   platform_device_register(&iop3xx_dma_1_channel);
+   platform_device_register(&iop3xx_aau_channel);
 }
 
 MACHINE_START(IQ80321, "Intel IQ80321")
diff --git a/arch/arm/mach-iop32x/n2100.c b/arch/arm/mach-iop32x/n2100.c
index d55005d..390a97d 100644
--- a/arch/arm/mach-iop32x/n2100.c
+++ b/arch/arm/mach-iop32x/n2100.c
@@ -245,6 +245,8 @@ static void __init n2100_init_machine(void)
platform_device_register(&iop3xx_i2c0_device);
platform_device_register(&n2100_flash_device);
platform_device_register(&n2100_serial_device);
+   platform_device_register(&iop3xx_dma_0_channel);
+   platform_device_register(&iop3xx_dma_1_channel);
 
pm_power_off = n2100_power_off;
 
diff --git a/arch/arm/mach-iop33x/iq80331.c b/arch/arm/mach-iop33x/iq80331.c
index 2b06318..433188e 100644
--- a/arch/arm/mach-iop33x/iq80331.c
+++ b/arch/arm/mach-iop33x/iq80331.c
@@ -136,6 +136,9 @@ static void __init iq80331_init_machine(void)
platform_device_register(&iop33x_uart0_device);
platform_device_register(&iop33x_uart1_device);
platform_device_register(&iq80331_flash_device);
+   platform_device_register(&iop3xx_dma_0_channel);
+   platform_device_register(&iop3xx_dma_1_channel);
+   platform_device_register(&iop3xx_aau_channel);
 }
 
 MACHINE_START(IQ80331, "Intel IQ80331")
diff --git a/arch/arm/mach-iop33x/iq80332.c b/arch

[md-accel PATCH 16/19] dmaengine: driver for the iop32x, iop33x, and iop13xx raid engines

2007-06-26 Thread Dan Williams
The Intel(R) IOP series of i/o processors integrate an Xscale core with
raid acceleration engines.  The capabilities per platform are:

iop219:
 (2) copy engines
iop321:
 (2) copy engines
 (1) xor and block fill engine
iop33x:
 (2) copy and crc32c engines
 (1) xor, xor zero sum, pq, pq zero sum, and block fill engine
iop13xx:
 (2) copy, crc32c, xor, xor zero sum, and block fill engines
 (1) copy, crc32c, xor, xor zero sum, pq, pq zero sum, and block fill engine

The driver supports the features of the async_tx api:
* asynchronous notification of operation completion
* implicit (interupt triggered) handling of inter-channel transaction
  dependencies

The driver adapts to the platform it is running by two methods.
1/ #include  which defines the hardware specific
   iop_chan_* and iop_desc_* routines as a series of static inline
   functions
2/ The private platform data attached to the platform_device defines the
   capabilities of the channels

20070626: Callbacks are run in a tasklet.  Given the recent discussion on
LKML about killing tasklets in favor of workqueues I did a quick conversion
of the driver.  Raid5 resync performance dropped from 50MB/s to 30MB/s, so
the tasklet implementation remains until a generic softirq interface is
available.

Changelog:
* fixed a slot allocation bug in do_iop13xx_adma_xor that caused too few
slots to be requested eventually leading to data corruption
* enabled the slot allocation routine to attempt to free slots before
returning -ENOMEM
* switched the cleanup routine to solely use the software chain and the
status register to determine if a descriptor is complete.  This is
necessary to support other IOP engines that do not have status writeback
capability
* make the driver iop generic
* modified the allocation routines to understand allocating a group of
slots for a single operation
* added a null xor initialization operation for the xor only channel on
iop3xx
* support xor operations on buffers larger than the hardware maximum
* split the do_* routines into separate prep, src/dest set, submit stages
* added async_tx support (dependent operations initiation at cleanup time)
* simplified group handling
* added interrupt support (callbacks via tasklets)
* brought the pending depth inline with ioat (i.e. 4 descriptors)
* drop dma mapping methods, suggested by Chris Leech
* don't use inline in C files, Adrian Bunk
* remove static tasklet declarations
* make iop_adma_alloc_slots easier to read and remove chances for a
corrupted descriptor chain
* fix locking bug in iop_adma_alloc_chan_resources, Benjamin Herrenschmidt
* convert capabilities over to dma_cap_mask_t
* fixup sparse warnings
* add descriptor flush before iop_chan_enable
* checkpatch.pl fixes
* gpl v2 only correction
* move set_src, set_dest, submit to async_tx methods

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/dma/Kconfig |8 
 drivers/dma/Makefile|1 
 drivers/dma/iop-adma.c  | 1465 +++
 include/asm-arm/hardware/iop_adma.h |  120 +++
 4 files changed, 1594 insertions(+), 0 deletions(-)

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 492aa08..f27f5c7 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -31,4 +31,12 @@ config INTEL_IOATDMA
default m
---help---
  Enable support for the Intel(R) I/OAT DMA engine.
+
+config INTEL_IOP_ADMA
+tristate "Intel IOP ADMA support"
+depends on DMA_ENGINE && (ARCH_IOP32X || ARCH_IOP33X || ARCH_IOP13XX)
+default m
+---help---
+  Enable support for the Intel(R) IOP Series RAID engines.
+
 endmenu
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
index bdcfdbd..b3839b6 100644
--- a/drivers/dma/Makefile
+++ b/drivers/dma/Makefile
@@ -1,3 +1,4 @@
 obj-$(CONFIG_DMA_ENGINE) += dmaengine.o
 obj-$(CONFIG_NET_DMA) += iovlock.o
 obj-$(CONFIG_INTEL_IOATDMA) += ioatdma.o
+obj-$(CONFIG_INTEL_IOP_ADMA) += iop-adma.o
diff --git a/drivers/dma/iop-adma.c b/drivers/dma/iop-adma.c
new file mode 100644
index 000..3db12d6
--- /dev/null
+++ b/drivers/dma/iop-adma.c
@@ -0,0 +1,1465 @@
+/*
+ * offload engine driver for the Intel Xscale series of i/o processors
+ * Copyright © 2006, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ */
+

[md-accel PATCH 14/19] md: handle_stripe5 - request io processing in raid5_run_ops

2007-06-26 Thread Dan Williams
I/O submission requests were already handled outside of the stripe lock in
handle_stripe.  Now that handle_stripe is only tasked with finding work,
this logic belongs in raid5_run_ops.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
Acked-By: NeilBrown <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   71 ++--
 1 files changed, 13 insertions(+), 58 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e0ae26d..a09bc5f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2319,6 +2319,9 @@ static void 
handle_issuing_new_write_requests5(raid5_conf_t *conf,
"%d for r-m-w\n", i);
set_bit(R5_LOCKED, &dev->flags);
set_bit(R5_Wantread, &dev->flags);
+   if (!test_and_set_bit(
+   STRIPE_OP_IO, &sh->ops.pending))
+   sh->ops.count++;
s->locked++;
} else {
set_bit(STRIPE_DELAYED, &sh->state);
@@ -2342,6 +2345,9 @@ static void 
handle_issuing_new_write_requests5(raid5_conf_t *conf,
"%d for Reconstruct\n", i);
set_bit(R5_LOCKED, &dev->flags);
set_bit(R5_Wantread, &dev->flags);
+   if (!test_and_set_bit(
+   STRIPE_OP_IO, &sh->ops.pending))
+   sh->ops.count++;
s->locked++;
} else {
set_bit(STRIPE_DELAYED, &sh->state);
@@ -2538,6 +2544,9 @@ static void handle_parity_checks5(raid5_conf_t *conf, 
struct stripe_head *sh,
 
set_bit(R5_LOCKED, &dev->flags);
set_bit(R5_Wantwrite, &dev->flags);
+   if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
+   sh->ops.count++;
+
clear_bit(STRIPE_DEGRADED, &sh->state);
s->locked++;
set_bit(STRIPE_INSYNC, &sh->state);
@@ -2923,12 +2932,16 @@ static void handle_stripe5(struct stripe_head *sh)
dev = &sh->dev[s.failed_num];
if (!test_bit(R5_ReWrite, &dev->flags)) {
set_bit(R5_Wantwrite, &dev->flags);
+   if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
+   sh->ops.count++;
set_bit(R5_ReWrite, &dev->flags);
set_bit(R5_LOCKED, &dev->flags);
s.locked++;
} else {
/* let's read it back */
set_bit(R5_Wantread, &dev->flags);
+   if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
+   sh->ops.count++;
set_bit(R5_LOCKED, &dev->flags);
s.locked++;
}
@@ -2989,64 +3002,6 @@ static void handle_stripe5(struct stripe_head *sh)
  test_bit(BIO_UPTODATE, &bi->bi_flags)
? 0 : -EIO);
}
-   for (i=disks; i-- ;) {
-   int rw;
-   struct bio *bi;
-   mdk_rdev_t *rdev;
-   if (test_and_clear_bit(R5_Wantwrite, &sh->dev[i].flags))
-   rw = WRITE;
-   else if (test_and_clear_bit(R5_Wantread, &sh->dev[i].flags))
-   rw = READ;
-   else
-   continue;
- 
-   bi = &sh->dev[i].req;
- 
-   bi->bi_rw = rw;
-   if (rw == WRITE)
-   bi->bi_end_io = raid5_end_write_request;
-   else
-   bi->bi_end_io = raid5_end_read_request;
- 
-   rcu_read_lock();
-   rdev = rcu_dereference(conf->disks[i].rdev);
-   if (rdev && test_bit(Faulty, &rdev->flags))
-   rdev = NULL;
-   if (rdev)
-   atomic_inc(&rdev->nr_pending);
-   rcu_read_unlock();
- 
-   if (rdev) {
-   if (s.syncing || s.expanding || s.expanded)
-   md_sync_acct(rdev->bdev, STRIPE_SECTORS);
-
-   bi->bi_bdev = rdev->bdev;
-   pr_debug("for 

[md-accel PATCH 15/19] md: remove raid5 compute_block and compute_parity5

2007-06-26 Thread Dan Williams
replaced by raid5_run_ops

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
Acked-By: NeilBrown <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |  124 
 1 files changed, 0 insertions(+), 124 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a09bc5f..0579d1f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1509,130 +1509,6 @@ static void copy_data(int frombio, struct bio *bio,
   }  \
} while(0)
 
-
-static void compute_block(struct stripe_head *sh, int dd_idx)
-{
-   int i, count, disks = sh->disks;
-   void *ptr[MAX_XOR_BLOCKS], *dest, *p;
-
-   pr_debug("compute_block, stripe %llu, idx %d\n",
-   (unsigned long long)sh->sector, dd_idx);
-
-   dest = page_address(sh->dev[dd_idx].page);
-   memset(dest, 0, STRIPE_SIZE);
-   count = 0;
-   for (i = disks ; i--; ) {
-   if (i == dd_idx)
-   continue;
-   p = page_address(sh->dev[i].page);
-   if (test_bit(R5_UPTODATE, &sh->dev[i].flags))
-   ptr[count++] = p;
-   else
-   printk(KERN_ERR "compute_block() %d, stripe %llu, %d"
-   " not present\n", dd_idx,
-   (unsigned long long)sh->sector, i);
-
-   check_xor();
-   }
-   if (count)
-   xor_blocks(count, STRIPE_SIZE, dest, ptr);
-   set_bit(R5_UPTODATE, &sh->dev[dd_idx].flags);
-}
-
-static void compute_parity5(struct stripe_head *sh, int method)
-{
-   raid5_conf_t *conf = sh->raid_conf;
-   int i, pd_idx = sh->pd_idx, disks = sh->disks, count;
-   void *ptr[MAX_XOR_BLOCKS], *dest;
-   struct bio *chosen;
-
-   pr_debug("compute_parity5, stripe %llu, method %d\n",
-   (unsigned long long)sh->sector, method);
-
-   count = 0;
-   dest = page_address(sh->dev[pd_idx].page);
-   switch(method) {
-   case READ_MODIFY_WRITE:
-   BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags));
-   for (i=disks ; i-- ;) {
-   if (i==pd_idx)
-   continue;
-   if (sh->dev[i].towrite &&
-   test_bit(R5_UPTODATE, &sh->dev[i].flags)) {
-   ptr[count++] = page_address(sh->dev[i].page);
-   chosen = sh->dev[i].towrite;
-   sh->dev[i].towrite = NULL;
-
-   if (test_and_clear_bit(R5_Overlap, 
&sh->dev[i].flags))
-   wake_up(&conf->wait_for_overlap);
-
-   BUG_ON(sh->dev[i].written);
-   sh->dev[i].written = chosen;
-   check_xor();
-   }
-   }
-   break;
-   case RECONSTRUCT_WRITE:
-   memset(dest, 0, STRIPE_SIZE);
-   for (i= disks; i-- ;)
-   if (i!=pd_idx && sh->dev[i].towrite) {
-   chosen = sh->dev[i].towrite;
-   sh->dev[i].towrite = NULL;
-
-   if (test_and_clear_bit(R5_Overlap, 
&sh->dev[i].flags))
-   wake_up(&conf->wait_for_overlap);
-
-   BUG_ON(sh->dev[i].written);
-   sh->dev[i].written = chosen;
-   }
-   break;
-   case CHECK_PARITY:
-   break;
-   }
-   if (count) {
-   xor_blocks(count, STRIPE_SIZE, dest, ptr);
-   count = 0;
-   }
-   
-   for (i = disks; i--;)
-   if (sh->dev[i].written) {
-   sector_t sector = sh->dev[i].sector;
-   struct bio *wbi = sh->dev[i].written;
-   while (wbi && wbi->bi_sector < sector + STRIPE_SECTORS) 
{
-   copy_data(1, wbi, sh->dev[i].page, sector);
-   wbi = r5_next_bio(wbi, sector);
-   }
-
-   set_bit(R5_LOCKED, &sh->dev[i].flags);
-   set_bit(R5_UPTODATE, &sh->dev[i].flags);
-   }
-
-   switch(method) {
-   case RECONSTRUCT_WRITE:
-   case CHECK_PARITY:
-   for (i=disks; i--;)
-   if (i != pd_idx) {
-   ptr[count++] = page_address(sh->dev[i].page);
-   check_xor();
- 

[md-accel PATCH 12/19] md: handle_stripe5 - add request/completion logic for async read ops

2007-06-26 Thread Dan Williams
When a read bio is attached to the stripe and the corresponding block is
marked R5_UPTODATE, then a read (biofill) operation is scheduled to copy
the data from the stripe cache to the bio buffer.  handle_stripe flags the
blocks to be operated on with the R5_Wantfill flag.  If new read requests
arrive while raid5_run_ops is running they will not be handled until
handle_stripe is scheduled to run again.

Changelog:
* cleanup to_read and to_fill accounting
* do not fail reads that have reached the cache

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
Acked-By: NeilBrown <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   53 +---
 include/linux/raid/raid5.h |2 +-
 2 files changed, 26 insertions(+), 29 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 89d3890..3d0dca9 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2042,9 +2042,12 @@ handle_requests_to_failed_array(raid5_conf_t *conf, 
struct stripe_head *sh,
bi = bi2;
}
 
-   /* fail any reads if this device is non-operational */
-   if (!test_bit(R5_Insync, &sh->dev[i].flags) ||
-   test_bit(R5_ReadError, &sh->dev[i].flags)) {
+   /* fail any reads if this device is non-operational and
+* the data has not reached the cache yet.
+*/
+   if (!test_bit(R5_Wantfill, &sh->dev[i].flags) &&
+   (!test_bit(R5_Insync, &sh->dev[i].flags) ||
+ test_bit(R5_ReadError, &sh->dev[i].flags))) {
bi = sh->dev[i].toread;
sh->dev[i].toread = NULL;
if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
@@ -2733,37 +2736,27 @@ static void handle_stripe5(struct stripe_head *sh)
struct r5dev *dev = &sh->dev[i];
clear_bit(R5_Insync, &dev->flags);
 
-   pr_debug("check %d: state 0x%lx read %p write %p written %p\n",
-   i, dev->flags, dev->toread, dev->towrite, dev->written);
-   /* maybe we can reply to a read */
-   if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread) {
-   struct bio *rbi, *rbi2;
-   pr_debug("Return read for disc %d\n", i);
-   spin_lock_irq(&conf->device_lock);
-   rbi = dev->toread;
-   dev->toread = NULL;
-   if (test_and_clear_bit(R5_Overlap, &dev->flags))
-   wake_up(&conf->wait_for_overlap);
-   spin_unlock_irq(&conf->device_lock);
-   while (rbi && rbi->bi_sector < dev->sector + 
STRIPE_SECTORS) {
-   copy_data(0, rbi, dev->page, dev->sector);
-   rbi2 = r5_next_bio(rbi, dev->sector);
-   spin_lock_irq(&conf->device_lock);
-   if (--rbi->bi_phys_segments == 0) {
-   rbi->bi_next = return_bi;
-   return_bi = rbi;
-   }
-   spin_unlock_irq(&conf->device_lock);
-   rbi = rbi2;
-   }
-   }
+   pr_debug("check %d: state 0x%lx toread %p read %p write %p "
+   "written %p\n", i, dev->flags, dev->toread, dev->read,
+   dev->towrite, dev->written);
+
+   /* maybe we can request a biofill operation
+*
+* new wantfill requests are only permitted while
+* STRIPE_OP_BIOFILL is clear
+*/
+   if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread &&
+   !test_bit(STRIPE_OP_BIOFILL, &sh->ops.pending))
+   set_bit(R5_Wantfill, &dev->flags);
 
/* now count some things */
if (test_bit(R5_LOCKED, &dev->flags)) s.locked++;
if (test_bit(R5_UPTODATE, &dev->flags)) s.uptodate++;
if (test_bit(R5_Wantcompute, &dev->flags)) s.compute++;
 
-   if (dev->toread)
+   if (test_bit(R5_Wantfill, &dev->flags))
+   s.to_fill++;
+   else if (dev->toread)
s.to_read++;
if (dev->towrite) {
s.to_write++;
@@ -2786,6 +2779,10 @@ static void handle_stripe5(struct stripe_head *sh)
set_bit(R5_Insync, &dev->f

[md-accel PATCH 13/19] md: handle_stripe5 - add request/completion logic for async expand ops

2007-06-26 Thread Dan Williams
When a stripe is being expanded bulk copying takes place to move the data
from the old stripe to the new.  Since raid5_run_ops only operates on one
stripe at a time these bulk copies are handled in-line under the stripe
lock.  In the dma offload case we poll for the completion of the operation.

After the data has been copied into the new stripe the parity needs to be
recalculated across the new disks.  We reuse the existing postxor
functionality to carry out this calculation.  By setting STRIPE_OP_POSTXOR
without setting STRIPE_OP_BIODRAIN the completion path in handle stripe
can differentiate expand operations from normal write operations.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
Acked-By: NeilBrown <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   50 ++
 1 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3d0dca9..e0ae26d 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2646,6 +2646,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, 
struct stripe_head *sh,
/* We have read all the blocks in this stripe and now we need to
 * copy some of them into a target stripe for expand.
 */
+   struct dma_async_tx_descriptor *tx = NULL;
clear_bit(STRIPE_EXPAND_SOURCE, &sh->state);
for (i = 0; i < sh->disks; i++)
if (i != sh->pd_idx && (r6s && i != r6s->qd_idx)) {
@@ -2671,9 +2672,12 @@ static void handle_stripe_expansion(raid5_conf_t *conf, 
struct stripe_head *sh,
release_stripe(sh2);
continue;
}
-   memcpy(page_address(sh2->dev[dd_idx].page),
-  page_address(sh->dev[i].page),
-  STRIPE_SIZE);
+
+   /* place all the copies on one channel */
+   tx = async_memcpy(sh2->dev[dd_idx].page,
+   sh->dev[i].page, 0, 0, STRIPE_SIZE,
+   ASYNC_TX_DEP_ACK, tx, NULL, NULL);
+
set_bit(R5_Expanded, &sh2->dev[dd_idx].flags);
set_bit(R5_UPTODATE, &sh2->dev[dd_idx].flags);
for (j = 0; j < conf->raid_disks; j++)
@@ -2686,6 +2690,12 @@ static void handle_stripe_expansion(raid5_conf_t *conf, 
struct stripe_head *sh,
set_bit(STRIPE_HANDLE, &sh2->state);
}
release_stripe(sh2);
+
+   /* done submitting copies, wait for them to complete */
+   if (i + 1 >= sh->disks) {
+   async_tx_ack(tx);
+   dma_wait_for_async_tx(tx);
+   }
}
 }
 
@@ -2924,18 +2934,34 @@ static void handle_stripe5(struct stripe_head *sh)
}
}
 
-   if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state)) {
-   /* Need to write out all blocks after computing parity */
-   sh->disks = conf->raid_disks;
-   sh->pd_idx = stripe_to_pdidx(sh->sector, conf, 
conf->raid_disks);
-   compute_parity5(sh, RECONSTRUCT_WRITE);
+   /* Finish postxor operations initiated by the expansion
+* process
+*/
+   if (test_bit(STRIPE_OP_POSTXOR, &sh->ops.complete) &&
+   !test_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending)) {
+
+   clear_bit(STRIPE_EXPANDING, &sh->state);
+
+   clear_bit(STRIPE_OP_POSTXOR, &sh->ops.pending);
+   clear_bit(STRIPE_OP_POSTXOR, &sh->ops.ack);
+   clear_bit(STRIPE_OP_POSTXOR, &sh->ops.complete);
+
for (i = conf->raid_disks; i--; ) {
-   set_bit(R5_LOCKED, &sh->dev[i].flags);
-   s.locked++;
set_bit(R5_Wantwrite, &sh->dev[i].flags);
+   if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
+   sh->ops.count++;
}
-   clear_bit(STRIPE_EXPANDING, &sh->state);
-   } else if (s.expanded) {
+   }
+
+   if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
+   !test_bit(STRIPE_OP_POSTXOR, &sh->ops.pending)) {
+   /* Need to write out all blocks after computing parity */
+   sh->disks = conf->raid_disks;
+   sh->pd_idx = stripe_to_pdidx(sh->sector, conf,
+   conf->raid_disks);
+   s.locked += handle_write_operations5(sh, 0, 1);
+   } else if (s.expanded &&
+   !test_bit(ST

[md-accel PATCH 10/19] md: handle_stripe5 - add request/completion logic for async compute ops

2007-06-26 Thread Dan Williams
handle_stripe will compute a block when a backing disk has failed, or when
it determines it can save a disk read by computing the block from all the
other up-to-date blocks.

Previously a block would be computed under the lock and subsequent logic in
handle_stripe could use the newly up-to-date block.  With the raid5_run_ops
implementation the compute operation is carried out a later time outside
the lock.  To preserve the old functionality we take advantage of the
dependency chain feature of async_tx to flag the block as R5_Wantcompute
and then let other parts of handle_stripe operate on the block as if it
were up-to-date.  raid5_run_ops guarantees that the block will be ready
before it is used in another operation.

However, this only works in cases where the compute and the dependent
operation are scheduled at the same time.  If a previous call to
handle_stripe sets the R5_Wantcompute flag there is no facility to pass the
async_tx dependency chain across successive calls to raid5_run_ops.  The
req_compute variable protects against this case.

Changelog:
* remove the req_compute BUG_ON

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
Acked-By: NeilBrown <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |  149 ++--
 include/linux/raid/raid5.h |2 -
 2 files changed, 115 insertions(+), 36 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b2e88fe..38b8167 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2070,36 +2070,101 @@ handle_requests_to_failed_array(raid5_conf_t *conf, 
struct stripe_head *sh,
 
 }
 
+/* __handle_issuing_new_read_requests5 - returns 0 if there are no more disks
+ * to process
+ */
+static int __handle_issuing_new_read_requests5(struct stripe_head *sh,
+   struct stripe_head_state *s, int disk_idx, int disks)
+{
+   struct r5dev *dev = &sh->dev[disk_idx];
+   struct r5dev *failed_dev = &sh->dev[s->failed_num];
+
+   /* don't schedule compute operations or reads on the parity block while
+* a check is in flight
+*/
+   if ((disk_idx == sh->pd_idx) &&
+test_bit(STRIPE_OP_CHECK, &sh->ops.pending))
+   return ~0;
+
+   /* is the data in this block needed, and can we get it? */
+   if (!test_bit(R5_LOCKED, &dev->flags) &&
+   !test_bit(R5_UPTODATE, &dev->flags) && (dev->toread ||
+   (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
+s->syncing || s->expanding || (s->failed &&
+(failed_dev->toread || (failed_dev->towrite &&
+!test_bit(R5_OVERWRITE, &failed_dev->flags)
+) {
+   /* 1/ We would like to get this block, possibly by computing it,
+* but we might not be able to.
+*
+* 2/ Since parity check operations potentially make the parity
+* block !uptodate it will need to be refreshed before any
+* compute operations on data disks are scheduled.
+*
+* 3/ We hold off parity block re-reads until check operations
+* have quiesced.
+*/
+   if ((s->uptodate == disks - 1) &&
+   !test_bit(STRIPE_OP_CHECK, &sh->ops.pending)) {
+   set_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending);
+   set_bit(R5_Wantcompute, &dev->flags);
+   sh->ops.target = disk_idx;
+   s->req_compute = 1;
+   sh->ops.count++;
+   /* Careful: from this point on 'uptodate' is in the eye
+* of raid5_run_ops which services 'compute' operations
+* before writes. R5_Wantcompute flags a block that will
+* be R5_UPTODATE by the time it is needed for a
+* subsequent operation.
+*/
+   s->uptodate++;
+   return 0; /* uptodate + compute == disks */
+   } else if ((s->uptodate < disks - 1) &&
+   test_bit(R5_Insync, &dev->flags)) {
+   /* Note: we hold off compute operations while checks are
+* in flight, but we still prefer 'compute' over 'read'
+* hence we only read if (uptodate < * disks-1)
+*/
+   set_bit(R5_LOCKED, &dev->flags);
+   set_bit(R5_Wantread, &dev->flags);
+   if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
+   sh->ops.count++;
+  

[md-accel PATCH 11/19] md: handle_stripe5 - add request/completion logic for async check ops

2007-06-26 Thread Dan Williams
Check operations are scheduled when the array is being resynced or an
explicit 'check/repair' command was sent to the array.  Previously check
operations would destroy the parity block in the cache such that even if
parity turned out to be correct the parity block would be marked
!R5_UPTODATE at the completion of the check.  When the operation can be
carried out by a dma engine the assumption is that it can check parity as a
read-only operation.  If raid5_run_ops notices that the check was handled
by hardware it will preserve the R5_UPTODATE status of the parity disk.

When a check operation determines that the parity needs to be repaired we
reuse the existing compute block infrastructure to carry out the operation.
Repair operations imply an immediate write back of the data, so to
differentiate a repair from a normal compute operation the
STRIPE_OP_MOD_REPAIR_PD flag is added.

Changelog:
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
Acked-By: NeilBrown <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   84 
 1 files changed, 65 insertions(+), 19 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 38b8167..89d3890 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2464,26 +2464,67 @@ static void handle_parity_checks5(raid5_conf_t *conf, 
struct stripe_head *sh,
struct stripe_head_state *s, int disks)
 {
set_bit(STRIPE_HANDLE, &sh->state);
-   if (s->failed == 0) {
-   BUG_ON(s->uptodate != disks);
-   compute_parity5(sh, CHECK_PARITY);
-   s->uptodate--;
-   if (page_is_zero(sh->dev[sh->pd_idx].page)) {
-   /* parity is correct (on disc, not in buffer any more)
-*/
-   set_bit(STRIPE_INSYNC, &sh->state);
-   } else {
-   conf->mddev->resync_mismatches += STRIPE_SECTORS;
-   if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery))
-   /* don't try to repair!! */
+   /* Take one of the following actions:
+* 1/ start a check parity operation if (uptodate == disks)
+* 2/ finish a check parity operation and act on the result
+* 3/ skip to the writeback section if we previously
+*initiated a recovery operation
+*/
+   if (s->failed == 0 &&
+   !test_bit(STRIPE_OP_MOD_REPAIR_PD, &sh->ops.pending)) {
+   if (!test_and_set_bit(STRIPE_OP_CHECK, &sh->ops.pending)) {
+   BUG_ON(s->uptodate != disks);
+   clear_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags);
+   sh->ops.count++;
+   s->uptodate--;
+   } else if (
+  test_and_clear_bit(STRIPE_OP_CHECK, &sh->ops.complete)) {
+   clear_bit(STRIPE_OP_CHECK, &sh->ops.ack);
+   clear_bit(STRIPE_OP_CHECK, &sh->ops.pending);
+
+   if (sh->ops.zero_sum_result == 0)
+   /* parity is correct (on disc,
+* not in buffer any more)
+*/
set_bit(STRIPE_INSYNC, &sh->state);
else {
-   compute_block(sh, sh->pd_idx);
-   s->uptodate++;
+   conf->mddev->resync_mismatches +=
+   STRIPE_SECTORS;
+   if (test_bit(
+MD_RECOVERY_CHECK, &conf->mddev->recovery))
+   /* don't try to repair!! */
+   set_bit(STRIPE_INSYNC, &sh->state);
+   else {
+   set_bit(STRIPE_OP_COMPUTE_BLK,
+   &sh->ops.pending);
+   set_bit(STRIPE_OP_MOD_REPAIR_PD,
+   &sh->ops.pending);
+   set_bit(R5_Wantcompute,
+   &sh->dev[sh->pd_idx].flags);
+   sh->ops.target = sh->pd_idx;
+   sh->ops.count++;
+   s->uptodate++;
+   }
}
}
}
-   if (!test_bit(STRIPE_INSYNC, &sh->state)) {
+
+   /* check if we can clear a parity disk reconstruct */
+   if

[md-accel PATCH 08/19] md: common infrastructure for running operations with raid5_run_ops

2007-06-26 Thread Dan Williams
All the handle_stripe operations that are to be transitioned to use
raid5_run_ops need a method to coherently gather work under the stripe-lock
and hand that work off to raid5_run_ops.  The 'get_stripe_work' routine
runs under the lock to read all the bits in sh->ops.pending that do not
have the corresponding bit set in sh->ops.ack.  This modified 'pending'
bitmap is then passed to raid5_run_ops for processing.

The transition from 'ack' to 'completion' does not need similar protection
as the existing release_stripe infrastructure will guarantee that
handle_stripe will run again after a completion bit is set, and
handle_stripe can tolerate a sh->ops.completed bit being set while the lock
is held.

A call to async_tx_issue_pending_all() is added to raid5d to kick the
offload engines once all pending stripe operations work has been submitted.
This enables batching of the submission and completion of operations.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
Acked-By: NeilBrown <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   67 +---
 1 files changed, 58 insertions(+), 9 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 34fcda0..7c688f6 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -124,6 +124,7 @@ static void __release_stripe(raid5_conf_t *conf, struct 
stripe_head *sh)
}
md_wakeup_thread(conf->mddev->thread);
} else {
+   BUG_ON(sh->ops.pending);
if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, 
&sh->state)) {
atomic_dec(&conf->preread_active_stripes);
if (atomic_read(&conf->preread_active_stripes) 
< IO_THRESHOLD)
@@ -225,7 +226,8 @@ static void init_stripe(struct stripe_head *sh, sector_t 
sector, int pd_idx, int
 
BUG_ON(atomic_read(&sh->count) != 0);
BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
-   
+   BUG_ON(sh->ops.pending || sh->ops.ack || sh->ops.complete);
+
CHECK_DEVLOCK();
pr_debug("init_stripe called, stripe %llu\n",
(unsigned long long)sh->sector);
@@ -241,11 +243,11 @@ static void init_stripe(struct stripe_head *sh, sector_t 
sector, int pd_idx, int
for (i = sh->disks; i--; ) {
struct r5dev *dev = &sh->dev[i];
 
-   if (dev->toread || dev->towrite || dev->written ||
+   if (dev->toread || dev->read || dev->towrite || dev->written ||
test_bit(R5_LOCKED, &dev->flags)) {
-   printk("sector=%llx i=%d %p %p %p %d\n",
+   printk(KERN_ERR "sector=%llx i=%d %p %p %p %p %d\n",
   (unsigned long long)sh->sector, i, dev->toread,
-  dev->towrite, dev->written,
+  dev->read, dev->towrite, dev->written,
   test_bit(R5_LOCKED, &dev->flags));
BUG();
}
@@ -325,6 +327,44 @@ static struct stripe_head *get_active_stripe(raid5_conf_t 
*conf, sector_t sector
return sh;
 }
 
+/* test_and_ack_op() ensures that we only dequeue an operation once */
+#define test_and_ack_op(op, pend) \
+do {   \
+   if (test_bit(op, &sh->ops.pending) &&   \
+   !test_bit(op, &sh->ops.complete)) { \
+   if (test_and_set_bit(op, &sh->ops.ack)) \
+   clear_bit(op, &pend);   \
+   else\
+   ack++;  \
+   } else  \
+   clear_bit(op, &pend);   \
+} while (0)
+
+/* find new work to run, do not resubmit work that is already
+ * in flight
+ */
+static unsigned long get_stripe_work(struct stripe_head *sh)
+{
+   unsigned long pending;
+   int ack = 0;
+
+   pending = sh->ops.pending;
+
+   test_and_ack_op(STRIPE_OP_BIOFILL, pending);
+   test_and_ack_op(STRIPE_OP_COMPUTE_BLK, pending);
+   test_and_ack_op(STRIPE_OP_PREXOR, pending);
+   test_and_ack_op(STRIPE_OP_BIODRAIN, pending);
+   test_and_ack_op(STRIPE_OP_POSTXOR, pending);
+   test_and_ack_op(STRIPE_OP_CHECK, pending);
+   if (test_and_clear_bit(STRIPE_OP_IO, &sh->ops.pending))
+   ack++;
+
+   sh->ops.count -= ack;
+   BUG_ON(sh->ops.count < 0);
+
+   return pending;
+}
+
 static int
 raid5_end_read_request(struct bio *bi, unsigned int bytes_done, int error);
 static int
@@ -2487,7 +2527,6 @@ static void handle_st

[md-accel PATCH 09/19] md: handle_stripe5 - add request/completion logic for async write ops

2007-06-26 Thread Dan Williams
After handle_stripe5 decides whether it wants to perform a
read-modify-write, or a reconstruct write it calls
handle_write_operations5.  A read-modify-write operation will perform an
xor subtraction of the blocks marked with the R5_Wantprexor flag, copy the
new data into the stripe (biodrain) and perform a postxor operation across
all up-to-date blocks to generate the new parity.  A reconstruct write is run
when all blocks are already up-to-date in the cache so all that is needed
is a biodrain and postxor.

On the completion path STRIPE_OP_PREXOR will be set if the operation was a
read-modify-write.  The STRIPE_OP_BIODRAIN flag is used in the completion
path to differentiate write-initiated postxor operations versus
expansion-initiated postxor operations.  Completion of a write triggers i/o
to the drives.

Changelog:
* make the 'rcw' parameter to handle_write_operations5 a simple flag, Neil Brown
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
Acked-By: NeilBrown <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |  161 +---
 1 files changed, 138 insertions(+), 23 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7c688f6..b2e88fe 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1815,7 +1815,79 @@ static void compute_block_2(struct stripe_head *sh, int 
dd_idx1, int dd_idx2)
}
 }
 
+static int
+handle_write_operations5(struct stripe_head *sh, int rcw, int expand)
+{
+   int i, pd_idx = sh->pd_idx, disks = sh->disks;
+   int locked = 0;
+
+   if (rcw) {
+   /* if we are not expanding this is a proper write request, and
+* there will be bios with new data to be drained into the
+* stripe cache
+*/
+   if (!expand) {
+   set_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending);
+   sh->ops.count++;
+   }
+
+   set_bit(STRIPE_OP_POSTXOR, &sh->ops.pending);
+   sh->ops.count++;
+
+   for (i = disks; i--; ) {
+   struct r5dev *dev = &sh->dev[i];
+
+   if (dev->towrite) {
+   set_bit(R5_LOCKED, &dev->flags);
+   if (!expand)
+   clear_bit(R5_UPTODATE, &dev->flags);
+   locked++;
+   }
+   }
+   } else {
+   BUG_ON(!(test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags) ||
+   test_bit(R5_Wantcompute, &sh->dev[pd_idx].flags)));
+
+   set_bit(STRIPE_OP_PREXOR, &sh->ops.pending);
+   set_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending);
+   set_bit(STRIPE_OP_POSTXOR, &sh->ops.pending);
+
+   sh->ops.count += 3;
+
+   for (i = disks; i--; ) {
+   struct r5dev *dev = &sh->dev[i];
+   if (i == pd_idx)
+   continue;
+
+   /* For a read-modify write there may be blocks that are
+* locked for reading while others are ready to be
+* written so we distinguish these blocks by the
+* R5_Wantprexor bit
+*/
+   if (dev->towrite &&
+   (test_bit(R5_UPTODATE, &dev->flags) ||
+   test_bit(R5_Wantcompute, &dev->flags))) {
+   set_bit(R5_Wantprexor, &dev->flags);
+   set_bit(R5_LOCKED, &dev->flags);
+   clear_bit(R5_UPTODATE, &dev->flags);
+   locked++;
+   }
+   }
+   }
+
+   /* keep the parity disk locked while asynchronous operations
+* are in flight
+*/
+   set_bit(R5_LOCKED, &sh->dev[pd_idx].flags);
+   clear_bit(R5_UPTODATE, &sh->dev[pd_idx].flags);
+   locked++;
 
+   pr_debug("%s: stripe %llu locked: %d pending: %lx\n",
+   __FUNCTION__, (unsigned long long)sh->sector,
+   locked, sh->ops.pending);
+
+   return locked;
+}
 
 /*
  * Each stripe/dev can have one or more bion attached.
@@ -2210,27 +2282,8 @@ static void 
handle_issuing_new_write_requests5(raid5_conf_t *conf,
 * we can start a write request
 */
if (s->locked == 0 && (rcw == 0 || rmw == 0) &&
-   !test_bit(STRIPE_BIT_DELAY, &sh->state)) {
-   pr_debug("Computing parity...\n");
-   compute_parity5(sh, rcw == 0 ?
-   

[md-accel PATCH 07/19] md: raid5_run_ops - run stripe operations outside sh->lock

2007-06-26 Thread Dan Williams
When the raid acceleration work was proposed, Neil laid out the following
attack plan:

1/ move the xor and copy operations outside spin_lock(&sh->lock)
2/ find/implement an asynchronous offload api

The raid5_run_ops routine uses the asynchronous offload api (async_tx) and
the stripe_operations member of a stripe_head to carry out xor+copy
operations asynchronously, outside the lock.

To perform operations outside the lock a new set of state flags is needed
to track new requests, in-flight requests, and completed requests.  In this
new model handle_stripe is tasked with scanning the stripe_head for work,
updating the stripe_operations structure, and finally dropping the lock and
calling raid5_run_ops for processing.  The following flags outline the
requests that handle_stripe can make of raid5_run_ops:

STRIPE_OP_BIOFILL
 - copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
 - generate a missing block in the cache from the other blocks
STRIPE_OP_PREXOR
 - subtract existing data as part of the read-modify-write process
STRIPE_OP_BIODRAIN
 - copy data out of request buffers to satisfy a write request
STRIPE_OP_POSTXOR
 - recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
 - verify that the parity is correct
STRIPE_OP_IO
 - submit i/o to the member disks (note this was already performed outside
   the stripe lock, but it made sense to add it as an operation type

The flow is:
1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending
2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the
   operation to the async_tx api
3/ async_tx triggers the completion callback routine to set
   sh->ops.complete and release the stripe
4/ handle_stripe runs again to finish the operation and optionally submit
   new operations that were previously blocked

Note this patch just defines raid5_run_ops, subsequent commits (one per
major operation type) modify handle_stripe to take advantage of this
routine.

Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
  ops_complete_write.
* removed the raid5_run_ops workqueue
* call bi_end_io for reads in ops_complete_biofill, saves a call to
  handle_stripe
* explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown
* fix race between async engines and bi_end_io call for reads, Neil Brown
* remove unnecessary spin_lock from ops_complete_biofill
* remove test_and_set/test_and_clear BUG_ONs, Neil Brown
* remove explicit interrupt handling for channel switching, this feature
  was absorbed (i.e. it is now implicit) by the async_tx api

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
Acked-By: NeilBrown <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |  546 
 include/linux/raid/raid5.h |   81 ++-
 2 files changed, 624 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index d21fa7a..34fcda0 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -52,6 +52,7 @@
 #include "raid6.h"
 
 #include 
+#include 
 
 /*
  * Stripe cache
@@ -324,6 +325,551 @@ static struct stripe_head *get_active_stripe(raid5_conf_t 
*conf, sector_t sector
return sh;
 }
 
+static int
+raid5_end_read_request(struct bio *bi, unsigned int bytes_done, int error);
+static int
+raid5_end_write_request (struct bio *bi, unsigned int bytes_done, int error);
+
+static void ops_run_io(struct stripe_head *sh)
+{
+   raid5_conf_t *conf = sh->raid_conf;
+   int i, disks = sh->disks;
+
+   might_sleep();
+
+   for (i = disks; i--; ) {
+   int rw;
+   struct bio *bi;
+   mdk_rdev_t *rdev;
+   if (test_and_clear_bit(R5_Wantwrite, &sh->dev[i].flags))
+   rw = WRITE;
+   else if (test_and_clear_bit(R5_Wantread, &sh->dev[i].flags))
+   rw = READ;
+   else
+   continue;
+
+   bi = &sh->dev[i].req;
+
+   bi->bi_rw = rw;
+   if (rw == WRITE)
+   bi->bi_end_io = raid5_end_write_request;
+   else
+   bi->bi_end_io = raid5_end_read_request;
+
+   rcu_read_lock();
+   rdev = rcu_dereference(conf->disks[i].rdev);
+   if (rdev && test_bit(Faulty, &rdev->flags))
+   rdev = NULL;
+   if (rdev)
+   atomic_inc(&rdev->nr_pending);
+   rcu_read_unlock();
+
+   if (rdev) {
+   if (test_bit(STRIPE_SYNCING, &sh->state) ||
+   test_bit(STRIPE_EXPAND_SOURCE, &sh->state) ||
+   test_bit(STRIPE_EXPAND_READY, &sh->state))
+   md_sync_acct(rdev->bdev, STRIPE_SECTORS);
+

[md-accel PATCH 05/19] raid5: refactor handle_stripe5 and handle_stripe6 (v2)

2007-06-26 Thread Dan Williams
handle_stripe5 and handle_stripe6 have very deep logic paths handling the
various states of a stripe_head.  By introducing the 'stripe_head_state'
and 'r6_state' objects, large portions of the logic can be moved to
sub-routines.

'struct stripe_head_state' consumes all of the automatic variables that 
previously
stood alone in handle_stripe5,6.  'struct r6_state' contains the handle_stripe6
specific variables like p_failed and q_failed.

One of the nice side effects of the 'stripe_head_state' change is that it
allows for further reductions in code duplication between raid5 and raid6.
The following new routines are shared between raid5 and raid6:

handle_completed_write_requests
handle_requests_to_failed_array
handle_stripe_expansion

Changes in v2:
* fixed 'conf->raid_disk-1' for the raid6 'handle_stripe_expansion' path

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c | 1488 +---
 include/linux/raid/raid5.h |   16 
 2 files changed, 737 insertions(+), 767 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 4f51dfa..94e0920 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1326,6 +1326,608 @@ static int stripe_to_pdidx(sector_t stripe, 
raid5_conf_t *conf, int disks)
return pd_idx;
 }
 
+static void
+handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh,
+   struct stripe_head_state *s, int disks,
+   struct bio **return_bi)
+{
+   int i;
+   for (i = disks; i--; ) {
+   struct bio *bi;
+   int bitmap_end = 0;
+
+   if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
+   mdk_rdev_t *rdev;
+   rcu_read_lock();
+   rdev = rcu_dereference(conf->disks[i].rdev);
+   if (rdev && test_bit(In_sync, &rdev->flags))
+   /* multiple read failures in one stripe */
+   md_error(conf->mddev, rdev);
+   rcu_read_unlock();
+   }
+   spin_lock_irq(&conf->device_lock);
+   /* fail all writes first */
+   bi = sh->dev[i].towrite;
+   sh->dev[i].towrite = NULL;
+   if (bi) {
+   s->to_write--;
+   bitmap_end = 1;
+   }
+
+   if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
+   wake_up(&conf->wait_for_overlap);
+
+   while (bi && bi->bi_sector <
+   sh->dev[i].sector + STRIPE_SECTORS) {
+   struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector);
+   clear_bit(BIO_UPTODATE, &bi->bi_flags);
+   if (--bi->bi_phys_segments == 0) {
+   md_write_end(conf->mddev);
+   bi->bi_next = *return_bi;
+   *return_bi = bi;
+   }
+   bi = nextbi;
+   }
+   /* and fail all 'written' */
+   bi = sh->dev[i].written;
+   sh->dev[i].written = NULL;
+   if (bi) bitmap_end = 1;
+   while (bi && bi->bi_sector <
+  sh->dev[i].sector + STRIPE_SECTORS) {
+   struct bio *bi2 = r5_next_bio(bi, sh->dev[i].sector);
+   clear_bit(BIO_UPTODATE, &bi->bi_flags);
+   if (--bi->bi_phys_segments == 0) {
+   md_write_end(conf->mddev);
+   bi->bi_next = *return_bi;
+   *return_bi = bi;
+   }
+   bi = bi2;
+   }
+
+   /* fail any reads if this device is non-operational */
+   if (!test_bit(R5_Insync, &sh->dev[i].flags) ||
+   test_bit(R5_ReadError, &sh->dev[i].flags)) {
+   bi = sh->dev[i].toread;
+   sh->dev[i].toread = NULL;
+   if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
+   wake_up(&conf->wait_for_overlap);
+   if (bi) s->to_read--;
+   while (bi && bi->bi_sector <
+  sh->dev[i].sector + STRIPE_SECTORS) {
+   struct bio *nextbi =
+   r5_next_bio(bi, sh->dev[i].sector);
+   clear_bit(BIO_UPTODATE, &

[md-accel PATCH 06/19] raid5: replace custom debug PRINTKs with standard pr_debug

2007-06-26 Thread Dan Williams
Replaces PRINTK with pr_debug, and kills the RAID5_DEBUG definition in
favor of the global DEBUG definition.  To get local debug messages just add
'#define DEBUG' to the top of the file.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |  116 ++--
 1 files changed, 58 insertions(+), 58 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 94e0920..d21fa7a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -80,7 +80,6 @@
 /*
  * The following can be used to debug the driver
  */
-#define RAID5_DEBUG0
 #define RAID5_PARANOIA 1
 #if RAID5_PARANOIA && defined(CONFIG_SMP)
 # define CHECK_DEVLOCK() assert_spin_locked(&conf->device_lock)
@@ -88,8 +87,7 @@
 # define CHECK_DEVLOCK()
 #endif
 
-#define PRINTK(x...) ((void)(RAID5_DEBUG && printk(x)))
-#if RAID5_DEBUG
+#ifdef DEBUG
 #define inline
 #define __inline__
 #endif
@@ -152,7 +150,8 @@ static void release_stripe(struct stripe_head *sh)
 
 static inline void remove_hash(struct stripe_head *sh)
 {
-   PRINTK("remove_hash(), stripe %llu\n", (unsigned long long)sh->sector);
+   pr_debug("remove_hash(), stripe %llu\n",
+   (unsigned long long)sh->sector);
 
hlist_del_init(&sh->hash);
 }
@@ -161,7 +160,8 @@ static inline void insert_hash(raid5_conf_t *conf, struct 
stripe_head *sh)
 {
struct hlist_head *hp = stripe_hash(conf, sh->sector);
 
-   PRINTK("insert_hash(), stripe %llu\n", (unsigned long long)sh->sector);
+   pr_debug("insert_hash(), stripe %llu\n",
+   (unsigned long long)sh->sector);
 
CHECK_DEVLOCK();
hlist_add_head(&sh->hash, hp);
@@ -226,7 +226,7 @@ static void init_stripe(struct stripe_head *sh, sector_t 
sector, int pd_idx, int
BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));

CHECK_DEVLOCK();
-   PRINTK("init_stripe called, stripe %llu\n", 
+   pr_debug("init_stripe called, stripe %llu\n",
(unsigned long long)sh->sector);
 
remove_hash(sh);
@@ -260,11 +260,11 @@ static struct stripe_head *__find_stripe(raid5_conf_t 
*conf, sector_t sector, in
struct hlist_node *hn;
 
CHECK_DEVLOCK();
-   PRINTK("__find_stripe, sector %llu\n", (unsigned long long)sector);
+   pr_debug("__find_stripe, sector %llu\n", (unsigned long long)sector);
hlist_for_each_entry(sh, hn, stripe_hash(conf, sector), hash)
if (sh->sector == sector && sh->disks == disks)
return sh;
-   PRINTK("__stripe %llu not in cache\n", (unsigned long long)sector);
+   pr_debug("__stripe %llu not in cache\n", (unsigned long long)sector);
return NULL;
 }
 
@@ -276,7 +276,7 @@ static struct stripe_head *get_active_stripe(raid5_conf_t 
*conf, sector_t sector
 {
struct stripe_head *sh;
 
-   PRINTK("get_stripe, sector %llu\n", (unsigned long long)sector);
+   pr_debug("get_stripe, sector %llu\n", (unsigned long long)sector);
 
spin_lock_irq(&conf->device_lock);
 
@@ -537,8 +537,8 @@ static int raid5_end_read_request(struct bio * bi, unsigned 
int bytes_done,
if (bi == &sh->dev[i].req)
break;
 
-   PRINTK("end_read_request %llu/%d, count: %d, uptodate %d.\n", 
-   (unsigned long long)sh->sector, i, atomic_read(&sh->count), 
+   pr_debug("end_read_request %llu/%d, count: %d, uptodate %d.\n",
+   (unsigned long long)sh->sector, i, atomic_read(&sh->count),
uptodate);
if (i == disks) {
BUG();
@@ -613,7 +613,7 @@ static int raid5_end_write_request (struct bio *bi, 
unsigned int bytes_done,
if (bi == &sh->dev[i].req)
break;
 
-   PRINTK("end_write_request %llu/%d, count %d, uptodate: %d.\n", 
+   pr_debug("end_write_request %llu/%d, count %d, uptodate: %d.\n",
(unsigned long long)sh->sector, i, atomic_read(&sh->count),
uptodate);
if (i == disks) {
@@ -658,7 +658,7 @@ static void error(mddev_t *mddev, mdk_rdev_t *rdev)
 {
char b[BDEVNAME_SIZE];
raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
-   PRINTK("raid5: error called\n");
+   pr_debug("raid5: error called\n");
 
if (!test_bit(Faulty, &rdev->flags)) {
set_bit(MD_CHANGE_DEVS, &mddev->flags);
@@ -929,7 +929,7 @@ static void compute_block(struct stripe_head *sh, int 
dd_idx)
int i, count, disks = sh->disks;
void *ptr[MAX_XOR_BLOCKS], *dest, *p;
 
-   PRINTK("compute_block, stripe %llu, idx %d\n", 
+ 

[md-accel PATCH 04/19] async_tx: add the async_tx api

2007-06-26 Thread Dan Williams
n
* don't use inline in C files, Adrian Bunk
* select the API when MD is enabled
* BUG_ON xor source counts <= 1
* implicitly handle hardware concerns like channel switching and
  interrupts, Neil Brown
* remove the per operation type list, and distribute operation capabilities
  evenly amongst the available channels
* simplify async_tx_find_channel to optimize the fast path
* introduce the channel_table_initialized flag to prevent early calls to
  the api
* reorganize the code to mimic crypto
* include mm.h as not all archs include it in dma-mapping.h
* make the Kconfig options non-user visible, Adrian Bunk
* move async_tx under crypto since it is meant as 'core' functionality, and
  the two may share algorithms in the future
* move large inline functions into c files
* checkpatch.pl fixes
* gpl v2 only correction

Cc: Herbert Xu <[EMAIL PROTECTED]>
Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
Acked-By: NeilBrown <[EMAIL PROTECTED]>
---

 crypto/Kconfig |6 
 crypto/Makefile|2 
 crypto/async_tx/Kconfig|   16 +
 crypto/async_tx/Makefile   |4 
 crypto/async_tx/async_memcpy.c |  131 +++
 crypto/async_tx/async_memset.c |  109 +
 crypto/async_tx/async_tx.c |  497 
 crypto/async_tx/async_xor.c|  327 ++
 crypto/xor.c   |   29 +-
 drivers/dma/Kconfig|5 
 drivers/md/Kconfig |3 
 drivers/md/raid5.c |   54 ++--
 include/linux/async_tx.h   |  156 +
 include/linux/raid/xor.h   |5 
 14 files changed, 1294 insertions(+), 50 deletions(-)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index b749a1a..07090e9 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -5,9 +5,13 @@ config XOR_BLOCKS
tristate
 
 #
-# Cryptographic API Configuration
+# async_tx api: hardware offloaded memory transfer/transform support
 #
+source "crypto/async_tx/Kconfig"
 
+#
+# Cryptographic API Configuration
+#
 menu "Cryptographic options"
 
 config CRYPTO
diff --git a/crypto/Makefile b/crypto/Makefile
index 68e934b..0cf17f1 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -55,4 +55,4 @@ obj-$(CONFIG_CRYPTO_TEST) += tcrypt.o
 # generic algorithms and the async_tx api
 #
 obj-$(CONFIG_XOR_BLOCKS) += xor.o
-
+obj-$(CONFIG_ASYNC_CORE) += async_tx/
diff --git a/crypto/async_tx/Kconfig b/crypto/async_tx/Kconfig
new file mode 100644
index 000..d8fb391
--- /dev/null
+++ b/crypto/async_tx/Kconfig
@@ -0,0 +1,16 @@
+config ASYNC_CORE
+   tristate
+
+config ASYNC_MEMCPY
+   tristate
+   select ASYNC_CORE
+
+config ASYNC_XOR
+   tristate
+   select ASYNC_CORE
+   select XOR_BLOCKS
+
+config ASYNC_MEMSET
+   tristate
+   select ASYNC_CORE
+
diff --git a/crypto/async_tx/Makefile b/crypto/async_tx/Makefile
new file mode 100644
index 000..27baa7d
--- /dev/null
+++ b/crypto/async_tx/Makefile
@@ -0,0 +1,4 @@
+obj-$(CONFIG_ASYNC_CORE) += async_tx.o
+obj-$(CONFIG_ASYNC_MEMCPY) += async_memcpy.o
+obj-$(CONFIG_ASYNC_MEMSET) += async_memset.o
+obj-$(CONFIG_ASYNC_XOR) += async_xor.o
diff --git a/crypto/async_tx/async_memcpy.c b/crypto/async_tx/async_memcpy.c
new file mode 100644
index 000..a973f4e
--- /dev/null
+++ b/crypto/async_tx/async_memcpy.c
@@ -0,0 +1,131 @@
+/*
+ * copy offload engine support
+ *
+ * Copyright © 2006, Intel Corporation.
+ *
+ *  Dan Williams <[EMAIL PROTECTED]>
+ *
+ *  with architecture considerations by:
+ *  Neil Brown <[EMAIL PROTECTED]>
+ *  Jeff Garzik <[EMAIL PROTECTED]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/**
+ * async_memcpy - attempt to copy memory with a dma engine.
+ * @dest: destination page
+ * @src: src page
+ * @offset: offset in pages to start transaction
+ * @len: length in bytes
+ * @flags: ASYNC_TX_ASSUME_COHERENT, ASYNC_TX_ACK, ASYNC_TX_DEP_ACK,
+ * ASYNC_TX_KMAP_SRC, ASYNC_TX_KMAP_DST
+ * @depend_tx: memcpy depends on the result of this transaction
+ * @cb_fn: function to call when the memcpy completes
+ * @cb_param: parameter to pass to the callback routine
+ */
+struct dma_async_tx_descriptor *
+async_memcpy(struct page *dest, struct page *src, unsigned 

[md-accel PATCH 03/19] xor: make 'xor_blocks' a library routine for use with async_tx

2007-06-26 Thread Dan Williams
The async_tx api tries to use a dma engine for an operation, but will fall
back to an optimized software routine otherwise.  Xor support is
implemented using the raid5 xor routines.  For organizational purposes this
routine is moved to a common area.

The following fixes are also made:
* rename xor_block => xor_blocks, suggested by Adrian Bunk
* ensure that xor.o initializes before md.o in the built-in case
* checkpatch.pl fixes
* mark calibrate_xor_blocks __init, Adrian Bunk

Cc: Adrian Bunk <[EMAIL PROTECTED]>
Cc: NeilBrown <[EMAIL PROTECTED]>
Cc: Herbert Xu <[EMAIL PROTECTED]>
Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 crypto/Kconfig   |6 ++
 crypto/Makefile  |6 ++
 crypto/xor.c |  156 ++
 drivers/md/Kconfig   |1 
 drivers/md/Makefile  |4 +
 drivers/md/md.c  |2 -
 drivers/md/raid5.c   |   10 +--
 drivers/md/xor.c |  154 -
 include/linux/raid/xor.h |2 -
 9 files changed, 178 insertions(+), 163 deletions(-)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 4ca0ab3..b749a1a 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1,4 +1,10 @@
 #
+# Generic algorithms support
+#
+config XOR_BLOCKS
+   tristate
+
+#
 # Cryptographic API Configuration
 #
 
diff --git a/crypto/Makefile b/crypto/Makefile
index cce46a1..68e934b 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -50,3 +50,9 @@ obj-$(CONFIG_CRYPTO_MICHAEL_MIC) += michael_mic.o
 obj-$(CONFIG_CRYPTO_CRC32C) += crc32c.o
 
 obj-$(CONFIG_CRYPTO_TEST) += tcrypt.o
+
+#
+# generic algorithms and the async_tx api
+#
+obj-$(CONFIG_XOR_BLOCKS) += xor.o
+
diff --git a/crypto/xor.c b/crypto/xor.c
new file mode 100644
index 000..8281ac5
--- /dev/null
+++ b/crypto/xor.c
@@ -0,0 +1,156 @@
+/*
+ * xor.c : Multiple Devices driver for Linux
+ *
+ * Copyright (C) 1996, 1997, 1998, 1999, 2000,
+ * Ingo Molnar, Matti Aarnio, Jakub Jelinek, Richard Henderson.
+ *
+ * Dispatch optimized RAID-5 checksumming functions.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * You should have received a copy of the GNU General Public License
+ * (for example /usr/src/linux/COPYING); if not, write to the Free
+ * Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#define BH_TRACE 0
+#include 
+#include 
+#include 
+#include 
+
+/* The xor routines to use.  */
+static struct xor_block_template *active_template;
+
+void
+xor_blocks(unsigned int count, unsigned int bytes, void **ptr)
+{
+   unsigned long *p0, *p1, *p2, *p3, *p4;
+
+   p0 = (unsigned long *) ptr[0];
+   p1 = (unsigned long *) ptr[1];
+   if (count == 2) {
+   active_template->do_2(bytes, p0, p1);
+   return;
+   }
+
+   p2 = (unsigned long *) ptr[2];
+   if (count == 3) {
+   active_template->do_3(bytes, p0, p1, p2);
+   return;
+   }
+
+   p3 = (unsigned long *) ptr[3];
+   if (count == 4) {
+   active_template->do_4(bytes, p0, p1, p2, p3);
+   return;
+   }
+
+   p4 = (unsigned long *) ptr[4];
+   active_template->do_5(bytes, p0, p1, p2, p3, p4);
+}
+EXPORT_SYMBOL(xor_blocks);
+
+/* Set of all registered templates.  */
+static struct xor_block_template *template_list;
+
+#define BENCH_SIZE (PAGE_SIZE)
+
+static void
+do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)
+{
+   int speed;
+   unsigned long now;
+   int i, count, max;
+
+   tmpl->next = template_list;
+   template_list = tmpl;
+
+   /*
+* Count the number of XORs done during a whole jiffy, and use
+* this to calculate the speed of checksumming.  We use a 2-page
+* allocation to have guaranteed color L1-cache layout.
+*/
+   max = 0;
+   for (i = 0; i < 5; i++) {
+   now = jiffies;
+   count = 0;
+   while (jiffies == now) {
+   mb(); /* prevent loop optimzation */
+   tmpl->do_2(BENCH_SIZE, b1, b2);
+   mb();
+   count++;
+   mb();
+   }
+   if (count > max)
+   max = count;
+   }
+
+   speed = max * (HZ * BENCH_SIZE / 1024);
+   tmpl->speed = speed;
+
+   printk(KERN_INFO "   %-10s: %5d.%03d MB/sec\n", tmpl->name,
+  speed / 1000, speed % 1000);
+}
+
+static int __init
+calibrate_xor_blocks(void)
+{
+   void *b1, *b2;
+   struct xor_block_template *f, *fastest;
+
+   b1 = (void *) __get_free_pages(GFP_KERNEL, 2);
+   if (!b1) {
+   printk(KERN_WAR

[md-accel PATCH 01/19] dmaengine: refactor dmaengine around dma_async_tx_descriptor

2007-06-26 Thread Dan Williams
The current dmaengine interface defines mutliple routines per operation,
i.e. dma_async_memcpy_buf_to_buf, dma_async_memcpy_buf_to_page etc.  Adding
more operation types (xor, crc, etc) to this model would result in an
unmanageable number of method permutations.

Are we really going to add a set of hooks for each DMA engine
whizbang feature?
- Jeff Garzik

The descriptor creation process is refactored using the new common
dma_async_tx_descriptor structure.  Instead of per driver
do___to_ methods, drivers integrate
dma_async_tx_descriptor into their private software descriptor and then
define a 'prep' routine per operation.  The prep routine allocates a
descriptor and ensures that the tx_set_src, tx_set_dest, tx_submit routines
are valid.  Descriptor creation and submission becomes:

struct dma_device *dev;
struct dma_chan *chan;
struct dma_async_tx_descriptor *tx;

tx = dev->device_prep_dma_(chan, len, int_flag)
tx->tx_set_src(dma_addr_t, tx, index /* for multi-source ops */)
tx->tx_set_dest(dma_addr_t, tx, index)
tx->tx_submit(tx)

In addition to the refactoring, dma_async_tx_descriptor also lays the
groundwork for definining cross-channel-operation dependencies, and a
callback facility for asynchronous notification of operation completion.

Changelog:
* drop dma mapping methods, suggested by Chris Leech
* fix ioat_dma_dependency_added, also caught by Andrew Morton
* fix dma_sync_wait, change from Andrew Morton
* uninline large functions, change from Andrew Morton
* add tx->callback = NULL to dmaengine calls to interoperate with async_tx
  calls
* hookup ioat_tx_submit
* convert channel capabilities to a 'cpumask_t like' bitmap
* removed DMA_TX_ARRAY_INIT, no longer needed
* checkpatch.pl fixes
* make set_src, set_dest, and tx_submit descriptor specific methods

Cc: Jeff Garzik <[EMAIL PROTECTED]>
Cc: Chris Leech <[EMAIL PROTECTED]>
Cc: Shannon Nelson <[EMAIL PROTECTED]>
Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/dma/dmaengine.c   |  182 ++
 drivers/dma/ioatdma.c |  277 -
 drivers/dma/ioatdma.h |8 +
 include/linux/dmaengine.h |  230 +++--
 4 files changed, 455 insertions(+), 242 deletions(-)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index 322ee29..379809f 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -59,6 +59,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -66,6 +67,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static DEFINE_MUTEX(dma_list_mutex);
 static LIST_HEAD(dma_device_list);
@@ -165,6 +167,24 @@ static struct dma_chan *dma_client_chan_alloc(struct 
dma_client *client)
return NULL;
 }
 
+enum dma_status dma_sync_wait(struct dma_chan *chan, dma_cookie_t cookie)
+{
+   enum dma_status status;
+   unsigned long dma_sync_wait_timeout = jiffies + msecs_to_jiffies(5000);
+
+   dma_async_issue_pending(chan);
+   do {
+   status = dma_async_is_tx_complete(chan, cookie, NULL, NULL);
+   if (time_after_eq(jiffies, dma_sync_wait_timeout)) {
+   printk(KERN_ERR "dma_sync_wait_timeout!\n");
+   return DMA_ERROR;
+   }
+   } while (status == DMA_IN_PROGRESS);
+
+   return status;
+}
+EXPORT_SYMBOL(dma_sync_wait);
+
 /**
  * dma_chan_cleanup - release a DMA channel's resources
  * @kref: kernel reference structure that contains the DMA channel device
@@ -322,6 +342,25 @@ int dma_async_device_register(struct dma_device *device)
if (!device)
return -ENODEV;
 
+   /* validate device routines */
+   BUG_ON(dma_has_cap(DMA_MEMCPY, device->cap_mask) &&
+   !device->device_prep_dma_memcpy);
+   BUG_ON(dma_has_cap(DMA_XOR, device->cap_mask) &&
+   !device->device_prep_dma_xor);
+   BUG_ON(dma_has_cap(DMA_ZERO_SUM, device->cap_mask) &&
+   !device->device_prep_dma_zero_sum);
+   BUG_ON(dma_has_cap(DMA_MEMSET, device->cap_mask) &&
+   !device->device_prep_dma_memset);
+   BUG_ON(dma_has_cap(DMA_ZERO_SUM, device->cap_mask) &&
+   !device->device_prep_dma_interrupt);
+
+   BUG_ON(!device->device_alloc_chan_resources);
+   BUG_ON(!device->device_free_chan_resources);
+   BUG_ON(!device->device_dependency_added);
+   BUG_ON(!device->device_is_tx_complete);
+   BUG_ON(!device->device_issue_pending);
+   BUG_ON(!device->dev);
+
init_completion(&device->done);
kref_init(&device->refcount);
device->dev_id = id++;
@@ -397,6 +436,149 @@ void dma_async_device_unregister(struct dma_device 
*device)
 }
 EXPORT_SYMBOL(dma_async_device_unregister);
 
+/**
+ *

[md-accel PATCH 02/19] dmaengine: make clients responsible for managing channels

2007-06-26 Thread Dan Williams
The current implementation assumes that a channel will only be used by one
client at a time.  In order to enable channel sharing the dmaengine core is
changed to a model where clients subscribe to channel-available-events.
Instead of tracking how many channels a client wants and how many it has
received the core just broadcasts the available channels and lets the
clients optionally take a reference.  The core learns about the clients'
needs at dma_event_callback time.

In support of multiple operation types, clients can specify a capability
mask to only be notified of channels that satisfy a certain set of
capabilities.

Changelog:
* removed DMA_TX_ARRAY_INIT, no longer needed
* dma_client_chan_free -> dma_chan_release: switch to global reference
  counting only at device unregistration time, before it was also happening
  at client unregistration time
* clients now return dma_state_client to dmaengine (ack, dup, nak)
* checkpatch.pl fixes

Cc: Chris Leech <[EMAIL PROTECTED]>
Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/dma/dmaengine.c   |  217 +++--
 drivers/dma/ioatdma.c |1 
 drivers/dma/ioatdma.h |3 -
 include/linux/dmaengine.h |   58 +++-
 net/core/dev.c|  112 ---
 5 files changed, 224 insertions(+), 167 deletions(-)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index 379809f..5c5378e 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -37,11 +37,11 @@
  * Each device has a channels list, which runs unlocked but is never modified
  * once the device is registered, it's just setup by the driver.
  *
- * Each client has a channels list, it's only modified under the client->lock
- * and in an RCU callback, so it's safe to read under rcu_read_lock().
+ * Each client is responsible for keeping track of the channels it uses.  See
+ * the definition of dma_event_callback in dmaengine.h.
  *
  * Each device has a kref, which is initialized to 1 when the device is
- * registered. A kref_put is done for each class_device registered.  When the
+ * registered. A kref_get is done for each class_device registered.  When the
  * class_device is released, the coresponding kref_put is done in the release
  * method. Every time one of the device's channels is allocated to a client,
  * a kref_get occurs.  When the channel is freed, the coresponding kref_put
@@ -51,10 +51,12 @@
  * references to finish.
  *
  * Each channel has an open-coded implementation of Rusty Russell's "bigref,"
- * with a kref and a per_cpu local_t.  A single reference is set when on an
- * ADDED event, and removed with a REMOVE event.  Net DMA client takes an
- * extra reference per outstanding transaction.  The relase function does a
- * kref_put on the device. -ChrisL
+ * with a kref and a per_cpu local_t.  A dma_chan_get is called when a client
+ * signals that it wants to use a channel, and dma_chan_put is called when
+ * a channel is removed or a client using it is unregesitered.  A client can
+ * take extra references per outstanding transaction, as is the case with
+ * the NET DMA client.  The release function does a kref_put on the device.
+ * -ChrisL, DanW
  */
 
 #include 
@@ -102,8 +104,19 @@ static ssize_t show_bytes_transferred(struct class_device 
*cd, char *buf)
 static ssize_t show_in_use(struct class_device *cd, char *buf)
 {
struct dma_chan *chan = container_of(cd, struct dma_chan, class_dev);
+   int in_use = 0;
+
+   if (unlikely(chan->slow_ref) &&
+   atomic_read(&chan->refcount.refcount) > 1)
+   in_use = 1;
+   else {
+   if (local_read(&(per_cpu_ptr(chan->local,
+   get_cpu())->refcount)) > 0)
+   in_use = 1;
+   put_cpu();
+   }
 
-   return sprintf(buf, "%d\n", (chan->client ? 1 : 0));
+   return sprintf(buf, "%d\n", in_use);
 }
 
 static struct class_device_attribute dma_class_attrs[] = {
@@ -129,42 +142,53 @@ static struct class dma_devclass = {
 
 /* --- client and device registration --- */
 
+#define dma_chan_satisfies_mask(chan, mask) \
+   __dma_chan_satisfies_mask((chan), &(mask))
+static int
+__dma_chan_satisfies_mask(struct dma_chan *chan, dma_cap_mask_t *want)
+{
+   dma_cap_mask_t has;
+
+   bitmap_and(has.bits, want->bits, chan->device->cap_mask.bits,
+   DMA_TX_TYPE_END);
+   return bitmap_equal(want->bits, has.bits, DMA_TX_TYPE_END);
+}
+
 /**
- * dma_client_chan_alloc - try to allocate a channel to a client
+ * dma_client_chan_alloc - try to allocate channels to a client
  * @client: &dma_client
  *
  * Called with dma_list_mutex held.
  */
-static struct dma_chan *dma_client_chan_alloc(struct dma_client *client)
+static void dma_client_chan_alloc(struct dma_client *client)

[md-accel PATCH 00/19] md raid acceleration and the async_tx api

2007-06-26 Thread Dan Williams
Greetings,

Per Andrew's suggestion this is the md raid5 acceleration patch set
updated with more thorough changelogs to lower the barrier to entry for
reviewers.  To get started with the code I would suggest the following
order:
[md-accel PATCH 01/19] dmaengine: refactor dmaengine around 
dma_async_tx_descriptor
[md-accel PATCH 04/19] async_tx: add the async_tx api
[md-accel PATCH 07/19] md: raid5_run_ops - run stripe operations outside 
sh->lock
[md-accel PATCH 16/19] dmaengine: driver for the iop32x, iop33x, and iop13xx 
raid engines

The patch set can be broken down into three main categories:
1/ API (async_tx: patches 1 - 4)
2/ implementation (md changes: patches 5 - 15)
3/ driver (iop-adma: patches 16 - 19)

I have worked with Neil to get approval of the category 2 changes.
However for the category 1 and 3 changes there was no obvious
merge-path/maintainer to work through.  I have thus far extrapolated
Neil's comments about 2 out to 1 and 3, Jeff gave some direction on a
early revision about the scalability of the API, and the patch set has
picked up various fixes and suggestions from being in -mm for a few
releases.  Please help me ensure that this code is ready for Linus to
pull for 2.6.23.

git://lost.foo-projects.org/~dwillia2/git/iop md-accel-linus

Dan Williams (19):
  dmaengine: refactor dmaengine around dma_async_tx_descriptor
  dmaengine: make clients responsible for managing channels
  xor: make 'xor_blocks' a library routine for use with async_tx
  async_tx: add the async_tx api
  raid5: refactor handle_stripe5 and handle_stripe6 (v2)
  raid5: replace custom debug PRINTKs with standard pr_debug
  md: raid5_run_ops - run stripe operations outside sh->lock
  md: common infrastructure for running operations with raid5_run_ops
  md: handle_stripe5 - add request/completion logic for async write ops
  md: handle_stripe5 - add request/completion logic for async compute ops
  md: handle_stripe5 - add request/completion logic for async check ops
  md: handle_stripe5 - add request/completion logic for async read ops
  md: handle_stripe5 - add request/completion logic for async expand ops
  md: handle_stripe5 - request io processing in raid5_run_ops
  md: remove raid5 compute_block and compute_parity5
  dmaengine: driver for the iop32x, iop33x, and iop13xx raid engines
  iop13xx: surface the iop13xx adma units to the iop-adma driver
  iop3xx: surface the iop3xx DMA and AAU units to the iop-adma driver
  ARM: Add drivers/dma to arch/arm/Kconfig

Administrivia:
This patch set contains three new patches compared to the previous
release they are:
[md-accel PATCH 03/19] xor: make 'xor_blocks' a library routine for use with 
async_tx
[md-accel PATCH 05/19] raid5: refactor handle_stripe5 and handle_stripe6 (v2) 
[md-accel PATCH 06/19] raid5: replace custom debug PRINTKs with standard 
pr_debug

net/core/dev.c is touched by the following:
[md-accel PATCH 02/19] dmaengine: make clients responsible for managing channels
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stripe_cache_size and performance

2007-06-25 Thread Dan Williams

7. And now, the question: the best absolute 'write' performance comes
with a stripe_cache_size value of 4096 (for my setup). However, any
value of stripe_cache_size above 384 really, really hurts 'check' (and
rebuild, one can assume) performance.  Why?


Question:
After performance goes "bad" does it go back up if you reduce the size
back down to 384?


--
Jon Nelson <[EMAIL PROTECTED]>


Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH git-md-accel 1/2] raid5: refactor handle_stripe5 and handle_stripe6

2007-06-18 Thread Dan Williams

On 6/18/07, Dan Williams <[EMAIL PROTECTED]> wrote:
...

+static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
+   struct r6_state *r6s)
+{
+   int i;
+
+   /* We have read all the blocks in this stripe and now we need to
+* copy some of them into a target stripe for expand.
+*/
+   clear_bit(STRIPE_EXPAND_SOURCE, &sh->state);
+   for (i = 0; i < sh->disks; i++)
+   if (i != sh->pd_idx && (r6s && i != r6s->qd_idx)) {
+   int dd_idx, pd_idx, j;
+   struct stripe_head *sh2;
+
+   sector_t bn = compute_blocknr(sh, i);
+   sector_t s = raid5_compute_sector(bn, conf->raid_disks,
+   conf->raid_disks-1, &dd_idx,
+   &pd_idx, conf);

this bug made it through the regression test:

'conf->raid_disks-1' should be 'conf->raid_disks - conf->max_degraded'

--
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH git-md-accel 2/2] raid5: replace custom debug print with standard pr_debug

2007-06-18 Thread Dan Williams
Replaces PRINTK with pr_debug, and kills the RAID5_DEBUG definition in
favor of the global DEBUG definition.  To get local debug messages just add
'#define DEBUG' to the top of the file.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |  116 ++--
 1 files changed, 58 insertions(+), 58 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 68834d2..fa562e7 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -80,7 +80,6 @@
 /*
  * The following can be used to debug the driver
  */
-#define RAID5_DEBUG0
 #define RAID5_PARANOIA 1
 #if RAID5_PARANOIA && defined(CONFIG_SMP)
 # define CHECK_DEVLOCK() assert_spin_locked(&conf->device_lock)
@@ -88,8 +87,7 @@
 # define CHECK_DEVLOCK()
 #endif
 
-#define PRINTK(x...) ((void)(RAID5_DEBUG && printk(x)))
-#if RAID5_DEBUG
+#ifdef DEBUG
 #define inline
 #define __inline__
 #endif
@@ -152,7 +150,8 @@ static void release_stripe(struct stripe_head *sh)
 
 static inline void remove_hash(struct stripe_head *sh)
 {
-   PRINTK("remove_hash(), stripe %llu\n", (unsigned long long)sh->sector);
+   pr_debug("remove_hash(), stripe %llu\n",
+   (unsigned long long)sh->sector);
 
hlist_del_init(&sh->hash);
 }
@@ -161,7 +160,8 @@ static inline void insert_hash(raid5_conf_t *conf, struct 
stripe_head *sh)
 {
struct hlist_head *hp = stripe_hash(conf, sh->sector);
 
-   PRINTK("insert_hash(), stripe %llu\n", (unsigned long long)sh->sector);
+   pr_debug("insert_hash(), stripe %llu\n",
+   (unsigned long long)sh->sector);
 
CHECK_DEVLOCK();
hlist_add_head(&sh->hash, hp);
@@ -226,7 +226,7 @@ static void init_stripe(struct stripe_head *sh, sector_t 
sector, int pd_idx, int
BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));

CHECK_DEVLOCK();
-   PRINTK("init_stripe called, stripe %llu\n", 
+   pr_debug("init_stripe called, stripe %llu\n",
(unsigned long long)sh->sector);
 
remove_hash(sh);
@@ -260,11 +260,11 @@ static struct stripe_head *__find_stripe(raid5_conf_t 
*conf, sector_t sector, in
struct hlist_node *hn;
 
CHECK_DEVLOCK();
-   PRINTK("__find_stripe, sector %llu\n", (unsigned long long)sector);
+   pr_debug("__find_stripe, sector %llu\n", (unsigned long long)sector);
hlist_for_each_entry(sh, hn, stripe_hash(conf, sector), hash)
if (sh->sector == sector && sh->disks == disks)
return sh;
-   PRINTK("__stripe %llu not in cache\n", (unsigned long long)sector);
+   pr_debug("__stripe %llu not in cache\n", (unsigned long long)sector);
return NULL;
 }
 
@@ -276,7 +276,7 @@ static struct stripe_head *get_active_stripe(raid5_conf_t 
*conf, sector_t sector
 {
struct stripe_head *sh;
 
-   PRINTK("get_stripe, sector %llu\n", (unsigned long long)sector);
+   pr_debug("get_stripe, sector %llu\n", (unsigned long long)sector);
 
spin_lock_irq(&conf->device_lock);
 
@@ -537,8 +537,8 @@ static int raid5_end_read_request(struct bio * bi, unsigned 
int bytes_done,
if (bi == &sh->dev[i].req)
break;
 
-   PRINTK("end_read_request %llu/%d, count: %d, uptodate %d.\n", 
-   (unsigned long long)sh->sector, i, atomic_read(&sh->count), 
+   pr_debug("end_read_request %llu/%d, count: %d, uptodate %d.\n",
+   (unsigned long long)sh->sector, i, atomic_read(&sh->count),
uptodate);
if (i == disks) {
BUG();
@@ -613,7 +613,7 @@ static int raid5_end_write_request (struct bio *bi, 
unsigned int bytes_done,
if (bi == &sh->dev[i].req)
break;
 
-   PRINTK("end_write_request %llu/%d, count %d, uptodate: %d.\n", 
+   pr_debug("end_write_request %llu/%d, count %d, uptodate: %d.\n",
(unsigned long long)sh->sector, i, atomic_read(&sh->count),
uptodate);
if (i == disks) {
@@ -658,7 +658,7 @@ static void error(mddev_t *mddev, mdk_rdev_t *rdev)
 {
char b[BDEVNAME_SIZE];
raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
-   PRINTK("raid5: error called\n");
+   pr_debug("raid5: error called\n");
 
if (!test_bit(Faulty, &rdev->flags)) {
set_bit(MD_CHANGE_DEVS, &mddev->flags);
@@ -929,7 +929,7 @@ static void compute_block(struct stripe_head *sh, int 
dd_idx)
int i, count, disks = sh->disks;
void *ptr[MAX_XOR_BLOCKS], *dest, *p;
 
-   PRINTK("compute_block, stripe %llu, idx %d\n", 
+ 

[PATCH git-md-accel 1/2] raid5: refactor handle_stripe5 and handle_stripe6

2007-06-18 Thread Dan Williams
handle_stripe5 and handle_stripe6 have very deep logic paths handling the
various states of a stripe_head.  By introducing the 'stripe_head_state'
and 'r6_state' objects, large portions of the logic can be moved to
sub-routines.

'struct stripe_head_state' consumes all of the automatic variables that 
previously
stood alone in handle_stripe5,6.  'struct r6_state' contains the handle_stripe6
specific variables like p_failed and q_failed.

One of the nice side effects of the 'stripe_head_state' change is that it
allows for further reductions in code duplication between raid5 and raid6.
The following new routines are shared between raid5 and raid6:

handle_completed_write_requests
handle_requests_to_failed_array
    handle_stripe_expansion

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c | 1484 +---
 include/linux/raid/raid5.h |   16 
 2 files changed, 733 insertions(+), 767 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 4f51dfa..68834d2 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1326,6 +1326,604 @@ static int stripe_to_pdidx(sector_t stripe, 
raid5_conf_t *conf, int disks)
return pd_idx;
 }
 
+static void
+handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh,
+   struct stripe_head_state *s, int disks,
+   struct bio **return_bi)
+{
+   int i;
+   for (i = disks; i--;) {
+   struct bio *bi;
+   int bitmap_end = 0;
+
+   if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
+   mdk_rdev_t *rdev;
+   rcu_read_lock();
+   rdev = rcu_dereference(conf->disks[i].rdev);
+   if (rdev && test_bit(In_sync, &rdev->flags))
+   /* multiple read failures in one stripe */
+   md_error(conf->mddev, rdev);
+   rcu_read_unlock();
+   }
+   spin_lock_irq(&conf->device_lock);
+   /* fail all writes first */
+   bi = sh->dev[i].towrite;
+   sh->dev[i].towrite = NULL;
+   if (bi) {
+   s->to_write--;
+   bitmap_end = 1;
+   }
+
+   if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
+   wake_up(&conf->wait_for_overlap);
+
+   while (bi && bi->bi_sector <
+   sh->dev[i].sector + STRIPE_SECTORS) {
+   struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector);
+   clear_bit(BIO_UPTODATE, &bi->bi_flags);
+   if (--bi->bi_phys_segments == 0) {
+   md_write_end(conf->mddev);
+   bi->bi_next = *return_bi;
+   *return_bi = bi;
+   }
+   bi = nextbi;
+   }
+   /* and fail all 'written' */
+   bi = sh->dev[i].written;
+   sh->dev[i].written = NULL;
+   if (bi) bitmap_end = 1;
+   while (bi && bi->bi_sector <
+  sh->dev[i].sector + STRIPE_SECTORS) {
+   struct bio *bi2 = r5_next_bio(bi, sh->dev[i].sector);
+   clear_bit(BIO_UPTODATE, &bi->bi_flags);
+   if (--bi->bi_phys_segments == 0) {
+   md_write_end(conf->mddev);
+   bi->bi_next = *return_bi;
+   *return_bi = bi;
+   }
+   bi = bi2;
+   }
+
+   /* fail any reads if this device is non-operational */
+   if (!test_bit(R5_Insync, &sh->dev[i].flags) ||
+   test_bit(R5_ReadError, &sh->dev[i].flags)) {
+   bi = sh->dev[i].toread;
+   sh->dev[i].toread = NULL;
+   if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
+   wake_up(&conf->wait_for_overlap);
+   if (bi) s->to_read--;
+   while (bi && bi->bi_sector <
+  sh->dev[i].sector + STRIPE_SECTORS) {
+   struct bio *nextbi =
+   r5_next_bio(bi, sh->dev[i].sector);
+   clear_bit(BIO_UPTODATE, &bi->bi_flags);
+   if (--bi->bi_phys_segments == 0) {
+   bi->bi_next = *re

[PATCH git-md-accel 0/2] raid5 refactor, and pr_debug cleanup

2007-06-18 Thread Dan Williams
Neil,

The following two patches are the respin of the changes you suggested to
"raid5: coding style cleanup / refactor".  I have added them to the
git-md-accel tree for a 2.6.23-rc1 pull.  The full, rebased, raid
acceleration patchset will be sent for a another round of review once I
address Andrew's concerns about the commit messages.

Dan Williams (2):
  raid5: refactor handle_stripe5 and handle_stripe6
  raid5: replace custom debug print with standard pr_debug
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: coding style cleanup / refactor

2007-06-15 Thread Dan Williams

On 6/15/07, Neil Brown <[EMAIL PROTECTED]> wrote:

Good idea...  Am I asking too much to have separate things in separate
patches?  It makes review easier.


...yeah I got a little bit carried away after the refactoring.  I will
spin the refactoring out into a separate patch and handle the coding
style violations as you suggested.


NeilBrown


Thanks,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: coding style cleanup / refactor

2007-06-14 Thread Dan Williams

On 6/14/07, Bill Davidsen <[EMAIL PROTECTED]> wrote:

When you are ready for wider testing, if you have a patch against a
released kernel it makes testing easy, characteristics are pretty well
known already.


Thanks I went ahead and put a separate snapshot up on SourceForge:
http://downloads.sourceforge.net/xscaleiop/md-accel-2.6.22-rc4-20070614.patch
[ It's actually based on current git but should apply cleanly to 2.6.22-rc4. ]

If you are so inclined the most up-to-date version is available via git.
git pull git://lost.foo-projects.org/~dwillia2/git/iop md-accel-linus

It should perform identically to vanilla 2.6.22-rc4 MD.


--
bill davidsen <[EMAIL PROTECTED]>


Regards,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: coding style cleanup / refactor

2007-06-13 Thread Dan Williams

In other words, it seemed like a good idea at the time, but I am open
to suggestions.



I went ahead and added the cleanup patch to the front of the
git-md-accel.patch series.  A few more whitespace cleanups, but no
major changes from what I posted earlier.  The new rebased series is
still passing my tests and Neil's tests in mdadm.

--
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: coding style cleanup / refactor

2007-06-12 Thread Dan Williams

> I assume that you're prepared to repair all that damage to your tree, but
> it seems a bit masochistic?

It's either this or have an inconsistent coding style throughout
raid5.c.  I figure it is worth it to have reduced code duplication
between raid5 and raid6, and it makes it easier to add new cache
features going forward.  I have a few more cleanups to add for a rev2 of
this patch, but I will hold that off until the rebase is done.



In other words, it seemed like a good idea at the time, but I am open
to suggestions.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Paranoid read mode for raid5/6

2007-06-11 Thread Dan Williams

On 6/11/07, Mattias Wadenstein <[EMAIL PROTECTED]> wrote:

Hi *,

Is there way to tell md to be paranoid and verify the parity for raid5/6
on every read? I guess this would come with a (significant) performance
hit, but sometimes that's not a big deal (unlike disks scrambling your
data).

Also, regarding data paranoia, for check/repair of a raid6, is the effort
made to figure out which is the misbehaving participant device, or is
parity just blindly recalculated?


This is something that is not available today, but is something I am
looking to implement.  Especially for the raid6 case where hpa showed
that not only can corruption be detected, it can be corrected.


/Mattias Wadenstein


Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] md: comment add_stripe_bio

2007-06-05 Thread Dan Williams
From: Dan Williams <[EMAIL PROTECTED]>

Document the overloading of struct bio fields.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

[ drop this if you think it is too much commenting/unnecessary, but I figured I 
would leave some
  breadcrumbs for the next guy. ]

 drivers/md/raid5.c |   26 ++
 1 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 061375e..065b8dc 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1231,10 +1231,13 @@ static void compute_block_2(struct stripe_head *sh, int 
dd_idx1, int dd_idx2)
 
 
 
-/*
- * Each stripe/dev can have one or more bion attached.
- * toread/towrite point to the first in a chain.
- * The bi_next chain must be in order.
+/* add_stripe_bio - attach a bio to the toread/towrite list for an
+ * rdev in the given stripe.  This routine assumes that the toread/towrite
+ * lists are in submission order
+ * @sh: stripe targeted by bi->bi_sector
+ * @bi: bio to add (this routines assumes ownership of the bi->bi_next field)
+ * @dd_idx: data disk determined from the logical sector
+ * @forwrite: read/write flag
  */
 static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, 
int forwrite)
 {
@@ -1249,12 +1252,18 @@ static int add_stripe_bio(struct stripe_head *sh, 
struct bio *bi, int dd_idx, in
 
spin_lock(&sh->lock);
spin_lock_irq(&conf->device_lock);
+   
+   /* pick the list to manipulate */
if (forwrite) {
bip = &sh->dev[dd_idx].towrite;
if (*bip == NULL && sh->dev[dd_idx].written == NULL)
firstwrite = 1;
} else
bip = &sh->dev[dd_idx].toread;
+
+   /* scroll through the list to see if this bio overlaps with a
+* pending request
+*/
while (*bip && (*bip)->bi_sector < bi->bi_sector) {
if ((*bip)->bi_sector + ((*bip)->bi_size >> 9) > bi->bi_sector)
goto overlap;
@@ -1263,10 +1272,19 @@ static int add_stripe_bio(struct stripe_head *sh, 
struct bio *bi, int dd_idx, in
if (*bip && (*bip)->bi_sector < bi->bi_sector + ((bi->bi_size)>>9))
goto overlap;
 
+   /* add this bio into the chain and make sure we are not dropping
+* a link that was previously established
+*/
BUG_ON(*bip && bi->bi_next && (*bip) != bi->bi_next);
if (*bip)
bi->bi_next = *bip;
+
+   /* attach the bio to the end of the list, if this is the first bio
+* added then we are directly manipulating toread/towrite
+*/
*bip = bi;
+
+   /* keep count of the number of stripes that this bio touches */
bi->bi_phys_segments ++;
spin_unlock_irq(&conf->device_lock);
spin_unlock(&sh->lock);
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 16/16] iop3xx: Surface the iop3xx DMA and AAU units to the iop-adma driver

2007-05-01 Thread Dan Williams
Adds the platform device definitions and the architecture specific support
routines (i.e. register initialization and descriptor formats) for the
iop-adma driver.

Changelog:
* add support for > 1k zero sum buffer sizes
* added dma/aau platform devices to iq80321 and iq80332 setup
* fixed the calculation in iop_desc_is_aligned
* support xor buffer sizes larger than 16MB
* fix places where software descriptors are assumed to be contiguous, only
hardware descriptors are contiguous
for up to a PAGE_SIZE buffer size
* convert to async_tx
* add interrupt support
* add platform devices for 80219 boards
* do not call platform register macros in driver code
* remove switch() statements for compatible register offsets/layouts
* change over to bitmap based capabilities

Cc: Russell King <[EMAIL PROTECTED]>
Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 arch/arm/mach-iop32x/glantank.c|2 
 arch/arm/mach-iop32x/iq31244.c |5 
 arch/arm/mach-iop32x/iq80321.c |3 
 arch/arm/mach-iop32x/n2100.c   |2 
 arch/arm/mach-iop33x/iq80331.c |3 
 arch/arm/mach-iop33x/iq80332.c |3 
 arch/arm/plat-iop/Makefile |2 
 arch/arm/plat-iop/adma.c   |  216 
 include/asm-arm/arch-iop32x/adma.h |5 
 include/asm-arm/arch-iop33x/adma.h |5 
 include/asm-arm/hardware/iop3xx-adma.h |  893 
 include/asm-arm/hardware/iop3xx.h  |   68 --
 12 files changed, 1147 insertions(+), 60 deletions(-)

diff --git a/arch/arm/mach-iop32x/glantank.c b/arch/arm/mach-iop32x/glantank.c
index 45f4f13..2e0099b 100644
--- a/arch/arm/mach-iop32x/glantank.c
+++ b/arch/arm/mach-iop32x/glantank.c
@@ -180,6 +180,8 @@ static void __init glantank_init_machine(void)
platform_device_register(&iop3xx_i2c1_device);
platform_device_register(&glantank_flash_device);
platform_device_register(&glantank_serial_device);
+   platform_device_register(&iop3xx_dma_0_channel);
+   platform_device_register(&iop3xx_dma_1_channel);
 
pm_power_off = glantank_power_off;
 }
diff --git a/arch/arm/mach-iop32x/iq31244.c b/arch/arm/mach-iop32x/iq31244.c
index 60e7430..c0d077c 100644
--- a/arch/arm/mach-iop32x/iq31244.c
+++ b/arch/arm/mach-iop32x/iq31244.c
@@ -295,9 +295,14 @@ static void __init iq31244_init_machine(void)
platform_device_register(&iop3xx_i2c1_device);
platform_device_register(&iq31244_flash_device);
platform_device_register(&iq31244_serial_device);
+   platform_device_register(&iop3xx_dma_0_channel);
+   platform_device_register(&iop3xx_dma_1_channel);
 
if (is_ep80219())
pm_power_off = ep80219_power_off;
+
+   if (!is_80219())
+   platform_device_register(&iop3xx_aau_channel);
 }
 
 static int __init force_ep80219_setup(char *str)
diff --git a/arch/arm/mach-iop32x/iq80321.c b/arch/arm/mach-iop32x/iq80321.c
index 361c70c..474ec2a 100644
--- a/arch/arm/mach-iop32x/iq80321.c
+++ b/arch/arm/mach-iop32x/iq80321.c
@@ -180,6 +180,9 @@ static void __init iq80321_init_machine(void)
platform_device_register(&iop3xx_i2c1_device);
platform_device_register(&iq80321_flash_device);
platform_device_register(&iq80321_serial_device);
+   platform_device_register(&iop3xx_dma_0_channel);
+   platform_device_register(&iop3xx_dma_1_channel);
+   platform_device_register(&iop3xx_aau_channel);
 }
 
 MACHINE_START(IQ80321, "Intel IQ80321")
diff --git a/arch/arm/mach-iop32x/n2100.c b/arch/arm/mach-iop32x/n2100.c
index 5f07344..8e6fe13 100644
--- a/arch/arm/mach-iop32x/n2100.c
+++ b/arch/arm/mach-iop32x/n2100.c
@@ -245,6 +245,8 @@ static void __init n2100_init_machine(void)
platform_device_register(&iop3xx_i2c0_device);
platform_device_register(&n2100_flash_device);
platform_device_register(&n2100_serial_device);
+   platform_device_register(&iop3xx_dma_0_channel);
+   platform_device_register(&iop3xx_dma_1_channel);
 
pm_power_off = n2100_power_off;
 
diff --git a/arch/arm/mach-iop33x/iq80331.c b/arch/arm/mach-iop33x/iq80331.c
index 1a9e361..b4d12bf 100644
--- a/arch/arm/mach-iop33x/iq80331.c
+++ b/arch/arm/mach-iop33x/iq80331.c
@@ -135,6 +135,9 @@ static void __init iq80331_init_machine(void)
platform_device_register(&iop33x_uart0_device);
platform_device_register(&iop33x_uart1_device);
platform_device_register(&iq80331_flash_device);
+   platform_device_register(&iop3xx_dma_0_channel);
+   platform_device_register(&iop3xx_dma_1_channel);
+   platform_device_register(&iop3xx_aau_channel);
 }
 
 MACHINE_START(IQ80331, "Intel IQ80331")
diff --git a/arch/arm/mach-iop33x/iq80332.c b/arch/arm/mach-iop33x/iq80332.c
index 96d6f0f..2abb2d8 100644
--- a/arch/arm/mach-iop33

  1   2   3   >