Re: mke2fs stuck in D state while creating filesystem on md*

2007-09-20 Thread Wiesner Thomas
If/when you experience the hang again please get a trace of all  
processes with:

echo t  /proc/sysrq-trigger

Of particular interest is the mke2fs trace; as well as any md threads.



Ok I've played around a bit. I didn't get those long hangs which I  
described in my initial mail
but smaller ones. I used loopback devices again. When everything is  
working correct, creating the
filesystem on the md device takes about 0.3s (if the data is still cached)  
to about 4s (if there is some

disk activity).

I noticed the following odd behaviour: The hangs occur usually after the  
md device was created.
Subsequent re-formattings don't seem to trigger it (at least very  
infrequent, I don't know).
But directly after creation, it happens quite often. Most of the time it  
takes only a few seconds longer
than expected but once it took 18s and another time 38s to complete. All  
in all I needed to recreate
and reformat the md device approximately 20 to 30 times to get those 2  
long hangs.

The long hangs don't seem to be that frequent.

Test system is unchanged (same as in my first mail), but I had to  
recompile the kernel for SysRq support.


mke2fs took 18s while I took this trace:
Sep 20 11:30:04 hazard kernel: loop0 S F7440320 0  1520  2  
(L-TLB)
Sep 20 11:30:04 hazard kernel:f7573f7c 0046 edfbd9a0 f7440320  
c0180370 f8a4450c  f89f1240
Sep 20 11:30:04 hazard kernel:f74a9e00 edfbd9a0 edfbd9a0 c1d0b050  
c1d510b0 c1d511bc 02cd d1518e89
Sep 20 11:30:04 hazard kernel:0059 f7d51f08 f7d51e00 f7573fa4  
f7573fb0 f89f1c57  c1d510b0

Sep 20 11:30:04 hazard kernel: Call Trace:
Sep 20 11:30:04 hazard kernel:  [c0180370] bio_put+0x20/0x30
Sep 20 11:30:04 hazard kernel:  [f8a4450c] super_written+0x7c/0xd0  
[md_mod]
Sep 20 11:30:04 hazard kernel:  [f89f1240] do_lo_send_aops+0x0/0x220  
[loop]

Sep 20 11:30:04 hazard kernel:  [f89f1c57] loop_thread+0x147/0x170 [loop]
Sep 20 11:30:04 hazard kernel:  [c012d550]  
autoremove_wake_function+0x0/0x50

Sep 20 11:30:04 hazard kernel:  [c0117a77] __wake_up_common+0x37/0x60
Sep 20 11:30:04 hazard kernel:  [c012d550]  
autoremove_wake_function+0x0/0x50

Sep 20 11:30:04 hazard kernel:  [f89f1b10] loop_thread+0x0/0x170 [loop]
Sep 20 11:30:04 hazard kernel:  [c012d00a] kthread+0x6a/0x70
Sep 20 11:30:04 hazard kernel:  [c012cfa0] kthread+0x0/0x70
Sep 20 11:30:04 hazard kernel:  [c0104bcf] kernel_thread_helper+0x7/0x38
Sep 20 11:30:04 hazard kernel:  ===
Sep 20 11:30:04 hazard kernel: loop1 S F74404A0 0  1522  2  
(L-TLB)
Sep 20 11:30:04 hazard kernel:f7525f7c 0046 edfbdba0 f74404a0  
c0180370 f8a4450c  f89f1240
Sep 20 11:30:04 hazard kernel:f74a9e00 edfbdba0 edfbdba0 c1d0b050  
c1d49590 c1d4969c 02f1 d1517029
Sep 20 11:30:04 hazard kernel:0059 f7d51108 f7d51000 f7525fa4  
f7525fb0 f89f1c57  c1d49590

Sep 20 11:30:04 hazard kernel: Call Trace:
Sep 20 11:30:04 hazard kernel:  [c0180370] bio_put+0x20/0x30
Sep 20 11:30:04 hazard kernel:  [f8a4450c] super_written+0x7c/0xd0  
[md_mod]
Sep 20 11:30:04 hazard kernel:  [f89f1240] do_lo_send_aops+0x0/0x220  
[loop]

Sep 20 11:30:04 hazard kernel:  [f89f1c57] loop_thread+0x147/0x170 [loop]
Sep 20 11:30:04 hazard kernel:  [c012d550]  
autoremove_wake_function+0x0/0x50
Sep 20 11:30:04 hazard kernel:  [c012d550]  
autoremove_wake_function+0x0/0x50

Sep 20 11:30:04 hazard kernel:  [f89f1b10] loop_thread+0x0/0x170 [loop]
Sep 20 11:30:04 hazard kernel:  [c012d00a] kthread+0x6a/0x70
Sep 20 11:30:04 hazard kernel:  [c012cfa0] kthread+0x0/0x70
Sep 20 11:30:04 hazard kernel:  [c0104bcf] kernel_thread_helper+0x7/0x38
Sep 20 11:30:04 hazard kernel:  ===
Sep 20 11:30:04 hazard kernel: loop2 S F74403E0 0  1524  2  
(L-TLB)
Sep 20 11:30:04 hazard kernel:f7577f7c 0046 edfbd9a0 f74403e0  
c0180370 f8a4450c  f89f1240
Sep 20 11:30:04 hazard kernel:f74a9e00 edfbd9a0 edfbd9a0 c1d0b050  
c1d49a90 c1d49b9c 0399 d1514f81
Sep 20 11:30:04 hazard kernel:0059 f7d51308 f7d51200 f7577fa4  
f7577fb0 f89f1c57  c1d49a90

Sep 20 11:30:04 hazard kernel: Call Trace:
Sep 20 11:30:04 hazard kernel:  [c0180370] bio_put+0x20/0x30
Sep 20 11:30:04 hazard kernel:  [f8a4450c] super_written+0x7c/0xd0  
[md_mod]
Sep 20 11:30:04 hazard kernel:  [f89f1240] do_lo_send_aops+0x0/0x220  
[loop]

Sep 20 11:30:04 hazard kernel:  [f89f1c57] loop_thread+0x147/0x170 [loop]
Sep 20 11:30:04 hazard kernel:  [c012d550]  
autoremove_wake_function+0x0/0x50

Sep 20 11:30:04 hazard kernel:  [c0117a77] __wake_up_common+0x37/0x60
Sep 20 11:30:04 hazard kernel:  [c012d550]  
autoremove_wake_function+0x0/0x50

Sep 20 11:30:04 hazard kernel:  [f89f1b10] loop_thread+0x0/0x170 [loop]
Sep 20 11:30:04 hazard kernel:  [c012d00a] kthread+0x6a/0x70
Sep 20 11:30:04 hazard kernel:  [c012cfa0] kthread+0x0/0x70
Sep 20 11:30:04 hazard kernel:  [c0104bcf] kernel_thread_helper+0x7/0x38

Re: MD devices renaming or re-ordering question

2007-09-20 Thread Bill Davidsen

Maurice Hilarius wrote:

Hi to all.

I wonder if somebody would care to help me to solve a problem?

I have some servers.
They are running CentOS5
This OS has a limitation where the maximum filesystem size is 8TB.

Each server curr3ently has a  AMCC/3WARE 16 port SATA controllers. Total
of 16 ports / drives
I am using 750GB drives.

I am exporting the drives as single, NOT as hardware RAID
That is due to the filesystem and controller limitations, among other
reasons.

Each server currently has 16 disks attached to the one controller

I want to add a 2nd controller, and, for now, 4 more disks on it.

I want to have the boot disk as a plain disk, as presently configured as
sda1,2,3
  


I'm not clear on what you mean by a plain disk followed by a list of 
partitions. If that means putting all your initial data on a single disk 
without RAID protection, that's a far worse idea in my experience than 
splitting arrays across controllers.

The remaining 15 disks are configured as :
sdb1 through sde1 as md0 ( 4 devices/partitions)
sdf1 through sdp1 as md1 (10 devices/partitions)
I want to add a 2nd controller, and 4 more drives, to the md0 device.

But, I do not want md0 to be split across the 2 controllers this way.
I prefer to do the split on md1
  


Move the md0 drives to the 2nd controller, add more.

Other than starting from scratch, the best solution would be to add the
disks to md0, then to magically turn md0 into md1, and md1 into md0
  


Unless you want to practice doing critical config changes, why? Moving 
the drives won't effect their name, at least not unless you have done 
something like configure by physical partition name instead of UUID. 
Doing that for more than a few drives is a learning experience waiting 
to happen. If that's the case, backup your mdadm.conf file and 
reconfigure using UUID, then start moving things around.

So, the question:
How does one make md1 into md0, and vice versa?
Without losing the data on these md's ?

Thanks in advance for any suggestions.


I would start by being sure the computer is doing the work, not the 
administrator, use UUID for assembly. Then move the drives for md0 and 
grow it.


Then consider the performance vs. reliability issues of having all 
drives on a single controller. Multiple controllers give you more points 
of failure unless you are mirroring across them, but better peak 
performance. Note, I'm suggesting evaluating what you are doing only, it 
may be fine, just avoids didn't think about that events.


Well, you asked for suggestions...  ;-)

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help: very slow software RAID 5.

2007-09-20 Thread Michal Soltys

Dean S. Messing wrote:


Also (as I asked) what is the downside?  From what I have read, random
access reads will take a hit.  Is this correct?

Thanks very much for your help!

Dean



Besides bonnie++ you should probably check iozone. It will allow you to test 
very specific settings quite thoroughly. Although with current 
multi-gigabyte memory systems the test runs may take a bit time.


http://www.iozone.org/

There's nice introduction to the progam there, along with some example graph 
results.



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help: very slow software RAID 5.

2007-09-20 Thread Dean S. Messing

Bill Davidsen wrote:
: Dean S. Messing wrote:
   
 snip

: Do you want to tune it to work well now or work well in the final 
: configuration? There is no magic tuning which is best for every use, if 
: there was it would be locked in and you couldn't change it.

I want it to work well in the final config. I'm just now learning
about benchmarking with tools like bonnie++.  In my naivety, I thought
that `hdparm -t' told all, at least for reads.

:  Aside: I have found RAID quite frustrating.  With the original two
:  disks I was getting 120-130 MB/s in RAID 0.  I would think that for
:  the investment of a 3rd drive I ought to get the modicum of
:  redundancey I expect and keep the speed (at least on reads) w/o
:  sacrificing anything.  But it appears I actually lost something for my
:  investment.  I'm back to the speed of single drives with the modicum
:  of redundancey that RAID 5 gives.  Not a very good deal.
: RAID-5 and RAID-1 performance are popular topic, reading the archives 
: may shed more light on that.

So I'm seeing.  I just finished wading through a long April 07
discussion on write-though vs. write-back for RAID 5.

:  After you get to LVM you can do read ahead 
: tuning on individual areas, which will allow you to do faster random 
: access on one part and faster sequential on another. *But* when you run 
: both types of access on the same physical device one or the other will 
: suffer, and with careful tuning both can be slow.

This is why simply bumping up the read ahead parameter as has been
suggested to me seems suspect.  If this was the right fix, it seems
that it would be getting set automatically by the default installation
of mdadm.


: When you get to the point where you know exactly what you are going to 
: do and how you are going to do it (layout) you can ask a better question 
: about tuning.

Well (in my extreme naivete) I had hoped that I could (just)
-- buy the extra SATA drive,
-- configure RAID 5 on all three drives, 
-- have it present itself as a single device with the speed
 of RAID 0 (on two drives), and the safety net of RAID 5, 
-- install Fedora 7 on the array,
-- use LVM to partition as I liked,
-- and forget about it.

Instead, this has turned into a many hour exercise in futility.  This
is my research machine (for signal/image processing) and the machine I
live on.  It does many different things.  What I really need is more
disk speed (but I can't afford very high speed drives).  That's what
attracted me to RAID 0 --- which seems to have no downside EXCEPT
safety :-).

So I'm not sure I'll ever figure out the right tuning.  I'm at the
point of abandoning RAID entirely and just putting the three disks
together as a big LV and being done with it.  (I don't have quite the
moxy to define a RAID 0 array underneath it. :-)


: PS: adding another drive and going to RAID-10 with far configuration 
: will give you speed and reliability, at the cost of capacity. Aren't 
: shoices fun?

I don't know hat far configuration is, though I understand basically
what RAID-10 is.

Having that much  wasted space is too costly, and besides the machine
can't take but three drives internally.  If I wished to add a 4th I'd
need to buy a SATA controller. I had thought RAID 5 did exactly what
I wanted.  Unfortunately ...

Which suggests a question for you, David.  If I were to invest in a
true hardware RAID SATA controller (is there such a thing) would
RAID 5 across the three drives behave just like RAID 0 on two drives +
1 disk redundancy?  In other words just abandon Linux software raid?
At this point I would be willing to spring for such a card if
it were not too expensive, and if I could find a slot to put it in.
(My system is slot-challenged).

Thanks for you remarks, David.  I wish I had the time to learn how to
do all this properly with multiple LV's, different read-aheads and
write-through/write-back settings on different logical devices, but my
head is swimming and my time is short.

Dean



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help: very slow software RAID 5.

2007-09-20 Thread Dean S. Messing

Michal Soltys writes:
: Dean S. Messing wrote:
:  
:  Also (as I asked) what is the downside?  From what I have read, random
:  access reads will take a hit.  Is this correct?
:  
:  Thanks very much for your help!
:  
:  Dean
:  
: 
: Besides bonnie++ you should probably check iozone. It will allow you to test 
: very specific settings quite thoroughly. Although with current 
: multi-gigabyte memory systems the test runs may take a bit time.
: 
: http://www.iozone.org/
: 
: There's nice introduction to the progam there, along with some example graph 
: results.

Thanks very much, Michal. I'll have a look.

Dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help: very slow software RAID 5.

2007-09-20 Thread Michael Tokarev
Dean S. Messing wrote:
[]
 []  That's what
 attracted me to RAID 0 --- which seems to have no downside EXCEPT
 safety :-).
 
 So I'm not sure I'll ever figure out the right tuning.  I'm at the
 point of abandoning RAID entirely and just putting the three disks
 together as a big LV and being done with it.  (I don't have quite the
 moxy to define a RAID 0 array underneath it. :-)

Putting three disks together as a big LV - that's exactly what
linear md module.  It's almost as unsafe as raid0, but with
linear read/write speed equal to speed of single drive...
Note also that the more drives you add to raid0-like config,
the more chances of failure you'll have - because raid0 fails
when ANY drive fails.  Ditto - for certain extent - for linear
md module and for one big LV which is basically the same thing.

By the way, before abandoming R in RAID, I'd check whenever
the resulting speed with raid5 (after at least read-ahead tuning)
is acceptable, and use that if yes.  If no, maybe raid10 over
the same 3 drives will give better results.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help: very slow software RAID 5.

2007-09-20 Thread Dean S. Messing

Michael Tokarev writes:
: Dean S. Messing wrote:
: []
:  []  That's what
:  attracted me to RAID 0 --- which seems to have no downside EXCEPT
:  safety :-).
:  
:  So I'm not sure I'll ever figure out the right tuning.  I'm at the
:  point of abandoning RAID entirely and just putting the three disks
:  together as a big LV and being done with it.  (I don't have quite the
:  moxy to define a RAID 0 array underneath it. :-)
: 
: Putting three disks together as a big LV - that's exactly what
: linear md module.  
: It's almost as unsafe as raid0, but with
: linear read/write speed equal to speed of single drive...

I understand I only get the speed of a single drive was I was not
aware of the safety factor.  I had intended to use snapshotting off
to a cheap USB drive each evening.  Will that not keep me safe within a
day's worth of data change?  I only learned about snapshots yesterday.
I'm utterly new to the disk array/LVM game.

For that matter why not run a RAID-0 + LVM  across two of the three drives
and snapshot to the third?

: Note also that the more drives you add to raid0-like config,
: the more chances of failure you'll have - because raid0 fails
: when ANY drive fails.  Ditto - for certain extent - for linear
: md module and for one big LV which is basically the same thing.

I understand the probability increases for additional drives.

: By the way, before abandoming R in RAID, I'd check whenever
: the resulting speed with raid5 (after at least read-ahead tuning)
: is acceptable, and use that if yes.

My problem is not quite knowing what acceptable is.  I bought a Dell
Precision 490 with two relatively fast SATA II drives. With RAID 0 I
attain speeds of nearly 140 MB/s (using 2 drives) for reads and writes
and the system is very snappy for everything, from processing 4Kx2K
video to building a 'locate' datebase, to searching my very large mail
archives for technical info.

When I see the speed loss of software RAID 5 (writes are at 55MB/s and
random reads are at 54 MB/s) for everything but seq. reads (and that
only if I increase read-ahead from 512 to 16384 to get read speeds of
about 110 MB/s I lose heart, esp. since I don't know the other
consequences of increasing read-ahead by so much.

: If no, maybe raid10 over
: the same 3 drives will give better results.

Does RAID10 work on three drives?  I though one needed 4 drives,
with striping across a pair of mirrored pairs.

Dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2.6.23-rc7 0/3] async_tx and md-accel fixes for 2.6.23

2007-09-20 Thread Dan Williams
Fix a couple bugs and provide documentation for the async_tx api.

Neil, please 'ack' patch #3.

git://lost.foo-projects.org/~dwillia2/git/iop async-tx-fixes-for-linus

Dan Williams (3):
  async_tx: usage documentation and developer notes
  async_tx: fix dma_wait_for_async_tx
  raid5: fix ops_complete_biofill

Documentation/crypto/async-tx-api.txt |  217 +
crypto/async_tx/async_tx.c|   12 ++-
drivers/md/raid5.c|   90 +++---
3 files changed, 273 insertions(+), 46 deletions(-)

--
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2.6.23-rc7 1/3] async_tx: usage documentation and developer notes

2007-09-20 Thread Dan Williams
Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 Documentation/crypto/async-tx-api.txt |  217 +
 1 files changed, 217 insertions(+), 0 deletions(-)

diff --git a/Documentation/crypto/async-tx-api.txt 
b/Documentation/crypto/async-tx-api.txt
new file mode 100644
index 000..48d685a
--- /dev/null
+++ b/Documentation/crypto/async-tx-api.txt
@@ -0,0 +1,217 @@
+Asynchronous Transfers/Transforms API
+
+1 INTRODUCTION
+
+2 GENEALOGY
+
+3 USAGE
+3.1 General format of the API
+3.2 Supported operations
+3.2 Descriptor management
+3.3 When does the operation execute?
+3.4 When does the operation complete?
+3.5 Constraints
+3.6 Example
+
+4 DRIVER DEVELOPER NOTES
+4.1 Conformance points
+4.2 My application needs finer control of hardware channels
+
+5 SOURCE
+
+---
+
+1 INTRODUCTION
+
+The async_tx api provides methods for describing a chain of asynchronous
+bulk memory transfers/transforms with support for inter-transactional
+dependencies.  It is implemented as a dmaengine client that smooths over
+the details of different hardware offload engine implementations.  Code
+that is written to the api can optimize for asynchronous operation and
+the api will fit the chain of operations to the available offload
+resources.
+
+2 GENEALOGY
+
+The api was initially designed to offload the memory copy and
+xor-parity-calculations of the md-raid5 driver using the offload engines
+present in the Intel(R) Xscale series of I/O processors.  It also built
+on the 'dmaengine' layer developed for offloading memory copies in the
+network stack using Intel(R) I/OAT engines.  The following design
+features surfaced as a result:
+1/ implicit synchronous path: users of the API do not need to know if
+   the platform they are running on has offload capabilities.  The
+   operation will be offloaded when an engine is available and carried out
+   in software otherwise.
+2/ cross channel dependency chains: the API allows a chain of dependent
+   operations to be submitted, like xor-copy-xor in the raid5 case.  The
+   API automatically handles cases where the transition from one operation
+   to another implies a hardware channel switch.
+3/ dmaengine extensions to support multiple clients and operation types
+   beyond 'memcpy'
+
+3 USAGE
+
+3.1 General format of the API:
+struct dma_async_tx_descriptor *
+async_operation(op specific parameters,
+ enum async_tx_flags flags,
+ struct dma_async_tx_descriptor *dependency,
+ dma_async_tx_callback callback_routine,
+ void *callback_parameter);
+
+3.2 Supported operations:
+memcpy   - memory copy between a source and a destination buffer
+memset   - fill a destination buffer with a byte value
+xor - xor a series of source buffers and write the result to a
+  destination buffer
+xor_zero_sum - xor a series of source buffers and set a flag if the
+  result is zero.  The implementation attempts to prevent
+  writes to memory
+
+3.2 Descriptor management:
+The return value is non-NULL and points to a 'descriptor' when the operation
+has been queued to execute asynchronously.  Descriptors are recycled
+resources, under control of the offload engine driver, to be reused as
+operations complete.  When an application needs to submit a chain of
+operations it must guarantee that the descriptor is not automatically recycled
+before the dependency is submitted.  This requires that all descriptors be
+acknowledged by the application before the offload engine driver is allowed to
+recycle (or free) the descriptor.  A descriptor can be acked by:
+1/ setting the ASYNC_TX_ACK flag if no operations are to be submitted
+2/ setting the ASYNC_TX_DEP_ACK flag to acknowledge the parent
+   descriptor of a new operation.
+3/ calling async_tx_ack() on the descriptor.
+
+3.3 When does the operation execute?:
+Operations do not immediately issue after return from the
+async_operation call.  Offload engine drivers batch operations to
+improve performance by reducing the number of mmio cycles needed to
+manage the channel.  Once a driver specific threshold is met the driver
+automatically issues pending operations.  An application can force this
+event by calling async_tx_issue_pending_all().  This operates on all
+channels since the application has no knowledge of channel to operation
+mapping.
+
+3.4 When does the operation complete?:
+There are two methods for an application to learn about the completion
+of an operation.
+1/ Call dma_wait_for_async_tx().  This call causes the cpu to spin while
+   it polls for the completion of the operation.  It handles dependency
+   chains and issuing pending operations.
+2/ Specify a completion callback.  The callback routine runs in tasklet
+   context if the offload engine driver supports interrupts, or it is
+   called in application context if the operation is carried out
+   synchronously in software.  The 

[PATCH 2.6.23-rc7 2/3] async_tx: fix dma_wait_for_async_tx

2007-09-20 Thread Dan Williams
Fix dma_wait_for_async_tx to not loop forever in the case where a
dependency chain is longer than two entries.  This condition will not
happen with current in-kernel drivers, but fix it for future drivers.

Found-by: Saeed Bishara [EMAIL PROTECTED]
Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 crypto/async_tx/async_tx.c |   12 ++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/crypto/async_tx/async_tx.c b/crypto/async_tx/async_tx.c
index 0350071..bc18cbb 100644
--- a/crypto/async_tx/async_tx.c
+++ b/crypto/async_tx/async_tx.c
@@ -80,6 +80,7 @@ dma_wait_for_async_tx(struct dma_async_tx_descriptor *tx)
 {
enum dma_status status;
struct dma_async_tx_descriptor *iter;
+   struct dma_async_tx_descriptor *parent;
 
if (!tx)
return DMA_SUCCESS;
@@ -87,8 +88,15 @@ dma_wait_for_async_tx(struct dma_async_tx_descriptor *tx)
/* poll through the dependency chain, return when tx is complete */
do {
iter = tx;
-   while (iter-cookie == -EBUSY)
-   iter = iter-parent;
+
+   /* find the root of the unsubmitted dependency chain */
+   while (iter-cookie == -EBUSY) {
+   parent = iter-parent;
+   if (parent  parent-cookie == -EBUSY)
+   iter = iter-parent;
+   else
+   break;
+   }
 
status = dma_sync_wait(iter-chan, iter-cookie);
} while (status == DMA_IN_PROGRESS || (iter != tx));
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2.6.23-rc7 3/3] raid5: fix ops_complete_biofill

2007-09-20 Thread Dan Williams
ops_complete_biofill tried to avoid calling handle_stripe since all the
state necessary to return read completions is available.  However the
process of determining whether more read requests are pending requires
locking the stripe (to block add_stripe_bio from updating dev-toead).
ops_complete_biofill can run in tasklet context, so rather than upgrading
all the stripe locks from spin_lock to spin_lock_bh this patch just moves
read completion handling back into handle_stripe.

Found-by: Yuri Tikhonov [EMAIL PROTECTED]
Signed-off-by: Dan Williams [EMAIL PROTECTED]
---

 drivers/md/raid5.c |   90 +++-
 1 files changed, 46 insertions(+), 44 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 4d63773..38c8893 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -512,54 +512,12 @@ async_copy_data(int frombio, struct bio *bio, struct page 
*page,
 static void ops_complete_biofill(void *stripe_head_ref)
 {
struct stripe_head *sh = stripe_head_ref;
-   struct bio *return_bi = NULL;
-   raid5_conf_t *conf = sh-raid_conf;
-   int i, more_to_read = 0;
 
pr_debug(%s: stripe %llu\n, __FUNCTION__,
(unsigned long long)sh-sector);
 
-   /* clear completed biofills */
-   for (i = sh-disks; i--; ) {
-   struct r5dev *dev = sh-dev[i];
-   /* check if this stripe has new incoming reads */
-   if (dev-toread)
-   more_to_read++;
-
-   /* acknowledge completion of a biofill operation */
-   /* and check if we need to reply to a read request
-   */
-   if (test_bit(R5_Wantfill, dev-flags)  !dev-toread) {
-   struct bio *rbi, *rbi2;
-   clear_bit(R5_Wantfill, dev-flags);
-
-   /* The access to dev-read is outside of the
-* spin_lock_irq(conf-device_lock), but is protected
-* by the STRIPE_OP_BIOFILL pending bit
-*/
-   BUG_ON(!dev-read);
-   rbi = dev-read;
-   dev-read = NULL;
-   while (rbi  rbi-bi_sector 
-   dev-sector + STRIPE_SECTORS) {
-   rbi2 = r5_next_bio(rbi, dev-sector);
-   spin_lock_irq(conf-device_lock);
-   if (--rbi-bi_phys_segments == 0) {
-   rbi-bi_next = return_bi;
-   return_bi = rbi;
-   }
-   spin_unlock_irq(conf-device_lock);
-   rbi = rbi2;
-   }
-   }
-   }
-   clear_bit(STRIPE_OP_BIOFILL, sh-ops.ack);
-   clear_bit(STRIPE_OP_BIOFILL, sh-ops.pending);
-
-   return_io(return_bi);
-
-   if (more_to_read)
-   set_bit(STRIPE_HANDLE, sh-state);
+   set_bit(STRIPE_OP_BIOFILL, sh-ops.complete);
+   set_bit(STRIPE_HANDLE, sh-state);
release_stripe(sh);
 }
 
@@ -2112,6 +2070,42 @@ static void handle_issuing_new_read_requests6(struct 
stripe_head *sh,
 }
 
 
+/* handle_completed_read_requests - return completion for reads and allow
+ * new read operations to be submitted to the stripe.
+ */
+static void handle_completed_read_requests(raid5_conf_t *conf,
+   struct stripe_head *sh,
+   struct bio **return_bi)
+{
+   int i;
+
+   pr_debug(%s: stripe %llu\n, __FUNCTION__,
+   (unsigned long long)sh-sector);
+
+   /* check if we need to reply to a read request */
+   for (i = sh-disks; i--; ) {
+   struct r5dev *dev = sh-dev[i];
+
+   if (test_and_clear_bit(R5_Wantfill, dev-flags)) {
+   struct bio *rbi, *rbi2;
+
+   rbi = dev-read;
+   dev-read = NULL;
+   while (rbi  rbi-bi_sector 
+   dev-sector + STRIPE_SECTORS) {
+   rbi2 = r5_next_bio(rbi, dev-sector);
+   spin_lock_irq(conf-device_lock);
+   if (--rbi-bi_phys_segments == 0) {
+   rbi-bi_next = *return_bi;
+   *return_bi = rbi;
+   }
+   spin_unlock_irq(conf-device_lock);
+   rbi = rbi2;
+   }
+   }
+   }
+}
+
 /* handle_completed_write_requests
  * any written block on an uptodate or failed drive can be returned.
  * Note that if we 'wrote' to a failed drive, it will be UPTODATE, but
@@ -2633,6 +2627,14 @@ static void handle_stripe5(struct stripe_head *sh)
s.expanded 

Re: [PATCH 2.6.23-rc7 1/3] async_tx: usage documentation and developer notes

2007-09-20 Thread Randy Dunlap
On Thu, 20 Sep 2007 18:27:40 -0700 Dan Williams wrote:

 Signed-off-by: Dan Williams [EMAIL PROTECTED]
 ---

Hi Dan,

Looks pretty good and informative.  Thanks.

(nits below :)


  Documentation/crypto/async-tx-api.txt |  217 
 +
  1 files changed, 217 insertions(+), 0 deletions(-)
 
 diff --git a/Documentation/crypto/async-tx-api.txt 
 b/Documentation/crypto/async-tx-api.txt
 new file mode 100644
 index 000..48d685a
 --- /dev/null
 +++ b/Documentation/crypto/async-tx-api.txt
 @@ -0,0 +1,217 @@
 +  Asynchronous Transfers/Transforms API
 +
 +1 INTRODUCTION
 +
 +2 GENEALOGY
 +
 +3 USAGE
 +3.1 General format of the API
 +3.2 Supported operations
 +3.2 Descriptor management

duplicate 3.2

 +3.3 When does the operation execute?
 +3.4 When does the operation complete?
 +3.5 Constraints
 +3.6 Example
 +
 +4 DRIVER DEVELOPER NOTES
 +4.1 Conformance points
 +4.2 My application needs finer control of hardware channels
 +
 +5 SOURCE
 +
 +---
 +
 +1 INTRODUCTION
 +
 +The async_tx api provides methods for describing a chain of asynchronous
 +bulk memory transfers/transforms with support for inter-transactional
 +dependencies.  It is implemented as a dmaengine client that smooths over
 +the details of different hardware offload engine implementations.  Code
 +that is written to the api can optimize for asynchronous operation and
 +the api will fit the chain of operations to the available offload
 +resources.
 +

I would s/api/API/g .

 +2 GENEALOGY
 +
[snip]

 +
 +3 USAGE
 +
 +3.1 General format of the API:
 +struct dma_async_tx_descriptor *
 +async_operation(op specific parameters,
 +   enum async_tx_flags flags,
 +   struct dma_async_tx_descriptor *dependency,
 +   dma_async_tx_callback callback_routine,
 +   void *callback_parameter);
 +
 +3.2 Supported operations:
 +memcpy   - memory copy between a source and a destination buffer
 +memset   - fill a destination buffer with a byte value
 +xor   - xor a series of source buffers and write the result to a
 +destination buffer
 +xor_zero_sum - xor a series of source buffers and set a flag if the
 +result is zero.  The implementation attempts to prevent
 +writes to memory
 +
 +3.2 Descriptor management:

duplicate 3.2

 +The return value is non-NULL and points to a 'descriptor' when the operation
 +has been queued to execute asynchronously.  Descriptors are recycled
 +resources, under control of the offload engine driver, to be reused as
 +operations complete.  When an application needs to submit a chain of
 +operations it must guarantee that the descriptor is not automatically 
 recycled
 +before the dependency is submitted.  This requires that all descriptors be
 +acknowledged by the application before the offload engine driver is allowed 
 to
 +recycle (or free) the descriptor.  A descriptor can be acked by:

can be acked by any of:   (?)

 +1/ setting the ASYNC_TX_ACK flag if no operations are to be submitted
 +2/ setting the ASYNC_TX_DEP_ACK flag to acknowledge the parent
 +   descriptor of a new operation.
 +3/ calling async_tx_ack() on the descriptor.
 +
 +3.3 When does the operation execute?:

Drop ':'

 +Operations do not immediately issue after return from the
 +async_operation call.  Offload engine drivers batch operations to
 +improve performance by reducing the number of mmio cycles needed to
 +manage the channel.  Once a driver specific threshold is met the driver

   driver-specific

 +automatically issues pending operations.  An application can force this
 +event by calling async_tx_issue_pending_all().  This operates on all
 +channels since the application has no knowledge of channel to operation
 +mapping.
 +
 +3.4 When does the operation complete?:

drop ':'

 +There are two methods for an application to learn about the completion
 +of an operation.
 +1/ Call dma_wait_for_async_tx().  This call causes the cpu to spin while

s/cpu/CPU/g

 +   it polls for the completion of the operation.  It handles dependency
 +   chains and issuing pending operations.
 +2/ Specify a completion callback.  The callback routine runs in tasklet
 +   context if the offload engine driver supports interrupts, or it is
 +   called in application context if the operation is carried out
 +   synchronously in software.  The callback can be set in the call to
 +   async_operation, or when the application needs to submit a chain of
 +   unknown length it can use the async_trigger_callback() routine to set a
 +   completion interrupt/callback at the end of the chain.
 +
 +3.5 Constraints:
 +1/ Calls to async_operation are not permitted in irq context.  Other

s/irq/IRQ/g

 +   contexts are permitted provided constraint #2 is not violated.
 +2/ Completion callback routines can not submit new operations.  This

   cannot

 +   results in recursion in the synchronous case and