Re: [PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-18 Thread Kent Overstreet
On Tue, Dec 18, 2012 at 11:16 AM, Kent Overstreet
 wrote:
> Or maybe just getting rid of the ringbuffer is that awesome. Gonna try
> and work on combining our optimizations so I can see what that looks
> like :)


Yes, yes it is. Combined our aio/dio patches and got 50% better
throughput than I'd seen before.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-18 Thread Kent Overstreet
On Sat, Dec 15, 2012 at 02:16:48PM +0100, Jens Axboe wrote:
> On 2012-12-15 11:36, Kent Overstreet wrote:
> >> Knock yourself out - I already took a quick look at it, and conversion
> >> should be pretty simple. It's the mtip32xx driver, it's in the kernel. I
> >> would suggest getting rid of the ->async_callback() (since it's always
> >> bio_endio()) since that'll make it cleaner.
> > 
> > Just pushed my conversion - it's untested, but it's pretty
> > straightforward.
> 
> You forgot a batch_complete_init(). With that, it works. Single device
> is ~1050K now, so still slower than jaio without batching (which was
> ~1220K).  But it's an improvement over kaio-dio, which was roughly ~930K
> IOPS.

Curious... if the device is delivering a reasonable number of
completions per interrupt, I would've expected that to help more (it
made a huge difference for me). Now I'm really curious where the
difference is coming from.

It's possible something I did introduced a performance regression you're
uncovering (i.e. I reordered stuff in struct kiocb to shrink it, not
sure if you were testing with those changes). It sounds like the
mtip32xx driver is better/more efficient than anything I can test with,
so if so it's entirely possible you're seing it due to less noise there.

Or maybe just getting rid of the ringbuffer is that awesome. Gonna try
and work on combining our optimizations so I can see what that looks
like :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-18 Thread Kent Overstreet
On Sat, Dec 15, 2012 at 02:16:48PM +0100, Jens Axboe wrote:
 On 2012-12-15 11:36, Kent Overstreet wrote:
  Knock yourself out - I already took a quick look at it, and conversion
  should be pretty simple. It's the mtip32xx driver, it's in the kernel. I
  would suggest getting rid of the -async_callback() (since it's always
  bio_endio()) since that'll make it cleaner.
  
  Just pushed my conversion - it's untested, but it's pretty
  straightforward.
 
 You forgot a batch_complete_init(). With that, it works. Single device
 is ~1050K now, so still slower than jaio without batching (which was
 ~1220K).  But it's an improvement over kaio-dio, which was roughly ~930K
 IOPS.

Curious... if the device is delivering a reasonable number of
completions per interrupt, I would've expected that to help more (it
made a huge difference for me). Now I'm really curious where the
difference is coming from.

It's possible something I did introduced a performance regression you're
uncovering (i.e. I reordered stuff in struct kiocb to shrink it, not
sure if you were testing with those changes). It sounds like the
mtip32xx driver is better/more efficient than anything I can test with,
so if so it's entirely possible you're seing it due to less noise there.

Or maybe just getting rid of the ringbuffer is that awesome. Gonna try
and work on combining our optimizations so I can see what that looks
like :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-18 Thread Kent Overstreet
On Tue, Dec 18, 2012 at 11:16 AM, Kent Overstreet
koverstr...@google.com wrote:
 Or maybe just getting rid of the ringbuffer is that awesome. Gonna try
 and work on combining our optimizations so I can see what that looks
 like :)


Yes, yes it is. Combined our aio/dio patches and got 50% better
throughput than I'd seen before.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-15 Thread Jens Axboe
On 2012-12-15 11:36, Kent Overstreet wrote:
>> Knock yourself out - I already took a quick look at it, and conversion
>> should be pretty simple. It's the mtip32xx driver, it's in the kernel. I
>> would suggest getting rid of the ->async_callback() (since it's always
>> bio_endio()) since that'll make it cleaner.
> 
> Just pushed my conversion - it's untested, but it's pretty
> straightforward.

You forgot a batch_complete_init(). With that, it works. Single device
is ~1050K now, so still slower than jaio without batching (which was
~1220K).  But it's an improvement over kaio-dio, which was roughly ~930K
IOPS.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-15 Thread Jens Axboe
On 2012-12-15 11:36, Kent Overstreet wrote:
> On Sat, Dec 15, 2012 at 10:46:32AM +0100, Jens Axboe wrote:
>> On 2012-12-15 10:25, Kent Overstreet wrote:
>>> Cool, thanks for the numbers!
>>>
>>> I suspect the difference is due to contention on the ringbuffer,
>>> completion side. You didn't enable my batched completion stuff, did you?
>>
>> No, haven't tried the batching yet.
>>
>>> I suspect the numbers would look quite a bit different with that,
>>> based on my own profiling. If the driver for the device you're testing
>>> on is open source, I'd be happy to do the conversion (it's a 5 minute
>>> job).
>>
>> Knock yourself out - I already took a quick look at it, and conversion
>> should be pretty simple. It's the mtip32xx driver, it's in the kernel. I
>> would suggest getting rid of the ->async_callback() (since it's always
>> bio_endio()) since that'll make it cleaner.
> 
> Just pushed my conversion - it's untested, but it's pretty
> straightforward.

Let me give it a whirl. It'll likely improve the situation, since we are
CPU limited at this point. I will definitely add the batching on my side
too, it's a clear win for the (usual) case of reaping multiple
completion events per IRQ.

>>> since I looked at your patch but you're getting rid of the aio
>>> ringbuffer and using a linked list instead, right? My batched completion
>>> stuff should still benefit that case.
>>
>> Yes, I make the ring interface optional. Basically you tell aio to use
>> the ring or not at io_queue_init() time. If you don't care about the
>> ring, we can use a lockless list for the completions.
> 
> Yeah, it is a good idea - I'm certainly not attached to the current
> ringbuffer implementation (though a ringbuffer isn't a terrible idea if
> we had one that was implemented correctly).

I agree, ringbuffer isn't necessarily a bad interface. But the fact is
that:

1) Current ringbuffer is broken
2) And nobody uses it

So as an interface, I think it's dead.

>> You completely remove the cancel, I just make it optional for the gadget
>> case. I'm fine with either of them, though I did not look at your usb
>> change in detail. If it's clean, I suspect we should just kill cancel
>> completion as you did.
> 
> We (Zach and I) actually made it optional too, more or less - I haven't
> looked at how you did it yet, but in my tree the linked list is there,
> but the kiocb isn't added to the kioctx's list until something sets a
> cancel function.

Sounds like the same approach I took - list is still there, and whether
the node is empty or not signifies whether we need to lock and remove
the entry on completion.

>>> Though - hrm, I'd have expected getting rid of the cancellation linked
>>> list to make a bigger difference and both our patchsets do that.
>>
>> The machine in question runs out of oomph, which is hampering the
>> results. I should have it beefed up next week. It's running E5-2630
>> right now, will move to E5-2690. I think that should make the results
>> clearer.
> 
> Well, the fact that it's cpu bound just means the throughput numbers for
> our various patches are different... and what I'm really interested in
> is the profiles, I can't think of any reason cpu speed would affect that
> much.

Sure, that's a given, since they have the same horse power available and
the setup is identical (same process pinning, irq pinning, etc).

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-15 Thread Kent Overstreet
On Sat, Dec 15, 2012 at 10:46:32AM +0100, Jens Axboe wrote:
> On 2012-12-15 10:25, Kent Overstreet wrote:
> > Cool, thanks for the numbers!
> > 
> > I suspect the difference is due to contention on the ringbuffer,
> > completion side. You didn't enable my batched completion stuff, did you?
> 
> No, haven't tried the batching yet.
> 
> > I suspect the numbers would look quite a bit different with that,
> > based on my own profiling. If the driver for the device you're testing
> > on is open source, I'd be happy to do the conversion (it's a 5 minute
> > job).
> 
> Knock yourself out - I already took a quick look at it, and conversion
> should be pretty simple. It's the mtip32xx driver, it's in the kernel. I
> would suggest getting rid of the ->async_callback() (since it's always
> bio_endio()) since that'll make it cleaner.

Just pushed my conversion - it's untested, but it's pretty
straightforward.

> 
> > Also, I don't think our approaches really conflict - it's been awhile
> 
> Completely agree. I split my patches up a bit yesterday, and then I took
> a look at your series. There's a bit of overlap between the two, but
> really most of it would be useful together. You can see the (bit more)
> split series here:
> 
> http://git.kernel.dk/?p=linux-block.git;a=shortlog;h=refs/heads/aio-dio

Cool, taking a look at it now.

> 
> > since I looked at your patch but you're getting rid of the aio
> > ringbuffer and using a linked list instead, right? My batched completion
> > stuff should still benefit that case.
> 
> Yes, I make the ring interface optional. Basically you tell aio to use
> the ring or not at io_queue_init() time. If you don't care about the
> ring, we can use a lockless list for the completions.

Yeah, it is a good idea - I'm certainly not attached to the current
ringbuffer implementation (though a ringbuffer isn't a terrible idea if
we had one that was implemented correctly).

> You completely remove the cancel, I just make it optional for the gadget
> case. I'm fine with either of them, though I did not look at your usb
> change in detail. If it's clean, I suspect we should just kill cancel
> completion as you did.

We (Zach and I) actually made it optional too, more or less - I haven't
looked at how you did it yet, but in my tree the linked list is there,
but the kiocb isn't added to the kioctx's list until something sets a
cancel function.

> > Though - hrm, I'd have expected getting rid of the cancellation linked
> > list to make a bigger difference and both our patchsets do that.
> 
> The machine in question runs out of oomph, which is hampering the
> results. I should have it beefed up next week. It's running E5-2630
> right now, will move to E5-2690. I think that should make the results
> clearer.

Well, the fact that it's cpu bound just means the throughput numbers for
our various patches are different... and what I'm really interested in
is the profiles, I can't think of any reason cpu speed would affect that
much.

> > What device are you testing on, and what's your fio script? I may just
> > have to buy some hardware so I can test this myself.
> 
> Pretty basic script, it's attached. Probably could eek more out of the
> system, but it's been fine for just basic apples-to-apples comparison.
> I'm using 6x p320h for this test case.

Thanks - I'll have to see if I can get something setup that roughly
matches your setup. 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-15 Thread Jens Axboe
On 2012-12-15 10:25, Kent Overstreet wrote:
> On Fri, Dec 14, 2012 at 08:35:53AM +0100, Jens Axboe wrote:
>> On 2012-12-14 03:26, Jack Wang wrote:
>>> 2012/12/14 Jens Axboe :
 On Mon, Dec 03 2012, Kent Overstreet wrote:
> Last posting: http://thread.gmane.org/gmane.linux.kernel.aio.general/3169
>
> Changes since the last posting should all be noted in the individual
> patch descriptions.
>
>  * Zach pointed out the aio_read_evt() patch was calling functions that
>could sleep in TASK_INTERRUPTIBLE state, that patch is rewritten.
>  * Ben pointed out some synchronize_rcu() usage was problematic,
>converted it to call_rcu()
>  * The flush_dcache_page() patch is new
>  * Changed the "use cancellation list lazily" patch so as to remove
>ki_flags from struct kiocb.

 Kent, I ran a few tests, and the below patches still don't seem as fast
 as the approach I took. To keep it fair, I used your aio branch and
 applied by dio speedups too. As a sanity check, I ran with your branch
 alone as well. The quick results below - kaio is kent-aio, just your
 branch. kaio-dio is with the direct IO speedups too. jaio is my branch,
 which already has the dio changes too.

 Devices Branch  IOPS
 1   kaio~915K
 1   kaio-dio~930K
 1   jaio   ~1220K
 6   kaio   ~3050K
 6   kaio-dio   ~3080K
 6   jaio3500K

 The box runs out of CPU driving power, which is why it doesn't scale
 linearly, otherwise I know that jaio at least does. It's basically
 completion limited for the 6 device test at the moment.

 I'll run some profiling tomorrow morning and get you some better
 results. Just thought I'd share these at least.

 --
 Jens Axboe

>>>
>>> A really good performance, woo.
>>>
>>> I think the device tested is really fast PCIe SSD builded by fusionio
>>> with fusionio in house block driver?
>>
>> It is pci-e flash storage, but it is not fusion-io.
>>
>>> any compare number with current mainline?
>>
>> Sure, I should have included that. Here's the table again, this time
>> with mainline as well.
>>
>> Devices Branch  IOPS
>> 1   mainline~870K
>> 1   kaio~915K
>> 1   kaio-dio~930K
>> 1   jaio   ~1220K
>> 6   kaio   ~3050K
>> 6   kaio-dio   ~3080K
>> 6   jaio   ~3500K
>> 6   mainline   ~2850K
> 
> Cool, thanks for the numbers!
> 
> I suspect the difference is due to contention on the ringbuffer,
> completion side. You didn't enable my batched completion stuff, did you?

No, haven't tried the batching yet.

> I suspect the numbers would look quite a bit different with that,
> based on my own profiling. If the driver for the device you're testing
> on is open source, I'd be happy to do the conversion (it's a 5 minute
> job).

Knock yourself out - I already took a quick look at it, and conversion
should be pretty simple. It's the mtip32xx driver, it's in the kernel. I
would suggest getting rid of the ->async_callback() (since it's always
bio_endio()) since that'll make it cleaner.

> Also, I don't think our approaches really conflict - it's been awhile

Completely agree. I split my patches up a bit yesterday, and then I took
a look at your series. There's a bit of overlap between the two, but
really most of it would be useful together. You can see the (bit more)
split series here:

http://git.kernel.dk/?p=linux-block.git;a=shortlog;h=refs/heads/aio-dio

> since I looked at your patch but you're getting rid of the aio
> ringbuffer and using a linked list instead, right? My batched completion
> stuff should still benefit that case.

Yes, I make the ring interface optional. Basically you tell aio to use
the ring or not at io_queue_init() time. If you don't care about the
ring, we can use a lockless list for the completions.

You completely remove the cancel, I just make it optional for the gadget
case. I'm fine with either of them, though I did not look at your usb
change in detail. If it's clean, I suspect we should just kill cancel
completion as you did.

> Though - hrm, I'd have expected getting rid of the cancellation linked
> list to make a bigger difference and both our patchsets do that.

The machine in question runs out of oomph, which is hampering the
results. I should have it beefed up next week. It's running E5-2630
right now, will move to E5-2690. I think that should make the results
clearer.

> What device are you testing on, and what's your fio script? I may just
> have to buy some hardware so I can test this myself.

Pretty basic script, it's attached. Probably could eek more out of the
system, but it's been fine for just basic apples-to-apples comparison.
I'm 

Re: [PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-15 Thread Kent Overstreet
On Fri, Dec 14, 2012 at 08:35:53AM +0100, Jens Axboe wrote:
> On 2012-12-14 03:26, Jack Wang wrote:
> > 2012/12/14 Jens Axboe :
> >> On Mon, Dec 03 2012, Kent Overstreet wrote:
> >>> Last posting: http://thread.gmane.org/gmane.linux.kernel.aio.general/3169
> >>>
> >>> Changes since the last posting should all be noted in the individual
> >>> patch descriptions.
> >>>
> >>>  * Zach pointed out the aio_read_evt() patch was calling functions that
> >>>could sleep in TASK_INTERRUPTIBLE state, that patch is rewritten.
> >>>  * Ben pointed out some synchronize_rcu() usage was problematic,
> >>>converted it to call_rcu()
> >>>  * The flush_dcache_page() patch is new
> >>>  * Changed the "use cancellation list lazily" patch so as to remove
> >>>ki_flags from struct kiocb.
> >>
> >> Kent, I ran a few tests, and the below patches still don't seem as fast
> >> as the approach I took. To keep it fair, I used your aio branch and
> >> applied by dio speedups too. As a sanity check, I ran with your branch
> >> alone as well. The quick results below - kaio is kent-aio, just your
> >> branch. kaio-dio is with the direct IO speedups too. jaio is my branch,
> >> which already has the dio changes too.
> >>
> >> Devices Branch  IOPS
> >> 1   kaio~915K
> >> 1   kaio-dio~930K
> >> 1   jaio   ~1220K
> >> 6   kaio   ~3050K
> >> 6   kaio-dio   ~3080K
> >> 6   jaio3500K
> >>
> >> The box runs out of CPU driving power, which is why it doesn't scale
> >> linearly, otherwise I know that jaio at least does. It's basically
> >> completion limited for the 6 device test at the moment.
> >>
> >> I'll run some profiling tomorrow morning and get you some better
> >> results. Just thought I'd share these at least.
> >>
> >> --
> >> Jens Axboe
> >>
> > 
> > A really good performance, woo.
> > 
> > I think the device tested is really fast PCIe SSD builded by fusionio
> > with fusionio in house block driver?
> 
> It is pci-e flash storage, but it is not fusion-io.
> 
> > any compare number with current mainline?
> 
> Sure, I should have included that. Here's the table again, this time
> with mainline as well.
> 
> Devices Branch  IOPS
> 1   mainline~870K
> 1   kaio~915K
> 1   kaio-dio~930K
> 1   jaio   ~1220K
> 6   kaio   ~3050K
> 6   kaio-dio   ~3080K
> 6   jaio   ~3500K
> 6   mainline   ~2850K

Cool, thanks for the numbers!

I suspect the difference is due to contention on the ringbuffer,
completion side. You didn't enable my batched completion stuff, did you?

I suspect the numbers would look quite a bit different with that,
based on my own profiling. If the driver for the device you're testing
on is open source, I'd be happy to do the conversion (it's a 5 minute
job).

Also, I don't think our approaches really conflict - it's been awhile
since I looked at your patch but you're getting rid of the aio
ringbuffer and using a linked list instead, right? My batched completion
stuff should still benefit that case.

Though - hrm, I'd have expected getting rid of the cancellation linked
list to make a bigger difference and both our patchsets do that.

What device are you testing on, and what's your fio script? I may just
have to buy some hardware so I can test this myself.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-15 Thread Kent Overstreet
On Fri, Dec 14, 2012 at 08:35:53AM +0100, Jens Axboe wrote:
 On 2012-12-14 03:26, Jack Wang wrote:
  2012/12/14 Jens Axboe jax...@fusionio.com:
  On Mon, Dec 03 2012, Kent Overstreet wrote:
  Last posting: http://thread.gmane.org/gmane.linux.kernel.aio.general/3169
 
  Changes since the last posting should all be noted in the individual
  patch descriptions.
 
   * Zach pointed out the aio_read_evt() patch was calling functions that
 could sleep in TASK_INTERRUPTIBLE state, that patch is rewritten.
   * Ben pointed out some synchronize_rcu() usage was problematic,
 converted it to call_rcu()
   * The flush_dcache_page() patch is new
   * Changed the use cancellation list lazily patch so as to remove
 ki_flags from struct kiocb.
 
  Kent, I ran a few tests, and the below patches still don't seem as fast
  as the approach I took. To keep it fair, I used your aio branch and
  applied by dio speedups too. As a sanity check, I ran with your branch
  alone as well. The quick results below - kaio is kent-aio, just your
  branch. kaio-dio is with the direct IO speedups too. jaio is my branch,
  which already has the dio changes too.
 
  Devices Branch  IOPS
  1   kaio~915K
  1   kaio-dio~930K
  1   jaio   ~1220K
  6   kaio   ~3050K
  6   kaio-dio   ~3080K
  6   jaio3500K
 
  The box runs out of CPU driving power, which is why it doesn't scale
  linearly, otherwise I know that jaio at least does. It's basically
  completion limited for the 6 device test at the moment.
 
  I'll run some profiling tomorrow morning and get you some better
  results. Just thought I'd share these at least.
 
  --
  Jens Axboe
 
  
  A really good performance, woo.
  
  I think the device tested is really fast PCIe SSD builded by fusionio
  with fusionio in house block driver?
 
 It is pci-e flash storage, but it is not fusion-io.
 
  any compare number with current mainline?
 
 Sure, I should have included that. Here's the table again, this time
 with mainline as well.
 
 Devices Branch  IOPS
 1   mainline~870K
 1   kaio~915K
 1   kaio-dio~930K
 1   jaio   ~1220K
 6   kaio   ~3050K
 6   kaio-dio   ~3080K
 6   jaio   ~3500K
 6   mainline   ~2850K

Cool, thanks for the numbers!

I suspect the difference is due to contention on the ringbuffer,
completion side. You didn't enable my batched completion stuff, did you?

I suspect the numbers would look quite a bit different with that,
based on my own profiling. If the driver for the device you're testing
on is open source, I'd be happy to do the conversion (it's a 5 minute
job).

Also, I don't think our approaches really conflict - it's been awhile
since I looked at your patch but you're getting rid of the aio
ringbuffer and using a linked list instead, right? My batched completion
stuff should still benefit that case.

Though - hrm, I'd have expected getting rid of the cancellation linked
list to make a bigger difference and both our patchsets do that.

What device are you testing on, and what's your fio script? I may just
have to buy some hardware so I can test this myself.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-15 Thread Jens Axboe
On 2012-12-15 10:25, Kent Overstreet wrote:
 On Fri, Dec 14, 2012 at 08:35:53AM +0100, Jens Axboe wrote:
 On 2012-12-14 03:26, Jack Wang wrote:
 2012/12/14 Jens Axboe jax...@fusionio.com:
 On Mon, Dec 03 2012, Kent Overstreet wrote:
 Last posting: http://thread.gmane.org/gmane.linux.kernel.aio.general/3169

 Changes since the last posting should all be noted in the individual
 patch descriptions.

  * Zach pointed out the aio_read_evt() patch was calling functions that
could sleep in TASK_INTERRUPTIBLE state, that patch is rewritten.
  * Ben pointed out some synchronize_rcu() usage was problematic,
converted it to call_rcu()
  * The flush_dcache_page() patch is new
  * Changed the use cancellation list lazily patch so as to remove
ki_flags from struct kiocb.

 Kent, I ran a few tests, and the below patches still don't seem as fast
 as the approach I took. To keep it fair, I used your aio branch and
 applied by dio speedups too. As a sanity check, I ran with your branch
 alone as well. The quick results below - kaio is kent-aio, just your
 branch. kaio-dio is with the direct IO speedups too. jaio is my branch,
 which already has the dio changes too.

 Devices Branch  IOPS
 1   kaio~915K
 1   kaio-dio~930K
 1   jaio   ~1220K
 6   kaio   ~3050K
 6   kaio-dio   ~3080K
 6   jaio3500K

 The box runs out of CPU driving power, which is why it doesn't scale
 linearly, otherwise I know that jaio at least does. It's basically
 completion limited for the 6 device test at the moment.

 I'll run some profiling tomorrow morning and get you some better
 results. Just thought I'd share these at least.

 --
 Jens Axboe


 A really good performance, woo.

 I think the device tested is really fast PCIe SSD builded by fusionio
 with fusionio in house block driver?

 It is pci-e flash storage, but it is not fusion-io.

 any compare number with current mainline?

 Sure, I should have included that. Here's the table again, this time
 with mainline as well.

 Devices Branch  IOPS
 1   mainline~870K
 1   kaio~915K
 1   kaio-dio~930K
 1   jaio   ~1220K
 6   kaio   ~3050K
 6   kaio-dio   ~3080K
 6   jaio   ~3500K
 6   mainline   ~2850K
 
 Cool, thanks for the numbers!
 
 I suspect the difference is due to contention on the ringbuffer,
 completion side. You didn't enable my batched completion stuff, did you?

No, haven't tried the batching yet.

 I suspect the numbers would look quite a bit different with that,
 based on my own profiling. If the driver for the device you're testing
 on is open source, I'd be happy to do the conversion (it's a 5 minute
 job).

Knock yourself out - I already took a quick look at it, and conversion
should be pretty simple. It's the mtip32xx driver, it's in the kernel. I
would suggest getting rid of the -async_callback() (since it's always
bio_endio()) since that'll make it cleaner.

 Also, I don't think our approaches really conflict - it's been awhile

Completely agree. I split my patches up a bit yesterday, and then I took
a look at your series. There's a bit of overlap between the two, but
really most of it would be useful together. You can see the (bit more)
split series here:

http://git.kernel.dk/?p=linux-block.git;a=shortlog;h=refs/heads/aio-dio

 since I looked at your patch but you're getting rid of the aio
 ringbuffer and using a linked list instead, right? My batched completion
 stuff should still benefit that case.

Yes, I make the ring interface optional. Basically you tell aio to use
the ring or not at io_queue_init() time. If you don't care about the
ring, we can use a lockless list for the completions.

You completely remove the cancel, I just make it optional for the gadget
case. I'm fine with either of them, though I did not look at your usb
change in detail. If it's clean, I suspect we should just kill cancel
completion as you did.

 Though - hrm, I'd have expected getting rid of the cancellation linked
 list to make a bigger difference and both our patchsets do that.

The machine in question runs out of oomph, which is hampering the
results. I should have it beefed up next week. It's running E5-2630
right now, will move to E5-2690. I think that should make the results
clearer.

 What device are you testing on, and what's your fio script? I may just
 have to buy some hardware so I can test this myself.

Pretty basic script, it's attached. Probably could eek more out of the
system, but it's been fine for just basic apples-to-apples comparison.
I'm using 6x p320h for this test case.

-- 
Jens Axboe

[global]
bs=4k
direct=1
ioengine=libaio
iodepth=42
numjobs=5
rwmixread=100
rw=randrw
iodepth_batch=8
iodepth_batch_submit=4
iodepth_batch_complete=4
random_generator=lfsr

Re: [PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-15 Thread Kent Overstreet
On Sat, Dec 15, 2012 at 10:46:32AM +0100, Jens Axboe wrote:
 On 2012-12-15 10:25, Kent Overstreet wrote:
  Cool, thanks for the numbers!
  
  I suspect the difference is due to contention on the ringbuffer,
  completion side. You didn't enable my batched completion stuff, did you?
 
 No, haven't tried the batching yet.
 
  I suspect the numbers would look quite a bit different with that,
  based on my own profiling. If the driver for the device you're testing
  on is open source, I'd be happy to do the conversion (it's a 5 minute
  job).
 
 Knock yourself out - I already took a quick look at it, and conversion
 should be pretty simple. It's the mtip32xx driver, it's in the kernel. I
 would suggest getting rid of the -async_callback() (since it's always
 bio_endio()) since that'll make it cleaner.

Just pushed my conversion - it's untested, but it's pretty
straightforward.

 
  Also, I don't think our approaches really conflict - it's been awhile
 
 Completely agree. I split my patches up a bit yesterday, and then I took
 a look at your series. There's a bit of overlap between the two, but
 really most of it would be useful together. You can see the (bit more)
 split series here:
 
 http://git.kernel.dk/?p=linux-block.git;a=shortlog;h=refs/heads/aio-dio

Cool, taking a look at it now.

 
  since I looked at your patch but you're getting rid of the aio
  ringbuffer and using a linked list instead, right? My batched completion
  stuff should still benefit that case.
 
 Yes, I make the ring interface optional. Basically you tell aio to use
 the ring or not at io_queue_init() time. If you don't care about the
 ring, we can use a lockless list for the completions.

Yeah, it is a good idea - I'm certainly not attached to the current
ringbuffer implementation (though a ringbuffer isn't a terrible idea if
we had one that was implemented correctly).

 You completely remove the cancel, I just make it optional for the gadget
 case. I'm fine with either of them, though I did not look at your usb
 change in detail. If it's clean, I suspect we should just kill cancel
 completion as you did.

We (Zach and I) actually made it optional too, more or less - I haven't
looked at how you did it yet, but in my tree the linked list is there,
but the kiocb isn't added to the kioctx's list until something sets a
cancel function.

  Though - hrm, I'd have expected getting rid of the cancellation linked
  list to make a bigger difference and both our patchsets do that.
 
 The machine in question runs out of oomph, which is hampering the
 results. I should have it beefed up next week. It's running E5-2630
 right now, will move to E5-2690. I think that should make the results
 clearer.

Well, the fact that it's cpu bound just means the throughput numbers for
our various patches are different... and what I'm really interested in
is the profiles, I can't think of any reason cpu speed would affect that
much.

  What device are you testing on, and what's your fio script? I may just
  have to buy some hardware so I can test this myself.
 
 Pretty basic script, it's attached. Probably could eek more out of the
 system, but it's been fine for just basic apples-to-apples comparison.
 I'm using 6x p320h for this test case.

Thanks - I'll have to see if I can get something setup that roughly
matches your setup. 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-15 Thread Jens Axboe
On 2012-12-15 11:36, Kent Overstreet wrote:
 On Sat, Dec 15, 2012 at 10:46:32AM +0100, Jens Axboe wrote:
 On 2012-12-15 10:25, Kent Overstreet wrote:
 Cool, thanks for the numbers!

 I suspect the difference is due to contention on the ringbuffer,
 completion side. You didn't enable my batched completion stuff, did you?

 No, haven't tried the batching yet.

 I suspect the numbers would look quite a bit different with that,
 based on my own profiling. If the driver for the device you're testing
 on is open source, I'd be happy to do the conversion (it's a 5 minute
 job).

 Knock yourself out - I already took a quick look at it, and conversion
 should be pretty simple. It's the mtip32xx driver, it's in the kernel. I
 would suggest getting rid of the -async_callback() (since it's always
 bio_endio()) since that'll make it cleaner.
 
 Just pushed my conversion - it's untested, but it's pretty
 straightforward.

Let me give it a whirl. It'll likely improve the situation, since we are
CPU limited at this point. I will definitely add the batching on my side
too, it's a clear win for the (usual) case of reaping multiple
completion events per IRQ.

 since I looked at your patch but you're getting rid of the aio
 ringbuffer and using a linked list instead, right? My batched completion
 stuff should still benefit that case.

 Yes, I make the ring interface optional. Basically you tell aio to use
 the ring or not at io_queue_init() time. If you don't care about the
 ring, we can use a lockless list for the completions.
 
 Yeah, it is a good idea - I'm certainly not attached to the current
 ringbuffer implementation (though a ringbuffer isn't a terrible idea if
 we had one that was implemented correctly).

I agree, ringbuffer isn't necessarily a bad interface. But the fact is
that:

1) Current ringbuffer is broken
2) And nobody uses it

So as an interface, I think it's dead.

 You completely remove the cancel, I just make it optional for the gadget
 case. I'm fine with either of them, though I did not look at your usb
 change in detail. If it's clean, I suspect we should just kill cancel
 completion as you did.
 
 We (Zach and I) actually made it optional too, more or less - I haven't
 looked at how you did it yet, but in my tree the linked list is there,
 but the kiocb isn't added to the kioctx's list until something sets a
 cancel function.

Sounds like the same approach I took - list is still there, and whether
the node is empty or not signifies whether we need to lock and remove
the entry on completion.

 Though - hrm, I'd have expected getting rid of the cancellation linked
 list to make a bigger difference and both our patchsets do that.

 The machine in question runs out of oomph, which is hampering the
 results. I should have it beefed up next week. It's running E5-2630
 right now, will move to E5-2690. I think that should make the results
 clearer.
 
 Well, the fact that it's cpu bound just means the throughput numbers for
 our various patches are different... and what I'm really interested in
 is the profiles, I can't think of any reason cpu speed would affect that
 much.

Sure, that's a given, since they have the same horse power available and
the setup is identical (same process pinning, irq pinning, etc).

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-15 Thread Jens Axboe
On 2012-12-15 11:36, Kent Overstreet wrote:
 Knock yourself out - I already took a quick look at it, and conversion
 should be pretty simple. It's the mtip32xx driver, it's in the kernel. I
 would suggest getting rid of the -async_callback() (since it's always
 bio_endio()) since that'll make it cleaner.
 
 Just pushed my conversion - it's untested, but it's pretty
 straightforward.

You forgot a batch_complete_init(). With that, it works. Single device
is ~1050K now, so still slower than jaio without batching (which was
~1220K).  But it's an improvement over kaio-dio, which was roughly ~930K
IOPS.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-13 Thread Jens Axboe
On 2012-12-14 03:26, Jack Wang wrote:
> 2012/12/14 Jens Axboe :
>> On Mon, Dec 03 2012, Kent Overstreet wrote:
>>> Last posting: http://thread.gmane.org/gmane.linux.kernel.aio.general/3169
>>>
>>> Changes since the last posting should all be noted in the individual
>>> patch descriptions.
>>>
>>>  * Zach pointed out the aio_read_evt() patch was calling functions that
>>>could sleep in TASK_INTERRUPTIBLE state, that patch is rewritten.
>>>  * Ben pointed out some synchronize_rcu() usage was problematic,
>>>converted it to call_rcu()
>>>  * The flush_dcache_page() patch is new
>>>  * Changed the "use cancellation list lazily" patch so as to remove
>>>ki_flags from struct kiocb.
>>
>> Kent, I ran a few tests, and the below patches still don't seem as fast
>> as the approach I took. To keep it fair, I used your aio branch and
>> applied by dio speedups too. As a sanity check, I ran with your branch
>> alone as well. The quick results below - kaio is kent-aio, just your
>> branch. kaio-dio is with the direct IO speedups too. jaio is my branch,
>> which already has the dio changes too.
>>
>> Devices Branch  IOPS
>> 1   kaio~915K
>> 1   kaio-dio~930K
>> 1   jaio   ~1220K
>> 6   kaio   ~3050K
>> 6   kaio-dio   ~3080K
>> 6   jaio3500K
>>
>> The box runs out of CPU driving power, which is why it doesn't scale
>> linearly, otherwise I know that jaio at least does. It's basically
>> completion limited for the 6 device test at the moment.
>>
>> I'll run some profiling tomorrow morning and get you some better
>> results. Just thought I'd share these at least.
>>
>> --
>> Jens Axboe
>>
> 
> A really good performance, woo.
> 
> I think the device tested is really fast PCIe SSD builded by fusionio
> with fusionio in house block driver?

It is pci-e flash storage, but it is not fusion-io.

> any compare number with current mainline?

Sure, I should have included that. Here's the table again, this time
with mainline as well.

Devices Branch  IOPS
1   mainline~870K
1   kaio~915K
1   kaio-dio~930K
1   jaio   ~1220K
6   kaio   ~3050K
6   kaio-dio   ~3080K
6   jaio   ~3500K
6   mainline   ~2850K


-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-13 Thread Jack Wang
2012/12/14 Jens Axboe :
> On Mon, Dec 03 2012, Kent Overstreet wrote:
>> Last posting: http://thread.gmane.org/gmane.linux.kernel.aio.general/3169
>>
>> Changes since the last posting should all be noted in the individual
>> patch descriptions.
>>
>>  * Zach pointed out the aio_read_evt() patch was calling functions that
>>could sleep in TASK_INTERRUPTIBLE state, that patch is rewritten.
>>  * Ben pointed out some synchronize_rcu() usage was problematic,
>>converted it to call_rcu()
>>  * The flush_dcache_page() patch is new
>>  * Changed the "use cancellation list lazily" patch so as to remove
>>ki_flags from struct kiocb.
>
> Kent, I ran a few tests, and the below patches still don't seem as fast
> as the approach I took. To keep it fair, I used your aio branch and
> applied by dio speedups too. As a sanity check, I ran with your branch
> alone as well. The quick results below - kaio is kent-aio, just your
> branch. kaio-dio is with the direct IO speedups too. jaio is my branch,
> which already has the dio changes too.
>
> Devices Branch  IOPS
> 1   kaio~915K
> 1   kaio-dio~930K
> 1   jaio   ~1220K
> 6   kaio   ~3050K
> 6   kaio-dio   ~3080K
> 6   jaio3500K
>
> The box runs out of CPU driving power, which is why it doesn't scale
> linearly, otherwise I know that jaio at least does. It's basically
> completion limited for the 6 device test at the moment.
>
> I'll run some profiling tomorrow morning and get you some better
> results. Just thought I'd share these at least.
>
> --
> Jens Axboe
>

A really good performance, woo.

I think the device tested is really fast PCIe SSD builded by fusionio
with fusionio in house block driver?

any compare number with current mainline?

Jack
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-13 Thread Jens Axboe
On Mon, Dec 03 2012, Kent Overstreet wrote:
> Last posting: http://thread.gmane.org/gmane.linux.kernel.aio.general/3169
> 
> Changes since the last posting should all be noted in the individual
> patch descriptions.
> 
>  * Zach pointed out the aio_read_evt() patch was calling functions that
>could sleep in TASK_INTERRUPTIBLE state, that patch is rewritten.
>  * Ben pointed out some synchronize_rcu() usage was problematic,
>converted it to call_rcu()
>  * The flush_dcache_page() patch is new
>  * Changed the "use cancellation list lazily" patch so as to remove
>ki_flags from struct kiocb.

Kent, I ran a few tests, and the below patches still don't seem as fast
as the approach I took. To keep it fair, I used your aio branch and
applied by dio speedups too. As a sanity check, I ran with your branch
alone as well. The quick results below - kaio is kent-aio, just your
branch. kaio-dio is with the direct IO speedups too. jaio is my branch,
which already has the dio changes too.

Devices Branch  IOPS
1   kaio~915K
1   kaio-dio~930K
1   jaio   ~1220K
6   kaio   ~3050K
6   kaio-dio   ~3080K
6   jaio3500K

The box runs out of CPU driving power, which is why it doesn't scale
linearly, otherwise I know that jaio at least does. It's basically
completion limited for the 6 device test at the moment.

I'll run some profiling tomorrow morning and get you some better
results. Just thought I'd share these at least.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-13 Thread Jens Axboe
On Mon, Dec 03 2012, Kent Overstreet wrote:
 Last posting: http://thread.gmane.org/gmane.linux.kernel.aio.general/3169
 
 Changes since the last posting should all be noted in the individual
 patch descriptions.
 
  * Zach pointed out the aio_read_evt() patch was calling functions that
could sleep in TASK_INTERRUPTIBLE state, that patch is rewritten.
  * Ben pointed out some synchronize_rcu() usage was problematic,
converted it to call_rcu()
  * The flush_dcache_page() patch is new
  * Changed the use cancellation list lazily patch so as to remove
ki_flags from struct kiocb.

Kent, I ran a few tests, and the below patches still don't seem as fast
as the approach I took. To keep it fair, I used your aio branch and
applied by dio speedups too. As a sanity check, I ran with your branch
alone as well. The quick results below - kaio is kent-aio, just your
branch. kaio-dio is with the direct IO speedups too. jaio is my branch,
which already has the dio changes too.

Devices Branch  IOPS
1   kaio~915K
1   kaio-dio~930K
1   jaio   ~1220K
6   kaio   ~3050K
6   kaio-dio   ~3080K
6   jaio3500K

The box runs out of CPU driving power, which is why it doesn't scale
linearly, otherwise I know that jaio at least does. It's basically
completion limited for the 6 device test at the moment.

I'll run some profiling tomorrow morning and get you some better
results. Just thought I'd share these at least.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-13 Thread Jack Wang
2012/12/14 Jens Axboe jax...@fusionio.com:
 On Mon, Dec 03 2012, Kent Overstreet wrote:
 Last posting: http://thread.gmane.org/gmane.linux.kernel.aio.general/3169

 Changes since the last posting should all be noted in the individual
 patch descriptions.

  * Zach pointed out the aio_read_evt() patch was calling functions that
could sleep in TASK_INTERRUPTIBLE state, that patch is rewritten.
  * Ben pointed out some synchronize_rcu() usage was problematic,
converted it to call_rcu()
  * The flush_dcache_page() patch is new
  * Changed the use cancellation list lazily patch so as to remove
ki_flags from struct kiocb.

 Kent, I ran a few tests, and the below patches still don't seem as fast
 as the approach I took. To keep it fair, I used your aio branch and
 applied by dio speedups too. As a sanity check, I ran with your branch
 alone as well. The quick results below - kaio is kent-aio, just your
 branch. kaio-dio is with the direct IO speedups too. jaio is my branch,
 which already has the dio changes too.

 Devices Branch  IOPS
 1   kaio~915K
 1   kaio-dio~930K
 1   jaio   ~1220K
 6   kaio   ~3050K
 6   kaio-dio   ~3080K
 6   jaio3500K

 The box runs out of CPU driving power, which is why it doesn't scale
 linearly, otherwise I know that jaio at least does. It's basically
 completion limited for the 6 device test at the moment.

 I'll run some profiling tomorrow morning and get you some better
 results. Just thought I'd share these at least.

 --
 Jens Axboe


A really good performance, woo.

I think the device tested is really fast PCIe SSD builded by fusionio
with fusionio in house block driver?

any compare number with current mainline?

Jack
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-13 Thread Jens Axboe
On 2012-12-14 03:26, Jack Wang wrote:
 2012/12/14 Jens Axboe jax...@fusionio.com:
 On Mon, Dec 03 2012, Kent Overstreet wrote:
 Last posting: http://thread.gmane.org/gmane.linux.kernel.aio.general/3169

 Changes since the last posting should all be noted in the individual
 patch descriptions.

  * Zach pointed out the aio_read_evt() patch was calling functions that
could sleep in TASK_INTERRUPTIBLE state, that patch is rewritten.
  * Ben pointed out some synchronize_rcu() usage was problematic,
converted it to call_rcu()
  * The flush_dcache_page() patch is new
  * Changed the use cancellation list lazily patch so as to remove
ki_flags from struct kiocb.

 Kent, I ran a few tests, and the below patches still don't seem as fast
 as the approach I took. To keep it fair, I used your aio branch and
 applied by dio speedups too. As a sanity check, I ran with your branch
 alone as well. The quick results below - kaio is kent-aio, just your
 branch. kaio-dio is with the direct IO speedups too. jaio is my branch,
 which already has the dio changes too.

 Devices Branch  IOPS
 1   kaio~915K
 1   kaio-dio~930K
 1   jaio   ~1220K
 6   kaio   ~3050K
 6   kaio-dio   ~3080K
 6   jaio3500K

 The box runs out of CPU driving power, which is why it doesn't scale
 linearly, otherwise I know that jaio at least does. It's basically
 completion limited for the 6 device test at the moment.

 I'll run some profiling tomorrow morning and get you some better
 results. Just thought I'd share these at least.

 --
 Jens Axboe

 
 A really good performance, woo.
 
 I think the device tested is really fast PCIe SSD builded by fusionio
 with fusionio in house block driver?

It is pci-e flash storage, but it is not fusion-io.

 any compare number with current mainline?

Sure, I should have included that. Here's the table again, this time
with mainline as well.

Devices Branch  IOPS
1   mainline~870K
1   kaio~915K
1   kaio-dio~930K
1   jaio   ~1220K
6   kaio   ~3050K
6   kaio-dio   ~3080K
6   jaio   ~3500K
6   mainline   ~2850K


-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-03 Thread Kent Overstreet
Last posting: http://thread.gmane.org/gmane.linux.kernel.aio.general/3169

Changes since the last posting should all be noted in the individual
patch descriptions.

 * Zach pointed out the aio_read_evt() patch was calling functions that
   could sleep in TASK_INTERRUPTIBLE state, that patch is rewritten.
 * Ben pointed out some synchronize_rcu() usage was problematic,
   converted it to call_rcu()
 * The flush_dcache_page() patch is new
 * Changed the "use cancellation list lazily" patch so as to remove
   ki_flags from struct kiocb.

Kent Overstreet (21):
  aio: Kill return value of aio_complete()
  aio: kiocb_cancel()
  aio: Move private stuff out of aio.h
  aio: dprintk() -> pr_debug()
  aio: do fget() after aio_get_req()
  aio: Make aio_put_req() lockless
  aio: Refcounting cleanup
  aio: Convert read_events() to hrtimers
  aio: Make aio_read_evt() more efficient
  aio: Use flush_dcache_page()
  aio: Use cancellation list lazily
  aio: Change reqs_active to include unreaped completions
  aio: Kill batch allocation
  aio: Kill struct aio_ring_info
  aio: Give shared kioctx fields their own cachelines
  aio: reqs_active -> reqs_available
  aio: percpu reqs_available
  Generic dynamic per cpu refcounting
  aio: Percpu ioctx refcount
  aio: use xchg() instead of completion_lock
  aio: Don't include aio.h in sched.h

Zach Brown (5):
  mm: remove old aio use_mm() comment
  aio: remove dead code from aio.h
  gadget: remove only user of aio retry
  aio: remove retry-based AIO
  char: add aio_{read,write} to /dev/{null,zero}

 arch/s390/hypfs/inode.c  |1 +
 block/scsi_ioctl.c   |1 +
 drivers/char/mem.c   |   36 +
 drivers/infiniband/hw/ipath/ipath_file_ops.c |1 +
 drivers/infiniband/hw/qib/qib_file_ops.c |2 +-
 drivers/staging/android/logger.c |1 +
 drivers/usb/gadget/inode.c   |   42 +-
 fs/9p/vfs_addr.c |1 +
 fs/afs/write.c   |1 +
 fs/aio.c | 1432 ++
 fs/block_dev.c   |1 +
 fs/btrfs/file.c  |1 +
 fs/btrfs/inode.c |1 +
 fs/ceph/file.c   |1 +
 fs/compat.c  |1 +
 fs/direct-io.c   |1 +
 fs/ecryptfs/file.c   |1 +
 fs/ext2/inode.c  |1 +
 fs/ext3/inode.c  |1 +
 fs/ext4/file.c   |1 +
 fs/ext4/indirect.c   |1 +
 fs/ext4/inode.c  |1 +
 fs/ext4/page-io.c|1 +
 fs/fat/inode.c   |1 +
 fs/fuse/dev.c|1 +
 fs/fuse/file.c   |1 +
 fs/gfs2/aops.c   |1 +
 fs/gfs2/file.c   |1 +
 fs/hfs/inode.c   |1 +
 fs/hfsplus/inode.c   |1 +
 fs/jfs/inode.c   |1 +
 fs/nilfs2/inode.c|2 +-
 fs/ntfs/file.c   |1 +
 fs/ntfs/inode.c  |1 +
 fs/ocfs2/aops.h  |2 +
 fs/ocfs2/dlmglue.c   |2 +-
 fs/ocfs2/inode.h |2 +
 fs/pipe.c|1 +
 fs/read_write.c  |   35 +-
 fs/reiserfs/inode.c  |1 +
 fs/ubifs/file.c  |1 +
 fs/udf/inode.c   |1 +
 fs/xfs/xfs_aops.c|1 +
 fs/xfs/xfs_file.c|1 +
 include/linux/aio.h  |  135 +--
 include/linux/cgroup.h   |1 +
 include/linux/errno.h|1 -
 include/linux/percpu-refcount.h  |   29 +
 include/linux/sched.h|2 -
 kernel/fork.c|1 +
 kernel/printk.c  |1 +
 kernel/ptrace.c  |1 +
 lib/Makefile |2 +-
 lib/percpu-refcount.c|  164 +++
 mm/mmu_context.c |3 -
 mm/page_io.c |1 +
 mm/shmem.c   |1 +
 mm/swap.c|1 +
 security/keys/internal.h |2 +
 security/keys/keyctl.c   |1 +
 sound/core/pcm_native.c  |2 +-
 61 files changed, 868 insertions(+), 1070 deletions(-)
 

[PATCH 00/26] AIO performance improvements/cleanups, v2

2012-12-03 Thread Kent Overstreet
Last posting: http://thread.gmane.org/gmane.linux.kernel.aio.general/3169

Changes since the last posting should all be noted in the individual
patch descriptions.

 * Zach pointed out the aio_read_evt() patch was calling functions that
   could sleep in TASK_INTERRUPTIBLE state, that patch is rewritten.
 * Ben pointed out some synchronize_rcu() usage was problematic,
   converted it to call_rcu()
 * The flush_dcache_page() patch is new
 * Changed the use cancellation list lazily patch so as to remove
   ki_flags from struct kiocb.

Kent Overstreet (21):
  aio: Kill return value of aio_complete()
  aio: kiocb_cancel()
  aio: Move private stuff out of aio.h
  aio: dprintk() - pr_debug()
  aio: do fget() after aio_get_req()
  aio: Make aio_put_req() lockless
  aio: Refcounting cleanup
  aio: Convert read_events() to hrtimers
  aio: Make aio_read_evt() more efficient
  aio: Use flush_dcache_page()
  aio: Use cancellation list lazily
  aio: Change reqs_active to include unreaped completions
  aio: Kill batch allocation
  aio: Kill struct aio_ring_info
  aio: Give shared kioctx fields their own cachelines
  aio: reqs_active - reqs_available
  aio: percpu reqs_available
  Generic dynamic per cpu refcounting
  aio: Percpu ioctx refcount
  aio: use xchg() instead of completion_lock
  aio: Don't include aio.h in sched.h

Zach Brown (5):
  mm: remove old aio use_mm() comment
  aio: remove dead code from aio.h
  gadget: remove only user of aio retry
  aio: remove retry-based AIO
  char: add aio_{read,write} to /dev/{null,zero}

 arch/s390/hypfs/inode.c  |1 +
 block/scsi_ioctl.c   |1 +
 drivers/char/mem.c   |   36 +
 drivers/infiniband/hw/ipath/ipath_file_ops.c |1 +
 drivers/infiniband/hw/qib/qib_file_ops.c |2 +-
 drivers/staging/android/logger.c |1 +
 drivers/usb/gadget/inode.c   |   42 +-
 fs/9p/vfs_addr.c |1 +
 fs/afs/write.c   |1 +
 fs/aio.c | 1432 ++
 fs/block_dev.c   |1 +
 fs/btrfs/file.c  |1 +
 fs/btrfs/inode.c |1 +
 fs/ceph/file.c   |1 +
 fs/compat.c  |1 +
 fs/direct-io.c   |1 +
 fs/ecryptfs/file.c   |1 +
 fs/ext2/inode.c  |1 +
 fs/ext3/inode.c  |1 +
 fs/ext4/file.c   |1 +
 fs/ext4/indirect.c   |1 +
 fs/ext4/inode.c  |1 +
 fs/ext4/page-io.c|1 +
 fs/fat/inode.c   |1 +
 fs/fuse/dev.c|1 +
 fs/fuse/file.c   |1 +
 fs/gfs2/aops.c   |1 +
 fs/gfs2/file.c   |1 +
 fs/hfs/inode.c   |1 +
 fs/hfsplus/inode.c   |1 +
 fs/jfs/inode.c   |1 +
 fs/nilfs2/inode.c|2 +-
 fs/ntfs/file.c   |1 +
 fs/ntfs/inode.c  |1 +
 fs/ocfs2/aops.h  |2 +
 fs/ocfs2/dlmglue.c   |2 +-
 fs/ocfs2/inode.h |2 +
 fs/pipe.c|1 +
 fs/read_write.c  |   35 +-
 fs/reiserfs/inode.c  |1 +
 fs/ubifs/file.c  |1 +
 fs/udf/inode.c   |1 +
 fs/xfs/xfs_aops.c|1 +
 fs/xfs/xfs_file.c|1 +
 include/linux/aio.h  |  135 +--
 include/linux/cgroup.h   |1 +
 include/linux/errno.h|1 -
 include/linux/percpu-refcount.h  |   29 +
 include/linux/sched.h|2 -
 kernel/fork.c|1 +
 kernel/printk.c  |1 +
 kernel/ptrace.c  |1 +
 lib/Makefile |2 +-
 lib/percpu-refcount.c|  164 +++
 mm/mmu_context.c |3 -
 mm/page_io.c |1 +
 mm/shmem.c   |1 +
 mm/swap.c|1 +
 security/keys/internal.h |2 +
 security/keys/keyctl.c   |1 +
 sound/core/pcm_native.c  |2 +-
 61 files changed, 868 insertions(+), 1070 deletions(-)
 create