Re: [PATCH v5 0/2] skb_array: array based FIFO for skbs

2016-05-30 Thread Jason Wang



On 2016年05月30日 23:37, Michael S. Tsirkin wrote:

On Mon, May 30, 2016 at 05:59:33PM +0800, Jason Wang wrote:


On 2016年05月23日 18:43, Michael S. Tsirkin wrote:

This is in response to the proposal by Jason to make tun
rx packet queue lockless using a circular buffer.
My testing seems to show that at least for the common usecase
in networking, which isn't lockless, circular buffer
with indices does not perform that well, because
each index access causes a cache line to bounce between
CPUs, and index access causes stalls due to the dependency.

I change tun to use skb array, looks like it can give about 5% more faster
than skb ring.

OK and skb ring is 9% faster than the linked list, so together
this is a 14% speedup?


Right.




And we usually don't need touch bhs during consume and produce (e.g for the
case of tun).

Thanks

Maybe I'll drop it in v6 then ...
Could you post the full tun patchset please?



Since it needs no bh versions of produce/consume, maybe you can post v6 
first, then I can post the tun patches?


Re: [PATCH v5 0/2] skb_array: array based FIFO for skbs

2016-05-30 Thread Michael S. Tsirkin
On Mon, May 30, 2016 at 05:59:33PM +0800, Jason Wang wrote:
> 
> 
> On 2016年05月23日 18:43, Michael S. Tsirkin wrote:
> >This is in response to the proposal by Jason to make tun
> >rx packet queue lockless using a circular buffer.
> >My testing seems to show that at least for the common usecase
> >in networking, which isn't lockless, circular buffer
> >with indices does not perform that well, because
> >each index access causes a cache line to bounce between
> >CPUs, and index access causes stalls due to the dependency.
> 
> I change tun to use skb array, looks like it can give about 5% more faster
> than skb ring.

OK and skb ring is 9% faster than the linked list, so together
this is a 14% speedup?

> And we usually don't need touch bhs during consume and produce (e.g for the
> case of tun).
> 
> Thanks

Maybe I'll drop it in v6 then ...
Could you post the full tun patchset please?

-- 
MST


Re: [PATCH v5 0/2] skb_array: array based FIFO for skbs

2016-05-30 Thread Jason Wang



On 2016年05月23日 18:43, Michael S. Tsirkin wrote:

This is in response to the proposal by Jason to make tun
rx packet queue lockless using a circular buffer.
My testing seems to show that at least for the common usecase
in networking, which isn't lockless, circular buffer
with indices does not perform that well, because
each index access causes a cache line to bounce between
CPUs, and index access causes stalls due to the dependency.

By comparison, an array of pointers where NULL means invalid
and !NULL means valid, can be updated without messing up barriers
at all and does not have this issue.

On the flip side, cache pressure may be caused by using large queues.
tun has a queue of 1000 entries by default and that's 8K.
At this point I'm not sure this can be solved efficiently.
The correct solution might be sizing the queues appropriately.

Here's an implementation of this idea: it can be used more
or less whenever sk_buff_head can be used, except you need
to know the queue size in advance.

It's using skb pointers but we switching to void * would be easy at cost
of type safety, though it appears that people want lockless  push
etc so I'm not sure of the value.

I didn't implement resizing but it's possible by holding
both consumer and producer locks.

I think this code works fine without any extra memory barriers since we
always read and write the same location, so the accesses can not be
reordered.
Multiple writes of the same value into memory would mess things up
for us, I don't think compilers would do it though.
But if people feel it's better to be safe wrt compiler optimizations,
specifying queue as volatile would probably do it in a cleaner way
than converting all accesses to READ_ONCE/WRITE_ONCE. Thoughts?

The only issue is with calls within a loop using the __skb_array_XXX
accessors - in theory compiler could hoist accesses out of the loop.

Following volatile-considered-harmful.txt I merely
documented that callers that busy-poll should invoke cpu_relax().
Most people will use the external skb_array_XXX APIs with a spinlock,
so this should not be an issue for them.

changes since v4 (v3 was never posted)
documentation
dropped SKB_ARRAY_MIN_SIZE heuristic
unit test (in userspace, included as patch 2)

changes since v2:
 fixed integer overflow pointed out by Eric.
 added some comments.

changes since v1:
 fixed bug pointed out by Eric.


Michael S. Tsirkin (2):
   skb_array: array based FIFO for skbs
   skb_array: ring test

  include/linux/skb_array.h | 127 +
  tools/virtio/ringtest/skb_array.c | 167 ++
  tools/virtio/ringtest/Makefile|   4 +-
  3 files changed, 297 insertions(+), 1 deletion(-)
  create mode 100644 include/linux/skb_array.h
  create mode 100644 tools/virtio/ringtest/skb_array.c



I change tun to use skb array, looks like it can give about 5% more 
faster than skb ring.


And we usually don't need touch bhs during consume and produce (e.g for 
the case of tun).


Thanks


Re: [PATCH v5 0/2] skb_array: array based FIFO for skbs

2016-05-23 Thread Michael S. Tsirkin
On Mon, May 23, 2016 at 06:31:46AM -0700, Eric Dumazet wrote:
> On Mon, 2016-05-23 at 13:43 +0300, Michael S. Tsirkin wrote:
> > This is in response to the proposal by Jason to make tun
> > rx packet queue lockless using a circular buffer.
> > My testing seems to show that at least for the common usecase
> > in networking, which isn't lockless, circular buffer
> > with indices does not perform that well, because
> > each index access causes a cache line to bounce between
> > CPUs, and index access causes stalls due to the dependency.
> > 
> > By comparison, an array of pointers where NULL means invalid
> > and !NULL means valid, can be updated without messing up barriers
> > at all and does not have this issue.
> 
> Note that both consumers and producers write in the array, so in light
> load (like TCP_RR), there are 2 cache line used byt the producers, and 2
> cache line used for consumers, with potential bouncing.

The shared part is RO by producer and consumer both,
so it's not bouncing - it can be shared in both caches.

Clearly memory footprint for this data structure is bigger
so it might cause more misses.

> In the other hand, the traditional sk_buff_head has one cache line,
> holding the spinlock and list head/tail.
>
> We might use the 'shared cache line' :
> 
> +   /* Shared consumer/producer data */
> +   int size cacheline_aligned_in_smp; /* max entries in queue
> */
> +   struct sk_buff **queue;
> 
> 
> To put here some fast path involving a single cache line access when
> queue has 0 or 1 item.
> 

I will try to experiment with it, but pls note that
this cache line is RO by producer and consumer currently,
if we make it writeable it will be bouncing.

-- 
MST


Re: [PATCH v5 0/2] skb_array: array based FIFO for skbs

2016-05-23 Thread Eric Dumazet
On Mon, 2016-05-23 at 13:43 +0300, Michael S. Tsirkin wrote:
> This is in response to the proposal by Jason to make tun
> rx packet queue lockless using a circular buffer.
> My testing seems to show that at least for the common usecase
> in networking, which isn't lockless, circular buffer
> with indices does not perform that well, because
> each index access causes a cache line to bounce between
> CPUs, and index access causes stalls due to the dependency.
> 
> By comparison, an array of pointers where NULL means invalid
> and !NULL means valid, can be updated without messing up barriers
> at all and does not have this issue.

Note that both consumers and producers write in the array, so in light
load (like TCP_RR), there are 2 cache line used byt the producers, and 2
cache line used for consumers, with potential bouncing.

In the other hand, the traditional sk_buff_head has one cache line,
holding the spinlock and list head/tail.

We might use the 'shared cache line' :

+   /* Shared consumer/producer data */
+   int size cacheline_aligned_in_smp; /* max entries in queue
*/
+   struct sk_buff **queue;


To put here some fast path involving a single cache line access when
queue has 0 or 1 item.





[PATCH v5 0/2] skb_array: array based FIFO for skbs

2016-05-23 Thread Michael S. Tsirkin
This is in response to the proposal by Jason to make tun
rx packet queue lockless using a circular buffer.
My testing seems to show that at least for the common usecase
in networking, which isn't lockless, circular buffer
with indices does not perform that well, because
each index access causes a cache line to bounce between
CPUs, and index access causes stalls due to the dependency.

By comparison, an array of pointers where NULL means invalid
and !NULL means valid, can be updated without messing up barriers
at all and does not have this issue.

On the flip side, cache pressure may be caused by using large queues.
tun has a queue of 1000 entries by default and that's 8K.
At this point I'm not sure this can be solved efficiently.
The correct solution might be sizing the queues appropriately.

Here's an implementation of this idea: it can be used more
or less whenever sk_buff_head can be used, except you need
to know the queue size in advance.

It's using skb pointers but we switching to void * would be easy at cost
of type safety, though it appears that people want lockless  push
etc so I'm not sure of the value.

I didn't implement resizing but it's possible by holding
both consumer and producer locks.

I think this code works fine without any extra memory barriers since we
always read and write the same location, so the accesses can not be
reordered.
Multiple writes of the same value into memory would mess things up
for us, I don't think compilers would do it though.
But if people feel it's better to be safe wrt compiler optimizations,
specifying queue as volatile would probably do it in a cleaner way
than converting all accesses to READ_ONCE/WRITE_ONCE. Thoughts?

The only issue is with calls within a loop using the __skb_array_XXX
accessors - in theory compiler could hoist accesses out of the loop.

Following volatile-considered-harmful.txt I merely
documented that callers that busy-poll should invoke cpu_relax().
Most people will use the external skb_array_XXX APIs with a spinlock,
so this should not be an issue for them.

changes since v4 (v3 was never posted)
documentation
dropped SKB_ARRAY_MIN_SIZE heuristic
unit test (in userspace, included as patch 2)

changes since v2:
fixed integer overflow pointed out by Eric.
added some comments.

changes since v1:
fixed bug pointed out by Eric.


Michael S. Tsirkin (2):
  skb_array: array based FIFO for skbs
  skb_array: ring test

 include/linux/skb_array.h | 127 +
 tools/virtio/ringtest/skb_array.c | 167 ++
 tools/virtio/ringtest/Makefile|   4 +-
 3 files changed, 297 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/skb_array.h
 create mode 100644 tools/virtio/ringtest/skb_array.c

-- 
MST