Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-05-10 Thread Michael S. Tsirkin
On Wed, May 10, 2017 at 11:30:42AM +0800, Jason Wang wrote:
> 
> 
> On 2017年05月09日 21:33, Michael S. Tsirkin wrote:
> > > I love this idea.  Reviewed and discussed the idea in-person with MST
> > > during netdevconf[1] at this laptop.  I promised I will also run it
> > > through my micro-benchmarking[2] once I return home (hint ptr_ring gets
> > > used in network stack as skb_array).
> > I'm merging this through my tree. Any objections?
> > 
> 
> Batch dequeuing series depends on this, maybe it's better to have this in
> that series. Let me post a V4 series with this.
> 
> Thanks

FYI I'm including 1/3 in the pull request.

-- 
MST


Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-05-10 Thread Michael S. Tsirkin
On Wed, May 10, 2017 at 11:30:42AM +0800, Jason Wang wrote:
> 
> 
> On 2017年05月09日 21:33, Michael S. Tsirkin wrote:
> > > I love this idea.  Reviewed and discussed the idea in-person with MST
> > > during netdevconf[1] at this laptop.  I promised I will also run it
> > > through my micro-benchmarking[2] once I return home (hint ptr_ring gets
> > > used in network stack as skb_array).
> > I'm merging this through my tree. Any objections?
> > 
> 
> Batch dequeuing series depends on this, maybe it's better to have this in
> that series. Let me post a V4 series with this.
> 
> Thanks

FYI I'm including 1/3 in the pull request.

-- 
MST


Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-05-10 Thread Michael S. Tsirkin
On Wed, May 10, 2017 at 11:18:13AM +0200, Jesper Dangaard Brouer wrote:
> On Tue, 9 May 2017 16:33:14 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Sat, Apr 08, 2017 at 02:14:08PM +0200, Jesper Dangaard Brouer wrote:
> > > On Fri, 7 Apr 2017 08:49:57 +0300
> > > "Michael S. Tsirkin"  wrote:
> > >   
> > > > A known weakness in ptr_ring design is that it does not handle well the
> > > > situation when ring is almost full: as entries are consumed they are
> > > > immediately used again by the producer, so consumer and producer are
> > > > writing to a shared cache line.
> > > > 
> > > > To fix this, add batching to consume calls: as entries are
> > > > consumed do not write NULL into the ring until we get
> > > > a multiple (in current implementation 2x) of cache lines
> > > > away from the producer. At that point, write them all out.
> > > > 
> > > > We do the write out in the reverse order to keep
> > > > producer from sharing cache with consumer for as long
> > > > as possible.
> > > > 
> > > > Writeout also triggers when ring wraps around - there's
> > > > no special reason to do this but it helps keep the code
> > > > a bit simpler.
> > > > 
> > > > What should we do if getting away from producer by 2 cache lines
> > > > would mean we are keeping the ring moe than half empty?
> > > > Maybe we should reduce the batching in this case,
> > > > current patch simply reduces the batching.
> > > > 
> > > > Notes:
> > > > - it is no longer true that a call to consume guarantees
> > > >   that the following call to produce will succeed.
> > > >   No users seem to assume that.
> > > > - batching can also in theory reduce the signalling rate:
> > > >   users that would previously send interrups to the producer
> > > >   to wake it up after consuming each entry would now only
> > > >   need to do this once in a batch.
> > > >   Doing this would be easy by returning a flag to the caller.
> > > >   No users seem to do signalling on consume yet so this was not
> > > >   implemented yet.
> > > > 
> > > > Signed-off-by: Michael S. Tsirkin 
> > > > ---
> > > > 
> > > > Jason, I am curious whether the following gives you some of
> > > > the performance boost that you see with vhost batching
> > > > patches. Is vhost batching on top still helpful?
> > > > 
> > > >  include/linux/ptr_ring.h | 63 
> > > > +---
> > > >  1 file changed, 54 insertions(+), 9 deletions(-)
> > > > 
> > > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > > > index 6c70444..6b2e0dd 100644
> > > > --- a/include/linux/ptr_ring.h
> > > > +++ b/include/linux/ptr_ring.h
> > > > @@ -34,11 +34,13 @@
> > > >  struct ptr_ring {
> > > > int producer cacheline_aligned_in_smp;
> > > > spinlock_t producer_lock;
> > > > -   int consumer cacheline_aligned_in_smp;
> > > > +   int consumer_head cacheline_aligned_in_smp; /* next valid 
> > > > entry */
> > > > +   int consumer_tail; /* next entry to invalidate */
> > > > spinlock_t consumer_lock;
> > > > /* Shared consumer/producer data */
> > > > /* Read-only by both the producer and the consumer */
> > > > int size cacheline_aligned_in_smp; /* max entries in queue 
> > > > */
> > > > +   int batch; /* number of entries to consume in a batch */
> > > > void **queue;
> > > >  };
> > > >  
> > > > @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct 
> > > > ptr_ring *r, void *ptr)
> > > >  static inline void *__ptr_ring_peek(struct ptr_ring *r)
> > > >  {
> > > > if (likely(r->size))
> > > > -   return r->queue[r->consumer];
> > > > +   return r->queue[r->consumer_head];
> > > > return NULL;
> > > >  }
> > > >  
> > > > @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct 
> > > > ptr_ring *r)
> > > >  /* Must only be called after __ptr_ring_peek returned !NULL */
> > > >  static inline void __ptr_ring_discard_one(struct ptr_ring *r)
> > > >  {
> > > > -   r->queue[r->consumer++] = NULL;
> > > > -   if (unlikely(r->consumer >= r->size))
> > > > -   r->consumer = 0;
> > > > +   /* Fundamentally, what we want to do is update consumer
> > > > +* index and zero out the entry so producer can reuse it.
> > > > +* Doing it naively at each consume would be as simple as:
> > > > +*   r->queue[r->consumer++] = NULL;
> > > > +*   if (unlikely(r->consumer >= r->size))
> > > > +*   r->consumer = 0;
> > > > +* but that is suboptimal when the ring is full as producer is 
> > > > writing
> > > > +* out new entries in the same cache line.  Defer these updates 
> > > > until a
> > > > +* batch of entries has been consumed.
> > > > +*/
> > > > +   int head = r->consumer_head++;
> > > > +
> > > > +   /* Once we have processed enough entries 

Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-05-10 Thread Michael S. Tsirkin
On Wed, May 10, 2017 at 11:18:13AM +0200, Jesper Dangaard Brouer wrote:
> On Tue, 9 May 2017 16:33:14 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Sat, Apr 08, 2017 at 02:14:08PM +0200, Jesper Dangaard Brouer wrote:
> > > On Fri, 7 Apr 2017 08:49:57 +0300
> > > "Michael S. Tsirkin"  wrote:
> > >   
> > > > A known weakness in ptr_ring design is that it does not handle well the
> > > > situation when ring is almost full: as entries are consumed they are
> > > > immediately used again by the producer, so consumer and producer are
> > > > writing to a shared cache line.
> > > > 
> > > > To fix this, add batching to consume calls: as entries are
> > > > consumed do not write NULL into the ring until we get
> > > > a multiple (in current implementation 2x) of cache lines
> > > > away from the producer. At that point, write them all out.
> > > > 
> > > > We do the write out in the reverse order to keep
> > > > producer from sharing cache with consumer for as long
> > > > as possible.
> > > > 
> > > > Writeout also triggers when ring wraps around - there's
> > > > no special reason to do this but it helps keep the code
> > > > a bit simpler.
> > > > 
> > > > What should we do if getting away from producer by 2 cache lines
> > > > would mean we are keeping the ring moe than half empty?
> > > > Maybe we should reduce the batching in this case,
> > > > current patch simply reduces the batching.
> > > > 
> > > > Notes:
> > > > - it is no longer true that a call to consume guarantees
> > > >   that the following call to produce will succeed.
> > > >   No users seem to assume that.
> > > > - batching can also in theory reduce the signalling rate:
> > > >   users that would previously send interrups to the producer
> > > >   to wake it up after consuming each entry would now only
> > > >   need to do this once in a batch.
> > > >   Doing this would be easy by returning a flag to the caller.
> > > >   No users seem to do signalling on consume yet so this was not
> > > >   implemented yet.
> > > > 
> > > > Signed-off-by: Michael S. Tsirkin 
> > > > ---
> > > > 
> > > > Jason, I am curious whether the following gives you some of
> > > > the performance boost that you see with vhost batching
> > > > patches. Is vhost batching on top still helpful?
> > > > 
> > > >  include/linux/ptr_ring.h | 63 
> > > > +---
> > > >  1 file changed, 54 insertions(+), 9 deletions(-)
> > > > 
> > > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > > > index 6c70444..6b2e0dd 100644
> > > > --- a/include/linux/ptr_ring.h
> > > > +++ b/include/linux/ptr_ring.h
> > > > @@ -34,11 +34,13 @@
> > > >  struct ptr_ring {
> > > > int producer cacheline_aligned_in_smp;
> > > > spinlock_t producer_lock;
> > > > -   int consumer cacheline_aligned_in_smp;
> > > > +   int consumer_head cacheline_aligned_in_smp; /* next valid 
> > > > entry */
> > > > +   int consumer_tail; /* next entry to invalidate */
> > > > spinlock_t consumer_lock;
> > > > /* Shared consumer/producer data */
> > > > /* Read-only by both the producer and the consumer */
> > > > int size cacheline_aligned_in_smp; /* max entries in queue 
> > > > */
> > > > +   int batch; /* number of entries to consume in a batch */
> > > > void **queue;
> > > >  };
> > > >  
> > > > @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct 
> > > > ptr_ring *r, void *ptr)
> > > >  static inline void *__ptr_ring_peek(struct ptr_ring *r)
> > > >  {
> > > > if (likely(r->size))
> > > > -   return r->queue[r->consumer];
> > > > +   return r->queue[r->consumer_head];
> > > > return NULL;
> > > >  }
> > > >  
> > > > @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct 
> > > > ptr_ring *r)
> > > >  /* Must only be called after __ptr_ring_peek returned !NULL */
> > > >  static inline void __ptr_ring_discard_one(struct ptr_ring *r)
> > > >  {
> > > > -   r->queue[r->consumer++] = NULL;
> > > > -   if (unlikely(r->consumer >= r->size))
> > > > -   r->consumer = 0;
> > > > +   /* Fundamentally, what we want to do is update consumer
> > > > +* index and zero out the entry so producer can reuse it.
> > > > +* Doing it naively at each consume would be as simple as:
> > > > +*   r->queue[r->consumer++] = NULL;
> > > > +*   if (unlikely(r->consumer >= r->size))
> > > > +*   r->consumer = 0;
> > > > +* but that is suboptimal when the ring is full as producer is 
> > > > writing
> > > > +* out new entries in the same cache line.  Defer these updates 
> > > > until a
> > > > +* batch of entries has been consumed.
> > > > +*/
> > > > +   int head = r->consumer_head++;
> > > > +
> > > > +   /* Once we have processed enough entries invalidate them in
> > > > +* the ring all at once 

Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-05-10 Thread Jesper Dangaard Brouer
On Tue, 9 May 2017 16:33:14 +0300
"Michael S. Tsirkin"  wrote:

> On Sat, Apr 08, 2017 at 02:14:08PM +0200, Jesper Dangaard Brouer wrote:
> > On Fri, 7 Apr 2017 08:49:57 +0300
> > "Michael S. Tsirkin"  wrote:
> >   
> > > A known weakness in ptr_ring design is that it does not handle well the
> > > situation when ring is almost full: as entries are consumed they are
> > > immediately used again by the producer, so consumer and producer are
> > > writing to a shared cache line.
> > > 
> > > To fix this, add batching to consume calls: as entries are
> > > consumed do not write NULL into the ring until we get
> > > a multiple (in current implementation 2x) of cache lines
> > > away from the producer. At that point, write them all out.
> > > 
> > > We do the write out in the reverse order to keep
> > > producer from sharing cache with consumer for as long
> > > as possible.
> > > 
> > > Writeout also triggers when ring wraps around - there's
> > > no special reason to do this but it helps keep the code
> > > a bit simpler.
> > > 
> > > What should we do if getting away from producer by 2 cache lines
> > > would mean we are keeping the ring moe than half empty?
> > > Maybe we should reduce the batching in this case,
> > > current patch simply reduces the batching.
> > > 
> > > Notes:
> > > - it is no longer true that a call to consume guarantees
> > >   that the following call to produce will succeed.
> > >   No users seem to assume that.
> > > - batching can also in theory reduce the signalling rate:
> > >   users that would previously send interrups to the producer
> > >   to wake it up after consuming each entry would now only
> > >   need to do this once in a batch.
> > >   Doing this would be easy by returning a flag to the caller.
> > >   No users seem to do signalling on consume yet so this was not
> > >   implemented yet.
> > > 
> > > Signed-off-by: Michael S. Tsirkin 
> > > ---
> > > 
> > > Jason, I am curious whether the following gives you some of
> > > the performance boost that you see with vhost batching
> > > patches. Is vhost batching on top still helpful?
> > > 
> > >  include/linux/ptr_ring.h | 63 
> > > +---
> > >  1 file changed, 54 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > > index 6c70444..6b2e0dd 100644
> > > --- a/include/linux/ptr_ring.h
> > > +++ b/include/linux/ptr_ring.h
> > > @@ -34,11 +34,13 @@
> > >  struct ptr_ring {
> > >   int producer cacheline_aligned_in_smp;
> > >   spinlock_t producer_lock;
> > > - int consumer cacheline_aligned_in_smp;
> > > + int consumer_head cacheline_aligned_in_smp; /* next valid entry */
> > > + int consumer_tail; /* next entry to invalidate */
> > >   spinlock_t consumer_lock;
> > >   /* Shared consumer/producer data */
> > >   /* Read-only by both the producer and the consumer */
> > >   int size cacheline_aligned_in_smp; /* max entries in queue */
> > > + int batch; /* number of entries to consume in a batch */
> > >   void **queue;
> > >  };
> > >  
> > > @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring 
> > > *r, void *ptr)
> > >  static inline void *__ptr_ring_peek(struct ptr_ring *r)
> > >  {
> > >   if (likely(r->size))
> > > - return r->queue[r->consumer];
> > > + return r->queue[r->consumer_head];
> > >   return NULL;
> > >  }
> > >  
> > > @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring 
> > > *r)
> > >  /* Must only be called after __ptr_ring_peek returned !NULL */
> > >  static inline void __ptr_ring_discard_one(struct ptr_ring *r)
> > >  {
> > > - r->queue[r->consumer++] = NULL;
> > > - if (unlikely(r->consumer >= r->size))
> > > - r->consumer = 0;
> > > + /* Fundamentally, what we want to do is update consumer
> > > +  * index and zero out the entry so producer can reuse it.
> > > +  * Doing it naively at each consume would be as simple as:
> > > +  *   r->queue[r->consumer++] = NULL;
> > > +  *   if (unlikely(r->consumer >= r->size))
> > > +  *   r->consumer = 0;
> > > +  * but that is suboptimal when the ring is full as producer is writing
> > > +  * out new entries in the same cache line.  Defer these updates until a
> > > +  * batch of entries has been consumed.
> > > +  */
> > > + int head = r->consumer_head++;
> > > +
> > > + /* Once we have processed enough entries invalidate them in
> > > +  * the ring all at once so producer can reuse their space in the ring.
> > > +  * We also do this when we reach end of the ring - not mandatory
> > > +  * but helps keep the implementation simple.
> > > +  */
> > > + if (unlikely(r->consumer_head - r->consumer_tail >= r->batch ||
> > > +  r->consumer_head >= r->size)) {
> > > + /* Zero out entries in the reverse order: this way we touch the
> > > +  * cache line that producer might currently be 

Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-05-10 Thread Jesper Dangaard Brouer
On Tue, 9 May 2017 16:33:14 +0300
"Michael S. Tsirkin"  wrote:

> On Sat, Apr 08, 2017 at 02:14:08PM +0200, Jesper Dangaard Brouer wrote:
> > On Fri, 7 Apr 2017 08:49:57 +0300
> > "Michael S. Tsirkin"  wrote:
> >   
> > > A known weakness in ptr_ring design is that it does not handle well the
> > > situation when ring is almost full: as entries are consumed they are
> > > immediately used again by the producer, so consumer and producer are
> > > writing to a shared cache line.
> > > 
> > > To fix this, add batching to consume calls: as entries are
> > > consumed do not write NULL into the ring until we get
> > > a multiple (in current implementation 2x) of cache lines
> > > away from the producer. At that point, write them all out.
> > > 
> > > We do the write out in the reverse order to keep
> > > producer from sharing cache with consumer for as long
> > > as possible.
> > > 
> > > Writeout also triggers when ring wraps around - there's
> > > no special reason to do this but it helps keep the code
> > > a bit simpler.
> > > 
> > > What should we do if getting away from producer by 2 cache lines
> > > would mean we are keeping the ring moe than half empty?
> > > Maybe we should reduce the batching in this case,
> > > current patch simply reduces the batching.
> > > 
> > > Notes:
> > > - it is no longer true that a call to consume guarantees
> > >   that the following call to produce will succeed.
> > >   No users seem to assume that.
> > > - batching can also in theory reduce the signalling rate:
> > >   users that would previously send interrups to the producer
> > >   to wake it up after consuming each entry would now only
> > >   need to do this once in a batch.
> > >   Doing this would be easy by returning a flag to the caller.
> > >   No users seem to do signalling on consume yet so this was not
> > >   implemented yet.
> > > 
> > > Signed-off-by: Michael S. Tsirkin 
> > > ---
> > > 
> > > Jason, I am curious whether the following gives you some of
> > > the performance boost that you see with vhost batching
> > > patches. Is vhost batching on top still helpful?
> > > 
> > >  include/linux/ptr_ring.h | 63 
> > > +---
> > >  1 file changed, 54 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > > index 6c70444..6b2e0dd 100644
> > > --- a/include/linux/ptr_ring.h
> > > +++ b/include/linux/ptr_ring.h
> > > @@ -34,11 +34,13 @@
> > >  struct ptr_ring {
> > >   int producer cacheline_aligned_in_smp;
> > >   spinlock_t producer_lock;
> > > - int consumer cacheline_aligned_in_smp;
> > > + int consumer_head cacheline_aligned_in_smp; /* next valid entry */
> > > + int consumer_tail; /* next entry to invalidate */
> > >   spinlock_t consumer_lock;
> > >   /* Shared consumer/producer data */
> > >   /* Read-only by both the producer and the consumer */
> > >   int size cacheline_aligned_in_smp; /* max entries in queue */
> > > + int batch; /* number of entries to consume in a batch */
> > >   void **queue;
> > >  };
> > >  
> > > @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring 
> > > *r, void *ptr)
> > >  static inline void *__ptr_ring_peek(struct ptr_ring *r)
> > >  {
> > >   if (likely(r->size))
> > > - return r->queue[r->consumer];
> > > + return r->queue[r->consumer_head];
> > >   return NULL;
> > >  }
> > >  
> > > @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring 
> > > *r)
> > >  /* Must only be called after __ptr_ring_peek returned !NULL */
> > >  static inline void __ptr_ring_discard_one(struct ptr_ring *r)
> > >  {
> > > - r->queue[r->consumer++] = NULL;
> > > - if (unlikely(r->consumer >= r->size))
> > > - r->consumer = 0;
> > > + /* Fundamentally, what we want to do is update consumer
> > > +  * index and zero out the entry so producer can reuse it.
> > > +  * Doing it naively at each consume would be as simple as:
> > > +  *   r->queue[r->consumer++] = NULL;
> > > +  *   if (unlikely(r->consumer >= r->size))
> > > +  *   r->consumer = 0;
> > > +  * but that is suboptimal when the ring is full as producer is writing
> > > +  * out new entries in the same cache line.  Defer these updates until a
> > > +  * batch of entries has been consumed.
> > > +  */
> > > + int head = r->consumer_head++;
> > > +
> > > + /* Once we have processed enough entries invalidate them in
> > > +  * the ring all at once so producer can reuse their space in the ring.
> > > +  * We also do this when we reach end of the ring - not mandatory
> > > +  * but helps keep the implementation simple.
> > > +  */
> > > + if (unlikely(r->consumer_head - r->consumer_tail >= r->batch ||
> > > +  r->consumer_head >= r->size)) {
> > > + /* Zero out entries in the reverse order: this way we touch the
> > > +  * cache line that producer might currently be reading the last;
> > > +  * producer 

Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-05-09 Thread Jason Wang



On 2017年05月09日 21:33, Michael S. Tsirkin wrote:

I love this idea.  Reviewed and discussed the idea in-person with MST
during netdevconf[1] at this laptop.  I promised I will also run it
through my micro-benchmarking[2] once I return home (hint ptr_ring gets
used in network stack as skb_array).

I'm merging this through my tree. Any objections?



Batch dequeuing series depends on this, maybe it's better to have this 
in that series. Let me post a V4 series with this.


Thanks


Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-05-09 Thread Jason Wang



On 2017年05月09日 21:33, Michael S. Tsirkin wrote:

I love this idea.  Reviewed and discussed the idea in-person with MST
during netdevconf[1] at this laptop.  I promised I will also run it
through my micro-benchmarking[2] once I return home (hint ptr_ring gets
used in network stack as skb_array).

I'm merging this through my tree. Any objections?



Batch dequeuing series depends on this, maybe it's better to have this 
in that series. Let me post a V4 series with this.


Thanks


Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-05-09 Thread Michael S. Tsirkin
On Sat, Apr 08, 2017 at 02:14:08PM +0200, Jesper Dangaard Brouer wrote:
> On Fri, 7 Apr 2017 08:49:57 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > A known weakness in ptr_ring design is that it does not handle well the
> > situation when ring is almost full: as entries are consumed they are
> > immediately used again by the producer, so consumer and producer are
> > writing to a shared cache line.
> > 
> > To fix this, add batching to consume calls: as entries are
> > consumed do not write NULL into the ring until we get
> > a multiple (in current implementation 2x) of cache lines
> > away from the producer. At that point, write them all out.
> > 
> > We do the write out in the reverse order to keep
> > producer from sharing cache with consumer for as long
> > as possible.
> > 
> > Writeout also triggers when ring wraps around - there's
> > no special reason to do this but it helps keep the code
> > a bit simpler.
> > 
> > What should we do if getting away from producer by 2 cache lines
> > would mean we are keeping the ring moe than half empty?
> > Maybe we should reduce the batching in this case,
> > current patch simply reduces the batching.
> > 
> > Notes:
> > - it is no longer true that a call to consume guarantees
> >   that the following call to produce will succeed.
> >   No users seem to assume that.
> > - batching can also in theory reduce the signalling rate:
> >   users that would previously send interrups to the producer
> >   to wake it up after consuming each entry would now only
> >   need to do this once in a batch.
> >   Doing this would be easy by returning a flag to the caller.
> >   No users seem to do signalling on consume yet so this was not
> >   implemented yet.
> > 
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> > 
> > Jason, I am curious whether the following gives you some of
> > the performance boost that you see with vhost batching
> > patches. Is vhost batching on top still helpful?
> > 
> >  include/linux/ptr_ring.h | 63 
> > +---
> >  1 file changed, 54 insertions(+), 9 deletions(-)
> > 
> > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > index 6c70444..6b2e0dd 100644
> > --- a/include/linux/ptr_ring.h
> > +++ b/include/linux/ptr_ring.h
> > @@ -34,11 +34,13 @@
> >  struct ptr_ring {
> > int producer cacheline_aligned_in_smp;
> > spinlock_t producer_lock;
> > -   int consumer cacheline_aligned_in_smp;
> > +   int consumer_head cacheline_aligned_in_smp; /* next valid entry */
> > +   int consumer_tail; /* next entry to invalidate */
> > spinlock_t consumer_lock;
> > /* Shared consumer/producer data */
> > /* Read-only by both the producer and the consumer */
> > int size cacheline_aligned_in_smp; /* max entries in queue */
> > +   int batch; /* number of entries to consume in a batch */
> > void **queue;
> >  };
> >  
> > @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring 
> > *r, void *ptr)
> >  static inline void *__ptr_ring_peek(struct ptr_ring *r)
> >  {
> > if (likely(r->size))
> > -   return r->queue[r->consumer];
> > +   return r->queue[r->consumer_head];
> > return NULL;
> >  }
> >  
> > @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring 
> > *r)
> >  /* Must only be called after __ptr_ring_peek returned !NULL */
> >  static inline void __ptr_ring_discard_one(struct ptr_ring *r)
> >  {
> > -   r->queue[r->consumer++] = NULL;
> > -   if (unlikely(r->consumer >= r->size))
> > -   r->consumer = 0;
> > +   /* Fundamentally, what we want to do is update consumer
> > +* index and zero out the entry so producer can reuse it.
> > +* Doing it naively at each consume would be as simple as:
> > +*   r->queue[r->consumer++] = NULL;
> > +*   if (unlikely(r->consumer >= r->size))
> > +*   r->consumer = 0;
> > +* but that is suboptimal when the ring is full as producer is writing
> > +* out new entries in the same cache line.  Defer these updates until a
> > +* batch of entries has been consumed.
> > +*/
> > +   int head = r->consumer_head++;
> > +
> > +   /* Once we have processed enough entries invalidate them in
> > +* the ring all at once so producer can reuse their space in the ring.
> > +* We also do this when we reach end of the ring - not mandatory
> > +* but helps keep the implementation simple.
> > +*/
> > +   if (unlikely(r->consumer_head - r->consumer_tail >= r->batch ||
> > +r->consumer_head >= r->size)) {
> > +   /* Zero out entries in the reverse order: this way we touch the
> > +* cache line that producer might currently be reading the last;
> > +* producer won't make progress and touch other cache lines
> > +* besides the first one until we write out all entries.
> > +*/
> > +   while (likely(head 

Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-05-09 Thread Michael S. Tsirkin
On Sat, Apr 08, 2017 at 02:14:08PM +0200, Jesper Dangaard Brouer wrote:
> On Fri, 7 Apr 2017 08:49:57 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > A known weakness in ptr_ring design is that it does not handle well the
> > situation when ring is almost full: as entries are consumed they are
> > immediately used again by the producer, so consumer and producer are
> > writing to a shared cache line.
> > 
> > To fix this, add batching to consume calls: as entries are
> > consumed do not write NULL into the ring until we get
> > a multiple (in current implementation 2x) of cache lines
> > away from the producer. At that point, write them all out.
> > 
> > We do the write out in the reverse order to keep
> > producer from sharing cache with consumer for as long
> > as possible.
> > 
> > Writeout also triggers when ring wraps around - there's
> > no special reason to do this but it helps keep the code
> > a bit simpler.
> > 
> > What should we do if getting away from producer by 2 cache lines
> > would mean we are keeping the ring moe than half empty?
> > Maybe we should reduce the batching in this case,
> > current patch simply reduces the batching.
> > 
> > Notes:
> > - it is no longer true that a call to consume guarantees
> >   that the following call to produce will succeed.
> >   No users seem to assume that.
> > - batching can also in theory reduce the signalling rate:
> >   users that would previously send interrups to the producer
> >   to wake it up after consuming each entry would now only
> >   need to do this once in a batch.
> >   Doing this would be easy by returning a flag to the caller.
> >   No users seem to do signalling on consume yet so this was not
> >   implemented yet.
> > 
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> > 
> > Jason, I am curious whether the following gives you some of
> > the performance boost that you see with vhost batching
> > patches. Is vhost batching on top still helpful?
> > 
> >  include/linux/ptr_ring.h | 63 
> > +---
> >  1 file changed, 54 insertions(+), 9 deletions(-)
> > 
> > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > index 6c70444..6b2e0dd 100644
> > --- a/include/linux/ptr_ring.h
> > +++ b/include/linux/ptr_ring.h
> > @@ -34,11 +34,13 @@
> >  struct ptr_ring {
> > int producer cacheline_aligned_in_smp;
> > spinlock_t producer_lock;
> > -   int consumer cacheline_aligned_in_smp;
> > +   int consumer_head cacheline_aligned_in_smp; /* next valid entry */
> > +   int consumer_tail; /* next entry to invalidate */
> > spinlock_t consumer_lock;
> > /* Shared consumer/producer data */
> > /* Read-only by both the producer and the consumer */
> > int size cacheline_aligned_in_smp; /* max entries in queue */
> > +   int batch; /* number of entries to consume in a batch */
> > void **queue;
> >  };
> >  
> > @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring 
> > *r, void *ptr)
> >  static inline void *__ptr_ring_peek(struct ptr_ring *r)
> >  {
> > if (likely(r->size))
> > -   return r->queue[r->consumer];
> > +   return r->queue[r->consumer_head];
> > return NULL;
> >  }
> >  
> > @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring 
> > *r)
> >  /* Must only be called after __ptr_ring_peek returned !NULL */
> >  static inline void __ptr_ring_discard_one(struct ptr_ring *r)
> >  {
> > -   r->queue[r->consumer++] = NULL;
> > -   if (unlikely(r->consumer >= r->size))
> > -   r->consumer = 0;
> > +   /* Fundamentally, what we want to do is update consumer
> > +* index and zero out the entry so producer can reuse it.
> > +* Doing it naively at each consume would be as simple as:
> > +*   r->queue[r->consumer++] = NULL;
> > +*   if (unlikely(r->consumer >= r->size))
> > +*   r->consumer = 0;
> > +* but that is suboptimal when the ring is full as producer is writing
> > +* out new entries in the same cache line.  Defer these updates until a
> > +* batch of entries has been consumed.
> > +*/
> > +   int head = r->consumer_head++;
> > +
> > +   /* Once we have processed enough entries invalidate them in
> > +* the ring all at once so producer can reuse their space in the ring.
> > +* We also do this when we reach end of the ring - not mandatory
> > +* but helps keep the implementation simple.
> > +*/
> > +   if (unlikely(r->consumer_head - r->consumer_tail >= r->batch ||
> > +r->consumer_head >= r->size)) {
> > +   /* Zero out entries in the reverse order: this way we touch the
> > +* cache line that producer might currently be reading the last;
> > +* producer won't make progress and touch other cache lines
> > +* besides the first one until we write out all entries.
> > +*/
> > +   while (likely(head >= r->consumer_tail))
> > +

Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-04-17 Thread Jason Wang



On 2017年04月15日 06:50, Michael S. Tsirkin wrote:

On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote:


On 2017年04月12日 16:03, Jason Wang wrote:


On 2017年04月07日 13:49, Michael S. Tsirkin wrote:

A known weakness in ptr_ring design is that it does not handle well the
situation when ring is almost full: as entries are consumed they are
immediately used again by the producer, so consumer and producer are
writing to a shared cache line.

To fix this, add batching to consume calls: as entries are
consumed do not write NULL into the ring until we get
a multiple (in current implementation 2x) of cache lines
away from the producer. At that point, write them all out.

We do the write out in the reverse order to keep
producer from sharing cache with consumer for as long
as possible.

Writeout also triggers when ring wraps around - there's
no special reason to do this but it helps keep the code
a bit simpler.

What should we do if getting away from producer by 2 cache lines
would mean we are keeping the ring moe than half empty?
Maybe we should reduce the batching in this case,
current patch simply reduces the batching.

Notes:
- it is no longer true that a call to consume guarantees
that the following call to produce will succeed.
No users seem to assume that.
- batching can also in theory reduce the signalling rate:
users that would previously send interrups to the producer
to wake it up after consuming each entry would now only
need to do this once in a batch.
Doing this would be easy by returning a flag to the caller.
No users seem to do signalling on consume yet so this was not
implemented yet.

Signed-off-by: Michael S. Tsirkin
---

Jason, I am curious whether the following gives you some of
the performance boost that you see with vhost batching
patches. Is vhost batching on top still helpful?

The patch looks good to me, will have a test for vhost batching patches.

Thanks

Still helpful:

before this patch: 1.84Mpps
with this patch: 2.00Mpps
with batch dequeuing: 2.30Mpps

Just a thought: could you test dropping the consumer spinlock
completely?  Just around the peek?


2% improvement for dropping spinlock around peeking, 2% more for 
dropping spinlock for consuming.




As I said previously, perf c2c tool should be helpful
to locate sources latency related to cache.



perf c2c indeeds shows some false sharing were reduced by this patch. 
But it does not show obvious different with batch dequeuing on top.


Thanks


Acked-by: Jason Wang 

Thanks




Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-04-17 Thread Jason Wang



On 2017年04月15日 06:50, Michael S. Tsirkin wrote:

On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote:


On 2017年04月12日 16:03, Jason Wang wrote:


On 2017年04月07日 13:49, Michael S. Tsirkin wrote:

A known weakness in ptr_ring design is that it does not handle well the
situation when ring is almost full: as entries are consumed they are
immediately used again by the producer, so consumer and producer are
writing to a shared cache line.

To fix this, add batching to consume calls: as entries are
consumed do not write NULL into the ring until we get
a multiple (in current implementation 2x) of cache lines
away from the producer. At that point, write them all out.

We do the write out in the reverse order to keep
producer from sharing cache with consumer for as long
as possible.

Writeout also triggers when ring wraps around - there's
no special reason to do this but it helps keep the code
a bit simpler.

What should we do if getting away from producer by 2 cache lines
would mean we are keeping the ring moe than half empty?
Maybe we should reduce the batching in this case,
current patch simply reduces the batching.

Notes:
- it is no longer true that a call to consume guarantees
that the following call to produce will succeed.
No users seem to assume that.
- batching can also in theory reduce the signalling rate:
users that would previously send interrups to the producer
to wake it up after consuming each entry would now only
need to do this once in a batch.
Doing this would be easy by returning a flag to the caller.
No users seem to do signalling on consume yet so this was not
implemented yet.

Signed-off-by: Michael S. Tsirkin
---

Jason, I am curious whether the following gives you some of
the performance boost that you see with vhost batching
patches. Is vhost batching on top still helpful?

The patch looks good to me, will have a test for vhost batching patches.

Thanks

Still helpful:

before this patch: 1.84Mpps
with this patch: 2.00Mpps
with batch dequeuing: 2.30Mpps

Just a thought: could you test dropping the consumer spinlock
completely?  Just around the peek?


2% improvement for dropping spinlock around peeking, 2% more for 
dropping spinlock for consuming.




As I said previously, perf c2c tool should be helpful
to locate sources latency related to cache.



perf c2c indeeds shows some false sharing were reduced by this patch. 
But it does not show obvious different with batch dequeuing on top.


Thanks


Acked-by: Jason Wang 

Thanks




Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-04-17 Thread Jason Wang



On 2017年04月15日 05:00, Michael S. Tsirkin wrote:

On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote:


On 2017年04月12日 16:03, Jason Wang wrote:


On 2017年04月07日 13:49, Michael S. Tsirkin wrote:

A known weakness in ptr_ring design is that it does not handle well the
situation when ring is almost full: as entries are consumed they are
immediately used again by the producer, so consumer and producer are
writing to a shared cache line.

To fix this, add batching to consume calls: as entries are
consumed do not write NULL into the ring until we get
a multiple (in current implementation 2x) of cache lines
away from the producer. At that point, write them all out.

We do the write out in the reverse order to keep
producer from sharing cache with consumer for as long
as possible.

Writeout also triggers when ring wraps around - there's
no special reason to do this but it helps keep the code
a bit simpler.

What should we do if getting away from producer by 2 cache lines
would mean we are keeping the ring moe than half empty?
Maybe we should reduce the batching in this case,
current patch simply reduces the batching.

Notes:
- it is no longer true that a call to consume guarantees
that the following call to produce will succeed.
No users seem to assume that.
- batching can also in theory reduce the signalling rate:
users that would previously send interrups to the producer
to wake it up after consuming each entry would now only
need to do this once in a batch.
Doing this would be easy by returning a flag to the caller.
No users seem to do signalling on consume yet so this was not
implemented yet.

Signed-off-by: Michael S. Tsirkin
---

Jason, I am curious whether the following gives you some of
the performance boost that you see with vhost batching
patches. Is vhost batching on top still helpful?

The patch looks good to me, will have a test for vhost batching patches.

Thanks

Still helpful:

before this patch: 1.84Mpps
with this patch: 2.00Mpps
with batch dequeuing: 2.30Mpps

Acked-by: Jason Wang 

Thanks

Fascinating. How do we explain the gain with batch dequeue?


I count the drop rate (pktgen on tun and count tun tx) and maybe it can 
explain more or less:


Before this patch: TX xmit 1.8Mpps Tx dropped 0.23Mpps Tx total 2.04Mpps 
11% dropped
After this patch: Tx xmit 1.95Mpps Tx dropped 0.33Mpps Tx total 2.28Mpps 
14% dropped
With batched dequeuing: Tx xmit 2.24Mpps Tx dropped 0.01Mpps Tx total 
2.26Mpps ~0% dropped


With this patch, we remove the cache contention by blocking the producer 
more or less. With batch dequeuing, the ring is not full in 99% of the 
cases which probably means the producer is not blocked for most of the time.



Is it just the lock overhead?


I remove the spinlocks for peeking and dequeuing on top of this patch. 
The tx pps were increased from ~2Mpps to ~2.08Mpps. So it was not only 
the lock overhead.



Can you pls try to replace
the lock with a simple non-fair atomic and see what happens?



Not sure I get the idea, we are going for fast path of spinlocks for 
sure which is just a cmpxchg().


Thanks




Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-04-17 Thread Jason Wang



On 2017年04月15日 05:00, Michael S. Tsirkin wrote:

On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote:


On 2017年04月12日 16:03, Jason Wang wrote:


On 2017年04月07日 13:49, Michael S. Tsirkin wrote:

A known weakness in ptr_ring design is that it does not handle well the
situation when ring is almost full: as entries are consumed they are
immediately used again by the producer, so consumer and producer are
writing to a shared cache line.

To fix this, add batching to consume calls: as entries are
consumed do not write NULL into the ring until we get
a multiple (in current implementation 2x) of cache lines
away from the producer. At that point, write them all out.

We do the write out in the reverse order to keep
producer from sharing cache with consumer for as long
as possible.

Writeout also triggers when ring wraps around - there's
no special reason to do this but it helps keep the code
a bit simpler.

What should we do if getting away from producer by 2 cache lines
would mean we are keeping the ring moe than half empty?
Maybe we should reduce the batching in this case,
current patch simply reduces the batching.

Notes:
- it is no longer true that a call to consume guarantees
that the following call to produce will succeed.
No users seem to assume that.
- batching can also in theory reduce the signalling rate:
users that would previously send interrups to the producer
to wake it up after consuming each entry would now only
need to do this once in a batch.
Doing this would be easy by returning a flag to the caller.
No users seem to do signalling on consume yet so this was not
implemented yet.

Signed-off-by: Michael S. Tsirkin
---

Jason, I am curious whether the following gives you some of
the performance boost that you see with vhost batching
patches. Is vhost batching on top still helpful?

The patch looks good to me, will have a test for vhost batching patches.

Thanks

Still helpful:

before this patch: 1.84Mpps
with this patch: 2.00Mpps
with batch dequeuing: 2.30Mpps

Acked-by: Jason Wang 

Thanks

Fascinating. How do we explain the gain with batch dequeue?


I count the drop rate (pktgen on tun and count tun tx) and maybe it can 
explain more or less:


Before this patch: TX xmit 1.8Mpps Tx dropped 0.23Mpps Tx total 2.04Mpps 
11% dropped
After this patch: Tx xmit 1.95Mpps Tx dropped 0.33Mpps Tx total 2.28Mpps 
14% dropped
With batched dequeuing: Tx xmit 2.24Mpps Tx dropped 0.01Mpps Tx total 
2.26Mpps ~0% dropped


With this patch, we remove the cache contention by blocking the producer 
more or less. With batch dequeuing, the ring is not full in 99% of the 
cases which probably means the producer is not blocked for most of the time.



Is it just the lock overhead?


I remove the spinlocks for peeking and dequeuing on top of this patch. 
The tx pps were increased from ~2Mpps to ~2.08Mpps. So it was not only 
the lock overhead.



Can you pls try to replace
the lock with a simple non-fair atomic and see what happens?



Not sure I get the idea, we are going for fast path of spinlocks for 
sure which is just a cmpxchg().


Thanks




Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-04-14 Thread Michael S. Tsirkin
On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote:
> 
> 
> On 2017年04月12日 16:03, Jason Wang wrote:
> > 
> > 
> > On 2017年04月07日 13:49, Michael S. Tsirkin wrote:
> > > A known weakness in ptr_ring design is that it does not handle well the
> > > situation when ring is almost full: as entries are consumed they are
> > > immediately used again by the producer, so consumer and producer are
> > > writing to a shared cache line.
> > > 
> > > To fix this, add batching to consume calls: as entries are
> > > consumed do not write NULL into the ring until we get
> > > a multiple (in current implementation 2x) of cache lines
> > > away from the producer. At that point, write them all out.
> > > 
> > > We do the write out in the reverse order to keep
> > > producer from sharing cache with consumer for as long
> > > as possible.
> > > 
> > > Writeout also triggers when ring wraps around - there's
> > > no special reason to do this but it helps keep the code
> > > a bit simpler.
> > > 
> > > What should we do if getting away from producer by 2 cache lines
> > > would mean we are keeping the ring moe than half empty?
> > > Maybe we should reduce the batching in this case,
> > > current patch simply reduces the batching.
> > > 
> > > Notes:
> > > - it is no longer true that a call to consume guarantees
> > >that the following call to produce will succeed.
> > >No users seem to assume that.
> > > - batching can also in theory reduce the signalling rate:
> > >users that would previously send interrups to the producer
> > >to wake it up after consuming each entry would now only
> > >need to do this once in a batch.
> > >Doing this would be easy by returning a flag to the caller.
> > >No users seem to do signalling on consume yet so this was not
> > >implemented yet.
> > > 
> > > Signed-off-by: Michael S. Tsirkin
> > > ---
> > > 
> > > Jason, I am curious whether the following gives you some of
> > > the performance boost that you see with vhost batching
> > > patches. Is vhost batching on top still helpful?
> > 
> > The patch looks good to me, will have a test for vhost batching patches.
> > 
> > Thanks
> 
> Still helpful:
> 
> before this patch: 1.84Mpps
> with this patch: 2.00Mpps
> with batch dequeuing: 2.30Mpps

Just a thought: could you test dropping the consumer spinlock
completely?  Just around the peek?

As I said previously, perf c2c tool should be helpful
to locate sources latency related to cache.

> Acked-by: Jason Wang 
> 
> Thanks


Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-04-14 Thread Michael S. Tsirkin
On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote:
> 
> 
> On 2017年04月12日 16:03, Jason Wang wrote:
> > 
> > 
> > On 2017年04月07日 13:49, Michael S. Tsirkin wrote:
> > > A known weakness in ptr_ring design is that it does not handle well the
> > > situation when ring is almost full: as entries are consumed they are
> > > immediately used again by the producer, so consumer and producer are
> > > writing to a shared cache line.
> > > 
> > > To fix this, add batching to consume calls: as entries are
> > > consumed do not write NULL into the ring until we get
> > > a multiple (in current implementation 2x) of cache lines
> > > away from the producer. At that point, write them all out.
> > > 
> > > We do the write out in the reverse order to keep
> > > producer from sharing cache with consumer for as long
> > > as possible.
> > > 
> > > Writeout also triggers when ring wraps around - there's
> > > no special reason to do this but it helps keep the code
> > > a bit simpler.
> > > 
> > > What should we do if getting away from producer by 2 cache lines
> > > would mean we are keeping the ring moe than half empty?
> > > Maybe we should reduce the batching in this case,
> > > current patch simply reduces the batching.
> > > 
> > > Notes:
> > > - it is no longer true that a call to consume guarantees
> > >that the following call to produce will succeed.
> > >No users seem to assume that.
> > > - batching can also in theory reduce the signalling rate:
> > >users that would previously send interrups to the producer
> > >to wake it up after consuming each entry would now only
> > >need to do this once in a batch.
> > >Doing this would be easy by returning a flag to the caller.
> > >No users seem to do signalling on consume yet so this was not
> > >implemented yet.
> > > 
> > > Signed-off-by: Michael S. Tsirkin
> > > ---
> > > 
> > > Jason, I am curious whether the following gives you some of
> > > the performance boost that you see with vhost batching
> > > patches. Is vhost batching on top still helpful?
> > 
> > The patch looks good to me, will have a test for vhost batching patches.
> > 
> > Thanks
> 
> Still helpful:
> 
> before this patch: 1.84Mpps
> with this patch: 2.00Mpps
> with batch dequeuing: 2.30Mpps

Just a thought: could you test dropping the consumer spinlock
completely?  Just around the peek?

As I said previously, perf c2c tool should be helpful
to locate sources latency related to cache.

> Acked-by: Jason Wang 
> 
> Thanks


Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-04-14 Thread Michael S. Tsirkin
On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote:
> 
> 
> On 2017年04月12日 16:03, Jason Wang wrote:
> > 
> > 
> > On 2017年04月07日 13:49, Michael S. Tsirkin wrote:
> > > A known weakness in ptr_ring design is that it does not handle well the
> > > situation when ring is almost full: as entries are consumed they are
> > > immediately used again by the producer, so consumer and producer are
> > > writing to a shared cache line.
> > > 
> > > To fix this, add batching to consume calls: as entries are
> > > consumed do not write NULL into the ring until we get
> > > a multiple (in current implementation 2x) of cache lines
> > > away from the producer. At that point, write them all out.
> > > 
> > > We do the write out in the reverse order to keep
> > > producer from sharing cache with consumer for as long
> > > as possible.
> > > 
> > > Writeout also triggers when ring wraps around - there's
> > > no special reason to do this but it helps keep the code
> > > a bit simpler.
> > > 
> > > What should we do if getting away from producer by 2 cache lines
> > > would mean we are keeping the ring moe than half empty?
> > > Maybe we should reduce the batching in this case,
> > > current patch simply reduces the batching.
> > > 
> > > Notes:
> > > - it is no longer true that a call to consume guarantees
> > >that the following call to produce will succeed.
> > >No users seem to assume that.
> > > - batching can also in theory reduce the signalling rate:
> > >users that would previously send interrups to the producer
> > >to wake it up after consuming each entry would now only
> > >need to do this once in a batch.
> > >Doing this would be easy by returning a flag to the caller.
> > >No users seem to do signalling on consume yet so this was not
> > >implemented yet.
> > > 
> > > Signed-off-by: Michael S. Tsirkin
> > > ---
> > > 
> > > Jason, I am curious whether the following gives you some of
> > > the performance boost that you see with vhost batching
> > > patches. Is vhost batching on top still helpful?
> > 
> > The patch looks good to me, will have a test for vhost batching patches.
> > 
> > Thanks
> 
> Still helpful:
> 
> before this patch: 1.84Mpps
> with this patch: 2.00Mpps
> with batch dequeuing: 2.30Mpps
> 
> Acked-by: Jason Wang 
> 
> Thanks

Fascinating. How do we explain the gain with batch dequeue?
Is it just the lock overhead? Can you pls try to replace
the lock with a simple non-fair atomic and see what happens?

-- 
MST


Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-04-14 Thread Michael S. Tsirkin
On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote:
> 
> 
> On 2017年04月12日 16:03, Jason Wang wrote:
> > 
> > 
> > On 2017年04月07日 13:49, Michael S. Tsirkin wrote:
> > > A known weakness in ptr_ring design is that it does not handle well the
> > > situation when ring is almost full: as entries are consumed they are
> > > immediately used again by the producer, so consumer and producer are
> > > writing to a shared cache line.
> > > 
> > > To fix this, add batching to consume calls: as entries are
> > > consumed do not write NULL into the ring until we get
> > > a multiple (in current implementation 2x) of cache lines
> > > away from the producer. At that point, write them all out.
> > > 
> > > We do the write out in the reverse order to keep
> > > producer from sharing cache with consumer for as long
> > > as possible.
> > > 
> > > Writeout also triggers when ring wraps around - there's
> > > no special reason to do this but it helps keep the code
> > > a bit simpler.
> > > 
> > > What should we do if getting away from producer by 2 cache lines
> > > would mean we are keeping the ring moe than half empty?
> > > Maybe we should reduce the batching in this case,
> > > current patch simply reduces the batching.
> > > 
> > > Notes:
> > > - it is no longer true that a call to consume guarantees
> > >that the following call to produce will succeed.
> > >No users seem to assume that.
> > > - batching can also in theory reduce the signalling rate:
> > >users that would previously send interrups to the producer
> > >to wake it up after consuming each entry would now only
> > >need to do this once in a batch.
> > >Doing this would be easy by returning a flag to the caller.
> > >No users seem to do signalling on consume yet so this was not
> > >implemented yet.
> > > 
> > > Signed-off-by: Michael S. Tsirkin
> > > ---
> > > 
> > > Jason, I am curious whether the following gives you some of
> > > the performance boost that you see with vhost batching
> > > patches. Is vhost batching on top still helpful?
> > 
> > The patch looks good to me, will have a test for vhost batching patches.
> > 
> > Thanks
> 
> Still helpful:
> 
> before this patch: 1.84Mpps
> with this patch: 2.00Mpps
> with batch dequeuing: 2.30Mpps
> 
> Acked-by: Jason Wang 
> 
> Thanks

Fascinating. How do we explain the gain with batch dequeue?
Is it just the lock overhead? Can you pls try to replace
the lock with a simple non-fair atomic and see what happens?

-- 
MST


Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-04-14 Thread Jason Wang



On 2017年04月12日 16:03, Jason Wang wrote:



On 2017年04月07日 13:49, Michael S. Tsirkin wrote:

A known weakness in ptr_ring design is that it does not handle well the
situation when ring is almost full: as entries are consumed they are
immediately used again by the producer, so consumer and producer are
writing to a shared cache line.

To fix this, add batching to consume calls: as entries are
consumed do not write NULL into the ring until we get
a multiple (in current implementation 2x) of cache lines
away from the producer. At that point, write them all out.

We do the write out in the reverse order to keep
producer from sharing cache with consumer for as long
as possible.

Writeout also triggers when ring wraps around - there's
no special reason to do this but it helps keep the code
a bit simpler.

What should we do if getting away from producer by 2 cache lines
would mean we are keeping the ring moe than half empty?
Maybe we should reduce the batching in this case,
current patch simply reduces the batching.

Notes:
- it is no longer true that a call to consume guarantees
   that the following call to produce will succeed.
   No users seem to assume that.
- batching can also in theory reduce the signalling rate:
   users that would previously send interrups to the producer
   to wake it up after consuming each entry would now only
   need to do this once in a batch.
   Doing this would be easy by returning a flag to the caller.
   No users seem to do signalling on consume yet so this was not
   implemented yet.

Signed-off-by: Michael S. Tsirkin
---

Jason, I am curious whether the following gives you some of
the performance boost that you see with vhost batching
patches. Is vhost batching on top still helpful?


The patch looks good to me, will have a test for vhost batching patches.

Thanks


Still helpful:

before this patch: 1.84Mpps
with this patch: 2.00Mpps
with batch dequeuing: 2.30Mpps

Acked-by: Jason Wang 

Thanks


Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-04-14 Thread Jason Wang



On 2017年04月12日 16:03, Jason Wang wrote:



On 2017年04月07日 13:49, Michael S. Tsirkin wrote:

A known weakness in ptr_ring design is that it does not handle well the
situation when ring is almost full: as entries are consumed they are
immediately used again by the producer, so consumer and producer are
writing to a shared cache line.

To fix this, add batching to consume calls: as entries are
consumed do not write NULL into the ring until we get
a multiple (in current implementation 2x) of cache lines
away from the producer. At that point, write them all out.

We do the write out in the reverse order to keep
producer from sharing cache with consumer for as long
as possible.

Writeout also triggers when ring wraps around - there's
no special reason to do this but it helps keep the code
a bit simpler.

What should we do if getting away from producer by 2 cache lines
would mean we are keeping the ring moe than half empty?
Maybe we should reduce the batching in this case,
current patch simply reduces the batching.

Notes:
- it is no longer true that a call to consume guarantees
   that the following call to produce will succeed.
   No users seem to assume that.
- batching can also in theory reduce the signalling rate:
   users that would previously send interrups to the producer
   to wake it up after consuming each entry would now only
   need to do this once in a batch.
   Doing this would be easy by returning a flag to the caller.
   No users seem to do signalling on consume yet so this was not
   implemented yet.

Signed-off-by: Michael S. Tsirkin
---

Jason, I am curious whether the following gives you some of
the performance boost that you see with vhost batching
patches. Is vhost batching on top still helpful?


The patch looks good to me, will have a test for vhost batching patches.

Thanks


Still helpful:

before this patch: 1.84Mpps
with this patch: 2.00Mpps
with batch dequeuing: 2.30Mpps

Acked-by: Jason Wang 

Thanks


Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-04-12 Thread Jason Wang



On 2017年04月07日 13:49, Michael S. Tsirkin wrote:

A known weakness in ptr_ring design is that it does not handle well the
situation when ring is almost full: as entries are consumed they are
immediately used again by the producer, so consumer and producer are
writing to a shared cache line.

To fix this, add batching to consume calls: as entries are
consumed do not write NULL into the ring until we get
a multiple (in current implementation 2x) of cache lines
away from the producer. At that point, write them all out.

We do the write out in the reverse order to keep
producer from sharing cache with consumer for as long
as possible.

Writeout also triggers when ring wraps around - there's
no special reason to do this but it helps keep the code
a bit simpler.

What should we do if getting away from producer by 2 cache lines
would mean we are keeping the ring moe than half empty?
Maybe we should reduce the batching in this case,
current patch simply reduces the batching.

Notes:
- it is no longer true that a call to consume guarantees
   that the following call to produce will succeed.
   No users seem to assume that.
- batching can also in theory reduce the signalling rate:
   users that would previously send interrups to the producer
   to wake it up after consuming each entry would now only
   need to do this once in a batch.
   Doing this would be easy by returning a flag to the caller.
   No users seem to do signalling on consume yet so this was not
   implemented yet.

Signed-off-by: Michael S. Tsirkin
---

Jason, I am curious whether the following gives you some of
the performance boost that you see with vhost batching
patches. Is vhost batching on top still helpful?


The patch looks good to me, will have a test for vhost batching patches.

Thanks


Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-04-12 Thread Jason Wang



On 2017年04月07日 13:49, Michael S. Tsirkin wrote:

A known weakness in ptr_ring design is that it does not handle well the
situation when ring is almost full: as entries are consumed they are
immediately used again by the producer, so consumer and producer are
writing to a shared cache line.

To fix this, add batching to consume calls: as entries are
consumed do not write NULL into the ring until we get
a multiple (in current implementation 2x) of cache lines
away from the producer. At that point, write them all out.

We do the write out in the reverse order to keep
producer from sharing cache with consumer for as long
as possible.

Writeout also triggers when ring wraps around - there's
no special reason to do this but it helps keep the code
a bit simpler.

What should we do if getting away from producer by 2 cache lines
would mean we are keeping the ring moe than half empty?
Maybe we should reduce the batching in this case,
current patch simply reduces the batching.

Notes:
- it is no longer true that a call to consume guarantees
   that the following call to produce will succeed.
   No users seem to assume that.
- batching can also in theory reduce the signalling rate:
   users that would previously send interrups to the producer
   to wake it up after consuming each entry would now only
   need to do this once in a batch.
   Doing this would be easy by returning a flag to the caller.
   No users seem to do signalling on consume yet so this was not
   implemented yet.

Signed-off-by: Michael S. Tsirkin
---

Jason, I am curious whether the following gives you some of
the performance boost that you see with vhost batching
patches. Is vhost batching on top still helpful?


The patch looks good to me, will have a test for vhost batching patches.

Thanks


Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-04-08 Thread Jesper Dangaard Brouer
On Fri, 7 Apr 2017 08:49:57 +0300
"Michael S. Tsirkin"  wrote:

> A known weakness in ptr_ring design is that it does not handle well the
> situation when ring is almost full: as entries are consumed they are
> immediately used again by the producer, so consumer and producer are
> writing to a shared cache line.
> 
> To fix this, add batching to consume calls: as entries are
> consumed do not write NULL into the ring until we get
> a multiple (in current implementation 2x) of cache lines
> away from the producer. At that point, write them all out.
> 
> We do the write out in the reverse order to keep
> producer from sharing cache with consumer for as long
> as possible.
> 
> Writeout also triggers when ring wraps around - there's
> no special reason to do this but it helps keep the code
> a bit simpler.
> 
> What should we do if getting away from producer by 2 cache lines
> would mean we are keeping the ring moe than half empty?
> Maybe we should reduce the batching in this case,
> current patch simply reduces the batching.
> 
> Notes:
> - it is no longer true that a call to consume guarantees
>   that the following call to produce will succeed.
>   No users seem to assume that.
> - batching can also in theory reduce the signalling rate:
>   users that would previously send interrups to the producer
>   to wake it up after consuming each entry would now only
>   need to do this once in a batch.
>   Doing this would be easy by returning a flag to the caller.
>   No users seem to do signalling on consume yet so this was not
>   implemented yet.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
> 
> Jason, I am curious whether the following gives you some of
> the performance boost that you see with vhost batching
> patches. Is vhost batching on top still helpful?
> 
>  include/linux/ptr_ring.h | 63 
> +---
>  1 file changed, 54 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> index 6c70444..6b2e0dd 100644
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -34,11 +34,13 @@
>  struct ptr_ring {
>   int producer cacheline_aligned_in_smp;
>   spinlock_t producer_lock;
> - int consumer cacheline_aligned_in_smp;
> + int consumer_head cacheline_aligned_in_smp; /* next valid entry */
> + int consumer_tail; /* next entry to invalidate */
>   spinlock_t consumer_lock;
>   /* Shared consumer/producer data */
>   /* Read-only by both the producer and the consumer */
>   int size cacheline_aligned_in_smp; /* max entries in queue */
> + int batch; /* number of entries to consume in a batch */
>   void **queue;
>  };
>  
> @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring *r, 
> void *ptr)
>  static inline void *__ptr_ring_peek(struct ptr_ring *r)
>  {
>   if (likely(r->size))
> - return r->queue[r->consumer];
> + return r->queue[r->consumer_head];
>   return NULL;
>  }
>  
> @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring *r)
>  /* Must only be called after __ptr_ring_peek returned !NULL */
>  static inline void __ptr_ring_discard_one(struct ptr_ring *r)
>  {
> - r->queue[r->consumer++] = NULL;
> - if (unlikely(r->consumer >= r->size))
> - r->consumer = 0;
> + /* Fundamentally, what we want to do is update consumer
> +  * index and zero out the entry so producer can reuse it.
> +  * Doing it naively at each consume would be as simple as:
> +  *   r->queue[r->consumer++] = NULL;
> +  *   if (unlikely(r->consumer >= r->size))
> +  *   r->consumer = 0;
> +  * but that is suboptimal when the ring is full as producer is writing
> +  * out new entries in the same cache line.  Defer these updates until a
> +  * batch of entries has been consumed.
> +  */
> + int head = r->consumer_head++;
> +
> + /* Once we have processed enough entries invalidate them in
> +  * the ring all at once so producer can reuse their space in the ring.
> +  * We also do this when we reach end of the ring - not mandatory
> +  * but helps keep the implementation simple.
> +  */
> + if (unlikely(r->consumer_head - r->consumer_tail >= r->batch ||
> +  r->consumer_head >= r->size)) {
> + /* Zero out entries in the reverse order: this way we touch the
> +  * cache line that producer might currently be reading the last;
> +  * producer won't make progress and touch other cache lines
> +  * besides the first one until we write out all entries.
> +  */
> + while (likely(head >= r->consumer_tail))
> + r->queue[head--] = NULL;
> + r->consumer_tail = r->consumer_head;
> + }
> + if (unlikely(r->consumer_head >= r->size)) {
> + 

Re: [PATCH 1/3] ptr_ring: batch ring zeroing

2017-04-08 Thread Jesper Dangaard Brouer
On Fri, 7 Apr 2017 08:49:57 +0300
"Michael S. Tsirkin"  wrote:

> A known weakness in ptr_ring design is that it does not handle well the
> situation when ring is almost full: as entries are consumed they are
> immediately used again by the producer, so consumer and producer are
> writing to a shared cache line.
> 
> To fix this, add batching to consume calls: as entries are
> consumed do not write NULL into the ring until we get
> a multiple (in current implementation 2x) of cache lines
> away from the producer. At that point, write them all out.
> 
> We do the write out in the reverse order to keep
> producer from sharing cache with consumer for as long
> as possible.
> 
> Writeout also triggers when ring wraps around - there's
> no special reason to do this but it helps keep the code
> a bit simpler.
> 
> What should we do if getting away from producer by 2 cache lines
> would mean we are keeping the ring moe than half empty?
> Maybe we should reduce the batching in this case,
> current patch simply reduces the batching.
> 
> Notes:
> - it is no longer true that a call to consume guarantees
>   that the following call to produce will succeed.
>   No users seem to assume that.
> - batching can also in theory reduce the signalling rate:
>   users that would previously send interrups to the producer
>   to wake it up after consuming each entry would now only
>   need to do this once in a batch.
>   Doing this would be easy by returning a flag to the caller.
>   No users seem to do signalling on consume yet so this was not
>   implemented yet.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
> 
> Jason, I am curious whether the following gives you some of
> the performance boost that you see with vhost batching
> patches. Is vhost batching on top still helpful?
> 
>  include/linux/ptr_ring.h | 63 
> +---
>  1 file changed, 54 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> index 6c70444..6b2e0dd 100644
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -34,11 +34,13 @@
>  struct ptr_ring {
>   int producer cacheline_aligned_in_smp;
>   spinlock_t producer_lock;
> - int consumer cacheline_aligned_in_smp;
> + int consumer_head cacheline_aligned_in_smp; /* next valid entry */
> + int consumer_tail; /* next entry to invalidate */
>   spinlock_t consumer_lock;
>   /* Shared consumer/producer data */
>   /* Read-only by both the producer and the consumer */
>   int size cacheline_aligned_in_smp; /* max entries in queue */
> + int batch; /* number of entries to consume in a batch */
>   void **queue;
>  };
>  
> @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring *r, 
> void *ptr)
>  static inline void *__ptr_ring_peek(struct ptr_ring *r)
>  {
>   if (likely(r->size))
> - return r->queue[r->consumer];
> + return r->queue[r->consumer_head];
>   return NULL;
>  }
>  
> @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring *r)
>  /* Must only be called after __ptr_ring_peek returned !NULL */
>  static inline void __ptr_ring_discard_one(struct ptr_ring *r)
>  {
> - r->queue[r->consumer++] = NULL;
> - if (unlikely(r->consumer >= r->size))
> - r->consumer = 0;
> + /* Fundamentally, what we want to do is update consumer
> +  * index and zero out the entry so producer can reuse it.
> +  * Doing it naively at each consume would be as simple as:
> +  *   r->queue[r->consumer++] = NULL;
> +  *   if (unlikely(r->consumer >= r->size))
> +  *   r->consumer = 0;
> +  * but that is suboptimal when the ring is full as producer is writing
> +  * out new entries in the same cache line.  Defer these updates until a
> +  * batch of entries has been consumed.
> +  */
> + int head = r->consumer_head++;
> +
> + /* Once we have processed enough entries invalidate them in
> +  * the ring all at once so producer can reuse their space in the ring.
> +  * We also do this when we reach end of the ring - not mandatory
> +  * but helps keep the implementation simple.
> +  */
> + if (unlikely(r->consumer_head - r->consumer_tail >= r->batch ||
> +  r->consumer_head >= r->size)) {
> + /* Zero out entries in the reverse order: this way we touch the
> +  * cache line that producer might currently be reading the last;
> +  * producer won't make progress and touch other cache lines
> +  * besides the first one until we write out all entries.
> +  */
> + while (likely(head >= r->consumer_tail))
> + r->queue[head--] = NULL;
> + r->consumer_tail = r->consumer_head;
> + }
> + if (unlikely(r->consumer_head >= r->size)) {
> + r->consumer_head = 0;
> +