Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On Wed, May 10, 2017 at 11:30:42AM +0800, Jason Wang wrote: > > > On 2017年05月09日 21:33, Michael S. Tsirkin wrote: > > > I love this idea. Reviewed and discussed the idea in-person with MST > > > during netdevconf[1] at this laptop. I promised I will also run it > > > through my micro-benchmarking[2] once I return home (hint ptr_ring gets > > > used in network stack as skb_array). > > I'm merging this through my tree. Any objections? > > > > Batch dequeuing series depends on this, maybe it's better to have this in > that series. Let me post a V4 series with this. > > Thanks FYI I'm including 1/3 in the pull request. -- MST
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On Wed, May 10, 2017 at 11:30:42AM +0800, Jason Wang wrote: > > > On 2017年05月09日 21:33, Michael S. Tsirkin wrote: > > > I love this idea. Reviewed and discussed the idea in-person with MST > > > during netdevconf[1] at this laptop. I promised I will also run it > > > through my micro-benchmarking[2] once I return home (hint ptr_ring gets > > > used in network stack as skb_array). > > I'm merging this through my tree. Any objections? > > > > Batch dequeuing series depends on this, maybe it's better to have this in > that series. Let me post a V4 series with this. > > Thanks FYI I'm including 1/3 in the pull request. -- MST
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On Wed, May 10, 2017 at 11:18:13AM +0200, Jesper Dangaard Brouer wrote: > On Tue, 9 May 2017 16:33:14 +0300 > "Michael S. Tsirkin"wrote: > > > On Sat, Apr 08, 2017 at 02:14:08PM +0200, Jesper Dangaard Brouer wrote: > > > On Fri, 7 Apr 2017 08:49:57 +0300 > > > "Michael S. Tsirkin" wrote: > > > > > > > A known weakness in ptr_ring design is that it does not handle well the > > > > situation when ring is almost full: as entries are consumed they are > > > > immediately used again by the producer, so consumer and producer are > > > > writing to a shared cache line. > > > > > > > > To fix this, add batching to consume calls: as entries are > > > > consumed do not write NULL into the ring until we get > > > > a multiple (in current implementation 2x) of cache lines > > > > away from the producer. At that point, write them all out. > > > > > > > > We do the write out in the reverse order to keep > > > > producer from sharing cache with consumer for as long > > > > as possible. > > > > > > > > Writeout also triggers when ring wraps around - there's > > > > no special reason to do this but it helps keep the code > > > > a bit simpler. > > > > > > > > What should we do if getting away from producer by 2 cache lines > > > > would mean we are keeping the ring moe than half empty? > > > > Maybe we should reduce the batching in this case, > > > > current patch simply reduces the batching. > > > > > > > > Notes: > > > > - it is no longer true that a call to consume guarantees > > > > that the following call to produce will succeed. > > > > No users seem to assume that. > > > > - batching can also in theory reduce the signalling rate: > > > > users that would previously send interrups to the producer > > > > to wake it up after consuming each entry would now only > > > > need to do this once in a batch. > > > > Doing this would be easy by returning a flag to the caller. > > > > No users seem to do signalling on consume yet so this was not > > > > implemented yet. > > > > > > > > Signed-off-by: Michael S. Tsirkin > > > > --- > > > > > > > > Jason, I am curious whether the following gives you some of > > > > the performance boost that you see with vhost batching > > > > patches. Is vhost batching on top still helpful? > > > > > > > > include/linux/ptr_ring.h | 63 > > > > +--- > > > > 1 file changed, 54 insertions(+), 9 deletions(-) > > > > > > > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h > > > > index 6c70444..6b2e0dd 100644 > > > > --- a/include/linux/ptr_ring.h > > > > +++ b/include/linux/ptr_ring.h > > > > @@ -34,11 +34,13 @@ > > > > struct ptr_ring { > > > > int producer cacheline_aligned_in_smp; > > > > spinlock_t producer_lock; > > > > - int consumer cacheline_aligned_in_smp; > > > > + int consumer_head cacheline_aligned_in_smp; /* next valid > > > > entry */ > > > > + int consumer_tail; /* next entry to invalidate */ > > > > spinlock_t consumer_lock; > > > > /* Shared consumer/producer data */ > > > > /* Read-only by both the producer and the consumer */ > > > > int size cacheline_aligned_in_smp; /* max entries in queue > > > > */ > > > > + int batch; /* number of entries to consume in a batch */ > > > > void **queue; > > > > }; > > > > > > > > @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct > > > > ptr_ring *r, void *ptr) > > > > static inline void *__ptr_ring_peek(struct ptr_ring *r) > > > > { > > > > if (likely(r->size)) > > > > - return r->queue[r->consumer]; > > > > + return r->queue[r->consumer_head]; > > > > return NULL; > > > > } > > > > > > > > @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct > > > > ptr_ring *r) > > > > /* Must only be called after __ptr_ring_peek returned !NULL */ > > > > static inline void __ptr_ring_discard_one(struct ptr_ring *r) > > > > { > > > > - r->queue[r->consumer++] = NULL; > > > > - if (unlikely(r->consumer >= r->size)) > > > > - r->consumer = 0; > > > > + /* Fundamentally, what we want to do is update consumer > > > > +* index and zero out the entry so producer can reuse it. > > > > +* Doing it naively at each consume would be as simple as: > > > > +* r->queue[r->consumer++] = NULL; > > > > +* if (unlikely(r->consumer >= r->size)) > > > > +* r->consumer = 0; > > > > +* but that is suboptimal when the ring is full as producer is > > > > writing > > > > +* out new entries in the same cache line. Defer these updates > > > > until a > > > > +* batch of entries has been consumed. > > > > +*/ > > > > + int head = r->consumer_head++; > > > > + > > > > + /* Once we have processed enough entries
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On Wed, May 10, 2017 at 11:18:13AM +0200, Jesper Dangaard Brouer wrote: > On Tue, 9 May 2017 16:33:14 +0300 > "Michael S. Tsirkin" wrote: > > > On Sat, Apr 08, 2017 at 02:14:08PM +0200, Jesper Dangaard Brouer wrote: > > > On Fri, 7 Apr 2017 08:49:57 +0300 > > > "Michael S. Tsirkin" wrote: > > > > > > > A known weakness in ptr_ring design is that it does not handle well the > > > > situation when ring is almost full: as entries are consumed they are > > > > immediately used again by the producer, so consumer and producer are > > > > writing to a shared cache line. > > > > > > > > To fix this, add batching to consume calls: as entries are > > > > consumed do not write NULL into the ring until we get > > > > a multiple (in current implementation 2x) of cache lines > > > > away from the producer. At that point, write them all out. > > > > > > > > We do the write out in the reverse order to keep > > > > producer from sharing cache with consumer for as long > > > > as possible. > > > > > > > > Writeout also triggers when ring wraps around - there's > > > > no special reason to do this but it helps keep the code > > > > a bit simpler. > > > > > > > > What should we do if getting away from producer by 2 cache lines > > > > would mean we are keeping the ring moe than half empty? > > > > Maybe we should reduce the batching in this case, > > > > current patch simply reduces the batching. > > > > > > > > Notes: > > > > - it is no longer true that a call to consume guarantees > > > > that the following call to produce will succeed. > > > > No users seem to assume that. > > > > - batching can also in theory reduce the signalling rate: > > > > users that would previously send interrups to the producer > > > > to wake it up after consuming each entry would now only > > > > need to do this once in a batch. > > > > Doing this would be easy by returning a flag to the caller. > > > > No users seem to do signalling on consume yet so this was not > > > > implemented yet. > > > > > > > > Signed-off-by: Michael S. Tsirkin > > > > --- > > > > > > > > Jason, I am curious whether the following gives you some of > > > > the performance boost that you see with vhost batching > > > > patches. Is vhost batching on top still helpful? > > > > > > > > include/linux/ptr_ring.h | 63 > > > > +--- > > > > 1 file changed, 54 insertions(+), 9 deletions(-) > > > > > > > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h > > > > index 6c70444..6b2e0dd 100644 > > > > --- a/include/linux/ptr_ring.h > > > > +++ b/include/linux/ptr_ring.h > > > > @@ -34,11 +34,13 @@ > > > > struct ptr_ring { > > > > int producer cacheline_aligned_in_smp; > > > > spinlock_t producer_lock; > > > > - int consumer cacheline_aligned_in_smp; > > > > + int consumer_head cacheline_aligned_in_smp; /* next valid > > > > entry */ > > > > + int consumer_tail; /* next entry to invalidate */ > > > > spinlock_t consumer_lock; > > > > /* Shared consumer/producer data */ > > > > /* Read-only by both the producer and the consumer */ > > > > int size cacheline_aligned_in_smp; /* max entries in queue > > > > */ > > > > + int batch; /* number of entries to consume in a batch */ > > > > void **queue; > > > > }; > > > > > > > > @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct > > > > ptr_ring *r, void *ptr) > > > > static inline void *__ptr_ring_peek(struct ptr_ring *r) > > > > { > > > > if (likely(r->size)) > > > > - return r->queue[r->consumer]; > > > > + return r->queue[r->consumer_head]; > > > > return NULL; > > > > } > > > > > > > > @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct > > > > ptr_ring *r) > > > > /* Must only be called after __ptr_ring_peek returned !NULL */ > > > > static inline void __ptr_ring_discard_one(struct ptr_ring *r) > > > > { > > > > - r->queue[r->consumer++] = NULL; > > > > - if (unlikely(r->consumer >= r->size)) > > > > - r->consumer = 0; > > > > + /* Fundamentally, what we want to do is update consumer > > > > +* index and zero out the entry so producer can reuse it. > > > > +* Doing it naively at each consume would be as simple as: > > > > +* r->queue[r->consumer++] = NULL; > > > > +* if (unlikely(r->consumer >= r->size)) > > > > +* r->consumer = 0; > > > > +* but that is suboptimal when the ring is full as producer is > > > > writing > > > > +* out new entries in the same cache line. Defer these updates > > > > until a > > > > +* batch of entries has been consumed. > > > > +*/ > > > > + int head = r->consumer_head++; > > > > + > > > > + /* Once we have processed enough entries invalidate them in > > > > +* the ring all at once
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On Tue, 9 May 2017 16:33:14 +0300 "Michael S. Tsirkin"wrote: > On Sat, Apr 08, 2017 at 02:14:08PM +0200, Jesper Dangaard Brouer wrote: > > On Fri, 7 Apr 2017 08:49:57 +0300 > > "Michael S. Tsirkin" wrote: > > > > > A known weakness in ptr_ring design is that it does not handle well the > > > situation when ring is almost full: as entries are consumed they are > > > immediately used again by the producer, so consumer and producer are > > > writing to a shared cache line. > > > > > > To fix this, add batching to consume calls: as entries are > > > consumed do not write NULL into the ring until we get > > > a multiple (in current implementation 2x) of cache lines > > > away from the producer. At that point, write them all out. > > > > > > We do the write out in the reverse order to keep > > > producer from sharing cache with consumer for as long > > > as possible. > > > > > > Writeout also triggers when ring wraps around - there's > > > no special reason to do this but it helps keep the code > > > a bit simpler. > > > > > > What should we do if getting away from producer by 2 cache lines > > > would mean we are keeping the ring moe than half empty? > > > Maybe we should reduce the batching in this case, > > > current patch simply reduces the batching. > > > > > > Notes: > > > - it is no longer true that a call to consume guarantees > > > that the following call to produce will succeed. > > > No users seem to assume that. > > > - batching can also in theory reduce the signalling rate: > > > users that would previously send interrups to the producer > > > to wake it up after consuming each entry would now only > > > need to do this once in a batch. > > > Doing this would be easy by returning a flag to the caller. > > > No users seem to do signalling on consume yet so this was not > > > implemented yet. > > > > > > Signed-off-by: Michael S. Tsirkin > > > --- > > > > > > Jason, I am curious whether the following gives you some of > > > the performance boost that you see with vhost batching > > > patches. Is vhost batching on top still helpful? > > > > > > include/linux/ptr_ring.h | 63 > > > +--- > > > 1 file changed, 54 insertions(+), 9 deletions(-) > > > > > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h > > > index 6c70444..6b2e0dd 100644 > > > --- a/include/linux/ptr_ring.h > > > +++ b/include/linux/ptr_ring.h > > > @@ -34,11 +34,13 @@ > > > struct ptr_ring { > > > int producer cacheline_aligned_in_smp; > > > spinlock_t producer_lock; > > > - int consumer cacheline_aligned_in_smp; > > > + int consumer_head cacheline_aligned_in_smp; /* next valid entry */ > > > + int consumer_tail; /* next entry to invalidate */ > > > spinlock_t consumer_lock; > > > /* Shared consumer/producer data */ > > > /* Read-only by both the producer and the consumer */ > > > int size cacheline_aligned_in_smp; /* max entries in queue */ > > > + int batch; /* number of entries to consume in a batch */ > > > void **queue; > > > }; > > > > > > @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring > > > *r, void *ptr) > > > static inline void *__ptr_ring_peek(struct ptr_ring *r) > > > { > > > if (likely(r->size)) > > > - return r->queue[r->consumer]; > > > + return r->queue[r->consumer_head]; > > > return NULL; > > > } > > > > > > @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring > > > *r) > > > /* Must only be called after __ptr_ring_peek returned !NULL */ > > > static inline void __ptr_ring_discard_one(struct ptr_ring *r) > > > { > > > - r->queue[r->consumer++] = NULL; > > > - if (unlikely(r->consumer >= r->size)) > > > - r->consumer = 0; > > > + /* Fundamentally, what we want to do is update consumer > > > + * index and zero out the entry so producer can reuse it. > > > + * Doing it naively at each consume would be as simple as: > > > + * r->queue[r->consumer++] = NULL; > > > + * if (unlikely(r->consumer >= r->size)) > > > + * r->consumer = 0; > > > + * but that is suboptimal when the ring is full as producer is writing > > > + * out new entries in the same cache line. Defer these updates until a > > > + * batch of entries has been consumed. > > > + */ > > > + int head = r->consumer_head++; > > > + > > > + /* Once we have processed enough entries invalidate them in > > > + * the ring all at once so producer can reuse their space in the ring. > > > + * We also do this when we reach end of the ring - not mandatory > > > + * but helps keep the implementation simple. > > > + */ > > > + if (unlikely(r->consumer_head - r->consumer_tail >= r->batch || > > > + r->consumer_head >= r->size)) { > > > + /* Zero out entries in the reverse order: this way we touch the > > > + * cache line that producer might currently be
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On Tue, 9 May 2017 16:33:14 +0300 "Michael S. Tsirkin" wrote: > On Sat, Apr 08, 2017 at 02:14:08PM +0200, Jesper Dangaard Brouer wrote: > > On Fri, 7 Apr 2017 08:49:57 +0300 > > "Michael S. Tsirkin" wrote: > > > > > A known weakness in ptr_ring design is that it does not handle well the > > > situation when ring is almost full: as entries are consumed they are > > > immediately used again by the producer, so consumer and producer are > > > writing to a shared cache line. > > > > > > To fix this, add batching to consume calls: as entries are > > > consumed do not write NULL into the ring until we get > > > a multiple (in current implementation 2x) of cache lines > > > away from the producer. At that point, write them all out. > > > > > > We do the write out in the reverse order to keep > > > producer from sharing cache with consumer for as long > > > as possible. > > > > > > Writeout also triggers when ring wraps around - there's > > > no special reason to do this but it helps keep the code > > > a bit simpler. > > > > > > What should we do if getting away from producer by 2 cache lines > > > would mean we are keeping the ring moe than half empty? > > > Maybe we should reduce the batching in this case, > > > current patch simply reduces the batching. > > > > > > Notes: > > > - it is no longer true that a call to consume guarantees > > > that the following call to produce will succeed. > > > No users seem to assume that. > > > - batching can also in theory reduce the signalling rate: > > > users that would previously send interrups to the producer > > > to wake it up after consuming each entry would now only > > > need to do this once in a batch. > > > Doing this would be easy by returning a flag to the caller. > > > No users seem to do signalling on consume yet so this was not > > > implemented yet. > > > > > > Signed-off-by: Michael S. Tsirkin > > > --- > > > > > > Jason, I am curious whether the following gives you some of > > > the performance boost that you see with vhost batching > > > patches. Is vhost batching on top still helpful? > > > > > > include/linux/ptr_ring.h | 63 > > > +--- > > > 1 file changed, 54 insertions(+), 9 deletions(-) > > > > > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h > > > index 6c70444..6b2e0dd 100644 > > > --- a/include/linux/ptr_ring.h > > > +++ b/include/linux/ptr_ring.h > > > @@ -34,11 +34,13 @@ > > > struct ptr_ring { > > > int producer cacheline_aligned_in_smp; > > > spinlock_t producer_lock; > > > - int consumer cacheline_aligned_in_smp; > > > + int consumer_head cacheline_aligned_in_smp; /* next valid entry */ > > > + int consumer_tail; /* next entry to invalidate */ > > > spinlock_t consumer_lock; > > > /* Shared consumer/producer data */ > > > /* Read-only by both the producer and the consumer */ > > > int size cacheline_aligned_in_smp; /* max entries in queue */ > > > + int batch; /* number of entries to consume in a batch */ > > > void **queue; > > > }; > > > > > > @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring > > > *r, void *ptr) > > > static inline void *__ptr_ring_peek(struct ptr_ring *r) > > > { > > > if (likely(r->size)) > > > - return r->queue[r->consumer]; > > > + return r->queue[r->consumer_head]; > > > return NULL; > > > } > > > > > > @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring > > > *r) > > > /* Must only be called after __ptr_ring_peek returned !NULL */ > > > static inline void __ptr_ring_discard_one(struct ptr_ring *r) > > > { > > > - r->queue[r->consumer++] = NULL; > > > - if (unlikely(r->consumer >= r->size)) > > > - r->consumer = 0; > > > + /* Fundamentally, what we want to do is update consumer > > > + * index and zero out the entry so producer can reuse it. > > > + * Doing it naively at each consume would be as simple as: > > > + * r->queue[r->consumer++] = NULL; > > > + * if (unlikely(r->consumer >= r->size)) > > > + * r->consumer = 0; > > > + * but that is suboptimal when the ring is full as producer is writing > > > + * out new entries in the same cache line. Defer these updates until a > > > + * batch of entries has been consumed. > > > + */ > > > + int head = r->consumer_head++; > > > + > > > + /* Once we have processed enough entries invalidate them in > > > + * the ring all at once so producer can reuse their space in the ring. > > > + * We also do this when we reach end of the ring - not mandatory > > > + * but helps keep the implementation simple. > > > + */ > > > + if (unlikely(r->consumer_head - r->consumer_tail >= r->batch || > > > + r->consumer_head >= r->size)) { > > > + /* Zero out entries in the reverse order: this way we touch the > > > + * cache line that producer might currently be reading the last; > > > + * producer
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On 2017年05月09日 21:33, Michael S. Tsirkin wrote: I love this idea. Reviewed and discussed the idea in-person with MST during netdevconf[1] at this laptop. I promised I will also run it through my micro-benchmarking[2] once I return home (hint ptr_ring gets used in network stack as skb_array). I'm merging this through my tree. Any objections? Batch dequeuing series depends on this, maybe it's better to have this in that series. Let me post a V4 series with this. Thanks
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On 2017年05月09日 21:33, Michael S. Tsirkin wrote: I love this idea. Reviewed and discussed the idea in-person with MST during netdevconf[1] at this laptop. I promised I will also run it through my micro-benchmarking[2] once I return home (hint ptr_ring gets used in network stack as skb_array). I'm merging this through my tree. Any objections? Batch dequeuing series depends on this, maybe it's better to have this in that series. Let me post a V4 series with this. Thanks
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On Sat, Apr 08, 2017 at 02:14:08PM +0200, Jesper Dangaard Brouer wrote: > On Fri, 7 Apr 2017 08:49:57 +0300 > "Michael S. Tsirkin"wrote: > > > A known weakness in ptr_ring design is that it does not handle well the > > situation when ring is almost full: as entries are consumed they are > > immediately used again by the producer, so consumer and producer are > > writing to a shared cache line. > > > > To fix this, add batching to consume calls: as entries are > > consumed do not write NULL into the ring until we get > > a multiple (in current implementation 2x) of cache lines > > away from the producer. At that point, write them all out. > > > > We do the write out in the reverse order to keep > > producer from sharing cache with consumer for as long > > as possible. > > > > Writeout also triggers when ring wraps around - there's > > no special reason to do this but it helps keep the code > > a bit simpler. > > > > What should we do if getting away from producer by 2 cache lines > > would mean we are keeping the ring moe than half empty? > > Maybe we should reduce the batching in this case, > > current patch simply reduces the batching. > > > > Notes: > > - it is no longer true that a call to consume guarantees > > that the following call to produce will succeed. > > No users seem to assume that. > > - batching can also in theory reduce the signalling rate: > > users that would previously send interrups to the producer > > to wake it up after consuming each entry would now only > > need to do this once in a batch. > > Doing this would be easy by returning a flag to the caller. > > No users seem to do signalling on consume yet so this was not > > implemented yet. > > > > Signed-off-by: Michael S. Tsirkin > > --- > > > > Jason, I am curious whether the following gives you some of > > the performance boost that you see with vhost batching > > patches. Is vhost batching on top still helpful? > > > > include/linux/ptr_ring.h | 63 > > +--- > > 1 file changed, 54 insertions(+), 9 deletions(-) > > > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h > > index 6c70444..6b2e0dd 100644 > > --- a/include/linux/ptr_ring.h > > +++ b/include/linux/ptr_ring.h > > @@ -34,11 +34,13 @@ > > struct ptr_ring { > > int producer cacheline_aligned_in_smp; > > spinlock_t producer_lock; > > - int consumer cacheline_aligned_in_smp; > > + int consumer_head cacheline_aligned_in_smp; /* next valid entry */ > > + int consumer_tail; /* next entry to invalidate */ > > spinlock_t consumer_lock; > > /* Shared consumer/producer data */ > > /* Read-only by both the producer and the consumer */ > > int size cacheline_aligned_in_smp; /* max entries in queue */ > > + int batch; /* number of entries to consume in a batch */ > > void **queue; > > }; > > > > @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring > > *r, void *ptr) > > static inline void *__ptr_ring_peek(struct ptr_ring *r) > > { > > if (likely(r->size)) > > - return r->queue[r->consumer]; > > + return r->queue[r->consumer_head]; > > return NULL; > > } > > > > @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring > > *r) > > /* Must only be called after __ptr_ring_peek returned !NULL */ > > static inline void __ptr_ring_discard_one(struct ptr_ring *r) > > { > > - r->queue[r->consumer++] = NULL; > > - if (unlikely(r->consumer >= r->size)) > > - r->consumer = 0; > > + /* Fundamentally, what we want to do is update consumer > > +* index and zero out the entry so producer can reuse it. > > +* Doing it naively at each consume would be as simple as: > > +* r->queue[r->consumer++] = NULL; > > +* if (unlikely(r->consumer >= r->size)) > > +* r->consumer = 0; > > +* but that is suboptimal when the ring is full as producer is writing > > +* out new entries in the same cache line. Defer these updates until a > > +* batch of entries has been consumed. > > +*/ > > + int head = r->consumer_head++; > > + > > + /* Once we have processed enough entries invalidate them in > > +* the ring all at once so producer can reuse their space in the ring. > > +* We also do this when we reach end of the ring - not mandatory > > +* but helps keep the implementation simple. > > +*/ > > + if (unlikely(r->consumer_head - r->consumer_tail >= r->batch || > > +r->consumer_head >= r->size)) { > > + /* Zero out entries in the reverse order: this way we touch the > > +* cache line that producer might currently be reading the last; > > +* producer won't make progress and touch other cache lines > > +* besides the first one until we write out all entries. > > +*/ > > + while (likely(head
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On Sat, Apr 08, 2017 at 02:14:08PM +0200, Jesper Dangaard Brouer wrote: > On Fri, 7 Apr 2017 08:49:57 +0300 > "Michael S. Tsirkin" wrote: > > > A known weakness in ptr_ring design is that it does not handle well the > > situation when ring is almost full: as entries are consumed they are > > immediately used again by the producer, so consumer and producer are > > writing to a shared cache line. > > > > To fix this, add batching to consume calls: as entries are > > consumed do not write NULL into the ring until we get > > a multiple (in current implementation 2x) of cache lines > > away from the producer. At that point, write them all out. > > > > We do the write out in the reverse order to keep > > producer from sharing cache with consumer for as long > > as possible. > > > > Writeout also triggers when ring wraps around - there's > > no special reason to do this but it helps keep the code > > a bit simpler. > > > > What should we do if getting away from producer by 2 cache lines > > would mean we are keeping the ring moe than half empty? > > Maybe we should reduce the batching in this case, > > current patch simply reduces the batching. > > > > Notes: > > - it is no longer true that a call to consume guarantees > > that the following call to produce will succeed. > > No users seem to assume that. > > - batching can also in theory reduce the signalling rate: > > users that would previously send interrups to the producer > > to wake it up after consuming each entry would now only > > need to do this once in a batch. > > Doing this would be easy by returning a flag to the caller. > > No users seem to do signalling on consume yet so this was not > > implemented yet. > > > > Signed-off-by: Michael S. Tsirkin > > --- > > > > Jason, I am curious whether the following gives you some of > > the performance boost that you see with vhost batching > > patches. Is vhost batching on top still helpful? > > > > include/linux/ptr_ring.h | 63 > > +--- > > 1 file changed, 54 insertions(+), 9 deletions(-) > > > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h > > index 6c70444..6b2e0dd 100644 > > --- a/include/linux/ptr_ring.h > > +++ b/include/linux/ptr_ring.h > > @@ -34,11 +34,13 @@ > > struct ptr_ring { > > int producer cacheline_aligned_in_smp; > > spinlock_t producer_lock; > > - int consumer cacheline_aligned_in_smp; > > + int consumer_head cacheline_aligned_in_smp; /* next valid entry */ > > + int consumer_tail; /* next entry to invalidate */ > > spinlock_t consumer_lock; > > /* Shared consumer/producer data */ > > /* Read-only by both the producer and the consumer */ > > int size cacheline_aligned_in_smp; /* max entries in queue */ > > + int batch; /* number of entries to consume in a batch */ > > void **queue; > > }; > > > > @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring > > *r, void *ptr) > > static inline void *__ptr_ring_peek(struct ptr_ring *r) > > { > > if (likely(r->size)) > > - return r->queue[r->consumer]; > > + return r->queue[r->consumer_head]; > > return NULL; > > } > > > > @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring > > *r) > > /* Must only be called after __ptr_ring_peek returned !NULL */ > > static inline void __ptr_ring_discard_one(struct ptr_ring *r) > > { > > - r->queue[r->consumer++] = NULL; > > - if (unlikely(r->consumer >= r->size)) > > - r->consumer = 0; > > + /* Fundamentally, what we want to do is update consumer > > +* index and zero out the entry so producer can reuse it. > > +* Doing it naively at each consume would be as simple as: > > +* r->queue[r->consumer++] = NULL; > > +* if (unlikely(r->consumer >= r->size)) > > +* r->consumer = 0; > > +* but that is suboptimal when the ring is full as producer is writing > > +* out new entries in the same cache line. Defer these updates until a > > +* batch of entries has been consumed. > > +*/ > > + int head = r->consumer_head++; > > + > > + /* Once we have processed enough entries invalidate them in > > +* the ring all at once so producer can reuse their space in the ring. > > +* We also do this when we reach end of the ring - not mandatory > > +* but helps keep the implementation simple. > > +*/ > > + if (unlikely(r->consumer_head - r->consumer_tail >= r->batch || > > +r->consumer_head >= r->size)) { > > + /* Zero out entries in the reverse order: this way we touch the > > +* cache line that producer might currently be reading the last; > > +* producer won't make progress and touch other cache lines > > +* besides the first one until we write out all entries. > > +*/ > > + while (likely(head >= r->consumer_tail)) > > +
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On 2017年04月15日 06:50, Michael S. Tsirkin wrote: On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote: On 2017年04月12日 16:03, Jason Wang wrote: On 2017年04月07日 13:49, Michael S. Tsirkin wrote: A known weakness in ptr_ring design is that it does not handle well the situation when ring is almost full: as entries are consumed they are immediately used again by the producer, so consumer and producer are writing to a shared cache line. To fix this, add batching to consume calls: as entries are consumed do not write NULL into the ring until we get a multiple (in current implementation 2x) of cache lines away from the producer. At that point, write them all out. We do the write out in the reverse order to keep producer from sharing cache with consumer for as long as possible. Writeout also triggers when ring wraps around - there's no special reason to do this but it helps keep the code a bit simpler. What should we do if getting away from producer by 2 cache lines would mean we are keeping the ring moe than half empty? Maybe we should reduce the batching in this case, current patch simply reduces the batching. Notes: - it is no longer true that a call to consume guarantees that the following call to produce will succeed. No users seem to assume that. - batching can also in theory reduce the signalling rate: users that would previously send interrups to the producer to wake it up after consuming each entry would now only need to do this once in a batch. Doing this would be easy by returning a flag to the caller. No users seem to do signalling on consume yet so this was not implemented yet. Signed-off-by: Michael S. Tsirkin--- Jason, I am curious whether the following gives you some of the performance boost that you see with vhost batching patches. Is vhost batching on top still helpful? The patch looks good to me, will have a test for vhost batching patches. Thanks Still helpful: before this patch: 1.84Mpps with this patch: 2.00Mpps with batch dequeuing: 2.30Mpps Just a thought: could you test dropping the consumer spinlock completely? Just around the peek? 2% improvement for dropping spinlock around peeking, 2% more for dropping spinlock for consuming. As I said previously, perf c2c tool should be helpful to locate sources latency related to cache. perf c2c indeeds shows some false sharing were reduced by this patch. But it does not show obvious different with batch dequeuing on top. Thanks Acked-by: Jason Wang Thanks
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On 2017年04月15日 06:50, Michael S. Tsirkin wrote: On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote: On 2017年04月12日 16:03, Jason Wang wrote: On 2017年04月07日 13:49, Michael S. Tsirkin wrote: A known weakness in ptr_ring design is that it does not handle well the situation when ring is almost full: as entries are consumed they are immediately used again by the producer, so consumer and producer are writing to a shared cache line. To fix this, add batching to consume calls: as entries are consumed do not write NULL into the ring until we get a multiple (in current implementation 2x) of cache lines away from the producer. At that point, write them all out. We do the write out in the reverse order to keep producer from sharing cache with consumer for as long as possible. Writeout also triggers when ring wraps around - there's no special reason to do this but it helps keep the code a bit simpler. What should we do if getting away from producer by 2 cache lines would mean we are keeping the ring moe than half empty? Maybe we should reduce the batching in this case, current patch simply reduces the batching. Notes: - it is no longer true that a call to consume guarantees that the following call to produce will succeed. No users seem to assume that. - batching can also in theory reduce the signalling rate: users that would previously send interrups to the producer to wake it up after consuming each entry would now only need to do this once in a batch. Doing this would be easy by returning a flag to the caller. No users seem to do signalling on consume yet so this was not implemented yet. Signed-off-by: Michael S. Tsirkin --- Jason, I am curious whether the following gives you some of the performance boost that you see with vhost batching patches. Is vhost batching on top still helpful? The patch looks good to me, will have a test for vhost batching patches. Thanks Still helpful: before this patch: 1.84Mpps with this patch: 2.00Mpps with batch dequeuing: 2.30Mpps Just a thought: could you test dropping the consumer spinlock completely? Just around the peek? 2% improvement for dropping spinlock around peeking, 2% more for dropping spinlock for consuming. As I said previously, perf c2c tool should be helpful to locate sources latency related to cache. perf c2c indeeds shows some false sharing were reduced by this patch. But it does not show obvious different with batch dequeuing on top. Thanks Acked-by: Jason Wang Thanks
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On 2017年04月15日 05:00, Michael S. Tsirkin wrote: On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote: On 2017年04月12日 16:03, Jason Wang wrote: On 2017年04月07日 13:49, Michael S. Tsirkin wrote: A known weakness in ptr_ring design is that it does not handle well the situation when ring is almost full: as entries are consumed they are immediately used again by the producer, so consumer and producer are writing to a shared cache line. To fix this, add batching to consume calls: as entries are consumed do not write NULL into the ring until we get a multiple (in current implementation 2x) of cache lines away from the producer. At that point, write them all out. We do the write out in the reverse order to keep producer from sharing cache with consumer for as long as possible. Writeout also triggers when ring wraps around - there's no special reason to do this but it helps keep the code a bit simpler. What should we do if getting away from producer by 2 cache lines would mean we are keeping the ring moe than half empty? Maybe we should reduce the batching in this case, current patch simply reduces the batching. Notes: - it is no longer true that a call to consume guarantees that the following call to produce will succeed. No users seem to assume that. - batching can also in theory reduce the signalling rate: users that would previously send interrups to the producer to wake it up after consuming each entry would now only need to do this once in a batch. Doing this would be easy by returning a flag to the caller. No users seem to do signalling on consume yet so this was not implemented yet. Signed-off-by: Michael S. Tsirkin--- Jason, I am curious whether the following gives you some of the performance boost that you see with vhost batching patches. Is vhost batching on top still helpful? The patch looks good to me, will have a test for vhost batching patches. Thanks Still helpful: before this patch: 1.84Mpps with this patch: 2.00Mpps with batch dequeuing: 2.30Mpps Acked-by: Jason Wang Thanks Fascinating. How do we explain the gain with batch dequeue? I count the drop rate (pktgen on tun and count tun tx) and maybe it can explain more or less: Before this patch: TX xmit 1.8Mpps Tx dropped 0.23Mpps Tx total 2.04Mpps 11% dropped After this patch: Tx xmit 1.95Mpps Tx dropped 0.33Mpps Tx total 2.28Mpps 14% dropped With batched dequeuing: Tx xmit 2.24Mpps Tx dropped 0.01Mpps Tx total 2.26Mpps ~0% dropped With this patch, we remove the cache contention by blocking the producer more or less. With batch dequeuing, the ring is not full in 99% of the cases which probably means the producer is not blocked for most of the time. Is it just the lock overhead? I remove the spinlocks for peeking and dequeuing on top of this patch. The tx pps were increased from ~2Mpps to ~2.08Mpps. So it was not only the lock overhead. Can you pls try to replace the lock with a simple non-fair atomic and see what happens? Not sure I get the idea, we are going for fast path of spinlocks for sure which is just a cmpxchg(). Thanks
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On 2017年04月15日 05:00, Michael S. Tsirkin wrote: On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote: On 2017年04月12日 16:03, Jason Wang wrote: On 2017年04月07日 13:49, Michael S. Tsirkin wrote: A known weakness in ptr_ring design is that it does not handle well the situation when ring is almost full: as entries are consumed they are immediately used again by the producer, so consumer and producer are writing to a shared cache line. To fix this, add batching to consume calls: as entries are consumed do not write NULL into the ring until we get a multiple (in current implementation 2x) of cache lines away from the producer. At that point, write them all out. We do the write out in the reverse order to keep producer from sharing cache with consumer for as long as possible. Writeout also triggers when ring wraps around - there's no special reason to do this but it helps keep the code a bit simpler. What should we do if getting away from producer by 2 cache lines would mean we are keeping the ring moe than half empty? Maybe we should reduce the batching in this case, current patch simply reduces the batching. Notes: - it is no longer true that a call to consume guarantees that the following call to produce will succeed. No users seem to assume that. - batching can also in theory reduce the signalling rate: users that would previously send interrups to the producer to wake it up after consuming each entry would now only need to do this once in a batch. Doing this would be easy by returning a flag to the caller. No users seem to do signalling on consume yet so this was not implemented yet. Signed-off-by: Michael S. Tsirkin --- Jason, I am curious whether the following gives you some of the performance boost that you see with vhost batching patches. Is vhost batching on top still helpful? The patch looks good to me, will have a test for vhost batching patches. Thanks Still helpful: before this patch: 1.84Mpps with this patch: 2.00Mpps with batch dequeuing: 2.30Mpps Acked-by: Jason Wang Thanks Fascinating. How do we explain the gain with batch dequeue? I count the drop rate (pktgen on tun and count tun tx) and maybe it can explain more or less: Before this patch: TX xmit 1.8Mpps Tx dropped 0.23Mpps Tx total 2.04Mpps 11% dropped After this patch: Tx xmit 1.95Mpps Tx dropped 0.33Mpps Tx total 2.28Mpps 14% dropped With batched dequeuing: Tx xmit 2.24Mpps Tx dropped 0.01Mpps Tx total 2.26Mpps ~0% dropped With this patch, we remove the cache contention by blocking the producer more or less. With batch dequeuing, the ring is not full in 99% of the cases which probably means the producer is not blocked for most of the time. Is it just the lock overhead? I remove the spinlocks for peeking and dequeuing on top of this patch. The tx pps were increased from ~2Mpps to ~2.08Mpps. So it was not only the lock overhead. Can you pls try to replace the lock with a simple non-fair atomic and see what happens? Not sure I get the idea, we are going for fast path of spinlocks for sure which is just a cmpxchg(). Thanks
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote: > > > On 2017年04月12日 16:03, Jason Wang wrote: > > > > > > On 2017年04月07日 13:49, Michael S. Tsirkin wrote: > > > A known weakness in ptr_ring design is that it does not handle well the > > > situation when ring is almost full: as entries are consumed they are > > > immediately used again by the producer, so consumer and producer are > > > writing to a shared cache line. > > > > > > To fix this, add batching to consume calls: as entries are > > > consumed do not write NULL into the ring until we get > > > a multiple (in current implementation 2x) of cache lines > > > away from the producer. At that point, write them all out. > > > > > > We do the write out in the reverse order to keep > > > producer from sharing cache with consumer for as long > > > as possible. > > > > > > Writeout also triggers when ring wraps around - there's > > > no special reason to do this but it helps keep the code > > > a bit simpler. > > > > > > What should we do if getting away from producer by 2 cache lines > > > would mean we are keeping the ring moe than half empty? > > > Maybe we should reduce the batching in this case, > > > current patch simply reduces the batching. > > > > > > Notes: > > > - it is no longer true that a call to consume guarantees > > >that the following call to produce will succeed. > > >No users seem to assume that. > > > - batching can also in theory reduce the signalling rate: > > >users that would previously send interrups to the producer > > >to wake it up after consuming each entry would now only > > >need to do this once in a batch. > > >Doing this would be easy by returning a flag to the caller. > > >No users seem to do signalling on consume yet so this was not > > >implemented yet. > > > > > > Signed-off-by: Michael S. Tsirkin> > > --- > > > > > > Jason, I am curious whether the following gives you some of > > > the performance boost that you see with vhost batching > > > patches. Is vhost batching on top still helpful? > > > > The patch looks good to me, will have a test for vhost batching patches. > > > > Thanks > > Still helpful: > > before this patch: 1.84Mpps > with this patch: 2.00Mpps > with batch dequeuing: 2.30Mpps Just a thought: could you test dropping the consumer spinlock completely? Just around the peek? As I said previously, perf c2c tool should be helpful to locate sources latency related to cache. > Acked-by: Jason Wang > > Thanks
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote: > > > On 2017年04月12日 16:03, Jason Wang wrote: > > > > > > On 2017年04月07日 13:49, Michael S. Tsirkin wrote: > > > A known weakness in ptr_ring design is that it does not handle well the > > > situation when ring is almost full: as entries are consumed they are > > > immediately used again by the producer, so consumer and producer are > > > writing to a shared cache line. > > > > > > To fix this, add batching to consume calls: as entries are > > > consumed do not write NULL into the ring until we get > > > a multiple (in current implementation 2x) of cache lines > > > away from the producer. At that point, write them all out. > > > > > > We do the write out in the reverse order to keep > > > producer from sharing cache with consumer for as long > > > as possible. > > > > > > Writeout also triggers when ring wraps around - there's > > > no special reason to do this but it helps keep the code > > > a bit simpler. > > > > > > What should we do if getting away from producer by 2 cache lines > > > would mean we are keeping the ring moe than half empty? > > > Maybe we should reduce the batching in this case, > > > current patch simply reduces the batching. > > > > > > Notes: > > > - it is no longer true that a call to consume guarantees > > >that the following call to produce will succeed. > > >No users seem to assume that. > > > - batching can also in theory reduce the signalling rate: > > >users that would previously send interrups to the producer > > >to wake it up after consuming each entry would now only > > >need to do this once in a batch. > > >Doing this would be easy by returning a flag to the caller. > > >No users seem to do signalling on consume yet so this was not > > >implemented yet. > > > > > > Signed-off-by: Michael S. Tsirkin > > > --- > > > > > > Jason, I am curious whether the following gives you some of > > > the performance boost that you see with vhost batching > > > patches. Is vhost batching on top still helpful? > > > > The patch looks good to me, will have a test for vhost batching patches. > > > > Thanks > > Still helpful: > > before this patch: 1.84Mpps > with this patch: 2.00Mpps > with batch dequeuing: 2.30Mpps Just a thought: could you test dropping the consumer spinlock completely? Just around the peek? As I said previously, perf c2c tool should be helpful to locate sources latency related to cache. > Acked-by: Jason Wang > > Thanks
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote: > > > On 2017年04月12日 16:03, Jason Wang wrote: > > > > > > On 2017年04月07日 13:49, Michael S. Tsirkin wrote: > > > A known weakness in ptr_ring design is that it does not handle well the > > > situation when ring is almost full: as entries are consumed they are > > > immediately used again by the producer, so consumer and producer are > > > writing to a shared cache line. > > > > > > To fix this, add batching to consume calls: as entries are > > > consumed do not write NULL into the ring until we get > > > a multiple (in current implementation 2x) of cache lines > > > away from the producer. At that point, write them all out. > > > > > > We do the write out in the reverse order to keep > > > producer from sharing cache with consumer for as long > > > as possible. > > > > > > Writeout also triggers when ring wraps around - there's > > > no special reason to do this but it helps keep the code > > > a bit simpler. > > > > > > What should we do if getting away from producer by 2 cache lines > > > would mean we are keeping the ring moe than half empty? > > > Maybe we should reduce the batching in this case, > > > current patch simply reduces the batching. > > > > > > Notes: > > > - it is no longer true that a call to consume guarantees > > >that the following call to produce will succeed. > > >No users seem to assume that. > > > - batching can also in theory reduce the signalling rate: > > >users that would previously send interrups to the producer > > >to wake it up after consuming each entry would now only > > >need to do this once in a batch. > > >Doing this would be easy by returning a flag to the caller. > > >No users seem to do signalling on consume yet so this was not > > >implemented yet. > > > > > > Signed-off-by: Michael S. Tsirkin> > > --- > > > > > > Jason, I am curious whether the following gives you some of > > > the performance boost that you see with vhost batching > > > patches. Is vhost batching on top still helpful? > > > > The patch looks good to me, will have a test for vhost batching patches. > > > > Thanks > > Still helpful: > > before this patch: 1.84Mpps > with this patch: 2.00Mpps > with batch dequeuing: 2.30Mpps > > Acked-by: Jason Wang > > Thanks Fascinating. How do we explain the gain with batch dequeue? Is it just the lock overhead? Can you pls try to replace the lock with a simple non-fair atomic and see what happens? -- MST
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote: > > > On 2017年04月12日 16:03, Jason Wang wrote: > > > > > > On 2017年04月07日 13:49, Michael S. Tsirkin wrote: > > > A known weakness in ptr_ring design is that it does not handle well the > > > situation when ring is almost full: as entries are consumed they are > > > immediately used again by the producer, so consumer and producer are > > > writing to a shared cache line. > > > > > > To fix this, add batching to consume calls: as entries are > > > consumed do not write NULL into the ring until we get > > > a multiple (in current implementation 2x) of cache lines > > > away from the producer. At that point, write them all out. > > > > > > We do the write out in the reverse order to keep > > > producer from sharing cache with consumer for as long > > > as possible. > > > > > > Writeout also triggers when ring wraps around - there's > > > no special reason to do this but it helps keep the code > > > a bit simpler. > > > > > > What should we do if getting away from producer by 2 cache lines > > > would mean we are keeping the ring moe than half empty? > > > Maybe we should reduce the batching in this case, > > > current patch simply reduces the batching. > > > > > > Notes: > > > - it is no longer true that a call to consume guarantees > > >that the following call to produce will succeed. > > >No users seem to assume that. > > > - batching can also in theory reduce the signalling rate: > > >users that would previously send interrups to the producer > > >to wake it up after consuming each entry would now only > > >need to do this once in a batch. > > >Doing this would be easy by returning a flag to the caller. > > >No users seem to do signalling on consume yet so this was not > > >implemented yet. > > > > > > Signed-off-by: Michael S. Tsirkin > > > --- > > > > > > Jason, I am curious whether the following gives you some of > > > the performance boost that you see with vhost batching > > > patches. Is vhost batching on top still helpful? > > > > The patch looks good to me, will have a test for vhost batching patches. > > > > Thanks > > Still helpful: > > before this patch: 1.84Mpps > with this patch: 2.00Mpps > with batch dequeuing: 2.30Mpps > > Acked-by: Jason Wang > > Thanks Fascinating. How do we explain the gain with batch dequeue? Is it just the lock overhead? Can you pls try to replace the lock with a simple non-fair atomic and see what happens? -- MST
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On 2017年04月12日 16:03, Jason Wang wrote: On 2017年04月07日 13:49, Michael S. Tsirkin wrote: A known weakness in ptr_ring design is that it does not handle well the situation when ring is almost full: as entries are consumed they are immediately used again by the producer, so consumer and producer are writing to a shared cache line. To fix this, add batching to consume calls: as entries are consumed do not write NULL into the ring until we get a multiple (in current implementation 2x) of cache lines away from the producer. At that point, write them all out. We do the write out in the reverse order to keep producer from sharing cache with consumer for as long as possible. Writeout also triggers when ring wraps around - there's no special reason to do this but it helps keep the code a bit simpler. What should we do if getting away from producer by 2 cache lines would mean we are keeping the ring moe than half empty? Maybe we should reduce the batching in this case, current patch simply reduces the batching. Notes: - it is no longer true that a call to consume guarantees that the following call to produce will succeed. No users seem to assume that. - batching can also in theory reduce the signalling rate: users that would previously send interrups to the producer to wake it up after consuming each entry would now only need to do this once in a batch. Doing this would be easy by returning a flag to the caller. No users seem to do signalling on consume yet so this was not implemented yet. Signed-off-by: Michael S. Tsirkin--- Jason, I am curious whether the following gives you some of the performance boost that you see with vhost batching patches. Is vhost batching on top still helpful? The patch looks good to me, will have a test for vhost batching patches. Thanks Still helpful: before this patch: 1.84Mpps with this patch: 2.00Mpps with batch dequeuing: 2.30Mpps Acked-by: Jason Wang Thanks
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On 2017年04月12日 16:03, Jason Wang wrote: On 2017年04月07日 13:49, Michael S. Tsirkin wrote: A known weakness in ptr_ring design is that it does not handle well the situation when ring is almost full: as entries are consumed they are immediately used again by the producer, so consumer and producer are writing to a shared cache line. To fix this, add batching to consume calls: as entries are consumed do not write NULL into the ring until we get a multiple (in current implementation 2x) of cache lines away from the producer. At that point, write them all out. We do the write out in the reverse order to keep producer from sharing cache with consumer for as long as possible. Writeout also triggers when ring wraps around - there's no special reason to do this but it helps keep the code a bit simpler. What should we do if getting away from producer by 2 cache lines would mean we are keeping the ring moe than half empty? Maybe we should reduce the batching in this case, current patch simply reduces the batching. Notes: - it is no longer true that a call to consume guarantees that the following call to produce will succeed. No users seem to assume that. - batching can also in theory reduce the signalling rate: users that would previously send interrups to the producer to wake it up after consuming each entry would now only need to do this once in a batch. Doing this would be easy by returning a flag to the caller. No users seem to do signalling on consume yet so this was not implemented yet. Signed-off-by: Michael S. Tsirkin --- Jason, I am curious whether the following gives you some of the performance boost that you see with vhost batching patches. Is vhost batching on top still helpful? The patch looks good to me, will have a test for vhost batching patches. Thanks Still helpful: before this patch: 1.84Mpps with this patch: 2.00Mpps with batch dequeuing: 2.30Mpps Acked-by: Jason Wang Thanks
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On 2017年04月07日 13:49, Michael S. Tsirkin wrote: A known weakness in ptr_ring design is that it does not handle well the situation when ring is almost full: as entries are consumed they are immediately used again by the producer, so consumer and producer are writing to a shared cache line. To fix this, add batching to consume calls: as entries are consumed do not write NULL into the ring until we get a multiple (in current implementation 2x) of cache lines away from the producer. At that point, write them all out. We do the write out in the reverse order to keep producer from sharing cache with consumer for as long as possible. Writeout also triggers when ring wraps around - there's no special reason to do this but it helps keep the code a bit simpler. What should we do if getting away from producer by 2 cache lines would mean we are keeping the ring moe than half empty? Maybe we should reduce the batching in this case, current patch simply reduces the batching. Notes: - it is no longer true that a call to consume guarantees that the following call to produce will succeed. No users seem to assume that. - batching can also in theory reduce the signalling rate: users that would previously send interrups to the producer to wake it up after consuming each entry would now only need to do this once in a batch. Doing this would be easy by returning a flag to the caller. No users seem to do signalling on consume yet so this was not implemented yet. Signed-off-by: Michael S. Tsirkin--- Jason, I am curious whether the following gives you some of the performance boost that you see with vhost batching patches. Is vhost batching on top still helpful? The patch looks good to me, will have a test for vhost batching patches. Thanks
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On 2017年04月07日 13:49, Michael S. Tsirkin wrote: A known weakness in ptr_ring design is that it does not handle well the situation when ring is almost full: as entries are consumed they are immediately used again by the producer, so consumer and producer are writing to a shared cache line. To fix this, add batching to consume calls: as entries are consumed do not write NULL into the ring until we get a multiple (in current implementation 2x) of cache lines away from the producer. At that point, write them all out. We do the write out in the reverse order to keep producer from sharing cache with consumer for as long as possible. Writeout also triggers when ring wraps around - there's no special reason to do this but it helps keep the code a bit simpler. What should we do if getting away from producer by 2 cache lines would mean we are keeping the ring moe than half empty? Maybe we should reduce the batching in this case, current patch simply reduces the batching. Notes: - it is no longer true that a call to consume guarantees that the following call to produce will succeed. No users seem to assume that. - batching can also in theory reduce the signalling rate: users that would previously send interrups to the producer to wake it up after consuming each entry would now only need to do this once in a batch. Doing this would be easy by returning a flag to the caller. No users seem to do signalling on consume yet so this was not implemented yet. Signed-off-by: Michael S. Tsirkin --- Jason, I am curious whether the following gives you some of the performance boost that you see with vhost batching patches. Is vhost batching on top still helpful? The patch looks good to me, will have a test for vhost batching patches. Thanks
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On Fri, 7 Apr 2017 08:49:57 +0300 "Michael S. Tsirkin"wrote: > A known weakness in ptr_ring design is that it does not handle well the > situation when ring is almost full: as entries are consumed they are > immediately used again by the producer, so consumer and producer are > writing to a shared cache line. > > To fix this, add batching to consume calls: as entries are > consumed do not write NULL into the ring until we get > a multiple (in current implementation 2x) of cache lines > away from the producer. At that point, write them all out. > > We do the write out in the reverse order to keep > producer from sharing cache with consumer for as long > as possible. > > Writeout also triggers when ring wraps around - there's > no special reason to do this but it helps keep the code > a bit simpler. > > What should we do if getting away from producer by 2 cache lines > would mean we are keeping the ring moe than half empty? > Maybe we should reduce the batching in this case, > current patch simply reduces the batching. > > Notes: > - it is no longer true that a call to consume guarantees > that the following call to produce will succeed. > No users seem to assume that. > - batching can also in theory reduce the signalling rate: > users that would previously send interrups to the producer > to wake it up after consuming each entry would now only > need to do this once in a batch. > Doing this would be easy by returning a flag to the caller. > No users seem to do signalling on consume yet so this was not > implemented yet. > > Signed-off-by: Michael S. Tsirkin > --- > > Jason, I am curious whether the following gives you some of > the performance boost that you see with vhost batching > patches. Is vhost batching on top still helpful? > > include/linux/ptr_ring.h | 63 > +--- > 1 file changed, 54 insertions(+), 9 deletions(-) > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h > index 6c70444..6b2e0dd 100644 > --- a/include/linux/ptr_ring.h > +++ b/include/linux/ptr_ring.h > @@ -34,11 +34,13 @@ > struct ptr_ring { > int producer cacheline_aligned_in_smp; > spinlock_t producer_lock; > - int consumer cacheline_aligned_in_smp; > + int consumer_head cacheline_aligned_in_smp; /* next valid entry */ > + int consumer_tail; /* next entry to invalidate */ > spinlock_t consumer_lock; > /* Shared consumer/producer data */ > /* Read-only by both the producer and the consumer */ > int size cacheline_aligned_in_smp; /* max entries in queue */ > + int batch; /* number of entries to consume in a batch */ > void **queue; > }; > > @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring *r, > void *ptr) > static inline void *__ptr_ring_peek(struct ptr_ring *r) > { > if (likely(r->size)) > - return r->queue[r->consumer]; > + return r->queue[r->consumer_head]; > return NULL; > } > > @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring *r) > /* Must only be called after __ptr_ring_peek returned !NULL */ > static inline void __ptr_ring_discard_one(struct ptr_ring *r) > { > - r->queue[r->consumer++] = NULL; > - if (unlikely(r->consumer >= r->size)) > - r->consumer = 0; > + /* Fundamentally, what we want to do is update consumer > + * index and zero out the entry so producer can reuse it. > + * Doing it naively at each consume would be as simple as: > + * r->queue[r->consumer++] = NULL; > + * if (unlikely(r->consumer >= r->size)) > + * r->consumer = 0; > + * but that is suboptimal when the ring is full as producer is writing > + * out new entries in the same cache line. Defer these updates until a > + * batch of entries has been consumed. > + */ > + int head = r->consumer_head++; > + > + /* Once we have processed enough entries invalidate them in > + * the ring all at once so producer can reuse their space in the ring. > + * We also do this when we reach end of the ring - not mandatory > + * but helps keep the implementation simple. > + */ > + if (unlikely(r->consumer_head - r->consumer_tail >= r->batch || > + r->consumer_head >= r->size)) { > + /* Zero out entries in the reverse order: this way we touch the > + * cache line that producer might currently be reading the last; > + * producer won't make progress and touch other cache lines > + * besides the first one until we write out all entries. > + */ > + while (likely(head >= r->consumer_tail)) > + r->queue[head--] = NULL; > + r->consumer_tail = r->consumer_head; > + } > + if (unlikely(r->consumer_head >= r->size)) { > +
Re: [PATCH 1/3] ptr_ring: batch ring zeroing
On Fri, 7 Apr 2017 08:49:57 +0300 "Michael S. Tsirkin" wrote: > A known weakness in ptr_ring design is that it does not handle well the > situation when ring is almost full: as entries are consumed they are > immediately used again by the producer, so consumer and producer are > writing to a shared cache line. > > To fix this, add batching to consume calls: as entries are > consumed do not write NULL into the ring until we get > a multiple (in current implementation 2x) of cache lines > away from the producer. At that point, write them all out. > > We do the write out in the reverse order to keep > producer from sharing cache with consumer for as long > as possible. > > Writeout also triggers when ring wraps around - there's > no special reason to do this but it helps keep the code > a bit simpler. > > What should we do if getting away from producer by 2 cache lines > would mean we are keeping the ring moe than half empty? > Maybe we should reduce the batching in this case, > current patch simply reduces the batching. > > Notes: > - it is no longer true that a call to consume guarantees > that the following call to produce will succeed. > No users seem to assume that. > - batching can also in theory reduce the signalling rate: > users that would previously send interrups to the producer > to wake it up after consuming each entry would now only > need to do this once in a batch. > Doing this would be easy by returning a flag to the caller. > No users seem to do signalling on consume yet so this was not > implemented yet. > > Signed-off-by: Michael S. Tsirkin > --- > > Jason, I am curious whether the following gives you some of > the performance boost that you see with vhost batching > patches. Is vhost batching on top still helpful? > > include/linux/ptr_ring.h | 63 > +--- > 1 file changed, 54 insertions(+), 9 deletions(-) > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h > index 6c70444..6b2e0dd 100644 > --- a/include/linux/ptr_ring.h > +++ b/include/linux/ptr_ring.h > @@ -34,11 +34,13 @@ > struct ptr_ring { > int producer cacheline_aligned_in_smp; > spinlock_t producer_lock; > - int consumer cacheline_aligned_in_smp; > + int consumer_head cacheline_aligned_in_smp; /* next valid entry */ > + int consumer_tail; /* next entry to invalidate */ > spinlock_t consumer_lock; > /* Shared consumer/producer data */ > /* Read-only by both the producer and the consumer */ > int size cacheline_aligned_in_smp; /* max entries in queue */ > + int batch; /* number of entries to consume in a batch */ > void **queue; > }; > > @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring *r, > void *ptr) > static inline void *__ptr_ring_peek(struct ptr_ring *r) > { > if (likely(r->size)) > - return r->queue[r->consumer]; > + return r->queue[r->consumer_head]; > return NULL; > } > > @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring *r) > /* Must only be called after __ptr_ring_peek returned !NULL */ > static inline void __ptr_ring_discard_one(struct ptr_ring *r) > { > - r->queue[r->consumer++] = NULL; > - if (unlikely(r->consumer >= r->size)) > - r->consumer = 0; > + /* Fundamentally, what we want to do is update consumer > + * index and zero out the entry so producer can reuse it. > + * Doing it naively at each consume would be as simple as: > + * r->queue[r->consumer++] = NULL; > + * if (unlikely(r->consumer >= r->size)) > + * r->consumer = 0; > + * but that is suboptimal when the ring is full as producer is writing > + * out new entries in the same cache line. Defer these updates until a > + * batch of entries has been consumed. > + */ > + int head = r->consumer_head++; > + > + /* Once we have processed enough entries invalidate them in > + * the ring all at once so producer can reuse their space in the ring. > + * We also do this when we reach end of the ring - not mandatory > + * but helps keep the implementation simple. > + */ > + if (unlikely(r->consumer_head - r->consumer_tail >= r->batch || > + r->consumer_head >= r->size)) { > + /* Zero out entries in the reverse order: this way we touch the > + * cache line that producer might currently be reading the last; > + * producer won't make progress and touch other cache lines > + * besides the first one until we write out all entries. > + */ > + while (likely(head >= r->consumer_tail)) > + r->queue[head--] = NULL; > + r->consumer_tail = r->consumer_head; > + } > + if (unlikely(r->consumer_head >= r->size)) { > + r->consumer_head = 0; > +