Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-10 Thread Eric Wong
Jason Baron  wrote:
> On 02/09/2015 11:49 PM, Eric Wong wrote:
> > Do you have a userland use case to share?
> 
> I've been trying to describe the use case, maybe I haven't been doing a good
> job :(

Sorry, I meant if you had any public code.

Anyways, I've restarted work on another project which I'll hopefully be
able to share in a few weeks which might be a good public candidate for
epoll performance testing.

> > Did you try my suggestion of using a dedicated thread (or thread pool)
> > which does nothing but loop on accept() + EPOLL_CTL_ADD?
> >
> > Those dedicated threads could do its own round-robin in userland to pick
> > a different epollfd to call EPOLL_CTL_ADD on.
> 
> Thanks for your suggestion! I'm not actively working on the user-space
> code here, but I will pass it along.
> 
> I would prefer though not to have to context switch the 'accept' thread
> on and off the cpu every time there is a new connection. So the approach
> suggested here essentially moves this dedicated thread (threads), down
> into the kernel and avoids the creation of these threads entirely.

For cmogstored, I stopped using TCP_DEFER_ACCEPT when using the
dedicated thread.  This approach offloads to epoll and ends up giving
similar behavior to what used to be infinite in TCP_DEFER_ACCEPT in
Linux <= 2.6.31
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-10 Thread Jason Baron
On 02/09/2015 11:49 PM, Eric Wong wrote:
> Jason Baron  wrote:
>> On 02/09/2015 05:45 PM, Andy Lutomirski wrote:
>>> On Mon, Feb 9, 2015 at 1:32 PM, Jason Baron  wrote:
 On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
> On 02/09/2015 12:06 PM, Jason Baron wrote:
>> Epoll file descriptors that are added to a shared wakeup source are 
>> always
>> added in a non-exclusive manner. That means that when we have multiple 
>> epoll
>> fds attached to a shared wakeup source they are all woken up. This can
>> lead to excessive cpu usage and uneven load distribution.
>>
>> This patch introduces two new 'events' flags that are intended to be used
>> with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the 
>> event
>> source in an exclusive manner such that the minimum number of threads are
>> woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, 
>> can
>> also be added to the 'events' flag, such that we round robin around the 
>> set
>> of waiting threads.
>>
>> An implementation note is that in the epoll wakeup routine,
>> 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a 
>> successful
>> wakeup, only when there are current waiters. The idea is to use this 
>> additional
>> heuristic in order minimize wakeup latencies.
> I don't understand what this is intended to do.
>
> If an event has EPOLLONESHOT, then this only one thread should be woken 
> regardless, right?  If not, isn't that just a bug that should be fixed?
>
 hmm...so with EPOLLONESHOT you basically get notified once about an event. 
 If i have multiple epoll fds (say 1 per-thread) attached to a single 
 source in EPOLLONESHOT, then all threads will potentially get woken up 
 once per event. Then, I would have to re-arm all of them. So I don't think 
 this addresses this particular usecase...what I am trying to avoid is this 
 mass wakeup or thundering herd for a shared event source.
>>> Now I understand.  Why are you using multiple epollfds?
>>>
>>> --Andy
>> So the multiple epollfds is really a way to partition the set of
>> events. Otherwise, I have all the threads contending on all the events
>> that are being generated. So I'm not sure if that is scalable.
> I wonder if EPOLLONESHOT + epoll_wait with a sufficiently large
> maxevents value is sufficient for you.  All events would be shared, so
> they can migrate between threads(*).  Each thread takes a largish set of
> events on every epoll_wait call and doesn't call epoll_wait again until
> it's done with the whole set it got.
>
> You'll hit more contention on EPOLL_CTL_MOD with shared events and a
> single epoll, but I think it's a better goal to make that lock-free.

Its not just EPOLL_CTL_MOD, but there's also going to be contention on
epoll add and remove since there is only 1 epoll fd in this case. I'm also
concerned about the balancing of the workload among threads in the single
queue case. I think its quite reasonable to have user-space partition
the set
of events among threads as it sees fit using multiple epoll fds.

However, currently this multiple epoll fd scheme does not handle events from
a shared event source well. As I mentioned there is a thundering herd wakeup
in this case, and the wakeups are unbalanced. In fact, we have an
application
that currently does EPOLL_CTL_REMOVEs followed by EPOLL_CTL_ADDs
periodically against a shared wakeup source in order to re-balance the
wakeup
queues. This solves the balancing issues to an extent, but not the
thundering
herd. I'd like to move this logic down into the kernel with this patch set.

> (*) Too large a maxevents will lead to head-of-line blocking, but from
> what I'm inferring, you already risk that with multiple epollfds and
> separate threads working on them.
>
> Do you have a userland use case to share?

I've been trying to describe the use case, maybe I haven't been doing a good
job :(

>> In the use-case I'm trying to describe, I've partitioned a large set
>> of the events, but there may still be some event sources that we wish
>> to share among all of the threads (or even subsets of them), so as not
>> to overload any one in particular.
>  
>> More specifically, in the case of a single listen socket, its natural
>> to call accept() on the thread that has been woken up, but without
>> doing round robin, you quickly get into a very unbalanced load, and in
>> addition you waste a lot of cpu doing unnecessary wakeups. There are
>> other approaches to solve this, specifically using SO_REUSEPORT, which
>> creates a separate socket per-thread and gets one back to the
>> separately partitioned events case previously described. However,
>> SO_REUSEPORT, I believe is very specific to tcp/udp, and in addition
>> does not have knowledge of the threads that are actively waiting as
>> the epoll code does.
> Did you try my suggestion of using a dedicated 

Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-10 Thread Jason Baron
On 02/09/2015 11:49 PM, Eric Wong wrote:
 Jason Baron jba...@akamai.com wrote:
 On 02/09/2015 05:45 PM, Andy Lutomirski wrote:
 On Mon, Feb 9, 2015 at 1:32 PM, Jason Baron jba...@akamai.com wrote:
 On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
 On 02/09/2015 12:06 PM, Jason Baron wrote:
 Epoll file descriptors that are added to a shared wakeup source are 
 always
 added in a non-exclusive manner. That means that when we have multiple 
 epoll
 fds attached to a shared wakeup source they are all woken up. This can
 lead to excessive cpu usage and uneven load distribution.

 This patch introduces two new 'events' flags that are intended to be used
 with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the 
 event
 source in an exclusive manner such that the minimum number of threads are
 woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, 
 can
 also be added to the 'events' flag, such that we round robin around the 
 set
 of waiting threads.

 An implementation note is that in the epoll wakeup routine,
 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a 
 successful
 wakeup, only when there are current waiters. The idea is to use this 
 additional
 heuristic in order minimize wakeup latencies.
 I don't understand what this is intended to do.

 If an event has EPOLLONESHOT, then this only one thread should be woken 
 regardless, right?  If not, isn't that just a bug that should be fixed?

 hmm...so with EPOLLONESHOT you basically get notified once about an event. 
 If i have multiple epoll fds (say 1 per-thread) attached to a single 
 source in EPOLLONESHOT, then all threads will potentially get woken up 
 once per event. Then, I would have to re-arm all of them. So I don't think 
 this addresses this particular usecase...what I am trying to avoid is this 
 mass wakeup or thundering herd for a shared event source.
 Now I understand.  Why are you using multiple epollfds?

 --Andy
 So the multiple epollfds is really a way to partition the set of
 events. Otherwise, I have all the threads contending on all the events
 that are being generated. So I'm not sure if that is scalable.
 I wonder if EPOLLONESHOT + epoll_wait with a sufficiently large
 maxevents value is sufficient for you.  All events would be shared, so
 they can migrate between threads(*).  Each thread takes a largish set of
 events on every epoll_wait call and doesn't call epoll_wait again until
 it's done with the whole set it got.

 You'll hit more contention on EPOLL_CTL_MOD with shared events and a
 single epoll, but I think it's a better goal to make that lock-free.

Its not just EPOLL_CTL_MOD, but there's also going to be contention on
epoll add and remove since there is only 1 epoll fd in this case. I'm also
concerned about the balancing of the workload among threads in the single
queue case. I think its quite reasonable to have user-space partition
the set
of events among threads as it sees fit using multiple epoll fds.

However, currently this multiple epoll fd scheme does not handle events from
a shared event source well. As I mentioned there is a thundering herd wakeup
in this case, and the wakeups are unbalanced. In fact, we have an
application
that currently does EPOLL_CTL_REMOVEs followed by EPOLL_CTL_ADDs
periodically against a shared wakeup source in order to re-balance the
wakeup
queues. This solves the balancing issues to an extent, but not the
thundering
herd. I'd like to move this logic down into the kernel with this patch set.

 (*) Too large a maxevents will lead to head-of-line blocking, but from
 what I'm inferring, you already risk that with multiple epollfds and
 separate threads working on them.

 Do you have a userland use case to share?

I've been trying to describe the use case, maybe I haven't been doing a good
job :(

 In the use-case I'm trying to describe, I've partitioned a large set
 of the events, but there may still be some event sources that we wish
 to share among all of the threads (or even subsets of them), so as not
 to overload any one in particular.
  
 More specifically, in the case of a single listen socket, its natural
 to call accept() on the thread that has been woken up, but without
 doing round robin, you quickly get into a very unbalanced load, and in
 addition you waste a lot of cpu doing unnecessary wakeups. There are
 other approaches to solve this, specifically using SO_REUSEPORT, which
 creates a separate socket per-thread and gets one back to the
 separately partitioned events case previously described. However,
 SO_REUSEPORT, I believe is very specific to tcp/udp, and in addition
 does not have knowledge of the threads that are actively waiting as
 the epoll code does.
 Did you try my suggestion of using a dedicated thread (or thread pool)
 which does nothing but loop on accept() + EPOLL_CTL_ADD?

 Those dedicated threads could do its own round-robin in userland to pick
 a different epollfd to call EPOLL_CTL_ADD on.

Thanks for your 

Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-10 Thread Eric Wong
Jason Baron jba...@akamai.com wrote:
 On 02/09/2015 11:49 PM, Eric Wong wrote:
  Do you have a userland use case to share?
 
 I've been trying to describe the use case, maybe I haven't been doing a good
 job :(

Sorry, I meant if you had any public code.

Anyways, I've restarted work on another project which I'll hopefully be
able to share in a few weeks which might be a good public candidate for
epoll performance testing.

  Did you try my suggestion of using a dedicated thread (or thread pool)
  which does nothing but loop on accept() + EPOLL_CTL_ADD?
 
  Those dedicated threads could do its own round-robin in userland to pick
  a different epollfd to call EPOLL_CTL_ADD on.
 
 Thanks for your suggestion! I'm not actively working on the user-space
 code here, but I will pass it along.
 
 I would prefer though not to have to context switch the 'accept' thread
 on and off the cpu every time there is a new connection. So the approach
 suggested here essentially moves this dedicated thread (threads), down
 into the kernel and avoids the creation of these threads entirely.

For cmogstored, I stopped using TCP_DEFER_ACCEPT when using the
dedicated thread.  This approach offloads to epoll and ends up giving
similar behavior to what used to be infinite in TCP_DEFER_ACCEPT in
Linux = 2.6.31
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-09 Thread Eric Wong
Jason Baron  wrote:
> On 02/09/2015 05:45 PM, Andy Lutomirski wrote:
> > On Mon, Feb 9, 2015 at 1:32 PM, Jason Baron  wrote:
> >> On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
> >>> On 02/09/2015 12:06 PM, Jason Baron wrote:
>  Epoll file descriptors that are added to a shared wakeup source are 
>  always
>  added in a non-exclusive manner. That means that when we have multiple 
>  epoll
>  fds attached to a shared wakeup source they are all woken up. This can
>  lead to excessive cpu usage and uneven load distribution.
> 
>  This patch introduces two new 'events' flags that are intended to be used
>  with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the 
>  event
>  source in an exclusive manner such that the minimum number of threads are
>  woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, 
>  can
>  also be added to the 'events' flag, such that we round robin around the 
>  set
>  of waiting threads.
> 
>  An implementation note is that in the epoll wakeup routine,
>  'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a 
>  successful
>  wakeup, only when there are current waiters. The idea is to use this 
>  additional
>  heuristic in order minimize wakeup latencies.
> >>> I don't understand what this is intended to do.
> >>>
> >>> If an event has EPOLLONESHOT, then this only one thread should be woken 
> >>> regardless, right?  If not, isn't that just a bug that should be fixed?
> >>>
> >> hmm...so with EPOLLONESHOT you basically get notified once about an event. 
> >> If i have multiple epoll fds (say 1 per-thread) attached to a single 
> >> source in EPOLLONESHOT, then all threads will potentially get woken up 
> >> once per event. Then, I would have to re-arm all of them. So I don't think 
> >> this addresses this particular usecase...what I am trying to avoid is this 
> >> mass wakeup or thundering herd for a shared event source.
> > Now I understand.  Why are you using multiple epollfds?
> >
> > --Andy
> 
> So the multiple epollfds is really a way to partition the set of
> events. Otherwise, I have all the threads contending on all the events
> that are being generated. So I'm not sure if that is scalable.

I wonder if EPOLLONESHOT + epoll_wait with a sufficiently large
maxevents value is sufficient for you.  All events would be shared, so
they can migrate between threads(*).  Each thread takes a largish set of
events on every epoll_wait call and doesn't call epoll_wait again until
it's done with the whole set it got.

You'll hit more contention on EPOLL_CTL_MOD with shared events and a
single epoll, but I think it's a better goal to make that lock-free.

(*) Too large a maxevents will lead to head-of-line blocking, but from
what I'm inferring, you already risk that with multiple epollfds and
separate threads working on them.

Do you have a userland use case to share?

> In the use-case I'm trying to describe, I've partitioned a large set
> of the events, but there may still be some event sources that we wish
> to share among all of the threads (or even subsets of them), so as not
> to overload any one in particular.
 
> More specifically, in the case of a single listen socket, its natural
> to call accept() on the thread that has been woken up, but without
> doing round robin, you quickly get into a very unbalanced load, and in
> addition you waste a lot of cpu doing unnecessary wakeups. There are
> other approaches to solve this, specifically using SO_REUSEPORT, which
> creates a separate socket per-thread and gets one back to the
> separately partitioned events case previously described. However,
> SO_REUSEPORT, I believe is very specific to tcp/udp, and in addition
> does not have knowledge of the threads that are actively waiting as
> the epoll code does.

Did you try my suggestion of using a dedicated thread (or thread pool)
which does nothing but loop on accept() + EPOLL_CTL_ADD?

Those dedicated threads could do its own round-robin in userland to pick
a different epollfd to call EPOLL_CTL_ADD on.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-09 Thread Jason Baron
On 02/09/2015 05:45 PM, Andy Lutomirski wrote:
> On Mon, Feb 9, 2015 at 1:32 PM, Jason Baron  wrote:
>> On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
>>> On 02/09/2015 12:06 PM, Jason Baron wrote:
 Epoll file descriptors that are added to a shared wakeup source are always
 added in a non-exclusive manner. That means that when we have multiple 
 epoll
 fds attached to a shared wakeup source they are all woken up. This can
 lead to excessive cpu usage and uneven load distribution.

 This patch introduces two new 'events' flags that are intended to be used
 with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the 
 event
 source in an exclusive manner such that the minimum number of threads are
 woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
 also be added to the 'events' flag, such that we round robin around the set
 of waiting threads.

 An implementation note is that in the epoll wakeup routine,
 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a 
 successful
 wakeup, only when there are current waiters. The idea is to use this 
 additional
 heuristic in order minimize wakeup latencies.
>>> I don't understand what this is intended to do.
>>>
>>> If an event has EPOLLONESHOT, then this only one thread should be woken 
>>> regardless, right?  If not, isn't that just a bug that should be fixed?
>>>
>> hmm...so with EPOLLONESHOT you basically get notified once about an event. 
>> If i have multiple epoll fds (say 1 per-thread) attached to a single source 
>> in EPOLLONESHOT, then all threads will potentially get woken up once per 
>> event. Then, I would have to re-arm all of them. So I don't think this 
>> addresses this particular usecase...what I am trying to avoid is this mass 
>> wakeup or thundering herd for a shared event source.
> Now I understand.  Why are you using multiple epollfds?
>
> --Andy

So the multiple epollfds is really a way to partition the set of events. 
Otherwise, I have all the threads contending on all the events that are being 
generated. So I'm not sure if that is scalable.

In the use-case I'm trying to describe, I've partitioned a large set of the 
events, but there may still be some event sources that we wish to share among 
all of the threads (or even subsets of them), so as not to overload any one in 
particular.

More specifically, in the case of a single listen socket, its natural to call 
accept() on the thread that has been woken up, but without doing round robin, 
you quickly get into a very unbalanced load, and in addition you waste a lot of 
cpu doing unnecessary wakeups. There are other approaches to solve this, 
specifically using SO_REUSEPORT, which creates a separate socket per-thread and 
gets one back to the separately partitioned events case previously described. 
However, SO_REUSEPORT, I believe is very specific to tcp/udp, and in addition 
does not have knowledge of the threads that are actively waiting as the epoll 
code does.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-09 Thread Andy Lutomirski
On Mon, Feb 9, 2015 at 1:32 PM, Jason Baron  wrote:
> On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
>> On 02/09/2015 12:06 PM, Jason Baron wrote:
>>> Epoll file descriptors that are added to a shared wakeup source are always
>>> added in a non-exclusive manner. That means that when we have multiple epoll
>>> fds attached to a shared wakeup source they are all woken up. This can
>>> lead to excessive cpu usage and uneven load distribution.
>>>
>>> This patch introduces two new 'events' flags that are intended to be used
>>> with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the 
>>> event
>>> source in an exclusive manner such that the minimum number of threads are
>>> woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
>>> also be added to the 'events' flag, such that we round robin around the set
>>> of waiting threads.
>>>
>>> An implementation note is that in the epoll wakeup routine,
>>> 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a 
>>> successful
>>> wakeup, only when there are current waiters. The idea is to use this 
>>> additional
>>> heuristic in order minimize wakeup latencies.
>>
>> I don't understand what this is intended to do.
>>
>> If an event has EPOLLONESHOT, then this only one thread should be woken 
>> regardless, right?  If not, isn't that just a bug that should be fixed?
>>
>
> hmm...so with EPOLLONESHOT you basically get notified once about an event. If 
> i have multiple epoll fds (say 1 per-thread) attached to a single source in 
> EPOLLONESHOT, then all threads will potentially get woken up once per event. 
> Then, I would have to re-arm all of them. So I don't think this addresses 
> this particular usecase...what I am trying to avoid is this mass wakeup or 
> thundering herd for a shared event source.

Now I understand.  Why are you using multiple epollfds?

--Andy

>
>> If an event has EPOLLET, then the considerations are similar to 
>> EPOLLONESHOT, right?
>>
>
> EPOLLET is still going to cause this thundering herd.
>
>> If an event is a normal level-triggered non-one-shot event, then I don't 
>> understand how a round-robin wakeup makes any sense.  It's level-triggered, 
>> after all.
>
> Yeah, so the current behavior is to wake up all of the threads. I'm trying to 
> add a new mode where it load balances among the threads interested in the 
> event. Perhaps, the test program I attached to 0/2 will show the issue better?
>
> Also, this originally came up in the context of a single listening socket 
> which was attached to multiple epoll fds each in a separate thread. With the 
> attached patch, I can measure a large decrease in cpu usage and better 
> balancing behavior among the accepting threads.
>
> Thanks,
>
> -Jason



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-09 Thread Jason Baron
On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
> On 02/09/2015 12:06 PM, Jason Baron wrote:
>> Epoll file descriptors that are added to a shared wakeup source are always
>> added in a non-exclusive manner. That means that when we have multiple epoll
>> fds attached to a shared wakeup source they are all woken up. This can
>> lead to excessive cpu usage and uneven load distribution.
>>
>> This patch introduces two new 'events' flags that are intended to be used
>> with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
>> source in an exclusive manner such that the minimum number of threads are
>> woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
>> also be added to the 'events' flag, such that we round robin around the set
>> of waiting threads.
>>
>> An implementation note is that in the epoll wakeup routine,
>> 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a 
>> successful
>> wakeup, only when there are current waiters. The idea is to use this 
>> additional
>> heuristic in order minimize wakeup latencies.
>
> I don't understand what this is intended to do.
>
> If an event has EPOLLONESHOT, then this only one thread should be woken 
> regardless, right?  If not, isn't that just a bug that should be fixed?
>

hmm...so with EPOLLONESHOT you basically get notified once about an event. If i 
have multiple epoll fds (say 1 per-thread) attached to a single source in 
EPOLLONESHOT, then all threads will potentially get woken up once per event. 
Then, I would have to re-arm all of them. So I don't think this addresses this 
particular usecase...what I am trying to avoid is this mass wakeup or 
thundering herd for a shared event source.

> If an event has EPOLLET, then the considerations are similar to EPOLLONESHOT, 
> right?
>

EPOLLET is still going to cause this thundering herd.

> If an event is a normal level-triggered non-one-shot event, then I don't 
> understand how a round-robin wakeup makes any sense.  It's level-triggered, 
> after all.

Yeah, so the current behavior is to wake up all of the threads. I'm trying to 
add a new mode where it load balances among the threads interested in the 
event. Perhaps, the test program I attached to 0/2 will show the issue better?

Also, this originally came up in the context of a single listening socket which 
was attached to multiple epoll fds each in a separate thread. With the attached 
patch, I can measure a large decrease in cpu usage and better balancing 
behavior among the accepting threads.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-09 Thread Michael Kerrisk
[CC += linux-...@vger.kernel.org]


On Mon, Feb 9, 2015 at 9:06 PM, Jason Baron  wrote:
> Epoll file descriptors that are added to a shared wakeup source are always
> added in a non-exclusive manner. That means that when we have multiple epoll
> fds attached to a shared wakeup source they are all woken up. This can
> lead to excessive cpu usage and uneven load distribution.
>
> This patch introduces two new 'events' flags that are intended to be used
> with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
> source in an exclusive manner such that the minimum number of threads are
> woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
> also be added to the 'events' flag, such that we round robin around the set
> of waiting threads.
>
> An implementation note is that in the epoll wakeup routine,
> 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
> wakeup, only when there are current waiters. The idea is to use this 
> additional
> heuristic in order minimize wakeup latencies.
>
> Signed-off-by: Jason Baron 
> ---
>  fs/eventpoll.c | 25 -
>  include/uapi/linux/eventpoll.h |  6 ++
>  2 files changed, 26 insertions(+), 5 deletions(-)
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index d77f944..382c832 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -92,7 +92,8 @@
>   */
>
>  /* Epoll private bits inside the event mask */
> -#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
> +#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | \
> +EPOLLEXCLUSIVE | EPOLLROUNDROBIN)
>
>  /* Maximum number of nesting allowed inside epoll sets */
>  #define EP_MAX_NESTS 4
> @@ -1002,6 +1003,7 @@ static int ep_poll_callback(wait_queue_t *wait, 
> unsigned mode, int sync, void *k
> unsigned long flags;
> struct epitem *epi = ep_item_from_wait(wait);
> struct eventpoll *ep = epi->ep;
> +   int ewake = 0;
>
> if ((unsigned long)key & POLLFREE) {
> ep_pwq_from_wait(wait)->whead = NULL;
> @@ -1066,8 +1068,10 @@ static int ep_poll_callback(wait_queue_t *wait, 
> unsigned mode, int sync, void *k
>  * Wake up ( if active ) both the eventpoll wait list and the ->poll()
>  * wait list.
>  */
> -   if (waitqueue_active(>wq))
> +   if (waitqueue_active(>wq)) {
> +   ewake = 1;
> wake_up_locked(>wq);
> +   }
> if (waitqueue_active(>poll_wait))
> pwake++;
>
> @@ -1078,6 +1082,8 @@ out_unlock:
> if (pwake)
> ep_poll_safewake(>poll_wait);
>
> +   if (epi->event.events & EPOLLROUNDROBIN)
> +   return ewake;
> return 1;
>  }
>
> @@ -1095,7 +1101,12 @@ static void ep_ptable_queue_proc(struct file *file, 
> wait_queue_head_t *whead,
> init_waitqueue_func_entry(>wait, ep_poll_callback);
> pwq->whead = whead;
> pwq->base = epi;
> -   add_wait_queue(whead, >wait);
> +   if (epi->event.events & EPOLLROUNDROBIN)
> +   add_wait_queue_rr(whead, >wait);
> +   else if (epi->event.events & EPOLLEXCLUSIVE)
> +   add_wait_queue_exclusive(whead, >wait);
> +   else
> +   add_wait_queue(whead, >wait);
> list_add_tail(>llink, >pwqlist);
> epi->nwait++;
> } else {
> @@ -1820,8 +1831,7 @@ SYSCALL_DEFINE1(epoll_create, int, size)
>  SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
> struct epoll_event __user *, event)
>  {
> -   int error;
> -   int full_check = 0;
> +   int error, full_check = 0, wait_flags = 0;
> struct fd f, tf;
> struct eventpoll *ep;
> struct epitem *epi;
> @@ -1861,6 +1871,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
> if (f.file == tf.file || !is_file_epoll(f.file))
> goto error_tgt_fput;
>
> +   wait_flags = epds.events & (EPOLLEXCLUSIVE | EPOLLROUNDROBIN);
> +   if (wait_flags && ((op == EPOLL_CTL_MOD) || ((op == EPOLL_CTL_ADD) &&
> +   ((wait_flags == EPOLLROUNDROBIN) || (is_file_epoll(tf.file))
> +   goto error_tgt_fput;
> +
> /*
>  * At this point it is safe to assume that the "private_data" contains
>  * our own data structure.
> diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
> index bc81fb2..10260a1 100644
> --- a/include/uapi/linux/eventpoll.h
> +++ b/include/uapi/linux/eventpoll.h
> @@ -26,6 +26,12 @@
>  #define EPOLL_CTL_DEL 2
>  #define EPOLL_CTL_MOD 3
>
> +/* Balance wakeups for a shared event source */
> +#define EPOLLROUNDROBIN (1 << 27)
> +
> +/* Add exclusively */
> +#define EPOLLEXCLUSIVE (1 << 28)
> +
>  /*
>   * Request the handling of system wakeup events so as to prevent system 
> 

Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-09 Thread Andy Lutomirski

On 02/09/2015 12:06 PM, Jason Baron wrote:

Epoll file descriptors that are added to a shared wakeup source are always
added in a non-exclusive manner. That means that when we have multiple epoll
fds attached to a shared wakeup source they are all woken up. This can
lead to excessive cpu usage and uneven load distribution.

This patch introduces two new 'events' flags that are intended to be used
with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
source in an exclusive manner such that the minimum number of threads are
woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
also be added to the 'events' flag, such that we round robin around the set
of waiting threads.

An implementation note is that in the epoll wakeup routine,
'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
wakeup, only when there are current waiters. The idea is to use this additional
heuristic in order minimize wakeup latencies.


I don't understand what this is intended to do.

If an event has EPOLLONESHOT, then this only one thread should be woken 
regardless, right?  If not, isn't that just a bug that should be fixed?


If an event has EPOLLET, then the considerations are similar to 
EPOLLONESHOT, right?


If an event is a normal level-triggered non-one-shot event, then I don't 
understand how a round-robin wakeup makes any sense.  It's 
level-triggered, after all.


--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-09 Thread Jason Baron
Epoll file descriptors that are added to a shared wakeup source are always
added in a non-exclusive manner. That means that when we have multiple epoll
fds attached to a shared wakeup source they are all woken up. This can
lead to excessive cpu usage and uneven load distribution.

This patch introduces two new 'events' flags that are intended to be used
with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
source in an exclusive manner such that the minimum number of threads are
woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
also be added to the 'events' flag, such that we round robin around the set
of waiting threads.

An implementation note is that in the epoll wakeup routine,
'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
wakeup, only when there are current waiters. The idea is to use this additional
heuristic in order minimize wakeup latencies.

Signed-off-by: Jason Baron 
---
 fs/eventpoll.c | 25 -
 include/uapi/linux/eventpoll.h |  6 ++
 2 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d77f944..382c832 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -92,7 +92,8 @@
  */
 
 /* Epoll private bits inside the event mask */
-#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
+#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | \
+EPOLLEXCLUSIVE | EPOLLROUNDROBIN)
 
 /* Maximum number of nesting allowed inside epoll sets */
 #define EP_MAX_NESTS 4
@@ -1002,6 +1003,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned 
mode, int sync, void *k
unsigned long flags;
struct epitem *epi = ep_item_from_wait(wait);
struct eventpoll *ep = epi->ep;
+   int ewake = 0;
 
if ((unsigned long)key & POLLFREE) {
ep_pwq_from_wait(wait)->whead = NULL;
@@ -1066,8 +1068,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned 
mode, int sync, void *k
 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
 * wait list.
 */
-   if (waitqueue_active(>wq))
+   if (waitqueue_active(>wq)) {
+   ewake = 1;
wake_up_locked(>wq);
+   }
if (waitqueue_active(>poll_wait))
pwake++;
 
@@ -1078,6 +1082,8 @@ out_unlock:
if (pwake)
ep_poll_safewake(>poll_wait);
 
+   if (epi->event.events & EPOLLROUNDROBIN)
+   return ewake;
return 1;
 }
 
@@ -1095,7 +1101,12 @@ static void ep_ptable_queue_proc(struct file *file, 
wait_queue_head_t *whead,
init_waitqueue_func_entry(>wait, ep_poll_callback);
pwq->whead = whead;
pwq->base = epi;
-   add_wait_queue(whead, >wait);
+   if (epi->event.events & EPOLLROUNDROBIN)
+   add_wait_queue_rr(whead, >wait);
+   else if (epi->event.events & EPOLLEXCLUSIVE)
+   add_wait_queue_exclusive(whead, >wait);
+   else
+   add_wait_queue(whead, >wait);
list_add_tail(>llink, >pwqlist);
epi->nwait++;
} else {
@@ -1820,8 +1831,7 @@ SYSCALL_DEFINE1(epoll_create, int, size)
 SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
struct epoll_event __user *, event)
 {
-   int error;
-   int full_check = 0;
+   int error, full_check = 0, wait_flags = 0;
struct fd f, tf;
struct eventpoll *ep;
struct epitem *epi;
@@ -1861,6 +1871,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
if (f.file == tf.file || !is_file_epoll(f.file))
goto error_tgt_fput;
 
+   wait_flags = epds.events & (EPOLLEXCLUSIVE | EPOLLROUNDROBIN);
+   if (wait_flags && ((op == EPOLL_CTL_MOD) || ((op == EPOLL_CTL_ADD) &&
+   ((wait_flags == EPOLLROUNDROBIN) || (is_file_epoll(tf.file))
+   goto error_tgt_fput;
+
/*
 * At this point it is safe to assume that the "private_data" contains
 * our own data structure.
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..10260a1 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -26,6 +26,12 @@
 #define EPOLL_CTL_DEL 2
 #define EPOLL_CTL_MOD 3
 
+/* Balance wakeups for a shared event source */
+#define EPOLLROUNDROBIN (1 << 27)
+
+/* Add exclusively */
+#define EPOLLEXCLUSIVE (1 << 28)
+
 /*
  * Request the handling of system wakeup events so as to prevent system 
suspends
  * from happening while those events are being processed.
-- 
1.8.2.rc2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-09 Thread Andy Lutomirski

On 02/09/2015 12:06 PM, Jason Baron wrote:

Epoll file descriptors that are added to a shared wakeup source are always
added in a non-exclusive manner. That means that when we have multiple epoll
fds attached to a shared wakeup source they are all woken up. This can
lead to excessive cpu usage and uneven load distribution.

This patch introduces two new 'events' flags that are intended to be used
with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
source in an exclusive manner such that the minimum number of threads are
woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
also be added to the 'events' flag, such that we round robin around the set
of waiting threads.

An implementation note is that in the epoll wakeup routine,
'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
wakeup, only when there are current waiters. The idea is to use this additional
heuristic in order minimize wakeup latencies.


I don't understand what this is intended to do.

If an event has EPOLLONESHOT, then this only one thread should be woken 
regardless, right?  If not, isn't that just a bug that should be fixed?


If an event has EPOLLET, then the considerations are similar to 
EPOLLONESHOT, right?


If an event is a normal level-triggered non-one-shot event, then I don't 
understand how a round-robin wakeup makes any sense.  It's 
level-triggered, after all.


--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-09 Thread Jason Baron
Epoll file descriptors that are added to a shared wakeup source are always
added in a non-exclusive manner. That means that when we have multiple epoll
fds attached to a shared wakeup source they are all woken up. This can
lead to excessive cpu usage and uneven load distribution.

This patch introduces two new 'events' flags that are intended to be used
with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
source in an exclusive manner such that the minimum number of threads are
woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
also be added to the 'events' flag, such that we round robin around the set
of waiting threads.

An implementation note is that in the epoll wakeup routine,
'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
wakeup, only when there are current waiters. The idea is to use this additional
heuristic in order minimize wakeup latencies.

Signed-off-by: Jason Baron jba...@akamai.com
---
 fs/eventpoll.c | 25 -
 include/uapi/linux/eventpoll.h |  6 ++
 2 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d77f944..382c832 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -92,7 +92,8 @@
  */
 
 /* Epoll private bits inside the event mask */
-#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
+#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | \
+EPOLLEXCLUSIVE | EPOLLROUNDROBIN)
 
 /* Maximum number of nesting allowed inside epoll sets */
 #define EP_MAX_NESTS 4
@@ -1002,6 +1003,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned 
mode, int sync, void *k
unsigned long flags;
struct epitem *epi = ep_item_from_wait(wait);
struct eventpoll *ep = epi-ep;
+   int ewake = 0;
 
if ((unsigned long)key  POLLFREE) {
ep_pwq_from_wait(wait)-whead = NULL;
@@ -1066,8 +1068,10 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned 
mode, int sync, void *k
 * Wake up ( if active ) both the eventpoll wait list and the -poll()
 * wait list.
 */
-   if (waitqueue_active(ep-wq))
+   if (waitqueue_active(ep-wq)) {
+   ewake = 1;
wake_up_locked(ep-wq);
+   }
if (waitqueue_active(ep-poll_wait))
pwake++;
 
@@ -1078,6 +1082,8 @@ out_unlock:
if (pwake)
ep_poll_safewake(ep-poll_wait);
 
+   if (epi-event.events  EPOLLROUNDROBIN)
+   return ewake;
return 1;
 }
 
@@ -1095,7 +1101,12 @@ static void ep_ptable_queue_proc(struct file *file, 
wait_queue_head_t *whead,
init_waitqueue_func_entry(pwq-wait, ep_poll_callback);
pwq-whead = whead;
pwq-base = epi;
-   add_wait_queue(whead, pwq-wait);
+   if (epi-event.events  EPOLLROUNDROBIN)
+   add_wait_queue_rr(whead, pwq-wait);
+   else if (epi-event.events  EPOLLEXCLUSIVE)
+   add_wait_queue_exclusive(whead, pwq-wait);
+   else
+   add_wait_queue(whead, pwq-wait);
list_add_tail(pwq-llink, epi-pwqlist);
epi-nwait++;
} else {
@@ -1820,8 +1831,7 @@ SYSCALL_DEFINE1(epoll_create, int, size)
 SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
struct epoll_event __user *, event)
 {
-   int error;
-   int full_check = 0;
+   int error, full_check = 0, wait_flags = 0;
struct fd f, tf;
struct eventpoll *ep;
struct epitem *epi;
@@ -1861,6 +1871,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
if (f.file == tf.file || !is_file_epoll(f.file))
goto error_tgt_fput;
 
+   wait_flags = epds.events  (EPOLLEXCLUSIVE | EPOLLROUNDROBIN);
+   if (wait_flags  ((op == EPOLL_CTL_MOD) || ((op == EPOLL_CTL_ADD) 
+   ((wait_flags == EPOLLROUNDROBIN) || (is_file_epoll(tf.file))
+   goto error_tgt_fput;
+
/*
 * At this point it is safe to assume that the private_data contains
 * our own data structure.
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..10260a1 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -26,6 +26,12 @@
 #define EPOLL_CTL_DEL 2
 #define EPOLL_CTL_MOD 3
 
+/* Balance wakeups for a shared event source */
+#define EPOLLROUNDROBIN (1  27)
+
+/* Add exclusively */
+#define EPOLLEXCLUSIVE (1  28)
+
 /*
  * Request the handling of system wakeup events so as to prevent system 
suspends
  * from happening while those events are being processed.
-- 
1.8.2.rc2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the 

Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-09 Thread Michael Kerrisk
[CC += linux-...@vger.kernel.org]


On Mon, Feb 9, 2015 at 9:06 PM, Jason Baron jba...@akamai.com wrote:
 Epoll file descriptors that are added to a shared wakeup source are always
 added in a non-exclusive manner. That means that when we have multiple epoll
 fds attached to a shared wakeup source they are all woken up. This can
 lead to excessive cpu usage and uneven load distribution.

 This patch introduces two new 'events' flags that are intended to be used
 with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
 source in an exclusive manner such that the minimum number of threads are
 woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
 also be added to the 'events' flag, such that we round robin around the set
 of waiting threads.

 An implementation note is that in the epoll wakeup routine,
 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a successful
 wakeup, only when there are current waiters. The idea is to use this 
 additional
 heuristic in order minimize wakeup latencies.

 Signed-off-by: Jason Baron jba...@akamai.com
 ---
  fs/eventpoll.c | 25 -
  include/uapi/linux/eventpoll.h |  6 ++
  2 files changed, 26 insertions(+), 5 deletions(-)

 diff --git a/fs/eventpoll.c b/fs/eventpoll.c
 index d77f944..382c832 100644
 --- a/fs/eventpoll.c
 +++ b/fs/eventpoll.c
 @@ -92,7 +92,8 @@
   */

  /* Epoll private bits inside the event mask */
 -#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
 +#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | \
 +EPOLLEXCLUSIVE | EPOLLROUNDROBIN)

  /* Maximum number of nesting allowed inside epoll sets */
  #define EP_MAX_NESTS 4
 @@ -1002,6 +1003,7 @@ static int ep_poll_callback(wait_queue_t *wait, 
 unsigned mode, int sync, void *k
 unsigned long flags;
 struct epitem *epi = ep_item_from_wait(wait);
 struct eventpoll *ep = epi-ep;
 +   int ewake = 0;

 if ((unsigned long)key  POLLFREE) {
 ep_pwq_from_wait(wait)-whead = NULL;
 @@ -1066,8 +1068,10 @@ static int ep_poll_callback(wait_queue_t *wait, 
 unsigned mode, int sync, void *k
  * Wake up ( if active ) both the eventpoll wait list and the -poll()
  * wait list.
  */
 -   if (waitqueue_active(ep-wq))
 +   if (waitqueue_active(ep-wq)) {
 +   ewake = 1;
 wake_up_locked(ep-wq);
 +   }
 if (waitqueue_active(ep-poll_wait))
 pwake++;

 @@ -1078,6 +1082,8 @@ out_unlock:
 if (pwake)
 ep_poll_safewake(ep-poll_wait);

 +   if (epi-event.events  EPOLLROUNDROBIN)
 +   return ewake;
 return 1;
  }

 @@ -1095,7 +1101,12 @@ static void ep_ptable_queue_proc(struct file *file, 
 wait_queue_head_t *whead,
 init_waitqueue_func_entry(pwq-wait, ep_poll_callback);
 pwq-whead = whead;
 pwq-base = epi;
 -   add_wait_queue(whead, pwq-wait);
 +   if (epi-event.events  EPOLLROUNDROBIN)
 +   add_wait_queue_rr(whead, pwq-wait);
 +   else if (epi-event.events  EPOLLEXCLUSIVE)
 +   add_wait_queue_exclusive(whead, pwq-wait);
 +   else
 +   add_wait_queue(whead, pwq-wait);
 list_add_tail(pwq-llink, epi-pwqlist);
 epi-nwait++;
 } else {
 @@ -1820,8 +1831,7 @@ SYSCALL_DEFINE1(epoll_create, int, size)
  SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 struct epoll_event __user *, event)
  {
 -   int error;
 -   int full_check = 0;
 +   int error, full_check = 0, wait_flags = 0;
 struct fd f, tf;
 struct eventpoll *ep;
 struct epitem *epi;
 @@ -1861,6 +1871,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 if (f.file == tf.file || !is_file_epoll(f.file))
 goto error_tgt_fput;

 +   wait_flags = epds.events  (EPOLLEXCLUSIVE | EPOLLROUNDROBIN);
 +   if (wait_flags  ((op == EPOLL_CTL_MOD) || ((op == EPOLL_CTL_ADD) 
 +   ((wait_flags == EPOLLROUNDROBIN) || (is_file_epoll(tf.file))
 +   goto error_tgt_fput;
 +
 /*
  * At this point it is safe to assume that the private_data contains
  * our own data structure.
 diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
 index bc81fb2..10260a1 100644
 --- a/include/uapi/linux/eventpoll.h
 +++ b/include/uapi/linux/eventpoll.h
 @@ -26,6 +26,12 @@
  #define EPOLL_CTL_DEL 2
  #define EPOLL_CTL_MOD 3

 +/* Balance wakeups for a shared event source */
 +#define EPOLLROUNDROBIN (1  27)
 +
 +/* Add exclusively */
 +#define EPOLLEXCLUSIVE (1  28)
 +
  /*
   * Request the handling of system wakeup events so as to prevent system 
 suspends
   * from happening while those events are being processed.
 --
 1.8.2.rc2


Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-09 Thread Jason Baron
On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
 On 02/09/2015 12:06 PM, Jason Baron wrote:
 Epoll file descriptors that are added to a shared wakeup source are always
 added in a non-exclusive manner. That means that when we have multiple epoll
 fds attached to a shared wakeup source they are all woken up. This can
 lead to excessive cpu usage and uneven load distribution.

 This patch introduces two new 'events' flags that are intended to be used
 with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the event
 source in an exclusive manner such that the minimum number of threads are
 woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
 also be added to the 'events' flag, such that we round robin around the set
 of waiting threads.

 An implementation note is that in the epoll wakeup routine,
 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a 
 successful
 wakeup, only when there are current waiters. The idea is to use this 
 additional
 heuristic in order minimize wakeup latencies.

 I don't understand what this is intended to do.

 If an event has EPOLLONESHOT, then this only one thread should be woken 
 regardless, right?  If not, isn't that just a bug that should be fixed?


hmm...so with EPOLLONESHOT you basically get notified once about an event. If i 
have multiple epoll fds (say 1 per-thread) attached to a single source in 
EPOLLONESHOT, then all threads will potentially get woken up once per event. 
Then, I would have to re-arm all of them. So I don't think this addresses this 
particular usecase...what I am trying to avoid is this mass wakeup or 
thundering herd for a shared event source.

 If an event has EPOLLET, then the considerations are similar to EPOLLONESHOT, 
 right?


EPOLLET is still going to cause this thundering herd.

 If an event is a normal level-triggered non-one-shot event, then I don't 
 understand how a round-robin wakeup makes any sense.  It's level-triggered, 
 after all.

Yeah, so the current behavior is to wake up all of the threads. I'm trying to 
add a new mode where it load balances among the threads interested in the 
event. Perhaps, the test program I attached to 0/2 will show the issue better?

Also, this originally came up in the context of a single listening socket which 
was attached to multiple epoll fds each in a separate thread. With the attached 
patch, I can measure a large decrease in cpu usage and better balancing 
behavior among the accepting threads.

Thanks,

-Jason
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-09 Thread Andy Lutomirski
On Mon, Feb 9, 2015 at 1:32 PM, Jason Baron jba...@akamai.com wrote:
 On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
 On 02/09/2015 12:06 PM, Jason Baron wrote:
 Epoll file descriptors that are added to a shared wakeup source are always
 added in a non-exclusive manner. That means that when we have multiple epoll
 fds attached to a shared wakeup source they are all woken up. This can
 lead to excessive cpu usage and uneven load distribution.

 This patch introduces two new 'events' flags that are intended to be used
 with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the 
 event
 source in an exclusive manner such that the minimum number of threads are
 woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
 also be added to the 'events' flag, such that we round robin around the set
 of waiting threads.

 An implementation note is that in the epoll wakeup routine,
 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a 
 successful
 wakeup, only when there are current waiters. The idea is to use this 
 additional
 heuristic in order minimize wakeup latencies.

 I don't understand what this is intended to do.

 If an event has EPOLLONESHOT, then this only one thread should be woken 
 regardless, right?  If not, isn't that just a bug that should be fixed?


 hmm...so with EPOLLONESHOT you basically get notified once about an event. If 
 i have multiple epoll fds (say 1 per-thread) attached to a single source in 
 EPOLLONESHOT, then all threads will potentially get woken up once per event. 
 Then, I would have to re-arm all of them. So I don't think this addresses 
 this particular usecase...what I am trying to avoid is this mass wakeup or 
 thundering herd for a shared event source.

Now I understand.  Why are you using multiple epollfds?

--Andy


 If an event has EPOLLET, then the considerations are similar to 
 EPOLLONESHOT, right?


 EPOLLET is still going to cause this thundering herd.

 If an event is a normal level-triggered non-one-shot event, then I don't 
 understand how a round-robin wakeup makes any sense.  It's level-triggered, 
 after all.

 Yeah, so the current behavior is to wake up all of the threads. I'm trying to 
 add a new mode where it load balances among the threads interested in the 
 event. Perhaps, the test program I attached to 0/2 will show the issue better?

 Also, this originally came up in the context of a single listening socket 
 which was attached to multiple epoll fds each in a separate thread. With the 
 attached patch, I can measure a large decrease in cpu usage and better 
 balancing behavior among the accepting threads.

 Thanks,

 -Jason



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-09 Thread Jason Baron
On 02/09/2015 05:45 PM, Andy Lutomirski wrote:
 On Mon, Feb 9, 2015 at 1:32 PM, Jason Baron jba...@akamai.com wrote:
 On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
 On 02/09/2015 12:06 PM, Jason Baron wrote:
 Epoll file descriptors that are added to a shared wakeup source are always
 added in a non-exclusive manner. That means that when we have multiple 
 epoll
 fds attached to a shared wakeup source they are all woken up. This can
 lead to excessive cpu usage and uneven load distribution.

 This patch introduces two new 'events' flags that are intended to be used
 with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the 
 event
 source in an exclusive manner such that the minimum number of threads are
 woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, can
 also be added to the 'events' flag, such that we round robin around the set
 of waiting threads.

 An implementation note is that in the epoll wakeup routine,
 'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a 
 successful
 wakeup, only when there are current waiters. The idea is to use this 
 additional
 heuristic in order minimize wakeup latencies.
 I don't understand what this is intended to do.

 If an event has EPOLLONESHOT, then this only one thread should be woken 
 regardless, right?  If not, isn't that just a bug that should be fixed?

 hmm...so with EPOLLONESHOT you basically get notified once about an event. 
 If i have multiple epoll fds (say 1 per-thread) attached to a single source 
 in EPOLLONESHOT, then all threads will potentially get woken up once per 
 event. Then, I would have to re-arm all of them. So I don't think this 
 addresses this particular usecase...what I am trying to avoid is this mass 
 wakeup or thundering herd for a shared event source.
 Now I understand.  Why are you using multiple epollfds?

 --Andy

So the multiple epollfds is really a way to partition the set of events. 
Otherwise, I have all the threads contending on all the events that are being 
generated. So I'm not sure if that is scalable.

In the use-case I'm trying to describe, I've partitioned a large set of the 
events, but there may still be some event sources that we wish to share among 
all of the threads (or even subsets of them), so as not to overload any one in 
particular.

More specifically, in the case of a single listen socket, its natural to call 
accept() on the thread that has been woken up, but without doing round robin, 
you quickly get into a very unbalanced load, and in addition you waste a lot of 
cpu doing unnecessary wakeups. There are other approaches to solve this, 
specifically using SO_REUSEPORT, which creates a separate socket per-thread and 
gets one back to the separately partitioned events case previously described. 
However, SO_REUSEPORT, I believe is very specific to tcp/udp, and in addition 
does not have knowledge of the threads that are actively waiting as the epoll 
code does.

Thanks,

-Jason
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-09 Thread Eric Wong
Jason Baron jba...@akamai.com wrote:
 On 02/09/2015 05:45 PM, Andy Lutomirski wrote:
  On Mon, Feb 9, 2015 at 1:32 PM, Jason Baron jba...@akamai.com wrote:
  On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
  On 02/09/2015 12:06 PM, Jason Baron wrote:
  Epoll file descriptors that are added to a shared wakeup source are 
  always
  added in a non-exclusive manner. That means that when we have multiple 
  epoll
  fds attached to a shared wakeup source they are all woken up. This can
  lead to excessive cpu usage and uneven load distribution.
 
  This patch introduces two new 'events' flags that are intended to be used
  with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the 
  event
  source in an exclusive manner such that the minimum number of threads are
  woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, 
  can
  also be added to the 'events' flag, such that we round robin around the 
  set
  of waiting threads.
 
  An implementation note is that in the epoll wakeup routine,
  'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a 
  successful
  wakeup, only when there are current waiters. The idea is to use this 
  additional
  heuristic in order minimize wakeup latencies.
  I don't understand what this is intended to do.
 
  If an event has EPOLLONESHOT, then this only one thread should be woken 
  regardless, right?  If not, isn't that just a bug that should be fixed?
 
  hmm...so with EPOLLONESHOT you basically get notified once about an event. 
  If i have multiple epoll fds (say 1 per-thread) attached to a single 
  source in EPOLLONESHOT, then all threads will potentially get woken up 
  once per event. Then, I would have to re-arm all of them. So I don't think 
  this addresses this particular usecase...what I am trying to avoid is this 
  mass wakeup or thundering herd for a shared event source.
  Now I understand.  Why are you using multiple epollfds?
 
  --Andy
 
 So the multiple epollfds is really a way to partition the set of
 events. Otherwise, I have all the threads contending on all the events
 that are being generated. So I'm not sure if that is scalable.

I wonder if EPOLLONESHOT + epoll_wait with a sufficiently large
maxevents value is sufficient for you.  All events would be shared, so
they can migrate between threads(*).  Each thread takes a largish set of
events on every epoll_wait call and doesn't call epoll_wait again until
it's done with the whole set it got.

You'll hit more contention on EPOLL_CTL_MOD with shared events and a
single epoll, but I think it's a better goal to make that lock-free.

(*) Too large a maxevents will lead to head-of-line blocking, but from
what I'm inferring, you already risk that with multiple epollfds and
separate threads working on them.

Do you have a userland use case to share?

 In the use-case I'm trying to describe, I've partitioned a large set
 of the events, but there may still be some event sources that we wish
 to share among all of the threads (or even subsets of them), so as not
 to overload any one in particular.
 
 More specifically, in the case of a single listen socket, its natural
 to call accept() on the thread that has been woken up, but without
 doing round robin, you quickly get into a very unbalanced load, and in
 addition you waste a lot of cpu doing unnecessary wakeups. There are
 other approaches to solve this, specifically using SO_REUSEPORT, which
 creates a separate socket per-thread and gets one back to the
 separately partitioned events case previously described. However,
 SO_REUSEPORT, I believe is very specific to tcp/udp, and in addition
 does not have knowledge of the threads that are actively waiting as
 the epoll code does.

Did you try my suggestion of using a dedicated thread (or thread pool)
which does nothing but loop on accept() + EPOLL_CTL_ADD?

Those dedicated threads could do its own round-robin in userland to pick
a different epollfd to call EPOLL_CTL_ADD on.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/