Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-03-07 Thread Jason Baron
On 03/05/2015 04:15 AM, Ingo Molnar wrote:
> * Jason Baron  wrote:
>
>> 2) We are using the wakeup in this case to 'assign' work more 
>> permanently to the thread. That is, in the case of a listen socket 
>> we then add the connected socket to the woken up threads local set 
>> of epoll events. So the load persists past the wake up. And in this 
>> case, doing the round robin wakeups, simply allows us to access more 
>> cpu bandwidth. (I'm also looking into potentially using cpu affinity 
>> to do the wakeups as well as you suggested.)
> So this is the part that I still don't understand.

Here's maybe another way to frame this. Epoll sets add
a waiter on the wait queue in a fixed order when epoll sets
are added (via EPOLL_CTL_ADD). This order does not change
modulo adds/dels which are usually not common. So if
we don't want to wake all threads, when say an interrupt
occurs at some random point, we can either:

1) Walk the list, wake up the first epoll set that has idle
threads (queued via epoll_wait()) and return.

or:

2) Walk the list and wake up the first epoll set that has idle
threads, but then 'rotate' or move this epoll set to the tail
of the queue before returning.

So because the epoll sets are in a fixed order there is
an extreme bias to pick the same epoll sets over and over
regardless of the order in which threads return to wait
via (epoll_wait()). So I think the rotate makes sense for
the case where I am trying to assign work to threads that
may persist past the wake up point, and for cases where
the threads can finish all their work before returning
back to epoll_wait().

Thanks,

-Jason






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-03-07 Thread Jason Baron
On 03/05/2015 04:15 AM, Ingo Molnar wrote:
 * Jason Baron jba...@akamai.com wrote:

 2) We are using the wakeup in this case to 'assign' work more 
 permanently to the thread. That is, in the case of a listen socket 
 we then add the connected socket to the woken up threads local set 
 of epoll events. So the load persists past the wake up. And in this 
 case, doing the round robin wakeups, simply allows us to access more 
 cpu bandwidth. (I'm also looking into potentially using cpu affinity 
 to do the wakeups as well as you suggested.)
 So this is the part that I still don't understand.

Here's maybe another way to frame this. Epoll sets add
a waiter on the wait queue in a fixed order when epoll sets
are added (via EPOLL_CTL_ADD). This order does not change
modulo adds/dels which are usually not common. So if
we don't want to wake all threads, when say an interrupt
occurs at some random point, we can either:

1) Walk the list, wake up the first epoll set that has idle
threads (queued via epoll_wait()) and return.

or:

2) Walk the list and wake up the first epoll set that has idle
threads, but then 'rotate' or move this epoll set to the tail
of the queue before returning.

So because the epoll sets are in a fixed order there is
an extreme bias to pick the same epoll sets over and over
regardless of the order in which threads return to wait
via (epoll_wait()). So I think the rotate makes sense for
the case where I am trying to assign work to threads that
may persist past the wake up point, and for cases where
the threads can finish all their work before returning
back to epoll_wait().

Thanks,

-Jason






--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-03-05 Thread Jason Baron

On 03/05/2015 04:15 AM, Ingo Molnar wrote:
> * Jason Baron  wrote:
>
>> 2) We are using the wakeup in this case to 'assign' work more 
>> permanently to the thread. That is, in the case of a listen socket 
>> we then add the connected socket to the woken up threads local set 
>> of epoll events. So the load persists past the wake up. And in this 
>> case, doing the round robin wakeups, simply allows us to access more 
>> cpu bandwidth. (I'm also looking into potentially using cpu affinity 
>> to do the wakeups as well as you suggested.)
> So this is the part that I still don't understand.
>
> What difference does LIFO versus FIFO wakeups make to CPU utilization: 
> a thread waiting for work is idle, no matter whether it ran most 
> recently or least recently.
>
> Once an idle worker thread is woken it will compute its own work, for 
> whatever time it needs to, and won't be bothered by epoll again until 
> it finished its work and starts waiting again.
>
> So regardless the wakeup order it's the same principal bandwidth 
> utilization, modulo caching artifacts [*] and modulo scheduling 
> artifacts [**]:

So just adding the wakeup source as 'exclusive', I think would
give much of the desired behavior as you point out. In the first
patch posting I separated 'exclusive' from 'rotate' (where rotate
depended on exclusive), since the idle threads will tend to get
assigned the new work vs. the busy threads as you point out
and the workload naturally spreads out (modulo the artifacts
you mentioned).

However, I added the 'rotate' b/c I'm assigning work via the
wakeup that persists past the wakeup point. So without the rotate
I might end up assigning a lot of work to always say the first
thread if its always idle. And then I might get a large burst of
work queued to it at some later point. The rotate is intended
to address this case.

To use some pseudo-code in hopes of clarifying things, each
thread is roughly doing:

epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd...);

while(1) {

epoll_wait(epfd...);
fd = accept(listen_fd...);
epoll_ctl(epfd, EPOLL_CTL_ADD, fd...);
...do any additional desired fd processing...
}

So since the work persists past the wakeup point (after
the 'fd' has been assigned to the epfd set of the local
thread), I am trying to balance out future load.

This is an issue that current userspace has to address in
various ways. In our case, we periodically remove all
epfds from the listen socket, and then re-add in a
different order periodically. Another alternative that was
suggested by Eric was to have a dedicated thread(s), to
do the assignment. So these approaches can work to an
extent, but they seem at least to me to complicate
userspace somewhat. And at least in our case, its not
providing as good balancing as this approach.

So I am trying to use epoll in a special way to do work
assignment. I think the model is different from the
standard waker/wakee model. So to that end, in this
v3 version, I've attempted to isolate all the changes to
be contained within epoll to reflect that fact.

Thanks,

-Jason


>
> [*]  Caching artifacts: in that sense Andrew's point stands: given 
>  multiple equivalent choices it's more beneficial to pick a thread 
>  that was most recently used (and is thus most cache-hot - i.e. 
>  the current wakeup behavior), versus a thread that was least 
>  recently used (and is thus the most cache-cold - i.e. the 
>  round-robin wakeup you introduce).
>
> [**] The hack patch I posted in my previous reply.
>
> Thanks,
>
>   Ingo
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-03-05 Thread Ingo Molnar

* Jason Baron  wrote:

> 2) We are using the wakeup in this case to 'assign' work more 
> permanently to the thread. That is, in the case of a listen socket 
> we then add the connected socket to the woken up threads local set 
> of epoll events. So the load persists past the wake up. And in this 
> case, doing the round robin wakeups, simply allows us to access more 
> cpu bandwidth. (I'm also looking into potentially using cpu affinity 
> to do the wakeups as well as you suggested.)

So this is the part that I still don't understand.

What difference does LIFO versus FIFO wakeups make to CPU utilization: 
a thread waiting for work is idle, no matter whether it ran most 
recently or least recently.

Once an idle worker thread is woken it will compute its own work, for 
whatever time it needs to, and won't be bothered by epoll again until 
it finished its work and starts waiting again.

So regardless the wakeup order it's the same principal bandwidth 
utilization, modulo caching artifacts [*] and modulo scheduling 
artifacts [**]:

[*]  Caching artifacts: in that sense Andrew's point stands: given 
 multiple equivalent choices it's more beneficial to pick a thread 
 that was most recently used (and is thus most cache-hot - i.e. 
 the current wakeup behavior), versus a thread that was least 
 recently used (and is thus the most cache-cold - i.e. the 
 round-robin wakeup you introduce).

[**] The hack patch I posted in my previous reply.

Thanks,

Ingo

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-03-05 Thread Jason Baron

On 03/05/2015 04:15 AM, Ingo Molnar wrote:
 * Jason Baron jba...@akamai.com wrote:

 2) We are using the wakeup in this case to 'assign' work more 
 permanently to the thread. That is, in the case of a listen socket 
 we then add the connected socket to the woken up threads local set 
 of epoll events. So the load persists past the wake up. And in this 
 case, doing the round robin wakeups, simply allows us to access more 
 cpu bandwidth. (I'm also looking into potentially using cpu affinity 
 to do the wakeups as well as you suggested.)
 So this is the part that I still don't understand.

 What difference does LIFO versus FIFO wakeups make to CPU utilization: 
 a thread waiting for work is idle, no matter whether it ran most 
 recently or least recently.

 Once an idle worker thread is woken it will compute its own work, for 
 whatever time it needs to, and won't be bothered by epoll again until 
 it finished its work and starts waiting again.

 So regardless the wakeup order it's the same principal bandwidth 
 utilization, modulo caching artifacts [*] and modulo scheduling 
 artifacts [**]:

So just adding the wakeup source as 'exclusive', I think would
give much of the desired behavior as you point out. In the first
patch posting I separated 'exclusive' from 'rotate' (where rotate
depended on exclusive), since the idle threads will tend to get
assigned the new work vs. the busy threads as you point out
and the workload naturally spreads out (modulo the artifacts
you mentioned).

However, I added the 'rotate' b/c I'm assigning work via the
wakeup that persists past the wakeup point. So without the rotate
I might end up assigning a lot of work to always say the first
thread if its always idle. And then I might get a large burst of
work queued to it at some later point. The rotate is intended
to address this case.

To use some pseudo-code in hopes of clarifying things, each
thread is roughly doing:

epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd...);

while(1) {

epoll_wait(epfd...);
fd = accept(listen_fd...);
epoll_ctl(epfd, EPOLL_CTL_ADD, fd...);
...do any additional desired fd processing...
}

So since the work persists past the wakeup point (after
the 'fd' has been assigned to the epfd set of the local
thread), I am trying to balance out future load.

This is an issue that current userspace has to address in
various ways. In our case, we periodically remove all
epfds from the listen socket, and then re-add in a
different order periodically. Another alternative that was
suggested by Eric was to have a dedicated thread(s), to
do the assignment. So these approaches can work to an
extent, but they seem at least to me to complicate
userspace somewhat. And at least in our case, its not
providing as good balancing as this approach.

So I am trying to use epoll in a special way to do work
assignment. I think the model is different from the
standard waker/wakee model. So to that end, in this
v3 version, I've attempted to isolate all the changes to
be contained within epoll to reflect that fact.

Thanks,

-Jason



 [*]  Caching artifacts: in that sense Andrew's point stands: given 
  multiple equivalent choices it's more beneficial to pick a thread 
  that was most recently used (and is thus most cache-hot - i.e. 
  the current wakeup behavior), versus a thread that was least 
  recently used (and is thus the most cache-cold - i.e. the 
  round-robin wakeup you introduce).

 [**] The hack patch I posted in my previous reply.

 Thanks,

   Ingo


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-03-05 Thread Ingo Molnar

* Jason Baron jba...@akamai.com wrote:

 2) We are using the wakeup in this case to 'assign' work more 
 permanently to the thread. That is, in the case of a listen socket 
 we then add the connected socket to the woken up threads local set 
 of epoll events. So the load persists past the wake up. And in this 
 case, doing the round robin wakeups, simply allows us to access more 
 cpu bandwidth. (I'm also looking into potentially using cpu affinity 
 to do the wakeups as well as you suggested.)

So this is the part that I still don't understand.

What difference does LIFO versus FIFO wakeups make to CPU utilization: 
a thread waiting for work is idle, no matter whether it ran most 
recently or least recently.

Once an idle worker thread is woken it will compute its own work, for 
whatever time it needs to, and won't be bothered by epoll again until 
it finished its work and starts waiting again.

So regardless the wakeup order it's the same principal bandwidth 
utilization, modulo caching artifacts [*] and modulo scheduling 
artifacts [**]:

[*]  Caching artifacts: in that sense Andrew's point stands: given 
 multiple equivalent choices it's more beneficial to pick a thread 
 that was most recently used (and is thus most cache-hot - i.e. 
 the current wakeup behavior), versus a thread that was least 
 recently used (and is thus the most cache-cold - i.e. the 
 round-robin wakeup you introduce).

[**] The hack patch I posted in my previous reply.

Thanks,

Ingo

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-03-04 Thread Jason Baron
On 03/04/2015 07:02 PM, Ingo Molnar wrote:
> * Andrew Morton  wrote:
>
>> On Fri, 27 Feb 2015 17:01:32 -0500 Jason Baron  wrote:
>>
 I don't really understand the need for rotation/round-robin.  We can
 solve the thundering herd via exclusive wakeups, but what is the point
 in choosing to wake the task which has been sleeping for the longest
 time?  Why is that better than waking the task which has been sleeping
 for the *least* time?  That's probably faster as that task's data is
 more likely to still be in cache.

 The changelogs talks about "starvation" but they don't really say what
 this term means in this context, nor why it is a bad thing.

>> I'm still not getting it.
>>
>>> So the idea with the 'rotation' is to try and distribute the
>>> workload more evenly across the worker threads.
>> Why?
>>
>>> We currently
>>> tend to wake up the 'head' of the queue over and over and
>>> thus the workload for us is not evenly distributed.
>> What's wrong with that?
>>
>>> In fact, we
>>> have a workload where we have to remove all the epoll sets
>>> and then re-add them in a different order to improve the situation.
>> Why?
> So my guess would be (but Jason would know this more precisely) that 
> spreading the workload to more tasks in a FIFO manner, the individual 
> tasks can move between CPUs better, and fill in available CPU 
> bandwidth better, increasing concurrency.
>
> With the current LIFO distribution of wakeups, the 'busiest' threads 
> will get many wakeups (potentially from different CPUs), making them 
> cache-hot, which may interfere with them easily migrating across CPUs.
>
> So while technically both approaches have similar concurrency, the 
> more 'spread out' task hierarchy schedules in a more consistent 
> manner.
>
> But ... this is just a wild guess and even if my description is 
> accurate then it should still be backed by robust measurements and 
> observations, before we extend the ABI.
>
> This hypothesis could be tested by the patch below: with the patch 
> applied if the performance difference between FIFO and LIFO epoll 
> wakeups disappears, then the root cause is the cache-hotness code in 
> the scheduler.
>
>

So what I think you are describing here fits the model
where you have single epoll fd (returned by epoll_create()),
which is then attached to wakeup fds. So that can be thought
of as having a single 'event' queue (the single epoll fd), where
multiple threads are competing to grab events via epoll_wait()
and things are currently LIFO there as you describe.

However, the use-case I was trying to get at is where you have
multiple epoll fds (or event queues), and really just one thread
doing epoll_wait() against a single epoll fd. So instead of having
all threads competing for all events, we have divided up the
events into separate queues.

Now, the 'problematic' case is where there may be an event
source that is shared among all these epoll fds - such as a listen
socket or a pipe. Now there are two distinct issues in this case
that this series is trying to address.

1) All epoll fds will receive a wakeup (and hence the threads
that are potentially blocking there, although they may not
return to user-space if the event has already been consumed).
I think the test case I posted shows this pretty clearly -
http://lwn.net/Articles/632590/. The number of context switches
without adding the to the wait queue is 50x the case where
they are added exclusively. That's a lot of extra cpu usage.

2) We are using the wakeup in this case to 'assign' work more
permanently to the thread. That is, in the case of a listen socket
we then add the connected socket to the woken up threads
local set of epoll events. So the load persists past the wake up.
And in this case, doing the round robin wakeups, simply allows
us to access more cpu bandwidth. (I'm also looking into potentially
using cpu affinity to do the wakeups as well as you suggested.)

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-03-04 Thread Ingo Molnar

* Andrew Morton  wrote:

> On Fri, 27 Feb 2015 17:01:32 -0500 Jason Baron  wrote:
> 
> > 
> > >
> > > I don't really understand the need for rotation/round-robin.  We can
> > > solve the thundering herd via exclusive wakeups, but what is the point
> > > in choosing to wake the task which has been sleeping for the longest
> > > time?  Why is that better than waking the task which has been sleeping
> > > for the *least* time?  That's probably faster as that task's data is
> > > more likely to still be in cache.
> > >
> > > The changelogs talks about "starvation" but they don't really say what
> > > this term means in this context, nor why it is a bad thing.
> > >
> 
> I'm still not getting it.
> 
> > So the idea with the 'rotation' is to try and distribute the
> > workload more evenly across the worker threads.
> 
> Why?
> 
> > We currently
> > tend to wake up the 'head' of the queue over and over and
> > thus the workload for us is not evenly distributed.
> 
> What's wrong with that?
> 
> > In fact, we
> > have a workload where we have to remove all the epoll sets
> > and then re-add them in a different order to improve the situation.
> 
> Why?

So my guess would be (but Jason would know this more precisely) that 
spreading the workload to more tasks in a FIFO manner, the individual 
tasks can move between CPUs better, and fill in available CPU 
bandwidth better, increasing concurrency.

With the current LIFO distribution of wakeups, the 'busiest' threads 
will get many wakeups (potentially from different CPUs), making them 
cache-hot, which may interfere with them easily migrating across CPUs.

So while technically both approaches have similar concurrency, the 
more 'spread out' task hierarchy schedules in a more consistent 
manner.

But ... this is just a wild guess and even if my description is 
accurate then it should still be backed by robust measurements and 
observations, before we extend the ABI.

This hypothesis could be tested by the patch below: with the patch 
applied if the performance difference between FIFO and LIFO epoll 
wakeups disappears, then the root cause is the cache-hotness code in 
the scheduler.

Thanks,

Ingo

---

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee595ef30470..89af04e946d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5354,7 +5354,7 @@ static int task_hot(struct task_struct *p, struct lb_env 
*env)
 
lockdep_assert_held(>src_rq->lock);
 
-   if (p->sched_class != _sched_class)
+   if (1 || p->sched_class != _sched_class)
return 0;
 
if (unlikely(p->policy == SCHED_IDLE))
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-03-04 Thread Ingo Molnar

* Andrew Morton a...@linux-foundation.org wrote:

 On Fri, 27 Feb 2015 17:01:32 -0500 Jason Baron jba...@akamai.com wrote:
 
  
  
   I don't really understand the need for rotation/round-robin.  We can
   solve the thundering herd via exclusive wakeups, but what is the point
   in choosing to wake the task which has been sleeping for the longest
   time?  Why is that better than waking the task which has been sleeping
   for the *least* time?  That's probably faster as that task's data is
   more likely to still be in cache.
  
   The changelogs talks about starvation but they don't really say what
   this term means in this context, nor why it is a bad thing.
  
 
 I'm still not getting it.
 
  So the idea with the 'rotation' is to try and distribute the
  workload more evenly across the worker threads.
 
 Why?
 
  We currently
  tend to wake up the 'head' of the queue over and over and
  thus the workload for us is not evenly distributed.
 
 What's wrong with that?
 
  In fact, we
  have a workload where we have to remove all the epoll sets
  and then re-add them in a different order to improve the situation.
 
 Why?

So my guess would be (but Jason would know this more precisely) that 
spreading the workload to more tasks in a FIFO manner, the individual 
tasks can move between CPUs better, and fill in available CPU 
bandwidth better, increasing concurrency.

With the current LIFO distribution of wakeups, the 'busiest' threads 
will get many wakeups (potentially from different CPUs), making them 
cache-hot, which may interfere with them easily migrating across CPUs.

So while technically both approaches have similar concurrency, the 
more 'spread out' task hierarchy schedules in a more consistent 
manner.

But ... this is just a wild guess and even if my description is 
accurate then it should still be backed by robust measurements and 
observations, before we extend the ABI.

This hypothesis could be tested by the patch below: with the patch 
applied if the performance difference between FIFO and LIFO epoll 
wakeups disappears, then the root cause is the cache-hotness code in 
the scheduler.

Thanks,

Ingo

---

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee595ef30470..89af04e946d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5354,7 +5354,7 @@ static int task_hot(struct task_struct *p, struct lb_env 
*env)
 
lockdep_assert_held(env-src_rq-lock);
 
-   if (p-sched_class != fair_sched_class)
+   if (1 || p-sched_class != fair_sched_class)
return 0;
 
if (unlikely(p-policy == SCHED_IDLE))
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-03-04 Thread Jason Baron
On 03/04/2015 07:02 PM, Ingo Molnar wrote:
 * Andrew Morton a...@linux-foundation.org wrote:

 On Fri, 27 Feb 2015 17:01:32 -0500 Jason Baron jba...@akamai.com wrote:

 I don't really understand the need for rotation/round-robin.  We can
 solve the thundering herd via exclusive wakeups, but what is the point
 in choosing to wake the task which has been sleeping for the longest
 time?  Why is that better than waking the task which has been sleeping
 for the *least* time?  That's probably faster as that task's data is
 more likely to still be in cache.

 The changelogs talks about starvation but they don't really say what
 this term means in this context, nor why it is a bad thing.

 I'm still not getting it.

 So the idea with the 'rotation' is to try and distribute the
 workload more evenly across the worker threads.
 Why?

 We currently
 tend to wake up the 'head' of the queue over and over and
 thus the workload for us is not evenly distributed.
 What's wrong with that?

 In fact, we
 have a workload where we have to remove all the epoll sets
 and then re-add them in a different order to improve the situation.
 Why?
 So my guess would be (but Jason would know this more precisely) that 
 spreading the workload to more tasks in a FIFO manner, the individual 
 tasks can move between CPUs better, and fill in available CPU 
 bandwidth better, increasing concurrency.

 With the current LIFO distribution of wakeups, the 'busiest' threads 
 will get many wakeups (potentially from different CPUs), making them 
 cache-hot, which may interfere with them easily migrating across CPUs.

 So while technically both approaches have similar concurrency, the 
 more 'spread out' task hierarchy schedules in a more consistent 
 manner.

 But ... this is just a wild guess and even if my description is 
 accurate then it should still be backed by robust measurements and 
 observations, before we extend the ABI.

 This hypothesis could be tested by the patch below: with the patch 
 applied if the performance difference between FIFO and LIFO epoll 
 wakeups disappears, then the root cause is the cache-hotness code in 
 the scheduler.



So what I think you are describing here fits the model
where you have single epoll fd (returned by epoll_create()),
which is then attached to wakeup fds. So that can be thought
of as having a single 'event' queue (the single epoll fd), where
multiple threads are competing to grab events via epoll_wait()
and things are currently LIFO there as you describe.

However, the use-case I was trying to get at is where you have
multiple epoll fds (or event queues), and really just one thread
doing epoll_wait() against a single epoll fd. So instead of having
all threads competing for all events, we have divided up the
events into separate queues.

Now, the 'problematic' case is where there may be an event
source that is shared among all these epoll fds - such as a listen
socket or a pipe. Now there are two distinct issues in this case
that this series is trying to address.

1) All epoll fds will receive a wakeup (and hence the threads
that are potentially blocking there, although they may not
return to user-space if the event has already been consumed).
I think the test case I posted shows this pretty clearly -
http://lwn.net/Articles/632590/. The number of context switches
without adding the to the wait queue is 50x the case where
they are added exclusively. That's a lot of extra cpu usage.

2) We are using the wakeup in this case to 'assign' work more
permanently to the thread. That is, in the case of a listen socket
we then add the connected socket to the woken up threads
local set of epoll events. So the load persists past the wake up.
And in this case, doing the round robin wakeups, simply allows
us to access more cpu bandwidth. (I'm also looking into potentially
using cpu affinity to do the wakeups as well as you suggested.)

Thanks,

-Jason
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-03-01 Thread Jason Baron
On 02/27/2015 04:31 PM, Jonathan Corbet wrote:
> On Fri, 27 Feb 2015 13:10:34 -0800
> Andrew Morton  wrote:
>
>> I don't really understand the need for rotation/round-robin.  We can
>> solve the thundering herd via exclusive wakeups, but what is the point
>> in choosing to wake the task which has been sleeping for the longest
>> time?  Why is that better than waking the task which has been sleeping
>> for the *least* time?  That's probably faster as that task's data is
>> more likely to still be in cache.
> So here's my chance to show the world what a fool I am (again)...  If I
> understand this at all, a task woken from epoll_wait() remains on the wait
> queues while it is off doing other stuff.  If you're doing exclusive
> wakeups, the task at the head of the queue will get all of them, since it
> never gets removed from the queue.  So you don't spread your load around,
> and, indeed, you may "wake" a process that is busy doing something else and
> can't deal with the event now anyway.  You need some way to shuffle up the
> wait queue, and round-robin is probably as good as any.
>
> (The alternative would be to take the process off the queue until it calls
> epoll_wait() again, but that runs counter to what epoll is all about).
>
> At least, that was my impression when I took a look at this stuff.
>
> jon

So tasks do not remain on wait queues when they are not in
epoll_wait(). That is, tasks are added to the associated epoll
fd wait queue at the beginning of epoll_wait(), and removed
from the associated epoll fd wait queue when epoll_wait()
returns.

One can think about the problem, perhaps, as assigning fd
events - POLLIN, POLLLOUT, etc., to a set of tasks. And this
discussion is about how to do the assignment in certain
cases. Namely, one could start by partitioning  the set of fds
into unique sets and then assigning them
(via EPOLL_CTL_ADD) to different epoll fds. Then, if there
is say a single task blocking on each epoll fd (via epoll_wait())
then each task can work nicely on its own set of events
without needing to necessarily co-ordinate with the other
tasks.

Now, this all works fine until, we have an 'fd' or event source
that we wish to share among all the tasks. We might
want to share it because it generates events or work, that
would potentially overwhelm a single task.

So in this shared event source case, where we have added
the fd to all of the epoll fds, we currently do a wake all.
This series attempts to change that behavior (with an
optional flag to epoll_create1()), into a round robin wakeup
(both to avoid excessive wakeups and to more evenly distribute
the wakeups). Note also that it will continue to wake up tasks,
as long as it doesn't find any in epoll_wait(). Thus, it still can
potentially wake up all if nobody is in epoll_wait().

Now, we could try and distribute the fd events among tasks
all waiting on a single epoll fd (meaning we have a single
event queue). But, we have already partitioned most of the
events, why combine them back into a single queue?

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-03-01 Thread Jason Baron
On 02/27/2015 04:31 PM, Jonathan Corbet wrote:
 On Fri, 27 Feb 2015 13:10:34 -0800
 Andrew Morton a...@linux-foundation.org wrote:

 I don't really understand the need for rotation/round-robin.  We can
 solve the thundering herd via exclusive wakeups, but what is the point
 in choosing to wake the task which has been sleeping for the longest
 time?  Why is that better than waking the task which has been sleeping
 for the *least* time?  That's probably faster as that task's data is
 more likely to still be in cache.
 So here's my chance to show the world what a fool I am (again)...  If I
 understand this at all, a task woken from epoll_wait() remains on the wait
 queues while it is off doing other stuff.  If you're doing exclusive
 wakeups, the task at the head of the queue will get all of them, since it
 never gets removed from the queue.  So you don't spread your load around,
 and, indeed, you may wake a process that is busy doing something else and
 can't deal with the event now anyway.  You need some way to shuffle up the
 wait queue, and round-robin is probably as good as any.

 (The alternative would be to take the process off the queue until it calls
 epoll_wait() again, but that runs counter to what epoll is all about).

 At least, that was my impression when I took a look at this stuff.

 jon

So tasks do not remain on wait queues when they are not in
epoll_wait(). That is, tasks are added to the associated epoll
fd wait queue at the beginning of epoll_wait(), and removed
from the associated epoll fd wait queue when epoll_wait()
returns.

One can think about the problem, perhaps, as assigning fd
events - POLLIN, POLLLOUT, etc., to a set of tasks. And this
discussion is about how to do the assignment in certain
cases. Namely, one could start by partitioning  the set of fds
into unique sets and then assigning them
(via EPOLL_CTL_ADD) to different epoll fds. Then, if there
is say a single task blocking on each epoll fd (via epoll_wait())
then each task can work nicely on its own set of events
without needing to necessarily co-ordinate with the other
tasks.

Now, this all works fine until, we have an 'fd' or event source
that we wish to share among all the tasks. We might
want to share it because it generates events or work, that
would potentially overwhelm a single task.

So in this shared event source case, where we have added
the fd to all of the epoll fds, we currently do a wake all.
This series attempts to change that behavior (with an
optional flag to epoll_create1()), into a round robin wakeup
(both to avoid excessive wakeups and to more evenly distribute
the wakeups). Note also that it will continue to wake up tasks,
as long as it doesn't find any in epoll_wait(). Thus, it still can
potentially wake up all if nobody is in epoll_wait().

Now, we could try and distribute the fd events among tasks
all waiting on a single epoll fd (meaning we have a single
event queue). But, we have already partitioned most of the
events, why combine them back into a single queue?

Thanks,

-Jason
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-02-27 Thread Andrew Morton
On Fri, 27 Feb 2015 17:01:32 -0500 Jason Baron  wrote:

> 
> >
> > I don't really understand the need for rotation/round-robin.  We can
> > solve the thundering herd via exclusive wakeups, but what is the point
> > in choosing to wake the task which has been sleeping for the longest
> > time?  Why is that better than waking the task which has been sleeping
> > for the *least* time?  That's probably faster as that task's data is
> > more likely to still be in cache.
> >
> > The changelogs talks about "starvation" but they don't really say what
> > this term means in this context, nor why it is a bad thing.
> >

I'm still not getting it.

> So the idea with the 'rotation' is to try and distribute the
> workload more evenly across the worker threads.

Why?

> We currently
> tend to wake up the 'head' of the queue over and over and
> thus the workload for us is not evenly distributed.

What's wrong with that?

> In fact, we
> have a workload where we have to remove all the epoll sets
> and then re-add them in a different order to improve the situation.

Why?

> We are trying to avoid this workaround and in addition avoid
> thundering wakeups when possible (using exclusive as you
> mention).


A better way to describe a patchset like this is to start out
describing what the user sees: what is wrong with the current kernel
behaviour and how will it be improved?  Once that is understood then
start getting into kernel internal details.


Jon's explanation was much more useful.  Was it accurate?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-02-27 Thread Jason Baron

On 02/27/2015 04:10 PM, Andrew Morton wrote:
> On Wed, 25 Feb 2015 11:27:04 -0500 Jason Baron  wrote:
>
>>> Libenzi inactive eventpoll appears to be without a 
>>> dedicated maintainer since 2011 or so. Is there anyone who 
>>> knows the code and its usages in detail and does final ABI 
>>> decisions on eventpoll - Andrew, Al or Linus?
>>>
>> Generally, Andrew and Al do more 'final' reviews here,
>> and a lot of others on lkml are always very helpful in
>> looking at this code. However, its not always clear, at
>> least to me, who I should pester.
> Yes, it's a difficult situation.
>
> The 3/3 changelog refers to "EPOLLROUNDROBIN" which I assume is
> a leftover from some earlier revision?

Yes, that's a typo there. It should read 'EPOLL_ROTATE'.

>
> I don't really understand the need for rotation/round-robin.  We can
> solve the thundering herd via exclusive wakeups, but what is the point
> in choosing to wake the task which has been sleeping for the longest
> time?  Why is that better than waking the task which has been sleeping
> for the *least* time?  That's probably faster as that task's data is
> more likely to still be in cache.
>
> The changelogs talks about "starvation" but they don't really say what
> this term means in this context, nor why it is a bad thing.
>

So the idea with the 'rotation' is to try and distribute the
workload more evenly across the worker threads. We currently
tend to wake up the 'head' of the queue over and over and
thus the workload for us is not evenly distributed. In fact, we
have a workload where we have to remove all the epoll sets
and then re-add them in a different order to improve the situation.
We are trying to avoid this workaround and in addition avoid
thundering wakeups when possible (using exclusive as you
mention).

I agree that waking up the task that may have been sleeping longer
may not be the best for all workloads. So what I am proposing
here is an optional flag to meet a certain workload. It might not be
right for all workloads, but we have found it quite useful.

The 'starvation' mention was in regards to the fact that with this
new behavior of not waking up all threads (and rotating them),
an adversarial thread might insert itself into our wakeup queue
and 'starve' us out. This concern was raised by Andy Lutomirkski,
and this current series is not subject to this issue, b/c it works
by creating a new epoll fd and then adding that epoll fd to the
wakeup queue.  Thus, this 'new' epoll fd is local to the thread
and the wakeup queue continues to wake all threads. Only the
'new' epoll fd which we then attach ourselves to, implements the
exclusive/rotate behavior.

Thanks,

-Jason


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-02-27 Thread Jonathan Corbet
On Fri, 27 Feb 2015 13:10:34 -0800
Andrew Morton  wrote:

> I don't really understand the need for rotation/round-robin.  We can
> solve the thundering herd via exclusive wakeups, but what is the point
> in choosing to wake the task which has been sleeping for the longest
> time?  Why is that better than waking the task which has been sleeping
> for the *least* time?  That's probably faster as that task's data is
> more likely to still be in cache.

So here's my chance to show the world what a fool I am (again)...  If I
understand this at all, a task woken from epoll_wait() remains on the wait
queues while it is off doing other stuff.  If you're doing exclusive
wakeups, the task at the head of the queue will get all of them, since it
never gets removed from the queue.  So you don't spread your load around,
and, indeed, you may "wake" a process that is busy doing something else and
can't deal with the event now anyway.  You need some way to shuffle up the
wait queue, and round-robin is probably as good as any.

(The alternative would be to take the process off the queue until it calls
epoll_wait() again, but that runs counter to what epoll is all about).

At least, that was my impression when I took a look at this stuff.

jon
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-02-27 Thread Andrew Morton
On Wed, 25 Feb 2015 11:27:04 -0500 Jason Baron  wrote:

> > Libenzi inactive eventpoll appears to be without a 
> > dedicated maintainer since 2011 or so. Is there anyone who 
> > knows the code and its usages in detail and does final ABI 
> > decisions on eventpoll - Andrew, Al or Linus?
> >
> Generally, Andrew and Al do more 'final' reviews here,
> and a lot of others on lkml are always very helpful in
> looking at this code. However, its not always clear, at
> least to me, who I should pester.

Yes, it's a difficult situation.

The 3/3 changelog refers to "EPOLLROUNDROBIN" which I assume is
a leftover from some earlier revision?

I don't really understand the need for rotation/round-robin.  We can
solve the thundering herd via exclusive wakeups, but what is the point
in choosing to wake the task which has been sleeping for the longest
time?  Why is that better than waking the task which has been sleeping
for the *least* time?  That's probably faster as that task's data is
more likely to still be in cache.

The changelogs talks about "starvation" but they don't really say what
this term means in this context, nor why it is a bad thing.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-02-27 Thread Jonathan Corbet
On Fri, 27 Feb 2015 13:10:34 -0800
Andrew Morton a...@linux-foundation.org wrote:

 I don't really understand the need for rotation/round-robin.  We can
 solve the thundering herd via exclusive wakeups, but what is the point
 in choosing to wake the task which has been sleeping for the longest
 time?  Why is that better than waking the task which has been sleeping
 for the *least* time?  That's probably faster as that task's data is
 more likely to still be in cache.

So here's my chance to show the world what a fool I am (again)...  If I
understand this at all, a task woken from epoll_wait() remains on the wait
queues while it is off doing other stuff.  If you're doing exclusive
wakeups, the task at the head of the queue will get all of them, since it
never gets removed from the queue.  So you don't spread your load around,
and, indeed, you may wake a process that is busy doing something else and
can't deal with the event now anyway.  You need some way to shuffle up the
wait queue, and round-robin is probably as good as any.

(The alternative would be to take the process off the queue until it calls
epoll_wait() again, but that runs counter to what epoll is all about).

At least, that was my impression when I took a look at this stuff.

jon
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-02-27 Thread Jason Baron

On 02/27/2015 04:10 PM, Andrew Morton wrote:
 On Wed, 25 Feb 2015 11:27:04 -0500 Jason Baron jba...@akamai.com wrote:

 Libenzi inactive eventpoll appears to be without a 
 dedicated maintainer since 2011 or so. Is there anyone who 
 knows the code and its usages in detail and does final ABI 
 decisions on eventpoll - Andrew, Al or Linus?

 Generally, Andrew and Al do more 'final' reviews here,
 and a lot of others on lkml are always very helpful in
 looking at this code. However, its not always clear, at
 least to me, who I should pester.
 Yes, it's a difficult situation.

 The 3/3 changelog refers to EPOLLROUNDROBIN which I assume is
 a leftover from some earlier revision?

Yes, that's a typo there. It should read 'EPOLL_ROTATE'.


 I don't really understand the need for rotation/round-robin.  We can
 solve the thundering herd via exclusive wakeups, but what is the point
 in choosing to wake the task which has been sleeping for the longest
 time?  Why is that better than waking the task which has been sleeping
 for the *least* time?  That's probably faster as that task's data is
 more likely to still be in cache.

 The changelogs talks about starvation but they don't really say what
 this term means in this context, nor why it is a bad thing.


So the idea with the 'rotation' is to try and distribute the
workload more evenly across the worker threads. We currently
tend to wake up the 'head' of the queue over and over and
thus the workload for us is not evenly distributed. In fact, we
have a workload where we have to remove all the epoll sets
and then re-add them in a different order to improve the situation.
We are trying to avoid this workaround and in addition avoid
thundering wakeups when possible (using exclusive as you
mention).

I agree that waking up the task that may have been sleeping longer
may not be the best for all workloads. So what I am proposing
here is an optional flag to meet a certain workload. It might not be
right for all workloads, but we have found it quite useful.

The 'starvation' mention was in regards to the fact that with this
new behavior of not waking up all threads (and rotating them),
an adversarial thread might insert itself into our wakeup queue
and 'starve' us out. This concern was raised by Andy Lutomirkski,
and this current series is not subject to this issue, b/c it works
by creating a new epoll fd and then adding that epoll fd to the
wakeup queue.  Thus, this 'new' epoll fd is local to the thread
and the wakeup queue continues to wake all threads. Only the
'new' epoll fd which we then attach ourselves to, implements the
exclusive/rotate behavior.

Thanks,

-Jason


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-02-27 Thread Andrew Morton
On Wed, 25 Feb 2015 11:27:04 -0500 Jason Baron jba...@akamai.com wrote:

  Libenzi inactive eventpoll appears to be without a 
  dedicated maintainer since 2011 or so. Is there anyone who 
  knows the code and its usages in detail and does final ABI 
  decisions on eventpoll - Andrew, Al or Linus?
 
 Generally, Andrew and Al do more 'final' reviews here,
 and a lot of others on lkml are always very helpful in
 looking at this code. However, its not always clear, at
 least to me, who I should pester.

Yes, it's a difficult situation.

The 3/3 changelog refers to EPOLLROUNDROBIN which I assume is
a leftover from some earlier revision?

I don't really understand the need for rotation/round-robin.  We can
solve the thundering herd via exclusive wakeups, but what is the point
in choosing to wake the task which has been sleeping for the longest
time?  Why is that better than waking the task which has been sleeping
for the *least* time?  That's probably faster as that task's data is
more likely to still be in cache.

The changelogs talks about starvation but they don't really say what
this term means in this context, nor why it is a bad thing.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-02-27 Thread Andrew Morton
On Fri, 27 Feb 2015 17:01:32 -0500 Jason Baron jba...@akamai.com wrote:

 
 
  I don't really understand the need for rotation/round-robin.  We can
  solve the thundering herd via exclusive wakeups, but what is the point
  in choosing to wake the task which has been sleeping for the longest
  time?  Why is that better than waking the task which has been sleeping
  for the *least* time?  That's probably faster as that task's data is
  more likely to still be in cache.
 
  The changelogs talks about starvation but they don't really say what
  this term means in this context, nor why it is a bad thing.
 

I'm still not getting it.

 So the idea with the 'rotation' is to try and distribute the
 workload more evenly across the worker threads.

Why?

 We currently
 tend to wake up the 'head' of the queue over and over and
 thus the workload for us is not evenly distributed.

What's wrong with that?

 In fact, we
 have a workload where we have to remove all the epoll sets
 and then re-add them in a different order to improve the situation.

Why?

 We are trying to avoid this workaround and in addition avoid
 thundering wakeups when possible (using exclusive as you
 mention).


A better way to describe a patchset like this is to start out
describing what the user sees: what is wrong with the current kernel
behaviour and how will it be improved?  Once that is understood then
start getting into kernel internal details.


Jon's explanation was much more useful.  Was it accurate?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-02-25 Thread Jason Baron
On 02/25/2015 02:38 AM, Ingo Molnar wrote:
> * Jason Baron  wrote:
>
>> Hi,
>>
>> When we are sharing a wakeup source among multiple epoll 
>> fds, we end up with thundering herd wakeups, since there 
>> is currently no way to add to the wakeup source 
>> exclusively. This series introduces a new EPOLL_ROTATE 
>> flag to allow for round robin exclusive wakeups.
>>
>> I believe this patch series addresses the two main 
>> concerns that were raised in prior postings. Namely, that 
>> it affected code (and potentially performance) of the 
>> core kernel wakeup functions, even in cases where it was 
>> not strictly needed, and that it could lead to wakeup 
>> starvation (since we were are no longer waking up all 
>> waiters). It does so by adding an extra layer of 
>> indirection, whereby waiters are attached to a 'psuedo' 
>> epoll fd, which in turn is attached directly to the 
>> wakeup source.
>>   sched/wait: add __wake_up_rotate()
>>  include/linux/wait.h   |  1 +
>>  kernel/sched/wait.c| 27 ++
> So the scheduler bits are looking good to me in principle, 
> because they just add a new round-robin-rotating wakeup 
> variant and don't disturb the others.
>
> Is there consensus on the epoll ABI changes? With Davide 

I'm not sure there is a clear consensus on this change,
but I'm hoping that I've addressed the outstanding
concerns in this latest version.

I also think the addition of a way to do a 'wakeup policy'
here will open up other 'policies', such as taking into
account cpu affinity as you suggested. So, I think its
potentially an interesting direction for this code.

> Libenzi inactive eventpoll appears to be without a 
> dedicated maintainer since 2011 or so. Is there anyone who 
> knows the code and its usages in detail and does final ABI 
> decisions on eventpoll - Andrew, Al or Linus?
>
Generally, Andrew and Al do more 'final' reviews here,
and a lot of others on lkml are always very helpful in
looking at this code. However, its not always clear, at
least to me, who I should pester.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-02-25 Thread Jason Baron
On 02/25/2015 02:38 AM, Ingo Molnar wrote:
 * Jason Baron jba...@akamai.com wrote:

 Hi,

 When we are sharing a wakeup source among multiple epoll 
 fds, we end up with thundering herd wakeups, since there 
 is currently no way to add to the wakeup source 
 exclusively. This series introduces a new EPOLL_ROTATE 
 flag to allow for round robin exclusive wakeups.

 I believe this patch series addresses the two main 
 concerns that were raised in prior postings. Namely, that 
 it affected code (and potentially performance) of the 
 core kernel wakeup functions, even in cases where it was 
 not strictly needed, and that it could lead to wakeup 
 starvation (since we were are no longer waking up all 
 waiters). It does so by adding an extra layer of 
 indirection, whereby waiters are attached to a 'psuedo' 
 epoll fd, which in turn is attached directly to the 
 wakeup source.
   sched/wait: add __wake_up_rotate()
  include/linux/wait.h   |  1 +
  kernel/sched/wait.c| 27 ++
 So the scheduler bits are looking good to me in principle, 
 because they just add a new round-robin-rotating wakeup 
 variant and don't disturb the others.

 Is there consensus on the epoll ABI changes? With Davide 

I'm not sure there is a clear consensus on this change,
but I'm hoping that I've addressed the outstanding
concerns in this latest version.

I also think the addition of a way to do a 'wakeup policy'
here will open up other 'policies', such as taking into
account cpu affinity as you suggested. So, I think its
potentially an interesting direction for this code.

 Libenzi inactive eventpoll appears to be without a 
 dedicated maintainer since 2011 or so. Is there anyone who 
 knows the code and its usages in detail and does final ABI 
 decisions on eventpoll - Andrew, Al or Linus?

Generally, Andrew and Al do more 'final' reviews here,
and a lot of others on lkml are always very helpful in
looking at this code. However, its not always clear, at
least to me, who I should pester.

Thanks,

-Jason
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-02-24 Thread Ingo Molnar

* Jason Baron  wrote:

> Hi,
> 
> When we are sharing a wakeup source among multiple epoll 
> fds, we end up with thundering herd wakeups, since there 
> is currently no way to add to the wakeup source 
> exclusively. This series introduces a new EPOLL_ROTATE 
> flag to allow for round robin exclusive wakeups.
> 
> I believe this patch series addresses the two main 
> concerns that were raised in prior postings. Namely, that 
> it affected code (and potentially performance) of the 
> core kernel wakeup functions, even in cases where it was 
> not strictly needed, and that it could lead to wakeup 
> starvation (since we were are no longer waking up all 
> waiters). It does so by adding an extra layer of 
> indirection, whereby waiters are attached to a 'psuedo' 
> epoll fd, which in turn is attached directly to the 
> wakeup source.

>   sched/wait: add __wake_up_rotate()

>  include/linux/wait.h   |  1 +
>  kernel/sched/wait.c| 27 ++

So the scheduler bits are looking good to me in principle, 
because they just add a new round-robin-rotating wakeup 
variant and don't disturb the others.

Is there consensus on the epoll ABI changes? With Davide 
Libenzi inactive eventpoll appears to be without a 
dedicated maintainer since 2011 or so. Is there anyone who 
knows the code and its usages in detail and does final ABI 
decisions on eventpoll - Andrew, Al or Linus?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] epoll: introduce round robin wakeup mode

2015-02-24 Thread Ingo Molnar

* Jason Baron jba...@akamai.com wrote:

 Hi,
 
 When we are sharing a wakeup source among multiple epoll 
 fds, we end up with thundering herd wakeups, since there 
 is currently no way to add to the wakeup source 
 exclusively. This series introduces a new EPOLL_ROTATE 
 flag to allow for round robin exclusive wakeups.
 
 I believe this patch series addresses the two main 
 concerns that were raised in prior postings. Namely, that 
 it affected code (and potentially performance) of the 
 core kernel wakeup functions, even in cases where it was 
 not strictly needed, and that it could lead to wakeup 
 starvation (since we were are no longer waking up all 
 waiters). It does so by adding an extra layer of 
 indirection, whereby waiters are attached to a 'psuedo' 
 epoll fd, which in turn is attached directly to the 
 wakeup source.

   sched/wait: add __wake_up_rotate()

  include/linux/wait.h   |  1 +
  kernel/sched/wait.c| 27 ++

So the scheduler bits are looking good to me in principle, 
because they just add a new round-robin-rotating wakeup 
variant and don't disturb the others.

Is there consensus on the epoll ABI changes? With Davide 
Libenzi inactive eventpoll appears to be without a 
dedicated maintainer since 2011 or so. Is there anyone who 
knows the code and its usages in detail and does final ABI 
decisions on eventpoll - Andrew, Al or Linus?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/