subject:"\[RFC\] vmstat\: Avoid waking up idle\-cpu to service shepherd work"

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-30 Thread Peter Zijlstra

On Mon, Mar 30, 2015 at 09:47:01PM +0530, Viresh Kumar wrote:
> And all I get it is 8256 bytes, with or without the change.

Duh, rounded up to cacheline boundary ;-)

Trades two 4 byte holes at the start for a bigger 'hole' at the end.

struct tvec_base {
spinlock_t lock; /* 0 2 */

/* XXX 6 bytes hole, try to pack */

struct timer_list *running_timer;/* 8 8 */
long unsigned int  timer_jiffies;/*16 8 */
long unsigned int  next_timer;   /*24 8 */
long unsigned int  active_timers;/*32 8 */
long unsigned int  all_timers;   /*40 8 */
intcpu;  /*48 4 */

/* XXX 4 bytes hole, try to pack */

struct tvec_root   tv1;  /*56  4096 */
/* --- cacheline 64 boundary (4096 bytes) was 56 bytes ago --- */
struct tvectv2;  /*  4152  1024 */
/* --- cacheline 80 boundary (5120 bytes) was 56 bytes ago --- */
struct tvectv3;  /*  5176  1024 */
/* --- cacheline 96 boundary (6144 bytes) was 56 bytes ago --- */
struct tvectv4;  /*  6200  1024 */
/* --- cacheline 112 boundary (7168 bytes) was 56 bytes ago --- */
struct tvectv5;  /*  7224  1024 */
/* --- cacheline 128 boundary (8192 bytes) was 56 bytes ago --- */

/* size: 8256, cachelines: 129, members: 12 */
/* sum members: 8238, holes: 2, sum holes: 10 */
/* padding: 8 */
};

vs

struct tvec_base {
spinlock_t lock; /* 0 2 */

/* XXX 2 bytes hole, try to pack */

intcpu;  /* 4 4 */
struct timer_list *running_timer;/* 8 8 */
long unsigned int  timer_jiffies;/*16 8 */
long unsigned int  next_timer;   /*24 8 */
long unsigned int  active_timers;/*32 8 */
long unsigned int  all_timers;   /*40 8 */
struct tvec_root   tv1;  /*48  4096 */
/* --- cacheline 64 boundary (4096 bytes) was 48 bytes ago --- */
struct tvectv2;  /*  4144  1024 */
/* --- cacheline 80 boundary (5120 bytes) was 48 bytes ago --- */
struct tvectv3;  /*  5168  1024 */
/* --- cacheline 96 boundary (6144 bytes) was 48 bytes ago --- */
struct tvectv4;  /*  6192  1024 */
/* --- cacheline 112 boundary (7168 bytes) was 48 bytes ago --- */
struct tvectv5;  /*  7216  1024 */
/* --- cacheline 128 boundary (8192 bytes) was 48 bytes ago --- */

/* size: 8256, cachelines: 129, members: 12 */
/* sum members: 8238, holes: 1, sum holes: 2 */
/* padding: 16 */
};
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-30 Thread Viresh Kumar

On 30 March 2015 at 19:29, Peter Zijlstra  wrote:
> Yeah, so that _should_ not trigger (obviously), and while I agree with
> the sentiment of sanity checks, I'm not sure its worth keeping that
> variable around just for that.

I read it as I can remove it then ? :)

> Anyway, while I'm looking at struct tvec_base I notice the cpu member
> should be second after the lock, that'll save 8 bytes on the structure
> on 64bit machines.

Hmm, I tried it on my macbook-pro.

$ uname -a
Linux vireshk 3.13.0-46-generic #79-Ubuntu SMP Tue Mar 10 20:06:50 UTC
2015 x86_64 x86_64 x86_64 GNU/Linux

$ gcc --version
gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2

config: arch/x86/configs/x86_64_defconfig

And all I get it is 8256 bytes, with or without the change.

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2d3f5c504939..afc5d74678df 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -77,12 +77,12 @@ struct tvec_root {

 struct tvec_base {
spinlock_t lock;
+   int cpu;
struct timer_list *running_timer;
unsigned long timer_jiffies;
unsigned long next_timer;
unsigned long active_timers;
unsigned long all_timers;
-   int cpu;
struct tvec_root tv1;
struct tvec tv2;
struct tvec tv3;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-30 Thread Christoph Lameter

On Mon, 30 Mar 2015, Michal Hocko wrote:

> Why cannot we do something like refresh_cpu_vm_stats from the IRQ
> context?  Especially the first zone stat part. The per-cpu pagesets is
> more costly and it would need a special treatment, alright. A simple
> way would be to splice the lists from the per-cpu context and then free
> those pages from the kthread context.

That would work.

> I am still wondering why those two things were squashed into a single
> place. Why kswapd is not doing the pcp cleanup?

They were squashed together by me for conveniences sake. They could be
separated. AFAICT the pcp cleanup could be done only on demand and we
already have logic for that when flushihng via IPI.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-30 Thread Peter Zijlstra

On Mon, Mar 30, 2015 at 05:08:18PM +0200, Michal Hocko wrote:
> On Sat 28-03-15 10:53:22, Peter Zijlstra wrote:
> [...]
> > Alternatively the thing hocko suggests is an utter fail too. You cannot
> > stuff that into hardirq context, that's insane.
> 
> I guess you are referring to
> http://article.gmane.org/gmane.linux.kernel.mm/127569, right?
> 
> Why cannot we do something like refresh_cpu_vm_stats from the IRQ
> context?  Especially the first zone stat part.

Big machines have big zone counts. There are machines with >200 nodes.
Although with the current trend of bigger nodes, the number of nodes
seems to come down as well. Still.

> The per-cpu pagesets is
> more costly and it would need a special treatment, alright. A simple
> way would be to splice the lists from the per-cpu context and then free
> those pages from the kthread context.
> 
> I am still wondering why those two things were squashed into a single
> place. Why kswapd is not doing the pcp cleanup?

Probably because they could be. The problem with kswapd is that its per
node, not per cpu.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-30 Thread Michal Hocko

On Sat 28-03-15 10:53:22, Peter Zijlstra wrote:
[...]
> Alternatively the thing hocko suggests is an utter fail too. You cannot
> stuff that into hardirq context, that's insane.

I guess you are referring to
http://article.gmane.org/gmane.linux.kernel.mm/127569, right?

Why cannot we do something like refresh_cpu_vm_stats from the IRQ
context?  Especially the first zone stat part. The per-cpu pagesets is
more costly and it would need a special treatment, alright. A simple
way would be to splice the lists from the per-cpu context and then free
those pages from the kthread context.

I am still wondering why those two things were squashed into a single
place. Why kswapd is not doing the pcp cleanup?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-30 Thread Peter Zijlstra

On Mon, Mar 30, 2015 at 06:44:22PM +0530, Viresh Kumar wrote:
> On 30 March 2015 at 18:17, Peter Zijlstra  wrote:
> > No, I means something else with that. We can remove the
> > tvec_base::running_timer field. Everything that uses that can use
> > tbase_running() AFAICT.
> 
> Okay, there is one instance which still needs it.
> 
> migrate_timers():
> 
> BUG_ON(old_base->running_timer);
> 
> What I wasn't sure about it is if we get can drop this statement or not.
> If we decide not to drop it, then we can convert running_timer into a bool.

Yeah, so that _should_ not trigger (obviously), and while I agree with
the sentiment of sanity checks, I'm not sure its worth keeping that
variable around just for that.

Anyway, while I'm looking at struct tvec_base I notice the cpu member
should be second after the lock, that'll save 8 bytes on the structure
on 64bit machines.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-30 Thread Viresh Kumar

On 30 March 2015 at 18:17, Peter Zijlstra  wrote:
> No, I means something else with that. We can remove the
> tvec_base::running_timer field. Everything that uses that can use
> tbase_running() AFAICT.

Okay, there is one instance which still needs it.

migrate_timers():

BUG_ON(old_base->running_timer);

What I wasn't sure about it is if we get can drop this statement or not.
If we decide not to drop it, then we can convert running_timer into a bool.

> Drop yes, racy not so much I think.
>
>
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index 2d3f5c504939..1394f9540348 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -1189,12 +1189,39 @@ static inline void __run_timers(struct tvec_base 
> *base)
> cascade(base, >tv5, INDEX(3));
> ++base->timer_jiffies;
> list_replace_init(base->tv1.vec + index, head);
> +
> +again:
> while (!list_empty(head)) {
> void (*fn)(unsigned long);
> unsigned long data;
> bool irqsafe;
>
> -   timer = list_first_entry(head, struct 
> timer_list,entry);
> +   timer = list_first_entry(head, struct timer_list, 
> entry);
> +   if (unlikely(tbase_running(timer))) {
> +   /* Only one timer on the list, force wait. */
> +   if (unlikely(head->next == head->prev)) {
> +   spin_unlock(>lock);
> +
> +   /*
> +* The only way to get here is if the
> +* handler requeued itself on another
> +* base, this guarantees the timer 
> will
> +* not go away.
> +*/
> +   while (tbase_running(timer))
> +   cpu_relax();
> +
> +   spin_lock(>lock);
> +   } else  {
> +   /*
> +* Otherwise, rotate the list and try
> +* someone else.
> +*/
> +   list_move_tail(>entry, head);
> +   }
> +   goto again;
> +   }
> +
> fn = timer->function;
> data = timer->data;
> irqsafe = tbase_get_irqsafe(timer->base);

Yeah, so I have written something similar only. Wasn't sure about what you wrote
earlier. Thanks for the clarification.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-30 Thread Peter Zijlstra

On Mon, Mar 30, 2015 at 05:32:16PM +0530, Viresh Kumar wrote:
> On 29 March 2015 at 15:54, Peter Zijlstra  wrote:

> > What I didn't say, but had thought of is that __run_timer() should skip
> > any timer that has RUNNING set -- for obvious reasons :-)

> Below is copied from your first reply, and so you probably already
> said that ? :)
> 
> > Also, once you have tbase_running, we can take base->running_timer out
> > altogether.

No, I means something else with that. We can remove the
tvec_base::running_timer field. Everything that uses that can use
tbase_running() AFAICT.

> I wanted to clarify if I understood it correctly..
> 
> Are you saying that:

> Case 2.) we keep retrying for it, until the time the other handler finishes?

That.

If we remove it from the list before we call ->fn. Therefore, even if
migrate happens, it will not see a RUNNING timer entry, seeing how its
not actually on any lists.

The only way to get on a list while running is if ->fn() requeues itself
_on_another_base_. When that happens, we need to wait for it to complete
running.

> Case 2.) We kept waiting for the first handler to finish ..
> - cpuY may waste some cycles as it kept waiting for handler to finish on cpuX 
> ..

True, rather silly to requeue a timer on the same jiffy as its already
running through, but yes, an unlikely possibility.

You can run another timer while we wait -- if there is another of
course.

> - We may need to perform base unlock/lock on cpuY, so that cpuX can take 
> cpuY's
> lock to reset tbase_running. And that might be racy, not sure.

Drop yes, racy not so much I think.


diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2d3f5c504939..1394f9540348 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1189,12 +1189,39 @@ static inline void __run_timers(struct tvec_base *base)
cascade(base, >tv5, INDEX(3));
++base->timer_jiffies;
list_replace_init(base->tv1.vec + index, head);
+
+again:
while (!list_empty(head)) {
void (*fn)(unsigned long);
unsigned long data;
bool irqsafe;
 
-   timer = list_first_entry(head, struct timer_list,entry);
+   timer = list_first_entry(head, struct timer_list, 
entry);
+   if (unlikely(tbase_running(timer))) {
+   /* Only one timer on the list, force wait. */
+   if (unlikely(head->next == head->prev)) {
+   spin_unlock(>lock);
+
+   /*
+* The only way to get here is if the
+* handler requeued itself on another
+* base, this guarantees the timer will
+* not go away.
+*/
+   while (tbase_running(timer))
+   cpu_relax();
+
+   spin_lock(>lock);
+   } else  {
+   /*
+* Otherwise, rotate the list and try
+* someone else.
+*/
+   list_move_tail(>entry, head);
+   }
+   goto again;
+   }
+
fn = timer->function;
data = timer->data;
irqsafe = tbase_get_irqsafe(timer->base);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-30 Thread Viresh Kumar

On 29 March 2015 at 15:54, Peter Zijlstra  wrote:
> On Sat, Mar 28, 2015 at 02:44:57PM +0100, Peter Zijlstra wrote:
>> > Now there are few issues I see here (Sorry if they are all imaginary):
>> > - In case a timer re-arms itself from its handler and is migrated from CPU 
>> > A to B, what
>> >   happens if the re-armed timer fires before the first handler finishes ? 
>> > i.e. timer->fn()
>> >   hasn't finished running on CPU A and it has fired again on CPU B. 
>> > Wouldn't this expose
>> >   us to a lot of other problems? It wouldn't be serialized to itself 
>> > anymore ?
>>
>> What I said above.
>
> What I didn't say, but had thought of is that __run_timer() should skip
> any timer that has RUNNING set -- for obvious reasons :-)

Below is copied from your first reply, and so you probably already
said that ? :)

> Also, once you have tbase_running, we can take base->running_timer out
> altogether.

I wanted to clarify if I understood it correctly..

Are you saying that:

Case 1.) if we find tbase_running on cpuY (because it was rearmed
from its handler on cpuX and has got migrated to cpuY), then we should drop the
timer from the list without calling its handler (as that is already
running in parallel) ?

Or

Case 2.) we keep retrying for it, until the time the other handler finishes?

I have few queries for both the cases.

Case 1.) Will that be fair to the timer user as the timer may get lost
completely.
If we skip the timer on cpuY here, it wouldn't be enqueued again and
so will be lost.

Case 2.) We kept waiting for the first handler to finish ..
- cpuY may waste some cycles as it kept waiting for handler to finish on cpuX ..
- We may need to perform base unlock/lock on cpuY, so that cpuX can take cpuY's
lock to reset tbase_running. And that might be racy, not sure.

--
viresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-30 Thread Peter Zijlstra

On Mon, Mar 30, 2015 at 05:32:16PM +0530, Viresh Kumar wrote:
 On 29 March 2015 at 15:54, Peter Zijlstra pet...@infradead.org wrote:

  What I didn't say, but had thought of is that __run_timer() should skip
  any timer that has RUNNING set -- for obvious reasons :-)

 Below is copied from your first reply, and so you probably already
 said that ? :)
 
  Also, once you have tbase_running, we can take base-running_timer out
  altogether.

No, I means something else with that. We can remove the
tvec_base::running_timer field. Everything that uses that can use
tbase_running() AFAICT.

 I wanted to clarify if I understood it correctly..
 
 Are you saying that:

 Case 2.) we keep retrying for it, until the time the other handler finishes?

That.

If we remove it from the list before we call -fn. Therefore, even if
migrate happens, it will not see a RUNNING timer entry, seeing how its
not actually on any lists.

The only way to get on a list while running is if -fn() requeues itself
_on_another_base_. When that happens, we need to wait for it to complete
running.

 Case 2.) We kept waiting for the first handler to finish ..
 - cpuY may waste some cycles as it kept waiting for handler to finish on cpuX 
 ..

True, rather silly to requeue a timer on the same jiffy as its already
running through, but yes, an unlikely possibility.

You can run another timer while we wait -- if there is another of
course.

 - We may need to perform base unlock/lock on cpuY, so that cpuX can take 
 cpuY's
 lock to reset tbase_running. And that might be racy, not sure.

Drop yes, racy not so much I think.


diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2d3f5c504939..1394f9540348 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1189,12 +1189,39 @@ static inline void __run_timers(struct tvec_base *base)
cascade(base, base-tv5, INDEX(3));
++base-timer_jiffies;
list_replace_init(base-tv1.vec + index, head);
+
+again:
while (!list_empty(head)) {
void (*fn)(unsigned long);
unsigned long data;
bool irqsafe;
 
-   timer = list_first_entry(head, struct timer_list,entry);
+   timer = list_first_entry(head, struct timer_list, 
entry);
+   if (unlikely(tbase_running(timer))) {
+   /* Only one timer on the list, force wait. */
+   if (unlikely(head-next == head-prev)) {
+   spin_unlock(base-lock);
+
+   /*
+* The only way to get here is if the
+* handler requeued itself on another
+* base, this guarantees the timer will
+* not go away.
+*/
+   while (tbase_running(timer))
+   cpu_relax();
+
+   spin_lock(base-lock);
+   } else  {
+   /*
+* Otherwise, rotate the list and try
+* someone else.
+*/
+   list_move_tail(timer-entry, head);
+   }
+   goto again;
+   }
+
fn = timer-function;
data = timer-data;
irqsafe = tbase_get_irqsafe(timer-base);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-30 Thread Viresh Kumar

On 30 March 2015 at 18:17, Peter Zijlstra pet...@infradead.org wrote:
 No, I means something else with that. We can remove the
 tvec_base::running_timer field. Everything that uses that can use
 tbase_running() AFAICT.

Okay, there is one instance which still needs it.

migrate_timers():

BUG_ON(old_base-running_timer);

What I wasn't sure about it is if we get can drop this statement or not.
If we decide not to drop it, then we can convert running_timer into a bool.

 Drop yes, racy not so much I think.


 diff --git a/kernel/time/timer.c b/kernel/time/timer.c
 index 2d3f5c504939..1394f9540348 100644
 --- a/kernel/time/timer.c
 +++ b/kernel/time/timer.c
 @@ -1189,12 +1189,39 @@ static inline void __run_timers(struct tvec_base 
 *base)
 cascade(base, base-tv5, INDEX(3));
 ++base-timer_jiffies;
 list_replace_init(base-tv1.vec + index, head);
 +
 +again:
 while (!list_empty(head)) {
 void (*fn)(unsigned long);
 unsigned long data;
 bool irqsafe;

 -   timer = list_first_entry(head, struct 
 timer_list,entry);
 +   timer = list_first_entry(head, struct timer_list, 
 entry);
 +   if (unlikely(tbase_running(timer))) {
 +   /* Only one timer on the list, force wait. */
 +   if (unlikely(head-next == head-prev)) {
 +   spin_unlock(base-lock);
 +
 +   /*
 +* The only way to get here is if the
 +* handler requeued itself on another
 +* base, this guarantees the timer 
 will
 +* not go away.
 +*/
 +   while (tbase_running(timer))
 +   cpu_relax();
 +
 +   spin_lock(base-lock);
 +   } else  {
 +   /*
 +* Otherwise, rotate the list and try
 +* someone else.
 +*/
 +   list_move_tail(timer-entry, head);
 +   }
 +   goto again;
 +   }
 +
 fn = timer-function;
 data = timer-data;
 irqsafe = tbase_get_irqsafe(timer-base);

Yeah, so I have written something similar only. Wasn't sure about what you wrote
earlier. Thanks for the clarification.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-30 Thread Peter Zijlstra

On Mon, Mar 30, 2015 at 06:44:22PM +0530, Viresh Kumar wrote:
 On 30 March 2015 at 18:17, Peter Zijlstra pet...@infradead.org wrote:
  No, I means something else with that. We can remove the
  tvec_base::running_timer field. Everything that uses that can use
  tbase_running() AFAICT.
 
 Okay, there is one instance which still needs it.
 
 migrate_timers():
 
 BUG_ON(old_base-running_timer);
 
 What I wasn't sure about it is if we get can drop this statement or not.
 If we decide not to drop it, then we can convert running_timer into a bool.

Yeah, so that _should_ not trigger (obviously), and while I agree with
the sentiment of sanity checks, I'm not sure its worth keeping that
variable around just for that.

Anyway, while I'm looking at struct tvec_base I notice the cpu member
should be second after the lock, that'll save 8 bytes on the structure
on 64bit machines.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-30 Thread Viresh Kumar

On 29 March 2015 at 15:54, Peter Zijlstra pet...@infradead.org wrote:
 On Sat, Mar 28, 2015 at 02:44:57PM +0100, Peter Zijlstra wrote:
  Now there are few issues I see here (Sorry if they are all imaginary):
  - In case a timer re-arms itself from its handler and is migrated from CPU 
  A to B, what
happens if the re-armed timer fires before the first handler finishes ? 
  i.e. timer-fn()
hasn't finished running on CPU A and it has fired again on CPU B. 
  Wouldn't this expose
us to a lot of other problems? It wouldn't be serialized to itself 
  anymore ?

 What I said above.

 What I didn't say, but had thought of is that __run_timer() should skip
 any timer that has RUNNING set -- for obvious reasons :-)

Below is copied from your first reply, and so you probably already
said that ? :)

 Also, once you have tbase_running, we can take base-running_timer out
 altogether.

I wanted to clarify if I understood it correctly..

Are you saying that:

Case 1.) if we find tbase_running on cpuY (because it was rearmed
from its handler on cpuX and has got migrated to cpuY), then we should drop the
timer from the list without calling its handler (as that is already
running in parallel) ?

Or

Case 2.) we keep retrying for it, until the time the other handler finishes?


I have few queries for both the cases.

Case 1.) Will that be fair to the timer user as the timer may get lost
completely.
If we skip the timer on cpuY here, it wouldn't be enqueued again and
so will be lost.

Case 2.) We kept waiting for the first handler to finish ..
- cpuY may waste some cycles as it kept waiting for handler to finish on cpuX ..
- We may need to perform base unlock/lock on cpuY, so that cpuX can take cpuY's
lock to reset tbase_running. And that might be racy, not sure.

--
viresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-30 Thread Michal Hocko

On Sat 28-03-15 10:53:22, Peter Zijlstra wrote:
[...]
 Alternatively the thing hocko suggests is an utter fail too. You cannot
 stuff that into hardirq context, that's insane.

I guess you are referring to
http://article.gmane.org/gmane.linux.kernel.mm/127569, right?

Why cannot we do something like refresh_cpu_vm_stats from the IRQ
context?  Especially the first zone stat part. The per-cpu pagesets is
more costly and it would need a special treatment, alright. A simple
way would be to splice the lists from the per-cpu context and then free
those pages from the kthread context.

I am still wondering why those two things were squashed into a single
place. Why kswapd is not doing the pcp cleanup?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-30 Thread Peter Zijlstra

On Mon, Mar 30, 2015 at 05:08:18PM +0200, Michal Hocko wrote:
 On Sat 28-03-15 10:53:22, Peter Zijlstra wrote:
 [...]
  Alternatively the thing hocko suggests is an utter fail too. You cannot
  stuff that into hardirq context, that's insane.
 
 I guess you are referring to
 http://article.gmane.org/gmane.linux.kernel.mm/127569, right?
 
 Why cannot we do something like refresh_cpu_vm_stats from the IRQ
 context?  Especially the first zone stat part.

Big machines have big zone counts. There are machines with 200 nodes.
Although with the current trend of bigger nodes, the number of nodes
seems to come down as well. Still.

 The per-cpu pagesets is
 more costly and it would need a special treatment, alright. A simple
 way would be to splice the lists from the per-cpu context and then free
 those pages from the kthread context.
 
 I am still wondering why those two things were squashed into a single
 place. Why kswapd is not doing the pcp cleanup?

Probably because they could be. The problem with kswapd is that its per
node, not per cpu.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-30 Thread Christoph Lameter

On Mon, 30 Mar 2015, Michal Hocko wrote:

 Why cannot we do something like refresh_cpu_vm_stats from the IRQ
 context?  Especially the first zone stat part. The per-cpu pagesets is
 more costly and it would need a special treatment, alright. A simple
 way would be to splice the lists from the per-cpu context and then free
 those pages from the kthread context.

That would work.

 I am still wondering why those two things were squashed into a single
 place. Why kswapd is not doing the pcp cleanup?

They were squashed together by me for conveniences sake. They could be
separated. AFAICT the pcp cleanup could be done only on demand and we
already have logic for that when flushihng via IPI.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-30 Thread Peter Zijlstra

On Mon, Mar 30, 2015 at 09:47:01PM +0530, Viresh Kumar wrote:
 And all I get it is 8256 bytes, with or without the change.

Duh, rounded up to cacheline boundary ;-)

Trades two 4 byte holes at the start for a bigger 'hole' at the end.

struct tvec_base {
spinlock_t lock; /* 0 2 */

/* XXX 6 bytes hole, try to pack */

struct timer_list *running_timer;/* 8 8 */
long unsigned int  timer_jiffies;/*16 8 */
long unsigned int  next_timer;   /*24 8 */
long unsigned int  active_timers;/*32 8 */
long unsigned int  all_timers;   /*40 8 */
intcpu;  /*48 4 */

/* XXX 4 bytes hole, try to pack */

struct tvec_root   tv1;  /*56  4096 */
/* --- cacheline 64 boundary (4096 bytes) was 56 bytes ago --- */
struct tvectv2;  /*  4152  1024 */
/* --- cacheline 80 boundary (5120 bytes) was 56 bytes ago --- */
struct tvectv3;  /*  5176  1024 */
/* --- cacheline 96 boundary (6144 bytes) was 56 bytes ago --- */
struct tvectv4;  /*  6200  1024 */
/* --- cacheline 112 boundary (7168 bytes) was 56 bytes ago --- */
struct tvectv5;  /*  7224  1024 */
/* --- cacheline 128 boundary (8192 bytes) was 56 bytes ago --- */

/* size: 8256, cachelines: 129, members: 12 */
/* sum members: 8238, holes: 2, sum holes: 10 */
/* padding: 8 */
};

vs

struct tvec_base {
spinlock_t lock; /* 0 2 */

/* XXX 2 bytes hole, try to pack */

intcpu;  /* 4 4 */
struct timer_list *running_timer;/* 8 8 */
long unsigned int  timer_jiffies;/*16 8 */
long unsigned int  next_timer;   /*24 8 */
long unsigned int  active_timers;/*32 8 */
long unsigned int  all_timers;   /*40 8 */
struct tvec_root   tv1;  /*48  4096 */
/* --- cacheline 64 boundary (4096 bytes) was 48 bytes ago --- */
struct tvectv2;  /*  4144  1024 */
/* --- cacheline 80 boundary (5120 bytes) was 48 bytes ago --- */
struct tvectv3;  /*  5168  1024 */
/* --- cacheline 96 boundary (6144 bytes) was 48 bytes ago --- */
struct tvectv4;  /*  6192  1024 */
/* --- cacheline 112 boundary (7168 bytes) was 48 bytes ago --- */
struct tvectv5;  /*  7216  1024 */
/* --- cacheline 128 boundary (8192 bytes) was 48 bytes ago --- */

/* size: 8256, cachelines: 129, members: 12 */
/* sum members: 8238, holes: 1, sum holes: 2 */
/* padding: 16 */
};
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-30 Thread Viresh Kumar

On 30 March 2015 at 19:29, Peter Zijlstra pet...@infradead.org wrote:
 Yeah, so that _should_ not trigger (obviously), and while I agree with
 the sentiment of sanity checks, I'm not sure its worth keeping that
 variable around just for that.

I read it as I can remove it then ? :)

 Anyway, while I'm looking at struct tvec_base I notice the cpu member
 should be second after the lock, that'll save 8 bytes on the structure
 on 64bit machines.

Hmm, I tried it on my macbook-pro.

$ uname -a
Linux vireshk 3.13.0-46-generic #79-Ubuntu SMP Tue Mar 10 20:06:50 UTC
2015 x86_64 x86_64 x86_64 GNU/Linux

$ gcc --version
gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2

config: arch/x86/configs/x86_64_defconfig

And all I get it is 8256 bytes, with or without the change.

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2d3f5c504939..afc5d74678df 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -77,12 +77,12 @@ struct tvec_root {

 struct tvec_base {
spinlock_t lock;
+   int cpu;
struct timer_list *running_timer;
unsigned long timer_jiffies;
unsigned long next_timer;
unsigned long active_timers;
unsigned long all_timers;
-   int cpu;
struct tvec_root tv1;
struct tvec tv2;
struct tvec tv3;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-29 Thread Peter Zijlstra

On Sun, Mar 29, 2015 at 05:31:32PM +0530, Viresh Kumar wrote:
> Warning:
> 
> config: blackfin-allyesconfig (attached as .config)
> reproduce:
>   wget 
> https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
> -O ~/bin/make.cross
>   chmod +x ~/bin/make.cross
>   git checkout ca713e393c6eceb54e803df204772a3d6e6c7981
>   # save the attached .config to linux build tree
>   make.cross ARCH=blackfin
> 
> All error/warnings:
> 
>kernel/time/timer.c: In function 'init_timers':
> >> kernel/time/timer.c:1648:2: error: call to '__compiletime_assert_1648' 
> >> declared with attribute error: BUILD_BUG_ON failed: __alignof__(struct 
> >> tvec_base) & TIMER_FLAG_MASK

Ha, this is because blackfin is broken.

blackfin doesn't respect cacheline_aligned and NOPs it for UP
builds. Why it thinks {__,}__cacheline_aligned semantics should differ
between SMP/UP is a mystery to me, we have the &_in_smp primitives for
that.

So just ignore this, let the blackfin people deal with it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-29 Thread Viresh Kumar

On 28 March 2015 at 19:14, Peter Zijlstra  wrote:
> Yeah, something like the below (at the very end) should ensure the thing
> is cacheline aligned, that should give us a fair few bits.

> ---
>  kernel/time/timer.c | 36 
>  1 file changed, 8 insertions(+), 28 deletions(-)
>
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index 2d3f5c504939..c8c45bf50b2e 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -93,6 +93,7 @@ struct tvec_base {
>  struct tvec_base boot_tvec_bases;
>  EXPORT_SYMBOL(boot_tvec_bases);
>  static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = _tvec_bases;
> +static DEFINE_PER_CPU(struct tvec_base, __tvec_bases);
>
>  /* Functions below help us manage 'deferrable' flag */
>  static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
> @@ -1534,46 +1535,25 @@ EXPORT_SYMBOL(schedule_timeout_uninterruptible);
>
>  static int init_timers_cpu(int cpu)
>  {
> -   int j;
> -   struct tvec_base *base;
> +   struct tvec_base *base = per_cpu(tvec_bases, cpu);
> static char tvec_base_done[NR_CPUS];
> +   int j;
>
> if (!tvec_base_done[cpu]) {
> static char boot_done;
>
> -   if (boot_done) {
> -   /*
> -* The APs use this path later in boot
> -*/
> -   base = kzalloc_node(sizeof(*base), GFP_KERNEL,
> -   cpu_to_node(cpu));
> -   if (!base)
> -   return -ENOMEM;
> -
> -   /* Make sure tvec_base has TIMER_FLAG_MASK bits free 
> */
> -   if (WARN_ON(base != tbase_get_base(base))) {
> -   kfree(base);
> -   return -ENOMEM;
> -   }
> -   per_cpu(tvec_bases, cpu) = base;
> +   if (!boot_done) {
> +   boot_done = 1; /* skip the boot cpu */
> } else {
> -   /*
> -* This is for the boot CPU - we use compile-time
> -* static initialisation because per-cpu memory isn't
> -* ready yet and because the memory allocators are not
> -* initialised either.
> -*/
> -   boot_done = 1;
> -   base = _tvec_bases;
> +   base = per_cpu_ptr(&__tvec_bases);
> +   per_cpu(tvec_bases, cpu) = base;
> }
> +
> spin_lock_init(>lock);
> tvec_base_done[cpu] = 1;
> base->cpu = cpu;
> -   } else {
> -   base = per_cpu(tvec_bases, cpu);
> }
>
> -
> for (j = 0; j < TVN_SIZE; j++) {
> INIT_LIST_HEAD(base->tv5.vec + j);
> INIT_LIST_HEAD(base->tv4.vec + j);

Even after this with following diff gives me the same warning on blackfin..

diff --git a/include/linux/timer.h b/include/linux/timer.h
index 8c5a197e1587..58bc28d9cef2 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -68,7 +68,7 @@ extern struct tvec_base boot_tvec_bases;
 #define TIMER_DEFERRABLE   0x1LU
 #define TIMER_IRQSAFE  0x2LU

-#define TIMER_FLAG_MASK0x3LU
+#define TIMER_FLAG_MASK0x7LU

 #define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \
.entry = { .prev = TIMER_ENTRY_STATIC },\


-xx--

Warning:

config: blackfin-allyesconfig (attached as .config)
reproduce:
  wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
-O ~/bin/make.cross
  chmod +x ~/bin/make.cross
  git checkout ca713e393c6eceb54e803df204772a3d6e6c7981
  # save the attached .config to linux build tree
  make.cross ARCH=blackfin

All error/warnings:

   kernel/time/timer.c: In function 'init_timers':
>> kernel/time/timer.c:1648:2: error: call to '__compiletime_assert_1648' 
>> declared with attribute error: BUILD_BUG_ON failed: __alignof__(struct 
>> tvec_base) & TIMER_FLAG_MASK
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-29 Thread Peter Zijlstra

On Sat, Mar 28, 2015 at 02:44:57PM +0100, Peter Zijlstra wrote:
> > Now there are few issues I see here (Sorry if they are all imaginary):
> > - In case a timer re-arms itself from its handler and is migrated from CPU 
> > A to B, what
> >   happens if the re-armed timer fires before the first handler finishes ? 
> > i.e. timer->fn()
> >   hasn't finished running on CPU A and it has fired again on CPU B. 
> > Wouldn't this expose
> >   us to a lot of other problems? It wouldn't be serialized to itself 
> > anymore ?
> 
> What I said above.

What I didn't say, but had thought of is that __run_timer() should skip
any timer that has RUNNING set -- for obvious reasons :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-29 Thread Viresh Kumar

On 28 March 2015 at 19:14, Peter Zijlstra pet...@infradead.org wrote:
 Yeah, something like the below (at the very end) should ensure the thing
 is cacheline aligned, that should give us a fair few bits.

 ---
  kernel/time/timer.c | 36 
  1 file changed, 8 insertions(+), 28 deletions(-)

 diff --git a/kernel/time/timer.c b/kernel/time/timer.c
 index 2d3f5c504939..c8c45bf50b2e 100644
 --- a/kernel/time/timer.c
 +++ b/kernel/time/timer.c
 @@ -93,6 +93,7 @@ struct tvec_base {
  struct tvec_base boot_tvec_bases;
  EXPORT_SYMBOL(boot_tvec_bases);
  static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = boot_tvec_bases;
 +static DEFINE_PER_CPU(struct tvec_base, __tvec_bases);

  /* Functions below help us manage 'deferrable' flag */
  static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
 @@ -1534,46 +1535,25 @@ EXPORT_SYMBOL(schedule_timeout_uninterruptible);

  static int init_timers_cpu(int cpu)
  {
 -   int j;
 -   struct tvec_base *base;
 +   struct tvec_base *base = per_cpu(tvec_bases, cpu);
 static char tvec_base_done[NR_CPUS];
 +   int j;

 if (!tvec_base_done[cpu]) {
 static char boot_done;

 -   if (boot_done) {
 -   /*
 -* The APs use this path later in boot
 -*/
 -   base = kzalloc_node(sizeof(*base), GFP_KERNEL,
 -   cpu_to_node(cpu));
 -   if (!base)
 -   return -ENOMEM;
 -
 -   /* Make sure tvec_base has TIMER_FLAG_MASK bits free 
 */
 -   if (WARN_ON(base != tbase_get_base(base))) {
 -   kfree(base);
 -   return -ENOMEM;
 -   }
 -   per_cpu(tvec_bases, cpu) = base;
 +   if (!boot_done) {
 +   boot_done = 1; /* skip the boot cpu */
 } else {
 -   /*
 -* This is for the boot CPU - we use compile-time
 -* static initialisation because per-cpu memory isn't
 -* ready yet and because the memory allocators are not
 -* initialised either.
 -*/
 -   boot_done = 1;
 -   base = boot_tvec_bases;
 +   base = per_cpu_ptr(__tvec_bases);
 +   per_cpu(tvec_bases, cpu) = base;
 }
 +
 spin_lock_init(base-lock);
 tvec_base_done[cpu] = 1;
 base-cpu = cpu;
 -   } else {
 -   base = per_cpu(tvec_bases, cpu);
 }

 -
 for (j = 0; j  TVN_SIZE; j++) {
 INIT_LIST_HEAD(base-tv5.vec + j);
 INIT_LIST_HEAD(base-tv4.vec + j);

Even after this with following diff gives me the same warning on blackfin..

diff --git a/include/linux/timer.h b/include/linux/timer.h
index 8c5a197e1587..58bc28d9cef2 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -68,7 +68,7 @@ extern struct tvec_base boot_tvec_bases;
 #define TIMER_DEFERRABLE   0x1LU
 #define TIMER_IRQSAFE  0x2LU

-#define TIMER_FLAG_MASK0x3LU
+#define TIMER_FLAG_MASK0x7LU

 #define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \
.entry = { .prev = TIMER_ENTRY_STATIC },\


-xx--

Warning:

config: blackfin-allyesconfig (attached as .config)
reproduce:
  wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
-O ~/bin/make.cross
  chmod +x ~/bin/make.cross
  git checkout ca713e393c6eceb54e803df204772a3d6e6c7981
  # save the attached .config to linux build tree
  make.cross ARCH=blackfin

All error/warnings:

   kernel/time/timer.c: In function 'init_timers':
 kernel/time/timer.c:1648:2: error: call to '__compiletime_assert_1648' 
 declared with attribute error: BUILD_BUG_ON failed: __alignof__(struct 
 tvec_base)  TIMER_FLAG_MASK
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-29 Thread Peter Zijlstra

On Sun, Mar 29, 2015 at 05:31:32PM +0530, Viresh Kumar wrote:
 Warning:
 
 config: blackfin-allyesconfig (attached as .config)
 reproduce:
   wget 
 https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
   chmod +x ~/bin/make.cross
   git checkout ca713e393c6eceb54e803df204772a3d6e6c7981
   # save the attached .config to linux build tree
   make.cross ARCH=blackfin
 
 All error/warnings:
 
kernel/time/timer.c: In function 'init_timers':
  kernel/time/timer.c:1648:2: error: call to '__compiletime_assert_1648' 
  declared with attribute error: BUILD_BUG_ON failed: __alignof__(struct 
  tvec_base)  TIMER_FLAG_MASK

Ha, this is because blackfin is broken.

blackfin doesn't respect cacheline_aligned and NOPs it for UP
builds. Why it thinks {__,}__cacheline_aligned semantics should differ
between SMP/UP is a mystery to me, we have the _in_smp primitives for
that.

So just ignore this, let the blackfin people deal with it.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-29 Thread Peter Zijlstra

On Sat, Mar 28, 2015 at 02:44:57PM +0100, Peter Zijlstra wrote:
  Now there are few issues I see here (Sorry if they are all imaginary):
  - In case a timer re-arms itself from its handler and is migrated from CPU 
  A to B, what
happens if the re-armed timer fires before the first handler finishes ? 
  i.e. timer-fn()
hasn't finished running on CPU A and it has fired again on CPU B. 
  Wouldn't this expose
us to a lot of other problems? It wouldn't be serialized to itself 
  anymore ?
 
 What I said above.

What I didn't say, but had thought of is that __run_timer() should skip
any timer that has RUNNING set -- for obvious reasons :-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-28 Thread Peter Zijlstra

On Sat, Mar 28, 2015 at 05:27:23PM +0530, viresh kumar wrote:
> So probably we need to make 'base' aligned to 8 bytes ?

Yeah, something like the below (at the very end) should ensure the thing
is cacheline aligned, that should give us a fair few bits.

> So, what you are suggesting is something like this (untested):

> @@ -1202,6 +1208,7 @@ static inline void __run_timers(struct tvec_base *base)
> timer_stats_account_timer(timer);
> 
> base->running_timer = timer;
> +   tbase_set_running(timer->base);
> detach_expired_timer(timer, base);
> 
> if (irqsafe) {
> @@ -1216,6 +1223,7 @@ static inline void __run_timers(struct tvec_base *base)
> }
> }
> base->running_timer = NULL;
> +   tbase_clear_running(timer->base);
> spin_unlock_irq(>lock);
>  }

That's broken. You need to clear running on all the timers you set it
on. Furthermore, you need to revalidate timer->base == base after
call_timer_fn().

Something like so:

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2d3f5c504939..489ce182f8ec 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1213,6 +1213,21 @@ static inline void __run_timers(struct tvec_base *base)
call_timer_fn(timer, fn, data);
spin_lock_irq(>lock);
}
+
+   if (unlikely(timer->base != base)) {
+   unsigned long flags;
+   struct tvec_base *tbase;
+
+   spin_unlock(>lock);
+
+   tbase = lock_timer_base(timer, );
+   tbase_clear_running(timer->base);
+   spin_unlock(>lock);
+
+   spin_lock(>lock);
+   } else {
+   tbase_clear_running(timer->base);
+   }
}
}
base->running_timer = NULL;

Also, once you have tbase_running, we can take base->running_timer out
altogether.

> Now there are few issues I see here (Sorry if they are all imaginary):
> - In case a timer re-arms itself from its handler and is migrated from CPU A 
> to B, what
>   happens if the re-armed timer fires before the first handler finishes ? 
> i.e. timer->fn()
>   hasn't finished running on CPU A and it has fired again on CPU B. Wouldn't 
> this expose
>   us to a lot of other problems? It wouldn't be serialized to itself anymore ?

What I said above.

> - Because the timer has migrated to another CPU, the locking in __run_timers()
>   needs to be fixed. And that will make it complicated ..

Hardly.

>   - __run_timer() doesn't lock bases of other CPUs, and it has to do it now..

Yep, but rarely.

>   - We probably need to take locks of both local CPU and the one to which 
> timer migrated.

Nope, or rather, not at the same time. That's what the NULL magic buys
us.

> - Its possible now that there can be more than one running timer for a base, 
> which wasn't
>   true earlier. Not sure if it will break something.

Only if you messed it up real bad :-)

---
 kernel/time/timer.c | 36 
 1 file changed, 8 insertions(+), 28 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2d3f5c504939..c8c45bf50b2e 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -93,6 +93,7 @@ struct tvec_base {
 struct tvec_base boot_tvec_bases;
 EXPORT_SYMBOL(boot_tvec_bases);
 static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = _tvec_bases;
+static DEFINE_PER_CPU(struct tvec_base, __tvec_bases);
 
 /* Functions below help us manage 'deferrable' flag */
 static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
@@ -1534,46 +1535,25 @@ EXPORT_SYMBOL(schedule_timeout_uninterruptible);
 
 static int init_timers_cpu(int cpu)
 {
-   int j;
-   struct tvec_base *base;
+   struct tvec_base *base = per_cpu(tvec_bases, cpu);
static char tvec_base_done[NR_CPUS];
+   int j;
 
if (!tvec_base_done[cpu]) {
static char boot_done;
 
-   if (boot_done) {
-   /*
-* The APs use this path later in boot
-*/
-   base = kzalloc_node(sizeof(*base), GFP_KERNEL,
-   cpu_to_node(cpu));
-   if (!base)
-   return -ENOMEM;
-
-   /* Make sure tvec_base has TIMER_FLAG_MASK bits free */
-   if (WARN_ON(base != tbase_get_base(base))) {
-   kfree(base);
-   return -ENOMEM;
-   }
-   per_cpu(tvec_bases, cpu) = base;
+   if (!boot_done) {
+

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-28 Thread Viresh Kumar

On 28 March 2015 at 17:27, viresh kumar  wrote:
> On 28 March 2015 at 15:23, Peter Zijlstra  wrote:
>
>> Well, for one your patch is indeed disgusting.
>
> Yeah, I agree :)

Sigh..

Sorry for the series of *nonsense* mails before the last one.

Its some thunderbird *BUG* which did that, I was accessing my
mail from both gmail's interface and thunderbird and somehow
this happened. I have hit the send button only once.

Really sorry for the noise.

(The last mail has few inquiries towards the end and a thanks note,
just to identify it..)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-28 Thread viresh kumar

On 28 March 2015 at 15:23, Peter Zijlstra  wrote:

> Well, for one your patch is indeed disgusting.

Yeah, I agree :)

> But yes I'm aware Thomas
> wants to rewrite the timer thing. But Thomas is away for a little while
> and if this really needs to happen then it does.

Sometime back I was trying to use another bit from base pointer for
marking a timer as PINNED:

diff --git a/include/linux/timer.h b/include/linux/timer.h
index 8c5a197e1587..e7184f57449c 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -67,8 +67,9 @@ extern struct tvec_base boot_tvec_bases;
  */
 #define TIMER_DEFERRABLE   0x1LU
 #define TIMER_IRQSAFE  0x2LU
+#define TIMER_PINNED   0x4LU
 -#define TIMER_FLAG_MASK0x3LU
+#define TIMER_FLAG_MASK0x7LU


And Fenguang's build-bot showed the problem (only) on blackfin [1].

config: make ARCH=blackfin allyesconfig

All error/warnings:

   kernel/timer.c: In function 'init_timers':
>> kernel/timer.c:1683:2: error: call to '__compiletime_assert_1683'
>> declared with attribute error: BUILD_BUG_ON failed:
>> __alignof__(struct tvec_base) & TIMER_FLAG_MASK


So probably we need to make 'base' aligned to 8 bytes ?



So, what you are suggesting is something like this (untested):

diff --git a/include/linux/timer.h b/include/linux/timer.h
index 8c5a197e1587..68bf09d69352 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -67,8 +67,9 @@ extern struct tvec_base boot_tvec_bases;
  */
 #define TIMER_DEFERRABLE   0x1LU
 #define TIMER_IRQSAFE  0x2LU
+#define TIMER_RUNNING  0x4LU

-#define TIMER_FLAG_MASK0x3LU
+#define TIMER_FLAG_MASK0x7LU

 #define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \
.entry = { .prev = TIMER_ENTRY_STATIC },\
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2d3f5c504939..8f9efa64bd34 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -105,6 +105,21 @@ static inline unsigned int tbase_get_irqsafe(struct 
tvec_base *base)
return ((unsigned int)(unsigned long)base & TIMER_IRQSAFE);
 }

+static inline unsigned int tbase_get_running(struct tvec_base *base)
+{
+   return ((unsigned int)(unsigned long)base & TIMER_RUNNING);
+}
+
+static inline unsigned int tbase_set_running(struct tvec_base *base)
+{
+   return ((unsigned int)(unsigned long)base | TIMER_RUNNING);
+}
+
+static inline unsigned int tbase_clear_running(struct tvec_base *base)
+{
+   return ((unsigned int)(unsigned long)base & ~TIMER_RUNNING);
+}
+
 static inline struct tvec_base *tbase_get_base(struct tvec_base *base)
 {
return ((struct tvec_base *)((unsigned long)base & ~TIMER_FLAG_MASK));
@@ -781,21 +796,12 @@ __mod_timer(struct timer_list *timer, unsigned long 
expires,
new_base = per_cpu(tvec_bases, cpu);

if (base != new_base) {
-   /*
-* We are trying to schedule the timer on the local CPU.
-* However we can't change timer's base while it is running,
-* otherwise del_timer_sync() can't detect that the timer's
-* handler yet has not finished. This also guarantees that
-* the timer is serialized wrt itself.
-*/
-   if (likely(base->running_timer != timer)) {
-   /* See the comment in lock_timer_base() */
-   timer_set_base(timer, NULL);
-   spin_unlock(>lock);
-   base = new_base;
-   spin_lock(>lock);
-   timer_set_base(timer, base);
-   }
+   /* See the comment in lock_timer_base() */
+   timer_set_base(timer, NULL);
+   spin_unlock(>lock);
+   base = new_base;
+   spin_lock(>lock);
+   timer_set_base(timer, base);
}

timer->expires = expires;
@@ -1016,7 +1022,7 @@ int try_to_del_timer_sync(struct timer_list *timer)

base = lock_timer_base(timer, );

-   if (base->running_timer != timer) {
+   if (tbase_get_running(timer->base)) {
timer_stats_timer_clear_start_info(timer);
ret = detach_if_pending(timer, base, true);
}
@@ -1202,6 +1208,7 @@ static inline void __run_timers(struct tvec_base *base)
timer_stats_account_timer(timer);

base->running_timer = timer;
+   tbase_set_running(timer->base);
detach_expired_timer(timer, base);

if (irqsafe) {
@@ -1216,6 +1223,7 @@ static inline void __run_timers(struct tvec_base *base)
}
}
base->running_timer = NULL;
+   tbase_clear_running(timer->base);
spin_unlock_irq(>lock);

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-28 Thread Peter Zijlstra

On Sat, Mar 28, 2015 at 09:58:38AM +0530, Viresh Kumar wrote:
> On 27 March 2015 at 17:32, Peter Zijlstra  wrote:
> > What's not clear to me is why that thing is allocated at all, AFAICT
> > something like:
> >
> > static DEFINE_PER_CPU(struct tvec_base, tvec_bases);
> >
> > Should do the right thing and be much simpler.
> 
> Does this comment from timers.c answers your query ?
> 
> /*
>  * This is for the boot CPU - we use compile-time
>  * static initialisation because per-cpu memory isn't
>  * ready yet and because the memory allocators are not
>  * initialised either.
>  */

No. The reason is because __TIMER_INITIALIZER() needs to set ->base to a
valid pointer and we cannot get a compile time pointer to per-cpu
entries because we don't know where we'll map the section, even for the
boot cpu.

This in turn means we need that boot_tvec_bases thing.

Now the reason we need to set ->base to a valid pointer is because we've
made NULL special.

Arguably we could create another special pointer, but since that's init
time only we get to add code for that will 'never' be used, more special
cases.

Its all a bit of a bother, but it makes sense.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-28 Thread Peter Zijlstra

On Sat, Mar 28, 2015 at 09:48:28AM +0530, Viresh Kumar wrote:
> On Fri, Mar 27, 2015 at 3:00 PM, Peter Zijlstra  wrote:
> > On Fri, Mar 27, 2015 at 10:16:13AM +0100, Peter Zijlstra wrote:
> 
> >> So the issue seems to be that we need base->running_timer in order to
> >> tell if a callback is running, right?
> >>
> >> We could align the base on 8 bytes to gain an extra bit in the pointer
> >> and use that bit to indicate the running state. Then these sites can
> >> spin on that bit while we can change the actual base pointer.
> >
> > Even though tvec_base has cacheline_aligned stuck on, most are
> > allocated using kzalloc_node() which does not actually respect that but
> > already guarantees a minimum u64 alignment, so I think we can use that
> > third bit without too much magic.
> 
> Right. So what I tried earlier was very much similar to you are suggesting.
> The only difference was that I haven't made much attempts towards
> saving memory.
> 
> But Thomas didn't like it for sure and I believe that Rant will hold true for
> what you are suggesting as well, isn't it ?
> 
> http://lists.linaro.org/pipermail/linaro-kernel/2013-November/008866.html

Well, for one your patch is indeed disgusting. But yes I'm aware Thomas
wants to rewrite the timer thing. But Thomas is away for a little while
and if this really needs to happen then it does.

Alternatively the thing hocko suggests is an utter fail too. You cannot
stuff that into hardirq context, that's insane.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-28 Thread Viresh Kumar

On 28 March 2015 at 17:27, viresh kumar viresh.ku...@linaro.org wrote:
 On 28 March 2015 at 15:23, Peter Zijlstra pet...@infradead.org wrote:

 Well, for one your patch is indeed disgusting.

 Yeah, I agree :)

Sigh..

Sorry for the series of *nonsense* mails before the last one.

Its some thunderbird *BUG* which did that, I was accessing my
mail from both gmail's interface and thunderbird and somehow
this happened. I have hit the send button only once.

Really sorry for the noise.

(The last mail has few inquiries towards the end and a thanks note,
just to identify it..)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-28 Thread Peter Zijlstra

On Sat, Mar 28, 2015 at 05:27:23PM +0530, viresh kumar wrote:
 So probably we need to make 'base' aligned to 8 bytes ?

Yeah, something like the below (at the very end) should ensure the thing
is cacheline aligned, that should give us a fair few bits.

 So, what you are suggesting is something like this (untested):

 @@ -1202,6 +1208,7 @@ static inline void __run_timers(struct tvec_base *base)
 timer_stats_account_timer(timer);
 
 base-running_timer = timer;
 +   tbase_set_running(timer-base);
 detach_expired_timer(timer, base);
 
 if (irqsafe) {
 @@ -1216,6 +1223,7 @@ static inline void __run_timers(struct tvec_base *base)
 }
 }
 base-running_timer = NULL;
 +   tbase_clear_running(timer-base);
 spin_unlock_irq(base-lock);
  }

That's broken. You need to clear running on all the timers you set it
on. Furthermore, you need to revalidate timer-base == base after
call_timer_fn().

Something like so:

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2d3f5c504939..489ce182f8ec 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1213,6 +1213,21 @@ static inline void __run_timers(struct tvec_base *base)
call_timer_fn(timer, fn, data);
spin_lock_irq(base-lock);
}
+
+   if (unlikely(timer-base != base)) {
+   unsigned long flags;
+   struct tvec_base *tbase;
+
+   spin_unlock(base-lock);
+
+   tbase = lock_timer_base(timer, flags);
+   tbase_clear_running(timer-base);
+   spin_unlock(tbase-lock);
+
+   spin_lock(base-lock);
+   } else {
+   tbase_clear_running(timer-base);
+   }
}
}
base-running_timer = NULL;

Also, once you have tbase_running, we can take base-running_timer out
altogether.

 Now there are few issues I see here (Sorry if they are all imaginary):
 - In case a timer re-arms itself from its handler and is migrated from CPU A 
 to B, what
   happens if the re-armed timer fires before the first handler finishes ? 
 i.e. timer-fn()
   hasn't finished running on CPU A and it has fired again on CPU B. Wouldn't 
 this expose
   us to a lot of other problems? It wouldn't be serialized to itself anymore ?

What I said above.

 - Because the timer has migrated to another CPU, the locking in __run_timers()
   needs to be fixed. And that will make it complicated ..

Hardly.

   - __run_timer() doesn't lock bases of other CPUs, and it has to do it now..

Yep, but rarely.

   - We probably need to take locks of both local CPU and the one to which 
 timer migrated.

Nope, or rather, not at the same time. That's what the NULL magic buys
us.

 - Its possible now that there can be more than one running timer for a base, 
 which wasn't
   true earlier. Not sure if it will break something.

Only if you messed it up real bad :-)

---
 kernel/time/timer.c | 36 
 1 file changed, 8 insertions(+), 28 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2d3f5c504939..c8c45bf50b2e 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -93,6 +93,7 @@ struct tvec_base {
 struct tvec_base boot_tvec_bases;
 EXPORT_SYMBOL(boot_tvec_bases);
 static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = boot_tvec_bases;
+static DEFINE_PER_CPU(struct tvec_base, __tvec_bases);
 
 /* Functions below help us manage 'deferrable' flag */
 static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
@@ -1534,46 +1535,25 @@ EXPORT_SYMBOL(schedule_timeout_uninterruptible);
 
 static int init_timers_cpu(int cpu)
 {
-   int j;
-   struct tvec_base *base;
+   struct tvec_base *base = per_cpu(tvec_bases, cpu);
static char tvec_base_done[NR_CPUS];
+   int j;
 
if (!tvec_base_done[cpu]) {
static char boot_done;
 
-   if (boot_done) {
-   /*
-* The APs use this path later in boot
-*/
-   base = kzalloc_node(sizeof(*base), GFP_KERNEL,
-   cpu_to_node(cpu));
-   if (!base)
-   return -ENOMEM;
-
-   /* Make sure tvec_base has TIMER_FLAG_MASK bits free */
-   if (WARN_ON(base != tbase_get_base(base))) {
-   kfree(base);
-   return -ENOMEM;
-   }
-   per_cpu(tvec_bases, cpu) = base;
+   if (!boot_done) {
+

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-28 Thread Peter Zijlstra

On Sat, Mar 28, 2015 at 09:58:38AM +0530, Viresh Kumar wrote:
 On 27 March 2015 at 17:32, Peter Zijlstra pet...@infradead.org wrote:
  What's not clear to me is why that thing is allocated at all, AFAICT
  something like:
 
  static DEFINE_PER_CPU(struct tvec_base, tvec_bases);
 
  Should do the right thing and be much simpler.
 
 Does this comment from timers.c answers your query ?
 
 /*
  * This is for the boot CPU - we use compile-time
  * static initialisation because per-cpu memory isn't
  * ready yet and because the memory allocators are not
  * initialised either.
  */

No. The reason is because __TIMER_INITIALIZER() needs to set -base to a
valid pointer and we cannot get a compile time pointer to per-cpu
entries because we don't know where we'll map the section, even for the
boot cpu.

This in turn means we need that boot_tvec_bases thing.

Now the reason we need to set -base to a valid pointer is because we've
made NULL special.

Arguably we could create another special pointer, but since that's init
time only we get to add code for that will 'never' be used, more special
cases.

Its all a bit of a bother, but it makes sense.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-28 Thread viresh kumar

On 28 March 2015 at 15:23, Peter Zijlstra pet...@infradead.org wrote:

 Well, for one your patch is indeed disgusting.

Yeah, I agree :)

 But yes I'm aware Thomas
 wants to rewrite the timer thing. But Thomas is away for a little while
 and if this really needs to happen then it does.

Sometime back I was trying to use another bit from base pointer for
marking a timer as PINNED:

diff --git a/include/linux/timer.h b/include/linux/timer.h
index 8c5a197e1587..e7184f57449c 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -67,8 +67,9 @@ extern struct tvec_base boot_tvec_bases;
  */
 #define TIMER_DEFERRABLE   0x1LU
 #define TIMER_IRQSAFE  0x2LU
+#define TIMER_PINNED   0x4LU
 -#define TIMER_FLAG_MASK0x3LU
+#define TIMER_FLAG_MASK0x7LU


And Fenguang's build-bot showed the problem (only) on blackfin [1].

config: make ARCH=blackfin allyesconfig

All error/warnings:

   kernel/timer.c: In function 'init_timers':
 kernel/timer.c:1683:2: error: call to '__compiletime_assert_1683'
 declared with attribute error: BUILD_BUG_ON failed:
 __alignof__(struct tvec_base)  TIMER_FLAG_MASK


So probably we need to make 'base' aligned to 8 bytes ?



So, what you are suggesting is something like this (untested):

diff --git a/include/linux/timer.h b/include/linux/timer.h
index 8c5a197e1587..68bf09d69352 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -67,8 +67,9 @@ extern struct tvec_base boot_tvec_bases;
  */
 #define TIMER_DEFERRABLE   0x1LU
 #define TIMER_IRQSAFE  0x2LU
+#define TIMER_RUNNING  0x4LU

-#define TIMER_FLAG_MASK0x3LU
+#define TIMER_FLAG_MASK0x7LU

 #define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \
.entry = { .prev = TIMER_ENTRY_STATIC },\
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2d3f5c504939..8f9efa64bd34 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -105,6 +105,21 @@ static inline unsigned int tbase_get_irqsafe(struct 
tvec_base *base)
return ((unsigned int)(unsigned long)base  TIMER_IRQSAFE);
 }

+static inline unsigned int tbase_get_running(struct tvec_base *base)
+{
+   return ((unsigned int)(unsigned long)base  TIMER_RUNNING);
+}
+
+static inline unsigned int tbase_set_running(struct tvec_base *base)
+{
+   return ((unsigned int)(unsigned long)base | TIMER_RUNNING);
+}
+
+static inline unsigned int tbase_clear_running(struct tvec_base *base)
+{
+   return ((unsigned int)(unsigned long)base  ~TIMER_RUNNING);
+}
+
 static inline struct tvec_base *tbase_get_base(struct tvec_base *base)
 {
return ((struct tvec_base *)((unsigned long)base  ~TIMER_FLAG_MASK));
@@ -781,21 +796,12 @@ __mod_timer(struct timer_list *timer, unsigned long 
expires,
new_base = per_cpu(tvec_bases, cpu);

if (base != new_base) {
-   /*
-* We are trying to schedule the timer on the local CPU.
-* However we can't change timer's base while it is running,
-* otherwise del_timer_sync() can't detect that the timer's
-* handler yet has not finished. This also guarantees that
-* the timer is serialized wrt itself.
-*/
-   if (likely(base-running_timer != timer)) {
-   /* See the comment in lock_timer_base() */
-   timer_set_base(timer, NULL);
-   spin_unlock(base-lock);
-   base = new_base;
-   spin_lock(base-lock);
-   timer_set_base(timer, base);
-   }
+   /* See the comment in lock_timer_base() */
+   timer_set_base(timer, NULL);
+   spin_unlock(base-lock);
+   base = new_base;
+   spin_lock(base-lock);
+   timer_set_base(timer, base);
}

timer-expires = expires;
@@ -1016,7 +1022,7 @@ int try_to_del_timer_sync(struct timer_list *timer)

base = lock_timer_base(timer, flags);

-   if (base-running_timer != timer) {
+   if (tbase_get_running(timer-base)) {
timer_stats_timer_clear_start_info(timer);
ret = detach_if_pending(timer, base, true);
}
@@ -1202,6 +1208,7 @@ static inline void __run_timers(struct tvec_base *base)
timer_stats_account_timer(timer);

base-running_timer = timer;
+   tbase_set_running(timer-base);
detach_expired_timer(timer, base);

if (irqsafe) {
@@ -1216,6 +1223,7 @@ static inline void __run_timers(struct tvec_base *base)
}
}
base-running_timer = NULL;
+   tbase_clear_running(timer-base);

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-28 Thread Peter Zijlstra

On Sat, Mar 28, 2015 at 09:48:28AM +0530, Viresh Kumar wrote:
 On Fri, Mar 27, 2015 at 3:00 PM, Peter Zijlstra pet...@infradead.org wrote:
  On Fri, Mar 27, 2015 at 10:16:13AM +0100, Peter Zijlstra wrote:
 
  So the issue seems to be that we need base-running_timer in order to
  tell if a callback is running, right?
 
  We could align the base on 8 bytes to gain an extra bit in the pointer
  and use that bit to indicate the running state. Then these sites can
  spin on that bit while we can change the actual base pointer.
 
  Even though tvec_base has cacheline_aligned stuck on, most are
  allocated using kzalloc_node() which does not actually respect that but
  already guarantees a minimum u64 alignment, so I think we can use that
  third bit without too much magic.
 
 Right. So what I tried earlier was very much similar to you are suggesting.
 The only difference was that I haven't made much attempts towards
 saving memory.
 
 But Thomas didn't like it for sure and I believe that Rant will hold true for
 what you are suggesting as well, isn't it ?
 
 http://lists.linaro.org/pipermail/linaro-kernel/2013-November/008866.html

Well, for one your patch is indeed disgusting. But yes I'm aware Thomas
wants to rewrite the timer thing. But Thomas is away for a little while
and if this really needs to happen then it does.

Alternatively the thing hocko suggests is an utter fail too. You cannot
stuff that into hardirq context, that's insane.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-27 Thread Viresh Kumar

On 27 March 2015 at 19:49, Michal Hocko  wrote:

> Wouldn't something like I was suggesting few months back
> (http://article.gmane.org/gmane.linux.kernel.mm/127569) solve this
> problem as well? Scheduler should be idle aware, no? I mean it shouldn't
> wake up an idle CPU if the task might run on another one.

Probably yes. Lets see what others have to say about it..
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-27 Thread Viresh Kumar

On 27 March 2015 at 17:32, Peter Zijlstra  wrote:
> What's not clear to me is why that thing is allocated at all, AFAICT
> something like:
>
> static DEFINE_PER_CPU(struct tvec_base, tvec_bases);
>
> Should do the right thing and be much simpler.

Does this comment from timers.c answers your query ?

/*
 * This is for the boot CPU - we use compile-time
 * static initialisation because per-cpu memory isn't
 * ready yet and because the memory allocators are not
 * initialised either.
 */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-27 Thread Viresh Kumar

On Fri, Mar 27, 2015 at 3:00 PM, Peter Zijlstra  wrote:
> On Fri, Mar 27, 2015 at 10:16:13AM +0100, Peter Zijlstra wrote:

>> So the issue seems to be that we need base->running_timer in order to
>> tell if a callback is running, right?
>>
>> We could align the base on 8 bytes to gain an extra bit in the pointer
>> and use that bit to indicate the running state. Then these sites can
>> spin on that bit while we can change the actual base pointer.
>
> Even though tvec_base has cacheline_aligned stuck on, most are
> allocated using kzalloc_node() which does not actually respect that but
> already guarantees a minimum u64 alignment, so I think we can use that
> third bit without too much magic.

Right. So what I tried earlier was very much similar to you are suggesting.
The only difference was that I haven't made much attempts towards
saving memory.

But Thomas didn't like it for sure and I believe that Rant will hold true for
what you are suggesting as well, isn't it ?

http://lists.linaro.org/pipermail/linaro-kernel/2013-November/008866.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-27 Thread Christoph Lameter

On Fri, 27 Mar 2015, Peter Zijlstra wrote:

> On Fri, Mar 27, 2015 at 06:11:44AM -0500, Christoph Lameter wrote:
> >
> > Create a new slab cache for this purpose that does the proper aligning?
>
> That is certainly a possibility, but we'll only ever allocate nr_cpus-1
> entries from it, a whole new slab cache might be overkill.

This will certainly be aliased to some other slab cache so not much
overhead is created.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-27 Thread Michal Hocko

On Thu 26-03-15 11:09:01, Viresh Kumar wrote:
> A delayed work to schedule vmstat_shepherd() is queued at periodic intervals 
> for
> internal working of vmstat core. This work and its timer end up waking an idle
> cpu sometimes, as this always stays on CPU0.
> 
> Because we re-queue the work from its handler, idle_cpu() returns false and so
> the timer (used by delayed work) never migrates to any other CPU.
> 
> This may not be the desired behavior always as waking up an idle CPU to queue
> work on few other CPUs isn't good from power-consumption point of view.

Wouldn't something like I was suggesting few months back
(http://article.gmane.org/gmane.linux.kernel.mm/127569) solve this
problem as well? Scheduler should be idle aware, no? I mean it shouldn't
wake up an idle CPU if the task might run on another one.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-27 Thread Peter Zijlstra

On Fri, Mar 27, 2015 at 06:11:44AM -0500, Christoph Lameter wrote:
> On Fri, 27 Mar 2015, Peter Zijlstra wrote:
> 
> > > We could align the base on 8 bytes to gain an extra bit in the pointer
> > > and use that bit to indicate the running state. Then these sites can
> > > spin on that bit while we can change the actual base pointer.
> >
> > Even though tvec_base has cacheline_aligned stuck on, most are
> > allocated using kzalloc_node() which does not actually respect that but
> > already guarantees a minimum u64 alignment, so I think we can use that
> > third bit without too much magic.
> 
> Create a new slab cache for this purpose that does the proper aligning?

That is certainly a possibility, but we'll only ever allocate nr_cpus-1
entries from it, a whole new slab cache might be overkill.

What's not clear to me is why that thing is allocated at all, AFAICT
something like:

static DEFINE_PER_CPU(struct tvec_base, tvec_bases);

Should do the right thing and be much simpler.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-27 Thread Christoph Lameter

On Fri, 27 Mar 2015, Peter Zijlstra wrote:

> > We could align the base on 8 bytes to gain an extra bit in the pointer
> > and use that bit to indicate the running state. Then these sites can
> > spin on that bit while we can change the actual base pointer.
>
> Even though tvec_base has cacheline_aligned stuck on, most are
> allocated using kzalloc_node() which does not actually respect that but
> already guarantees a minimum u64 alignment, so I think we can use that
> third bit without too much magic.

Create a new slab cache for this purpose that does the proper aligning?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-27 Thread Peter Zijlstra

On Fri, Mar 27, 2015 at 10:16:13AM +0100, Peter Zijlstra wrote:
> On Fri, Mar 27, 2015 at 10:19:54AM +0530, Viresh Kumar wrote:
> > On 27 March 2015 at 01:48, Andrew Morton  wrote:
> > > Shouldn't this be viewed as a shortcoming of the core timer code?
> > 
> > Yeah, it is. Some (not so pretty) solutions were tried earlier to fix that, 
> > but
> > they are rejected for obviously reasons [1].
> > 
> > > vmstat_shepherd() is merely rescheduling itself with
> > > schedule_delayed_work().  That's a dead bog simple operation and if
> > > it's producing suboptimal behaviour then we shouldn't be fixing it with
> > > elaborate workarounds in the caller?
> > 
> > I understand that, and that's why I sent it as an RFC to get the discussion
> > started. Does anyone else have got another (acceptable) idea to get this
> > resolved ?
> 
> So the issue seems to be that we need base->running_timer in order to
> tell if a callback is running, right?
> 
> We could align the base on 8 bytes to gain an extra bit in the pointer
> and use that bit to indicate the running state. Then these sites can
> spin on that bit while we can change the actual base pointer.

Even though tvec_base has cacheline_aligned stuck on, most are
allocated using kzalloc_node() which does not actually respect that but
already guarantees a minimum u64 alignment, so I think we can use that
third bit without too much magic.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-27 Thread Peter Zijlstra

On Fri, Mar 27, 2015 at 10:19:54AM +0530, Viresh Kumar wrote:
> On 27 March 2015 at 01:48, Andrew Morton  wrote:
> > Shouldn't this be viewed as a shortcoming of the core timer code?
> 
> Yeah, it is. Some (not so pretty) solutions were tried earlier to fix that, 
> but
> they are rejected for obviously reasons [1].
> 
> > vmstat_shepherd() is merely rescheduling itself with
> > schedule_delayed_work().  That's a dead bog simple operation and if
> > it's producing suboptimal behaviour then we shouldn't be fixing it with
> > elaborate workarounds in the caller?
> 
> I understand that, and that's why I sent it as an RFC to get the discussion
> started. Does anyone else have got another (acceptable) idea to get this
> resolved ?

So the issue seems to be that we need base->running_timer in order to
tell if a callback is running, right?

We could align the base on 8 bytes to gain an extra bit in the pointer
and use that bit to indicate the running state. Then these sites can
spin on that bit while we can change the actual base pointer.

Since the timer->base pointer is locked through the base->lock and
hand-over is safe vs lock_timer_base, this should all work.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-27 Thread Peter Zijlstra

On Fri, Mar 27, 2015 at 10:16:13AM +0100, Peter Zijlstra wrote:
 On Fri, Mar 27, 2015 at 10:19:54AM +0530, Viresh Kumar wrote:
  On 27 March 2015 at 01:48, Andrew Morton a...@linux-foundation.org wrote:
   Shouldn't this be viewed as a shortcoming of the core timer code?
  
  Yeah, it is. Some (not so pretty) solutions were tried earlier to fix that, 
  but
  they are rejected for obviously reasons [1].
  
   vmstat_shepherd() is merely rescheduling itself with
   schedule_delayed_work().  That's a dead bog simple operation and if
   it's producing suboptimal behaviour then we shouldn't be fixing it with
   elaborate workarounds in the caller?
  
  I understand that, and that's why I sent it as an RFC to get the discussion
  started. Does anyone else have got another (acceptable) idea to get this
  resolved ?
 
 So the issue seems to be that we need base-running_timer in order to
 tell if a callback is running, right?
 
 We could align the base on 8 bytes to gain an extra bit in the pointer
 and use that bit to indicate the running state. Then these sites can
 spin on that bit while we can change the actual base pointer.

Even though tvec_base has cacheline_aligned stuck on, most are
allocated using kzalloc_node() which does not actually respect that but
already guarantees a minimum u64 alignment, so I think we can use that
third bit without too much magic.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-27 Thread Christoph Lameter

On Fri, 27 Mar 2015, Peter Zijlstra wrote:

  We could align the base on 8 bytes to gain an extra bit in the pointer
  and use that bit to indicate the running state. Then these sites can
  spin on that bit while we can change the actual base pointer.

 Even though tvec_base has cacheline_aligned stuck on, most are
 allocated using kzalloc_node() which does not actually respect that but
 already guarantees a minimum u64 alignment, so I think we can use that
 third bit without too much magic.

Create a new slab cache for this purpose that does the proper aligning?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-27 Thread Peter Zijlstra

On Fri, Mar 27, 2015 at 10:19:54AM +0530, Viresh Kumar wrote:
 On 27 March 2015 at 01:48, Andrew Morton a...@linux-foundation.org wrote:
  Shouldn't this be viewed as a shortcoming of the core timer code?
 
 Yeah, it is. Some (not so pretty) solutions were tried earlier to fix that, 
 but
 they are rejected for obviously reasons [1].
 
  vmstat_shepherd() is merely rescheduling itself with
  schedule_delayed_work().  That's a dead bog simple operation and if
  it's producing suboptimal behaviour then we shouldn't be fixing it with
  elaborate workarounds in the caller?
 
 I understand that, and that's why I sent it as an RFC to get the discussion
 started. Does anyone else have got another (acceptable) idea to get this
 resolved ?

So the issue seems to be that we need base-running_timer in order to
tell if a callback is running, right?

We could align the base on 8 bytes to gain an extra bit in the pointer
and use that bit to indicate the running state. Then these sites can
spin on that bit while we can change the actual base pointer.

Since the timer-base pointer is locked through the base-lock and
hand-over is safe vs lock_timer_base, this should all work.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-27 Thread Peter Zijlstra

On Fri, Mar 27, 2015 at 06:11:44AM -0500, Christoph Lameter wrote:
 On Fri, 27 Mar 2015, Peter Zijlstra wrote:
 
   We could align the base on 8 bytes to gain an extra bit in the pointer
   and use that bit to indicate the running state. Then these sites can
   spin on that bit while we can change the actual base pointer.
 
  Even though tvec_base has cacheline_aligned stuck on, most are
  allocated using kzalloc_node() which does not actually respect that but
  already guarantees a minimum u64 alignment, so I think we can use that
  third bit without too much magic.
 
 Create a new slab cache for this purpose that does the proper aligning?

That is certainly a possibility, but we'll only ever allocate nr_cpus-1
entries from it, a whole new slab cache might be overkill.

What's not clear to me is why that thing is allocated at all, AFAICT
something like:

static DEFINE_PER_CPU(struct tvec_base, tvec_bases);

Should do the right thing and be much simpler.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-27 Thread Michal Hocko

On Thu 26-03-15 11:09:01, Viresh Kumar wrote:
 A delayed work to schedule vmstat_shepherd() is queued at periodic intervals 
 for
 internal working of vmstat core. This work and its timer end up waking an idle
 cpu sometimes, as this always stays on CPU0.
 
 Because we re-queue the work from its handler, idle_cpu() returns false and so
 the timer (used by delayed work) never migrates to any other CPU.
 
 This may not be the desired behavior always as waking up an idle CPU to queue
 work on few other CPUs isn't good from power-consumption point of view.

Wouldn't something like I was suggesting few months back
(http://article.gmane.org/gmane.linux.kernel.mm/127569) solve this
problem as well? Scheduler should be idle aware, no? I mean it shouldn't
wake up an idle CPU if the task might run on another one.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-27 Thread Christoph Lameter

On Fri, 27 Mar 2015, Peter Zijlstra wrote:

 On Fri, Mar 27, 2015 at 06:11:44AM -0500, Christoph Lameter wrote:
 
  Create a new slab cache for this purpose that does the proper aligning?

 That is certainly a possibility, but we'll only ever allocate nr_cpus-1
 entries from it, a whole new slab cache might be overkill.

This will certainly be aliased to some other slab cache so not much
overhead is created.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-27 Thread Viresh Kumar

On Fri, Mar 27, 2015 at 3:00 PM, Peter Zijlstra pet...@infradead.org wrote:
 On Fri, Mar 27, 2015 at 10:16:13AM +0100, Peter Zijlstra wrote:

 So the issue seems to be that we need base-running_timer in order to
 tell if a callback is running, right?

 We could align the base on 8 bytes to gain an extra bit in the pointer
 and use that bit to indicate the running state. Then these sites can
 spin on that bit while we can change the actual base pointer.

 Even though tvec_base has cacheline_aligned stuck on, most are
 allocated using kzalloc_node() which does not actually respect that but
 already guarantees a minimum u64 alignment, so I think we can use that
 third bit without too much magic.

Right. So what I tried earlier was very much similar to you are suggesting.
The only difference was that I haven't made much attempts towards
saving memory.

But Thomas didn't like it for sure and I believe that Rant will hold true for
what you are suggesting as well, isn't it ?

http://lists.linaro.org/pipermail/linaro-kernel/2013-November/008866.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-27 Thread Viresh Kumar

On 27 March 2015 at 19:49, Michal Hocko mho...@suse.cz wrote:

 Wouldn't something like I was suggesting few months back
 (http://article.gmane.org/gmane.linux.kernel.mm/127569) solve this
 problem as well? Scheduler should be idle aware, no? I mean it shouldn't
 wake up an idle CPU if the task might run on another one.

Probably yes. Lets see what others have to say about it..
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-27 Thread Viresh Kumar

On 27 March 2015 at 17:32, Peter Zijlstra pet...@infradead.org wrote:
 What's not clear to me is why that thing is allocated at all, AFAICT
 something like:

 static DEFINE_PER_CPU(struct tvec_base, tvec_bases);

 Should do the right thing and be much simpler.

Does this comment from timers.c answers your query ?

/*
 * This is for the boot CPU - we use compile-time
 * static initialisation because per-cpu memory isn't
 * ready yet and because the memory allocators are not
 * initialised either.
 */
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-26 Thread Viresh Kumar

On 27 March 2015 at 01:48, Andrew Morton  wrote:
> Shouldn't this be viewed as a shortcoming of the core timer code?

Yeah, it is. Some (not so pretty) solutions were tried earlier to fix that, but
they are rejected for obviously reasons [1].

> vmstat_shepherd() is merely rescheduling itself with
> schedule_delayed_work().  That's a dead bog simple operation and if
> it's producing suboptimal behaviour then we shouldn't be fixing it with
> elaborate workarounds in the caller?

I understand that, and that's why I sent it as an RFC to get the discussion
started. Does anyone else have got another (acceptable) idea to get this
resolved ?

--
viresh

[1] http://lists.linaro.org/pipermail/linaro-kernel/2013-November/008866.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-26 Thread Andrew Morton

On Thu, 26 Mar 2015 11:09:01 +0530 Viresh Kumar  wrote:

> A delayed work to schedule vmstat_shepherd() is queued at periodic intervals 
> for
> internal working of vmstat core. This work and its timer end up waking an idle
> cpu sometimes, as this always stays on CPU0.
> 
> Because we re-queue the work from its handler, idle_cpu() returns false and so
> the timer (used by delayed work) never migrates to any other CPU.
> 
> This may not be the desired behavior always as waking up an idle CPU to queue
> work on few other CPUs isn't good from power-consumption point of view.
> 
> In order to avoid waking up an idle core, we can replace 
> schedule_delayed_work()
> with a normal work plus a separate timer. The timer handler will then queue 
> the
> work after re-arming the timer. If the CPU was idle before the timer fired,
> idle_cpu() will mostly return true and the next timer shall be migrated to a
> non-idle CPU.
> 
> But the timer core has a limitation, when the timer is re-armed from its
> handler, timer core disables migration of that timer to other cores. Details 
> of
> that limitation are present in kernel/time/timer.c:__mod_timer() routine.
> 
> Another simple yet effective solution can be to keep two timers with same
> handler and keep toggling between them, so that the above limitation doesn't
> hold true anymore.
> 
> This patch replaces schedule_delayed_work() with schedule_work() plus two
> timers. After this, it was seen that the timer and its do get migrated to 
> other
> non-idle CPUs, when the local cpu is idle.

Shouldn't this be viewed as a shortcoming of the core timer code? 

vmstat_shepherd() is merely rescheduling itself with
schedule_delayed_work().  That's a dead bog simple operation and if
it's producing suboptimal behaviour then we shouldn't be fixing it with
elaborate workarounds in the caller?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-26 Thread Viresh Kumar

On 27 March 2015 at 01:48, Andrew Morton a...@linux-foundation.org wrote:
 Shouldn't this be viewed as a shortcoming of the core timer code?

Yeah, it is. Some (not so pretty) solutions were tried earlier to fix that, but
they are rejected for obviously reasons [1].

 vmstat_shepherd() is merely rescheduling itself with
 schedule_delayed_work().  That's a dead bog simple operation and if
 it's producing suboptimal behaviour then we shouldn't be fixing it with
 elaborate workarounds in the caller?

I understand that, and that's why I sent it as an RFC to get the discussion
started. Does anyone else have got another (acceptable) idea to get this
resolved ?

--
viresh

[1] http://lists.linaro.org/pipermail/linaro-kernel/2013-November/008866.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-26 Thread Andrew Morton

On Thu, 26 Mar 2015 11:09:01 +0530 Viresh Kumar viresh.ku...@linaro.org wrote:

 A delayed work to schedule vmstat_shepherd() is queued at periodic intervals 
 for
 internal working of vmstat core. This work and its timer end up waking an idle
 cpu sometimes, as this always stays on CPU0.
 
 Because we re-queue the work from its handler, idle_cpu() returns false and so
 the timer (used by delayed work) never migrates to any other CPU.
 
 This may not be the desired behavior always as waking up an idle CPU to queue
 work on few other CPUs isn't good from power-consumption point of view.
 
 In order to avoid waking up an idle core, we can replace 
 schedule_delayed_work()
 with a normal work plus a separate timer. The timer handler will then queue 
 the
 work after re-arming the timer. If the CPU was idle before the timer fired,
 idle_cpu() will mostly return true and the next timer shall be migrated to a
 non-idle CPU.
 
 But the timer core has a limitation, when the timer is re-armed from its
 handler, timer core disables migration of that timer to other cores. Details 
 of
 that limitation are present in kernel/time/timer.c:__mod_timer() routine.
 
 Another simple yet effective solution can be to keep two timers with same
 handler and keep toggling between them, so that the above limitation doesn't
 hold true anymore.
 
 This patch replaces schedule_delayed_work() with schedule_work() plus two
 timers. After this, it was seen that the timer and its do get migrated to 
 other
 non-idle CPUs, when the local cpu is idle.

Shouldn't this be viewed as a shortcoming of the core timer code? 

vmstat_shepherd() is merely rescheduling itself with
schedule_delayed_work().  That's a dead bog simple operation and if
it's producing suboptimal behaviour then we shouldn't be fixing it with
elaborate workarounds in the caller?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-25 Thread Viresh Kumar

A delayed work to schedule vmstat_shepherd() is queued at periodic intervals for
internal working of vmstat core. This work and its timer end up waking an idle
cpu sometimes, as this always stays on CPU0.

Because we re-queue the work from its handler, idle_cpu() returns false and so
the timer (used by delayed work) never migrates to any other CPU.

This may not be the desired behavior always as waking up an idle CPU to queue
work on few other CPUs isn't good from power-consumption point of view.

In order to avoid waking up an idle core, we can replace schedule_delayed_work()
with a normal work plus a separate timer. The timer handler will then queue the
work after re-arming the timer. If the CPU was idle before the timer fired,
idle_cpu() will mostly return true and the next timer shall be migrated to a
non-idle CPU.

But the timer core has a limitation, when the timer is re-armed from its
handler, timer core disables migration of that timer to other cores. Details of
that limitation are present in kernel/time/timer.c:__mod_timer() routine.

Another simple yet effective solution can be to keep two timers with same
handler and keep toggling between them, so that the above limitation doesn't
hold true anymore.

This patch replaces schedule_delayed_work() with schedule_work() plus two
timers. After this, it was seen that the timer and its do get migrated to other
non-idle CPUs, when the local cpu is idle.

Tested-by: Vinayak Menon 
Tested-by: Shiraz Hashim 
Signed-off-by: Viresh Kumar 
---
This patch isn't sent to say its the best way forward, but to get a discussion
started on the same.

 mm/vmstat.c | 31 +--
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4f5cd974e11a..d45e4243a046 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1424,8 +1424,18 @@ static bool need_update(int cpu)
  * inactivity.
  */
 static void vmstat_shepherd(struct work_struct *w);
+static DECLARE_WORK(shepherd, vmstat_shepherd);
 
-static DECLARE_DELAYED_WORK(shepherd, vmstat_shepherd);
+/*
+ * Two timers are used here to avoid waking up an idle CPU. If a single timer 
is
+ * kept, then re-arming the timer from its handler doesn't let us migrate it.
+ */
+static struct timer_list shepherd_timer[2];
+#define toggle_timer() (shepherd_timer_index = !shepherd_timer_index,  \
+   _timer[shepherd_timer_index])
+
+static void vmstat_shepherd_timer(unsigned long data);
+static int shepherd_timer_index;
 
 static void vmstat_shepherd(struct work_struct *w)
 {
@@ -1441,15 +1451,19 @@ static void vmstat_shepherd(struct work_struct *w)
_cpu(vmstat_work, cpu), 0);
 
put_online_cpus();
+}
 
-   schedule_delayed_work(,
-   round_jiffies_relative(sysctl_stat_interval));
+static void vmstat_shepherd_timer(unsigned long data)
+{
+   mod_timer(toggle_timer(),
+ jiffies + round_jiffies_relative(sysctl_stat_interval));
+   schedule_work();
 
 }
 
 static void __init start_shepherd_timer(void)
 {
-   int cpu;
+   int cpu, i = -1;
 
for_each_possible_cpu(cpu)
INIT_DELAYED_WORK(per_cpu_ptr(_work, cpu),
@@ -1459,8 +1473,13 @@ static void __init start_shepherd_timer(void)
BUG();
cpumask_copy(cpu_stat_off, cpu_online_mask);
 
-   schedule_delayed_work(,
-   round_jiffies_relative(sysctl_stat_interval));
+   while (++i < 2) {
+   init_timer(_timer[i]);
+   shepherd_timer[i].function = vmstat_shepherd_timer;
+   };
+
+   mod_timer(toggle_timer(),
+ jiffies + round_jiffies_relative(sysctl_stat_interval));
 }
 
 static void vmstat_cpu_dead(int node)
-- 
2.3.0.rc0.44.ga94655d

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC] vmstat: Avoid waking up idle-cpu to service shepherd work

2015-03-25 Thread Viresh Kumar

A delayed work to schedule vmstat_shepherd() is queued at periodic intervals for
internal working of vmstat core. This work and its timer end up waking an idle
cpu sometimes, as this always stays on CPU0.

Because we re-queue the work from its handler, idle_cpu() returns false and so
the timer (used by delayed work) never migrates to any other CPU.

This may not be the desired behavior always as waking up an idle CPU to queue
work on few other CPUs isn't good from power-consumption point of view.

In order to avoid waking up an idle core, we can replace schedule_delayed_work()
with a normal work plus a separate timer. The timer handler will then queue the
work after re-arming the timer. If the CPU was idle before the timer fired,
idle_cpu() will mostly return true and the next timer shall be migrated to a
non-idle CPU.

But the timer core has a limitation, when the timer is re-armed from its
handler, timer core disables migration of that timer to other cores. Details of
that limitation are present in kernel/time/timer.c:__mod_timer() routine.

Another simple yet effective solution can be to keep two timers with same
handler and keep toggling between them, so that the above limitation doesn't
hold true anymore.

This patch replaces schedule_delayed_work() with schedule_work() plus two
timers. After this, it was seen that the timer and its do get migrated to other
non-idle CPUs, when the local cpu is idle.

Tested-by: Vinayak Menon vinme...@codeaurora.org
Tested-by: Shiraz Hashim shas...@codeaurora.org
Signed-off-by: Viresh Kumar viresh.ku...@linaro.org
---
This patch isn't sent to say its the best way forward, but to get a discussion
started on the same.

 mm/vmstat.c | 31 +--
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4f5cd974e11a..d45e4243a046 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1424,8 +1424,18 @@ static bool need_update(int cpu)
  * inactivity.
  */
 static void vmstat_shepherd(struct work_struct *w);
+static DECLARE_WORK(shepherd, vmstat_shepherd);
 
-static DECLARE_DELAYED_WORK(shepherd, vmstat_shepherd);
+/*
+ * Two timers are used here to avoid waking up an idle CPU. If a single timer 
is
+ * kept, then re-arming the timer from its handler doesn't let us migrate it.
+ */
+static struct timer_list shepherd_timer[2];
+#define toggle_timer() (shepherd_timer_index = !shepherd_timer_index,  \
+   shepherd_timer[shepherd_timer_index])
+
+static void vmstat_shepherd_timer(unsigned long data);
+static int shepherd_timer_index;
 
 static void vmstat_shepherd(struct work_struct *w)
 {
@@ -1441,15 +1451,19 @@ static void vmstat_shepherd(struct work_struct *w)
per_cpu(vmstat_work, cpu), 0);
 
put_online_cpus();
+}
 
-   schedule_delayed_work(shepherd,
-   round_jiffies_relative(sysctl_stat_interval));
+static void vmstat_shepherd_timer(unsigned long data)
+{
+   mod_timer(toggle_timer(),
+ jiffies + round_jiffies_relative(sysctl_stat_interval));
+   schedule_work(shepherd);
 
 }
 
 static void __init start_shepherd_timer(void)
 {
-   int cpu;
+   int cpu, i = -1;
 
for_each_possible_cpu(cpu)
INIT_DELAYED_WORK(per_cpu_ptr(vmstat_work, cpu),
@@ -1459,8 +1473,13 @@ static void __init start_shepherd_timer(void)
BUG();
cpumask_copy(cpu_stat_off, cpu_online_mask);
 
-   schedule_delayed_work(shepherd,
-   round_jiffies_relative(sysctl_stat_interval));
+   while (++i  2) {
+   init_timer(shepherd_timer[i]);
+   shepherd_timer[i].function = vmstat_shepherd_timer;
+   };
+
+   mod_timer(toggle_timer(),
+ jiffies + round_jiffies_relative(sysctl_stat_interval));
 }
 
 static void vmstat_cpu_dead(int node)
-- 
2.3.0.rc0.44.ga94655d

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

58 matches

Mail list logo