from:"Vladimir Davydov"

Re: [PATCH] cpuidle: menu: use nr_running instead of cpuload for calculating perf mult

2012-11-27 Thread Vladimir Davydov

Hello again

Playing with cpu cgroups, I've found that performance of various workloads 
running inside cgroups can be significantly improved by increasing sched load 
resolution and that this improvement has already been committed to the kernel 
(c8b281161dfa4bb5d5be63fb036ce19347b88c63). Unfortunately, it turned out that 
the commit triggered power usage regression and as a result was reverted until 
the root cause of the regression was found (see commit 
e4c2fb0d5776b58049d2556b456144a4db3fe5a9). And the root cause is still 
unknown...

I guess that the behavior of the cpuidle menu governor, which is stated in the 
patch below, can explain the regression. The governor uses this_cpu_load() to 
estimate idleness of the system. After commit 
c8b281161dfa4bb5d5be63fb036ce19347b88c63, which increases sched load 
resolution, this_cpu_load() seems to return large values even for small loads, 
which would probably lead to the cpuidle governor making wrong decisions due to 
overestimating the system load.

So, this seems to be another reason to use some different performance 
multiplier in cpuidle governor.

On Jun 4, 2012, at 2:24 PM, Vladimir Davydov  wrote:

> rq->cpuload strongly depends on cgroup hierarchy. For example, if hundreds of
> tasks are running inside cpu:/test cgroup, the sum of cpuload over all cpus
> won't exceed 1024 (by default). That makes the cpuidle menu governor take 
> wrong
> decisions, which can negatively affect overall performance.
> 
> To cope this, use nr_running last seen in __update_cpu_load() instead of
> cpuload for calculating performance multiplier.
> ---
> drivers/cpuidle/governors/menu.c |   17 +++--
> include/linux/sched.h|2 +-
> kernel/sched/core.c  |7 ---
> kernel/sched/sched.h |3 +++
> 4 files changed, 11 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/cpuidle/governors/menu.c 
> b/drivers/cpuidle/governors/menu.c
> index 0633575..2aa2625 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -123,17 +123,6 @@ struct menu_device {
> };
> 
> 
> -#define LOAD_INT(x) ((x) >> FSHIFT)
> -#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
> -
> -static int get_loadavg(void)
> -{
> - unsigned long this = this_cpu_load();
> -
> -
> - return LOAD_INT(this) * 10 + LOAD_FRAC(this) / 10;
> -}
> -
> static inline int which_bucket(unsigned int duration)
> {
>   int bucket = 0;
> @@ -170,13 +159,13 @@ static inline int which_bucket(unsigned int duration)
> static inline int performance_multiplier(void)
> {
>   int mult = 1;
> + int this_cpu = smp_processor_id();
> 
>   /* for higher loadavg, we are more reluctant */
> -
> - mult += 2 * get_loadavg();
> + mult += 10 * nr_active_cpu(this_cpu);
> 
>   /* for IO wait tasks (per cpu!) we add 5x each */
> - mult += 10 * nr_iowait_cpu(smp_processor_id());
> + mult += 10 * nr_iowait_cpu(this_cpu);
> 
>   return mult;
> }
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f45c0b2..fb83d22 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -141,7 +141,7 @@ extern unsigned long nr_running(void);
> extern unsigned long nr_uninterruptible(void);
> extern unsigned long nr_iowait(void);
> extern unsigned long nr_iowait_cpu(int cpu);
> -extern unsigned long this_cpu_load(void);
> +extern unsigned long nr_active_cpu(int cpu);
> 
> 
> extern void calc_global_load(unsigned long ticks);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 39eb601..8cc2011 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2155,10 +2155,10 @@ unsigned long nr_iowait_cpu(int cpu)
>   return atomic_read(&this->nr_iowait);
> }
> 
> -unsigned long this_cpu_load(void)
> +unsigned long nr_active_cpu(int cpu)
> {
> - struct rq *this = this_rq();
> - return this->cpu_load[0];
> + struct rq *this = cpu_rq(cpu);
> + return this->nr_active;
> }
> 
> 
> @@ -2494,6 +2494,7 @@ static void __update_cpu_load(struct rq *this_rq, 
> unsigned long this_load,
>   this_rq->nr_load_updates++;
> 
>   /* Update our load: */
> + this_rq->nr_active = this_rq->nr_running;
>   this_rq->cpu_load[0] = this_load; /* Fasttrack for idx 0 */
>   for (i = 1, scale = 2; i < CPU_LOAD_IDX_MAX; i++, scale += scale) {
>   unsigned long old_load, new_load;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ba9dccf..2564712 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -379,6 +379,9 @@ struct rq {
>   struct list_head leaf_rt_rq_list;
&g

[PATCH] mqueue: sys_mq_open: do not call mnt_drop_write() if read-only

2013-03-19 Thread Vladimir Davydov

mnt_drop_write() must be called only if mnt_want_write() succeeded,
otherwise the mnt_writers counter will diverge.

Signed-off-by: Vladimir Davydov 
Cc: Doug Ledford 
Cc: Andrew Morton 
Cc: KOSAKI Motohiro 
Cc: "Eric W. Biederman" 
---
 ipc/mqueue.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index e5c4f60..3953fda 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -840,7 +840,8 @@ out_putfd:
fd = error;
}
mutex_unlock(&root->d_inode->i_mutex);
-   mnt_drop_write(mnt);
+   if (!ro)
+   mnt_drop_write(mnt);
 out_putname:
putname(name);
return fd;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mqueue: sys_mq_open: do not call mnt_drop_write() if read-only

2013-03-19 Thread Vladimir Davydov

On Mar 20, 2013, at 1:09 AM, Andrew Morton 
 wrote:

> On Tue, 19 Mar 2013 13:31:18 +0400 Vladimir Davydov  
> wrote:
> 
>> mnt_drop_write() must be called only if mnt_want_write() succeeded,
>> otherwise the mnt_writers counter will diverge.
>> 
>> ...
>> 
>> --- a/ipc/mqueue.c
>> +++ b/ipc/mqueue.c
>> @@ -840,7 +840,8 @@ out_putfd:
>>  fd = error;
>>  }
>>  mutex_unlock(&root->d_inode->i_mutex);
>> -mnt_drop_write(mnt);
>> +if (!ro)
>> +mnt_drop_write(mnt);
>> out_putname:
>>  putname(name);
>>  return fd;
> 
> huh, that's been there for a while.  What were the runtime-visible
> effects of the bug?

mnt_writers counters are used to check if remounting FS as read-only is OK, so 
after an extra mnt_drop_write() call, it would be impossible to remount mqueue 
FS as read-only. Besides, on umount a warning would be printed like this one:

[  194.714880] =
[  194.719680] [ BUG: bad unlock balance detected! ]
[  194.724488] 3.9.0-rc3 #5 Not tainted
[  194.728159] -
[  194.732958] a.out/12486 is trying to release lock (sb_writers) at:
[  194.739355] [] mnt_drop_write+0x1f/0x30
[  194.744851] but there are no more locks to release!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] sched: initialize runtime to non-zero on cfs bw set

2013-02-07 Thread Vladimir Davydov

If cfs_rq->runtime_remaining is <= 0 then either
- cfs_rq is throttled and waiting for quota redistribution, or
- cfs_rq is currently executing and will be throttled on
  put_prev_entity, or
- cfs_rq is not throttled and has not executed since its quota was set
  (runtime_remaining is set to 0 on cfs bandwidth reconfiguration).

It is obvious that the last case is rather an exception from the rule
"runtime_remaining<=0 iff cfs_rq is throttled or will be throttled as
soon as it finishes its execution". Moreover, it can lead to a task hang
as follows. If put_prev_task is called immediately after first
pick_next_task after quota was set, "immediately" meaning rq->clock in
both functions is the same, then the corresponding cfs_rq will be
throttled. Besides being unfair (the cfs_rq has not executed in fact),
the quota refilling timer can be idle at that time and it won't be
activated on put_prev_task because update_curr calls
account_cfs_rq_runtime, which activates the timer, only if delta_exec is
strictly positive. As a result we can get a task "running" inside a
throttled cfs_rq which will probably never be unthrottled.

To avoid the problem, the patch makes tg_set_cfs_bandwidth initialize
runtime_remaining of each cfs_rq to 1 instead of 0 so that the cfs_rq
will be throttled only if it has executed for some positive number of
nanoseconds.
--
Several times we had our customers encountered such hangs inside a VM
(seems something is wrong or rather different in time accounting there).
Analyzing crash dumps revealed that hung tasks were running inside
cfs_rq's, which had the following setup

cfs_rq->throttled=1
cfs_rq->runtime_enabled=1
cfs_rq->runtime_remaining=0
cfs_rq->tg->cfs_bandwidth.idle=1
cfs_rq->tg->cfs_bandwidth.timer_active=0

which conforms pretty nice to the explanation given above.

Signed-off-by: Vladimir Davydov 
---
 kernel/sched/core.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26058d0..c7a078f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7686,7 +7686,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
u64 period, u64 quota)
 
raw_spin_lock_irq(&rq->lock);
cfs_rq->runtime_enabled = runtime_enabled;
-   cfs_rq->runtime_remaining = 0;
+   cfs_rq->runtime_remaining = 1;
 
if (cfs_rq->throttled)
unthrottle_cfs_rq(cfs_rq);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched: initialize runtime to non-zero on cfs bw set

2013-02-08 Thread Vladimir Davydov

On Feb 8, 2013, at 6:46 PM, Paul Turner  wrote:

> On Fri, Feb 08, 2013 at 11:10:46AM +0400, Vladimir Davydov wrote:
>> If cfs_rq->runtime_remaining is <= 0 then either
>> - cfs_rq is throttled and waiting for quota redistribution, or
>> - cfs_rq is currently executing and will be throttled on
>>  put_prev_entity, or
>> - cfs_rq is not throttled and has not executed since its quota was set
>>  (runtime_remaining is set to 0 on cfs bandwidth reconfiguration).
>> 
>> It is obvious that the last case is rather an exception from the rule
>> "runtime_remaining<=0 iff cfs_rq is throttled or will be throttled as
>> soon as it finishes its execution". Moreover, it can lead to a task hang
>> as follows. If put_prev_task is called immediately after first
>> pick_next_task after quota was set, "immediately" meaning rq->clock in
>> both functions is the same, then the corresponding cfs_rq will be
>> throttled. Besides being unfair (the cfs_rq has not executed in fact),
>> the quota refilling timer can be idle at that time and it won't be
>> activated on put_prev_task because update_curr calls
>> account_cfs_rq_runtime, which activates the timer, only if delta_exec is
>> strictly positive. As a result we can get a task "running" inside a
>> throttled cfs_rq which will probably never be unthrottled.
>> 
>> To avoid the problem, the patch makes tg_set_cfs_bandwidth initialize
>> runtime_remaining of each cfs_rq to 1 instead of 0 so that the cfs_rq
>> will be throttled only if it has executed for some positive number of
>> nanoseconds.
>> --
>> Several times we had our customers encountered such hangs inside a VM
>> (seems something is wrong or rather different in time accounting there).
> 
> Yeah, looks like!
> 
> It's not ultimately _super_ shocking; I can think of a few  places where such
> gremlins could lurk if they caused enough problems for someone to really go
> digging.
> 
>> Analyzing crash dumps revealed that hung tasks were running inside
>> cfs_rq's, which had the following setup
>> 
>> cfs_rq->throttled=1
>> cfs_rq->runtime_enabled=1
>> cfs_rq->runtime_remaining=0
>> cfs_rq->tg->cfs_bandwidth.idle=1
>> cfs_rq->tg->cfs_bandwidth.timer_active=0
>> 
>> which conforms pretty nice to the explanation given above.
>> 
>> Signed-off-by: Vladimir Davydov 
>> ---
>> kernel/sched/core.c |2 +-
>> 1 files changed, 1 insertions(+), 1 deletions(-)
>> 
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 26058d0..c7a078f 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7686,7 +7686,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
>> u64 period, u64 quota)
>> 
>>  raw_spin_lock_irq(&rq->lock);
>>  cfs_rq->runtime_enabled = runtime_enabled;
>> -cfs_rq->runtime_remaining = 0;
>> +cfs_rq->runtime_remaining = 1;
> 
> So I agree this is reasonably correct and would fix the issue identified.
> However, one concern is that it would potentially grant a tick of execution
> time on all cfs_rqs which could result in large quota violations on a many 
> core
> machine; one trick then would be to give them "expired" quota; which would be
> safe against put_prev_entity->check_cfs_runtime, e.g.

Yeah, I missed that. Thank you for the update.

> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1dff78a..4369231 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7687,7 +7687,17 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, 
> u64 period, u64 quota)
> 
>   raw_spin_lock_irq(&rq->lock);
>   cfs_rq->runtime_enabled = runtime_enabled;
> - cfs_rq->runtime_remaining = 0;
> + /*
> +  * On re-definition of bandwidth values we allocate a trivial
> +  * amount of already expired quota.  This guarantees that
> +  * put_prev_entity() cannot lead to a throttle event before we
> +  * have seen a call to account_cfs_runtime(); while not being
> +  * usable by newly waking, or set_curr_task_fair-ing, cpus
> +  * since it would be immediately expired, requiring
> +  * reassignment.
> +  */
> + cfs_rq->runtime_remaining = 1;
> + cfs_rq->runtime_expires = rq_of(cfs_rq)->clock - 1;
> 
>   if (cfs_rq->throttled)
>   unthrottle_cfs_rq(cfs_rq);
> 
> A

Re: [PATCH] sched: initialize runtime to non-zero on cfs bw set

2013-02-08 Thread Vladimir Davydov

On Feb 8, 2013, at 7:26 PM, Vladimir Davydov  wrote:

> On Feb 8, 2013, at 6:46 PM, Paul Turner  wrote:
> 
>> On Fri, Feb 08, 2013 at 11:10:46AM +0400, Vladimir Davydov wrote:
>>> If cfs_rq->runtime_remaining is <= 0 then either
>>> - cfs_rq is throttled and waiting for quota redistribution, or
>>> - cfs_rq is currently executing and will be throttled on
>>> put_prev_entity, or
>>> - cfs_rq is not throttled and has not executed since its quota was set
>>> (runtime_remaining is set to 0 on cfs bandwidth reconfiguration).
>>> 
>>> It is obvious that the last case is rather an exception from the rule
>>> "runtime_remaining<=0 iff cfs_rq is throttled or will be throttled as
>>> soon as it finishes its execution". Moreover, it can lead to a task hang
>>> as follows. If put_prev_task is called immediately after first
>>> pick_next_task after quota was set, "immediately" meaning rq->clock in
>>> both functions is the same, then the corresponding cfs_rq will be
>>> throttled. Besides being unfair (the cfs_rq has not executed in fact),
>>> the quota refilling timer can be idle at that time and it won't be
>>> activated on put_prev_task because update_curr calls
>>> account_cfs_rq_runtime, which activates the timer, only if delta_exec is
>>> strictly positive. As a result we can get a task "running" inside a
>>> throttled cfs_rq which will probably never be unthrottled.
>>> 
>>> To avoid the problem, the patch makes tg_set_cfs_bandwidth initialize
>>> runtime_remaining of each cfs_rq to 1 instead of 0 so that the cfs_rq
>>> will be throttled only if it has executed for some positive number of
>>> nanoseconds.
>>> --
>>> Several times we had our customers encountered such hangs inside a VM
>>> (seems something is wrong or rather different in time accounting there).
>> 
>> Yeah, looks like!
>> 
>> It's not ultimately _super_ shocking; I can think of a few  places where such
>> gremlins could lurk if they caused enough problems for someone to really go
>> digging.
>> 
>>> Analyzing crash dumps revealed that hung tasks were running inside
>>> cfs_rq's, which had the following setup
>>> 
>>> cfs_rq->throttled=1
>>> cfs_rq->runtime_enabled=1
>>> cfs_rq->runtime_remaining=0
>>> cfs_rq->tg->cfs_bandwidth.idle=1
>>> cfs_rq->tg->cfs_bandwidth.timer_active=0
>>> 
>>> which conforms pretty nice to the explanation given above.
>>> 
>>> Signed-off-by: Vladimir Davydov 
>>> ---
>>> kernel/sched/core.c |2 +-
>>> 1 files changed, 1 insertions(+), 1 deletions(-)
>>> 
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 26058d0..c7a078f 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -7686,7 +7686,7 @@ static int tg_set_cfs_bandwidth(struct task_group 
>>> *tg, u64 period, u64 quota)
>>> 
>>> raw_spin_lock_irq(&rq->lock);
>>> cfs_rq->runtime_enabled = runtime_enabled;
>>> -   cfs_rq->runtime_remaining = 0;
>>> +   cfs_rq->runtime_remaining = 1;
>> 
>> So I agree this is reasonably correct and would fix the issue identified.
>> However, one concern is that it would potentially grant a tick of execution
>> time on all cfs_rqs which could result in large quota violations on a many 
>> core
>> machine; one trick then would be to give them "expired" quota; which would be
>> safe against put_prev_entity->check_cfs_runtime, e.g.
> 

But wait, what "large quota violations" are you talking about? First, the 
granted
quota is rather ephemeral because it will be neglected during the next cfs bw 
period.
Second, we will actually grant NR_CPUS nanoseconds, is is only 1 ms for a 1000 
cpus
host, because the task group will sleep in the throttled state for each 
nanosecond
consumed over the granted quota, i.e. it may be a spike in the load which the 
task
group will eventually pay for. Besides, I guess quota reconfiguration is rather 
a
rare event, so it should not be a concern.

Thanks

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] netfilter: nf_conntrack: Batch cleanup

2013-03-14 Thread Vladimir Davydov

The patch introduces nf_conntrack_cleanup_net_list(), which cleanups
nf_conntrack for a list of netns and calls synchronize_net() only once
for them all. This should reduce netns destruction time.
--
I've measured cleanup time for 1k dummy net ns. Here are the results:

 
 # modprobe nf_conntrack
 # time modprobe -r nf_conntrack

 real   0m10.337s
 user   0m0.000s
 sys0m0.376s

 
 # modprobe nf_conntrack
 # time modprobe -r nf_conntrack

 real0m5.661s
 user0m0.000s
 sys 0m0.216s

Signed-off-by: Vladimir Davydov 
Cc: Patrick McHardy 
Cc: "David S. Miller" 
Cc: "Eric W. Biederman" 
---
 include/net/netfilter/nf_conntrack_core.h |1 +
 net/netfilter/nf_conntrack_core.c |   46 -
 net/netfilter/nf_conntrack_standalone.c   |   16 ++
 3 files changed, 43 insertions(+), 20 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_core.h 
b/include/net/netfilter/nf_conntrack_core.h
index 930275f..fb2b623 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -27,6 +27,7 @@ extern unsigned int nf_conntrack_in(struct net *net,
 
 extern int nf_conntrack_init_net(struct net *net);
 extern void nf_conntrack_cleanup_net(struct net *net);
+extern void nf_conntrack_cleanup_net_list(struct list_head *net_exit_list);
 
 extern int nf_conntrack_proto_pernet_init(struct net *net);
 extern void nf_conntrack_proto_pernet_fini(struct net *net);
diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index c8e001a..09c02ef 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1364,30 +1364,48 @@ void nf_conntrack_cleanup_end(void)
  */
 void nf_conntrack_cleanup_net(struct net *net)
 {
+   LIST_HEAD(single);
+
+   list_add(&net->exit_list, &single);
+   nf_conntrack_cleanup_net_list(&single);
+}
+
+void nf_conntrack_cleanup_net_list(struct list_head *net_exit_list)
+{
+   int busy;
+   struct net *net;
+
/*
 * This makes sure all current packets have passed through
 *  netfilter framework.  Roll on, two-stage module
 *  delete...
 */
synchronize_net();
- i_see_dead_people:
-   nf_ct_iterate_cleanup(net, kill_all, NULL);
-   nf_ct_release_dying_list(net);
-   if (atomic_read(&net->ct.count) != 0) {
+i_see_dead_people:
+   busy = 0;
+   list_for_each_entry(net, net_exit_list, exit_list) {
+   nf_ct_iterate_cleanup(net, kill_all, NULL);
+   nf_ct_release_dying_list(net);
+   if (atomic_read(&net->ct.count) != 0)
+   busy = 1;
+   }
+   if (busy) {
schedule();
goto i_see_dead_people;
}
 
-   nf_ct_free_hashtable(net->ct.hash, net->ct.htable_size);
-   nf_conntrack_proto_pernet_fini(net);
-   nf_conntrack_helper_pernet_fini(net);
-   nf_conntrack_ecache_pernet_fini(net);
-   nf_conntrack_tstamp_pernet_fini(net);
-   nf_conntrack_acct_pernet_fini(net);
-   nf_conntrack_expect_pernet_fini(net);
-   kmem_cache_destroy(net->ct.nf_conntrack_cachep);
-   kfree(net->ct.slabname);
-   free_percpu(net->ct.stat);
+   list_for_each_entry(net, net_exit_list, exit_list) {
+   nf_ct_free_hashtable(net->ct.hash, net->ct.htable_size);
+   nf_conntrack_proto_pernet_fini(net);
+   nf_conntrack_helper_pernet_fini(net);
+   nf_conntrack_ecache_pernet_fini(net);
+   nf_conntrack_tstamp_pernet_fini(net);
+   nf_conntrack_acct_pernet_fini(net);
+   nf_conntrack_expect_pernet_fini(net);
+   kmem_cache_destroy(net->ct.nf_conntrack_cachep);
+   kfree(net->ct.slabname);
+   free_percpu(net->ct.stat);
+   }
 }
 
 void *nf_ct_alloc_hashtable(unsigned int *sizep, int nulls)
diff --git a/net/netfilter/nf_conntrack_standalone.c 
b/net/netfilter/nf_conntrack_standalone.c
index 6bcce40..6c69fbd 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -545,16 +545,20 @@ out_init:
return ret;
 }
 
-static void nf_conntrack_pernet_exit(struct net *net)
+static void nf_conntrack_pernet_exit(struct list_head *net_exit_list)
 {
-   nf_conntrack_standalone_fini_sysctl(net);
-   nf_conntrack_standalone_fini_proc(net);
-   nf_conntrack_cleanup_net(net);
+   struct net *net;
+
+   list_for_each_entry(net, net_exit_list, exit_list) {
+   nf_conntrack_standalone_fini_sysctl(net);
+   nf_conntrack_standalone_fini_proc(net);
+   }
+   nf_conntrack_cleanup_net_list(net_exit_list);
 }
 
 static struct pernet_operations nf_conntrack_net_ops = {
-   .init = nf_conntrack_pernet_init,
-   .exit = nf_conntrack_pernet_exit,
+   .init   = nf_conntrac

[PATCH] net: batch nf_conntrack_net_exit

2012-07-30 Thread Vladimir Davydov

The patch introduces nf_conntrack_cleanup_list(), which cleanups
nf_conntracks for a list of netns and calls synchronize_net() only
once for them all.
---
 include/net/netfilter/nf_conntrack_core.h |   10 +-
 net/netfilter/nf_conntrack_core.c |   21 +
 net/netfilter/nf_conntrack_standalone.c   |   14 +-
 3 files changed, 31 insertions(+), 14 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_core.h 
b/include/net/netfilter/nf_conntrack_core.h
index d8f5b9f..f53b855 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -26,7 +26,15 @@ extern unsigned int nf_conntrack_in(struct net *net,
struct sk_buff *skb);
 
 extern int nf_conntrack_init(struct net *net);
-extern void nf_conntrack_cleanup(struct net *net);
+extern void nf_conntrack_cleanup_list(struct list_head *net_exit_list);
+
+static inline void nf_conntrack_cleanup(struct net *net)
+{
+   LIST_HEAD(single);
+
+   list_add(&net->exit_list, &single);
+   nf_conntrack_cleanup_list(&single);
+}
 
 extern int nf_conntrack_proto_init(struct net *net);
 extern void nf_conntrack_proto_fini(struct net *net);
diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index cf48755..afa62f7 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1363,21 +1363,26 @@ static void nf_conntrack_cleanup_net(struct net *net)
 
 /* Mishearing the voices in his head, our hero wonders how he's
supposed to kill the mall. */
-void nf_conntrack_cleanup(struct net *net)
+void nf_conntrack_cleanup_list(struct list_head *net_exit_list)
 {
-   if (net_eq(net, &init_net))
-   RCU_INIT_POINTER(ip_ct_attach, NULL);
+   struct net *net;
+
+   list_for_each_entry(net, net_exit_list, exit_list)
+   if (net_eq(net, &init_net))
+   RCU_INIT_POINTER(ip_ct_attach, NULL);
 
/* This makes sure all current packets have passed through
   netfilter framework.  Roll on, two-stage module
   delete... */
synchronize_net();
-   nf_conntrack_proto_fini(net);
-   nf_conntrack_cleanup_net(net);
 
-   if (net_eq(net, &init_net)) {
-   RCU_INIT_POINTER(nf_ct_destroy, NULL);
-   nf_conntrack_cleanup_init_net();
+   list_for_each_entry(net, net_exit_list, exit_list) {
+   nf_conntrack_proto_fini(net);
+   nf_conntrack_cleanup_net(net);
+   if (net_eq(net, &init_net)) {
+   RCU_INIT_POINTER(nf_ct_destroy, NULL);
+   nf_conntrack_cleanup_init_net();
+   }
}
 }
 
diff --git a/net/netfilter/nf_conntrack_standalone.c 
b/net/netfilter/nf_conntrack_standalone.c
index 9b39432..8f60708 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -551,16 +551,20 @@ out_init:
return ret;
 }
 
-static void nf_conntrack_net_exit(struct net *net)
+static void nf_conntrack_net_exit(struct list_head *net_exit_list)
 {
-   nf_conntrack_standalone_fini_sysctl(net);
-   nf_conntrack_standalone_fini_proc(net);
-   nf_conntrack_cleanup(net);
+   struct net *net;
+
+   list_for_each_entry(net, net_exit_list, exit_list) {
+   nf_conntrack_standalone_fini_sysctl(net);
+   nf_conntrack_standalone_fini_proc(net);
+   }
+   nf_conntrack_cleanup_list(net_exit_list);
 }
 
 static struct pernet_operations nf_conntrack_net_ops = {
.init = nf_conntrack_net_init,
-   .exit = nf_conntrack_net_exit,
+   .exit_batch = nf_conntrack_net_exit,
 };
 
 static int __init nf_conntrack_standalone_init(void)
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC] sched: move h_load calculation to task_h_load

2013-07-13 Thread Vladimir Davydov

The bad thing about update_h_load(), which computes hierarchical load
factor for task groups, is that it is called for each task group in the
system before every load balancer run, and since rebalance can be
triggered very often, this function can eat really a lot of cpu time if
there are many cpu cgroups in the system.

Although the situation was improved significantly by commit a35b646
('sched, cgroup: Reduce rq->lock hold times for large cgroup
hierarchies'), the problem still can arise under some kinds of loads,
e.g. when cpus are switching from idle to busy and back very frequently.

For instance, when I start 1000 of processes that wake up every
millisecond on my 8 cpus host, 'top' and 'perf top' show:

Cpu(s): 17.8%us, 24.3%sy,  0.0%ni, 57.9%id,  0.0%wa,  0.0%hi,  0.0%si
Events: 243K cycles
  7.57%  [kernel]   [k] __schedule
  7.08%  [kernel]   [k] timerqueue_add
  6.13%  libc-2.12.so   [.] usleep

Then if I create 1 *idle* cpu cgroups (no processes in them), cpu
usage increases significantly although the 'wakers' are still executing
in the root cpu cgroup:

Cpu(s): 19.1%us, 48.7%sy,  0.0%ni, 31.6%id,  0.0%wa,  0.0%hi,  0.7%si
Events: 230K cycles
 24.56%  [kernel][k] tg_load_down
  5.76%  [kernel][k] __schedule

This happens because this particular kind of load triggers 'new idle'
rebalance very frequently, which requires calling update_h_load(),
which, in turn, calls tg_load_down() for every *idle* cpu cgroup even
though it is absolutely useless, because idle cpu cgroups have no tasks
to pull.

This patch tries to improve the situation by making h_load calculation
proceed only when h_load is really necessary. To achieve this, it
substitutes update_h_load() with update_cfs_rq_h_load(), which computes
h_load only for a given cfs_rq and all its ascendants, and makes the
load balancer call this function whenever it considers if a task should
be pulled, i.e. it moves h_load calculations directly to task_h_load().
For h_load of the same cfs_rq not to be updated multiple times (in case
several tasks in the same cgroup are considered during the same balance
run), the patch keeps the time of the last h_load update for each cfs_rq
and breaks calculation when it finds h_load to be uptodate.

The benefit of it is that h_load is computed only for those cfs_rq's,
which really need it, in particular all idle task groups are skipped.
Although this, in fact, moves h_load calculation under rq lock, it
should not affect latency much, because the amount of work done under rq
lock while trying to pull tasks is limited by sched_nr_migrate.

After the patch applied with the setup described above (1000 wakers in
the root cgroup and 1 idle cgroups), I get:

Cpu(s): 16.9%us, 24.8%sy,  0.0%ni, 58.4%id,  0.0%wa,  0.0%hi,  0.0%si
Events: 242K cycles
  7.57%  [kernel]  [k] __schedule
  6.70%  [kernel]  [k] timerqueue_add
  5.93%  libc-2.12.so  [.] usleep

Signed-off-by: Vladimir Davydov 
---
 kernel/sched/fair.c  |   56 ++
 kernel/sched/sched.h |7 +++
 2 files changed, 28 insertions(+), 35 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f77f9c5..de90690 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4171,47 +4171,46 @@ static void update_blocked_averages(int cpu)
 }
 
 /*
- * Compute the cpu's hierarchical load factor for each task group.
+ * Compute the hierarchical load factor for cfs_rq and all its ascendants.
  * This needs to be done in a top-down fashion because the load of a child
  * group is a fraction of its parents load.
  */
-static int tg_load_down(struct task_group *tg, void *data)
+static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 {
+   struct rq *rq = rq_of(cfs_rq);
+   struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
unsigned long load;
-   long cpu = (long)data;
 
-   if (!tg->parent) {
-   load = cpu_rq(cpu)->avg.load_avg_contrib;
-   } else {
-   load = tg->parent->cfs_rq[cpu]->h_load;
-   load = div64_ul(load * tg->se[cpu]->avg.load_avg_contrib,
-   tg->parent->cfs_rq[cpu]->runnable_load_avg + 1);
+   cfs_rq->h_load_next = NULL;
+   for_each_sched_entity(se) {
+   cfs_rq = cfs_rq_of(se);
+   cfs_rq->h_load_next = se;
+   if (cfs_rq->last_h_load_update == rq->clock)
+   break;
}
 
-   tg->cfs_rq[cpu]->h_load = load;
-
-   return 0;
-}
-
-static void update_h_load(long cpu)
-{
-   struct rq *rq = cpu_rq(cpu);
-   unsigned long now = jiffies;
-
-   if (rq->h_load_throttle == now)
-   return;
-
-   rq->h_load_throttle = now;
+   if (!se) {
+   cfs_rq-&g

[PATCH] sched: Fix task_h_load calculation

2013-09-14 Thread Vladimir Davydov

Patch a003a2 (sched: Consider runnable load average in move_tasks())
sets all top-level cfs_rqs' h_load to rq->avg.load_avg_contrib, which is
always 0. This mistype leads to all tasks having weight 0 when load
balancing in a cpu-cgroup enabled setup. There obviously should be sum
of weights of all runnable tasks there instead. Fix it.

Signed-off-by: Vladimir Davydov 
---
 kernel/sched/fair.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9b3fe1c..13abc29 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4242,7 +4242,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
}
 
if (!se) {
-   cfs_rq->h_load = rq->avg.load_avg_contrib;
+   cfs_rq->h_load = cfs_rq->runnable_load_avg;
cfs_rq->last_h_load_update = now;
}
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] sched: fix_small_imbalance: Fix local->avg_load > busiest->avg_load case

2013-09-15 Thread Vladimir Davydov

In busiest->group_imb case we can come to fix_small_imbalance() with
local->avg_load > busiest->avg_load. This can result in wrong imbalance
fix-up, because there is the following check there where all the
members are unsigned:

if (busiest->avg_load - local->avg_load + scaled_busy_load_per_task >=
(scaled_busy_load_per_task * imbn)) {
env->imbalance = busiest->load_per_task;
return;
}

As a result we can end up constantly bouncing tasks from one cpu to
another if there are pinned tasks.

Fix it by substituting the subtraction with an equivalent addition in
the check.
--
The bug can be caught by running 2*N cpuhogs pinned to two logical cpus
belonging to different cores on an HT-enabled machine with N logical
cpus: just look at se.nr_migrations growth.

Signed-off-by: Vladimir Davydov 
---
 kernel/sched/fair.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 507a8a9..bdaf1fc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4823,8 +4823,8 @@ void fix_small_imbalance(struct lb_env *env, struct 
sd_lb_stats *sds)
(busiest->load_per_task * SCHED_POWER_SCALE) /
busiest->group_power;
 
-   if (busiest->avg_load - local->avg_load + scaled_busy_load_per_task >=
-   (scaled_busy_load_per_task * imbn)) {
+   if (busiest->avg_load + scaled_busy_load_per_task >=
+   local->avg_load + (scaled_busy_load_per_task * imbn)) {
env->imbalance = busiest->load_per_task;
return;
}
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] sched: calculate_imbalance: Fix local->avg_load > sds->avg_load case

2013-09-15 Thread Vladimir Davydov

In busiest->group_imb case we can come to calculate_imbalance() with
local->avg_load >= busiest->avg_load >= sds->avg_load. This can result
in imbalance overflow, because it is calculated as follows

env->imbalance = min(
max_pull * busiest->group_power,
(sds->avg_load - local->avg_load) * local->group_power
) / SCHED_POWER_SCALE;

As a result we can end up constantly bouncing tasks from one cpu to
another if there are pinned tasks.

Fix this by skipping the assignment and assuming imbalance=0 in case
local->avg_load > sds->avg_load.
--
The bug can be caught by running 2*N cpuhogs pinned to two logical cpus
belonging to different cores on an HT-enabled machine with N logical
cpus: just look at se.nr_migrations growth.

Signed-off-by: Vladimir Davydov 
---
 kernel/sched/fair.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9b3fe1c..507a8a9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4896,7 +4896,8 @@ static inline void calculate_imbalance(struct lb_env 
*env, struct sd_lb_stats *s
 * max load less than avg load(as we skip the groups at or below
 * its cpu_power, while calculating max_load..)
 */
-   if (busiest->avg_load < sds->avg_load) {
+   if (busiest->avg_load <= sds->avg_load ||
+   local->avg_load >= sds->avg_load) {
env->imbalance = 0;
return fix_small_imbalance(env, sds);
}
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] sched: load_balance: Prevent reselect prev dst_cpu if some pinned

2013-09-15 Thread Vladimir Davydov

Currently new_dst_cpu is prevented from being reselected actually, not
dst_cpu. This can result in attempting to pull tasks to this_cpu twice.

Signed-off-by: Vladimir Davydov 
---
 kernel/sched/fair.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9b3fe1c..cd59640 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5269,15 +5269,15 @@ more_balance:
 */
if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
 
+   /* Prevent to re-select dst_cpu via env's cpus */
+   cpumask_clear_cpu(env.dst_cpu, env.cpus);
+
env.dst_rq   = cpu_rq(env.new_dst_cpu);
env.dst_cpu  = env.new_dst_cpu;
env.flags   &= ~LBF_SOME_PINNED;
env.loop = 0;
env.loop_break   = sched_nr_migrate_break;
 
-   /* Prevent to re-select dst_cpu via env's cpus */
-   cpumask_clear_cpu(env.dst_cpu, env.cpus);
-
/*
 * Go back to "more_balance" rather than "redo" since we
 * need to continue with same src_cpu.
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] sched: load_balance: Reset env when going to redo due to all pinned

2013-09-15 Thread Vladimir Davydov

Firstly, reset env.dst_cpu/dst_rq to this_cpu/this_rq, because it could
have changed in 'some pinned' case. Otherwise, should_we_balance() can
stop balancing beforehand.

Secondly, reset env.flags, because it can have LBF_SOME_PINNED set.

Thirdly, reset env.dst_grpmask cpus in env.cpus to allow handling 'some
pinned' case when pulling tasks from a new busiest cpu.

Signed-off-by: Vladimir Davydov 
---
 kernel/sched/fair.c |   12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cd59640..d840e51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5289,8 +5289,16 @@ more_balance:
if (unlikely(env.flags & LBF_ALL_PINNED)) {
cpumask_clear_cpu(cpu_of(busiest), cpus);
if (!cpumask_empty(cpus)) {
-   env.loop = 0;
-   env.loop_break = sched_nr_migrate_break;
+   env.dst_cpu = this_cpu;
+   env.dst_rq  = this_rq;
+   env.flags   = 0;
+   env.loop= 0;
+   env.loop_break  = sched_nr_migrate_break;
+
+   /* Reset cpus cleared in LBF_SOME_PINNED case */
+   if (env.dst_grpmask)
+   cpumask_or(cpus, cpus, env.dst_grpmask);
+
goto redo;
}
goto out_balanced;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] sched: calculate_imbalance: Fix local->avg_load > sds->avg_load case

2013-09-16 Thread Vladimir Davydov


On 09/16/2013 09:52 AM, Peter Zijlstra wrote:

On Sun, Sep 15, 2013 at 05:49:13PM +0400, Vladimir Davydov wrote:

In busiest->group_imb case we can come to calculate_imbalance() with
local->avg_load >= busiest->avg_load >= sds->avg_load. This can result
in imbalance overflow, because it is calculated as follows

env->imbalance = min(
max_pull * busiest->group_power,
(sds->avg_load - local->avg_load) * local->group_power
) / SCHED_POWER_SCALE;

As a result we can end up constantly bouncing tasks from one cpu to
another if there are pinned tasks.

Fix this by skipping the assignment and assuming imbalance=0 in case
local->avg_load > sds->avg_load.
--
The bug can be caught by running 2*N cpuhogs pinned to two logical cpus
belonging to different cores on an HT-enabled machine with N logical
cpus: just look at se.nr_migrations growth.

Signed-off-by: Vladimir Davydov
---
  kernel/sched/fair.c |3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9b3fe1c..507a8a9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4896,7 +4896,8 @@ static inline void calculate_imbalance(struct lb_env 
*env, struct sd_lb_stats *s
 * max load less than avg load(as we skip the groups at or below
 * its cpu_power, while calculating max_load..)
 */
-   if (busiest->avg_load < sds->avg_load) {
+   if (busiest->avg_load <= sds->avg_load ||
+   local->avg_load >= sds->avg_load) {
env->imbalance = 0;
return fix_small_imbalance(env, sds);
}

Why the = part? Surely 'busiest->avg_load < sds->avg_load ||
local->avg_load > sds->avg_load' avoids both underflows?


Of course it does, but env->imbalance will be assigned to 0 anyway in = 
case, so why not go shortcut?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] sched: load_balance: Reset env when going to redo due to all pinned

2013-09-16 Thread Vladimir Davydov


On 09/16/2013 09:43 AM, Peter Zijlstra wrote:

On Sun, Sep 15, 2013 at 09:30:14PM +0400, Vladimir Davydov wrote:

Firstly, reset env.dst_cpu/dst_rq to this_cpu/this_rq, because it could
have changed in 'some pinned' case. Otherwise, should_we_balance() can
stop balancing beforehand.

Secondly, reset env.flags, because it can have LBF_SOME_PINNED set.

Thirdly, reset env.dst_grpmask cpus in env.cpus to allow handling 'some
pinned' case when pulling tasks from a new busiest cpu.

Did you actually run into any problems because of this?


IRL no, and now I see that catching 'all pinned' after 'some pinned' 
case can only happen if a task changes its affinity mask or gets 
throttled during load balance run, which is very unlikely. So this patch 
is rather for sanity and can be safely dropped.



Signed-off-by: Vladimir Davydov
---
  kernel/sched/fair.c |   12 ++--
  1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cd59640..d840e51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5289,8 +5289,16 @@ more_balance:
if (unlikely(env.flags & LBF_ALL_PINNED)) {
cpumask_clear_cpu(cpu_of(busiest), cpus);
if (!cpumask_empty(cpus)) {
-   env.loop = 0;
-   env.loop_break = sched_nr_migrate_break;
+   env.dst_cpu = this_cpu;
+   env.dst_rq  = this_rq;
+   env.flags   = 0;
+   env.loop= 0;
+   env.loop_break  = sched_nr_migrate_break;
+
+   /* Reset cpus cleared in LBF_SOME_PINNED case */
+   if (env.dst_grpmask)
+   cpumask_or(cpus, cpus, env.dst_grpmask);
+
goto redo;
}
goto out_balanced;

So the problem I have with this is that it removes the bound on the
number of iterations we do. Currently we're limited by the bits in cpus,
but by resetting those we can do on and on and on...


find_busiest_group() never selects the local group, doesn't it? So none 
of env.dst_grpmask, which is initialized to 
sched_group_cpus(this_rq->sd), can be selected for the source cpu. That 
said, resetting env.dst_grpmask bits in the cpus bitmask actually 
doesn't affect the number of balance iterations.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/19] pramfs

2013-09-08 Thread Vladimir Davydov


On 09/07/2013 08:22 PM, Marco Stornelli wrote:

Il 07/09/2013 16:58, richard -rw- weinberger ha scritto:

On Sat, Sep 7, 2013 at 10:14 AM, Marco Stornelli
 wrote:

Hi all,

this is an attempt to include pramfs in mainline. At the moment pramfs
has been included in LTSI kernel. Since last review the code is more
or less the same but, with a really big thanks to Vladimir Davydov and
Parallels, the development of fsck has been started and we have now
the possibility to correct fs errors due to corruption. It's a "young"
tool but we are working on it. You can clone the code from our repos:

git clone git://git.code.sf.net/p/pramfs/code pramfs-code
git clone git://git.code.sf.net/p/pramfs/Tools pramfs-Tools


I'm a bit confused, what kind of non-volatile RAM is your fs targeting?
Wouldn't it make sense to use pstore like
arch/powerpc/platforms/pseries/nvram.c does?



Usually battery-backed SRAM, but actually it can be used in any piece 
of ram directly accessible and it provides a normal and complete fs 
interface. Usually I do the fs test remapping my system ram. You can 
find documentation here:


http://pramfs.sourceforge.net


I'd like to add that in contrast to pstore, pramfs allows storing any 
files in it, not only system logs. This can be of value even on machines 
w/o special devices like sram/nvram: one can store data that should be 
quickly restored after reboot in conventional ram and use kexec to boot 
to a new kernel. One of the use cases of this could be checkpointing 
time-critical services to ram (using criu.org) to be quickly restored 
after a kernel update providing almost zero-downtime.




Marco


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [tip:sched/core] sched/balancing: Fix cfs_rq-> task_h_load calculation

2013-09-30 Thread Vladimir Davydov


On 09/29/2013 01:47 PM, Yuanhan Liu wrote:

On Fri, Sep 20, 2013 at 06:46:59AM -0700, tip-bot for Vladimir Davydov wrote:

Commit-ID:  7e3115ef5149fc502e3a2e80719dba54a8e7409d
Gitweb:http://git.kernel.org/tip/7e3115ef5149fc502e3a2e80719dba54a8e7409d
Author: Vladimir Davydov
AuthorDate: Sat, 14 Sep 2013 19:39:46 +0400
Committer:  Ingo Molnar
CommitDate: Fri, 20 Sep 2013 11:59:39 +0200

sched/balancing: Fix cfs_rq->task_h_load calculation

Patch a003a2 (sched: Consider runnable load average in move_tasks())
sets all top-level cfs_rqs' h_load to rq->avg.load_avg_contrib, which is
always 0. This mistype leads to all tasks having weight 0 when load
balancing in a cpu-cgroup enabled setup. There obviously should be sum
of weights of all runnable tasks there instead. Fix it.

Hi Vladimir,

FYI, Here we found a 17% netperf regression by this patch. Here are some
changed stats between this commit 7e3115ef5149fc502e3a2e80719dba54a8e7409d
and it's parent(3029ede39373c368f402a76896600d85a4f7121b)


Hello,

Could you please report the following info:

1) the test machine cpu topology (i.e. output of 
/sys/devices/system/cpu/cpu*/{thread_siblings_list,core_siblings_list})

2) kernel config you used during the test
3) the output of /sys/kernel/debug/sched_features (debugfs mounted).
4) netperf server/client options
5) did you place netserver into a separate cpu cgroup?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC] sched: boost throttled entities on wakeups

2012-10-18 Thread Vladimir Davydov

If several tasks in different cpu cgroups are contending for the same resource
(e.g. a semaphore) and one of those task groups is cpu limited (using cfs
bandwidth control), the priority inversion problem is likely to arise: if a cpu
limited task goes to sleep holding the resource (e.g. trying to take another
semaphore), it can be throttled (i.e.  removed from the runqueue), which will
result in other, perhaps high-priority, tasks waiting until the low-priority
task continues its execution.

The patch tries to solve this problem by boosting tasks in throttled groups on
wakeups, i.e.  temporarily unthrottling the groups a woken task belongs to in
order to let the task finish its execution in kernel space.  This obviously
should eliminate the priority inversion problem on voluntary preemptable
kernels.  However, it does not solve the problem for fully preemptable kernels,
although I guess the patch can be extended to handle those kernels too (e.g. by
boosting forcibly preempted tasks thus not allowing to throttle).

I wrote a simple test that demonstrates the problem (the test is attached). It
creates two cgroups each of which is bound to exactly one cpu using cpusets,
sets the limit of the first group to 10% and leaves the second group unlimited.
Then in both groups it starts processes reading the same (big enough) file
along with a couple of busyloops in the limited groups, and measures the read
time.

I've run the test 10 times for a 1 Gb file on a server with > 10 Gb of RAM and
4 cores x 2 hyperthreads (the kernel was with CONFIG_PREEMPT_VOLUNTARY=y). Here
are the results:

without the patch   40.03 +- 7.04 s
with the patch   8.42 +- 0.48 s

(Since the server's RAM can accommodate the whole file, the read time was the
same for both groups)

I would appreciate if you could answer the following questions regarding the
priority inversion problem and the proposed approach:

1) Do you agree that the problem exists and should be sorted out?

2) If so, does the general approach proposed (unthrottling on wakeups) suits
you? Why or why not?

3) If you think that the approach proposed is sane, what you dislike about the
patch?

Thank you!

---
 include/linux/sched.h   |8 ++
 kernel/sched/core.c |8 ++
 kernel/sched/fair.c |  182 ++-
 kernel/sched/features.h |2 +
 kernel/sched/sched.h|6 ++
 5 files changed, 204 insertions(+), 2 deletions(-)



sched-boost-throttled-entities-on-wakeups.patch
Description: sched-boost-throttled-entities-on-wakeups.patch


ioprio_inv_test.sh
Description: ioprio_inv_test.sh

Re: [Devel] [PATCH RFC] sched: boost throttled entities on wakeups

2012-10-18 Thread Vladimir Davydov

There is an error in the test script: I forgot to initialize cpuset.mems of 
test cgroups - without it it is impossible to add a task into a cpuset cgroup.

Sorry for that.

Fixed version of the test script is attached.

On Oct 18, 2012, at 11:32 AM, Vladimir Davydov wrote:

> If several tasks in different cpu cgroups are contending for the same resource
> (e.g. a semaphore) and one of those task groups is cpu limited (using cfs
> bandwidth control), the priority inversion problem is likely to arise: if a 
> cpu
> limited task goes to sleep holding the resource (e.g. trying to take another
> semaphore), it can be throttled (i.e.  removed from the runqueue), which will
> result in other, perhaps high-priority, tasks waiting until the low-priority
> task continues its execution.
> 
> The patch tries to solve this problem by boosting tasks in throttled groups on
> wakeups, i.e.  temporarily unthrottling the groups a woken task belongs to in
> order to let the task finish its execution in kernel space.  This obviously
> should eliminate the priority inversion problem on voluntary preemptable
> kernels.  However, it does not solve the problem for fully preemptable 
> kernels,
> although I guess the patch can be extended to handle those kernels too (e.g. 
> by
> boosting forcibly preempted tasks thus not allowing to throttle).
> 
> I wrote a simple test that demonstrates the problem (the test is attached). It
> creates two cgroups each of which is bound to exactly one cpu using cpusets,
> sets the limit of the first group to 10% and leaves the second group 
> unlimited.
> Then in both groups it starts processes reading the same (big enough) file
> along with a couple of busyloops in the limited groups, and measures the read
> time.
> 
> I've run the test 10 times for a 1 Gb file on a server with > 10 Gb of RAM and
> 4 cores x 2 hyperthreads (the kernel was with CONFIG_PREEMPT_VOLUNTARY=y). 
> Here
> are the results:
> 
> without the patch 40.03 +- 7.04 s
> with the patch 8.42 +- 0.48 s
> 
> (Since the server's RAM can accommodate the whole file, the read time was the
> same for both groups)
> 
> I would appreciate if you could answer the following questions regarding the
> priority inversion problem and the proposed approach:
> 
> 1) Do you agree that the problem exists and should be sorted out?
> 
> 2) If so, does the general approach proposed (unthrottling on wakeups) suits
> you? Why or why not?
> 
> 3) If you think that the approach proposed is sane, what you dislike about the
> patch?
> 
> Thank you!
> 
> ---
> include/linux/sched.h   |8 ++
> kernel/sched/core.c |8 ++
> kernel/sched/fair.c |  182 ++-
> kernel/sched/features.h |2 +
> kernel/sched/sched.h|6 ++
> 5 files changed, 204 insertions(+), 2 deletions(-)
> 
> 



ioprio_inv_test.sh
Description: ioprio_inv_test.sh

Re: [PATCH RFC] sched: boost throttled entities on wakeups

2012-10-19 Thread Vladimir Davydov

Thank you for the answer.

On Oct 19, 2012, at 6:24 PM, Peter Zijlstra wrote:

> its a quick hack similar to existing hacks done for rt, preferably we'd
> do smarter things though.

If you have any ideas how to fix this in a better way, please share.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC 00/13] PRAM: Persistent over-kexec memory storage

2013-07-01 Thread Vladimir Davydov

s PRAM
   super block pfn, which is exported via /sys/kernel/pram, in the 'pram' boot
   param.

 * Since some memory is required for completing boot sequence, PRAM tracks all
   memory regions that have ever been reserved by other parts of the kernel and
   avoids using them for persistent memory. Since the device configuration
   cannot change during kexec, and the newly booted kernel is likely to have
   the same set of device drivers, it should work in most cases.

 * Since kexec may load the new kernel code to any memory region, it can
   destroy persistent memory. To exclude this, kexec should be forced to load
   the new kernel code to a memory region that is banned for PRAM. For that
   purpose, there is the 'pram_banned' boot param and --mem-min and --mem-max
   otpions of the kexec utility.

 * If a conflict still happens, it will be identified and all persistent memory
   will be discarded to prevent further errors. It is guaranteed by
   checksumming all data saved to PRAM.

 * tmpfs is saved to PRAM on unmount and loaded on mount if 'pram=NAME' mount
   option is passed. NAME specifies the PRAM node to save data to. This is to
   allow saving several tmpfs trees.

 * Saving tmpfs to PRAM is not well elaborated at present and serves rather as
   a proof of concept. Namely, only regular files without multiple hard links
   are supported and tmpfs may not be swapped out. If these requirements are
   not met, save to PRAM will be aborted spewing a message to the kernel log.
   This is not very difficult to fix, but at present one should turn off swap
   to test the feature.

 -- Future plans --

What we'd like to do:

 * Implement swap entries 'freezing' to allow saving a swapped out tmpfs.
 * Implement full support of tmpfs including saving dirs, special files, etc.
 * Implement SPLICE_F_MOVE, SPLICE_F_GIFT flags for splicing data from/to
   shmem. This would allow avoiding memory copying on checkpoint/restore.
 * Save uptodate fs cache on umount to be restored on mount after kexec.

Thanks,

Vladimir Davydov (13):
  mm: add PRAM API stubs and Kconfig
  mm: PRAM: implement node load and save functions
  mm: PRAM: implement page stream operations
  mm: PRAM: implement byte stream operations
  mm: PRAM: link nodes by pfn before reboot
  mm: PRAM: introduce super block
  mm: PRAM: preserve persistent memory at boot
  mm: PRAM: checksum saved data
  mm: PRAM: ban pages that have been reserved at boot time
  mm: PRAM: allow to ban arbitrary memory ranges
  mm: PRAM: allow to free persistent memory from userspace
  mm: shmem: introduce shmem_insert_page
  mm: shmem: enable saving to PRAM

 arch/x86/kernel/setup.c  |2 +
 arch/x86/mm/init_32.c|5 +
 arch/x86/mm/init_64.c|5 +
 include/linux/pram.h |   62 +++
 include/linux/shmem_fs.h |   29 ++
 mm/Kconfig   |   14 +
 mm/Makefile  |1 +
 mm/bootmem.c |4 +
 mm/memblock.c|7 +-
 mm/pram.c| 1279 ++
 mm/shmem.c   |   97 +++-
 mm/shmem_pram.c  |  378 ++
 12 files changed, 1878 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/pram.h
 create mode 100644 mm/pram.c
 create mode 100644 mm/shmem_pram.c

-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC 13/13] mm: shmem: enable saving to PRAM

2013-07-01 Thread Vladimir Davydov

This patch illustrates how PRAM API can be used for making tmpfs
'persistent'. It adds 'pram=' option to tmpfs, which specifies the PRAM
node to load/save FS tree from/to.

If the option is passed on mount, shmem will look for the corresponding
PRAM node and load the FS tree from it. On the subsequent unmount, it
will save FS tree to that PRAM node.

A typical usage scenario looks like:

 # mount -t tmpfs -o pram=mytmpfs none /mnt
 # echo something > /mnt/smth
 # umount /mnt
 
 # mount -t tmpfs -o pram=mytmpfs none /mnt
 # cat /mnt/smth

Each FS tree is saved into two PRAM nodes, one acting as a byte stream
and the other acting as a page stream. The byte stream is used for
saving files metadata (name, permissions, etc) and data page offsets
while the page stream accommodates file content pages.

Current implementation serves for demonstration purposes and so is quite
simplified: it supports only regular files in the root directory without
multiple hard links, and it does not save swapped out files aborting if
any. However, it can be elaborated to fully support tmpfs.
---
 include/linux/shmem_fs.h |   26 
 mm/Makefile  |2 +-
 mm/shmem.c   |   29 +++-
 mm/shmem_pram.c  |  378 ++
 4 files changed, 430 insertions(+), 5 deletions(-)
 create mode 100644 mm/shmem_pram.c

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index da63308..0408421 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -23,6 +23,11 @@ struct shmem_inode_info {
struct inodevfs_inode;
 };
 
+#define SHMEM_PRAM_NAME_MAX128
+struct shmem_pram_info {
+   char name[SHMEM_PRAM_NAME_MAX];
+};
+
 struct shmem_sb_info {
unsigned long max_blocks;   /* How many blocks are allowed */
struct percpu_counter used_blocks;  /* How many are allocated */
@@ -33,6 +38,7 @@ struct shmem_sb_info {
kgid_t gid; /* Mount gid for root directory */
umode_t mode;   /* Mount mode for root directory */
struct mempolicy *mpol; /* default memory policy for mappings */
+   struct shmem_pram_info *pram;
 };
 
 static inline struct shmem_inode_info *SHMEM_I(struct inode *inode)
@@ -62,7 +68,27 @@ static inline struct page *shmem_read_mapping_page(
mapping_gfp_mask(mapping));
 }
 
+struct pagevec;
+
 extern int shmem_insert_page(struct inode *inode,
pgoff_t index, struct page *page, bool on_lru);
+extern unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
+   pgoff_t start, unsigned int nr_pages,
+   struct page **pages, pgoff_t *indices);
+extern void shmem_deswap_pagevec(struct pagevec *pvec);
+
+#ifdef CONFIG_PRAM
+extern int shmem_parse_pram(const char *str, struct shmem_pram_info **pram);
+extern void shmem_show_pram(struct seq_file *seq, struct shmem_pram_info 
*pram);
+extern void shmem_save_pram(struct super_block *sb);
+extern void shmem_load_pram(struct super_block *sb);
+#else
+static inline int shmem_parse_pram(const char *str,
+   struct shmem_pram_info **pram) { return 1; }
+static inline void shmem_show_pram(struct seq_file *seq,
+   struct shmem_pram_info *pram) { }
+static inline void shmem_save_pram(struct super_block *sb) { }
+static inline void shmem_load_pram(struct super_block *sb) { }
+#endif
 
 #endif
diff --git a/mm/Makefile b/mm/Makefile
index 33ad952..6a8c61d 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -58,4 +58,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
 obj-$(CONFIG_CLEANCACHE) += cleancache.o
 obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
-obj-$(CONFIG_PRAM) += pram.o
+obj-$(CONFIG_PRAM) += pram.o shmem_pram.o
diff --git a/mm/shmem.c b/mm/shmem.c
index 71fac31..2d6b618 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -399,7 +399,7 @@ out:
 /*
  * Like find_get_pages, but collecting swap entries as well as pages.
  */
-static unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
+unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
pgoff_t start, unsigned int nr_pages,
struct page **pages, pgoff_t *indices)
 {
@@ -465,7 +465,7 @@ static int shmem_free_swap(struct address_space *mapping,
 /*
  * Pagevec may contain swap entries, so shuffle up pages before releasing.
  */
-static void shmem_deswap_pagevec(struct pagevec *pvec)
+void shmem_deswap_pagevec(struct pagevec *pvec)
 {
int i, j;
 
@@ -2535,6 +2535,10 @@ static int shmem_parse_options(char *options, struct 
shmem_sb_info *sbinfo,
mpol = NULL;
if (mpol_parse_str(value, &mpol))
goto bad_val;
+   } else if (!strcmp(this_cha

[PATCH RFC 12/13] mm: shmem: introduce shmem_insert_page

2013-07-01 Thread Vladimir Davydov

The function inserts a memory page to a shmem file under an arbitrary
offset. If there is something at the specified offset (page or swap),
the function fails.

The function will be sued by the next patch.
---
 include/linux/shmem_fs.h |3 ++
 mm/shmem.c   |   68 ++
 2 files changed, 71 insertions(+)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 30aa0dc..da63308 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -62,4 +62,7 @@ static inline struct page *shmem_read_mapping_page(
mapping_gfp_mask(mapping));
 }
 
+extern int shmem_insert_page(struct inode *inode,
+   pgoff_t index, struct page *page, bool on_lru);
+
 #endif
diff --git a/mm/shmem.c b/mm/shmem.c
index 1c44af7..71fac31 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -328,6 +328,74 @@ static void shmem_delete_from_page_cache(struct page 
*page, void *radswap)
BUG_ON(error);
 }
 
+int shmem_insert_page(struct inode *inode,
+   pgoff_t index, struct page *page, bool on_lru)
+{
+   struct address_space *mapping = inode->i_mapping;
+   struct shmem_inode_info *info = SHMEM_I(inode);
+   struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+   gfp_t gfp = mapping_gfp_mask(mapping);
+   int err;
+
+   if (index > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT))
+   return -EFBIG;
+
+   err = -ENOSPC;
+   if (shmem_acct_block(info->flags))
+   goto out;
+   if (sbinfo->max_blocks) {
+   if (percpu_counter_compare(&sbinfo->used_blocks,
+  sbinfo->max_blocks) >= 0)
+   goto out_unacct;
+   percpu_counter_inc(&sbinfo->used_blocks);
+   }
+
+   if (!on_lru) {
+   SetPageSwapBacked(page);
+   __set_page_locked(page);
+   } else
+   lock_page(page);
+
+   err = mem_cgroup_cache_charge(page, current->mm,
+ gfp & GFP_RECLAIM_MASK);
+   if (err)
+   goto out_unlock;
+   err = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
+   if (!err) {
+   err = shmem_add_to_page_cache(page, mapping, index, gfp, NULL);
+   radix_tree_preload_end();
+   }
+   if (err)
+   goto out_uncharge;
+
+   if (!on_lru)
+   lru_cache_add_anon(page);
+
+   spin_lock(&info->lock);
+   info->alloced++;
+   inode->i_blocks += BLOCKS_PER_PAGE;
+   shmem_recalc_inode(inode);
+   spin_unlock(&info->lock);
+
+   flush_dcache_page(page);
+   SetPageUptodate(page);
+   set_page_dirty(page);
+
+   unlock_page(page);
+   return 0;
+
+out_uncharge:
+   mem_cgroup_uncharge_cache_page(page);
+out_unlock:
+   unlock_page(page);
+   if (sbinfo->max_blocks)
+   percpu_counter_add(&sbinfo->used_blocks, -1);
+out_unacct:
+   shmem_unacct_blocks(info->flags, 1);
+out:
+   return err;
+}
+
 /*
  * Like find_get_pages, but collecting swap entries as well as pages.
  */
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC 11/13] mm: PRAM: allow to free persistent memory from userspace

2013-07-01 Thread Vladimir Davydov

To free all space utilized for persistent memory, one can write 0 to
/sys/kernel/pram. This will destroy all PRAM nodes that are not
currently being read or written.
---
 mm/pram.c |   39 ++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/mm/pram.c b/mm/pram.c
index 3ad769b..43ad85f 100644
--- a/mm/pram.c
+++ b/mm/pram.c
@@ -697,6 +697,32 @@ static void pram_truncate_node(struct pram_node *node)
 
 }
 
+/*
+ * Free all nodes that are not under operation.
+ */
+static void pram_truncate(void)
+{
+   struct page *page, *tmp;
+   struct pram_node *node;
+   LIST_HEAD(dispose);
+
+   mutex_lock(&pram_mutex);
+   list_for_each_entry_safe(page, tmp, &pram_nodes, lru) {
+   node = page_address(page);
+   if (!(node->flags & PRAM_ACCMODE_MASK))
+   list_move(&page->lru, &dispose);
+   }
+   mutex_unlock(&pram_mutex);
+
+   while (!list_empty(&dispose)) {
+   page = list_first_entry(&dispose, struct page, lru);
+   list_del(&page->lru);
+   node = page_address(page);
+   pram_truncate_node(node);
+   pram_free_page(node);
+   }
+}
+
 static void pram_stream_init(struct pram_stream *ps,
 struct pram_node *node, gfp_t gfp_mask)
 {
@@ -1189,8 +1215,19 @@ static ssize_t show_pram_sb_pfn(struct kobject *kobj,
return sprintf(buf, "%lx\n", pfn);
 }
 
+static ssize_t store_pram_sb_pfn(struct kobject *kobj,
+   struct kobj_attribute *attr, const char *buf, size_t count)
+{
+   int val;
+
+   if (kstrtoint(buf, 0, &val) || val)
+   return -EINVAL;
+   pram_truncate();
+   return count;
+}
+
 static struct kobj_attribute pram_sb_pfn_attr =
-   __ATTR(pram, 0444, show_pram_sb_pfn, NULL);
+   __ATTR(pram, 0644, show_pram_sb_pfn, store_pram_sb_pfn);
 
 static struct attribute *pram_attrs[] = {
&pram_sb_pfn_attr.attr,
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC 07/13] mm: PRAM: preserve persistent memory at boot

2013-07-01 Thread Vladimir Davydov

Persistent memory preservation is done by reserving memory pages
belonging to PRAM at early boot so that they will not be recycled. If
memory reservation fails for some reason (e.g. memory region is busy),
persistent memory will be lost.

Currently, PRAM preservation is only implemented for x86.
---
 arch/x86/kernel/setup.c |2 +
 arch/x86/mm/init_32.c   |4 +
 arch/x86/mm/init_64.c   |4 +
 include/linux/pram.h|8 ++
 mm/Kconfig  |1 +
 mm/pram.c   |  203 +++
 6 files changed, 222 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index fae9134..caf1b29 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -69,6 +69,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -1127,6 +1128,7 @@ void __init setup_arch(char **cmdline_p)
acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start);
 #endif
 
+   pram_reserve();
reserve_crashkernel();
 
vsmp_init();
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 2d19001..da38426 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -779,6 +780,9 @@ void __init mem_init(void)
 
after_bootmem = 1;
 
+   totalram_pages += pram_reserved_pages;
+   reservedpages -= pram_reserved_pages;
+
codesize =  (unsigned long) &_etext - (unsigned long) &_text;
datasize =  (unsigned long) &_edata - (unsigned long) &_etext;
initsize =  (unsigned long) &__init_end - (unsigned long) &__init_begin;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 474e28f..8aa4bc4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1077,6 +1078,9 @@ void __init mem_init(void)
reservedpages = max_pfn - totalram_pages - absent_pages;
after_bootmem = 1;
 
+   totalram_pages += pram_reserved_pages;
+   reservedpages -= pram_reserved_pages;
+
codesize =  (unsigned long) &_etext - (unsigned long) &_text;
datasize =  (unsigned long) &_edata - (unsigned long) &_etext;
initsize =  (unsigned long) &__init_end - (unsigned long) &__init_begin;
diff --git a/include/linux/pram.h b/include/linux/pram.h
index 61c536c..b7f2799 100644
--- a/include/linux/pram.h
+++ b/include/linux/pram.h
@@ -47,4 +47,12 @@ extern ssize_t pram_write(struct pram_stream *ps,
  const void *buf, size_t count);
 extern size_t pram_read(struct pram_stream *ps, void *buf, size_t count);
 
+#ifdef CONFIG_PRAM
+extern unsigned long pram_reserved_pages;
+extern void pram_reserve(void);
+#else
+#define pram_reserved_pages 0UL
+static inline void pram_reserve(void) { }
+#endif
+
 #endif /* _LINUX_PRAM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 46337e8..f1e11a0 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -474,6 +474,7 @@ config FRONTSWAP
 
 config PRAM
bool "Persistent over-kexec memory storage"
+   depends on X86
default n
help
  This option adds the kernel API that enables saving memory pages of
diff --git a/mm/pram.c b/mm/pram.c
index 58ae9ed..380735f 100644
--- a/mm/pram.c
+++ b/mm/pram.c
@@ -1,3 +1,4 @@
+#include 
 #include 
 #include 
 #include 
@@ -5,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -93,6 +95,8 @@ static struct pram_super_block *pram_sb;
 static LIST_HEAD(pram_nodes);  /* linked through page::lru */
 static DEFINE_MUTEX(pram_mutex);   /* serializes open/close */
 
+unsigned long __initdata pram_reserved_pages;
+
 /*
  * The PRAM super block pfn, see above.
  */
@@ -102,6 +106,196 @@ static int __init parse_pram_sb_pfn(char *arg)
 }
 early_param("pram", parse_pram_sb_pfn);
 
+static void * __init pram_map_meta(unsigned long pfn)
+{
+   if (pfn >= max_low_pfn)
+   return ERR_PTR(-EINVAL);
+   return pfn_to_kaddr(pfn);
+}
+
+static int __init pram_reserve_page(unsigned long pfn)
+{
+   int err = 0;
+   phys_addr_t base, size;
+
+   if (pfn >= max_pfn)
+   return -EINVAL;
+
+   base = PFN_PHYS(pfn);
+   size = PAGE_SIZE;
+
+#ifdef CONFIG_NO_BOOTMEM
+   if (memblock_is_region_reserved(base, size) ||
+   memblock_reserve(base, size) < 0)
+   err = -EBUSY;
+#else
+   err = reserve_bootmem(base, size, BOOTMEM_EXCLUSIVE);
+#endif
+   if (!err)
+   pram_reserved_pages++;
+   return err;
+}
+
+static void __init pram_unreserve_page(unsigned long pfn)
+{
+   free_bootmem(PFN_PHYS(pfn), PAGE_SIZE);
+   pram_reserved_pages--;
+}
+
+static int __init pram_reserve_link(struct pram_link *link)
+{
+   int i;
+   int err = 0;
+
+   for (i = 0; i < PRAM_LINK_ENTRIES_MAX; i++) {
+   struct pram_entr

[PATCH RFC 10/13] mm: PRAM: allow to ban arbitrary memory ranges

2013-07-01 Thread Vladimir Davydov

Banning for PRAM memory ranges that have been reserved at boot time is
not enough for avoiding all conflicts. The point is that kexec may load
the new kernel code to some address range that have never been reserved
possibly overwriting persistent data.

Fortunately, it is possible to specify a memory range kexec will load
the new kernel code into. Thus, to avoid kexec-vs-PRAM conflicts, it is
enough to disallow for PRAM some memory range large enough to load the
new kernel and make kexec load the new kernel code into that range.

For that purpose, This patch adds ability to specify arbitrary banned
ranges using the 'pram_banned' boot option.
---
 mm/pram.c |   45 +
 1 file changed, 45 insertions(+)

diff --git a/mm/pram.c b/mm/pram.c
index 969ff3f..3ad769b 100644
--- a/mm/pram.c
+++ b/mm/pram.c
@@ -127,6 +127,15 @@ static bool __meminitdata pram_reservation_in_progress;
  * persistent data. Since the device configuration cannot change during kexec
  * and the newly booted kernel is likely to have a similar boot-time device
  * driver set, this hack should work in most cases.
+ *
+ * This solution has one exception. The point is that kexec may load the new
+ * kernel code to some address range that have never been reserved and thus
+ * banned for PRAM by the current kernel possibly overwriting persistent data.
+ * Fortunately, it is possible to specify an exact range kexec will load the
+ * new kernel code into. Thus, to avoid kexec-vs-PRAM conflicts, one should
+ * disallow for PRAM some memory range large enough to load the new kernel (see
+ * the 'pram_banned' boot param) and make kexec load the new kernel code into
+ * that range.
  */
 
 /*
@@ -378,6 +387,42 @@ out:
 }
 
 /*
+ * A comma separated list of memory regions that PRAM is not allowed to use.
+ */
+static int __init parse_pram_banned(char *arg)
+{
+   char *cur = arg, *tmp;
+   unsigned long long start, end;
+
+   do {
+   start = memparse(cur, &tmp);
+   if (cur == tmp) {
+   pr_warning("pram_banned: Memory value expected\n");
+   return -EINVAL;
+   }
+   cur = tmp;
+   if (*cur != '-') {
+   pr_warning("pram_banned: '-' expected\n");
+   return -EINVAL;
+   }
+   cur++;
+   end = memparse(cur, &tmp);
+   if (cur == tmp) {
+   pr_warning("pram_banned: Memory value expected\n");
+   return -EINVAL;
+   }
+   if (end <= start) {
+   pr_warning("pram_banned: end <= start\n");
+   return -EINVAL;
+   }
+   pram_ban_region(PFN_DOWN(start), PFN_UP(end) - 1);
+   } while (*cur++ == ',');
+
+   return 0;
+}
+early_param("pram_banned", parse_pram_banned);
+
+/*
  * Bans pfn range [start..end] (inclusive) for PRAM.
  */
 void __meminit pram_ban_region(unsigned long start, unsigned long end)
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC 09/13] mm: PRAM: ban pages that have been reserved at boot time

2013-07-01 Thread Vladimir Davydov

Obviously, not all memory ranges can be used for saving persistent
over-kexec data, because some of them are reserved by the system core
and various device drivers at boot time. If a memory range used for
initialization of a particular device turns out to be busy because PRAM
uses it for storing its data, the device driver initialization stage or
even the whole system boot sequence may fail.

As a workaround the current implementation uses a rather dirty hack. It
tracks all memory regions that have ever been reserved during the boot
sequence and avoids using pages belonging to those regions for storing
persistent data. Since the device configuration cannot change during
kexec and the newly booted kernel is likely to have a similar boot-time
device driver set, this hack should work in most cases.
---
 arch/x86/mm/init_32.c |1 +
 arch/x86/mm/init_64.c |1 +
 include/linux/pram.h  |4 +
 mm/bootmem.c  |4 +
 mm/memblock.c |7 +-
 mm/pram.c |  211 -
 6 files changed, 225 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index da38426..67b963a 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -782,6 +782,7 @@ void __init mem_init(void)
 
totalram_pages += pram_reserved_pages;
reservedpages -= pram_reserved_pages;
+   pram_show_banned();
 
codesize =  (unsigned long) &_etext - (unsigned long) &_text;
datasize =  (unsigned long) &_edata - (unsigned long) &_etext;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 8aa4bc4..fbe3e17 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1080,6 +1080,7 @@ void __init mem_init(void)
 
totalram_pages += pram_reserved_pages;
reservedpages -= pram_reserved_pages;
+   pram_show_banned();
 
codesize =  (unsigned long) &_etext - (unsigned long) &_text;
datasize =  (unsigned long) &_edata - (unsigned long) &_etext;
diff --git a/include/linux/pram.h b/include/linux/pram.h
index b7f2799..d4f23e3 100644
--- a/include/linux/pram.h
+++ b/include/linux/pram.h
@@ -50,9 +50,13 @@ extern size_t pram_read(struct pram_stream *ps, void *buf, 
size_t count);
 #ifdef CONFIG_PRAM
 extern unsigned long pram_reserved_pages;
 extern void pram_reserve(void);
+extern void pram_ban_region(unsigned long start, unsigned long end);
+extern void pram_show_banned(void);
 #else
 #define pram_reserved_pages 0UL
 static inline void pram_reserve(void) { }
+static inline void pram_ban_region(unsigned long start, unsigned long end) { }
+static inline void pram_show_banned(void) { }
 #endif
 
 #endif /* _LINUX_PRAM_H */
diff --git a/mm/bootmem.c b/mm/bootmem.c
index 2b0bcb0..34d0b42 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -328,6 +329,9 @@ static int __init __reserve(bootmem_data_t *bdata, unsigned 
long sidx,
bdebug("silent double reserve of PFN %lx\n",
idx + bdata->node_min_pfn);
}
+
+   pram_ban_region(sidx + bdata->node_min_pfn,
+   eidx + bdata->node_min_pfn - 1);
return 0;
 }
 
diff --git a/mm/memblock.c b/mm/memblock.c
index b8d9147..d2c248e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct memblock_region 
memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
 static struct memblock_region 
memblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
@@ -553,13 +554,17 @@ int __init_memblock memblock_free(phys_addr_t base, 
phys_addr_t size)
 int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
 {
struct memblock_type *_rgn = &memblock.reserved;
+   int err;
 
memblock_dbg("memblock_reserve: [%#016llx-%#016llx] %pF\n",
 (unsigned long long)base,
 (unsigned long long)base + size,
 (void *)_RET_IP_);
 
-   return memblock_add_region(_rgn, base, size, MAX_NUMNODES);
+   err = memblock_add_region(_rgn, base, size, MAX_NUMNODES);
+   if (!err)
+   pram_ban_region(PFN_DOWN(base), PFN_UP(base + size) - 1);
+   return err;
 }
 
 /**
diff --git a/mm/pram.c b/mm/pram.c
index 8a66a86..969ff3f 100644
--- a/mm/pram.c
+++ b/mm/pram.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -110,6 +111,47 @@ static LIST_HEAD(pram_nodes);  /* 
linked through page::lru */
 static DEFINE_MUTEX(pram_mutex);   /* serializes open/close */
 
 unsigned long __initdata pram_reserved_pages;
+static bool __meminitdata pram_reservation_in_progress;
+
+/*
+ * Obviously, not all memory ranges can be used for saving persistent
+ * over-kexec data, because some of them are reserved by the sy

[PATCH RFC 06/13] mm: PRAM: introduce super block

2013-07-01 Thread Vladimir Davydov

The PRAM super block is the starting point for restoring persistent
memory. If the kernel locates the super block at boot time, it will
preserve the persistent memory structure from the previous kernel. To
point the kernel to the location of the super block, one should pass its
pfn via the 'pram' boot param. For that purpose, the pram super block
pfn is exported via /sys/kernel/pram. If none is passed, persistent
memory will not be preserved, and a new super block will be allocated.

The current patch introduces only super block handling. Memory
preservation will be implemented later.
---
 mm/pram.c |   94 +++--
 1 file changed, 92 insertions(+), 2 deletions(-)

diff --git a/mm/pram.c b/mm/pram.c
index c7706dc..58ae9ed 100644
--- a/mm/pram.c
+++ b/mm/pram.c
@@ -3,15 +3,18 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -66,12 +69,39 @@ struct pram_node {
 #define PRAM_ACCMODE_MASK  3
 
 /*
+ * The PRAM super block contains data needed to restore the persistent memory
+ * structure on boot. The pointer to it (pfn) should be passed via the 'pram'
+ * boot param if one wants to restore persistent data saved by the previously
+ * executing kernel. For that purpose the kernel exports the pfn via
+ * /sys/kernel/pram. If none is passed, persistent memory if any will not be
+ * preserved and a new clean page will be allocated for the super block.
+ *
+ * The structure occupies a memory page.
+ */
+struct pram_super_block {
+   __u64   node_pfn;   /* points to the first element of
+* the node list */
+};
+
+static unsigned long __initdata pram_sb_pfn;
+static struct pram_super_block *pram_sb;
+
+/*
  * For convenience sake PRAM nodes are kept in an auxiliary doubly-linked list
  * connected through the lru field of the page struct.
  */
 static LIST_HEAD(pram_nodes);  /* linked through page::lru */
 static DEFINE_MUTEX(pram_mutex);   /* serializes open/close */
 
+/*
+ * The PRAM super block pfn, see above.
+ */
+static int __init parse_pram_sb_pfn(char *arg)
+{
+   return kstrtoul(arg, 16, &pram_sb_pfn);
+}
+early_param("pram", parse_pram_sb_pfn);
+
 static inline struct page *pram_alloc_page(gfp_t gfp_mask)
 {
return alloc_page(gfp_mask);
@@ -161,6 +191,7 @@ static void pram_stream_init(struct pram_stream *ps,
  * Returns 0 on success, -errno on failure.
  *
  * Error values:
+ *%ENODEV: PRAM not available
  *%ENAMETOOLONG: name len >= PRAM_NAME_MAX
  *%ENOMEM: insufficient memory available
  *%EEXIST: node with specified name already exists
@@ -175,6 +206,9 @@ int pram_prepare_save(struct pram_stream *ps,
struct pram_node *node;
int err = 0;
 
+   if (!pram_sb)
+   return -ENODEV;
+
BUG_ON(type != PRAM_PAGE_STREAM &&
   type != PRAM_BYTE_STREAM);
 
@@ -250,6 +284,7 @@ void pram_discard_save(struct pram_stream *ps)
  * Returns 0 on success, -errno on failure.
  *
  * Error values:
+ *%ENODEV: PRAM not available
  *%ENOENT: node with specified name does not exist
  *%EBUSY: save to required node has not finished yet
  *%EPERM: specified type conflicts with type of required node
@@ -262,6 +297,9 @@ int pram_prepare_load(struct pram_stream *ps,
struct pram_node *node;
int err = 0;
 
+   if (!pram_sb)
+   return -ENODEV;
+
mutex_lock(&pram_mutex);
node = pram_find_node(name);
if (!node) {
@@ -550,6 +588,7 @@ static void __pram_reboot(void)
node->node_pfn = node_pfn;
node_pfn = page_to_pfn(page);
}
+   pram_sb->node_pfn = node_pfn;
 }
 
 static int pram_reboot(struct notifier_block *notifier,
@@ -557,7 +596,8 @@ static int pram_reboot(struct notifier_block *notifier,
 {
if (val != SYS_RESTART)
return NOTIFY_DONE;
-   __pram_reboot();
+   if (pram_sb)
+   __pram_reboot();
return NOTIFY_OK;
 }
 
@@ -565,9 +605,59 @@ static struct notifier_block pram_reboot_notifier = {
.notifier_call = pram_reboot,
 };
 
+static ssize_t show_pram_sb_pfn(struct kobject *kobj,
+   struct kobj_attribute *attr, char *buf)
+{
+   unsigned long pfn = pram_sb ? PFN_DOWN(__pa(pram_sb)) : 0;
+   return sprintf(buf, "%lx\n", pfn);
+}
+
+static struct kobj_attribute pram_sb_pfn_attr =
+   __ATTR(pram, 0444, show_pram_sb_pfn, NULL);
+
+static struct attribute *pram_attrs[] = {
+   &pram_sb_pfn_attr.attr,
+   NULL,
+};
+
+static struct attribute_group pram_attr_group = {
+   .attrs = pram_attrs,
+};
+
+/* returns non-zero on success */
+static int __init pram_init_sb(void)
+{
+   unsigned long pfn;
+   struct pram_node *node;
+
+   if (!pram_sb) {
+   struct p

[PATCH RFC 05/13] mm: PRAM: link nodes by pfn before reboot

2013-07-01 Thread Vladimir Davydov

Since page structs, which are used for linking PRAM nodes, are cleared
on boot, organize all PRAM nodes into a list singly-linked by pfn's
before reboot to facilitate the node list restore in the new kernel.
---
 mm/pram.c |   50 ++
 1 file changed, 50 insertions(+)

diff --git a/mm/pram.c b/mm/pram.c
index f7eebe1..c7706dc 100644
--- a/mm/pram.c
+++ b/mm/pram.c
@@ -1,11 +1,15 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -42,6 +46,9 @@ struct pram_link {
  * singly-linked list of PRAM link structures (see above), the node has a
  * pointer to the head of.
  *
+ * To facilitate data restore in the new kernel, before reboot all PRAM nodes
+ * are organized into a list singly-linked by pfn's (see pram_reboot()).
+ *
  * The structure occupies a memory page.
  */
 struct pram_node {
@@ -49,6 +56,7 @@ struct pram_node {
__u32   type;   /* data type, see enum pram_stream_type */
__u64   data_len;   /* data size, only for byte streams */
__u64   link_pfn;   /* points to the first link of the node */
+   __u64   node_pfn;   /* points to the next node in the node list */
 
__u8name[PRAM_NAME_MAX];
 };
@@ -57,6 +65,10 @@ struct pram_node {
 #define PRAM_LOAD  2
 #define PRAM_ACCMODE_MASK  3
 
+/*
+ * For convenience sake PRAM nodes are kept in an auxiliary doubly-linked list
+ * connected through the lru field of the page struct.
+ */
 static LIST_HEAD(pram_nodes);  /* linked through page::lru */
 static DEFINE_MUTEX(pram_mutex);   /* serializes open/close */
 
@@ -521,3 +533,41 @@ size_t pram_read(struct pram_stream *ps, void *buf, size_t 
count)
}
return read_count;
 }
+
+/*
+ * Build the list of PRAM nodes.
+ */
+static void __pram_reboot(void)
+{
+   struct page *page;
+   struct pram_node *node;
+   unsigned long node_pfn = 0;
+
+   list_for_each_entry_reverse(page, &pram_nodes, lru) {
+   node = page_address(page);
+   if (WARN_ON(node->flags & PRAM_ACCMODE_MASK))
+   continue;
+   node->node_pfn = node_pfn;
+   node_pfn = page_to_pfn(page);
+   }
+}
+
+static int pram_reboot(struct notifier_block *notifier,
+  unsigned long val, void *v)
+{
+   if (val != SYS_RESTART)
+   return NOTIFY_DONE;
+   __pram_reboot();
+   return NOTIFY_OK;
+}
+
+static struct notifier_block pram_reboot_notifier = {
+   .notifier_call = pram_reboot,
+};
+
+static int __init pram_init(void)
+{
+   register_reboot_notifier(&pram_reboot_notifier);
+   return 0;
+}
+module_init(pram_init);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC 02/13] mm: PRAM: implement node load and save functions

2013-07-01 Thread Vladimir Davydov

Persistent memory is divided into nodes, which can be saved and loaded
independently of each other. PRAM nodes are kept on the list and
identified by unique names. Whenever a save operation is initiated by
calling pram_prepare_save(), a new node is created and linked to the
list. When the save operation has been committed by calling
pram_finish_save(), the node becomes loadable. A load operation can be
then initiated by calling pram_prepare_load(), which deletes the node
from the list and prepares the corresponding stream for loading data
from it. After the load has been finished, the pram_finish_load()
function must be called to free the node. Nodes are also deleted when a
save operation is discarded, i.e. pram_discard_save() is called instead
of pram_finish_save().
---
 include/linux/pram.h |7 ++-
 mm/pram.c|  158 --
 2 files changed, 159 insertions(+), 6 deletions(-)

diff --git a/include/linux/pram.h b/include/linux/pram.h
index cf04548..5b8c2c1 100644
--- a/include/linux/pram.h
+++ b/include/linux/pram.h
@@ -5,7 +5,12 @@
 #include 
 #include 
 
-struct pram_stream;
+struct pram_node;
+
+struct pram_stream {
+   gfp_t gfp_mask;
+   struct pram_node *node;
+};
 
 #define PRAM_NAME_MAX  256 /* including nul */
 
diff --git a/mm/pram.c b/mm/pram.c
index cea0e87..3af2039 100644
--- a/mm/pram.c
+++ b/mm/pram.c
@@ -1,10 +1,75 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
+#include 
 #include 
 
+/*
+ * Persistent memory is divided into nodes that can be saved or loaded
+ * independently of each other. The nodes are identified by unique name
+ * strings.
+ *
+ * The structure occupies a memory page.
+ */
+struct pram_node {
+   __u32   flags;  /* see PRAM_* flags below */
+   __u32   type;   /* data type, see enum pram_stream_type */
+
+   __u8name[PRAM_NAME_MAX];
+};
+
+#define PRAM_SAVE  1
+#define PRAM_LOAD  2
+#define PRAM_ACCMODE_MASK  3
+
+static LIST_HEAD(pram_nodes);  /* linked through page::lru */
+static DEFINE_MUTEX(pram_mutex);   /* serializes open/close */
+
+static inline struct page *pram_alloc_page(gfp_t gfp_mask)
+{
+   return alloc_page(gfp_mask);
+}
+
+static inline void pram_free_page(void *addr)
+{
+   free_page((unsigned long)addr);
+}
+
+static inline void pram_insert_node(struct pram_node *node)
+{
+   list_add(&virt_to_page(node)->lru, &pram_nodes);
+}
+
+static inline void pram_delete_node(struct pram_node *node)
+{
+   list_del(&virt_to_page(node)->lru);
+}
+
+static struct pram_node *pram_find_node(const char *name)
+{
+   struct page *page;
+   struct pram_node *node;
+
+   list_for_each_entry(page, &pram_nodes, lru) {
+   node = page_address(page);
+   if (strcmp(node->name, name) == 0)
+   return node;
+   }
+   return NULL;
+}
+
+static void pram_stream_init(struct pram_stream *ps,
+struct pram_node *node, gfp_t gfp_mask)
+{
+   memset(ps, 0, sizeof(*ps));
+   ps->gfp_mask = gfp_mask;
+   ps->node = node;
+}
+
 /**
  * Create a persistent memory node with name @name and initialize stream @ps
  * for saving data to it.
@@ -18,13 +83,49 @@
  *
  * Returns 0 on success, -errno on failure.
  *
+ * Error values:
+ *%ENAMETOOLONG: name len >= PRAM_NAME_MAX
+ *%ENOMEM: insufficient memory available
+ *%EEXIST: node with specified name already exists
+ *
  * After the save has finished, pram_finish_save() (or pram_discard_save() in
  * case of failure) is to be called.
  */
 int pram_prepare_save(struct pram_stream *ps,
const char *name, enum pram_stream_type type, gfp_t gfp_mask)
 {
-   return -ENOSYS;
+   struct page *page;
+   struct pram_node *node;
+   int err = 0;
+
+   BUG_ON(type != PRAM_PAGE_STREAM &&
+  type != PRAM_BYTE_STREAM);
+
+   if (strlen(name) >= PRAM_NAME_MAX)
+   return -ENAMETOOLONG;
+
+   page = pram_alloc_page(GFP_KERNEL | __GFP_ZERO);
+   if (!page)
+   return -ENOMEM;
+   node = page_address(page);
+
+   node->flags = PRAM_SAVE;
+   node->type = type;
+   strcpy(node->name, name);
+
+   mutex_lock(&pram_mutex);
+   if (!pram_find_node(name))
+   pram_insert_node(node);
+   else
+   err = -EEXIST;
+   mutex_unlock(&pram_mutex);
+   if (err) {
+   __free_page(page);
+   return err;
+   }
+
+   pram_stream_init(ps, node, gfp_mask);
+   return 0;
 }
 
 /**
@@ -33,7 +134,12 @@ int pram_prepare_save(struct pram_stream *ps,
  */
 void pram_finish_save(struct pram_stream *ps)
 {
-   BUG();
+   struct pram_node *node = ps->node;
+
+   BUG_ON((node->flags & PRAM_ACCMODE_MASK) != PRAM_SAVE);
+
+   smp_wmb();
+   node->flags &= ~PRAM_ACCMODE_MA

[PATCH RFC 03/13] mm: PRAM: implement page stream operations

2013-07-01 Thread Vladimir Davydov

Using the pram_save_page() function, one can populate PRAM nodes with
memory pages, which can be then loaded using the pram_load_page()
function. Saving a memory page to PRAM is implemented as storing the pfn
in the PRAM node and incrementing its ref count so that it will not get
freed after the last user puts it.
---
 include/linux/pram.h |3 +
 mm/pram.c|  166 +-
 2 files changed, 167 insertions(+), 2 deletions(-)

diff --git a/include/linux/pram.h b/include/linux/pram.h
index 5b8c2c1..dd17316 100644
--- a/include/linux/pram.h
+++ b/include/linux/pram.h
@@ -6,10 +6,13 @@
 #include 
 
 struct pram_node;
+struct pram_link;
 
 struct pram_stream {
gfp_t gfp_mask;
struct pram_node *node;
+   struct pram_link *link; /* current link */
+   unsigned int page_index;/* next page index in link */
 };
 
 #define PRAM_NAME_MAX  256 /* including nul */
diff --git a/mm/pram.c b/mm/pram.c
index 3af2039..a443eb0 100644
--- a/mm/pram.c
+++ b/mm/pram.c
@@ -5,19 +5,48 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
 /*
+ * Represents a reference to a data page saved to PRAM.
+ */
+struct pram_entry {
+   __u32 flags;/* see PRAM_PAGE_* flags */
+   __u64 pfn;  /* the page frame number */
+};
+
+/*
+ * Keeps references to data pages saved to PRAM.
+ * The structure occupies a memory page.
+ */
+struct pram_link {
+   __u64   link_pfn;   /* points to the next link of the node */
+
+   /* the array occupies the rest of the link page; if the link is not
+* full, the rest of the array must be filled with zeros */
+   struct pram_entry entry[0];
+};
+
+#define PRAM_LINK_ENTRIES_MAX \
+   ((PAGE_SIZE-sizeof(struct pram_link))/sizeof(struct pram_entry))
+
+/*
  * Persistent memory is divided into nodes that can be saved or loaded
  * independently of each other. The nodes are identified by unique name
  * strings.
  *
+ * References to data pages saved to a persistent memory node are kept in a
+ * singly-linked list of PRAM link structures (see above), the node has a
+ * pointer to the head of.
+ *
  * The structure occupies a memory page.
  */
 struct pram_node {
__u32   flags;  /* see PRAM_* flags below */
__u32   type;   /* data type, see enum pram_stream_type */
+   __u64   link_pfn;   /* points to the first link of the node */
 
__u8name[PRAM_NAME_MAX];
 };
@@ -62,12 +91,46 @@ static struct pram_node *pram_find_node(const char *name)
return NULL;
 }
 
+static void pram_truncate_link(struct pram_link *link)
+{
+   int i;
+   unsigned long pfn;
+   struct page *page;
+
+   for (i = 0; i < PRAM_LINK_ENTRIES_MAX; i++) {
+   pfn = link->entry[i].pfn;
+   if (!pfn)
+   continue;
+   page = pfn_to_page(pfn);
+   put_page(page);
+   }
+}
+
+static void pram_truncate_node(struct pram_node *node)
+{
+   unsigned long link_pfn;
+   struct pram_link *link;
+
+   link_pfn = node->link_pfn;
+   while (link_pfn) {
+   link = pfn_to_kaddr(link_pfn);
+   pram_truncate_link(link);
+   link_pfn = link->link_pfn;
+   pram_free_page(link);
+   cond_resched();
+   }
+   node->link_pfn = 0;
+
+}
+
 static void pram_stream_init(struct pram_stream *ps,
 struct pram_node *node, gfp_t gfp_mask)
 {
memset(ps, 0, sizeof(*ps));
ps->gfp_mask = gfp_mask;
ps->node = node;
+   if (node->link_pfn)
+   ps->link = pfn_to_kaddr(node->link_pfn);
 }
 
 /**
@@ -157,6 +220,7 @@ void pram_discard_save(struct pram_stream *ps)
pram_delete_node(node);
mutex_unlock(&pram_mutex);
 
+   pram_truncate_node(node);
pram_free_page(node);
 }
 
@@ -220,9 +284,46 @@ void pram_finish_load(struct pram_stream *ps)
 
BUG_ON((node->flags & PRAM_ACCMODE_MASK) != PRAM_LOAD);
 
+   pram_truncate_node(node);
pram_free_page(node);
 }
 
+/*
+ * Insert page to PRAM node allocating a new PRAM link if necessary.
+ */
+static int __pram_save_page(struct pram_stream *ps,
+   struct page *page, int flags)
+{
+   struct pram_node *node = ps->node;
+   struct pram_link *link = ps->link;
+   struct pram_entry *entry;
+
+   if (!link || ps->page_index >= PRAM_LINK_ENTRIES_MAX) {
+   struct page *link_page;
+   unsigned long link_pfn;
+
+   link_page = pram_alloc_page((ps->gfp_mask & GFP_RECLAIM_MASK) |
+   __GFP_ZERO);
+   if (!link_page)
+   return -ENOMEM;
+
+   link_pfn = page_to_pfn(link_page);
+   if (link)
+   link->link_pfn = link_pfn;
+   else
+

[PATCH RFC 01/13] mm: add PRAM API stubs and Kconfig

2013-07-01 Thread Vladimir Davydov

Persistent memory subsys or PRAM is intended to be used for saving
memory pages of the currently executing kernel and restoring them after
a kexec in the newly booted one. This can be utilized for speeding up
reboot by leaving process memory and/or FS caches in-place.

The proposed API:

 * Persistent memory is divided into nodes, which can be saved or loaded
   independently of each other. The nodes are identified by unique name
   strings. PRAM node is created (removed) when save (load) is initiated
   by calling pram_prepare_save() (pram_prepare_load()), see below.

 * For saving/loading data from a PRAM node an instance of the
   pram_stream struct is used. The struct is initialized by calling
   pram_prepare_save() for saving data or pram_prepare_load() for
   loading data. After save (load) is complete, pram_finish_save()
   (pram_finish_load()) must be called. If an error occurred during
   save, the saved data and the PRAM node may be freed by calling
   pram_discard_save() instead of pram_finish_save().

 * Each pram_stream has a type, which  determines the set of operations
   that may be used for saving/loading data. The type is defined by the
   pram_stream_type enum. Currently there are two stream types
   available: PRAM_PAGE_STREAM to save/load memory pages, and
   PRAM_BYTE_STREAM to save/load byte strings. For page streams
   pram_save_page() and pram_load_page() may be used, and for byte
   streams pram_write() and pram_read() may be used for saving and
   loading data respectively.

Thus a sequence of operations for saving/loading data from PRAM should
look like:

  * For saving data to PRAM:

/* create PRAM node and initialize stream for saving data to it */
pram_prepare_save()

/* save data to the node */
pram_save_page()[,...]  /* for page stream, or
pram_write()[,...]   * ... for byte stream */

/* commit the save or discard and delete the node */
pram_finish_save()  /* on success, or
pram_discard_save()  * ... in case of error */

  * For loading data from PRAM:

/* remove PRAM node from the list and initialize stream for
 * loading data from it */
pram_prepare_load()

/* load data from the node */
pram_load_page()[,...]  /* for page stream, or
pram_read()[,...]* ... for byte stream */

/* free the node */
pram_finish_load()
---
 include/linux/pram.h |   38 +++
 mm/Kconfig   |9 
 mm/Makefile  |1 +
 mm/pram.c|  131 ++
 4 files changed, 179 insertions(+)
 create mode 100644 include/linux/pram.h
 create mode 100644 mm/pram.c

diff --git a/include/linux/pram.h b/include/linux/pram.h
new file mode 100644
index 000..cf04548
--- /dev/null
+++ b/include/linux/pram.h
@@ -0,0 +1,38 @@
+#ifndef _LINUX_PRAM_H
+#define _LINUX_PRAM_H
+
+#include 
+#include 
+#include 
+
+struct pram_stream;
+
+#define PRAM_NAME_MAX  256 /* including nul */
+
+enum pram_stream_type {
+   PRAM_PAGE_STREAM,
+   PRAM_BYTE_STREAM,
+};
+
+extern int pram_prepare_save(struct pram_stream *ps,
+   const char *name, enum pram_stream_type type, gfp_t gfp_mask);
+extern void pram_finish_save(struct pram_stream *ps);
+extern void pram_discard_save(struct pram_stream *ps);
+
+extern int pram_prepare_load(struct pram_stream *ps,
+   const char *name, enum pram_stream_type type);
+extern void pram_finish_load(struct pram_stream *ps);
+
+#define PRAM_PAGE_LRU  0x01/* page is on the LRU */
+
+/* page-stream specific methods */
+extern int pram_save_page(struct pram_stream *ps,
+ struct page *page, int flags);
+extern struct page *pram_load_page(struct pram_stream *ps, int *flags);
+
+/* byte-stream specific methods */
+extern ssize_t pram_write(struct pram_stream *ps,
+ const void *buf, size_t count);
+extern size_t pram_read(struct pram_stream *ps, void *buf, size_t count);
+
+#endif /* _LINUX_PRAM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 3bea74f..46337e8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -471,3 +471,12 @@ config FRONTSWAP
  and swap data is stored as normal on the matching swap device.
 
  If unsure, say Y to enable frontswap.
+
+config PRAM
+   bool "Persistent over-kexec memory storage"
+   default n
+   help
+ This option adds the kernel API that enables saving memory pages of
+ the currently executing kernel and restoring them after a kexec in
+ the newly booted one. This can be utilized for speeding up reboot by
+ leaving process memory and/or FS caches in-place.
diff --git a/mm/Makefile b/mm/Makefile
index 3a46287..33ad952 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -58,3 +58,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
 obj-$(CONFIG_CLEANCACHE) += cleancache.o
 obj-$(CONFIG_MEMO

[PATCH RFC 08/13] mm: PRAM: checksum saved data

2013-07-01 Thread Vladimir Davydov

Checksum PRAM pages with crc32 to ensure persistent memory is not
corrupted during reboot.
---
 mm/Kconfig |4 ++
 mm/pram.c  |  128 +++-
 2 files changed, 130 insertions(+), 2 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index f1e11a0..0a4d4c6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -475,6 +475,10 @@ config FRONTSWAP
 config PRAM
bool "Persistent over-kexec memory storage"
depends on X86
+   select CRC32
+   select LIBCRC32C
+   select CRYPTO_CRC32C
+   select CRYPTO_CRC32C_INTEL
default n
help
  This option adds the kernel API that enables saving memory pages of
diff --git a/mm/pram.c b/mm/pram.c
index 380735f..8a66a86 100644
--- a/mm/pram.c
+++ b/mm/pram.c
@@ -1,4 +1,6 @@
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -19,11 +21,14 @@
 #include 
 #include 
 
+#define PRAM_MAGIC 0x7072616D
+
 /*
  * Represents a reference to a data page saved to PRAM.
  */
 struct pram_entry {
__u32 flags;/* see PRAM_PAGE_* flags */
+   __u32 csum; /* the page csum */
__u64 pfn;  /* the page frame number */
 };
 
@@ -32,6 +37,9 @@ struct pram_entry {
  * The structure occupies a memory page.
  */
 struct pram_link {
+   __u32   magic;
+   __u32   csum;
+
__u64   link_pfn;   /* points to the next link of the node */
 
/* the array occupies the rest of the link page; if the link is not
@@ -57,6 +65,9 @@ struct pram_link {
  * The structure occupies a memory page.
  */
 struct pram_node {
+   __u32   magic;
+   __u32   csum;
+
__u32   flags;  /* see PRAM_* flags below */
__u32   type;   /* data type, see enum pram_stream_type */
__u64   data_len;   /* data size, only for byte streams */
@@ -81,6 +92,9 @@ struct pram_node {
  * The structure occupies a memory page.
  */
 struct pram_super_block {
+   __u32   magic;
+   __u32   csum;
+
__u64   node_pfn;   /* points to the first element of
 * the node list */
 };
@@ -106,11 +120,34 @@ static int __init parse_pram_sb_pfn(char *arg)
 }
 early_param("pram", parse_pram_sb_pfn);
 
+static u32 pram_data_csum(struct page *page)
+{
+   u32 ret;
+   void *addr;
+
+   addr = kmap_atomic(page);
+   ret = crc32c(0, addr, PAGE_SIZE);
+   kunmap_atomic(addr);
+   return ret;
+}
+
+/* SSE-4.2 crc32c faster than crc32, but not available at early boot */
+static inline u32 pram_meta_csum(void *addr)
+{
+   /* skip magic and csum fields */
+   return crc32(0, addr + 8, PAGE_SIZE - 8);
+}
+
 static void * __init pram_map_meta(unsigned long pfn)
 {
+   __u32 *p;
+
if (pfn >= max_low_pfn)
return ERR_PTR(-EINVAL);
-   return pfn_to_kaddr(pfn);
+   p = pfn_to_kaddr(pfn);
+   if (p[0] != PRAM_MAGIC || p[1] != pram_meta_csum(p))
+   return ERR_PTR(-EINVAL);
+   return p;
 }
 
 static int __init pram_reserve_page(unsigned long pfn)
@@ -332,6 +369,65 @@ static struct pram_node *pram_find_node(const char *name)
return NULL;
 }
 
+static void pram_csum_link(struct pram_link *link)
+{
+   int i;
+   struct pram_entry *entry;
+
+   for (i = 0; i < PRAM_LINK_ENTRIES_MAX; i++) {
+   entry = &link->entry[i];
+   if (entry->pfn)
+   entry->csum = pram_data_csum(pfn_to_page(entry->pfn));
+   }
+}
+
+static void pram_csum_node(struct pram_node *node)
+{
+   unsigned long link_pfn;
+   struct pram_link *link;
+
+   link_pfn = node->link_pfn;
+   while (link_pfn) {
+   link = pfn_to_kaddr(link_pfn);
+   pram_csum_link(link);
+   link_pfn = link->link_pfn;
+   cond_resched();
+   }
+}
+
+static int pram_check_link(struct pram_link *link)
+{
+   int i;
+   struct pram_entry *entry;
+
+   for (i = 0; i < PRAM_LINK_ENTRIES_MAX; i++) {
+   entry = &link->entry[i];
+   if (!entry->pfn)
+   break;
+   if (entry->csum != pram_data_csum(pfn_to_page(entry->pfn)))
+   return -EFAULT;
+   }
+   return 0;
+}
+
+static int pram_check_node(struct pram_node *node)
+{
+   unsigned long link_pfn;
+   struct pram_link *link;
+   int ret = 0;
+
+   link_pfn = node->link_pfn;
+   while (link_pfn) {
+   link = pfn_to_kaddr(link_pfn);
+   ret = pram_check_link(link);
+   if (ret)
+   break;
+   link_pfn = link->link_pfn;
+   cond_resched();
+   }
+   return ret;
+}
+
 static void pram_truncate_link(struct pram_link *link)
 {
int i;
@@ -449,6 +545,7 @@ void pram_finish_save(struct pram_stream *ps)
 
BUG_ON((node->flags & PRAM_ACCMODE_MASK) != PRAM

[PATCH RFC 04/13] mm: PRAM: implement byte stream operations

2013-07-01 Thread Vladimir Davydov

This patch adds ability to save arbitrary byte strings to PRAM using
pram_write() to be restored later using pram_read(). These two
operations are implemented on top of pram_save_page() and
pram_load_page() respectively.
---
 include/linux/pram.h |4 +++
 mm/pram.c|   86 --
 2 files changed, 88 insertions(+), 2 deletions(-)

diff --git a/include/linux/pram.h b/include/linux/pram.h
index dd17316..61c536c 100644
--- a/include/linux/pram.h
+++ b/include/linux/pram.h
@@ -13,6 +13,10 @@ struct pram_stream {
struct pram_node *node;
struct pram_link *link; /* current link */
unsigned int page_index;/* next page index in link */
+
+   /* byte-stream specific */
+   struct page *data_page;
+   unsigned int data_offset;
 };
 
 #define PRAM_NAME_MAX  256 /* including nul */
diff --git a/mm/pram.c b/mm/pram.c
index a443eb0..f7eebe1 100644
--- a/mm/pram.c
+++ b/mm/pram.c
@@ -1,5 +1,6 @@
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -46,6 +47,7 @@ struct pram_link {
 struct pram_node {
__u32   flags;  /* see PRAM_* flags below */
__u32   type;   /* data type, see enum pram_stream_type */
+   __u64   data_len;   /* data size, only for byte streams */
__u64   link_pfn;   /* points to the first link of the node */
 
__u8name[PRAM_NAME_MAX];
@@ -284,6 +286,9 @@ void pram_finish_load(struct pram_stream *ps)
 
BUG_ON((node->flags & PRAM_ACCMODE_MASK) != PRAM_LOAD);
 
+   if (ps->data_page)
+   put_page(ps->data_page);
+
pram_truncate_node(node);
pram_free_page(node);
 }
@@ -422,10 +427,51 @@ struct page *pram_load_page(struct pram_stream *ps, int 
*flags)
  *
  * On success, returns the number of bytes written, which is always equal to
  * @count. On failure, -errno is returned.
+ *
+ * Error values:
+ *%ENOMEM: insufficient amount of memory available
  */
 ssize_t pram_write(struct pram_stream *ps, const void *buf, size_t count)
 {
-   return -ENOSYS;
+   void *addr;
+   size_t copy_count, write_count = 0;
+   struct pram_node *node = ps->node;
+
+   BUG_ON(node->type != PRAM_BYTE_STREAM);
+   BUG_ON((node->flags & PRAM_ACCMODE_MASK) != PRAM_SAVE);
+
+   while (count > 0) {
+   if (!ps->data_page) {
+   struct page *page;
+   int err;
+
+   page = pram_alloc_page((ps->gfp_mask & 
GFP_RECLAIM_MASK) |
+  __GFP_HIGHMEM | __GFP_ZERO);
+   if (!page)
+   return -ENOMEM;
+   err = __pram_save_page(ps, page, 0);
+   put_page(page);
+   if (err)
+   return err;
+   ps->data_page = page;
+   ps->data_offset = 0;
+   }
+
+   copy_count = min_t(size_t, count, PAGE_SIZE - ps->data_offset);
+   addr = kmap_atomic(ps->data_page);
+   memcpy(addr + ps->data_offset, buf, copy_count);
+   kunmap_atomic(addr);
+
+   buf += copy_count;
+   node->data_len += copy_count;
+   ps->data_offset += copy_count;
+   if (ps->data_offset >= PAGE_SIZE)
+   ps->data_page = NULL;
+
+   write_count += copy_count;
+   count -= copy_count;
+   }
+   return write_count;
 }
 
 /**
@@ -437,5 +483,41 @@ ssize_t pram_write(struct pram_stream *ps, const void 
*buf, size_t count)
  */
 size_t pram_read(struct pram_stream *ps, void *buf, size_t count)
 {
-   return 0;
+   char *addr;
+   size_t copy_count, read_count = 0;
+   struct pram_node *node = ps->node;
+
+   BUG_ON(node->type != PRAM_BYTE_STREAM);
+   BUG_ON((node->flags & PRAM_ACCMODE_MASK) != PRAM_LOAD);
+
+   while (count > 0 && node->data_len > 0) {
+   if (!ps->data_page) {
+   struct page *page;
+
+   page = __pram_load_page(ps, NULL);
+   if (!page)
+   break;
+   ps->data_page = page;
+   ps->data_offset = 0;
+   }
+
+   copy_count = min_t(size_t, count, PAGE_SIZE - ps->data_offset);
+   if (copy_count > node->data_len)
+   copy_count = node->data_len;
+   addr = kmap_atomic(ps->data_page);
+   memcpy(buf, addr + ps->data_offset, copy_count);
+   kunmap_atomic(addr);
+
+   buf += copy_count;
+   node->data_len -= copy_count;
+   ps->data_offset += copy_count;
+   if (ps->data_offset >= PAGE_SIZE || !node->data_len) {
+   put_page(ps->data_page);
+

Re: [PATCH RFC] sched: move h_load calculation to task_h_load

2013-07-15 Thread Vladimir Davydov


On 07/15/2013 12:28 PM, Peter Zijlstra wrote:

OK, fair enough. It does somewhat rely on us getting the single
rq->clock update thing right, but that should be ok.


Frankly, I doubt that rq->clock is the right thing to use here, because 
it can be updated very frequently under some conditions, so that 
cfs_rq->h_load can get updated several times during the same balance 
run. Perhaps, jiffies would suit better for that purpose. This would 
work as h_load update throttler similar to how it works now (I mean 
rq->h_load_throttle in update_h_load()).


If there is something else you don't like/have some thoughts on, please 
share - I'll try to improve.


Thank you for your feedback.


But yeah, when you have stupid many cgroups we quickly need less h_load
instances than there are cgroups.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2] sched: move h_load calculation to task_h_load

2013-07-15 Thread Vladimir Davydov

The bad thing about update_h_load(), which computes hierarchical load
factor for task groups, is that it is called for each task group in the
system before every load balancer run, and since rebalance can be
triggered very often, this function can eat really a lot of cpu time if
there are many cpu cgroups in the system.

Although the situation was improved significantly by commit a35b646
('sched, cgroup: Reduce rq->lock hold times for large cgroup
hierarchies'), the problem still can arise under some kinds of loads,
e.g. when cpus are switching from idle to busy and back very frequently.

For instance, when I start 1000 of processes that wake up every
millisecond on my 8 cpus host, 'top' and 'perf top' show:

Cpu(s): 17.8%us, 24.3%sy,  0.0%ni, 57.9%id,  0.0%wa,  0.0%hi,  0.0%si
Events: 243K cycles
  7.57%  [kernel]   [k] __schedule
  7.08%  [kernel]   [k] timerqueue_add
  6.13%  libc-2.12.so   [.] usleep

Then if I create 1 *idle* cpu cgroups (no processes in them), cpu
usage increases significantly although the 'wakers' are still executing
in the root cpu cgroup:

Cpu(s): 19.1%us, 48.7%sy,  0.0%ni, 31.6%id,  0.0%wa,  0.0%hi,  0.7%si
Events: 230K cycles
 24.56%  [kernel][k] tg_load_down
  5.76%  [kernel][k] __schedule

This happens because this particular kind of load triggers 'new idle'
rebalance very frequently, which requires calling update_h_load(),
which, in turn, calls tg_load_down() for every *idle* cpu cgroup even
though it is absolutely useless, because idle cpu cgroups have no tasks
to pull.

This patch tries to improve the situation by making h_load calculation
proceed only when h_load is really necessary. To achieve this, it
substitutes update_h_load() with update_cfs_rq_h_load(), which computes
h_load only for a given cfs_rq and all its ascendants, and makes the
load balancer call this function whenever it considers if a task should
be pulled, i.e. it moves h_load calculations directly to task_h_load().
For h_load of the same cfs_rq not to be updated multiple times (in case
several tasks in the same cgroup are considered during the same balance
run), the patch keeps the time of the last h_load update for each cfs_rq
and breaks calculation when it finds h_load to be uptodate.

The benefit of it is that h_load is computed only for those cfs_rq's,
which really need it, in particular all idle task groups are skipped.
Although this, in fact, moves h_load calculation under rq lock, it
should not affect latency much, because the amount of work done under rq
lock while trying to pull tasks is limited by sched_nr_migrate.

After the patch applied with the setup described above (1000 wakers in
the root cgroup and 1 idle cgroups), I get:

Cpu(s): 16.9%us, 24.8%sy,  0.0%ni, 58.4%id,  0.0%wa,  0.0%hi,  0.0%si
Events: 242K cycles
  7.57%  [kernel]  [k] __schedule
  6.70%  [kernel]  [k] timerqueue_add
  5.93%  libc-2.12.so  [.] usleep

Changes in v2:
 * use jiffies instead of rq->clock for last_h_load_update.

Signed-off-by: Vladimir Davydov 
---
 kernel/sched/fair.c  |   58 +++---
 kernel/sched/sched.h |7 +++---
 2 files changed, 30 insertions(+), 35 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f77f9c5..12d0137 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4171,47 +4171,48 @@ static void update_blocked_averages(int cpu)
 }
 
 /*
- * Compute the cpu's hierarchical load factor for each task group.
+ * Compute the hierarchical load factor for cfs_rq and all its ascendants.
  * This needs to be done in a top-down fashion because the load of a child
  * group is a fraction of its parents load.
  */
-static int tg_load_down(struct task_group *tg, void *data)
+static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 {
-   unsigned long load;
-   long cpu = (long)data;
-
-   if (!tg->parent) {
-   load = cpu_rq(cpu)->avg.load_avg_contrib;
-   } else {
-   load = tg->parent->cfs_rq[cpu]->h_load;
-   load = div64_ul(load * tg->se[cpu]->avg.load_avg_contrib,
-   tg->parent->cfs_rq[cpu]->runnable_load_avg + 1);
-   }
-
-   tg->cfs_rq[cpu]->h_load = load;
-
-   return 0;
-}
-
-static void update_h_load(long cpu)
-{
-   struct rq *rq = cpu_rq(cpu);
+   struct rq *rq = rq_of(cfs_rq);
+   struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
unsigned long now = jiffies;
+   unsigned long load;
 
-   if (rq->h_load_throttle == now)
+   if (cfs_rq->last_h_load_update == now)
return;
 
-   rq->h_load_throttle = now;
+   cfs_rq->h_load_next = NULL;
+   for_each_sched_entity(se) {
+   cfs_rq = cfs_rq_of(se);
+   cfs_rq->h_load_next =

Re: [PATCH RFC] pram: persistent over-kexec memory file system

2013-07-27 Thread Vladimir Davydov


On 07/27/2013 07:41 PM, Marco Stornelli wrote:

Il 26/07/2013 14:29, Vladimir Davydov ha scritto:

Hi,

We want to propose a way to upgrade a kernel on a machine without
restarting all the user-space services. This is to be done with CRIU
project, but we need help from the kernel to preserve some data in
memory while doing kexec.

The key point of our implementation is leaving process memory in-place
during reboot. This should eliminate most io operations the services
would produce during initialization. To achieve this, we have
implemented a pseudo file system that preserves its content during
kexec. We propose saving CRIU dump files to this file system, kexec'ing
and then restoring the processes in the newly booted kernel.



http://pramfs.sourceforge.net/


AFAIU it's a bit different thing: PRAMFS as well as pstore, which has 
already been merged, requires hardware support for over-reboot 
persistency, so called non-volatile RAM, i.e. RAM which is not directly 
accessible and so is not used by the kernel. On the contrary, what we'd 
like to have is preserving usual RAM on kexec. It is possible, because 
RAM is not reset during kexec. This would allow leaving applications 
working set as well as filesystem caches in place, speeding the reboot 
process as a whole and reducing the downtime significantly.


Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] pram: persistent over-kexec memory file system

2013-07-28 Thread Vladimir Davydov


On 07/27/2013 09:37 PM, Marco Stornelli wrote:

Il 27/07/2013 19:35, Vladimir Davydov ha scritto:

On 07/27/2013 07:41 PM, Marco Stornelli wrote:

Il 26/07/2013 14:29, Vladimir Davydov ha scritto:

Hi,

We want to propose a way to upgrade a kernel on a machine without
restarting all the user-space services. This is to be done with CRIU
project, but we need help from the kernel to preserve some data in
memory while doing kexec.

The key point of our implementation is leaving process memory in-place
during reboot. This should eliminate most io operations the services
would produce during initialization. To achieve this, we have
implemented a pseudo file system that preserves its content during
kexec. We propose saving CRIU dump files to this file system, 
kexec'ing

and then restoring the processes in the newly booted kernel.



http://pramfs.sourceforge.net/


AFAIU it's a bit different thing: PRAMFS as well as pstore, which has
already been merged, requires hardware support for over-reboot
persistency, so called non-volatile RAM, i.e. RAM which is not directly
accessible and so is not used by the kernel. On the contrary, what we'd
like to have is preserving usual RAM on kexec. It is possible, because
RAM is not reset during kexec. This would allow leaving applications
working set as well as filesystem caches in place, speeding the reboot
process as a whole and reducing the downtime significantly.

Thanks.


Actually not. You can use normal system RAM reserved at boot with mem 
parameter without any kernel change. Until an hard reset happens, that 
area will be "persistent".


Thank you, we'll look at PRAMFS closer, but right now, after trying it I 
have a couple of concerns I'd appreciate if you could clarify:


1) As you advised, I tried to reserve a range of memory (passing 
memmap=4G$4G at boot) and mounted PRAMFS using the following options:


# mount -t pramfs -o physaddr=0x1,init=4G,bs=4096 none /mnt/pramfs

And it turned out that PRAMFS is very slow as compared to ramfs:

# dd if=/dev/zero of=/mnt/pramfs if=/dev/zero of=/mnt/pramfs/dummy 
bs=4096 count=$[100*1024]

102400+0 records in
102400+0 records out
419430400 bytes (419 MB) copied, 9.23498 s, 45.4 MB/s
# dd if=/dev/zero of=/mnt/pramfs if=/dev/zero of=/mnt/pramfs/dummy 
bs=4096 count=$[100*1024] conv=notrunc

102400+0 records in
102400+0 records out
419430400 bytes (419 MB) copied, 3.04692 s, 138 MB/s

We need it to be as fast as usual RAM, because otherwise the benefit of 
it over hdd disappears. So before diving into the code, I'd like to ask 
you if it's intrinsic to PRAMFS, or can it be fixed? Or, perhaps, I used 
wrong mount/boot/config options (btw, I enabled only CONFIG_PRAMFS)?


2) To enable saving application dump files in memory using PRAMFS, one 
should reserve half of RAM for it. That's too expensive. While with 
ramfs, once SPLICE_F_MOVE flag is implemented, one could move anonymous 
memory pages to ramfs page cache and after kexec move it back so that 
almost no extra memory space costs would be required. Of course, 
SPLICE_F_MOVE is to be yet implemented, but with PRAMFS significant 
memory costs are inevitable... or am I wrong?


Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] pram: persistent over-kexec memory file system

2013-07-28 Thread Vladimir Davydov


On 07/28/2013 03:02 PM, Marco Stornelli wrote:

Il 28/07/2013 12:05, Vladimir Davydov ha scritto:

On 07/27/2013 09:37 PM, Marco Stornelli wrote:

Il 27/07/2013 19:35, Vladimir Davydov ha scritto:

On 07/27/2013 07:41 PM, Marco Stornelli wrote:

Il 26/07/2013 14:29, Vladimir Davydov ha scritto:

Hi,

We want to propose a way to upgrade a kernel on a machine without
restarting all the user-space services. This is to be done with CRIU
project, but we need help from the kernel to preserve some data in
memory while doing kexec.

The key point of our implementation is leaving process memory 
in-place

during reboot. This should eliminate most io operations the services
would produce during initialization. To achieve this, we have
implemented a pseudo file system that preserves its content during
kexec. We propose saving CRIU dump files to this file system,
kexec'ing
and then restoring the processes in the newly booted kernel.



http://pramfs.sourceforge.net/


AFAIU it's a bit different thing: PRAMFS as well as pstore, which has
already been merged, requires hardware support for over-reboot
persistency, so called non-volatile RAM, i.e. RAM which is not 
directly
accessible and so is not used by the kernel. On the contrary, what 
we'd

like to have is preserving usual RAM on kexec. It is possible, because
RAM is not reset during kexec. This would allow leaving applications
working set as well as filesystem caches in place, speeding the reboot
process as a whole and reducing the downtime significantly.

Thanks.


Actually not. You can use normal system RAM reserved at boot with mem
parameter without any kernel change. Until an hard reset happens, that
area will be "persistent".


Thank you, we'll look at PRAMFS closer, but right now, after trying it I
have a couple of concerns I'd appreciate if you could clarify:

1) As you advised, I tried to reserve a range of memory (passing
memmap=4G$4G at boot) and mounted PRAMFS using the following options:

# mount -t pramfs -o physaddr=0x1,init=4G,bs=4096 none 
/mnt/pramfs


And it turned out that PRAMFS is very slow as compared to ramfs:

# dd if=/dev/zero of=/mnt/pramfs if=/dev/zero of=/mnt/pramfs/dummy
bs=4096 count=$[100*1024]
102400+0 records in
102400+0 records out
419430400 bytes (419 MB) copied, 9.23498 s, 45.4 MB/s
# dd if=/dev/zero of=/mnt/pramfs if=/dev/zero of=/mnt/pramfs/dummy
bs=4096 count=$[100*1024] conv=notrunc
102400+0 records in
102400+0 records out
419430400 bytes (419 MB) copied, 3.04692 s, 138 MB/s

We need it to be as fast as usual RAM, because otherwise the benefit of
it over hdd disappears. So before diving into the code, I'd like to ask
you if it's intrinsic to PRAMFS, or can it be fixed? Or, perhaps, I used
wrong mount/boot/config options (btw, I enabled only CONFIG_PRAMFS)?



In x86 you should have the write protection enabled. Turn it off or 
mount it with noprotect option.


I tried. This helps, but the write rate is still too low:

with write protect:

# mount -t pramfs -o physaddr=0x1,init=4G,bs=4096 none /mnt/pramfs
# dd if=/dev/zero of=/mnt/pramfs if=/dev/zero of=/mnt/pramfs/dummy 
bs=4096 count=$[100*1024]

102400+0 records in
102400+0 records out
419430400 bytes (419 MB) copied, 17.6007 s, 23.8 MB/s
# dd if=/dev/zero of=/mnt/pramfs if=/dev/zero of=/mnt/pramfs/dummy 
bs=4096 count=$[100*1024] conv=notrunc

102400+0 records in
102400+0 records out
419430400 bytes (419 MB) copied, 4.32923 s, 96.9 MB/s

w/o write protect:

# mount -t pramfs -o physaddr=0x1,init=4G,bs=4096,noprotect none 
/mnt/pramfs
# dd if=/dev/zero of=/mnt/pramfs if=/dev/zero of=/mnt/pramfs/dummy 
bs=4096 count=$[100*1024]

102400+0 records in
102400+0 records out
419430400 bytes (419 MB) copied, 9.07748 s, 46.2 MB/s
# dd if=/dev/zero of=/mnt/pramfs if=/dev/zero of=/mnt/pramfs/dummy 
bs=4096 count=$[100*1024] conv=notrunc

102400+0 records in
102400+0 records out
419430400 bytes (419 MB) copied, 3.04596 s, 138 MB/s

Also tried turning off CONFIG_PRAMFS_WRITE_PROTECT, the result is the 
same: the rate does not exceed 150 MB/s, which is too slow comparing to 
ramfs:


# mount -t ramfs none /mnt/ramfs
# dd if=/dev/zero of=/mnt/pramfs if=/dev/zero of=/mnt/ramfs/dummy 
bs=4096 count=$[100*1024]

102400+0 records in
102400+0 records out
419430400 bytes (419 MB) copied, 0.200809 s, 2.1 GB/s




2) To enable saving application dump files in memory using PRAMFS, one
should reserve half of RAM for it. That's too expensive. While with
ramfs, once SPLICE_F_MOVE flag is implemented, one could move anonymous
memory pages to ramfs page cache and after kexec move it back so that
almost no extra memory space costs would be required. Of course,
SPLICE_F_MOVE is to be yet implemented, but with PRAMFS significant
memory costs are inevitable... or am I wrong?

Thanks.


From this point of view you are right. Pramfs (or other solution like 
that) are out of page cache, so you can't do any memory transfer. It&#x

[PATCH RFC] pram: persistent over-kexec memory file system

2013-07-26 Thread Vladimir Davydov

Hi,

We want to propose a way to upgrade a kernel on a machine without
restarting all the user-space services. This is to be done with CRIU
project, but we need help from the kernel to preserve some data in
memory while doing kexec.

The key point of our implementation is leaving process memory in-place
during reboot. This should eliminate most io operations the services
would produce during initialization. To achieve this, we have
implemented a pseudo file system that preserves its content during
kexec. We propose saving CRIU dump files to this file system, kexec'ing
and then restoring the processes in the newly booted kernel.

A typical usage scenario would look like this:

 1) Boot kernel with 'pram_banned=MEMRANGE' boot option
(MEMRANGE=MEMMIN-MEMMAX).

This is to prevent kexec from overwriting persistent data while
loading the new kernel. Later on kexec will be forced to load kernel
to the range specified. MEMRANGE=0-128M should be enough.

 2) Mount pram file system and save dump files there:

# mount -t pram none /mnt
# criu dump -D /mnt -t $PID

 3) Run kexec passing pram location to the new kernel and forcing it to
load the kernel image to MEMRAGE:

# kexec --load /vmlinuz --initrd=initrd.img \
--append="$(cat /proc/cmdline | sed -e 's/pram=[^ ]*//g') 
pram=$(cat /sys/kernel/pram)" \
--mem-min=$MEMMIN --mem-max=$MEMMAX
# reboot

 4) After reboot mount pram, restore processes, and cleanup:

# mount -t pram none /mnt
# criu restore -d -D /mnt
# rm -f /mnt/*
# umount /mnt

In this patch I introduce the pram pseudo file system that keeps its
memory in place during kexec. pram is based on ramfs, but it serializes
and leaves in memory its content on unmount, and restores it on the next
mount. To survive over kexec, pram finds the serialized content, whose
location should be specified by 'pram' boot param (exported via
/sys/kernel/pram), and reserves it at early boot. To avoid conflicts
with other parts of the kernel that make early reservation too, pram
tracks all memory regions that have ever been reserved and avoids using
them for storing its data. Plus, it adds 'pram_banned' boot param, which
can be used to explicitly disallow pram to use specified memory regions.
This may be useful for avoiding conflicts with kexec loading the new
kernel image (as it is done in the usage scenario).

This implementation serves as a proof of concept and so has a number of
limitations:

 * pram only supports regular files; directories, symlinks, etc are
   ignored

 * pram is implemented only for x86

 * pram does not support swapping out

 * pram checksums serialized content and drops it in case it is
   corrupted with no possibility of restore

What do you think about it, does it make sense to go on with this
approach or should we reconsider it as a whole?

Thanks.
---
 arch/x86/kernel/setup.c |2 +
 fs/Kconfig  |   13 +
 fs/ramfs/Makefile   |1 +
 fs/ramfs/inode.c|2 +-
 fs/ramfs/persistent.c   |  699 +++
 include/linux/pram.h|   18 ++
 include/linux/ramfs.h   |1 +
 kernel/ksysfs.c |   13 +
 mm/bootmem.c|5 +
 mm/memblock.c   |7 +-
 mm/nobootmem.c  |2 +
 mm/page_alloc.c |3 +
 12 files changed, 764 insertions(+), 2 deletions(-)
 create mode 100644 fs/ramfs/persistent.c
 create mode 100644 include/linux/pram.h

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f8ec578..7d22ad0 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1137,6 +1138,7 @@ void __init setup_arch(char **cmdline_p)
early_acpi_boot_init();
 
initmem_init();
+   pram_reserve();
memblock_find_dma_reserve();
 
 #ifdef CONFIG_KVM_GUEST
diff --git a/fs/Kconfig b/fs/Kconfig
index c229f82..8d6943f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -168,6 +168,19 @@ config HUGETLB_PAGE
 
 source "fs/configfs/Kconfig"
 
+config PRAM
+   bool "Persistent over-kexec memory storage"
+   depends on X86
+   select CRC32
+   select LIBCRC32C
+   select CRYPTO_CRC32C
+   select CRYPTO_CRC32C_INTEL
+   default n
+   help
+ pram is a filesystem that saves its content on unmount to be restored
+ on the next mount after kexec. It can be used for speeding up system
+ reboot by saving application memory images there.
+
 endmenu
 
 menuconfig MISC_FILESYSTEMS
diff --git a/fs/ramfs/Makefile b/fs/ramfs/Makefile
index c71e65d..e6953d4 100644
--- a/fs/ramfs/Makefile
+++ b/fs/ramfs/Makefile
@@ -7,3 +7,4 @@ obj-y += ramfs.o
 file-mmu-y := file-nommu.o
 file-mmu-$(CONFIG_MMU) := file-mmu.o
 ramfs-objs += inode.o $(file-mmu-y)
+ramfs-$(CONFIG_PRAM) += persistent.o
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index c24f1e1..86f9f9a 100644
--- a/fs/ramfs/inode.c
+

[PATCH 1/2] cpu: common: make clearcpuid option take bits list

2012-07-20 Thread Vladimir Davydov

It is more convenient to write 'clearcpuid=147,148,...' than
'clearcpuid=147 clearcpuid=148 ...'
---
 arch/x86/kernel/cpu/common.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 6b9333b..8ffe1b9 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1012,10 +1012,10 @@ static __init int setup_disablecpuid(char *arg)
 {
int bit;
 
-   if (get_option(&arg, &bit) && bit < NCAPINTS*32)
-   setup_clear_cpu_cap(bit);
-   else
-   return 0;
+   while (get_option(&arg, &bit)) {
+   if (bit >= 0 && bit < NCAPINTS*32)
+   setup_clear_cpu_cap(bit);
+   }
 
return 1;
 }
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] cpu: intel, amd: mask cleared cpuid features

2012-07-20 Thread Vladimir Davydov

If 'clearcpuid=N' is specified in boot options, CPU feature #N won't be
reported in /proc/cpuinfo and used by the kernel. However, if a
userpsace process checks CPU features directly using the cpuid
instruction, it will be reported about all features supported by the CPU
irrespective of what features are cleared.

The patch makes the clearcpuid boot option not only clear CPU features
in kernel but also mask them in hardware for Intel and AMD CPUs that
support it so that the features cleared won't be reported even by the
cpuid instruction.

This can be useful for migration of virtual machines managed by
hypervisors that do not support/use Intel VT/AMD-V hardware-assisted
virtualization technology.

If CPUID masking is supported, this will be reported in
/proc/cpuinfo:flags as 'cpuidmask'.
---
 arch/x86/include/asm/cpufeature.h |1 +
 arch/x86/kernel/cpu/amd.c |   22 ++
 arch/x86/kernel/cpu/intel.c   |   59 +
 3 files changed, 82 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h 
b/arch/x86/include/asm/cpufeature.h
index f91e80f..6f061ea 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -97,6 +97,7 @@
 #define X86_FEATURE_EXTD_APICID(3*32+26) /* has extended APICID (8 
bits) */
 #define X86_FEATURE_AMD_DCM (3*32+27) /* multi-node processor */
 #define X86_FEATURE_APERFMPERF (3*32+28) /* APERFMPERF */
+#define X86_FEATURE_CPUIDMASK  (3*32+29) /* CPUID feature masking */
 
 /* Intel-defined CPU features, CPUID level 0x0001 (ecx), word 4 */
 #define X86_FEATURE_XMM3   (4*32+ 0) /* "pni" SSE-3 */
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 146bb62..ddc8ea2 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -482,6 +482,26 @@ static void __cpuinit early_init_amd(struct cpuinfo_x86 *c)
 #endif
 }
 
+#define MSR_AMD_CPUID_FEATURES_OVERRIDE0xc0011004
+#define MSR_AMD_EXT_CPUID_FEATURES_OVERRIDE0xc0011005
+
+static void __cpuinit amd_mask_cpu_caps(struct cpuinfo_x86 *c)
+{
+   u64 val;
+
+   if (c->x86 >= 0x0f) {
+   set_cpu_cap(c, X86_FEATURE_CPUIDMASK);
+
+   rdmsrl_amd_safe(MSR_AMD_CPUID_FEATURES_OVERRIDE, &val);
+   val &= ~((u64)cpu_caps_cleared[4] << 32 | cpu_caps_cleared[0]);
+   wrmsrl_amd_safe(MSR_AMD_CPUID_FEATURES_OVERRIDE, val);
+
+   rdmsrl_amd_safe(MSR_AMD_EXT_CPUID_FEATURES_OVERRIDE, &val);
+   val &= ~((u64)cpu_caps_cleared[6] << 32 | cpu_caps_cleared[1]);
+   wrmsrl_amd_safe(MSR_AMD_EXT_CPUID_FEATURES_OVERRIDE, val);
+   }
+}
+
 static void __cpuinit init_amd(struct cpuinfo_x86 *c)
 {
u32 dummy;
@@ -684,6 +704,8 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c)
}
 
rdmsr_safe(MSR_AMD64_PATCH_LEVEL, &c->microcode, &dummy);
+
+   amd_mask_cpu_caps(c);
 }
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 3e6ff6c..1df1da5 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -358,6 +358,63 @@ static void __cpuinit detect_vmx_virtcap(struct 
cpuinfo_x86 *c)
}
 }
 
+static unsigned int __cpuinit intel_cpuid_mask_msr(struct cpuinfo_x86 *c,
+   unsigned int *msr_ext_cpuid_mask)
+{
+   unsigned int msr, msr_ext;
+
+   msr = msr_ext = 0;
+
+   switch (c->x86_model) {
+   case 0x17:
+   case 0x1D:
+   msr = 0x478;
+   break;
+   case 0x1A:
+   case 0x1E:
+   case 0x1F:
+   case 0x25:
+   case 0x2C:
+   case 0x2E:
+   case 0x2F:
+   msr = 0x130;
+   msr_ext = 0x131;
+   break;
+   case 0x2A:
+   msr = 0x132;
+   msr_ext = 0x133;
+   break;
+   }
+
+   if (msr_ext_cpuid_mask)
+   *msr_ext_cpuid_mask = msr_ext;
+
+   return msr;
+}
+
+static void __cpuinit intel_mask_cpu_caps(struct cpuinfo_x86 *c)
+{
+   u32 low, high;
+   unsigned int msr_cpuid_mask, msr_ext_cpuid_mask;
+
+   msr_cpuid_mask = intel_cpuid_mask_msr(c, &msr_ext_cpuid_mask);
+   if (msr_cpuid_mask) {
+   set_cpu_cap(c, X86_FEATURE_CPUIDMASK);
+
+   rdmsr(msr_cpuid_mask, low, high);
+   low &= ~cpu_caps_cleared[4];
+   high &= ~cpu_caps_cleared[0];
+   wrmsr(msr_cpuid_mask, low, high);
+
+   if (msr_ext_cpuid_mask != 0) {
+   rdmsr(msr_ext_cpuid_mask, low, high);
+   low &= ~cpu_caps_cleared[6];
+   high &= ~cpu_caps_cleared[1];
+   wrmsr(msr_ext_cpuid_mask, low, high);
+   }
+   }
+}
+
 static void __cpuinit init_intel(struct cpuinfo_x86 *c)
 {
unsigned int l2 = 0;
@@ -474,6 +531,8 @@ static void __cpuinit init_intel(struct cpuin

Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features

2012-07-20 Thread Vladimir Davydov

On Jul 20, 2012, at 9:20 PM, H. Peter Anvin wrote:

> On 07/20/2012 09:37 AM, Vladimir Davydov wrote:
>> If 'clearcpuid=N' is specified in boot options, CPU feature #N won't be
>> reported in /proc/cpuinfo and used by the kernel. However, if a
>> userpsace process checks CPU features directly using the cpuid
>> instruction, it will be reported about all features supported by the CPU
>> irrespective of what features are cleared.
>> 
>> The patch makes the clearcpuid boot option not only clear CPU features
>> in kernel but also mask them in hardware for Intel and AMD CPUs that
>> support it so that the features cleared won't be reported even by the
>> cpuid instruction.
>> 
>> This can be useful for migration of virtual machines managed by
>> hypervisors that do not support/use Intel VT/AMD-V hardware-assisted
>> virtualization technology.
>> 
>> If CPUID masking is supported, this will be reported in
>> /proc/cpuinfo:flags as 'cpuidmask'.
> 
> I am a bit concerned about this patch:
> 
> 1. it silently changes existing behavior.

Yes, but who needs the current implementation of 'clearcpuid' which, in fact, 
just hides flags in /proc/cpuinfo while userspace apps will see and 
consequently use all CPU features?

So, I think it logically extends the existing behavior.

> 2. even on enabled hardware, only some of the bits are maskable.

The patch makes only words 0, 1, 4, 6 maskable, but words 3, 7, 8 are 
Linux-defined, words 2 and 5 are Transmeta-, Centaur-, etc- defined, and word 9 
contains some bizarre Intel CPU features. Thus, it is words 0, 1, 4, 6 that 
contain useful information for most hardware models.

If you ask about some Intel CPUs that can't mask CPUID function 0x8001, 
this function describes AMD-specific features, and I bet those Intel CPUs just 
don't have them at all and thus have nothing to mask.

> 
>   -hpa
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features

2012-07-20 Thread Vladimir Davydov

On Jul 21, 2012, at 12:19 AM, H. Peter Anvin wrote:

> On 07/20/2012 11:21 AM, Vladimir Davydov wrote:
>>> 
>>> I am a bit concerned about this patch:
>>> 
>>> 1. it silently changes existing behavior.
>> 
>> Yes, but who needs the current implementation of 'clearcpuid' which,
>> in fact, just hides flags in /proc/cpuinfo while userspace apps will
>> see and consequently use all CPU features?
> 
> Anyone who wants to disable a feature from the kernel, specifically.
> 

Another option then?

>> So, I think it logically extends the existing behavior.
>> 
>>> 2. even on enabled hardware, only some of the bits are maskable.
>> 
>> The patch makes only words 0, 1, 4, 6 maskable, but words 3, 7, 8 are 
>> Linux-defined, words 2 and 5 are Transmeta-, Centaur-, etc- defined, and 
>> word 9 contains some bizarre Intel CPU features. Thus, it is words 0, 1, 4, 
>> 6 that contain useful information for most hardware models.
> 
> "Bizarre"?  New features, perhaps.

All right, new :-)

If the features are widely used, they'll provide a way of masking them either I 
guess.

AFAIK, Intel added CPUID faulting for their newest 'Ivy Bridge' models, which 
allows masking all CPUID functions. Unfortunately, I don't have such a CPU. But 
later this feature can be utilized and wired into the code.

Anyway, masking of at least some of the features would be better than lacking 
of the ability at all, wouldn't it? Another question whether the kernel should 
report errors/warnings if a particular feature can't be masked.

> 
>> If you ask about some Intel CPUs that can't mask CPUID function 0x8001, 
>> this function describes AMD-specific features, and I bet those Intel CPUs 
>> just don't have them at all and thus have nothing to mask.
> 
> Not quite.
> 
>   -hpa
> 
> 
> 
> -- 
> H. Peter Anvin, Intel Open Source Technology Center
> I work for Intel.  I don't speak on their behalf.
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features

2012-07-24 Thread Vladimir Davydov


On 07/21/2012 02:37 PM, Borislav Petkov wrote:

(+ Andre who's been doing some cross vendor stuff)

On Fri, Jul 20, 2012 at 08:37:33PM +0400, Vladimir Davydov wrote:

If 'clearcpuid=N' is specified in boot options, CPU feature #N won't be
reported in /proc/cpuinfo and used by the kernel. However, if a
userpsace process checks CPU features directly using the cpuid
instruction, it will be reported about all features supported by the CPU
irrespective of what features are cleared.

The patch makes the clearcpuid boot option not only clear CPU features
in kernel but also mask them in hardware for Intel and AMD CPUs that
support it so that the features cleared won't be reported even by the
cpuid instruction.

This can be useful for migration of virtual machines managed by
hypervisors that do not support/use Intel VT/AMD-V hardware-assisted
virtualization technology.

As they say in Star Wars: "I have a bad feeling about this."

So opening the floodgates to people fiddling with this (not only
migrators) makes me feel pretty uneasy. And I won't wonder if all of
a sudden strange failures start to appear because code is querying
cpuid features but some funny distro has disabled it in its kernel boot
options.

Or some other obscure case where the culprit is hidden in kernel command
line options.

If it were only needed for migration, then I'd say you guys can use
msr-tools and run a script as root on the target machine to which you
want to migrate to and toggle the feature bits you want.


If msr-tools are used for cpuid masking, we will either get 
inconsistency between /proc/cpuinfo:flags and the output of the cpuid 
instruction or have to "synchronize" the clearcpuid boot option and the 
userspace app using msr-tools, which seems to be inconvenient. So, IMO, 
it would be better to have such functionality implemented in the kernel.



I don't think cross vendor migration alone justifies having a generic
kernel feature like that.

Thanks.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features

2012-07-24 Thread Vladimir Davydov


On 07/24/2012 12:14 PM, Andre Przywara wrote:

On 07/24/2012 09:06 AM, Vladimir Davydov wrote:

On 07/21/2012 02:37 PM, Borislav Petkov wrote:

(+ Andre who's been doing some cross vendor stuff)

On Fri, Jul 20, 2012 at 08:37:33PM +0400, Vladimir Davydov wrote:

If 'clearcpuid=N' is specified in boot options, CPU feature #N won't be
reported in /proc/cpuinfo and used by the kernel. However, if a
userpsace process checks CPU features directly using the cpuid
instruction, it will be reported about all features supported by the CPU
irrespective of what features are cleared.

The patch makes the clearcpuid boot option not only clear CPU features
in kernel but also mask them in hardware for Intel and AMD CPUs that
support it so that the features cleared won't be reported even by the
cpuid instruction.

This can be useful for migration of virtual machines managed by
hypervisors that do not support/use Intel VT/AMD-V hardware-assisted
virtualization technology.

But for this case you want it more fine-grained, say on a pre-process or
per-container level, right?
For hardware-assisted virtualization you simply don't need it, and for
Xen PV guests for instance this can be more safely done by the
hypervisor. I assume Parallels is similar in this respect, so you may
want to switch the MSRs on the guest's entry and exit by the VMM.
Also if you want to restrict a guest's CPUID features, you don't want to
do this at the guest's discretion, but better one level below where the
guest cannot revert this, right?


Actually I meant OS-level virtualization (no hypervisors) based on the 
linux cgroup subsystem and namespaces like OpenVZ or LXC . Although the 
latter does not have the container migration ability at present, there 
is a project that will hopefully allow this soon (criu.org). For such 
virtualization systems, per-kernel option would be enough because all 
guests share the same kernel.




In general I am not reluctant to have this feature with a sane
interface, but I simply don't see the usefulness of having it per kernel.
Also note that AFAIK this masking only helps with the basic CPUID
features, namely leaf 1 and 0x8001 for ECX and EDX. This does not
cover the more advanced features and not the new ones at leaf 7.


I guess that when the more advanced features become widely-used, vendors 
will offer new MSRs and/or CPUID faulting.



So opening the floodgates to people fiddling with this (not only
migrators) makes me feel pretty uneasy. And I won't wonder if all of
a sudden strange failures start to appear because code is querying
cpuid features but some funny distro has disabled it in its kernel boot
options.

Actually these "strange failures" would be a bug then. If CPUID is not
there, the feature is not there. Full stop. In the past we had had
already some trouble with people ignoring CPUID and stating some funny
things like: "Every XYZ processor has this feature."
If someone disables MCE, then on purpose. Let the code cope with it.

And Boris: I don't like this "majority of users" argument. If there is
some sense in this feature, why not have it (unless it significantly
hurts the code base)? Remember, this is Linux: If you want to shoot
yourself in the foot, we will not prevent you.

Regards,
Andre.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features

2012-07-24 Thread Vladimir Davydov


On 07/24/2012 02:10 PM, Borislav Petkov wrote:

On Tue, Jul 24, 2012 at 12:29:19PM +0400, Vladimir Davydov wrote:

I guess that when the more advanced features become widely-used,
vendors will offer new MSRs and/or CPUID faulting.

And this right there is the dealbreaker:

So what are you doing for cpus which have the advanced CPUID features
leafs but there are no MSRs to turn those bits off?


We have not encountered this situation in our environments and I hope we 
won't :-)


But look, these CPUID functions cover majority of CPU features, don't 
they? So, most of "normal" apps inside VM will survive migration. 
Perhaps, some low-level utils won't. I guess that's why there are no 
MSRs for other levels provided by vendors.



You surely need some software-only solution for the migration to work,
no?


Yes.


If so, why not apply that solution to your hypervisor without touching
the kernel at all?


In most hypervisor-based virtualization products, this is already 
implemented using VMM-exits, so that each VM can have arbitrary CPUID 
mask set by the admin.


The problem is that we have no hypervisor. "Virtualization" we want this 
feature for is based on cgroups and namespaces (examples are OpenVZ and 
mainstream LXC). Tasks are just grouped into virtual environments and 
share the same kernel, which is proved to be more memory usage efficient 
than traditional hypervisor-based approaches.



Thanks.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features

2012-07-24 Thread Vladimir Davydov


On 07/25/2012 04:57 AM, H. Peter Anvin wrote:

On 07/24/2012 04:09 AM, Vladimir Davydov wrote:

We have not encountered this situation in our environments and I hope we
won't :-)

But look, these CPUID functions cover majority of CPU features, don't
they? So, most of "normal" apps inside VM will survive migration.
Perhaps, some low-level utils won't. I guess that's why there are no
MSRs for other levels provided by vendors.


You will once Ivy Bridge becomes common.


Ivy Bridge has CPUID faulting, which allows masking all CPUID 
levels/functions.



-hpa



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features

2012-07-25 Thread Vladimir Davydov


On 07/24/2012 04:34 PM, Andre Przywara wrote:

On 07/24/2012 01:09 PM, Vladimir Davydov wrote:

On 07/24/2012 02:10 PM, Borislav Petkov wrote:

On Tue, Jul 24, 2012 at 12:29:19PM +0400, Vladimir Davydov wrote:

I guess that when the more advanced features become widely-used,
vendors will offer new MSRs and/or CPUID faulting.

And this right there is the dealbreaker:

So what are you doing for cpus which have the advanced CPUID features
leafs but there are no MSRs to turn those bits off?

We have not encountered this situation in our environments and I hope we
won't :-)

But look, these CPUID functions cover majority of CPU features, don't
they? So, most of "normal" apps inside VM will survive migration.
Perhaps, some low-level utils won't. I guess that's why there are no
MSRs for other levels provided by vendors.

You have the new feature leaf at EAX=7. This contains things like BMI
and AVX2 and probably more upcoming features.
So you may be safe for a while, but you need a solution in the long run.


You surely need some software-only solution for the migration to work,
no?

Yes.


If so, why not apply that solution to your hypervisor without touching
the kernel at all?

In most hypervisor-based virtualization products, this is already
implemented using VMM-exits, so that each VM can have arbitrary CPUID
mask set by the admin.

The problem is that we have no hypervisor. "Virtualization" we want this
feature for is based on cgroups and namespaces (examples are OpenVZ and
mainstream LXC). Tasks are just grouped into virtual environments and
share the same kernel, which is proved to be more memory usage efficient
than traditional hypervisor-based approaches.

So for this single kernel approach I'd understand it that way:
1. You boot up the kernel on the host, it should detect and enable all
the features, say MCA.
2. After boot, you use /src/msr-tools/wrmsr to mask CPUID bits, again
MCA for instance or AVX/AES or the like.
Since the (host side of the) kernel already detected it, this does not
hurt the kernel features like MCA. But AVX will not be available to
applications running in the "host container", which is probably OK since
these are mostly management applications, right?
3. Then you start guests. The guest's libc will not detect the features
because of the MSR masking. All you need now is /proc/cpuinfo filtering
to make this bullet-proof, preferably through the container
functionality. I see that you do already massive sysfs filtering and
also /proc/  filtering, so this maybe an option?


Yes, we do filter /proc/cpuinfo output in our product (OpenVZ), but 
there is also the LXC project that is based completely on the mainstream 
kernel. LXC developers have not faced the cross-vendor migration problem 
yet because currently they don't have migration at all, but they will 
surely face it in future since the work on migration is in progress now 
(CRIU).


So, you prefer adding some filtering of /proc/cpuinfo into the 
mainstream kernel (not now, later, for LXC to work) instead of enabling 
clearcpuid boot option to mask CPUID features? IMO, the latter would 
look clearer.




This approach does not need any kernel support (except for the
/proc/cpuinfo filtering). Does this address the issues you have?

Regards,
Andre.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features

2012-07-25 Thread Vladimir Davydov


On 07/24/2012 04:44 PM, Alan Cox wrote:

This approach does not need any kernel support (except for the
/proc/cpuinfo filtering). Does this address the issues you have?

You can do the /proc/cpuinfo filtering in user space too



How?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features

2012-07-25 Thread Vladimir Davydov


On 07/25/2012 02:58 PM, Andre Przywara wrote:

On 07/25/2012 12:31 PM, Vladimir Davydov wrote:

On 07/24/2012 04:44 PM, Alan Cox wrote:

This approach does not need any kernel support (except for the
/proc/cpuinfo filtering). Does this address the issues you have?

You can do the /proc/cpuinfo filtering in user space too


How?

I wanted to write the same reply yesterday, but followed the hint in
Alan's previous mail:
# mount --bind /dev/shm/faked_cpuinfo /somepath/proc/cpuinfo

I checked it, it works even with chroots and is not visible from within.


If CPUs go online/offline?



Regards,
Andre.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features

2012-07-25 Thread Vladimir Davydov


On 07/25/2012 02:43 PM, Borislav Petkov wrote:

On Wed, Jul 25, 2012 at 02:31:23PM +0400, Vladimir Davydov wrote:

So, you prefer adding some filtering of /proc/cpuinfo into the
mainstream kernel

That's already there right? And your 1/2 patch was making toggling those
bits easier.


(not now, later, for LXC to work) instead of enabling clearcpuid boot
option to mask CPUID features? IMO, the latter would look clearer.

Yes, but for reasons noted earlier, you cannot tweak all hardware CPUID
features as you want them.

So, having a software-only layer of tweaking /proc/cpuinfo or something
different you can come up with is the only option you have.

And even in that case, applications running in the container which
execute CPUID might fail in a strange manner when the corresponding
/proc/cpuinfo flag was cleared by you intentionally but the hardware
CPUID flag is there. In such situations, issues like that should
probably be sorted on a case-by-case basis I guess.

Thanks.



We've agreed that tweaking CPUID bits in kernel is not a good idea and 
it is better to think about virtualization of /proc/cpuinfo and using 
msr-tools.


Thank you for your time and feedback.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features

2012-07-25 Thread Vladimir Davydov


On 07/25/2012 03:17 PM, Andre Przywara wrote:

On 07/25/2012 01:02 PM, Vladimir Davydov wrote:

On 07/25/2012 02:58 PM, Andre Przywara wrote:

On 07/25/2012 12:31 PM, Vladimir Davydov wrote:

On 07/24/2012 04:44 PM, Alan Cox wrote:

This approach does not need any kernel support (except for the
/proc/cpuinfo filtering). Does this address the issues you have?

You can do the /proc/cpuinfo filtering in user space too


How?

I wanted to write the same reply yesterday, but followed the hint in
Alan's previous mail:
# mount --bind /dev/shm/faked_cpuinfo /somepath/proc/cpuinfo

I checked it, it works even with chroots and is not visible from within.

If CPUs go online/offline?

Do you support CPU offlining from within the guest? My OpenVZ guest only
has /sys/class and nothing else, so I cannot offline any CPU.

So you setup a "hand-crafted" cpuinfo for the guest and this should stay
valid for the whole guest's runtime, right?

And since it is a dumped file, "host" CPU off/onlining does not affect
it. Or do you want to propagate this to the guests?


A guest cannot have more CPUs than the host in container virtualization 
(at least in OpenVZ). So yes, we will have to propagate.


Anyway, we've agreed that you were right and are going to think about 
/proc/cpuinfo virtualization.


Thank you for your time and feedback.


(Sorry, but my thinking is more Xen/KVM oriented, where guests can only
do most things if they are explicitly allowed to do it and separation
between guests and host is much stricter).


Regards,
Andre.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features

2012-07-25 Thread Vladimir Davydov


On 07/25/2012 03:31 PM, Alan Cox wrote:

On Wed, 25 Jul 2012 14:31:30 +0400
Vladimir Davydov  wrote:


On 07/24/2012 04:44 PM, Alan Cox wrote:

This approach does not need any kernel support (except for the
/proc/cpuinfo filtering). Does this address the issues you have?

You can do the /proc/cpuinfo filtering in user space too


How?

bind mount your own normal file over the top or a FUSE file


In general, we've agreed that wiring CPUID masking into the kernel was 
not a good idea.


We will try to virtualize /proc/cpuinfo as well as other /proc stuff in 
mainstream.


Thank you for your time and feedback.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features

2012-07-25 Thread Vladimir Davydov


On 07/25/2012 04:57 AM, H. Peter Anvin wrote:

On 07/24/2012 04:09 AM, Vladimir Davydov wrote:

We have not encountered this situation in our environments and I hope we
won't :-)

But look, these CPUID functions cover majority of CPU features, don't
they? So, most of "normal" apps inside VM will survive migration.
Perhaps, some low-level utils won't. I guess that's why there are no
MSRs for other levels provided by vendors.


You will once Ivy Bridge becomes common.

-hpa



You're right here. We will try to find other ways for sorting the 
migration problem out.


Thank you for your time and feedback.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features

2012-07-25 Thread Vladimir Davydov


On 07/20/2012 09:10 PM, Andi Kleen wrote:

+   unsigned int *msr_ext_cpuid_mask)
+{
+   unsigned int msr, msr_ext;
+
+   msr = msr_ext = 0;
+
+   switch (c->x86_model) {

You have to check the family too.


+
+   return msr;
+}
+
+static void __cpuinit intel_mask_cpu_caps(struct cpuinfo_x86 *c)
+{
+   u32 low, high;
+   unsigned int msr_cpuid_mask, msr_ext_cpuid_mask;
+
+   msr_cpuid_mask = intel_cpuid_mask_msr(c,&msr_ext_cpuid_mask);
+   if (msr_cpuid_mask) {
+   set_cpu_cap(c, X86_FEATURE_CPUIDMASK);
+
+   rdmsr(msr_cpuid_mask, low, high);

And please use rdmsrl(), rdmsr() is so 90ies

Other than that patch looks good. I presume it will be useful for
other things than just migration too.

-Andi



Thank you for review, but we've decided to drop this since the kernel 
does not seem to be a suitable place for such "tweaking".

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] block: account iowait time when waiting for completion of IO request

2013-02-14 Thread Vladimir Davydov

Using wait_for_completion() for waiting for a IO request to be executed
results in wrong iowait time accounting. For example, a system having
the only task doing write() and fdatasync() on a block device can be
reported being idle instead of iowaiting as it should because
blkdev_issue_flush() calls wait_for_completion() which in turn calls
schedule() that does not increment the iowait proc counter and thus does
not turn on iowait time accounting.

The patch makes block layer use wait_for_completion_io() instead of
wait_for_completion() where appropriate to account iowait time
correctly.

Signed-off-by: Vladimir Davydov 
---
 block/blk-exec.c  |4 ++--
 block/blk-flush.c |2 +-
 block/blk-lib.c   |6 +++---
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blk-exec.c b/block/blk-exec.c
index 74638ec..f634de7 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -120,9 +120,9 @@ int blk_execute_rq(struct request_queue *q, struct gendisk 
*bd_disk,
/* Prevent hang_check timer from firing at us during very long I/O */
hang_check = sysctl_hung_task_timeout_secs;
if (hang_check)
-   while (!wait_for_completion_timeout(&wait, hang_check * 
(HZ/2)));
+   while (!wait_for_completion_io_timeout(&wait, hang_check * 
(HZ/2)));
else
-   wait_for_completion(&wait);
+   wait_for_completion_io(&wait);
 
if (rq->errors)
err = -EIO;
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 720ad60..db8f1b5 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -436,7 +436,7 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t 
gfp_mask,
 
bio_get(bio);
submit_bio(WRITE_FLUSH, bio);
-   wait_for_completion(&wait);
+   wait_for_completion_io(&wait);
 
/*
 * The driver must store the error location in ->bi_sector, if
diff --git a/block/blk-lib.c b/block/blk-lib.c
index b3a1f2b..d6f50d5 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -126,7 +126,7 @@ int blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
 
/* Wait for bios in-flight */
if (!atomic_dec_and_test(&bb.done))
-   wait_for_completion(&wait);
+   wait_for_completion_io(&wait);
 
if (!test_bit(BIO_UPTODATE, &bb.flags))
ret = -EIO;
@@ -200,7 +200,7 @@ int blkdev_issue_write_same(struct block_device *bdev, 
sector_t sector,
 
/* Wait for bios in-flight */
if (!atomic_dec_and_test(&bb.done))
-   wait_for_completion(&wait);
+   wait_for_completion_io(&wait);
 
if (!test_bit(BIO_UPTODATE, &bb.flags))
ret = -ENOTSUPP;
@@ -262,7 +262,7 @@ int __blkdev_issue_zeroout(struct block_device *bdev, 
sector_t sector,
 
/* Wait for bios in-flight */
if (!atomic_dec_and_test(&bb.done))
-   wait_for_completion(&wait);
+   wait_for_completion_io(&wait);
 
if (!test_bit(BIO_UPTODATE, &bb.flags))
/* One of bios in the batch was completed with error.*/
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] sched: add wait_for_completion_io[_timeout]

2013-02-14 Thread Vladimir Davydov

The only difference between wait_for_completion[_timeout]() and
wait_for_completion_io[_timeout]() is that the latter calls
io_schedule_timeout() instead of schedule_timeout() so that the caller
is accounted as waiting for IO, not just sleeping.

These functions can be used for correct iowait time accounting when the
completion struct is actually used for waiting for IO (e.g. completion
of a bio request in the block layer).

Signed-off-by: Vladimir Davydov 
---
 include/linux/completion.h |3 ++
 kernel/sched/core.c|   57 
 2 files changed, 55 insertions(+), 5 deletions(-)

diff --git a/include/linux/completion.h b/include/linux/completion.h
index 51494e6..33f0280 100644
--- a/include/linux/completion.h
+++ b/include/linux/completion.h
@@ -77,10 +77,13 @@ static inline void init_completion(struct completion *x)
 }
 
 extern void wait_for_completion(struct completion *);
+extern void wait_for_completion_io(struct completion *);
 extern int wait_for_completion_interruptible(struct completion *x);
 extern int wait_for_completion_killable(struct completion *x);
 extern unsigned long wait_for_completion_timeout(struct completion *x,
   unsigned long timeout);
+extern unsigned long wait_for_completion_io_timeout(struct completion *x,
+   unsigned long timeout);
 extern long wait_for_completion_interruptible_timeout(
struct completion *x, unsigned long timeout);
 extern long wait_for_completion_killable_timeout(
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26058d0..e0396dd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3268,7 +3268,8 @@ void complete_all(struct completion *x)
 EXPORT_SYMBOL(complete_all);
 
 static inline long __sched
-do_wait_for_common(struct completion *x, long timeout, int state)
+do_wait_for_common(struct completion *x,
+  long (*action)(long), long timeout, int state)
 {
if (!x->done) {
DECLARE_WAITQUEUE(wait, current);
@@ -3281,7 +3282,7 @@ do_wait_for_common(struct completion *x, long timeout, 
int state)
}
__set_current_state(state);
spin_unlock_irq(&x->wait.lock);
-   timeout = schedule_timeout(timeout);
+   timeout = action(timeout);
spin_lock_irq(&x->wait.lock);
} while (!x->done && timeout);
__remove_wait_queue(&x->wait, &wait);
@@ -3292,17 +3293,30 @@ do_wait_for_common(struct completion *x, long timeout, 
int state)
return timeout ?: 1;
 }
 
-static long __sched
-wait_for_common(struct completion *x, long timeout, int state)
+static inline long __sched
+__wait_for_common(struct completion *x,
+ long (*action)(long), long timeout, int state)
 {
might_sleep();
 
spin_lock_irq(&x->wait.lock);
-   timeout = do_wait_for_common(x, timeout, state);
+   timeout = do_wait_for_common(x, action, timeout, state);
spin_unlock_irq(&x->wait.lock);
return timeout;
 }
 
+static long __sched
+wait_for_common(struct completion *x, long timeout, int state)
+{
+   return __wait_for_common(x, schedule_timeout, timeout, state);
+}
+
+static long __sched
+wait_for_common_io(struct completion *x, long timeout, int state)
+{
+   return __wait_for_common(x, io_schedule_timeout, timeout, state);
+}
+
 /**
  * wait_for_completion: - waits for completion of a task
  * @x:  holds the state of this particular completion
@@ -3339,6 +3353,39 @@ wait_for_completion_timeout(struct completion *x, 
unsigned long timeout)
 EXPORT_SYMBOL(wait_for_completion_timeout);
 
 /**
+ * wait_for_completion_io: - waits for completion of a task
+ * @x:  holds the state of this particular completion
+ *
+ * This waits to be signaled for completion of a specific task. It is NOT
+ * interruptible and there is no timeout. The caller is accounted as waiting
+ * for IO.
+ */
+void __sched wait_for_completion_io(struct completion *x)
+{
+   wait_for_common_io(x, MAX_SCHEDULE_TIMEOUT, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL(wait_for_completion_io);
+
+/**
+ * wait_for_completion_io_timeout: - waits for completion of a task (w/timeout)
+ * @x:  holds the state of this particular completion
+ * @timeout:  timeout value in jiffies
+ *
+ * This waits for either a completion of a specific task to be signaled or for 
a
+ * specified timeout to expire. The timeout is in jiffies. It is not
+ * interruptible. The caller is accounted as waiting for IO.
+ *
+ * The return value is 0 if timed out, and positive (at least 1, or number of
+ * jiffies left till timeout) if completed.
+ */
+unsigned long __sched
+wait_for_completion_io_timeout(struct completion *x, unsigned long timeout)
+{
+   return wait_for_common_io

Re: [PATCH] update sc->nr_reclaimed after each shrink_slab

2016-07-22 Thread Vladimir Davydov

On Fri, Jul 22, 2016 at 09:49:13AM +0200, Michal Hocko wrote:
> On Fri 22-07-16 11:43:30, Zhou Chengming wrote:
> > In !global_reclaim(sc) case, we should update sc->nr_reclaimed after each
> > shrink_slab in the loop. Because we need the correct sc->nr_reclaimed
> > value to see if we can break out.
> 
> Does this actually change anything? Maybe I am missing something but
> try_to_free_mem_cgroup_pages which is the main entry for the memcg
> reclaim doesn't set reclaim_state. I don't remember why... Vladimir?

We don't set reclaim_state on memcg reclaim, because there might be a
lot of unrelated slab objects freed from the interrupt context (e.g.
RCU freed) while we're doing memcg reclaim. Obviously, we don't want
them to contribute to nr_reclaimed.

Link to the thread with the problem discussion:

  http://marc.info/?l=linux-kernel&m=142132698209680&w=2

Re: [PATCH v2] mm: oom: deduplicate victim selection code for memcg and global oom

2016-07-23 Thread Vladimir Davydov

On Thu, Jul 21, 2016 at 08:41:44AM -0400, Johannes Weiner wrote:
> On Mon, Jun 27, 2016 at 07:39:54PM +0300, Vladimir Davydov wrote:
> > When selecting an oom victim, we use the same heuristic for both memory
> > cgroup and global oom. The only difference is the scope of tasks to
> > select the victim from. So we could just export an iterator over all
> > memcg tasks and keep all oom related logic in oom_kill.c, but instead we
> > duplicate pieces of it in memcontrol.c reusing some initially private
> > functions of oom_kill.c in order to not duplicate all of it. That looks
> > ugly and error prone, because any modification of select_bad_process
> > should also be propagated to mem_cgroup_out_of_memory.
> > 
> > Let's rework this as follows: keep all oom heuristic related code
> > private to oom_kill.c and make oom_kill.c use exported memcg functions
> > when it's really necessary (like in case of iterating over memcg tasks).
> 
> This approach, with the control flow in the OOM code, makes a lot of
> sense to me. I think it's particularly useful in preparation for
> supporting cgroup-aware OOM killing, where not just individual tasks
> but entire cgroups are evaluated and killed as opaque memory units.

Yeah, that too. Also, this patch can be thought of as a preparation step
for unified oom locking and oom timeouts (provided we ever agree to add
them). Currently, there's some code in memcg trying to implement proper
locking that would allow running oom in parallel in different cgroups
and wait until memory is actually freed instead of looping and retrying
reclaim. I think it could be reused for global case, although it's going
to be tricky as we need to support legacy cgroup oom control api.

> 
> I'm thinking about doing something like the following, which should be
> able to work regardless on what cgroup level - root, intermediate, or
> leaf node - the OOM killer is invoked, and this patch works toward it:
> 
> struct oom_victim {
> bool is_memcg;
> union {
> struct task_struct *task;
> struct mem_cgroup *memcg;
> } entity;
> unsigned long badness;
> };
> 
> oom_evaluate_memcg(oc, memcg, victim)
> {
> if (memcg == root) {
> for_each_memcg_process(p, memcg) {
> badness = oom_badness(oc, memcg, p);
> if (badness == some_special_value) {
> ...
> } else if (badness > victim->badness) {
>   victim->is_memcg = false;
>   victim->entity.task = p;
>   victim->badness = badness;
>   }
> }
> } else {
> badness = 0;
> for_each_memcg_process(p, memcg) {
> b = oom_badness(oc, memcg, p);
> if (b == some_special_value)
> ...
> else
> badness += b;
> }
> if (badness > victim.badness)
> victim->is_memcg = true;
>   victim->entity.memcg = memcg;
>   victim->badness = badness;

Yeah, that makes sense. However, I don't think we should always kill the
whole cgroup, even if it's badness is highest. IMO what should be killed
- cgroup or task - depends on the workload running inside the container.
Some workloads (e.g. those that fork often) can put up with youngest of
their tasks getting oom-killed, others will just get stuck if one of the
workers is killed - for them we'd better kill the whole container. I
guess we could introduce a per cgroup tunable which would define oom
behavior - whether the whole cgroup should be killed on oom or just one
task/sub-cgroup in the cgroup.

>   }
> }
> }
> 
> oom()
> {
> struct oom_victim victim = {
> .badness = 0,
> };
> 
> for_each_mem_cgroup_tree(memcg, oc->memcg)
> oom_evaluate_memcg(oc, memcg, &victim);
> 
> if (!victim.badness && !is_sysrq_oom(oc)) {
> dump_header(oc, NULL);
> panic("Out of memory and no killable processes...\n");
> }
> 
> if (victim.badness != -1) {
> oom_kill_victim(oc, &victim);
> schedule_timeout_killable(1);
> }
> 
> return true;
> }
> 
> But even without that, with the unification of two identical control
> flows and the privatization of a good amount of oom killer internals,
> the patch speaks for itself.
>   
> > Signed-off-by: Vladimir Davydov 
> 
> Acked-by: Johannes Weiner 

Thanks!

Re: [PATCH 1/3] mm: memcontrol: make per-cpu charge cache IRQ-safe for socket accounting

2016-09-19 Thread Vladimir Davydov

On Wed, Sep 14, 2016 at 03:48:44PM -0400, Johannes Weiner wrote:
> From: Johannes Weiner 
> 
> During cgroup2 rollout into production, we started encountering css
> refcount underflows and css access crashes in the memory controller.
> Splitting the heavily shared css reference counter into logical users
> narrowed the imbalance down to the cgroup2 socket memory accounting.
> 
> The problem turns out to be the per-cpu charge cache. Cgroup1 had a
> separate socket counter, but the new cgroup2 socket accounting goes
> through the common charge path that uses a shared per-cpu cache for
> all memory that is being tracked. Those caches are safe against
> scheduling preemption, but not against interrupts - such as the newly
> added packet receive path. When cache draining is interrupted by
> network RX taking pages out of the cache, the resuming drain operation
> will put references of in-use pages, thus causing the imbalance.
> 
> Disable IRQs during all per-cpu charge cache operations.
> 
> Fixes: f7e1cb6ec51b ("mm: memcontrol: account socket memory in unified 
> hierarchy memory controller")
> Cc:  # 4.5+
> Signed-off-by: Johannes Weiner 

Acked-by: Vladimir Davydov

Re: [PATCH 2/3] cgroup: duplicate cgroup reference when cloning sockets

2016-09-19 Thread Vladimir Davydov

On Wed, Sep 14, 2016 at 03:48:45PM -0400, Johannes Weiner wrote:
> From: Johannes Weiner 
> 
> When a socket is cloned, the associated sock_cgroup_data is duplicated
> but not its reference on the cgroup. As a result, the cgroup reference
> count will underflow when both sockets are destroyed later on.
> 
> Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
> Cc:  # 4.5+
> Signed-off-by: Johannes Weiner 

Reviewed-by: Vladimir Davydov

Re: [PATCH v3 04/13] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks

2016-06-21 Thread Vladimir Davydov

On Mon, Jun 20, 2016 at 04:43:34PM -0700, Andy Lutomirski wrote:
> Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a
> zone.  This only makes sense if each kernel stack exists entirely in
> one zone, and allowing vmapped stacks could break this assumption.
> 
> Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
> allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on
> all architectures.  Keep it simple and use KiB.
> 
> Cc: Vladimir Davydov 
> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: linux...@kvack.org
> Signed-off-by: Andy Lutomirski 

Reviewed-by: Vladimir Davydov

Re: [PATCH v3 05/13] mm: Fix memcg stack accounting for sub-page stacks

2016-06-21 Thread Vladimir Davydov

On Mon, Jun 20, 2016 at 04:43:35PM -0700, Andy Lutomirski wrote:
> We should account for stacks regardless of stack size, and we need
> to account in sub-page units if THREAD_SIZE < PAGE_SIZE.  Change the
> units to kilobytes and Move it into account_kernel_stack().
> 
> Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 
> memory.stat")
> Cc: Vladimir Davydov 
> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: linux...@kvack.org
> Signed-off-by: Andy Lutomirski 

Reviewed-by: Vladimir Davydov 

This patch is going to have a minor conflict with recent changes in
mmotm, where {alloc,free}_kmem_pages were dropped, The conflict should
be trivial to resolve - we only need to replace {alloc,free}_kmem_pages
with {alloc,free}_pages in this patch.

> ---
>  include/linux/memcontrol.h |  2 +-
>  kernel/fork.c  | 15 ++-
>  mm/memcontrol.c|  2 +-
>  3 files changed, 8 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a805474df4ab..3b653b86bb8f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -52,7 +52,7 @@ enum mem_cgroup_stat_index {
>   MEM_CGROUP_STAT_SWAP,   /* # of pages, swapped out */
>   MEM_CGROUP_STAT_NSTATS,
>   /* default hierarchy stats */
> - MEMCG_KERNEL_STACK = MEM_CGROUP_STAT_NSTATS,
> + MEMCG_KERNEL_STACK_KB = MEM_CGROUP_STAT_NSTATS,
>   MEMCG_SLAB_RECLAIMABLE,
>   MEMCG_SLAB_UNRECLAIMABLE,
>   MEMCG_SOCK,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index be7f006af727..ff3c41c2ba96 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -165,20 +165,12 @@ static struct thread_info 
> *alloc_thread_info_node(struct task_struct *tsk,
>   struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
> THREAD_SIZE_ORDER);
>  
> - if (page)
> - memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> - 1 << THREAD_SIZE_ORDER);
> -
>   return page ? page_address(page) : NULL;
>  }
>  
>  static inline void free_thread_info(struct thread_info *ti)
>  {
> - struct page *page = virt_to_page(ti);
> -
> - memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK,
> - -(1 << THREAD_SIZE_ORDER));
> - __free_kmem_pages(page, THREAD_SIZE_ORDER);
> + free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
>  }
>  # else
>  static struct kmem_cache *thread_info_cache;
> @@ -227,6 +219,11 @@ static void account_kernel_stack(struct thread_info *ti, 
> int account)
>  
>   mod_zone_page_state(zone, NR_KERNEL_STACK_KB,
>   THREAD_SIZE / 1024 * account);
> +
> + /* All stack pages belong to the same memcg. */
> + memcg_kmem_update_page_stat(
> + virt_to_page(ti), MEMCG_KERNEL_STACK_KB,
> + account * (THREAD_SIZE / 1024));
>  }
>  
>  void free_task(struct task_struct *tsk)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 75e74408cc8f..8e13a2419dad 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5133,7 +5133,7 @@ static int memory_stat_show(struct seq_file *m, void *v)
>   seq_printf(m, "file %llu\n",
>  (u64)stat[MEM_CGROUP_STAT_CACHE] * PAGE_SIZE);
>   seq_printf(m, "kernel_stack %llu\n",
> -(u64)stat[MEMCG_KERNEL_STACK] * PAGE_SIZE);
> +(u64)stat[MEMCG_KERNEL_STACK_KB] * 1024);
>   seq_printf(m, "slab %llu\n",
>  (u64)(stat[MEMCG_SLAB_RECLAIMABLE] +
>stat[MEMCG_SLAB_UNRECLAIMABLE]) * PAGE_SIZE);

Re: [PATCH 3/3] mm: memcontrol: fix cgroup creation failure after many small jobs

2016-06-21 Thread Vladimir Davydov

On Fri, Jun 17, 2016 at 12:25:16PM -0400, Johannes Weiner wrote:
> The memory controller has quite a bit of state that usually outlives
> the cgroup and pins its CSS until said state disappears. At the same
> time it imposes a 16-bit limit on the CSS ID space to economically
> store IDs in the wild. Consequently, when we use cgroups to contain
> frequent but small and short-lived jobs that leave behind some page
> cache, we quickly run into the 64k limitations of outstanding CSSs.
> Creating a new cgroup fails with -ENOSPC while there are only a few,
> or even no user-visible cgroups in existence.
> 
> Although pinning CSSs past cgroup removal is common, there are only
> two instances that actually need an ID after a cgroup is deleted:
> cache shadow entries and swapout records.
> 
> Cache shadow entries reference the ID weakly and can deal with the CSS
> having disappeared when it's looked up later. They pose no hurdle.
> 
> Swap-out records do need to pin the css to hierarchically attribute
> swapins after the cgroup has been deleted; though the only pages that
> remain swapped out after offlining are tmpfs/shmem pages. And those
> references are under the user's control, so they are manageable.
> 
> This patch introduces a private 16-bit memcg ID and switches swap and
> cache shadow entries over to using that. This ID can then be recycled
> after offlining when the CSS remains pinned only by objects that don't
> specifically need it.
> 
> This script demonstrates the problem by faulting one cache page in a
> new cgroup and deleting it again:
> 
> set -e
> mkdir -p pages
> for x in `seq 128000`; do
>   [ $((x % 1000)) -eq 0 ] && echo $x
>   mkdir /cgroup/foo
>   echo $$ >/cgroup/foo/cgroup.procs
>   echo trex >pages/$x
>   echo $$ >/cgroup/cgroup.procs
>   rmdir /cgroup/foo
> done
> 
> When run on an unpatched kernel, we eventually run out of possible IDs
> even though there are no visible cgroups:
> 
> [root@ham ~]# ./cssidstress.sh
> [...]
> 65000
> mkdir: cannot create directory '/cgroup/foo': No space left on device
> 
> After this patch, the IDs get released upon cgroup destruction and the
> cache and css objects get released once memory reclaim kicks in.

With 65K cgroups it will take the reclaimer a substantial amount of time
to iterate over all of them, which might result in latency spikes.
Probably, to avoid that, we could move pages from a dead cgroup's lru to
its parent's one on offline while still leaving dead cgroups pinned,
like we do in case of list_lru entries.

> 
> Signed-off-by: Johannes Weiner 

Reviewed-by: Vladimir Davydov 

One nit below.

...
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 75e74408cc8f..dc92b2df2585 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4057,6 +4057,60 @@ static struct cftype mem_cgroup_legacy_files[] = {
>   { },/* terminate */
>  };
>  
> +/*
> + * Private memory cgroup IDR
> + *
> + * Swap-out records and page cache shadow entries need to store memcg
> + * references in constrained space, so we maintain an ID space that is
> + * limited to 16 bit (MEM_CGROUP_ID_MAX), limiting the total number of
> + * memory-controlled cgroups to 64k.
> + *
> + * However, there usually are many references to the oflline CSS after
> + * the cgroup has been destroyed, such as page cache or reclaimable
> + * slab objects, that don't need to hang on to the ID. We want to keep
> + * those dead CSS from occupying IDs, or we might quickly exhaust the
> + * relatively small ID space and prevent the creation of new cgroups
> + * even when there are much fewer than 64k cgroups - possibly none.
> + *
> + * Maintain a private 16-bit ID space for memcg, and allow the ID to
> + * be freed and recycled when it's no longer needed, which is usually
> + * when the CSS is offlined.
> + *
> + * The only exception to that are records of swapped out tmpfs/shmem
> + * pages that need to be attributed to live ancestors on swapin. But
> + * those references are manageable from userspace.
> + */
> +
> +static struct idr mem_cgroup_idr;

static DEFINE_IDR(mem_cgroup_idr);

Re: [PATCH] memcg: mem_cgroup_migrate() may be called with irq disabled

2016-06-21 Thread Vladimir Davydov

On Mon, Jun 20, 2016 at 02:41:58PM -0400, Tejun Heo wrote:
> mem_cgroup_migrate() uses local_irq_disable/enable() but can be called
> with irq disabled from migrate_page_copy().  This ends up enabling irq
> while holding a irq context lock triggering the following lockdep
> warning.  Fix it by using irq_save/restore instead.
> 
>   =
>   [ INFO: inconsistent lock state ]
>   4.7.0-rc1+ #52 Tainted: GW  
>   -
>   inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
>   kcompactd0/151 [HC0[0]:SC0[0]:HE1:SE1] takes:
>(&(&ctx->completion_lock)->rlock){+.?.-.}, at: [<0038fd96>] 
> aio_migratepage+0x156/0x1e8
>   {IN-SOFTIRQ-W} state was registered at:
> [<001a8366>] __lock_acquire+0x5b6/0x1930
> [<001a9b9e>] lock_acquire+0xee/0x270
> [<00951fee>] _raw_spin_lock_irqsave+0x66/0xb0
> [<00390108>] aio_complete+0x98/0x328
> [<0037c7d4>] dio_complete+0xe4/0x1e0
> [<00650e64>] blk_update_request+0xd4/0x450
> [<0072a1a8>] scsi_end_request+0x48/0x1c8
> [<0072d7e2>] scsi_io_completion+0x272/0x698
> [<0065adb2>] blk_done_softirq+0xca/0xe8
> [<00953f80>] __do_softirq+0xc8/0x518
> [<001495de>] irq_exit+0xee/0x110
> [<0010ceba>] do_IRQ+0x6a/0x88
> [<0095342e>] io_int_handler+0x11a/0x25c
> [<0094fb5c>] __mutex_unlock_slowpath+0x144/0x1d8
> [<0094fb58>] __mutex_unlock_slowpath+0x140/0x1d8
> [<003c6114>] kernfs_iop_permission+0x64/0x80
> [<0033ba86>] __inode_permission+0x9e/0xf0
> [<0033ea96>] link_path_walk+0x6e/0x510
> [<0033f09c>] path_lookupat+0xc4/0x1a8
> [<0034195c>] filename_lookup+0x9c/0x160
> [<00341b44>] user_path_at_empty+0x5c/0x70
> [<00335250>] SyS_readlinkat+0x68/0x140
> [<00952f8e>] system_call+0xd6/0x270
>   irq event stamp: 971410
>   hardirqs last  enabled at (971409): [<0030f982>] 
> migrate_page_move_mapping+0x3ea/0x588
>   hardirqs last disabled at (971410): [<00951fc4>] 
> _raw_spin_lock_irqsave+0x3c/0xb0
>   softirqs last  enabled at (970526): [<00954318>] 
> __do_softirq+0x460/0x518
>   softirqs last disabled at (970519): [<001495de>] irq_exit+0xee/0x110
> 
>   other info that might help us debug this:
>Possible unsafe locking scenario:
> 
>CPU0
>
> lock(&(&ctx->completion_lock)->rlock);
> 
>   lock(&(&ctx->completion_lock)->rlock);
> 
> *** DEADLOCK ***
> 
>   3 locks held by kcompactd0/151:
>#0:  (&(&mapping->private_lock)->rlock){+.+.-.}, at: [<0038fc82>] 
> aio_migratepage+0x42/0x1e8
>#1:  (&ctx->ring_lock){+.+.+.}, at: [<0038fc9a>] 
> aio_migratepage+0x5a/0x1e8
>#2:  (&(&ctx->completion_lock)->rlock){+.?.-.}, at: [<0038fd96>] 
> aio_migratepage+0x156/0x1e8
> 
>   stack backtrace:
>   CPU: 20 PID: 151 Comm: kcompactd0 Tainted: GW   4.7.0-rc1+ #52
>0001c6cbb730 0001c6cbb7c0 0002  
>0001c6cbb860 0001c6cbb7d8 0001c6cbb7d8 00114496 
> 00b517ec 00b680b6 000b 
>0001c6cbb820 0001c6cbb7c0   
>04000184ad18 00114496 0001c6cbb7c0 0001c6cbb820 
>   Call Trace:
>   ([<001143d2>] show_trace+0xea/0xf0)
>   ([<0011444a>] show_stack+0x72/0xf0)
>   ([<00684522>] dump_stack+0x9a/0xd8)
>   ([<0028679c>] print_usage_bug.part.27+0x2d4/0x2e8)
>   ([<001a71ce>] mark_lock+0x17e/0x758)
>   ([<001a784a>] mark_held_locks+0xa2/0xd0)
>   ([<001a79b8>] trace_hardirqs_on_caller+0x140/0x1c0)
>   ([<00326026>] mem_cgroup_migrate+0x266/0x370)
>   ([<0038fdaa>] aio_migratepage+0x16a/0x1e8)
>   ([<00310568>] move_to_new_page+0xb0/0x260)
>   ([<003111b4>] migrate_pages+0x8f4/0x9f0)
>   ([<002c507c>] compact_zone+0x4dc/0xdc8)
>   ([<002c5e22>] kcompactd_do_work+0x1aa/0x358)
>   ([<002c608a>] kcompactd+0xba/0x2c8)
>   ([<0016b09a>] kthread+0x10a/0x110)
>   ([<0095315a>] kernel_thread_starter+0x6/0xc)
>   ([<00953154>] kernel_thread_starter+0x0/0xc)
>   INFO: lockdep is turned off.
> 
> Signed-off-by: Tejun Heo 
> Reported-by: Christian Borntraeger 
> Link: http://lkml.kernel.org/g/5767cfe5.7080...@de.ibm.com

Reviewed-by: Vladimir Davydov

[PATCH] mm: vmscan: fix memcg-aware shrinkers not called on global reclaim

2016-08-01 Thread Vladimir Davydov

We must call shrink_slab() for each memory cgroup on both global and
memcg reclaim in shrink_node_memcg(). Commit d71df22b55099 accidentally
changed that so that now shrink_slab() is only called with memcg != NULL
on memcg reclaim. As a result, memcg-aware shrinkers (including
dentry/inode) are never invoked on global reclaim. Fix that.

Fixes: d71df22b55099 ("mm, vmscan: begin reclaiming pages on a per-node basis")
Signed-off-by: Vladimir Davydov 
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 650d26832569..374d95d04178 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2561,7 +2561,7 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
node_lru_pages += lru_pages;
 
-   if (!global_reclaim(sc))
+   if (memcg)
shrink_slab(sc->gfp_mask, pgdat->node_id,
memcg, sc->nr_scanned - scanned,
lru_pages);
-- 
2.1.4

[PATCH 2/3] mm: memcontrol: fix memcg id ref counter on swap charge move

2016-08-01 Thread Vladimir Davydov

Since commit 73f576c04b941 swap entries do not pin memcg->css.refcnt
directly. Instead, they pin memcg->id.ref. So we should adjust the
reference counters accordingly when moving swap charges between cgroups.

Fixes: 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many 
small jobs")
Signed-off-by: Vladimir Davydov 
---
 mm/memcontrol.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5fe285f27ea7..58c229071fb1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4030,9 +4030,9 @@ static struct cftype mem_cgroup_legacy_files[] = {
 
 static DEFINE_IDR(mem_cgroup_idr);
 
-static void mem_cgroup_id_get(struct mem_cgroup *memcg)
+static void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n)
 {
-   atomic_inc(&memcg->id.ref);
+   atomic_add(n, &memcg->id.ref);
 }
 
 static struct mem_cgroup *mem_cgroup_id_get_active(struct mem_cgroup *memcg)
@@ -4042,9 +4042,9 @@ static struct mem_cgroup *mem_cgroup_id_get_active(struct 
mem_cgroup *memcg)
return memcg;
 }
 
-static void mem_cgroup_id_put(struct mem_cgroup *memcg)
+static void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n)
 {
-   if (atomic_dec_and_test(&memcg->id.ref)) {
+   if (atomic_sub_and_test(n, &memcg->id.ref)) {
idr_remove(&mem_cgroup_idr, memcg->id.id);
memcg->id.id = 0;
 
@@ -4053,6 +4053,16 @@ static void mem_cgroup_id_put(struct mem_cgroup *memcg)
}
 }
 
+static inline void mem_cgroup_id_get(struct mem_cgroup *memcg)
+{
+   mem_cgroup_id_get_many(memcg, 1);
+}
+
+static inline void mem_cgroup_id_put(struct mem_cgroup *memcg)
+{
+   mem_cgroup_id_put_many(memcg, 1);
+}
+
 /**
  * mem_cgroup_from_id - look up a memcg from a memcg id
  * @id: the memcg id to look up
@@ -4687,6 +4697,8 @@ static void __mem_cgroup_clear_mc(void)
if (!mem_cgroup_is_root(mc.from))
page_counter_uncharge(&mc.from->memsw, mc.moved_swap);
 
+   mem_cgroup_id_put_many(mc.from, mc.moved_swap);
+
/*
 * we charged both to->memory and to->memsw, so we
 * should uncharge to->memory.
@@ -4694,9 +4706,9 @@ static void __mem_cgroup_clear_mc(void)
if (!mem_cgroup_is_root(mc.to))
page_counter_uncharge(&mc.to->memory, mc.moved_swap);
 
-   css_put_many(&mc.from->css, mc.moved_swap);
+   mem_cgroup_id_get_many(mc.to, mc.moved_swap);
+   css_put_many(&mc.to->css, mc.moved_swap);
 
-   /* we've already done css_get(mc.to) */
mc.moved_swap = 0;
}
memcg_oom_recover(from);
-- 
2.1.4

[PATCH 1/3] mm: memcontrol: fix swap counter leak on swapout from offline cgroup

2016-08-01 Thread Vladimir Davydov

An offline memory cgroup might have anonymous memory or shmem left
charged to it and no swap. Since only swap entries pin the id of an
offline cgroup, such a cgroup will have no id and so an attempt to
swapout its anon/shmem will not store memory cgroup info in the swap
cgroup map. As a result, memcg->swap or memcg->memsw will never get
uncharged from it and any of its ascendants.

Fix this by always charging swapout to the first ancestor cgroup that
hasn't released its id yet.

Fixes: 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many 
small jobs")
Signed-off-by: Vladimir Davydov 
---
 mm/memcontrol.c | 27 +--
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b5804e4e6324..5fe285f27ea7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4035,6 +4035,13 @@ static void mem_cgroup_id_get(struct mem_cgroup *memcg)
atomic_inc(&memcg->id.ref);
 }
 
+static struct mem_cgroup *mem_cgroup_id_get_active(struct mem_cgroup *memcg)
+{
+   while (!atomic_inc_not_zero(&memcg->id.ref))
+   memcg = parent_mem_cgroup(memcg);
+   return memcg;
+}
+
 static void mem_cgroup_id_put(struct mem_cgroup *memcg)
 {
if (atomic_dec_and_test(&memcg->id.ref)) {
@@ -5751,7 +5758,7 @@ subsys_initcall(mem_cgroup_init);
  */
 void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 {
-   struct mem_cgroup *memcg;
+   struct mem_cgroup *memcg, *swap_memcg;
unsigned short oldid;
 
VM_BUG_ON_PAGE(PageLRU(page), page);
@@ -5766,15 +5773,20 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t 
entry)
if (!memcg)
return;
 
-   mem_cgroup_id_get(memcg);
-   oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg));
+   swap_memcg = mem_cgroup_id_get_active(memcg);
+   oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg));
VM_BUG_ON_PAGE(oldid, page);
-   mem_cgroup_swap_statistics(memcg, true);
+   mem_cgroup_swap_statistics(swap_memcg, true);
 
page->mem_cgroup = NULL;
 
if (!mem_cgroup_is_root(memcg))
page_counter_uncharge(&memcg->memory, 1);
+   if (memcg != swap_memcg) {
+   if (!mem_cgroup_is_root(swap_memcg))
+   page_counter_charge(&swap_memcg->memsw, 1);
+   page_counter_uncharge(&memcg->memsw, 1);
+   }
 
/*
 * Interrupts should be disabled here because the caller holds the
@@ -5814,11 +5826,14 @@ int mem_cgroup_try_charge_swap(struct page *page, 
swp_entry_t entry)
if (!memcg)
return 0;
 
+   memcg = mem_cgroup_id_get_active(memcg);
+
if (!mem_cgroup_is_root(memcg) &&
-   !page_counter_try_charge(&memcg->swap, 1, &counter))
+   !page_counter_try_charge(&memcg->swap, 1, &counter)) {
+   mem_cgroup_id_put(memcg);
return -ENOMEM;
+   }
 
-   mem_cgroup_id_get(memcg);
oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg));
VM_BUG_ON_PAGE(oldid, page);
mem_cgroup_swap_statistics(memcg, true);
-- 
2.1.4

Re: [PATCH] memcg: put soft limit reclaim out of way if the excess tree is empty

2016-08-01 Thread Vladimir Davydov

On Mon, Aug 01, 2016 at 12:00:21PM +0200, Michal Hocko wrote:
...
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c265212bec8c..eb7e39c2d948 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2543,6 +2543,11 @@ static int mem_cgroup_resize_memsw_limit(struct 
> mem_cgroup *memcg,
>   return ret;
>  }
>  
> +static inline bool soft_limit_tree_empty(struct mem_cgroup_tree_per_node 
> *mctz)
> +{
> + return rb_last(&mctz->rb_root) == NULL;
> +}
> +

I don't think traversing rb tree as rb_last() does w/o holding the lock
is a good idea. Why is RB_EMPTY_ROOT() insufficient here?

[PATCH] radix-tree: account nodes to memcg only if explicitly requested

2016-08-01 Thread Vladimir Davydov

Radix trees may be used not only for storing page cache pages, so
unconditionally accounting radix tree nodes to the current memory cgroup
is bad: if a radix tree node is used for storing data shared among
different cgroups we risk pinning dead memory cgroups forever. So let's
only account radix tree nodes if it was explicitly requested by passing
__GFP_ACCOUNT to INIT_RADIX_TREE. Currently, we only want to account
page cache entries, so mark mapping->page_tree so.

Signed-off-by: Vladimir Davydov 
---
 fs/inode.c   |  2 +-
 lib/radix-tree.c | 14 ++
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 559a9da25237..1d04dab5211c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -345,7 +345,7 @@ EXPORT_SYMBOL(inc_nlink);
 void address_space_init_once(struct address_space *mapping)
 {
memset(mapping, 0, sizeof(*mapping));
-   INIT_RADIX_TREE(&mapping->page_tree, GFP_ATOMIC);
+   INIT_RADIX_TREE(&mapping->page_tree, GFP_ATOMIC | __GFP_ACCOUNT);
spin_lock_init(&mapping->tree_lock);
init_rwsem(&mapping->i_mmap_rwsem);
INIT_LIST_HEAD(&mapping->private_list);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 61b8fb529cef..1b7bf7314141 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -277,10 +277,11 @@ radix_tree_node_alloc(struct radix_tree_root *root)
 
/*
 * Even if the caller has preloaded, try to allocate from the
-* cache first for the new node to get accounted.
+* cache first for the new node to get accounted to the memory
+* cgroup.
 */
ret = kmem_cache_alloc(radix_tree_node_cachep,
-  gfp_mask | __GFP_ACCOUNT | __GFP_NOWARN);
+  gfp_mask | __GFP_NOWARN);
if (ret)
goto out;
 
@@ -303,8 +304,7 @@ radix_tree_node_alloc(struct radix_tree_root *root)
kmemleak_update_trace(ret);
goto out;
}
-   ret = kmem_cache_alloc(radix_tree_node_cachep,
-  gfp_mask | __GFP_ACCOUNT);
+   ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
 out:
BUG_ON(radix_tree_is_internal_node(ret));
return ret;
@@ -351,6 +351,12 @@ static int __radix_tree_preload(gfp_t gfp_mask, int nr)
struct radix_tree_node *node;
int ret = -ENOMEM;
 
+   /*
+* Nodes preloaded by one cgroup can be be used by another cgroup, so
+* they should never be accounted to any particular memory cgroup.
+*/
+   gfp_mask &= ~__GFP_ACCOUNT;
+
preempt_disable();
rtp = this_cpu_ptr(&radix_tree_preloads);
while (rtp->nr < nr) {
-- 
2.1.4

[PATCH v3] mm: oom: deduplicate victim selection code for memcg and global oom

2016-08-01 Thread Vladimir Davydov

When selecting an oom victim, we use the same heuristic for both memory
cgroup and global oom. The only difference is the scope of tasks to
select the victim from. So we could just export an iterator over all
memcg tasks and keep all oom related logic in oom_kill.c, but instead we
duplicate pieces of it in memcontrol.c reusing some initially private
functions of oom_kill.c in order to not duplicate all of it. That looks
ugly and error prone, because any modification of select_bad_process
should also be propagated to mem_cgroup_out_of_memory.

Let's rework this as follows: keep all oom heuristic related code
private to oom_kill.c and make oom_kill.c use exported memcg functions
when it's really necessary (like in case of iterating over memcg tasks).

Signed-off-by: Vladimir Davydov 
Acked-by: Johannes Weiner 
---
Changes in v3:
 - rebase on top of v4.7-mmotm-2016-07-28-16-33
 - add ack by Johannes

 include/linux/memcontrol.h |  15 
 include/linux/oom.h|  43 +-
 mm/memcontrol.c| 114 ++
 mm/oom_kill.c  | 200 -
 4 files changed, 167 insertions(+), 205 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5d8ca6e02e39..0710143723bc 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -366,6 +366,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
   struct mem_cgroup *,
   struct mem_cgroup_reclaim_cookie *);
 void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
+int mem_cgroup_scan_tasks(struct mem_cgroup *,
+ int (*)(struct task_struct *, void *), void *);
 
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
@@ -446,6 +448,8 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec 
*lruvec, enum lru_list lru)
 
 void mem_cgroup_handle_over_high(void);
 
+unsigned long mem_cgroup_get_limit(struct mem_cgroup *memcg);
+
 void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);
 
@@ -639,6 +643,12 @@ static inline void mem_cgroup_iter_break(struct mem_cgroup 
*root,
 {
 }
 
+static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
+   int (*fn)(struct task_struct *, void *), void *arg)
+{
+   return 0;
+}
+
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
return 0;
@@ -669,6 +679,11 @@ mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
return 0;
 }
 
+static inline unsigned long mem_cgroup_get_limit(struct mem_cgroup *memcg)
+{
+   return 0;
+}
+
 static inline void
 mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5bc0457ee3a8..17946e5121b6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -34,23 +34,11 @@ struct oom_control {
 * for display purposes.
 */
const int order;
-};
 
-/*
- * Types of limitations to the nodes from which allocations may occur
- */
-enum oom_constraint {
-   CONSTRAINT_NONE,
-   CONSTRAINT_CPUSET,
-   CONSTRAINT_MEMORY_POLICY,
-   CONSTRAINT_MEMCG,
-};
-
-enum oom_scan_t {
-   OOM_SCAN_OK,/* scan thread and find its badness */
-   OOM_SCAN_CONTINUE,  /* do not consider thread for oom kill */
-   OOM_SCAN_ABORT, /* abort the iteration and return */
-   OOM_SCAN_SELECT,/* always select this thread first */
+   /* Used by oom implementation, do not set */
+   unsigned long totalpages;
+   struct task_struct *chosen;
+   unsigned long chosen_points;
 };
 
 extern struct mutex oom_lock;
@@ -70,30 +58,10 @@ static inline bool oom_task_origin(const struct task_struct 
*p)
return p->signal->oom_flag_origin;
 }
 
-extern void mark_oom_victim(struct task_struct *tsk);
-
-#ifdef CONFIG_MMU
-extern void wake_oom_reaper(struct task_struct *tsk);
-#else
-static inline void wake_oom_reaper(struct task_struct *tsk)
-{
-}
-#endif
-
 extern unsigned long oom_badness(struct task_struct *p,
struct mem_cgroup *memcg, const nodemask_t *nodemask,
unsigned long totalpages);
 
-extern void oom_kill_process(struct oom_control *oc, struct task_struct *p,
-unsigned int points, unsigned long totalpages,
-const char *message);
-
-extern void check_panic_on_oom(struct oom_control *oc,
-  enum oom_constraint constraint);
-
-extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
-  struct task_struct *task);
-
 extern bool out_of_memory(struct oom_control *oc);
 
 extern void exit_oom_victim(struct task_struct *tsk);
@@ -101,14 +69,11 @@ extern void exit_oom_victim(struct task_struct *t

[PATCH] Update my e-mail address

2016-08-31 Thread Vladimir Davydov

vdavydov@{parallels,virtuozzo}.com will bounce from now on.

Signed-off-by: Vladimir Davydov 
---
 .mailmap| 2 ++
 MAINTAINERS | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/.mailmap b/.mailmap
index b18912c5121e..de22daefd9da 100644
--- a/.mailmap
+++ b/.mailmap
@@ -159,6 +159,8 @@ Valdis Kletnieks 
 Viresh Kumar  
 Viresh Kumar  
 Viresh Kumar  
+Vladimir Davydov  
+Vladimir Davydov  
 Takashi YOSHII 
 Yusuke Goda 
 Gustavo Padovan 
diff --git a/MAINTAINERS b/MAINTAINERS
index d8e81b1dde30..46a7d3093a49 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3265,7 +3265,7 @@ F:kernel/cpuset.c
 CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)
 M: Johannes Weiner 
 M: Michal Hocko 
-M: Vladimir Davydov 
+M: Vladimir Davydov 
 L: cgro...@vger.kernel.org
 L: linux...@kvack.org
 S: Maintained
-- 
2.1.4

Re: [PATCH] Update my e-mail address

2016-08-31 Thread Vladimir Davydov

On Wed, Aug 31, 2016 at 07:26:43AM -0700, Greg wrote:
> On Wed, 2016-08-31 at 15:01 +0300, Vladimir Davydov wrote:
> > vdavydov@{parallels,virtuozzo}.com will bounce from now on.
> > 
> > Signed-off-by: Vladimir Davydov 
> 
> Shouldn't MAINTAINERS be in the subject line?

You're right, sorry. Here it goes:

>From 11b0d4818cfc517ce0cd1dba743bd777f2ccd6b1 Mon Sep 17 00:00:00 2001
From: Vladimir Davydov 
Date: Wed, 31 Aug 2016 14:36:19 +0300
Subject: [PATCH] MAINTAINERS: update my e-mail address

vdavydov@{parallels,virtuozzo}.com will bounce from now on.

Signed-off-by: Vladimir Davydov 
---
 .mailmap| 2 ++
 MAINTAINERS | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/.mailmap b/.mailmap
index b18912c5121e..de22daefd9da 100644
--- a/.mailmap
+++ b/.mailmap
@@ -159,6 +159,8 @@ Valdis Kletnieks 
 Viresh Kumar  
 Viresh Kumar  
 Viresh Kumar  
+Vladimir Davydov  
+Vladimir Davydov  
 Takashi YOSHII 
 Yusuke Goda 
 Gustavo Padovan 
diff --git a/MAINTAINERS b/MAINTAINERS
index d8e81b1dde30..46a7d3093a49 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3265,7 +3265,7 @@ F:kernel/cpuset.c
 CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)
 M: Johannes Weiner 
 M:     Michal Hocko 
-M: Vladimir Davydov 
+M: Vladimir Davydov 
 L: cgro...@vger.kernel.org
 L: linux...@kvack.org
 S: Maintained
-- 
2.1.4

Re: [BUG] Bad page states

2016-08-08 Thread Vladimir Davydov

On Mon, Aug 08, 2016 at 10:48:45AM -0700, Linus Torvalds wrote:
...
> > [   43.477693] BUG: Bad page state in process S05containers  pfn:1ff02a3
> > [   43.484417] page:ea007fc0a8c0 count:0 mapcount:-511 mapping: 
> >  (null) index:0x0
> > [   43.492737] flags: 0x1000()
> > [   43.496602] page dumped because: nonzero mapcount
> 
> Hmm. The _mapcount field is a union with other fields, but that number
> doesn't make sense for any of the other fields.
> 
> So it's almost certainly related to "PAGE_KMEMCG_MAPCOUNT_VALUE". So

Yes, it is - my bad. The thing is I set/clear PAGE_KMEMCG_MAPCOUNT_VALUE
for pages allocated with __GFP_ACCOUNT iff memcg_kmem_enabled() is true
(see __alloc_pages_nodemask and free_pages_prepare), while the latter
gets disabled when the last cgroup gets destroyed. So if you do

 mkdir /sys/fs/cgroup/memory/test
 # run something in the root cgroup that allocates pages with
 # __GFP_ACCOUNT, e.g. a program using pipe
 rmdir /sys/fs/cgroup/memory/test

Then, if there are no other memory cgroups, you'll see the bug.

Sorry about that :-(

Obviously, the PageKmemcg flag should only be set for pages that are
actually accounted to a non-root kmemcg and hence pin memcg_kmem_enabled
static key. I'll fix that.

Thanks,
Vladimir

[PATCH] mm: memcontrol: only mark charged pages with PageKmemcg

2016-08-08 Thread Vladimir Davydov

To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg,
which sets page->_mapcount to -512. Currently, we set/clear PageKmemcg in
__alloc_pages_nodemask()/free_pages_prepare() for any page allocated
with __GFP_ACCOUNT, including those that aren't actually charged to any
cgroup, i.e. allocated from the root cgroup context. To avoid overhead
in case cgroups are not used, we only do that if memcg_kmem_enabled() is
true. The latter is set iff there are kmem-enabled memory cgroups
(online or offline). The root cgroup is not considered kmem-enabled.

As a result, if a page is allocated with __GFP_ACCOUNT for the root
cgroup when there are kmem-enabled memory cgroups and is freed after all
kmem-enabled memory cgroups were removed, e.g.

  # no memory cgroups has been created yet, create one
  mkdir /sys/fs/cgroup/memory/test
  # run something allocating pages with __GFP_ACCOUNT, e.g.
  # a program using pipe
  dmesg | tail
  # remove the memory cgroup
  rmdir /sys/fs/cgroup/memory/test

we'll get bad page state bug complaining about page->_mapcount != -1:

  BUG: Bad page state in process swapper/0  pfn:1fd945c
  page:ea007f651700 count:0 mapcount:-511 mapping:  (null) index:0x0
  flags: 0x1000()

To avoid that, let's mark with PageKmemcg only those pages that are
actually charged to and hence pin a non-root memory cgroup.

Fixes: 4949148ad433 ("mm: charge/uncharge kmemcg from generic page allocator 
paths")
Reported-by: Eric Dumazet 
Signed-off-by: Vladimir Davydov 
Cc: [4.7+]
---
 fs/pipe.c   |  4 +---
 mm/memcontrol.c | 14 --
 mm/page_alloc.c | 14 +-
 3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 4b32928f5426..4ebe6b2e5217 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -144,10 +144,8 @@ static int anon_pipe_buf_steal(struct pipe_inode_info 
*pipe,
struct page *page = buf->page;
 
if (page_count(page) == 1) {
-   if (memcg_kmem_enabled()) {
+   if (memcg_kmem_enabled())
memcg_kmem_uncharge(page, 0);
-   __ClearPageKmemcg(page);
-   }
__SetPageLocked(page);
return 0;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1c0aa59fd333..7c9c0bd475f0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2308,8 +2308,11 @@ int memcg_kmem_charge(struct page *page, gfp_t gfp, int 
order)
return 0;
 
memcg = get_mem_cgroup_from_mm(current->mm);
-   if (!mem_cgroup_is_root(memcg))
+   if (!mem_cgroup_is_root(memcg)) {
ret = memcg_kmem_charge_memcg(page, gfp, order, memcg);
+   if (!ret)
+   __SetPageKmemcg(page);
+   }
css_put(&memcg->css);
return ret;
 }
@@ -2336,6 +2339,11 @@ void memcg_kmem_uncharge(struct page *page, int order)
page_counter_uncharge(&memcg->memsw, nr_pages);
 
page->mem_cgroup = NULL;
+
+   /* slab pages do not have PageKmemcg flag set */
+   if (PageKmemcg(page))
+   __ClearPageKmemcg(page);
+
css_put_many(&memcg->css, nr_pages);
 }
 #endif /* !CONFIG_SLOB */
@@ -5533,8 +5541,10 @@ static void uncharge_list(struct list_head *page_list)
else
nr_file += nr_pages;
pgpgout++;
-   } else
+   } else {
nr_kmem += 1 << compound_order(page);
+   __ClearPageKmemcg(page);
+   }
 
page->mem_cgroup = NULL;
} while (next != page_list);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ee5a4b20daf4..3a7d3a1fb6a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1009,10 +1009,8 @@ static __always_inline bool free_pages_prepare(struct 
page *page,
}
if (PageMappingFlags(page))
page->mapping = NULL;
-   if (memcg_kmem_enabled() && PageKmemcg(page)) {
+   if (memcg_kmem_enabled() && PageKmemcg(page))
memcg_kmem_uncharge(page, order);
-   __ClearPageKmemcg(page);
-   }
if (check_free)
bad += free_pages_check(page);
if (bad)
@@ -3788,12 +3786,10 @@ no_zone:
}
 
 out:
-   if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page) {
-   if (unlikely(memcg_kmem_charge(page, gfp_mask, order))) {
-   __free_pages(page, order);
-   page = NULL;
-   } else
-   __SetPageKmemcg(page);
+   if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
+   unlikely(memcg_kmem_charge(page, gfp_mask, order) != 0)) {
+   __free_pages(page, order);
+   page = NULL;
}
 
if (kmemcheck_enabled && page)
-- 
2.1.4

Re: [PATCH] mm: memcontrol: avoid unused function warning

2016-08-24 Thread Vladimir Davydov

On Wed, Aug 24, 2016 at 10:22:43AM +0200, Arnd Bergmann wrote:
> A bugfix in v4.8-rc2 introduced a harmless warning when CONFIG_MEMCG_SWAP
> is disabled but CONFIG_MEMCG is enabled:
> 
> mm/memcontrol.c:4085:27: error: 'mem_cgroup_id_get_online' defined but not 
> used [-Werror=unused-function]
>  static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)
> 
> This adds an extra #ifdef that matches the one around the caller to
> avoid the warning.
> 
> Signed-off-by: Arnd Bergmann 
> Fixes: 1f47b61fb407 ("mm: memcontrol: fix swap counter leak on swapout from 
> offline cgroup")

Acked-by: Vladimir Davydov

Re: [PATCH v2] fs/dcache.c: fix spin lockup issue on nlru->lock

2017-06-21 Thread Vladimir Davydov

On Wed, Jun 21, 2017 at 12:09:15PM +0530, Sahitya Tummala wrote:
> __list_lru_walk_one() acquires nlru spin lock (nlru->lock) for
> longer duration if there are more number of items in the lru list.
> As per the current code, it can hold the spin lock for upto maximum
> UINT_MAX entries at a time. So if there are more number of items in
> the lru list, then "BUG: spinlock lockup suspected" is observed in
> the below path -
> 
> [] spin_bug+0x90
> [] do_raw_spin_lock+0xfc
> [] _raw_spin_lock+0x28
> [] list_lru_add+0x28
> [] dput+0x1c8
> [] path_put+0x20
> [] terminate_walk+0x3c
> [] path_lookupat+0x100
> [] filename_lookup+0x6c
> [] user_path_at_empty+0x54
> [] SyS_faccessat+0xd0
> [] el0_svc_naked+0x24
> 
> This nlru->lock is acquired by another CPU in this path -
> 
> [] d_lru_shrink_move+0x34
> [] dentry_lru_isolate_shrink+0x48
> [] __list_lru_walk_one.isra.10+0x94
> [] list_lru_walk_node+0x40
> [] shrink_dcache_sb+0x60
> [] do_remount_sb+0xbc
> [] do_emergency_remount+0xb0
> [] process_one_work+0x228
> [] worker_thread+0x2e0
> [] kthread+0xf4
> [] ret_from_fork+0x10
> 
> Fix this lockup by reducing the number of entries to be shrinked
> from the lru list to 1024 at once. Also, add cond_resched() before
> processing the lru list again.
> 
> Link: http://marc.info/?t=14972286491&r=1&w=2
> Fix-suggested-by: Jan kara 
> Fix-suggested-by: Vladimir Davydov 
> Signed-off-by: Sahitya Tummala 
> ---
> v2: patch shrink_dcache_sb() instead of list_lru_walk()
> ---
>  fs/dcache.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index cddf397..c8ca150 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1133,10 +1133,11 @@ void shrink_dcache_sb(struct super_block *sb)
>   LIST_HEAD(dispose);
>  
>   freed = list_lru_walk(&sb->s_dentry_lru,
> - dentry_lru_isolate_shrink, &dispose, UINT_MAX);
> + dentry_lru_isolate_shrink, &dispose, 1024);
>  
>   this_cpu_sub(nr_dentry_unused, freed);
>   shrink_dentry_list(&dispose);
> + cond_resched();
>   } while (freed > 0);

In an extreme case, a single invocation of list_lru_walk() can skip all
1024 dentries, in which case 'freed' will be 0 forcing us to break the
loop prematurely. I think we should loop until there's at least one
dentry left on the LRU, i.e.

while (list_lru_count(&sb->s_dentry_lru) > 0)

However, even that wouldn't be quite correct, because list_lru_count()
iterates over all memory cgroups to sum list_lru_one->nr_items, which
can race with memcg offlining code migrating dentries off a dead cgroup
(see memcg_drain_all_list_lrus()). So it looks like to make this check
race-free, we need to account the number of entries on the LRU not only
per memcg, but also per node, i.e. add list_lru_node->nr_items.
Fortunately, list_lru entries can't be migrated between NUMA nodes.

>  }
>  EXPORT_SYMBOL(shrink_dcache_sb);

[PATCH 1/2] mm/slab: skip memcg reclaim only if in atomic context

2015-08-30 Thread Vladimir Davydov

SLAB's implementation of kmem_cache_alloc() works as follows:
 1. First, it tries to allocate from the preferred NUMA node without
issuing reclaim.
 2. If step 1 fails, it tries all nodes in the order of preference,
again without invoking reclaimer
 3. Only if steps 1 and 2 fails, it falls back on allocation from any
allowed node with reclaim enabled.

Before commit 4167e9b2cf10f ("mm: remove GFP_THISNODE"), GFP_THISNODE
combination, which equaled __GFP_THISNODE|__GFP_NOWARN|__GFP_NORETRY on
NUMA enabled builds, was used in order to avoid reclaim during steps 1
and 2. If __alloc_pages_slowpath() saw this combination in gfp flags, it
aborted immediately even if __GFP_WAIT flag was set. So there was no
need in clearing __GFP_WAIT flag while performing steps 1 and 2 and
hence we could invoke memcg reclaim when allocating a slab page if the
context allowed.

Commit 4167e9b2cf10f zapped GFP_THISNODE combination. Instead of OR-ing
the gfp mask with GFP_THISNODE, gfp_exact_node() helper should now be
used. The latter sets __GFP_THISNODE and __GFP_NOWARN flags and clears
__GFP_WAIT on the current gfp mask. As a result, it effectively
prohibits invoking memcg reclaim on steps 1 and 2. This breaks
memcg/memory.high logic when kmem accounting is enabled. The memory.high
threshold is supposed to work as a soft limit, i.e. it does not fail an
allocation on breaching it, but it still forces the caller to invoke
direct reclaim to compensate for the excess. Without __GFP_WAIT flag
direct reclaim is impossible so the caller will go on without being
pushed back to the threshold.

To fix this issue, we get rid of gfp_exact_node() helper and move gfp
flags filtering to kmem_getpages() after memcg_charge_slab() is called.

To understand the patch, note that:
 - In fallback_alloc() the only effect of using gfp_exact_node() is
   preventing recursion fallback_alloc() -> cache_alloc_node() ->
   fallback_alloc().
 - Aside from fallback_alloc(), gfp_exact_node() is only used along with
   cache_grow(). Moreover, the only place where cache_grow() is used
   without it is fallback_alloc(), which, in contrast to other
   cache_grow() users, preallocates a page and passes it to cache_grow()
   so that the latter does not need to invoke kmem_getpages() by itself.

Reported-by: Tejun Heo 
Signed-off-by: Vladimir Davydov 
---
 mm/slab.c | 32 +++-
 1 file changed, 11 insertions(+), 21 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index d890750ec31e..9ee809d2ed8b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -857,11 +857,6 @@ static inline void *cache_alloc_node(struct kmem_cache 
*cachep,
return NULL;
 }
 
-static inline gfp_t gfp_exact_node(gfp_t flags)
-{
-   return flags;
-}
-
 #else  /* CONFIG_NUMA */
 
 static void *cache_alloc_node(struct kmem_cache *, gfp_t, int);
@@ -1028,15 +1023,6 @@ static inline int cache_free_alien(struct kmem_cache 
*cachep, void *objp)
 
return __cache_free_alien(cachep, objp, node, page_node);
 }
-
-/*
- * Construct gfp mask to allocate from a specific node but do not invoke 
reclaim
- * or warn about failures.
- */
-static inline gfp_t gfp_exact_node(gfp_t flags)
-{
-   return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT;
-}
 #endif
 
 /*
@@ -1583,7 +1569,7 @@ slab_out_of_memory(struct kmem_cache *cachep, gfp_t 
gfpflags, int nodeid)
  * would be relatively rare and ignorable.
  */
 static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
-   int nodeid)
+ int nodeid, bool fallback)
 {
struct page *page;
int nr_pages;
@@ -1595,6 +1581,9 @@ static struct page *kmem_getpages(struct kmem_cache 
*cachep, gfp_t flags,
if (memcg_charge_slab(cachep, flags, cachep->gfporder))
return NULL;
 
+   if (!fallback)
+   flags = (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT;
+
page = __alloc_pages_node(nodeid, flags | __GFP_NOTRACK, 
cachep->gfporder);
if (!page) {
memcg_uncharge_slab(cachep, cachep->gfporder);
@@ -2641,7 +2630,8 @@ static int cache_grow(struct kmem_cache *cachep,
 * 'nodeid'.
 */
if (!page)
-   page = kmem_getpages(cachep, local_flags, nodeid);
+   page = kmem_getpages(cachep, local_flags, nodeid,
+!IS_ENABLED(CONFIG_NUMA));
if (!page)
goto failed;
 
@@ -2840,7 +2830,7 @@ alloc_done:
if (unlikely(!ac->avail)) {
int x;
 force_grow:
-   x = cache_grow(cachep, gfp_exact_node(flags), node, NULL);
+   x = cache_grow(cachep, flags, node, NULL);
 
/* cache_grow can reenable interrupts, then ac could change. */
ac = cpu_cache_get(cachep);
@@ -3034,7 +3024,7 @@

[PATCH 2/2] mm/slub: do not bypass memcg reclaim for high-order page allocation

2015-08-30 Thread Vladimir Davydov

Commit 6af3142bed1f52 ("mm/slub: don't wait for high-order page
allocation") made allocate_slab() try to allocate high order slab pages
without __GFP_WAIT in order to avoid invoking reclaim/compaction when we
can fall back on low order pages. However, it broke memcg/memory.high
logic in case kmem accounting is enabled. The memory.high threshold
works as a soft limit: an allocation does not fail if it is breached,
but we call direct reclaim to compensate for the excess. Without
__GFP_WAIT we cannot invoke reclaimer and therefore we will go on
exceeding memory.high more and more until a normal __GFP_WAIT allocation
is issued.

Since memcg reclaim never triggers compaction, we can pass __GFP_WAIT to
memcg_charge_slab() even on high order page allocations w/o any
performance impact. So let us fix this problem by excluding __GFP_WAIT
only from alloc_pages() while still forwarding it to memcg_charge_slab()
if the context allows.

Reported-by: Tejun Heo 
Signed-off-by: Vladimir Davydov 
---
 mm/slub.c | 24 +++-
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index e180f8dcd06d..416a332277cb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1333,6 +1333,14 @@ static inline struct page *alloc_slab_page(struct 
kmem_cache *s,
if (memcg_charge_slab(s, flags, order))
return NULL;
 
+   /*
+* Let the initial higher-order allocation fail under memory pressure
+* so we fall-back to the minimum order allocation.
+*/
+   if (oo_order(oo) > oo_order(s->min))
+   flags = (flags | __GFP_NOWARN | __GFP_NOMEMALLOC) &
+   ~(__GFP_NOFAIL | __GFP_WAIT);
+
if (node == NUMA_NO_NODE)
page = alloc_pages(flags, order);
else
@@ -1348,7 +1356,6 @@ static struct page *allocate_slab(struct kmem_cache *s, 
gfp_t flags, int node)
 {
struct page *page;
struct kmem_cache_order_objects oo = s->oo;
-   gfp_t alloc_gfp;
void *start, *p;
int idx, order;
 
@@ -1359,23 +1366,14 @@ static struct page *allocate_slab(struct kmem_cache *s, 
gfp_t flags, int node)
 
flags |= s->allocflags;
 
-   /*
-* Let the initial higher-order allocation fail under memory pressure
-* so we fall-back to the minimum order allocation.
-*/
-   alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
-   if ((alloc_gfp & __GFP_WAIT) && oo_order(oo) > oo_order(s->min))
-   alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~__GFP_WAIT;
-
-   page = alloc_slab_page(s, alloc_gfp, node, oo);
+   page = alloc_slab_page(s, flags, node, oo);
if (unlikely(!page)) {
oo = s->min;
-   alloc_gfp = flags;
/*
 * Allocation may have failed due to fragmentation.
 * Try a lower order alloc if possible
 */
-   page = alloc_slab_page(s, alloc_gfp, node, oo);
+   page = alloc_slab_page(s, flags, node, oo);
if (unlikely(!page))
goto out;
stat(s, ORDER_FALLBACK);
@@ -1385,7 +1383,7 @@ static struct page *allocate_slab(struct kmem_cache *s, 
gfp_t flags, int node)
!(s->flags & (SLAB_NOTRACK | DEBUG_DEFAULT_FLAGS))) {
int pages = 1 << oo_order(oo);
 
-   kmemcheck_alloc_shadow(page, oo_order(oo), alloc_gfp, node);
+   kmemcheck_alloc_shadow(page, oo_order(oo), flags, node);
 
/*
 * Objects from caches that have a constructor don't get
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-30 Thread Vladimir Davydov

Hi,

Tejun reported that sometimes memcg/memory.high threshold seems to be
silently ignored if kmem accounting is enabled:

  http://www.spinics.net/lists/linux-mm/msg93613.html

It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT
first. As a result, if there is enough free pages, memcg reclaim will
not get invoked on kmem allocations, which will lead to uncontrollable
growth of memory usage no matter what memory.high is set to.

This patch set attempts to fix this issue. For more details please see
comments to individual patches.

Thanks,

Vladimir Davydov (2):
  mm/slab: skip memcg reclaim only if in atomic context
  mm/slub: do not bypass memcg reclaim for high-order page allocation

 mm/slab.c | 32 +++-
 mm/slub.c | 24 +++-
 2 files changed, 22 insertions(+), 34 deletions(-)

-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Vladimir Davydov

On Mon, Aug 31, 2015 at 03:24:15PM +0200, Michal Hocko wrote:
> On Sun 30-08-15 22:02:16, Vladimir Davydov wrote:

> > Tejun reported that sometimes memcg/memory.high threshold seems to be
> > silently ignored if kmem accounting is enabled:
> > 
> >   http://www.spinics.net/lists/linux-mm/msg93613.html
> > 
> > It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT
> > first. As a result, if there is enough free pages, memcg reclaim will
> > not get invoked on kmem allocations, which will lead to uncontrollable
> > growth of memory usage no matter what memory.high is set to.
> 
> Right but isn't that what the caller explicitly asked for?

No. If the caller of kmalloc() asked for a __GFP_WAIT allocation, we
might ignore that and charge memcg w/o __GFP_WAIT.

> Why should we ignore that for kmem accounting? It seems like a fix at
> a wrong layer to me.

Let's forget about memory.high for a minute.

 1. SLAB. Suppose someone calls kmalloc_node and there is enough free
memory on the preferred node. W/o memcg limit set, the allocation
will happen from the preferred node, which is OK. If there is memcg
limit, we can currently fail to allocate from the preferred node if
we are near the limit. We issue memcg reclaim and go to fallback
alloc then, which will most probably allocate from a different node,
although there is no reason for that. This is a bug.

 2. SLUB. Someone calls kmalloc and there is enough free high order
pages. If there is no memcg limit, we will allocate a high order
slab page, which is in accordance with SLUB internal logic. With
memcg limit set, we are likely to fail to charge high order page
(because we currently try to charge high order pages w/o __GFP_WAIT)
and fallback on a low order page. The latter is unexpected and
unjustified.

That being said, this is the fix at the right layer.

> Either we should start failing GFP_NOWAIT charges when we are above
> high wmark or deploy an additional catchup mechanism as suggested by
> Tejun.

The mechanism proposed by Tejun won't help us to avoid allocation
failures if we are hitting memory.max w/o __GFP_WAIT or __GFP_FS.

To fix GFP_NOFS/GFP_NOWAIT failures we just need to start reclaim when
the gap between limit and usage is getting too small. It may be done
from a workqueue or from task_work, but currently I don't see any reason
why complicate and not just start reclaim directly, just like
memory.high does.

I mean, currently you can protect against GFP_NOWAIT failures by setting
memory.high to be 1-2 MB lower than memory.high and this *will* work,
because GFP_NOWAIT/GFP_NOFS allocations can't go on infinitely - they
will alternate with normal GFP_KERNEL allocations sooner or later. It
does not mean we should encourage users to set memory.high to protect
against such failures, because, as pointed out by Tejun, logic behind
memory.high is currently opaque and can change, but we can introduce
memcg-internal watermarks that would work exactly as memory.high and
hence help us against GFP_NOWAIT/GFP_NOFS failures.

Thanks,
Vladimir

> I like the later more because it allows to better handle GFP_NOFS
> requests as well and there are many sources of these from kmem paths.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Vladimir Davydov

On Mon, Aug 31, 2015 at 09:43:35AM -0400, Tejun Heo wrote:
> On Mon, Aug 31, 2015 at 03:24:15PM +0200, Michal Hocko wrote:
> > Right but isn't that what the caller explicitly asked for? Why should we
> > ignore that for kmem accounting? It seems like a fix at a wrong layer to
> > me. Either we should start failing GFP_NOWAIT charges when we are above
> > high wmark or deploy an additional catchup mechanism as suggested by
> > Tejun. I like the later more because it allows to better handle GFP_NOFS
> > requests as well and there are many sources of these from kmem paths.
> 
> Yeah, this is beginning to look like we're trying to solve the problem
> at the wrong layer.  slab/slub or whatever else should be able to use
> GFP_NOWAIT in whatever frequency they want for speculative
> allocations.

slab/slub can issue alloc_pages() any time with any flags they want and
it won't be accounted to memcg, because kmem is accounted at slab/slub
layer, not in buddy.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Vladimir Davydov

On Mon, Aug 31, 2015 at 10:39:39AM -0400, Tejun Heo wrote:
> On Mon, Aug 31, 2015 at 05:30:08PM +0300, Vladimir Davydov wrote:
> > slab/slub can issue alloc_pages() any time with any flags they want and
> > it won't be accounted to memcg, because kmem is accounted at slab/slub
> > layer, not in buddy.
> 
> Hmmm?  I meant the eventual calling into try_charge w/ GFP_NOWAIT.
> Speculative usage of GFP_NOWAIT is bound to increase and we don't want
> to put on extra restrictions from memcg side. 

We already put restrictions on slab/slub from memcg side, because kmem
accounting is a part of slab/slub. They have to cooperate in order to
get things working. If slab/slub wants to make a speculative allocation
for some reason, it should just put memcg_charge out of this speculative
alloc section. This is what this patch set does.

We have to be cautious about placing memcg_charge in slab/slub. To
understand why, consider SLAB case, which first tries to allocate from
all nodes in the order of preference w/o __GFP_WAIT and only if it fails
falls back on an allocation from any node w/ __GFP_WAIT. This is its
internal algorithm. If we blindly put memcg_charge to alloc_slab method,
then, when we are near the memcg limit, we will go over all NUMA nodes
in vain, then finally fall back to __GFP_WAIT allocation, which will get
a slab from a random node. Not only we do more work than necessary due
to walking over all NUMA nodes for nothing, but we also break SLAB
internal logic! And you just can't fix it in memcg, because memcg knows
nothing about the internal logic of SLAB, how it handles NUMA nodes.

SLUB has a different problem. It tries to avoid high-order allocations
if there is a risk of invoking costly memory compactor. It has nothing
to do with memcg, because memcg does not care if the charge is for a
high order page or not.

Thanks,
Vladimir

> For memory.high,
> punting to the return path is a pretty stright-forward solution which
> should make the problem go away almost entirely.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Vladimir Davydov

On Mon, Aug 31, 2015 at 10:46:04AM -0400, Tejun Heo wrote:
> Hello, Vladimir.
> 
> On Mon, Aug 31, 2015 at 05:20:49PM +0300, Vladimir Davydov wrote:
> ...
> > That being said, this is the fix at the right layer.
> 
> While this *might* be a necessary workaround for the hard limit case
> right now, this is by no means the fix at the right layer.  The
> expectation is that mm keeps a reasonable amount of memory available
> for allocations which can't block.  These allocations may fail from
> time to time depending on luck and under extreme memory pressure but
> the caller should be able to depend on it as a speculative allocation
> mechanism which doesn't fail willy-nilly.
> 
> Hardlimit breaking GFP_NOWAIT behavior is a bug on memcg side, not
> slab or slub.

I never denied that there is GFP_NOWAIT/GFP_NOFS problem in memcg. I
even proposed ways to cope with it in one of the previous e-mails.

Nevertheless, we just can't allow slab/slub internals call memcg_charge
whenever they want as I pointed out in a parallel thread.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Vladimir Davydov

On Mon, Aug 31, 2015 at 11:47:56AM -0400, Tejun Heo wrote:
> On Mon, Aug 31, 2015 at 06:18:14PM +0300, Vladimir Davydov wrote:
> > We have to be cautious about placing memcg_charge in slab/slub. To
> > understand why, consider SLAB case, which first tries to allocate from
> > all nodes in the order of preference w/o __GFP_WAIT and only if it fails
> > falls back on an allocation from any node w/ __GFP_WAIT. This is its
> > internal algorithm. If we blindly put memcg_charge to alloc_slab method,
> > then, when we are near the memcg limit, we will go over all NUMA nodes
> > in vain, then finally fall back to __GFP_WAIT allocation, which will get
> > a slab from a random node. Not only we do more work than necessary due
> > to walking over all NUMA nodes for nothing, but we also break SLAB
> > internal logic! And you just can't fix it in memcg, because memcg knows
> > nothing about the internal logic of SLAB, how it handles NUMA nodes.
> > 
> > SLUB has a different problem. It tries to avoid high-order allocations
> > if there is a risk of invoking costly memory compactor. It has nothing
> > to do with memcg, because memcg does not care if the charge is for a
> > high order page or not.
> 
> Maybe I'm missing something but aren't both issues caused by memcg
> failing to provide headroom for NOWAIT allocations when the
> consumption gets close to the max limit? 

That's correct.

> Regardless of the specific usage, !__GFP_WAIT means "give me memory if
> it can be spared w/o inducing direct time-consuming maintenance work"
> and the contract around it is that such requests will mostly succeed
> under nominal conditions.  Also, slab/slub might not stay as the only
> user of try_charge().

Indeed, there might be other users trying GFP_NOWAIT before falling back
to GFP_KERNEL, but they are not doing that constantly and hence cause no
problems. If SLAB/SLUB plays such tricks, the problem becomes massive:
under certain conditions *every* try_charge may be invoked w/o
__GFP_WAIT, resulting in memory.high breaching and hitting memory.max.

Generally speaking, handing over reclaim responsibility to task_work
won't help, because there might be cases when a process spends quite a
lot of time in kernel invoking lots of GFP_KERNEL allocations before
returning to userspace. Without fixing slab/slub, such a process will
charge w/o __GFP_WAIT and therefore can exceed memory.high and reach
memory.max. If there are no other active processes in the cgroup, the
cgroup can stay with memory.high excess for a relatively long time
(suppose the process was throttled in kernel), possibly hurting the rest
of the system. What is worse, if the process happens to invoke a real
GFP_NOWAIT allocation when it's about to hit the limit, it will fail.

If we want to allow slab/slub implementation to invoke try_charge
wherever it wants, we need to introduce an asynchronous thread doing
reclaim when a memcg is approaching its limit (or teach kswapd do that).
That's a way to go, but what's the point to complicate things
prematurely while it seems we can fix the problem by using the technique
similar to the one behind memory.high?

Nevertheless, even if we introduced such a thread, it'd be just insane
to allow slab/slub blindly insert try_charge. Let me repeat the examples
of SLAB/SLUB sub-optimal behavior caused by thoughtless usage of
try_charge I gave above:

 - memcg knows nothing about NUMA nodes, so what's the point in failing
   !__GFP_WAIT allocations used by SLAB while inspecting NUMA nodes?
 - memcg knows nothing about high order pages, so what's the point in
   failing !__GFP_WAIT allocations used by SLUB to try to allocate a
   high order page?

Thanks,
Vladimir

> I still think solving this from memcg side is the right direction.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-08-31 Thread Vladimir Davydov

On Mon, Aug 31, 2015 at 01:03:09PM -0400, Tejun Heo wrote:
> On Mon, Aug 31, 2015 at 07:51:32PM +0300, Vladimir Davydov wrote:
> ...
> > If we want to allow slab/slub implementation to invoke try_charge
> > wherever it wants, we need to introduce an asynchronous thread doing
> > reclaim when a memcg is approaching its limit (or teach kswapd do that).
> 
> In the long term, I think this is the way to go.

Quite probably, or we can use task_work, or direct reclaim instead. It's
not that obvious to me yet which one is the best.

> 
> > That's a way to go, but what's the point to complicate things
> > prematurely while it seems we can fix the problem by using the technique
> > similar to the one behind memory.high?
> 
> Cuz we're now scattering workarounds to multiple places and I'm sure
> we'll add more try_charge() users (e.g. we want to fold in tcp memcg
> under the same knobs) and we'll have to worry about the same problem
> all over again and will inevitably miss some cases leading to subtle
> failures.

I don't think we will need to insert try_charge_kmem anywhere else,
because all kmem users either allocate memory using kmalloc and friends
or using alloc_pages. kmalloc is accounted. For those who prefer
alloc_pages, there is alloc_kmem_pages helper.

> 
> > Nevertheless, even if we introduced such a thread, it'd be just insane
> > to allow slab/slub blindly insert try_charge. Let me repeat the examples
> > of SLAB/SLUB sub-optimal behavior caused by thoughtless usage of
> > try_charge I gave above:
> > 
> >  - memcg knows nothing about NUMA nodes, so what's the point in failing
> >!__GFP_WAIT allocations used by SLAB while inspecting NUMA nodes?
> >  - memcg knows nothing about high order pages, so what's the point in
> >failing !__GFP_WAIT allocations used by SLUB to try to allocate a
> >high order page?
> 
> Both are optimistic speculative actions and as long as memcg can
> guarantee that those requests will succeed under normal circumstances,
> as does the system-wide mm does, it isn't a problem.
> 
> In general, we want to make sure inside-cgroup behaviors as close to
> system-wide behaviors as possible, scoped but equivalent in kind.
> Doing things differently, while inevitable in certain cases, is likely
> to get messy in the long term.

I totally agree that we should strive to make a kmem user feel roughly
the same in memcg as if it were running on a host with equal amount of
RAM. There are two ways to achieve that:

 1. Make the API functions, i.e. kmalloc and friends, behave inside
memcg roughly the same way as they do in the root cgroup.
 2. Make the internal memcg functions, i.e. try_charge and friends,
behave roughly the same way as alloc_pages.

I find way 1 more flexible, because we don't have to blindly follow
heuristics used on global memory reclaim and therefore have more
opportunities to achieve the same goal.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-01 Thread Vladimir Davydov

On Mon, Aug 31, 2015 at 03:22:22PM -0500, Christoph Lameter wrote:
> On Mon, 31 Aug 2015, Vladimir Davydov wrote:
> 
> > I totally agree that we should strive to make a kmem user feel roughly
> > the same in memcg as if it were running on a host with equal amount of
> > RAM. There are two ways to achieve that:
> >
> >  1. Make the API functions, i.e. kmalloc and friends, behave inside
> > memcg roughly the same way as they do in the root cgroup.
> >  2. Make the internal memcg functions, i.e. try_charge and friends,
> > behave roughly the same way as alloc_pages.
> >
> > I find way 1 more flexible, because we don't have to blindly follow
> > heuristics used on global memory reclaim and therefore have more
> > opportunities to achieve the same goal.
> 
> The heuristics need to integrate well if its in a cgroup or not. In
> general make use of cgroups as transparent as possible to the rest of the
> code.

Half of kmem accounting implementation resides in SLAB/SLUB. We can't
just make use of cgroups there transparent. For the rest of the code
using kmalloc, cgroups are transparent.

Indeed, we can make memcg_charge_slab behave exactly like alloc_pages,
we can even put it to alloc_pages (where it used to be), but why if the
only user of memcg_charge_slab is SLAB/SLUB core?

I think we'd have more space to manoeuvre if we just taught SLAB/SLUB to
use memcg_charge_slab wisely (as it used to until recently), because
memcg charge/reclaim is quite different from global alloc/reclaim:

 - it isn't aware of NUMA nodes, so trying to charge w/o __GFP_WAIT
   while inspecting nodes, like in case of SLAB, is meaningless

 - it isn't aware of high order page allocations, so trying to charge
   w/o __GFP_WAIT while trying optimistically to get a high order page,
   like in case of SLUB, is meaningless too

 - it can always let a high prio allocation go unaccounted, so IMO there
   is no point in introducing emergency reserves (__GFP_MEMALLOC
   handling)

 - it can always charge a GFP_NOWAIT allocation even if it exceeds the
   limit, issuing direct reclaim when a GFP_KERNEL allocation comes or
   from a task work, because there is no risk of depleting memory
   reserves; so it isn't obvious to me whether we really need an aync
   thread handling memcg reclaim like kswapd

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-01 Thread Vladimir Davydov

On Tue, Sep 01, 2015 at 02:36:12PM +0200, Michal Hocko wrote:
> On Mon 31-08-15 17:20:49, Vladimir Davydov wrote:
> > On Mon, Aug 31, 2015 at 03:24:15PM +0200, Michal Hocko wrote:
> > > On Sun 30-08-15 22:02:16, Vladimir Davydov wrote:
> > 
> > > > Tejun reported that sometimes memcg/memory.high threshold seems to be
> > > > silently ignored if kmem accounting is enabled:
> > > > 
> > > >   http://www.spinics.net/lists/linux-mm/msg93613.html
> > > > 
> > > > It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT
> > > > first. As a result, if there is enough free pages, memcg reclaim will
> > > > not get invoked on kmem allocations, which will lead to uncontrollable
> > > > growth of memory usage no matter what memory.high is set to.
> > > 
> > > Right but isn't that what the caller explicitly asked for?
> > 
> > No. If the caller of kmalloc() asked for a __GFP_WAIT allocation, we
> > might ignore that and charge memcg w/o __GFP_WAIT.
> 
> I was referring to the slab allocator as the caller. Sorry for not being
> clear about that.
> 
> > > Why should we ignore that for kmem accounting? It seems like a fix at
> > > a wrong layer to me.
> > 
> > Let's forget about memory.high for a minute.
> >
> >  1. SLAB. Suppose someone calls kmalloc_node and there is enough free
> > memory on the preferred node. W/o memcg limit set, the allocation
> > will happen from the preferred node, which is OK. If there is memcg
> > limit, we can currently fail to allocate from the preferred node if
> > we are near the limit. We issue memcg reclaim and go to fallback
> > alloc then, which will most probably allocate from a different node,
> > although there is no reason for that. This is a bug.
> 
> I am not familiar with the SLAB internals much but how is it different
> from the global case. If the preferred node is full then __GFP_THISNODE
> request will make it fail early even without giving GFP_NOWAIT
> additional access to atomic memory reserves. The fact that memcg case
> fails earlier is perfectly expected because the restriction is tighter
> than the global case.

memcg restrictions are orthogonal to NUMA: failing an allocation from a
particular node does not mean failing memcg charge and vice versa.

> 
> How the fallback is implemented and whether trying other node before
> reclaiming from the preferred one is reasonable I dunno. This is for
> SLAB to decide. But ignoring GFP_NOWAIT for this path makes the behavior
> for memcg enabled setups subtly different. And that is bad.

Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while
inspecting if a NUMA node has free pages makes SLAB behaviour subtly
differently: SLAB will walk over all NUMA nodes for nothing instead of
invoking memcg reclaim once a free page is found.

You are talking about memcg/kmem accounting as if it were done in the
buddy allocator on top of which the slab layer is built knowing nothing
about memcg accounting on the lower layer. That's not true and that
simply can't be true. Kmem accounting is implemented at the slab layer.
Memcg provides its memcg_charge_slab/uncharge methods solely for
slab core, so it's OK to have some calling conventions between them.
What we are really obliged to do is to preserve behavior of slab's
external API, i.e. kmalloc and friends.

> 
> >  2. SLUB. Someone calls kmalloc and there is enough free high order
> > pages. If there is no memcg limit, we will allocate a high order
> > slab page, which is in accordance with SLUB internal logic. With
> > memcg limit set, we are likely to fail to charge high order page
> > (because we currently try to charge high order pages w/o __GFP_WAIT)
> > and fallback on a low order page. The latter is unexpected and
> > unjustified.
> 
> And this case very similar and I even argue that it shows more
> brokenness with your patch. The SLUB allocator has _explicitly_ asked
> for an allocation _without_ reclaim because that would be unnecessarily
> too costly and there is other less expensive fallback. But memcg would

You are ignoring the fact that, in contrast to alloc_pages, for memcg
there is practically no difference between charging a 4-order page or a
1-order page. OTOH, using 1-order pages where we could go with 4-order
pages increases page fragmentation at the global level. This subtly
breaks internal SLUB optimization. Once again, kmem accounting is not
something staying aside from slab core, it's a part of slab core.

> be ignoring this with your patch AFAIU and break the optimization. There
> are other cases like

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-01 Thread Vladimir Davydov

On Tue, Sep 01, 2015 at 05:01:20PM +0200, Michal Hocko wrote:
> On Tue 01-09-15 16:40:03, Vladimir Davydov wrote:
> > On Tue, Sep 01, 2015 at 02:36:12PM +0200, Michal Hocko wrote:
> > > On Mon 31-08-15 17:20:49, Vladimir Davydov wrote:
> {...}
> > > >  1. SLAB. Suppose someone calls kmalloc_node and there is enough free
> > > > memory on the preferred node. W/o memcg limit set, the allocation
> > > > will happen from the preferred node, which is OK. If there is memcg
> > > > limit, we can currently fail to allocate from the preferred node if
> > > > we are near the limit. We issue memcg reclaim and go to fallback
> > > > alloc then, which will most probably allocate from a different node,
> > > > although there is no reason for that. This is a bug.
> > > 
> > > I am not familiar with the SLAB internals much but how is it different
> > > from the global case. If the preferred node is full then __GFP_THISNODE
> > > request will make it fail early even without giving GFP_NOWAIT
> > > additional access to atomic memory reserves. The fact that memcg case
> > > fails earlier is perfectly expected because the restriction is tighter
> > > than the global case.
> > 
> > memcg restrictions are orthogonal to NUMA: failing an allocation from a
> > particular node does not mean failing memcg charge and vice versa.
> 
> Sure memcg doesn't care about NUMA it just puts an additional constrain
> on top of all existing ones. The point I've tried to make is that the
> logic is currently same whether it is page allocator (with the node
> restriction) or memcg (cumulative amount restriction) are behaving
> consistently. Neither of them try to reclaim in order to achieve its
> goals. How conservative is memcg about allowing GFP_NOWAIT allocation
> is a separate issue and all those details belong to memcg proper same
> as the allocation strategy for these allocations belongs to the page
> allocator.
>  
> > > How the fallback is implemented and whether trying other node before
> > > reclaiming from the preferred one is reasonable I dunno. This is for
> > > SLAB to decide. But ignoring GFP_NOWAIT for this path makes the behavior
> > > for memcg enabled setups subtly different. And that is bad.
> > 
> > Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while
> > inspecting if a NUMA node has free pages makes SLAB behaviour subtly
> > differently: SLAB will walk over all NUMA nodes for nothing instead of
> > invoking memcg reclaim once a free page is found.
> 
> So you are saying that the SLAB kmem accounting in this particular path
> is suboptimal because the fallback mode doesn't retry local node with
> the reclaim enabled before falling back to other nodes?

I'm just pointing out some subtle behavior changes in slab you were
opposed to.

> I would consider it quite surprising as well even for the global case
> because __GFP_THISNODE doesn't wake up kswapd to make room on that node.
> 
> > You are talking about memcg/kmem accounting as if it were done in the
> > buddy allocator on top of which the slab layer is built knowing nothing
> > about memcg accounting on the lower layer. That's not true and that
> > simply can't be true. Kmem accounting is implemented at the slab layer.
> > Memcg provides its memcg_charge_slab/uncharge methods solely for
> > slab core, so it's OK to have some calling conventions between them.
> > What we are really obliged to do is to preserve behavior of slab's
> > external API, i.e. kmalloc and friends.
> 
> I guess I understand what you are saying here but it sounds like special
> casing which tries to be clever because the current code understands
> both the lower level allocator and kmem charge paths to decide how to

What do you mean by saying "it understands the lower level allocator"?
AFAIK we have memcg callbacks only in special places, like page fault
handler or kmalloc.

> juggle with them. This is imho bad and hard to maintain long term.

We already juggle. Just grep where and how we insert
mem_cgroup_try_charge.

> 
> > > >  2. SLUB. Someone calls kmalloc and there is enough free high order
> > > > pages. If there is no memcg limit, we will allocate a high order
> > > > slab page, which is in accordance with SLUB internal logic. With
> > > > memcg limit set, we are likely to fail to charge high order page
> > > > (because we currently try to charge high order pages w/o __GFP_WAIT)
> > > > and fallback on a low order page. The latter is unexpected and

Re: [PATCH 3/3] memcg: simplify and inline __mem_cgroup_from_kmem

2015-10-16 Thread Vladimir Davydov

On Fri, Oct 16, 2015 at 04:17:26PM +0300, Kirill A. Shutemov wrote:
> On Mon, Oct 05, 2015 at 01:21:43AM +0300, Vladimir Davydov wrote:
> > Before the previous patch, __mem_cgroup_from_kmem had to handle two
> > types of kmem - slab pages and pages allocated with alloc_kmem_pages -
> > differently, because slab pages did not store information about owner
> > memcg in the page struct. Now we can unify it. Since after it, this
> > function becomes tiny we can fold it into mem_cgroup_from_kmem.
> > 
> > Signed-off-by: Vladimir Davydov 
> > ---
> >  include/linux/memcontrol.h |  7 ---
> >  mm/memcontrol.c| 18 --
> >  2 files changed, 4 insertions(+), 21 deletions(-)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 8a9b7a798f14..0e2e039609d1 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -769,8 +769,6 @@ static inline int memcg_cache_id(struct mem_cgroup 
> > *memcg)
> >  struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep);
> >  void __memcg_kmem_put_cache(struct kmem_cache *cachep);
> >  
> > -struct mem_cgroup *__mem_cgroup_from_kmem(void *ptr);
> > -
> >  static inline bool __memcg_kmem_bypass(gfp_t gfp)
> >  {
> > if (!memcg_kmem_enabled())
> > @@ -832,9 +830,12 @@ static __always_inline void 
> > memcg_kmem_put_cache(struct kmem_cache *cachep)
> >  
> >  static __always_inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr)
> >  {
> > +   struct page *page;
> > +
> > if (!memcg_kmem_enabled())
> > return NULL;
> > -   return __mem_cgroup_from_kmem(ptr);
> > +   page = virt_to_head_page(ptr);
> > +   return page->mem_cgroup;
> >  }
> 
> virt_to_head_page() is defined in  but you don't include it,
> and the commit breaks build for me (on v4.3-rc5-mmotm-2015-10-15-15-20).
> 
>   CC  arch/x86/kernel/asm-offsets.s
> In file included from /home/kas/linux/mm/include/linux/swap.h:8:0,
>  from /home/kas/linux/mm/include/linux/suspend.h:4,
>  from /home/kas/linux/mm/arch/x86/kernel/asm-offsets.c:12:
> /home/kas/linux/mm/include/linux/memcontrol.h: In function 
> â?~mem_cgroup_from_kmemâ?T:
> /home/kas/linux/mm/include/linux/memcontrol.h:841:9: error: implicit 
> declaration of function â?~virt_to_head_pageâ?T 
> [-Werror=implicit-function-declaration]
>   page = virt_to_head_page(ptr);
>  ^
> /home/kas/linux/mm/include/linux/memcontrol.h:841:7: warning: assignment 
> makes pointer from integer without a cast [-Wint-conversion]
>   page = virt_to_head_page(ptr);
>^
> In file included from /home/kas/linux/mm/include/linux/suspend.h:8:0,
>  from /home/kas/linux/mm/arch/x86/kernel/asm-offsets.c:12:
> /home/kas/linux/mm/include/linux/mm.h: At top level:
> /home/kas/linux/mm/include/linux/mm.h:452:28: error: conflicting types for 
> â?~virt_to_head_pageâ?T
>  static inline struct page *virt_to_head_page(const void *x)
> ^
> In file included from /home/kas/linux/mm/include/linux/swap.h:8:0,
>  from /home/kas/linux/mm/include/linux/suspend.h:4,
>  from /home/kas/linux/mm/arch/x86/kernel/asm-offsets.c:12:
> /home/kas/linux/mm/include/linux/memcontrol.h:841:9: note: previous implicit 
> declaration of â?~virt_to_head_pageâ?T was here
>   page = virt_to_head_page(ptr);
>  ^
> cc1: some warnings being treated as errors

Oops, in my config I have CONFIG_CGROUP_WRITEBACK enabled, which results
in including mm.h to memcontrol.h indirectly:

 linux/memcontrol.h
  linux/writeback.h
   linux/bio.h
linux/highmem.h
 linux/mm.h

That's why I didn't notice this. Sorry about that.

> 
> The patch below fixes it for me (and for allmodconfig on x86-64), but I'm not
> sure if it have any side effects on other configurations.

It should work OK with any config, otherwise CONFIG_CGROUP_WRITEBACK
would be broken too.

Andrew, could you please merge the fix by Kirill into
memcg-simplify-and-inline-__mem_cgroup_from_kmem.patch

Thanks,
Vladimir

> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 47677acb4516..e8e52e502c20 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -26,6 +26,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] memcg: simplify and inline __mem_cgroup_from_kmem

2015-10-17 Thread Vladimir Davydov

On Fri, Oct 16, 2015 at 03:12:23PM -0700, Hugh Dickins wrote:
...
> Are you expecting to use mem_cgroup_from_kmem() from other places
> in future?  Seems possible; but at present it's called from only

Not in the near future. At least, currently I can't think of any other
use for it except list_lru_from_kmem.

> one place, and (given how memcontrol.h has somehow managed to avoid
> including mm.h all these years), I thought it would be nice to avoid
> it for just this; and fixed my build with the patch below last night.
> Whatever you all think best: just wanted to point out an alternative.

Makes sense, thanks!

I would even inline mem_cgroup_from_kmem to list_lru_from_kmem:

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 47677acb4516..2077b9bb4883 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -831,16 +831,6 @@ static __always_inline void memcg_kmem_put_cache(struct 
kmem_cache *cachep)
if (memcg_kmem_enabled())
__memcg_kmem_put_cache(cachep);
 }
-
-static __always_inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr)
-{
-   struct page *page;
-
-   if (!memcg_kmem_enabled())
-   return NULL;
-   page = virt_to_head_page(ptr);
-   return page->mem_cgroup;
-}
 #else
 #define for_each_memcg_cache_index(_idx)   \
for (; NULL; )
@@ -886,11 +876,6 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 static inline void memcg_kmem_put_cache(struct kmem_cache *cachep)
 {
 }
-
-static inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr)
-{
-   return NULL;
-}
 #endif /* CONFIG_MEMCG_KMEM */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 28237476b055..00d0a70af70a 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -68,10 +68,10 @@ list_lru_from_kmem(struct list_lru_node *nlru, void *ptr)
 {
struct mem_cgroup *memcg;
 
-   if (!nlru->memcg_lrus)
+   if (!memcg_kmem_enabled() || !nlru->memcg_lrus)
return &nlru->lru;
 
-   memcg = mem_cgroup_from_kmem(ptr);
+   memcg = virt_to_head_page(ptr)->mem_cgroup;
if (!memcg)
return &nlru->lru;
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] memcg: unify slab and other kmem pages charging

2015-10-17 Thread Vladimir Davydov

On Fri, Oct 16, 2015 at 05:19:32PM -0700, Johannes Weiner wrote:
...
> I think it'd be better to have an outer function than a magic
> parameter for the memcg lookup. Could we fold this in there?

Yeah, that looks neater. Thanks!

Andrew, could you please fold this one too?

> 
> ---
> 
> Signed-off-by: Johannes Weiner 
> ---
>  include/linux/memcontrol.h |  7 ---
>  mm/memcontrol.c| 36 ++--
>  mm/slab.h  |  4 ++--
>  3 files changed, 24 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 47677ac..730a65d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -756,8 +756,9 @@ static inline bool memcg_kmem_is_active(struct mem_cgroup 
> *memcg)
>   * conditions, but because they are pretty simple, they are expected to be
>   * fast.
>   */
> -int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order,
> - struct mem_cgroup *memcg);
> +int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
> +   struct mem_cgroup *memcg);
> +int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order);
>  void __memcg_kmem_uncharge(struct page *page, int order);
>  
>  /*
> @@ -797,7 +798,7 @@ static __always_inline int memcg_kmem_charge(struct page 
> *page,
>  {
>   if (__memcg_kmem_bypass(gfp))
>   return 0;
> - return __memcg_kmem_charge(page, gfp, order, NULL);
> + return __memcg_kmem_charge(page, gfp, order);
>  }
>  
>  /**
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 15db655..6fc9959 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2378,39 +2378,39 @@ void __memcg_kmem_put_cache(struct kmem_cache *cachep)
>   css_put(&cachep->memcg_params.memcg->css);
>  }
>  
> -/*
> - * If @memcg != NULL, charge to @memcg, otherwise charge to the memcg the
> - * current task belongs to.
> - */
> -int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order,
> - struct mem_cgroup *memcg)
> +int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
> +   struct mem_cgroup *memcg)
>  {
> - struct page_counter *counter;
>   unsigned int nr_pages = 1 << order;
> - bool put = false;
> + struct page_counter *counter;
>   int ret = 0;
>  
> - if (!memcg) {
> - memcg = get_mem_cgroup_from_mm(current->mm);
> - put = true;
> - }
>   if (!memcg_kmem_is_active(memcg))
> - goto out;
> + return 0;
>  
>   ret = page_counter_try_charge(&memcg->kmem, nr_pages, &counter);
>   if (ret)
> - goto out;
> + return ret;
>  
>   ret = try_charge(memcg, gfp, nr_pages);
>   if (ret) {
>   page_counter_uncharge(&memcg->kmem, nr_pages);
> - goto out;
> + return ret;
>   }
>  
>   page->mem_cgroup = memcg;
> -out:
> - if (put)
> - css_put(&memcg->css);
> +
> + return 0;
> +}
> +
> +int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
> +{
> + struct mem_cgroup *memcg;
> + int ret;
> +
> + memcg = get_mem_cgroup_from_mm(current->mm);
> + ret = __memcg_kmem_charge_memcg(page, gfp, order, memcg);
> + css_put(&memcg->css);
>   return ret;
>  }
>  
> diff --git a/mm/slab.h b/mm/slab.h
> index 3d667a4..27492eb 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -244,8 +244,8 @@ static __always_inline int memcg_charge_slab(struct page 
> *page,
>   return 0;
>   if (is_root_cache(s))
>   return 0;
> - return __memcg_kmem_charge(page, gfp, order,
> -s->memcg_params.memcg);
> + return __memcg_kmem_charge_memcg(page, gfp, order,
> +  s->memcg_params.memcg);
>  }
>  
>  extern void slab_init_memcg_params(struct kmem_cache *);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy

2015-10-22 Thread Vladimir Davydov

Hi Johannes,

On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote:
...
> Patch #5 adds accounting and tracking of socket memory to the unified
> hierarchy memory controller, as described above. It uses the existing
> per-cpu charge caches and triggers high limit reclaim asynchroneously.
> 
> Patch #8 uses the vmpressure extension to equalize pressure between
> the pages tracked natively by the VM and socket buffer pages. As the
> pool is shared, it makes sense that while natively tracked pages are
> under duress the network transmit windows are also not increased.

First of all, I've no experience in networking, so I'm likely to be
mistaken. Nevertheless I beg to disagree that this patch set is a step
in the right direction. Here goes why.

I admit that your idea to get rid of explicit tcp window control knobs
and size it dynamically basing on memory pressure instead does sound
tempting, but I don't think it'd always work. The problem is that in
contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only
stop growing them. Now suppose a system hasn't experienced memory
pressure for a while. If we don't have explicit tcp window limit, tcp
buffers on such a system might have eaten almost all available memory
(because of network load/problems). If a user workload that needs a
significant amount of memory is started suddenly then, the network code
will receive a notification and surely stop growing buffers, but all
those buffers accumulated won't disappear instantly. As a result, the
workload might be unable to find enough free memory and have no choice
but invoke OOM killer. This looks unexpected from the user POV.

That said, I think we do need per memcg tcp window control similar to
what we have system-wide. In other words, Glauber's work makes sense to
me. You might want to point me at my RFC patch where I proposed to
revert it (https://lkml.org/lkml/2014/9/12/401). Well, I've changed my
mind since then. Now I think I was mistaken, luckily I was stopped.
However, I may be mistaken again :-)

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting

2015-10-22 Thread Vladimir Davydov

On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote:
> The tcp memory controller has extensive provisions for future memory
> accounting interfaces that won't materialize after all. Cut the code
> base down to what's actually used, now and in the likely future.
> 
> - There won't be any different protocol counters in the future, so a
>   direct sock->sk_memcg linkage is enough. This eliminates a lot of
>   callback maze and boilerplate code, and restores most of the socket
>   allocation code to pre-tcp_memcontrol state.
> 
> - There won't be a tcp control soft limit, so integrating the memcg

In fact, the code is ready for the "soft" limit (I mean min, pressure,
max tuple), it just lacks a knob.

>   code into the global skmem limiting scheme complicates things
>   unnecessarily. Replace all that with simple and clear charge and
>   uncharge calls--hidden behind a jump label--to account skb memory.
> 
> - The previous jump label code was an elaborate state machine that
>   tracked the number of cgroups with an active socket limit in order
>   to enable the skmem tracking and accounting code only when actively
>   necessary. But this is overengineered: it was meant to protect the
>   people who never use this feature in the first place. Simply enable
>   the branches once when the first limit is set until the next reboot.
> 
...
> @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const 
> struct sock *sk)
>   if (!sk->sk_prot->memory_pressure)
>   return false;
>  
> - if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> - return !!sk->sk_cgrp->memory_pressure;
> -

AFAIU, now we won't shrink the window on hitting the limit, i.e. this
patch subtly changes the behavior of the existing knobs, potentially
breaking them.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

2015-10-22 Thread Vladimir Davydov

On Thu, Oct 22, 2015 at 12:21:33AM -0400, Johannes Weiner wrote:
...
> @@ -5500,13 +5524,38 @@ void sock_release_memcg(struct sock *sk)
>   */
>  bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
> + unsigned int batch = max(CHARGE_BATCH, nr_pages);
>   struct page_counter *counter;
> + bool force = false;
>  
> - if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> + if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter))
> + return true;
> + page_counter_charge(&memcg->skmem, nr_pages);
> + return false;
> + }
> +
> + if (consume_stock(memcg, nr_pages))
>   return true;
> +retry:
> + if (page_counter_try_charge(&memcg->memory, batch, &counter))
> + goto done;

Currently, we use memcg->memory only for charging memory pages. Besides,
every page charged to this counter (including kmem) has ->mem_cgroup
field set appropriately. This looks consistent and nice. As an extra
benefit, we can track all pages charged to a memory cgroup via
/proc/kapgecgroup.

Now, you charge "window size" to it, which AFAIU isn't necessarily equal
to the amount of memory actually consumed by the cgroup for socket
buffers. I think this looks ugly and inconsistent with the existing
behavior. I agree that we need to charge socker buffers to ->memory, but
IMO we should do that per each skb page, using memcg_kmem_charge_kmem
somewhere in alloc_skb_with_frags invoking the reclaimer just as we do
for kmalloc, while tcp window size control should stay aside.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity

2015-10-22 Thread Vladimir Davydov

On Thu, Oct 22, 2015 at 12:21:35AM -0400, Johannes Weiner wrote:
...
> @@ -2437,6 +2439,10 @@ static bool shrink_zone(struct zone *zone, struct 
> scan_control *sc,
>   }
>   }
>  
> + vmpressure(sc->gfp_mask, memcg,
> +sc->nr_scanned - scanned,
> +sc->nr_reclaimed - reclaimed);
> +
>   /*
>* Direct reclaim and kswapd have to scan all memory
>* cgroups to fulfill the overall scan target for the
> @@ -2454,10 +2460,6 @@ static bool shrink_zone(struct zone *zone, struct 
> scan_control *sc,
>   }
>   } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
>  
> - vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
> -sc->nr_scanned - nr_scanned,
> -sc->nr_reclaimed - nr_reclaimed);
> -
>   if (sc->nr_reclaimed - nr_reclaimed)
>   reclaimable = true;
>  

I may be mistaken, but AFAIU this patch subtly changes the behavior of
vmpressure visible from the userspace: w/o this patch a userspace
process will only receive a notification for a memory cgroup only if
*this* memory cgroup calls reclaimer; with this patch userspace
notification will be issued even if reclaimer is invoked by any cgroup
up the hierarchy.

Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/8] mm: memcontrol: hook up vmpressure to socket pressure

2015-10-22 Thread Vladimir Davydov

On Thu, Oct 22, 2015 at 12:21:36AM -0400, Johannes Weiner wrote:
...
> @@ -185,8 +183,29 @@ static void vmpressure_work_fn(struct work_struct *work)
>   vmpr->reclaimed = 0;
>   spin_unlock(&vmpr->sr_lock);
>  
> + level = vmpressure_calc_level(scanned, reclaimed);
> +
> + if (level > VMPRESSURE_LOW) {

So we start socket_pressure at MEDIUM. Why not at LOW or CRITICAL?

> + struct mem_cgroup *memcg;
> + /*
> +  * Let the socket buffer allocator know that we are
> +  * having trouble reclaiming LRU pages.
> +  *
> +  * For hysteresis, keep the pressure state asserted
> +  * for a second in which subsequent pressure events
> +  * can occur.
> +  *
> +  * XXX: is vmpressure a global feature or part of
> +  * memcg? There shouldn't be anything memcg-specific
> +  * about exporting reclaim success ratios from the VM.
> +  */
> + memcg = container_of(vmpr, struct mem_cgroup, vmpressure);
> + if (memcg != root_mem_cgroup)
> + memcg->socket_pressure = jiffies + HZ;

Why 1 second?

Thanks,
Vladimir

> + }
> +
>   do {
> - if (vmpressure_event(vmpr, scanned, reclaimed))
> + if (vmpressure_event(vmpr, level))
>   break;
>   /*
>* If not handled, propagate the event upward into the
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled

2015-09-02 Thread Vladimir Davydov

[
  I'll try to summarize my point in one hunk instead of spreading it all
  over the e-mail, because IMO it's becoming a kind of difficult to
  follow. If you think that there's a question I dodge, please let me
  now and I'll try to address it separately.

  Also, adding Johannes to Cc (I noticed that I accidentally left him
  out), because this discussion seems to be fundamental and may affect
  our further steps dramatically.
]

On Tue, Sep 01, 2015 at 08:38:50PM +0200, Michal Hocko wrote:
> On Tue 01-09-15 19:55:54, Vladimir Davydov wrote:
> > On Tue, Sep 01, 2015 at 05:01:20PM +0200, Michal Hocko wrote:
> > > On Tue 01-09-15 16:40:03, Vladimir Davydov wrote:
> > > > On Tue, Sep 01, 2015 at 02:36:12PM +0200, Michal Hocko wrote:
> [...]
> > > > > How the fallback is implemented and whether trying other node before
> > > > > reclaiming from the preferred one is reasonable I dunno. This is for
> > > > > SLAB to decide. But ignoring GFP_NOWAIT for this path makes the 
> > > > > behavior
> > > > > for memcg enabled setups subtly different. And that is bad.
> > > > 
> > > > Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while
> > > > inspecting if a NUMA node has free pages makes SLAB behaviour subtly
> > > > differently: SLAB will walk over all NUMA nodes for nothing instead of
> > > > invoking memcg reclaim once a free page is found.
> > > 
> > > So you are saying that the SLAB kmem accounting in this particular path
> > > is suboptimal because the fallback mode doesn't retry local node with
> > > the reclaim enabled before falling back to other nodes?
> > 
> > I'm just pointing out some subtle behavior changes in slab you were
> > opposed to.
> 
> I guess we are still not at the same page here. If the slab has a subtle
> behavior (and from what you are saying it seems it has the same behavior
> at the global scope) then we should strive to fix it rather than making
> it more obscure just to not expose GFP_NOWAIT to memcg which is not
> handled properly currently wrt. high limit (more on that below) which
> was the primary motivation for the patch AFAIU.

Slab is a kind of abnormal alloc_pages user. By calling alloc_pages_node
with __GFP_THISNODE and w/o __GFP_WAIT before falling back to
alloc_pages with the caller's context, it does the job normally done by
alloc_pages itself. It's not what is done massively.

Leaving slab charge path as is looks really ugly to me. Look, slab
iterates over all nodes, inspecting if they have free pages and fails
even if they do due to the memcg constraint...

My point is that what slab does is a pretty low level thing, normal
users call alloc_pages or kmalloc with flags corresponding to their
context. Of course, there may be special users trying optimistically
GFP_NOWAIT, but they aren't massive, and that simplifies things for
memcg a lot. I mean if we can rely on the fact that the number of
GFP_NOWAIT allocations that can occur in a row is limited we can use
direct reclaim (like memory.high) and/or task_work reclaim to fix
GFP_NOWAIT failures. Otherwise, we have to mimic the global alloc with
most its heuristics. I don't think that copying those heuristics is the
right thing to do, because in memcg case the same problems may be
resolved much easier, because we don't actually experience real memory
shortage when hitting the limit.

Moreover, we already treat some flags not in the same way as in case of
slab for simplicity. E.g. we let __GFP_NOFAIL allocations go uncharged
instead of retrying infinitely. We ignore __GFP_THISNODE thing and we
just cannot take it into account. We ignore allocation order, because
that makes no sense for memcg.

To sum it up. Basically, there are two ways of handling kmemcg charges:

 1. Make the memcg try_charge mimic alloc_pages behavior.
 2. Make API functions (kmalloc, etc) work in memcg as if they were
called from the root cgroup, while keeping interactions between the
low level subsys (slab) and memcg private.

Way 1 might look appealing at the first glance, but at the same time it
is much more complex, because alloc_pages has grown over the years to
handle a lot of subtle situations that may arise on global memory
pressure, but impossible in memcg. What does way 1 give us then? We
can't insert try_charge directly to alloc_pages and have to spread its
calls all over the code anyway, so why is it better? Easier to use it in
places where users depend on buddy allocator peculiarities? There are
not many such users.

I understand that the idea of way 1 is to provide a well-defined memcg
API independent of the rest of the code, but that's just impossible. You
need special casing anyway. E.g. you need those get/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1505 matches

Mail list logo