Re: [PATCH] cpuidle: menu: use nr_running instead of cpuload for calculating perf mult
Hello again Playing with cpu cgroups, I've found that performance of various workloads running inside cgroups can be significantly improved by increasing sched load resolution and that this improvement has already been committed to the kernel (c8b281161dfa4bb5d5be63fb036ce19347b88c63). Unfortunately, it turned out that the commit triggered power usage regression and as a result was reverted until the root cause of the regression was found (see commit e4c2fb0d5776b58049d2556b456144a4db3fe5a9). And the root cause is still unknown... I guess that the behavior of the cpuidle menu governor, which is stated in the patch below, can explain the regression. The governor uses this_cpu_load() to estimate idleness of the system. After commit c8b281161dfa4bb5d5be63fb036ce19347b88c63, which increases sched load resolution, this_cpu_load() seems to return large values even for small loads, which would probably lead to the cpuidle governor making wrong decisions due to overestimating the system load. So, this seems to be another reason to use some different performance multiplier in cpuidle governor. On Jun 4, 2012, at 2:24 PM, Vladimir Davydov wrote: > rq->cpuload strongly depends on cgroup hierarchy. For example, if hundreds of > tasks are running inside cpu:/test cgroup, the sum of cpuload over all cpus > won't exceed 1024 (by default). That makes the cpuidle menu governor take > wrong > decisions, which can negatively affect overall performance. > > To cope this, use nr_running last seen in __update_cpu_load() instead of > cpuload for calculating performance multiplier. > --- > drivers/cpuidle/governors/menu.c | 17 +++-- > include/linux/sched.h|2 +- > kernel/sched/core.c |7 --- > kernel/sched/sched.h |3 +++ > 4 files changed, 11 insertions(+), 18 deletions(-) > > diff --git a/drivers/cpuidle/governors/menu.c > b/drivers/cpuidle/governors/menu.c > index 0633575..2aa2625 100644 > --- a/drivers/cpuidle/governors/menu.c > +++ b/drivers/cpuidle/governors/menu.c > @@ -123,17 +123,6 @@ struct menu_device { > }; > > > -#define LOAD_INT(x) ((x) >> FSHIFT) > -#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) > - > -static int get_loadavg(void) > -{ > - unsigned long this = this_cpu_load(); > - > - > - return LOAD_INT(this) * 10 + LOAD_FRAC(this) / 10; > -} > - > static inline int which_bucket(unsigned int duration) > { > int bucket = 0; > @@ -170,13 +159,13 @@ static inline int which_bucket(unsigned int duration) > static inline int performance_multiplier(void) > { > int mult = 1; > + int this_cpu = smp_processor_id(); > > /* for higher loadavg, we are more reluctant */ > - > - mult += 2 * get_loadavg(); > + mult += 10 * nr_active_cpu(this_cpu); > > /* for IO wait tasks (per cpu!) we add 5x each */ > - mult += 10 * nr_iowait_cpu(smp_processor_id()); > + mult += 10 * nr_iowait_cpu(this_cpu); > > return mult; > } > diff --git a/include/linux/sched.h b/include/linux/sched.h > index f45c0b2..fb83d22 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -141,7 +141,7 @@ extern unsigned long nr_running(void); > extern unsigned long nr_uninterruptible(void); > extern unsigned long nr_iowait(void); > extern unsigned long nr_iowait_cpu(int cpu); > -extern unsigned long this_cpu_load(void); > +extern unsigned long nr_active_cpu(int cpu); > > > extern void calc_global_load(unsigned long ticks); > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 39eb601..8cc2011 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -2155,10 +2155,10 @@ unsigned long nr_iowait_cpu(int cpu) > return atomic_read(&this->nr_iowait); > } > > -unsigned long this_cpu_load(void) > +unsigned long nr_active_cpu(int cpu) > { > - struct rq *this = this_rq(); > - return this->cpu_load[0]; > + struct rq *this = cpu_rq(cpu); > + return this->nr_active; > } > > > @@ -2494,6 +2494,7 @@ static void __update_cpu_load(struct rq *this_rq, > unsigned long this_load, > this_rq->nr_load_updates++; > > /* Update our load: */ > + this_rq->nr_active = this_rq->nr_running; > this_rq->cpu_load[0] = this_load; /* Fasttrack for idx 0 */ > for (i = 1, scale = 2; i < CPU_LOAD_IDX_MAX; i++, scale += scale) { > unsigned long old_load, new_load; > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index ba9dccf..2564712 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -379,6 +379,9 @@ struct rq { > struct list_head leaf_rt_rq_list; &g
[PATCH] mqueue: sys_mq_open: do not call mnt_drop_write() if read-only
mnt_drop_write() must be called only if mnt_want_write() succeeded, otherwise the mnt_writers counter will diverge. Signed-off-by: Vladimir Davydov Cc: Doug Ledford Cc: Andrew Morton Cc: KOSAKI Motohiro Cc: "Eric W. Biederman" --- ipc/mqueue.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/ipc/mqueue.c b/ipc/mqueue.c index e5c4f60..3953fda 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -840,7 +840,8 @@ out_putfd: fd = error; } mutex_unlock(&root->d_inode->i_mutex); - mnt_drop_write(mnt); + if (!ro) + mnt_drop_write(mnt); out_putname: putname(name); return fd; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mqueue: sys_mq_open: do not call mnt_drop_write() if read-only
On Mar 20, 2013, at 1:09 AM, Andrew Morton wrote: > On Tue, 19 Mar 2013 13:31:18 +0400 Vladimir Davydov > wrote: > >> mnt_drop_write() must be called only if mnt_want_write() succeeded, >> otherwise the mnt_writers counter will diverge. >> >> ... >> >> --- a/ipc/mqueue.c >> +++ b/ipc/mqueue.c >> @@ -840,7 +840,8 @@ out_putfd: >> fd = error; >> } >> mutex_unlock(&root->d_inode->i_mutex); >> -mnt_drop_write(mnt); >> +if (!ro) >> +mnt_drop_write(mnt); >> out_putname: >> putname(name); >> return fd; > > huh, that's been there for a while. What were the runtime-visible > effects of the bug? mnt_writers counters are used to check if remounting FS as read-only is OK, so after an extra mnt_drop_write() call, it would be impossible to remount mqueue FS as read-only. Besides, on umount a warning would be printed like this one: [ 194.714880] = [ 194.719680] [ BUG: bad unlock balance detected! ] [ 194.724488] 3.9.0-rc3 #5 Not tainted [ 194.728159] - [ 194.732958] a.out/12486 is trying to release lock (sb_writers) at: [ 194.739355] [] mnt_drop_write+0x1f/0x30 [ 194.744851] but there are no more locks to release! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched: initialize runtime to non-zero on cfs bw set
If cfs_rq->runtime_remaining is <= 0 then either - cfs_rq is throttled and waiting for quota redistribution, or - cfs_rq is currently executing and will be throttled on put_prev_entity, or - cfs_rq is not throttled and has not executed since its quota was set (runtime_remaining is set to 0 on cfs bandwidth reconfiguration). It is obvious that the last case is rather an exception from the rule "runtime_remaining<=0 iff cfs_rq is throttled or will be throttled as soon as it finishes its execution". Moreover, it can lead to a task hang as follows. If put_prev_task is called immediately after first pick_next_task after quota was set, "immediately" meaning rq->clock in both functions is the same, then the corresponding cfs_rq will be throttled. Besides being unfair (the cfs_rq has not executed in fact), the quota refilling timer can be idle at that time and it won't be activated on put_prev_task because update_curr calls account_cfs_rq_runtime, which activates the timer, only if delta_exec is strictly positive. As a result we can get a task "running" inside a throttled cfs_rq which will probably never be unthrottled. To avoid the problem, the patch makes tg_set_cfs_bandwidth initialize runtime_remaining of each cfs_rq to 1 instead of 0 so that the cfs_rq will be throttled only if it has executed for some positive number of nanoseconds. -- Several times we had our customers encountered such hangs inside a VM (seems something is wrong or rather different in time accounting there). Analyzing crash dumps revealed that hung tasks were running inside cfs_rq's, which had the following setup cfs_rq->throttled=1 cfs_rq->runtime_enabled=1 cfs_rq->runtime_remaining=0 cfs_rq->tg->cfs_bandwidth.idle=1 cfs_rq->tg->cfs_bandwidth.timer_active=0 which conforms pretty nice to the explanation given above. Signed-off-by: Vladimir Davydov --- kernel/sched/core.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 26058d0..c7a078f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7686,7 +7686,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) raw_spin_lock_irq(&rq->lock); cfs_rq->runtime_enabled = runtime_enabled; - cfs_rq->runtime_remaining = 0; + cfs_rq->runtime_remaining = 1; if (cfs_rq->throttled) unthrottle_cfs_rq(cfs_rq); -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: initialize runtime to non-zero on cfs bw set
On Feb 8, 2013, at 6:46 PM, Paul Turner wrote: > On Fri, Feb 08, 2013 at 11:10:46AM +0400, Vladimir Davydov wrote: >> If cfs_rq->runtime_remaining is <= 0 then either >> - cfs_rq is throttled and waiting for quota redistribution, or >> - cfs_rq is currently executing and will be throttled on >> put_prev_entity, or >> - cfs_rq is not throttled and has not executed since its quota was set >> (runtime_remaining is set to 0 on cfs bandwidth reconfiguration). >> >> It is obvious that the last case is rather an exception from the rule >> "runtime_remaining<=0 iff cfs_rq is throttled or will be throttled as >> soon as it finishes its execution". Moreover, it can lead to a task hang >> as follows. If put_prev_task is called immediately after first >> pick_next_task after quota was set, "immediately" meaning rq->clock in >> both functions is the same, then the corresponding cfs_rq will be >> throttled. Besides being unfair (the cfs_rq has not executed in fact), >> the quota refilling timer can be idle at that time and it won't be >> activated on put_prev_task because update_curr calls >> account_cfs_rq_runtime, which activates the timer, only if delta_exec is >> strictly positive. As a result we can get a task "running" inside a >> throttled cfs_rq which will probably never be unthrottled. >> >> To avoid the problem, the patch makes tg_set_cfs_bandwidth initialize >> runtime_remaining of each cfs_rq to 1 instead of 0 so that the cfs_rq >> will be throttled only if it has executed for some positive number of >> nanoseconds. >> -- >> Several times we had our customers encountered such hangs inside a VM >> (seems something is wrong or rather different in time accounting there). > > Yeah, looks like! > > It's not ultimately _super_ shocking; I can think of a few places where such > gremlins could lurk if they caused enough problems for someone to really go > digging. > >> Analyzing crash dumps revealed that hung tasks were running inside >> cfs_rq's, which had the following setup >> >> cfs_rq->throttled=1 >> cfs_rq->runtime_enabled=1 >> cfs_rq->runtime_remaining=0 >> cfs_rq->tg->cfs_bandwidth.idle=1 >> cfs_rq->tg->cfs_bandwidth.timer_active=0 >> >> which conforms pretty nice to the explanation given above. >> >> Signed-off-by: Vladimir Davydov >> --- >> kernel/sched/core.c |2 +- >> 1 files changed, 1 insertions(+), 1 deletions(-) >> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index 26058d0..c7a078f 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -7686,7 +7686,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, >> u64 period, u64 quota) >> >> raw_spin_lock_irq(&rq->lock); >> cfs_rq->runtime_enabled = runtime_enabled; >> -cfs_rq->runtime_remaining = 0; >> +cfs_rq->runtime_remaining = 1; > > So I agree this is reasonably correct and would fix the issue identified. > However, one concern is that it would potentially grant a tick of execution > time on all cfs_rqs which could result in large quota violations on a many > core > machine; one trick then would be to give them "expired" quota; which would be > safe against put_prev_entity->check_cfs_runtime, e.g. Yeah, I missed that. Thank you for the update. > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 1dff78a..4369231 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -7687,7 +7687,17 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, > u64 period, u64 quota) > > raw_spin_lock_irq(&rq->lock); > cfs_rq->runtime_enabled = runtime_enabled; > - cfs_rq->runtime_remaining = 0; > + /* > + * On re-definition of bandwidth values we allocate a trivial > + * amount of already expired quota. This guarantees that > + * put_prev_entity() cannot lead to a throttle event before we > + * have seen a call to account_cfs_runtime(); while not being > + * usable by newly waking, or set_curr_task_fair-ing, cpus > + * since it would be immediately expired, requiring > + * reassignment. > + */ > + cfs_rq->runtime_remaining = 1; > + cfs_rq->runtime_expires = rq_of(cfs_rq)->clock - 1; > > if (cfs_rq->throttled) > unthrottle_cfs_rq(cfs_rq); > > A
Re: [PATCH] sched: initialize runtime to non-zero on cfs bw set
On Feb 8, 2013, at 7:26 PM, Vladimir Davydov wrote: > On Feb 8, 2013, at 6:46 PM, Paul Turner wrote: > >> On Fri, Feb 08, 2013 at 11:10:46AM +0400, Vladimir Davydov wrote: >>> If cfs_rq->runtime_remaining is <= 0 then either >>> - cfs_rq is throttled and waiting for quota redistribution, or >>> - cfs_rq is currently executing and will be throttled on >>> put_prev_entity, or >>> - cfs_rq is not throttled and has not executed since its quota was set >>> (runtime_remaining is set to 0 on cfs bandwidth reconfiguration). >>> >>> It is obvious that the last case is rather an exception from the rule >>> "runtime_remaining<=0 iff cfs_rq is throttled or will be throttled as >>> soon as it finishes its execution". Moreover, it can lead to a task hang >>> as follows. If put_prev_task is called immediately after first >>> pick_next_task after quota was set, "immediately" meaning rq->clock in >>> both functions is the same, then the corresponding cfs_rq will be >>> throttled. Besides being unfair (the cfs_rq has not executed in fact), >>> the quota refilling timer can be idle at that time and it won't be >>> activated on put_prev_task because update_curr calls >>> account_cfs_rq_runtime, which activates the timer, only if delta_exec is >>> strictly positive. As a result we can get a task "running" inside a >>> throttled cfs_rq which will probably never be unthrottled. >>> >>> To avoid the problem, the patch makes tg_set_cfs_bandwidth initialize >>> runtime_remaining of each cfs_rq to 1 instead of 0 so that the cfs_rq >>> will be throttled only if it has executed for some positive number of >>> nanoseconds. >>> -- >>> Several times we had our customers encountered such hangs inside a VM >>> (seems something is wrong or rather different in time accounting there). >> >> Yeah, looks like! >> >> It's not ultimately _super_ shocking; I can think of a few places where such >> gremlins could lurk if they caused enough problems for someone to really go >> digging. >> >>> Analyzing crash dumps revealed that hung tasks were running inside >>> cfs_rq's, which had the following setup >>> >>> cfs_rq->throttled=1 >>> cfs_rq->runtime_enabled=1 >>> cfs_rq->runtime_remaining=0 >>> cfs_rq->tg->cfs_bandwidth.idle=1 >>> cfs_rq->tg->cfs_bandwidth.timer_active=0 >>> >>> which conforms pretty nice to the explanation given above. >>> >>> Signed-off-by: Vladimir Davydov >>> --- >>> kernel/sched/core.c |2 +- >>> 1 files changed, 1 insertions(+), 1 deletions(-) >>> >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >>> index 26058d0..c7a078f 100644 >>> --- a/kernel/sched/core.c >>> +++ b/kernel/sched/core.c >>> @@ -7686,7 +7686,7 @@ static int tg_set_cfs_bandwidth(struct task_group >>> *tg, u64 period, u64 quota) >>> >>> raw_spin_lock_irq(&rq->lock); >>> cfs_rq->runtime_enabled = runtime_enabled; >>> - cfs_rq->runtime_remaining = 0; >>> + cfs_rq->runtime_remaining = 1; >> >> So I agree this is reasonably correct and would fix the issue identified. >> However, one concern is that it would potentially grant a tick of execution >> time on all cfs_rqs which could result in large quota violations on a many >> core >> machine; one trick then would be to give them "expired" quota; which would be >> safe against put_prev_entity->check_cfs_runtime, e.g. > But wait, what "large quota violations" are you talking about? First, the granted quota is rather ephemeral because it will be neglected during the next cfs bw period. Second, we will actually grant NR_CPUS nanoseconds, is is only 1 ms for a 1000 cpus host, because the task group will sleep in the throttled state for each nanosecond consumed over the granted quota, i.e. it may be a spike in the load which the task group will eventually pay for. Besides, I guess quota reconfiguration is rather a rare event, so it should not be a concern. Thanks -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] netfilter: nf_conntrack: Batch cleanup
The patch introduces nf_conntrack_cleanup_net_list(), which cleanups nf_conntrack for a list of netns and calls synchronize_net() only once for them all. This should reduce netns destruction time. -- I've measured cleanup time for 1k dummy net ns. Here are the results: # modprobe nf_conntrack # time modprobe -r nf_conntrack real 0m10.337s user 0m0.000s sys0m0.376s # modprobe nf_conntrack # time modprobe -r nf_conntrack real0m5.661s user0m0.000s sys 0m0.216s Signed-off-by: Vladimir Davydov Cc: Patrick McHardy Cc: "David S. Miller" Cc: "Eric W. Biederman" --- include/net/netfilter/nf_conntrack_core.h |1 + net/netfilter/nf_conntrack_core.c | 46 - net/netfilter/nf_conntrack_standalone.c | 16 ++ 3 files changed, 43 insertions(+), 20 deletions(-) diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h index 930275f..fb2b623 100644 --- a/include/net/netfilter/nf_conntrack_core.h +++ b/include/net/netfilter/nf_conntrack_core.h @@ -27,6 +27,7 @@ extern unsigned int nf_conntrack_in(struct net *net, extern int nf_conntrack_init_net(struct net *net); extern void nf_conntrack_cleanup_net(struct net *net); +extern void nf_conntrack_cleanup_net_list(struct list_head *net_exit_list); extern int nf_conntrack_proto_pernet_init(struct net *net); extern void nf_conntrack_proto_pernet_fini(struct net *net); diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index c8e001a..09c02ef 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1364,30 +1364,48 @@ void nf_conntrack_cleanup_end(void) */ void nf_conntrack_cleanup_net(struct net *net) { + LIST_HEAD(single); + + list_add(&net->exit_list, &single); + nf_conntrack_cleanup_net_list(&single); +} + +void nf_conntrack_cleanup_net_list(struct list_head *net_exit_list) +{ + int busy; + struct net *net; + /* * This makes sure all current packets have passed through * netfilter framework. Roll on, two-stage module * delete... */ synchronize_net(); - i_see_dead_people: - nf_ct_iterate_cleanup(net, kill_all, NULL); - nf_ct_release_dying_list(net); - if (atomic_read(&net->ct.count) != 0) { +i_see_dead_people: + busy = 0; + list_for_each_entry(net, net_exit_list, exit_list) { + nf_ct_iterate_cleanup(net, kill_all, NULL); + nf_ct_release_dying_list(net); + if (atomic_read(&net->ct.count) != 0) + busy = 1; + } + if (busy) { schedule(); goto i_see_dead_people; } - nf_ct_free_hashtable(net->ct.hash, net->ct.htable_size); - nf_conntrack_proto_pernet_fini(net); - nf_conntrack_helper_pernet_fini(net); - nf_conntrack_ecache_pernet_fini(net); - nf_conntrack_tstamp_pernet_fini(net); - nf_conntrack_acct_pernet_fini(net); - nf_conntrack_expect_pernet_fini(net); - kmem_cache_destroy(net->ct.nf_conntrack_cachep); - kfree(net->ct.slabname); - free_percpu(net->ct.stat); + list_for_each_entry(net, net_exit_list, exit_list) { + nf_ct_free_hashtable(net->ct.hash, net->ct.htable_size); + nf_conntrack_proto_pernet_fini(net); + nf_conntrack_helper_pernet_fini(net); + nf_conntrack_ecache_pernet_fini(net); + nf_conntrack_tstamp_pernet_fini(net); + nf_conntrack_acct_pernet_fini(net); + nf_conntrack_expect_pernet_fini(net); + kmem_cache_destroy(net->ct.nf_conntrack_cachep); + kfree(net->ct.slabname); + free_percpu(net->ct.stat); + } } void *nf_ct_alloc_hashtable(unsigned int *sizep, int nulls) diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c index 6bcce40..6c69fbd 100644 --- a/net/netfilter/nf_conntrack_standalone.c +++ b/net/netfilter/nf_conntrack_standalone.c @@ -545,16 +545,20 @@ out_init: return ret; } -static void nf_conntrack_pernet_exit(struct net *net) +static void nf_conntrack_pernet_exit(struct list_head *net_exit_list) { - nf_conntrack_standalone_fini_sysctl(net); - nf_conntrack_standalone_fini_proc(net); - nf_conntrack_cleanup_net(net); + struct net *net; + + list_for_each_entry(net, net_exit_list, exit_list) { + nf_conntrack_standalone_fini_sysctl(net); + nf_conntrack_standalone_fini_proc(net); + } + nf_conntrack_cleanup_net_list(net_exit_list); } static struct pernet_operations nf_conntrack_net_ops = { - .init = nf_conntrack_pernet_init, - .exit = nf_conntrack_pernet_exit, + .init = nf_conntrac
[PATCH] net: batch nf_conntrack_net_exit
The patch introduces nf_conntrack_cleanup_list(), which cleanups nf_conntracks for a list of netns and calls synchronize_net() only once for them all. --- include/net/netfilter/nf_conntrack_core.h | 10 +- net/netfilter/nf_conntrack_core.c | 21 + net/netfilter/nf_conntrack_standalone.c | 14 +- 3 files changed, 31 insertions(+), 14 deletions(-) diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h index d8f5b9f..f53b855 100644 --- a/include/net/netfilter/nf_conntrack_core.h +++ b/include/net/netfilter/nf_conntrack_core.h @@ -26,7 +26,15 @@ extern unsigned int nf_conntrack_in(struct net *net, struct sk_buff *skb); extern int nf_conntrack_init(struct net *net); -extern void nf_conntrack_cleanup(struct net *net); +extern void nf_conntrack_cleanup_list(struct list_head *net_exit_list); + +static inline void nf_conntrack_cleanup(struct net *net) +{ + LIST_HEAD(single); + + list_add(&net->exit_list, &single); + nf_conntrack_cleanup_list(&single); +} extern int nf_conntrack_proto_init(struct net *net); extern void nf_conntrack_proto_fini(struct net *net); diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index cf48755..afa62f7 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1363,21 +1363,26 @@ static void nf_conntrack_cleanup_net(struct net *net) /* Mishearing the voices in his head, our hero wonders how he's supposed to kill the mall. */ -void nf_conntrack_cleanup(struct net *net) +void nf_conntrack_cleanup_list(struct list_head *net_exit_list) { - if (net_eq(net, &init_net)) - RCU_INIT_POINTER(ip_ct_attach, NULL); + struct net *net; + + list_for_each_entry(net, net_exit_list, exit_list) + if (net_eq(net, &init_net)) + RCU_INIT_POINTER(ip_ct_attach, NULL); /* This makes sure all current packets have passed through netfilter framework. Roll on, two-stage module delete... */ synchronize_net(); - nf_conntrack_proto_fini(net); - nf_conntrack_cleanup_net(net); - if (net_eq(net, &init_net)) { - RCU_INIT_POINTER(nf_ct_destroy, NULL); - nf_conntrack_cleanup_init_net(); + list_for_each_entry(net, net_exit_list, exit_list) { + nf_conntrack_proto_fini(net); + nf_conntrack_cleanup_net(net); + if (net_eq(net, &init_net)) { + RCU_INIT_POINTER(nf_ct_destroy, NULL); + nf_conntrack_cleanup_init_net(); + } } } diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c index 9b39432..8f60708 100644 --- a/net/netfilter/nf_conntrack_standalone.c +++ b/net/netfilter/nf_conntrack_standalone.c @@ -551,16 +551,20 @@ out_init: return ret; } -static void nf_conntrack_net_exit(struct net *net) +static void nf_conntrack_net_exit(struct list_head *net_exit_list) { - nf_conntrack_standalone_fini_sysctl(net); - nf_conntrack_standalone_fini_proc(net); - nf_conntrack_cleanup(net); + struct net *net; + + list_for_each_entry(net, net_exit_list, exit_list) { + nf_conntrack_standalone_fini_sysctl(net); + nf_conntrack_standalone_fini_proc(net); + } + nf_conntrack_cleanup_list(net_exit_list); } static struct pernet_operations nf_conntrack_net_ops = { .init = nf_conntrack_net_init, - .exit = nf_conntrack_net_exit, + .exit_batch = nf_conntrack_net_exit, }; static int __init nf_conntrack_standalone_init(void) -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC] sched: move h_load calculation to task_h_load
The bad thing about update_h_load(), which computes hierarchical load factor for task groups, is that it is called for each task group in the system before every load balancer run, and since rebalance can be triggered very often, this function can eat really a lot of cpu time if there are many cpu cgroups in the system. Although the situation was improved significantly by commit a35b646 ('sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies'), the problem still can arise under some kinds of loads, e.g. when cpus are switching from idle to busy and back very frequently. For instance, when I start 1000 of processes that wake up every millisecond on my 8 cpus host, 'top' and 'perf top' show: Cpu(s): 17.8%us, 24.3%sy, 0.0%ni, 57.9%id, 0.0%wa, 0.0%hi, 0.0%si Events: 243K cycles 7.57% [kernel] [k] __schedule 7.08% [kernel] [k] timerqueue_add 6.13% libc-2.12.so [.] usleep Then if I create 1 *idle* cpu cgroups (no processes in them), cpu usage increases significantly although the 'wakers' are still executing in the root cpu cgroup: Cpu(s): 19.1%us, 48.7%sy, 0.0%ni, 31.6%id, 0.0%wa, 0.0%hi, 0.7%si Events: 230K cycles 24.56% [kernel][k] tg_load_down 5.76% [kernel][k] __schedule This happens because this particular kind of load triggers 'new idle' rebalance very frequently, which requires calling update_h_load(), which, in turn, calls tg_load_down() for every *idle* cpu cgroup even though it is absolutely useless, because idle cpu cgroups have no tasks to pull. This patch tries to improve the situation by making h_load calculation proceed only when h_load is really necessary. To achieve this, it substitutes update_h_load() with update_cfs_rq_h_load(), which computes h_load only for a given cfs_rq and all its ascendants, and makes the load balancer call this function whenever it considers if a task should be pulled, i.e. it moves h_load calculations directly to task_h_load(). For h_load of the same cfs_rq not to be updated multiple times (in case several tasks in the same cgroup are considered during the same balance run), the patch keeps the time of the last h_load update for each cfs_rq and breaks calculation when it finds h_load to be uptodate. The benefit of it is that h_load is computed only for those cfs_rq's, which really need it, in particular all idle task groups are skipped. Although this, in fact, moves h_load calculation under rq lock, it should not affect latency much, because the amount of work done under rq lock while trying to pull tasks is limited by sched_nr_migrate. After the patch applied with the setup described above (1000 wakers in the root cgroup and 1 idle cgroups), I get: Cpu(s): 16.9%us, 24.8%sy, 0.0%ni, 58.4%id, 0.0%wa, 0.0%hi, 0.0%si Events: 242K cycles 7.57% [kernel] [k] __schedule 6.70% [kernel] [k] timerqueue_add 5.93% libc-2.12.so [.] usleep Signed-off-by: Vladimir Davydov --- kernel/sched/fair.c | 56 ++ kernel/sched/sched.h |7 +++ 2 files changed, 28 insertions(+), 35 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f77f9c5..de90690 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4171,47 +4171,46 @@ static void update_blocked_averages(int cpu) } /* - * Compute the cpu's hierarchical load factor for each task group. + * Compute the hierarchical load factor for cfs_rq and all its ascendants. * This needs to be done in a top-down fashion because the load of a child * group is a fraction of its parents load. */ -static int tg_load_down(struct task_group *tg, void *data) +static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq) { + struct rq *rq = rq_of(cfs_rq); + struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)]; unsigned long load; - long cpu = (long)data; - if (!tg->parent) { - load = cpu_rq(cpu)->avg.load_avg_contrib; - } else { - load = tg->parent->cfs_rq[cpu]->h_load; - load = div64_ul(load * tg->se[cpu]->avg.load_avg_contrib, - tg->parent->cfs_rq[cpu]->runnable_load_avg + 1); + cfs_rq->h_load_next = NULL; + for_each_sched_entity(se) { + cfs_rq = cfs_rq_of(se); + cfs_rq->h_load_next = se; + if (cfs_rq->last_h_load_update == rq->clock) + break; } - tg->cfs_rq[cpu]->h_load = load; - - return 0; -} - -static void update_h_load(long cpu) -{ - struct rq *rq = cpu_rq(cpu); - unsigned long now = jiffies; - - if (rq->h_load_throttle == now) - return; - - rq->h_load_throttle = now; + if (!se) { + cfs_rq-&g
[PATCH] sched: Fix task_h_load calculation
Patch a003a2 (sched: Consider runnable load average in move_tasks()) sets all top-level cfs_rqs' h_load to rq->avg.load_avg_contrib, which is always 0. This mistype leads to all tasks having weight 0 when load balancing in a cpu-cgroup enabled setup. There obviously should be sum of weights of all runnable tasks there instead. Fix it. Signed-off-by: Vladimir Davydov --- kernel/sched/fair.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9b3fe1c..13abc29 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4242,7 +4242,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq) } if (!se) { - cfs_rq->h_load = rq->avg.load_avg_contrib; + cfs_rq->h_load = cfs_rq->runnable_load_avg; cfs_rq->last_h_load_update = now; } -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] sched: fix_small_imbalance: Fix local->avg_load > busiest->avg_load case
In busiest->group_imb case we can come to fix_small_imbalance() with local->avg_load > busiest->avg_load. This can result in wrong imbalance fix-up, because there is the following check there where all the members are unsigned: if (busiest->avg_load - local->avg_load + scaled_busy_load_per_task >= (scaled_busy_load_per_task * imbn)) { env->imbalance = busiest->load_per_task; return; } As a result we can end up constantly bouncing tasks from one cpu to another if there are pinned tasks. Fix it by substituting the subtraction with an equivalent addition in the check. -- The bug can be caught by running 2*N cpuhogs pinned to two logical cpus belonging to different cores on an HT-enabled machine with N logical cpus: just look at se.nr_migrations growth. Signed-off-by: Vladimir Davydov --- kernel/sched/fair.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 507a8a9..bdaf1fc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4823,8 +4823,8 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) (busiest->load_per_task * SCHED_POWER_SCALE) / busiest->group_power; - if (busiest->avg_load - local->avg_load + scaled_busy_load_per_task >= - (scaled_busy_load_per_task * imbn)) { + if (busiest->avg_load + scaled_busy_load_per_task >= + local->avg_load + (scaled_busy_load_per_task * imbn)) { env->imbalance = busiest->load_per_task; return; } -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] sched: calculate_imbalance: Fix local->avg_load > sds->avg_load case
In busiest->group_imb case we can come to calculate_imbalance() with local->avg_load >= busiest->avg_load >= sds->avg_load. This can result in imbalance overflow, because it is calculated as follows env->imbalance = min( max_pull * busiest->group_power, (sds->avg_load - local->avg_load) * local->group_power ) / SCHED_POWER_SCALE; As a result we can end up constantly bouncing tasks from one cpu to another if there are pinned tasks. Fix this by skipping the assignment and assuming imbalance=0 in case local->avg_load > sds->avg_load. -- The bug can be caught by running 2*N cpuhogs pinned to two logical cpus belonging to different cores on an HT-enabled machine with N logical cpus: just look at se.nr_migrations growth. Signed-off-by: Vladimir Davydov --- kernel/sched/fair.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9b3fe1c..507a8a9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4896,7 +4896,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * max load less than avg load(as we skip the groups at or below * its cpu_power, while calculating max_load..) */ - if (busiest->avg_load < sds->avg_load) { + if (busiest->avg_load <= sds->avg_load || + local->avg_load >= sds->avg_load) { env->imbalance = 0; return fix_small_imbalance(env, sds); } -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] sched: load_balance: Prevent reselect prev dst_cpu if some pinned
Currently new_dst_cpu is prevented from being reselected actually, not dst_cpu. This can result in attempting to pull tasks to this_cpu twice. Signed-off-by: Vladimir Davydov --- kernel/sched/fair.c |6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9b3fe1c..cd59640 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5269,15 +5269,15 @@ more_balance: */ if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) { + /* Prevent to re-select dst_cpu via env's cpus */ + cpumask_clear_cpu(env.dst_cpu, env.cpus); + env.dst_rq = cpu_rq(env.new_dst_cpu); env.dst_cpu = env.new_dst_cpu; env.flags &= ~LBF_SOME_PINNED; env.loop = 0; env.loop_break = sched_nr_migrate_break; - /* Prevent to re-select dst_cpu via env's cpus */ - cpumask_clear_cpu(env.dst_cpu, env.cpus); - /* * Go back to "more_balance" rather than "redo" since we * need to continue with same src_cpu. -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] sched: load_balance: Reset env when going to redo due to all pinned
Firstly, reset env.dst_cpu/dst_rq to this_cpu/this_rq, because it could have changed in 'some pinned' case. Otherwise, should_we_balance() can stop balancing beforehand. Secondly, reset env.flags, because it can have LBF_SOME_PINNED set. Thirdly, reset env.dst_grpmask cpus in env.cpus to allow handling 'some pinned' case when pulling tasks from a new busiest cpu. Signed-off-by: Vladimir Davydov --- kernel/sched/fair.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cd59640..d840e51 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5289,8 +5289,16 @@ more_balance: if (unlikely(env.flags & LBF_ALL_PINNED)) { cpumask_clear_cpu(cpu_of(busiest), cpus); if (!cpumask_empty(cpus)) { - env.loop = 0; - env.loop_break = sched_nr_migrate_break; + env.dst_cpu = this_cpu; + env.dst_rq = this_rq; + env.flags = 0; + env.loop= 0; + env.loop_break = sched_nr_migrate_break; + + /* Reset cpus cleared in LBF_SOME_PINNED case */ + if (env.dst_grpmask) + cpumask_or(cpus, cpus, env.dst_grpmask); + goto redo; } goto out_balanced; -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] sched: calculate_imbalance: Fix local->avg_load > sds->avg_load case
On 09/16/2013 09:52 AM, Peter Zijlstra wrote: On Sun, Sep 15, 2013 at 05:49:13PM +0400, Vladimir Davydov wrote: In busiest->group_imb case we can come to calculate_imbalance() with local->avg_load >= busiest->avg_load >= sds->avg_load. This can result in imbalance overflow, because it is calculated as follows env->imbalance = min( max_pull * busiest->group_power, (sds->avg_load - local->avg_load) * local->group_power ) / SCHED_POWER_SCALE; As a result we can end up constantly bouncing tasks from one cpu to another if there are pinned tasks. Fix this by skipping the assignment and assuming imbalance=0 in case local->avg_load > sds->avg_load. -- The bug can be caught by running 2*N cpuhogs pinned to two logical cpus belonging to different cores on an HT-enabled machine with N logical cpus: just look at se.nr_migrations growth. Signed-off-by: Vladimir Davydov --- kernel/sched/fair.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9b3fe1c..507a8a9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4896,7 +4896,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * max load less than avg load(as we skip the groups at or below * its cpu_power, while calculating max_load..) */ - if (busiest->avg_load < sds->avg_load) { + if (busiest->avg_load <= sds->avg_load || + local->avg_load >= sds->avg_load) { env->imbalance = 0; return fix_small_imbalance(env, sds); } Why the = part? Surely 'busiest->avg_load < sds->avg_load || local->avg_load > sds->avg_load' avoids both underflows? Of course it does, but env->imbalance will be assigned to 0 anyway in = case, so why not go shortcut? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] sched: load_balance: Reset env when going to redo due to all pinned
On 09/16/2013 09:43 AM, Peter Zijlstra wrote: On Sun, Sep 15, 2013 at 09:30:14PM +0400, Vladimir Davydov wrote: Firstly, reset env.dst_cpu/dst_rq to this_cpu/this_rq, because it could have changed in 'some pinned' case. Otherwise, should_we_balance() can stop balancing beforehand. Secondly, reset env.flags, because it can have LBF_SOME_PINNED set. Thirdly, reset env.dst_grpmask cpus in env.cpus to allow handling 'some pinned' case when pulling tasks from a new busiest cpu. Did you actually run into any problems because of this? IRL no, and now I see that catching 'all pinned' after 'some pinned' case can only happen if a task changes its affinity mask or gets throttled during load balance run, which is very unlikely. So this patch is rather for sanity and can be safely dropped. Signed-off-by: Vladimir Davydov --- kernel/sched/fair.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cd59640..d840e51 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5289,8 +5289,16 @@ more_balance: if (unlikely(env.flags & LBF_ALL_PINNED)) { cpumask_clear_cpu(cpu_of(busiest), cpus); if (!cpumask_empty(cpus)) { - env.loop = 0; - env.loop_break = sched_nr_migrate_break; + env.dst_cpu = this_cpu; + env.dst_rq = this_rq; + env.flags = 0; + env.loop= 0; + env.loop_break = sched_nr_migrate_break; + + /* Reset cpus cleared in LBF_SOME_PINNED case */ + if (env.dst_grpmask) + cpumask_or(cpus, cpus, env.dst_grpmask); + goto redo; } goto out_balanced; So the problem I have with this is that it removes the bound on the number of iterations we do. Currently we're limited by the bits in cpus, but by resetting those we can do on and on and on... find_busiest_group() never selects the local group, doesn't it? So none of env.dst_grpmask, which is initialized to sched_group_cpus(this_rq->sd), can be selected for the source cpu. That said, resetting env.dst_grpmask bits in the cpus bitmask actually doesn't affect the number of balance iterations. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/19] pramfs
On 09/07/2013 08:22 PM, Marco Stornelli wrote: Il 07/09/2013 16:58, richard -rw- weinberger ha scritto: On Sat, Sep 7, 2013 at 10:14 AM, Marco Stornelli wrote: Hi all, this is an attempt to include pramfs in mainline. At the moment pramfs has been included in LTSI kernel. Since last review the code is more or less the same but, with a really big thanks to Vladimir Davydov and Parallels, the development of fsck has been started and we have now the possibility to correct fs errors due to corruption. It's a "young" tool but we are working on it. You can clone the code from our repos: git clone git://git.code.sf.net/p/pramfs/code pramfs-code git clone git://git.code.sf.net/p/pramfs/Tools pramfs-Tools I'm a bit confused, what kind of non-volatile RAM is your fs targeting? Wouldn't it make sense to use pstore like arch/powerpc/platforms/pseries/nvram.c does? Usually battery-backed SRAM, but actually it can be used in any piece of ram directly accessible and it provides a normal and complete fs interface. Usually I do the fs test remapping my system ram. You can find documentation here: http://pramfs.sourceforge.net I'd like to add that in contrast to pstore, pramfs allows storing any files in it, not only system logs. This can be of value even on machines w/o special devices like sram/nvram: one can store data that should be quickly restored after reboot in conventional ram and use kexec to boot to a new kernel. One of the use cases of this could be checkpointing time-critical services to ram (using criu.org) to be quickly restored after a kernel update providing almost zero-downtime. Marco -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [tip:sched/core] sched/balancing: Fix cfs_rq-> task_h_load calculation
On 09/29/2013 01:47 PM, Yuanhan Liu wrote: On Fri, Sep 20, 2013 at 06:46:59AM -0700, tip-bot for Vladimir Davydov wrote: Commit-ID: 7e3115ef5149fc502e3a2e80719dba54a8e7409d Gitweb:http://git.kernel.org/tip/7e3115ef5149fc502e3a2e80719dba54a8e7409d Author: Vladimir Davydov AuthorDate: Sat, 14 Sep 2013 19:39:46 +0400 Committer: Ingo Molnar CommitDate: Fri, 20 Sep 2013 11:59:39 +0200 sched/balancing: Fix cfs_rq->task_h_load calculation Patch a003a2 (sched: Consider runnable load average in move_tasks()) sets all top-level cfs_rqs' h_load to rq->avg.load_avg_contrib, which is always 0. This mistype leads to all tasks having weight 0 when load balancing in a cpu-cgroup enabled setup. There obviously should be sum of weights of all runnable tasks there instead. Fix it. Hi Vladimir, FYI, Here we found a 17% netperf regression by this patch. Here are some changed stats between this commit 7e3115ef5149fc502e3a2e80719dba54a8e7409d and it's parent(3029ede39373c368f402a76896600d85a4f7121b) Hello, Could you please report the following info: 1) the test machine cpu topology (i.e. output of /sys/devices/system/cpu/cpu*/{thread_siblings_list,core_siblings_list}) 2) kernel config you used during the test 3) the output of /sys/kernel/debug/sched_features (debugfs mounted). 4) netperf server/client options 5) did you place netserver into a separate cpu cgroup? Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC] sched: boost throttled entities on wakeups
If several tasks in different cpu cgroups are contending for the same resource (e.g. a semaphore) and one of those task groups is cpu limited (using cfs bandwidth control), the priority inversion problem is likely to arise: if a cpu limited task goes to sleep holding the resource (e.g. trying to take another semaphore), it can be throttled (i.e. removed from the runqueue), which will result in other, perhaps high-priority, tasks waiting until the low-priority task continues its execution. The patch tries to solve this problem by boosting tasks in throttled groups on wakeups, i.e. temporarily unthrottling the groups a woken task belongs to in order to let the task finish its execution in kernel space. This obviously should eliminate the priority inversion problem on voluntary preemptable kernels. However, it does not solve the problem for fully preemptable kernels, although I guess the patch can be extended to handle those kernels too (e.g. by boosting forcibly preempted tasks thus not allowing to throttle). I wrote a simple test that demonstrates the problem (the test is attached). It creates two cgroups each of which is bound to exactly one cpu using cpusets, sets the limit of the first group to 10% and leaves the second group unlimited. Then in both groups it starts processes reading the same (big enough) file along with a couple of busyloops in the limited groups, and measures the read time. I've run the test 10 times for a 1 Gb file on a server with > 10 Gb of RAM and 4 cores x 2 hyperthreads (the kernel was with CONFIG_PREEMPT_VOLUNTARY=y). Here are the results: without the patch 40.03 +- 7.04 s with the patch 8.42 +- 0.48 s (Since the server's RAM can accommodate the whole file, the read time was the same for both groups) I would appreciate if you could answer the following questions regarding the priority inversion problem and the proposed approach: 1) Do you agree that the problem exists and should be sorted out? 2) If so, does the general approach proposed (unthrottling on wakeups) suits you? Why or why not? 3) If you think that the approach proposed is sane, what you dislike about the patch? Thank you! --- include/linux/sched.h |8 ++ kernel/sched/core.c |8 ++ kernel/sched/fair.c | 182 ++- kernel/sched/features.h |2 + kernel/sched/sched.h|6 ++ 5 files changed, 204 insertions(+), 2 deletions(-) sched-boost-throttled-entities-on-wakeups.patch Description: sched-boost-throttled-entities-on-wakeups.patch ioprio_inv_test.sh Description: ioprio_inv_test.sh
Re: [Devel] [PATCH RFC] sched: boost throttled entities on wakeups
There is an error in the test script: I forgot to initialize cpuset.mems of test cgroups - without it it is impossible to add a task into a cpuset cgroup. Sorry for that. Fixed version of the test script is attached. On Oct 18, 2012, at 11:32 AM, Vladimir Davydov wrote: > If several tasks in different cpu cgroups are contending for the same resource > (e.g. a semaphore) and one of those task groups is cpu limited (using cfs > bandwidth control), the priority inversion problem is likely to arise: if a > cpu > limited task goes to sleep holding the resource (e.g. trying to take another > semaphore), it can be throttled (i.e. removed from the runqueue), which will > result in other, perhaps high-priority, tasks waiting until the low-priority > task continues its execution. > > The patch tries to solve this problem by boosting tasks in throttled groups on > wakeups, i.e. temporarily unthrottling the groups a woken task belongs to in > order to let the task finish its execution in kernel space. This obviously > should eliminate the priority inversion problem on voluntary preemptable > kernels. However, it does not solve the problem for fully preemptable > kernels, > although I guess the patch can be extended to handle those kernels too (e.g. > by > boosting forcibly preempted tasks thus not allowing to throttle). > > I wrote a simple test that demonstrates the problem (the test is attached). It > creates two cgroups each of which is bound to exactly one cpu using cpusets, > sets the limit of the first group to 10% and leaves the second group > unlimited. > Then in both groups it starts processes reading the same (big enough) file > along with a couple of busyloops in the limited groups, and measures the read > time. > > I've run the test 10 times for a 1 Gb file on a server with > 10 Gb of RAM and > 4 cores x 2 hyperthreads (the kernel was with CONFIG_PREEMPT_VOLUNTARY=y). > Here > are the results: > > without the patch 40.03 +- 7.04 s > with the patch 8.42 +- 0.48 s > > (Since the server's RAM can accommodate the whole file, the read time was the > same for both groups) > > I would appreciate if you could answer the following questions regarding the > priority inversion problem and the proposed approach: > > 1) Do you agree that the problem exists and should be sorted out? > > 2) If so, does the general approach proposed (unthrottling on wakeups) suits > you? Why or why not? > > 3) If you think that the approach proposed is sane, what you dislike about the > patch? > > Thank you! > > --- > include/linux/sched.h |8 ++ > kernel/sched/core.c |8 ++ > kernel/sched/fair.c | 182 ++- > kernel/sched/features.h |2 + > kernel/sched/sched.h|6 ++ > 5 files changed, 204 insertions(+), 2 deletions(-) > > ioprio_inv_test.sh Description: ioprio_inv_test.sh
Re: [PATCH RFC] sched: boost throttled entities on wakeups
Thank you for the answer. On Oct 19, 2012, at 6:24 PM, Peter Zijlstra wrote: > its a quick hack similar to existing hacks done for rt, preferably we'd > do smarter things though. If you have any ideas how to fix this in a better way, please share. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 00/13] PRAM: Persistent over-kexec memory storage
s PRAM super block pfn, which is exported via /sys/kernel/pram, in the 'pram' boot param. * Since some memory is required for completing boot sequence, PRAM tracks all memory regions that have ever been reserved by other parts of the kernel and avoids using them for persistent memory. Since the device configuration cannot change during kexec, and the newly booted kernel is likely to have the same set of device drivers, it should work in most cases. * Since kexec may load the new kernel code to any memory region, it can destroy persistent memory. To exclude this, kexec should be forced to load the new kernel code to a memory region that is banned for PRAM. For that purpose, there is the 'pram_banned' boot param and --mem-min and --mem-max otpions of the kexec utility. * If a conflict still happens, it will be identified and all persistent memory will be discarded to prevent further errors. It is guaranteed by checksumming all data saved to PRAM. * tmpfs is saved to PRAM on unmount and loaded on mount if 'pram=NAME' mount option is passed. NAME specifies the PRAM node to save data to. This is to allow saving several tmpfs trees. * Saving tmpfs to PRAM is not well elaborated at present and serves rather as a proof of concept. Namely, only regular files without multiple hard links are supported and tmpfs may not be swapped out. If these requirements are not met, save to PRAM will be aborted spewing a message to the kernel log. This is not very difficult to fix, but at present one should turn off swap to test the feature. -- Future plans -- What we'd like to do: * Implement swap entries 'freezing' to allow saving a swapped out tmpfs. * Implement full support of tmpfs including saving dirs, special files, etc. * Implement SPLICE_F_MOVE, SPLICE_F_GIFT flags for splicing data from/to shmem. This would allow avoiding memory copying on checkpoint/restore. * Save uptodate fs cache on umount to be restored on mount after kexec. Thanks, Vladimir Davydov (13): mm: add PRAM API stubs and Kconfig mm: PRAM: implement node load and save functions mm: PRAM: implement page stream operations mm: PRAM: implement byte stream operations mm: PRAM: link nodes by pfn before reboot mm: PRAM: introduce super block mm: PRAM: preserve persistent memory at boot mm: PRAM: checksum saved data mm: PRAM: ban pages that have been reserved at boot time mm: PRAM: allow to ban arbitrary memory ranges mm: PRAM: allow to free persistent memory from userspace mm: shmem: introduce shmem_insert_page mm: shmem: enable saving to PRAM arch/x86/kernel/setup.c |2 + arch/x86/mm/init_32.c|5 + arch/x86/mm/init_64.c|5 + include/linux/pram.h | 62 +++ include/linux/shmem_fs.h | 29 ++ mm/Kconfig | 14 + mm/Makefile |1 + mm/bootmem.c |4 + mm/memblock.c|7 +- mm/pram.c| 1279 ++ mm/shmem.c | 97 +++- mm/shmem_pram.c | 378 ++ 12 files changed, 1878 insertions(+), 5 deletions(-) create mode 100644 include/linux/pram.h create mode 100644 mm/pram.c create mode 100644 mm/shmem_pram.c -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 13/13] mm: shmem: enable saving to PRAM
This patch illustrates how PRAM API can be used for making tmpfs 'persistent'. It adds 'pram=' option to tmpfs, which specifies the PRAM node to load/save FS tree from/to. If the option is passed on mount, shmem will look for the corresponding PRAM node and load the FS tree from it. On the subsequent unmount, it will save FS tree to that PRAM node. A typical usage scenario looks like: # mount -t tmpfs -o pram=mytmpfs none /mnt # echo something > /mnt/smth # umount /mnt # mount -t tmpfs -o pram=mytmpfs none /mnt # cat /mnt/smth Each FS tree is saved into two PRAM nodes, one acting as a byte stream and the other acting as a page stream. The byte stream is used for saving files metadata (name, permissions, etc) and data page offsets while the page stream accommodates file content pages. Current implementation serves for demonstration purposes and so is quite simplified: it supports only regular files in the root directory without multiple hard links, and it does not save swapped out files aborting if any. However, it can be elaborated to fully support tmpfs. --- include/linux/shmem_fs.h | 26 mm/Makefile |2 +- mm/shmem.c | 29 +++- mm/shmem_pram.c | 378 ++ 4 files changed, 430 insertions(+), 5 deletions(-) create mode 100644 mm/shmem_pram.c diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index da63308..0408421 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -23,6 +23,11 @@ struct shmem_inode_info { struct inodevfs_inode; }; +#define SHMEM_PRAM_NAME_MAX128 +struct shmem_pram_info { + char name[SHMEM_PRAM_NAME_MAX]; +}; + struct shmem_sb_info { unsigned long max_blocks; /* How many blocks are allowed */ struct percpu_counter used_blocks; /* How many are allocated */ @@ -33,6 +38,7 @@ struct shmem_sb_info { kgid_t gid; /* Mount gid for root directory */ umode_t mode; /* Mount mode for root directory */ struct mempolicy *mpol; /* default memory policy for mappings */ + struct shmem_pram_info *pram; }; static inline struct shmem_inode_info *SHMEM_I(struct inode *inode) @@ -62,7 +68,27 @@ static inline struct page *shmem_read_mapping_page( mapping_gfp_mask(mapping)); } +struct pagevec; + extern int shmem_insert_page(struct inode *inode, pgoff_t index, struct page *page, bool on_lru); +extern unsigned shmem_find_get_pages_and_swap(struct address_space *mapping, + pgoff_t start, unsigned int nr_pages, + struct page **pages, pgoff_t *indices); +extern void shmem_deswap_pagevec(struct pagevec *pvec); + +#ifdef CONFIG_PRAM +extern int shmem_parse_pram(const char *str, struct shmem_pram_info **pram); +extern void shmem_show_pram(struct seq_file *seq, struct shmem_pram_info *pram); +extern void shmem_save_pram(struct super_block *sb); +extern void shmem_load_pram(struct super_block *sb); +#else +static inline int shmem_parse_pram(const char *str, + struct shmem_pram_info **pram) { return 1; } +static inline void shmem_show_pram(struct seq_file *seq, + struct shmem_pram_info *pram) { } +static inline void shmem_save_pram(struct super_block *sb) { } +static inline void shmem_load_pram(struct super_block *sb) { } +#endif #endif diff --git a/mm/Makefile b/mm/Makefile index 33ad952..6a8c61d 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -58,4 +58,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o obj-$(CONFIG_CLEANCACHE) += cleancache.o obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o -obj-$(CONFIG_PRAM) += pram.o +obj-$(CONFIG_PRAM) += pram.o shmem_pram.o diff --git a/mm/shmem.c b/mm/shmem.c index 71fac31..2d6b618 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -399,7 +399,7 @@ out: /* * Like find_get_pages, but collecting swap entries as well as pages. */ -static unsigned shmem_find_get_pages_and_swap(struct address_space *mapping, +unsigned shmem_find_get_pages_and_swap(struct address_space *mapping, pgoff_t start, unsigned int nr_pages, struct page **pages, pgoff_t *indices) { @@ -465,7 +465,7 @@ static int shmem_free_swap(struct address_space *mapping, /* * Pagevec may contain swap entries, so shuffle up pages before releasing. */ -static void shmem_deswap_pagevec(struct pagevec *pvec) +void shmem_deswap_pagevec(struct pagevec *pvec) { int i, j; @@ -2535,6 +2535,10 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo, mpol = NULL; if (mpol_parse_str(value, &mpol)) goto bad_val; + } else if (!strcmp(this_cha
[PATCH RFC 12/13] mm: shmem: introduce shmem_insert_page
The function inserts a memory page to a shmem file under an arbitrary offset. If there is something at the specified offset (page or swap), the function fails. The function will be sued by the next patch. --- include/linux/shmem_fs.h |3 ++ mm/shmem.c | 68 ++ 2 files changed, 71 insertions(+) diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 30aa0dc..da63308 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -62,4 +62,7 @@ static inline struct page *shmem_read_mapping_page( mapping_gfp_mask(mapping)); } +extern int shmem_insert_page(struct inode *inode, + pgoff_t index, struct page *page, bool on_lru); + #endif diff --git a/mm/shmem.c b/mm/shmem.c index 1c44af7..71fac31 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -328,6 +328,74 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap) BUG_ON(error); } +int shmem_insert_page(struct inode *inode, + pgoff_t index, struct page *page, bool on_lru) +{ + struct address_space *mapping = inode->i_mapping; + struct shmem_inode_info *info = SHMEM_I(inode); + struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); + gfp_t gfp = mapping_gfp_mask(mapping); + int err; + + if (index > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT)) + return -EFBIG; + + err = -ENOSPC; + if (shmem_acct_block(info->flags)) + goto out; + if (sbinfo->max_blocks) { + if (percpu_counter_compare(&sbinfo->used_blocks, + sbinfo->max_blocks) >= 0) + goto out_unacct; + percpu_counter_inc(&sbinfo->used_blocks); + } + + if (!on_lru) { + SetPageSwapBacked(page); + __set_page_locked(page); + } else + lock_page(page); + + err = mem_cgroup_cache_charge(page, current->mm, + gfp & GFP_RECLAIM_MASK); + if (err) + goto out_unlock; + err = radix_tree_preload(gfp & GFP_RECLAIM_MASK); + if (!err) { + err = shmem_add_to_page_cache(page, mapping, index, gfp, NULL); + radix_tree_preload_end(); + } + if (err) + goto out_uncharge; + + if (!on_lru) + lru_cache_add_anon(page); + + spin_lock(&info->lock); + info->alloced++; + inode->i_blocks += BLOCKS_PER_PAGE; + shmem_recalc_inode(inode); + spin_unlock(&info->lock); + + flush_dcache_page(page); + SetPageUptodate(page); + set_page_dirty(page); + + unlock_page(page); + return 0; + +out_uncharge: + mem_cgroup_uncharge_cache_page(page); +out_unlock: + unlock_page(page); + if (sbinfo->max_blocks) + percpu_counter_add(&sbinfo->used_blocks, -1); +out_unacct: + shmem_unacct_blocks(info->flags, 1); +out: + return err; +} + /* * Like find_get_pages, but collecting swap entries as well as pages. */ -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 11/13] mm: PRAM: allow to free persistent memory from userspace
To free all space utilized for persistent memory, one can write 0 to /sys/kernel/pram. This will destroy all PRAM nodes that are not currently being read or written. --- mm/pram.c | 39 ++- 1 file changed, 38 insertions(+), 1 deletion(-) diff --git a/mm/pram.c b/mm/pram.c index 3ad769b..43ad85f 100644 --- a/mm/pram.c +++ b/mm/pram.c @@ -697,6 +697,32 @@ static void pram_truncate_node(struct pram_node *node) } +/* + * Free all nodes that are not under operation. + */ +static void pram_truncate(void) +{ + struct page *page, *tmp; + struct pram_node *node; + LIST_HEAD(dispose); + + mutex_lock(&pram_mutex); + list_for_each_entry_safe(page, tmp, &pram_nodes, lru) { + node = page_address(page); + if (!(node->flags & PRAM_ACCMODE_MASK)) + list_move(&page->lru, &dispose); + } + mutex_unlock(&pram_mutex); + + while (!list_empty(&dispose)) { + page = list_first_entry(&dispose, struct page, lru); + list_del(&page->lru); + node = page_address(page); + pram_truncate_node(node); + pram_free_page(node); + } +} + static void pram_stream_init(struct pram_stream *ps, struct pram_node *node, gfp_t gfp_mask) { @@ -1189,8 +1215,19 @@ static ssize_t show_pram_sb_pfn(struct kobject *kobj, return sprintf(buf, "%lx\n", pfn); } +static ssize_t store_pram_sb_pfn(struct kobject *kobj, + struct kobj_attribute *attr, const char *buf, size_t count) +{ + int val; + + if (kstrtoint(buf, 0, &val) || val) + return -EINVAL; + pram_truncate(); + return count; +} + static struct kobj_attribute pram_sb_pfn_attr = - __ATTR(pram, 0444, show_pram_sb_pfn, NULL); + __ATTR(pram, 0644, show_pram_sb_pfn, store_pram_sb_pfn); static struct attribute *pram_attrs[] = { &pram_sb_pfn_attr.attr, -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 07/13] mm: PRAM: preserve persistent memory at boot
Persistent memory preservation is done by reserving memory pages belonging to PRAM at early boot so that they will not be recycled. If memory reservation fails for some reason (e.g. memory region is busy), persistent memory will be lost. Currently, PRAM preservation is only implemented for x86. --- arch/x86/kernel/setup.c |2 + arch/x86/mm/init_32.c |4 + arch/x86/mm/init_64.c |4 + include/linux/pram.h|8 ++ mm/Kconfig |1 + mm/pram.c | 203 +++ 6 files changed, 222 insertions(+) diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index fae9134..caf1b29 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -69,6 +69,7 @@ #include #include #include +#include #include @@ -1127,6 +1128,7 @@ void __init setup_arch(char **cmdline_p) acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start); #endif + pram_reserve(); reserve_crashkernel(); vsmp_init(); diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c index 2d19001..da38426 100644 --- a/arch/x86/mm/init_32.c +++ b/arch/x86/mm/init_32.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include @@ -779,6 +780,9 @@ void __init mem_init(void) after_bootmem = 1; + totalram_pages += pram_reserved_pages; + reservedpages -= pram_reserved_pages; + codesize = (unsigned long) &_etext - (unsigned long) &_text; datasize = (unsigned long) &_edata - (unsigned long) &_etext; initsize = (unsigned long) &__init_end - (unsigned long) &__init_begin; diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 474e28f..8aa4bc4 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -32,6 +32,7 @@ #include #include #include +#include #include #include @@ -1077,6 +1078,9 @@ void __init mem_init(void) reservedpages = max_pfn - totalram_pages - absent_pages; after_bootmem = 1; + totalram_pages += pram_reserved_pages; + reservedpages -= pram_reserved_pages; + codesize = (unsigned long) &_etext - (unsigned long) &_text; datasize = (unsigned long) &_edata - (unsigned long) &_etext; initsize = (unsigned long) &__init_end - (unsigned long) &__init_begin; diff --git a/include/linux/pram.h b/include/linux/pram.h index 61c536c..b7f2799 100644 --- a/include/linux/pram.h +++ b/include/linux/pram.h @@ -47,4 +47,12 @@ extern ssize_t pram_write(struct pram_stream *ps, const void *buf, size_t count); extern size_t pram_read(struct pram_stream *ps, void *buf, size_t count); +#ifdef CONFIG_PRAM +extern unsigned long pram_reserved_pages; +extern void pram_reserve(void); +#else +#define pram_reserved_pages 0UL +static inline void pram_reserve(void) { } +#endif + #endif /* _LINUX_PRAM_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 46337e8..f1e11a0 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -474,6 +474,7 @@ config FRONTSWAP config PRAM bool "Persistent over-kexec memory storage" + depends on X86 default n help This option adds the kernel API that enables saving memory pages of diff --git a/mm/pram.c b/mm/pram.c index 58ae9ed..380735f 100644 --- a/mm/pram.c +++ b/mm/pram.c @@ -1,3 +1,4 @@ +#include #include #include #include @@ -5,6 +6,7 @@ #include #include #include +#include #include #include #include @@ -93,6 +95,8 @@ static struct pram_super_block *pram_sb; static LIST_HEAD(pram_nodes); /* linked through page::lru */ static DEFINE_MUTEX(pram_mutex); /* serializes open/close */ +unsigned long __initdata pram_reserved_pages; + /* * The PRAM super block pfn, see above. */ @@ -102,6 +106,196 @@ static int __init parse_pram_sb_pfn(char *arg) } early_param("pram", parse_pram_sb_pfn); +static void * __init pram_map_meta(unsigned long pfn) +{ + if (pfn >= max_low_pfn) + return ERR_PTR(-EINVAL); + return pfn_to_kaddr(pfn); +} + +static int __init pram_reserve_page(unsigned long pfn) +{ + int err = 0; + phys_addr_t base, size; + + if (pfn >= max_pfn) + return -EINVAL; + + base = PFN_PHYS(pfn); + size = PAGE_SIZE; + +#ifdef CONFIG_NO_BOOTMEM + if (memblock_is_region_reserved(base, size) || + memblock_reserve(base, size) < 0) + err = -EBUSY; +#else + err = reserve_bootmem(base, size, BOOTMEM_EXCLUSIVE); +#endif + if (!err) + pram_reserved_pages++; + return err; +} + +static void __init pram_unreserve_page(unsigned long pfn) +{ + free_bootmem(PFN_PHYS(pfn), PAGE_SIZE); + pram_reserved_pages--; +} + +static int __init pram_reserve_link(struct pram_link *link) +{ + int i; + int err = 0; + + for (i = 0; i < PRAM_LINK_ENTRIES_MAX; i++) { + struct pram_entr
[PATCH RFC 10/13] mm: PRAM: allow to ban arbitrary memory ranges
Banning for PRAM memory ranges that have been reserved at boot time is not enough for avoiding all conflicts. The point is that kexec may load the new kernel code to some address range that have never been reserved possibly overwriting persistent data. Fortunately, it is possible to specify a memory range kexec will load the new kernel code into. Thus, to avoid kexec-vs-PRAM conflicts, it is enough to disallow for PRAM some memory range large enough to load the new kernel and make kexec load the new kernel code into that range. For that purpose, This patch adds ability to specify arbitrary banned ranges using the 'pram_banned' boot option. --- mm/pram.c | 45 + 1 file changed, 45 insertions(+) diff --git a/mm/pram.c b/mm/pram.c index 969ff3f..3ad769b 100644 --- a/mm/pram.c +++ b/mm/pram.c @@ -127,6 +127,15 @@ static bool __meminitdata pram_reservation_in_progress; * persistent data. Since the device configuration cannot change during kexec * and the newly booted kernel is likely to have a similar boot-time device * driver set, this hack should work in most cases. + * + * This solution has one exception. The point is that kexec may load the new + * kernel code to some address range that have never been reserved and thus + * banned for PRAM by the current kernel possibly overwriting persistent data. + * Fortunately, it is possible to specify an exact range kexec will load the + * new kernel code into. Thus, to avoid kexec-vs-PRAM conflicts, one should + * disallow for PRAM some memory range large enough to load the new kernel (see + * the 'pram_banned' boot param) and make kexec load the new kernel code into + * that range. */ /* @@ -378,6 +387,42 @@ out: } /* + * A comma separated list of memory regions that PRAM is not allowed to use. + */ +static int __init parse_pram_banned(char *arg) +{ + char *cur = arg, *tmp; + unsigned long long start, end; + + do { + start = memparse(cur, &tmp); + if (cur == tmp) { + pr_warning("pram_banned: Memory value expected\n"); + return -EINVAL; + } + cur = tmp; + if (*cur != '-') { + pr_warning("pram_banned: '-' expected\n"); + return -EINVAL; + } + cur++; + end = memparse(cur, &tmp); + if (cur == tmp) { + pr_warning("pram_banned: Memory value expected\n"); + return -EINVAL; + } + if (end <= start) { + pr_warning("pram_banned: end <= start\n"); + return -EINVAL; + } + pram_ban_region(PFN_DOWN(start), PFN_UP(end) - 1); + } while (*cur++ == ','); + + return 0; +} +early_param("pram_banned", parse_pram_banned); + +/* * Bans pfn range [start..end] (inclusive) for PRAM. */ void __meminit pram_ban_region(unsigned long start, unsigned long end) -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 09/13] mm: PRAM: ban pages that have been reserved at boot time
Obviously, not all memory ranges can be used for saving persistent over-kexec data, because some of them are reserved by the system core and various device drivers at boot time. If a memory range used for initialization of a particular device turns out to be busy because PRAM uses it for storing its data, the device driver initialization stage or even the whole system boot sequence may fail. As a workaround the current implementation uses a rather dirty hack. It tracks all memory regions that have ever been reserved during the boot sequence and avoids using pages belonging to those regions for storing persistent data. Since the device configuration cannot change during kexec and the newly booted kernel is likely to have a similar boot-time device driver set, this hack should work in most cases. --- arch/x86/mm/init_32.c |1 + arch/x86/mm/init_64.c |1 + include/linux/pram.h |4 + mm/bootmem.c |4 + mm/memblock.c |7 +- mm/pram.c | 211 - 6 files changed, 225 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c index da38426..67b963a 100644 --- a/arch/x86/mm/init_32.c +++ b/arch/x86/mm/init_32.c @@ -782,6 +782,7 @@ void __init mem_init(void) totalram_pages += pram_reserved_pages; reservedpages -= pram_reserved_pages; + pram_show_banned(); codesize = (unsigned long) &_etext - (unsigned long) &_text; datasize = (unsigned long) &_edata - (unsigned long) &_etext; diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 8aa4bc4..fbe3e17 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1080,6 +1080,7 @@ void __init mem_init(void) totalram_pages += pram_reserved_pages; reservedpages -= pram_reserved_pages; + pram_show_banned(); codesize = (unsigned long) &_etext - (unsigned long) &_text; datasize = (unsigned long) &_edata - (unsigned long) &_etext; diff --git a/include/linux/pram.h b/include/linux/pram.h index b7f2799..d4f23e3 100644 --- a/include/linux/pram.h +++ b/include/linux/pram.h @@ -50,9 +50,13 @@ extern size_t pram_read(struct pram_stream *ps, void *buf, size_t count); #ifdef CONFIG_PRAM extern unsigned long pram_reserved_pages; extern void pram_reserve(void); +extern void pram_ban_region(unsigned long start, unsigned long end); +extern void pram_show_banned(void); #else #define pram_reserved_pages 0UL static inline void pram_reserve(void) { } +static inline void pram_ban_region(unsigned long start, unsigned long end) { } +static inline void pram_show_banned(void) { } #endif #endif /* _LINUX_PRAM_H */ diff --git a/mm/bootmem.c b/mm/bootmem.c index 2b0bcb0..34d0b42 100644 --- a/mm/bootmem.c +++ b/mm/bootmem.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include @@ -328,6 +329,9 @@ static int __init __reserve(bootmem_data_t *bdata, unsigned long sidx, bdebug("silent double reserve of PFN %lx\n", idx + bdata->node_min_pfn); } + + pram_ban_region(sidx + bdata->node_min_pfn, + eidx + bdata->node_min_pfn - 1); return 0; } diff --git a/mm/memblock.c b/mm/memblock.c index b8d9147..d2c248e 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -19,6 +19,7 @@ #include #include #include +#include static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock; static struct memblock_region memblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock; @@ -553,13 +554,17 @@ int __init_memblock memblock_free(phys_addr_t base, phys_addr_t size) int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size) { struct memblock_type *_rgn = &memblock.reserved; + int err; memblock_dbg("memblock_reserve: [%#016llx-%#016llx] %pF\n", (unsigned long long)base, (unsigned long long)base + size, (void *)_RET_IP_); - return memblock_add_region(_rgn, base, size, MAX_NUMNODES); + err = memblock_add_region(_rgn, base, size, MAX_NUMNODES); + if (!err) + pram_ban_region(PFN_DOWN(base), PFN_UP(base + size) - 1); + return err; } /** diff --git a/mm/pram.c b/mm/pram.c index 8a66a86..969ff3f 100644 --- a/mm/pram.c +++ b/mm/pram.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -110,6 +111,47 @@ static LIST_HEAD(pram_nodes); /* linked through page::lru */ static DEFINE_MUTEX(pram_mutex); /* serializes open/close */ unsigned long __initdata pram_reserved_pages; +static bool __meminitdata pram_reservation_in_progress; + +/* + * Obviously, not all memory ranges can be used for saving persistent + * over-kexec data, because some of them are reserved by the sy
[PATCH RFC 06/13] mm: PRAM: introduce super block
The PRAM super block is the starting point for restoring persistent memory. If the kernel locates the super block at boot time, it will preserve the persistent memory structure from the previous kernel. To point the kernel to the location of the super block, one should pass its pfn via the 'pram' boot param. For that purpose, the pram super block pfn is exported via /sys/kernel/pram. If none is passed, persistent memory will not be preserved, and a new super block will be allocated. The current patch introduces only super block handling. Memory preservation will be implemented later. --- mm/pram.c | 94 +++-- 1 file changed, 92 insertions(+), 2 deletions(-) diff --git a/mm/pram.c b/mm/pram.c index c7706dc..58ae9ed 100644 --- a/mm/pram.c +++ b/mm/pram.c @@ -3,15 +3,18 @@ #include #include #include +#include #include #include #include #include #include +#include #include #include #include #include +#include #include /* @@ -66,12 +69,39 @@ struct pram_node { #define PRAM_ACCMODE_MASK 3 /* + * The PRAM super block contains data needed to restore the persistent memory + * structure on boot. The pointer to it (pfn) should be passed via the 'pram' + * boot param if one wants to restore persistent data saved by the previously + * executing kernel. For that purpose the kernel exports the pfn via + * /sys/kernel/pram. If none is passed, persistent memory if any will not be + * preserved and a new clean page will be allocated for the super block. + * + * The structure occupies a memory page. + */ +struct pram_super_block { + __u64 node_pfn; /* points to the first element of +* the node list */ +}; + +static unsigned long __initdata pram_sb_pfn; +static struct pram_super_block *pram_sb; + +/* * For convenience sake PRAM nodes are kept in an auxiliary doubly-linked list * connected through the lru field of the page struct. */ static LIST_HEAD(pram_nodes); /* linked through page::lru */ static DEFINE_MUTEX(pram_mutex); /* serializes open/close */ +/* + * The PRAM super block pfn, see above. + */ +static int __init parse_pram_sb_pfn(char *arg) +{ + return kstrtoul(arg, 16, &pram_sb_pfn); +} +early_param("pram", parse_pram_sb_pfn); + static inline struct page *pram_alloc_page(gfp_t gfp_mask) { return alloc_page(gfp_mask); @@ -161,6 +191,7 @@ static void pram_stream_init(struct pram_stream *ps, * Returns 0 on success, -errno on failure. * * Error values: + *%ENODEV: PRAM not available *%ENAMETOOLONG: name len >= PRAM_NAME_MAX *%ENOMEM: insufficient memory available *%EEXIST: node with specified name already exists @@ -175,6 +206,9 @@ int pram_prepare_save(struct pram_stream *ps, struct pram_node *node; int err = 0; + if (!pram_sb) + return -ENODEV; + BUG_ON(type != PRAM_PAGE_STREAM && type != PRAM_BYTE_STREAM); @@ -250,6 +284,7 @@ void pram_discard_save(struct pram_stream *ps) * Returns 0 on success, -errno on failure. * * Error values: + *%ENODEV: PRAM not available *%ENOENT: node with specified name does not exist *%EBUSY: save to required node has not finished yet *%EPERM: specified type conflicts with type of required node @@ -262,6 +297,9 @@ int pram_prepare_load(struct pram_stream *ps, struct pram_node *node; int err = 0; + if (!pram_sb) + return -ENODEV; + mutex_lock(&pram_mutex); node = pram_find_node(name); if (!node) { @@ -550,6 +588,7 @@ static void __pram_reboot(void) node->node_pfn = node_pfn; node_pfn = page_to_pfn(page); } + pram_sb->node_pfn = node_pfn; } static int pram_reboot(struct notifier_block *notifier, @@ -557,7 +596,8 @@ static int pram_reboot(struct notifier_block *notifier, { if (val != SYS_RESTART) return NOTIFY_DONE; - __pram_reboot(); + if (pram_sb) + __pram_reboot(); return NOTIFY_OK; } @@ -565,9 +605,59 @@ static struct notifier_block pram_reboot_notifier = { .notifier_call = pram_reboot, }; +static ssize_t show_pram_sb_pfn(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + unsigned long pfn = pram_sb ? PFN_DOWN(__pa(pram_sb)) : 0; + return sprintf(buf, "%lx\n", pfn); +} + +static struct kobj_attribute pram_sb_pfn_attr = + __ATTR(pram, 0444, show_pram_sb_pfn, NULL); + +static struct attribute *pram_attrs[] = { + &pram_sb_pfn_attr.attr, + NULL, +}; + +static struct attribute_group pram_attr_group = { + .attrs = pram_attrs, +}; + +/* returns non-zero on success */ +static int __init pram_init_sb(void) +{ + unsigned long pfn; + struct pram_node *node; + + if (!pram_sb) { + struct p
[PATCH RFC 05/13] mm: PRAM: link nodes by pfn before reboot
Since page structs, which are used for linking PRAM nodes, are cleared on boot, organize all PRAM nodes into a list singly-linked by pfn's before reboot to facilitate the node list restore in the new kernel. --- mm/pram.c | 50 ++ 1 file changed, 50 insertions(+) diff --git a/mm/pram.c b/mm/pram.c index f7eebe1..c7706dc 100644 --- a/mm/pram.c +++ b/mm/pram.c @@ -1,11 +1,15 @@ #include #include #include +#include #include #include #include +#include #include +#include #include +#include #include #include #include @@ -42,6 +46,9 @@ struct pram_link { * singly-linked list of PRAM link structures (see above), the node has a * pointer to the head of. * + * To facilitate data restore in the new kernel, before reboot all PRAM nodes + * are organized into a list singly-linked by pfn's (see pram_reboot()). + * * The structure occupies a memory page. */ struct pram_node { @@ -49,6 +56,7 @@ struct pram_node { __u32 type; /* data type, see enum pram_stream_type */ __u64 data_len; /* data size, only for byte streams */ __u64 link_pfn; /* points to the first link of the node */ + __u64 node_pfn; /* points to the next node in the node list */ __u8name[PRAM_NAME_MAX]; }; @@ -57,6 +65,10 @@ struct pram_node { #define PRAM_LOAD 2 #define PRAM_ACCMODE_MASK 3 +/* + * For convenience sake PRAM nodes are kept in an auxiliary doubly-linked list + * connected through the lru field of the page struct. + */ static LIST_HEAD(pram_nodes); /* linked through page::lru */ static DEFINE_MUTEX(pram_mutex); /* serializes open/close */ @@ -521,3 +533,41 @@ size_t pram_read(struct pram_stream *ps, void *buf, size_t count) } return read_count; } + +/* + * Build the list of PRAM nodes. + */ +static void __pram_reboot(void) +{ + struct page *page; + struct pram_node *node; + unsigned long node_pfn = 0; + + list_for_each_entry_reverse(page, &pram_nodes, lru) { + node = page_address(page); + if (WARN_ON(node->flags & PRAM_ACCMODE_MASK)) + continue; + node->node_pfn = node_pfn; + node_pfn = page_to_pfn(page); + } +} + +static int pram_reboot(struct notifier_block *notifier, + unsigned long val, void *v) +{ + if (val != SYS_RESTART) + return NOTIFY_DONE; + __pram_reboot(); + return NOTIFY_OK; +} + +static struct notifier_block pram_reboot_notifier = { + .notifier_call = pram_reboot, +}; + +static int __init pram_init(void) +{ + register_reboot_notifier(&pram_reboot_notifier); + return 0; +} +module_init(pram_init); -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 02/13] mm: PRAM: implement node load and save functions
Persistent memory is divided into nodes, which can be saved and loaded independently of each other. PRAM nodes are kept on the list and identified by unique names. Whenever a save operation is initiated by calling pram_prepare_save(), a new node is created and linked to the list. When the save operation has been committed by calling pram_finish_save(), the node becomes loadable. A load operation can be then initiated by calling pram_prepare_load(), which deletes the node from the list and prepares the corresponding stream for loading data from it. After the load has been finished, the pram_finish_load() function must be called to free the node. Nodes are also deleted when a save operation is discarded, i.e. pram_discard_save() is called instead of pram_finish_save(). --- include/linux/pram.h |7 ++- mm/pram.c| 158 -- 2 files changed, 159 insertions(+), 6 deletions(-) diff --git a/include/linux/pram.h b/include/linux/pram.h index cf04548..5b8c2c1 100644 --- a/include/linux/pram.h +++ b/include/linux/pram.h @@ -5,7 +5,12 @@ #include #include -struct pram_stream; +struct pram_node; + +struct pram_stream { + gfp_t gfp_mask; + struct pram_node *node; +}; #define PRAM_NAME_MAX 256 /* including nul */ diff --git a/mm/pram.c b/mm/pram.c index cea0e87..3af2039 100644 --- a/mm/pram.c +++ b/mm/pram.c @@ -1,10 +1,75 @@ #include #include #include +#include #include +#include #include +#include #include +/* + * Persistent memory is divided into nodes that can be saved or loaded + * independently of each other. The nodes are identified by unique name + * strings. + * + * The structure occupies a memory page. + */ +struct pram_node { + __u32 flags; /* see PRAM_* flags below */ + __u32 type; /* data type, see enum pram_stream_type */ + + __u8name[PRAM_NAME_MAX]; +}; + +#define PRAM_SAVE 1 +#define PRAM_LOAD 2 +#define PRAM_ACCMODE_MASK 3 + +static LIST_HEAD(pram_nodes); /* linked through page::lru */ +static DEFINE_MUTEX(pram_mutex); /* serializes open/close */ + +static inline struct page *pram_alloc_page(gfp_t gfp_mask) +{ + return alloc_page(gfp_mask); +} + +static inline void pram_free_page(void *addr) +{ + free_page((unsigned long)addr); +} + +static inline void pram_insert_node(struct pram_node *node) +{ + list_add(&virt_to_page(node)->lru, &pram_nodes); +} + +static inline void pram_delete_node(struct pram_node *node) +{ + list_del(&virt_to_page(node)->lru); +} + +static struct pram_node *pram_find_node(const char *name) +{ + struct page *page; + struct pram_node *node; + + list_for_each_entry(page, &pram_nodes, lru) { + node = page_address(page); + if (strcmp(node->name, name) == 0) + return node; + } + return NULL; +} + +static void pram_stream_init(struct pram_stream *ps, +struct pram_node *node, gfp_t gfp_mask) +{ + memset(ps, 0, sizeof(*ps)); + ps->gfp_mask = gfp_mask; + ps->node = node; +} + /** * Create a persistent memory node with name @name and initialize stream @ps * for saving data to it. @@ -18,13 +83,49 @@ * * Returns 0 on success, -errno on failure. * + * Error values: + *%ENAMETOOLONG: name len >= PRAM_NAME_MAX + *%ENOMEM: insufficient memory available + *%EEXIST: node with specified name already exists + * * After the save has finished, pram_finish_save() (or pram_discard_save() in * case of failure) is to be called. */ int pram_prepare_save(struct pram_stream *ps, const char *name, enum pram_stream_type type, gfp_t gfp_mask) { - return -ENOSYS; + struct page *page; + struct pram_node *node; + int err = 0; + + BUG_ON(type != PRAM_PAGE_STREAM && + type != PRAM_BYTE_STREAM); + + if (strlen(name) >= PRAM_NAME_MAX) + return -ENAMETOOLONG; + + page = pram_alloc_page(GFP_KERNEL | __GFP_ZERO); + if (!page) + return -ENOMEM; + node = page_address(page); + + node->flags = PRAM_SAVE; + node->type = type; + strcpy(node->name, name); + + mutex_lock(&pram_mutex); + if (!pram_find_node(name)) + pram_insert_node(node); + else + err = -EEXIST; + mutex_unlock(&pram_mutex); + if (err) { + __free_page(page); + return err; + } + + pram_stream_init(ps, node, gfp_mask); + return 0; } /** @@ -33,7 +134,12 @@ int pram_prepare_save(struct pram_stream *ps, */ void pram_finish_save(struct pram_stream *ps) { - BUG(); + struct pram_node *node = ps->node; + + BUG_ON((node->flags & PRAM_ACCMODE_MASK) != PRAM_SAVE); + + smp_wmb(); + node->flags &= ~PRAM_ACCMODE_MA
[PATCH RFC 03/13] mm: PRAM: implement page stream operations
Using the pram_save_page() function, one can populate PRAM nodes with memory pages, which can be then loaded using the pram_load_page() function. Saving a memory page to PRAM is implemented as storing the pfn in the PRAM node and incrementing its ref count so that it will not get freed after the last user puts it. --- include/linux/pram.h |3 + mm/pram.c| 166 +- 2 files changed, 167 insertions(+), 2 deletions(-) diff --git a/include/linux/pram.h b/include/linux/pram.h index 5b8c2c1..dd17316 100644 --- a/include/linux/pram.h +++ b/include/linux/pram.h @@ -6,10 +6,13 @@ #include struct pram_node; +struct pram_link; struct pram_stream { gfp_t gfp_mask; struct pram_node *node; + struct pram_link *link; /* current link */ + unsigned int page_index;/* next page index in link */ }; #define PRAM_NAME_MAX 256 /* including nul */ diff --git a/mm/pram.c b/mm/pram.c index 3af2039..a443eb0 100644 --- a/mm/pram.c +++ b/mm/pram.c @@ -5,19 +5,48 @@ #include #include #include +#include #include #include /* + * Represents a reference to a data page saved to PRAM. + */ +struct pram_entry { + __u32 flags;/* see PRAM_PAGE_* flags */ + __u64 pfn; /* the page frame number */ +}; + +/* + * Keeps references to data pages saved to PRAM. + * The structure occupies a memory page. + */ +struct pram_link { + __u64 link_pfn; /* points to the next link of the node */ + + /* the array occupies the rest of the link page; if the link is not +* full, the rest of the array must be filled with zeros */ + struct pram_entry entry[0]; +}; + +#define PRAM_LINK_ENTRIES_MAX \ + ((PAGE_SIZE-sizeof(struct pram_link))/sizeof(struct pram_entry)) + +/* * Persistent memory is divided into nodes that can be saved or loaded * independently of each other. The nodes are identified by unique name * strings. * + * References to data pages saved to a persistent memory node are kept in a + * singly-linked list of PRAM link structures (see above), the node has a + * pointer to the head of. + * * The structure occupies a memory page. */ struct pram_node { __u32 flags; /* see PRAM_* flags below */ __u32 type; /* data type, see enum pram_stream_type */ + __u64 link_pfn; /* points to the first link of the node */ __u8name[PRAM_NAME_MAX]; }; @@ -62,12 +91,46 @@ static struct pram_node *pram_find_node(const char *name) return NULL; } +static void pram_truncate_link(struct pram_link *link) +{ + int i; + unsigned long pfn; + struct page *page; + + for (i = 0; i < PRAM_LINK_ENTRIES_MAX; i++) { + pfn = link->entry[i].pfn; + if (!pfn) + continue; + page = pfn_to_page(pfn); + put_page(page); + } +} + +static void pram_truncate_node(struct pram_node *node) +{ + unsigned long link_pfn; + struct pram_link *link; + + link_pfn = node->link_pfn; + while (link_pfn) { + link = pfn_to_kaddr(link_pfn); + pram_truncate_link(link); + link_pfn = link->link_pfn; + pram_free_page(link); + cond_resched(); + } + node->link_pfn = 0; + +} + static void pram_stream_init(struct pram_stream *ps, struct pram_node *node, gfp_t gfp_mask) { memset(ps, 0, sizeof(*ps)); ps->gfp_mask = gfp_mask; ps->node = node; + if (node->link_pfn) + ps->link = pfn_to_kaddr(node->link_pfn); } /** @@ -157,6 +220,7 @@ void pram_discard_save(struct pram_stream *ps) pram_delete_node(node); mutex_unlock(&pram_mutex); + pram_truncate_node(node); pram_free_page(node); } @@ -220,9 +284,46 @@ void pram_finish_load(struct pram_stream *ps) BUG_ON((node->flags & PRAM_ACCMODE_MASK) != PRAM_LOAD); + pram_truncate_node(node); pram_free_page(node); } +/* + * Insert page to PRAM node allocating a new PRAM link if necessary. + */ +static int __pram_save_page(struct pram_stream *ps, + struct page *page, int flags) +{ + struct pram_node *node = ps->node; + struct pram_link *link = ps->link; + struct pram_entry *entry; + + if (!link || ps->page_index >= PRAM_LINK_ENTRIES_MAX) { + struct page *link_page; + unsigned long link_pfn; + + link_page = pram_alloc_page((ps->gfp_mask & GFP_RECLAIM_MASK) | + __GFP_ZERO); + if (!link_page) + return -ENOMEM; + + link_pfn = page_to_pfn(link_page); + if (link) + link->link_pfn = link_pfn; + else +
[PATCH RFC 01/13] mm: add PRAM API stubs and Kconfig
Persistent memory subsys or PRAM is intended to be used for saving memory pages of the currently executing kernel and restoring them after a kexec in the newly booted one. This can be utilized for speeding up reboot by leaving process memory and/or FS caches in-place. The proposed API: * Persistent memory is divided into nodes, which can be saved or loaded independently of each other. The nodes are identified by unique name strings. PRAM node is created (removed) when save (load) is initiated by calling pram_prepare_save() (pram_prepare_load()), see below. * For saving/loading data from a PRAM node an instance of the pram_stream struct is used. The struct is initialized by calling pram_prepare_save() for saving data or pram_prepare_load() for loading data. After save (load) is complete, pram_finish_save() (pram_finish_load()) must be called. If an error occurred during save, the saved data and the PRAM node may be freed by calling pram_discard_save() instead of pram_finish_save(). * Each pram_stream has a type, which determines the set of operations that may be used for saving/loading data. The type is defined by the pram_stream_type enum. Currently there are two stream types available: PRAM_PAGE_STREAM to save/load memory pages, and PRAM_BYTE_STREAM to save/load byte strings. For page streams pram_save_page() and pram_load_page() may be used, and for byte streams pram_write() and pram_read() may be used for saving and loading data respectively. Thus a sequence of operations for saving/loading data from PRAM should look like: * For saving data to PRAM: /* create PRAM node and initialize stream for saving data to it */ pram_prepare_save() /* save data to the node */ pram_save_page()[,...] /* for page stream, or pram_write()[,...] * ... for byte stream */ /* commit the save or discard and delete the node */ pram_finish_save() /* on success, or pram_discard_save() * ... in case of error */ * For loading data from PRAM: /* remove PRAM node from the list and initialize stream for * loading data from it */ pram_prepare_load() /* load data from the node */ pram_load_page()[,...] /* for page stream, or pram_read()[,...]* ... for byte stream */ /* free the node */ pram_finish_load() --- include/linux/pram.h | 38 +++ mm/Kconfig |9 mm/Makefile |1 + mm/pram.c| 131 ++ 4 files changed, 179 insertions(+) create mode 100644 include/linux/pram.h create mode 100644 mm/pram.c diff --git a/include/linux/pram.h b/include/linux/pram.h new file mode 100644 index 000..cf04548 --- /dev/null +++ b/include/linux/pram.h @@ -0,0 +1,38 @@ +#ifndef _LINUX_PRAM_H +#define _LINUX_PRAM_H + +#include +#include +#include + +struct pram_stream; + +#define PRAM_NAME_MAX 256 /* including nul */ + +enum pram_stream_type { + PRAM_PAGE_STREAM, + PRAM_BYTE_STREAM, +}; + +extern int pram_prepare_save(struct pram_stream *ps, + const char *name, enum pram_stream_type type, gfp_t gfp_mask); +extern void pram_finish_save(struct pram_stream *ps); +extern void pram_discard_save(struct pram_stream *ps); + +extern int pram_prepare_load(struct pram_stream *ps, + const char *name, enum pram_stream_type type); +extern void pram_finish_load(struct pram_stream *ps); + +#define PRAM_PAGE_LRU 0x01/* page is on the LRU */ + +/* page-stream specific methods */ +extern int pram_save_page(struct pram_stream *ps, + struct page *page, int flags); +extern struct page *pram_load_page(struct pram_stream *ps, int *flags); + +/* byte-stream specific methods */ +extern ssize_t pram_write(struct pram_stream *ps, + const void *buf, size_t count); +extern size_t pram_read(struct pram_stream *ps, void *buf, size_t count); + +#endif /* _LINUX_PRAM_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 3bea74f..46337e8 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -471,3 +471,12 @@ config FRONTSWAP and swap data is stored as normal on the matching swap device. If unsure, say Y to enable frontswap. + +config PRAM + bool "Persistent over-kexec memory storage" + default n + help + This option adds the kernel API that enables saving memory pages of + the currently executing kernel and restoring them after a kexec in + the newly booted one. This can be utilized for speeding up reboot by + leaving process memory and/or FS caches in-place. diff --git a/mm/Makefile b/mm/Makefile index 3a46287..33ad952 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -58,3 +58,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o obj-$(CONFIG_CLEANCACHE) += cleancache.o obj-$(CONFIG_MEMO
[PATCH RFC 08/13] mm: PRAM: checksum saved data
Checksum PRAM pages with crc32 to ensure persistent memory is not corrupted during reboot. --- mm/Kconfig |4 ++ mm/pram.c | 128 +++- 2 files changed, 130 insertions(+), 2 deletions(-) diff --git a/mm/Kconfig b/mm/Kconfig index f1e11a0..0a4d4c6 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -475,6 +475,10 @@ config FRONTSWAP config PRAM bool "Persistent over-kexec memory storage" depends on X86 + select CRC32 + select LIBCRC32C + select CRYPTO_CRC32C + select CRYPTO_CRC32C_INTEL default n help This option adds the kernel API that enables saving memory pages of diff --git a/mm/pram.c b/mm/pram.c index 380735f..8a66a86 100644 --- a/mm/pram.c +++ b/mm/pram.c @@ -1,4 +1,6 @@ #include +#include +#include #include #include #include @@ -19,11 +21,14 @@ #include #include +#define PRAM_MAGIC 0x7072616D + /* * Represents a reference to a data page saved to PRAM. */ struct pram_entry { __u32 flags;/* see PRAM_PAGE_* flags */ + __u32 csum; /* the page csum */ __u64 pfn; /* the page frame number */ }; @@ -32,6 +37,9 @@ struct pram_entry { * The structure occupies a memory page. */ struct pram_link { + __u32 magic; + __u32 csum; + __u64 link_pfn; /* points to the next link of the node */ /* the array occupies the rest of the link page; if the link is not @@ -57,6 +65,9 @@ struct pram_link { * The structure occupies a memory page. */ struct pram_node { + __u32 magic; + __u32 csum; + __u32 flags; /* see PRAM_* flags below */ __u32 type; /* data type, see enum pram_stream_type */ __u64 data_len; /* data size, only for byte streams */ @@ -81,6 +92,9 @@ struct pram_node { * The structure occupies a memory page. */ struct pram_super_block { + __u32 magic; + __u32 csum; + __u64 node_pfn; /* points to the first element of * the node list */ }; @@ -106,11 +120,34 @@ static int __init parse_pram_sb_pfn(char *arg) } early_param("pram", parse_pram_sb_pfn); +static u32 pram_data_csum(struct page *page) +{ + u32 ret; + void *addr; + + addr = kmap_atomic(page); + ret = crc32c(0, addr, PAGE_SIZE); + kunmap_atomic(addr); + return ret; +} + +/* SSE-4.2 crc32c faster than crc32, but not available at early boot */ +static inline u32 pram_meta_csum(void *addr) +{ + /* skip magic and csum fields */ + return crc32(0, addr + 8, PAGE_SIZE - 8); +} + static void * __init pram_map_meta(unsigned long pfn) { + __u32 *p; + if (pfn >= max_low_pfn) return ERR_PTR(-EINVAL); - return pfn_to_kaddr(pfn); + p = pfn_to_kaddr(pfn); + if (p[0] != PRAM_MAGIC || p[1] != pram_meta_csum(p)) + return ERR_PTR(-EINVAL); + return p; } static int __init pram_reserve_page(unsigned long pfn) @@ -332,6 +369,65 @@ static struct pram_node *pram_find_node(const char *name) return NULL; } +static void pram_csum_link(struct pram_link *link) +{ + int i; + struct pram_entry *entry; + + for (i = 0; i < PRAM_LINK_ENTRIES_MAX; i++) { + entry = &link->entry[i]; + if (entry->pfn) + entry->csum = pram_data_csum(pfn_to_page(entry->pfn)); + } +} + +static void pram_csum_node(struct pram_node *node) +{ + unsigned long link_pfn; + struct pram_link *link; + + link_pfn = node->link_pfn; + while (link_pfn) { + link = pfn_to_kaddr(link_pfn); + pram_csum_link(link); + link_pfn = link->link_pfn; + cond_resched(); + } +} + +static int pram_check_link(struct pram_link *link) +{ + int i; + struct pram_entry *entry; + + for (i = 0; i < PRAM_LINK_ENTRIES_MAX; i++) { + entry = &link->entry[i]; + if (!entry->pfn) + break; + if (entry->csum != pram_data_csum(pfn_to_page(entry->pfn))) + return -EFAULT; + } + return 0; +} + +static int pram_check_node(struct pram_node *node) +{ + unsigned long link_pfn; + struct pram_link *link; + int ret = 0; + + link_pfn = node->link_pfn; + while (link_pfn) { + link = pfn_to_kaddr(link_pfn); + ret = pram_check_link(link); + if (ret) + break; + link_pfn = link->link_pfn; + cond_resched(); + } + return ret; +} + static void pram_truncate_link(struct pram_link *link) { int i; @@ -449,6 +545,7 @@ void pram_finish_save(struct pram_stream *ps) BUG_ON((node->flags & PRAM_ACCMODE_MASK) != PRAM
[PATCH RFC 04/13] mm: PRAM: implement byte stream operations
This patch adds ability to save arbitrary byte strings to PRAM using pram_write() to be restored later using pram_read(). These two operations are implemented on top of pram_save_page() and pram_load_page() respectively. --- include/linux/pram.h |4 +++ mm/pram.c| 86 -- 2 files changed, 88 insertions(+), 2 deletions(-) diff --git a/include/linux/pram.h b/include/linux/pram.h index dd17316..61c536c 100644 --- a/include/linux/pram.h +++ b/include/linux/pram.h @@ -13,6 +13,10 @@ struct pram_stream { struct pram_node *node; struct pram_link *link; /* current link */ unsigned int page_index;/* next page index in link */ + + /* byte-stream specific */ + struct page *data_page; + unsigned int data_offset; }; #define PRAM_NAME_MAX 256 /* including nul */ diff --git a/mm/pram.c b/mm/pram.c index a443eb0..f7eebe1 100644 --- a/mm/pram.c +++ b/mm/pram.c @@ -1,5 +1,6 @@ #include #include +#include #include #include #include @@ -46,6 +47,7 @@ struct pram_link { struct pram_node { __u32 flags; /* see PRAM_* flags below */ __u32 type; /* data type, see enum pram_stream_type */ + __u64 data_len; /* data size, only for byte streams */ __u64 link_pfn; /* points to the first link of the node */ __u8name[PRAM_NAME_MAX]; @@ -284,6 +286,9 @@ void pram_finish_load(struct pram_stream *ps) BUG_ON((node->flags & PRAM_ACCMODE_MASK) != PRAM_LOAD); + if (ps->data_page) + put_page(ps->data_page); + pram_truncate_node(node); pram_free_page(node); } @@ -422,10 +427,51 @@ struct page *pram_load_page(struct pram_stream *ps, int *flags) * * On success, returns the number of bytes written, which is always equal to * @count. On failure, -errno is returned. + * + * Error values: + *%ENOMEM: insufficient amount of memory available */ ssize_t pram_write(struct pram_stream *ps, const void *buf, size_t count) { - return -ENOSYS; + void *addr; + size_t copy_count, write_count = 0; + struct pram_node *node = ps->node; + + BUG_ON(node->type != PRAM_BYTE_STREAM); + BUG_ON((node->flags & PRAM_ACCMODE_MASK) != PRAM_SAVE); + + while (count > 0) { + if (!ps->data_page) { + struct page *page; + int err; + + page = pram_alloc_page((ps->gfp_mask & GFP_RECLAIM_MASK) | + __GFP_HIGHMEM | __GFP_ZERO); + if (!page) + return -ENOMEM; + err = __pram_save_page(ps, page, 0); + put_page(page); + if (err) + return err; + ps->data_page = page; + ps->data_offset = 0; + } + + copy_count = min_t(size_t, count, PAGE_SIZE - ps->data_offset); + addr = kmap_atomic(ps->data_page); + memcpy(addr + ps->data_offset, buf, copy_count); + kunmap_atomic(addr); + + buf += copy_count; + node->data_len += copy_count; + ps->data_offset += copy_count; + if (ps->data_offset >= PAGE_SIZE) + ps->data_page = NULL; + + write_count += copy_count; + count -= copy_count; + } + return write_count; } /** @@ -437,5 +483,41 @@ ssize_t pram_write(struct pram_stream *ps, const void *buf, size_t count) */ size_t pram_read(struct pram_stream *ps, void *buf, size_t count) { - return 0; + char *addr; + size_t copy_count, read_count = 0; + struct pram_node *node = ps->node; + + BUG_ON(node->type != PRAM_BYTE_STREAM); + BUG_ON((node->flags & PRAM_ACCMODE_MASK) != PRAM_LOAD); + + while (count > 0 && node->data_len > 0) { + if (!ps->data_page) { + struct page *page; + + page = __pram_load_page(ps, NULL); + if (!page) + break; + ps->data_page = page; + ps->data_offset = 0; + } + + copy_count = min_t(size_t, count, PAGE_SIZE - ps->data_offset); + if (copy_count > node->data_len) + copy_count = node->data_len; + addr = kmap_atomic(ps->data_page); + memcpy(buf, addr + ps->data_offset, copy_count); + kunmap_atomic(addr); + + buf += copy_count; + node->data_len -= copy_count; + ps->data_offset += copy_count; + if (ps->data_offset >= PAGE_SIZE || !node->data_len) { + put_page(ps->data_page); +
Re: [PATCH RFC] sched: move h_load calculation to task_h_load
On 07/15/2013 12:28 PM, Peter Zijlstra wrote: OK, fair enough. It does somewhat rely on us getting the single rq->clock update thing right, but that should be ok. Frankly, I doubt that rq->clock is the right thing to use here, because it can be updated very frequently under some conditions, so that cfs_rq->h_load can get updated several times during the same balance run. Perhaps, jiffies would suit better for that purpose. This would work as h_load update throttler similar to how it works now (I mean rq->h_load_throttle in update_h_load()). If there is something else you don't like/have some thoughts on, please share - I'll try to improve. Thank you for your feedback. But yeah, when you have stupid many cgroups we quickly need less h_load instances than there are cgroups. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] sched: move h_load calculation to task_h_load
The bad thing about update_h_load(), which computes hierarchical load factor for task groups, is that it is called for each task group in the system before every load balancer run, and since rebalance can be triggered very often, this function can eat really a lot of cpu time if there are many cpu cgroups in the system. Although the situation was improved significantly by commit a35b646 ('sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies'), the problem still can arise under some kinds of loads, e.g. when cpus are switching from idle to busy and back very frequently. For instance, when I start 1000 of processes that wake up every millisecond on my 8 cpus host, 'top' and 'perf top' show: Cpu(s): 17.8%us, 24.3%sy, 0.0%ni, 57.9%id, 0.0%wa, 0.0%hi, 0.0%si Events: 243K cycles 7.57% [kernel] [k] __schedule 7.08% [kernel] [k] timerqueue_add 6.13% libc-2.12.so [.] usleep Then if I create 1 *idle* cpu cgroups (no processes in them), cpu usage increases significantly although the 'wakers' are still executing in the root cpu cgroup: Cpu(s): 19.1%us, 48.7%sy, 0.0%ni, 31.6%id, 0.0%wa, 0.0%hi, 0.7%si Events: 230K cycles 24.56% [kernel][k] tg_load_down 5.76% [kernel][k] __schedule This happens because this particular kind of load triggers 'new idle' rebalance very frequently, which requires calling update_h_load(), which, in turn, calls tg_load_down() for every *idle* cpu cgroup even though it is absolutely useless, because idle cpu cgroups have no tasks to pull. This patch tries to improve the situation by making h_load calculation proceed only when h_load is really necessary. To achieve this, it substitutes update_h_load() with update_cfs_rq_h_load(), which computes h_load only for a given cfs_rq and all its ascendants, and makes the load balancer call this function whenever it considers if a task should be pulled, i.e. it moves h_load calculations directly to task_h_load(). For h_load of the same cfs_rq not to be updated multiple times (in case several tasks in the same cgroup are considered during the same balance run), the patch keeps the time of the last h_load update for each cfs_rq and breaks calculation when it finds h_load to be uptodate. The benefit of it is that h_load is computed only for those cfs_rq's, which really need it, in particular all idle task groups are skipped. Although this, in fact, moves h_load calculation under rq lock, it should not affect latency much, because the amount of work done under rq lock while trying to pull tasks is limited by sched_nr_migrate. After the patch applied with the setup described above (1000 wakers in the root cgroup and 1 idle cgroups), I get: Cpu(s): 16.9%us, 24.8%sy, 0.0%ni, 58.4%id, 0.0%wa, 0.0%hi, 0.0%si Events: 242K cycles 7.57% [kernel] [k] __schedule 6.70% [kernel] [k] timerqueue_add 5.93% libc-2.12.so [.] usleep Changes in v2: * use jiffies instead of rq->clock for last_h_load_update. Signed-off-by: Vladimir Davydov --- kernel/sched/fair.c | 58 +++--- kernel/sched/sched.h |7 +++--- 2 files changed, 30 insertions(+), 35 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f77f9c5..12d0137 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4171,47 +4171,48 @@ static void update_blocked_averages(int cpu) } /* - * Compute the cpu's hierarchical load factor for each task group. + * Compute the hierarchical load factor for cfs_rq and all its ascendants. * This needs to be done in a top-down fashion because the load of a child * group is a fraction of its parents load. */ -static int tg_load_down(struct task_group *tg, void *data) +static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq) { - unsigned long load; - long cpu = (long)data; - - if (!tg->parent) { - load = cpu_rq(cpu)->avg.load_avg_contrib; - } else { - load = tg->parent->cfs_rq[cpu]->h_load; - load = div64_ul(load * tg->se[cpu]->avg.load_avg_contrib, - tg->parent->cfs_rq[cpu]->runnable_load_avg + 1); - } - - tg->cfs_rq[cpu]->h_load = load; - - return 0; -} - -static void update_h_load(long cpu) -{ - struct rq *rq = cpu_rq(cpu); + struct rq *rq = rq_of(cfs_rq); + struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)]; unsigned long now = jiffies; + unsigned long load; - if (rq->h_load_throttle == now) + if (cfs_rq->last_h_load_update == now) return; - rq->h_load_throttle = now; + cfs_rq->h_load_next = NULL; + for_each_sched_entity(se) { + cfs_rq = cfs_rq_of(se); + cfs_rq->h_load_next =
Re: [PATCH RFC] pram: persistent over-kexec memory file system
On 07/27/2013 07:41 PM, Marco Stornelli wrote: Il 26/07/2013 14:29, Vladimir Davydov ha scritto: Hi, We want to propose a way to upgrade a kernel on a machine without restarting all the user-space services. This is to be done with CRIU project, but we need help from the kernel to preserve some data in memory while doing kexec. The key point of our implementation is leaving process memory in-place during reboot. This should eliminate most io operations the services would produce during initialization. To achieve this, we have implemented a pseudo file system that preserves its content during kexec. We propose saving CRIU dump files to this file system, kexec'ing and then restoring the processes in the newly booted kernel. http://pramfs.sourceforge.net/ AFAIU it's a bit different thing: PRAMFS as well as pstore, which has already been merged, requires hardware support for over-reboot persistency, so called non-volatile RAM, i.e. RAM which is not directly accessible and so is not used by the kernel. On the contrary, what we'd like to have is preserving usual RAM on kexec. It is possible, because RAM is not reset during kexec. This would allow leaving applications working set as well as filesystem caches in place, speeding the reboot process as a whole and reducing the downtime significantly. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] pram: persistent over-kexec memory file system
On 07/27/2013 09:37 PM, Marco Stornelli wrote: Il 27/07/2013 19:35, Vladimir Davydov ha scritto: On 07/27/2013 07:41 PM, Marco Stornelli wrote: Il 26/07/2013 14:29, Vladimir Davydov ha scritto: Hi, We want to propose a way to upgrade a kernel on a machine without restarting all the user-space services. This is to be done with CRIU project, but we need help from the kernel to preserve some data in memory while doing kexec. The key point of our implementation is leaving process memory in-place during reboot. This should eliminate most io operations the services would produce during initialization. To achieve this, we have implemented a pseudo file system that preserves its content during kexec. We propose saving CRIU dump files to this file system, kexec'ing and then restoring the processes in the newly booted kernel. http://pramfs.sourceforge.net/ AFAIU it's a bit different thing: PRAMFS as well as pstore, which has already been merged, requires hardware support for over-reboot persistency, so called non-volatile RAM, i.e. RAM which is not directly accessible and so is not used by the kernel. On the contrary, what we'd like to have is preserving usual RAM on kexec. It is possible, because RAM is not reset during kexec. This would allow leaving applications working set as well as filesystem caches in place, speeding the reboot process as a whole and reducing the downtime significantly. Thanks. Actually not. You can use normal system RAM reserved at boot with mem parameter without any kernel change. Until an hard reset happens, that area will be "persistent". Thank you, we'll look at PRAMFS closer, but right now, after trying it I have a couple of concerns I'd appreciate if you could clarify: 1) As you advised, I tried to reserve a range of memory (passing memmap=4G$4G at boot) and mounted PRAMFS using the following options: # mount -t pramfs -o physaddr=0x1,init=4G,bs=4096 none /mnt/pramfs And it turned out that PRAMFS is very slow as compared to ramfs: # dd if=/dev/zero of=/mnt/pramfs if=/dev/zero of=/mnt/pramfs/dummy bs=4096 count=$[100*1024] 102400+0 records in 102400+0 records out 419430400 bytes (419 MB) copied, 9.23498 s, 45.4 MB/s # dd if=/dev/zero of=/mnt/pramfs if=/dev/zero of=/mnt/pramfs/dummy bs=4096 count=$[100*1024] conv=notrunc 102400+0 records in 102400+0 records out 419430400 bytes (419 MB) copied, 3.04692 s, 138 MB/s We need it to be as fast as usual RAM, because otherwise the benefit of it over hdd disappears. So before diving into the code, I'd like to ask you if it's intrinsic to PRAMFS, or can it be fixed? Or, perhaps, I used wrong mount/boot/config options (btw, I enabled only CONFIG_PRAMFS)? 2) To enable saving application dump files in memory using PRAMFS, one should reserve half of RAM for it. That's too expensive. While with ramfs, once SPLICE_F_MOVE flag is implemented, one could move anonymous memory pages to ramfs page cache and after kexec move it back so that almost no extra memory space costs would be required. Of course, SPLICE_F_MOVE is to be yet implemented, but with PRAMFS significant memory costs are inevitable... or am I wrong? Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] pram: persistent over-kexec memory file system
On 07/28/2013 03:02 PM, Marco Stornelli wrote: Il 28/07/2013 12:05, Vladimir Davydov ha scritto: On 07/27/2013 09:37 PM, Marco Stornelli wrote: Il 27/07/2013 19:35, Vladimir Davydov ha scritto: On 07/27/2013 07:41 PM, Marco Stornelli wrote: Il 26/07/2013 14:29, Vladimir Davydov ha scritto: Hi, We want to propose a way to upgrade a kernel on a machine without restarting all the user-space services. This is to be done with CRIU project, but we need help from the kernel to preserve some data in memory while doing kexec. The key point of our implementation is leaving process memory in-place during reboot. This should eliminate most io operations the services would produce during initialization. To achieve this, we have implemented a pseudo file system that preserves its content during kexec. We propose saving CRIU dump files to this file system, kexec'ing and then restoring the processes in the newly booted kernel. http://pramfs.sourceforge.net/ AFAIU it's a bit different thing: PRAMFS as well as pstore, which has already been merged, requires hardware support for over-reboot persistency, so called non-volatile RAM, i.e. RAM which is not directly accessible and so is not used by the kernel. On the contrary, what we'd like to have is preserving usual RAM on kexec. It is possible, because RAM is not reset during kexec. This would allow leaving applications working set as well as filesystem caches in place, speeding the reboot process as a whole and reducing the downtime significantly. Thanks. Actually not. You can use normal system RAM reserved at boot with mem parameter without any kernel change. Until an hard reset happens, that area will be "persistent". Thank you, we'll look at PRAMFS closer, but right now, after trying it I have a couple of concerns I'd appreciate if you could clarify: 1) As you advised, I tried to reserve a range of memory (passing memmap=4G$4G at boot) and mounted PRAMFS using the following options: # mount -t pramfs -o physaddr=0x1,init=4G,bs=4096 none /mnt/pramfs And it turned out that PRAMFS is very slow as compared to ramfs: # dd if=/dev/zero of=/mnt/pramfs if=/dev/zero of=/mnt/pramfs/dummy bs=4096 count=$[100*1024] 102400+0 records in 102400+0 records out 419430400 bytes (419 MB) copied, 9.23498 s, 45.4 MB/s # dd if=/dev/zero of=/mnt/pramfs if=/dev/zero of=/mnt/pramfs/dummy bs=4096 count=$[100*1024] conv=notrunc 102400+0 records in 102400+0 records out 419430400 bytes (419 MB) copied, 3.04692 s, 138 MB/s We need it to be as fast as usual RAM, because otherwise the benefit of it over hdd disappears. So before diving into the code, I'd like to ask you if it's intrinsic to PRAMFS, or can it be fixed? Or, perhaps, I used wrong mount/boot/config options (btw, I enabled only CONFIG_PRAMFS)? In x86 you should have the write protection enabled. Turn it off or mount it with noprotect option. I tried. This helps, but the write rate is still too low: with write protect: # mount -t pramfs -o physaddr=0x1,init=4G,bs=4096 none /mnt/pramfs # dd if=/dev/zero of=/mnt/pramfs if=/dev/zero of=/mnt/pramfs/dummy bs=4096 count=$[100*1024] 102400+0 records in 102400+0 records out 419430400 bytes (419 MB) copied, 17.6007 s, 23.8 MB/s # dd if=/dev/zero of=/mnt/pramfs if=/dev/zero of=/mnt/pramfs/dummy bs=4096 count=$[100*1024] conv=notrunc 102400+0 records in 102400+0 records out 419430400 bytes (419 MB) copied, 4.32923 s, 96.9 MB/s w/o write protect: # mount -t pramfs -o physaddr=0x1,init=4G,bs=4096,noprotect none /mnt/pramfs # dd if=/dev/zero of=/mnt/pramfs if=/dev/zero of=/mnt/pramfs/dummy bs=4096 count=$[100*1024] 102400+0 records in 102400+0 records out 419430400 bytes (419 MB) copied, 9.07748 s, 46.2 MB/s # dd if=/dev/zero of=/mnt/pramfs if=/dev/zero of=/mnt/pramfs/dummy bs=4096 count=$[100*1024] conv=notrunc 102400+0 records in 102400+0 records out 419430400 bytes (419 MB) copied, 3.04596 s, 138 MB/s Also tried turning off CONFIG_PRAMFS_WRITE_PROTECT, the result is the same: the rate does not exceed 150 MB/s, which is too slow comparing to ramfs: # mount -t ramfs none /mnt/ramfs # dd if=/dev/zero of=/mnt/pramfs if=/dev/zero of=/mnt/ramfs/dummy bs=4096 count=$[100*1024] 102400+0 records in 102400+0 records out 419430400 bytes (419 MB) copied, 0.200809 s, 2.1 GB/s 2) To enable saving application dump files in memory using PRAMFS, one should reserve half of RAM for it. That's too expensive. While with ramfs, once SPLICE_F_MOVE flag is implemented, one could move anonymous memory pages to ramfs page cache and after kexec move it back so that almost no extra memory space costs would be required. Of course, SPLICE_F_MOVE is to be yet implemented, but with PRAMFS significant memory costs are inevitable... or am I wrong? Thanks. From this point of view you are right. Pramfs (or other solution like that) are out of page cache, so you can't do any memory transfer. It
[PATCH RFC] pram: persistent over-kexec memory file system
Hi, We want to propose a way to upgrade a kernel on a machine without restarting all the user-space services. This is to be done with CRIU project, but we need help from the kernel to preserve some data in memory while doing kexec. The key point of our implementation is leaving process memory in-place during reboot. This should eliminate most io operations the services would produce during initialization. To achieve this, we have implemented a pseudo file system that preserves its content during kexec. We propose saving CRIU dump files to this file system, kexec'ing and then restoring the processes in the newly booted kernel. A typical usage scenario would look like this: 1) Boot kernel with 'pram_banned=MEMRANGE' boot option (MEMRANGE=MEMMIN-MEMMAX). This is to prevent kexec from overwriting persistent data while loading the new kernel. Later on kexec will be forced to load kernel to the range specified. MEMRANGE=0-128M should be enough. 2) Mount pram file system and save dump files there: # mount -t pram none /mnt # criu dump -D /mnt -t $PID 3) Run kexec passing pram location to the new kernel and forcing it to load the kernel image to MEMRAGE: # kexec --load /vmlinuz --initrd=initrd.img \ --append="$(cat /proc/cmdline | sed -e 's/pram=[^ ]*//g') pram=$(cat /sys/kernel/pram)" \ --mem-min=$MEMMIN --mem-max=$MEMMAX # reboot 4) After reboot mount pram, restore processes, and cleanup: # mount -t pram none /mnt # criu restore -d -D /mnt # rm -f /mnt/* # umount /mnt In this patch I introduce the pram pseudo file system that keeps its memory in place during kexec. pram is based on ramfs, but it serializes and leaves in memory its content on unmount, and restores it on the next mount. To survive over kexec, pram finds the serialized content, whose location should be specified by 'pram' boot param (exported via /sys/kernel/pram), and reserves it at early boot. To avoid conflicts with other parts of the kernel that make early reservation too, pram tracks all memory regions that have ever been reserved and avoids using them for storing its data. Plus, it adds 'pram_banned' boot param, which can be used to explicitly disallow pram to use specified memory regions. This may be useful for avoiding conflicts with kexec loading the new kernel image (as it is done in the usage scenario). This implementation serves as a proof of concept and so has a number of limitations: * pram only supports regular files; directories, symlinks, etc are ignored * pram is implemented only for x86 * pram does not support swapping out * pram checksums serialized content and drops it in case it is corrupted with no possibility of restore What do you think about it, does it make sense to go on with this approach or should we reconsider it as a whole? Thanks. --- arch/x86/kernel/setup.c |2 + fs/Kconfig | 13 + fs/ramfs/Makefile |1 + fs/ramfs/inode.c|2 +- fs/ramfs/persistent.c | 699 +++ include/linux/pram.h| 18 ++ include/linux/ramfs.h |1 + kernel/ksysfs.c | 13 + mm/bootmem.c|5 + mm/memblock.c |7 +- mm/nobootmem.c |2 + mm/page_alloc.c |3 + 12 files changed, 764 insertions(+), 2 deletions(-) create mode 100644 fs/ramfs/persistent.c create mode 100644 include/linux/pram.h diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index f8ec578..7d22ad0 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -50,6 +50,7 @@ #include #include #include +#include #include #include @@ -1137,6 +1138,7 @@ void __init setup_arch(char **cmdline_p) early_acpi_boot_init(); initmem_init(); + pram_reserve(); memblock_find_dma_reserve(); #ifdef CONFIG_KVM_GUEST diff --git a/fs/Kconfig b/fs/Kconfig index c229f82..8d6943f 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -168,6 +168,19 @@ config HUGETLB_PAGE source "fs/configfs/Kconfig" +config PRAM + bool "Persistent over-kexec memory storage" + depends on X86 + select CRC32 + select LIBCRC32C + select CRYPTO_CRC32C + select CRYPTO_CRC32C_INTEL + default n + help + pram is a filesystem that saves its content on unmount to be restored + on the next mount after kexec. It can be used for speeding up system + reboot by saving application memory images there. + endmenu menuconfig MISC_FILESYSTEMS diff --git a/fs/ramfs/Makefile b/fs/ramfs/Makefile index c71e65d..e6953d4 100644 --- a/fs/ramfs/Makefile +++ b/fs/ramfs/Makefile @@ -7,3 +7,4 @@ obj-y += ramfs.o file-mmu-y := file-nommu.o file-mmu-$(CONFIG_MMU) := file-mmu.o ramfs-objs += inode.o $(file-mmu-y) +ramfs-$(CONFIG_PRAM) += persistent.o diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c index c24f1e1..86f9f9a 100644 --- a/fs/ramfs/inode.c +
[PATCH 1/2] cpu: common: make clearcpuid option take bits list
It is more convenient to write 'clearcpuid=147,148,...' than 'clearcpuid=147 clearcpuid=148 ...' --- arch/x86/kernel/cpu/common.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 6b9333b..8ffe1b9 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1012,10 +1012,10 @@ static __init int setup_disablecpuid(char *arg) { int bit; - if (get_option(&arg, &bit) && bit < NCAPINTS*32) - setup_clear_cpu_cap(bit); - else - return 0; + while (get_option(&arg, &bit)) { + if (bit >= 0 && bit < NCAPINTS*32) + setup_clear_cpu_cap(bit); + } return 1; } -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] cpu: intel, amd: mask cleared cpuid features
If 'clearcpuid=N' is specified in boot options, CPU feature #N won't be reported in /proc/cpuinfo and used by the kernel. However, if a userpsace process checks CPU features directly using the cpuid instruction, it will be reported about all features supported by the CPU irrespective of what features are cleared. The patch makes the clearcpuid boot option not only clear CPU features in kernel but also mask them in hardware for Intel and AMD CPUs that support it so that the features cleared won't be reported even by the cpuid instruction. This can be useful for migration of virtual machines managed by hypervisors that do not support/use Intel VT/AMD-V hardware-assisted virtualization technology. If CPUID masking is supported, this will be reported in /proc/cpuinfo:flags as 'cpuidmask'. --- arch/x86/include/asm/cpufeature.h |1 + arch/x86/kernel/cpu/amd.c | 22 ++ arch/x86/kernel/cpu/intel.c | 59 + 3 files changed, 82 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h index f91e80f..6f061ea 100644 --- a/arch/x86/include/asm/cpufeature.h +++ b/arch/x86/include/asm/cpufeature.h @@ -97,6 +97,7 @@ #define X86_FEATURE_EXTD_APICID(3*32+26) /* has extended APICID (8 bits) */ #define X86_FEATURE_AMD_DCM (3*32+27) /* multi-node processor */ #define X86_FEATURE_APERFMPERF (3*32+28) /* APERFMPERF */ +#define X86_FEATURE_CPUIDMASK (3*32+29) /* CPUID feature masking */ /* Intel-defined CPU features, CPUID level 0x0001 (ecx), word 4 */ #define X86_FEATURE_XMM3 (4*32+ 0) /* "pni" SSE-3 */ diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index 146bb62..ddc8ea2 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -482,6 +482,26 @@ static void __cpuinit early_init_amd(struct cpuinfo_x86 *c) #endif } +#define MSR_AMD_CPUID_FEATURES_OVERRIDE0xc0011004 +#define MSR_AMD_EXT_CPUID_FEATURES_OVERRIDE0xc0011005 + +static void __cpuinit amd_mask_cpu_caps(struct cpuinfo_x86 *c) +{ + u64 val; + + if (c->x86 >= 0x0f) { + set_cpu_cap(c, X86_FEATURE_CPUIDMASK); + + rdmsrl_amd_safe(MSR_AMD_CPUID_FEATURES_OVERRIDE, &val); + val &= ~((u64)cpu_caps_cleared[4] << 32 | cpu_caps_cleared[0]); + wrmsrl_amd_safe(MSR_AMD_CPUID_FEATURES_OVERRIDE, val); + + rdmsrl_amd_safe(MSR_AMD_EXT_CPUID_FEATURES_OVERRIDE, &val); + val &= ~((u64)cpu_caps_cleared[6] << 32 | cpu_caps_cleared[1]); + wrmsrl_amd_safe(MSR_AMD_EXT_CPUID_FEATURES_OVERRIDE, val); + } +} + static void __cpuinit init_amd(struct cpuinfo_x86 *c) { u32 dummy; @@ -684,6 +704,8 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c) } rdmsr_safe(MSR_AMD64_PATCH_LEVEL, &c->microcode, &dummy); + + amd_mask_cpu_caps(c); } #ifdef CONFIG_X86_32 diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 3e6ff6c..1df1da5 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -358,6 +358,63 @@ static void __cpuinit detect_vmx_virtcap(struct cpuinfo_x86 *c) } } +static unsigned int __cpuinit intel_cpuid_mask_msr(struct cpuinfo_x86 *c, + unsigned int *msr_ext_cpuid_mask) +{ + unsigned int msr, msr_ext; + + msr = msr_ext = 0; + + switch (c->x86_model) { + case 0x17: + case 0x1D: + msr = 0x478; + break; + case 0x1A: + case 0x1E: + case 0x1F: + case 0x25: + case 0x2C: + case 0x2E: + case 0x2F: + msr = 0x130; + msr_ext = 0x131; + break; + case 0x2A: + msr = 0x132; + msr_ext = 0x133; + break; + } + + if (msr_ext_cpuid_mask) + *msr_ext_cpuid_mask = msr_ext; + + return msr; +} + +static void __cpuinit intel_mask_cpu_caps(struct cpuinfo_x86 *c) +{ + u32 low, high; + unsigned int msr_cpuid_mask, msr_ext_cpuid_mask; + + msr_cpuid_mask = intel_cpuid_mask_msr(c, &msr_ext_cpuid_mask); + if (msr_cpuid_mask) { + set_cpu_cap(c, X86_FEATURE_CPUIDMASK); + + rdmsr(msr_cpuid_mask, low, high); + low &= ~cpu_caps_cleared[4]; + high &= ~cpu_caps_cleared[0]; + wrmsr(msr_cpuid_mask, low, high); + + if (msr_ext_cpuid_mask != 0) { + rdmsr(msr_ext_cpuid_mask, low, high); + low &= ~cpu_caps_cleared[6]; + high &= ~cpu_caps_cleared[1]; + wrmsr(msr_ext_cpuid_mask, low, high); + } + } +} + static void __cpuinit init_intel(struct cpuinfo_x86 *c) { unsigned int l2 = 0; @@ -474,6 +531,8 @@ static void __cpuinit init_intel(struct cpuin
Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features
On Jul 20, 2012, at 9:20 PM, H. Peter Anvin wrote: > On 07/20/2012 09:37 AM, Vladimir Davydov wrote: >> If 'clearcpuid=N' is specified in boot options, CPU feature #N won't be >> reported in /proc/cpuinfo and used by the kernel. However, if a >> userpsace process checks CPU features directly using the cpuid >> instruction, it will be reported about all features supported by the CPU >> irrespective of what features are cleared. >> >> The patch makes the clearcpuid boot option not only clear CPU features >> in kernel but also mask them in hardware for Intel and AMD CPUs that >> support it so that the features cleared won't be reported even by the >> cpuid instruction. >> >> This can be useful for migration of virtual machines managed by >> hypervisors that do not support/use Intel VT/AMD-V hardware-assisted >> virtualization technology. >> >> If CPUID masking is supported, this will be reported in >> /proc/cpuinfo:flags as 'cpuidmask'. > > I am a bit concerned about this patch: > > 1. it silently changes existing behavior. Yes, but who needs the current implementation of 'clearcpuid' which, in fact, just hides flags in /proc/cpuinfo while userspace apps will see and consequently use all CPU features? So, I think it logically extends the existing behavior. > 2. even on enabled hardware, only some of the bits are maskable. The patch makes only words 0, 1, 4, 6 maskable, but words 3, 7, 8 are Linux-defined, words 2 and 5 are Transmeta-, Centaur-, etc- defined, and word 9 contains some bizarre Intel CPU features. Thus, it is words 0, 1, 4, 6 that contain useful information for most hardware models. If you ask about some Intel CPUs that can't mask CPUID function 0x8001, this function describes AMD-specific features, and I bet those Intel CPUs just don't have them at all and thus have nothing to mask. > > -hpa > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features
On Jul 21, 2012, at 12:19 AM, H. Peter Anvin wrote: > On 07/20/2012 11:21 AM, Vladimir Davydov wrote: >>> >>> I am a bit concerned about this patch: >>> >>> 1. it silently changes existing behavior. >> >> Yes, but who needs the current implementation of 'clearcpuid' which, >> in fact, just hides flags in /proc/cpuinfo while userspace apps will >> see and consequently use all CPU features? > > Anyone who wants to disable a feature from the kernel, specifically. > Another option then? >> So, I think it logically extends the existing behavior. >> >>> 2. even on enabled hardware, only some of the bits are maskable. >> >> The patch makes only words 0, 1, 4, 6 maskable, but words 3, 7, 8 are >> Linux-defined, words 2 and 5 are Transmeta-, Centaur-, etc- defined, and >> word 9 contains some bizarre Intel CPU features. Thus, it is words 0, 1, 4, >> 6 that contain useful information for most hardware models. > > "Bizarre"? New features, perhaps. All right, new :-) If the features are widely used, they'll provide a way of masking them either I guess. AFAIK, Intel added CPUID faulting for their newest 'Ivy Bridge' models, which allows masking all CPUID functions. Unfortunately, I don't have such a CPU. But later this feature can be utilized and wired into the code. Anyway, masking of at least some of the features would be better than lacking of the ability at all, wouldn't it? Another question whether the kernel should report errors/warnings if a particular feature can't be masked. > >> If you ask about some Intel CPUs that can't mask CPUID function 0x8001, >> this function describes AMD-specific features, and I bet those Intel CPUs >> just don't have them at all and thus have nothing to mask. > > Not quite. > > -hpa > > > > -- > H. Peter Anvin, Intel Open Source Technology Center > I work for Intel. I don't speak on their behalf. > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features
On 07/21/2012 02:37 PM, Borislav Petkov wrote: (+ Andre who's been doing some cross vendor stuff) On Fri, Jul 20, 2012 at 08:37:33PM +0400, Vladimir Davydov wrote: If 'clearcpuid=N' is specified in boot options, CPU feature #N won't be reported in /proc/cpuinfo and used by the kernel. However, if a userpsace process checks CPU features directly using the cpuid instruction, it will be reported about all features supported by the CPU irrespective of what features are cleared. The patch makes the clearcpuid boot option not only clear CPU features in kernel but also mask them in hardware for Intel and AMD CPUs that support it so that the features cleared won't be reported even by the cpuid instruction. This can be useful for migration of virtual machines managed by hypervisors that do not support/use Intel VT/AMD-V hardware-assisted virtualization technology. As they say in Star Wars: "I have a bad feeling about this." So opening the floodgates to people fiddling with this (not only migrators) makes me feel pretty uneasy. And I won't wonder if all of a sudden strange failures start to appear because code is querying cpuid features but some funny distro has disabled it in its kernel boot options. Or some other obscure case where the culprit is hidden in kernel command line options. If it were only needed for migration, then I'd say you guys can use msr-tools and run a script as root on the target machine to which you want to migrate to and toggle the feature bits you want. If msr-tools are used for cpuid masking, we will either get inconsistency between /proc/cpuinfo:flags and the output of the cpuid instruction or have to "synchronize" the clearcpuid boot option and the userspace app using msr-tools, which seems to be inconvenient. So, IMO, it would be better to have such functionality implemented in the kernel. I don't think cross vendor migration alone justifies having a generic kernel feature like that. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features
On 07/24/2012 12:14 PM, Andre Przywara wrote: On 07/24/2012 09:06 AM, Vladimir Davydov wrote: On 07/21/2012 02:37 PM, Borislav Petkov wrote: (+ Andre who's been doing some cross vendor stuff) On Fri, Jul 20, 2012 at 08:37:33PM +0400, Vladimir Davydov wrote: If 'clearcpuid=N' is specified in boot options, CPU feature #N won't be reported in /proc/cpuinfo and used by the kernel. However, if a userpsace process checks CPU features directly using the cpuid instruction, it will be reported about all features supported by the CPU irrespective of what features are cleared. The patch makes the clearcpuid boot option not only clear CPU features in kernel but also mask them in hardware for Intel and AMD CPUs that support it so that the features cleared won't be reported even by the cpuid instruction. This can be useful for migration of virtual machines managed by hypervisors that do not support/use Intel VT/AMD-V hardware-assisted virtualization technology. But for this case you want it more fine-grained, say on a pre-process or per-container level, right? For hardware-assisted virtualization you simply don't need it, and for Xen PV guests for instance this can be more safely done by the hypervisor. I assume Parallels is similar in this respect, so you may want to switch the MSRs on the guest's entry and exit by the VMM. Also if you want to restrict a guest's CPUID features, you don't want to do this at the guest's discretion, but better one level below where the guest cannot revert this, right? Actually I meant OS-level virtualization (no hypervisors) based on the linux cgroup subsystem and namespaces like OpenVZ or LXC . Although the latter does not have the container migration ability at present, there is a project that will hopefully allow this soon (criu.org). For such virtualization systems, per-kernel option would be enough because all guests share the same kernel. In general I am not reluctant to have this feature with a sane interface, but I simply don't see the usefulness of having it per kernel. Also note that AFAIK this masking only helps with the basic CPUID features, namely leaf 1 and 0x8001 for ECX and EDX. This does not cover the more advanced features and not the new ones at leaf 7. I guess that when the more advanced features become widely-used, vendors will offer new MSRs and/or CPUID faulting. So opening the floodgates to people fiddling with this (not only migrators) makes me feel pretty uneasy. And I won't wonder if all of a sudden strange failures start to appear because code is querying cpuid features but some funny distro has disabled it in its kernel boot options. Actually these "strange failures" would be a bug then. If CPUID is not there, the feature is not there. Full stop. In the past we had had already some trouble with people ignoring CPUID and stating some funny things like: "Every XYZ processor has this feature." If someone disables MCE, then on purpose. Let the code cope with it. And Boris: I don't like this "majority of users" argument. If there is some sense in this feature, why not have it (unless it significantly hurts the code base)? Remember, this is Linux: If you want to shoot yourself in the foot, we will not prevent you. Regards, Andre. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features
On 07/24/2012 02:10 PM, Borislav Petkov wrote: On Tue, Jul 24, 2012 at 12:29:19PM +0400, Vladimir Davydov wrote: I guess that when the more advanced features become widely-used, vendors will offer new MSRs and/or CPUID faulting. And this right there is the dealbreaker: So what are you doing for cpus which have the advanced CPUID features leafs but there are no MSRs to turn those bits off? We have not encountered this situation in our environments and I hope we won't :-) But look, these CPUID functions cover majority of CPU features, don't they? So, most of "normal" apps inside VM will survive migration. Perhaps, some low-level utils won't. I guess that's why there are no MSRs for other levels provided by vendors. You surely need some software-only solution for the migration to work, no? Yes. If so, why not apply that solution to your hypervisor without touching the kernel at all? In most hypervisor-based virtualization products, this is already implemented using VMM-exits, so that each VM can have arbitrary CPUID mask set by the admin. The problem is that we have no hypervisor. "Virtualization" we want this feature for is based on cgroups and namespaces (examples are OpenVZ and mainstream LXC). Tasks are just grouped into virtual environments and share the same kernel, which is proved to be more memory usage efficient than traditional hypervisor-based approaches. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features
On 07/25/2012 04:57 AM, H. Peter Anvin wrote: On 07/24/2012 04:09 AM, Vladimir Davydov wrote: We have not encountered this situation in our environments and I hope we won't :-) But look, these CPUID functions cover majority of CPU features, don't they? So, most of "normal" apps inside VM will survive migration. Perhaps, some low-level utils won't. I guess that's why there are no MSRs for other levels provided by vendors. You will once Ivy Bridge becomes common. Ivy Bridge has CPUID faulting, which allows masking all CPUID levels/functions. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features
On 07/24/2012 04:34 PM, Andre Przywara wrote: On 07/24/2012 01:09 PM, Vladimir Davydov wrote: On 07/24/2012 02:10 PM, Borislav Petkov wrote: On Tue, Jul 24, 2012 at 12:29:19PM +0400, Vladimir Davydov wrote: I guess that when the more advanced features become widely-used, vendors will offer new MSRs and/or CPUID faulting. And this right there is the dealbreaker: So what are you doing for cpus which have the advanced CPUID features leafs but there are no MSRs to turn those bits off? We have not encountered this situation in our environments and I hope we won't :-) But look, these CPUID functions cover majority of CPU features, don't they? So, most of "normal" apps inside VM will survive migration. Perhaps, some low-level utils won't. I guess that's why there are no MSRs for other levels provided by vendors. You have the new feature leaf at EAX=7. This contains things like BMI and AVX2 and probably more upcoming features. So you may be safe for a while, but you need a solution in the long run. You surely need some software-only solution for the migration to work, no? Yes. If so, why not apply that solution to your hypervisor without touching the kernel at all? In most hypervisor-based virtualization products, this is already implemented using VMM-exits, so that each VM can have arbitrary CPUID mask set by the admin. The problem is that we have no hypervisor. "Virtualization" we want this feature for is based on cgroups and namespaces (examples are OpenVZ and mainstream LXC). Tasks are just grouped into virtual environments and share the same kernel, which is proved to be more memory usage efficient than traditional hypervisor-based approaches. So for this single kernel approach I'd understand it that way: 1. You boot up the kernel on the host, it should detect and enable all the features, say MCA. 2. After boot, you use /src/msr-tools/wrmsr to mask CPUID bits, again MCA for instance or AVX/AES or the like. Since the (host side of the) kernel already detected it, this does not hurt the kernel features like MCA. But AVX will not be available to applications running in the "host container", which is probably OK since these are mostly management applications, right? 3. Then you start guests. The guest's libc will not detect the features because of the MSR masking. All you need now is /proc/cpuinfo filtering to make this bullet-proof, preferably through the container functionality. I see that you do already massive sysfs filtering and also /proc/ filtering, so this maybe an option? Yes, we do filter /proc/cpuinfo output in our product (OpenVZ), but there is also the LXC project that is based completely on the mainstream kernel. LXC developers have not faced the cross-vendor migration problem yet because currently they don't have migration at all, but they will surely face it in future since the work on migration is in progress now (CRIU). So, you prefer adding some filtering of /proc/cpuinfo into the mainstream kernel (not now, later, for LXC to work) instead of enabling clearcpuid boot option to mask CPUID features? IMO, the latter would look clearer. This approach does not need any kernel support (except for the /proc/cpuinfo filtering). Does this address the issues you have? Regards, Andre. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features
On 07/24/2012 04:44 PM, Alan Cox wrote: This approach does not need any kernel support (except for the /proc/cpuinfo filtering). Does this address the issues you have? You can do the /proc/cpuinfo filtering in user space too How? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features
On 07/25/2012 02:58 PM, Andre Przywara wrote: On 07/25/2012 12:31 PM, Vladimir Davydov wrote: On 07/24/2012 04:44 PM, Alan Cox wrote: This approach does not need any kernel support (except for the /proc/cpuinfo filtering). Does this address the issues you have? You can do the /proc/cpuinfo filtering in user space too How? I wanted to write the same reply yesterday, but followed the hint in Alan's previous mail: # mount --bind /dev/shm/faked_cpuinfo /somepath/proc/cpuinfo I checked it, it works even with chroots and is not visible from within. If CPUs go online/offline? Regards, Andre. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features
On 07/25/2012 02:43 PM, Borislav Petkov wrote: On Wed, Jul 25, 2012 at 02:31:23PM +0400, Vladimir Davydov wrote: So, you prefer adding some filtering of /proc/cpuinfo into the mainstream kernel That's already there right? And your 1/2 patch was making toggling those bits easier. (not now, later, for LXC to work) instead of enabling clearcpuid boot option to mask CPUID features? IMO, the latter would look clearer. Yes, but for reasons noted earlier, you cannot tweak all hardware CPUID features as you want them. So, having a software-only layer of tweaking /proc/cpuinfo or something different you can come up with is the only option you have. And even in that case, applications running in the container which execute CPUID might fail in a strange manner when the corresponding /proc/cpuinfo flag was cleared by you intentionally but the hardware CPUID flag is there. In such situations, issues like that should probably be sorted on a case-by-case basis I guess. Thanks. We've agreed that tweaking CPUID bits in kernel is not a good idea and it is better to think about virtualization of /proc/cpuinfo and using msr-tools. Thank you for your time and feedback. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features
On 07/25/2012 03:17 PM, Andre Przywara wrote: On 07/25/2012 01:02 PM, Vladimir Davydov wrote: On 07/25/2012 02:58 PM, Andre Przywara wrote: On 07/25/2012 12:31 PM, Vladimir Davydov wrote: On 07/24/2012 04:44 PM, Alan Cox wrote: This approach does not need any kernel support (except for the /proc/cpuinfo filtering). Does this address the issues you have? You can do the /proc/cpuinfo filtering in user space too How? I wanted to write the same reply yesterday, but followed the hint in Alan's previous mail: # mount --bind /dev/shm/faked_cpuinfo /somepath/proc/cpuinfo I checked it, it works even with chroots and is not visible from within. If CPUs go online/offline? Do you support CPU offlining from within the guest? My OpenVZ guest only has /sys/class and nothing else, so I cannot offline any CPU. So you setup a "hand-crafted" cpuinfo for the guest and this should stay valid for the whole guest's runtime, right? And since it is a dumped file, "host" CPU off/onlining does not affect it. Or do you want to propagate this to the guests? A guest cannot have more CPUs than the host in container virtualization (at least in OpenVZ). So yes, we will have to propagate. Anyway, we've agreed that you were right and are going to think about /proc/cpuinfo virtualization. Thank you for your time and feedback. (Sorry, but my thinking is more Xen/KVM oriented, where guests can only do most things if they are explicitly allowed to do it and separation between guests and host is much stricter). Regards, Andre. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features
On 07/25/2012 03:31 PM, Alan Cox wrote: On Wed, 25 Jul 2012 14:31:30 +0400 Vladimir Davydov wrote: On 07/24/2012 04:44 PM, Alan Cox wrote: This approach does not need any kernel support (except for the /proc/cpuinfo filtering). Does this address the issues you have? You can do the /proc/cpuinfo filtering in user space too How? bind mount your own normal file over the top or a FUSE file In general, we've agreed that wiring CPUID masking into the kernel was not a good idea. We will try to virtualize /proc/cpuinfo as well as other /proc stuff in mainstream. Thank you for your time and feedback. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features
On 07/25/2012 04:57 AM, H. Peter Anvin wrote: On 07/24/2012 04:09 AM, Vladimir Davydov wrote: We have not encountered this situation in our environments and I hope we won't :-) But look, these CPUID functions cover majority of CPU features, don't they? So, most of "normal" apps inside VM will survive migration. Perhaps, some low-level utils won't. I guess that's why there are no MSRs for other levels provided by vendors. You will once Ivy Bridge becomes common. -hpa You're right here. We will try to find other ways for sorting the migration problem out. Thank you for your time and feedback. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] cpu: intel, amd: mask cleared cpuid features
On 07/20/2012 09:10 PM, Andi Kleen wrote: + unsigned int *msr_ext_cpuid_mask) +{ + unsigned int msr, msr_ext; + + msr = msr_ext = 0; + + switch (c->x86_model) { You have to check the family too. + + return msr; +} + +static void __cpuinit intel_mask_cpu_caps(struct cpuinfo_x86 *c) +{ + u32 low, high; + unsigned int msr_cpuid_mask, msr_ext_cpuid_mask; + + msr_cpuid_mask = intel_cpuid_mask_msr(c,&msr_ext_cpuid_mask); + if (msr_cpuid_mask) { + set_cpu_cap(c, X86_FEATURE_CPUIDMASK); + + rdmsr(msr_cpuid_mask, low, high); And please use rdmsrl(), rdmsr() is so 90ies Other than that patch looks good. I presume it will be useful for other things than just migration too. -Andi Thank you for review, but we've decided to drop this since the kernel does not seem to be a suitable place for such "tweaking". -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] block: account iowait time when waiting for completion of IO request
Using wait_for_completion() for waiting for a IO request to be executed results in wrong iowait time accounting. For example, a system having the only task doing write() and fdatasync() on a block device can be reported being idle instead of iowaiting as it should because blkdev_issue_flush() calls wait_for_completion() which in turn calls schedule() that does not increment the iowait proc counter and thus does not turn on iowait time accounting. The patch makes block layer use wait_for_completion_io() instead of wait_for_completion() where appropriate to account iowait time correctly. Signed-off-by: Vladimir Davydov --- block/blk-exec.c |4 ++-- block/blk-flush.c |2 +- block/blk-lib.c |6 +++--- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/block/blk-exec.c b/block/blk-exec.c index 74638ec..f634de7 100644 --- a/block/blk-exec.c +++ b/block/blk-exec.c @@ -120,9 +120,9 @@ int blk_execute_rq(struct request_queue *q, struct gendisk *bd_disk, /* Prevent hang_check timer from firing at us during very long I/O */ hang_check = sysctl_hung_task_timeout_secs; if (hang_check) - while (!wait_for_completion_timeout(&wait, hang_check * (HZ/2))); + while (!wait_for_completion_io_timeout(&wait, hang_check * (HZ/2))); else - wait_for_completion(&wait); + wait_for_completion_io(&wait); if (rq->errors) err = -EIO; diff --git a/block/blk-flush.c b/block/blk-flush.c index 720ad60..db8f1b5 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -436,7 +436,7 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask, bio_get(bio); submit_bio(WRITE_FLUSH, bio); - wait_for_completion(&wait); + wait_for_completion_io(&wait); /* * The driver must store the error location in ->bi_sector, if diff --git a/block/blk-lib.c b/block/blk-lib.c index b3a1f2b..d6f50d5 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -126,7 +126,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector, /* Wait for bios in-flight */ if (!atomic_dec_and_test(&bb.done)) - wait_for_completion(&wait); + wait_for_completion_io(&wait); if (!test_bit(BIO_UPTODATE, &bb.flags)) ret = -EIO; @@ -200,7 +200,7 @@ int blkdev_issue_write_same(struct block_device *bdev, sector_t sector, /* Wait for bios in-flight */ if (!atomic_dec_and_test(&bb.done)) - wait_for_completion(&wait); + wait_for_completion_io(&wait); if (!test_bit(BIO_UPTODATE, &bb.flags)) ret = -ENOTSUPP; @@ -262,7 +262,7 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, /* Wait for bios in-flight */ if (!atomic_dec_and_test(&bb.done)) - wait_for_completion(&wait); + wait_for_completion_io(&wait); if (!test_bit(BIO_UPTODATE, &bb.flags)) /* One of bios in the batch was completed with error.*/ -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] sched: add wait_for_completion_io[_timeout]
The only difference between wait_for_completion[_timeout]() and wait_for_completion_io[_timeout]() is that the latter calls io_schedule_timeout() instead of schedule_timeout() so that the caller is accounted as waiting for IO, not just sleeping. These functions can be used for correct iowait time accounting when the completion struct is actually used for waiting for IO (e.g. completion of a bio request in the block layer). Signed-off-by: Vladimir Davydov --- include/linux/completion.h |3 ++ kernel/sched/core.c| 57 2 files changed, 55 insertions(+), 5 deletions(-) diff --git a/include/linux/completion.h b/include/linux/completion.h index 51494e6..33f0280 100644 --- a/include/linux/completion.h +++ b/include/linux/completion.h @@ -77,10 +77,13 @@ static inline void init_completion(struct completion *x) } extern void wait_for_completion(struct completion *); +extern void wait_for_completion_io(struct completion *); extern int wait_for_completion_interruptible(struct completion *x); extern int wait_for_completion_killable(struct completion *x); extern unsigned long wait_for_completion_timeout(struct completion *x, unsigned long timeout); +extern unsigned long wait_for_completion_io_timeout(struct completion *x, + unsigned long timeout); extern long wait_for_completion_interruptible_timeout( struct completion *x, unsigned long timeout); extern long wait_for_completion_killable_timeout( diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 26058d0..e0396dd 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3268,7 +3268,8 @@ void complete_all(struct completion *x) EXPORT_SYMBOL(complete_all); static inline long __sched -do_wait_for_common(struct completion *x, long timeout, int state) +do_wait_for_common(struct completion *x, + long (*action)(long), long timeout, int state) { if (!x->done) { DECLARE_WAITQUEUE(wait, current); @@ -3281,7 +3282,7 @@ do_wait_for_common(struct completion *x, long timeout, int state) } __set_current_state(state); spin_unlock_irq(&x->wait.lock); - timeout = schedule_timeout(timeout); + timeout = action(timeout); spin_lock_irq(&x->wait.lock); } while (!x->done && timeout); __remove_wait_queue(&x->wait, &wait); @@ -3292,17 +3293,30 @@ do_wait_for_common(struct completion *x, long timeout, int state) return timeout ?: 1; } -static long __sched -wait_for_common(struct completion *x, long timeout, int state) +static inline long __sched +__wait_for_common(struct completion *x, + long (*action)(long), long timeout, int state) { might_sleep(); spin_lock_irq(&x->wait.lock); - timeout = do_wait_for_common(x, timeout, state); + timeout = do_wait_for_common(x, action, timeout, state); spin_unlock_irq(&x->wait.lock); return timeout; } +static long __sched +wait_for_common(struct completion *x, long timeout, int state) +{ + return __wait_for_common(x, schedule_timeout, timeout, state); +} + +static long __sched +wait_for_common_io(struct completion *x, long timeout, int state) +{ + return __wait_for_common(x, io_schedule_timeout, timeout, state); +} + /** * wait_for_completion: - waits for completion of a task * @x: holds the state of this particular completion @@ -3339,6 +3353,39 @@ wait_for_completion_timeout(struct completion *x, unsigned long timeout) EXPORT_SYMBOL(wait_for_completion_timeout); /** + * wait_for_completion_io: - waits for completion of a task + * @x: holds the state of this particular completion + * + * This waits to be signaled for completion of a specific task. It is NOT + * interruptible and there is no timeout. The caller is accounted as waiting + * for IO. + */ +void __sched wait_for_completion_io(struct completion *x) +{ + wait_for_common_io(x, MAX_SCHEDULE_TIMEOUT, TASK_UNINTERRUPTIBLE); +} +EXPORT_SYMBOL(wait_for_completion_io); + +/** + * wait_for_completion_io_timeout: - waits for completion of a task (w/timeout) + * @x: holds the state of this particular completion + * @timeout: timeout value in jiffies + * + * This waits for either a completion of a specific task to be signaled or for a + * specified timeout to expire. The timeout is in jiffies. It is not + * interruptible. The caller is accounted as waiting for IO. + * + * The return value is 0 if timed out, and positive (at least 1, or number of + * jiffies left till timeout) if completed. + */ +unsigned long __sched +wait_for_completion_io_timeout(struct completion *x, unsigned long timeout) +{ + return wait_for_common_io
Re: [PATCH] update sc->nr_reclaimed after each shrink_slab
On Fri, Jul 22, 2016 at 09:49:13AM +0200, Michal Hocko wrote: > On Fri 22-07-16 11:43:30, Zhou Chengming wrote: > > In !global_reclaim(sc) case, we should update sc->nr_reclaimed after each > > shrink_slab in the loop. Because we need the correct sc->nr_reclaimed > > value to see if we can break out. > > Does this actually change anything? Maybe I am missing something but > try_to_free_mem_cgroup_pages which is the main entry for the memcg > reclaim doesn't set reclaim_state. I don't remember why... Vladimir? We don't set reclaim_state on memcg reclaim, because there might be a lot of unrelated slab objects freed from the interrupt context (e.g. RCU freed) while we're doing memcg reclaim. Obviously, we don't want them to contribute to nr_reclaimed. Link to the thread with the problem discussion: http://marc.info/?l=linux-kernel&m=142132698209680&w=2
Re: [PATCH v2] mm: oom: deduplicate victim selection code for memcg and global oom
On Thu, Jul 21, 2016 at 08:41:44AM -0400, Johannes Weiner wrote: > On Mon, Jun 27, 2016 at 07:39:54PM +0300, Vladimir Davydov wrote: > > When selecting an oom victim, we use the same heuristic for both memory > > cgroup and global oom. The only difference is the scope of tasks to > > select the victim from. So we could just export an iterator over all > > memcg tasks and keep all oom related logic in oom_kill.c, but instead we > > duplicate pieces of it in memcontrol.c reusing some initially private > > functions of oom_kill.c in order to not duplicate all of it. That looks > > ugly and error prone, because any modification of select_bad_process > > should also be propagated to mem_cgroup_out_of_memory. > > > > Let's rework this as follows: keep all oom heuristic related code > > private to oom_kill.c and make oom_kill.c use exported memcg functions > > when it's really necessary (like in case of iterating over memcg tasks). > > This approach, with the control flow in the OOM code, makes a lot of > sense to me. I think it's particularly useful in preparation for > supporting cgroup-aware OOM killing, where not just individual tasks > but entire cgroups are evaluated and killed as opaque memory units. Yeah, that too. Also, this patch can be thought of as a preparation step for unified oom locking and oom timeouts (provided we ever agree to add them). Currently, there's some code in memcg trying to implement proper locking that would allow running oom in parallel in different cgroups and wait until memory is actually freed instead of looping and retrying reclaim. I think it could be reused for global case, although it's going to be tricky as we need to support legacy cgroup oom control api. > > I'm thinking about doing something like the following, which should be > able to work regardless on what cgroup level - root, intermediate, or > leaf node - the OOM killer is invoked, and this patch works toward it: > > struct oom_victim { > bool is_memcg; > union { > struct task_struct *task; > struct mem_cgroup *memcg; > } entity; > unsigned long badness; > }; > > oom_evaluate_memcg(oc, memcg, victim) > { > if (memcg == root) { > for_each_memcg_process(p, memcg) { > badness = oom_badness(oc, memcg, p); > if (badness == some_special_value) { > ... > } else if (badness > victim->badness) { > victim->is_memcg = false; > victim->entity.task = p; > victim->badness = badness; > } > } > } else { > badness = 0; > for_each_memcg_process(p, memcg) { > b = oom_badness(oc, memcg, p); > if (b == some_special_value) > ... > else > badness += b; > } > if (badness > victim.badness) > victim->is_memcg = true; > victim->entity.memcg = memcg; > victim->badness = badness; Yeah, that makes sense. However, I don't think we should always kill the whole cgroup, even if it's badness is highest. IMO what should be killed - cgroup or task - depends on the workload running inside the container. Some workloads (e.g. those that fork often) can put up with youngest of their tasks getting oom-killed, others will just get stuck if one of the workers is killed - for them we'd better kill the whole container. I guess we could introduce a per cgroup tunable which would define oom behavior - whether the whole cgroup should be killed on oom or just one task/sub-cgroup in the cgroup. > } > } > } > > oom() > { > struct oom_victim victim = { > .badness = 0, > }; > > for_each_mem_cgroup_tree(memcg, oc->memcg) > oom_evaluate_memcg(oc, memcg, &victim); > > if (!victim.badness && !is_sysrq_oom(oc)) { > dump_header(oc, NULL); > panic("Out of memory and no killable processes...\n"); > } > > if (victim.badness != -1) { > oom_kill_victim(oc, &victim); > schedule_timeout_killable(1); > } > > return true; > } > > But even without that, with the unification of two identical control > flows and the privatization of a good amount of oom killer internals, > the patch speaks for itself. > > > Signed-off-by: Vladimir Davydov > > Acked-by: Johannes Weiner Thanks!
Re: [PATCH 1/3] mm: memcontrol: make per-cpu charge cache IRQ-safe for socket accounting
On Wed, Sep 14, 2016 at 03:48:44PM -0400, Johannes Weiner wrote: > From: Johannes Weiner > > During cgroup2 rollout into production, we started encountering css > refcount underflows and css access crashes in the memory controller. > Splitting the heavily shared css reference counter into logical users > narrowed the imbalance down to the cgroup2 socket memory accounting. > > The problem turns out to be the per-cpu charge cache. Cgroup1 had a > separate socket counter, but the new cgroup2 socket accounting goes > through the common charge path that uses a shared per-cpu cache for > all memory that is being tracked. Those caches are safe against > scheduling preemption, but not against interrupts - such as the newly > added packet receive path. When cache draining is interrupted by > network RX taking pages out of the cache, the resuming drain operation > will put references of in-use pages, thus causing the imbalance. > > Disable IRQs during all per-cpu charge cache operations. > > Fixes: f7e1cb6ec51b ("mm: memcontrol: account socket memory in unified > hierarchy memory controller") > Cc: # 4.5+ > Signed-off-by: Johannes Weiner Acked-by: Vladimir Davydov
Re: [PATCH 2/3] cgroup: duplicate cgroup reference when cloning sockets
On Wed, Sep 14, 2016 at 03:48:45PM -0400, Johannes Weiner wrote: > From: Johannes Weiner > > When a socket is cloned, the associated sock_cgroup_data is duplicated > but not its reference on the cgroup. As a result, the cgroup reference > count will underflow when both sockets are destroyed later on. > > Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") > Cc: # 4.5+ > Signed-off-by: Johannes Weiner Reviewed-by: Vladimir Davydov
Re: [PATCH v3 04/13] mm: Track NR_KERNEL_STACK in KiB instead of number of stacks
On Mon, Jun 20, 2016 at 04:43:34PM -0700, Andy Lutomirski wrote: > Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a > zone. This only makes sense if each kernel stack exists entirely in > one zone, and allowing vmapped stacks could break this assumption. > > Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack > allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on > all architectures. Keep it simple and use KiB. > > Cc: Vladimir Davydov > Cc: Johannes Weiner > Cc: Michal Hocko > Cc: linux...@kvack.org > Signed-off-by: Andy Lutomirski Reviewed-by: Vladimir Davydov
Re: [PATCH v3 05/13] mm: Fix memcg stack accounting for sub-page stacks
On Mon, Jun 20, 2016 at 04:43:35PM -0700, Andy Lutomirski wrote: > We should account for stacks regardless of stack size, and we need > to account in sub-page units if THREAD_SIZE < PAGE_SIZE. Change the > units to kilobytes and Move it into account_kernel_stack(). > > Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 > memory.stat") > Cc: Vladimir Davydov > Cc: Johannes Weiner > Cc: Michal Hocko > Cc: linux...@kvack.org > Signed-off-by: Andy Lutomirski Reviewed-by: Vladimir Davydov This patch is going to have a minor conflict with recent changes in mmotm, where {alloc,free}_kmem_pages were dropped, The conflict should be trivial to resolve - we only need to replace {alloc,free}_kmem_pages with {alloc,free}_pages in this patch. > --- > include/linux/memcontrol.h | 2 +- > kernel/fork.c | 15 ++- > mm/memcontrol.c| 2 +- > 3 files changed, 8 insertions(+), 11 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index a805474df4ab..3b653b86bb8f 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -52,7 +52,7 @@ enum mem_cgroup_stat_index { > MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */ > MEM_CGROUP_STAT_NSTATS, > /* default hierarchy stats */ > - MEMCG_KERNEL_STACK = MEM_CGROUP_STAT_NSTATS, > + MEMCG_KERNEL_STACK_KB = MEM_CGROUP_STAT_NSTATS, > MEMCG_SLAB_RECLAIMABLE, > MEMCG_SLAB_UNRECLAIMABLE, > MEMCG_SOCK, > diff --git a/kernel/fork.c b/kernel/fork.c > index be7f006af727..ff3c41c2ba96 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -165,20 +165,12 @@ static struct thread_info > *alloc_thread_info_node(struct task_struct *tsk, > struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP, > THREAD_SIZE_ORDER); > > - if (page) > - memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK, > - 1 << THREAD_SIZE_ORDER); > - > return page ? page_address(page) : NULL; > } > > static inline void free_thread_info(struct thread_info *ti) > { > - struct page *page = virt_to_page(ti); > - > - memcg_kmem_update_page_stat(page, MEMCG_KERNEL_STACK, > - -(1 << THREAD_SIZE_ORDER)); > - __free_kmem_pages(page, THREAD_SIZE_ORDER); > + free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER); > } > # else > static struct kmem_cache *thread_info_cache; > @@ -227,6 +219,11 @@ static void account_kernel_stack(struct thread_info *ti, > int account) > > mod_zone_page_state(zone, NR_KERNEL_STACK_KB, > THREAD_SIZE / 1024 * account); > + > + /* All stack pages belong to the same memcg. */ > + memcg_kmem_update_page_stat( > + virt_to_page(ti), MEMCG_KERNEL_STACK_KB, > + account * (THREAD_SIZE / 1024)); > } > > void free_task(struct task_struct *tsk) > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 75e74408cc8f..8e13a2419dad 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5133,7 +5133,7 @@ static int memory_stat_show(struct seq_file *m, void *v) > seq_printf(m, "file %llu\n", > (u64)stat[MEM_CGROUP_STAT_CACHE] * PAGE_SIZE); > seq_printf(m, "kernel_stack %llu\n", > -(u64)stat[MEMCG_KERNEL_STACK] * PAGE_SIZE); > +(u64)stat[MEMCG_KERNEL_STACK_KB] * 1024); > seq_printf(m, "slab %llu\n", > (u64)(stat[MEMCG_SLAB_RECLAIMABLE] + >stat[MEMCG_SLAB_UNRECLAIMABLE]) * PAGE_SIZE);
Re: [PATCH 3/3] mm: memcontrol: fix cgroup creation failure after many small jobs
On Fri, Jun 17, 2016 at 12:25:16PM -0400, Johannes Weiner wrote: > The memory controller has quite a bit of state that usually outlives > the cgroup and pins its CSS until said state disappears. At the same > time it imposes a 16-bit limit on the CSS ID space to economically > store IDs in the wild. Consequently, when we use cgroups to contain > frequent but small and short-lived jobs that leave behind some page > cache, we quickly run into the 64k limitations of outstanding CSSs. > Creating a new cgroup fails with -ENOSPC while there are only a few, > or even no user-visible cgroups in existence. > > Although pinning CSSs past cgroup removal is common, there are only > two instances that actually need an ID after a cgroup is deleted: > cache shadow entries and swapout records. > > Cache shadow entries reference the ID weakly and can deal with the CSS > having disappeared when it's looked up later. They pose no hurdle. > > Swap-out records do need to pin the css to hierarchically attribute > swapins after the cgroup has been deleted; though the only pages that > remain swapped out after offlining are tmpfs/shmem pages. And those > references are under the user's control, so they are manageable. > > This patch introduces a private 16-bit memcg ID and switches swap and > cache shadow entries over to using that. This ID can then be recycled > after offlining when the CSS remains pinned only by objects that don't > specifically need it. > > This script demonstrates the problem by faulting one cache page in a > new cgroup and deleting it again: > > set -e > mkdir -p pages > for x in `seq 128000`; do > [ $((x % 1000)) -eq 0 ] && echo $x > mkdir /cgroup/foo > echo $$ >/cgroup/foo/cgroup.procs > echo trex >pages/$x > echo $$ >/cgroup/cgroup.procs > rmdir /cgroup/foo > done > > When run on an unpatched kernel, we eventually run out of possible IDs > even though there are no visible cgroups: > > [root@ham ~]# ./cssidstress.sh > [...] > 65000 > mkdir: cannot create directory '/cgroup/foo': No space left on device > > After this patch, the IDs get released upon cgroup destruction and the > cache and css objects get released once memory reclaim kicks in. With 65K cgroups it will take the reclaimer a substantial amount of time to iterate over all of them, which might result in latency spikes. Probably, to avoid that, we could move pages from a dead cgroup's lru to its parent's one on offline while still leaving dead cgroups pinned, like we do in case of list_lru entries. > > Signed-off-by: Johannes Weiner Reviewed-by: Vladimir Davydov One nit below. ... > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 75e74408cc8f..dc92b2df2585 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4057,6 +4057,60 @@ static struct cftype mem_cgroup_legacy_files[] = { > { },/* terminate */ > }; > > +/* > + * Private memory cgroup IDR > + * > + * Swap-out records and page cache shadow entries need to store memcg > + * references in constrained space, so we maintain an ID space that is > + * limited to 16 bit (MEM_CGROUP_ID_MAX), limiting the total number of > + * memory-controlled cgroups to 64k. > + * > + * However, there usually are many references to the oflline CSS after > + * the cgroup has been destroyed, such as page cache or reclaimable > + * slab objects, that don't need to hang on to the ID. We want to keep > + * those dead CSS from occupying IDs, or we might quickly exhaust the > + * relatively small ID space and prevent the creation of new cgroups > + * even when there are much fewer than 64k cgroups - possibly none. > + * > + * Maintain a private 16-bit ID space for memcg, and allow the ID to > + * be freed and recycled when it's no longer needed, which is usually > + * when the CSS is offlined. > + * > + * The only exception to that are records of swapped out tmpfs/shmem > + * pages that need to be attributed to live ancestors on swapin. But > + * those references are manageable from userspace. > + */ > + > +static struct idr mem_cgroup_idr; static DEFINE_IDR(mem_cgroup_idr);
Re: [PATCH] memcg: mem_cgroup_migrate() may be called with irq disabled
On Mon, Jun 20, 2016 at 02:41:58PM -0400, Tejun Heo wrote: > mem_cgroup_migrate() uses local_irq_disable/enable() but can be called > with irq disabled from migrate_page_copy(). This ends up enabling irq > while holding a irq context lock triggering the following lockdep > warning. Fix it by using irq_save/restore instead. > > = > [ INFO: inconsistent lock state ] > 4.7.0-rc1+ #52 Tainted: GW > - > inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. > kcompactd0/151 [HC0[0]:SC0[0]:HE1:SE1] takes: >(&(&ctx->completion_lock)->rlock){+.?.-.}, at: [<0038fd96>] > aio_migratepage+0x156/0x1e8 > {IN-SOFTIRQ-W} state was registered at: > [<001a8366>] __lock_acquire+0x5b6/0x1930 > [<001a9b9e>] lock_acquire+0xee/0x270 > [<00951fee>] _raw_spin_lock_irqsave+0x66/0xb0 > [<00390108>] aio_complete+0x98/0x328 > [<0037c7d4>] dio_complete+0xe4/0x1e0 > [<00650e64>] blk_update_request+0xd4/0x450 > [<0072a1a8>] scsi_end_request+0x48/0x1c8 > [<0072d7e2>] scsi_io_completion+0x272/0x698 > [<0065adb2>] blk_done_softirq+0xca/0xe8 > [<00953f80>] __do_softirq+0xc8/0x518 > [<001495de>] irq_exit+0xee/0x110 > [<0010ceba>] do_IRQ+0x6a/0x88 > [<0095342e>] io_int_handler+0x11a/0x25c > [<0094fb5c>] __mutex_unlock_slowpath+0x144/0x1d8 > [<0094fb58>] __mutex_unlock_slowpath+0x140/0x1d8 > [<003c6114>] kernfs_iop_permission+0x64/0x80 > [<0033ba86>] __inode_permission+0x9e/0xf0 > [<0033ea96>] link_path_walk+0x6e/0x510 > [<0033f09c>] path_lookupat+0xc4/0x1a8 > [<0034195c>] filename_lookup+0x9c/0x160 > [<00341b44>] user_path_at_empty+0x5c/0x70 > [<00335250>] SyS_readlinkat+0x68/0x140 > [<00952f8e>] system_call+0xd6/0x270 > irq event stamp: 971410 > hardirqs last enabled at (971409): [<0030f982>] > migrate_page_move_mapping+0x3ea/0x588 > hardirqs last disabled at (971410): [<00951fc4>] > _raw_spin_lock_irqsave+0x3c/0xb0 > softirqs last enabled at (970526): [<00954318>] > __do_softirq+0x460/0x518 > softirqs last disabled at (970519): [<001495de>] irq_exit+0xee/0x110 > > other info that might help us debug this: >Possible unsafe locking scenario: > >CPU0 > > lock(&(&ctx->completion_lock)->rlock); > > lock(&(&ctx->completion_lock)->rlock); > > *** DEADLOCK *** > > 3 locks held by kcompactd0/151: >#0: (&(&mapping->private_lock)->rlock){+.+.-.}, at: [<0038fc82>] > aio_migratepage+0x42/0x1e8 >#1: (&ctx->ring_lock){+.+.+.}, at: [<0038fc9a>] > aio_migratepage+0x5a/0x1e8 >#2: (&(&ctx->completion_lock)->rlock){+.?.-.}, at: [<0038fd96>] > aio_migratepage+0x156/0x1e8 > > stack backtrace: > CPU: 20 PID: 151 Comm: kcompactd0 Tainted: GW 4.7.0-rc1+ #52 >0001c6cbb730 0001c6cbb7c0 0002 >0001c6cbb860 0001c6cbb7d8 0001c6cbb7d8 00114496 > 00b517ec 00b680b6 000b >0001c6cbb820 0001c6cbb7c0 >04000184ad18 00114496 0001c6cbb7c0 0001c6cbb820 > Call Trace: > ([<001143d2>] show_trace+0xea/0xf0) > ([<0011444a>] show_stack+0x72/0xf0) > ([<00684522>] dump_stack+0x9a/0xd8) > ([<0028679c>] print_usage_bug.part.27+0x2d4/0x2e8) > ([<001a71ce>] mark_lock+0x17e/0x758) > ([<001a784a>] mark_held_locks+0xa2/0xd0) > ([<001a79b8>] trace_hardirqs_on_caller+0x140/0x1c0) > ([<00326026>] mem_cgroup_migrate+0x266/0x370) > ([<0038fdaa>] aio_migratepage+0x16a/0x1e8) > ([<00310568>] move_to_new_page+0xb0/0x260) > ([<003111b4>] migrate_pages+0x8f4/0x9f0) > ([<002c507c>] compact_zone+0x4dc/0xdc8) > ([<002c5e22>] kcompactd_do_work+0x1aa/0x358) > ([<002c608a>] kcompactd+0xba/0x2c8) > ([<0016b09a>] kthread+0x10a/0x110) > ([<0095315a>] kernel_thread_starter+0x6/0xc) > ([<00953154>] kernel_thread_starter+0x0/0xc) > INFO: lockdep is turned off. > > Signed-off-by: Tejun Heo > Reported-by: Christian Borntraeger > Link: http://lkml.kernel.org/g/5767cfe5.7080...@de.ibm.com Reviewed-by: Vladimir Davydov
[PATCH] mm: vmscan: fix memcg-aware shrinkers not called on global reclaim
We must call shrink_slab() for each memory cgroup on both global and memcg reclaim in shrink_node_memcg(). Commit d71df22b55099 accidentally changed that so that now shrink_slab() is only called with memcg != NULL on memcg reclaim. As a result, memcg-aware shrinkers (including dentry/inode) are never invoked on global reclaim. Fix that. Fixes: d71df22b55099 ("mm, vmscan: begin reclaiming pages on a per-node basis") Signed-off-by: Vladimir Davydov --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 650d26832569..374d95d04178 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2561,7 +2561,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) shrink_node_memcg(pgdat, memcg, sc, &lru_pages); node_lru_pages += lru_pages; - if (!global_reclaim(sc)) + if (memcg) shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->nr_scanned - scanned, lru_pages); -- 2.1.4
[PATCH 2/3] mm: memcontrol: fix memcg id ref counter on swap charge move
Since commit 73f576c04b941 swap entries do not pin memcg->css.refcnt directly. Instead, they pin memcg->id.ref. So we should adjust the reference counters accordingly when moving swap charges between cgroups. Fixes: 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many small jobs") Signed-off-by: Vladimir Davydov --- mm/memcontrol.c | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5fe285f27ea7..58c229071fb1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4030,9 +4030,9 @@ static struct cftype mem_cgroup_legacy_files[] = { static DEFINE_IDR(mem_cgroup_idr); -static void mem_cgroup_id_get(struct mem_cgroup *memcg) +static void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n) { - atomic_inc(&memcg->id.ref); + atomic_add(n, &memcg->id.ref); } static struct mem_cgroup *mem_cgroup_id_get_active(struct mem_cgroup *memcg) @@ -4042,9 +4042,9 @@ static struct mem_cgroup *mem_cgroup_id_get_active(struct mem_cgroup *memcg) return memcg; } -static void mem_cgroup_id_put(struct mem_cgroup *memcg) +static void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n) { - if (atomic_dec_and_test(&memcg->id.ref)) { + if (atomic_sub_and_test(n, &memcg->id.ref)) { idr_remove(&mem_cgroup_idr, memcg->id.id); memcg->id.id = 0; @@ -4053,6 +4053,16 @@ static void mem_cgroup_id_put(struct mem_cgroup *memcg) } } +static inline void mem_cgroup_id_get(struct mem_cgroup *memcg) +{ + mem_cgroup_id_get_many(memcg, 1); +} + +static inline void mem_cgroup_id_put(struct mem_cgroup *memcg) +{ + mem_cgroup_id_put_many(memcg, 1); +} + /** * mem_cgroup_from_id - look up a memcg from a memcg id * @id: the memcg id to look up @@ -4687,6 +4697,8 @@ static void __mem_cgroup_clear_mc(void) if (!mem_cgroup_is_root(mc.from)) page_counter_uncharge(&mc.from->memsw, mc.moved_swap); + mem_cgroup_id_put_many(mc.from, mc.moved_swap); + /* * we charged both to->memory and to->memsw, so we * should uncharge to->memory. @@ -4694,9 +4706,9 @@ static void __mem_cgroup_clear_mc(void) if (!mem_cgroup_is_root(mc.to)) page_counter_uncharge(&mc.to->memory, mc.moved_swap); - css_put_many(&mc.from->css, mc.moved_swap); + mem_cgroup_id_get_many(mc.to, mc.moved_swap); + css_put_many(&mc.to->css, mc.moved_swap); - /* we've already done css_get(mc.to) */ mc.moved_swap = 0; } memcg_oom_recover(from); -- 2.1.4
[PATCH 1/3] mm: memcontrol: fix swap counter leak on swapout from offline cgroup
An offline memory cgroup might have anonymous memory or shmem left charged to it and no swap. Since only swap entries pin the id of an offline cgroup, such a cgroup will have no id and so an attempt to swapout its anon/shmem will not store memory cgroup info in the swap cgroup map. As a result, memcg->swap or memcg->memsw will never get uncharged from it and any of its ascendants. Fix this by always charging swapout to the first ancestor cgroup that hasn't released its id yet. Fixes: 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many small jobs") Signed-off-by: Vladimir Davydov --- mm/memcontrol.c | 27 +-- 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b5804e4e6324..5fe285f27ea7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4035,6 +4035,13 @@ static void mem_cgroup_id_get(struct mem_cgroup *memcg) atomic_inc(&memcg->id.ref); } +static struct mem_cgroup *mem_cgroup_id_get_active(struct mem_cgroup *memcg) +{ + while (!atomic_inc_not_zero(&memcg->id.ref)) + memcg = parent_mem_cgroup(memcg); + return memcg; +} + static void mem_cgroup_id_put(struct mem_cgroup *memcg) { if (atomic_dec_and_test(&memcg->id.ref)) { @@ -5751,7 +5758,7 @@ subsys_initcall(mem_cgroup_init); */ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) { - struct mem_cgroup *memcg; + struct mem_cgroup *memcg, *swap_memcg; unsigned short oldid; VM_BUG_ON_PAGE(PageLRU(page), page); @@ -5766,15 +5773,20 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) if (!memcg) return; - mem_cgroup_id_get(memcg); - oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg)); + swap_memcg = mem_cgroup_id_get_active(memcg); + oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg)); VM_BUG_ON_PAGE(oldid, page); - mem_cgroup_swap_statistics(memcg, true); + mem_cgroup_swap_statistics(swap_memcg, true); page->mem_cgroup = NULL; if (!mem_cgroup_is_root(memcg)) page_counter_uncharge(&memcg->memory, 1); + if (memcg != swap_memcg) { + if (!mem_cgroup_is_root(swap_memcg)) + page_counter_charge(&swap_memcg->memsw, 1); + page_counter_uncharge(&memcg->memsw, 1); + } /* * Interrupts should be disabled here because the caller holds the @@ -5814,11 +5826,14 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry) if (!memcg) return 0; + memcg = mem_cgroup_id_get_active(memcg); + if (!mem_cgroup_is_root(memcg) && - !page_counter_try_charge(&memcg->swap, 1, &counter)) + !page_counter_try_charge(&memcg->swap, 1, &counter)) { + mem_cgroup_id_put(memcg); return -ENOMEM; + } - mem_cgroup_id_get(memcg); oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg)); VM_BUG_ON_PAGE(oldid, page); mem_cgroup_swap_statistics(memcg, true); -- 2.1.4
Re: [PATCH] memcg: put soft limit reclaim out of way if the excess tree is empty
On Mon, Aug 01, 2016 at 12:00:21PM +0200, Michal Hocko wrote: ... > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index c265212bec8c..eb7e39c2d948 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2543,6 +2543,11 @@ static int mem_cgroup_resize_memsw_limit(struct > mem_cgroup *memcg, > return ret; > } > > +static inline bool soft_limit_tree_empty(struct mem_cgroup_tree_per_node > *mctz) > +{ > + return rb_last(&mctz->rb_root) == NULL; > +} > + I don't think traversing rb tree as rb_last() does w/o holding the lock is a good idea. Why is RB_EMPTY_ROOT() insufficient here?
[PATCH] radix-tree: account nodes to memcg only if explicitly requested
Radix trees may be used not only for storing page cache pages, so unconditionally accounting radix tree nodes to the current memory cgroup is bad: if a radix tree node is used for storing data shared among different cgroups we risk pinning dead memory cgroups forever. So let's only account radix tree nodes if it was explicitly requested by passing __GFP_ACCOUNT to INIT_RADIX_TREE. Currently, we only want to account page cache entries, so mark mapping->page_tree so. Signed-off-by: Vladimir Davydov --- fs/inode.c | 2 +- lib/radix-tree.c | 14 ++ 2 files changed, 11 insertions(+), 5 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 559a9da25237..1d04dab5211c 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -345,7 +345,7 @@ EXPORT_SYMBOL(inc_nlink); void address_space_init_once(struct address_space *mapping) { memset(mapping, 0, sizeof(*mapping)); - INIT_RADIX_TREE(&mapping->page_tree, GFP_ATOMIC); + INIT_RADIX_TREE(&mapping->page_tree, GFP_ATOMIC | __GFP_ACCOUNT); spin_lock_init(&mapping->tree_lock); init_rwsem(&mapping->i_mmap_rwsem); INIT_LIST_HEAD(&mapping->private_list); diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 61b8fb529cef..1b7bf7314141 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -277,10 +277,11 @@ radix_tree_node_alloc(struct radix_tree_root *root) /* * Even if the caller has preloaded, try to allocate from the -* cache first for the new node to get accounted. +* cache first for the new node to get accounted to the memory +* cgroup. */ ret = kmem_cache_alloc(radix_tree_node_cachep, - gfp_mask | __GFP_ACCOUNT | __GFP_NOWARN); + gfp_mask | __GFP_NOWARN); if (ret) goto out; @@ -303,8 +304,7 @@ radix_tree_node_alloc(struct radix_tree_root *root) kmemleak_update_trace(ret); goto out; } - ret = kmem_cache_alloc(radix_tree_node_cachep, - gfp_mask | __GFP_ACCOUNT); + ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask); out: BUG_ON(radix_tree_is_internal_node(ret)); return ret; @@ -351,6 +351,12 @@ static int __radix_tree_preload(gfp_t gfp_mask, int nr) struct radix_tree_node *node; int ret = -ENOMEM; + /* +* Nodes preloaded by one cgroup can be be used by another cgroup, so +* they should never be accounted to any particular memory cgroup. +*/ + gfp_mask &= ~__GFP_ACCOUNT; + preempt_disable(); rtp = this_cpu_ptr(&radix_tree_preloads); while (rtp->nr < nr) { -- 2.1.4
[PATCH v3] mm: oom: deduplicate victim selection code for memcg and global oom
When selecting an oom victim, we use the same heuristic for both memory cgroup and global oom. The only difference is the scope of tasks to select the victim from. So we could just export an iterator over all memcg tasks and keep all oom related logic in oom_kill.c, but instead we duplicate pieces of it in memcontrol.c reusing some initially private functions of oom_kill.c in order to not duplicate all of it. That looks ugly and error prone, because any modification of select_bad_process should also be propagated to mem_cgroup_out_of_memory. Let's rework this as follows: keep all oom heuristic related code private to oom_kill.c and make oom_kill.c use exported memcg functions when it's really necessary (like in case of iterating over memcg tasks). Signed-off-by: Vladimir Davydov Acked-by: Johannes Weiner --- Changes in v3: - rebase on top of v4.7-mmotm-2016-07-28-16-33 - add ack by Johannes include/linux/memcontrol.h | 15 include/linux/oom.h| 43 +- mm/memcontrol.c| 114 ++ mm/oom_kill.c | 200 - 4 files changed, 167 insertions(+), 205 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5d8ca6e02e39..0710143723bc 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -366,6 +366,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *, struct mem_cgroup *, struct mem_cgroup_reclaim_cookie *); void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *); +int mem_cgroup_scan_tasks(struct mem_cgroup *, + int (*)(struct task_struct *, void *), void *); static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) { @@ -446,6 +448,8 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru) void mem_cgroup_handle_over_high(void); +unsigned long mem_cgroup_get_limit(struct mem_cgroup *memcg); + void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); @@ -639,6 +643,12 @@ static inline void mem_cgroup_iter_break(struct mem_cgroup *root, { } +static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, + int (*fn)(struct task_struct *, void *), void *arg) +{ + return 0; +} + static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) { return 0; @@ -669,6 +679,11 @@ mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg, return 0; } +static inline unsigned long mem_cgroup_get_limit(struct mem_cgroup *memcg) +{ + return 0; +} + static inline void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { diff --git a/include/linux/oom.h b/include/linux/oom.h index 5bc0457ee3a8..17946e5121b6 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -34,23 +34,11 @@ struct oom_control { * for display purposes. */ const int order; -}; -/* - * Types of limitations to the nodes from which allocations may occur - */ -enum oom_constraint { - CONSTRAINT_NONE, - CONSTRAINT_CPUSET, - CONSTRAINT_MEMORY_POLICY, - CONSTRAINT_MEMCG, -}; - -enum oom_scan_t { - OOM_SCAN_OK,/* scan thread and find its badness */ - OOM_SCAN_CONTINUE, /* do not consider thread for oom kill */ - OOM_SCAN_ABORT, /* abort the iteration and return */ - OOM_SCAN_SELECT,/* always select this thread first */ + /* Used by oom implementation, do not set */ + unsigned long totalpages; + struct task_struct *chosen; + unsigned long chosen_points; }; extern struct mutex oom_lock; @@ -70,30 +58,10 @@ static inline bool oom_task_origin(const struct task_struct *p) return p->signal->oom_flag_origin; } -extern void mark_oom_victim(struct task_struct *tsk); - -#ifdef CONFIG_MMU -extern void wake_oom_reaper(struct task_struct *tsk); -#else -static inline void wake_oom_reaper(struct task_struct *tsk) -{ -} -#endif - extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, unsigned long totalpages); -extern void oom_kill_process(struct oom_control *oc, struct task_struct *p, -unsigned int points, unsigned long totalpages, -const char *message); - -extern void check_panic_on_oom(struct oom_control *oc, - enum oom_constraint constraint); - -extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, - struct task_struct *task); - extern bool out_of_memory(struct oom_control *oc); extern void exit_oom_victim(struct task_struct *tsk); @@ -101,14 +69,11 @@ extern void exit_oom_victim(struct task_struct *t
[PATCH] Update my e-mail address
vdavydov@{parallels,virtuozzo}.com will bounce from now on. Signed-off-by: Vladimir Davydov --- .mailmap| 2 ++ MAINTAINERS | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/.mailmap b/.mailmap index b18912c5121e..de22daefd9da 100644 --- a/.mailmap +++ b/.mailmap @@ -159,6 +159,8 @@ Valdis Kletnieks Viresh Kumar Viresh Kumar Viresh Kumar +Vladimir Davydov +Vladimir Davydov Takashi YOSHII Yusuke Goda Gustavo Padovan diff --git a/MAINTAINERS b/MAINTAINERS index d8e81b1dde30..46a7d3093a49 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3265,7 +3265,7 @@ F:kernel/cpuset.c CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG) M: Johannes Weiner M: Michal Hocko -M: Vladimir Davydov +M: Vladimir Davydov L: cgro...@vger.kernel.org L: linux...@kvack.org S: Maintained -- 2.1.4
Re: [PATCH] Update my e-mail address
On Wed, Aug 31, 2016 at 07:26:43AM -0700, Greg wrote: > On Wed, 2016-08-31 at 15:01 +0300, Vladimir Davydov wrote: > > vdavydov@{parallels,virtuozzo}.com will bounce from now on. > > > > Signed-off-by: Vladimir Davydov > > Shouldn't MAINTAINERS be in the subject line? You're right, sorry. Here it goes: >From 11b0d4818cfc517ce0cd1dba743bd777f2ccd6b1 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Wed, 31 Aug 2016 14:36:19 +0300 Subject: [PATCH] MAINTAINERS: update my e-mail address vdavydov@{parallels,virtuozzo}.com will bounce from now on. Signed-off-by: Vladimir Davydov --- .mailmap| 2 ++ MAINTAINERS | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/.mailmap b/.mailmap index b18912c5121e..de22daefd9da 100644 --- a/.mailmap +++ b/.mailmap @@ -159,6 +159,8 @@ Valdis Kletnieks Viresh Kumar Viresh Kumar Viresh Kumar +Vladimir Davydov +Vladimir Davydov Takashi YOSHII Yusuke Goda Gustavo Padovan diff --git a/MAINTAINERS b/MAINTAINERS index d8e81b1dde30..46a7d3093a49 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3265,7 +3265,7 @@ F:kernel/cpuset.c CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG) M: Johannes Weiner M: Michal Hocko -M: Vladimir Davydov +M: Vladimir Davydov L: cgro...@vger.kernel.org L: linux...@kvack.org S: Maintained -- 2.1.4
Re: [BUG] Bad page states
On Mon, Aug 08, 2016 at 10:48:45AM -0700, Linus Torvalds wrote: ... > > [ 43.477693] BUG: Bad page state in process S05containers pfn:1ff02a3 > > [ 43.484417] page:ea007fc0a8c0 count:0 mapcount:-511 mapping: > > (null) index:0x0 > > [ 43.492737] flags: 0x1000() > > [ 43.496602] page dumped because: nonzero mapcount > > Hmm. The _mapcount field is a union with other fields, but that number > doesn't make sense for any of the other fields. > > So it's almost certainly related to "PAGE_KMEMCG_MAPCOUNT_VALUE". So Yes, it is - my bad. The thing is I set/clear PAGE_KMEMCG_MAPCOUNT_VALUE for pages allocated with __GFP_ACCOUNT iff memcg_kmem_enabled() is true (see __alloc_pages_nodemask and free_pages_prepare), while the latter gets disabled when the last cgroup gets destroyed. So if you do mkdir /sys/fs/cgroup/memory/test # run something in the root cgroup that allocates pages with # __GFP_ACCOUNT, e.g. a program using pipe rmdir /sys/fs/cgroup/memory/test Then, if there are no other memory cgroups, you'll see the bug. Sorry about that :-( Obviously, the PageKmemcg flag should only be set for pages that are actually accounted to a non-root kmemcg and hence pin memcg_kmem_enabled static key. I'll fix that. Thanks, Vladimir
[PATCH] mm: memcontrol: only mark charged pages with PageKmemcg
To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg, which sets page->_mapcount to -512. Currently, we set/clear PageKmemcg in __alloc_pages_nodemask()/free_pages_prepare() for any page allocated with __GFP_ACCOUNT, including those that aren't actually charged to any cgroup, i.e. allocated from the root cgroup context. To avoid overhead in case cgroups are not used, we only do that if memcg_kmem_enabled() is true. The latter is set iff there are kmem-enabled memory cgroups (online or offline). The root cgroup is not considered kmem-enabled. As a result, if a page is allocated with __GFP_ACCOUNT for the root cgroup when there are kmem-enabled memory cgroups and is freed after all kmem-enabled memory cgroups were removed, e.g. # no memory cgroups has been created yet, create one mkdir /sys/fs/cgroup/memory/test # run something allocating pages with __GFP_ACCOUNT, e.g. # a program using pipe dmesg | tail # remove the memory cgroup rmdir /sys/fs/cgroup/memory/test we'll get bad page state bug complaining about page->_mapcount != -1: BUG: Bad page state in process swapper/0 pfn:1fd945c page:ea007f651700 count:0 mapcount:-511 mapping: (null) index:0x0 flags: 0x1000() To avoid that, let's mark with PageKmemcg only those pages that are actually charged to and hence pin a non-root memory cgroup. Fixes: 4949148ad433 ("mm: charge/uncharge kmemcg from generic page allocator paths") Reported-by: Eric Dumazet Signed-off-by: Vladimir Davydov Cc: [4.7+] --- fs/pipe.c | 4 +--- mm/memcontrol.c | 14 -- mm/page_alloc.c | 14 +- 3 files changed, 18 insertions(+), 14 deletions(-) diff --git a/fs/pipe.c b/fs/pipe.c index 4b32928f5426..4ebe6b2e5217 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -144,10 +144,8 @@ static int anon_pipe_buf_steal(struct pipe_inode_info *pipe, struct page *page = buf->page; if (page_count(page) == 1) { - if (memcg_kmem_enabled()) { + if (memcg_kmem_enabled()) memcg_kmem_uncharge(page, 0); - __ClearPageKmemcg(page); - } __SetPageLocked(page); return 0; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1c0aa59fd333..7c9c0bd475f0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2308,8 +2308,11 @@ int memcg_kmem_charge(struct page *page, gfp_t gfp, int order) return 0; memcg = get_mem_cgroup_from_mm(current->mm); - if (!mem_cgroup_is_root(memcg)) + if (!mem_cgroup_is_root(memcg)) { ret = memcg_kmem_charge_memcg(page, gfp, order, memcg); + if (!ret) + __SetPageKmemcg(page); + } css_put(&memcg->css); return ret; } @@ -2336,6 +2339,11 @@ void memcg_kmem_uncharge(struct page *page, int order) page_counter_uncharge(&memcg->memsw, nr_pages); page->mem_cgroup = NULL; + + /* slab pages do not have PageKmemcg flag set */ + if (PageKmemcg(page)) + __ClearPageKmemcg(page); + css_put_many(&memcg->css, nr_pages); } #endif /* !CONFIG_SLOB */ @@ -5533,8 +5541,10 @@ static void uncharge_list(struct list_head *page_list) else nr_file += nr_pages; pgpgout++; - } else + } else { nr_kmem += 1 << compound_order(page); + __ClearPageKmemcg(page); + } page->mem_cgroup = NULL; } while (next != page_list); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ee5a4b20daf4..3a7d3a1fb6a0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1009,10 +1009,8 @@ static __always_inline bool free_pages_prepare(struct page *page, } if (PageMappingFlags(page)) page->mapping = NULL; - if (memcg_kmem_enabled() && PageKmemcg(page)) { + if (memcg_kmem_enabled() && PageKmemcg(page)) memcg_kmem_uncharge(page, order); - __ClearPageKmemcg(page); - } if (check_free) bad += free_pages_check(page); if (bad) @@ -3788,12 +3786,10 @@ no_zone: } out: - if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page) { - if (unlikely(memcg_kmem_charge(page, gfp_mask, order))) { - __free_pages(page, order); - page = NULL; - } else - __SetPageKmemcg(page); + if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page && + unlikely(memcg_kmem_charge(page, gfp_mask, order) != 0)) { + __free_pages(page, order); + page = NULL; } if (kmemcheck_enabled && page) -- 2.1.4
Re: [PATCH] mm: memcontrol: avoid unused function warning
On Wed, Aug 24, 2016 at 10:22:43AM +0200, Arnd Bergmann wrote: > A bugfix in v4.8-rc2 introduced a harmless warning when CONFIG_MEMCG_SWAP > is disabled but CONFIG_MEMCG is enabled: > > mm/memcontrol.c:4085:27: error: 'mem_cgroup_id_get_online' defined but not > used [-Werror=unused-function] > static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg) > > This adds an extra #ifdef that matches the one around the caller to > avoid the warning. > > Signed-off-by: Arnd Bergmann > Fixes: 1f47b61fb407 ("mm: memcontrol: fix swap counter leak on swapout from > offline cgroup") Acked-by: Vladimir Davydov
Re: [PATCH v2] fs/dcache.c: fix spin lockup issue on nlru->lock
On Wed, Jun 21, 2017 at 12:09:15PM +0530, Sahitya Tummala wrote: > __list_lru_walk_one() acquires nlru spin lock (nlru->lock) for > longer duration if there are more number of items in the lru list. > As per the current code, it can hold the spin lock for upto maximum > UINT_MAX entries at a time. So if there are more number of items in > the lru list, then "BUG: spinlock lockup suspected" is observed in > the below path - > > [] spin_bug+0x90 > [] do_raw_spin_lock+0xfc > [] _raw_spin_lock+0x28 > [] list_lru_add+0x28 > [] dput+0x1c8 > [] path_put+0x20 > [] terminate_walk+0x3c > [] path_lookupat+0x100 > [] filename_lookup+0x6c > [] user_path_at_empty+0x54 > [] SyS_faccessat+0xd0 > [] el0_svc_naked+0x24 > > This nlru->lock is acquired by another CPU in this path - > > [] d_lru_shrink_move+0x34 > [] dentry_lru_isolate_shrink+0x48 > [] __list_lru_walk_one.isra.10+0x94 > [] list_lru_walk_node+0x40 > [] shrink_dcache_sb+0x60 > [] do_remount_sb+0xbc > [] do_emergency_remount+0xb0 > [] process_one_work+0x228 > [] worker_thread+0x2e0 > [] kthread+0xf4 > [] ret_from_fork+0x10 > > Fix this lockup by reducing the number of entries to be shrinked > from the lru list to 1024 at once. Also, add cond_resched() before > processing the lru list again. > > Link: http://marc.info/?t=14972286491&r=1&w=2 > Fix-suggested-by: Jan kara > Fix-suggested-by: Vladimir Davydov > Signed-off-by: Sahitya Tummala > --- > v2: patch shrink_dcache_sb() instead of list_lru_walk() > --- > fs/dcache.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/fs/dcache.c b/fs/dcache.c > index cddf397..c8ca150 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -1133,10 +1133,11 @@ void shrink_dcache_sb(struct super_block *sb) > LIST_HEAD(dispose); > > freed = list_lru_walk(&sb->s_dentry_lru, > - dentry_lru_isolate_shrink, &dispose, UINT_MAX); > + dentry_lru_isolate_shrink, &dispose, 1024); > > this_cpu_sub(nr_dentry_unused, freed); > shrink_dentry_list(&dispose); > + cond_resched(); > } while (freed > 0); In an extreme case, a single invocation of list_lru_walk() can skip all 1024 dentries, in which case 'freed' will be 0 forcing us to break the loop prematurely. I think we should loop until there's at least one dentry left on the LRU, i.e. while (list_lru_count(&sb->s_dentry_lru) > 0) However, even that wouldn't be quite correct, because list_lru_count() iterates over all memory cgroups to sum list_lru_one->nr_items, which can race with memcg offlining code migrating dentries off a dead cgroup (see memcg_drain_all_list_lrus()). So it looks like to make this check race-free, we need to account the number of entries on the LRU not only per memcg, but also per node, i.e. add list_lru_node->nr_items. Fortunately, list_lru entries can't be migrated between NUMA nodes. > } > EXPORT_SYMBOL(shrink_dcache_sb);
[PATCH 1/2] mm/slab: skip memcg reclaim only if in atomic context
SLAB's implementation of kmem_cache_alloc() works as follows: 1. First, it tries to allocate from the preferred NUMA node without issuing reclaim. 2. If step 1 fails, it tries all nodes in the order of preference, again without invoking reclaimer 3. Only if steps 1 and 2 fails, it falls back on allocation from any allowed node with reclaim enabled. Before commit 4167e9b2cf10f ("mm: remove GFP_THISNODE"), GFP_THISNODE combination, which equaled __GFP_THISNODE|__GFP_NOWARN|__GFP_NORETRY on NUMA enabled builds, was used in order to avoid reclaim during steps 1 and 2. If __alloc_pages_slowpath() saw this combination in gfp flags, it aborted immediately even if __GFP_WAIT flag was set. So there was no need in clearing __GFP_WAIT flag while performing steps 1 and 2 and hence we could invoke memcg reclaim when allocating a slab page if the context allowed. Commit 4167e9b2cf10f zapped GFP_THISNODE combination. Instead of OR-ing the gfp mask with GFP_THISNODE, gfp_exact_node() helper should now be used. The latter sets __GFP_THISNODE and __GFP_NOWARN flags and clears __GFP_WAIT on the current gfp mask. As a result, it effectively prohibits invoking memcg reclaim on steps 1 and 2. This breaks memcg/memory.high logic when kmem accounting is enabled. The memory.high threshold is supposed to work as a soft limit, i.e. it does not fail an allocation on breaching it, but it still forces the caller to invoke direct reclaim to compensate for the excess. Without __GFP_WAIT flag direct reclaim is impossible so the caller will go on without being pushed back to the threshold. To fix this issue, we get rid of gfp_exact_node() helper and move gfp flags filtering to kmem_getpages() after memcg_charge_slab() is called. To understand the patch, note that: - In fallback_alloc() the only effect of using gfp_exact_node() is preventing recursion fallback_alloc() -> cache_alloc_node() -> fallback_alloc(). - Aside from fallback_alloc(), gfp_exact_node() is only used along with cache_grow(). Moreover, the only place where cache_grow() is used without it is fallback_alloc(), which, in contrast to other cache_grow() users, preallocates a page and passes it to cache_grow() so that the latter does not need to invoke kmem_getpages() by itself. Reported-by: Tejun Heo Signed-off-by: Vladimir Davydov --- mm/slab.c | 32 +++- 1 file changed, 11 insertions(+), 21 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index d890750ec31e..9ee809d2ed8b 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -857,11 +857,6 @@ static inline void *cache_alloc_node(struct kmem_cache *cachep, return NULL; } -static inline gfp_t gfp_exact_node(gfp_t flags) -{ - return flags; -} - #else /* CONFIG_NUMA */ static void *cache_alloc_node(struct kmem_cache *, gfp_t, int); @@ -1028,15 +1023,6 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp) return __cache_free_alien(cachep, objp, node, page_node); } - -/* - * Construct gfp mask to allocate from a specific node but do not invoke reclaim - * or warn about failures. - */ -static inline gfp_t gfp_exact_node(gfp_t flags) -{ - return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT; -} #endif /* @@ -1583,7 +1569,7 @@ slab_out_of_memory(struct kmem_cache *cachep, gfp_t gfpflags, int nodeid) * would be relatively rare and ignorable. */ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, - int nodeid) + int nodeid, bool fallback) { struct page *page; int nr_pages; @@ -1595,6 +1581,9 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, if (memcg_charge_slab(cachep, flags, cachep->gfporder)) return NULL; + if (!fallback) + flags = (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT; + page = __alloc_pages_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder); if (!page) { memcg_uncharge_slab(cachep, cachep->gfporder); @@ -2641,7 +2630,8 @@ static int cache_grow(struct kmem_cache *cachep, * 'nodeid'. */ if (!page) - page = kmem_getpages(cachep, local_flags, nodeid); + page = kmem_getpages(cachep, local_flags, nodeid, +!IS_ENABLED(CONFIG_NUMA)); if (!page) goto failed; @@ -2840,7 +2830,7 @@ alloc_done: if (unlikely(!ac->avail)) { int x; force_grow: - x = cache_grow(cachep, gfp_exact_node(flags), node, NULL); + x = cache_grow(cachep, flags, node, NULL); /* cache_grow can reenable interrupts, then ac could change. */ ac = cpu_cache_get(cachep); @@ -3034,7 +3024,7 @@
[PATCH 2/2] mm/slub: do not bypass memcg reclaim for high-order page allocation
Commit 6af3142bed1f52 ("mm/slub: don't wait for high-order page allocation") made allocate_slab() try to allocate high order slab pages without __GFP_WAIT in order to avoid invoking reclaim/compaction when we can fall back on low order pages. However, it broke memcg/memory.high logic in case kmem accounting is enabled. The memory.high threshold works as a soft limit: an allocation does not fail if it is breached, but we call direct reclaim to compensate for the excess. Without __GFP_WAIT we cannot invoke reclaimer and therefore we will go on exceeding memory.high more and more until a normal __GFP_WAIT allocation is issued. Since memcg reclaim never triggers compaction, we can pass __GFP_WAIT to memcg_charge_slab() even on high order page allocations w/o any performance impact. So let us fix this problem by excluding __GFP_WAIT only from alloc_pages() while still forwarding it to memcg_charge_slab() if the context allows. Reported-by: Tejun Heo Signed-off-by: Vladimir Davydov --- mm/slub.c | 24 +++- 1 file changed, 11 insertions(+), 13 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index e180f8dcd06d..416a332277cb 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1333,6 +1333,14 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s, if (memcg_charge_slab(s, flags, order)) return NULL; + /* +* Let the initial higher-order allocation fail under memory pressure +* so we fall-back to the minimum order allocation. +*/ + if (oo_order(oo) > oo_order(s->min)) + flags = (flags | __GFP_NOWARN | __GFP_NOMEMALLOC) & + ~(__GFP_NOFAIL | __GFP_WAIT); + if (node == NUMA_NO_NODE) page = alloc_pages(flags, order); else @@ -1348,7 +1356,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) { struct page *page; struct kmem_cache_order_objects oo = s->oo; - gfp_t alloc_gfp; void *start, *p; int idx, order; @@ -1359,23 +1366,14 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) flags |= s->allocflags; - /* -* Let the initial higher-order allocation fail under memory pressure -* so we fall-back to the minimum order allocation. -*/ - alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL; - if ((alloc_gfp & __GFP_WAIT) && oo_order(oo) > oo_order(s->min)) - alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~__GFP_WAIT; - - page = alloc_slab_page(s, alloc_gfp, node, oo); + page = alloc_slab_page(s, flags, node, oo); if (unlikely(!page)) { oo = s->min; - alloc_gfp = flags; /* * Allocation may have failed due to fragmentation. * Try a lower order alloc if possible */ - page = alloc_slab_page(s, alloc_gfp, node, oo); + page = alloc_slab_page(s, flags, node, oo); if (unlikely(!page)) goto out; stat(s, ORDER_FALLBACK); @@ -1385,7 +1383,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) !(s->flags & (SLAB_NOTRACK | DEBUG_DEFAULT_FLAGS))) { int pages = 1 << oo_order(oo); - kmemcheck_alloc_shadow(page, oo_order(oo), alloc_gfp, node); + kmemcheck_alloc_shadow(page, oo_order(oo), flags, node); /* * Objects from caches that have a constructor don't get -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled
Hi, Tejun reported that sometimes memcg/memory.high threshold seems to be silently ignored if kmem accounting is enabled: http://www.spinics.net/lists/linux-mm/msg93613.html It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT first. As a result, if there is enough free pages, memcg reclaim will not get invoked on kmem allocations, which will lead to uncontrollable growth of memory usage no matter what memory.high is set to. This patch set attempts to fix this issue. For more details please see comments to individual patches. Thanks, Vladimir Davydov (2): mm/slab: skip memcg reclaim only if in atomic context mm/slub: do not bypass memcg reclaim for high-order page allocation mm/slab.c | 32 +++- mm/slub.c | 24 +++- 2 files changed, 22 insertions(+), 34 deletions(-) -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled
On Mon, Aug 31, 2015 at 03:24:15PM +0200, Michal Hocko wrote: > On Sun 30-08-15 22:02:16, Vladimir Davydov wrote: > > Tejun reported that sometimes memcg/memory.high threshold seems to be > > silently ignored if kmem accounting is enabled: > > > > http://www.spinics.net/lists/linux-mm/msg93613.html > > > > It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT > > first. As a result, if there is enough free pages, memcg reclaim will > > not get invoked on kmem allocations, which will lead to uncontrollable > > growth of memory usage no matter what memory.high is set to. > > Right but isn't that what the caller explicitly asked for? No. If the caller of kmalloc() asked for a __GFP_WAIT allocation, we might ignore that and charge memcg w/o __GFP_WAIT. > Why should we ignore that for kmem accounting? It seems like a fix at > a wrong layer to me. Let's forget about memory.high for a minute. 1. SLAB. Suppose someone calls kmalloc_node and there is enough free memory on the preferred node. W/o memcg limit set, the allocation will happen from the preferred node, which is OK. If there is memcg limit, we can currently fail to allocate from the preferred node if we are near the limit. We issue memcg reclaim and go to fallback alloc then, which will most probably allocate from a different node, although there is no reason for that. This is a bug. 2. SLUB. Someone calls kmalloc and there is enough free high order pages. If there is no memcg limit, we will allocate a high order slab page, which is in accordance with SLUB internal logic. With memcg limit set, we are likely to fail to charge high order page (because we currently try to charge high order pages w/o __GFP_WAIT) and fallback on a low order page. The latter is unexpected and unjustified. That being said, this is the fix at the right layer. > Either we should start failing GFP_NOWAIT charges when we are above > high wmark or deploy an additional catchup mechanism as suggested by > Tejun. The mechanism proposed by Tejun won't help us to avoid allocation failures if we are hitting memory.max w/o __GFP_WAIT or __GFP_FS. To fix GFP_NOFS/GFP_NOWAIT failures we just need to start reclaim when the gap between limit and usage is getting too small. It may be done from a workqueue or from task_work, but currently I don't see any reason why complicate and not just start reclaim directly, just like memory.high does. I mean, currently you can protect against GFP_NOWAIT failures by setting memory.high to be 1-2 MB lower than memory.high and this *will* work, because GFP_NOWAIT/GFP_NOFS allocations can't go on infinitely - they will alternate with normal GFP_KERNEL allocations sooner or later. It does not mean we should encourage users to set memory.high to protect against such failures, because, as pointed out by Tejun, logic behind memory.high is currently opaque and can change, but we can introduce memcg-internal watermarks that would work exactly as memory.high and hence help us against GFP_NOWAIT/GFP_NOFS failures. Thanks, Vladimir > I like the later more because it allows to better handle GFP_NOFS > requests as well and there are many sources of these from kmem paths. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled
On Mon, Aug 31, 2015 at 09:43:35AM -0400, Tejun Heo wrote: > On Mon, Aug 31, 2015 at 03:24:15PM +0200, Michal Hocko wrote: > > Right but isn't that what the caller explicitly asked for? Why should we > > ignore that for kmem accounting? It seems like a fix at a wrong layer to > > me. Either we should start failing GFP_NOWAIT charges when we are above > > high wmark or deploy an additional catchup mechanism as suggested by > > Tejun. I like the later more because it allows to better handle GFP_NOFS > > requests as well and there are many sources of these from kmem paths. > > Yeah, this is beginning to look like we're trying to solve the problem > at the wrong layer. slab/slub or whatever else should be able to use > GFP_NOWAIT in whatever frequency they want for speculative > allocations. slab/slub can issue alloc_pages() any time with any flags they want and it won't be accounted to memcg, because kmem is accounted at slab/slub layer, not in buddy. Thanks, Vladimir -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled
On Mon, Aug 31, 2015 at 10:39:39AM -0400, Tejun Heo wrote: > On Mon, Aug 31, 2015 at 05:30:08PM +0300, Vladimir Davydov wrote: > > slab/slub can issue alloc_pages() any time with any flags they want and > > it won't be accounted to memcg, because kmem is accounted at slab/slub > > layer, not in buddy. > > Hmmm? I meant the eventual calling into try_charge w/ GFP_NOWAIT. > Speculative usage of GFP_NOWAIT is bound to increase and we don't want > to put on extra restrictions from memcg side. We already put restrictions on slab/slub from memcg side, because kmem accounting is a part of slab/slub. They have to cooperate in order to get things working. If slab/slub wants to make a speculative allocation for some reason, it should just put memcg_charge out of this speculative alloc section. This is what this patch set does. We have to be cautious about placing memcg_charge in slab/slub. To understand why, consider SLAB case, which first tries to allocate from all nodes in the order of preference w/o __GFP_WAIT and only if it fails falls back on an allocation from any node w/ __GFP_WAIT. This is its internal algorithm. If we blindly put memcg_charge to alloc_slab method, then, when we are near the memcg limit, we will go over all NUMA nodes in vain, then finally fall back to __GFP_WAIT allocation, which will get a slab from a random node. Not only we do more work than necessary due to walking over all NUMA nodes for nothing, but we also break SLAB internal logic! And you just can't fix it in memcg, because memcg knows nothing about the internal logic of SLAB, how it handles NUMA nodes. SLUB has a different problem. It tries to avoid high-order allocations if there is a risk of invoking costly memory compactor. It has nothing to do with memcg, because memcg does not care if the charge is for a high order page or not. Thanks, Vladimir > For memory.high, > punting to the return path is a pretty stright-forward solution which > should make the problem go away almost entirely. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled
On Mon, Aug 31, 2015 at 10:46:04AM -0400, Tejun Heo wrote: > Hello, Vladimir. > > On Mon, Aug 31, 2015 at 05:20:49PM +0300, Vladimir Davydov wrote: > ... > > That being said, this is the fix at the right layer. > > While this *might* be a necessary workaround for the hard limit case > right now, this is by no means the fix at the right layer. The > expectation is that mm keeps a reasonable amount of memory available > for allocations which can't block. These allocations may fail from > time to time depending on luck and under extreme memory pressure but > the caller should be able to depend on it as a speculative allocation > mechanism which doesn't fail willy-nilly. > > Hardlimit breaking GFP_NOWAIT behavior is a bug on memcg side, not > slab or slub. I never denied that there is GFP_NOWAIT/GFP_NOFS problem in memcg. I even proposed ways to cope with it in one of the previous e-mails. Nevertheless, we just can't allow slab/slub internals call memcg_charge whenever they want as I pointed out in a parallel thread. Thanks, Vladimir -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled
On Mon, Aug 31, 2015 at 11:47:56AM -0400, Tejun Heo wrote: > On Mon, Aug 31, 2015 at 06:18:14PM +0300, Vladimir Davydov wrote: > > We have to be cautious about placing memcg_charge in slab/slub. To > > understand why, consider SLAB case, which first tries to allocate from > > all nodes in the order of preference w/o __GFP_WAIT and only if it fails > > falls back on an allocation from any node w/ __GFP_WAIT. This is its > > internal algorithm. If we blindly put memcg_charge to alloc_slab method, > > then, when we are near the memcg limit, we will go over all NUMA nodes > > in vain, then finally fall back to __GFP_WAIT allocation, which will get > > a slab from a random node. Not only we do more work than necessary due > > to walking over all NUMA nodes for nothing, but we also break SLAB > > internal logic! And you just can't fix it in memcg, because memcg knows > > nothing about the internal logic of SLAB, how it handles NUMA nodes. > > > > SLUB has a different problem. It tries to avoid high-order allocations > > if there is a risk of invoking costly memory compactor. It has nothing > > to do with memcg, because memcg does not care if the charge is for a > > high order page or not. > > Maybe I'm missing something but aren't both issues caused by memcg > failing to provide headroom for NOWAIT allocations when the > consumption gets close to the max limit? That's correct. > Regardless of the specific usage, !__GFP_WAIT means "give me memory if > it can be spared w/o inducing direct time-consuming maintenance work" > and the contract around it is that such requests will mostly succeed > under nominal conditions. Also, slab/slub might not stay as the only > user of try_charge(). Indeed, there might be other users trying GFP_NOWAIT before falling back to GFP_KERNEL, but they are not doing that constantly and hence cause no problems. If SLAB/SLUB plays such tricks, the problem becomes massive: under certain conditions *every* try_charge may be invoked w/o __GFP_WAIT, resulting in memory.high breaching and hitting memory.max. Generally speaking, handing over reclaim responsibility to task_work won't help, because there might be cases when a process spends quite a lot of time in kernel invoking lots of GFP_KERNEL allocations before returning to userspace. Without fixing slab/slub, such a process will charge w/o __GFP_WAIT and therefore can exceed memory.high and reach memory.max. If there are no other active processes in the cgroup, the cgroup can stay with memory.high excess for a relatively long time (suppose the process was throttled in kernel), possibly hurting the rest of the system. What is worse, if the process happens to invoke a real GFP_NOWAIT allocation when it's about to hit the limit, it will fail. If we want to allow slab/slub implementation to invoke try_charge wherever it wants, we need to introduce an asynchronous thread doing reclaim when a memcg is approaching its limit (or teach kswapd do that). That's a way to go, but what's the point to complicate things prematurely while it seems we can fix the problem by using the technique similar to the one behind memory.high? Nevertheless, even if we introduced such a thread, it'd be just insane to allow slab/slub blindly insert try_charge. Let me repeat the examples of SLAB/SLUB sub-optimal behavior caused by thoughtless usage of try_charge I gave above: - memcg knows nothing about NUMA nodes, so what's the point in failing !__GFP_WAIT allocations used by SLAB while inspecting NUMA nodes? - memcg knows nothing about high order pages, so what's the point in failing !__GFP_WAIT allocations used by SLUB to try to allocate a high order page? Thanks, Vladimir > I still think solving this from memcg side is the right direction. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled
On Mon, Aug 31, 2015 at 01:03:09PM -0400, Tejun Heo wrote: > On Mon, Aug 31, 2015 at 07:51:32PM +0300, Vladimir Davydov wrote: > ... > > If we want to allow slab/slub implementation to invoke try_charge > > wherever it wants, we need to introduce an asynchronous thread doing > > reclaim when a memcg is approaching its limit (or teach kswapd do that). > > In the long term, I think this is the way to go. Quite probably, or we can use task_work, or direct reclaim instead. It's not that obvious to me yet which one is the best. > > > That's a way to go, but what's the point to complicate things > > prematurely while it seems we can fix the problem by using the technique > > similar to the one behind memory.high? > > Cuz we're now scattering workarounds to multiple places and I'm sure > we'll add more try_charge() users (e.g. we want to fold in tcp memcg > under the same knobs) and we'll have to worry about the same problem > all over again and will inevitably miss some cases leading to subtle > failures. I don't think we will need to insert try_charge_kmem anywhere else, because all kmem users either allocate memory using kmalloc and friends or using alloc_pages. kmalloc is accounted. For those who prefer alloc_pages, there is alloc_kmem_pages helper. > > > Nevertheless, even if we introduced such a thread, it'd be just insane > > to allow slab/slub blindly insert try_charge. Let me repeat the examples > > of SLAB/SLUB sub-optimal behavior caused by thoughtless usage of > > try_charge I gave above: > > > > - memcg knows nothing about NUMA nodes, so what's the point in failing > >!__GFP_WAIT allocations used by SLAB while inspecting NUMA nodes? > > - memcg knows nothing about high order pages, so what's the point in > >failing !__GFP_WAIT allocations used by SLUB to try to allocate a > >high order page? > > Both are optimistic speculative actions and as long as memcg can > guarantee that those requests will succeed under normal circumstances, > as does the system-wide mm does, it isn't a problem. > > In general, we want to make sure inside-cgroup behaviors as close to > system-wide behaviors as possible, scoped but equivalent in kind. > Doing things differently, while inevitable in certain cases, is likely > to get messy in the long term. I totally agree that we should strive to make a kmem user feel roughly the same in memcg as if it were running on a host with equal amount of RAM. There are two ways to achieve that: 1. Make the API functions, i.e. kmalloc and friends, behave inside memcg roughly the same way as they do in the root cgroup. 2. Make the internal memcg functions, i.e. try_charge and friends, behave roughly the same way as alloc_pages. I find way 1 more flexible, because we don't have to blindly follow heuristics used on global memory reclaim and therefore have more opportunities to achieve the same goal. Thanks, Vladimir -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled
On Mon, Aug 31, 2015 at 03:22:22PM -0500, Christoph Lameter wrote: > On Mon, 31 Aug 2015, Vladimir Davydov wrote: > > > I totally agree that we should strive to make a kmem user feel roughly > > the same in memcg as if it were running on a host with equal amount of > > RAM. There are two ways to achieve that: > > > > 1. Make the API functions, i.e. kmalloc and friends, behave inside > > memcg roughly the same way as they do in the root cgroup. > > 2. Make the internal memcg functions, i.e. try_charge and friends, > > behave roughly the same way as alloc_pages. > > > > I find way 1 more flexible, because we don't have to blindly follow > > heuristics used on global memory reclaim and therefore have more > > opportunities to achieve the same goal. > > The heuristics need to integrate well if its in a cgroup or not. In > general make use of cgroups as transparent as possible to the rest of the > code. Half of kmem accounting implementation resides in SLAB/SLUB. We can't just make use of cgroups there transparent. For the rest of the code using kmalloc, cgroups are transparent. Indeed, we can make memcg_charge_slab behave exactly like alloc_pages, we can even put it to alloc_pages (where it used to be), but why if the only user of memcg_charge_slab is SLAB/SLUB core? I think we'd have more space to manoeuvre if we just taught SLAB/SLUB to use memcg_charge_slab wisely (as it used to until recently), because memcg charge/reclaim is quite different from global alloc/reclaim: - it isn't aware of NUMA nodes, so trying to charge w/o __GFP_WAIT while inspecting nodes, like in case of SLAB, is meaningless - it isn't aware of high order page allocations, so trying to charge w/o __GFP_WAIT while trying optimistically to get a high order page, like in case of SLUB, is meaningless too - it can always let a high prio allocation go unaccounted, so IMO there is no point in introducing emergency reserves (__GFP_MEMALLOC handling) - it can always charge a GFP_NOWAIT allocation even if it exceeds the limit, issuing direct reclaim when a GFP_KERNEL allocation comes or from a task work, because there is no risk of depleting memory reserves; so it isn't obvious to me whether we really need an aync thread handling memcg reclaim like kswapd Thanks, Vladimir -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled
On Tue, Sep 01, 2015 at 02:36:12PM +0200, Michal Hocko wrote: > On Mon 31-08-15 17:20:49, Vladimir Davydov wrote: > > On Mon, Aug 31, 2015 at 03:24:15PM +0200, Michal Hocko wrote: > > > On Sun 30-08-15 22:02:16, Vladimir Davydov wrote: > > > > > > Tejun reported that sometimes memcg/memory.high threshold seems to be > > > > silently ignored if kmem accounting is enabled: > > > > > > > > http://www.spinics.net/lists/linux-mm/msg93613.html > > > > > > > > It turned out that both SLAB and SLUB try to allocate without __GFP_WAIT > > > > first. As a result, if there is enough free pages, memcg reclaim will > > > > not get invoked on kmem allocations, which will lead to uncontrollable > > > > growth of memory usage no matter what memory.high is set to. > > > > > > Right but isn't that what the caller explicitly asked for? > > > > No. If the caller of kmalloc() asked for a __GFP_WAIT allocation, we > > might ignore that and charge memcg w/o __GFP_WAIT. > > I was referring to the slab allocator as the caller. Sorry for not being > clear about that. > > > > Why should we ignore that for kmem accounting? It seems like a fix at > > > a wrong layer to me. > > > > Let's forget about memory.high for a minute. > > > > 1. SLAB. Suppose someone calls kmalloc_node and there is enough free > > memory on the preferred node. W/o memcg limit set, the allocation > > will happen from the preferred node, which is OK. If there is memcg > > limit, we can currently fail to allocate from the preferred node if > > we are near the limit. We issue memcg reclaim and go to fallback > > alloc then, which will most probably allocate from a different node, > > although there is no reason for that. This is a bug. > > I am not familiar with the SLAB internals much but how is it different > from the global case. If the preferred node is full then __GFP_THISNODE > request will make it fail early even without giving GFP_NOWAIT > additional access to atomic memory reserves. The fact that memcg case > fails earlier is perfectly expected because the restriction is tighter > than the global case. memcg restrictions are orthogonal to NUMA: failing an allocation from a particular node does not mean failing memcg charge and vice versa. > > How the fallback is implemented and whether trying other node before > reclaiming from the preferred one is reasonable I dunno. This is for > SLAB to decide. But ignoring GFP_NOWAIT for this path makes the behavior > for memcg enabled setups subtly different. And that is bad. Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while inspecting if a NUMA node has free pages makes SLAB behaviour subtly differently: SLAB will walk over all NUMA nodes for nothing instead of invoking memcg reclaim once a free page is found. You are talking about memcg/kmem accounting as if it were done in the buddy allocator on top of which the slab layer is built knowing nothing about memcg accounting on the lower layer. That's not true and that simply can't be true. Kmem accounting is implemented at the slab layer. Memcg provides its memcg_charge_slab/uncharge methods solely for slab core, so it's OK to have some calling conventions between them. What we are really obliged to do is to preserve behavior of slab's external API, i.e. kmalloc and friends. > > > 2. SLUB. Someone calls kmalloc and there is enough free high order > > pages. If there is no memcg limit, we will allocate a high order > > slab page, which is in accordance with SLUB internal logic. With > > memcg limit set, we are likely to fail to charge high order page > > (because we currently try to charge high order pages w/o __GFP_WAIT) > > and fallback on a low order page. The latter is unexpected and > > unjustified. > > And this case very similar and I even argue that it shows more > brokenness with your patch. The SLUB allocator has _explicitly_ asked > for an allocation _without_ reclaim because that would be unnecessarily > too costly and there is other less expensive fallback. But memcg would You are ignoring the fact that, in contrast to alloc_pages, for memcg there is practically no difference between charging a 4-order page or a 1-order page. OTOH, using 1-order pages where we could go with 4-order pages increases page fragmentation at the global level. This subtly breaks internal SLUB optimization. Once again, kmem accounting is not something staying aside from slab core, it's a part of slab core. > be ignoring this with your patch AFAIU and break the optimization. There > are other cases like
Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled
On Tue, Sep 01, 2015 at 05:01:20PM +0200, Michal Hocko wrote: > On Tue 01-09-15 16:40:03, Vladimir Davydov wrote: > > On Tue, Sep 01, 2015 at 02:36:12PM +0200, Michal Hocko wrote: > > > On Mon 31-08-15 17:20:49, Vladimir Davydov wrote: > {...} > > > > 1. SLAB. Suppose someone calls kmalloc_node and there is enough free > > > > memory on the preferred node. W/o memcg limit set, the allocation > > > > will happen from the preferred node, which is OK. If there is memcg > > > > limit, we can currently fail to allocate from the preferred node if > > > > we are near the limit. We issue memcg reclaim and go to fallback > > > > alloc then, which will most probably allocate from a different node, > > > > although there is no reason for that. This is a bug. > > > > > > I am not familiar with the SLAB internals much but how is it different > > > from the global case. If the preferred node is full then __GFP_THISNODE > > > request will make it fail early even without giving GFP_NOWAIT > > > additional access to atomic memory reserves. The fact that memcg case > > > fails earlier is perfectly expected because the restriction is tighter > > > than the global case. > > > > memcg restrictions are orthogonal to NUMA: failing an allocation from a > > particular node does not mean failing memcg charge and vice versa. > > Sure memcg doesn't care about NUMA it just puts an additional constrain > on top of all existing ones. The point I've tried to make is that the > logic is currently same whether it is page allocator (with the node > restriction) or memcg (cumulative amount restriction) are behaving > consistently. Neither of them try to reclaim in order to achieve its > goals. How conservative is memcg about allowing GFP_NOWAIT allocation > is a separate issue and all those details belong to memcg proper same > as the allocation strategy for these allocations belongs to the page > allocator. > > > > How the fallback is implemented and whether trying other node before > > > reclaiming from the preferred one is reasonable I dunno. This is for > > > SLAB to decide. But ignoring GFP_NOWAIT for this path makes the behavior > > > for memcg enabled setups subtly different. And that is bad. > > > > Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while > > inspecting if a NUMA node has free pages makes SLAB behaviour subtly > > differently: SLAB will walk over all NUMA nodes for nothing instead of > > invoking memcg reclaim once a free page is found. > > So you are saying that the SLAB kmem accounting in this particular path > is suboptimal because the fallback mode doesn't retry local node with > the reclaim enabled before falling back to other nodes? I'm just pointing out some subtle behavior changes in slab you were opposed to. > I would consider it quite surprising as well even for the global case > because __GFP_THISNODE doesn't wake up kswapd to make room on that node. > > > You are talking about memcg/kmem accounting as if it were done in the > > buddy allocator on top of which the slab layer is built knowing nothing > > about memcg accounting on the lower layer. That's not true and that > > simply can't be true. Kmem accounting is implemented at the slab layer. > > Memcg provides its memcg_charge_slab/uncharge methods solely for > > slab core, so it's OK to have some calling conventions between them. > > What we are really obliged to do is to preserve behavior of slab's > > external API, i.e. kmalloc and friends. > > I guess I understand what you are saying here but it sounds like special > casing which tries to be clever because the current code understands > both the lower level allocator and kmem charge paths to decide how to What do you mean by saying "it understands the lower level allocator"? AFAIK we have memcg callbacks only in special places, like page fault handler or kmalloc. > juggle with them. This is imho bad and hard to maintain long term. We already juggle. Just grep where and how we insert mem_cgroup_try_charge. > > > > > 2. SLUB. Someone calls kmalloc and there is enough free high order > > > > pages. If there is no memcg limit, we will allocate a high order > > > > slab page, which is in accordance with SLUB internal logic. With > > > > memcg limit set, we are likely to fail to charge high order page > > > > (because we currently try to charge high order pages w/o __GFP_WAIT) > > > > and fallback on a low order page. The latter is unexpected and
Re: [PATCH 3/3] memcg: simplify and inline __mem_cgroup_from_kmem
On Fri, Oct 16, 2015 at 04:17:26PM +0300, Kirill A. Shutemov wrote: > On Mon, Oct 05, 2015 at 01:21:43AM +0300, Vladimir Davydov wrote: > > Before the previous patch, __mem_cgroup_from_kmem had to handle two > > types of kmem - slab pages and pages allocated with alloc_kmem_pages - > > differently, because slab pages did not store information about owner > > memcg in the page struct. Now we can unify it. Since after it, this > > function becomes tiny we can fold it into mem_cgroup_from_kmem. > > > > Signed-off-by: Vladimir Davydov > > --- > > include/linux/memcontrol.h | 7 --- > > mm/memcontrol.c| 18 -- > > 2 files changed, 4 insertions(+), 21 deletions(-) > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 8a9b7a798f14..0e2e039609d1 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -769,8 +769,6 @@ static inline int memcg_cache_id(struct mem_cgroup > > *memcg) > > struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep); > > void __memcg_kmem_put_cache(struct kmem_cache *cachep); > > > > -struct mem_cgroup *__mem_cgroup_from_kmem(void *ptr); > > - > > static inline bool __memcg_kmem_bypass(gfp_t gfp) > > { > > if (!memcg_kmem_enabled()) > > @@ -832,9 +830,12 @@ static __always_inline void > > memcg_kmem_put_cache(struct kmem_cache *cachep) > > > > static __always_inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr) > > { > > + struct page *page; > > + > > if (!memcg_kmem_enabled()) > > return NULL; > > - return __mem_cgroup_from_kmem(ptr); > > + page = virt_to_head_page(ptr); > > + return page->mem_cgroup; > > } > > virt_to_head_page() is defined in but you don't include it, > and the commit breaks build for me (on v4.3-rc5-mmotm-2015-10-15-15-20). > > CC arch/x86/kernel/asm-offsets.s > In file included from /home/kas/linux/mm/include/linux/swap.h:8:0, > from /home/kas/linux/mm/include/linux/suspend.h:4, > from /home/kas/linux/mm/arch/x86/kernel/asm-offsets.c:12: > /home/kas/linux/mm/include/linux/memcontrol.h: In function > â?~mem_cgroup_from_kmemâ?T: > /home/kas/linux/mm/include/linux/memcontrol.h:841:9: error: implicit > declaration of function â?~virt_to_head_pageâ?T > [-Werror=implicit-function-declaration] > page = virt_to_head_page(ptr); > ^ > /home/kas/linux/mm/include/linux/memcontrol.h:841:7: warning: assignment > makes pointer from integer without a cast [-Wint-conversion] > page = virt_to_head_page(ptr); >^ > In file included from /home/kas/linux/mm/include/linux/suspend.h:8:0, > from /home/kas/linux/mm/arch/x86/kernel/asm-offsets.c:12: > /home/kas/linux/mm/include/linux/mm.h: At top level: > /home/kas/linux/mm/include/linux/mm.h:452:28: error: conflicting types for > â?~virt_to_head_pageâ?T > static inline struct page *virt_to_head_page(const void *x) > ^ > In file included from /home/kas/linux/mm/include/linux/swap.h:8:0, > from /home/kas/linux/mm/include/linux/suspend.h:4, > from /home/kas/linux/mm/arch/x86/kernel/asm-offsets.c:12: > /home/kas/linux/mm/include/linux/memcontrol.h:841:9: note: previous implicit > declaration of â?~virt_to_head_pageâ?T was here > page = virt_to_head_page(ptr); > ^ > cc1: some warnings being treated as errors Oops, in my config I have CONFIG_CGROUP_WRITEBACK enabled, which results in including mm.h to memcontrol.h indirectly: linux/memcontrol.h linux/writeback.h linux/bio.h linux/highmem.h linux/mm.h That's why I didn't notice this. Sorry about that. > > The patch below fixes it for me (and for allmodconfig on x86-64), but I'm not > sure if it have any side effects on other configurations. It should work OK with any config, otherwise CONFIG_CGROUP_WRITEBACK would be broken too. Andrew, could you please merge the fix by Kirill into memcg-simplify-and-inline-__mem_cgroup_from_kmem.patch Thanks, Vladimir > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 47677acb4516..e8e52e502c20 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -26,6 +26,7 @@ > #include > #include > #include > +#include > #include > #include > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] memcg: simplify and inline __mem_cgroup_from_kmem
On Fri, Oct 16, 2015 at 03:12:23PM -0700, Hugh Dickins wrote: ... > Are you expecting to use mem_cgroup_from_kmem() from other places > in future? Seems possible; but at present it's called from only Not in the near future. At least, currently I can't think of any other use for it except list_lru_from_kmem. > one place, and (given how memcontrol.h has somehow managed to avoid > including mm.h all these years), I thought it would be nice to avoid > it for just this; and fixed my build with the patch below last night. > Whatever you all think best: just wanted to point out an alternative. Makes sense, thanks! I would even inline mem_cgroup_from_kmem to list_lru_from_kmem: diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 47677acb4516..2077b9bb4883 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -831,16 +831,6 @@ static __always_inline void memcg_kmem_put_cache(struct kmem_cache *cachep) if (memcg_kmem_enabled()) __memcg_kmem_put_cache(cachep); } - -static __always_inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr) -{ - struct page *page; - - if (!memcg_kmem_enabled()) - return NULL; - page = virt_to_head_page(ptr); - return page->mem_cgroup; -} #else #define for_each_memcg_cache_index(_idx) \ for (; NULL; ) @@ -886,11 +876,6 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp) static inline void memcg_kmem_put_cache(struct kmem_cache *cachep) { } - -static inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr) -{ - return NULL; -} #endif /* CONFIG_MEMCG_KMEM */ #endif /* _LINUX_MEMCONTROL_H */ diff --git a/mm/list_lru.c b/mm/list_lru.c index 28237476b055..00d0a70af70a 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -68,10 +68,10 @@ list_lru_from_kmem(struct list_lru_node *nlru, void *ptr) { struct mem_cgroup *memcg; - if (!nlru->memcg_lrus) + if (!memcg_kmem_enabled() || !nlru->memcg_lrus) return &nlru->lru; - memcg = mem_cgroup_from_kmem(ptr); + memcg = virt_to_head_page(ptr)->mem_cgroup; if (!memcg) return &nlru->lru; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] memcg: unify slab and other kmem pages charging
On Fri, Oct 16, 2015 at 05:19:32PM -0700, Johannes Weiner wrote: ... > I think it'd be better to have an outer function than a magic > parameter for the memcg lookup. Could we fold this in there? Yeah, that looks neater. Thanks! Andrew, could you please fold this one too? > > --- > > Signed-off-by: Johannes Weiner > --- > include/linux/memcontrol.h | 7 --- > mm/memcontrol.c| 36 ++-- > mm/slab.h | 4 ++-- > 3 files changed, 24 insertions(+), 23 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 47677ac..730a65d 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -756,8 +756,9 @@ static inline bool memcg_kmem_is_active(struct mem_cgroup > *memcg) > * conditions, but because they are pretty simple, they are expected to be > * fast. > */ > -int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order, > - struct mem_cgroup *memcg); > +int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order, > + struct mem_cgroup *memcg); > +int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order); > void __memcg_kmem_uncharge(struct page *page, int order); > > /* > @@ -797,7 +798,7 @@ static __always_inline int memcg_kmem_charge(struct page > *page, > { > if (__memcg_kmem_bypass(gfp)) > return 0; > - return __memcg_kmem_charge(page, gfp, order, NULL); > + return __memcg_kmem_charge(page, gfp, order); > } > > /** > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 15db655..6fc9959 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2378,39 +2378,39 @@ void __memcg_kmem_put_cache(struct kmem_cache *cachep) > css_put(&cachep->memcg_params.memcg->css); > } > > -/* > - * If @memcg != NULL, charge to @memcg, otherwise charge to the memcg the > - * current task belongs to. > - */ > -int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order, > - struct mem_cgroup *memcg) > +int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order, > + struct mem_cgroup *memcg) > { > - struct page_counter *counter; > unsigned int nr_pages = 1 << order; > - bool put = false; > + struct page_counter *counter; > int ret = 0; > > - if (!memcg) { > - memcg = get_mem_cgroup_from_mm(current->mm); > - put = true; > - } > if (!memcg_kmem_is_active(memcg)) > - goto out; > + return 0; > > ret = page_counter_try_charge(&memcg->kmem, nr_pages, &counter); > if (ret) > - goto out; > + return ret; > > ret = try_charge(memcg, gfp, nr_pages); > if (ret) { > page_counter_uncharge(&memcg->kmem, nr_pages); > - goto out; > + return ret; > } > > page->mem_cgroup = memcg; > -out: > - if (put) > - css_put(&memcg->css); > + > + return 0; > +} > + > +int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order) > +{ > + struct mem_cgroup *memcg; > + int ret; > + > + memcg = get_mem_cgroup_from_mm(current->mm); > + ret = __memcg_kmem_charge_memcg(page, gfp, order, memcg); > + css_put(&memcg->css); > return ret; > } > > diff --git a/mm/slab.h b/mm/slab.h > index 3d667a4..27492eb 100644 > --- a/mm/slab.h > +++ b/mm/slab.h > @@ -244,8 +244,8 @@ static __always_inline int memcg_charge_slab(struct page > *page, > return 0; > if (is_root_cache(s)) > return 0; > - return __memcg_kmem_charge(page, gfp, order, > -s->memcg_params.memcg); > + return __memcg_kmem_charge_memcg(page, gfp, order, > + s->memcg_params.memcg); > } > > extern void slab_init_memcg_params(struct kmem_cache *); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/8] mm: memcontrol: account socket memory in unified hierarchy
Hi Johannes, On Thu, Oct 22, 2015 at 12:21:28AM -0400, Johannes Weiner wrote: ... > Patch #5 adds accounting and tracking of socket memory to the unified > hierarchy memory controller, as described above. It uses the existing > per-cpu charge caches and triggers high limit reclaim asynchroneously. > > Patch #8 uses the vmpressure extension to equalize pressure between > the pages tracked natively by the VM and socket buffer pages. As the > pool is shared, it makes sense that while natively tracked pages are > under duress the network transmit windows are also not increased. First of all, I've no experience in networking, so I'm likely to be mistaken. Nevertheless I beg to disagree that this patch set is a step in the right direction. Here goes why. I admit that your idea to get rid of explicit tcp window control knobs and size it dynamically basing on memory pressure instead does sound tempting, but I don't think it'd always work. The problem is that in contrast to, say, dcache, we can't shrink tcp buffers AFAIU, we can only stop growing them. Now suppose a system hasn't experienced memory pressure for a while. If we don't have explicit tcp window limit, tcp buffers on such a system might have eaten almost all available memory (because of network load/problems). If a user workload that needs a significant amount of memory is started suddenly then, the network code will receive a notification and surely stop growing buffers, but all those buffers accumulated won't disappear instantly. As a result, the workload might be unable to find enough free memory and have no choice but invoke OOM killer. This looks unexpected from the user POV. That said, I think we do need per memcg tcp window control similar to what we have system-wide. In other words, Glauber's work makes sense to me. You might want to point me at my RFC patch where I proposed to revert it (https://lkml.org/lkml/2014/9/12/401). Well, I've changed my mind since then. Now I think I was mistaken, luckily I was stopped. However, I may be mistaken again :-) Thanks, Vladimir -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/8] net: consolidate memcg socket buffer tracking and accounting
On Thu, Oct 22, 2015 at 12:21:31AM -0400, Johannes Weiner wrote: > The tcp memory controller has extensive provisions for future memory > accounting interfaces that won't materialize after all. Cut the code > base down to what's actually used, now and in the likely future. > > - There won't be any different protocol counters in the future, so a > direct sock->sk_memcg linkage is enough. This eliminates a lot of > callback maze and boilerplate code, and restores most of the socket > allocation code to pre-tcp_memcontrol state. > > - There won't be a tcp control soft limit, so integrating the memcg In fact, the code is ready for the "soft" limit (I mean min, pressure, max tuple), it just lacks a knob. > code into the global skmem limiting scheme complicates things > unnecessarily. Replace all that with simple and clear charge and > uncharge calls--hidden behind a jump label--to account skb memory. > > - The previous jump label code was an elaborate state machine that > tracked the number of cgroups with an active socket limit in order > to enable the skmem tracking and accounting code only when actively > necessary. But this is overengineered: it was meant to protect the > people who never use this feature in the first place. Simply enable > the branches once when the first limit is set until the next reboot. > ... > @@ -1136,9 +1090,6 @@ static inline bool sk_under_memory_pressure(const > struct sock *sk) > if (!sk->sk_prot->memory_pressure) > return false; > > - if (mem_cgroup_sockets_enabled && sk->sk_cgrp) > - return !!sk->sk_cgrp->memory_pressure; > - AFAIU, now we won't shrink the window on hitting the limit, i.e. this patch subtly changes the behavior of the existing knobs, potentially breaking them. Thanks, Vladimir -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy
On Thu, Oct 22, 2015 at 12:21:33AM -0400, Johannes Weiner wrote: ... > @@ -5500,13 +5524,38 @@ void sock_release_memcg(struct sock *sk) > */ > bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) > { > + unsigned int batch = max(CHARGE_BATCH, nr_pages); > struct page_counter *counter; > + bool force = false; > > - if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) { > + if (page_counter_try_charge(&memcg->skmem, nr_pages, &counter)) > + return true; > + page_counter_charge(&memcg->skmem, nr_pages); > + return false; > + } > + > + if (consume_stock(memcg, nr_pages)) > return true; > +retry: > + if (page_counter_try_charge(&memcg->memory, batch, &counter)) > + goto done; Currently, we use memcg->memory only for charging memory pages. Besides, every page charged to this counter (including kmem) has ->mem_cgroup field set appropriately. This looks consistent and nice. As an extra benefit, we can track all pages charged to a memory cgroup via /proc/kapgecgroup. Now, you charge "window size" to it, which AFAIU isn't necessarily equal to the amount of memory actually consumed by the cgroup for socket buffers. I think this looks ugly and inconsistent with the existing behavior. I agree that we need to charge socker buffers to ->memory, but IMO we should do that per each skb page, using memcg_kmem_charge_kmem somewhere in alloc_skb_with_frags invoking the reclaimer just as we do for kmalloc, while tcp window size control should stay aside. Thanks, Vladimir -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 7/8] mm: vmscan: report vmpressure at the level of reclaim activity
On Thu, Oct 22, 2015 at 12:21:35AM -0400, Johannes Weiner wrote: ... > @@ -2437,6 +2439,10 @@ static bool shrink_zone(struct zone *zone, struct > scan_control *sc, > } > } > > + vmpressure(sc->gfp_mask, memcg, > +sc->nr_scanned - scanned, > +sc->nr_reclaimed - reclaimed); > + > /* >* Direct reclaim and kswapd have to scan all memory >* cgroups to fulfill the overall scan target for the > @@ -2454,10 +2460,6 @@ static bool shrink_zone(struct zone *zone, struct > scan_control *sc, > } > } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); > > - vmpressure(sc->gfp_mask, sc->target_mem_cgroup, > -sc->nr_scanned - nr_scanned, > -sc->nr_reclaimed - nr_reclaimed); > - > if (sc->nr_reclaimed - nr_reclaimed) > reclaimable = true; > I may be mistaken, but AFAIU this patch subtly changes the behavior of vmpressure visible from the userspace: w/o this patch a userspace process will only receive a notification for a memory cgroup only if *this* memory cgroup calls reclaimer; with this patch userspace notification will be issued even if reclaimer is invoked by any cgroup up the hierarchy. Thanks, Vladimir -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] mm: memcontrol: hook up vmpressure to socket pressure
On Thu, Oct 22, 2015 at 12:21:36AM -0400, Johannes Weiner wrote: ... > @@ -185,8 +183,29 @@ static void vmpressure_work_fn(struct work_struct *work) > vmpr->reclaimed = 0; > spin_unlock(&vmpr->sr_lock); > > + level = vmpressure_calc_level(scanned, reclaimed); > + > + if (level > VMPRESSURE_LOW) { So we start socket_pressure at MEDIUM. Why not at LOW or CRITICAL? > + struct mem_cgroup *memcg; > + /* > + * Let the socket buffer allocator know that we are > + * having trouble reclaiming LRU pages. > + * > + * For hysteresis, keep the pressure state asserted > + * for a second in which subsequent pressure events > + * can occur. > + * > + * XXX: is vmpressure a global feature or part of > + * memcg? There shouldn't be anything memcg-specific > + * about exporting reclaim success ratios from the VM. > + */ > + memcg = container_of(vmpr, struct mem_cgroup, vmpressure); > + if (memcg != root_mem_cgroup) > + memcg->socket_pressure = jiffies + HZ; Why 1 second? Thanks, Vladimir > + } > + > do { > - if (vmpressure_event(vmpr, scanned, reclaimed)) > + if (vmpressure_event(vmpr, level)) > break; > /* >* If not handled, propagate the event upward into the -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Fix memcg/memory.high in case kmem accounting is enabled
[ I'll try to summarize my point in one hunk instead of spreading it all over the e-mail, because IMO it's becoming a kind of difficult to follow. If you think that there's a question I dodge, please let me now and I'll try to address it separately. Also, adding Johannes to Cc (I noticed that I accidentally left him out), because this discussion seems to be fundamental and may affect our further steps dramatically. ] On Tue, Sep 01, 2015 at 08:38:50PM +0200, Michal Hocko wrote: > On Tue 01-09-15 19:55:54, Vladimir Davydov wrote: > > On Tue, Sep 01, 2015 at 05:01:20PM +0200, Michal Hocko wrote: > > > On Tue 01-09-15 16:40:03, Vladimir Davydov wrote: > > > > On Tue, Sep 01, 2015 at 02:36:12PM +0200, Michal Hocko wrote: > [...] > > > > > How the fallback is implemented and whether trying other node before > > > > > reclaiming from the preferred one is reasonable I dunno. This is for > > > > > SLAB to decide. But ignoring GFP_NOWAIT for this path makes the > > > > > behavior > > > > > for memcg enabled setups subtly different. And that is bad. > > > > > > > > Quite the contrary. Trying to charge memcg w/o __GFP_WAIT while > > > > inspecting if a NUMA node has free pages makes SLAB behaviour subtly > > > > differently: SLAB will walk over all NUMA nodes for nothing instead of > > > > invoking memcg reclaim once a free page is found. > > > > > > So you are saying that the SLAB kmem accounting in this particular path > > > is suboptimal because the fallback mode doesn't retry local node with > > > the reclaim enabled before falling back to other nodes? > > > > I'm just pointing out some subtle behavior changes in slab you were > > opposed to. > > I guess we are still not at the same page here. If the slab has a subtle > behavior (and from what you are saying it seems it has the same behavior > at the global scope) then we should strive to fix it rather than making > it more obscure just to not expose GFP_NOWAIT to memcg which is not > handled properly currently wrt. high limit (more on that below) which > was the primary motivation for the patch AFAIU. Slab is a kind of abnormal alloc_pages user. By calling alloc_pages_node with __GFP_THISNODE and w/o __GFP_WAIT before falling back to alloc_pages with the caller's context, it does the job normally done by alloc_pages itself. It's not what is done massively. Leaving slab charge path as is looks really ugly to me. Look, slab iterates over all nodes, inspecting if they have free pages and fails even if they do due to the memcg constraint... My point is that what slab does is a pretty low level thing, normal users call alloc_pages or kmalloc with flags corresponding to their context. Of course, there may be special users trying optimistically GFP_NOWAIT, but they aren't massive, and that simplifies things for memcg a lot. I mean if we can rely on the fact that the number of GFP_NOWAIT allocations that can occur in a row is limited we can use direct reclaim (like memory.high) and/or task_work reclaim to fix GFP_NOWAIT failures. Otherwise, we have to mimic the global alloc with most its heuristics. I don't think that copying those heuristics is the right thing to do, because in memcg case the same problems may be resolved much easier, because we don't actually experience real memory shortage when hitting the limit. Moreover, we already treat some flags not in the same way as in case of slab for simplicity. E.g. we let __GFP_NOFAIL allocations go uncharged instead of retrying infinitely. We ignore __GFP_THISNODE thing and we just cannot take it into account. We ignore allocation order, because that makes no sense for memcg. To sum it up. Basically, there are two ways of handling kmemcg charges: 1. Make the memcg try_charge mimic alloc_pages behavior. 2. Make API functions (kmalloc, etc) work in memcg as if they were called from the root cgroup, while keeping interactions between the low level subsys (slab) and memcg private. Way 1 might look appealing at the first glance, but at the same time it is much more complex, because alloc_pages has grown over the years to handle a lot of subtle situations that may arise on global memory pressure, but impossible in memcg. What does way 1 give us then? We can't insert try_charge directly to alloc_pages and have to spread its calls all over the code anyway, so why is it better? Easier to use it in places where users depend on buddy allocator peculiarities? There are not many such users. I understand that the idea of way 1 is to provide a well-defined memcg API independent of the rest of the code, but that's just impossible. You need special casing anyway. E.g. you need those get/