This patch is a part of vz7 commit (only avenrun part) 34a1dc1e4e3d ("sched: Account task_group::cpustat,taskstats,avenrun")
Extracted from "Initial patch". Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com> +++ ve/sched: Do not use kstat_glb_lock to update kstat_glob::nr_unint_avg kstat_glob::nr_unint_avg can't be updated in parallel on two or more cpus, so on modifications we have to protect against readers only. So, avoid using global kstat_glb_lock here, to minimize its sharing with another counters it protects. Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com> (cherry picked from commit 715f311fdb4ab0b7922f9e53617c5821ae36bfaf) Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com> +++ sched/ve: Use cfs_rq::h_nr_running to count loadavg cfs_rq::nr_running contains number of child entities of one level below: tasks and cfs_rq, but it does not contain tasks from deeper levels. Use cfs_rq::h_nr_running instead as it contains number of tasks among all child hierarchy. https://jira.sw.ru/browse/PSBM-81572 Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com> Reviewed-by: Andrey Ryabinin <aryabi...@virtuozzo.com> mFixes: 028c54e613a3 ("sched: Account task_group::avenrun") (cherry picked from vz7 commit 5f2a49a05629bd709ad6bfce83bfacc58a4db3d9) Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com> +++ sched/ve: Iterate only VE root cpu cgroups to count loadavg Counting loadavg we are interested in VE root cpu cgroup only, as it's analogy of node's loadavg. So, this patch makes iterate only such types of cpu cgroup, when we calc loadavg. Since this code called from interrupt, this may give positive performance resuts. https://jira.sw.ru/browse/PSBM-81572 Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com> Reviewed-by: Andrey Ryabinin <aryabi...@virtuozzo.com> (cherry picked from vz7 commit 4140a241e5ec2230105f5c4513400a6b5ecea92f) Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com> +++ sched: Export calc_load_ve() This will be used in next patch. Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com> ========================= Patchset description: Make calc_load_ve() be executed out of jiffies_lock https://jira.sw.ru/browse/PSBM-84967 Kirill Tkhai (3): sched: Make calc_global_load() return true when it's need to update ve statistic sched: Export calc_load_ve() sched: Call calc_load_ve() out of jiffies_lock (cherry picked from vz7 commit 738b92fb2cdd6577925a6b7019925f320cd379df) Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com> +++ sched: Call calc_load_ve() out of jiffies_lock jiffies_lock is a big global seqlock, which is used in many places. In combination with another actions like smp call functions and readers of this seqlock, system may hang for a long time. There is already a pair of hard lockups because of long iteration in calc_load_ve() with jiffies_lock held, which made readers of this seqlock to spin long time. This patch makes calc_load_ve() to use separate lock, and this relaxes jiffies_lock. I think, this should be enough to resolve the problem, since both the crashes I saw contains readers of the seqlock on parallel cpus, and we won't have to relax further (say, moving calc_load_ve() to softirq). Note, that the principal change of this patch makes is jiffies_lock readers on parallel cpus won't wait till calc_load_ve() finishes, so instead of (n_readers + 1) cpus waiting till this function completes, there will be only 1 cpu doing that. https://jira.sw.ru/browse/PSBM-84967 Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com> ========================= Patchset description: Make calc_load_ve() be executed out of jiffies_lock https://jira.sw.ru/browse/PSBM-84967 Kirill Tkhai (3): sched: Make calc_global_load() return true when it's need to update ve statistic sched: Export calc_load_ve() sched: Call calc_load_ve() out of jiffies_lock +++ sched: really don't call calc_load_ve() under jiffies_lock Previously we've done all preparation work for calc_load_ve() not being executed under jiffies_lock, and thus not called from calc_global_load(), but forgot to drop the call in calc_global_load(). So now we still call expensive calc_load_ve() under the jiffies_lock and get NMI. Fix that. mFixes:19bc294a5691d ("sched: Call calc_load_ve() out of jiffies_lock") https://jira.sw.ru/browse/PSBM-102573 Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com> Signed-off-by: Valeriy Vdovin <valeriy.vdo...@virtuozzo.com> (cherry picked from vz7 commit 0610b98e5b6537d2ecd99522c3cbd1aa939565e7) Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com> --- include/linux/sched/loadavg.h | 8 ++++++ kernel/sched/loadavg.c | 50 +++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 1 + kernel/time/tick-common.c | 9 ++++++- kernel/time/tick-sched.c | 6 ++++- kernel/time/timekeeping.c | 5 +++- 6 files changed, 76 insertions(+), 3 deletions(-) diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h index 34061919f880..1da5768389b7 100644 --- a/include/linux/sched/loadavg.h +++ b/include/linux/sched/loadavg.h @@ -16,6 +16,8 @@ */ extern unsigned long avenrun[]; /* Load averages */ extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift); +extern void get_avenrun_ve(unsigned long *loads, + unsigned long offset, int shift); #define FSHIFT 11 /* nr of bits of precision */ #define FIXED_1 (1<<FSHIFT) /* 1.0 as fixed-point */ @@ -47,4 +49,10 @@ extern unsigned long calc_load_n(unsigned long load, unsigned long exp, extern bool calc_global_load(unsigned long ticks); +#ifdef CONFIG_VE +extern void calc_load_ve(void); +#else +#define calc_load_ve() do { } while (0) +#endif + #endif /* _LINUX_SCHED_LOADAVG_H */ diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c index a7b373053dc4..c62f34033112 100644 --- a/kernel/sched/loadavg.c +++ b/kernel/sched/loadavg.c @@ -76,6 +76,14 @@ void get_avenrun(unsigned long *loads, unsigned long offset, int shift) loads[2] = (avenrun[2] + offset) << shift; } +void get_avenrun_ve(unsigned long *loads, unsigned long offset, int shift) +{ + struct task_group *tg = task_group(current); + loads[0] = (tg->avenrun[0] + offset) << shift; + loads[1] = (tg->avenrun[1] + offset) << shift; + loads[2] = (tg->avenrun[2] + offset) << shift; +} + long calc_load_fold_active(struct rq *this_rq, long adjust) { long nr_active, delta = 0; @@ -91,6 +99,48 @@ long calc_load_fold_active(struct rq *this_rq, long adjust) return delta; } +#ifdef CONFIG_VE +extern struct list_head ve_root_list; +extern spinlock_t load_ve_lock; + +void calc_load_ve(void) +{ + unsigned long nr_active; + struct task_group *tg; + int i; + + /* + * This is called without jiffies_lock, and here we protect + * against very rare parallel execution on two or more cpus. + */ + spin_lock(&load_ve_lock); + list_for_each_entry(tg, &ve_root_list, ve_root_list) { + nr_active = 0; + for_each_possible_cpu(i) { +#ifdef CONFIG_FAIR_GROUP_SCHED + nr_active += tg->cfs_rq[i]->h_nr_running; + /* + * We do not export nr_unint to parent task groups + * like we do for h_nr_running, as it gives additional + * overhead for activate/deactivate operations. So, we + * don't account child cgroup unint tasks here. + */ + nr_active += tg->cfs_rq[i]->nr_unint; +#endif +#ifdef CONFIG_RT_GROUP_SCHED + nr_active += tg->rt_rq[i]->rt_nr_running; +#endif + } + nr_active *= FIXED_1; + + tg->avenrun[0] = calc_load(tg->avenrun[0], EXP_1, nr_active); + tg->avenrun[1] = calc_load(tg->avenrun[1], EXP_5, nr_active); + tg->avenrun[2] = calc_load(tg->avenrun[2], EXP_15, nr_active); + } + spin_unlock(&load_ve_lock); +} +#endif /* CONFIG_VE */ + /** * fixed_power_int - compute: x^n, in O(log n) time * diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 93bf1d78c27d..3f1e5ba43910 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -408,6 +408,7 @@ struct task_group { struct list_head ve_root_list; #endif + unsigned long avenrun[3]; /* loadavg data */ /* Monotonic time in nsecs: */ u64 start_time; diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c index 61ce3505c195..47a9e0719ee8 100644 --- a/kernel/time/tick-common.c +++ b/kernel/time/tick-common.c @@ -18,6 +18,7 @@ #include <linux/percpu.h> #include <linux/profile.h> #include <linux/sched.h> +#include <linux/sched/loadavg.h> #include <linux/module.h> #include <trace/events/power.h> @@ -87,13 +88,19 @@ int tick_is_oneshot_available(void) static void tick_periodic(int cpu) { if (tick_do_timer_cpu == cpu) { + bool calc_ve; + write_seqlock(&jiffies_lock); /* Keep track of the next tick event */ tick_next_period = ktime_add(tick_next_period, tick_period); - do_timer(1); + calc_ve = do_timer(1); write_sequnlock(&jiffies_lock); + + if (calc_ve) + calc_load_ve(); + update_wall_time(); } diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 4380af8ac923..5f265f7cce76 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -23,6 +23,7 @@ #include <linux/sched/clock.h> #include <linux/sched/stat.h> #include <linux/sched/nohz.h> +#include <linux/sched/loadavg.h> #include <linux/module.h> #include <linux/irq_work.h> #include <linux/posix-timers.h> @@ -57,6 +58,7 @@ static ktime_t last_jiffies_update; static void tick_do_update_jiffies64(ktime_t now) { unsigned long ticks = 0; + bool calc_ve = false; ktime_t delta; /* @@ -85,7 +87,7 @@ static void tick_do_update_jiffies64(ktime_t now) last_jiffies_update = ktime_add_ns(last_jiffies_update, incr * ticks); } - do_timer(++ticks); + calc_ve = do_timer(++ticks); /* Keep the tick_next_period variable up to date */ tick_next_period = ktime_add(last_jiffies_update, tick_period); @@ -94,6 +96,8 @@ static void tick_do_update_jiffies64(ktime_t now) return; } write_sequnlock(&jiffies_lock); + if (calc_ve) + calc_load_ve(); update_wall_time(); } diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index bce92a9952f4..3b6500c5a357 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -2398,8 +2398,11 @@ EXPORT_SYMBOL(hardpps); */ void xtime_update(unsigned long ticks) { + bool calc_ve; write_seqlock(&jiffies_lock); - do_timer(ticks); + calc_ve = do_timer(ticks); write_sequnlock(&jiffies_lock); + if (calc_ve) + calc_load_ve(); update_wall_time(); } -- 2.28.0 _______________________________________________ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel