Re: [PATCH v2] cpuidle-pseries: Fix CEDE latency conversion from tb to us
* Gautham R Shenoy [2020-09-03 14:57:27]: > From: "Gautham R. Shenoy" > > commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values > of the Extended CEDE states advertised by the platform. The values > advertised by the platform are in timebase ticks. However the cpuidle > framework requires the latency values in microseconds. > > If the tb-ticks value advertised by the platform correspond to a value > smaller than 1us, during the conversion from tb-ticks to microseconds, > in the current code, the result becomes zero. This is incorrect as it > puts a CEDE state on par with the snooze state. > > This patch fixes this by rounding up the result obtained while > converting the latency value from tb-ticks to microseconds. It also > prints a warning in case we discover an extended-cede state with > wakeup latency to be 0. In such a case, ensure that CEDE(0) has a > non-zero wakeup latency. > > Fixes: commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for > CEDE(0)") > > Signed-off-by: Gautham R. Shenoy Reviewed-by: Vaidyanathan Srinivasan > --- > v1-->v2: Added a warning if a CEDE state has 0 wakeup latency (Suggested by > Joel Stanley) > Also added code to ensure that CEDE(0) has a non-zero wakeup > latency. > drivers/cpuidle/cpuidle-pseries.c | 15 +++ > 1 file changed, 11 insertions(+), 4 deletions(-) > > diff --git a/drivers/cpuidle/cpuidle-pseries.c > b/drivers/cpuidle/cpuidle-pseries.c > index ff6d99e..a2b5c6f 100644 > --- a/drivers/cpuidle/cpuidle-pseries.c > +++ b/drivers/cpuidle/cpuidle-pseries.c > @@ -361,7 +361,10 @@ static void __init fixup_cede0_latency(void) > for (i = 0; i < nr_xcede_records; i++) { > struct xcede_latency_record *record = >records[i]; > u64 latency_tb = be64_to_cpu(record->latency_ticks); > - u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC; > + u64 latency_us = DIV_ROUND_UP_ULL(tb_to_ns(latency_tb), > NSEC_PER_USEC); This would fix the issue of rounding down to zero. > + > + if (latency_us == 0) > + pr_warn("cpuidle: xcede record %d has an unrealistic > latency of 0us.\n", i); +1 This should not happen. > if (latency_us < min_latency_us) > min_latency_us = latency_us; > @@ -378,10 +381,14 @@ static void __init fixup_cede0_latency(void) >* Perform the fix-up. >*/ > if (min_latency_us < dedicated_states[1].exit_latency) { > - u64 cede0_latency = min_latency_us - 1; > + /* > + * We set a minimum of 1us wakeup latency for cede0 to > + * distinguish it from snooze > + */ > + u64 cede0_latency = 1; > > - if (cede0_latency <= 0) > - cede0_latency = min_latency_us; > + if (min_latency_us > cede0_latency) > + cede0_latency = min_latency_us - 1; Good checks to expect cede latency of 1us or more. > dedicated_states[1].exit_latency = cede0_latency; > dedicated_states[1].target_residency = 10 * (cede0_latency); --Vaidy
Re: [PATCH] Revert "powerpc/powernv/idle: Replace CPU feature check with PVR check"
* Pratik Rajesh Sampat [2020-08-26 13:59:18]: > Cpuidle stop state implementation has minor optimizations for P10 > where hardware preserves more SPR registers compared to P9. > The current P9 driver works for P10, although does few extra > save-restores. P9 driver can provide the required power management > features like SMT thread folding and core level power savings > on a P10 platform. > > Until the P10 stop driver is available, revert the commit which > allows for only P9 systems to utilize cpuidle and blocks all > idle stop states for P10. > Cpu idle states are enabled and tested on the P10 platform > with this fix. > > This reverts commit 8747bf36f312356f8a295a0c39ff092d65ce75ae. > > Fixes: 8747bf36f312 ("powerpc/powernv/idle: Replace CPU feature check with > PVR check") > Signed-off-by: Pratik Rajesh Sampat Reviewed-by: Vaidyanathan Srinivasan > --- > @mpe: This revert would resolve a staging issue wherein the P10 stop > driver is not yet ready while cpuidle stop states need not be blocked > on 5.9 for Power10 systems which could cause SMT folding related > performance issues. > > The P10 stop driver is in the works here: > https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-August/216773.html > > arch/powerpc/platforms/powernv/idle.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/platforms/powernv/idle.c > b/arch/powerpc/platforms/powernv/idle.c > index 77513a80cef9..345ab062b21a 100644 > --- a/arch/powerpc/platforms/powernv/idle.c > +++ b/arch/powerpc/platforms/powernv/idle.c > @@ -1223,7 +1223,7 @@ static void __init pnv_probe_idle_states(void) > return; > } > > - if (pvr_version_is(PVR_POWER9)) > + if (cpu_has_feature(CPU_FTR_ARCH_300)) > pnv_power9_idle_init(); > > for (i = 0; i < nr_pnv_idle_states; i++) This revert solves the stated problem and makes kernel v5.9 work reasonable well on P10 with stop states which are required for SMT mode changes. Complete P10 driver has been in the works and will build on this fix and complete the required platform support and optimizations. --Vaidy
Re: [PATCH 5/5] cpuidle-pseries: Block Extended CEDE(1) which adds no additional value.
* Gautham R Shenoy [2020-07-07 16:41:39]: > From: "Gautham R. Shenoy" > > The Extended CEDE state with latency-hint = 1 is only different from > normal CEDE (with latency-hint = 0) in that a CPU in Extended CEDE(1) > does not wakeup on timer events. Both CEDE and Extended CEDE(1) map to > the same hardware idle state. Since we already get SMT folding from > the normal CEDE, the Extended CEDE(1) doesn't provide any additional > value. This patch blocks Extended CEDE(1). > > Signed-off-by: Gautham R. Shenoy Reviewed-by: Vaidyanathan Srinivasan > --- > drivers/cpuidle/cpuidle-pseries.c | 57 > --- > 1 file changed, 54 insertions(+), 3 deletions(-) > > diff --git a/drivers/cpuidle/cpuidle-pseries.c > b/drivers/cpuidle/cpuidle-pseries.c > index 6f893cd..be0b8b2 100644 > --- a/drivers/cpuidle/cpuidle-pseries.c > +++ b/drivers/cpuidle/cpuidle-pseries.c > @@ -350,6 +350,43 @@ static int pseries_cpuidle_driver_init(void) > return 0; > } > > +#define XCEDE1_HINT 1 > +#define ERR_NO_VALUE_ADD (-1) > +#define ERR_NO_EE_WAKEUP (-2) > + > +/* > + * Returns 0 if the Extende CEDE state with @hint is not blocked in > + * cpuidle framework. > + * > + * Returns ERR_NO_EE_WAKEUP if the Extended CEDE state is blocked due > + * to not being responsive to external interrupts. > + * > + * Returns ERR_NO_VALUE_ADD if the Extended CEDE state does not provide > + * added value addition over the normal CEDE. > + */ > +static int cpuidle_xcede_blocked(u8 hint, u64 latency_us, u8 > responsive_to_irqs) > +{ > + > + /* > + * We will only allow extended CEDE states that are responsive > + * to irqs do not require an H_PROD to be woken up. > + */ > + if (!responsive_to_irqs) > + return ERR_NO_EE_WAKEUP; > + > + /* > + * We already obtain SMT folding benefits from CEDE (which is > + * CEDE with hint 0). Furthermore, CEDE is also responsive to > + * timer-events, while XCEDE1 requires an external > + * interrupt/H_PROD to be woken up. Hence, block XCEDE1 since > + * it adds no further value. > + */ > + if (hint == XCEDE1_HINT) > + return ERR_NO_VALUE_ADD; > + > + return 0; > +} > + > static int add_pseries_idle_states(void) > { > int nr_states = 2; /* By default we have snooze, CEDE */ > @@ -365,15 +402,29 @@ static int add_pseries_idle_states(void) > char name[CPUIDLE_NAME_LEN]; > unsigned int latency_hint = xcede_records[i].latency_hint; > u64 residency_us; > + int rc; > + > + if (latency_us < min_latency_us) > + min_latency_us = latency_us; > + > + rc = cpuidle_xcede_blocked(latency_hint, latency_us, > +xcede_records[i].responsive_to_irqs); > > - if (!xcede_records[i].responsive_to_irqs) { > + if (rc) { > + switch (rc) { > + case ERR_NO_VALUE_ADD: > + pr_info("cpuidle : Skipping XCEDE%d. No > additional value-add\n", > + latency_hint); > + break; > + case ERR_NO_EE_WAKEUP: > pr_info("cpuidle : Skipping XCEDE%d. Not responsive to > IRQs\n", > latency_hint); > + break; > + } > + > continue; > } > > - if (latency_us < min_latency_us) > - min_latency_us = latency_us; > snprintf(name, CPUIDLE_NAME_LEN, "XCEDE%d", latency_hint); > > /* We need these heuristics to select/reject idle states exposed by platform firmware to Linux primarily because not all states are really useful to Linux on a given setup. --Vaidy
Re: [PATCH 4/5] cpuidle-pseries : Include extended CEDE states in cpuidle framework
* Gautham R Shenoy [2020-07-07 16:41:38]: > From: "Gautham R. Shenoy" > > This patch exposes those extended CEDE states to the cpuidle framework > which are responsive to external interrupts and do not need an H_PROD. > > Since as per the PAPR, all the extended CEDE states are non-responsive > to timers, we indicate this to the cpuidle subsystem via the > CPUIDLE_FLAG_TIMER_STOP flag for all those extende CEDE states which > can wake up on external interrupts. > > With the patch, we are able to see the extended CEDE state with > latency hint = 1 exposed via the cpuidle framework. > > $ cpupower idle-info > CPUidle driver: pseries_idle > CPUidle governor: menu > analyzing CPU 0: > > Number of idle states: 3 > Available idle states: snooze CEDE XCEDE1 > snooze: > Flags/Description: snooze > Latency: 0 > Usage: 33429446 > Duration: 27006062 > CEDE: > Flags/Description: CEDE > Latency: 1 > Usage: 10272 > Duration: 110786770 > XCEDE1: > Flags/Description: XCEDE1 > Latency: 12 > Usage: 26445 > Duration: 1436433815 > > Benchmark results: > TLDR: Over all we do not see any additional benefit from having XCEDE1 over > CEDE. > > ebizzy : > 2 threads bound to a big-core. With this patch, we see a 3.39% > regression compared to with only CEDE0 latency fixup. > x With only CEDE0 latency fixup > * With CEDE0 latency fixup + CEDE1 > N Min MaxMedian AvgStddev > x 10 2893813 5834474 5832448 5327281.3 1055941.4 > * 10 2907329 5834923 5831398 5146614.6 1193874.8 > > context_switch2: > With the context_switch2 there are no observable regressions in the > results. > > context_switch2 CPU0 CPU1 (Same Big-core, different small-cores). > No difference with and without patch. > x without_patch > * with_patch > N Min MaxMedian AvgStddev > x 500343644348778345444 345584.02 1035.1658 > * 500344310347646345776 345877.22 802.19501 > > context_switch2 CPU0 CPU8 (different big-cores). Minor 0.05% improvement > with patch > x without_patch > * with_patch > N Min MaxMedian AvgStddev > x 500287562288756288162 288134.76 262.24328 > * 500287874288960288306 288274.66 187.57034 > > schbench: > No regressions observed with schbench > > Without Patch: > Latency percentiles (usec) > 50.0th: 29 > 75.0th: 40 > 90.0th: 50 > 95.0th: 61 > *99.0th: 13648 > 99.5th: 14768 > 99.9th: 15664 > min=0, max=29812 > > With Patch: > Latency percentiles (usec) > 50.0th: 30 > 75.0th: 40 > 90.0th: 51 > 95.0th: 59 > *99.0th: 13616 > 99.5th: 14512 > 99.9th: 15696 > min=0, max=15996 > > Signed-off-by: Gautham R. Shenoy Reviewed-by: Vaidyanathan Srinivasan > --- > drivers/cpuidle/cpuidle-pseries.c | 50 > +++ > 1 file changed, 50 insertions(+) > > diff --git a/drivers/cpuidle/cpuidle-pseries.c > b/drivers/cpuidle/cpuidle-pseries.c > index 502f906..6f893cd 100644 > --- a/drivers/cpuidle/cpuidle-pseries.c > +++ b/drivers/cpuidle/cpuidle-pseries.c > @@ -362,9 +362,59 @@ static int add_pseries_idle_states(void) > for (i = 0; i < nr_xcede_records; i++) { > u64 latency_tb = xcede_records[i].wakeup_latency_tb_ticks; > u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC; > + char name[CPUIDLE_NAME_LEN]; > + unsigned int latency_hint = xcede_records[i].latency_hint; > + u64 residency_us; > + > + if (!xcede_records[i].responsive_to_irqs) { > + pr_info("cpuidle : Skipping XCEDE%d. Not responsive to > IRQs\n", > + latency_hint); > + continue; > + } > > if (latency_us < min_latency_us) > min_latency_us = latency_us; > + snprintf(name, CPUIDLE_NAME_LEN, "XCEDE%d", latency_hint); > + > + /* > + * As per the section 14.14.1 of PAPR version 2.8.1 > + * says that alling H_CEDE with the value of the cede > + * latency specifier set greater than zero allows the > + * processor timer facility to b
Re: [PATCH 3/5] cpuidle-pseries : Fixup exit latency for CEDE(0)
* Gautham R Shenoy [2020-07-07 16:41:37]: > From: "Gautham R. Shenoy" > > We are currently assuming that CEDE(0) has exit latency 10us, since > there is no way for us to query from the platform. However, if the > wakeup latency of an Extended CEDE state is smaller than 10us, then we > can be sure that the exit latency of CEDE(0) cannot be more than that. > that. > > In this patch, we fix the exit latency of CEDE(0) if we discover an > Extended CEDE state with wakeup latency smaller than 10us. The new > value is 1us lesser than the smallest wakeup latency among the > Extended CEDE states. > > Benchmark results: > > ebizzy: > 2 ebizzy threads bound to the same big-core. 25% improvement in the > avg records/s with patch. > x without_patch > * with_patch > N Min MaxMedian AvgStddev > x 10 2491089 5834307 5398375 4244335 1596244.9 > * 10 2893813 5834474 5832448 5327281.3 1055941.4 > > context_switch2 : > There is no major regression observed with this patch as seen from the > context_switch2 benchmark. > > context_switch2 across CPU0 CPU1 (Both belong to same big-core, but different > small cores). We observe a minor 0.14% regression in the number of > context-switches (higher is better). > x without_patch > * with_patch > N Min MaxMedian AvgStddev > x 500348872362236354712 354745.69 2711.827 > * 500349422361452353942 354215.4 2576.9258 > > context_switch2 across CPU0 CPU8 (Different big-cores). We observe a 0.37% > improvement in the number of context-switches (higher is better). > x without_patch > * with_patch > N Min MaxMedian AvgStddev > x 500287956294940288896 288977.23 646.59295 > * 500288300294646289582 290064.76 1161.9992 > > schbench: > No major difference could be seen until the 99.9th percentile. > > Without-patch > Latency percentiles (usec) > 50.0th: 29 > 75.0th: 39 > 90.0th: 49 > 95.0th: 59 > *99.0th: 13104 > 99.5th: 14672 > 99.9th: 15824 > min=0, max=17993 > > With-patch: > Latency percentiles (usec) > 50.0th: 29 > 75.0th: 40 > 90.0th: 50 > 95.0th: 61 > *99.0th: 13648 > 99.5th: 14768 > 99.9th: 15664 > min=0, max=29812 > > Signed-off-by: Gautham R. Shenoy Reviewed-by: Vaidyanathan Srinivasan > --- > drivers/cpuidle/cpuidle-pseries.c | 34 -- > 1 file changed, 32 insertions(+), 2 deletions(-) > > diff --git a/drivers/cpuidle/cpuidle-pseries.c > b/drivers/cpuidle/cpuidle-pseries.c > index c13549b..502f906 100644 > --- a/drivers/cpuidle/cpuidle-pseries.c > +++ b/drivers/cpuidle/cpuidle-pseries.c > @@ -353,12 +353,42 @@ static int pseries_cpuidle_driver_init(void) > static int add_pseries_idle_states(void) > { > int nr_states = 2; /* By default we have snooze, CEDE */ > + int i; > + u64 min_latency_us = dedicated_states[1].exit_latency; /* CEDE latency > */ > > if (parse_cede_parameters()) > return nr_states; > > - pr_info("cpuidle : Skipping the %d Extended CEDE idle states\n", > - nr_xcede_records); > + for (i = 0; i < nr_xcede_records; i++) { > + u64 latency_tb = xcede_records[i].wakeup_latency_tb_ticks; > + u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC; > + > + if (latency_us < min_latency_us) > + min_latency_us = latency_us; > + } > + > + /* > + * We are currently assuming that CEDE(0) has exit latency > + * 10us, since there is no way for us to query from the > + * platform. > + * > + * However, if the wakeup latency of an Extended CEDE state is > + * smaller than 10us, then we can be sure that CEDE(0) > + * requires no more than that. > + * > + * Perform the fix-up. > + */ > + if (min_latency_us < dedicated_states[1].exit_latency) { > + u64 cede0_latency = min_latency_us - 1; > + > + if (cede0_latency <= 0) > + cede0_latency = min_latency_us; > + > + dedicated_states[1].exit_latency = cede0_latency; > + dedicated_states[1].target_residency = 10 * (cede0_latency); > + pr_info("cpuidle : Fixed up CEDE exit latency to %llu us\n", > + cede0_latency); > + } As per PAPR spec the CEDE hints are in increasing order of exit latency. Hence a given state's exit latency cannot exceed the one following it. The quirk is such that the first one (hint 0) is implicit and hence we have to use the above logic to extract its characteristics. --Vaidy
Re: [PATCH 2/5] cpuidle-pseries: Add function to parse extended CEDE records
* Gautham R Shenoy [2020-07-07 16:41:36]: > From: "Gautham R. Shenoy" > > Currently we use CEDE with latency-hint 0 as the only other idle state > on a dedicated LPAR apart from the polling "snooze" state. > > The platform might support additional extended CEDE idle states, which > can be discovered through the "ibm,get-system-parameter" rtas-call > made with CEDE_LATENCY_TOKEN. > > This patch adds a function to obtain information about the extended > CEDE idle states from the platform and parse the contents to populate > an array of extended CEDE states. These idle states thus discovered > will be added to the cpuidle framework in the next patch. > > dmesg on a POWER9 LPAR, demonstrating the output of parsing the > extended CEDE latency parameters. > > [5.913180] xcede : xcede_record_size = 10 > [5.913183] xcede : Record 0 : hint = 1, latency =0x400 tb-ticks, > Wake-on-irq = 1 > [5.913188] xcede : Record 1 : hint = 2, latency =0x3e8000 tb-ticks, > Wake-on-irq = 0 > [5.913193] cpuidle : Skipping the 2 Extended CEDE idle states > > Signed-off-by: Gautham R. Shenoy Reviewed-by: Vaidyanathan Srinivasan > > --- > drivers/cpuidle/cpuidle-pseries.c | 129 > +- > 1 file changed, 127 insertions(+), 2 deletions(-) > > diff --git a/drivers/cpuidle/cpuidle-pseries.c > b/drivers/cpuidle/cpuidle-pseries.c > index 39d4bb6..c13549b 100644 > --- a/drivers/cpuidle/cpuidle-pseries.c > +++ b/drivers/cpuidle/cpuidle-pseries.c > @@ -21,6 +21,7 @@ > #include > #include > #include > +#include > > struct cpuidle_driver pseries_idle_driver = { > .name = "pseries_idle", > @@ -105,9 +106,120 @@ static void check_and_cede_processor(void) > } > } > > -#define NR_CEDE_STATES 1 /* CEDE with latency-hint 0 */ > +struct xcede_latency_records { > + u8 latency_hint; > + u64 wakeup_latency_tb_ticks; > + u8 responsive_to_irqs; > +}; > + > +/* > + * XCEDE : Extended CEDE states discovered through the > + * "ibm,get-systems-parameter" rtas-call with the token > + * CEDE_LATENCY_TOKEN > + */ > +#define MAX_XCEDE_STATES 4 > +#define XCEDE_LATENCY_RECORD_SIZE 10 > +#define XCEDE_LATENCY_PARAM_MAX_LENGTH (2 + 2 + \ > + (MAX_XCEDE_STATES * > XCEDE_LATENCY_RECORD_SIZE)) > + > +#define CEDE_LATENCY_TOKEN 45 > + > +#define NR_CEDE_STATES (MAX_XCEDE_STATES + 1) /* CEDE with > latency-hint 0 */ > #define NR_DEDICATED_STATES (NR_CEDE_STATES + 1) /* Includes snooze */ > > +struct xcede_latency_records xcede_records[MAX_XCEDE_STATES]; > +unsigned int nr_xcede_records; > +char xcede_parameters[XCEDE_LATENCY_PARAM_MAX_LENGTH]; > + > +static int parse_cede_parameters(void) > +{ > + int ret = -1, i; > + u16 payload_length; > + u8 xcede_record_size; > + u32 total_xcede_records_size; > + char *payload; > + > + memset(xcede_parameters, 0, XCEDE_LATENCY_PARAM_MAX_LENGTH); > + > + ret = rtas_call(rtas_token("ibm,get-system-parameter"), 3, 1, > + NULL, CEDE_LATENCY_TOKEN, __pa(xcede_parameters), > + XCEDE_LATENCY_PARAM_MAX_LENGTH); > + > + if (ret) { > + pr_err("xcede: Error parsing CEDE_LATENCY_TOKEN\n"); > + return ret; > + } > + > + payload_length = be16_to_cpu(*(__be16 *)(_parameters[0])); > + payload = _parameters[2]; > + > + /* > + * If the platform supports the cede latency settings > + * information system parameter it must provide the following > + * information in the NULL terminated parameter string: > + * > + * a. The first byte is the length ???N??? of each cede > + *latency setting record minus one (zero indicates a length > + *of 1 byte). > + * > + * b. For each supported cede latency setting a cede latency > + *setting record consisting of the first ???N??? bytes as per > + *the following table. > + * > + * - > + * | Field | Field | > + * | Name| Length | > + * - > + * | Cede Latency| 1 Byte | > + * | Specifier Value || > + * - > + * | Maximum wakeup || > + * | latency in | 8 Bytes| > + * | tb-ticks|| > + * - >
Re: [PATCH 1/5] cpuidle-pseries: Set the latency-hint before entering CEDE
* Gautham R Shenoy [2020-07-07 16:41:35]: > From: "Gautham R. Shenoy" > > As per the PAPR, each H_CEDE call is associated with a latency-hint to > be passed in the VPA field "cede_latency_hint". The CEDE states that > we were implicitly entering so far is CEDE with latency-hint = 0. > > This patch explicitly sets the latency hint corresponding to the CEDE > state that we are currently entering. While at it, we save the > previous hint, to be restored once we wakeup from CEDE. This will be > required in the future when we expose extended-cede states through the > cpuidle framework, where each of them will have a different > cede-latency hint. > > Signed-off-by: Gautham R. Shenoy Reviewed-by: Vaidyanathan Srinivasan > --- > drivers/cpuidle/cpuidle-pseries.c | 10 +- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/drivers/cpuidle/cpuidle-pseries.c > b/drivers/cpuidle/cpuidle-pseries.c > index 4a37252..39d4bb6 100644 > --- a/drivers/cpuidle/cpuidle-pseries.c > +++ b/drivers/cpuidle/cpuidle-pseries.c > @@ -105,19 +105,27 @@ static void check_and_cede_processor(void) > } > } > > +#define NR_CEDE_STATES 1 /* CEDE with latency-hint 0 */ > +#define NR_DEDICATED_STATES (NR_CEDE_STATES + 1) /* Includes snooze */ > + > +u8 cede_latency_hint[NR_DEDICATED_STATES]; > static int dedicated_cede_loop(struct cpuidle_device *dev, > struct cpuidle_driver *drv, > int index) > { > + u8 old_latency_hint; > > pseries_idle_prolog(); > get_lppaca()->donate_dedicated_cpu = 1; > + old_latency_hint = get_lppaca()->cede_latency_hint; > + get_lppaca()->cede_latency_hint = cede_latency_hint[index]; > > HMT_medium(); > check_and_cede_processor(); > > local_irq_disable(); > get_lppaca()->donate_dedicated_cpu = 0; > + get_lppaca()->cede_latency_hint = old_latency_hint; > > pseries_idle_epilog(); > > @@ -149,7 +157,7 @@ static int shared_cede_loop(struct cpuidle_device *dev, > /* > * States for dedicated partition case. > */ > -static struct cpuidle_state dedicated_states[] = { > +static struct cpuidle_state dedicated_states[NR_DEDICATED_STATES] = { > { /* Snooze */ > .name = "snooze", > .desc = "snooze", Saving and restoring the current cede hint value helps in maintaining compatibility with other parts of the kernel. Over long term we can make cpuidle driver deterministically set the CEDE hint at each invocation of H_CEDE call so that we do not have to do multiple redundant save-restore. This is a reasonable start to cleanup this cupidle subsystem on PAPR guests. --Vaidy
Re: [PATCH 0/5] cpuidle-pseries: Parse extended CEDE information for idle.
* Gautham R Shenoy [2020-07-07 16:41:34]: > From: "Gautham R. Shenoy" > > Hi, > > On pseries Dedicated Linux LPARs, apart from the polling snooze idle > state, we currently have the CEDE idle state which cedes the CPU to > the hypervisor with latency-hint = 0. > > However, the PowerVM hypervisor supports additional extended CEDE > states, which can be queried through the "ibm,get-systems-parameter" > rtas-call with the CEDE_LATENCY_TOKEN. The hypervisor maps these > extended CEDE states to appropriate platform idle-states in order to > provide energy-savings as well as shifting power to the active > units. On existing pseries LPARs today we have extended CEDE with > latency-hints {1,2} supported. > > In Patches 1-3 of this patchset, we add the code to parse the CEDE > latency records provided by the hypervisor. We use this information to > determine the wakeup latency of the regular CEDE (which we have been > so far hardcoding to 10us while experimentally it is much lesser ~ > 1us), by looking at the wakeup latency provided by the hypervisor for > Extended CEDE states. Since the platform currently advertises Extended > CEDE 1 to have wakeup latency of 2us, we can be sure that the wakeup > latency of the regular CEDE is no more than this. > > Patch 4 (currently marked as RFC), expose the extended CEDE states > parsed above to the cpuidle framework, provided that they can wakeup > on an interrupt. On current platforms only Extended CEDE 1 fits the > bill, but this is going to change in future platforms where even > Extended CEDE 2 may be responsive to external interrupts. > > Patch 5 (currently marked as RFC), filters out Extended CEDE 1 since > it offers no added advantage over the normal CEDE. > > With Patches 1-3, we see an improvement in the single-threaded > performance on ebizzy. > > 2 ebizzy threads bound to the same big-core. 25% improvement in the > avg records/s (higher the better) with patches 1-3. > x without_patches > * with_patches > N Min MaxMedian AvgStddev > x 10 2491089 5834307 5398375 4244335 1596244.9 > * 10 2893813 5834474 5832448 5327281.3 1055941.4 > > We do not observe any major regression in either the context_switch2 > benchmark or the schbench benchmark > > context_switch2 across CPU0 CPU1 (Both belong to same big-core, but different > small cores). We observe a minor 0.14% regression in the number of > context-switches (higher is better). > x without_patch > * with_patch > N Min MaxMedian AvgStddev > x 500348872362236354712 354745.69 2711.827 > * 500349422361452353942 354215.4 2576.9258 > > context_switch2 across CPU0 CPU8 (Different big-cores). We observe a 0.37% > improvement in the number of context-switches (higher is better). > x without_patch > * with_patch > N Min MaxMedian AvgStddev > x 500287956294940288896 288977.23 646.59295 > * 500288300294646289582 290064.76 1161.9992 > > schbench: > No major difference could be seen until the 99.9th percentile. > > Without-patch > Latency percentiles (usec) > 50.0th: 29 > 75.0th: 39 > 90.0th: 49 > 95.0th: 59 > *99.0th: 13104 > 99.5th: 14672 > 99.9th: 15824 > min=0, max=17993 > > With-patch: > Latency percentiles (usec) > 50.0th: 29 > 75.0th: 40 > 90.0th: 50 > 95.0th: 61 > *99.0th: 13648 > 99.5th: 14768 > 99.9th: 15664 > min=0, max=29812 This patch series mainly cleans up the CEDE latency discovery and prepares to add different cpuidle states in virtualised environment. This helps in improving SMT folding speeds and also power savings and power shifting with newer platform firmware. The current benefit is primarily from faster SMT folding and resulting single performance achieved by updating the platform firmware provided heuristics in the cpuidle states. --Vaidy
Re: [Skiboot] [PATCH v8 3/3] Self save API integration
* Pratik Rajesh Sampat [2020-04-23 16:24:38]: > The commit makes the self save API available outside the firmware by defining > an OPAL wrapper. > This wrapper has a similar interface to that of self restore and expects the > cpu pir, SPR number, minus the value of that SPR to be passed in its > paramters and returns OPAL_SUCCESS on success. It adds a device-tree > node signifying support for self-save after verifying the stop API > version compatibility. > > The commit also documents both the self-save and the self-restore API > calls along with their working and usage. > > Signed-off-by: Pratik Rajesh Sampat Reviewed-by: Vaidyanathan Srinivasan > --- > doc/opal-api/opal-slw-self-save-reg-181.rst | 51 ++ > doc/opal-api/opal-slw-set-reg-100.rst | 5 + > doc/power-management.rst| 48 + > hw/slw.c| 106 > include/opal-api.h | 3 +- > include/p9_stop_api.H | 18 > include/skiboot.h | 3 + > 7 files changed, 233 insertions(+), 1 deletion(-) > create mode 100644 doc/opal-api/opal-slw-self-save-reg-181.rst This patch enables OPAL interface to call stop-api and set self-save. Basically completes the infrastructure required to use the new self-save function provided by the microcode. --Vaidy
Re: [PATCH v8 2/3] API to verify the STOP API and image compatibility
* Pratik Rajesh Sampat [2020-04-23 16:24:37]: > From: Prem Shanker Jha > > Commit defines a new API primarily intended for OPAL to determine > cpu register save API's compatibility with HOMER layout and > self save restore. It can help OPAL determine if version of > API integrated with OPAL is different from hostboot. > > Change-Id: Ic0de45a336cfb8b6b6096a10ac1cd3ffbaa44fc0 > Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/77612 > Tested-by: FSP CI Jenkins > Tested-by: Jenkins Server > Tested-by: Hostboot CI > Reviewed-by: RANGANATHPRASAD G. BRAHMASAMUDRA > Reviewed-by: Gregory S Still > Reviewed-by: Jennifer A Stofer > Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/77614 > Tested-by: Jenkins OP Build CI > Tested-by: Jenkins OP HW > Reviewed-by: Daniel M Crowell > Signed-off-by: Pratik Rajesh Sampat Reviewed-by: Vaidyanathan Srinivasan > --- > include/p9_stop_api.H| 25 ++ > libpore/p9_cpu_reg_restore_instruction.H | 7 ++- > libpore/p9_hcd_memmap_base.H | 7 +++ > libpore/p9_stop_api.C| 58 +++- > libpore/p9_stop_api.H| 26 ++- > libpore/p9_stop_util.H | 20 > 6 files changed, 130 insertions(+), 13 deletions(-) This stop-api code will help OPAL check and use self-save functions so that different versions of OPAL can be loaded and run with different versions of low level firmware stack. --Vaidy
Re: [Skiboot] [PATCH v8 1/3] Self Save: Introducing Support for SPR Self Save
* Pratik Rajesh Sampat [2020-04-23 16:24:36]: > From: Prem Shanker Jha > > The commit is a merger of commits that makes the following changes: > 1. Commit fixes some issues with code found during integration test > - replacement of addi with xor instruction during self save API. > - fixing instruction generation for MFMSR during self save > - data struct updates in STOP API > - error RC updates for hcode image build > - HOMER parser updates. > - removed self save support for URMOR and HRMOR > - code changes for compilation with OPAL > - populating CME Image header with unsecure HOMER address. > > Key_Cronus_Test=PM_REGRESS > > Change-Id: I7cedcc466267c4245255d8d75c01ed695e316720 > Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/66580 > Tested-by: FSP CI Jenkins > Tested-by: HWSV CI > Tested-by: PPE CI > Tested-by: Jenkins Server > Tested-by: Cronus HW CI > Tested-by: Hostboot CI > Reviewed-by: Gregory S. Still > Reviewed-by: RAHUL BATRA > Reviewed-by: Jennifer A. Stofer > Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/66587 > Reviewed-by: Christian R. Geddes > Signed-off-by: Prem Shanker Jha > Signed-off-by: Akshay Adiga > Signed-off-by: Pratik Rajesh Sampat Reviewed-by: Vaidyanathan Srinivasan > 2. The commit also incorporates changes that make STOP API project > agnostic changes include defining wrapper functions which call legacy > API. It also adds duplicate enum members which start with prefix PROC > instead of P9. > > Key_Cronus_Test=PM_REGRESS > > Change-Id: If87970f3e8cf9b507f33eb1be249e03eb3836a5e > RTC: 201128 > Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/71307 > Tested-by: FSP CI Jenkins > Tested-by: Jenkins Server > Tested-by: Hostboot CI > Tested-by: Cronus HW CI > Reviewed-by: RANGANATHPRASAD G. BRAHMASAMUDRA > Reviewed-by: Gregory S. Still > Reviewed-by: Jennifer A Stofer > Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/71314 > Tested-by: Jenkins OP Build CI > Tested-by: Jenkins OP HW > Reviewed-by: Daniel M. Crowell > Signed-off-by: Prem Shanker Jha > Signed-off-by: Pratik Rajesh Sampat > --- > include/p9_stop_api.H| 79 +- > libpore/p9_cpu_reg_restore_instruction.H | 4 + > libpore/p9_stop_api.C| 954 +-- > libpore/p9_stop_api.H| 115 ++- > libpore/p9_stop_data_struct.H| 4 +- > libpore/p9_stop_util.H | 7 +- > 6 files changed, 721 insertions(+), 442 deletions(-) These code changes are from hcode component copied as is into OPAL project for integration and use of stop-api. Hcode component in cooperation with hostboot would be loaded in memory before OPAL is loaded. This code will allow runtime changes and usage of various power management save restore functions. This patch specifically enables self-save feature by the microcode. --Vaidy
Re: [PATCH] pseries/energy: Use OF accessor functions to read ibm,drc-indexes
* Gautham R Shenoy [2019-03-08 21:03:24]: > From: "Gautham R. Shenoy" > > In cpu_to_drc_index() in the case when FW_FEATURE_DRC_INFO is absent, > we currently use of_read_property() to obtain the pointer to the array > corresponding to the property "ibm,drc-indexes". The elements of this > array are of type __be32, but are accessed without any conversion to > the OS-endianness, which is buggy on a Little Endian OS. > > Fix this by using of_property_read_u32_index() accessor function to > safely read the elements of the array. > > Fixes: commit e83636ac3334 ("pseries/drc-info: Search DRC properties for CPU > indexes") > Cc: #v4.16+ > Reported-by: Pavithra R. Prakash > Signed-off-by: Gautham R. Shenoy Reviewed-by: Vaidyanathan Srinivasan > --- > arch/powerpc/platforms/pseries/pseries_energy.c | 27 > - > 1 file changed, 18 insertions(+), 9 deletions(-) > > diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c > b/arch/powerpc/platforms/pseries/pseries_energy.c > index 6ed2212..1c4d1ba 100644 > --- a/arch/powerpc/platforms/pseries/pseries_energy.c > +++ b/arch/powerpc/platforms/pseries/pseries_energy.c > @@ -77,18 +77,27 @@ static u32 cpu_to_drc_index(int cpu) > > ret = drc.drc_index_start + (thread_index * drc.sequential_inc); > } else { > - const __be32 *indexes; > - > - indexes = of_get_property(dn, "ibm,drc-indexes", NULL); > - if (indexes == NULL) > - goto err_of_node_put; > + u32 nr_drc_indexes, thread_drc_index; > > /* > - * The first element indexes[0] is the number of drc_indexes > - * returned in the list. Hence thread_index+1 will get the > - * drc_index corresponding to core number thread_index. > + * The first element of ibm,drc-indexes array is the > + * number of drc_indexes returned in the list. Hence > + * thread_index+1 will get the drc_index corresponding > + * to core number thread_index. >*/ > - ret = indexes[thread_index + 1]; > + rc = of_property_read_u32_index(dn, "ibm,drc-indexes", > + 0, _drc_indexes); > + if (rc) > + goto err_of_node_put; > + > + WARN_ON(thread_index > nr_drc_indexes); > + rc = of_property_read_u32_index(dn, "ibm,drc-indexes", > + thread_index + 1, > + _drc_index); > + if (rc) > + goto err_of_node_put; > + > + ret = thread_drc_index; Oops! Good bugfix. We should use device tree accessors like this in all places for correct and safe code. Thanks! --Vaidy
Re: [PATCH] cpuidle:powernv: Add the CPUIDLE_FLAG_POLLING for snooze
* Gautham R Shenoy [2018-07-03 10:54:16]: > From: "Gautham R. Shenoy" > > In the situations where snooze is the only cpuidle state due to > firmware not exposing any platform idle states, the idle CPUs will > remain in snooze for a long time with interrupts disabled causing the > Hard-lockup detector to complain. snooze_loop() will spin in SMT low priority with interrupt enabled. We have local_irq_enable() before we get into the snooze loop. Since this is a polling state, we should wakeup without an interrupt and hence we set TIF_POLLING_NRFLAG as well. > watchdog: CPU 51 detected hard LOCKUP on other CPUs 59 > watchdog: CPU 51 TB:535296107736, last SMP heartbeat TB:527472229239 (15281ms > ago) > watchdog: CPU 59 Hard LOCKUP > watchdog: CPU 59 TB:535296252849, last heartbeat TB:526554725466 (17073ms ago) hmm.. not sure why watchdog will complain, maybe something more is going on. > Fix this by adding CPUIDLE_FLAG_POLLING flag to the state, so that the > cpuidle governor will do the right thing, such as not stopping the > tick if it is going to put the idle cpu to snooze. > > Reported-by: Akshay Adiga > Cc: Nicholas Piggin > Signed-off-by: Gautham R. Shenoy > --- > drivers/cpuidle/cpuidle-powernv.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/cpuidle/cpuidle-powernv.c > b/drivers/cpuidle/cpuidle-powernv.c > index d29e4f0..b73041b 100644 > --- a/drivers/cpuidle/cpuidle-powernv.c > +++ b/drivers/cpuidle/cpuidle-powernv.c > @@ -156,6 +156,7 @@ static int stop_loop(struct cpuidle_device *dev, > { /* Snooze */ > .name = "snooze", > .desc = "snooze", > + .flags = CPUIDLE_FLAG_POLLING, > .exit_latency = 0, > .target_residency = 0, > .enter = snooze_loop }, Adding the CPUIDLE_FLAG_POLLING is good and enables more optimization. But the reason that we spin with interrupt disabled does not seem right. --Vaidy
Re: [PATCH] cpuidle:powernv: Add the CPUIDLE_FLAG_POLLING for snooze
* Gautham R Shenoy [2018-07-03 10:54:16]: > From: "Gautham R. Shenoy" > > In the situations where snooze is the only cpuidle state due to > firmware not exposing any platform idle states, the idle CPUs will > remain in snooze for a long time with interrupts disabled causing the > Hard-lockup detector to complain. snooze_loop() will spin in SMT low priority with interrupt enabled. We have local_irq_enable() before we get into the snooze loop. Since this is a polling state, we should wakeup without an interrupt and hence we set TIF_POLLING_NRFLAG as well. > watchdog: CPU 51 detected hard LOCKUP on other CPUs 59 > watchdog: CPU 51 TB:535296107736, last SMP heartbeat TB:527472229239 (15281ms > ago) > watchdog: CPU 59 Hard LOCKUP > watchdog: CPU 59 TB:535296252849, last heartbeat TB:526554725466 (17073ms ago) hmm.. not sure why watchdog will complain, maybe something more is going on. > Fix this by adding CPUIDLE_FLAG_POLLING flag to the state, so that the > cpuidle governor will do the right thing, such as not stopping the > tick if it is going to put the idle cpu to snooze. > > Reported-by: Akshay Adiga > Cc: Nicholas Piggin > Signed-off-by: Gautham R. Shenoy > --- > drivers/cpuidle/cpuidle-powernv.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/cpuidle/cpuidle-powernv.c > b/drivers/cpuidle/cpuidle-powernv.c > index d29e4f0..b73041b 100644 > --- a/drivers/cpuidle/cpuidle-powernv.c > +++ b/drivers/cpuidle/cpuidle-powernv.c > @@ -156,6 +156,7 @@ static int stop_loop(struct cpuidle_device *dev, > { /* Snooze */ > .name = "snooze", > .desc = "snooze", > + .flags = CPUIDLE_FLAG_POLLING, > .exit_latency = 0, > .target_residency = 0, > .enter = snooze_loop }, Adding the CPUIDLE_FLAG_POLLING is good and enables more optimization. But the reason that we spin with interrupt disabled does not seem right. --Vaidy
Re: [PATCH V3] cpufreq: powernv: Fix the hardlockup by synchronus smp_call in timer interrupt
* Shilpa Bhat <shilpa.b...@linux.vnet.ibm.com> [2018-04-25 16:29:31]: > gpstate_timer_handler() uses synchronous smp_call to set the pstate > on the requested core. This causes the below hard lockup: > > [c03fe566b320] [c01d5340] smp_call_function_single+0x110/0x180 > (unreliable) > [c03fe566b390] [c01d55e0] smp_call_function_any+0x180/0x250 > [c03fe566b3f0] [c0acd3e8] gpstate_timer_handler+0x1e8/0x580 > [c03fe566b4a0] [c01b46b0] call_timer_fn+0x50/0x1c0 > [c03fe566b520] [c01b4958] expire_timers+0x138/0x1f0 > [c03fe566b590] [c01b4bf8] run_timer_softirq+0x1e8/0x270 > [c03fe566b630] [c0d0d6c8] __do_softirq+0x158/0x3e4 > [c03fe566b710] [c0114be8] irq_exit+0xe8/0x120 > [c03fe566b730] [c0024d0c] timer_interrupt+0x9c/0xe0 > [c03fe566b760] [c0009014] decrementer_common+0x114/0x120 > -- interrupt: 901 at doorbell_global_ipi+0x34/0x50 > LR = arch_send_call_function_ipi_mask+0x120/0x130 > [c03fe566ba50] [c004876c] > arch_send_call_function_ipi_mask+0x4c/0x130 > [c03fe566ba90] [c01d59f0] smp_call_function_many+0x340/0x450 > [c03fe566bb00] [c0075f18] pmdp_invalidate+0x98/0xe0 > [c03fe566bb30] [c03a1120] change_huge_pmd+0xe0/0x270 > [c03fe566bba0] [c0349278] change_protection_range+0xb88/0xe40 > [c03fe566bcf0] [c03496c0] mprotect_fixup+0x140/0x340 > [c03fe566bdb0] [c0349a74] SyS_mprotect+0x1b4/0x350 > [c03fe566be30] [c000b184] system_call+0x58/0x6c > > One way to avoid this is removing the smp-call. We can ensure that the timer > always runs on one of the policy-cpus. If the timer gets migrated to a > cpu outside the policy then re-queue it back on the policy->cpus. This way > we can get rid of the smp-call which was being used to set the pstate > on the policy->cpus. > > Fixes: 7bc54b652f13 (timers, cpufreq/powernv: Initialize the gpstate timer as > pinned) > Cc: <sta...@vger.kernel.org>[4.8+] > Reported-by: Nicholas Piggin <npig...@gmail.com> > Reported-by: Pridhiviraj Paidipeddi <ppaid...@linux.vnet.ibm.com> > Signed-off-by: Shilpasri G Bhat <shilpa.b...@linux.vnet.ibm.com> > --- > Changes from V2: > - Remove the check for active policy while requeing the migrated timer > Changes from V1: > - Remove smp_call in the pstate handler. > > drivers/cpufreq/powernv-cpufreq.c | 14 +++--- > 1 file changed, 11 insertions(+), 3 deletions(-) > > diff --git a/drivers/cpufreq/powernv-cpufreq.c > b/drivers/cpufreq/powernv-cpufreq.c > index 71f8682..e368e1f 100644 > --- a/drivers/cpufreq/powernv-cpufreq.c > +++ b/drivers/cpufreq/powernv-cpufreq.c > @@ -679,6 +679,16 @@ void gpstate_timer_handler(struct timer_list *t) > > if (!spin_trylock(>gpstate_lock)) > return; > + /* > + * If the timer has migrated to the different cpu then bring > + * it back to one of the policy->cpus > + */ > + if (!cpumask_test_cpu(raw_smp_processor_id(), policy->cpus)) { > + gpstates->timer.expires = jiffies + msecs_to_jiffies(1); > + add_timer_on(>timer, cpumask_first(policy->cpus)); > + spin_unlock(>gpstate_lock); > + return; > + } > > /* >* If PMCR was last updated was using fast_swtich then > @@ -718,10 +728,8 @@ void gpstate_timer_handler(struct timer_list *t) > if (gpstate_idx != gpstates->last_lpstate_idx) > queue_gpstate_timer(gpstates); > > + set_pstate(_data); > spin_unlock(>gpstate_lock); > - > - /* Timer may get migrated to a different cpu on cpu hot unplug */ > - smp_call_function_any(policy->cpus, set_pstate, _data, 1); > } Fix looks good. Acked-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>
Re: [PATCH V3] cpufreq: powernv: Fix the hardlockup by synchronus smp_call in timer interrupt
* Shilpa Bhat [2018-04-25 16:29:31]: > gpstate_timer_handler() uses synchronous smp_call to set the pstate > on the requested core. This causes the below hard lockup: > > [c03fe566b320] [c01d5340] smp_call_function_single+0x110/0x180 > (unreliable) > [c03fe566b390] [c01d55e0] smp_call_function_any+0x180/0x250 > [c03fe566b3f0] [c0acd3e8] gpstate_timer_handler+0x1e8/0x580 > [c03fe566b4a0] [c01b46b0] call_timer_fn+0x50/0x1c0 > [c03fe566b520] [c01b4958] expire_timers+0x138/0x1f0 > [c03fe566b590] [c01b4bf8] run_timer_softirq+0x1e8/0x270 > [c03fe566b630] [c0d0d6c8] __do_softirq+0x158/0x3e4 > [c03fe566b710] [c0114be8] irq_exit+0xe8/0x120 > [c03fe566b730] [c0024d0c] timer_interrupt+0x9c/0xe0 > [c03fe566b760] [c0009014] decrementer_common+0x114/0x120 > -- interrupt: 901 at doorbell_global_ipi+0x34/0x50 > LR = arch_send_call_function_ipi_mask+0x120/0x130 > [c03fe566ba50] [c004876c] > arch_send_call_function_ipi_mask+0x4c/0x130 > [c03fe566ba90] [c01d59f0] smp_call_function_many+0x340/0x450 > [c03fe566bb00] [c0075f18] pmdp_invalidate+0x98/0xe0 > [c03fe566bb30] [c03a1120] change_huge_pmd+0xe0/0x270 > [c03fe566bba0] [c0349278] change_protection_range+0xb88/0xe40 > [c03fe566bcf0] [c03496c0] mprotect_fixup+0x140/0x340 > [c03fe566bdb0] [c0349a74] SyS_mprotect+0x1b4/0x350 > [c03fe566be30] [c000b184] system_call+0x58/0x6c > > One way to avoid this is removing the smp-call. We can ensure that the timer > always runs on one of the policy-cpus. If the timer gets migrated to a > cpu outside the policy then re-queue it back on the policy->cpus. This way > we can get rid of the smp-call which was being used to set the pstate > on the policy->cpus. > > Fixes: 7bc54b652f13 (timers, cpufreq/powernv: Initialize the gpstate timer as > pinned) > Cc: [4.8+] > Reported-by: Nicholas Piggin > Reported-by: Pridhiviraj Paidipeddi > Signed-off-by: Shilpasri G Bhat > --- > Changes from V2: > - Remove the check for active policy while requeing the migrated timer > Changes from V1: > - Remove smp_call in the pstate handler. > > drivers/cpufreq/powernv-cpufreq.c | 14 +++--- > 1 file changed, 11 insertions(+), 3 deletions(-) > > diff --git a/drivers/cpufreq/powernv-cpufreq.c > b/drivers/cpufreq/powernv-cpufreq.c > index 71f8682..e368e1f 100644 > --- a/drivers/cpufreq/powernv-cpufreq.c > +++ b/drivers/cpufreq/powernv-cpufreq.c > @@ -679,6 +679,16 @@ void gpstate_timer_handler(struct timer_list *t) > > if (!spin_trylock(>gpstate_lock)) > return; > + /* > + * If the timer has migrated to the different cpu then bring > + * it back to one of the policy->cpus > + */ > + if (!cpumask_test_cpu(raw_smp_processor_id(), policy->cpus)) { > + gpstates->timer.expires = jiffies + msecs_to_jiffies(1); > + add_timer_on(>timer, cpumask_first(policy->cpus)); > + spin_unlock(>gpstate_lock); > + return; > + } > > /* >* If PMCR was last updated was using fast_swtich then > @@ -718,10 +728,8 @@ void gpstate_timer_handler(struct timer_list *t) > if (gpstate_idx != gpstates->last_lpstate_idx) > queue_gpstate_timer(gpstates); > > + set_pstate(_data); > spin_unlock(>gpstate_lock); > - > - /* Timer may get migrated to a different cpu on cpu hot unplug */ > - smp_call_function_any(policy->cpus, set_pstate, _data, 1); > } Fix looks good. Acked-by: Vaidyanathan Srinivasan
Re: [RESEND][PATCH] cpuidle/powernv : Restore different PSSCR for idle and hotplug
* Benjamin Herrenschmidt[2018-03-01 08:40:22]: > On Thu, 2018-03-01 at 01:03 +0530, Akshay Adiga wrote: > > commit 1e1601b38e6e ("powerpc/powernv/idle: Restore SPRs for deep idle > > states via stop API.") uses stop-api provided by the firmware to restore > > PSSCR. PSSCR restore is required for handling special wakeup. When special > > wakeup is completed, the core enters stop state based on restored PSSCR. > > > > Currently PSSCR is restored to deepest available stop state, causing > > a idle cpu to enter deeper stop state on a special wakeup, which causes > > the cpu to hang on wakeup. > > > > A "sensors" command which reads temperature (through DTS sensors) on idle > > cpu can trigger special wakeup. > > > > Failed Scenario : > > Request restore of PSSCR with RL = 11 > > cpu enters idle state (stop5) > > user triggers "sensors" command > >Assert special wakeup on cpu > > Restores PSSCR with RL = 11 < Done by firmware > > Read DTS sensor > >Deassert special wakeup > > cpu enters idle state (stop11) <-- Instead of stop5 > > > > Cpu hang is caused because cpu ended up in a deeper state than it requested > > > > This patch fixes instability caused by special wakeup when stop11 is > > enabled. Requests restore of PSSCR to deepest stop state used by cpuidle. > > Only when offlining cpu, request restore of PSSCR to deepest stop state. > > On onlining cpu, request restore of PSSCR to deepest stop state used by > > cpuidle. > > So if we chose a stop state, but somebody else does a special wakeup, > we'll end up going back into a *deeper* one than the one we came from ? Unfortunately yes. This is the current limitation. If we are in stop4 and above and we had not set a PSSCR to be restored, then the hardware will default to all bits set (stop15) leading to entry into stop11 after the special wakeup is removed. The requirement is such that we need to have a correct PSSCR restore value set using stop-api. We need to set a restore PSSCR value that represents one in a group like stop4,5,6,7 will have identical state loss, hence we can either set a PSSCR of 7 or 4 or 5 for any of this stop state entry and not have to use stop-api to set exact value of stop4 or 5 at every entry. > I still think this is broken by design. The chip should automatically > go back to the state it went to after special wakeup, thus the PPE > controlling the state should override the PSSCR value accordingly > rather than relying on those SW hoops. Special wakeup de-assertion and re-entry hits this limitation where we have lost the original content of PSSCR SPR and hence CME does not know what was requested. Additional stop-api calls from software could have been avoided, but practically we have only cpu hotplug case that uses stop11 and needs this stop-api. We can default the system to stop4 or stop5 and then make stop-api call to explicitly set stop11 and then hotplug out the cpu. We have to restore the deepest cpuidle state (stop4/5) back during online. As such this is not much of software overhead. But we need an elegant method to control these calls from OPAL flags so that kernel behaviour can be more closely controlled. If we want to use stop11 in cpuidle (despite being very slow) for evaluation reasons, then we will need to make more stop-api call to choose between stop4/5 vs stop11 since they belong to different group. Even in this case, since stop11 is the slow path, we would want to set stop11 before entry and restore to stop4/5 after wakeup. This way we still completely avoid stop-api call in fast-path stop4/5 entry/exit. --Vaidy
Re: [RESEND][PATCH] cpuidle/powernv : Restore different PSSCR for idle and hotplug
* Benjamin Herrenschmidt [2018-03-01 08:40:22]: > On Thu, 2018-03-01 at 01:03 +0530, Akshay Adiga wrote: > > commit 1e1601b38e6e ("powerpc/powernv/idle: Restore SPRs for deep idle > > states via stop API.") uses stop-api provided by the firmware to restore > > PSSCR. PSSCR restore is required for handling special wakeup. When special > > wakeup is completed, the core enters stop state based on restored PSSCR. > > > > Currently PSSCR is restored to deepest available stop state, causing > > a idle cpu to enter deeper stop state on a special wakeup, which causes > > the cpu to hang on wakeup. > > > > A "sensors" command which reads temperature (through DTS sensors) on idle > > cpu can trigger special wakeup. > > > > Failed Scenario : > > Request restore of PSSCR with RL = 11 > > cpu enters idle state (stop5) > > user triggers "sensors" command > >Assert special wakeup on cpu > > Restores PSSCR with RL = 11 < Done by firmware > > Read DTS sensor > >Deassert special wakeup > > cpu enters idle state (stop11) <-- Instead of stop5 > > > > Cpu hang is caused because cpu ended up in a deeper state than it requested > > > > This patch fixes instability caused by special wakeup when stop11 is > > enabled. Requests restore of PSSCR to deepest stop state used by cpuidle. > > Only when offlining cpu, request restore of PSSCR to deepest stop state. > > On onlining cpu, request restore of PSSCR to deepest stop state used by > > cpuidle. > > So if we chose a stop state, but somebody else does a special wakeup, > we'll end up going back into a *deeper* one than the one we came from ? Unfortunately yes. This is the current limitation. If we are in stop4 and above and we had not set a PSSCR to be restored, then the hardware will default to all bits set (stop15) leading to entry into stop11 after the special wakeup is removed. The requirement is such that we need to have a correct PSSCR restore value set using stop-api. We need to set a restore PSSCR value that represents one in a group like stop4,5,6,7 will have identical state loss, hence we can either set a PSSCR of 7 or 4 or 5 for any of this stop state entry and not have to use stop-api to set exact value of stop4 or 5 at every entry. > I still think this is broken by design. The chip should automatically > go back to the state it went to after special wakeup, thus the PPE > controlling the state should override the PSSCR value accordingly > rather than relying on those SW hoops. Special wakeup de-assertion and re-entry hits this limitation where we have lost the original content of PSSCR SPR and hence CME does not know what was requested. Additional stop-api calls from software could have been avoided, but practically we have only cpu hotplug case that uses stop11 and needs this stop-api. We can default the system to stop4 or stop5 and then make stop-api call to explicitly set stop11 and then hotplug out the cpu. We have to restore the deepest cpuidle state (stop4/5) back during online. As such this is not much of software overhead. But we need an elegant method to control these calls from OPAL flags so that kernel behaviour can be more closely controlled. If we want to use stop11 in cpuidle (despite being very slow) for evaluation reasons, then we will need to make more stop-api call to choose between stop4/5 vs stop11 since they belong to different group. Even in this case, since stop11 is the slow path, we would want to set stop11 before entry and restore to stop4/5 after wakeup. This way we still completely avoid stop-api call in fast-path stop4/5 entry/exit. --Vaidy
Re: [PATCH v2 1/2] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration
* Rafael J. Wysocki <raf...@kernel.org> [2017-03-23 16:28:31]: > On Thu, Mar 23, 2017 at 4:22 PM, Vaidyanathan Srinivasan > <sva...@linux.vnet.ibm.com> wrote: > > drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init(). > > On PowerNV platform cpu_present could be less than cpu_possible in cases > > where firmware detects the cpu, but it is not available to the OS. When > > CONFIG_HOTPLUG_CPU=n, such cpus are not hotplugable at runtime and hence > > we skip creating cpu_device. > > > > This breaks cpuidle on powernv where register_cpu() is not called for > > cpus in cpu_possible_mask that cannot be hot-added at runtime. > > > > Trying cpuidle_register_device() on cpu without cpu_device will cause > > crash like this: > > > > cpu 0xf: Vector: 380 (Data SLB Access) at [c00ff1503490] > > pc: c022c8bc: string+0x34/0x60 > > lr: c022ed78: vsnprintf+0x284/0x42c > > sp: c00ff1503710 > >msr: 90009033 > >dar: 60006000 > > current = 0xc00ff148 > > paca= 0xcfe82d00 softe: 0irq_happened: 0x01 > > pid = 1, comm = swapper/8 > > Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4 > > (Buildroot 2017.02-4-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017 > > enter ? for help > > [link register ] c022ed78 vsnprintf+0x284/0x42c > > [c00ff1503710] c022ebb8 vsnprintf+0xc4/0x42c (unreliable) > > [c00ff1503800] c022ef40 vscnprintf+0x20/0x44 > > [c00ff1503830] c00ab61c vprintk_emit+0x94/0x2cc > > [c00ff15038a0] c00acc9c vprintk_func+0x60/0x74 > > [c00ff15038c0] c0619694 printk+0x38/0x4c > > [c00ff15038e0] c0224950 kobject_get+0x40/0x60 > > [c00ff1503950] c022507c kobject_add_internal+0x60/0x2c4 > > [c00ff15039e0] c0225350 kobject_init_and_add+0x70/0x78 > > [c00ff1503a60] c053c288 cpuidle_add_sysfs+0x9c/0xe0 > > [c00ff1503ae0] c053aeac cpuidle_register_device+0xd4/0x12c > > [c00ff1503b30] c053b108 cpuidle_register+0x98/0xcc > > [c00ff1503bc0] c085eaf0 powernv_processor_idle_init+0x140/0x1e0 > > [c00ff1503c60] c000cd60 do_one_initcall+0xc0/0x15c > > [c00ff1503d20] c0833e84 kernel_init_freeable+0x1a0/0x25c > > [c00ff1503dc0] c000d478 kernel_init+0x24/0x12c > > [c00ff1503e30] c000b564 ret_from_kernel_thread+0x5c/0x78 > > > > This patch fixes the bug by passing correct cpumask from > > powernv-cpuidle driver. > > > > Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> > > That needs to be ACKed by someone familiar with powernv. Previous version at https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-March/155587.html I had not CCed linux-pm in the first post. Michael and Mikey have reviewed the previous version. Let me get an ack for you to proceed with the merge. Thanks, Vaidy
Re: [PATCH v2 1/2] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration
* Rafael J. Wysocki [2017-03-23 16:28:31]: > On Thu, Mar 23, 2017 at 4:22 PM, Vaidyanathan Srinivasan > wrote: > > drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init(). > > On PowerNV platform cpu_present could be less than cpu_possible in cases > > where firmware detects the cpu, but it is not available to the OS. When > > CONFIG_HOTPLUG_CPU=n, such cpus are not hotplugable at runtime and hence > > we skip creating cpu_device. > > > > This breaks cpuidle on powernv where register_cpu() is not called for > > cpus in cpu_possible_mask that cannot be hot-added at runtime. > > > > Trying cpuidle_register_device() on cpu without cpu_device will cause > > crash like this: > > > > cpu 0xf: Vector: 380 (Data SLB Access) at [c00ff1503490] > > pc: c022c8bc: string+0x34/0x60 > > lr: c022ed78: vsnprintf+0x284/0x42c > > sp: c00ff1503710 > >msr: 90009033 > >dar: 60006000 > > current = 0xc00ff148 > > paca= 0xcfe82d00 softe: 0irq_happened: 0x01 > > pid = 1, comm = swapper/8 > > Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4 > > (Buildroot 2017.02-4-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017 > > enter ? for help > > [link register ] c022ed78 vsnprintf+0x284/0x42c > > [c00ff1503710] c022ebb8 vsnprintf+0xc4/0x42c (unreliable) > > [c00ff1503800] c022ef40 vscnprintf+0x20/0x44 > > [c00ff1503830] c00ab61c vprintk_emit+0x94/0x2cc > > [c00ff15038a0] c00acc9c vprintk_func+0x60/0x74 > > [c00ff15038c0] c0619694 printk+0x38/0x4c > > [c00ff15038e0] c0224950 kobject_get+0x40/0x60 > > [c00ff1503950] c022507c kobject_add_internal+0x60/0x2c4 > > [c00ff15039e0] c0225350 kobject_init_and_add+0x70/0x78 > > [c00ff1503a60] c053c288 cpuidle_add_sysfs+0x9c/0xe0 > > [c00ff1503ae0] c053aeac cpuidle_register_device+0xd4/0x12c > > [c00ff1503b30] c053b108 cpuidle_register+0x98/0xcc > > [c00ff1503bc0] c085eaf0 powernv_processor_idle_init+0x140/0x1e0 > > [c00ff1503c60] c000cd60 do_one_initcall+0xc0/0x15c > > [c00ff1503d20] c0833e84 kernel_init_freeable+0x1a0/0x25c > > [c00ff1503dc0] c000d478 kernel_init+0x24/0x12c > > [c00ff1503e30] c000b564 ret_from_kernel_thread+0x5c/0x78 > > > > This patch fixes the bug by passing correct cpumask from > > powernv-cpuidle driver. > > > > Signed-off-by: Vaidyanathan Srinivasan > > That needs to be ACKed by someone familiar with powernv. Previous version at https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-March/155587.html I had not CCed linux-pm in the first post. Michael and Mikey have reviewed the previous version. Let me get an ack for you to proceed with the merge. Thanks, Vaidy
Re: [PATCH v2 2/2] cpuidle: Validate cpu_dev in cpuidle_add_sysfs
* Rafael J. Wysocki <raf...@kernel.org> [2017-03-23 16:27:31]: > On Thu, Mar 23, 2017 at 4:22 PM, Vaidyanathan Srinivasan > <sva...@linux.vnet.ibm.com> wrote: > > If a given cpu is not in cpu_present and cpu hotplug > > is disabled, arch can skip setting up the cpu_dev. > > > > Arch cpuidle driver should pass correct cpu mask > > for registration, but failing to do so by the driver > > causes error to propagate and crash like this: > > > > [ 30.076045] Unable to handle kernel paging request for > > data at address 0x0048 > > [ 30.076100] Faulting instruction address: 0xc07b2f30 > > cpu 0x4d: Vector: 300 (Data Access) at [c03feb18b670] > > pc: c07b2f30: kobject_get+0x20/0x70 > > lr: c07b3c94: kobject_add_internal+0x54/0x3f0 > > sp: c03feb18b8f0 > >msr: 90009033 > >dar: 48 > > dsisr: 4000 > > current = 0xc03fd2ed8300 > > paca= 0xcfbab500 softe: 0irq_happened: 0x01 > > pid = 1, comm = swapper/0 > > Linux version 4.11.0-rc2-svaidy+ (sv@sagarika) (gcc version 6.2.0 > > 20161005 (Ubuntu 6.2.0-5ubuntu12) ) #10 SMP Sun Mar 19 00:08:09 IST 2017 > > enter ? for help > > [c03feb18b960] c07b3c94 kobject_add_internal+0x54/0x3f0 > > [c03feb18b9f0] c07b43a4 kobject_init_and_add+0x64/0xa0 > > [c03feb18ba70] c0e284f4 cpuidle_add_sysfs+0xb4/0x130 > > [c03feb18baf0] c0e26038 cpuidle_register_device+0x118/0x1c0 > > [c03feb18bb30] c0e26c48 cpuidle_register+0x78/0x120 > > [c03feb18bbc0] c168fd9c powernv_processor_idle_init+0x110/0x1c4 > > [c03feb18bc40] c000cff8 do_one_initcall+0x68/0x1d0 > > [c03feb18bd00] c16242f4 kernel_init_freeable+0x280/0x360 > > [c03feb18bdc0] c000d864 kernel_init+0x24/0x160 > > [c03feb18be30] c000b4e8 ret_from_kernel_thread+0x5c/0x74 > > > > Validating cpu_dev fixes the crash and reports correct error message like: > > > > [ 30.163506] Failed to register cpuidle device for cpu136 > > [ 30.173329] Registration of powernv driver failed. > > > > Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> > > The previous version is in linux-next already and I'm going to push it > for merging shortly. Thanks Rafael. The previous version is good for merge. --Vaidy
Re: [PATCH v2 2/2] cpuidle: Validate cpu_dev in cpuidle_add_sysfs
* Rafael J. Wysocki [2017-03-23 16:27:31]: > On Thu, Mar 23, 2017 at 4:22 PM, Vaidyanathan Srinivasan > wrote: > > If a given cpu is not in cpu_present and cpu hotplug > > is disabled, arch can skip setting up the cpu_dev. > > > > Arch cpuidle driver should pass correct cpu mask > > for registration, but failing to do so by the driver > > causes error to propagate and crash like this: > > > > [ 30.076045] Unable to handle kernel paging request for > > data at address 0x0048 > > [ 30.076100] Faulting instruction address: 0xc07b2f30 > > cpu 0x4d: Vector: 300 (Data Access) at [c03feb18b670] > > pc: c07b2f30: kobject_get+0x20/0x70 > > lr: c07b3c94: kobject_add_internal+0x54/0x3f0 > > sp: c03feb18b8f0 > >msr: 90009033 > >dar: 48 > > dsisr: 4000 > > current = 0xc03fd2ed8300 > > paca= 0xcfbab500 softe: 0irq_happened: 0x01 > > pid = 1, comm = swapper/0 > > Linux version 4.11.0-rc2-svaidy+ (sv@sagarika) (gcc version 6.2.0 > > 20161005 (Ubuntu 6.2.0-5ubuntu12) ) #10 SMP Sun Mar 19 00:08:09 IST 2017 > > enter ? for help > > [c03feb18b960] c07b3c94 kobject_add_internal+0x54/0x3f0 > > [c03feb18b9f0] c07b43a4 kobject_init_and_add+0x64/0xa0 > > [c03feb18ba70] c0e284f4 cpuidle_add_sysfs+0xb4/0x130 > > [c03feb18baf0] c0e26038 cpuidle_register_device+0x118/0x1c0 > > [c03feb18bb30] c0e26c48 cpuidle_register+0x78/0x120 > > [c03feb18bbc0] c168fd9c powernv_processor_idle_init+0x110/0x1c4 > > [c03feb18bc40] c000cff8 do_one_initcall+0x68/0x1d0 > > [c03feb18bd00] c16242f4 kernel_init_freeable+0x280/0x360 > > [c03feb18bdc0] c000d864 kernel_init+0x24/0x160 > > [c03feb18be30] c000b4e8 ret_from_kernel_thread+0x5c/0x74 > > > > Validating cpu_dev fixes the crash and reports correct error message like: > > > > [ 30.163506] Failed to register cpuidle device for cpu136 > > [ 30.173329] Registration of powernv driver failed. > > > > Signed-off-by: Vaidyanathan Srinivasan > > The previous version is in linux-next already and I'm going to push it > for merging shortly. Thanks Rafael. The previous version is good for merge. --Vaidy
[PATCH v2 2/2] cpuidle: Validate cpu_dev in cpuidle_add_sysfs
If a given cpu is not in cpu_present and cpu hotplug is disabled, arch can skip setting up the cpu_dev. Arch cpuidle driver should pass correct cpu mask for registration, but failing to do so by the driver causes error to propagate and crash like this: [ 30.076045] Unable to handle kernel paging request for data at address 0x0048 [ 30.076100] Faulting instruction address: 0xc07b2f30 cpu 0x4d: Vector: 300 (Data Access) at [c03feb18b670] pc: c07b2f30: kobject_get+0x20/0x70 lr: c07b3c94: kobject_add_internal+0x54/0x3f0 sp: c03feb18b8f0 msr: 90009033 dar: 48 dsisr: 4000 current = 0xc03fd2ed8300 paca= 0xcfbab500 softe: 0irq_happened: 0x01 pid = 1, comm = swapper/0 Linux version 4.11.0-rc2-svaidy+ (sv@sagarika) (gcc version 6.2.0 20161005 (Ubuntu 6.2.0-5ubuntu12) ) #10 SMP Sun Mar 19 00:08:09 IST 2017 enter ? for help [c03feb18b960] c07b3c94 kobject_add_internal+0x54/0x3f0 [c03feb18b9f0] c07b43a4 kobject_init_and_add+0x64/0xa0 [c03feb18ba70] c0e284f4 cpuidle_add_sysfs+0xb4/0x130 [c03feb18baf0] c0e26038 cpuidle_register_device+0x118/0x1c0 [c03feb18bb30] c0e26c48 cpuidle_register+0x78/0x120 [c03feb18bbc0] c168fd9c powernv_processor_idle_init+0x110/0x1c4 [c03feb18bc40] c000cff8 do_one_initcall+0x68/0x1d0 [c03feb18bd00] c16242f4 kernel_init_freeable+0x280/0x360 [c03feb18bdc0] c000d864 kernel_init+0x24/0x160 [c03feb18be30] c000b4e8 ret_from_kernel_thread+0x5c/0x74 Validating cpu_dev fixes the crash and reports correct error message like: [ 30.163506] Failed to register cpuidle device for cpu136 [ 30.173329] Registration of powernv driver failed. Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> --- drivers/cpuidle/sysfs.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/cpuidle/sysfs.c b/drivers/cpuidle/sysfs.c index c5adc8c..f2c3bce 100644 --- a/drivers/cpuidle/sysfs.c +++ b/drivers/cpuidle/sysfs.c @@ -615,6 +615,18 @@ int cpuidle_add_sysfs(struct cpuidle_device *dev) struct device *cpu_dev = get_cpu_device((unsigned long)dev->cpu); int error; + /* +* Return error if cpu_device is not setup for this cpu. This +* could happen if arch did not setup cpu_device since this +* cpu is not in cpu_present mask and the driver did not send +* correct cpu mask at registration. Without this check we +* would end up passing bogus value for _dev->kobj in +* kobject_init_and_add(). +*/ + + if (!cpu_dev) + return -ENODEV; + kdev = kzalloc(sizeof(*kdev), GFP_KERNEL); if (!kdev) return -ENOMEM; -- 2.9.3
[PATCH v2 1/2] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration
drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init(). On PowerNV platform cpu_present could be less than cpu_possible in cases where firmware detects the cpu, but it is not available to the OS. When CONFIG_HOTPLUG_CPU=n, such cpus are not hotplugable at runtime and hence we skip creating cpu_device. This breaks cpuidle on powernv where register_cpu() is not called for cpus in cpu_possible_mask that cannot be hot-added at runtime. Trying cpuidle_register_device() on cpu without cpu_device will cause crash like this: cpu 0xf: Vector: 380 (Data SLB Access) at [c00ff1503490] pc: c022c8bc: string+0x34/0x60 lr: c022ed78: vsnprintf+0x284/0x42c sp: c00ff1503710 msr: 90009033 dar: 60006000 current = 0xc00ff148 paca= 0xcfe82d00 softe: 0irq_happened: 0x01 pid = 1, comm = swapper/8 Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4 (Buildroot 2017.02-4-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017 enter ? for help [link register ] c022ed78 vsnprintf+0x284/0x42c [c00ff1503710] c022ebb8 vsnprintf+0xc4/0x42c (unreliable) [c00ff1503800] c022ef40 vscnprintf+0x20/0x44 [c00ff1503830] c00ab61c vprintk_emit+0x94/0x2cc [c00ff15038a0] c00acc9c vprintk_func+0x60/0x74 [c00ff15038c0] c0619694 printk+0x38/0x4c [c00ff15038e0] c0224950 kobject_get+0x40/0x60 [c00ff1503950] c022507c kobject_add_internal+0x60/0x2c4 [c00ff15039e0] c0225350 kobject_init_and_add+0x70/0x78 [c00ff1503a60] c053c288 cpuidle_add_sysfs+0x9c/0xe0 [c00ff1503ae0] c053aeac cpuidle_register_device+0xd4/0x12c [c00ff1503b30] c053b108 cpuidle_register+0x98/0xcc [c00ff1503bc0] c085eaf0 powernv_processor_idle_init+0x140/0x1e0 [c00ff1503c60] c000cd60 do_one_initcall+0xc0/0x15c [c00ff1503d20] c0833e84 kernel_init_freeable+0x1a0/0x25c [c00ff1503dc0] c000d478 kernel_init+0x24/0x12c [c00ff1503e30] c000b564 ret_from_kernel_thread+0x5c/0x78 This patch fixes the bug by passing correct cpumask from powernv-cpuidle driver. Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> --- drivers/cpuidle/cpuidle-powernv.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index a06df51..82f7b33 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -175,6 +175,24 @@ static int powernv_cpuidle_driver_init(void) drv->state_count += 1; } + /* +* On PowerNV platform cpu_present may be less that cpu_possible in +* cases where firmware detects the cpu, but it is not available to the +* OS. If CONFIG_HOTPLUG_CPU=n then such CPUs are not hotplugable at +* runtime and hence cpu_devices are not created for those cpus by +* generic topology_init(). +* +* drv->cpumask defaults to cpu_possible_mask in +* __cpuidle_driver_init(). This breaks cpuidle on powernv where +* cpu_devices are not created for cpus in cpu_possible_mask that +* cannot be hot-added later at runtime. +* +* Trying cpuidle_register_device() on a cpu without cpu_devices is +* incorrect. Hence pass correct cpu mask to generic cpuidle driver. +*/ + + drv->cpumask = (struct cpumask *)cpu_present_mask; + return 0; } -- 2.9.3
[PATCH v2 2/2] cpuidle: Validate cpu_dev in cpuidle_add_sysfs
If a given cpu is not in cpu_present and cpu hotplug is disabled, arch can skip setting up the cpu_dev. Arch cpuidle driver should pass correct cpu mask for registration, but failing to do so by the driver causes error to propagate and crash like this: [ 30.076045] Unable to handle kernel paging request for data at address 0x0048 [ 30.076100] Faulting instruction address: 0xc07b2f30 cpu 0x4d: Vector: 300 (Data Access) at [c03feb18b670] pc: c07b2f30: kobject_get+0x20/0x70 lr: c07b3c94: kobject_add_internal+0x54/0x3f0 sp: c03feb18b8f0 msr: 90009033 dar: 48 dsisr: 4000 current = 0xc03fd2ed8300 paca= 0xcfbab500 softe: 0irq_happened: 0x01 pid = 1, comm = swapper/0 Linux version 4.11.0-rc2-svaidy+ (sv@sagarika) (gcc version 6.2.0 20161005 (Ubuntu 6.2.0-5ubuntu12) ) #10 SMP Sun Mar 19 00:08:09 IST 2017 enter ? for help [c03feb18b960] c07b3c94 kobject_add_internal+0x54/0x3f0 [c03feb18b9f0] c07b43a4 kobject_init_and_add+0x64/0xa0 [c03feb18ba70] c0e284f4 cpuidle_add_sysfs+0xb4/0x130 [c03feb18baf0] c0e26038 cpuidle_register_device+0x118/0x1c0 [c03feb18bb30] c0e26c48 cpuidle_register+0x78/0x120 [c03feb18bbc0] c168fd9c powernv_processor_idle_init+0x110/0x1c4 [c03feb18bc40] c000cff8 do_one_initcall+0x68/0x1d0 [c03feb18bd00] c16242f4 kernel_init_freeable+0x280/0x360 [c03feb18bdc0] c000d864 kernel_init+0x24/0x160 [c03feb18be30] c000b4e8 ret_from_kernel_thread+0x5c/0x74 Validating cpu_dev fixes the crash and reports correct error message like: [ 30.163506] Failed to register cpuidle device for cpu136 [ 30.173329] Registration of powernv driver failed. Signed-off-by: Vaidyanathan Srinivasan --- drivers/cpuidle/sysfs.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/cpuidle/sysfs.c b/drivers/cpuidle/sysfs.c index c5adc8c..f2c3bce 100644 --- a/drivers/cpuidle/sysfs.c +++ b/drivers/cpuidle/sysfs.c @@ -615,6 +615,18 @@ int cpuidle_add_sysfs(struct cpuidle_device *dev) struct device *cpu_dev = get_cpu_device((unsigned long)dev->cpu); int error; + /* +* Return error if cpu_device is not setup for this cpu. This +* could happen if arch did not setup cpu_device since this +* cpu is not in cpu_present mask and the driver did not send +* correct cpu mask at registration. Without this check we +* would end up passing bogus value for _dev->kobj in +* kobject_init_and_add(). +*/ + + if (!cpu_dev) + return -ENODEV; + kdev = kzalloc(sizeof(*kdev), GFP_KERNEL); if (!kdev) return -ENOMEM; -- 2.9.3
[PATCH v2 1/2] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration
drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init(). On PowerNV platform cpu_present could be less than cpu_possible in cases where firmware detects the cpu, but it is not available to the OS. When CONFIG_HOTPLUG_CPU=n, such cpus are not hotplugable at runtime and hence we skip creating cpu_device. This breaks cpuidle on powernv where register_cpu() is not called for cpus in cpu_possible_mask that cannot be hot-added at runtime. Trying cpuidle_register_device() on cpu without cpu_device will cause crash like this: cpu 0xf: Vector: 380 (Data SLB Access) at [c00ff1503490] pc: c022c8bc: string+0x34/0x60 lr: c022ed78: vsnprintf+0x284/0x42c sp: c00ff1503710 msr: 90009033 dar: 60006000 current = 0xc00ff148 paca= 0xcfe82d00 softe: 0irq_happened: 0x01 pid = 1, comm = swapper/8 Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4 (Buildroot 2017.02-4-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017 enter ? for help [link register ] c022ed78 vsnprintf+0x284/0x42c [c00ff1503710] c022ebb8 vsnprintf+0xc4/0x42c (unreliable) [c00ff1503800] c022ef40 vscnprintf+0x20/0x44 [c00ff1503830] c00ab61c vprintk_emit+0x94/0x2cc [c00ff15038a0] c00acc9c vprintk_func+0x60/0x74 [c00ff15038c0] c0619694 printk+0x38/0x4c [c00ff15038e0] c0224950 kobject_get+0x40/0x60 [c00ff1503950] c022507c kobject_add_internal+0x60/0x2c4 [c00ff15039e0] c0225350 kobject_init_and_add+0x70/0x78 [c00ff1503a60] c053c288 cpuidle_add_sysfs+0x9c/0xe0 [c00ff1503ae0] c053aeac cpuidle_register_device+0xd4/0x12c [c00ff1503b30] c053b108 cpuidle_register+0x98/0xcc [c00ff1503bc0] c085eaf0 powernv_processor_idle_init+0x140/0x1e0 [c00ff1503c60] c000cd60 do_one_initcall+0xc0/0x15c [c00ff1503d20] c0833e84 kernel_init_freeable+0x1a0/0x25c [c00ff1503dc0] c000d478 kernel_init+0x24/0x12c [c00ff1503e30] c000b564 ret_from_kernel_thread+0x5c/0x78 This patch fixes the bug by passing correct cpumask from powernv-cpuidle driver. Signed-off-by: Vaidyanathan Srinivasan --- drivers/cpuidle/cpuidle-powernv.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index a06df51..82f7b33 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -175,6 +175,24 @@ static int powernv_cpuidle_driver_init(void) drv->state_count += 1; } + /* +* On PowerNV platform cpu_present may be less that cpu_possible in +* cases where firmware detects the cpu, but it is not available to the +* OS. If CONFIG_HOTPLUG_CPU=n then such CPUs are not hotplugable at +* runtime and hence cpu_devices are not created for those cpus by +* generic topology_init(). +* +* drv->cpumask defaults to cpu_possible_mask in +* __cpuidle_driver_init(). This breaks cpuidle on powernv where +* cpu_devices are not created for cpus in cpu_possible_mask that +* cannot be hot-added later at runtime. +* +* Trying cpuidle_register_device() on a cpu without cpu_devices is +* incorrect. Hence pass correct cpu mask to generic cpuidle driver. +*/ + + drv->cpumask = (struct cpumask *)cpu_present_mask; + return 0; } -- 2.9.3
[PATCH v1 0/2] cpuidle: Fixes in cpuidle driver
When CONFIG_HOTPLUG_CPU=n and cpu_present is less than cpu_possible, then cpuidle-powernv not passing an explicit drv->cpu_mask allows generic cpuidle driver to try create sysfs objects for cpus that does not have cpu_devices created by calling register_cpu(). This caused kernel to access incorrect address and crash. The following patch series fixes the cpuidle-powernv driver and also adds additional checks in cpuidle_add_sysfs() This patch set is against v4.11-rc3. Changed from v1: Updated commit message and comments. Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>
[PATCH v1 0/2] cpuidle: Fixes in cpuidle driver
When CONFIG_HOTPLUG_CPU=n and cpu_present is less than cpu_possible, then cpuidle-powernv not passing an explicit drv->cpu_mask allows generic cpuidle driver to try create sysfs objects for cpus that does not have cpu_devices created by calling register_cpu(). This caused kernel to access incorrect address and crash. The following patch series fixes the cpuidle-powernv driver and also adds additional checks in cpuidle_add_sysfs() This patch set is against v4.11-rc3. Changed from v1: Updated commit message and comments. Signed-off-by: Vaidyanathan Srinivasan
Re: [PATCH 4.4 54/58] powerpc: Convert cmp to cmpd in idle enter sequence
* Greg Kroah-Hartman <gre...@linuxfoundation.org> [2017-01-06 22:44:39]: > 4.4-stable review patch. If anyone has any objections, please let me know. > > -- > > From: Segher Boessenkool <seg...@kernel.crashing.org> > > commit 80f23935cadb1c654e81951f5a8b7ceae0acc1b4 upstream. > > PowerPC's "cmp" instruction has four operands. Normally people write > "cmpw" or "cmpd" for the second cmp operand 0 or 1. But, frequently > people forget, and write "cmp" with just three operands. > > With older binutils this is silently accepted as if this was "cmpw", > while often "cmpd" is wanted. With newer binutils GAS will complain > about this for 64-bit code. For 32-bit code it still silently assumes > "cmpw" is what is meant. > > In this instance the code comes directly from ISA v2.07, including the > cmp, but cmpd is correct. Backport to stable so that new toolchains can > build old kernels. > > Fixes: 948cf67c4726 ("powerpc: Add NAP mode support on Power7 in HV mode") > Reviewed-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> > Signed-off-by: Segher Boessenkool <seg...@kernel.crashing.org> > Signed-off-by: Michael Ellerman <m...@ellerman.id.au> > Signed-off-by: Joel Stanley <j...@jms.id.au> > Signed-off-by: Greg Kroah-Hartman <gre...@linuxfoundation.org> Acked-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> The change from cmp to cmpd is correct. This code will execute in 64-bit mode only and fix applies to stable as described above. > > > --- > arch/powerpc/kernel/idle_power7.S |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- a/arch/powerpc/kernel/idle_power7.S > +++ b/arch/powerpc/kernel/idle_power7.S > @@ -44,7 +44,7 @@ > std r0,0(r1); \ > ptesync;\ > ld r0,0(r1); \ > -1: cmp cr0,r0,r0; \ > +1: cmpdcr0,r0,r0; \ > bne 1b; \ > IDLE_INST; \ > b . > >
Re: [PATCH 4.4 54/58] powerpc: Convert cmp to cmpd in idle enter sequence
* Greg Kroah-Hartman [2017-01-06 22:44:39]: > 4.4-stable review patch. If anyone has any objections, please let me know. > > -- > > From: Segher Boessenkool > > commit 80f23935cadb1c654e81951f5a8b7ceae0acc1b4 upstream. > > PowerPC's "cmp" instruction has four operands. Normally people write > "cmpw" or "cmpd" for the second cmp operand 0 or 1. But, frequently > people forget, and write "cmp" with just three operands. > > With older binutils this is silently accepted as if this was "cmpw", > while often "cmpd" is wanted. With newer binutils GAS will complain > about this for 64-bit code. For 32-bit code it still silently assumes > "cmpw" is what is meant. > > In this instance the code comes directly from ISA v2.07, including the > cmp, but cmpd is correct. Backport to stable so that new toolchains can > build old kernels. > > Fixes: 948cf67c4726 ("powerpc: Add NAP mode support on Power7 in HV mode") > Reviewed-by: Vaidyanathan Srinivasan > Signed-off-by: Segher Boessenkool > Signed-off-by: Michael Ellerman > Signed-off-by: Joel Stanley > Signed-off-by: Greg Kroah-Hartman Acked-by: Vaidyanathan Srinivasan The change from cmp to cmpd is correct. This code will execute in 64-bit mode only and fix applies to stable as described above. > > > --- > arch/powerpc/kernel/idle_power7.S |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- a/arch/powerpc/kernel/idle_power7.S > +++ b/arch/powerpc/kernel/idle_power7.S > @@ -44,7 +44,7 @@ > std r0,0(r1); \ > ptesync;\ > ld r0,0(r1); \ > -1: cmp cr0,r0,r0; \ > +1: cmpdcr0,r0,r0; \ > bne 1b; \ > IDLE_INST; \ > b . > >
Re: [patch 10/15] sched/migration: Move calc_load_migrate() into CPU_DYING
* Thomas Gleixner <t...@linutronix.de> [2016-07-12 18:33:56]: > Anton, > > On Tue, 12 Jul 2016, Anton Blanchard wrote: > > > It really does not matter when we fold the load for the outgoing cpu. > > > It's almost dead anyway, so there is no harm if we fail to fold the > > > few microseconds which are required for going fully away. > > > > We are seeing the load average shoot up when hot unplugging CPUs (+1 > > for every CPU we offline) on ppc64. This reproduces on bare metal as > > well as inside a KVM guest. A bisect points at this commit. > > > > As an example, a completely idle box with 128 CPUS and 112 hot > > unplugged: > > > > # uptime > > 04:35:30 up 1:23, 2 users, load average: 112.43, 122.94, 125.54 > > Yes, it's an off by one as we now call that from the task which is tearing > down the cpu. Does the patch below fix it? Hi Thomas, Yes this patch fixes the issue. I was able to recreate the problem and also verify with this patch on 4.7.0-rc7. > > Thanks, > > tglx > > 8<-- > > Subject: sched/migration: Correct off by one in load migration > From: Thomas Gleixner <t...@linutronix.de> > > The move of calc_load_migrate() from CPU_DEAD to CPU_DYING did not take into > account that the function is now called from a thread running on the outgoing > CPU. As a result a cpu unplug leakes a load of 1 into the global load > accounting mechanism. > > Fix it by adjusting for the currently running thread which calls > calc_load_migrate(). > > Fixes: e9cd8fa4fcfd: "sched/migration: Move calc_load_migrate() into > CPU_DYING" > Reported-by: Anton Blanchard <an...@samba.org> Tested-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com> > Signed-off-by: Thomas Gleixner <t...@linutronix.de> > > --- > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 51d7105f529a..97ee9ac7e97c 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -5394,13 +5394,15 @@ void idle_task_exit(void) > /* > * Since this CPU is going 'away' for a while, fold any nr_active delta > * we might have. Assumes we're called after migrate_tasks() so that the > - * nr_active count is stable. > + * nr_active count is stable. We need to take the teardown thread which > + * is calling this into account, so we hand in adjust = 1 to the load > + * calculation. > * > * Also see the comment "Global load-average calculations". > */ > static void calc_load_migrate(struct rq *rq) > { > - long delta = calc_load_fold_active(rq); > + long delta = calc_load_fold_active(rq, 1); > if (delta) > atomic_long_add(delta, _load_tasks); > } > diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c > index b0b93fd33af9..a2d6eb71f06b 100644 > --- a/kernel/sched/loadavg.c > +++ b/kernel/sched/loadavg.c > @@ -78,11 +78,11 @@ void get_avenrun(unsigned long *loads, unsigned long > offset, int shift) > loads[2] = (avenrun[2] + offset) << shift; > } > > -long calc_load_fold_active(struct rq *this_rq) > +long calc_load_fold_active(struct rq *this_rq, long adjust) > { > long nr_active, delta = 0; > > - nr_active = this_rq->nr_running; > + nr_active = this_rq->nr_running - adjust; > nr_active += (long)this_rq->nr_uninterruptible; if (nr_active != this_rq->calc_load_active) { delta = nr_active - this_rq->calc_load_active; this_rq->calc_load_active = nr_active; } return delta; Does the above calculation hold good even if we send adjust=1 and bump down nr_active? Tested ok though :) --Vaidy
Re: [patch 10/15] sched/migration: Move calc_load_migrate() into CPU_DYING
* Thomas Gleixner [2016-07-12 18:33:56]: > Anton, > > On Tue, 12 Jul 2016, Anton Blanchard wrote: > > > It really does not matter when we fold the load for the outgoing cpu. > > > It's almost dead anyway, so there is no harm if we fail to fold the > > > few microseconds which are required for going fully away. > > > > We are seeing the load average shoot up when hot unplugging CPUs (+1 > > for every CPU we offline) on ppc64. This reproduces on bare metal as > > well as inside a KVM guest. A bisect points at this commit. > > > > As an example, a completely idle box with 128 CPUS and 112 hot > > unplugged: > > > > # uptime > > 04:35:30 up 1:23, 2 users, load average: 112.43, 122.94, 125.54 > > Yes, it's an off by one as we now call that from the task which is tearing > down the cpu. Does the patch below fix it? Hi Thomas, Yes this patch fixes the issue. I was able to recreate the problem and also verify with this patch on 4.7.0-rc7. > > Thanks, > > tglx > > 8<-- > > Subject: sched/migration: Correct off by one in load migration > From: Thomas Gleixner > > The move of calc_load_migrate() from CPU_DEAD to CPU_DYING did not take into > account that the function is now called from a thread running on the outgoing > CPU. As a result a cpu unplug leakes a load of 1 into the global load > accounting mechanism. > > Fix it by adjusting for the currently running thread which calls > calc_load_migrate(). > > Fixes: e9cd8fa4fcfd: "sched/migration: Move calc_load_migrate() into > CPU_DYING" > Reported-by: Anton Blanchard Tested-by: Vaidyanathan Srinivasan > Signed-off-by: Thomas Gleixner > > --- > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 51d7105f529a..97ee9ac7e97c 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -5394,13 +5394,15 @@ void idle_task_exit(void) > /* > * Since this CPU is going 'away' for a while, fold any nr_active delta > * we might have. Assumes we're called after migrate_tasks() so that the > - * nr_active count is stable. > + * nr_active count is stable. We need to take the teardown thread which > + * is calling this into account, so we hand in adjust = 1 to the load > + * calculation. > * > * Also see the comment "Global load-average calculations". > */ > static void calc_load_migrate(struct rq *rq) > { > - long delta = calc_load_fold_active(rq); > + long delta = calc_load_fold_active(rq, 1); > if (delta) > atomic_long_add(delta, _load_tasks); > } > diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c > index b0b93fd33af9..a2d6eb71f06b 100644 > --- a/kernel/sched/loadavg.c > +++ b/kernel/sched/loadavg.c > @@ -78,11 +78,11 @@ void get_avenrun(unsigned long *loads, unsigned long > offset, int shift) > loads[2] = (avenrun[2] + offset) << shift; > } > > -long calc_load_fold_active(struct rq *this_rq) > +long calc_load_fold_active(struct rq *this_rq, long adjust) > { > long nr_active, delta = 0; > > - nr_active = this_rq->nr_running; > + nr_active = this_rq->nr_running - adjust; > nr_active += (long)this_rq->nr_uninterruptible; if (nr_active != this_rq->calc_load_active) { delta = nr_active - this_rq->calc_load_active; this_rq->calc_load_active = nr_active; } return delta; Does the above calculation hold good even if we send adjust=1 and bump down nr_active? Tested ok though :) --Vaidy
Re: [PATCH] cpuidle: powernv/pseries: Decrease the snooze residency
* Benjamin Herrenschmidt [2015-05-30 20:38:22]: > On Sat, 2015-05-30 at 11:31 +0530, Vaidyanathan Srinivasan wrote: > > In shared lpar case, spinning in guest context may potentially take > > away cycles from other lpars waiting to run on the same physical cpu. > > > > So the policy in shared lpar case is to let PowerVM hypervisor know > > immediately that the guest cpu is idle which will allow the hypervisor > > to use the cycles for other tasks/lpars. > > But that will have negative side effects under KVM no ? Yes, you have a good point. If one of the thread in the core goes to cede, it can still come back quickly since the KVM guest context is not switched yet. But in single threaded guest, this can force an unnecessary exit/context switch overhead. Now that we have fixed the snooze loop to be bounded and exit predictably, KVM guest should actually use snooze state to improve latency. I will test this scenario and enable snooze state for KVM guest. > Suresh mentioned something with his new directed interrupts code that we > had many cases where the interrupts ended up arriving shortly after we > exited to host for NAP'ing ... > > Snooze might fix it... Right. This scenario is worth experimenting and then introduce snooze loop for guest. --Vaidy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuidle: powernv/pseries: Decrease the snooze residency
* Benjamin Herrenschmidt b...@au1.ibm.com [2015-05-30 20:38:22]: On Sat, 2015-05-30 at 11:31 +0530, Vaidyanathan Srinivasan wrote: In shared lpar case, spinning in guest context may potentially take away cycles from other lpars waiting to run on the same physical cpu. So the policy in shared lpar case is to let PowerVM hypervisor know immediately that the guest cpu is idle which will allow the hypervisor to use the cycles for other tasks/lpars. But that will have negative side effects under KVM no ? Yes, you have a good point. If one of the thread in the core goes to cede, it can still come back quickly since the KVM guest context is not switched yet. But in single threaded guest, this can force an unnecessary exit/context switch overhead. Now that we have fixed the snooze loop to be bounded and exit predictably, KVM guest should actually use snooze state to improve latency. I will test this scenario and enable snooze state for KVM guest. Suresh mentioned something with his new directed interrupts code that we had many cases where the interrupts ended up arriving shortly after we exited to host for NAP'ing ... Snooze might fix it... Right. This scenario is worth experimenting and then introduce snooze loop for guest. --Vaidy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuidle: powernv/pseries: Decrease the snooze residency
* Preeti U Murthy [2015-05-29 19:17:17]: [snip] > > + if (max_idle_state > 1) { > > + snooze_timeout_en = true; > > + snooze_timeout = cpuidle_state_table[1].target_residency * > > +tb_ticks_per_usec; > > + } > > Any idea why we don't have snooze defined on the shared lpar configuration ? In shared lpar case, spinning in guest context may potentially take away cycles from other lpars waiting to run on the same physical cpu. So the policy in shared lpar case is to let PowerVM hypervisor know immediately that the guest cpu is idle which will allow the hypervisor to use the cycles for other tasks/lpars. --Vaidy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpuidle: powernv/pseries: Decrease the snooze residency
* Preeti U Murthy pre...@linux.vnet.ibm.com [2015-05-29 19:17:17]: [snip] + if (max_idle_state 1) { + snooze_timeout_en = true; + snooze_timeout = cpuidle_state_table[1].target_residency * +tb_ticks_per_usec; + } Any idea why we don't have snooze defined on the shared lpar configuration ? In shared lpar case, spinning in guest context may potentially take away cycles from other lpars waiting to run on the same physical cpu. So the policy in shared lpar case is to let PowerVM hypervisor know immediately that the guest cpu is idle which will allow the hypervisor to use the cycles for other tasks/lpars. --Vaidy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched: Fix asymmetric scheduling for POWER7
Commit-ID: 2042abe7977222ef606306faa2dce8fd51e98e65 Gitweb: http://git.kernel.org/tip/2042abe7977222ef606306faa2dce8fd51e98e65 Author: Vaidyanathan Srinivasan AuthorDate: Wed, 30 Oct 2013 08:42:42 +0530 Committer: Ingo Molnar CommitDate: Wed, 6 Nov 2013 12:37:54 +0100 sched: Fix asymmetric scheduling for POWER7 Asymmetric scheduling within a core is a scheduler loadbalancing feature that is triggered when SD_ASYM_PACKING flag is set. The goal for the load balancer is to move tasks to lower order idle SMT threads within a core on a POWER7 system. In nohz_kick_needed(), we intend to check if our sched domain (core) is completely busy or we have idle cpu. The following check for SD_ASYM_PACKING: (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu) already covers the case of checking if the domain has an idle cpu, because cpumask_first_and() will not yield any set bits if this domain has no idle cpu. Hence, nr_busy check against group weight can be removed. Reported-by: Michael Neuling Signed-off-by: Vaidyanathan Srinivasan Signed-off-by: Preeti U Murthy Tested-by: Michael Neuling Signed-off-by: Peter Zijlstra Cc: vincent.guit...@linaro.org Cc: bitbuc...@online.de Cc: b...@kernel.crashing.org Cc: an...@samba.org Cc: morten.rasmus...@arm.com Cc: p...@google.com Link: http://lkml.kernel.org/r/20131030031242.23426.13019.st...@preeti.in.ibm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 41c02b6..074551a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6800,7 +6800,7 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu) if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1) goto need_kick_unlock; - if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight + if (sd->flags & SD_ASYM_PACKING && (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu)) goto need_kick_unlock; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:sched/core] sched: Fix asymmetric scheduling for POWER7
Commit-ID: 2042abe7977222ef606306faa2dce8fd51e98e65 Gitweb: http://git.kernel.org/tip/2042abe7977222ef606306faa2dce8fd51e98e65 Author: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com AuthorDate: Wed, 30 Oct 2013 08:42:42 +0530 Committer: Ingo Molnar mi...@kernel.org CommitDate: Wed, 6 Nov 2013 12:37:54 +0100 sched: Fix asymmetric scheduling for POWER7 Asymmetric scheduling within a core is a scheduler loadbalancing feature that is triggered when SD_ASYM_PACKING flag is set. The goal for the load balancer is to move tasks to lower order idle SMT threads within a core on a POWER7 system. In nohz_kick_needed(), we intend to check if our sched domain (core) is completely busy or we have idle cpu. The following check for SD_ASYM_PACKING: (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) cpu) already covers the case of checking if the domain has an idle cpu, because cpumask_first_and() will not yield any set bits if this domain has no idle cpu. Hence, nr_busy check against group weight can be removed. Reported-by: Michael Neuling michael.neul...@au1.ibm.com Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com Tested-by: Michael Neuling mi...@neuling.org Signed-off-by: Peter Zijlstra pet...@infradead.org Cc: vincent.guit...@linaro.org Cc: bitbuc...@online.de Cc: b...@kernel.crashing.org Cc: an...@samba.org Cc: morten.rasmus...@arm.com Cc: p...@google.com Link: http://lkml.kernel.org/r/20131030031242.23426.13019.st...@preeti.in.ibm.com Signed-off-by: Ingo Molnar mi...@kernel.org --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 41c02b6..074551a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6800,7 +6800,7 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu) if (sd-flags SD_SHARE_PKG_RESOURCES nr_busy 1) goto need_kick_unlock; - if (sd-flags SD_ASYM_PACKING nr_busy != sg-group_weight + if (sd-flags SD_ASYM_PACKING (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) cpu)) goto need_kick_unlock; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] sched: Aggressive balance in domains whose groups share package resources
From: Preeti U Murthy The current logic in load balance is such that after picking the busiest group, the load is attempted to be moved from the busiest cpu in that group to the dst_cpu. If the load cannot be moved from the busiest cpu to dst_cpu due to either tsk_cpus_allowed mask or cache hot tasks, then the dst_cpu is changed to be another idle cpu within the dst->grpmask. If even then, the load cannot be moved from the busiest cpu, then the source group is changed. The next busiest group is found and the above steps are repeated. However if the cpus in the group share package resources, then when a load movement from the busiest cpu in this group fails as above, instead of finding the next busiest group to move load from, find the next busiest cpu *within the same group* from which to move load away. By doing so, a conscious effort is made during load balancing to keep just one cpu busy as much as possible within domains that have SHARED_PKG_RESOURCES flag set unless under scenarios of high load. Having multiple cpus busy within a domain which share package resource could lead to a performance hit. A similar scenario arises in active load balancing as well. When the current task on the busiest cpu cannot be moved away due to task pinning, currently no more attempts at load balancing is made. This patch checks if the balancing is being done on a group whose cpus share package resources. If so, then check if the load balancing can be done for other cpus in the same group. Signed-off-by: Preeti U Murthy Signed-off-by: Vaidyanathan Srinivasan --- kernel/sched/fair.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 828ed97..bbcd96b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5165,6 +5165,8 @@ static int load_balance(int this_cpu, struct rq *this_rq, { int ld_moved, cur_ld_moved, active_balance = 0; struct sched_group *group; + struct sched_domain *child; + int share_pkg_res = 0; struct rq *busiest; unsigned long flags; struct cpumask *cpus = __get_cpu_var(load_balance_mask); @@ -5190,6 +5192,10 @@ static int load_balance(int this_cpu, struct rq *this_rq, schedstat_inc(sd, lb_count[idle]); + child = sd->child; + if (child && child->flags & SD_SHARE_PKG_RESOURCES) + share_pkg_res = 1; + redo: if (!should_we_balance()) { *continue_balancing = 0; @@ -5202,6 +5208,7 @@ redo: goto out_balanced; } +redo_grp: busiest = find_busiest_queue(, group); if (!busiest) { schedstat_inc(sd, lb_nobusyq[idle]); @@ -5292,6 +5299,11 @@ more_balance: if (!cpumask_empty(cpus)) { env.loop = 0; env.loop_break = sched_nr_migrate_break; + if (share_pkg_res && + cpumask_intersects(cpus, + to_cpumask(group->cpumask))) + goto redo_grp; + goto redo; } goto out_balanced; @@ -5318,9 +5330,15 @@ more_balance: */ if (!cpumask_test_cpu(this_cpu, tsk_cpus_allowed(busiest->curr))) { + cpumask_clear_cpu(cpu_of(busiest), cpus); raw_spin_unlock_irqrestore(>lock, flags); env.flags |= LBF_ALL_PINNED; + if (share_pkg_res && + cpumask_intersects(cpus, + to_cpumask(group->cpumask))) + goto redo_grp; + goto out_one_pinned; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] sched: Fix asymmetric scheduling for POWER7
Asymmetric scheduling within a core is a scheduler loadbalancing feature that is triggered when SD_ASYM_PACKING flag is set. The goal for the load balancer is to move tasks to lower order idle SMT threads within a core on a POWER7 system. In nohz_kick_needed(), we intend to check if our sched domain (core) is completely busy or we have idle cpu. The following check for SD_ASYM_PACKING: (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu) already covers the case of checking if the domain has an idle cpu, because cpumask_first_and() will not yield any set bits if this domain has no idle cpu. Hence, nr_busy check against group weight can be removed. Reported-by: Michael Neuling Signed-off-by: Vaidyanathan Srinivasan Signed-off-by: Preeti U Murthy --- kernel/sched/fair.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 12f0eab..828ed97 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5821,8 +5821,8 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu) goto need_kick_unlock; } - if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight - && (cpumask_first_and(nohz.idle_cpus_mask, + if (sd->flags & SD_ASYM_PACKING && + (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu)) goto need_kick_unlock; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group
From: Preeti U Murthy In nohz_kick_needed() there are checks around the flags SD_SHARE_PKG_RESOURCES which decide to initiate nohz_balance if the domains with this flag set have more than one cpu busy. Therefore at every domain, a check has to be made on nr_busy of that domain. This means the sum of the nr_busy of each group in that domain needs to be checked, since nr_busy is a parameter which is associated with a group. However in the current implementation of nohz_kick_needed(), the nr_busy is being checked for just the group to which the cpu that has initiated this check belongs to. This will give us the wrong count of the number of busy cpus in that domain. The following commit which fixed the sgp->nr_busy_cpus computation actually exposed the bug in nohz_kick_needed() which worked when nr_busy was incorrectly > 1 25f55d9d01ad7a7ad248fd5af1d22675ffd202c5 sched: Fix init NOHZ_IDLE flag To illustrate the scenario, consider a core, whose domain will have the SD_SHARE_PKG_RESOURCES set. We want to kick nohz_idle_balance when we find that more than one thread in the core is busy. With the current implementation of nohz_kick_needed(), at this domain(sd), the nr_busy will be 1 always since it returns this parameter for sd->groups which encompasses a single thread, while we want this parameter for sd->parent->groups which will rightly point to the number of busy threads in the core. This patch also ensures that the order of check for SD_SHARE_PKG_RESOURCE comes before the check for ASYM_PACKING. Priority is given to avoid more than one busy thread in a core as much as possible before attempting asymmetric packing. Signed-off-by: Preeti U Murthy Signed-off-by: Vaidyanathan Srinivasan --- kernel/sched/fair.c | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7c70201..12f0eab 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5807,12 +5807,19 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu) rcu_read_lock(); for_each_domain(cpu, sd) { - struct sched_group *sg = sd->groups; - struct sched_group_power *sgp = sg->sgp; - int nr_busy = atomic_read(>nr_busy_cpus); - - if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1) - goto need_kick_unlock; + struct sched_domain *sd_parent = sd->parent; + struct sched_group *sg; + struct sched_group_power *sgp; + int nr_busy; + + if (sd_parent) { + sg = sd_parent->groups; + sgp = sg->sgp; + nr_busy = atomic_read(>nr_busy_cpus); + + if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1) + goto need_kick_unlock; + } if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight && (cpumask_first_and(nohz.idle_cpus_mask, -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/3] sched: Fixes for task placement in SMT threads
Hi, The following series fixes scheduler loadbalancing issues where we are missing opportunity to place tasks in SMT threads optimally especially on a POWER7 system. PATCH 1/3, Fixes the scenario where load balancing fails to move tasks away from the cpus in a domain which has SHARE_PKG_RESOURCES set even under the presence of idle cpus in other domains. PATCH 2/3, ensures lower order SMT threads are used for the SD_ASYM_PACKING load balancing case. PATCH 3/3, tries to fix the problem that is explained in its changelog. The following experiment expose the scenario: A task placement test was conducted on a POWER7 as well as multi-core x86 system. At the beginning, tasks are pinned to all the threads within a core and later only the tasks pinned to the secondary threads in a core are unpinned. This leaves the primary thread with tasks that are unmovable, while the rest are free to be moved. Under the above setup, load balancing today does not move the unpinned tasks away from the secondary threads of the core in the above experiment although there are other idle cpus. The PATCH 3/3 fixes this situation and improves task placement even if some of the tasks in a core are pinned. This series applies on v3.12-rc6 and tested on x86 and powerpc. --Vaidy --- Preeti U Murthy (2): sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group sched: Aggressive balance in domains whose groups share package resources Vaidyanathan Srinivasan (1): sched: Fix asymmetric scheduling for POWER7 kernel/sched/fair.c | 41 + 1 file changed, 33 insertions(+), 8 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/3] sched: Fixes for task placement in SMT threads
Hi, The following series fixes scheduler loadbalancing issues where we are missing opportunity to place tasks in SMT threads optimally especially on a POWER7 system. PATCH 1/3, Fixes the scenario where load balancing fails to move tasks away from the cpus in a domain which has SHARE_PKG_RESOURCES set even under the presence of idle cpus in other domains. PATCH 2/3, ensures lower order SMT threads are used for the SD_ASYM_PACKING load balancing case. PATCH 3/3, tries to fix the problem that is explained in its changelog. The following experiment expose the scenario: A task placement test was conducted on a POWER7 as well as multi-core x86 system. At the beginning, tasks are pinned to all the threads within a core and later only the tasks pinned to the secondary threads in a core are unpinned. This leaves the primary thread with tasks that are unmovable, while the rest are free to be moved. Under the above setup, load balancing today does not move the unpinned tasks away from the secondary threads of the core in the above experiment although there are other idle cpus. The PATCH 3/3 fixes this situation and improves task placement even if some of the tasks in a core are pinned. This series applies on v3.12-rc6 and tested on x86 and powerpc. --Vaidy --- Preeti U Murthy (2): sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group sched: Aggressive balance in domains whose groups share package resources Vaidyanathan Srinivasan (1): sched: Fix asymmetric scheduling for POWER7 kernel/sched/fair.c | 41 + 1 file changed, 33 insertions(+), 8 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group
From: Preeti U Murthy pre...@linux.vnet.ibm.com In nohz_kick_needed() there are checks around the flags SD_SHARE_PKG_RESOURCES which decide to initiate nohz_balance if the domains with this flag set have more than one cpu busy. Therefore at every domain, a check has to be made on nr_busy of that domain. This means the sum of the nr_busy of each group in that domain needs to be checked, since nr_busy is a parameter which is associated with a group. However in the current implementation of nohz_kick_needed(), the nr_busy is being checked for just the group to which the cpu that has initiated this check belongs to. This will give us the wrong count of the number of busy cpus in that domain. The following commit which fixed the sgp-nr_busy_cpus computation actually exposed the bug in nohz_kick_needed() which worked when nr_busy was incorrectly 1 25f55d9d01ad7a7ad248fd5af1d22675ffd202c5 sched: Fix init NOHZ_IDLE flag To illustrate the scenario, consider a core, whose domain will have the SD_SHARE_PKG_RESOURCES set. We want to kick nohz_idle_balance when we find that more than one thread in the core is busy. With the current implementation of nohz_kick_needed(), at this domain(sd), the nr_busy will be 1 always since it returns this parameter for sd-groups which encompasses a single thread, while we want this parameter for sd-parent-groups which will rightly point to the number of busy threads in the core. This patch also ensures that the order of check for SD_SHARE_PKG_RESOURCE comes before the check for ASYM_PACKING. Priority is given to avoid more than one busy thread in a core as much as possible before attempting asymmetric packing. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- kernel/sched/fair.c | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7c70201..12f0eab 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5807,12 +5807,19 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu) rcu_read_lock(); for_each_domain(cpu, sd) { - struct sched_group *sg = sd-groups; - struct sched_group_power *sgp = sg-sgp; - int nr_busy = atomic_read(sgp-nr_busy_cpus); - - if (sd-flags SD_SHARE_PKG_RESOURCES nr_busy 1) - goto need_kick_unlock; + struct sched_domain *sd_parent = sd-parent; + struct sched_group *sg; + struct sched_group_power *sgp; + int nr_busy; + + if (sd_parent) { + sg = sd_parent-groups; + sgp = sg-sgp; + nr_busy = atomic_read(sgp-nr_busy_cpus); + + if (sd-flags SD_SHARE_PKG_RESOURCES nr_busy 1) + goto need_kick_unlock; + } if (sd-flags SD_ASYM_PACKING nr_busy != sg-group_weight (cpumask_first_and(nohz.idle_cpus_mask, -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] sched: Fix asymmetric scheduling for POWER7
Asymmetric scheduling within a core is a scheduler loadbalancing feature that is triggered when SD_ASYM_PACKING flag is set. The goal for the load balancer is to move tasks to lower order idle SMT threads within a core on a POWER7 system. In nohz_kick_needed(), we intend to check if our sched domain (core) is completely busy or we have idle cpu. The following check for SD_ASYM_PACKING: (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) cpu) already covers the case of checking if the domain has an idle cpu, because cpumask_first_and() will not yield any set bits if this domain has no idle cpu. Hence, nr_busy check against group weight can be removed. Reported-by: Michael Neuling michael.neul...@au1.ibm.com Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com --- kernel/sched/fair.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 12f0eab..828ed97 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5821,8 +5821,8 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu) goto need_kick_unlock; } - if (sd-flags SD_ASYM_PACKING nr_busy != sg-group_weight -(cpumask_first_and(nohz.idle_cpus_mask, + if (sd-flags SD_ASYM_PACKING + (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) cpu)) goto need_kick_unlock; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] sched: Aggressive balance in domains whose groups share package resources
From: Preeti U Murthy pre...@linux.vnet.ibm.com The current logic in load balance is such that after picking the busiest group, the load is attempted to be moved from the busiest cpu in that group to the dst_cpu. If the load cannot be moved from the busiest cpu to dst_cpu due to either tsk_cpus_allowed mask or cache hot tasks, then the dst_cpu is changed to be another idle cpu within the dst-grpmask. If even then, the load cannot be moved from the busiest cpu, then the source group is changed. The next busiest group is found and the above steps are repeated. However if the cpus in the group share package resources, then when a load movement from the busiest cpu in this group fails as above, instead of finding the next busiest group to move load from, find the next busiest cpu *within the same group* from which to move load away. By doing so, a conscious effort is made during load balancing to keep just one cpu busy as much as possible within domains that have SHARED_PKG_RESOURCES flag set unless under scenarios of high load. Having multiple cpus busy within a domain which share package resource could lead to a performance hit. A similar scenario arises in active load balancing as well. When the current task on the busiest cpu cannot be moved away due to task pinning, currently no more attempts at load balancing is made. This patch checks if the balancing is being done on a group whose cpus share package resources. If so, then check if the load balancing can be done for other cpus in the same group. Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com --- kernel/sched/fair.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 828ed97..bbcd96b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5165,6 +5165,8 @@ static int load_balance(int this_cpu, struct rq *this_rq, { int ld_moved, cur_ld_moved, active_balance = 0; struct sched_group *group; + struct sched_domain *child; + int share_pkg_res = 0; struct rq *busiest; unsigned long flags; struct cpumask *cpus = __get_cpu_var(load_balance_mask); @@ -5190,6 +5192,10 @@ static int load_balance(int this_cpu, struct rq *this_rq, schedstat_inc(sd, lb_count[idle]); + child = sd-child; + if (child child-flags SD_SHARE_PKG_RESOURCES) + share_pkg_res = 1; + redo: if (!should_we_balance(env)) { *continue_balancing = 0; @@ -5202,6 +5208,7 @@ redo: goto out_balanced; } +redo_grp: busiest = find_busiest_queue(env, group); if (!busiest) { schedstat_inc(sd, lb_nobusyq[idle]); @@ -5292,6 +5299,11 @@ more_balance: if (!cpumask_empty(cpus)) { env.loop = 0; env.loop_break = sched_nr_migrate_break; + if (share_pkg_res + cpumask_intersects(cpus, + to_cpumask(group-cpumask))) + goto redo_grp; + goto redo; } goto out_balanced; @@ -5318,9 +5330,15 @@ more_balance: */ if (!cpumask_test_cpu(this_cpu, tsk_cpus_allowed(busiest-curr))) { + cpumask_clear_cpu(cpu_of(busiest), cpus); raw_spin_unlock_irqrestore(busiest-lock, flags); env.flags |= LBF_ALL_PINNED; + if (share_pkg_res + cpumask_intersects(cpus, + to_cpumask(group-cpumask))) + goto redo_grp; + goto out_one_pinned; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints
* Preeti U Murthy [2013-07-27 13:20:37]: > Hi Ben, > > On 07/27/2013 12:00 PM, Benjamin Herrenschmidt wrote: > > On Fri, 2013-07-26 at 08:09 +0530, Preeti U Murthy wrote: > >> *The lapic of a broadcast CPU is active always*. Say CPUX, wants the > >> broadcast CPU to wake it up at timeX. Since we cannot program the lapic > >> of a remote CPU, CPUX will need to send an IPI to the broadcast CPU, > >> asking it to program its lapic to fire at timeX so as to wake up CPUX. > >> *With multiple CPUs the overhead of sending IPI, could result in > >> performance bottlenecks and may not scale well.* > >> > >> Hence the workaround is that the broadcast CPU on each of its timer > >> interrupt checks if any of the next timer event of a CPU in deep idle > >> state has expired, which can very well be found from dev->next_event of > >> that CPU. For example the timeX that has been mentioned above has > >> expired. If so the broadcast handler is called to send an IPI to the > >> idling CPU to wake it up. > >> > >> *If the broadcast CPU, is in tickless idle, its timer interrupt could be > >> many ticks away. It could miss waking up a CPU in deep idle*, if its > >> wakeup is much before this timer interrupt of the broadcast CPU. But > >> without tickless idle, atleast at each period we are assured of a timer > >> interrupt. At which time broadcast handling is done as stated in the > >> previous paragraph and we will not miss wakeup of CPUs in deep idle states. > > > > But that means a great loss of power saving on the broadcast CPU when the > > machine > > is basically completely idle. We might be able to come up with some thing > > better. > > > > (Note : I do no know the timer offload code if it exists already, I'm > > describing > > how things could happen "out of the blue" without any knowledge of > > pre-existing > > framework here) > > > > We can know when the broadcast CPU expects to wake up next. When a CPU goes > > to > > a deep sleep state, it can then > > > > - Indicate to the broadcast CPU when it intends to be woken up by queuing > > itself into an ordered queue (ordered by target wakeup time). (OPTIMISATION: > > Play with the locality of that: have one queue (and one "broadcast CPU") per > > chip or per node instead of a global one to limit cache bouncing). > > > > - Check if that happens before the broadcast CPU intended wake time (we > > need statistics to see how often that happens), and in that case send an IPI > > to wake it up now. When the broadcast CPU goes to sleep, it limits its sleep > > time to the min of it's intended sleep time and the new sleeper time. > > (OPTIMISATION: Dynamically chose a broadcast CPU based on closest expiry ?) > > > > - We can probably limit spurrious wakeups a *LOT* by aligning that target > > time > > to a global jiffy boundary, meaning that several CPUs going to idle are > > likely > > to be choosing the same. Or maybe better, an adaptative alignment by > > essentially > > getting more coarse grained as we go further in the future > > > > - When the "broadcast" CPU goes to sleep, it can play the same game of > > alignment. > > > > I don't like the concept of a dedicated broadcast CPU however. I'd rather > > have a > > general queue (or per node) of sleepers needing a wakeup and more/less > > dynamically > > pick a waker to be the last man standing, but it does make things a bit more > > tricky with tickless scheduler (non-idle). > > > > Still, I wonder if we could just have some algorithm to actually pick wakers > > more dynamically based on who ever has the closest "next wakeup" planned, > > that sort of thing. A fixed broadcaster will create an imbalance in > > power/thermal within the chip in addition to needing to be moved around on > > hotplug etc... > > Thank you for having listed out the above suggestions. Below, I will > bring out some ideas about how the concerns that you have raised can be > addressed in the increasing order of priority. > > - To begin with, I think we can have the following model to have the > responsibility of the broadcast CPU float around certain CPUs. i.e. Not > have a dedicated broadcast CPU. I will refer to the broadcast CPU as the > bc_cpu henceforth for convenience. > > 1. The first CPU that intends to enter deep sleep state will be the bc_cpu. > > 2. Every other CPU that intends to enter deep idle state will enter > themselves into a mask, say the bc_mask, which is already being done > today, after they check that a bc_cpu has been assigned. > > 3. The bc_cpu should not enter tickless idle, until step 5a holds true. > > 4. So on every timer interrupt, which is at-least every period, it > checks the bc_mask to see if any CPUs need to be woken up. > > 5. The bc cpu should not enter tickless idle *until* it is de-nominated > as the bc_cpu. The de-nomination occurs when: > a. In one of its timer interrupts, it does broadcast handling to find > out that there are no CPUs to be woken up. > > 6. So if 5a
Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints
* Benjamin Herrenschmidt [2013-07-27 16:30:05]: > On Fri, 2013-07-26 at 08:09 +0530, Preeti U Murthy wrote: > > *The lapic of a broadcast CPU is active always*. Say CPUX, wants the > > broadcast CPU to wake it up at timeX. Since we cannot program the lapic > > of a remote CPU, CPUX will need to send an IPI to the broadcast CPU, > > asking it to program its lapic to fire at timeX so as to wake up CPUX. > > *With multiple CPUs the overhead of sending IPI, could result in > > performance bottlenecks and may not scale well.* > > > > Hence the workaround is that the broadcast CPU on each of its timer > > interrupt checks if any of the next timer event of a CPU in deep idle > > state has expired, which can very well be found from dev->next_event of > > that CPU. For example the timeX that has been mentioned above has > > expired. If so the broadcast handler is called to send an IPI to the > > idling CPU to wake it up. > > > > *If the broadcast CPU, is in tickless idle, its timer interrupt could be > > many ticks away. It could miss waking up a CPU in deep idle*, if its > > wakeup is much before this timer interrupt of the broadcast CPU. But > > without tickless idle, atleast at each period we are assured of a timer > > interrupt. At which time broadcast handling is done as stated in the > > previous paragraph and we will not miss wakeup of CPUs in deep idle states. > > But that means a great loss of power saving on the broadcast CPU when the > machine > is basically completely idle. We might be able to come up with some thing > better. Hi Ben, Yes, we will need to improve on this case in stages. In the current design, we will have to hold one of the CPU in shallow idle state (nap) to wakeup other deep idle cpus. The cost of keeping the periodic tick ON on the broadcast CPU in minimal (but not zero) since we would not allow that CPU to enter any deep idle states even if there were no periodic timers queued. > (Note : I do no know the timer offload code if it exists already, I'm > describing > how things could happen "out of the blue" without any knowledge of > pre-existing > framework here) > > We can know when the broadcast CPU expects to wake up next. When a CPU goes to > a deep sleep state, it can then > > - Indicate to the broadcast CPU when it intends to be woken up by queuing > itself into an ordered queue (ordered by target wakeup time). (OPTIMISATION: > Play with the locality of that: have one queue (and one "broadcast CPU") per > chip or per node instead of a global one to limit cache bouncing). > > - Check if that happens before the broadcast CPU intended wake time (we > need statistics to see how often that happens), and in that case send an IPI > to wake it up now. When the broadcast CPU goes to sleep, it limits its sleep > time to the min of it's intended sleep time and the new sleeper time. > (OPTIMISATION: Dynamically chose a broadcast CPU based on closest expiry ?) This will be an improvement and the idea we have is to have a hierarchical method of finding a waking CPU within core/socket/node in order to find a better fit and ultimately send IPI to wakeup a broadcast CPU only if there is no other fit. This condition would imply that more CPUs are in deep idle state and the cost of sending an IPI to reprogram the broadcast cpu's local timer may well payoff. > - We can probably limit spurrious wakeups a *LOT* by aligning that target > time > to a global jiffy boundary, meaning that several CPUs going to idle are likely > to be choosing the same. Or maybe better, an adaptative alignment by > essentially > getting more coarse grained as we go further in the future > > - When the "broadcast" CPU goes to sleep, it can play the same game of > alignment. CPUs entering deep idle state would need to wakeup only at a jiffy boundary or the jiffy boundary earlier than the target wakeup time. Your point is can the broadcast cpu wakeup the sleeping CPU *around* the designated wakeup time (earlier) so as to avoid reprogramming its timer. > I don't like the concept of a dedicated broadcast CPU however. I'd rather > have a > general queue (or per node) of sleepers needing a wakeup and more/less > dynamically > pick a waker to be the last man standing, but it does make things a bit more > tricky with tickless scheduler (non-idle). > > Still, I wonder if we could just have some algorithm to actually pick wakers > more dynamically based on who ever has the closest "next wakeup" planned, > that sort of thing. A fixed broadcaster will create an imbalance in > power/thermal within the chip in addition to needing to be moved around on > hotplug etc... Right Ben. The hierarchical way of selecting the waker will help us have multiple wakers in different sockets/cores. The broadcast framework allows us to decouple the cpu going to idle and the waker to be selected independently. This patch series is the start where we pick one and allow it to move around. The ideal goal to achieve
Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints
* Benjamin Herrenschmidt b...@kernel.crashing.org [2013-07-27 16:30:05]: On Fri, 2013-07-26 at 08:09 +0530, Preeti U Murthy wrote: *The lapic of a broadcast CPU is active always*. Say CPUX, wants the broadcast CPU to wake it up at timeX. Since we cannot program the lapic of a remote CPU, CPUX will need to send an IPI to the broadcast CPU, asking it to program its lapic to fire at timeX so as to wake up CPUX. *With multiple CPUs the overhead of sending IPI, could result in performance bottlenecks and may not scale well.* Hence the workaround is that the broadcast CPU on each of its timer interrupt checks if any of the next timer event of a CPU in deep idle state has expired, which can very well be found from dev-next_event of that CPU. For example the timeX that has been mentioned above has expired. If so the broadcast handler is called to send an IPI to the idling CPU to wake it up. *If the broadcast CPU, is in tickless idle, its timer interrupt could be many ticks away. It could miss waking up a CPU in deep idle*, if its wakeup is much before this timer interrupt of the broadcast CPU. But without tickless idle, atleast at each period we are assured of a timer interrupt. At which time broadcast handling is done as stated in the previous paragraph and we will not miss wakeup of CPUs in deep idle states. But that means a great loss of power saving on the broadcast CPU when the machine is basically completely idle. We might be able to come up with some thing better. Hi Ben, Yes, we will need to improve on this case in stages. In the current design, we will have to hold one of the CPU in shallow idle state (nap) to wakeup other deep idle cpus. The cost of keeping the periodic tick ON on the broadcast CPU in minimal (but not zero) since we would not allow that CPU to enter any deep idle states even if there were no periodic timers queued. (Note : I do no know the timer offload code if it exists already, I'm describing how things could happen out of the blue without any knowledge of pre-existing framework here) We can know when the broadcast CPU expects to wake up next. When a CPU goes to a deep sleep state, it can then - Indicate to the broadcast CPU when it intends to be woken up by queuing itself into an ordered queue (ordered by target wakeup time). (OPTIMISATION: Play with the locality of that: have one queue (and one broadcast CPU) per chip or per node instead of a global one to limit cache bouncing). - Check if that happens before the broadcast CPU intended wake time (we need statistics to see how often that happens), and in that case send an IPI to wake it up now. When the broadcast CPU goes to sleep, it limits its sleep time to the min of it's intended sleep time and the new sleeper time. (OPTIMISATION: Dynamically chose a broadcast CPU based on closest expiry ?) This will be an improvement and the idea we have is to have a hierarchical method of finding a waking CPU within core/socket/node in order to find a better fit and ultimately send IPI to wakeup a broadcast CPU only if there is no other fit. This condition would imply that more CPUs are in deep idle state and the cost of sending an IPI to reprogram the broadcast cpu's local timer may well payoff. - We can probably limit spurrious wakeups a *LOT* by aligning that target time to a global jiffy boundary, meaning that several CPUs going to idle are likely to be choosing the same. Or maybe better, an adaptative alignment by essentially getting more coarse grained as we go further in the future - When the broadcast CPU goes to sleep, it can play the same game of alignment. CPUs entering deep idle state would need to wakeup only at a jiffy boundary or the jiffy boundary earlier than the target wakeup time. Your point is can the broadcast cpu wakeup the sleeping CPU *around* the designated wakeup time (earlier) so as to avoid reprogramming its timer. I don't like the concept of a dedicated broadcast CPU however. I'd rather have a general queue (or per node) of sleepers needing a wakeup and more/less dynamically pick a waker to be the last man standing, but it does make things a bit more tricky with tickless scheduler (non-idle). Still, I wonder if we could just have some algorithm to actually pick wakers more dynamically based on who ever has the closest next wakeup planned, that sort of thing. A fixed broadcaster will create an imbalance in power/thermal within the chip in addition to needing to be moved around on hotplug etc... Right Ben. The hierarchical way of selecting the waker will help us have multiple wakers in different sockets/cores. The broadcast framework allows us to decouple the cpu going to idle and the waker to be selected independently. This patch series is the start where we pick one and allow it to move around. The ideal goal to achieve would be that we can have multiple wakers serving wakeup requests from a
Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints
* Preeti U Murthy pre...@linux.vnet.ibm.com [2013-07-27 13:20:37]: Hi Ben, On 07/27/2013 12:00 PM, Benjamin Herrenschmidt wrote: On Fri, 2013-07-26 at 08:09 +0530, Preeti U Murthy wrote: *The lapic of a broadcast CPU is active always*. Say CPUX, wants the broadcast CPU to wake it up at timeX. Since we cannot program the lapic of a remote CPU, CPUX will need to send an IPI to the broadcast CPU, asking it to program its lapic to fire at timeX so as to wake up CPUX. *With multiple CPUs the overhead of sending IPI, could result in performance bottlenecks and may not scale well.* Hence the workaround is that the broadcast CPU on each of its timer interrupt checks if any of the next timer event of a CPU in deep idle state has expired, which can very well be found from dev-next_event of that CPU. For example the timeX that has been mentioned above has expired. If so the broadcast handler is called to send an IPI to the idling CPU to wake it up. *If the broadcast CPU, is in tickless idle, its timer interrupt could be many ticks away. It could miss waking up a CPU in deep idle*, if its wakeup is much before this timer interrupt of the broadcast CPU. But without tickless idle, atleast at each period we are assured of a timer interrupt. At which time broadcast handling is done as stated in the previous paragraph and we will not miss wakeup of CPUs in deep idle states. But that means a great loss of power saving on the broadcast CPU when the machine is basically completely idle. We might be able to come up with some thing better. (Note : I do no know the timer offload code if it exists already, I'm describing how things could happen out of the blue without any knowledge of pre-existing framework here) We can know when the broadcast CPU expects to wake up next. When a CPU goes to a deep sleep state, it can then - Indicate to the broadcast CPU when it intends to be woken up by queuing itself into an ordered queue (ordered by target wakeup time). (OPTIMISATION: Play with the locality of that: have one queue (and one broadcast CPU) per chip or per node instead of a global one to limit cache bouncing). - Check if that happens before the broadcast CPU intended wake time (we need statistics to see how often that happens), and in that case send an IPI to wake it up now. When the broadcast CPU goes to sleep, it limits its sleep time to the min of it's intended sleep time and the new sleeper time. (OPTIMISATION: Dynamically chose a broadcast CPU based on closest expiry ?) - We can probably limit spurrious wakeups a *LOT* by aligning that target time to a global jiffy boundary, meaning that several CPUs going to idle are likely to be choosing the same. Or maybe better, an adaptative alignment by essentially getting more coarse grained as we go further in the future - When the broadcast CPU goes to sleep, it can play the same game of alignment. I don't like the concept of a dedicated broadcast CPU however. I'd rather have a general queue (or per node) of sleepers needing a wakeup and more/less dynamically pick a waker to be the last man standing, but it does make things a bit more tricky with tickless scheduler (non-idle). Still, I wonder if we could just have some algorithm to actually pick wakers more dynamically based on who ever has the closest next wakeup planned, that sort of thing. A fixed broadcaster will create an imbalance in power/thermal within the chip in addition to needing to be moved around on hotplug etc... Thank you for having listed out the above suggestions. Below, I will bring out some ideas about how the concerns that you have raised can be addressed in the increasing order of priority. - To begin with, I think we can have the following model to have the responsibility of the broadcast CPU float around certain CPUs. i.e. Not have a dedicated broadcast CPU. I will refer to the broadcast CPU as the bc_cpu henceforth for convenience. 1. The first CPU that intends to enter deep sleep state will be the bc_cpu. 2. Every other CPU that intends to enter deep idle state will enter themselves into a mask, say the bc_mask, which is already being done today, after they check that a bc_cpu has been assigned. 3. The bc_cpu should not enter tickless idle, until step 5a holds true. 4. So on every timer interrupt, which is at-least every period, it checks the bc_mask to see if any CPUs need to be woken up. 5. The bc cpu should not enter tickless idle *until* it is de-nominated as the bc_cpu. The de-nomination occurs when: a. In one of its timer interrupts, it does broadcast handling to find out that there are no CPUs to be woken up. 6. So if 5a holds, then there is no bc_cpu anymore until a CPU decides to enter deep idle state again, in which case steps 1 to 5 repeat. - We could optimize this further, to allow
Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management
* Mel Gorman [2012-11-08 18:02:57]: > On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote: > > Hi Mel, Thanks for detailed review and comments. The goal of this patch series is to brainstorm on ideas that enable Linux VM to record and exploit memory region boundaries. The first approach that we had last year (hierarchy) has more runtime overhead. This approach of sorted-buddy was one of the alternative discussed earlier and we are trying to find out if simple requirements of biasing memory allocations can be achieved with this approach. Smart reclaim based on this approach is a key piece we still need to design. Ideas from compaction will certainly help. > > Today memory subsystems are offer a wide range of capabilities for managing > > memory power consumption. As a quick example, if a block of memory is not > > referenced for a threshold amount of time, the memory controller can decide > > to > > put that chunk into a low-power content-preserving state. And the next > > reference to that memory chunk would bring it back to full power for > > read/write. > > With this capability in place, it becomes important for the OS to understand > > the boundaries of such power-manageable chunks of memory and to ensure that > > references are consolidated to a minimum number of such memory power > > management > > domains. > > > > How much power is saved? On embedded platform the savings could be around 5% as discussed in the earlier thread: http://article.gmane.org/gmane.linux.kernel.mm/65935 On larger servers with large amounts of memory the savings could be more. We do not yet have all the pieces together to evaluate. > > ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that > > the firmware can expose information regarding the boundaries of such memory > > power management domains to the OS in a standard way. > > > > I'm not familiar with the ACPI spec but is there support for parsing of > MPST and interpreting the associated ACPI events? For example, if ACPI > fires an event indicating that a memory power node is to enter a low > state then presumably the OS should actively migrate pages away -- even > if it's going into a state where the contents are still refreshed > as exiting that state could take a long time. > > I did not look closely at the patchset at all because it looked like the > actual support to use it and measure the benefit is missing. Correct. The platform interface part is not included in this patch set mainly because there is not much design required there. Each platform can have code to collect the memory region boundaries from BIOS/firmware and load it into the Linux VM. The goal of this patch is to brainstorm on the idea of hos core VM should used the region information. > > How can Linux VM help memory power savings? > > > > o Consolidate memory allocations and/or references such that they are > > not spread across the entire memory address space. Basically area of memory > > that is not being referenced, can reside in low power state. > > > > Which the series does not appear to do. Correct. We need to design the correct reclaim strategy for this to work. However having buddy list sorted by region address could get us one step closer to shaping the allocations. > > o Support targeted memory reclaim, where certain areas of memory that can be > > easily freed can be offlined, allowing those areas of memory to be put into > > lower power states. > > > > Which the series does not appear to do judging from this; > > include/linux/mm.h | 38 +++ > include/linux/mmzone.h | 52 + > mm/compaction.c|8 + > mm/page_alloc.c| 263 > > mm/vmstat.c| 59 ++- > > This does not appear to be doing anything with reclaim and not enough with > compaction to indicate that the series actively manages memory placement > in response to ACPI events. Correct. Evaluating different ideas for reclaim will be next step before getting into the platform interface parts. > Further in section 5.2.21.4 the spec says that power node regions can > overlap (but are not hierarchal for some reason) but have no gaps yet the > structure you use to represent is assumes there can be gaps and there are > no overlaps. Again, this is just glancing at the spec and a quick skim of > the patches so maybe I missed something that explains why this structure > is suitable. This patch is roughly based on the idea that ACPI MPST will give us memory region boundaries. It is not designed to implement all options defined in the spec. We have taken a general case of regions do not overlap while memory addresses itself can be discontinuous. > It seems to me that superficially the VM implementation for the support > would have > > a) Involved a tree that managed the overlapping regions (even
Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management
* Mel Gorman mgor...@suse.de [2012-11-08 18:02:57]: On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote: Hi Mel, Thanks for detailed review and comments. The goal of this patch series is to brainstorm on ideas that enable Linux VM to record and exploit memory region boundaries. The first approach that we had last year (hierarchy) has more runtime overhead. This approach of sorted-buddy was one of the alternative discussed earlier and we are trying to find out if simple requirements of biasing memory allocations can be achieved with this approach. Smart reclaim based on this approach is a key piece we still need to design. Ideas from compaction will certainly help. Today memory subsystems are offer a wide range of capabilities for managing memory power consumption. As a quick example, if a block of memory is not referenced for a threshold amount of time, the memory controller can decide to put that chunk into a low-power content-preserving state. And the next reference to that memory chunk would bring it back to full power for read/write. With this capability in place, it becomes important for the OS to understand the boundaries of such power-manageable chunks of memory and to ensure that references are consolidated to a minimum number of such memory power management domains. How much power is saved? On embedded platform the savings could be around 5% as discussed in the earlier thread: http://article.gmane.org/gmane.linux.kernel.mm/65935 On larger servers with large amounts of memory the savings could be more. We do not yet have all the pieces together to evaluate. ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that the firmware can expose information regarding the boundaries of such memory power management domains to the OS in a standard way. I'm not familiar with the ACPI spec but is there support for parsing of MPST and interpreting the associated ACPI events? For example, if ACPI fires an event indicating that a memory power node is to enter a low state then presumably the OS should actively migrate pages away -- even if it's going into a state where the contents are still refreshed as exiting that state could take a long time. I did not look closely at the patchset at all because it looked like the actual support to use it and measure the benefit is missing. Correct. The platform interface part is not included in this patch set mainly because there is not much design required there. Each platform can have code to collect the memory region boundaries from BIOS/firmware and load it into the Linux VM. The goal of this patch is to brainstorm on the idea of hos core VM should used the region information. How can Linux VM help memory power savings? o Consolidate memory allocations and/or references such that they are not spread across the entire memory address space. Basically area of memory that is not being referenced, can reside in low power state. Which the series does not appear to do. Correct. We need to design the correct reclaim strategy for this to work. However having buddy list sorted by region address could get us one step closer to shaping the allocations. o Support targeted memory reclaim, where certain areas of memory that can be easily freed can be offlined, allowing those areas of memory to be put into lower power states. Which the series does not appear to do judging from this; include/linux/mm.h | 38 +++ include/linux/mmzone.h | 52 + mm/compaction.c|8 + mm/page_alloc.c| 263 mm/vmstat.c| 59 ++- This does not appear to be doing anything with reclaim and not enough with compaction to indicate that the series actively manages memory placement in response to ACPI events. Correct. Evaluating different ideas for reclaim will be next step before getting into the platform interface parts. Further in section 5.2.21.4 the spec says that power node regions can overlap (but are not hierarchal for some reason) but have no gaps yet the structure you use to represent is assumes there can be gaps and there are no overlaps. Again, this is just glancing at the spec and a quick skim of the patches so maybe I missed something that explains why this structure is suitable. This patch is roughly based on the idea that ACPI MPST will give us memory region boundaries. It is not designed to implement all options defined in the spec. We have taken a general case of regions do not overlap while memory addresses itself can be discontinuous. It seems to me that superficially the VM implementation for the support would have a) Involved a tree that managed the overlapping regions (even if it's not hierarchal it feels more sensible) and picked the
Re: [PATCH 33/36] autonuma: powerpc port
* Benjamin Herrenschmidt [2012-08-23 08:56:34]: > On Thu, 2012-08-23 at 08:01 +1000, Benjamin Herrenschmidt wrote: > > On Wed, 2012-08-22 at 16:59 +0200, Andrea Arcangeli wrote: > > > From: Vaidyanathan Srinivasan > > > > > > * PMD flaging is not required in powerpc since large pages > > > are tracked in ptes. > > > * Yet to be tested with large pages > > > * This is an initial patch that partially works > > > * knuma_scand and numa hinting page faults works > > > * Page migration is yet to be observed/verified > > > > > > Signed-off-by: Vaidyanathan Srinivasan > > > Signed-off-by: Andrea Arcangeli > > > > I don't like this. > > What I mean here is that it's fine as a proof of concept ;-) I don't > like it being in a series aimed at upstream... I agree. My intend was to get the ppc64 discussions started and also see what it takes to get autonuma to a new arch. > We can try to flush out the issues, but as it is, the patch isn't > upstreamable imho. Yes. The patch is still in RFC phase and good only for review/testing. > As for finding PTE bits, I have a few ideas we need to discuss, but > nothing simple I'm afraid. Sure Ben, lets try them and find the better fit. --Vaidy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 33/36] autonuma: powerpc port
* Benjamin Herrenschmidt b...@kernel.crashing.org [2012-08-23 08:56:34]: On Thu, 2012-08-23 at 08:01 +1000, Benjamin Herrenschmidt wrote: On Wed, 2012-08-22 at 16:59 +0200, Andrea Arcangeli wrote: From: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com * PMD flaging is not required in powerpc since large pages are tracked in ptes. * Yet to be tested with large pages * This is an initial patch that partially works * knuma_scand and numa hinting page faults works * Page migration is yet to be observed/verified Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com Signed-off-by: Andrea Arcangeli aarca...@redhat.com I don't like this. What I mean here is that it's fine as a proof of concept ;-) I don't like it being in a series aimed at upstream... I agree. My intend was to get the ppc64 discussions started and also see what it takes to get autonuma to a new arch. We can try to flush out the issues, but as it is, the patch isn't upstreamable imho. Yes. The patch is still in RFC phase and good only for review/testing. As for finding PTE bits, I have a few ideas we need to discuss, but nothing simple I'm afraid. Sure Ben, lets try them and find the better fit. --Vaidy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Analysis of sched_mc_power_savings
* Ingo Molnar <[EMAIL PROTECTED]> [2008-01-09 12:35:07]: > > * Vaidyanathan Srinivasan <[EMAIL PROTECTED]> wrote: > > > How do we take this technique to the next step where we can > > consolidate short running jobs as well? Did you face any difficulty > > biasing the CPU for short running jobs? > > are you sure your measurement tasks do not impact the measurement > workload? If you use something like 'top' then try running it reniced to > +19. (or perhaps even bound to a particular CPU, say #3, to make its > impact isolated) Hi Ingo, I will watch this during the experiments. I have been using klog application to dump relayfs data. I did run powertop and top as well, I will bind them to certain CPUs and isolate their impact. I believe the margin of error would be less since all the measurement tasks sleep for long duration. Thanks, Vaidy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Analysis of sched_mc_power_savings
* Siddha, Suresh B <[EMAIL PROTECTED]> [2008-01-08 13:24:00]: > On Tue, Jan 08, 2008 at 11:08:15PM +0530, Vaidyanathan Srinivasan wrote: > > Hi, > > > > The following experiments were conducted on a two socket dual core > > intel processor based machine in order to understand the impact of > > sched_mc_power_savings scheduler heuristics. > > Thanks for these experiments. sched_mc_power_savings is mainly for > the reasonably long(which can be caught by periodic load balance) running > light load(when number of tasks < number of logical cpus). This tunable > will minimize the number of packages carrying load, enabling the other > completely idle packages to go to the available deepest P and C states. Hi Suresh, Thanks for the explanation. I tried few more experiments based on your comments and found that in a long running CPU intensive job, the heuristics did help to get the load on the same package. > > > > The scheduler heuristics for multi core system > > /sys/devices/system/cpu/sched_mc_power_savings should ideally extend > > the cpu tickless idle time atleast on few CPU in an SMP machine. > > Not really. Ideally sched_mc_power_savings re-distributes the load(in > the scenarios mentioned above) to different logical cpus in the system, so > as to minimize the busy packages in the system. But the problem is we will not be able get longer idle time and goto lower sleep states on other packages if we are not able to move the short bursts of daemon jobs to busy packages. We are looking at various means to increase uninterrupted CPU idle times on atleast some of the CPUs in an under-utilised SMP machine. > > Experiment 1: > > - > ... > > Observations with sched_mc_power_savings=1: > > > > * No major impact of sched_mc_power_savings on CPU0 and CPU1 > > * Significant idle time improvement on CPU2 > > * However, significant idle time reduction on CPU3 > > In your setup, CPU0 and 3 are the core siblings? If so, this is probably > expected. Previously there was some load distribution happening between > packages. With sched_mc_power_savings, load is now getting distributed > between cores in the same package. But on my systems, typically CPU0 and 2 > are core siblings. I will let you confirm your system topology, before we > come to conclusions. No, in contrary to your observation, on my machine, CPU0 and CPU1 are siblings on the first package while CPU2 and CPU3 are in the second package. > > Experiment 2: > > - > ... > > > > Observations with sched_mc_power_savings=1: > > > > * No major impact of sched_mc_power_savings on CPU0 and CPU1 > > * Good idle time improvement on CPU2 and CPU3 > > I would have expected similar results like in experiment-1. CPU3 data > seems almost same with and without sched_mc_power_savings. Atleast the > data is not significantly different as in other cases like CPU2 for ex. There is a general improvement in idle time in this experiment as compared to the first experiment. So if sched_mc will not impact short running tasks (daemons and other event managers in idle system), then the observation on experiment-1 may be just a random redistribution of tasks over time. > > Please review the experiment and comment on how the effectiveness of > > sched_mc_power_savings can be analysed. > > While we should see a no change or a simple redistribution(to minimize > busy packages) in your experiments, for evaluating sched_mc_power_savings, > we can also use some lightly loaded system (like kernel-compilation or > specjbb with 2 threads on a DP with dual-core, for example and see how the > load is distributed with and without MC power savings.) I have tried running two CPU intensive threads. With sched_mc turned ON, they were scheduled on CPU0-1 or CPU2-3. With sched_mc turned OFF, I can see that they were scheduled arbitrarily. Later I added random usleep() to make them consume less CPU, in that case the scheduling was very random. The two threads with random sleeps were scheduled on any of the 4 CPUs and the sched_mc parameter had no effect is consolidating the threads to one package. Addition of sleeps makes the threads very short run and not CPU intensive. This completely validates your explanation of consolidating only 'long running' jobs. How do we take this technique to the next step where we can consolidate short running jobs as well? Did you face any difficulty biasing the CPU for short running jobs? > Similarly it will be interesting to see how this data varies with and without > tickless. I am sure tickless has very significant impact on the idle time. I will get you more data on this comparison. > For some loads which can't be caught by periodic load balancer, w
Re: Analysis of sched_mc_power_savings
* Siddha, Suresh B [EMAIL PROTECTED] [2008-01-08 13:24:00]: On Tue, Jan 08, 2008 at 11:08:15PM +0530, Vaidyanathan Srinivasan wrote: Hi, The following experiments were conducted on a two socket dual core intel processor based machine in order to understand the impact of sched_mc_power_savings scheduler heuristics. Thanks for these experiments. sched_mc_power_savings is mainly for the reasonably long(which can be caught by periodic load balance) running light load(when number of tasks number of logical cpus). This tunable will minimize the number of packages carrying load, enabling the other completely idle packages to go to the available deepest P and C states. Hi Suresh, Thanks for the explanation. I tried few more experiments based on your comments and found that in a long running CPU intensive job, the heuristics did help to get the load on the same package. The scheduler heuristics for multi core system /sys/devices/system/cpu/sched_mc_power_savings should ideally extend the cpu tickless idle time atleast on few CPU in an SMP machine. Not really. Ideally sched_mc_power_savings re-distributes the load(in the scenarios mentioned above) to different logical cpus in the system, so as to minimize the busy packages in the system. But the problem is we will not be able get longer idle time and goto lower sleep states on other packages if we are not able to move the short bursts of daemon jobs to busy packages. We are looking at various means to increase uninterrupted CPU idle times on atleast some of the CPUs in an under-utilised SMP machine. Experiment 1: - ... Observations with sched_mc_power_savings=1: * No major impact of sched_mc_power_savings on CPU0 and CPU1 * Significant idle time improvement on CPU2 * However, significant idle time reduction on CPU3 In your setup, CPU0 and 3 are the core siblings? If so, this is probably expected. Previously there was some load distribution happening between packages. With sched_mc_power_savings, load is now getting distributed between cores in the same package. But on my systems, typically CPU0 and 2 are core siblings. I will let you confirm your system topology, before we come to conclusions. No, in contrary to your observation, on my machine, CPU0 and CPU1 are siblings on the first package while CPU2 and CPU3 are in the second package. Experiment 2: - ... Observations with sched_mc_power_savings=1: * No major impact of sched_mc_power_savings on CPU0 and CPU1 * Good idle time improvement on CPU2 and CPU3 I would have expected similar results like in experiment-1. CPU3 data seems almost same with and without sched_mc_power_savings. Atleast the data is not significantly different as in other cases like CPU2 for ex. There is a general improvement in idle time in this experiment as compared to the first experiment. So if sched_mc will not impact short running tasks (daemons and other event managers in idle system), then the observation on experiment-1 may be just a random redistribution of tasks over time. Please review the experiment and comment on how the effectiveness of sched_mc_power_savings can be analysed. While we should see a no change or a simple redistribution(to minimize busy packages) in your experiments, for evaluating sched_mc_power_savings, we can also use some lightly loaded system (like kernel-compilation or specjbb with 2 threads on a DP with dual-core, for example and see how the load is distributed with and without MC power savings.) I have tried running two CPU intensive threads. With sched_mc turned ON, they were scheduled on CPU0-1 or CPU2-3. With sched_mc turned OFF, I can see that they were scheduled arbitrarily. Later I added random usleep() to make them consume less CPU, in that case the scheduling was very random. The two threads with random sleeps were scheduled on any of the 4 CPUs and the sched_mc parameter had no effect is consolidating the threads to one package. Addition of sleeps makes the threads very short run and not CPU intensive. This completely validates your explanation of consolidating only 'long running' jobs. How do we take this technique to the next step where we can consolidate short running jobs as well? Did you face any difficulty biasing the CPU for short running jobs? Similarly it will be interesting to see how this data varies with and without tickless. I am sure tickless has very significant impact on the idle time. I will get you more data on this comparison. For some loads which can't be caught by periodic load balancer, we may not see any difference. But atleast we should not see any scheduling anomalies. True, we did not see any scheduling anomalies yet. Thanks, Vaidy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Analysis of sched_mc_power_savings
* Ingo Molnar [EMAIL PROTECTED] [2008-01-09 12:35:07]: * Vaidyanathan Srinivasan [EMAIL PROTECTED] wrote: How do we take this technique to the next step where we can consolidate short running jobs as well? Did you face any difficulty biasing the CPU for short running jobs? are you sure your measurement tasks do not impact the measurement workload? If you use something like 'top' then try running it reniced to +19. (or perhaps even bound to a particular CPU, say #3, to make its impact isolated) Hi Ingo, I will watch this during the experiments. I have been using klog application to dump relayfs data. I did run powertop and top as well, I will bind them to certain CPUs and isolate their impact. I believe the margin of error would be less since all the measurement tasks sleep for long duration. Thanks, Vaidy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Analysis of sched_mc_power_savings
Hi, The following experiments were conducted on a two socket dual core intel processor based machine in order to understand the impact of sched_mc_power_savings scheduler heuristics. Kernel linux-2.6.24-rc6: CONFIG_SCHED_SMT=y CONFIG_SCHED_MC=y CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y CONFIG_GENERIC_CLOCKEVENTS_BUILD=y CONFIG_HPET_TIMER=y CONFIG_HPET_EMULATE_RTC=y tick-sched.c has been instrumented to collect idle entry and exit time stamps. Instrumentation patch: Instrument tick-sched nohz code and generate time stamp trace data. Signed-off-by: Vaidyanathan Srinivasan <[EMAIL PROTECTED]> --- kernel/time/tick-sched.c |8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) --- linux-2.6.24-rc6.orig/kernel/time/tick-sched.c +++ linux-2.6.24-rc6/kernel/time/tick-sched.c @@ -20,6 +20,7 @@ #include #include #include +#include #include @@ -200,7 +201,10 @@ void tick_nohz_stop_sched_tick(void) if (ts->tick_stopped) { delta = ktime_sub(now, ts->idle_entrytime); ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta); - } + ktrace_log2(KT_FUNC_tick_nohz_stop_sched_tick, KT_EVENT_INFO1, + ktime_to_ns(now), ts->idle_calls); >>>>>>>>>>>>>>>>Tracepoint A + } else + ktrace_log2(KT_FUNC_tick_nohz_stop_sched_tick, KT_EVENT_FUNC_ENTER, ktime_to_ns(now), 0); >>>>>>>>>>>>>>>>Tracepoint B ts->idle_entrytime = now; ts->idle_calls++; @@ -391,6 +395,8 @@ void tick_nohz_restart_sched_tick(void) tick_do_update_jiffies64(now); now = ktime_get(); } + ktrace_log2(KT_FUNC_tick_nohz_restart_sched_tick, KT_EVENT_FUNC_EXIT, + ktime_to_ns(now), ts->idle_calls); >>>>>>>>>>>>>>>>Tracepoint C local_irq_enable(); } The idle time collected are time stamp at (C) minus (B). This is the time interval between stopping ticks and restarting ticks in an idle system. Complete patch series: http://svaidy.googlepages.com/1-klog.patch http://svaidy.googlepages.com/1-trace-sched.patch Userspace program to extract trace data: http://svaidy.googlepages.com/1-klog.c Python script to post process binary trace data: http://svaidy.googlepages.com/1-sched-stats.py Gnuplot scripts that was used to generate the graphs: http://svaidy.googlepages.com/1-multiplot.gp The scheduler heuristics for multi core system /sys/devices/system/cpu/sched_mc_power_savings should ideally extend the cpu tickless idle time atleast on few CPU in an SMP machine. However in the experiment it was found that turning on sched_mc_power_savings marginally increased the idle time in only some of the CPUs. Experiment 1: - Setup: * yum-updated and irqbalance daemon was stopped to reduce the idle wakeup rate * All irqs manually routed to CPU0 only (hoping this will keep other CPUs idle) (http://svaidy.googlepages.com/1-irq-config.txt) * Powertop shows around 35 wakeups per second during idle (http://svaidy.googlepages.com/1-powertop-screen.txt) * The trace of idle time stamps was collected for 120 seconds with the system in idle state Results: There are 4 png files that plots the idle time for each CPU in the system. Please get the graphs from the following URLs http://svaidy.googlepages.com/1-idle-cpu0.png http://svaidy.googlepages.com/1-idle-cpu1.png http://svaidy.googlepages.com/1-idle-cpu2.png http://svaidy.googlepages.com/1-idle-cpu3.png Each png file has 4 graphs plotted that is relevant to one CPU * Right-top plot is the idle time sample obtained during the experiment * Left-top graph is histogram of right top plot * The bottom graphs corresponding to idle times when sched_mc_power_savings=1 Observations with sched_mc_power_savings=1: * No major impact of sched_mc_power_savings on CPU0 and CPU1 * Significant idle time improvement on CPU2 * However, significant idle time reduction on CPU3 Experiment 2: - Setup: * USB stopped * Most daemons like yum-updatesd, hal, autofs, syslog, crond, irqbalance, sendmail, pcscd were stopped * Interrupt routing left to default but irqbalance daemon stopped * Powertop shows around 4 wakeups per second during idle (http://svaidy.googlepages.com/2-powertop-screen.txt) * The trace of idle time stamps was collected for 120 seconds with the system in idle state Results: There are 4 png files that plots the idle time for each CPU in the system. http://svaidy.googlepages.com/2-idle-cpu0.png http://svaidy.googlepages.com/2-idle-cpu1.png http://svaidy.googlepages.com/2-idle-cpu2.png http://svaidy.googlepages.com/2-idle-cpu3.png The details of the plot are same as the previous experiment. Observations with sched_mc
Analysis of sched_mc_power_savings
Hi, The following experiments were conducted on a two socket dual core intel processor based machine in order to understand the impact of sched_mc_power_savings scheduler heuristics. Kernel linux-2.6.24-rc6: CONFIG_SCHED_SMT=y CONFIG_SCHED_MC=y CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y CONFIG_GENERIC_CLOCKEVENTS_BUILD=y CONFIG_HPET_TIMER=y CONFIG_HPET_EMULATE_RTC=y tick-sched.c has been instrumented to collect idle entry and exit time stamps. Instrumentation patch: Instrument tick-sched nohz code and generate time stamp trace data. Signed-off-by: Vaidyanathan Srinivasan [EMAIL PROTECTED] --- kernel/time/tick-sched.c |8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) --- linux-2.6.24-rc6.orig/kernel/time/tick-sched.c +++ linux-2.6.24-rc6/kernel/time/tick-sched.c @@ -20,6 +20,7 @@ #include linux/profile.h #include linux/sched.h #include linux/tick.h +#include linux/ktrace.h #include asm/irq_regs.h @@ -200,7 +201,10 @@ void tick_nohz_stop_sched_tick(void) if (ts-tick_stopped) { delta = ktime_sub(now, ts-idle_entrytime); ts-idle_sleeptime = ktime_add(ts-idle_sleeptime, delta); - } + ktrace_log2(KT_FUNC_tick_nohz_stop_sched_tick, KT_EVENT_INFO1, + ktime_to_ns(now), ts-idle_calls); Tracepoint A + } else + ktrace_log2(KT_FUNC_tick_nohz_stop_sched_tick, KT_EVENT_FUNC_ENTER, ktime_to_ns(now), 0); Tracepoint B ts-idle_entrytime = now; ts-idle_calls++; @@ -391,6 +395,8 @@ void tick_nohz_restart_sched_tick(void) tick_do_update_jiffies64(now); now = ktime_get(); } + ktrace_log2(KT_FUNC_tick_nohz_restart_sched_tick, KT_EVENT_FUNC_EXIT, + ktime_to_ns(now), ts-idle_calls); Tracepoint C local_irq_enable(); } The idle time collected are time stamp at (C) minus (B). This is the time interval between stopping ticks and restarting ticks in an idle system. Complete patch series: http://svaidy.googlepages.com/1-klog.patch http://svaidy.googlepages.com/1-trace-sched.patch Userspace program to extract trace data: http://svaidy.googlepages.com/1-klog.c Python script to post process binary trace data: http://svaidy.googlepages.com/1-sched-stats.py Gnuplot scripts that was used to generate the graphs: http://svaidy.googlepages.com/1-multiplot.gp The scheduler heuristics for multi core system /sys/devices/system/cpu/sched_mc_power_savings should ideally extend the cpu tickless idle time atleast on few CPU in an SMP machine. However in the experiment it was found that turning on sched_mc_power_savings marginally increased the idle time in only some of the CPUs. Experiment 1: - Setup: * yum-updated and irqbalance daemon was stopped to reduce the idle wakeup rate * All irqs manually routed to CPU0 only (hoping this will keep other CPUs idle) (http://svaidy.googlepages.com/1-irq-config.txt) * Powertop shows around 35 wakeups per second during idle (http://svaidy.googlepages.com/1-powertop-screen.txt) * The trace of idle time stamps was collected for 120 seconds with the system in idle state Results: There are 4 png files that plots the idle time for each CPU in the system. Please get the graphs from the following URLs http://svaidy.googlepages.com/1-idle-cpu0.png http://svaidy.googlepages.com/1-idle-cpu1.png http://svaidy.googlepages.com/1-idle-cpu2.png http://svaidy.googlepages.com/1-idle-cpu3.png Each png file has 4 graphs plotted that is relevant to one CPU * Right-top plot is the idle time sample obtained during the experiment * Left-top graph is histogram of right top plot * The bottom graphs corresponding to idle times when sched_mc_power_savings=1 Observations with sched_mc_power_savings=1: * No major impact of sched_mc_power_savings on CPU0 and CPU1 * Significant idle time improvement on CPU2 * However, significant idle time reduction on CPU3 Experiment 2: - Setup: * USB stopped * Most daemons like yum-updatesd, hal, autofs, syslog, crond, irqbalance, sendmail, pcscd were stopped * Interrupt routing left to default but irqbalance daemon stopped * Powertop shows around 4 wakeups per second during idle (http://svaidy.googlepages.com/2-powertop-screen.txt) * The trace of idle time stamps was collected for 120 seconds with the system in idle state Results: There are 4 png files that plots the idle time for each CPU in the system. http://svaidy.googlepages.com/2-idle-cpu0.png http://svaidy.googlepages.com/2-idle-cpu1.png http://svaidy.googlepages.com/2-idle-cpu2.png http://svaidy.googlepages.com/2-idle-cpu3.png The details of the plot are same as the previous experiment. Observations with sched_mc_power_savings=1: * No major impact of sched_mc_power_savings on CPU0 and CPU1 * Good idle time improvement on CPU2 and CPU3 Please review the experiment
Re: 1.0.0.0 DNS replies for many domain names (network)
* Amogh Hooshdar <[EMAIL PROTECTED]> [2007-12-14 17:20:17]: > I am having a strange problem with Debian Etch 4.0 (both 64-bit and > 32-bit) using 2.6.18 kernel. Most websites do not open with browser, > Pidgin and most other GUI applicatoins. but I am able to ping them > fine. I am also able to do nslookup properly. When I tried to > investigate it with Wireshark net sniffer, I observed the following. > > PROBLEM WITH 2.6.18 > Say, I try to open www.google.com, browser sends DNS query for > www.google.com to my DNS server which is correctly configured in > resolv.conf. It replies with the correct IP address. www.google.com > redirects the browser to www.google.co.in. browser sends a DNS query > again for www.google.co.in and the DNS server replies with 1.0.0.0 > which obviously is the wrong address. I had this problem on Debian 4.0 and it was due to bug in the DSL router. I had DNS server set to 192.168.1.1 that is my DSL router that holds the real DNS IP and forwards the DNS lookup request. Once in a while the DNS proxy server will give out 1.0.0.0. The solution I used was to find the real DNS server and fill it in resolv.conf This avoids the DNS proxy on the router and then problem went away. https://bugs.launchpad.net/ubuntu/+bug/81057 --Vaidy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 1.0.0.0 DNS replies for many domain names (network)
* Amogh Hooshdar [EMAIL PROTECTED] [2007-12-14 17:20:17]: I am having a strange problem with Debian Etch 4.0 (both 64-bit and 32-bit) using 2.6.18 kernel. Most websites do not open with browser, Pidgin and most other GUI applicatoins. but I am able to ping them fine. I am also able to do nslookup properly. When I tried to investigate it with Wireshark net sniffer, I observed the following. PROBLEM WITH 2.6.18 Say, I try to open www.google.com, browser sends DNS query for www.google.com to my DNS server which is correctly configured in resolv.conf. It replies with the correct IP address. www.google.com redirects the browser to www.google.co.in. browser sends a DNS query again for www.google.co.in and the DNS server replies with 1.0.0.0 which obviously is the wrong address. I had this problem on Debian 4.0 and it was due to bug in the DSL router. I had DNS server set to 192.168.1.1 that is my DSL router that holds the real DNS IP and forwards the DNS lookup request. Once in a while the DNS proxy server will give out 1.0.0.0. The solution I used was to find the real DNS server and fill it in resolv.conf This avoids the DNS proxy on the router and then problem went away. https://bugs.launchpad.net/ubuntu/+bug/81057 --Vaidy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc1 on PPC64: machine check exception
> >> http://patchwork.ozlabs.org/linuxppc/patch?id=14475 > > Thanks for pointing me to this patch. I will try out and let you know if > this fixed the problem. Hi Anton, This patch fixed the problem. I am able to run and profile ebizzy on 128-way PPC64. However this fix is not included in 2.6.24-rc2 as well. I will watch for inclusion of this fix in 2.6.24. --Vaidy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc1 on PPC64: machine check exception
http://patchwork.ozlabs.org/linuxppc/patch?id=14475 Thanks for pointing me to this patch. I will try out and let you know if this fixed the problem. Hi Anton, This patch fixed the problem. I am able to run and profile ebizzy on 128-way PPC64. However this fix is not included in 2.6.24-rc2 as well. I will watch for inclusion of this fix in 2.6.24. --Vaidy - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc1 on PPC64: machine check exception
Anton Blanchard wrote: > Hi > >> I got the following exception on a 128-way PPC64 machine while running >> ebizzy-0.2 benchmark. > > Is it a 64bit application? Im guessing its fixed by this: Hi Anton, The ebizzy application I tried is 64-bit. > http://patchwork.ozlabs.org/linuxppc/patch?id=14475 Thanks for pointing me to this patch. I will try out and let you know if this fixed the problem. --Vaidy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.24-rc1 on PPC64: machine check exception
Hi, I got the following exception on a 128-way PPC64 machine while running ebizzy-0.2 benchmark. cpu 0x48: Vector: 200 (Machine Check) at [c06df1bb37c0] pc: c007a26c: .exit_robust_list+0x78/0x228 lr: c007a240: .exit_robust_list+0x4c/0x228 sp: c06df1bb3a40 msr: 80109032 current = 0xc06df1725120 paca= 0xc06f4000 pid = 5950, comm = ebizzy cpu 0x78: Vector: 200 (Machine Check) at [c00e21bd37c0] pc: c007a26c: .exit_robust_list+0x78/0x228 lr: c007a240: .exit_robust_list+0x4c/0x228 sp: c00e21bd3a40 msr: 80109032 current = 0xc00e pa paca= 0xc06fb800 cidpid = 5941, comm = ebizzy = 5941,[c06df1bb3b10] c00578a8 .do_exit+0x260/0x7fc [c06df1bb3bc0] c0057ef8 .sys_exit_group+0x0/0x8 [c06df1bb3c50] c00636e0 .get_signal_to_deliver+0x45c/0x4ac [c06df1bb3d00] c0011d44 .do_signal+0x68/0x2e4 [c06df1bb3e30] c0008ae8 do_work+0x28/0x2c --- Exception: 500 (Hardware Interrupt) at 10007698 SP (40016601610) is in userspace Config file attached. I did not see any similar failure reported with 2.6.24-rc1. --Vaidy # # Automatically generated make config: don't edit # Linux kernel version: 2.6.24-rc1 # Mon Nov 5 02:25:46 2007 # CONFIG_PPC64=y # # Processor support # # CONFIG_POWER4_ONLY is not set CONFIG_POWER3=y CONFIG_POWER4=y # CONFIG_TUNE_CELL is not set CONFIG_PPC_FPU=y CONFIG_ALTIVEC=y CONFIG_PPC_STD_MMU=y CONFIG_PPC_MM_SLICES=y CONFIG_VIRT_CPU_ACCOUNTING=y CONFIG_SMP=y CONFIG_NR_CPUS=128 CONFIG_64BIT=y CONFIG_WORD_SIZE=64 CONFIG_PPC_MERGE=y CONFIG_MMU=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_GENERIC_TIME=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_HARDIRQS=y CONFIG_IRQ_PER_CPU=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_ARCH_HAS_ILOG2_U32=y CONFIG_ARCH_HAS_ILOG2_U64=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_GENERIC_FIND_NEXT_BIT=y CONFIG_ARCH_NO_VIRT_TO_BUS=y CONFIG_PPC=y CONFIG_EARLY_PRINTK=y CONFIG_COMPAT=y CONFIG_SYSVIPC_COMPAT=y CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_PPC_OF=y CONFIG_OF=y CONFIG_PPC_UDBG_16550=y # CONFIG_GENERIC_TBSYNC is not set CONFIG_AUDIT_ARCH=y CONFIG_GENERIC_BUG=y # CONFIG_DEFAULT_UIMAGE is not set # CONFIG_PPC_DCR_NATIVE is not set # CONFIG_PPC_DCR_MMIO is not set # CONFIG_PPC_OF_PLATFORM_PCI is not set CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # General setup # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_LOCALVERSION="-sv" CONFIG_LOCALVERSION_AUTO=y CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_TASKSTATS is not set # CONFIG_USER_NS is not set CONFIG_AUDIT=y CONFIG_AUDITSYSCALL=y CONFIG_AUDIT_TREE=y CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_LOG_BUF_SHIFT=17 CONFIG_CGROUPS=y # CONFIG_CGROUP_DEBUG is not set # CONFIG_CGROUP_NS is not set # CONFIG_CGROUP_CPUACCT is not set CONFIG_CPUSETS=y # CONFIG_FAIR_GROUP_SCHED is not set # CONFIG_SYSFS_DEPRECATED is not set CONFIG_PROC_PID_CPUSET=y CONFIG_RELAY=y CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE="" CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_SYSCTL=y # CONFIG_EMBEDDED is not set CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y CONFIG_KALLSYMS_ALL=y # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_ANON_INODES=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_SLAB=y # CONFIG_SLUB is not set # CONFIG_SLOB is not set CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y # CONFIG_MODULE_FORCE_UNLOAD is not set CONFIG_MODVERSIONS=y CONFIG_MODULE_SRCVERSION_ALL=y CONFIG_KMOD=y CONFIG_STOP_MACHINE=y CONFIG_BLOCK=y # CONFIG_BLK_DEV_IO_TRACE is not set # CONFIG_BLK_DEV_BSG is not set CONFIG_BLOCK_COMPAT=y # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y CONFIG_DEFAULT_AS=y # CONFIG_DEFAULT_DEADLINE is not set # CONFIG_DEFAULT_CFQ is not set # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="anticipatory" # # Platform support # CONFIG_PPC_MULTIPLATFORM=y # CONFIG_PPC_82xx is not set # CONFIG_PPC_83xx is not set # CONFIG_PPC_86xx is not set CONFIG_PPC_PSERIES=y CONFIG_PPC_SPLPAR=y CONFIG_EEH=y CONFIG_SCANLOG=m CONFIG_LPARCFG=y # CONFIG_PPC_ISERIES is not set # CONFIG_PPC_MPC52xx is not set # CONFIG_PPC_MPC5200 is not set # CONFIG_PPC_PMAC is not set # CONFIG_PPC_MAPLE is not set # CONFIG_PPC_PASEMI is not set # CONFIG_PPC_CELLEB is not set # CONFIG_PPC_PS3 is not set # CONFIG_PPC_CELL is not set # CONFIG_PPC_CELL_NATIVE is not set # CONFIG_PPC_IBM_CELL_BLADE is not set # CONFIG_PQ2ADS is not set CONFIG_PPC_NATIVE=y # CONFIG_UDBG_RTAS_CONSOLE is not set CONFIG_XICS=y
2.6.24-rc1 on PPC64: machine check exception
Hi, I got the following exception on a 128-way PPC64 machine while running ebizzy-0.2 benchmark. cpu 0x48: Vector: 200 (Machine Check) at [c06df1bb37c0] pc: c007a26c: .exit_robust_list+0x78/0x228 lr: c007a240: .exit_robust_list+0x4c/0x228 sp: c06df1bb3a40 msr: 80109032 current = 0xc06df1725120 paca= 0xc06f4000 pid = 5950, comm = ebizzy cpu 0x78: Vector: 200 (Machine Check) at [c00e21bd37c0] pc: c007a26c: .exit_robust_list+0x78/0x228 lr: c007a240: .exit_robust_list+0x4c/0x228 sp: c00e21bd3a40 msr: 80109032 current = 0xc00e pa paca= 0xc06fb800 cidpid = 5941, comm = ebizzy = 5941,[c06df1bb3b10] c00578a8 .do_exit+0x260/0x7fc [c06df1bb3bc0] c0057ef8 .sys_exit_group+0x0/0x8 [c06df1bb3c50] c00636e0 .get_signal_to_deliver+0x45c/0x4ac [c06df1bb3d00] c0011d44 .do_signal+0x68/0x2e4 [c06df1bb3e30] c0008ae8 do_work+0x28/0x2c --- Exception: 500 (Hardware Interrupt) at 10007698 SP (40016601610) is in userspace Config file attached. I did not see any similar failure reported with 2.6.24-rc1. --Vaidy # # Automatically generated make config: don't edit # Linux kernel version: 2.6.24-rc1 # Mon Nov 5 02:25:46 2007 # CONFIG_PPC64=y # # Processor support # # CONFIG_POWER4_ONLY is not set CONFIG_POWER3=y CONFIG_POWER4=y # CONFIG_TUNE_CELL is not set CONFIG_PPC_FPU=y CONFIG_ALTIVEC=y CONFIG_PPC_STD_MMU=y CONFIG_PPC_MM_SLICES=y CONFIG_VIRT_CPU_ACCOUNTING=y CONFIG_SMP=y CONFIG_NR_CPUS=128 CONFIG_64BIT=y CONFIG_WORD_SIZE=64 CONFIG_PPC_MERGE=y CONFIG_MMU=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_GENERIC_TIME=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_HARDIRQS=y CONFIG_IRQ_PER_CPU=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_ARCH_HAS_ILOG2_U32=y CONFIG_ARCH_HAS_ILOG2_U64=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_GENERIC_FIND_NEXT_BIT=y CONFIG_ARCH_NO_VIRT_TO_BUS=y CONFIG_PPC=y CONFIG_EARLY_PRINTK=y CONFIG_COMPAT=y CONFIG_SYSVIPC_COMPAT=y CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_PPC_OF=y CONFIG_OF=y CONFIG_PPC_UDBG_16550=y # CONFIG_GENERIC_TBSYNC is not set CONFIG_AUDIT_ARCH=y CONFIG_GENERIC_BUG=y # CONFIG_DEFAULT_UIMAGE is not set # CONFIG_PPC_DCR_NATIVE is not set # CONFIG_PPC_DCR_MMIO is not set # CONFIG_PPC_OF_PLATFORM_PCI is not set CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config # # General setup # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_LOCALVERSION=-sv CONFIG_LOCALVERSION_AUTO=y CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_TASKSTATS is not set # CONFIG_USER_NS is not set CONFIG_AUDIT=y CONFIG_AUDITSYSCALL=y CONFIG_AUDIT_TREE=y CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_LOG_BUF_SHIFT=17 CONFIG_CGROUPS=y # CONFIG_CGROUP_DEBUG is not set # CONFIG_CGROUP_NS is not set # CONFIG_CGROUP_CPUACCT is not set CONFIG_CPUSETS=y # CONFIG_FAIR_GROUP_SCHED is not set # CONFIG_SYSFS_DEPRECATED is not set CONFIG_PROC_PID_CPUSET=y CONFIG_RELAY=y CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE= CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_SYSCTL=y # CONFIG_EMBEDDED is not set CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y CONFIG_KALLSYMS_ALL=y # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_ANON_INODES=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_SLAB=y # CONFIG_SLUB is not set # CONFIG_SLOB is not set CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y # CONFIG_MODULE_FORCE_UNLOAD is not set CONFIG_MODVERSIONS=y CONFIG_MODULE_SRCVERSION_ALL=y CONFIG_KMOD=y CONFIG_STOP_MACHINE=y CONFIG_BLOCK=y # CONFIG_BLK_DEV_IO_TRACE is not set # CONFIG_BLK_DEV_BSG is not set CONFIG_BLOCK_COMPAT=y # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y CONFIG_DEFAULT_AS=y # CONFIG_DEFAULT_DEADLINE is not set # CONFIG_DEFAULT_CFQ is not set # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED=anticipatory # # Platform support # CONFIG_PPC_MULTIPLATFORM=y # CONFIG_PPC_82xx is not set # CONFIG_PPC_83xx is not set # CONFIG_PPC_86xx is not set CONFIG_PPC_PSERIES=y CONFIG_PPC_SPLPAR=y CONFIG_EEH=y CONFIG_SCANLOG=m CONFIG_LPARCFG=y # CONFIG_PPC_ISERIES is not set # CONFIG_PPC_MPC52xx is not set # CONFIG_PPC_MPC5200 is not set # CONFIG_PPC_PMAC is not set # CONFIG_PPC_MAPLE is not set # CONFIG_PPC_PASEMI is not set # CONFIG_PPC_CELLEB is not set # CONFIG_PPC_PS3 is not set # CONFIG_PPC_CELL is not set # CONFIG_PPC_CELL_NATIVE is not set # CONFIG_PPC_IBM_CELL_BLADE is not set # CONFIG_PQ2ADS is not set CONFIG_PPC_NATIVE=y # CONFIG_UDBG_RTAS_CONSOLE is not set CONFIG_XICS=y CONFIG_MPIC=y
Re: 2.6.24-rc1 on PPC64: machine check exception
Anton Blanchard wrote: Hi I got the following exception on a 128-way PPC64 machine while running ebizzy-0.2 benchmark. Is it a 64bit application? Im guessing its fixed by this: Hi Anton, The ebizzy application I tried is 64-bit. http://patchwork.ozlabs.org/linuxppc/patch?id=14475 Thanks for pointing me to this patch. I will try out and let you know if this fixed the problem. --Vaidy - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Question about free/used memory on Linux
Ravinandan Arakali (rarakali) wrote: > Hi Vaidy, > Thanks for clarifying several of my doubts. > > To answer your question about my intention, we currently have a > system with 2 GB RAM and I need to find out the actual used and > free memory so that we can decide if the same setup(applications, > tmpfs etc.) can run on another system with lesser memory. > > Is it correct to say that the "used" field "free -m" excluding > buffers/caches would give the correct idea of used memory > (I mean does it take care of shared memory, shared copies of > libraries etc.) ? I assume it does not include /dev/shm usage > since that's also a tmpfs partition ? Thats correct. The used excluding the buffer caches gives most of the memory used by the system. You have excludes _all_ file backed memory including shm. > > If so, then I can add the memory used by tmpfs partitions to > the above and get the total memory used ? > > For eg. if my "free -m" appears as below: > Linux(debug)# free -m > total used free sharedbuffers > cached > Mem: 2014984 1030 0 80 > 594 > -/+ buffers/cache:309 1705 > > Can I say that 309MB + 350 MB(size of tmpfs partitions including > /dev/shm) > is the used memory on my system ? Two problems with this logic: 1. all of tmpfs may not be really used. You are over committing. 2. You still miss the pages needed to map the program code. They are file backed too. Though this will be very less amount of memory compared to data and shared memory. Let me suggest a metric: 1. Take the used part excluding the buffers (309MB) in your case and add 'Mapped' and 'Dirty' from /proc/meminfo This may be better than adding tmpfs/shmdev size. 2. Once your system is running will all applications loaded, cleanup the pagecache (file data cached in memory) sync echo 1 > /proc/sys/vm/drop_caches The first sync will bring down 'Dirty' count and drop_caches will reclaim all 'not needed' file cache memory. Now if you use 'free' and take the used count _with_ the buffers and file cache, this will provide a realistic value. (Actually Free in /proc/meminfo) Do not exclude buffers they are _needed_ for optimum system operation. With the above figure you can probably add 10% or more memory as extra memory for file cache when the system is operating with full load. If you want to be sure of these experiments boot you system with less memory using mem=xxx kernel parameter and run some performance tests to ensure the degradation is under acceptable limits. --Vaidy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Question about free/used memory on Linux
Ravinandan Arakali (rarakali) wrote: > Hi kernel gurus, > I am trying to find out the memory that's used on my linux box. > I find that there are quite a few confusing metrics. How do > I find out the "true" used memory ? pagecache pages may be the reason for the confusion. Most free memory can be consumed under 'Cached' in /proc/meminfo Most of this memory is easily reclaimable and can be considered 'free'. However the cost of reclaim increases if the pages are memory mapped and/or dirty. Cached-Mapped-Dirty in /proc/meminfo can be seen as a easily freeable memory and can be added to the 'Free' in /proc/meminfo count What is your intention of finding free memory in the system? Linux kernel takes the best decision of using available RAM for file cache or applications code/data as and when needed. Ideally the 'Free' count will be a very low value on a heavily used system. If application needs more memory, then the kernel will shrink the caches and give the reclaimed memory to the application. > > 1. For eg. "free -m" shows free memory (excluding buffers/caches) > as 308 MB while I can see(from "df" output) that the the tmpfs > partitions take up about 400 MB. So, does "free -m" not consider > the tmpfs partitions ? Pages used from tmpfs should come under Mapped or Cached. They are not counted as free. > 2. I try to add up RSS field of all processes reported by > "ps aux" command. But is it true that this would be misleading > in that, shared memory used by, say 2 processes would show > up twice here although there's only one copy in memory. Also > does this consider the fact that there's only one copy > of shared libraries ? > RSS is from each process point of view. If the page is present in RAM, it is counted. If the pages is shared, then it is counted in both process address space. > 3. I guess "free -m" and "top" commands use /proc/meminfo > and hence all these outputs are same ? Yes, all of them parse /proc/meminfo --Vaidy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Question about free/used memory on Linux
Ravinandan Arakali (rarakali) wrote: Hi kernel gurus, I am trying to find out the memory that's used on my linux box. I find that there are quite a few confusing metrics. How do I find out the true used memory ? pagecache pages may be the reason for the confusion. Most free memory can be consumed under 'Cached' in /proc/meminfo Most of this memory is easily reclaimable and can be considered 'free'. However the cost of reclaim increases if the pages are memory mapped and/or dirty. Cached-Mapped-Dirty in /proc/meminfo can be seen as a easily freeable memory and can be added to the 'Free' in /proc/meminfo count What is your intention of finding free memory in the system? Linux kernel takes the best decision of using available RAM for file cache or applications code/data as and when needed. Ideally the 'Free' count will be a very low value on a heavily used system. If application needs more memory, then the kernel will shrink the caches and give the reclaimed memory to the application. 1. For eg. free -m shows free memory (excluding buffers/caches) as 308 MB while I can see(from df output) that the the tmpfs partitions take up about 400 MB. So, does free -m not consider the tmpfs partitions ? Pages used from tmpfs should come under Mapped or Cached. They are not counted as free. 2. I try to add up RSS field of all processes reported by ps aux command. But is it true that this would be misleading in that, shared memory used by, say 2 processes would show up twice here although there's only one copy in memory. Also does this consider the fact that there's only one copy of shared libraries ? RSS is from each process point of view. If the page is present in RAM, it is counted. If the pages is shared, then it is counted in both process address space. 3. I guess free -m and top commands use /proc/meminfo and hence all these outputs are same ? Yes, all of them parse /proc/meminfo --Vaidy - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Question about free/used memory on Linux
Ravinandan Arakali (rarakali) wrote: Hi Vaidy, Thanks for clarifying several of my doubts. To answer your question about my intention, we currently have a system with 2 GB RAM and I need to find out the actual used and free memory so that we can decide if the same setup(applications, tmpfs etc.) can run on another system with lesser memory. Is it correct to say that the used field free -m excluding buffers/caches would give the correct idea of used memory (I mean does it take care of shared memory, shared copies of libraries etc.) ? I assume it does not include /dev/shm usage since that's also a tmpfs partition ? Thats correct. The used excluding the buffer caches gives most of the memory used by the system. You have excludes _all_ file backed memory including shm. If so, then I can add the memory used by tmpfs partitions to the above and get the total memory used ? For eg. if my free -m appears as below: Linux(debug)# free -m total used free sharedbuffers cached Mem: 2014984 1030 0 80 594 -/+ buffers/cache:309 1705 Can I say that 309MB + 350 MB(size of tmpfs partitions including /dev/shm) is the used memory on my system ? Two problems with this logic: 1. all of tmpfs may not be really used. You are over committing. 2. You still miss the pages needed to map the program code. They are file backed too. Though this will be very less amount of memory compared to data and shared memory. Let me suggest a metric: 1. Take the used part excluding the buffers (309MB) in your case and add 'Mapped' and 'Dirty' from /proc/meminfo This may be better than adding tmpfs/shmdev size. 2. Once your system is running will all applications loaded, cleanup the pagecache (file data cached in memory) sync echo 1 /proc/sys/vm/drop_caches The first sync will bring down 'Dirty' count and drop_caches will reclaim all 'not needed' file cache memory. Now if you use 'free' and take the used count _with_ the buffers and file cache, this will provide a realistic value. (Actually Free in /proc/meminfo) Do not exclude buffers they are _needed_ for optimum system operation. With the above figure you can probably add 10% or more memory as extra memory for file cache when the system is operating with full load. If you want to be sure of these experiments boot you system with less memory using mem=xxx kernel parameter and run some performance tests to ensure the degradation is under acceptable limits. --Vaidy - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-mm PATCH 0/9] Memory controller introduction (v4)
KAMEZAWA Hiroyuki wrote: > On Wed, 8 Aug 2007 12:51:39 +0900 > KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> wrote: > >> On Sat, 28 Jul 2007 01:39:37 +0530 >> Balbir Singh <[EMAIL PROTECTED]> wrote: >>> At OLS, the resource management BOF, it was discussed that we need to manage >>> RSS and unmapped page cache together. This patchset is a step towards that >>> >> Can I make a question ? Why limiting RSS instead of # of used pages per >> container ? Maybe bacause of shared pages between container > SorryIgnore above question. > I didn't understand what mem_container_charge() accounts and limits. > It controls # of meta_pages. Hi Kame, Actually the number of pages resident in memory brought in by a container is charged. However each such page will have a meta_page allocated to keep the extra data. Yes, the accounting counts the number of meta_page which is same as the number of mapped and unmapped (pagecache) pages brought into the system memory by this container. Whether pagecache pages should be included or not is configurable per container through the 'type' file in containerfs. --Vaidy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-mm PATCH 0/9] Memory controller introduction (v4)
KAMEZAWA Hiroyuki wrote: On Wed, 8 Aug 2007 12:51:39 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Sat, 28 Jul 2007 01:39:37 +0530 Balbir Singh [EMAIL PROTECTED] wrote: At OLS, the resource management BOF, it was discussed that we need to manage RSS and unmapped page cache together. This patchset is a step towards that Can I make a question ? Why limiting RSS instead of # of used pages per container ? Maybe bacause of shared pages between container SorryIgnore above question. I didn't understand what mem_container_charge() accounts and limits. It controls # of meta_pages. Hi Kame, Actually the number of pages resident in memory brought in by a container is charged. However each such page will have a meta_page allocated to keep the extra data. Yes, the accounting counts the number of meta_page which is same as the number of mapped and unmapped (pagecache) pages brought into the system memory by this container. Whether pagecache pages should be included or not is configurable per container through the 'type' file in containerfs. --Vaidy - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-mm PATCH 6/9] Memory controller add per container LRU and reclaim (v4)
Vaidyanathan Srinivasan wrote: > > YAMAMOTO Takashi wrote: >>> +unsigned long mem_container_isolate_pages(unsigned long nr_to_scan, >>> + struct list_head *dst, >>> + unsigned long *scanned, int order, >>> + int mode, struct zone *z, >>> + struct mem_container *mem_cont, >>> + int active) >>> +{ >>> + unsigned long nr_taken = 0; >>> + struct page *page; >>> + unsigned long scan; >>> + LIST_HEAD(mp_list); >>> + struct list_head *src; >>> + struct meta_page *mp; >>> + >>> + if (active) >>> + src = _cont->active_list; >>> + else >>> + src = _cont->inactive_list; >>> + >>> + for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) { >>> + mp = list_entry(src->prev, struct meta_page, lru); >> what prevents another thread from freeing mp here? > > mem_cont->lru_lock protects the list and validity of mp. If we hold > mem_cont->lru_lock for this entire loop, then we preserve the validity > of mp. However that will be holding up container charge and uncharge. > > This entire routing is called with zone->lru_lock held by the caller. > So within a zone, this routine is serialized. > > However page uncharge may race with isolate page. But will that lead > to any corruption of the list? We may be holding the lock for too > much time just to be on the safe side. > > Please allow us some time to verify whether this is indeed inadequate > locking that will lead to corruption of the list. I did few runs and checked for ref_cnt on meta_page and there seems to be a race between isolate pages and uncharge. We will probably have to increase the ref_cnt on meta_page while we are isolating it. I am trying to see if we can solve the problem by manipulating the ref_cnt on the meta_page. --Vaidy > Thanks for pointing out this situation. > --Vaidy > >>> + spin_lock(_cont->lru_lock); >>> + if (mp) >>> + page = mp->page; >>> + spin_unlock(_cont->lru_lock); >>> + if (!mp) >>> + continue; >> YAMAMOTO Takashi >> ___ >> Containers mailing list >> [EMAIL PROTECTED] >> https://lists.linux-foundation.org/mailman/listinfo/containers >> > ___ > Containers mailing list > [EMAIL PROTECTED] > https://lists.linux-foundation.org/mailman/listinfo/containers > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-mm PATCH 6/9] Memory controller add per container LRU and reclaim (v4)
Vaidyanathan Srinivasan wrote: YAMAMOTO Takashi wrote: +unsigned long mem_container_isolate_pages(unsigned long nr_to_scan, + struct list_head *dst, + unsigned long *scanned, int order, + int mode, struct zone *z, + struct mem_container *mem_cont, + int active) +{ + unsigned long nr_taken = 0; + struct page *page; + unsigned long scan; + LIST_HEAD(mp_list); + struct list_head *src; + struct meta_page *mp; + + if (active) + src = mem_cont-active_list; + else + src = mem_cont-inactive_list; + + for (scan = 0; scan nr_to_scan !list_empty(src); scan++) { + mp = list_entry(src-prev, struct meta_page, lru); what prevents another thread from freeing mp here? mem_cont-lru_lock protects the list and validity of mp. If we hold mem_cont-lru_lock for this entire loop, then we preserve the validity of mp. However that will be holding up container charge and uncharge. This entire routing is called with zone-lru_lock held by the caller. So within a zone, this routine is serialized. However page uncharge may race with isolate page. But will that lead to any corruption of the list? We may be holding the lock for too much time just to be on the safe side. Please allow us some time to verify whether this is indeed inadequate locking that will lead to corruption of the list. I did few runs and checked for ref_cnt on meta_page and there seems to be a race between isolate pages and uncharge. We will probably have to increase the ref_cnt on meta_page while we are isolating it. I am trying to see if we can solve the problem by manipulating the ref_cnt on the meta_page. --Vaidy Thanks for pointing out this situation. --Vaidy + spin_lock(mem_cont-lru_lock); + if (mp) + page = mp-page; + spin_unlock(mem_cont-lru_lock); + if (!mp) + continue; YAMAMOTO Takashi ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-mm PATCH 6/9] Memory controller add per container LRU and reclaim (v4)
YAMAMOTO Takashi wrote: >> +unsigned long mem_container_isolate_pages(unsigned long nr_to_scan, >> +struct list_head *dst, >> +unsigned long *scanned, int order, >> +int mode, struct zone *z, >> +struct mem_container *mem_cont, >> +int active) >> +{ >> +unsigned long nr_taken = 0; >> +struct page *page; >> +unsigned long scan; >> +LIST_HEAD(mp_list); >> +struct list_head *src; >> +struct meta_page *mp; >> + >> +if (active) >> +src = _cont->active_list; >> +else >> +src = _cont->inactive_list; >> + >> +for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) { >> +mp = list_entry(src->prev, struct meta_page, lru); > > what prevents another thread from freeing mp here? mem_cont->lru_lock protects the list and validity of mp. If we hold mem_cont->lru_lock for this entire loop, then we preserve the validity of mp. However that will be holding up container charge and uncharge. This entire routing is called with zone->lru_lock held by the caller. So within a zone, this routine is serialized. However page uncharge may race with isolate page. But will that lead to any corruption of the list? We may be holding the lock for too much time just to be on the safe side. Please allow us some time to verify whether this is indeed inadequate locking that will lead to corruption of the list. Thanks for pointing out this situation. --Vaidy >> +spin_lock(_cont->lru_lock); >> +if (mp) >> +page = mp->page; >> +spin_unlock(_cont->lru_lock); >> +if (!mp) >> +continue; > > YAMAMOTO Takashi > ___ > Containers mailing list > [EMAIL PROTECTED] > https://lists.linux-foundation.org/mailman/listinfo/containers > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-mm PATCH 4/9] Memory controller memory accounting (v4)
YAMAMOTO Takashi wrote: >> +lock_meta_page(page); >> +/* >> + * Check if somebody else beat us to allocating the meta_page >> + */ >> +race_mp = page_get_meta_page(page); >> +if (race_mp) { >> +kfree(mp); >> +mp = race_mp; >> +atomic_inc(>ref_cnt); >> +res_counter_uncharge(>res, 1); >> +goto done; >> +} > > i think you need css_put here. Thats correct. We do need css_put in this path. Thanks, Vaidy > YAMAMOTO Takashi > ___ > Containers mailing list > [EMAIL PROTECTED] > https://lists.linux-foundation.org/mailman/listinfo/containers > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-mm PATCH 4/9] Memory controller memory accounting (v4)
YAMAMOTO Takashi wrote: +lock_meta_page(page); +/* + * Check if somebody else beat us to allocating the meta_page + */ +race_mp = page_get_meta_page(page); +if (race_mp) { +kfree(mp); +mp = race_mp; +atomic_inc(mp-ref_cnt); +res_counter_uncharge(mem-res, 1); +goto done; +} i think you need css_put here. Thats correct. We do need css_put in this path. Thanks, Vaidy YAMAMOTO Takashi ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-mm PATCH 6/9] Memory controller add per container LRU and reclaim (v4)
YAMAMOTO Takashi wrote: +unsigned long mem_container_isolate_pages(unsigned long nr_to_scan, +struct list_head *dst, +unsigned long *scanned, int order, +int mode, struct zone *z, +struct mem_container *mem_cont, +int active) +{ +unsigned long nr_taken = 0; +struct page *page; +unsigned long scan; +LIST_HEAD(mp_list); +struct list_head *src; +struct meta_page *mp; + +if (active) +src = mem_cont-active_list; +else +src = mem_cont-inactive_list; + +for (scan = 0; scan nr_to_scan !list_empty(src); scan++) { +mp = list_entry(src-prev, struct meta_page, lru); what prevents another thread from freeing mp here? mem_cont-lru_lock protects the list and validity of mp. If we hold mem_cont-lru_lock for this entire loop, then we preserve the validity of mp. However that will be holding up container charge and uncharge. This entire routing is called with zone-lru_lock held by the caller. So within a zone, this routine is serialized. However page uncharge may race with isolate page. But will that lead to any corruption of the list? We may be holding the lock for too much time just to be on the safe side. Please allow us some time to verify whether this is indeed inadequate locking that will lead to corruption of the list. Thanks for pointing out this situation. --Vaidy +spin_lock(mem_cont-lru_lock); +if (mp) +page = mp-page; +spin_unlock(mem_cont-lru_lock); +if (!mp) +continue; YAMAMOTO Takashi ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-mm PATCH 4/7] Memory controller memory accounting
Balbir Singh wrote: > Add the accounting hooks. The accounting is carried out for RSS and Page > Cache (unmapped) pages. There is now a common limit and accounting for both. > The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap() > time. Page cache is accounted at add_to_page_cache(), > __delete_from_page_cache(). Swap cache is also accounted for. > > Each page's meta_page is protected with a bit in page flags, this makes > handling of race conditions involving simultaneous mappings of a page easier. > A reference count is kept in the meta_page to deal with cases where a page > might be unmapped from the RSS of all tasks, but still lives in the page > cache. > > Signed-off-by: Balbir Singh <[EMAIL PROTECTED]> > --- > > fs/exec.c |1 > include/linux/memcontrol.h | 11 +++ > include/linux/page-flags.h |3 + > mm/filemap.c |8 ++ > mm/memcontrol.c| 132 > - > mm/memory.c| 22 +++ > mm/migrate.c |6 ++ > mm/page_alloc.c|3 + > mm/rmap.c |2 > mm/swap_state.c|8 ++ > mm/swapfile.c | 40 +++-- > 11 files changed, 218 insertions(+), 18 deletions(-) > [snip] > diff -puN mm/migrate.c~mem-control-accounting mm/migrate.c > --- linux-2.6.22-rc6/mm/migrate.c~mem-control-accounting 2007-07-04 > 15:05:27.0 -0700 > +++ linux-2.6.22-rc6-balbir/mm/migrate.c 2007-07-04 15:05:27.0 > -0700 > @@ -28,6 +28,7 @@ > #include > #include > #include > +#include > > #include "internal.h" > > @@ -157,6 +158,11 @@ static void remove_migration_pte(struct > return; > } > > + if (mem_container_charge(page, mm)) { Minor correction. The above line should be if (mem_container_charge(new, mm)) { to avoid compilation error. --Vaidy [snip] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-mm PATCH 4/7] Memory controller memory accounting
Balbir Singh wrote: Add the accounting hooks. The accounting is carried out for RSS and Page Cache (unmapped) pages. There is now a common limit and accounting for both. The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap() time. Page cache is accounted at add_to_page_cache(), __delete_from_page_cache(). Swap cache is also accounted for. Each page's meta_page is protected with a bit in page flags, this makes handling of race conditions involving simultaneous mappings of a page easier. A reference count is kept in the meta_page to deal with cases where a page might be unmapped from the RSS of all tasks, but still lives in the page cache. Signed-off-by: Balbir Singh [EMAIL PROTECTED] --- fs/exec.c |1 include/linux/memcontrol.h | 11 +++ include/linux/page-flags.h |3 + mm/filemap.c |8 ++ mm/memcontrol.c| 132 - mm/memory.c| 22 +++ mm/migrate.c |6 ++ mm/page_alloc.c|3 + mm/rmap.c |2 mm/swap_state.c|8 ++ mm/swapfile.c | 40 +++-- 11 files changed, 218 insertions(+), 18 deletions(-) [snip] diff -puN mm/migrate.c~mem-control-accounting mm/migrate.c --- linux-2.6.22-rc6/mm/migrate.c~mem-control-accounting 2007-07-04 15:05:27.0 -0700 +++ linux-2.6.22-rc6-balbir/mm/migrate.c 2007-07-04 15:05:27.0 -0700 @@ -28,6 +28,7 @@ #include linux/mempolicy.h #include linux/vmalloc.h #include linux/security.h +#include linux/memcontrol.h #include internal.h @@ -157,6 +158,11 @@ static void remove_migration_pte(struct return; } + if (mem_container_charge(page, mm)) { Minor correction. The above line should be if (mem_container_charge(new, mm)) { to avoid compilation error. --Vaidy [snip] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IRQ handling difference between i386 and x86_64
Arjan van de Ven wrote: > On Sat, 2007-06-30 at 16:55 +0200, Krzysztof Oledzki wrote: >> Hello, >> >> It seems that IRQ handling is somehow different between i386 and x86_64. >> >> In my Dell PowerEdge 1950 is it possible to enable interrupts spreading >> over all CPUs. This a single CPU, four CORE system (Quad-Core E5335 Xeon) >> so I think that interrupts migration may be useful. Unfortunately, it >> works only with 32-bit kernel. Booting it with x86_64 leads to situation, >> when all interrupts goes only to the first cpu matching a smp_affinity >> mask. > > arguably that is the most efficient behavior... round robin of > interrupts is the worst possible case in terms of performance > > are you using irqbalance ? (www.irqbalance.org) If you have not been using irqbalance, then setting more bits in irq/smp_affinity will utilize hardware APIC routing. Sending interrupt to more than one CPU will work in flat mode only. In 32 bit kernel the IOAPIC is configured in flat mode while in 64 bit it is in physical flat mode. Physical flat mode will support interrupt routing to only one CPU. 32-bit: Enabling APIC mode: Flat. Using 2 I/O APICs 64-bit: Setting APIC routing to physical flat This is the reason for the observed interrupt routing behavior. I guess future kernels will use 'physical flat' mode to avoid hardware bugs on various complex configurations and also make it scale up to larger systems. Also utilizing hardware routing to send interrupts to many CPUs may not provide desired behavior. irqbalance application will do the needful. I agree with Arjan on the fact that equal interrupt distribution need not provide best performance. You will run into complex routing issues only when you have more NICs than actual number of CPU cores. --Vaidy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IRQ handling difference between i386 and x86_64
Arjan van de Ven wrote: On Sat, 2007-06-30 at 16:55 +0200, Krzysztof Oledzki wrote: Hello, It seems that IRQ handling is somehow different between i386 and x86_64. In my Dell PowerEdge 1950 is it possible to enable interrupts spreading over all CPUs. This a single CPU, four CORE system (Quad-Core E5335 Xeon) so I think that interrupts migration may be useful. Unfortunately, it works only with 32-bit kernel. Booting it with x86_64 leads to situation, when all interrupts goes only to the first cpu matching a smp_affinity mask. arguably that is the most efficient behavior... round robin of interrupts is the worst possible case in terms of performance are you using irqbalance ? (www.irqbalance.org) If you have not been using irqbalance, then setting more bits in irq/smp_affinity will utilize hardware APIC routing. Sending interrupt to more than one CPU will work in flat mode only. In 32 bit kernel the IOAPIC is configured in flat mode while in 64 bit it is in physical flat mode. Physical flat mode will support interrupt routing to only one CPU. 32-bit: Enabling APIC mode: Flat. Using 2 I/O APICs 64-bit: Setting APIC routing to physical flat This is the reason for the observed interrupt routing behavior. I guess future kernels will use 'physical flat' mode to avoid hardware bugs on various complex configurations and also make it scale up to larger systems. Also utilizing hardware routing to send interrupts to many CPUs may not provide desired behavior. irqbalance application will do the needful. I agree with Arjan on the fact that equal interrupt distribution need not provide best performance. You will run into complex routing issues only when you have more NICs than actual number of CPU cores. --Vaidy - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 3/3] Pagecache reclaim
Pagecache controller reclaim changes Reclaim path needs performance improvement. For now it is minor changes to include unmapped pages in our list of page_container. Signed-off-by: Vaidyanathan Srinivasan <[EMAIL PROTECTED]> --- mm/rss_container.c |3 --- 1 file changed, 3 deletions(-) --- linux-2.6.22-rc2-mm1.orig/mm/rss_container.c +++ linux-2.6.22-rc2-mm1/mm/rss_container.c @@ -243,9 +243,6 @@ void container_rss_move_lists(struct pag struct rss_container *rss; struct page_container *pc; - if (!page_mapped(pg)) - return; - pc = page_container(pg); if (pc == NULL) return; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 2/3] Pagecache and RSS accounting hooks
Pagecache and RSS accounting Hooks -- New calls have been added from swap_state.c and filemap.c to track pagecache and swapcache pages. All existing RSS hooks have been generalised for pagecache accounting as well. Most of these are function prototype changes. Signed-off-by: Vaidyanathan Srinivasan <[EMAIL PROTECTED]> --- fs/exec.c |4 ++-- mm/filemap.c| 17 + mm/memory.c | 18 +- mm/migrate.c|4 ++-- mm/rmap.c | 12 ++-- mm/swap_state.c | 16 mm/swapfile.c |4 ++-- 7 files changed, 54 insertions(+), 21 deletions(-) --- linux-2.6.22-rc2-mm1.orig/fs/exec.c +++ linux-2.6.22-rc2-mm1/fs/exec.c @@ -321,7 +321,7 @@ void install_arg_page(struct vm_area_str if (unlikely(anon_vma_prepare(vma))) goto out; - if (container_rss_prepare(page, vma, )) + if (container_page_prepare(page, vma->vm_mm, )) goto out; flush_dcache_page(page); @@ -343,7 +343,7 @@ void install_arg_page(struct vm_area_str return; out_release: - container_rss_release(pcont); + container_page_release(pcont); out: __free_page(page); force_sig(SIGKILL, current); --- linux-2.6.22-rc2-mm1.orig/mm/filemap.c +++ linux-2.6.22-rc2-mm1/mm/filemap.c @@ -30,6 +30,7 @@ #include #include #include +#include #include "filemap.h" #include "internal.h" @@ -117,6 +118,9 @@ void __remove_from_page_cache(struct pag struct address_space *mapping = page->mapping; radix_tree_delete(>page_tree, page->index); + /* Uncharge before the mapping is gone */ + if (page_container(page)) + container_page_del(page_container(page)); page->mapping = NULL; mapping->nrpages--; __dec_zone_page_state(page, NR_FILE_PAGES); @@ -440,6 +444,8 @@ int add_to_page_cache(struct page *page, pgoff_t offset, gfp_t gfp_mask) { int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); + struct page_container *pc; + struct mm_struct *mm; if (error == 0) { write_lock_irq(>tree_lock); @@ -453,6 +459,17 @@ int add_to_page_cache(struct page *page, __inc_zone_page_state(page, NR_FILE_PAGES); } write_unlock_irq(>tree_lock); + /* Unlock before charge, because we may reclaim this inline */ + if(!error) { + if (current->mm) + mm = current->mm; + else + mm = _mm; + if (!container_page_prepare(page, mm, )) + container_page_add(pc); + else + BUG(); + } radix_tree_preload_end(); } return error; --- linux-2.6.22-rc2-mm1.orig/mm/memory.c +++ linux-2.6.22-rc2-mm1/mm/memory.c @@ -1755,7 +1755,7 @@ gotten: cow_user_page(new_page, old_page, address, vma); } - if (container_rss_prepare(new_page, vma, )) + if (container_page_prepare(new_page, vma->vm_mm, )) goto oom; /* @@ -1791,7 +1791,7 @@ gotten: new_page = old_page; ret |= VM_FAULT_WRITE; } else - container_rss_release(pcont); + container_page_release(pcont); if (new_page) page_cache_release(new_page); @@ -2217,7 +2217,7 @@ static int do_swap_page(struct mm_struct count_vm_event(PGMAJFAULT); } - if (container_rss_prepare(page, vma, )) { + if (container_page_prepare(page, vma->vm_mm, )) { ret = VM_FAULT_OOM; goto out; } @@ -2235,7 +2235,7 @@ static int do_swap_page(struct mm_struct if (unlikely(!PageUptodate(page))) { ret = VM_FAULT_SIGBUS; - container_rss_release(pcont); + container_page_release(pcont); goto out_nomap; } @@ -2271,7 +2271,7 @@ unlock: out: return ret; out_nomap: - container_rss_release(pcont); + container_page_release(pcont); pte_unmap_unlock(page_table, ptl); unlock_page(page); page_cache_release(page); @@ -2302,7 +2302,7 @@ static int do_anonymous_page(struct mm_s if (!page) goto oom; - if (container_rss_prepare(page, vma, )) + if (container_page_prepare(page, vma->vm_mm, )) goto oom_release; entry = mk_pte(page, vma->vm_page_prot); @@ -2338,7 +2338,7 @@ unlock: pte_unmap_unlock(page_table, ptl); return VM_FAULT_MINOR; release_container: - container_rss_release(pcont); + container_page_
[RFC][PATCH 1/3] Pagecache accounting
Pagecache Accounting The rss accounting hooks have been generalised to handle both anon pages and file backed pages and charge the resource counter. Ref count has been added to page_container structure. The ref count is used to ensure a page is added or removed from page_container list only once independent of repeated calls from pagecache, swapcache and mmap to RSS. No setup patch is required since rss_limit and rss_usage has been generalised as the resource counter for pagecache as well. Signed-off-by: Vaidyanathan Srinivasan <[EMAIL PROTECTED]> --- include/linux/rss_container.h | 18 ++--- mm/rss_container.c| 134 -- 2 files changed, 99 insertions(+), 53 deletions(-) --- linux-2.6.22-rc2-mm1.orig/include/linux/rss_container.h +++ linux-2.6.22-rc2-mm1/include/linux/rss_container.h @@ -68,11 +68,11 @@ struct rss_container; * */ -int container_rss_prepare(struct page *, struct vm_area_struct *vma, +int container_page_prepare(struct page *, struct mm_struct *mm, struct page_container **); -void container_rss_add(struct page_container *); -void container_rss_del(struct page_container *); -void container_rss_release(struct page_container *); +void container_page_add(struct page_container *); +void container_page_del(struct page_container *); +void container_page_release(struct page_container *); void container_out_of_memory(struct rss_container *); @@ -85,22 +85,22 @@ unsigned long isolate_pages_in_container int order, int mode, struct zone *zone, struct rss_container *, int active); #else -static inline int container_rss_prepare(struct page *pg, - struct vm_area_struct *vma, struct page_container **pc) +static inline int container_page_prepare(struct page *pg, + struct mm_struct *mm, struct page_container **pc) { *pc = NULL; /* to make gcc happy */ return 0; } -static inline void container_rss_add(struct page_container *pc) +static inline void container_page_add(struct page_container *pc) { } -static inline void container_rss_del(struct page_container *pc) +static inline void container_page_del(struct page_container *pc) { } -static inline void container_rss_release(struct page_container *pc) +static inline void container_page_release(struct page_container *pc) { } --- linux-2.6.22-rc2-mm1.orig/mm/rss_container.c +++ linux-2.6.22-rc2-mm1/mm/rss_container.c @@ -56,6 +56,9 @@ struct rss_container { */ struct page_container { + unsigned long ref_cnt; /* Ref cnt to keep track of +* multiple additions of same page +*/ struct page *page; struct rss_container *cnt; struct list_head list; /* this is the element of (int)active_list of @@ -93,26 +96,36 @@ void mm_free_container(struct mm_struct * I bet you have already read the comment in include/linux/rss_container.h :) */ -int container_rss_prepare(struct page *page, struct vm_area_struct *vma, +int container_page_prepare(struct page *page, struct mm_struct *mm, struct page_container **ppc) { - struct rss_container *rss; - struct page_container *pc; - - rcu_read_lock(); - rss = rcu_dereference(vma->vm_mm->rss_container); - css_get(>css); - rcu_read_unlock(); - - pc = kmalloc(sizeof(struct page_container), GFP_KERNEL); - if (pc == NULL) - goto out_nomem; + struct rss_container *rss; + struct page_container *pc; + int rc = 1; + + /* Page may have been added to container earlier */ + pc = page_container(page); + /* Check if this is fist time addition or not */ + if (!pc) { + rcu_read_lock(); + rss = rcu_dereference(mm->rss_container); + css_get(>css); + rcu_read_unlock(); + } else { + rss = pc->cnt; + } - while (res_counter_charge(>res, 1)) { - if (try_to_free_pages_in_container(rss)) { - atomic_inc(>rss_reclaimed); - continue; - } + /* Charge the respective resource count first time only */ + while (rc && !pc) { + rc = res_counter_charge(>res, 1); + + if (!rc) + break; /* All well */ + + if (try_to_free_pages_in_container(rss)) { + atomic_inc(>rss_reclaimed); + continue; /* Try agin to charge container */ + } /* * try_to_free_pages() might not give us a full picture @@ -125,60 +138,93 @@ int container_rss_prepare(struct page *p if (res_counter_check_under_limit(>res)) continue; - container_out_of_memory(rss);
[RFC][PATCH 0/3] Containers: Integrated RSS and pagecache control v5
Containers: Integrated RSS and pagecache accounting and control v5 -- Based on the discussions at OLS yesterday, the consensus was to try an integrated pagecache controller along with RSS controller under the same usage limit. This patch extends the RSS controller to account and reclaim pagecache and swapcache pages. The same 'rss_limit' now applies to both RSS pages and pagecache pages. When the limit is reached, both pagecache and RSS pages are reclaimed in LRU order as per the normal system wide reclaim policy. This patch is based on RSS Controller V3.1 by Pavel and Balbir. This patch depends on 1. Paul Menage's Containers(V10): Generic Process Containers http://lwn.net/Articles/236032/ 2. Pavel Emelianov's RSS controller based on process containers (v3.1) http://lwn.net/Articles/236817/ 3. Balbir's fixes for RSS controller as mentioned in http://lkml.org/lkml/2007/6/04/185 This is very much work-in-progress and it have been posted for comments after some basic testing with kernbench. Comments, suggestions and criticisms are welcome. --Vaidy Features: * Single limit for both RSS and pagecache/swapcache pages * No new subsystem is added. The RSS controller subsystem is extended since most of the code can be shared between pagecache control and RSS control. * The accounting number include pages in swap cache and filesystem buffer pages apart from pagecache, basically everything under NR_FILE_PAGES is counted under rss_usage. * The usage limit set in rss_limit applies to sum of both RSS and pagecache pages * Limits on pagecache can be set by echo -n 10 > rss_limit on the /container file system. The unit is in pages or 4 kilobytes * If the pagecache+RSS utilisation exceed the limit, the container reclaim code is invoked to recover pages from the container. Advantages: --- * Minimal code changes to RSS controller to include pagecache pages Limitations: --- * All limitation of RSS controller applies to this code as well * Per-container reclaim knobs like dirty ratio, vm_swappiness may provide better control Usage: -- * Add all dependent patches before including this patch * No new config settings apart from enabling CONFIG_RSS_CONTAINER * Boot new kernel * Mount container filesystem mount -t container none /container cd /container * Create new container mkdir mybox cd /container/mybox * Add current shell to container echo $$ > tasks * In order to set limit, echo value in pages (4KB) to rss_limit echo -n 10 > rss_limit #This would set 409MB limit on pagecache+rss usage * Trash the system from current shell using scp/cp/dd/tar etc * Watch rss_usage and /proc/meminfo to verify behavior Tests: -- * Simple dd/cat/cp test on pagecache limit/reclaim * rss_limit was tested with simple test application that would malloc predefined size of memory and touch them to allocate pages. * kernbench was run under container with 400MB memory limit ToDo: * Optimise the reclaim. * Per-container VM stats and knobs Patch Series: - pagecache-controller-v5-acct.patch pagecache-controller-v5-acct-hooks.patch pagecache-controller-v5-reclaim.patch ChangeLog: - v5: Integrated pagecache + rss controller * No separate pagecache_limit * pagecache and rss pages accounted in rss_usage and governed by rss_limit * Each page counted only once in rss_usage. Mapped or unmapped pagecache pages are counted alike in rss_usage v4: * Patch remerged to Container v10 and RSS v3.1 * Bug fixes * Tested with kernbench v3: * Patch merged with Containers v8 and RSS v2 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 0/3] Containers: Integrated RSS and pagecache control v5
Containers: Integrated RSS and pagecache accounting and control v5 -- Based on the discussions at OLS yesterday, the consensus was to try an integrated pagecache controller along with RSS controller under the same usage limit. This patch extends the RSS controller to account and reclaim pagecache and swapcache pages. The same 'rss_limit' now applies to both RSS pages and pagecache pages. When the limit is reached, both pagecache and RSS pages are reclaimed in LRU order as per the normal system wide reclaim policy. This patch is based on RSS Controller V3.1 by Pavel and Balbir. This patch depends on 1. Paul Menage's Containers(V10): Generic Process Containers http://lwn.net/Articles/236032/ 2. Pavel Emelianov's RSS controller based on process containers (v3.1) http://lwn.net/Articles/236817/ 3. Balbir's fixes for RSS controller as mentioned in http://lkml.org/lkml/2007/6/04/185 This is very much work-in-progress and it have been posted for comments after some basic testing with kernbench. Comments, suggestions and criticisms are welcome. --Vaidy Features: * Single limit for both RSS and pagecache/swapcache pages * No new subsystem is added. The RSS controller subsystem is extended since most of the code can be shared between pagecache control and RSS control. * The accounting number include pages in swap cache and filesystem buffer pages apart from pagecache, basically everything under NR_FILE_PAGES is counted under rss_usage. * The usage limit set in rss_limit applies to sum of both RSS and pagecache pages * Limits on pagecache can be set by echo -n 10 rss_limit on the /container file system. The unit is in pages or 4 kilobytes * If the pagecache+RSS utilisation exceed the limit, the container reclaim code is invoked to recover pages from the container. Advantages: --- * Minimal code changes to RSS controller to include pagecache pages Limitations: --- * All limitation of RSS controller applies to this code as well * Per-container reclaim knobs like dirty ratio, vm_swappiness may provide better control Usage: -- * Add all dependent patches before including this patch * No new config settings apart from enabling CONFIG_RSS_CONTAINER * Boot new kernel * Mount container filesystem mount -t container none /container cd /container * Create new container mkdir mybox cd /container/mybox * Add current shell to container echo $$ tasks * In order to set limit, echo value in pages (4KB) to rss_limit echo -n 10 rss_limit #This would set 409MB limit on pagecache+rss usage * Trash the system from current shell using scp/cp/dd/tar etc * Watch rss_usage and /proc/meminfo to verify behavior Tests: -- * Simple dd/cat/cp test on pagecache limit/reclaim * rss_limit was tested with simple test application that would malloc predefined size of memory and touch them to allocate pages. * kernbench was run under container with 400MB memory limit ToDo: * Optimise the reclaim. * Per-container VM stats and knobs Patch Series: - pagecache-controller-v5-acct.patch pagecache-controller-v5-acct-hooks.patch pagecache-controller-v5-reclaim.patch ChangeLog: - v5: Integrated pagecache + rss controller * No separate pagecache_limit * pagecache and rss pages accounted in rss_usage and governed by rss_limit * Each page counted only once in rss_usage. Mapped or unmapped pagecache pages are counted alike in rss_usage v4: * Patch remerged to Container v10 and RSS v3.1 * Bug fixes * Tested with kernbench v3: * Patch merged with Containers v8 and RSS v2 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 1/3] Pagecache accounting
Pagecache Accounting The rss accounting hooks have been generalised to handle both anon pages and file backed pages and charge the resource counter. Ref count has been added to page_container structure. The ref count is used to ensure a page is added or removed from page_container list only once independent of repeated calls from pagecache, swapcache and mmap to RSS. No setup patch is required since rss_limit and rss_usage has been generalised as the resource counter for pagecache as well. Signed-off-by: Vaidyanathan Srinivasan [EMAIL PROTECTED] --- include/linux/rss_container.h | 18 ++--- mm/rss_container.c| 134 -- 2 files changed, 99 insertions(+), 53 deletions(-) --- linux-2.6.22-rc2-mm1.orig/include/linux/rss_container.h +++ linux-2.6.22-rc2-mm1/include/linux/rss_container.h @@ -68,11 +68,11 @@ struct rss_container; * */ -int container_rss_prepare(struct page *, struct vm_area_struct *vma, +int container_page_prepare(struct page *, struct mm_struct *mm, struct page_container **); -void container_rss_add(struct page_container *); -void container_rss_del(struct page_container *); -void container_rss_release(struct page_container *); +void container_page_add(struct page_container *); +void container_page_del(struct page_container *); +void container_page_release(struct page_container *); void container_out_of_memory(struct rss_container *); @@ -85,22 +85,22 @@ unsigned long isolate_pages_in_container int order, int mode, struct zone *zone, struct rss_container *, int active); #else -static inline int container_rss_prepare(struct page *pg, - struct vm_area_struct *vma, struct page_container **pc) +static inline int container_page_prepare(struct page *pg, + struct mm_struct *mm, struct page_container **pc) { *pc = NULL; /* to make gcc happy */ return 0; } -static inline void container_rss_add(struct page_container *pc) +static inline void container_page_add(struct page_container *pc) { } -static inline void container_rss_del(struct page_container *pc) +static inline void container_page_del(struct page_container *pc) { } -static inline void container_rss_release(struct page_container *pc) +static inline void container_page_release(struct page_container *pc) { } --- linux-2.6.22-rc2-mm1.orig/mm/rss_container.c +++ linux-2.6.22-rc2-mm1/mm/rss_container.c @@ -56,6 +56,9 @@ struct rss_container { */ struct page_container { + unsigned long ref_cnt; /* Ref cnt to keep track of +* multiple additions of same page +*/ struct page *page; struct rss_container *cnt; struct list_head list; /* this is the element of (int)active_list of @@ -93,26 +96,36 @@ void mm_free_container(struct mm_struct * I bet you have already read the comment in include/linux/rss_container.h :) */ -int container_rss_prepare(struct page *page, struct vm_area_struct *vma, +int container_page_prepare(struct page *page, struct mm_struct *mm, struct page_container **ppc) { - struct rss_container *rss; - struct page_container *pc; - - rcu_read_lock(); - rss = rcu_dereference(vma-vm_mm-rss_container); - css_get(rss-css); - rcu_read_unlock(); - - pc = kmalloc(sizeof(struct page_container), GFP_KERNEL); - if (pc == NULL) - goto out_nomem; + struct rss_container *rss; + struct page_container *pc; + int rc = 1; + + /* Page may have been added to container earlier */ + pc = page_container(page); + /* Check if this is fist time addition or not */ + if (!pc) { + rcu_read_lock(); + rss = rcu_dereference(mm-rss_container); + css_get(rss-css); + rcu_read_unlock(); + } else { + rss = pc-cnt; + } - while (res_counter_charge(rss-res, 1)) { - if (try_to_free_pages_in_container(rss)) { - atomic_inc(rss-rss_reclaimed); - continue; - } + /* Charge the respective resource count first time only */ + while (rc !pc) { + rc = res_counter_charge(rss-res, 1); + + if (!rc) + break; /* All well */ + + if (try_to_free_pages_in_container(rss)) { + atomic_inc(rss-rss_reclaimed); + continue; /* Try agin to charge container */ + } /* * try_to_free_pages() might not give us a full picture @@ -125,60 +138,93 @@ int container_rss_prepare(struct page *p if (res_counter_check_under_limit(rss-res)) continue; - container_out_of_memory(rss); - if (test_thread_flag(TIF_MEMDIE
[RFC][PATCH 2/3] Pagecache and RSS accounting hooks
Pagecache and RSS accounting Hooks -- New calls have been added from swap_state.c and filemap.c to track pagecache and swapcache pages. All existing RSS hooks have been generalised for pagecache accounting as well. Most of these are function prototype changes. Signed-off-by: Vaidyanathan Srinivasan [EMAIL PROTECTED] --- fs/exec.c |4 ++-- mm/filemap.c| 17 + mm/memory.c | 18 +- mm/migrate.c|4 ++-- mm/rmap.c | 12 ++-- mm/swap_state.c | 16 mm/swapfile.c |4 ++-- 7 files changed, 54 insertions(+), 21 deletions(-) --- linux-2.6.22-rc2-mm1.orig/fs/exec.c +++ linux-2.6.22-rc2-mm1/fs/exec.c @@ -321,7 +321,7 @@ void install_arg_page(struct vm_area_str if (unlikely(anon_vma_prepare(vma))) goto out; - if (container_rss_prepare(page, vma, pcont)) + if (container_page_prepare(page, vma-vm_mm, pcont)) goto out; flush_dcache_page(page); @@ -343,7 +343,7 @@ void install_arg_page(struct vm_area_str return; out_release: - container_rss_release(pcont); + container_page_release(pcont); out: __free_page(page); force_sig(SIGKILL, current); --- linux-2.6.22-rc2-mm1.orig/mm/filemap.c +++ linux-2.6.22-rc2-mm1/mm/filemap.c @@ -30,6 +30,7 @@ #include linux/security.h #include linux/syscalls.h #include linux/cpuset.h +#include linux/rss_container.h #include filemap.h #include internal.h @@ -117,6 +118,9 @@ void __remove_from_page_cache(struct pag struct address_space *mapping = page-mapping; radix_tree_delete(mapping-page_tree, page-index); + /* Uncharge before the mapping is gone */ + if (page_container(page)) + container_page_del(page_container(page)); page-mapping = NULL; mapping-nrpages--; __dec_zone_page_state(page, NR_FILE_PAGES); @@ -440,6 +444,8 @@ int add_to_page_cache(struct page *page, pgoff_t offset, gfp_t gfp_mask) { int error = radix_tree_preload(gfp_mask ~__GFP_HIGHMEM); + struct page_container *pc; + struct mm_struct *mm; if (error == 0) { write_lock_irq(mapping-tree_lock); @@ -453,6 +459,17 @@ int add_to_page_cache(struct page *page, __inc_zone_page_state(page, NR_FILE_PAGES); } write_unlock_irq(mapping-tree_lock); + /* Unlock before charge, because we may reclaim this inline */ + if(!error) { + if (current-mm) + mm = current-mm; + else + mm = init_mm; + if (!container_page_prepare(page, mm, pc)) + container_page_add(pc); + else + BUG(); + } radix_tree_preload_end(); } return error; --- linux-2.6.22-rc2-mm1.orig/mm/memory.c +++ linux-2.6.22-rc2-mm1/mm/memory.c @@ -1755,7 +1755,7 @@ gotten: cow_user_page(new_page, old_page, address, vma); } - if (container_rss_prepare(new_page, vma, pcont)) + if (container_page_prepare(new_page, vma-vm_mm, pcont)) goto oom; /* @@ -1791,7 +1791,7 @@ gotten: new_page = old_page; ret |= VM_FAULT_WRITE; } else - container_rss_release(pcont); + container_page_release(pcont); if (new_page) page_cache_release(new_page); @@ -2217,7 +2217,7 @@ static int do_swap_page(struct mm_struct count_vm_event(PGMAJFAULT); } - if (container_rss_prepare(page, vma, pcont)) { + if (container_page_prepare(page, vma-vm_mm, pcont)) { ret = VM_FAULT_OOM; goto out; } @@ -2235,7 +2235,7 @@ static int do_swap_page(struct mm_struct if (unlikely(!PageUptodate(page))) { ret = VM_FAULT_SIGBUS; - container_rss_release(pcont); + container_page_release(pcont); goto out_nomap; } @@ -2271,7 +2271,7 @@ unlock: out: return ret; out_nomap: - container_rss_release(pcont); + container_page_release(pcont); pte_unmap_unlock(page_table, ptl); unlock_page(page); page_cache_release(page); @@ -2302,7 +2302,7 @@ static int do_anonymous_page(struct mm_s if (!page) goto oom; - if (container_rss_prepare(page, vma, pcont)) + if (container_page_prepare(page, vma-vm_mm, pcont)) goto oom_release; entry = mk_pte(page, vma-vm_page_prot); @@ -2338,7 +2338,7 @@ unlock: pte_unmap_unlock(page_table, ptl); return VM_FAULT_MINOR; release_container
[RFC][PATCH 3/3] Pagecache reclaim
Pagecache controller reclaim changes Reclaim path needs performance improvement. For now it is minor changes to include unmapped pages in our list of page_container. Signed-off-by: Vaidyanathan Srinivasan [EMAIL PROTECTED] --- mm/rss_container.c |3 --- 1 file changed, 3 deletions(-) --- linux-2.6.22-rc2-mm1.orig/mm/rss_container.c +++ linux-2.6.22-rc2-mm1/mm/rss_container.c @@ -243,9 +243,6 @@ void container_rss_move_lists(struct pag struct rss_container *rss; struct page_container *pc; - if (!page_mapped(pg)) - return; - pc = page_container(pg); if (pc == NULL) return; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm-controller
Paul Menage wrote: > On 6/25/07, Paul Menage <[EMAIL PROTECTED]> wrote: >> On 6/22/07, Vaidyanathan Srinivasan <[EMAIL PROTECTED]> wrote: >>> Merging both limits will eliminate the issue, however we would need >>> individual limits for pagecache and RSS for better control. There are >>> use cases for pagecache_limit alone without RSS_limit like the case of >>> database application using direct IO, backup applications and >>> streaming applications that does not make good use of pagecache. >>> >> If streaming applications would otherwise litter the pagecache with >> unwanted data, then limiting their total memory footprint (with a >> single limit) and forcing them to drop old data sooner sounds like a >> great idea. > > Actually, reading what you wrote more carefully, that's sort of what > you were already saying. But it's not clear why you wouldn't also want > to limit the anon pages for a job, if you're already concerned that > it's not playing nicely with the rest of the system. Hi Paul, Limiting memory footprint (RSS and pagecache) for multi media applications would work. However, generally streaming applications have a fairly constant RSS size (mapped pagecache pages + ANON) while the unmapped pagecache pages is what we want to control better. If we have a combined limit for unmapped pagecache pages and RSS, then we will have to bring in vm_swappiness kind of knobs for each container to influence the per container reclaim process so as to not hurt the application performance badly. RSS controller should be able to take care of the mapped memory footprint if needed. In case of database server, moving out any of it RSS pages will hurt it performance, while we are free to shrink the unmapped pagecache pages to any smaller limit since the database is using direct IO and does not benefit from pagecache. With pagecache controller, we are able to split application's memory pages into mapped and unmapped pages. Ability to account and control unmapped pages in memory provides more possibilities for fine grain resource management. --Vaidy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm-controller
Peter Zijlstra wrote: > On Fri, 2007-06-22 at 22:05 +0530, Vaidyanathan Srinivasan wrote: > >> Merging both limits will eliminate the issue, however we would need >> individual limits for pagecache and RSS for better control. There are >> use cases for pagecache_limit alone without RSS_limit like the case of >> database application using direct IO, backup applications and >> streaming applications that does not make good use of pagecache. > > I'm aware that some people want this. However we rejected adding a > pagecache limit to the kernel proper on grounds that reclaim should do a > better job. > > And now we're sneaking it in the backdoor. > > If we're going to do this, get it in the kernel proper first. > Good point. We should probably revisit this in the context of containers, virtualization and server consolidation. Kernel takes the best decision in the context of overall system performance, but when we want the kernel to favor certain group of application relative to others then we hit corner cases. Streaming multimedia applications are one of the corner case where the kernel's effort to manage pagecache does not help overall system performance. There have been several patches suggested to provide system wide pagecache limit. There are some user mode fadvice() based techniques as well. However solving the problem in the context of containers provide certain advantages * Containers provide task grouping * Relative priority or importance can be assigned to each group using resource limits. * Memory controller under container framework provide infrastructure for detailed accounting of memory usage * Containers and controllers form generalised infrastructure to create localised VM behavior for a group of tasks I would see introduction of pagecache limit in containers as a safe place to add the new feature rather than a backdoor. Since this feature has a relatively small user base, it be best left as a container plugin rather than a system wide tunable. I am not suggesting against system wide pagecache control. We should definitely try to find solutions for pagecache control outside of containers as well. --Vaidy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm-controller
Peter Zijlstra wrote: On Fri, 2007-06-22 at 22:05 +0530, Vaidyanathan Srinivasan wrote: Merging both limits will eliminate the issue, however we would need individual limits for pagecache and RSS for better control. There are use cases for pagecache_limit alone without RSS_limit like the case of database application using direct IO, backup applications and streaming applications that does not make good use of pagecache. I'm aware that some people want this. However we rejected adding a pagecache limit to the kernel proper on grounds that reclaim should do a better job. And now we're sneaking it in the backdoor. If we're going to do this, get it in the kernel proper first. Good point. We should probably revisit this in the context of containers, virtualization and server consolidation. Kernel takes the best decision in the context of overall system performance, but when we want the kernel to favor certain group of application relative to others then we hit corner cases. Streaming multimedia applications are one of the corner case where the kernel's effort to manage pagecache does not help overall system performance. There have been several patches suggested to provide system wide pagecache limit. There are some user mode fadvice() based techniques as well. However solving the problem in the context of containers provide certain advantages * Containers provide task grouping * Relative priority or importance can be assigned to each group using resource limits. * Memory controller under container framework provide infrastructure for detailed accounting of memory usage * Containers and controllers form generalised infrastructure to create localised VM behavior for a group of tasks I would see introduction of pagecache limit in containers as a safe place to add the new feature rather than a backdoor. Since this feature has a relatively small user base, it be best left as a container plugin rather than a system wide tunable. I am not suggesting against system wide pagecache control. We should definitely try to find solutions for pagecache control outside of containers as well. --Vaidy - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm-controller
Paul Menage wrote: On 6/25/07, Paul Menage [EMAIL PROTECTED] wrote: On 6/22/07, Vaidyanathan Srinivasan [EMAIL PROTECTED] wrote: Merging both limits will eliminate the issue, however we would need individual limits for pagecache and RSS for better control. There are use cases for pagecache_limit alone without RSS_limit like the case of database application using direct IO, backup applications and streaming applications that does not make good use of pagecache. If streaming applications would otherwise litter the pagecache with unwanted data, then limiting their total memory footprint (with a single limit) and forcing them to drop old data sooner sounds like a great idea. Actually, reading what you wrote more carefully, that's sort of what you were already saying. But it's not clear why you wouldn't also want to limit the anon pages for a job, if you're already concerned that it's not playing nicely with the rest of the system. Hi Paul, Limiting memory footprint (RSS and pagecache) for multi media applications would work. However, generally streaming applications have a fairly constant RSS size (mapped pagecache pages + ANON) while the unmapped pagecache pages is what we want to control better. If we have a combined limit for unmapped pagecache pages and RSS, then we will have to bring in vm_swappiness kind of knobs for each container to influence the per container reclaim process so as to not hurt the application performance badly. RSS controller should be able to take care of the mapped memory footprint if needed. In case of database server, moving out any of it RSS pages will hurt it performance, while we are free to shrink the unmapped pagecache pages to any smaller limit since the database is using direct IO and does not benefit from pagecache. With pagecache controller, we are able to split application's memory pages into mapped and unmapped pages. Ability to account and control unmapped pages in memory provides more possibilities for fine grain resource management. --Vaidy - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm-controller
Peter Zijlstra wrote: > On Thu, 2007-06-21 at 16:33 +0530, Balbir Singh wrote: >> Peter Zijlstra wrote: [snip] > Not quite sure on 2, from reading the pagecache controller, I got the > impression that enforcing both limits got you into trouble. Merging the > limits would rid us of that issue, no? > Hi Peter, Setting both limits in pagecache controller (v4) will work correct in principle. What I intended in the comment is performance issues and wrong type of page being reclaimed. We are working on the issues and will fix them in future versions. You can set any combination of limits and the reclaim will work to keep the page utilization below limits. When RSS limit is hit, ANON pages are pushed to swapcache. We would need to limit swapcache (using pagecache_limit) to force a write out to disk. Merging both limits will eliminate the issue, however we would need individual limits for pagecache and RSS for better control. There are use cases for pagecache_limit alone without RSS_limit like the case of database application using direct IO, backup applications and streaming applications that does not make good use of pagecache. Thank you for the review and new design proposal. --Vaidy [snip] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mm-controller
Peter Zijlstra wrote: On Thu, 2007-06-21 at 16:33 +0530, Balbir Singh wrote: Peter Zijlstra wrote: [snip] Not quite sure on 2, from reading the pagecache controller, I got the impression that enforcing both limits got you into trouble. Merging the limits would rid us of that issue, no? Hi Peter, Setting both limits in pagecache controller (v4) will work correct in principle. What I intended in the comment is performance issues and wrong type of page being reclaimed. We are working on the issues and will fix them in future versions. You can set any combination of limits and the reclaim will work to keep the page utilization below limits. When RSS limit is hit, ANON pages are pushed to swapcache. We would need to limit swapcache (using pagecache_limit) to force a write out to disk. Merging both limits will eliminate the issue, however we would need individual limits for pagecache and RSS for better control. There are use cases for pagecache_limit alone without RSS_limit like the case of database application using direct IO, backup applications and streaming applications that does not make good use of pagecache. Thank you for the review and new design proposal. --Vaidy [snip] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 3/4] Pagecache and RSS accounting hooks
Pagecache and RSS accounting Hooks -- New calls have been added from swap_state.c and filemap.c to track pagecache and swapcache pages. All existing RSS hooks have been generalised for pagecache accounting as well. Most of these are function prototype changes. Signed-off-by: Vaidyanathan Srinivasan <[EMAIL PROTECTED]> --- fs/exec.c |4 ++-- mm/filemap.c| 18 ++ mm/memory.c | 18 +- mm/migrate.c|4 ++-- mm/rmap.c | 12 ++-- mm/swap_state.c | 16 mm/swapfile.c |4 ++-- 7 files changed, 55 insertions(+), 21 deletions(-) --- linux-2.6.22-rc2-mm1.orig/fs/exec.c +++ linux-2.6.22-rc2-mm1/fs/exec.c @@ -321,7 +321,7 @@ void install_arg_page(struct vm_area_str if (unlikely(anon_vma_prepare(vma))) goto out; - if (container_rss_prepare(page, vma, )) + if (container_page_prepare(page, vma->vm_mm, , PAGE_TYPE_RSS)) goto out; flush_dcache_page(page); @@ -343,7 +343,7 @@ void install_arg_page(struct vm_area_str return; out_release: - container_rss_release(pcont); + container_page_release(pcont, PAGE_TYPE_RSS); out: __free_page(page); force_sig(SIGKILL, current); --- linux-2.6.22-rc2-mm1.orig/mm/filemap.c +++ linux-2.6.22-rc2-mm1/mm/filemap.c @@ -30,6 +30,7 @@ #include #include #include +#include #include "filemap.h" #include "internal.h" @@ -117,6 +118,9 @@ void __remove_from_page_cache(struct pag struct address_space *mapping = page->mapping; radix_tree_delete(>page_tree, page->index); + /* Uncharge before the mapping is gone */ + if (page_container(page)) + container_page_del(page_container(page), PAGE_TYPE_PAGECACHE); page->mapping = NULL; mapping->nrpages--; __dec_zone_page_state(page, NR_FILE_PAGES); @@ -440,6 +444,8 @@ int add_to_page_cache(struct page *page, pgoff_t offset, gfp_t gfp_mask) { int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); + struct page_container *pc; + struct mm_struct *mm; if (error == 0) { write_lock_irq(>tree_lock); @@ -453,6 +459,18 @@ int add_to_page_cache(struct page *page, __inc_zone_page_state(page, NR_FILE_PAGES); } write_unlock_irq(>tree_lock); + /* Unlock before charge, because we may reclaim this inline */ + if(!error) { + if (current->mm) + mm = current->mm; + else + mm = _mm; + if (!container_page_prepare(page, mm, , PAGE_TYPE_PAGECACHE)) + container_page_add(pc, PAGE_TYPE_PAGECACHE); + else + BUG(); + } + radix_tree_preload_end(); } return error; --- linux-2.6.22-rc2-mm1.orig/mm/memory.c +++ linux-2.6.22-rc2-mm1/mm/memory.c @@ -1755,7 +1755,7 @@ gotten: cow_user_page(new_page, old_page, address, vma); } - if (container_rss_prepare(new_page, vma, )) + if (container_page_prepare(new_page, vma->vm_mm, , PAGE_TYPE_RSS)) goto oom; /* @@ -1791,7 +1791,7 @@ gotten: new_page = old_page; ret |= VM_FAULT_WRITE; } else - container_rss_release(pcont); + container_page_release(pcont, PAGE_TYPE_RSS); if (new_page) page_cache_release(new_page); @@ -2217,7 +2217,7 @@ static int do_swap_page(struct mm_struct count_vm_event(PGMAJFAULT); } - if (container_rss_prepare(page, vma, )) { + if (container_page_prepare(page, vma->vm_mm, , PAGE_TYPE_RSS)) { ret = VM_FAULT_OOM; goto out; } @@ -2235,7 +2235,7 @@ static int do_swap_page(struct mm_struct if (unlikely(!PageUptodate(page))) { ret = VM_FAULT_SIGBUS; - container_rss_release(pcont); + container_page_release(pcont, PAGE_TYPE_RSS); goto out_nomap; } @@ -2271,7 +2271,7 @@ unlock: out: return ret; out_nomap: - container_rss_release(pcont); + container_page_release(pcont, PAGE_TYPE_RSS); pte_unmap_unlock(page_table, ptl); unlock_page(page); page_cache_release(page); @@ -2302,7 +2302,7 @@ static int do_anonymous_page(struct mm_s if (!page) goto oom; - if (container_rss_prepare(page, vma, )) + if (container_page_prepare(page, vma->vm_mm, , PAGE_TYPE_RSS)) goto oom_release; entry = mk_pte(page, vma->vm_page_prot)
[RFC][PATCH 4/4] Pagecache reclaim
Pagecache controller reclaim changes Reclaim path needs performance improvement. For now it is minor changes to include unmapped pages in our list of page_container. Signed-off-by: Vaidyanathan Srinivasan <[EMAIL PROTECTED]> --- mm/rss_container.c |3 --- 1 file changed, 3 deletions(-) --- linux-2.6.22-rc2-mm1.orig/mm/rss_container.c +++ linux-2.6.22-rc2-mm1/mm/rss_container.c @@ -274,9 +274,6 @@ void container_rss_move_lists(struct pag struct rss_container *rss; struct page_container *pc; - if (!page_mapped(pg)) - return; - pc = page_container(pg); if (pc == NULL) return; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/