from:"Vaidyanathan Srinivasan"

Re: [PATCH v2] cpuidle-pseries: Fix CEDE latency conversion from tb to us

2020-09-03 Thread Vaidyanathan Srinivasan

* Gautham R Shenoy  [2020-09-03 14:57:27]:

> From: "Gautham R. Shenoy" 
> 
> commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> of the Extended CEDE states advertised by the platform. The values
> advertised by the platform are in timebase ticks. However the cpuidle
> framework requires the latency values in microseconds.
> 
> If the tb-ticks value advertised by the platform correspond to a value
> smaller than 1us, during the conversion from tb-ticks to microseconds,
> in the current code, the result becomes zero. This is incorrect as it
> puts a CEDE state on par with the snooze state.
> 
> This patch fixes this by rounding up the result obtained while
> converting the latency value from tb-ticks to microseconds. It also
> prints a warning in case we discover an extended-cede state with
> wakeup latency to be 0. In such a case, ensure that CEDE(0) has a
> non-zero wakeup latency.
> 
> Fixes: commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> CEDE(0)")
> 
> Signed-off-by: Gautham R. Shenoy 

Reviewed-by: Vaidyanathan Srinivasan 

> ---
> v1-->v2: Added a warning if a CEDE state has 0 wakeup latency (Suggested by 
> Joel Stanley)
>  Also added code to ensure that CEDE(0) has a non-zero wakeup 
> latency. 
>  drivers/cpuidle/cpuidle-pseries.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/cpuidle/cpuidle-pseries.c 
> b/drivers/cpuidle/cpuidle-pseries.c
> index ff6d99e..a2b5c6f 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -361,7 +361,10 @@ static void __init fixup_cede0_latency(void)
>   for (i = 0; i < nr_xcede_records; i++) {
>   struct xcede_latency_record *record = >records[i];
>   u64 latency_tb = be64_to_cpu(record->latency_ticks);
> - u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC;
> + u64 latency_us = DIV_ROUND_UP_ULL(tb_to_ns(latency_tb), 
> NSEC_PER_USEC);

This would fix the issue of rounding down to zero.

> +
> + if (latency_us == 0)
> + pr_warn("cpuidle: xcede record %d has an unrealistic 
> latency of 0us.\n", i);

+1  This should not happen.

>   if (latency_us < min_latency_us)
>   min_latency_us = latency_us;
> @@ -378,10 +381,14 @@ static void __init fixup_cede0_latency(void)
>* Perform the fix-up.
>*/
>   if (min_latency_us < dedicated_states[1].exit_latency) {
> - u64 cede0_latency = min_latency_us - 1;
> + /*
> +  * We set a minimum of 1us wakeup latency for cede0 to
> +  * distinguish it from snooze
> +  */
> + u64 cede0_latency = 1;
> 
> - if (cede0_latency <= 0)
> - cede0_latency = min_latency_us;
> + if (min_latency_us > cede0_latency)
> + cede0_latency = min_latency_us - 1;

Good checks to expect cede latency of 1us or more.

>   dedicated_states[1].exit_latency = cede0_latency;
>   dedicated_states[1].target_residency = 10 * (cede0_latency);

--Vaidy

Re: [PATCH] Revert "powerpc/powernv/idle: Replace CPU feature check with PVR check"

2020-08-26 Thread Vaidyanathan Srinivasan

* Pratik Rajesh Sampat  [2020-08-26 13:59:18]:

> Cpuidle stop state implementation has minor optimizations for P10
> where hardware preserves more SPR registers compared to P9.
> The current P9 driver works for P10, although does few extra
> save-restores. P9 driver can provide the required power management
> features like SMT thread folding and core level power savings
> on a P10 platform.
> 
> Until the P10 stop driver is available, revert the commit which
> allows for only P9 systems to utilize cpuidle and blocks all
> idle stop states for P10.
> Cpu idle states are enabled and tested on the P10 platform
> with this fix.
> 
> This reverts commit 8747bf36f312356f8a295a0c39ff092d65ce75ae.
> 
> Fixes: 8747bf36f312 ("powerpc/powernv/idle: Replace CPU feature check with 
> PVR check")
> Signed-off-by: Pratik Rajesh Sampat 

Reviewed-by: Vaidyanathan Srinivasan 


> ---
>  @mpe: This revert would resolve a staging issue wherein the P10 stop
>  driver is not yet ready while cpuidle stop states need not be blocked
>  on 5.9 for Power10 systems which could cause SMT folding related
>  performance issues.
> 
>  The P10 stop driver is in the works here:
>  https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-August/216773.html
> 
>  arch/powerpc/platforms/powernv/idle.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/idle.c 
> b/arch/powerpc/platforms/powernv/idle.c
> index 77513a80cef9..345ab062b21a 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -1223,7 +1223,7 @@ static void __init pnv_probe_idle_states(void)
>   return;
>   }
>  
> - if (pvr_version_is(PVR_POWER9))
> + if (cpu_has_feature(CPU_FTR_ARCH_300))
>   pnv_power9_idle_init();
>  
>   for (i = 0; i < nr_pnv_idle_states; i++)


This revert solves the stated problem and makes kernel v5.9 work
reasonable well on P10 with stop states which are required for SMT
mode changes.

Complete P10 driver has been in the works and will build on this fix
and complete the required platform support and optimizations.

--Vaidy

Re: [PATCH 5/5] cpuidle-pseries: Block Extended CEDE(1) which adds no additional value.

2020-07-20 Thread Vaidyanathan Srinivasan

* Gautham R Shenoy  [2020-07-07 16:41:39]:

> From: "Gautham R. Shenoy" 
> 
> The Extended CEDE state with latency-hint = 1 is only different from
> normal CEDE (with latency-hint = 0) in that a CPU in Extended CEDE(1)
> does not wakeup on timer events. Both CEDE and Extended CEDE(1) map to
> the same hardware idle state. Since we already get SMT folding from
> the normal CEDE, the Extended CEDE(1) doesn't provide any additional
> value. This patch blocks Extended CEDE(1).
> 
> Signed-off-by: Gautham R. Shenoy 

Reviewed-by: Vaidyanathan Srinivasan 

> ---
>  drivers/cpuidle/cpuidle-pseries.c | 57 
> ---
>  1 file changed, 54 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/cpuidle/cpuidle-pseries.c 
> b/drivers/cpuidle/cpuidle-pseries.c
> index 6f893cd..be0b8b2 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -350,6 +350,43 @@ static int pseries_cpuidle_driver_init(void)
>   return 0;
>  }
> 
> +#define XCEDE1_HINT  1
> +#define ERR_NO_VALUE_ADD (-1)
> +#define ERR_NO_EE_WAKEUP (-2)
> +
> +/*
> + * Returns 0 if the Extende CEDE state with @hint is not blocked in
> + * cpuidle framework.
> + *
> + * Returns ERR_NO_EE_WAKEUP if the Extended CEDE state is blocked due
> + * to not being responsive to external interrupts.
> + *
> + * Returns ERR_NO_VALUE_ADD if the Extended CEDE state does not provide
> + * added value addition over the normal CEDE.
> + */
> +static int cpuidle_xcede_blocked(u8 hint, u64 latency_us, u8 
> responsive_to_irqs)
> +{
> +
> + /*
> +  * We will only allow extended CEDE states that are responsive
> +  * to irqs do not require an H_PROD to be woken up.
> +  */
> + if (!responsive_to_irqs)
> + return ERR_NO_EE_WAKEUP;
> +
> + /*
> +  * We already obtain SMT folding benefits from CEDE (which is
> +  * CEDE with hint 0). Furthermore, CEDE is also responsive to
> +  * timer-events, while XCEDE1 requires an external
> +  * interrupt/H_PROD to be woken up. Hence, block XCEDE1 since
> +  * it adds no further value.
> +  */
> + if (hint == XCEDE1_HINT)
> + return ERR_NO_VALUE_ADD;
> +
> + return 0;
> +}
> +
>  static int add_pseries_idle_states(void)
>  {
>   int nr_states = 2; /* By default we have snooze, CEDE */
> @@ -365,15 +402,29 @@ static int add_pseries_idle_states(void)
>   char name[CPUIDLE_NAME_LEN];
>   unsigned int latency_hint = xcede_records[i].latency_hint;
>   u64 residency_us;
> + int rc;
> +
> + if (latency_us < min_latency_us)
> + min_latency_us = latency_us;
> +
> + rc = cpuidle_xcede_blocked(latency_hint, latency_us,
> +xcede_records[i].responsive_to_irqs);
> 
> - if (!xcede_records[i].responsive_to_irqs) {
> + if (rc) {
> + switch (rc) {
> + case ERR_NO_VALUE_ADD:
> + pr_info("cpuidle : Skipping XCEDE%d. No 
> additional value-add\n",
> + latency_hint);
> + break;
> + case ERR_NO_EE_WAKEUP:
>   pr_info("cpuidle : Skipping XCEDE%d. Not responsive to 
> IRQs\n",
>   latency_hint);
> + break;
> + }
> +
>   continue;
>   }
> 
> - if (latency_us < min_latency_us)
> - min_latency_us = latency_us;
>   snprintf(name, CPUIDLE_NAME_LEN, "XCEDE%d", latency_hint);
> 
>   /*


We need these heuristics to select/reject idle states exposed by
platform firmware to Linux primarily because not all states are really
useful to Linux on a given setup.

--Vaidy

Re: [PATCH 4/5] cpuidle-pseries : Include extended CEDE states in cpuidle framework

2020-07-20 Thread Vaidyanathan Srinivasan

* Gautham R Shenoy  [2020-07-07 16:41:38]:

> From: "Gautham R. Shenoy" 
> 
> This patch exposes those extended CEDE states to the cpuidle framework
> which are responsive to external interrupts and do not need an H_PROD.
> 
> Since as per the PAPR, all the extended CEDE states are non-responsive
> to timers, we indicate this to the cpuidle subsystem via the
> CPUIDLE_FLAG_TIMER_STOP flag for all those extende CEDE states which
> can wake up on external interrupts.
> 
> With the patch, we are able to see the extended CEDE state with
> latency hint = 1 exposed via the cpuidle framework.
> 
>   $ cpupower idle-info
>   CPUidle driver: pseries_idle
>   CPUidle governor: menu
>   analyzing CPU 0:
> 
>   Number of idle states: 3
>   Available idle states: snooze CEDE XCEDE1
>   snooze:
>   Flags/Description: snooze
>   Latency: 0
>   Usage: 33429446
>   Duration: 27006062
>   CEDE:
>   Flags/Description: CEDE
>   Latency: 1
>   Usage: 10272
>   Duration: 110786770
>   XCEDE1:
>   Flags/Description: XCEDE1
>   Latency: 12
>   Usage: 26445
>   Duration: 1436433815
> 
> Benchmark results:
> TLDR: Over all we do not see any additional benefit from having XCEDE1 over
> CEDE.
> 
> ebizzy :
> 2 threads bound to a big-core. With this patch, we see a 3.39%
> regression compared to with only CEDE0 latency fixup.
> x With only CEDE0 latency fixup
> * With CEDE0 latency fixup + CEDE1
> N   Min   MaxMedian   AvgStddev
> x  10   2893813   5834474   5832448 5327281.3 1055941.4
> *  10   2907329   5834923   5831398 5146614.6 1193874.8
> 
> context_switch2:
> With the context_switch2 there are no observable regressions in the
> results.
> 
> context_switch2 CPU0 CPU1 (Same Big-core, different small-cores).
> No difference with and without patch.
> x without_patch
> * with_patch
> N   Min   MaxMedian   AvgStddev
> x 500343644348778345444 345584.02 1035.1658
> * 500344310347646345776 345877.22 802.19501
> 
> context_switch2 CPU0 CPU8 (different big-cores). Minor 0.05% improvement
> with patch
> x without_patch
> * with_patch
> N   Min   MaxMedian   AvgStddev
> x 500287562288756288162 288134.76 262.24328
> * 500287874288960288306 288274.66 187.57034
> 
> schbench:
> No regressions observed with schbench
> 
> Without Patch:
> Latency percentiles (usec)
>   50.0th: 29
>   75.0th: 40
>   90.0th: 50
>   95.0th: 61
>   *99.0th: 13648
>   99.5th: 14768
>   99.9th: 15664
>   min=0, max=29812
> 
> With Patch:
> Latency percentiles (usec)
>   50.0th: 30
>   75.0th: 40
>   90.0th: 51
>   95.0th: 59
>   *99.0th: 13616
>   99.5th: 14512
>   99.9th: 15696
>   min=0, max=15996
> 
> Signed-off-by: Gautham R. Shenoy 

Reviewed-by: Vaidyanathan Srinivasan 

> ---
>  drivers/cpuidle/cpuidle-pseries.c | 50 
> +++
>  1 file changed, 50 insertions(+)
> 
> diff --git a/drivers/cpuidle/cpuidle-pseries.c 
> b/drivers/cpuidle/cpuidle-pseries.c
> index 502f906..6f893cd 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -362,9 +362,59 @@ static int add_pseries_idle_states(void)
>   for (i = 0; i < nr_xcede_records; i++) {
>   u64 latency_tb = xcede_records[i].wakeup_latency_tb_ticks;
>   u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC;
> + char name[CPUIDLE_NAME_LEN];
> + unsigned int latency_hint = xcede_records[i].latency_hint;
> + u64 residency_us;
> +
> + if (!xcede_records[i].responsive_to_irqs) {
> + pr_info("cpuidle : Skipping XCEDE%d. Not responsive to 
> IRQs\n",
> + latency_hint);
> + continue;
> + }
> 
>   if (latency_us < min_latency_us)
>   min_latency_us = latency_us;
> + snprintf(name, CPUIDLE_NAME_LEN, "XCEDE%d", latency_hint);
> +
> + /*
> +  * As per the section 14.14.1 of PAPR version 2.8.1
> +  * says that alling H_CEDE with the value of the cede
> +  * latency specifier set greater than zero allows the
> +  * processor timer facility to b

Re: [PATCH 3/5] cpuidle-pseries : Fixup exit latency for CEDE(0)

2020-07-20 Thread Vaidyanathan Srinivasan

* Gautham R Shenoy  [2020-07-07 16:41:37]:

> From: "Gautham R. Shenoy" 
> 
> We are currently assuming that CEDE(0) has exit latency 10us, since
> there is no way for us to query from the platform.  However, if the
> wakeup latency of an Extended CEDE state is smaller than 10us, then we
> can be sure that the exit latency of CEDE(0) cannot be more than that.
> that.
> 
> In this patch, we fix the exit latency of CEDE(0) if we discover an
> Extended CEDE state with wakeup latency smaller than 10us. The new
> value is 1us lesser than the smallest wakeup latency among the
> Extended CEDE states.
> 
> Benchmark results:
> 
> ebizzy:
> 2 ebizzy threads bound to the same big-core. 25% improvement in the
> avg records/s with patch.
> x without_patch
> * with_patch
> N   Min   MaxMedian   AvgStddev
> x  10   2491089   5834307   5398375   4244335 1596244.9
> *  10   2893813   5834474   5832448 5327281.3 1055941.4
> 
> context_switch2 :
> There is no major regression observed with this patch as seen from the
> context_switch2 benchmark.
> 
> context_switch2 across CPU0 CPU1 (Both belong to same big-core, but different
> small cores). We observe a minor 0.14% regression in the number of
> context-switches (higher is better).
> x without_patch
> * with_patch
> N   Min   MaxMedian   AvgStddev
> x 500348872362236354712 354745.69  2711.827
> * 500349422361452353942  354215.4 2576.9258
> 
> context_switch2 across CPU0 CPU8 (Different big-cores). We observe a 0.37%
> improvement in the number of context-switches (higher is better).
> x without_patch
> * with_patch
> N   Min   MaxMedian   AvgStddev
> x 500287956294940288896 288977.23 646.59295
> * 500288300294646289582 290064.76 1161.9992
> 
> schbench:
> No major difference could be seen until the 99.9th percentile.
> 
> Without-patch
> Latency percentiles (usec)
>   50.0th: 29
>   75.0th: 39
>   90.0th: 49
>   95.0th: 59
>   *99.0th: 13104
>   99.5th: 14672
>   99.9th: 15824
>   min=0, max=17993
> 
> With-patch:
> Latency percentiles (usec)
>   50.0th: 29
>   75.0th: 40
>   90.0th: 50
>   95.0th: 61
>   *99.0th: 13648
>   99.5th: 14768
>   99.9th: 15664
>   min=0, max=29812
> 
> Signed-off-by: Gautham R. Shenoy 

Reviewed-by: Vaidyanathan Srinivasan 


> ---
>  drivers/cpuidle/cpuidle-pseries.c | 34 --
>  1 file changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cpuidle/cpuidle-pseries.c 
> b/drivers/cpuidle/cpuidle-pseries.c
> index c13549b..502f906 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -353,12 +353,42 @@ static int pseries_cpuidle_driver_init(void)
>  static int add_pseries_idle_states(void)
>  {
>   int nr_states = 2; /* By default we have snooze, CEDE */
> + int i;
> + u64 min_latency_us = dedicated_states[1].exit_latency; /* CEDE latency 
> */
> 
>   if (parse_cede_parameters())
>   return nr_states;
> 
> - pr_info("cpuidle : Skipping the %d Extended CEDE idle states\n",
> - nr_xcede_records);
> + for (i = 0; i < nr_xcede_records; i++) {
> + u64 latency_tb = xcede_records[i].wakeup_latency_tb_ticks;
> + u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC;
> +
> + if (latency_us < min_latency_us)
> + min_latency_us = latency_us;
> + }
> +
> + /*
> +  * We are currently assuming that CEDE(0) has exit latency
> +  * 10us, since there is no way for us to query from the
> +  * platform.
> +  *
> +  * However, if the wakeup latency of an Extended CEDE state is
> +  * smaller than 10us, then we can be sure that CEDE(0)
> +  * requires no more than that.
> +  *
> +  * Perform the fix-up.
> +  */
> + if (min_latency_us < dedicated_states[1].exit_latency) {
> + u64 cede0_latency = min_latency_us - 1;
> +
> + if (cede0_latency <= 0)
> + cede0_latency = min_latency_us;
> +
> + dedicated_states[1].exit_latency = cede0_latency;
> + dedicated_states[1].target_residency = 10 * (cede0_latency);
> + pr_info("cpuidle : Fixed up CEDE exit latency to %llu us\n",
> + cede0_latency);
> + }


As per PAPR spec the CEDE hints are in increasing order of exit
latency.  Hence a given state's exit latency cannot exceed the one
following it.  The quirk is such that the first one (hint 0) is
implicit and hence we have to use the above logic to extract its
characteristics.

--Vaidy

Re: [PATCH 2/5] cpuidle-pseries: Add function to parse extended CEDE records

2020-07-20 Thread Vaidyanathan Srinivasan

* Gautham R Shenoy  [2020-07-07 16:41:36]:

> From: "Gautham R. Shenoy" 
> 
> Currently we use CEDE with latency-hint 0 as the only other idle state
> on a dedicated LPAR apart from the polling "snooze" state.
> 
> The platform might support additional extended CEDE idle states, which
> can be discovered through the "ibm,get-system-parameter" rtas-call
> made with CEDE_LATENCY_TOKEN.
> 
> This patch adds a function to obtain information about the extended
> CEDE idle states from the platform and parse the contents to populate
> an array of extended CEDE states. These idle states thus discovered
> will be added to the cpuidle framework in the next patch.
> 
> dmesg on a POWER9 LPAR, demonstrating the output of parsing the
> extended CEDE latency parameters.
> 
> [5.913180] xcede : xcede_record_size = 10
> [5.913183] xcede : Record 0 : hint = 1, latency =0x400 tb-ticks, 
> Wake-on-irq = 1
> [5.913188] xcede : Record 1 : hint = 2, latency =0x3e8000 tb-ticks, 
> Wake-on-irq = 0
> [5.913193] cpuidle : Skipping the 2 Extended CEDE idle states
> 
> Signed-off-by: Gautham R. Shenoy 

Reviewed-by: Vaidyanathan Srinivasan 

>
> ---
>  drivers/cpuidle/cpuidle-pseries.c | 129 
> +-
>  1 file changed, 127 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cpuidle/cpuidle-pseries.c 
> b/drivers/cpuidle/cpuidle-pseries.c
> index 39d4bb6..c13549b 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  struct cpuidle_driver pseries_idle_driver = {
>   .name = "pseries_idle",
> @@ -105,9 +106,120 @@ static void check_and_cede_processor(void)
>   }
>  }
> 
> -#define NR_CEDE_STATES   1  /* CEDE with latency-hint 0 */
> +struct xcede_latency_records {
> + u8  latency_hint;
> + u64 wakeup_latency_tb_ticks;
> + u8  responsive_to_irqs;
> +};
> +
> +/*
> + * XCEDE : Extended CEDE states discovered through the
> + * "ibm,get-systems-parameter" rtas-call with the token
> + * CEDE_LATENCY_TOKEN
> + */
> +#define MAX_XCEDE_STATES 4
> +#define  XCEDE_LATENCY_RECORD_SIZE   10
> +#define XCEDE_LATENCY_PARAM_MAX_LENGTH   (2 + 2 + \
> + (MAX_XCEDE_STATES * 
> XCEDE_LATENCY_RECORD_SIZE))
> +
> +#define CEDE_LATENCY_TOKEN   45
> +
> +#define NR_CEDE_STATES   (MAX_XCEDE_STATES + 1) /* CEDE with 
> latency-hint 0 */
>  #define NR_DEDICATED_STATES  (NR_CEDE_STATES + 1) /* Includes snooze */
> 
> +struct xcede_latency_records xcede_records[MAX_XCEDE_STATES];
> +unsigned int nr_xcede_records;
> +char xcede_parameters[XCEDE_LATENCY_PARAM_MAX_LENGTH];
> +
> +static int parse_cede_parameters(void)
> +{
> + int ret = -1, i;
> + u16 payload_length;
> + u8 xcede_record_size;
> + u32 total_xcede_records_size;
> + char *payload;
> +
> + memset(xcede_parameters, 0, XCEDE_LATENCY_PARAM_MAX_LENGTH);
> +
> + ret = rtas_call(rtas_token("ibm,get-system-parameter"), 3, 1,
> + NULL, CEDE_LATENCY_TOKEN, __pa(xcede_parameters),
> + XCEDE_LATENCY_PARAM_MAX_LENGTH);
> +
> + if (ret) {
> + pr_err("xcede: Error parsing CEDE_LATENCY_TOKEN\n");
> + return ret;
> + }
> +
> + payload_length = be16_to_cpu(*(__be16 *)(_parameters[0]));
> + payload = _parameters[2];
> +
> + /*
> +  * If the platform supports the cede latency settings
> +  * information system parameter it must provide the following
> +  * information in the NULL terminated parameter string:
> +  *
> +  * a. The first byte is the length ???N??? of each cede
> +  *latency setting record minus one (zero indicates a length
> +  *of 1 byte).
> +  *
> +  * b. For each supported cede latency setting a cede latency
> +  *setting record consisting of the first ???N??? bytes as per
> +  *the following table.
> +  *
> +  *  -
> +  *  | Field   | Field  |
> +  *  | Name| Length |
> +  *  -
> +  *  | Cede Latency| 1 Byte |
> +  *  | Specifier Value ||
> +  *  -
> +  *  | Maximum wakeup  ||
> +  *  | latency in  | 8 Bytes|
> +  *  | tb-ticks||
> +  *  -
>

Re: [PATCH 1/5] cpuidle-pseries: Set the latency-hint before entering CEDE

2020-07-19 Thread Vaidyanathan Srinivasan

* Gautham R Shenoy  [2020-07-07 16:41:35]:

> From: "Gautham R. Shenoy" 
> 
> As per the PAPR, each H_CEDE call is associated with a latency-hint to
> be passed in the VPA field "cede_latency_hint". The CEDE states that
> we were implicitly entering so far is CEDE with latency-hint = 0.
> 
> This patch explicitly sets the latency hint corresponding to the CEDE
> state that we are currently entering. While at it, we save the
> previous hint, to be restored once we wakeup from CEDE. This will be
> required in the future when we expose extended-cede states through the
> cpuidle framework, where each of them will have a different
> cede-latency hint.
> 
> Signed-off-by: Gautham R. Shenoy 

Reviewed-by: Vaidyanathan Srinivasan 


> ---
>  drivers/cpuidle/cpuidle-pseries.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cpuidle/cpuidle-pseries.c 
> b/drivers/cpuidle/cpuidle-pseries.c
> index 4a37252..39d4bb6 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -105,19 +105,27 @@ static void check_and_cede_processor(void)
>   }
>  }
> 
> +#define NR_CEDE_STATES   1  /* CEDE with latency-hint 0 */
> +#define NR_DEDICATED_STATES  (NR_CEDE_STATES + 1) /* Includes snooze */
> +
> +u8 cede_latency_hint[NR_DEDICATED_STATES];
>  static int dedicated_cede_loop(struct cpuidle_device *dev,
>   struct cpuidle_driver *drv,
>   int index)
>  {
> + u8 old_latency_hint;
> 
>   pseries_idle_prolog();
>   get_lppaca()->donate_dedicated_cpu = 1;
> + old_latency_hint = get_lppaca()->cede_latency_hint;
> + get_lppaca()->cede_latency_hint = cede_latency_hint[index];
> 
>   HMT_medium();
>   check_and_cede_processor();
> 
>   local_irq_disable();
>   get_lppaca()->donate_dedicated_cpu = 0;
> + get_lppaca()->cede_latency_hint = old_latency_hint;
> 
>   pseries_idle_epilog();
> 
> @@ -149,7 +157,7 @@ static int shared_cede_loop(struct cpuidle_device *dev,
>  /*
>   * States for dedicated partition case.
>   */
> -static struct cpuidle_state dedicated_states[] = {
> +static struct cpuidle_state dedicated_states[NR_DEDICATED_STATES] = {
>   { /* Snooze */
>   .name = "snooze",
>   .desc = "snooze",


Saving and restoring the current cede hint value helps in maintaining
compatibility with other parts of the kernel.  Over long term we can
make cpuidle driver deterministically set the CEDE hint at each
invocation of H_CEDE call so that we do not have to do multiple
redundant save-restore.

This is a reasonable start to cleanup this cupidle subsystem on PAPR
guests. 

--Vaidy

Re: [PATCH 0/5] cpuidle-pseries: Parse extended CEDE information for idle.

2020-07-19 Thread Vaidyanathan Srinivasan

* Gautham R Shenoy  [2020-07-07 16:41:34]:

> From: "Gautham R. Shenoy" 
> 
> Hi,
> 
> On pseries Dedicated Linux LPARs, apart from the polling snooze idle
> state, we currently have the CEDE idle state which cedes the CPU to
> the hypervisor with latency-hint = 0.
> 
> However, the PowerVM hypervisor supports additional extended CEDE
> states, which can be queried through the "ibm,get-systems-parameter"
> rtas-call with the CEDE_LATENCY_TOKEN. The hypervisor maps these
> extended CEDE states to appropriate platform idle-states in order to
> provide energy-savings as well as shifting power to the active
> units. On existing pseries LPARs today we have extended CEDE with
> latency-hints {1,2} supported.
> 
> In Patches 1-3 of this patchset, we add the code to parse the CEDE
> latency records provided by the hypervisor. We use this information to
> determine the wakeup latency of the regular CEDE (which we have been
> so far hardcoding to 10us while experimentally it is much lesser ~
> 1us), by looking at the wakeup latency provided by the hypervisor for
> Extended CEDE states. Since the platform currently advertises Extended
> CEDE 1 to have wakeup latency of 2us, we can be sure that the wakeup
> latency of the regular CEDE is no more than this.
> 
> Patch 4 (currently marked as RFC), expose the extended CEDE states
> parsed above to the cpuidle framework, provided that they can wakeup
> on an interrupt. On current platforms only Extended CEDE 1 fits the
> bill, but this is going to change in future platforms where even
> Extended CEDE 2 may be responsive to external interrupts.
> 
> Patch 5 (currently marked as RFC), filters out Extended CEDE 1 since
> it offers no added advantage over the normal CEDE.
> 
> With Patches 1-3, we see an improvement in the single-threaded
> performance on ebizzy.
> 
> 2 ebizzy threads bound to the same big-core. 25% improvement in the
> avg records/s (higher the better) with patches 1-3.
> x without_patches
> * with_patches
> N   Min   MaxMedian   AvgStddev
> x  10   2491089   5834307   5398375   4244335 1596244.9
> *  10   2893813   5834474   5832448 5327281.3 1055941.4
> 
> We do not observe any major regression in either the context_switch2
> benchmark or the schbench benchmark
> 
> context_switch2 across CPU0 CPU1 (Both belong to same big-core, but different
> small cores). We observe a minor 0.14% regression in the number of
> context-switches (higher is better).
> x without_patch
> * with_patch
> N   Min   MaxMedian   AvgStddev
> x 500348872362236354712 354745.69  2711.827
> * 500349422361452353942  354215.4 2576.9258
> 
> context_switch2 across CPU0 CPU8 (Different big-cores). We observe a 0.37%
> improvement in the number of context-switches (higher is better).
> x without_patch
> * with_patch
> N   Min   MaxMedian   AvgStddev
> x 500287956294940288896 288977.23 646.59295
> * 500288300294646289582 290064.76 1161.9992
> 
> schbench:
> No major difference could be seen until the 99.9th percentile.
> 
> Without-patch
> Latency percentiles (usec)
>   50.0th: 29
>   75.0th: 39
>   90.0th: 49
>   95.0th: 59
>   *99.0th: 13104
>   99.5th: 14672
>   99.9th: 15824
>   min=0, max=17993
> 
> With-patch:
> Latency percentiles (usec)
>   50.0th: 29
>   75.0th: 40
>   90.0th: 50
>   95.0th: 61
>   *99.0th: 13648
>   99.5th: 14768
>   99.9th: 15664
>   min=0, max=29812

This patch series mainly cleans up the CEDE latency discovery and
prepares to add different cpuidle states in virtualised environment.
This helps in improving SMT folding speeds and also power savings and
power shifting with newer platform firmware.

The current benefit is primarily from faster SMT folding and resulting
single performance achieved by updating the platform firmware provided
heuristics in the cpuidle states.

--Vaidy

Re: [Skiboot] [PATCH v8 3/3] Self save API integration

2020-04-29 Thread Vaidyanathan Srinivasan

* Pratik Rajesh Sampat  [2020-04-23 16:24:38]:

> The commit makes the self save API available outside the firmware by defining
> an OPAL wrapper.
> This wrapper has a similar interface to that of self restore and expects the
> cpu pir, SPR number, minus the value of that SPR to be passed in its
> paramters and returns OPAL_SUCCESS on success. It adds a device-tree
> node signifying support for self-save after verifying the stop API
> version compatibility.
> 
> The commit also documents both the self-save and the self-restore API
> calls along with their working and usage.
> 
> Signed-off-by: Pratik Rajesh Sampat 

Reviewed-by: Vaidyanathan Srinivasan 

> ---
>  doc/opal-api/opal-slw-self-save-reg-181.rst |  51 ++
>  doc/opal-api/opal-slw-set-reg-100.rst   |   5 +
>  doc/power-management.rst|  48 +
>  hw/slw.c| 106 
>  include/opal-api.h  |   3 +-
>  include/p9_stop_api.H   |  18 
>  include/skiboot.h   |   3 +
>  7 files changed, 233 insertions(+), 1 deletion(-)
>  create mode 100644 doc/opal-api/opal-slw-self-save-reg-181.rst

This patch enables OPAL interface to call stop-api and set self-save.
Basically completes the infrastructure required to use the new
self-save function provided by the microcode.

--Vaidy

Re: [PATCH v8 2/3] API to verify the STOP API and image compatibility

2020-04-29 Thread Vaidyanathan Srinivasan

* Pratik Rajesh Sampat  [2020-04-23 16:24:37]:

> From: Prem Shanker Jha 
> 
> Commit defines a new API primarily intended for OPAL to determine
> cpu register save API's compatibility with HOMER layout and
> self save restore. It can help OPAL determine if version of
> API integrated with OPAL is different from hostboot.
> 
> Change-Id: Ic0de45a336cfb8b6b6096a10ac1cd3ffbaa44fc0
> Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/77612
> Tested-by: FSP CI Jenkins 
> Tested-by: Jenkins Server 
> Tested-by: Hostboot CI 
> Reviewed-by: RANGANATHPRASAD G. BRAHMASAMUDRA 
> Reviewed-by: Gregory S Still 
> Reviewed-by: Jennifer A Stofer 
> Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/77614
> Tested-by: Jenkins OP Build CI 
> Tested-by: Jenkins OP HW 
> Reviewed-by: Daniel M Crowell 
> Signed-off-by: Pratik Rajesh Sampat 

Reviewed-by: Vaidyanathan Srinivasan 

> ---
>  include/p9_stop_api.H| 25 ++
>  libpore/p9_cpu_reg_restore_instruction.H |  7 ++-
>  libpore/p9_hcd_memmap_base.H |  7 +++
>  libpore/p9_stop_api.C| 58 +++-
>  libpore/p9_stop_api.H| 26 ++-
>  libpore/p9_stop_util.H   | 20 
>  6 files changed, 130 insertions(+), 13 deletions(-)

This stop-api code will help OPAL check and use self-save functions so
that different versions of OPAL can be loaded and run with different
versions of low level firmware stack.

--Vaidy

Re: [Skiboot] [PATCH v8 1/3] Self Save: Introducing Support for SPR Self Save

2020-04-29 Thread Vaidyanathan Srinivasan

* Pratik Rajesh Sampat  [2020-04-23 16:24:36]:

> From: Prem Shanker Jha 
> 
> The commit is a merger of commits that makes the following changes:
> 1. Commit fixes some issues with code found during integration test
>   -  replacement of addi with xor instruction during self save API.
>   -  fixing instruction generation for MFMSR during self save
>   -  data struct updates in STOP API
>   -  error RC updates for hcode image build
>   -  HOMER parser updates.
>   -  removed self save support for URMOR and HRMOR
>   -  code changes for compilation with OPAL
>   -  populating CME Image header with unsecure HOMER address.
> 
> Key_Cronus_Test=PM_REGRESS
> 
> Change-Id: I7cedcc466267c4245255d8d75c01ed695e316720
> Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/66580
> Tested-by: FSP CI Jenkins 
> Tested-by: HWSV CI 
> Tested-by: PPE CI 
> Tested-by: Jenkins Server 
> Tested-by: Cronus HW CI 
> Tested-by: Hostboot CI 
> Reviewed-by: Gregory S. Still 
> Reviewed-by: RAHUL BATRA 
> Reviewed-by: Jennifer A. Stofer 
> Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/66587
> Reviewed-by: Christian R. Geddes 
> Signed-off-by: Prem Shanker Jha 
> Signed-off-by: Akshay Adiga 
> Signed-off-by: Pratik Rajesh Sampat 

Reviewed-by: Vaidyanathan Srinivasan 

> 2. The commit also incorporates changes that make STOP API project
> agnostic changes include defining wrapper functions which call legacy
> API. It also adds duplicate enum members which start with prefix PROC
> instead of P9.
> 
> Key_Cronus_Test=PM_REGRESS
> 
> Change-Id: If87970f3e8cf9b507f33eb1be249e03eb3836a5e
> RTC: 201128
> Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/71307
> Tested-by: FSP CI Jenkins 
> Tested-by: Jenkins Server 
> Tested-by: Hostboot CI 
> Tested-by: Cronus HW CI 
> Reviewed-by: RANGANATHPRASAD G. BRAHMASAMUDRA 
> Reviewed-by: Gregory S. Still 
> Reviewed-by: Jennifer A Stofer 
> Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/71314
> Tested-by: Jenkins OP Build CI 
> Tested-by: Jenkins OP HW 
> Reviewed-by: Daniel M. Crowell 
> Signed-off-by: Prem Shanker Jha 
> Signed-off-by: Pratik Rajesh Sampat 
> ---
>  include/p9_stop_api.H|  79 +-
>  libpore/p9_cpu_reg_restore_instruction.H |   4 +
>  libpore/p9_stop_api.C| 954 +--
>  libpore/p9_stop_api.H| 115 ++-
>  libpore/p9_stop_data_struct.H|   4 +-
>  libpore/p9_stop_util.H   |   7 +-
>  6 files changed, 721 insertions(+), 442 deletions(-)


These code changes are from hcode component copied as is into OPAL
project for integration and use of stop-api.

Hcode component in cooperation with hostboot would be loaded in memory
before OPAL is loaded. This code will allow runtime changes and usage
of various power management save restore functions.

This patch specifically enables self-save feature by the microcode.

--Vaidy

Re: [PATCH] pseries/energy: Use OF accessor functions to read ibm,drc-indexes

2019-03-08 Thread Vaidyanathan Srinivasan

* Gautham R Shenoy  [2019-03-08 21:03:24]:

> From: "Gautham R. Shenoy" 
> 
> In cpu_to_drc_index() in the case when FW_FEATURE_DRC_INFO is absent,
> we currently use of_read_property() to obtain the pointer to the array
> corresponding to the property "ibm,drc-indexes". The elements of this
> array are of type __be32, but are accessed without any conversion to
> the OS-endianness, which is buggy on a Little Endian OS.
> 
> Fix this by using of_property_read_u32_index() accessor function to
> safely read the elements of the array.
> 
> Fixes: commit e83636ac3334 ("pseries/drc-info: Search DRC properties for CPU 
> indexes")
> Cc:  #v4.16+
> Reported-by: Pavithra R. Prakash 
> Signed-off-by: Gautham R. Shenoy 

Reviewed-by: Vaidyanathan Srinivasan 

> ---
>  arch/powerpc/platforms/pseries/pseries_energy.c | 27 
> -
>  1 file changed, 18 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c 
> b/arch/powerpc/platforms/pseries/pseries_energy.c
> index 6ed2212..1c4d1ba 100644
> --- a/arch/powerpc/platforms/pseries/pseries_energy.c
> +++ b/arch/powerpc/platforms/pseries/pseries_energy.c
> @@ -77,18 +77,27 @@ static u32 cpu_to_drc_index(int cpu)
> 
>   ret = drc.drc_index_start + (thread_index * drc.sequential_inc);
>   } else {
> - const __be32 *indexes;
> -
> - indexes = of_get_property(dn, "ibm,drc-indexes", NULL);
> - if (indexes == NULL)
> - goto err_of_node_put;
> + u32 nr_drc_indexes, thread_drc_index;
> 
>   /*
> -  * The first element indexes[0] is the number of drc_indexes
> -  * returned in the list.  Hence thread_index+1 will get the
> -  * drc_index corresponding to core number thread_index.
> +  * The first element of ibm,drc-indexes array is the
> +  * number of drc_indexes returned in the list.  Hence
> +  * thread_index+1 will get the drc_index corresponding
> +  * to core number thread_index.
>*/
> - ret = indexes[thread_index + 1];
> + rc = of_property_read_u32_index(dn, "ibm,drc-indexes",
> + 0, _drc_indexes);
> + if (rc)
> + goto err_of_node_put;
> +
> + WARN_ON(thread_index > nr_drc_indexes);
> + rc = of_property_read_u32_index(dn, "ibm,drc-indexes",
> + thread_index + 1,
> + _drc_index);
> + if (rc)
> + goto err_of_node_put;
> +
> + ret = thread_drc_index;

Oops!  Good bugfix.  We should use device tree accessors like this in
all places for correct and safe code.

Thanks!

--Vaidy

Re: [PATCH] cpuidle:powernv: Add the CPUIDLE_FLAG_POLLING for snooze

2018-07-03 Thread Vaidyanathan Srinivasan

* Gautham R Shenoy  [2018-07-03 10:54:16]:

> From: "Gautham R. Shenoy" 
> 
> In the situations where snooze is the only cpuidle state due to
> firmware not exposing any platform idle states, the idle CPUs will
> remain in snooze for a long time with interrupts disabled causing the
> Hard-lockup detector to complain.

snooze_loop() will spin in SMT low priority with interrupt enabled. We
have local_irq_enable() before we get into the snooze loop.
Since this is a polling state, we should wakeup without an interrupt
and hence we set TIF_POLLING_NRFLAG as well.

 
> watchdog: CPU 51 detected hard LOCKUP on other CPUs 59
> watchdog: CPU 51 TB:535296107736, last SMP heartbeat TB:527472229239 (15281ms 
> ago)
> watchdog: CPU 59 Hard LOCKUP
> watchdog: CPU 59 TB:535296252849, last heartbeat TB:526554725466 (17073ms ago)

hmm.. not sure why watchdog will complain, maybe something more is
going on.
 
> Fix this by adding CPUIDLE_FLAG_POLLING flag to the state, so that the
> cpuidle governor will do the right thing, such as not stopping the
> tick if it is going to put the idle cpu to snooze.
> 
> Reported-by: Akshay Adiga 
> Cc: Nicholas Piggin 
> Signed-off-by: Gautham R. Shenoy 
> ---
>  drivers/cpuidle/cpuidle-powernv.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/cpuidle/cpuidle-powernv.c 
> b/drivers/cpuidle/cpuidle-powernv.c
> index d29e4f0..b73041b 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -156,6 +156,7 @@ static int stop_loop(struct cpuidle_device *dev,
>   { /* Snooze */
>   .name = "snooze",
>   .desc = "snooze",
> + .flags = CPUIDLE_FLAG_POLLING,
>   .exit_latency = 0,
>   .target_residency = 0,
>   .enter = snooze_loop },

Adding the CPUIDLE_FLAG_POLLING is good and enables more optimization.
But the reason that we spin with interrupt disabled does not seem
right.

--Vaidy

Re: [PATCH] cpuidle:powernv: Add the CPUIDLE_FLAG_POLLING for snooze

2018-07-03 Thread Vaidyanathan Srinivasan

* Gautham R Shenoy  [2018-07-03 10:54:16]:

> From: "Gautham R. Shenoy" 
> 
> In the situations where snooze is the only cpuidle state due to
> firmware not exposing any platform idle states, the idle CPUs will
> remain in snooze for a long time with interrupts disabled causing the
> Hard-lockup detector to complain.

snooze_loop() will spin in SMT low priority with interrupt enabled. We
have local_irq_enable() before we get into the snooze loop.
Since this is a polling state, we should wakeup without an interrupt
and hence we set TIF_POLLING_NRFLAG as well.

 
> watchdog: CPU 51 detected hard LOCKUP on other CPUs 59
> watchdog: CPU 51 TB:535296107736, last SMP heartbeat TB:527472229239 (15281ms 
> ago)
> watchdog: CPU 59 Hard LOCKUP
> watchdog: CPU 59 TB:535296252849, last heartbeat TB:526554725466 (17073ms ago)

hmm.. not sure why watchdog will complain, maybe something more is
going on.
 
> Fix this by adding CPUIDLE_FLAG_POLLING flag to the state, so that the
> cpuidle governor will do the right thing, such as not stopping the
> tick if it is going to put the idle cpu to snooze.
> 
> Reported-by: Akshay Adiga 
> Cc: Nicholas Piggin 
> Signed-off-by: Gautham R. Shenoy 
> ---
>  drivers/cpuidle/cpuidle-powernv.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/cpuidle/cpuidle-powernv.c 
> b/drivers/cpuidle/cpuidle-powernv.c
> index d29e4f0..b73041b 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -156,6 +156,7 @@ static int stop_loop(struct cpuidle_device *dev,
>   { /* Snooze */
>   .name = "snooze",
>   .desc = "snooze",
> + .flags = CPUIDLE_FLAG_POLLING,
>   .exit_latency = 0,
>   .target_residency = 0,
>   .enter = snooze_loop },

Adding the CPUIDLE_FLAG_POLLING is good and enables more optimization.
But the reason that we spin with interrupt disabled does not seem
right.

--Vaidy

Re: [PATCH V3] cpufreq: powernv: Fix the hardlockup by synchronus smp_call in timer interrupt

2018-04-25 Thread Vaidyanathan Srinivasan

* Shilpa Bhat <shilpa.b...@linux.vnet.ibm.com> [2018-04-25 16:29:31]:

> gpstate_timer_handler() uses synchronous smp_call to set the pstate
> on the requested core. This causes the below hard lockup:
> 
> [c03fe566b320] [c01d5340] smp_call_function_single+0x110/0x180 
> (unreliable)
> [c03fe566b390] [c01d55e0] smp_call_function_any+0x180/0x250
> [c03fe566b3f0] [c0acd3e8] gpstate_timer_handler+0x1e8/0x580
> [c03fe566b4a0] [c01b46b0] call_timer_fn+0x50/0x1c0
> [c03fe566b520] [c01b4958] expire_timers+0x138/0x1f0
> [c03fe566b590] [c01b4bf8] run_timer_softirq+0x1e8/0x270
> [c03fe566b630] [c0d0d6c8] __do_softirq+0x158/0x3e4
> [c03fe566b710] [c0114be8] irq_exit+0xe8/0x120
> [c03fe566b730] [c0024d0c] timer_interrupt+0x9c/0xe0
> [c03fe566b760] [c0009014] decrementer_common+0x114/0x120
> -- interrupt: 901 at doorbell_global_ipi+0x34/0x50
> LR = arch_send_call_function_ipi_mask+0x120/0x130
> [c03fe566ba50] [c004876c]
> arch_send_call_function_ipi_mask+0x4c/0x130
> [c03fe566ba90] [c01d59f0] smp_call_function_many+0x340/0x450
> [c03fe566bb00] [c0075f18] pmdp_invalidate+0x98/0xe0
> [c03fe566bb30] [c03a1120] change_huge_pmd+0xe0/0x270
> [c03fe566bba0] [c0349278] change_protection_range+0xb88/0xe40
> [c03fe566bcf0] [c03496c0] mprotect_fixup+0x140/0x340
> [c03fe566bdb0] [c0349a74] SyS_mprotect+0x1b4/0x350
> [c03fe566be30] [c000b184] system_call+0x58/0x6c
> 
> One way to avoid this is removing the smp-call. We can ensure that the timer
> always runs on one of the policy-cpus. If the timer gets migrated to a
> cpu outside the policy then re-queue it back on the policy->cpus. This way
> we can get rid of the smp-call which was being used to set the pstate
> on the policy->cpus.
> 
> Fixes: 7bc54b652f13 (timers, cpufreq/powernv: Initialize the gpstate timer as 
> pinned)
> Cc: <sta...@vger.kernel.org>[4.8+]
> Reported-by: Nicholas Piggin <npig...@gmail.com>
> Reported-by: Pridhiviraj Paidipeddi <ppaid...@linux.vnet.ibm.com>
> Signed-off-by: Shilpasri G Bhat <shilpa.b...@linux.vnet.ibm.com>
> ---
> Changes from V2:
> - Remove the check for active policy while requeing the migrated timer
> Changes from V1:
> - Remove smp_call in the pstate handler.
> 
>  drivers/cpufreq/powernv-cpufreq.c | 14 +++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index 71f8682..e368e1f 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -679,6 +679,16 @@ void gpstate_timer_handler(struct timer_list *t)
> 
>   if (!spin_trylock(>gpstate_lock))
>   return;
> + /*
> +  * If the timer has migrated to the different cpu then bring
> +  * it back to one of the policy->cpus
> +  */
> + if (!cpumask_test_cpu(raw_smp_processor_id(), policy->cpus)) {
> + gpstates->timer.expires = jiffies + msecs_to_jiffies(1);
> + add_timer_on(>timer, cpumask_first(policy->cpus));
> + spin_unlock(>gpstate_lock);
> + return;
> + }
> 
>   /*
>* If PMCR was last updated was using fast_swtich then
> @@ -718,10 +728,8 @@ void gpstate_timer_handler(struct timer_list *t)
>   if (gpstate_idx != gpstates->last_lpstate_idx)
>   queue_gpstate_timer(gpstates);
> 
> + set_pstate(_data);
>   spin_unlock(>gpstate_lock);
> -
> - /* Timer may get migrated to a different cpu on cpu hot unplug */
> - smp_call_function_any(policy->cpus, set_pstate, _data, 1);
>  }

Fix looks good. 

Acked-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>

Re: [PATCH V3] cpufreq: powernv: Fix the hardlockup by synchronus smp_call in timer interrupt

2018-04-25 Thread Vaidyanathan Srinivasan

* Shilpa Bhat  [2018-04-25 16:29:31]:

> gpstate_timer_handler() uses synchronous smp_call to set the pstate
> on the requested core. This causes the below hard lockup:
> 
> [c03fe566b320] [c01d5340] smp_call_function_single+0x110/0x180 
> (unreliable)
> [c03fe566b390] [c01d55e0] smp_call_function_any+0x180/0x250
> [c03fe566b3f0] [c0acd3e8] gpstate_timer_handler+0x1e8/0x580
> [c03fe566b4a0] [c01b46b0] call_timer_fn+0x50/0x1c0
> [c03fe566b520] [c01b4958] expire_timers+0x138/0x1f0
> [c03fe566b590] [c01b4bf8] run_timer_softirq+0x1e8/0x270
> [c03fe566b630] [c0d0d6c8] __do_softirq+0x158/0x3e4
> [c03fe566b710] [c0114be8] irq_exit+0xe8/0x120
> [c03fe566b730] [c0024d0c] timer_interrupt+0x9c/0xe0
> [c03fe566b760] [c0009014] decrementer_common+0x114/0x120
> -- interrupt: 901 at doorbell_global_ipi+0x34/0x50
> LR = arch_send_call_function_ipi_mask+0x120/0x130
> [c03fe566ba50] [c004876c]
> arch_send_call_function_ipi_mask+0x4c/0x130
> [c03fe566ba90] [c01d59f0] smp_call_function_many+0x340/0x450
> [c03fe566bb00] [c0075f18] pmdp_invalidate+0x98/0xe0
> [c03fe566bb30] [c03a1120] change_huge_pmd+0xe0/0x270
> [c03fe566bba0] [c0349278] change_protection_range+0xb88/0xe40
> [c03fe566bcf0] [c03496c0] mprotect_fixup+0x140/0x340
> [c03fe566bdb0] [c0349a74] SyS_mprotect+0x1b4/0x350
> [c03fe566be30] [c000b184] system_call+0x58/0x6c
> 
> One way to avoid this is removing the smp-call. We can ensure that the timer
> always runs on one of the policy-cpus. If the timer gets migrated to a
> cpu outside the policy then re-queue it back on the policy->cpus. This way
> we can get rid of the smp-call which was being used to set the pstate
> on the policy->cpus.
> 
> Fixes: 7bc54b652f13 (timers, cpufreq/powernv: Initialize the gpstate timer as 
> pinned)
> Cc: [4.8+]
> Reported-by: Nicholas Piggin 
> Reported-by: Pridhiviraj Paidipeddi 
> Signed-off-by: Shilpasri G Bhat 
> ---
> Changes from V2:
> - Remove the check for active policy while requeing the migrated timer
> Changes from V1:
> - Remove smp_call in the pstate handler.
> 
>  drivers/cpufreq/powernv-cpufreq.c | 14 +++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index 71f8682..e368e1f 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -679,6 +679,16 @@ void gpstate_timer_handler(struct timer_list *t)
> 
>   if (!spin_trylock(>gpstate_lock))
>   return;
> + /*
> +  * If the timer has migrated to the different cpu then bring
> +  * it back to one of the policy->cpus
> +  */
> + if (!cpumask_test_cpu(raw_smp_processor_id(), policy->cpus)) {
> + gpstates->timer.expires = jiffies + msecs_to_jiffies(1);
> + add_timer_on(>timer, cpumask_first(policy->cpus));
> + spin_unlock(>gpstate_lock);
> + return;
> + }
> 
>   /*
>* If PMCR was last updated was using fast_swtich then
> @@ -718,10 +728,8 @@ void gpstate_timer_handler(struct timer_list *t)
>   if (gpstate_idx != gpstates->last_lpstate_idx)
>   queue_gpstate_timer(gpstates);
> 
> + set_pstate(_data);
>   spin_unlock(>gpstate_lock);
> -
> - /* Timer may get migrated to a different cpu on cpu hot unplug */
> - smp_call_function_any(policy->cpus, set_pstate, _data, 1);
>  }

Fix looks good. 

Acked-by: Vaidyanathan Srinivasan

Re: [RESEND][PATCH] cpuidle/powernv : Restore different PSSCR for idle and hotplug

2018-03-06 Thread Vaidyanathan Srinivasan

* Benjamin Herrenschmidt  [2018-03-01 08:40:22]:

> On Thu, 2018-03-01 at 01:03 +0530, Akshay Adiga wrote:
> > commit 1e1601b38e6e ("powerpc/powernv/idle: Restore SPRs for deep idle
> > states via stop API.") uses stop-api provided by the firmware to restore
> > PSSCR. PSSCR restore is required for handling special wakeup. When special
> > wakeup is completed, the core enters stop state based on restored PSSCR.
> > 
> > Currently PSSCR is restored to deepest available stop state, causing
> > a idle cpu to enter deeper stop state on a special wakeup, which causes
> > the cpu to hang on wakeup.
> > 
> > A "sensors" command which reads temperature (through DTS sensors) on idle
> > cpu can trigger special wakeup.
> > 
> > Failed Scenario :
> > Request restore of PSSCR with RL = 11
> > cpu enters idle state (stop5)
> >   user triggers "sensors" command
> >Assert special wakeup on cpu
> >  Restores PSSCR with RL = 11  < Done by firmware
> >   Read DTS sensor
> >Deassert special wakeup
> >   cpu enters idle state (stop11) <-- Instead of stop5
> > 
> > Cpu hang is caused because cpu ended up in a deeper state than it requested
> > 
> > This patch fixes instability caused by special wakeup when stop11 is
> > enabled. Requests restore of PSSCR to deepest stop state used by cpuidle.
> > Only when offlining cpu, request restore of PSSCR to deepest stop state.
> > On onlining cpu, request restore of PSSCR to deepest stop state used by
> > cpuidle.
> 
> So if we chose a stop state, but somebody else does a special wakeup,
> we'll end up going back into a *deeper* one than the one we came from ?

Unfortunately yes.  This is the current limitation.  If we are in stop4
and above and we had not set a PSSCR to be restored, then the hardware
will default to all bits set (stop15) leading to entry into stop11
after the special wakeup is removed.  The requirement is such that we
need to have a correct PSSCR restore value set using stop-api.

We need to set a restore PSSCR value that represents one in a group
like stop4,5,6,7 will have identical state loss, hence we can either
set a PSSCR of 7 or 4 or 5 for any of this stop state entry and not
have to use stop-api to set exact value of stop4 or 5 at every entry.

> I still think this is broken by design. The chip should automatically
> go back to the state it went to after special wakeup, thus the PPE
> controlling the state should override the PSSCR value accordingly
> rather than relying on those SW hoops.

Special wakeup de-assertion and re-entry hits this limitation where we
have lost the original content of PSSCR SPR and hence CME does not know
what was requested.

Additional stop-api calls from software could have been avoided, but
practically we have only cpu hotplug case that uses stop11 and needs
this stop-api.  We can default the system to stop4 or stop5 and then
make stop-api call to explicitly set stop11 and then hotplug out the
cpu. We have to restore the deepest cpuidle state (stop4/5) back
during online.  As such this is not much of software overhead. But we
need an elegant method to control these calls from OPAL flags so that
kernel behaviour can be more closely controlled.

If we want to use stop11 in cpuidle (despite being very slow) for
evaluation reasons, then we will need to make more stop-api call to
choose between stop4/5 vs stop11 since they belong to different group.
Even in this case, since stop11 is the slow path, we would want to set
stop11 before entry and restore to stop4/5 after wakeup.  This way we
still completely avoid stop-api call in fast-path stop4/5 entry/exit.

--Vaidy

Re: [RESEND][PATCH] cpuidle/powernv : Restore different PSSCR for idle and hotplug

2018-03-06 Thread Vaidyanathan Srinivasan

* Benjamin Herrenschmidt  [2018-03-01 08:40:22]:

> On Thu, 2018-03-01 at 01:03 +0530, Akshay Adiga wrote:
> > commit 1e1601b38e6e ("powerpc/powernv/idle: Restore SPRs for deep idle
> > states via stop API.") uses stop-api provided by the firmware to restore
> > PSSCR. PSSCR restore is required for handling special wakeup. When special
> > wakeup is completed, the core enters stop state based on restored PSSCR.
> > 
> > Currently PSSCR is restored to deepest available stop state, causing
> > a idle cpu to enter deeper stop state on a special wakeup, which causes
> > the cpu to hang on wakeup.
> > 
> > A "sensors" command which reads temperature (through DTS sensors) on idle
> > cpu can trigger special wakeup.
> > 
> > Failed Scenario :
> > Request restore of PSSCR with RL = 11
> > cpu enters idle state (stop5)
> >   user triggers "sensors" command
> >Assert special wakeup on cpu
> >  Restores PSSCR with RL = 11  < Done by firmware
> >   Read DTS sensor
> >Deassert special wakeup
> >   cpu enters idle state (stop11) <-- Instead of stop5
> > 
> > Cpu hang is caused because cpu ended up in a deeper state than it requested
> > 
> > This patch fixes instability caused by special wakeup when stop11 is
> > enabled. Requests restore of PSSCR to deepest stop state used by cpuidle.
> > Only when offlining cpu, request restore of PSSCR to deepest stop state.
> > On onlining cpu, request restore of PSSCR to deepest stop state used by
> > cpuidle.
> 
> So if we chose a stop state, but somebody else does a special wakeup,
> we'll end up going back into a *deeper* one than the one we came from ?

Unfortunately yes.  This is the current limitation.  If we are in stop4
and above and we had not set a PSSCR to be restored, then the hardware
will default to all bits set (stop15) leading to entry into stop11
after the special wakeup is removed.  The requirement is such that we
need to have a correct PSSCR restore value set using stop-api.

We need to set a restore PSSCR value that represents one in a group
like stop4,5,6,7 will have identical state loss, hence we can either
set a PSSCR of 7 or 4 or 5 for any of this stop state entry and not
have to use stop-api to set exact value of stop4 or 5 at every entry.

> I still think this is broken by design. The chip should automatically
> go back to the state it went to after special wakeup, thus the PPE
> controlling the state should override the PSSCR value accordingly
> rather than relying on those SW hoops.

Special wakeup de-assertion and re-entry hits this limitation where we
have lost the original content of PSSCR SPR and hence CME does not know
what was requested.

Additional stop-api calls from software could have been avoided, but
practically we have only cpu hotplug case that uses stop11 and needs
this stop-api.  We can default the system to stop4 or stop5 and then
make stop-api call to explicitly set stop11 and then hotplug out the
cpu. We have to restore the deepest cpuidle state (stop4/5) back
during online.  As such this is not much of software overhead. But we
need an elegant method to control these calls from OPAL flags so that
kernel behaviour can be more closely controlled.

If we want to use stop11 in cpuidle (despite being very slow) for
evaluation reasons, then we will need to make more stop-api call to
choose between stop4/5 vs stop11 since they belong to different group.
Even in this case, since stop11 is the slow path, we would want to set
stop11 before entry and restore to stop4/5 after wakeup.  This way we
still completely avoid stop-api call in fast-path stop4/5 entry/exit.

--Vaidy

Re: [PATCH v2 1/2] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration

2017-03-23 Thread Vaidyanathan Srinivasan

* Rafael J. Wysocki <raf...@kernel.org> [2017-03-23 16:28:31]:

> On Thu, Mar 23, 2017 at 4:22 PM, Vaidyanathan Srinivasan
> <sva...@linux.vnet.ibm.com> wrote:
> > drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init().
> > On PowerNV platform cpu_present could be less than cpu_possible in cases
> > where firmware detects the cpu, but it is not available to the OS.  When
> > CONFIG_HOTPLUG_CPU=n, such cpus are not hotplugable at runtime and hence
> > we skip creating cpu_device.
> >
> > This breaks cpuidle on powernv where register_cpu() is not called for
> > cpus in cpu_possible_mask that cannot be hot-added at runtime.
> >
> > Trying cpuidle_register_device() on cpu without cpu_device will cause
> > crash like this:
> >
> > cpu 0xf: Vector: 380 (Data SLB Access) at [c00ff1503490]
> > pc: c022c8bc: string+0x34/0x60
> > lr: c022ed78: vsnprintf+0x284/0x42c
> > sp: c00ff1503710
> >msr: 90009033
> >dar: 60006000
> >   current = 0xc00ff148
> >   paca= 0xcfe82d00   softe: 0irq_happened: 0x01
> > pid   = 1, comm = swapper/8
> > Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4
> > (Buildroot 2017.02-4-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017
> > enter ? for help
> > [link register   ] c022ed78 vsnprintf+0x284/0x42c
> > [c00ff1503710] c022ebb8 vsnprintf+0xc4/0x42c (unreliable)
> > [c00ff1503800] c022ef40 vscnprintf+0x20/0x44
> > [c00ff1503830] c00ab61c vprintk_emit+0x94/0x2cc
> > [c00ff15038a0] c00acc9c vprintk_func+0x60/0x74
> > [c00ff15038c0] c0619694 printk+0x38/0x4c
> > [c00ff15038e0] c0224950 kobject_get+0x40/0x60
> > [c00ff1503950] c022507c kobject_add_internal+0x60/0x2c4
> > [c00ff15039e0] c0225350 kobject_init_and_add+0x70/0x78
> > [c00ff1503a60] c053c288 cpuidle_add_sysfs+0x9c/0xe0
> > [c00ff1503ae0] c053aeac cpuidle_register_device+0xd4/0x12c
> > [c00ff1503b30] c053b108 cpuidle_register+0x98/0xcc
> > [c00ff1503bc0] c085eaf0 powernv_processor_idle_init+0x140/0x1e0
> > [c00ff1503c60] c000cd60 do_one_initcall+0xc0/0x15c
> > [c00ff1503d20] c0833e84 kernel_init_freeable+0x1a0/0x25c
> > [c00ff1503dc0] c000d478 kernel_init+0x24/0x12c
> > [c00ff1503e30] c000b564 ret_from_kernel_thread+0x5c/0x78
> >
> > This patch fixes the bug by passing correct cpumask from
> > powernv-cpuidle driver.
> >
> > Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>
> 
> That needs to be ACKed by someone familiar with powernv.

Previous version at
https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-March/155587.html

I had not CCed linux-pm in the first post.

Michael and Mikey have reviewed the previous version.  Let me get an
ack for you to proceed with the merge.

Thanks,
Vaidy

Re: [PATCH v2 1/2] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration

2017-03-23 Thread Vaidyanathan Srinivasan

* Rafael J. Wysocki  [2017-03-23 16:28:31]:

> On Thu, Mar 23, 2017 at 4:22 PM, Vaidyanathan Srinivasan
>  wrote:
> > drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init().
> > On PowerNV platform cpu_present could be less than cpu_possible in cases
> > where firmware detects the cpu, but it is not available to the OS.  When
> > CONFIG_HOTPLUG_CPU=n, such cpus are not hotplugable at runtime and hence
> > we skip creating cpu_device.
> >
> > This breaks cpuidle on powernv where register_cpu() is not called for
> > cpus in cpu_possible_mask that cannot be hot-added at runtime.
> >
> > Trying cpuidle_register_device() on cpu without cpu_device will cause
> > crash like this:
> >
> > cpu 0xf: Vector: 380 (Data SLB Access) at [c00ff1503490]
> > pc: c022c8bc: string+0x34/0x60
> > lr: c022ed78: vsnprintf+0x284/0x42c
> > sp: c00ff1503710
> >msr: 90009033
> >dar: 60006000
> >   current = 0xc00ff148
> >   paca= 0xcfe82d00   softe: 0irq_happened: 0x01
> > pid   = 1, comm = swapper/8
> > Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4
> > (Buildroot 2017.02-4-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017
> > enter ? for help
> > [link register   ] c022ed78 vsnprintf+0x284/0x42c
> > [c00ff1503710] c022ebb8 vsnprintf+0xc4/0x42c (unreliable)
> > [c00ff1503800] c022ef40 vscnprintf+0x20/0x44
> > [c00ff1503830] c00ab61c vprintk_emit+0x94/0x2cc
> > [c00ff15038a0] c00acc9c vprintk_func+0x60/0x74
> > [c00ff15038c0] c0619694 printk+0x38/0x4c
> > [c00ff15038e0] c0224950 kobject_get+0x40/0x60
> > [c00ff1503950] c022507c kobject_add_internal+0x60/0x2c4
> > [c00ff15039e0] c0225350 kobject_init_and_add+0x70/0x78
> > [c00ff1503a60] c053c288 cpuidle_add_sysfs+0x9c/0xe0
> > [c00ff1503ae0] c053aeac cpuidle_register_device+0xd4/0x12c
> > [c00ff1503b30] c053b108 cpuidle_register+0x98/0xcc
> > [c00ff1503bc0] c085eaf0 powernv_processor_idle_init+0x140/0x1e0
> > [c00ff1503c60] c000cd60 do_one_initcall+0xc0/0x15c
> > [c00ff1503d20] c0833e84 kernel_init_freeable+0x1a0/0x25c
> > [c00ff1503dc0] c000d478 kernel_init+0x24/0x12c
> > [c00ff1503e30] c000b564 ret_from_kernel_thread+0x5c/0x78
> >
> > This patch fixes the bug by passing correct cpumask from
> > powernv-cpuidle driver.
> >
> > Signed-off-by: Vaidyanathan Srinivasan 
> 
> That needs to be ACKed by someone familiar with powernv.

Previous version at
https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-March/155587.html

I had not CCed linux-pm in the first post.

Michael and Mikey have reviewed the previous version.  Let me get an
ack for you to proceed with the merge.

Thanks,
Vaidy

Re: [PATCH v2 2/2] cpuidle: Validate cpu_dev in cpuidle_add_sysfs

2017-03-23 Thread Vaidyanathan Srinivasan

* Rafael J. Wysocki <raf...@kernel.org> [2017-03-23 16:27:31]:

> On Thu, Mar 23, 2017 at 4:22 PM, Vaidyanathan Srinivasan
> <sva...@linux.vnet.ibm.com> wrote:
> > If a given cpu is not in cpu_present and cpu hotplug
> > is disabled, arch can skip setting up the cpu_dev.
> >
> > Arch cpuidle driver should pass correct cpu mask
> > for registration, but failing to do so by the driver
> > causes error to propagate and crash like this:
> >
> > [   30.076045] Unable to handle kernel paging request for
> > data at address 0x0048
> > [   30.076100] Faulting instruction address: 0xc07b2f30
> > cpu 0x4d: Vector: 300 (Data Access) at [c03feb18b670]
> > pc: c07b2f30: kobject_get+0x20/0x70
> > lr: c07b3c94: kobject_add_internal+0x54/0x3f0
> > sp: c03feb18b8f0
> >msr: 90009033
> >dar: 48
> >  dsisr: 4000
> >   current = 0xc03fd2ed8300
> >   paca= 0xcfbab500   softe: 0irq_happened: 0x01
> > pid   = 1, comm = swapper/0
> > Linux version 4.11.0-rc2-svaidy+ (sv@sagarika) (gcc version 6.2.0
> > 20161005 (Ubuntu 6.2.0-5ubuntu12) ) #10 SMP Sun Mar 19 00:08:09 IST 2017
> > enter ? for help
> > [c03feb18b960] c07b3c94 kobject_add_internal+0x54/0x3f0
> > [c03feb18b9f0] c07b43a4 kobject_init_and_add+0x64/0xa0
> > [c03feb18ba70] c0e284f4 cpuidle_add_sysfs+0xb4/0x130
> > [c03feb18baf0] c0e26038 cpuidle_register_device+0x118/0x1c0
> > [c03feb18bb30] c0e26c48 cpuidle_register+0x78/0x120
> > [c03feb18bbc0] c168fd9c powernv_processor_idle_init+0x110/0x1c4
> > [c03feb18bc40] c000cff8 do_one_initcall+0x68/0x1d0
> > [c03feb18bd00] c16242f4 kernel_init_freeable+0x280/0x360
> > [c03feb18bdc0] c000d864 kernel_init+0x24/0x160
> > [c03feb18be30] c000b4e8 ret_from_kernel_thread+0x5c/0x74
> >
> > Validating cpu_dev fixes the crash and reports correct error message like:
> >
> > [   30.163506] Failed to register cpuidle device for cpu136
> > [   30.173329] Registration of powernv driver failed.
> >
> > Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>
> 
> The previous version is in linux-next already and I'm going to push it
> for merging shortly.

Thanks Rafael.  The previous version is good for merge.

--Vaidy

Re: [PATCH v2 2/2] cpuidle: Validate cpu_dev in cpuidle_add_sysfs

2017-03-23 Thread Vaidyanathan Srinivasan

* Rafael J. Wysocki  [2017-03-23 16:27:31]:

> On Thu, Mar 23, 2017 at 4:22 PM, Vaidyanathan Srinivasan
>  wrote:
> > If a given cpu is not in cpu_present and cpu hotplug
> > is disabled, arch can skip setting up the cpu_dev.
> >
> > Arch cpuidle driver should pass correct cpu mask
> > for registration, but failing to do so by the driver
> > causes error to propagate and crash like this:
> >
> > [   30.076045] Unable to handle kernel paging request for
> > data at address 0x0048
> > [   30.076100] Faulting instruction address: 0xc07b2f30
> > cpu 0x4d: Vector: 300 (Data Access) at [c03feb18b670]
> > pc: c07b2f30: kobject_get+0x20/0x70
> > lr: c07b3c94: kobject_add_internal+0x54/0x3f0
> > sp: c03feb18b8f0
> >msr: 90009033
> >dar: 48
> >  dsisr: 4000
> >   current = 0xc03fd2ed8300
> >   paca= 0xcfbab500   softe: 0irq_happened: 0x01
> > pid   = 1, comm = swapper/0
> > Linux version 4.11.0-rc2-svaidy+ (sv@sagarika) (gcc version 6.2.0
> > 20161005 (Ubuntu 6.2.0-5ubuntu12) ) #10 SMP Sun Mar 19 00:08:09 IST 2017
> > enter ? for help
> > [c03feb18b960] c07b3c94 kobject_add_internal+0x54/0x3f0
> > [c03feb18b9f0] c07b43a4 kobject_init_and_add+0x64/0xa0
> > [c03feb18ba70] c0e284f4 cpuidle_add_sysfs+0xb4/0x130
> > [c03feb18baf0] c0e26038 cpuidle_register_device+0x118/0x1c0
> > [c03feb18bb30] c0e26c48 cpuidle_register+0x78/0x120
> > [c03feb18bbc0] c168fd9c powernv_processor_idle_init+0x110/0x1c4
> > [c03feb18bc40] c000cff8 do_one_initcall+0x68/0x1d0
> > [c03feb18bd00] c16242f4 kernel_init_freeable+0x280/0x360
> > [c03feb18bdc0] c000d864 kernel_init+0x24/0x160
> > [c03feb18be30] c000b4e8 ret_from_kernel_thread+0x5c/0x74
> >
> > Validating cpu_dev fixes the crash and reports correct error message like:
> >
> > [   30.163506] Failed to register cpuidle device for cpu136
> > [   30.173329] Registration of powernv driver failed.
> >
> > Signed-off-by: Vaidyanathan Srinivasan 
> 
> The previous version is in linux-next already and I'm going to push it
> for merging shortly.

Thanks Rafael.  The previous version is good for merge.

--Vaidy

[PATCH v2 2/2] cpuidle: Validate cpu_dev in cpuidle_add_sysfs

2017-03-23 Thread Vaidyanathan Srinivasan

If a given cpu is not in cpu_present and cpu hotplug
is disabled, arch can skip setting up the cpu_dev.

Arch cpuidle driver should pass correct cpu mask
for registration, but failing to do so by the driver
causes error to propagate and crash like this:

[   30.076045] Unable to handle kernel paging request for
data at address 0x0048
[   30.076100] Faulting instruction address: 0xc07b2f30
cpu 0x4d: Vector: 300 (Data Access) at [c03feb18b670]
pc: c07b2f30: kobject_get+0x20/0x70
lr: c07b3c94: kobject_add_internal+0x54/0x3f0
sp: c03feb18b8f0
   msr: 90009033
   dar: 48
 dsisr: 4000
  current = 0xc03fd2ed8300
  paca= 0xcfbab500   softe: 0irq_happened: 0x01
pid   = 1, comm = swapper/0
Linux version 4.11.0-rc2-svaidy+ (sv@sagarika) (gcc version 6.2.0
20161005 (Ubuntu 6.2.0-5ubuntu12) ) #10 SMP Sun Mar 19 00:08:09 IST 2017
enter ? for help
[c03feb18b960] c07b3c94 kobject_add_internal+0x54/0x3f0
[c03feb18b9f0] c07b43a4 kobject_init_and_add+0x64/0xa0
[c03feb18ba70] c0e284f4 cpuidle_add_sysfs+0xb4/0x130
[c03feb18baf0] c0e26038 cpuidle_register_device+0x118/0x1c0
[c03feb18bb30] c0e26c48 cpuidle_register+0x78/0x120
[c03feb18bbc0] c168fd9c powernv_processor_idle_init+0x110/0x1c4
[c03feb18bc40] c000cff8 do_one_initcall+0x68/0x1d0
[c03feb18bd00] c16242f4 kernel_init_freeable+0x280/0x360
[c03feb18bdc0] c000d864 kernel_init+0x24/0x160
[c03feb18be30] c000b4e8 ret_from_kernel_thread+0x5c/0x74

Validating cpu_dev fixes the crash and reports correct error message like:

[   30.163506] Failed to register cpuidle device for cpu136
[   30.173329] Registration of powernv driver failed.

Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>
---
 drivers/cpuidle/sysfs.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/cpuidle/sysfs.c b/drivers/cpuidle/sysfs.c
index c5adc8c..f2c3bce 100644
--- a/drivers/cpuidle/sysfs.c
+++ b/drivers/cpuidle/sysfs.c
@@ -615,6 +615,18 @@ int cpuidle_add_sysfs(struct cpuidle_device *dev)
struct device *cpu_dev = get_cpu_device((unsigned long)dev->cpu);
int error;
 
+   /*
+* Return error if cpu_device is not setup for this cpu.  This
+* could happen if arch did not setup cpu_device since this
+* cpu is not in cpu_present mask and the driver did not send
+* correct cpu mask at registration.  Without this check we
+* would end up passing bogus value for _dev->kobj in
+* kobject_init_and_add().
+*/
+
+   if (!cpu_dev)
+   return -ENODEV;
+
kdev = kzalloc(sizeof(*kdev), GFP_KERNEL);
if (!kdev)
return -ENOMEM;
-- 
2.9.3

[PATCH v2 1/2] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration

2017-03-23 Thread Vaidyanathan Srinivasan

drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init().
On PowerNV platform cpu_present could be less than cpu_possible in cases
where firmware detects the cpu, but it is not available to the OS.  When
CONFIG_HOTPLUG_CPU=n, such cpus are not hotplugable at runtime and hence
we skip creating cpu_device.

This breaks cpuidle on powernv where register_cpu() is not called for
cpus in cpu_possible_mask that cannot be hot-added at runtime.

Trying cpuidle_register_device() on cpu without cpu_device will cause
crash like this:

cpu 0xf: Vector: 380 (Data SLB Access) at [c00ff1503490]
pc: c022c8bc: string+0x34/0x60
lr: c022ed78: vsnprintf+0x284/0x42c
sp: c00ff1503710
   msr: 90009033
   dar: 60006000
  current = 0xc00ff148
  paca= 0xcfe82d00   softe: 0irq_happened: 0x01
pid   = 1, comm = swapper/8
Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4
(Buildroot 2017.02-4-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017
enter ? for help
[link register   ] c022ed78 vsnprintf+0x284/0x42c
[c00ff1503710] c022ebb8 vsnprintf+0xc4/0x42c (unreliable)
[c00ff1503800] c022ef40 vscnprintf+0x20/0x44
[c00ff1503830] c00ab61c vprintk_emit+0x94/0x2cc
[c00ff15038a0] c00acc9c vprintk_func+0x60/0x74
[c00ff15038c0] c0619694 printk+0x38/0x4c
[c00ff15038e0] c0224950 kobject_get+0x40/0x60
[c00ff1503950] c022507c kobject_add_internal+0x60/0x2c4
[c00ff15039e0] c0225350 kobject_init_and_add+0x70/0x78
[c00ff1503a60] c053c288 cpuidle_add_sysfs+0x9c/0xe0
[c00ff1503ae0] c053aeac cpuidle_register_device+0xd4/0x12c
[c00ff1503b30] c053b108 cpuidle_register+0x98/0xcc
[c00ff1503bc0] c085eaf0 powernv_processor_idle_init+0x140/0x1e0
[c00ff1503c60] c000cd60 do_one_initcall+0xc0/0x15c
[c00ff1503d20] c0833e84 kernel_init_freeable+0x1a0/0x25c
[c00ff1503dc0] c000d478 kernel_init+0x24/0x12c
[c00ff1503e30] c000b564 ret_from_kernel_thread+0x5c/0x78

This patch fixes the bug by passing correct cpumask from
powernv-cpuidle driver.

Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>
---
 drivers/cpuidle/cpuidle-powernv.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index a06df51..82f7b33 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -175,6 +175,24 @@ static int powernv_cpuidle_driver_init(void)
drv->state_count += 1;
}
 
+   /*
+* On PowerNV platform cpu_present may be less that cpu_possible in
+* cases where firmware detects the cpu, but it is not available to the
+* OS.  If CONFIG_HOTPLUG_CPU=n then such CPUs are not hotplugable at
+* runtime and hence cpu_devices are not created for those cpus by
+* generic topology_init().
+*
+* drv->cpumask defaults to cpu_possible_mask in
+* __cpuidle_driver_init().  This breaks cpuidle on powernv where
+* cpu_devices are not created for cpus in cpu_possible_mask that
+* cannot be hot-added later at runtime.
+*
+* Trying cpuidle_register_device() on a cpu without cpu_devices is
+* incorrect. Hence pass correct cpu mask to generic cpuidle driver.
+*/
+
+   drv->cpumask = (struct cpumask *)cpu_present_mask;
+
return 0;
 }
 
-- 
2.9.3

[PATCH v2 2/2] cpuidle: Validate cpu_dev in cpuidle_add_sysfs

2017-03-23 Thread Vaidyanathan Srinivasan

If a given cpu is not in cpu_present and cpu hotplug
is disabled, arch can skip setting up the cpu_dev.

Arch cpuidle driver should pass correct cpu mask
for registration, but failing to do so by the driver
causes error to propagate and crash like this:

[   30.076045] Unable to handle kernel paging request for
data at address 0x0048
[   30.076100] Faulting instruction address: 0xc07b2f30
cpu 0x4d: Vector: 300 (Data Access) at [c03feb18b670]
pc: c07b2f30: kobject_get+0x20/0x70
lr: c07b3c94: kobject_add_internal+0x54/0x3f0
sp: c03feb18b8f0
   msr: 90009033
   dar: 48
 dsisr: 4000
  current = 0xc03fd2ed8300
  paca= 0xcfbab500   softe: 0irq_happened: 0x01
pid   = 1, comm = swapper/0
Linux version 4.11.0-rc2-svaidy+ (sv@sagarika) (gcc version 6.2.0
20161005 (Ubuntu 6.2.0-5ubuntu12) ) #10 SMP Sun Mar 19 00:08:09 IST 2017
enter ? for help
[c03feb18b960] c07b3c94 kobject_add_internal+0x54/0x3f0
[c03feb18b9f0] c07b43a4 kobject_init_and_add+0x64/0xa0
[c03feb18ba70] c0e284f4 cpuidle_add_sysfs+0xb4/0x130
[c03feb18baf0] c0e26038 cpuidle_register_device+0x118/0x1c0
[c03feb18bb30] c0e26c48 cpuidle_register+0x78/0x120
[c03feb18bbc0] c168fd9c powernv_processor_idle_init+0x110/0x1c4
[c03feb18bc40] c000cff8 do_one_initcall+0x68/0x1d0
[c03feb18bd00] c16242f4 kernel_init_freeable+0x280/0x360
[c03feb18bdc0] c000d864 kernel_init+0x24/0x160
[c03feb18be30] c000b4e8 ret_from_kernel_thread+0x5c/0x74

Validating cpu_dev fixes the crash and reports correct error message like:

[   30.163506] Failed to register cpuidle device for cpu136
[   30.173329] Registration of powernv driver failed.

Signed-off-by: Vaidyanathan Srinivasan 
---
 drivers/cpuidle/sysfs.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/cpuidle/sysfs.c b/drivers/cpuidle/sysfs.c
index c5adc8c..f2c3bce 100644
--- a/drivers/cpuidle/sysfs.c
+++ b/drivers/cpuidle/sysfs.c
@@ -615,6 +615,18 @@ int cpuidle_add_sysfs(struct cpuidle_device *dev)
struct device *cpu_dev = get_cpu_device((unsigned long)dev->cpu);
int error;
 
+   /*
+* Return error if cpu_device is not setup for this cpu.  This
+* could happen if arch did not setup cpu_device since this
+* cpu is not in cpu_present mask and the driver did not send
+* correct cpu mask at registration.  Without this check we
+* would end up passing bogus value for _dev->kobj in
+* kobject_init_and_add().
+*/
+
+   if (!cpu_dev)
+   return -ENODEV;
+
kdev = kzalloc(sizeof(*kdev), GFP_KERNEL);
if (!kdev)
return -ENOMEM;
-- 
2.9.3

[PATCH v2 1/2] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration

2017-03-23 Thread Vaidyanathan Srinivasan

drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init().
On PowerNV platform cpu_present could be less than cpu_possible in cases
where firmware detects the cpu, but it is not available to the OS.  When
CONFIG_HOTPLUG_CPU=n, such cpus are not hotplugable at runtime and hence
we skip creating cpu_device.

This breaks cpuidle on powernv where register_cpu() is not called for
cpus in cpu_possible_mask that cannot be hot-added at runtime.

Trying cpuidle_register_device() on cpu without cpu_device will cause
crash like this:

cpu 0xf: Vector: 380 (Data SLB Access) at [c00ff1503490]
pc: c022c8bc: string+0x34/0x60
lr: c022ed78: vsnprintf+0x284/0x42c
sp: c00ff1503710
   msr: 90009033
   dar: 60006000
  current = 0xc00ff148
  paca= 0xcfe82d00   softe: 0irq_happened: 0x01
pid   = 1, comm = swapper/8
Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4
(Buildroot 2017.02-4-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017
enter ? for help
[link register   ] c022ed78 vsnprintf+0x284/0x42c
[c00ff1503710] c022ebb8 vsnprintf+0xc4/0x42c (unreliable)
[c00ff1503800] c022ef40 vscnprintf+0x20/0x44
[c00ff1503830] c00ab61c vprintk_emit+0x94/0x2cc
[c00ff15038a0] c00acc9c vprintk_func+0x60/0x74
[c00ff15038c0] c0619694 printk+0x38/0x4c
[c00ff15038e0] c0224950 kobject_get+0x40/0x60
[c00ff1503950] c022507c kobject_add_internal+0x60/0x2c4
[c00ff15039e0] c0225350 kobject_init_and_add+0x70/0x78
[c00ff1503a60] c053c288 cpuidle_add_sysfs+0x9c/0xe0
[c00ff1503ae0] c053aeac cpuidle_register_device+0xd4/0x12c
[c00ff1503b30] c053b108 cpuidle_register+0x98/0xcc
[c00ff1503bc0] c085eaf0 powernv_processor_idle_init+0x140/0x1e0
[c00ff1503c60] c000cd60 do_one_initcall+0xc0/0x15c
[c00ff1503d20] c0833e84 kernel_init_freeable+0x1a0/0x25c
[c00ff1503dc0] c000d478 kernel_init+0x24/0x12c
[c00ff1503e30] c000b564 ret_from_kernel_thread+0x5c/0x78

This patch fixes the bug by passing correct cpumask from
powernv-cpuidle driver.

Signed-off-by: Vaidyanathan Srinivasan 
---
 drivers/cpuidle/cpuidle-powernv.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index a06df51..82f7b33 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -175,6 +175,24 @@ static int powernv_cpuidle_driver_init(void)
drv->state_count += 1;
}
 
+   /*
+* On PowerNV platform cpu_present may be less that cpu_possible in
+* cases where firmware detects the cpu, but it is not available to the
+* OS.  If CONFIG_HOTPLUG_CPU=n then such CPUs are not hotplugable at
+* runtime and hence cpu_devices are not created for those cpus by
+* generic topology_init().
+*
+* drv->cpumask defaults to cpu_possible_mask in
+* __cpuidle_driver_init().  This breaks cpuidle on powernv where
+* cpu_devices are not created for cpus in cpu_possible_mask that
+* cannot be hot-added later at runtime.
+*
+* Trying cpuidle_register_device() on a cpu without cpu_devices is
+* incorrect. Hence pass correct cpu mask to generic cpuidle driver.
+*/
+
+   drv->cpumask = (struct cpumask *)cpu_present_mask;
+
return 0;
 }
 
-- 
2.9.3

[PATCH v1 0/2] cpuidle: Fixes in cpuidle driver

2017-03-23 Thread Vaidyanathan Srinivasan

When CONFIG_HOTPLUG_CPU=n and cpu_present is less than cpu_possible,
then cpuidle-powernv not passing an explicit drv->cpu_mask allows
generic cpuidle driver to try create sysfs objects for cpus that does
not have cpu_devices created by calling register_cpu().

This caused kernel to access incorrect address and crash. The
following patch series fixes the cpuidle-powernv driver and also adds
additional checks in cpuidle_add_sysfs()

This patch set is against v4.11-rc3.

Changed from v1: Updated commit message and comments.

Signed-off-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>

[PATCH v1 0/2] cpuidle: Fixes in cpuidle driver

2017-03-23 Thread Vaidyanathan Srinivasan

When CONFIG_HOTPLUG_CPU=n and cpu_present is less than cpu_possible,
then cpuidle-powernv not passing an explicit drv->cpu_mask allows
generic cpuidle driver to try create sysfs objects for cpus that does
not have cpu_devices created by calling register_cpu().

This caused kernel to access incorrect address and crash. The
following patch series fixes the cpuidle-powernv driver and also adds
additional checks in cpuidle_add_sysfs()

This patch set is against v4.11-rc3.

Changed from v1: Updated commit message and comments.

Signed-off-by: Vaidyanathan Srinivasan

Re: [PATCH 4.4 54/58] powerpc: Convert cmp to cmpd in idle enter sequence

2017-01-09 Thread Vaidyanathan Srinivasan

* Greg Kroah-Hartman <gre...@linuxfoundation.org> [2017-01-06 22:44:39]:

> 4.4-stable review patch.  If anyone has any objections, please let me know.
> 
> --
> 
> From: Segher Boessenkool <seg...@kernel.crashing.org>
> 
> commit 80f23935cadb1c654e81951f5a8b7ceae0acc1b4 upstream.
> 
> PowerPC's "cmp" instruction has four operands. Normally people write
> "cmpw" or "cmpd" for the second cmp operand 0 or 1. But, frequently
> people forget, and write "cmp" with just three operands.
> 
> With older binutils this is silently accepted as if this was "cmpw",
> while often "cmpd" is wanted. With newer binutils GAS will complain
> about this for 64-bit code. For 32-bit code it still silently assumes
> "cmpw" is what is meant.
> 
> In this instance the code comes directly from ISA v2.07, including the
> cmp, but cmpd is correct. Backport to stable so that new toolchains can
> build old kernels.
> 
> Fixes: 948cf67c4726 ("powerpc: Add NAP mode support on Power7 in HV mode")
> Reviewed-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>
> Signed-off-by: Segher Boessenkool <seg...@kernel.crashing.org>
> Signed-off-by: Michael Ellerman <m...@ellerman.id.au>
> Signed-off-by: Joel Stanley <j...@jms.id.au>
> Signed-off-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>

Acked-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>

The change from cmp to cmpd is correct. This code will execute
in 64-bit mode only and fix applies to stable as described above.

> 
> 
> ---
>  arch/powerpc/kernel/idle_power7.S |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/arch/powerpc/kernel/idle_power7.S
> +++ b/arch/powerpc/kernel/idle_power7.S
> @@ -44,7 +44,7 @@
>   std r0,0(r1);   \
>   ptesync;\
>   ld  r0,0(r1);   \
> -1:   cmp cr0,r0,r0;  \
> +1:   cmpdcr0,r0,r0;  \
>   bne 1b; \
>   IDLE_INST;  \
>   b   .
> 
>

Re: [PATCH 4.4 54/58] powerpc: Convert cmp to cmpd in idle enter sequence

2017-01-09 Thread Vaidyanathan Srinivasan

* Greg Kroah-Hartman  [2017-01-06 22:44:39]:

> 4.4-stable review patch.  If anyone has any objections, please let me know.
> 
> --
> 
> From: Segher Boessenkool 
> 
> commit 80f23935cadb1c654e81951f5a8b7ceae0acc1b4 upstream.
> 
> PowerPC's "cmp" instruction has four operands. Normally people write
> "cmpw" or "cmpd" for the second cmp operand 0 or 1. But, frequently
> people forget, and write "cmp" with just three operands.
> 
> With older binutils this is silently accepted as if this was "cmpw",
> while often "cmpd" is wanted. With newer binutils GAS will complain
> about this for 64-bit code. For 32-bit code it still silently assumes
> "cmpw" is what is meant.
> 
> In this instance the code comes directly from ISA v2.07, including the
> cmp, but cmpd is correct. Backport to stable so that new toolchains can
> build old kernels.
> 
> Fixes: 948cf67c4726 ("powerpc: Add NAP mode support on Power7 in HV mode")
> Reviewed-by: Vaidyanathan Srinivasan 
> Signed-off-by: Segher Boessenkool 
> Signed-off-by: Michael Ellerman 
> Signed-off-by: Joel Stanley 
> Signed-off-by: Greg Kroah-Hartman 

Acked-by: Vaidyanathan Srinivasan 

The change from cmp to cmpd is correct. This code will execute
in 64-bit mode only and fix applies to stable as described above.

> 
> 
> ---
>  arch/powerpc/kernel/idle_power7.S |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/arch/powerpc/kernel/idle_power7.S
> +++ b/arch/powerpc/kernel/idle_power7.S
> @@ -44,7 +44,7 @@
>   std r0,0(r1);   \
>   ptesync;\
>   ld  r0,0(r1);   \
> -1:   cmp cr0,r0,r0;  \
> +1:   cmpdcr0,r0,r0;  \
>   bne 1b; \
>   IDLE_INST;  \
>   b   .
> 
>

Re: [patch 10/15] sched/migration: Move calc_load_migrate() into CPU_DYING

2016-07-12 Thread Vaidyanathan Srinivasan

* Thomas Gleixner <t...@linutronix.de> [2016-07-12 18:33:56]:

> Anton,
> 
> On Tue, 12 Jul 2016, Anton Blanchard wrote:
> > > It really does not matter when we fold the load for the outgoing cpu.
> > > It's almost dead anyway, so there is no harm if we fail to fold the
> > > few microseconds which are required for going fully away.
> > 
> > We are seeing the load average shoot up when hot unplugging CPUs (+1
> > for every CPU we offline) on ppc64. This reproduces on bare metal as
> > well as inside a KVM guest. A bisect points at this commit.
> > 
> > As an example, a completely idle box with 128 CPUS and 112 hot
> > unplugged:
> > 
> > # uptime
> >  04:35:30 up  1:23,  2 users,  load average: 112.43, 122.94, 125.54
> 
> Yes, it's an off by one as we now call that from the task which is tearing
> down the cpu. Does the patch below fix it?

Hi Thomas,

Yes this patch fixes the issue.  I was able to recreate the problem
and also verify with this patch on 4.7.0-rc7.

> 
> Thanks,
> 
>   tglx
> 
> 8<--
> 
> Subject: sched/migration: Correct off by one in load migration
> From: Thomas Gleixner <t...@linutronix.de>
> 
> The move of calc_load_migrate() from CPU_DEAD to CPU_DYING did not take into
> account that the function is now called from a thread running on the outgoing
> CPU. As a result a cpu unplug leakes a load of 1 into the global load
> accounting mechanism.
> 
> Fix it by adjusting for the currently running thread which calls
> calc_load_migrate().
> 
> Fixes: e9cd8fa4fcfd: "sched/migration: Move calc_load_migrate() into 
> CPU_DYING"
> Reported-by: Anton Blanchard <an...@samba.org>

Tested-by: Vaidyanathan Srinivasan <sva...@linux.vnet.ibm.com>

> Signed-off-by: Thomas Gleixner <t...@linutronix.de>
>
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 51d7105f529a..97ee9ac7e97c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5394,13 +5394,15 @@ void idle_task_exit(void)
>  /*
>   * Since this CPU is going 'away' for a while, fold any nr_active delta
>   * we might have. Assumes we're called after migrate_tasks() so that the
> - * nr_active count is stable.
> + * nr_active count is stable. We need to take the teardown thread which
> + * is calling this into account, so we hand in adjust = 1 to the load
> + * calculation.
>   *
>   * Also see the comment "Global load-average calculations".
>   */
>  static void calc_load_migrate(struct rq *rq)
>  {
> - long delta = calc_load_fold_active(rq);
> + long delta = calc_load_fold_active(rq, 1);
>   if (delta)
>   atomic_long_add(delta, _load_tasks);
>  }
> diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
> index b0b93fd33af9..a2d6eb71f06b 100644
> --- a/kernel/sched/loadavg.c
> +++ b/kernel/sched/loadavg.c
> @@ -78,11 +78,11 @@ void get_avenrun(unsigned long *loads, unsigned long 
> offset, int shift)
>   loads[2] = (avenrun[2] + offset) << shift;
>  }
> 
> -long calc_load_fold_active(struct rq *this_rq)
> +long calc_load_fold_active(struct rq *this_rq, long adjust)
>  {
>   long nr_active, delta = 0;
> 
> - nr_active = this_rq->nr_running;
> + nr_active = this_rq->nr_running - adjust;
>   nr_active += (long)this_rq->nr_uninterruptible;

if (nr_active != this_rq->calc_load_active) {
delta = nr_active - this_rq->calc_load_active;
this_rq->calc_load_active = nr_active;
}

return delta;

Does the above calculation hold good even if we send adjust=1 and bump
down nr_active? Tested ok though :)

--Vaidy

Re: [patch 10/15] sched/migration: Move calc_load_migrate() into CPU_DYING

2016-07-12 Thread Vaidyanathan Srinivasan

* Thomas Gleixner  [2016-07-12 18:33:56]:

> Anton,
> 
> On Tue, 12 Jul 2016, Anton Blanchard wrote:
> > > It really does not matter when we fold the load for the outgoing cpu.
> > > It's almost dead anyway, so there is no harm if we fail to fold the
> > > few microseconds which are required for going fully away.
> > 
> > We are seeing the load average shoot up when hot unplugging CPUs (+1
> > for every CPU we offline) on ppc64. This reproduces on bare metal as
> > well as inside a KVM guest. A bisect points at this commit.
> > 
> > As an example, a completely idle box with 128 CPUS and 112 hot
> > unplugged:
> > 
> > # uptime
> >  04:35:30 up  1:23,  2 users,  load average: 112.43, 122.94, 125.54
> 
> Yes, it's an off by one as we now call that from the task which is tearing
> down the cpu. Does the patch below fix it?

Hi Thomas,

Yes this patch fixes the issue.  I was able to recreate the problem
and also verify with this patch on 4.7.0-rc7.

> 
> Thanks,
> 
>   tglx
> 
> 8<--
> 
> Subject: sched/migration: Correct off by one in load migration
> From: Thomas Gleixner 
> 
> The move of calc_load_migrate() from CPU_DEAD to CPU_DYING did not take into
> account that the function is now called from a thread running on the outgoing
> CPU. As a result a cpu unplug leakes a load of 1 into the global load
> accounting mechanism.
> 
> Fix it by adjusting for the currently running thread which calls
> calc_load_migrate().
> 
> Fixes: e9cd8fa4fcfd: "sched/migration: Move calc_load_migrate() into 
> CPU_DYING"
> Reported-by: Anton Blanchard 

Tested-by: Vaidyanathan Srinivasan 

> Signed-off-by: Thomas Gleixner 
>
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 51d7105f529a..97ee9ac7e97c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5394,13 +5394,15 @@ void idle_task_exit(void)
>  /*
>   * Since this CPU is going 'away' for a while, fold any nr_active delta
>   * we might have. Assumes we're called after migrate_tasks() so that the
> - * nr_active count is stable.
> + * nr_active count is stable. We need to take the teardown thread which
> + * is calling this into account, so we hand in adjust = 1 to the load
> + * calculation.
>   *
>   * Also see the comment "Global load-average calculations".
>   */
>  static void calc_load_migrate(struct rq *rq)
>  {
> - long delta = calc_load_fold_active(rq);
> + long delta = calc_load_fold_active(rq, 1);
>   if (delta)
>   atomic_long_add(delta, _load_tasks);
>  }
> diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
> index b0b93fd33af9..a2d6eb71f06b 100644
> --- a/kernel/sched/loadavg.c
> +++ b/kernel/sched/loadavg.c
> @@ -78,11 +78,11 @@ void get_avenrun(unsigned long *loads, unsigned long 
> offset, int shift)
>   loads[2] = (avenrun[2] + offset) << shift;
>  }
> 
> -long calc_load_fold_active(struct rq *this_rq)
> +long calc_load_fold_active(struct rq *this_rq, long adjust)
>  {
>   long nr_active, delta = 0;
> 
> - nr_active = this_rq->nr_running;
> + nr_active = this_rq->nr_running - adjust;
>   nr_active += (long)this_rq->nr_uninterruptible;

if (nr_active != this_rq->calc_load_active) {
delta = nr_active - this_rq->calc_load_active;
this_rq->calc_load_active = nr_active;
}

return delta;

Does the above calculation hold good even if we send adjust=1 and bump
down nr_active? Tested ok though :)

--Vaidy

Re: [PATCH] cpuidle: powernv/pseries: Decrease the snooze residency

2015-06-03 Thread Vaidyanathan Srinivasan

* Benjamin Herrenschmidt  [2015-05-30 20:38:22]:

> On Sat, 2015-05-30 at 11:31 +0530, Vaidyanathan Srinivasan wrote:
> > In shared lpar case, spinning in guest context may potentially take
> > away cycles from other lpars waiting to run on the same physical cpu.
> > 
> > So the policy in shared lpar case is to let PowerVM hypervisor know
> > immediately that the guest cpu is idle which will allow the hypervisor
> > to use the cycles for other tasks/lpars.
> 
> But that will have negative side effects under KVM no ?

Yes, you have a good point.  If one of the thread in the core goes to
cede, it can still come back quickly since the KVM guest context is
not switched yet.  But in single threaded guest, this can force an
unnecessary exit/context switch overhead.

Now that we have fixed the snooze loop to be bounded and exit
predictably, KVM guest should actually use snooze state to improve
latency.

I will test this scenario and enable snooze state for KVM guest.

> Suresh mentioned something with his new directed interrupts code that we
> had many cases where the interrupts ended up arriving shortly after we
> exited to host for NAP'ing ...
> 
> Snooze might fix it...

Right.  This scenario is worth experimenting and then introduce snooze
loop for guest.

--Vaidy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cpuidle: powernv/pseries: Decrease the snooze residency

2015-06-03 Thread Vaidyanathan Srinivasan

* Benjamin Herrenschmidt b...@au1.ibm.com [2015-05-30 20:38:22]:

 On Sat, 2015-05-30 at 11:31 +0530, Vaidyanathan Srinivasan wrote:
  In shared lpar case, spinning in guest context may potentially take
  away cycles from other lpars waiting to run on the same physical cpu.
  
  So the policy in shared lpar case is to let PowerVM hypervisor know
  immediately that the guest cpu is idle which will allow the hypervisor
  to use the cycles for other tasks/lpars.
 
 But that will have negative side effects under KVM no ?

Yes, you have a good point.  If one of the thread in the core goes to
cede, it can still come back quickly since the KVM guest context is
not switched yet.  But in single threaded guest, this can force an
unnecessary exit/context switch overhead.

Now that we have fixed the snooze loop to be bounded and exit
predictably, KVM guest should actually use snooze state to improve
latency.

I will test this scenario and enable snooze state for KVM guest.
 
 Suresh mentioned something with his new directed interrupts code that we
 had many cases where the interrupts ended up arriving shortly after we
 exited to host for NAP'ing ...
 
 Snooze might fix it...

Right.  This scenario is worth experimenting and then introduce snooze
loop for guest.

--Vaidy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cpuidle: powernv/pseries: Decrease the snooze residency

2015-05-30 Thread Vaidyanathan Srinivasan

* Preeti U Murthy  [2015-05-29 19:17:17]:

[snip]

> > +   if (max_idle_state > 1) {
> > +   snooze_timeout_en = true;
> > +   snooze_timeout = cpuidle_state_table[1].target_residency *
> > +tb_ticks_per_usec;
> > +   }
> 
> Any idea why we don't have snooze defined on the shared lpar configuration ?

In shared lpar case, spinning in guest context may potentially take
away cycles from other lpars waiting to run on the same physical cpu.

So the policy in shared lpar case is to let PowerVM hypervisor know
immediately that the guest cpu is idle which will allow the hypervisor
to use the cycles for other tasks/lpars.

--Vaidy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] cpuidle: powernv/pseries: Decrease the snooze residency

2015-05-30 Thread Vaidyanathan Srinivasan

* Preeti U Murthy pre...@linux.vnet.ibm.com [2015-05-29 19:17:17]:

[snip]

  +   if (max_idle_state  1) {
  +   snooze_timeout_en = true;
  +   snooze_timeout = cpuidle_state_table[1].target_residency *
  +tb_ticks_per_usec;
  +   }
 
 Any idea why we don't have snooze defined on the shared lpar configuration ?

In shared lpar case, spinning in guest context may potentially take
away cycles from other lpars waiting to run on the same physical cpu.

So the policy in shared lpar case is to let PowerVM hypervisor know
immediately that the guest cpu is idle which will allow the hypervisor
to use the cycles for other tasks/lpars.

--Vaidy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched: Fix asymmetric scheduling for POWER7

2013-11-06 Thread tip-bot for Vaidyanathan Srinivasan

Commit-ID:  2042abe7977222ef606306faa2dce8fd51e98e65
Gitweb: http://git.kernel.org/tip/2042abe7977222ef606306faa2dce8fd51e98e65
Author: Vaidyanathan Srinivasan 
AuthorDate: Wed, 30 Oct 2013 08:42:42 +0530
Committer:  Ingo Molnar 
CommitDate: Wed, 6 Nov 2013 12:37:54 +0100

sched: Fix asymmetric scheduling for POWER7

Asymmetric scheduling within a core is a scheduler loadbalancing
feature that is triggered when SD_ASYM_PACKING flag is set.  The goal
for the load balancer is to move tasks to lower order idle SMT threads
within a core on a POWER7 system.

In nohz_kick_needed(), we intend to check if our sched domain (core)
is completely busy or we have idle cpu.

The following check for SD_ASYM_PACKING:

(cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu)

already covers the case of checking if the domain has an idle cpu,
because cpumask_first_and() will not yield any set bits if this domain
has no idle cpu.

Hence, nr_busy check against group weight can be removed.

Reported-by: Michael Neuling 
Signed-off-by: Vaidyanathan Srinivasan 
Signed-off-by: Preeti U Murthy 
Tested-by: Michael Neuling 
Signed-off-by: Peter Zijlstra 
Cc: vincent.guit...@linaro.org
Cc: bitbuc...@online.de
Cc: b...@kernel.crashing.org
Cc: an...@samba.org
Cc: morten.rasmus...@arm.com
Cc: p...@google.com
Link: 
http://lkml.kernel.org/r/20131030031242.23426.13019.st...@preeti.in.ibm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41c02b6..074551a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6800,7 +6800,7 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
goto need_kick_unlock;
 
-   if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
+   if (sd->flags & SD_ASYM_PACKING
&& (cpumask_first_and(nohz.idle_cpus_mask,
  sched_domain_span(sd)) < cpu))
goto need_kick_unlock;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:sched/core] sched: Fix asymmetric scheduling for POWER7

2013-11-06 Thread tip-bot for Vaidyanathan Srinivasan

Commit-ID:  2042abe7977222ef606306faa2dce8fd51e98e65
Gitweb: http://git.kernel.org/tip/2042abe7977222ef606306faa2dce8fd51e98e65
Author: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
AuthorDate: Wed, 30 Oct 2013 08:42:42 +0530
Committer:  Ingo Molnar mi...@kernel.org
CommitDate: Wed, 6 Nov 2013 12:37:54 +0100

sched: Fix asymmetric scheduling for POWER7

Asymmetric scheduling within a core is a scheduler loadbalancing
feature that is triggered when SD_ASYM_PACKING flag is set.  The goal
for the load balancer is to move tasks to lower order idle SMT threads
within a core on a POWER7 system.

In nohz_kick_needed(), we intend to check if our sched domain (core)
is completely busy or we have idle cpu.

The following check for SD_ASYM_PACKING:

(cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd))  cpu)

already covers the case of checking if the domain has an idle cpu,
because cpumask_first_and() will not yield any set bits if this domain
has no idle cpu.

Hence, nr_busy check against group weight can be removed.

Reported-by: Michael Neuling michael.neul...@au1.ibm.com
Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
Tested-by: Michael Neuling mi...@neuling.org
Signed-off-by: Peter Zijlstra pet...@infradead.org
Cc: vincent.guit...@linaro.org
Cc: bitbuc...@online.de
Cc: b...@kernel.crashing.org
Cc: an...@samba.org
Cc: morten.rasmus...@arm.com
Cc: p...@google.com
Link: 
http://lkml.kernel.org/r/20131030031242.23426.13019.st...@preeti.in.ibm.com
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41c02b6..074551a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6800,7 +6800,7 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
if (sd-flags  SD_SHARE_PKG_RESOURCES  nr_busy  1)
goto need_kick_unlock;
 
-   if (sd-flags  SD_ASYM_PACKING  nr_busy != sg-group_weight
+   if (sd-flags  SD_ASYM_PACKING
 (cpumask_first_and(nohz.idle_cpus_mask,
  sched_domain_span(sd))  cpu))
goto need_kick_unlock;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] sched: Aggressive balance in domains whose groups share package resources

2013-10-21 Thread Vaidyanathan Srinivasan

From: Preeti U Murthy 

The current logic in load balance is such that after picking the
busiest group, the load is attempted to be moved from the busiest cpu
in that group to the dst_cpu. If the load cannot be moved from the
busiest cpu to dst_cpu due to either tsk_cpus_allowed mask or cache
hot tasks, then the dst_cpu is changed to be another idle cpu within
the dst->grpmask. If even then, the load cannot be moved from the
busiest cpu, then the source group is changed. The next busiest group
is found and the above steps are repeated.

However if the cpus in the group share package resources, then when
a load movement from the busiest cpu in this group fails as above,
instead of finding the next busiest group to move load from, find the
next busiest cpu *within the same group* from which to move load away.
By doing so, a conscious effort is made during load balancing to keep
just one cpu busy as much as possible within domains that have
SHARED_PKG_RESOURCES flag set unless under scenarios of high load.
Having multiple cpus busy within a domain which share package resource
could lead to a performance hit.

A similar scenario arises in active load balancing as well. When the
current task on the busiest cpu cannot be moved away due to task
pinning, currently no more attempts at load balancing is made. This
patch checks if the balancing is being done on a group whose cpus
share package resources. If so, then check if the load balancing can
be done for other cpus in the same group.

Signed-off-by: Preeti U Murthy 
Signed-off-by: Vaidyanathan Srinivasan 
---
 kernel/sched/fair.c |   18 ++
 1 file changed, 18 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 828ed97..bbcd96b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5165,6 +5165,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 {
int ld_moved, cur_ld_moved, active_balance = 0;
struct sched_group *group;
+   struct sched_domain *child;
+   int share_pkg_res = 0;
struct rq *busiest;
unsigned long flags;
struct cpumask *cpus = __get_cpu_var(load_balance_mask);
@@ -5190,6 +5192,10 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 
schedstat_inc(sd, lb_count[idle]);
 
+   child = sd->child;
+   if (child && child->flags & SD_SHARE_PKG_RESOURCES)
+   share_pkg_res = 1;
+
 redo:
if (!should_we_balance()) {
*continue_balancing = 0;
@@ -5202,6 +5208,7 @@ redo:
goto out_balanced;
}
 
+redo_grp:
busiest = find_busiest_queue(, group);
if (!busiest) {
schedstat_inc(sd, lb_nobusyq[idle]);
@@ -5292,6 +5299,11 @@ more_balance:
if (!cpumask_empty(cpus)) {
env.loop = 0;
env.loop_break = sched_nr_migrate_break;
+   if (share_pkg_res &&
+   cpumask_intersects(cpus,
+   to_cpumask(group->cpumask)))
+   goto redo_grp;
+
goto redo;
}
goto out_balanced;
@@ -5318,9 +5330,15 @@ more_balance:
 */
if (!cpumask_test_cpu(this_cpu,
tsk_cpus_allowed(busiest->curr))) {
+   cpumask_clear_cpu(cpu_of(busiest), cpus);
raw_spin_unlock_irqrestore(>lock,
flags);
env.flags |= LBF_ALL_PINNED;
+   if (share_pkg_res &&
+   cpumask_intersects(cpus,
+   to_cpumask(group->cpumask)))
+   goto redo_grp;
+
goto out_one_pinned;
}
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] sched: Fix asymmetric scheduling for POWER7

2013-10-21 Thread Vaidyanathan Srinivasan

Asymmetric scheduling within a core is a scheduler loadbalancing
feature that is triggered when SD_ASYM_PACKING flag is set.  The goal
for the load balancer is to move tasks to lower order idle SMT threads
within a core on a POWER7 system.

In nohz_kick_needed(), we intend to check if our sched domain (core)
is completely busy or we have idle cpu.

The following check for SD_ASYM_PACKING:

(cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu)

already covers the case of checking if the domain has an idle cpu,
because cpumask_first_and() will not yield any set bits if this domain
has no idle cpu.

Hence, nr_busy check against group weight can be removed.

Reported-by: Michael Neuling 
Signed-off-by: Vaidyanathan Srinivasan 
Signed-off-by: Preeti U Murthy 
---
 kernel/sched/fair.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 12f0eab..828ed97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5821,8 +5821,8 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
goto need_kick_unlock;
}
 
-   if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
-   && (cpumask_first_and(nohz.idle_cpus_mask,
+   if (sd->flags & SD_ASYM_PACKING &&
+   (cpumask_first_and(nohz.idle_cpus_mask,
  sched_domain_span(sd)) < cpu))
goto need_kick_unlock;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

2013-10-21 Thread Vaidyanathan Srinivasan

From: Preeti U Murthy 

In nohz_kick_needed() there are checks around the flags
SD_SHARE_PKG_RESOURCES which decide to initiate nohz_balance if the
domains with this flag set have more than one cpu busy. Therefore at
every domain, a check has to be made on nr_busy of that domain. This
means the sum of the nr_busy of each group in that domain needs to be
checked, since nr_busy is a parameter which is associated with
a group.  However in the current implementation of nohz_kick_needed(),
the nr_busy is being checked for just the group to which the cpu that
has initiated this check belongs to. This will give us the wrong count
of the number of busy cpus in that domain.

The following commit which fixed the sgp->nr_busy_cpus computation
actually exposed the bug in nohz_kick_needed() which worked when
nr_busy was incorrectly > 1

25f55d9d01ad7a7ad248fd5af1d22675ffd202c5
sched: Fix init NOHZ_IDLE flag

To illustrate the scenario,  consider a core, whose domain will have
the SD_SHARE_PKG_RESOURCES set. We want to kick nohz_idle_balance when
we find that more than one thread in the core is busy. With the
current implementation of nohz_kick_needed(), at this domain(sd), the
nr_busy will be 1 always since it returns this parameter for
sd->groups which encompasses a single thread, while we want this
parameter for sd->parent->groups which will rightly point to the
number of busy threads in the core.

This patch also ensures that the order of check for
SD_SHARE_PKG_RESOURCE comes before the check for ASYM_PACKING.
Priority is given to avoid more than one busy thread in a core as much
as possible before attempting asymmetric packing.

Signed-off-by: Preeti U Murthy 
Signed-off-by: Vaidyanathan Srinivasan 
---
 kernel/sched/fair.c |   19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7c70201..12f0eab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5807,12 +5807,19 @@ static inline int nohz_kick_needed(struct rq *rq, int 
cpu)
 
rcu_read_lock();
for_each_domain(cpu, sd) {
-   struct sched_group *sg = sd->groups;
-   struct sched_group_power *sgp = sg->sgp;
-   int nr_busy = atomic_read(>nr_busy_cpus);
-
-   if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
-   goto need_kick_unlock;
+   struct sched_domain *sd_parent = sd->parent;
+   struct sched_group *sg;
+   struct sched_group_power *sgp;
+   int nr_busy;
+
+   if (sd_parent) {
+   sg = sd_parent->groups;
+   sgp = sg->sgp;
+   nr_busy = atomic_read(>nr_busy_cpus);
+
+   if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
+   goto need_kick_unlock;
+   }
 
if (sd->flags & SD_ASYM_PACKING && nr_busy != sg->group_weight
&& (cpumask_first_and(nohz.idle_cpus_mask,

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/3] sched: Fixes for task placement in SMT threads

2013-10-21 Thread Vaidyanathan Srinivasan

Hi,

The following series fixes scheduler loadbalancing issues where we are
missing opportunity to place tasks in SMT threads optimally especially
on a POWER7 system.

PATCH 1/3, Fixes the scenario where load balancing fails to move tasks
away from the cpus in a domain which has SHARE_PKG_RESOURCES set even
under the presence of idle cpus in other domains.

PATCH 2/3, ensures lower order SMT threads are used for the
SD_ASYM_PACKING load balancing case.

PATCH 3/3, tries to fix the problem that is explained in its
changelog. The following experiment expose the scenario:

A task placement test was conducted on a POWER7 as well as multi-core
x86 system. At the beginning, tasks are pinned to all the threads
within a core and later only the tasks pinned to the secondary threads
in a core are unpinned. This leaves the primary thread with tasks that
are unmovable, while the rest are free to be moved.

Under the above setup, load balancing today does not move the unpinned
tasks away from the secondary threads of the core in the above
experiment although there are other idle cpus. The PATCH 3/3 fixes
this situation and improves task placement even if some of the tasks
in a core are pinned.

This series applies on v3.12-rc6 and tested on x86 and powerpc.

--Vaidy

---

Preeti U Murthy (2):
  sched: Fix nohz_kick_needed to consider the nr_busy of the parent 
domain's group
  sched: Aggressive balance in domains whose groups share package resources

Vaidyanathan Srinivasan (1):
  sched: Fix asymmetric scheduling for POWER7


 kernel/sched/fair.c |   41 +
 1 file changed, 33 insertions(+), 8 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/3] sched: Fixes for task placement in SMT threads

2013-10-21 Thread Vaidyanathan Srinivasan

Hi,

The following series fixes scheduler loadbalancing issues where we are
missing opportunity to place tasks in SMT threads optimally especially
on a POWER7 system.

PATCH 1/3, Fixes the scenario where load balancing fails to move tasks
away from the cpus in a domain which has SHARE_PKG_RESOURCES set even
under the presence of idle cpus in other domains.

PATCH 2/3, ensures lower order SMT threads are used for the
SD_ASYM_PACKING load balancing case.

PATCH 3/3, tries to fix the problem that is explained in its
changelog. The following experiment expose the scenario:

A task placement test was conducted on a POWER7 as well as multi-core
x86 system. At the beginning, tasks are pinned to all the threads
within a core and later only the tasks pinned to the secondary threads
in a core are unpinned. This leaves the primary thread with tasks that
are unmovable, while the rest are free to be moved.

Under the above setup, load balancing today does not move the unpinned
tasks away from the secondary threads of the core in the above
experiment although there are other idle cpus. The PATCH 3/3 fixes
this situation and improves task placement even if some of the tasks
in a core are pinned.

This series applies on v3.12-rc6 and tested on x86 and powerpc.

--Vaidy

---

Preeti U Murthy (2):
  sched: Fix nohz_kick_needed to consider the nr_busy of the parent 
domain's group
  sched: Aggressive balance in domains whose groups share package resources

Vaidyanathan Srinivasan (1):
  sched: Fix asymmetric scheduling for POWER7


 kernel/sched/fair.c |   41 +
 1 file changed, 33 insertions(+), 8 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

2013-10-21 Thread Vaidyanathan Srinivasan

From: Preeti U Murthy pre...@linux.vnet.ibm.com

In nohz_kick_needed() there are checks around the flags
SD_SHARE_PKG_RESOURCES which decide to initiate nohz_balance if the
domains with this flag set have more than one cpu busy. Therefore at
every domain, a check has to be made on nr_busy of that domain. This
means the sum of the nr_busy of each group in that domain needs to be
checked, since nr_busy is a parameter which is associated with
a group.  However in the current implementation of nohz_kick_needed(),
the nr_busy is being checked for just the group to which the cpu that
has initiated this check belongs to. This will give us the wrong count
of the number of busy cpus in that domain.

The following commit which fixed the sgp-nr_busy_cpus computation
actually exposed the bug in nohz_kick_needed() which worked when
nr_busy was incorrectly  1

25f55d9d01ad7a7ad248fd5af1d22675ffd202c5
sched: Fix init NOHZ_IDLE flag

To illustrate the scenario,  consider a core, whose domain will have
the SD_SHARE_PKG_RESOURCES set. We want to kick nohz_idle_balance when
we find that more than one thread in the core is busy. With the
current implementation of nohz_kick_needed(), at this domain(sd), the
nr_busy will be 1 always since it returns this parameter for
sd-groups which encompasses a single thread, while we want this
parameter for sd-parent-groups which will rightly point to the
number of busy threads in the core.

This patch also ensures that the order of check for
SD_SHARE_PKG_RESOURCE comes before the check for ASYM_PACKING.
Priority is given to avoid more than one busy thread in a core as much
as possible before attempting asymmetric packing.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---
 kernel/sched/fair.c |   19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7c70201..12f0eab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5807,12 +5807,19 @@ static inline int nohz_kick_needed(struct rq *rq, int 
cpu)
 
rcu_read_lock();
for_each_domain(cpu, sd) {
-   struct sched_group *sg = sd-groups;
-   struct sched_group_power *sgp = sg-sgp;
-   int nr_busy = atomic_read(sgp-nr_busy_cpus);
-
-   if (sd-flags  SD_SHARE_PKG_RESOURCES  nr_busy  1)
-   goto need_kick_unlock;
+   struct sched_domain *sd_parent = sd-parent;
+   struct sched_group *sg;
+   struct sched_group_power *sgp;
+   int nr_busy;
+
+   if (sd_parent) {
+   sg = sd_parent-groups;
+   sgp = sg-sgp;
+   nr_busy = atomic_read(sgp-nr_busy_cpus);
+
+   if (sd-flags  SD_SHARE_PKG_RESOURCES  nr_busy  1)
+   goto need_kick_unlock;
+   }
 
if (sd-flags  SD_ASYM_PACKING  nr_busy != sg-group_weight
 (cpumask_first_and(nohz.idle_cpus_mask,

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] sched: Fix asymmetric scheduling for POWER7

2013-10-21 Thread Vaidyanathan Srinivasan

Asymmetric scheduling within a core is a scheduler loadbalancing
feature that is triggered when SD_ASYM_PACKING flag is set.  The goal
for the load balancer is to move tasks to lower order idle SMT threads
within a core on a POWER7 system.

In nohz_kick_needed(), we intend to check if our sched domain (core)
is completely busy or we have idle cpu.

The following check for SD_ASYM_PACKING:

(cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd))  cpu)

already covers the case of checking if the domain has an idle cpu,
because cpumask_first_and() will not yield any set bits if this domain
has no idle cpu.

Hence, nr_busy check against group weight can be removed.

Reported-by: Michael Neuling michael.neul...@au1.ibm.com
Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
---
 kernel/sched/fair.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 12f0eab..828ed97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5821,8 +5821,8 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
goto need_kick_unlock;
}
 
-   if (sd-flags  SD_ASYM_PACKING  nr_busy != sg-group_weight
-(cpumask_first_and(nohz.idle_cpus_mask,
+   if (sd-flags  SD_ASYM_PACKING 
+   (cpumask_first_and(nohz.idle_cpus_mask,
  sched_domain_span(sd))  cpu))
goto need_kick_unlock;
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] sched: Aggressive balance in domains whose groups share package resources

2013-10-21 Thread Vaidyanathan Srinivasan

From: Preeti U Murthy pre...@linux.vnet.ibm.com

The current logic in load balance is such that after picking the
busiest group, the load is attempted to be moved from the busiest cpu
in that group to the dst_cpu. If the load cannot be moved from the
busiest cpu to dst_cpu due to either tsk_cpus_allowed mask or cache
hot tasks, then the dst_cpu is changed to be another idle cpu within
the dst-grpmask. If even then, the load cannot be moved from the
busiest cpu, then the source group is changed. The next busiest group
is found and the above steps are repeated.

However if the cpus in the group share package resources, then when
a load movement from the busiest cpu in this group fails as above,
instead of finding the next busiest group to move load from, find the
next busiest cpu *within the same group* from which to move load away.
By doing so, a conscious effort is made during load balancing to keep
just one cpu busy as much as possible within domains that have
SHARED_PKG_RESOURCES flag set unless under scenarios of high load.
Having multiple cpus busy within a domain which share package resource
could lead to a performance hit.

A similar scenario arises in active load balancing as well. When the
current task on the busiest cpu cannot be moved away due to task
pinning, currently no more attempts at load balancing is made. This
patch checks if the balancing is being done on a group whose cpus
share package resources. If so, then check if the load balancing can
be done for other cpus in the same group.

Signed-off-by: Preeti U Murthy pre...@linux.vnet.ibm.com
Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
---
 kernel/sched/fair.c |   18 ++
 1 file changed, 18 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 828ed97..bbcd96b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5165,6 +5165,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 {
int ld_moved, cur_ld_moved, active_balance = 0;
struct sched_group *group;
+   struct sched_domain *child;
+   int share_pkg_res = 0;
struct rq *busiest;
unsigned long flags;
struct cpumask *cpus = __get_cpu_var(load_balance_mask);
@@ -5190,6 +5192,10 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 
schedstat_inc(sd, lb_count[idle]);
 
+   child = sd-child;
+   if (child  child-flags  SD_SHARE_PKG_RESOURCES)
+   share_pkg_res = 1;
+
 redo:
if (!should_we_balance(env)) {
*continue_balancing = 0;
@@ -5202,6 +5208,7 @@ redo:
goto out_balanced;
}
 
+redo_grp:
busiest = find_busiest_queue(env, group);
if (!busiest) {
schedstat_inc(sd, lb_nobusyq[idle]);
@@ -5292,6 +5299,11 @@ more_balance:
if (!cpumask_empty(cpus)) {
env.loop = 0;
env.loop_break = sched_nr_migrate_break;
+   if (share_pkg_res 
+   cpumask_intersects(cpus,
+   to_cpumask(group-cpumask)))
+   goto redo_grp;
+
goto redo;
}
goto out_balanced;
@@ -5318,9 +5330,15 @@ more_balance:
 */
if (!cpumask_test_cpu(this_cpu,
tsk_cpus_allowed(busiest-curr))) {
+   cpumask_clear_cpu(cpu_of(busiest), cpus);
raw_spin_unlock_irqrestore(busiest-lock,
flags);
env.flags |= LBF_ALL_PINNED;
+   if (share_pkg_res 
+   cpumask_intersects(cpus,
+   to_cpumask(group-cpumask)))
+   goto redo_grp;
+
goto out_one_pinned;
}
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

2013-07-28 Thread Vaidyanathan Srinivasan

* Preeti U Murthy  [2013-07-27 13:20:37]:

> Hi Ben,
> 
> On 07/27/2013 12:00 PM, Benjamin Herrenschmidt wrote:
> > On Fri, 2013-07-26 at 08:09 +0530, Preeti U Murthy wrote:
> >> *The lapic of a broadcast CPU is active always*. Say CPUX, wants the
> >> broadcast CPU to wake it up at timeX.  Since we cannot program the lapic
> >> of a remote CPU, CPUX will need to send an IPI to the broadcast CPU,
> >> asking it to program its lapic to fire at timeX so as to wake up CPUX.
> >> *With multiple CPUs the overhead of sending IPI, could result in
> >> performance bottlenecks and may not scale well.*
> >>
> >> Hence the workaround is that the broadcast CPU on each of its timer
> >> interrupt checks if any of the next timer event of a CPU in deep idle
> >> state has expired, which can very well be found from dev->next_event of
> >> that CPU. For example the timeX that has been mentioned above has
> >> expired. If so the broadcast handler is called to send an IPI to the
> >> idling CPU to wake it up.
> >>
> >> *If the broadcast CPU, is in tickless idle, its timer interrupt could be
> >> many ticks away. It could miss waking up a CPU in deep idle*, if its
> >> wakeup is much before this timer interrupt of the broadcast CPU. But
> >> without tickless idle, atleast at each period we are assured of a timer
> >> interrupt. At which time broadcast handling is done as stated in the
> >> previous paragraph and we will not miss wakeup of CPUs in deep idle states.
> > 
> > But that means a great loss of power saving on the broadcast CPU when the 
> > machine
> > is basically completely idle. We might be able to come up with some thing 
> > better.
> > 
> > (Note : I do no know the timer offload code if it exists already, I'm 
> > describing
> > how things could happen "out of the blue" without any knowledge of 
> > pre-existing
> > framework here)
> > 
> > We can know when the broadcast CPU expects to wake up next. When a CPU goes 
> > to
> > a deep sleep state, it can then
> > 
> >  - Indicate to the broadcast CPU when it intends to be woken up by queuing
> > itself into an ordered queue (ordered by target wakeup time). (OPTIMISATION:
> > Play with the locality of that: have one queue (and one "broadcast CPU") per
> > chip or per node instead of a global one to limit cache bouncing).
> > 
> >  - Check if that happens before the broadcast CPU intended wake time (we
> > need statistics to see how often that happens), and in that case send an IPI
> > to wake it up now. When the broadcast CPU goes to sleep, it limits its sleep
> > time to the min of it's intended sleep time and the new sleeper time.
> > (OPTIMISATION: Dynamically chose a broadcast CPU based on closest expiry ?)
> > 
> >  - We can probably limit spurrious wakeups a *LOT* by aligning that target 
> > time
> > to a global jiffy boundary, meaning that several CPUs going to idle are 
> > likely
> > to be choosing the same. Or maybe better, an adaptative alignment by 
> > essentially
> > getting more coarse grained as we go further in the future
> > 
> >  - When the "broadcast" CPU goes to sleep, it can play the same game of 
> > alignment.
> > 
> > I don't like the concept of a dedicated broadcast CPU however. I'd rather 
> > have a
> > general queue (or per node) of sleepers needing a wakeup and more/less 
> > dynamically
> > pick a waker to be the last man standing, but it does make things a bit more
> > tricky with tickless scheduler (non-idle).
> > 
> > Still, I wonder if we could just have some algorithm to actually pick wakers
> > more dynamically based on who ever has the closest "next wakeup" planned,
> > that sort of thing. A fixed broadcaster will create an imbalance in
> > power/thermal within the chip in addition to needing to be moved around on
> > hotplug etc...
> 
> Thank you for having listed out the above suggestions. Below, I will
> bring out some ideas about how the concerns that you have raised can be
> addressed in the increasing order of priority.
> 
> - To begin with, I think we can have the following model to have the
> responsibility of the broadcast CPU float around certain CPUs. i.e. Not
> have a dedicated broadcast CPU. I will refer to the broadcast CPU as the
> bc_cpu henceforth for convenience.
> 
> 1. The first CPU that intends to enter deep sleep state will be the bc_cpu.
> 
> 2. Every other CPU that intends to enter deep idle state will enter
> themselves into a mask, say the bc_mask, which is already being done
> today, after they check that a bc_cpu has been assigned.
> 
> 3. The bc_cpu should not enter tickless idle, until step 5a holds true.
> 
> 4. So on every timer interrupt, which is at-least every period, it
> checks the bc_mask to see if any CPUs need to be woken up.
> 
> 5. The bc cpu should not enter tickless idle *until* it is de-nominated
> as the bc_cpu. The de-nomination occurs when:
>   a. In one of its timer interrupts, it does broadcast handling to find
> out that there are no CPUs to be woken up.
> 
> 6. So if 5a

Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

2013-07-28 Thread Vaidyanathan Srinivasan

* Benjamin Herrenschmidt  [2013-07-27 16:30:05]:

> On Fri, 2013-07-26 at 08:09 +0530, Preeti U Murthy wrote:
> > *The lapic of a broadcast CPU is active always*. Say CPUX, wants the
> > broadcast CPU to wake it up at timeX.  Since we cannot program the lapic
> > of a remote CPU, CPUX will need to send an IPI to the broadcast CPU,
> > asking it to program its lapic to fire at timeX so as to wake up CPUX.
> > *With multiple CPUs the overhead of sending IPI, could result in
> > performance bottlenecks and may not scale well.*
> > 
> > Hence the workaround is that the broadcast CPU on each of its timer
> > interrupt checks if any of the next timer event of a CPU in deep idle
> > state has expired, which can very well be found from dev->next_event of
> > that CPU. For example the timeX that has been mentioned above has
> > expired. If so the broadcast handler is called to send an IPI to the
> > idling CPU to wake it up.
> > 
> > *If the broadcast CPU, is in tickless idle, its timer interrupt could be
> > many ticks away. It could miss waking up a CPU in deep idle*, if its
> > wakeup is much before this timer interrupt of the broadcast CPU. But
> > without tickless idle, atleast at each period we are assured of a timer
> > interrupt. At which time broadcast handling is done as stated in the
> > previous paragraph and we will not miss wakeup of CPUs in deep idle states.
> 
> But that means a great loss of power saving on the broadcast CPU when the 
> machine
> is basically completely idle. We might be able to come up with some thing 
> better.

Hi Ben,

Yes, we will need to improve on this case in stages.  In the current
design, we will have to hold one of the CPU in shallow idle state
(nap) to wakeup other deep idle cpus.  The cost of keeping the
periodic tick ON on the broadcast CPU in minimal (but not zero) since
we would not allow that CPU to enter any deep idle states even if
there were no periodic timers queued.
 
> (Note : I do no know the timer offload code if it exists already, I'm 
> describing
> how things could happen "out of the blue" without any knowledge of 
> pre-existing
> framework here)
> 
> We can know when the broadcast CPU expects to wake up next. When a CPU goes to
> a deep sleep state, it can then
> 
>  - Indicate to the broadcast CPU when it intends to be woken up by queuing
> itself into an ordered queue (ordered by target wakeup time). (OPTIMISATION:
> Play with the locality of that: have one queue (and one "broadcast CPU") per
> chip or per node instead of a global one to limit cache bouncing).
> 
>  - Check if that happens before the broadcast CPU intended wake time (we
> need statistics to see how often that happens), and in that case send an IPI
> to wake it up now. When the broadcast CPU goes to sleep, it limits its sleep
> time to the min of it's intended sleep time and the new sleeper time.
> (OPTIMISATION: Dynamically chose a broadcast CPU based on closest expiry ?)

This will be an improvement and the idea we have is to have
a hierarchical method of finding a waking CPU within core/socket/node
in order to find a better fit and ultimately send IPI to wakeup
a broadcast CPU only if there is no other fit.  This condition would
imply that more CPUs are in deep idle state and the cost of sending an
IPI to reprogram the broadcast cpu's local timer may well payoff.

>  - We can probably limit spurrious wakeups a *LOT* by aligning that target 
> time
> to a global jiffy boundary, meaning that several CPUs going to idle are likely
> to be choosing the same. Or maybe better, an adaptative alignment by 
> essentially
> getting more coarse grained as we go further in the future
> 
>  - When the "broadcast" CPU goes to sleep, it can play the same game of 
> alignment.

CPUs entering deep idle state would need to wakeup only at a jiffy
boundary or the jiffy boundary earlier than the target wakeup time.
Your point is can the broadcast cpu wakeup the sleeping CPU *around*
the designated wakeup time (earlier) so as to avoid reprogramming its
timer.
 
> I don't like the concept of a dedicated broadcast CPU however. I'd rather 
> have a
> general queue (or per node) of sleepers needing a wakeup and more/less 
> dynamically
> pick a waker to be the last man standing, but it does make things a bit more
> tricky with tickless scheduler (non-idle).
> 
> Still, I wonder if we could just have some algorithm to actually pick wakers
> more dynamically based on who ever has the closest "next wakeup" planned,
> that sort of thing. A fixed broadcaster will create an imbalance in
> power/thermal within the chip in addition to needing to be moved around on
> hotplug etc...

Right Ben.  The hierarchical way of selecting the waker will help us
have multiple wakers in different sockets/cores.  The broadcast
framework allows us to decouple the cpu going to idle and the waker to
be selected independently.  This patch series is the start where we
pick one and allow it to move around.  The ideal goal to achieve

Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

2013-07-28 Thread Vaidyanathan Srinivasan

* Benjamin Herrenschmidt b...@kernel.crashing.org [2013-07-27 16:30:05]:

 On Fri, 2013-07-26 at 08:09 +0530, Preeti U Murthy wrote:
  *The lapic of a broadcast CPU is active always*. Say CPUX, wants the
  broadcast CPU to wake it up at timeX.  Since we cannot program the lapic
  of a remote CPU, CPUX will need to send an IPI to the broadcast CPU,
  asking it to program its lapic to fire at timeX so as to wake up CPUX.
  *With multiple CPUs the overhead of sending IPI, could result in
  performance bottlenecks and may not scale well.*
  
  Hence the workaround is that the broadcast CPU on each of its timer
  interrupt checks if any of the next timer event of a CPU in deep idle
  state has expired, which can very well be found from dev-next_event of
  that CPU. For example the timeX that has been mentioned above has
  expired. If so the broadcast handler is called to send an IPI to the
  idling CPU to wake it up.
  
  *If the broadcast CPU, is in tickless idle, its timer interrupt could be
  many ticks away. It could miss waking up a CPU in deep idle*, if its
  wakeup is much before this timer interrupt of the broadcast CPU. But
  without tickless idle, atleast at each period we are assured of a timer
  interrupt. At which time broadcast handling is done as stated in the
  previous paragraph and we will not miss wakeup of CPUs in deep idle states.
 
 But that means a great loss of power saving on the broadcast CPU when the 
 machine
 is basically completely idle. We might be able to come up with some thing 
 better.

Hi Ben,

Yes, we will need to improve on this case in stages.  In the current
design, we will have to hold one of the CPU in shallow idle state
(nap) to wakeup other deep idle cpus.  The cost of keeping the
periodic tick ON on the broadcast CPU in minimal (but not zero) since
we would not allow that CPU to enter any deep idle states even if
there were no periodic timers queued.
 
 (Note : I do no know the timer offload code if it exists already, I'm 
 describing
 how things could happen out of the blue without any knowledge of 
 pre-existing
 framework here)
 
 We can know when the broadcast CPU expects to wake up next. When a CPU goes to
 a deep sleep state, it can then
 
  - Indicate to the broadcast CPU when it intends to be woken up by queuing
 itself into an ordered queue (ordered by target wakeup time). (OPTIMISATION:
 Play with the locality of that: have one queue (and one broadcast CPU) per
 chip or per node instead of a global one to limit cache bouncing).
 
  - Check if that happens before the broadcast CPU intended wake time (we
 need statistics to see how often that happens), and in that case send an IPI
 to wake it up now. When the broadcast CPU goes to sleep, it limits its sleep
 time to the min of it's intended sleep time and the new sleeper time.
 (OPTIMISATION: Dynamically chose a broadcast CPU based on closest expiry ?)

This will be an improvement and the idea we have is to have
a hierarchical method of finding a waking CPU within core/socket/node
in order to find a better fit and ultimately send IPI to wakeup
a broadcast CPU only if there is no other fit.  This condition would
imply that more CPUs are in deep idle state and the cost of sending an
IPI to reprogram the broadcast cpu's local timer may well payoff.

  - We can probably limit spurrious wakeups a *LOT* by aligning that target 
 time
 to a global jiffy boundary, meaning that several CPUs going to idle are likely
 to be choosing the same. Or maybe better, an adaptative alignment by 
 essentially
 getting more coarse grained as we go further in the future
 
  - When the broadcast CPU goes to sleep, it can play the same game of 
 alignment.

CPUs entering deep idle state would need to wakeup only at a jiffy
boundary or the jiffy boundary earlier than the target wakeup time.
Your point is can the broadcast cpu wakeup the sleeping CPU *around*
the designated wakeup time (earlier) so as to avoid reprogramming its
timer.
 
 I don't like the concept of a dedicated broadcast CPU however. I'd rather 
 have a
 general queue (or per node) of sleepers needing a wakeup and more/less 
 dynamically
 pick a waker to be the last man standing, but it does make things a bit more
 tricky with tickless scheduler (non-idle).
 
 Still, I wonder if we could just have some algorithm to actually pick wakers
 more dynamically based on who ever has the closest next wakeup planned,
 that sort of thing. A fixed broadcaster will create an imbalance in
 power/thermal within the chip in addition to needing to be moved around on
 hotplug etc...

Right Ben.  The hierarchical way of selecting the waker will help us
have multiple wakers in different sockets/cores.  The broadcast
framework allows us to decouple the cpu going to idle and the waker to
be selected independently.  This patch series is the start where we
pick one and allow it to move around.  The ideal goal to achieve would
be that we can have multiple wakers serving wakeup requests from
a

Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

2013-07-28 Thread Vaidyanathan Srinivasan

* Preeti U Murthy pre...@linux.vnet.ibm.com [2013-07-27 13:20:37]:

 Hi Ben,
 
 On 07/27/2013 12:00 PM, Benjamin Herrenschmidt wrote:
  On Fri, 2013-07-26 at 08:09 +0530, Preeti U Murthy wrote:
  *The lapic of a broadcast CPU is active always*. Say CPUX, wants the
  broadcast CPU to wake it up at timeX.  Since we cannot program the lapic
  of a remote CPU, CPUX will need to send an IPI to the broadcast CPU,
  asking it to program its lapic to fire at timeX so as to wake up CPUX.
  *With multiple CPUs the overhead of sending IPI, could result in
  performance bottlenecks and may not scale well.*
 
  Hence the workaround is that the broadcast CPU on each of its timer
  interrupt checks if any of the next timer event of a CPU in deep idle
  state has expired, which can very well be found from dev-next_event of
  that CPU. For example the timeX that has been mentioned above has
  expired. If so the broadcast handler is called to send an IPI to the
  idling CPU to wake it up.
 
  *If the broadcast CPU, is in tickless idle, its timer interrupt could be
  many ticks away. It could miss waking up a CPU in deep idle*, if its
  wakeup is much before this timer interrupt of the broadcast CPU. But
  without tickless idle, atleast at each period we are assured of a timer
  interrupt. At which time broadcast handling is done as stated in the
  previous paragraph and we will not miss wakeup of CPUs in deep idle states.
  
  But that means a great loss of power saving on the broadcast CPU when the 
  machine
  is basically completely idle. We might be able to come up with some thing 
  better.
  
  (Note : I do no know the timer offload code if it exists already, I'm 
  describing
  how things could happen out of the blue without any knowledge of 
  pre-existing
  framework here)
  
  We can know when the broadcast CPU expects to wake up next. When a CPU goes 
  to
  a deep sleep state, it can then
  
   - Indicate to the broadcast CPU when it intends to be woken up by queuing
  itself into an ordered queue (ordered by target wakeup time). (OPTIMISATION:
  Play with the locality of that: have one queue (and one broadcast CPU) per
  chip or per node instead of a global one to limit cache bouncing).
  
   - Check if that happens before the broadcast CPU intended wake time (we
  need statistics to see how often that happens), and in that case send an IPI
  to wake it up now. When the broadcast CPU goes to sleep, it limits its sleep
  time to the min of it's intended sleep time and the new sleeper time.
  (OPTIMISATION: Dynamically chose a broadcast CPU based on closest expiry ?)
  
   - We can probably limit spurrious wakeups a *LOT* by aligning that target 
  time
  to a global jiffy boundary, meaning that several CPUs going to idle are 
  likely
  to be choosing the same. Or maybe better, an adaptative alignment by 
  essentially
  getting more coarse grained as we go further in the future
  
   - When the broadcast CPU goes to sleep, it can play the same game of 
  alignment.
  
  I don't like the concept of a dedicated broadcast CPU however. I'd rather 
  have a
  general queue (or per node) of sleepers needing a wakeup and more/less 
  dynamically
  pick a waker to be the last man standing, but it does make things a bit more
  tricky with tickless scheduler (non-idle).
  
  Still, I wonder if we could just have some algorithm to actually pick wakers
  more dynamically based on who ever has the closest next wakeup planned,
  that sort of thing. A fixed broadcaster will create an imbalance in
  power/thermal within the chip in addition to needing to be moved around on
  hotplug etc...
 
 Thank you for having listed out the above suggestions. Below, I will
 bring out some ideas about how the concerns that you have raised can be
 addressed in the increasing order of priority.
 
 - To begin with, I think we can have the following model to have the
 responsibility of the broadcast CPU float around certain CPUs. i.e. Not
 have a dedicated broadcast CPU. I will refer to the broadcast CPU as the
 bc_cpu henceforth for convenience.
 
 1. The first CPU that intends to enter deep sleep state will be the bc_cpu.
 
 2. Every other CPU that intends to enter deep idle state will enter
 themselves into a mask, say the bc_mask, which is already being done
 today, after they check that a bc_cpu has been assigned.
 
 3. The bc_cpu should not enter tickless idle, until step 5a holds true.
 
 4. So on every timer interrupt, which is at-least every period, it
 checks the bc_mask to see if any CPUs need to be woken up.
 
 5. The bc cpu should not enter tickless idle *until* it is de-nominated
 as the bc_cpu. The de-nomination occurs when:
   a. In one of its timer interrupts, it does broadcast handling to find
 out that there are no CPUs to be woken up.
 
 6. So if 5a holds, then there is no bc_cpu anymore until a CPU decides
 to enter deep idle state again, in which case steps 1 to 5 repeat.
 
 
 - We could optimize this further, to allow

Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-08 Thread Vaidyanathan Srinivasan

* Mel Gorman  [2012-11-08 18:02:57]:

> On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
> > 

Hi Mel,

Thanks for detailed review and comments.  The goal of this patch
series is to brainstorm on ideas that enable Linux VM to record and
exploit memory region boundaries.

The first approach that we had last year (hierarchy) has more runtime
overhead.  This approach of sorted-buddy was one of the alternative
discussed earlier and we are trying to find out if simple requirements
of biasing memory allocations can be achieved with this approach.

Smart reclaim based on this approach is a key piece we still need to
design.  Ideas from compaction will certainly help.

> > Today memory subsystems are offer a wide range of capabilities for managing
> > memory power consumption. As a quick example, if a block of memory is not
> > referenced for a threshold amount of time, the memory controller can decide 
> > to
> > put that chunk into a low-power content-preserving state. And the next
> > reference to that memory chunk would bring it back to full power for 
> > read/write.
> > With this capability in place, it becomes important for the OS to understand
> > the boundaries of such power-manageable chunks of memory and to ensure that
> > references are consolidated to a minimum number of such memory power 
> > management
> > domains.
> > 
> 
> How much power is saved?

On embedded platform the savings could be around 5% as discussed in
the earlier thread: http://article.gmane.org/gmane.linux.kernel.mm/65935

On larger servers with large amounts of memory the savings could be
more.  We do not yet have all the pieces together to evaluate.

> > ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
> > the firmware can expose information regarding the boundaries of such memory
> > power management domains to the OS in a standard way.
> > 
> 
> I'm not familiar with the ACPI spec but is there support for parsing of
> MPST and interpreting the associated ACPI events? For example, if ACPI
> fires an event indicating that a memory power node is to enter a low
> state then presumably the OS should actively migrate pages away -- even
> if it's going into a state where the contents are still refreshed
> as exiting that state could take a long time.
> 
> I did not look closely at the patchset at all because it looked like the
> actual support to use it and measure the benefit is missing.

Correct.  The platform interface part is not included in this patch
set mainly because there is not much design required there.  Each
platform can have code to collect the memory region boundaries from
BIOS/firmware and load it into the Linux VM.  The goal of this patch
is to brainstorm on the idea of hos core VM should used the region
information.

> > How can Linux VM help memory power savings?
> > 
> > o Consolidate memory allocations and/or references such that they are
> > not spread across the entire memory address space.  Basically area of memory
> > that is not being referenced, can reside in low power state.
> > 
> 
> Which the series does not appear to do.

Correct.  We need to design the correct reclaim strategy for this to
work.  However having buddy list sorted by region address could get us
one step closer to shaping the allocations.

> > o Support targeted memory reclaim, where certain areas of memory that can be
> > easily freed can be offlined, allowing those areas of memory to be put into
> > lower power states.
> > 
> 
> Which the series does not appear to do judging from this;
> 
>   include/linux/mm.h |   38 +++
>   include/linux/mmzone.h |   52 +
>   mm/compaction.c|8 +
>   mm/page_alloc.c|  263 
> 
>   mm/vmstat.c|   59 ++-
> 
> This does not appear to be doing anything with reclaim and not enough with
> compaction to indicate that the series actively manages memory placement
> in response to ACPI events.

Correct.  Evaluating different ideas for reclaim will be next step
before getting into the platform interface parts.

> Further in section 5.2.21.4 the spec says that power node regions can
> overlap (but are not hierarchal for some reason) but have no gaps yet the
> structure you use to represent is assumes there can be gaps and there are
> no overlaps. Again, this is just glancing at the spec and a quick skim of
> the patches so maybe I missed something that explains why this structure
> is suitable.

This patch is roughly based on the idea that ACPI MPST will give us
memory region boundaries.  It is not designed to implement all options
defined in the spec.  We have taken a general case of regions do not
overlap while memory addresses itself can be discontinuous.

> It seems to me that superficially the VM implementation for the support
> would have
> 
> a) Involved a tree that managed the overlapping regions (even

Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

2012-11-08 Thread Vaidyanathan Srinivasan

* Mel Gorman mgor...@suse.de [2012-11-08 18:02:57]:

 On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
  

Hi Mel,

Thanks for detailed review and comments.  The goal of this patch
series is to brainstorm on ideas that enable Linux VM to record and
exploit memory region boundaries.

The first approach that we had last year (hierarchy) has more runtime
overhead.  This approach of sorted-buddy was one of the alternative
discussed earlier and we are trying to find out if simple requirements
of biasing memory allocations can be achieved with this approach.

Smart reclaim based on this approach is a key piece we still need to
design.  Ideas from compaction will certainly help.

  Today memory subsystems are offer a wide range of capabilities for managing
  memory power consumption. As a quick example, if a block of memory is not
  referenced for a threshold amount of time, the memory controller can decide 
  to
  put that chunk into a low-power content-preserving state. And the next
  reference to that memory chunk would bring it back to full power for 
  read/write.
  With this capability in place, it becomes important for the OS to understand
  the boundaries of such power-manageable chunks of memory and to ensure that
  references are consolidated to a minimum number of such memory power 
  management
  domains.
  
 
 How much power is saved?

On embedded platform the savings could be around 5% as discussed in
the earlier thread: http://article.gmane.org/gmane.linux.kernel.mm/65935

On larger servers with large amounts of memory the savings could be
more.  We do not yet have all the pieces together to evaluate.

  ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
  the firmware can expose information regarding the boundaries of such memory
  power management domains to the OS in a standard way.
  
 
 I'm not familiar with the ACPI spec but is there support for parsing of
 MPST and interpreting the associated ACPI events? For example, if ACPI
 fires an event indicating that a memory power node is to enter a low
 state then presumably the OS should actively migrate pages away -- even
 if it's going into a state where the contents are still refreshed
 as exiting that state could take a long time.
 
 I did not look closely at the patchset at all because it looked like the
 actual support to use it and measure the benefit is missing.

Correct.  The platform interface part is not included in this patch
set mainly because there is not much design required there.  Each
platform can have code to collect the memory region boundaries from
BIOS/firmware and load it into the Linux VM.  The goal of this patch
is to brainstorm on the idea of hos core VM should used the region
information.
 
  How can Linux VM help memory power savings?
  
  o Consolidate memory allocations and/or references such that they are
  not spread across the entire memory address space.  Basically area of memory
  that is not being referenced, can reside in low power state.
  
 
 Which the series does not appear to do.

Correct.  We need to design the correct reclaim strategy for this to
work.  However having buddy list sorted by region address could get us
one step closer to shaping the allocations.

  o Support targeted memory reclaim, where certain areas of memory that can be
  easily freed can be offlined, allowing those areas of memory to be put into
  lower power states.
  
 
 Which the series does not appear to do judging from this;
 
   include/linux/mm.h |   38 +++
   include/linux/mmzone.h |   52 +
   mm/compaction.c|8 +
   mm/page_alloc.c|  263 
 
   mm/vmstat.c|   59 ++-
 
 This does not appear to be doing anything with reclaim and not enough with
 compaction to indicate that the series actively manages memory placement
 in response to ACPI events.

Correct.  Evaluating different ideas for reclaim will be next step
before getting into the platform interface parts.

 Further in section 5.2.21.4 the spec says that power node regions can
 overlap (but are not hierarchal for some reason) but have no gaps yet the
 structure you use to represent is assumes there can be gaps and there are
 no overlaps. Again, this is just glancing at the spec and a quick skim of
 the patches so maybe I missed something that explains why this structure
 is suitable.

This patch is roughly based on the idea that ACPI MPST will give us
memory region boundaries.  It is not designed to implement all options
defined in the spec.  We have taken a general case of regions do not
overlap while memory addresses itself can be discontinuous.

 It seems to me that superficially the VM implementation for the support
 would have
 
 a) Involved a tree that managed the overlapping regions (even if it's
not hierarchal it feels more sensible) and picked the

Re: [PATCH 33/36] autonuma: powerpc port

2012-08-22 Thread Vaidyanathan Srinivasan

* Benjamin Herrenschmidt  [2012-08-23 08:56:34]:

> On Thu, 2012-08-23 at 08:01 +1000, Benjamin Herrenschmidt wrote:
> > On Wed, 2012-08-22 at 16:59 +0200, Andrea Arcangeli wrote:
> > > From: Vaidyanathan Srinivasan 
> > > 
> > > * PMD flaging is not required in powerpc since large pages
> > >   are tracked in ptes.
> > > * Yet to be tested with large pages
> > > * This is an initial patch that partially works
> > > * knuma_scand and numa hinting page faults works
> > > * Page migration is yet to be observed/verified
> > > 
> > > Signed-off-by: Vaidyanathan Srinivasan 
> > > Signed-off-by: Andrea Arcangeli 
> > 
> > I don't like this.
> 
> What I mean here is that it's fine as a proof of concept ;-) I don't
> like it being in a series aimed at upstream...

I agree.  My intend was to get the ppc64 discussions started and also
see what it takes to get autonuma to a new arch.

> We can try to flush out the issues, but as it is, the patch isn't
> upstreamable imho.

Yes.  The patch is still in RFC phase and good only for
review/testing.

> As for finding PTE bits, I have a few ideas we need to discuss, but
> nothing simple I'm afraid.

Sure Ben, lets try them and find the better fit.

--Vaidy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 33/36] autonuma: powerpc port

2012-08-22 Thread Vaidyanathan Srinivasan

* Benjamin Herrenschmidt b...@kernel.crashing.org [2012-08-23 08:56:34]:

 On Thu, 2012-08-23 at 08:01 +1000, Benjamin Herrenschmidt wrote:
  On Wed, 2012-08-22 at 16:59 +0200, Andrea Arcangeli wrote:
   From: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
   
   * PMD flaging is not required in powerpc since large pages
 are tracked in ptes.
   * Yet to be tested with large pages
   * This is an initial patch that partially works
   * knuma_scand and numa hinting page faults works
   * Page migration is yet to be observed/verified
   
   Signed-off-by: Vaidyanathan Srinivasan sva...@linux.vnet.ibm.com
   Signed-off-by: Andrea Arcangeli aarca...@redhat.com
  
  I don't like this.
 
 What I mean here is that it's fine as a proof of concept ;-) I don't
 like it being in a series aimed at upstream...

I agree.  My intend was to get the ppc64 discussions started and also
see what it takes to get autonuma to a new arch.

 We can try to flush out the issues, but as it is, the patch isn't
 upstreamable imho.

Yes.  The patch is still in RFC phase and good only for
review/testing.

 As for finding PTE bits, I have a few ideas we need to discuss, but
 nothing simple I'm afraid.

Sure Ben, lets try them and find the better fit.

--Vaidy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Analysis of sched_mc_power_savings

2008-01-09 Thread Vaidyanathan Srinivasan

* Ingo Molnar <[EMAIL PROTECTED]> [2008-01-09 12:35:07]:

> 
> * Vaidyanathan Srinivasan <[EMAIL PROTECTED]> wrote:
> 
> > How do we take this technique to the next step where we can 
> > consolidate short running jobs as well?  Did you face any difficulty 
> > biasing the CPU for short running jobs?
> 
> are you sure your measurement tasks do not impact the measurement 
> workload? If you use something like 'top' then try running it reniced to 
> +19. (or perhaps even bound to a particular CPU, say #3, to make its 
> impact isolated)

Hi Ingo,

I will watch this during the experiments.  I have been using klog
application to dump relayfs data.  I did run powertop and top as well,
I will bind them to certain CPUs and isolate their impact.

I believe the margin of error would be less since all the measurement
tasks sleep for long duration.

Thanks,
Vaidy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Analysis of sched_mc_power_savings

2008-01-09 Thread Vaidyanathan Srinivasan

* Siddha, Suresh B <[EMAIL PROTECTED]> [2008-01-08 13:24:00]:

> On Tue, Jan 08, 2008 at 11:08:15PM +0530, Vaidyanathan Srinivasan wrote:
> > Hi,
> > 
> > The following experiments were conducted on a two socket dual core
> > intel processor based machine in order to understand the impact of
> > sched_mc_power_savings scheduler heuristics.
> 
> Thanks for these experiments. sched_mc_power_savings is mainly for
> the reasonably long(which can be caught by periodic load balance) running
> light load(when number of tasks < number of logical cpus).  This tunable
> will minimize the number of packages carrying load, enabling the other
> completely idle packages to go to the available deepest P and C states.

Hi Suresh,

Thanks for the explanation.  I tried few more experiments based on
your comments and found that in a long running CPU intensive job, the
heuristics did help to get the load on the same package.

> > 
> > The scheduler heuristics for multi core system
> > /sys/devices/system/cpu/sched_mc_power_savings should ideally extend
> > the cpu tickless idle time atleast on few CPU in an SMP machine.
> 
> Not really. Ideally sched_mc_power_savings re-distributes the load(in
> the scenarios mentioned above) to different logical cpus in the system, so
> as to minimize the busy packages in the system.

But the problem is we will not be able get longer idle time and goto
lower sleep states on other packages if we are not able to move the
short bursts of daemon jobs to busy packages.

We are looking at various means to increase uninterrupted CPU idle
times on atleast some of the CPUs in an under-utilised SMP machine.

> > Experiment 1:
> > -
> ...
> > Observations with sched_mc_power_savings=1:
> > 
> > * No major impact of sched_mc_power_savings on CPU0 and CPU1
> > * Significant idle time improvement on CPU2
> > * However, significant idle time reduction on CPU3
> 
> In your setup, CPU0 and 3 are the core siblings? If so, this is probably
> expected. Previously there was some load distribution happening between
> packages. With sched_mc_power_savings, load is now getting distributed
> between cores in the same package. But on my systems, typically CPU0 and 2
> are core siblings. I will let you confirm your system topology, before we
> come to conclusions.

No, in contrary to your observation, on my machine, CPU0 and CPU1 are
siblings on the first package while CPU2 and CPU3 are in the second
package.

> > Experiment 2:
> > -
> ...
> > 
> > Observations with sched_mc_power_savings=1:
> > 
> > * No major impact of sched_mc_power_savings on CPU0 and CPU1
> > * Good idle time improvement on CPU2 and CPU3
> 
> I would have expected similar results like in experiment-1. CPU3 data
> seems almost same with and without sched_mc_power_savings. Atleast the
> data is not significantly different as in other cases like CPU2 for ex.

There is a general improvement in idle time in this experiment as
compared to the first experiment.  So if sched_mc will not impact
short running tasks (daemons and other event managers in idle system),
then the observation on experiment-1 may be just a random
redistribution of tasks over time.

> > Please review the experiment and comment on how the effectiveness of
> > sched_mc_power_savings can be analysed.  
> 
> While we should see a no change or a simple redistribution(to minimize 
> busy packages) in your experiments, for evaluating sched_mc_power_savings,
> we can also use some lightly loaded system (like kernel-compilation or
> specjbb with 2 threads on  a DP with dual-core, for example and see how the
> load is distributed with and without MC power savings.)

I have tried running two CPU intensive threads.  With sched_mc turned
ON, they were scheduled on CPU0-1 or CPU2-3.  With sched_mc turned
OFF, I can see that they were scheduled arbitrarily.  

Later I added random usleep() to make them consume less CPU, in that
case the scheduling was very random.  The two threads with random
sleeps were scheduled on any of the 4 CPUs and the sched_mc parameter
had no effect is consolidating the threads to one package.  Addition
of sleeps makes the threads very short run and not CPU intensive.

This completely validates your explanation of consolidating only 'long
running' jobs.

How do we take this technique to the next step where we can
consolidate short running jobs as well?  Did you face any difficulty
biasing the CPU for short running jobs?

> Similarly it will be interesting to see how this data varies with and without
> tickless.

I am sure tickless has very significant impact on the idle time.
I will get you more data on this comparison.

> For some loads which can't be caught by periodic load balancer, w

Re: Analysis of sched_mc_power_savings

2008-01-09 Thread Vaidyanathan Srinivasan

* Siddha, Suresh B [EMAIL PROTECTED] [2008-01-08 13:24:00]:

 On Tue, Jan 08, 2008 at 11:08:15PM +0530, Vaidyanathan Srinivasan wrote:
  Hi,
  
  The following experiments were conducted on a two socket dual core
  intel processor based machine in order to understand the impact of
  sched_mc_power_savings scheduler heuristics.
 
 Thanks for these experiments. sched_mc_power_savings is mainly for
 the reasonably long(which can be caught by periodic load balance) running
 light load(when number of tasks  number of logical cpus).  This tunable
 will minimize the number of packages carrying load, enabling the other
 completely idle packages to go to the available deepest P and C states.

Hi Suresh,

Thanks for the explanation.  I tried few more experiments based on
your comments and found that in a long running CPU intensive job, the
heuristics did help to get the load on the same package.

  
  The scheduler heuristics for multi core system
  /sys/devices/system/cpu/sched_mc_power_savings should ideally extend
  the cpu tickless idle time atleast on few CPU in an SMP machine.
 
 Not really. Ideally sched_mc_power_savings re-distributes the load(in
 the scenarios mentioned above) to different logical cpus in the system, so
 as to minimize the busy packages in the system.

But the problem is we will not be able get longer idle time and goto
lower sleep states on other packages if we are not able to move the
short bursts of daemon jobs to busy packages.

We are looking at various means to increase uninterrupted CPU idle
times on atleast some of the CPUs in an under-utilised SMP machine.
 
  Experiment 1:
  -
 ...
  Observations with sched_mc_power_savings=1:
  
  * No major impact of sched_mc_power_savings on CPU0 and CPU1
  * Significant idle time improvement on CPU2
  * However, significant idle time reduction on CPU3
 
 In your setup, CPU0 and 3 are the core siblings? If so, this is probably
 expected. Previously there was some load distribution happening between
 packages. With sched_mc_power_savings, load is now getting distributed
 between cores in the same package. But on my systems, typically CPU0 and 2
 are core siblings. I will let you confirm your system topology, before we
 come to conclusions.

No, in contrary to your observation, on my machine, CPU0 and CPU1 are
siblings on the first package while CPU2 and CPU3 are in the second
package.

  Experiment 2:
  -
 ...
  
  Observations with sched_mc_power_savings=1:
  
  * No major impact of sched_mc_power_savings on CPU0 and CPU1
  * Good idle time improvement on CPU2 and CPU3
 
 I would have expected similar results like in experiment-1. CPU3 data
 seems almost same with and without sched_mc_power_savings. Atleast the
 data is not significantly different as in other cases like CPU2 for ex.

There is a general improvement in idle time in this experiment as
compared to the first experiment.  So if sched_mc will not impact
short running tasks (daemons and other event managers in idle system),
then the observation on experiment-1 may be just a random
redistribution of tasks over time.

  Please review the experiment and comment on how the effectiveness of
  sched_mc_power_savings can be analysed.  
 
 While we should see a no change or a simple redistribution(to minimize 
 busy packages) in your experiments, for evaluating sched_mc_power_savings,
 we can also use some lightly loaded system (like kernel-compilation or
 specjbb with 2 threads on  a DP with dual-core, for example and see how the
 load is distributed with and without MC power savings.)

I have tried running two CPU intensive threads.  With sched_mc turned
ON, they were scheduled on CPU0-1 or CPU2-3.  With sched_mc turned
OFF, I can see that they were scheduled arbitrarily.  

Later I added random usleep() to make them consume less CPU, in that
case the scheduling was very random.  The two threads with random
sleeps were scheduled on any of the 4 CPUs and the sched_mc parameter
had no effect is consolidating the threads to one package.  Addition
of sleeps makes the threads very short run and not CPU intensive.

This completely validates your explanation of consolidating only 'long
running' jobs.

How do we take this technique to the next step where we can
consolidate short running jobs as well?  Did you face any difficulty
biasing the CPU for short running jobs?

 Similarly it will be interesting to see how this data varies with and without
 tickless.

I am sure tickless has very significant impact on the idle time.
I will get you more data on this comparison.
 
 For some loads which can't be caught by periodic load balancer, we may
 not see any difference. But atleast we should not see any scheduling
 anomalies.

True, we did not see any scheduling anomalies yet.  

Thanks,
Vaidy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Analysis of sched_mc_power_savings

2008-01-09 Thread Vaidyanathan Srinivasan

* Ingo Molnar [EMAIL PROTECTED] [2008-01-09 12:35:07]:

 
 * Vaidyanathan Srinivasan [EMAIL PROTECTED] wrote:
 
  How do we take this technique to the next step where we can 
  consolidate short running jobs as well?  Did you face any difficulty 
  biasing the CPU for short running jobs?
 
 are you sure your measurement tasks do not impact the measurement 
 workload? If you use something like 'top' then try running it reniced to 
 +19. (or perhaps even bound to a particular CPU, say #3, to make its 
 impact isolated)

Hi Ingo,

I will watch this during the experiments.  I have been using klog
application to dump relayfs data.  I did run powertop and top as well,
I will bind them to certain CPUs and isolate their impact.

I believe the margin of error would be less since all the measurement
tasks sleep for long duration.

Thanks,
Vaidy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Analysis of sched_mc_power_savings

2008-01-08 Thread Vaidyanathan Srinivasan

Hi,

The following experiments were conducted on a two socket dual core
intel processor based machine in order to understand the impact of
sched_mc_power_savings scheduler heuristics.

Kernel linux-2.6.24-rc6:

CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y

tick-sched.c has been instrumented to collect idle entry and exit time
stamps.

Instrumentation patch:

Instrument tick-sched nohz code and generate time stamp trace data.

Signed-off-by: Vaidyanathan Srinivasan <[EMAIL PROTECTED]>
---
 kernel/time/tick-sched.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

--- linux-2.6.24-rc6.orig/kernel/time/tick-sched.c
+++ linux-2.6.24-rc6/kernel/time/tick-sched.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -200,7 +201,10 @@ void tick_nohz_stop_sched_tick(void)
if (ts->tick_stopped) {
delta = ktime_sub(now, ts->idle_entrytime);
ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
-   }
+   ktrace_log2(KT_FUNC_tick_nohz_stop_sched_tick, KT_EVENT_INFO1,
+   ktime_to_ns(now), ts->idle_calls);
>>>>>>>>>>>>>>>>Tracepoint A

+   } else
+   ktrace_log2(KT_FUNC_tick_nohz_stop_sched_tick,
KT_EVENT_FUNC_ENTER, ktime_to_ns(now), 0);
 
>>>>>>>>>>>>>>>>Tracepoint B

ts->idle_entrytime = now;
ts->idle_calls++;
@@ -391,6 +395,8 @@ void tick_nohz_restart_sched_tick(void)
tick_do_update_jiffies64(now);
now = ktime_get();
}
+   ktrace_log2(KT_FUNC_tick_nohz_restart_sched_tick, KT_EVENT_FUNC_EXIT,
+   ktime_to_ns(now), ts->idle_calls);
>>>>>>>>>>>>>>>>Tracepoint C
local_irq_enable();
 }

The idle time collected are time stamp at (C) minus (B). This is the
time interval between stopping ticks and restarting ticks in an idle
system.

Complete patch series:
http://svaidy.googlepages.com/1-klog.patch
http://svaidy.googlepages.com/1-trace-sched.patch

Userspace program to extract trace data:
http://svaidy.googlepages.com/1-klog.c

Python script to post process binary trace data:
http://svaidy.googlepages.com/1-sched-stats.py

Gnuplot scripts that was used to generate the graphs:
http://svaidy.googlepages.com/1-multiplot.gp

The scheduler heuristics for multi core system
/sys/devices/system/cpu/sched_mc_power_savings should ideally extend
the cpu tickless idle time atleast on few CPU in an SMP machine.

However in the experiment it was found that turning on
sched_mc_power_savings marginally increased the idle time in only some
of the CPUs.

Experiment 1:
-

Setup:

* yum-updated and irqbalance daemon was stopped to reduce the idle
  wakeup rate
* All irqs manually routed to CPU0 only (hoping this will keep other
  CPUs idle) (http://svaidy.googlepages.com/1-irq-config.txt)
* Powertop shows around 35 wakeups per second during idle
  (http://svaidy.googlepages.com/1-powertop-screen.txt)
* The trace of idle time stamps was collected for 120 seconds with the
  system in idle state

Results:

There are 4 png files that plots the idle time for each CPU in the
system.  

Please get the graphs from the following URLs

http://svaidy.googlepages.com/1-idle-cpu0.png
http://svaidy.googlepages.com/1-idle-cpu1.png
http://svaidy.googlepages.com/1-idle-cpu2.png
http://svaidy.googlepages.com/1-idle-cpu3.png

Each png file has 4 graphs plotted that is relevant to one CPU

* Right-top plot is the idle time sample obtained during the
  experiment
* Left-top graph is histogram of right top plot
* The bottom graphs corresponding to idle times when
  sched_mc_power_savings=1

Observations with sched_mc_power_savings=1:

* No major impact of sched_mc_power_savings on CPU0 and CPU1
* Significant idle time improvement on CPU2
* However, significant idle time reduction on CPU3

Experiment 2:
-

Setup:

* USB stopped
* Most daemons like yum-updatesd, hal, autofs, syslog, crond, irqbalance,
  sendmail, pcscd were stopped  
* Interrupt routing left to default but irqbalance daemon stopped  
* Powertop shows around 4 wakeups per second during idle
  (http://svaidy.googlepages.com/2-powertop-screen.txt) 
* The trace of idle time stamps was collected for 120 seconds with the
  system in idle state

Results:

There are 4 png files that plots the idle time for each CPU in the
system.  

http://svaidy.googlepages.com/2-idle-cpu0.png
http://svaidy.googlepages.com/2-idle-cpu1.png
http://svaidy.googlepages.com/2-idle-cpu2.png
http://svaidy.googlepages.com/2-idle-cpu3.png

The details of the plot are same as the previous experiment.

Observations with sched_mc

Analysis of sched_mc_power_savings

2008-01-08 Thread Vaidyanathan Srinivasan

Hi,

The following experiments were conducted on a two socket dual core
intel processor based machine in order to understand the impact of
sched_mc_power_savings scheduler heuristics.

Kernel linux-2.6.24-rc6:

CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y

tick-sched.c has been instrumented to collect idle entry and exit time
stamps.

Instrumentation patch:

Instrument tick-sched nohz code and generate time stamp trace data.

Signed-off-by: Vaidyanathan Srinivasan [EMAIL PROTECTED]
---
 kernel/time/tick-sched.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

--- linux-2.6.24-rc6.orig/kernel/time/tick-sched.c
+++ linux-2.6.24-rc6/kernel/time/tick-sched.c
@@ -20,6 +20,7 @@
 #include linux/profile.h
 #include linux/sched.h
 #include linux/tick.h
+#include linux/ktrace.h
 
 #include asm/irq_regs.h
 
@@ -200,7 +201,10 @@ void tick_nohz_stop_sched_tick(void)
if (ts-tick_stopped) {
delta = ktime_sub(now, ts-idle_entrytime);
ts-idle_sleeptime = ktime_add(ts-idle_sleeptime, delta);
-   }
+   ktrace_log2(KT_FUNC_tick_nohz_stop_sched_tick, KT_EVENT_INFO1,
+   ktime_to_ns(now), ts-idle_calls);
Tracepoint A

+   } else
+   ktrace_log2(KT_FUNC_tick_nohz_stop_sched_tick,
KT_EVENT_FUNC_ENTER, ktime_to_ns(now), 0);
 
Tracepoint B

ts-idle_entrytime = now;
ts-idle_calls++;
@@ -391,6 +395,8 @@ void tick_nohz_restart_sched_tick(void)
tick_do_update_jiffies64(now);
now = ktime_get();
}
+   ktrace_log2(KT_FUNC_tick_nohz_restart_sched_tick, KT_EVENT_FUNC_EXIT,
+   ktime_to_ns(now), ts-idle_calls);
Tracepoint C
local_irq_enable();
 }

The idle time collected are time stamp at (C) minus (B). This is the
time interval between stopping ticks and restarting ticks in an idle
system.

Complete patch series:
http://svaidy.googlepages.com/1-klog.patch
http://svaidy.googlepages.com/1-trace-sched.patch

Userspace program to extract trace data:
http://svaidy.googlepages.com/1-klog.c

Python script to post process binary trace data:
http://svaidy.googlepages.com/1-sched-stats.py

Gnuplot scripts that was used to generate the graphs:
http://svaidy.googlepages.com/1-multiplot.gp

The scheduler heuristics for multi core system
/sys/devices/system/cpu/sched_mc_power_savings should ideally extend
the cpu tickless idle time atleast on few CPU in an SMP machine.

However in the experiment it was found that turning on
sched_mc_power_savings marginally increased the idle time in only some
of the CPUs.

Experiment 1:
-

Setup:

* yum-updated and irqbalance daemon was stopped to reduce the idle
  wakeup rate
* All irqs manually routed to CPU0 only (hoping this will keep other
  CPUs idle) (http://svaidy.googlepages.com/1-irq-config.txt)
* Powertop shows around 35 wakeups per second during idle
  (http://svaidy.googlepages.com/1-powertop-screen.txt)
* The trace of idle time stamps was collected for 120 seconds with the
  system in idle state

Results:

There are 4 png files that plots the idle time for each CPU in the
system.  

Please get the graphs from the following URLs

http://svaidy.googlepages.com/1-idle-cpu0.png
http://svaidy.googlepages.com/1-idle-cpu1.png
http://svaidy.googlepages.com/1-idle-cpu2.png
http://svaidy.googlepages.com/1-idle-cpu3.png

Each png file has 4 graphs plotted that is relevant to one CPU

* Right-top plot is the idle time sample obtained during the
  experiment
* Left-top graph is histogram of right top plot
* The bottom graphs corresponding to idle times when
  sched_mc_power_savings=1

Observations with sched_mc_power_savings=1:

* No major impact of sched_mc_power_savings on CPU0 and CPU1
* Significant idle time improvement on CPU2
* However, significant idle time reduction on CPU3

Experiment 2:
-

Setup:

* USB stopped
* Most daemons like yum-updatesd, hal, autofs, syslog, crond, irqbalance,
  sendmail, pcscd were stopped  
* Interrupt routing left to default but irqbalance daemon stopped  
* Powertop shows around 4 wakeups per second during idle
  (http://svaidy.googlepages.com/2-powertop-screen.txt) 
* The trace of idle time stamps was collected for 120 seconds with the
  system in idle state

Results:

There are 4 png files that plots the idle time for each CPU in the
system.  

http://svaidy.googlepages.com/2-idle-cpu0.png
http://svaidy.googlepages.com/2-idle-cpu1.png
http://svaidy.googlepages.com/2-idle-cpu2.png
http://svaidy.googlepages.com/2-idle-cpu3.png

The details of the plot are same as the previous experiment.

Observations with sched_mc_power_savings=1:

* No major impact of sched_mc_power_savings on CPU0 and CPU1
* Good idle time improvement on CPU2 and CPU3

Please review the experiment

Re: 1.0.0.0 DNS replies for many domain names (network)

2007-12-16 Thread Vaidyanathan Srinivasan

* Amogh Hooshdar <[EMAIL PROTECTED]> [2007-12-14 17:20:17]:

> I am having a strange problem with Debian Etch 4.0 (both 64-bit and
> 32-bit) using 2.6.18 kernel. Most websites do not open with browser,
> Pidgin and most other GUI applicatoins. but I am able to ping them
> fine. I am also able to do nslookup properly. When I tried to
> investigate it with Wireshark net sniffer, I observed the following.
> 
> PROBLEM WITH 2.6.18
> Say, I try to open www.google.com, browser sends DNS query for
> www.google.com to my DNS server which is correctly configured in
> resolv.conf. It replies with the correct IP address. www.google.com
> redirects the browser to www.google.co.in. browser sends a DNS query
> again for www.google.co.in and the DNS server replies with 1.0.0.0
> which obviously is the wrong address.

I had this problem on Debian 4.0 and it was due to bug in the DSL
router.  I had DNS server set to 192.168.1.1 that is my DSL router
that holds the real DNS IP and forwards the DNS lookup request.

Once in a while the DNS proxy server will give out 1.0.0.0.  The
solution I used was to find the real DNS server and fill it in
resolv.conf  This avoids the DNS proxy on the router and then problem
went away.

https://bugs.launchpad.net/ubuntu/+bug/81057

--Vaidy
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 1.0.0.0 DNS replies for many domain names (network)

2007-12-16 Thread Vaidyanathan Srinivasan

* Amogh Hooshdar [EMAIL PROTECTED] [2007-12-14 17:20:17]:

 I am having a strange problem with Debian Etch 4.0 (both 64-bit and
 32-bit) using 2.6.18 kernel. Most websites do not open with browser,
 Pidgin and most other GUI applicatoins. but I am able to ping them
 fine. I am also able to do nslookup properly. When I tried to
 investigate it with Wireshark net sniffer, I observed the following.
 
 PROBLEM WITH 2.6.18
 Say, I try to open www.google.com, browser sends DNS query for
 www.google.com to my DNS server which is correctly configured in
 resolv.conf. It replies with the correct IP address. www.google.com
 redirects the browser to www.google.co.in. browser sends a DNS query
 again for www.google.co.in and the DNS server replies with 1.0.0.0
 which obviously is the wrong address.

I had this problem on Debian 4.0 and it was due to bug in the DSL
router.  I had DNS server set to 192.168.1.1 that is my DSL router
that holds the real DNS IP and forwards the DNS lookup request.

Once in a while the DNS proxy server will give out 1.0.0.0.  The
solution I used was to find the real DNS server and fill it in
resolv.conf  This avoids the DNS proxy on the router and then problem
went away.

https://bugs.launchpad.net/ubuntu/+bug/81057

--Vaidy
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc1 on PPC64: machine check exception

2007-11-15 Thread Vaidyanathan Srinivasan

> 
>> http://patchwork.ozlabs.org/linuxppc/patch?id=14475
> 
> Thanks for pointing me to this patch.  I will try out and let you know if
> this fixed the problem.

Hi Anton,

This patch fixed the problem.  I am able to run and profile ebizzy on 128-way
PPC64.  However this fix is not included in 2.6.24-rc2 as well.
I will watch for inclusion of this fix in 2.6.24.

--Vaidy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc1 on PPC64: machine check exception

2007-11-15 Thread Vaidyanathan Srinivasan

 
 http://patchwork.ozlabs.org/linuxppc/patch?id=14475
 
 Thanks for pointing me to this patch.  I will try out and let you know if
 this fixed the problem.

Hi Anton,

This patch fixed the problem.  I am able to run and profile ebizzy on 128-way
PPC64.  However this fix is not included in 2.6.24-rc2 as well.
I will watch for inclusion of this fix in 2.6.24.

--Vaidy
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc1 on PPC64: machine check exception

2007-11-05 Thread Vaidyanathan Srinivasan

Anton Blanchard wrote:
> Hi
> 
>> I got the following exception on a 128-way PPC64 machine while running
>> ebizzy-0.2 benchmark.
> 
> Is it a 64bit application? Im guessing its fixed by this:

Hi Anton,

The ebizzy application I tried is 64-bit.

> http://patchwork.ozlabs.org/linuxppc/patch?id=14475

Thanks for pointing me to this patch.  I will try out and let you know if
this fixed the problem.

--Vaidy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.6.24-rc1 on PPC64: machine check exception

2007-11-05 Thread Vaidyanathan Srinivasan

Hi,

I got the following exception on a 128-way PPC64 machine while running
ebizzy-0.2 benchmark.

cpu 0x48: Vector: 200 (Machine Check) at [c06df1bb37c0]
pc: c007a26c: .exit_robust_list+0x78/0x228
lr: c007a240: .exit_robust_list+0x4c/0x228
sp: c06df1bb3a40
   msr: 80109032
  current = 0xc06df1725120
  paca= 0xc06f4000
pid   = 5950, comm = ebizzy
cpu 0x78: Vector: 200 (Machine Check) at [c00e21bd37c0]
pc: c007a26c: .exit_robust_list+0x78/0x228
lr: c007a240: .exit_robust_list+0x4c/0x228
sp: c00e21bd3a40
   msr: 80109032
  current = 0xc00e pa  paca= 0xc06fb800
cidpid   = 5941, comm = ebizzy
   = 5941,[c06df1bb3b10] c00578a8 .do_exit+0x260/0x7fc
[c06df1bb3bc0] c0057ef8 .sys_exit_group+0x0/0x8
[c06df1bb3c50] c00636e0 .get_signal_to_deliver+0x45c/0x4ac
[c06df1bb3d00] c0011d44 .do_signal+0x68/0x2e4
[c06df1bb3e30] c0008ae8 do_work+0x28/0x2c
--- Exception: 500 (Hardware Interrupt) at 10007698
SP (40016601610) is in userspace

Config file attached. I did not see any similar failure reported with
2.6.24-rc1.

--Vaidy
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.24-rc1
# Mon Nov  5 02:25:46 2007
#
CONFIG_PPC64=y

#
# Processor support
#
# CONFIG_POWER4_ONLY is not set
CONFIG_POWER3=y
CONFIG_POWER4=y
# CONFIG_TUNE_CELL is not set
CONFIG_PPC_FPU=y
CONFIG_ALTIVEC=y
CONFIG_PPC_STD_MMU=y
CONFIG_PPC_MM_SLICES=y
CONFIG_VIRT_CPU_ACCOUNTING=y
CONFIG_SMP=y
CONFIG_NR_CPUS=128
CONFIG_64BIT=y
CONFIG_WORD_SIZE=64
CONFIG_PPC_MERGE=y
CONFIG_MMU=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_IRQ_PER_CPU=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_ARCH_HAS_ILOG2_U32=y
CONFIG_ARCH_HAS_ILOG2_U64=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_ARCH_NO_VIRT_TO_BUS=y
CONFIG_PPC=y
CONFIG_EARLY_PRINTK=y
CONFIG_COMPAT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_PPC_OF=y
CONFIG_OF=y
CONFIG_PPC_UDBG_16550=y
# CONFIG_GENERIC_TBSYNC is not set
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
# CONFIG_DEFAULT_UIMAGE is not set
# CONFIG_PPC_DCR_NATIVE is not set
# CONFIG_PPC_DCR_MMIO is not set
# CONFIG_PPC_OF_PLATFORM_PCI is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION="-sv"
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_USER_NS is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_TREE=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=17
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
# CONFIG_CGROUP_NS is not set
# CONFIG_CGROUP_CPUACCT is not set
CONFIG_CPUSETS=y
# CONFIG_FAIR_GROUP_SCHED is not set
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_PROC_PID_CPUSET=y
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLAB=y
# CONFIG_SLUB is not set
# CONFIG_SLOB is not set
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
# CONFIG_BLK_DEV_IO_TRACE is not set
# CONFIG_BLK_DEV_BSG is not set
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_AS=y
# CONFIG_DEFAULT_DEADLINE is not set
# CONFIG_DEFAULT_CFQ is not set
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="anticipatory"

#
# Platform support
#
CONFIG_PPC_MULTIPLATFORM=y
# CONFIG_PPC_82xx is not set
# CONFIG_PPC_83xx is not set
# CONFIG_PPC_86xx is not set
CONFIG_PPC_PSERIES=y
CONFIG_PPC_SPLPAR=y
CONFIG_EEH=y
CONFIG_SCANLOG=m
CONFIG_LPARCFG=y
# CONFIG_PPC_ISERIES is not set
# CONFIG_PPC_MPC52xx is not set
# CONFIG_PPC_MPC5200 is not set
# CONFIG_PPC_PMAC is not set
# CONFIG_PPC_MAPLE is not set
# CONFIG_PPC_PASEMI is not set
# CONFIG_PPC_CELLEB is not set
# CONFIG_PPC_PS3 is not set
# CONFIG_PPC_CELL is not set
# CONFIG_PPC_CELL_NATIVE is not set
# CONFIG_PPC_IBM_CELL_BLADE is not set
# CONFIG_PQ2ADS is not set
CONFIG_PPC_NATIVE=y
# CONFIG_UDBG_RTAS_CONSOLE is not set
CONFIG_XICS=y

2.6.24-rc1 on PPC64: machine check exception

2007-11-05 Thread Vaidyanathan Srinivasan

Hi,

I got the following exception on a 128-way PPC64 machine while running
ebizzy-0.2 benchmark.

cpu 0x48: Vector: 200 (Machine Check) at [c06df1bb37c0]
pc: c007a26c: .exit_robust_list+0x78/0x228
lr: c007a240: .exit_robust_list+0x4c/0x228
sp: c06df1bb3a40
   msr: 80109032
  current = 0xc06df1725120
  paca= 0xc06f4000
pid   = 5950, comm = ebizzy
cpu 0x78: Vector: 200 (Machine Check) at [c00e21bd37c0]
pc: c007a26c: .exit_robust_list+0x78/0x228
lr: c007a240: .exit_robust_list+0x4c/0x228
sp: c00e21bd3a40
   msr: 80109032
  current = 0xc00e pa  paca= 0xc06fb800
cidpid   = 5941, comm = ebizzy
   = 5941,[c06df1bb3b10] c00578a8 .do_exit+0x260/0x7fc
[c06df1bb3bc0] c0057ef8 .sys_exit_group+0x0/0x8
[c06df1bb3c50] c00636e0 .get_signal_to_deliver+0x45c/0x4ac
[c06df1bb3d00] c0011d44 .do_signal+0x68/0x2e4
[c06df1bb3e30] c0008ae8 do_work+0x28/0x2c
--- Exception: 500 (Hardware Interrupt) at 10007698
SP (40016601610) is in userspace

Config file attached. I did not see any similar failure reported with
2.6.24-rc1.

--Vaidy
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.24-rc1
# Mon Nov  5 02:25:46 2007
#
CONFIG_PPC64=y

#
# Processor support
#
# CONFIG_POWER4_ONLY is not set
CONFIG_POWER3=y
CONFIG_POWER4=y
# CONFIG_TUNE_CELL is not set
CONFIG_PPC_FPU=y
CONFIG_ALTIVEC=y
CONFIG_PPC_STD_MMU=y
CONFIG_PPC_MM_SLICES=y
CONFIG_VIRT_CPU_ACCOUNTING=y
CONFIG_SMP=y
CONFIG_NR_CPUS=128
CONFIG_64BIT=y
CONFIG_WORD_SIZE=64
CONFIG_PPC_MERGE=y
CONFIG_MMU=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_IRQ_PER_CPU=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_ARCH_HAS_ILOG2_U32=y
CONFIG_ARCH_HAS_ILOG2_U64=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_ARCH_NO_VIRT_TO_BUS=y
CONFIG_PPC=y
CONFIG_EARLY_PRINTK=y
CONFIG_COMPAT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_PPC_OF=y
CONFIG_OF=y
CONFIG_PPC_UDBG_16550=y
# CONFIG_GENERIC_TBSYNC is not set
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
# CONFIG_DEFAULT_UIMAGE is not set
# CONFIG_PPC_DCR_NATIVE is not set
# CONFIG_PPC_DCR_MMIO is not set
# CONFIG_PPC_OF_PLATFORM_PCI is not set
CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=-sv
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_USER_NS is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_TREE=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=17
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
# CONFIG_CGROUP_NS is not set
# CONFIG_CGROUP_CPUACCT is not set
CONFIG_CPUSETS=y
# CONFIG_FAIR_GROUP_SCHED is not set
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_PROC_PID_CPUSET=y
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLAB=y
# CONFIG_SLUB is not set
# CONFIG_SLOB is not set
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
# CONFIG_BLK_DEV_IO_TRACE is not set
# CONFIG_BLK_DEV_BSG is not set
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_AS=y
# CONFIG_DEFAULT_DEADLINE is not set
# CONFIG_DEFAULT_CFQ is not set
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED=anticipatory

#
# Platform support
#
CONFIG_PPC_MULTIPLATFORM=y
# CONFIG_PPC_82xx is not set
# CONFIG_PPC_83xx is not set
# CONFIG_PPC_86xx is not set
CONFIG_PPC_PSERIES=y
CONFIG_PPC_SPLPAR=y
CONFIG_EEH=y
CONFIG_SCANLOG=m
CONFIG_LPARCFG=y
# CONFIG_PPC_ISERIES is not set
# CONFIG_PPC_MPC52xx is not set
# CONFIG_PPC_MPC5200 is not set
# CONFIG_PPC_PMAC is not set
# CONFIG_PPC_MAPLE is not set
# CONFIG_PPC_PASEMI is not set
# CONFIG_PPC_CELLEB is not set
# CONFIG_PPC_PS3 is not set
# CONFIG_PPC_CELL is not set
# CONFIG_PPC_CELL_NATIVE is not set
# CONFIG_PPC_IBM_CELL_BLADE is not set
# CONFIG_PQ2ADS is not set
CONFIG_PPC_NATIVE=y
# CONFIG_UDBG_RTAS_CONSOLE is not set
CONFIG_XICS=y
CONFIG_MPIC=y

Re: 2.6.24-rc1 on PPC64: machine check exception

2007-11-05 Thread Vaidyanathan Srinivasan

Anton Blanchard wrote:
 Hi
 
 I got the following exception on a 128-way PPC64 machine while running
 ebizzy-0.2 benchmark.
 
 Is it a 64bit application? Im guessing its fixed by this:

Hi Anton,

The ebizzy application I tried is 64-bit.

 http://patchwork.ozlabs.org/linuxppc/patch?id=14475

Thanks for pointing me to this patch.  I will try out and let you know if
this fixed the problem.

--Vaidy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Question about free/used memory on Linux

2007-10-21 Thread Vaidyanathan Srinivasan

Ravinandan Arakali (rarakali) wrote:
> Hi Vaidy,
> Thanks for clarifying several of my doubts.
> 
> To answer your question about my intention, we currently have a 
> system with 2 GB RAM and I need to find out the actual used and 
> free memory so that we can decide if the same setup(applications,
> tmpfs etc.) can run on another system with lesser memory.
> 
> Is it correct to say that the "used" field "free -m" excluding
> buffers/caches would give the correct idea of used memory
> (I mean does it take care of shared memory, shared copies of 
> libraries etc.) ? I assume it does not include /dev/shm usage
> since that's also a tmpfs partition ?

Thats correct. The used excluding the buffer caches gives most of the
memory used by the system.  You have excludes _all_ file backed memory
including shm.

> 
> If so, then I can add the memory used by tmpfs partitions to
> the above and get the total memory used ?
> 
> For eg. if my "free -m" appears as below:
> Linux(debug)# free -m
>  total   used   free sharedbuffers
> cached
> Mem:  2014984   1030  0 80
> 594
> -/+ buffers/cache:309   1705
> 
> Can I say that 309MB + 350 MB(size of tmpfs partitions including
> /dev/shm)
> is the used memory on my system ?

Two problems with this logic:

1. all of tmpfs may not be really used.  You are over committing.
2. You still miss the pages needed to map the program code.  They are file
backed too. Though this will be very less amount of memory compared to data
 and shared memory.

Let me suggest a metric:

1. Take the used part excluding the buffers (309MB) in your case and add
'Mapped' and 'Dirty' from /proc/meminfo

This may be better than adding tmpfs/shmdev size.

2. Once your system is running will all applications loaded, cleanup the
pagecache (file data cached in memory)

sync
echo 1 > /proc/sys/vm/drop_caches

The first sync will bring down 'Dirty' count and drop_caches will reclaim
all 'not needed' file cache memory.

Now if you use 'free' and take the used count _with_ the buffers and file
cache, this will provide a realistic value. (Actually Free in /proc/meminfo)

Do not exclude buffers they are _needed_ for optimum system operation.
With the above figure you can probably add 10% or more memory as extra
memory for file cache when the system is operating with full load.

If you want to be sure of these experiments boot you system with less
memory using mem=xxx kernel parameter and run some performance tests to
ensure the degradation is under acceptable limits.

--Vaidy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Question about free/used memory on Linux

2007-10-21 Thread Vaidyanathan Srinivasan

Ravinandan Arakali (rarakali) wrote:
> Hi kernel gurus,
> I am trying to find out the memory that's used on my linux box.
> I find that there are quite a few confusing metrics. How do
> I find out the "true" used memory ?

pagecache pages may be the reason for the confusion.  Most free memory can
be consumed under 'Cached' in /proc/meminfo

Most of this memory is easily reclaimable and can be considered 'free'.
However the cost of reclaim increases if the pages are memory mapped and/or
dirty.

Cached-Mapped-Dirty in /proc/meminfo can be seen as a easily freeable
memory and can be added to the 'Free' in /proc/meminfo count

What is your intention of finding free memory in the system?  Linux kernel
takes the best decision of using available RAM for file cache or
applications code/data as and when needed.  Ideally the 'Free' count will
be  a very low value on a heavily used system.  If application needs more
memory, then the kernel will shrink the caches and give the reclaimed
memory to the application.

> 
> 1. For eg. "free -m" shows free memory (excluding buffers/caches) 
> as 308 MB while I can see(from "df" output) that the the tmpfs 
> partitions take up about 400 MB. So, does "free -m" not consider 
> the tmpfs partitions ?

Pages used from tmpfs should come under Mapped or Cached.  They are not
counted as free.

> 2. I try to add up RSS field of all processes reported by
> "ps aux" command. But is it true that this would be misleading
> in that, shared memory used by, say 2 processes would show
> up twice here although there's only one copy in memory. Also
> does this consider the fact that there's only one copy
> of shared libraries ?
>

RSS is from each process point of view.  If the page is present in RAM, it
is counted.  If the pages is shared, then it is counted in both process
address space.

> 3. I guess "free -m" and "top" commands use /proc/meminfo 
> and hence all these outputs are same ?

Yes, all of them parse /proc/meminfo

--Vaidy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Question about free/used memory on Linux

2007-10-21 Thread Vaidyanathan Srinivasan

Ravinandan Arakali (rarakali) wrote:
 Hi kernel gurus,
 I am trying to find out the memory that's used on my linux box.
 I find that there are quite a few confusing metrics. How do
 I find out the true used memory ?

pagecache pages may be the reason for the confusion.  Most free memory can
be consumed under 'Cached' in /proc/meminfo

Most of this memory is easily reclaimable and can be considered 'free'.
However the cost of reclaim increases if the pages are memory mapped and/or
dirty.

Cached-Mapped-Dirty in /proc/meminfo can be seen as a easily freeable
memory and can be added to the 'Free' in /proc/meminfo count

What is your intention of finding free memory in the system?  Linux kernel
takes the best decision of using available RAM for file cache or
applications code/data as and when needed.  Ideally the 'Free' count will
be  a very low value on a heavily used system.  If application needs more
memory, then the kernel will shrink the caches and give the reclaimed
memory to the application.

 
 1. For eg. free -m shows free memory (excluding buffers/caches) 
 as 308 MB while I can see(from df output) that the the tmpfs 
 partitions take up about 400 MB. So, does free -m not consider 
 the tmpfs partitions ?

Pages used from tmpfs should come under Mapped or Cached.  They are not
counted as free.

 2. I try to add up RSS field of all processes reported by
 ps aux command. But is it true that this would be misleading
 in that, shared memory used by, say 2 processes would show
 up twice here although there's only one copy in memory. Also
 does this consider the fact that there's only one copy
 of shared libraries ?


RSS is from each process point of view.  If the page is present in RAM, it
is counted.  If the pages is shared, then it is counted in both process
address space.

 3. I guess free -m and top commands use /proc/meminfo 
 and hence all these outputs are same ?

Yes, all of them parse /proc/meminfo

--Vaidy
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Question about free/used memory on Linux

2007-10-21 Thread Vaidyanathan Srinivasan


Ravinandan Arakali (rarakali) wrote:
 Hi Vaidy,
 Thanks for clarifying several of my doubts.
 
 To answer your question about my intention, we currently have a 
 system with 2 GB RAM and I need to find out the actual used and 
 free memory so that we can decide if the same setup(applications,
 tmpfs etc.) can run on another system with lesser memory.
 
 Is it correct to say that the used field free -m excluding
 buffers/caches would give the correct idea of used memory
 (I mean does it take care of shared memory, shared copies of 
 libraries etc.) ? I assume it does not include /dev/shm usage
 since that's also a tmpfs partition ?

Thats correct. The used excluding the buffer caches gives most of the
memory used by the system.  You have excludes _all_ file backed memory
including shm.

 
 If so, then I can add the memory used by tmpfs partitions to
 the above and get the total memory used ?
 
 For eg. if my free -m appears as below:
 Linux(debug)# free -m
  total   used   free sharedbuffers
 cached
 Mem:  2014984   1030  0 80
 594
 -/+ buffers/cache:309   1705
 
 Can I say that 309MB + 350 MB(size of tmpfs partitions including
 /dev/shm)
 is the used memory on my system ?

Two problems with this logic:

1. all of tmpfs may not be really used.  You are over committing.
2. You still miss the pages needed to map the program code.  They are file
backed too. Though this will be very less amount of memory compared to data
 and shared memory.

Let me suggest a metric:

1. Take the used part excluding the buffers (309MB) in your case and add
'Mapped' and 'Dirty' from /proc/meminfo

This may be better than adding tmpfs/shmdev size.

2. Once your system is running will all applications loaded, cleanup the
pagecache (file data cached in memory)

sync
echo 1  /proc/sys/vm/drop_caches

The first sync will bring down 'Dirty' count and drop_caches will reclaim
all 'not needed' file cache memory.

Now if you use 'free' and take the used count _with_ the buffers and file
cache, this will provide a realistic value. (Actually Free in /proc/meminfo)

Do not exclude buffers they are _needed_ for optimum system operation.
With the above figure you can probably add 10% or more memory as extra
memory for file cache when the system is operating with full load.

If you want to be sure of these experiments boot you system with less
memory using mem=xxx kernel parameter and run some performance tests to
ensure the degradation is under acceptable limits.

--Vaidy


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [-mm PATCH 0/9] Memory controller introduction (v4)

2007-08-09 Thread Vaidyanathan Srinivasan

KAMEZAWA Hiroyuki wrote:
> On Wed, 8 Aug 2007 12:51:39 +0900
> KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> wrote:
> 
>> On Sat, 28 Jul 2007 01:39:37 +0530
>> Balbir Singh <[EMAIL PROTECTED]> wrote:
>>> At OLS, the resource management BOF, it was discussed that we need to manage
>>> RSS and unmapped page cache together. This patchset is a step towards that
>>>
>> Can I make a question ? Why limiting RSS instead of # of used pages per
>> container ? Maybe bacause of shared pages between container 
> SorryIgnore above question.
> I didn't understand what mem_container_charge() accounts and limits.
> It controls # of meta_pages.

Hi Kame,

Actually the number of pages resident in memory brought in by a
container is charged.  However each such page will have a meta_page
allocated to keep the extra data.

Yes, the accounting counts the number of meta_page which is same as
the number of mapped and unmapped (pagecache) pages brought into the
system memory by this container.  Whether pagecache pages should be
included or not is configurable per container through the 'type' file
in containerfs.

--Vaidy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [-mm PATCH 0/9] Memory controller introduction (v4)

2007-08-09 Thread Vaidyanathan Srinivasan


KAMEZAWA Hiroyuki wrote:
 On Wed, 8 Aug 2007 12:51:39 +0900
 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 
 On Sat, 28 Jul 2007 01:39:37 +0530
 Balbir Singh [EMAIL PROTECTED] wrote:
 At OLS, the resource management BOF, it was discussed that we need to manage
 RSS and unmapped page cache together. This patchset is a step towards that

 Can I make a question ? Why limiting RSS instead of # of used pages per
 container ? Maybe bacause of shared pages between container 
 SorryIgnore above question.
 I didn't understand what mem_container_charge() accounts and limits.
 It controls # of meta_pages.

Hi Kame,

Actually the number of pages resident in memory brought in by a
container is charged.  However each such page will have a meta_page
allocated to keep the extra data.

Yes, the accounting counts the number of meta_page which is same as
the number of mapped and unmapped (pagecache) pages brought into the
system memory by this container.  Whether pagecache pages should be
included or not is configurable per container through the 'type' file
in containerfs.

--Vaidy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [-mm PATCH 6/9] Memory controller add per container LRU and reclaim (v4)

2007-08-07 Thread Vaidyanathan Srinivasan



Vaidyanathan Srinivasan wrote:
> 
> YAMAMOTO Takashi wrote:
>>> +unsigned long mem_container_isolate_pages(unsigned long nr_to_scan,
>>> +   struct list_head *dst,
>>> +   unsigned long *scanned, int order,
>>> +   int mode, struct zone *z,
>>> +   struct mem_container *mem_cont,
>>> +   int active)
>>> +{
>>> +   unsigned long nr_taken = 0;
>>> +   struct page *page;
>>> +   unsigned long scan;
>>> +   LIST_HEAD(mp_list);
>>> +   struct list_head *src;
>>> +   struct meta_page *mp;
>>> +
>>> +   if (active)
>>> +   src = _cont->active_list;
>>> +   else
>>> +   src = _cont->inactive_list;
>>> +
>>> +   for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
>>> +   mp = list_entry(src->prev, struct meta_page, lru);
>> what prevents another thread from freeing mp here?
> 
> mem_cont->lru_lock protects the list and validity of mp.  If we hold
> mem_cont->lru_lock for this entire loop, then we preserve the validity
> of mp.  However that will be holding up container charge and uncharge.
> 
> This entire routing is called with zone->lru_lock held by the caller.
>  So within a zone, this routine is serialized.
> 
> However page uncharge may race with isolate page.  But will that lead
> to any corruption of the list?  We may be holding the lock for too
> much time just to be on the safe side.
> 
> Please allow us some time to verify whether this is indeed inadequate
> locking that will lead to corruption of the list.

I did few runs and checked for ref_cnt on meta_page and there seems to
be a race between isolate pages and uncharge.  We will probably have
to increase the ref_cnt on meta_page while we are isolating it.  I am
trying to see if we can solve the problem by manipulating the ref_cnt
on the meta_page.

--Vaidy

> Thanks for pointing out this situation.
> --Vaidy
> 
>>> +   spin_lock(_cont->lru_lock);
>>> +   if (mp)
>>> +   page = mp->page;
>>> +   spin_unlock(_cont->lru_lock);
>>> +   if (!mp)
>>> +   continue;
>> YAMAMOTO Takashi
>> ___
>> Containers mailing list
>> [EMAIL PROTECTED]
>> https://lists.linux-foundation.org/mailman/listinfo/containers
>>
> ___
> Containers mailing list
> [EMAIL PROTECTED]
> https://lists.linux-foundation.org/mailman/listinfo/containers
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [-mm PATCH 6/9] Memory controller add per container LRU and reclaim (v4)

2007-08-07 Thread Vaidyanathan Srinivasan



Vaidyanathan Srinivasan wrote:
 
 YAMAMOTO Takashi wrote:
 +unsigned long mem_container_isolate_pages(unsigned long nr_to_scan,
 +   struct list_head *dst,
 +   unsigned long *scanned, int order,
 +   int mode, struct zone *z,
 +   struct mem_container *mem_cont,
 +   int active)
 +{
 +   unsigned long nr_taken = 0;
 +   struct page *page;
 +   unsigned long scan;
 +   LIST_HEAD(mp_list);
 +   struct list_head *src;
 +   struct meta_page *mp;
 +
 +   if (active)
 +   src = mem_cont-active_list;
 +   else
 +   src = mem_cont-inactive_list;
 +
 +   for (scan = 0; scan  nr_to_scan  !list_empty(src); scan++) {
 +   mp = list_entry(src-prev, struct meta_page, lru);
 what prevents another thread from freeing mp here?
 
 mem_cont-lru_lock protects the list and validity of mp.  If we hold
 mem_cont-lru_lock for this entire loop, then we preserve the validity
 of mp.  However that will be holding up container charge and uncharge.
 
 This entire routing is called with zone-lru_lock held by the caller.
  So within a zone, this routine is serialized.
 
 However page uncharge may race with isolate page.  But will that lead
 to any corruption of the list?  We may be holding the lock for too
 much time just to be on the safe side.
 
 Please allow us some time to verify whether this is indeed inadequate
 locking that will lead to corruption of the list.

I did few runs and checked for ref_cnt on meta_page and there seems to
be a race between isolate pages and uncharge.  We will probably have
to increase the ref_cnt on meta_page while we are isolating it.  I am
trying to see if we can solve the problem by manipulating the ref_cnt
on the meta_page.

--Vaidy

 Thanks for pointing out this situation.
 --Vaidy
 
 +   spin_lock(mem_cont-lru_lock);
 +   if (mp)
 +   page = mp-page;
 +   spin_unlock(mem_cont-lru_lock);
 +   if (!mp)
 +   continue;
 YAMAMOTO Takashi
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.linux-foundation.org/mailman/listinfo/containers

 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.linux-foundation.org/mailman/listinfo/containers
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [-mm PATCH 6/9] Memory controller add per container LRU and reclaim (v4)

2007-07-31 Thread Vaidyanathan Srinivasan

YAMAMOTO Takashi wrote:
>> +unsigned long mem_container_isolate_pages(unsigned long nr_to_scan,
>> +struct list_head *dst,
>> +unsigned long *scanned, int order,
>> +int mode, struct zone *z,
>> +struct mem_container *mem_cont,
>> +int active)
>> +{
>> +unsigned long nr_taken = 0;
>> +struct page *page;
>> +unsigned long scan;
>> +LIST_HEAD(mp_list);
>> +struct list_head *src;
>> +struct meta_page *mp;
>> +
>> +if (active)
>> +src = _cont->active_list;
>> +else
>> +src = _cont->inactive_list;
>> +
>> +for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
>> +mp = list_entry(src->prev, struct meta_page, lru);
> 
> what prevents another thread from freeing mp here?

mem_cont->lru_lock protects the list and validity of mp.  If we hold
mem_cont->lru_lock for this entire loop, then we preserve the validity
of mp.  However that will be holding up container charge and uncharge.

This entire routing is called with zone->lru_lock held by the caller.
 So within a zone, this routine is serialized.

However page uncharge may race with isolate page.  But will that lead
to any corruption of the list?  We may be holding the lock for too
much time just to be on the safe side.

Please allow us some time to verify whether this is indeed inadequate
locking that will lead to corruption of the list.

Thanks for pointing out this situation.
--Vaidy

>> +spin_lock(_cont->lru_lock);
>> +if (mp)
>> +page = mp->page;
>> +spin_unlock(_cont->lru_lock);
>> +if (!mp)
>> +continue;
> 
> YAMAMOTO Takashi
> ___
> Containers mailing list
> [EMAIL PROTECTED]
> https://lists.linux-foundation.org/mailman/listinfo/containers
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [-mm PATCH 4/9] Memory controller memory accounting (v4)

2007-07-31 Thread Vaidyanathan Srinivasan



YAMAMOTO Takashi wrote:
>> +lock_meta_page(page);
>> +/*
>> + * Check if somebody else beat us to allocating the meta_page
>> + */
>> +race_mp = page_get_meta_page(page);
>> +if (race_mp) {
>> +kfree(mp);
>> +mp = race_mp;
>> +atomic_inc(>ref_cnt);
>> +res_counter_uncharge(>res, 1);
>> +goto done;
>> +}
> 
> i think you need css_put here.

Thats correct. We do need css_put in this path.

Thanks,
Vaidy

> YAMAMOTO Takashi
> ___
> Containers mailing list
> [EMAIL PROTECTED]
> https://lists.linux-foundation.org/mailman/listinfo/containers
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [-mm PATCH 4/9] Memory controller memory accounting (v4)

2007-07-31 Thread Vaidyanathan Srinivasan



YAMAMOTO Takashi wrote:
 +lock_meta_page(page);
 +/*
 + * Check if somebody else beat us to allocating the meta_page
 + */
 +race_mp = page_get_meta_page(page);
 +if (race_mp) {
 +kfree(mp);
 +mp = race_mp;
 +atomic_inc(mp-ref_cnt);
 +res_counter_uncharge(mem-res, 1);
 +goto done;
 +}
 
 i think you need css_put here.

Thats correct. We do need css_put in this path.

Thanks,
Vaidy

 YAMAMOTO Takashi
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.linux-foundation.org/mailman/listinfo/containers
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [-mm PATCH 6/9] Memory controller add per container LRU and reclaim (v4)

2007-07-31 Thread Vaidyanathan Srinivasan



YAMAMOTO Takashi wrote:
 +unsigned long mem_container_isolate_pages(unsigned long nr_to_scan,
 +struct list_head *dst,
 +unsigned long *scanned, int order,
 +int mode, struct zone *z,
 +struct mem_container *mem_cont,
 +int active)
 +{
 +unsigned long nr_taken = 0;
 +struct page *page;
 +unsigned long scan;
 +LIST_HEAD(mp_list);
 +struct list_head *src;
 +struct meta_page *mp;
 +
 +if (active)
 +src = mem_cont-active_list;
 +else
 +src = mem_cont-inactive_list;
 +
 +for (scan = 0; scan  nr_to_scan  !list_empty(src); scan++) {
 +mp = list_entry(src-prev, struct meta_page, lru);
 
 what prevents another thread from freeing mp here?

mem_cont-lru_lock protects the list and validity of mp.  If we hold
mem_cont-lru_lock for this entire loop, then we preserve the validity
of mp.  However that will be holding up container charge and uncharge.

This entire routing is called with zone-lru_lock held by the caller.
 So within a zone, this routine is serialized.

However page uncharge may race with isolate page.  But will that lead
to any corruption of the list?  We may be holding the lock for too
much time just to be on the safe side.

Please allow us some time to verify whether this is indeed inadequate
locking that will lead to corruption of the list.

Thanks for pointing out this situation.
--Vaidy

 +spin_lock(mem_cont-lru_lock);
 +if (mp)
 +page = mp-page;
 +spin_unlock(mem_cont-lru_lock);
 +if (!mp)
 +continue;
 
 YAMAMOTO Takashi
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.linux-foundation.org/mailman/listinfo/containers
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [-mm PATCH 4/7] Memory controller memory accounting

2007-07-05 Thread Vaidyanathan Srinivasan



Balbir Singh wrote:
> Add the accounting hooks. The accounting is carried out for RSS and Page
> Cache (unmapped) pages. There is now a common limit and accounting for both.
> The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
> time. Page cache is accounted at add_to_page_cache(),
> __delete_from_page_cache(). Swap cache is also accounted for.
> 
> Each page's meta_page is protected with a bit in page flags, this makes
> handling of race conditions involving simultaneous mappings of a page easier.
> A reference count is kept in the meta_page to deal with cases where a page
> might be unmapped from the RSS of all tasks, but still lives in the page
> cache.
> 
> Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>
> ---
> 
>  fs/exec.c  |1 
>  include/linux/memcontrol.h |   11 +++
>  include/linux/page-flags.h |3 +
>  mm/filemap.c   |8 ++
>  mm/memcontrol.c|  132 
> -
>  mm/memory.c|   22 +++
>  mm/migrate.c   |6 ++
>  mm/page_alloc.c|3 +
>  mm/rmap.c  |2 
>  mm/swap_state.c|8 ++
>  mm/swapfile.c  |   40 +++--
>  11 files changed, 218 insertions(+), 18 deletions(-)
> 

[snip]

> diff -puN mm/migrate.c~mem-control-accounting mm/migrate.c
> --- linux-2.6.22-rc6/mm/migrate.c~mem-control-accounting  2007-07-04 
> 15:05:27.0 -0700
> +++ linux-2.6.22-rc6-balbir/mm/migrate.c  2007-07-04 15:05:27.0 
> -0700
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include "internal.h"
> 
> @@ -157,6 +158,11 @@ static void remove_migration_pte(struct 
>   return;
>   }
> 
> + if (mem_container_charge(page, mm)) {

Minor correction.  The above line should be

if (mem_container_charge(new, mm)) {

to avoid compilation error.

--Vaidy

[snip]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [-mm PATCH 4/7] Memory controller memory accounting

2007-07-05 Thread Vaidyanathan Srinivasan



Balbir Singh wrote:
 Add the accounting hooks. The accounting is carried out for RSS and Page
 Cache (unmapped) pages. There is now a common limit and accounting for both.
 The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
 time. Page cache is accounted at add_to_page_cache(),
 __delete_from_page_cache(). Swap cache is also accounted for.
 
 Each page's meta_page is protected with a bit in page flags, this makes
 handling of race conditions involving simultaneous mappings of a page easier.
 A reference count is kept in the meta_page to deal with cases where a page
 might be unmapped from the RSS of all tasks, but still lives in the page
 cache.
 
 Signed-off-by: Balbir Singh [EMAIL PROTECTED]
 ---
 
  fs/exec.c  |1 
  include/linux/memcontrol.h |   11 +++
  include/linux/page-flags.h |3 +
  mm/filemap.c   |8 ++
  mm/memcontrol.c|  132 
 -
  mm/memory.c|   22 +++
  mm/migrate.c   |6 ++
  mm/page_alloc.c|3 +
  mm/rmap.c  |2 
  mm/swap_state.c|8 ++
  mm/swapfile.c  |   40 +++--
  11 files changed, 218 insertions(+), 18 deletions(-)
 

[snip]

 diff -puN mm/migrate.c~mem-control-accounting mm/migrate.c
 --- linux-2.6.22-rc6/mm/migrate.c~mem-control-accounting  2007-07-04 
 15:05:27.0 -0700
 +++ linux-2.6.22-rc6-balbir/mm/migrate.c  2007-07-04 15:05:27.0 
 -0700
 @@ -28,6 +28,7 @@
  #include linux/mempolicy.h
  #include linux/vmalloc.h
  #include linux/security.h
 +#include linux/memcontrol.h
 
  #include internal.h
 
 @@ -157,6 +158,11 @@ static void remove_migration_pte(struct 
   return;
   }
 
 + if (mem_container_charge(page, mm)) {

Minor correction.  The above line should be

if (mem_container_charge(new, mm)) {

to avoid compilation error.

--Vaidy

[snip]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IRQ handling difference between i386 and x86_64

2007-07-02 Thread Vaidyanathan Srinivasan

Arjan van de Ven wrote:
> On Sat, 2007-06-30 at 16:55 +0200, Krzysztof Oledzki wrote:
>> Hello,
>>
>> It seems that IRQ handling is somehow different between i386 and x86_64.
>>
>> In my Dell PowerEdge 1950 is it possible to enable interrupts spreading 
>> over all CPUs. This a single CPU, four CORE system (Quad-Core E5335 Xeon) 
>> so I think that interrupts migration may be useful. Unfortunately, it 
>> works only with 32-bit kernel. Booting it with x86_64 leads to situation, 
>> when all interrupts goes only to the first cpu matching a smp_affinity 
>> mask.
> 
> arguably that is the most efficient behavior... round robin of
> interrupts is the worst possible case in terms of performance
> 
> are you using irqbalance ? (www.irqbalance.org)

If you have not been using irqbalance, then setting more bits in
irq/smp_affinity will utilize hardware APIC routing. Sending interrupt
to more than one CPU will work in flat mode only.

In 32 bit kernel the IOAPIC is configured in flat mode while in 64 bit
it is in physical flat mode.  Physical flat mode will support
interrupt routing to only one CPU.

32-bit: Enabling APIC mode:  Flat.  Using 2 I/O APICs
64-bit: Setting APIC routing to physical flat

This is the reason for the observed interrupt routing behavior.  I
guess future kernels will use 'physical flat' mode to avoid hardware
bugs on various complex configurations and also make it scale up to
larger systems.  Also utilizing hardware routing to send interrupts to
many CPUs may not provide desired behavior.

irqbalance application will do the needful.  I agree with Arjan on the
fact that equal interrupt distribution need not provide best
performance.  You will run into complex routing issues only when you
have more NICs than actual number of CPU cores.

--Vaidy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IRQ handling difference between i386 and x86_64

2007-07-02 Thread Vaidyanathan Srinivasan


Arjan van de Ven wrote:
 On Sat, 2007-06-30 at 16:55 +0200, Krzysztof Oledzki wrote:
 Hello,

 It seems that IRQ handling is somehow different between i386 and x86_64.

 In my Dell PowerEdge 1950 is it possible to enable interrupts spreading 
 over all CPUs. This a single CPU, four CORE system (Quad-Core E5335 Xeon) 
 so I think that interrupts migration may be useful. Unfortunately, it 
 works only with 32-bit kernel. Booting it with x86_64 leads to situation, 
 when all interrupts goes only to the first cpu matching a smp_affinity 
 mask.
 
 arguably that is the most efficient behavior... round robin of
 interrupts is the worst possible case in terms of performance
 
 are you using irqbalance ? (www.irqbalance.org)

If you have not been using irqbalance, then setting more bits in
irq/smp_affinity will utilize hardware APIC routing. Sending interrupt
to more than one CPU will work in flat mode only.

In 32 bit kernel the IOAPIC is configured in flat mode while in 64 bit
it is in physical flat mode.  Physical flat mode will support
interrupt routing to only one CPU.

32-bit: Enabling APIC mode:  Flat.  Using 2 I/O APICs
64-bit: Setting APIC routing to physical flat

This is the reason for the observed interrupt routing behavior.  I
guess future kernels will use 'physical flat' mode to avoid hardware
bugs on various complex configurations and also make it scale up to
larger systems.  Also utilizing hardware routing to send interrupts to
many CPUs may not provide desired behavior.

irqbalance application will do the needful.  I agree with Arjan on the
fact that equal interrupt distribution need not provide best
performance.  You will run into complex routing issues only when you
have more NICs than actual number of CPU cores.

--Vaidy


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH 3/3] Pagecache reclaim

2007-06-29 Thread Vaidyanathan Srinivasan

Pagecache controller reclaim changes


Reclaim path needs performance improvement.
For now it is minor changes to include unmapped
pages in our list of page_container.

Signed-off-by: Vaidyanathan Srinivasan <[EMAIL PROTECTED]>
---
 mm/rss_container.c |3 ---
 1 file changed, 3 deletions(-)

--- linux-2.6.22-rc2-mm1.orig/mm/rss_container.c
+++ linux-2.6.22-rc2-mm1/mm/rss_container.c
@@ -243,9 +243,6 @@ void container_rss_move_lists(struct pag
struct rss_container *rss;
struct page_container *pc;

-   if (!page_mapped(pg))
-   return;
-
pc = page_container(pg);
if (pc == NULL)
return;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH 2/3] Pagecache and RSS accounting hooks

2007-06-29 Thread Vaidyanathan Srinivasan

Pagecache and RSS accounting Hooks
--

New calls have been added from swap_state.c and filemap.c to track pagecache and
swapcache pages.

All existing RSS hooks have been generalised for pagecache accounting as well.

Most of these are function prototype changes.

Signed-off-by: Vaidyanathan Srinivasan <[EMAIL PROTECTED]>
---
 fs/exec.c   |4 ++--
 mm/filemap.c|   17 +
 mm/memory.c |   18 +-
 mm/migrate.c|4 ++--
 mm/rmap.c   |   12 ++--
 mm/swap_state.c |   16 
 mm/swapfile.c   |4 ++--
 7 files changed, 54 insertions(+), 21 deletions(-)

--- linux-2.6.22-rc2-mm1.orig/fs/exec.c
+++ linux-2.6.22-rc2-mm1/fs/exec.c
@@ -321,7 +321,7 @@ void install_arg_page(struct vm_area_str
if (unlikely(anon_vma_prepare(vma)))
goto out;

-   if (container_rss_prepare(page, vma, ))
+   if (container_page_prepare(page, vma->vm_mm, ))
goto out;

flush_dcache_page(page);
@@ -343,7 +343,7 @@ void install_arg_page(struct vm_area_str
return;

 out_release:
-   container_rss_release(pcont);
+   container_page_release(pcont);
 out:
__free_page(page);
force_sig(SIGKILL, current);
--- linux-2.6.22-rc2-mm1.orig/mm/filemap.c
+++ linux-2.6.22-rc2-mm1/mm/filemap.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "filemap.h"
 #include "internal.h"

@@ -117,6 +118,9 @@ void __remove_from_page_cache(struct pag
struct address_space *mapping = page->mapping;

radix_tree_delete(>page_tree, page->index);
+   /* Uncharge before the mapping is gone */
+   if (page_container(page))
+   container_page_del(page_container(page));
page->mapping = NULL;
mapping->nrpages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
@@ -440,6 +444,8 @@ int add_to_page_cache(struct page *page,
pgoff_t offset, gfp_t gfp_mask)
 {
int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+   struct page_container *pc;
+   struct mm_struct *mm;

if (error == 0) {
write_lock_irq(>tree_lock);
@@ -453,6 +459,17 @@ int add_to_page_cache(struct page *page,
__inc_zone_page_state(page, NR_FILE_PAGES);
}
write_unlock_irq(>tree_lock);
+   /* Unlock before charge, because we may reclaim this inline */
+   if(!error) {
+   if (current->mm)
+   mm = current->mm;
+   else
+   mm = _mm;
+   if (!container_page_prepare(page, mm, ))
+   container_page_add(pc);
+   else
+   BUG();
+   }
radix_tree_preload_end();
}
return error;
--- linux-2.6.22-rc2-mm1.orig/mm/memory.c
+++ linux-2.6.22-rc2-mm1/mm/memory.c
@@ -1755,7 +1755,7 @@ gotten:
cow_user_page(new_page, old_page, address, vma);
}

-   if (container_rss_prepare(new_page, vma, ))
+   if (container_page_prepare(new_page, vma->vm_mm, ))
goto oom;

/*
@@ -1791,7 +1791,7 @@ gotten:
new_page = old_page;
ret |= VM_FAULT_WRITE;
} else
-   container_rss_release(pcont);
+   container_page_release(pcont);

if (new_page)
page_cache_release(new_page);
@@ -2217,7 +2217,7 @@ static int do_swap_page(struct mm_struct
count_vm_event(PGMAJFAULT);
}

-   if (container_rss_prepare(page, vma, )) {
+   if (container_page_prepare(page, vma->vm_mm, )) {
ret = VM_FAULT_OOM;
goto out;
}
@@ -2235,7 +2235,7 @@ static int do_swap_page(struct mm_struct

if (unlikely(!PageUptodate(page))) {
ret = VM_FAULT_SIGBUS;
-   container_rss_release(pcont);
+   container_page_release(pcont);
goto out_nomap;
}

@@ -2271,7 +2271,7 @@ unlock:
 out:
return ret;
 out_nomap:
-   container_rss_release(pcont);
+   container_page_release(pcont);
pte_unmap_unlock(page_table, ptl);
unlock_page(page);
page_cache_release(page);
@@ -2302,7 +2302,7 @@ static int do_anonymous_page(struct mm_s
if (!page)
goto oom;

-   if (container_rss_prepare(page, vma, ))
+   if (container_page_prepare(page, vma->vm_mm, ))
goto oom_release;

entry = mk_pte(page, vma->vm_page_prot);
@@ -2338,7 +2338,7 @@ unlock:
pte_unmap_unlock(page_table, ptl);
return VM_FAULT_MINOR;
 release_container:
-   container_rss_release(pcont);
+   container_page_

[RFC][PATCH 1/3] Pagecache accounting

2007-06-29 Thread Vaidyanathan Srinivasan

Pagecache Accounting


The rss accounting hooks have been generalised to handle both anon pages
and file backed pages and charge the resource counter.

Ref count has been added to page_container structure.  The ref count is used
to ensure a page is added or removed from page_container list only once
independent of repeated calls from pagecache, swapcache and mmap to RSS.

No setup patch is required since rss_limit and rss_usage has been generalised
as the resource counter for pagecache as well.

Signed-off-by: Vaidyanathan Srinivasan <[EMAIL PROTECTED]>
---
 include/linux/rss_container.h |   18 ++---
 mm/rss_container.c|  134 --
 2 files changed, 99 insertions(+), 53 deletions(-)

--- linux-2.6.22-rc2-mm1.orig/include/linux/rss_container.h
+++ linux-2.6.22-rc2-mm1/include/linux/rss_container.h
@@ -68,11 +68,11 @@ struct rss_container;
  *
  */

-int container_rss_prepare(struct page *, struct vm_area_struct *vma,
+int container_page_prepare(struct page *, struct mm_struct *mm,
struct page_container **);
-void container_rss_add(struct page_container *);
-void container_rss_del(struct page_container *);
-void container_rss_release(struct page_container *);
+void container_page_add(struct page_container *);
+void container_page_del(struct page_container *);
+void container_page_release(struct page_container *);

 void container_out_of_memory(struct rss_container *);

@@ -85,22 +85,22 @@ unsigned long isolate_pages_in_container
int order, int mode, struct zone *zone,
struct rss_container *, int active);
 #else
-static inline int container_rss_prepare(struct page *pg,
-   struct vm_area_struct *vma, struct page_container **pc)
+static inline int container_page_prepare(struct page *pg,
+   struct mm_struct *mm, struct page_container **pc)
 {
*pc = NULL; /* to make gcc happy */
return 0;
 }

-static inline void container_rss_add(struct page_container *pc)
+static inline void container_page_add(struct page_container *pc)
 {
 }

-static inline void container_rss_del(struct page_container *pc)
+static inline void container_page_del(struct page_container *pc)
 {
 }

-static inline void container_rss_release(struct page_container *pc)
+static inline void container_page_release(struct page_container *pc)
 {
 }

--- linux-2.6.22-rc2-mm1.orig/mm/rss_container.c
+++ linux-2.6.22-rc2-mm1/mm/rss_container.c
@@ -56,6 +56,9 @@ struct rss_container {
  */

 struct page_container {
+   unsigned long ref_cnt;  /* Ref cnt to keep track of
+* multiple additions of same page
+*/
struct page *page;
struct rss_container *cnt;
struct list_head list; /* this is the element of (int)active_list of
@@ -93,26 +96,36 @@ void mm_free_container(struct mm_struct
  * I bet you have already read the comment in include/linux/rss_container.h :)
  */

-int container_rss_prepare(struct page *page, struct vm_area_struct *vma,
+int container_page_prepare(struct page *page, struct mm_struct *mm,
struct page_container **ppc)
 {
-   struct rss_container *rss;
-   struct page_container *pc;
-
-   rcu_read_lock();
-   rss = rcu_dereference(vma->vm_mm->rss_container);
-   css_get(>css);
-   rcu_read_unlock();
-
-   pc = kmalloc(sizeof(struct page_container), GFP_KERNEL);
-   if (pc == NULL)
-   goto out_nomem;
+   struct rss_container *rss;
+   struct page_container *pc;
+   int rc = 1;
+
+   /* Page may have been added to container earlier */
+   pc = page_container(page);
+   /* Check if this is fist time addition or not */
+   if (!pc) {
+   rcu_read_lock();
+   rss = rcu_dereference(mm->rss_container);
+   css_get(>css);
+   rcu_read_unlock();
+   } else {
+   rss = pc->cnt;
+   }

-   while (res_counter_charge(>res, 1)) {
-   if (try_to_free_pages_in_container(rss)) {
-   atomic_inc(>rss_reclaimed);
-   continue;
-   }
+   /* Charge the respective resource count first time only */
+   while (rc && !pc) {
+   rc = res_counter_charge(>res, 1);
+
+   if (!rc)
+   break; /* All well */
+
+   if (try_to_free_pages_in_container(rss)) {
+   atomic_inc(>rss_reclaimed);
+   continue; /* Try agin to charge container */
+   }

/*
 * try_to_free_pages() might not give us a full picture
@@ -125,60 +138,93 @@ int container_rss_prepare(struct page *p
if (res_counter_check_under_limit(>res))
continue;

-   container_out_of_memory(rss);

[RFC][PATCH 0/3] Containers: Integrated RSS and pagecache control v5

2007-06-29 Thread Vaidyanathan Srinivasan

Containers: Integrated RSS and pagecache accounting and control v5
--

Based on the discussions at OLS yesterday, the consensus was to try an
integrated pagecache controller along with RSS controller under the
same usage limit.

This patch extends the RSS controller to account and reclaim pagecache
and swapcache pages.  The same 'rss_limit' now applies to both RSS pages
and pagecache pages. When the limit is reached, both pagecache and RSS
pages are reclaimed in LRU order as per the normal system wide reclaim
policy.

This patch is based on RSS Controller V3.1 by Pavel and Balbir.  This patch
depends on

1. Paul Menage's Containers(V10): Generic Process Containers
http://lwn.net/Articles/236032/
2. Pavel Emelianov's RSS controller based on process containers (v3.1)
http://lwn.net/Articles/236817/
3. Balbir's fixes for RSS controller as mentioned in
http://lkml.org/lkml/2007/6/04/185

This is very much work-in-progress and it have been posted for comments
after some basic testing with kernbench.

Comments, suggestions and criticisms are welcome.

--Vaidy

Features:

* Single limit for both RSS and pagecache/swapcache pages
* No new subsystem is added. The RSS controller subsystem is extended
  since most of the code can be shared between pagecache control and
  RSS control.
* The accounting number include pages in swap cache and filesystem
  buffer pages apart from pagecache, basically everything under
  NR_FILE_PAGES is counted under rss_usage.
* The usage limit set in rss_limit applies to sum of both RSS and
  pagecache pages
* Limits on pagecache can be set by echo -n 10 > rss_limit on
  the /container file system.  The unit is in pages or 4 kilobytes
* If the pagecache+RSS utilisation exceed the limit, the container reclaim
  code is invoked to recover pages from the container.

Advantages:
---
* Minimal code changes to RSS controller to include pagecache pages

Limitations:
---
* All limitation of RSS controller applies to this code as well
* Per-container reclaim knobs like dirty ratio, vm_swappiness may
  provide better control

Usage:
--
* Add all dependent patches before including this patch
* No new config settings apart from enabling CONFIG_RSS_CONTAINER
* Boot new kernel
* Mount container filesystem
mount -t container none /container
cd /container
* Create new container
mkdir mybox
cd /container/mybox
* Add current shell to container
echo $$ > tasks
* In order to set limit, echo value in pages (4KB) to rss_limit
echo -n 10 > rss_limit
#This would set 409MB limit on pagecache+rss usage
* Trash the system from current shell using scp/cp/dd/tar etc
* Watch rss_usage and /proc/meminfo to verify behavior

Tests:
--
* Simple dd/cat/cp test on pagecache limit/reclaim
* rss_limit was tested with simple test application that would malloc
  predefined size of memory and touch them to allocate pages.
* kernbench was run under container with 400MB memory limit

ToDo:

* Optimise the reclaim.
* Per-container VM stats and knobs

Patch Series:
-
pagecache-controller-v5-acct.patch
pagecache-controller-v5-acct-hooks.patch
pagecache-controller-v5-reclaim.patch

ChangeLog:
-

v5: Integrated pagecache + rss controller

* No separate pagecache_limit
* pagecache and rss pages accounted in rss_usage and governed by rss_limit
* Each page counted only once in rss_usage.  Mapped or unmapped
  pagecache pages are counted alike in rss_usage

v4:
* Patch remerged to Container v10 and RSS v3.1
* Bug fixes
* Tested with kernbench

v3:
* Patch merged with Containers v8 and RSS v2

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH 0/3] Containers: Integrated RSS and pagecache control v5

2007-06-29 Thread Vaidyanathan Srinivasan

Containers: Integrated RSS and pagecache accounting and control v5
--

Based on the discussions at OLS yesterday, the consensus was to try an
integrated pagecache controller along with RSS controller under the
same usage limit.

This patch extends the RSS controller to account and reclaim pagecache
and swapcache pages.  The same 'rss_limit' now applies to both RSS pages
and pagecache pages. When the limit is reached, both pagecache and RSS
pages are reclaimed in LRU order as per the normal system wide reclaim
policy.

This patch is based on RSS Controller V3.1 by Pavel and Balbir.  This patch
depends on

1. Paul Menage's Containers(V10): Generic Process Containers
http://lwn.net/Articles/236032/
2. Pavel Emelianov's RSS controller based on process containers (v3.1)
http://lwn.net/Articles/236817/
3. Balbir's fixes for RSS controller as mentioned in
http://lkml.org/lkml/2007/6/04/185

This is very much work-in-progress and it have been posted for comments
after some basic testing with kernbench.

Comments, suggestions and criticisms are welcome.

--Vaidy

Features:

* Single limit for both RSS and pagecache/swapcache pages
* No new subsystem is added. The RSS controller subsystem is extended
  since most of the code can be shared between pagecache control and
  RSS control.
* The accounting number include pages in swap cache and filesystem
  buffer pages apart from pagecache, basically everything under
  NR_FILE_PAGES is counted under rss_usage.
* The usage limit set in rss_limit applies to sum of both RSS and
  pagecache pages
* Limits on pagecache can be set by echo -n 10  rss_limit on
  the /container file system.  The unit is in pages or 4 kilobytes
* If the pagecache+RSS utilisation exceed the limit, the container reclaim
  code is invoked to recover pages from the container.

Advantages:
---
* Minimal code changes to RSS controller to include pagecache pages

Limitations:
---
* All limitation of RSS controller applies to this code as well
* Per-container reclaim knobs like dirty ratio, vm_swappiness may
  provide better control

Usage:
--
* Add all dependent patches before including this patch
* No new config settings apart from enabling CONFIG_RSS_CONTAINER
* Boot new kernel
* Mount container filesystem
mount -t container none /container
cd /container
* Create new container
mkdir mybox
cd /container/mybox
* Add current shell to container
echo $$  tasks
* In order to set limit, echo value in pages (4KB) to rss_limit
echo -n 10  rss_limit
#This would set 409MB limit on pagecache+rss usage
* Trash the system from current shell using scp/cp/dd/tar etc
* Watch rss_usage and /proc/meminfo to verify behavior

Tests:
--
* Simple dd/cat/cp test on pagecache limit/reclaim
* rss_limit was tested with simple test application that would malloc
  predefined size of memory and touch them to allocate pages.
* kernbench was run under container with 400MB memory limit

ToDo:

* Optimise the reclaim.
* Per-container VM stats and knobs

Patch Series:
-
pagecache-controller-v5-acct.patch
pagecache-controller-v5-acct-hooks.patch
pagecache-controller-v5-reclaim.patch

ChangeLog:
-

v5: Integrated pagecache + rss controller

* No separate pagecache_limit
* pagecache and rss pages accounted in rss_usage and governed by rss_limit
* Each page counted only once in rss_usage.  Mapped or unmapped
  pagecache pages are counted alike in rss_usage

v4:
* Patch remerged to Container v10 and RSS v3.1
* Bug fixes
* Tested with kernbench

v3:
* Patch merged with Containers v8 and RSS v2

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH 1/3] Pagecache accounting

2007-06-29 Thread Vaidyanathan Srinivasan

Pagecache Accounting


The rss accounting hooks have been generalised to handle both anon pages
and file backed pages and charge the resource counter.

Ref count has been added to page_container structure.  The ref count is used
to ensure a page is added or removed from page_container list only once
independent of repeated calls from pagecache, swapcache and mmap to RSS.

No setup patch is required since rss_limit and rss_usage has been generalised
as the resource counter for pagecache as well.

Signed-off-by: Vaidyanathan Srinivasan [EMAIL PROTECTED]
---
 include/linux/rss_container.h |   18 ++---
 mm/rss_container.c|  134 --
 2 files changed, 99 insertions(+), 53 deletions(-)

--- linux-2.6.22-rc2-mm1.orig/include/linux/rss_container.h
+++ linux-2.6.22-rc2-mm1/include/linux/rss_container.h
@@ -68,11 +68,11 @@ struct rss_container;
  *
  */

-int container_rss_prepare(struct page *, struct vm_area_struct *vma,
+int container_page_prepare(struct page *, struct mm_struct *mm,
struct page_container **);
-void container_rss_add(struct page_container *);
-void container_rss_del(struct page_container *);
-void container_rss_release(struct page_container *);
+void container_page_add(struct page_container *);
+void container_page_del(struct page_container *);
+void container_page_release(struct page_container *);

 void container_out_of_memory(struct rss_container *);

@@ -85,22 +85,22 @@ unsigned long isolate_pages_in_container
int order, int mode, struct zone *zone,
struct rss_container *, int active);
 #else
-static inline int container_rss_prepare(struct page *pg,
-   struct vm_area_struct *vma, struct page_container **pc)
+static inline int container_page_prepare(struct page *pg,
+   struct mm_struct *mm, struct page_container **pc)
 {
*pc = NULL; /* to make gcc happy */
return 0;
 }

-static inline void container_rss_add(struct page_container *pc)
+static inline void container_page_add(struct page_container *pc)
 {
 }

-static inline void container_rss_del(struct page_container *pc)
+static inline void container_page_del(struct page_container *pc)
 {
 }

-static inline void container_rss_release(struct page_container *pc)
+static inline void container_page_release(struct page_container *pc)
 {
 }

--- linux-2.6.22-rc2-mm1.orig/mm/rss_container.c
+++ linux-2.6.22-rc2-mm1/mm/rss_container.c
@@ -56,6 +56,9 @@ struct rss_container {
  */

 struct page_container {
+   unsigned long ref_cnt;  /* Ref cnt to keep track of
+* multiple additions of same page
+*/
struct page *page;
struct rss_container *cnt;
struct list_head list; /* this is the element of (int)active_list of
@@ -93,26 +96,36 @@ void mm_free_container(struct mm_struct
  * I bet you have already read the comment in include/linux/rss_container.h :)
  */

-int container_rss_prepare(struct page *page, struct vm_area_struct *vma,
+int container_page_prepare(struct page *page, struct mm_struct *mm,
struct page_container **ppc)
 {
-   struct rss_container *rss;
-   struct page_container *pc;
-
-   rcu_read_lock();
-   rss = rcu_dereference(vma-vm_mm-rss_container);
-   css_get(rss-css);
-   rcu_read_unlock();
-
-   pc = kmalloc(sizeof(struct page_container), GFP_KERNEL);
-   if (pc == NULL)
-   goto out_nomem;
+   struct rss_container *rss;
+   struct page_container *pc;
+   int rc = 1;
+
+   /* Page may have been added to container earlier */
+   pc = page_container(page);
+   /* Check if this is fist time addition or not */
+   if (!pc) {
+   rcu_read_lock();
+   rss = rcu_dereference(mm-rss_container);
+   css_get(rss-css);
+   rcu_read_unlock();
+   } else {
+   rss = pc-cnt;
+   }

-   while (res_counter_charge(rss-res, 1)) {
-   if (try_to_free_pages_in_container(rss)) {
-   atomic_inc(rss-rss_reclaimed);
-   continue;
-   }
+   /* Charge the respective resource count first time only */
+   while (rc  !pc) {
+   rc = res_counter_charge(rss-res, 1);
+
+   if (!rc)
+   break; /* All well */
+
+   if (try_to_free_pages_in_container(rss)) {
+   atomic_inc(rss-rss_reclaimed);
+   continue; /* Try agin to charge container */
+   }

/*
 * try_to_free_pages() might not give us a full picture
@@ -125,60 +138,93 @@ int container_rss_prepare(struct page *p
if (res_counter_check_under_limit(rss-res))
continue;

-   container_out_of_memory(rss);
-   if (test_thread_flag(TIF_MEMDIE

[RFC][PATCH 2/3] Pagecache and RSS accounting hooks

2007-06-29 Thread Vaidyanathan Srinivasan

Pagecache and RSS accounting Hooks
--

New calls have been added from swap_state.c and filemap.c to track pagecache and
swapcache pages.

All existing RSS hooks have been generalised for pagecache accounting as well.

Most of these are function prototype changes.

Signed-off-by: Vaidyanathan Srinivasan [EMAIL PROTECTED]
---
 fs/exec.c   |4 ++--
 mm/filemap.c|   17 +
 mm/memory.c |   18 +-
 mm/migrate.c|4 ++--
 mm/rmap.c   |   12 ++--
 mm/swap_state.c |   16 
 mm/swapfile.c   |4 ++--
 7 files changed, 54 insertions(+), 21 deletions(-)

--- linux-2.6.22-rc2-mm1.orig/fs/exec.c
+++ linux-2.6.22-rc2-mm1/fs/exec.c
@@ -321,7 +321,7 @@ void install_arg_page(struct vm_area_str
if (unlikely(anon_vma_prepare(vma)))
goto out;

-   if (container_rss_prepare(page, vma, pcont))
+   if (container_page_prepare(page, vma-vm_mm, pcont))
goto out;

flush_dcache_page(page);
@@ -343,7 +343,7 @@ void install_arg_page(struct vm_area_str
return;

 out_release:
-   container_rss_release(pcont);
+   container_page_release(pcont);
 out:
__free_page(page);
force_sig(SIGKILL, current);
--- linux-2.6.22-rc2-mm1.orig/mm/filemap.c
+++ linux-2.6.22-rc2-mm1/mm/filemap.c
@@ -30,6 +30,7 @@
 #include linux/security.h
 #include linux/syscalls.h
 #include linux/cpuset.h
+#include linux/rss_container.h
 #include filemap.h
 #include internal.h

@@ -117,6 +118,9 @@ void __remove_from_page_cache(struct pag
struct address_space *mapping = page-mapping;

radix_tree_delete(mapping-page_tree, page-index);
+   /* Uncharge before the mapping is gone */
+   if (page_container(page))
+   container_page_del(page_container(page));
page-mapping = NULL;
mapping-nrpages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
@@ -440,6 +444,8 @@ int add_to_page_cache(struct page *page,
pgoff_t offset, gfp_t gfp_mask)
 {
int error = radix_tree_preload(gfp_mask  ~__GFP_HIGHMEM);
+   struct page_container *pc;
+   struct mm_struct *mm;

if (error == 0) {
write_lock_irq(mapping-tree_lock);
@@ -453,6 +459,17 @@ int add_to_page_cache(struct page *page,
__inc_zone_page_state(page, NR_FILE_PAGES);
}
write_unlock_irq(mapping-tree_lock);
+   /* Unlock before charge, because we may reclaim this inline */
+   if(!error) {
+   if (current-mm)
+   mm = current-mm;
+   else
+   mm = init_mm;
+   if (!container_page_prepare(page, mm, pc))
+   container_page_add(pc);
+   else
+   BUG();
+   }
radix_tree_preload_end();
}
return error;
--- linux-2.6.22-rc2-mm1.orig/mm/memory.c
+++ linux-2.6.22-rc2-mm1/mm/memory.c
@@ -1755,7 +1755,7 @@ gotten:
cow_user_page(new_page, old_page, address, vma);
}

-   if (container_rss_prepare(new_page, vma, pcont))
+   if (container_page_prepare(new_page, vma-vm_mm, pcont))
goto oom;

/*
@@ -1791,7 +1791,7 @@ gotten:
new_page = old_page;
ret |= VM_FAULT_WRITE;
} else
-   container_rss_release(pcont);
+   container_page_release(pcont);

if (new_page)
page_cache_release(new_page);
@@ -2217,7 +2217,7 @@ static int do_swap_page(struct mm_struct
count_vm_event(PGMAJFAULT);
}

-   if (container_rss_prepare(page, vma, pcont)) {
+   if (container_page_prepare(page, vma-vm_mm, pcont)) {
ret = VM_FAULT_OOM;
goto out;
}
@@ -2235,7 +2235,7 @@ static int do_swap_page(struct mm_struct

if (unlikely(!PageUptodate(page))) {
ret = VM_FAULT_SIGBUS;
-   container_rss_release(pcont);
+   container_page_release(pcont);
goto out_nomap;
}

@@ -2271,7 +2271,7 @@ unlock:
 out:
return ret;
 out_nomap:
-   container_rss_release(pcont);
+   container_page_release(pcont);
pte_unmap_unlock(page_table, ptl);
unlock_page(page);
page_cache_release(page);
@@ -2302,7 +2302,7 @@ static int do_anonymous_page(struct mm_s
if (!page)
goto oom;

-   if (container_rss_prepare(page, vma, pcont))
+   if (container_page_prepare(page, vma-vm_mm, pcont))
goto oom_release;

entry = mk_pte(page, vma-vm_page_prot);
@@ -2338,7 +2338,7 @@ unlock:
pte_unmap_unlock(page_table, ptl);
return VM_FAULT_MINOR;
 release_container

[RFC][PATCH 3/3] Pagecache reclaim

2007-06-29 Thread Vaidyanathan Srinivasan

Pagecache controller reclaim changes


Reclaim path needs performance improvement.
For now it is minor changes to include unmapped
pages in our list of page_container.

Signed-off-by: Vaidyanathan Srinivasan [EMAIL PROTECTED]
---
 mm/rss_container.c |3 ---
 1 file changed, 3 deletions(-)

--- linux-2.6.22-rc2-mm1.orig/mm/rss_container.c
+++ linux-2.6.22-rc2-mm1/mm/rss_container.c
@@ -243,9 +243,6 @@ void container_rss_move_lists(struct pag
struct rss_container *rss;
struct page_container *pc;

-   if (!page_mapped(pg))
-   return;
-
pc = page_container(pg);
if (pc == NULL)
return;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm-controller

2007-06-25 Thread Vaidyanathan Srinivasan

Paul Menage wrote:
> On 6/25/07, Paul Menage <[EMAIL PROTECTED]> wrote:
>> On 6/22/07, Vaidyanathan Srinivasan <[EMAIL PROTECTED]> wrote:
>>> Merging both limits will eliminate the issue, however we would need
>>> individual limits for pagecache and RSS for better control.  There are
>>> use cases for pagecache_limit alone without RSS_limit like the case of
>>> database application using direct IO, backup applications and
>>> streaming applications that does not make good use of pagecache.
>>>
>> If streaming applications would otherwise litter the pagecache with
>> unwanted data, then limiting their total memory footprint (with a
>> single limit) and forcing them to drop old data sooner sounds like a
>> great idea.
> 
> Actually, reading what you wrote more carefully, that's sort of what
> you were already saying. But it's not clear why you wouldn't also want
> to limit the anon pages for a job, if you're already concerned that
> it's not playing nicely with the rest of the system.

Hi Paul,

Limiting memory footprint (RSS and pagecache) for multi media
applications would work.  However, generally streaming applications
have a fairly constant RSS size (mapped pagecache pages + ANON) while
the unmapped pagecache pages is what we want to control better.

If we have a combined limit for unmapped pagecache pages and RSS, then
 we will have to bring in vm_swappiness kind of knobs for each
container to influence the per container reclaim process so as to not
hurt the application performance badly.

RSS controller should be able to take care of the mapped memory
footprint if needed.  In case of database server, moving out any of it
RSS pages will hurt it performance, while we are free to shrink the
unmapped pagecache pages to any smaller limit since the database is
using direct IO and does not benefit from pagecache.

With pagecache controller, we are able to split application's memory
pages into mapped and unmapped pages. Ability to account and control
unmapped pages in memory provides more possibilities for fine grain
resource management.

--Vaidy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm-controller

2007-06-25 Thread Vaidyanathan Srinivasan

Peter Zijlstra wrote:
> On Fri, 2007-06-22 at 22:05 +0530, Vaidyanathan Srinivasan wrote:
> 
>> Merging both limits will eliminate the issue, however we would need
>> individual limits for pagecache and RSS for better control.  There are
>> use cases for pagecache_limit alone without RSS_limit like the case of
>> database application using direct IO, backup applications and
>> streaming applications that does not make good use of pagecache.
> 
> I'm aware that some people want this. However we rejected adding a
> pagecache limit to the kernel proper on grounds that reclaim should do a
> better job.
> 
> And now we're sneaking it in the backdoor.
> 
> If we're going to do this, get it in the kernel proper first.
> 

Good point.  We should probably revisit this in the context of
containers, virtualization and server consolidation.  Kernel takes the
best decision in the context of overall system performance, but when
we want the kernel to favor certain group of application relative to
others then we hit corner cases.  Streaming multimedia applications
are one of the corner case where the kernel's effort to manage
pagecache does not help overall system performance.

There have been several patches suggested to provide system wide
pagecache limit.  There are some user mode fadvice() based techniques
as well.  However solving the problem in the context of containers
provide certain advantages

* Containers provide task grouping
* Relative priority or importance can be assigned to each group using
resource limits.
* Memory controller under container framework provide infrastructure
for detailed  accounting of memory usage
* Containers and controllers form generalised infrastructure to create
localised VM behavior for a group of tasks

I would see introduction of pagecache limit in containers as a safe
place to add the new feature rather than a backdoor.  Since this
feature has a relatively small user base, it be best left as a
container plugin rather than a system wide tunable.

I am not suggesting against system wide pagecache control.  We should
definitely try to find solutions for pagecache control outside of
containers as well.

--Vaidy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm-controller

2007-06-25 Thread Vaidyanathan Srinivasan



Peter Zijlstra wrote:
 On Fri, 2007-06-22 at 22:05 +0530, Vaidyanathan Srinivasan wrote:
 
 Merging both limits will eliminate the issue, however we would need
 individual limits for pagecache and RSS for better control.  There are
 use cases for pagecache_limit alone without RSS_limit like the case of
 database application using direct IO, backup applications and
 streaming applications that does not make good use of pagecache.
 
 I'm aware that some people want this. However we rejected adding a
 pagecache limit to the kernel proper on grounds that reclaim should do a
 better job.
 
 And now we're sneaking it in the backdoor.
 
 If we're going to do this, get it in the kernel proper first.
 

Good point.  We should probably revisit this in the context of
containers, virtualization and server consolidation.  Kernel takes the
best decision in the context of overall system performance, but when
we want the kernel to favor certain group of application relative to
others then we hit corner cases.  Streaming multimedia applications
are one of the corner case where the kernel's effort to manage
pagecache does not help overall system performance.

There have been several patches suggested to provide system wide
pagecache limit.  There are some user mode fadvice() based techniques
as well.  However solving the problem in the context of containers
provide certain advantages

* Containers provide task grouping
* Relative priority or importance can be assigned to each group using
resource limits.
* Memory controller under container framework provide infrastructure
for detailed  accounting of memory usage
* Containers and controllers form generalised infrastructure to create
localised VM behavior for a group of tasks

I would see introduction of pagecache limit in containers as a safe
place to add the new feature rather than a backdoor.  Since this
feature has a relatively small user base, it be best left as a
container plugin rather than a system wide tunable.

I am not suggesting against system wide pagecache control.  We should
definitely try to find solutions for pagecache control outside of
containers as well.

--Vaidy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm-controller

2007-06-25 Thread Vaidyanathan Srinivasan



Paul Menage wrote:
 On 6/25/07, Paul Menage [EMAIL PROTECTED] wrote:
 On 6/22/07, Vaidyanathan Srinivasan [EMAIL PROTECTED] wrote:
 Merging both limits will eliminate the issue, however we would need
 individual limits for pagecache and RSS for better control.  There are
 use cases for pagecache_limit alone without RSS_limit like the case of
 database application using direct IO, backup applications and
 streaming applications that does not make good use of pagecache.

 If streaming applications would otherwise litter the pagecache with
 unwanted data, then limiting their total memory footprint (with a
 single limit) and forcing them to drop old data sooner sounds like a
 great idea.
 
 Actually, reading what you wrote more carefully, that's sort of what
 you were already saying. But it's not clear why you wouldn't also want
 to limit the anon pages for a job, if you're already concerned that
 it's not playing nicely with the rest of the system.

Hi Paul,

Limiting memory footprint (RSS and pagecache) for multi media
applications would work.  However, generally streaming applications
have a fairly constant RSS size (mapped pagecache pages + ANON) while
the unmapped pagecache pages is what we want to control better.

If we have a combined limit for unmapped pagecache pages and RSS, then
 we will have to bring in vm_swappiness kind of knobs for each
container to influence the per container reclaim process so as to not
hurt the application performance badly.

RSS controller should be able to take care of the mapped memory
footprint if needed.  In case of database server, moving out any of it
RSS pages will hurt it performance, while we are free to shrink the
unmapped pagecache pages to any smaller limit since the database is
using direct IO and does not benefit from pagecache.

With pagecache controller, we are able to split application's memory
pages into mapped and unmapped pages. Ability to account and control
unmapped pages in memory provides more possibilities for fine grain
resource management.

--Vaidy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm-controller

2007-06-22 Thread Vaidyanathan Srinivasan

Peter Zijlstra wrote:
> On Thu, 2007-06-21 at 16:33 +0530, Balbir Singh wrote:
>> Peter Zijlstra wrote:
[snip]

> Not quite sure on 2, from reading the pagecache controller, I got the
> impression that enforcing both limits got you into trouble. Merging the
> limits would rid us of that issue, no?
> 

Hi Peter,

Setting both limits in pagecache controller (v4) will work correct in
principle.  What I intended in the comment is performance issues and
wrong type of page being reclaimed.  We are working on the issues and
will fix them in future versions.  You can set any combination of
limits and the reclaim will work to keep the page utilization below
limits.

When RSS limit is hit, ANON pages are pushed to swapcache.  We would
need to limit swapcache (using pagecache_limit) to force a write out
to disk.

Merging both limits will eliminate the issue, however we would need
individual limits for pagecache and RSS for better control.  There are
use cases for pagecache_limit alone without RSS_limit like the case of
database application using direct IO, backup applications and
streaming applications that does not make good use of pagecache.

Thank you for the review and new design proposal.

--Vaidy

[snip]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mm-controller

2007-06-22 Thread Vaidyanathan Srinivasan



Peter Zijlstra wrote:
 On Thu, 2007-06-21 at 16:33 +0530, Balbir Singh wrote:
 Peter Zijlstra wrote:
[snip]

 Not quite sure on 2, from reading the pagecache controller, I got the
 impression that enforcing both limits got you into trouble. Merging the
 limits would rid us of that issue, no?
 

Hi Peter,

Setting both limits in pagecache controller (v4) will work correct in
principle.  What I intended in the comment is performance issues and
wrong type of page being reclaimed.  We are working on the issues and
will fix them in future versions.  You can set any combination of
limits and the reclaim will work to keep the page utilization below
limits.

When RSS limit is hit, ANON pages are pushed to swapcache.  We would
need to limit swapcache (using pagecache_limit) to force a write out
to disk.

Merging both limits will eliminate the issue, however we would need
individual limits for pagecache and RSS for better control.  There are
use cases for pagecache_limit alone without RSS_limit like the case of
database application using direct IO, backup applications and
streaming applications that does not make good use of pagecache.

Thank you for the review and new design proposal.

--Vaidy

[snip]


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH 3/4] Pagecache and RSS accounting hooks

2007-06-20 Thread Vaidyanathan Srinivasan

Pagecache and RSS accounting Hooks
--

New calls have been added from swap_state.c and filemap.c to track pagecache
and swapcache pages.

All existing RSS hooks have been generalised for pagecache accounting as well.

Most of these are function prototype changes.

Signed-off-by: Vaidyanathan Srinivasan <[EMAIL PROTECTED]>
---
 fs/exec.c   |4 ++--
 mm/filemap.c|   18 ++
 mm/memory.c |   18 +-
 mm/migrate.c|4 ++--
 mm/rmap.c   |   12 ++--
 mm/swap_state.c |   16 
 mm/swapfile.c   |4 ++--
 7 files changed, 55 insertions(+), 21 deletions(-)

--- linux-2.6.22-rc2-mm1.orig/fs/exec.c
+++ linux-2.6.22-rc2-mm1/fs/exec.c
@@ -321,7 +321,7 @@ void install_arg_page(struct vm_area_str
if (unlikely(anon_vma_prepare(vma)))
goto out;

-   if (container_rss_prepare(page, vma, ))
+   if (container_page_prepare(page, vma->vm_mm, , PAGE_TYPE_RSS))
goto out;

flush_dcache_page(page);
@@ -343,7 +343,7 @@ void install_arg_page(struct vm_area_str
return;

 out_release:
-   container_rss_release(pcont);
+   container_page_release(pcont, PAGE_TYPE_RSS);
 out:
__free_page(page);
force_sig(SIGKILL, current);
--- linux-2.6.22-rc2-mm1.orig/mm/filemap.c
+++ linux-2.6.22-rc2-mm1/mm/filemap.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "filemap.h"
 #include "internal.h"

@@ -117,6 +118,9 @@ void __remove_from_page_cache(struct pag
struct address_space *mapping = page->mapping;

radix_tree_delete(>page_tree, page->index);
+   /* Uncharge before the mapping is gone */
+   if (page_container(page))
+   container_page_del(page_container(page), PAGE_TYPE_PAGECACHE);
page->mapping = NULL;
mapping->nrpages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
@@ -440,6 +444,8 @@ int add_to_page_cache(struct page *page,
pgoff_t offset, gfp_t gfp_mask)
 {
int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+   struct page_container *pc;
+   struct mm_struct *mm;

if (error == 0) {
write_lock_irq(>tree_lock);
@@ -453,6 +459,18 @@ int add_to_page_cache(struct page *page,
__inc_zone_page_state(page, NR_FILE_PAGES);
}
write_unlock_irq(>tree_lock);
+   /* Unlock before charge, because we may reclaim this inline */
+   if(!error) {
+   if (current->mm)
+   mm = current->mm;
+   else
+   mm = _mm;
+   if (!container_page_prepare(page, mm, , 
PAGE_TYPE_PAGECACHE))
+   container_page_add(pc, PAGE_TYPE_PAGECACHE);
+   else
+   BUG();
+   }
+
radix_tree_preload_end();
}
return error;
--- linux-2.6.22-rc2-mm1.orig/mm/memory.c
+++ linux-2.6.22-rc2-mm1/mm/memory.c
@@ -1755,7 +1755,7 @@ gotten:
cow_user_page(new_page, old_page, address, vma);
}

-   if (container_rss_prepare(new_page, vma, ))
+   if (container_page_prepare(new_page, vma->vm_mm, , PAGE_TYPE_RSS))
goto oom;

/*
@@ -1791,7 +1791,7 @@ gotten:
new_page = old_page;
ret |= VM_FAULT_WRITE;
} else
-   container_rss_release(pcont);
+   container_page_release(pcont, PAGE_TYPE_RSS);

if (new_page)
page_cache_release(new_page);
@@ -2217,7 +2217,7 @@ static int do_swap_page(struct mm_struct
count_vm_event(PGMAJFAULT);
}

-   if (container_rss_prepare(page, vma, )) {
+   if (container_page_prepare(page, vma->vm_mm, , PAGE_TYPE_RSS)) {
ret = VM_FAULT_OOM;
goto out;
}
@@ -2235,7 +2235,7 @@ static int do_swap_page(struct mm_struct

if (unlikely(!PageUptodate(page))) {
ret = VM_FAULT_SIGBUS;
-   container_rss_release(pcont);
+   container_page_release(pcont, PAGE_TYPE_RSS);
goto out_nomap;
}

@@ -2271,7 +2271,7 @@ unlock:
 out:
return ret;
 out_nomap:
-   container_rss_release(pcont);
+   container_page_release(pcont, PAGE_TYPE_RSS);
pte_unmap_unlock(page_table, ptl);
unlock_page(page);
page_cache_release(page);
@@ -2302,7 +2302,7 @@ static int do_anonymous_page(struct mm_s
if (!page)
goto oom;

-   if (container_rss_prepare(page, vma, ))
+   if (container_page_prepare(page, vma->vm_mm, , 
PAGE_TYPE_RSS))
goto oom_release;

entry = mk_pte(page, vma->vm_page_prot)

[RFC][PATCH 4/4] Pagecache reclaim

2007-06-20 Thread Vaidyanathan Srinivasan

Pagecache controller reclaim changes


Reclaim path needs performance improvement.
For now it is minor changes to include unmapped
pages in our list of page_container.

Signed-off-by: Vaidyanathan Srinivasan <[EMAIL PROTECTED]>
---
 mm/rss_container.c |3 ---
 1 file changed, 3 deletions(-)

--- linux-2.6.22-rc2-mm1.orig/mm/rss_container.c
+++ linux-2.6.22-rc2-mm1/mm/rss_container.c
@@ -274,9 +274,6 @@ void container_rss_move_lists(struct pag
struct rss_container *rss;
struct page_container *pc;

-   if (!page_mapped(pg))
-   return;
-
pc = page_container(pg);
if (pc == NULL)
return;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 192 matches

Mail list logo