Re: [patch] x86, apic: use tsc deadline for oneshot when available

2012-10-30 Thread Venki Pallipadi
Patch looks good.

Acked-by: Venkatesh Pallipadi 

On Mon, Oct 22, 2012 at 2:37 PM, Suresh Siddha
 wrote:
>
> Thomas, You wanted to run some tests with this, right? Please give it a
> try and see if this is ok to be pushed to the -tip.
>
> thanks,
> suresh
> --8<--
> From: Suresh Siddha 
> Subject: x86, apic: use tsc deadline for oneshot when available
>
> If the TSC deadline mode is supported, LAPIC timer one-shot mode can be
> implemented using IA32_TSC_DEADLINE MSR. An interrupt will be generated
> when the TSC value equals or exceeds the value in the IA32_TSC_DEADLINE
> MSR.
>
> This enables us to skip the APIC calibration during boot. Also,
> in xapic mode, this enables us to skip the uncached apic access
> to re-arm the APIC timer.
>
> As this timer ticks at the high frequency TSC rate, we use the
> TSC_DIVISOR (32) to work with the 32-bit restrictions in the clockevent
> API's to avoid 64-bit divides etc (frequency is u32 and "unsigned long"
> in the set_next_event(), max_delta limits the next event to 32-bit for
> 32-bit kernel).
>
> Signed-off-by: Suresh Siddha 
> ---
>  Documentation/kernel-parameters.txt |4 ++
>  arch/x86/include/asm/msr-index.h|2 +
>  arch/x86/kernel/apic/apic.c |   66 
> ++-
>  3 files changed, 55 insertions(+), 17 deletions(-)
>
> diff --git a/Documentation/kernel-parameters.txt 
> b/Documentation/kernel-parameters.txt
> index 9776f06..4aa9ca0 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1304,6 +1304,10 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
> lapic   [X86-32,APIC] Enable the local APIC even if BIOS
> disabled it.
>
> +   lapic=  [x86,APIC] "notscdeadline" Do not use TSC deadline
> +   value for LAPIC timer one-shot implementation. Default
> +   back to the programmable timer unit in the LAPIC.
> +
> lapic_timer_c2_ok   [X86,APIC] trust the local apic timer
> in C2 power state.
>
> diff --git a/arch/x86/include/asm/msr-index.h 
> b/arch/x86/include/asm/msr-index.h
> index 7f0edce..e400cdb 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -337,6 +337,8 @@
>  #define MSR_IA32_MISC_ENABLE_TURBO_DISABLE (1ULL << 38)
>  #define MSR_IA32_MISC_ENABLE_IP_PREF_DISABLE   (1ULL << 39)
>
> +#define MSR_IA32_TSC_DEADLINE  0x06E0
> +
>  /* P4/Xeon+ specific */
>  #define MSR_IA32_MCG_EAX   0x0180
>  #define MSR_IA32_MCG_EBX   0x0181
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index b17416e..b0c49b1 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -90,21 +90,6 @@ EXPORT_EARLY_PER_CPU_SYMBOL(x86_bios_cpu_apicid);
>   */
>  DEFINE_EARLY_PER_CPU_READ_MOSTLY(int, x86_cpu_to_logical_apicid, BAD_APICID);
>
> -/*
> - * Knob to control our willingness to enable the local APIC.
> - *
> - * +1=force-enable
> - */
> -static int force_enable_local_apic __initdata;
> -/*
> - * APIC command line parameters
> - */
> -static int __init parse_lapic(char *arg)
> -{
> -   force_enable_local_apic = 1;
> -   return 0;
> -}
> -early_param("lapic", parse_lapic);
>  /* Local APIC was disabled by the BIOS and enabled by the kernel */
>  static int enabled_via_apicbase;
>
> @@ -133,6 +118,25 @@ static inline void imcr_apic_to_pic(void)
>  }
>  #endif
>
> +/*
> + * Knob to control our willingness to enable the local APIC.
> + *
> + * +1=force-enable
> + */
> +static int force_enable_local_apic __initdata;
> +/*
> + * APIC command line parameters
> + */
> +static int __init parse_lapic(char *arg)
> +{
> +   if (config_enabled(CONFIG_X86_32) && !arg)
> +   force_enable_local_apic = 1;
> +   else if (!strncmp(arg, "notscdeadline", 13))
> +   setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
> +   return 0;
> +}
> +early_param("lapic", parse_lapic);
> +
>  #ifdef CONFIG_X86_64
>  static int apic_calibrate_pmtmr __initdata;
>  static __init int setup_apicpmtimer(char *s)
> @@ -315,6 +319,7 @@ int lapic_get_maxlvt(void)
>
>  /* Clock divisor */
>  #define APIC_DIVISOR 16
> +#define TSC_DIVISOR  32
>
>  /*
>   * This function sets up the local APIC timer, with a timeout of
> @@ -333,6 +338,9 @@ static void __setup_APIC_LVTT(unsigned int clocks, int 
> oneshot, int irqen)
> lvtt_value = LOCAL_TIMER_VECTOR;
> if (!oneshot)
> lvtt_value |= APIC_LVT_TIMER_PERIODIC;
> +   else if (boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER))
> +   lvtt_value |= APIC_LVT_TIMER_TSCDEADLINE;
> +
> if (!lapic_is_integrated())
> lvtt_value |= SET_APIC_TIMER_BASE(APIC_TIMER_BASE_DIV);
>
> @@ -341,6 +349,11 @@ static void __setup_APIC_LVTT(unsigned int clocks, int 
> oneshot, int irqen)
>
> 

Re: [patch] x86, apic: use tsc deadline for oneshot when available

2012-10-30 Thread Venki Pallipadi
Patch looks good.

Acked-by: Venkatesh Pallipadi ve...@google.com

On Mon, Oct 22, 2012 at 2:37 PM, Suresh Siddha
suresh.b.sid...@intel.com wrote:

 Thomas, You wanted to run some tests with this, right? Please give it a
 try and see if this is ok to be pushed to the -tip.

 thanks,
 suresh
 --8--
 From: Suresh Siddha suresh.b.sid...@intel.com
 Subject: x86, apic: use tsc deadline for oneshot when available

 If the TSC deadline mode is supported, LAPIC timer one-shot mode can be
 implemented using IA32_TSC_DEADLINE MSR. An interrupt will be generated
 when the TSC value equals or exceeds the value in the IA32_TSC_DEADLINE
 MSR.

 This enables us to skip the APIC calibration during boot. Also,
 in xapic mode, this enables us to skip the uncached apic access
 to re-arm the APIC timer.

 As this timer ticks at the high frequency TSC rate, we use the
 TSC_DIVISOR (32) to work with the 32-bit restrictions in the clockevent
 API's to avoid 64-bit divides etc (frequency is u32 and unsigned long
 in the set_next_event(), max_delta limits the next event to 32-bit for
 32-bit kernel).

 Signed-off-by: Suresh Siddha suresh.b.sid...@intel.com
 ---
  Documentation/kernel-parameters.txt |4 ++
  arch/x86/include/asm/msr-index.h|2 +
  arch/x86/kernel/apic/apic.c |   66 
 ++-
  3 files changed, 55 insertions(+), 17 deletions(-)

 diff --git a/Documentation/kernel-parameters.txt 
 b/Documentation/kernel-parameters.txt
 index 9776f06..4aa9ca0 100644
 --- a/Documentation/kernel-parameters.txt
 +++ b/Documentation/kernel-parameters.txt
 @@ -1304,6 +1304,10 @@ bytes respectively. Such letter suffixes can also be 
 entirely omitted.
 lapic   [X86-32,APIC] Enable the local APIC even if BIOS
 disabled it.

 +   lapic=  [x86,APIC] notscdeadline Do not use TSC deadline
 +   value for LAPIC timer one-shot implementation. Default
 +   back to the programmable timer unit in the LAPIC.
 +
 lapic_timer_c2_ok   [X86,APIC] trust the local apic timer
 in C2 power state.

 diff --git a/arch/x86/include/asm/msr-index.h 
 b/arch/x86/include/asm/msr-index.h
 index 7f0edce..e400cdb 100644
 --- a/arch/x86/include/asm/msr-index.h
 +++ b/arch/x86/include/asm/msr-index.h
 @@ -337,6 +337,8 @@
  #define MSR_IA32_MISC_ENABLE_TURBO_DISABLE (1ULL  38)
  #define MSR_IA32_MISC_ENABLE_IP_PREF_DISABLE   (1ULL  39)

 +#define MSR_IA32_TSC_DEADLINE  0x06E0
 +
  /* P4/Xeon+ specific */
  #define MSR_IA32_MCG_EAX   0x0180
  #define MSR_IA32_MCG_EBX   0x0181
 diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
 index b17416e..b0c49b1 100644
 --- a/arch/x86/kernel/apic/apic.c
 +++ b/arch/x86/kernel/apic/apic.c
 @@ -90,21 +90,6 @@ EXPORT_EARLY_PER_CPU_SYMBOL(x86_bios_cpu_apicid);
   */
  DEFINE_EARLY_PER_CPU_READ_MOSTLY(int, x86_cpu_to_logical_apicid, BAD_APICID);

 -/*
 - * Knob to control our willingness to enable the local APIC.
 - *
 - * +1=force-enable
 - */
 -static int force_enable_local_apic __initdata;
 -/*
 - * APIC command line parameters
 - */
 -static int __init parse_lapic(char *arg)
 -{
 -   force_enable_local_apic = 1;
 -   return 0;
 -}
 -early_param(lapic, parse_lapic);
  /* Local APIC was disabled by the BIOS and enabled by the kernel */
  static int enabled_via_apicbase;

 @@ -133,6 +118,25 @@ static inline void imcr_apic_to_pic(void)
  }
  #endif

 +/*
 + * Knob to control our willingness to enable the local APIC.
 + *
 + * +1=force-enable
 + */
 +static int force_enable_local_apic __initdata;
 +/*
 + * APIC command line parameters
 + */
 +static int __init parse_lapic(char *arg)
 +{
 +   if (config_enabled(CONFIG_X86_32)  !arg)
 +   force_enable_local_apic = 1;
 +   else if (!strncmp(arg, notscdeadline, 13))
 +   setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
 +   return 0;
 +}
 +early_param(lapic, parse_lapic);
 +
  #ifdef CONFIG_X86_64
  static int apic_calibrate_pmtmr __initdata;
  static __init int setup_apicpmtimer(char *s)
 @@ -315,6 +319,7 @@ int lapic_get_maxlvt(void)

  /* Clock divisor */
  #define APIC_DIVISOR 16
 +#define TSC_DIVISOR  32

  /*
   * This function sets up the local APIC timer, with a timeout of
 @@ -333,6 +338,9 @@ static void __setup_APIC_LVTT(unsigned int clocks, int 
 oneshot, int irqen)
 lvtt_value = LOCAL_TIMER_VECTOR;
 if (!oneshot)
 lvtt_value |= APIC_LVT_TIMER_PERIODIC;
 +   else if (boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER))
 +   lvtt_value |= APIC_LVT_TIMER_TSCDEADLINE;
 +
 if (!lapic_is_integrated())
 lvtt_value |= SET_APIC_TIMER_BASE(APIC_TIMER_BASE_DIV);

 @@ -341,6 +349,11 @@ static void __setup_APIC_LVTT(unsigned int clocks, int 
 oneshot, int irqen)

 apic_write(APIC_LVTT, lvtt_value);

 +   if (lvtt_value  

Re: 2.6.25-rc1 regression - suspend to ram

2008-02-11 Thread Venki Pallipadi
On Tue, Feb 12, 2008 at 12:10:54AM +0100, R. J. Wysocki wrote:
> On Monday, 11 of February 2008, Lukas Hejtmanek wrote:
> > Hello,
> 
> Hi,
>  
> > 2.6.25-rc1 takes really long time till it suspends (about 30-40secs, used to
> > be about 5 secs at all) and it is resuming about few minutes.  While 
> > resuming,
> > capslock toggles the capslock led but with few secs delay.
> > 
> > 2.6.24-git15 was OK. 2.6.24 is OK.
> > 
> > I have Lenovo ThinkPad T61.
> 
> If you have CONFIG_CPU_IDLE set, please try to boot with idle=poll and see if
> that helps.
> 

Just sent this patch to fix a regression in acpi processor_idle.c on another
thread. Can you try the patch below and check whether that helps.

Thanks,
Venki


Earlier patch (bc71bec91f9875ef825d12104acf3bf4ca215fa4) broke
suspend resume on many laptops. The problem was reported by
Carlos R. Mafra and Calvin Walton, who bisected the issue to above patch.

The problem was because, C2 and C3 code were calling acpi_idle_enter_c1
directly, with C2 or C3 as state parameter, while suspend/resume was in
progress. The patch bc71bec started making use of that state information,
assuming that it would always be referring to C1 state. This caused the
problem with suspend-resume as we ended up using C2/C3 state indirectly.

Fix this by adding acpi_idle_suspend check in enter_c1.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.25-rc1/drivers/acpi/processor_idle.c
===
--- linux-2.6.25-rc1.orig/drivers/acpi/processor_idle.c
+++ linux-2.6.25-rc1/drivers/acpi/processor_idle.c
@@ -1420,6 +1420,14 @@ static int acpi_idle_enter_c1(struct cpu
return 0;
 
local_irq_disable();
+
+   /* Do not access any ACPI IO ports in suspend path */
+   if (acpi_idle_suspend) {
+   acpi_safe_halt();
+   local_irq_enable();
+   return 0;
+   }
+
if (pr->flags.bm_check)
acpi_idle_update_bm_rld(pr, cx);
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6.25-rc1 regression] Suspend to RAM (bisected)

2008-02-11 Thread Venki Pallipadi
On Mon, Feb 11, 2008 at 12:06:50PM -0800, Venki Pallipadi wrote:
> On Mon, Feb 11, 2008 at 05:37:04PM -0200, Carlos R. Mafra wrote:
> > Pallipadi, Venkatesh wrote:
> > > 
> > > Can you send me the output of acpidump and full dmesg to me. Looks like
> > > it is a platform issue due to which we cannot use C1 mwait idle during
> > > suspend resume, something similar to issue we had with using C2/C3 state
> > > during idle.
> > 
> > Full dmesg and acpidump outputs are attached.
> 
> Above acpidump doesnt have all info, as it is loading some SSDT at run time.
> Can you get the output of
> 
> # acpidump --addr 0x7F6D8709 --length 0x04B7
> # acpidump --addr 0x7F6D8BC0 --length 0x0092
> 

Thanks for sending the dumps Carlos.

The patch below (on top of rc1) should fix the problem. Can you please
check it.

Thanks,
Venki


Earlier patch (bc71bec91f9875ef825d12104acf3bf4ca215fa4) broke
suspend resume on many laptops. The problem was reported by
Carlos R. Mafra and Calvin Walton, who bisected the issue to above patch.

The problem was because, C2 and C3 code were calling acpi_idle_enter_c1
directly, with C2 or C3 as state parameter, while suspend/resume was in
progress. The patch bc71bec started making use of that state information,
assuming that it would always be referring to C1 state. This caused the
problem with suspend-resume as we ended up using C2/C3 state indirectly.

Fix this by adding acpi_idle_suspend check in enter_c1.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.25-rc1/drivers/acpi/processor_idle.c
===
--- linux-2.6.25-rc1.orig/drivers/acpi/processor_idle.c
+++ linux-2.6.25-rc1/drivers/acpi/processor_idle.c
@@ -1420,6 +1420,14 @@ static int acpi_idle_enter_c1(struct cpu
return 0;
 
local_irq_disable();
+
+   /* Do not access any ACPI IO ports in suspend path */
+   if (acpi_idle_suspend) {
+   acpi_safe_halt();
+   local_irq_enable();
+   return 0;
+   }
+
if (pr->flags.bm_check)
acpi_idle_update_bm_rld(pr, cx);
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6.25-rc1 regression] Suspend to RAM (bisected)

2008-02-11 Thread Venki Pallipadi
On Mon, Feb 11, 2008 at 05:37:04PM -0200, Carlos R. Mafra wrote:
> Pallipadi, Venkatesh wrote:
> > 
> > Can you send me the output of acpidump and full dmesg to me. Looks like
> > it is a platform issue due to which we cannot use C1 mwait idle during
> > suspend resume, something similar to issue we had with using C2/C3 state
> > during idle.
> 
> Full dmesg and acpidump outputs are attached.

Above acpidump doesnt have all info, as it is loading some SSDT at run time.
Can you get the output of

# acpidump --addr 0x7F6D8709 --length 0x04B7
# acpidump --addr 0x7F6D8BC0 --length 0x0092

Thanks,
Venki

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6.25-rc1 regression] Suspend to RAM (bisected)

2008-02-11 Thread Venki Pallipadi
On Mon, Feb 11, 2008 at 05:37:04PM -0200, Carlos R. Mafra wrote:
 Pallipadi, Venkatesh wrote:
  
  Can you send me the output of acpidump and full dmesg to me. Looks like
  it is a platform issue due to which we cannot use C1 mwait idle during
  suspend resume, something similar to issue we had with using C2/C3 state
  during idle.
 
 Full dmesg and acpidump outputs are attached.

Above acpidump doesnt have all info, as it is loading some SSDT at run time.
Can you get the output of

# acpidump --addr 0x7F6D8709 --length 0x04B7
# acpidump --addr 0x7F6D8BC0 --length 0x0092

Thanks,
Venki

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.25-rc1 regression - suspend to ram

2008-02-11 Thread Venki Pallipadi
On Tue, Feb 12, 2008 at 12:10:54AM +0100, R. J. Wysocki wrote:
 On Monday, 11 of February 2008, Lukas Hejtmanek wrote:
  Hello,
 
 Hi,
  
  2.6.25-rc1 takes really long time till it suspends (about 30-40secs, used to
  be about 5 secs at all) and it is resuming about few minutes.  While 
  resuming,
  capslock toggles the capslock led but with few secs delay.
  
  2.6.24-git15 was OK. 2.6.24 is OK.
  
  I have Lenovo ThinkPad T61.
 
 If you have CONFIG_CPU_IDLE set, please try to boot with idle=poll and see if
 that helps.
 

Just sent this patch to fix a regression in acpi processor_idle.c on another
thread. Can you try the patch below and check whether that helps.

Thanks,
Venki


Earlier patch (bc71bec91f9875ef825d12104acf3bf4ca215fa4) broke
suspend resume on many laptops. The problem was reported by
Carlos R. Mafra and Calvin Walton, who bisected the issue to above patch.

The problem was because, C2 and C3 code were calling acpi_idle_enter_c1
directly, with C2 or C3 as state parameter, while suspend/resume was in
progress. The patch bc71bec started making use of that state information,
assuming that it would always be referring to C1 state. This caused the
problem with suspend-resume as we ended up using C2/C3 state indirectly.

Fix this by adding acpi_idle_suspend check in enter_c1.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.25-rc1/drivers/acpi/processor_idle.c
===
--- linux-2.6.25-rc1.orig/drivers/acpi/processor_idle.c
+++ linux-2.6.25-rc1/drivers/acpi/processor_idle.c
@@ -1420,6 +1420,14 @@ static int acpi_idle_enter_c1(struct cpu
return 0;
 
local_irq_disable();
+
+   /* Do not access any ACPI IO ports in suspend path */
+   if (acpi_idle_suspend) {
+   acpi_safe_halt();
+   local_irq_enable();
+   return 0;
+   }
+
if (pr-flags.bm_check)
acpi_idle_update_bm_rld(pr, cx);
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6.25-rc1 regression] Suspend to RAM (bisected)

2008-02-11 Thread Venki Pallipadi
On Mon, Feb 11, 2008 at 12:06:50PM -0800, Venki Pallipadi wrote:
 On Mon, Feb 11, 2008 at 05:37:04PM -0200, Carlos R. Mafra wrote:
  Pallipadi, Venkatesh wrote:
   
   Can you send me the output of acpidump and full dmesg to me. Looks like
   it is a platform issue due to which we cannot use C1 mwait idle during
   suspend resume, something similar to issue we had with using C2/C3 state
   during idle.
  
  Full dmesg and acpidump outputs are attached.
 
 Above acpidump doesnt have all info, as it is loading some SSDT at run time.
 Can you get the output of
 
 # acpidump --addr 0x7F6D8709 --length 0x04B7
 # acpidump --addr 0x7F6D8BC0 --length 0x0092
 

Thanks for sending the dumps Carlos.

The patch below (on top of rc1) should fix the problem. Can you please
check it.

Thanks,
Venki


Earlier patch (bc71bec91f9875ef825d12104acf3bf4ca215fa4) broke
suspend resume on many laptops. The problem was reported by
Carlos R. Mafra and Calvin Walton, who bisected the issue to above patch.

The problem was because, C2 and C3 code were calling acpi_idle_enter_c1
directly, with C2 or C3 as state parameter, while suspend/resume was in
progress. The patch bc71bec started making use of that state information,
assuming that it would always be referring to C1 state. This caused the
problem with suspend-resume as we ended up using C2/C3 state indirectly.

Fix this by adding acpi_idle_suspend check in enter_c1.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.25-rc1/drivers/acpi/processor_idle.c
===
--- linux-2.6.25-rc1.orig/drivers/acpi/processor_idle.c
+++ linux-2.6.25-rc1/drivers/acpi/processor_idle.c
@@ -1420,6 +1420,14 @@ static int acpi_idle_enter_c1(struct cpu
return 0;
 
local_irq_disable();
+
+   /* Do not access any ACPI IO ports in suspend path */
+   if (acpi_idle_suspend) {
+   acpi_safe_halt();
+   local_irq_enable();
+   return 0;
+   }
+
if (pr-flags.bm_check)
acpi_idle_update_bm_rld(pr, cx);
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: Simplify cpu_idle_wait

2008-02-08 Thread Venki Pallipadi
On Fri, Feb 08, 2008 at 11:28:48AM +0100, Andi Kleen wrote:
> 
> > -   set_cpus_allowed(current, tmp);
> > +   smp_mb();
> > +   /* kick all the CPUs so that they exit out of pm_idle */
> > +   smp_call_function(do_nothing, NULL, 0, 0);
> 
> I think the last argument (wait) needs to be 1 to make sure it is 
> synchronous (for 32/64) Otherwise the patch looks great.

Yes. Below is the updated patch


Earlier commit 40d6a146629b98d8e322b6f9332b182c7cbff3df
added smp_call_function in cpu_idle_wait() to kick cpus that are in tickless
idle. Looking at cpu_idle_wait code at that time, code seemed to be
over-engineered for a case which is rarely used (while changing idle handler).

Below is a simplified version of cpu_idle_wait, which just makes
a dummy smp_call_function to all cpus, to make them come out of old idle handler
and start using the new idle handler. It eliminates code in the idle loop
to handle cpu_idle_wait.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.25-rc/arch/x86/kernel/process_32.c
===
--- linux-2.6.25-rc.orig/arch/x86/kernel/process_32.c
+++ linux-2.6.25-rc/arch/x86/kernel/process_32.c
@@ -82,7 +82,6 @@ unsigned long thread_saved_pc(struct tas
  */
 void (*pm_idle)(void);
 EXPORT_SYMBOL(pm_idle);
-static DEFINE_PER_CPU(unsigned int, cpu_idle_state);
 
 void disable_hlt(void)
 {
@@ -181,9 +180,6 @@ void cpu_idle(void)
while (!need_resched()) {
void (*idle)(void);
 
-   if (__get_cpu_var(cpu_idle_state))
-   __get_cpu_var(cpu_idle_state) = 0;
-
check_pgt_cache();
rmb();
idle = pm_idle;
@@ -208,40 +204,19 @@ static void do_nothing(void *unused)
 {
 }
 
+/*
+ * cpu_idle_wait - Used to ensure that all the CPUs discard old value of
+ * pm_idle and update to new pm_idle value. Required while changing pm_idle
+ * handler on SMP systems.
+ *
+ * Caller must have changed pm_idle to the new value before the call. Old
+ * pm_idle value will not be used by any CPU after the return of this function.
+ */
 void cpu_idle_wait(void)
 {
-   unsigned int cpu, this_cpu = get_cpu();
-   cpumask_t map, tmp = current->cpus_allowed;
-
-   set_cpus_allowed(current, cpumask_of_cpu(this_cpu));
-   put_cpu();
-
-   cpus_clear(map);
-   for_each_online_cpu(cpu) {
-   per_cpu(cpu_idle_state, cpu) = 1;
-   cpu_set(cpu, map);
-   }
-
-   __get_cpu_var(cpu_idle_state) = 0;
-
-   wmb();
-   do {
-   ssleep(1);
-   for_each_online_cpu(cpu) {
-   if (cpu_isset(cpu, map) && !per_cpu(cpu_idle_state, 
cpu))
-   cpu_clear(cpu, map);
-   }
-   cpus_and(map, map, cpu_online_map);
-   /*
-* We waited 1 sec, if a CPU still did not call idle
-* it may be because it is in idle and not waking up
-* because it has nothing to do.
-* Give all the remaining CPUS a kick.
-*/
-   smp_call_function_mask(map, do_nothing, 0, 0);
-   } while (!cpus_empty(map));
-
-   set_cpus_allowed(current, tmp);
+   smp_mb();
+   /* kick all the CPUs so that they exit out of pm_idle */
+   smp_call_function(do_nothing, NULL, 0, 1);
 }
 EXPORT_SYMBOL_GPL(cpu_idle_wait);
 
Index: linux-2.6.25-rc/arch/x86/kernel/process_64.c
===
--- linux-2.6.25-rc.orig/arch/x86/kernel/process_64.c
+++ linux-2.6.25-rc/arch/x86/kernel/process_64.c
@@ -64,7 +64,6 @@ EXPORT_SYMBOL(boot_option_idle_override)
  */
 void (*pm_idle)(void);
 EXPORT_SYMBOL(pm_idle);
-static DEFINE_PER_CPU(unsigned int, cpu_idle_state);
 
 static ATOMIC_NOTIFIER_HEAD(idle_notifier);
 
@@ -139,41 +138,19 @@ static void do_nothing(void *unused)
 {
 }
 
+/*
+ * cpu_idle_wait - Used to ensure that all the CPUs discard old value of
+ * pm_idle and update to new pm_idle value. Required while changing pm_idle
+ * handler on SMP systems.
+ *
+ * Caller must have changed pm_idle to the new value before the call. Old
+ * pm_idle value will not be used by any CPU after the return of this function.
+ */
 void cpu_idle_wait(void)
 {
-   unsigned int cpu, this_cpu = get_cpu();
-   cpumask_t map, tmp = current->cpus_allowed;
-
-   set_cpus_allowed(current, cpumask_of_cpu(this_cpu));
-   put_cpu();
-
-   cpus_clear(map);
-   for_each_online_cpu(cpu) {
-   per_cpu(cpu_idle_state, cpu) = 1;
-   cpu_set(cpu, map);
-   }
-
-   __get_cpu_var(cpu_idle_state) = 0;
-
-   wmb();
-   do {
-   ssleep(1);
-   for_each_online_cpu(cpu) {
-   if (cpu_isset(cpu, map) &&
-   !per_cpu(cpu_idle_state, cpu))
-

Re: [PATCH] x86: Simplify cpu_idle_wait

2008-02-08 Thread Venki Pallipadi
On Fri, Feb 08, 2008 at 11:28:48AM +0100, Andi Kleen wrote:
 
  -   set_cpus_allowed(current, tmp);
  +   smp_mb();
  +   /* kick all the CPUs so that they exit out of pm_idle */
  +   smp_call_function(do_nothing, NULL, 0, 0);
 
 I think the last argument (wait) needs to be 1 to make sure it is 
 synchronous (for 32/64) Otherwise the patch looks great.

Yes. Below is the updated patch


Earlier commit 40d6a146629b98d8e322b6f9332b182c7cbff3df
added smp_call_function in cpu_idle_wait() to kick cpus that are in tickless
idle. Looking at cpu_idle_wait code at that time, code seemed to be
over-engineered for a case which is rarely used (while changing idle handler).

Below is a simplified version of cpu_idle_wait, which just makes
a dummy smp_call_function to all cpus, to make them come out of old idle handler
and start using the new idle handler. It eliminates code in the idle loop
to handle cpu_idle_wait.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.25-rc/arch/x86/kernel/process_32.c
===
--- linux-2.6.25-rc.orig/arch/x86/kernel/process_32.c
+++ linux-2.6.25-rc/arch/x86/kernel/process_32.c
@@ -82,7 +82,6 @@ unsigned long thread_saved_pc(struct tas
  */
 void (*pm_idle)(void);
 EXPORT_SYMBOL(pm_idle);
-static DEFINE_PER_CPU(unsigned int, cpu_idle_state);
 
 void disable_hlt(void)
 {
@@ -181,9 +180,6 @@ void cpu_idle(void)
while (!need_resched()) {
void (*idle)(void);
 
-   if (__get_cpu_var(cpu_idle_state))
-   __get_cpu_var(cpu_idle_state) = 0;
-
check_pgt_cache();
rmb();
idle = pm_idle;
@@ -208,40 +204,19 @@ static void do_nothing(void *unused)
 {
 }
 
+/*
+ * cpu_idle_wait - Used to ensure that all the CPUs discard old value of
+ * pm_idle and update to new pm_idle value. Required while changing pm_idle
+ * handler on SMP systems.
+ *
+ * Caller must have changed pm_idle to the new value before the call. Old
+ * pm_idle value will not be used by any CPU after the return of this function.
+ */
 void cpu_idle_wait(void)
 {
-   unsigned int cpu, this_cpu = get_cpu();
-   cpumask_t map, tmp = current-cpus_allowed;
-
-   set_cpus_allowed(current, cpumask_of_cpu(this_cpu));
-   put_cpu();
-
-   cpus_clear(map);
-   for_each_online_cpu(cpu) {
-   per_cpu(cpu_idle_state, cpu) = 1;
-   cpu_set(cpu, map);
-   }
-
-   __get_cpu_var(cpu_idle_state) = 0;
-
-   wmb();
-   do {
-   ssleep(1);
-   for_each_online_cpu(cpu) {
-   if (cpu_isset(cpu, map)  !per_cpu(cpu_idle_state, 
cpu))
-   cpu_clear(cpu, map);
-   }
-   cpus_and(map, map, cpu_online_map);
-   /*
-* We waited 1 sec, if a CPU still did not call idle
-* it may be because it is in idle and not waking up
-* because it has nothing to do.
-* Give all the remaining CPUS a kick.
-*/
-   smp_call_function_mask(map, do_nothing, 0, 0);
-   } while (!cpus_empty(map));
-
-   set_cpus_allowed(current, tmp);
+   smp_mb();
+   /* kick all the CPUs so that they exit out of pm_idle */
+   smp_call_function(do_nothing, NULL, 0, 1);
 }
 EXPORT_SYMBOL_GPL(cpu_idle_wait);
 
Index: linux-2.6.25-rc/arch/x86/kernel/process_64.c
===
--- linux-2.6.25-rc.orig/arch/x86/kernel/process_64.c
+++ linux-2.6.25-rc/arch/x86/kernel/process_64.c
@@ -64,7 +64,6 @@ EXPORT_SYMBOL(boot_option_idle_override)
  */
 void (*pm_idle)(void);
 EXPORT_SYMBOL(pm_idle);
-static DEFINE_PER_CPU(unsigned int, cpu_idle_state);
 
 static ATOMIC_NOTIFIER_HEAD(idle_notifier);
 
@@ -139,41 +138,19 @@ static void do_nothing(void *unused)
 {
 }
 
+/*
+ * cpu_idle_wait - Used to ensure that all the CPUs discard old value of
+ * pm_idle and update to new pm_idle value. Required while changing pm_idle
+ * handler on SMP systems.
+ *
+ * Caller must have changed pm_idle to the new value before the call. Old
+ * pm_idle value will not be used by any CPU after the return of this function.
+ */
 void cpu_idle_wait(void)
 {
-   unsigned int cpu, this_cpu = get_cpu();
-   cpumask_t map, tmp = current-cpus_allowed;
-
-   set_cpus_allowed(current, cpumask_of_cpu(this_cpu));
-   put_cpu();
-
-   cpus_clear(map);
-   for_each_online_cpu(cpu) {
-   per_cpu(cpu_idle_state, cpu) = 1;
-   cpu_set(cpu, map);
-   }
-
-   __get_cpu_var(cpu_idle_state) = 0;
-
-   wmb();
-   do {
-   ssleep(1);
-   for_each_online_cpu(cpu) {
-   if (cpu_isset(cpu, map) 
-   !per_cpu(cpu_idle_state, cpu))
-

[PATCH] x86: Simplify cpu_idle_wait

2008-02-07 Thread Venki Pallipadi

Earlier commit 40d6a146629b98d8e322b6f9332b182c7cbff3df
added smp_call_function in cpu_idle_wait() to kick cpus that are in tickless
idle. Looking at cpu_idle_wait code at that time, code seemed to be
over-engineered for a case which is rarely used (while changing idle handler).

Below is a simplified version of cpu_idle_wait, which just makes
a dummy smp_call_function to all cpus, to make them come out of old idle handler
and start using the new idle handler. It eliminates code in the idle loop
to handle cpu_idle_wait.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.25-rc/arch/x86/kernel/process_32.c
===
--- linux-2.6.25-rc.orig/arch/x86/kernel/process_32.c
+++ linux-2.6.25-rc/arch/x86/kernel/process_32.c
@@ -82,7 +82,6 @@ unsigned long thread_saved_pc(struct tas
  */
 void (*pm_idle)(void);
 EXPORT_SYMBOL(pm_idle);
-static DEFINE_PER_CPU(unsigned int, cpu_idle_state);
 
 void disable_hlt(void)
 {
@@ -181,9 +180,6 @@ void cpu_idle(void)
while (!need_resched()) {
void (*idle)(void);
 
-   if (__get_cpu_var(cpu_idle_state))
-   __get_cpu_var(cpu_idle_state) = 0;
-
check_pgt_cache();
rmb();
idle = pm_idle;
@@ -208,40 +204,19 @@ static void do_nothing(void *unused)
 {
 }
 
+/*
+ * cpu_idle_wait - Used to ensure that all the CPUs discard old value of
+ * pm_idle and update to new pm_idle value. Required while changing pm_idle
+ * handler on SMP systems.
+ *
+ * Caller must have changed pm_idle to the new value before the call. Old
+ * pm_idle value will not be used by any CPU after the return of this function.
+ */
 void cpu_idle_wait(void)
 {
-   unsigned int cpu, this_cpu = get_cpu();
-   cpumask_t map, tmp = current->cpus_allowed;
-
-   set_cpus_allowed(current, cpumask_of_cpu(this_cpu));
-   put_cpu();
-
-   cpus_clear(map);
-   for_each_online_cpu(cpu) {
-   per_cpu(cpu_idle_state, cpu) = 1;
-   cpu_set(cpu, map);
-   }
-
-   __get_cpu_var(cpu_idle_state) = 0;
-
-   wmb();
-   do {
-   ssleep(1);
-   for_each_online_cpu(cpu) {
-   if (cpu_isset(cpu, map) && !per_cpu(cpu_idle_state, 
cpu))
-   cpu_clear(cpu, map);
-   }
-   cpus_and(map, map, cpu_online_map);
-   /*
-* We waited 1 sec, if a CPU still did not call idle
-* it may be because it is in idle and not waking up
-* because it has nothing to do.
-* Give all the remaining CPUS a kick.
-*/
-   smp_call_function_mask(map, do_nothing, 0, 0);
-   } while (!cpus_empty(map));
-
-   set_cpus_allowed(current, tmp);
+   smp_mb();
+   /* kick all the CPUs so that they exit out of pm_idle */
+   smp_call_function(do_nothing, NULL, 0, 0);
 }
 EXPORT_SYMBOL_GPL(cpu_idle_wait);
 
Index: linux-2.6.25-rc/arch/x86/kernel/process_64.c
===
--- linux-2.6.25-rc.orig/arch/x86/kernel/process_64.c
+++ linux-2.6.25-rc/arch/x86/kernel/process_64.c
@@ -64,7 +64,6 @@ EXPORT_SYMBOL(boot_option_idle_override)
  */
 void (*pm_idle)(void);
 EXPORT_SYMBOL(pm_idle);
-static DEFINE_PER_CPU(unsigned int, cpu_idle_state);
 
 static ATOMIC_NOTIFIER_HEAD(idle_notifier);
 
@@ -139,41 +138,19 @@ static void do_nothing(void *unused)
 {
 }
 
+/*
+ * cpu_idle_wait - Used to ensure that all the CPUs discard old value of
+ * pm_idle and update to new pm_idle value. Required while changing pm_idle
+ * handler on SMP systems.
+ *
+ * Caller must have changed pm_idle to the new value before the call. Old
+ * pm_idle value will not be used by any CPU after the return of this function.
+ */
 void cpu_idle_wait(void)
 {
-   unsigned int cpu, this_cpu = get_cpu();
-   cpumask_t map, tmp = current->cpus_allowed;
-
-   set_cpus_allowed(current, cpumask_of_cpu(this_cpu));
-   put_cpu();
-
-   cpus_clear(map);
-   for_each_online_cpu(cpu) {
-   per_cpu(cpu_idle_state, cpu) = 1;
-   cpu_set(cpu, map);
-   }
-
-   __get_cpu_var(cpu_idle_state) = 0;
-
-   wmb();
-   do {
-   ssleep(1);
-   for_each_online_cpu(cpu) {
-   if (cpu_isset(cpu, map) &&
-   !per_cpu(cpu_idle_state, cpu))
-   cpu_clear(cpu, map);
-   }
-   cpus_and(map, map, cpu_online_map);
-   /*
-* We waited 1 sec, if a CPU still did not call idle
-* it may be because it is in idle and not waking up
-* because it has nothing to do.
-* Give all the remaining CPUS a kick.
-*/
-  

[PATCH] x86: Simplify cpu_idle_wait

2008-02-07 Thread Venki Pallipadi

Earlier commit 40d6a146629b98d8e322b6f9332b182c7cbff3df
added smp_call_function in cpu_idle_wait() to kick cpus that are in tickless
idle. Looking at cpu_idle_wait code at that time, code seemed to be
over-engineered for a case which is rarely used (while changing idle handler).

Below is a simplified version of cpu_idle_wait, which just makes
a dummy smp_call_function to all cpus, to make them come out of old idle handler
and start using the new idle handler. It eliminates code in the idle loop
to handle cpu_idle_wait.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.25-rc/arch/x86/kernel/process_32.c
===
--- linux-2.6.25-rc.orig/arch/x86/kernel/process_32.c
+++ linux-2.6.25-rc/arch/x86/kernel/process_32.c
@@ -82,7 +82,6 @@ unsigned long thread_saved_pc(struct tas
  */
 void (*pm_idle)(void);
 EXPORT_SYMBOL(pm_idle);
-static DEFINE_PER_CPU(unsigned int, cpu_idle_state);
 
 void disable_hlt(void)
 {
@@ -181,9 +180,6 @@ void cpu_idle(void)
while (!need_resched()) {
void (*idle)(void);
 
-   if (__get_cpu_var(cpu_idle_state))
-   __get_cpu_var(cpu_idle_state) = 0;
-
check_pgt_cache();
rmb();
idle = pm_idle;
@@ -208,40 +204,19 @@ static void do_nothing(void *unused)
 {
 }
 
+/*
+ * cpu_idle_wait - Used to ensure that all the CPUs discard old value of
+ * pm_idle and update to new pm_idle value. Required while changing pm_idle
+ * handler on SMP systems.
+ *
+ * Caller must have changed pm_idle to the new value before the call. Old
+ * pm_idle value will not be used by any CPU after the return of this function.
+ */
 void cpu_idle_wait(void)
 {
-   unsigned int cpu, this_cpu = get_cpu();
-   cpumask_t map, tmp = current-cpus_allowed;
-
-   set_cpus_allowed(current, cpumask_of_cpu(this_cpu));
-   put_cpu();
-
-   cpus_clear(map);
-   for_each_online_cpu(cpu) {
-   per_cpu(cpu_idle_state, cpu) = 1;
-   cpu_set(cpu, map);
-   }
-
-   __get_cpu_var(cpu_idle_state) = 0;
-
-   wmb();
-   do {
-   ssleep(1);
-   for_each_online_cpu(cpu) {
-   if (cpu_isset(cpu, map)  !per_cpu(cpu_idle_state, 
cpu))
-   cpu_clear(cpu, map);
-   }
-   cpus_and(map, map, cpu_online_map);
-   /*
-* We waited 1 sec, if a CPU still did not call idle
-* it may be because it is in idle and not waking up
-* because it has nothing to do.
-* Give all the remaining CPUS a kick.
-*/
-   smp_call_function_mask(map, do_nothing, 0, 0);
-   } while (!cpus_empty(map));
-
-   set_cpus_allowed(current, tmp);
+   smp_mb();
+   /* kick all the CPUs so that they exit out of pm_idle */
+   smp_call_function(do_nothing, NULL, 0, 0);
 }
 EXPORT_SYMBOL_GPL(cpu_idle_wait);
 
Index: linux-2.6.25-rc/arch/x86/kernel/process_64.c
===
--- linux-2.6.25-rc.orig/arch/x86/kernel/process_64.c
+++ linux-2.6.25-rc/arch/x86/kernel/process_64.c
@@ -64,7 +64,6 @@ EXPORT_SYMBOL(boot_option_idle_override)
  */
 void (*pm_idle)(void);
 EXPORT_SYMBOL(pm_idle);
-static DEFINE_PER_CPU(unsigned int, cpu_idle_state);
 
 static ATOMIC_NOTIFIER_HEAD(idle_notifier);
 
@@ -139,41 +138,19 @@ static void do_nothing(void *unused)
 {
 }
 
+/*
+ * cpu_idle_wait - Used to ensure that all the CPUs discard old value of
+ * pm_idle and update to new pm_idle value. Required while changing pm_idle
+ * handler on SMP systems.
+ *
+ * Caller must have changed pm_idle to the new value before the call. Old
+ * pm_idle value will not be used by any CPU after the return of this function.
+ */
 void cpu_idle_wait(void)
 {
-   unsigned int cpu, this_cpu = get_cpu();
-   cpumask_t map, tmp = current-cpus_allowed;
-
-   set_cpus_allowed(current, cpumask_of_cpu(this_cpu));
-   put_cpu();
-
-   cpus_clear(map);
-   for_each_online_cpu(cpu) {
-   per_cpu(cpu_idle_state, cpu) = 1;
-   cpu_set(cpu, map);
-   }
-
-   __get_cpu_var(cpu_idle_state) = 0;
-
-   wmb();
-   do {
-   ssleep(1);
-   for_each_online_cpu(cpu) {
-   if (cpu_isset(cpu, map) 
-   !per_cpu(cpu_idle_state, cpu))
-   cpu_clear(cpu, map);
-   }
-   cpus_and(map, map, cpu_online_map);
-   /*
-* We waited 1 sec, if a CPU still did not call idle
-* it may be because it is in idle and not waking up
-* because it has nothing to do.
-* Give all the remaining CPUS a kick.
-*/
-  

Re: 2.6.24-rc8-mm1

2008-01-17 Thread Venki Pallipadi
On Thu, Jan 17, 2008 at 11:40:32AM -0800, Andrew Morton wrote:
> On Thu, 17 Jan 2008 11:22:19 -0800 "Pallipadi, Venkatesh" <[EMAIL PROTECTED]> 
> wrote:
> 
> >  
> > The problem is
> > >> modprobe:2584 conflicting cache attribute 5000-50001000
> > >> uncached<->default
> > 
> > Some address range here is being mapped with conflicting types.
> > Somewhere the range was mapped with default (write-back). Later
> > pci_iomap() is mapping that region as uncacheable which is basically
> > aliasing. PAT code detects the aliasing and fails the second uncacheable
> > request which leads in the failure.
> 
> It sounds to me like you need considerably more runtime debugging and
> reporting support in that code.  Ensure that it generates enough output
> both during regular operation and during failures for you to be able to
> diagnose things in a single iteration.
> 
> We can always take it out later.
> 
> 

Patch below makes the interesting printks from PAT non DEBUG.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.git/arch/x86/mm/ioremap.c
===
--- linux-2.6.git.orig/arch/x86/mm/ioremap.c2008-01-17 03:18:59.0 
-0800
+++ linux-2.6.git/arch/x86/mm/ioremap.c 2008-01-17 08:11:51.0 -0800
@@ -25,10 +25,13 @@
  */
 void __iomem *ioremap_wc(unsigned long phys_addr, unsigned long size)
 {
-   if (pat_wc_enabled)
+   if (pat_wc_enabled) {
+   printk(KERN_INFO "ioremap_wc: addr %lx, size %lx\n",
+  phys_addr, size);
return __ioremap(phys_addr, size, _PAGE_WC);
-   else
+   } else {
return ioremap_nocache(phys_addr, size);
+   }
 }
 EXPORT_SYMBOL(ioremap_wc);
 
Index: linux-2.6.git/arch/x86/mm/ioremap_32.c
===
--- linux-2.6.git.orig/arch/x86/mm/ioremap_32.c 2008-01-17 03:18:59.0 
-0800
+++ linux-2.6.git/arch/x86/mm/ioremap_32.c  2008-01-17 08:10:58.0 
-0800
@@ -164,6 +164,8 @@
 
 void __iomem *ioremap_nocache (unsigned long phys_addr, unsigned long size)
 {
+   printk(KERN_INFO "ioremap_nocache: addr %lx, size %lx\n",
+  phys_addr, size);
return __ioremap(phys_addr, size, _PAGE_UC);
 }
 EXPORT_SYMBOL(ioremap_nocache);
Index: linux-2.6.git/arch/x86/mm/ioremap_64.c
===
--- linux-2.6.git.orig/arch/x86/mm/ioremap_64.c 2008-01-17 03:18:59.0 
-0800
+++ linux-2.6.git/arch/x86/mm/ioremap_64.c  2008-01-17 08:10:13.0 
-0800
@@ -144,7 +144,7 @@
 
 void __iomem *ioremap_nocache (unsigned long phys_addr, unsigned long size)
 {
-   printk(KERN_DEBUG "ioremap_nocache: addr %lx, size %lx\n",
+   printk(KERN_INFO "ioremap_nocache: addr %lx, size %lx\n",
   phys_addr, size);
return __ioremap(phys_addr, size, _PAGE_UC);
 }
Index: linux-2.6.git/arch/x86/mm/pat.c
===
--- linux-2.6.git.orig/arch/x86/mm/pat.c2008-01-17 03:18:59.0 
-0800
+++ linux-2.6.git/arch/x86/mm/pat.c 2008-01-17 08:06:23.0 -0800
@@ -170,7 +170,7 @@
 
if (!fattr && attr != ml->attr) {
printk(
-   KERN_DEBUG "%s:%d conflicting cache attribute %Lx-%Lx %s<->%s\n",
+   KERN_WARNING "%s:%d conflicting cache attribute %Lx-%Lx %s<->%s\n",
current->comm, current->pid,
start, end,
cattr_name(attr), cattr_name(ml->attr));
@@ -205,7 +205,7 @@
list_for_each_entry(ml, _list, nd) {
if (ml->start == start && ml->end == end) {
if (ml->attr != attr)
-   printk(KERN_DEBUG
+   printk(KERN_WARNING
"%s:%d conflicting cache attributes on free %Lx-%Lx %s<->%s\n",
current->comm, current->pid, start, end,
cattr_name(attr), cattr_name(ml->attr));
@@ -217,7 +217,7 @@
}
spin_unlock(_lock);
if (err)
-   printk(KERN_DEBUG "%s:%d freeing invalid mattr %Lx-%Lx %s\n",
+   printk(KERN_WARNING "%s:%d freeing invalid mattr %Lx-%Lx %s\n",
current->comm, current->pid,
start, end, cattr_name(attr));
return err;
Index: linux-2.6.git/include/asm-x86/io_32.h
===
--- linux-2.6.git.orig/include/asm-x86/io_32.h  2008-01-17 06:28:06.0 
-0800
+++ linux-2.6.git/include/asm-x86/io_32.h   2008-01-17 08:09:30.0 
-0800
@@ -113,6 +113,8 @@
 
 static inline void __iomem * ioremap(unsigned long offset, unsigned long size)
 {
+   printk(KERN_INFO "ioremap: addr %lx, size %lx\n",
+  offset, size);
  

Re: [patch 0/4] x86: PAT followup - Incremental changes and bug fixes

2008-01-17 Thread Venki Pallipadi
On Thu, Jan 17, 2008 at 11:52:43PM +0100, Andreas Herrmann3 wrote:
> On Thu, Jan 17, 2008 at 11:15:05PM +0100, Ingo Molnar wrote:
> > 
> > * Andreas Herrmann3 <[EMAIL PROTECTED]> wrote:
> > 
> > > On Thu, Jan 17, 2008 at 10:42:09PM +0100, Ingo Molnar wrote:
> > > > 
> > > > * Siddha, Suresh B <[EMAIL PROTECTED]> wrote:
> > > > 
> > > > > On Thu, Jan 17, 2008 at 10:13:08PM +0100, Ingo Molnar wrote:
> > > > > > but in general we must be robust enough in this case and just 
> > > > > > degrade 
> > > > > > any overlapping page to UC (and emit a warning perhaps) - instead 
> > > > > > of 
> > > > > > failing the ioremap and thus failing the driver (and the bootup).
> > > > > 
> > > > > But then, this will cause an attribute conflicit. Old one was 
> > > > > specifying WB in PAT (ioremap with noflags) and the new ioremap 
> > > > > specifies UC.
> > > > 
> > > > we could fix up all aliases of that page as well and degrade them to UC?
> > > 
> > > Yes, we must fix all aliases or reject the conflicting mapping. But 
> > > fixing all aliases might not be that easy. (I've just seen a panic 
> > > when using your patch ;-(
> > 
> > yes, indeed my patch is bad if you have PAT enabled: conflicting cache 
> > attributes might be present. I'll go with your patch for now.
> 
> I think the best is to just reject conflicting mappings. (Because now
> I am too tired to think about a safe way how to change the aliases to the
> most restrictive memory type. ;-)
> 
> But then of course such boot-time problems like I've seen on my test
> machines should be avoided somehow.
> 
> 

Below is another potential fix for the problem here. Going through ACPI
ioremap usages, we found at one place the mapping is cached for possible
optimization reason and not unmapped later. Patch below always unmaps
ioremap at this place in ACPICA.

Thanks,
Venki


Index: linux-2.6.git/drivers/acpi/executer/exregion.c
===
--- linux-2.6.git.orig/drivers/acpi/executer/exregion.c 2008-01-17 
03:18:39.0 -0800
+++ linux-2.6.git/drivers/acpi/executer/exregion.c  2008-01-17 
07:34:33.0 -0800
@@ -48,6 +48,8 @@
 #define _COMPONENT  ACPI_EXECUTER
 ACPI_MODULE_NAME("exregion")
 
+static int ioremap_cache;
+
 
/***
  *
  * FUNCTION:acpi_ex_system_memory_space_handler
@@ -249,6 +251,13 @@
break;
}
 
+   if (!ioremap_cache) {
+   acpi_os_unmap_memory(mem_info->mapped_logical_address,
+window_size);
+   mem_info->mapped_logical_address = 0;
+   mem_info->mapped_physical_address = 0;
+   mem_info->mapped_length = 0;
+   }
return_ACPI_STATUS(status);
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [-mm Patch] uml: fix a building error

2008-01-17 Thread Venki Pallipadi
On Thu, Jan 17, 2008 at 04:14:37PM -0500, Jeff Dike wrote:
> On Thu, Jan 17, 2008 at 11:38:53AM -0800, Pallipadi, Venkatesh wrote:
> > Apart from unxlate, there is also ioremap_wc which is defined in the
> > same way.
> 
> And while we're on the subject, what's the deal with these, in
> include/asm-x86/io.h?
> 
> #define ioremap_wc ioremap_wc
> #define unxlate_dev_mem_ptr unxlate_dev_mem_ptr
> 

If archs want to override the defaults for these two functions, they define
the above and then include asm-generic/iomap.h.

Archs which doesnt want to implement anything in these new funcs just have to
include asm-generic/iomap.h which has the proper stubs.

So, a patch like the below is what is required here for all archs to
include asm-generic iomap.h (without the other patch that
defines null unxlate in asm specific header).

Totally untested.

Thanks,
Venki

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.git/include/asm-arm/io.h
===
--- linux-2.6.git.orig/include/asm-arm/io.h 2008-01-17 06:28:06.0 
-0800
+++ linux-2.6.git/include/asm-arm/io.h  2008-01-17 06:39:13.0 -0800
@@ -27,6 +27,8 @@
 #include 
 #include 
 
+#include 
+
 /*
  * ISA I/O bus memory addresses are 1:1 with the physical address.
  */
Index: linux-2.6.git/include/asm-avr32/io.h
===
--- linux-2.6.git.orig/include/asm-avr32/io.h   2008-01-17 06:28:06.0 
-0800
+++ linux-2.6.git/include/asm-avr32/io.h2008-01-17 06:39:13.0 
-0800
@@ -10,6 +10,8 @@
 
 #include 
 
+#include 
+
 /* virt_to_phys will only work when address is in P1 or P2 */
 static __inline__ unsigned long virt_to_phys(volatile void *address)
 {
Index: linux-2.6.git/include/asm-blackfin/io.h
===
--- linux-2.6.git.orig/include/asm-blackfin/io.h2008-01-17 
06:28:06.0 -0800
+++ linux-2.6.git/include/asm-blackfin/io.h 2008-01-17 06:39:13.0 
-0800
@@ -8,6 +8,8 @@
 #endif
 #include 
 
+#include 
+
 /*
  * These are for ISA/PCI shared memory _only_ and should never be used
  * on any other type of memory, including Zorro memory. They are meant to
Index: linux-2.6.git/include/asm-cris/io.h
===
--- linux-2.6.git.orig/include/asm-cris/io.h2008-01-17 06:28:06.0 
-0800
+++ linux-2.6.git/include/asm-cris/io.h 2008-01-17 06:39:13.0 -0800
@@ -5,6 +5,8 @@
 #include 
 #include 
 
+#include 
+
 struct cris_io_operations
 {
u32 (*read_mem)(void *addr, int size);
Index: linux-2.6.git/include/asm-frv/io.h
===
--- linux-2.6.git.orig/include/asm-frv/io.h 2008-01-17 06:28:06.0 
-0800
+++ linux-2.6.git/include/asm-frv/io.h  2008-01-17 06:39:13.0 -0800
@@ -23,6 +23,8 @@
 #include 
 #include 
 
+#include 
+
 /*
  * swap functions are sometimes needed to interface little-endian hardware
  */
Index: linux-2.6.git/include/asm-h8300/io.h
===
--- linux-2.6.git.orig/include/asm-h8300/io.h   2008-01-17 06:28:06.0 
-0800
+++ linux-2.6.git/include/asm-h8300/io.h2008-01-17 06:39:13.0 
-0800
@@ -13,6 +13,8 @@
 #error UNKNOWN CPU TYPE
 #endif
 
+#include 
+
 
 /*
  * These are for ISA/PCI shared memory _only_ and should never be used
Index: linux-2.6.git/include/asm-m32r/io.h
===
--- linux-2.6.git.orig/include/asm-m32r/io.h2008-01-17 06:28:06.0 
-0800
+++ linux-2.6.git/include/asm-m32r/io.h 2008-01-17 06:39:13.0 -0800
@@ -5,6 +5,8 @@
 #include 
 #include   /* __va */
 
+#include 
+
 #ifdef __KERNEL__
 
 #define IO_SPACE_LIMIT  0x
Index: linux-2.6.git/include/asm-m68knommu/io.h
===
--- linux-2.6.git.orig/include/asm-m68knommu/io.h   2008-01-17 
06:28:06.0 -0800
+++ linux-2.6.git/include/asm-m68knommu/io.h2008-01-17 06:39:13.0 
-0800
@@ -1,6 +1,8 @@
 #ifndef _M68KNOMMU_IO_H
 #define _M68KNOMMU_IO_H
 
+#include 
+
 #ifdef __KERNEL__
 
 
Index: linux-2.6.git/include/asm-ppc/io.h
===
--- linux-2.6.git.orig/include/asm-ppc/io.h 2008-01-17 06:28:06.0 
-0800
+++ linux-2.6.git/include/asm-ppc/io.h  2008-01-17 06:39:13.0 -0800
@@ -10,6 +10,8 @@
 #include 
 #include 
 
+#include 
+
 #define SIO_CONFIG_RA  0x398
 #define SIO_CONFIG_RD  0x399
 
Index: linux-2.6.git/include/asm-s390/io.h
===
--- linux-2.6.git.orig/include/asm-s390/io.h2008-01-17 06:28:06.0 
-0800
+++ linux-2.6.git/include/asm-s390/io.h 2008-01-17 06:39:13.0 -0800
@@ -15,6 +15,8 @@
 
 

Re: [-mm Patch] uml: fix a building error

2008-01-17 Thread Venki Pallipadi
On Thu, Jan 17, 2008 at 04:14:37PM -0500, Jeff Dike wrote:
 On Thu, Jan 17, 2008 at 11:38:53AM -0800, Pallipadi, Venkatesh wrote:
  Apart from unxlate, there is also ioremap_wc which is defined in the
  same way.
 
 And while we're on the subject, what's the deal with these, in
 include/asm-x86/io.h?
 
 #define ioremap_wc ioremap_wc
 #define unxlate_dev_mem_ptr unxlate_dev_mem_ptr
 

If archs want to override the defaults for these two functions, they define
the above and then include asm-generic/iomap.h.

Archs which doesnt want to implement anything in these new funcs just have to
include asm-generic/iomap.h which has the proper stubs.

So, a patch like the below is what is required here for all archs to
include asm-generic iomap.h (without the other patch that
defines null unxlate in asm specific header).

Totally untested.

Thanks,
Venki

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.git/include/asm-arm/io.h
===
--- linux-2.6.git.orig/include/asm-arm/io.h 2008-01-17 06:28:06.0 
-0800
+++ linux-2.6.git/include/asm-arm/io.h  2008-01-17 06:39:13.0 -0800
@@ -27,6 +27,8 @@
 #include asm/byteorder.h
 #include asm/memory.h
 
+#include asm-generic/iomap.h
+
 /*
  * ISA I/O bus memory addresses are 1:1 with the physical address.
  */
Index: linux-2.6.git/include/asm-avr32/io.h
===
--- linux-2.6.git.orig/include/asm-avr32/io.h   2008-01-17 06:28:06.0 
-0800
+++ linux-2.6.git/include/asm-avr32/io.h2008-01-17 06:39:13.0 
-0800
@@ -10,6 +10,8 @@
 
 #include asm/arch/io.h
 
+#include asm-generic/iomap.h
+
 /* virt_to_phys will only work when address is in P1 or P2 */
 static __inline__ unsigned long virt_to_phys(volatile void *address)
 {
Index: linux-2.6.git/include/asm-blackfin/io.h
===
--- linux-2.6.git.orig/include/asm-blackfin/io.h2008-01-17 
06:28:06.0 -0800
+++ linux-2.6.git/include/asm-blackfin/io.h 2008-01-17 06:39:13.0 
-0800
@@ -8,6 +8,8 @@
 #endif
 #include linux/compiler.h
 
+#include asm-generic/iomap.h
+
 /*
  * These are for ISA/PCI shared memory _only_ and should never be used
  * on any other type of memory, including Zorro memory. They are meant to
Index: linux-2.6.git/include/asm-cris/io.h
===
--- linux-2.6.git.orig/include/asm-cris/io.h2008-01-17 06:28:06.0 
-0800
+++ linux-2.6.git/include/asm-cris/io.h 2008-01-17 06:39:13.0 -0800
@@ -5,6 +5,8 @@
 #include asm/arch/io.h
 #include linux/kernel.h
 
+#include asm-generic/iomap.h
+
 struct cris_io_operations
 {
u32 (*read_mem)(void *addr, int size);
Index: linux-2.6.git/include/asm-frv/io.h
===
--- linux-2.6.git.orig/include/asm-frv/io.h 2008-01-17 06:28:06.0 
-0800
+++ linux-2.6.git/include/asm-frv/io.h  2008-01-17 06:39:13.0 -0800
@@ -23,6 +23,8 @@
 #include asm/mb-regs.h
 #include linux/delay.h
 
+#include asm-generic/iomap.h
+
 /*
  * swap functions are sometimes needed to interface little-endian hardware
  */
Index: linux-2.6.git/include/asm-h8300/io.h
===
--- linux-2.6.git.orig/include/asm-h8300/io.h   2008-01-17 06:28:06.0 
-0800
+++ linux-2.6.git/include/asm-h8300/io.h2008-01-17 06:39:13.0 
-0800
@@ -13,6 +13,8 @@
 #error UNKNOWN CPU TYPE
 #endif
 
+#include asm-generic/iomap.h
+
 
 /*
  * These are for ISA/PCI shared memory _only_ and should never be used
Index: linux-2.6.git/include/asm-m32r/io.h
===
--- linux-2.6.git.orig/include/asm-m32r/io.h2008-01-17 06:28:06.0 
-0800
+++ linux-2.6.git/include/asm-m32r/io.h 2008-01-17 06:39:13.0 -0800
@@ -5,6 +5,8 @@
 #include linux/compiler.h
 #include asm/page.h  /* __va */
 
+#include asm-generic/iomap.h
+
 #ifdef __KERNEL__
 
 #define IO_SPACE_LIMIT  0x
Index: linux-2.6.git/include/asm-m68knommu/io.h
===
--- linux-2.6.git.orig/include/asm-m68knommu/io.h   2008-01-17 
06:28:06.0 -0800
+++ linux-2.6.git/include/asm-m68knommu/io.h2008-01-17 06:39:13.0 
-0800
@@ -1,6 +1,8 @@
 #ifndef _M68KNOMMU_IO_H
 #define _M68KNOMMU_IO_H
 
+#include asm-generic/iomap.h
+
 #ifdef __KERNEL__
 
 
Index: linux-2.6.git/include/asm-ppc/io.h
===
--- linux-2.6.git.orig/include/asm-ppc/io.h 2008-01-17 06:28:06.0 
-0800
+++ linux-2.6.git/include/asm-ppc/io.h  2008-01-17 06:39:13.0 -0800
@@ -10,6 +10,8 @@
 #include asm/synch.h
 #include asm/mmu.h
 
+#include asm-generic/iomap.h
+
 #define SIO_CONFIG_RA  0x398
 #define 

Re: [patch 0/4] x86: PAT followup - Incremental changes and bug fixes

2008-01-17 Thread Venki Pallipadi
On Thu, Jan 17, 2008 at 11:52:43PM +0100, Andreas Herrmann3 wrote:
 On Thu, Jan 17, 2008 at 11:15:05PM +0100, Ingo Molnar wrote:
  
  * Andreas Herrmann3 [EMAIL PROTECTED] wrote:
  
   On Thu, Jan 17, 2008 at 10:42:09PM +0100, Ingo Molnar wrote:

* Siddha, Suresh B [EMAIL PROTECTED] wrote:

 On Thu, Jan 17, 2008 at 10:13:08PM +0100, Ingo Molnar wrote:
  but in general we must be robust enough in this case and just 
  degrade 
  any overlapping page to UC (and emit a warning perhaps) - instead 
  of 
  failing the ioremap and thus failing the driver (and the bootup).
 
 But then, this will cause an attribute conflicit. Old one was 
 specifying WB in PAT (ioremap with noflags) and the new ioremap 
 specifies UC.

we could fix up all aliases of that page as well and degrade them to UC?
   
   Yes, we must fix all aliases or reject the conflicting mapping. But 
   fixing all aliases might not be that easy. (I've just seen a panic 
   when using your patch ;-(
  
  yes, indeed my patch is bad if you have PAT enabled: conflicting cache 
  attributes might be present. I'll go with your patch for now.
 
 I think the best is to just reject conflicting mappings. (Because now
 I am too tired to think about a safe way how to change the aliases to the
 most restrictive memory type. ;-)
 
 But then of course such boot-time problems like I've seen on my test
 machines should be avoided somehow.
 
 

Below is another potential fix for the problem here. Going through ACPI
ioremap usages, we found at one place the mapping is cached for possible
optimization reason and not unmapped later. Patch below always unmaps
ioremap at this place in ACPICA.

Thanks,
Venki


Index: linux-2.6.git/drivers/acpi/executer/exregion.c
===
--- linux-2.6.git.orig/drivers/acpi/executer/exregion.c 2008-01-17 
03:18:39.0 -0800
+++ linux-2.6.git/drivers/acpi/executer/exregion.c  2008-01-17 
07:34:33.0 -0800
@@ -48,6 +48,8 @@
 #define _COMPONENT  ACPI_EXECUTER
 ACPI_MODULE_NAME(exregion)
 
+static int ioremap_cache;
+
 
/***
  *
  * FUNCTION:acpi_ex_system_memory_space_handler
@@ -249,6 +251,13 @@
break;
}
 
+   if (!ioremap_cache) {
+   acpi_os_unmap_memory(mem_info-mapped_logical_address,
+window_size);
+   mem_info-mapped_logical_address = 0;
+   mem_info-mapped_physical_address = 0;
+   mem_info-mapped_length = 0;
+   }
return_ACPI_STATUS(status);
 }
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc8-mm1

2008-01-17 Thread Venki Pallipadi
On Thu, Jan 17, 2008 at 11:40:32AM -0800, Andrew Morton wrote:
 On Thu, 17 Jan 2008 11:22:19 -0800 Pallipadi, Venkatesh [EMAIL PROTECTED] 
 wrote:
 
   
  The problem is
   modprobe:2584 conflicting cache attribute 5000-50001000
   uncached-default
  
  Some address range here is being mapped with conflicting types.
  Somewhere the range was mapped with default (write-back). Later
  pci_iomap() is mapping that region as uncacheable which is basically
  aliasing. PAT code detects the aliasing and fails the second uncacheable
  request which leads in the failure.
 
 It sounds to me like you need considerably more runtime debugging and
 reporting support in that code.  Ensure that it generates enough output
 both during regular operation and during failures for you to be able to
 diagnose things in a single iteration.
 
 We can always take it out later.
 
 

Patch below makes the interesting printks from PAT non DEBUG.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.git/arch/x86/mm/ioremap.c
===
--- linux-2.6.git.orig/arch/x86/mm/ioremap.c2008-01-17 03:18:59.0 
-0800
+++ linux-2.6.git/arch/x86/mm/ioremap.c 2008-01-17 08:11:51.0 -0800
@@ -25,10 +25,13 @@
  */
 void __iomem *ioremap_wc(unsigned long phys_addr, unsigned long size)
 {
-   if (pat_wc_enabled)
+   if (pat_wc_enabled) {
+   printk(KERN_INFO ioremap_wc: addr %lx, size %lx\n,
+  phys_addr, size);
return __ioremap(phys_addr, size, _PAGE_WC);
-   else
+   } else {
return ioremap_nocache(phys_addr, size);
+   }
 }
 EXPORT_SYMBOL(ioremap_wc);
 
Index: linux-2.6.git/arch/x86/mm/ioremap_32.c
===
--- linux-2.6.git.orig/arch/x86/mm/ioremap_32.c 2008-01-17 03:18:59.0 
-0800
+++ linux-2.6.git/arch/x86/mm/ioremap_32.c  2008-01-17 08:10:58.0 
-0800
@@ -164,6 +164,8 @@
 
 void __iomem *ioremap_nocache (unsigned long phys_addr, unsigned long size)
 {
+   printk(KERN_INFO ioremap_nocache: addr %lx, size %lx\n,
+  phys_addr, size);
return __ioremap(phys_addr, size, _PAGE_UC);
 }
 EXPORT_SYMBOL(ioremap_nocache);
Index: linux-2.6.git/arch/x86/mm/ioremap_64.c
===
--- linux-2.6.git.orig/arch/x86/mm/ioremap_64.c 2008-01-17 03:18:59.0 
-0800
+++ linux-2.6.git/arch/x86/mm/ioremap_64.c  2008-01-17 08:10:13.0 
-0800
@@ -144,7 +144,7 @@
 
 void __iomem *ioremap_nocache (unsigned long phys_addr, unsigned long size)
 {
-   printk(KERN_DEBUG ioremap_nocache: addr %lx, size %lx\n,
+   printk(KERN_INFO ioremap_nocache: addr %lx, size %lx\n,
   phys_addr, size);
return __ioremap(phys_addr, size, _PAGE_UC);
 }
Index: linux-2.6.git/arch/x86/mm/pat.c
===
--- linux-2.6.git.orig/arch/x86/mm/pat.c2008-01-17 03:18:59.0 
-0800
+++ linux-2.6.git/arch/x86/mm/pat.c 2008-01-17 08:06:23.0 -0800
@@ -170,7 +170,7 @@
 
if (!fattr  attr != ml-attr) {
printk(
-   KERN_DEBUG %s:%d conflicting cache attribute %Lx-%Lx %s-%s\n,
+   KERN_WARNING %s:%d conflicting cache attribute %Lx-%Lx %s-%s\n,
current-comm, current-pid,
start, end,
cattr_name(attr), cattr_name(ml-attr));
@@ -205,7 +205,7 @@
list_for_each_entry(ml, mattr_list, nd) {
if (ml-start == start  ml-end == end) {
if (ml-attr != attr)
-   printk(KERN_DEBUG
+   printk(KERN_WARNING
%s:%d conflicting cache attributes on free %Lx-%Lx %s-%s\n,
current-comm, current-pid, start, end,
cattr_name(attr), cattr_name(ml-attr));
@@ -217,7 +217,7 @@
}
spin_unlock(mattr_lock);
if (err)
-   printk(KERN_DEBUG %s:%d freeing invalid mattr %Lx-%Lx %s\n,
+   printk(KERN_WARNING %s:%d freeing invalid mattr %Lx-%Lx %s\n,
current-comm, current-pid,
start, end, cattr_name(attr));
return err;
Index: linux-2.6.git/include/asm-x86/io_32.h
===
--- linux-2.6.git.orig/include/asm-x86/io_32.h  2008-01-17 06:28:06.0 
-0800
+++ linux-2.6.git/include/asm-x86/io_32.h   2008-01-17 08:09:30.0 
-0800
@@ -113,6 +113,8 @@
 
 static inline void __iomem * ioremap(unsigned long offset, unsigned long size)
 {
+   printk(KERN_INFO ioremap: addr %lx, size %lx\n,
+  offset, size);
return __ioremap(offset, size, 0);
 }
 
Index: 

Re: [patch 2/4] x86: PAT followup - Remove KERNPG_TABLE from pte entry

2008-01-16 Thread Venki Pallipadi
On Wed, Jan 16, 2008 at 10:14:00AM +0200, Mika Penttilä wrote:
> [EMAIL PROTECTED] kirjoitti:
> >KERNPG_TABLE was a bug in earlier patch. Remove it from pte.
> >pte_val() check is redundant as this routine is called immediately after a
> >ptepage is allocated afresh.
> >
> >Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>
> >Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
> >
> >Index: linux-2.6.git/arch/x86/mm/init_64.c
> >===
> >--- linux-2.6.git.orig/arch/x86/mm/init_64.c 2008-01-15 
> >11:02:23.0 -0800
> >+++ linux-2.6.git/arch/x86/mm/init_64.c  2008-01-15 
> >11:06:37.0 -0800
> >@@ -541,9 +541,6 @@
> > if (address >= end)
> > break;
> > 
> >-if (pte_val(*pte))
> >-continue;
> >-
> > /* Nothing to map. Map the null page */
> > if (!(address & (~PAGE_MASK)) &&
> > (address + PAGE_SIZE <= end) &&
> >@@ -561,9 +558,9 @@
> > }
> > 
> > if (exec)
> >-entry = _PAGE_NX|_KERNPG_TABLE|_PAGE_GLOBAL|address;
> >+entry = _PAGE_NX|_PAGE_GLOBAL|address;
> > else
> >-entry = _KERNPG_TABLE|_PAGE_GLOBAL|address;
> >+entry = _PAGE_GLOBAL|address;
> > entry &= __supported_pte_mask;
> > set_pte(pte, __pte(entry));
> > }
> >
> >  
> 
> Hmm then what's the point of mapping not present 4k pages for valid mem 
> here?
> 

Ingo,

Below incremental patch fixes this pte entry setting correctly. Thanks to
Mika for catching this.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.git/arch/x86/mm/init_64.c
===
--- linux-2.6.git.orig/arch/x86/mm/init_64.c2008-01-16 03:38:32.0 
-0800
+++ linux-2.6.git/arch/x86/mm/init_64.c 2008-01-16 03:51:34.0 -0800
@@ -515,9 +515,9 @@
}
 
if (exec)
-   entry = _PAGE_NX|_PAGE_GLOBAL|address;
+   entry = __PAGE_KERNEL_EXEC | _PAGE_GLOBAL | address;
else
-   entry = _PAGE_GLOBAL|address;
+   entry = __PAGE_KERNEL | _PAGE_GLOBAL | address;
entry &= __supported_pte_mask;
set_pte(pte, __pte(entry));
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/4] x86: PAT followup - Incremental changes and bug fixes

2008-01-16 Thread Venki Pallipadi
On Wed, Jan 16, 2008 at 07:57:48PM +0100, Andreas Herrmann wrote:
> Hi,
> 
> I just want to report that the PAT support in x86/mm causes crashes
> on two of my test machines. On both boxes the SATA detection does
> not work when the PAT support is patched into the kernel.
> 
> Symptoms are as follows -- best described by a diff between the
> two boot.logs:
> 
> # diff boot-failing.log boot-working.log
> 
> -Linux version 2.6.24-rc8-ga9f7faa5 ([EMAIL PROTECTED]) (gcc version ...
> +Linux version 2.6.24-rc8-g2ea3cf43 ([EMAIL PROTECTED]) (gcc version ...
> ...
>  early_iounmap(82a0b000, 1000)
> -early_ioremap(c000, 1000) => -02103394304
> -early_iounmap(82a0c000, 1000)
This does not look to be the problem here. We just mapped some new low
address due to possibly a different code path. But, seems to have worked fine.

>  early_iounmap(82808000, 1000)
> ...
> -ACPI: PCI interrupt for device :00:12.0 disabled
> -sata_sil: probe of :00:12.0 failed with error -12
> +scsi0 : sata_sil
> +scsi1 : sata_sil
> +ata1: SATA max UDMA/100 mmio [EMAIL PROTECTED] tf 0xc0403080 irq 22
> ...
> -AC'97 space ioremap problem
> -ACPI: PCI interrupt for device :00:14.5 disabled
> -ATI IXP AC97 controller: probe of :00:14.5 failed with error -5

This ioremap failing seems to be the real problem. This can be due to
new tracking of ioremaps introduced by PAT patches. We do not allow
conflicting ioremaps to same region. Probably that is happening
in both Sound and sata initialization which results in driver init failing.

Can you please try the debug patch below over latest x86/mm and boot kernel with
debug boot option and send us the dmesg from the failure. That will give us
better info about ioremaps.

Thanks,
Venki


Index: linux-2.6.git/arch/x86/mm/ioremap_64.c
===
--- linux-2.6.git.orig/arch/x86/mm/ioremap_64.c 2008-01-16 03:38:32.0 
-0800
+++ linux-2.6.git/arch/x86/mm/ioremap_64.c  2008-01-16 05:16:28.0 
-0800
@@ -150,6 +150,8 @@
 
 void __iomem *ioremap_nocache (unsigned long phys_addr, unsigned long size)
 {
+   printk(KERN_DEBUG "ioremap_nocache: addr %lx, size %lx\n",
+  phys_addr, size);
return __ioremap(phys_addr, size, _PAGE_UC);
 }
 EXPORT_SYMBOL(ioremap_nocache);
Index: linux-2.6.git/include/asm-x86/io_64.h
===
--- linux-2.6.git.orig/include/asm-x86/io_64.h  2008-01-16 03:38:32.0 
-0800
+++ linux-2.6.git/include/asm-x86/io_64.h   2008-01-16 05:16:57.0 
-0800
@@ -154,6 +154,8 @@
 
 static inline void __iomem * ioremap (unsigned long offset, unsigned long size)
 {
+   printk(KERN_DEBUG "ioremap: addr %lx, size %lx\n",
+  offset, size);
return __ioremap(offset, size, 0);
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/4] x86: PAT followup - Incremental changes and bug fixes

2008-01-16 Thread Venki Pallipadi
On Wed, Jan 16, 2008 at 07:57:48PM +0100, Andreas Herrmann wrote:
 Hi,
 
 I just want to report that the PAT support in x86/mm causes crashes
 on two of my test machines. On both boxes the SATA detection does
 not work when the PAT support is patched into the kernel.
 
 Symptoms are as follows -- best described by a diff between the
 two boot.logs:
 
 # diff boot-failing.log boot-working.log
 
 -Linux version 2.6.24-rc8-ga9f7faa5 ([EMAIL PROTECTED]) (gcc version ...
 +Linux version 2.6.24-rc8-g2ea3cf43 ([EMAIL PROTECTED]) (gcc version ...
 ...
  early_iounmap(82a0b000, 1000)
 -early_ioremap(c000, 1000) = -02103394304
 -early_iounmap(82a0c000, 1000)
This does not look to be the problem here. We just mapped some new low
address due to possibly a different code path. But, seems to have worked fine.

  early_iounmap(82808000, 1000)
 ...
 -ACPI: PCI interrupt for device :00:12.0 disabled
 -sata_sil: probe of :00:12.0 failed with error -12
 +scsi0 : sata_sil
 +scsi1 : sata_sil
 +ata1: SATA max UDMA/100 mmio [EMAIL PROTECTED] tf 0xc0403080 irq 22
 ...
 -AC'97 space ioremap problem
 -ACPI: PCI interrupt for device :00:14.5 disabled
 -ATI IXP AC97 controller: probe of :00:14.5 failed with error -5

This ioremap failing seems to be the real problem. This can be due to
new tracking of ioremaps introduced by PAT patches. We do not allow
conflicting ioremaps to same region. Probably that is happening
in both Sound and sata initialization which results in driver init failing.

Can you please try the debug patch below over latest x86/mm and boot kernel with
debug boot option and send us the dmesg from the failure. That will give us
better info about ioremaps.

Thanks,
Venki


Index: linux-2.6.git/arch/x86/mm/ioremap_64.c
===
--- linux-2.6.git.orig/arch/x86/mm/ioremap_64.c 2008-01-16 03:38:32.0 
-0800
+++ linux-2.6.git/arch/x86/mm/ioremap_64.c  2008-01-16 05:16:28.0 
-0800
@@ -150,6 +150,8 @@
 
 void __iomem *ioremap_nocache (unsigned long phys_addr, unsigned long size)
 {
+   printk(KERN_DEBUG ioremap_nocache: addr %lx, size %lx\n,
+  phys_addr, size);
return __ioremap(phys_addr, size, _PAGE_UC);
 }
 EXPORT_SYMBOL(ioremap_nocache);
Index: linux-2.6.git/include/asm-x86/io_64.h
===
--- linux-2.6.git.orig/include/asm-x86/io_64.h  2008-01-16 03:38:32.0 
-0800
+++ linux-2.6.git/include/asm-x86/io_64.h   2008-01-16 05:16:57.0 
-0800
@@ -154,6 +154,8 @@
 
 static inline void __iomem * ioremap (unsigned long offset, unsigned long size)
 {
+   printk(KERN_DEBUG ioremap: addr %lx, size %lx\n,
+  offset, size);
return __ioremap(offset, size, 0);
 }
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/4] x86: PAT followup - Remove KERNPG_TABLE from pte entry

2008-01-16 Thread Venki Pallipadi
On Wed, Jan 16, 2008 at 10:14:00AM +0200, Mika Penttilä wrote:
 [EMAIL PROTECTED] kirjoitti:
 KERNPG_TABLE was a bug in earlier patch. Remove it from pte.
 pte_val() check is redundant as this routine is called immediately after a
 ptepage is allocated afresh.
 
 Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]
 Signed-off-by: Suresh Siddha [EMAIL PROTECTED]
 
 Index: linux-2.6.git/arch/x86/mm/init_64.c
 ===
 --- linux-2.6.git.orig/arch/x86/mm/init_64.c 2008-01-15 
 11:02:23.0 -0800
 +++ linux-2.6.git/arch/x86/mm/init_64.c  2008-01-15 
 11:06:37.0 -0800
 @@ -541,9 +541,6 @@
  if (address = end)
  break;
  
 -if (pte_val(*pte))
 -continue;
 -
  /* Nothing to map. Map the null page */
  if (!(address  (~PAGE_MASK)) 
  (address + PAGE_SIZE = end) 
 @@ -561,9 +558,9 @@
  }
  
  if (exec)
 -entry = _PAGE_NX|_KERNPG_TABLE|_PAGE_GLOBAL|address;
 +entry = _PAGE_NX|_PAGE_GLOBAL|address;
  else
 -entry = _KERNPG_TABLE|_PAGE_GLOBAL|address;
 +entry = _PAGE_GLOBAL|address;
  entry = __supported_pte_mask;
  set_pte(pte, __pte(entry));
  }
 
   
 
 Hmm then what's the point of mapping not present 4k pages for valid mem 
 here?
 

Ingo,

Below incremental patch fixes this pte entry setting correctly. Thanks to
Mika for catching this.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.git/arch/x86/mm/init_64.c
===
--- linux-2.6.git.orig/arch/x86/mm/init_64.c2008-01-16 03:38:32.0 
-0800
+++ linux-2.6.git/arch/x86/mm/init_64.c 2008-01-16 03:51:34.0 -0800
@@ -515,9 +515,9 @@
}
 
if (exec)
-   entry = _PAGE_NX|_PAGE_GLOBAL|address;
+   entry = __PAGE_KERNEL_EXEC | _PAGE_GLOBAL | address;
else
-   entry = _PAGE_GLOBAL|address;
+   entry = __PAGE_KERNEL | _PAGE_GLOBAL | address;
entry = __supported_pte_mask;
set_pte(pte, __pte(entry));
}

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Folding _PAGE_PWT into _PAGE_PCD (was Re: unify pagetable accessors patch causes double fault II)

2008-01-15 Thread Venki Pallipadi
On Tue, Jan 15, 2008 at 09:16:50AM -0800, Jeremy Fitzhardinge wrote:
> Ingo Molnar wrote:
> >-#define _PAGE_PRESENT   (_AC(1, UL)<<_PAGE_BIT_PRESENT)
> >-#define _PAGE_RW(_AC(1, UL)<<_PAGE_BIT_RW)
> >-#define _PAGE_USER  (_AC(1, UL)<<_PAGE_BIT_USER)
> >-#define _PAGE_PWT   (_AC(1, UL)<<_PAGE_BIT_PWT)
> >-#define _PAGE_PCD   ((_AC(1, UL)<<_PAGE_BIT_PCD) | _PAGE_PWT)
> >  
> 
> BTW, I just noticed that _PAGE_PWT has been folded into _PAGE_PCD.  This 
> seems like a really bad idea to me, since it breaks the rule that 
> _PAGE_X == 1 << _PAGE_BIT_X.  I can't think of a specific place where 
> this would cause problems, but this kind of non-uniformity always ends 
> up biting someone in the arse.
> 
> I think having a specific _PAGE_NOCACHE which combines these bits is a 
> better approach.
> 
>J

How about the patch below. It defines new _PAGE_UC. One concern is drivers
continuing to use _PAGE_PCD and getting wrong attributes. May be we need to
rename _PAGE_PCD to catch those errors as well?

Thanks,
Venki

Do not fold PCD and PWT bits in _PAGE_PCD. Instead, introduce a new
_PAGE_UC which defines uncached mappings and use it in place of _PAGE_PCD.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.git/arch/x86/mm/ioremap_32.c
===
--- linux-2.6.git.orig/arch/x86/mm/ioremap_32.c 2008-01-15 03:29:38.0 
-0800
+++ linux-2.6.git/arch/x86/mm/ioremap_32.c  2008-01-15 04:42:59.0 
-0800
@@ -173,7 +173,7 @@
 
 void __iomem *ioremap_nocache (unsigned long phys_addr, unsigned long size)
 {
-   return __ioremap(phys_addr, size, _PAGE_PCD);
+   return __ioremap(phys_addr, size, _PAGE_UC);
 }
 EXPORT_SYMBOL(ioremap_nocache);
 
Index: linux-2.6.git/arch/x86/mm/ioremap_64.c
===
--- linux-2.6.git.orig/arch/x86/mm/ioremap_64.c 2008-01-15 03:29:38.0 
-0800
+++ linux-2.6.git/arch/x86/mm/ioremap_64.c  2008-01-15 04:43:07.0 
-0800
@@ -150,7 +150,7 @@
 
 void __iomem *ioremap_nocache (unsigned long phys_addr, unsigned long size)
 {
-   return __ioremap(phys_addr, size, _PAGE_PCD);
+   return __ioremap(phys_addr, size, _PAGE_UC);
 }
 EXPORT_SYMBOL(ioremap_nocache);
 
Index: linux-2.6.git/arch/x86/mm/pat.c
===
--- linux-2.6.git.orig/arch/x86/mm/pat.c2008-01-15 03:29:38.0 
-0800
+++ linux-2.6.git/arch/x86/mm/pat.c 2008-01-15 05:01:43.0 -0800
@@ -64,7 +64,7 @@
if (smp_processor_id() && !pat_wc_enabled)
return;
 
-   /* Set PWT+PCD to Write-Combining. All other bits stay the same */
+   /* Set PCD to Write-Combining. All other bits stay the same */
/* PTE encoding used in Linux:
  PAT
  |PCD
@@ -72,7 +72,7 @@
  |||
  000 WB default
  010 WC _PAGE_WC
- 011 UC _PAGE_PCD
+ 011 UC _PAGE_UC
PAT bit unused */
pat = PAT(0,WB) | PAT(1,WT) | PAT(2,WC) | PAT(3,UC) |
  PAT(4,WB) | PAT(5,WT) | PAT(6,WC) | PAT(7,UC);
@@ -97,7 +97,7 @@
 {
switch (flags & _PAGE_CACHE_MASK) {
case _PAGE_WC:  return "write combining";
-   case _PAGE_PCD: return "uncached";
+   case _PAGE_UC: return "uncached";
case 0: return "default";
default:return "broken";
}
@@ -144,7 +144,7 @@
if (!fattr)
return -EINVAL;
else
-   *fattr  = _PAGE_PCD;
+   *fattr  = _PAGE_UC;
}
 
return 0;
@@ -227,13 +227,13 @@
unsigned long flags;
unsigned long want_flags = 0;
if (file->f_flags & O_SYNC)
-   want_flags = _PAGE_PCD;
+   want_flags = _PAGE_UC;
 
 #ifdef CONFIG_X86_32
/*
 * On the PPro and successors, the MTRRs are used to set
 * memory types for physical addresses outside main memory,
-* so blindly setting PCD or PWT on those pages is wrong.
+* so blindly setting UC or PWT on those pages is wrong.
 * For Pentiums and earlier, the surround logic should disable
 * caching for the high addresses through the KEN pin, but
 * we maintain the tradition of paranoia in this code.
@@ -244,7 +244,7 @@
test_bit(X86_FEATURE_CYRIX_ARR, boot_cpu_data.x86_capability) ||
test_bit(X86_FEATURE_CENTAUR_MCR, 
boot_cpu_data.x86_capability)) &&
   offset >= __pa(high_memory))
-   want_flags = _PAGE_PCD;
+   want_flags = _PAGE_UC;
 #endif
 
/* ignore error because we can't handle it here */
Index: linux-2.6.git/arch/x86/pci/i386.c
===
--- 

Re: Folding _PAGE_PWT into _PAGE_PCD (was Re: unify pagetable accessors patch causes double fault II)

2008-01-15 Thread Venki Pallipadi
On Tue, Jan 15, 2008 at 09:16:50AM -0800, Jeremy Fitzhardinge wrote:
 Ingo Molnar wrote:
 -#define _PAGE_PRESENT   (_AC(1, UL)_PAGE_BIT_PRESENT)
 -#define _PAGE_RW(_AC(1, UL)_PAGE_BIT_RW)
 -#define _PAGE_USER  (_AC(1, UL)_PAGE_BIT_USER)
 -#define _PAGE_PWT   (_AC(1, UL)_PAGE_BIT_PWT)
 -#define _PAGE_PCD   ((_AC(1, UL)_PAGE_BIT_PCD) | _PAGE_PWT)
   
 
 BTW, I just noticed that _PAGE_PWT has been folded into _PAGE_PCD.  This 
 seems like a really bad idea to me, since it breaks the rule that 
 _PAGE_X == 1  _PAGE_BIT_X.  I can't think of a specific place where 
 this would cause problems, but this kind of non-uniformity always ends 
 up biting someone in the arse.
 
 I think having a specific _PAGE_NOCACHE which combines these bits is a 
 better approach.
 
J

How about the patch below. It defines new _PAGE_UC. One concern is drivers
continuing to use _PAGE_PCD and getting wrong attributes. May be we need to
rename _PAGE_PCD to catch those errors as well?

Thanks,
Venki

Do not fold PCD and PWT bits in _PAGE_PCD. Instead, introduce a new
_PAGE_UC which defines uncached mappings and use it in place of _PAGE_PCD.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.git/arch/x86/mm/ioremap_32.c
===
--- linux-2.6.git.orig/arch/x86/mm/ioremap_32.c 2008-01-15 03:29:38.0 
-0800
+++ linux-2.6.git/arch/x86/mm/ioremap_32.c  2008-01-15 04:42:59.0 
-0800
@@ -173,7 +173,7 @@
 
 void __iomem *ioremap_nocache (unsigned long phys_addr, unsigned long size)
 {
-   return __ioremap(phys_addr, size, _PAGE_PCD);
+   return __ioremap(phys_addr, size, _PAGE_UC);
 }
 EXPORT_SYMBOL(ioremap_nocache);
 
Index: linux-2.6.git/arch/x86/mm/ioremap_64.c
===
--- linux-2.6.git.orig/arch/x86/mm/ioremap_64.c 2008-01-15 03:29:38.0 
-0800
+++ linux-2.6.git/arch/x86/mm/ioremap_64.c  2008-01-15 04:43:07.0 
-0800
@@ -150,7 +150,7 @@
 
 void __iomem *ioremap_nocache (unsigned long phys_addr, unsigned long size)
 {
-   return __ioremap(phys_addr, size, _PAGE_PCD);
+   return __ioremap(phys_addr, size, _PAGE_UC);
 }
 EXPORT_SYMBOL(ioremap_nocache);
 
Index: linux-2.6.git/arch/x86/mm/pat.c
===
--- linux-2.6.git.orig/arch/x86/mm/pat.c2008-01-15 03:29:38.0 
-0800
+++ linux-2.6.git/arch/x86/mm/pat.c 2008-01-15 05:01:43.0 -0800
@@ -64,7 +64,7 @@
if (smp_processor_id()  !pat_wc_enabled)
return;
 
-   /* Set PWT+PCD to Write-Combining. All other bits stay the same */
+   /* Set PCD to Write-Combining. All other bits stay the same */
/* PTE encoding used in Linux:
  PAT
  |PCD
@@ -72,7 +72,7 @@
  |||
  000 WB default
  010 WC _PAGE_WC
- 011 UC _PAGE_PCD
+ 011 UC _PAGE_UC
PAT bit unused */
pat = PAT(0,WB) | PAT(1,WT) | PAT(2,WC) | PAT(3,UC) |
  PAT(4,WB) | PAT(5,WT) | PAT(6,WC) | PAT(7,UC);
@@ -97,7 +97,7 @@
 {
switch (flags  _PAGE_CACHE_MASK) {
case _PAGE_WC:  return write combining;
-   case _PAGE_PCD: return uncached;
+   case _PAGE_UC: return uncached;
case 0: return default;
default:return broken;
}
@@ -144,7 +144,7 @@
if (!fattr)
return -EINVAL;
else
-   *fattr  = _PAGE_PCD;
+   *fattr  = _PAGE_UC;
}
 
return 0;
@@ -227,13 +227,13 @@
unsigned long flags;
unsigned long want_flags = 0;
if (file-f_flags  O_SYNC)
-   want_flags = _PAGE_PCD;
+   want_flags = _PAGE_UC;
 
 #ifdef CONFIG_X86_32
/*
 * On the PPro and successors, the MTRRs are used to set
 * memory types for physical addresses outside main memory,
-* so blindly setting PCD or PWT on those pages is wrong.
+* so blindly setting UC or PWT on those pages is wrong.
 * For Pentiums and earlier, the surround logic should disable
 * caching for the high addresses through the KEN pin, but
 * we maintain the tradition of paranoia in this code.
@@ -244,7 +244,7 @@
test_bit(X86_FEATURE_CYRIX_ARR, boot_cpu_data.x86_capability) ||
test_bit(X86_FEATURE_CENTAUR_MCR, 
boot_cpu_data.x86_capability)) 
   offset = __pa(high_memory))
-   want_flags = _PAGE_PCD;
+   want_flags = _PAGE_UC;
 #endif
 
/* ignore error because we can't handle it here */
Index: linux-2.6.git/arch/x86/pci/i386.c
===
--- linux-2.6.git.orig/arch/x86/pci/i386.c  2008-01-15 

Re: + restore-missing-sysfs-max_cstate-attr.patch added to -mm tree

2008-01-03 Thread Venki Pallipadi

Reintroduce run time configurable max_cstate for !CPU_IDLE case.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.24-rc/drivers/acpi/processor_idle.c
===
--- linux-2.6.24-rc.orig/drivers/acpi/processor_idle.c
+++ linux-2.6.24-rc/drivers/acpi/processor_idle.c
@@ -76,7 +76,11 @@ static void (*pm_idle_save) (void) __rea
 #define PM_TIMER_TICKS_TO_US(p)(((p) * 
1000)/(PM_TIMER_FREQUENCY/1000))
 
 static unsigned int max_cstate __read_mostly = ACPI_PROCESSOR_MAX_POWER;
+#ifdef CONFIG_CPU_IDLE
 module_param(max_cstate, uint, );
+#else
+module_param(max_cstate, uint, 0644);
+#endif
 static unsigned int nocst __read_mostly;
 module_param(nocst, uint, );
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + restore-missing-sysfs-max_cstate-attr.patch added to -mm tree

2008-01-03 Thread Venki Pallipadi

Reintroduce run time configurable max_cstate for !CPU_IDLE case.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.24-rc/drivers/acpi/processor_idle.c
===
--- linux-2.6.24-rc.orig/drivers/acpi/processor_idle.c
+++ linux-2.6.24-rc/drivers/acpi/processor_idle.c
@@ -76,7 +76,11 @@ static void (*pm_idle_save) (void) __rea
 #define PM_TIMER_TICKS_TO_US(p)(((p) * 
1000)/(PM_TIMER_FREQUENCY/1000))
 
 static unsigned int max_cstate __read_mostly = ACPI_PROCESSOR_MAX_POWER;
+#ifdef CONFIG_CPU_IDLE
 module_param(max_cstate, uint, );
+#else
+module_param(max_cstate, uint, 0644);
+#endif
 static unsigned int nocst __read_mostly;
 module_param(nocst, uint, );
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: Voluntary leave_mm before entering ACPI C3

2007-12-19 Thread Venki Pallipadi
On Wed, Dec 19, 2007 at 08:32:55PM +0100, Ingo Molnar wrote:
> 
> * Venki Pallipadi <[EMAIL PROTECTED]> wrote:
> 
> > Aviod TLB flush IPIs during C3 states by voluntary leave_mm() before 
> > entering C3.
> > 
> > The performance impact of TLB flush on C3 should not be significant 
> > with respect to C3 wakeup latency. Also, CPUs tend to flush TLB in 
> > hardware while in C3 anyways.
> > 
> > On a 8 logical CPU system, running make -j2, the number of tlbflush 
> > IPIs goes down from 40 per second to ~ 0. Total number of interrupts 
> > during the run of this workload was ~1200 per second, which makes it 
> > ~3% savings in wakeups.
> > 
> > There was no measurable performance or power impact however.
> 
> thanks, applied to x86.git. Nice and elegant patch!
> 
> Btw., since the TLB flush state machine is really subtle and fragile, 
> could you try to run the following mmap stresstest i wrote some time 
> ago:
> 
>http://redhat.com/~mingo/threaded-mmap-stresstest/
> 
> for a couple of hours. It runs nr_cpus threads which then do a "random 
> crazy mix" of mappings/unmappings/remappings of a 800 MB memory window. 
> The more sockets/cores, the crazier the TLB races get ;-)
> 

Ingo,

I ran this stress test on two systems (8 cores and 2 cores) for over
4 hours without any issues. There was more than 20% C3 time during the
run. So, this C3 tlbflush path must have been stressed well during the run.

And sorry about the patch not working on UP config. That was a silly oversight
on my part.

Thanks,
Venki
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: Voluntary leave_mm before entering ACPI C3

2007-12-19 Thread Venki Pallipadi
On Wed, Dec 19, 2007 at 11:48:14AM -0800, H. Peter Anvin wrote:
> Ingo Molnar wrote:
> >
> >i dont think it's required for C3 to even turn off any portion of the 
> >CPU - if an interrupt arrives after the C3 sequence is initiated but 
> >just before dirty cachelines have been flushed then the CPU can just 
> >return without touching anything (such as the TLB) - right? So i dont 
> >think there's any implicit guarantee of TLB flushing (nor should there 
> >be), but in practice, a good C3 sequence would (statistically) turn off 
> >large portions of the CPU and hence the TLB as well.
> >
> 
> I think C3 guarantees that the cache contents stay intact, and thus it 
> might make sense in some technology to preserve the TLB as well (being a 
> kind of cache.)
> 
> Otherwise, what you say here of course is absolutely correct.
> 

C3 does not guarantee all cache contents. Infact, atleast on Intel, 
L1 will be almost always flushed. Newer more power efficient CPUs does dynamic
cache sizing [1]

C3 just guarantees that the caches are coherent. That is, if they are intact,
then DMA will keep cache consistent.

Thanks,
Venki

[1] - 
http://download.intel.com/products/processor/core2duo/mobile_prod_brief.pdf

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: Voluntary leave_mm before entering ACPI C3

2007-12-19 Thread Venki Pallipadi
On Wed, Dec 19, 2007 at 08:40:32PM +0100, Ingo Molnar wrote:
> 
> * H. Peter Anvin <[EMAIL PROTECTED]> wrote:
> 
> > Ingo Molnar wrote:
> >> * Venki Pallipadi <[EMAIL PROTECTED]> wrote:
> >>
> >>> Aviod TLB flush IPIs during C3 states by voluntary leave_mm() before 
> >>> entering C3.
> >>>
> >>> The performance impact of TLB flush on C3 should not be significant with 
> >>> respect to C3 wakeup latency. Also, CPUs tend to flush TLB in hardware 
> >>> while in C3 anyways.
> >>>
> >
> > Are there any CPUs around which *don't* flush the TLB across C3?  (I 
> > guess it's not guaranteed by the spec, though, and as TLBs grow larger 
> > there might be incentive to keep them online.)
> 
> i dont think it's required for C3 to even turn off any portion of the 
> CPU - if an interrupt arrives after the C3 sequence is initiated but 
> just before dirty cachelines have been flushed then the CPU can just 
> return without touching anything (such as the TLB) - right? So i dont 
> think there's any implicit guarantee of TLB flushing (nor should there 
> be), but in practice, a good C3 sequence would (statistically) turn off 
> large portions of the CPU and hence the TLB as well.
> 

Yes. There are cases where hardware/BIOS can do C-state changes behind OS, with
things like be in C1 for a while and then go to C2/C3 after a while etc. In
such cases, there will be times when TLBs are not really flushed in hardware.
But ideally, if C3 results in deep idle TLBs would be turned off. And in cases
where we wake up earlier than expected, C-state policy should identify that
and choose a lower C-state next time around.

I also tried one variation of this, where in I only do flush if there are more
than one CPU sharing the mm. But, that did not help the test case I was using
(which is probably the worst case). What I would see is:
Process runs on CPU x and mm is not shared
Goes idle (C3) waiting on something
Wakes up on CPU y which will now start sharing mm
and would send flush IPI anyway

Thanks,
Venki

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86: Voluntary leave_mm before entering ACPI C3

2007-12-19 Thread Venki Pallipadi

Aviod TLB flush IPIs during C3 states by voluntary leave_mm()
before entering C3.

The performance impact of TLB flush on C3 should not be significant with
respect to C3 wakeup latency. Also, CPUs tend to flush TLB in hardware while in
C3 anyways.

On a 8 logical CPU system, running make -j2, the number of tlbflush IPIs goes
down from 40 per second to ~ 0. Total number of interrupts during the run
of this workload was ~1200 per second, which makes it ~3% savings in wakeups.

There was no measurable performance or power impact however.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.24-rc/arch/x86/kernel/smp_64.c
===
--- linux-2.6.24-rc.orig/arch/x86/kernel/smp_64.c
+++ linux-2.6.24-rc/arch/x86/kernel/smp_64.c
@@ -70,7 +70,7 @@ static DEFINE_PER_CPU(union smp_flush_st
  * We cannot call mmdrop() because we are in interrupt context, 
  * instead update mm->cpu_vm_mask.
  */
-static inline void leave_mm(int cpu)
+void leave_mm(int cpu)
 {
if (read_pda(mmu_state) == TLBSTATE_OK)
BUG();
Index: linux-2.6.24-rc/include/asm-x86/acpi_32.h
===
--- linux-2.6.24-rc.orig/include/asm-x86/acpi_32.h
+++ linux-2.6.24-rc/include/asm-x86/acpi_32.h
@@ -31,6 +31,7 @@
 #include 
 
 #include /* defines cmpxchg */
+#include 
 
 #define COMPILER_DEPENDENT_INT64   long long
 #define COMPILER_DEPENDENT_UINT64  unsigned long long
@@ -138,6 +139,8 @@ static inline void disable_acpi(void) { 
 
 #define ARCH_HAS_POWER_INIT1
 
+#define acpi_unlazy_tlb(x) leave_mm(x)
+
 #endif /*__KERNEL__*/
 
 #endif /*_ASM_ACPI_H*/
Index: linux-2.6.24-rc/include/asm-x86/acpi_64.h
===
--- linux-2.6.24-rc.orig/include/asm-x86/acpi_64.h
+++ linux-2.6.24-rc/include/asm-x86/acpi_64.h
@@ -30,6 +30,7 @@
 
 #include 
 #include 
+#include 
 
 #define COMPILER_DEPENDENT_INT64   long long
 #define COMPILER_DEPENDENT_UINT64  unsigned long long
@@ -148,6 +149,8 @@ static inline void acpi_fake_nodes(const
 }
 #endif
 
+#define acpi_unlazy_tlb(x) leave_mm(x)
+
 #endif /*__KERNEL__*/
 
 #endif /*_ASM_ACPI_H*/
Index: linux-2.6.24-rc/drivers/acpi/processor_idle.c
===
--- linux-2.6.24-rc.orig/drivers/acpi/processor_idle.c
+++ linux-2.6.24-rc/drivers/acpi/processor_idle.c
@@ -530,6 +530,7 @@ static void acpi_processor_idle(void)
break;
 
case ACPI_STATE_C3:
+   acpi_unlazy_tlb(smp_processor_id());
/*
 * disable bus master
 * bm_check implies we need ARB_DIS
@@ -1485,6 +1486,7 @@ static int acpi_idle_enter_bm(struct cpu
return 0;
}
 
+   acpi_unlazy_tlb(smp_processor_id());
/*
 * Must be done before busmaster disable as we might need to
 * access HPET !
Index: linux-2.6.24-rc/include/asm-ia64/acpi.h
===
--- linux-2.6.24-rc.orig/include/asm-ia64/acpi.h
+++ linux-2.6.24-rc/include/asm-ia64/acpi.h
@@ -126,6 +126,8 @@ extern int __devinitdata pxm_to_nid_map[
 extern int __initdata nid_to_pxm_map[MAX_NUMNODES];
 #endif
 
+#define acpi_unlazy_tlb(x)
+
 #endif /*__KERNEL__*/
 
 #endif /*_ASM_ACPI_H*/
Index: linux-2.6.24-rc/include/asm-x86/mmu.h
===
--- linux-2.6.24-rc.orig/include/asm-x86/mmu.h
+++ linux-2.6.24-rc/include/asm-x86/mmu.h
@@ -20,4 +20,6 @@ typedef struct { 
void *vdso;
 } mm_context_t;
 
+void leave_mm(int cpu);
+
 #endif /* _ASM_X86_MMU_H */
Index: linux-2.6.24-rc/arch/x86/kernel/smp_32.c
===
--- linux-2.6.24-rc.orig/arch/x86/kernel/smp_32.c
+++ linux-2.6.24-rc/arch/x86/kernel/smp_32.c
@@ -256,7 +256,7 @@ static DEFINE_SPINLOCK(tlbstate_lock);
  * We need to reload %cr3 since the page tables may be going
  * away from under us..
  */
-void leave_mm(unsigned long cpu)
+void leave_mm(int cpu)
 {
if (per_cpu(cpu_tlbstate, cpu).state == TLBSTATE_OK)
BUG();
Index: linux-2.6.24-rc/include/asm-x86/mmu_context_32.h
===
--- linux-2.6.24-rc.orig/include/asm-x86/mmu_context_32.h
+++ linux-2.6.24-rc/include/asm-x86/mmu_context_32.h
@@ -32,8 +32,6 @@ static inline void enter_lazy_tlb(struct
 #endif
 }
 
-void leave_mm(unsigned long cpu);
-
 static inline void switch_mm(struct mm_struct *prev,
 struct mm_struct *next,
 struct task_struct *tsk)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86: Voluntary leave_mm before entering ACPI C3

2007-12-19 Thread Venki Pallipadi

Aviod TLB flush IPIs during C3 states by voluntary leave_mm()
before entering C3.

The performance impact of TLB flush on C3 should not be significant with
respect to C3 wakeup latency. Also, CPUs tend to flush TLB in hardware while in
C3 anyways.

On a 8 logical CPU system, running make -j2, the number of tlbflush IPIs goes
down from 40 per second to ~ 0. Total number of interrupts during the run
of this workload was ~1200 per second, which makes it ~3% savings in wakeups.

There was no measurable performance or power impact however.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.24-rc/arch/x86/kernel/smp_64.c
===
--- linux-2.6.24-rc.orig/arch/x86/kernel/smp_64.c
+++ linux-2.6.24-rc/arch/x86/kernel/smp_64.c
@@ -70,7 +70,7 @@ static DEFINE_PER_CPU(union smp_flush_st
  * We cannot call mmdrop() because we are in interrupt context, 
  * instead update mm-cpu_vm_mask.
  */
-static inline void leave_mm(int cpu)
+void leave_mm(int cpu)
 {
if (read_pda(mmu_state) == TLBSTATE_OK)
BUG();
Index: linux-2.6.24-rc/include/asm-x86/acpi_32.h
===
--- linux-2.6.24-rc.orig/include/asm-x86/acpi_32.h
+++ linux-2.6.24-rc/include/asm-x86/acpi_32.h
@@ -31,6 +31,7 @@
 #include acpi/pdc_intel.h
 
 #include asm/system.h/* defines cmpxchg */
+#include asm/mmu.h
 
 #define COMPILER_DEPENDENT_INT64   long long
 #define COMPILER_DEPENDENT_UINT64  unsigned long long
@@ -138,6 +139,8 @@ static inline void disable_acpi(void) { 
 
 #define ARCH_HAS_POWER_INIT1
 
+#define acpi_unlazy_tlb(x) leave_mm(x)
+
 #endif /*__KERNEL__*/
 
 #endif /*_ASM_ACPI_H*/
Index: linux-2.6.24-rc/include/asm-x86/acpi_64.h
===
--- linux-2.6.24-rc.orig/include/asm-x86/acpi_64.h
+++ linux-2.6.24-rc/include/asm-x86/acpi_64.h
@@ -30,6 +30,7 @@
 
 #include acpi/pdc_intel.h
 #include asm/numa.h
+#include asm/mmu.h
 
 #define COMPILER_DEPENDENT_INT64   long long
 #define COMPILER_DEPENDENT_UINT64  unsigned long long
@@ -148,6 +149,8 @@ static inline void acpi_fake_nodes(const
 }
 #endif
 
+#define acpi_unlazy_tlb(x) leave_mm(x)
+
 #endif /*__KERNEL__*/
 
 #endif /*_ASM_ACPI_H*/
Index: linux-2.6.24-rc/drivers/acpi/processor_idle.c
===
--- linux-2.6.24-rc.orig/drivers/acpi/processor_idle.c
+++ linux-2.6.24-rc/drivers/acpi/processor_idle.c
@@ -530,6 +530,7 @@ static void acpi_processor_idle(void)
break;
 
case ACPI_STATE_C3:
+   acpi_unlazy_tlb(smp_processor_id());
/*
 * disable bus master
 * bm_check implies we need ARB_DIS
@@ -1485,6 +1486,7 @@ static int acpi_idle_enter_bm(struct cpu
return 0;
}
 
+   acpi_unlazy_tlb(smp_processor_id());
/*
 * Must be done before busmaster disable as we might need to
 * access HPET !
Index: linux-2.6.24-rc/include/asm-ia64/acpi.h
===
--- linux-2.6.24-rc.orig/include/asm-ia64/acpi.h
+++ linux-2.6.24-rc/include/asm-ia64/acpi.h
@@ -126,6 +126,8 @@ extern int __devinitdata pxm_to_nid_map[
 extern int __initdata nid_to_pxm_map[MAX_NUMNODES];
 #endif
 
+#define acpi_unlazy_tlb(x)
+
 #endif /*__KERNEL__*/
 
 #endif /*_ASM_ACPI_H*/
Index: linux-2.6.24-rc/include/asm-x86/mmu.h
===
--- linux-2.6.24-rc.orig/include/asm-x86/mmu.h
+++ linux-2.6.24-rc/include/asm-x86/mmu.h
@@ -20,4 +20,6 @@ typedef struct { 
void *vdso;
 } mm_context_t;
 
+void leave_mm(int cpu);
+
 #endif /* _ASM_X86_MMU_H */
Index: linux-2.6.24-rc/arch/x86/kernel/smp_32.c
===
--- linux-2.6.24-rc.orig/arch/x86/kernel/smp_32.c
+++ linux-2.6.24-rc/arch/x86/kernel/smp_32.c
@@ -256,7 +256,7 @@ static DEFINE_SPINLOCK(tlbstate_lock);
  * We need to reload %cr3 since the page tables may be going
  * away from under us..
  */
-void leave_mm(unsigned long cpu)
+void leave_mm(int cpu)
 {
if (per_cpu(cpu_tlbstate, cpu).state == TLBSTATE_OK)
BUG();
Index: linux-2.6.24-rc/include/asm-x86/mmu_context_32.h
===
--- linux-2.6.24-rc.orig/include/asm-x86/mmu_context_32.h
+++ linux-2.6.24-rc/include/asm-x86/mmu_context_32.h
@@ -32,8 +32,6 @@ static inline void enter_lazy_tlb(struct
 #endif
 }
 
-void leave_mm(unsigned long cpu);
-
 static inline void switch_mm(struct mm_struct *prev,
 struct mm_struct *next,
 struct task_struct *tsk)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

Re: [PATCH] x86: Voluntary leave_mm before entering ACPI C3

2007-12-19 Thread Venki Pallipadi
On Wed, Dec 19, 2007 at 08:40:32PM +0100, Ingo Molnar wrote:
 
 * H. Peter Anvin [EMAIL PROTECTED] wrote:
 
  Ingo Molnar wrote:
  * Venki Pallipadi [EMAIL PROTECTED] wrote:
 
  Aviod TLB flush IPIs during C3 states by voluntary leave_mm() before 
  entering C3.
 
  The performance impact of TLB flush on C3 should not be significant with 
  respect to C3 wakeup latency. Also, CPUs tend to flush TLB in hardware 
  while in C3 anyways.
 
 
  Are there any CPUs around which *don't* flush the TLB across C3?  (I 
  guess it's not guaranteed by the spec, though, and as TLBs grow larger 
  there might be incentive to keep them online.)
 
 i dont think it's required for C3 to even turn off any portion of the 
 CPU - if an interrupt arrives after the C3 sequence is initiated but 
 just before dirty cachelines have been flushed then the CPU can just 
 return without touching anything (such as the TLB) - right? So i dont 
 think there's any implicit guarantee of TLB flushing (nor should there 
 be), but in practice, a good C3 sequence would (statistically) turn off 
 large portions of the CPU and hence the TLB as well.
 

Yes. There are cases where hardware/BIOS can do C-state changes behind OS, with
things like be in C1 for a while and then go to C2/C3 after a while etc. In
such cases, there will be times when TLBs are not really flushed in hardware.
But ideally, if C3 results in deep idle TLBs would be turned off. And in cases
where we wake up earlier than expected, C-state policy should identify that
and choose a lower C-state next time around.

I also tried one variation of this, where in I only do flush if there are more
than one CPU sharing the mm. But, that did not help the test case I was using
(which is probably the worst case). What I would see is:
Process runs on CPU x and mm is not shared
Goes idle (C3) waiting on something
Wakes up on CPU y which will now start sharing mm
and would send flush IPI anyway

Thanks,
Venki

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: Voluntary leave_mm before entering ACPI C3

2007-12-19 Thread Venki Pallipadi
On Wed, Dec 19, 2007 at 11:48:14AM -0800, H. Peter Anvin wrote:
 Ingo Molnar wrote:
 
 i dont think it's required for C3 to even turn off any portion of the 
 CPU - if an interrupt arrives after the C3 sequence is initiated but 
 just before dirty cachelines have been flushed then the CPU can just 
 return without touching anything (such as the TLB) - right? So i dont 
 think there's any implicit guarantee of TLB flushing (nor should there 
 be), but in practice, a good C3 sequence would (statistically) turn off 
 large portions of the CPU and hence the TLB as well.
 
 
 I think C3 guarantees that the cache contents stay intact, and thus it 
 might make sense in some technology to preserve the TLB as well (being a 
 kind of cache.)
 
 Otherwise, what you say here of course is absolutely correct.
 

C3 does not guarantee all cache contents. Infact, atleast on Intel, 
L1 will be almost always flushed. Newer more power efficient CPUs does dynamic
cache sizing [1]

C3 just guarantees that the caches are coherent. That is, if they are intact,
then DMA will keep cache consistent.

Thanks,
Venki

[1] - 
http://download.intel.com/products/processor/core2duo/mobile_prod_brief.pdf

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: Voluntary leave_mm before entering ACPI C3

2007-12-19 Thread Venki Pallipadi
On Wed, Dec 19, 2007 at 08:32:55PM +0100, Ingo Molnar wrote:
 
 * Venki Pallipadi [EMAIL PROTECTED] wrote:
 
  Aviod TLB flush IPIs during C3 states by voluntary leave_mm() before 
  entering C3.
  
  The performance impact of TLB flush on C3 should not be significant 
  with respect to C3 wakeup latency. Also, CPUs tend to flush TLB in 
  hardware while in C3 anyways.
  
  On a 8 logical CPU system, running make -j2, the number of tlbflush 
  IPIs goes down from 40 per second to ~ 0. Total number of interrupts 
  during the run of this workload was ~1200 per second, which makes it 
  ~3% savings in wakeups.
  
  There was no measurable performance or power impact however.
 
 thanks, applied to x86.git. Nice and elegant patch!
 
 Btw., since the TLB flush state machine is really subtle and fragile, 
 could you try to run the following mmap stresstest i wrote some time 
 ago:
 
http://redhat.com/~mingo/threaded-mmap-stresstest/
 
 for a couple of hours. It runs nr_cpus threads which then do a random 
 crazy mix of mappings/unmappings/remappings of a 800 MB memory window. 
 The more sockets/cores, the crazier the TLB races get ;-)
 

Ingo,

I ran this stress test on two systems (8 cores and 2 cores) for over
4 hours without any issues. There was more than 20% C3 time during the
run. So, this C3 tlbflush path must have been stressed well during the run.

And sorry about the patch not working on UP config. That was a silly oversight
on my part.

Thanks,
Venki
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 02/12] PAT 64b: Basic PAT implementation

2007-12-14 Thread Venki Pallipadi
On Fri, Dec 14, 2007 at 01:42:12AM +0100, Andi Kleen wrote:
> > +void __cpuinit pat_init(void)
> > +{
> > +   /* Set PWT+PCD to Write-Combining. All other bits stay the same */
> > +   if (cpu_has_pat) {
> 
> All the old CPUs (PPro etc.) with known PAT bugs need to clear this flag 
> now in their CPU init functions. It is fine to be aggressive there
> because these old systems have lived so long without PAT they can do 
> so forever. So perhaps it's best to just white list it only for newer
> CPUs on the Intel side at least.

Yes. Enabling this only on relatively newer CPUs is safer. Will do that in next 
iteration of the patches.
 
> Another problem is that there are some popular modules (ATI, Nvidia for once)
> who reprogram the PAT registers on their own, likely different. Need some way 
> to detect
> that case I guess, otherwise lots of users will see strange malfunctions.
> Maybe recheck after module load?

Yes. We can check that at load time. But they can still do bad things at runt 
ime, like say when 3D gets enabled etc??

 
> > +   |||
> > +  000 WB default
> > +  010 UC_MINUS   _PAGE_PCD
> > +  011 WC _PAGE_WC
> > +  PAT bit unused */
> > +   pat = PAT(0,WB) | PAT(1,WT) | PAT(2,UC_MINUS) | PAT(3,WC) |
> > + PAT(4,WB) | PAT(5,WT) | PAT(6,UC_MINUS) | PAT(7,WC);
> > +   rdmsrl(MSR_IA32_CR_PAT, boot_pat_state);
> > +   wrmsrl(MSR_IA32_CR_PAT, pat);
> > +   __flush_tlb_all();
> > +   asm volatile("wbinvd");
> 
> Have you double checked this is the full procedure from the manual? iirc there
> were some steps missing.


Checking the manual for this. You are right, we had missed some steps here.
Actually, manual says on MP, PAT MSR on all CPUs must be consistent (even when 
they are not really using it in their page tables.
So, this will change the init and shutdown parts significantly and there may be 
some challenges with CPU offline and KEXEC. We will redo this part in next 
iteration.

Thanks,
Venki
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 02/12] PAT 64b: Basic PAT implementation

2007-12-14 Thread Venki Pallipadi
On Fri, Dec 14, 2007 at 01:42:12AM +0100, Andi Kleen wrote:
  +void __cpuinit pat_init(void)
  +{
  +   /* Set PWT+PCD to Write-Combining. All other bits stay the same */
  +   if (cpu_has_pat) {
 
 All the old CPUs (PPro etc.) with known PAT bugs need to clear this flag 
 now in their CPU init functions. It is fine to be aggressive there
 because these old systems have lived so long without PAT they can do 
 so forever. So perhaps it's best to just white list it only for newer
 CPUs on the Intel side at least.

Yes. Enabling this only on relatively newer CPUs is safer. Will do that in next 
iteration of the patches.
 
 Another problem is that there are some popular modules (ATI, Nvidia for once)
 who reprogram the PAT registers on their own, likely different. Need some way 
 to detect
 that case I guess, otherwise lots of users will see strange malfunctions.
 Maybe recheck after module load?

Yes. We can check that at load time. But they can still do bad things at runt 
ime, like say when 3D gets enabled etc??

 
  +   |||
  +  000 WB default
  +  010 UC_MINUS   _PAGE_PCD
  +  011 WC _PAGE_WC
  +  PAT bit unused */
  +   pat = PAT(0,WB) | PAT(1,WT) | PAT(2,UC_MINUS) | PAT(3,WC) |
  + PAT(4,WB) | PAT(5,WT) | PAT(6,UC_MINUS) | PAT(7,WC);
  +   rdmsrl(MSR_IA32_CR_PAT, boot_pat_state);
  +   wrmsrl(MSR_IA32_CR_PAT, pat);
  +   __flush_tlb_all();
  +   asm volatile(wbinvd);
 
 Have you double checked this is the full procedure from the manual? iirc there
 were some steps missing.


Checking the manual for this. You are right, we had missed some steps here.
Actually, manual says on MP, PAT MSR on all CPUs must be consistent (even when 
they are not really using it in their page tables.
So, this will change the init and shutdown parts significantly and there may be 
some challenges with CPU offline and KEXEC. We will redo this part in next 
iteration.

Thanks,
Venki
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc1 and 2.6.24.rc2 hangs while running udev on my laptop

2007-11-09 Thread Venki Pallipadi
On Fri, Nov 09, 2007 at 10:10:43AM -0800, Pallipadi, Venkatesh wrote:
>  
> 
> >-Original Message-
> >From: Andrew Morton [mailto:[EMAIL PROTECTED] 
> >Sent: Friday, November 09, 2007 2:03 AM
> >To: SANGOI DINO LEONARDO
> >Cc: linux-kernel@vger.kernel.org; Rafael J. Wysocki; Brown, 
> >Len; Pallipadi, Venkatesh; [EMAIL PROTECTED]
> >Subject: Re: 2.6.24-rc1 and 2.6.24.rc2 hangs while running 
> >udev on my laptop
> >
> >
> >(cc's added)
> >
> >On Fri, 9 Nov 2007 09:47:02 +0100  SANGOI DINO LEONARDO 
> ><[EMAIL PROTECTED]> wrote:
> >
> >> Hi,
> >>  
> >> My laptop (an HP nx6125) doesn't boot with kernels 2.6.24-rc1 and
> >> 2.6.24.rc2. 
> >> It works fine with 2.6.23 and older.
> >> 
> >> I seen this bug first while running fedora rawhide, so you 
> >can find hardware
> >> 
> >> info and boot logs at 
> >https://bugzilla.redhat.com/show_bug.cgi?id=312201.
> >> 
> >> I did a git bisect, and got this:
> >> 
> >> $ git bisect bad
> >> 4f86d3a8e297205780cca027e974fd5f81064780 is first bad commit
> >> commit 4f86d3a8e297205780cca027e974fd5f81064780
> >> Author: Len Brown <[EMAIL PROTECTED]>
> >> Date:   Wed Oct 3 18:58:00 2007 -0400
> >> 
> >> cpuidle: consolidate 2.6.22 cpuidle branch into one patch
> >> [SNIP full commit log]
> >> 
> 
> > 
> >> 
> >> Config is taken from Fedora kernel. CONFIG_CPU_IDLE is set 
> >to y (tell me if
> >> full config is needed).
> >> 
> >> If I use 'nolapic' parameter, kernel 2.6.24-rc1 boots fine. 
> >> Setting CONFIG_CPU_IDLE=n also gives me a working kernel.
> >> 
> >> Ask me if more info is needed (please CC me).
> >> 
> >> Thanks,
> >> 
> >> Dino
> 
> 

Dino,

Can you try the patch below over rc2 and see whether it fixes the problem.
Looking at the code, it should fix the problem. If it does not, can you send
me the output of acpidump from your system. That will help to look further
into this. You can get acpidump from latest pmtools package here.
www.kernel.org/pub/linux/kernel/people/lenb/acpi/utils/

Thanks,
Venki


Test patch for the bug report at
https://bugzilla.redhat.com/show_bug.cgi?id=312201

Signed-off-by: Venki Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.24-rc/drivers/acpi/processor_idle.c
===
--- linux-2.6.24-rc.orig/drivers/acpi/processor_idle.c
+++ linux-2.6.24-rc/drivers/acpi/processor_idle.c
@@ -1502,23 +1502,28 @@ static int acpi_idle_enter_bm(struct cpu
} else {
acpi_idle_update_bm_rld(pr, cx);
 
-   spin_lock(_lock);
-   c3_cpu_count++;
-   /* Disable bus master arbitration when all CPUs are in C3 */
-   if (c3_cpu_count == num_online_cpus())
-   acpi_set_register(ACPI_BITREG_ARB_DISABLE, 1);
-   spin_unlock(_lock);
+   if (pr->flags.bm_check && pr->flags.bm_control) {
+   spin_lock(_lock);
+   c3_cpu_count++;
+   /* Disable bus master arbitration when all CPUs are in 
C3 */
+   if (c3_cpu_count == num_online_cpus())
+   acpi_set_register(ACPI_BITREG_ARB_DISABLE, 1);
+   spin_unlock(_lock);
+   } else if (!pr->flags.bm_check) {
+   ACPI_FLUSH_CPU_CACHE();
+   }
 
t1 = inl(acpi_gbl_FADT.xpm_timer_block.address);
acpi_idle_do_entry(cx);
t2 = inl(acpi_gbl_FADT.xpm_timer_block.address);
 
-   spin_lock(_lock);
/* Re-enable bus master arbitration */
-   if (c3_cpu_count == num_online_cpus())
+   if (pr->flags.bm_check && pr->flags.bm_control) {
+   spin_lock(_lock);
acpi_set_register(ACPI_BITREG_ARB_DISABLE, 0);
-   c3_cpu_count--;
-   spin_unlock(_lock);
+   c3_cpu_count--;
+   spin_unlock(_lock);
+   }
}
 
 #if defined (CONFIG_GENERIC_TIME) && defined (CONFIG_X86_TSC)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc1 and 2.6.24.rc2 hangs while running udev on my laptop

2007-11-09 Thread Venki Pallipadi
On Fri, Nov 09, 2007 at 10:10:43AM -0800, Pallipadi, Venkatesh wrote:
  
 
 -Original Message-
 From: Andrew Morton [mailto:[EMAIL PROTECTED] 
 Sent: Friday, November 09, 2007 2:03 AM
 To: SANGOI DINO LEONARDO
 Cc: linux-kernel@vger.kernel.org; Rafael J. Wysocki; Brown, 
 Len; Pallipadi, Venkatesh; [EMAIL PROTECTED]
 Subject: Re: 2.6.24-rc1 and 2.6.24.rc2 hangs while running 
 udev on my laptop
 
 
 (cc's added)
 
 On Fri, 9 Nov 2007 09:47:02 +0100  SANGOI DINO LEONARDO 
 [EMAIL PROTECTED] wrote:
 
  Hi,
   
  My laptop (an HP nx6125) doesn't boot with kernels 2.6.24-rc1 and
  2.6.24.rc2. 
  It works fine with 2.6.23 and older.
  
  I seen this bug first while running fedora rawhide, so you 
 can find hardware
  
  info and boot logs at 
 https://bugzilla.redhat.com/show_bug.cgi?id=312201.
  
  I did a git bisect, and got this:
  
  $ git bisect bad
  4f86d3a8e297205780cca027e974fd5f81064780 is first bad commit
  commit 4f86d3a8e297205780cca027e974fd5f81064780
  Author: Len Brown [EMAIL PROTECTED]
  Date:   Wed Oct 3 18:58:00 2007 -0400
  
  cpuidle: consolidate 2.6.22 cpuidle branch into one patch
  [SNIP full commit log]
  
 snip
  
  
  Config is taken from Fedora kernel. CONFIG_CPU_IDLE is set 
 to y (tell me if
  full config is needed).
  
  If I use 'nolapic' parameter, kernel 2.6.24-rc1 boots fine. 
  Setting CONFIG_CPU_IDLE=n also gives me a working kernel.
  
  Ask me if more info is needed (please CC me).
  
  Thanks,
  
  Dino
 
 

Dino,

Can you try the patch below over rc2 and see whether it fixes the problem.
Looking at the code, it should fix the problem. If it does not, can you send
me the output of acpidump from your system. That will help to look further
into this. You can get acpidump from latest pmtools package here.
www.kernel.org/pub/linux/kernel/people/lenb/acpi/utils/

Thanks,
Venki


Test patch for the bug report at
https://bugzilla.redhat.com/show_bug.cgi?id=312201

Signed-off-by: Venki Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.24-rc/drivers/acpi/processor_idle.c
===
--- linux-2.6.24-rc.orig/drivers/acpi/processor_idle.c
+++ linux-2.6.24-rc/drivers/acpi/processor_idle.c
@@ -1502,23 +1502,28 @@ static int acpi_idle_enter_bm(struct cpu
} else {
acpi_idle_update_bm_rld(pr, cx);
 
-   spin_lock(c3_lock);
-   c3_cpu_count++;
-   /* Disable bus master arbitration when all CPUs are in C3 */
-   if (c3_cpu_count == num_online_cpus())
-   acpi_set_register(ACPI_BITREG_ARB_DISABLE, 1);
-   spin_unlock(c3_lock);
+   if (pr-flags.bm_check  pr-flags.bm_control) {
+   spin_lock(c3_lock);
+   c3_cpu_count++;
+   /* Disable bus master arbitration when all CPUs are in 
C3 */
+   if (c3_cpu_count == num_online_cpus())
+   acpi_set_register(ACPI_BITREG_ARB_DISABLE, 1);
+   spin_unlock(c3_lock);
+   } else if (!pr-flags.bm_check) {
+   ACPI_FLUSH_CPU_CACHE();
+   }
 
t1 = inl(acpi_gbl_FADT.xpm_timer_block.address);
acpi_idle_do_entry(cx);
t2 = inl(acpi_gbl_FADT.xpm_timer_block.address);
 
-   spin_lock(c3_lock);
/* Re-enable bus master arbitration */
-   if (c3_cpu_count == num_online_cpus())
+   if (pr-flags.bm_check  pr-flags.bm_control) {
+   spin_lock(c3_lock);
acpi_set_register(ACPI_BITREG_ARB_DISABLE, 0);
-   c3_cpu_count--;
-   spin_unlock(c3_lock);
+   c3_cpu_count--;
+   spin_unlock(c3_lock);
+   }
}
 
 #if defined (CONFIG_GENERIC_TIME)  defined (CONFIG_X86_TSC)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Track accurate idle time with tick_sched.idle_sleeptime

2007-08-27 Thread Venki Pallipadi

Current idle time in kstat is based on jiffies and is coarse grained.
tick_sched.idle_sleeptime is making some attempt to keep track of
idle time in a fine grained manner. But, it is not handling
the time spent in interrupts fully.

Make tick_sched.idle_sleeptime accurate with respect to time spent on
handling interrupts and also add tick_sched.idle_lastupdate, which
keeps track of last time when idle_sleeptime was updated.

This statistics will be crucial for cpufreq-ondemand governor, which can shed
some conservative gaurd band that is uses today while setting the frequency.
The ondemand changes that uses the exact idle time is coming soon. 

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.22/kernel/time/tick-sched.c
===
--- linux-2.6.22.orig/kernel/time/tick-sched.c
+++ linux-2.6.22/kernel/time/tick-sched.c
@@ -141,6 +141,43 @@ void tick_nohz_update_jiffies(void)
local_irq_restore(flags);
 }
 
+void tick_nohz_stop_idle(int cpu)
+{
+   struct tick_sched *ts = _cpu(tick_cpu_sched, cpu);
+
+   if (ts->idle_active) {
+   ktime_t now, delta;
+   now = ktime_get();
+   delta = ktime_sub(now, ts->idle_entrytime);
+   ts->idle_lastupdate = now;
+   ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+   ts->idle_active = 0;
+   }
+}
+
+static void tick_nohz_start_idle(int cpu)
+{
+   struct tick_sched *ts = _cpu(tick_cpu_sched, cpu);
+   ktime_t now, delta;
+
+   now = ktime_get();
+   if (ts->idle_active) {
+   delta = ktime_sub(now, ts->idle_entrytime);
+   ts->idle_lastupdate = now;
+   ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+   }
+   ts->idle_entrytime = now;
+   ts->idle_active = 1;
+}
+
+u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
+{
+   struct tick_sched *ts = _cpu(tick_cpu_sched, cpu);
+
+   *last_update_time = ktime_to_us(ts->idle_lastupdate);
+   return ktime_to_us(ts->idle_sleeptime);
+}
+
 /**
  * tick_nohz_stop_sched_tick - stop the idle tick from the idle task
  *
@@ -152,13 +189,15 @@ void tick_nohz_stop_sched_tick(void)
 {
unsigned long seq, last_jiffies, next_jiffies, delta_jiffies, flags;
struct tick_sched *ts;
-   ktime_t last_update, expires, now, delta;
+   ktime_t last_update, expires, now;
struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;
int cpu;
 
local_irq_save(flags);
 
cpu = smp_processor_id();
+   tick_nohz_start_idle(cpu);
+
ts = _cpu(tick_cpu_sched, cpu);
 
if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
@@ -178,19 +217,7 @@ void tick_nohz_stop_sched_tick(void)
}
}
 
-   now = ktime_get();
-   /*
-* When called from irq_exit we need to account the idle sleep time
-* correctly.
-*/
-   if (ts->tick_stopped) {
-   delta = ktime_sub(now, ts->idle_entrytime);
-   ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
-   }
-
-   ts->idle_entrytime = now;
ts->idle_calls++;
-
/* Read jiffies and the time when jiffies were updated last */
do {
seq = read_seqbegin(_lock);
@@ -320,23 +347,22 @@ void tick_nohz_restart_sched_tick(void)
int cpu = smp_processor_id();
struct tick_sched *ts = _cpu(tick_cpu_sched, cpu);
unsigned long ticks;
-   ktime_t now, delta;
+   ktime_t now;
+
+   local_irq_disable();
+   tick_nohz_stop_idle(cpu);
 
-   if (!ts->tick_stopped)
+   if (!ts->tick_stopped) {
+   local_irq_enable();
return;
+   }
 
/* Update jiffies first */
-   now = ktime_get();
-
-   local_irq_disable();
select_nohz_load_balancer(0);
+   now = ktime_get();
tick_do_update_jiffies64(now);
cpu_clear(cpu, nohz_cpu_mask);
 
-   /* Account the idle time */
-   delta = ktime_sub(now, ts->idle_entrytime);
-   ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
-
/*
 * We stopped the tick in idle. Update process times would miss the
 * time we slept as update_process_times does only a 1 tick
Index: linux-2.6.22/include/linux/tick.h
===
--- linux-2.6.22.orig/include/linux/tick.h
+++ linux-2.6.22/include/linux/tick.h
@@ -51,8 +51,10 @@ struct tick_sched {
unsigned long   idle_jiffies;
unsigned long   idle_calls;
unsigned long   idle_sleeps;
+   int idle_active;
ktime_t idle_entrytime;
ktime_t idle_sleeptime;
+   ktime_t idle_lastupdate;
ktime_t

[PATCH] Track accurate idle time with tick_sched.idle_sleeptime

2007-08-27 Thread Venki Pallipadi

Current idle time in kstat is based on jiffies and is coarse grained.
tick_sched.idle_sleeptime is making some attempt to keep track of
idle time in a fine grained manner. But, it is not handling
the time spent in interrupts fully.

Make tick_sched.idle_sleeptime accurate with respect to time spent on
handling interrupts and also add tick_sched.idle_lastupdate, which
keeps track of last time when idle_sleeptime was updated.

This statistics will be crucial for cpufreq-ondemand governor, which can shed
some conservative gaurd band that is uses today while setting the frequency.
The ondemand changes that uses the exact idle time is coming soon. 

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.22/kernel/time/tick-sched.c
===
--- linux-2.6.22.orig/kernel/time/tick-sched.c
+++ linux-2.6.22/kernel/time/tick-sched.c
@@ -141,6 +141,43 @@ void tick_nohz_update_jiffies(void)
local_irq_restore(flags);
 }
 
+void tick_nohz_stop_idle(int cpu)
+{
+   struct tick_sched *ts = per_cpu(tick_cpu_sched, cpu);
+
+   if (ts-idle_active) {
+   ktime_t now, delta;
+   now = ktime_get();
+   delta = ktime_sub(now, ts-idle_entrytime);
+   ts-idle_lastupdate = now;
+   ts-idle_sleeptime = ktime_add(ts-idle_sleeptime, delta);
+   ts-idle_active = 0;
+   }
+}
+
+static void tick_nohz_start_idle(int cpu)
+{
+   struct tick_sched *ts = per_cpu(tick_cpu_sched, cpu);
+   ktime_t now, delta;
+
+   now = ktime_get();
+   if (ts-idle_active) {
+   delta = ktime_sub(now, ts-idle_entrytime);
+   ts-idle_lastupdate = now;
+   ts-idle_sleeptime = ktime_add(ts-idle_sleeptime, delta);
+   }
+   ts-idle_entrytime = now;
+   ts-idle_active = 1;
+}
+
+u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
+{
+   struct tick_sched *ts = per_cpu(tick_cpu_sched, cpu);
+
+   *last_update_time = ktime_to_us(ts-idle_lastupdate);
+   return ktime_to_us(ts-idle_sleeptime);
+}
+
 /**
  * tick_nohz_stop_sched_tick - stop the idle tick from the idle task
  *
@@ -152,13 +189,15 @@ void tick_nohz_stop_sched_tick(void)
 {
unsigned long seq, last_jiffies, next_jiffies, delta_jiffies, flags;
struct tick_sched *ts;
-   ktime_t last_update, expires, now, delta;
+   ktime_t last_update, expires, now;
struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev;
int cpu;
 
local_irq_save(flags);
 
cpu = smp_processor_id();
+   tick_nohz_start_idle(cpu);
+
ts = per_cpu(tick_cpu_sched, cpu);
 
if (unlikely(ts-nohz_mode == NOHZ_MODE_INACTIVE))
@@ -178,19 +217,7 @@ void tick_nohz_stop_sched_tick(void)
}
}
 
-   now = ktime_get();
-   /*
-* When called from irq_exit we need to account the idle sleep time
-* correctly.
-*/
-   if (ts-tick_stopped) {
-   delta = ktime_sub(now, ts-idle_entrytime);
-   ts-idle_sleeptime = ktime_add(ts-idle_sleeptime, delta);
-   }
-
-   ts-idle_entrytime = now;
ts-idle_calls++;
-
/* Read jiffies and the time when jiffies were updated last */
do {
seq = read_seqbegin(xtime_lock);
@@ -320,23 +347,22 @@ void tick_nohz_restart_sched_tick(void)
int cpu = smp_processor_id();
struct tick_sched *ts = per_cpu(tick_cpu_sched, cpu);
unsigned long ticks;
-   ktime_t now, delta;
+   ktime_t now;
+
+   local_irq_disable();
+   tick_nohz_stop_idle(cpu);
 
-   if (!ts-tick_stopped)
+   if (!ts-tick_stopped) {
+   local_irq_enable();
return;
+   }
 
/* Update jiffies first */
-   now = ktime_get();
-
-   local_irq_disable();
select_nohz_load_balancer(0);
+   now = ktime_get();
tick_do_update_jiffies64(now);
cpu_clear(cpu, nohz_cpu_mask);
 
-   /* Account the idle time */
-   delta = ktime_sub(now, ts-idle_entrytime);
-   ts-idle_sleeptime = ktime_add(ts-idle_sleeptime, delta);
-
/*
 * We stopped the tick in idle. Update process times would miss the
 * time we slept as update_process_times does only a 1 tick
Index: linux-2.6.22/include/linux/tick.h
===
--- linux-2.6.22.orig/include/linux/tick.h
+++ linux-2.6.22/include/linux/tick.h
@@ -51,8 +51,10 @@ struct tick_sched {
unsigned long   idle_jiffies;
unsigned long   idle_calls;
unsigned long   idle_sleeps;
+   int idle_active;
ktime_t idle_entrytime;
ktime_t idle_sleeptime;
+   ktime_t idle_lastupdate;
ktime_t 

Re: Cpu-Hotplug and Real-Time

2007-08-07 Thread Venki Pallipadi
On Tue, Aug 07, 2007 at 07:13:36PM +0400, Oleg Nesterov wrote:
> On 08/07, Gautham R Shenoy wrote:
> >
> > After some debugging, I saw that the hang occured because
> > the high prio process was stuck in a loop doing yield() inside
> > wait_task_inactive(). Description follows:
> > 
> > Say a high-prio task (A) does a kthread_create(B),
> > followed by a kthread_bind(B, cpu1). At this moment, 
> > only cpu0 is online.
> > 
> > Now, immediately after being created, B would
> > do a 
> > complete(>started) [kernel/kthread.c: kthread()], 
> > before scheduling itself out.
> > 
> > This complete() will wake up kthreadd, which had spawned B.
> > It is possible that during the wakeup, kthreadd might preempt B.
> > Thus, B is still on the runqueue, and not yet called schedule().
> > 
> > kthreadd, will inturn do a 
> > complete(>done); [kernel/kthread.c: create_kthread()]
> > which will wake up the thread which had called kthread_create().
> > In our case it's task A, which will run immediately, since its priority
> > is higher.
> > 
> > A will now call kthread_bind(B, cpu1).
> > kthread_bind(), calls wait_task_inactive(B), to ensures that 
> > B has scheduled itself out.
> > 
> > B is still on the runqueue, so A calls yield() in wait_task_inactive().
> > But since A is the task with the highest prio, scheduler schedules it
> > back again.
> > 
> > Thus B never gets to run to schedule itself out.
> > A loops waiting for B to schedule out leading  to system hang.
> 
> As for kthread_bind(), I think wait_task_inactive+set_task_cpu is just
> an optimization, and easy to "fix":
> 
> --- kernel/kthread.c  2007-07-28 16:58:17.0 +0400
> +++ /proc/self/fd/0   2007-08-07 18:56:54.248073547 +0400
> @@ -166,10 +166,7 @@ void kthread_bind(struct task_struct *k,
>   WARN_ON(1);
>   return;
>   }
> - /* Must have done schedule() in kthread() before we set_task_cpu */
> - wait_task_inactive(k);
> - set_task_cpu(k, cpu);
> - k->cpus_allowed = cpumask_of_cpu(cpu);
> + set_cpus_allowed(current, cpumask_of_cpu(cpu));
>  }
>  EXPORT_SYMBOL(kthread_bind);
> 

Not sure whether set_cpus_allowed() will work here. Looks like, it needs the
CPU to be online during the call and in kthread_bind() case CPU may be offline.

Thanks,
Venki
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Cpu-Hotplug and Real-Time

2007-08-07 Thread Venki Pallipadi
On Tue, Aug 07, 2007 at 07:13:36PM +0400, Oleg Nesterov wrote:
 On 08/07, Gautham R Shenoy wrote:
 
  After some debugging, I saw that the hang occured because
  the high prio process was stuck in a loop doing yield() inside
  wait_task_inactive(). Description follows:
  
  Say a high-prio task (A) does a kthread_create(B),
  followed by a kthread_bind(B, cpu1). At this moment, 
  only cpu0 is online.
  
  Now, immediately after being created, B would
  do a 
  complete(create-started) [kernel/kthread.c: kthread()], 
  before scheduling itself out.
  
  This complete() will wake up kthreadd, which had spawned B.
  It is possible that during the wakeup, kthreadd might preempt B.
  Thus, B is still on the runqueue, and not yet called schedule().
  
  kthreadd, will inturn do a 
  complete(create-done); [kernel/kthread.c: create_kthread()]
  which will wake up the thread which had called kthread_create().
  In our case it's task A, which will run immediately, since its priority
  is higher.
  
  A will now call kthread_bind(B, cpu1).
  kthread_bind(), calls wait_task_inactive(B), to ensures that 
  B has scheduled itself out.
  
  B is still on the runqueue, so A calls yield() in wait_task_inactive().
  But since A is the task with the highest prio, scheduler schedules it
  back again.
  
  Thus B never gets to run to schedule itself out.
  A loops waiting for B to schedule out leading  to system hang.
 
 As for kthread_bind(), I think wait_task_inactive+set_task_cpu is just
 an optimization, and easy to fix:
 
 --- kernel/kthread.c  2007-07-28 16:58:17.0 +0400
 +++ /proc/self/fd/0   2007-08-07 18:56:54.248073547 +0400
 @@ -166,10 +166,7 @@ void kthread_bind(struct task_struct *k,
   WARN_ON(1);
   return;
   }
 - /* Must have done schedule() in kthread() before we set_task_cpu */
 - wait_task_inactive(k);
 - set_task_cpu(k, cpu);
 - k-cpus_allowed = cpumask_of_cpu(cpu);
 + set_cpus_allowed(current, cpumask_of_cpu(cpu));
  }
  EXPORT_SYMBOL(kthread_bind);
 

Not sure whether set_cpus_allowed() will work here. Looks like, it needs the
CPU to be online during the call and in kthread_bind() case CPU may be offline.

Thanks,
Venki
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Time Problems with 2.6.23-rc1-gf695baf2

2007-07-31 Thread Venki Pallipadi
On Tue, Jul 31, 2007 at 05:38:08PM +0200, Eric Sesterhenn / Snakebyte wrote:
> * Pallipadi, Venkatesh ([EMAIL PROTECTED]) wrote:
> > This means things should work fine with processor.max_cstate=2 boot
> > option
> > as well. Can you please double check that.
> 
> yes, system boots fine with this kernel parameter
> 
> > Also, please send in the acpidump from your system.
> 
> here we go, if you need some parameters to acpidump, just say so.
> 

Eric,

Can you check the test patch below (over latest git) and let me know whether it
resolves the issue.

Thanks,
Venki


Enable C3 without bm control only for CST based C3.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6/drivers/acpi/processor_idle.c
===
--- linux-2.6.orig/drivers/acpi/processor_idle.c2007-07-31 
04:29:26.0 -0700
+++ linux-2.6/drivers/acpi/processor_idle.c 2007-07-31 04:52:50.0 
-0700
@@ -969,11 +969,17 @@
}
 
if (pr->flags.bm_check) {
-   /* bus mastering control is necessary */
if (!pr->flags.bm_control) {
-   /* In this case we enter C3 without bus mastering */
-   ACPI_DEBUG_PRINT((ACPI_DB_INFO,
-   "C3 support without bus mastering control\n"));
+   if (pr->flags.has_cst != 1) {
+   /* bus mastering control is necessary */
+   ACPI_DEBUG_PRINT((ACPI_DB_INFO,
+   "C3 support requires BM control\n"));
+   return;
+   } else {
+   /* Here we enter C3 without bus mastering */
+   ACPI_DEBUG_PRINT((ACPI_DB_INFO,
+   "C3 support without BM control\n"));
+   }
}
} else {
/*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Time Problems with 2.6.23-rc1-gf695baf2

2007-07-31 Thread Venki Pallipadi
On Tue, Jul 31, 2007 at 05:38:08PM +0200, Eric Sesterhenn / Snakebyte wrote:
 * Pallipadi, Venkatesh ([EMAIL PROTECTED]) wrote:
  This means things should work fine with processor.max_cstate=2 boot
  option
  as well. Can you please double check that.
 
 yes, system boots fine with this kernel parameter
 
  Also, please send in the acpidump from your system.
 
 here we go, if you need some parameters to acpidump, just say so.
 

Eric,

Can you check the test patch below (over latest git) and let me know whether it
resolves the issue.

Thanks,
Venki


Enable C3 without bm control only for CST based C3.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6/drivers/acpi/processor_idle.c
===
--- linux-2.6.orig/drivers/acpi/processor_idle.c2007-07-31 
04:29:26.0 -0700
+++ linux-2.6/drivers/acpi/processor_idle.c 2007-07-31 04:52:50.0 
-0700
@@ -969,11 +969,17 @@
}
 
if (pr-flags.bm_check) {
-   /* bus mastering control is necessary */
if (!pr-flags.bm_control) {
-   /* In this case we enter C3 without bus mastering */
-   ACPI_DEBUG_PRINT((ACPI_DB_INFO,
-   C3 support without bus mastering control\n));
+   if (pr-flags.has_cst != 1) {
+   /* bus mastering control is necessary */
+   ACPI_DEBUG_PRINT((ACPI_DB_INFO,
+   C3 support requires BM control\n));
+   return;
+   } else {
+   /* Here we enter C3 without bus mastering */
+   ACPI_DEBUG_PRINT((ACPI_DB_INFO,
+   C3 support without BM control\n));
+   }
}
} else {
/*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 7/7] ICH Force HPET: Add ICH7_0 pciid to quirk list

2007-06-22 Thread Venki Pallipadi


Add another PCI ID for ICH7 force hpet.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.21/arch/i386/kernel/quirks.c
===
--- linux-2.6.21.orig/arch/i386/kernel/quirks.c
+++ linux-2.6.21/arch/i386/kernel/quirks.c
@@ -149,6 +149,8 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_I
  ich_force_enable_hpet);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH6_1,
  ich_force_enable_hpet);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_0,
+ ich_force_enable_hpet);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_1,
  ich_force_enable_hpet);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_31,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/7] ICH Force HPET: Late initialization of hpet after quirk

2007-06-22 Thread Venki Pallipadi


Enable HPET later during boot, after the force detect in PCI quirks.
Also add a call to repeat the force enabling at resume time.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

---
 arch/i386/kernel/hpet.c |   50 +++-
 include/asm-i386/hpet.h |1 
 2 files changed, 46 insertions(+), 5 deletions(-)

Index: linux-2.6.22-rc5/include/asm-i386/hpet.h
===
--- linux-2.6.22-rc5.orig/include/asm-i386/hpet.h   2007-06-17 
08:52:10.0 +0200
+++ linux-2.6.22-rc5/include/asm-i386/hpet.h2007-06-17 08:52:10.0 
+0200
@@ -64,6 +64,7 @@
 
 /* hpet memory map physical address */
 extern unsigned long hpet_address;
+extern unsigned long force_hpet_address;
 extern int is_hpet_enabled(void);
 extern int hpet_enable(void);
 
Index: linux-2.6.22-rc5/arch/i386/kernel/hpet.c
===
--- linux-2.6.22-rc5.orig/arch/i386/kernel/hpet.c   2007-06-17 
08:52:10.0 +0200
+++ linux-2.6.22-rc5/arch/i386/kernel/hpet.c2007-06-17 08:52:10.0 
+0200
@@ -25,6 +25,8 @@ extern struct clock_event_device *global
  */
 unsigned long hpet_address;
 
+static void __iomem * hpet_virt_address;
+
 #ifdef CONFIG_X86_64
 
 #include 
@@ -34,19 +36,22 @@ static inline void hpet_set_mapping(void
 {
set_fixmap_nocache(FIX_HPET_BASE, hpet_address);
__set_fixmap(VSYSCALL_HPET, hpet_address, PAGE_KERNEL_VSYSCALL_NOCACHE);
+   hpet_virt_address = (void __iomem *)fix_to_virt(FIX_HPET_BASE);
+
 }
 
 static inline void __iomem *hpet_get_virt_address(void)
 {
-   return (void __iomem *)fix_to_virt(FIX_HPET_BASE);
+   return hpet_virt_address;
 }
 
-static inline void hpet_clear_mapping(void) { }
+static inline void hpet_clear_mapping(void)
+{
+   hpet_virt_address = NULL;
+}
 
 #else
 
-static void __iomem * hpet_virt_address;
-
 static inline unsigned long hpet_readl(unsigned long a)
 {
return readl(hpet_virt_address + a);
@@ -173,6 +178,7 @@ static struct clock_event_device hpet_cl
.set_next_event = hpet_legacy_next_event,
.shift  = 32,
.irq= 0,
+   .rating = 50,
 };
 
 static void hpet_start_counter(void)
@@ -187,6 +193,17 @@ static void hpet_start_counter(void)
hpet_writel(cfg, HPET_CFG);
 }
 
+static void hpet_resume_device(void)
+{
+   ich_force_hpet_resume();
+}
+
+static void hpet_restart_counter(void)
+{
+   hpet_resume_device();
+   hpet_start_counter();
+}
+
 static void hpet_enable_legacy_int(void)
 {
unsigned long cfg = hpet_readl(HPET_CFG);
@@ -308,7 +325,7 @@ static struct clocksource clocksource_hp
.mask   = HPET_MASK,
.shift  = HPET_SHIFT,
.flags  = CLOCK_SOURCE_IS_CONTINUOUS,
-   .resume = hpet_start_counter,
+   .resume = hpet_restart_counter,
 #ifdef CONFIG_X86_64
.vread  = vread_hpet,
 #endif
@@ -372,6 +389,9 @@ int __init hpet_enable(void)
 {
unsigned long id;
 
+   if (hpet_get_virt_address())
+   return 0;
+
if (!is_hpet_capable())
return 0;
 
@@ -416,6 +436,26 @@ out_nohpet:
 }
 
 
+static int __init hpet_late_init(void)
+{
+   if (boot_hpet_disable)
+   return -ENODEV;
+
+   if (!hpet_address) {
+   if (!force_hpet_address)
+   return -ENODEV;
+
+   hpet_address = force_hpet_address;
+   hpet_enable();
+   if (!hpet_get_virt_address())
+   return -ENODEV;
+   }
+
+   return 0;
+}
+fs_initcall(hpet_late_init);
+
+
 #ifdef CONFIG_HPET_EMULATE_RTC
 
 /* HPET in LegacyReplacement Mode eats up RTC interrupt line. When, HPET
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/7] ICH Force HPET: ICH5 quirk to force detect enable

2007-06-22 Thread Venki Pallipadi


force_enable hpet for ICH5.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

---
 arch/i386/kernel/hpet.c   |2 
 arch/i386/kernel/quirks.c |  101 +-
 include/asm-i386/hpet.h   |2 
 include/linux/pci_ids.h   |1 
 4 files changed, 103 insertions(+), 3 deletions(-)

Index: linux-2.6.22-rc5/arch/i386/kernel/quirks.c
===
--- linux-2.6.22-rc5.orig/arch/i386/kernel/quirks.c 2007-06-17 
08:52:10.0 +0200
+++ linux-2.6.22-rc5/arch/i386/kernel/quirks.c  2007-06-17 08:52:10.0 
+0200
@@ -54,9 +54,15 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_IN
 #if defined(CONFIG_HPET_TIMER)
 unsigned long force_hpet_address;
 
+static enum {
+   NONE_FORCE_HPET_RESUME,
+   OLD_ICH_FORCE_HPET_RESUME,
+   ICH_FORCE_HPET_RESUME
+} force_hpet_resume_type;
+
 static void __iomem *rcba_base;
 
-void ich_force_hpet_resume(void)
+static void ich_force_hpet_resume(void)
 {
u32 val;
 
@@ -133,6 +139,7 @@ static void ich_force_enable_hpet(struct
iounmap(rcba_base);
printk(KERN_DEBUG "Failed to force enable HPET\n");
} else {
+   force_hpet_resume_type = ICH_FORCE_HPET_RESUME;
printk(KERN_DEBUG "Force enabled HPET at base address 0x%lx\n",
   force_hpet_address);
}
@@ -148,4 +155,96 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_I
  ich_force_enable_hpet);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH8_1,
  ich_force_enable_hpet);
+
+
+static struct pci_dev *cached_dev;
+
+static void old_ich_force_hpet_resume(void)
+{
+   u32 val, gen_cntl;
+
+   if (!force_hpet_address || !cached_dev)
+   return;
+
+   pci_read_config_dword(cached_dev, 0xD0, _cntl);
+   gen_cntl &= (~(0x7 << 15));
+   gen_cntl |= (0x4 << 15);
+
+   pci_write_config_dword(cached_dev, 0xD0, gen_cntl);
+   pci_read_config_dword(cached_dev, 0xD0, _cntl);
+   val = gen_cntl >> 15;
+   val &= 0x7;
+   if (val == 0x4)
+   printk(KERN_DEBUG "Force enabled HPET at resume\n");
+   else
+   BUG();
+}
+
+static void old_ich_force_enable_hpet(struct pci_dev *dev)
+{
+   u32 val, gen_cntl;
+
+   if (hpet_address || force_hpet_address)
+   return;
+
+   pci_read_config_dword(dev, 0xD0, _cntl);
+   /*
+* Bit 17 is HPET enable bit.
+* Bit 16:15 control the HPET base address.
+*/
+   val = gen_cntl >> 15;
+   val &= 0x7;
+   if (val & 0x4) {
+   val &= 0x3;
+   force_hpet_address = 0xFED0 | (val << 12);
+   printk(KERN_DEBUG "HPET at base address 0x%lx\n",
+  force_hpet_address);
+   cached_dev = dev;
+   return;
+   }
+
+   /*
+* HPET is disabled. Trying enabling at FED0 and check
+* whether it sticks
+*/
+   gen_cntl &= (~(0x7 << 15));
+   gen_cntl |= (0x4 << 15);
+   pci_write_config_dword(dev, 0xD0, gen_cntl);
+
+   pci_read_config_dword(dev, 0xD0, _cntl);
+
+   val = gen_cntl >> 15;
+   val &= 0x7;
+   if (val & 0x4) {
+   /* HPET is enabled in HPTC. Just not reported by BIOS */
+   val &= 0x3;
+   force_hpet_address = 0xFED0 | (val << 12);
+   printk(KERN_DEBUG "Force enabled HPET at base address 0x%lx\n",
+  force_hpet_address);
+   force_hpet_resume_type = OLD_ICH_FORCE_HPET_RESUME;
+   return;
+   }
+
+   printk(KERN_DEBUG "Failed to force enable HPET\n");
+}
+
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801EB_0,
+ old_ich_force_enable_hpet);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801EB_12,
+ old_ich_force_enable_hpet);
+
+void force_hpet_resume(void)
+{
+   switch (force_hpet_resume_type) {
+   case ICH_FORCE_HPET_RESUME:
+   return ich_force_hpet_resume();
+
+   case OLD_ICH_FORCE_HPET_RESUME:
+   return old_ich_force_hpet_resume();
+
+   default:
+   break;
+   }
+}
+
 #endif
Index: linux-2.6.22-rc5/arch/i386/kernel/hpet.c
===
--- linux-2.6.22-rc5.orig/arch/i386/kernel/hpet.c   2007-06-17 
08:52:10.0 +0200
+++ linux-2.6.22-rc5/arch/i386/kernel/hpet.c2007-06-17 08:52:10.0 
+0200
@@ -195,7 +195,7 @@ static void hpet_start_counter(void)
 
 static void hpet_resume_device(void)
 {
-   ich_force_hpet_resume();
+   force_hpet_resume();
 }
 
 static void hpet_restart_counter(void)
Index: linux-2.6.22-rc5/include/asm-i386/hpet.h

[PATCH 6/7] ICH Force HPET: ICH5 fix a bug with suspend/resume

2007-06-22 Thread Venki Pallipadi


A bugfix in ich5 hpet force detect which caused resumes to fail.
Thanks to Udo A Steinberg for reporting the problem.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

---
 arch/i386/kernel/quirks.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.22-rc5/arch/i386/kernel/quirks.c
===
--- linux-2.6.22-rc5.orig/arch/i386/kernel/quirks.c 2007-06-17 
08:52:10.0 +0200
+++ linux-2.6.22-rc5/arch/i386/kernel/quirks.c  2007-06-17 08:52:10.0 
+0200
@@ -199,7 +199,6 @@ static void old_ich_force_enable_hpet(st
force_hpet_address = 0xFED0 | (val << 12);
printk(KERN_DEBUG "HPET at base address 0x%lx\n",
   force_hpet_address);
-   cached_dev = dev;
return;
}
 
@@ -221,6 +220,7 @@ static void old_ich_force_enable_hpet(st
force_hpet_address = 0xFED0 | (val << 12);
printk(KERN_DEBUG "Force enabled HPET at base address 0x%lx\n",
   force_hpet_address);
+   cached_dev = dev;
force_hpet_resume_type = OLD_ICH_FORCE_HPET_RESUME;
return;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/7] ICH Force HPET: ICH7 or later quirk to force detect enable

2007-06-22 Thread Venki Pallipadi

Force detect and/or enable HPET on ICH chipsets. This patch just handles the
detection part and following patches use this information. Adds a function
to repeat the force enabling during resume time.

Using HPET this way, instead of PIT increases the time CPUs can
reside in C-state when system is totally idle. On my test system with
Core 2 Duo, average C-state residency goes up from ~20mS to ~80mS.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

---
 arch/i386/kernel/quirks.c |  101 ++
 include/asm-i386/hpet.h   |2 
 2 files changed, 103 insertions(+)

Index: linux-2.6.22-rc5/arch/i386/kernel/quirks.c
===
--- linux-2.6.22-rc5.orig/arch/i386/kernel/quirks.c 2007-06-17 
08:51:58.0 +0200
+++ linux-2.6.22-rc5/arch/i386/kernel/quirks.c  2007-06-17 08:52:10.0 
+0200
@@ -4,6 +4,8 @@
 #include 
 #include 
 
+#include 
+
 #if defined(CONFIG_X86_IO_APIC) && defined(CONFIG_SMP) && defined(CONFIG_PCI)
 
 static void __devinit quirk_intel_irqbalance(struct pci_dev *dev)
@@ -48,3 +50,102 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_IN
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL,   PCI_DEVICE_ID_INTEL_E7525_MCH,  
quirk_intel_irqbalance);
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL,   PCI_DEVICE_ID_INTEL_E7520_MCH,  
quirk_intel_irqbalance);
 #endif
+
+#if defined(CONFIG_HPET_TIMER)
+unsigned long force_hpet_address;
+
+static void __iomem *rcba_base;
+
+void ich_force_hpet_resume(void)
+{
+   u32 val;
+
+   if (!force_hpet_address)
+   return;
+
+   if (rcba_base == NULL)
+   BUG();
+
+   /* read the Function Disable register, dword mode only */
+   val = readl(rcba_base + 0x3404);
+   if (!(val & 0x80)) {
+   /* HPET disabled in HPTC. Trying to enable */
+   writel(val | 0x80, rcba_base + 0x3404);
+   }
+
+   val = readl(rcba_base + 0x3404);
+   if (!(val & 0x80))
+   BUG();
+   else
+   printk(KERN_DEBUG "Force enabled HPET at resume\n");
+
+   return;
+}
+
+static void ich_force_enable_hpet(struct pci_dev *dev)
+{
+   u32 val, rcba;
+   int err = 0;
+
+   if (hpet_address || force_hpet_address)
+   return;
+
+   pci_read_config_dword(dev, 0xF0, );
+   rcba &= 0xC000;
+   if (rcba == 0) {
+   printk(KERN_DEBUG "RCBA disabled. Cannot force enable HPET\n");
+   return;
+   }
+
+   /* use bits 31:14, 16 kB aligned */
+   rcba_base = ioremap_nocache(rcba, 0x4000);
+   if (rcba_base == NULL) {
+   printk(KERN_DEBUG "ioremap failed. Cannot force enable HPET\n");
+   return;
+   }
+
+   /* read the Function Disable register, dword mode only */
+   val = readl(rcba_base + 0x3404);
+
+   if (val & 0x80) {
+   /* HPET is enabled in HPTC. Just not reported by BIOS */
+   val = val & 0x3;
+   force_hpet_address = 0xFED0 | (val << 12);
+   printk(KERN_DEBUG "Force enabled HPET at base address 0x%lx\n",
+  force_hpet_address);
+   iounmap(rcba_base);
+   return;
+   }
+
+   /* HPET disabled in HPTC. Trying to enable */
+   writel(val | 0x80, rcba_base + 0x3404);
+
+   val = readl(rcba_base + 0x3404);
+   if (!(val & 0x80)) {
+   err = 1;
+   } else {
+   val = val & 0x3;
+   force_hpet_address = 0xFED0 | (val << 12);
+   }
+
+   if (err) {
+   force_hpet_address = 0;
+   iounmap(rcba_base);
+   printk(KERN_DEBUG "Failed to force enable HPET\n");
+   } else {
+   printk(KERN_DEBUG "Force enabled HPET at base address 0x%lx\n",
+  force_hpet_address);
+   }
+}
+
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ESB2_0,
+ ich_force_enable_hpet);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH6_1,
+ ich_force_enable_hpet);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_1,
+ ich_force_enable_hpet);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_31,
+ ich_force_enable_hpet);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH8_1,
+ ich_force_enable_hpet);
+#endif
Index: linux-2.6.22-rc5/include/asm-i386/hpet.h
===
--- linux-2.6.22-rc5.orig/include/asm-i386/hpet.h   2007-06-17 
08:52:09.0 +0200
+++ linux-2.6.22-rc5/include/asm-i386/hpet.h2007-06-17 08:52:10.0 
+0200
@@ -72,6 +72,8 @@ extern int hpet_enable(void);
 #include 
 #endif
 
+void ich_force_hpet_resume(void);
+
 #ifdef CONFIG_HPET_EMULATE_RTC
 

[PATCH 2/7] ICH Force HPET: Restructure hpet generic clock code

2007-06-22 Thread Venki Pallipadi

Restructure and rename legacy replacement mode HPET timer support.
Just the code structural changes and should be zero functionality change.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

---

Index: linux-2.6.22-rc5/arch/i386/kernel/hpet.c
===
--- linux-2.6.22-rc5.orig/arch/i386/kernel/hpet.c   2007-06-17 
08:52:09.0 +0200
+++ linux-2.6.22-rc5/arch/i386/kernel/hpet.c2007-06-17 08:52:10.0 
+0200
@@ -158,9 +158,9 @@ static void hpet_reserve_platform_timers
  */
 static unsigned long hpet_period;
 
-static void hpet_set_mode(enum clock_event_mode mode,
+static void hpet_legacy_set_mode(enum clock_event_mode mode,
  struct clock_event_device *evt);
-static int hpet_next_event(unsigned long delta,
+static int hpet_legacy_next_event(unsigned long delta,
   struct clock_event_device *evt);
 
 /*
@@ -169,8 +169,8 @@ static int hpet_next_event(unsigned long
 static struct clock_event_device hpet_clockevent = {
.name   = "hpet",
.features   = CLOCK_EVT_FEAT_PERIODIC | CLOCK_EVT_FEAT_ONESHOT,
-   .set_mode   = hpet_set_mode,
-   .set_next_event = hpet_next_event,
+   .set_mode   = hpet_legacy_set_mode,
+   .set_next_event = hpet_legacy_next_event,
.shift  = 32,
.irq= 0,
 };
@@ -187,7 +187,7 @@ static void hpet_start_counter(void)
hpet_writel(cfg, HPET_CFG);
 }
 
-static void hpet_enable_int(void)
+static void hpet_enable_legacy_int(void)
 {
unsigned long cfg = hpet_readl(HPET_CFG);
 
@@ -196,7 +196,39 @@ static void hpet_enable_int(void)
hpet_legacy_int_enabled = 1;
 }
 
-static void hpet_set_mode(enum clock_event_mode mode,
+static void hpet_legacy_clockevent_register(void)
+{
+   uint64_t hpet_freq;
+
+   /* Start HPET legacy interrupts */
+   hpet_enable_legacy_int();
+
+   /*
+* The period is a femto seconds value. We need to calculate the
+* scaled math multiplication factor for nanosecond to hpet tick
+* conversion.
+*/
+   hpet_freq = 1000ULL;
+   do_div(hpet_freq, hpet_period);
+   hpet_clockevent.mult = div_sc((unsigned long) hpet_freq,
+ NSEC_PER_SEC, 32);
+   /* Calculate the min / max delta */
+   hpet_clockevent.max_delta_ns = clockevent_delta2ns(0x7FFF,
+  _clockevent);
+   hpet_clockevent.min_delta_ns = clockevent_delta2ns(0x30,
+  _clockevent);
+
+   /*
+* Start hpet with the boot cpu mask and make it
+* global after the IO_APIC has been initialized.
+*/
+   hpet_clockevent.cpumask = cpumask_of_cpu(smp_processor_id());
+   clockevents_register_device(_clockevent);
+   global_clock_event = _clockevent;
+   printk(KERN_DEBUG "hpet clockevent registered\n");
+}
+
+static void hpet_legacy_set_mode(enum clock_event_mode mode,
  struct clock_event_device *evt)
 {
unsigned long cfg, cmp, now;
@@ -237,12 +269,12 @@ static void hpet_set_mode(enum clock_eve
break;
 
case CLOCK_EVT_MODE_RESUME:
-   hpet_enable_int();
+   hpet_enable_legacy_int();
break;
}
 }
 
-static int hpet_next_event(unsigned long delta,
+static int hpet_legacy_next_event(unsigned long delta,
   struct clock_event_device *evt)
 {
unsigned long cnt;
@@ -282,58 +314,11 @@ static struct clocksource clocksource_hp
 #endif
 };
 
-/*
- * Try to setup the HPET timer
- */
-int __init hpet_enable(void)
+static int hpet_clocksource_register(void)
 {
-   unsigned long id;
-   uint64_t hpet_freq;
u64 tmp, start, now;
cycle_t t1;
 
-   if (!is_hpet_capable())
-   return 0;
-
-   hpet_set_mapping();
-
-   /*
-* Read the period and check for a sane value:
-*/
-   hpet_period = hpet_readl(HPET_PERIOD);
-   if (hpet_period < HPET_MIN_PERIOD || hpet_period > HPET_MAX_PERIOD)
-   goto out_nohpet;
-
-   /*
-* The period is a femto seconds value. We need to calculate the
-* scaled math multiplication factor for nanosecond to hpet tick
-* conversion.
-*/
-   hpet_freq = 1000ULL;
-   do_div(hpet_freq, hpet_period);
-   hpet_clockevent.mult = div_sc((unsigned long) hpet_freq,
- NSEC_PER_SEC, 32);
-   /* Calculate the min / max delta */
-   hpet_clockevent.max_delta_ns = clockevent_delta2ns(0x7FFF,
-  _clockevent);
-   hpet_clockevent.min_delta_ns = clockevent_delta2ns(0x30,
-  _clockevent);
-
-  

[PATCH 1/7] ICH Force HPET: Make generic time capable of switching broadcast timer

2007-06-22 Thread Venki Pallipadi


Auto-detect the presence of HPET on ICH5 or newer platforms and enable
HPET for broadcast timer. This gives a bigger upperlimit for tickless time
tick and improves the power consumption in comparison to PIT as broadcast timer.

This patch:

Change the broadcast timer, if a timer with higher rating becomes available.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

---

Applies over linux-2.6.22-rc4-mm2 +
tglx's  patch-2.6.22-rc4-mm2-hrt4 patch

The patchset had been baking for a while along with patch-2.6.22-rc*-hrt* for
a while without breaking anything and redusing the number of
timer interrupts with tickless on various platforms.

 kernel/time/tick-broadcast.c |   13 ++---
 kernel/time/tick-common.c|4 ++--
 2 files changed, 8 insertions(+), 9 deletions(-)

Index: linux-2.6.22-rc5/kernel/time/tick-common.c
===
--- linux-2.6.22-rc5.orig/kernel/time/tick-common.c 2007-06-17 
08:52:07.0 +0200
+++ linux-2.6.22-rc5/kernel/time/tick-common.c  2007-06-17 08:52:10.0 
+0200
@@ -200,7 +200,7 @@ static int tick_check_new_device(struct 
 
cpu = smp_processor_id();
if (!cpu_isset(cpu, newdev->cpumask))
-   goto out;
+   goto out_bc;
 
td = _cpu(tick_cpu_device, cpu);
curdev = td->evtdev;
@@ -265,7 +265,7 @@ out_bc:
 */
if (tick_check_broadcast_device(newdev))
ret = NOTIFY_STOP;
-out:
+
spin_unlock_irqrestore(_device_lock, flags);
 
return ret;
Index: linux-2.6.22-rc5/kernel/time/tick-broadcast.c
===
--- linux-2.6.22-rc5.orig/kernel/time/tick-broadcast.c  2007-06-17 
08:52:07.0 +0200
+++ linux-2.6.22-rc5/kernel/time/tick-broadcast.c   2007-06-17 
08:52:10.0 +0200
@@ -64,8 +64,9 @@ static void tick_broadcast_start_periodi
  */
 int tick_check_broadcast_device(struct clock_event_device *dev)
 {
-   if (tick_broadcast_device.evtdev ||
-   (dev->features & CLOCK_EVT_FEAT_C3STOP))
+   if ((tick_broadcast_device.evtdev &&
+tick_broadcast_device.evtdev->rating >= dev->rating) ||
+(dev->features & CLOCK_EVT_FEAT_C3STOP))
return 0;
 
clockevents_exchange_device(NULL, dev);
@@ -519,11 +520,9 @@ static void tick_broadcast_clear_oneshot
  */
 void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
 {
-   if (bc->mode != CLOCK_EVT_MODE_ONESHOT) {
-   bc->event_handler = tick_handle_oneshot_broadcast;
-   clockevents_set_mode(bc, CLOCK_EVT_MODE_ONESHOT);
-   bc->next_event.tv64 = KTIME_MAX;
-   }
+   bc->event_handler = tick_handle_oneshot_broadcast;
+   clockevents_set_mode(bc, CLOCK_EVT_MODE_ONESHOT);
+   bc->next_event.tv64 = KTIME_MAX;
 }
 
 /*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/7] ICH Force HPET: Make generic time capable of switching broadcast timer

2007-06-22 Thread Venki Pallipadi


Auto-detect the presence of HPET on ICH5 or newer platforms and enable
HPET for broadcast timer. This gives a bigger upperlimit for tickless time
tick and improves the power consumption in comparison to PIT as broadcast timer.

This patch:

Change the broadcast timer, if a timer with higher rating becomes available.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

---

Applies over linux-2.6.22-rc4-mm2 +
tglx's  patch-2.6.22-rc4-mm2-hrt4 patch

The patchset had been baking for a while along with patch-2.6.22-rc*-hrt* for
a while without breaking anything and redusing the number of
timer interrupts with tickless on various platforms.

 kernel/time/tick-broadcast.c |   13 ++---
 kernel/time/tick-common.c|4 ++--
 2 files changed, 8 insertions(+), 9 deletions(-)

Index: linux-2.6.22-rc5/kernel/time/tick-common.c
===
--- linux-2.6.22-rc5.orig/kernel/time/tick-common.c 2007-06-17 
08:52:07.0 +0200
+++ linux-2.6.22-rc5/kernel/time/tick-common.c  2007-06-17 08:52:10.0 
+0200
@@ -200,7 +200,7 @@ static int tick_check_new_device(struct 
 
cpu = smp_processor_id();
if (!cpu_isset(cpu, newdev-cpumask))
-   goto out;
+   goto out_bc;
 
td = per_cpu(tick_cpu_device, cpu);
curdev = td-evtdev;
@@ -265,7 +265,7 @@ out_bc:
 */
if (tick_check_broadcast_device(newdev))
ret = NOTIFY_STOP;
-out:
+
spin_unlock_irqrestore(tick_device_lock, flags);
 
return ret;
Index: linux-2.6.22-rc5/kernel/time/tick-broadcast.c
===
--- linux-2.6.22-rc5.orig/kernel/time/tick-broadcast.c  2007-06-17 
08:52:07.0 +0200
+++ linux-2.6.22-rc5/kernel/time/tick-broadcast.c   2007-06-17 
08:52:10.0 +0200
@@ -64,8 +64,9 @@ static void tick_broadcast_start_periodi
  */
 int tick_check_broadcast_device(struct clock_event_device *dev)
 {
-   if (tick_broadcast_device.evtdev ||
-   (dev-features  CLOCK_EVT_FEAT_C3STOP))
+   if ((tick_broadcast_device.evtdev 
+tick_broadcast_device.evtdev-rating = dev-rating) ||
+(dev-features  CLOCK_EVT_FEAT_C3STOP))
return 0;
 
clockevents_exchange_device(NULL, dev);
@@ -519,11 +520,9 @@ static void tick_broadcast_clear_oneshot
  */
 void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
 {
-   if (bc-mode != CLOCK_EVT_MODE_ONESHOT) {
-   bc-event_handler = tick_handle_oneshot_broadcast;
-   clockevents_set_mode(bc, CLOCK_EVT_MODE_ONESHOT);
-   bc-next_event.tv64 = KTIME_MAX;
-   }
+   bc-event_handler = tick_handle_oneshot_broadcast;
+   clockevents_set_mode(bc, CLOCK_EVT_MODE_ONESHOT);
+   bc-next_event.tv64 = KTIME_MAX;
 }
 
 /*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/7] ICH Force HPET: Restructure hpet generic clock code

2007-06-22 Thread Venki Pallipadi

Restructure and rename legacy replacement mode HPET timer support.
Just the code structural changes and should be zero functionality change.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

---

Index: linux-2.6.22-rc5/arch/i386/kernel/hpet.c
===
--- linux-2.6.22-rc5.orig/arch/i386/kernel/hpet.c   2007-06-17 
08:52:09.0 +0200
+++ linux-2.6.22-rc5/arch/i386/kernel/hpet.c2007-06-17 08:52:10.0 
+0200
@@ -158,9 +158,9 @@ static void hpet_reserve_platform_timers
  */
 static unsigned long hpet_period;
 
-static void hpet_set_mode(enum clock_event_mode mode,
+static void hpet_legacy_set_mode(enum clock_event_mode mode,
  struct clock_event_device *evt);
-static int hpet_next_event(unsigned long delta,
+static int hpet_legacy_next_event(unsigned long delta,
   struct clock_event_device *evt);
 
 /*
@@ -169,8 +169,8 @@ static int hpet_next_event(unsigned long
 static struct clock_event_device hpet_clockevent = {
.name   = hpet,
.features   = CLOCK_EVT_FEAT_PERIODIC | CLOCK_EVT_FEAT_ONESHOT,
-   .set_mode   = hpet_set_mode,
-   .set_next_event = hpet_next_event,
+   .set_mode   = hpet_legacy_set_mode,
+   .set_next_event = hpet_legacy_next_event,
.shift  = 32,
.irq= 0,
 };
@@ -187,7 +187,7 @@ static void hpet_start_counter(void)
hpet_writel(cfg, HPET_CFG);
 }
 
-static void hpet_enable_int(void)
+static void hpet_enable_legacy_int(void)
 {
unsigned long cfg = hpet_readl(HPET_CFG);
 
@@ -196,7 +196,39 @@ static void hpet_enable_int(void)
hpet_legacy_int_enabled = 1;
 }
 
-static void hpet_set_mode(enum clock_event_mode mode,
+static void hpet_legacy_clockevent_register(void)
+{
+   uint64_t hpet_freq;
+
+   /* Start HPET legacy interrupts */
+   hpet_enable_legacy_int();
+
+   /*
+* The period is a femto seconds value. We need to calculate the
+* scaled math multiplication factor for nanosecond to hpet tick
+* conversion.
+*/
+   hpet_freq = 1000ULL;
+   do_div(hpet_freq, hpet_period);
+   hpet_clockevent.mult = div_sc((unsigned long) hpet_freq,
+ NSEC_PER_SEC, 32);
+   /* Calculate the min / max delta */
+   hpet_clockevent.max_delta_ns = clockevent_delta2ns(0x7FFF,
+  hpet_clockevent);
+   hpet_clockevent.min_delta_ns = clockevent_delta2ns(0x30,
+  hpet_clockevent);
+
+   /*
+* Start hpet with the boot cpu mask and make it
+* global after the IO_APIC has been initialized.
+*/
+   hpet_clockevent.cpumask = cpumask_of_cpu(smp_processor_id());
+   clockevents_register_device(hpet_clockevent);
+   global_clock_event = hpet_clockevent;
+   printk(KERN_DEBUG hpet clockevent registered\n);
+}
+
+static void hpet_legacy_set_mode(enum clock_event_mode mode,
  struct clock_event_device *evt)
 {
unsigned long cfg, cmp, now;
@@ -237,12 +269,12 @@ static void hpet_set_mode(enum clock_eve
break;
 
case CLOCK_EVT_MODE_RESUME:
-   hpet_enable_int();
+   hpet_enable_legacy_int();
break;
}
 }
 
-static int hpet_next_event(unsigned long delta,
+static int hpet_legacy_next_event(unsigned long delta,
   struct clock_event_device *evt)
 {
unsigned long cnt;
@@ -282,58 +314,11 @@ static struct clocksource clocksource_hp
 #endif
 };
 
-/*
- * Try to setup the HPET timer
- */
-int __init hpet_enable(void)
+static int hpet_clocksource_register(void)
 {
-   unsigned long id;
-   uint64_t hpet_freq;
u64 tmp, start, now;
cycle_t t1;
 
-   if (!is_hpet_capable())
-   return 0;
-
-   hpet_set_mapping();
-
-   /*
-* Read the period and check for a sane value:
-*/
-   hpet_period = hpet_readl(HPET_PERIOD);
-   if (hpet_period  HPET_MIN_PERIOD || hpet_period  HPET_MAX_PERIOD)
-   goto out_nohpet;
-
-   /*
-* The period is a femto seconds value. We need to calculate the
-* scaled math multiplication factor for nanosecond to hpet tick
-* conversion.
-*/
-   hpet_freq = 1000ULL;
-   do_div(hpet_freq, hpet_period);
-   hpet_clockevent.mult = div_sc((unsigned long) hpet_freq,
- NSEC_PER_SEC, 32);
-   /* Calculate the min / max delta */
-   hpet_clockevent.max_delta_ns = clockevent_delta2ns(0x7FFF,
-  hpet_clockevent);
-   hpet_clockevent.min_delta_ns = clockevent_delta2ns(0x30,
-  

[PATCH 3/7] ICH Force HPET: ICH7 or later quirk to force detect enable

2007-06-22 Thread Venki Pallipadi

Force detect and/or enable HPET on ICH chipsets. This patch just handles the
detection part and following patches use this information. Adds a function
to repeat the force enabling during resume time.

Using HPET this way, instead of PIT increases the time CPUs can
reside in C-state when system is totally idle. On my test system with
Core 2 Duo, average C-state residency goes up from ~20mS to ~80mS.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

---
 arch/i386/kernel/quirks.c |  101 ++
 include/asm-i386/hpet.h   |2 
 2 files changed, 103 insertions(+)

Index: linux-2.6.22-rc5/arch/i386/kernel/quirks.c
===
--- linux-2.6.22-rc5.orig/arch/i386/kernel/quirks.c 2007-06-17 
08:51:58.0 +0200
+++ linux-2.6.22-rc5/arch/i386/kernel/quirks.c  2007-06-17 08:52:10.0 
+0200
@@ -4,6 +4,8 @@
 #include linux/pci.h
 #include linux/irq.h
 
+#include asm/hpet.h
+
 #if defined(CONFIG_X86_IO_APIC)  defined(CONFIG_SMP)  defined(CONFIG_PCI)
 
 static void __devinit quirk_intel_irqbalance(struct pci_dev *dev)
@@ -48,3 +50,102 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_IN
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL,   PCI_DEVICE_ID_INTEL_E7525_MCH,  
quirk_intel_irqbalance);
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL,   PCI_DEVICE_ID_INTEL_E7520_MCH,  
quirk_intel_irqbalance);
 #endif
+
+#if defined(CONFIG_HPET_TIMER)
+unsigned long force_hpet_address;
+
+static void __iomem *rcba_base;
+
+void ich_force_hpet_resume(void)
+{
+   u32 val;
+
+   if (!force_hpet_address)
+   return;
+
+   if (rcba_base == NULL)
+   BUG();
+
+   /* read the Function Disable register, dword mode only */
+   val = readl(rcba_base + 0x3404);
+   if (!(val  0x80)) {
+   /* HPET disabled in HPTC. Trying to enable */
+   writel(val | 0x80, rcba_base + 0x3404);
+   }
+
+   val = readl(rcba_base + 0x3404);
+   if (!(val  0x80))
+   BUG();
+   else
+   printk(KERN_DEBUG Force enabled HPET at resume\n);
+
+   return;
+}
+
+static void ich_force_enable_hpet(struct pci_dev *dev)
+{
+   u32 val, rcba;
+   int err = 0;
+
+   if (hpet_address || force_hpet_address)
+   return;
+
+   pci_read_config_dword(dev, 0xF0, rcba);
+   rcba = 0xC000;
+   if (rcba == 0) {
+   printk(KERN_DEBUG RCBA disabled. Cannot force enable HPET\n);
+   return;
+   }
+
+   /* use bits 31:14, 16 kB aligned */
+   rcba_base = ioremap_nocache(rcba, 0x4000);
+   if (rcba_base == NULL) {
+   printk(KERN_DEBUG ioremap failed. Cannot force enable HPET\n);
+   return;
+   }
+
+   /* read the Function Disable register, dword mode only */
+   val = readl(rcba_base + 0x3404);
+
+   if (val  0x80) {
+   /* HPET is enabled in HPTC. Just not reported by BIOS */
+   val = val  0x3;
+   force_hpet_address = 0xFED0 | (val  12);
+   printk(KERN_DEBUG Force enabled HPET at base address 0x%lx\n,
+  force_hpet_address);
+   iounmap(rcba_base);
+   return;
+   }
+
+   /* HPET disabled in HPTC. Trying to enable */
+   writel(val | 0x80, rcba_base + 0x3404);
+
+   val = readl(rcba_base + 0x3404);
+   if (!(val  0x80)) {
+   err = 1;
+   } else {
+   val = val  0x3;
+   force_hpet_address = 0xFED0 | (val  12);
+   }
+
+   if (err) {
+   force_hpet_address = 0;
+   iounmap(rcba_base);
+   printk(KERN_DEBUG Failed to force enable HPET\n);
+   } else {
+   printk(KERN_DEBUG Force enabled HPET at base address 0x%lx\n,
+  force_hpet_address);
+   }
+}
+
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ESB2_0,
+ ich_force_enable_hpet);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH6_1,
+ ich_force_enable_hpet);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_1,
+ ich_force_enable_hpet);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_31,
+ ich_force_enable_hpet);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH8_1,
+ ich_force_enable_hpet);
+#endif
Index: linux-2.6.22-rc5/include/asm-i386/hpet.h
===
--- linux-2.6.22-rc5.orig/include/asm-i386/hpet.h   2007-06-17 
08:52:09.0 +0200
+++ linux-2.6.22-rc5/include/asm-i386/hpet.h2007-06-17 08:52:10.0 
+0200
@@ -72,6 +72,8 @@ extern int hpet_enable(void);
 #include asm/vsyscall.h
 #endif
 
+void ich_force_hpet_resume(void);
+
 #ifdef 

[PATCH 4/7] ICH Force HPET: Late initialization of hpet after quirk

2007-06-22 Thread Venki Pallipadi


Enable HPET later during boot, after the force detect in PCI quirks.
Also add a call to repeat the force enabling at resume time.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

---
 arch/i386/kernel/hpet.c |   50 +++-
 include/asm-i386/hpet.h |1 
 2 files changed, 46 insertions(+), 5 deletions(-)

Index: linux-2.6.22-rc5/include/asm-i386/hpet.h
===
--- linux-2.6.22-rc5.orig/include/asm-i386/hpet.h   2007-06-17 
08:52:10.0 +0200
+++ linux-2.6.22-rc5/include/asm-i386/hpet.h2007-06-17 08:52:10.0 
+0200
@@ -64,6 +64,7 @@
 
 /* hpet memory map physical address */
 extern unsigned long hpet_address;
+extern unsigned long force_hpet_address;
 extern int is_hpet_enabled(void);
 extern int hpet_enable(void);
 
Index: linux-2.6.22-rc5/arch/i386/kernel/hpet.c
===
--- linux-2.6.22-rc5.orig/arch/i386/kernel/hpet.c   2007-06-17 
08:52:10.0 +0200
+++ linux-2.6.22-rc5/arch/i386/kernel/hpet.c2007-06-17 08:52:10.0 
+0200
@@ -25,6 +25,8 @@ extern struct clock_event_device *global
  */
 unsigned long hpet_address;
 
+static void __iomem * hpet_virt_address;
+
 #ifdef CONFIG_X86_64
 
 #include asm/pgtable.h
@@ -34,19 +36,22 @@ static inline void hpet_set_mapping(void
 {
set_fixmap_nocache(FIX_HPET_BASE, hpet_address);
__set_fixmap(VSYSCALL_HPET, hpet_address, PAGE_KERNEL_VSYSCALL_NOCACHE);
+   hpet_virt_address = (void __iomem *)fix_to_virt(FIX_HPET_BASE);
+
 }
 
 static inline void __iomem *hpet_get_virt_address(void)
 {
-   return (void __iomem *)fix_to_virt(FIX_HPET_BASE);
+   return hpet_virt_address;
 }
 
-static inline void hpet_clear_mapping(void) { }
+static inline void hpet_clear_mapping(void)
+{
+   hpet_virt_address = NULL;
+}
 
 #else
 
-static void __iomem * hpet_virt_address;
-
 static inline unsigned long hpet_readl(unsigned long a)
 {
return readl(hpet_virt_address + a);
@@ -173,6 +178,7 @@ static struct clock_event_device hpet_cl
.set_next_event = hpet_legacy_next_event,
.shift  = 32,
.irq= 0,
+   .rating = 50,
 };
 
 static void hpet_start_counter(void)
@@ -187,6 +193,17 @@ static void hpet_start_counter(void)
hpet_writel(cfg, HPET_CFG);
 }
 
+static void hpet_resume_device(void)
+{
+   ich_force_hpet_resume();
+}
+
+static void hpet_restart_counter(void)
+{
+   hpet_resume_device();
+   hpet_start_counter();
+}
+
 static void hpet_enable_legacy_int(void)
 {
unsigned long cfg = hpet_readl(HPET_CFG);
@@ -308,7 +325,7 @@ static struct clocksource clocksource_hp
.mask   = HPET_MASK,
.shift  = HPET_SHIFT,
.flags  = CLOCK_SOURCE_IS_CONTINUOUS,
-   .resume = hpet_start_counter,
+   .resume = hpet_restart_counter,
 #ifdef CONFIG_X86_64
.vread  = vread_hpet,
 #endif
@@ -372,6 +389,9 @@ int __init hpet_enable(void)
 {
unsigned long id;
 
+   if (hpet_get_virt_address())
+   return 0;
+
if (!is_hpet_capable())
return 0;
 
@@ -416,6 +436,26 @@ out_nohpet:
 }
 
 
+static int __init hpet_late_init(void)
+{
+   if (boot_hpet_disable)
+   return -ENODEV;
+
+   if (!hpet_address) {
+   if (!force_hpet_address)
+   return -ENODEV;
+
+   hpet_address = force_hpet_address;
+   hpet_enable();
+   if (!hpet_get_virt_address())
+   return -ENODEV;
+   }
+
+   return 0;
+}
+fs_initcall(hpet_late_init);
+
+
 #ifdef CONFIG_HPET_EMULATE_RTC
 
 /* HPET in LegacyReplacement Mode eats up RTC interrupt line. When, HPET
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/7] ICH Force HPET: ICH5 quirk to force detect enable

2007-06-22 Thread Venki Pallipadi


force_enable hpet for ICH5.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

---
 arch/i386/kernel/hpet.c   |2 
 arch/i386/kernel/quirks.c |  101 +-
 include/asm-i386/hpet.h   |2 
 include/linux/pci_ids.h   |1 
 4 files changed, 103 insertions(+), 3 deletions(-)

Index: linux-2.6.22-rc5/arch/i386/kernel/quirks.c
===
--- linux-2.6.22-rc5.orig/arch/i386/kernel/quirks.c 2007-06-17 
08:52:10.0 +0200
+++ linux-2.6.22-rc5/arch/i386/kernel/quirks.c  2007-06-17 08:52:10.0 
+0200
@@ -54,9 +54,15 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_IN
 #if defined(CONFIG_HPET_TIMER)
 unsigned long force_hpet_address;
 
+static enum {
+   NONE_FORCE_HPET_RESUME,
+   OLD_ICH_FORCE_HPET_RESUME,
+   ICH_FORCE_HPET_RESUME
+} force_hpet_resume_type;
+
 static void __iomem *rcba_base;
 
-void ich_force_hpet_resume(void)
+static void ich_force_hpet_resume(void)
 {
u32 val;
 
@@ -133,6 +139,7 @@ static void ich_force_enable_hpet(struct
iounmap(rcba_base);
printk(KERN_DEBUG Failed to force enable HPET\n);
} else {
+   force_hpet_resume_type = ICH_FORCE_HPET_RESUME;
printk(KERN_DEBUG Force enabled HPET at base address 0x%lx\n,
   force_hpet_address);
}
@@ -148,4 +155,96 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_I
  ich_force_enable_hpet);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH8_1,
  ich_force_enable_hpet);
+
+
+static struct pci_dev *cached_dev;
+
+static void old_ich_force_hpet_resume(void)
+{
+   u32 val, gen_cntl;
+
+   if (!force_hpet_address || !cached_dev)
+   return;
+
+   pci_read_config_dword(cached_dev, 0xD0, gen_cntl);
+   gen_cntl = (~(0x7  15));
+   gen_cntl |= (0x4  15);
+
+   pci_write_config_dword(cached_dev, 0xD0, gen_cntl);
+   pci_read_config_dword(cached_dev, 0xD0, gen_cntl);
+   val = gen_cntl  15;
+   val = 0x7;
+   if (val == 0x4)
+   printk(KERN_DEBUG Force enabled HPET at resume\n);
+   else
+   BUG();
+}
+
+static void old_ich_force_enable_hpet(struct pci_dev *dev)
+{
+   u32 val, gen_cntl;
+
+   if (hpet_address || force_hpet_address)
+   return;
+
+   pci_read_config_dword(dev, 0xD0, gen_cntl);
+   /*
+* Bit 17 is HPET enable bit.
+* Bit 16:15 control the HPET base address.
+*/
+   val = gen_cntl  15;
+   val = 0x7;
+   if (val  0x4) {
+   val = 0x3;
+   force_hpet_address = 0xFED0 | (val  12);
+   printk(KERN_DEBUG HPET at base address 0x%lx\n,
+  force_hpet_address);
+   cached_dev = dev;
+   return;
+   }
+
+   /*
+* HPET is disabled. Trying enabling at FED0 and check
+* whether it sticks
+*/
+   gen_cntl = (~(0x7  15));
+   gen_cntl |= (0x4  15);
+   pci_write_config_dword(dev, 0xD0, gen_cntl);
+
+   pci_read_config_dword(dev, 0xD0, gen_cntl);
+
+   val = gen_cntl  15;
+   val = 0x7;
+   if (val  0x4) {
+   /* HPET is enabled in HPTC. Just not reported by BIOS */
+   val = 0x3;
+   force_hpet_address = 0xFED0 | (val  12);
+   printk(KERN_DEBUG Force enabled HPET at base address 0x%lx\n,
+  force_hpet_address);
+   force_hpet_resume_type = OLD_ICH_FORCE_HPET_RESUME;
+   return;
+   }
+
+   printk(KERN_DEBUG Failed to force enable HPET\n);
+}
+
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801EB_0,
+ old_ich_force_enable_hpet);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801EB_12,
+ old_ich_force_enable_hpet);
+
+void force_hpet_resume(void)
+{
+   switch (force_hpet_resume_type) {
+   case ICH_FORCE_HPET_RESUME:
+   return ich_force_hpet_resume();
+
+   case OLD_ICH_FORCE_HPET_RESUME:
+   return old_ich_force_hpet_resume();
+
+   default:
+   break;
+   }
+}
+
 #endif
Index: linux-2.6.22-rc5/arch/i386/kernel/hpet.c
===
--- linux-2.6.22-rc5.orig/arch/i386/kernel/hpet.c   2007-06-17 
08:52:10.0 +0200
+++ linux-2.6.22-rc5/arch/i386/kernel/hpet.c2007-06-17 08:52:10.0 
+0200
@@ -195,7 +195,7 @@ static void hpet_start_counter(void)
 
 static void hpet_resume_device(void)
 {
-   ich_force_hpet_resume();
+   force_hpet_resume();
 }
 
 static void hpet_restart_counter(void)
Index: linux-2.6.22-rc5/include/asm-i386/hpet.h
===
--- 

[PATCH 6/7] ICH Force HPET: ICH5 fix a bug with suspend/resume

2007-06-22 Thread Venki Pallipadi


A bugfix in ich5 hpet force detect which caused resumes to fail.
Thanks to Udo A Steinberg for reporting the problem.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

---
 arch/i386/kernel/quirks.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.22-rc5/arch/i386/kernel/quirks.c
===
--- linux-2.6.22-rc5.orig/arch/i386/kernel/quirks.c 2007-06-17 
08:52:10.0 +0200
+++ linux-2.6.22-rc5/arch/i386/kernel/quirks.c  2007-06-17 08:52:10.0 
+0200
@@ -199,7 +199,6 @@ static void old_ich_force_enable_hpet(st
force_hpet_address = 0xFED0 | (val  12);
printk(KERN_DEBUG HPET at base address 0x%lx\n,
   force_hpet_address);
-   cached_dev = dev;
return;
}
 
@@ -221,6 +220,7 @@ static void old_ich_force_enable_hpet(st
force_hpet_address = 0xFED0 | (val  12);
printk(KERN_DEBUG Force enabled HPET at base address 0x%lx\n,
   force_hpet_address);
+   cached_dev = dev;
force_hpet_resume_type = OLD_ICH_FORCE_HPET_RESUME;
return;
}
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 7/7] ICH Force HPET: Add ICH7_0 pciid to quirk list

2007-06-22 Thread Venki Pallipadi


Add another PCI ID for ICH7 force hpet.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.21/arch/i386/kernel/quirks.c
===
--- linux-2.6.21.orig/arch/i386/kernel/quirks.c
+++ linux-2.6.21/arch/i386/kernel/quirks.c
@@ -149,6 +149,8 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_I
  ich_force_enable_hpet);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH6_1,
  ich_force_enable_hpet);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_0,
+ ich_force_enable_hpet);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_1,
  ich_force_enable_hpet);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_31,
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 8/8] cpuidle: first round of documentation updates

2007-06-06 Thread Venki Pallipadi


Documentation changes based on Pavel's feedback.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.22-rc-mm/Documentation/cpuidle/sysfs.txt
===
--- linux-2.6.22-rc-mm.orig/Documentation/cpuidle/sysfs.txt 2007-06-06 
11:33:25.0 -0700
+++ linux-2.6.22-rc-mm/Documentation/cpuidle/sysfs.txt  2007-06-06 
11:35:37.0 -0700
@@ -4,14 +4,22 @@
 
cpuidle sysfs
 
-System global cpuidle information are under
+System global cpuidle related information and tunables are under
 /sys/devices/system/cpu/cpuidle
 
 The current interfaces in this directory has self-explanatory names:
+* current_driver_ro
+* current_governor_ro
+
+With cpuidle_sysfs_switch boot option (meant for developer testing)
+following objects are visible instead.
 * available_drivers
 * available_governors
 * current_driver
 * current_governor
+In this case user can switch the driver, governor at run time by writing
+onto current_driver and current_governor.
+
 
 Per logical CPU specific cpuidle information are under
 /sys/devices/system/cpu/cpuX/cpuidle
@@ -19,9 +27,9 @@
 
 Under this percpu directory, there is a directory for each idle state supported
 by the driver, which in turn has
-* latency
-* power
-* time
-* usage
+* latency : Latency to exit out of this idle state (in microseconds)
+* power : Power consumed while in this idle state (in milliwatts)
+* time : Total time spent in this idle state (in microseconds)
+* usage : Number of times this state was entered (count)
 
 
Index: linux-2.6.22-rc-mm/Documentation/cpuidle/governor.txt
===
--- linux-2.6.22-rc-mm.orig/Documentation/cpuidle/governor.txt  2007-06-06 
11:33:25.0 -0700
+++ linux-2.6.22-rc-mm/Documentation/cpuidle/governor.txt   2007-06-06 
11:33:34.0 -0700
@@ -11,12 +11,16 @@
 cpuidle governor is policy routine that decides what idle state to enter at
 any given time. cpuidle core uses different callbacks to governor while
 handling idle entry.
-* select_state callback where governor can determine next idle state to enter
-* prepare_idle callback is called before entering an idle state
-* scan callback is called after a driver forces redetection of the states
+* select_state() callback where governor can determine next idle state to enter
+* prepare_idle() callback is called before entering an idle state
+* scan() callback is called after a driver forces redetection of the states
 
 More than one governor can be registered at the same time and
-user can switch between drivers using /sysfs interface.
+user can switch between drivers using /sysfs interface (when supported).
+
+More than one governor part is supported for developers to easily experiment
+with different governors. By default, most optimal governor based on your
+kernel configuration and platform will be selected by cpuidle.
 
 Interfaces:
 int cpuidle_register_governor(struct cpuidle_governor *gov);
Index: linux-2.6.22-rc-mm/Documentation/cpuidle/core.txt
===
--- linux-2.6.22-rc-mm.orig/Documentation/cpuidle/core.txt  2007-06-06 
11:33:25.0 -0700
+++ linux-2.6.22-rc-mm/Documentation/cpuidle/core.txt   2007-06-06 
11:33:34.0 -0700
@@ -12,6 +12,6 @@
 standardized infrastructure to support independent development of
 governors and drivers.
 
-cpuidle resides under /drivers/cpuidle.
+cpuidle resides under drivers/cpuidle.
 
 
Index: linux-2.6.22-rc-mm/Documentation/cpuidle/driver.txt
===
--- linux-2.6.22-rc-mm.orig/Documentation/cpuidle/driver.txt2007-06-06 
11:33:25.0 -0700
+++ linux-2.6.22-rc-mm/Documentation/cpuidle/driver.txt 2007-06-06 
11:33:34.0 -0700
@@ -7,16 +7,21 @@
 
 
 
-cpuidle driver supports capability detection for a particular system. The
-init and exit routines will be called for each online CPU, with a percpu
-cpuidle_driver object and driver should fill in cpuidle_states inside
-cpuidle_driver depending on the CPU capability.
+cpuidle driver hooks into the cpuidle infrastructure and does the
+architecture/platform dependent part of CPU idle states. Driver
+provides the platform idle state detection capability and also
+has mechanisms in place to support actusl entry-exit into a CPU idle state.
+
+cpuidle driver supports capability detection for a platform using the
+init and exit routines. They will be called for each online CPU, with a
+percpu cpuidle_driver object and driver should fill in cpuidle_states
+inside cpuidle_driver depending on the CPU capability.
 
 Driver can handle dynamic state changes (like battery<->AC), by calling
 force_redetect interface.
 
 It is possible to have more than one driver registered at the same time and
-user can switch between drivers using /sysfs interface.
+user can switch between 

[PATCH 7/8] cpuidle: add rating to the governors and pick the one with highest rating by default

2007-06-06 Thread Venki Pallipadi



Introduce a governor rating scheme to pick the right governor by default.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.22-rc-mm/include/linux/cpuidle.h
===
--- linux-2.6.22-rc-mm.orig/include/linux/cpuidle.h 2007-06-05 
17:00:09.0 -0700
+++ linux-2.6.22-rc-mm/include/linux/cpuidle.h  2007-06-05 17:01:08.0 
-0700
@@ -159,6 +159,7 @@
 struct cpuidle_governor {
charname[CPUIDLE_NAME_LEN];
struct list_headgovernor_list;
+   unsigned intrating;
 
int  (*init)(struct cpuidle_device *dev);
void (*exit)(struct cpuidle_device *dev);
Index: linux-2.6.22-rc-mm/drivers/cpuidle/governors/menu.c
===
--- linux-2.6.22-rc-mm.orig/drivers/cpuidle/governors/menu.c2007-06-05 
15:46:34.0 -0700
+++ linux-2.6.22-rc-mm/drivers/cpuidle/governors/menu.c 2007-06-05 
17:04:32.0 -0700
@@ -153,6 +153,7 @@
 
 struct cpuidle_governor menu_governor = {
.name = "menu",
+   .rating =   20,
.scan = menu_scan_device,
.select =   menu_select,
.reflect =  menu_reflect,
Index: linux-2.6.22-rc-mm/drivers/cpuidle/governor.c
===
--- linux-2.6.22-rc-mm.orig/drivers/cpuidle/governor.c  2007-06-01 
16:25:49.0 -0700
+++ linux-2.6.22-rc-mm/drivers/cpuidle/governor.c   2007-06-05 
17:15:05.0 -0700
@@ -131,7 +131,8 @@
if (__cpuidle_find_governor(gov->name) == NULL) {
ret = 0;
list_add_tail(>governor_list, _governors);
-   if (!cpuidle_curr_governor)
+   if (!cpuidle_curr_governor ||
+   cpuidle_curr_governor->rating < gov->rating)
cpuidle_switch_governor(gov);
}
mutex_unlock(_lock);
@@ -142,6 +143,29 @@
 EXPORT_SYMBOL_GPL(cpuidle_register_governor);
 
 /**
+ * cpuidle_replace_governor - find a replacement governor
+ * @exclude_rating: the rating that will be skipped while looking for
+ * new governor.
+ */
+struct cpuidle_governor *cpuidle_replace_governor(int exclude_rating)
+{
+   struct cpuidle_governor *gov;
+   struct cpuidle_governor *ret_gov = NULL;
+   unsigned int max_rating = 0;
+
+   list_for_each_entry(gov, _governors, governor_list) {
+   if (gov->rating == exclude_rating)
+   continue;
+   if (gov->rating > max_rating) {
+   max_rating = gov->rating;
+   ret_gov = gov;
+   }
+   }
+
+   return ret_gov;
+}
+
+/**
  * cpuidle_unregister_governor - unregisters a governor
  * @gov: the governor
  */
@@ -151,8 +175,11 @@
return;
 
mutex_lock(_lock);
-   if (gov == cpuidle_curr_governor)
-   cpuidle_switch_governor(NULL);
+   if (gov == cpuidle_curr_governor) {
+   struct cpuidle_governor *new_gov;
+   new_gov = cpuidle_replace_governor(gov->rating);
+   cpuidle_switch_governor(new_gov);
+   }
list_del(>governor_list);
mutex_unlock(_lock);
 }
Index: linux-2.6.22-rc-mm/drivers/cpuidle/governors/ladder.c
===
--- linux-2.6.22-rc-mm.orig/drivers/cpuidle/governors/ladder.c  2007-06-01 
16:25:49.0 -0700
+++ linux-2.6.22-rc-mm/drivers/cpuidle/governors/ladder.c   2007-06-05 
17:03:37.0 -0700
@@ -199,6 +199,7 @@
 
 static struct cpuidle_governor ladder_governor = {
.name = "ladder",
+   .rating =   10,
.init = ladder_init_device,
.exit = ladder_exit_device,
.scan = ladder_scan_device,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 6/8] cpuidle: make cpuidle sysfs driver/governor switch off by default

2007-06-06 Thread Venki Pallipadi


Make default cpuidle sysfs to show current_governor and current_driver in
read-only mode. More elaborate available_governors and available_drivers with
writeable current_governor and current_driver interface only appear with
"cpuidle_sysfs_switch" boot parameter.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.22-rc-mm/drivers/cpuidle/cpuidle.c
===
--- linux-2.6.22-rc-mm.orig/drivers/cpuidle/cpuidle.c   2007-06-05 
17:52:32.0 -0700
+++ linux-2.6.22-rc-mm/drivers/cpuidle/cpuidle.c2007-06-06 
10:57:41.0 -0700
@@ -25,7 +25,6 @@
 LIST_HEAD(cpuidle_detected_devices);
 static void (*pm_idle_old)(void);
 
-
 /**
  * cpuidle_idle_call - the main idle loop
  *
Index: linux-2.6.22-rc-mm/drivers/cpuidle/sysfs.c
===
--- linux-2.6.22-rc-mm.orig/drivers/cpuidle/sysfs.c 2007-06-05 
17:52:56.0 -0700
+++ linux-2.6.22-rc-mm/drivers/cpuidle/sysfs.c  2007-06-06 11:29:50.0 
-0700
@@ -13,6 +13,14 @@
 
 #include "cpuidle.h"
 
+static unsigned int sysfs_switch;
+static int __init cpuidle_sysfs_setup(char *unused)
+{
+   sysfs_switch = 1;
+   return 1;
+}
+__setup("cpuidle_sysfs_switch", cpuidle_sysfs_setup);
+
 static ssize_t show_available_drivers(struct sys_device *dev, char *buf)
 {
ssize_t i = 0;
@@ -127,6 +135,15 @@
return count;
 }
 
+static SYSDEV_ATTR(current_driver_ro, 0444, show_current_driver, NULL);
+static SYSDEV_ATTR(current_governor_ro, 0444, show_current_governor, NULL);
+
+static struct attribute *cpuclass_default_attrs[] = {
+   _current_driver_ro.attr,
+   _current_governor_ro.attr,
+   NULL
+};
+
 static SYSDEV_ATTR(available_drivers, 0444, show_available_drivers, NULL);
 static SYSDEV_ATTR(available_governors, 0444, show_available_governors, NULL);
 static SYSDEV_ATTR(current_driver, 0644, show_current_driver,
@@ -134,7 +151,7 @@
 static SYSDEV_ATTR(current_governor, 0644, show_current_governor,
store_current_governor);
 
-static struct attribute *cpuclass_default_attrs[] = {
+static struct attribute *cpuclass_switch_attrs[] = {
_available_drivers.attr,
_available_governors.attr,
_current_driver.attr,
@@ -152,6 +169,9 @@
  */
 int cpuidle_add_class_sysfs(struct sysdev_class *cls)
 {
+   if (sysfs_switch)
+   cpuclass_attr_group.attrs = cpuclass_switch_attrs;
+
return sysfs_create_group(>kset.kobj, _attr_group);
 }
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/8] cpuidle: menu governor change the early break condition

2007-06-06 Thread Venki Pallipadi


Change the C-state early break out algorithm in menu governor.

We only look at early breakouts that result in wakeups shorter than idle state's
target_residency. If such a breakout is frequent enough, eliminate the
particular idle state upto a timeout period.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.22-rc-mm/drivers/cpuidle/governors/menu.c
===
--- linux-2.6.22-rc-mm.orig/drivers/cpuidle/governors/menu.c2007-06-05 
09:39:27.0 -0700
+++ linux-2.6.22-rc-mm/drivers/cpuidle/governors/menu.c 2007-06-05 
15:46:34.0 -0700
@@ -14,19 +14,20 @@
 #include 
 #include 
 
-#define BM_HOLDOFF 2   /* 20 ms */
+#define BM_HOLDOFF 2   /* 20 ms */
+#define DEMOTION_THRESHOLD 5
+#define DEMOTION_TIMEOUT_MULTIPLIER1000
 
 struct menu_device {
int last_state_idx;
-   int deepest_bm_state;
 
-   int break_last_us;
-   int break_elapsed_us;
+   int deepest_break_state;
+   struct timespec break_expire_time_ts;
+   int break_last_cnt;
 
+   int deepest_bm_state;
int bm_elapsed_us;
int bm_holdoff_us;
-
-   unsigned long   idle_jiffies;
 };
 
 static DEFINE_PER_CPU(struct menu_device, menu_devices);
@@ -45,7 +46,6 @@
 
/* determine the expected residency time */
expected_us = (s32) ktime_to_ns(tick_nohz_get_sleep_length()) / 1000;
-   expected_us = min(expected_us, data->break_last_us);
 
/* determine the maximum state compatible with current BM status */
if (cpuidle_get_bm_activity())
@@ -53,17 +53,33 @@
if (data->bm_elapsed_us <= data->bm_holdoff_us)
max_state = data->deepest_bm_state + 1;
 
+   /* determine the maximum state compatible with recent idle breaks */
+   if (data->deepest_break_state >= 0) {
+   struct timespec now;
+   ktime_get_ts();
+   if (timespec_compare(>break_expire_time_ts, ) > 0) {
+   max_state = min(max_state,
+   data->deepest_break_state + 1);
+   } else {
+   data->deepest_break_state = -1;
+   }
+   }
+   
/* find the deepest idle state that satisfies our constraints */
for (i = 1; i < max_state; i++) {
struct cpuidle_state *s = >states[i];
+
if (s->target_residency > expected_us)
break;
+
if (s->exit_latency > system_latency_constraint())
break;
}
 
+   if (data->last_state_idx != i - 1)
+   data->break_last_cnt = 0;
+
data->last_state_idx = i - 1;
-   data->idle_jiffies = tick_nohz_get_idle_jiffies();
return i - 1;
 }
 
@@ -91,14 +107,27 @@
measured_us = USEC_PER_SEC / HZ;
 
data->bm_elapsed_us += measured_us;
-   data->break_elapsed_us += measured_us;
+
+   if (data->last_state_idx == 0)
+   return;
 
/*
-* Did something other than the timer interrupt cause the break event?
+* Did something other than the timer interrupt
+* cause an early break event?
 */
-   if (tick_nohz_get_idle_jiffies() == data->idle_jiffies) {
-   data->break_last_us = data->break_elapsed_us;
-   data->break_elapsed_us = 0;
+   if (unlikely(measured_us < target->target_residency)) {
+   if (data->break_last_cnt > DEMOTION_THRESHOLD) {
+   data->deepest_break_state = data->last_state_idx - 1;
+   ktime_get_ts(>break_expire_time_ts);
+   timespec_add_ns(>break_expire_time_ts,
+   target->target_residency *
+   DEMOTION_TIMEOUT_MULTIPLIER);
+   } else {
+   data->break_last_cnt++;
+   }
+   } else {
+   if (data->break_last_cnt > 0)
+   data->break_last_cnt--;
}
 }
 
@@ -112,10 +141,9 @@
int i;
 
data->last_state_idx = 0;
-   data->break_last_us = 0;
-   data->break_elapsed_us = 0;
data->bm_elapsed_us = 0;
data->bm_holdoff_us = BM_HOLDOFF;
+   data->deepest_break_state = -1;
 
for (i = 1; i < dev->state_count; i++)
if (dev->states[i].flags & CPUIDLE_FLAG_CHECK_BM)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/8] cpuidle: fis the uninitialized variable in sysfs routine

2007-06-06 Thread Venki Pallipadi


Fix the uninitialized usage of ret.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.22-rc-mm/drivers/cpuidle/sysfs.c
===
--- linux-2.6.22-rc-mm.orig/drivers/cpuidle/sysfs.c 2007-06-04 
15:44:17.0 -0700
+++ linux-2.6.22-rc-mm/drivers/cpuidle/sysfs.c  2007-06-04 15:46:49.0 
-0700
@@ -301,7 +301,7 @@
  */
 int cpuidle_add_driver_sysfs(struct cpuidle_device *device)
 {
-   int i, ret;
+   int i, ret = -ENOMEM;
struct cpuidle_state_kobj *kobj;
 
/* state statistics */
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/8] cpuidle: reenable /proc/acpi/ power interface for the time being

2007-06-06 Thread Venki Pallipadi


Keep /proc/acpi/processor/CPU*/power around for a while as powertop depends
on it. It will be marked deprecated and removed in future. powertop can use
cpuidle interfaces instead.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.22-rc-mm/drivers/acpi/processor_idle.c
===
--- linux-2.6.22-rc-mm.orig/drivers/acpi/processor_idle.c   2007-06-01 
16:17:40.0 -0700
+++ linux-2.6.22-rc-mm/drivers/acpi/processor_idle.c2007-06-01 
17:20:57.0 -0700
@@ -792,7 +792,7 @@
  * @t1: the start time
  * @t2: the end time
  */
-static inline u32 ticks_elapsed(u32 t1, u32 t2)
+static inline u32 ticks_elapsed_in_us(u32 t1, u32 t2)
 {
if (t2 >= t1)
return PM_TIMER_TICKS_TO_US(t2 - t1);
@@ -802,6 +802,16 @@
return PM_TIMER_TICKS_TO_US((0x - t1) + t2);
 }
 
+static inline u32 ticks_elapsed(u32 t1, u32 t2)
+{
+   if (t2 >= t1)
+   return (t2 - t1);
+   else if (!(acpi_gbl_FADT.flags & ACPI_FADT_32BIT_TIMER))
+   return (((0x00FF - t1) + t2) & 0x00FF);
+   else
+   return ((0x - t1) + t2);
+}
+
 /**
  * acpi_idle_update_bm_rld - updates the BM_RLD bit depending on target state
  * @pr: the processor
@@ -925,7 +935,8 @@
cx->usage++;
 
acpi_state_timer_broadcast(pr, cx, 0);
-   return ticks_elapsed(t1, t2);
+   cx->time += ticks_elapsed(t1, t2);
+   return ticks_elapsed_in_us(t1, t2);
 }
 
 static int c3_cpu_count;
@@ -1009,7 +1020,8 @@
cx->usage++;
 
acpi_state_timer_broadcast(pr, cx, 0);
-   return ticks_elapsed(t1, t2);
+   cx->time += ticks_elapsed(t1, t2);
+   return ticks_elapsed_in_us(t1, t2);
 }
 
 /**
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/8] cpuidle: menu governor and hrtimer compile fix

2007-06-06 Thread Venki Pallipadi


Compile fix for menu governor.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.22-rc-mm/drivers/cpuidle/governors/menu.c
===
--- linux-2.6.22-rc-mm.orig/drivers/cpuidle/governors/menu.c2007-06-01 
16:25:49.0 -0700
+++ linux-2.6.22-rc-mm/drivers/cpuidle/governors/menu.c 2007-06-05 
17:52:33.0 -0700
@@ -11,8 +11,8 @@
 #include 
 #include 
 #include 
-#include 
 #include 
+#include 
 
 #define BM_HOLDOFF 2   /* 20 ms */
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/8] cpuidle: acpi_set_cstate_limit compile fix

2007-06-06 Thread Venki Pallipadi

Len,

Following are a bunch of small changes to cpuidle trying to prepare it
for mainline. Some of the changes are just the compile timer errors/warnings
and you probably already have them in acpi-test.

Should apply cleanly to latest acpi-test. Please include in acpi-test.

Thanks,
Venki

This patch:


cpuidle compile fix related to acpi_set_cstate_limit().

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.22-rc-mm/drivers/acpi/osl.c
===
--- linux-2.6.22-rc-mm.orig/drivers/acpi/osl.c  2007-06-01 16:17:40.0 
-0700
+++ linux-2.6.22-rc-mm/drivers/acpi/osl.c   2007-06-01 16:21:43.0 
-0700
@@ -1030,6 +1030,7 @@
if (acpi_do_set_cstate_limit)
acpi_do_set_cstate_limit();
 }
+EXPORT_SYMBOL(acpi_set_cstate_limit);
 
 /*
  * Acquire a spinlock.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/8] cpuidle: acpi_set_cstate_limit compile fix

2007-06-06 Thread Venki Pallipadi

Len,

Following are a bunch of small changes to cpuidle trying to prepare it
for mainline. Some of the changes are just the compile timer errors/warnings
and you probably already have them in acpi-test.

Should apply cleanly to latest acpi-test. Please include in acpi-test.

Thanks,
Venki

This patch:


cpuidle compile fix related to acpi_set_cstate_limit().

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.22-rc-mm/drivers/acpi/osl.c
===
--- linux-2.6.22-rc-mm.orig/drivers/acpi/osl.c  2007-06-01 16:17:40.0 
-0700
+++ linux-2.6.22-rc-mm/drivers/acpi/osl.c   2007-06-01 16:21:43.0 
-0700
@@ -1030,6 +1030,7 @@
if (acpi_do_set_cstate_limit)
acpi_do_set_cstate_limit();
 }
+EXPORT_SYMBOL(acpi_set_cstate_limit);
 
 /*
  * Acquire a spinlock.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/8] cpuidle: menu governor and hrtimer compile fix

2007-06-06 Thread Venki Pallipadi


Compile fix for menu governor.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.22-rc-mm/drivers/cpuidle/governors/menu.c
===
--- linux-2.6.22-rc-mm.orig/drivers/cpuidle/governors/menu.c2007-06-01 
16:25:49.0 -0700
+++ linux-2.6.22-rc-mm/drivers/cpuidle/governors/menu.c 2007-06-05 
17:52:33.0 -0700
@@ -11,8 +11,8 @@
 #include linux/latency.h
 #include linux/time.h
 #include linux/ktime.h
-#include linux/tick.h
 #include linux/hrtimer.h
+#include linux/tick.h
 
 #define BM_HOLDOFF 2   /* 20 ms */
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/8] cpuidle: reenable /proc/acpi/ power interface for the time being

2007-06-06 Thread Venki Pallipadi


Keep /proc/acpi/processor/CPU*/power around for a while as powertop depends
on it. It will be marked deprecated and removed in future. powertop can use
cpuidle interfaces instead.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.22-rc-mm/drivers/acpi/processor_idle.c
===
--- linux-2.6.22-rc-mm.orig/drivers/acpi/processor_idle.c   2007-06-01 
16:17:40.0 -0700
+++ linux-2.6.22-rc-mm/drivers/acpi/processor_idle.c2007-06-01 
17:20:57.0 -0700
@@ -792,7 +792,7 @@
  * @t1: the start time
  * @t2: the end time
  */
-static inline u32 ticks_elapsed(u32 t1, u32 t2)
+static inline u32 ticks_elapsed_in_us(u32 t1, u32 t2)
 {
if (t2 = t1)
return PM_TIMER_TICKS_TO_US(t2 - t1);
@@ -802,6 +802,16 @@
return PM_TIMER_TICKS_TO_US((0x - t1) + t2);
 }
 
+static inline u32 ticks_elapsed(u32 t1, u32 t2)
+{
+   if (t2 = t1)
+   return (t2 - t1);
+   else if (!(acpi_gbl_FADT.flags  ACPI_FADT_32BIT_TIMER))
+   return (((0x00FF - t1) + t2)  0x00FF);
+   else
+   return ((0x - t1) + t2);
+}
+
 /**
  * acpi_idle_update_bm_rld - updates the BM_RLD bit depending on target state
  * @pr: the processor
@@ -925,7 +935,8 @@
cx-usage++;
 
acpi_state_timer_broadcast(pr, cx, 0);
-   return ticks_elapsed(t1, t2);
+   cx-time += ticks_elapsed(t1, t2);
+   return ticks_elapsed_in_us(t1, t2);
 }
 
 static int c3_cpu_count;
@@ -1009,7 +1020,8 @@
cx-usage++;
 
acpi_state_timer_broadcast(pr, cx, 0);
-   return ticks_elapsed(t1, t2);
+   cx-time += ticks_elapsed(t1, t2);
+   return ticks_elapsed_in_us(t1, t2);
 }
 
 /**
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/8] cpuidle: fis the uninitialized variable in sysfs routine

2007-06-06 Thread Venki Pallipadi


Fix the uninitialized usage of ret.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.22-rc-mm/drivers/cpuidle/sysfs.c
===
--- linux-2.6.22-rc-mm.orig/drivers/cpuidle/sysfs.c 2007-06-04 
15:44:17.0 -0700
+++ linux-2.6.22-rc-mm/drivers/cpuidle/sysfs.c  2007-06-04 15:46:49.0 
-0700
@@ -301,7 +301,7 @@
  */
 int cpuidle_add_driver_sysfs(struct cpuidle_device *device)
 {
-   int i, ret;
+   int i, ret = -ENOMEM;
struct cpuidle_state_kobj *kobj;
 
/* state statistics */
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/8] cpuidle: menu governor change the early break condition

2007-06-06 Thread Venki Pallipadi


Change the C-state early break out algorithm in menu governor.

We only look at early breakouts that result in wakeups shorter than idle state's
target_residency. If such a breakout is frequent enough, eliminate the
particular idle state upto a timeout period.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.22-rc-mm/drivers/cpuidle/governors/menu.c
===
--- linux-2.6.22-rc-mm.orig/drivers/cpuidle/governors/menu.c2007-06-05 
09:39:27.0 -0700
+++ linux-2.6.22-rc-mm/drivers/cpuidle/governors/menu.c 2007-06-05 
15:46:34.0 -0700
@@ -14,19 +14,20 @@
 #include linux/hrtimer.h
 #include linux/tick.h
 
-#define BM_HOLDOFF 2   /* 20 ms */
+#define BM_HOLDOFF 2   /* 20 ms */
+#define DEMOTION_THRESHOLD 5
+#define DEMOTION_TIMEOUT_MULTIPLIER1000
 
 struct menu_device {
int last_state_idx;
-   int deepest_bm_state;
 
-   int break_last_us;
-   int break_elapsed_us;
+   int deepest_break_state;
+   struct timespec break_expire_time_ts;
+   int break_last_cnt;
 
+   int deepest_bm_state;
int bm_elapsed_us;
int bm_holdoff_us;
-
-   unsigned long   idle_jiffies;
 };
 
 static DEFINE_PER_CPU(struct menu_device, menu_devices);
@@ -45,7 +46,6 @@
 
/* determine the expected residency time */
expected_us = (s32) ktime_to_ns(tick_nohz_get_sleep_length()) / 1000;
-   expected_us = min(expected_us, data-break_last_us);
 
/* determine the maximum state compatible with current BM status */
if (cpuidle_get_bm_activity())
@@ -53,17 +53,33 @@
if (data-bm_elapsed_us = data-bm_holdoff_us)
max_state = data-deepest_bm_state + 1;
 
+   /* determine the maximum state compatible with recent idle breaks */
+   if (data-deepest_break_state = 0) {
+   struct timespec now;
+   ktime_get_ts(now);
+   if (timespec_compare(data-break_expire_time_ts, now)  0) {
+   max_state = min(max_state,
+   data-deepest_break_state + 1);
+   } else {
+   data-deepest_break_state = -1;
+   }
+   }
+   
/* find the deepest idle state that satisfies our constraints */
for (i = 1; i  max_state; i++) {
struct cpuidle_state *s = dev-states[i];
+
if (s-target_residency  expected_us)
break;
+
if (s-exit_latency  system_latency_constraint())
break;
}
 
+   if (data-last_state_idx != i - 1)
+   data-break_last_cnt = 0;
+
data-last_state_idx = i - 1;
-   data-idle_jiffies = tick_nohz_get_idle_jiffies();
return i - 1;
 }
 
@@ -91,14 +107,27 @@
measured_us = USEC_PER_SEC / HZ;
 
data-bm_elapsed_us += measured_us;
-   data-break_elapsed_us += measured_us;
+
+   if (data-last_state_idx == 0)
+   return;
 
/*
-* Did something other than the timer interrupt cause the break event?
+* Did something other than the timer interrupt
+* cause an early break event?
 */
-   if (tick_nohz_get_idle_jiffies() == data-idle_jiffies) {
-   data-break_last_us = data-break_elapsed_us;
-   data-break_elapsed_us = 0;
+   if (unlikely(measured_us  target-target_residency)) {
+   if (data-break_last_cnt  DEMOTION_THRESHOLD) {
+   data-deepest_break_state = data-last_state_idx - 1;
+   ktime_get_ts(data-break_expire_time_ts);
+   timespec_add_ns(data-break_expire_time_ts,
+   target-target_residency *
+   DEMOTION_TIMEOUT_MULTIPLIER);
+   } else {
+   data-break_last_cnt++;
+   }
+   } else {
+   if (data-break_last_cnt  0)
+   data-break_last_cnt--;
}
 }
 
@@ -112,10 +141,9 @@
int i;
 
data-last_state_idx = 0;
-   data-break_last_us = 0;
-   data-break_elapsed_us = 0;
data-bm_elapsed_us = 0;
data-bm_holdoff_us = BM_HOLDOFF;
+   data-deepest_break_state = -1;
 
for (i = 1; i  dev-state_count; i++)
if (dev-states[i].flags  CPUIDLE_FLAG_CHECK_BM)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 6/8] cpuidle: make cpuidle sysfs driver/governor switch off by default

2007-06-06 Thread Venki Pallipadi


Make default cpuidle sysfs to show current_governor and current_driver in
read-only mode. More elaborate available_governors and available_drivers with
writeable current_governor and current_driver interface only appear with
cpuidle_sysfs_switch boot parameter.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.22-rc-mm/drivers/cpuidle/cpuidle.c
===
--- linux-2.6.22-rc-mm.orig/drivers/cpuidle/cpuidle.c   2007-06-05 
17:52:32.0 -0700
+++ linux-2.6.22-rc-mm/drivers/cpuidle/cpuidle.c2007-06-06 
10:57:41.0 -0700
@@ -25,7 +25,6 @@
 LIST_HEAD(cpuidle_detected_devices);
 static void (*pm_idle_old)(void);
 
-
 /**
  * cpuidle_idle_call - the main idle loop
  *
Index: linux-2.6.22-rc-mm/drivers/cpuidle/sysfs.c
===
--- linux-2.6.22-rc-mm.orig/drivers/cpuidle/sysfs.c 2007-06-05 
17:52:56.0 -0700
+++ linux-2.6.22-rc-mm/drivers/cpuidle/sysfs.c  2007-06-06 11:29:50.0 
-0700
@@ -13,6 +13,14 @@
 
 #include cpuidle.h
 
+static unsigned int sysfs_switch;
+static int __init cpuidle_sysfs_setup(char *unused)
+{
+   sysfs_switch = 1;
+   return 1;
+}
+__setup(cpuidle_sysfs_switch, cpuidle_sysfs_setup);
+
 static ssize_t show_available_drivers(struct sys_device *dev, char *buf)
 {
ssize_t i = 0;
@@ -127,6 +135,15 @@
return count;
 }
 
+static SYSDEV_ATTR(current_driver_ro, 0444, show_current_driver, NULL);
+static SYSDEV_ATTR(current_governor_ro, 0444, show_current_governor, NULL);
+
+static struct attribute *cpuclass_default_attrs[] = {
+   attr_current_driver_ro.attr,
+   attr_current_governor_ro.attr,
+   NULL
+};
+
 static SYSDEV_ATTR(available_drivers, 0444, show_available_drivers, NULL);
 static SYSDEV_ATTR(available_governors, 0444, show_available_governors, NULL);
 static SYSDEV_ATTR(current_driver, 0644, show_current_driver,
@@ -134,7 +151,7 @@
 static SYSDEV_ATTR(current_governor, 0644, show_current_governor,
store_current_governor);
 
-static struct attribute *cpuclass_default_attrs[] = {
+static struct attribute *cpuclass_switch_attrs[] = {
attr_available_drivers.attr,
attr_available_governors.attr,
attr_current_driver.attr,
@@ -152,6 +169,9 @@
  */
 int cpuidle_add_class_sysfs(struct sysdev_class *cls)
 {
+   if (sysfs_switch)
+   cpuclass_attr_group.attrs = cpuclass_switch_attrs;
+
return sysfs_create_group(cls-kset.kobj, cpuclass_attr_group);
 }
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 7/8] cpuidle: add rating to the governors and pick the one with highest rating by default

2007-06-06 Thread Venki Pallipadi



Introduce a governor rating scheme to pick the right governor by default.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.22-rc-mm/include/linux/cpuidle.h
===
--- linux-2.6.22-rc-mm.orig/include/linux/cpuidle.h 2007-06-05 
17:00:09.0 -0700
+++ linux-2.6.22-rc-mm/include/linux/cpuidle.h  2007-06-05 17:01:08.0 
-0700
@@ -159,6 +159,7 @@
 struct cpuidle_governor {
charname[CPUIDLE_NAME_LEN];
struct list_headgovernor_list;
+   unsigned intrating;
 
int  (*init)(struct cpuidle_device *dev);
void (*exit)(struct cpuidle_device *dev);
Index: linux-2.6.22-rc-mm/drivers/cpuidle/governors/menu.c
===
--- linux-2.6.22-rc-mm.orig/drivers/cpuidle/governors/menu.c2007-06-05 
15:46:34.0 -0700
+++ linux-2.6.22-rc-mm/drivers/cpuidle/governors/menu.c 2007-06-05 
17:04:32.0 -0700
@@ -153,6 +153,7 @@
 
 struct cpuidle_governor menu_governor = {
.name = menu,
+   .rating =   20,
.scan = menu_scan_device,
.select =   menu_select,
.reflect =  menu_reflect,
Index: linux-2.6.22-rc-mm/drivers/cpuidle/governor.c
===
--- linux-2.6.22-rc-mm.orig/drivers/cpuidle/governor.c  2007-06-01 
16:25:49.0 -0700
+++ linux-2.6.22-rc-mm/drivers/cpuidle/governor.c   2007-06-05 
17:15:05.0 -0700
@@ -131,7 +131,8 @@
if (__cpuidle_find_governor(gov-name) == NULL) {
ret = 0;
list_add_tail(gov-governor_list, cpuidle_governors);
-   if (!cpuidle_curr_governor)
+   if (!cpuidle_curr_governor ||
+   cpuidle_curr_governor-rating  gov-rating)
cpuidle_switch_governor(gov);
}
mutex_unlock(cpuidle_lock);
@@ -142,6 +143,29 @@
 EXPORT_SYMBOL_GPL(cpuidle_register_governor);
 
 /**
+ * cpuidle_replace_governor - find a replacement governor
+ * @exclude_rating: the rating that will be skipped while looking for
+ * new governor.
+ */
+struct cpuidle_governor *cpuidle_replace_governor(int exclude_rating)
+{
+   struct cpuidle_governor *gov;
+   struct cpuidle_governor *ret_gov = NULL;
+   unsigned int max_rating = 0;
+
+   list_for_each_entry(gov, cpuidle_governors, governor_list) {
+   if (gov-rating == exclude_rating)
+   continue;
+   if (gov-rating  max_rating) {
+   max_rating = gov-rating;
+   ret_gov = gov;
+   }
+   }
+
+   return ret_gov;
+}
+
+/**
  * cpuidle_unregister_governor - unregisters a governor
  * @gov: the governor
  */
@@ -151,8 +175,11 @@
return;
 
mutex_lock(cpuidle_lock);
-   if (gov == cpuidle_curr_governor)
-   cpuidle_switch_governor(NULL);
+   if (gov == cpuidle_curr_governor) {
+   struct cpuidle_governor *new_gov;
+   new_gov = cpuidle_replace_governor(gov-rating);
+   cpuidle_switch_governor(new_gov);
+   }
list_del(gov-governor_list);
mutex_unlock(cpuidle_lock);
 }
Index: linux-2.6.22-rc-mm/drivers/cpuidle/governors/ladder.c
===
--- linux-2.6.22-rc-mm.orig/drivers/cpuidle/governors/ladder.c  2007-06-01 
16:25:49.0 -0700
+++ linux-2.6.22-rc-mm/drivers/cpuidle/governors/ladder.c   2007-06-05 
17:03:37.0 -0700
@@ -199,6 +199,7 @@
 
 static struct cpuidle_governor ladder_governor = {
.name = ladder,
+   .rating =   10,
.init = ladder_init_device,
.exit = ladder_exit_device,
.scan = ladder_scan_device,
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 8/8] cpuidle: first round of documentation updates

2007-06-06 Thread Venki Pallipadi


Documentation changes based on Pavel's feedback.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.22-rc-mm/Documentation/cpuidle/sysfs.txt
===
--- linux-2.6.22-rc-mm.orig/Documentation/cpuidle/sysfs.txt 2007-06-06 
11:33:25.0 -0700
+++ linux-2.6.22-rc-mm/Documentation/cpuidle/sysfs.txt  2007-06-06 
11:35:37.0 -0700
@@ -4,14 +4,22 @@
 
cpuidle sysfs
 
-System global cpuidle information are under
+System global cpuidle related information and tunables are under
 /sys/devices/system/cpu/cpuidle
 
 The current interfaces in this directory has self-explanatory names:
+* current_driver_ro
+* current_governor_ro
+
+With cpuidle_sysfs_switch boot option (meant for developer testing)
+following objects are visible instead.
 * available_drivers
 * available_governors
 * current_driver
 * current_governor
+In this case user can switch the driver, governor at run time by writing
+onto current_driver and current_governor.
+
 
 Per logical CPU specific cpuidle information are under
 /sys/devices/system/cpu/cpuX/cpuidle
@@ -19,9 +27,9 @@
 
 Under this percpu directory, there is a directory for each idle state supported
 by the driver, which in turn has
-* latency
-* power
-* time
-* usage
+* latency : Latency to exit out of this idle state (in microseconds)
+* power : Power consumed while in this idle state (in milliwatts)
+* time : Total time spent in this idle state (in microseconds)
+* usage : Number of times this state was entered (count)
 
 
Index: linux-2.6.22-rc-mm/Documentation/cpuidle/governor.txt
===
--- linux-2.6.22-rc-mm.orig/Documentation/cpuidle/governor.txt  2007-06-06 
11:33:25.0 -0700
+++ linux-2.6.22-rc-mm/Documentation/cpuidle/governor.txt   2007-06-06 
11:33:34.0 -0700
@@ -11,12 +11,16 @@
 cpuidle governor is policy routine that decides what idle state to enter at
 any given time. cpuidle core uses different callbacks to governor while
 handling idle entry.
-* select_state callback where governor can determine next idle state to enter
-* prepare_idle callback is called before entering an idle state
-* scan callback is called after a driver forces redetection of the states
+* select_state() callback where governor can determine next idle state to enter
+* prepare_idle() callback is called before entering an idle state
+* scan() callback is called after a driver forces redetection of the states
 
 More than one governor can be registered at the same time and
-user can switch between drivers using /sysfs interface.
+user can switch between drivers using /sysfs interface (when supported).
+
+More than one governor part is supported for developers to easily experiment
+with different governors. By default, most optimal governor based on your
+kernel configuration and platform will be selected by cpuidle.
 
 Interfaces:
 int cpuidle_register_governor(struct cpuidle_governor *gov);
Index: linux-2.6.22-rc-mm/Documentation/cpuidle/core.txt
===
--- linux-2.6.22-rc-mm.orig/Documentation/cpuidle/core.txt  2007-06-06 
11:33:25.0 -0700
+++ linux-2.6.22-rc-mm/Documentation/cpuidle/core.txt   2007-06-06 
11:33:34.0 -0700
@@ -12,6 +12,6 @@
 standardized infrastructure to support independent development of
 governors and drivers.
 
-cpuidle resides under /drivers/cpuidle.
+cpuidle resides under drivers/cpuidle.
 
 
Index: linux-2.6.22-rc-mm/Documentation/cpuidle/driver.txt
===
--- linux-2.6.22-rc-mm.orig/Documentation/cpuidle/driver.txt2007-06-06 
11:33:25.0 -0700
+++ linux-2.6.22-rc-mm/Documentation/cpuidle/driver.txt 2007-06-06 
11:33:34.0 -0700
@@ -7,16 +7,21 @@
 
 
 
-cpuidle driver supports capability detection for a particular system. The
-init and exit routines will be called for each online CPU, with a percpu
-cpuidle_driver object and driver should fill in cpuidle_states inside
-cpuidle_driver depending on the CPU capability.
+cpuidle driver hooks into the cpuidle infrastructure and does the
+architecture/platform dependent part of CPU idle states. Driver
+provides the platform idle state detection capability and also
+has mechanisms in place to support actusl entry-exit into a CPU idle state.
+
+cpuidle driver supports capability detection for a platform using the
+init and exit routines. They will be called for each online CPU, with a
+percpu cpuidle_driver object and driver should fill in cpuidle_states
+inside cpuidle_driver depending on the CPU capability.
 
 Driver can handle dynamic state changes (like battery-AC), by calling
 force_redetect interface.
 
 It is possible to have more than one driver registered at the same time and
-user can switch between drivers using /sysfs interface.
+user can switch between 

Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

2007-06-01 Thread Venki Pallipadi
On Fri, Jun 01, 2007 at 02:41:57PM -0700, Jesse Barnes wrote:
> On Friday, June 1, 2007 2:19:43 Andi Kleen wrote:
> > And normally the MTRRs win, don't they (if I remember the table correctly)
> > So if the MTRR says UC and PAT disagrees it might not actually help
> 
> I just checked, yes the MTRRs win for UC types.  But it sounds like the cases 
> we're talking about are actually situations where there's no MTRR coverage, 
> so the default type is used.  The manual doesn't specifically call out how 
> memory using the default type interacts with PAT, but it may well be that it 
> stays uncached if the default type is uncached.  Again that argues for fixing 
> the MTRR mapping problem in some way.
> 

I feel, having a silent/transparent workaround is not a good idea. With that
chances are BIOS bug will go unnoticed (having an error message in dmesg may not
get noticed either). Probably we should just panic at boot with a
detailed message about the e820 mtrr discrepancy (which can be logged as
a BUG to BIOS provider) and suggest a temporary workaround of "mem=___".

Thanks,
Venki
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

2007-06-01 Thread Venki Pallipadi
On Fri, Jun 01, 2007 at 02:41:57PM -0700, Jesse Barnes wrote:
 On Friday, June 1, 2007 2:19:43 Andi Kleen wrote:
  And normally the MTRRs win, don't they (if I remember the table correctly)
  So if the MTRR says UC and PAT disagrees it might not actually help
 
 I just checked, yes the MTRRs win for UC types.  But it sounds like the cases 
 we're talking about are actually situations where there's no MTRR coverage, 
 so the default type is used.  The manual doesn't specifically call out how 
 memory using the default type interacts with PAT, but it may well be that it 
 stays uncached if the default type is uncached.  Again that argues for fixing 
 the MTRR mapping problem in some way.
 

I feel, having a silent/transparent workaround is not a good idea. With that
chances are BIOS bug will go unnoticed (having an error message in dmesg may not
get noticed either). Probably we should just panic at boot with a
detailed message about the e820 mtrr discrepancy (which can be logged as
a BUG to BIOS provider) and suggest a temporary workaround of mem=___.

Thanks,
Venki
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Add a flag to indicate deferrable timers in /proc/timer_stats

2007-05-30 Thread Venki Pallipadi

Add a flag in /proc/timer_stats to indicate deferrable timers. This will let
developers/users to differentiate between types of tiemrs in /proc/timer_stats.

Deferrable timer and normal timer will appear in /proc/timer_stats as below.
  10D, 1 swapper  queue_delayed_work_on (delayed_work_timer_fn)
   10, 1 swapper  queue_delayed_work_on (delayed_work_timer_fn)

Also version of timer_stats changes from v0.1 to v0.2

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.22-rc-mm/include/linux/hrtimer.h
===
--- linux-2.6.22-rc-mm.orig/include/linux/hrtimer.h 2007-05-24 
17:04:10.0 -0700
+++ linux-2.6.22-rc-mm/include/linux/hrtimer.h  2007-05-30 15:02:49.0 
-0700
@@ -329,12 +329,13 @@
 #ifdef CONFIG_TIMER_STATS
 
 extern void timer_stats_update_stats(void *timer, pid_t pid, void *startf,
-void *timerf, char * comm);
+void *timerf, char * comm,
+unsigned int timer_flag);
 
 static inline void timer_stats_account_hrtimer(struct hrtimer *timer)
 {
timer_stats_update_stats(timer, timer->start_pid, timer->start_site,
-timer->function, timer->start_comm);
+timer->function, timer->start_comm, 0);
 }
 
 extern void __timer_stats_hrtimer_set_start_info(struct hrtimer *timer,
Index: linux-2.6.22-rc-mm/kernel/timer.c
===
--- linux-2.6.22-rc-mm.orig/kernel/timer.c  2007-05-24 17:04:10.0 
-0700
+++ linux-2.6.22-rc-mm/kernel/timer.c   2007-05-30 15:19:15.0 -0700
@@ -305,6 +305,20 @@
memcpy(timer->start_comm, current->comm, TASK_COMM_LEN);
timer->start_pid = current->pid;
 }
+
+static void timer_stats_account_timer(struct timer_list *timer)
+{
+   unsigned int flag = 0;
+
+   if (unlikely(tbase_get_deferrable(timer->base)))
+   flag |= TIMER_STATS_FLAG_DEFERRABLE;
+
+   timer_stats_update_stats(timer, timer->start_pid, timer->start_site,
+timer->function, timer->start_comm, flag);
+}
+
+#else
+static void timer_stats_account_timer(struct timer_list *timer) {}
 #endif
 
 /**
Index: linux-2.6.22-rc-mm/kernel/time/timer_stats.c
===
--- linux-2.6.22-rc-mm.orig/kernel/time/timer_stats.c   2007-05-24 
17:04:10.0 -0700
+++ linux-2.6.22-rc-mm/kernel/time/timer_stats.c2007-05-30 
15:36:55.0 -0700
@@ -68,6 +68,7 @@
 * Number of timeout events:
 */
unsigned long   count;
+   unsigned inttimer_flag;
 
/*
 * We save the command-line string to preserve
@@ -227,7 +228,8 @@
  * incremented. Otherwise the timer is registered in a free slot.
  */
 void timer_stats_update_stats(void *timer, pid_t pid, void *startf,
- void *timerf, char * comm)
+ void *timerf, char * comm,
+ unsigned int timer_flag)
 {
/*
 * It doesnt matter which lock we take:
@@ -240,6 +242,7 @@
input.start_func = startf;
input.expire_func = timerf;
input.pid = pid;
+   input.timer_flag = timer_flag;
 
spin_lock_irqsave(lock, flags);
if (!active)
@@ -286,7 +289,7 @@
period = ktime_to_timespec(time);
ms = period.tv_nsec / 100;
 
-   seq_puts(m, "Timer Stats Version: v0.1\n");
+   seq_puts(m, "Timer Stats Version: v0.2\n");
seq_printf(m, "Sample period: %ld.%03ld s\n", period.tv_sec, ms);
if (atomic_read(_count))
seq_printf(m, "Overflow: %d entries\n",
@@ -294,8 +297,13 @@
 
for (i = 0; i < nr_entries; i++) {
entry = entries + i;
-   seq_printf(m, "%4lu, %5d %-16s ",
+   if (entry->timer_flag & TIMER_STATS_FLAG_DEFERRABLE) {
+   seq_printf(m, "%4luD, %5d %-16s ",
entry->count, entry->pid, entry->comm);
+   } else {
+   seq_printf(m, " %4lu, %5d %-16s ",
+   entry->count, entry->pid, entry->comm);
+   }
 
print_name_offset(m, (unsigned long)entry->start_func);
seq_puts(m, " (");
Index: linux-2.6.22-rc-mm/include/linux/timer.h
===
--- linux-2.6.22-rc-mm.orig/include/linux/timer.h   2007-05-24 
17:04:10.0 -0700
+++ linux-2.6.22-rc-mm/include/linux/timer.h2007-05-30 15:19:23.0 
-0700
@@ -85,16 +85,13 @@
  */
 #ifdef CONFIG_TIMER_STATS
 
+#define TIMER_STATS_FLAG_DEFERRABLE0x1
+
 extern void init_timer_stats(void);
 
 extern void timer_stats_update_stats(void *timer, pid_t pid, void 

Re: [PATCH 3/4] Make net watchdog timers 1 sec jiffy aligned

2007-05-30 Thread Venki Pallipadi
On Wed, May 30, 2007 at 01:30:39PM -0700, Stephen Hemminger wrote:
> On Wed, 30 May 2007 12:55:51 -0700 (PDT)
> David Miller <[EMAIL PROTECTED]> wrote:
> 
> > From: Patrick McHardy <[EMAIL PROTECTED]>
> > Date: Wed, 30 May 2007 20:42:32 +0200
> > 
> > > Stephen Hemminger wrote:
> > > >>>Index: linux-2.6.22-rc-mm/net/sched/sch_generic.c
> > > >>>===
> > > >>>--- linux-2.6.22-rc-mm.orig/net/sched/sch_generic.c2007-05-24 
> > > >>>11:16:03.0 -0700
> > > >>>+++ linux-2.6.22-rc-mm/net/sched/sch_generic.c 2007-05-25 
> > > >>>15:10:02.0 -0700
> > > >>>@@ -224,7 +224,8 @@
> > > >>>   if (dev->tx_timeout) {
> > > >>>   if (dev->watchdog_timeo <= 0)
> > > >>>   dev->watchdog_timeo = 5*HZ;
> > > >>>-  if (!mod_timer(>watchdog_timer, jiffies + 
> > > >>>dev->watchdog_timeo))
> > > >>>+  if (!mod_timer(>watchdog_timer,
> > > >>>+ round_jiffies(jiffies + 
> > > >>>dev->watchdog_timeo)))
> > > >>>   dev_hold(dev);
> > > >>>   }
> > > >>> }
> > > >>
> > > >>Please cc netdev on net patches.
> > > >>
> > > >>Again, I worry that if people set the watchdog timeout to, say, 0.1 
> > > >>seconds
> > > >>then they will get one second, which is grossly different.
> > > >>
> > > >>And if they were to set it to 1.5 seconds, they'd get 2.0 which is 
> > > >>pretty
> > > >>significant, too.
> > > > 
> > > > 
> > > > Alternatively, we could change to a timer that is pushed forward after 
> > > > each
> > > > TX, maybe using hrtimer and hrtimer_forward().  That way the timer would
> > > > never run in normal case.
> > > 
> > > 
> > > It seems wasteful to add per-packet overhead for tx timeouts, which
> > > should be an exception. Do drivers really care about the exact
> > > timeout value? Compared to a packet transmission time its incredibly
> > > long anyways ..
> > 
> > I agree, this change is absolutely rediculious and is just a blind
> > cookie-cutter change made without consideration of what the code is
> > doing and what it's requirements are.
> > 
> 
> what about the obvious compromise:
> 
> --- a/net/sched/sch_generic.c 2007-05-30 11:42:18.0 -0700
> +++ b/net/sched/sch_generic.c 2007-05-30 13:29:34.0 -0700
> @@ -203,7 +203,11 @@ static void dev_watchdog(unsigned long a
>  dev->name);
>   dev->tx_timeout(dev);
>   }
> - if (!mod_timer(>watchdog_timer, 
> round_jiffies(jiffies + dev->watchdog_timeo)))
> +
> + if (!mod_timer(>watchdog_timer,
> +dev->watchdog_timeo > 2 * HZ
> +? round_jiffies(jiffies + 
> dev->watchdog_timeo)
> +: jiffies + dev->watchdog_timeo))
>   dev_hold(dev);
>   }
>   }
> 
> 

If this does not work:
Another option is to use 'deferrable timer' here which will be called at
same as before time when CPU is busy and on idle CPU it will be delayed until
CPU comes out of idle due to any other events.

Thanks,
Venki
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] Make net watchdog timers 1 sec jiffy aligned

2007-05-30 Thread Venki Pallipadi
On Wed, May 30, 2007 at 12:55:51PM -0700, David Miller wrote:
> From: Patrick McHardy <[EMAIL PROTECTED]>
> Date: Wed, 30 May 2007 20:42:32 +0200
> 
> > Stephen Hemminger wrote:
> > >>>Index: linux-2.6.22-rc-mm/net/sched/sch_generic.c
> > >>>===
> > >>>--- linux-2.6.22-rc-mm.orig/net/sched/sch_generic.c  2007-05-24 
> > >>>11:16:03.0 -0700
> > >>>+++ linux-2.6.22-rc-mm/net/sched/sch_generic.c   2007-05-25 
> > >>>15:10:02.0 -0700
> > >>>@@ -224,7 +224,8 @@
> > >>> if (dev->tx_timeout) {
> > >>> if (dev->watchdog_timeo <= 0)
> > >>> dev->watchdog_timeo = 5*HZ;
> > >>>-if (!mod_timer(>watchdog_timer, jiffies + 
> > >>>dev->watchdog_timeo))
> > >>>+if (!mod_timer(>watchdog_timer,
> > >>>+   round_jiffies(jiffies + 
> > >>>dev->watchdog_timeo)))
> > >>> dev_hold(dev);
> > >>> }
> > >>> }
> > >>
> > >>Please cc netdev on net patches.
> > >>
> > >>Again, I worry that if people set the watchdog timeout to, say, 0.1 
> > >>seconds
> > >>then they will get one second, which is grossly different.
> > >>
> > >>And if they were to set it to 1.5 seconds, they'd get 2.0 which is pretty
> > >>significant, too.
> > > 
> > > 
> > > Alternatively, we could change to a timer that is pushed forward after 
> > > each
> > > TX, maybe using hrtimer and hrtimer_forward().  That way the timer would
> > > never run in normal case.
> > 
> > 
> > It seems wasteful to add per-packet overhead for tx timeouts, which
> > should be an exception. Do drivers really care about the exact
> > timeout value? Compared to a packet transmission time its incredibly
> > long anyways ..
> 
> I agree, this change is absolutely rediculious and is just a blind
> cookie-cutter change made without consideration of what the code is
> doing and what it's requirements are.

I hope I could atleast highlight the issue here despite the cookie-cutter
patch..
On a totally idle system I have something like 85 wakeups for every 5 seconds
which I am trying to reduce (to reduce the power consumption and increase
battery life. And 1 interrupt out of 85 happens to be netdev watchdog timer.

Thanks,
Venki
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] Make net watchdog timers 1 sec jiffy aligned

2007-05-30 Thread Venki Pallipadi
On Wed, May 30, 2007 at 08:42:32PM +0200, Patrick McHardy wrote:
> Stephen Hemminger wrote:
> >>>Index: linux-2.6.22-rc-mm/net/sched/sch_generic.c
> >>>===
> >>>--- linux-2.6.22-rc-mm.orig/net/sched/sch_generic.c2007-05-24 
> >>>11:16:03.0 -0700
> >>>+++ linux-2.6.22-rc-mm/net/sched/sch_generic.c 2007-05-25 
> >>>15:10:02.0 -0700
> >>>@@ -224,7 +224,8 @@
> >>>   if (dev->tx_timeout) {
> >>>   if (dev->watchdog_timeo <= 0)
> >>>   dev->watchdog_timeo = 5*HZ;
> >>>-  if (!mod_timer(>watchdog_timer, jiffies + 
> >>>dev->watchdog_timeo))
> >>>+  if (!mod_timer(>watchdog_timer,
> >>>+ round_jiffies(jiffies + dev->watchdog_timeo)))
> >>>   dev_hold(dev);
> >>>   }
> >>> }
> >>
> >>Please cc netdev on net patches.
> >>
> >>Again, I worry that if people set the watchdog timeout to, say, 0.1 seconds
> >>then they will get one second, which is grossly different.
> >>
> >>And if they were to set it to 1.5 seconds, they'd get 2.0 which is pretty
> >>significant, too.
> > 
> > 
> > Alternatively, we could change to a timer that is pushed forward after each
> > TX, maybe using hrtimer and hrtimer_forward().  That way the timer would
> > never run in normal case.
> 
> 
> It seems wasteful to add per-packet overhead for tx timeouts, which
> should be an exception. Do drivers really care about the exact
> timeout value? Compared to a packet transmission time its incredibly
> long anyways ..

I agree. Doing a mod_timer or hrtimer_forward to push forward may add to the
complexity depending on how often TX happens.

Are the drivers really worried about exact timeouts here? Can we use rounding
for the timers that are more than a second, at least?

Thanks,
Venki
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] Make net watchdog timers 1 sec jiffy aligned

2007-05-30 Thread Venki Pallipadi
On Wed, May 30, 2007 at 12:55:51PM -0700, David Miller wrote:
 From: Patrick McHardy [EMAIL PROTECTED]
 Date: Wed, 30 May 2007 20:42:32 +0200
 
  Stephen Hemminger wrote:
  Index: linux-2.6.22-rc-mm/net/sched/sch_generic.c
  ===
  --- linux-2.6.22-rc-mm.orig/net/sched/sch_generic.c  2007-05-24 
  11:16:03.0 -0700
  +++ linux-2.6.22-rc-mm/net/sched/sch_generic.c   2007-05-25 
  15:10:02.0 -0700
  @@ -224,7 +224,8 @@
   if (dev-tx_timeout) {
   if (dev-watchdog_timeo = 0)
   dev-watchdog_timeo = 5*HZ;
  -if (!mod_timer(dev-watchdog_timer, jiffies + 
  dev-watchdog_timeo))
  +if (!mod_timer(dev-watchdog_timer,
  +   round_jiffies(jiffies + 
  dev-watchdog_timeo)))
   dev_hold(dev);
   }
   }
  
  Please cc netdev on net patches.
  
  Again, I worry that if people set the watchdog timeout to, say, 0.1 
  seconds
  then they will get one second, which is grossly different.
  
  And if they were to set it to 1.5 seconds, they'd get 2.0 which is pretty
  significant, too.
   
   
   Alternatively, we could change to a timer that is pushed forward after 
   each
   TX, maybe using hrtimer and hrtimer_forward().  That way the timer would
   never run in normal case.
  
  
  It seems wasteful to add per-packet overhead for tx timeouts, which
  should be an exception. Do drivers really care about the exact
  timeout value? Compared to a packet transmission time its incredibly
  long anyways ..
 
 I agree, this change is absolutely rediculious and is just a blind
 cookie-cutter change made without consideration of what the code is
 doing and what it's requirements are.

I hope I could atleast highlight the issue here despite the cookie-cutter
patch..
On a totally idle system I have something like 85 wakeups for every 5 seconds
which I am trying to reduce (to reduce the power consumption and increase
battery life. And 1 interrupt out of 85 happens to be netdev watchdog timer.

Thanks,
Venki
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] Make net watchdog timers 1 sec jiffy aligned

2007-05-30 Thread Venki Pallipadi
On Wed, May 30, 2007 at 01:30:39PM -0700, Stephen Hemminger wrote:
 On Wed, 30 May 2007 12:55:51 -0700 (PDT)
 David Miller [EMAIL PROTECTED] wrote:
 
  From: Patrick McHardy [EMAIL PROTECTED]
  Date: Wed, 30 May 2007 20:42:32 +0200
  
   Stephen Hemminger wrote:
   Index: linux-2.6.22-rc-mm/net/sched/sch_generic.c
   ===
   --- linux-2.6.22-rc-mm.orig/net/sched/sch_generic.c2007-05-24 
   11:16:03.0 -0700
   +++ linux-2.6.22-rc-mm/net/sched/sch_generic.c 2007-05-25 
   15:10:02.0 -0700
   @@ -224,7 +224,8 @@
  if (dev-tx_timeout) {
  if (dev-watchdog_timeo = 0)
  dev-watchdog_timeo = 5*HZ;
   -  if (!mod_timer(dev-watchdog_timer, jiffies + 
   dev-watchdog_timeo))
   +  if (!mod_timer(dev-watchdog_timer,
   + round_jiffies(jiffies + 
   dev-watchdog_timeo)))
  dev_hold(dev);
  }
}
   
   Please cc netdev on net patches.
   
   Again, I worry that if people set the watchdog timeout to, say, 0.1 
   seconds
   then they will get one second, which is grossly different.
   
   And if they were to set it to 1.5 seconds, they'd get 2.0 which is 
   pretty
   significant, too.


Alternatively, we could change to a timer that is pushed forward after 
each
TX, maybe using hrtimer and hrtimer_forward().  That way the timer would
never run in normal case.
   
   
   It seems wasteful to add per-packet overhead for tx timeouts, which
   should be an exception. Do drivers really care about the exact
   timeout value? Compared to a packet transmission time its incredibly
   long anyways ..
  
  I agree, this change is absolutely rediculious and is just a blind
  cookie-cutter change made without consideration of what the code is
  doing and what it's requirements are.
  
 
 what about the obvious compromise:
 
 --- a/net/sched/sch_generic.c 2007-05-30 11:42:18.0 -0700
 +++ b/net/sched/sch_generic.c 2007-05-30 13:29:34.0 -0700
 @@ -203,7 +203,11 @@ static void dev_watchdog(unsigned long a
  dev-name);
   dev-tx_timeout(dev);
   }
 - if (!mod_timer(dev-watchdog_timer, 
 round_jiffies(jiffies + dev-watchdog_timeo)))
 +
 + if (!mod_timer(dev-watchdog_timer,
 +dev-watchdog_timeo  2 * HZ
 +? round_jiffies(jiffies + 
 dev-watchdog_timeo)
 +: jiffies + dev-watchdog_timeo))
   dev_hold(dev);
   }
   }
 
 

If this does not work:
Another option is to use 'deferrable timer' here which will be called at
same as before time when CPU is busy and on idle CPU it will be delayed until
CPU comes out of idle due to any other events.

Thanks,
Venki
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Add a flag to indicate deferrable timers in /proc/timer_stats

2007-05-30 Thread Venki Pallipadi

Add a flag in /proc/timer_stats to indicate deferrable timers. This will let
developers/users to differentiate between types of tiemrs in /proc/timer_stats.

Deferrable timer and normal timer will appear in /proc/timer_stats as below.
  10D, 1 swapper  queue_delayed_work_on (delayed_work_timer_fn)
   10, 1 swapper  queue_delayed_work_on (delayed_work_timer_fn)

Also version of timer_stats changes from v0.1 to v0.2

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.22-rc-mm/include/linux/hrtimer.h
===
--- linux-2.6.22-rc-mm.orig/include/linux/hrtimer.h 2007-05-24 
17:04:10.0 -0700
+++ linux-2.6.22-rc-mm/include/linux/hrtimer.h  2007-05-30 15:02:49.0 
-0700
@@ -329,12 +329,13 @@
 #ifdef CONFIG_TIMER_STATS
 
 extern void timer_stats_update_stats(void *timer, pid_t pid, void *startf,
-void *timerf, char * comm);
+void *timerf, char * comm,
+unsigned int timer_flag);
 
 static inline void timer_stats_account_hrtimer(struct hrtimer *timer)
 {
timer_stats_update_stats(timer, timer-start_pid, timer-start_site,
-timer-function, timer-start_comm);
+timer-function, timer-start_comm, 0);
 }
 
 extern void __timer_stats_hrtimer_set_start_info(struct hrtimer *timer,
Index: linux-2.6.22-rc-mm/kernel/timer.c
===
--- linux-2.6.22-rc-mm.orig/kernel/timer.c  2007-05-24 17:04:10.0 
-0700
+++ linux-2.6.22-rc-mm/kernel/timer.c   2007-05-30 15:19:15.0 -0700
@@ -305,6 +305,20 @@
memcpy(timer-start_comm, current-comm, TASK_COMM_LEN);
timer-start_pid = current-pid;
 }
+
+static void timer_stats_account_timer(struct timer_list *timer)
+{
+   unsigned int flag = 0;
+
+   if (unlikely(tbase_get_deferrable(timer-base)))
+   flag |= TIMER_STATS_FLAG_DEFERRABLE;
+
+   timer_stats_update_stats(timer, timer-start_pid, timer-start_site,
+timer-function, timer-start_comm, flag);
+}
+
+#else
+static void timer_stats_account_timer(struct timer_list *timer) {}
 #endif
 
 /**
Index: linux-2.6.22-rc-mm/kernel/time/timer_stats.c
===
--- linux-2.6.22-rc-mm.orig/kernel/time/timer_stats.c   2007-05-24 
17:04:10.0 -0700
+++ linux-2.6.22-rc-mm/kernel/time/timer_stats.c2007-05-30 
15:36:55.0 -0700
@@ -68,6 +68,7 @@
 * Number of timeout events:
 */
unsigned long   count;
+   unsigned inttimer_flag;
 
/*
 * We save the command-line string to preserve
@@ -227,7 +228,8 @@
  * incremented. Otherwise the timer is registered in a free slot.
  */
 void timer_stats_update_stats(void *timer, pid_t pid, void *startf,
- void *timerf, char * comm)
+ void *timerf, char * comm,
+ unsigned int timer_flag)
 {
/*
 * It doesnt matter which lock we take:
@@ -240,6 +242,7 @@
input.start_func = startf;
input.expire_func = timerf;
input.pid = pid;
+   input.timer_flag = timer_flag;
 
spin_lock_irqsave(lock, flags);
if (!active)
@@ -286,7 +289,7 @@
period = ktime_to_timespec(time);
ms = period.tv_nsec / 100;
 
-   seq_puts(m, Timer Stats Version: v0.1\n);
+   seq_puts(m, Timer Stats Version: v0.2\n);
seq_printf(m, Sample period: %ld.%03ld s\n, period.tv_sec, ms);
if (atomic_read(overflow_count))
seq_printf(m, Overflow: %d entries\n,
@@ -294,8 +297,13 @@
 
for (i = 0; i  nr_entries; i++) {
entry = entries + i;
-   seq_printf(m, %4lu, %5d %-16s ,
+   if (entry-timer_flag  TIMER_STATS_FLAG_DEFERRABLE) {
+   seq_printf(m, %4luD, %5d %-16s ,
entry-count, entry-pid, entry-comm);
+   } else {
+   seq_printf(m,  %4lu, %5d %-16s ,
+   entry-count, entry-pid, entry-comm);
+   }
 
print_name_offset(m, (unsigned long)entry-start_func);
seq_puts(m,  ();
Index: linux-2.6.22-rc-mm/include/linux/timer.h
===
--- linux-2.6.22-rc-mm.orig/include/linux/timer.h   2007-05-24 
17:04:10.0 -0700
+++ linux-2.6.22-rc-mm/include/linux/timer.h2007-05-30 15:19:23.0 
-0700
@@ -85,16 +85,13 @@
  */
 #ifdef CONFIG_TIMER_STATS
 
+#define TIMER_STATS_FLAG_DEFERRABLE0x1
+
 extern void init_timer_stats(void);
 
 extern void timer_stats_update_stats(void *timer, pid_t pid, void *startf,
-  

Re: [PATCH 3/4] Make net watchdog timers 1 sec jiffy aligned

2007-05-30 Thread Venki Pallipadi
On Wed, May 30, 2007 at 08:42:32PM +0200, Patrick McHardy wrote:
 Stephen Hemminger wrote:
 Index: linux-2.6.22-rc-mm/net/sched/sch_generic.c
 ===
 --- linux-2.6.22-rc-mm.orig/net/sched/sch_generic.c2007-05-24 
 11:16:03.0 -0700
 +++ linux-2.6.22-rc-mm/net/sched/sch_generic.c 2007-05-25 
 15:10:02.0 -0700
 @@ -224,7 +224,8 @@
if (dev-tx_timeout) {
if (dev-watchdog_timeo = 0)
dev-watchdog_timeo = 5*HZ;
 -  if (!mod_timer(dev-watchdog_timer, jiffies + 
 dev-watchdog_timeo))
 +  if (!mod_timer(dev-watchdog_timer,
 + round_jiffies(jiffies + dev-watchdog_timeo)))
dev_hold(dev);
}
  }
 
 Please cc netdev on net patches.
 
 Again, I worry that if people set the watchdog timeout to, say, 0.1 seconds
 then they will get one second, which is grossly different.
 
 And if they were to set it to 1.5 seconds, they'd get 2.0 which is pretty
 significant, too.
  
  
  Alternatively, we could change to a timer that is pushed forward after each
  TX, maybe using hrtimer and hrtimer_forward().  That way the timer would
  never run in normal case.
 
 
 It seems wasteful to add per-packet overhead for tx timeouts, which
 should be an exception. Do drivers really care about the exact
 timeout value? Compared to a packet transmission time its incredibly
 long anyways ..

I agree. Doing a mod_timer or hrtimer_forward to push forward may add to the
complexity depending on how often TX happens.

Are the drivers really worried about exact timeouts here? Can we use rounding
for the timers that are more than a second, at least?

Thanks,
Venki
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Display Intel Dynamic Acceleration feature in /proc/cpuinfo

2007-05-29 Thread Venki Pallipadi
On Thu, May 24, 2007 at 05:04:13PM -0700, H. Peter Anvin wrote:
> 
> If they grow slowly from the bottom, I guess we could simply allocate
> space in the vector byte by byte instead.  Either way, it means more
> work whenever anything has to change.
> 

hpa,

Below patch adds a new word for feature bits that willb eused for all Intel
features that may be spread around in CPUID leafs like 0x6, 0xA, etc.
I added "ida" bit first into this word. I will send an incremental patch
to move ARCH_PERFMON bit and any other feature bits in these leaf subsequently.
The patch is against newsetup git tree.

Please apply.

Thanks,
Venki



Use a new CPU feature word to cover all Intel features that are spread around
in different CPUID leafs like 0x5, 0x6 and 0xA. Make this
feature detection code common across i386 and x86_64.

Display Intel Dynamic Acceleration feature in /proc/cpuinfo. This feature
will be enabled automatically by current acpi-cpufreq driver.

Refer to Intel Software Developer's Manual for more details about the feature.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6/include/asm-i386/cpufeature.h
===
--- linux-2.6.orig/include/asm-i386/cpufeature.h2007-05-29 
07:30:28.0 -0700
+++ linux-2.6/include/asm-i386/cpufeature.h 2007-05-29 10:21:17.0 
-0700
@@ -12,7 +12,7 @@
 #endif
 #include 
 
-#define NCAPINTS   7   /* N 32-bit words worth of info */
+#define NCAPINTS   8   /* N 32-bit words worth of info */
 
 /* Intel-defined CPU features, CPUID level 0x0001 (edx), word 0 */
 #define X86_FEATURE_FPU(0*32+ 0) /* Onboard FPU */
@@ -109,6 +109,9 @@
 #define X86_FEATURE_LAHF_LM(6*32+ 0) /* LAHF/SAHF in long mode */
 #define X86_FEATURE_CMP_LEGACY (6*32+ 1) /* If yes HyperThreading not valid */
 
+/* More extended Intel flags: From various new CPUID levels like 0x6, 0xA etc 
*/
+#define X86_FEATURE_IDA(7*32+ 0) /* Intel Dynamic Acceleration 
*/
+
 #define cpu_has(c, bit)
\
(__builtin_constant_p(bit) &&   \
 ( (((bit)>>5)==0 && (1UL<<((bit)&31) & REQUIRED_MASK0)) || \
@@ -117,7 +120,8 @@
   (((bit)>>5)==3 && (1UL<<((bit)&31) & REQUIRED_MASK3)) || \
   (((bit)>>5)==4 && (1UL<<((bit)&31) & REQUIRED_MASK4)) || \
   (((bit)>>5)==5 && (1UL<<((bit)&31) & REQUIRED_MASK5)) || \
-  (((bit)>>5)==6 && (1UL<<((bit)&31) & REQUIRED_MASK6)) )  \
+  (((bit)>>5)==6 && (1UL<<((bit)&31) & REQUIRED_MASK6)) || \
+  (((bit)>>5)==7 && (1UL<<((bit)&31) & REQUIRED_MASK7)) )  \
  ? 1 : \
  test_bit(bit, (c)->x86_capability))
 #define boot_cpu_has(bit)  cpu_has(_cpu_data, bit)
Index: linux-2.6/arch/i386/kernel/cpu/proc.c
===
--- linux-2.6.orig/arch/i386/kernel/cpu/proc.c  2007-05-29 07:30:20.0 
-0700
+++ linux-2.6/arch/i386/kernel/cpu/proc.c   2007-05-29 08:20:51.0 
-0700
@@ -65,6 +65,12 @@
"osvw", "ibs", NULL, NULL, NULL, NULL,
NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
+
+   /* Intel-defined (#3) */
+   "ida", NULL, NULL, NULL, NULL, NULL, NULL, NULL,
+   NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
+   NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
+   NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
};
static const char * const x86_power_flags[] = {
"ts",   /* temperature sensor */
Index: linux-2.6/arch/x86_64/kernel/setup.c
===
--- linux-2.6.orig/arch/x86_64/kernel/setup.c   2007-05-29 07:30:21.0 
-0700
+++ linux-2.6/arch/x86_64/kernel/setup.c2007-05-29 09:20:01.0 
-0700
@@ -699,6 +699,7 @@
/* Cache sizes */
unsigned n;
 
+   init_additional_intel_features(c);
init_intel_cacheinfo(c);
if (c->cpuid_level > 9 ) {
unsigned eax = cpuid_eax(10);
@@ -973,6 +974,12 @@
"osvw", "ibs", NULL, NULL, NULL, NULL,
NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
+
+   /* Intel-defined (#3) */
+   "ida", NULL, NULL, NULL, NULL, NULL, NULL, NULL,
+   NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
+   NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
+   NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
};
static char *x86_power_flags[] = { 
"ts",   /* temperature sensor */
Index: linux-2.6/include/asm-i386/required-features.h

Re: [PATCH 1/4] Make usb-autosuspend timer 1 sec jiffy aligned

2007-05-29 Thread Venki Pallipadi
On Tue, May 29, 2007 at 11:22:30AM -0700, Randy Dunlap wrote:
> On Tue, 29 May 2007 10:58:21 -0700 Venki Pallipadi wrote:
> 
> > 
> > 
> > Below are a bunch of random timers, that were active on my system,
> > that can better be round_jiffies() aligned.
> 
> and these 4 patches help with (a) power usage, or (b) cache
> usage/niceness, or (c) other (be specific)...
> 

Yes. They are all related to power savings with tickless kernel.
A 5 sec timer account for 0.2 unnecessary wakeups per sec (powertop numbers).
All these patches together account for somewhere between 0.5-1
wakeup per second saving. That means my wakeups per second
comes down from ~18 per second to ~17 per second.
On my dual core laptop, CPUs will have more than 3% increase in
average C3 residency (actual powertop number went from ~104mS to ~108mS
long term C3 residency).

The actual AC power numbers were not consisitent enough to be reported here.
But, all these small changes will add up in terms of power savings and
battery life.

Thanks,
Venki
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/4] Make mce polling timers 1 sec jiffy aligned

2007-05-29 Thread Venki Pallipadi

round_jiffies() for i386 and x86-64 non-critical/corrected MCE polling.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.22-rc-mm/arch/x86_64/kernel/mce.c
===
--- linux-2.6.22-rc-mm.orig/arch/x86_64/kernel/mce.c2007-05-24 
11:15:57.0 -0700
+++ linux-2.6.22-rc-mm/arch/x86_64/kernel/mce.c 2007-05-25 17:29:21.0 
-0700
@@ -366,7 +366,8 @@
next_interval = min(next_interval*2, check_interval*HZ);
}
 
-   schedule_delayed_work(_work, next_interval);
+   schedule_delayed_work(_work,
+ round_jiffies_relative(next_interval));
 }
 
 
@@ -374,7 +375,8 @@
 { 
next_interval = check_interval * HZ;
if (next_interval)
-   schedule_delayed_work(_work, next_interval);
+   schedule_delayed_work(_work,
+ round_jiffies_relative(next_interval));
return 0;
 } 
 __initcall(periodic_mcheck_init);
@@ -618,7 +620,8 @@
on_each_cpu(mce_init, NULL, 1, 1);   
next_interval = check_interval * HZ;
if (next_interval)
-   schedule_delayed_work(_work, next_interval);
+   schedule_delayed_work(_work,
+ round_jiffies_relative(next_interval));
 }
 
 static struct sysdev_class mce_sysclass = {
Index: linux-2.6.22-rc-mm/arch/i386/kernel/cpu/mcheck/non-fatal.c
===
--- linux-2.6.22-rc-mm.orig/arch/i386/kernel/cpu/mcheck/non-fatal.c 
2007-04-25 20:08:32.0 -0700
+++ linux-2.6.22-rc-mm/arch/i386/kernel/cpu/mcheck/non-fatal.c  2007-05-25 
17:27:49.0 -0700
@@ -57,7 +57,7 @@
 static void mce_work_fn(struct work_struct *work)
 { 
on_each_cpu(mce_checkregs, NULL, 1, 1);
-   schedule_delayed_work(_work, MCE_RATE);
+   schedule_delayed_work(_work, round_jiffies_relative(MCE_RATE));
 } 
 
 static int __init init_nonfatal_mce_checker(void)
@@ -82,7 +82,7 @@
/*
 * Check for non-fatal errors every MCE_RATE s
 */
-   schedule_delayed_work(_work, MCE_RATE);
+   schedule_delayed_work(_work, round_jiffies_relative(MCE_RATE));
printk(KERN_INFO "Machine check exception polling timer started.\n");
return 0;
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/4] Make net watchdog timers 1 sec jiffy aligned

2007-05-29 Thread Venki Pallipadi

round_jiffies for net dev watchdog timer.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.22-rc-mm/net/sched/sch_generic.c
===
--- linux-2.6.22-rc-mm.orig/net/sched/sch_generic.c 2007-05-24 
11:16:03.0 -0700
+++ linux-2.6.22-rc-mm/net/sched/sch_generic.c  2007-05-25 15:10:02.0 
-0700
@@ -224,7 +224,8 @@
if (dev->tx_timeout) {
if (dev->watchdog_timeo <= 0)
dev->watchdog_timeo = 5*HZ;
-   if (!mod_timer(>watchdog_timer, jiffies + 
dev->watchdog_timeo))
+   if (!mod_timer(>watchdog_timer,
+  round_jiffies(jiffies + dev->watchdog_timeo)))
dev_hold(dev);
}
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/4] Make page-writeback timers 1 sec jiffy aligned

2007-05-29 Thread Venki Pallipadi


timer round_jiffies in page-writeback.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.22-rc-mm/mm/page-writeback.c
===
--- linux-2.6.22-rc-mm.orig/mm/page-writeback.c 2007-05-25 10:49:11.0 
-0700
+++ linux-2.6.22-rc-mm/mm/page-writeback.c  2007-05-25 10:49:29.0 
-0700
@@ -469,7 +469,7 @@
if (time_before(next_jif, jiffies + HZ))
next_jif = jiffies + HZ;
if (dirty_writeback_interval)
-   mod_timer(_timer, next_jif);
+   mod_timer(_timer, round_jiffies(next_jif));
 }
 
 /*
@@ -481,7 +481,7 @@
proc_dointvec_userhz_jiffies(table, write, file, buffer, length, ppos);
if (dirty_writeback_interval) {
mod_timer(_timer,
-   jiffies + dirty_writeback_interval);
+   round_jiffies(jiffies + dirty_writeback_interval));
} else {
del_timer(_timer);
}
@@ -491,7 +491,8 @@
 static void wb_timer_fn(unsigned long unused)
 {
if (pdflush_operation(wb_kupdate, 0) < 0)
-   mod_timer(_timer, jiffies + HZ); /* delay 1 second */
+   mod_timer(_timer, round_jiffies(jiffies + HZ));
+   /* delay 1 second */
 }
 
 static void laptop_flush(unsigned long unused)
@@ -511,7 +512,7 @@
  */
 void laptop_io_completion(void)
 {
-   mod_timer(_mode_wb_timer, jiffies + laptop_mode);
+   mod_timer(_mode_wb_timer, round_jiffies(jiffies + laptop_mode));
 }
 
 /*
@@ -582,7 +583,7 @@
  */
 void __init page_writeback_init(void)
 {
-   mod_timer(_timer, jiffies + dirty_writeback_interval);
+   mod_timer(_timer, round_jiffies(jiffies + dirty_writeback_interval));
writeback_set_ratelimit();
register_cpu_notifier(_nb);
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/4] Make usb-autosuspend timer 1 sec jiffy aligned

2007-05-29 Thread Venki Pallipadi


Below are a bunch of random timers, that were active on my system,
that can better be round_jiffies() aligned.

I guess we need a audit of all timer usages atleast in kernel-core.

This patch:

Make usb autosuspend timers 1sec jiffy aligned.

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

Index: linux-2.6.22-rc-mm/drivers/usb/core/driver.c
===
--- linux-2.6.22-rc-mm.orig/drivers/usb/core/driver.c   2007-05-24 
11:16:00.0 -0700
+++ linux-2.6.22-rc-mm/drivers/usb/core/driver.c2007-05-25 
10:00:50.0 -0700
@@ -974,7 +974,7 @@
 * or for the past.
 */
queue_delayed_work(ksuspend_usb_wq, >autosuspend,
-   suspend_time - jiffies);
+   round_jiffies_relative(suspend_time - jiffies));
}
return -EAGAIN;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/4] Make usb-autosuspend timer 1 sec jiffy aligned

2007-05-29 Thread Venki Pallipadi


Below are a bunch of random timers, that were active on my system,
that can better be round_jiffies() aligned.

I guess we need a audit of all timer usages atleast in kernel-core.

This patch:

Make usb autosuspend timers 1sec jiffy aligned.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.22-rc-mm/drivers/usb/core/driver.c
===
--- linux-2.6.22-rc-mm.orig/drivers/usb/core/driver.c   2007-05-24 
11:16:00.0 -0700
+++ linux-2.6.22-rc-mm/drivers/usb/core/driver.c2007-05-25 
10:00:50.0 -0700
@@ -974,7 +974,7 @@
 * or for the past.
 */
queue_delayed_work(ksuspend_usb_wq, udev-autosuspend,
-   suspend_time - jiffies);
+   round_jiffies_relative(suspend_time - jiffies));
}
return -EAGAIN;
}
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/4] Make page-writeback timers 1 sec jiffy aligned

2007-05-29 Thread Venki Pallipadi


timer round_jiffies in page-writeback.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.22-rc-mm/mm/page-writeback.c
===
--- linux-2.6.22-rc-mm.orig/mm/page-writeback.c 2007-05-25 10:49:11.0 
-0700
+++ linux-2.6.22-rc-mm/mm/page-writeback.c  2007-05-25 10:49:29.0 
-0700
@@ -469,7 +469,7 @@
if (time_before(next_jif, jiffies + HZ))
next_jif = jiffies + HZ;
if (dirty_writeback_interval)
-   mod_timer(wb_timer, next_jif);
+   mod_timer(wb_timer, round_jiffies(next_jif));
 }
 
 /*
@@ -481,7 +481,7 @@
proc_dointvec_userhz_jiffies(table, write, file, buffer, length, ppos);
if (dirty_writeback_interval) {
mod_timer(wb_timer,
-   jiffies + dirty_writeback_interval);
+   round_jiffies(jiffies + dirty_writeback_interval));
} else {
del_timer(wb_timer);
}
@@ -491,7 +491,8 @@
 static void wb_timer_fn(unsigned long unused)
 {
if (pdflush_operation(wb_kupdate, 0)  0)
-   mod_timer(wb_timer, jiffies + HZ); /* delay 1 second */
+   mod_timer(wb_timer, round_jiffies(jiffies + HZ));
+   /* delay 1 second */
 }
 
 static void laptop_flush(unsigned long unused)
@@ -511,7 +512,7 @@
  */
 void laptop_io_completion(void)
 {
-   mod_timer(laptop_mode_wb_timer, jiffies + laptop_mode);
+   mod_timer(laptop_mode_wb_timer, round_jiffies(jiffies + laptop_mode));
 }
 
 /*
@@ -582,7 +583,7 @@
  */
 void __init page_writeback_init(void)
 {
-   mod_timer(wb_timer, jiffies + dirty_writeback_interval);
+   mod_timer(wb_timer, round_jiffies(jiffies + dirty_writeback_interval));
writeback_set_ratelimit();
register_cpu_notifier(ratelimit_nb);
 }
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/4] Make net watchdog timers 1 sec jiffy aligned

2007-05-29 Thread Venki Pallipadi

round_jiffies for net dev watchdog timer.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.22-rc-mm/net/sched/sch_generic.c
===
--- linux-2.6.22-rc-mm.orig/net/sched/sch_generic.c 2007-05-24 
11:16:03.0 -0700
+++ linux-2.6.22-rc-mm/net/sched/sch_generic.c  2007-05-25 15:10:02.0 
-0700
@@ -224,7 +224,8 @@
if (dev-tx_timeout) {
if (dev-watchdog_timeo = 0)
dev-watchdog_timeo = 5*HZ;
-   if (!mod_timer(dev-watchdog_timer, jiffies + 
dev-watchdog_timeo))
+   if (!mod_timer(dev-watchdog_timer,
+  round_jiffies(jiffies + dev-watchdog_timeo)))
dev_hold(dev);
}
 }
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/4] Make mce polling timers 1 sec jiffy aligned

2007-05-29 Thread Venki Pallipadi

round_jiffies() for i386 and x86-64 non-critical/corrected MCE polling.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6.22-rc-mm/arch/x86_64/kernel/mce.c
===
--- linux-2.6.22-rc-mm.orig/arch/x86_64/kernel/mce.c2007-05-24 
11:15:57.0 -0700
+++ linux-2.6.22-rc-mm/arch/x86_64/kernel/mce.c 2007-05-25 17:29:21.0 
-0700
@@ -366,7 +366,8 @@
next_interval = min(next_interval*2, check_interval*HZ);
}
 
-   schedule_delayed_work(mcheck_work, next_interval);
+   schedule_delayed_work(mcheck_work,
+ round_jiffies_relative(next_interval));
 }
 
 
@@ -374,7 +375,8 @@
 { 
next_interval = check_interval * HZ;
if (next_interval)
-   schedule_delayed_work(mcheck_work, next_interval);
+   schedule_delayed_work(mcheck_work,
+ round_jiffies_relative(next_interval));
return 0;
 } 
 __initcall(periodic_mcheck_init);
@@ -618,7 +620,8 @@
on_each_cpu(mce_init, NULL, 1, 1);   
next_interval = check_interval * HZ;
if (next_interval)
-   schedule_delayed_work(mcheck_work, next_interval);
+   schedule_delayed_work(mcheck_work,
+ round_jiffies_relative(next_interval));
 }
 
 static struct sysdev_class mce_sysclass = {
Index: linux-2.6.22-rc-mm/arch/i386/kernel/cpu/mcheck/non-fatal.c
===
--- linux-2.6.22-rc-mm.orig/arch/i386/kernel/cpu/mcheck/non-fatal.c 
2007-04-25 20:08:32.0 -0700
+++ linux-2.6.22-rc-mm/arch/i386/kernel/cpu/mcheck/non-fatal.c  2007-05-25 
17:27:49.0 -0700
@@ -57,7 +57,7 @@
 static void mce_work_fn(struct work_struct *work)
 { 
on_each_cpu(mce_checkregs, NULL, 1, 1);
-   schedule_delayed_work(mce_work, MCE_RATE);
+   schedule_delayed_work(mce_work, round_jiffies_relative(MCE_RATE));
 } 
 
 static int __init init_nonfatal_mce_checker(void)
@@ -82,7 +82,7 @@
/*
 * Check for non-fatal errors every MCE_RATE s
 */
-   schedule_delayed_work(mce_work, MCE_RATE);
+   schedule_delayed_work(mce_work, round_jiffies_relative(MCE_RATE));
printk(KERN_INFO Machine check exception polling timer started.\n);
return 0;
 }
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] Make usb-autosuspend timer 1 sec jiffy aligned

2007-05-29 Thread Venki Pallipadi
On Tue, May 29, 2007 at 11:22:30AM -0700, Randy Dunlap wrote:
 On Tue, 29 May 2007 10:58:21 -0700 Venki Pallipadi wrote:
 
  
  
  Below are a bunch of random timers, that were active on my system,
  that can better be round_jiffies() aligned.
 
 and these 4 patches help with (a) power usage, or (b) cache
 usage/niceness, or (c) other (be specific)...
 

Yes. They are all related to power savings with tickless kernel.
A 5 sec timer account for 0.2 unnecessary wakeups per sec (powertop numbers).
All these patches together account for somewhere between 0.5-1
wakeup per second saving. That means my wakeups per second
comes down from ~18 per second to ~17 per second.
On my dual core laptop, CPUs will have more than 3% increase in
average C3 residency (actual powertop number went from ~104mS to ~108mS
long term C3 residency).

The actual AC power numbers were not consisitent enough to be reported here.
But, all these small changes will add up in terms of power savings and
battery life.

Thanks,
Venki
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Display Intel Dynamic Acceleration feature in /proc/cpuinfo

2007-05-29 Thread Venki Pallipadi
On Thu, May 24, 2007 at 05:04:13PM -0700, H. Peter Anvin wrote:
 
 If they grow slowly from the bottom, I guess we could simply allocate
 space in the vector byte by byte instead.  Either way, it means more
 work whenever anything has to change.
 

hpa,

Below patch adds a new word for feature bits that willb eused for all Intel
features that may be spread around in CPUID leafs like 0x6, 0xA, etc.
I added ida bit first into this word. I will send an incremental patch
to move ARCH_PERFMON bit and any other feature bits in these leaf subsequently.
The patch is against newsetup git tree.

Please apply.

Thanks,
Venki



Use a new CPU feature word to cover all Intel features that are spread around
in different CPUID leafs like 0x5, 0x6 and 0xA. Make this
feature detection code common across i386 and x86_64.

Display Intel Dynamic Acceleration feature in /proc/cpuinfo. This feature
will be enabled automatically by current acpi-cpufreq driver.

Refer to Intel Software Developer's Manual for more details about the feature.

Signed-off-by: Venkatesh Pallipadi [EMAIL PROTECTED]

Index: linux-2.6/include/asm-i386/cpufeature.h
===
--- linux-2.6.orig/include/asm-i386/cpufeature.h2007-05-29 
07:30:28.0 -0700
+++ linux-2.6/include/asm-i386/cpufeature.h 2007-05-29 10:21:17.0 
-0700
@@ -12,7 +12,7 @@
 #endif
 #include asm/required-features.h
 
-#define NCAPINTS   7   /* N 32-bit words worth of info */
+#define NCAPINTS   8   /* N 32-bit words worth of info */
 
 /* Intel-defined CPU features, CPUID level 0x0001 (edx), word 0 */
 #define X86_FEATURE_FPU(0*32+ 0) /* Onboard FPU */
@@ -109,6 +109,9 @@
 #define X86_FEATURE_LAHF_LM(6*32+ 0) /* LAHF/SAHF in long mode */
 #define X86_FEATURE_CMP_LEGACY (6*32+ 1) /* If yes HyperThreading not valid */
 
+/* More extended Intel flags: From various new CPUID levels like 0x6, 0xA etc 
*/
+#define X86_FEATURE_IDA(7*32+ 0) /* Intel Dynamic Acceleration 
*/
+
 #define cpu_has(c, bit)
\
(__builtin_constant_p(bit)\
 ( (((bit)5)==0  (1UL((bit)31)  REQUIRED_MASK0)) || \
@@ -117,7 +120,8 @@
   (((bit)5)==3  (1UL((bit)31)  REQUIRED_MASK3)) || \
   (((bit)5)==4  (1UL((bit)31)  REQUIRED_MASK4)) || \
   (((bit)5)==5  (1UL((bit)31)  REQUIRED_MASK5)) || \
-  (((bit)5)==6  (1UL((bit)31)  REQUIRED_MASK6)) )  \
+  (((bit)5)==6  (1UL((bit)31)  REQUIRED_MASK6)) || \
+  (((bit)5)==7  (1UL((bit)31)  REQUIRED_MASK7)) )  \
  ? 1 : \
  test_bit(bit, (c)-x86_capability))
 #define boot_cpu_has(bit)  cpu_has(boot_cpu_data, bit)
Index: linux-2.6/arch/i386/kernel/cpu/proc.c
===
--- linux-2.6.orig/arch/i386/kernel/cpu/proc.c  2007-05-29 07:30:20.0 
-0700
+++ linux-2.6/arch/i386/kernel/cpu/proc.c   2007-05-29 08:20:51.0 
-0700
@@ -65,6 +65,12 @@
osvw, ibs, NULL, NULL, NULL, NULL,
NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
+
+   /* Intel-defined (#3) */
+   ida, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
+   NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
+   NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
+   NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
};
static const char * const x86_power_flags[] = {
ts,   /* temperature sensor */
Index: linux-2.6/arch/x86_64/kernel/setup.c
===
--- linux-2.6.orig/arch/x86_64/kernel/setup.c   2007-05-29 07:30:21.0 
-0700
+++ linux-2.6/arch/x86_64/kernel/setup.c2007-05-29 09:20:01.0 
-0700
@@ -699,6 +699,7 @@
/* Cache sizes */
unsigned n;
 
+   init_additional_intel_features(c);
init_intel_cacheinfo(c);
if (c-cpuid_level  9 ) {
unsigned eax = cpuid_eax(10);
@@ -973,6 +974,12 @@
osvw, ibs, NULL, NULL, NULL, NULL,
NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
+
+   /* Intel-defined (#3) */
+   ida, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
+   NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
+   NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
+   NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
};
static char *x86_power_flags[] = { 
ts,   /* temperature sensor */
Index: linux-2.6/include/asm-i386/required-features.h
===

Re: [PATCH] Display Intel Dynamic Acceleration feature in /proc/cpuinfo

2007-05-24 Thread Venki Pallipadi
On Thu, May 24, 2007 at 03:02:23PM -0700, Andrew Morton wrote:
> On Wed, 23 May 2007 15:46:37 -0700
> Venki Pallipadi <[EMAIL PROTECTED]> wrote:
> 
> > Display Intel Dynamic Acceleration feature in /proc/cpuinfo. This feature
> > will be enabled automatically by current acpi-cpufreq driver and cpufreq.
> > 
> > Refer to Intel Software Developer's Manual for more details about the 
> > feature.
> > 
> > Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>
> > 
> > Index: linux-2.6.22-rc-mm/arch/i386/kernel/cpu/proc.c
> > ===
> > --- linux-2.6.22-rc-mm.orig/arch/i386/kernel/cpu/proc.c
> > +++ linux-2.6.22-rc-mm/arch/i386/kernel/cpu/proc.c
> > @@ -41,7 +41,7 @@ static int show_cpuinfo(struct seq_file 
> > "cxmmx", "k6_mtrr", "cyrix_arr", "centaur_mcr",
> > NULL, NULL, NULL, NULL,
> > "constant_tsc", "up", NULL, NULL, NULL, NULL, NULL, NULL,
> > -   NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
> > +   "ida", NULL, NULL, NULL, NULL, NULL, NULL, NULL,
> > NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
> >  
> > /* Intel-defined (#2) */
> > Index: linux-2.6.22-rc-mm/arch/x86_64/kernel/setup.c
> > ===
> > --- linux-2.6.22-rc-mm.orig/arch/x86_64/kernel/setup.c
> > +++ linux-2.6.22-rc-mm/arch/x86_64/kernel/setup.c
> > @@ -949,7 +949,7 @@ static int show_cpuinfo(struct seq_file 
> > /* Other (Linux-defined) */
> > "cxmmx", NULL, "cyrix_arr", "centaur_mcr", NULL,
> > "constant_tsc", NULL, NULL,
> > -   "up", NULL, NULL, NULL, NULL, NULL, NULL, NULL,
> > +   "up", NULL, NULL, NULL, "ida", NULL, NULL, NULL,
> > NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
> > NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
> 
> Ho hum.  This clashes with hpa's git-newsetup tree, which goes for a great
> tromp through the cpuinfo implementation.
> 

Hmm.. Will move feature detection to setup routines and will also
refresh the patch against latest mm and resend it

Thanks,
Venki
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Display Intel Dynamic Acceleration feature in /proc/cpuinfo

2007-05-24 Thread Venki Pallipadi
On Thu, May 24, 2007 at 11:25:27PM +0200, Andi Kleen wrote:
> On Thursday 24 May 2007 23:13:37 Venki Pallipadi wrote:
> > On Thu, May 24, 2007 at 11:08:38PM +0200, Andi Kleen wrote:
> > > 
> > > I think it's generally a good idea to push cpuinfo flags in earliest
> > > as possible; just make sure we actually use the final name (so that we 
> > > don't get
> > > into a pni->sse3 mess again) 
> > >  
> > 
> > ida is official name as in the Software Developer's Manual now. So, should
> > not be a issue unless marketing folks change their mind in future :-)
> 
> Well they did sometimes in the past.
> 
> But actually reading the patch: it seems weird to detect the flag 
> in acpi-cpufreq and essentially change /proc/cpuinfo when a
> module is loaded. Why not in the intel setup function? And why is it 
> not in the standard CPUID 1 features mask anyways?
> 

I can do it in intel setup function. But, the feature may not be activated
unless the driver is loaded. Going by the hardware capability point
of view, we can do it in setup function.

The feature appears in CPUID 6 (called Power Management Leaf) instead of
regular CPUID 1 features.

Thanks,
Venki
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Display Intel Dynamic Acceleration feature in /proc/cpuinfo

2007-05-24 Thread Venki Pallipadi
On Thu, May 24, 2007 at 11:08:38PM +0200, Andi Kleen wrote:
> 
> I think it's generally a good idea to push cpuinfo flags in earliest
> as possible; just make sure we actually use the final name (so that we don't 
> get
> into a pni->sse3 mess again) 
>  

ida is official name as in the Software Developer's Manual now. So, should
not be a issue unless marketing folks change their mind in future :-)

Thanks,
Venki
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Display Intel Dynamic Acceleration feature in /proc/cpuinfo

2007-05-24 Thread Venki Pallipadi
On Thu, May 24, 2007 at 05:01:04PM -0400, Dave Jones wrote:
> On Thu, May 24, 2007 at 01:55:13PM -0700, Andrew Morton wrote:
>  > On Wed, 23 May 2007 15:46:37 -0700
>  > Venki Pallipadi <[EMAIL PROTECTED]> wrote:
>  > 
>  > > Display Intel Dynamic Acceleration feature in /proc/cpuinfo. This feature
>  > > will be enabled automatically by current acpi-cpufreq driver and cpufreq.
>  > 
>  > So you're saying that the cpufreq code in Linus's tree aleady supports IDA?
>  > If so, this is a 2.6.22 patch, isn't it?
> 
> From my limited understanding[*], ida is the "We're single threaded,
> disable the 2nd core, and clock the first core faster" magic.
> It doesn't need code-changes, as its all done in hardware afaik.

IDA state will appear as a new highest freq P-state (P0) and when software
requests that frequency, hardware can provide a higher frequency than that
oppurtunistically and transparently.

The current cpufreq code will detect this new state and enter that state
when CPU is busy.

> 
> identifying & exporting the flags on earlier kernels should be harmless,
> but not really 'mustfix'.
> 

Agree with Dave that it is not a mustfix. As the patch is pretty harmless
would be nice to have in 2.6.22.

Thanks,
Venki
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >