Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-12-30 Thread Ingo Molnar

* Mike Galbraith <[EMAIL PROTECTED]> wrote:

> > Is this after resume ? If yes, then something (probably BIOS) is 
> > fiddling with the TSC of one CPU when the resume happens.
> 
> My P4 box has the same "problem", which is remedied by..

> - start = get_cycles_sync();
> + start = last_tsc = get_cycles_sync();

this is slightly racy - your second patch that initializes things 
properly is the right solution IMO. I'm wondering, if others are seeing 
this too, should we make this a v2.6.24 item? It's a bit late for that i 
think - although it shouldnt hurt.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-12-30 Thread Ingo Molnar

* Mike Galbraith [EMAIL PROTECTED] wrote:

  Is this after resume ? If yes, then something (probably BIOS) is 
  fiddling with the TSC of one CPU when the resume happens.
 
 My P4 box has the same problem, which is remedied by..

 - start = get_cycles_sync();
 + start = last_tsc = get_cycles_sync();

this is slightly racy - your second patch that initializes things 
properly is the right solution IMO. I'm wondering, if others are seeing 
this too, should we make this a v2.6.24 item? It's a bit late for that i 
think - although it shouldnt hurt.

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-12-29 Thread Mike Galbraith
(hm, google says i'm not the only one seeing this, so...)

On Sun, 2007-03-18 at 00:32 +0100, Thomas Gleixner wrote:
> Maxim,
> 
> On Sun, 2007-03-18 at 01:00 +0200, Maxim wrote:
> > >Mar 14 00:22:23 MAIN kernel: [2.072931] checking TSC synchronization 
> > >[CPU#0 -> CPU#1]:
> > >Mar 14 00:22:23 MAIN kernel: [2.092922] Measured 72051818872 cycles 
> > >TSC warp between CPUs, turning off
> > 
> > ^ This one I don't think is related to NO_HZ, maybe it is hardware
> > problem, but it exist without NO_HZ
> 
> The TSC is checked for synchronization between the CPUs. It's nothing to
> worry about. We switch off the TSC and use a different clocksource.
> 
> Is this after resume ? If yes, then something (probably BIOS) is
> fiddling with the TSC of one CPU when the resume happens.

My P4 box has the same "problem", which is remedied by..

diff --git a/arch/x86/kernel/tsc_sync.c b/arch/x86/kernel/tsc_sync.c
index 9125efe..7b74969 100644
--- a/arch/x86/kernel/tsc_sync.c
+++ b/arch/x86/kernel/tsc_sync.c
@@ -46,7 +46,7 @@ static __cpuinit void check_tsc_warp(void)
cycles_t start, now, prev, end;
int i;
 
-   start = get_cycles_sync();
+   start = last_tsc = get_cycles_sync();
/*
 * The measurement runs for 20 msecs:
 */

..whacking the ancient last_tsc before entering test loop.  Question is,
is there a good reason to disable the TSC once it's been stepped upon by
BIOS?  Are there any ill effects to be awaited by ignoring this BIOS
artifact?  All seems just fine here.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-12-29 Thread Mike Galbraith
(hm, google says i'm not the only one seeing this, so...)

On Sun, 2007-03-18 at 00:32 +0100, Thomas Gleixner wrote:
 Maxim,
 
 On Sun, 2007-03-18 at 01:00 +0200, Maxim wrote:
  Mar 14 00:22:23 MAIN kernel: [2.072931] checking TSC synchronization 
  [CPU#0 - CPU#1]:
  Mar 14 00:22:23 MAIN kernel: [2.092922] Measured 72051818872 cycles 
  TSC warp between CPUs, turning off
  
  ^ This one I don't think is related to NO_HZ, maybe it is hardware
  problem, but it exist without NO_HZ
 
 The TSC is checked for synchronization between the CPUs. It's nothing to
 worry about. We switch off the TSC and use a different clocksource.
 
 Is this after resume ? If yes, then something (probably BIOS) is
 fiddling with the TSC of one CPU when the resume happens.

My P4 box has the same problem, which is remedied by..

diff --git a/arch/x86/kernel/tsc_sync.c b/arch/x86/kernel/tsc_sync.c
index 9125efe..7b74969 100644
--- a/arch/x86/kernel/tsc_sync.c
+++ b/arch/x86/kernel/tsc_sync.c
@@ -46,7 +46,7 @@ static __cpuinit void check_tsc_warp(void)
cycles_t start, now, prev, end;
int i;
 
-   start = get_cycles_sync();
+   start = last_tsc = get_cycles_sync();
/*
 * The measurement runs for 20 msecs:
 */

..whacking the ancient last_tsc before entering test loop.  Question is,
is there a good reason to disable the TSC once it's been stepped upon by
BIOS?  Are there any ill effects to be awaited by ignoring this BIOS
artifact?  All seems just fine here.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sysfs ugly timer interface (was Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far)

2007-03-22 Thread Thomas Gleixner
On Thu, 2007-03-22 at 08:28 -0700, Greg KH wrote:
> On Tue, Mar 20, 2007 at 11:54:03AM +, Pavel Machek wrote:
> > Hi!
> > 
> > > [EMAIL PROTECTED]:/home/maxim# cat 
> > > /sys/devices/system/clockevents/clockevents0/registered
> > > lapicF:0007 M:3(periodic) C: 1
> > > hpet F:0003 M:1(shutdown) C: 0
> > > lapicF:0007 M:3(periodic) C: 0
> > > [EMAIL PROTECTED]:/home/maxim#   
> > 
> > Now... this file needs to die, before 2.6.21 is released. It tries to
> > bring /proc-like parsing nightmare to sysfs. Kill it before it becomes
> > part of stable ABI!
> 
> Eeek!
> 
> I agree, that needs to be fixed now.
> 
> Remember, 1 value per file in sysfs!  Shall I just submit a patch
> ripping it out for now?

I fix it.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sysfs ugly timer interface (was Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far)

2007-03-22 Thread Greg KH
On Tue, Mar 20, 2007 at 11:54:03AM +, Pavel Machek wrote:
> Hi!
> 
> > [EMAIL PROTECTED]:/home/maxim# cat 
> > /sys/devices/system/clockevents/clockevents0/registered
> > lapicF:0007 M:3(periodic) C: 1
> > hpet F:0003 M:1(shutdown) C: 0
> > lapicF:0007 M:3(periodic) C: 0
> > [EMAIL PROTECTED]:/home/maxim#   
> 
> Now... this file needs to die, before 2.6.21 is released. It tries to
> bring /proc-like parsing nightmare to sysfs. Kill it before it becomes
> part of stable ABI!

Eeek!

I agree, that needs to be fixed now.

Remember, 1 value per file in sysfs!  Shall I just submit a patch
ripping it out for now?

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


sysfs ugly timer interface (was Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far)

2007-03-22 Thread Pavel Machek
Hi!

> [EMAIL PROTECTED]:/home/maxim# cat 
> /sys/devices/system/clockevents/clockevents0/registered
> lapicF:0007 M:3(periodic) C: 1
> hpet F:0003 M:1(shutdown) C: 0
> lapicF:0007 M:3(periodic) C: 0
> [EMAIL PROTECTED]:/home/maxim#   

Now... this file needs to die, before 2.6.21 is released. It tries to
bring /proc-like parsing nightmare to sysfs. Kill it before it becomes
part of stable ABI!
Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


sysfs ugly timer interface (was Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far)

2007-03-22 Thread Pavel Machek
Hi!

 [EMAIL PROTECTED]:/home/maxim# cat 
 /sys/devices/system/clockevents/clockevents0/registered
 lapicF:0007 M:3(periodic) C: 1
 hpet F:0003 M:1(shutdown) C: 0
 lapicF:0007 M:3(periodic) C: 0
 [EMAIL PROTECTED]:/home/maxim#   

Now... this file needs to die, before 2.6.21 is released. It tries to
bring /proc-like parsing nightmare to sysfs. Kill it before it becomes
part of stable ABI!
Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sysfs ugly timer interface (was Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far)

2007-03-22 Thread Greg KH
On Tue, Mar 20, 2007 at 11:54:03AM +, Pavel Machek wrote:
 Hi!
 
  [EMAIL PROTECTED]:/home/maxim# cat 
  /sys/devices/system/clockevents/clockevents0/registered
  lapicF:0007 M:3(periodic) C: 1
  hpet F:0003 M:1(shutdown) C: 0
  lapicF:0007 M:3(periodic) C: 0
  [EMAIL PROTECTED]:/home/maxim#   
 
 Now... this file needs to die, before 2.6.21 is released. It tries to
 bring /proc-like parsing nightmare to sysfs. Kill it before it becomes
 part of stable ABI!

Eeek!

I agree, that needs to be fixed now.

Remember, 1 value per file in sysfs!  Shall I just submit a patch
ripping it out for now?

thanks,

greg k-h
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sysfs ugly timer interface (was Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far)

2007-03-22 Thread Thomas Gleixner
On Thu, 2007-03-22 at 08:28 -0700, Greg KH wrote:
 On Tue, Mar 20, 2007 at 11:54:03AM +, Pavel Machek wrote:
  Hi!
  
   [EMAIL PROTECTED]:/home/maxim# cat 
   /sys/devices/system/clockevents/clockevents0/registered
   lapicF:0007 M:3(periodic) C: 1
   hpet F:0003 M:1(shutdown) C: 0
   lapicF:0007 M:3(periodic) C: 0
   [EMAIL PROTECTED]:/home/maxim#   
  
  Now... this file needs to die, before 2.6.21 is released. It tries to
  bring /proc-like parsing nightmare to sysfs. Kill it before it becomes
  part of stable ABI!
 
 Eeek!
 
 I agree, that needs to be fixed now.
 
 Remember, 1 value per file in sysfs!  Shall I just submit a patch
 ripping it out for now?

I fix it.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-20 Thread Eric St-Laurent
On Tue, 2007-20-03 at 10:15 +0100, Arjan van de Ven wrote:

> disabling that is a BAD idea. I'm no fan of SMM myself, but it's there,
> and we have to live with it. Disabling it without knowing what it does
> on your system is madness.
> 

Like Lee said, for "debugging", mainly trying to resolve unexplained
long latencies.

I've had a laptop that caused latency spikes with the cpu fan was turn
on. I tried disabling SMI to diagnose the problem with no success.

My current system has a BIOS feature to control fans speed according to
temperature. I presume this must a SMI to work right?  In this case it
should be possible to find and disable the related SMI and replace the
fan control with a user space software.

Of course it's not wise to blindly disable SMIs as we don't precisely
know what they do. 


- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-20 Thread Andy Lutomirski

Arjan van de Ven wrote:

On Tue, 2007-03-20 at 01:36 -0400, Eric St-Laurent wrote:

On Tue, 2007-20-03 at 01:04 -0400, Lee Revell wrote:


I think CONFIG_TRY_TO_DISABLE_SMI would be excellent for debugging,
not to mention people trying to spec out hardware for RT
applications...

There is a SMI disabling module in RTAI, check the smi-module.c in this:

https://www.rtai.org/RTAI/rtai-3.5.tar.bz2

More infos:

http://www.captain.at/rtai-smi-high-latency.php
http://www.captain.at/xenomai-smi-high-latency.php

It might make sense to merge this code, at least in the -rt tree.


it NEVER makes sense to disable SMM.

SMM is there to ensure that your hardware doesn't get physically
damaged.

disabling that is a BAD idea. I'm no fan of SMM myself, but it's there,
and we have to live with it. Disabling it without knowing what it does
on your system is madness.



How about disabling it long enough to calibrate the timers and then 
turning it back on?


--Andy

(apologies if anyone gets duplicates of this.  i'm encountering 
nightly-thunderbird-build bugs.)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-20 Thread Oliver Neukum
Am Dienstag, 20. März 2007 12:36 schrieb Andi Kleen:
> It's long after timer calibration, which is what it interfered with here.
> 
> To handle that it would need to be moved to the x86 early quirks and
> use boot_ioremap etc. It would be probably somewhat messy, but doable.

USB is not specific to x86. And not necessarily the only user of SMM.
Is this really necessary?

Regards
Oliver
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-20 Thread Andi Kleen
On Mon, Mar 19, 2007 at 09:27:34PM -0700, Greg KH wrote:
> On Sat, Mar 17, 2007 at 02:26:57PM +0100, Andi Kleen wrote:
> > Arjan van de Ven <[EMAIL PROTECTED]> writes:
> > > 
> > > well we can do the handshake to take ownership like we do much later in
> > > boot, but that requires PCI to be there and fully discovered, which we
> > > don't have this early.
> > 
> > That's not true - we do early pci discovery. Doing USB handsoff
> > there would be quite possible.
> 
> What, we don't do USB "handoff" early enough in the boot process?  It's
> happening at PCI quirk time now, which I think should be early enough
> for everyone (and too early for some who rely on USB keyboards and

Early for drivers, but quite late for architecture initialization.

> initramfs shells...)

It's long after timer calibration, which is what it interfered with here.

To handle that it would need to be moved to the x86 early quirks and
use boot_ioremap etc. It would be probably somewhat messy, but doable.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-20 Thread Arjan van de Ven
On Tue, 2007-03-20 at 01:36 -0400, Eric St-Laurent wrote:
> On Tue, 2007-20-03 at 01:04 -0400, Lee Revell wrote:
> 
> > I think CONFIG_TRY_TO_DISABLE_SMI would be excellent for debugging,
> > not to mention people trying to spec out hardware for RT
> > applications...
> 
> There is a SMI disabling module in RTAI, check the smi-module.c in this:
> 
> https://www.rtai.org/RTAI/rtai-3.5.tar.bz2
> 
> More infos:
> 
> http://www.captain.at/rtai-smi-high-latency.php
> http://www.captain.at/xenomai-smi-high-latency.php
> 
> It might make sense to merge this code, at least in the -rt tree.

it NEVER makes sense to disable SMM.

SMM is there to ensure that your hardware doesn't get physically
damaged.

disabling that is a BAD idea. I'm no fan of SMM myself, but it's there,
and we have to live with it. Disabling it without knowing what it does
on your system is madness.

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via 
http://www.linuxfirmwarekit.org

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-20 Thread Arjan van de Ven
On Mon, 2007-03-19 at 21:27 -0700, Greg KH wrote:
> On Sat, Mar 17, 2007 at 02:26:57PM +0100, Andi Kleen wrote:
> > Arjan van de Ven <[EMAIL PROTECTED]> writes:
> > > 
> > > well we can do the handshake to take ownership like we do much later in
> > > boot, but that requires PCI to be there and fully discovered, which we
> > > don't have this early.
> > 
> > That's not true - we do early pci discovery. Doing USB handsoff
> > there would be quite possible.
> 
> What, we don't do USB "handoff" early enough in the boot process?  It's
> happening at PCI quirk time now, which I think should be early enough
> for everyone (and too early for some who rely on USB keyboards and
> initramfs shells...)

it's not early enough for this bug, where the SMM code is ruining the
cpu calibrations :)

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via 
http://www.linuxfirmwarekit.org

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-20 Thread Arjan van de Ven
On Mon, 2007-03-19 at 21:27 -0700, Greg KH wrote:
 On Sat, Mar 17, 2007 at 02:26:57PM +0100, Andi Kleen wrote:
  Arjan van de Ven [EMAIL PROTECTED] writes:
   
   well we can do the handshake to take ownership like we do much later in
   boot, but that requires PCI to be there and fully discovered, which we
   don't have this early.
  
  That's not true - we do early pci discovery. Doing USB handsoff
  there would be quite possible.
 
 What, we don't do USB handoff early enough in the boot process?  It's
 happening at PCI quirk time now, which I think should be early enough
 for everyone (and too early for some who rely on USB keyboards and
 initramfs shells...)

it's not early enough for this bug, where the SMM code is ruining the
cpu calibrations :)

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via 
http://www.linuxfirmwarekit.org

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-20 Thread Arjan van de Ven
On Tue, 2007-03-20 at 01:36 -0400, Eric St-Laurent wrote:
 On Tue, 2007-20-03 at 01:04 -0400, Lee Revell wrote:
 
  I think CONFIG_TRY_TO_DISABLE_SMI would be excellent for debugging,
  not to mention people trying to spec out hardware for RT
  applications...
 
 There is a SMI disabling module in RTAI, check the smi-module.c in this:
 
 https://www.rtai.org/RTAI/rtai-3.5.tar.bz2
 
 More infos:
 
 http://www.captain.at/rtai-smi-high-latency.php
 http://www.captain.at/xenomai-smi-high-latency.php
 
 It might make sense to merge this code, at least in the -rt tree.

it NEVER makes sense to disable SMM.

SMM is there to ensure that your hardware doesn't get physically
damaged.

disabling that is a BAD idea. I'm no fan of SMM myself, but it's there,
and we have to live with it. Disabling it without knowing what it does
on your system is madness.

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via 
http://www.linuxfirmwarekit.org

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-20 Thread Andi Kleen
On Mon, Mar 19, 2007 at 09:27:34PM -0700, Greg KH wrote:
 On Sat, Mar 17, 2007 at 02:26:57PM +0100, Andi Kleen wrote:
  Arjan van de Ven [EMAIL PROTECTED] writes:
   
   well we can do the handshake to take ownership like we do much later in
   boot, but that requires PCI to be there and fully discovered, which we
   don't have this early.
  
  That's not true - we do early pci discovery. Doing USB handsoff
  there would be quite possible.
 
 What, we don't do USB handoff early enough in the boot process?  It's
 happening at PCI quirk time now, which I think should be early enough
 for everyone (and too early for some who rely on USB keyboards and

Early for drivers, but quite late for architecture initialization.

 initramfs shells...)

It's long after timer calibration, which is what it interfered with here.

To handle that it would need to be moved to the x86 early quirks and
use boot_ioremap etc. It would be probably somewhat messy, but doable.

-Andi

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-20 Thread Oliver Neukum
Am Dienstag, 20. März 2007 12:36 schrieb Andi Kleen:
 It's long after timer calibration, which is what it interfered with here.
 
 To handle that it would need to be moved to the x86 early quirks and
 use boot_ioremap etc. It would be probably somewhat messy, but doable.

USB is not specific to x86. And not necessarily the only user of SMM.
Is this really necessary?

Regards
Oliver
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-20 Thread Andy Lutomirski

Arjan van de Ven wrote:

On Tue, 2007-03-20 at 01:36 -0400, Eric St-Laurent wrote:

On Tue, 2007-20-03 at 01:04 -0400, Lee Revell wrote:


I think CONFIG_TRY_TO_DISABLE_SMI would be excellent for debugging,
not to mention people trying to spec out hardware for RT
applications...

There is a SMI disabling module in RTAI, check the smi-module.c in this:

https://www.rtai.org/RTAI/rtai-3.5.tar.bz2

More infos:

http://www.captain.at/rtai-smi-high-latency.php
http://www.captain.at/xenomai-smi-high-latency.php

It might make sense to merge this code, at least in the -rt tree.


it NEVER makes sense to disable SMM.

SMM is there to ensure that your hardware doesn't get physically
damaged.

disabling that is a BAD idea. I'm no fan of SMM myself, but it's there,
and we have to live with it. Disabling it without knowing what it does
on your system is madness.



How about disabling it long enough to calibrate the timers and then 
turning it back on?


--Andy

(apologies if anyone gets duplicates of this.  i'm encountering 
nightly-thunderbird-build bugs.)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-20 Thread Eric St-Laurent
On Tue, 2007-20-03 at 10:15 +0100, Arjan van de Ven wrote:

 disabling that is a BAD idea. I'm no fan of SMM myself, but it's there,
 and we have to live with it. Disabling it without knowing what it does
 on your system is madness.
 

Like Lee said, for debugging, mainly trying to resolve unexplained
long latencies.

I've had a laptop that caused latency spikes with the cpu fan was turn
on. I tried disabling SMI to diagnose the problem with no success.

My current system has a BIOS feature to control fans speed according to
temperature. I presume this must a SMI to work right?  In this case it
should be possible to find and disable the related SMI and replace the
fan control with a user space software.

Of course it's not wise to blindly disable SMIs as we don't precisely
know what they do. 


- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-19 Thread Thomas Gleixner
On Mon, 2007-03-19 at 21:27 -0700, Greg KH wrote:
> On Sat, Mar 17, 2007 at 02:26:57PM +0100, Andi Kleen wrote:
> > Arjan van de Ven <[EMAIL PROTECTED]> writes:
> > > 
> > > well we can do the handshake to take ownership like we do much later in
> > > boot, but that requires PCI to be there and fully discovered, which we
> > > don't have this early.
> > 
> > That's not true - we do early pci discovery. Doing USB handsoff
> > there would be quite possible.
> 
> What, we don't do USB "handoff" early enough in the boot process?  It's
> happening at PCI quirk time now, which I think should be early enough
> for everyone (and too early for some who rely on USB keyboards and
> initramfs shells...)

It happens way after the CPUs are brought up. At this point both the
delay loop calibration and the local APIC calibration are already done.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-19 Thread Eric St-Laurent
On Tue, 2007-20-03 at 01:04 -0400, Lee Revell wrote:

> I think CONFIG_TRY_TO_DISABLE_SMI would be excellent for debugging,
> not to mention people trying to spec out hardware for RT
> applications...

There is a SMI disabling module in RTAI, check the smi-module.c in this:

https://www.rtai.org/RTAI/rtai-3.5.tar.bz2

More infos:

http://www.captain.at/rtai-smi-high-latency.php
http://www.captain.at/xenomai-smi-high-latency.php

It might make sense to merge this code, at least in the -rt tree.


- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-19 Thread Lee Revell

On 3/16/07, Thomas Gleixner <[EMAIL PROTECTED]> wrote:

Yes, this is probably caused by SMM code trying to emulate a PS/2
keyboard from a (maybe connected or not) USB keyboard. Unfortunately we
have no way to disable this BIOS misfeature in the early boot process.


https://mail.rtai.org/pipermail/rtai/2003-March/002949.html

http://www.embeddedrelated.com/usenet/embedded/show/50333-1.php

I think CONFIG_TRY_TO_DISABLE_SMI would be excellent for debugging,
not to mention people trying to spec out hardware for RT
applications...

Lee
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-19 Thread Greg KH
On Sat, Mar 17, 2007 at 02:26:57PM +0100, Andi Kleen wrote:
> Arjan van de Ven <[EMAIL PROTECTED]> writes:
> > 
> > well we can do the handshake to take ownership like we do much later in
> > boot, but that requires PCI to be there and fully discovered, which we
> > don't have this early.
> 
> That's not true - we do early pci discovery. Doing USB handsoff
> there would be quite possible.

What, we don't do USB "handoff" early enough in the boot process?  It's
happening at PCI quirk time now, which I think should be early enough
for everyone (and too early for some who rely on USB keyboards and
initramfs shells...)

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-19 Thread Greg KH
On Sat, Mar 17, 2007 at 02:26:57PM +0100, Andi Kleen wrote:
 Arjan van de Ven [EMAIL PROTECTED] writes:
  
  well we can do the handshake to take ownership like we do much later in
  boot, but that requires PCI to be there and fully discovered, which we
  don't have this early.
 
 That's not true - we do early pci discovery. Doing USB handsoff
 there would be quite possible.

What, we don't do USB handoff early enough in the boot process?  It's
happening at PCI quirk time now, which I think should be early enough
for everyone (and too early for some who rely on USB keyboards and
initramfs shells...)

thanks,

greg k-h
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-19 Thread Lee Revell

On 3/16/07, Thomas Gleixner [EMAIL PROTECTED] wrote:

Yes, this is probably caused by SMM code trying to emulate a PS/2
keyboard from a (maybe connected or not) USB keyboard. Unfortunately we
have no way to disable this BIOS misfeature in the early boot process.


https://mail.rtai.org/pipermail/rtai/2003-March/002949.html

http://www.embeddedrelated.com/usenet/embedded/show/50333-1.php

I think CONFIG_TRY_TO_DISABLE_SMI would be excellent for debugging,
not to mention people trying to spec out hardware for RT
applications...

Lee
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-19 Thread Eric St-Laurent
On Tue, 2007-20-03 at 01:04 -0400, Lee Revell wrote:

 I think CONFIG_TRY_TO_DISABLE_SMI would be excellent for debugging,
 not to mention people trying to spec out hardware for RT
 applications...

There is a SMI disabling module in RTAI, check the smi-module.c in this:

https://www.rtai.org/RTAI/rtai-3.5.tar.bz2

More infos:

http://www.captain.at/rtai-smi-high-latency.php
http://www.captain.at/xenomai-smi-high-latency.php

It might make sense to merge this code, at least in the -rt tree.


- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-19 Thread Thomas Gleixner
On Mon, 2007-03-19 at 21:27 -0700, Greg KH wrote:
 On Sat, Mar 17, 2007 at 02:26:57PM +0100, Andi Kleen wrote:
  Arjan van de Ven [EMAIL PROTECTED] writes:
   
   well we can do the handshake to take ownership like we do much later in
   boot, but that requires PCI to be there and fully discovered, which we
   don't have this early.
  
  That's not true - we do early pci discovery. Doing USB handsoff
  there would be quite possible.
 
 What, we don't do USB handoff early enough in the boot process?  It's
 happening at PCI quirk time now, which I think should be early enough
 for everyone (and too early for some who rely on USB keyboards and
 initramfs shells...)

It happens way after the CPUs are brought up. At this point both the
delay loop calibration and the local APIC calibration are already done.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Thomas Gleixner
Maxim,

On Sun, 2007-03-18 at 01:00 +0200, Maxim wrote:
> >Mar 14 00:22:23 MAIN kernel: [2.072931] checking TSC synchronization 
> >[CPU#0 -> CPU#1]:
> >Mar 14 00:22:23 MAIN kernel: [2.092922] Measured 72051818872 cycles TSC 
> >warp between CPUs, turning off
> 
> ^ This one I don't think is related to NO_HZ, maybe it is hardware
> problem, but it exist without NO_HZ

The TSC is checked for synchronization between the CPUs. It's nothing to
worry about. We switch off the TSC and use a different clocksource.

Is this after resume ? If yes, then something (probably BIOS) is
fiddling with the TSC of one CPU when the resume happens.

> >[   36.217405] ENABLING IO-APIC IRQs
> >[   36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
> >[   36.433917] APIC timer disabled due to verification failure.
> 
> This one is now discussed, I will look at it and it is not related to NO_HZ

I sent a patch for this yesterday:

http://marc.info/?l=linux-kernel=117408952322631=2

> And I forgot to tell about another problem with (now I know ,hi-resolution 
> timers)
> That before suspend to ram APIC timer is used and HPET is not used :
> 
> [EMAIL PROTECTED]:/home/maxim# cat 
> /sys/devices/system/clockevents/clockevents0/registered
> lapicF:0007 M:3(periodic) C: 1
> hpet F:0003 M:1(shutdown) C: 0
> lapicF:0007 M:3(periodic) C: 0
> [EMAIL PROTECTED]:/home/maxim#   
> 
> But after suspend to ram HPET is 'woken'
> 
> [EMAIL PROTECTED]:/home/maxim# cat 
> /sys/devices/system/clockevents/clockevents0/registered
> lapicF:0007 M:3(one shoot) C: 1
> hpet F:0003 M:3(one shoot) C: 0
> lapicF:0007 M:3(one shoot) C: 0

This is unrelated to suspend / resume. The local apic timers stop
(hardware madness), when the CPU enters C3 power state. In this case we
switch to HPET (or PIT when HPET is not available) and broadcast the
events via Inter Processor Interrupts. This is nothing to worry about. 

I'm a bit surprised though, that your system was in periodic mode before
suspend and switched to one shot mode on resume.

Is this reproducible ? If yes, can you please provide the dmesg output
from boot to resume ?

> Note that I added those (one shoot), (periodic) descriptions, would be
> nice to have them in kernel, can I send a patch ?  ;-)

Sure, just s/shoot/shot/ :)

> and I see average of 18 IRQs/sec on IRQ 0

So dynticks are working as expected.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Maxim
On Saturday 17 March 2007 01:39:01 Thomas Gleixner wrote:
> On Fri, 2007-03-16 at 12:30 +0200, Maxim Levitsky wrote:
> > Mar 14 00:22:23 MAIN kernel: [2.072875] caller is 
> > check_tsc_sync_source+0x1d/0x100
> > Mar 14 00:22:23 MAIN kernel: [2.072878]  [show_trace_log_lvl+26/48] 
> > show_trace_log_lvl+0x1a/0x30
> > Mar 14 00:22:23 MAIN kernel: [2.072931] checking TSC synchronization 
> > [CPU#0 -> CPU#1]:
> > Mar 14 00:22:23 MAIN kernel: [2.092922] Measured 72051818872 cycles TSC 
> > warp between CPUs, turning off
> > 
> > It looks clear that preempt is enabled all the way in second cpu 
> > initialization, ( I think that at least in check_tsc_sync_source, it should 
> > be disabled,
> > shouldn't it ? )
> 
> This should be fixed by commit d04f41e35343f1d788551fd3f753f51794f4afcf
> 
>   tglx
> 
> 
> 
> 

Hi,

Yes, it is fixed, thanks

Regards,
Maxim Levitsky
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Maxim
On Saturday 17 March 2007 01:19:44 Len Brown wrote:
> On Friday 16 March 2007 06:30, Maxim Levitsky wrote:
> > 
> > Good day, 
> > 
> > I want to report regressions I have with 2.6.21-rc3 kernel.
> > I use CONFIG_NO_HZ.
> 
> Do any of these issues go away with CONFIG_NO_HZ=n (or boot with nohz=n)
> or are they all independent of it?
> 
> thanks,
> -Len
> 
> > 1) Both suspend to disk and suspend to RAM are completely broken:
> > On vanilla 2.6.20 suspend to disk works perfectly and suspend to ram works 
> > _almost_ perfectly (I will tell about that later).
> > On 2.6.21-rc1 and later system hangs even before suspend begins (suspend to 
> > disk hangs before image write , and after suspend to ram , 
> > some devices are powered down (disk,power leds) , and some and not(fans, 
> > power) , and system hangs).
> > 
> > I did a git-bisect and I found which commit caused that:
> > e3c7db621bed4afb8e231cb005057f2feb5db557 - [PATCH] [PATCH] PM: Change 
> > code ordering in main.c (breaks  S3)
> > ed746e3b18f4df18afa3763155972c5835f284c5 - [PATCH] [PATCH] swsusp: 
> > Change code ordering in disk.c (breaks swsusp, I don't use it, but I tested 
> > it)
> > 259130526c267550bc365d3015917d90667732f1 - [PATCH] [PATCH] swsusp: 
> > Change code ordering in user.c (breaks uswsusp, that I use)
> > 
> > I reverted those commits and now system suspends correctly to disk, but 
> > suspend to ram showed some more regressions.
> > 
> > 
> > 2) ) After suspend to ram I get this 
> > 
> > Mar 14 00:22:23 MAIN kernel: [2.072875] caller is 
> > check_tsc_sync_source+0x1d/0x100
> > Mar 14 00:22:23 MAIN kernel: [2.072878]  [show_trace_log_lvl+26/48] 
> > show_trace_log_lvl+0x1a/0x30
> > Mar 14 00:22:23 MAIN kernel: [2.072881]  [show_trace+18/32] 
> > show_trace+0x12/0x20
> > Mar 14 00:22:23 MAIN kernel: [2.072884]  [dump_stack+22/32] 
> > dump_stack+0x16/0x20
> > Mar 14 00:22:23 MAIN kernel: [2.072887]  
> > [debug_smp_processor_id+173/176] debug_smp_processor_id+0xad/0xb0
> > Mar 14 00:22:23 MAIN kernel: [2.072891]  [check_tsc_sync_source+29/256] 
> > check_tsc_sync_source+0x1d/0x100
> > Mar 14 00:22:23 MAIN kernel: [2.072894]  [__cpu_up+80/384] 
> > __cpu_up+0x50/0x180
> > Mar 14 00:22:23 MAIN kernel: [2.072897]  [_cpu_up+98/208] 
> > _cpu_up+0x62/0xd0
> > Mar 14 00:22:23 MAIN kernel: [2.072901]  [cpu_up+46/80] cpu_up+0x2e/0x50
> > Mar 14 00:22:23 MAIN kernel: [2.072903]  [enable_nonboot_cpus+110/160] 
> > enable_nonboot_cpus+0x6e/0xa0
> > Mar 14 00:22:23 MAIN kernel: [2.072906]  [enter_state+326/496] 
> > enter_state+0x146/0x1f0
> > Mar 14 00:22:23 MAIN kernel: [2.072909]  [state_store+174/192] 
> > state_store+0xae/0xc0
> > Mar 14 00:22:23 MAIN kernel: [2.072912]  [subsys_attr_store+43/64] 
> > subsys_attr_store+0x2b/0x40
> > Mar 14 00:22:23 MAIN kernel: [2.072917]  [sysfs_write_file+186/272] 
> > sysfs_write_file+0xba/0x110
> > Mar 14 00:22:23 MAIN kernel: [2.072920]  [vfs_write+150/352] 
> > vfs_write+0x96/0x160
> > Mar 14 00:22:23 MAIN kernel: [2.072923]  [sys_write+61/112] 
> > sys_write+0x3d/0x70
> > Mar 14 00:22:23 MAIN kernel: [2.072926]  [sysenter_past_esp+93/153] 
> > sysenter_past_esp+0x5d/0x99
> > Mar 14 00:22:23 MAIN kernel: [2.072929]  ===
> > Mar 14 00:22:23 MAIN kernel: [2.072931] checking TSC synchronization 
> > [CPU#0 -> CPU#1]:
> > Mar 14 00:22:23 MAIN kernel: [2.092922] Measured 72051818872 cycles TSC 
> > warp between CPUs, turning off
> > 
> > It looks clear that preempt is enabled all the way in second cpu 
> > initialization, ( I think that at least in check_tsc_sync_source, it should 
> > be disabled,
> > shouldn't it ? )
> > 
> > Then I did add preempt_disable() / preempt_enable()  to this function , and 
> >  I still got this:
> > 
> > Mar 14 00:22:23 MAIN kernel: [2.072931] checking TSC synchronization 
> > [CPU#0 -> CPU#1]:
> > Mar 14 00:22:23 MAIN kernel: [2.092922] Measured 72051818872 cycles TSC 
> > warp between CPUs, turning off
> > 
> > It happens after second CPU is brought back on-line.
> > 
> > Now I understand that this is TSC sync problem and I tried to do some tests:
> > 
> >  I tried to disable/enable second CPU by hand, eg I did number of times,
> > 
> > echo "0" > /sys/devices/system/cpu/cpu1/online
> > echo "1" > /sys/devices/system/cpu/cpu1/online
> > 
> > and TSC sync was ok.
> > 
> > Then I disabled 2nd CPU, have suspended system to RAM , resumed it  , and 
> > then enabled 2nd CPU and got same error message.
> > Then I disabled cpufreq , and did above tests, and got same results.
> > I think that maybe this error is false, that there is some difference in 
> > TSC clock, but this difference is constant, and can be fixed
> > 
> > 3) Sometimes I get this (once in three boots or so)
> > 
> > [   36.217405] ENABLING IO-APIC IRQs
> > [   36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
> > [   36.433917] APIC timer disabled due to verification 

Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Maxim
On Saturday 17 March 2007 03:32:53 Len Brown wrote:
> On Friday 16 March 2007 19:44, Thomas Gleixner wrote:
> > Maxim,
> > 
> > On Fri, 2007-03-16 at 12:30 +0200, Maxim Levitsky wrote:
> > > 3) Sometimes I get this (once in three boots or so)
> > > 
> > > [   36.217405] ENABLING IO-APIC IRQs
> > > [   36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
> > > [   36.433917] APIC timer disabled due to verification failure.
> > > 
> > > And NO_HZ is disabled due to that (I get 1000/s timer's interrupts)
> > > I haven't investigated that yet.
> > > It looks like another new test that my hardware fails to perform... 
> > 
> > Yes, this is probably caused by SMM code trying to emulate a PS/2
> > keyboard from a (maybe connected or not) USB keyboard. Unfortunately we
> > have no way to disable this BIOS misfeature in the early boot process. 
> > Arjan, Len ?
> 
> Nope.  By definition, SMM is invisible to the OS -- we don't even
> get a bit that said it occurred (though we'd like one -- it would
> be really helpful to diagnose issues like this one)
> 
> So go into BIOS SETUP and see if there is a USB Legacy Emulation
> feature that you can disable.  Sometimes there is not, but disabling
> onboard USB altogether may help at least prove the issue is in that area.
> 
> > I built in this test to rule out bogus LAPIC timer calibration values
> > which are sometimes off by factor 2-10.
> > 
> > But I also built in a calibration against the PM-Timer, which turned out
> > to be quite reliable and I think the additional verification step is
> > only necessary for sytems without PM-Timer.
> > 
> > That was a bit over cautious from my side. I send a patch to avoid this
> > when PM-Timer is available in a separate mail.
> 
> PM-Timer was invented to work-around the issue that the TSC became unreliable
> in the face of power management on laptops.  In particular, to be able
> to time duration of OS idle where TSC stopped.
> 
> While it is not fine grain, and it is not low-latency, is should
> be very reliable.  My understanding is that it is implemented as
> a simple divider right off the system 14MHz clock -- the signal
> which most motherboard clocks are PLL multiplied up from --
> including the 100MHz front-side bus which drives the LAPIC timer.
> 
> But that said, I don't understand why calibrating the LAPIC timer
> using the PM-timer is going to be more reliable -- exactly how
> and why did the previous calibration scheme fail?
> Maybe I could follow the new logic in apic.c if I saw the "apic=debug"
> output for this box.
> 
> cheers,
> -Len
> 
> 
> 

Hi,

Yes, usb emulation is enabled, but I need it. I will test without usb 
emulation, but since it shows only sometimes, 
I don't know yet whenever usb legacy affects it.

Regards, 
Maxim Levitsky
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Thomas Gleixner
On Sat, 2007-03-17 at 10:56 +0100, Thomas Gleixner wrote:
> > Maybe I could follow the new logic in apic.c if I saw the "apic=debug"
> > output for this box.
> 
> calibrating APIC timer ...
> ... lapic delta = 2426884
> ... PM timer delta = 833908
> APIC calibration PIT not consistent with PM Timer: 232ms instead of 100ms
> APIC delta adjusted to PM-Timer: 1041737 (2426884)
> . delta 1041737
> . mult: 44749065
> . calibration result: 166677
> . CPU clock speed is 4659.0624 MHz.
> . host bus clock speed is 166.0677 MHz.
> 
> This box is off by factor 2.3 and using the PM-Timer instead of the
> PIT/jiffies values gives me a correct result.

I instrumented the lapic calibration on this box:

I1: 999 us total:999 us
I2: 999 us total:   1998 us
...
I28:999 us total:  27980 us
I29: 135097 us total: 163077 us  <
I30:881 us total: 163958 us
...
I98:   1000 us total: 231918 us
I99:999 us total: 232917 us

So it vanishes away for 132 ms, which is exactly the error above. This
happens in random places and sometimes I'm lucky that it does not happen
at all.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Andi Kleen
Arjan van de Ven <[EMAIL PROTECTED]> writes:
> 
> well we can do the handshake to take ownership like we do much later in
> boot, but that requires PCI to be there and fully discovered, which we
> don't have this early.

That's not true - we do early pci discovery. Doing USB handsoff
there would be quite possible.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Thomas Gleixner
On Sat, 2007-03-17 at 10:56 +0100, Thomas Gleixner wrote:
> calibrating APIC timer ...
> ... lapic delta = 2426884
> ... PM timer delta = 833908
> APIC calibration PIT not consistent with PM Timer: 232ms instead of 100ms
> APIC delta adjusted to PM-Timer: 1041737 (2426884)
> . delta 1041737
> . mult: 44749065
> . calibration result: 166677
> . CPU clock speed is 4659.0624 MHz.
> . host bus clock speed is 166.0677 MHz.
> 
> This box is off by factor 2.3 and using the PM-Timer instead of the
> PIT/jiffies values gives me a correct result.
> 
> Another one:
> APIC calibration not consistent with PM Timer: 2020ms instead of 100ms
> APIC delta adjusted to PM-Timer: 1254436 (2534)
> 
> Off by factor 20 !!

This weird behaviour also can be seen with the BogoMIPS calibration:

Calibrating delay using timer specific routine.. 6428.32 BogoMIPS 
(lpj=12856647)

Initializing CPU#1
Calibrating delay using timer specific routine.. 103837.25 BogoMIPS 
(lpj=207674508)

Note, that I never observed that on CPU#0. It always affects CPU#1.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Arjan van de Ven
On Fri, 2007-03-16 at 21:32 -0400, Len Brown wrote:
> On Friday 16 March 2007 19:44, Thomas Gleixner wrote:
> > Maxim,
> > 
> > On Fri, 2007-03-16 at 12:30 +0200, Maxim Levitsky wrote:
> > > 3) Sometimes I get this (once in three boots or so)
> > > 
> > > [   36.217405] ENABLING IO-APIC IRQs
> > > [   36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
> > > [   36.433917] APIC timer disabled due to verification failure.
> > > 
> > > And NO_HZ is disabled due to that (I get 1000/s timer's interrupts)
> > > I haven't investigated that yet.
> > > It looks like another new test that my hardware fails to perform... 
> > 
> > Yes, this is probably caused by SMM code trying to emulate a PS/2
> > keyboard from a (maybe connected or not) USB keyboard. Unfortunately we
> > have no way to disable this BIOS misfeature in the early boot process. 
> > Arjan, Len ?
> 
> Nope.  By definition, SMM is invisible to the OS -- we don't even
> get a bit that said it occurred (though we'd like one -- it would
> be really helpful to diagnose issues like this one)


well we can do the handshake to take ownership like we do much later in
boot, but that requires PCI to be there and fully discovered, which we
don't have this early.

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via 
http://www.linuxfirmwarekit.org

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Thomas Gleixner
Len,

On Fri, 2007-03-16 at 21:32 -0400, Len Brown wrote:
> > > [   36.433917] APIC timer disabled due to verification failure.
> > > 
> > > And NO_HZ is disabled due to that (I get 1000/s timer's interrupts)
> > > I haven't investigated that yet.
> > > It looks like another new test that my hardware fails to perform... 
> > 
> > Yes, this is probably caused by SMM code trying to emulate a PS/2
> > keyboard from a (maybe connected or not) USB keyboard. Unfortunately we
> > have no way to disable this BIOS misfeature in the early boot process. 
> > Arjan, Len ?
> 
> Nope.  By definition, SMM is invisible to the OS -- we don't even
> get a bit that said it occurred (though we'd like one -- it would
> be really helpful to diagnose issues like this one)

I know that it is invisible. Nevertheless I know that the BIOSes emulate
PS/2 keyboards from USB via SMM during the boot process until we call
the usb_handoff function.

> So go into BIOS SETUP and see if there is a USB Legacy Emulation
> feature that you can disable.  Sometimes there is not, but disabling
> onboard USB altogether may help at least prove the issue is in that area.

I have more than one box (even original Intel mainboards), where either
plugging a PS/2 keyboard or switching off USB makes this problem go
away.

> > I built in this test to rule out bogus LAPIC timer calibration values
> > which are sometimes off by factor 2-10.
> > 
> > But I also built in a calibration against the PM-Timer, which turned out
> > to be quite reliable and I think the additional verification step is
> > only necessary for sytems without PM-Timer.
> > 
> > That was a bit over cautious from my side. I send a patch to avoid this
> > when PM-Timer is available in a separate mail.
> 
> PM-Timer was invented to work-around the issue that the TSC became unreliable
> in the face of power management on laptops.  In particular, to be able
> to time duration of OS idle where TSC stopped.
> 
> While it is not fine grain, and it is not low-latency, is should
> be very reliable.  My understanding is that it is implemented as
> a simple divider right off the system 14MHz clock -- the signal
> which most motherboard clocks are PLL multiplied up from --
> including the 100MHz front-side bus which drives the LAPIC timer.
> 
> But that said, I don't understand why calibrating the LAPIC timer
> using the PM-timer is going to be more reliable -- exactly how
> and why did the previous calibration scheme fail?
> Maybe I could follow the new logic in apic.c if I saw the "apic=debug"
> output for this box.

calibrating APIC timer ...
... lapic delta = 2426884
... PM timer delta = 833908
APIC calibration PIT not consistent with PM Timer: 232ms instead of 100ms
APIC delta adjusted to PM-Timer: 1041737 (2426884)
. delta 1041737
. mult: 44749065
. calibration result: 166677
. CPU clock speed is 4659.0624 MHz.
. host bus clock speed is 166.0677 MHz.

This box is off by factor 2.3 and using the PM-Timer instead of the
PIT/jiffies values gives me a correct result.

Another one:
APIC calibration not consistent with PM Timer: 2020ms instead of 100ms
APIC delta adjusted to PM-Timer: 1254436 (2534)

Off by factor 20 !!

The original APIC timer calibration did:

local_irq_disable();
wait_until_pit_underflows();
t1 = read_apic_counter();
for (i = 0; i < HZ/10; i++)
wait_until_pit_underflows();
t2 = read_apic_counter();

and calculated the APIC timer frequency from the delta of t1 and t2 vs.
the 100ms time.

This had 2 problems:
1. It gave results, which are off by factor 2-10 on a couple of boxen.
2. Some systems stop there dead as the PIT readout is broken.

I changed it to do:

local_irq_disable();
original_pit_handler = pit->handler;
pit->handler = lapic_calibration_handler;
loops = 0;
local_irq_enable();
wait_until_handler_has_done_HZ/10_loops();

The handler does:

if (!loops++) {
t1_apic = read_apic_counter();
t1_jiffies = jiffies;
t1_pm = read_pm_timer();
}

if (loops == HZ/10) {
t2_apic = read_apic_counter();
t2_jiffies = jiffies;
t2_pm = read_pm_timer();
done = 1;
}

If the pmtimer is available, then calculate the APIC timer frequency
from the t1_pm/t2_pm delta, otherwise use jiffies.

When pm_timer is there, we can trust the calculated value, if not we do
a verify run of the periodic apic timer and the pit timer. If this fails
- and it fails often due to the SMM crap - then I use the PIT and IPIs.

In the first version I did a verification run even when pm_timer was
there, but this produced false positives as well, because the lapic
timer interrupt is in the same way delayed as the PIT interrupt. I
removed this to avoid unnecessary switching to IPIs after I verified,
that it always produced false positives when the 

Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Thomas Gleixner
Len,

On Fri, 2007-03-16 at 21:32 -0400, Len Brown wrote:
   [   36.433917] APIC timer disabled due to verification failure.
   
   And NO_HZ is disabled due to that (I get 1000/s timer's interrupts)
   I haven't investigated that yet.
   It looks like another new test that my hardware fails to perform... 
  
  Yes, this is probably caused by SMM code trying to emulate a PS/2
  keyboard from a (maybe connected or not) USB keyboard. Unfortunately we
  have no way to disable this BIOS misfeature in the early boot process. 
  Arjan, Len ?
 
 Nope.  By definition, SMM is invisible to the OS -- we don't even
 get a bit that said it occurred (though we'd like one -- it would
 be really helpful to diagnose issues like this one)

I know that it is invisible. Nevertheless I know that the BIOSes emulate
PS/2 keyboards from USB via SMM during the boot process until we call
the usb_handoff function.

 So go into BIOS SETUP and see if there is a USB Legacy Emulation
 feature that you can disable.  Sometimes there is not, but disabling
 onboard USB altogether may help at least prove the issue is in that area.

I have more than one box (even original Intel mainboards), where either
plugging a PS/2 keyboard or switching off USB makes this problem go
away.

  I built in this test to rule out bogus LAPIC timer calibration values
  which are sometimes off by factor 2-10.
  
  But I also built in a calibration against the PM-Timer, which turned out
  to be quite reliable and I think the additional verification step is
  only necessary for sytems without PM-Timer.
  
  That was a bit over cautious from my side. I send a patch to avoid this
  when PM-Timer is available in a separate mail.
 
 PM-Timer was invented to work-around the issue that the TSC became unreliable
 in the face of power management on laptops.  In particular, to be able
 to time duration of OS idle where TSC stopped.
 
 While it is not fine grain, and it is not low-latency, is should
 be very reliable.  My understanding is that it is implemented as
 a simple divider right off the system 14MHz clock -- the signal
 which most motherboard clocks are PLL multiplied up from --
 including the 100MHz front-side bus which drives the LAPIC timer.
 
 But that said, I don't understand why calibrating the LAPIC timer
 using the PM-timer is going to be more reliable -- exactly how
 and why did the previous calibration scheme fail?
 Maybe I could follow the new logic in apic.c if I saw the apic=debug
 output for this box.

calibrating APIC timer ...
... lapic delta = 2426884
... PM timer delta = 833908
APIC calibration PIT not consistent with PM Timer: 232ms instead of 100ms
APIC delta adjusted to PM-Timer: 1041737 (2426884)
. delta 1041737
. mult: 44749065
. calibration result: 166677
. CPU clock speed is 4659.0624 MHz.
. host bus clock speed is 166.0677 MHz.

This box is off by factor 2.3 and using the PM-Timer instead of the
PIT/jiffies values gives me a correct result.

Another one:
APIC calibration not consistent with PM Timer: 2020ms instead of 100ms
APIC delta adjusted to PM-Timer: 1254436 (2534)

Off by factor 20 !!

The original APIC timer calibration did:

local_irq_disable();
wait_until_pit_underflows();
t1 = read_apic_counter();
for (i = 0; i  HZ/10; i++)
wait_until_pit_underflows();
t2 = read_apic_counter();

and calculated the APIC timer frequency from the delta of t1 and t2 vs.
the 100ms time.

This had 2 problems:
1. It gave results, which are off by factor 2-10 on a couple of boxen.
2. Some systems stop there dead as the PIT readout is broken.

I changed it to do:

local_irq_disable();
original_pit_handler = pit-handler;
pit-handler = lapic_calibration_handler;
loops = 0;
local_irq_enable();
wait_until_handler_has_done_HZ/10_loops();

The handler does:

if (!loops++) {
t1_apic = read_apic_counter();
t1_jiffies = jiffies;
t1_pm = read_pm_timer();
}

if (loops == HZ/10) {
t2_apic = read_apic_counter();
t2_jiffies = jiffies;
t2_pm = read_pm_timer();
done = 1;
}

If the pmtimer is available, then calculate the APIC timer frequency
from the t1_pm/t2_pm delta, otherwise use jiffies.

When pm_timer is there, we can trust the calculated value, if not we do
a verify run of the periodic apic timer and the pit timer. If this fails
- and it fails often due to the SMM crap - then I use the PIT and IPIs.

In the first version I did a verification run even when pm_timer was
there, but this produced false positives as well, because the lapic
timer interrupt is in the same way delayed as the PIT interrupt. I
removed this to avoid unnecessary switching to IPIs after I verified,
that it always produced false positives when the calibration was done
against the PM-Timer.

tglx


-
To unsubscribe 

Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Arjan van de Ven
On Fri, 2007-03-16 at 21:32 -0400, Len Brown wrote:
 On Friday 16 March 2007 19:44, Thomas Gleixner wrote:
  Maxim,
  
  On Fri, 2007-03-16 at 12:30 +0200, Maxim Levitsky wrote:
   3) Sometimes I get this (once in three boots or so)
   
   [   36.217405] ENABLING IO-APIC IRQs
   [   36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
   [   36.433917] APIC timer disabled due to verification failure.
   
   And NO_HZ is disabled due to that (I get 1000/s timer's interrupts)
   I haven't investigated that yet.
   It looks like another new test that my hardware fails to perform... 
  
  Yes, this is probably caused by SMM code trying to emulate a PS/2
  keyboard from a (maybe connected or not) USB keyboard. Unfortunately we
  have no way to disable this BIOS misfeature in the early boot process. 
  Arjan, Len ?
 
 Nope.  By definition, SMM is invisible to the OS -- we don't even
 get a bit that said it occurred (though we'd like one -- it would
 be really helpful to diagnose issues like this one)


well we can do the handshake to take ownership like we do much later in
boot, but that requires PCI to be there and fully discovered, which we
don't have this early.

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via 
http://www.linuxfirmwarekit.org

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Thomas Gleixner
On Sat, 2007-03-17 at 10:56 +0100, Thomas Gleixner wrote:
 calibrating APIC timer ...
 ... lapic delta = 2426884
 ... PM timer delta = 833908
 APIC calibration PIT not consistent with PM Timer: 232ms instead of 100ms
 APIC delta adjusted to PM-Timer: 1041737 (2426884)
 . delta 1041737
 . mult: 44749065
 . calibration result: 166677
 . CPU clock speed is 4659.0624 MHz.
 . host bus clock speed is 166.0677 MHz.
 
 This box is off by factor 2.3 and using the PM-Timer instead of the
 PIT/jiffies values gives me a correct result.
 
 Another one:
 APIC calibration not consistent with PM Timer: 2020ms instead of 100ms
 APIC delta adjusted to PM-Timer: 1254436 (2534)
 
 Off by factor 20 !!

This weird behaviour also can be seen with the BogoMIPS calibration:

Calibrating delay using timer specific routine.. 6428.32 BogoMIPS 
(lpj=12856647)

Initializing CPU#1
Calibrating delay using timer specific routine.. 103837.25 BogoMIPS 
(lpj=207674508)

Note, that I never observed that on CPU#0. It always affects CPU#1.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Andi Kleen
Arjan van de Ven [EMAIL PROTECTED] writes:
 
 well we can do the handshake to take ownership like we do much later in
 boot, but that requires PCI to be there and fully discovered, which we
 don't have this early.

That's not true - we do early pci discovery. Doing USB handsoff
there would be quite possible.

-Andi

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Thomas Gleixner
On Sat, 2007-03-17 at 10:56 +0100, Thomas Gleixner wrote:
  Maybe I could follow the new logic in apic.c if I saw the apic=debug
  output for this box.
 
 calibrating APIC timer ...
 ... lapic delta = 2426884
 ... PM timer delta = 833908
 APIC calibration PIT not consistent with PM Timer: 232ms instead of 100ms
 APIC delta adjusted to PM-Timer: 1041737 (2426884)
 . delta 1041737
 . mult: 44749065
 . calibration result: 166677
 . CPU clock speed is 4659.0624 MHz.
 . host bus clock speed is 166.0677 MHz.
 
 This box is off by factor 2.3 and using the PM-Timer instead of the
 PIT/jiffies values gives me a correct result.

I instrumented the lapic calibration on this box:

I1: 999 us total:999 us
I2: 999 us total:   1998 us
...
I28:999 us total:  27980 us
I29: 135097 us total: 163077 us  
I30:881 us total: 163958 us
...
I98:   1000 us total: 231918 us
I99:999 us total: 232917 us

So it vanishes away for 132 ms, which is exactly the error above. This
happens in random places and sometimes I'm lucky that it does not happen
at all.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Maxim
On Saturday 17 March 2007 03:32:53 Len Brown wrote:
 On Friday 16 March 2007 19:44, Thomas Gleixner wrote:
  Maxim,
  
  On Fri, 2007-03-16 at 12:30 +0200, Maxim Levitsky wrote:
   3) Sometimes I get this (once in three boots or so)
   
   [   36.217405] ENABLING IO-APIC IRQs
   [   36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
   [   36.433917] APIC timer disabled due to verification failure.
   
   And NO_HZ is disabled due to that (I get 1000/s timer's interrupts)
   I haven't investigated that yet.
   It looks like another new test that my hardware fails to perform... 
  
  Yes, this is probably caused by SMM code trying to emulate a PS/2
  keyboard from a (maybe connected or not) USB keyboard. Unfortunately we
  have no way to disable this BIOS misfeature in the early boot process. 
  Arjan, Len ?
 
 Nope.  By definition, SMM is invisible to the OS -- we don't even
 get a bit that said it occurred (though we'd like one -- it would
 be really helpful to diagnose issues like this one)
 
 So go into BIOS SETUP and see if there is a USB Legacy Emulation
 feature that you can disable.  Sometimes there is not, but disabling
 onboard USB altogether may help at least prove the issue is in that area.
 
  I built in this test to rule out bogus LAPIC timer calibration values
  which are sometimes off by factor 2-10.
  
  But I also built in a calibration against the PM-Timer, which turned out
  to be quite reliable and I think the additional verification step is
  only necessary for sytems without PM-Timer.
  
  That was a bit over cautious from my side. I send a patch to avoid this
  when PM-Timer is available in a separate mail.
 
 PM-Timer was invented to work-around the issue that the TSC became unreliable
 in the face of power management on laptops.  In particular, to be able
 to time duration of OS idle where TSC stopped.
 
 While it is not fine grain, and it is not low-latency, is should
 be very reliable.  My understanding is that it is implemented as
 a simple divider right off the system 14MHz clock -- the signal
 which most motherboard clocks are PLL multiplied up from --
 including the 100MHz front-side bus which drives the LAPIC timer.
 
 But that said, I don't understand why calibrating the LAPIC timer
 using the PM-timer is going to be more reliable -- exactly how
 and why did the previous calibration scheme fail?
 Maybe I could follow the new logic in apic.c if I saw the apic=debug
 output for this box.
 
 cheers,
 -Len
 
 
 

Hi,

Yes, usb emulation is enabled, but I need it. I will test without usb 
emulation, but since it shows only sometimes, 
I don't know yet whenever usb legacy affects it.

Regards, 
Maxim Levitsky
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Maxim
On Saturday 17 March 2007 01:19:44 Len Brown wrote:
 On Friday 16 March 2007 06:30, Maxim Levitsky wrote:
  
  Good day, 
  
  I want to report regressions I have with 2.6.21-rc3 kernel.
  I use CONFIG_NO_HZ.
 
 Do any of these issues go away with CONFIG_NO_HZ=n (or boot with nohz=n)
 or are they all independent of it?
 
 thanks,
 -Len
 
  1) Both suspend to disk and suspend to RAM are completely broken:
  On vanilla 2.6.20 suspend to disk works perfectly and suspend to ram works 
  _almost_ perfectly (I will tell about that later).
  On 2.6.21-rc1 and later system hangs even before suspend begins (suspend to 
  disk hangs before image write , and after suspend to ram , 
  some devices are powered down (disk,power leds) , and some and not(fans, 
  power) , and system hangs).
  
  I did a git-bisect and I found which commit caused that:
  e3c7db621bed4afb8e231cb005057f2feb5db557 - [PATCH] [PATCH] PM: Change 
  code ordering in main.c (breaks  S3)
  ed746e3b18f4df18afa3763155972c5835f284c5 - [PATCH] [PATCH] swsusp: 
  Change code ordering in disk.c (breaks swsusp, I don't use it, but I tested 
  it)
  259130526c267550bc365d3015917d90667732f1 - [PATCH] [PATCH] swsusp: 
  Change code ordering in user.c (breaks uswsusp, that I use)
  
  I reverted those commits and now system suspends correctly to disk, but 
  suspend to ram showed some more regressions.
  
  
  2) ) After suspend to ram I get this 
  
  Mar 14 00:22:23 MAIN kernel: [2.072875] caller is 
  check_tsc_sync_source+0x1d/0x100
  Mar 14 00:22:23 MAIN kernel: [2.072878]  [show_trace_log_lvl+26/48] 
  show_trace_log_lvl+0x1a/0x30
  Mar 14 00:22:23 MAIN kernel: [2.072881]  [show_trace+18/32] 
  show_trace+0x12/0x20
  Mar 14 00:22:23 MAIN kernel: [2.072884]  [dump_stack+22/32] 
  dump_stack+0x16/0x20
  Mar 14 00:22:23 MAIN kernel: [2.072887]  
  [debug_smp_processor_id+173/176] debug_smp_processor_id+0xad/0xb0
  Mar 14 00:22:23 MAIN kernel: [2.072891]  [check_tsc_sync_source+29/256] 
  check_tsc_sync_source+0x1d/0x100
  Mar 14 00:22:23 MAIN kernel: [2.072894]  [__cpu_up+80/384] 
  __cpu_up+0x50/0x180
  Mar 14 00:22:23 MAIN kernel: [2.072897]  [_cpu_up+98/208] 
  _cpu_up+0x62/0xd0
  Mar 14 00:22:23 MAIN kernel: [2.072901]  [cpu_up+46/80] cpu_up+0x2e/0x50
  Mar 14 00:22:23 MAIN kernel: [2.072903]  [enable_nonboot_cpus+110/160] 
  enable_nonboot_cpus+0x6e/0xa0
  Mar 14 00:22:23 MAIN kernel: [2.072906]  [enter_state+326/496] 
  enter_state+0x146/0x1f0
  Mar 14 00:22:23 MAIN kernel: [2.072909]  [state_store+174/192] 
  state_store+0xae/0xc0
  Mar 14 00:22:23 MAIN kernel: [2.072912]  [subsys_attr_store+43/64] 
  subsys_attr_store+0x2b/0x40
  Mar 14 00:22:23 MAIN kernel: [2.072917]  [sysfs_write_file+186/272] 
  sysfs_write_file+0xba/0x110
  Mar 14 00:22:23 MAIN kernel: [2.072920]  [vfs_write+150/352] 
  vfs_write+0x96/0x160
  Mar 14 00:22:23 MAIN kernel: [2.072923]  [sys_write+61/112] 
  sys_write+0x3d/0x70
  Mar 14 00:22:23 MAIN kernel: [2.072926]  [sysenter_past_esp+93/153] 
  sysenter_past_esp+0x5d/0x99
  Mar 14 00:22:23 MAIN kernel: [2.072929]  ===
  Mar 14 00:22:23 MAIN kernel: [2.072931] checking TSC synchronization 
  [CPU#0 - CPU#1]:
  Mar 14 00:22:23 MAIN kernel: [2.092922] Measured 72051818872 cycles TSC 
  warp between CPUs, turning off
  
  It looks clear that preempt is enabled all the way in second cpu 
  initialization, ( I think that at least in check_tsc_sync_source, it should 
  be disabled,
  shouldn't it ? )
  
  Then I did add preempt_disable() / preempt_enable()  to this function , and 
   I still got this:
  
  Mar 14 00:22:23 MAIN kernel: [2.072931] checking TSC synchronization 
  [CPU#0 - CPU#1]:
  Mar 14 00:22:23 MAIN kernel: [2.092922] Measured 72051818872 cycles TSC 
  warp between CPUs, turning off
  
  It happens after second CPU is brought back on-line.
  
  Now I understand that this is TSC sync problem and I tried to do some tests:
  
   I tried to disable/enable second CPU by hand, eg I did number of times,
  
  echo 0  /sys/devices/system/cpu/cpu1/online
  echo 1  /sys/devices/system/cpu/cpu1/online
  
  and TSC sync was ok.
  
  Then I disabled 2nd CPU, have suspended system to RAM , resumed it  , and 
  then enabled 2nd CPU and got same error message.
  Then I disabled cpufreq , and did above tests, and got same results.
  I think that maybe this error is false, that there is some difference in 
  TSC clock, but this difference is constant, and can be fixed
  
  3) Sometimes I get this (once in three boots or so)
  
  [   36.217405] ENABLING IO-APIC IRQs
  [   36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
  [   36.433917] APIC timer disabled due to verification failure.
  
  And NO_HZ is disabled due to that (I get 1000/s timer's interrupts)
  I haven't investigated that yet.
  It looks like another new test that my hardware fails to perform... 
  
  
  And now I want to tell you 

Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Maxim
On Saturday 17 March 2007 01:39:01 Thomas Gleixner wrote:
 On Fri, 2007-03-16 at 12:30 +0200, Maxim Levitsky wrote:
  Mar 14 00:22:23 MAIN kernel: [2.072875] caller is 
  check_tsc_sync_source+0x1d/0x100
  Mar 14 00:22:23 MAIN kernel: [2.072878]  [show_trace_log_lvl+26/48] 
  show_trace_log_lvl+0x1a/0x30
  Mar 14 00:22:23 MAIN kernel: [2.072931] checking TSC synchronization 
  [CPU#0 - CPU#1]:
  Mar 14 00:22:23 MAIN kernel: [2.092922] Measured 72051818872 cycles TSC 
  warp between CPUs, turning off
  
  It looks clear that preempt is enabled all the way in second cpu 
  initialization, ( I think that at least in check_tsc_sync_source, it should 
  be disabled,
  shouldn't it ? )
 
 This should be fixed by commit d04f41e35343f1d788551fd3f753f51794f4afcf
 
   tglx
 
 
 
 

Hi,

Yes, it is fixed, thanks

Regards,
Maxim Levitsky
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-17 Thread Thomas Gleixner
Maxim,

On Sun, 2007-03-18 at 01:00 +0200, Maxim wrote:
 Mar 14 00:22:23 MAIN kernel: [2.072931] checking TSC synchronization 
 [CPU#0 - CPU#1]:
 Mar 14 00:22:23 MAIN kernel: [2.092922] Measured 72051818872 cycles TSC 
 warp between CPUs, turning off
 
 ^ This one I don't think is related to NO_HZ, maybe it is hardware
 problem, but it exist without NO_HZ

The TSC is checked for synchronization between the CPUs. It's nothing to
worry about. We switch off the TSC and use a different clocksource.

Is this after resume ? If yes, then something (probably BIOS) is
fiddling with the TSC of one CPU when the resume happens.

 [   36.217405] ENABLING IO-APIC IRQs
 [   36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
 [   36.433917] APIC timer disabled due to verification failure.
 
 This one is now discussed, I will look at it and it is not related to NO_HZ

I sent a patch for this yesterday:

http://marc.info/?l=linux-kernelm=117408952322631w=2

 And I forgot to tell about another problem with (now I know ,hi-resolution 
 timers)
 That before suspend to ram APIC timer is used and HPET is not used :
 
 [EMAIL PROTECTED]:/home/maxim# cat 
 /sys/devices/system/clockevents/clockevents0/registered
 lapicF:0007 M:3(periodic) C: 1
 hpet F:0003 M:1(shutdown) C: 0
 lapicF:0007 M:3(periodic) C: 0
 [EMAIL PROTECTED]:/home/maxim#   
 
 But after suspend to ram HPET is 'woken'
 
 [EMAIL PROTECTED]:/home/maxim# cat 
 /sys/devices/system/clockevents/clockevents0/registered
 lapicF:0007 M:3(one shoot) C: 1
 hpet F:0003 M:3(one shoot) C: 0
 lapicF:0007 M:3(one shoot) C: 0

This is unrelated to suspend / resume. The local apic timers stop
(hardware madness), when the CPU enters C3 power state. In this case we
switch to HPET (or PIT when HPET is not available) and broadcast the
events via Inter Processor Interrupts. This is nothing to worry about. 

I'm a bit surprised though, that your system was in periodic mode before
suspend and switched to one shot mode on resume.

Is this reproducible ? If yes, can you please provide the dmesg output
from boot to resume ?

 Note that I added those (one shoot), (periodic) descriptions, would be
 nice to have them in kernel, can I send a patch ?  ;-)

Sure, just s/shoot/shot/ :)

 and I see average of 18 IRQs/sec on IRQ 0

So dynticks are working as expected.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-16 Thread Len Brown
On Friday 16 March 2007 19:44, Thomas Gleixner wrote:
> Maxim,
> 
> On Fri, 2007-03-16 at 12:30 +0200, Maxim Levitsky wrote:
> > 3) Sometimes I get this (once in three boots or so)
> > 
> > [   36.217405] ENABLING IO-APIC IRQs
> > [   36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
> > [   36.433917] APIC timer disabled due to verification failure.
> > 
> > And NO_HZ is disabled due to that (I get 1000/s timer's interrupts)
> > I haven't investigated that yet.
> > It looks like another new test that my hardware fails to perform... 
> 
> Yes, this is probably caused by SMM code trying to emulate a PS/2
> keyboard from a (maybe connected or not) USB keyboard. Unfortunately we
> have no way to disable this BIOS misfeature in the early boot process. 
> Arjan, Len ?

Nope.  By definition, SMM is invisible to the OS -- we don't even
get a bit that said it occurred (though we'd like one -- it would
be really helpful to diagnose issues like this one)

So go into BIOS SETUP and see if there is a USB Legacy Emulation
feature that you can disable.  Sometimes there is not, but disabling
onboard USB altogether may help at least prove the issue is in that area.

> I built in this test to rule out bogus LAPIC timer calibration values
> which are sometimes off by factor 2-10.
> 
> But I also built in a calibration against the PM-Timer, which turned out
> to be quite reliable and I think the additional verification step is
> only necessary for sytems without PM-Timer.
> 
> That was a bit over cautious from my side. I send a patch to avoid this
> when PM-Timer is available in a separate mail.

PM-Timer was invented to work-around the issue that the TSC became unreliable
in the face of power management on laptops.  In particular, to be able
to time duration of OS idle where TSC stopped.

While it is not fine grain, and it is not low-latency, is should
be very reliable.  My understanding is that it is implemented as
a simple divider right off the system 14MHz clock -- the signal
which most motherboard clocks are PLL multiplied up from --
including the 100MHz front-side bus which drives the LAPIC timer.

But that said, I don't understand why calibrating the LAPIC timer
using the PM-timer is going to be more reliable -- exactly how
and why did the previous calibration scheme fail?
Maybe I could follow the new logic in apic.c if I saw the "apic=debug"
output for this box.

cheers,
-Len


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-16 Thread Thomas Gleixner
Maxim,

On Fri, 2007-03-16 at 12:30 +0200, Maxim Levitsky wrote:
> 3) Sometimes I get this (once in three boots or so)
> 
> [   36.217405] ENABLING IO-APIC IRQs
> [   36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
> [   36.433917] APIC timer disabled due to verification failure.
> 
> And NO_HZ is disabled due to that (I get 1000/s timer's interrupts)
> I haven't investigated that yet.
> It looks like another new test that my hardware fails to perform... 

Yes, this is probably caused by SMM code trying to emulate a PS/2
keyboard from a (maybe connected or not) USB keyboard. Unfortunately we
have no way to disable this BIOS misfeature in the early boot process. 
Arjan, Len ?

I built in this test to rule out bogus LAPIC timer calibration values
which are sometimes off by factor 2-10.

But I also built in a calibration against the PM-Timer, which turned out
to be quite reliable and I think the additional verification step is
only necessary for sytems without PM-Timer.

That was a bit over cautious from my side. I send a patch to avoid this
when PM-Timer is available in a separate mail.

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-16 Thread Thomas Gleixner
On Fri, 2007-03-16 at 12:30 +0200, Maxim Levitsky wrote:
> Mar 14 00:22:23 MAIN kernel: [2.072875] caller is 
> check_tsc_sync_source+0x1d/0x100
> Mar 14 00:22:23 MAIN kernel: [2.072878]  [show_trace_log_lvl+26/48] 
> show_trace_log_lvl+0x1a/0x30
> Mar 14 00:22:23 MAIN kernel: [2.072931] checking TSC synchronization 
> [CPU#0 -> CPU#1]:
> Mar 14 00:22:23 MAIN kernel: [2.092922] Measured 72051818872 cycles TSC 
> warp between CPUs, turning off
> 
> It looks clear that preempt is enabled all the way in second cpu 
> initialization, ( I think that at least in check_tsc_sync_source, it should 
> be disabled,
> shouldn't it ? )

This should be fixed by commit d04f41e35343f1d788551fd3f753f51794f4afcf

tglx



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-16 Thread Len Brown
On Friday 16 March 2007 06:30, Maxim Levitsky wrote:
> 
> Good day, 
> 
> I want to report regressions I have with 2.6.21-rc3 kernel.
> I use CONFIG_NO_HZ.

Do any of these issues go away with CONFIG_NO_HZ=n (or boot with nohz=n)
or are they all independent of it?

thanks,
-Len

> 1) Both suspend to disk and suspend to RAM are completely broken:
> On vanilla 2.6.20 suspend to disk works perfectly and suspend to ram works 
> _almost_ perfectly (I will tell about that later).
> On 2.6.21-rc1 and later system hangs even before suspend begins (suspend to 
> disk hangs before image write , and after suspend to ram , 
> some devices are powered down (disk,power leds) , and some and not(fans, 
> power) , and system hangs).
> 
> I did a git-bisect and I found which commit caused that:
>   e3c7db621bed4afb8e231cb005057f2feb5db557 - [PATCH] [PATCH] PM: Change 
> code ordering in main.c (breaks  S3)
>   ed746e3b18f4df18afa3763155972c5835f284c5 - [PATCH] [PATCH] swsusp: 
> Change code ordering in disk.c (breaks swsusp, I don't use it, but I tested 
> it)
> 259130526c267550bc365d3015917d90667732f1 - [PATCH] [PATCH] swsusp: 
> Change code ordering in user.c (breaks uswsusp, that I use)
> 
> I reverted those commits and now system suspends correctly to disk, but 
> suspend to ram showed some more regressions.
> 
> 
> 2) ) After suspend to ram I get this 
> 
> Mar 14 00:22:23 MAIN kernel: [2.072875] caller is 
> check_tsc_sync_source+0x1d/0x100
> Mar 14 00:22:23 MAIN kernel: [2.072878]  [show_trace_log_lvl+26/48] 
> show_trace_log_lvl+0x1a/0x30
> Mar 14 00:22:23 MAIN kernel: [2.072881]  [show_trace+18/32] 
> show_trace+0x12/0x20
> Mar 14 00:22:23 MAIN kernel: [2.072884]  [dump_stack+22/32] 
> dump_stack+0x16/0x20
> Mar 14 00:22:23 MAIN kernel: [2.072887]  [debug_smp_processor_id+173/176] 
> debug_smp_processor_id+0xad/0xb0
> Mar 14 00:22:23 MAIN kernel: [2.072891]  [check_tsc_sync_source+29/256] 
> check_tsc_sync_source+0x1d/0x100
> Mar 14 00:22:23 MAIN kernel: [2.072894]  [__cpu_up+80/384] 
> __cpu_up+0x50/0x180
> Mar 14 00:22:23 MAIN kernel: [2.072897]  [_cpu_up+98/208] 
> _cpu_up+0x62/0xd0
> Mar 14 00:22:23 MAIN kernel: [2.072901]  [cpu_up+46/80] cpu_up+0x2e/0x50
> Mar 14 00:22:23 MAIN kernel: [2.072903]  [enable_nonboot_cpus+110/160] 
> enable_nonboot_cpus+0x6e/0xa0
> Mar 14 00:22:23 MAIN kernel: [2.072906]  [enter_state+326/496] 
> enter_state+0x146/0x1f0
> Mar 14 00:22:23 MAIN kernel: [2.072909]  [state_store+174/192] 
> state_store+0xae/0xc0
> Mar 14 00:22:23 MAIN kernel: [2.072912]  [subsys_attr_store+43/64] 
> subsys_attr_store+0x2b/0x40
> Mar 14 00:22:23 MAIN kernel: [2.072917]  [sysfs_write_file+186/272] 
> sysfs_write_file+0xba/0x110
> Mar 14 00:22:23 MAIN kernel: [2.072920]  [vfs_write+150/352] 
> vfs_write+0x96/0x160
> Mar 14 00:22:23 MAIN kernel: [2.072923]  [sys_write+61/112] 
> sys_write+0x3d/0x70
> Mar 14 00:22:23 MAIN kernel: [2.072926]  [sysenter_past_esp+93/153] 
> sysenter_past_esp+0x5d/0x99
> Mar 14 00:22:23 MAIN kernel: [2.072929]  ===
> Mar 14 00:22:23 MAIN kernel: [2.072931] checking TSC synchronization 
> [CPU#0 -> CPU#1]:
> Mar 14 00:22:23 MAIN kernel: [2.092922] Measured 72051818872 cycles TSC 
> warp between CPUs, turning off
> 
> It looks clear that preempt is enabled all the way in second cpu 
> initialization, ( I think that at least in check_tsc_sync_source, it should 
> be disabled,
> shouldn't it ? )
> 
> Then I did add preempt_disable() / preempt_enable()  to this function , and  
> I still got this:
> 
> Mar 14 00:22:23 MAIN kernel: [2.072931] checking TSC synchronization 
> [CPU#0 -> CPU#1]:
> Mar 14 00:22:23 MAIN kernel: [2.092922] Measured 72051818872 cycles TSC 
> warp between CPUs, turning off
> 
> It happens after second CPU is brought back on-line.
> 
> Now I understand that this is TSC sync problem and I tried to do some tests:
> 
>  I tried to disable/enable second CPU by hand, eg I did number of times,
> 
> echo "0" > /sys/devices/system/cpu/cpu1/online
> echo "1" > /sys/devices/system/cpu/cpu1/online
> 
> and TSC sync was ok.
> 
> Then I disabled 2nd CPU, have suspended system to RAM , resumed it  , and 
> then enabled 2nd CPU and got same error message.
> Then I disabled cpufreq , and did above tests, and got same results.
> I think that maybe this error is false, that there is some difference in TSC 
> clock, but this difference is constant, and can be fixed
> 
> 3) Sometimes I get this (once in three boots or so)
> 
> [   36.217405] ENABLING IO-APIC IRQs
> [   36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
> [   36.433917] APIC timer disabled due to verification failure.
> 
> And NO_HZ is disabled due to that (I get 1000/s timer's interrupts)
> I haven't investigated that yet.
> It looks like another new test that my hardware fails to perform... 
> 
> 
> And now I want to tell you about that _almost_ working suspend to ram I got 
> 

Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-16 Thread Len Brown
On Friday 16 March 2007 06:30, Maxim Levitsky wrote:
 
 Good day, 
 
 I want to report regressions I have with 2.6.21-rc3 kernel.
 I use CONFIG_NO_HZ.

Do any of these issues go away with CONFIG_NO_HZ=n (or boot with nohz=n)
or are they all independent of it?

thanks,
-Len

 1) Both suspend to disk and suspend to RAM are completely broken:
 On vanilla 2.6.20 suspend to disk works perfectly and suspend to ram works 
 _almost_ perfectly (I will tell about that later).
 On 2.6.21-rc1 and later system hangs even before suspend begins (suspend to 
 disk hangs before image write , and after suspend to ram , 
 some devices are powered down (disk,power leds) , and some and not(fans, 
 power) , and system hangs).
 
 I did a git-bisect and I found which commit caused that:
   e3c7db621bed4afb8e231cb005057f2feb5db557 - [PATCH] [PATCH] PM: Change 
 code ordering in main.c (breaks  S3)
   ed746e3b18f4df18afa3763155972c5835f284c5 - [PATCH] [PATCH] swsusp: 
 Change code ordering in disk.c (breaks swsusp, I don't use it, but I tested 
 it)
 259130526c267550bc365d3015917d90667732f1 - [PATCH] [PATCH] swsusp: 
 Change code ordering in user.c (breaks uswsusp, that I use)
 
 I reverted those commits and now system suspends correctly to disk, but 
 suspend to ram showed some more regressions.
 
 
 2) ) After suspend to ram I get this 
 
 Mar 14 00:22:23 MAIN kernel: [2.072875] caller is 
 check_tsc_sync_source+0x1d/0x100
 Mar 14 00:22:23 MAIN kernel: [2.072878]  [show_trace_log_lvl+26/48] 
 show_trace_log_lvl+0x1a/0x30
 Mar 14 00:22:23 MAIN kernel: [2.072881]  [show_trace+18/32] 
 show_trace+0x12/0x20
 Mar 14 00:22:23 MAIN kernel: [2.072884]  [dump_stack+22/32] 
 dump_stack+0x16/0x20
 Mar 14 00:22:23 MAIN kernel: [2.072887]  [debug_smp_processor_id+173/176] 
 debug_smp_processor_id+0xad/0xb0
 Mar 14 00:22:23 MAIN kernel: [2.072891]  [check_tsc_sync_source+29/256] 
 check_tsc_sync_source+0x1d/0x100
 Mar 14 00:22:23 MAIN kernel: [2.072894]  [__cpu_up+80/384] 
 __cpu_up+0x50/0x180
 Mar 14 00:22:23 MAIN kernel: [2.072897]  [_cpu_up+98/208] 
 _cpu_up+0x62/0xd0
 Mar 14 00:22:23 MAIN kernel: [2.072901]  [cpu_up+46/80] cpu_up+0x2e/0x50
 Mar 14 00:22:23 MAIN kernel: [2.072903]  [enable_nonboot_cpus+110/160] 
 enable_nonboot_cpus+0x6e/0xa0
 Mar 14 00:22:23 MAIN kernel: [2.072906]  [enter_state+326/496] 
 enter_state+0x146/0x1f0
 Mar 14 00:22:23 MAIN kernel: [2.072909]  [state_store+174/192] 
 state_store+0xae/0xc0
 Mar 14 00:22:23 MAIN kernel: [2.072912]  [subsys_attr_store+43/64] 
 subsys_attr_store+0x2b/0x40
 Mar 14 00:22:23 MAIN kernel: [2.072917]  [sysfs_write_file+186/272] 
 sysfs_write_file+0xba/0x110
 Mar 14 00:22:23 MAIN kernel: [2.072920]  [vfs_write+150/352] 
 vfs_write+0x96/0x160
 Mar 14 00:22:23 MAIN kernel: [2.072923]  [sys_write+61/112] 
 sys_write+0x3d/0x70
 Mar 14 00:22:23 MAIN kernel: [2.072926]  [sysenter_past_esp+93/153] 
 sysenter_past_esp+0x5d/0x99
 Mar 14 00:22:23 MAIN kernel: [2.072929]  ===
 Mar 14 00:22:23 MAIN kernel: [2.072931] checking TSC synchronization 
 [CPU#0 - CPU#1]:
 Mar 14 00:22:23 MAIN kernel: [2.092922] Measured 72051818872 cycles TSC 
 warp between CPUs, turning off
 
 It looks clear that preempt is enabled all the way in second cpu 
 initialization, ( I think that at least in check_tsc_sync_source, it should 
 be disabled,
 shouldn't it ? )
 
 Then I did add preempt_disable() / preempt_enable()  to this function , and  
 I still got this:
 
 Mar 14 00:22:23 MAIN kernel: [2.072931] checking TSC synchronization 
 [CPU#0 - CPU#1]:
 Mar 14 00:22:23 MAIN kernel: [2.092922] Measured 72051818872 cycles TSC 
 warp between CPUs, turning off
 
 It happens after second CPU is brought back on-line.
 
 Now I understand that this is TSC sync problem and I tried to do some tests:
 
  I tried to disable/enable second CPU by hand, eg I did number of times,
 
 echo 0  /sys/devices/system/cpu/cpu1/online
 echo 1  /sys/devices/system/cpu/cpu1/online
 
 and TSC sync was ok.
 
 Then I disabled 2nd CPU, have suspended system to RAM , resumed it  , and 
 then enabled 2nd CPU and got same error message.
 Then I disabled cpufreq , and did above tests, and got same results.
 I think that maybe this error is false, that there is some difference in TSC 
 clock, but this difference is constant, and can be fixed
 
 3) Sometimes I get this (once in three boots or so)
 
 [   36.217405] ENABLING IO-APIC IRQs
 [   36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
 [   36.433917] APIC timer disabled due to verification failure.
 
 And NO_HZ is disabled due to that (I get 1000/s timer's interrupts)
 I haven't investigated that yet.
 It looks like another new test that my hardware fails to perform... 
 
 
 And now I want to tell you about that _almost_ working suspend to ram I got 
 in 2.6.20:
 To put it simply sometimes system wakes from resume, and sometimes not (about 
 1 in 5 times)
 When it 

Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-16 Thread Thomas Gleixner
On Fri, 2007-03-16 at 12:30 +0200, Maxim Levitsky wrote:
 Mar 14 00:22:23 MAIN kernel: [2.072875] caller is 
 check_tsc_sync_source+0x1d/0x100
 Mar 14 00:22:23 MAIN kernel: [2.072878]  [show_trace_log_lvl+26/48] 
 show_trace_log_lvl+0x1a/0x30
 Mar 14 00:22:23 MAIN kernel: [2.072931] checking TSC synchronization 
 [CPU#0 - CPU#1]:
 Mar 14 00:22:23 MAIN kernel: [2.092922] Measured 72051818872 cycles TSC 
 warp between CPUs, turning off
 
 It looks clear that preempt is enabled all the way in second cpu 
 initialization, ( I think that at least in check_tsc_sync_source, it should 
 be disabled,
 shouldn't it ? )

This should be fixed by commit d04f41e35343f1d788551fd3f753f51794f4afcf

tglx



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-16 Thread Thomas Gleixner
Maxim,

On Fri, 2007-03-16 at 12:30 +0200, Maxim Levitsky wrote:
 3) Sometimes I get this (once in three boots or so)
 
 [   36.217405] ENABLING IO-APIC IRQs
 [   36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
 [   36.433917] APIC timer disabled due to verification failure.
 
 And NO_HZ is disabled due to that (I get 1000/s timer's interrupts)
 I haven't investigated that yet.
 It looks like another new test that my hardware fails to perform... 

Yes, this is probably caused by SMM code trying to emulate a PS/2
keyboard from a (maybe connected or not) USB keyboard. Unfortunately we
have no way to disable this BIOS misfeature in the early boot process. 
Arjan, Len ?

I built in this test to rule out bogus LAPIC timer calibration values
which are sometimes off by factor 2-10.

But I also built in a calibration against the PM-Timer, which turned out
to be quite reliable and I think the additional verification step is
only necessary for sytems without PM-Timer.

That was a bit over cautious from my side. I send a patch to avoid this
when PM-Timer is available in a separate mail.

tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-16 Thread Len Brown
On Friday 16 March 2007 19:44, Thomas Gleixner wrote:
 Maxim,
 
 On Fri, 2007-03-16 at 12:30 +0200, Maxim Levitsky wrote:
  3) Sometimes I get this (once in three boots or so)
  
  [   36.217405] ENABLING IO-APIC IRQs
  [   36.217587] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
  [   36.433917] APIC timer disabled due to verification failure.
  
  And NO_HZ is disabled due to that (I get 1000/s timer's interrupts)
  I haven't investigated that yet.
  It looks like another new test that my hardware fails to perform... 
 
 Yes, this is probably caused by SMM code trying to emulate a PS/2
 keyboard from a (maybe connected or not) USB keyboard. Unfortunately we
 have no way to disable this BIOS misfeature in the early boot process. 
 Arjan, Len ?

Nope.  By definition, SMM is invisible to the OS -- we don't even
get a bit that said it occurred (though we'd like one -- it would
be really helpful to diagnose issues like this one)

So go into BIOS SETUP and see if there is a USB Legacy Emulation
feature that you can disable.  Sometimes there is not, but disabling
onboard USB altogether may help at least prove the issue is in that area.

 I built in this test to rule out bogus LAPIC timer calibration values
 which are sometimes off by factor 2-10.
 
 But I also built in a calibration against the PM-Timer, which turned out
 to be quite reliable and I think the additional verification step is
 only necessary for sytems without PM-Timer.
 
 That was a bit over cautious from my side. I send a patch to avoid this
 when PM-Timer is available in a separate mail.

PM-Timer was invented to work-around the issue that the TSC became unreliable
in the face of power management on laptops.  In particular, to be able
to time duration of OS idle where TSC stopped.

While it is not fine grain, and it is not low-latency, is should
be very reliable.  My understanding is that it is implemented as
a simple divider right off the system 14MHz clock -- the signal
which most motherboard clocks are PLL multiplied up from --
including the 100MHz front-side bus which drives the LAPIC timer.

But that said, I don't understand why calibrating the LAPIC timer
using the PM-timer is going to be more reliable -- exactly how
and why did the previous calibration scheme fail?
Maybe I could follow the new logic in apic.c if I saw the apic=debug
output for this box.

cheers,
-Len


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/