Re: Suspend to RAM generates oops and general protection fault

2007-03-23 Thread Jean-Marc Valin
Hi,

Sorry I haven't replied recently about that bug, but I have to admit I
have no idea where to start. There actually seems to be much more
fundamental problems with the kernel on my machines. I initially
realised that even without using suspend to RAM, I was still getting
crashes when docking. So I stopped docking and realised my machine would
sometimes just crash when I plug/unplug the AC adaptor. Just to give an
idea, I've experienced about 10-15 crashes in the past two months -- I
don't think I've even done a single clean shutdown during that period.

To make things worse, the behaviour is always different. Sometimes I get
a panic with keyboard LEDs flashing. Sometimes I get nothing at all and
the machine is just frozen (doesn't respond to pings or to Alt-SysRq
commands). Sometimes, I just lose my keyboard and/or mouse but the
machine stays up. I'm running a vanilla 2.6.20 kernel (not tainted) with
the following configuration: http://jmspeex.livejournal.com/1090.html

Jean-Marc


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-03-23 Thread Jean-Marc Valin
Hi,

Sorry I haven't replied recently about that bug, but I have to admit I
have no idea where to start. There actually seems to be much more
fundamental problems with the kernel on my machines. I initially
realised that even without using suspend to RAM, I was still getting
crashes when docking. So I stopped docking and realised my machine would
sometimes just crash when I plug/unplug the AC adaptor. Just to give an
idea, I've experienced about 10-15 crashes in the past two months -- I
don't think I've even done a single clean shutdown during that period.

To make things worse, the behaviour is always different. Sometimes I get
a panic with keyboard LEDs flashing. Sometimes I get nothing at all and
the machine is just frozen (doesn't respond to pings or to Alt-SysRq
commands). Sometimes, I just lose my keyboard and/or mouse but the
machine stays up. I'm running a vanilla 2.6.20 kernel (not tainted) with
the following configuration: http://jmspeex.livejournal.com/1090.html

Jean-Marc


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-23 Thread Jean-Marc Valin
Luming Yu a écrit :
> what about removing psmouse module?

Trying that now. Any particular reason you suspect that one?

Jean-Marc

> On 1/23/07, Jean-Marc Valin <[EMAIL PROTECTED]> wrote:
>> >>> will be a device driver. Common causes of suspend/resume problems
>> from
>> >>> the list you give below are acpi modules, bluetooth and usb. I'd
>> also be
>> >>> consider pcmcia, drm and fuse possibilities. But again, go for
>> unloading
>> >>> everything possible in the first instance.
>> >> Actually, the reason I sent this is that when I showed the oops/gpf to
>> >> Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug
>> >> problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the
>> >> suspend to RAM now works ~95% of the time.
>> >
>> > Try a kernel without CONFIG_SMP... that will verify if it is SMP
>> > related.
>>
>> Well, this happens to be my main work machine, which I'm not willing to
>> have running at half speed for several weeks. Anything else you can
>> suggest?
>>
>> Jean-Marc
>> -
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in
>> the body of a message to [EMAIL PROTECTED]
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-23 Thread Luming Yu

On 1/23/07, Jean-Marc Valin <[EMAIL PROTECTED]> wrote:

Luming Yu a écrit :
> what about removing psmouse module?

Trying that now. Any particular reason you suspect that one?



I suspect it is due to broken modules. If not psmouse, please trying a
boot with minimal modules loaded, and re-test .

Thanks,
Luming
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-23 Thread Luming Yu

On 1/23/07, Jean-Marc Valin [EMAIL PROTECTED] wrote:

Luming Yu a écrit :
 what about removing psmouse module?

Trying that now. Any particular reason you suspect that one?



I suspect it is due to broken modules. If not psmouse, please trying a
boot with minimal modules loaded, and re-test .

Thanks,
Luming
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-23 Thread Jean-Marc Valin
Luming Yu a écrit :
 what about removing psmouse module?

Trying that now. Any particular reason you suspect that one?

Jean-Marc

 On 1/23/07, Jean-Marc Valin [EMAIL PROTECTED] wrote:
  will be a device driver. Common causes of suspend/resume problems
 from
  the list you give below are acpi modules, bluetooth and usb. I'd
 also be
  consider pcmcia, drm and fuse possibilities. But again, go for
 unloading
  everything possible in the first instance.
  Actually, the reason I sent this is that when I showed the oops/gpf to
  Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug
  problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the
  suspend to RAM now works ~95% of the time.
 
  Try a kernel without CONFIG_SMP... that will verify if it is SMP
  related.

 Well, this happens to be my main work machine, which I'm not willing to
 have running at half speed for several weeks. Anything else you can
 suggest?

 Jean-Marc
 -
 To unsubscribe from this list: send the line unsubscribe
 linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-22 Thread Luming Yu

what about removing psmouse module?

On 1/23/07, Jean-Marc Valin <[EMAIL PROTECTED]> wrote:

>>> will be a device driver. Common causes of suspend/resume problems from
>>> the list you give below are acpi modules, bluetooth and usb. I'd also be
>>> consider pcmcia, drm and fuse possibilities. But again, go for unloading
>>> everything possible in the first instance.
>> Actually, the reason I sent this is that when I showed the oops/gpf to
>> Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug
>> problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the
>> suspend to RAM now works ~95% of the time.
>
> Try a kernel without CONFIG_SMP... that will verify if it is SMP
> related.

Well, this happens to be my main work machine, which I'm not willing to
have running at half speed for several weeks. Anything else you can suggest?

Jean-Marc
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-22 Thread Jean-Marc Valin
>>> will be a device driver. Common causes of suspend/resume problems from
>>> the list you give below are acpi modules, bluetooth and usb. I'd also be
>>> consider pcmcia, drm and fuse possibilities. But again, go for unloading
>>> everything possible in the first instance.
>> Actually, the reason I sent this is that when I showed the oops/gpf to
>> Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug
>> problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the
>> suspend to RAM now works ~95% of the time.
> 
> Try a kernel without CONFIG_SMP... that will verify if it is SMP
> related.

Well, this happens to be my main work machine, which I'm not willing to
have running at half speed for several weeks. Anything else you can suggest?

Jean-Marc
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-22 Thread Jean-Marc Valin
>> I just encountered the following oops and general protection fault
>> trying to suspend/resume my laptop. I've got a Dell D820 laptop with a 2
>> GHz Core 2 Duo CPU. It usually suspends/resumes fine but not always. The
>> relevant errors are below but the full dmesg log is at
>> http://people.xiph.org/~jm/suspend_resume_oops.txt and my config is in
>> http://people.xiph.org/~jm/config-2.6.20-rc5.txt
>>
>> This happens when I'm running 2.6.20-rc5. The previous kernel version I
>> was using is 2.6.19-rc6 and was much more broken (second attempt
>> *always* failed), so it's probably not a regression.
> 
> This is a shot against the odds, but could you please check if the attached
> patch has any effect?

Thanks, I'll try that. It may take a while because the problem only
happened once in dozens of suspend/resume cycles.

Jean-Marc

> Rafael
> 
> 
> 
> 
> 
> 
> Both process_zones()and drain_node_pages() check for populated zones before
> touching pagesets. However, __drain_pages does not do so,
> 
> This may result in a NULL pointer dereference for pagesets in unpopulated
> zones if a NUMA setup is combined with cpu hotplug.
> 
> Initially the unpopulated zone has the pcp pointers pointing to the boot
> pagesets.  Since the zone is not populated the boot pageset pointers will
> not be changed during page allocator and slab bootstrap.
> 
> If a cpu is later brought down (first call to __drain_pages()) then the pcp
> pointers for cpus in unpopulated zones are set to NULL since __drain_pages
> does not first check for an unpopulated zone.
> 
> If the cpu is then brought up again then we call process_zones() which will 
> ignore
> the unpopulated zone. So the pageset pointers will still be NULL.
> 
> If the cpu is then again brought down then __drain_pages will attempt to drain
> pages by following the NULL pageset pointer for unpopulated zones.
> 
> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
> 
> ---
>  mm/page_alloc.c |3 +++
>  1 file changed, 3 insertions(+)
> 
> Index: linux-2.6.20-rc4/mm/page_alloc.c
> ===
> --- linux-2.6.20-rc4.orig/mm/page_alloc.c
> +++ linux-2.6.20-rc4/mm/page_alloc.c
> @@ -714,6 +714,9 @@ static void __drain_pages(unsigned int c
>   if (!populated_zone(zone))
>   continue;
>  
> + if (!populated_zone(zone))
> + continue;
> +
>   pset = zone_pcp(zone, cpu);
>   for (i = 0; i < ARRAY_SIZE(pset->pcp); i++) {
>   struct per_cpu_pages *pcp;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-22 Thread Pavel Machek
Hi!

> > will be a device driver. Common causes of suspend/resume problems from
> > the list you give below are acpi modules, bluetooth and usb. I'd also be
> > consider pcmcia, drm and fuse possibilities. But again, go for unloading
> > everything possible in the first instance.
> 
> Actually, the reason I sent this is that when I showed the oops/gpf to
> Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug
> problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the
> suspend to RAM now works ~95% of the time.

Try a kernel without CONFIG_SMP... that will verify if it is SMP
related.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-22 Thread Rafael J. Wysocki
Hi,

On Monday, 22 January 2007 03:34, Jean-Marc Valin wrote:
> Hi,
> 
> I just encountered the following oops and general protection fault
> trying to suspend/resume my laptop. I've got a Dell D820 laptop with a 2
> GHz Core 2 Duo CPU. It usually suspends/resumes fine but not always. The
> relevant errors are below but the full dmesg log is at
> http://people.xiph.org/~jm/suspend_resume_oops.txt and my config is in
> http://people.xiph.org/~jm/config-2.6.20-rc5.txt
> 
> This happens when I'm running 2.6.20-rc5. The previous kernel version I
> was using is 2.6.19-rc6 and was much more broken (second attempt
> *always* failed), so it's probably not a regression.

This is a shot against the odds, but could you please check if the attached
patch has any effect?

Rafael


-- 
If you don't have the time to read,
you don't have the time or the tools to write.
- Stephen King
Both process_zones()and drain_node_pages() check for populated zones before
touching pagesets. However, __drain_pages does not do so,

This may result in a NULL pointer dereference for pagesets in unpopulated
zones if a NUMA setup is combined with cpu hotplug.

Initially the unpopulated zone has the pcp pointers pointing to the boot
pagesets.  Since the zone is not populated the boot pageset pointers will
not be changed during page allocator and slab bootstrap.

If a cpu is later brought down (first call to __drain_pages()) then the pcp
pointers for cpus in unpopulated zones are set to NULL since __drain_pages
does not first check for an unpopulated zone.

If the cpu is then brought up again then we call process_zones() which will ignore
the unpopulated zone. So the pageset pointers will still be NULL.

If the cpu is then again brought down then __drain_pages will attempt to drain
pages by following the NULL pageset pointer for unpopulated zones.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 mm/page_alloc.c |3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6.20-rc4/mm/page_alloc.c
===
--- linux-2.6.20-rc4.orig/mm/page_alloc.c
+++ linux-2.6.20-rc4/mm/page_alloc.c
@@ -714,6 +714,9 @@ static void __drain_pages(unsigned int c
 		if (!populated_zone(zone))
 			continue;
 
+		if (!populated_zone(zone))
+			continue;
+
 		pset = zone_pcp(zone, cpu);
 		for (i = 0; i < ARRAY_SIZE(pset->pcp); i++) {
 			struct per_cpu_pages *pcp;


Re: Suspend to RAM generates oops and general protection fault

2007-01-22 Thread Rafael J. Wysocki
Hi,

On Monday, 22 January 2007 03:34, Jean-Marc Valin wrote:
 Hi,
 
 I just encountered the following oops and general protection fault
 trying to suspend/resume my laptop. I've got a Dell D820 laptop with a 2
 GHz Core 2 Duo CPU. It usually suspends/resumes fine but not always. The
 relevant errors are below but the full dmesg log is at
 http://people.xiph.org/~jm/suspend_resume_oops.txt and my config is in
 http://people.xiph.org/~jm/config-2.6.20-rc5.txt
 
 This happens when I'm running 2.6.20-rc5. The previous kernel version I
 was using is 2.6.19-rc6 and was much more broken (second attempt
 *always* failed), so it's probably not a regression.

This is a shot against the odds, but could you please check if the attached
patch has any effect?

Rafael


-- 
If you don't have the time to read,
you don't have the time or the tools to write.
- Stephen King
Both process_zones()and drain_node_pages() check for populated zones before
touching pagesets. However, __drain_pages does not do so,

This may result in a NULL pointer dereference for pagesets in unpopulated
zones if a NUMA setup is combined with cpu hotplug.

Initially the unpopulated zone has the pcp pointers pointing to the boot
pagesets.  Since the zone is not populated the boot pageset pointers will
not be changed during page allocator and slab bootstrap.

If a cpu is later brought down (first call to __drain_pages()) then the pcp
pointers for cpus in unpopulated zones are set to NULL since __drain_pages
does not first check for an unpopulated zone.

If the cpu is then brought up again then we call process_zones() which will ignore
the unpopulated zone. So the pageset pointers will still be NULL.

If the cpu is then again brought down then __drain_pages will attempt to drain
pages by following the NULL pageset pointer for unpopulated zones.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 mm/page_alloc.c |3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6.20-rc4/mm/page_alloc.c
===
--- linux-2.6.20-rc4.orig/mm/page_alloc.c
+++ linux-2.6.20-rc4/mm/page_alloc.c
@@ -714,6 +714,9 @@ static void __drain_pages(unsigned int c
 		if (!populated_zone(zone))
 			continue;
 
+		if (!populated_zone(zone))
+			continue;
+
 		pset = zone_pcp(zone, cpu);
 		for (i = 0; i  ARRAY_SIZE(pset-pcp); i++) {
 			struct per_cpu_pages *pcp;


Re: Suspend to RAM generates oops and general protection fault

2007-01-22 Thread Pavel Machek
Hi!

  will be a device driver. Common causes of suspend/resume problems from
  the list you give below are acpi modules, bluetooth and usb. I'd also be
  consider pcmcia, drm and fuse possibilities. But again, go for unloading
  everything possible in the first instance.
 
 Actually, the reason I sent this is that when I showed the oops/gpf to
 Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug
 problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the
 suspend to RAM now works ~95% of the time.

Try a kernel without CONFIG_SMP... that will verify if it is SMP
related.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-22 Thread Jean-Marc Valin
 I just encountered the following oops and general protection fault
 trying to suspend/resume my laptop. I've got a Dell D820 laptop with a 2
 GHz Core 2 Duo CPU. It usually suspends/resumes fine but not always. The
 relevant errors are below but the full dmesg log is at
 http://people.xiph.org/~jm/suspend_resume_oops.txt and my config is in
 http://people.xiph.org/~jm/config-2.6.20-rc5.txt

 This happens when I'm running 2.6.20-rc5. The previous kernel version I
 was using is 2.6.19-rc6 and was much more broken (second attempt
 *always* failed), so it's probably not a regression.
 
 This is a shot against the odds, but could you please check if the attached
 patch has any effect?

Thanks, I'll try that. It may take a while because the problem only
happened once in dozens of suspend/resume cycles.

Jean-Marc

 Rafael
 
 
 
 
 
 
 Both process_zones()and drain_node_pages() check for populated zones before
 touching pagesets. However, __drain_pages does not do so,
 
 This may result in a NULL pointer dereference for pagesets in unpopulated
 zones if a NUMA setup is combined with cpu hotplug.
 
 Initially the unpopulated zone has the pcp pointers pointing to the boot
 pagesets.  Since the zone is not populated the boot pageset pointers will
 not be changed during page allocator and slab bootstrap.
 
 If a cpu is later brought down (first call to __drain_pages()) then the pcp
 pointers for cpus in unpopulated zones are set to NULL since __drain_pages
 does not first check for an unpopulated zone.
 
 If the cpu is then brought up again then we call process_zones() which will 
 ignore
 the unpopulated zone. So the pageset pointers will still be NULL.
 
 If the cpu is then again brought down then __drain_pages will attempt to drain
 pages by following the NULL pageset pointer for unpopulated zones.
 
 Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
 
 ---
  mm/page_alloc.c |3 +++
  1 file changed, 3 insertions(+)
 
 Index: linux-2.6.20-rc4/mm/page_alloc.c
 ===
 --- linux-2.6.20-rc4.orig/mm/page_alloc.c
 +++ linux-2.6.20-rc4/mm/page_alloc.c
 @@ -714,6 +714,9 @@ static void __drain_pages(unsigned int c
   if (!populated_zone(zone))
   continue;
  
 + if (!populated_zone(zone))
 + continue;
 +
   pset = zone_pcp(zone, cpu);
   for (i = 0; i  ARRAY_SIZE(pset-pcp); i++) {
   struct per_cpu_pages *pcp;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-22 Thread Jean-Marc Valin
 will be a device driver. Common causes of suspend/resume problems from
 the list you give below are acpi modules, bluetooth and usb. I'd also be
 consider pcmcia, drm and fuse possibilities. But again, go for unloading
 everything possible in the first instance.
 Actually, the reason I sent this is that when I showed the oops/gpf to
 Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug
 problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the
 suspend to RAM now works ~95% of the time.
 
 Try a kernel without CONFIG_SMP... that will verify if it is SMP
 related.

Well, this happens to be my main work machine, which I'm not willing to
have running at half speed for several weeks. Anything else you can suggest?

Jean-Marc
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-22 Thread Luming Yu

what about removing psmouse module?

On 1/23/07, Jean-Marc Valin [EMAIL PROTECTED] wrote:

 will be a device driver. Common causes of suspend/resume problems from
 the list you give below are acpi modules, bluetooth and usb. I'd also be
 consider pcmcia, drm and fuse possibilities. But again, go for unloading
 everything possible in the first instance.
 Actually, the reason I sent this is that when I showed the oops/gpf to
 Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug
 problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the
 suspend to RAM now works ~95% of the time.

 Try a kernel without CONFIG_SMP... that will verify if it is SMP
 related.

Well, this happens to be my main work machine, which I'm not willing to
have running at half speed for several weeks. Anything else you can suggest?

Jean-Marc
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-21 Thread Nigel Cunningham
Hi.

On Mon, 2007-01-22 at 16:16 +1100, Jean-Marc Valin wrote:
> >> I just encountered the following oops and general protection fault
> >> trying to suspend/resume my laptop. I've got a Dell D820 laptop with a 2
> >> GHz Core 2 Duo CPU. It usually suspends/resumes fine but not always. The
> >> relevant errors are below but the full dmesg log is at
> >> http://people.xiph.org/~jm/suspend_resume_oops.txt and my config is in
> >> http://people.xiph.org/~jm/config-2.6.20-rc5.txt
> ...
> > It looks like something is stomping on memory it shouldn't be touching,
> > so I would suggest testing multiple cycles with a minimal (preferably
> > zero) number of modules loaded. If that looks good and reliable, add
> > modules & processes until you can say 'If I do X, it breaks.'. If having
> > a minimal number of modules loaded doesn't help, I would then suggest
> > reviewing your kernel config to see if other things can be built as
> > modules and the same logic applied. You can be reasonably sure that it
> > will be a device driver. Common causes of suspend/resume problems from
> > the list you give below are acpi modules, bluetooth and usb. I'd also be
> > consider pcmcia, drm and fuse possibilities. But again, go for unloading
> > everything possible in the first instance.
> 
> Actually, the reason I sent this is that when I showed the oops/gpf to
> Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug
> problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the
> suspend to RAM now works ~95% of the time.

I agree that the second is cpu hotplug, but the first is something else,
hence my recommendations above.

Regards,

Nigel


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-21 Thread Jean-Marc Valin
>> I just encountered the following oops and general protection fault
>> trying to suspend/resume my laptop. I've got a Dell D820 laptop with a 2
>> GHz Core 2 Duo CPU. It usually suspends/resumes fine but not always. The
>> relevant errors are below but the full dmesg log is at
>> http://people.xiph.org/~jm/suspend_resume_oops.txt and my config is in
>> http://people.xiph.org/~jm/config-2.6.20-rc5.txt
...
> It looks like something is stomping on memory it shouldn't be touching,
> so I would suggest testing multiple cycles with a minimal (preferably
> zero) number of modules loaded. If that looks good and reliable, add
> modules & processes until you can say 'If I do X, it breaks.'. If having
> a minimal number of modules loaded doesn't help, I would then suggest
> reviewing your kernel config to see if other things can be built as
> modules and the same logic applied. You can be reasonably sure that it
> will be a device driver. Common causes of suspend/resume problems from
> the list you give below are acpi modules, bluetooth and usb. I'd also be
> consider pcmcia, drm and fuse possibilities. But again, go for unloading
> everything possible in the first instance.

Actually, the reason I sent this is that when I showed the oops/gpf to
Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug
problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the
suspend to RAM now works ~95% of the time.

Jean-Marc

> Regards,
> 
> Nigel
> 
>> Cheers,
>>
>>  Jean-Marc
>>
>> P.S. This is the same laptop I had at LCA for which Linus told me to
>> disable preemption and try the newest rc version.
>>
>> [10746.449071] Unable to handle kernel NULL pointer dereference at
>> 0038 RIP:
>> [10746.449080]  [] iput+0x18/0x80
>> [10746.449092] PGD 3a607067 PUD 27b20067 PMD 0
>> [10746.449099] Oops:  [1] SMP
>> [10746.449104] CPU 0
>> [10746.449107] Modules linked in: psmouse battery ac thermal fan button
>> ipw3945 ieee80211 tg3 arc4 ecb blkcipher ieee80211_crypt_wep
>> ieee80211_crypt binfmt_misc rfcomm l2cap bluetooth i915 drm
>> speedstep_centrino cpufreq_userspace cpufreq_powersave cpufreq_ondemand
>> cpufreq_stats freq_table cpufreq_conservative video sbs i2c_ec dock
>> asus_acpi backlight container ipv6 fuse sbp2 af_packet parport_pc lp
>> parport sg sr_mod cdrom snd_hda_intel snd_hda_codec tsdev snd_pcm_oss
>> snd_mixer_oss pcmcia snd_pcm snd_timer ata_generic snd shpchp
>> pci_hotplug soundcore snd_page_alloc serio_raw yenta_socket
>> rsrc_nonstatic pcmcia_core pcspkr evdev ext3 jbd mbcache ohci1394
>> ehci_hcd ieee1394 ide_generic uhci_hcd usbcore generic sd_mod processor
>> [10746.449190] Pid: 218, comm: kswapd0 Not tainted 2.6.20-rc5-x86-64 #1
>> [10746.449196] RIP: 0010:[]  []
>> iput+0x18/0x80
>> [10746.449206] RSP: :810037f2dd50  EFLAGS: 00010283
>> [10746.449212] RAX:  RBX: 8103fcf0 RCX:
>> 8103fd20
>> [10746.449219] RDX: 0001 RSI: 0286 RDI:
>> 8103fcf0
>> [10746.449225] RBP: 0042 R08:  R09:
>> 
>> [10746.449232] R10: 28f5c28f5c28f5c3 R11: 8023ae90 R12:
>> 
>> [10746.449239] R13: 810075721c70 R14: 805fa940 R15:
>> 
>> [10746.449246] FS:  () GS:8058e000()
>> knlGS:
>> [10746.449253] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
>> [10746.449259] CR2: 0038 CR3: 1207f000 CR4:
>> 06e0
>> [10746.449265] Process kswapd0 (pid: 218, threadinfo 810037f2c000,
>> task 810037a1b760)
>> [10746.449269] Stack:  811ce2f0 802ddaf8
>> 811ce3c0 811ce2f0
>> [10746.449280]  0042 8022f645 810037f2dd80
>> 0001cb60
>> [10746.449288]  0090 81007daa0e00 00d0
>> 802ddb49
>> [10746.449296] Call Trace:
>> [10746.449305]  [] prune_one_dentry+0x68/0xa0
>> [10746.449314]  [] prune_dcache+0x145/0x1e0
>> [10746.449323]  [] shrink_dcache_memory+0x19/0x50
>> [10746.449331]  [] shrink_slab+0x117/0x190
>> [10746.449342]  [] kswapd+0x382/0x4e0
>> [10746.449356]  [] autoremove_wake_function+0x0/0x30
>> [10746.449370]  [] kswapd+0x0/0x4e0
>> [10746.449376]  [] keventd_create_kthread+0x0/0x90
>> [10746.449383]  [] kthread+0xd9/0x120
>> [10746.449394]  [] child_rip+0xa/0x12
>> [10746.449401]  [] keventd_create_kthread+0x0/0x90
>> [10746.449414]  [] kthread+0x0/0x120
>> [10746.449421]  [] child_rip+0x0/0x12
>> [10746.449426]
>> [10746.449429]
>> [10746.449430] Code: 48 8b 40 38 75 04 0f 0b eb fe 48 85 c0 74 0b 48 8b
>> 40 28 48
>> [10746.449449] RIP  [] iput+0x18/0x80
>> [10746.449456]  RSP 
>> [10746.449460] CR2: 0038
>> [10746.449463]  ACPI Exception (pci_bind-0299): AE_NOT_FOUND, Unable to
>> get data from device DCKS [20060707]
>>
>>
>> and later:
>>
>>
>> [3.668009] SMP alternatives: switching to SMP code
>> [3.668168] Booting 

Re: Suspend to RAM generates oops and general protection fault

2007-01-21 Thread Nigel Cunningham
Hi.

On Mon, 2007-01-22 at 13:34 +1100, Jean-Marc Valin wrote:
> Hi,
> 
> I just encountered the following oops and general protection fault
> trying to suspend/resume my laptop. I've got a Dell D820 laptop with a 2
> GHz Core 2 Duo CPU. It usually suspends/resumes fine but not always. The
> relevant errors are below but the full dmesg log is at
> http://people.xiph.org/~jm/suspend_resume_oops.txt and my config is in
> http://people.xiph.org/~jm/config-2.6.20-rc5.txt
> 
> This happens when I'm running 2.6.20-rc5. The previous kernel version I
> was using is 2.6.19-rc6 and was much more broken (second attempt
> *always* failed), so it's probably not a regression.

A second attempt always failing usually indicates that a driver was
dazed and confused after the first cycle and properly killed by the
second attempt, usually because of a lack of [proper] power management
code.

Between any two versions, some things can be fixed, some things can be
broken and some things can become broken in different ways, so your
different experience with 2.6.20-rc5 doesn't necessarily mean that this
is a different issue.

It looks like something is stomping on memory it shouldn't be touching,
so I would suggest testing multiple cycles with a minimal (preferably
zero) number of modules loaded. If that looks good and reliable, add
modules & processes until you can say 'If I do X, it breaks.'. If having
a minimal number of modules loaded doesn't help, I would then suggest
reviewing your kernel config to see if other things can be built as
modules and the same logic applied. You can be reasonably sure that it
will be a device driver. Common causes of suspend/resume problems from
the list you give below are acpi modules, bluetooth and usb. I'd also be
consider pcmcia, drm and fuse possibilities. But again, go for unloading
everything possible in the first instance.

Regards,

Nigel

> Cheers,
> 
>   Jean-Marc
> 
> P.S. This is the same laptop I had at LCA for which Linus told me to
> disable preemption and try the newest rc version.
> 
> [10746.449071] Unable to handle kernel NULL pointer dereference at
> 0038 RIP:
> [10746.449080]  [] iput+0x18/0x80
> [10746.449092] PGD 3a607067 PUD 27b20067 PMD 0
> [10746.449099] Oops:  [1] SMP
> [10746.449104] CPU 0
> [10746.449107] Modules linked in: psmouse battery ac thermal fan button
> ipw3945 ieee80211 tg3 arc4 ecb blkcipher ieee80211_crypt_wep
> ieee80211_crypt binfmt_misc rfcomm l2cap bluetooth i915 drm
> speedstep_centrino cpufreq_userspace cpufreq_powersave cpufreq_ondemand
> cpufreq_stats freq_table cpufreq_conservative video sbs i2c_ec dock
> asus_acpi backlight container ipv6 fuse sbp2 af_packet parport_pc lp
> parport sg sr_mod cdrom snd_hda_intel snd_hda_codec tsdev snd_pcm_oss
> snd_mixer_oss pcmcia snd_pcm snd_timer ata_generic snd shpchp
> pci_hotplug soundcore snd_page_alloc serio_raw yenta_socket
> rsrc_nonstatic pcmcia_core pcspkr evdev ext3 jbd mbcache ohci1394
> ehci_hcd ieee1394 ide_generic uhci_hcd usbcore generic sd_mod processor
> [10746.449190] Pid: 218, comm: kswapd0 Not tainted 2.6.20-rc5-x86-64 #1
> [10746.449196] RIP: 0010:[]  []
> iput+0x18/0x80
> [10746.449206] RSP: :810037f2dd50  EFLAGS: 00010283
> [10746.449212] RAX:  RBX: 8103fcf0 RCX:
> 8103fd20
> [10746.449219] RDX: 0001 RSI: 0286 RDI:
> 8103fcf0
> [10746.449225] RBP: 0042 R08:  R09:
> 
> [10746.449232] R10: 28f5c28f5c28f5c3 R11: 8023ae90 R12:
> 
> [10746.449239] R13: 810075721c70 R14: 805fa940 R15:
> 
> [10746.449246] FS:  () GS:8058e000()
> knlGS:
> [10746.449253] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
> [10746.449259] CR2: 0038 CR3: 1207f000 CR4:
> 06e0
> [10746.449265] Process kswapd0 (pid: 218, threadinfo 810037f2c000,
> task 810037a1b760)
> [10746.449269] Stack:  811ce2f0 802ddaf8
> 811ce3c0 811ce2f0
> [10746.449280]  0042 8022f645 810037f2dd80
> 0001cb60
> [10746.449288]  0090 81007daa0e00 00d0
> 802ddb49
> [10746.449296] Call Trace:
> [10746.449305]  [] prune_one_dentry+0x68/0xa0
> [10746.449314]  [] prune_dcache+0x145/0x1e0
> [10746.449323]  [] shrink_dcache_memory+0x19/0x50
> [10746.449331]  [] shrink_slab+0x117/0x190
> [10746.449342]  [] kswapd+0x382/0x4e0
> [10746.449356]  [] autoremove_wake_function+0x0/0x30
> [10746.449370]  [] kswapd+0x0/0x4e0
> [10746.449376]  [] keventd_create_kthread+0x0/0x90
> [10746.449383]  [] kthread+0xd9/0x120
> [10746.449394]  [] child_rip+0xa/0x12
> [10746.449401]  [] keventd_create_kthread+0x0/0x90
> [10746.449414]  [] kthread+0x0/0x120
> [10746.449421]  [] child_rip+0x0/0x12
> [10746.449426]
> [10746.449429]
> [10746.449430] Code: 48 8b 40 38 75 04 0f 0b eb fe 48 85 c0 74 0b 48 8b
> 

Suspend to RAM generates oops and general protection fault

2007-01-21 Thread Jean-Marc Valin
Hi,

I just encountered the following oops and general protection fault
trying to suspend/resume my laptop. I've got a Dell D820 laptop with a 2
GHz Core 2 Duo CPU. It usually suspends/resumes fine but not always. The
relevant errors are below but the full dmesg log is at
http://people.xiph.org/~jm/suspend_resume_oops.txt and my config is in
http://people.xiph.org/~jm/config-2.6.20-rc5.txt

This happens when I'm running 2.6.20-rc5. The previous kernel version I
was using is 2.6.19-rc6 and was much more broken (second attempt
*always* failed), so it's probably not a regression.

Cheers,

Jean-Marc

P.S. This is the same laptop I had at LCA for which Linus told me to
disable preemption and try the newest rc version.

[10746.449071] Unable to handle kernel NULL pointer dereference at
0038 RIP:
[10746.449080]  [] iput+0x18/0x80
[10746.449092] PGD 3a607067 PUD 27b20067 PMD 0
[10746.449099] Oops:  [1] SMP
[10746.449104] CPU 0
[10746.449107] Modules linked in: psmouse battery ac thermal fan button
ipw3945 ieee80211 tg3 arc4 ecb blkcipher ieee80211_crypt_wep
ieee80211_crypt binfmt_misc rfcomm l2cap bluetooth i915 drm
speedstep_centrino cpufreq_userspace cpufreq_powersave cpufreq_ondemand
cpufreq_stats freq_table cpufreq_conservative video sbs i2c_ec dock
asus_acpi backlight container ipv6 fuse sbp2 af_packet parport_pc lp
parport sg sr_mod cdrom snd_hda_intel snd_hda_codec tsdev snd_pcm_oss
snd_mixer_oss pcmcia snd_pcm snd_timer ata_generic snd shpchp
pci_hotplug soundcore snd_page_alloc serio_raw yenta_socket
rsrc_nonstatic pcmcia_core pcspkr evdev ext3 jbd mbcache ohci1394
ehci_hcd ieee1394 ide_generic uhci_hcd usbcore generic sd_mod processor
[10746.449190] Pid: 218, comm: kswapd0 Not tainted 2.6.20-rc5-x86-64 #1
[10746.449196] RIP: 0010:[]  []
iput+0x18/0x80
[10746.449206] RSP: :810037f2dd50  EFLAGS: 00010283
[10746.449212] RAX:  RBX: 8103fcf0 RCX:
8103fd20
[10746.449219] RDX: 0001 RSI: 0286 RDI:
8103fcf0
[10746.449225] RBP: 0042 R08:  R09:

[10746.449232] R10: 28f5c28f5c28f5c3 R11: 8023ae90 R12:

[10746.449239] R13: 810075721c70 R14: 805fa940 R15:

[10746.449246] FS:  () GS:8058e000()
knlGS:
[10746.449253] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
[10746.449259] CR2: 0038 CR3: 1207f000 CR4:
06e0
[10746.449265] Process kswapd0 (pid: 218, threadinfo 810037f2c000,
task 810037a1b760)
[10746.449269] Stack:  811ce2f0 802ddaf8
811ce3c0 811ce2f0
[10746.449280]  0042 8022f645 810037f2dd80
0001cb60
[10746.449288]  0090 81007daa0e00 00d0
802ddb49
[10746.449296] Call Trace:
[10746.449305]  [] prune_one_dentry+0x68/0xa0
[10746.449314]  [] prune_dcache+0x145/0x1e0
[10746.449323]  [] shrink_dcache_memory+0x19/0x50
[10746.449331]  [] shrink_slab+0x117/0x190
[10746.449342]  [] kswapd+0x382/0x4e0
[10746.449356]  [] autoremove_wake_function+0x0/0x30
[10746.449370]  [] kswapd+0x0/0x4e0
[10746.449376]  [] keventd_create_kthread+0x0/0x90
[10746.449383]  [] kthread+0xd9/0x120
[10746.449394]  [] child_rip+0xa/0x12
[10746.449401]  [] keventd_create_kthread+0x0/0x90
[10746.449414]  [] kthread+0x0/0x120
[10746.449421]  [] child_rip+0x0/0x12
[10746.449426]
[10746.449429]
[10746.449430] Code: 48 8b 40 38 75 04 0f 0b eb fe 48 85 c0 74 0b 48 8b
40 28 48
[10746.449449] RIP  [] iput+0x18/0x80
[10746.449456]  RSP 
[10746.449460] CR2: 0038
[10746.449463]  ACPI Exception (pci_bind-0299): AE_NOT_FOUND, Unable to
get data from device DCKS [20060707]


and later:


[3.668009] SMP alternatives: switching to SMP code
[3.668168] Booting processor 1/2 APIC 0x1
[4.149691] Initializing CPU#1
[4.229595] Calibrating delay using timer specific routine.. 3990.32
BogoMIPS (lpj=7980654)
[4.229602] CPU: L1 I cache: 32K, L1 D cache: 32K
[4.229604] CPU: L2 cache: 4096K
[4.229606] CPU 1/1 -> Node 0
[4.229608] CPU: Physical Processor ID: 0
[4.229609] CPU: Processor Core ID: 1
[4.230107] Intel(R) Core(TM)2 CPU T7200  @ 2.00GHz stepping 06
[4.233607] CPU 1: Syncing TSC to CPU 0.
[3.762970] CPU 1: synchronized TSC with CPU 0 (last diff 0 cycles,
maxerr 960 cycles)
[3.764689] general protection fault:  [2] SMP
[3.764963] CPU 1
[3.764983] Modules linked in: psmouse battery ac thermal fan button
arc4 ecb blkcipher ieee80211_crypt_wep ieee80211_crypt binfmt_misc
rfcomm l2cap bluetooth i915 drm speedstep_centrino cpufreq_userspace
cpufreq_powersave cpufreq_ondemand cpufreq_stats freq_table
cpufreq_conservative video sbs i2c_ec dock asus_acpi backlight container
ipv6 fuse sbp2 af_packet parport_pc lp parport sg sr_mod cdrom
snd_hda_intel snd_hda_codec tsdev snd_pcm_oss snd_mixer_oss pcmcia
snd_pcm snd_timer 

Suspend to RAM generates oops and general protection fault

2007-01-21 Thread Jean-Marc Valin
Hi,

I just encountered the following oops and general protection fault
trying to suspend/resume my laptop. I've got a Dell D820 laptop with a 2
GHz Core 2 Duo CPU. It usually suspends/resumes fine but not always. The
relevant errors are below but the full dmesg log is at
http://people.xiph.org/~jm/suspend_resume_oops.txt and my config is in
http://people.xiph.org/~jm/config-2.6.20-rc5.txt

This happens when I'm running 2.6.20-rc5. The previous kernel version I
was using is 2.6.19-rc6 and was much more broken (second attempt
*always* failed), so it's probably not a regression.

Cheers,

Jean-Marc

P.S. This is the same laptop I had at LCA for which Linus told me to
disable preemption and try the newest rc version.

[10746.449071] Unable to handle kernel NULL pointer dereference at
0038 RIP:
[10746.449080]  [8022b9c8] iput+0x18/0x80
[10746.449092] PGD 3a607067 PUD 27b20067 PMD 0
[10746.449099] Oops:  [1] SMP
[10746.449104] CPU 0
[10746.449107] Modules linked in: psmouse battery ac thermal fan button
ipw3945 ieee80211 tg3 arc4 ecb blkcipher ieee80211_crypt_wep
ieee80211_crypt binfmt_misc rfcomm l2cap bluetooth i915 drm
speedstep_centrino cpufreq_userspace cpufreq_powersave cpufreq_ondemand
cpufreq_stats freq_table cpufreq_conservative video sbs i2c_ec dock
asus_acpi backlight container ipv6 fuse sbp2 af_packet parport_pc lp
parport sg sr_mod cdrom snd_hda_intel snd_hda_codec tsdev snd_pcm_oss
snd_mixer_oss pcmcia snd_pcm snd_timer ata_generic snd shpchp
pci_hotplug soundcore snd_page_alloc serio_raw yenta_socket
rsrc_nonstatic pcmcia_core pcspkr evdev ext3 jbd mbcache ohci1394
ehci_hcd ieee1394 ide_generic uhci_hcd usbcore generic sd_mod processor
[10746.449190] Pid: 218, comm: kswapd0 Not tainted 2.6.20-rc5-x86-64 #1
[10746.449196] RIP: 0010:[8022b9c8]  [8022b9c8]
iput+0x18/0x80
[10746.449206] RSP: :810037f2dd50  EFLAGS: 00010283
[10746.449212] RAX:  RBX: 8103fcf0 RCX:
8103fd20
[10746.449219] RDX: 0001 RSI: 0286 RDI:
8103fcf0
[10746.449225] RBP: 0042 R08:  R09:

[10746.449232] R10: 28f5c28f5c28f5c3 R11: 8023ae90 R12:

[10746.449239] R13: 810075721c70 R14: 805fa940 R15:

[10746.449246] FS:  () GS:8058e000()
knlGS:
[10746.449253] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
[10746.449259] CR2: 0038 CR3: 1207f000 CR4:
06e0
[10746.449265] Process kswapd0 (pid: 218, threadinfo 810037f2c000,
task 810037a1b760)
[10746.449269] Stack:  811ce2f0 802ddaf8
811ce3c0 811ce2f0
[10746.449280]  0042 8022f645 810037f2dd80
0001cb60
[10746.449288]  0090 81007daa0e00 00d0
802ddb49
[10746.449296] Call Trace:
[10746.449305]  [802ddaf8] prune_one_dentry+0x68/0xa0
[10746.449314]  [8022f645] prune_dcache+0x145/0x1e0
[10746.449323]  [802ddb49] shrink_dcache_memory+0x19/0x50
[10746.449331]  [802418a7] shrink_slab+0x117/0x190
[10746.449342]  [8025a392] kswapd+0x382/0x4e0
[10746.449356]  [802a13b0] autoremove_wake_function+0x0/0x30
[10746.449370]  [8025a010] kswapd+0x0/0x4e0
[10746.449376]  [802a11d0] keventd_create_kthread+0x0/0x90
[10746.449383]  [802335a9] kthread+0xd9/0x120
[10746.449394]  [80260ec8] child_rip+0xa/0x12
[10746.449401]  [802a11d0] keventd_create_kthread+0x0/0x90
[10746.449414]  [802334d0] kthread+0x0/0x120
[10746.449421]  [80260ebe] child_rip+0x0/0x12
[10746.449426]
[10746.449429]
[10746.449430] Code: 48 8b 40 38 75 04 0f 0b eb fe 48 85 c0 74 0b 48 8b
40 28 48
[10746.449449] RIP  [8022b9c8] iput+0x18/0x80
[10746.449456]  RSP 810037f2dd50
[10746.449460] CR2: 0038
[10746.449463]  ACPI Exception (pci_bind-0299): AE_NOT_FOUND, Unable to
get data from device DCKS [20060707]


and later:


[3.668009] SMP alternatives: switching to SMP code
[3.668168] Booting processor 1/2 APIC 0x1
[4.149691] Initializing CPU#1
[4.229595] Calibrating delay using timer specific routine.. 3990.32
BogoMIPS (lpj=7980654)
[4.229602] CPU: L1 I cache: 32K, L1 D cache: 32K
[4.229604] CPU: L2 cache: 4096K
[4.229606] CPU 1/1 - Node 0
[4.229608] CPU: Physical Processor ID: 0
[4.229609] CPU: Processor Core ID: 1
[4.230107] Intel(R) Core(TM)2 CPU T7200  @ 2.00GHz stepping 06
[4.233607] CPU 1: Syncing TSC to CPU 0.
[3.762970] CPU 1: synchronized TSC with CPU 0 (last diff 0 cycles,
maxerr 960 cycles)
[3.764689] general protection fault:  [2] SMP
[3.764963] CPU 1
[3.764983] Modules linked in: psmouse battery ac thermal fan button
arc4 ecb blkcipher ieee80211_crypt_wep ieee80211_crypt binfmt_misc
rfcomm l2cap bluetooth i915 drm speedstep_centrino 

Re: Suspend to RAM generates oops and general protection fault

2007-01-21 Thread Nigel Cunningham
Hi.

On Mon, 2007-01-22 at 13:34 +1100, Jean-Marc Valin wrote:
 Hi,
 
 I just encountered the following oops and general protection fault
 trying to suspend/resume my laptop. I've got a Dell D820 laptop with a 2
 GHz Core 2 Duo CPU. It usually suspends/resumes fine but not always. The
 relevant errors are below but the full dmesg log is at
 http://people.xiph.org/~jm/suspend_resume_oops.txt and my config is in
 http://people.xiph.org/~jm/config-2.6.20-rc5.txt
 
 This happens when I'm running 2.6.20-rc5. The previous kernel version I
 was using is 2.6.19-rc6 and was much more broken (second attempt
 *always* failed), so it's probably not a regression.

A second attempt always failing usually indicates that a driver was
dazed and confused after the first cycle and properly killed by the
second attempt, usually because of a lack of [proper] power management
code.

Between any two versions, some things can be fixed, some things can be
broken and some things can become broken in different ways, so your
different experience with 2.6.20-rc5 doesn't necessarily mean that this
is a different issue.

It looks like something is stomping on memory it shouldn't be touching,
so I would suggest testing multiple cycles with a minimal (preferably
zero) number of modules loaded. If that looks good and reliable, add
modules  processes until you can say 'If I do X, it breaks.'. If having
a minimal number of modules loaded doesn't help, I would then suggest
reviewing your kernel config to see if other things can be built as
modules and the same logic applied. You can be reasonably sure that it
will be a device driver. Common causes of suspend/resume problems from
the list you give below are acpi modules, bluetooth and usb. I'd also be
consider pcmcia, drm and fuse possibilities. But again, go for unloading
everything possible in the first instance.

Regards,

Nigel

 Cheers,
 
   Jean-Marc
 
 P.S. This is the same laptop I had at LCA for which Linus told me to
 disable preemption and try the newest rc version.
 
 [10746.449071] Unable to handle kernel NULL pointer dereference at
 0038 RIP:
 [10746.449080]  [8022b9c8] iput+0x18/0x80
 [10746.449092] PGD 3a607067 PUD 27b20067 PMD 0
 [10746.449099] Oops:  [1] SMP
 [10746.449104] CPU 0
 [10746.449107] Modules linked in: psmouse battery ac thermal fan button
 ipw3945 ieee80211 tg3 arc4 ecb blkcipher ieee80211_crypt_wep
 ieee80211_crypt binfmt_misc rfcomm l2cap bluetooth i915 drm
 speedstep_centrino cpufreq_userspace cpufreq_powersave cpufreq_ondemand
 cpufreq_stats freq_table cpufreq_conservative video sbs i2c_ec dock
 asus_acpi backlight container ipv6 fuse sbp2 af_packet parport_pc lp
 parport sg sr_mod cdrom snd_hda_intel snd_hda_codec tsdev snd_pcm_oss
 snd_mixer_oss pcmcia snd_pcm snd_timer ata_generic snd shpchp
 pci_hotplug soundcore snd_page_alloc serio_raw yenta_socket
 rsrc_nonstatic pcmcia_core pcspkr evdev ext3 jbd mbcache ohci1394
 ehci_hcd ieee1394 ide_generic uhci_hcd usbcore generic sd_mod processor
 [10746.449190] Pid: 218, comm: kswapd0 Not tainted 2.6.20-rc5-x86-64 #1
 [10746.449196] RIP: 0010:[8022b9c8]  [8022b9c8]
 iput+0x18/0x80
 [10746.449206] RSP: :810037f2dd50  EFLAGS: 00010283
 [10746.449212] RAX:  RBX: 8103fcf0 RCX:
 8103fd20
 [10746.449219] RDX: 0001 RSI: 0286 RDI:
 8103fcf0
 [10746.449225] RBP: 0042 R08:  R09:
 
 [10746.449232] R10: 28f5c28f5c28f5c3 R11: 8023ae90 R12:
 
 [10746.449239] R13: 810075721c70 R14: 805fa940 R15:
 
 [10746.449246] FS:  () GS:8058e000()
 knlGS:
 [10746.449253] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
 [10746.449259] CR2: 0038 CR3: 1207f000 CR4:
 06e0
 [10746.449265] Process kswapd0 (pid: 218, threadinfo 810037f2c000,
 task 810037a1b760)
 [10746.449269] Stack:  811ce2f0 802ddaf8
 811ce3c0 811ce2f0
 [10746.449280]  0042 8022f645 810037f2dd80
 0001cb60
 [10746.449288]  0090 81007daa0e00 00d0
 802ddb49
 [10746.449296] Call Trace:
 [10746.449305]  [802ddaf8] prune_one_dentry+0x68/0xa0
 [10746.449314]  [8022f645] prune_dcache+0x145/0x1e0
 [10746.449323]  [802ddb49] shrink_dcache_memory+0x19/0x50
 [10746.449331]  [802418a7] shrink_slab+0x117/0x190
 [10746.449342]  [8025a392] kswapd+0x382/0x4e0
 [10746.449356]  [802a13b0] autoremove_wake_function+0x0/0x30
 [10746.449370]  [8025a010] kswapd+0x0/0x4e0
 [10746.449376]  [802a11d0] keventd_create_kthread+0x0/0x90
 [10746.449383]  [802335a9] kthread+0xd9/0x120
 [10746.449394]  [80260ec8] child_rip+0xa/0x12
 [10746.449401]  [802a11d0] keventd_create_kthread+0x0/0x90
 [10746.449414]  [802334d0] 

Re: Suspend to RAM generates oops and general protection fault

2007-01-21 Thread Jean-Marc Valin
 I just encountered the following oops and general protection fault
 trying to suspend/resume my laptop. I've got a Dell D820 laptop with a 2
 GHz Core 2 Duo CPU. It usually suspends/resumes fine but not always. The
 relevant errors are below but the full dmesg log is at
 http://people.xiph.org/~jm/suspend_resume_oops.txt and my config is in
 http://people.xiph.org/~jm/config-2.6.20-rc5.txt
...
 It looks like something is stomping on memory it shouldn't be touching,
 so I would suggest testing multiple cycles with a minimal (preferably
 zero) number of modules loaded. If that looks good and reliable, add
 modules  processes until you can say 'If I do X, it breaks.'. If having
 a minimal number of modules loaded doesn't help, I would then suggest
 reviewing your kernel config to see if other things can be built as
 modules and the same logic applied. You can be reasonably sure that it
 will be a device driver. Common causes of suspend/resume problems from
 the list you give below are acpi modules, bluetooth and usb. I'd also be
 consider pcmcia, drm and fuse possibilities. But again, go for unloading
 everything possible in the first instance.

Actually, the reason I sent this is that when I showed the oops/gpf to
Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug
problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the
suspend to RAM now works ~95% of the time.

Jean-Marc

 Regards,
 
 Nigel
 
 Cheers,

  Jean-Marc

 P.S. This is the same laptop I had at LCA for which Linus told me to
 disable preemption and try the newest rc version.

 [10746.449071] Unable to handle kernel NULL pointer dereference at
 0038 RIP:
 [10746.449080]  [8022b9c8] iput+0x18/0x80
 [10746.449092] PGD 3a607067 PUD 27b20067 PMD 0
 [10746.449099] Oops:  [1] SMP
 [10746.449104] CPU 0
 [10746.449107] Modules linked in: psmouse battery ac thermal fan button
 ipw3945 ieee80211 tg3 arc4 ecb blkcipher ieee80211_crypt_wep
 ieee80211_crypt binfmt_misc rfcomm l2cap bluetooth i915 drm
 speedstep_centrino cpufreq_userspace cpufreq_powersave cpufreq_ondemand
 cpufreq_stats freq_table cpufreq_conservative video sbs i2c_ec dock
 asus_acpi backlight container ipv6 fuse sbp2 af_packet parport_pc lp
 parport sg sr_mod cdrom snd_hda_intel snd_hda_codec tsdev snd_pcm_oss
 snd_mixer_oss pcmcia snd_pcm snd_timer ata_generic snd shpchp
 pci_hotplug soundcore snd_page_alloc serio_raw yenta_socket
 rsrc_nonstatic pcmcia_core pcspkr evdev ext3 jbd mbcache ohci1394
 ehci_hcd ieee1394 ide_generic uhci_hcd usbcore generic sd_mod processor
 [10746.449190] Pid: 218, comm: kswapd0 Not tainted 2.6.20-rc5-x86-64 #1
 [10746.449196] RIP: 0010:[8022b9c8]  [8022b9c8]
 iput+0x18/0x80
 [10746.449206] RSP: :810037f2dd50  EFLAGS: 00010283
 [10746.449212] RAX:  RBX: 8103fcf0 RCX:
 8103fd20
 [10746.449219] RDX: 0001 RSI: 0286 RDI:
 8103fcf0
 [10746.449225] RBP: 0042 R08:  R09:
 
 [10746.449232] R10: 28f5c28f5c28f5c3 R11: 8023ae90 R12:
 
 [10746.449239] R13: 810075721c70 R14: 805fa940 R15:
 
 [10746.449246] FS:  () GS:8058e000()
 knlGS:
 [10746.449253] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
 [10746.449259] CR2: 0038 CR3: 1207f000 CR4:
 06e0
 [10746.449265] Process kswapd0 (pid: 218, threadinfo 810037f2c000,
 task 810037a1b760)
 [10746.449269] Stack:  811ce2f0 802ddaf8
 811ce3c0 811ce2f0
 [10746.449280]  0042 8022f645 810037f2dd80
 0001cb60
 [10746.449288]  0090 81007daa0e00 00d0
 802ddb49
 [10746.449296] Call Trace:
 [10746.449305]  [802ddaf8] prune_one_dentry+0x68/0xa0
 [10746.449314]  [8022f645] prune_dcache+0x145/0x1e0
 [10746.449323]  [802ddb49] shrink_dcache_memory+0x19/0x50
 [10746.449331]  [802418a7] shrink_slab+0x117/0x190
 [10746.449342]  [8025a392] kswapd+0x382/0x4e0
 [10746.449356]  [802a13b0] autoremove_wake_function+0x0/0x30
 [10746.449370]  [8025a010] kswapd+0x0/0x4e0
 [10746.449376]  [802a11d0] keventd_create_kthread+0x0/0x90
 [10746.449383]  [802335a9] kthread+0xd9/0x120
 [10746.449394]  [80260ec8] child_rip+0xa/0x12
 [10746.449401]  [802a11d0] keventd_create_kthread+0x0/0x90
 [10746.449414]  [802334d0] kthread+0x0/0x120
 [10746.449421]  [80260ebe] child_rip+0x0/0x12
 [10746.449426]
 [10746.449429]
 [10746.449430] Code: 48 8b 40 38 75 04 0f 0b eb fe 48 85 c0 74 0b 48 8b
 40 28 48
 [10746.449449] RIP  [8022b9c8] iput+0x18/0x80
 [10746.449456]  RSP 810037f2dd50
 [10746.449460] CR2: 0038
 [10746.449463]  ACPI Exception (pci_bind-0299): AE_NOT_FOUND, Unable to
 get data from device DCKS [20060707]



Re: Suspend to RAM generates oops and general protection fault

2007-01-21 Thread Nigel Cunningham
Hi.

On Mon, 2007-01-22 at 16:16 +1100, Jean-Marc Valin wrote:
  I just encountered the following oops and general protection fault
  trying to suspend/resume my laptop. I've got a Dell D820 laptop with a 2
  GHz Core 2 Duo CPU. It usually suspends/resumes fine but not always. The
  relevant errors are below but the full dmesg log is at
  http://people.xiph.org/~jm/suspend_resume_oops.txt and my config is in
  http://people.xiph.org/~jm/config-2.6.20-rc5.txt
 ...
  It looks like something is stomping on memory it shouldn't be touching,
  so I would suggest testing multiple cycles with a minimal (preferably
  zero) number of modules loaded. If that looks good and reliable, add
  modules  processes until you can say 'If I do X, it breaks.'. If having
  a minimal number of modules loaded doesn't help, I would then suggest
  reviewing your kernel config to see if other things can be built as
  modules and the same logic applied. You can be reasonably sure that it
  will be a device driver. Common causes of suspend/resume problems from
  the list you give below are acpi modules, bluetooth and usb. I'd also be
  consider pcmcia, drm and fuse possibilities. But again, go for unloading
  everything possible in the first instance.
 
 Actually, the reason I sent this is that when I showed the oops/gpf to
 Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug
 problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the
 suspend to RAM now works ~95% of the time.

I agree that the second is cpu hotplug, but the first is something else,
hence my recommendations above.

Regards,

Nigel


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/