Re: AMD graphics performance regression in 4.15 and later

2018-04-11 Thread Jean-Marc Valin
On 04/11/2018 05:37 AM, Christian König wrote:
>> With your patches my EPYC box is unusable with  4.15++ kernels.
>> The whole Desktop is acting weird.  This one is using
>> an Cape Verde PRO [Radeon HD 7750/8740 / R7 250E] GPU.
>>
>> Box is  2 * EPYC 7281 with 128 GB ECC RAM
>>
>> Also a 14C Xeon box with a HD7700 is broken same way.
> 
> The hardware is irrelevant for this. We need to know what software stack
> you use on top of it.

Well, the hardware appears to be part of the issue too. I don't think
it's a coincidence that Gabriel has the problem on 2xEPYC, I have it on
2xXeon and the previous reported had it on a Core 2 Quad that internally
has two dies.

I've not yet tested your disable CONFIG_SWIOTLB fix yet -- might try it
over the weekend and report what happens.

Cheers,

Jean-Marc


Re: AMD graphics performance regression in 4.15 and later

2018-04-11 Thread Jean-Marc Valin
On 04/11/2018 05:37 AM, Christian König wrote:
>> With your patches my EPYC box is unusable with  4.15++ kernels.
>> The whole Desktop is acting weird.  This one is using
>> an Cape Verde PRO [Radeon HD 7750/8740 / R7 250E] GPU.
>>
>> Box is  2 * EPYC 7281 with 128 GB ECC RAM
>>
>> Also a 14C Xeon box with a HD7700 is broken same way.
> 
> The hardware is irrelevant for this. We need to know what software stack
> you use on top of it.

Well, the hardware appears to be part of the issue too. I don't think
it's a coincidence that Gabriel has the problem on 2xEPYC, I have it on
2xXeon and the previous reported had it on a Core 2 Quad that internally
has two dies.

I've not yet tested your disable CONFIG_SWIOTLB fix yet -- might try it
over the weekend and report what happens.

Cheers,

Jean-Marc


Re: AMD graphics performance regression in 4.15 and later

2018-04-09 Thread Jean-Marc Valin
On 04/09/2018 05:42 AM, Christian König wrote:
> Backporting all the detection logic is to invasive, but you could just
> go into drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c and forcefull use the
> other code path.
> 
> Just look out for "#ifdef CONFIG_SWIOTLB" checks and disable those.

Do you mean just taking the 4.15 code as is and replacing
"#ifdef CONFIG_SWIOTLB" with "#if 0" in
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c or are you talking about using a
different version of drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c ?

Jean-Marc


Re: AMD graphics performance regression in 4.15 and later

2018-04-09 Thread Jean-Marc Valin
On 04/09/2018 05:42 AM, Christian König wrote:
> Backporting all the detection logic is to invasive, but you could just
> go into drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c and forcefull use the
> other code path.
> 
> Just look out for "#ifdef CONFIG_SWIOTLB" checks and disable those.

Do you mean just taking the 4.15 code as is and replacing
"#ifdef CONFIG_SWIOTLB" with "#if 0" in
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c or are you talking about using a
different version of drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c ?

Jean-Marc


Re: AMD graphics performance regression in 4.15 and later

2018-04-06 Thread Jean-Marc Valin
Hi Christian,

Thanks for the info. FYI, I've also opened a Firefox bug for that at:
https://bugzilla.mozilla.org/show_bug.cgi?id=1448778
Feel free to comment since you have a better understanding of what's
going on.

One last question: right now I'm running 4.15.0 with the "offending"
patch reverted. Is that safe to run or are there possible bad
interactions with other changes.

Cheers,

Jean-Marc

On 04/06/2018 01:20 PM, Christian König wrote:
> Am 06.04.2018 um 18:42 schrieb Jean-Marc Valin:
>> Hi Christian,
>>
>> On 04/09/2018 07:48 AM, Christian König wrote:
>>> Am 06.04.2018 um 17:30 schrieb Jean-Marc Valin:
>>>> Hi Christian,
>>>>
>>>> Is there a way to turn off these huge pages at boot-time/run-time?
>>> Only at compile time by not setting CONFIG_TRANSPARENT_HUGEPAGE.
>> Any reason why
>> echo never > /sys/kernel/mm/transparent_hugepage/enabled
>> doesn't solve the problem?
> 
> Because we unfortunately try to allocate huge pages anyway, we
> unfortunately just fail in 100% of all cases.
> 
> That basically gives you both, the extra allocation overhead and the
> still bad throughput.
> 
>> Also, I assume that disabling CONFIG_TRANSPARENT_HUGEPAGE will disable
>> them for everything and not just what your patch added, right?
> 
> Correct, that's why I wrote that disabling SWIOTLBs might be better.
> 
>>>> I'm not sure what you mean by "We mitigated the problem by avoiding the
>>>> slow coherent DMA code path on almost all platforms on newer
>>>> kernels". I
>>>> tested up to 4.16 and the performance regression is just as bad as
>>>> it is
>>>> for 4.15.
>>> Indeed 4.16 still doesn't have that. You could use the
>>> amd-staging-drm-next branch or wait for 4.17.
>> Is there a way to pull just that change or is there too much
>> interactions with other changes?
> 
> It adds a new detection if memory allocation needs to be coherent or
> not, that is not something you can easily pull into older versions.
> 
>>> That isn't related to the GFX hardware, but to your CPU/motherboard and
>>> whatever else you have in the system.
>> Well, I have an nvidia GPU in the same system (normally only used for
>> CUDA) and if I use it instead of my RX 560 then I'm not seeing any
>> performance issue with 4.15.
> 
> That's because you are probably using the Nvidia binary driver which has
> a completely separate code base.
> 
>>> Some part of your system needs SWIOTLB and that makes allocating memory
>>> much slower.
>> What would that part be? FTR, I have a complete description of my system
>> at https://jmvalin.dreamwidth.org/15583.html
>>
>> I don't know if it's related, but I can maybe see one thing in common
>> between my machine and the Core 2 Quad from the other bug report and
>> that's the "NUMA part". I have a dual-socket Xeon and (AFAIK) the Core 2
>> Quad is made of two two-core CPUs glued together with little
>> communication between them.
> 
> Yeah, that is probably the reason.
> 
>>> Intel doesn't use TTM because they don't have dedicated VRAM, but the
>>> open source nvidia driver should be affected as well.
>> I'm using the proprietary nvidia driver (because CUDA). Is that supposed
>> to be affected as well?
> 
> No.
> 
>>> We already mitigated that problem and I don't see any solution which
>>> will arrive faster than 4.17.
>> Is that supposed to make the slowdown unnoticeable or just slightly
>> better?
> 
> It completely goes away. The issue with the coherent path is that it
> tries to always allocate the lowest possible memory to make sure that it
> fits into the DMA constrains of all devices in the system.
> 
> But since AMD GPU can handle 40bits of addresses you would need at least
> 1TB of memory in the system to trigger that (or a NUMA where some system
> is low and some in a high area).
> 
> Christian.
> 
>>> The only quick workaround I can see is to avoid firefox, chrome for
>>> example is reported to work perfectly fine.
>> Or use an unaffected GPU/driver ;-)
>>
>> Cheers,
>>
>> Jean-Marc
>>
> 


Re: AMD graphics performance regression in 4.15 and later

2018-04-06 Thread Jean-Marc Valin
Hi Christian,

Thanks for the info. FYI, I've also opened a Firefox bug for that at:
https://bugzilla.mozilla.org/show_bug.cgi?id=1448778
Feel free to comment since you have a better understanding of what's
going on.

One last question: right now I'm running 4.15.0 with the "offending"
patch reverted. Is that safe to run or are there possible bad
interactions with other changes.

Cheers,

Jean-Marc

On 04/06/2018 01:20 PM, Christian König wrote:
> Am 06.04.2018 um 18:42 schrieb Jean-Marc Valin:
>> Hi Christian,
>>
>> On 04/09/2018 07:48 AM, Christian König wrote:
>>> Am 06.04.2018 um 17:30 schrieb Jean-Marc Valin:
>>>> Hi Christian,
>>>>
>>>> Is there a way to turn off these huge pages at boot-time/run-time?
>>> Only at compile time by not setting CONFIG_TRANSPARENT_HUGEPAGE.
>> Any reason why
>> echo never > /sys/kernel/mm/transparent_hugepage/enabled
>> doesn't solve the problem?
> 
> Because we unfortunately try to allocate huge pages anyway, we
> unfortunately just fail in 100% of all cases.
> 
> That basically gives you both, the extra allocation overhead and the
> still bad throughput.
> 
>> Also, I assume that disabling CONFIG_TRANSPARENT_HUGEPAGE will disable
>> them for everything and not just what your patch added, right?
> 
> Correct, that's why I wrote that disabling SWIOTLBs might be better.
> 
>>>> I'm not sure what you mean by "We mitigated the problem by avoiding the
>>>> slow coherent DMA code path on almost all platforms on newer
>>>> kernels". I
>>>> tested up to 4.16 and the performance regression is just as bad as
>>>> it is
>>>> for 4.15.
>>> Indeed 4.16 still doesn't have that. You could use the
>>> amd-staging-drm-next branch or wait for 4.17.
>> Is there a way to pull just that change or is there too much
>> interactions with other changes?
> 
> It adds a new detection if memory allocation needs to be coherent or
> not, that is not something you can easily pull into older versions.
> 
>>> That isn't related to the GFX hardware, but to your CPU/motherboard and
>>> whatever else you have in the system.
>> Well, I have an nvidia GPU in the same system (normally only used for
>> CUDA) and if I use it instead of my RX 560 then I'm not seeing any
>> performance issue with 4.15.
> 
> That's because you are probably using the Nvidia binary driver which has
> a completely separate code base.
> 
>>> Some part of your system needs SWIOTLB and that makes allocating memory
>>> much slower.
>> What would that part be? FTR, I have a complete description of my system
>> at https://jmvalin.dreamwidth.org/15583.html
>>
>> I don't know if it's related, but I can maybe see one thing in common
>> between my machine and the Core 2 Quad from the other bug report and
>> that's the "NUMA part". I have a dual-socket Xeon and (AFAIK) the Core 2
>> Quad is made of two two-core CPUs glued together with little
>> communication between them.
> 
> Yeah, that is probably the reason.
> 
>>> Intel doesn't use TTM because they don't have dedicated VRAM, but the
>>> open source nvidia driver should be affected as well.
>> I'm using the proprietary nvidia driver (because CUDA). Is that supposed
>> to be affected as well?
> 
> No.
> 
>>> We already mitigated that problem and I don't see any solution which
>>> will arrive faster than 4.17.
>> Is that supposed to make the slowdown unnoticeable or just slightly
>> better?
> 
> It completely goes away. The issue with the coherent path is that it
> tries to always allocate the lowest possible memory to make sure that it
> fits into the DMA constrains of all devices in the system.
> 
> But since AMD GPU can handle 40bits of addresses you would need at least
> 1TB of memory in the system to trigger that (or a NUMA where some system
> is low and some in a high area).
> 
> Christian.
> 
>>> The only quick workaround I can see is to avoid firefox, chrome for
>>> example is reported to work perfectly fine.
>> Or use an unaffected GPU/driver ;-)
>>
>> Cheers,
>>
>> Jean-Marc
>>
> 


Re: AMD graphics performance regression in 4.15 and later

2018-04-06 Thread Jean-Marc Valin
Hi Christian,

On 04/09/2018 07:48 AM, Christian König wrote:
> Am 06.04.2018 um 17:30 schrieb Jean-Marc Valin:
>> Hi Christian,
>>
>> Is there a way to turn off these huge pages at boot-time/run-time?
> 
> Only at compile time by not setting CONFIG_TRANSPARENT_HUGEPAGE.

Any reason why
echo never > /sys/kernel/mm/transparent_hugepage/enabled
doesn't solve the problem?

Also, I assume that disabling CONFIG_TRANSPARENT_HUGEPAGE will disable
them for everything and not just what your patch added, right?

>> I'm not sure what you mean by "We mitigated the problem by avoiding the
>> slow coherent DMA code path on almost all platforms on newer kernels". I
>> tested up to 4.16 and the performance regression is just as bad as it is
>> for 4.15.
> 
> Indeed 4.16 still doesn't have that. You could use the
> amd-staging-drm-next branch or wait for 4.17.

Is there a way to pull just that change or is there too much
interactions with other changes?

> That isn't related to the GFX hardware, but to your CPU/motherboard and
> whatever else you have in the system.

Well, I have an nvidia GPU in the same system (normally only used for
CUDA) and if I use it instead of my RX 560 then I'm not seeing any
performance issue with 4.15.

> Some part of your system needs SWIOTLB and that makes allocating memory
> much slower.

What would that part be? FTR, I have a complete description of my system
at https://jmvalin.dreamwidth.org/15583.html

I don't know if it's related, but I can maybe see one thing in common
between my machine and the Core 2 Quad from the other bug report and
that's the "NUMA part". I have a dual-socket Xeon and (AFAIK) the Core 2
Quad is made of two two-core CPUs glued together with little
communication between them.

> Intel doesn't use TTM because they don't have dedicated VRAM, but the
> open source nvidia driver should be affected as well.

I'm using the proprietary nvidia driver (because CUDA). Is that supposed
to be affected as well?

> We already mitigated that problem and I don't see any solution which
> will arrive faster than 4.17.

Is that supposed to make the slowdown unnoticeable or just slightly better?

> The only quick workaround I can see is to avoid firefox, chrome for
> example is reported to work perfectly fine.

Or use an unaffected GPU/driver ;-)

Cheers,

Jean-Marc



Re: AMD graphics performance regression in 4.15 and later

2018-04-06 Thread Jean-Marc Valin
Hi Christian,

On 04/09/2018 07:48 AM, Christian König wrote:
> Am 06.04.2018 um 17:30 schrieb Jean-Marc Valin:
>> Hi Christian,
>>
>> Is there a way to turn off these huge pages at boot-time/run-time?
> 
> Only at compile time by not setting CONFIG_TRANSPARENT_HUGEPAGE.

Any reason why
echo never > /sys/kernel/mm/transparent_hugepage/enabled
doesn't solve the problem?

Also, I assume that disabling CONFIG_TRANSPARENT_HUGEPAGE will disable
them for everything and not just what your patch added, right?

>> I'm not sure what you mean by "We mitigated the problem by avoiding the
>> slow coherent DMA code path on almost all platforms on newer kernels". I
>> tested up to 4.16 and the performance regression is just as bad as it is
>> for 4.15.
> 
> Indeed 4.16 still doesn't have that. You could use the
> amd-staging-drm-next branch or wait for 4.17.

Is there a way to pull just that change or is there too much
interactions with other changes?

> That isn't related to the GFX hardware, but to your CPU/motherboard and
> whatever else you have in the system.

Well, I have an nvidia GPU in the same system (normally only used for
CUDA) and if I use it instead of my RX 560 then I'm not seeing any
performance issue with 4.15.

> Some part of your system needs SWIOTLB and that makes allocating memory
> much slower.

What would that part be? FTR, I have a complete description of my system
at https://jmvalin.dreamwidth.org/15583.html

I don't know if it's related, but I can maybe see one thing in common
between my machine and the Core 2 Quad from the other bug report and
that's the "NUMA part". I have a dual-socket Xeon and (AFAIK) the Core 2
Quad is made of two two-core CPUs glued together with little
communication between them.

> Intel doesn't use TTM because they don't have dedicated VRAM, but the
> open source nvidia driver should be affected as well.

I'm using the proprietary nvidia driver (because CUDA). Is that supposed
to be affected as well?

> We already mitigated that problem and I don't see any solution which
> will arrive faster than 4.17.

Is that supposed to make the slowdown unnoticeable or just slightly better?

> The only quick workaround I can see is to avoid firefox, chrome for
> example is reported to work perfectly fine.

Or use an unaffected GPU/driver ;-)

Cheers,

Jean-Marc



Re: AMD graphics performance regression in 4.15 and later

2018-04-06 Thread Jean-Marc Valin
Hi Christian,

Is there a way to turn off these huge pages at boot-time/run-time? Right
now the recent kernels are making Firefox pretty much unusable for me.
I've been able to revert the patch from 4.15 but it's not really a
long-term solution.

You mention that the purpose of the patch is to improve performance, but
I haven't actually noticed anything running faster on my system. Is
there any particular test where I'm supposed to see an improvement
compared to 4.14?

I'm not sure what you mean by "We mitigated the problem by avoiding the
slow coherent DMA code path on almost all platforms on newer kernels". I
tested up to 4.16 and the performance regression is just as bad as it is
for 4.15.

Unlike the older hardware reported on kernel bug 198511, the hardware I
have is quite recent (RX 560) and still being sold. I've also confirmed
that neither nvidia (on the same machine) nor intel GPUs (on a less
powerful machine) are affected, so it seems like there's a way to avoid
that slow performance. I'm not saying that what Firefox is doing is
ideal (I don't know what it does and why), but it still seems like
something that should still be avoided in the kernel.

Cheers,

Jean-Marc

On 04/06/2018 04:03 AM, Christian König wrote:
> Hi Jean,
> 
> yeah, that is a known problem. Using huge pages improves the performance
> because of better TLB usage, but for the cost of higher allocation
> overhead.
> 
> What we found is that firefox is doing something rather strange by
> allocating large textures and then just trowing them away again
> immediately.
> 
> We mitigated the problem by avoiding the slow coherent DMA code path on
> almost all platforms on newer kernels, but essentially somebody needs to
> figure out why firefox and/or the user space stack is doing this
> constant allocation/freeing of memory.
> 
> There is also a bug tracker on bugs.kernel.org about this, but I can't
> find it any more of hand.
> 
> Regards,
> Christian.
> 
> Am 06.04.2018 um 02:30 schrieb Jean-Marc Valin:
>> Hi,
>>
>> I noticed a serious graphics performance regression between 4.14 and
>> 4.15. It is most noticeable with Firefox (tried FF57 through FF60) and
>> causes scrolling to be really choppy/sluggish. I've confirmed that the
>> problem is also there on 4.16, while 4.13 works fine.
>>
>> After a bisection, I've narrowed the regression down to this commit:
>>
>> commit 648bc3574716400acc06f99915815f80d9563783
>> Author: Christian König <christian.koe...@amd.com>
>> Date:   Thu Jul 6 09:59:43 2017 +0200
>>
>>  drm/ttm: add transparent huge page support for DMA allocations v2
>>
>>
>> Some details about my system:
>> Distro: Fedora 27 (up-to-date)
>> Video: MSI Radeon RX 560 AERO
>> CPU: Dual-socket Xeon E5-2640 v4 (20 cores total)
>> RAM: 128 GB ECC
>>
>>
>> As a comparison, when running Firefox with 4.15 on a Lenovo W540 laptop
>> (with Intel graphics only) the responsiveness is much better then what
>> I'm getting on the Xeon machine above with the Radeon card, so this
>> really seems to be an AMD-only issue.
>>
>> Any way to fix the issue?
>>
>> Thanks,
>>
>> Jean-Marc
>> ___
>> dri-devel mailing list
>> dri-de...@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 


Re: AMD graphics performance regression in 4.15 and later

2018-04-06 Thread Jean-Marc Valin
Hi Christian,

Is there a way to turn off these huge pages at boot-time/run-time? Right
now the recent kernels are making Firefox pretty much unusable for me.
I've been able to revert the patch from 4.15 but it's not really a
long-term solution.

You mention that the purpose of the patch is to improve performance, but
I haven't actually noticed anything running faster on my system. Is
there any particular test where I'm supposed to see an improvement
compared to 4.14?

I'm not sure what you mean by "We mitigated the problem by avoiding the
slow coherent DMA code path on almost all platforms on newer kernels". I
tested up to 4.16 and the performance regression is just as bad as it is
for 4.15.

Unlike the older hardware reported on kernel bug 198511, the hardware I
have is quite recent (RX 560) and still being sold. I've also confirmed
that neither nvidia (on the same machine) nor intel GPUs (on a less
powerful machine) are affected, so it seems like there's a way to avoid
that slow performance. I'm not saying that what Firefox is doing is
ideal (I don't know what it does and why), but it still seems like
something that should still be avoided in the kernel.

Cheers,

Jean-Marc

On 04/06/2018 04:03 AM, Christian König wrote:
> Hi Jean,
> 
> yeah, that is a known problem. Using huge pages improves the performance
> because of better TLB usage, but for the cost of higher allocation
> overhead.
> 
> What we found is that firefox is doing something rather strange by
> allocating large textures and then just trowing them away again
> immediately.
> 
> We mitigated the problem by avoiding the slow coherent DMA code path on
> almost all platforms on newer kernels, but essentially somebody needs to
> figure out why firefox and/or the user space stack is doing this
> constant allocation/freeing of memory.
> 
> There is also a bug tracker on bugs.kernel.org about this, but I can't
> find it any more of hand.
> 
> Regards,
> Christian.
> 
> Am 06.04.2018 um 02:30 schrieb Jean-Marc Valin:
>> Hi,
>>
>> I noticed a serious graphics performance regression between 4.14 and
>> 4.15. It is most noticeable with Firefox (tried FF57 through FF60) and
>> causes scrolling to be really choppy/sluggish. I've confirmed that the
>> problem is also there on 4.16, while 4.13 works fine.
>>
>> After a bisection, I've narrowed the regression down to this commit:
>>
>> commit 648bc3574716400acc06f99915815f80d9563783
>> Author: Christian König 
>> Date:   Thu Jul 6 09:59:43 2017 +0200
>>
>>  drm/ttm: add transparent huge page support for DMA allocations v2
>>
>>
>> Some details about my system:
>> Distro: Fedora 27 (up-to-date)
>> Video: MSI Radeon RX 560 AERO
>> CPU: Dual-socket Xeon E5-2640 v4 (20 cores total)
>> RAM: 128 GB ECC
>>
>>
>> As a comparison, when running Firefox with 4.15 on a Lenovo W540 laptop
>> (with Intel graphics only) the responsiveness is much better then what
>> I'm getting on the Xeon machine above with the Radeon card, so this
>> really seems to be an AMD-only issue.
>>
>> Any way to fix the issue?
>>
>> Thanks,
>>
>> Jean-Marc
>> ___
>> dri-devel mailing list
>> dri-de...@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 


AMD graphics performance regression in 4.15 and later

2018-04-05 Thread Jean-Marc Valin
Hi,

I noticed a serious graphics performance regression between 4.14 and
4.15. It is most noticeable with Firefox (tried FF57 through FF60) and
causes scrolling to be really choppy/sluggish. I've confirmed that the
problem is also there on 4.16, while 4.13 works fine.

After a bisection, I've narrowed the regression down to this commit:

commit 648bc3574716400acc06f99915815f80d9563783
Author: Christian König 
Date:   Thu Jul 6 09:59:43 2017 +0200

drm/ttm: add transparent huge page support for DMA allocations v2


Some details about my system:
Distro: Fedora 27 (up-to-date)
Video: MSI Radeon RX 560 AERO
CPU: Dual-socket Xeon E5-2640 v4 (20 cores total)
RAM: 128 GB ECC


As a comparison, when running Firefox with 4.15 on a Lenovo W540 laptop
(with Intel graphics only) the responsiveness is much better then what
I'm getting on the Xeon machine above with the Radeon card, so this
really seems to be an AMD-only issue.

Any way to fix the issue?

Thanks,

Jean-Marc


AMD graphics performance regression in 4.15 and later

2018-04-05 Thread Jean-Marc Valin
Hi,

I noticed a serious graphics performance regression between 4.14 and
4.15. It is most noticeable with Firefox (tried FF57 through FF60) and
causes scrolling to be really choppy/sluggish. I've confirmed that the
problem is also there on 4.16, while 4.13 works fine.

After a bisection, I've narrowed the regression down to this commit:

commit 648bc3574716400acc06f99915815f80d9563783
Author: Christian König 
Date:   Thu Jul 6 09:59:43 2017 +0200

drm/ttm: add transparent huge page support for DMA allocations v2


Some details about my system:
Distro: Fedora 27 (up-to-date)
Video: MSI Radeon RX 560 AERO
CPU: Dual-socket Xeon E5-2640 v4 (20 cores total)
RAM: 128 GB ECC


As a comparison, when running Firefox with 4.15 on a Lenovo W540 laptop
(with Intel graphics only) the responsiveness is much better then what
I'm getting on the Xeon machine above with the Radeon card, so this
really seems to be an AMD-only issue.

Any way to fix the issue?

Thanks,

Jean-Marc


Re: Suspend to RAM generates oops and general protection fault

2007-03-23 Thread Jean-Marc Valin
Hi,

Sorry I haven't replied recently about that bug, but I have to admit I
have no idea where to start. There actually seems to be much more
fundamental problems with the kernel on my machines. I initially
realised that even without using suspend to RAM, I was still getting
crashes when docking. So I stopped docking and realised my machine would
sometimes just crash when I plug/unplug the AC adaptor. Just to give an
idea, I've experienced about 10-15 crashes in the past two months -- I
don't think I've even done a single clean shutdown during that period.

To make things worse, the behaviour is always different. Sometimes I get
a panic with keyboard LEDs flashing. Sometimes I get nothing at all and
the machine is just frozen (doesn't respond to pings or to Alt-SysRq
commands). Sometimes, I just lose my keyboard and/or mouse but the
machine stays up. I'm running a vanilla 2.6.20 kernel (not tainted) with
the following configuration: http://jmspeex.livejournal.com/1090.html

Jean-Marc


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-03-23 Thread Jean-Marc Valin
Hi,

Sorry I haven't replied recently about that bug, but I have to admit I
have no idea where to start. There actually seems to be much more
fundamental problems with the kernel on my machines. I initially
realised that even without using suspend to RAM, I was still getting
crashes when docking. So I stopped docking and realised my machine would
sometimes just crash when I plug/unplug the AC adaptor. Just to give an
idea, I've experienced about 10-15 crashes in the past two months -- I
don't think I've even done a single clean shutdown during that period.

To make things worse, the behaviour is always different. Sometimes I get
a panic with keyboard LEDs flashing. Sometimes I get nothing at all and
the machine is just frozen (doesn't respond to pings or to Alt-SysRq
commands). Sometimes, I just lose my keyboard and/or mouse but the
machine stays up. I'm running a vanilla 2.6.20 kernel (not tainted) with
the following configuration: http://jmspeex.livejournal.com/1090.html

Jean-Marc


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-23 Thread Jean-Marc Valin
Luming Yu a écrit :
> what about removing psmouse module?

Trying that now. Any particular reason you suspect that one?

Jean-Marc

> On 1/23/07, Jean-Marc Valin <[EMAIL PROTECTED]> wrote:
>> >>> will be a device driver. Common causes of suspend/resume problems
>> from
>> >>> the list you give below are acpi modules, bluetooth and usb. I'd
>> also be
>> >>> consider pcmcia, drm and fuse possibilities. But again, go for
>> unloading
>> >>> everything possible in the first instance.
>> >> Actually, the reason I sent this is that when I showed the oops/gpf to
>> >> Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug
>> >> problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the
>> >> suspend to RAM now works ~95% of the time.
>> >
>> > Try a kernel without CONFIG_SMP... that will verify if it is SMP
>> > related.
>>
>> Well, this happens to be my main work machine, which I'm not willing to
>> have running at half speed for several weeks. Anything else you can
>> suggest?
>>
>> Jean-Marc
>> -
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in
>> the body of a message to [EMAIL PROTECTED]
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-23 Thread Jean-Marc Valin
Luming Yu a écrit :
 what about removing psmouse module?

Trying that now. Any particular reason you suspect that one?

Jean-Marc

 On 1/23/07, Jean-Marc Valin [EMAIL PROTECTED] wrote:
  will be a device driver. Common causes of suspend/resume problems
 from
  the list you give below are acpi modules, bluetooth and usb. I'd
 also be
  consider pcmcia, drm and fuse possibilities. But again, go for
 unloading
  everything possible in the first instance.
  Actually, the reason I sent this is that when I showed the oops/gpf to
  Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug
  problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the
  suspend to RAM now works ~95% of the time.
 
  Try a kernel without CONFIG_SMP... that will verify if it is SMP
  related.

 Well, this happens to be my main work machine, which I'm not willing to
 have running at half speed for several weeks. Anything else you can
 suggest?

 Jean-Marc
 -
 To unsubscribe from this list: send the line unsubscribe
 linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-22 Thread Jean-Marc Valin
>>> will be a device driver. Common causes of suspend/resume problems from
>>> the list you give below are acpi modules, bluetooth and usb. I'd also be
>>> consider pcmcia, drm and fuse possibilities. But again, go for unloading
>>> everything possible in the first instance.
>> Actually, the reason I sent this is that when I showed the oops/gpf to
>> Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug
>> problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the
>> suspend to RAM now works ~95% of the time.
> 
> Try a kernel without CONFIG_SMP... that will verify if it is SMP
> related.

Well, this happens to be my main work machine, which I'm not willing to
have running at half speed for several weeks. Anything else you can suggest?

Jean-Marc
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-22 Thread Jean-Marc Valin
>> I just encountered the following oops and general protection fault
>> trying to suspend/resume my laptop. I've got a Dell D820 laptop with a 2
>> GHz Core 2 Duo CPU. It usually suspends/resumes fine but not always. The
>> relevant errors are below but the full dmesg log is at
>> http://people.xiph.org/~jm/suspend_resume_oops.txt and my config is in
>> http://people.xiph.org/~jm/config-2.6.20-rc5.txt
>>
>> This happens when I'm running 2.6.20-rc5. The previous kernel version I
>> was using is 2.6.19-rc6 and was much more broken (second attempt
>> *always* failed), so it's probably not a regression.
> 
> This is a shot against the odds, but could you please check if the attached
> patch has any effect?

Thanks, I'll try that. It may take a while because the problem only
happened once in dozens of suspend/resume cycles.

Jean-Marc

> Rafael
> 
> 
> 
> 
> 
> 
> Both process_zones()and drain_node_pages() check for populated zones before
> touching pagesets. However, __drain_pages does not do so,
> 
> This may result in a NULL pointer dereference for pagesets in unpopulated
> zones if a NUMA setup is combined with cpu hotplug.
> 
> Initially the unpopulated zone has the pcp pointers pointing to the boot
> pagesets.  Since the zone is not populated the boot pageset pointers will
> not be changed during page allocator and slab bootstrap.
> 
> If a cpu is later brought down (first call to __drain_pages()) then the pcp
> pointers for cpus in unpopulated zones are set to NULL since __drain_pages
> does not first check for an unpopulated zone.
> 
> If the cpu is then brought up again then we call process_zones() which will 
> ignore
> the unpopulated zone. So the pageset pointers will still be NULL.
> 
> If the cpu is then again brought down then __drain_pages will attempt to drain
> pages by following the NULL pageset pointer for unpopulated zones.
> 
> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
> 
> ---
>  mm/page_alloc.c |3 +++
>  1 file changed, 3 insertions(+)
> 
> Index: linux-2.6.20-rc4/mm/page_alloc.c
> ===
> --- linux-2.6.20-rc4.orig/mm/page_alloc.c
> +++ linux-2.6.20-rc4/mm/page_alloc.c
> @@ -714,6 +714,9 @@ static void __drain_pages(unsigned int c
>   if (!populated_zone(zone))
>   continue;
>  
> + if (!populated_zone(zone))
> + continue;
> +
>   pset = zone_pcp(zone, cpu);
>   for (i = 0; i < ARRAY_SIZE(pset->pcp); i++) {
>   struct per_cpu_pages *pcp;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-22 Thread Jean-Marc Valin
 I just encountered the following oops and general protection fault
 trying to suspend/resume my laptop. I've got a Dell D820 laptop with a 2
 GHz Core 2 Duo CPU. It usually suspends/resumes fine but not always. The
 relevant errors are below but the full dmesg log is at
 http://people.xiph.org/~jm/suspend_resume_oops.txt and my config is in
 http://people.xiph.org/~jm/config-2.6.20-rc5.txt

 This happens when I'm running 2.6.20-rc5. The previous kernel version I
 was using is 2.6.19-rc6 and was much more broken (second attempt
 *always* failed), so it's probably not a regression.
 
 This is a shot against the odds, but could you please check if the attached
 patch has any effect?

Thanks, I'll try that. It may take a while because the problem only
happened once in dozens of suspend/resume cycles.

Jean-Marc

 Rafael
 
 
 
 
 
 
 Both process_zones()and drain_node_pages() check for populated zones before
 touching pagesets. However, __drain_pages does not do so,
 
 This may result in a NULL pointer dereference for pagesets in unpopulated
 zones if a NUMA setup is combined with cpu hotplug.
 
 Initially the unpopulated zone has the pcp pointers pointing to the boot
 pagesets.  Since the zone is not populated the boot pageset pointers will
 not be changed during page allocator and slab bootstrap.
 
 If a cpu is later brought down (first call to __drain_pages()) then the pcp
 pointers for cpus in unpopulated zones are set to NULL since __drain_pages
 does not first check for an unpopulated zone.
 
 If the cpu is then brought up again then we call process_zones() which will 
 ignore
 the unpopulated zone. So the pageset pointers will still be NULL.
 
 If the cpu is then again brought down then __drain_pages will attempt to drain
 pages by following the NULL pageset pointer for unpopulated zones.
 
 Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
 
 ---
  mm/page_alloc.c |3 +++
  1 file changed, 3 insertions(+)
 
 Index: linux-2.6.20-rc4/mm/page_alloc.c
 ===
 --- linux-2.6.20-rc4.orig/mm/page_alloc.c
 +++ linux-2.6.20-rc4/mm/page_alloc.c
 @@ -714,6 +714,9 @@ static void __drain_pages(unsigned int c
   if (!populated_zone(zone))
   continue;
  
 + if (!populated_zone(zone))
 + continue;
 +
   pset = zone_pcp(zone, cpu);
   for (i = 0; i  ARRAY_SIZE(pset-pcp); i++) {
   struct per_cpu_pages *pcp;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-22 Thread Jean-Marc Valin
 will be a device driver. Common causes of suspend/resume problems from
 the list you give below are acpi modules, bluetooth and usb. I'd also be
 consider pcmcia, drm and fuse possibilities. But again, go for unloading
 everything possible in the first instance.
 Actually, the reason I sent this is that when I showed the oops/gpf to
 Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug
 problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the
 suspend to RAM now works ~95% of the time.
 
 Try a kernel without CONFIG_SMP... that will verify if it is SMP
 related.

Well, this happens to be my main work machine, which I'm not willing to
have running at half speed for several weeks. Anything else you can suggest?

Jean-Marc
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-21 Thread Jean-Marc Valin
>> I just encountered the following oops and general protection fault
>> trying to suspend/resume my laptop. I've got a Dell D820 laptop with a 2
>> GHz Core 2 Duo CPU. It usually suspends/resumes fine but not always. The
>> relevant errors are below but the full dmesg log is at
>> http://people.xiph.org/~jm/suspend_resume_oops.txt and my config is in
>> http://people.xiph.org/~jm/config-2.6.20-rc5.txt
...
> It looks like something is stomping on memory it shouldn't be touching,
> so I would suggest testing multiple cycles with a minimal (preferably
> zero) number of modules loaded. If that looks good and reliable, add
> modules & processes until you can say 'If I do X, it breaks.'. If having
> a minimal number of modules loaded doesn't help, I would then suggest
> reviewing your kernel config to see if other things can be built as
> modules and the same logic applied. You can be reasonably sure that it
> will be a device driver. Common causes of suspend/resume problems from
> the list you give below are acpi modules, bluetooth and usb. I'd also be
> consider pcmcia, drm and fuse possibilities. But again, go for unloading
> everything possible in the first instance.

Actually, the reason I sent this is that when I showed the oops/gpf to
Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug
problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the
suspend to RAM now works ~95% of the time.

Jean-Marc

> Regards,
> 
> Nigel
> 
>> Cheers,
>>
>>  Jean-Marc
>>
>> P.S. This is the same laptop I had at LCA for which Linus told me to
>> disable preemption and try the newest rc version.
>>
>> [10746.449071] Unable to handle kernel NULL pointer dereference at
>> 0038 RIP:
>> [10746.449080]  [] iput+0x18/0x80
>> [10746.449092] PGD 3a607067 PUD 27b20067 PMD 0
>> [10746.449099] Oops:  [1] SMP
>> [10746.449104] CPU 0
>> [10746.449107] Modules linked in: psmouse battery ac thermal fan button
>> ipw3945 ieee80211 tg3 arc4 ecb blkcipher ieee80211_crypt_wep
>> ieee80211_crypt binfmt_misc rfcomm l2cap bluetooth i915 drm
>> speedstep_centrino cpufreq_userspace cpufreq_powersave cpufreq_ondemand
>> cpufreq_stats freq_table cpufreq_conservative video sbs i2c_ec dock
>> asus_acpi backlight container ipv6 fuse sbp2 af_packet parport_pc lp
>> parport sg sr_mod cdrom snd_hda_intel snd_hda_codec tsdev snd_pcm_oss
>> snd_mixer_oss pcmcia snd_pcm snd_timer ata_generic snd shpchp
>> pci_hotplug soundcore snd_page_alloc serio_raw yenta_socket
>> rsrc_nonstatic pcmcia_core pcspkr evdev ext3 jbd mbcache ohci1394
>> ehci_hcd ieee1394 ide_generic uhci_hcd usbcore generic sd_mod processor
>> [10746.449190] Pid: 218, comm: kswapd0 Not tainted 2.6.20-rc5-x86-64 #1
>> [10746.449196] RIP: 0010:[]  []
>> iput+0x18/0x80
>> [10746.449206] RSP: :810037f2dd50  EFLAGS: 00010283
>> [10746.449212] RAX:  RBX: 8103fcf0 RCX:
>> 8103fd20
>> [10746.449219] RDX: 0001 RSI: 0286 RDI:
>> 8103fcf0
>> [10746.449225] RBP: 0042 R08:  R09:
>> 
>> [10746.449232] R10: 28f5c28f5c28f5c3 R11: 8023ae90 R12:
>> 
>> [10746.449239] R13: 810075721c70 R14: 805fa940 R15:
>> 
>> [10746.449246] FS:  () GS:8058e000()
>> knlGS:
>> [10746.449253] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
>> [10746.449259] CR2: 0038 CR3: 1207f000 CR4:
>> 06e0
>> [10746.449265] Process kswapd0 (pid: 218, threadinfo 810037f2c000,
>> task 810037a1b760)
>> [10746.449269] Stack:  811ce2f0 802ddaf8
>> 811ce3c0 811ce2f0
>> [10746.449280]  0042 8022f645 810037f2dd80
>> 0001cb60
>> [10746.449288]  0090 81007daa0e00 00d0
>> 802ddb49
>> [10746.449296] Call Trace:
>> [10746.449305]  [] prune_one_dentry+0x68/0xa0
>> [10746.449314]  [] prune_dcache+0x145/0x1e0
>> [10746.449323]  [] shrink_dcache_memory+0x19/0x50
>> [10746.449331]  [] shrink_slab+0x117/0x190
>> [10746.449342]  [] kswapd+0x382/0x4e0
>> [10746.449356]  [] autoremove_wake_function+0x0/0x30
>> [10746.449370]  [] kswapd+0x0/0x4e0
>> [10746.449376]  [] keventd_create_kthread+0x0/0x90
>> [10746.449383]  [] kthread+0xd9/0x120
>> [10746.449394]  [] child_rip+0xa/0x12
>> [10746.449401]  [] keventd_create_kthread+0x0/0x90
>> [10746.449414]  [] kthread+0x0/0x120
>> [10746.449421]  [] child_rip+0x0/0x12
>> [10746.449426]
>> [10746.449429]
>> [10746.449430] Code: 48 8b 40 38 75 04 0f 0b eb fe 48 85 c0 74 0b 48 8b
>> 40 28 48
>> [10746.449449] RIP  [] iput+0x18/0x80
>> [10746.449456]  RSP 
>> [10746.449460] CR2: 0038
>> [10746.449463]  ACPI Exception (pci_bind-0299): AE_NOT_FOUND, Unable to
>> get data from device DCKS [20060707]
>>
>>
>> and later:
>>
>>
>> [3.668009] SMP alternatives: switching to SMP code
>> [3.668168] Booting 

Suspend to RAM generates oops and general protection fault

2007-01-21 Thread Jean-Marc Valin
Hi,

I just encountered the following oops and general protection fault
trying to suspend/resume my laptop. I've got a Dell D820 laptop with a 2
GHz Core 2 Duo CPU. It usually suspends/resumes fine but not always. The
relevant errors are below but the full dmesg log is at
http://people.xiph.org/~jm/suspend_resume_oops.txt and my config is in
http://people.xiph.org/~jm/config-2.6.20-rc5.txt

This happens when I'm running 2.6.20-rc5. The previous kernel version I
was using is 2.6.19-rc6 and was much more broken (second attempt
*always* failed), so it's probably not a regression.

Cheers,

Jean-Marc

P.S. This is the same laptop I had at LCA for which Linus told me to
disable preemption and try the newest rc version.

[10746.449071] Unable to handle kernel NULL pointer dereference at
0038 RIP:
[10746.449080]  [] iput+0x18/0x80
[10746.449092] PGD 3a607067 PUD 27b20067 PMD 0
[10746.449099] Oops:  [1] SMP
[10746.449104] CPU 0
[10746.449107] Modules linked in: psmouse battery ac thermal fan button
ipw3945 ieee80211 tg3 arc4 ecb blkcipher ieee80211_crypt_wep
ieee80211_crypt binfmt_misc rfcomm l2cap bluetooth i915 drm
speedstep_centrino cpufreq_userspace cpufreq_powersave cpufreq_ondemand
cpufreq_stats freq_table cpufreq_conservative video sbs i2c_ec dock
asus_acpi backlight container ipv6 fuse sbp2 af_packet parport_pc lp
parport sg sr_mod cdrom snd_hda_intel snd_hda_codec tsdev snd_pcm_oss
snd_mixer_oss pcmcia snd_pcm snd_timer ata_generic snd shpchp
pci_hotplug soundcore snd_page_alloc serio_raw yenta_socket
rsrc_nonstatic pcmcia_core pcspkr evdev ext3 jbd mbcache ohci1394
ehci_hcd ieee1394 ide_generic uhci_hcd usbcore generic sd_mod processor
[10746.449190] Pid: 218, comm: kswapd0 Not tainted 2.6.20-rc5-x86-64 #1
[10746.449196] RIP: 0010:[]  []
iput+0x18/0x80
[10746.449206] RSP: :810037f2dd50  EFLAGS: 00010283
[10746.449212] RAX:  RBX: 8103fcf0 RCX:
8103fd20
[10746.449219] RDX: 0001 RSI: 0286 RDI:
8103fcf0
[10746.449225] RBP: 0042 R08:  R09:

[10746.449232] R10: 28f5c28f5c28f5c3 R11: 8023ae90 R12:

[10746.449239] R13: 810075721c70 R14: 805fa940 R15:

[10746.449246] FS:  () GS:8058e000()
knlGS:
[10746.449253] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
[10746.449259] CR2: 0038 CR3: 1207f000 CR4:
06e0
[10746.449265] Process kswapd0 (pid: 218, threadinfo 810037f2c000,
task 810037a1b760)
[10746.449269] Stack:  811ce2f0 802ddaf8
811ce3c0 811ce2f0
[10746.449280]  0042 8022f645 810037f2dd80
0001cb60
[10746.449288]  0090 81007daa0e00 00d0
802ddb49
[10746.449296] Call Trace:
[10746.449305]  [] prune_one_dentry+0x68/0xa0
[10746.449314]  [] prune_dcache+0x145/0x1e0
[10746.449323]  [] shrink_dcache_memory+0x19/0x50
[10746.449331]  [] shrink_slab+0x117/0x190
[10746.449342]  [] kswapd+0x382/0x4e0
[10746.449356]  [] autoremove_wake_function+0x0/0x30
[10746.449370]  [] kswapd+0x0/0x4e0
[10746.449376]  [] keventd_create_kthread+0x0/0x90
[10746.449383]  [] kthread+0xd9/0x120
[10746.449394]  [] child_rip+0xa/0x12
[10746.449401]  [] keventd_create_kthread+0x0/0x90
[10746.449414]  [] kthread+0x0/0x120
[10746.449421]  [] child_rip+0x0/0x12
[10746.449426]
[10746.449429]
[10746.449430] Code: 48 8b 40 38 75 04 0f 0b eb fe 48 85 c0 74 0b 48 8b
40 28 48
[10746.449449] RIP  [] iput+0x18/0x80
[10746.449456]  RSP 
[10746.449460] CR2: 0038
[10746.449463]  ACPI Exception (pci_bind-0299): AE_NOT_FOUND, Unable to
get data from device DCKS [20060707]


and later:


[3.668009] SMP alternatives: switching to SMP code
[3.668168] Booting processor 1/2 APIC 0x1
[4.149691] Initializing CPU#1
[4.229595] Calibrating delay using timer specific routine.. 3990.32
BogoMIPS (lpj=7980654)
[4.229602] CPU: L1 I cache: 32K, L1 D cache: 32K
[4.229604] CPU: L2 cache: 4096K
[4.229606] CPU 1/1 -> Node 0
[4.229608] CPU: Physical Processor ID: 0
[4.229609] CPU: Processor Core ID: 1
[4.230107] Intel(R) Core(TM)2 CPU T7200  @ 2.00GHz stepping 06
[4.233607] CPU 1: Syncing TSC to CPU 0.
[3.762970] CPU 1: synchronized TSC with CPU 0 (last diff 0 cycles,
maxerr 960 cycles)
[3.764689] general protection fault:  [2] SMP
[3.764963] CPU 1
[3.764983] Modules linked in: psmouse battery ac thermal fan button
arc4 ecb blkcipher ieee80211_crypt_wep ieee80211_crypt binfmt_misc
rfcomm l2cap bluetooth i915 drm speedstep_centrino cpufreq_userspace
cpufreq_powersave cpufreq_ondemand cpufreq_stats freq_table
cpufreq_conservative video sbs i2c_ec dock asus_acpi backlight container
ipv6 fuse sbp2 af_packet parport_pc lp parport sg sr_mod cdrom
snd_hda_intel snd_hda_codec tsdev snd_pcm_oss snd_mixer_oss pcmcia
snd_pcm snd_timer 

Suspend to RAM generates oops and general protection fault

2007-01-21 Thread Jean-Marc Valin
Hi,

I just encountered the following oops and general protection fault
trying to suspend/resume my laptop. I've got a Dell D820 laptop with a 2
GHz Core 2 Duo CPU. It usually suspends/resumes fine but not always. The
relevant errors are below but the full dmesg log is at
http://people.xiph.org/~jm/suspend_resume_oops.txt and my config is in
http://people.xiph.org/~jm/config-2.6.20-rc5.txt

This happens when I'm running 2.6.20-rc5. The previous kernel version I
was using is 2.6.19-rc6 and was much more broken (second attempt
*always* failed), so it's probably not a regression.

Cheers,

Jean-Marc

P.S. This is the same laptop I had at LCA for which Linus told me to
disable preemption and try the newest rc version.

[10746.449071] Unable to handle kernel NULL pointer dereference at
0038 RIP:
[10746.449080]  [8022b9c8] iput+0x18/0x80
[10746.449092] PGD 3a607067 PUD 27b20067 PMD 0
[10746.449099] Oops:  [1] SMP
[10746.449104] CPU 0
[10746.449107] Modules linked in: psmouse battery ac thermal fan button
ipw3945 ieee80211 tg3 arc4 ecb blkcipher ieee80211_crypt_wep
ieee80211_crypt binfmt_misc rfcomm l2cap bluetooth i915 drm
speedstep_centrino cpufreq_userspace cpufreq_powersave cpufreq_ondemand
cpufreq_stats freq_table cpufreq_conservative video sbs i2c_ec dock
asus_acpi backlight container ipv6 fuse sbp2 af_packet parport_pc lp
parport sg sr_mod cdrom snd_hda_intel snd_hda_codec tsdev snd_pcm_oss
snd_mixer_oss pcmcia snd_pcm snd_timer ata_generic snd shpchp
pci_hotplug soundcore snd_page_alloc serio_raw yenta_socket
rsrc_nonstatic pcmcia_core pcspkr evdev ext3 jbd mbcache ohci1394
ehci_hcd ieee1394 ide_generic uhci_hcd usbcore generic sd_mod processor
[10746.449190] Pid: 218, comm: kswapd0 Not tainted 2.6.20-rc5-x86-64 #1
[10746.449196] RIP: 0010:[8022b9c8]  [8022b9c8]
iput+0x18/0x80
[10746.449206] RSP: :810037f2dd50  EFLAGS: 00010283
[10746.449212] RAX:  RBX: 8103fcf0 RCX:
8103fd20
[10746.449219] RDX: 0001 RSI: 0286 RDI:
8103fcf0
[10746.449225] RBP: 0042 R08:  R09:

[10746.449232] R10: 28f5c28f5c28f5c3 R11: 8023ae90 R12:

[10746.449239] R13: 810075721c70 R14: 805fa940 R15:

[10746.449246] FS:  () GS:8058e000()
knlGS:
[10746.449253] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
[10746.449259] CR2: 0038 CR3: 1207f000 CR4:
06e0
[10746.449265] Process kswapd0 (pid: 218, threadinfo 810037f2c000,
task 810037a1b760)
[10746.449269] Stack:  811ce2f0 802ddaf8
811ce3c0 811ce2f0
[10746.449280]  0042 8022f645 810037f2dd80
0001cb60
[10746.449288]  0090 81007daa0e00 00d0
802ddb49
[10746.449296] Call Trace:
[10746.449305]  [802ddaf8] prune_one_dentry+0x68/0xa0
[10746.449314]  [8022f645] prune_dcache+0x145/0x1e0
[10746.449323]  [802ddb49] shrink_dcache_memory+0x19/0x50
[10746.449331]  [802418a7] shrink_slab+0x117/0x190
[10746.449342]  [8025a392] kswapd+0x382/0x4e0
[10746.449356]  [802a13b0] autoremove_wake_function+0x0/0x30
[10746.449370]  [8025a010] kswapd+0x0/0x4e0
[10746.449376]  [802a11d0] keventd_create_kthread+0x0/0x90
[10746.449383]  [802335a9] kthread+0xd9/0x120
[10746.449394]  [80260ec8] child_rip+0xa/0x12
[10746.449401]  [802a11d0] keventd_create_kthread+0x0/0x90
[10746.449414]  [802334d0] kthread+0x0/0x120
[10746.449421]  [80260ebe] child_rip+0x0/0x12
[10746.449426]
[10746.449429]
[10746.449430] Code: 48 8b 40 38 75 04 0f 0b eb fe 48 85 c0 74 0b 48 8b
40 28 48
[10746.449449] RIP  [8022b9c8] iput+0x18/0x80
[10746.449456]  RSP 810037f2dd50
[10746.449460] CR2: 0038
[10746.449463]  ACPI Exception (pci_bind-0299): AE_NOT_FOUND, Unable to
get data from device DCKS [20060707]


and later:


[3.668009] SMP alternatives: switching to SMP code
[3.668168] Booting processor 1/2 APIC 0x1
[4.149691] Initializing CPU#1
[4.229595] Calibrating delay using timer specific routine.. 3990.32
BogoMIPS (lpj=7980654)
[4.229602] CPU: L1 I cache: 32K, L1 D cache: 32K
[4.229604] CPU: L2 cache: 4096K
[4.229606] CPU 1/1 - Node 0
[4.229608] CPU: Physical Processor ID: 0
[4.229609] CPU: Processor Core ID: 1
[4.230107] Intel(R) Core(TM)2 CPU T7200  @ 2.00GHz stepping 06
[4.233607] CPU 1: Syncing TSC to CPU 0.
[3.762970] CPU 1: synchronized TSC with CPU 0 (last diff 0 cycles,
maxerr 960 cycles)
[3.764689] general protection fault:  [2] SMP
[3.764963] CPU 1
[3.764983] Modules linked in: psmouse battery ac thermal fan button
arc4 ecb blkcipher ieee80211_crypt_wep ieee80211_crypt binfmt_misc
rfcomm l2cap bluetooth i915 drm speedstep_centrino 

Re: Suspend to RAM generates oops and general protection fault

2007-01-21 Thread Jean-Marc Valin
 I just encountered the following oops and general protection fault
 trying to suspend/resume my laptop. I've got a Dell D820 laptop with a 2
 GHz Core 2 Duo CPU. It usually suspends/resumes fine but not always. The
 relevant errors are below but the full dmesg log is at
 http://people.xiph.org/~jm/suspend_resume_oops.txt and my config is in
 http://people.xiph.org/~jm/config-2.6.20-rc5.txt
...
 It looks like something is stomping on memory it shouldn't be touching,
 so I would suggest testing multiple cycles with a minimal (preferably
 zero) number of modules loaded. If that looks good and reliable, add
 modules  processes until you can say 'If I do X, it breaks.'. If having
 a minimal number of modules loaded doesn't help, I would then suggest
 reviewing your kernel config to see if other things can be built as
 modules and the same logic applied. You can be reasonably sure that it
 will be a device driver. Common causes of suspend/resume problems from
 the list you give below are acpi modules, bluetooth and usb. I'd also be
 consider pcmcia, drm and fuse possibilities. But again, go for unloading
 everything possible in the first instance.

Actually, the reason I sent this is that when I showed the oops/gpf to
Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug
problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the
suspend to RAM now works ~95% of the time.

Jean-Marc

 Regards,
 
 Nigel
 
 Cheers,

  Jean-Marc

 P.S. This is the same laptop I had at LCA for which Linus told me to
 disable preemption and try the newest rc version.

 [10746.449071] Unable to handle kernel NULL pointer dereference at
 0038 RIP:
 [10746.449080]  [8022b9c8] iput+0x18/0x80
 [10746.449092] PGD 3a607067 PUD 27b20067 PMD 0
 [10746.449099] Oops:  [1] SMP
 [10746.449104] CPU 0
 [10746.449107] Modules linked in: psmouse battery ac thermal fan button
 ipw3945 ieee80211 tg3 arc4 ecb blkcipher ieee80211_crypt_wep
 ieee80211_crypt binfmt_misc rfcomm l2cap bluetooth i915 drm
 speedstep_centrino cpufreq_userspace cpufreq_powersave cpufreq_ondemand
 cpufreq_stats freq_table cpufreq_conservative video sbs i2c_ec dock
 asus_acpi backlight container ipv6 fuse sbp2 af_packet parport_pc lp
 parport sg sr_mod cdrom snd_hda_intel snd_hda_codec tsdev snd_pcm_oss
 snd_mixer_oss pcmcia snd_pcm snd_timer ata_generic snd shpchp
 pci_hotplug soundcore snd_page_alloc serio_raw yenta_socket
 rsrc_nonstatic pcmcia_core pcspkr evdev ext3 jbd mbcache ohci1394
 ehci_hcd ieee1394 ide_generic uhci_hcd usbcore generic sd_mod processor
 [10746.449190] Pid: 218, comm: kswapd0 Not tainted 2.6.20-rc5-x86-64 #1
 [10746.449196] RIP: 0010:[8022b9c8]  [8022b9c8]
 iput+0x18/0x80
 [10746.449206] RSP: :810037f2dd50  EFLAGS: 00010283
 [10746.449212] RAX:  RBX: 8103fcf0 RCX:
 8103fd20
 [10746.449219] RDX: 0001 RSI: 0286 RDI:
 8103fcf0
 [10746.449225] RBP: 0042 R08:  R09:
 
 [10746.449232] R10: 28f5c28f5c28f5c3 R11: 8023ae90 R12:
 
 [10746.449239] R13: 810075721c70 R14: 805fa940 R15:
 
 [10746.449246] FS:  () GS:8058e000()
 knlGS:
 [10746.449253] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
 [10746.449259] CR2: 0038 CR3: 1207f000 CR4:
 06e0
 [10746.449265] Process kswapd0 (pid: 218, threadinfo 810037f2c000,
 task 810037a1b760)
 [10746.449269] Stack:  811ce2f0 802ddaf8
 811ce3c0 811ce2f0
 [10746.449280]  0042 8022f645 810037f2dd80
 0001cb60
 [10746.449288]  0090 81007daa0e00 00d0
 802ddb49
 [10746.449296] Call Trace:
 [10746.449305]  [802ddaf8] prune_one_dentry+0x68/0xa0
 [10746.449314]  [8022f645] prune_dcache+0x145/0x1e0
 [10746.449323]  [802ddb49] shrink_dcache_memory+0x19/0x50
 [10746.449331]  [802418a7] shrink_slab+0x117/0x190
 [10746.449342]  [8025a392] kswapd+0x382/0x4e0
 [10746.449356]  [802a13b0] autoremove_wake_function+0x0/0x30
 [10746.449370]  [8025a010] kswapd+0x0/0x4e0
 [10746.449376]  [802a11d0] keventd_create_kthread+0x0/0x90
 [10746.449383]  [802335a9] kthread+0xd9/0x120
 [10746.449394]  [80260ec8] child_rip+0xa/0x12
 [10746.449401]  [802a11d0] keventd_create_kthread+0x0/0x90
 [10746.449414]  [802334d0] kthread+0x0/0x120
 [10746.449421]  [80260ebe] child_rip+0x0/0x12
 [10746.449426]
 [10746.449429]
 [10746.449430] Code: 48 8b 40 38 75 04 0f 0b eb fe 48 85 c0 74 0b 48 8b
 40 28 48
 [10746.449449] RIP  [8022b9c8] iput+0x18/0x80
 [10746.449456]  RSP 810037f2dd50
 [10746.449460] CR2: 0038
 [10746.449463]  ACPI Exception (pci_bind-0299): AE_NOT_FOUND, Unable to
 get data from device DCKS [20060707]



Low latency patches

2005-04-06 Thread Jean-Marc Valin
Hi,

I've recently come across Con Kolivas' isochronous scheduler and Ingo's
RLIMIT_RT_CPU patch. I cannot comment on Ingo's patch, but I've been
using Con's scheduler for a few days and I only have good things to say
about it (latency is as good as running the process as root). The only
thing missing is perhaps a way to enable the feature on a per-user basis
(e.g. enable only for owner of the console), though I'm not sure whether
it goes in kernel or user space.

Are there any plans on merging some of that work? I think it would
really help everyone doing audio (or other real-time stuff) on Linux.

Jean-Marc

P.S. Please include me in CC, I'm not subscribed.

-- 
Jean-Marc Valin <[EMAIL PROTECTED]>
Université de Sherbrooke

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Low latency patches

2005-04-06 Thread Jean-Marc Valin
Hi,

I've recently come across Con Kolivas' isochronous scheduler and Ingo's
RLIMIT_RT_CPU patch. I cannot comment on Ingo's patch, but I've been
using Con's scheduler for a few days and I only have good things to say
about it (latency is as good as running the process as root). The only
thing missing is perhaps a way to enable the feature on a per-user basis
(e.g. enable only for owner of the console), though I'm not sure whether
it goes in kernel or user space.

Are there any plans on merging some of that work? I think it would
really help everyone doing audio (or other real-time stuff) on Linux.

Jean-Marc

P.S. Please include me in CC, I'm not subscribed.

-- 
Jean-Marc Valin [EMAIL PROTECTED]
Université de Sherbrooke

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 bug

2005-02-28 Thread Jean-Marc Valin
Le lundi 28 février 2005 à 08:31 -0700, jmerkey a écrit :
> I see this problem infrequently on systems that have low memory 
> conditions and
> with heavy swapping.I have not seen it on 2.6.9 but I have seen it 
> on 2.6.10. 

My machine has 1 GB RAM and I wasn't using much of it at that time (2GB
free on the swap), so I doubt that's the problem in my case.

Jean-Marc

-- 
Jean-Marc Valin <[EMAIL PROTECTED]>
Université de Sherbrooke

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 bug

2005-02-28 Thread Jean-Marc Valin
Le lundi 28 février 2005 à 08:31 -0700, jmerkey a écrit :
 I see this problem infrequently on systems that have low memory 
 conditions and
 with heavy swapping.I have not seen it on 2.6.9 but I have seen it 
 on 2.6.10. 

My machine has 1 GB RAM and I wasn't using much of it at that time (2GB
free on the swap), so I doubt that's the problem in my case.

Jean-Marc

-- 
Jean-Marc Valin [EMAIL PROTECTED]
Université de Sherbrooke

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 bug

2005-02-27 Thread Jean-Marc Valin
> Hmm.. So that error is not FC3 specific, it is present in stock 2.6.10 as 
> well.  Also - This is on a USB disk, right? If so, the error may re-surface. 
> Try upgrading to latest kernel if possible. 

It's a USB disk (3.5" IDE + IDE to USB). What has been changed in
2.6.11-rcX?

Jean-Marc

-- 
Jean-Marc Valin <[EMAIL PROTECTED]>
Université de Sherbrooke

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 bug

2005-02-27 Thread Jean-Marc Valin
> Please try stock kernel. 2.6.11-rc3 onwards should be fine. - I saw a similar 
> problem while running 2.6.10 kernel from Fedora Core 3. It doesn't happen 
> with stock kernels.

I did use a stock 2.6.10 kernel (I said custom in the sense that it
wasn't a Debian kernel). After a reboot, I was able to run fsck on the
disk (many, many errors) and it went fine after.

Jean-Marc

-- 
Jean-Marc Valin <[EMAIL PROTECTED]>
Université de Sherbrooke

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 bug

2005-02-27 Thread Jean-Marc Valin
 Please try stock kernel. 2.6.11-rc3 onwards should be fine. - I saw a similar 
 problem while running 2.6.10 kernel from Fedora Core 3. It doesn't happen 
 with stock kernels.

I did use a stock 2.6.10 kernel (I said custom in the sense that it
wasn't a Debian kernel). After a reboot, I was able to run fsck on the
disk (many, many errors) and it went fine after.

Jean-Marc

-- 
Jean-Marc Valin [EMAIL PROTECTED]
Université de Sherbrooke

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext3 bug

2005-02-27 Thread Jean-Marc Valin
 Hmm.. So that error is not FC3 specific, it is present in stock 2.6.10 as 
 well.  Also - This is on a USB disk, right? If so, the error may re-surface. 
 Try upgrading to latest kernel if possible. 

It's a USB disk (3.5 IDE + IDE to USB). What has been changed in
2.6.11-rcX?

Jean-Marc

-- 
Jean-Marc Valin [EMAIL PROTECTED]
Université de Sherbrooke

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


ext3 bug

2005-02-26 Thread Jean-Marc Valin
Hi,

Looks like I ran into an ext3 bug (or at least the log says so). I got a
bunch of messages like:
ext3_free_blocks_sb: aborting transaction: Journal has aborted in
__ext3_journal_get_undo_access<2>EXT3-fs error (device sda2) in
ext3_free_blocks_sb: Journal has aborted
EXT3-fs error (device sda2): ext3_free_blocks: Freeing blocks in system
zones -Block = 228, count = 1

It happened while I was doing an "rm -rf" on a directory. The "rm" gave
a segfault and now I can't unmount the filesystem: unmount says "device
is busy", even though lsof reports nothing. The filesystem is on a USB
hard disk. The actual dump is in attachment. I'm running Debian unstable
with a custom 2.6.10 kernel on a 1.6 GHz Pentium-M.

    Jean-Marc

-- 
Jean-Marc Valin <[EMAIL PROTECTED]>
Université de Sherbrooke
Feb 27 01:15:48 idefix kernel: [ cut here ]
Feb 27 01:15:48 idefix kernel: PREEMPT 
Feb 27 01:15:48 idefix kernel: Modules linked in: msdos sd_mod udf isofs sr_mod 
usb_storage scsi_mod joydev usbhid appletalk ax25 ipx radeon ipt_state 
iptable_filter iptable_mangle iptable_nat ip_conntrack ip_tables ipv6 
orinoco_cs orinoco hermes pcmcia lp binfmt_misc af_packet parport_pc parport 
uhci_hcd pci_hotplug intel_agp agpgart yenta_socket pcmcia_core tg3 
snd_intel8x0 snd_ac97_codec ehci_hcd usbcore nls_iso8859_1 nls_cp437 vfat fat 
ppp_async ppp_generic slhc crc_ccitt snd_pcm_oss tsdev evdev snd_pcm snd_timer 
snd_page_alloc snd_mixer_oss snd soundcore psmouse thermal fan button ac 
battery cpufreq_ondemand cpufreq_powersave speedstep_centrino freq_table 
processor
Feb 27 01:15:48 idefix kernel: CPU:0
Feb 27 01:15:48 idefix kernel: EIP:0060:[]Not tainted VLI
Feb 27 01:15:48 idefix kernel: EFLAGS: 00210286   (2.6.10) 
Feb 27 01:15:48 idefix kernel: EIP is at journal_forget+0x1d0/0x220
Feb 27 01:15:48 idefix kernel: eax: 005f   ebx: d1f1c000   ecx: b032c7cc   
edx: b032c7cc
Feb 27 01:15:48 idefix kernel: esi: b8932d48   edi: bb2ad41c   ebp: dd668080   
esp: d1f1dda0
Feb 27 01:15:48 idefix kernel: ds: 007b   es: 007b   ss: 0068
Feb 27 01:15:48 idefix kernel: Process rm (pid: 10370, threadinfo=d1f1c000 
task=c97f49e0)
Feb 27 01:15:48 idefix kernel: Stack: b02f67e0 b02e1027 b02f445b 04ca 
b02f4571  be0a5aac b8932d48 
Feb 27 01:15:48 idefix kernel:dfc002b8 b019c940 dfc002b8 b8932d48 
e73d7980 b275f400 b8932d48 0006 
Feb 27 01:15:48 idefix kernel:b0aeb448 dfc002b8 be0a5aac b019f028 
dfc002b8  be0a5aac b8932d48 
Feb 27 01:15:48 idefix kernel: Call Trace:
Feb 27 01:15:48 idefix kernel:  [] ext3_forget+0xf0/0x100
Feb 27 01:15:48 idefix kernel:  [] ext3_clear_blocks+0x118/0x170
Feb 27 01:15:48 idefix kernel:  [] ext3_free_data+0x98/0x150
Feb 27 01:15:48 idefix kernel:  [] ext3_free_branches+0xec/0x270
Feb 27 01:15:48 idefix kernel:  [] ext3_truncate+0x46b/0x5d0
Feb 27 01:15:48 idefix kernel:  [] ext3_mark_iloc_dirty+0x28/0x40
Feb 27 01:15:48 idefix kernel:  [] journal_start+0xad/0xe0
Feb 27 01:15:48 idefix kernel:  [] __ext3_journal_stop+0x24/0x50
Feb 27 01:15:48 idefix kernel:  [] start_transaction+0x29/0x70
Feb 27 01:15:48 idefix kernel:  [] ext3_delete_inode+0xc8/0x100
Feb 27 01:15:48 idefix kernel:  [] ext3_delete_inode+0x0/0x100
Feb 27 01:15:48 idefix kernel:  [] generic_delete_inode+0xa5/0x170
Feb 27 01:15:48 idefix kernel:  [] iput+0x63/0x90
Feb 27 01:15:48 idefix kernel:  [] sys_unlink+0xd7/0x150
Feb 27 01:15:48 idefix kernel:  [] sys_getdents64+0xa0/0xaa
Feb 27 01:15:48 idefix kernel:  [] filldir64+0x0/0x100
Feb 27 01:15:48 idefix kernel:  [] syscall_call+0x7/0xb
Feb 27 01:15:48 idefix kernel: Code: 2f b0 b8 71 45 2f b0 89 44 24 10 b8 ca 04 
00 00 89 44 24 0c b8 5b 44 2f b0 89 44 24 08 b8 27 10 2e b0 89 44 24 04 e8 c0 
a6 f6 ff <0f> 0b ca 04 5b 44 2f b0 e9 4d ff ff ff c7 04 24 e0 67 2f b0 b8 
Feb 27 01:15:48 idefix kernel:  <6>note: rm[10370] exited with preempt_count 2
Feb 27 01:15:48 idefix kernel:  [] schedule+0x532/0x540
Feb 27 01:15:48 idefix kernel:  [] unmap_page_range+0x53/0x80
Feb 27 01:15:48 idefix kernel:  [] unmap_vmas+0x1b6/0x1d0
Feb 27 01:15:48 idefix kernel:  [] exit_mmap+0x7d/0x160
Feb 27 01:15:48 idefix kernel:  [] mmput+0x37/0xa0
Feb 27 01:15:48 idefix kernel:  [] do_exit+0x16f/0x470
Feb 27 01:15:48 idefix kernel:  [] do_invalid_op+0x0/0xd0
Feb 27 01:15:48 idefix kernel:  [] die+0x18b/0x190
Feb 27 01:15:48 idefix kernel:  [] do_invalid_op+0xb2/0xd0
Feb 27 01:15:48 idefix kernel:  [] journal_forget+0x1d0/0x220
Feb 27 01:15:48 idefix kernel:  [] __wake_up_common+0x41/0x70
Feb 27 01:15:48 idefix kernel:  [] release_console_sem+0xbf/0xd0
Feb 27 01:15:48 idefix kernel:  [] error_code+0x2b/0x30
Feb 27 01:15:48 idefix kernel:  [] journal_forget+0x1d0/0x220
Feb 27 01:15:48 idefix kernel:  [] ext3_forget+0xf0/0x100
Feb 27 01:15:48 idefix kernel:  [] ext3_clear_blocks+0x118/0x170
Feb 27 01:15:48 idefix kernel:  [] ext3_free_data+0x98/0x150
Feb 27 01:15:48 idefix kernel:  [] ext3_

ext3 bug

2005-02-26 Thread Jean-Marc Valin
Hi,

Looks like I ran into an ext3 bug (or at least the log says so). I got a
bunch of messages like:
ext3_free_blocks_sb: aborting transaction: Journal has aborted in
__ext3_journal_get_undo_access2EXT3-fs error (device sda2) in
ext3_free_blocks_sb: Journal has aborted
EXT3-fs error (device sda2): ext3_free_blocks: Freeing blocks in system
zones -Block = 228, count = 1

It happened while I was doing an rm -rf on a directory. The rm gave
a segfault and now I can't unmount the filesystem: unmount says device
is busy, even though lsof reports nothing. The filesystem is on a USB
hard disk. The actual dump is in attachment. I'm running Debian unstable
with a custom 2.6.10 kernel on a 1.6 GHz Pentium-M.

Jean-Marc

-- 
Jean-Marc Valin [EMAIL PROTECTED]
Université de Sherbrooke
Feb 27 01:15:48 idefix kernel: [ cut here ]
Feb 27 01:15:48 idefix kernel: PREEMPT 
Feb 27 01:15:48 idefix kernel: Modules linked in: msdos sd_mod udf isofs sr_mod 
usb_storage scsi_mod joydev usbhid appletalk ax25 ipx radeon ipt_state 
iptable_filter iptable_mangle iptable_nat ip_conntrack ip_tables ipv6 
orinoco_cs orinoco hermes pcmcia lp binfmt_misc af_packet parport_pc parport 
uhci_hcd pci_hotplug intel_agp agpgart yenta_socket pcmcia_core tg3 
snd_intel8x0 snd_ac97_codec ehci_hcd usbcore nls_iso8859_1 nls_cp437 vfat fat 
ppp_async ppp_generic slhc crc_ccitt snd_pcm_oss tsdev evdev snd_pcm snd_timer 
snd_page_alloc snd_mixer_oss snd soundcore psmouse thermal fan button ac 
battery cpufreq_ondemand cpufreq_powersave speedstep_centrino freq_table 
processor
Feb 27 01:15:48 idefix kernel: CPU:0
Feb 27 01:15:48 idefix kernel: EIP:0060:[b01af540]Not tainted VLI
Feb 27 01:15:48 idefix kernel: EFLAGS: 00210286   (2.6.10) 
Feb 27 01:15:48 idefix kernel: EIP is at journal_forget+0x1d0/0x220
Feb 27 01:15:48 idefix kernel: eax: 005f   ebx: d1f1c000   ecx: b032c7cc   
edx: b032c7cc
Feb 27 01:15:48 idefix kernel: esi: b8932d48   edi: bb2ad41c   ebp: dd668080   
esp: d1f1dda0
Feb 27 01:15:48 idefix kernel: ds: 007b   es: 007b   ss: 0068
Feb 27 01:15:48 idefix kernel: Process rm (pid: 10370, threadinfo=d1f1c000 
task=c97f49e0)
Feb 27 01:15:48 idefix kernel: Stack: b02f67e0 b02e1027 b02f445b 04ca 
b02f4571  be0a5aac b8932d48 
Feb 27 01:15:48 idefix kernel:dfc002b8 b019c940 dfc002b8 b8932d48 
e73d7980 b275f400 b8932d48 0006 
Feb 27 01:15:48 idefix kernel:b0aeb448 dfc002b8 be0a5aac b019f028 
dfc002b8  be0a5aac b8932d48 
Feb 27 01:15:48 idefix kernel: Call Trace:
Feb 27 01:15:48 idefix kernel:  [b019c940] ext3_forget+0xf0/0x100
Feb 27 01:15:48 idefix kernel:  [b019f028] ext3_clear_blocks+0x118/0x170
Feb 27 01:15:48 idefix kernel:  [b019f118] ext3_free_data+0x98/0x150
Feb 27 01:15:48 idefix kernel:  [b019f2bc] ext3_free_branches+0xec/0x270
Feb 27 01:15:48 idefix kernel:  [b019f8ab] ext3_truncate+0x46b/0x5d0
Feb 27 01:15:48 idefix kernel:  [b01a08b8] ext3_mark_iloc_dirty+0x28/0x40
Feb 27 01:15:48 idefix kernel:  [b01ae12d] journal_start+0xad/0xe0
Feb 27 01:15:48 idefix kernel:  [b01a5234] __ext3_journal_stop+0x24/0x50
Feb 27 01:15:48 idefix kernel:  [b019c9a9] start_transaction+0x29/0x70
Feb 27 01:15:48 idefix kernel:  [b019cb28] ext3_delete_inode+0xc8/0x100
Feb 27 01:15:48 idefix kernel:  [b019ca60] ext3_delete_inode+0x0/0x100
Feb 27 01:15:48 idefix kernel:  [b01726f5] generic_delete_inode+0xa5/0x170
Feb 27 01:15:48 idefix kernel:  [b01729a3] iput+0x63/0x90
Feb 27 01:15:48 idefix kernel:  [b0167f27] sys_unlink+0xd7/0x150
Feb 27 01:15:48 idefix kernel:  [b016ad40] sys_getdents64+0xa0/0xaa
Feb 27 01:15:48 idefix kernel:  [b016aba0] filldir64+0x0/0x100
Feb 27 01:15:48 idefix kernel:  [b01030df] syscall_call+0x7/0xb
Feb 27 01:15:48 idefix kernel: Code: 2f b0 b8 71 45 2f b0 89 44 24 10 b8 ca 04 
00 00 89 44 24 0c b8 5b 44 2f b0 89 44 24 08 b8 27 10 2e b0 89 44 24 04 e8 c0 
a6 f6 ff 0f 0b ca 04 5b 44 2f b0 e9 4d ff ff ff c7 04 24 e0 67 2f b0 b8 
Feb 27 01:15:48 idefix kernel:  6note: rm[10370] exited with preempt_count 2
Feb 27 01:15:48 idefix kernel:  [b02d8772] schedule+0x532/0x540
Feb 27 01:15:48 idefix kernel:  [b0146c53] unmap_page_range+0x53/0x80
Feb 27 01:15:48 idefix kernel:  [b0146e36] unmap_vmas+0x1b6/0x1d0
Feb 27 01:15:48 idefix kernel:  [b014b53d] exit_mmap+0x7d/0x160
Feb 27 01:15:48 idefix kernel:  [b0117617] mmput+0x37/0xa0
Feb 27 01:15:48 idefix kernel:  [b011c06f] do_exit+0x16f/0x470
Feb 27 01:15:48 idefix kernel:  [b01046a0] do_invalid_op+0x0/0xd0
Feb 27 01:15:48 idefix kernel:  [b01042cb] die+0x18b/0x190
Feb 27 01:15:48 idefix kernel:  [b0104752] do_invalid_op+0xb2/0xd0
Feb 27 01:15:48 idefix kernel:  [b01af540] journal_forget+0x1d0/0x220
Feb 27 01:15:48 idefix kernel:  [b01162d1] __wake_up_common+0x41/0x70
Feb 27 01:15:48 idefix kernel:  [b0119e9f] release_console_sem+0xbf/0xd0
Feb 27 01:15:48 idefix kernel:  [b0103b17] error_code+0x2b/0x30
Feb 27 01:15:48 idefix kernel:  [b01af540] journal_forget+0x1d0/0x220
Feb 27 01:15:48 idefix kernel:  [b019c940