nouveau shuts the machine down with v3.9-rc1 (temperature (72 C) hit the 'shutdown' threshold).

2013-03-22 Thread Rafał Miłecki
2013/3/15 Martin Peres 
> As a follow up, Konrad sent me in private his vbios and the issue turned out 
> to be trivial.
> The reason why it behaved this way was that his vbios didn't have sensor 
> calibration values.
> The fix is available here: 
> http://gitorious.org/linux-nouveau-pm/linux-nouveau-pm/commit/59b4006b5b30828bbd094dffe3937333b43d1e12
>
> This fix is part of a pull request I sent to Ben.
>
> Thanks again Konrad for reporting and testing the patches, I'll add you as a 
> tester to this patch :)

Thanks guys for debugging analyzing and fixing this. I got the same problem on
00:05.0 VGA compatible controller [0300]: NVIDIA Corporation C51G
[GeForce 6100] [10de:0242] (rev a2)
and now it's fixed.

It seems it wasn't just a one single BIOS like that in the world ;)

--
Rafa?
-- next part --
8698080ee092bdbd6ee2cd5e7f707ceea2812bd8
Merge branch 'drm-nouveau-fixes-3.9' of 
git://anongit.freedesktop.org/git/nouveau/linux-2.6 into drm-next
Regression fixes and oops fixes for nouveau.
[   76.082597] nouveau  [  DEVICE][:00:05.0] BOOT0  : 0x04e000a2
[   76.082605] nouveau  [  DEVICE][:00:05.0] Chipset: C51 (NV4E)
[   76.082609] nouveau  [  DEVICE][:00:05.0] Family : NV40
[   76.084534] nouveau  [   VBIOS][:00:05.0] checking PRAMIN for image...
[   76.125409] nouveau  [   VBIOS][:00:05.0] ... appears to be valid
[   76.125418] nouveau  [   VBIOS][:00:05.0] using image from PRAMIN
[   76.125658] nouveau  [   VBIOS][:00:05.0] BIT signature found
[   76.125663] nouveau  [   VBIOS][:00:05.0] version 05.51.22.28.10
[   76.128699] nouveau  [ PFB][:00:05.0] RAM type: stolen system memory
[   76.128708] nouveau  [ PFB][:00:05.0] RAM size: 64 MiB
[   76.128711] nouveau  [ PFB][:00:05.0]ZCOMP: 0 tags
[   76.781036] nouveau  [  PTHERM][:00:05.0] FAN control: none / external
[   76.781053] nouveau  [  PTHERM][:00:05.0] Thermal management: disabled
[   76.781057] nouveau  [  PTHERM][:00:05.0] internal sensor: yes
[   76.791261] nouveau  [  PTHERM][:00:05.0] programmed thresholds [ 90(2), 
95(3), 145(2), 135(5) ]
[   76.791267] nouveau  [  PTHERM][:00:05.0] temperature (154 C) hit the 
'fanboost' threshold
[   76.791271] nouveau  [  PTHERM][:00:05.0] Thermal management: automatic
[   76.791277] nouveau  [  PTHERM][:00:05.0] temperature (154 C) hit the 
'downclock' threshold
[   76.791281] nouveau  [  PTHERM][:00:05.0] temperature (154 C) hit the 
'critical' threshold
[   76.791285] nouveau  [  PTHERM][:00:05.0] temperature (154 C) hit the 
'shutdown' threshold

cf9a625fae3d0ce8dffab53b2758d7c0cf4a5ad4
Merge branch 'drm-nouveau-fixes-3.9' of 
git://anongit.freedesktop.org/git/nouveau/linux-2.6 into drm-next
Lots of thermal fixes and fix a lockdep warning we've been seeing.
[   55.668598] nouveau  [  DEVICE][:00:05.0] BOOT0  : 0x04e000a2
[   55.668606] nouveau  [  DEVICE][:00:05.0] Chipset: C51 (NV4E)
[   55.668609] nouveau  [  DEVICE][:00:05.0] Family : NV40
[   55.670533] nouveau  [   VBIOS][:00:05.0] checking PRAMIN for image...
[   55.711390] nouveau  [   VBIOS][:00:05.0] ... appears to be valid
[   55.711399] nouveau  [   VBIOS][:00:05.0] using image from PRAMIN
[   55.711639] nouveau  [   VBIOS][:00:05.0] BIT signature found
[   55.711644] nouveau  [   VBIOS][:00:05.0] version 05.51.22.28.10
[   55.714712] nouveau  [ PFB][:00:05.0] RAM type: stolen system memory
[   55.714721] nouveau  [ PFB][:00:05.0] RAM size: 64 MiB
[   55.714724] nouveau  [ PFB][:00:05.0]ZCOMP: 0 tags
[   56.367033] nouveau  [  PTHERM][:00:05.0] FAN control: none / external
[   56.367052] nouveau  [  PTHERM][:00:05.0] fan management: disabled
[   56.367056] nouveau  [  PTHERM][:00:05.0] internal sensor: no
[   56.387298] nouveau  [  PTHERM][:00:05.0] programmed thresholds [ 90(2), 
95(3), 145(2), 135(5) ]


Re: nouveau shuts the machine down with v3.9-rc1 (temperature (72 C) hit the 'shutdown' threshold).

2013-03-22 Thread Rafał Miłecki
2013/3/15 Martin Peres martin.pe...@free.fr
 As a follow up, Konrad sent me in private his vbios and the issue turned out 
 to be trivial.
 The reason why it behaved this way was that his vbios didn't have sensor 
 calibration values.
 The fix is available here: 
 http://gitorious.org/linux-nouveau-pm/linux-nouveau-pm/commit/59b4006b5b30828bbd094dffe3937333b43d1e12

 This fix is part of a pull request I sent to Ben.

 Thanks again Konrad for reporting and testing the patches, I'll add you as a 
 tester to this patch :)

Thanks guys for debugging analyzing and fixing this. I got the same problem on
00:05.0 VGA compatible controller [0300]: NVIDIA Corporation C51G
[GeForce 6100] [10de:0242] (rev a2)
and now it's fixed.

It seems it wasn't just a one single BIOS like that in the world ;)

--
Rafał
8698080ee092bdbd6ee2cd5e7f707ceea2812bd8
Merge branch 'drm-nouveau-fixes-3.9' of 
git://anongit.freedesktop.org/git/nouveau/linux-2.6 into drm-next
Regression fixes and oops fixes for nouveau.
[   76.082597] nouveau  [  DEVICE][:00:05.0] BOOT0  : 0x04e000a2
[   76.082605] nouveau  [  DEVICE][:00:05.0] Chipset: C51 (NV4E)
[   76.082609] nouveau  [  DEVICE][:00:05.0] Family : NV40
[   76.084534] nouveau  [   VBIOS][:00:05.0] checking PRAMIN for image...
[   76.125409] nouveau  [   VBIOS][:00:05.0] ... appears to be valid
[   76.125418] nouveau  [   VBIOS][:00:05.0] using image from PRAMIN
[   76.125658] nouveau  [   VBIOS][:00:05.0] BIT signature found
[   76.125663] nouveau  [   VBIOS][:00:05.0] version 05.51.22.28.10
[   76.128699] nouveau  [ PFB][:00:05.0] RAM type: stolen system memory
[   76.128708] nouveau  [ PFB][:00:05.0] RAM size: 64 MiB
[   76.128711] nouveau  [ PFB][:00:05.0]ZCOMP: 0 tags
[   76.781036] nouveau  [  PTHERM][:00:05.0] FAN control: none / external
[   76.781053] nouveau  [  PTHERM][:00:05.0] Thermal management: disabled
[   76.781057] nouveau  [  PTHERM][:00:05.0] internal sensor: yes
[   76.791261] nouveau  [  PTHERM][:00:05.0] programmed thresholds [ 90(2), 
95(3), 145(2), 135(5) ]
[   76.791267] nouveau  [  PTHERM][:00:05.0] temperature (154 C) hit the 
'fanboost' threshold
[   76.791271] nouveau  [  PTHERM][:00:05.0] Thermal management: automatic
[   76.791277] nouveau  [  PTHERM][:00:05.0] temperature (154 C) hit the 
'downclock' threshold
[   76.791281] nouveau  [  PTHERM][:00:05.0] temperature (154 C) hit the 
'critical' threshold
[   76.791285] nouveau  [  PTHERM][:00:05.0] temperature (154 C) hit the 
'shutdown' threshold

cf9a625fae3d0ce8dffab53b2758d7c0cf4a5ad4
Merge branch 'drm-nouveau-fixes-3.9' of 
git://anongit.freedesktop.org/git/nouveau/linux-2.6 into drm-next
Lots of thermal fixes and fix a lockdep warning we've been seeing.
[   55.668598] nouveau  [  DEVICE][:00:05.0] BOOT0  : 0x04e000a2
[   55.668606] nouveau  [  DEVICE][:00:05.0] Chipset: C51 (NV4E)
[   55.668609] nouveau  [  DEVICE][:00:05.0] Family : NV40
[   55.670533] nouveau  [   VBIOS][:00:05.0] checking PRAMIN for image...
[   55.711390] nouveau  [   VBIOS][:00:05.0] ... appears to be valid
[   55.711399] nouveau  [   VBIOS][:00:05.0] using image from PRAMIN
[   55.711639] nouveau  [   VBIOS][:00:05.0] BIT signature found
[   55.711644] nouveau  [   VBIOS][:00:05.0] version 05.51.22.28.10
[   55.714712] nouveau  [ PFB][:00:05.0] RAM type: stolen system memory
[   55.714721] nouveau  [ PFB][:00:05.0] RAM size: 64 MiB
[   55.714724] nouveau  [ PFB][:00:05.0]ZCOMP: 0 tags
[   56.367033] nouveau  [  PTHERM][:00:05.0] FAN control: none / external
[   56.367052] nouveau  [  PTHERM][:00:05.0] fan management: disabled
[   56.367056] nouveau  [  PTHERM][:00:05.0] internal sensor: no
[   56.387298] nouveau  [  PTHERM][:00:05.0] programmed thresholds [ 90(2), 
95(3), 145(2), 135(5) ]
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


nouveau shuts the machine down with v3.9-rc1 (temperature (72 C) hit the 'shutdown' threshold).

2013-03-15 Thread Martin Peres
Hi everyone,

As a follow up, Konrad sent me in private his vbios and the issue turned 
out to be trivial.
The reason why it behaved this way was that his vbios didn't have sensor 
calibration values.
The fix is available here: 
http://gitorious.org/linux-nouveau-pm/linux-nouveau-pm/commit/59b4006b5b30828bbd094dffe3937333b43d1e12

This fix is part of a pull request I sent to Ben.

Thanks again Konrad for reporting and testing the patches, I'll add you 
as a tester to this patch :)

Cheers,
Mupuf

PS: For the records, here is a fwd of our private conversation.

 Message original 
Sujet:  Re: nouveau shuts the machine down with v3.9-rc1 (temperature 
(72 C) hit the 'shutdown' threshold).
Date :  Fri, 15 Mar 2013 11:16:17 -0400
De :Konrad Rzeszutek Wilk 
Pour :  Martin Peres 


On Fri, Mar 15, 2013 at 02:30:44AM +0100, Martin Peres wrote:
> On 13/03/2013 03:20, Konrad Rzeszutek Wilk wrote:
> >>Ah ah, what challenge? The reason why the temperature is messed up
> >>is ... trivial.
> >>
> >>Will send a patch for that!
> >Heh. Pls CC me so I can test it and add the Tested-by flag:
> >>Thanks for reporting the bug!
> >Of course.
> >>Martin
> Hey Konrad,
>
> Here are the thermal patches I sent to Ben Skeggs for review. The
> patch that should solve your problem is the patch 6.
>
> Let me know if it solves your issue (that I managed to reproduce by
> faking a different vbios).
>

> dmesg | grep nou
[   12.177930] calling  nouveau_drm_init+0x0/0x1000 [nouveau] @ 1488
[   12.330206] nouveau :00:0d.0: setting latency timer to 64
[   12.353307] nouveau  [  DEVICE][:00:0d.0] BOOT0  : 0x04c000a2
[   12.359398] nouveau  [  DEVICE][:00:0d.0] Chipset: C61 (NV4C)
[   12.365477] nouveau  [  DEVICE][:00:0d.0] Family : NV40
[   12.371621] nouveau  [   VBIOS][:00:0d.0] checking PRAMIN for image...
[   12.416327] nouveau  [   VBIOS][:00:0d.0] ... appears to be valid
[   12.422758] nouveau  [   VBIOS][:00:0d.0] using image from PRAMIN
[   12.429324] nouveau  [   VBIOS][:00:0d.0] BIT signature found
[   12.429326] nouveau  [   VBIOS][:00:0d.0] version 05.61.32.22.01
[   12.443160] nouveau  [ PFB][:00:0d.0] RAM type: unknown
[   12.443161] nouveau  [ PFB][:00:0d.0] RAM size: 128 MiB
[   12.443162] nouveau  [ PFB][:00:0d.0]ZCOMP: 0 tags
[   12.50] nouveau  [  PTHERM][:00:0d.0] FAN control: none / external
[   12.514647] nouveau  [  PTHERM][:00:0d.0] fan management: disabled
[   12.521161] nouveau  [  PTHERM][:00:0d.0] internal sensor: no
[   12.547272] nouveau  [  PTHERM][:00:0d.0] programmed thresholds [ 90(2), 
95(3), 145(2), 135(5) ]
[   12.573758] nouveau  [ DRM] VRAM: 125 MiB
[   12.579153] nouveau  [ DRM] GART: 512 MiB
[   12.584887] nouveau  [ DRM] TMDS table version 1.1
[   12.590018] nouveau  [ DRM] DCB version 3.0
[   12.594555] nouveau  [ DRM] DCB outp 00: 01000310 0023
[   12.601754] nouveau  [ DRM] DCB outp 01: 00110204 97e5
[   12.607585] nouveau  [ DRM] DCB conn 00: 
[   12.612424] nouveau  [ DRM] Saving VGA fonts
[   12.656034] nouveau W[ DRM] DCB type 4 not known
[   12.660991] nouveau W[ DRM] Unknown-1 has no encoders, removing
[   12.681157] nouveau  [ DRM] 1 available performance level(s)
[   12.687714] nouveau  [ DRM] 0: core 425MHz shader 425MHz fanspeed 100%
[   12.694575] nouveau  [ DRM] c:
[   12.699270] nouveau  [ DRM] MM: using M2MF for buffer copies
[   12.738742] nouveau :00:0d.0: No connectors reported connected with modes
[   12.752063] nouveau  [ DRM] allocated 1024x768 fb: 0x9000, bo 
88012dffbc00
[   12.763397] fbcon: nouveaufb (fb0) is primary device
[   12.780410] nouveau :00:0d.0: fb0: nouveaufb frame buffer device
[   12.786754] nouveau :00:0d.0: registered panic notifier
[   12.792330] [drm] Initialized nouveau 1.1.0 20120801 for :00:0d.0 on 
minor 0
[   12.800071] initcall nouveau_drm_init+0x0/0x1000 [nouveau] returned 0 after 
602409 usecs


and no poweroffs :-)

So definitly Tested-by: Konrad Rzeszutek Wilk 
all of the patches.

Thanks!
> Cheers,
> Martin

-- next part --
An HTML attachment was scrubbed...
URL: 



Re: nouveau shuts the machine down with v3.9-rc1 (temperature (72 C) hit the 'shutdown' threshold).

2013-03-15 Thread Martin Peres

Hi everyone,

As a follow up, Konrad sent me in private his vbios and the issue turned 
out to be trivial.
The reason why it behaved this way was that his vbios didn't have sensor 
calibration values.
The fix is available here: 
http://gitorious.org/linux-nouveau-pm/linux-nouveau-pm/commit/59b4006b5b30828bbd094dffe3937333b43d1e12


This fix is part of a pull request I sent to Ben.

Thanks again Konrad for reporting and testing the patches, I'll add you 
as a tester to this patch :)


Cheers,
Mupuf

PS: For the records, here is a fwd of our private conversation.

 Message original 
Sujet: 	Re: nouveau shuts the machine down with v3.9-rc1 (temperature 
(72 C) hit the 'shutdown' threshold).

Date :  Fri, 15 Mar 2013 11:16:17 -0400
De :Konrad Rzeszutek Wilk konrad.w...@oracle.com
Pour :  Martin Peres martin.pe...@free.fr


On Fri, Mar 15, 2013 at 02:30:44AM +0100, Martin Peres wrote:

On 13/03/2013 03:20, Konrad Rzeszutek Wilk wrote:
Ah ah, what challenge? The reason why the temperature is messed up
is ... trivial.

Will send a patch for that!
Heh. Pls CC me so I can test it and add the Tested-by flag:
Thanks for reporting the bug!
Of course.
Martin
Hey Konrad,

Here are the thermal patches I sent to Ben Skeggs for review. The
patch that should solve your problem is the patch 6.

Let me know if it solves your issue (that I managed to reproduce by
faking a different vbios).




dmesg | grep nou

[   12.177930] calling  nouveau_drm_init+0x0/0x1000 [nouveau] @ 1488
[   12.330206] nouveau :00:0d.0: setting latency timer to 64
[   12.353307] nouveau  [  DEVICE][:00:0d.0] BOOT0  : 0x04c000a2
[   12.359398] nouveau  [  DEVICE][:00:0d.0] Chipset: C61 (NV4C)
[   12.365477] nouveau  [  DEVICE][:00:0d.0] Family : NV40
[   12.371621] nouveau  [   VBIOS][:00:0d.0] checking PRAMIN for image...
[   12.416327] nouveau  [   VBIOS][:00:0d.0] ... appears to be valid
[   12.422758] nouveau  [   VBIOS][:00:0d.0] using image from PRAMIN
[   12.429324] nouveau  [   VBIOS][:00:0d.0] BIT signature found
[   12.429326] nouveau  [   VBIOS][:00:0d.0] version 05.61.32.22.01
[   12.443160] nouveau  [ PFB][:00:0d.0] RAM type: unknown
[   12.443161] nouveau  [ PFB][:00:0d.0] RAM size: 128 MiB
[   12.443162] nouveau  [ PFB][:00:0d.0]ZCOMP: 0 tags
[   12.50] nouveau  [  PTHERM][:00:0d.0] FAN control: none / external
[   12.514647] nouveau  [  PTHERM][:00:0d.0] fan management: disabled
[   12.521161] nouveau  [  PTHERM][:00:0d.0] internal sensor: no
[   12.547272] nouveau  [  PTHERM][:00:0d.0] programmed thresholds [ 90(2), 
95(3), 145(2), 135(5) ]
[   12.573758] nouveau  [ DRM] VRAM: 125 MiB
[   12.579153] nouveau  [ DRM] GART: 512 MiB
[   12.584887] nouveau  [ DRM] TMDS table version 1.1
[   12.590018] nouveau  [ DRM] DCB version 3.0
[   12.594555] nouveau  [ DRM] DCB outp 00: 01000310 0023
[   12.601754] nouveau  [ DRM] DCB outp 01: 00110204 97e5
[   12.607585] nouveau  [ DRM] DCB conn 00: 
[   12.612424] nouveau  [ DRM] Saving VGA fonts
[   12.656034] nouveau W[ DRM] DCB type 4 not known
[   12.660991] nouveau W[ DRM] Unknown-1 has no encoders, removing
[   12.681157] nouveau  [ DRM] 1 available performance level(s)
[   12.687714] nouveau  [ DRM] 0: core 425MHz shader 425MHz fanspeed 100%
[   12.694575] nouveau  [ DRM] c:
[   12.699270] nouveau  [ DRM] MM: using M2MF for buffer copies
[   12.738742] nouveau :00:0d.0: No connectors reported connected with modes
[   12.752063] nouveau  [ DRM] allocated 1024x768 fb: 0x9000, bo 
88012dffbc00
[   12.763397] fbcon: nouveaufb (fb0) is primary device
[   12.780410] nouveau :00:0d.0: fb0: nouveaufb frame buffer device
[   12.786754] nouveau :00:0d.0: registered panic notifier
[   12.792330] [drm] Initialized nouveau 1.1.0 20120801 for :00:0d.0 on 
minor 0
[   12.800071] initcall nouveau_drm_init+0x0/0x1000 [nouveau] returned 0 after 
602409 usecs


and no poweroffs :-)

So definitly Tested-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com
all of the patches.

Thanks!

Cheers,
Martin


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


nouveau shuts the machine down with v3.9-rc1 (temperature (72 C) hit the 'shutdown' threshold).

2013-03-12 Thread Martin Peres
On 11/03/2013 13:38, Konrad Rzeszutek Wilk wrote:
>> With that I am still getting the issues (even with an insance delay of 100 
>> seconds).
>> Here is the serial log with various runs.
> Any thoughts?
Sorry for taking so long to answer but I got a one-week flu and still 
had to do my research duties :s

Anyway, as a matter of fact, I do have some thoughts. If you don't mind, 
the tests I would like you to make will be listed at the end of the message.
>> [   13.523878] initcall init_sg+0x0/0x1000 [sg] returned 0 after 5355 usecs
>> ^G^G[   13.621376] nouveau  [  PTHERM][:00:0d.0] programmed thresholds [ 
>> 90(2), 95(3), 145(2), 135(5) ]
>> [   13.630487] nouveau 39079] nouveau  [  PTHERM][:00:0d.0] Thermal 
>> management: automatic
>> [   13.646028] nouveau  [  PTHERM][:00:0d.0] temperature (218 C) hit the 
>> 'downclock' threshold
>> [   13.654702] nouveau  [  PTHERM][:00:0d.0] temperature (218 C) hit the 
>> 'critical' threshold
>> [   13.663296] nouveau  [  PTHERM][:00:0d.0] temperature (218 C) hit the 
>> 'shutdown' threshold
>> [   13.671992] [TTM] Zone  kernel: Available graphics memory: 1963774 kiB
> Perhaps I've some insanely stupid BIOS?

So, first of all, I indeed would like to see your vbios and I also would 
like to know the bitfield of some regs.

The easiest way to do both is to grab and compile the envytools[0].

To grab your vbios, please do the following:
nvagetbios > nv4c_vbios.rom

To get the bitfield of the thermal-related regs:
nvascan 15b0 10 > nv4c_therm_scan

Please send me both of these files and I'll see what I can do.

Sorry again for the very late answer (I'm slowly getting better).

Martin

[0] https://github.com/pathscale/envytools


nouveau shuts the machine down with v3.9-rc1 (temperature (72 C) hit the 'shutdown' threshold).

2013-03-11 Thread Konrad Rzeszutek Wilk
> With that I am still getting the issues (even with an insance delay of 100 
> seconds).
> Here is the serial log with various runs.

Any thoughts?
> [   13.523878] initcall init_sg+0x0/0x1000 [sg] returned 0 after 5355 usecs
> ^G^G[   13.621376] nouveau  [  PTHERM][:00:0d.0] programmed thresholds [ 
> 90(2), 95(3), 145(2), 135(5) ]
> [   13.630487] nouveau 39079] nouveau  [  PTHERM][:00:0d.0] Thermal 
> management: automatic
> [   13.646028] nouveau  [  PTHERM][:00:0d.0] temperature (218 C) hit the 
> 'downclock' threshold
> [   13.654702] nouveau  [  PTHERM][:00:0d.0] temperature (218 C) hit the 
> 'critical' threshold
> [   13.663296] nouveau  [  PTHERM][:00:0d.0] temperature (218 C) hit the 
> 'shutdown' threshold
> [   13.671992] [TTM] Zone  kernel: Available graphics memory: 1963774 kiB

Perhaps I've some insanely stupid BIOS?


Re: nouveau shuts the machine down with v3.9-rc1 (temperature (72 C) hit the 'shutdown' threshold).

2013-03-11 Thread Konrad Rzeszutek Wilk
 With that I am still getting the issues (even with an insance delay of 100 
 seconds).
 Here is the serial log with various runs.

Any thoughts?
 [   13.523878] initcall init_sg+0x0/0x1000 [sg] returned 0 after 5355 usecs
 ^G^G[   13.621376] nouveau  [  PTHERM][:00:0d.0] programmed thresholds [ 
 90(2), 95(3), 145(2), 135(5) ]
 [   13.630487] nouveau 39079] nouveau  [  PTHERM][:00:0d.0] Thermal 
 management: automatic
 [   13.646028] nouveau  [  PTHERM][:00:0d.0] temperature (218 C) hit the 
 'downclock' threshold
 [   13.654702] nouveau  [  PTHERM][:00:0d.0] temperature (218 C) hit the 
 'critical' threshold
 [   13.663296] nouveau  [  PTHERM][:00:0d.0] temperature (218 C) hit the 
 'shutdown' threshold
 [   13.671992] [TTM] Zone  kernel: Available graphics memory: 1963774 kiB

Perhaps I've some insanely stupid BIOS?
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


nouveau shuts the machine down with v3.9-rc1 (temperature (72 C) hit the 'shutdown' threshold).

2013-03-05 Thread Martin Peres
On 04/03/2013 22:41, Konrad Rzeszutek Wilk wrote:
> Pls CC me in case you would like me also to test them with the mdelay 
> patch. 

Hi Konrad,

Marcin proposed me another explanation for the issue you are seeing and 
it made me look again at the code.

I don't have enough nv4x hw to test all the conditions but with the 
attached patches, you may get a saner
behaviour than a computer that shut-downs whenever you turn it on (like 
a "most useless machine ever").
The most important patch is the 8th one.

Please try applying them on top of your 3.9-rc1 kernel and send me back 
your kernel logs + sensors output.

Cheers,
Martin

PS: The attached patches are parts of my current thermal-related queue. 
I'll post them soon to the list.
- http://gitorious.org/linux-nouveau-pm/linux-nouveau-pm/commits/thermal

-- next part --
An HTML attachment was scrubbed...
URL: 

-- next part --
A non-text attachment was scrubbed...
Name: 0001-drm-nv40-therm-improve-selection-between-the-old-and.patch
Type: text/x-patch
Size: 3038 bytes
Desc: not available
URL: 

-- next part --
A non-text attachment was scrubbed...
Name: 0002-drm-nv40-therm-increase-the-sensor-s-settling-delay-.patch
Type: text/x-patch
Size: 1541 bytes
Desc: not available
URL: 

-- next part --
A non-text attachment was scrubbed...
Name: 0003-drm-nouveau-therm-do-not-make-assumptions-on-tempera.patch
Type: text/x-patch
Size: 1679 bytes
Desc: not available
URL: 

-- next part --
A non-text attachment was scrubbed...
Name: 0004-drm-nouveau-therm-remove-some-confusion-introduced-b.patch
Type: text/x-patch
Size: 4288 bytes
Desc: not available
URL: 

-- next part --
A non-text attachment was scrubbed...
Name: 0008-drm-nv40-therm-DO-NOT-PUSH-move-nv4c-to-the-newer-te.patch
Type: text/x-patch
Size: 946 bytes
Desc: not available
URL: 



nouveau shuts the machine down with v3.9-rc1 (temperature (72 C) hit the 'shutdown' threshold).

2013-03-05 Thread Konrad Rzeszutek Wilk
On Tue, Mar 05, 2013 at 12:13:57PM +0100, Martin Peres wrote:
> On 04/03/2013 22:41, Konrad Rzeszutek Wilk wrote:
> >Pls CC me in case you would like me also to test them with the
> >mdelay patch.
> 
> Hi Konrad,
> 
> Marcin proposed me another explanation for the issue you are seeing
> and it made me look again at the code.
> 
> I don't have enough nv4x hw to test all the conditions but with the
> attached patches, you may get a saner
> behaviour than a computer that shut-downs whenever you turn it on
> (like a "most useless machine ever").
> The most important patch is the 8th one.
> 
> Please try applying them on top of your 3.9-rc1 kernel and send me
> back your kernel logs + sensors output.

I also added on top of this a debug patch to twidle with the values:

diff --git a/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c 
b/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c
index 92f3fca..a5a8abe 100644
--- a/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c
+++ b/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c
@@ -31,22 +31,28 @@ struct nv40_therm_priv {

 enum nv40_sensor_style { INVALID_STYLE = -1, OLD_STYLE = 0, NEW_STYLE = 1 };

+extern int hack_old_style;
+extern int hack_mdelay;
 static enum nv40_sensor_style
 nv40_is_older_style_sensor(struct nouveau_therm *therm)
 {
struct nouveau_device *device = nv_device(therm);

+   if (hack_old_style) {
+   if (device->chipset == 0x4c)
+   return OLD_STYLE;
+   }
switch (device->chipset) {
case 0x43:
case 0x44:
case 0x4a:
case 0x47:
-   case 0x4c:
return OLD_STYLE;

case 0x46:
case 0x49:
case 0x4b:
+   case 0x4c:
case 0x4e:
case 0x67:
case 0x68:
@@ -66,11 +72,17 @@ nv40_sensor_setup(struct nouveau_therm *therm)
if (style == NEW_STYLE) {
nv_mask(therm, 0x15b8, 0x8000, 0);
nv_wr32(therm, 0x15b0, 0x80003fff);
-   mdelay(20); /* wait for the temperature to stabilize */
+   if (hack_mdelay)
+   mdelay(hack_mdelay);
+   else
+   mdelay(20); /* wait for the temperature to stabilize */
return nv_rd32(therm, 0x15b4) & 0x3fff;
} else if (style == OLD_STYLE) {
nv_wr32(therm, 0x15b0, 0xff);
-   mdelay(20); /* wait for the temperature to stabilize */
+   if (hack_mdelay)
+   mdelay(hack_mdelay);
+   else
+   mdelay(20); /* wait for the temperature to stabilize */
return nv_rd32(therm, 0x15b4) & 0xff;
} else
return -ENODEV;
diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c 
b/drivers/gpu/drm/nouveau/nouveau_drm.c
index d109936..d51bf21 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -69,6 +69,12 @@ MODULE_PARM_DESC(modeset, "enable driver (default: auto, "
 int nouveau_modeset = -1;
 module_param_named(modeset, nouveau_modeset, int, 0400);

+int hack_mdelay = 0;
+module_param_named(mdelay, hack_mdelay, int, 0400);
+
+int hack_old_style = 1;
+module_param_named(old_style, hack_old_style, int, 0400);
+
 static struct drm_driver driver;

 static int

With that I am still getting the issues (even with an insance delay of 100 
seconds).
Here is the serial log with various runs.


-- next part --
PXELINUX 3.82 2009-06-09  Copyright (C) 1994-2009 H. Peter Anvin et al
boot: 
Loading 
vmlinuz
Loading 
initramfs.cpio.gzready.
Hh?r??II=I??mu9a[

Re: nouveau shuts the machine down with v3.9-rc1 (temperature (72 C) hit the 'shutdown' threshold).

2013-03-05 Thread Martin Peres

On 04/03/2013 22:41, Konrad Rzeszutek Wilk wrote:
Pls CC me in case you would like me also to test them with the mdelay 
patch. 


Hi Konrad,

Marcin proposed me another explanation for the issue you are seeing and 
it made me look again at the code.


I don't have enough nv4x hw to test all the conditions but with the 
attached patches, you may get a saner
behaviour than a computer that shut-downs whenever you turn it on (like 
a most useless machine ever).

The most important patch is the 8th one.

Please try applying them on top of your 3.9-rc1 kernel and send me back 
your kernel logs + sensors output.


Cheers,
Martin

PS: The attached patches are parts of my current thermal-related queue. 
I'll post them soon to the list.

- http://gitorious.org/linux-nouveau-pm/linux-nouveau-pm/commits/thermal

From e2a10f1e7060b0cbabb032590ec2588f952016fc Mon Sep 17 00:00:00 2001
From: Martin Peres martin.pe...@labri.fr
Date: Tue, 5 Mar 2013 10:26:30 +0100
Subject: [PATCH 1/8] drm/nv40/therm: improve selection between the old and the
 new style

The condition to select between the old and new style was a thinko
as rnndb orders chipsets based on their release date (or general
chronologie hw-wise) and not based on their chipset number.

As the nv40 family is a mess when it comes to numbers, this patch
introduces a switch-based selection between the old and new style.

Signed-off-by: Martin Peres martin.pe...@labri.fr
---
 drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c | 50 ++--
 1 file changed, 38 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c b/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c
index 0f5363e..9d9ecaa 100644
--- a/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c
+++ b/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c
@@ -29,42 +29,68 @@ struct nv40_therm_priv {
 	struct nouveau_therm_priv base;
 };
 
+enum nv40_sensor_style { INVALID_STYLE = -1, OLD_STYLE = 0, NEW_STYLE = 1 };
+
+static enum nv40_sensor_style
+nv40_is_older_style_sensor(struct nouveau_therm *therm)
+{
+	struct nouveau_device *device = nv_device(therm);
+
+	switch (device-chipset) {
+	case 0x43:
+	case 0x44:
+	case 0x4a:
+	case 0x47:
+		return OLD_STYLE;
+
+	case 0x46:
+	case 0x49:
+	case 0x4b:
+	case 0x4e:
+	case 0x4c:
+	case 0x67:
+	case 0x68:
+	case 0x63:
+		return NEW_STYLE;
+	default:
+		return INVALID_STYLE;
+	}
+}
+
 static int
 nv40_sensor_setup(struct nouveau_therm *therm)
 {
-	struct nouveau_device *device = nv_device(therm);
+	enum nv40_sensor_style style = nv40_is_older_style_sensor(therm);
 
 	/* enable ADC readout and disable the ALARM threshold */
-	if (device-chipset = 0x46) {
+	if (style == NEW_STYLE) {
 		nv_mask(therm, 0x15b8, 0x8000, 0);
 		nv_wr32(therm, 0x15b0, 0x80003fff);
 		mdelay(10); /* wait for the temperature to stabilize */
 		return nv_rd32(therm, 0x15b4)  0x3fff;
-	} else {
+	} else if (style == OLD_STYLE) {
 		nv_wr32(therm, 0x15b0, 0xff);
 		return nv_rd32(therm, 0x15b4)  0xff;
-	}
+	} else
+		return -ENODEV;
 }
 
 static int
 nv40_temp_get(struct nouveau_therm *therm)
 {
 	struct nouveau_therm_priv *priv = (void *)therm;
-	struct nouveau_device *device = nv_device(therm);
 	struct nvbios_therm_sensor *sensor = priv-bios_sensor;
+	enum nv40_sensor_style style = nv40_is_older_style_sensor(therm);
 	int core_temp;
 
-	if (device-chipset = 0x46) {
+	if (style == NEW_STYLE) {
 		nv_wr32(therm, 0x15b0, 0x80003fff);
 		core_temp = nv_rd32(therm, 0x15b4)  0x3fff;
-	} else {
+	} else if (style == OLD_STYLE) {
 		nv_wr32(therm, 0x15b0, 0xff);
 		core_temp = nv_rd32(therm, 0x15b4)  0xff;
-	}
-
-	/* Setup the sensor if the temperature is 0 */
-	if (core_temp == 0)
-		core_temp = nv40_sensor_setup(therm);
+	} else
+		return -ENODEV;
 
 	if (sensor-slope_div == 0)
 		sensor-slope_div = 1;
-- 
1.8.1.5

From 38893c70faa9bbedded908694c3344d755cb05bf Mon Sep 17 00:00:00 2001
From: Martin Peres martin.pe...@labri.fr
Date: Tue, 5 Mar 2013 10:35:20 +0100
Subject: [PATCH 2/8] drm/nv40/therm: increase the sensor's settling delay to
 20ms

Based on my experience, 10ms wasn't always enough. Let's bump that
to a little more.

If this turns out to be insufficient-enough again, then an approach
based on letting the sensor settle for several seconds before starting
polling on the temperature would be better suited. This way, boot time
wouldn't be impacted by those waits too much.

Signed-off-by: Martin Peres martin.pe...@labri.fr
---
 drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c b/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c
index 9d9ecaa..818060e 100644
--- a/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c
+++ b/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c
@@ -66,10 +66,11 @@ nv40_sensor_setup(struct nouveau_therm *therm)
 	if (style == NEW_STYLE) {
 		nv_mask(therm, 0x15b8, 0x8000, 0);
 		nv_wr32(therm, 0x15b0, 

nouveau shuts the machine down with v3.9-rc1 (temperature (72 C) hit the 'shutdown' threshold).

2013-03-04 Thread Martin Peres
Hi Konrad,

On 04/03/2013 19:40, Konrad Rzeszutek Wilk wrote:> After git merge 
ab7826595e9ec51a51f622c5fc91e2f59440481a
 > (Merge tag 'mfd-3.9-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6)
 > the nouveau driver ends up shutting of the machine when booting.
 >
 >
 > I hadn't done a git bisection yet and was wondering if there are some
 > juice commits I ought to look at?

Sure, no need to bisect, it is a new (apparently-broken-for-you) feature.

The code is in /drivers/gpu/drm/nouveau/core/subdev/therm/


 >
 > Here is the serial console:


 > [6.940628] nouveau  [  PTHERM][:00:0d.0] Thermal management: 
disabled
 > [6.957474] nouveau  [  PTHERM][:00:0d.0] programmed 
thresholds [ 90(2), 95(3), 145(2), 135(5) ]
 > [6.966594] nouveau 6.975100] nouveau  [ 
PTHERM][:00:0d.0] Thermal management: automatic
 > [6.982059] nouveau  [  PTHERM][:00:0d.0] temperature (88 C) 
hit the 'downclock' threshold
 > [6.990680] nouveau  [  PTHERM][:00:0d.0] temperature (88 C) 
hit the 'critical' threshold
 > [6.999194] nouveau  [  PTHERM][:00:0d.0] temperature (90 C) 
hit the 'shutdown' threshold

See, this is strange. If I believe the "programmed thresholds" line, the 
fanboost threshold is at 90?C, downclock is at 95?C, critical 
temperature is at 145?C and shutdown is at 135?C.
So, from the BIOS side, things seem to be in fairly good shape (critical 
should be lower than shutdown, but that's OK).

My theory is that your temperature sensor is very variable that would 
set off the shutdown alarm. So, either the sensor needs more settling 
time or the output is genuinely very variable.

In the first case, we could fix that by increasing the settling time (at 
the expense of a longer boot period). We could also for a 10s wait at 
boot time before reading temperature.
If this is the latter case, we only have the solution to average the 
temperature on several samples. I would need statistics on the 
variability in order to calculate a proper low-pass filter that wouldn't 
be too slow or too RAM/wakeup-intensive.

I really hope the problem is the settling time!


Here is what you can do to test the theory:

Change the mdelay at line 41 of 
/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c 
(http://cgit.freedesktop.org/nouveau/linux-2.6/tree/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c#n41)
 
from 10 to 1000.
Please also add an mdelay of 1000 between lines 44 and 45.

If it works with this patch, then try decreasing the delay to 20ms.

In any way, I'll send some thermal patches tonight to be more resistant 
to long settling times.

Thanks for reporting!

Martin (mupuf)




nouveau shuts the machine down with v3.9-rc1 (temperature (72 C) hit the 'shutdown' threshold).

2013-03-04 Thread Konrad Rzeszutek Wilk
On Mon, Mar 04, 2013 at 08:21:48PM +0100, Martin Peres wrote:
> Hi Konrad,
> 
> On 04/03/2013 19:40, Konrad Rzeszutek Wilk wrote:> After git merge
> ab7826595e9ec51a51f622c5fc91e2f59440481a
> > (Merge tag 'mfd-3.9-1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6)
> > the nouveau driver ends up shutting of the machine when booting.
> >
> >
> > I hadn't done a git bisection yet and was wondering if there are some
> > juice commits I ought to look at?
> 
> Sure, no need to bisect, it is a new (apparently-broken-for-you) feature.
> 
> The code is in /drivers/gpu/drm/nouveau/core/subdev/therm/
> 
> 
> >
> > Here is the serial console:
> 
> 
> > [6.940628] nouveau  [  PTHERM][:00:0d.0] Thermal
> management: disabled
> > [6.957474] nouveau  [  PTHERM][:00:0d.0] programmed
> thresholds [ 90(2), 95(3), 145(2), 135(5) ]
> > [6.966594] nouveau 6.975100] nouveau  [
> PTHERM][:00:0d.0] Thermal management: automatic
> > [6.982059] nouveau  [  PTHERM][:00:0d.0] temperature (88
> C) hit the 'downclock' threshold
> > [6.990680] nouveau  [  PTHERM][:00:0d.0] temperature (88
> C) hit the 'critical' threshold
> > [6.999194] nouveau  [  PTHERM][:00:0d.0] temperature (90
> C) hit the 'shutdown' threshold
> 
> See, this is strange. If I believe the "programmed thresholds" line,
> the fanboost threshold is at 90?C, downclock is at 95?C, critical
> temperature is at 145?C and shutdown is at 135?C.
> So, from the BIOS side, things seem to be in fairly good shape
> (critical should be lower than shutdown, but that's OK).
> 
> My theory is that your temperature sensor is very variable that
> would set off the shutdown alarm. So, either the sensor needs more
> settling time or the output is genuinely very variable.

You should see it when I boot it under Xen:

[8.427789] nouveau  [  PTHERM][:00:0d.0] programmed thresholds [ 90(2), 
95(3), 145(2), 135(5) ]^M^M
[8.427855] nouveau  [  PTHERM][:00:0d.0] temperature (222 C) hit the 
'fanboost' threshold^M^M
[8.427919] nouveau  [  PTHERM][:00:0d.0] Thermal management: 
automatic^M^M
[8.427973] nouveau  [  PTHERM][:00:0d.0] temperature (222 C) hit the 
'downclock' threshold^M^M
[8.428036] nouveau  [  PTHERM][:00:0d.0] temperature (222 C) hit the 
'critical' threshold^M^M
[8.428099] nouveau  [  PTHERM][:00:0d.0] temperature (222 C) hit the 
'shutdown' threshold^M^M

> 
> In the first case, we could fix that by increasing the settling time
> (at the expense of a longer boot period). We could also for a 10s
> wait at boot time before reading temperature.
> If this is the latter case, we only have the solution to average the
> temperature on several samples. I would need statistics on the
> variability in order to calculate a proper low-pass filter that
> wouldn't be too slow or too RAM/wakeup-intensive.
> 
> I really hope the problem is the settling time!
> 
> 
> Here is what you can do to test the theory:
> 
> Change the mdelay at line 41 of
> /drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c 
> (http://cgit.freedesktop.org/nouveau/linux-2.6/tree/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c#n41)
> from 10 to 1000.
> Please also add an mdelay of 1000 between lines 44 and 45.

Let me do that tomorrow and report my findings.
> 
> If it works with this patch, then try decreasing the delay to 20ms.
> 
> In any way, I'll send some thermal patches tonight to be more
> resistant to long settling times.

Pls CC me in case you would like me also to test them with the
mdelay patch.

> 
> Thanks for reporting!

Of course.
> 
> Martin (mupuf)
> 
> 


nouveau shuts the machine down with v3.9-rc1 (temperature (72 C) hit the 'shutdown' threshold).

2013-03-04 Thread Konrad Rzeszutek Wilk
After git merge ab7826595e9ec51a51f622c5fc91e2f59440481a
(Merge tag 'mfd-3.9-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6)
the nouveau driver ends up shutting of the machine when booting.


I hadn't done a git bisection yet and was wondering if there are some
juice commits I ought to look at?

Here is the serial console:


????????
Loading 
latest/initramfs.cpio.gz..ready.
[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Linux version 3.8.0upstream-10478-g31c7742-dirty (konrad at 
build.dumpdata.com) (gcc version 4.4.4 20100503 (Red Hat 4.4.4-2) (GCC) ) #1 
SMP Sun Mar 3 18:09:03 EST 2013
[0.00] Command line: initrd=latest/initramfs.cpio.gz zcache nofb debug 
selinux=0 console=ttyS0,115200 loglevel=10 apic=debug BOOT_IMAGE=latest/vmlinuz 
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009f3ff] usable
[0.00] BIOS-e820: [mem 0x0009f400-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0xb7ed] usable
[0.00] BIOS-e820: [mem 0xb7ee-0xb7ee2fff] ACPI NVS
[0.00] BIOS-e820: [mem 0xb7ee3000-0xb7ee] ACPI data
[0.00] BIOS-e820: [mem 0xb7ef-0xb7ef] reserved
[0.00] BIOS-e820: [mem 0xb800-0xbfff] reserved
[0.00] BIOS-e820: [mem 0xf000-0xf3ff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00013fff] usable
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.5 present.
[0.00] DMI: BIOSTAR Group N61PB-M2S/N61PB-M2S, BIOS 6.00 PG 09/03/2009
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] No AGP bridge found
[0.00] e820: last_pfn = 0x14 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-C7FFF write-protect
[0.00]   C8000-F uncachable
[0.00] MTRR variable ranges enabled:
[0.00]   0 base  mask 8000 write-back
[0.00]   1 base 8000 mask C000 write-back
[0.00]   2 base 0001 mask C000 write-back
[0.00]   3 disabled
[0.00]   4 disabled
[0.00]   5 disabled
[0.00]   6 disabled
[0.00]   7 disabled
[0.00] TOM2: 00014000 aka 5120M
[0.00] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
[0.00] e820: update [mem 0xc000-0x] usable ==> reserved
[0.00] e820: last_pfn = 0xb7ee0 max_arch_pfn = 0x4
[0.00] Scan for SMP in [mem 0x-0x03ff]
[0.00] Scan for SMP in [mem 0x0009fc00-0x0009]
[0.00] Scan for SMP in [mem 0x000f-0x000f]
[0.00] found SMP MP-table at [mem 0x000f3a30-0x000f3a3f] mapped at 
[880f3a30]
[0.00]   mpc: f1f44-f2088
[0.00] Scanning 1 areas for low memory corruption
[0.00] ACPI: RSDP 000f7e60 00014 (v00 Nvidia)
[0.00] ACPI: RSDT b7ee3000 00038 (v01 Nvidia NVDAACPI 42302E31 
NVDA )
[0.00] ACPI: FACP b7ee3080 00074 (v01 Nvidia NVDAACPI 42302E31 
NVDA )
[0.00] ACPI BIOS Bug: Warning: Optional FADT field Pm2ControlBlock has 
zero address or length: 0x/0x1 (20130117/tbfadt-599)
[0.00] ACPI: 

Re: nouveau shuts the machine down with v3.9-rc1 (temperature (72 C) hit the 'shutdown' threshold).

2013-03-04 Thread Martin Peres

Hi Konrad,

On 04/03/2013 19:40, Konrad Rzeszutek Wilk wrote: After git merge 
ab7826595e9ec51a51f622c5fc91e2f59440481a
 (Merge tag 'mfd-3.9-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6)

 the nouveau driver ends up shutting of the machine when booting.


 I hadn't done a git bisection yet and was wondering if there are some
 juice commits I ought to look at?

Sure, no need to bisect, it is a new (apparently-broken-for-you) feature.

The code is in /drivers/gpu/drm/nouveau/core/subdev/therm/



 Here is the serial console:


 [6.940628] nouveau  [  PTHERM][:00:0d.0] Thermal management: 
disabled
 [6.957474] nouveau  [  PTHERM][:00:0d.0] programmed 
thresholds [ 90(2), 95(3), 145(2), 135(5) ]
 [6.966594] nouveau 6.975100] nouveau  [ 
PTHERM][:00:0d.0] Thermal management: automatic
 [6.982059] nouveau  [  PTHERM][:00:0d.0] temperature (88 C) 
hit the 'downclock' threshold
 [6.990680] nouveau  [  PTHERM][:00:0d.0] temperature (88 C) 
hit the 'critical' threshold
 [6.999194] nouveau  [  PTHERM][:00:0d.0] temperature (90 C) 
hit the 'shutdown' threshold


See, this is strange. If I believe the programmed thresholds line, the 
fanboost threshold is at 90°C, downclock is at 95°C, critical 
temperature is at 145°C and shutdown is at 135°C.
So, from the BIOS side, things seem to be in fairly good shape (critical 
should be lower than shutdown, but that's OK).


My theory is that your temperature sensor is very variable that would 
set off the shutdown alarm. So, either the sensor needs more settling 
time or the output is genuinely very variable.


In the first case, we could fix that by increasing the settling time (at 
the expense of a longer boot period). We could also for a 10s wait at 
boot time before reading temperature.
If this is the latter case, we only have the solution to average the 
temperature on several samples. I would need statistics on the 
variability in order to calculate a proper low-pass filter that wouldn't 
be too slow or too RAM/wakeup-intensive.


I really hope the problem is the settling time!


Here is what you can do to test the theory:

Change the mdelay at line 41 of 
/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c 
(http://cgit.freedesktop.org/nouveau/linux-2.6/tree/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c#n41) 
from 10 to 1000.

Please also add an mdelay of 1000 between lines 44 and 45.

If it works with this patch, then try decreasing the delay to 20ms.

In any way, I'll send some thermal patches tonight to be more resistant 
to long settling times.


Thanks for reporting!

Martin (mupuf)


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: nouveau shuts the machine down with v3.9-rc1 (temperature (72 C) hit the 'shutdown' threshold).

2013-03-04 Thread Konrad Rzeszutek Wilk
On Mon, Mar 04, 2013 at 08:21:48PM +0100, Martin Peres wrote:
 Hi Konrad,
 
 On 04/03/2013 19:40, Konrad Rzeszutek Wilk wrote: After git merge
 ab7826595e9ec51a51f622c5fc91e2f59440481a
  (Merge tag 'mfd-3.9-1' of
 git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6)
  the nouveau driver ends up shutting of the machine when booting.
 
 
  I hadn't done a git bisection yet and was wondering if there are some
  juice commits I ought to look at?
 
 Sure, no need to bisect, it is a new (apparently-broken-for-you) feature.
 
 The code is in /drivers/gpu/drm/nouveau/core/subdev/therm/
 
 
 
  Here is the serial console:
 
 
  [6.940628] nouveau  [  PTHERM][:00:0d.0] Thermal
 management: disabled
  [6.957474] nouveau  [  PTHERM][:00:0d.0] programmed
 thresholds [ 90(2), 95(3), 145(2), 135(5) ]
  [6.966594] nouveau 6.975100] nouveau  [
 PTHERM][:00:0d.0] Thermal management: automatic
  [6.982059] nouveau  [  PTHERM][:00:0d.0] temperature (88
 C) hit the 'downclock' threshold
  [6.990680] nouveau  [  PTHERM][:00:0d.0] temperature (88
 C) hit the 'critical' threshold
  [6.999194] nouveau  [  PTHERM][:00:0d.0] temperature (90
 C) hit the 'shutdown' threshold
 
 See, this is strange. If I believe the programmed thresholds line,
 the fanboost threshold is at 90°C, downclock is at 95°C, critical
 temperature is at 145°C and shutdown is at 135°C.
 So, from the BIOS side, things seem to be in fairly good shape
 (critical should be lower than shutdown, but that's OK).
 
 My theory is that your temperature sensor is very variable that
 would set off the shutdown alarm. So, either the sensor needs more
 settling time or the output is genuinely very variable.

You should see it when I boot it under Xen:

[8.427789] nouveau  [  PTHERM][:00:0d.0] programmed thresholds [ 90(2), 
95(3), 145(2), 135(5) ]^M^M
[8.427855] nouveau  [  PTHERM][:00:0d.0] temperature (222 C) hit the 
'fanboost' threshold^M^M
[8.427919] nouveau  [  PTHERM][:00:0d.0] Thermal management: 
automatic^M^M
[8.427973] nouveau  [  PTHERM][:00:0d.0] temperature (222 C) hit the 
'downclock' threshold^M^M
[8.428036] nouveau  [  PTHERM][:00:0d.0] temperature (222 C) hit the 
'critical' threshold^M^M
[8.428099] nouveau  [  PTHERM][:00:0d.0] temperature (222 C) hit the 
'shutdown' threshold^M^M

 
 In the first case, we could fix that by increasing the settling time
 (at the expense of a longer boot period). We could also for a 10s
 wait at boot time before reading temperature.
 If this is the latter case, we only have the solution to average the
 temperature on several samples. I would need statistics on the
 variability in order to calculate a proper low-pass filter that
 wouldn't be too slow or too RAM/wakeup-intensive.
 
 I really hope the problem is the settling time!
 
 
 Here is what you can do to test the theory:
 
 Change the mdelay at line 41 of
 /drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c 
 (http://cgit.freedesktop.org/nouveau/linux-2.6/tree/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c#n41)
 from 10 to 1000.
 Please also add an mdelay of 1000 between lines 44 and 45.

Let me do that tomorrow and report my findings.
 
 If it works with this patch, then try decreasing the delay to 20ms.
 
 In any way, I'll send some thermal patches tonight to be more
 resistant to long settling times.

Pls CC me in case you would like me also to test them with the
mdelay patch.

 
 Thanks for reporting!

Of course.
 
 Martin (mupuf)
 
 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel