[Nouveau] [Bug 103721] New: Frequent freezes with nouveau on Thinkpad P50

2017-11-13 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=103721

Bug ID: 103721
   Summary: Frequent freezes with nouveau on Thinkpad P50
   Product: xorg
   Version: unspecified
  Hardware: Other
OS: All
Status: NEW
  Severity: normal
  Priority: medium
 Component: Driver/nouveau
  Assignee: nouveau@lists.freedesktop.org
  Reporter: will.new...@gmail.com
QA Contact: xorg-t...@lists.x.org

Created attachment 135437
  --> https://bugs.freedesktop.org/attachment.cgi?id=135437=edit
Output of journalctl -k -b-1

I have been experiencing frequent freezes of xorg/wayland with the nouveau
driver on a Thinkpad P50 20EN.

The freezes seem to be related to system load and occur several times per-day.

I'm running Fedora 26 with kernel 4.13.11-200.fc26.x86_64 and the crashes seem
to happen with Xorg and Wayland. I'm running with discrete graphics enabled in
the BIOS as enabling Hybrid prevents the system from booting (there is no
option to run purely integrated graphics).

I have attached the kernel logs to the ticket.

I've also reported the issue in the Fedora bug tracker where there are some
more logs but had no response there as yet:
https://bugzilla.redhat.com/show_bug.cgi?id=1509294

lspci:

00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core
Processor Host Bridge/DRAM Registers (rev 07)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core
Processor PCIe Controller (x16) (rev 07)
00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI
Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation Sunrise Point-H Thermal
subsystem (rev 31)
00:16.0 Communication controller: Intel Corporation Sunrise Point-H CSME HECI
#1 (rev 31)
00:16.3 Serial controller: Intel Corporation Sunrise Point-H KT Redirection
(rev 31)
00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA controller
[AHCI mode] (rev 31)
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #1
(rev f1)
00:1c.2 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #3
(rev f1)
00:1c.4 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #5
(rev f1)
00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #13
(rev f1)
00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC (rev 31)
00:1f.3 Audio device: Intel Corporation Sunrise Point-H HD Audio (rev 31)
00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM
(rev 31)
01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro M2000M]
(rev a2)
01:00.1 Audio device: NVIDIA Corporation Device 0fbc (rev a1)
04:00.0 Network controller: Intel Corporation Wireless 8260 (rev 3a)
3e:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI
Express Card Reader (rev 01)

-- 
You are receiving this mail because:
You are the assignee for the bug.___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] Addressing the problem of noisy GPUs under Nouveau

2017-11-13 Thread Martin Peres
Hello,

On 13/11/17 09:15, John Hubbard wrote:
> On 11/12/2017 06:29 PM, Martin Peres wrote:
>> Hello,
>>
>> Some users have been complaining for years about their GPU sounding like
>> a jet engine at take off. Last year, I finally laid my hand on one of
>> these GPUs and have been trying to fix this issue on and off since then.
> 
> Some early feedback: can you tell us the exact SKUs you have? And are these
> production boards with production VBIOSes?  
> 
> Normally, it's just our bringup boards that we'd expect to be noisy like 
> this, so we're looking for a few more details.

Thanks for the quick feedback.

We only have access to production hardware with production vbioses, as
far as I know. In any case, I made all my experiments on the following
GPU (with a stock vbios, albeit modified to perform the experiment):

NVIDIA Corporation GF108 [GeForce GT 620] (rev a1) (prog-if 00 [VGA
controller])
Subsystem: eVga.com. Corp. Device 2625

I pushed my vbios to http://fs.mupuf.org/nvidia/fan_calib/ if this is
interesting to you (I doubt it, but if that can save us a round trip,
then let's do this :)).

Thanks,
Martin

> 
> thanks,
> John Hubbard
> NVIDIA
> 
>>
>> After failing to find anything in the HW, I figured out that the duty
>> cycle set by nvidia's proprietary driver would be way under the expected
>> value. By randomly changing values in the unknown tables of the vbios, I
>> found out that there is a fan calibration table at the offset 0x18 in
>> the BIT P table (version 2).
>>
>> In this table, I identified 2 major 16 bits parameters at offset 0xa and
>> 0xc[2]. The first one, I named pwm_max, while naming the latter
>> pwm_offset. As expected, these parameters look like a mapping function
>> of the form aX + b. However, after gathering more samples, I found out
>> that the output was not continuous when linearly increasing pwm_offset
>> [1]. Even more funnily, the period of this square function is linear
>> with the frequency used for the fan's PWN.
>>
>> I tried reverse engineering the formula to describe this function, but
>> failed to find a version that would work perfectly for all PWM
>> frequency. This is the closest I have got to[3], and I basically stopped
>> there about a year ago because I could not figure it out and got
>> frustrated :s.
>>
>> I started again on this project 2 weeks ago, with the intent of finding
>> a good-enough solution for nouveau, and modelling the rest of the
>> equation that that would allow me to compute what duty I should set for
>> every wanted fan speed (%). I again mostly succeeded... but it would
>> seem that the interpretation of the table depends on the generation of
>> chipset (Tesla behaves one way, Fermi+ behaves another way). Also, the
>> proprietary is not consistent for rules such as what to do when the
>> computed duty value is going to be lower than 0 or not (sometimes we
>> clamp it to 0, some times we set it to the same value as the divider,
>> some times we set it to a slightly lower value than the divider).
>>
>> I have been trying to cover all edge cases by generating a randomized
>> set of values for the PWM frequency, pwm_max, and pwm_offset values,
>> flashed the vbios, and iterate from 0% to 100% fan speed while dumping
>> the values set by your driver. Using half a million sample points (which
>> took a week to acquire), my model computes 97% of the values correctly
>> (ignoring off by ones), while the remaining 3% are worryingly off (by up
>> to 100%)... It is clear that the code is not trivial and is full of
>> branching, which makes clean-room reverse engineering a chore.
>>
>> As a final attempt to make a somewhat complete solution, I tried this
>> weekend to make a "safe" model that would still make the GPUs quiet. I
>> managed to improve the pass rate from 97 to 99.6%, but the remaining
>> failures conflict with my previous findings, which are also way more
>> prevalent. In the end, the only completely-safe way of driving the fan
>> is the current behaviour of nouveau...
>>
>> At this point, I am ready to throw in the towel and hardcode parameters
>> in nouveau to address the problem of the loudest GPUs, but this is of
>> course suboptimal. This is why I am asking for your help. Would you have
>> some documentation about this fan calibration table that could help me
>> here? Code would be even more appreciated.
>>
>> Thanks a lot in advance,
>> Martin
>>
>> PS: here is most of the code you may want to see:
>> http://fs.mupuf.org/nvidia/fan_calib/
>>
>> [1] http://fs.mupuf.org/nvidia/fan_calib/pwm_offset.png
>> [2] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L333
>> [3] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L298
>>

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau