Re: [Nouveau] Addressing the problem of noisy GPUs under Nouveau

2017-11-22 Thread Martin Peres
Hey,

Thanks for your answer, Andy!

On 22/11/17 04:06, Ilia Mirkin wrote:
> On Tue, Nov 21, 2017 at 8:29 PM, Andy Ritger  wrote:
>> Hi Martin,
> 
> Martin should have complete answers,
> 
>>
>> I was asked to clarify a few things:
>>
>> (1) Are all the user reports of loud fans on Fermi-era GPUs?
> 
> Yes. Although I believe some GK208 users are also having trouble,
> including yours truly. (It's been quite a while since I've checked
> though... my memory is weak in that regard.)

We did not hear back from a lot of users about these issues, but I can
see that most GF108 vbios in our vbios repo are problematic, and some
GK106/GT215/GT216/GT218 might be too.

> 
>>
>> (2) When the VBIOS POSTs the card, it loads initial ucode onto the Falcon
>> processor (PMU), which will do basic fan management on its own.  We call this
>> init ucode "IFR" (Init From ROM).  nvidia.ko will restore the IFR ucode when
>> unloaded.  I assume the loud fan symptom occurs after Nouveau is loaded and
>> running, correct?  I.e., this is a problem in Nouveau's fan control
>> programming, rather than a problem in IFR.
> 
> Correct.

Indeed.

> 
>>
>> (3) IFR will run until something else is loaded on the Falcon processor 
>> (PMU).
>> On Fermi, I assume the Nouveau kernel driver is uploading the Nouveau-written
>> ucode from here:
>>
>> drivers/gpu/drm/nouveau/nvkm/subdev/pmu/fuc
>>
>> correct?  I only ask to rule out the possibility that IFR and Nouveau are 
>> both
>> attempting to program fans simultaneously.  The symptoms you describe don't
>> sound like that, but just double checking...
> 
> Correct.

Indeed.

> 
>>
>> (4) Given the PMU ucode debacle, I'm embarrassed to ask, but at least on 
>> Fermi,
>> how much does Nouveau strictly depend on Nouveau's PMU ucode?  Would it be an
>> option to just let IFR continue to manage fans?
> 
> Reclocking is still on our horizon, which clearly won't happen without
> nouveau PMU code loaded. Not sure what it's used for until reclocking
> becomes a thing on Fermi.

Yeah, this would hinder our reclocking efforts :s

The best idea I can come up with is to fake the temperature (register
0x20408) to 1°C (minimum the hardware allows us) and read the PWM duty,
then we can get the maximum duty by setting the temperature to the
fan_boost threshold.

Not sure we have a sure-way of computing the fan_boost threshold though,
maybe we can just use of the thermal throttling threshold for this (more
on this later in the email).

In any case, all of these solutions are workarounds. Given that the code
to compute these values is already found in vbioses, why is it a problem
to share the meaning of all the values in the fan calibration table,
and/or the algorithm?

> 
>>
>> (5) Lastly, I was asked how Nouveau determines what fan speed to (attempt
>> to) program.

Oh, thanks for giving me an idea about what the other values in this
table may be about :D

Anyways, the current code uses the entry id 0x46 of the thermal table
(bit P, offset 0x10) to find out what are the thermal points for
$fan_min and $fan_max. The $fan_min and $fan_max values are found in the
entry id 0x22 of the same table.

If the 0x46 entry is not present in the thermal table (which seems to be
the norm for Fermi), we revert to default values: 40 -> 85°C.

With these 4 values, we get 2 trip points (temp_min, fan_min)
and(temp_max, fan_max), and we merely do linear interpolation between them.

> 
> I'll let Martin answer this, but as you're probably aware, there's 2
> different ways this can be done - there might be a PWM, we might have
> to toggle it manually. Maybe something else still.

The manual toggle fans are only present on pre-tesla GPUs, let's ignore
them here, because we know what to do there.

All recent (2006+) GPUs use PWM, and anything after the GT215 use this
fan calibration table which took me a while to find, and that is still
mostly a mystery to me :s

> 
> Have a look at drm/nouveau/nvkm/subdev/therm/fan.c and the various
> bits it ends up calling (pre-GF119 fermi's end up with the nv50
> fan_set, I believe).
> 
> The bios stuff is parsed in nvkm/subdev/bios/fan.c and therm.c,
> although I believe Martin's latest analysis is more advanced than
> what's in that code.

Absolutely :) I have not updated Nouveau yet, in fear of setting a value
lower than what the proprietary driver does...

> 
> Martin's question was very long, but it boils down to this:
> 
> How do we compute the correct values to write into the e114/e118 pwm
> registers based on the VBIOS contents and current state of the board
> (like temperature).

Unfortunately, it can also be the e11c/e120 couple, or 0x200d8/dc on
GF119+, or 0x200cd/d0 on Kepler+.

At least, it looks like we know which PWM controler we need to drive, so
I did not want to muddy the water even more by giving register
addresses, rather concentrating on the problem at hand: How to compute
the duty value for the PWM controler.

> 
> We generally do this right, but 

[Nouveau] [Bug 91523] [NVE7] driver cannot initialize gpu(failed to parse ramcfg data)

2017-11-22 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=91523

--- Comment #6 from FeepingCreature  ---
Looking at envytools' nvbios decode, given ramcfg=1, the first entry of the
timing mapping table (0x718f) has a value of 0xf at index 1, offset 0. This is
taken as index into the timing table (0x733c); however, the timing table only
defines 12 entries, and only six of them are nonzero.

If I simply skip entry 0, reclocking works fine. Maybe that's what 0x0f
indicates?

-- 
You are receiving this mail because:
You are the assignee for the bug.___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [Bug 91523] [NVE7] driver cannot initialize gpu(failed to parse ramcfg data)

2017-11-22 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=91523

--- Comment #5 from default_357-l...@yahoo.de ---
I still get this error in 4.14.1. Same laptop, same vbios.rom. If I comment out
the "return ret" in
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/nouveau/nvkm/subdev/fb/ramgk104.c#L1577,
it boots and runs 3d acceleration with xrandr offload, but cannot reclock
successfully.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [Bug 100146] nouveau system lockup -> fifo: read fault engine 15 [PCE0] client 01 [PCOPY0] reason 02 [PAGE_NOT_PRESENT] on channel

2017-11-22 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=100146

--- Comment #2 from Ilia Mirkin  ---
(In reply to t-IX from comment #1)
> I think i have same problem.

If you ever think you have the same problem as someone, file a new bug, and
make a mention of the fact that it *might* be related to some other bug.

All you're doing now is spamming other people's unrelated issues.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [Bug 100146] nouveau system lockup -> fifo: read fault engine 15 [PCE0] client 01 [PCOPY0] reason 02 [PAGE_NOT_PRESENT] on channel

2017-11-22 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=100146

--- Comment #1 from t-IX  ---
I think i have same problem.
Can reproduce error by play 4k video in mpv.
Or start Steam bigscreen mode.

I'm trying to install nvidia firmware, but system feels very slow and
unresponsive, and i can't test video or steam.


Linux 4.12.12-gentoo x86_64 Gentoo

Nov 22 00:51:36 kernel: nouveau :01:00.0: gr: TRAP ch 5 [003fb72000
X[12943]]
Nov 22 00:51:36 kernel: nouveau :01:00.0: gr: GPC0/PROP trap: 0400
[RT_LINEAR_MISMATCH] x = 2048, y = 1024, format = 18, storage type = 0
Nov 22 00:51:36 kernel: nouveau :01:00.0: fifo: write fault at 0009612000
engine 00 [PGRAPH] client 0f [GPC0/PROP] reason 02 [PAGE_NOT_PRESENT] on chann
el 5 [003fb72000 X[12943]]
Nov 22 00:51:36 kernel: nouveau :01:00.0: fifo: gr engine fault on channel
5, recovering...
Nov 22 00:51:36 kernel: nouveau :01:00.0: X[12943]: channel 5 killed!
Nov 22 00:52:49 kernel: nouveau :01:00.0: X[12943]: failed to idle channel
10 [X[12943]]
Nov 22 00:53:04 kernel: nouveau :01:00.0: X[12943]: failed to idle channel
10 [X[12943]]
Nov 22 00:53:04 kernel: nouveau :01:00.0: fifo: read fault at 013000
engine 07 [PFIFO] client 07 [BAR_READ] reason 02 [PAGE_NOT_PRESENT] on channel 
10 [003f8ae000 X[12943]]
Nov 22 00:53:04 kernel: nouveau :01:00.0: fifo: fifo engine fault on
channel 10, recovering...
Nov 22 00:53:04 kernel: nouveau :01:00.0: X[12943]: channel 10 killed!
Nov 22 00:53:19 kernel: nouveau :01:00.0: X[12943]: failed to idle channel
13 [X[12943]]
Nov 22 00:53:34 kernel: nouveau :01:00.0: X[12943]: failed to idle channel
13 [X[12943]]
Nov 22 00:53:49 kernel: nouveau :01:00.0: X[12943]: failed to idle channel
9 [X[12943]]
Nov 22 00:54:04 kernel: nouveau :01:00.0: X[12943]: failed to idle channel
9 [X[12943]]
Nov 22 00:54:19 kernel: nouveau :01:00.0: X[12943]: failed to idle channel
6 [X[12943]]
Nov 22 00:54:34 kernel: nouveau :01:00.0: X[12943]: failed to idle channel
6 [X[12943]]
Nov 22 00:54:49 kernel: nouveau :01:00.0: X[12943]: failed to idle channel
12 [X[12943]]
Nov 22 00:55:04 kernel: nouveau :01:00.0: X[12943]: failed to idle channel
12 [X[12943]]
Nov 22 00:55:19 kernel: nouveau :01:00.0: X[12943]: failed to idle channel
7 [X[12943]]
Nov 22 00:55:34 kernel: nouveau :01:00.0: X[12943]: failed to idle channel
7 [X[12943]]
Nov 22 00:55:49 kernel: nouveau :01:00.0: X[12943]: failed to idle channel
4 [X[12943]]
Nov 22 00:56:04 kernel: nouveau :01:00.0: X[12943]: failed to idle channel
4 [X[12943]]
Nov 22 00:56:19 kernel: nouveau :01:00.0: X[12943]: failed to idle channel
3 [X[12943]]
Nov 22 00:56:34 kernel: nouveau :01:00.0: X[12943]: failed to idle channel
3 [X[12943]]
Nov 22 00:56:49 kernel: nouveau :01:00.0: X[12943]: failed to idle channel
2 [X[12943]]
Nov 22 00:57:04 kernel: nouveau :01:00.0: X[12943]: failed to idle channel
2 [X[12943]]

Chipset: "NVIDIA NVC1"
NVIDIA Corporation GF108 [GeForce GT 430] (rev a1)
Digital interface is DisplayPort

-- 
You are receiving this mail because:
You are the assignee for the bug.___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v2] drm: don't continue with anything after the GPU couldn't be woken up

2017-11-22 Thread Karol Herbst
On Wed, Nov 22, 2017 at 11:31 AM, Thierry Reding
 wrote:
> On Tue, Nov 21, 2017 at 08:03:20PM +0100, Karol Herbst wrote:
>> On Tue, Nov 21, 2017 at 6:46 PM, Thierry Reding
>>  wrote:
>> > On Tue, Nov 21, 2017 at 04:01:16PM +0100, Karol Herbst wrote:
>> >> This should make systems more stable where resuming the GPU fails. This
>> >> can happen due to bad firmware or due to a bug within the kernel. The
>> >> last thing which should happen in either case is an unusable system.
>> >>
>> >> v2: do the same in nouveau_pmops_resume
>> >>
>> >> Tested-by: Karl Hastings 
>> >> Signed-off-by: Karol Herbst 
>> >> ---
>> >>  drm/nouveau/nouveau_drm.c | 31 +++
>> >>  1 file changed, 23 insertions(+), 8 deletions(-)
>> >>
>> >> diff --git a/drm/nouveau/nouveau_drm.c b/drm/nouveau/nouveau_drm.c
>> >> index 8d4a5be3..6e4cb4f7 100644
>> >> --- a/drm/nouveau/nouveau_drm.c
>> >> +++ b/drm/nouveau/nouveau_drm.c
>> >> @@ -792,6 +792,27 @@ nouveau_pmops_suspend(struct device *dev)
>> >>   return 0;
>> >>  }
>> >>
>> >> +static int
>> >> +nouveau_set_power_state_D0(struct pci_dev *pdev)
>> >> +{
>> >> + struct nouveau_drm *drm = nouveau_drm(pci_get_drvdata(pdev));
>> >> + int ret;
>> >> +
>> >> + pci_set_power_state(pdev, PCI_D0);
>> >> + /* abort if anything went wrong */
>> >> + if (pdev->current_state != PCI_D0) {
>> >> + NV_ERROR(drm, "couldn't wake up GPU!\n");
>> >> + return -EBUSY;
>> >> + }
>> >
>> > Looks to me like the more idiomatic way to do this is:
>> >
>> > ret = pci_set_power_state(pdev, PCI_D0);
>> > if (ret < 0 && ret != -EIO)
>> > return ret;
>> >
>>
>> I thought so too, but it ends up returning 0 even if setting the power
>> state fails. Or maybe I did something wrong when installing the
>> kernel. I could take another shot at it, but what I came up with seems
>> to work. Adding airlied in CC, because he saw my patch and didn't
>> complain about it. Hopefully he knows more.
>
> pci_raw_set_power_state(), called by pci_set_power_state(), contains
> this, which looks to me like it would be the only case where the problem
> you're describing could be coming from:
>
> dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK);
> if (dev->current_state != state && printk_ratelimit())
> dev_info(>dev, "Refused to change power state, currently 
> in D%d\n",
>  dev->current_state);
>
> Do you happen to see this in the kernel logs? Perhaps this should be
> considered an error rather than just an KERN_INFO level message?
>
> Adding Bjorn and linux-pci for visibility.
>
> Thierry

yeah, that is the error we have in dmesg.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH 03/32] therm: Split return code and value in nvkm_get_temp

2017-11-22 Thread Karol Herbst
On Wed, Nov 22, 2017 at 9:29 AM, Martin Peres  wrote:
> On 22/11/17 03:42, Karol Herbst wrote:
>> On Wed, Nov 22, 2017 at 1:32 AM, Martin Peres  wrote:
>>> On 17/11/17 02:04, Karol Herbst wrote:
 The current hwmon code doesn't check if the returned value was actually an
 error.

 Since Kepler temperature sensors are able to report negative values. Those
 negative values are not for error reporting, but rather when you buried
 your GPU in snow somewhere in Antarctica and still want a valid
 temperature to be reported (unverified).

 Since Pascal (and maybe earlier) we have sensors with improved precision.

 Adjust the nvkm_get_temp method to be able to deal with those changes
 and let hwmon return an error properly.
>>>
>>> Where did you get this information? And if it is the case, then we will
>>> royally screw up the value because we don't even know on how many bits
>>> 0x20400 uses...
>>>
>>> If you are indeed right, I would like to see g84.c's temp_get().
>>>
>>
>> Nvidia uses 0x020460 on Pascal and I thought we had that in envytools 
>> already?
>
> Nothing in nvbios.
>
>>
>> [0] 228.412618 MMIO32 R 0x020460 0x20002f20 PTHERM.I2C_SLAVE+0x60 => 
>> 0x20002f20
>> [0] 229.413039 MMIO32 R 0x020460 0x20002f10 PTHERM.I2C_SLAVE+0x60 => 
>> 0x20002f10
>> [0] 230.416976 MMIO32 R 0x020460 0x20002f68 PTHERM.I2C_SLAVE+0x60 => 
>> 0x20002f68
>> [0] 231.417407 MMIO32 R 0x020460 0x20002fe8 PTHERM.I2C_SLAVE+0x60 => 
>> 0x20002fe8
>> [0] 232.417742 MMIO32 R 0x020460 0x20002ff0 PTHERM.I2C_SLAVE+0x60 => 
>> 0x20002ff0
>> [0] 233.417742 MMIO32 R 0x020460 0x20003020 PTHERM.I2C_SLAVE+0x60 => 
>> 0x20003020
>>
>> or nvapeek 0x20400 && nvapeek 0x020460;
>> 00020400: 002c
>> 00020460: 20002c60
>
> Cool that they increased the accuracy, but where do you get that the
> value can be negative? And if it is, how many bits are used? 16 or more?
> What are those high bits?
>

I was asking rhyskidd to add stuff to rnndb, because he wrote the
nouveau patch for those sensors, check out for more details:
https://github.com/skeggsb/nouveau/blob/master/drm/nouveau/nvkm/subdev/therm/gp100.c#L27

regarding negative values, we have 0x02044c + 0x020450 for that and I
could imagine that 0x020460 can display negative values as well...
nobody tried for sure, but it seems plausible.

Anyway, moving from u8 to int should make life much easier for us if
we decide to really care about the higher precision. Currently our
code can't handle that for multiple reasons, but this solves one of
those at least.

>>

 Signed-off-by: Karol Herbst 
 ---
  drm/nouveau/include/nvkm/subdev/clk.h   |  4 ++--
  drm/nouveau/include/nvkm/subdev/therm.h |  2 +-
  drm/nouveau/nouveau_hwmon.c | 15 +--
  drm/nouveau/nvkm/subdev/clk/base.c  |  2 +-
  drm/nouveau/nvkm/subdev/therm/base.c| 19 ++-
  drm/nouveau/nvkm/subdev/therm/g84.c | 13 -
  drm/nouveau/nvkm/subdev/therm/gp100.c   |  9 +
  drm/nouveau/nvkm/subdev/therm/nv40.c|  9 +++--
  drm/nouveau/nvkm/subdev/therm/nv50.c|  9 +++--
  drm/nouveau/nvkm/subdev/therm/priv.h|  4 ++--
  drm/nouveau/nvkm/subdev/therm/temp.c| 16 
  11 files changed, 60 insertions(+), 42 deletions(-)

 diff --git a/drm/nouveau/include/nvkm/subdev/clk.h 
 b/drm/nouveau/include/nvkm/subdev/clk.h
 index e5275f74..506f8cc6 100644
 --- a/drm/nouveau/include/nvkm/subdev/clk.h
 +++ b/drm/nouveau/include/nvkm/subdev/clk.h
 @@ -100,7 +100,7 @@ struct nvkm_clk {
   int ustate_dc; /* user-requested (-1 disabled, -2 perfmon) */
   int astate; /* perfmon adjustment (base) */
   int dstate; /* display adjustment (min+) */
 - u8  temp;
 + int temp;

   bool allow_reclock;
  #define NVKM_CLK_BOOST_NONE 0x0
 @@ -122,7 +122,7 @@ int nvkm_clk_read(struct nvkm_clk *, enum nv_clk_src);
  int nvkm_clk_ustate(struct nvkm_clk *, int req, int pwr);
  int nvkm_clk_astate(struct nvkm_clk *, int req, int rel, bool wait);
  int nvkm_clk_dstate(struct nvkm_clk *, int req, int rel);
 -int nvkm_clk_tstate(struct nvkm_clk *, u8 temperature);
 +int nvkm_clk_tstate(struct nvkm_clk *, int temperature);

  int nv04_clk_new(struct nvkm_device *, int, struct nvkm_clk **);
  int nv40_clk_new(struct nvkm_device *, int, struct nvkm_clk **);
 diff --git a/drm/nouveau/include/nvkm/subdev/therm.h 
 b/drm/nouveau/include/nvkm/subdev/therm.h
 index 9841f076..8c84017f 100644
 --- a/drm/nouveau/include/nvkm/subdev/therm.h
 +++ b/drm/nouveau/include/nvkm/subdev/therm.h
 @@ -86,7 +86,7 @@ struct nvkm_therm {
   int (*attr_set)(struct nvkm_therm *, enum nvkm_therm_attr_type, int);
  };

 -int nvkm_therm_temp_get(struct nvkm_therm *);
 +int 

Re: [Nouveau] [PATCH v2] drm: don't continue with anything after the GPU couldn't be woken up

2017-11-22 Thread Thierry Reding
On Tue, Nov 21, 2017 at 08:03:20PM +0100, Karol Herbst wrote:
> On Tue, Nov 21, 2017 at 6:46 PM, Thierry Reding
>  wrote:
> > On Tue, Nov 21, 2017 at 04:01:16PM +0100, Karol Herbst wrote:
> >> This should make systems more stable where resuming the GPU fails. This
> >> can happen due to bad firmware or due to a bug within the kernel. The
> >> last thing which should happen in either case is an unusable system.
> >>
> >> v2: do the same in nouveau_pmops_resume
> >>
> >> Tested-by: Karl Hastings 
> >> Signed-off-by: Karol Herbst 
> >> ---
> >>  drm/nouveau/nouveau_drm.c | 31 +++
> >>  1 file changed, 23 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/drm/nouveau/nouveau_drm.c b/drm/nouveau/nouveau_drm.c
> >> index 8d4a5be3..6e4cb4f7 100644
> >> --- a/drm/nouveau/nouveau_drm.c
> >> +++ b/drm/nouveau/nouveau_drm.c
> >> @@ -792,6 +792,27 @@ nouveau_pmops_suspend(struct device *dev)
> >>   return 0;
> >>  }
> >>
> >> +static int
> >> +nouveau_set_power_state_D0(struct pci_dev *pdev)
> >> +{
> >> + struct nouveau_drm *drm = nouveau_drm(pci_get_drvdata(pdev));
> >> + int ret;
> >> +
> >> + pci_set_power_state(pdev, PCI_D0);
> >> + /* abort if anything went wrong */
> >> + if (pdev->current_state != PCI_D0) {
> >> + NV_ERROR(drm, "couldn't wake up GPU!\n");
> >> + return -EBUSY;
> >> + }
> >
> > Looks to me like the more idiomatic way to do this is:
> >
> > ret = pci_set_power_state(pdev, PCI_D0);
> > if (ret < 0 && ret != -EIO)
> > return ret;
> >
> 
> I thought so too, but it ends up returning 0 even if setting the power
> state fails. Or maybe I did something wrong when installing the
> kernel. I could take another shot at it, but what I came up with seems
> to work. Adding airlied in CC, because he saw my patch and didn't
> complain about it. Hopefully he knows more.

pci_raw_set_power_state(), called by pci_set_power_state(), contains
this, which looks to me like it would be the only case where the problem
you're describing could be coming from:

dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK);
if (dev->current_state != state && printk_ratelimit())
dev_info(>dev, "Refused to change power state, currently 
in D%d\n",
 dev->current_state);

Do you happen to see this in the kernel logs? Perhaps this should be
considered an error rather than just an KERN_INFO level message?

Adding Bjorn and linux-pci for visibility.

Thierry


signature.asc
Description: PGP signature
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH 03/32] therm: Split return code and value in nvkm_get_temp

2017-11-22 Thread Martin Peres
On 22/11/17 03:42, Karol Herbst wrote:
> On Wed, Nov 22, 2017 at 1:32 AM, Martin Peres  wrote:
>> On 17/11/17 02:04, Karol Herbst wrote:
>>> The current hwmon code doesn't check if the returned value was actually an
>>> error.
>>>
>>> Since Kepler temperature sensors are able to report negative values. Those
>>> negative values are not for error reporting, but rather when you buried
>>> your GPU in snow somewhere in Antarctica and still want a valid
>>> temperature to be reported (unverified).
>>>
>>> Since Pascal (and maybe earlier) we have sensors with improved precision.
>>>
>>> Adjust the nvkm_get_temp method to be able to deal with those changes
>>> and let hwmon return an error properly.
>>
>> Where did you get this information? And if it is the case, then we will
>> royally screw up the value because we don't even know on how many bits
>> 0x20400 uses...
>>
>> If you are indeed right, I would like to see g84.c's temp_get().
>>
> 
> Nvidia uses 0x020460 on Pascal and I thought we had that in envytools already?

Nothing in nvbios.

> 
> [0] 228.412618 MMIO32 R 0x020460 0x20002f20 PTHERM.I2C_SLAVE+0x60 => 
> 0x20002f20
> [0] 229.413039 MMIO32 R 0x020460 0x20002f10 PTHERM.I2C_SLAVE+0x60 => 
> 0x20002f10
> [0] 230.416976 MMIO32 R 0x020460 0x20002f68 PTHERM.I2C_SLAVE+0x60 => 
> 0x20002f68
> [0] 231.417407 MMIO32 R 0x020460 0x20002fe8 PTHERM.I2C_SLAVE+0x60 => 
> 0x20002fe8
> [0] 232.417742 MMIO32 R 0x020460 0x20002ff0 PTHERM.I2C_SLAVE+0x60 => 
> 0x20002ff0
> [0] 233.417742 MMIO32 R 0x020460 0x20003020 PTHERM.I2C_SLAVE+0x60 => 
> 0x20003020
> 
> or nvapeek 0x20400 && nvapeek 0x020460;
> 00020400: 002c
> 00020460: 20002c60

Cool that they increased the accuracy, but where do you get that the
value can be negative? And if it is, how many bits are used? 16 or more?
What are those high bits?

> 
>>>
>>> Signed-off-by: Karol Herbst 
>>> ---
>>>  drm/nouveau/include/nvkm/subdev/clk.h   |  4 ++--
>>>  drm/nouveau/include/nvkm/subdev/therm.h |  2 +-
>>>  drm/nouveau/nouveau_hwmon.c | 15 +--
>>>  drm/nouveau/nvkm/subdev/clk/base.c  |  2 +-
>>>  drm/nouveau/nvkm/subdev/therm/base.c| 19 ++-
>>>  drm/nouveau/nvkm/subdev/therm/g84.c | 13 -
>>>  drm/nouveau/nvkm/subdev/therm/gp100.c   |  9 +
>>>  drm/nouveau/nvkm/subdev/therm/nv40.c|  9 +++--
>>>  drm/nouveau/nvkm/subdev/therm/nv50.c|  9 +++--
>>>  drm/nouveau/nvkm/subdev/therm/priv.h|  4 ++--
>>>  drm/nouveau/nvkm/subdev/therm/temp.c| 16 
>>>  11 files changed, 60 insertions(+), 42 deletions(-)
>>>
>>> diff --git a/drm/nouveau/include/nvkm/subdev/clk.h 
>>> b/drm/nouveau/include/nvkm/subdev/clk.h
>>> index e5275f74..506f8cc6 100644
>>> --- a/drm/nouveau/include/nvkm/subdev/clk.h
>>> +++ b/drm/nouveau/include/nvkm/subdev/clk.h
>>> @@ -100,7 +100,7 @@ struct nvkm_clk {
>>>   int ustate_dc; /* user-requested (-1 disabled, -2 perfmon) */
>>>   int astate; /* perfmon adjustment (base) */
>>>   int dstate; /* display adjustment (min+) */
>>> - u8  temp;
>>> + int temp;
>>>
>>>   bool allow_reclock;
>>>  #define NVKM_CLK_BOOST_NONE 0x0
>>> @@ -122,7 +122,7 @@ int nvkm_clk_read(struct nvkm_clk *, enum nv_clk_src);
>>>  int nvkm_clk_ustate(struct nvkm_clk *, int req, int pwr);
>>>  int nvkm_clk_astate(struct nvkm_clk *, int req, int rel, bool wait);
>>>  int nvkm_clk_dstate(struct nvkm_clk *, int req, int rel);
>>> -int nvkm_clk_tstate(struct nvkm_clk *, u8 temperature);
>>> +int nvkm_clk_tstate(struct nvkm_clk *, int temperature);
>>>
>>>  int nv04_clk_new(struct nvkm_device *, int, struct nvkm_clk **);
>>>  int nv40_clk_new(struct nvkm_device *, int, struct nvkm_clk **);
>>> diff --git a/drm/nouveau/include/nvkm/subdev/therm.h 
>>> b/drm/nouveau/include/nvkm/subdev/therm.h
>>> index 9841f076..8c84017f 100644
>>> --- a/drm/nouveau/include/nvkm/subdev/therm.h
>>> +++ b/drm/nouveau/include/nvkm/subdev/therm.h
>>> @@ -86,7 +86,7 @@ struct nvkm_therm {
>>>   int (*attr_set)(struct nvkm_therm *, enum nvkm_therm_attr_type, int);
>>>  };
>>>
>>> -int nvkm_therm_temp_get(struct nvkm_therm *);
>>> +int nvkm_therm_temp_get(struct nvkm_therm *, int *);
>>>  int nvkm_therm_fan_sense(struct nvkm_therm *);
>>>  int nvkm_therm_cstate(struct nvkm_therm *, int, int);
>>>
>>> diff --git a/drm/nouveau/nouveau_hwmon.c b/drm/nouveau/nouveau_hwmon.c
>>> index 7c965648..7486e4af 100644
>>> --- a/drm/nouveau/nouveau_hwmon.c
>>> +++ b/drm/nouveau/nouveau_hwmon.c
>>> @@ -326,8 +326,9 @@ nouveau_temp_is_visible(const void *data, u32 attr, int 
>>> channel)
>>>  {
>>>   struct nouveau_drm *drm = nouveau_drm((struct drm_device *)data);
>>>   struct nvkm_therm *therm = nvxx_therm(>client.device);
>>> + int val;
>>>
>>> - if (therm && therm->attr_get && nvkm_therm_temp_get(therm) < 0)
>>> + if (therm && therm->attr_get && nvkm_therm_temp_get(therm, ))
>>>   return 0;
>>>
>>>   switch