[Bug 108585] *ERROR* hw_init of IP block failed -22

2019-01-08 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=108585

--- Comment #19 from Dan Horák  ---
They (=
https://src.fedoraproject.org/fork/sharkcz/rpms/kernel/blob/talos/f/ppc64-talos-amdgpu-reset.patch)
can go upstream.

I have a 4.20 kernel on the host with recent firmware for polaris11, the
skiroot boot environment (4.15 IIRC) uses polaris11 firmware files from
20181015. The system boots OK in the skiroot kernel -> kexec -> host kernel
sequence.

The diff between the firmware sets is
--- a-old   2019-01-08 14:34:02.300578731 +0100
+++ a-new   2019-01-08 14:34:14.519024951 +0100
@@ -1,19 +1,21 @@
 5dc1006ce1896d997232b9165f2ce8ebad52a63c  polaris11_ce.bin
 193b0baaabf2a0a9e37a201a251013f26b0b70ee  polaris11_ce_2.bin
+8ee2e5db95fe589e8292642e638a15d1b8291bcb  polaris11_k_mc.bin
 cf2b8cd1d7f723f0edb6f17123711d6fa21ef379  polaris11_k_smc.bin
+a4ab9c0484cc9957112ab803d88e5b967c412c01  polaris11_k2_smc.bin
 f6551d45c0b652955009560bea1694c5ca86c1af  polaris11_mc.bin
 a3bfc83f5f52978365428e8756ed165655984a3b  polaris11_me.bin
 eb17beac2b09e25cfdc4662afc353f51a1c23272  polaris11_mec.bin
-c960ba31f13806c889d076bbe0b796281fdb0075  polaris11_mec_2.bin
+4e2290c5e030d6211168802fbe60db53b7c076c5  polaris11_mec_2.bin
 eb17beac2b09e25cfdc4662afc353f51a1c23272  polaris11_mec2.bin
-a13f4c7ce6e30ed930c7a5756196b24114247691  polaris11_mec2_2.bin
+f7956bba6312950db2de12336f79ccb28201593a  polaris11_mec2_2.bin
 a7fb9fab4529707592ce3dd449cd27fbf415fb94  polaris11_me_2.bin
 6377d75775fbe5353c8397ff03d230be0f4d6bcf  polaris11_pfp.bin
-af8c1b94170ada698379d1dab33924003ee525c0  polaris11_pfp_2.bin
+01eda59a1f159889d9a0ea2a9744eae4e09eaa1c  polaris11_pfp_2.bin
 38c512c82fe4773f33e53ba1ada414ddbe9b9e09  polaris11_rlc.bin
 82d8fcf56ac3051981b9e70199b115ed9d46995f  polaris11_sdma.bin
 8b21e98cb7e0ab000d131543c13a3ed95aa6687a  polaris11_sdma1.bin
 e01ac87abb011582d1da84eda9444353de082d11  polaris11_smc.bin
-6b804243472b5653ba449106426a0da1c46a9d84  polaris11_smc_sk.bin
+f8680ef51f84df00b388d9c230a28d150836ce08  polaris11_smc_sk.bin
 85a2f70f1f3b63e02a1bfbaba73a1729cee2104e  polaris11_uvd.bin
 a9abead599bb8497f38d587f682726a00bc067d2  polaris11_vce.bin

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 108585] *ERROR* hw_init of IP block failed -22

2018-12-14 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=108585

Alex Deucher  changed:

   What|Removed |Added

   See Also||https://bugs.freedesktop.or
   ||g/show_bug.cgi?id=108754

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 108585] *ERROR* hw_init of IP block failed -22

2018-12-14 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=108585

--- Comment #18 from Alex Deucher  ---
(In reply to Dan Horák from comment #17)
> Fedora/ppc64le users can find a pre-built kernel with the patchset at
> https://copr.fedorainfracloud.org/coprs/sharkcz/talos-kernel/build/817728/

Should these patches go upstream?  Can you confirm they fix your issues?

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 108585] *ERROR* hw_init of IP block failed -22

2018-11-01 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=108585

--- Comment #17 from Dan Horák  ---
Fedora/ppc64le users can find a pre-built kernel with the patchset at
https://copr.fedorainfracloud.org/coprs/sharkcz/talos-kernel/build/817728/

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 108585] *ERROR* hw_init of IP block failed -22

2018-11-01 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=108585

--- Comment #16 from Dan Horák  ---
Reset on init sounds better to me as the loader kernel (in kexec case) is more
difficult to update than the host kernel.

And for the record - after updating the skiroot kernel firmware version to the
latest there is no problem/crash.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 108585] *ERROR* hw_init of IP block failed -22

2018-10-31 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=108585

--- Comment #15 from Alex Deucher  ---
Created attachment 142316
  --> https://bugs.freedesktop.org/attachment.cgi?id=142316=edit
more involved fix

These patches attempt to reset the GPU on init if the GPU was already running
from a previous load of the driver.  Compile tested only at the moment.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 108585] *ERROR* hw_init of IP block failed -22

2018-10-31 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=108585

--- Comment #14 from Alex Deucher  ---
Created attachment 142303
  --> https://bugs.freedesktop.org/attachment.cgi?id=142303=edit
possible fix

(In reply to Benjamin Herrenschmidt from comment #12)
> We have no control on what firmware is loaded by the target distro so the
> right thing is going to reset the adapter.
> 
> We'll probably need to add something to the amdgpu shutdown() path to force
> an adapter reset.

Does the attached patch help?  I'd been hesitant to add reset to the shutdown
path because it adds latency to the regular shutdown path and users complain
when that slows down.

> 
> Do you have details of what specific PCIe config space write you use ? FLR ?

The reset sequence is asic specific.  older parts just happened to use PCI
config space to trigger a GPU reset via an AMD specific sequence.  Newer GPUs
reset via the PSP.  FLR is only available on SR-IOV capable skus so it's not a
general solution.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 108585] *ERROR* hw_init of IP block failed -22

2018-10-31 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=108585

--- Comment #13 from Christian König  ---
(In reply to Benjamin Herrenschmidt from comment #12)
> We'll probably need to add something to the amdgpu shutdown() path to force
> an adapter reset.

If that would be possible we would have already done that.

The problem is that you do a full ASIC reset. So not only the GPU is affected,
but also bridges, sound codecs etc... If any of those parts have a driver
loaded while you do the reset you usually crash the system.

Additional to that AFAIK this doesn't work on APUs. Because there the GPU is
part of the CPU and so you would need to to reset both.

How about stopping to use amdgpu in the boot loader? For just displaying a
splash screen vesafb or efifb should do fine as well.

> Do you have details of what specific PCIe config space write you use ? FLR ?

Alex knows the details of that, but a FLR alone doesn't work AFAIK.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 108585] *ERROR* hw_init of IP block failed -22

2018-10-31 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=108585

--- Comment #12 from Benjamin Herrenschmidt  ---
We have no control on what firmware is loaded by the target distro so the right
thing is going to reset the adapter.

We'll probably need to add something to the amdgpu shutdown() path to force an
adapter reset.

Do you have details of what specific PCIe config space write you use ? FLR ?

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 108585] *ERROR* hw_init of IP block failed -22

2018-10-31 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=108585

--- Comment #11 from Dan Horák  ---
Thanks for the info, I've documented that in the Talos wiki under
https://wiki.raptorcs.com/wiki/Troubleshooting/GPU#AMDGPU_driver_crashes_after_firmware_update

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 108585] *ERROR* hw_init of IP block failed -22

2018-10-31 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=108585

Christian König  changed:

   What|Removed |Added

 CC||joel.s...@gmail.com

--- Comment #10 from Christian König  ---
*** Bug 108607 has been marked as a duplicate of this bug. ***

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 108585] *ERROR* hw_init of IP block failed -22

2018-10-31 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=108585

Christian König  changed:

   What|Removed |Added

 Resolution|--- |NOTABUG
 Status|NEW |RESOLVED

--- Comment #9 from Christian König  ---
(In reply to Benjamin Herrenschmidt from comment #6)
> They may or may not be related ... Alex, kexec is how we boot these
> machines, there's a Linux kernel in flash that runs a Linux based bootloader.

Yeah, you guys should have noted that because that combination is known to not
work correctly.

The problem is that some parts of the hardware are explicitly designed in a way
which only allows loading one firmware after an ASIC reset. So as long as kexec
doesn't makes a full PCIe level ASIC reset the second driver load is intended
to fail.

We have the same problem with virtualization and used to have a workaround in
KVM which triggers the ASIC reset with a PCIe config space write. Alex should
know the details.

Only solution I can see is to either use the same workaround as the KVM guys or
use the same firmware for both the loader and the final kernel.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 108585] *ERROR* hw_init of IP block failed -22

2018-10-31 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=108585

--- Comment #8 from Dan Horák  ---
I should have mentioned I'm kexec-ing too. It's from 4.15.9 (in skiroot) to
Fedora kernels 4.16, 4.17, 4.18 and now 4.19 during the time. It worked fine
until the recent amdgpu firmware update. The skiroot kernel uses amdgpu
firmware from ~June.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 108585] *ERROR* hw_init of IP block failed -22

2018-10-30 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=108585

--- Comment #7 from Joel  ---
(In reply to Benjamin Herrenschmidt from comment #6)
> Dan... did you do some firmware changes here ? Could it have to do with the
> versions differences between petitboot and the final kernel ?

FWIW, Talos II machines use a fork of op-build that include the amdgpu driver
in petitboot. (They also appear to be stuck on 4.16).

I was experimenting with the same.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 108585] *ERROR* hw_init of IP block failed -22

2018-10-30 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=108585

--- Comment #6 from Benjamin Herrenschmidt  ---
They may or may not be related ... Alex, kexec is how we boot these machines,
there's a Linux kernel in flash that runs a Linux based bootloader.

Until recently however, that didn't have an amdgpu driver. This might have
changed.

Dan... did you do some firmware changes here ? Could it have to do with the
versions differences between petitboot and the final kernel ?

Alex, whether we track it here or separately, we probably need to look into
kexec support. It's not just us, there's a bit of momentum around kexec based
bootloaders (google's on it too) as was seen at the recent OSFC (firmware
conf).

A workaround in the meantime (for kexec problems) could be to hot reset the
card during the transition I suppose.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 108585] *ERROR* hw_init of IP block failed -22

2018-10-30 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=108585

--- Comment #5 from Alex Deucher  ---
(In reply to Joel from comment #4)
> I see a similar backtrace on 4.19.0-11706-g11743c56785c (Linus' tree
> mid-merge window).
> 
> My system has a "fiji" card. The first kernel is 4.19 (upstream release),
> and the second kernel where the backtrace occurs is with 4.19+.
> 
> The second kernel is kexec'd from the first.

Please file your own bug.  Kexec is not likely to work and should be tracked
separately.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 108585] *ERROR* hw_init of IP block failed -22

2018-10-29 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=108585

--- Comment #4 from Joel  ---
I see a similar backtrace on 4.19.0-11706-g11743c56785c (Linus' tree mid-merge
window).

My system has a "fiji" card. The first kernel is 4.19 (upstream release), and
the second kernel where the backtrace occurs is with 4.19+.

The second kernel is kexec'd from the first.

If I don't load amdgpu in the first kernel, the second kernel works. So there
is something missed in the shutdown path.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 108585] *ERROR* hw_init of IP block failed -22

2018-10-29 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=108585

--- Comment #3 from Dan Horák  ---
Ha, so it's the firmware stored in the initrds, what is different (lsinitrd
lied). And the latest polaris11* ones provoke the crash. When I manually
replaced them with the ones from the rc8 initrd, I've successfully booted into
the 4.19 GA kernel.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 108585] *ERROR* hw_init of IP block failed -22

2018-10-29 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=108585

--- Comment #2 from Dan Horák  ---
(In reply to Michel Dänzer from comment #1)
> There were no amdgpu driver changes between rc8 and final... Are you sure
> this is 100% reproducible with the latter and not reproducible with the
> former? If so, can you bisect?

till now 100% reproduceable, will try bisecting the kernel sources and also
will look what else might have changed

for the record
- you can find the kernels at
https://copr.fedorainfracloud.org/coprs/sharkcz/talos-kernel/builds/
- amdgpu.dc=0 or 1 makes no difference (0 is my default value, see bug 107049)

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 108585] *ERROR* hw_init of IP block failed -22

2018-10-29 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=108585

--- Comment #1 from Michel Dänzer  ---
There were no amdgpu driver changes between rc8 and final... Are you sure this
is 100% reproducible with the latter and not reproducible with the former? If
so, can you bisect?

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 108585] *ERROR* hw_init of IP block failed -22

2018-10-29 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=108585

Bug ID: 108585
   Summary: *ERROR* hw_init of IP block  failed -22
   Product: DRI
   Version: unspecified
  Hardware: PowerPC
OS: Linux (All)
Status: NEW
  Severity: normal
  Priority: medium
 Component: DRM/AMDgpu
  Assignee: dri-devel@lists.freedesktop.org
  Reporter: d...@danny.cz
CC: bcroc...@redhat.com

Created attachment 142253
  --> https://bugs.freedesktop.org/attachment.cgi?id=142253=edit
full dmesg output

amdgpu driver fails to initialize Radeon WX4100 PRO on my Talos Power9 system
with kernel 4.19 (GA). There is no such problem with 4.19-rc8 (and earlier).

...
[2.421393] [drm] amdgpu kernel modesetting enabled.
[2.421512] amdgpu :01:00.0: enabling device (0540 -> 0542)
[2.421732] [drm] initializing kernel modesetting (POLARIS11 0x1002:0x67E3
0x1002:0x0B0D 0x00).
[2.421776] [drm] register mmio base: 0x
[2.421781] [drm] register mmio size: 262144
[2.421787] [drm] PCI I/O BAR is not found.
[2.421798] [drm] add ip block number 0 
[2.421801] [drm] add ip block number 1 
[2.421805] [drm] add ip block number 2 
[2.421808] [drm] add ip block number 3 
[2.421811] [drm] add ip block number 4 
[2.421814] [drm] add ip block number 5 
[2.421818] [drm] add ip block number 6 
[2.421821] [drm] add ip block number 7 
[2.421824] [drm] add ip block number 8 
[2.421837] [drm] UVD is enabled in VM mode
[2.421840] [drm] UVD ENC is enabled in VM mode
[2.421845] [drm] VCE enabled in VM mode
[2.609475] md/raid1:md127: active with 2 out of 2 mirrors
[2.625800] md127: detected capacity change from 0 to 481708474368
[2.627770] md/raid1:md126: active with 2 out of 2 mirrors
[2.643643] md126: detected capacity change from 0 to 1072693248
[2.769520] usb 1-4: new high-speed USB device number 4 using xhci_hcd
[2.769550] ATOM BIOS: 113-D0150600-103
[2.769747] [drm] vm size is 256 GB, 2 levels, block size is 10-bit,
fragment size is 9-bit
[2.769846] pci :01 : [PE# 00] pseudo-bypass sizes: tracker 32800
bitmap 8192 TCEs 65536
[2.769851] pci :01 : [PE# 00] TCE tables configured for
pseudo-bypass
[2.769903] amdgpu :01:00.0: BAR 2: releasing [mem
0x61000-0x6101f 64bit pref]
[2.769907] amdgpu :01:00.0: BAR 0: releasing [mem
0x6-0x60fff 64bit pref]
[2.769939] pci :00:00.0: BAR 15: releasing [mem
0x6-0x6003fbff0 64bit pref]
[2.769956] pci :00:00.0: BAR 15: assigned [mem
0x6-0x600017fff 64bit pref]
[2.769961] amdgpu :01:00.0: BAR 0: assigned [mem
0x6-0x6 64bit pref]
[2.769972] amdgpu :01:00.0: BAR 2: assigned [mem
0x60001-0x60001001f 64bit pref]
[2.770004] pci :00:00.0: PCI bridge to [bus 01]
[2.770009] pci :00:00.0:   bridge window [mem
0x600c0-0x600c07fef]
[2.770015] pci :00:00.0:   bridge window [mem
0x6-0x6003fbff0 64bit pref]
[2.770066] amdgpu :01:00.0: VRAM: 4096M 0x00F4 -
0x00F4 (4096M used)
[2.770069] amdgpu :01:00.0: GART: 256M 0x -
0x0FFF
[2.770075] [drm] Detected VRAM RAM=4096M, BAR=4096M
[2.770077] [drm] RAM width 128bits GDDR5
[2.770162] [TTM] Zone  kernel: Available graphics memory: 32717248 kiB
[2.770165] [TTM] Zone   dma32: Available graphics memory: 2097152 kiB
[2.770166] [TTM] Initializing pool allocator
[2.771771] [drm] amdgpu: 4096M of VRAM memory ready
[2.771774] [drm] amdgpu: 4096M of GTT memory ready.
[2.771790] [drm] GART: num cpu pages 4096, num gpu pages 65536
[2.771839] [drm] PCIE GART of 256M enabled (table at 0x00F4008D).
[2.771911] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[2.771913] [drm] Driver supports precise vblank timestamp query.
[2.772311] [drm] AMDGPU Display Connectors
[2.772313] [drm] Connector 0:
[2.772315] [drm]   DP-1
[2.772316] [drm]   HPD5
[2.772318] [drm]   DDC: 0x4868 0x4868 0x4869 0x4869 0x486a 0x486a 0x486b
0x486b
[2.772320] [drm]   Encoders:
[2.772322] [drm] DFP1: INTERNAL_UNIPHY1
[2.772323] [drm] Connector 1:
[2.772325] [drm]   DP-2
[2.772326] [drm]   HPD4
[2.772328] [drm]   DDC: 0x486c 0x486c 0x486d 0x486d 0x486e 0x486e 0x486f
0x486f
[2.772330] [drm]   Encoders:
[2.772332] [drm] DFP2: INTERNAL_UNIPHY1
[2.772333] [drm] Connector 2:
[2.772335] [drm]   DP-3
[2.772336] [drm]   HPD3
[2.772338] [drm]   DDC: 0x4870 0x4870 0x4871 0x4871 0x4872 0x4872 0x4873
0x4873
[2.772340] [drm]   Encoders:
[2.772341] [drm] DFP3: INTERNAL_UNIPHY
[2.772343] [drm] Connector 3:
[2.772345] [drm]   DP-4
[2.772346] [drm]   HPD2
[2.772348] [drm]   DDC: 0x4874