*** This bug is a duplicate of bug 1956845 ***
https://bugs.launchpad.net/bugs/1956845
Hardware: DP Epyc Milan 7763 node with 2 qty AMD Instinct Mi100
Kernel: ubuntu 18.04.6LTS w/linux-hwe 5.4.0-107-generic
ROCm 5.1.0 and AMDGPU version: 5.13.20.5.1 driver
Homegrown software developed using ROCm 5.1.0.
Might this be related?
Logs:
[304726.475355] beegfs: enabling unsafe global rkey
[304734.912424] amdgpu 0000:23:00.0: amdgpu: [gfxhub0] no-retry page fault
(src_id:0 ring:24 vmid:3 pasid:32769, for process hyprep pid 122284 thread
hyprep pid 122284)
[304734.928526] amdgpu 0000:23:00.0: amdgpu: in page starting at address
0x0000000001753000 from IH client 0x1b (UTCL2)
[304734.939972] amdgpu 0000:23:00.0: amdgpu:
VM_L2_PROTECTION_FAULT_STATUS:0x00301031
[304734.948130] amdgpu 0000:23:00.0: amdgpu: Faulty UTCL2 client ID: TCP
(0x8)
[304734.955858] amdgpu 0000:23:00.0: amdgpu: MORE_FAULTS: 0x1
[304734.962115] amdgpu 0000:23:00.0: amdgpu: WALKER_ERROR: 0x0
[304734.968441] amdgpu 0000:23:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[304734.975196] amdgpu 0000:23:00.0: amdgpu: MAPPING_ERROR: 0x0
[304734.981580] amdgpu 0000:23:00.0: amdgpu: RW: 0x0
[304735.568400] amdgpu 0000:23:00.0: amdgpu: [gfxhub0] no-retry page fault
(src_id:0 ring:24 vmid:3 pasid:32769, for process pid 0 thread pid 0)
[304735.582318] amdgpu 0000:23:00.0: amdgpu: in page starting at address
0x0000000001753000 from IH client 0x1b (UTCL2)
[304735.593722] amdgpu 0000:23:00.0: amdgpu:
VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[304735.601851] amdgpu 0000:23:00.0: amdgpu: Faulty UTCL2 client ID: CB
(0x0)
[304735.609465] amdgpu 0000:23:00.0: amdgpu: MORE_FAULTS: 0x0
[304735.615686] amdgpu 0000:23:00.0: amdgpu: WALKER_ERROR: 0x0
[304735.621994] amdgpu 0000:23:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[304735.628737] amdgpu 0000:23:00.0: amdgpu: MAPPING_ERROR: 0x0
[304735.635104] amdgpu 0000:23:00.0: amdgpu: RW: 0x0
[321839.599489] beegfs: enabling unsafe global rkey
Driver
Apr 02 22:19:59 n004 kernel: [drm] amdgpu kernel modesetting enabled.
Apr 02 22:19:59 n004 kernel: [drm] amdgpu version: 5.13.20.5.1
Apr 02 22:19:59 n004 kernel: amdgpu: Ignoring ACPI CRAT on non-APU system
Apr 02 22:19:59 n004 kernel: amdgpu: Virtual CRAT table created for CPU
Apr 02 22:19:59 n004 kernel: amdgpu: Topology: Add CPU node
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0:
remove_conflicting_pci_framebuffers: bar 0: 0x67800000000 -> 0x67fffffffff
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0:
remove_conflicting_pci_framebuffers: bar 2: 0x68000000000 -> 0x680001fffff
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0:
remove_conflicting_pci_framebuffers: bar 5: 0xeb400000 -> 0xeb47ffff
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: enabling device (0000 -> 0003)
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: Trusted Memory Zone
(TMZ) feature not supported
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: Fetched VBIOS from
ROM BAR
Apr 02 22:19:59 n004 kernel: amdgpu: ATOM BIOS: 113-D3431401-100
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: MEM ECC is active.
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: SRAM ECC is active.
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: RAS INFO: ras
initialized successfully, hardware ability[7fff] ras_mask[7fff]
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: VRAM: 32752M
0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: GART: 512M
0x0000000000000000 - 0x000000001FFFFFFF
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: AGP: 267878400M
0x0000008800000000 - 0x0000FFFFFFFFFFFF
Apr 02 22:19:59 n004 kernel: [drm] amdgpu: 32752M of VRAM memory ready
Apr 02 22:19:59 n004 kernel: [drm] amdgpu: 2064153M of GTT memory ready.
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: PSP runtime database
doesn't exist
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: Will use PSP to load
VCN firmware
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: HDCP: optional hdcp
ta ucode is not available
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: DTM: optional dtm ta
ucode is not available
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: RAP: optional rap ta
ucode is not available
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: SECUREDISPLAY:
securedisplay ta ucode is not available
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: use vbios provided
pptable
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: smc_dpm_info table
revision(format.content): 4.6
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: PMFW based fan
control disabled
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: SMU is initialized
successfully!
Apr 02 22:19:59 n004 kernel: kfd kfd: amdgpu: Allocated 3969056 bytes on gart
Apr 02 22:19:59 n004 kernel: amdgpu: Virtual CRAT table created for GPU
Apr 02 22:19:59 n004 kernel: amdgpu: Topology: Add dGPU node [0x738c:0x1002]
Apr 02 22:19:59 n004 kernel: kfd kfd: amdgpu: added device 1002:738c
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: SE 8, SH per SE 1, CU
per SH 16, active_cu_number 120
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring comp_1.0.0 uses
VM inv eng 0 on hub 0
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring comp_1.1.0 uses
VM inv eng 1 on hub 0
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring comp_1.2.0 uses
VM inv eng 4 on hub 0
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring comp_1.3.0 uses
VM inv eng 5 on hub 0
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring comp_1.0.1 uses
VM inv eng 6 on hub 0
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring comp_1.1.1 uses
VM inv eng 7 on hub 0
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring comp_1.2.1 uses
VM inv eng 8 on hub 0
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring comp_1.3.1 uses
VM inv eng 9 on hub 0
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring kiq_2.1.0 uses
VM inv eng 10 on hub 0
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring sdma0 uses VM
inv eng 0 on hub 1
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring sdma1 uses VM
inv eng 1 on hub 1
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring sdma2 uses VM
inv eng 4 on hub 1
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring sdma3 uses VM
inv eng 5 on hub 1
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring sdma4 uses VM
inv eng 6 on hub 1
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring sdma5 uses VM
inv eng 0 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring sdma6 uses VM
inv eng 1 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring sdma7 uses VM
inv eng 4 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring vcn_dec_0 uses
VM inv eng 5 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring vcn_enc_0.0 uses
VM inv eng 6 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring vcn_enc_0.1 uses
VM inv eng 7 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring vcn_dec_1 uses
VM inv eng 8 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring vcn_enc_1.0 uses
VM inv eng 9 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring vcn_enc_1.1 uses
VM inv eng 10 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring jpeg_dec_0 uses
VM inv eng 11 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring jpeg_dec_1 uses
VM inv eng 12 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu: Detected AMDGPU 6 Perf Events.
Apr 02 22:19:59 n004 kernel: [drm] Initialized amdgpu 3.45.0 20150101 for
0000:23:00.0 on minor 1
Apr 02 22:21:38 n004 kernel: amdgpu: PeerDirect support was initialized
successfully
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1940690
Title:
amdgpu kernel crash
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1940690/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs