*** This bug is a duplicate of bug 1956845 ***
    https://bugs.launchpad.net/bugs/1956845

Hardware: DP Epyc Milan 7763 node with 2 qty AMD Instinct Mi100

Kernel:   ubuntu 18.04.6LTS w/linux-hwe 5.4.0-107-generic

ROCm 5.1.0 and AMDGPU version: 5.13.20.5.1 driver

Homegrown software developed using ROCm 5.1.0.

Might this be related?


Logs:

[304726.475355] beegfs: enabling unsafe global rkey
[304734.912424] amdgpu 0000:23:00.0: amdgpu: [gfxhub0] no-retry page fault 
(src_id:0 ring:24 vmid:3 pasid:32769, for process hyprep pid 122284 thread 
hyprep pid 122284)
[304734.928526] amdgpu 0000:23:00.0: amdgpu:   in page starting at address 
0x0000000001753000 from IH client 0x1b (UTCL2)
[304734.939972] amdgpu 0000:23:00.0: amdgpu: 
VM_L2_PROTECTION_FAULT_STATUS:0x00301031
[304734.948130] amdgpu 0000:23:00.0: amdgpu:     Faulty UTCL2 client ID: TCP 
(0x8)
[304734.955858] amdgpu 0000:23:00.0: amdgpu:     MORE_FAULTS: 0x1
[304734.962115] amdgpu 0000:23:00.0: amdgpu:     WALKER_ERROR: 0x0
[304734.968441] amdgpu 0000:23:00.0: amdgpu:     PERMISSION_FAULTS: 0x3
[304734.975196] amdgpu 0000:23:00.0: amdgpu:     MAPPING_ERROR: 0x0
[304734.981580] amdgpu 0000:23:00.0: amdgpu:     RW: 0x0
[304735.568400] amdgpu 0000:23:00.0: amdgpu: [gfxhub0] no-retry page fault 
(src_id:0 ring:24 vmid:3 pasid:32769, for process  pid 0 thread  pid 0)
[304735.582318] amdgpu 0000:23:00.0: amdgpu:   in page starting at address 
0x0000000001753000 from IH client 0x1b (UTCL2)
[304735.593722] amdgpu 0000:23:00.0: amdgpu: 
VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[304735.601851] amdgpu 0000:23:00.0: amdgpu:     Faulty UTCL2 client ID: CB 
(0x0)
[304735.609465] amdgpu 0000:23:00.0: amdgpu:     MORE_FAULTS: 0x0
[304735.615686] amdgpu 0000:23:00.0: amdgpu:     WALKER_ERROR: 0x0
[304735.621994] amdgpu 0000:23:00.0: amdgpu:     PERMISSION_FAULTS: 0x0
[304735.628737] amdgpu 0000:23:00.0: amdgpu:     MAPPING_ERROR: 0x0
[304735.635104] amdgpu 0000:23:00.0: amdgpu:     RW: 0x0
[321839.599489] beegfs: enabling unsafe global rkey

Driver
Apr 02 22:19:59 n004 kernel: [drm] amdgpu kernel modesetting enabled.
Apr 02 22:19:59 n004 kernel: [drm] amdgpu version: 5.13.20.5.1
Apr 02 22:19:59 n004 kernel: amdgpu: Ignoring ACPI CRAT on non-APU system
Apr 02 22:19:59 n004 kernel: amdgpu: Virtual CRAT table created for CPU
Apr 02 22:19:59 n004 kernel: amdgpu: Topology: Add CPU node
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: 
remove_conflicting_pci_framebuffers: bar 0: 0x67800000000 -> 0x67fffffffff
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: 
remove_conflicting_pci_framebuffers: bar 2: 0x68000000000 -> 0x680001fffff
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: 
remove_conflicting_pci_framebuffers: bar 5: 0xeb400000 -> 0xeb47ffff
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: enabling device (0000 -> 0003)
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: Trusted Memory Zone 
(TMZ) feature not supported
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: Fetched VBIOS from 
ROM BAR
Apr 02 22:19:59 n004 kernel: amdgpu: ATOM BIOS: 113-D3431401-100
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: MEM ECC is active.
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: SRAM ECC is active.
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: RAS INFO: ras 
initialized successfully, hardware ability[7fff] ras_mask[7fff]
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: VRAM: 32752M 
0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: GART: 512M 
0x0000000000000000 - 0x000000001FFFFFFF
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: AGP: 267878400M 
0x0000008800000000 - 0x0000FFFFFFFFFFFF
Apr 02 22:19:59 n004 kernel: [drm] amdgpu: 32752M of VRAM memory ready
Apr 02 22:19:59 n004 kernel: [drm] amdgpu: 2064153M of GTT memory ready.
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: PSP runtime database 
doesn't exist
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: Will use PSP to load 
VCN firmware
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: HDCP: optional hdcp 
ta ucode is not available
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: DTM: optional dtm ta 
ucode is not available
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: RAP: optional rap ta 
ucode is not available
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: SECUREDISPLAY: 
securedisplay ta ucode is not available
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: use vbios provided 
pptable
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: smc_dpm_info table 
revision(format.content): 4.6
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: PMFW based fan 
control disabled
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: SMU is initialized 
successfully!
Apr 02 22:19:59 n004 kernel: kfd kfd: amdgpu: Allocated 3969056 bytes on gart
Apr 02 22:19:59 n004 kernel: amdgpu: Virtual CRAT table created for GPU
Apr 02 22:19:59 n004 kernel: amdgpu: Topology: Add dGPU node [0x738c:0x1002]
Apr 02 22:19:59 n004 kernel: kfd kfd: amdgpu: added device 1002:738c
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: SE 8, SH per SE 1, CU 
per SH 16, active_cu_number 120
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring comp_1.0.0 uses 
VM inv eng 0 on hub 0
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring comp_1.1.0 uses 
VM inv eng 1 on hub 0
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring comp_1.2.0 uses 
VM inv eng 4 on hub 0
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring comp_1.3.0 uses 
VM inv eng 5 on hub 0
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring comp_1.0.1 uses 
VM inv eng 6 on hub 0
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring comp_1.1.1 uses 
VM inv eng 7 on hub 0
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring comp_1.2.1 uses 
VM inv eng 8 on hub 0
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring comp_1.3.1 uses 
VM inv eng 9 on hub 0
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring kiq_2.1.0 uses 
VM inv eng 10 on hub 0
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring sdma0 uses VM 
inv eng 0 on hub 1
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring sdma1 uses VM 
inv eng 1 on hub 1
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring sdma2 uses VM 
inv eng 4 on hub 1
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring sdma3 uses VM 
inv eng 5 on hub 1
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring sdma4 uses VM 
inv eng 6 on hub 1
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring sdma5 uses VM 
inv eng 0 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring sdma6 uses VM 
inv eng 1 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring sdma7 uses VM 
inv eng 4 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring vcn_dec_0 uses 
VM inv eng 5 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring vcn_enc_0.0 uses 
VM inv eng 6 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring vcn_enc_0.1 uses 
VM inv eng 7 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring vcn_dec_1 uses 
VM inv eng 8 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring vcn_enc_1.0 uses 
VM inv eng 9 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring vcn_enc_1.1 uses 
VM inv eng 10 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring jpeg_dec_0 uses 
VM inv eng 11 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu 0000:23:00.0: amdgpu: ring jpeg_dec_1 uses 
VM inv eng 12 on hub 2
Apr 02 22:19:59 n004 kernel: amdgpu: Detected AMDGPU 6 Perf Events.
Apr 02 22:19:59 n004 kernel: [drm] Initialized amdgpu 3.45.0 20150101 for 
0000:23:00.0 on minor 1
Apr 02 22:21:38 n004 kernel: amdgpu: PeerDirect support was initialized 
successfully

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1940690

Title:
  amdgpu kernel crash

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1940690/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to