*** This bug is a duplicate of bug 1956845 ***
https://bugs.launchpad.net/bugs/1956845
Hardware: DP Epyc Milan 7763 node with 2 qty AMD Instinct Mi100
Kernel: ubuntu 18.04.6LTS w/linux-hwe 5.4.0-107-generic
ROCm 5.1.0 and AMDGPU version: 5.13.20.5.1 driver
Homegrown software developed using ROCm 5.1.0.
Might this be related?
Logs:
[304726.475355] beegfs: enabling unsafe global rkey
[304734.912424] amdgpu :23:00.0: amdgpu: [gfxhub0] no-retry page fault
(src_id:0 ring:24 vmid:3 pasid:32769, for process hyprep pid 122284 thread
hyprep pid 122284)
[304734.928526] amdgpu :23:00.0: amdgpu: in page starting at address
0x01753000 from IH client 0x1b (UTCL2)
[304734.939972] amdgpu :23:00.0: amdgpu:
VM_L2_PROTECTION_FAULT_STATUS:0x00301031
[304734.948130] amdgpu :23:00.0: amdgpu: Faulty UTCL2 client ID: TCP
(0x8)
[304734.955858] amdgpu :23:00.0: amdgpu: MORE_FAULTS: 0x1
[304734.962115] amdgpu :23:00.0: amdgpu: WALKER_ERROR: 0x0
[304734.968441] amdgpu :23:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[304734.975196] amdgpu :23:00.0: amdgpu: MAPPING_ERROR: 0x0
[304734.981580] amdgpu :23:00.0: amdgpu: RW: 0x0
[304735.568400] amdgpu :23:00.0: amdgpu: [gfxhub0] no-retry page fault
(src_id:0 ring:24 vmid:3 pasid:32769, for process pid 0 thread pid 0)
[304735.582318] amdgpu :23:00.0: amdgpu: in page starting at address
0x01753000 from IH client 0x1b (UTCL2)
[304735.593722] amdgpu :23:00.0: amdgpu:
VM_L2_PROTECTION_FAULT_STATUS:0x
[304735.601851] amdgpu :23:00.0: amdgpu: Faulty UTCL2 client ID: CB
(0x0)
[304735.609465] amdgpu :23:00.0: amdgpu: MORE_FAULTS: 0x0
[304735.615686] amdgpu :23:00.0: amdgpu: WALKER_ERROR: 0x0
[304735.621994] amdgpu :23:00.0: amdgpu: PERMISSION_FAULTS: 0x0
[304735.628737] amdgpu :23:00.0: amdgpu: MAPPING_ERROR: 0x0
[304735.635104] amdgpu :23:00.0: amdgpu: RW: 0x0
[321839.599489] beegfs: enabling unsafe global rkey
Driver
Apr 02 22:19:59 n004 kernel: [drm] amdgpu kernel modesetting enabled.
Apr 02 22:19:59 n004 kernel: [drm] amdgpu version: 5.13.20.5.1
Apr 02 22:19:59 n004 kernel: amdgpu: Ignoring ACPI CRAT on non-APU system
Apr 02 22:19:59 n004 kernel: amdgpu: Virtual CRAT table created for CPU
Apr 02 22:19:59 n004 kernel: amdgpu: Topology: Add CPU node
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0:
remove_conflicting_pci_framebuffers: bar 0: 0x678 -> 0x67f
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0:
remove_conflicting_pci_framebuffers: bar 2: 0x680 -> 0x680001f
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0:
remove_conflicting_pci_framebuffers: bar 5: 0xeb40 -> 0xeb47
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0: enabling device ( -> 0003)
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0: amdgpu: Trusted Memory Zone
(TMZ) feature not supported
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0: amdgpu: Fetched VBIOS from
ROM BAR
Apr 02 22:19:59 n004 kernel: amdgpu: ATOM BIOS: 113-D3431401-100
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0: amdgpu: MEM ECC is active.
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0: amdgpu: SRAM ECC is active.
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0: amdgpu: RAS INFO: ras
initialized successfully, hardware ability[7fff] ras_mask[7fff]
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0: amdgpu: VRAM: 32752M
0x0080 - 0x0087FEFF (32752M used)
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0: amdgpu: GART: 512M
0x - 0x1FFF
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0: amdgpu: AGP: 267878400M
0x0088 - 0x
Apr 02 22:19:59 n004 kernel: [drm] amdgpu: 32752M of VRAM memory ready
Apr 02 22:19:59 n004 kernel: [drm] amdgpu: 2064153M of GTT memory ready.
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0: amdgpu: PSP runtime database
doesn't exist
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0: amdgpu: Will use PSP to load
VCN firmware
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0: amdgpu: HDCP: optional hdcp
ta ucode is not available
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0: amdgpu: DTM: optional dtm ta
ucode is not available
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0: amdgpu: RAP: optional rap ta
ucode is not available
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0: amdgpu: SECUREDISPLAY:
securedisplay ta ucode is not available
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0: amdgpu: use vbios provided
pptable
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0: amdgpu: smc_dpm_info table
revision(format.content): 4.6
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0: amdgpu: PMFW based fan
control disabled
Apr 02 22:19:59 n004 kernel: amdgpu :23:00.0: amdgpu: SMU is initialized
successfully!
Apr 02 22:19:59 n004 kernel: kfd kfd: amdgpu: Allocated 3969056 bytes on gart
Apr 02 22:19:59 n004 kernel: amdgpu: Virtual CRAT table created for