apport information

** Attachment added: "ProcEnviron.txt"
   
https://bugs.launchpad.net/bugs/1990323/+attachment/5617875/+files/ProcEnviron.txt

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1990323

Title:
  amdgpu driver hangs periodically, causes display to permanently crash

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  Apologies in advance if this isn't the right place to file this bug.
  Please let me know if I should report this elsewhere or if there's any
  other info I can add.

  What I suspect is an amdgpu driver issue has been causing display
  issues and machine crashes. Once the issue starts, the display won't
  come back from being blank, and turning the machine off takes five
  minutes or longer. Additionally, executing `sensors` hangs on reading
  data from the gpu and can even cause 100% CPU utilization for multiple
  minutes. I believe this significant delays to other system calls, as
  the entire machine will start to behave sluggishly in spurts where
  every program hangs and then many things happen all at once. I can't
  reliably trigger the issue, although repeatedly reading `sensors` in a
  loop seems to be one method that eventually works.

  The main symptom other than the behavior described above is a cycle of
  messages in the journal of the form

  kernel: amdgpu:
      last message was failed ret is 0
  kernel: amdgpu:
      failed to send message 145 ret is 0
  kernel: amdgpu:
      last message was failed ret is 0
  kernel: amdgpu:
      failed to send message 146 ret is 0

  The messages sent are 145, 146, 5e, and 148.

  I've had this GPU for 5 years without any of these problems. However,
  I only recently upgraded from an older Intel CPU to a Ryzen 5600 and a
  ASRock B550m Pro4. I have no idea if the issues are related to the
  upgrade or how they could be.

  Some misc info.
  Distro: Ubuntu 22.04.1 LTS x86_64
  Kernel: 5.15.0-48-generic
  Graphics card: Radeon R9 380X
  Motherboard: ASRock B550M Pro4 (P2.30 BIOS)
  CPU: Ryzen 5600
  Desktop: i3 (with picom compositor)

  I'm attaching the relevant logs from my most recent boot when I was
  able to boot and use the machine for several hours. I left the machine
  to blank its screen and when I returned, I was unable to unblank the
  screen. The only thing I could do was press the power button and leave
  the machine to shutdown over the course of the next 5 or 10 minutes.

  The only other thing that is interesting is that on startup I'm seeing
  the following warning about undefined behavior.

  UBSAN: shift-out-of-bounds in 
/build/linux-kQ6jNR/linux-5.15.0/drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device_queue_manager.c:997:32
  shift exponent 64 is too large for 64-bit type 'long long unsigned int'
  CPU: 10 PID: 483 Comm: systemd-udevd Not tainted 5.15.0-48-generic #54-Ubuntu
  Hardware name: To Be Filled By O.E.M. B550M Pro4/B550M Pro4, BIOS P2.30 
02/24/2022
  Call Trace:
   <TASK>
   show_stack+0x52/0x5c
   dump_stack_lvl+0x4a/0x63
   dump_stack+0x10/0x16
   ubsan_epilogue+0x9/0x49
   __ubsan_handle_shift_out_of_bounds.cold+0x61/0xef
   initialize_nocpsch.cold+0x15/0x59 [amdgpu]
   device_queue_manager_init+0x20b/0x3b0 [amdgpu]
   kgd2kfd_device_init.cold+0x1af/0x483 [amdgpu]
   amdgpu_amdkfd_device_init+0x135/0x170 [amdgpu]
   amdgpu_device_ip_init+0x681/0x6a4 [amdgpu]
  loop33: detected capacity change from 0 to 8
   amdgpu_device_init.cold+0x25b/0x7db [amdgpu]
   ? do_pci_enable_device+0xdb/0x110
   amdgpu_driver_load_kms+0x1e/0x270 [amdgpu]
   amdgpu_pci_probe+0x1ce/0x260 [amdgpu]
   local_pci_probe+0x4b/0x90
   pci_device_probe+0x119/0x1f0
   really_probe+0x222/0x420
   __driver_probe_device+0x119/0x190
   driver_probe_device+0x23/0xc0
   __driver_attach+0xbd/0x1e0
   ? __device_attach_driver+0x120/0x120
   bus_for_each_dev+0x7e/0xd0
   driver_attach+0x1e/0x30
   bus_add_driver+0x148/0x220
   driver_register+0x95/0x100
   __pci_register_driver+0x68/0x70
   amdgpu_init+0x7c/0x1000 [amdgpu]
   ? 0xffffffffc1a40000
   do_one_initcall+0x48/0x1e0
   ? kmem_cache_alloc_trace+0x19e/0x2e0
   do_init_module+0x52/0x260
   load_module+0xacd/0xbc0
   __do_sys_finit_module+0xbf/0x120
   __x64_sys_finit_module+0x18/0x20
   do_syscall_64+0x5c/0xc0
   ? syscall_exit_to_user_mode+0x27/0x50
   ? __x64_sys_newfstatat+0x1c/0x30
   ? do_syscall_64+0x69/0xc0
   ? __x64_sys_mmap+0x33/0x50
   ? do_syscall_64+0x69/0xc0
   ? do_syscall_64+0x69/0xc0
   entry_SYSCALL_64_after_hwframe+0x61/0xcb
  RIP: 0033:0x7f06f3fb9a3d
  Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 
c3 48 8b 0d c3 a3 0f 00 f7 d8 64 89 01 48
  RSP: 002b:00007ffc7ce54ae8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
  RAX: ffffffffffffffda RBX: 0000556c9ab3e3d0 RCX: 00007f06f3fb9a3d
  RDX: 0000000000000000 RSI: 00007f06f4150441 RDI: 000000000000001a
  RBP: 0000000000020000 R08: 0000000000000000 R09: 0000000000000002
  R10: 000000000000001a R11: 0000000000000246 R12: 00007f06f4150441
  R13: 0000556c9aa05fb0 R14: 0000556c9ab40460 R15: 0000556c9ab35150
   </TASK>

  It's even visible on screen before the splash screen appears. I don't
  remember seeing this before the motherboard/CPU upgrade.

  I haven't tried to trigger the issue from a bootable USB, but I can confirm 
that the warning about undefined behavior is present there as well.
  --- 
  ProblemType: Bug
  ApportVersion: 2.20.11-0ubuntu82.1
  Architecture: amd64
  AudioDevicesInUse:
   USER        PID ACCESS COMMAND
   /dev/snd/controlC1:  emichael   2373 F.... pulseaudio
   /dev/snd/controlC2:  emichael   2373 F.... pulseaudio
   /dev/snd/controlC0:  emichael   2373 F.... pulseaudio
  CasperMD5CheckResult: unknown
  CurrentDesktop: i3
  DistroRelease: Ubuntu 22.04
  HibernationDevice: RESUME=UUID=4154f3bc-32d5-44ad-8af7-193b3f9c6483
  InstallationDate: Installed on 2016-01-18 (2438 days ago)
  InstallationMedia: Ubuntu 14.04.3 LTS "Trusty Tahr" - Beta amd64 (20150805)
  IwConfig:
   lo        no wireless extensions.
   
   enp5s0    no wireless extensions.
  MachineType: To Be Filled By O.E.M. B550M Pro4
  Package: linux (not installed)
  ProcFB: 0 amdgpudrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-48-generic 
root=UUID=91431a5f-fc77-4987-9f34-3e61da41a3b4 ro
  ProcVersionSignature: Ubuntu 5.15.0-48.54-generic 5.15.53
  RelatedPackageVersions:
   linux-restricted-modules-5.15.0-48-generic N/A
   linux-backports-modules-5.15.0-48-generic  N/A
   linux-firmware                             20220329.git681281e4-0ubuntu3.5
  RfKill:
   0: hci0: Bluetooth
        Soft blocked: no
        Hard blocked: no
  Tags:  jammy
  Uname: Linux 5.15.0-48-generic x86_64
  UpgradeStatus: Upgraded to jammy on 2022-05-06 (137 days ago)
  UserGroups: adm cdrom dip docker fuse lpadmin plugdev sambashare sudo video
  _MarkForUpload: True
  dmi.bios.date: 02/24/2022
  dmi.bios.release: 5.17
  dmi.bios.vendor: American Megatrends International, LLC.
  dmi.bios.version: P2.30
  dmi.board.name: B550M Pro4
  dmi.board.vendor: ASRock
  dmi.chassis.asset.tag: To Be Filled By O.E.M.
  dmi.chassis.type: 3
  dmi.chassis.vendor: To Be Filled By O.E.M.
  dmi.chassis.version: To Be Filled By O.E.M.
  dmi.modalias: 
dmi:bvnAmericanMegatrendsInternational,LLC.:bvrP2.30:bd02/24/2022:br5.17:svnToBeFilledByO.E.M.:pnB550MPro4:pvrToBeFilledByO.E.M.:rvnASRock:rnB550MPro4:rvr:cvnToBeFilledByO.E.M.:ct3:cvrToBeFilledByO.E.M.:skuToBeFilledByO.E.M.:
  dmi.product.family: To Be Filled By O.E.M.
  dmi.product.name: B550M Pro4
  dmi.product.sku: To Be Filled By O.E.M.
  dmi.product.version: To Be Filled By O.E.M.
  dmi.sys.vendor: To Be Filled By O.E.M.
  modified.conffile..etc.default.apport: [modified]
  mtime.conffile..etc.default.apport: 2022-09-17T22:35:32.791110

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1990323/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to