Public bug reported:

I'm running Ubuntu 22.04 with kernel 5.15.0.27.30 on an HPE ProLiant
DL20 Gen9 server. The server has an HPE Smart HBA H240 SATA controller.

Since Ubuntu 22.04, the kernel runs into trouble after a few hours of
uptime. The problem starts with a few instances of a message such as
this:

Apr 26 12:03:37 <hostname> kernel: DMAR: ERROR: DMA PTE for vPFN 0x7bf32 
already set (to 7bf32003 not 24c563801)
Apr 26 12:03:37 <hostname> kernel: ------------[ cut here ]------------
Apr 26 12:03:37 <hostname> kernel: WARNING: CPU: 1 PID: 10171 at 
drivers/iommu/intel/iommu.c:2391 __domain_mapping.cold+0x94/0xcb
Apr 26 12:03:37 <hostname> kernel: Modules linked in: tls rpcsec_gss_krb5 
binfmt_misc ip6t_REJECT nf_reject_ipv6 xt_hl ip6_tables ip6t_rt ipt_REJECT 
nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limi>
Apr 26 12:03:37 <hostname> kernel:  drm_kms_helper aesni_intel syscopyarea 
sysfillrect sysimgblt fb_sys_fops xhci_pci cec crypto_simd i2c_i801 rc_core 
cryptd drm xhci_pci_renesas ahci i2c_smbus tg3 hpsa>
Apr 26 12:03:37 <hostname> kernel: CPU: 1 PID: 10171 Comm: kworker/u4:0 Not 
tainted 5.15.0-27-generic #28-Ubuntu
Apr 26 12:03:37 <hostname> kernel: Hardware name: HP ProLiant DL20 
Gen9/ProLiant DL20 Gen9, BIOS U22 04/01/2021
Apr 26 12:03:37 <hostname> kernel: Workqueue: writeback wb_workfn (flush-253:2)
Apr 26 12:03:37 <hostname> kernel: RIP: 0010:__domain_mapping.cold+0x94/0xcb
Apr 26 12:03:37 <hostname> kernel: Code: 27 9d 4c 89 4d b8 4c 89 45 c0 e8 03 c5 
fa ff 8b 05 e7 e6 40 01 4c 8b 45 c0 4c 8b 4d b8 85 c0 74 09 83 e8 01 89 05 d2 
e6 40 01 <0f> 0b e9 7e b2 b1 ff 89 ca 48 83 >
Apr 26 12:03:37 <hostname> kernel: RSP: 0018:ffffc077826b2fa0 EFLAGS: 00010202
Apr 26 12:03:37 <hostname> kernel: RAX: 0000000000000004 RBX: ffff9f0042062990 
RCX: 0000000000000000
Apr 26 12:03:37 <hostname> kernel: RDX: 0000000000000000 RSI: ffff9f02b3d20980 
RDI: ffff9f02b3d20980
Apr 26 12:03:37 <hostname> kernel: RBP: ffffc077826b2ff0 R08: 000000024c563801 
R09: 000000000024c563
Apr 26 12:03:37 <hostname> kernel: R10: 00000000ffffffff R11: ffffffffc01550e0 
R12: 000000000000000f
Apr 26 12:03:37 <hostname> kernel: R13: 000000000007bf32 R14: ffff9f00412f5800 
R15: ffff9f0042062938
Apr 26 12:03:37 <hostname> kernel: FS:  0000000000000000(0000) 
GS:ffff9f02b3d00000(0000) knlGS:0000000000000000
Apr 26 12:03:37 <hostname> kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
Apr 26 12:03:37 <hostname> kernel: CR2: 00001530f676a01c CR3: 000000029c210001 
CR4: 00000000002706e0
Apr 26 12:03:37 <hostname> kernel: Call Trace:
Apr 26 12:03:37 <hostname> kernel:  <TASK>
Apr 26 12:03:37 <hostname> kernel:  intel_iommu_map_pages+0xdc/0x120
Apr 26 12:03:37 <hostname> kernel:  ? __alloc_and_insert_iova_range+0x203/0x240
Apr 26 12:03:37 <hostname> kernel:  __iommu_map+0xda/0x270
Apr 26 12:03:37 <hostname> kernel:  __iommu_map_sg+0x8e/0x120
Apr 26 12:03:37 <hostname> kernel:  iommu_map_sg_atomic+0x14/0x20
Apr 26 12:03:37 <hostname> kernel:  iommu_dma_map_sg+0x345/0x4d0
Apr 26 12:03:37 <hostname> kernel:  __dma_map_sg_attrs+0x68/0x70
Apr 26 12:03:37 <hostname> kernel:  dma_map_sg_attrs+0xe/0x20
Apr 26 12:03:37 <hostname> kernel:  scsi_dma_map+0x39/0x50
Apr 26 12:03:37 <hostname> kernel:  
hpsa_scsi_ioaccel2_queue_command.constprop.0+0x11e/0x570 [hpsa]
Apr 26 12:03:37 <hostname> kernel:  ? __blk_rq_map_sg+0x36/0x160
Apr 26 12:03:37 <hostname> kernel:  hpsa_scsi_ioaccel_queue_command+0x82/0xd0 
[hpsa]
Apr 26 12:03:37 <hostname> kernel:  hpsa_ioaccel_submit+0x174/0x190 [hpsa]
Apr 26 12:03:37 <hostname> kernel:  hpsa_scsi_queue_command+0x19c/0x240 [hpsa]
Apr 26 12:03:37 <hostname> kernel:  ? recalibrate_cpu_khz+0x10/0x10
Apr 26 12:03:37 <hostname> kernel:  scsi_dispatch_cmd+0x93/0x1f0
Apr 26 12:03:37 <hostname> kernel:  scsi_queue_rq+0x2d1/0x690
Apr 26 12:03:37 <hostname> kernel:  blk_mq_dispatch_rq_list+0x126/0x600
Apr 26 12:03:37 <hostname> kernel:  ? __sbitmap_queue_get+0x1/0x10
Apr 26 12:03:37 <hostname> kernel:  __blk_mq_do_dispatch_sched+0xba/0x2d0
Apr 26 12:03:37 <hostname> kernel:  __blk_mq_sched_dispatch_requests+0x104/0x150
Apr 26 12:03:37 <hostname> kernel:  blk_mq_sched_dispatch_requests+0x35/0x60
Apr 26 12:03:37 <hostname> kernel:  __blk_mq_run_hw_queue+0x34/0xb0
Apr 26 12:03:37 <hostname> kernel:  __blk_mq_delay_run_hw_queue+0x162/0x170
Apr 26 12:03:37 <hostname> kernel:  blk_mq_run_hw_queue+0x83/0x120
Apr 26 12:03:37 <hostname> kernel:  blk_mq_sched_insert_requests+0x69/0xf0
Apr 26 12:03:37 <hostname> kernel:  blk_mq_flush_plug_list+0x103/0x1c0
Apr 26 12:03:37 <hostname> kernel:  blk_flush_plug_list+0xdd/0x100
Apr 26 12:03:37 <hostname> kernel:  blk_mq_submit_bio+0x2bd/0x600
Apr 26 12:03:37 <hostname> kernel:  __submit_bio+0x1ea/0x220
Apr 26 12:03:37 <hostname> kernel:  ? mempool_alloc_slab+0x17/0x20
Apr 26 12:03:37 <hostname> kernel:  __submit_bio_noacct+0x85/0x1f0
Apr 26 12:03:37 <hostname> kernel:  submit_bio_noacct+0x4e/0x120
Apr 26 12:03:37 <hostname> kernel:  ? radix_tree_lookup+0xd/0x10
Apr 26 12:03:37 <hostname> kernel:  ? bio_associate_blkg_from_css+0x1b2/0x310
Apr 26 12:03:37 <hostname> kernel:  submit_bio+0x4a/0x130
Apr 26 12:03:37 <hostname> kernel:  ? wbc_account_cgroup_owner+0x2c/0x80
Apr 26 12:03:37 <hostname> kernel:  submit_bh_wbc+0x18d/0x1c0
Apr 26 12:03:37 <hostname> kernel:  __block_write_full_page+0x227/0x4a0
Apr 26 12:03:37 <hostname> kernel:  ? block_invalidatepage+0x150/0x150
Apr 26 12:03:37 <hostname> kernel:  ? blkdev_llseek+0x60/0x60
Apr 26 12:03:37 <hostname> kernel:  block_write_full_page+0x6f/0x90
Apr 26 12:03:37 <hostname> kernel:  blkdev_writepage+0x18/0x20
Apr 26 12:03:37 <hostname> kernel:  __writepage+0x1e/0x70
Apr 26 12:03:37 <hostname> kernel:  write_cache_pages+0x1a9/0x460
Apr 26 12:03:37 <hostname> kernel:  ? __set_page_dirty_no_writeback+0x40/0x40
Apr 26 12:03:37 <hostname> kernel:  generic_writepages+0x58/0x90
Apr 26 12:03:37 <hostname> kernel:  ? __blk_mq_do_dispatch_sched+0x7f/0x2d0
Apr 26 12:03:37 <hostname> kernel:  blkdev_writepages+0xe/0x10
Apr 26 12:03:37 <hostname> kernel:  do_writepages+0xda/0x200
Apr 26 12:03:37 <hostname> kernel:  ? __percpu_counter_sum+0x6f/0xa0
Apr 26 12:03:37 <hostname> kernel:  ? 
__blk_mq_sched_dispatch_requests+0x104/0x150
Apr 26 12:03:37 <hostname> kernel:  ? mem_cgroup_css_rstat_flush+0x43a/0x870
Apr 26 12:03:37 <hostname> kernel:  ? cpumask_next+0x23/0x30
Apr 26 12:03:37 <hostname> kernel:  __writeback_single_inode+0x44/0x290
Apr 26 12:03:37 <hostname> kernel:  writeback_sb_inodes+0x223/0x4d0
Apr 26 12:03:37 <hostname> kernel:  __writeback_inodes_wb+0x56/0xf0
Apr 26 12:03:37 <hostname> kernel:  wb_writeback+0x1cc/0x290
Apr 26 12:03:37 <hostname> kernel:  wb_do_writeback+0x1a4/0x280
Apr 26 12:03:37 <hostname> kernel:  wb_workfn+0x77/0x250
Apr 26 12:03:37 <hostname> kernel:  ? psi_task_switch+0xc6/0x220
Apr 26 12:03:37 <hostname> kernel:  ? finish_task_switch.isra.0+0xa6/0x270
Apr 26 12:03:37 <hostname> kernel:  process_one_work+0x22b/0x3d0
Apr 26 12:03:37 <hostname> kernel:  worker_thread+0x53/0x410
Apr 26 12:03:37 <hostname> kernel:  ? process_one_work+0x3d0/0x3d0
Apr 26 12:03:37 <hostname> kernel:  kthread+0x12a/0x150
Apr 26 12:03:37 <hostname> kernel:  ? set_kthread_struct+0x50/0x50
Apr 26 12:03:37 <hostname> kernel:  ret_from_fork+0x22/0x30
Apr 26 12:03:37 <hostname> kernel:  </TASK>
Apr 26 12:03:37 <hostname> kernel: ---[ end trace 6eaabfe8ad4492e0 ]---

Afterwards, messages like

Apr 26 12:55:29 <hostname> kernel: dmar_fault: 152 callbacks suppressed
Apr 26 12:55:29 <hostname> kernel: DMAR: DRHD: handling fault status reg 2
Apr 26 12:55:29 <hostname> kernel: DMAR: [DMA Write NO_PASID] Request device 
[06:00.0] fault addr 0x7bf4a000 [fault reason 0x05] PTE Write access is not set

or

Apr 26 12:56:50 <hostname> kernel: dmar_fault: 152 callbacks suppressed
Apr 26 12:56:50 <hostname> kernel: DMAR: DRHD: handling fault status reg 2
Apr 26 12:56:50 <hostname> kernel: DMAR: [DMA Read NO_PASID] Request device 
[06:00.0] fault addr 0x7bf32000 [fault reason 0x06] PTE Read access is not set

are logged continuously. The logged device ID 06:00.0 is the HPE SATA
controller.

The errors go away after a reboot until the problem occurs again after a
few hours. In most cases, the server even reports a hardware fault and
the storage fan spins up to 100 %.

The problem did _not_ occur in Ubuntu 20.04, the last 20.04 kernel I ran
on this server was 5.13.0.39.44~20.04.24.

Setting the intel_iommu=off kernel boot parameter seems to work around
the problem.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1970453

Title:
  DMAR: ERROR: DMA PTE for vPFN 0x7bf32 already set

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1970453/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to