Public bug reported:

[Impact]
TXG sync stalls, causing ZFS workloads to hang indefinitely

[Description]
For certain ZFS workloads, we can see hung task timeouts in the kernel logs due 
to a transaction group deadlock. Userspace process will hang and display stack 
traces similar to the one below:
[49181.619711] clnt_server     D    0 21699  28868 0x00000320
[49181.619715] Call Trace:
[49181.619725]  __schedule+0x24e/0x880
[49181.619730]  schedule+0x2c/0x80
[49181.619750]  cv_wait_common+0x11e/0x140 [spl]
[49181.619763]  ? wait_woken+0x80/0x80
[49181.619775]  __cv_wait+0x15/0x20 [spl]
[49181.619872]  zil_commit.part.14+0x80/0x8c0 [zfs]
[49181.619884]  ? _cond_resched+0x19/0x40
[49181.619887]  ? mutex_lock+0x12/0x40
[49181.619959]  zil_commit+0x17/0x20 [zfs]
[49181.620026]  zfs_fsync+0x77/0xe0 [zfs]
[49181.620093]  zpl_fsync+0x68/0xa0 [zfs]
[49181.620100]  vfs_fsync_range+0x51/0xb0
[49181.620105]  do_fsync+0x3d/0x70
[49181.620109]  SyS_fsync+0x10/0x20
[49181.620114]  do_syscall_64+0x73/0x130
[49181.620119]  entry_SYSCALL_64_after_hwframe+0x41/0xa6

We also might see a kworker thread blocking in the zfs writeback/evict path:
[49181.881570] kworker/u17:3   D    0  4915      2 0x80000000
[49181.881576] Workqueue: writeback wb_workfn (flush-zfs-10)
[49181.881577] Call Trace:
[49181.881580]  __schedule+0x24e/0x880
[49181.881582]  ? atomic_t_wait+0x60/0x60
[49181.881584]  schedule+0x2c/0x80
[49181.881588]  bit_wait+0x11/0x60
[49181.881592]  __wait_on_bit+0x4c/0x90
[49181.881596]  ? atomic_t_wait+0x60/0x60
[49181.881599]  __inode_wait_for_writeback+0xb9/0xf0
[49181.881601]  ? bit_waitqueue+0x40/0x40
[49181.881605]  inode_wait_for_writeback+0x26/0x40
[49181.881609]  evict+0xb5/0x1a0
[49181.881611]  iput+0x19c/0x230
[49181.881648]  zfs_iput_async+0x1d/0x80 [zfs]
[49181.881682]  zfs_get_data+0x1d4/0x2a0 [zfs]
[49181.881718]  zil_commit.part.14+0x640/0x8c0 [zfs]
[49181.881752]  zil_commit+0x17/0x20 [zfs]
[49181.881784]  zpl_writepages+0xd5/0x160 [zfs]
[49181.881787]  do_writepages+0x4b/0xe0
[49181.881790]  __writeback_single_inode+0x45/0x350
[49181.881792]  ? __writeback_single_inode+0x45/0x350
[49181.881794]  writeback_sb_inodes+0x1d7/0x530
[49181.881796]  wb_writeback+0xfb/0x300
[49181.881799]  wb_workfn+0xad/0x400
[49181.881800]  ? wb_workfn+0xad/0x400
[49181.881803]  ? __switch_to_asm+0x35/0x70
[49181.881809]  process_one_work+0x1de/0x420
[49181.881811]  worker_thread+0x32/0x410
[49181.881813]  kthread+0x121/0x140
[49181.881815]  ? process_one_work+0x420/0x420
[49181.881817]  ? kthread_create_worker_on_cpu+0x70/0x70
[49181.881819]  ret_from_fork+0x35/0x40

This is caused by a race between ZFS writeback and evict threads,
usually during a transaction group sync operation. It's possible to have
two iput() threads racing for the same inode: one of them scheduled
async and the other executed synchronously as part of the writeback
path. If the writeback thread tries to evict the inode while the async
thread is running, it might re-enter the block layer for the same inode
due to ZFS counters being in an inconsistent state. This then causes the
kworker thread to stall the writeback, which in turn prevents the
transaction group sync to complete and locks other ZFS threads.

This is fixed by the upstream commit:
- Fix zrele race in zrele_async that can cause hang (2921ad6cba54) [0]

[Test Case]
Being a race condition, this issue has been hard to reproduce consistently. 
This has been reported on heavy I/O workloads, mixing file creation and 
deletion. We have some reports both from upstream and from Ubuntu users that 
this is usually reproducible on e.g. heavy SQL workloads or on complex 
ccache-enabled builds [1].

[0] https://github.com/openzfs/zfs/pull/11530
[1] https://github.com/openzfs/zfs/issues/11527

[Regression Potential]
The patch has been tested in the ZFS test suite and in production environments, 
so the potential for further regressions should be fairly controlled. Potential 
regressions might arise in the ZFS writeback path, so we should monitor heavy 
I/O workloads that put a lot of stress in the sync and evict paths.

** Affects: zfs-linux (Ubuntu)
     Importance: Critical
     Assignee: Heitor Alves de Siqueira (halves)
         Status: In Progress

** Affects: zfs-linux (Ubuntu Xenial)
     Importance: High
     Assignee: Heitor Alves de Siqueira (halves)
         Status: In Progress

** Affects: zfs-linux (Ubuntu Bionic)
     Importance: Critical
     Assignee: Heitor Alves de Siqueira (halves)
         Status: In Progress

** Affects: zfs-linux (Ubuntu Focal)
     Importance: Critical
     Assignee: Heitor Alves de Siqueira (halves)
         Status: In Progress

** Affects: zfs-linux (Ubuntu Groovy)
     Importance: Critical
     Assignee: Heitor Alves de Siqueira (halves)
         Status: In Progress

** Affects: zfs-linux (Ubuntu Hirsute)
     Importance: Critical
     Assignee: Heitor Alves de Siqueira (halves)
         Status: In Progress


** Tags: sts

** Description changed:

+ [Impact]
+ TXG sync stalls, causing ZFS workloads to hang indefinitely
+ 
  [Description]
  For certain ZFS workloads, we can see hung task timeouts in the kernel logs 
due to a transaction group deadlock. Userspace process will hang and display 
stack traces similar to the one below:
  [49181.619711] clnt_server     D    0 21699  28868 0x00000320
  [49181.619715] Call Trace:
  [49181.619725]  __schedule+0x24e/0x880
  [49181.619730]  schedule+0x2c/0x80
  [49181.619750]  cv_wait_common+0x11e/0x140 [spl]
  [49181.619763]  ? wait_woken+0x80/0x80
  [49181.619775]  __cv_wait+0x15/0x20 [spl]
  [49181.619872]  zil_commit.part.14+0x80/0x8c0 [zfs]
  [49181.619884]  ? _cond_resched+0x19/0x40
  [49181.619887]  ? mutex_lock+0x12/0x40
  [49181.619959]  zil_commit+0x17/0x20 [zfs]
  [49181.620026]  zfs_fsync+0x77/0xe0 [zfs]
  [49181.620093]  zpl_fsync+0x68/0xa0 [zfs]
  [49181.620100]  vfs_fsync_range+0x51/0xb0
  [49181.620105]  do_fsync+0x3d/0x70
  [49181.620109]  SyS_fsync+0x10/0x20
  [49181.620114]  do_syscall_64+0x73/0x130
  [49181.620119]  entry_SYSCALL_64_after_hwframe+0x41/0xa6
  
  We also might see a kworker thread blocking in the zfs writeback/evict path:
  [49181.881570] kworker/u17:3   D    0  4915      2 0x80000000
  [49181.881576] Workqueue: writeback wb_workfn (flush-zfs-10)
  [49181.881577] Call Trace:
  [49181.881580]  __schedule+0x24e/0x880
  [49181.881582]  ? atomic_t_wait+0x60/0x60
  [49181.881584]  schedule+0x2c/0x80
  [49181.881588]  bit_wait+0x11/0x60
  [49181.881592]  __wait_on_bit+0x4c/0x90
  [49181.881596]  ? atomic_t_wait+0x60/0x60
  [49181.881599]  __inode_wait_for_writeback+0xb9/0xf0
  [49181.881601]  ? bit_waitqueue+0x40/0x40
  [49181.881605]  inode_wait_for_writeback+0x26/0x40
  [49181.881609]  evict+0xb5/0x1a0
  [49181.881611]  iput+0x19c/0x230
  [49181.881648]  zfs_iput_async+0x1d/0x80 [zfs]
  [49181.881682]  zfs_get_data+0x1d4/0x2a0 [zfs]
  [49181.881718]  zil_commit.part.14+0x640/0x8c0 [zfs]
  [49181.881752]  zil_commit+0x17/0x20 [zfs]
  [49181.881784]  zpl_writepages+0xd5/0x160 [zfs]
  [49181.881787]  do_writepages+0x4b/0xe0
  [49181.881790]  __writeback_single_inode+0x45/0x350
  [49181.881792]  ? __writeback_single_inode+0x45/0x350
  [49181.881794]  writeback_sb_inodes+0x1d7/0x530
  [49181.881796]  wb_writeback+0xfb/0x300
  [49181.881799]  wb_workfn+0xad/0x400
  [49181.881800]  ? wb_workfn+0xad/0x400
  [49181.881803]  ? __switch_to_asm+0x35/0x70
  [49181.881809]  process_one_work+0x1de/0x420
  [49181.881811]  worker_thread+0x32/0x410
  [49181.881813]  kthread+0x121/0x140
  [49181.881815]  ? process_one_work+0x420/0x420
  [49181.881817]  ? kthread_create_worker_on_cpu+0x70/0x70
  [49181.881819]  ret_from_fork+0x35/0x40
  
  This is caused by a race between ZFS writeback and evict threads,
  usually during a transaction group sync operation. It's possible to have
  two iput() threads racing for the same inode: one of them scheduled
  async and the other executed synchronously as part of the writeback
  path. If the writeback thread tries to evict the inode while the async
  thread is running, it might re-enter the block layer for the same inode
  due to ZFS counters being in an inconsistent state. This then causes the
  kworker thread to stall the writeback, which in turn prevents the
  transaction group sync to complete and locks other ZFS threads.
  
  This is fixed by the upstream commit:
  - Fix zrele race in zrele_async that can cause hang (2921ad6cba54) [0]
  
  [Test Case]
  Being a race condition, this issue has been hard to reproduce consistently. 
This has been reported on heavy I/O workloads, mixing file creation and 
deletion. We have some reports both from upstream and from Ubuntu users that 
this is usually reproducible on e.g. heavy SQL workloads or on complex 
ccache-enabled builds [1].
  
  [0] https://github.com/openzfs/zfs/pull/11530
  [1] https://github.com/openzfs/zfs/issues/11527
  
  [Regression Potential]
  The patch has been tested in the ZFS test suite and in production 
environments, so the potential for further regressions should be fairly 
controlled. Potential regressions might arise in the ZFS writeback path, so we 
should monitor heavy I/O workloads that put a lot of stress in the sync and 
evict paths.

** Also affects: zfs-linux (Ubuntu Hirsute)
   Importance: Critical
     Assignee: Heitor Alves de Siqueira (halves)
       Status: Confirmed

** Also affects: zfs-linux (Ubuntu Focal)
   Importance: Undecided
       Status: New

** Also affects: zfs-linux (Ubuntu Xenial)
   Importance: Undecided
       Status: New

** Also affects: zfs-linux (Ubuntu Bionic)
   Importance: Undecided
       Status: New

** Also affects: zfs-linux (Ubuntu Groovy)
   Importance: Undecided
       Status: New

** Changed in: zfs-linux (Ubuntu Groovy)
   Importance: Undecided => Critical

** Changed in: zfs-linux (Ubuntu Focal)
   Importance: Undecided => High

** Changed in: zfs-linux (Ubuntu Focal)
   Importance: High => Critical

** Changed in: zfs-linux (Ubuntu Bionic)
   Importance: Undecided => Critical

** Changed in: zfs-linux (Ubuntu Xenial)
   Importance: Undecided => High

** Changed in: zfs-linux (Ubuntu Xenial)
     Assignee: (unassigned) => Heitor Alves de Siqueira (halves)

** Changed in: zfs-linux (Ubuntu Bionic)
     Assignee: (unassigned) => Heitor Alves de Siqueira (halves)

** Changed in: zfs-linux (Ubuntu Groovy)
     Assignee: (unassigned) => Heitor Alves de Siqueira (halves)

** Changed in: zfs-linux (Ubuntu Focal)
     Assignee: (unassigned) => Heitor Alves de Siqueira (halves)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to zfs-linux in Ubuntu.
https://bugs.launchpad.net/bugs/1916486

Title:
  zfs_zrele_async can cause txg sync deadlocks

Status in zfs-linux package in Ubuntu:
  In Progress
Status in zfs-linux source package in Xenial:
  In Progress
Status in zfs-linux source package in Bionic:
  In Progress
Status in zfs-linux source package in Focal:
  In Progress
Status in zfs-linux source package in Groovy:
  In Progress
Status in zfs-linux source package in Hirsute:
  In Progress

Bug description:
  [Impact]
  TXG sync stalls, causing ZFS workloads to hang indefinitely

  [Description]
  For certain ZFS workloads, we can see hung task timeouts in the kernel logs 
due to a transaction group deadlock. Userspace process will hang and display 
stack traces similar to the one below:
  [49181.619711] clnt_server     D    0 21699  28868 0x00000320
  [49181.619715] Call Trace:
  [49181.619725]  __schedule+0x24e/0x880
  [49181.619730]  schedule+0x2c/0x80
  [49181.619750]  cv_wait_common+0x11e/0x140 [spl]
  [49181.619763]  ? wait_woken+0x80/0x80
  [49181.619775]  __cv_wait+0x15/0x20 [spl]
  [49181.619872]  zil_commit.part.14+0x80/0x8c0 [zfs]
  [49181.619884]  ? _cond_resched+0x19/0x40
  [49181.619887]  ? mutex_lock+0x12/0x40
  [49181.619959]  zil_commit+0x17/0x20 [zfs]
  [49181.620026]  zfs_fsync+0x77/0xe0 [zfs]
  [49181.620093]  zpl_fsync+0x68/0xa0 [zfs]
  [49181.620100]  vfs_fsync_range+0x51/0xb0
  [49181.620105]  do_fsync+0x3d/0x70
  [49181.620109]  SyS_fsync+0x10/0x20
  [49181.620114]  do_syscall_64+0x73/0x130
  [49181.620119]  entry_SYSCALL_64_after_hwframe+0x41/0xa6

  We also might see a kworker thread blocking in the zfs writeback/evict path:
  [49181.881570] kworker/u17:3   D    0  4915      2 0x80000000
  [49181.881576] Workqueue: writeback wb_workfn (flush-zfs-10)
  [49181.881577] Call Trace:
  [49181.881580]  __schedule+0x24e/0x880
  [49181.881582]  ? atomic_t_wait+0x60/0x60
  [49181.881584]  schedule+0x2c/0x80
  [49181.881588]  bit_wait+0x11/0x60
  [49181.881592]  __wait_on_bit+0x4c/0x90
  [49181.881596]  ? atomic_t_wait+0x60/0x60
  [49181.881599]  __inode_wait_for_writeback+0xb9/0xf0
  [49181.881601]  ? bit_waitqueue+0x40/0x40
  [49181.881605]  inode_wait_for_writeback+0x26/0x40
  [49181.881609]  evict+0xb5/0x1a0
  [49181.881611]  iput+0x19c/0x230
  [49181.881648]  zfs_iput_async+0x1d/0x80 [zfs]
  [49181.881682]  zfs_get_data+0x1d4/0x2a0 [zfs]
  [49181.881718]  zil_commit.part.14+0x640/0x8c0 [zfs]
  [49181.881752]  zil_commit+0x17/0x20 [zfs]
  [49181.881784]  zpl_writepages+0xd5/0x160 [zfs]
  [49181.881787]  do_writepages+0x4b/0xe0
  [49181.881790]  __writeback_single_inode+0x45/0x350
  [49181.881792]  ? __writeback_single_inode+0x45/0x350
  [49181.881794]  writeback_sb_inodes+0x1d7/0x530
  [49181.881796]  wb_writeback+0xfb/0x300
  [49181.881799]  wb_workfn+0xad/0x400
  [49181.881800]  ? wb_workfn+0xad/0x400
  [49181.881803]  ? __switch_to_asm+0x35/0x70
  [49181.881809]  process_one_work+0x1de/0x420
  [49181.881811]  worker_thread+0x32/0x410
  [49181.881813]  kthread+0x121/0x140
  [49181.881815]  ? process_one_work+0x420/0x420
  [49181.881817]  ? kthread_create_worker_on_cpu+0x70/0x70
  [49181.881819]  ret_from_fork+0x35/0x40

  This is caused by a race between ZFS writeback and evict threads,
  usually during a transaction group sync operation. It's possible to
  have two iput() threads racing for the same inode: one of them
  scheduled async and the other executed synchronously as part of the
  writeback path. If the writeback thread tries to evict the inode while
  the async thread is running, it might re-enter the block layer for the
  same inode due to ZFS counters being in an inconsistent state. This
  then causes the kworker thread to stall the writeback, which in turn
  prevents the transaction group sync to complete and locks other ZFS
  threads.

  This is fixed by the upstream commit:
  - Fix zrele race in zrele_async that can cause hang (2921ad6cba54) [0]

  [Test Case]
  Being a race condition, this issue has been hard to reproduce consistently. 
This has been reported on heavy I/O workloads, mixing file creation and 
deletion. We have some reports both from upstream and from Ubuntu users that 
this is usually reproducible on e.g. heavy SQL workloads or on complex 
ccache-enabled builds [1].

  [0] https://github.com/openzfs/zfs/pull/11530
  [1] https://github.com/openzfs/zfs/issues/11527

  [Regression Potential]
  The patch has been tested in the ZFS test suite and in production 
environments, so the potential for further regressions should be fairly 
controlled. Potential regressions might arise in the ZFS writeback path, so we 
should monitor heavy I/O workloads that put a lot of stress in the sync and 
evict paths.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1916486/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to