[Bug 1898786] Re: Issue with bcache bch_mca_scan causing huge IO wait

2020-10-13 Thread Matthew Ruffell
Hi Benjamin,

Great to hear that the early results look promising.

Yes, let's give it a week, so we can ensure that these patches do their
job under the high loads the git server faces each day over the week.

If things still look good early next week, I will go ahead and prepare
the patches for SRU to the Ubuntu kernels.

Let me know how the test kernel is going next week.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  Issue with bcache bch_mca_scan causing huge IO wait

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: Issue with bcache bch_mca_scan causing huge IO wait

2020-10-13 Thread Benjamin Allot
The testing of the new kernel looks very promising. We don't observe any
of the latency/IOwait we had before even with the btree_shrinker
enabled.

We'll give it a week probably, but having a backport of those patches
would be fantastic for sure.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  Issue with bcache bch_mca_scan causing huge IO wait

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: Issue with bcache bch_mca_scan causing huge IO wait

2020-10-12 Thread Benjamin Allot
Thanks Matthew, we'll try this kernel tomorrow.

I tested it on an openstack instance, the only downside is that it
uninstalls the official 4.15.0-118-generic one.

Regards,

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  Issue with bcache bch_mca_scan causing huge IO wait

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: Issue with bcache bch_mca_scan causing huge IO wait

2020-10-08 Thread Matthew Ruffell
Hi Benjamin,

I think Dan is onto something.

The following commit was merged in upstream 5.5-rc1, but was backported
to 4.15.0-87-generic through upstream stable:

commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b
Author: Coly Li 
Date:   Wed Nov 13 16:03:24 2019 +0800
Subject: bcache: at least try to shrink 1 node in bch_mca_scan()
Link: 
https://github.com/torvalds/linux/commit/9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b

It mentions in the description that:

> If sc->nr_to_scan is smaller than c->btree_pages, after the above
> calculation, variable 'nr' will be 0 and nothing will be shrunk. It is
> frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make
> nr to be zero. Then bch_mca_scan() will do nothing more then acquiring
> and releasing mutex c->bucket_lock.

Spending a lot of idle time in bch_mca_scan() waiting on locks seems to
be a key symptom from the logs you have gathered.

I went and had a look through the commit history, and I think these
three patches might solve the problem:

commit 125d98edd11464c8e0ec9eaaba7d682d0f832686
Author: Coly Li 
Date:   Fri Jan 24 01:01:40 2020 +0800
Subject: bcache: remove member accessed from struct btree
Link: 
https://github.com/torvalds/linux/commit/125d98edd11464c8e0ec9eaaba7d682d0f832686

commit d5c9c470b01177e4d90cdbf178b8c7f37f5b8795
Author: Coly Li 
Date:   Fri Jan 24 01:01:41 2020 +0800
Subject: bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan()
Link: 
https://github.com/torvalds/linux/commit/d5c9c470b01177e4d90cdbf178b8c7f37f5b8795

commit e3de04469a49ee09c89e80bf821508df458ccee6
Author: Coly Li 
Date:   Fri Jan 24 01:01:42 2020 +0800
Subject: bcache: reap from tail of c->btree_cache in bch_mca_scan()
Link: Link: 
https://github.com/torvalds/linux/commit/e3de04469a49ee09c89e80bf821508df458ccee6

The first commit is a dependency of the other two. The first commit
removes a "recently accessed" marker, used to indicate if a particular
cache has been used recently, and if it has been, not consider it for
cache eviction. The commit mentions that under heavy IO, all caches will
end up being recently accessed, and nothing will ever be shrunk.

The second commit changes a previous design decision of skipping the
first 3 caches to shrink, since it is a common case to call
bch_mca_scan() with nr being 1, or 2, just like 0 was common in the very
first commit I mentioned. This time, if we land on 1 or 2, the loop
exits and nothing happens, and we waste time waiting on locks, just like
the very first commit I mentioned. The fix is to try shrink caches from
the tail of the list, and not the beginning.

The third commit fixes a minor issue where we don't want to re-arrange
the linked list, which is what the second commit ended up doing, and
instead, just shrink the cache at the end of the linked list, and not
change the order.

All three commits are clean cherry picks to 4.15.0-118-generic, and I
have got a test kernel building for you in:

https://launchpad.net/~mruffell/+archive/ubuntu/sf294907-test

It should be ready in a few hours time. Make sure the build has finished
before trying to install.

Please note this package is NOT SUPPORTED by Canonical, and is for
TESTING PURPOSES ONLY. ONLY Install in a dedicated test environment.

Instructions to install (on a Bionic system):
1) sudo add-apt-repository ppa:mruffell/sf294907-test
2) sudo apt update
3) sudo apt install linux-image-unsigned-4.15.0-118-generic 
linux-modules-4.15.0-118-generic \
linux-modules-extra-4.15.0-118-generic linux-headers-4.15.0-118 
linux-headers-4.15.0-118-generic
4) sudo reboot
5) uname -rv

The uname string should be along the lines of
4.15.0-118.119+TEST294907v20201009b1. You may need to change your grub
settings if you don't boot into the correct kernel.

If you can, schedule some downtime and install the test kernel. Enable
bcache shrinking and try reproduce the problem. Hopefully the test
kernel is an improvement and fixes your issues.

Let me know how the test kernel goes.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  Issue with bcache bch_mca_scan causing huge IO wait

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: Issue with bcache bch_mca_scan causing huge IO wait

2020-10-08 Thread Benjamin Allot
Hi Mauricio,

The tests in [5][6] and [7] have been done with a 44GB memory VM.
This VM usually has 64GB of allocated memory.
The goal was to verify that the load of the whole btree in RAM was not 
prevented (like in mca-reap -> down_write_trylock).

All this kernel stuff is rather new to me, so I may be following a wrong
lead.

** Description changed:

  Hello,
  
  In short, we faced an issue with a huge IO wait on a bionic Ubuntu 
4.15.0-118.119-generic kernel.
  This is the full list of process and the kernel function they were stuck in 
[0].
  
  The main issue can probably be summarized by this perf reports
  * first identify that the cpu are stuck in idle because of something[1]
  * second, see what kernel function seems to stuck the process kswapd0 and 
kswapd1 [2].
  
  We could see that this seems to be the mutex_lock in the bch_mca_scan
  function [3].
  
  After running the command:
  
   | sudo bash -c "echo 1 > /sys/fs/bcache/f1a1e8cb-3e6b-40ea-852e-
  583c48d0c2b8/internal/btree_shrinker_disabled"
  
  The server started to respond normally and the IO wait dropped
  significantly
  
  Here is a trace of the bcache event related lock in the kernel obtained
  with some bpfcc-tools [4].
  
  klockstat-bpfcc -c bch_ -i 5 -s 3
  
  The trace has been run in parallel with the following command line
  
  echo "Shrinker disabled: $(date)"; sleep 60; echo "Enabling shrinker:
  $(date)"; echo 0 | sudo tee
  /sys/block/bcache0/bcache/cache/internal/btree_shrinker_disabled ; sleep
  60; echo "Disabling shrinker: $(date)"; echo 1 | sudo tee
  /sys/block/bcache0/bcache/cache/internal/btree_shrinker_disabled; sleep
  60; echo "End of test: $(date)"
  
  Trying to dig more, we reduced by 20 GB the memory allocated to a VM on the 
server.
  * The bcache btree size fluctuation seems "normal" [5]
- * I noticed that, when the shrinker was enabled,a lot of time was spent in 
the locks during "bch_btree_insert_node".
+ * I noticed that, when the shrinker was enabled,a lot of time was spent in 
the locks during "bch_btree_insert_node". [6]
  
  I decided to check if one of the function called during
  bch_btree_insert_node was taking longer than usual when the shrinker was
  enabled.
  
  I finally found the "funclatency" tool and tried do have the same approach I 
had with the klockstat [7]. However, that was inconclusive. I could see there 
that the bch_btree_insert_node was barely called during the whole duration of 
the test.
  Which made me think it's amount of time spent in lock is more due to another 
process acquiring the lock.
  
  I'm going to try to have another go with some perf/klockstat/funclatency
  focused on bch_mca_scan and the function called there.
  
  Also, here are some memory related metrics [8].
  
- 
  Now another perf stacktrace with the command used [9].
  Strangely this one doesn't show any bch_mca_scan at all.
  
  I enabled the shrinker again, hoping to get more traces, but apparently the 
timeframe was not right. Not enough load to trigger the cliff resulting in a 
1sec IOwait plateau.
  Which is interesting because that means that without the maximal workload, 
the kernel can cope with the shrinker.
- 
  
  [0]: https://pastebin.ubuntu.com/p/QYXPdsMCWC/
  [1]: https://pastebin.ubuntu.com/p/BFsvF7H54r/
  [2]: https://pastebin.ubuntu.com/p/35qdsHYHf5/
  [3]: 
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/drivers/md/bcache/btree.c?h=Ubuntu-4.15.0-118.119#n674
  [4]: https://pastebin.ubuntu.com/p/qhyqP35fCw/
  [5]: https://pastebin.ubuntu.com/p/McjxxqTVjn/
  [6]: https://pastebin.ubuntu.com/p/KmrnW4Ng8F/
  [7]: https://pastebin.ubuntu.com/p/fSX4c7tTFV/
  [8]: https://pastebin.ubuntu.com/p/CZgXkgKhmJ/
  [9]: https://pastebin.ubuntu.com/p/DzKCP8NGdf/
  
  
  $ cat /proc/version_signature
  Ubuntu 4.15.0-118.119-generic 4.15.18
  
  ProblemType: Bug
  DistroRelease: Ubuntu 18.04
  Package: linux-image-4.15.0-118-generic 4.15.0-118.119
  ProcVersionSignature: User Name 4.15.0-118.119-generic 4.15.18
  Uname: Linux 4.15.0-118-generic x86_64
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 29 10:04 seq
   crw-rw 1 root audio 116, 33 Sep 29 10:04 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
  ApportVersion: 2.20.9-0ubuntu7.16
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 
'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  Date: Tue Oct  6 20:36:18 2020
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
  MachineType: HP ProLiant DL380 G7
  PciMultimedia:
  
  ProcFB: 0 radeondrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-118-generic 
root=UUID=c6ad1629-a506-4043-a339-6d57f0708d12 ro console=ttyS1,115200 nosplash
  RelatedPackageVersions:
   linux-restricted-modules-4.15.0-118-generic N/A
   

[Bug 1898786] Re: Issue with bcache bch_mca_scan causing huge IO wait

2020-10-08 Thread Benjamin Allot
** Description changed:

  Hello,
  
  In short, we faced an issue with a huge IO wait on a bionic Ubuntu 
4.15.0-118.119-generic kernel.
  This is the full list of process and the kernel function they were stuck in 
[0].
  
  The main issue can probably be summarized by this perf reports
  * first identify that the cpu are stuck in idle because of something[1]
  * second, see what kernel function seems to stuck the process kswapd0 and 
kswapd1 [2].
  
  We could see that this seems to be the mutex_lock in the bch_mca_scan
  function [3].
  
  After running the command:
  
   | sudo bash -c "echo 1 > /sys/fs/bcache/f1a1e8cb-3e6b-40ea-852e-
  583c48d0c2b8/internal/btree_shrinker_disabled"
  
  The server started to respond normally and the IO wait dropped
  significantly
  
  Here is a trace of the bcache event related lock in the kernel obtained
  with some bpfcc-tools [4].
  
  klockstat-bpfcc -c bch_ -i 5 -s 3
  
  The trace has been run in parallel with the following command line
  
  echo "Shrinker disabled: $(date)"; sleep 60; echo "Enabling shrinker:
  $(date)"; echo 0 | sudo tee
  /sys/block/bcache0/bcache/cache/internal/btree_shrinker_disabled ; sleep
  60; echo "Disabling shrinker: $(date)"; echo 1 | sudo tee
  /sys/block/bcache0/bcache/cache/internal/btree_shrinker_disabled; sleep
  60; echo "End of test: $(date)"
  
  Trying to dig more, we reduced by 20 GB the memory allocated to a VM on the 
server.
  * The bcache btree size fluctuation seems "normal" [5]
  * I noticed that, when the shrinker was enabled,a lot of time was spent in 
the locks during "bch_btree_insert_node".
  
  I decided to check if one of the function called during
  bch_btree_insert_node was taking longer than usual when the shrinker was
  enabled.
  
  I finally found the "funclatency" tool and tried do have the same approach I 
had with the klockstat [7]. However, that was inconclusive. I could see there 
that the bch_btree_insert_node was barely called during the whole duration of 
the test.
  Which made me think it's amount of time spent in lock is more due to another 
process acquiring the lock.
  
  I'm going to try to have another go with some perf/klockstat/funclatency
  focused on bch_mca_scan and the function called there.
  
  Also, here are some memory related metrics [8].
  
  
+ Now another perf stacktrace with the command used [9].
+ Strangely this one doesn't show any bch_mca_scan at all.
+ 
+ I enabled the shrinker again, hoping to get more traces, but apparently the 
timeframe was not right. Not enough load to trigger the cliff resulting in a 
1sec IOwait plateau.
+ Which is interesting because that means that without the maximal workload, 
the kernel can cope with the shrinker.
+ 
+ 
  [0]: https://pastebin.ubuntu.com/p/QYXPdsMCWC/
  [1]: https://pastebin.ubuntu.com/p/BFsvF7H54r/
  [2]: https://pastebin.ubuntu.com/p/35qdsHYHf5/
  [3]: 
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/drivers/md/bcache/btree.c?h=Ubuntu-4.15.0-118.119#n674
  [4]: https://pastebin.ubuntu.com/p/qhyqP35fCw/
  [5]: https://pastebin.ubuntu.com/p/McjxxqTVjn/
  [6]: https://pastebin.ubuntu.com/p/KmrnW4Ng8F/
  [7]: https://pastebin.ubuntu.com/p/fSX4c7tTFV/
  [8]: https://pastebin.ubuntu.com/p/CZgXkgKhmJ/
+ [9]: https://pastebin.ubuntu.com/p/DzKCP8NGdf/
  
  
  $ cat /proc/version_signature
  Ubuntu 4.15.0-118.119-generic 4.15.18
  
  ProblemType: Bug
  DistroRelease: Ubuntu 18.04
  Package: linux-image-4.15.0-118-generic 4.15.0-118.119
  ProcVersionSignature: User Name 4.15.0-118.119-generic 4.15.18
  Uname: Linux 4.15.0-118-generic x86_64
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 29 10:04 seq
   crw-rw 1 root audio 116, 33 Sep 29 10:04 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
  ApportVersion: 2.20.9-0ubuntu7.16
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 
'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  Date: Tue Oct  6 20:36:18 2020
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
  MachineType: HP ProLiant DL380 G7
  PciMultimedia:
  
  ProcFB: 0 radeondrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-118-generic 
root=UUID=c6ad1629-a506-4043-a339-6d57f0708d12 ro console=ttyS1,115200 nosplash
  RelatedPackageVersions:
   linux-restricted-modules-4.15.0-118-generic N/A
   linux-backports-modules-4.15.0-118-generic  N/A
   linux-firmware  1.173.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
  SourcePackage: linux
  UpgradeStatus: Upgraded to bionic on 2019-09-27 (375 days ago)
  dmi.bios.date: 05/05/2011
  dmi.bios.vendor: HP
  dmi.bios.version: P67
  dmi.chassis.type: 23
  dmi.chassis.vendor: HP
  dmi.modalias: 
dmi:bvnHP:bvrP67:bd05/05/2011:svnHP:pnProLiantDL380G7:pvr:cvnHP:ct23:cvr:
  

[Bug 1898786] Re: Issue with bcache bch_mca_scan causing huge IO wait

2020-10-08 Thread Benjamin Allot
** Description changed:

  Hello,
  
  In short, we faced an issue with a huge IO wait on a bionic Ubuntu 
4.15.0-118.119-generic kernel.
  This is the full list of process and the kernel function they were stuck in 
[0].
  
  The main issue can probably be summarized by this perf reports
  * first identify that the cpu are stuck in idle because of something[1]
  * second, see what kernel function seems to stuck the process kswapd0 and 
kswapd1 [2].
  
  We could see that this seems to be the mutex_lock in the bch_mca_scan
  function [3].
  
  After running the command:
  
   | sudo bash -c "echo 1 > /sys/fs/bcache/f1a1e8cb-3e6b-40ea-852e-
  583c48d0c2b8/internal/btree_shrinker_disabled"
  
  The server started to respond normally and the IO wait dropped
  significantly
  
  Here is a trace of the bcache event related lock in the kernel obtained
  with some bpfcc-tools [4].
  
  klockstat-bpfcc -c bch_ -i 5 -s 3
  
  The trace has been run in parallel with the following command line
  
  echo "Shrinker disabled: $(date)"; sleep 60; echo "Enabling shrinker:
  $(date)"; echo 0 | sudo tee
  /sys/block/bcache0/bcache/cache/internal/btree_shrinker_disabled ; sleep
  60; echo "Disabling shrinker: $(date)"; echo 1 | sudo tee
  /sys/block/bcache0/bcache/cache/internal/btree_shrinker_disabled; sleep
  60; echo "End of test: $(date)"
  
+ Trying to dig more, we reduced by 20 GB the memory allocated to a VM on the 
server.
+ * The bcache btree size fluctuation seems "normal" [5]
+ * I noticed that, when the shrinker was enabled,a lot of time was spent in 
the locks during "bch_btree_insert_node".
+ 
+ I decided to check if one of the function called during
+ bch_btree_insert_node was taking longer than usual when the shrinker was
+ enabled.
+ 
+ I finally found the "funclatency" tool and tried do have the same approach I 
had with the klockstat [7]. However, that was inconclusive. I could see there 
that the bch_btree_insert_node was barely called during the whole duration of 
the test.
+ Which made me think it's amount of time spent in lock is more due to another 
process acquiring the lock.
+ 
+ I'm going to try to have another go with some perf/klockstat/funclatency
+ focused on bch_mca_scan and the function called there.
+ 
+ Also, here are some memory related metrics [8].
+ 
  
  [0]: https://pastebin.ubuntu.com/p/QYXPdsMCWC/
  [1]: https://pastebin.ubuntu.com/p/BFsvF7H54r/
  [2]: https://pastebin.ubuntu.com/p/35qdsHYHf5/
  [3]: 
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/drivers/md/bcache/btree.c?h=Ubuntu-4.15.0-118.119#n674
  [4]: https://pastebin.ubuntu.com/p/qhyqP35fCw/
+ [5]: https://pastebin.ubuntu.com/p/McjxxqTVjn/
+ [6]: https://pastebin.ubuntu.com/p/KmrnW4Ng8F/
+ [7]: https://pastebin.ubuntu.com/p/fSX4c7tTFV/
+ [8]: https://pastebin.ubuntu.com/p/CZgXkgKhmJ/
  
  
  $ cat /proc/version_signature
  Ubuntu 4.15.0-118.119-generic 4.15.18
  
  ProblemType: Bug
  DistroRelease: Ubuntu 18.04
  Package: linux-image-4.15.0-118-generic 4.15.0-118.119
  ProcVersionSignature: User Name 4.15.0-118.119-generic 4.15.18
  Uname: Linux 4.15.0-118-generic x86_64
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 29 10:04 seq
   crw-rw 1 root audio 116, 33 Sep 29 10:04 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
  ApportVersion: 2.20.9-0ubuntu7.16
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 
'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  Date: Tue Oct  6 20:36:18 2020
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
  MachineType: HP ProLiant DL380 G7
  PciMultimedia:
  
  ProcFB: 0 radeondrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-118-generic 
root=UUID=c6ad1629-a506-4043-a339-6d57f0708d12 ro console=ttyS1,115200 nosplash
  RelatedPackageVersions:
   linux-restricted-modules-4.15.0-118-generic N/A
   linux-backports-modules-4.15.0-118-generic  N/A
   linux-firmware  1.173.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
  SourcePackage: linux
  UpgradeStatus: Upgraded to bionic on 2019-09-27 (375 days ago)
  dmi.bios.date: 05/05/2011
  dmi.bios.vendor: HP
  dmi.bios.version: P67
  dmi.chassis.type: 23
  dmi.chassis.vendor: HP
  dmi.modalias: 
dmi:bvnHP:bvrP67:bd05/05/2011:svnHP:pnProLiantDL380G7:pvr:cvnHP:ct23:cvr:
  dmi.product.family: ProLiant
  dmi.product.name: ProLiant DL380 G7
  dmi.sys.vendor: HP

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  Issue with bcache bch_mca_scan causing huge IO wait

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list

[Bug 1898786] Re: Issue with bcache bch_mca_scan causing huge IO wait

2020-10-08 Thread Mauricio Faria de Oliveira
Hi Benjamin,

It's interesting that the memory shrinkers are being invoked without memory 
pressure.
I'll try to understand it better, maybe I'm missing something.

I guess the before/after memory sizes for the test are swapped?
i.e., to alleviate potential memory pressure the VM's memory had
been _increased_ from 44G to 64G? and not the opposite.

And maybe it's not necessary to trace/reproduce the exact load with
fio; perhaps one of fio's default/generic modes ([rand]r/w/rw) w/
a variation of queue depth might hit a similar pattern on iowait.
(If that helps w/ one less step to do :)

cheers,
Mauricio

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  Issue with bcache bch_mca_scan causing huge IO wait

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: Issue with bcache bch_mca_scan causing huge IO wait

2020-10-08 Thread Benjamin Allot
Hello Mauricio,

That was also one of the conclusion we reached yesterday.

I followed a wrong lead yesterday, after getting another more detailed
output with klockstat.

I'll update the description of the bug at the end, adding the bits I
found.

The system is not really having memory pressure but to confirm this, we resized 
the memory allocated to a VM on the server from 64GB to 44GB and looked into 
the shrinker behavior.
It didn't change a thing, and the IO wait started to rise again.

As for the IO load, I'm not sure, I would probably need to do a blktrace
and register it (never did that, but I read that you can then passes the
result to fio to reproduce a load)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  Issue with bcache bch_mca_scan causing huge IO wait

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: Issue with bcache bch_mca_scan causing huge IO wait

2020-10-08 Thread Benjamin Allot
** Description changed:

  Hello,
  
  In short, we faced an issue with a huge IO wait on a bionic Ubuntu 
4.15.0-118.119-generic kernel.
  This is the full list of process and the kernel function they were stuck in 
[0].
  
  The main issue can probably be summarized by this perf reports
  * first identify that the cpu are stuck in idle because of something[1]
  * second, see what kernel function seems to stuck the process kswapd0 and 
kswapd1 [2].
  
  We could see that this seems to be the mutex_lock in the bch_mca_scan
  function [3].
  
  After running the command:
  
   | sudo bash -c "echo 1 > /sys/fs/bcache/f1a1e8cb-3e6b-40ea-852e-
  583c48d0c2b8/internal/btree_shrinker_disabled"
  
  The server started to respond normally and the IO wait dropped significantly
- [0]: https://pastebin.canonical.com/p/wYYKwHdRXk/
- [1]: https://pastebin.canonical.com/p/n2Tw57QyBC/
- [2]: https://pastebin.canonical.com/p/3QqFTfdHhX/
+ [0]: https://pastebin.ubuntu.com/p/QYXPdsMCWC/
+ [1]: https://pastebin.ubuntu.com/p/BFsvF7H54r/
+ [2]: https://pastebin.ubuntu.com/p/35qdsHYHf5/
  [3]: 
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/drivers/md/bcache/btree.c?h=Ubuntu-4.15.0-118.119#n674
  
  
  $ cat /proc/version_signature
  Ubuntu 4.15.0-118.119-generic 4.15.18
  
  ProblemType: Bug
  DistroRelease: Ubuntu 18.04
  Package: linux-image-4.15.0-118-generic 4.15.0-118.119
  ProcVersionSignature: User Name 4.15.0-118.119-generic 4.15.18
  Uname: Linux 4.15.0-118-generic x86_64
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep 29 10:04 seq
   crw-rw 1 root audio 116, 33 Sep 29 10:04 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
  ApportVersion: 2.20.9-0ubuntu7.16
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 
'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  Date: Tue Oct  6 20:36:18 2020
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
  MachineType: HP ProLiant DL380 G7
  PciMultimedia:
  
  ProcFB: 0 radeondrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-118-generic 
root=UUID=c6ad1629-a506-4043-a339-6d57f0708d12 ro console=ttyS1,115200 nosplash
  RelatedPackageVersions:
   linux-restricted-modules-4.15.0-118-generic N/A
   linux-backports-modules-4.15.0-118-generic  N/A
   linux-firmware  1.173.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
  SourcePackage: linux
  UpgradeStatus: Upgraded to bionic on 2019-09-27 (375 days ago)
  dmi.bios.date: 05/05/2011
  dmi.bios.vendor: HP
  dmi.bios.version: P67
  dmi.chassis.type: 23
  dmi.chassis.vendor: HP
  dmi.modalias: 
dmi:bvnHP:bvrP67:bd05/05/2011:svnHP:pnProLiantDL380G7:pvr:cvnHP:ct23:cvr:
  dmi.product.family: ProLiant
  dmi.product.name: ProLiant DL380 G7
  dmi.sys.vendor: HP

** Description changed:

  Hello,
  
  In short, we faced an issue with a huge IO wait on a bionic Ubuntu 
4.15.0-118.119-generic kernel.
  This is the full list of process and the kernel function they were stuck in 
[0].
  
  The main issue can probably be summarized by this perf reports
  * first identify that the cpu are stuck in idle because of something[1]
  * second, see what kernel function seems to stuck the process kswapd0 and 
kswapd1 [2].
  
  We could see that this seems to be the mutex_lock in the bch_mca_scan
  function [3].
  
  After running the command:
  
   | sudo bash -c "echo 1 > /sys/fs/bcache/f1a1e8cb-3e6b-40ea-852e-
  583c48d0c2b8/internal/btree_shrinker_disabled"
  
- The server started to respond normally and the IO wait dropped significantly
+ The server started to respond normally and the IO wait dropped
+ significantly
+ 
+ Here is a trace of the bcache event related lock in the kernel obtained
+ with some bpfcc-tools [4].
+ 
+ klockstat-bpfcc -c bch_ -i 5 -s 3
+ 
+ The trace has been run in parallel with the following command line
+ 
+ echo "Shrinker disabled: $(date)"; sleep 60; echo "Enabling shrinker:
+ $(date)"; echo 0 | sudo tee
+ /sys/block/bcache0/bcache/cache/internal/btree_shrinker_disabled ; sleep
+ 60; echo "Disabling shrinker: $(date)"; echo 1 | sudo tee
+ /sys/block/bcache0/bcache/cache/internal/btree_shrinker_disabled; sleep
+ 60; echo "End of test: $(date)"
+ 
+ 
  [0]: https://pastebin.ubuntu.com/p/QYXPdsMCWC/
  [1]: https://pastebin.ubuntu.com/p/BFsvF7H54r/
  [2]: https://pastebin.ubuntu.com/p/35qdsHYHf5/
  [3]: 
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/drivers/md/bcache/btree.c?h=Ubuntu-4.15.0-118.119#n674
+ [4]: https://pastebin.ubuntu.com/p/qhyqP35fCw/
  
  
  $ cat /proc/version_signature
  Ubuntu 4.15.0-118.119-generic 4.15.18
  
  ProblemType: Bug
  DistroRelease: Ubuntu 18.04
  Package: linux-image-4.15.0-118-generic 4.15.0-118.119
  ProcVersionSignature: User Name 

[Bug 1898786] Re: Issue with bcache bch_mca_scan causing huge IO wait

2020-10-07 Thread Mauricio Faria de Oliveira
Hi Benjamin,

Thanks for the detailed report.

Does this system show signs of memory pressure?

bch_mca_scan() is part of bcache's memory shrinker, and thus should be
called when the system is trying to release memory from its several
caches.

Also, the bucket lock usage is widespread in bcache (from a quick grep;
and more used on writes, I'd imagine) thus if bch_mca_scan() is waiting
a lot on it, ie, showing lock contention, it would seem like the IO load
is indeed significant, as you mentioned.

Do you know the IO load profile, or could reproduce the issue with the
fio tool?

If we can reproduce this, there _might_ be some heuristics to consider
to use non-blocking trylock instead of blocking lock, and bail out/don't
wait if the lock is taken and it doesn't seem worth it. (but that would
be a development item, not a proper "bug" fix :)

cheers,
Mauricio

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  Issue with bcache bch_mca_scan causing huge IO wait

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: Issue with bcache bch_mca_scan causing huge IO wait

2020-10-07 Thread Benjamin Allot
Here is a trace of the bcache event related lock in the kernel obtained
with some bpfcc-tools.

klockstat-bpfcc -c bch_ -i 5 -s 3

The trace has been run in parallel with the following command line

echo "Shrinker disabled: $(date)"; sleep 60; echo "Enabling shrinker:
$(date)"; echo 0 | sudo tee
/sys/block/bcache0/bcache/cache/internal/btree_shrinker_disabled ; sleep
60; echo "Disabling shrinker: $(date)"; echo 1 | sudo tee
/sys/block/bcache0/bcache/cache/internal/btree_shrinker_disabled; sleep
60; echo "End of test: $(date)"

The log are here : https://pastebin.canonical.com/p/jVKdbV3RrK/

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  Issue with bcache bch_mca_scan causing huge IO wait

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: Issue with bcache bch_mca_scan causing huge IO wait

2020-10-07 Thread Benjamin Allot
Actually we cannot be sure no. The server didn't have any metrics prior
to few days ago and the issue was already there.

It's worth nothing that few servers have this bcache configuration,
because the cache mode is configured as writethrough and the load is
pretty significant.

So no last "good" version.

Actually, we have various IO wait issue on another platform (but running
xenial-hwe kernel) and we suspected bcache already. Mentioning it to
show that bcache behavior seems to be related to some disk performance
since quite some time.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  Issue with bcache bch_mca_scan causing huge IO wait

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: Issue with bcache bch_mca_scan causing huge IO wait

2020-10-07 Thread Juerg Haefliger
** Also affects: linux (Ubuntu Bionic)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  Issue with bcache bch_mca_scan causing huge IO wait

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: Issue with bcache bch_mca_scan causing huge IO wait

2020-10-06 Thread Khaled El Mously
correction: from *a prior kernel version

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  Issue with bcache bch_mca_scan causing huge IO wait

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: Issue with bcache bch_mca_scan causing huge IO wait

2020-10-06 Thread Khaled El Mously
@Benjamin Allot is this a performance regression from the prior kernel
version? If so, can you confirm the last "good" version? Perhaps we can
do a bisect

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  Issue with bcache bch_mca_scan causing huge IO wait

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1898786] Re: Issue with bcache bch_mca_scan causing huge IO wait

2020-10-06 Thread Haw Loeung
** Description changed:

  Hello,
  
  In short, we faced an issue with a huge IO wait on a bionic Ubuntu 
4.15.0-118.119-generic kernel.
- This is the full list of process and the kernel function they were stuck in 
[2].
+ This is the full list of process and the kernel function they were stuck in 
[0].
  
- The main issue can probably be summarized by this perf reports 
+ The main issue can probably be summarized by this perf reports
  * first identify that the cpu are stuck in idle because of something[1]
  * second, see what kernel function seems to stuck the process kswapd0 and 
kswapd1 [2].
  
  We could see that this seems to be the mutex_lock in the bch_mca_scan
  function [3].
  
  After running the command:
  
-  | sudo bash -c "echo 1 > /sys/fs/bcache/f1a1e8cb-3e6b-40ea-852e-
+  | sudo bash -c "echo 1 > /sys/fs/bcache/f1a1e8cb-3e6b-40ea-852e-
  583c48d0c2b8/internal/btree_shrinker_disabled"
  
  The server started to respond normally and the IO wait dropped significantly
  [0]: https://pastebin.canonical.com/p/wYYKwHdRXk/
  [1]: https://pastebin.canonical.com/p/n2Tw57QyBC/
  [2]: https://pastebin.canonical.com/p/3QqFTfdHhX/
  [3]: 
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/drivers/md/bcache/btree.c?h=Ubuntu-4.15.0-118.119#n674
  
  
  $ cat /proc/version_signature
  Ubuntu 4.15.0-118.119-generic 4.15.18
  
  ProblemType: Bug
  DistroRelease: Ubuntu 18.04
  Package: linux-image-4.15.0-118-generic 4.15.0-118.119
  ProcVersionSignature: User Name 4.15.0-118.119-generic 4.15.18
  Uname: Linux 4.15.0-118-generic x86_64
  AlsaDevices:
-  total 0
-  crw-rw 1 root audio 116,  1 Sep 29 10:04 seq
-  crw-rw 1 root audio 116, 33 Sep 29 10:04 timer
+  total 0
+  crw-rw 1 root audio 116,  1 Sep 29 10:04 seq
+  crw-rw 1 root audio 116, 33 Sep 29 10:04 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
  ApportVersion: 2.20.9-0ubuntu7.16
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 
'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  Date: Tue Oct  6 20:36:18 2020
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
  MachineType: HP ProLiant DL380 G7
  PciMultimedia:
-  
+ 
  ProcFB: 0 radeondrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-118-generic 
root=UUID=c6ad1629-a506-4043-a339-6d57f0708d12 ro console=ttyS1,115200 nosplash
  RelatedPackageVersions:
-  linux-restricted-modules-4.15.0-118-generic N/A
-  linux-backports-modules-4.15.0-118-generic  N/A
-  linux-firmware  1.173.18
+  linux-restricted-modules-4.15.0-118-generic N/A
+  linux-backports-modules-4.15.0-118-generic  N/A
+  linux-firmware  1.173.18
  RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
  SourcePackage: linux
  UpgradeStatus: Upgraded to bionic on 2019-09-27 (375 days ago)
  dmi.bios.date: 05/05/2011
  dmi.bios.vendor: HP
  dmi.bios.version: P67
  dmi.chassis.type: 23
  dmi.chassis.vendor: HP
  dmi.modalias: 
dmi:bvnHP:bvrP67:bd05/05/2011:svnHP:pnProLiantDL380G7:pvr:cvnHP:ct23:cvr:
  dmi.product.family: ProLiant
  dmi.product.name: ProLiant DL380 G7
  dmi.sys.vendor: HP

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1898786

Title:
  Issue with bcache bch_mca_scan causing huge IO wait

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1898786/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs