Re: [PVE-User] High I/O waits, not sure if it's a ceph issue.
On Thu, Jul 02, 2020 at 03:57:32PM +0200, Alexandre DERUMIER wrote: > Hi, > > you should give it a try to ceph octopus. librbd have greatly improved for > write, and I can recommand to enable writeback now by default Yes, that wil probably even work better. But what I'm trying to determine here, is if krbd is a better choice than librbd :) -- Mark Schouten | Tuxis B.V. KvK: 74698818 | http://www.tuxis.nl/ T: +31 318 200208 | i...@tuxis.nl ___ pve-user mailing list pve-user@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Re: [PVE-User] High I/O waits, not sure if it's a ceph issue.
Hi, you should give it a try to ceph octopus. librbd have greatly improved for write, and I can recommand to enable writeback now by default Here some iops result with 1vm - 1disk - 4k block iodepth=64, librbd, no iothread. nautilus-cache=none nautilus-cache=writeback octopus-cache=none octopus-cache=writeback randread 4k 62.1k 25.2k 61.1k 60.8k randwrite 4k 27.7k 19.5k 34.5k 53.0k seqwrite 4k 7850 37.5k 24.9k 82.6k - Mail original - De: "Mark Schouten" À: "proxmoxve" Envoyé: Jeudi 2 Juillet 2020 15:15:20 Objet: Re: [PVE-User] High I/O waits, not sure if it's a ceph issue. On Thu, Jul 02, 2020 at 09:06:54AM +1000, Lindsay Mathieson wrote: > I did some adhoc testing last night - definitely a difference, in KRBD's > favour. Both sequential and random IO was much better with it enabled. Interesting! I just did some testing too on our demo cluster. Ceph with 6 osd's over three nodes, size 2. root@node04:~# pveversion pve-manager/6.2-6/ee1d7754 (running kernel: 5.4.41-1-pve) root@node04:~# ceph -v ceph version 14.2.9 (bed944f8c45b9c98485e99b70e11bbcec6f6659a) nautilus (stable) rbd create fio_test --size 10G -p Ceph rbd create map_test --size 10G -p Ceph rbd map Ceph/map_test When just using a write test (rw=randwrite) krbd wins, big. rbd: WRITE: bw=37.9MiB/s (39.8MB/s), 37.9MiB/s-37.9MiB/s (39.8MB/s-39.8MB/s), io=10.0GiB (10.7GB), run=269904-269904msec krbd: WRITE: bw=207MiB/s (217MB/s), 207MiB/s-207MiB/s (217MB/s-217MB/s), io=10.0GiB (10.7GB), run=49582-49582msec However, using rw=randrw (rwmixread=75), things change a lot: rbd: READ: bw=49.0MiB/s (52.4MB/s), 49.0MiB/s-49.0MiB/s (52.4MB/s-52.4MB/s), io=7678MiB (8051MB), run=153607-153607msec WRITE: bw=16.7MiB/s (17.5MB/s), 16.7MiB/s-16.7MiB/s (17.5MB/s-17.5MB/s), io=2562MiB (2687MB), run=153607-153607msec krbd: READ: bw=5511KiB/s (5643kB/s), 5511KiB/s-5511KiB/s (5643kB/s-5643kB/s), io=7680MiB (8053MB), run=1426930-1426930msec WRITE: bw=1837KiB/s (1881kB/s), 1837KiB/s-1837KiB/s (1881kB/s-1881kB/s), io=2560MiB (2685MB), run=1426930-1426930msec Maybe I'm interpreting or testing stuff wrong, but it looks like simply writing to krbd is much faster, but actually trying to use that data seems slower. Let me know what you guys think. Attachments are being stripped, IIRC, so here's the config and the full output of the tests: RESULTS== rbd_write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=32 rbd_readwrite: (g=1): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=32 krbd_write: (g=2): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32 krbd_readwrite: (g=3): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32 fio-3.12 Starting 4 processes Jobs: 1 (f=1): [_(3),f(1)][100.0%][eta 00m:00s] rbd_write: (groupid=0, jobs=1): err= 0: pid=1846441: Thu Jul 2 15:08:42 2020 write: IOPS=9712, BW=37.9MiB/s (39.8MB/s)(10.0GiB/269904msec); 0 zone resets slat (nsec): min=943, max=1131.9k, avg=6367.94, stdev=10934.84 clat (usec): min=1045, max=259066, avg=3286.70, stdev=4553.24 lat (usec): min=1053, max=259069, avg=3293.06, stdev=4553.20 clat percentiles (usec): | 1.00th=[ 1844], 5.00th=[ 2114], 10.00th=[ 2311], 20.00th=[ 2573], | 30.00th=[ 2769], 40.00th=[ 2933], 50.00th=[ 3064], 60.00th=[ 3228], | 70.00th=[ 3425], 80.00th=[ 3621], 90.00th=[ 3982], 95.00th=[ 4359], | 99.00th=[ 5538], 99.50th=[ 6718], 99.90th=[ 82314], 99.95th=[125305], | 99.99th=[187696] bw ( KiB/s): min=17413, max=40282, per=83.81%, avg=32561.17, stdev=3777.39, samples=539 iops : min= 4353, max=10070, avg=8139.93, stdev=944.34, samples=539 lat (msec) : 2=2.64%, 4=87.80%, 10=9.37%, 20=0.08%, 50=0.01% lat (msec) : 100=0.02%, 250=0.09%, 500=0.01% cpu : usr=8.73%, sys=5.27%, ctx=1254152, majf=0, minf=8484 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued rwts: total=0,2621440,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=32 rbd_readwrite: (groupid=1, jobs=1): err= 0: pid=1852029: Thu Jul 2 15:08:42 2020 read: IOPS=12.8k, BW=49.0MiB/s (52.4MB/s)(7678MiB/153607msec) slat (nsec): min=315, max=4467.8k, avg=3247.91, stdev=7360.28 clat (usec): min=276, max=160495, avg=1412.53, stdev=656.11 lat (usec): min=281, max=160497, avg=1415.78, stdev=656.02 clat percentiles (usec): | 1.00th=[ 494], 5.00th=[
Re: [PVE-User] High I/O waits, not sure if it's a ceph issue.
On 30/06/2020 11:09 pm, Mark Schouten wrote: I think this is incorrect. Using KRBD uses the kernel-driver which is usually older than the userland-version. Also, upgrading is easier when not using KRBD. I'd like to hear that I'm wrong, am I?:) I did some adhoc testing last night - definitely a difference, in KRBD's favour. Both sequential and random IO was much better with it enabled. There's some interesting theads online regards librbd vs KRBD. Apparently since 12.x, librd should perform better, but a lot of people aren't seeing it. -- Lindsay ___ pve-user mailing list pve-user@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Re: [PVE-User] High I/O waits, not sure if it's a ceph issue.
On 2/07/2020 2:54 am, jameslipski via pve-user wrote: Thank you, there is definitely an improvement to using krbd -- not seeing any i/o waits. Excellent, glad to hear it. -- Lindsay ___ pve-user mailing list pve-user@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Re: [PVE-User] High I/O waits, not sure if it's a ceph issue.
--- Begin Message --- Thank you, there is definitely an improvement to using krbd -- not seeing any i/o waits. ‐‐‐ Original Message ‐‐‐ On Tuesday, June 30, 2020 7:47 PM, Lindsay Mathieson wrote: > On 30/06/2020 11:09 pm, Mark Schouten wrote: > > > I think this is incorrect. Using KRBD uses the kernel-driver which is > > usually older than the userland-version. Also, upgrading is easier when > > not using KRBD. > > Older yes - Luminous (12.x) > > But it supports sufficient features and I found it considerably faster > than the user driver. > > > > Lindsay > > pve-user mailing list > pve-user@pve.proxmox.com > https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user --- End Message --- ___ pve-user mailing list pve-user@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Re: [PVE-User] High I/O waits, not sure if it's a ceph issue.
On 30/06/2020 11:09 pm, Mark Schouten wrote: I think this is incorrect. Using KRBD uses the kernel-driver which is usually older than the userland-version. Also, upgrading is easier when not using KRBD. Older yes - Luminous (12.x) But it supports sufficient features and I found it considerably faster than the user driver. -- Lindsay ___ pve-user mailing list pve-user@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Re: [PVE-User] High I/O waits, not sure if it's a ceph issue.
On 30/06/2020 10:07 pm, jameslipski via pve-user wrote: Before I update ceph, regarding KRBD, I've just enabled it, do I have do re-create the pool, restart ceph, restart the node, etc... or it just takes into effect? No need to recreate the pool, just stop/start the VM's accessing it. -- Lindsay ___ pve-user mailing list pve-user@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Re: [PVE-User] High I/O waits, not sure if it's a ceph issue.
On Tue, Jun 30, 2020 at 11:28:51AM +1000, Lindsay Mathieson wrote: > Do you have KRBD set for the Proxmox Ceph Storage? that help a lot. I think this is incorrect. Using KRBD uses the kernel-driver which is usually older than the userland-version. Also, upgrading is easier when not using KRBD. I'd like to hear that I'm wrong, am I? :) -- Mark Schouten | Tuxis B.V. KvK: 74698818 | http://www.tuxis.nl/ T: +31 318 200208 | i...@tuxis.nl ___ pve-user mailing list pve-user@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Re: [PVE-User] High I/O waits, not sure if it's a ceph issue.
--- Begin Message --- Thanks for the reply All nodes are connected to a 10Gbit switch. Ceph is currently running on 14.2.2 but will update to the latest. KRBD was not enabled to the pool. Before I update ceph, regarding KRBD, I've just enabled it, do I have do re-create the pool, restart ceph, restart the node, etc... or it just takes into effect? ‐‐‐ Original Message ‐‐‐ On Monday, June 29, 2020 9:28 PM, Lindsay Mathieson wrote: > On 30/06/2020 11:08 am, jameslipski via pve-user wrote: > > > ust to give a little bit of a background, we currently we have 6 nodes. > > We're running CEPH, and each node consists of > > 2 osds (each node has 2x Intel SSDSC2KG019T8) OSD Type is bluestore. Global > > ceph configurations (at least as shown on the proxmox interface) is as > > follows: > > Network config? (ie. speed etc). > > Ceph is Nautilus 14.2.9? (latest on proxmox) > > Do you have KRBD set for the Proxmox Ceph Storage? that help a lot. > > -- > > Lindsay > > pve-user mailing list > pve-user@pve.proxmox.com > https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user --- End Message --- ___ pve-user mailing list pve-user@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Re: [PVE-User] High I/O waits, not sure if it's a ceph issue.
On 30/06/2020 11:08 am, jameslipski via pve-user wrote: ust to give a little bit of a background, we currently we have 6 nodes. We're running CEPH, and each node consists of 2 osds (each node has 2x Intel SSDSC2KG019T8) OSD Type is bluestore. Global ceph configurations (at least as shown on the proxmox interface) is as follows: Network config? (ie. speed etc). Ceph is Nautilus 14.2.9? (latest on proxmox) Do you have KRBD set for the Proxmox Ceph Storage? that help a lot. -- Lindsay ___ pve-user mailing list pve-user@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
[PVE-User] High I/O waits, not sure if it's a ceph issue.
--- Begin Message --- Greetings, I'm trying out PVE. Currently I'm just doing tests and ran into an issue relating to high I/O waits. Just to give a little bit of a background, we currently we have 6 nodes. We're running CEPH, and each node consists of 2 osds (each node has 2x Intel SSDSC2KG019T8) OSD Type is bluestore. Global ceph configurations (at least as shown on the proxmox interface) is as follows: [global] auth_client_required = auth_cluster_required = auth_service_required = cluster_network = 10.125.0.0/24 fsid = f64d2a67-98c3-4dbc-abfd-906ea7aaf314 mon_allow_pool_delete = true mon_host = 10.125.0.101 10.125.0.102 10.125.0.103 10.125.0.105 10.125.0.106 10.125.0.104 osd_pool_default_min_size = 2 osd_pool_default_size = 3 public_network = 10.125.0.0/24 [client] keyring = /etc/pve/priv/$cluster.$name.keyring If I'm missing any relevant information relating to my ceph setup (I'm still learning this), please let me know. Each node consists of 2x Xeon E5-2660 v3. Where I ran into high I/O waits is when running 2 VMs. 1 VM is a mysql replication server (using 8 cores), and is performing mostly writes. The second VM is running Debian with Cacti. Both of these systems are on 2 different nodes but uses CEPH to store the vm-hd. When I copied files over the network to the VM running Cacti, I've noticed high I/O waits in my mysql VM. I'm assuming that this has something to do with ceph; though the only thing I'm seeing in the ceph logs are the following: 02:43:01.062082 mgr.node01 (mgr.2914449) 8009571 : cluster [DBG] pgmap v8009574: 512 pgs: 512 active+clean; 246 GiB data, 712 GiB used, 20 TiB / 21 TiB avail; 2.4 MiB/s wr, 274 op/s 02:43:03.063137 mgr.node01 (mgr.2914449) 8009572 : cluster [DBG] pgmap v8009575: 512 pgs: 512 active+clean; 246 GiB data, 712 GiB used, 20 TiB / 21 TiB avail; 0 B/s rd, 3.0 MiB/s wr, 380 op/s 02:43:05.064125 mgr.node01 (mgr.2914449) 8009573 : cluster [DBG] pgmap v8009576: 512 pgs: 512 active+clean; 246 GiB data, 712 GiB used, 20 TiB / 21 TiB avail; 0 B/s rd, 2.9 MiB/s wr, 332 op/s 02:43:07.065373 mgr.node01 (mgr.2914449) 8009574 : cluster [DBG] pgmap v8009577: 512 pgs: 512 active+clean; 246 GiB data, 712 GiB used, 20 TiB / 21 TiB avail; 0 B/s rd, 2.7 MiB/s wr, 313 op/s 02:43:09.066210 mgr.node01 (mgr.2914449) 8009575 : cluster [DBG] pgmap v8009578: 512 pgs: 512 active+clean; 246 GiB data, 712 GiB used, 20 TiB / 21 TiB avail; 341 B/s rd, 2.9 MiB/s wr, 350 op/s 02:43:11.066913 mgr.node01 (mgr.2914449) 8009576 : cluster [DBG] pgmap v8009579: 512 pgs: 512 active+clean; 246 GiB data, 712 GiB used, 20 TiB / 21 TiB avail; 341 B/s rd, 3.1 MiB/s wr, 346 op/s 02:43:13.067926 mgr.node01 (mgr.2914449) 8009577 : cluster [DBG] pgmap v8009580: 512 pgs: 512 active+clean; 246 GiB data, 712 GiB used, 20 TiB / 21 TiB avail; 341 B/s rd, 3.5 MiB/s wr, 408 op/s 02:43:15.068834 mgr.node01 (mgr.2914449) 8009578 : cluster [DBG] pgmap v8009581: 512 pgs: 512 active+clean; 246 GiB data, 712 GiB used, 20 TiB / 21 TiB avail; 341 B/s rd, 3.0 MiB/s wr, 320 op/s 02:43:17.069627 mgr.node01 (mgr.2914449) 8009579 : cluster [DBG] pgmap v8009582: 512 pgs: 512 active+clean; 246 GiB data, 712 GiB used, 20 TiB / 21 TiB avail; 341 B/s rd, 2.5 MiB/s wr, 285 op/s 02:43:19.070507 mgr.node01 (mgr.2914449) 8009580 : cluster [DBG] pgmap v8009583: 512 pgs: 512 active+clean; 246 GiB data, 712 GiB used, 20 TiB / 21 TiB avail; 341 B/s rd, 3.0 MiB/s wr, 349 op/s 02:43:21.071241 mgr.node01 (mgr.2914449) 8009581 : cluster [DBG] pgmap v8009584: 512 pgs: 512 active+clean; 246 GiB data, 712 GiB used, 20 TiB / 21 TiB avail; 0 B/s rd, 2.8 MiB/s wr, 319 op/s 02:43:23.072286 mgr.node01 (mgr.2914449) 8009582 : cluster [DBG] pgmap v8009585: 512 pgs: 512 active+clean; 246 GiB data, 712 GiB used, 20 TiB / 21 TiB avail; 2.7 MiB/s wr, 329 op/s 02:43:25.073369 mgr.node01 (mgr.2914449) 8009583 : cluster [DBG] pgmap v8009586: 512 pgs: 512 active+clean; 246 GiB data, 712 GiB used, 20 TiB / 21 TiB avail; 2.8 MiB/s wr, 304 op/s 02:43:27.074315 mgr.node01 (mgr.2914449) 8009584 : cluster [DBG] pgmap v8009587: 512 pgs: 512 active+clean; 246 GiB data, 712 GiB used, 20 TiB / 21 TiB avail; 2.2 MiB/s wr, 262 op/s 02:43:29.075284 mgr.node01 (mgr.2914449) 8009585 : cluster [DBG] pgmap v8009588: 512 pgs: 512 active+clean; 246 GiB data, 712 GiB used, 20 TiB / 21 TiB avail; 682 B/s rd, 2.9 MiB/s wr, 342 op/s 02:43:31.076180 mgr.node01 (mgr.2914449) 8009586 : cluster [DBG] pgmap v8009589: 512 pgs: 512 active+clean; 246 GiB data, 712 GiB used, 20 TiB / 21 TiB avail; 682 B/s rd, 2.4 MiB/s wr, 269 op/s 02:43:33.077523 mgr.node01 (mgr.2914449) 8009587 : cluster [DBG] pgmap v8009590: 512 pgs: 512 active+clean; 246 GiB data, 712 GiB used, 20 TiB / 21 TiB avail; 682 B/s rd, 3.4 MiB/s wr, 389 op/s 02:43:35.078543 mgr.node01 (mgr.2914449) 8009588 : cluster [DBG] pgmap v8009591: 512 pgs: 512 active+clean; 246 GiB data, 712 GiB used, 20 TiB / 21 TiB avail; 682 B/s rd, 3.1 MiB/s wr, 344 op/s 02:43:37.079428