[ceph-users] Re: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus

2022-10-15 Thread Frank Schilder
rank Schilder; ceph-users Subject: Re: [ceph-users] Re: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus Just a datapoint - we upgraded several large Mimic-born clusters straight to 15.2.12 with the quick fsck disabled in ceph.conf, then did require-osd-release, and finally

[ceph-users] Re: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus

2022-09-26 Thread Tyler Stachecki
Just a datapoint - we upgraded several large Mimic-born clusters straight to 15.2.12 with the quick fsck disabled in ceph.conf, then did require-osd-release, and finally did the omap conversion offline after the cluster was upgraded using the bluestore tool while the OSDs were down (all done in

[ceph-users] Re: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus

2022-09-26 Thread Marc
Hi Frank, Thank you very much for this! :) > > we just completed a third upgrade test. There are 2 ways to convert the > OSDs: > > A) convert along with the upgrade (quick-fix-on-start=true) > B) convert after setting require-osd-release=octopus (quick-fix-on- > start=false until

[ceph-users] Re: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus

2022-09-13 Thread Boris Behrens
Hi, i just checked and all OSDs have it set to true. It seems also not a problem with the snaptrim opration. We just had two times in the last 7 days where nearly all OSDs logged a lot (around 3k times in 20 minutes) of these messages: 022-09-12T20:27:19.146+0200 7f576de49700 -1 osd.9 786378

[ceph-users] Re: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus

2022-09-13 Thread Wesley Dillingham
I haven't read through this entire thread so forgive me if already mentioned: What is the parameter "bluefs_buffered_io" set to on your OSDs? We once saw a terrible slowdown on our OSDs during snaptrim events and setting bluefs_buffered_io to true alleviated that issue. That was on a nautilus

[ceph-users] Re: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus

2022-09-13 Thread Boris Behrens
The cluster is SSD only with 2TB,4TB and 8TB disks. I would expect that this should be done fairly fast. For now I will recreate every OSD in the cluster and check if this helps. Do you experience slow OPS (so the cluster shows a message like "cluster [WRN] Health check update: 679 slow ops,

[ceph-users] Re: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus

2022-09-13 Thread Marc
> > It might be possible that converting OSDs before setting require-osd- > release=octopus leads to a broken state of the converted OSDs. I could > not yet find a way out of this situation. We will soon perform a third > upgrade test to test this hypothesis. > So with upgrading one should

[ceph-users] Re: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus

2022-09-13 Thread Boris Behrens
Hi Frank, we converted the OSDs directly on the upgrade. 1. installing new ceph versions 2. restart all OSD daemons 3. wait some time (took around 5-20 minutes) 4. all OSDs were online again. So I would expect, that the OSDs are all upgraded correctly. I also checked when the trimming happens,

[ceph-users] Re: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus

2022-09-13 Thread Boris Behrens
I checked the cluster for other snaptrim operations and they happen all over the place, so for me it looks like they just happend to be done when the issue occured, but were not the driving factor. Am Di., 13. Sept. 2022 um 12:04 Uhr schrieb Boris Behrens : > Because someone mentioned that the

[ceph-users] Re: laggy OSDs and staling krbd IO after upgrade from nautilus to octopus

2022-09-13 Thread Boris Behrens
Because someone mentioned that the attachments did not went through I created pastebin links: monlog: https://pastebin.com/jiNPUrtL osdlog: https://pastebin.com/dxqXgqDz Am Di., 13. Sept. 2022 um 11:43 Uhr schrieb Boris Behrens : > Hi, I need you help really bad. > > we are currently

[ceph-users] Re: Laggy OSDs

2022-03-29 Thread Alex Closs
Hi - I've been bitten by that too and checked, and that *did* happen but I swapped them off a while ago. Thanks for your quick reply :) -Alex On Mar 29, 2022, 6:26 PM -0400, Arnaud M , wrote: > Hello > > is swap enabled on your host ? Is swap used ? > > For our cluster we tend to allocate enough

[ceph-users] Re: Laggy OSDs

2022-03-29 Thread Arnaud M
Hello is swap enabled on your host ? Is swap used ? For our cluster we tend to allocate enough ram and disable swap Maybe the reboot of your host re-activated swap ? Try to disable swap and see if it help All the best Arnaud Le mar. 29 mars 2022 à 23:41, David Orman a écrit : > We're

[ceph-users] Re: Laggy OSDs

2022-03-29 Thread David Orman
We're definitely dealing with something that sounds similar, but hard to state definitively without more detail. Do you have object lock/versioned buckets in use (especially if one started being used around the time of the slowdown)? Was this cluster always 16.2.7? What is your pool configuration