[ceph-users] cephadm shell version not consistent across monitors

2024-04-02 Thread J-P Methot
Hi, We are still running ceph Pacific with cephadm and we have run into a peculiar issue. When we run the `cephadm shell` command on monitor1, the container we get runs ceph 16.2.9. However, when we run the same command on monitor2, the container runs 16.2.15, which is the current version of

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-11 Thread J-P Methot
or something for Bluestore was changed? You can check this via `ceph config diff` As Mark said, it will be nice to have a tracker, if this really release problem Thanks, k Sent from my iPhone On 7 Sep 2023, at 20:22, J-P Methot wrote: We went from 16.2.13 to 16.2.14 Also, timeout is 15

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot
to significant allocation fragmentation. That got fixed, but I wouldn't be surprised if we have some other sub-optimal behaviors we don't know about. Mark On 9/7/23 12:28, J-P Methot wrote: Hi, By this point, we're 95% sure that, contrary to our previous beliefs, it's an issue with changes

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot
, but even with testing on NVMe clusters this was quite low (maybe a couple of percent). Mark On 9/7/23 10:21, J-P Methot wrote: Hi, Since my post, we've been speaking with a member of the Ceph dev team. He did, at first, believe it was an issue linked to the common performance degradation

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot
. I'll post the actual resolution when we confirm 100% that it works. On 9/7/23 12:18, Konstantin Shalygin wrote: Hi, On 7 Sep 2023, at 18:21, J-P Methot wrote: Since my post, we've been speaking with a member of the Ceph dev team. He did, at first, believe it was an issue linked to the common

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot
appear to change anything. Deactivating scrubs altogether did not impact performances in any way. Furthermore, I'll stress that this is only happening since we upgraded to the latest Pacific, yesterday. On 9/7/23 10:49, Stefan Kooman wrote: On 07-09-2023 09:05, J-P Methot wrote: Hi, We're

[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot
We're talking about automatic online compaction here, not running the command. On 9/7/23 04:04, Konstantin Shalygin wrote: Hi, On 7 Sep 2023, at 10:05, J-P Methot wrote: We're running latest Pacific on our production cluster and we've been seeing the dreaded 'OSD::osd_op_tp thread

[ceph-users] Rocksdb compaction and OSD timeout

2023-09-07 Thread J-P Methot
Hi, We're running latest Pacific on our production cluster and we've been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out after 15.00954s' error. We have reasons to believe this happens each time the RocksDB compaction process is launched on an OSD. My question

[ceph-users] Mysteriously dead OSD process

2023-04-05 Thread J-P Methot
Hi, We currently use Ceph Pacific 16.2.10 deployed with Cephadm on this storage cluster. Last night, one of our OSD died. However, since its storage is a SSD, we ran hardware checks and found no issue with the SSD itself. However, if we try starting the service again, the container just

[ceph-users] Re: Flapping OSDs on pacific 16.2.10

2023-01-18 Thread J-P Methot
There's nothing in the CPU graph that suggests soft lock-ups at these times. However, thank you for pointing out that the disk io scheduler could have an impact. Ubuntu seems to be on mq-deadline by default, so we just switched to none, as it fits our workload best I believe. I don't know if

[ceph-users] Re: Flapping OSDs on pacific 16.2.10

2023-01-18 Thread J-P Methot
? are you sharing nics between public / replication?  That is another metric that needs looking into. *From:* J-P Methot *Sent:* 18 January 2023 12:42 *To:* ceph-users *Subject:* [ceph-users] Flapping OSDs on pacific 16.2.10

[ceph-users] Flapping OSDs on pacific 16.2.10

2023-01-18 Thread J-P Methot
Hi, We have a full SSD production cluster running on Pacific 16.2.10 and deployed with cephadm that is experiencing OSD flapping issues. Essentially, random OSDs will get kicked out of the cluster and then automatically brought back in a few times a day. As an example, let's take the case of

[ceph-users] Re: OSD container won't boot up

2022-12-01 Thread J-P Methot
and the bluestore tools repair can't fix it. Is there a way to fix this, or is this completely unrepairable? On 11/29/22 13:57, J-P Methot wrote: Hi, I've been testing the cephadm upgrade process in my staging environment and I'm running into an issue where the docker container just doesn't boot

[ceph-users] OSD container won't boot up

2022-11-29 Thread J-P Methot
Hi, I've been testing the cephadm upgrade process in my staging environment and I'm running into an issue where the docker container just doesn't boot up anymore. This is an octopus to Nautilus  16.2.10 upgrade and I expect to upgrade to quincy afterwards. This is also running on Ubuntu

[ceph-users] Re: Freak issue every few weeks

2022-09-23 Thread J-P Methot
/23/22 15:22, J-P Methot wrote: Thank you for your reply, discard is not enabled in our configuration as it is mainly the default conf. Are you suggesting to enable it? No. There is no consensus if enabling it is a good idea (depends on proper implementation among other things). From my

[ceph-users] Re: Freak issue every few weeks

2022-09-23 Thread J-P Methot
Thank you for your reply, discard is not enabled in our configuration as it is mainly the default conf. Are you suggesting to enable it? On 9/22/22 14:20, Stefan Kooman wrote: Just guessing here: have you configured "discard": bdev enable discard bdev async discard We've see monitor slow

[ceph-users] Freak issue every few weeks

2022-09-22 Thread J-P Methot
Hi, We've been running into a mysterious issue on Ceph 16.2.7. Every few weeks or so (can be from 2 weeks to a month and a half), we get input/output errors on a random OSD. Here's the logs : 2022-09-22T15:54:11.600Z    syslog    debug -6> 2022-09-22T15:41:05.678+ 7fec2ebaa080 -1

[ceph-users] MTU mismatch error in Ceph dashboard

2021-08-04 Thread J-P Methot
Hi, We're running Ceph 16.2.5 Pacific and, in the ceph dashboard, we keep getting a MTU mismatch alert. However, all our hosts have the same network configuration: => bond0: mtu 9000 qdisc noqueue state UPgroup default qlen 1000 => vlan.24@bond0: mtu 9000 qdisc noqueue state UPgroup

[ceph-users] Re: Lost data from a RBD while client was not connected

2021-08-04 Thread J-P Methot
I'm replying to my own message as it appears we have "fixed" the issue. Basically, we restarted all OSD hosts and all the presumed lost data reappeared. It's likely that some OSDs were stuck unreachable but were somehow never flagged as such in the cluster. On 8/3/21 8:15 PM, J-P Me

[ceph-users] Lost data from a RBD while client was not connected

2021-08-03 Thread J-P Methot
Hi, We've encountered this issue on Ceph Pacific, with an Openstack Wallaby cluster hooked to it. Essentially, we're slowly pushing this setup into production so we're testing it and encountered this oddity. My colleague wanted to do some network redundancy tests, so he manually shutdown a