Hi,
We are still running ceph Pacific with cephadm and we have run into a
peculiar issue. When we run the `cephadm shell` command on monitor1, the
container we get runs ceph 16.2.9. However, when we run the same command
on monitor2, the container runs 16.2.15, which is the current version of
or something for Bluestore was
changed?
You can check this via `ceph config diff`
As Mark said, it will be nice to have a tracker, if this really
release problem
Thanks,
k
Sent from my iPhone
On 7 Sep 2023, at 20:22, J-P Methot wrote:
We went from 16.2.13 to 16.2.14
Also, timeout is 15
to significant allocation fragmentation. That got fixed,
but I wouldn't be surprised if we have some other sub-optimal
behaviors we don't know about.
Mark
On 9/7/23 12:28, J-P Methot wrote:
Hi,
By this point, we're 95% sure that, contrary to our previous beliefs,
it's an issue with changes
, but even with testing
on NVMe clusters this was quite low (maybe a couple of percent).
Mark
On 9/7/23 10:21, J-P Methot wrote:
Hi,
Since my post, we've been speaking with a member of the Ceph dev
team. He did, at first, believe it was an issue linked to the common
performance degradation
. I'll post the actual
resolution when we confirm 100% that it works.
On 9/7/23 12:18, Konstantin Shalygin wrote:
Hi,
On 7 Sep 2023, at 18:21, J-P Methot wrote:
Since my post, we've been speaking with a member of the Ceph dev
team. He did, at first, believe it was an issue linked to the common
appear to change anything.
Deactivating scrubs altogether did not impact performances in any way.
Furthermore, I'll stress that this is only happening since we upgraded
to the latest Pacific, yesterday.
On 9/7/23 10:49, Stefan Kooman wrote:
On 07-09-2023 09:05, J-P Methot wrote:
Hi,
We're
We're talking about automatic online compaction here, not running the
command.
On 9/7/23 04:04, Konstantin Shalygin wrote:
Hi,
On 7 Sep 2023, at 10:05, J-P Methot wrote:
We're running latest Pacific on our production cluster and we've been
seeing the dreaded 'OSD::osd_op_tp thread
Hi,
We're running latest Pacific on our production cluster and we've been
seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out
after 15.00954s' error. We have reasons to believe this happens each
time the RocksDB compaction process is launched on an OSD. My question
Hi,
We currently use Ceph Pacific 16.2.10 deployed with Cephadm on this
storage cluster. Last night, one of our OSD died. However, since its
storage is a SSD, we ran hardware checks and found no issue with the SSD
itself. However, if we try starting the service again, the container
just
There's nothing in the CPU graph that suggests soft lock-ups at these
times. However, thank you for pointing out that the disk io scheduler
could have an impact. Ubuntu seems to be on mq-deadline by default, so
we just switched to none, as it fits our workload best I believe. I
don't know if
? are you sharing nics between public / replication? That is
another metric that needs looking into.
*From:* J-P Methot
*Sent:* 18 January 2023 12:42
*To:* ceph-users
*Subject:* [ceph-users] Flapping OSDs on pacific 16.2.10
Hi,
We have a full SSD production cluster running on Pacific 16.2.10 and
deployed with cephadm that is experiencing OSD flapping issues.
Essentially, random OSDs will get kicked out of the cluster and then
automatically brought back in a few times a day. As an example, let's
take the case of
and the bluestore tools repair can't fix it.
Is there a way to fix this, or is this completely unrepairable?
On 11/29/22 13:57, J-P Methot wrote:
Hi,
I've been testing the cephadm upgrade process in my staging
environment and I'm running into an issue where the docker container
just doesn't boot
Hi,
I've been testing the cephadm upgrade process in my staging environment
and I'm running into an issue where the docker container just doesn't
boot up anymore. This is an octopus to Nautilus 16.2.10 upgrade and I
expect to upgrade to quincy afterwards. This is also running on Ubuntu
/23/22 15:22, J-P Methot wrote:
Thank you for your reply,
discard is not enabled in our configuration as it is mainly the
default conf. Are you suggesting to enable it?
No. There is no consensus if enabling it is a good idea (depends on
proper implementation among other things). From my
Thank you for your reply,
discard is not enabled in our configuration as it is mainly the default
conf. Are you suggesting to enable it?
On 9/22/22 14:20, Stefan Kooman wrote:
Just guessing here: have you configured "discard":
bdev enable discard
bdev async discard
We've see monitor slow
Hi,
We've been running into a mysterious issue on Ceph 16.2.7. Every few
weeks or so (can be from 2 weeks to a month and a half), we get
input/output errors on a random OSD. Here's the logs :
2022-09-22T15:54:11.600Z syslog debug -6>
2022-09-22T15:41:05.678+ 7fec2ebaa080 -1
Hi,
We're running Ceph 16.2.5 Pacific and, in the ceph dashboard, we keep
getting a MTU mismatch alert. However, all our hosts have the same
network configuration:
=> bond0: mtu 9000 qdisc
noqueue state UPgroup default qlen 1000 => vlan.24@bond0:
mtu 9000 qdisc noqueue state UPgroup
I'm replying to my own message as it appears we have "fixed" the issue.
Basically, we restarted all OSD hosts and all the presumed lost data
reappeared. It's likely that some OSDs were stuck unreachable but were
somehow never flagged as such in the cluster.
On 8/3/21 8:15 PM, J-P Me
Hi,
We've encountered this issue on Ceph Pacific, with an Openstack Wallaby
cluster hooked to it. Essentially, we're slowly pushing this setup into
production so we're testing it and encountered this oddity. My colleague
wanted to do some network redundancy tests, so he manually shutdown a
20 matches
Mail list logo