I have an 8-node cluster with old hardware. a week ago 4 nodes went down and
the CEPH cluster went nuts.
All pgs became unknown and montors took too long to be in sync.
So i reduced the number of mons to one and mgrs to one as well
Now the recovery starts with 100% unknown pgs and then pgs start
There are some pretty strange compaction behavior happening in these
logs. For instance, in osd0, we see a O-1 CF L1 compaction that's
taking ~204 seconds:
2023-09-21T20:03:59.378+ 7f16a286c700 4 rocksdb: (Original Log Time
2023/09/21-20:03:59.381808) EVENT_LOG_v1 {"time_micros":
Hi Milind,Team
Thank you for the response @Milind.
>>Snap-schedule no longer accepts a --subvol argument,
Thank you for the information.
Currently, we are using the following commands to create the snap-schedules:
Syntax:
*"ceph fs snap-schedule add ///
"*
*"ceph fs snap-schedule retention add
Hi Venky, and cephers
Thanks for reply.
no config changes had been made before the issues occurred. It suspects to
be client bug. Please see following message about the log segment
accumulation to be trimmed.for the moment problematic client nodes can not
be rebooted.evicting client will
Hi Sudhin,
It looks like manual DB compactions are (periodically?) issued via admin
socket for your OSDs, which (my working hypothesis) triggers DB access
stalls.
Here are the log lines indicating such calls
debug 2023-09-22T11:24:55.234+ 7fc4efa20700 1 osd.1 1192508
triggering manual
Den ons 27 sep. 2023 kl 15:32 skrev Nicolas FONTAINE :
> Hi everyone,
> Is there a way to specify which MGR and which MDS should be the active one?
>
At least for the mgr, you can just fail-over until it lands on the one you
want it to be running on.
--
May the most significant bit of your