Reminds me of https://tracker.ceph.com/issues/57007 which wasn't fixed in
pacific until 16.2.11, so this is probably just the result of a cephadm bug
unfortunately.
On Fri, Jun 23, 2023 at 5:16 PM Malte Stroem wrote:
> Hello Eugen,
>
> thanks.
>
> We found the cause.
>
> Somehow all
>
> /var/lib
Oh, okay. I believe there was a thread reporting something very
similar as well some time ago. I don’t remember the details but having
outdated information on the OSDs was part of it. Were the nodes you
removed also MON nodes?
But it’s great that you found the root cause.
Zitat von Malte St
Hello Eugen,
thanks.
We found the cause.
Somehow all
/var/lib/ceph/fsid/osd.XX/config
files on every host were still filled with expired information about the
mons.
So refreshing the files helped to bring the osds up again. Damn.
All other configs for the mons, mds', rgws and so on were u
On 6/21/23 11:20, Malte Stroem wrote:
Hello Eugen,
recovery and rebalancing was finished however now all PGs show missing
OSDs.
Everything looks like the PGs are missing OSDs although it finished
correctly.
As if we shut down the servers immediately.
But we removed the nodes the way it is
I still can’t really grasp what might have happened here. But could
you please clarify which of the down OSDs (or Hosts) are supposed to
be down and which you’re trying to bring back online? Obviously osd.40
is one of your attempts. But what about the hosts cephx01 and cephx08?
Are those th
Hello Eugen,
recovery and rebalancing was finished however now all PGs show missing OSDs.
Everything looks like the PGs are missing OSDs although it finished
correctly.
As if we shut down the servers immediately.
But we removed the nodes the way it is described in the documentation.
We just
Hi,
Yes, we drained the nodes. It needed two weeks to finish the
process, and yes, I think this is the root cause.
So we still have the nodes but when I try to restart one of those
OSDs it still cannot join:
if the nodes were drained successfully (can you confirm that all PGs
were active+
Hello Eugen,
thank you. Yesterday I thought: Well, Eugen can help!
Yes, we drained the nodes. It needed two weeks to finish the process,
and yes, I think this is the root cause.
So we still have the nodes but when I try to restart one of those OSDs
it still cannot join:
Jun 21 09:46:03 cep
Hi,
can you share more details what exactly you did? How did you remove
the nodes? Hopefully, you waited for the draining to finish? But if
the remaining OSDs wait for removed OSDs it sounds like the draining
was not finished.
Zitat von Malte Stroem :
Hello,
we removed some nodes from o