[ceph-users] Re: OSDs cannot join cluster anymore

2023-06-24 Thread Adam King
Reminds me of https://tracker.ceph.com/issues/57007 which wasn't fixed in pacific until 16.2.11, so this is probably just the result of a cephadm bug unfortunately. On Fri, Jun 23, 2023 at 5:16 PM Malte Stroem wrote: > Hello Eugen, > > thanks. > > We found the cause. > > Somehow all > > /var/lib

[ceph-users] Re: OSDs cannot join cluster anymore

2023-06-24 Thread Eugen Block
Oh, okay. I believe there was a thread reporting something very similar as well some time ago. I don’t remember the details but having outdated information on the OSDs was part of it. Were the nodes you removed also MON nodes? But it’s great that you found the root cause. Zitat von Malte St

[ceph-users] Re: OSDs cannot join cluster anymore

2023-06-23 Thread Malte Stroem
Hello Eugen, thanks. We found the cause. Somehow all /var/lib/ceph/fsid/osd.XX/config files on every host were still filled with expired information about the mons. So refreshing the files helped to bring the osds up again. Damn. All other configs for the mons, mds', rgws and so on were u

[ceph-users] Re: OSDs cannot join cluster anymore

2023-06-21 Thread Stefan Kooman
On 6/21/23 11:20, Malte Stroem wrote: Hello Eugen, recovery and rebalancing was finished however now all PGs show missing OSDs. Everything looks like the PGs are missing OSDs although it finished correctly. As if we shut down the servers immediately. But we removed the nodes the way it is

[ceph-users] Re: OSDs cannot join cluster anymore

2023-06-21 Thread Eugen Block
I still can’t really grasp what might have happened here. But could you please clarify which of the down OSDs (or Hosts) are supposed to be down and which you’re trying to bring back online? Obviously osd.40 is one of your attempts. But what about the hosts cephx01 and cephx08? Are those th

[ceph-users] Re: OSDs cannot join cluster anymore

2023-06-21 Thread Malte Stroem
Hello Eugen, recovery and rebalancing was finished however now all PGs show missing OSDs. Everything looks like the PGs are missing OSDs although it finished correctly. As if we shut down the servers immediately. But we removed the nodes the way it is described in the documentation. We just

[ceph-users] Re: OSDs cannot join cluster anymore

2023-06-21 Thread Eugen Block
Hi, Yes, we drained the nodes. It needed two weeks to finish the process, and yes, I think this is the root cause. So we still have the nodes but when I try to restart one of those OSDs it still cannot join: if the nodes were drained successfully (can you confirm that all PGs were active+

[ceph-users] Re: OSDs cannot join cluster anymore

2023-06-21 Thread Malte Stroem
Hello Eugen, thank you. Yesterday I thought: Well, Eugen can help! Yes, we drained the nodes. It needed two weeks to finish the process, and yes, I think this is the root cause. So we still have the nodes but when I try to restart one of those OSDs it still cannot join: Jun 21 09:46:03 cep

[ceph-users] Re: OSDs cannot join cluster anymore

2023-06-21 Thread Eugen Block
Hi, can you share more details what exactly you did? How did you remove the nodes? Hopefully, you waited for the draining to finish? But if the remaining OSDs wait for removed OSDs it sounds like the draining was not finished. Zitat von Malte Stroem : Hello, we removed some nodes from o