[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-12 Thread Alexander E. Patrakov
пт, 7 окт. 2022 г. в 19:50, Frank Schilder : > For the interested future reader, we have subdivided 400G high-performance > SSDs into 4x100G OSDs for our FS meta data pool. The increased concurrency > improves performance a lot. But yes, we are on the edge. OMAP+META is almost > 50%. Please be

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-12 Thread Frank Schilder
m: Szabo, Istvan (Agoda) Sent: 07 October 2022 14:28 To: Frank Schilder Cc: Igor Fedotov; ceph-users@ceph.io Subject: RE: [ceph-users] Re: OSD crashes during upgrade mimic->octopus Finally how is your pg distribution? How many pg/disk? Istvan Sza

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Szabo, Istvan (Agoda)
Schilder Sent: Friday, October 7, 2022 6:50 PM To: Igor Fedotov ; ceph-users@ceph.io Subject: [ceph-users] Re: OSD crashes during upgrade mimic->octopus Email received from the internet. If in doubt, don't click any link nor open any attachment ! Hi all, try

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Igor Fedotov
crashes during upgrade mimic->octopus Hi Igor, I added a sample of OSDs on identical disks. The usage is quite well balanced, so the numbers I included are representative. I don't believe that we had one such extreme outlier. Maybe it ran full during conversion. Most of the data is OMAP after

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Igor Fedotov
Just FYI: standalone ceph-bluestore-tool's quick-fix behaves pretty similar to the action performed on start-up with bluestore_fsck_quick_fix_on_mount = true On 10/7/2022 10:18 AM, Frank Schilder wrote: Hi Stefan, super thanks! I found a quick-fix command in the help output: #

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Igor Fedotov
For format updates one can use quick-fix command instead of repair, it might work a bit faster.. On 10/7/2022 10:07 AM, Stefan Kooman wrote: On 10/7/22 09:03, Frank Schilder wrote: Hi Igor and Stefan, thanks a lot for your help! Our cluster is almost finished with recovery and I would like

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Igor Fedotov
Hi Frank, one more thing I realized during the night :) Whe performing conversion DB gets a significant bunch of new data (approx. on par with the original OMAP volume) without old one being immediately removed. Hence one should expect DB size grows dramatically at this point. Which should

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Frank Schilder
Hi Stefan, super thanks! I found a quick-fix command in the help output: # ceph-bluestore-tool -h [...] Positional options: --command arg fsck, repair, quick-fix, bluefs-export, bluefs-bdev-sizes, bluefs-bdev-expand,

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Frank Schilder
Hi Igor and Stefan, thanks a lot for your help! Our cluster is almost finished with recovery and I would like to switch to off-line conversion of the SSD OSDs. In one of Stefan's I coud find the command for manual compaction: ceph-kvstore-tool bluestore-kv "/var/lib/ceph/osd/ceph-${OSD_ID}"

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
From: Frank Schilder Sent: 07 October 2022 01:53:20 To: Igor Fedotov; ceph-users@ceph.io Subject: [ceph-users] Re: OSD crashes during upgrade mimic->octopus Hi Igor, I added a sample of OSDs on identical disks. The usage is quite well balanced, so the numbers I inclu

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
well, I've just realized that you're apparently unable to collect these high-level stats for broken OSDs, aren't you? But if that's the case you shouldn't make any assumption about faulty OSDs utilization from healthy ones - it's definitely a very doubtful approach ;) On 10/7/2022 2:19

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
The log I inspected was for osd.16  so please share that OSD utilization... And honestly I trust allocator's stats more so it's rather CLI stats are incorrect if any. Anyway free dump should provide additional proofs.. And once again - do other non-starting OSDs show the same ENOSPC error? 

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor, I suspect there is something wrong with the data reported. These OSDs are only 50-60% used. For example: IDCLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL%USE VAR PGS STATUS TYPE NAME 29 ssd 0.09099 1.0 93

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
Hi Frank, the abort message "bluefs enospc" indicates lack of free space for additional bluefs space allocations which prevents osd from startup. From the following log line one can see that bluefs needs ~1M more space while the total available one is approx 622M. the problem is that bluefs

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor, the problematic disk holds OSDs 16,17,18 and 19. OSD 16 is the one crashing the show. I collected its startup log here: https://pastebin.com/25D3piS6 . The line sticking out is line 603:

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Stefan and anyone else reading this, we are probably misunderstanding each other here: > There is a strict MDS maintenance dance you have to perform [1]. > ... > [1]: https://docs.ceph.com/en/octopus/cephfs/upgrading/ Our ceph fs shut-down was *after* completing the upgrade to octopus, *not

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Stefan, to answer your question as well: > ... conversion from octopus to > pacific, and the resharding as well). We would save half the time by > compacting them before hand. It would take, in our case, many hours to > do a conversion, so it would pay off immensely. ... With experiments on

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor. > But could you please share full OSD startup log for any one which is > unable to restart after host reboot? Will do. I also would like to know what happened here and if it is possible to recover these OSDs. The rebuild takes ages with the current throttled recovery settings. >

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
Sorry - no clue about CephFS related questions... But could you please share full OSD startup log for any one which is unable to restart after host reboot? On 10/6/2022 5:12 PM, Frank Schilder wrote: Hi Igor and Stefan. Not sure why you're talking about replicated(!) 4(2) pool. Its

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Stefan Kooman
On 10/6/22 16:12, Frank Schilder wrote: Hi Igor and Stefan. Not sure why you're talking about replicated(!) 4(2) pool. Its because in the production cluster its the 4(2) pool that has that problem. On the test cluster it was an > > EC pool. Seems to affect all sorts of pools. I have to

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor and Stefan. > > Not sure why you're talking about replicated(!) 4(2) pool. > > Its because in the production cluster its the 4(2) pool that has that > problem. On the test cluster it was an > > EC pool. Seems to affect all sorts > of pools. I have to take this one back. It is indeed

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
On 10/6/2022 3:16 PM, Stefan Kooman wrote: On 10/6/22 13:41, Frank Schilder wrote: Hi Stefan, thanks for looking at this. The conversion has happened on 1 host only. Status is: - all daemons on all hosts upgraded - all OSDs on 1 OSD-host were restarted with

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
Are crashing OSDs still bound to two hosts? If not - does any died OSD unconditionally mean its underlying disk is unavailable any more? On 10/6/2022 3:35 PM, Frank Schilder wrote: Hi Igor. Not sure why you're talking about replicated(!) 4(2) pool. Its because in the production cluster

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor. > Not sure why you're talking about replicated(!) 4(2) pool. Its because in the production cluster its the 4(2) pool that has that problem. On the test cluster it was an EC pool. Seems to affect all sorts of pools. I just lost another disk, we have PGs down now. I really hope the

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
On 10/6/2022 2:55 PM, Frank Schilder wrote: Hi Igor, it has the SSD OSDs down, the HDD OSDs are running just fine. I don't want to make a bad situation worse for now and wait for recovery to finish. The inactive PGs are activating very slowly. Got it. By the way, there are 2 out of 4 OSDs

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Stefan Kooman
On 10/6/22 13:41, Frank Schilder wrote: Hi Stefan, thanks for looking at this. The conversion has happened on 1 host only. Status is: - all daemons on all hosts upgraded - all OSDs on 1 OSD-host were restarted with bluestore_fsck_quick_fix_on_mount = true in its local ceph.conf, these OSDs

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor, it has the SSD OSDs down, the HDD OSDs are running just fine. I don't want to make a bad situation worse for now and wait for recovery to finish. The inactive PGs are activating very slowly. By the way, there are 2 out of 4 OSDs up in the replicated 4(2) pool. Why are PGs even

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
From your response to Stefan I'm getting that one of two damaged hosts has all OSDs down and unable to start. I that correct? If so you can reboot it with no problem and proceed with manual compaction [and other experiments] quite "safely" for the rest of the cluster. On 10/6/2022 2:35 PM,

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Stefan, thanks for looking at this. The conversion has happened on 1 host only. Status is: - all daemons on all hosts upgraded - all OSDs on 1 OSD-host were restarted with bluestore_fsck_quick_fix_on_mount = true in its local ceph.conf, these OSDs completed conversion and rebooted, I would

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor, I can't access these drives. They have an OSD- or LVM process hanging in D-state. Any attempt to do something with these gets stuck as well. I somehow need to wait for recovery to finish and protect the still running OSDs from crashing similarly badly. After we have full redundancy

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
IIUC the OSDs that expose "had timed out after 15" are failing to start up. Is that correct or I missed something?  I meant trying compaction for them... On 10/6/2022 2:27 PM, Frank Schilder wrote: Hi Igor, thanks for your response. And what's the target Octopus release? ceph version

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Stefan Kooman
On 10/6/22 13:06, Frank Schilder wrote: Hi all, we are stuck with a really unpleasant situation and we would appreciate help. Yesterday we completed the ceph deamon upgrade from mimic to octopus all he way through with bluestore_fsck_quick_fix_on_mount = false and started the OSD OMAP

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor, thanks for your response. > And what's the target Octopus release? ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) I'm afraid I don't have the luxury right now to take OSDs down or add extra load with an on-line compaction. I would really appreciate a

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
And what's the target Octopus release? On 10/6/2022 2:06 PM, Frank Schilder wrote: Hi all, we are stuck with a really unpleasant situation and we would appreciate help. Yesterday we completed the ceph deamon upgrade from mimic to octopus all he way through with