Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-27 Thread Willem Jan Withagen
On 26/09/2018 12:41, Eugen Block wrote: Hi, I'm not sure how the recovery "still works" with the flag "norecover". Anyway, I think you should unset the flags norecover, nobackfill. Even if not all OSDs come back up you should allow the cluster to backfill PGs. Not sure, but unsetting

Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-26 Thread KEVIN MICHAEL HRPCEK
Hey, don't lose hope. I just went through 2 3-5 day outages after a mimic upgrade with no data loss. I'd recommend looking through the thread about it to see how close it is to your issue. From my point of view there seems to be some similarities.

Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-26 Thread Eugen Block
Hi, I'm not sure how the recovery "still works" with the flag "norecover". Anyway, I think you should unset the flags norecover, nobackfill. Even if not all OSDs come back up you should allow the cluster to backfill PGs. Not sure, but unsetting norebalance could also be useful, but that

Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-26 Thread by morphin
Hello Eugen. Thank you for your answer. I was loosing my hope to get an answer here. I faced so many times with losing 2/3 mons but I never faced any problem like this on luminous. The recovery still works and its have been 30hours. The last state of my cluster is:

Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-26 Thread Eugen Block
Hi, could this be related to this other Mimic upgrade thread [1]? Your failing MONs sound a bit like the problem described there, eventually the user reported recovery success. You could try the described steps: - disable cephx auth with 'auth_cluster_required = none' - set the

Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-25 Thread by morphin
After I tried too many things with so many helps on IRC. My pool health is still in ERROR and I think I can't recover from this. https://paste.ubuntu.com/p/HbsFnfkYDT/ At the end 2 of 3 mons crashed and started at same time and the pool is offlined. Recovery takes more than 12hours and it is way

Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-25 Thread by morphin
Hi, Cluster is still down :( Up to not we have managed to compensate the OSDs. 118s of 160 OSD are stable and cluster is still in the progress of settling. Thanks for the guy Be-El in the ceph IRC channel. Be-El helped a lot to make flapping OSDs stable. What we learned up now is that this is

Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-25 Thread Eugen Block
I would try to reduce recovery to a minimum, something like this helped us in in a small cluster (25 OSDs on 3 hosts) in case of recovery while operation continued without impact: ceph tell 'osd.*' injectargs '--osd-recovery-max-active 2' ceph tell 'osd.*' injectargs '--osd-max-backfills 8'

Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-25 Thread by morphin
After reducing the recovery parameter values did not change much. There are a lot of OSD still marked down. I don't know what I need to do after this point. [osd] osd recovery op priority = 63 osd client op priority = 1 osd recovery max active = 1 osd max scrubs = 1 ceph -s cluster: id:

Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-25 Thread Sergey Malinin
Now you also have PGs in 'creating' state. Creating PGs is very IO intensive operation. To me, nothing special going on there - recovery + deep scrubbing + creating PGs results in expected degradation of performance. September 25, 2018 2:32 PM, "by morphin" wrote: > 29 creating+down > 4

Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-25 Thread by morphin
The config didnt work. Because increasing the number faced with more OSD Drops. bhfs -s cluster: id: 89569e73-eb89-41a4-9fc9-d2a5ec5f4106 health: HEALTH_ERR norebalance,norecover flag(s) set 1 osds down 17/8839434 objects unfound (0.000%)

Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-25 Thread Sergey Malinin
Settings that heavily affect recovery performance are: osd_recovery_sleep osd_recovery_sleep_[hdd|ssd] See this for details: http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/ September 25, 2018 1:57 PM, "by morphin" wrote: > Thank you for answer > > What do you think the

Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-25 Thread by morphin
Thank you for answer What do you think the conf for speed the recover? [osd] osd recovery op priority = 63 osd client op priority = 1 osd recovery max active = 16 osd max scrubs = 16 adresine sahip kullanıcı 25 Eyl 2018 Sal, 13:37 tarihinde şunu yazdı: > > Just let it recover. > > data: >

Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-25 Thread Caspar Smit
You can set: *osd_scrub_during_recovery = false* and in addition maybe set the noscrub and nodeep-scrub flags to let it settle. Kind regards, Caspar Op di 25 sep. 2018 om 12:39 schreef Sergey Malinin : > Just let it recover. > > data: > pools: 1 pools, 4096 pgs > objects: 8.95 M

Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-25 Thread Sergey Malinin
Just let it recover. data: pools: 1 pools, 4096 pgs objects: 8.95 M objects, 17 TiB usage: 34 TiB used, 577 TiB / 611 TiB avail pgs: 94.873% pgs not active 48475/17901254 objects degraded (0.271%) 1/8950627 objects unfound (0.000%)

[ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-25 Thread by morphin
Hello. Half a hour ago, 7/28 of my servers are crashed (because of corosync! "2.4.4-3") and 2 of them was MON, I have 3 MON on my cluster. After they come back, I see high disk utilization because of ceph-osd processes. All of my cluster is not responding right now! All of my OSDs are consuming