On 26/09/2018 12:41, Eugen Block wrote:
Hi,
I'm not sure how the recovery "still works" with the flag "norecover".
Anyway, I think you should unset the flags norecover, nobackfill. Even
if not all OSDs come back up you should allow the cluster to backfill
PGs. Not sure, but unsetting
Hey, don't lose hope. I just went through 2 3-5 day outages after a mimic
upgrade with no data loss. I'd recommend looking through the thread about it to
see how close it is to your issue. From my point of view there seems to be some
similarities.
Hi,
I'm not sure how the recovery "still works" with the flag "norecover".
Anyway, I think you should unset the flags norecover, nobackfill. Even
if not all OSDs come back up you should allow the cluster to backfill
PGs. Not sure, but unsetting norebalance could also be useful, but
that
Hello Eugen. Thank you for your answer. I was loosing my hope to get
an answer here.
I faced so many times with losing 2/3 mons but I never faced any
problem like this on luminous.
The recovery still works and its have been 30hours. The last state of
my cluster is:
Hi,
could this be related to this other Mimic upgrade thread [1]? Your
failing MONs sound a bit like the problem described there, eventually
the user reported recovery success. You could try the described steps:
- disable cephx auth with 'auth_cluster_required = none'
- set the
After I tried too many things with so many helps on IRC. My pool
health is still in ERROR and I think I can't recover from this.
https://paste.ubuntu.com/p/HbsFnfkYDT/
At the end 2 of 3 mons crashed and started at same time and the pool
is offlined. Recovery takes more than 12hours and it is way
Hi,
Cluster is still down :(
Up to not we have managed to compensate the OSDs. 118s of 160 OSD are
stable and cluster is still in the progress of settling. Thanks for
the guy Be-El in the ceph IRC channel. Be-El helped a lot to make
flapping OSDs stable.
What we learned up now is that this is
I would try to reduce recovery to a minimum, something like this
helped us in in a small cluster (25 OSDs on 3 hosts) in case of
recovery while operation continued without impact:
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 2'
ceph tell 'osd.*' injectargs '--osd-max-backfills 8'
After reducing the recovery parameter values did not change much.
There are a lot of OSD still marked down.
I don't know what I need to do after this point.
[osd]
osd recovery op priority = 63
osd client op priority = 1
osd recovery max active = 1
osd max scrubs = 1
ceph -s
cluster:
id:
Now you also have PGs in 'creating' state. Creating PGs is very IO intensive
operation.
To me, nothing special going on there - recovery + deep scrubbing + creating
PGs results in expected degradation of performance.
September 25, 2018 2:32 PM, "by morphin" wrote:
> 29 creating+down
> 4
The config didnt work. Because increasing the number faced with more OSD Drops.
bhfs -s
cluster:
id: 89569e73-eb89-41a4-9fc9-d2a5ec5f4106
health: HEALTH_ERR
norebalance,norecover flag(s) set
1 osds down
17/8839434 objects unfound (0.000%)
Settings that heavily affect recovery performance are:
osd_recovery_sleep
osd_recovery_sleep_[hdd|ssd]
See this for details:
http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/
September 25, 2018 1:57 PM, "by morphin" wrote:
> Thank you for answer
>
> What do you think the
Thank you for answer
What do you think the conf for speed the recover?
[osd]
osd recovery op priority = 63
osd client op priority = 1
osd recovery max active = 16
osd max scrubs = 16
adresine sahip kullanıcı 25 Eyl 2018 Sal,
13:37 tarihinde şunu yazdı:
>
> Just let it recover.
>
> data:
>
You can set:
*osd_scrub_during_recovery = false*
and in addition maybe set the noscrub and nodeep-scrub flags to let it
settle.
Kind regards,
Caspar
Op di 25 sep. 2018 om 12:39 schreef Sergey Malinin :
> Just let it recover.
>
> data:
> pools: 1 pools, 4096 pgs
> objects: 8.95 M
Just let it recover.
data:
pools: 1 pools, 4096 pgs
objects: 8.95 M objects, 17 TiB
usage: 34 TiB used, 577 TiB / 611 TiB avail
pgs: 94.873% pgs not active
48475/17901254 objects degraded (0.271%)
1/8950627 objects unfound (0.000%)
Hello.
Half a hour ago, 7/28 of my servers are crashed (because of corosync!
"2.4.4-3")
and 2 of them was MON, I have 3 MON on my cluster.
After they come back, I see high disk utilization because of ceph-osd processes.
All of my cluster is not responding right now! All of my OSDs are
consuming
16 matches
Mail list logo