Hi Denny,
the recommendation for ceph maintenance is to set three flags if you
need to shutdown a node (or the entire cluster):
ceph osd set noout
ceph osd set nobackfill
ceph osd set norecover
Although the 'noout' flag seems to be enough for many maintenance
tasks it doesn't prevent the
one server with noout.
Paul
Am Fr., 19. Okt. 2018 um 11:37 Uhr schrieb Eugen Block :
Hi Denny,
the recommendation for ceph maintenance is to set three flags if you
need to shutdown a node (or the entire cluster):
ceph osd set noout
ceph osd set nobackfill
ceph osd set norecover
Although
Hi,
you can monitor the cache size and see if the new values are applied:
ceph@mds:~> ceph daemon mds. cache status
{
"pool": {
"items": 106708834,
"bytes": 5828227058
}
}
You should also see in top (or similar tools) that the memory
increases/decreases. From my
I only tried to use the Ceph CLI once out of curiosity, simply because
it is there, but I don't really benefit from it.
Usually when I'm working with clusters it requires a combination of
different commands (rbd, rados, ceph etc.), so this would mean either
exiting and entering the CLI back
for the clarification!
Zitat von Ilya Dryomov :
On Thu, Aug 30, 2018 at 1:04 PM Eugen Block wrote:
Hi again,
we still didn't figure out the reason for the flapping, but I wanted
to get back on the dmesg entries.
They just reflect what happened in the past, they're no indicator to
predict anything
and will expire in version "Sodium".ago 31
12:43:51 polar salt-minion[1421]: [ERROR ] Mine on polar.iq.ufrgs.br
<http://polar.iq.ufrgs.br> for cephdisks.listago 31 12:43:51 polar
salt-minion[1421]: [ERROR ] Module function osd.deploy threw an
exception. Exception: Mine on polar.
Hi Olivier,
what size does the cache tier have? You could set cache-mode to
forward and flush it, maybe restarting those OSDs (68, 69) helps, too.
Or there could be an issue with the cache tier, what do those logs say?
Regards,
Eugen
Zitat von Olivier Bonvalet :
Hello,
on a Luminous
ier
Le vendredi 21 septembre 2018 à 09:34 +0000, Eugen Block a écrit :
Hi Olivier,
what size does the cache tier have? You could set cache-mode to
forward and flush it, maybe restarting those OSDs (68, 69) helps,
too.
Or there could be an issue with the cache tier, what do those logs
say?
Regard
I tried to flush the cache with "rados -p cache-bkp-foo cache-flush-
evict-all", but it blocks on the object
"rbd_data.f66c92ae8944a.000f2596".
This is the object that's stuck in the cache tier (according to your
output in https://pastebin.com/zrwu5X0w). Can you verify if that block
I also switched the cache tier to "readproxy", to avoid using this
cache. But, it's still blocked.
You could change the cache mode to "none" to disable it. Could you
paste the output of:
ceph osd pool ls detail | grep cache-bkp-foo
Zitat von Olivier Bonvalet :
In fact, one object (only
Hi,
I am wondering if it is possible to move the ssd journal for the
bluestore osd? I would like to move it from one ssd drive to another.
yes, this question has been asked several times.
Depending on your deployment there are several things to be aware of,
maybe you should first read [1]
Hi,
how did you create the OSDs? Were they built from scratch with the
respective command options (--block.db /dev/)?
You could check what the bluestore tool tells you about the block.db:
ceph1:~ # ceph-bluestore-tool show-label --dev
/var/lib/ceph/osd/ceph-21/block | grep path
Hi,
I can confirm this for:
ceph --version
ceph version 12.2.5-419-g8cbf63d997
(8cbf63d997fb5cdc783fe7bfcd4f5032ee140c0c) luminous (stable)
Setting ACLs on a file works as expected (restrict file access to
specific user), getfacl displays correct information, but 'ls -la'
does not
l strongly offers to unset nodown parameter.
What do you think?
Eugen Block , 26 Eyl 2018 Çar, 12:54 tarihinde şunu yazdı:
Hi,
could this be related to this other Mimic upgrade thread [1]? Your
failing MONs sound a bit like the problem described there, eventually
the user reported recovery
Hi,
could this be related to this other Mimic upgrade thread [1]? Your
failing MONs sound a bit like the problem described there, eventually
the user reported recovery success. You could try the described steps:
- disable cephx auth with 'auth_cluster_required = none'
- set the
I would try to reduce recovery to a minimum, something like this
helped us in in a small cluster (25 OSDs on 3 hosts) in case of
recovery while operation continued without impact:
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 2'
ceph tell 'osd.*' injectargs '--osd-max-backfills 8'
Yeah, since we haven't knowingly done anything about it, it would be a
(pleasant) surprise if it was accidentally resolved in mimic ;-)
Too bad ;-)
Thanks for your help!
Eugen
Zitat von John Spray :
On Wed, Sep 19, 2018 at 10:37 AM Eugen Block wrote:
Hi John,
> I'm not 100% s
Hi,
to reduce impact on clients during migration I would set the OSD's
primary-affinity to 0 beforehand. This should prevent the slow
requests, at least this setting has helped us a lot with problematic
OSDs.
Regards
Eugen
Zitat von Jaime Ibar :
Hi all,
we recently upgrade from
Hi Jan,
I think you're running into an issue reported a couple of times.
For the use of LVM you have to specify the name of the Volume Group
and the respective Logical Volume instead of the path, e.g.
ceph-volume lvm prepare --bluestore --block.db ssd_vg/ssd00 --data /dev/sda
Regards,
Eugen
Hi Thomas,
What is the best practice for creating pools & images?
Should I create multiple pools, means one pool per database?
Or should I create a single pool "backup" and use namespace when writing
data in the pool?
I don't think one pool per DB is reasonable. If the number of DBs
Hi,
If the OSD represents the primary one for a PG, then all IO will be
stopped..which may lead to application failure..
no, that's not how it works. You have an acting set of OSDs for a PG,
typically 3 OSDs in a replicated pool. If the primary OSD goes down,
the secondary becomes the
volume-baa6c928-8ac1-4240-b189-32b444b434a3
volume-c23a69dc-d043-45f7-970d-1eec2ccb10cc
volume-f1872ae6-48e3-4a62-9f46-bf157f079e7f
On Wed, 19 Dec 2018 at 09:25, Eugen Block wrote:
Hi,
can you explain more detailed what exactly goes wrong?
In many cases it's an authentication error, can you
Hello list,
there are two config options of mon/osd interaction that I don't fully
understand. Maybe one of you could clarify it for me.
mon osd report timeout
- The grace period in seconds before declaring unresponsive Ceph OSD
Daemons down. Default 900
mon osd down out interval
- The
Hello list,
I noticed my last post was displayed as a reply to a different thread,
so I re-send my question, please excuse the noise.
There are two config options of mon/osd interaction that I don't fully
understand. Maybe one of you could clarify it for me.
mon osd report timeout
- The
help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
On Mon, Jan 14, 2019 at 10:17 AM Eugen Block wrote:
Hello list,
I noticed my last post was displayed as a reply to a different thread,
so I re-send my que
Hi Jan,
how did you move the WAL and DB to the SSD/NVMe? By recreating the
OSDs or a different approach? Did you check afterwards that the
devices were really used for that purpose? We had to deal with that a
couple of months ago [1] and it's not really obvious if the new
devices are
Hi Stefan,
mds.mds1 [WRN] replayed op client.15327973:15585315,15585103 used ino
0x19918de but session next is 0x1873b8b
Nothing of importance is logged in the mds (debug_mds_log": "1/5").
What does this warning message mean / indicate?
we face these messages on a regular basis.
Hi,
Between tests we destroyed the OSDs and created them from scratch. We used
Docker image to deploy Ceph on one machine.
I've seen that there are WAL/DB partitions created on the disks.
Should I also check somewhere in ceph config that it actually uses those?
if you created them from
the replay
mds until we hit a real issue. ;-)
It's probably impossible to predict any improvement on this with mimic, right?
Regards,
Eugen
Zitat von John Spray :
On Mon, Sep 17, 2018 at 2:49 PM Eugen Block wrote:
Hi,
from your response I understand that these messages are not expected
Hi,
from your response I understand that these messages are not expected
if everything is healthy.
We face them every now and then, three or four times a week, but
there's no real connection to specific jobs or a high load in our
cluster. It's a Luminous cluster (12.2.7) with 1 active, 1
Hi,
can you explain more detailed what exactly goes wrong?
In many cases it's an authentication error, can you check if your
specified user is allowed to create volumes in the respective pool?
You could try something like this (from compute node):
rbd --user -k
--long" command. Thanks for the clarification.
Eugen
Zitat von Jason Dillaman :
On Tue, Apr 2, 2019 at 8:42 AM Eugen Block wrote:
Hi,
> If you run "rbd snap ls --all", you should see a snapshot in
> the "trash" namespace.
I just tried the command "rbd
Hi,
If you run "rbd snap ls --all", you should see a snapshot in
the "trash" namespace.
I just tried the command "rbd snap ls --all" on a lab cluster
(nautilus) and get this error:
ceph-2:~ # rbd snap ls --all
rbd: image name was not specified
Are there any requirements I haven't
Hi Dan,
I don't know about keeping the osd-id but I just partially recreated
your scenario. I wiped one OSD and recreated it. You are trying to
re-use the existing block.db-LV with the device path (--block.db
/dev/vg-name/lv-name) instead the lv notation (--block.db
vg-name/lv-name):
t cluster.
Regards,
Eugen
Zitat von Janne Johansson :
Den mån 25 feb. 2019 kl 13:40 skrev Eugen Block :
I just moved a (virtual lab) cluster to a different network, it worked
like a charm.
In an offline method - you need to:
- set osd noout, ensure there are no OSDs up
- Change the MONs IP,
Hi,
We skipped stage 1 and replaced the UUIDs of old disks with the new
ones in the policy.cfg
We ran salt '*' pillar.items and confirmed that the output was correct.
It showed the new UUIDs in the correct places.
Next we ran salt-run state.orch ceph.stage.3
PS: All of the above ran
I just moved a (virtual lab) cluster to a different network, it worked
like a charm.
In an offline method - you need to:
- set osd noout, ensure there are no OSDs up
- Change the MONs IP, See the bottom of [1] "CHANGING A MONITOR’S IP
ADDRESS", MONs are the only ones really
sticky with the
Hi,
my first guess would be a network issue. Double-check your connections
and make sure the network setup works as expected. Check syslogs,
dmesg, switches etc. for hints that a network interruption may have
occured.
Regards,
Eugen
Zitat von Zhenshi Zhou :
Hi,
I deployed a ceph
in DB. And assertion might still happen
(hopefully with less frequency).
So could you please run fsck for OSDs that were broken once and
share the results?
Then we can decide if it makes sense to proceed with the repair.
Thanks,
Igor
On 2/7/2019 3:37 PM, Eugen Block wrote:
Hi list,
I
Hi,
could it be a missing 'ceph osd require-osd-release luminous' on your cluster?
When I check a luminous cluster I get this:
host1:~ # ceph osd dump | grep recovery
flags sortbitwise,recovery_deletes,purged_snapdirs
The flags in the code you quote seem related to that.
Can you check that
Hi Francois,
Is that correct that recovery will be forbidden by the crush rule if
a node is down?
yes, that is correct, failure-domain=host means no two chunks of the
same PG can be on the same host. So if your PG is divided into 6
chunks, they're all on different hosts, no recovery is
Hi cephers,
I'm struggling a little with the deep-scrubs. I know this has been
discussed multiple times (e.g. in [1]) and we also use a known crontab
script in a Luminous cluster (12.2.10) to start the deep-scrubbing
manually (a quarter of all PGs 4 times a week). The script works just
Thank you, Konstantin,
I'll give that a try.
Do you have any comment on osd_deep_mon_scrub_interval?
Eugen
Zitat von Konstantin Shalygin :
The expectation was to prevent the automatic deep-scrubs but they are
started anyway
You can disable deep-scrubs per pool via `ceph osd pool set
ction CEPH...
Regards,
/st
-Original Message-
From: ceph-users On Behalf Of Eugen Block
Sent: Tuesday, February 12, 2019 5:32 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] will crush rule be used during object
relocation in OSD failure ?
Hi,
I came to the same conclusion a
I created http://tracker.ceph.com/issues/38310 for this.
Regards,
Eugen
Zitat von Konstantin Shalygin :
On 2/14/19 2:21 PM, Eugen Block wrote:
Already did, but now with highlighting ;-)
http://docs.ceph.com/docs/luminous/rados/operations/health-checks/?highlight=osd_deep_mon_scrub_interval
2:16 PM, Eugen Block wrote:
Exactly, it's also not available in a Mimic test-cluster. But it's
mentioned in the docs for L and M (I didn't check the docs for
other releases), that's what I was wondering about.
Can you provide url to this page?
k
My Ceph Luminous don't know anything about this option:
# ceph daemon osd.7 config help osd_deep_mon_scrub_interval
{
"error": "Setting not found: 'osd_deep_mon_scrub_interval'"
}
Exactly, it's also not available in a Mimic test-cluster. But it's
mentioned in the docs for L and M (I
Hi,
I came to the same conclusion after doing various tests with rooms and
failure domains. I agree with Maged and suggest to use size=4,
min_size=2 for replicated pools. It's more overhead but you can
survive the loss of one room and even one more OSD (of the affected
PG) without losing
I have no issues opening that site from Germany.
Zitat von Dan van der Ster :
On Fri, Feb 15, 2019 at 11:40 AM Willem Jan Withagen wrote:
On 15/02/2019 10:39, Ilya Dryomov wrote:
> On Fri, Feb 15, 2019 at 12:05 AM Mike Perez wrote:
>>
>> Hi Marc,
>>
>> You can see previous designs on the
Hi list,
I found this thread [1] about crashing SSD OSDs, although that was
about an upgrade to 12.2.7, we just hit (probably) the same issue
after our update to 12.2.10 two days ago in a production cluster.
Just half an hour ago I saw one OSD (SSD) crashing (for the first time):
. And assertion might still happen
(hopefully with less frequency).
So could you please run fsck for OSDs that were broken once and
share the results?
Then we can decide if it makes sense to proceed with the repair.
Thanks,
Igor
On 2/7/2019 3:37 PM, Eugen Block wrote:
Hi list,
I found
Hi,
I see that as a security feature ;-)
You can prevent data loss if k chunks are intact, but you don't want
to work with the least required amount of chunks. In a disaster
scenario you can reduce min_size to k temporarily, but the main goal
should always be to get the OSDs back up.
For
fsck report first.
W.r.t to running ceph-bluestore-tool - you might want to specify log
file and increase log level to 20 using --log-file and --log-level
options.
On 2/7/2019 4:45 PM, Eugen Block wrote:
Hi Igor,
thanks for the quick response!
Just to make sure I don't misunderstand
68a5700 failed to open image: (1) Operation not permitted
rbd: error opening image isa: (1) Operation not permitted
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (1) Operation not permitted
Regards
Thomas
Am 25.01.2019 um 11:52 schrieb Eugen B
Hi,
I replied to your thread a couple of days ago, maybe you didn't notice:
Restricting user access is possible on rbd image level. You can grant
read/write access for one client and only read access for other
clients, you have to create different clients for that, see [1] for
more
You can check all objects of that pool to see if your caps match:
rados -p backup ls | grep rbd_id
Zitat von Eugen Block :
caps osd = "allow pool backup object_prefix
rbd_data.18102d6b8b4567; allow rwx pool backup object_prefix
rbd_header.18102d6b8b4567; allow rx pool backup object_p
Hi,
I haven't used the --show-config option until now, but if you ask your
OSD daemon directly, your change should have been applied:
host1:~ # ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
host1:~ # ceph daemon osd.1 config show | grep osd_recovery_max_active
I'll keep it that way. ;-)
Zitat von Janne Johansson :
Den ons 10 apr. 2019 kl 13:37 skrev Eugen Block :
> If you don't specify which daemon to talk to, it tells you what the
> defaults would be for a random daemon started just now using the same
> config as you have in /etc/ceph/
osd_recovery_max_active
osd_recovery_max_active = 3
Zitat von Janne Johansson :
Den ons 10 apr. 2019 kl 13:31 skrev Eugen Block :
While --show-config still shows
host1:~ # ceph --show-config | grep osd_recovery_max_active
osd_recovery_max_active = 3
It seems as if --show-config is not really up-to-date
ot;: "173b6382-504b-421f-aa4d-52526fa80dfa",
"kv_backend": "rocksdb",
"magic": "ceph osd volume v026",
"mkfs_done": "yes",
"osd_key": "AQBXwwddy5OEAxAAS4AidvOF0kl+kxIBvFhT1A==",
"ready": "ready"
Hi,
this OSD must have been part of a previous cluster, I assume.
I would remove it from crush if it's still there (check just to make
sure), wipe the disk, remove any traces like logical volumes (if it
was a ceph-volume lvm OSD) and if possible, reboot the node.
Regards,
Eugen
Zitat von
Hi Alex,
The cluster has been idle at the moment being new and all. I
noticed some disk related errors in dmesg but that was about it.
It looked to me for the next 20 - 30 minutes the failure has not
been detected. All osds were up and in and health was OK. OSD logs
had no smoking gun
Hi, have you tried 'ceph health detail'?
Zitat von Lars Täuber :
Hi everybody,
with the status report I get a HEALTH_WARN I don't know how to get rid of.
It my be connected to recently removed pools.
# ceph -s
cluster:
id: 6cba13d1-b814-489c-9aac-9c04aaf78720
health:
Hi Marc,
have you configured the other MDS to be standby-replay for the active
MDS? I have three MDS servers, one is active, the second is
active-standby and the third just standby. If the active fails, the
second takes over within seconds. This is what I have in my ceph.conf:
[mds.]
Hi,
this question comes up regularly and is been discussed just now:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-May/034867.html
Regards,
Eugen
Zitat von Yoann Moulin :
Dear all,
I am doing some tests with Nautilus and cephfs on erasure coding pool.
I noticed something
Sure there is:
ceph pg ls-by-osd
Regards,
Eugen
Zitat von Igor Podlesny :
Or is there no direct way to accomplish that?
What workarounds can be used then?
--
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
Hi,
did you also remove that OSD from crush and also from auth before
recreating it?
ceph osd crush remove osd.71
ceph auth del osd.71
Regards,
Eugen
Zitat von "ST Wong (ITSC)" :
Hi all,
We replaced a faulty disk out of N OSD and tried to follow steps
according to "Replacing and OSD"
Hi,
some more information about the cluster status would be helpful, such as
ceph -s
ceph osd tree
service status of all MONs, MDSs, MGRs.
Are all services up? Did you configure the spare MDS as standby for
rank 0 so that a failover can happen?
Regards,
Eugen
Zitat von
m Air International Inc.
dhils...@performair.com
www.PerformAir.com
-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
Behalf Of Eugen Block
Sent: Thursday, June 27, 2019 8:23 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] MGR Logs after Failure Testing
Hi
Good morning,
the OSDs are usually marked out after 10 minutes, that's when
rebalancing starts. But the I/O should not drop during that time, this
could be related to your pool configuration. If you have a replicated
pool of size 3 and also set min_size to 3 the I/O would pause if a
node
Hi,
did you try to use rbd and rados commands with the cinder keyring, not
the admin keyring? Did you check if the caps for that client are still
valid (do the caps differ between the two cinder pools)?
Are the ceph versions on your hypervisors also nautilus?
Regards,
Eugen
Zitat von
Hi,
I can’t get data flushed out of osd with weights set to 0. Is there
any way of checking the tasks queued for PG remapping ? Thank You.
can you give some more details about your cluster (replicated or EC
pools, applied rules etc.)? My first guess would be that the other
OSDs are
Hi,
are the OSD nodes on Nautilus already? We upgraded from Luminous to
Nautilus recently and the commands return valid output, except for
those OSDs that haven't been upgraded yet.
Zitat von Gary Molenkamp :
I've had no luck in tracing this down. I've tried setting debugging and
log
Hi,
Then I tried to move DB to a new device (SSD) that is not formatted:
root@ld5505:~# ceph-bluestore-tool bluefs-bdev-new-db –-path
/var/lib/ceph/osd/ceph-76 --dev-target /dev/sdbk
too many positional options have been specified on the command line
I think you're trying the wrong option.
Sorry, I misread, your option is correct, of course since there was no
external db device.
This worked for me:
ceph-2:~ # CEPH_ARGS="--bluestore-block-db-size 1048576"
ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-1 bluefs-bdev-new-db
--dev-target /dev/sdb
inferring bluefs devices
Hi list,
we're facing an unexpected recovery behavior of an upgraded cluster
(Luminous -> Nautilus).
We added new servers with Nautilus to the existing Luminous cluster,
so we could first replace the MONs step by step. Then we moved the old
servers to a new root in the crush map and then
needs his vacation. ;-)
Regards,
Eugen
Zitat von Wido den Hollander :
On 7/18/19 12:21 PM, Eugen Block wrote:
Hi list,
we're facing an unexpected recovery behavior of an upgraded cluster
(Luminous -> Nautilus).
We added new servers with Nautilus to the existing Luminous cluster, so
we co
Hi all,
we just upgraded our cluster to:
ceph version 14.2.0-300-gacd2f2b9e1
(acd2f2b9e196222b0350b3b59af9981f91706c7f) nautilus (stable)
When clicking through the dashboard to see what's new we noticed that
the crushmap viewer only shows the first root of our crushmap (we have
two
Thank you very much!
Zitat von EDH - Manuel Rios Fernandez :
Hi Eugen,
Yes its solved, we reported in 14.2.1 and team fixed in 14.2.2
Regards,
Manuel
-Mensaje original-
De: ceph-users En nombre de Eugen Block
Enviado el: miércoles, 24 de julio de 2019 15:10
Para: ceph-users
Hi,
deep-scrubs can also be configured per pool, so even if you have
adjusted the general deep-scrub time the deep-scrubs will still
happen. To disable per pool deep-scrubs you need to set
ceph osd pool set nodeep-scrub true
Regards,
Eugen
Zitat von c...@elchaka.de:
Hello,
I have a
Hi,
check if the active MGR is hanging.
I had this when testing pg_autoscaler, after some time every command
would hang. Restarting the MGR helped for a short period of time, then
I disabled pg_autoscaler. This is an upgraded cluster, currently on
Nautilus.
Regards,
Eugen
Zitat von
Hi,
since we upgraded our cluster to Nautilus we also see those messages
sometimes when it's rebalancing. There are several reports about this
[1] [2], we didn't see it in Luminous. But eventually the rebalancing
finished and the error message cleared, so I'd say there's (probably)
Hi,
can you provide more details?
ceph daemon mds. cache status
ceph config show mds. | grep mds_cache_memory_limit
Regards,
Eugen
Zitat von Ranjan Ghosh :
Okay, now, after I settled the issue with the oneshot service thanks to
the amazing help of Paul and Richard (thanks again!), I still
Hi,
can you share `ceph osd tree`? What crush rules are in use in your
cluster? I assume that the two failed OSDs prevent the remapping
because the rules can't be applied.
Regards,
Eugen
Zitat von Philipp Schwaha :
hi,
I have a problem with a cluster being stuck in recovery after osd
that helps. This would
allow the recovery to proceed - but you should consider adding OSDs
(or at least increase the memory allocated to OSDs above the
defaults).
Andras
On 10/22/19 3:02 PM, Philipp Schwaha wrote:
hi,
On 2019-10-22 08:05, Eugen Block wrote:
Hi,
can you share `ceph osd t
Hi,
there is also /var/log/ceph/ceph.log on the MONs, it has the stats
you're asking for. Does this answer your question?
Regards,
Eugen
Zitat von nokia ceph :
Hi Team,
With default log settings , the ceph stats will be logged like
cluster [INF] pgmap v30410386: 8192 pgs: 8192
Hi Kristof,
setting the OSD "out" doesn't change the crush weight of that OSD, but
removing it from the tree does, that's why the cluster started to
rebalance.
Regards,
Eugen
Zitat von Kristof Coucke :
Hi all,
We are facing a strange symptom here.
We're testing our recovery
Hi,
you say the daemons are locally up and running but restarting fails?
Which one is it?
Do you see any messages suggesting flapping OSDs? After 5 retries
within 10 minutes the OSDs would be marked out. What is the result of
your checks for iostat etc.? Anything pointing to a high load on
Hi,
A. will ceph be able to recover over time? I am afraid that the 14 PGs
that are down will not recover.
if all OSDs come back (stable) the recovery should eventually finish.
B. what caused the OSDs going down and up during recovery after the
failed OSD node came back online? (step 2
101 - 189 of 189 matches
Mail list logo