nl/2015/04/protecting-your-ceph-pools-against-removal-or-property-changes/
kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Eugen Block voi
ng like a journal-scrub, maybe.
Has someone experienced similar issues and can shed some light on
this? Any insights would be very helpful.
Regards,
Eugen
[1]
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024913.html
--
Eugen Block voice : +
o much if you want to deploy separate block.db
from a bluestore made without block.db
kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Eugen Block
Hi,
I'm not sure if this is deprecated or something, but I usually have to
execute an additional "ceph auth del " before recreating an OSD.
Otherwise the OSD fails to start. Maybe this is a missing step.
Regards,
Eugen
Zitat von Gary Molenkamp :
Good morning all,
Last week I started co
enkamp :
Thanks Eugen,
The OSDs will start immediately after completing the "ceph-volume
prepare", but they won't start on a clean reboot. It seems that
the "prepare" is mounting the /var/lib/ceph/osd/ceph-osdX
path/structure but this is missing now in my boot process.
Ga
Hi,
I have a similar issue and would also need some advice how to get rid
of the already deleted files.
Ceph is our OpenStack backend and there was a nova clone without
parent information. Apparently, the base image had been deleted
without a warning or anything although there were existi
Hi,
So "somthing" goes wrong:
# cat /var/log/libvirt/libxl/libxl-driver.log
-> ...
2018-05-20 15:28:15.270+: libxl:
libxl_bootloader.c:634:bootloader_finished: bootloader failed - consult
logfile /var/log/xen/bootloader.7.log
2018-05-20 15:28:15.270+: libxl:
libxl_exec.c:118:libxl_repor
Hi list,
we have a Luminous bluestore cluster with separate block.db/block.wal
on SSDs. We were running version 12.2.2 and upgraded yesterday to
12.2.5. The upgrade went smoothly, but since the restart of the OSDs I
noticed that 'ceph osd df' shows a different total disk size:
---cut here
On 5/25/2018 2:22 PM, Eugen Block wrote:
Hi list,
we have a Luminous bluestore cluster with separate
block.db/block.wal on SSDs. We were running version 12.2.2 and
upgraded yesterday to 12.2.5. The upgrade went smoothly, but since
the restart of the OSDs I noticed that 'ceph osd df
Hi,
[root@n1 ~]# ceph osd pool rm mytestpool mytestpool --yes-i-really-mean-it
Error EPERM: WARNING: this will *PERMANENTLY DESTROY* all data stored
if the command you posted is complete then you forgot one "really" in
the --yes-i-really-really-mean-it option.
Regards
Zitat von Steffen W
Hi,
we had to recreate some block.db's for some OSDs just a couple of
weeks ago because our existing journal SSD had failed. This way we
avoided to rebalance the whole cluster, just the OSD had to be filled
up. Maybe this will help you too.
http://heiterbiswolkig.blogs.nde.ag/2018/04/08/r
the other nodes?
How do you deal with these slow requests?
Thanks for any help!
Regards,
Eugen
--
Eugen Block voice : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg
sts.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Eugen Block voice : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg e-mail : ebl...@nde.ag
Vorsitzende des Aufsic
8",
"age": 112.118210,
"duration": 26.452526,
They also contain many "waiting for rw locks" messages, but not as
much as the dump from the reporting OSD.
To me it seems that because two OSDs take a lot of time to process
their requests (on
mpty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 577,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
idn't have these inconsistencies
since we increased the size to 3.
Zitat von Mio Vlahović :
Hello,
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
Behalf Of Eugen Block
I had a similar issue recently, where I had a replication size of 2 (I
changed that to 3 after th
Glad I could help! :-)
Zitat von Mio Vlahović :
From: Eugen Block [mailto:ebl...@nde.ag]
From what I understand, with a rep size of 2 the cluster can't decide
which object is intact if one is broken, so the repair fails. If you
had a size of 3, the cluster would see 2 intact objec
ay help you nail it.
I suspect though, that it may come down to enabling debug logging and
tracking a slow request through the logs.
On Thu, Jan 12, 2017 at 8:41 PM, Eugen Block wrote:
Hi,
Looking at the output of dump_historic_ops and dump_ops_in_flight
I waited for new slow request messa
scrubs in the evening, to avoid performance impact during working
hours.
Please let me know if I missed anything. I really appreciate you
looking into this.
Regards,
Eugen
Zitat von Christian Balzer :
Hello,
On Wed, 01 Feb 2017 11:43:02 +0100 Eugen Block wrote:
Hi,
I haven't t
lot.
Eugen
Zitat von Christian Balzer :
Hello,
On Wed, 01 Feb 2017 15:16:15 +0100 Eugen Block wrote:
> You've told us absolutely nothing about your cluster
You're right, I'll try to provide as much information as possible.
Please note that we have kind of a "special&q
mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Eugen Block voice : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg e-mail : ebl
ut a full recovery. Or should I
have deleted that PG instead of re-activating old OSDs? I'm not sure
what the best practice would be in this case.
Any help is appreciated!
Regards,
Eugen
--
Eugen Block voice : +49-40-559 51 75
NDE Netzdesign und -entwicklun
Mon, Feb 13, 2017 at 7:05 AM Wido den Hollander wrote:
> Op 13 februari 2017 om 16:03 schreef Eugen Block :
>
>
> Hi experts,
>
> I have a strange situation right now. We are re-organizing our 4 node
> Hammer cluster from LVM-based OSDs to HDDs. When we did this on the
&
easier
to use and give production deployment.
Thanks,
Shambhu Rajak
--
Eugen Block voice : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg e-mail : ebl...@nde.ag
Hi,
can you explain more detailed what exactly goes wrong?
In many cases it's an authentication error, can you check if your
specified user is allowed to create volumes in the respective pool?
You could try something like this (from compute node):
rbd --user -k
/etc/ceph/ceph.client.OPENS
6-a36a-587cc50ed1ff
volume-baa6c928-8ac1-4240-b189-32b444b434a3
volume-c23a69dc-d043-45f7-970d-1eec2ccb10cc
volume-f1872ae6-48e3-4a62-9f46-bf157f079e7f
On Wed, 19 Dec 2018 at 09:25, Eugen Block wrote:
Hi,
can you explain more detailed what exactly goes wrong?
In many cases it's an auth
Hello list,
there are two config options of mon/osd interaction that I don't fully
understand. Maybe one of you could clarify it for me.
mon osd report timeout
- The grace period in seconds before declaring unresponsive Ceph OSD
Daemons down. Default 900
mon osd down out interval
- The
Hello list,
I noticed my last post was displayed as a reply to a different thread,
so I re-send my question, please excuse the noise.
There are two config options of mon/osd interaction that I don't fully
understand. Maybe one of you could clarify it for me.
mon osd report timeout
- The
ooking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
On Mon, Jan 14, 2019 at 10:17 AM Eugen Block wrote:
Hello list,
I noticed my last post was displayed as a reply to a different thread,
so I re-send
Hi Jan,
I think you're running into an issue reported a couple of times.
For the use of LVM you have to specify the name of the Volume Group
and the respective Logical Volume instead of the path, e.g.
ceph-volume lvm prepare --bluestore --block.db ssd_vg/ssd00 --data /dev/sda
Regards,
Eugen
Hi Thomas,
What is the best practice for creating pools & images?
Should I create multiple pools, means one pool per database?
Or should I create a single pool "backup" and use namespace when writing
data in the pool?
I don't think one pool per DB is reasonable. If the number of DBs
increase
Hi,
If the OSD represents the primary one for a PG, then all IO will be
stopped..which may lead to application failure..
no, that's not how it works. You have an acting set of OSDs for a PG,
typically 3 OSDs in a replicated pool. If the primary OSD goes down,
the secondary becomes the prim
Hi,
I replied to your thread a couple of days ago, maybe you didn't notice:
Restricting user access is possible on rbd image level. You can grant
read/write access for one client and only read access for other
clients, you have to create different clients for that, see [1] for
more details
:ImageState:
0x5643668a5700 failed to open image: (1) Operation not permitted
rbd: error opening image isa: (1) Operation not permitted
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (1) Operation not permitted
Regards
Thomas
Am 25.01.2019 um 1
You can check all objects of that pool to see if your caps match:
rados -p backup ls | grep rbd_id
Zitat von Eugen Block :
caps osd = "allow pool backup object_prefix
rbd_data.18102d6b8b4567; allow rwx pool backup object_prefix
rbd_header.18102d6b8b4567; allow rx pool backup object_p
Hi list,
I found this thread [1] about crashing SSD OSDs, although that was
about an upgrade to 12.2.7, we just hit (probably) the same issue
after our update to 12.2.10 two days ago in a production cluster.
Just half an hour ago I saw one OSD (SSD) crashing (for the first time):
2019-02-07
t
earlier are still present in DB. And assertion might still happen
(hopefully with less frequency).
So could you please run fsck for OSDs that were broken once and
share the results?
Then we can decide if it makes sense to proceed with the repair.
Thanks,
Igor
On 2/7/2019 3:37 PM, Eu
properly. Will see, lets get fsck report first.
W.r.t to running ceph-bluestore-tool - you might want to specify log
file and increase log level to 20 using --log-file and --log-level
options.
On 2/7/2019 4:45 PM, Eugen Block wrote:
Hi Igor,
thanks for the quick response!
Just to m
Hi,
could it be a missing 'ceph osd require-osd-release luminous' on your cluster?
When I check a luminous cluster I get this:
host1:~ # ceph osd dump | grep recovery
flags sortbitwise,recovery_deletes,purged_snapdirs
The flags in the code you quote seem related to that.
Can you check that out
Hi Francois,
Is that correct that recovery will be forbidden by the crush rule if
a node is down?
yes, that is correct, failure-domain=host means no two chunks of the
same PG can be on the same host. So if your PG is divided into 6
chunks, they're all on different hosts, no recovery is po
Hi,
I came to the same conclusion after doing various tests with rooms and
failure domains. I agree with Maged and suggest to use size=4,
min_size=2 for replicated pools. It's more overhead but you can
survive the loss of one room and even one more OSD (of the affected
PG) without losing
ction CEPH...
Regards,
/st
-Original Message-
From: ceph-users On Behalf Of Eugen Block
Sent: Tuesday, February 12, 2019 5:32 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] will crush rule be used during object
relocation in OSD failure ?
Hi,
I came to the same conclusion a
Hi cephers,
I'm struggling a little with the deep-scrubs. I know this has been
discussed multiple times (e.g. in [1]) and we also use a known crontab
script in a Luminous cluster (12.2.10) to start the deep-scrubbing
manually (a quarter of all PGs 4 times a week). The script works just
fi
Thank you, Konstantin,
I'll give that a try.
Do you have any comment on osd_deep_mon_scrub_interval?
Eugen
Zitat von Konstantin Shalygin :
The expectation was to prevent the automatic deep-scrubs but they are
started anyway
You can disable deep-scrubs per pool via `ceph osd pool set
node
My Ceph Luminous don't know anything about this option:
# ceph daemon osd.7 config help osd_deep_mon_scrub_interval
{
"error": "Setting not found: 'osd_deep_mon_scrub_interval'"
}
Exactly, it's also not available in a Mimic test-cluster. But it's
mentioned in the docs for L and M (I didn'
2:16 PM, Eugen Block wrote:
Exactly, it's also not available in a Mimic test-cluster. But it's
mentioned in the docs for L and M (I didn't check the docs for
other releases), that's what I was wondering about.
Can you provide ur
I created http://tracker.ceph.com/issues/38310 for this.
Regards,
Eugen
Zitat von Konstantin Shalygin :
On 2/14/19 2:21 PM, Eugen Block wrote:
Already did, but now with highlighting ;-)
http://docs.ceph.com/docs/luminous/rados/operations/health-checks/?highlight=osd_deep_mon_scrub_interval
I have no issues opening that site from Germany.
Zitat von Dan van der Ster :
On Fri, Feb 15, 2019 at 11:40 AM Willem Jan Withagen wrote:
On 15/02/2019 10:39, Ilya Dryomov wrote:
> On Fri, Feb 15, 2019 at 12:05 AM Mike Perez wrote:
>>
>> Hi Marc,
>>
>> You can see previous designs on the C
Hi,
We skipped stage 1 and replaced the UUIDs of old disks with the new
ones in the policy.cfg
We ran salt '*' pillar.items and confirmed that the output was correct.
It showed the new UUIDs in the correct places.
Next we ran salt-run state.orch ceph.stage.3
PS: All of the above ran successfull
Hi,
I see that as a security feature ;-)
You can prevent data loss if k chunks are intact, but you don't want
to work with the least required amount of chunks. In a disaster
scenario you can reduce min_size to k temporarily, but the main goal
should always be to get the OSDs back up.
For ex
I just moved a (virtual lab) cluster to a different network, it worked
like a charm.
In an offline method - you need to:
- set osd noout, ensure there are no OSDs up
- Change the MONs IP, See the bottom of [1] "CHANGING A MONITOR’S IP
ADDRESS", MONs are the only ones really
sticky with the
only a test cluster.
Regards,
Eugen
Zitat von Janne Johansson :
Den mån 25 feb. 2019 kl 13:40 skrev Eugen Block :
I just moved a (virtual lab) cluster to a different network, it worked
like a charm.
In an offline method - you need to:
- set osd noout, ensure there are no OSDs up
- Change the
earlier are still present in DB. And assertion might still happen
(hopefully with less frequency).
So could you please run fsck for OSDs that were broken once and
share the results?
Then we can decide if it makes sense to proceed with the repair.
Thanks,
Igor
On 2/7/2019 3:37 PM, Eugen Block
Hi,
my first guess would be a network issue. Double-check your connections
and make sure the network setup works as expected. Check syslogs,
dmesg, switches etc. for hints that a network interruption may have
occured.
Regards,
Eugen
Zitat von Zhenshi Zhou :
Hi,
I deployed a ceph clus
Hi Dan,
I don't know about keeping the osd-id but I just partially recreated
your scenario. I wiped one OSD and recreated it. You are trying to
re-use the existing block.db-LV with the device path (--block.db
/dev/vg-name/lv-name) instead the lv notation (--block.db
vg-name/lv-name):
#
Hi,
If you run "rbd snap ls --all", you should see a snapshot in
the "trash" namespace.
I just tried the command "rbd snap ls --all" on a lab cluster
(nautilus) and get this error:
ceph-2:~ # rbd snap ls --all
rbd: image name was not specified
Are there any requirements I haven't noticed?
--long" command. Thanks for the clarification.
Eugen
Zitat von Jason Dillaman :
On Tue, Apr 2, 2019 at 8:42 AM Eugen Block wrote:
Hi,
> If you run "rbd snap ls --all", you should see a snapshot in
> the "trash" namespace.
I just tried the command "rbd
Hi,
I haven't used the --show-config option until now, but if you ask your
OSD daemon directly, your change should have been applied:
host1:~ # ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
host1:~ # ceph daemon osd.1 config show | grep osd_recovery_max_active
"osd_recovery
grep osd_recovery_max_active
osd_recovery_max_active = 3
Zitat von Janne Johansson :
Den ons 10 apr. 2019 kl 13:31 skrev Eugen Block :
While --show-config still shows
host1:~ # ceph --show-config | grep osd_recovery_max_active
osd_recovery_max_active = 3
It seems as if --show-config is not really up-to
#x27;ll keep it that way. ;-)
Zitat von Janne Johansson :
Den ons 10 apr. 2019 kl 13:37 skrev Eugen Block :
> If you don't specify which daemon to talk to, it tells you what the
> defaults would be for a random daemon started just now using the same
> config as you have in /etc/
Good morning,
the OSDs are usually marked out after 10 minutes, that's when
rebalancing starts. But the I/O should not drop during that time, this
could be related to your pool configuration. If you have a replicated
pool of size 3 and also set min_size to 3 the I/O would pause if a
node
Sure there is:
ceph pg ls-by-osd
Regards,
Eugen
Zitat von Igor Podlesny :
Or is there no direct way to accomplish that?
What workarounds can be used then?
--
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http:
Hi, have you tried 'ceph health detail'?
Zitat von Lars Täuber :
Hi everybody,
with the status report I get a HEALTH_WARN I don't know how to get rid of.
It my be connected to recently removed pools.
# ceph -s
cluster:
id: 6cba13d1-b814-489c-9aac-9c04aaf78720
health: HEALTH_WAR
Hi Marc,
have you configured the other MDS to be standby-replay for the active
MDS? I have three MDS servers, one is active, the second is
active-standby and the third just standby. If the active fails, the
second takes over within seconds. This is what I have in my ceph.conf:
[mds.]
mds_
Hi,
this question comes up regularly and is been discussed just now:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-May/034867.html
Regards,
Eugen
Zitat von Yoann Moulin :
Dear all,
I am doing some tests with Nautilus and cephfs on erasure coding pool.
I noticed something strang
Hi Alex,
The cluster has been idle at the moment being new and all. I
noticed some disk related errors in dmesg but that was about it.
It looked to me for the next 20 - 30 minutes the failure has not
been detected. All osds were up and in and health was OK. OSD logs
had no smoking gun eit
Hi,
this OSD must have been part of a previous cluster, I assume.
I would remove it from crush if it's still there (check just to make
sure), wipe the disk, remove any traces like logical volumes (if it
was a ceph-volume lvm OSD) and if possible, reboot the node.
Regards,
Eugen
Zitat von
,
"bluefs": "1",
"ceph_fsid": "173b6382-504b-421f-aa4d-52526fa80dfa",
"kv_backend": "rocksdb",
"magic": "ceph osd volume v026",
"mkfs_done": "yes",
"osd_key": "AQBXwwddy5OEAxAAS4AidvOF0kl+k
Hi,
some more information about the cluster status would be helpful, such as
ceph -s
ceph osd tree
service status of all MONs, MDSs, MGRs.
Are all services up? Did you configure the spare MDS as standby for
rank 0 so that a failover can happen?
Regards,
Eugen
Zitat von dhils...@performair
on Technology
Perform Air International Inc.
dhils...@performair.com
www.PerformAir.com
-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
Behalf Of Eugen Block
Sent: Thursday, June 27, 2019 8:23 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] MGR Logs after
Hi,
I can’t get data flushed out of osd with weights set to 0. Is there
any way of checking the tasks queued for PG remapping ? Thank You.
can you give some more details about your cluster (replicated or EC
pools, applied rules etc.)? My first guess would be that the other
OSDs are either
Hi,
did you try to use rbd and rados commands with the cinder keyring, not
the admin keyring? Did you check if the caps for that client are still
valid (do the caps differ between the two cinder pools)?
Are the ceph versions on your hypervisors also nautilus?
Regards,
Eugen
Zitat von Adr
Hi,
did you also remove that OSD from crush and also from auth before
recreating it?
ceph osd crush remove osd.71
ceph auth del osd.71
Regards,
Eugen
Zitat von "ST Wong (ITSC)" :
Hi all,
We replaced a faulty disk out of N OSD and tried to follow steps
according to "Replacing and OSD"
Hi list,
we're facing an unexpected recovery behavior of an upgraded cluster
(Luminous -> Nautilus).
We added new servers with Nautilus to the existing Luminous cluster,
so we could first replace the MONs step by step. Then we moved the old
servers to a new root in the crush map and then
ess hours, maybe that's why he needs his vacation. ;-)
Regards,
Eugen
Zitat von Wido den Hollander :
On 7/18/19 12:21 PM, Eugen Block wrote:
Hi list,
we're facing an unexpected recovery behavior of an upgraded cluster
(Luminous -> Nautilus).
We added new servers with Nautilus to
Hi all,
we just upgraded our cluster to:
ceph version 14.2.0-300-gacd2f2b9e1
(acd2f2b9e196222b0350b3b59af9981f91706c7f) nautilus (stable)
When clicking through the dashboard to see what's new we noticed that
the crushmap viewer only shows the first root of our crushmap (we have
two roots
Thank you very much!
Zitat von EDH - Manuel Rios Fernandez :
Hi Eugen,
Yes its solved, we reported in 14.2.1 and team fixed in 14.2.2
Regards,
Manuel
-Mensaje original-
De: ceph-users En nombre de Eugen Block
Enviado el: miércoles, 24 de julio de 2019 15:10
Para: ceph-users
Hi,
are the OSD nodes on Nautilus already? We upgraded from Luminous to
Nautilus recently and the commands return valid output, except for
those OSDs that haven't been upgraded yet.
Zitat von Gary Molenkamp :
I've had no luck in tracing this down. I've tried setting debugging and
log ch
Hi,
Then I tried to move DB to a new device (SSD) that is not formatted:
root@ld5505:~# ceph-bluestore-tool bluefs-bdev-new-db –-path
/var/lib/ceph/osd/ceph-76 --dev-target /dev/sdbk
too many positional options have been specified on the command line
I think you're trying the wrong option.
Sorry, I misread, your option is correct, of course since there was no
external db device.
This worked for me:
ceph-2:~ # CEPH_ARGS="--bluestore-block-db-size 1048576"
ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-1 bluefs-bdev-new-db
--dev-target /dev/sdb
inferring bluefs devices from
Hi,
there is also /var/log/ceph/ceph.log on the MONs, it has the stats
you're asking for. Does this answer your question?
Regards,
Eugen
Zitat von nokia ceph :
Hi Team,
With default log settings , the ceph stats will be logged like
cluster [INF] pgmap v30410386: 8192 pgs: 8192 active+cl
Hi,
can you share `ceph osd tree`? What crush rules are in use in your
cluster? I assume that the two failed OSDs prevent the remapping
because the rules can't be applied.
Regards,
Eugen
Zitat von Philipp Schwaha :
hi,
I have a problem with a cluster being stuck in recovery after osd
_max_pg_per_osd to, say, 400 and see if that helps. This would
allow the recovery to proceed - but you should consider adding OSDs
(or at least increase the memory allocated to OSDs above the
defaults).
Andras
On 10/22/19 3:02 PM, Philipp Schwaha wrote:
hi,
On 2019-10-22 08:05, Eugen Bloc
Hi,
deep-scrubs can also be configured per pool, so even if you have
adjusted the general deep-scrub time the deep-scrubs will still
happen. To disable per pool deep-scrubs you need to set
ceph osd pool set nodeep-scrub true
Regards,
Eugen
Zitat von c...@elchaka.de:
Hello,
I have a N
Hi,
check if the active MGR is hanging.
I had this when testing pg_autoscaler, after some time every command
would hang. Restarting the MGR helped for a short period of time, then
I disabled pg_autoscaler. This is an upgraded cluster, currently on
Nautilus.
Regards,
Eugen
Zitat von Thom
Hi,
can you provide more details?
ceph daemon mds. cache status
ceph config show mds. | grep mds_cache_memory_limit
Regards,
Eugen
Zitat von Ranjan Ghosh :
Okay, now, after I settled the issue with the oneshot service thanks to
the amazing help of Paul and Richard (thanks again!), I still
Hi,
since we upgraded our cluster to Nautilus we also see those messages
sometimes when it's rebalancing. There are several reports about this
[1] [2], we didn't see it in Luminous. But eventually the rebalancing
finished and the error message cleared, so I'd say there's (probably)
nothin
Hi Kristof,
setting the OSD "out" doesn't change the crush weight of that OSD, but
removing it from the tree does, that's why the cluster started to
rebalance.
Regards,
Eugen
Zitat von Kristof Coucke :
Hi all,
We are facing a strange symptom here.
We're testing our recovery procedures.
Hi,
A. will ceph be able to recover over time? I am afraid that the 14 PGs
that are down will not recover.
if all OSDs come back (stable) the recovery should eventually finish.
B. what caused the OSDs going down and up during recovery after the
failed OSD node came back online? (step 2 above
Hi,
you say the daemons are locally up and running but restarting fails?
Which one is it?
Do you see any messages suggesting flapping OSDs? After 5 retries
within 10 minutes the OSDs would be marked out. What is the result of
your checks for iostat etc.? Anything pointing to a high load on
101 - 190 of 190 matches
Mail list logo