Hi,
cephadm stores a local copy of the cephadm binary in
/var/lib/ceph/{FSID}/cephad.{DIGEST}:
quincy-1:~ # ls -lrt /var/lib/ceph/{FSID}/cephadm.*
-rw-r--r-- 1 root root 350889 26. Okt 2023
/var/lib/ceph/{FSID}/cephadm.f6868821c084cd9740b59c7c5eb59f0dd47f6e3b1e6fecb542cb44134ace8d78
there will be soon some more remapping. :-)
So I would consider this thread as closed, all good.
Zitat von Eugen Block :
No, we didn’t change much, just increased the max pg per osd to
avoid warnings and inactive PGs in case a node would fail during
this process. And the max backfills
Hi,
can you share the current 'ceph status'? Do you have any inconsistent
PGs or something? What are the cephfs data pool's min_size and size?
Zitat von Alexey GERASIMOV :
Colleagues, thank you for the advice to check the operability of
MGRs. In fact, it is strange also: we checked our
" in method 1 and "migrating
PGs" in method 2? I think method 1 must read the OSD to be removed.
Otherwise, we would not see slow ops warning. Does method 2 not involve
reading this OSD?
Thanks,
Mary
On Fri, Apr 26, 2024 at 5:15 AM Eugen Block wrote:
> Hi,
>
> if you rem
Hi, I didn’t find any other config options other than you already did.
Just wanted to note that I did read your message. :-)
Maybe one of the Devs can comment.
Zitat von Stefan Kooman :
Hi,
We're testing with rbd-mirror (mode snapshot) and try to get status
updates about snapshots as fast
Hi,
if you remove the OSD this way, it will be drained. Which means that
it will try to recover PGs from this OSD, and in case of hardware
failure it might lead to slow requests. It might make sense to
forcefully remove the OSD without draining:
- stop the osd daemon
- mark it as out
-
Hi, it's unlikely that all OSDs fail at the same time, it seems like a
network issue. Do you have an active MGR? Just a couple of days ago
someone reported incorrect OSD stats because no MGR was up. Although
your 'ceph health detail' output doesn't mention that, there are still
issues when
mon_osd_nearfull_ratio temporarily?
Frédéric.
- Le 25 Avr 24, à 12:35, Eugen Block ebl...@nde.ag a écrit :
For those interested, just a short update: the split process is
approaching its end, two days ago there were around 230 PGs left
(target are 4096 PGs). So far there were no complaints, no cluster
increasing osd_max_backfills to any values
higher than
2-3 will not help much with the recovery/backfilling speed.
All the way, you'll have to be patient. :-)
Cheers,
Frédéric.
- Le 10 Avr 24, à 12:54, Eugen Block ebl...@nde.ag a écrit :
Thank you for input!
We started the split with max
Hi,
I saw something like this a couple of weeks ago on a customer cluster.
I'm not entirely sure, but this was either due to (yet) missing or
wrong cephadm ssh config or a label/client-keyring management issue.
If this is still an issue I would recommend to check the configured
keys to be
In addition to Nico's response, three years ago I wrote a blog post
[1] about that topic, maybe that can help as well. It might be a bit
outdated, what it definitely doesn't contain is this command from the
docs [2] once the server has been re-added to the host list:
ceph cephadm osd
possible to implement a modify operation in the future
without breaking stuff. And you can save time on the documentation,
because it works like other stuff.
Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
____
From: Eugen Bl
Oh, I see. Unfortunately, I don't have a cluster in stretch mode so I
can't really test that. Thanks for pointing to the tracker.
Zitat von Stefan Kooman :
On 23-04-2024 14:40, Eugen Block wrote:
Hi,
whats the right way to add another pool?
create pool with 4/2 and use the rule
Hi,
I believe the docs [2] are okay, running 'ceph fs authorize' will
overwrite the existing caps, it will not add more caps to the client:
Capabilities can be modified by running fs authorize only in the
case when read/write permissions must be changed.
If a client already has a
Hi,
whats the right way to add another pool?
create pool with 4/2 and use the rule for the stretched mode, finished?
the exsisting pools were automaticly set to 4/2 after "ceph mon
enable_stretch_mode".
if that is what you require, then yes, it's as easy as that. Although
I haven't played
I'm not entirely sure if I ever tried it with the rbd-mirror user
instead of admin user, but I see the same error message on 17.2.7. I
assume that it's not expected, I think a tracker issue makes sense.
Thanks,
Eugen
Zitat von Stefan Kooman :
Hi,
We are testing rbd-mirroring. There seems
IIRC, you have 8 GB configured for the mds cache memory limit, and it
doesn’t seem to be enough. Does the host run into oom killer as well?
But it’s definitely a good approach to increase the cache limit (try
24 GB if possible since it’s trying to use at least 19 GB) on a host
with enough
t have a any clients connected).
Zitat von Eugen Block :
Hi,
I don't see a reason why Quincy rgw daemons shouldn't work with a
Reef cluster. It would basically mean that you have a staggered
upgrade [1] running and didn't upgrade RGWs yet. It should also work
to just downgrade them, e
Hi,
I don't see a reason why Quincy rgw daemons shouldn't work with a Reef
cluster. It would basically mean that you have a staggered upgrade [1]
running and didn't upgrade RGWs yet. It should also work to just
downgrade them, either by providing a different default image, then
redeploy
Right, I just figured from the health output you would have a couple
of seconds or so to query the daemon:
mds: 1/1 daemons up
Zitat von Alexey GERASIMOV :
Ok, we will create the ticket.
Eugen Block - ceph tell command needs to communicate with the MDS
daemon running
Hi Erich,
there's no simple answer to your question, as always it depends.
Every now and then there are threads about clients misbehaving,
especially with the "flush tid" messages. For example, the docs [1]
state:
The CephFS client-MDS protocol uses a field called the oldest tid to
What’s the output of:
ceph tell mds.0 damage ls
Zitat von alexey.gerasi...@opencascade.com:
Dear colleagues, hope that anybody can help us.
The initial point: Ceph cluster v15.2 (installed and controlled by
the Proxmox) with 3 nodes based on physical servers rented from a
cloud
Hi, there are lots of metrics that are collected by the MGR. So if
there is none, the cluster health details can be wrong or outdated.
Zitat von Tobias Langner :
Hey Alwin,
Thanks for your reply, answers inline.
I'd assume (w/o pool config) that the EC 2+1 is putting PG as
inactive.
Hi,
without looking too deep into it, I would just assume that the daemons
and clients are connected to different MONs. Or am I misunderstanding
your question?
Zitat von Joel Davidow :
Just curious why the feature_map portions differ in the return of
mon_status across a cluster. Below
Hi,
I'm not sure if and how that could help, there's a get-crushmap
command for the ceph-monstore-tool:
[ceph: root@host1 /]# ceph-monstore-tool /var/lib/ceph/mon/ceph-host1/
show-versions -- --map-type crushmap > show-versions
[ceph: root@host1 /]# cat show-versions
first committed:
"if something goes wrong,
monitors will fail" rather discouraging :-)
/Z
On Tue, 16 Apr 2024 at 18:59, Eugen Block wrote:
Sorry, I meant extra-entrypoint-arguments:
https://www.spinics.net/lists/ceph-users/msg79251.html
Zitat von Eugen Block :
> You can use the extra containe
Sorry, I meant extra-entrypoint-arguments:
https://www.spinics.net/lists/ceph-users/msg79251.html
Zitat von Eugen Block :
You can use the extra container arguments I pointed out a few months
ago. Those work in my test clusters, although I haven’t enabled that
in production yet
in theory this
> should result in lower but much faster compression.
>
> I hope this helps. My plan is to keep the monitors with the current
> settings, i.e. 3 with compression + 2 without compression, until the next
> minor release of Pacific to see whether the monitors with compressed
&g
Ah, okay, thanks for the hint. In that case what I see is expected.
Zitat von Robert Sander :
Hi,
On 16.04.24 10:49, Eugen Block wrote:
I believe I can confirm your suspicion, I have a test cluster on
Reef 18.2.1 and deployed nfs without HAProxy but with keepalived [1].
Stopping
Hm, no, I can't confirm it yet. I missed something in the config, the
failover happens and a new nfs daemon is deployed on a different node.
But I still see client interruptions so I'm gonna look into that first.
Zitat von Eugen Block :
Hi,
I believe I can confirm your suspicion, I have
Hi,
I believe I can confirm your suspicion, I have a test cluster on Reef
18.2.1 and deployed nfs without HAProxy but with keepalived [1].
Stopping the active NFS daemon doesn't trigger anything, the MGR
notices that it's stopped at some point, but nothing else seems to
happen. I didn't
ou'll have to be patient. :-)
Cheers,
Frédéric.
- Le 10 Avr 24, à 12:54, Eugen Block ebl...@nde.ag a écrit :
Thank you for input!
We started the split with max_backfills = 1 and watched for a few
minutes, then gradually increased it to 8. Now it's backfilling with
around 180 MB/s, not really much
, but we
haven't noticed it before.
HTH,
Greg.
On 10/4/24 14:42, Eugen Block wrote:
Thank you, Janne.
I believe the default 5% target_max_misplaced_ratio would work as
well, we've had good experience with that in the past, without the
autoscaler. I just haven't dealt with such large PGs, I've
) and now they finally started to listen. Well, they would still
ignore it if it wouldn't impact all kinds of things now. ;-)
Thanks,
Eugen
Zitat von Janne Johansson :
Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block :
I'm trying to estimate the possible impact when large PGs are
splitted
is a simpler
In any case, it’s worth trying and using the maximum capabilities of
the upmap
Good luck,
k
[1] https://github.com/digitalocean/pgremapper
On 9 Apr 2024, at 11:39, Eugen Block wrote:
I'm trying to estimate the possible impact when large PGs are
splitted. Here's one example
Hi,
I'm trying to estimate the possible impact when large PGs are
splitted. Here's one example of such a PG:
PG_STAT OBJECTS BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOGUP
86.3ff277708 4144030984090 0 3092
3092
Hi Torkil,
I assume the affected OSDs were the ones with slow requests, no? You
should still see them in some of the logs (mon, mgr).
Zitat von Torkil Svensgaard :
On 06-04-2024 18:10, Torkil Svensgaard wrote:
Hi
Cephadm Reef 18.2.1
Started draining 5 18-20 TB HDD OSDs (DB/WAL om NVMe)
Sorry, I hit send too early, to enable multi-active MDS the full command is:
ceph fs flag set enable_multiple true
Zitat von Eugen Block :
Did you enable multi-active MDS? Can you please share 'ceph fs
dump'? Port 6789 is the MON port (v1, v2 is 3300). If you haven't
enabled multi-active
Did you enable multi-active MDS? Can you please share 'ceph fs dump'?
Port 6789 is the MON port (v1, v2 is 3300). If you haven't enabled
multi-active, run:
ceph fs flag set enable_multiple
Zitat von elite_...@163.com:
I tried to remove the default fs then it works, but port 6789 still
Hi,
the noin flag seems to be only applicable to existing OSDs which are
already in the crushmap. It doesn't apply to newly created OSDs, I
could confirm that in a small test cluster with Pacific and Reef. I
don't have any insights if that is by design or not, I assume it's
supposed to
parameter?
Or maybe look into speeding up LV creation (if this is the bootleneck)?
Thanks a lot,
Mathias
-Original Message-
From: Kuhring, Mathias
Sent: Friday, March 22, 2024 5:38 PM
To: Eugen Block ; ceph-users@ceph.io
Subject: [ceph-users] Re: [ext] Re: cephadm auto disk preparation
Hi,
1. I see no systemd units with the fsid in them, as described in the
document above. Both before and after the upgrade, my mon and other
units are:
ceph-mon@.serviceceph-osd@[N].service
etc
Should I be concerned?
I think this is expected because it's not containerized, no reason to
Hi,
how many OSDs do you have in total? Can you share your osd tree, please?
You could check the unit.meta file on each OSD host to see which
service it refers to and simply change it according to the service you
intend to keep:
host1:~ # grep -r service_name
Hi,
you need to deploy more daemons because your current active MDS is
responsible for the already existing CephFS. There are several ways to
do this, I like the yaml file approach and increase the number of MDS
daemons, just as an example from a test cluster with one CephFS I
added the
9945d0514222bd7a83e28b96e8440c630ba6891f",
"RepoTags": [
"ceph/daemon:latest-pacific"
"RepoDigests": [
"ceph/daemon@sha256:261bbe628f4b438f5bf10de5a8ee05282f2697a5a2cb7ff7668f776b61b9d586"
-Original Message-
From: Adiga, Anantha
Sent:
, but that was it.
/Z
On Tue, 2 Apr 2024 at 11:00, Eugen Block wrote:
Hi,
here's the link to the docs [1] how to replace OSDs.
ceph orch osd rm --replace --zap [--force]
This should zap both the data drive and db LV (yes, its data is
useless without the data drive), not sure how it will handle if the
data
Hi,
here's the link to the docs [1] how to replace OSDs.
ceph orch osd rm --replace --zap [--force]
This should zap both the data drive and db LV (yes, its data is
useless without the data drive), not sure how it will handle if the
data drive isn't accessible though.
One thing I'm not
Hi,
without knowing the whole story, to cancel OSD removal you can run
this command:
ceph orch osd rm stop
Regards,
Eugen
Zitat von "adam.ther" :
Hello,
I have a single node host with a VM as a backup MON,MGR,ect.
This has caused all OSD's to be pending as 'deleting', can i safely
- a001s017
- a001s018
# ceph orch ls --service_name=mon --export
service_type: mon
service_name: mon
placement:
count: 3
hosts:
- a001s016
- a001s017
- a001s018
-Original Message-
From: Adiga, Anantha
Sent: Monday, April 1, 2024 6:06 PM
To: Eugen Block
Cc: ceph-users@c
n_mon_release 16 (pacific)
election_strategy: 1
0: [v2:10.45.128.28:3300/0,v1:10.45.128.28:6789/0] mon.a001s018
1: [v2:10.45.128.27:3300/0,v1:10.45.128.27:6789/0] mon.a001s017
Thank you,
Anantha
-Original Message-
From: Eugen Block
Sent: Monday, April 1, 2024 1:10 PM
To: ceph-users@ce
Maybe it’s just not in the monmap? Can you show the output of:
ceph mon dump
Did you do any maintenance (apparently OSDs restarted recently) and
maybe accidentally removed a MON from the monmap?
Zitat von "Adiga, Anantha" :
Hi Anthony,
Seeing it since last after noon. It is same with
Hi,
what does your node-exporter spec look like?
ceph orch ls node-exporter --export
If other node-exporter daemons are running in the cluster, what's the
difference between them? Do they all have the same container image?
ceph config get mgr mgr/cephadm/container_image_node_exporter
and
omp rx=0 tx=0)._fault waiting 15.00
2024-03-13T11:14:29.795+0800 7f6980206640 10 RDMAStack polling finally
delete qp = 0x5650c54164b0
Eugen Block 于2024年3月19日周二 14:50写道:
Hi,
there are several existing threads on this list, have you tried to
apply those suggestions? A couple of them were:
- ceph mgr
Hi,
before getting into that the first thing I would do is to fail the
mgr. There have been too many issues where failing over the mgr
resolved many of them.
If that doesn't help, the cephadm.log should show something useful
(/var/log/ceph/cephadm.log on the OSD hosts, I'm still not too
Hi Jesper,
could you please provide more details about the cluster (the usual
like 'ceph osd tree', 'ceph osd df', 'ceph versions')?
I find it unusual to enable maintenance mode to add OSDs, is there a
specific reason?
And why adding OSDs manually with 'ceph orch osd add', why not have a
Hi,
there are several existing threads on this list, have you tried to
apply those suggestions? A couple of them were:
- ceph mgr fail
- check time sync (NTP, chrony)
- different weights for MONs
- Check debug logs
Regards,
Eugen
Zitat von faicker mo :
some logs here,
It's your pool replication (size = 3):
3886733 (number of objects) * 3 = 11660199
Zitat von Thorne Lawler :
Can anyone please tell me what "COPIES" means in this context?
[ceph: root@san2 /]# rados df -p cephfs.shared.data
POOL_NAME USED OBJECTS CLONES COPIES
Hi Torkil,
Num is 0 but it's not replicated so how does this translate to
picking 3 of 3 datacenters?
it doesn't really make a difference if replicated or not, it just
defines how many crush buckets to choose, so it applies in the same
way as for your replicated pool.
I am thinking we
Yeah, the whole story would help to give better advice. With EC the
default min_size is k+1, you could reduce the min_size to 5
temporarily, this might bring the PGs back online. But the long term
fix is to have all required OSDs up and have enough OSDs to sustain an
outage.
Zitat von
Hi,
I assume you're still on a "low" pacific release? This was fixed by PR
[1][2] and the warning is supressed when autoscaler is on, it was
merged into Pacific 16.2.8 [3].
I can't answer why autoscaler doesn't increase the pg_num, but yes,
you can increase it by yourself. The pool for
Hi,
your ceph version seems to be 17.2.4, not 17.2.6 (which is the locally
installed ceph version on the system where you ran the command) Could
you add the 'ceph versions' output as well?
How is the load on the systems when the recovery starts? The OSDs
crash after around 20 minutes,
sd.3, it crashes in less than a minute
23:49 : After I mark osd.3 "in" and start it again, it comes back
online with osd.0 and osd.11 soon after
Best regards,
Romain Lebbadi-Breteau
On 2024-03-08 3:17 a.m., Eugen Block wrote:
Hi,
can you share more details? Which OSD are you trying
Hi,
can you share more details? Which OSD are you trying to get out, the
primary osd.3?
Can you also share 'ceph osd df'?
It looks like a replicated pool with size 3, can you confirm with
'ceph osd pool ls detail'?
Do you have logs from the crashing OSDs when you take out osd.3?
Which ceph
Thanks! That's very interesting to know!
Zitat von "David C." :
some monitors have existed for many years (weight 10) others have been
added (weight 0)
=> https://github.com/ceph/ceph/commit/2d113dedf851995e000d3cce136b69
bfa94b6fe0
Le jeudi 7 mars 2024, Eugen Block a écrit :
I’m curious how the weights might have been changed. I’ve never
touched a mon weight myself, do you know how that happened?
Zitat von "David C." :
Ok, got it :
[root@pprod-admin:/var/lib/ceph/]# ceph mon dump -f json-pretty
|egrep "name|weigh"
dumped monmap epoch 14
Are the scrubs eventually reported as "scrub ok" in the OSD logs? How
long do the scrubs take? Do you see updated timestamps in the 'ceph pg
dump' output (column DEEP_SCRUB_STAMP)?
Zitat von thymus_03fumb...@icloud.com:
I recently switched from 16.2.x to 18.2.x and migrated to cephadm,
You're welcome, great that your cluster is healthy again.
Zitat von matt...@peregrineit.net:
Thanks Eugen, you pointed me in the right direction :-)
Yes, the config files I mentioned were the ones in
`/var/lib/ceph/{FSID}/mgr.{MGR}/config` - I wasn't aware there were
others (well, I
Hi,
your response arrived in my inbox today, so sorry for the delay.
I wrote a blog post [1] just two weeks ago for that procedure with
cephadm, Zac adopted that and updated the docs [2]. Can you give that
a try and let me know if it worked? I repeated that procedure a couple
of times to
-Original Message-
From: Eugen Block
Sent: mercredi, 6 mars 2024 10:47
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Upgarde from 16.2.1 to 16.2.2 pacific stuck
There was another issue when having more than two MGRs, maybe you're
hitting that (https://tracker.ceph.com/issues/57675,
https
There was another issue when having more than two MGRs, maybe you're
hitting that (https://tracker.ceph.com/issues/57675,
https://github.com/ceph/ceph/pull/48258). I believe my workaround was
to set the global config to a newer image (target version) and then
deployed a new mgr.
Zitat
Hi,
a couple of things.
First, is there any specific reason why you're upgrading from .1 to
.2? Why not directly to .15? It seems unnecessary and you're risking
upgrading to a "bad" version (I believe it was 16.2.7) if you're
applying evey minor release. Or why not upgrading to Quincy or
Hi,
I've checked, checked, and checked again that the individual config
files all point towards the correct ip subnet for the monitors, and
I cannot find any trace of the old subnet's ip address in any config
file (that I can find).
what are those "individual config files"? The ones
one for both.
On Tue, Mar 5, 2024 at 8:26 AM Eugen Block wrote:
It seems to be an issue with the service type (in this case "mon"),
it's not entirely "broken", with the node-exporter it works:
quincy-1:~ # cat node-exporter.yaml
service_type: node-exporter
service_name: nod
xtra_entrypoint_args:
-
"--collector.textfile.directory=/var/lib/node_exporter/textfile_collector2"
quincy-1:~ # ceph orch apply -i node-exporter.yaml
Scheduled node-exporter update...
I'll keep looking... unless one of the devs is reading this thread and
finds it quicker.
Zitat von Eugen Blo
Oh, you're right. I just checked on Quincy as well at it failed with
the same error message. For pacific it still works. I'll check for
existing tracker issues.
Zitat von Robert Sander :
Hi,
On 3/5/24 08:57, Eugen Block wrote:
extra_entrypoint_args:
-
'--mon-rocksdb-options
", but it seems that this
option doesn't have any effect at all.
/Z
On Tue, 5 Mar 2024 at 09:58, Eugen Block wrote:
Hi,
> 1. RocksDB options, which I provided to each mon via their configuration
> files, got overwritten during mon redeployment and I had to re-add
> mon_rocksdb_option
Hi,
1. RocksDB options, which I provided to each mon via their configuration
files, got overwritten during mon redeployment and I had to re-add
mon_rocksdb_options back.
IIRC, you didn't use the extra_entrypoint_args for that option but
added it directly to the container unit.run file. So
Hi,
yes you can activate existing OSDs [1] as if you reinstalled a server
(for example if the host OS was damaged). I wrote a blog post [2] a
few years ago for an early Octopus version in a virtual lab
environment where I describe a manual procedure to reintroduce
existing OSDs on a new
help.
Cédric
On 26 Feb 2024, at 10:57, Eugen Block wrote:
Hi,
thanks for the context. Was there any progress over the weekend?
The hanging commands seem to be MGR related, and there's only one
in your cluster according to your output. Can you deploy a second
one manually, then adopt
Maybe this [2] helps, one specific mountpoint is excluded:
mountpoint !~ "/mnt.*"
[2] https://alex.dzyoba.com/blog/prometheus-alerts/
Zitat von Eugen Block :
Hi,
let me refer you to my response to a similar question [1]. I don't
have a working example how to exclude some m
Hi,
let me refer you to my response to a similar question [1]. I don't
have a working example how to exclude some mointpoints but it should
be possible to modify existing rules.
Regards,
Eugen
[1]
if things look better.
But would it then use corrupted data on osd 269 to rebuild.
-
Kai Stian Olstad
On 26.02.2024 10:19, Eugen Block wrote:
Hi,
I think your approach makes sense. But I'm wondering if moving only
the problematic PGs to different OSDs could have an effect as
well. I
Hi,
I see these messages regularly but haven't looked to deep into the
cause. It appears to be related to short interruptions like log
rotation or a mgr failover. I think they're harmless.
Regards,
Eugen
Zitat von Denis Polom :
Hi,
running Ceph Quincy 17.2.7 on Ubuntu Focal LTS,
Hi,
I'm not a dev, but as I understand it, the command would issue a 'pg
repair' on each (primary) PG of the provided pool. It might be useful
if you have multiple (or even many) inconsistent PGs in a pool. But
I've never used that and this is just a hypothesis.
Regards,
Eugen
Zitat von
Hi,
thanks for the context. Was there any progress over the weekend? The
hanging commands seem to be MGR related, and there's only one in your
cluster according to your output. Can you deploy a second one
manually, then adopt it with cephadm? Can you add 'ceph versions' as
well?
Zitat
Hi,
I think your approach makes sense. But I'm wondering if moving only
the problematic PGs to different OSDs could have an effect as well. I
assume that moving the 2 PGs is much quicker than moving all BUT those
2 PGs. If that doesn't work you could still fall back to draining the
Hi,
no, you can't go directly from O to R, you need to upgrade to Q first.
Technically it might be possible but it's not supported.
Your approach to first adopt the cluster by cephadm is my preferred
way as well.
Regards,
Eugen
Zitat von "Alex Hussein-Kershaw (HE/HIM)" :
Hi ceph-users,
that has been installed is 17.2.5. But this method does not
work at all.
On Fri, Feb 23, 2024, 10:23 AM Eugen Block wrote:
Which ceph version is this? In a small Reef test cluster this works as
expected:
# cephadm rm-cluster --fsid 2851404a-d09a-11ee-9aaa-fa163e2de51a
--zap-osds --force
Using
the logs ?
Best Regards,
Edouard FAZENDA
Technical Support
Chemin du Curé-Desclouds 2, CH-1226 THONEX +41 (0)22 869 04 40
www.csti.ch
-Original Message-----
From: Eugen Block
Sent: vendredi, 23 février 2024 12:50
To: ceph-users@ceph.io
Subject: [ceph-users] Re: MDS in ReadOnly and 2 M
Hi,
the mds log should contain information why it goes into read-only
mode. Just a few weeks ago I helped a user with a broken CephFS (MDS
went into read-only mode because of missing objects in the journal).
Can you check the journal status:
# cephfs-journal-tool --rank=cephfs:0
Which ceph version is this? In a small Reef test cluster this works as
expected:
# cephadm rm-cluster --fsid 2851404a-d09a-11ee-9aaa-fa163e2de51a
--zap-osds --force
Using recent ceph image
This seems to be the relevant stack trace:
---snip---
Feb 23 15:18:39 cephgw02 conmon[2158052]: debug -1>
2024-02-23T08:18:39.609+ 7fccc03c0700 -1
You still haven't provided any details (logs) of what happened. The
short excerpt from yesterday isn't useful as it only shows the startup
of the daemon.
Zitat von nguyenvand...@baoviet.com.vn:
Could you pls help me explain the status of volume: recovering ?
what is it ? and do we need to
ter status (ceph -s)? And maybe attach the entire query output to
a file and attach it?
[2] https://github.com/ceph/ceph/blob/v16.2.13/src/osd/PrimaryLogPG.cc#L12407
[3] https://github.com/ceph/ceph/blob/v16.2.13/src/osd/PrimaryLogScrub.cc#L54
Zitat von Cedric :
On Thu, Feb 22, 2024 at 12:37 PM E
e of "ceph pg_mark_unfound_lost revert"
action, but we wonder if there is a risk of data loss.
On Thu, Feb 22, 2024 at 11:50 AM Eugen Block wrote:
I found a config to force scrub invalid PGs, what is your current
setting on that?
ceph config get osd osd_scrub_invalid_stats
true
The config referen
;: false,
"manifest_stats_invalid": false,
I also provide again cluster informations that was lost in previous
missed reply all. Don't hesitate to ask more if needed I would be
glade to provide them.
Cédric
On Thu, Feb 22, 2024 at 11:04 AM Eugen Block wrote:
Hm, I won
If it crashes after two minutes you have your time window to look for.
Restart the mds daemon and capture everything after that until the
crash.
Zitat von nguyenvand...@baoviet.com.vn:
it suck too long log, could you pls guide me how to grep/filter
important things in logs ?
Hi,
just responding to the last questions:
- After the bootstrap, the Web interface was accessible :
- How can I access the wizard page again? If I don't use it the first
time I could not find another way to get it.
I don't know how to recall the wizard, but you should be able
migrated from HDD/SSD to NVME a while ago but
tiering remains, unfortunately.
So actually we are trying to understand the root cause
On Tue, Feb 20, 2024 at 1:43 PM Eugen Block wrote:
Please don't drop the list from your response.
The first question coming to mind is, why do you have a cache
There a couple of ways, find your MDS daemon with:
ceph fs status -> should show you the to-be-active MDS
On that host run:
cephadm logs --name mds.{MDS}
or alternatively:
cephadm ls --no-detail | grep mds
journalctl -u ceph-{FSID}@mds.{MDS} --no-pager > {MDS}.log
Zitat von
1 - 100 of 1286 matches
Mail list logo