[ceph-users] Re: which grafana version to use with 17.2.x ceph version

2024-04-29 Thread Eugen Block
Hi, cephadm stores a local copy of the cephadm binary in /var/lib/ceph/{FSID}/cephad.{DIGEST}: quincy-1:~ # ls -lrt /var/lib/ceph/{FSID}/cephadm.* -rw-r--r-- 1 root root 350889 26. Okt 2023 /var/lib/ceph/{FSID}/cephadm.f6868821c084cd9740b59c7c5eb59f0dd47f6e3b1e6fecb542cb44134ace8d78

[ceph-users] Re: Impact of large PG splits

2024-04-29 Thread Eugen Block
there will be soon some more remapping. :-) So I would consider this thread as closed, all good. Zitat von Eugen Block : No, we didn’t change much, just increased the max pg per osd to avoid warnings and inactive PGs in case a node would fail during this process. And the max backfills

[ceph-users] Re: MDS crash

2024-04-28 Thread Eugen Block
Hi, can you share the current 'ceph status'? Do you have any inconsistent PGs or something? What are the cephfs data pool's min_size and size? Zitat von Alexey GERASIMOV : Colleagues, thank you for the advice to check the operability of MGRs. In fact, it is strange also: we checked our

[ceph-users] Re: Remove an OSD with hardware issue caused rgw 503

2024-04-27 Thread Eugen Block
" in method 1 and "migrating PGs" in method 2? I think method 1 must read the OSD to be removed. Otherwise, we would not see slow ops warning. Does method 2 not involve reading this OSD? Thanks, Mary On Fri, Apr 26, 2024 at 5:15 AM Eugen Block wrote: > Hi, > > if you rem

[ceph-users] Re: rbd-mirror get status updates quicker

2024-04-27 Thread Eugen Block
Hi, I didn’t find any other config options other than you already did. Just wanted to note that I did read your message. :-) Maybe one of the Devs can comment. Zitat von Stefan Kooman : Hi, We're testing with rbd-mirror (mode snapshot) and try to get status updates about snapshots as fast

[ceph-users] Re: Remove an OSD with hardware issue caused rgw 503

2024-04-26 Thread Eugen Block
Hi, if you remove the OSD this way, it will be drained. Which means that it will try to recover PGs from this OSD, and in case of hardware failure it might lead to slow requests. It might make sense to forcefully remove the OSD without draining: - stop the osd daemon - mark it as out -

[ceph-users] Re: MDS crash

2024-04-26 Thread Eugen Block
Hi, it's unlikely that all OSDs fail at the same time, it seems like a network issue. Do you have an active MGR? Just a couple of days ago someone reported incorrect OSD stats because no MGR was up. Although your 'ceph health detail' output doesn't mention that, there are still issues when

[ceph-users] Re: Impact of large PG splits

2024-04-25 Thread Eugen Block
mon_osd_nearfull_ratio temporarily? Frédéric. - Le 25 Avr 24, à 12:35, Eugen Block ebl...@nde.ag a écrit : For those interested, just a short update: the split process is approaching its end, two days ago there were around 230 PGs left (target are 4096 PGs). So far there were no complaints, no cluster

[ceph-users] Re: Impact of large PG splits

2024-04-25 Thread Eugen Block
increasing osd_max_backfills to any values higher than 2-3 will not help much with the recovery/backfilling speed. All the way, you'll have to be patient. :-) Cheers, Frédéric. - Le 10 Avr 24, à 12:54, Eugen Block ebl...@nde.ag a écrit : Thank you for input! We started the split with max

[ceph-users] Re: Cephadm stacktrace on copying ceph.conf

2024-04-25 Thread Eugen Block
Hi, I saw something like this a couple of weeks ago on a customer cluster. I'm not entirely sure, but this was either due to (yet) missing or wrong cephadm ssh config or a label/client-keyring management issue. If this is still an issue I would recommend to check the configured keys to be

[ceph-users] Re: Reconstructing an OSD server when the boot OS is corrupted

2024-04-24 Thread Eugen Block
In addition to Nico's response, three years ago I wrote a blog post [1] about that topic, maybe that can help as well. It might be a bit outdated, what it definitely doesn't contain is this command from the docs [2] once the server has been re-added to the host list: ceph cephadm osd

[ceph-users] Re: Latest Doco Out Of Date?

2024-04-24 Thread Eugen Block
possible to implement a modify operation in the future without breaking stuff. And you can save time on the documentation, because it works like other stuff. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Eugen Bl

[ceph-users] Re: stretched cluster new pool and second pool with nvme

2024-04-24 Thread Eugen Block
Oh, I see. Unfortunately, I don't have a cluster in stretch mode so I can't really test that. Thanks for pointing to the tracker. Zitat von Stefan Kooman : On 23-04-2024 14:40, Eugen Block wrote: Hi, whats the right way to add another pool? create pool with 4/2 and use the rule

[ceph-users] Re: Latest Doco Out Of Date?

2024-04-24 Thread Eugen Block
Hi, I believe the docs [2] are okay, running 'ceph fs authorize' will overwrite the existing caps, it will not add more caps to the client: Capabilities can be modified by running fs authorize only in the case when read/write permissions must be changed. If a client already has a

[ceph-users] Re: stretched cluster new pool and second pool with nvme

2024-04-23 Thread Eugen Block
Hi, whats the right way to add another pool? create pool with 4/2 and use the rule for the stretched mode, finished? the exsisting pools were automaticly set to 4/2 after "ceph mon enable_stretch_mode". if that is what you require, then yes, it's as easy as that. Although I haven't played

[ceph-users] Re: rbd-mirror failed to query services: (13) Permission denied

2024-04-23 Thread Eugen Block
I'm not entirely sure if I ever tried it with the rbd-mirror user instead of admin user, but I see the same error message on 17.2.7. I assume that it's not expected, I think a tracker issue makes sense. Thanks, Eugen Zitat von Stefan Kooman : Hi, We are testing rbd-mirroring. There seems

[ceph-users] Re: Stuck in replay?

2024-04-22 Thread Eugen Block
IIRC, you have 8 GB configured for the mds cache memory limit, and it doesn’t seem to be enough. Does the host run into oom killer as well? But it’s definitely a good approach to increase the cache limit (try 24 GB if possible since it’s trying to use at least 19 GB) on a host with enough

[ceph-users] Re: RGWs stop processing requests after upgrading to Reef

2024-04-22 Thread Eugen Block
t have a any clients connected). Zitat von Eugen Block : Hi, I don't see a reason why Quincy rgw daemons shouldn't work with a Reef cluster. It would basically mean that you have a staggered upgrade [1] running and didn't upgrade RGWs yet. It should also work to just downgrade them, e

[ceph-users] Re: RGWs stop processing requests after upgrading to Reef

2024-04-22 Thread Eugen Block
Hi, I don't see a reason why Quincy rgw daemons shouldn't work with a Reef cluster. It would basically mean that you have a staggered upgrade [1] running and didn't upgrade RGWs yet. It should also work to just downgrade them, either by providing a different default image, then redeploy

[ceph-users] Re: MDS crash

2024-04-22 Thread Eugen Block
Right, I just figured from the health output you would have a couple of seconds or so to query the daemon: mds: 1/1 daemons up Zitat von Alexey GERASIMOV : Ok, we will create the ticket. Eugen Block - ceph tell command needs to communicate with the MDS daemon running

[ceph-users] Re: Multiple MDS Daemon needed?

2024-04-22 Thread Eugen Block
Hi Erich, there's no simple answer to your question, as always it depends. Every now and then there are threads about clients misbehaving, especially with the "flush tid" messages. For example, the docs [1] state: The CephFS client-MDS protocol uses a field called the oldest tid to

[ceph-users] Re: MDS crash

2024-04-21 Thread Eugen Block
What’s the output of: ceph tell mds.0 damage ls Zitat von alexey.gerasi...@opencascade.com: Dear colleagues, hope that anybody can help us. The initial point: Ceph cluster v15.2 (installed and controlled by the Proxmox) with 3 nodes based on physical servers rented from a cloud

[ceph-users] Re: Working ceph cluster reports large amount of pgs in state unknown/undersized and objects degraded

2024-04-20 Thread Eugen Block
Hi, there are lots of metrics that are collected by the MGR. So if there is none, the cluster health details can be wrong or outdated. Zitat von Tobias Langner : Hey Alwin, Thanks for your reply, answers inline. I'd assume (w/o pool config) that the EC 2+1 is putting PG as inactive.

[ceph-users] Re: feature_map differs across mon_status

2024-04-17 Thread Eugen Block
Hi, without looking too deep into it, I would just assume that the daemons and clients are connected to different MONs. Or am I misunderstanding your question? Zitat von Joel Davidow : Just curious why the feature_map portions differ in the return of mon_status across a cluster. Below

[ceph-users] Re: crushmap history

2024-04-17 Thread Eugen Block
Hi, I'm not sure if and how that could help, there's a get-crushmap command for the ceph-monstore-tool: [ceph: root@host1 /]# ceph-monstore-tool /var/lib/ceph/mon/ceph-host1/ show-versions -- --map-type crushmap > show-versions [ceph: root@host1 /]# cat show-versions first committed:

[ceph-users] Re: [EXTERN] Re: Ceph 16.2.x mon compactions, disk writes

2024-04-16 Thread Eugen Block
"if something goes wrong, monitors will fail" rather discouraging :-) /Z On Tue, 16 Apr 2024 at 18:59, Eugen Block wrote: Sorry, I meant extra-entrypoint-arguments: https://www.spinics.net/lists/ceph-users/msg79251.html Zitat von Eugen Block : > You can use the extra containe

[ceph-users] Re: [EXTERN] Re: Ceph 16.2.x mon compactions, disk writes

2024-04-16 Thread Eugen Block
Sorry, I meant extra-entrypoint-arguments: https://www.spinics.net/lists/ceph-users/msg79251.html Zitat von Eugen Block : You can use the extra container arguments I pointed out a few months ago. Those work in my test clusters, although I haven’t enabled that in production yet

[ceph-users] Re: [EXTERN] Re: Ceph 16.2.x mon compactions, disk writes

2024-04-16 Thread Eugen Block
in theory this > should result in lower but much faster compression. > > I hope this helps. My plan is to keep the monitors with the current > settings, i.e. 3 with compression + 2 without compression, until the next > minor release of Pacific to see whether the monitors with compressed &g

[ceph-users] Re: Have a problem with haproxy/keepalived/ganesha/docker

2024-04-16 Thread Eugen Block
Ah, okay, thanks for the hint. In that case what I see is expected. Zitat von Robert Sander : Hi, On 16.04.24 10:49, Eugen Block wrote: I believe I can confirm your suspicion, I have a test cluster on Reef 18.2.1 and deployed nfs without HAProxy but with keepalived [1]. Stopping

[ceph-users] Re: Have a problem with haproxy/keepalived/ganesha/docker

2024-04-16 Thread Eugen Block
Hm, no, I can't confirm it yet. I missed something in the config, the failover happens and a new nfs daemon is deployed on a different node. But I still see client interruptions so I'm gonna look into that first. Zitat von Eugen Block : Hi, I believe I can confirm your suspicion, I have

[ceph-users] Re: Have a problem with haproxy/keepalived/ganesha/docker

2024-04-16 Thread Eugen Block
Hi, I believe I can confirm your suspicion, I have a test cluster on Reef 18.2.1 and deployed nfs without HAProxy but with keepalived [1]. Stopping the active NFS daemon doesn't trigger anything, the MGR notices that it's stopped at some point, but nothing else seems to happen. I didn't

[ceph-users] Re: Impact of large PG splits

2024-04-12 Thread Eugen Block
ou'll have to be patient. :-) Cheers, Frédéric. - Le 10 Avr 24, à 12:54, Eugen Block ebl...@nde.ag a écrit : Thank you for input! We started the split with max_backfills = 1 and watched for a few minutes, then gradually increased it to 8. Now it's backfilling with around 180 MB/s, not really much

[ceph-users] Re: Impact of large PG splits

2024-04-10 Thread Eugen Block
, but we haven't noticed it before. HTH, Greg. On 10/4/24 14:42, Eugen Block wrote: Thank you, Janne. I believe the default 5% target_max_misplaced_ratio would work as well, we've had good experience with that in the past, without the autoscaler. I just haven't dealt with such large PGs, I've

[ceph-users] Re: Impact of large PG splits

2024-04-10 Thread Eugen Block
) and now they finally started to listen. Well, they would still ignore it if it wouldn't impact all kinds of things now. ;-) Thanks, Eugen Zitat von Janne Johansson : Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block : I'm trying to estimate the possible impact when large PGs are splitted

[ceph-users] Re: Impact of large PG splits

2024-04-09 Thread Eugen Block
is a simpler In any case, it’s worth trying and using the maximum capabilities of the upmap Good luck, k [1] https://github.com/digitalocean/pgremapper On 9 Apr 2024, at 11:39, Eugen Block wrote: I'm trying to estimate the possible impact when large PGs are splitted. Here's one example

[ceph-users] Impact of large PG splits

2024-04-09 Thread Eugen Block
Hi, I'm trying to estimate the possible impact when large PGs are splitted. Here's one example of such a PG: PG_STAT OBJECTS BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOGUP 86.3ff277708 4144030984090 0 3092 3092

[ceph-users] Re: NFS never recovers after slow ops

2024-04-06 Thread Eugen Block
Hi Torkil, I assume the affected OSDs were the ones with slow requests, no? You should still see them in some of the logs (mon, mgr). Zitat von Torkil Svensgaard : On 06-04-2024 18:10, Torkil Svensgaard wrote: Hi Cephadm Reef 18.2.1 Started draining 5 18-20 TB HDD OSDs (DB/WAL om NVMe)

[ceph-users] Re: Issue about execute "ceph fs new"

2024-04-06 Thread Eugen Block
Sorry, I hit send too early, to enable multi-active MDS the full command is: ceph fs flag set enable_multiple true Zitat von Eugen Block : Did you enable multi-active MDS? Can you please share 'ceph fs dump'? Port 6789 is the MON port (v1, v2 is 3300). If you haven't enabled multi-active

[ceph-users] Re: Issue about execute "ceph fs new"

2024-04-06 Thread Eugen Block
Did you enable multi-active MDS? Can you please share 'ceph fs dump'? Port 6789 is the MON port (v1, v2 is 3300). If you haven't enabled multi-active, run: ceph fs flag set enable_multiple Zitat von elite_...@163.com: I tried to remove the default fs then it works, but port 6789 still

[ceph-users] Re: Pacific 16.2.15 `osd noin`

2024-04-04 Thread Eugen Block
Hi, the noin flag seems to be only applicable to existing OSDs which are already in the crushmap. It doesn't apply to newly created OSDs, I could confirm that in a small test cluster with Pacific and Reef. I don't have any insights if that is by design or not, I assume it's supposed to

[ceph-users] Re: [ext] Re: cephadm auto disk preparation and OSD installation incomplete

2024-04-03 Thread Eugen Block
parameter? Or maybe look into speeding up LV creation (if this is the bootleneck)? Thanks a lot, Mathias -Original Message- From: Kuhring, Mathias Sent: Friday, March 22, 2024 5:38 PM To: Eugen Block ; ceph-users@ceph.io Subject: [ceph-users] Re: [ext] Re: cephadm auto disk preparation

[ceph-users] Re: quincy-> reef upgrade non-cephadm

2024-04-03 Thread Eugen Block
Hi, 1. I see no systemd units with the fsid in them, as described in the document above. Both before and after the upgrade, my mon and other units are: ceph-mon@.serviceceph-osd@[N].service etc Should I be concerned? I think this is expected because it's not containerized, no reason to

[ceph-users] Re: ceph orchestrator for osds

2024-04-03 Thread Eugen Block
Hi, how many OSDs do you have in total? Can you share your osd tree, please? You could check the unit.meta file on each OSD host to see which service it refers to and simply change it according to the service you intend to keep: host1:~ # grep -r service_name

[ceph-users] Re: Issue about execute "ceph fs new"

2024-04-03 Thread Eugen Block
Hi, you need to deploy more daemons because your current active MDS is responsible for the already existing CephFS. There are several ways to do this, I like the yaml file approach and increase the number of MDS daemons, just as an example from a test cluster with one CephFS I added the

[ceph-users] Re: ceph status not showing correct monitor services

2024-04-03 Thread Eugen Block
9945d0514222bd7a83e28b96e8440c630ba6891f", "RepoTags": [ "ceph/daemon:latest-pacific" "RepoDigests": [ "ceph/daemon@sha256:261bbe628f4b438f5bf10de5a8ee05282f2697a5a2cb7ff7668f776b61b9d586" -Original Message- From: Adiga, Anantha Sent:

[ceph-users] Re: Replace block drives of combined NVME+HDD OSDs

2024-04-02 Thread Eugen Block
, but that was it. /Z On Tue, 2 Apr 2024 at 11:00, Eugen Block wrote: Hi, here's the link to the docs [1] how to replace OSDs. ceph orch osd rm --replace --zap [--force] This should zap both the data drive and db LV (yes, its data is useless without the data drive), not sure how it will handle if the data

[ceph-users] Re: Replace block drives of combined NVME+HDD OSDs

2024-04-02 Thread Eugen Block
Hi, here's the link to the docs [1] how to replace OSDs. ceph orch osd rm --replace --zap [--force] This should zap both the data drive and db LV (yes, its data is useless without the data drive), not sure how it will handle if the data drive isn't accessible though. One thing I'm not

[ceph-users] Re: Drained A Single Node Host On Accident

2024-04-02 Thread Eugen Block
Hi, without knowing the whole story, to cancel OSD removal you can run this command: ceph orch osd rm stop Regards, Eugen Zitat von "adam.ther" : Hello, I have a single node host with a VM as a backup MON,MGR,ect. This has caused all OSD's to be pending as 'deleting', can i safely

[ceph-users] Re: ceph status not showing correct monitor services

2024-04-02 Thread Eugen Block
- a001s017 - a001s018 # ceph orch ls --service_name=mon --export service_type: mon service_name: mon placement: count: 3 hosts: - a001s016 - a001s017 - a001s018 -Original Message- From: Adiga, Anantha Sent: Monday, April 1, 2024 6:06 PM To: Eugen Block Cc: ceph-users@c

[ceph-users] Re: ceph status not showing correct monitor services

2024-04-01 Thread Eugen Block
n_mon_release 16 (pacific) election_strategy: 1 0: [v2:10.45.128.28:3300/0,v1:10.45.128.28:6789/0] mon.a001s018 1: [v2:10.45.128.27:3300/0,v1:10.45.128.27:6789/0] mon.a001s017 Thank you, Anantha -Original Message- From: Eugen Block Sent: Monday, April 1, 2024 1:10 PM To: ceph-users@ce

[ceph-users] Re: ceph status not showing correct monitor services

2024-04-01 Thread Eugen Block
Maybe it’s just not in the monmap? Can you show the output of: ceph mon dump Did you do any maintenance (apparently OSDs restarted recently) and maybe accidentally removed a MON from the monmap? Zitat von "Adiga, Anantha" : Hi Anthony, Seeing it since last after noon. It is same with

[ceph-users] Re: node-exporter error

2024-03-22 Thread Eugen Block
Hi, what does your node-exporter spec look like? ceph orch ls node-exporter --export If other node-exporter daemons are running in the cluster, what's the difference between them? Do they all have the same container image? ceph config get mgr mgr/cephadm/container_image_node_exporter and

[ceph-users] Re: mon stuck in probing

2024-03-21 Thread Eugen Block
omp rx=0 tx=0)._fault waiting 15.00 2024-03-13T11:14:29.795+0800 7f6980206640 10 RDMAStack polling finally delete qp = 0x5650c54164b0 Eugen Block 于2024年3月19日周二 14:50写道: Hi, there are several existing threads on this list, have you tried to apply those suggestions? A couple of them were: - ceph mgr

[ceph-users] Re: cephadm auto disk preparation and OSD installation incomplete

2024-03-21 Thread Eugen Block
Hi, before getting into that the first thing I would do is to fail the mgr. There have been too many issues where failing over the mgr resolved many of them. If that doesn't help, the cephadm.log should show something useful (/var/log/ceph/cephadm.log on the OSD hosts, I'm still not too

[ceph-users] Re: Adding new OSD's - slow_ops and other issues.

2024-03-19 Thread Eugen Block
Hi Jesper, could you please provide more details about the cluster (the usual like 'ceph osd tree', 'ceph osd df', 'ceph versions')? I find it unusual to enable maintenance mode to add OSDs, is there a specific reason? And why adding OSDs manually with 'ceph orch osd add', why not have a

[ceph-users] Re: mon stuck in probing

2024-03-19 Thread Eugen Block
Hi, there are several existing threads on this list, have you tried to apply those suggestions? A couple of them were: - ceph mgr fail - check time sync (NTP, chrony) - different weights for MONs - Check debug logs Regards, Eugen Zitat von faicker mo : some logs here,

[ceph-users] Re: CephFS space usage

2024-03-19 Thread Eugen Block
It's your pool replication (size = 3): 3886733 (number of objects) * 3 = 11660199 Zitat von Thorne Lawler : Can anyone please tell me what "COPIES" means in this context? [ceph: root@san2 /]# rados df -p cephfs.shared.data POOL_NAME USED  OBJECTS  CLONES    COPIES

[ceph-users] Re: Num values for 3 DC 4+2 crush rule

2024-03-16 Thread Eugen Block
Hi Torkil, Num is 0 but it's not replicated so how does this translate to picking 3 of 3 datacenters? it doesn't really make a difference if replicated or not, it just defines how many crush buckets to choose, so it applies in the same way as for your replicated pool. I am thinking we

[ceph-users] Re: activating+undersized+degraded+remapped

2024-03-16 Thread Eugen Block
Yeah, the whole story would help to give better advice. With EC the default min_size is k+1, you could reduce the min_size to 5 temporarily, this might bring the PGs back online. But the long term fix is to have all required OSDs up and have enough OSDs to sustain an outage. Zitat von

[ceph-users] Re: MANY_OBJECT_PER_PG on 1 pool which is cephfs_metadata

2024-03-11 Thread Eugen Block
Hi, I assume you're still on a "low" pacific release? This was fixed by PR [1][2] and the warning is supressed when autoscaler is on, it was merged into Pacific 16.2.8 [3]. I can't answer why autoscaler doesn't increase the pg_num, but yes, you can increase it by yourself. The pool for

[ceph-users] Re: PG damaged "failed_repair"

2024-03-11 Thread Eugen Block
Hi, your ceph version seems to be 17.2.4, not 17.2.6 (which is the locally installed ceph version on the system where you ran the command) Could you add the 'ceph versions' output as well? How is the load on the systems when the recovery starts? The OSDs crash after around 20 minutes,

[ceph-users] Re: PG damaged "failed_repair"

2024-03-10 Thread Eugen Block
sd.3, it crashes in less than a minute 23:49 : After I mark osd.3 "in" and start it again, it comes back online with osd.0 and osd.11 soon after Best regards, Romain Lebbadi-Breteau On 2024-03-08 3:17 a.m., Eugen Block wrote: Hi, can you share more details? Which OSD are you trying

[ceph-users] Re: PG damaged "failed_repair"

2024-03-08 Thread Eugen Block
Hi, can you share more details? Which OSD are you trying to get out, the primary osd.3? Can you also share 'ceph osd df'? It looks like a replicated pool with size 3, can you confirm with 'ceph osd pool ls detail'? Do you have logs from the crashing OSDs when you take out osd.3? Which ceph

[ceph-users] Re: All MGR loop crash

2024-03-07 Thread Eugen Block
Thanks! That's very interesting to know! Zitat von "David C." : some monitors have existed for many years (weight 10) others have been added (weight 0) => https://github.com/ceph/ceph/commit/2d113dedf851995e000d3cce136b69 bfa94b6fe0 Le jeudi 7 mars 2024, Eugen Block a écrit :

[ceph-users] Re: All MGR loop crash

2024-03-07 Thread Eugen Block
I’m curious how the weights might have been changed. I’ve never touched a mon weight myself, do you know how that happened? Zitat von "David C." : Ok, got it : [root@pprod-admin:/var/lib/ceph/]# ceph mon dump -f json-pretty |egrep "name|weigh" dumped monmap epoch 14

[ceph-users] Re: Ceph is constantly scrubbing 1/4 of all PGs and still have pigs not scrubbed in time

2024-03-07 Thread Eugen Block
Are the scrubs eventually reported as "scrub ok" in the OSD logs? How long do the scrubs take? Do you see updated timestamps in the 'ceph pg dump' output (column DEEP_SCRUB_STAMP)? Zitat von thymus_03fumb...@icloud.com: I recently switched from 16.2.x to 18.2.x and migrated to cephadm,

[ceph-users] Re: Ceph Cluster Config File Locations?

2024-03-06 Thread Eugen Block
You're welcome, great that your cluster is healthy again. Zitat von matt...@peregrineit.net: Thanks Eugen, you pointed me in the right direction :-) Yes, the config files I mentioned were the ones in `/var/lib/ceph/{FSID}/mgr.{MGR}/config` - I wasn't aware there were others (well, I

[ceph-users] Re: change ip node and public_network in cluster

2024-03-06 Thread Eugen Block
Hi, your response arrived in my inbox today, so sorry for the delay. I wrote a blog post [1] just two weeks ago for that procedure with cephadm, Zac adopted that and updated the docs [2]. Can you give that a try and let me know if it worked? I repeated that procedure a couple of times to

[ceph-users] Re: Upgarde from 16.2.1 to 16.2.2 pacific stuck

2024-03-06 Thread Eugen Block
-Original Message- From: Eugen Block Sent: mercredi, 6 mars 2024 10:47 To: ceph-users@ceph.io Subject: [ceph-users] Re: Upgarde from 16.2.1 to 16.2.2 pacific stuck There was another issue when having more than two MGRs, maybe you're hitting that (https://tracker.ceph.com/issues/57675, https

[ceph-users] Re: Upgarde from 16.2.1 to 16.2.2 pacific stuck

2024-03-06 Thread Eugen Block
There was another issue when having more than two MGRs, maybe you're hitting that (https://tracker.ceph.com/issues/57675, https://github.com/ceph/ceph/pull/48258). I believe my workaround was to set the global config to a newer image (target version) and then deployed a new mgr. Zitat

[ceph-users] Re: Upgarde from 16.2.1 to 16.2.2 pacific stuck

2024-03-06 Thread Eugen Block
Hi, a couple of things. First, is there any specific reason why you're upgrading from .1 to .2? Why not directly to .15? It seems unnecessary and you're risking upgrading to a "bad" version (I believe it was 16.2.7) if you're applying evey minor release. Or why not upgrading to Quincy or

[ceph-users] Re: Ceph Cluster Config File Locations?

2024-03-05 Thread Eugen Block
Hi, I've checked, checked, and checked again that the individual config files all point towards the correct ip subnet for the monitors, and I cannot find any trace of the old subnet's ip address in any config file (that I can find). what are those "individual config files"? The ones

[ceph-users] Re: Upgraded 16.2.14 to 16.2.15

2024-03-05 Thread Eugen Block
one for both. On Tue, Mar 5, 2024 at 8:26 AM Eugen Block wrote: It seems to be an issue with the service type (in this case "mon"), it's not entirely "broken", with the node-exporter it works: quincy-1:~ # cat node-exporter.yaml service_type: node-exporter service_name: nod

[ceph-users] Re: Upgraded 16.2.14 to 16.2.15

2024-03-05 Thread Eugen Block
xtra_entrypoint_args:  - "--collector.textfile.directory=/var/lib/node_exporter/textfile_collector2" quincy-1:~ # ceph orch apply -i node-exporter.yaml   Scheduled node-exporter update... I'll keep looking... unless one of the devs is reading this thread and finds it quicker. Zitat von Eugen Blo

[ceph-users] Re: Upgraded 16.2.14 to 16.2.15

2024-03-05 Thread Eugen Block
Oh, you're right. I just checked on Quincy as well at it failed with the same error message. For pacific it still works. I'll check for existing tracker issues. Zitat von Robert Sander : Hi, On 3/5/24 08:57, Eugen Block wrote: extra_entrypoint_args:   - '--mon-rocksdb-options

[ceph-users] Re: Upgraded 16.2.14 to 16.2.15

2024-03-05 Thread Eugen Block
", but it seems that this option doesn't have any effect at all. /Z On Tue, 5 Mar 2024 at 09:58, Eugen Block wrote: Hi, > 1. RocksDB options, which I provided to each mon via their configuration > files, got overwritten during mon redeployment and I had to re-add > mon_rocksdb_option

[ceph-users] Re: Upgraded 16.2.14 to 16.2.15

2024-03-04 Thread Eugen Block
Hi, 1. RocksDB options, which I provided to each mon via their configuration files, got overwritten during mon redeployment and I had to re-add mon_rocksdb_options back. IIRC, you didn't use the extra_entrypoint_args for that option but added it directly to the container unit.run file. So

[ceph-users] Re: Renaming an OSD node

2024-02-29 Thread Eugen Block
Hi, yes you can activate existing OSDs [1] as if you reinstalled a server (for example if the host OS was damaged). I wrote a blog post [2] a few years ago for an early Octopus version in a virtual lab environment where I describe a manual procedure to reintroduce existing OSDs on a new

[ceph-users] Re: Scrub stuck and 'pg has invalid (post-split) stat'

2024-02-28 Thread Eugen Block
help. Cédric On 26 Feb 2024, at 10:57, Eugen Block wrote: Hi, thanks for the context. Was there any progress over the weekend? The hanging commands seem to be MGR related, and there's only one in your cluster according to your output. Can you deploy a second one manually, then adopt

[ceph-users] Re: Possible to tune Full Disk warning ??

2024-02-28 Thread Eugen Block
Maybe this [2] helps, one specific mountpoint is excluded: mountpoint !~ "/mnt.*" [2] https://alex.dzyoba.com/blog/prometheus-alerts/ Zitat von Eugen Block : Hi, let me refer you to my response to a similar question [1]. I don't have a working example how to exclude some m

[ceph-users] Re: Possible to tune Full Disk warning ??

2024-02-28 Thread Eugen Block
Hi, let me refer you to my response to a similar question [1]. I don't have a working example how to exclude some mointpoints but it should be possible to modify existing rules. Regards, Eugen [1]

[ceph-users] Re: pg repair doesn't fix "got incorrect hash on read" / "candidate had an ec hash mismatch"

2024-02-28 Thread Eugen Block
if things look better. But would it then use corrupted data on osd 269 to rebuild. - Kai Stian Olstad On 26.02.2024 10:19, Eugen Block wrote: Hi, I think your approach makes sense. But I'm wondering if moving only the problematic PGs to different OSDs could have an effect as well. I

[ceph-users] Re: ceph-mgr client.0 error registering admin socket command: (17) File exists

2024-02-26 Thread Eugen Block
Hi, I see these messages regularly but haven't looked to deep into the cause. It appears to be related to short interruptions like log rotation or a mgr failover. I think they're harmless. Regards, Eugen Zitat von Denis Polom : Hi, running Ceph Quincy 17.2.7 on Ubuntu Focal LTS,

[ceph-users] Re: What exactly does the osd pool repair funtion do?

2024-02-26 Thread Eugen Block
Hi, I'm not a dev, but as I understand it, the command would issue a 'pg repair' on each (primary) PG of the provided pool. It might be useful if you have multiple (or even many) inconsistent PGs in a pool. But I've never used that and this is just a hypothesis. Regards, Eugen Zitat von

[ceph-users] Re: Scrub stuck and 'pg has invalid (post-split) stat'

2024-02-26 Thread Eugen Block
Hi, thanks for the context. Was there any progress over the weekend? The hanging commands seem to be MGR related, and there's only one in your cluster according to your output. Can you deploy a second one manually, then adopt it with cephadm? Can you add 'ceph versions' as well? Zitat

[ceph-users] Re: pg repair doesn't fix "got incorrect hash on read" / "candidate had an ec hash mismatch"

2024-02-26 Thread Eugen Block
Hi, I think your approach makes sense. But I'm wondering if moving only the problematic PGs to different OSDs could have an effect as well. I assume that moving the 2 PGs is much quicker than moving all BUT those 2 PGs. If that doesn't work you could still fall back to draining the

[ceph-users] Re: Is a direct Octopus to Reef Upgrade Possible?

2024-02-26 Thread Eugen Block
Hi, no, you can't go directly from O to R, you need to upgrade to Q first. Technically it might be possible but it's not supported. Your approach to first adopt the cluster by cephadm is my preferred way as well. Regards, Eugen Zitat von "Alex Hussein-Kershaw (HE/HIM)" : Hi ceph-users,

[ceph-users] Re: cephadm purge cluster does not work

2024-02-23 Thread Eugen Block
that has been installed is 17.2.5. But this method does not work at all. On Fri, Feb 23, 2024, 10:23 AM Eugen Block wrote: Which ceph version is this? In a small Reef test cluster this works as expected: # cephadm rm-cluster --fsid 2851404a-d09a-11ee-9aaa-fa163e2de51a --zap-osds --force Using

[ceph-users] Re: MDS in ReadOnly and 2 MDS behind on trimming

2024-02-23 Thread Eugen Block
the logs ? Best Regards, Edouard FAZENDA Technical Support Chemin du Curé-Desclouds 2, CH-1226 THONEX +41 (0)22 869 04 40 www.csti.ch -Original Message----- From: Eugen Block Sent: vendredi, 23 février 2024 12:50 To: ceph-users@ceph.io Subject: [ceph-users] Re: MDS in ReadOnly and 2 M

[ceph-users] Re: MDS in ReadOnly and 2 MDS behind on trimming

2024-02-23 Thread Eugen Block
Hi, the mds log should contain information why it goes into read-only mode. Just a few weeks ago I helped a user with a broken CephFS (MDS went into read-only mode because of missing objects in the journal). Can you check the journal status: # cephfs-journal-tool --rank=cephfs:0

[ceph-users] Re: cephadm purge cluster does not work

2024-02-23 Thread Eugen Block
Which ceph version is this? In a small Reef test cluster this works as expected: # cephadm rm-cluster --fsid 2851404a-d09a-11ee-9aaa-fa163e2de51a --zap-osds --force Using recent ceph image

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-23 Thread Eugen Block
This seems to be the relevant stack trace: ---snip--- Feb 23 15:18:39 cephgw02 conmon[2158052]: debug -1> 2024-02-23T08:18:39.609+ 7fccc03c0700 -1

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-23 Thread Eugen Block
You still haven't provided any details (logs) of what happened. The short excerpt from yesterday isn't useful as it only shows the startup of the daemon. Zitat von nguyenvand...@baoviet.com.vn: Could you pls help me explain the status of volume: recovering ? what is it ? and do we need to

[ceph-users] Re: Scrub stuck and 'pg has invalid (post-split) stat'

2024-02-22 Thread Eugen Block
ter status (ceph -s)? And maybe attach the entire query output to a file and attach it? [2] https://github.com/ceph/ceph/blob/v16.2.13/src/osd/PrimaryLogPG.cc#L12407 [3] https://github.com/ceph/ceph/blob/v16.2.13/src/osd/PrimaryLogScrub.cc#L54 Zitat von Cedric : On Thu, Feb 22, 2024 at 12:37 PM E

[ceph-users] Re: Scrub stuck and 'pg has invalid (post-split) stat'

2024-02-22 Thread Eugen Block
e of "ceph pg_mark_unfound_lost revert" action, but we wonder if there is a risk of data loss. On Thu, Feb 22, 2024 at 11:50 AM Eugen Block wrote: I found a config to force scrub invalid PGs, what is your current setting on that? ceph config get osd osd_scrub_invalid_stats true The config referen

[ceph-users] Re: Scrub stuck and 'pg has invalid (post-split) stat'

2024-02-22 Thread Eugen Block
;: false, "manifest_stats_invalid": false, I also provide again cluster informations that was lost in previous missed reply all. Don't hesitate to ask more if needed I would be glade to provide them. Cédric On Thu, Feb 22, 2024 at 11:04 AM Eugen Block wrote: Hm, I won

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-22 Thread Eugen Block
If it crashes after two minutes you have your time window to look for. Restart the mds daemon and capture everything after that until the crash. Zitat von nguyenvand...@baoviet.com.vn: it suck too long log, could you pls guide me how to grep/filter important things in logs ?

[ceph-users] Re: Some questions about cephadm

2024-02-22 Thread Eugen Block
Hi, just responding to the last questions: - After the bootstrap, the Web interface was accessible : - How can I access the wizard page again? If I don't use it the first time I could not find another way to get it. I don't know how to recall the wizard, but you should be able

[ceph-users] Re: Scrub stuck and 'pg has invalid (post-split) stat'

2024-02-22 Thread Eugen Block
migrated from HDD/SSD to NVME a while ago but tiering remains, unfortunately. So actually we are trying to understand the root cause On Tue, Feb 20, 2024 at 1:43 PM Eugen Block wrote: Please don't drop the list from your response. The first question coming to mind is, why do you have a cache

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-22 Thread Eugen Block
There a couple of ways, find your MDS daemon with: ceph fs status -> should show you the to-be-active MDS On that host run: cephadm logs --name mds.{MDS} or alternatively: cephadm ls --no-detail | grep mds journalctl -u ceph-{FSID}@mds.{MDS} --no-pager > {MDS}.log Zitat von

  1   2   3   4   5   6   7   8   9   10   >