[ceph-users] Re: MDS stuck in rejoin

2023-07-20 Thread Xiubo Li
On 7/20/23 22:09, Frank Schilder wrote: Hi all, we had a client with the warning "[WRN] MDS_CLIENT_OLDEST_TID: 1 clients failing to advance oldest client/flush tid". I looked at the client and there was nothing going on, so I rebooted it. After the client was back, the message was still

[ceph-users] Re: mds terminated

2023-07-20 Thread Venky Shankar
On Thu, Jul 20, 2023 at 11:19 PM wrote: > > If any rook-ceph users see the situation that mds is stuck in replay, then > look at the logs of the mds pod. > > When it runs and then terminates repeatedly, check if there is "liveness > probe termninated" error message by typing "kubectl describe

[ceph-users] Re: Adding datacenter level to CRUSH tree causes rebalancing

2023-07-20 Thread Anthony D'Atri
I can believe the month timeframe for a cluster with multiple large spinners behind each HBA. I’ve witnessed such personally. > On Jul 20, 2023, at 4:16 PM, Michel Jouvin > wrote: > > Hi Niklas, > > As I said, ceph placement is based on more than fulfilling the failure domain > constraint.

[ceph-users] Re: Adding datacenter level to CRUSH tree causes rebalancing

2023-07-20 Thread Michel Jouvin
Hi Niklas, As I said, ceph placement is based on more than fulfilling the failure domain constraint. This is a core feature in ceph design. There is no reason for a rebalancing on a cluster with a few hundreds OSDs to last a month. Just before 17 you have to adjust the max backfills parameter

[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time

2023-07-20 Thread Anthony D'Atri
Sometimes one can even get away with "ceph osd down 343" which doesn't affect the process. I have had occasions when this goosed peering in a less-intrusive way. I believe it just marks the OSD down in the mons' map, and when that makes it to the OSD, the OSD responds with "I'm not dead yet"

[ceph-users] Re: Adding datacenter level to CRUSH tree causes rebalancing

2023-07-20 Thread Niklas Hambüchen
Thank you both Michel and Christian. Looks like I will have to do the rebalancing eventually. From past experience with Ceph 16 the rebalance will likely take at least a month with my 500 M objects. It seems like a good idea to upgrade to Ceph 17 first as Michel suggests. Unless: I was

[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time

2023-07-20 Thread Matthew Leonard (BLOOMBERG/ 120 PARK)
Assuming you're running systemctl OSDs you can run the following command on the host that OSD 343 resides on. systemctl restart ceph-osd@343 From: siddhit.ren...@nxtgen.com At: 07/20/23 13:44:36 UTC-4:00To: ceph-users@ceph.io Subject: [ceph-users] Re: 1 PG stucked in

[ceph-users] RGWs offline after upgrade to Nautilus

2023-07-20 Thread Ben . Zieglmeier
Hello, We have an RGW cluster that was recently upgraded from 12.2.11 to 14.2.22. The upgrade went mostly fine, though now several of our RGWs will not start. One RGW is working fine, the rest will not initialize. They are on a crash loop. This is part of a multisite configuration, and is

[ceph-users] Re: rgw multisite sync not syncing data, error: RGW-SYNC:data:init_data_sync_status: ERROR: failed to read remote data log shards

2023-07-20 Thread david . piper
Hey Christian, What does sync look like on the first site? And does restarting the RGW instances on the first site fix up your issues? We saw issues in the past that sound a lot like yours. We've adopted the practice of restarting the RGW instances in the first cluster after deploying a

[ceph-users] Re: mds terminated

2023-07-20 Thread dxodnd
If any rook-ceph users see the situation that mds is stuck in replay, then look at the logs of the mds pod. When it runs and then terminates repeatedly, check if there is "liveness probe termninated" error message by typing "kubectl describe pod -n (namspace) (mds' pod name)" If there is the

[ceph-users] Re: mds terminated

2023-07-20 Thread dxodnd
This issue has been closed. If any rook-ceph users see this, when mds replay takes a long time, look at the logs in mds pod. If it's going well and then abruptly terminates, try describing the mds pod, and if liveness probe terminated, try increasing the threadhold of liveness probe.

[ceph-users] Re: librbd hangs during large backfill

2023-07-20 Thread fb2cd0fc-933c-4cfe-b534-93d67045a088
We did have a peering storm, we're past that portion of the backfill and still experiencing new instances of rbd volumes hanging. It is for sure not just the peering storm. We've got 22.184% objects misplaced yet, with a bunch of pgs left to backfill (like 75k). Our rbd poll is using about

[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time

2023-07-20 Thread siddhit . renake
What should be appropriate way to restart primary OSD in this case (343) ? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time

2023-07-20 Thread siddhit . renake
Hello Eugen, Requested details are as below. PG ID: 15.28f0 Pool ID: 15 Pool: default.rgw.buckets.data Pool EC Ratio: 8: 3 Number of Hosts: 12 ## crush dump for rule ## #ceph osd crush rule dump data_ec_rule { "rule_id": 1, "rule_name": "data_ec_rule", "ruleset": 1, "type":

[ceph-users] Re: mds terminated

2023-07-20 Thread dxodnd
I think the rook-ceph is not responding to the liveness probe (confirmed by k8s describe mds pod) I don't think it's the memory as I don't limit it, and I have the cpu set to 500m per mds, but what direction should I go from here? ___ ceph-users

[ceph-users] Re: librbd hangs during large backfill

2023-07-20 Thread Jack Hayhurst
We did have a peering storm, we're past that portion of the backfill and still experiencing new instances of rbd volumes hanging. It is for sure not just the peering storm. We've got 22.184% objects misplaced yet, with a bunch of pgs left to backfill (like 75k). Our rbd poll is using about

[ceph-users] Quincy 17.2.6 - Rados gateway crash -

2023-07-20 Thread xadhoom76
Hi, we have service that is still crashing when S3 client (veeam backup) start to write data main log from rgw service req 13170422438428971730 0.00886s s3:get_obj WARNING: couldn't find acl header for object, generating default 2023-07-20T14:36:45.331+ 7fa5adb4c700 -1 *** Caught

[ceph-users] Re: index object in shard begins with hex 80

2023-07-20 Thread Christopher Durham
Ok, I fthink I igured this out. First, as I think I wrote earlier, these objects in the ugly namespace begin with "<80>0_", and as such are  a "bucket log index" file  according to the bucket_index_prefixes[] in cls_rgw.cc. These objects were multiplying, and caused the 'Large omap object'

[ceph-users] what is the point of listing "auth: unable to find a keyring on /etc/ceph/ceph.client nfs-ganesha

2023-07-20 Thread Marc
I need some help understanding this. I have configured nfs-ganesha for cephfs using something like this in ganesha.conf FSAL { Name = CEPH; User_Id = "testing.nfs"; Secret_Access_Key = "AAA=="; } But I contstantly have these messages in de ganesha logs, 6x per user_id auth:

[ceph-users] Re: Workload that delete 100 M object daily via lifecycle

2023-07-20 Thread Paul JURCO
Enabling debug lc will execute more often the LC, but, please mind that might not respect expiration time set. By design it will consider a day the time set in interval. So, if will run more often, you will end up removing objects sooner than 365 days (as an example) if set to do so. Please test

[ceph-users] Re: ceph-mgr ssh connections left open

2023-07-20 Thread John Mulligan
On Thursday, July 20, 2023 10:36:02 AM EDT Wyll Ingersoll wrote: > Yes, it is ceph pacific 16.2.11. > > Is this a known issue that is fixed in a more recent pacific update? We're > not ready to move to quincy yet. > > thanks, >Wyllys > To the best of my knowledge there's no fix in

[ceph-users] Re: ceph-mgr ssh connections left open

2023-07-20 Thread Wyll Ingersoll
Yes, it is ceph pacific 16.2.11. Is this a known issue that is fixed in a more recent pacific update? We're not ready to move to quincy yet. thanks, Wyllys From: John Mulligan Sent: Thursday, July 20, 2023 10:30 AM To: ceph-users@ceph.io Cc: Wyll

[ceph-users] Re: ceph-mgr ssh connections left open

2023-07-20 Thread John Mulligan
On Tuesday, July 18, 2023 10:56:12 AM EDT Wyll Ingersoll wrote: > Every night at midnight, our ceph-mgr daemons open up ssh connections to the > other nodes and then leaves them open. Eventually they become zombies. I > cannot figure out what module is causing this or how to turn it off. If >

[ceph-users] MDS stuck in rejoin

2023-07-20 Thread Frank Schilder
Hi all, we had a client with the warning "[WRN] MDS_CLIENT_OLDEST_TID: 1 clients failing to advance oldest client/flush tid". I looked at the client and there was nothing going on, so I rebooted it. After the client was back, the message was still there. To clean this up I failed the MDS.

[ceph-users] Re: cephadm does not redeploy OSD

2023-07-20 Thread Luis Domingues
Here you have. So on the log when cephadm gets the inventory: Found inventory for host [Device(path=/ Device(path=/dev/nvme2n1, lvs=[{'cluster_fsid': '11b47c57-5e7f-44c0-8b19-ddd801a89435', 'cluster_name': 'ceph', 'db_uuid': 'irQUVH-txAO-fh3p-tkEj-ZoAH-p7lI-HcHOJp', 'name':