On 7/20/23 22:09, Frank Schilder wrote:
Hi all,
we had a client with the warning "[WRN] MDS_CLIENT_OLDEST_TID: 1 clients failing to
advance oldest client/flush tid". I looked at the client and there was nothing going
on, so I rebooted it. After the client was back, the message was still
On Thu, Jul 20, 2023 at 11:19 PM wrote:
>
> If any rook-ceph users see the situation that mds is stuck in replay, then
> look at the logs of the mds pod.
>
> When it runs and then terminates repeatedly, check if there is "liveness
> probe termninated" error message by typing "kubectl describe
I can believe the month timeframe for a cluster with multiple large spinners
behind each HBA. I’ve witnessed such personally.
> On Jul 20, 2023, at 4:16 PM, Michel Jouvin
> wrote:
>
> Hi Niklas,
>
> As I said, ceph placement is based on more than fulfilling the failure domain
> constraint.
Hi Niklas,
As I said, ceph placement is based on more than fulfilling the failure
domain constraint. This is a core feature in ceph design. There is no
reason for a rebalancing on a cluster with a few hundreds OSDs to last a
month. Just before 17 you have to adjust the max backfills parameter
Sometimes one can even get away with "ceph osd down 343" which doesn't affect
the process. I have had occasions when this goosed peering in a less-intrusive
way. I believe it just marks the OSD down in the mons' map, and when that
makes it to the OSD, the OSD responds with "I'm not dead yet"
Thank you both Michel and Christian.
Looks like I will have to do the rebalancing eventually.
From past experience with Ceph 16 the rebalance will likely take at least a
month with my 500 M objects.
It seems like a good idea to upgrade to Ceph 17 first as Michel suggests.
Unless:
I was
Assuming you're running systemctl OSDs you can run the following command on the
host that OSD 343 resides on.
systemctl restart ceph-osd@343
From: siddhit.ren...@nxtgen.com At: 07/20/23 13:44:36 UTC-4:00To:
ceph-users@ceph.io
Subject: [ceph-users] Re: 1 PG stucked in
Hello,
We have an RGW cluster that was recently upgraded from 12.2.11 to 14.2.22. The
upgrade went mostly fine, though now several of our RGWs will not start. One
RGW is working fine, the rest will not initialize. They are on a crash loop.
This is part of a multisite configuration, and is
Hey Christian,
What does sync look like on the first site? And does restarting the RGW
instances on the first site fix up your issues?
We saw issues in the past that sound a lot like yours. We've adopted the
practice of restarting the RGW instances in the first cluster after deploying a
If any rook-ceph users see the situation that mds is stuck in replay, then look
at the logs of the mds pod.
When it runs and then terminates repeatedly, check if there is "liveness probe
termninated" error message by typing "kubectl describe pod -n (namspace) (mds'
pod name)"
If there is the
This issue has been closed.
If any rook-ceph users see this, when mds replay takes a long time, look at the
logs in mds pod.
If it's going well and then abruptly terminates, try describing the mds pod,
and if liveness probe terminated, try increasing the threadhold of liveness
probe.
We did have a peering storm, we're past that portion of the backfill and still
experiencing new instances of rbd volumes hanging. It is for sure not just the
peering storm.
We've got 22.184% objects misplaced yet, with a bunch of pgs left to backfill
(like 75k). Our rbd poll is using about
What should be appropriate way to restart primary OSD in this case (343) ?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
Hello Eugen,
Requested details are as below.
PG ID: 15.28f0
Pool ID: 15
Pool: default.rgw.buckets.data
Pool EC Ratio: 8: 3
Number of Hosts: 12
## crush dump for rule ##
#ceph osd crush rule dump data_ec_rule
{
"rule_id": 1,
"rule_name": "data_ec_rule",
"ruleset": 1,
"type":
I think the rook-ceph is not responding to the liveness probe (confirmed by k8s
describe mds pod) I don't think it's the memory as I don't limit it, and I have
the cpu set to 500m per mds, but what direction should I go from here?
___
ceph-users
We did have a peering storm, we're past that portion of the backfill and still
experiencing new instances of rbd volumes hanging. It is for sure not just the
peering storm.
We've got 22.184% objects misplaced yet, with a bunch of pgs left to backfill
(like 75k). Our rbd poll is using about
Hi, we have service that is still crashing when S3 client (veeam backup) start
to write data
main log from rgw service
req 13170422438428971730 0.00886s s3:get_obj WARNING: couldn't find acl
header for object, generating
default
2023-07-20T14:36:45.331+ 7fa5adb4c700 -1 *** Caught
Ok,
I fthink I igured this out. First, as I think I wrote earlier, these objects in
the ugly namespace begin with "<80>0_", and as such are a "bucket log
index" file according to the bucket_index_prefixes[] in cls_rgw.cc.
These objects were multiplying, and caused the 'Large omap object'
I need some help understanding this. I have configured nfs-ganesha for cephfs
using something like this in ganesha.conf
FSAL { Name = CEPH; User_Id = "testing.nfs"; Secret_Access_Key =
"AAA=="; }
But I contstantly have these messages in de ganesha logs, 6x per user_id
auth:
Enabling debug lc will execute more often the LC, but, please mind that
might not respect expiration time set. By design it will consider a day the
time set in interval.
So, if will run more often, you will end up removing objects sooner than
365 days (as an example) if set to do so.
Please test
On Thursday, July 20, 2023 10:36:02 AM EDT Wyll Ingersoll wrote:
> Yes, it is ceph pacific 16.2.11.
>
> Is this a known issue that is fixed in a more recent pacific update? We're
> not ready to move to quincy yet.
>
> thanks,
>Wyllys
>
To the best of my knowledge there's no fix in
Yes, it is ceph pacific 16.2.11.
Is this a known issue that is fixed in a more recent pacific update? We're not
ready to move to quincy yet.
thanks,
Wyllys
From: John Mulligan
Sent: Thursday, July 20, 2023 10:30 AM
To: ceph-users@ceph.io
Cc: Wyll
On Tuesday, July 18, 2023 10:56:12 AM EDT Wyll Ingersoll wrote:
> Every night at midnight, our ceph-mgr daemons open up ssh connections to the
> other nodes and then leaves them open. Eventually they become zombies. I
> cannot figure out what module is causing this or how to turn it off. If
>
Hi all,
we had a client with the warning "[WRN] MDS_CLIENT_OLDEST_TID: 1 clients
failing to advance oldest client/flush tid". I looked at the client and there
was nothing going on, so I rebooted it. After the client was back, the message
was still there. To clean this up I failed the MDS.
Here you have.
So on the log when cephadm gets the inventory:
Found inventory for host [Device(path=/
Device(path=/dev/nvme2n1, lvs=[{'cluster_fsid':
'11b47c57-5e7f-44c0-8b19-ddd801a89435', 'cluster_name': 'ceph', 'db_uuid':
'irQUVH-txAO-fh3p-tkEj-ZoAH-p7lI-HcHOJp', 'name':
25 matches
Mail list logo