Please share the mds per dump as requested. We need to understand what's
happening before suggesting anything.
Thanks & Regards,
Kotresh H R
On Fri, May 17, 2024 at 5:35 PM Akash Warkhade
wrote:
> @Kotresh Hiremath Ravishankar
>
> Can you please help on above
>
>
>
> On Fri, 17 May, 2024,
@Kotresh Hiremath Ravishankar
Can you please help on above
On Fri, 17 May, 2024, 12:26 pm Akash Warkhade,
wrote:
> Hi Kotresh,
>
>
> Thanks for the reply.
> 1)There are no customer configs defined
> 2) not enabled subtree pinning
> 3) there were no warning related to rados
>
> So wanted to
Hi Kotresh,
Thanks for the reply.
1)There are no customer configs defined
2) not enabled subtree pinning
3) there were no warning related to rados
So wanted to know In order to fix this should we increase default
mds_cache_memory_limit from 4Gb to 6Gb or more?
Or is there any other solution for
Hi,
~6K log segments to be trimmed, that's huge.
1. Are there any custom configs configured on this setup ?
2. Is subtree pinning enabled ?
3. Are there any warnings w.r.t rados slowness ?
4. Please share the mds perf dump to check for latencies and other stuff.
$ceph tell mds. perf dump
On 4/30/24 04:05, Erich Weiler wrote:
Hi Xiubo,
Is there any way to possibly get a PR development release we could
upgrade to, in order to test and see if the lock order bug per Bug
#62123 could be the answer? Although I'm not sure that bug has been
fixed yet?
I think you can get the
Hi Xiubo,
Is there any way to possibly get a PR development release we could
upgrade to, in order to test and see if the lock order bug per Bug
#62123 could be the answer? Although I'm not sure that bug has been
fixed yet?
-erich
On 4/21/24 9:39 PM, Xiubo Li wrote:
Hi Erich,
I raised
Hi Erich,
I raised one tracker for this https://tracker.ceph.com/issues/65607.
Currently I haven't figured out where was holding the 'dn->lock' in the
'lookup' request or somewhere else, since there is not debug log.
Hopefully we can get the debug logs, which we can push it further.
Thanks
Hi Xiubo,
Nevermind I was wrong, most the blocked ops were 12 hours old. Ug.
I restarted the MDS daemon to clear them.
I just reset to having one active MDS instead of two, let's see if that
makes a difference.
I am beginning to think it may be impossible to catch the logs that
matter
Hi Erich,
Two things I need to make them to be clear:
1, Since there has no debug log so I am not very sure my fixing PR will
100% fix this.
2, It will take time to get this PR to be merged in upstream. So I
couldn't tell exactly when this PR will be backported to downstream and
then be
On Wed, 10 Apr 2024 at 14:01, Xiubo Li wrote:
> > I assume if this fix is approved and backported it will then appear in
> > like 18.2.3 or something?
> >
> Yeah, it will be backported after being well tested.
>
We believe we are being bitten by this bug too, looking forward to the fix.
thanks.
Or... Maybe the fix will first appear in the "centos-ceph-reef-test"
repo that I see? Is that how RedHat usually does it?
On 4/11/24 10:30, Erich Weiler wrote:
I guess we are specifically using the "centos-ceph-reef" repository, and
it looks like the latest version in that repo is
I guess we are specifically using the "centos-ceph-reef" repository, and
it looks like the latest version in that repo is 18.2.2-1.el9s. Will
this fix appear in 18.2.2-2.el9s or something like that? I don't know
how often the release cycle updates the repos...?
On 4/11/24 09:40, Erich
I have raised one PR to fix the lock order issue, if possible please
have a try to see could it resolve this issue.
That's great! When do you think that will be available?
Thank you! Yeah, this issue is happening every couple days now. It
just happened again today and I got more MDS dumps.
On 4/10/24 11:48, Erich Weiler wrote:
Dos that mean it could be the locker order bug
(https://tracker.ceph.com/issues/62123) as Xiubo suggested?
I have raised one PR to fix the lock order issue, if possible please
have a try to see could it resolve this issue.
Thank you! Yeah, this issue
Dos that mean it could be the locker order bug
(https://tracker.ceph.com/issues/62123) as Xiubo suggested?
I have raised one PR to fix the lock order issue, if possible please
have a try to see could it resolve this issue.
Thank you! Yeah, this issue is happening every couple days now. It
On 4/8/24 12:32, Erich Weiler wrote:
Ah, I see. Yes, we are already running version 18.2.1 on the server side (we
just installed this cluster a few weeks ago from scratch). So I guess if the
fix has already been backported to that version, then we still have a problem.
Dos that mean it
On 4/8/24 12:32, Erich Weiler wrote:
Ah, I see. Yes, we are already running version 18.2.1 on the server side (we
just installed this cluster a few weeks ago from scratch). So I guess if the
fix has already been backported to that version, then we still have a problem.
Dos that mean it
Ah, I see. Yes, we are already running version 18.2.1 on the server side (we
just installed this cluster a few weeks ago from scratch). So I guess if the
fix has already been backported to that version, then we still have a problem.
Dos that mean it could be the locker order bug
Hi Erich,
On Mon, Apr 8, 2024 at 11:51 AM Erich Weiler wrote:
>
> Hi Xiubo,
>
> > Thanks for your logs, and it should be the same issue with
> > https://tracker.ceph.com/issues/62052, could you try to test with this
> > fix again ?
>
> This sounds good - but I'm not clear on what I should do? I
Hi Xiubo,
Thanks for your logs, and it should be the same issue with
https://tracker.ceph.com/issues/62052, could you try to test with this
fix again ?
This sounds good - but I'm not clear on what I should do? I see a patch
in that tracker page, is that what you are referring to? If so,
Hi Erich,
Thanks for your logs, and it should be the same issue with
https://tracker.ceph.com/issues/62052, could you try to test with this
fix again ?
Please let me know if you still could see this bug then it should be the
locker order bug as https://tracker.ceph.com/issues/62123.
On 3/29/24 04:57, Erich Weiler wrote:
MDS logs show:
Mar 28 13:42:29 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : 16 slow requests, 0 included below; oldest blocked for >
3676.400077 secs
Mar 28 13:42:30 pr-md-02.prism ceph-mds[1464328]:
mds.slugfs.pr-md-02.sbblqq
Hi,
we have similar problems from time to time. Running Reef on servers and
latest ubuntu 20.04 hwe kernel on the clients.
There are probably two scenarios with slightly different observations:
1. MDS reports slow ops
Some client is holding caps for a certain file / directory and blocks
Hello Erich,
What you are experiencing is definitely a bug - but possibly a client
bug. Not sure. Upgrading Ceph packages on the clients, though, will
not help, because the actual CephFS client is the kernel. You can try
upgrading it to the latest 6.8.x (or, better, trying the same workload
from
Could there be an issue with the fact that the servers (MDS, MGR, MON,
OSD) are running reef and all the clients are running quincy?
I can easily enough get the new reef repo in for all our clients (Ubuntu
22.04) and upgrade the clients to reef if that might help..?
On 3/28/24 3:05 PM, Erich
I asked the user and they said no, no rsync involved. Although I
rsync'd 500TB into this filesystem in the beginning without incident, so
hopefully it's not a big deal here.
I'm asking the user what their workflow does to try and pin this down.
Are there any other known reason why a slow
Hello Erich,
Does the workload, by any chance, involve rsync? It is unfortunately
well-known for triggering such issues. A workaround is to export the
directory via NFS and run rsync against the NFS mount instead of
directly against CephFS.
On Fri, Mar 29, 2024 at 4:58 AM Erich Weiler wrote:
>
MDS logs show:
Mar 28 13:42:29 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
log [WRN] : 16 slow requests, 0 included below; oldest blocked for >
3676.400077 secs
Mar 28 13:42:30 pr-md-02.prism ceph-mds[1464328]:
mds.slugfs.pr-md-02.sbblqq Updating MDS map to version 22775 from mon.3
Wow those are extremely useful commands. Next time this happens I'll be
sure to use them. A quick test shows they work just great!
cheers,
erich
On 3/28/24 11:16 AM, Alexander E. Patrakov wrote:
Hi Erich,
Here is how to map the client ID to some extra info:
ceph tell mds.0 client ls
Hi Erich,
Here is how to map the client ID to some extra info:
ceph tell mds.0 client ls id=99445
Here is how to map inode ID to the path:
ceph tell mds.0 dump inode 0x100081b9ceb | jq -r .path
On Fri, Mar 29, 2024 at 1:12 AM Erich Weiler wrote:
>
> Here are some of the MDS logs:
>
> Mar 27
Here are some of the MDS logs:
Mar 27 11:58:25 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
log [WRN] : slow request 511.703289 seconds old, received at
2024-03-27T18:49:53.623192+: client_request(client.99375:459393
getattr AsXsFs #0x100081b9ceb 2024-03-27T18:49:53.620806+
On 3/28/24 04:03, Erich Weiler wrote:
Hi All,
I've been battling this for a while and I'm not sure where to go from
here. I have a Ceph health warning as such:
# ceph -s
cluster:
id: 58bde08a-d7ed-11ee-9098-506b4b4da440
health: HEALTH_WARN
1 MDSs report slow
,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Francois Legrand
Sent: 09 June 2020 22:20:29
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory
exhausted
Hi,
Actually I let the mds
June 2020 16:38:18
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory
exhausted
I already had some discussion on the list about this problem. But I
should ask again.
We really lost some objects and there are not enought shards to
reconstruct
regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Francois Legrand
Sent: 08 June 2020 16:38:18
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory
exhausted
I already had
regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Francois Legrand
Sent: 08 June 2020 16:00:28
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory
exhausted
Legrand
Sent: 08 June 2020 15:27:59
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory
exhausted
Thanks again for the hint !
Indeed, I did a
ceph daemon mds.lpnceph-mds02.in2p3.fr objecter_requests
and it seems that osd 27 is more or less
Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory
exhausted
Hi Franck,
Finally I dit :
ceph config set global mds_beacon_grace 60
and create /etc/sysctl.d/sysctl-ceph.conf with
vm.min_free_kbytes=4194303
and then
sysctl --system
After
==
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Francois Legrand
Sent: 08 June 2020 16:00:28
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory
exhausted
There is no recovery going on
From: Francois Legrand
Sent: 08 June 2020 15:27:59
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory
exhausted
Thanks again for the hint !
Indeed, I did a
ceph daemon mds.lpnceph-mds02.in2p3.fr objecter_requests
and it seems that osd 27
um S14
From: Francois Legrand
Sent: 06 June 2020 11:11
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory
exhausted
Thanks for the tip,
I will try that. For now vm.min_free_kbytes = 90112
Indeed, yesterday afte
rds,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Francois Legrand
Sent: 08 June 2020 14:45:13
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory
exhausted
Hi Franck,
Finally I
S14
From: Francois Legrand
Sent: 06 June 2020 11:11
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory
exhausted
Thanks for the tip,
I will try that. For now vm.min_free_kbytes = 90112
Indeed, yesterday after your last mail I
egrand
Sent: 06 June 2020 11:11
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory
exhausted
Thanks for the tip,
I will try that. For now vm.min_free_kbytes = 90112
Indeed, yesterday after your last mail I set mds_beacon_grace to 240.0
but this
Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted
Hi Francois,
yes, the beacon grace needs to be higher due to the latency of swap. Not sure
if 60s will do. For this particular recovery operation, you might want to go
much higher (1h) and watch the cluster health closely
; f...@lpnhe.in2p3.fr
Subject: [ceph-users] Re: mds behind on trimming - replay until memory exhausted
Hi Francois,
yes, the beacon grace needs to be higher due to the latency of swap. Not sure
if 60s will do. For this particular recovery operation, you might want to go
much higher (1h) and wa
Campus
Bygning 109, rum S14
From: Francois Legrand
Sent: 05 June 2020 23:51:04
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted
Hi,
Unfortunately adding swap did not solve the problem !
I added
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted
I was also wondering if setting mds dump cache after rejoin could help ?
Le 05/06/2020 à 12:49, Frank Schilder a écrit :
Out of interest, I did the same on a mimic cluster a f
int in one of the other threads.
Good luck!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Francois Legrand
Sent: 05 June 2020 14:34:06
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] mds behind on trimming - replay un
it will do eventually.
Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Francois Legrand
Sent: 05 June 2020 13:46:03
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] mds behind on trimming - replay until memory exhausted
Out of interest, I did the same on a mimic cluster a few months ago, running up
to 5 parallel rsync sessions without any problems. I moved about 120TB. Each
rsync was running on a separate client with its own cache. I made sure that the
sync dirs were all disjoint (no overlap of
. Will take a while, but it will do eventually.
Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Francois Legrand
Sent: 05 June 2020 13:46:03
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] mds behind on trimming
I was also wondering if setting mds dump cache after rejoin could help ?
Le 05/06/2020 à 12:49, Frank Schilder a écrit :
Out of interest, I did the same on a mimic cluster a few months ago, running up
to 5 parallel rsync sessions without any problems. I moved about 120TB. Each
rsync was
Hi,
Thanks for your answer.
I have :
osd_op_queue=wpq
osd_op_queue_cut_off=low
I can try to set osd_op_queue_cut_off to high, but it will be useful
only if the mds get active, true ?
For now, the mds_cache_memory_limit is set to 8 589 934 592 (so 8GB
which seems reasonable for a mds server
54 matches
Mail list logo