Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-31 Thread Oliver Freyermuth
Am 01.06.2018 um 02:59 schrieb Yan, Zheng:
> On Wed, May 30, 2018 at 5:17 PM, Oliver Freyermuth
>  wrote:
>> Am 30.05.2018 um 10:37 schrieb Yan, Zheng:
>>> On Wed, May 30, 2018 at 3:04 PM, Oliver Freyermuth
>>>  wrote:
>>>> Hi,
>>>>
>>>> ij our case, there's only a single active MDS
>>>> (+1 standby-replay + 1 standby).
>>>> We also get the health warning in case it happens.
>>>>
>>>
>>> Were there "client.xxx isn't responding to mclientcaps(revoke)"
>>> warnings in cluster log.  please send them to me if there were.
>>
>> Yes, indeed, I almost missed them!
>>
>> Here you go:
>>
>> 
>> 2018-05-29 12:16:02.491186 mon.mon003 mon.0 10.161.8.40:6789/0 11177 : 
>> cluster [WRN] MDS health message (mds.0): Client XXX:XXX failing to 
>> respond to capability release
>> 2018-05-29 12:16:03.401014 mon.mon003 mon.0 10.161.8.40:6789/0 11178 : 
>> cluster [WRN] Health check failed: 1 clients failing to respond to 
>> capability release (MDS_CLIENT_LATE_RELEASE)
>> 
>> 2018-05-29 12:16:00.567520 mds.mon001 mds.0 10.161.8.191:6800/3068262341 
>> 15745 : cluster [WRN] client.1524813 isn't responding to 
>> mclientcaps(revoke), ino 0x1388ae0 pending pAsLsXsFr issued pAsLsXsFrw, 
>> sent 63.908382 seconds ago
>> 
>>> repetition of message with increasing delays in between>
>> 
>> 2018-05-29 16:31:00.899416 mds.mon001 mds.0 10.161.8.191:6800/3068262341 
>> 17169 : cluster [WRN] client.1524813 isn't responding to 
>> mclientcaps(revoke), ino 0x1388ae0 pending pAsLsXsFr issued pAsLsXsFrw, 
>> sent 15364.240272 seconds ago
>> 
>>
>> After evicting the client, I also get:
>> 2018-05-29 17:00:00.000134 mon.mon003 mon.0 10.161.8.40:6789/0 11293 : 
>> cluster [WRN] overall HEALTH_WARN 1 clients failing to respond to capability 
>> release; 1 MDSs report slow requests
>> 2018-05-29 17:09:50.964730 mon.mon003 mon.0 10.161.8.40:6789/0 11297 : 
>> cluster [INF] MDS health message cleared (mds.0): Client XXX:XXX 
>> failing to respond to capability release
>> 2018-05-29 17:09:50.964767 mon.mon003 mon.0 10.161.8.40:6789/0 11298 : 
>> cluster [INF] MDS health message cleared (mds.0): 123 slow requests are 
>> blocked > 30 sec
>> 2018-05-29 17:09:51.015071 mon.mon003 mon.0 10.161.8.40:6789/0 11299 : 
>> cluster [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE (was: 1 clients 
>> failing to respond to capability release)
>> 2018-05-29 17:09:51.015154 mon.mon003 mon.0 10.161.8.40:6789/0 11300 : 
>> cluster [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report 
>> slow requests)
>> 2018-05-29 17:09:51.015191 mon.mon003 mon.0 10.161.8.40:6789/0 11301 : 
>> cluster [INF] Cluster is now healthy
>> 2018-05-29 17:14:26.178321 mds.mon002 mds.34884 10.161.8.192:6800/2102077019 
>> 8 : cluster [WRN]  replayed op client.1495010:32710304,32710299 used ino 
>> 0x13909d0 but session next is 0x1388af6
>> 2018-05-29 17:14:26.178393 mds.mon002 mds.34884 10.161.8.192:6800/2102077019 
>> 9 : cluster [WRN]  replayed op client.1495010:32710306,32710299 used ino 
>> 0x13909d1 but session next is 0x1388af6
>> 2018-05-29 18:00:00.000132 mon.mon003 mon.0 10.161.8.40:6789/0 11304 : 
>> cluster [INF] overall HEALTH_OK
>>
>> Thanks for looking into it!
>>
>> Cheers,
>> Oliver
>>
>>
> 
> I found cause of your issue. http://tracker.ceph.com/issues/24369

Wow, many thanks! 
I did not yet manage to reproduce the stuck behaviour, since the user who could 
reliably cause it made use of the national holiday around here. 

But the issue seems extremely likely to be exactly that one - quotas are set 
for the directory tree which was affected. 
Let me know if I still should ask him to reproduce and collect the information 
from the client to confirm. 

Many thanks and cheers,
Oliver

> 
>>>
>>>> Cheers,
>>>> Oliver
>>>>
>>>> Am 30.05.2018 um 03:25 schrieb Yan, Zheng:
>>>>> I could be http://tracker.ceph.com/issues/24172
>>>>>
>>>>>
>>>>> On Wed, May 30, 2018 at 9:01 AM, Linh Vu  wrote:
>>>>>> In my case, I have multiple active MDS (with directory pinning at the 
>>>>>> very
>>>>>> top level), and there would be "Client xxx failing to respond to 
>>>>>> capability
>>>>>> release" health warning every single time that happens.
>>>>>>
>>>

Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-31 Thread Yan, Zheng
On Wed, May 30, 2018 at 5:17 PM, Oliver Freyermuth
 wrote:
> Am 30.05.2018 um 10:37 schrieb Yan, Zheng:
>> On Wed, May 30, 2018 at 3:04 PM, Oliver Freyermuth
>>  wrote:
>>> Hi,
>>>
>>> ij our case, there's only a single active MDS
>>> (+1 standby-replay + 1 standby).
>>> We also get the health warning in case it happens.
>>>
>>
>> Were there "client.xxx isn't responding to mclientcaps(revoke)"
>> warnings in cluster log.  please send them to me if there were.
>
> Yes, indeed, I almost missed them!
>
> Here you go:
>
> 
> 2018-05-29 12:16:02.491186 mon.mon003 mon.0 10.161.8.40:6789/0 11177 : 
> cluster [WRN] MDS health message (mds.0): Client XXX:XXX failing to 
> respond to capability release
> 2018-05-29 12:16:03.401014 mon.mon003 mon.0 10.161.8.40:6789/0 11178 : 
> cluster [WRN] Health check failed: 1 clients failing to respond to capability 
> release (MDS_CLIENT_LATE_RELEASE)
> 
> 2018-05-29 12:16:00.567520 mds.mon001 mds.0 10.161.8.191:6800/3068262341 
> 15745 : cluster [WRN] client.1524813 isn't responding to mclientcaps(revoke), 
> ino 0x1388ae0 pending pAsLsXsFr issued pAsLsXsFrw, sent 63.908382 seconds 
> ago
> 
>>repetition of message with increasing delays in between>
> 
> 2018-05-29 16:31:00.899416 mds.mon001 mds.0 10.161.8.191:6800/3068262341 
> 17169 : cluster [WRN] client.1524813 isn't responding to mclientcaps(revoke), 
> ino 0x1388ae0 pending pAsLsXsFr issued pAsLsXsFrw, sent 15364.240272 
> seconds ago
> 
>
> After evicting the client, I also get:
> 2018-05-29 17:00:00.000134 mon.mon003 mon.0 10.161.8.40:6789/0 11293 : 
> cluster [WRN] overall HEALTH_WARN 1 clients failing to respond to capability 
> release; 1 MDSs report slow requests
> 2018-05-29 17:09:50.964730 mon.mon003 mon.0 10.161.8.40:6789/0 11297 : 
> cluster [INF] MDS health message cleared (mds.0): Client XXX:XXX 
> failing to respond to capability release
> 2018-05-29 17:09:50.964767 mon.mon003 mon.0 10.161.8.40:6789/0 11298 : 
> cluster [INF] MDS health message cleared (mds.0): 123 slow requests are 
> blocked > 30 sec
> 2018-05-29 17:09:51.015071 mon.mon003 mon.0 10.161.8.40:6789/0 11299 : 
> cluster [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE (was: 1 clients 
> failing to respond to capability release)
> 2018-05-29 17:09:51.015154 mon.mon003 mon.0 10.161.8.40:6789/0 11300 : 
> cluster [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow 
> requests)
> 2018-05-29 17:09:51.015191 mon.mon003 mon.0 10.161.8.40:6789/0 11301 : 
> cluster [INF] Cluster is now healthy
> 2018-05-29 17:14:26.178321 mds.mon002 mds.34884 10.161.8.192:6800/2102077019 
> 8 : cluster [WRN]  replayed op client.1495010:32710304,32710299 used ino 
> 0x13909d0 but session next is 0x1388af6
> 2018-05-29 17:14:26.178393 mds.mon002 mds.34884 10.161.8.192:6800/2102077019 
> 9 : cluster [WRN]  replayed op client.1495010:32710306,32710299 used ino 
> 0x13909d1 but session next is 0x1388af6
> 2018-05-29 18:00:00.000132 mon.mon003 mon.0 10.161.8.40:6789/0 11304 : 
> cluster [INF] overall HEALTH_OK
>
> Thanks for looking into it!
>
> Cheers,
> Oliver
>
>

I found cause of your issue. http://tracker.ceph.com/issues/24369

>>
>>> Cheers,
>>> Oliver
>>>
>>> Am 30.05.2018 um 03:25 schrieb Yan, Zheng:
>>>> I could be http://tracker.ceph.com/issues/24172
>>>>
>>>>
>>>> On Wed, May 30, 2018 at 9:01 AM, Linh Vu  wrote:
>>>>> In my case, I have multiple active MDS (with directory pinning at the very
>>>>> top level), and there would be "Client xxx failing to respond to 
>>>>> capability
>>>>> release" health warning every single time that happens.
>>>>>
>>>>> 
>>>>> From: ceph-users  on behalf of Yan, 
>>>>> Zheng
>>>>> 
>>>>> Sent: Tuesday, 29 May 2018 9:53:43 PM
>>>>> To: Oliver Freyermuth
>>>>> Cc: Ceph Users; Peter Wienemann
>>>>> Subject: Re: [ceph-users] Ceph-fuse getting stuck with "currently failed 
>>>>> to
>>>>> authpin local pins"
>>>>>
>>>>> Single or multiple acitve mds? Were there "Client xxx failing to
>>>>> respond to capability release" health warning?
>>>>>
>>>>> On Mon, May 28, 2018 at 10:38 PM, Oliver Freyermuth
>>>>>  wrote:
>>>>>> Dear Cephalopodians,
>>>>>>
>>>>>> we ju

Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-30 Thread Yan, Zheng
On Wed, May 30, 2018 at 5:17 PM, Oliver Freyermuth
 wrote:
> Am 30.05.2018 um 10:37 schrieb Yan, Zheng:
>> On Wed, May 30, 2018 at 3:04 PM, Oliver Freyermuth
>>  wrote:
>>> Hi,
>>>
>>> ij our case, there's only a single active MDS
>>> (+1 standby-replay + 1 standby).
>>> We also get the health warning in case it happens.
>>>
>>
>> Were there "client.xxx isn't responding to mclientcaps(revoke)"
>> warnings in cluster log.  please send them to me if there were.
>
> Yes, indeed, I almost missed them!
>
> Here you go:
>
> 
> 2018-05-29 12:16:02.491186 mon.mon003 mon.0 10.161.8.40:6789/0 11177 : 
> cluster [WRN] MDS health message (mds.0): Client XXX:XXX failing to 
> respond to capability release
> 2018-05-29 12:16:03.401014 mon.mon003 mon.0 10.161.8.40:6789/0 11178 : 
> cluster [WRN] Health check failed: 1 clients failing to respond to capability 
> release (MDS_CLIENT_LATE_RELEASE)
> 
> 2018-05-29 12:16:00.567520 mds.mon001 mds.0 10.161.8.191:6800/3068262341 
> 15745 : cluster [WRN] client.1524813 isn't responding to mclientcaps(revoke), 
> ino 0x1388ae0 pending pAsLsXsFr issued pAsLsXsFrw, sent 63.908382 seconds 
> ago
> 
>>repetition of message with increasing delays in between>
> 
> 2018-05-29 16:31:00.899416 mds.mon001 mds.0 10.161.8.191:6800/3068262341 
> 17169 : cluster [WRN] client.1524813 isn't responding to mclientcaps(revoke), 
> ino 0x1388ae0 pending pAsLsXsFr issued pAsLsXsFrw, sent 15364.240272 
> seconds ago
> 

The client failed to release Fw. When it happens again, please check
if there are hung osd requests (ceph
--admin-daemon=/var/run/ceph/ceph-client.admin.xxx.asok
objecter_requests)


>
> After evicting the client, I also get:
> 2018-05-29 17:00:00.000134 mon.mon003 mon.0 10.161.8.40:6789/0 11293 : 
> cluster [WRN] overall HEALTH_WARN 1 clients failing to respond to capability 
> release; 1 MDSs report slow requests
> 2018-05-29 17:09:50.964730 mon.mon003 mon.0 10.161.8.40:6789/0 11297 : 
> cluster [INF] MDS health message cleared (mds.0): Client XXX:XXX 
> failing to respond to capability release
> 2018-05-29 17:09:50.964767 mon.mon003 mon.0 10.161.8.40:6789/0 11298 : 
> cluster [INF] MDS health message cleared (mds.0): 123 slow requests are 
> blocked > 30 sec
> 2018-05-29 17:09:51.015071 mon.mon003 mon.0 10.161.8.40:6789/0 11299 : 
> cluster [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE (was: 1 clients 
> failing to respond to capability release)
> 2018-05-29 17:09:51.015154 mon.mon003 mon.0 10.161.8.40:6789/0 11300 : 
> cluster [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow 
> requests)
> 2018-05-29 17:09:51.015191 mon.mon003 mon.0 10.161.8.40:6789/0 11301 : 
> cluster [INF] Cluster is now healthy
> 2018-05-29 17:14:26.178321 mds.mon002 mds.34884 10.161.8.192:6800/2102077019 
> 8 : cluster [WRN]  replayed op client.1495010:32710304,32710299 used ino 
> 0x13909d0 but session next is 0x1388af6
> 2018-05-29 17:14:26.178393 mds.mon002 mds.34884 10.161.8.192:6800/2102077019 
> 9 : cluster [WRN]  replayed op client.1495010:32710306,32710299 used ino 
> 0x13909d1 but session next is 0x1388af6
> 2018-05-29 18:00:00.000132 mon.mon003 mon.0 10.161.8.40:6789/0 11304 : 
> cluster [INF] overall HEALTH_OK
>
> Thanks for looking into it!
>
> Cheers,
> Oliver
>
>
>>
>>> Cheers,
>>> Oliver
>>>
>>> Am 30.05.2018 um 03:25 schrieb Yan, Zheng:
>>>> I could be http://tracker.ceph.com/issues/24172
>>>>
>>>>
>>>> On Wed, May 30, 2018 at 9:01 AM, Linh Vu  wrote:
>>>>> In my case, I have multiple active MDS (with directory pinning at the very
>>>>> top level), and there would be "Client xxx failing to respond to 
>>>>> capability
>>>>> release" health warning every single time that happens.
>>>>>
>>>>> 
>>>>> From: ceph-users  on behalf of Yan, 
>>>>> Zheng
>>>>> 
>>>>> Sent: Tuesday, 29 May 2018 9:53:43 PM
>>>>> To: Oliver Freyermuth
>>>>> Cc: Ceph Users; Peter Wienemann
>>>>> Subject: Re: [ceph-users] Ceph-fuse getting stuck with "currently failed 
>>>>> to
>>>>> authpin local pins"
>>>>>
>>>>> Single or multiple acitve mds? Were there "Client xxx failing to
>>>>> respond to capability release" health warning?
>>>>>
>>>>> On Mon, May 28, 2018 at 10:38 PM, Oliver Freyermuth
>>>>

Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-30 Thread Oliver Freyermuth
Am 30.05.2018 um 10:37 schrieb Yan, Zheng:
> On Wed, May 30, 2018 at 3:04 PM, Oliver Freyermuth
>  wrote:
>> Hi,
>>
>> ij our case, there's only a single active MDS
>> (+1 standby-replay + 1 standby).
>> We also get the health warning in case it happens.
>>
> 
> Were there "client.xxx isn't responding to mclientcaps(revoke)"
> warnings in cluster log.  please send them to me if there were.

Yes, indeed, I almost missed them!

Here you go:


2018-05-29 12:16:02.491186 mon.mon003 mon.0 10.161.8.40:6789/0 11177 : cluster 
[WRN] MDS health message (mds.0): Client XXX:XXX failing to respond to 
capability release
2018-05-29 12:16:03.401014 mon.mon003 mon.0 10.161.8.40:6789/0 11178 : cluster 
[WRN] Health check failed: 1 clients failing to respond to capability release 
(MDS_CLIENT_LATE_RELEASE)

2018-05-29 12:16:00.567520 mds.mon001 mds.0 10.161.8.191:6800/3068262341 15745 
: cluster [WRN] client.1524813 isn't responding to mclientcaps(revoke), ino 
0x1388ae0 pending pAsLsXsFr issued pAsLsXsFrw, sent 63.908382 seconds ago

>repetition of message with increasing delays in between>

2018-05-29 16:31:00.899416 mds.mon001 mds.0 10.161.8.191:6800/3068262341 17169 
: cluster [WRN] client.1524813 isn't responding to mclientcaps(revoke), ino 
0x1388ae0 pending pAsLsXsFr issued pAsLsXsFrw, sent 15364.240272 seconds ago


After evicting the client, I also get:
2018-05-29 17:00:00.000134 mon.mon003 mon.0 10.161.8.40:6789/0 11293 : cluster 
[WRN] overall HEALTH_WARN 1 clients failing to respond to capability release; 1 
MDSs report slow requests
2018-05-29 17:09:50.964730 mon.mon003 mon.0 10.161.8.40:6789/0 11297 : cluster 
[INF] MDS health message cleared (mds.0): Client XXX:XXX failing to 
respond to capability release
2018-05-29 17:09:50.964767 mon.mon003 mon.0 10.161.8.40:6789/0 11298 : cluster 
[INF] MDS health message cleared (mds.0): 123 slow requests are blocked > 30 sec
2018-05-29 17:09:51.015071 mon.mon003 mon.0 10.161.8.40:6789/0 11299 : cluster 
[INF] Health check cleared: MDS_CLIENT_LATE_RELEASE (was: 1 clients failing to 
respond to capability release)
2018-05-29 17:09:51.015154 mon.mon003 mon.0 10.161.8.40:6789/0 11300 : cluster 
[INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow requests)
2018-05-29 17:09:51.015191 mon.mon003 mon.0 10.161.8.40:6789/0 11301 : cluster 
[INF] Cluster is now healthy
2018-05-29 17:14:26.178321 mds.mon002 mds.34884 10.161.8.192:6800/2102077019 8 
: cluster [WRN]  replayed op client.1495010:32710304,32710299 used ino 
0x13909d0 but session next is 0x1388af6
2018-05-29 17:14:26.178393 mds.mon002 mds.34884 10.161.8.192:6800/2102077019 9 
: cluster [WRN]  replayed op client.1495010:32710306,32710299 used ino 
0x13909d1 but session next is 0x1388af6
2018-05-29 18:00:00.000132 mon.mon003 mon.0 10.161.8.40:6789/0 11304 : cluster 
[INF] overall HEALTH_OK

Thanks for looking into it!

Cheers,
Oliver


> 
>> Cheers,
>> Oliver
>>
>> Am 30.05.2018 um 03:25 schrieb Yan, Zheng:
>>> I could be http://tracker.ceph.com/issues/24172
>>>
>>>
>>> On Wed, May 30, 2018 at 9:01 AM, Linh Vu  wrote:
>>>> In my case, I have multiple active MDS (with directory pinning at the very
>>>> top level), and there would be "Client xxx failing to respond to capability
>>>> release" health warning every single time that happens.
>>>>
>>>> ________
>>>> From: ceph-users  on behalf of Yan, 
>>>> Zheng
>>>> 
>>>> Sent: Tuesday, 29 May 2018 9:53:43 PM
>>>> To: Oliver Freyermuth
>>>> Cc: Ceph Users; Peter Wienemann
>>>> Subject: Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to
>>>> authpin local pins"
>>>>
>>>> Single or multiple acitve mds? Were there "Client xxx failing to
>>>> respond to capability release" health warning?
>>>>
>>>> On Mon, May 28, 2018 at 10:38 PM, Oliver Freyermuth
>>>>  wrote:
>>>>> Dear Cephalopodians,
>>>>>
>>>>> we just had a "lockup" of many MDS requests, and also trimming fell
>>>>> behind, for over 2 days.
>>>>> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status
>>>>> "currently failed to authpin local pins". Metadata pool usage did grow by 
>>>>> 10
>>>>> GB in those 2 days.
>>>>>
>>>>> Rebooting the node to force a client eviction solved the issue, and now
>>>>> metadata usage is down again, and all stuck requests were processed 
>>>>> quickly.
>>>>

Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-30 Thread Yan, Zheng
On Wed, May 30, 2018 at 3:04 PM, Oliver Freyermuth
 wrote:
> Hi,
>
> ij our case, there's only a single active MDS
> (+1 standby-replay + 1 standby).
> We also get the health warning in case it happens.
>

Were there "client.xxx isn't responding to mclientcaps(revoke)"
warnings in cluster log.  please send them to me if there were.

> Cheers,
> Oliver
>
> Am 30.05.2018 um 03:25 schrieb Yan, Zheng:
>> I could be http://tracker.ceph.com/issues/24172
>>
>>
>> On Wed, May 30, 2018 at 9:01 AM, Linh Vu  wrote:
>>> In my case, I have multiple active MDS (with directory pinning at the very
>>> top level), and there would be "Client xxx failing to respond to capability
>>> release" health warning every single time that happens.
>>>
>>> 
>>> From: ceph-users  on behalf of Yan, Zheng
>>> 
>>> Sent: Tuesday, 29 May 2018 9:53:43 PM
>>> To: Oliver Freyermuth
>>> Cc: Ceph Users; Peter Wienemann
>>> Subject: Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to
>>> authpin local pins"
>>>
>>> Single or multiple acitve mds? Were there "Client xxx failing to
>>> respond to capability release" health warning?
>>>
>>> On Mon, May 28, 2018 at 10:38 PM, Oliver Freyermuth
>>>  wrote:
>>>> Dear Cephalopodians,
>>>>
>>>> we just had a "lockup" of many MDS requests, and also trimming fell
>>>> behind, for over 2 days.
>>>> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status
>>>> "currently failed to authpin local pins". Metadata pool usage did grow by 
>>>> 10
>>>> GB in those 2 days.
>>>>
>>>> Rebooting the node to force a client eviction solved the issue, and now
>>>> metadata usage is down again, and all stuck requests were processed 
>>>> quickly.
>>>>
>>>> Is there any idea on what could cause something like that? On the client,
>>>> der was no CPU load, but many processes waiting for cephfs to respond.
>>>> Syslog did yield anything. It only affected one user and his user
>>>> directory.
>>>>
>>>> If there are no ideas: How can I collect good debug information in case
>>>> this happens again?
>>>>
>>>> Cheers,
>>>> Oliver
>>>>
>>>>
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>>
>>>> https://protect-au.mimecast.com/s/Zl9aCXLKNwFxY9nNc6jQJC?domain=lists.ceph.com
>>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-30 Thread Oliver Freyermuth
Hi,

ij our case, there's only a single active MDS
(+1 standby-replay + 1 standby). 
We also get the health warning in case it happens. 

Cheers,
Oliver

Am 30.05.2018 um 03:25 schrieb Yan, Zheng:
> I could be http://tracker.ceph.com/issues/24172
> 
> 
> On Wed, May 30, 2018 at 9:01 AM, Linh Vu  wrote:
>> In my case, I have multiple active MDS (with directory pinning at the very
>> top level), and there would be "Client xxx failing to respond to capability
>> release" health warning every single time that happens.
>>
>> 
>> From: ceph-users  on behalf of Yan, Zheng
>> 
>> Sent: Tuesday, 29 May 2018 9:53:43 PM
>> To: Oliver Freyermuth
>> Cc: Ceph Users; Peter Wienemann
>> Subject: Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to
>> authpin local pins"
>>
>> Single or multiple acitve mds? Were there "Client xxx failing to
>> respond to capability release" health warning?
>>
>> On Mon, May 28, 2018 at 10:38 PM, Oliver Freyermuth
>>  wrote:
>>> Dear Cephalopodians,
>>>
>>> we just had a "lockup" of many MDS requests, and also trimming fell
>>> behind, for over 2 days.
>>> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status
>>> "currently failed to authpin local pins". Metadata pool usage did grow by 10
>>> GB in those 2 days.
>>>
>>> Rebooting the node to force a client eviction solved the issue, and now
>>> metadata usage is down again, and all stuck requests were processed quickly.
>>>
>>> Is there any idea on what could cause something like that? On the client,
>>> der was no CPU load, but many processes waiting for cephfs to respond.
>>> Syslog did yield anything. It only affected one user and his user
>>> directory.
>>>
>>> If there are no ideas: How can I collect good debug information in case
>>> this happens again?
>>>
>>> Cheers,
>>> Oliver
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>>
>>> https://protect-au.mimecast.com/s/Zl9aCXLKNwFxY9nNc6jQJC?domain=lists.ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-29 Thread Linh Vu
That could be it. Every time it happens for me, it is indeed from a non-auth 
MDS.


From: Yan, Zheng 
Sent: Wednesday, 30 May 2018 11:25:59 AM
To: Linh Vu
Cc: Oliver Freyermuth; Ceph Users; Peter Wienemann
Subject: Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to 
authpin local pins"

I could be http://tracker.ceph.com/issues/24172


On Wed, May 30, 2018 at 9:01 AM, Linh Vu  wrote:
> In my case, I have multiple active MDS (with directory pinning at the very
> top level), and there would be "Client xxx failing to respond to capability
> release" health warning every single time that happens.
>
> 
> From: ceph-users  on behalf of Yan, Zheng
> 
> Sent: Tuesday, 29 May 2018 9:53:43 PM
> To: Oliver Freyermuth
> Cc: Ceph Users; Peter Wienemann
> Subject: Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to
> authpin local pins"
>
> Single or multiple acitve mds? Were there "Client xxx failing to
> respond to capability release" health warning?
>
> On Mon, May 28, 2018 at 10:38 PM, Oliver Freyermuth
>  wrote:
>> Dear Cephalopodians,
>>
>> we just had a "lockup" of many MDS requests, and also trimming fell
>> behind, for over 2 days.
>> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status
>> "currently failed to authpin local pins". Metadata pool usage did grow by 10
>> GB in those 2 days.
>>
>> Rebooting the node to force a client eviction solved the issue, and now
>> metadata usage is down again, and all stuck requests were processed quickly.
>>
>> Is there any idea on what could cause something like that? On the client,
>> der was no CPU load, but many processes waiting for cephfs to respond.
>> Syslog did yield anything. It only affected one user and his user
>> directory.
>>
>> If there are no ideas: How can I collect good debug information in case
>> this happens again?
>>
>> Cheers,
>> Oliver
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-29 Thread Yan, Zheng
I could be http://tracker.ceph.com/issues/24172


On Wed, May 30, 2018 at 9:01 AM, Linh Vu  wrote:
> In my case, I have multiple active MDS (with directory pinning at the very
> top level), and there would be "Client xxx failing to respond to capability
> release" health warning every single time that happens.
>
> 
> From: ceph-users  on behalf of Yan, Zheng
> 
> Sent: Tuesday, 29 May 2018 9:53:43 PM
> To: Oliver Freyermuth
> Cc: Ceph Users; Peter Wienemann
> Subject: Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to
> authpin local pins"
>
> Single or multiple acitve mds? Were there "Client xxx failing to
> respond to capability release" health warning?
>
> On Mon, May 28, 2018 at 10:38 PM, Oliver Freyermuth
>  wrote:
>> Dear Cephalopodians,
>>
>> we just had a "lockup" of many MDS requests, and also trimming fell
>> behind, for over 2 days.
>> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status
>> "currently failed to authpin local pins". Metadata pool usage did grow by 10
>> GB in those 2 days.
>>
>> Rebooting the node to force a client eviction solved the issue, and now
>> metadata usage is down again, and all stuck requests were processed quickly.
>>
>> Is there any idea on what could cause something like that? On the client,
>> der was no CPU load, but many processes waiting for cephfs to respond.
>> Syslog did yield anything. It only affected one user and his user
>> directory.
>>
>> If there are no ideas: How can I collect good debug information in case
>> this happens again?
>>
>> Cheers,
>> Oliver
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>>
>> https://protect-au.mimecast.com/s/Zl9aCXLKNwFxY9nNc6jQJC?domain=lists.ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-29 Thread Linh Vu
In my case, I have multiple active MDS (with directory pinning at the very top 
level), and there would be "Client xxx failing to respond to capability 
release" health warning every single time that happens.


From: ceph-users  on behalf of Yan, Zheng 

Sent: Tuesday, 29 May 2018 9:53:43 PM
To: Oliver Freyermuth
Cc: Ceph Users; Peter Wienemann
Subject: Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to 
authpin local pins"

Single or multiple acitve mds? Were there "Client xxx failing to
respond to capability release" health warning?

On Mon, May 28, 2018 at 10:38 PM, Oliver Freyermuth
 wrote:
> Dear Cephalopodians,
>
> we just had a "lockup" of many MDS requests, and also trimming fell behind, 
> for over 2 days.
> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status 
> "currently failed to authpin local pins". Metadata pool usage did grow by 10 
> GB in those 2 days.
>
> Rebooting the node to force a client eviction solved the issue, and now 
> metadata usage is down again, and all stuck requests were processed quickly.
>
> Is there any idea on what could cause something like that? On the client, der 
> was no CPU load, but many processes waiting for cephfs to respond.
> Syslog did yield anything. It only affected one user and his user directory.
>
> If there are no ideas: How can I collect good debug information in case this 
> happens again?
>
> Cheers,
> Oliver
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-29 Thread Yan, Zheng
Single or multiple acitve mds? Were there "Client xxx failing to
respond to capability release" health warning?

On Mon, May 28, 2018 at 10:38 PM, Oliver Freyermuth
 wrote:
> Dear Cephalopodians,
>
> we just had a "lockup" of many MDS requests, and also trimming fell behind, 
> for over 2 days.
> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status 
> "currently failed to authpin local pins". Metadata pool usage did grow by 10 
> GB in those 2 days.
>
> Rebooting the node to force a client eviction solved the issue, and now 
> metadata usage is down again, and all stuck requests were processed quickly.
>
> Is there any idea on what could cause something like that? On the client, der 
> was no CPU load, but many processes waiting for cephfs to respond.
> Syslog did yield anything. It only affected one user and his user directory.
>
> If there are no ideas: How can I collect good debug information in case this 
> happens again?
>
> Cheers,
> Oliver
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-29 Thread Oliver Freyermuth
I get the feeling this is not dependent on the exact Ceph version... 

In our case, I know what the user has done (and he'll not do it again). He 
misunderstood how our cluster works and started 1100 cluster jobs,
all entering the very same directory on CephFS (mounted via ceph-fuse on 38 
machines), all running "make clean; make -j10 install". 
So 1100 processes from 38 clients have been trying to lock / delete / write the 
very same files. 

In parallel, an IDE (eclipse) and an indexing service (zeitgeist...) may have 
accessed the very same directory via nfs-ganesha since the user mounted the 
NFS-exported directory via sshfs into his desktop home directory... 

So I can't really blame CephFS for becoming as unhappy as I would become 
myself. 
However, I would have hoped it would not enter a "stuck" state in which only 
client eviction will help... 

Cheers,
Oliver


Am 29.05.2018 um 03:26 schrieb Linh Vu:
> I get the exact opposite to the same error message "currently failed to 
> authpin local pins". Had a few clients on ceph-fuse 12.2.2 and they ran into 
> those issues a lot (evicting works). Upgrading to ceph-fuse 12.2.5 fixed it. 
> The main cluster is on 12.2.4.
> 
> 
> The cause is user's HPC jobs or even just their login on multiple nodes 
> accessing the same files, in a particular way. Doesn't happen to other users. 
> Haven't quite dug into it deep enough as upgrading to 12.2.5 fixed our 
> problem. 
> 
> --
> *From:* ceph-users  on behalf of Oliver 
> Freyermuth 
> *Sent:* Tuesday, 29 May 2018 7:29:06 AM
> *To:* Paul Emmerich
> *Cc:* Ceph Users; Peter Wienemann
> *Subject:* Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to 
> authpin local pins"
>  
> Dear Paul,
> 
> Am 28.05.2018 um 20:16 schrieb Paul Emmerich:
>> I encountered the exact same issue earlier today immediately after upgrading 
>> a customer's cluster from 12.2.2 to 12.2.5.
>> I've evicted the session and restarted the ganesha client to fix it, as I 
>> also couldn't find any obvious problem.
> 
> interesting! In our case, the client with the problem (it happened again a 
> few hours later...) always was a ceph-fuse client. Evicting / rebooting the 
> client node helped.
> However, it may well be that the original issue way caused by a Ganesha 
> client, which we also use (and the user in question who complained was 
> accessing files in parallel via NFS and ceph-fuse),
> but I don't have a clear indication of that.
> 
> Cheers,
>     Oliver
> 
>> 
>> Paul
>> 
>> 2018-05-28 16:38 GMT+02:00 Oliver Freyermuth > <mailto:freyerm...@physik.uni-bonn.de>>:
>> 
>> Dear Cephalopodians,
>> 
>> we just had a "lockup" of many MDS requests, and also trimming fell 
>>behind, for over 2 days.
>> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status 
>>"currently failed to authpin local pins". Metadata pool usage did grow by 10 
>>GB in those 2 days.
>> 
>> Rebooting the node to force a client eviction solved the issue, and now 
>>metadata usage is down again, and all stuck requests were processed quickly.
>> 
>> Is there any idea on what could cause something like that? On the 
>>client, der was no CPU load, but many processes waiting for cephfs to respond.
>> Syslog did yield anything. It only affected one user and his user 
>>directory.
>> 
>> If there are no ideas: How can I collect good debug information in case 
>>this happens again?
>> 
>> Cheers,
>>         Oliver
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-

Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-28 Thread Linh Vu
I get the exact opposite to the same error message "currently failed to authpin 
local pins". Had a few clients on ceph-fuse 12.2.2 and they ran into those 
issues a lot (evicting works). Upgrading to ceph-fuse 12.2.5 fixed it. The main 
cluster is on 12.2.4.


The cause is user's HPC jobs or even just their login on multiple nodes 
accessing the same files, in a particular way. Doesn't happen to other users. 
Haven't quite dug into it deep enough as upgrading to 12.2.5 fixed our problem.


From: ceph-users  on behalf of Oliver 
Freyermuth 
Sent: Tuesday, 29 May 2018 7:29:06 AM
To: Paul Emmerich
Cc: Ceph Users; Peter Wienemann
Subject: Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to 
authpin local pins"

Dear Paul,

Am 28.05.2018 um 20:16 schrieb Paul Emmerich:
> I encountered the exact same issue earlier today immediately after upgrading 
> a customer's cluster from 12.2.2 to 12.2.5.
> I've evicted the session and restarted the ganesha client to fix it, as I 
> also couldn't find any obvious problem.

interesting! In our case, the client with the problem (it happened again a few 
hours later...) always was a ceph-fuse client. Evicting / rebooting the client 
node helped.
However, it may well be that the original issue way caused by a Ganesha client, 
which we also use (and the user in question who complained was accessing files 
in parallel via NFS and ceph-fuse),
but I don't have a clear indication of that.

Cheers,
Oliver

>
> Paul
>
> 2018-05-28 16:38 GMT+02:00 Oliver Freyermuth  <mailto:freyerm...@physik.uni-bonn.de>>:
>
> Dear Cephalopodians,
>
> we just had a "lockup" of many MDS requests, and also trimming fell 
> behind, for over 2 days.
> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status 
> "currently failed to authpin local pins". Metadata pool usage did grow by 10 
> GB in those 2 days.
>
> Rebooting the node to force a client eviction solved the issue, and now 
> metadata usage is down again, and all stuck requests were processed quickly.
>
> Is there any idea on what could cause something like that? On the client, 
> der was no CPU load, but many processes waiting for cephfs to respond.
> Syslog did yield anything. It only affected one user and his user 
> directory.
>
> If there are no ideas: How can I collect good debug information in case 
> this happens again?
>
> Cheers,
> Oliver
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>
>
>
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io<http://www.croit.io> <http://www.croit.io>
> Tel: +49 89 1896585 90

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-28 Thread Oliver Freyermuth
Dear Paul,

Am 28.05.2018 um 20:16 schrieb Paul Emmerich:
> I encountered the exact same issue earlier today immediately after upgrading 
> a customer's cluster from 12.2.2 to 12.2.5.
> I've evicted the session and restarted the ganesha client to fix it, as I 
> also couldn't find any obvious problem.

interesting! In our case, the client with the problem (it happened again a few 
hours later...) always was a ceph-fuse client. Evicting / rebooting the client 
node helped. 
However, it may well be that the original issue way caused by a Ganesha client, 
which we also use (and the user in question who complained was accessing files 
in parallel via NFS and ceph-fuse),
but I don't have a clear indication of that. 

Cheers,
Oliver

> 
> Paul
> 
> 2018-05-28 16:38 GMT+02:00 Oliver Freyermuth  >:
> 
> Dear Cephalopodians,
> 
> we just had a "lockup" of many MDS requests, and also trimming fell 
> behind, for over 2 days.
> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status 
> "currently failed to authpin local pins". Metadata pool usage did grow by 10 
> GB in those 2 days.
> 
> Rebooting the node to force a client eviction solved the issue, and now 
> metadata usage is down again, and all stuck requests were processed quickly.
> 
> Is there any idea on what could cause something like that? On the client, 
> der was no CPU load, but many processes waiting for cephfs to respond.
> Syslog did yield anything. It only affected one user and his user 
> directory.
> 
> If there are no ideas: How can I collect good debug information in case 
> this happens again?
> 
> Cheers,
>         Oliver
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> 
> 
> 
> -- 
> Paul Emmerich
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io
> 
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io 
> Tel: +49 89 1896585 90



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-fuse getting stuck with "currently failed to authpin local pins"

2018-05-28 Thread Paul Emmerich
I encountered the exact same issue earlier today immediately after
upgrading a customer's cluster from 12.2.2 to 12.2.5.
I've evicted the session and restarted the ganesha client to fix it, as I
also couldn't find any obvious problem.

Paul

2018-05-28 16:38 GMT+02:00 Oliver Freyermuth 
:

> Dear Cephalopodians,
>
> we just had a "lockup" of many MDS requests, and also trimming fell
> behind, for over 2 days.
> One of the clients (all ceph-fuse 12.2.5 on CentOS 7.5) was in status
> "currently failed to authpin local pins". Metadata pool usage did grow by
> 10 GB in those 2 days.
>
> Rebooting the node to force a client eviction solved the issue, and now
> metadata usage is down again, and all stuck requests were processed
> quickly.
>
> Is there any idea on what could cause something like that? On the client,
> der was no CPU load, but many processes waiting for cephfs to respond.
> Syslog did yield anything. It only affected one user and his user
> directory.
>
> If there are no ideas: How can I collect good debug information in case
> this happens again?
>
> Cheers,
> Oliver
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com