Re: [ceph-users] After Luminous upgrade: ceph-fuse clients failing to respond to cache pressure

2018-01-22 Thread Andras Pataki
Just to close this thread up - it looks like all the problems were 
related to setting the "mds cache size" option in Luminous instead of 
using "mds cache memory limit".  The "mds cache size" option 
documentation says that "it is recommended to use mds_cache_memory_limit 
...", but it looks more like "mds cache size" simply does not work in 
Luminous like it used to in Jewel (or does not work period).  As a 
result the MDS was trying to aggressively reduce caches in our setup.  
Since we switched all MDS's over to 'mds cache memory limit' of 16GB and 
bounced them, we have had no performance or cache pressure issues, and 
as expected they hover around 22-23GB of RSS.


Thanks everyone for the help,

Andras


On 01/18/2018 12:34 PM, Patrick Donnelly wrote:

Hi Andras,

On Thu, Jan 18, 2018 at 3:38 AM, Andras Pataki
 wrote:

Hi John,

Some other symptoms of the problem:  when the MDS has been running for a few
days, it starts looking really busy.  At this time, listing directories
becomes really slow.  An "ls -l" on a directory with about 250 entries takes
about 2.5 seconds.  All the metadata is on OSDs with NVMe backing stores.
Interestingly enough the memory usage seems pretty low (compared to the
allowed cache limit).


 PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
COMMAND
1604408 ceph  20   0 3710304 2.387g  18360 S 100.0  0.9 757:06.92
/usr/bin/ceph-mds -f --cluster ceph --id cephmon00 --setuser ceph --setgroup
ceph

Once I bounce it (fail it over), the CPU usage goes down to the 10-25%
range.  The same ls -l after the bounce takes about 0.5 seconds.  I
remounted the filesystem before each test to ensure there isn't anything
cached.

 PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
COMMAND
   00 ceph  20   0 6537052 5.864g  18500 S  17.6  2.3   9:23.55
/usr/bin/ceph-mds -f --cluster ceph --id cephmon02 --setuser ceph --setgroup
ceph

Also, I have a crawler that crawls the file system periodically.  Normally
the full crawl runs for about 24 hours, but with the slowing down MDS, now
it has been running for more than 2 days and isn't close to finishing.

The MDS related settings we are running with are:

mds_cache_memory_limit = 17179869184
mds_cache_reservation = 0.10

Debug logs from the MDS at that time would be helpful with `debug mds
= 20` and `debug ms = 1`. Feel free to create a tracker ticket and use
ceph-post-file [1] to share logs.

[1] http://docs.ceph.com/docs/hammer/man/8/ceph-post-file/



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] After Luminous upgrade: ceph-fuse clients failing to respond to cache pressure

2018-01-18 Thread Patrick Donnelly
Hi Andras,

On Thu, Jan 18, 2018 at 3:38 AM, Andras Pataki
 wrote:
> Hi John,
>
> Some other symptoms of the problem:  when the MDS has been running for a few
> days, it starts looking really busy.  At this time, listing directories
> becomes really slow.  An "ls -l" on a directory with about 250 entries takes
> about 2.5 seconds.  All the metadata is on OSDs with NVMe backing stores.
> Interestingly enough the memory usage seems pretty low (compared to the
> allowed cache limit).
>
>
> PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
> COMMAND
> 1604408 ceph  20   0 3710304 2.387g  18360 S 100.0  0.9 757:06.92
> /usr/bin/ceph-mds -f --cluster ceph --id cephmon00 --setuser ceph --setgroup
> ceph
>
> Once I bounce it (fail it over), the CPU usage goes down to the 10-25%
> range.  The same ls -l after the bounce takes about 0.5 seconds.  I
> remounted the filesystem before each test to ensure there isn't anything
> cached.
>
> PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
> COMMAND
>   00 ceph  20   0 6537052 5.864g  18500 S  17.6  2.3   9:23.55
> /usr/bin/ceph-mds -f --cluster ceph --id cephmon02 --setuser ceph --setgroup
> ceph
>
> Also, I have a crawler that crawls the file system periodically.  Normally
> the full crawl runs for about 24 hours, but with the slowing down MDS, now
> it has been running for more than 2 days and isn't close to finishing.
>
> The MDS related settings we are running with are:
>
> mds_cache_memory_limit = 17179869184
> mds_cache_reservation = 0.10

Debug logs from the MDS at that time would be helpful with `debug mds
= 20` and `debug ms = 1`. Feel free to create a tracker ticket and use
ceph-post-file [1] to share logs.

[1] http://docs.ceph.com/docs/hammer/man/8/ceph-post-file/

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] After Luminous upgrade: ceph-fuse clients failing to respond to cache pressure

2018-01-18 Thread Andras Pataki

Hi John,

Some other symptoms of the problem:  when the MDS has been running for a 
few days, it starts looking really busy.  At this time, listing 
directories becomes really slow.  An "ls -l" on a directory with about 
250 entries takes about 2.5 seconds.  All the metadata is on OSDs with 
NVMe backing stores.  Interestingly enough the memory usage seems pretty 
low (compared to the allowed cache limit).



    PID USER  PR  NI    VIRT    RES    SHR S  %CPU %MEM TIME+ COMMAND
1604408 ceph  20   0 3710304 2.387g  18360 S 100.0  0.9 757:06.92 
/usr/bin/ceph-mds -f --cluster ceph --id cephmon00 --setuser ceph 
--setgroup ceph


Once I bounce it (fail it over), the CPU usage goes down to the 10-25% 
range.  The same ls -l after the bounce takes about 0.5 seconds.  I 
remounted the filesystem before each test to ensure there isn't anything 
cached.


    PID USER  PR  NI    VIRT    RES    SHR S  %CPU %MEM TIME+ COMMAND
00 ceph  20   0 6537052 5.864g  18500 S  17.6 2.3   9:23.55 
/usr/bin/ceph-mds -f --cluster ceph --id cephmon02 --setuser ceph 
--setgroup ceph


Also, I have a crawler that crawls the file system periodically. 
Normally the full crawl runs for about 24 hours, but with the slowing 
down MDS, now it has been running for more than 2 days and isn't close 
to finishing.


The MDS related settings we are running with are:

   mds_cache_memory_limit = 17179869184
   mds_cache_reservation = 0.10


Andras


On 01/17/2018 01:11 PM, John Spray wrote:

On Wed, Jan 17, 2018 at 3:36 PM, Andras Pataki
 wrote:

Hi John,

All our hosts are CentOS 7 hosts, the majority are 7.4 with kernel
3.10.0-693.5.2.el7.x86_64, with fuse 2.9.2-8.el7.  We have some hosts that
have slight variations in kernel versions, the oldest one are a handful of
CentOS 7.3 hosts with kernel 3.10.0-514.21.1.el7.x86_64 and fuse
2.9.2-7.el7.  I know Redhat has been backporting lots of stuff so perhaps
these kernels fall into the category you are describing?

Quite possibly -- this issue was originally noticed on RHEL, so maybe
the relevant bits made it back to CentOS recently.

However, it looks like the fixes for that issue[1,2] are already in
12.2.2, so maybe this is something completely unrelated :-/

The ceph-fuse executable does create an admin command socket in
/var/run/ceph (named something ceph-client...) that you can drive with
"ceph daemon  dump_cache", but the output is extremely verbose
and low level and may not be informative.

John

1. http://tracker.ceph.com/issues/21423
2. http://tracker.ceph.com/issues/22269


When the cache pressure problem happens, is there a way to know exactly
which hosts are involved, and what items are in their caches easily?

Andras



On 01/17/2018 06:09 AM, John Spray wrote:

On Tue, Jan 16, 2018 at 8:50 PM, Andras Pataki
 wrote:

Dear Cephers,

We've upgraded the back end of our cluster from Jewel (10.2.10) to
Luminous
(12.2.2).  The upgrade went smoothly for the most part, except we seem to
be
hitting an issue with cephfs.  After about a day or two of use, the MDS
start complaining about clients failing to respond to cache pressure:

What's the OS, kernel version and fuse version on the hosts where the
clients are running?

There have been some issues with ceph-fuse losing the ability to
properly invalidate cached items when certain updated OS packages were
installed.

Specifically, ceph-fuse checks the kernel version against 3.18.0 to
decide which invalidation method to use, and if your OS has backported
new behaviour to a low-version-numbered kernel, that can confuse it.

John


[root@cephmon00 ~]# ceph -s
cluster:
  id: d7b33135-0940-4e48-8aa6-1d2026597c2f
  health: HEALTH_WARN
  1 MDSs have many clients failing to respond to cache
pressure
  noout flag(s) set
  1 osds down

services:
  mon: 3 daemons, quorum cephmon00,cephmon01,cephmon02
  mgr: cephmon00(active), standbys: cephmon01, cephmon02
  mds: cephfs-1/1/1 up  {0=cephmon00=up:active}, 2 up:standby
  osd: 2208 osds: 2207 up, 2208 in
   flags noout

data:
  pools:   6 pools, 42496 pgs
  objects: 919M objects, 3062 TB
  usage:   9203 TB used, 4618 TB / 13822 TB avail
  pgs: 42470 active+clean
   22active+clean+scrubbing+deep
   4 active+clean+scrubbing

io:
  client:   56122 kB/s rd, 18397 kB/s wr, 84 op/s rd, 101 op/s wr

[root@cephmon00 ~]# ceph health detail
HEALTH_WARN 1 MDSs have many clients failing to respond to cache
pressure;
noout flag(s) set; 1 osds down
MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to
cache
pressure
  mdscephmon00(mds.0): Many clients (103) failing to respond to cache
pressureclient_count: 103
OSDMAP_FLAGS noout flag(s) set
OSD_DOWN 1 osds down
  osd.1296 (root=root-disk,pod=pod0-disk,host=cephosd008-disk) is down


We are using exclusively the 12.2.2 fuse 

Re: [ceph-users] After Luminous upgrade: ceph-fuse clients failing to respond to cache pressure

2018-01-17 Thread John Spray
On Wed, Jan 17, 2018 at 3:36 PM, Andras Pataki
 wrote:
> Hi John,
>
> All our hosts are CentOS 7 hosts, the majority are 7.4 with kernel
> 3.10.0-693.5.2.el7.x86_64, with fuse 2.9.2-8.el7.  We have some hosts that
> have slight variations in kernel versions, the oldest one are a handful of
> CentOS 7.3 hosts with kernel 3.10.0-514.21.1.el7.x86_64 and fuse
> 2.9.2-7.el7.  I know Redhat has been backporting lots of stuff so perhaps
> these kernels fall into the category you are describing?

Quite possibly -- this issue was originally noticed on RHEL, so maybe
the relevant bits made it back to CentOS recently.

However, it looks like the fixes for that issue[1,2] are already in
12.2.2, so maybe this is something completely unrelated :-/

The ceph-fuse executable does create an admin command socket in
/var/run/ceph (named something ceph-client...) that you can drive with
"ceph daemon  dump_cache", but the output is extremely verbose
and low level and may not be informative.

John

1. http://tracker.ceph.com/issues/21423
2. http://tracker.ceph.com/issues/22269

>
> When the cache pressure problem happens, is there a way to know exactly
> which hosts are involved, and what items are in their caches easily?
>
> Andras
>
>
>
> On 01/17/2018 06:09 AM, John Spray wrote:
>>
>> On Tue, Jan 16, 2018 at 8:50 PM, Andras Pataki
>>  wrote:
>>>
>>> Dear Cephers,
>>>
>>> We've upgraded the back end of our cluster from Jewel (10.2.10) to
>>> Luminous
>>> (12.2.2).  The upgrade went smoothly for the most part, except we seem to
>>> be
>>> hitting an issue with cephfs.  After about a day or two of use, the MDS
>>> start complaining about clients failing to respond to cache pressure:
>>
>> What's the OS, kernel version and fuse version on the hosts where the
>> clients are running?
>>
>> There have been some issues with ceph-fuse losing the ability to
>> properly invalidate cached items when certain updated OS packages were
>> installed.
>>
>> Specifically, ceph-fuse checks the kernel version against 3.18.0 to
>> decide which invalidation method to use, and if your OS has backported
>> new behaviour to a low-version-numbered kernel, that can confuse it.
>>
>> John
>>
>>> [root@cephmon00 ~]# ceph -s
>>>cluster:
>>>  id: d7b33135-0940-4e48-8aa6-1d2026597c2f
>>>  health: HEALTH_WARN
>>>  1 MDSs have many clients failing to respond to cache
>>> pressure
>>>  noout flag(s) set
>>>  1 osds down
>>>
>>>services:
>>>  mon: 3 daemons, quorum cephmon00,cephmon01,cephmon02
>>>  mgr: cephmon00(active), standbys: cephmon01, cephmon02
>>>  mds: cephfs-1/1/1 up  {0=cephmon00=up:active}, 2 up:standby
>>>  osd: 2208 osds: 2207 up, 2208 in
>>>   flags noout
>>>
>>>data:
>>>  pools:   6 pools, 42496 pgs
>>>  objects: 919M objects, 3062 TB
>>>  usage:   9203 TB used, 4618 TB / 13822 TB avail
>>>  pgs: 42470 active+clean
>>>   22active+clean+scrubbing+deep
>>>   4 active+clean+scrubbing
>>>
>>>io:
>>>  client:   56122 kB/s rd, 18397 kB/s wr, 84 op/s rd, 101 op/s wr
>>>
>>> [root@cephmon00 ~]# ceph health detail
>>> HEALTH_WARN 1 MDSs have many clients failing to respond to cache
>>> pressure;
>>> noout flag(s) set; 1 osds down
>>> MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to
>>> cache
>>> pressure
>>>  mdscephmon00(mds.0): Many clients (103) failing to respond to cache
>>> pressureclient_count: 103
>>> OSDMAP_FLAGS noout flag(s) set
>>> OSD_DOWN 1 osds down
>>>  osd.1296 (root=root-disk,pod=pod0-disk,host=cephosd008-disk) is down
>>>
>>>
>>> We are using exclusively the 12.2.2 fuse client on about 350 nodes or so
>>> (out of which it seems 100 are not responding to cache pressure in this
>>> log).  When this happens, clients appear pretty sluggish also (listing
>>> directories, etc.).  After bouncing the MDS, everything returns on normal
>>> after the failover for a while.  Ignore the message about 1 OSD down,
>>> that
>>> corresponds to a failed drive and all data has been re-replicated since.
>>>
>>> We were also using the 12.2.2 fuse client with the Jewel back end before
>>> the
>>> upgrade, and have not seen this issue.
>>>
>>> We are running with a larger MDS cache than usual, we have mds_cache_size
>>> set to 4 million.  All other MDS configs are the defaults.
>>>
>>> Is this a known issue?  If not, any hints on how to further diagnose the
>>> problem?
>>>
>>> Andras
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] After Luminous upgrade: ceph-fuse clients failing to respond to cache pressure

2018-01-17 Thread Andras Pataki

Hi John,

All our hosts are CentOS 7 hosts, the majority are 7.4 with kernel 
3.10.0-693.5.2.el7.x86_64, with fuse 2.9.2-8.el7.  We have some hosts 
that have slight variations in kernel versions, the oldest one are a 
handful of CentOS 7.3 hosts with kernel 3.10.0-514.21.1.el7.x86_64 and 
fuse 2.9.2-7.el7.  I know Redhat has been backporting lots of stuff so 
perhaps these kernels fall into the category you are describing?


When the cache pressure problem happens, is there a way to know exactly 
which hosts are involved, and what items are in their caches easily?


Andras


On 01/17/2018 06:09 AM, John Spray wrote:

On Tue, Jan 16, 2018 at 8:50 PM, Andras Pataki
 wrote:

Dear Cephers,

We've upgraded the back end of our cluster from Jewel (10.2.10) to Luminous
(12.2.2).  The upgrade went smoothly for the most part, except we seem to be
hitting an issue with cephfs.  After about a day or two of use, the MDS
start complaining about clients failing to respond to cache pressure:

What's the OS, kernel version and fuse version on the hosts where the
clients are running?

There have been some issues with ceph-fuse losing the ability to
properly invalidate cached items when certain updated OS packages were
installed.

Specifically, ceph-fuse checks the kernel version against 3.18.0 to
decide which invalidation method to use, and if your OS has backported
new behaviour to a low-version-numbered kernel, that can confuse it.

John


[root@cephmon00 ~]# ceph -s
   cluster:
 id: d7b33135-0940-4e48-8aa6-1d2026597c2f
 health: HEALTH_WARN
 1 MDSs have many clients failing to respond to cache pressure
 noout flag(s) set
 1 osds down

   services:
 mon: 3 daemons, quorum cephmon00,cephmon01,cephmon02
 mgr: cephmon00(active), standbys: cephmon01, cephmon02
 mds: cephfs-1/1/1 up  {0=cephmon00=up:active}, 2 up:standby
 osd: 2208 osds: 2207 up, 2208 in
  flags noout

   data:
 pools:   6 pools, 42496 pgs
 objects: 919M objects, 3062 TB
 usage:   9203 TB used, 4618 TB / 13822 TB avail
 pgs: 42470 active+clean
  22active+clean+scrubbing+deep
  4 active+clean+scrubbing

   io:
 client:   56122 kB/s rd, 18397 kB/s wr, 84 op/s rd, 101 op/s wr

[root@cephmon00 ~]# ceph health detail
HEALTH_WARN 1 MDSs have many clients failing to respond to cache pressure;
noout flag(s) set; 1 osds down
MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to cache
pressure
 mdscephmon00(mds.0): Many clients (103) failing to respond to cache
pressureclient_count: 103
OSDMAP_FLAGS noout flag(s) set
OSD_DOWN 1 osds down
 osd.1296 (root=root-disk,pod=pod0-disk,host=cephosd008-disk) is down


We are using exclusively the 12.2.2 fuse client on about 350 nodes or so
(out of which it seems 100 are not responding to cache pressure in this
log).  When this happens, clients appear pretty sluggish also (listing
directories, etc.).  After bouncing the MDS, everything returns on normal
after the failover for a while.  Ignore the message about 1 OSD down, that
corresponds to a failed drive and all data has been re-replicated since.

We were also using the 12.2.2 fuse client with the Jewel back end before the
upgrade, and have not seen this issue.

We are running with a larger MDS cache than usual, we have mds_cache_size
set to 4 million.  All other MDS configs are the defaults.

Is this a known issue?  If not, any hints on how to further diagnose the
problem?

Andras


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] After Luminous upgrade: ceph-fuse clients failing to respond to cache pressure

2018-01-17 Thread John Spray
On Tue, Jan 16, 2018 at 8:50 PM, Andras Pataki
 wrote:
> Dear Cephers,
>
> We've upgraded the back end of our cluster from Jewel (10.2.10) to Luminous
> (12.2.2).  The upgrade went smoothly for the most part, except we seem to be
> hitting an issue with cephfs.  After about a day or two of use, the MDS
> start complaining about clients failing to respond to cache pressure:

What's the OS, kernel version and fuse version on the hosts where the
clients are running?

There have been some issues with ceph-fuse losing the ability to
properly invalidate cached items when certain updated OS packages were
installed.

Specifically, ceph-fuse checks the kernel version against 3.18.0 to
decide which invalidation method to use, and if your OS has backported
new behaviour to a low-version-numbered kernel, that can confuse it.

John

>
> [root@cephmon00 ~]# ceph -s
>   cluster:
> id: d7b33135-0940-4e48-8aa6-1d2026597c2f
> health: HEALTH_WARN
> 1 MDSs have many clients failing to respond to cache pressure
> noout flag(s) set
> 1 osds down
>
>   services:
> mon: 3 daemons, quorum cephmon00,cephmon01,cephmon02
> mgr: cephmon00(active), standbys: cephmon01, cephmon02
> mds: cephfs-1/1/1 up  {0=cephmon00=up:active}, 2 up:standby
> osd: 2208 osds: 2207 up, 2208 in
>  flags noout
>
>   data:
> pools:   6 pools, 42496 pgs
> objects: 919M objects, 3062 TB
> usage:   9203 TB used, 4618 TB / 13822 TB avail
> pgs: 42470 active+clean
>  22active+clean+scrubbing+deep
>  4 active+clean+scrubbing
>
>   io:
> client:   56122 kB/s rd, 18397 kB/s wr, 84 op/s rd, 101 op/s wr
>
> [root@cephmon00 ~]# ceph health detail
> HEALTH_WARN 1 MDSs have many clients failing to respond to cache pressure;
> noout flag(s) set; 1 osds down
> MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to cache
> pressure
> mdscephmon00(mds.0): Many clients (103) failing to respond to cache
> pressureclient_count: 103
> OSDMAP_FLAGS noout flag(s) set
> OSD_DOWN 1 osds down
> osd.1296 (root=root-disk,pod=pod0-disk,host=cephosd008-disk) is down
>
>
> We are using exclusively the 12.2.2 fuse client on about 350 nodes or so
> (out of which it seems 100 are not responding to cache pressure in this
> log).  When this happens, clients appear pretty sluggish also (listing
> directories, etc.).  After bouncing the MDS, everything returns on normal
> after the failover for a while.  Ignore the message about 1 OSD down, that
> corresponds to a failed drive and all data has been re-replicated since.
>
> We were also using the 12.2.2 fuse client with the Jewel back end before the
> upgrade, and have not seen this issue.
>
> We are running with a larger MDS cache than usual, we have mds_cache_size
> set to 4 million.  All other MDS configs are the defaults.
>
> Is this a known issue?  If not, any hints on how to further diagnose the
> problem?
>
> Andras
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] After Luminous upgrade: ceph-fuse clients failing to respond to cache pressure

2018-01-16 Thread Andras Pataki

Dear Cephers,

We've upgraded the back end of our cluster from Jewel (10.2.10) to 
Luminous (12.2.2).  The upgrade went smoothly for the most part, except 
we seem to be hitting an issue with cephfs.  After about a day or two of 
use, the MDS start complaining about clients failing to respond to cache 
pressure:


   [root@cephmon00 ~]# *ceph -s*
  cluster:
    id: d7b33135-0940-4e48-8aa6-1d2026597c2f
    health: HEALTH_WARN
   *    1 MDSs have many clients failing to respond to cache
   pressure*
    noout flag(s) set
    1 osds down

  services:
    mon: 3 daemons, quorum cephmon00,cephmon01,cephmon02
    mgr: cephmon00(active), standbys: cephmon01, cephmon02
    mds: cephfs-1/1/1 up  {0=cephmon00=up:active}, 2 up:standby
    osd: 2208 osds: 2207 up, 2208 in
 flags noout

  data:
    pools:   6 pools, 42496 pgs
    objects: 919M objects, 3062 TB
    usage:   9203 TB used, 4618 TB / 13822 TB avail
    pgs: 42470 active+clean
 22    active+clean+scrubbing+deep
 4 active+clean+scrubbing

  io:
    client:   56122 kB/s rd, 18397 kB/s wr, 84 op/s rd, 101 op/s wr

   [root@cephmon00 ~]# *ceph health detail*
   HEALTH_WARN 1 MDSs have many clients failing to respond to cache
   pressure; noout flag(s) set; 1 osds down
   *MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond
   to cache pressure**
   **    mdscephmon00(mds.0): Many clients (103) failing to respond to
   cache pressureclient_count: 103*
   OSDMAP_FLAGS noout flag(s) set
   OSD_DOWN 1 osds down
    osd.1296 (root=root-disk,pod=pod0-disk,host=cephosd008-disk) is
   down


We are using exclusively the 12.2.2 fuse client on about 350 nodes or so 
(out of which it seems 100 are not responding to cache pressure in this 
log).  When this happens, clients appear pretty sluggish also (listing 
directories, etc.).  After bouncing the MDS, everything returns on 
normal after the failover for a while.  Ignore the message about 1 OSD 
down, that corresponds to a failed drive and all data has been 
re-replicated since.


We were also using the 12.2.2 fuse client with the Jewel back end before 
the upgrade, and have not seen this issue.


We are running with a larger MDS cache than usual, we have 
mds_cache_size set to 4 million.  All other MDS configs are the defaults.


Is this a known issue?  If not, any hints on how to further diagnose the 
problem?


Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com