Re: xfs_reclaim_inodes_ag taking several seconds

2020-11-18 Thread Donald Buczek

On 8/12/20 8:49 AM, Donald Buczek wrote:

On 8/4/20 12:11 AM, Dave Chinner wrote:

On Sat, Aug 01, 2020 at 12:25:40PM +0200, Donald Buczek wrote:

On 01.08.20 00:32, Dave Chinner wrote:

On Fri, Jul 31, 2020 at 01:27:31PM +0200, Donald Buczek wrote:

Dear Linux people,

we have a backup server with two xfs filesystems on 101.9TB md-raid6 devices 
(16 * 7.3 T disks) each, Current Linux version is 5.4.54.

.

root:done:/home/buczek/linux_problems/shrinker_semaphore/# cat /proc/meminfo
MemTotal:   263572332 kB


256GB of RAM.


MemFree: 2872368 kB
MemAvailable:   204193824 kB


200GB "available"


Buffers:    2568 kB
Cached: 164931356 kB


160GB in page cache


KReclaimable:   40079660 kB
Slab:   49988268 kB
SReclaimable:   40079660 kB


40GB in reclaimable slab objects.

IOWs, you have no free memory in the machine and so allocation
will frequently be dipping into memory reclaim to free up page cache
and slab caches to make memory available.


xfs_inode 30978282 31196832    960    4    1 : tunables   54   27    8 
: slabdata 7799208 7799208    434


Yes, 30 million cached inodes.


bio_integrity_payload 29644966 30203481    192   21    1 : tunables  120   60   
 8 : slabdata 1438261 1438261    480


Either there is a memory leak in this slab, or it is shared with
something like the xfs_ili slab, which would indicate that most
of the cached inodes have been dirtied in memory at some point in
time.


I think you are right here:

 crash> p $s->name
 $84 = 0x82259401 "bio_integrity_payload"
 crash> p $s->refcount
 $88 = 8
 crash> p $s
 $92 = (struct kmem_cache *) 0x88bff92d2bc0
 crash> p sizeof(xfs_inode_log_item_t)
 $93 = 192
 crash> p $s->object_size
 $94 = 192

So if I understand you correctly, this is expected behavior with
this kind of load and conceptual changes are already scheduled for
kernel 5.9. I don't understand most of it, but isn't it true that
with that planned changes the impact might be better limited to
the filesystem, so that the performance of other areas of the
system might improve?


What the changes in 5.9 will do is remove the direct memory reclaim
latency that comes from waiting on IO in the shrinker. Hence you
will no longer see this problem from applications doing memory
allocation. i.e. they'll get some other memory reclaimed without
blocking (e.g. page cache or clean inodes) and so the specific
symptom of having large numbers of dirty inodes in memory that you
are seeing will go away.

Which means that dirty inodes in memory will continue to build up
until the next constraint is hit, and then it will go back to having
unpredictable large latencies while waiting for inodes to be written
back to free up whatever resource the filesystem has run out of.

That resource will, most likely, be filesystem journal space. Every
fs modification needs to reserve sufficient journal to complete
before the modification starts. Hence if the journal fills, any
modification to the fs will block waiting on dirty inode writeback
to release space in the journal

You might be lucky and the backup process is slow enough that the
disk subsystem can keep up with the rate of ingest of new data and
so you never hit this limitation. However, the reported state of the
machine and the amount of RAM it has for caching says to me that the
underlying problem is that ingest is far faster than the filesystem
and disk subsystem can sink...

A solution to this problem might be to spread out the backups being
done over a wider timeframe, such that there isn't a sustained heavy
load at 3am in the morning when every machine is scheduled to be
backed up at the same time...


It is already running round the clock. We have two of these servers, doing 
daily backups for 1374 file spaces (= directory trees below 1 TB) on 392 
clients. The servers are doing daily mirrors of these file spaces, creating 
hard links for existing files and keeping these daily trees for 4 month. The 
schedule is free from fixed wall clock times. A backup is due when the last 
iteration is older than 24 hours, and will be done as time, threads and some 
locking constrains allow. Under normal circumstances the servers keep up, but 
are nearly continuously busy. The excess capacity is spread all over the day 
(sometimes there is no work to do for some minutes).

I monitor, how long a local mount takes (usually 0.03-0.04 seconds) and the 
worst time seen so far is over 16 minutes! Because we rely on autofs mounts for 
logins and some other things, the system kind of appears to be dead during that 
time.

I've limited the inode caches by running the backup jobs in memory control 
groups, but, as expected, this didn't really bring down the delays. I've also 
added a cond_resched() to mm/vmscan.c:shrink_slab_memcg(), which seems to be 
missing after the up_read(), but this didn't help either.

 From a user perspective it is hard to understand, that a 

Re: xfs_reclaim_inodes_ag taking several seconds

2020-08-12 Thread Donald Buczek

On 8/4/20 12:11 AM, Dave Chinner wrote:

On Sat, Aug 01, 2020 at 12:25:40PM +0200, Donald Buczek wrote:

On 01.08.20 00:32, Dave Chinner wrote:

On Fri, Jul 31, 2020 at 01:27:31PM +0200, Donald Buczek wrote:

Dear Linux people,

we have a backup server with two xfs filesystems on 101.9TB md-raid6 devices 
(16 * 7.3 T disks) each, Current Linux version is 5.4.54.

.

root:done:/home/buczek/linux_problems/shrinker_semaphore/# cat /proc/meminfo
MemTotal:   263572332 kB


256GB of RAM.


MemFree: 2872368 kB
MemAvailable:   204193824 kB


200GB "available"


Buffers:2568 kB
Cached: 164931356 kB


160GB in page cache


KReclaimable:   40079660 kB
Slab:   49988268 kB
SReclaimable:   40079660 kB


40GB in reclaimable slab objects.

IOWs, you have no free memory in the machine and so allocation
will frequently be dipping into memory reclaim to free up page cache
and slab caches to make memory available.


xfs_inode 30978282 3119683296041 : tunables   54   278 
: slabdata 7799208 7799208434


Yes, 30 million cached inodes.


bio_integrity_payload 29644966 30203481192   211 : tunables  120   60   
 8 : slabdata 1438261 1438261480


Either there is a memory leak in this slab, or it is shared with
something like the xfs_ili slab, which would indicate that most
of the cached inodes have been dirtied in memory at some point in
time.


I think you are right here:

 crash> p $s->name
 $84 = 0x82259401 "bio_integrity_payload"
 crash> p $s->refcount
 $88 = 8
 crash> p $s
 $92 = (struct kmem_cache *) 0x88bff92d2bc0
 crash> p sizeof(xfs_inode_log_item_t)
 $93 = 192
 crash> p $s->object_size
 $94 = 192

So if I understand you correctly, this is expected behavior with
this kind of load and conceptual changes are already scheduled for
kernel 5.9. I don't understand most of it, but isn't it true that
with that planned changes the impact might be better limited to
the filesystem, so that the performance of other areas of the
system might improve?


What the changes in 5.9 will do is remove the direct memory reclaim
latency that comes from waiting on IO in the shrinker. Hence you
will no longer see this problem from applications doing memory
allocation. i.e. they'll get some other memory reclaimed without
blocking (e.g. page cache or clean inodes) and so the specific
symptom of having large numbers of dirty inodes in memory that you
are seeing will go away.

Which means that dirty inodes in memory will continue to build up
until the next constraint is hit, and then it will go back to having
unpredictable large latencies while waiting for inodes to be written
back to free up whatever resource the filesystem has run out of.

That resource will, most likely, be filesystem journal space. Every
fs modification needs to reserve sufficient journal to complete
before the modification starts. Hence if the journal fills, any
modification to the fs will block waiting on dirty inode writeback
to release space in the journal

You might be lucky and the backup process is slow enough that the
disk subsystem can keep up with the rate of ingest of new data and
so you never hit this limitation. However, the reported state of the
machine and the amount of RAM it has for caching says to me that the
underlying problem is that ingest is far faster than the filesystem
and disk subsystem can sink...

A solution to this problem might be to spread out the backups being
done over a wider timeframe, such that there isn't a sustained heavy
load at 3am in the morning when every machine is scheduled to be
backed up at the same time...


It is already running round the clock. We have two of these servers, doing 
daily backups for 1374 file spaces (= directory trees below 1 TB) on 392 
clients. The servers are doing daily mirrors of these file spaces, creating 
hard links for existing files and keeping these daily trees for 4 month. The 
schedule is free from fixed wall clock times. A backup is due when the last 
iteration is older than 24 hours, and will be done as time, threads and some 
locking constrains allow. Under normal circumstances the servers keep up, but 
are nearly continuously busy. The excess capacity is spread all over the day 
(sometimes there is no work to do for some minutes).

I monitor, how long a local mount takes (usually 0.03-0.04 seconds) and the 
worst time seen so far is over 16 minutes! Because we rely on autofs mounts for 
logins and some other things, the system kind of appears to be dead during that 
time.

I've limited the inode caches by running the backup jobs in memory control 
groups, but, as expected, this didn't really bring down the delays. I've also 
added a cond_resched() to mm/vmscan.c:shrink_slab_memcg(), which seems to be 
missing after the up_read(), but this didn't help either.

From a user perspective it is hard to understand, that a saturated block device 
delays unrelated 

Re: xfs_reclaim_inodes_ag taking several seconds

2020-08-03 Thread Dave Chinner
On Sat, Aug 01, 2020 at 12:25:40PM +0200, Donald Buczek wrote:
> On 01.08.20 00:32, Dave Chinner wrote:
> > On Fri, Jul 31, 2020 at 01:27:31PM +0200, Donald Buczek wrote:
> > > Dear Linux people,
> > > 
> > > we have a backup server with two xfs filesystems on 101.9TB md-raid6 
> > > devices (16 * 7.3 T disks) each, Current Linux version is 5.4.54.
> > .
> > > root:done:/home/buczek/linux_problems/shrinker_semaphore/# cat 
> > > /proc/meminfo
> > > MemTotal:   263572332 kB
> > 
> > 256GB of RAM.
> > 
> > > MemFree: 2872368 kB
> > > MemAvailable:   204193824 kB
> > 
> > 200GB "available"
> > 
> > > Buffers:2568 kB
> > > Cached: 164931356 kB
> > 
> > 160GB in page cache
> > 
> > > KReclaimable:   40079660 kB
> > > Slab:   49988268 kB
> > > SReclaimable:   40079660 kB
> > 
> > 40GB in reclaimable slab objects.
> > 
> > IOWs, you have no free memory in the machine and so allocation
> > will frequently be dipping into memory reclaim to free up page cache
> > and slab caches to make memory available.
> > 
> > > xfs_inode 30978282 3119683296041 : tunables   54   27 
> > >8 : slabdata 7799208 7799208434
> > 
> > Yes, 30 million cached inodes.
> > 
> > > bio_integrity_payload 29644966 30203481192   211 : tunables  120  
> > >  608 : slabdata 1438261 1438261480
> > 
> > Either there is a memory leak in this slab, or it is shared with
> > something like the xfs_ili slab, which would indicate that most
> > of the cached inodes have been dirtied in memory at some point in
> > time.
> 
> I think you are right here:
> 
> crash> p $s->name
> $84 = 0x82259401 "bio_integrity_payload"
> crash> p $s->refcount
> $88 = 8
> crash> p $s
> $92 = (struct kmem_cache *) 0x88bff92d2bc0
> crash> p sizeof(xfs_inode_log_item_t)
> $93 = 192
> crash> p $s->object_size
> $94 = 192
> 
> So if I understand you correctly, this is expected behavior with
> this kind of load and conceptual changes are already scheduled for
> kernel 5.9. I don't understand most of it, but isn't it true that
> with that planned changes the impact might be better limited to
> the filesystem, so that the performance of other areas of the
> system might improve?

What the changes in 5.9 will do is remove the direct memory reclaim
latency that comes from waiting on IO in the shrinker. Hence you
will no longer see this problem from applications doing memory
allocation. i.e. they'll get some other memory reclaimed without
blocking (e.g. page cache or clean inodes) and so the specific
symptom of having large numbers of dirty inodes in memory that you
are seeing will go away.

Which means that dirty inodes in memory will continue to build up
until the next constraint is hit, and then it will go back to having
unpredictable large latencies while waiting for inodes to be written
back to free up whatever resource the filesystem has run out of.

That resource will, most likely, be filesystem journal space. Every
fs modification needs to reserve sufficient journal to complete
before the modification starts. Hence if the journal fills, any
modification to the fs will block waiting on dirty inode writeback
to release space in the journal

You might be lucky and the backup process is slow enough that the
disk subsystem can keep up with the rate of ingest of new data and
so you never hit this limitation. However, the reported state of the
machine and the amount of RAM it has for caching says to me that the
underlying problem is that ingest is far faster than the filesystem
and disk subsystem can sink...

A solution to this problem might be to spread out the backups being
done over a wider timeframe, such that there isn't a sustained heavy
load at 3am in the morning when every machine is scheduled to be
backed up at the same time...

> I'd love to test that with our load, but I
> don't want to risk our backup data and it would be difficult to
> produce the same load on a toy system. The patch set is not yet
> ready to be tested on production data, is it?

Not unless you like testing -rc1 kernels in production :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: xfs_reclaim_inodes_ag taking several seconds

2020-08-01 Thread Donald Buczek



On 01.08.20 12:25, Donald Buczek wrote:


So if I understand you correctly, this is expected behavior with this kind of 
load and conceptual changes are already scheduled for kernel 5.9. I don't 
understand most of it, but isn't it true that with that planned changes the 
impact might be better limited to the filesystem, so that the performance of 
other areas of the system might improve? I'd love to test that with our load, 
but I don't want to risk our backup data and it would be difficult to produce 
the same load on a toy system. The patch set is not yet ready to be tested on 
production data, is it?

So I guess I'll try to put the backup processes into one or more cgroups to 
limit the memory available for their fs caches and leave some room for 
unrelated (maintenance) processes. I hope, that makes sense.


Which it doesn't, because it totally ignores what was said before. The affected 
processes were not waiting for memory but for the shrinker to finish.

D.



Thank you both four your analysis!

Donald


And if you have 30 million inodes in memory, and lots of them are
dirty, and the shrinkers are running, then they will be doing
dirty inode writeback to throttle memory reclaim to
ensure it makes progress and doesn't declare OOM and kill processes
permaturely.

You have spinning disks, RAID6. I'm betting that it can only clean a
couple of hundred inodes a second because RAID6 is severely IOP
limited for small writes (like inode clusters). And when you many,
many thousands (maybe millions) of dirty inodes, anything that has
to wait on inode writeback is going to be waiting for some time...


root:done:/home/buczek/linux_problems/shrinker_semaphore/# xfs_info 
/amd/done/C/C8024
meta-data=/dev/md0   isize=512    agcount=102, agsize=268435328 blks
  =   sectsz=4096  attr=2, projid32bit=1
  =   crc=1    finobt=1, sparse=1, rmapbt=0
  =   reflink=0
data =   bsize=4096   blocks=27348629504, imaxpct=1
  =   sunit=128    swidth=1792 blks
naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
log  =internal log   bsize=4096   blocks=521728, version=2


And full size journals, so the filesystem can hold an awful lot of
active dirty inodes in memory before it starts throttling on a full
journal (think millions of dirty inodes per filesystem)...

So, yeah, this is the classic "in memory operation is orders of
magnitude faster than disk operation" and it all comes crashing down
when something needs to wait for inodes to be written back. The
patchset Darrick pointed you at should fix the shrinker issue, but
it's likely that this will just push the problem to the next
throttling point, which is the journal filling up.

IOWs, I suspect fixing your shrinker problem is only going to make
the overload of dirty inodes in the system behave worse, because
running out of journal space cause *all modifications* to the
filesystem to start taking significant delays while they wait for
inode writeback to free journal space, not just have things
trying to register/unregister shrinkers take delays...

Cheers,

Dave.





--
Donald Buczek
buc...@molgen.mpg.de
Tel: +49 30 8413 1433


Re: xfs_reclaim_inodes_ag taking several seconds

2020-08-01 Thread Donald Buczek

On 01.08.20 00:32, Dave Chinner wrote:

On Fri, Jul 31, 2020 at 01:27:31PM +0200, Donald Buczek wrote:

Dear Linux people,

we have a backup server with two xfs filesystems on 101.9TB md-raid6 devices 
(16 * 7.3 T disks) each, Current Linux version is 5.4.54.

.

root:done:/home/buczek/linux_problems/shrinker_semaphore/# cat /proc/meminfo
MemTotal:   263572332 kB


256GB of RAM.


MemFree: 2872368 kB
MemAvailable:   204193824 kB


200GB "available"


Buffers:2568 kB
Cached: 164931356 kB


160GB in page cache


KReclaimable:   40079660 kB
Slab:   49988268 kB
SReclaimable:   40079660 kB


40GB in reclaimable slab objects.

IOWs, you have no free memory in the machine and so allocation
will frequently be dipping into memory reclaim to free up page cache
and slab caches to make memory available.


xfs_inode 30978282 3119683296041 : tunables   54   278 
: slabdata 7799208 7799208434


Yes, 30 million cached inodes.


bio_integrity_payload 29644966 30203481192   211 : tunables  120   60   
 8 : slabdata 1438261 1438261480


Either there is a memory leak in this slab, or it is shared with
something like the xfs_ili slab, which would indicate that most
of the cached inodes have been dirtied in memory at some point in
time.


I think you are right here:

crash> p $s->name
$84 = 0x82259401 "bio_integrity_payload"
crash> p $s->refcount
$88 = 8
crash> p $s
$92 = (struct kmem_cache *) 0x88bff92d2bc0
crash> p sizeof(xfs_inode_log_item_t)
$93 = 192
crash> p $s->object_size
$94 = 192

So if I understand you correctly, this is expected behavior with this kind of 
load and conceptual changes are already scheduled for kernel 5.9. I don't 
understand most of it, but isn't it true that with that planned changes the 
impact might be better limited to the filesystem, so that the performance of 
other areas of the system might improve? I'd love to test that with our load, 
but I don't want to risk our backup data and it would be difficult to produce 
the same load on a toy system. The patch set is not yet ready to be tested on 
production data, is it?

So I guess I'll try to put the backup processes into one or more cgroups to 
limit the memory available for their fs caches and leave some room for 
unrelated (maintenance) processes. I hope, that makes sense.

Thank you both four your analysis!

Donald


And if you have 30 million inodes in memory, and lots of them are
dirty, and the shrinkers are running, then they will be doing
dirty inode writeback to throttle memory reclaim to
ensure it makes progress and doesn't declare OOM and kill processes
permaturely.

You have spinning disks, RAID6. I'm betting that it can only clean a
couple of hundred inodes a second because RAID6 is severely IOP
limited for small writes (like inode clusters). And when you many,
many thousands (maybe millions) of dirty inodes, anything that has
to wait on inode writeback is going to be waiting for some time...


root:done:/home/buczek/linux_problems/shrinker_semaphore/# xfs_info 
/amd/done/C/C8024
meta-data=/dev/md0   isize=512agcount=102, agsize=268435328 blks
  =   sectsz=4096  attr=2, projid32bit=1
  =   crc=1finobt=1, sparse=1, rmapbt=0
  =   reflink=0
data =   bsize=4096   blocks=27348629504, imaxpct=1
  =   sunit=128swidth=1792 blks
naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
log  =internal log   bsize=4096   blocks=521728, version=2


And full size journals, so the filesystem can hold an awful lot of
active dirty inodes in memory before it starts throttling on a full
journal (think millions of dirty inodes per filesystem)...

So, yeah, this is the classic "in memory operation is orders of
magnitude faster than disk operation" and it all comes crashing down
when something needs to wait for inodes to be written back. The
patchset Darrick pointed you at should fix the shrinker issue, but
it's likely that this will just push the problem to the next
throttling point, which is the journal filling up.

IOWs, I suspect fixing your shrinker problem is only going to make
the overload of dirty inodes in the system behave worse, because
running out of journal space cause *all modifications* to the
filesystem to start taking significant delays while they wait for
inode writeback to free journal space, not just have things
trying to register/unregister shrinkers take delays...

Cheers,

Dave.



--
Donald Buczek
buc...@molgen.mpg.de
Tel: +49 30 8413 1433


Re: xfs_reclaim_inodes_ag taking several seconds

2020-07-31 Thread Dave Chinner
On Fri, Jul 31, 2020 at 01:27:31PM +0200, Donald Buczek wrote:
> Dear Linux people,
> 
> we have a backup server with two xfs filesystems on 101.9TB md-raid6 devices 
> (16 * 7.3 T disks) each, Current Linux version is 5.4.54.
.
> root:done:/home/buczek/linux_problems/shrinker_semaphore/# cat /proc/meminfo
> MemTotal:   263572332 kB

256GB of RAM.

> MemFree: 2872368 kB
> MemAvailable:   204193824 kB

200GB "available"

> Buffers:2568 kB
> Cached: 164931356 kB

160GB in page cache

> KReclaimable:   40079660 kB
> Slab:   49988268 kB
> SReclaimable:   40079660 kB

40GB in reclaimable slab objects.

IOWs, you have no free memory in the machine and so allocation
will frequently be dipping into memory reclaim to free up page cache
and slab caches to make memory available.

> xfs_inode 30978282 3119683296041 : tunables   54   27
> 8 : slabdata 7799208 7799208434

Yes, 30 million cached inodes.

> bio_integrity_payload 29644966 30203481192   211 : tunables  120   60 
>8 : slabdata 1438261 1438261480

Either there is a memory leak in this slab, or it is shared with
something like the xfs_ili slab, which would indicate that most
of the cached inodes have been dirtied in memory at some point in
time. 

And if you have 30 million inodes in memory, and lots of them are
dirty, and the shrinkers are running, then they will be doing
dirty inode writeback to throttle memory reclaim to
ensure it makes progress and doesn't declare OOM and kill processes
permaturely.

You have spinning disks, RAID6. I'm betting that it can only clean a
couple of hundred inodes a second because RAID6 is severely IOP
limited for small writes (like inode clusters). And when you many,
many thousands (maybe millions) of dirty inodes, anything that has
to wait on inode writeback is going to be waiting for some time...

> root:done:/home/buczek/linux_problems/shrinker_semaphore/# xfs_info 
> /amd/done/C/C8024
> meta-data=/dev/md0   isize=512agcount=102, agsize=268435328 
> blks
>  =   sectsz=4096  attr=2, projid32bit=1
>  =   crc=1finobt=1, sparse=1, rmapbt=0
>  =   reflink=0
> data =   bsize=4096   blocks=27348629504, imaxpct=1
>  =   sunit=128swidth=1792 blks
> naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
> log  =internal log   bsize=4096   blocks=521728, version=2

And full size journals, so the filesystem can hold an awful lot of
active dirty inodes in memory before it starts throttling on a full
journal (think millions of dirty inodes per filesystem)...

So, yeah, this is the classic "in memory operation is orders of
magnitude faster than disk operation" and it all comes crashing down
when something needs to wait for inodes to be written back. The
patchset Darrick pointed you at should fix the shrinker issue, but
it's likely that this will just push the problem to the next
throttling point, which is the journal filling up.

IOWs, I suspect fixing your shrinker problem is only going to make
the overload of dirty inodes in the system behave worse, because
running out of journal space cause *all modifications* to the
filesystem to start taking significant delays while they wait for
inode writeback to free journal space, not just have things
trying to register/unregister shrinkers take delays...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: xfs_reclaim_inodes_ag taking several seconds

2020-07-31 Thread Darrick J. Wong
On Fri, Jul 31, 2020 at 01:27:31PM +0200, Donald Buczek wrote:
> Dear Linux people,
> 
> we have a backup server with two xfs filesystems on 101.9TB md-raid6 devices 
> (16 * 7.3 T disks) each, Current Linux version is 5.4.54.
> 
> The server is doing lot of metadata i/o to these filesystems (two rsync of 
> complete client filesystems with --link-dest to the previous backup, so lots 
> of hard links, and two `rm -rf` of directory trees expiring old backups). 
> Lots and lots of inodes on the filesystem and in the cache.
> 
> While these backup tasks seem to perform okay, often the system seems to be 
> stalled for minutes, e.g. when a mount is required.
> 
> I figured out, that xfs_reclaim_inodes_ag can take several tens of seconds. 
> These add up to minutes while the system is running the shrinkers. Other 
> processes (e.g. those which require mounts) block on shrinker_rwsem when they 
> try to register/deregister their shrinkers.
> 
> Is this expected behavior? Any ideas?

On any kernel through 5.8?  Yes.

5.9 will have this patchset[1] that changes xfs to throttle frontend
processes that are trying to modify inodes instead of using dirty inode
writeback to throtte memory reclaim.

I'd be curious to see how the new code handles everyone else's
workloads, but we haven't even finished 5.8 yet so it's still early
days...

--D

[1] 
https://lore.kernel.org/linux-xfs/20200622081605.1818434-1-da...@fromorbit.com/

> 
> Best
>   Donald
> 
> Some data:
> 
> root:done:~/# ps -f -p 20379
> UIDPID  PPID  C STIME TTY  TIME CMD
> root 20379 20378  0 11:42 ?00:00:00 /sbin/mount.nfs 
> handsomejack:/amd/handsomejack/2/home/abt_srv/klages /home/klages -s -o 
> rw,nosuid
> root:done:~/# cat /proc/20379/stack
> [<0>] rwsem_down_write_slowpath+0x21b/0x470
> [<0>] unregister_memcg_shrinker.isra.55+0x18/0x40
> [<0>] unregister_shrinker+0x6e/0x80
> [<0>] deactivate_locked_super+0x2e/0x70
> [<0>] cleanup_mnt+0xb8/0x150
> [<0>] task_work_run+0x7e/0xa0
> [<0>] exit_to_usermode_loop+0xbf/0xd0
> [<0>] do_syscall_64+0xfe/0x130
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> 
> # echo "vmscan:*" > /sys/kernel/debug/tracing/set_event
> # cat /sys/kernel/debug/tracing/trace > trace.001
> # grep kswapd1-137 trace.001
> # # only lines with noteable time gaps ( > 1sec) pasted here :
> 
> kswapd1-137   [011]  66671.873532: mm_vmscan_kswapd_wake: nid=1 order=3
> kswapd1-137   [011]  66671.878236: mm_shrink_slab_start: 
> super_cache_scan+0x0/0x1a0 3e89d20a: nid: 1 objects to shrink 824 
> gfp_flags GFP_KERNEL cache items 1065515 delta 520 total_scan 1344 priority 12
> 
> kswapd1-137   [007]  66707.247798: mm_shrink_slab_end: 
> super_cache_scan+0x0/0x1a0 3e89d20a: nid: 1 unused scan count 824 new 
> scan count 320 total_scan 320 last shrinker return val 1021
> kswapd1-137   [007]  66707.255750: mm_shrink_slab_start: 
> super_cache_scan+0x0/0x1a0 3e89d20a: nid: 1 objects to shrink 726 
> gfp_flags GFP_KERNEL cache items 1064232 delta 1038 total_scan 1764 priority 
> 11
> 
> kswapd1-137   [003]  66712.718536: mm_shrink_slab_end: 
> super_cache_scan+0x0/0x1a0 3e89d20a: nid: 1 unused scan count 726 new 
> scan count 740 total_scan 740 last shrinker return val 1022
> kswapd1-137   [001]  66712.984923: mm_vmscan_kswapd_sleep: nid=1
> 
> # true > /sys/kernel/debug/tracing/set_event
> # true > /sys/kernel/debug/tracing/trace
> # echo super_cache_scan > /sys/kernel/debug/tracing/set_graph_function
> # echo function_graph > /sys/kernel/debug/tracing/current_tracer
> # cat /sys/kernel/debug/tracing/trace > trace.002
> # fgrep \$ trace.002 | tail -20
> 
>  13) $ 1918769 us  |  }
>   9) $ 1952622 us  |  }
>   8) $ 2503877 us  |} /* xfs_reclaim_inodes_ag */
>   8) $ 2503888 us  |  } /* xfs_reclaim_inodes_nr */
>   8) $ 2503889 us  |} /* xfs_fs_free_cached_objects */
>   8) $ 2503939 us  |  } /* super_cache_scan */
>   5) $ 2119727 us  |}
>   5) $ 2119727 us  |  }
>   5) $ 2119729 us  |}
>   5) $ 2119732 us  |  }
>   5) $ 2119732 us  |}
>   5) $ 2119744 us  |  }
>   5) $ 2178135 us  |}
>   5) $ 2228517 us  |  }
>   5) $ 2228518 us  |}
>   5) $ 2229533 us  |  }
>  14) $ 18086849 us |} /* xfs_reclaim_inodes_ag */
>  14) $ 18086858 us |  } /* xfs_reclaim_inodes_nr */
>  14) $ 18086858 us |} /* xfs_fs_free_cached_objects */
>  14) $ 18086907 us |  } /* super_cache_scan */
> 
> root:done:/home/buczek/linux_problems/shrinker_semaphore/# cat /proc/meminfo
> MemTotal:   263572332 kB
> MemFree: 2872368 kB
> MemAvailable:   204193824 kB
> Buffers:2568 kB
> Cached: 164931356 kB
> SwapCached:0 kB
> Active: 50608440 kB
> Inactive:   114468364 kB
> Active(anon): 143724 kB
> Inactive(anon): 

xfs_reclaim_inodes_ag taking several seconds

2020-07-31 Thread Donald Buczek

Dear Linux people,

we have a backup server with two xfs filesystems on 101.9TB md-raid6 devices 
(16 * 7.3 T disks) each, Current Linux version is 5.4.54.

The server is doing lot of metadata i/o to these filesystems (two rsync of 
complete client filesystems with --link-dest to the previous backup, so lots of 
hard links, and two `rm -rf` of directory trees expiring old backups). Lots and 
lots of inodes on the filesystem and in the cache.

While these backup tasks seem to perform okay, often the system seems to be 
stalled for minutes, e.g. when a mount is required.

I figured out, that xfs_reclaim_inodes_ag can take several tens of seconds. 
These add up to minutes while the system is running the shrinkers. Other 
processes (e.g. those which require mounts) block on shrinker_rwsem when they 
try to register/deregister their shrinkers.

Is this expected behavior? Any ideas?

Best
  Donald

Some data:

root:done:~/# ps -f -p 20379
UIDPID  PPID  C STIME TTY  TIME CMD
root 20379 20378  0 11:42 ?00:00:00 /sbin/mount.nfs 
handsomejack:/amd/handsomejack/2/home/abt_srv/klages /home/klages -s -o 
rw,nosuid
root:done:~/# cat /proc/20379/stack
[<0>] rwsem_down_write_slowpath+0x21b/0x470
[<0>] unregister_memcg_shrinker.isra.55+0x18/0x40
[<0>] unregister_shrinker+0x6e/0x80
[<0>] deactivate_locked_super+0x2e/0x70
[<0>] cleanup_mnt+0xb8/0x150
[<0>] task_work_run+0x7e/0xa0
[<0>] exit_to_usermode_loop+0xbf/0xd0
[<0>] do_syscall_64+0xfe/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9


# echo "vmscan:*" > /sys/kernel/debug/tracing/set_event
# cat /sys/kernel/debug/tracing/trace > trace.001
# grep kswapd1-137 trace.001
# # only lines with noteable time gaps ( > 1sec) pasted here :

kswapd1-137   [011]  66671.873532: mm_vmscan_kswapd_wake: nid=1 order=3
kswapd1-137   [011]  66671.878236: mm_shrink_slab_start: 
super_cache_scan+0x0/0x1a0 3e89d20a: nid: 1 objects to shrink 824 
gfp_flags GFP_KERNEL cache items 1065515 delta 520 total_scan 1344 priority 12

kswapd1-137   [007]  66707.247798: mm_shrink_slab_end: 
super_cache_scan+0x0/0x1a0 3e89d20a: nid: 1 unused scan count 824 new 
scan count 320 total_scan 320 last shrinker return val 1021
kswapd1-137   [007]  66707.255750: mm_shrink_slab_start: 
super_cache_scan+0x0/0x1a0 3e89d20a: nid: 1 objects to shrink 726 
gfp_flags GFP_KERNEL cache items 1064232 delta 1038 total_scan 1764 priority 11

kswapd1-137   [003]  66712.718536: mm_shrink_slab_end: 
super_cache_scan+0x0/0x1a0 3e89d20a: nid: 1 unused scan count 726 new 
scan count 740 total_scan 740 last shrinker return val 1022
kswapd1-137   [001]  66712.984923: mm_vmscan_kswapd_sleep: nid=1

# true > /sys/kernel/debug/tracing/set_event
# true > /sys/kernel/debug/tracing/trace
# echo super_cache_scan > /sys/kernel/debug/tracing/set_graph_function
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
# cat /sys/kernel/debug/tracing/trace > trace.002
# fgrep \$ trace.002 | tail -20

 13) $ 1918769 us  |  }
  9) $ 1952622 us  |  }
  8) $ 2503877 us  |} /* xfs_reclaim_inodes_ag */
  8) $ 2503888 us  |  } /* xfs_reclaim_inodes_nr */
  8) $ 2503889 us  |} /* xfs_fs_free_cached_objects */
  8) $ 2503939 us  |  } /* super_cache_scan */
  5) $ 2119727 us  |}
  5) $ 2119727 us  |  }
  5) $ 2119729 us  |}
  5) $ 2119732 us  |  }
  5) $ 2119732 us  |}
  5) $ 2119744 us  |  }
  5) $ 2178135 us  |}
  5) $ 2228517 us  |  }
  5) $ 2228518 us  |}
  5) $ 2229533 us  |  }
 14) $ 18086849 us |} /* xfs_reclaim_inodes_ag */
 14) $ 18086858 us |  } /* xfs_reclaim_inodes_nr */
 14) $ 18086858 us |} /* xfs_fs_free_cached_objects */
 14) $ 18086907 us |  } /* super_cache_scan */

root:done:/home/buczek/linux_problems/shrinker_semaphore/# cat /proc/meminfo
MemTotal:   263572332 kB
MemFree: 2872368 kB
MemAvailable:   204193824 kB
Buffers:2568 kB
Cached: 164931356 kB
SwapCached:0 kB
Active: 50608440 kB
Inactive:   114468364 kB
Active(anon): 143724 kB
Inactive(anon): 1484 kB
Active(file):   50464716 kB
Inactive(file): 114466880 kB
Unevictable:   0 kB
Mlocked:   0 kB
SwapTotal: 0 kB
SwapFree:  0 kB
Dirty:685232 kB
Writeback: 0 kB
AnonPages:140452 kB
Mapped:36220 kB
Shmem:  1956 kB
KReclaimable:   40079660 kB
Slab:   49988268 kB
SReclaimable:   40079660 kB
SUnreclaim:  9908608 kB
KernelStack:6368 kB
PageTables: 4912 kB
NFS_Unstable:  0 kB
Bounce:0 kB
WritebackTmp:  0 kB
CommitLimit:263572332 kB
Committed_AS: 507052 kB
VmallocTotal:   34359738367 kB
VmallocUsed:  228600 kB
VmallocChunk:  0 kB