Re: [lustre-discuss] Memory Management in Lustre

2022-01-19 Thread Jeff Johnson
Ellis,

I haven't messed with it much personally but if you look at some of
the Lustre module parameters, like in the case of module obdclass, you
will see some options that could be of interest like lu_cache_percent.

I'm sure a Whamcloud person might chime in with more detail.

# modinfo obdclass
filename:   /lib/modules/3.10.0-957.27.2.el7.DPC.x86_64/extra/obdclass.ko.xz
license:GPL
version:2.12.2
description:Lustre Class Driver
author: OpenSFS, Inc. 
alias:  fs-lustre
retpoline:  Y
rhelversion:7.6
srcversion: 3D7126D7BB611F089C67867
depends:libcfs,lnet,crc-t10dif
vermagic:   3.10.0-957.27.2.el7.DPC.x86_64 SMP mod_unload modversions
parm:   lu_cache_percent:Percentage of memory to be used as
lu_object cache (int)
parm:   lu_cache_nr:Maximum number of objects in lu_object cache (long)
parm:   lprocfs_no_percpu_stats:Do not alloc percpu data for
lprocfs stats (int)

--Jeff


On Wed, Jan 19, 2022 at 6:35 PM Ellis Wilson via lustre-discuss
 wrote:
>
> Hi folks,
>
> Broader (but related) question than my current malaise with OOM issues on 
> 2.14/2.15:  Is there any documentation or can somebody point me at some code 
> that explains memory management within Lustre?  I've hunted through Lustre 
> manuals, the Lustre internals doc, and a bunch of code, but can find nothing 
> that documents the memory architecture in place.  I'm specifically looking at 
> PTLRPC and OBD code right now, and I can't seem to find anywhere that 
> explicitly limits the amount of allocations Lustre will perform.  On other 
> filesystems I've worked on there are memory pools that you can explicitly 
> size with maxes, and while these may be discrete between areas or reference 
> counters used to leverage a system-shared pool, I expected to see /something/ 
> that might bake in limits of some kind.  I'm sure I'm just not finding it.  
> Any help is greatly appreciated.
>
> Best,
>
> ellis
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



-- 
--
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite C - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Memory Management in Lustre

2022-01-19 Thread Ellis Wilson via lustre-discuss
Hi folks,

Broader (but related) question than my current malaise with OOM issues on 
2.14/2.15:  Is there any documentation or can somebody point me at some code 
that explains memory management within Lustre?  I've hunted through Lustre 
manuals, the Lustre internals doc, and a bunch of code, but can find nothing 
that documents the memory architecture in place.  I'm specifically looking at 
PTLRPC and OBD code right now, and I can't seem to find anywhere that 
explicitly limits the amount of allocations Lustre will perform.  On other 
filesystems I've worked on there are memory pools that you can explicitly size 
with maxes, and while these may be discrete between areas or reference counters 
used to leverage a system-shared pool, I expected to see /something/ that might 
bake in limits of some kind.  I'm sure I'm just not finding it.  Any help is 
greatly appreciated.

Best,

ellis
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] IPoIB best practises

2022-01-19 Thread Horn, Chris via lustre-discuss
Local LNet configuration can be done either via modprobe config or via 
lnetctl/yaml. We are slowly moving away from modprobe config (kernel module 
parameters) in favor of lnetctl/yaml because the latter provides more 
flexibility.

For IB and TCP networks, every interface needs an IP address assigned.

It is okay to have multiple interfaces on the same subnet as long as you have 
appropriate ip route/rules and ARP settings in place. Otherwise the network 
stack may not actually send traffic to/from the correct interfaces, or there 
may be connection failures, etc. There was some work to do this automatically 
for TCP networks in https://jira.whamcloud.com/browse/LU-14662 . There is some 
discussion of the issue on the wiki at 
https://wiki.lustre.org/LNet_Router_Config_Guide#ARP_flux_issue_for_MR_node but 
I’m not sure how up-to-date that guidance is.

LNet/ko2iblnd only uses IPoIB for connection setup via RDMA CM. After a 
connection is established all traffic between IB peers is via RDMA protocol.

The multi-rail feature requires more than just a local LNet configuration. It 
also requires configuration of the peer table. In Lustre 2.10, this peer table 
was statically defined. In Lustre 2.11 (and later), the LNet Dynamic Peer 
Discovery feature allows LNet to create the peer table dynamically.=

Chris Horn

From: lustre-discuss  on behalf of Åke 
Sandgren 
Date: Monday, January 17, 2022 at 1:10 AM
To: Lustre discussion 
Subject: Re: [lustre-discuss] IPoIB best practises


On 1/17/22 2:36 AM, Angelos Ching via lustre-discuss wrote:
> Hi Eli,
>
> Yes & no; part of my info is a bit rusty because I carried them from
> version around 2.10. MR is now turned on by default.
>
> But you'll need to have an IP setup on each IPoIB interface, and for all
> ib0 & all ib1 interface, they should be in different subnet. Eg: all ib0
> on 192.168.100.0/24 and all ib1 on 192.168.101.0/24

The multirail setup we have is that both ib0 and ib1 are on the same
subnet, that's how DDN configured it for us.

ip a s ib0 | grep inet
inet 172.27.1.30/24 brd 172.27.1.255 scope global ib0
ip a s ib1 | grep inet
inet 172.27.1.50/24 brd 172.27.1.255 scope global ib1

and the modprobe config is

options lnet networks="o2ib1(ib0,ib1)"

--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre Client Lockup Under Buffered I/O (2.14/2.15)

2022-01-19 Thread Patrick Farrell via lustre-discuss
Ellis,

As you may have guessed, that function just set looks like a node which is 
doing buffered I/O and thrashing for memory.  No particular insight available 
from the count of functions there.

Would you consider opening a bug report in the Whamcloud JIRA?  You should have 
enough for a good report, here's a few things that would be helpful as well:

It sounds like you can hang the node on demand.  If you could collect stack 
traces with:

echo t > /proc/sysrq-trigger

after creating the hang, that would be useful.  (It will print to dmesg.)

You've also collected debug logs - Could you include, say, the last 100 MiB of 
that log set?  That should be reasonable to attach if compressed.

Regards,
Patrick


From: lustre-discuss  on behalf of 
Ellis Wilson via lustre-discuss 
Sent: Wednesday, January 19, 2022 8:32 AM
To: Andreas Dilger 
Cc: lustre-discuss@lists.lustre.org 
Subject: Re: [lustre-discuss] Lustre Client Lockup Under Buffered I/O 
(2.14/2.15)


Hi Andreas,



Apologies in advance for the top-post.  I’m required to use Outlook for work, 
and it doesn’t handle in-line or bottom-posting well.



Client-side defaults prior to any tuning of mine (this is a very minimal 
1-client, 1-MDS/MGS, 2-OSS cluster):

~# lctl get_param llite.*.max_cached_mb

llite.lustrefs-8d52a9c52800.max_cached_mb=

users: 5

max_cached_mb: 7748

used_mb: 0

unused_mb: 7748

reclaim_count: 0

~# lctl get_param osc.*.max_dirty_mb

osc.lustrefs-OST-osc-8d52a9c52800.max_dirty_mb=1938

osc.lustrefs-OST0001-osc-8d52a9c52800.max_dirty_mb=1938

~# lctl get_param osc.*.max_rpcs_in_flight

osc.lustrefs-OST-osc-8d52a9c52800.max_rpcs_in_flight=8

osc.lustrefs-OST0001-osc-8d52a9c52800.max_rpcs_in_flight=8

~# lctl get_param osc.*.max_pages_per_rpc

osc.lustrefs-OST-osc-8d52a9c52800.max_pages_per_rpc=1024

osc.lustrefs-OST0001-osc-8d52a9c52800.max_pages_per_rpc=1024



Thus far I’ve reduced the following to what I felt were really conservative 
values for a 16GB RAM machine:



~# lctl set_param llite.*.max_cached_mb=1024

llite.lustrefs-8d52a9c52800.max_cached_mb=1024

~# lctl set_param osc.*.max_dirty_mb=512

osc.lustrefs-OST-osc-8d52a9c52800.max_dirty_mb=512

osc.lustrefs-OST0001-osc-8d52a9c52800.max_dirty_mb=512

~# lctl set_param osc.*.max_pages_per_rpc=128

osc.lustrefs-OST-osc-8d52a9c52800.max_pages_per_rpc=128

osc.lustrefs-OST0001-osc-8d52a9c52800.max_pages_per_rpc=128

~# lctl set_param osc.*.max_rpcs_in_flight=2

osc.lustrefs-OST-osc-8d52a9c52800.max_rpcs_in_flight=2

osc.lustrefs-OST0001-osc-8d52a9c52800.max_rpcs_in_flight=2



This slows down how fast I get to basically OOM from <10 seconds to more like 
25 seconds, but the trend is identical.



As an example of what I’m seeing on the client, you can see below we start with 
most free, and then iozone rapidly (within ~10 seconds) causes all memory to be 
marked used, and that stabilizes at about 140MB free until at some point it 
stalls for 20 or more seconds and then some has been synced out:

~# dstat --mem

--memory-usage-

used  free  buff  cach

1029M 13.9G 2756k  215M

1028M 13.9G 2756k  215M

1028M 13.9G 2756k  215M

1088M 13.9G 2756k  215M

2550M 11.5G 2764k 1238M

3989M 10.1G 2764k 1236M

5404M 8881M 2764k 1239M

6831M 7453M 2772k 1240M

8254M 6033M 2772k 1237M

9672M 4613M 2772k 1239M

10.6G 3462M 2772k 1240M

12.1G 1902M 2772k 1240M

13.4G  582M 2772k 1240M

13.9G  139M 2488k 1161M

13.9G  139M 1528k 1174M

13.9G  140M  896k 1175M

13.9G  139M  676k 1176M

13.9G  142M  528k 1177M

13.9G  140M  484k 1188M

13.9G  139M  492k 1188M

13.9G  139M  488k 1188M

13.9G  141M  488k 1186M

13.9G  141M  480k 1187M

13.9G  139M  492k 1188M

13.9G  141M  600k 1188M

13.9G  139M  580k 1187M

13.9G  140M  536k 1186M

13.9G  141M  668k 1186M

13.9G  139M  580k 1188M

13.9G  140M  568k 1187M

12.7G 1299M 2064k 1197M missed 20 ticks <-- client is totally unresponsive 
during this time

11.0G 2972M 5404k 1238M^C



Additionally, I’ve messed with sysctl settings.  Defaults:

vm.dirty_background_bytes = 0

vm.dirty_background_ratio = 10

vm.dirty_bytes = 0

vm.dirty_expire_centisecs = 3000

vm.dirty_ratio = 20

vm.dirty_writeback_centisecs = 500



Revised to conservative values:

vm.dirty_background_bytes = 1073741824

vm.dirty_background_ratio = 0

vm.dirty_bytes = 2147483648

vm.dirty_expire_centisecs = 200

vm.dirty_ratio = 0

vm.dirty_writeback_centisecs = 500



No observed improvement.



I’m going to trawl two logs today side-by-side, one with ldiskfs backing the 
OSTs, and one with zfs backing the OSTs, and see if I can see what the 
differences are since the zfs-backed version never gave us this problem.  The 
only other potentially useful thing I can share right now is that when I turned 
on full debug logging and ran the test until I hit OOM, the following were the 
most frequently hit functions in the logs (count, descending, is the first 
column).  

Re: [lustre-discuss] Lustre Client Lockup Under Buffered I/O (2.14/2.15)

2022-01-19 Thread Ellis Wilson via lustre-discuss
Hi Andreas,

Apologies in advance for the top-post.  I'm required to use Outlook for work, 
and it doesn't handle in-line or bottom-posting well.

Client-side defaults prior to any tuning of mine (this is a very minimal 
1-client, 1-MDS/MGS, 2-OSS cluster):

~# lctl get_param llite.*.max_cached_mb
llite.lustrefs-8d52a9c52800.max_cached_mb=
users: 5
max_cached_mb: 7748
used_mb: 0
unused_mb: 7748
reclaim_count: 0
~# lctl get_param osc.*.max_dirty_mb
osc.lustrefs-OST-osc-8d52a9c52800.max_dirty_mb=1938
osc.lustrefs-OST0001-osc-8d52a9c52800.max_dirty_mb=1938
~# lctl get_param osc.*.max_rpcs_in_flight
osc.lustrefs-OST-osc-8d52a9c52800.max_rpcs_in_flight=8
osc.lustrefs-OST0001-osc-8d52a9c52800.max_rpcs_in_flight=8
~# lctl get_param osc.*.max_pages_per_rpc
osc.lustrefs-OST-osc-8d52a9c52800.max_pages_per_rpc=1024
osc.lustrefs-OST0001-osc-8d52a9c52800.max_pages_per_rpc=1024

Thus far I've reduced the following to what I felt were really conservative 
values for a 16GB RAM machine:

~# lctl set_param llite.*.max_cached_mb=1024
llite.lustrefs-8d52a9c52800.max_cached_mb=1024
~# lctl set_param osc.*.max_dirty_mb=512
osc.lustrefs-OST-osc-8d52a9c52800.max_dirty_mb=512
osc.lustrefs-OST0001-osc-8d52a9c52800.max_dirty_mb=512
~# lctl set_param osc.*.max_pages_per_rpc=128
osc.lustrefs-OST-osc-8d52a9c52800.max_pages_per_rpc=128
osc.lustrefs-OST0001-osc-8d52a9c52800.max_pages_per_rpc=128
~# lctl set_param osc.*.max_rpcs_in_flight=2
osc.lustrefs-OST-osc-8d52a9c52800.max_rpcs_in_flight=2
osc.lustrefs-OST0001-osc-8d52a9c52800.max_rpcs_in_flight=2

This slows down how fast I get to basically OOM from <10 seconds to more like 
25 seconds, but the trend is identical.

As an example of what I'm seeing on the client, you can see below we start with 
most free, and then iozone rapidly (within ~10 seconds) causes all memory to be 
marked used, and that stabilizes at about 140MB free until at some point it 
stalls for 20 or more seconds and then some has been synced out:

~# dstat --mem
--memory-usage-
used  free  buff  cach
1029M 13.9G 2756k  215M
1028M 13.9G 2756k  215M
1028M 13.9G 2756k  215M
1088M 13.9G 2756k  215M
2550M 11.5G 2764k 1238M
3989M 10.1G 2764k 1236M
5404M 8881M 2764k 1239M
6831M 7453M 2772k 1240M
8254M 6033M 2772k 1237M
9672M 4613M 2772k 1239M
10.6G 3462M 2772k 1240M
12.1G 1902M 2772k 1240M
13.4G  582M 2772k 1240M
13.9G  139M 2488k 1161M
13.9G  139M 1528k 1174M
13.9G  140M  896k 1175M
13.9G  139M  676k 1176M
13.9G  142M  528k 1177M
13.9G  140M  484k 1188M
13.9G  139M  492k 1188M
13.9G  139M  488k 1188M
13.9G  141M  488k 1186M
13.9G  141M  480k 1187M
13.9G  139M  492k 1188M
13.9G  141M  600k 1188M
13.9G  139M  580k 1187M
13.9G  140M  536k 1186M
13.9G  141M  668k 1186M
13.9G  139M  580k 1188M
13.9G  140M  568k 1187M
12.7G 1299M 2064k 1197M missed 20 ticks <-- client is totally unresponsive 
during this time
11.0G 2972M 5404k 1238M^C

Additionally, I've messed with sysctl settings.  Defaults:
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500

Revised to conservative values:
vm.dirty_background_bytes = 1073741824
vm.dirty_background_ratio = 0
vm.dirty_bytes = 2147483648
vm.dirty_expire_centisecs = 200
vm.dirty_ratio = 0
vm.dirty_writeback_centisecs = 500

No observed improvement.

I'm going to trawl two logs today side-by-side, one with ldiskfs backing the 
OSTs, and one with zfs backing the OSTs, and see if I can see what the 
differences are since the zfs-backed version never gave us this problem.  The 
only other potentially useful thing I can share right now is that when I turned 
on full debug logging and ran the test until I hit OOM, the following were the 
most frequently hit functions in the logs (count, descending, is the first 
column).  This was approximately 30s of logs:

 205874 cl_page.c:518:cl_vmpage_page())
 206587 cl_page.c:545:cl_page_owner_clear())
 206673 cl_page.c:551:cl_page_owner_clear())
 206748 osc_cache.c:2483:osc_teardown_async_page())
 206815 cl_page.c:867:cl_page_delete())
 206862 cl_page.c:837:cl_page_delete0())
 206878 osc_cache.c:2478:osc_teardown_async_page())
 206928 cl_page.c:869:cl_page_delete())
 206930 cl_page.c:441:cl_page_state_set0())
 206988 osc_page.c:206:osc_page_delete())
 207021 cl_page.c:179:__cl_page_free())
 207021 cl_page.c:193:cl_page_free())
 207021 cl_page.c:532:cl_vmpage_page())
 207024 cl_page.c:210:cl_page_free())
 207075 cl_page.c:430:cl_page_state_set0())
 207169 osc_cache.c:2505:osc_teardown_async_page())
 207175 cl_page.c:475:cl_pagevec_put())
 207202 cl_page.c:492:cl_pagevec_put())
 207211 cl_page.c:822:cl_page_delete0())
 207384 osc_page.c:178:osc_page_delete())
 207422 osc_page.c:177:osc_page_delete())
 413680 cl_page.c:433:cl_page_state_set0())
 413701 cl_page.c:477:cl_pagevec_put())

If anybody has any additional suggestions or requests for more info don't