Re: [gpfsug-discuss] [non-nasa source] Re: pagepool shrink doesn't release all memory
Following up on this... On one of the nodes on which I'd bounced the pagepool around I managed to cause what appeared to that node as filesystem corruption (i/o errors and fsstruct errors) on every single fs. Thankfully none of the other nodes in the cluster seemed to agree that the fs was corrupt. I'll open a PMR on that but I thought it was interesting none the less. I haven't run an fsck on any of the filesystems but my belief is that they're OK since so far none of the other nodes in the cluster have complained. Secondly, I can see the pagepool allocations that align with registered verbs mr's (looking at mmfsadm dump verbs). In theory one can free an ib mr after registration as long as it's not in use but one has to track that and I could see that being a tricky thing (although in theory given the fact that GPFS has its own page allocator it might be relatively trivial to figure it out but it might also require re-establishing RDMA connections depending on whether or not a given QP is associated with a PD that uses the MR trying to be freed...I think that makes sense). Anyway, I'm wondering if the need to free the ib MR on pagepool shrink could be avoided all together by limiting the amount of memory that gets allocated to verbs MR's (e.g. something like verbsPagePoolMaxMB) so that those regions never need to be freed but the amount of memory available for user caching could grow and shrink as required. It's probably not that simple, though :) Another thought I had was doing something like creating a file in /dev/shm, registering it as a loopback device, and using that as an LROC device. I just don't think that's feasible at scale given the current method of LROC device registration (e.g. via the mmsdrfs file). I think there's much to be gained from the ability to dynamically change the memory-based file cache size on a per-job basis so I'm really hopeful we can find a way to make this work. -Aaron On 2/25/18 11:45 AM, Aaron Knister wrote: Hmm...interesting. It sure seems to try :) The pmap command was this: pmap $(pidof mmfsd) | sort -n -k3 | tail -Aaron On 2/23/18 9:35 AM, IBM Spectrum Scale wrote: AFAIK you can increase the pagepool size dynamically but you cannot shrink it dynamically. To shrink it you must restart the GPFS daemon. Also, could you please provide the actual pmap commands you executed? Regards, The Spectrum Scale (GPFS) team -- If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=----0479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Aaron KnisterTo: Date: 02/22/2018 10:30 PM Subject: Re: [gpfsug-discuss] pagepool shrink doesn't release all memory Sent by: gpfsug-discuss-boun...@spectrumscale.org This is also interesting (although I don't know what it really means). Looking at pmap run against mmfsd I can see what happens after each step: # baseline 7fffe4639000 59164K 0K 0K 0K 0K ---p [anon] 7fffd837e000 61960K 0K 0K 0K 0K ---p [anon] 0200 1048576K 1048576K 1048576K 1048576K 0K rwxp [anon] Total: 1613580K 1191020K 1189650K 1171836K 0K # tschpool 64G 7fffe4639000 59164K 0K 0K 0K 0K ---p [anon] 7fffd837e000 61960K 0K 0K 0K 0K ---p [anon] 0200 67108864K 67108864K 67108864K 67108864K 0K rwxp [anon] Total: 67706636K 67284108K 67282625K 67264920K 0K # tschpool 1G 7fffe4639000 59164K 0K 0K 0K 0K ---p [anon] 7fffd837e000 61960K 0K 0K 0K 0K ---p [anon] 02000140 139264K 139264K 139264K 139264K 0K rwxp [anon] 020fc940 897024K 897024K 897024K 897024K 0K rwxp [anon] 020009c0 66052096K 0K 0K 0K 0K rwxp [anon] Total: 67706636K 1223820K 1222451K 1204632K 0K Even though mmfsd has that 64G chunk allocated there's none of it *used*. I wonder why Linux seems to be accounting it as allocated. -Aaron On 2/22/18 10:17 PM, Aaron Knister wrote: > I've been exploring the idea for a while of writing a SLURM SPANK plugin > to allow users to dynamically change the pagepool size on a node. Every > now and then we
Re: [gpfsug-discuss] [non-nasa source] Re: pagepool shrink doesn't release all memory
Oh, and I think you're absolutely right about the rdma interaction. If I stop the infiniband service on a node and try the same exercise again, I can jump between 100G and 1G several times and the free'd memory is actually released. -Aaron On 2/25/18 11:54 AM, Aaron Knister wrote: Hi Stijn, Thanks for sharing your experiences-- I'm glad I'm not the only one whose had the idea (and come up empty handed). About the pagpool and numa awareness, I'd remembered seeing something about that somewhere and I did some googling and found there's a parameter called numaMemoryInterleave that "starts mmfsd with numactl --interleave=all". Do you think that provides the kind of numa awareness you're looking for? -Aaron On 2/23/18 9:44 AM, Stijn De Weirdt wrote: hi all, we had the same idea long ago, afaik the issue we had was due to the pinned memory the pagepool uses when RDMA is enabled. at some point we restarted gpfs on the compute nodes for each job, similar to the way we do swapoff/swapon; but in certain scenarios gpfs really did not like it; so we gave up on it. the other issue that needs to be resolved is that the pagepool needs to be numa aware, so the pagepool is nicely allocated across all numa domains, instead of using the first ones available. otherwise compute jobs might start that only do non-local doamin memeory access. stijn On 02/23/2018 03:35 PM, IBM Spectrum Scale wrote: AFAIK you can increase the pagepool size dynamically but you cannot shrink it dynamically. To shrink it you must restart the GPFS daemon. Also, could you please provide the actual pmap commands you executed? Regards, The Spectrum Scale (GPFS) team -- If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=----0479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Aaron KnisterTo: Date: 02/22/2018 10:30 PM Subject: Re: [gpfsug-discuss] pagepool shrink doesn't release all memory Sent by: gpfsug-discuss-boun...@spectrumscale.org This is also interesting (although I don't know what it really means). Looking at pmap run against mmfsd I can see what happens after each step: # baseline 7fffe4639000 59164K 0K 0K 0K 0K ---p [anon] 7fffd837e000 61960K 0K 0K 0K 0K ---p [anon] 0200 1048576K 1048576K 1048576K 1048576K 0K rwxp [anon] Total: 1613580K 1191020K 1189650K 1171836K 0K # tschpool 64G 7fffe4639000 59164K 0K 0K 0K 0K ---p [anon] 7fffd837e000 61960K 0K 0K 0K 0K ---p [anon] 0200 67108864K 67108864K 67108864K 67108864K 0K rwxp [anon] Total: 67706636K 67284108K 67282625K 67264920K 0K # tschpool 1G 7fffe4639000 59164K 0K 0K 0K 0K ---p [anon] 7fffd837e000 61960K 0K 0K 0K 0K ---p [anon] 02000140 139264K 139264K 139264K 139264K 0K rwxp [anon] 020fc940 897024K 897024K 897024K 897024K 0K rwxp [anon] 020009c0 66052096K 0K 0K 0K 0K rwxp [anon] Total: 67706636K 1223820K 1222451K 1204632K 0K Even though mmfsd has that 64G chunk allocated there's none of it *used*. I wonder why Linux seems to be accounting it as allocated. -Aaron On 2/22/18 10:17 PM, Aaron Knister wrote: I've been exploring the idea for a while of writing a SLURM SPANK plugin to allow users to dynamically change the pagepool size on a node. Every now and then we have some users who would benefit significantly from a much larger pagepool on compute nodes but by default keep it on the smaller side to make as much physmem available as possible to batch work. In testing, though, it seems as though reducing the pagepool doesn't quite release all of the memory. I don't really understand it because I've never before seen memory that was previously resident become un-resident but still maintain the virtual memory allocation. Here's what I mean. Let's take a node with 128G and a 1G pagepool. If I do the following to simulate what might happen as various jobs tweak the pagepool: - tschpool 64G - tschpool 1G - tschpool 32G - tschpool 1G - tschpool 32G I end up with this: mmfsd thinks there's 32G resident but 64G virt # ps -o vsz,rss,comm -p 24397 VSZ RSS COMMAND