Re: [lopsa-tech] Swap sizing in Linux HPC cluster nodes.

david Fri, 04 Sep 2009 12:28:42 -0700

On Fri, 4 Sep 2009, Matthias Birkner wrote:

> At $work we've been having a discussion about what the right amount of 
> swap is for a given amount of RAM in our standard linux image and I'm 
> looking for additional input.
>
> The "old school" conventional wisdom says "swap = 2x RAM".  The more 
> modern conventional wisdom seems to vary from "swap = 1x RAM + 4G" to 
> "swap = 4G regardless of RAM".


in part this depends heavily on the virtual memory design of the *nix 
system that you are using.

some systems allocate a page in swap for every virtual address you ever 
can use (including the ones that you have real memory for), and for those 
your total memory address space available to your kernel is equal to your 
swap size (even if it's less than your ram size), so for those, ram+ 
sizing is required (if you had 1G of ram and 512M of swap you would only 
ever be able to use 512M of ram)

other systems use pages of swap in addition to pages of memory, and so 
your total address space is swap + ra

the current linux VM system is in the second category, so you only need as 
much swap as you want to allow the system to use.

since swap is _extrememly_ expensive to use, you don't actually want to 
use much, if any in a HPC cluster.

HOWEVER, there is the issue of memory overcommit and how you choose to 
deal with it.

Linux frequently uses a feature called 'Copy On Write' (COW) where instead 
of copying a page of memory it instead marks it read-only and COW, and 
allows multiple processes to still access this page. if any of the 
processes try to make a change, it triggers a write error that then copies 
the page and life continues.

this is a HUGE win for almost all systems. for example, if you are running 
firefox and it is using 1.5G of ram, you click on a pdf file. firefox 
downloads the file then starts your pdf reader, to do this it first forks 
a copy of itself, and then executes the pdf reader. between the time that 
it does the fork and makes the exec call to start the pdf reader, you 
technically have two identical copies of firefox in ram, each needing 1.5G 
of ram. with COW you end up only useing a few K of ram for this process 
instead of having to really allocate and copy the 1.5G of ram.

because of this feature, you can have a lot more address space in use than 
you actually have memory for (witht he firefox example, with COW you can 
do the example above with 2G of ram, without COW you would need 3.5G of 
ram + swap). the bad thing is that this can trigger additional memory 
getting used long after the malloc call has completed sucessfully. that 
additional memory use could push you into swap or run you out of memory 
entirely.

by default linux allows unlimited overcommit, and if you actually run out 
of memory it triggers the Out Of Memory process (OOM), which tries to 
figure out what to kill to try and keep the system running (and as with 
any heristics, sometimes it works, sometimes it doesn't)

you can change this default to disable overcommit. in which case if you do 
not have the address space available to fully support all possible COW 
splits. if you don't have enough swap allocated to support the possible 
COW splits, the system will reject the malloc, EVEN IF YOU HAVE UNUSED 
RAM.

so you need to either allow unlimited overcommit (which can kill your 
system at unexpected times when your run out of ram), or you need to 
disable overcommit and have 'enough' swap (which can run the risk of 
running you into swap and bringing your system to a crawl)

personally, I choose to leave overcommit on, and have a small amount of 
swap, no matter how much ram I have.

for historical reasons (since it used to be the limit on swap partition 
size), I have fallen in the habit of creating a 2G swap partition on all 
my systems. If I was going to change it I would probably shrink it down 
(by the time a system is using 512M to 1G of swap, it's probably slowed to 
unusuable levels anyway and so I would just as soon have the system crash 
so that my clustering HA solution can kick in instead)

David Lang

>
> So if you're running/managing a Linux HPC cluster, or you have strong 
> opinions on the subject, or you just want to comment :), I love to hear 
> you're thoughts.
>
> Some info about our environment...  We have several HPC clusters scattered 
> around the globe with anywhere
> from 100 to somewhat over 1000 systems in each cluster.  Workload in
> the clusters is managed using LSF and typically they are configured to
> have one job-slot per cpu.  The memory configs in each system ranges
> from 4G RAM up to 512G.  Not sure if the OS version matters but in case
> it does, we're primarily running RHEL4u5 and starting a migration to
> RHEL5u3.
>
> Thanks much,
> Matt
>
> ===========================================================
> "If they are the pillars of our community,
> We better keep a sharp eye on the roof."
> ===========================================================
>
>
>
>
> _______________________________________________
> Tech mailing list
> [email protected]
> http://lopsa.org/cgi-bin/mailman/listinfo/tech
> This list provided by the League of Professional System Administrators
> http://lopsa.org/
>
_______________________________________________
Tech mailing list
[email protected]
http://lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-tech] Swap sizing in Linux HPC cluster nodes.

Reply via email to