Re: [lopsa-tech] Swap sizing in Linux HPC cluster nodes.

david Sun, 06 Sep 2009 10:37:09 -0700

On Sun, 6 Sep 2009, Yves Dorfsman wrote:

> [email protected] wrote:
>>
>
> > note that there is a flag that the backup software should be using to tell
> > the system that it's not going to be accessing this data again.
>
> Which flag, on which function ?
> At the end of the day, aren't all functions reading from disk mapping to an
> read(3) ?


sorry I can't pinpoint it more (I don't do much C progrmming nowdays), but 
I have seen it mentioned on many kernel threads where people have 
complained about this behavior, it's an O_ something flag. I'll do a 
little digging and see if I can find it. I believe that what actually 
happens is that the pages still go into the cache, but are inserted in the 
other end of the (normally) LRU queue so that they are the first to be 
discarded when memory is needed (including by the same process reading the 
next batch of pages from disk)

>>> in, once you are done the machine seems frozen for a few minutes (all the
>>> apps were idle, even without any i/o activity, the memory for the apps
>>> got
>>> freed up).
>>
>> what else is running on the system that is asking for memory? the kernel
>> won't throw away memory unless something else is asking for it.
>
> I agree with you, but my understanding is that with a high value for
> swappiness the kernel will swap out processes in order to make space for the
> file system cache. Look at this test:
> http://lwn.net/Articles/100978/
>
> Just doing dd's they get the vm to swap memory out, which confirms my
> understanding of it.
>
> If I am right, then to obtain the result you are talking about ("the kernel
> won't throw away memory unless something else is asking for it"), you need
> to set swappiness to zero.

by doing the dd you are asking for memory implicitly.

the kernel is trying to balance the need for memory to hold several 
different catergories of things

1. pages of code that it can re-read from the binary on disk

2. pages of application generated data that it can swap out

3. pages of data read from disk that may be used again (disk read cache)

4. pages of data being written to disk (disk write cache, doesn't go to 
swap when written, but there are knobs to adjust how quickly and how hard 
the kernel works to write these pages out, after which they become clean 
cache pages like the read cache)

5. pages of kernel generated data that it can swap out

_many_ programs nowdays are huge, but when people are using them 
they seldom use more than a tiny fraction of the capibilities included 
(and therefor seldom touch a large portion of the code). keeping all that 
unused code in memory can be significant in terms of the amount of data 
that can be cached that you actually use.

yes, if too much gets swapped out (and especially if your disk is 
extrememly slow like a laptop), you can suffer when it gets swapped back 
in, but usually you don't have to pull much in at any time.

>>
> >> The only case I can think of swappiness > 0 making any sense is if you
> >> start
> >> start a lot of applications, but only use a few, and do not change
> >> from apps
> >> to apps very often.
> >
> > I don't think 0 is the right value, but for a long time the kernel did
> > default to a much to high value, within the last year or so the efault
>> was greatly reduced.
>
> This is the latest (2.6.31) kernel from L. Torvalds, and it still has
> swappiness=60
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=mm/vmscan.c;h=94e86dd6954c295830478011fd8e71465f1a9f2d;hb=e07cccf4046978df10f2e13fe2b99b2f9b3a65db#l127
>
> So which distribution do you use (I use Fedora 10, Ubuntu 9.10 and CentOS
> 5.3, they leave the default of 60) ? What value to they put by default for
> swappiness ?

I could be mixing up the swappieness value and the writeback agressivness 
values. I know that one of them chanbed recently

> Anyway, I have been setting swappiness to zero by default on all the systems
> I take care of for a few years, so if there is a reason why I should not,
> I'd love to hear it.
>
> Why do you set it at a value different than zero (what is the expected
> outcome, how is it different than if it were set at zero) ?
>
> If neither 0 nor 60 are the right values, what is the right value ? If it
> depends on the load, how do you make an objective decision ?

since part of what is involved here is your particular workload and 
prefrences (do you prefer to be faster most of the time at the cost of 
occasional slowdowns, or are you willing to be a little slower all the 
time, but not have the hicups) I think it's like every other tuning 
parameter, there is no one right answer for everyone.

some of the issues that you have run into (the backup pushing things into 
swap) can be addressed in a way that will do what you want, but for other 
things it's not nearly as clear.

with the default at 60 and you setting it to 0, I owuld suggest trying it 
set at 10 or 20 and see if you notice any difference. if you like the 
change, keep tinkering, if you hate the change switch it back (I suspect 
that going to a low value will make little noticable difference, but I 
could easily be wrong)

there are some applications (like firefox) that appear to have memory 
leaks in them. saying that application data should _never_ be swapped out 
means that that leaked memory directly fights with disk caches. if 
intstead it gets swapped out you probably never need to swap it back in 
again (at least before it's time to shut down), so that would be a case 
where swappiness of 0 would hurt you.

David Lang
_______________________________________________
Tech mailing list
[email protected]
http://lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-tech] Swap sizing in Linux HPC cluster nodes.

Reply via email to