Re: Swap oddities

2007-10-30 Thread Rob van der Heij
On 10/30/07, Marcy Cortes [EMAIL PROTECTED] wrote:

 huge 15 IFL at peak app :)... We've had our share there of these and
 badly written code before...  The generational GC, new with 6.1, seems
 to be a *phenomonal* difference (and I don't say that lightly never
 believing that perf knobs at the app level save you much of anything at,

From what I read about the controls of modern GC in the JVM (not sure
this is all new with 6.1) some of it would also be beneficial for
Linux on z/VM because it will enforce some locality of reference.

Note that you do need sufficient free swap space for Linux to feel
secure it can allow the next request for virtual memory, even though
the process may never really use it all. That's why you may be unable
to satisfy a request for virtual memory even when still free memory
and free swap space. An unused VDISK of 2G (or even a few) is a cheap
way to do that.

Rob
--
Rob van der Heij
Velocity Software, Inc
http://velocitysoftware.com/

--
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit
http://www.marist.edu/htbin/wlvindex?LINUX-390


Re: Swap oddities

2007-10-29 Thread Vic Cross
On Sun, 28 Oct 2007 08:41:16 am Marcy Cortes wrote:

 So, if I'm understanding right, those would be dirty pages no longer
 needed hanging out there in swap?

That's right -- but you'll get arguments on the definition of no longer 
needed.  Having sent a page to the swap device, Linux will keep it out there 
even if the page gets swapped in.  The reason: if the page again needs to be 
swapped out, and it wasn't modified while it was swapped back in, you save an 
I/O (so the claim is that it's not that it's no longer needed, it's that 
it's not needed right now but might be again soon).

I read about this and other interesting behaviours at http://linux-mm.org -- 
it seems that the operation of Linux's memory management has generated enough 
discussion for someone to start a wiki on it. :)

The real issue in terms of VDISK is that even if we could eliminate the keep 
it in case we need it behaviour of Linux, there's no way for Linux to inform 
CP that a page of a VDISK is no longer needed and can be de-allocated.  Even 
doing swapon/swapoff, with an intervening mkswap, even chccwdev the thing off 
from Linux and back on again, won't tell CP that it can flush the disk -- 
AFAIK, only DELETE/DEFINE would do it.

 I thought the point of the priortized 
 swap was that it'd keep reusing those on the highest numbered disks
 before starting down to the next disk.  It was well into the 3rd disk
 (they are like 250M, 500M, 1G, 1G).   (at least I think it used to work
 that way!).  Could there be a linux bug here?

From what I've seen, Linux is working as designed unfortunately.  The 
hierarchy of swap devices was a theory (tested by others much more skilled 
and equipped than me, even though I drew the funny pictures of it in the 
ISP/ASP Redbook).  Regardless, it was only meant as an indicator for how big 
your *central storage* needs to be; as soon as the guest touched the second 
disk it was a flag to increase the central.  (Can't increase central?  Divide 
the workload across a number of guests.)  Ideally you *never* want to swap; 
having a swap device that's almost as fast as memory helps mitigate the cost 
of swapping, but using that fast swap is not a habit to keep up.

It's also quite possible that your smaller devices became fragmented and 
unable to satisfy a request for a large number of contiguous pages.  Such 
fragmentation would make it ever more likely that the later devices would get 
swapped-onto as your uptime wore on.

 Seems like vm.swappiness=0 (or a least a lower number than the default
 of 60) would be a good setting for Linux under VM. Has anyone studied
 this?

/proc/sys/vm/swappiness was introduced with kernel 2.6 [1].  The doco suggests 
that using swappiness=0 makes the kernel behave like it used to in the 2.4 
(and earlier) days -- sacrifice cache to reduce swapping.  I have seen SLES 9 
systems (with 2.6 kernels) appear to use far more memory than equivalent SLES 
8 systems (kernel 2.4), so from experience a low value is useful for the z/VM 
environment [2].

CMM is meant to be the remedy to all of this of course.  Now we can give all 
our Linux guests a central storage allocation beyond their wildest dreams 
(I'm kidding), and let VMRM handle the dirty work for us.  I could imagine 
that we could be a bit more relaxed about our vm.swappiness value then -- we 
still don't want each of our penguins to buffer up its disks, but perhaps the 
consequences aren't as severe when allocations are more fluid and more 
effective sharing is taking place[3].  Unfortunately I haven't used CMM in 
anger as I'm a little light on systems to play with nowadays.

Cheerio,
Vic Cross

[1] Swappiness controls the likelihood that a given page of memory will be 
retained as cache if the kernel needs memory -- it's a range from 100 (means 
cache pages are preserved and non-cache pages are swapped out to satisfy the 
request) to 0 (means cache pages are flushed to free memory to satisfy the 
request).
[2] If only to preserve the way that we used to tune our guests prior to 
2.6. :)
[3] We might even be able to do the Embedded Linux thing and disable swapping 
entirely!

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

--
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit
http://www.marist.edu/htbin/wlvindex?LINUX-390


Re: Swap oddities

2007-10-29 Thread Marcy Cortes
Interesting!  Thanks for your response Vic.
I'm not sure it is working as designed.  Eventually, when we use up our
swap, WAS crashes OOM (that's *our* real issue, at least our biggest one
anyway :).  But if we are able to swapoff/swapon and recover that space
without crashing WAS that kind a says to me that it didn't need it
anyway - course I haven't tried that whilst workload was running
through...  Maybe it is destructive.

We plan to experiment some with the vm.swapiness and see if that helps.
I guess in the very least, we can add enough vdisks and enough VM paging
packs to get through week without a recycle until we figure this out as
long as response time  cpu savings remain this good with 6.1.


Marcy Cortes 
 
This message may contain confidential and/or privileged information. If
you are not the addressee or authorized to receive this for the
addressee, you must not use, copy, disclose, or take any action based on
this message or any information herein. If you have received this
message in error, please advise the sender immediately by reply e-mail
and delete this message. Thank you for your cooperation.


-Original Message-
From: Linux on 390 Port [mailto:[EMAIL PROTECTED] On Behalf Of
Vic Cross
Sent: Monday, October 29, 2007 7:58 AM
To: LINUX-390@VM.MARIST.EDU
Subject: Re: [LINUX-390] Swap oddities

On Sun, 28 Oct 2007 08:41:16 am Marcy Cortes wrote:

 So, if I'm understanding right, those would be dirty pages no longer 
 needed hanging out there in swap?

That's right -- but you'll get arguments on the definition of no longer
needed.  Having sent a page to the swap device, Linux will keep it out
there even if the page gets swapped in.  The reason: if the page again
needs to be swapped out, and it wasn't modified while it was swapped
back in, you save an I/O (so the claim is that it's not that it's no
longer needed, it's that it's not needed right now but might be again
soon).

I read about this and other interesting behaviours at
http://linux-mm.org -- it seems that the operation of Linux's memory
management has generated enough discussion for someone to start a wiki
on it. :)

The real issue in terms of VDISK is that even if we could eliminate the
keep it in case we need it behaviour of Linux, there's no way for
Linux to inform CP that a page of a VDISK is no longer needed and can be
de-allocated.  Even doing swapon/swapoff, with an intervening mkswap,
even chccwdev the thing off from Linux and back on again, won't tell CP
that it can flush the disk -- AFAIK, only DELETE/DEFINE would do it.

 I thought the point of the priortized swap was that it'd keep reusing 
 those on the highest numbered disks before starting down to the next 
 disk.  It was well into the 3rd disk
 (they are like 250M, 500M, 1G, 1G).   (at least I think it used to
work
 that way!).  Could there be a linux bug here?

From what I've seen, Linux is working as designed unfortunately.  The
hierarchy of swap devices was a theory (tested by others much more
skilled and equipped than me, even though I drew the funny pictures of
it in the ISP/ASP Redbook).  Regardless, it was only meant as an
indicator for how big your *central storage* needs to be; as soon as the
guest touched the second disk it was a flag to increase the central.
(Can't increase central?  Divide the workload across a number of
guests.)  Ideally you *never* want to swap; having a swap device that's
almost as fast as memory helps mitigate the cost of swapping, but using
that fast swap is not a habit to keep up.

It's also quite possible that your smaller devices became fragmented and
unable to satisfy a request for a large number of contiguous pages.
Such fragmentation would make it ever more likely that the later devices
would get swapped-onto as your uptime wore on.

 Seems like vm.swappiness=0 (or a least a lower number than the default

 of 60) would be a good setting for Linux under VM. Has anyone studied 
 this?

/proc/sys/vm/swappiness was introduced with kernel 2.6 [1].  The doco
suggests that using swappiness=0 makes the kernel behave like it used to
in the 2.4 (and earlier) days -- sacrifice cache to reduce swapping.  I
have seen SLES 9 systems (with 2.6 kernels) appear to use far more
memory than equivalent SLES
8 systems (kernel 2.4), so from experience a low value is useful for the
z/VM environment [2].

CMM is meant to be the remedy to all of this of course.  Now we can give
all our Linux guests a central storage allocation beyond their wildest
dreams (I'm kidding), and let VMRM handle the dirty work for us.  I
could imagine that we could be a bit more relaxed about our
vm.swappiness value then -- we still don't want each of our penguins to
buffer up its disks, but perhaps the consequences aren't as severe when
allocations are more fluid and more effective sharing is taking
place[3].  Unfortunately I haven't used CMM in anger as I'm a little
light on systems to play with nowadays.

Cheerio,
Vic Cross

[1] Swappiness controls the likelihood

Re: Swap oddities

2007-10-29 Thread Vic Cross
On Tue, 30 Oct 2007 06:19:33 am Marcy Cortes wrote:
 I'm not sure it is working as designed.

I never said it was a good design -- and perhaps I should have read your
earlier messages prior to saying that. :)  It does depend on your point of
view though -- it's another one of these aspects that belies Linux's
single-system non-resource-sharing heritage.  In a non-shared environment,
keeping swap pages hanging around on disk is a good design point in that it
can realistically save costly I/O.  It's not so good for us though.  :)

 Eventually, when we use up our
 swap, WAS crashes OOM (that's *our* real issue, at least our biggest one
 anyway :).

Yes... and that's not going to be solved by CMM or creating different swap
VDISKs or anything like that.  The earlier hints about JVM heap size and
garbage collection and so on will be useful here.  I guess the application is
being checked for leaks as well -- or do your developers write perfect code
first-time-every-time too? ;-P

 But if we are able to swapoff/swapon and recover that space
 without crashing WAS that kind a says to me that it didn't need it
 anyway - course I haven't tried that whilst workload was running
 through...  Maybe it is destructive.

It might be, but as long as your Linux has more free virtual memory than the
amount of pages in use on the device you want to remove, you *should* be able
to do a swapoff without impact (things might get a little sluggish for a few
seconds while kswapd shuffles things around though).  It would be nice to be
able to tell accurately just how much swap space is being used on a
device -- /proc/meminfo is system-wide.  SwapCached in /proc/meminfo is a
helpful indicator that counts the swap space hanging around (you could try
http://www.linuxweblog.com/meminfo among heaps of other places for more info
about what the numbers from meminfo mean); if this number is low compared to
your total available swap then you're not likely to get much benefit from
swapoff/swapon cycles.

 We plan to experiment some with the vm.swapiness and see if that helps.
 I guess in the very least, we can add enough vdisks and enough VM paging
 packs to get through week without a recycle until we figure this out as
 long as response time  cpu savings remain this good with 6.1.

Good plan, although vm.swappiness is only likely to delay your swap usage
rather than eliminate it entirely (if something is asking for that much
memory, at some point it's going to have to get it from somewhere).  Of
course If it delays heavy swapping long enough to get you through the week
then that's a win.

While you've got this WAS issue you are *possibly* justified in throwing a
DASD swap device at the end of your line of VDISKs (I emphasise possibly
because I don't want to offend Rob et al too much).  Perhaps the last thing
you want would be to just keep adding VDISKs and VM page packs until your VM
paging system is consumed by leaked Linux memory.  You could do a nightly
swapoff/swapon of some of the VDISKs to flush things out and reduce the
activity to the DASD swap.  I guess what I'm saying is that you could think
about this WAS problem as an abberation rather than the normal operating mode
for your system -- don't jeopardise your entire environment for the sake of
one problem system, and be prepared to let best-practice slide a bit while
you get the issue sorted.  Of course you're in a much better position than me
to decide if your paging environment needs such protection.

I also transposed my client's problem onto your shop -- I thought you were
concerned about the number of pages allocated to VDISKs.  That's why I
mentioned the stuff about DELETE/DEFINE of your VDISK swaps.

Best of luck with the issue!

Cheerio,
Vic Cross

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

--
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit
http://www.marist.edu/htbin/wlvindex?LINUX-390


Re: Swap oddities

2007-10-29 Thread Marcy Cortes
 
Thanks Vic!

JVM heap size and garbage collection seem to be under control.  Believe
me, this is well looked at by both us and IBM's finest since it is a
huge 15 IFL at peak app :)... We've had our share there of these and
badly written code before...  The generational GC, new with 6.1, seems
to be a *phenomonal* difference (and I don't say that lightly never
believing that perf knobs at the app level save you much of anything at,
saving 13% in CPU and shaving 100ms off of response times (on a 500ms
response time transaction).  Unbelievable...  So I gotta believe they
are close to as good as it gets for your average programmer... Course
there's always the outlier/different transaction that could be coming in
and gumming up the whole system.. ...  And of course WAS support says
all of their leaks are fixed now :))  (and there is some significant
ones there apparently in fixpacks less than the current 11 if you study
WAS 6. support site:)..  They are saying this is a native memory leak,
not in the JVM heap, so tracing that is needed is totally invasive and
therefore nearly impossible in our env.  And that there is a possibility
that it will *stablize* at maybe 3-5 Gig thus telling us what the
virtual memory size should be (hard to believe when it was so happy,
even overcommitted and probably needing 1.3G,  on WAS 5/sles8 at 1.5
Gig, but you know 64bit is bigger :) ) We're leaving some up with larger
swap sizes now pending the stabilization or near crash, whichever
comes first.


Swapoff does appear to cause some long pauses, so can't do that in
production :(  Can't afford to lose even 1 second because that results
in ATMs not reaching back end systems...  We recycle weekly anyway for
DR reasons...so for now... We just need to make it 7 days without loss
of response time.

It probably does make more sense to keep adding vdisks  vm paging
volumes rather than dedicated disk for swapping.  At least they'll all
share that way ( clustered app with a few servers on each lpar)...

Now on the otherhand.. Our test environment with probably 35 out of 100
running WAS6 it becomes not an aberration but the norm for the load we
have there unfortunately.  Luckily the paging system is so robust (I
think we hit 20K per sec to DASD in todays Monday morning fun).  More
experimention is definitely needed there!



Marcy Cortes 

This message may contain confidential and/or privileged information. If
you are not the addressee or authorized to receive this for the
addressee, you must not use, copy, disclose, or take any action based on
this message or any information herein. If you have received this
message in error, please advise the sender immediately by reply e-mail
and delete this message. Thank you for your cooperation.


-Original Message-
From: Linux on 390 Port [mailto:[EMAIL PROTECTED] On Behalf Of
Vic Cross
Sent: Monday, October 29, 2007 7:29 PM
To: LINUX-390@VM.MARIST.EDU
Subject: Re: [LINUX-390] Swap oddities

On Tue, 30 Oct 2007 06:19:33 am Marcy Cortes wrote:
 I'm not sure it is working as designed.

I never said it was a good design -- and perhaps I should have read your
earlier messages prior to saying that. :)  It does depend on your point
of view though -- it's another one of these aspects that belies Linux's
single-system non-resource-sharing heritage.  In a non-shared
environment, keeping swap pages hanging around on disk is a good design
point in that it can realistically save costly I/O.  It's not so good
for us though.  :)

 Eventually, when we use up our
 swap, WAS crashes OOM (that's *our* real issue, at least our biggest 
 one anyway :).

Yes... and that's not going to be solved by CMM or creating different
swap VDISKs or anything like that.  The earlier hints about JVM heap
size and garbage collection and so on will be useful here.  I guess the
application is being checked for leaks as well -- or do your developers
write perfect code first-time-every-time too? ;-P

 But if we are able to swapoff/swapon and recover that space without 
 crashing WAS that kind a says to me that it didn't need it anyway - 
 course I haven't tried that whilst workload was running through...  
 Maybe it is destructive.

It might be, but as long as your Linux has more free virtual memory than
the amount of pages in use on the device you want to remove, you
*should* be able to do a swapoff without impact (things might get a
little sluggish for a few seconds while kswapd shuffles things around
though).  It would be nice to be able to tell accurately just how much
swap space is being used on a device -- /proc/meminfo is system-wide.
SwapCached in /proc/meminfo is a helpful indicator that counts the swap
space hanging around (you could try http://www.linuxweblog.com/meminfo
among heaps of other places for more info about what the numbers from
meminfo mean); if this number is low compared to your total available
swap then you're not likely to get much benefit from swapoff/swapon
cycles.

 We plan to experiment

Swap oddities

2007-10-27 Thread Marcy Cortes
OK, this test one is sles9x (2.6.5-7.286-s390x) - was6.1..  Idle.

Using 901MB Swap space (on 4 vdisks)
And like 217M in the cache
I decided to see if I could make the cache go away.
I set vm.swappiness=0
Then I did swapoff / swapon 1 vdisk at a time 
It cleared up the swap space - taking it down to 69MB total
It also took the buffers and cache down to about 110MB total

So, if I'm understanding right, those would be dirty pages no longer
needed hanging out there in swap?  I thought the point of the priortized
swap was that it'd keep reusing those on the highest numbered disks
before starting down to the next disk.  It was well into the 3rd disk
(they are like 250M, 500M, 1G, 1G).   (at least I think it used to work
that way!).  Could there be a linux bug here?

Seems like vm.swappiness=0 (or a least a lower number than the default
of 60) would be a good setting for Linux under VM. Has anyone studied
this?







1 of 2  LINUX UCD Memory Analysis ReportNODE LNX132 LIMIT 50
2094 6784A
 

  --Storage sizes in MegaBytes ---

  --Real Storage-- Over  -SWAP Storage Total

Time Node Total  Avail Used   head Total Avail Used  MIN   Avail

  -- - - - - - - - -

15:14:00 LNX132   1464.7   7.3  1457  1346  3662  3592  69.8  15.6  3600

15:13:00 LNX132   1464.7   7.9  1457  1346  3662  3592  69.8  15.6  3600

15:12:00 LNX132   1464.7   6.0  1459  1351  3662  3592  69.9  15.6  3598

15:11:00 LNX132   1464.7   5.6  1459  1346  3662  3593  69.2  15.6  3599

15:10:00 LNX132   1464.7   7.6  1457  1346  3662  3588  74.4  15.6  3595

15:09:00 LNX132   1464.7   8.2  1456  1349  3662  3588  74.4  15.6  3596

15:08:00 LNX132   1464.7   3.5  1461  1351  3449  3359  90.2  15.6  3363

15:07:00 LNX132   1464.7   6.3  1458  1342  3536  3382 153.3  15.6  3389

15:06:00 LNX132   1464.7   6.9  1458  1293  3637  3408 229.1  15.6  3415

15:05:00 LNX132   1464.7   7.0  1458  1279  3662  3409 253.0  15.6  3416

15:04:00 LNX132   1464.7   6.7  1458  1278  2197  1944 253.0  15.6  1951

15:03:00 LNX132   1464.7   5.4  1459  1255  2395  1953 441.9  15.6  1958

15:02:00 LNX132   1464.7   7.0  1458  1237  2686  2059 626.3  15.6  2066

15:01:00 LNX132   1464.7   5.9  1459  1210  2739  2059 679.7  15.6  2065

15:00:00 LNX132   1464.7   4.8  1460  1183  2890  2059 830.5  15.6  2064

14:59:00 LNX132   1464.7   6.0  1459  1173  3662  2760 901.6  15.6 2766

14:58:00 LNX132   1464.7   7.2  1457  1169  3662  2760 901.6  15.6  2768

14:57:00 LNX132   1464.7   5.5  1459  1169  3662  2760 901.6  15.6  2766

14:56:00 LNX132   1464.7   8.6  1456  1169  3662  2760 901.6  15.6  2769

14:55:00 LNX132   1464.7   8.9  1456  1169  3662  2760 901.6  15.6  2769

14:54:00 LNX132   1464.7   8.9  1456  1169  3662  2760 901.6  15.6  2769

14:53:00 LNX132   1464.7   6.2  1458  1169  3662  2760 901.6  15.6  2767

14:52:00 LNX132   1464.7   6.3  1458  1169  3662  2760 901.6  15.6  2767

14:51:00 LNX132   1464.7   6.4  1458  1169  3662  2760 901.6  15.6  2767

14:50:00 LNX132   1464.7   6.4  1458  1169  3662  2760 901.6  15.6  2767

14:49:00 LNX132   1464.7   6.6  1458  1169  3662  2760 901.6  15.6  2767

14:48:00 LNX132   1464.7   6.7  1458  1169  3662  2760 901.6  15.6  2767

14:47:00 LNX132   1464.7   7.5  1457  1169  3662  2760 901.6  15.6  2768

14:46:00 LNX132   1464.7   6.8  1458  1169  3662  2760 901.6  15.6  2767

14:45:00 LNX132   1464.7   6.4  1458  1170  3662  2760 901.6  15.6  2767

14:44:00 LNX132   1464.7   6.6  1458  1169  3662  2760 901.6  15.6  2767

14:43:00 LNX132   1464.7   6.7  1458  1169  3662  2760 901.6  15.6  2767

14:42:00 LNX132   1464.7   6.8  1458  1170  3662  2760 901.6  15.6  2767



2 of 2  LINUX UCD Memory Analysis ReportNODE LNX132 LIMIT 50
2094 6784A
 

  --Storage sizes (in Megabytes)
--  
  --Real Storage-- --Storage in Use- Error

Time Node Total  Avail Used  Shared Buffer Cache Message

  -- - - -- -- -
  
15:14:00 LNX132   1464.7   7.3  14570.06.4 105.3

15:13:00 LNX132   1464.7   7.9  14570.06.2 105.0

15:12:00 LNX132   1464.7   6.0  14590.05.7 102.2

15:11:00 LNX132   1464.7   5.6  14590.04.9 107.8

15:10:00 LNX132   1464.7   7.6  14570.05.1 106.0

15:09:00 LNX132   1464.7   8.2  14560.04.8 103.1

15:08:00 LNX132   1464.7   3.5  14610.05.2 105.3

15:07:00 LNX132   1464.7   6.3  14580.04.6 111.5

15:06:00 LNX132   1464.7   6.9  14580.0   17.7 147.5

15:05:00 LNX132   1464.7   7.0  14580.0   20.6 157.8

15:04:00 LNX132   1464.7   6.7  14580.0   20.5 159.1

15:03:00 LNX132   1464.7   5.4  14590.0   27.5 176.4

15:02:00 LNX132   1464.7   7.0  14580.0   32.7 187.7

15:01:00 LNX132   1464.7   5.9