Re: Swap oddities
On 10/30/07, Marcy Cortes [EMAIL PROTECTED] wrote: huge 15 IFL at peak app :)... We've had our share there of these and badly written code before... The generational GC, new with 6.1, seems to be a *phenomonal* difference (and I don't say that lightly never believing that perf knobs at the app level save you much of anything at, From what I read about the controls of modern GC in the JVM (not sure this is all new with 6.1) some of it would also be beneficial for Linux on z/VM because it will enforce some locality of reference. Note that you do need sufficient free swap space for Linux to feel secure it can allow the next request for virtual memory, even though the process may never really use it all. That's why you may be unable to satisfy a request for virtual memory even when still free memory and free swap space. An unused VDISK of 2G (or even a few) is a cheap way to do that. Rob -- Rob van der Heij Velocity Software, Inc http://velocitysoftware.com/ -- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390
Re: Swap oddities
On Sun, 28 Oct 2007 08:41:16 am Marcy Cortes wrote: So, if I'm understanding right, those would be dirty pages no longer needed hanging out there in swap? That's right -- but you'll get arguments on the definition of no longer needed. Having sent a page to the swap device, Linux will keep it out there even if the page gets swapped in. The reason: if the page again needs to be swapped out, and it wasn't modified while it was swapped back in, you save an I/O (so the claim is that it's not that it's no longer needed, it's that it's not needed right now but might be again soon). I read about this and other interesting behaviours at http://linux-mm.org -- it seems that the operation of Linux's memory management has generated enough discussion for someone to start a wiki on it. :) The real issue in terms of VDISK is that even if we could eliminate the keep it in case we need it behaviour of Linux, there's no way for Linux to inform CP that a page of a VDISK is no longer needed and can be de-allocated. Even doing swapon/swapoff, with an intervening mkswap, even chccwdev the thing off from Linux and back on again, won't tell CP that it can flush the disk -- AFAIK, only DELETE/DEFINE would do it. I thought the point of the priortized swap was that it'd keep reusing those on the highest numbered disks before starting down to the next disk. It was well into the 3rd disk (they are like 250M, 500M, 1G, 1G). (at least I think it used to work that way!). Could there be a linux bug here? From what I've seen, Linux is working as designed unfortunately. The hierarchy of swap devices was a theory (tested by others much more skilled and equipped than me, even though I drew the funny pictures of it in the ISP/ASP Redbook). Regardless, it was only meant as an indicator for how big your *central storage* needs to be; as soon as the guest touched the second disk it was a flag to increase the central. (Can't increase central? Divide the workload across a number of guests.) Ideally you *never* want to swap; having a swap device that's almost as fast as memory helps mitigate the cost of swapping, but using that fast swap is not a habit to keep up. It's also quite possible that your smaller devices became fragmented and unable to satisfy a request for a large number of contiguous pages. Such fragmentation would make it ever more likely that the later devices would get swapped-onto as your uptime wore on. Seems like vm.swappiness=0 (or a least a lower number than the default of 60) would be a good setting for Linux under VM. Has anyone studied this? /proc/sys/vm/swappiness was introduced with kernel 2.6 [1]. The doco suggests that using swappiness=0 makes the kernel behave like it used to in the 2.4 (and earlier) days -- sacrifice cache to reduce swapping. I have seen SLES 9 systems (with 2.6 kernels) appear to use far more memory than equivalent SLES 8 systems (kernel 2.4), so from experience a low value is useful for the z/VM environment [2]. CMM is meant to be the remedy to all of this of course. Now we can give all our Linux guests a central storage allocation beyond their wildest dreams (I'm kidding), and let VMRM handle the dirty work for us. I could imagine that we could be a bit more relaxed about our vm.swappiness value then -- we still don't want each of our penguins to buffer up its disks, but perhaps the consequences aren't as severe when allocations are more fluid and more effective sharing is taking place[3]. Unfortunately I haven't used CMM in anger as I'm a little light on systems to play with nowadays. Cheerio, Vic Cross [1] Swappiness controls the likelihood that a given page of memory will be retained as cache if the kernel needs memory -- it's a range from 100 (means cache pages are preserved and non-cache pages are swapped out to satisfy the request) to 0 (means cache pages are flushed to free memory to satisfy the request). [2] If only to preserve the way that we used to tune our guests prior to 2.6. :) [3] We might even be able to do the Embedded Linux thing and disable swapping entirely! -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390
Re: Swap oddities
Interesting! Thanks for your response Vic. I'm not sure it is working as designed. Eventually, when we use up our swap, WAS crashes OOM (that's *our* real issue, at least our biggest one anyway :). But if we are able to swapoff/swapon and recover that space without crashing WAS that kind a says to me that it didn't need it anyway - course I haven't tried that whilst workload was running through... Maybe it is destructive. We plan to experiment some with the vm.swapiness and see if that helps. I guess in the very least, we can add enough vdisks and enough VM paging packs to get through week without a recycle until we figure this out as long as response time cpu savings remain this good with 6.1. Marcy Cortes This message may contain confidential and/or privileged information. If you are not the addressee or authorized to receive this for the addressee, you must not use, copy, disclose, or take any action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by reply e-mail and delete this message. Thank you for your cooperation. -Original Message- From: Linux on 390 Port [mailto:[EMAIL PROTECTED] On Behalf Of Vic Cross Sent: Monday, October 29, 2007 7:58 AM To: LINUX-390@VM.MARIST.EDU Subject: Re: [LINUX-390] Swap oddities On Sun, 28 Oct 2007 08:41:16 am Marcy Cortes wrote: So, if I'm understanding right, those would be dirty pages no longer needed hanging out there in swap? That's right -- but you'll get arguments on the definition of no longer needed. Having sent a page to the swap device, Linux will keep it out there even if the page gets swapped in. The reason: if the page again needs to be swapped out, and it wasn't modified while it was swapped back in, you save an I/O (so the claim is that it's not that it's no longer needed, it's that it's not needed right now but might be again soon). I read about this and other interesting behaviours at http://linux-mm.org -- it seems that the operation of Linux's memory management has generated enough discussion for someone to start a wiki on it. :) The real issue in terms of VDISK is that even if we could eliminate the keep it in case we need it behaviour of Linux, there's no way for Linux to inform CP that a page of a VDISK is no longer needed and can be de-allocated. Even doing swapon/swapoff, with an intervening mkswap, even chccwdev the thing off from Linux and back on again, won't tell CP that it can flush the disk -- AFAIK, only DELETE/DEFINE would do it. I thought the point of the priortized swap was that it'd keep reusing those on the highest numbered disks before starting down to the next disk. It was well into the 3rd disk (they are like 250M, 500M, 1G, 1G). (at least I think it used to work that way!). Could there be a linux bug here? From what I've seen, Linux is working as designed unfortunately. The hierarchy of swap devices was a theory (tested by others much more skilled and equipped than me, even though I drew the funny pictures of it in the ISP/ASP Redbook). Regardless, it was only meant as an indicator for how big your *central storage* needs to be; as soon as the guest touched the second disk it was a flag to increase the central. (Can't increase central? Divide the workload across a number of guests.) Ideally you *never* want to swap; having a swap device that's almost as fast as memory helps mitigate the cost of swapping, but using that fast swap is not a habit to keep up. It's also quite possible that your smaller devices became fragmented and unable to satisfy a request for a large number of contiguous pages. Such fragmentation would make it ever more likely that the later devices would get swapped-onto as your uptime wore on. Seems like vm.swappiness=0 (or a least a lower number than the default of 60) would be a good setting for Linux under VM. Has anyone studied this? /proc/sys/vm/swappiness was introduced with kernel 2.6 [1]. The doco suggests that using swappiness=0 makes the kernel behave like it used to in the 2.4 (and earlier) days -- sacrifice cache to reduce swapping. I have seen SLES 9 systems (with 2.6 kernels) appear to use far more memory than equivalent SLES 8 systems (kernel 2.4), so from experience a low value is useful for the z/VM environment [2]. CMM is meant to be the remedy to all of this of course. Now we can give all our Linux guests a central storage allocation beyond their wildest dreams (I'm kidding), and let VMRM handle the dirty work for us. I could imagine that we could be a bit more relaxed about our vm.swappiness value then -- we still don't want each of our penguins to buffer up its disks, but perhaps the consequences aren't as severe when allocations are more fluid and more effective sharing is taking place[3]. Unfortunately I haven't used CMM in anger as I'm a little light on systems to play with nowadays. Cheerio, Vic Cross [1] Swappiness controls the likelihood
Re: Swap oddities
On Tue, 30 Oct 2007 06:19:33 am Marcy Cortes wrote: I'm not sure it is working as designed. I never said it was a good design -- and perhaps I should have read your earlier messages prior to saying that. :) It does depend on your point of view though -- it's another one of these aspects that belies Linux's single-system non-resource-sharing heritage. In a non-shared environment, keeping swap pages hanging around on disk is a good design point in that it can realistically save costly I/O. It's not so good for us though. :) Eventually, when we use up our swap, WAS crashes OOM (that's *our* real issue, at least our biggest one anyway :). Yes... and that's not going to be solved by CMM or creating different swap VDISKs or anything like that. The earlier hints about JVM heap size and garbage collection and so on will be useful here. I guess the application is being checked for leaks as well -- or do your developers write perfect code first-time-every-time too? ;-P But if we are able to swapoff/swapon and recover that space without crashing WAS that kind a says to me that it didn't need it anyway - course I haven't tried that whilst workload was running through... Maybe it is destructive. It might be, but as long as your Linux has more free virtual memory than the amount of pages in use on the device you want to remove, you *should* be able to do a swapoff without impact (things might get a little sluggish for a few seconds while kswapd shuffles things around though). It would be nice to be able to tell accurately just how much swap space is being used on a device -- /proc/meminfo is system-wide. SwapCached in /proc/meminfo is a helpful indicator that counts the swap space hanging around (you could try http://www.linuxweblog.com/meminfo among heaps of other places for more info about what the numbers from meminfo mean); if this number is low compared to your total available swap then you're not likely to get much benefit from swapoff/swapon cycles. We plan to experiment some with the vm.swapiness and see if that helps. I guess in the very least, we can add enough vdisks and enough VM paging packs to get through week without a recycle until we figure this out as long as response time cpu savings remain this good with 6.1. Good plan, although vm.swappiness is only likely to delay your swap usage rather than eliminate it entirely (if something is asking for that much memory, at some point it's going to have to get it from somewhere). Of course If it delays heavy swapping long enough to get you through the week then that's a win. While you've got this WAS issue you are *possibly* justified in throwing a DASD swap device at the end of your line of VDISKs (I emphasise possibly because I don't want to offend Rob et al too much). Perhaps the last thing you want would be to just keep adding VDISKs and VM page packs until your VM paging system is consumed by leaked Linux memory. You could do a nightly swapoff/swapon of some of the VDISKs to flush things out and reduce the activity to the DASD swap. I guess what I'm saying is that you could think about this WAS problem as an abberation rather than the normal operating mode for your system -- don't jeopardise your entire environment for the sake of one problem system, and be prepared to let best-practice slide a bit while you get the issue sorted. Of course you're in a much better position than me to decide if your paging environment needs such protection. I also transposed my client's problem onto your shop -- I thought you were concerned about the number of pages allocated to VDISKs. That's why I mentioned the stuff about DELETE/DEFINE of your VDISK swaps. Best of luck with the issue! Cheerio, Vic Cross -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390
Re: Swap oddities
Thanks Vic! JVM heap size and garbage collection seem to be under control. Believe me, this is well looked at by both us and IBM's finest since it is a huge 15 IFL at peak app :)... We've had our share there of these and badly written code before... The generational GC, new with 6.1, seems to be a *phenomonal* difference (and I don't say that lightly never believing that perf knobs at the app level save you much of anything at, saving 13% in CPU and shaving 100ms off of response times (on a 500ms response time transaction). Unbelievable... So I gotta believe they are close to as good as it gets for your average programmer... Course there's always the outlier/different transaction that could be coming in and gumming up the whole system.. ... And of course WAS support says all of their leaks are fixed now :)) (and there is some significant ones there apparently in fixpacks less than the current 11 if you study WAS 6. support site:).. They are saying this is a native memory leak, not in the JVM heap, so tracing that is needed is totally invasive and therefore nearly impossible in our env. And that there is a possibility that it will *stablize* at maybe 3-5 Gig thus telling us what the virtual memory size should be (hard to believe when it was so happy, even overcommitted and probably needing 1.3G, on WAS 5/sles8 at 1.5 Gig, but you know 64bit is bigger :) ) We're leaving some up with larger swap sizes now pending the stabilization or near crash, whichever comes first. Swapoff does appear to cause some long pauses, so can't do that in production :( Can't afford to lose even 1 second because that results in ATMs not reaching back end systems... We recycle weekly anyway for DR reasons...so for now... We just need to make it 7 days without loss of response time. It probably does make more sense to keep adding vdisks vm paging volumes rather than dedicated disk for swapping. At least they'll all share that way ( clustered app with a few servers on each lpar)... Now on the otherhand.. Our test environment with probably 35 out of 100 running WAS6 it becomes not an aberration but the norm for the load we have there unfortunately. Luckily the paging system is so robust (I think we hit 20K per sec to DASD in todays Monday morning fun). More experimention is definitely needed there! Marcy Cortes This message may contain confidential and/or privileged information. If you are not the addressee or authorized to receive this for the addressee, you must not use, copy, disclose, or take any action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by reply e-mail and delete this message. Thank you for your cooperation. -Original Message- From: Linux on 390 Port [mailto:[EMAIL PROTECTED] On Behalf Of Vic Cross Sent: Monday, October 29, 2007 7:29 PM To: LINUX-390@VM.MARIST.EDU Subject: Re: [LINUX-390] Swap oddities On Tue, 30 Oct 2007 06:19:33 am Marcy Cortes wrote: I'm not sure it is working as designed. I never said it was a good design -- and perhaps I should have read your earlier messages prior to saying that. :) It does depend on your point of view though -- it's another one of these aspects that belies Linux's single-system non-resource-sharing heritage. In a non-shared environment, keeping swap pages hanging around on disk is a good design point in that it can realistically save costly I/O. It's not so good for us though. :) Eventually, when we use up our swap, WAS crashes OOM (that's *our* real issue, at least our biggest one anyway :). Yes... and that's not going to be solved by CMM or creating different swap VDISKs or anything like that. The earlier hints about JVM heap size and garbage collection and so on will be useful here. I guess the application is being checked for leaks as well -- or do your developers write perfect code first-time-every-time too? ;-P But if we are able to swapoff/swapon and recover that space without crashing WAS that kind a says to me that it didn't need it anyway - course I haven't tried that whilst workload was running through... Maybe it is destructive. It might be, but as long as your Linux has more free virtual memory than the amount of pages in use on the device you want to remove, you *should* be able to do a swapoff without impact (things might get a little sluggish for a few seconds while kswapd shuffles things around though). It would be nice to be able to tell accurately just how much swap space is being used on a device -- /proc/meminfo is system-wide. SwapCached in /proc/meminfo is a helpful indicator that counts the swap space hanging around (you could try http://www.linuxweblog.com/meminfo among heaps of other places for more info about what the numbers from meminfo mean); if this number is low compared to your total available swap then you're not likely to get much benefit from swapoff/swapon cycles. We plan to experiment
Swap oddities
OK, this test one is sles9x (2.6.5-7.286-s390x) - was6.1.. Idle. Using 901MB Swap space (on 4 vdisks) And like 217M in the cache I decided to see if I could make the cache go away. I set vm.swappiness=0 Then I did swapoff / swapon 1 vdisk at a time It cleared up the swap space - taking it down to 69MB total It also took the buffers and cache down to about 110MB total So, if I'm understanding right, those would be dirty pages no longer needed hanging out there in swap? I thought the point of the priortized swap was that it'd keep reusing those on the highest numbered disks before starting down to the next disk. It was well into the 3rd disk (they are like 250M, 500M, 1G, 1G). (at least I think it used to work that way!). Could there be a linux bug here? Seems like vm.swappiness=0 (or a least a lower number than the default of 60) would be a good setting for Linux under VM. Has anyone studied this? 1 of 2 LINUX UCD Memory Analysis ReportNODE LNX132 LIMIT 50 2094 6784A --Storage sizes in MegaBytes --- --Real Storage-- Over -SWAP Storage Total Time Node Total Avail Used head Total Avail Used MIN Avail -- - - - - - - - - 15:14:00 LNX132 1464.7 7.3 1457 1346 3662 3592 69.8 15.6 3600 15:13:00 LNX132 1464.7 7.9 1457 1346 3662 3592 69.8 15.6 3600 15:12:00 LNX132 1464.7 6.0 1459 1351 3662 3592 69.9 15.6 3598 15:11:00 LNX132 1464.7 5.6 1459 1346 3662 3593 69.2 15.6 3599 15:10:00 LNX132 1464.7 7.6 1457 1346 3662 3588 74.4 15.6 3595 15:09:00 LNX132 1464.7 8.2 1456 1349 3662 3588 74.4 15.6 3596 15:08:00 LNX132 1464.7 3.5 1461 1351 3449 3359 90.2 15.6 3363 15:07:00 LNX132 1464.7 6.3 1458 1342 3536 3382 153.3 15.6 3389 15:06:00 LNX132 1464.7 6.9 1458 1293 3637 3408 229.1 15.6 3415 15:05:00 LNX132 1464.7 7.0 1458 1279 3662 3409 253.0 15.6 3416 15:04:00 LNX132 1464.7 6.7 1458 1278 2197 1944 253.0 15.6 1951 15:03:00 LNX132 1464.7 5.4 1459 1255 2395 1953 441.9 15.6 1958 15:02:00 LNX132 1464.7 7.0 1458 1237 2686 2059 626.3 15.6 2066 15:01:00 LNX132 1464.7 5.9 1459 1210 2739 2059 679.7 15.6 2065 15:00:00 LNX132 1464.7 4.8 1460 1183 2890 2059 830.5 15.6 2064 14:59:00 LNX132 1464.7 6.0 1459 1173 3662 2760 901.6 15.6 2766 14:58:00 LNX132 1464.7 7.2 1457 1169 3662 2760 901.6 15.6 2768 14:57:00 LNX132 1464.7 5.5 1459 1169 3662 2760 901.6 15.6 2766 14:56:00 LNX132 1464.7 8.6 1456 1169 3662 2760 901.6 15.6 2769 14:55:00 LNX132 1464.7 8.9 1456 1169 3662 2760 901.6 15.6 2769 14:54:00 LNX132 1464.7 8.9 1456 1169 3662 2760 901.6 15.6 2769 14:53:00 LNX132 1464.7 6.2 1458 1169 3662 2760 901.6 15.6 2767 14:52:00 LNX132 1464.7 6.3 1458 1169 3662 2760 901.6 15.6 2767 14:51:00 LNX132 1464.7 6.4 1458 1169 3662 2760 901.6 15.6 2767 14:50:00 LNX132 1464.7 6.4 1458 1169 3662 2760 901.6 15.6 2767 14:49:00 LNX132 1464.7 6.6 1458 1169 3662 2760 901.6 15.6 2767 14:48:00 LNX132 1464.7 6.7 1458 1169 3662 2760 901.6 15.6 2767 14:47:00 LNX132 1464.7 7.5 1457 1169 3662 2760 901.6 15.6 2768 14:46:00 LNX132 1464.7 6.8 1458 1169 3662 2760 901.6 15.6 2767 14:45:00 LNX132 1464.7 6.4 1458 1170 3662 2760 901.6 15.6 2767 14:44:00 LNX132 1464.7 6.6 1458 1169 3662 2760 901.6 15.6 2767 14:43:00 LNX132 1464.7 6.7 1458 1169 3662 2760 901.6 15.6 2767 14:42:00 LNX132 1464.7 6.8 1458 1170 3662 2760 901.6 15.6 2767 2 of 2 LINUX UCD Memory Analysis ReportNODE LNX132 LIMIT 50 2094 6784A --Storage sizes (in Megabytes) -- --Real Storage-- --Storage in Use- Error Time Node Total Avail Used Shared Buffer Cache Message -- - - -- -- - 15:14:00 LNX132 1464.7 7.3 14570.06.4 105.3 15:13:00 LNX132 1464.7 7.9 14570.06.2 105.0 15:12:00 LNX132 1464.7 6.0 14590.05.7 102.2 15:11:00 LNX132 1464.7 5.6 14590.04.9 107.8 15:10:00 LNX132 1464.7 7.6 14570.05.1 106.0 15:09:00 LNX132 1464.7 8.2 14560.04.8 103.1 15:08:00 LNX132 1464.7 3.5 14610.05.2 105.3 15:07:00 LNX132 1464.7 6.3 14580.04.6 111.5 15:06:00 LNX132 1464.7 6.9 14580.0 17.7 147.5 15:05:00 LNX132 1464.7 7.0 14580.0 20.6 157.8 15:04:00 LNX132 1464.7 6.7 14580.0 20.5 159.1 15:03:00 LNX132 1464.7 5.4 14590.0 27.5 176.4 15:02:00 LNX132 1464.7 7.0 14580.0 32.7 187.7 15:01:00 LNX132 1464.7 5.9