Re: KSM For All Via LD_PRELOAD?
On 06/10/2010 08:33 AM, Dor Laor wrote: On 06/09/2010 01:31 PM, Gordan Bobic wrote: On 06/09/2010 09:56 AM, Paolo Bonzini wrote: Or is this too crazy an idea? It should work. Note that the the malloced memory should be aligned in order to get better sharing. Within glibc malloc large blocks are mmaped, so they are automatically aligned. Effective sharing of small blocks would take too much luck or too much wasted memory, so probably madvising brk memory is not too useful. Of course there are exceptions. Bitmaps are very much sharable, but not big. And some programs have their own allocator, using mmap in all likelihood and slicing the resulting block. Typically these will be virtual machines for garbage collected languages (but also GCC for example does this). They will store a lot of pointers in there too, so in this case KSM would likely work a lot for little benefit. So if you really want to apply it to _all_ processes, it comes to mind to wrap both mmap and malloc so that you can set a flag only for mmap-within-malloc... It will take some experimentation and heuristics to actually not degrade performance (and of course it will depend on the workload), but it should work. Arguably, the way QEMU KVM does it for the VM's entire memory block doesn't seem to be distinguishing the types of memory allocation inside the VM, so simply covering all mmap()/brk() calls would probably do no worse in terms of performance. Or am I missing something? There won't be drastic effect for qemu-kvm since the non guest ram areas are minimal. I thought you were trying to trap mmap/brk/malloc for other general applications regardless of virt. Why does it matter that the non-guest RAM areas are minimal? The way I envisage using is is by putting: export LD_PRELOAD=myksmintercept.so as the first line in rc.sysinit and having _all_ processes in the system subject to this. So the memory areas not subject to KSM would be as negligible as in the virt case if not more so. Or am I misunderstanding what you're saying? Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KSM For All Via LD_PRELOAD?
On 06/10/2010 08:44 AM, Jes Sorensen wrote: On 06/08/10 20:43, Gordan Bobic wrote: Is this plausible? I'm trying to work out if it's even worth considering this approach to enable all memory used by in a system to be open to KSM page merging, rather than only memory used by specific programs aware of it (e.g. kvm/qemu). Something like this would address the fact that container based virtualization (OpenVZ, VServer, LXC) cannot benefit from KSM. What I'm thinking about is somehow intercepting malloc() and wrapping it so that all malloc()-ed memory gets madvise()-d as well. Not sure if it is worth it, but you might want to look at ElectricFence which does malloc wrapping in a somewhat similar way. It might save you some code :) I'll look into it, but I don't see this requiring more than maybe 50 lines of code, including comments, headers and Makefile. I was planning to literally just intercept mmap()/brk()/malloc() and mark them with madvise() when the underlying call returns. Which brings me to another question: Would intercepting malloc() be completely redundant if mmap() is intercepted? Would I also need to do something with intercepting free()? Is three anything else I would need to intercept? Whether or not you will run into problems if you run it system wise is really hard to predict. Any other application that might be linked in a special way or use preload itself might bark, but you can try it out and see what explodes. Thanks for the heads up. Can you think of any such applications off the top of your head? Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KSM For All Via LD_PRELOAD?
On 06/09/2010 09:56 AM, Paolo Bonzini wrote: Or is this too crazy an idea? It should work. Note that the the malloced memory should be aligned in order to get better sharing. Within glibc malloc large blocks are mmaped, so they are automatically aligned. Effective sharing of small blocks would take too much luck or too much wasted memory, so probably madvising brk memory is not too useful. Of course there are exceptions. Bitmaps are very much sharable, but not big. And some programs have their own allocator, using mmap in all likelihood and slicing the resulting block. Typically these will be virtual machines for garbage collected languages (but also GCC for example does this). They will store a lot of pointers in there too, so in this case KSM would likely work a lot for little benefit. So if you really want to apply it to _all_ processes, it comes to mind to wrap both mmap and malloc so that you can set a flag only for mmap-within-malloc... It will take some experimentation and heuristics to actually not degrade performance (and of course it will depend on the workload), but it should work. Arguably, the way QEMU KVM does it for the VM's entire memory block doesn't seem to be distinguishing the types of memory allocation inside the VM, so simply covering all mmap()/brk() calls would probably do no worse in terms of performance. Or am I missing something? Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KSM For All Via LD_PRELOAD?
Is this plausible? I'm trying to work out if it's even worth considering this approach to enable all memory used by in a system to be open to KSM page merging, rather than only memory used by specific programs aware of it (e.g. kvm/qemu). Something like this would address the fact that container based virtualization (OpenVZ, VServer, LXC) cannot benefit from KSM. What I'm thinking about is somehow intercepting malloc() and wrapping it so that all malloc()-ed memory gets madvise()-d as well. Has this been done? Or is this too crazy an idea? Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shouldn't cache=none be the default for drives?
Troels Arvin wrote: Hello, I'm conducting some performancetests with KVM-virtualized CentOSes. One thing I noticed is that guest I/O performance seems to be significantly better for virtio-based block devices (drives) if the cache=none argument is used. (This was with a rather powerful storage system backend which is hard to saturate.) So: Why isn't cache=none be the default for drives? Is that the right question? Or is the right question Why is cache=none faster? What did you use for measuring the performance? I have found in the past that virtio block device was slower than IDE block device emulation. Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KSM without VT / KSM for all memory
Hi, Is it possible to use KSM: 1) Without hardware VT support 2) For all memory in a system, without patching all applications to register with it TIA. Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KSM without VT / KSM for all memory
Chris Wright wrote: 2) For all memory in a system, without patching all applications to register with it No. Right now, an app must be modified to call madvise(MADV_MERGEABLE). Further, the core scanning loop that ksmd performs is based on per-process virtual memory regions rather than physical memory. You mean only pages within the same process are de-duplicated? Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio disk slower than IDE?
john cooper wrote: The test is building the Linux kernel (only taking the second run to give the test the benefit of local cache): make clean; make -j8 all; make clean; sync; time make -j8 all This takes about 10 minutes with IDE disk emulation and about 13 minutes with virtio. I ran the tests multiple time with most non-essential services on the host switched off (including cron/atd), and the guest in single-user mode to reduce the noise in the test to the minimum, and the results are pretty consistent, with virtio being about 30% behind. I'd expect for an observed 30% wall clock time difference of an operation as complex as a kernel build the base i/o throughput disparity is substantially greater. Did you try a more simple/regular load, eg: a streaming dd read of various block sizes from guest raw disk devices? This is also considerably easier to debug vs. the complex i/o load generated by a build. I'm not convinced it's the read performance, since it's the second pass that is time, by which time all the source files will be in the guest's cache. I verified this by doing just one pass and priming it with: find . -type f -exec cat '{}' /dev/null \; The execution times are indistinguishable from the second pass in the two-pass test. To me that would indicate the the problem is with write performance, rather than read performance. One way to chop up the problem space is using blktrace on the host to observe both the i/o patterns coming out of qemu and the host's response to them in terms of turn around time. I expect you'll see somewhat different nature requests generated by qemu w/r/t blocking and number of threads serving virtio_blk requests relative to ide but the host response should be essentially the same in terms of data returned per unit time. If the host looks to be turning around i/o request with similar latency in both cases, the problem would be lower frequency of requests generated by qemu in the case of virtio_blk. Here it would be useful to know the host load generated by the guest for both cases. With virtio the CPU usage did seem to be noticeably lower. I figured that was because it was spending more time waiting for I/O to finish, since it was clearly bottlenecking on disk I/O (since that's the only thing that changed). I'll try iozone's write tests and see how that compares. If I'm right about write performance being problematic, iozone might show the same performance deterioration on write tests compared to the IDE emulation. Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio disk slower than IDE?
Dor Laor wrote: On 11/14/2009 04:23 PM, Gordan Bobic wrote: I just tried paravirtualized virtio block devices, and my tests show that they are approximately 30% slower than emulated IDE devices. I'm guessing this isn't normal. Is this a known issue or am I likely to have mosconfigured something? I'm using 64-bit RHEL/CentOS 5 (both host and guest). Please try to change the io scheduler on the host to io scheduler, it should boost your performance back. I presume you mean the deadline io scheduler. I tried that (kernel parameter elevator=deadline) and it made no measurable difference compared to the cfq scheduler. Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Virtualization Performance: Intel vs. AMD
Thomas Fjellstrom wrote: On Sun November 15 2009, Neil Aggarwal wrote: The Core i7 has hyperthreading, so you see 8 logical CPUs. Are you saying the AMD processors do not have hyperthreading? Course not. Hyperthreading is dubious at best. That's a rather questionable answer to a rather broad issue. SMT is useful, especially on processors with deep pipelines (think Pentium 4 - and in general, deeper pipelines tend to be required for higher clock speeds), because it reduces the number of context switches. Context switches are certainly one of the most expensive operations if not the most expensive operation you can do on a processor, and typically requires flushing the pipelines. Double the number of hardware threads, and you halve the number of context switches. This typically isn't useful if your CPU is processing one single-threaded application 99% of the time, but on a loaded server it can make a significant difference to throughput. Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio disk slower than IDE?
Dor Laor wrote: On 11/15/2009 02:00 PM, Gordan Bobic wrote: Dor Laor wrote: On 11/14/2009 04:23 PM, Gordan Bobic wrote: I just tried paravirtualized virtio block devices, and my tests show that they are approximately 30% slower than emulated IDE devices. I'm guessing this isn't normal. Is this a known issue or am I likely to have mosconfigured something? I'm using 64-bit RHEL/CentOS 5 (both host and guest). Please try to change the io scheduler on the host to io scheduler, it should boost your performance back. I presume you mean the deadline io scheduler. I tried that (kernel parameter elevator=deadline) and it made no measurable difference compared to the cfq scheduler. What version of kvm do you use? Is it rhel5.4? It's RHEL 5.4. $ rpm -qa | grep -i kvm kmod-kvm-83-105.el5_4.9 kvm-83-105.el5_4.9 Can you post the qemu cmdline and the perf test in the guest? Here is what is in the libvirt log: For IDE emulation: LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin HOME=/root USER=root LOGNAME=root /usr/libexec/qemu-kvm -S -M pc -m 2048 -smp 4 -name RHEL_5_x86-64 -uuid cb44b2c5-e64b-848f-77af-f8e7f02fa2ca -no-kvm-pit-reinjection -monitor pty -pidfile /var/run/libvirt/qemu//RHEL_5_x86-64.pid -boot c -drive file=/var/lib/libvirt/images/RHEL_5_x86-64.img,if=ide,index=0,boot=on -net nic,macaddr=54:52:00:5a:67:4b,vlan=0,model=e1000 -net tap,fd=15,script=,vlan=0,ifname=vnet0 -serial pty -parallel none -usb -vnc 127.0.0.1:0 -k en-gb For virtio: LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin HOME=/root USER=root LOGNAME=root /usr/libexec/qemu-kvm -S -M pc -m 2048 -smp 4 -name RHEL_5_x86-64 -uuid cb44b2c5-e64b-848f-77af-f8e7f02fa2ca -no-kvm-pit-reinjection -monitor pty -pidfile /var/run/libvirt/qemu//RHEL_5_x86-64.pid -boot c -drive file=/var/lib/libvirt/images/CentOS_5_x86-64.img,if=virtio,index=0,boot=on -net nic,macaddr=54:52:00:5a:67:4b,vlan=0,model=e1000 -net tap,fd=15,script=,vlan=0,ifname=vnet0 -serial pty -parallel none -usb -vnc 127.0.0.1:0 -k en-gb The test is building the Linux kernel (only taking the second run to give the test the benefit of local cache): make clean; make -j8 all; make clean; sync; time make -j8 all This takes about 10 minutes with IDE disk emulation and about 13 minutes with virtio. I ran the tests multiple time with most non-essential services on the host switched off (including cron/atd), and the guest in single-user mode to reduce the noise in the test to the minimum, and the results are pretty consistent, with virtio being about 30% behind. Lastly, do you use cache=wb on qemu? it's just a fun mode, we use cache=off only. I don't see the option being set in the logs, so I'd guess it's whatever qemu-kvm defaults to. Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Virtualization Performance: Intel vs. AMD
Thomas Fjellstrom wrote: The Core i7 has hyperthreading, so you see 8 logical CPUs. Are you saying the AMD processors do not have hyperthreading? Course not. Hyperthreading is dubious at best. That's a rather questionable answer to a rather broad issue. SMT is useful, especially on processors with deep pipelines (think Pentium 4 - and in general, deeper pipelines tend to be required for higher clock speeds), because it reduces the number of context switches. Context switches are certainly one of the most expensive operations if not the most expensive operation you can do on a processor, and typically requires flushing the pipelines. Double the number of hardware threads, and you halve the number of context switches. Hardware context switches aren't free either. And while it really has nothing to do with this discussion, the P4 arch was far from perfect (many would say, far from GOOD). I actually disagree with a lot of criticism of P4. The reason why it's performance _appeared_ to be poor was because it was more reliant on compilers doing their job well. Unfortunately, most compilers generate very poor code, and most programmers aren't even aware of the improvements that can be had in this area with a bit of extra work and a decent compiler. Performance differences of 7+ times (700%) aren't unheard of on Pentium 4 between, say, ICC and GCC generated code. P4 wasn't a bad design - the compilers just weren't good enough to leverage it to anywhere near it's potential. This typically isn't useful if your CPU is processing one single-threaded application 99% of the time, but on a loaded server it can make a significant difference to throughput. I'll buy that. Though you'll have to agree that the initial Hyperthread implementation in intel cpus was really bad. I hear good things about the latest version though. As measured by what? A single-threaded desktop benchmark? But hey, if you can stick more cores in, or do what AMD is doing with its upcoming line, why not do that? Hyperthreading seems like more of a gimmick than anything. If there weren't clear and quantifiable benefits then IBM wouldn't be putting it in it's Power series of high end processors, it wouldn't be in the X-Box 360's Xenon (PPC970 variant), and Sun wouldn't be going massively SMT in the Niagara SPARCs. Silicon die space is _expensive_ - it wouldn't be getting wasted on gimmicks. What seems to help the most with the new Intel arch is the auto overclocking when some cores are idle. Far more of a performance improvement than Hyperthreading will ever be it seems. Which is targeted at gamers and desktop enthusiasts who think that FPS in Crysis is a meaningful measure of performance for most applications. Server load profile is a whole different ball game. Anyway, let's get this back on topic for the list before we get told off (of course, I'm more than happy to continue the discussion off list). Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
virtio disk slower than IDE?
I just tried paravirtualized virtio block devices, and my tests show that they are approximately 30% slower than emulated IDE devices. I'm guessing this isn't normal. Is this a known issue or am I likely to have mosconfigured something? I'm using 64-bit RHEL/CentOS 5 (both host and guest). Thanks. Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Guest OpenGL Acceleration
On Tue, 18 Aug 2009 13:02:18 +0100, Armindo Silva deathon2l...@gmail.com wrote: There's a patch for qemu: http://qemu-forum.ipi.fi/viewtopic.php?t=2984 Interesting, and along the lines of exactly what I was after (including the opengl32.dll win32 library). But that thread is from 2+ years ago, and no mention of whether the project is maintained. Does it work with the KVM virtualization back end for QEMU? and there's also this: http://sysweb.cs.toronto.edu/projects/7 I think this is used by vbox. No, vbox seems to use something very similar to the approach in the first link you posted. Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Trouble shutting down vm on guest halt
On Fri, 14 Aug 2009 14:38:52 +0200, Flemming Frandsen flemming.frand...@stibo.com wrote: I'm having some problems getting kvm to exit when the guest OS has halted. Specifically I'm running CentOS 5.2 as the guest on ubuntu 8.1.' I've noticed that 32 bit windows xp and 64 bit ubuntu 9.10 can power down a vm as expected. Any idea where I should look for documentation on how to tickle kvm ACPI the right way from CentOS? I have CentOS 5.3 on CentOS 5.3 and the shutdown on that works OK, so it seems probable that this is a host/KVM side issue. What version of KVM are you running? Oh, and you aren't running it with -no-acpi are you? Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Guest OpenGL Acceleration
Is OpenGL Acceleration based on the host's OpenGL capability available in KVM? Thanks. Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Disk Emulation and Trim Instruction
With the recent talk of the trim SATA instruction becoming supported in the upcoming versions of Windows and claims from Intel that support for it in their SSDs is imminent, it occurs to me that this would be equally useful in virtual disk emulation. Since the disk image is a sparse file, it always only grows, and eventually it will grow to it's full intended size even if the actual used space is a small fraction of the container size. Since the trim instruction tells the disk that a particular block is no longer used (and can thus be scheduled for erasing as and when required), the same thing could be used to reclaim space used by sparse files backing the VM. It would allow for higher overcommit of disk usage on VM farms. Is this feature likely to be available in KVM soon? Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html