Re: KSM For All Via LD_PRELOAD?

2010-06-10 Thread Gordan Bobic

On 06/10/2010 08:33 AM, Dor Laor wrote:

On 06/09/2010 01:31 PM, Gordan Bobic wrote:

On 06/09/2010 09:56 AM, Paolo Bonzini wrote:

Or is this too crazy an idea?


It should work. Note that the the malloced memory should be aligned in
order to get better sharing.


Within glibc malloc large blocks are mmaped, so they are automatically
aligned. Effective sharing of small blocks would take too much luck or
too much wasted memory, so probably madvising brk memory is not too
useful.

Of course there are exceptions. Bitmaps are very much sharable, but not
big. And some programs have their own allocator, using mmap in all
likelihood and slicing the resulting block. Typically these will be
virtual machines for garbage collected languages (but also GCC for
example does this). They will store a lot of pointers in there too, so
in this case KSM would likely work a lot for little benefit.

So if you really want to apply it to _all_ processes, it comes to mind
to wrap both mmap and malloc so that you can set a flag only for
mmap-within-malloc... It will take some experimentation and heuristics
to actually not degrade performance (and of course it will depend on the
workload), but it should work.


Arguably, the way QEMU KVM does it for the VM's entire memory block
doesn't seem to be distinguishing the types of memory allocation inside
the VM, so simply covering all mmap()/brk() calls would probably do no
worse in terms of performance. Or am I missing something?


There won't be drastic effect for qemu-kvm since the non guest ram areas
are minimal. I thought you were trying to trap mmap/brk/malloc for other
general applications regardless of virt.


Why does it matter that the non-guest RAM areas are minimal? The way I 
envisage using is is by putting:

export LD_PRELOAD=myksmintercept.so
as the first line in rc.sysinit and having _all_ processes in the system 
subject to this. So the memory areas not subject to KSM would be as 
negligible as in the virt case if not more so. Or am I misunderstanding 
what you're saying?


Gordan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KSM For All Via LD_PRELOAD?

2010-06-10 Thread Gordan Bobic

On 06/10/2010 08:44 AM, Jes Sorensen wrote:

On 06/08/10 20:43, Gordan Bobic wrote:

Is this plausible?

I'm trying to work out if it's even worth considering this approach to
enable all memory used by in a system to be open to KSM page merging,
rather than only memory used by specific programs aware of it (e.g.
kvm/qemu).

Something like this would address the fact that container based
virtualization (OpenVZ, VServer, LXC) cannot benefit from KSM.

What I'm thinking about is somehow intercepting malloc() and wrapping it
so that all malloc()-ed memory gets madvise()-d as well.


Not sure if it is worth it, but you might want to look at ElectricFence
which does malloc wrapping in a somewhat similar way. It might save you
some code :)


I'll look into it, but I don't see this requiring more than maybe 50 
lines of code, including comments, headers and Makefile. I was planning 
to literally just intercept mmap()/brk()/malloc() and mark them with 
madvise() when the underlying call returns.


Which brings me to another question:

Would intercepting malloc() be completely redundant if mmap() is 
intercepted? Would I also need to do something with intercepting free()? 
Is three anything else I would need to intercept?



Whether or not you will run into problems if you run it system wise is
really hard to predict. Any other application that might be linked in a
special way or use preload itself might bark, but you can try it out and
see what explodes.


Thanks for the heads up. Can you think of any such applications off the 
top of your head?


Gordan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KSM For All Via LD_PRELOAD?

2010-06-09 Thread Gordan Bobic

On 06/09/2010 09:56 AM, Paolo Bonzini wrote:

Or is this too crazy an idea?


It should work. Note that the the malloced memory should be aligned in
order to get better sharing.


Within glibc malloc large blocks are mmaped, so they are automatically
aligned. Effective sharing of small blocks would take too much luck or
too much wasted memory, so probably madvising brk memory is not too useful.

Of course there are exceptions. Bitmaps are very much sharable, but not
big. And some programs have their own allocator, using mmap in all
likelihood and slicing the resulting block. Typically these will be
virtual machines for garbage collected languages (but also GCC for
example does this). They will store a lot of pointers in there too, so
in this case KSM would likely work a lot for little benefit.

So if you really want to apply it to _all_ processes, it comes to mind
to wrap both mmap and malloc so that you can set a flag only for
mmap-within-malloc... It will take some experimentation and heuristics
to actually not degrade performance (and of course it will depend on the
workload), but it should work.


Arguably, the way QEMU KVM does it for the VM's entire memory block 
doesn't seem to be distinguishing the types of memory allocation inside 
the VM, so simply covering all mmap()/brk() calls would probably do no 
worse in terms of performance. Or am I missing something?


Gordan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KSM For All Via LD_PRELOAD?

2010-06-08 Thread Gordan Bobic

Is this plausible?

I'm trying to work out if it's even worth considering this approach to 
enable all memory used by in a system to be open to KSM page merging, 
rather than only memory used by specific programs aware of it (e.g. 
kvm/qemu).


Something like this would address the fact that container based 
virtualization (OpenVZ, VServer, LXC) cannot benefit from KSM.


What I'm thinking about is somehow intercepting malloc() and wrapping it 
so that all malloc()-ed memory gets madvise()-d as well.


Has this been done?

Or is this too crazy an idea?

Gordan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shouldn't cache=none be the default for drives?

2010-04-07 Thread Gordan Bobic

Troels Arvin wrote:

Hello,

I'm conducting some performancetests with KVM-virtualized CentOSes. One 
thing I noticed is that guest I/O performance seems to be significantly 
better for virtio-based block devices (drives) if the cache=none 
argument is used. (This was with a rather powerful storage system 
backend which is hard to saturate.)


So: Why isn't cache=none be the default for drives?


Is that the right question? Or is the right question Why is cache=none 
faster?


What did you use for measuring the performance? I have found in the past 
that virtio block device was slower than IDE block device emulation.


Gordan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KSM without VT / KSM for all memory

2010-03-26 Thread Gordan Bobic

Hi,

Is it possible to use KSM:
1) Without hardware VT support
2) For all memory in a system, without patching all applications to 
register with it


TIA.

Gordan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KSM without VT / KSM for all memory

2010-03-26 Thread Gordan Bobic

Chris Wright wrote:

2) For all memory in a system, without patching all applications to  
register with it


No.

Right now, an app must be modified to call madvise(MADV_MERGEABLE).
Further, the core scanning loop that ksmd performs is based on per-process
virtual memory regions rather than physical memory.


You mean only pages within the same process are de-duplicated?

Gordan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio disk slower than IDE?

2009-11-16 Thread Gordan Bobic

john cooper wrote:


The test is building the Linux kernel (only taking the second run to give the 
test the benefit of local cache):

make clean; make -j8 all; make clean; sync; time make -j8 all

This takes about 10 minutes with IDE disk emulation and about 13 minutes with virtio. I 
ran the tests multiple time with most non-essential services on the host switched off 
(including cron/atd), and the guest in single-user mode to reduce the noise 
in the test to the minimum, and the results are pretty consistent, with virtio being 
about 30% behind.


I'd expect for an observed 30% wall clock time difference
of an operation as complex as a kernel build the base i/o
throughput disparity is substantially greater.  Did you
try a more simple/regular load, eg: a streaming dd read
of various block sizes from guest raw disk devices?
This is also considerably easier to debug vs. the complex
i/o load generated by a build.


I'm not convinced it's the read performance, since it's the second pass 
that is time, by which time all the source files will be in the guest's 
cache. I verified this by doing just one pass and priming it with:


find . -type f -exec cat '{}'  /dev/null \;

The execution times are indistinguishable from the second pass in the 
two-pass test.


To me that would indicate the the problem is with write performance, 
rather than read performance.



One way to chop up the problem space is using blktrace
on the host to observe both the i/o patterns coming out
of qemu and the host's response to them in terms of
turn around time.  I expect you'll see somewhat different
nature requests generated by qemu w/r/t blocking and
number of threads serving virtio_blk requests relative
to ide but the host response should be essentially the
same in terms of data returned per unit time.

If the host looks to be turning around i/o request with
similar latency in both cases, the problem would be lower
frequency of requests generated by qemu in the case of
virtio_blk.   Here it would be useful to know the host
load generated by the guest for both cases.


With virtio the CPU usage did seem to be noticeably lower. I figured 
that was because it was spending more time waiting for I/O to finish, 
since it was clearly bottlenecking on disk I/O (since that's the only 
thing that changed).


I'll try iozone's write tests and see how that compares. If I'm right 
about write performance being problematic, iozone might show the same 
performance deterioration on write tests compared to the IDE emulation.


Gordan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio disk slower than IDE?

2009-11-15 Thread Gordan Bobic

Dor Laor wrote:

On 11/14/2009 04:23 PM, Gordan Bobic wrote:

I just tried paravirtualized virtio block devices, and my tests show
that they are approximately 30% slower than emulated IDE devices. I'm
guessing this isn't normal. Is this a known issue or am I likely to have
mosconfigured something? I'm using 64-bit RHEL/CentOS 5 (both host and
guest).


Please try to change the io scheduler on the host to io scheduler, it 
should boost your performance back.


I presume you mean the deadline io scheduler. I tried that (kernel 
parameter elevator=deadline) and it made no measurable difference 
compared to the cfq scheduler.


Gordan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Virtualization Performance: Intel vs. AMD

2009-11-15 Thread Gordan Bobic

Thomas Fjellstrom wrote:

On Sun November 15 2009, Neil Aggarwal wrote:

The Core i7 has hyperthreading, so you see 8 logical CPUs.

Are you saying the AMD processors do not have hyperthreading?


Course not. Hyperthreading is dubious at best.


That's a rather questionable answer to a rather broad issue. SMT is 
useful, especially on processors with deep pipelines (think Pentium 4 - 
and in general, deeper pipelines tend to be required for higher clock 
speeds), because it reduces the number of context switches. Context 
switches are certainly one of the most expensive operations if not the 
most expensive operation you can do on a processor, and typically 
requires flushing the pipelines. Double the number of hardware threads, 
and you halve the number of context switches.


This typically isn't useful if your CPU is processing one 
single-threaded application 99% of the time, but on a loaded server it 
can make a significant difference to throughput.


Gordan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio disk slower than IDE?

2009-11-15 Thread Gordan Bobic

Dor Laor wrote:

On 11/15/2009 02:00 PM, Gordan Bobic wrote:

Dor Laor wrote:

On 11/14/2009 04:23 PM, Gordan Bobic wrote:

I just tried paravirtualized virtio block devices, and my tests show
that they are approximately 30% slower than emulated IDE devices. I'm
guessing this isn't normal. Is this a known issue or am I likely to 
have

mosconfigured something? I'm using 64-bit RHEL/CentOS 5 (both host and
guest).


Please try to change the io scheduler on the host to io scheduler, it
should boost your performance back.


I presume you mean the deadline io scheduler. I tried that (kernel
parameter elevator=deadline) and it made no measurable difference
compared to the cfq scheduler.


What version of kvm do you use? Is it rhel5.4?


It's RHEL 5.4.

$ rpm -qa | grep -i kvm
kmod-kvm-83-105.el5_4.9
kvm-83-105.el5_4.9




Can you post the qemu cmdline and the perf test in the guest?


Here is what is in the libvirt log:

For IDE emulation:

LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin HOME=/root USER=root 
LOGNAME=root /usr/libexec/qemu-kvm -S -M pc -m 2048 -smp 4 -name 
RHEL_5_x86-64 -uuid cb44b2c5-e64b-848f-77af-f8e7f02fa2ca 
-no-kvm-pit-reinjection -monitor pty -pidfile 
/var/run/libvirt/qemu//RHEL_5_x86-64.pid -boot c -drive 
file=/var/lib/libvirt/images/RHEL_5_x86-64.img,if=ide,index=0,boot=on 
-net nic,macaddr=54:52:00:5a:67:4b,vlan=0,model=e1000 -net 
tap,fd=15,script=,vlan=0,ifname=vnet0 -serial pty -parallel none -usb 
-vnc 127.0.0.1:0 -k en-gb


For virtio:

LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin HOME=/root USER=root 
LOGNAME=root /usr/libexec/qemu-kvm -S -M pc -m 2048 -smp 4 -name 
RHEL_5_x86-64 -uuid cb44b2c5-e64b-848f-77af-f8e7f02fa2ca 
-no-kvm-pit-reinjection -monitor pty -pidfile 
/var/run/libvirt/qemu//RHEL_5_x86-64.pid -boot c -drive 
file=/var/lib/libvirt/images/CentOS_5_x86-64.img,if=virtio,index=0,boot=on 
-net nic,macaddr=54:52:00:5a:67:4b,vlan=0,model=e1000 -net 
tap,fd=15,script=,vlan=0,ifname=vnet0 -serial pty -parallel none -usb 
-vnc 127.0.0.1:0 -k en-gb


The test is building the Linux kernel (only taking the second run to 
give the test the benefit of local cache):


make clean; make -j8 all; make clean; sync; time make -j8 all

This takes about 10 minutes with IDE disk emulation and about 13 minutes 
with virtio. I ran the tests multiple time with most non-essential 
services on the host switched off (including cron/atd), and the guest in 
single-user mode to reduce the noise in the test to the minimum, and 
the results are pretty consistent, with virtio being about 30% behind.


Lastly, do you use cache=wb on qemu? it's just a fun mode, we use 
cache=off only.


I don't see the option being set in the logs, so I'd guess it's whatever 
qemu-kvm defaults to.


Gordan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Virtualization Performance: Intel vs. AMD

2009-11-15 Thread Gordan Bobic

Thomas Fjellstrom wrote:


The Core i7 has hyperthreading, so you see 8 logical CPUs.

Are you saying the AMD processors do not have hyperthreading?

Course not. Hyperthreading is dubious at best.

That's a rather questionable answer to a rather broad issue. SMT is
useful, especially on processors with deep pipelines (think Pentium 4 -
and in general, deeper pipelines tend to be required for higher clock
speeds), because it reduces the number of context switches. Context
switches are certainly one of the most expensive operations if not the
most expensive operation you can do on a processor, and typically
requires flushing the pipelines. Double the number of hardware threads,
and you halve the number of context switches.


Hardware context switches aren't free either. And while it really has 
nothing to do with this discussion, the P4 arch was far from perfect (many 
would say, far from GOOD).


I actually disagree with a lot of criticism of P4. The reason why it's 
performance _appeared_ to be poor was because it was more reliant on 
compilers doing their job well. Unfortunately, most compilers generate 
very poor code, and most programmers aren't even aware of the 
improvements that can be had in this area with a bit of extra work and a 
decent compiler. Performance differences of 7+ times (700%) aren't 
unheard of on Pentium 4 between, say, ICC and GCC generated code.


P4 wasn't a bad design - the compilers just weren't good enough to 
leverage it to anywhere near it's potential.



This typically isn't useful if your CPU is processing one
single-threaded application 99% of the time, but on a loaded server it
can make a significant difference to throughput.


I'll buy that. Though you'll have to agree that the initial Hyperthread 
implementation in intel cpus was really bad. I hear good things about the 
latest version though.


As measured by what? A single-threaded desktop benchmark?

But hey, if you can stick more cores in, or do what AMD is doing with its 
upcoming line, why not do that? Hyperthreading seems like more of a gimmick 
than anything.


If there weren't clear and quantifiable benefits then IBM wouldn't be 
putting it in it's Power series of high end processors, it wouldn't be 
in the X-Box 360's Xenon (PPC970 variant), and Sun wouldn't be going 
massively SMT in the Niagara SPARCs. Silicon die space is _expensive_ - 
it wouldn't be getting wasted on gimmicks.


What seems to help the most with the new Intel arch is the 
auto overclocking when some cores are idle. Far more of a performance 
improvement than Hyperthreading will ever be it seems.


Which is targeted at gamers and desktop enthusiasts who think that FPS 
in Crysis is a meaningful measure of performance for most applications. 
Server load profile is a whole different ball game.


Anyway, let's get this back on topic for the list before we get told off 
(of course, I'm more than happy to continue the discussion off list).


Gordan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


virtio disk slower than IDE?

2009-11-14 Thread Gordan Bobic
I just tried paravirtualized virtio block devices, and my tests show 
that they are approximately 30% slower than emulated IDE devices. I'm 
guessing this isn't normal. Is this a known issue or am I likely to have 
mosconfigured something? I'm using 64-bit RHEL/CentOS 5 (both host and 
guest).


Thanks.

Gordan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Guest OpenGL Acceleration

2009-08-18 Thread Gordan Bobic
On Tue, 18 Aug 2009 13:02:18 +0100, Armindo Silva deathon2l...@gmail.com
wrote:
 There's a patch for qemu:
 
 http://qemu-forum.ipi.fi/viewtopic.php?t=2984

Interesting, and along the lines of exactly what I was after (including the
opengl32.dll win32 library). But that thread is from 2+ years ago, and no
mention of whether the project is maintained. Does it work with the KVM
virtualization back end for QEMU?

 and there's also this:
 
 http://sysweb.cs.toronto.edu/projects/7
 
 I think this is used by vbox.

No, vbox seems to use something very similar to the approach in the first
link you posted.

Gordan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Trouble shutting down vm on guest halt

2009-08-14 Thread Gordan Bobic
On Fri, 14 Aug 2009 14:38:52 +0200, Flemming Frandsen
flemming.frand...@stibo.com wrote:
 I'm having some problems getting kvm to exit when the guest OS has
halted.
 
 Specifically I'm running CentOS 5.2 as the guest on ubuntu 8.1.'
 I've noticed that 32 bit windows xp and 64 bit ubuntu 9.10 can power 
 down a vm as expected.
 
 Any idea where I should look for documentation on how to tickle kvm ACPI 
 the right way from CentOS?

I have CentOS 5.3 on CentOS 5.3 and the shutdown on that works OK, so it
seems probable that this is a host/KVM side issue. What version of KVM are
you running? Oh, and you aren't running it with -no-acpi are you?

Gordan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Guest OpenGL Acceleration

2009-08-13 Thread Gordan Bobic
Is OpenGL Acceleration based on the host's OpenGL capability available 
in KVM?


Thanks.

Gordan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Disk Emulation and Trim Instruction

2009-08-13 Thread Gordan Bobic
With the recent talk of the trim SATA instruction becoming supported in 
the upcoming versions of Windows and claims from Intel that support for 
it in their SSDs is imminent, it occurs to me that this would be equally 
useful in virtual disk emulation.


Since the disk image is a sparse file, it always only grows, and 
eventually it will grow to it's full intended size even if the actual 
used space is a small fraction of the container size. Since the trim 
instruction tells the disk that a particular block is no longer used 
(and can thus be scheduled for erasing as and when required), the same 
thing could be used to reclaim space used by sparse files backing the 
VM. It would allow for higher overcommit of disk usage on VM farms.


Is this feature likely to be available in KVM soon?

Gordan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html