Re: Question on skip_emulated_instructions()
2010/4/6 Gleb Natapov g...@redhat.com: On Tue, Apr 06, 2010 at 01:11:23PM +0900, Yoshiaki Tamura wrote: Hi. When handle_io() is called, rip is currently proceeded *before* actually having I/O handled by qemu in userland. Upon implementing Kemari for KVM(http://www.mail-archive.com/kvm@vger.kernel.org/msg25141.html) mainly in userland qemu, we encountered a problem that synchronizing the content of VCPU before handling I/O in qemu is too late because rip is already proceeded in KVM, Although we avoided this issue with temporal hack, I would like to ask a few question on skip_emulated_instructions. 1. Does rip need to be proceeded before having I/O handled by qemu? In current kvm.git rip is proceeded before I/O is handled by qemu only in case of out instruction. From architecture point of view I think it's OK since on real HW you can't guaranty that I/O will take effect before instruction pointer is advanced. It is done like that because we want out emulation to be real fast so we skip x86 emulator. Thanks for your reply. If proceeding rip later doesn't break the behavior of devices or introduce slow down, I would like that to be done. 2. If no, is it possible to divide skip_emulated_instructions(), like rec_emulated_instructions() to remember to next_rip, and skip_emulated_instructions() to actually proceed the rip. Currently only emulator can call userspace to do I/O, so after userspace returns after I/O exit, control is handled back to emulator unconditionally. out instruction skips emulator, but there is nothing to do after userspace returns, so regular cpu loop is executed. If we want to advance rip only after userspace executed I/O done by out we need to distinguish who requested I/O (emulator or kvm_fast_pio_out()) and call different code depending on who that was. It can be done by having a callback that (if not null) is called on return from userspace. Your suggestion is to introduce a callback entry, and instead of calling kvm_rip_write(), set it to the entry before calling kvm_fast_pio_out(), and check the entry upon return from the userspace, correct? According to the comment in x86.c, when it was out instruction vcpu-arch.pio.count is set to 0 to skip the emulator. To call kvm_fast_pio_out(), !string and !in must be set. If we can check, vcpu-arch.pio.count, string and in on return from the userspace, can't we distinguish who requested I/O, emulator or kvm_fast_pio_out()? 3. svm has next_rip but when it is 0, nop is emulated. Can this be modified to continue without emulating nop when next_rip is 0? I don't see where nop is emulated if next_rip is 0. As far as I see in case of next_rip==0 an instruction at rip is decoded to figure out its length and then rip is advanced by instruction length. Anyway next_rip is svm thing only. Sorry. I wasn't understanding the code enough. static void skip_emulated_instruction(struct kvm_vcpu *vcpu) { ... if (!svm-next_rip) { if (emulate_instruction(vcpu, 0, 0, EMULTYPE_SKIP) != EMULATE_DONE) printk(KERN_DEBUG %s: NOP\n, __func__); return; } Since the printk says NOP, I thought emulate_instruction was doing so... The reason I asked about next_rip is because I was hoping to use this entry to advance rip only after userspace executed I/O done by out, like if next_rip is !0, call kvm_rip_write(), and introduce next_rip to vmx if it is usable because vmx is currently using local variable rip. Yoshi -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [PPC] Add dequeue for external on BookE
Commit a0abee86af2d1f048dbe99d2bcc4a2cefe685617 introduced unsetting of the IRQ line from userspace. This added a new core specific callback that I apparently forgot to add for BookE. So let's add the callback for BookE as well, making it build again. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/booke.c |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index 4d686cc..e170672 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -132,6 +132,12 @@ void kvmppc_core_queue_external(struct kvm_vcpu *vcpu, kvmppc_booke_queue_irqprio(vcpu, BOOKE_IRQPRIO_EXTERNAL); } +void kvmppc_core_dequeue_external(struct kvm_vcpu *vcpu, + struct kvm_interrupt *irq) +{ + clear_bit(BOOKE_IRQPRIO_EXTERNAL, vcpu-arch.pending_exceptions); +} + /* Deliver the interrupt of the corresponding priority, if possible. */ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu, unsigned int priority) -- 1.6.0.2 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.
On Wed, Apr 07, 2010 at 10:41:08AM +0800, Xin, Xiaohui wrote: Michael, Qemu needs a userspace write, is that a synchronous one or asynchronous one? It's a synchronous non-blocking write. Sorry, why the Qemu live migration needs the device have a userspace write? how does the write operation work? And why a read operation is not cared here? Thanks Xiaohui Roughly, with ethernet bridges, moving a device from one location in the network to another makes forwarding tables incorrect (or incomplete), until outgoing traffic from the device causes these tables to be updated. Since there's no guarantee that guest will generate outgoing traffic, after migration qemu sends out several dummy packets itself. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
On Wed, Apr 07, 2010 at 09:36:36AM +0800, Xin, Xiaohui wrote: Michael, For the write logging, do you have a function in hand that we can recompute the log? If that, I think I can use it to recompute the log info when the logging is suddenly enabled. For the outstanding requests, do you mean all the user buffers have submitted before the logging ioctl changed? That may be a lot, and some of them are still in NIC ring descriptors. Waiting them to be finished may be need some time. I think when logging ioctl changed, then the logging is changed just after that is also reasonable. The key point is that after loggin ioctl returns, any subsequent change to memory must be logged. It does not matter when was the request submitted, otherwise we will get memory corruption on migration. The change to memory happens when vhost_add_used_and_signal(), right? So after ioctl returns, just recompute the log info to the events in the async queue, is ok. Since the ioctl and write log operations are all protected by vq-mutex. Thanks Xiaohui Yes, I think this will work. Thanks, so do you have the function to recompute the log info in your hand that I can use? I have weakly remembered that you have noticed it before some time. Doesn't just rerunning vhost_get_vq_desc work? Am I missing something here? The vhost_get_vq_desc() looks in vq, and finds the first available buffers, and converts it to an iovec. I think the first available buffer is not the buffers in the async queue, so I think rerunning vhost_get_vq_desc() cannot work. Thanks Xiaohui Right, but we can move the head back, so we'll find the same buffers again, or add a variant of vhost_get_vq_desc that will process descriptors already consumed. Thanks Xiaohui drivers/vhost/net.c | 189 +++-- drivers/vhost/vhost.h | 10 +++ 2 files changed, 192 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 22d5fef..2aafd90 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -17,11 +17,13 @@ #include linux/workqueue.h #include linux/rcupdate.h #include linux/file.h +#include linux/aio.h #include linux/net.h #include linux/if_packet.h #include linux/if_arp.h #include linux/if_tun.h +#include linux/mpassthru.h #include net/sock.h @@ -47,6 +49,7 @@ struct vhost_net { struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX]; struct vhost_poll poll[VHOST_NET_VQ_MAX]; + struct kmem_cache *cache; /* Tells us whether we are polling a socket for TX. * We only do this when socket buffer fills up. * Protected by tx vq lock. */ @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + if (!list_empty(vq-notifier)) { + iocb = list_first_entry(vq-notifier, + struct kiocb, ki_list); + list_del(iocb-ki_list); + } + spin_unlock_irqrestore(vq-notify_lock, flags); + return iocb; +} + +static void handle_async_rx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + struct vhost_log *vq_log = NULL; + int rx_total_len = 0; + int log, size; + + if (vq-link_state != VHOST_VQ_LINK_ASYNC) + return; + + if (vq-receiver) + vq-receiver(vq); + + vq_log = unlikely(vhost_has_feature( + net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; + while ((iocb = notify_dequeue(vq)) != NULL) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, iocb-ki_nbytes); + log = (int)iocb-ki_user_data; + size = iocb-ki_nbytes; + rx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + kmem_cache_free(net-cache, iocb); + + if (unlikely(vq_log)) + vhost_log_write(vq, vq_log, log, size); + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } +} + +static void handle_async_tx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + int tx_total_len = 0; + + if (vq-link_state != VHOST_VQ_LINK_ASYNC) + return; + +
[PATCH 1/3] KVM Test: Add control file dbench.control.200 for dbench
This control file set seconds to 200. It is used by ioquit script. Signed-off-by: Feng Yang fy...@redhat.com --- .../tests/kvm/autotest_control/dbench.control.200 | 20 1 files changed, 20 insertions(+), 0 deletions(-) create mode 100644 client/tests/kvm/autotest_control/dbench.control.200 diff --git a/client/tests/kvm/autotest_control/dbench.control.200 b/client/tests/kvm/autotest_control/dbench.control.200 new file mode 100644 index 000..c648f7a --- /dev/null +++ b/client/tests/kvm/autotest_control/dbench.control.200 @@ -0,0 +1,20 @@ +TIME=SHORT +AUTHOR = Martin Bligh mbl...@google.com +DOC = +dbench is one of our standard kernel stress tests. It produces filesystem +load like netbench originally did, but involves no network system calls. +Its results include throughput rates, which can be used for performance +analysis. + +More information on dbench can be found here: +http://samba.org/ftp/tridge/dbench/README + +Currently it needs to be updated in its configuration. It is a great test for +the higher level I/O systems but barely touches the disk right now. + +NAME = 'dbench' +TEST_CLASS = 'kernel' +TEST_CATEGORY = 'Functional' +TEST_TYPE = 'client' + +job.run_test('dbench', seconds=200) -- 1.5.5.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] KVM Test: Add function run_autotest_background and wait_autotest_background.
Add function run_autotest_background and wait_autotest_background to kvm_test_utils.py. This two functions is used in ioquit test script. Signed-off-by: Feng Yang fy...@redhat.com --- client/tests/kvm/kvm_test_utils.py | 68 +++- 1 files changed, 67 insertions(+), 1 deletions(-) diff --git a/client/tests/kvm/kvm_test_utils.py b/client/tests/kvm/kvm_test_utils.py index f512044..2a1054e 100644 --- a/client/tests/kvm/kvm_test_utils.py +++ b/client/tests/kvm/kvm_test_utils.py @@ -21,7 +21,7 @@ More specifically: @copyright: 2008-2009 Red Hat Inc. -import time, os, logging, re, commands +import time, os, logging, re, commands, sys from autotest_lib.client.common_lib import error from autotest_lib.client.bin import utils import kvm_utils, kvm_vm, kvm_subprocess, scan_results @@ -402,3 +402,69 @@ def run_autotest(vm, session, control_path, timeout, test_name, outputdir): result = bad_results[0] raise error.TestFail(Test '%s' ended with %s (reason: '%s') % (result[0], result[1], result[3])) + + +def run_autotest_background(vm, session, control_path, timeout, test_name, +outputdir): + +Wrapper of run_autotest() and make it run in the background through fork() +and let it run in the child process. +1) Flush the stdio. +2) Build test params which is recevied from arguments and used by + run_autotest() +3) Fork the process and let the run_autotest() run in the child +4) Catch the exception raise by run_autotest() and exit the child with + non-zero return code. +5) If no exception catched, reutrn 0 + +@param vm: VM object. +@param session: A shell session on the VM provided. +@param control: An autotest control file. +@param timeout: Timeout under which the autotest test must complete. +@param test_name: Autotest client test name. +@param outputdir: Path on host where we should copy the guest autotest +results to. + + +def flush(): +sys.stdout.flush() +sys.stderr.flush() + +logging.info(Running autotest background ...) +flush() +pid = os.fork() +if pid: +# Parent process +return pid + +try: +# Launch autotest +logging.info(child process of run_autotest_background) +run_autotest(vm, session, control_path, timeout, test_name, outputdir) +except error.TestFail, message_fail: +logging.info([Autotest Background FAIL] %s % message_fail) +os._exit(1) +except error.TestError, message_error: +logging.info([Autotest Background ERROR] %s % message_error) +os._exit(2) +except: +os._exit(3) + +logging.info([Auototest Background GOOD]) +os._exit(0) + + +def wait_autotest_background(pid): + +Wait for background autotest finish. + +@param pid: Pid of the child process executing background autotest + +logging.info(Waiting for background autotest to finish ...) + +(pid, s) = os.waitpid(pid,0) +status = os.WEXITSTATUS(s) +if status != 0: +return False +return True + -- 1.5.5.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] KVM Test: Add ioquit test case
Signed-off-by: Feng Yang fy...@redhat.com --- client/tests/kvm/tests/ioquit.py | 54 client/tests/kvm/tests_base.cfg.sample |4 ++ 2 files changed, 58 insertions(+), 0 deletions(-) create mode 100644 client/tests/kvm/tests/ioquit.py diff --git a/client/tests/kvm/tests/ioquit.py b/client/tests/kvm/tests/ioquit.py new file mode 100644 index 000..c75a0e3 --- /dev/null +++ b/client/tests/kvm/tests/ioquit.py @@ -0,0 +1,54 @@ +import logging, time, random, signal, os +from autotest_lib.client.common_lib import error +import kvm_test_utils, kvm_utils + + +def run_ioquit(test, params, env): + +Emulate the poweroff under IO workload(dbench so far) using monitor +command 'quit'. + +@param test: Kvm test object +@param params: Dictionary with the test parameters. +@param env: Dictionary with test environment. + +vm = kvm_test_utils.get_living_vm(env, params.get(main_vm)) +session = kvm_test_utils.wait_for_login(vm, + timeout=int(params.get(login_timeout, 360))) +session2 = kvm_test_utils.wait_for_login(vm, + timeout=int(params.get(login_timeout, 360))) +def is_autotest_launched(): +if session.get_command_status(pgrep autotest) != 0: +logging.debug(Autotest process not found) +return False +return True + +test_name = params.get(background_test, dbench) +control_file = params.get(control_file, dbench.control) +timeout = int(params.get(test_timeout, 300)) +control_path = os.path.join(test.bindir, autotest_control, +control_file) +outputdir = test.outputdir + +pid = kvm_test_utils.run_autotest_background(vm, session2, control_path, + timeout, test_name, + outputdir) +if pid 0: +raise error.TestError(Could not create child process to execute + autotest background) + +if kvm_utils.wait_for(is_autotest_launched, 240, 0, 2): +logging.debug(Background autotest successfully) +else: +logging.debug(Background autotest failed, start the test anyway) + +time.sleep(100 + random.randrange(0,100)) +logging.info(Kill the virtual machine) +vm.process.close() + +logging.info(Kill the tracking process) +kvm_utils.safe_kill(pid, signal.SIGKILL) +kvm_test_utils.wait_autotest_background(pid) +session.close() +session2.close() + diff --git a/client/tests/kvm/tests_base.cfg.sample b/client/tests/kvm/tests_base.cfg.sample index 9b12fc2..d8530f6 100644 --- a/client/tests/kvm/tests_base.cfg.sample +++ b/client/tests/kvm/tests_base.cfg.sample @@ -305,6 +305,10 @@ variants: - ksm_parallel: ksm_mode = parallel +- ioquit: +type = ioquit +control_file = dbench.control.200 +background_test = dbench # system_powerdown, system_reset and shutdown *must* be the last ones # defined (in this order), since the effect of such tests can leave # the VM on a bad state. -- 1.5.5.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re:[PATCH 1/3] A device for zero-copy based on KVM virtio-net.
From: Xin Xiaohui xiaohui@intel.com --- Michael, Thanks a lot for the explanation. I have drafted a patch for the qemu write after I looked into tun driver. Does it do in right way? Thanks Xiaohui drivers/vhost/mpassthru.c | 45 + 1 files changed, 45 insertions(+), 0 deletions(-) diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c index e9449ac..1cde097 100644 --- a/drivers/vhost/mpassthru.c +++ b/drivers/vhost/mpassthru.c @@ -1065,6 +1065,49 @@ static unsigned int mp_chr_poll(struct file *file, poll_table * wait) return mask; } +static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long count, loff_t pos) +{ + struct file *file = iocb-ki_filp; + struct mp_struct *mp = mp_get(file-private_data); + struct sock *sk = mp-socket.sk; + struct sk_buff *skb; + int len, err; + ssize_t result; + + if (!mp) + return -EBADFD; + + /* currently, async is not supported */ + if (!is_sync_kiocb(iocb)) + return -EFAULT; + + len = iov_length(iov, count); + skb = sock_alloc_send_skb(sk, len + NET_IP_ALIGN, + file-f_flags O_NONBLOCK, err); + + if (!skb) + return -EFAULT; + + skb_reserve(skb, NET_IP_ALIGN); + skb_put(skb, len); + + if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) { + kfree_skb(skb); + return -EFAULT; + } + skb_set_network_header(skb, ETH_HLEN); + skb-protocol = *((__be16 *)(skb-data) + ETH_ALEN); + skb-dev = mp-dev; + + dev_queue_xmit(skb); + mp-dev-stats.tx_packets++; + mp-dev-stats.tx_bytes += len; + + mp_put(mp); + return result; +} + static int mp_chr_close(struct inode *inode, struct file *file) { struct mp_file *mfile = file-private_data; @@ -1084,6 +1127,8 @@ static int mp_chr_close(struct inode *inode, struct file *file) static const struct file_operations mp_fops = { .owner = THIS_MODULE, .llseek = no_llseek, + .write = do_sync_write, + .aio_write = mp_chr_aio_write, .poll = mp_chr_poll, .unlocked_ioctl = mp_chr_ioctl, .open = mp_chr_open, -- 1.5.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] Add Mergeable receive buffer support to vhost_net
On Tue, Apr 06, 2010 at 01:32:53PM -0700, David L Stevens wrote: This patch adds support for the Mergeable Receive Buffers feature to vhost_net. +-DLS Changes from previous revision: 1) renamed: vhost_discard_vq_desc - vhost_discard_desc vhost_get_heads - vhost_get_desc_n vhost_get_vq_desc - vhost_get_desc 2) added heads as argument to ghost_get_desc_n 3) changed vq-heads from iovec to vring_used_elem, removed casts 4) changed vhost_add_used to do multiple elements in a single copy_to_user, or two when we wrap the ring. 5) removed rxmaxheadcount and available buffer checks in favor of running until an allocation failure, but making sure we break the loop if we get two in a row, indicating we have at least 1 buffer, but not enough for the current receive packet 6) restore non-vnet header handling Signed-Off-By: David L Stevens dlstev...@us.ibm.com Thanks! There's some whitespace damage, are you sending with your new sendmail setup? It seems to have worked for qemu patches ... diff -ruNp net-next-p0/drivers/vhost/net.c net-next-v3/drivers/vhost/net.c --- net-next-p0/drivers/vhost/net.c 2010-03-22 12:04:38.0 -0700 +++ net-next-v3/drivers/vhost/net.c 2010-04-06 12:54:56.0 -0700 @@ -130,9 +130,8 @@ static void handle_tx(struct vhost_net * hdr_size = vq-hdr_size; for (;;) { - head = vhost_get_vq_desc(net-dev, vq, vq-iov, - ARRAY_SIZE(vq-iov), - out, in, + head = vhost_get_desc(net-dev, vq, vq-iov, + ARRAY_SIZE(vq-iov), out, in, NULL, NULL); /* Nothing new? Wait for eventfd to tell us they refilled. */ if (head == vq-num) { @@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net * /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock-ops-sendmsg(NULL, sock, msg, len); if (unlikely(err 0)) { - vhost_discard_vq_desc(vq); - tx_poll_start(net, sock); + if (err == -EAGAIN) { + vhost_discard_desc(vq, 1); + tx_poll_start(net, sock); + } else { + vq_err(vq, sendmsg: errno %d\n, -err); + /* drop packet; do not discard/resend */ + vhost_add_used_and_signal(net-dev, vq, head, + 0); vhost does not currently has a consistent error handling strategy: if we drop packets, need to think which other errors should cause packet drops. I prefer to just call vq_err for now, and have us look at handling segfaults etc in a consistent way separately. + } break; } if (err != len) @@ -186,12 +192,25 @@ static void handle_tx(struct vhost_net * unuse_mm(net-dev.mm); } +static int vhost_head_len(struct sock *sk) +{ + struct sk_buff *head; + int len = 0; + + lock_sock(sk); + head = skb_peek(sk-sk_receive_queue); + if (head) + len = head-len; + release_sock(sk); + return len; +} + I wonder whether it makes sense to check skb_queue_empty(sk-sk_receive_queue) outside the lock, to reduce the cost of this call on an empty queue (we know that it happens at least once each time we exit the loop on rx)? /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_rx(struct vhost_net *net) { struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX]; - unsigned head, out, in, log, s; + unsigned in, log, s; struct vhost_log *vq_log; struct msghdr msg = { .msg_name = NULL, @@ -202,13 +221,14 @@ static void handle_rx(struct vhost_net * .msg_flags = MSG_DONTWAIT, }; - struct virtio_net_hdr hdr = { - .flags = 0, - .gso_type = VIRTIO_NET_HDR_GSO_NONE + struct virtio_net_hdr_mrg_rxbuf hdr = { + .hdr.flags = 0, + .hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE }; + int retries = 0; size_t len, total_len = 0; - int err; + int err, headcount, datalen; size_t hdr_size; struct socket *sock = rcu_dereference(vq-private_data); if (!sock || skb_queue_empty(sock-sk-sk_receive_queue)) @@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net * vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; - for (;;) { - head = vhost_get_vq_desc(net-dev, vq, vq-iov, -
Re: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.
On Wed, Apr 07, 2010 at 05:00:39PM +0800, xiaohui@intel.com wrote: From: Xin Xiaohui xiaohui@intel.com --- Michael, Thanks a lot for the explanation. I have drafted a patch for the qemu write after I looked into tun driver. Does it do in right way? Thanks Xiaohui drivers/vhost/mpassthru.c | 45 + 1 files changed, 45 insertions(+), 0 deletions(-) diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c index e9449ac..1cde097 100644 --- a/drivers/vhost/mpassthru.c +++ b/drivers/vhost/mpassthru.c @@ -1065,6 +1065,49 @@ static unsigned int mp_chr_poll(struct file *file, poll_table * wait) return mask; } +static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long count, loff_t pos) +{ + struct file *file = iocb-ki_filp; + struct mp_struct *mp = mp_get(file-private_data); + struct sock *sk = mp-socket.sk; + struct sk_buff *skb; + int len, err; + ssize_t result; + + if (!mp) + return -EBADFD; + Can this happen? When? + /* currently, async is not supported */ + if (!is_sync_kiocb(iocb)) + return -EFAULT; Really necessary? I think do_sync_write handles all this. + + len = iov_length(iov, count); + skb = sock_alloc_send_skb(sk, len + NET_IP_ALIGN, + file-f_flags O_NONBLOCK, err); + + if (!skb) + return -EFAULT; Surely not EFAULT. -EAGAIN? + + skb_reserve(skb, NET_IP_ALIGN); + skb_put(skb, len); + + if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) { + kfree_skb(skb); + return -EFAULT; + } + skb_set_network_header(skb, ETH_HLEN); Is this really right or necessary? Also, probably need to check that length is at least ETH_ALEN before doing this. + skb-protocol = *((__be16 *)(skb-data) + ETH_ALEN); eth_type_trans? + skb-dev = mp-dev; + + dev_queue_xmit(skb); + mp-dev-stats.tx_packets++; + mp-dev-stats.tx_bytes += len; Doesn't the hard start xmit function for the device increment the counters? + + mp_put(mp); + return result; +} + static int mp_chr_close(struct inode *inode, struct file *file) { struct mp_file *mfile = file-private_data; @@ -1084,6 +1127,8 @@ static int mp_chr_close(struct inode *inode, struct file *file) static const struct file_operations mp_fops = { .owner = THIS_MODULE, .llseek = no_llseek, + .write = do_sync_write, + .aio_write = mp_chr_aio_write, .poll = mp_chr_poll, .unlocked_ioctl = mp_chr_ioctl, .open = mp_chr_open, -- 1.5.4.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GSoC 2010][RESEND] Shared memory transport between guest(s) and host
Hi, I am interested in the Shared memory transport between guest(s) and host project for GSoC 2010. The description of the project is pretty straightforward, but I am a little bit lost on some parts: 1- Is there any documentation available on KVM shared memory transport. This'd definitely help understand how inter-vm shared memory should work. 2- Does the project only aim at providing a shared memory transport between a single host and a number of guests, with the host acting as a central node containing shared memory objects and communication taking placde only between guests and host, or is there any kind of guest-guest communications to be supported? If yes, how should it be done? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] Add Mergeable receive buffer support to vhost_net
Some corrections: On Wed, Apr 07, 2010 at 01:59:10PM +0300, Michael S. Tsirkin wrote: On Tue, Apr 06, 2010 at 01:32:53PM -0700, David L Stevens wrote: This patch adds support for the Mergeable Receive Buffers feature to vhost_net. +-DLS Changes from previous revision: 1) renamed: vhost_discard_vq_desc - vhost_discard_desc vhost_get_heads - vhost_get_desc_n vhost_get_vq_desc - vhost_get_desc 2) added heads as argument to ghost_get_desc_n 3) changed vq-heads from iovec to vring_used_elem, removed casts 4) changed vhost_add_used to do multiple elements in a single copy_to_user, or two when we wrap the ring. 5) removed rxmaxheadcount and available buffer checks in favor of running until an allocation failure, but making sure we break the loop if we get two in a row, indicating we have at least 1 buffer, but not enough for the current receive packet 6) restore non-vnet header handling Signed-Off-By: David L Stevens dlstev...@us.ibm.com Thanks! There's some whitespace damage, are you sending with your new sendmail setup? It seems to have worked for qemu patches ... diff -ruNp net-next-p0/drivers/vhost/net.c net-next-v3/drivers/vhost/net.c --- net-next-p0/drivers/vhost/net.c 2010-03-22 12:04:38.0 -0700 +++ net-next-v3/drivers/vhost/net.c 2010-04-06 12:54:56.0 -0700 @@ -130,9 +130,8 @@ static void handle_tx(struct vhost_net * hdr_size = vq-hdr_size; for (;;) { - head = vhost_get_vq_desc(net-dev, vq, vq-iov, -ARRAY_SIZE(vq-iov), -out, in, + head = vhost_get_desc(net-dev, vq, vq-iov, +ARRAY_SIZE(vq-iov), out, in, NULL, NULL); /* Nothing new? Wait for eventfd to tell us they refilled. */ if (head == vq-num) { @@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net * /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock-ops-sendmsg(NULL, sock, msg, len); if (unlikely(err 0)) { - vhost_discard_vq_desc(vq); - tx_poll_start(net, sock); + if (err == -EAGAIN) { + vhost_discard_desc(vq, 1); + tx_poll_start(net, sock); + } else { + vq_err(vq, sendmsg: errno %d\n, -err); + /* drop packet; do not discard/resend */ + vhost_add_used_and_signal(net-dev, vq, head, + 0); vhost does not currently has a consistent error handling strategy: if we drop packets, need to think which other errors should cause packet drops. I prefer to just call vq_err for now, and have us look at handling segfaults etc in a consistent way separately. + } break; } if (err != len) @@ -186,12 +192,25 @@ static void handle_tx(struct vhost_net * unuse_mm(net-dev.mm); } +static int vhost_head_len(struct sock *sk) +{ + struct sk_buff *head; + int len = 0; + + lock_sock(sk); + head = skb_peek(sk-sk_receive_queue); + if (head) + len = head-len; + release_sock(sk); + return len; +} + I wonder whether it makes sense to check skb_queue_empty(sk-sk_receive_queue) outside the lock, to reduce the cost of this call on an empty queue (we know that it happens at least once each time we exit the loop on rx)? /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_rx(struct vhost_net *net) { struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX]; - unsigned head, out, in, log, s; + unsigned in, log, s; struct vhost_log *vq_log; struct msghdr msg = { .msg_name = NULL, @@ -202,13 +221,14 @@ static void handle_rx(struct vhost_net * .msg_flags = MSG_DONTWAIT, }; - struct virtio_net_hdr hdr = { - .flags = 0, - .gso_type = VIRTIO_NET_HDR_GSO_NONE + struct virtio_net_hdr_mrg_rxbuf hdr = { + .hdr.flags = 0, + .hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE }; + int retries = 0; size_t len, total_len = 0; - int err; + int err, headcount, datalen; size_t hdr_size; struct socket *sock = rcu_dereference(vq-private_data); if (!sock || skb_queue_empty(sock-sk-sk_receive_queue)) @@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net * vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; - for (;;) { - head =
[PATCH] KVM test: Add a subtest iofuzz
The design of iofuzz is simple: it just generate random I/O port activity inside the virtual machine. The correctness of the device emulation may be verified through this test. As the instrcutions are randomly generated, guest may enter the wrong state. The test solve this issue by detect the hang and restart the virtual machine. The test duration could also be adjusted through the fuzz_count. And the parameter skip_devices is used to specified the devices which should not be used to do the fuzzing. For current version, every activity were logged and the commnad was sent through a seesion between host and guest. Through this method may slow down the whole test but it works well. The enumeration was done through /proc/ioports and the scenario of avtivity is not aggressive. Suggestions are welcomed. Signed-off-by: Jason Wang jasow...@redhat.com --- client/tests/kvm/tests/iofuzz.py | 97 client/tests/kvm/tests_base.cfg.sample |2 + 2 files changed, 99 insertions(+), 0 deletions(-) create mode 100644 client/tests/kvm/tests/iofuzz.py diff --git a/client/tests/kvm/tests/iofuzz.py b/client/tests/kvm/tests/iofuzz.py new file mode 100644 index 000..c2f22af --- /dev/null +++ b/client/tests/kvm/tests/iofuzz.py @@ -0,0 +1,97 @@ +import logging, time, re, random +from autotest_lib.client.common_lib import error +import kvm_subprocess, kvm_test_utils, kvm_utils + + +def run_iofuzz(test, params, env): + +KVM iofuzz test: +1) Log into a guest + +@param test: kvm test object +@param params: Dictionary with the test parameters +@param env: Dictionary with test environment. + +vm = kvm_test_utils.get_living_vm(env, params.get(main_vm)) +session = kvm_test_utils.wait_for_login(vm, 0, + float(params.get(boot_timeout, 240)), + 0, 2) + +def outb(session, port, data): +logging.debug(outb(0x%x,0x%x) % (port, data)) +outb_cmd = echo -e '\\%s' | dd of=/dev/port seek=%d bs=1 count=1 % \ + (oct(data), port) +s, o = session.get_command_status_output(outb_cmd) +if s != 0: +logging.debug(None zero value returned) + +def inb(session, port): +logging.debug(inb(0x%x) % port) +inb_cmd = dd if=/dev/port seek=%d of=/dev/null bs=1 count=1 % port +s, o = session.get_command_status_output(inb_cmd) +if s != 0: +logging.debug(None zero value returned) + +def fuzz(session, inst_list): +for (op, operand) in inst_list: +if op == read: +inb(session, operand[0]) +elif op ==write: +outb(session, operand[0], operand[1]) +else: +raise error.TestError(Unknown command %s % op) + +if not session.is_responsive(): +logging.debug(Session is not responsive) +if vm.process.is_alive(): +logging.debug(VM is alive, try to re-login) +try: +session = kvm_test_utils.wait_for_login(vm, 0, 10, 0, 2) +except: +logging.debug(Could not re-login, reboot the guest) +session = kvm_test_utils.reboot(vm, session, +method = system_reset) +else: +raise error.TestFail(VM have quit abnormally) +try: +ports = {} +ran = random.SystemRandom() + +logging.info(Enumerate the device through /proc/ioports) +ioports = session.get_command_output(cat /proc/ioports) +logging.debug(ioports) +devices = re.findall((\w+)-(\w+)\ : (.*), ioports) + +skip_devices = params.get(skip_devices,) +fuzz_count = int(params.get(fuzz_count, 10)) + +for (beg, end, name) in devices: +ports[(int(beg, base=16), int(end, base=16))] = name.strip() + +for (beg, end) in ports.keys(): + +name = ports[(beg, end)] +if name in skip_devices: +logging.info(Skipping %s % name) +continue + +logging.info(Fuzzing %s at 0x%x-0x%x % (name, beg, end)) +inst = [] + +# Read all ports +for port in range(beg, end+1): +inst.append((read, [port])) + +# Outb with zero +for port in range(beg, end+1): +inst.append((write, [port, 0])) +pass + +# Random fuzzing +for seq in range(fuzz_count * (end-beg+1)): +inst.append((write, [ran.randint(beg, end), ran.randint(0,255)])) + +fuzz(session, inst) + +finally: +session.close() diff --git a/client/tests/kvm/tests_base.cfg.sample
Re: Setting nx bit in virtual CPU
On 07/04/10 06:39, Avi Kivity wrote: On 04/07/2010 01:31 AM, Richard Simpson wrote: 2.6.27 should be plenty fine for nx. Really the important bit is that the host kernel has nx enabled. Can you check if that is so? Umm, could you give me a clue about how to do that. It is some time since I configured the host kernel, but I do have a /proc/config.gz. Could I check by looking in that? The attached script should verify it. rs% ./check-nx Traceback (most recent call last): File ./check-nx, line 17, in module efer = msr().read(0xc080, 0) File ./check-nx, line 8, in __init__ self.f = file('/dev/msr0') IOError: [Errno 2] No such file or directory: '/dev/msr0' Sorry! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Setting nx bit in virtual CPU
On 04/07/2010 03:10 PM, Richard Simpson wrote: On 07/04/10 06:39, Avi Kivity wrote: On 04/07/2010 01:31 AM, Richard Simpson wrote: 2.6.27 should be plenty fine for nx. Really the important bit is that the host kernel has nx enabled. Can you check if that is so? Umm, could you give me a clue about how to do that. It is some time since I configured the host kernel, but I do have a /proc/config.gz. Could I check by looking in that? The attached script should verify it. rs% ./check-nx Traceback (most recent call last): File ./check-nx, line 17, inmodule efer = msr().read(0xc080, 0) File ./check-nx, line 8, in __init__ self.f = file('/dev/msr0') IOError: [Errno 2] No such file or directory: '/dev/msr0' Run as root, please. And check first that you have a file named /dev/cpu/0/msr. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM freeze when using --serial
Hi there, I already posted this problem to #kvm on freenode. Please set me in CC: when replying to this mail, as I am not subscribed to this mailing lists right now. The Scenario is as follows: I got 2 VM processes in userspace. The first is started with the parameter --monitor pty. = This results in a file /dev/pts/x in the host, (crw--w 1 kittel tty 136, 3 2010-04-07 15:51 /dev/pts/3 on my system) Another VM is then started with the parameter --serial /dev/pts/3 = This results in /dev/ttyS0 inside the second VM. Both VMs are running debian lenny. The host (debian) uses qemu-kvm 0.12.3. startvms.sh start is used to start the VMs. Running the executable build from test.c in the second VM results in a freeze of this VM. (The test.c included uses /dev/ttyS1 as /dev/ttyS0 is the VMs serial console in my setup.) The process uses 100% CPU and is stuck in kvm_mutex_lock(). Trying to use the build in gdbserver didn´t work because it also locked. Is there a way to tunnel one VMs monitor console to another VM? Thanks Thomas #include fcntl.h #include unistd.h #include stdio.h #include signal.h #include pthread.h void signal_handler(int signum){ pthread_exit(NULL); } void *readFile(void * ptr){ signal(SIGTERM, signal_handler); int fd; char buffer; fd = open(/dev/ttyS1, O_RDONLY); while(true){ read(fd, buffer, 1); printf(%c, buffer); fflush(stdout); } close(fd); pthread_exit(NULL); } int main(int argc, char** argv){ pthread_t thread; pthread_create(thread, NULL, readFile, NULL); sleep(10); pthread_kill(thread, SIGTERM); pthread_join(thread, NULL); } startvms.sh Description: Bourne shell script
Shouldn't cache=none be the default for drives?
Hello, I'm conducting some performancetests with KVM-virtualized CentOSes. One thing I noticed is that guest I/O performance seems to be significantly better for virtio-based block devices (drives) if the cache=none argument is used. (This was with a rather powerful storage system backend which is hard to saturate.) So: Why isn't cache=none be the default for drives? -- Troels -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM freeze when using --serial
Hi again, I just tried to use unix domain sockets. So I used the parameter --monitor unix:monitor:server:nowait on the first VM and the parameter --serial unix:monitor on the second VM, And again the second VM freezes when running my test application. cya Tom Thomas Kittel wrote: Hi there, I already posted this problem to #kvm on freenode. Please set me in CC: when replying to this mail, as I am not subscribed to this mailing lists right now. The Scenario is as follows: I got 2 VM processes in userspace. The first is started with the parameter --monitor pty. = This results in a file /dev/pts/x in the host, (crw--w 1 kittel tty 136, 3 2010-04-07 15:51 /dev/pts/3 on my system) Another VM is then started with the parameter --serial /dev/pts/3 = This results in /dev/ttyS0 inside the second VM. Both VMs are running debian lenny. The host (debian) uses qemu-kvm 0.12.3. startvms.sh start is used to start the VMs. Running the executable build from test.c in the second VM results in a freeze of this VM. (The test.c included uses /dev/ttyS1 as /dev/ttyS0 is the VMs serial console in my setup.) The process uses 100% CPU and is stuck in kvm_mutex_lock(). Trying to use the build in gdbserver didn´t work because it also locked. Is there a way to tunnel one VMs monitor console to another VM? Thanks Thomas -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shouldn't cache=none be the default for drives?
Troels Arvin wrote: Hello, I'm conducting some performancetests with KVM-virtualized CentOSes. One thing I noticed is that guest I/O performance seems to be significantly better for virtio-based block devices (drives) if the cache=none argument is used. (This was with a rather powerful storage system backend which is hard to saturate.) So: Why isn't cache=none be the default for drives? Is that the right question? Or is the right question Why is cache=none faster? What did you use for measuring the performance? I have found in the past that virtio block device was slower than IDE block device emulation. Gordan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
On Wed, Apr 07, 2010 at 03:25:10PM +0900, Yoshiaki Tamura wrote: 2010/4/6 Gleb Natapov g...@redhat.com: On Tue, Apr 06, 2010 at 01:11:23PM +0900, Yoshiaki Tamura wrote: Hi. When handle_io() is called, rip is currently proceeded *before* actually having I/O handled by qemu in userland. Upon implementing Kemari for KVM(http://www.mail-archive.com/kvm@vger.kernel.org/msg25141.html) mainly in userland qemu, we encountered a problem that synchronizing the content of VCPU before handling I/O in qemu is too late because rip is already proceeded in KVM, Although we avoided this issue with temporal hack, I would like to ask a few question on skip_emulated_instructions. 1. Does rip need to be proceeded before having I/O handled by qemu? In current kvm.git rip is proceeded before I/O is handled by qemu only in case of out instruction. From architecture point of view I think it's OK since on real HW you can't guaranty that I/O will take effect before instruction pointer is advanced. It is done like that because we want out emulation to be real fast so we skip x86 emulator. Thanks for your reply. If proceeding rip later doesn't break the behavior of devices or introduce slow down, I would like that to be done. Device can not care less about what value rip register currently has. Why is it matters for you code? 2. If no, is it possible to divide skip_emulated_instructions(), like rec_emulated_instructions() to remember to next_rip, and skip_emulated_instructions() to actually proceed the rip. Currently only emulator can call userspace to do I/O, so after userspace returns after I/O exit, control is handled back to emulator unconditionally. out instruction skips emulator, but there is nothing to do after userspace returns, so regular cpu loop is executed. If we want to advance rip only after userspace executed I/O done by out we need to distinguish who requested I/O (emulator or kvm_fast_pio_out()) and call different code depending on who that was. It can be done by having a callback that (if not null) is called on return from userspace. Your suggestion is to introduce a callback entry, and instead of calling kvm_rip_write(), set it to the entry before calling kvm_fast_pio_out(), and check the entry upon return from the userspace, correct? Something like that, yes. According to the comment in x86.c, when it was out instruction vcpu-arch.pio.count is set to 0 to skip the emulator. To call kvm_fast_pio_out(), !string and !in must be set. If we can check, vcpu-arch.pio.count, string and in on return from the userspace, can't we distinguish who requested I/O, emulator or kvm_fast_pio_out()? May be, but callback approach is much cleaner. string and in can have stale data for instance. 3. svm has next_rip but when it is 0, nop is emulated. Can this be modified to continue without emulating nop when next_rip is 0? I don't see where nop is emulated if next_rip is 0. As far as I see in case of next_rip==0 an instruction at rip is decoded to figure out its length and then rip is advanced by instruction length. Anyway next_rip is svm thing only. Sorry. I wasn't understanding the code enough. static void skip_emulated_instruction(struct kvm_vcpu *vcpu) { ... if (!svm-next_rip) { if (emulate_instruction(vcpu, 0, 0, EMULTYPE_SKIP) != EMULATE_DONE) printk(KERN_DEBUG %s: NOP\n, __func__); return; } Since the printk says NOP, I thought emulate_instruction was doing so... The reason I asked about next_rip is because I was hoping to use this entry to advance rip only after userspace executed I/O done by out, like if next_rip is !0, call kvm_rip_write(), and introduce next_rip to vmx if it is usable because vmx is currently using local variable rip. Yoshi -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM MMU: remove unused field
On Tue, Apr 06, 2010 at 06:29:05PM +0800, Xiao Guangrong wrote: kvm_mmu_page.oos_link is not used, so remove it Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/include/asm/kvm_host.h |2 -- arch/x86/kvm/mmu.c |1 - 2 files changed, 0 insertions(+), 3 deletions(-) Applied both, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [PPC] Add dequeue for external on BookE
On Wed, Apr 07, 2010 at 10:03:25AM +0200, Alexander Graf wrote: Commit a0abee86af2d1f048dbe99d2bcc4a2cefe685617 introduced unsetting of the IRQ line from userspace. This added a new core specific callback that I apparently forgot to add for BookE. So let's add the callback for BookE as well, making it build again. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/booke.c |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [PPC] Add dequeue for external on BookE
On Wed, 7 Apr 2010 12:58:34 -0300 Marcelo Tosatti mtosa...@redhat.com wrote: On Wed, Apr 07, 2010 at 10:03:25AM +0200, Alexander Graf wrote: Commit a0abee86af2d1f048dbe99d2bcc4a2cefe685617 introduced unsetting of the IRQ line from userspace. This added a new core specific callback that I apparently forgot to add for BookE. So let's add the callback for BookE as well, making it build again. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/booke.c |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) Applied, thanks. Thanks, guys. -- Cheers, Stephen Rothwells...@canb.auug.org.au http://www.canb.auug.org.au/~sfr/ pgpiHVmUvpI5v.pgp Description: PGP signature
Re: [GSoC 2010][RESEND] Shared memory transport between guest(s) and host
On Wed, Apr 7, 2010 at 5:30 AM, Mohammed Gamal m.gamal...@gmail.com wrote: Hi, I am interested in the Shared memory transport between guest(s) and host project for GSoC 2010. The description of the project is pretty straightforward, but I am a little bit lost on some parts: 1- Is there any documentation available on KVM shared memory transport. This'd definitely help understand how inter-vm shared memory should work. Hi Mohammed, A shared memory transport would be a new addition to the code base so there isn't anything yet in KVM. That said, I'm working on my patch and it will hopefully be accepted soon. Frankly, while I suggested this project, I'm not sure there's enough work remaining for the full Summer. 2- Does the project only aim at providing a shared memory transport between a single host and a number of guests, with the host acting as a central node containing shared memory objects and communication taking placde only between guests and host, or is there any kind of guest-guest communications to be supported? If yes, how should it be done? My patch currently supports guest-to-guest communication and guest-to-host. I'll be sending out a new version shortly. You can see if there's something you might like to add to it whether it's part of GSoC or not. Cheers, Cam -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
2010/4/8 Gleb Natapov g...@redhat.com: On Wed, Apr 07, 2010 at 03:25:10PM +0900, Yoshiaki Tamura wrote: 2010/4/6 Gleb Natapov g...@redhat.com: On Tue, Apr 06, 2010 at 01:11:23PM +0900, Yoshiaki Tamura wrote: Hi. When handle_io() is called, rip is currently proceeded *before* actually having I/O handled by qemu in userland. Upon implementing Kemari for KVM(http://www.mail-archive.com/kvm@vger.kernel.org/msg25141.html) mainly in userland qemu, we encountered a problem that synchronizing the content of VCPU before handling I/O in qemu is too late because rip is already proceeded in KVM, Although we avoided this issue with temporal hack, I would like to ask a few question on skip_emulated_instructions. 1. Does rip need to be proceeded before having I/O handled by qemu? In current kvm.git rip is proceeded before I/O is handled by qemu only in case of out instruction. From architecture point of view I think it's OK since on real HW you can't guaranty that I/O will take effect before instruction pointer is advanced. It is done like that because we want out emulation to be real fast so we skip x86 emulator. Thanks for your reply. If proceeding rip later doesn't break the behavior of devices or introduce slow down, I would like that to be done. Device can not care less about what value rip register currently has. Why is it matters for you code? My code, Kemari is a mechanism to synchronize VMs to achieve fault tolerance. It transfers the whole VM state upon events such as disk or network output, so that the secondary server can keep continuing upon hardware failure. Please think it like continuous live migration. I've implemented this feature in userland qemu, which calls the live migration function when it detects any outputs from the device emulators. http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html The problem here is that, I needed to transfer the VM state which is just *before* the output to the devices. Otherwise, the VM state has already been proceeded, and after failover, some I/O didn't work as I expected. I tracked down this issue, and figured out rip was already proceeded in KVM, and transferring this VCPU state was meaningless. I'm planning to post the patch set of Kemari soon, but I would like to solve this rip issue before that. If there is no drawback, I'm happy to work and post a patch. 2. If no, is it possible to divide skip_emulated_instructions(), like rec_emulated_instructions() to remember to next_rip, and skip_emulated_instructions() to actually proceed the rip. Currently only emulator can call userspace to do I/O, so after userspace returns after I/O exit, control is handled back to emulator unconditionally. out instruction skips emulator, but there is nothing to do after userspace returns, so regular cpu loop is executed. If we want to advance rip only after userspace executed I/O done by out we need to distinguish who requested I/O (emulator or kvm_fast_pio_out()) and call different code depending on who that was. It can be done by having a callback that (if not null) is called on return from userspace. Your suggestion is to introduce a callback entry, and instead of calling kvm_rip_write(), set it to the entry before calling kvm_fast_pio_out(), and check the entry upon return from the userspace, correct? Something like that, yes. OK. Let me work on that. According to the comment in x86.c, when it was out instruction vcpu-arch.pio.count is set to 0 to skip the emulator. To call kvm_fast_pio_out(), !string and !in must be set. If we can check, vcpu-arch.pio.count, string and in on return from the userspace, can't we distinguish who requested I/O, emulator or kvm_fast_pio_out()? May be, but callback approach is much cleaner. string and in can have stale data for instance. I see. I was thinking that can be a trade off between introducing a new variable. I'll take the callback approach first, and think again later if necessary. 3. svm has next_rip but when it is 0, nop is emulated. Can this be modified to continue without emulating nop when next_rip is 0? I don't see where nop is emulated if next_rip is 0. As far as I see in case of next_rip==0 an instruction at rip is decoded to figure out its length and then rip is advanced by instruction length. Anyway next_rip is svm thing only. Sorry. I wasn't understanding the code enough. static void skip_emulated_instruction(struct kvm_vcpu *vcpu) { ... if (!svm-next_rip) { if (emulate_instruction(vcpu, 0, 0, EMULTYPE_SKIP) != EMULATE_DONE) printk(KERN_DEBUG %s: NOP\n, __func__); return; } Since the printk says NOP, I thought emulate_instruction was doing so... The reason I asked about next_rip is because I was hoping to use this entry to advance rip only after userspace executed I/O done by out, like
Re: Question on skip_emulated_instructions()
On 04/07/2010 08:21 PM, Yoshiaki Tamura wrote: The problem here is that, I needed to transfer the VM state which is just *before* the output to the devices. Otherwise, the VM state has already been proceeded, and after failover, some I/O didn't work as I expected. I tracked down this issue, and figured out rip was already proceeded in KVM, and transferring this VCPU state was meaningless. I'm planning to post the patch set of Kemari soon, but I would like to solve this rip issue before that. If there is no drawback, I'm happy to work and post a patch. vcpu state is undefined when an mmio operation is pending, Documentation/kvm/api.txt says the following: NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding operations are complete (and guest state is consistent) only after userspace has re-entered the kernel with KVM_RUN. The kernel side will first finish incomplete operations and then check for pending signals. Userspace can re-enter the guest with an unmasked signal pending to complete pending operations. Currently we complete instructions for output operations and leave them incomplete for input operations. Deferring completion for output operations should work, except it may break the vmware backdoor port (see hw/vmport.c), which changes register state following an output instruction, and KVM_EXIT_TPR_ACCESS, where userspace reads the state following a write instruction. Do you really need to transfer the vcpu state before the instruction, or do you just need a consistent state? If the latter, then you can get away by posting a signal and re-entering the guest. kvm will complete the instruction and exit immediately, and you will have fully consistent state. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] Add Mergeable receive buffer support to vhost_net
Thanks! There's some whitespace damage, are you sending with your new sendmail setup? It seems to have worked for qemu patches ... Yes, I saw some line wraps in what I received, but I checked the original draft to be sure and they weren't there. Possibly from the relay; Sigh. @@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net * /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock-ops-sendmsg(NULL, sock, msg, len); if (unlikely(err 0)) { - vhost_discard_vq_desc(vq); - tx_poll_start(net, sock); + if (err == -EAGAIN) { +vhost_discard_desc(vq, 1); +tx_poll_start(net, sock); + } else { +vq_err(vq, sendmsg: errno %d\n, -err); +/* drop packet; do not discard/resend */ +vhost_add_used_and_signal(net-dev, vq, head, + 0); vhost does not currently has a consistent error handling strategy: if we drop packets, need to think which other errors should cause packet drops. I prefer to just call vq_err for now, and have us look at handling segfaults etc in a consistent way separately. I had to add this to avoid an infinite loop when I wrote a bad packet on the socket. I agree error handling needs a better look, but retrying a bad packet continuously while dumping in the log is what it was doing when I hit an error before this code. Isn't this better, at least until a second look? +} + I wonder whether it makes sense to check skb_queue_empty(sk-sk_receive_queue) outside the lock, to reduce the cost of this call on an empty queue (we know that it happens at least once each time we exit the loop on rx)? I was looking at alternatives to adding the lock in the first place, but I found I couldn't measure a difference in the cost with and without the lock. + int retries = 0; size_t len, total_len = 0; - int err; + int err, headcount, datalen; size_t hdr_size; struct socket *sock = rcu_dereference(vq-private_data); if (!sock || skb_queue_empty(sock-sk-sk_receive_queue)) @@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net * vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; - for (;;) { - head = vhost_get_vq_desc(net-dev, vq, vq-iov, -ARRAY_SIZE(vq-iov), -out, in, -vq_log, log); + while ((datalen = vhost_head_len(sock-sk))) { + headcount = vhost_get_desc_n(vq, vq-heads, datalen, in, +vq_log, log); This looks like a bug, I think we need to pass datalen + header size to vhost_get_desc_n. Not sure how we know the header size that backend will use though. Maybe just look at our features. Yes; we have hdr_size, so I can add it here. It'll be 0 for the cases where the backend and guest both have vnet header (either the regular or larger mergeable buffers one), but should be added in for future raw socket support. /* OK, now we need to know about added descriptors. */ - if (head == vq-num) { - if (unlikely(vhost_enable_notify(vq))) { + if (!headcount) { + if (retries == 0 unlikely(vhost_enable_notify(vq))) { /* They have slipped one in as we were * doing that: check again. */ vhost_disable_notify(vq); +retries++; continue; } Hmm. The reason we have the code at all, as the comment says, is because guest could have added more buffers between the time we read last index and the time we enabled notification. So if we just break like this the race still exists. We could remember the last head value we observed, and have vhost_enable_notify check against this value? This is to prevent a spin loop in the case where we have some buffers available, but not enough for the current packet (ie, this is the replacement code for the rxmaxheadcount business). If they actually added something new, retrying once should see it, but what vhost_enable_notify() returns non-zero on is not new buffers but rather not empty. In the case mentioned, we aren't empty, so vhost_enable_notify() returns nonzero every time, but the guest hasn't given us enough buffers to proceed, so we continuously retry; this code breaks the spin loop until we've really got new buffers from the guest. Need to think about it. Another concern here is that on retries vhost_get_desc_n is doing extra work, rescanning the same descriptor again and again. Not sure how common this is, might be worthwhile to add a TODO to consider this at least. I had a printk in there to test the code and with the retries counter, it happens when we fill the ring (once, because of the retries checks), and then proceeds as desired when the guest gives us more buffers. Without the check, it spews until we
[GIT PULL] vhost-net fix for 2.6.34-rc3
David, The following tree includes a patch fixing an issue with vhost-net in 2.6.34-rc3. Please pull for 2.6.34. Thanks! The following changes since commit 2eaa9cfdf33b8d7fb7aff27792192e0019ae8fc6: Linux 2.6.34-rc3 (2010-03-30 09:24:39 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost Jeff Dike (1): vhost-net: fix vq_memory_access_ok error checking drivers/vhost/vhost.c |4 1 files changed, 4 insertions(+), 0 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Some Code for Performance Profiling
2010/4/5 Avi Kivity a...@redhat.com: On 03/31/2010 07:53 PM, Jiaqing Du wrote: Hi, We have some code about performance profiling in KVM. They are outputs of a school project. Previous discussions in KVM, Perfmon2, and Xen mailing lists helped us a lot. The code are NOT in a good shape and are only used to demonstrated the feasibility of doing performance profiling in KVM. Feel free to use it if you want. Performance monitoring is an important feature for kvm. Is there any chance you can work at getting it into good shape? I have been following the discussions about PMU virtualization in the list for a while. Exporting a proper interface, i.e., guest visible MSRs and supported events, to the guest across a large number physical CPUs from different vendors, families, and models is the major problem. For KVM, currently it also supports almost a dozen different types of virtual CPUs. I will think about it and try to come up with something more general. We categorize performance profiling in a virtualized environment into two types: *guest-wide profiling* and *system-wide profiling*. For guest-wide profiling, only the guest is profiled. KVM virtualizes the PMU and the user runs a profiler directly in the guest. It requires no modifications to the guest OS and the profiler running in the guest. For system-wide profiling, both KVM and the guest OS are profiled. The results are similar to what XenOprof outputs. In this case, one profiler running in the host and one profiler running in the guest. Still it requires no modifications to the guest and the profiler running in it. Can your implementation support both simultaneously? What do you mean simultaneously? With my implementation, you either do guest-wide profiling or system-wide profiling. They are achieved through different patches. Actually, the result of guest-wide profiling is a subset of system-wide profiling. For guest-wide profiling, there are two possible places to save and restore the related MSRs. One is where the CPU switches between guest mode and host mode. We call this *CPU-switch*. Profiling with this enabled reflects how the guest behaves on the physical CPU, plus other virtualized, not emulated, devices. The other place is where the CPU switches between the KVM context and others. Here KVM context means the CPU is executing guest code or KVM code, both kernel space and user space. We call this *domain-switch*. Profiling with this enabled discloses how the guest behaves on both the physical CPU and KVM. (Some emulated operations are really expensive in a virtualized environment.) Which method do you use? Or do you support both? I post two patches in my previous email. One is for CPU-switch, and the other is for domain-switch. Note disclosing host pmu data to the guest is sometimes a security issue. For instance? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Some Code for Performance Profiling
On 04/07/2010 10:23 PM, Jiaqing Du wrote: Can your implementation support both simultaneously? What do you mean simultaneously? With my implementation, you either do guest-wide profiling or system-wide profiling. They are achieved through different patches. Actually, the result of guest-wide profiling is a subset of system-wide profiling. A guest admin monitors the performance of their guest via a vpmu. Meanwhile the host admin monitors the performance of the host (including all guests) using the host pmu. Given that the host pmu and the vpmu may select different counters, it is difficult to support both simultaneously. For guest-wide profiling, there are two possible places to save and restore the related MSRs. One is where the CPU switches between guest mode and host mode. We call this *CPU-switch*. Profiling with this enabled reflects how the guest behaves on the physical CPU, plus other virtualized, not emulated, devices. The other place is where the CPU switches between the KVM context and others. Here KVM context means the CPU is executing guest code or KVM code, both kernel space and user space. We call this *domain-switch*. Profiling with this enabled discloses how the guest behaves on both the physical CPU and KVM. (Some emulated operations are really expensive in a virtualized environment.) Which method do you use? Or do you support both? I post two patches in my previous email. One is for CPU-switch, and the other is for domain-switch. I see. I'm not sure I know which one is better! Note disclosing host pmu data to the guest is sometimes a security issue. For instance? The standard example is hyperthreading where the memory bus unit is shared among two logical processors. A guest sampling a vcpu on one thread can gain information about what is happening on the other - the number of bus transactions the other thread has issued. This can be used to establish a communication channel between two guests that shouldn't be communicating, or to eavesdrop on another guest. A similar problem happens with multicores. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
VMX and save/restore guest in virtual-8086 mode
During initialization, WinXP.32 switches to virtual-8086 mode, with paging enabled, to use VGABIOS functions. Since enter_pmode unconditionally clears IOPL and VM bits in RFLAGS flags = vmcs_readl(GUEST_RFLAGS); flags = ~(X86_EFLAGS_IOPL | X86_EFLAGS_VM); flags |= (vmx-rmode.save_iopl IOPL_SHIFT); vmcs_writel(GUEST_RFLAGS, flags); And the order of loading state is set_regs (rflags) followed by set_sregs (cr0), these bits are lost across save/restore: savevm 1 kvm_arch_save_regs EIP=7a04 cr0=8001003b eflags=33286 system_reset loadvm 1 kvm_arch_save_regs EIP=7a04 cr0=8001003b eflags=10286 cont kvm: unhandled exit 8021 kvm_run returned -22 The following patch fixes it, but it has some drawbacks: - cpu_synchronize_state+writeback is noticeably slow with tpr patching, this makes it slower. - Should be conditional on VMX !unrestricted guest. - Its a fugly workaround. Any better ideas? diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c index 748ff69..9821653 100644 --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -956,6 +956,7 @@ void kvm_arch_load_regs(CPUState *env, int level) sregs.efer = env-efer; kvm_set_sregs(env, sregs); +kvm_set_regs(env, regs); /* msrs */ n = 0; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Setting nx bit in virtual CPU
On 07/04/10 13:23, Avi Kivity wrote: On 04/07/2010 03:10 PM, Richard Simpson wrote: On 07/04/10 06:39, Avi Kivity wrote: On 04/07/2010 01:31 AM, Richard Simpson wrote: 2.6.27 should be plenty fine for nx. Really the important bit is that the host kernel has nx enabled. Can you check if that is so? The attached script should verify it. IOError: [Errno 2] No such file or directory: '/dev/msr0' Run as root, please. And check first that you have a file named /dev/cpu/0/msr. Doh! gordon Code # ./check-nx nx: enabled gordon Code # OK, seems to be enabled just fine. Any other ideas? I am beginning to get that horrible feeling that there isn't a real problem and it is just me being dumb! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VMX and save/restore guest in virtual-8086 mode
On 04/07/2010 11:24 PM, Marcelo Tosatti wrote: During initialization, WinXP.32 switches to virtual-8086 mode, with paging enabled, to use VGABIOS functions. Since enter_pmode unconditionally clears IOPL and VM bits in RFLAGS flags = vmcs_readl(GUEST_RFLAGS); flags= ~(X86_EFLAGS_IOPL | X86_EFLAGS_VM); flags |= (vmx-rmode.save_iopl IOPL_SHIFT); vmcs_writel(GUEST_RFLAGS, flags); Looks like KVM_SET_REGS should write rmode.save_iopl (and a new save_vm)? I think we have a small related bug in realmode emulation - we run the guest with iopl=3. This means the guest can use pushfl and see the host iopl instead of the guest iopl. We should run with iopl=0, which causes pushfl/popfl to #GP, where we can emulate the flags correctly (by updating rmode.save_iopl and rmode.save_vm). That has lots of implications however... And the order of loading state is set_regs (rflags) followed by set_sregs (cr0), these bits are lost across save/restore: savevm 1 kvm_arch_save_regs EIP=7a04 cr0=8001003b eflags=33286 system_reset loadvm 1 kvm_arch_save_regs EIP=7a04 cr0=8001003b eflags=10286 cont kvm: unhandled exit 8021 kvm_run returned -22 The following patch fixes it, but it has some drawbacks: - cpu_synchronize_state+writeback is noticeably slow with tpr patching, this makes it slower. Isn't it a very rare event? - Should be conditional on VMX !unrestricted guest. Userspace should know nothing of this mess. - Its a fugly workaround. True. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Setting nx bit in virtual CPU
On 04/07/2010 11:38 PM, Richard Simpson wrote: On 07/04/10 13:23, Avi Kivity wrote: On 04/07/2010 03:10 PM, Richard Simpson wrote: On 07/04/10 06:39, Avi Kivity wrote: On 04/07/2010 01:31 AM, Richard Simpson wrote: 2.6.27 should be plenty fine for nx. Really the important bit is that the host kernel has nx enabled. Can you check if that is so? The attached script should verify it. IOError: [Errno 2] No such file or directory: '/dev/msr0' Run as root, please. And check first that you have a file named /dev/cpu/0/msr. Doh! gordon Code # ./check-nx nx: enabled gordon Code # OK, seems to be enabled just fine. Any other ideas? I am beginning to get that horrible feeling that there isn't a real problem and it is just me being dumb! I really hope so, because I am out of ideas... :) Can you verify check-nx returns disabled on the guest? Does /proc/cpuinfo show nx in the guest? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] Add Mergeable receive buffer support to vhost_net
kvm-ow...@vger.kernel.org wrote on 04/07/2010 11:09:30 AM: On Wed, Apr 07, 2010 at 10:37:17AM -0700, David Stevens wrote: Thanks! There's some whitespace damage, are you sending with your new sendmail setup? It seems to have worked for qemu patches ... Yes, I saw some line wraps in what I received, but I checked the original draft to be sure and they weren't there. Possibly from the relay; Sigh. @@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net * /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock-ops-sendmsg(NULL, sock, msg, len); if (unlikely(err 0)) { - vhost_discard_vq_desc(vq); - tx_poll_start(net, sock); + if (err == -EAGAIN) { +vhost_discard_desc(vq, 1); +tx_poll_start(net, sock); + } else { +vq_err(vq, sendmsg: errno %d\n, -err); +/* drop packet; do not discard/resend */ +vhost_add_used_and_signal(net-dev, vq, head, + 0); vhost does not currently has a consistent error handling strategy: if we drop packets, need to think which other errors should cause packet drops. I prefer to just call vq_err for now, and have us look at handling segfaults etc in a consistent way separately. I had to add this to avoid an infinite loop when I wrote a bad packet on the socket. I agree error handling needs a better look, but retrying a bad packet continuously while dumping in the log is what it was doing when I hit an error before this code. Isn't this better, at least until a second look? Hmm, what do you mean 'continuously'? Don't we only try again on next kick? If the packet is corrupt (in my case, a missing vnet header during testing), every send will fail and we never make progress. I had thousands of error messages in the log (for the same packet) before I added this code. If the problem is with the packet, retrying the same one as the original code does will never recover. This isn't required for mergeable rx buffer support, so I can certainly remove it from this patch, but I think the original error handling doesn't handle a single corrupted packet very gracefully. @@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net * vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; - for (;;) { - head = vhost_get_vq_desc(net-dev, vq, vq-iov, -ARRAY_SIZE(vq-iov), -out, in, -vq_log, log); + while ((datalen = vhost_head_len(sock-sk))) { + headcount = vhost_get_desc_n(vq, vq-heads, datalen, in, +vq_log, log); This looks like a bug, I think we need to pass datalen + header size to vhost_get_desc_n. Not sure how we know the header size that backend will use though. Maybe just look at our features. Yes; we have hdr_size, so I can add it here. It'll be 0 for the cases where the backend and guest both have vnet header (either the regular or larger mergeable buffers one), but should be added in for future raw socket support. So hdr_size is the wrong thing to add then. We need to add non-zero value for tap now. datalen includes the vnet_hdr in the tap case, so we don't need a non-zero hdr_size. The socket data has the entire packet and vnet_hdr and that length is what we're getting from vhost_head_len(). /* OK, now we need to know about added descriptors. */ - if (head == vq-num) { - if (unlikely(vhost_enable_notify(vq))) { + if (!headcount) { + if (retries == 0 unlikely(vhost_enable_notify(vq))) { /* They have slipped one in as we were * doing that: check again. */ vhost_disable_notify(vq); +retries++; continue; } Hmm. The reason we have the code at all, as the comment says, is because guest could have added more buffers between the time we read last index and the time we enabled notification. So if we just break like this the race still exists. We could remember the last head value we observed, and have vhost_enable_notify check against this value? This is to prevent a spin loop in the case where we have some buffers available, but not enough for the current packet (ie, this is the replacement code for the rxmaxheadcount business). If they actually added something new, retrying once should see it, but what vhost_enable_notify() returns non-zero on is not new buffers but rather not empty. In the case mentioned, we aren't empty, so vhost_enable_notify() returns nonzero every time, but the guest hasn't given us enough buffers to proceed, so we continuously retry; this code breaks the
Re: [PATCH] KVM: VMX: Disable unrestricted guest when EPT disabled
On Thu, Mar 18, 2010 at 02:11:19PM +0800, Sheng Yang wrote: Otherwise would cause VMEntry failure when using ept=0 on unrestricted guest supported processors. Signed-off-by: Sheng Yang sh...@linux.intel.com Now included through a different submission. thanks, greg k-h -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Using serial with Windows XP - need help
Using: kvm-0.12.3 libvirt-0.7.7 kernel-2.6.31 I am trying to persuade my Win XP guest to use the serial which I have configured using the virt-manager. But whatever I try no COMs appear in guest. I have tried with socket, tcp, pipe, etc. with just the same result - no COM within the guest. I have obviously missed something. I need help to set-up that COM appear and work within guest machine. guest's XML part for serial looks like: serial type='tcp' source mode='connect' host='127.0.0.1' service='4555'/ protocol type='raw'/ target port='0'/ /serial console type='tcp' source mode='connect' host='127.0.0.1' service='4555'/ protocol type='raw'/ target port='0'/ /console Here is the latest command line to run my guest: /usr/bin/qemu-kvm -S -M pc-0.11 -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -name windowsxp -uuid 1f012431-f345-48f4-9ae5-b10787e24de7 -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/windowsxp.monitor,server,nowait -mon chardev=monitor,mode=readline -rtc base=localtime -boot c -drive file=/var/lib/kvm/images/windowsxp/disk0,if=none,id=drive-ide0-0-0,boot=on -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -drive if=none,media=cdrom,id=drive-ide0-1-0 -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/var/lib/kvm/images/windowsxp/disk1,if=none,id=drive-ide0-0-1 -device ide-drive,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -device rtl8139,vlan=0,id=net0,mac=52:54:00:6b:9d:6e,bus=pci.0,addr=0x4 -net tap,fd=20,vlan=0,name=hostnet0 -device rtl8139,vlan=1,id=net1,mac=52:54:00:3e:2a:f8,bus=pci.0,addr=0x5 -net tap,fd=21,vlan=1,name=hostnet1 -device rtl8139,vlan=2,id=net2,mac=52:54:00:ce:35:fe,bus=pci.0,addr=0x6 -net tap,fd=22,vlan=2,name=hostnet2 -chardev socket,id=serial0,host=127.0.0.1,port=4555,server,nowait -device isa-serial,chardev=serial0 -usb -vnc 127.0.0.1:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 Iztok -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 0/3] PCI Shared memory device
Latest patch for PCI shared memory device that maps a host shared memory object to be shared between guests new in this series - moved to single Doorbell register and use datamatch to trigger different VMs rather than one register per eventfd - remove writing arbitrary values to eventfds. Only values of 1 are now written to ensure correct usage Cam Macdonell (3): Device specification for shared memory PCI device Support adding a file to qemu's ram allocation Inter-VM shared memory PCI device Makefile.target|3 + cpu-common.h |1 + docs/specs/ivshmem_device_spec.txt | 85 + exec.c | 33 ++ hw/ivshmem.c | 700 qemu-char.c|6 + qemu-char.h|3 + 7 files changed, 831 insertions(+), 0 deletions(-) create mode 100644 docs/specs/ivshmem_device_spec.txt create mode 100644 hw/ivshmem.c -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 1/3] Device specification for shared memory PCI device
--- docs/specs/ivshmem_device_spec.txt | 85 1 files changed, 85 insertions(+), 0 deletions(-) create mode 100644 docs/specs/ivshmem_device_spec.txt diff --git a/docs/specs/ivshmem_device_spec.txt b/docs/specs/ivshmem_device_spec.txt new file mode 100644 index 000..9895782 --- /dev/null +++ b/docs/specs/ivshmem_device_spec.txt @@ -0,0 +1,85 @@ + +Device Specification for Inter-VM shared memory device +-- + +The Inter-VM shared memory device is designed to share a region of memory to +userspace in multiple virtual guests. The memory region does not belong to any +guest, but is a POSIX memory object on the host. Optionally, the device may +support sending interrupts to other guests sharing the same memory region. + +The Inter-VM PCI device +--- + +BARs + +The device supports three BARs. BAR0 is a 1 Kbyte MMIO region to support +registers. BAR1 is used for MSI-X when it is enabled in the device. BAR2 is +used to map the shared memory object from the host. The size of BAR2 is +specified when the guest is started and must be a power of 2 in size. + +Registers + +The device currently supports 4 registers of 32-bits each. Registers +are used for synchronization between guests sharing the same memory object when +interrupts are supported (this requires using the shared memory server). + +The server assigns each VM an ID number and sends this ID number to the Qemu +process when the guest starts. + +enum ivshmem_registers { +IntrMask = 0, +IntrStatus = 4, +IVPosition = 8, +Doorbell = 12 +}; + +The first two registers are the interrupt mask and status registers. Mask and +status are only used with pin-based interrupts. They are unused with MSI +interrupts. The IVPosition register is read-only and reports the guest's ID +number. To interrupt another guest, a guest must write to the Doorbell +register. The doorbell register is 32-bits, logically divided into two 16-bit +fields. The high 16-bits are the guest ID to interrupt and the low 16-bits are +the interrupt vector to trigger. + +The semantics of the value written to the doorbell depends on whether the +device is using MSI or a regular pin-based interrupt. In short, MSI uses +vectors and regular interrupts set the status register. + +Regular Interrupts +-- + +If regular interrupts are used (due to either a guest not supporting MSI or the +user specifying not to use them on startup) then the value written to the lower +16-bits of the Doorbell register results is arbitrary and will trigger an +interrupt in the destination guest. + +An interrupt is also generated when a new guest accesses the shared memory +region. A status of (2^32 - 1) indicates that a new guest has joined. + +Message Signalled Interrupts + + +A ivshmem device may support multiple MSI vectors. If so, the lower 16-bits +written to the Doorbell register must be between 1 and the maximum number of +vectors the guest supports. The lower 16 bits written to the doorbell is the +MSI vector that will be raised in the destination guest. The number of MSI +vectors can vary but it is set when the VM is started, however vector 0 is +used to notify that a new guest has joined. Guests should not use vector 0 for +any other purpose. + +The important thing to remember with MSI is that it is only a signal, no status +is set (since MSI interrupts are not shared). All information other than the +interrupt itself should be communicated via the shared memory region. Devices +supporting multiple MSI vectors can use different vectors to indicate different +events have occurred. The semantics of interrupt vectors are left to the +user's discretion. + +Usage in the Guest +-- + +The shared memory device is intended to be used with the provided UIO driver. +Very little configuration is needed. The guest should map BAR0 to access the +registers (an array of 32-bit ints allows simple writing) and map BAR2 to +access the shared memory region itself. The size of the shared memory region +is specified when the guest (or shared memory server) is started. A guest may +map the whole shared memory region or only part of it. -- 1.6.0.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 2/3] Support adding a file to qemu's ram allocation
This avoids the need of using qemu_ram_alloc and mmap with MAP_FIXED to map a host file into guest RAM. This function mmaps the opened file anywhere and adds the memory to the ram blocks. Usage is qemu_ram_mmap(fd, size, MAP_SHARED, offset); --- cpu-common.h |1 + exec.c | 33 + 2 files changed, 34 insertions(+), 0 deletions(-) diff --git a/cpu-common.h b/cpu-common.h index 49c7fb3..87c82fc 100644 --- a/cpu-common.h +++ b/cpu-common.h @@ -32,6 +32,7 @@ static inline void cpu_register_physical_memory(target_phys_addr_t start_addr, } ram_addr_t cpu_get_physical_page_desc(target_phys_addr_t addr); +ram_addr_t qemu_ram_mmap(int, ram_addr_t, int, int); ram_addr_t qemu_ram_alloc(ram_addr_t); void qemu_ram_free(ram_addr_t addr); /* This should only be used for ram local to a device. */ diff --git a/exec.c b/exec.c index 467a0e7..2303be7 100644 --- a/exec.c +++ b/exec.c @@ -2811,6 +2811,39 @@ static void *file_ram_alloc(ram_addr_t memory, const char *path) } #endif +ram_addr_t qemu_ram_mmap(int fd, ram_addr_t size, int flags, int offset) +{ +RAMBlock *new_block; + +size = TARGET_PAGE_ALIGN(size); +new_block = qemu_malloc(sizeof(*new_block)); + +// map the file passed as a parameter to be this part of memory +new_block-host = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, offset); + +#ifdef MADV_MERGEABLE +madvise(new_block-host, size, MADV_MERGEABLE); +#endif + +new_block-offset = last_ram_offset; +new_block-length = size; + +new_block-next = ram_blocks; +ram_blocks = new_block; + +phys_ram_dirty = qemu_realloc(phys_ram_dirty, +(last_ram_offset + size) TARGET_PAGE_BITS); +memset(phys_ram_dirty + (last_ram_offset TARGET_PAGE_BITS), + 0xff, size TARGET_PAGE_BITS); + +last_ram_offset += size; + +if (kvm_enabled()) +kvm_setup_guest_memory(new_block-host, size); + +return new_block-offset; +} + ram_addr_t qemu_ram_alloc(ram_addr_t size) { RAMBlock *new_block; -- 1.6.0.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 3/3] Inter-VM shared memory PCI device
Support an inter-vm shared memory device that maps a shared-memory object as a PCI device in the guest. This patch also supports interrupts between guest by communicating over a unix domain socket. This patch applies to the qemu-kvm repository. -device ivshmem,size=size in MB[,shm=shm name] Interrupts are supported between multiple VMs by using a shared memory server by using a chardev socket. -device ivshmem,size=size in MB[,shm=shm name][,chardev=id][,msi=on] [,irqfd=on][,vectors=n] -chardev socket,path=path,id=id Sample programs, init scripts and the shared memory server are available in a git repo here: www.gitorious.org/nahanni --- Makefile.target |3 + hw/ivshmem.c| 700 +++ qemu-char.c |6 + qemu-char.h |3 + 4 files changed, 712 insertions(+), 0 deletions(-) create mode 100644 hw/ivshmem.c diff --git a/Makefile.target b/Makefile.target index 1ffd802..bc9a681 100644 --- a/Makefile.target +++ b/Makefile.target @@ -199,6 +199,9 @@ obj-$(CONFIG_USB_OHCI) += usb-ohci.o obj-y += rtl8139.o obj-y += e1000.o +# Inter-VM PCI shared memory +obj-y += ivshmem.o + # Hardware support obj-i386-y = pckbd.o dma.o obj-i386-y += vga.o diff --git a/hw/ivshmem.c b/hw/ivshmem.c new file mode 100644 index 000..2ec6c2c --- /dev/null +++ b/hw/ivshmem.c @@ -0,0 +1,700 @@ +/* + * Inter-VM Shared Memory PCI device. + * + * Author: + * Cam Macdonell c...@cs.ualberta.ca + * + * Based On: cirrus_vga.c and rtl8139.c + * + * This code is licensed under the GNU GPL v2. + */ +#include sys/mman.h +#include sys/types.h +#include sys/socket.h +#include sys/io.h +#include sys/ioctl.h +#include sys/eventfd.h +#include hw.h +#include console.h +#include pc.h +#include pci.h +#include sysemu.h + +#include msix.h +#include qemu-kvm.h +#include libkvm.h + +#include sys/eventfd.h +#include sys/mman.h +#include sys/socket.h +#include sys/ioctl.h + +#define PCI_COMMAND_IOACCESS0x0001 +#define PCI_COMMAND_MEMACCESS 0x0002 + +#define DEBUG_IVSHMEM + +#define IVSHMEM_IRQFD 0 +#define IVSHMEM_MSI 1 +#define IVSHMEM_MAX_EVENTFDS 16 + +#ifdef DEBUG_IVSHMEM +#define IVSHMEM_DPRINTF(fmt, args...)\ +do {printf(IVSHMEM: fmt, ##args); } while (0) +#else +#define IVSHMEM_DPRINTF(fmt, args...) +#endif + +#define NEW_GUEST_VAL UINT_MAX + +struct eventfd_entry { +PCIDevice *pdev; +int vector; +}; + +typedef struct IVShmemState { +PCIDevice dev; +uint32_t intrmask; +uint32_t intrstatus; +uint32_t doorbell; + +CharDriverState * chr; +CharDriverState ** eventfd_chr; +int ivshmem_mmio_io_addr; + +pcibus_t mmio_addr; +uint8_t *ivshmem_ptr; +unsigned long ivshmem_offset; +unsigned int ivshmem_size; +int shm_fd; /* shared memory file descriptor */ + +/* array of eventfds for each guest */ +int * eventfds[IVSHMEM_MAX_EVENTFDS]; +/* keep track of # of eventfds for each guest*/ +int * eventfds_posn_count; + +int vm_id; +int num_eventfds; +uint32_t vectors; +uint32_t features; +struct eventfd_entry eventfd_table[IVSHMEM_MAX_EVENTFDS]; + +char * shmobj; +uint32_t size; /*size of shared memory in MB*/ +} IVShmemState; + +/* registers for the Inter-VM shared memory device */ +enum ivshmem_registers { +IntrMask = 0, +IntrStatus = 4, +IVPosition = 8, +Doorbell = 12, +}; + +static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int feature) { +return (ivs-features (1 feature)); +} + +static inline int is_power_of_two(int x) { +return (x (x-1)) == 0; +} + +static void ivshmem_map(PCIDevice *pci_dev, int region_num, +pcibus_t addr, pcibus_t size, int type) +{ +IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev); + +IVSHMEM_DPRINTF(addr = %u size = %u\n, (uint32_t)addr, (uint32_t)size); +cpu_register_physical_memory(addr, s-ivshmem_size, s-ivshmem_offset); + +} + +/* accessing registers - based on rtl8139 */ +static void ivshmem_update_irq(IVShmemState *s, int val) +{ +int isr; +isr = (s-intrstatus s-intrmask) 0x; + +/* don't print ISR resets */ +if (isr) { +IVSHMEM_DPRINTF(Set IRQ to %d (%04x %04x)\n, + isr ? 1 : 0, s-intrstatus, s-intrmask); +} + +qemu_set_irq(s-dev.irq[0], (isr != 0)); +} + +static void ivshmem_IntrMask_write(IVShmemState *s, uint32_t val) +{ +IVSHMEM_DPRINTF(IntrMask write(w) val = 0x%04x\n, val); + +s-intrmask = val; + +ivshmem_update_irq(s, val); +} + +static uint32_t ivshmem_IntrMask_read(IVShmemState *s) +{ +uint32_t ret = s-intrmask; + +IVSHMEM_DPRINTF(intrmask read(w) val = 0x%04x\n, ret); + +return ret; +} + +static void ivshmem_IntrStatus_write(IVShmemState *s, uint32_t val) +{ +IVSHMEM_DPRINTF(IntrStatus write(w) val = 0x%04x\n, val); + +s-intrstatus = val; + +ivshmem_update_irq(s, val); +return; +} + +static
[PATCH v4] Shared memory uio_pci driver
This patch adds a driver for my shared memory PCI device using the uio_pci interface. The driver has three memory regions. The first memory region is for device registers for sending interrupts. The second BAR is for receiving MSI-X interrupts and the third memory region maps the shared memory. The device only exports the first and third memory regions to userspace. This driver supports MSI-X and regular pin interrupts. Currently, the number of MSI vectors is set to 2 (one for new connections and the other for interrupts) but it could easily be increased. If MSI is not available, then regular interrupts will be used. This version added formatting and style corrections as well as better error-checking and cleanup when errors occur. --- drivers/uio/Kconfig |8 ++ drivers/uio/Makefile |1 + drivers/uio/uio_ivshmem.c | 252 + 3 files changed, 261 insertions(+), 0 deletions(-) create mode 100644 drivers/uio/uio_ivshmem.c diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig index 1da73ec..b92cded 100644 --- a/drivers/uio/Kconfig +++ b/drivers/uio/Kconfig @@ -74,6 +74,14 @@ config UIO_SERCOS3 If you compile this as a module, it will be called uio_sercos3. +config UIO_IVSHMEM + tristate KVM shared memory PCI driver + default n + help + Userspace I/O interface for the KVM shared memory device. This + driver will make available two memory regions, the first is + registers and the second is a region for sharing between VMs. + config UIO_PCI_GENERIC tristate Generic driver for PCI 2.3 and PCI Express cards depends on PCI diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile index 18fd818..25c1ca5 100644 --- a/drivers/uio/Makefile +++ b/drivers/uio/Makefile @@ -6,3 +6,4 @@ obj-$(CONFIG_UIO_AEC) += uio_aec.o obj-$(CONFIG_UIO_SERCOS3) += uio_sercos3.o obj-$(CONFIG_UIO_PCI_GENERIC) += uio_pci_generic.o obj-$(CONFIG_UIO_NETX) += uio_netx.o +obj-$(CONFIG_UIO_IVSHMEM) += uio_ivshmem.o diff --git a/drivers/uio/uio_ivshmem.c b/drivers/uio/uio_ivshmem.c new file mode 100644 index 000..42ac9a7 --- /dev/null +++ b/drivers/uio/uio_ivshmem.c @@ -0,0 +1,252 @@ +/* + * UIO IVShmem Driver + * + * (C) 2009 Cam Macdonell + * based on Hilscher CIF card driver (C) 2007 Hans J. Koch h...@linutronix.de + * + * Licensed under GPL version 2 only. + * + */ + +#include linux/device.h +#include linux/module.h +#include linux/pci.h +#include linux/uio_driver.h + +#include asm/io.h + +#define IntrStatus 0x04 +#define IntrMask 0x00 + +struct ivshmem_info { + struct uio_info *uio; + struct pci_dev *dev; + char (*msix_names)[256]; + struct msix_entry *msix_entries; + int nvectors; +}; + +static irqreturn_t ivshmem_handler(int irq, struct uio_info *dev_info) +{ + + void __iomem *plx_intscr = dev_info-mem[0].internal_addr + + IntrStatus; + u32 val; + + val = readl(plx_intscr); + if (val == 0) + return IRQ_NONE; + + return IRQ_HANDLED; +} + +static irqreturn_t ivshmem_msix_handler(int irq, void *opaque) +{ + + struct uio_info * dev_info = (struct uio_info *) opaque; + + /* we have to do this explicitly when using MSI-X */ + uio_event_notify(dev_info); + return IRQ_HANDLED; +} + +static void free_msix_vectors(struct ivshmem_info *ivs_info, + const int max_vector) +{ + int i; + + for (i = 0; i max_vector; i++) + free_irq(ivs_info-msix_entries[i].vector, ivs_info-uio); +} + +static int request_msix_vectors(struct ivshmem_info *ivs_info, int nvectors) +{ + int i, err; + const char *name = ivshmem; + + ivs_info-nvectors = nvectors; + + ivs_info-msix_entries = kmalloc(nvectors * sizeof * + ivs_info-msix_entries, + GFP_KERNEL); + if (ivs_info-msix_entries == NULL) + return -ENOSPC; + + ivs_info-msix_names = kmalloc(nvectors * sizeof *ivs_info-msix_names, + GFP_KERNEL); + if (ivs_info-msix_names == NULL) { + kfree(ivs_info-msix_entries); + return -ENOSPC; + } + + for (i = 0; i nvectors; ++i) + ivs_info-msix_entries[i].entry = i; + + err = pci_enable_msix(ivs_info-dev, ivs_info-msix_entries, + ivs_info-nvectors); + if (err 0) { + ivs_info-nvectors = err; /* msi-x positive error code +returns the number available*/ + err = pci_enable_msix(ivs_info-dev, ivs_info-msix_entries, + ivs_info-nvectors); + if (err) { + printk(KERN_INFO no MSI (%d). Back to INTx.\n, err); +
Re: Setting nx bit in virtual CPU
gordon Code # ./check-nx nx: enabled gordon Code # OK, seems to be enabled just fine. Any other ideas? I am beginning to get that horrible feeling that there isn't a real problem and it is just me being dumb! I really hope so, because I am out of ideas... :) Can you verify check-nx returns disabled on the guest? Does /proc/cpuinfo show nx in the guest? OK, time for a summary: Host: /proc/cpuinfo shows 'nx' and check-nx shows 'enabled' Guest: /proc/cpuinfo doesn't show nx and check-nx shows 'disabled' Guest (with -no-kvm option): /proc/cpuinfo shows 'nx', but check-nx still shows 'disabled' Below I have included all the listings which I think might be useful, but if you would like to see anything else then please ask. HOST: /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 79 model name : AMD Athlon(tm) 64 Processor 3200+ stepping: 2 cpu MHz : 1000.000 cache size : 512 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow rep_good nopl pni cx16 lahf_lm svm extapic cr8_legacy bogomips: 2000.06 TLB size: 1024 4K pages clflush size: 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp tm stc GUEST with command line - kvm -hda /dev/mapper/vols-andrew -kernel ./bzImage -append root=/dev/hda2 -cpu host -runas xx -net nic -net user -m 256 -k en-gb -vnc :1 -monitor stdio /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 79 model name : AMD Athlon(tm) 64 Processor 3200+ stepping: 2 cpu MHz : 1.330 cache size : 512 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall mmxext fxsr_opt lm rep_good pni cx16 lahf_lm bogomips: 2000.06 TLB size: 1024 4K pages clflush size: 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: Results of paxtest PaXtest - Copyright(c) 2003,2004 by Peter Busser pe...@adamantix.org Released under the GNU Public Licence version 2 or later Mode: kiddie Linux andrew 2.6.28-hardened-r9 #4 Mon Jan 18 22:39:31 GMT 2010 x86_64 AMD Athlon(tm) 64 Processor 3200+ AuthenticAMD GNU/Linux Executable anonymous mapping : Vulnerable Executable bss : Vulnerable Executable data : Vulnerable Executable heap : Vulnerable Executable stack : Vulnerable Executable anonymous mapping (mprotect) : Vulnerable Executable bss (mprotect): Vulnerable Executable data (mprotect) : Vulnerable Executable heap (mprotect) : Vulnerable Executable stack (mprotect) : Vulnerable Executable shared library bss (mprotect) : Vulnerable Executable shared library data (mprotect): Vulnerable Writable text segments : Killed Anonymous mapping randomisation test : 33 bits (guessed) Heap randomisation test (ET_EXEC): 13 bits (guessed) Heap randomisation test (ET_DYN) : 40 bits (guessed) Main executable randomisation (ET_EXEC) : No randomisation Main executable randomisation (ET_DYN) : 12 bits (guessed) Shared library randomisation test: 33 bits (guessed) Stack randomisation test (SEGMEXEC) : 40 bits (guessed) Stack randomisation test (PAGEEXEC) : 40 bits (guessed) Return to function (strcpy) : paxtest: bad luck, try different compiler options. Return to function (memcpy) : *** buffer overflow detected ***: rettofunc2 - terminated rettofunc2: buffer overflow attack in function unknown - terminated Report to http://bugs.gentoo.org/ Killed Return to function (strcpy, RANDEXEC): paxtest: bad luck, try different compiler options. Return to function (memcpy, RANDEXEC): *** buffer overflow detected ***: rettofunc2x - terminated rettofunc2x: buffer overflow attack in function unknown - terminated Report to http://bugs.gentoo.org/ Killed Executable shared library bss: Killed Executable shared library data : Killed GUEST with command line - kvm -hda /dev/mapper/vols-andrew -kernel ./bzImage -append root=/dev/hda2 -no-kvm -runas xx -net nic -net user -m 256 -k en-gb -vnc :1 -monitor stdio /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 2 model name : QEMU Virtual CPU version 0.12.3 stepping: 3 cpu MHz : 1998.067 cache size : 512 KB fpu : yes fpu_exception : yes cpuid level : 4 wp
Re: [GIT PULL] vhost-net fix for 2.6.34-rc3
From: Michael S. Tsirkin m...@redhat.com Date: Wed, 7 Apr 2010 20:35:02 +0300 David, The following tree includes a patch fixing an issue with vhost-net in 2.6.34-rc3. Please pull for 2.6.34. Pulled, thanks Michael. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
buildbot failure in qemu-kvm on default_x86_64_debian_5_0
The Buildbot has detected a new failure of default_x86_64_debian_5_0 on qemu-kvm. Full details are available at: http://buildbot.b1-systems.de/qemu-kvm/builders/default_x86_64_debian_5_0/builds/347 Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/ Buildslave for this Build: b1_qemu_kvm_1 Build Reason: The Nightly scheduler named 'nightly_default' triggered this build Build Source Stamp: [branch master] HEAD Blamelist: BUILD FAILED: failed compile sincerely, -The Buildbot -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
buildbot failure in qemu-kvm on default_x86_64_out_of_tree
The Buildbot has detected a new failure of default_x86_64_out_of_tree on qemu-kvm. Full details are available at: http://buildbot.b1-systems.de/qemu-kvm/builders/default_x86_64_out_of_tree/builds/288 Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/ Buildslave for this Build: b1_qemu_kvm_1 Build Reason: The Nightly scheduler named 'nightly_default' triggered this build Build Source Stamp: [branch master] HEAD Blamelist: BUILD FAILED: failed compile sincerely, -The Buildbot -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM autotest patch queue report 04-07-2010
Summary: Total patches: 13 Reviewed patches: 8 Reviews unfinished: 1 Unreviewed patches: 4 Autotest patchwork (patches under review) http://patchwork.test.kernel.org/project/autotest/list/ Autotest timeline (patches already applied) http://autotest.kernel.org/timeline KVM test: Add a subtest iofuzz 2010-04-07 Jason Wang lmr Under Review Unreviewed patch. [3/3] KVM Test: Add ioquit test case 2010-04-07 Feng Yang lmr Under Review [2/3] KVM Test: Add function run_autotest_background and wait_autotest_background. 2010-04-07 Feng Yang lmr Under Review [1/3] KVM Test: Add control file dbench.control.200 for dbench 2010-04-07 Feng Yang lmr Under Review Unreviewed patch series. [RFC] KVM test: Introduce sample performance test set 2010-04-07 Lucas Meneghel Rodrigues lmr Under Review First pass at implementing automated performance testing. Still has a lot of work to make it in shape for upstream inclusion. [v2] KVM-test: Add a subtest 'qemu_img' 2010-03-31 Yolkfull Chow lmr Under Review Improved version of qemu_img test, currently in testing stage. [KVM-AUTOTEST] Opensuse unattended install 2010-03-23 yogi lmr Under Review I am concerned about a timeout added at the end of unattended install, which could have bad side effects on other guests. Didn't manage to finish testing. KVM-Test: Add kvm userspace unit test 2010-03-05 sshang lmr Under review Naphtali Sprei is working on support running the unittests with a new infrastructure in-qemu, made by Avi, so this test will wait a little bit so we can merge both approaches in one. Updates: Naphtali already sent his RFC patches, currently under review. [2/2] KVM test: Add cpu_set subtest 2010-02-25 Lucas Meneghel Rodrigues lmr Under Review This patch will stay on the queue until the feature tested gets in a better shape on KVM upstream KVM test: Add support for ipv6 addresses 2010-02-24 Lucas Meneghel Rodrigues lmr Under Review This test was reviewed and the decision is that it will stay on the queue until we have more extensive guest network testing. KVM test: Memory ballooning test for KVM guest 2010-02-11 pradeep lmr Under Review Same status from previous week. Waiting on revised patch from originator. [2/2] KVM test: subtest migration: Add rem_host and rem_port for migrate() 2009-12-08 Yolkfull Chow lmr Under Review [1/2,-,V3] Add a server-side test - kvm_migration 2009-12-08 Yolkfull Chow lmr Under Review Same status from previous week. Needs full review -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shouldn't cache=none be the default for drives?
Am Wed, 07 Apr 2010 16:39:41 +0200 schrieb Troels Arvin: Hello, I'm conducting some performancetests with KVM-virtualized CentOSes. One thing I noticed is that guest I/O performance seems to be significantly better for virtio-based block devices (drives) if the cache=none argument is used. (This was with a rather powerful storage system backend which is hard to saturate.) So: Why isn't cache=none be the default for drives? while ago i suffered poor performance of virtio and win2008. This helped alot: I enabled deadline block scheduler instead of the default cfq on the host system. tested with: Host Debian with scheduler deadline, Guest Win2008 with Virtio and cache=none. (26MB/s to 50MB/s boost measured) Maybe this is also true for Linux/Linux. I expect that scheduler noop for linux guests would be good. - Thomas -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
Avi Kivity wrote: On 04/07/2010 08:21 PM, Yoshiaki Tamura wrote: The problem here is that, I needed to transfer the VM state which is just *before* the output to the devices. Otherwise, the VM state has already been proceeded, and after failover, some I/O didn't work as I expected. I tracked down this issue, and figured out rip was already proceeded in KVM, and transferring this VCPU state was meaningless. I'm planning to post the patch set of Kemari soon, but I would like to solve this rip issue before that. If there is no drawback, I'm happy to work and post a patch. vcpu state is undefined when an mmio operation is pending, Documentation/kvm/api.txt says the following: NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding operations are complete (and guest state is consistent) only after userspace has re-entered the kernel with KVM_RUN. The kernel side will first finish incomplete operations and then check for pending signals. Userspace can re-enter the guest with an unmasked signal pending to complete pending operations. Thanks for the information. So the point is the vcpu state that can been observed from qemu upon KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI should not be used because it's not complete/consistent? Currently we complete instructions for output operations and leave them incomplete for input operations. Deferring completion for output operations should work, except it may break the vmware backdoor port (see hw/vmport.c), which changes register state following an output instruction, and KVM_EXIT_TPR_ACCESS, where userspace reads the state following a write instruction. Do you really need to transfer the vcpu state before the instruction, or do you just need a consistent state? If the latter, then you can get away by posting a signal and re-entering the guest. kvm will complete the instruction and exit immediately, and you will have fully consistent state. The requirement is that the guest must always be able to replay at least the instruction which triggered the synchronization on the primary. From that point of view, I think I need to transfer the vcpu state before the instruction. If I post a signal and let the guest or emulator proceed, I'm not sure whether the guest on the secondary can be replay as expected. Please point out if I were misunderstanding. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
On Thu, Apr 08, 2010 at 02:27:53PM +0900, Yoshiaki Tamura wrote: Avi Kivity wrote: On 04/07/2010 08:21 PM, Yoshiaki Tamura wrote: The problem here is that, I needed to transfer the VM state which is just *before* the output to the devices. Otherwise, the VM state has already been proceeded, and after failover, some I/O didn't work as I expected. I tracked down this issue, and figured out rip was already proceeded in KVM, and transferring this VCPU state was meaningless. I'm planning to post the patch set of Kemari soon, but I would like to solve this rip issue before that. If there is no drawback, I'm happy to work and post a patch. vcpu state is undefined when an mmio operation is pending, Documentation/kvm/api.txt says the following: NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding operations are complete (and guest state is consistent) only after userspace has re-entered the kernel with KVM_RUN. The kernel side will first finish incomplete operations and then check for pending signals. Userspace can re-enter the guest with an unmasked signal pending to complete pending operations. Thanks for the information. So the point is the vcpu state that can been observed from qemu upon KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI should not be used because it's not complete/consistent? Definitely. VCPU is in the middle of an instruction execution, so the state is undefined. One instruction may generate more then one IO exit during its execution BTW. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html