Re: Question on skip_emulated_instructions()

2010-04-07 Thread Yoshiaki Tamura
2010/4/6 Gleb Natapov g...@redhat.com:
 On Tue, Apr 06, 2010 at 01:11:23PM +0900, Yoshiaki Tamura wrote:
 Hi.

 When handle_io() is called, rip is currently proceeded *before* actually 
 having
 I/O handled by qemu in userland.  Upon implementing Kemari for
 KVM(http://www.mail-archive.com/kvm@vger.kernel.org/msg25141.html) mainly in
 userland qemu, we encountered a problem that synchronizing the content of 
 VCPU
 before handling I/O in qemu is too late because rip is already proceeded in 
 KVM,
 Although we avoided this issue with temporal hack, I would like to ask a few
 question on skip_emulated_instructions.

 1. Does rip need to be proceeded before having I/O handled by qemu?
 In current kvm.git rip is proceeded before I/O is handled by qemu only
 in case of out instruction. From architecture point of view I think
 it's OK since on real HW you can't guaranty that I/O will take effect
 before instruction pointer is advanced. It is done like that because we
 want out emulation to be real fast so we skip x86 emulator.

Thanks for your reply.

If proceeding rip later doesn't break the behavior of devices or
introduce slow down, I would like that to be done.

 2. If no, is it possible to divide skip_emulated_instructions(), like
 rec_emulated_instructions() to remember to next_rip, and
 skip_emulated_instructions() to actually proceed the rip.
 Currently only emulator can call userspace to do I/O, so after
 userspace returns after I/O exit, control is handled back to emulator
 unconditionally.  out instruction skips emulator, but there is nothing
 to do after userspace returns, so regular cpu loop is executed. If we
 want to advance rip only after userspace executed I/O done by out we
 need to distinguish who requested I/O (emulator or kvm_fast_pio_out())
 and call different code depending on who that was. It can be done by
 having a callback that (if not null) is called on return from userspace.

Your suggestion is to introduce a callback entry, and instead of
calling kvm_rip_write(), set it to the entry before calling
kvm_fast_pio_out(),
and check the entry upon return from the userspace, correct?

According to the comment in x86.c, when it was out instruction
vcpu-arch.pio.count is set to 0 to skip the emulator.
To call kvm_fast_pio_out(), !string and !in must be set.
If we can check, vcpu-arch.pio.count, string and in on return
from the userspace, can't we distinguish who requested I/O, emulator
or kvm_fast_pio_out()?

 3. svm has next_rip but when it is 0, nop is emulated.  Can this be modified 
 to
 continue without emulating nop when next_rip is 0?

 I don't see where nop is emulated if next_rip is 0. As far as I see in
 case of next_rip==0 an instruction at rip is decoded to figure out its
 length and then rip is advanced by instruction length. Anyway next_rip
 is svm thing only.

Sorry.  I wasn't understanding the code enough.

static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
{
...
if (!svm-next_rip) {
if (emulate_instruction(vcpu, 0, 0, EMULTYPE_SKIP) !=
EMULATE_DONE)
printk(KERN_DEBUG %s: NOP\n, __func__);
return;
}

Since the printk says NOP, I thought emulate_instruction was doing so...

The reason I asked about next_rip is because I was hoping to use this
entry to advance rip only after userspace executed I/O done by out,
like if next_rip is !0,
call kvm_rip_write(), and introduce next_rip to vmx if it is usable
because vmx is
currently using local variable rip.

Yoshi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [PPC] Add dequeue for external on BookE

2010-04-07 Thread Alexander Graf
Commit a0abee86af2d1f048dbe99d2bcc4a2cefe685617 introduced unsetting of the
IRQ line from userspace. This added a new core specific callback that I
apparently forgot to add for BookE.

So let's add the callback for BookE as well, making it build again.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/booke.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 4d686cc..e170672 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -132,6 +132,12 @@ void kvmppc_core_queue_external(struct kvm_vcpu *vcpu,
kvmppc_booke_queue_irqprio(vcpu, BOOKE_IRQPRIO_EXTERNAL);
 }
 
+void kvmppc_core_dequeue_external(struct kvm_vcpu *vcpu,
+  struct kvm_interrupt *irq)
+{
+   clear_bit(BOOKE_IRQPRIO_EXTERNAL, vcpu-arch.pending_exceptions);
+}
+
 /* Deliver the interrupt of the corresponding priority, if possible. */
 static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu,
 unsigned int priority)
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-07 Thread Michael S. Tsirkin
On Wed, Apr 07, 2010 at 10:41:08AM +0800, Xin, Xiaohui wrote:
 Michael,
  
  Qemu needs a userspace write, is that a synchronous one or
 asynchronous one?
 
 It's a synchronous non-blocking write.
 Sorry, why the Qemu live migration needs the device have a userspace write?
 how does the write operation work? And why a read operation is not cared here?
 
 Thanks
 Xiaohui

Roughly, with ethernet bridges, moving a device from one location in
the network to another makes forwarding tables incorrect (or incomplete),
until outgoing traffic from the device causes these tables
to be updated. Since there's no guarantee that guest
will generate outgoing traffic, after migration qemu sends out several
dummy packets itself.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.

2010-04-07 Thread Michael S. Tsirkin
On Wed, Apr 07, 2010 at 09:36:36AM +0800, Xin, Xiaohui wrote:
 Michael,
For the write logging, do you have a function in hand that we can
recompute the log? If that, I think I can use it to recompute the
   log info when the logging is suddenly enabled.
For the outstanding requests, do you mean all the user buffers have
   submitted before the logging ioctl changed? That may be a lot, and
some of them are still in NIC ring descriptors. Waiting them to be
   finished may be need some time. I think when logging ioctl changed,
then the logging is changed just after that is also reasonable.
 
   The key point is that after loggin ioctl returns, any
   subsequent change to memory must be logged. It does not
   matter when was the request submitted, otherwise we will
   get memory corruption on migration.
 
   The change to memory happens when vhost_add_used_and_signal(), right?
   So after ioctl returns, just recompute the log info to the events in 
   the async queue,
   is ok. Since the ioctl and write log operations are all protected by 
   vq-mutex.
 
   Thanks
   Xiaohui
 
  Yes, I think this will work.
 
  Thanks, so do you have the function to recompute the log info in your hand 
  that I can
 use? I have weakly remembered that you have noticed it before some time.
 
 Doesn't just rerunning vhost_get_vq_desc work?
 
 Am I missing something here?
 The vhost_get_vq_desc() looks in vq, and finds the first available buffers, 
 and converts it
 to an iovec. I think the first available buffer is not the buffers in the 
 async queue, so I
 think rerunning vhost_get_vq_desc() cannot work.
 
 Thanks
 Xiaohui

Right, but we can move the head back, so we'll find the same buffers
again, or add a variant of vhost_get_vq_desc that will process
descriptors already consumed.

Thanks
Xiaohui
   
 drivers/vhost/net.c   |  189 
+++--
 drivers/vhost/vhost.h |   10 +++
 2 files changed, 192 insertions(+), 7 deletions(-)
   
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 22d5fef..2aafd90 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -17,11 +17,13 @@
 #include linux/workqueue.h
 #include linux/rcupdate.h
 #include linux/file.h
+#include linux/aio.h
   
 #include linux/net.h
 #include linux/if_packet.h
 #include linux/if_arp.h
 #include linux/if_tun.h
+#include linux/mpassthru.h
   
 #include net/sock.h
   
@@ -47,6 +49,7 @@ struct vhost_net {
  struct vhost_dev dev;
  struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
  struct vhost_poll poll[VHOST_NET_VQ_MAX];
+ struct kmem_cache   *cache;
  /* Tells us whether we are polling a socket for TX.
   * We only do this when socket buffer fills up.
   * Protected by tx vq lock. */
@@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, 
struct socket *sock)
  net-tx_poll_state = VHOST_NET_POLL_STARTED;
 }
   
+struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
+{
+ struct kiocb *iocb = NULL;
+ unsigned long flags;
+
+ spin_lock_irqsave(vq-notify_lock, flags);
+ if (!list_empty(vq-notifier)) {
+ iocb = list_first_entry(vq-notifier,
+ struct kiocb, ki_list);
+ list_del(iocb-ki_list);
+ }
+ spin_unlock_irqrestore(vq-notify_lock, flags);
+ return iocb;
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+ struct vhost_virtqueue *vq)
+{
+ struct kiocb *iocb = NULL;
+ struct vhost_log *vq_log = NULL;
+ int rx_total_len = 0;
+ int log, size;
+
+ if (vq-link_state != VHOST_VQ_LINK_ASYNC)
+ return;
+
+ if (vq-receiver)
+ vq-receiver(vq);
+
+ vq_log = unlikely(vhost_has_feature(
+ net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL;
+ while ((iocb = notify_dequeue(vq)) != NULL) {
+ vhost_add_used_and_signal(net-dev, vq,
+ iocb-ki_pos, iocb-ki_nbytes);
+ log = (int)iocb-ki_user_data;
+ size = iocb-ki_nbytes;
+ rx_total_len += iocb-ki_nbytes;
+
+ if (iocb-ki_dtor)
+ iocb-ki_dtor(iocb);
+ kmem_cache_free(net-cache, iocb);
+
+ if (unlikely(vq_log))
+ vhost_log_write(vq, vq_log, log, size);
+ if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) {
+ vhost_poll_queue(vq-poll);
+ break;
+ }
+ }
+}
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+ struct vhost_virtqueue *vq)
+{
+ struct kiocb *iocb = NULL;
+ int tx_total_len = 0;
+
+ if (vq-link_state != VHOST_VQ_LINK_ASYNC)
+ return;
+
+ 

[PATCH 1/3] KVM Test: Add control file dbench.control.200 for dbench

2010-04-07 Thread Feng Yang
This control file set seconds to 200. It is used by
ioquit script.

Signed-off-by: Feng Yang fy...@redhat.com
---
 .../tests/kvm/autotest_control/dbench.control.200  |   20 
 1 files changed, 20 insertions(+), 0 deletions(-)
 create mode 100644 client/tests/kvm/autotest_control/dbench.control.200

diff --git a/client/tests/kvm/autotest_control/dbench.control.200 
b/client/tests/kvm/autotest_control/dbench.control.200
new file mode 100644
index 000..c648f7a
--- /dev/null
+++ b/client/tests/kvm/autotest_control/dbench.control.200
@@ -0,0 +1,20 @@
+TIME=SHORT
+AUTHOR = Martin Bligh mbl...@google.com
+DOC = 
+dbench is one of our standard kernel stress tests.  It produces filesystem
+load like netbench originally did, but involves no network system calls.
+Its results include throughput rates, which can be used for performance
+analysis.
+
+More information on dbench can be found here:
+http://samba.org/ftp/tridge/dbench/README
+
+Currently it needs to be updated in its configuration. It is a great test for
+the higher level I/O systems but barely touches the disk right now.
+
+NAME = 'dbench'
+TEST_CLASS = 'kernel'
+TEST_CATEGORY = 'Functional'
+TEST_TYPE = 'client'
+
+job.run_test('dbench', seconds=200)
-- 
1.5.5.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] KVM Test: Add function run_autotest_background and wait_autotest_background.

2010-04-07 Thread Feng Yang
Add function run_autotest_background and wait_autotest_background to
kvm_test_utils.py.  This two functions is used in ioquit test script.

Signed-off-by: Feng Yang fy...@redhat.com
---
 client/tests/kvm/kvm_test_utils.py |   68 +++-
 1 files changed, 67 insertions(+), 1 deletions(-)

diff --git a/client/tests/kvm/kvm_test_utils.py 
b/client/tests/kvm/kvm_test_utils.py
index f512044..2a1054e 100644
--- a/client/tests/kvm/kvm_test_utils.py
+++ b/client/tests/kvm/kvm_test_utils.py
@@ -21,7 +21,7 @@ More specifically:
 @copyright: 2008-2009 Red Hat Inc.
 
 
-import time, os, logging, re, commands
+import time, os, logging, re, commands, sys
 from autotest_lib.client.common_lib import error
 from autotest_lib.client.bin import utils
 import kvm_utils, kvm_vm, kvm_subprocess, scan_results
@@ -402,3 +402,69 @@ def run_autotest(vm, session, control_path, timeout, 
test_name, outputdir):
 result = bad_results[0]
 raise error.TestFail(Test '%s' ended with %s (reason: '%s')
  % (result[0], result[1], result[3]))
+
+
+def run_autotest_background(vm, session, control_path, timeout, test_name,
+outputdir):
+
+Wrapper of run_autotest() and make it run in the background through fork()
+and let it run in the child process.
+1) Flush the stdio.
+2) Build test params which is recevied from arguments and used by
+   run_autotest()
+3) Fork the process and let the run_autotest() run in the child
+4) Catch the exception raise by run_autotest() and exit the child with
+   non-zero return code.
+5) If no exception catched, reutrn 0
+
+@param vm: VM object.
+@param session: A shell session on the VM provided.
+@param control: An autotest control file.
+@param timeout: Timeout under which the autotest test must complete.
+@param test_name: Autotest client test name.
+@param outputdir: Path on host where we should copy the guest autotest
+results to.
+
+
+def flush():
+sys.stdout.flush()
+sys.stderr.flush()
+
+logging.info(Running autotest background ...)
+flush()
+pid = os.fork()
+if pid:
+# Parent process
+return pid
+
+try:
+# Launch autotest
+logging.info(child process of run_autotest_background)
+run_autotest(vm, session, control_path, timeout, test_name, outputdir)
+except error.TestFail, message_fail:
+logging.info([Autotest Background FAIL] %s % message_fail)
+os._exit(1)
+except error.TestError, message_error:
+logging.info([Autotest Background ERROR] %s % message_error)
+os._exit(2)
+except:
+os._exit(3)
+
+logging.info([Auototest Background GOOD])
+os._exit(0)
+
+
+def wait_autotest_background(pid):
+
+Wait for background autotest finish.
+
+@param pid: Pid of the child process executing background autotest
+
+logging.info(Waiting for background autotest to finish ...)
+
+(pid, s) = os.waitpid(pid,0)
+status = os.WEXITSTATUS(s)
+if status != 0:
+return False
+return True
+
-- 
1.5.5.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] KVM Test: Add ioquit test case

2010-04-07 Thread Feng Yang
Signed-off-by: Feng Yang fy...@redhat.com
---
 client/tests/kvm/tests/ioquit.py   |   54 
 client/tests/kvm/tests_base.cfg.sample |4 ++
 2 files changed, 58 insertions(+), 0 deletions(-)
 create mode 100644 client/tests/kvm/tests/ioquit.py

diff --git a/client/tests/kvm/tests/ioquit.py b/client/tests/kvm/tests/ioquit.py
new file mode 100644
index 000..c75a0e3
--- /dev/null
+++ b/client/tests/kvm/tests/ioquit.py
@@ -0,0 +1,54 @@
+import logging, time, random, signal, os
+from autotest_lib.client.common_lib import error
+import kvm_test_utils, kvm_utils
+
+
+def run_ioquit(test, params, env):
+
+Emulate the poweroff under IO workload(dbench so far) using monitor
+command 'quit'.
+
+@param test: Kvm test object
+@param params: Dictionary with the test parameters.
+@param env: Dictionary with test environment.
+
+vm = kvm_test_utils.get_living_vm(env, params.get(main_vm))
+session = kvm_test_utils.wait_for_login(vm,
+  timeout=int(params.get(login_timeout, 360)))
+session2 = kvm_test_utils.wait_for_login(vm,
+  timeout=int(params.get(login_timeout, 360)))
+def is_autotest_launched():
+if session.get_command_status(pgrep autotest) != 0:
+logging.debug(Autotest process not found)
+return False
+return True
+
+test_name = params.get(background_test, dbench)
+control_file = params.get(control_file, dbench.control)
+timeout = int(params.get(test_timeout, 300))
+control_path = os.path.join(test.bindir, autotest_control,
+control_file)
+outputdir = test.outputdir
+
+pid = kvm_test_utils.run_autotest_background(vm, session2, control_path,
+ timeout, test_name,
+ outputdir)
+if pid  0:
+raise error.TestError(Could not create child process to execute 
+  autotest background)
+
+if kvm_utils.wait_for(is_autotest_launched, 240, 0, 2):
+logging.debug(Background autotest successfully)
+else:
+logging.debug(Background autotest failed, start the test anyway)
+
+time.sleep(100 + random.randrange(0,100))
+logging.info(Kill the virtual machine)
+vm.process.close()
+
+logging.info(Kill the tracking process)
+kvm_utils.safe_kill(pid, signal.SIGKILL)
+kvm_test_utils.wait_autotest_background(pid)
+session.close()
+session2.close()
+
diff --git a/client/tests/kvm/tests_base.cfg.sample 
b/client/tests/kvm/tests_base.cfg.sample
index 9b12fc2..d8530f6 100644
--- a/client/tests/kvm/tests_base.cfg.sample
+++ b/client/tests/kvm/tests_base.cfg.sample
@@ -305,6 +305,10 @@ variants:
 - ksm_parallel:
 ksm_mode = parallel
 
+- ioquit:
+type = ioquit
+control_file = dbench.control.200
+background_test = dbench
 # system_powerdown, system_reset and shutdown *must* be the last ones
 # defined (in this order), since the effect of such tests can leave
 # the VM on a bad state.
-- 
1.5.5.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re:[PATCH 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-07 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

---

Michael,
Thanks a lot for the explanation. I have drafted a patch for the qemu write
after I looked into tun driver. Does it do in right way?

Thanks
Xiaohui

 drivers/vhost/mpassthru.c |   45 +
 1 files changed, 45 insertions(+), 0 deletions(-)

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
index e9449ac..1cde097 100644
--- a/drivers/vhost/mpassthru.c
+++ b/drivers/vhost/mpassthru.c
@@ -1065,6 +1065,49 @@ static unsigned int mp_chr_poll(struct file *file, 
poll_table * wait)
return mask;
 }
 
+static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
+   unsigned long count, loff_t pos)
+{
+   struct file *file = iocb-ki_filp;
+   struct mp_struct *mp = mp_get(file-private_data);
+   struct sock *sk = mp-socket.sk;
+   struct sk_buff *skb;
+   int len, err;
+   ssize_t result;
+
+   if (!mp)
+   return -EBADFD;
+
+   /* currently, async is not supported */
+   if (!is_sync_kiocb(iocb))
+   return -EFAULT;
+
+   len = iov_length(iov, count);
+   skb = sock_alloc_send_skb(sk, len + NET_IP_ALIGN,
+ file-f_flags  O_NONBLOCK, err);
+
+   if (!skb)
+   return -EFAULT;
+
+   skb_reserve(skb, NET_IP_ALIGN);
+   skb_put(skb, len);
+
+   if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) {
+   kfree_skb(skb);
+   return -EFAULT;
+   }
+   skb_set_network_header(skb, ETH_HLEN);
+   skb-protocol = *((__be16 *)(skb-data) + ETH_ALEN);
+   skb-dev = mp-dev;
+
+   dev_queue_xmit(skb);
+   mp-dev-stats.tx_packets++;
+   mp-dev-stats.tx_bytes += len;
+
+   mp_put(mp);
+   return result;
+}
+
 static int mp_chr_close(struct inode *inode, struct file *file)
 {
struct mp_file *mfile = file-private_data;
@@ -1084,6 +1127,8 @@ static int mp_chr_close(struct inode *inode, struct file 
*file)
 static const struct file_operations mp_fops = {
.owner  = THIS_MODULE,
.llseek = no_llseek,
+   .write  = do_sync_write,
+   .aio_write = mp_chr_aio_write,
.poll   = mp_chr_poll,
.unlocked_ioctl = mp_chr_ioctl,
.open   = mp_chr_open,
-- 
1.5.4.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] Add Mergeable receive buffer support to vhost_net

2010-04-07 Thread Michael S. Tsirkin
On Tue, Apr 06, 2010 at 01:32:53PM -0700, David L Stevens wrote:
 
 This patch adds support for the Mergeable Receive Buffers feature to
 vhost_net.
 
   +-DLS
 
 Changes from previous revision:
 1) renamed:
   vhost_discard_vq_desc - vhost_discard_desc
   vhost_get_heads - vhost_get_desc_n
   vhost_get_vq_desc - vhost_get_desc
 2) added heads as argument to ghost_get_desc_n
 3) changed vq-heads from iovec to vring_used_elem, removed casts
 4) changed vhost_add_used to do multiple elements in a single
 copy_to_user,
   or two when we wrap the ring.
 5) removed rxmaxheadcount and available buffer checks in favor of
 running until
   an allocation failure, but making sure we break the loop if we get
   two in a row, indicating we have at least 1 buffer, but not enough
   for the current receive packet
 6) restore non-vnet header handling
 
 Signed-Off-By: David L Stevens dlstev...@us.ibm.com

Thanks!
There's some whitespace damage, are you sending with your new
sendmail setup? It seems to have worked for qemu patches ...

 diff -ruNp net-next-p0/drivers/vhost/net.c
 net-next-v3/drivers/vhost/net.c
 --- net-next-p0/drivers/vhost/net.c   2010-03-22 12:04:38.0 -0700
 +++ net-next-v3/drivers/vhost/net.c   2010-04-06 12:54:56.0 -0700
 @@ -130,9 +130,8 @@ static void handle_tx(struct vhost_net *
   hdr_size = vq-hdr_size;
  
   for (;;) {
 - head = vhost_get_vq_desc(net-dev, vq, vq-iov,
 -  ARRAY_SIZE(vq-iov),
 -  out, in,
 + head = vhost_get_desc(net-dev, vq, vq-iov,
 +  ARRAY_SIZE(vq-iov), out, in,
NULL, NULL);
   /* Nothing new?  Wait for eventfd to tell us they refilled. */
   if (head == vq-num) {
 @@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net *
   /* TODO: Check specific error and bomb out unless ENOBUFS? */
   err = sock-ops-sendmsg(NULL, sock, msg, len);
   if (unlikely(err  0)) {
 - vhost_discard_vq_desc(vq);
 - tx_poll_start(net, sock);
 + if (err == -EAGAIN) {
 + vhost_discard_desc(vq, 1);
 + tx_poll_start(net, sock);
 + } else {
 + vq_err(vq, sendmsg: errno %d\n, -err);
 + /* drop packet; do not discard/resend */
 + vhost_add_used_and_signal(net-dev, vq, head,
 +   0);

vhost does not currently has a consistent error handling strategy:
if we drop packets, need to think which other errors should cause
packet drops.  I prefer to just call vq_err for now,
and have us look at handling segfaults etc in a consistent way
separately.

 + }
   break;
   }
   if (err != len)
 @@ -186,12 +192,25 @@ static void handle_tx(struct vhost_net *
   unuse_mm(net-dev.mm);
  }
  
 +static int vhost_head_len(struct sock *sk)
 +{
 + struct sk_buff *head;
 + int len = 0;
 +
 + lock_sock(sk);
 + head = skb_peek(sk-sk_receive_queue);
 + if (head)
 + len = head-len;
 + release_sock(sk);
 + return len;
 +}
 +

I wonder whether it makes sense to check
skb_queue_empty(sk-sk_receive_queue)
outside the lock, to reduce the cost of this call
on an empty queue (we know that it happens at least once
each time we exit the loop on rx)?

  /* Expects to be always run from workqueue - which acts as
   * read-size critical section for our kind of RCU. */
  static void handle_rx(struct vhost_net *net)
  {
   struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX];
 - unsigned head, out, in, log, s;
 + unsigned in, log, s;
   struct vhost_log *vq_log;
   struct msghdr msg = {
   .msg_name = NULL,
 @@ -202,13 +221,14 @@ static void handle_rx(struct vhost_net *
   .msg_flags = MSG_DONTWAIT,
   };
  
 - struct virtio_net_hdr hdr = {
 - .flags = 0,
 - .gso_type = VIRTIO_NET_HDR_GSO_NONE
 + struct virtio_net_hdr_mrg_rxbuf hdr = {
 + .hdr.flags = 0,
 + .hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
   };
  
 + int retries = 0;
   size_t len, total_len = 0;
 - int err;
 + int err, headcount, datalen;
   size_t hdr_size;
   struct socket *sock = rcu_dereference(vq-private_data);
   if (!sock || skb_queue_empty(sock-sk-sk_receive_queue))
 @@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net *
   vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
   vq-log : NULL;
  
 - for (;;) {
 - head = vhost_get_vq_desc(net-dev, vq, vq-iov,
 -  

Re: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-07 Thread Michael S. Tsirkin
On Wed, Apr 07, 2010 at 05:00:39PM +0800, xiaohui@intel.com wrote:
 From: Xin Xiaohui xiaohui@intel.com
 
 ---
 
 Michael,
 Thanks a lot for the explanation. I have drafted a patch for the qemu write
 after I looked into tun driver. Does it do in right way?
 
 Thanks
 Xiaohui
 
  drivers/vhost/mpassthru.c |   45 
 +
  1 files changed, 45 insertions(+), 0 deletions(-)
 
 diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
 index e9449ac..1cde097 100644
 --- a/drivers/vhost/mpassthru.c
 +++ b/drivers/vhost/mpassthru.c
 @@ -1065,6 +1065,49 @@ static unsigned int mp_chr_poll(struct file *file, 
 poll_table * wait)
   return mask;
  }
  
 +static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
 + unsigned long count, loff_t pos)
 +{
 + struct file *file = iocb-ki_filp;
 + struct mp_struct *mp = mp_get(file-private_data);
 + struct sock *sk = mp-socket.sk;
 + struct sk_buff *skb;
 + int len, err;
 + ssize_t result;
 +
 + if (!mp)
 + return -EBADFD;
 +

Can this happen? When?

 + /* currently, async is not supported */
 + if (!is_sync_kiocb(iocb))
 + return -EFAULT;

Really necessary? I think do_sync_write handles all this.

 +
 + len = iov_length(iov, count);
 + skb = sock_alloc_send_skb(sk, len + NET_IP_ALIGN,
 +   file-f_flags  O_NONBLOCK, err);
 +
 + if (!skb)
 + return -EFAULT;

Surely not EFAULT. -EAGAIN?

 +
 + skb_reserve(skb, NET_IP_ALIGN);
 + skb_put(skb, len);
 +
 + if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) {
 + kfree_skb(skb);
 + return -EFAULT;
 + }
 + skb_set_network_header(skb, ETH_HLEN);

Is this really right or necessary? Also,
probably need to check that length is at least ETH_ALEN before
doing this.

 + skb-protocol = *((__be16 *)(skb-data) + ETH_ALEN);

eth_type_trans?

 + skb-dev = mp-dev;
 +
 + dev_queue_xmit(skb);
 + mp-dev-stats.tx_packets++;
 + mp-dev-stats.tx_bytes += len;

Doesn't the hard start xmit function for the device
increment the counters?

 +
 + mp_put(mp);
 + return result;
 +}
 +
  static int mp_chr_close(struct inode *inode, struct file *file)
  {
   struct mp_file *mfile = file-private_data;
 @@ -1084,6 +1127,8 @@ static int mp_chr_close(struct inode *inode, struct 
 file *file)
  static const struct file_operations mp_fops = {
   .owner  = THIS_MODULE,
   .llseek = no_llseek,
 + .write  = do_sync_write,
 + .aio_write = mp_chr_aio_write,
   .poll   = mp_chr_poll,
   .unlocked_ioctl = mp_chr_ioctl,
   .open   = mp_chr_open,
 -- 
 1.5.4.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GSoC 2010][RESEND] Shared memory transport between guest(s) and host

2010-04-07 Thread Mohammed Gamal
Hi,
I am interested in the Shared memory transport between guest(s) and
host project for GSoC 2010. The description of the project is pretty
straightforward, but I am a little bit lost on some parts:

1- Is there any documentation available on KVM shared memory
transport. This'd definitely help understand how inter-vm shared
memory should work.

2- Does the project only aim at providing a shared memory transport
between a single host and a number of guests, with the host acting as
a central node containing shared memory objects and communication
taking placde only between guests and host, or is there any kind of
guest-guest communications to be supported? If yes, how should it be
done?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] Add Mergeable receive buffer support to vhost_net

2010-04-07 Thread Michael S. Tsirkin
Some corrections:

On Wed, Apr 07, 2010 at 01:59:10PM +0300, Michael S. Tsirkin wrote:
 On Tue, Apr 06, 2010 at 01:32:53PM -0700, David L Stevens wrote:
  
  This patch adds support for the Mergeable Receive Buffers feature to
  vhost_net.
  
  +-DLS
  
  Changes from previous revision:
  1) renamed:
  vhost_discard_vq_desc - vhost_discard_desc
  vhost_get_heads - vhost_get_desc_n
  vhost_get_vq_desc - vhost_get_desc
  2) added heads as argument to ghost_get_desc_n
  3) changed vq-heads from iovec to vring_used_elem, removed casts
  4) changed vhost_add_used to do multiple elements in a single
  copy_to_user,
  or two when we wrap the ring.
  5) removed rxmaxheadcount and available buffer checks in favor of
  running until
  an allocation failure, but making sure we break the loop if we get
  two in a row, indicating we have at least 1 buffer, but not enough
  for the current receive packet
  6) restore non-vnet header handling
  
  Signed-Off-By: David L Stevens dlstev...@us.ibm.com
 
 Thanks!
 There's some whitespace damage, are you sending with your new
 sendmail setup? It seems to have worked for qemu patches ...
 
  diff -ruNp net-next-p0/drivers/vhost/net.c
  net-next-v3/drivers/vhost/net.c
  --- net-next-p0/drivers/vhost/net.c 2010-03-22 12:04:38.0 -0700
  +++ net-next-v3/drivers/vhost/net.c 2010-04-06 12:54:56.0 -0700
  @@ -130,9 +130,8 @@ static void handle_tx(struct vhost_net *
  hdr_size = vq-hdr_size;
   
  for (;;) {
  -   head = vhost_get_vq_desc(net-dev, vq, vq-iov,
  -ARRAY_SIZE(vq-iov),
  -out, in,
  +   head = vhost_get_desc(net-dev, vq, vq-iov,
  +ARRAY_SIZE(vq-iov), out, in,
   NULL, NULL);
  /* Nothing new?  Wait for eventfd to tell us they refilled. */
  if (head == vq-num) {
  @@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net *
  /* TODO: Check specific error and bomb out unless ENOBUFS? */
  err = sock-ops-sendmsg(NULL, sock, msg, len);
  if (unlikely(err  0)) {
  -   vhost_discard_vq_desc(vq);
  -   tx_poll_start(net, sock);
  +   if (err == -EAGAIN) {
  +   vhost_discard_desc(vq, 1);
  +   tx_poll_start(net, sock);
  +   } else {
  +   vq_err(vq, sendmsg: errno %d\n, -err);
  +   /* drop packet; do not discard/resend */
  +   vhost_add_used_and_signal(net-dev, vq, head,
  + 0);
 
 vhost does not currently has a consistent error handling strategy:
 if we drop packets, need to think which other errors should cause
 packet drops.  I prefer to just call vq_err for now,
 and have us look at handling segfaults etc in a consistent way
 separately.
 
  +   }
  break;
  }
  if (err != len)
  @@ -186,12 +192,25 @@ static void handle_tx(struct vhost_net *
  unuse_mm(net-dev.mm);
   }
   
  +static int vhost_head_len(struct sock *sk)
  +{
  +   struct sk_buff *head;
  +   int len = 0;
  +
  +   lock_sock(sk);
  +   head = skb_peek(sk-sk_receive_queue);
  +   if (head)
  +   len = head-len;
  +   release_sock(sk);
  +   return len;
  +}
  +
 
 I wonder whether it makes sense to check
 skb_queue_empty(sk-sk_receive_queue)
 outside the lock, to reduce the cost of this call
 on an empty queue (we know that it happens at least once
 each time we exit the loop on rx)?
 
   /* Expects to be always run from workqueue - which acts as
* read-size critical section for our kind of RCU. */
   static void handle_rx(struct vhost_net *net)
   {
  struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX];
  -   unsigned head, out, in, log, s;
  +   unsigned in, log, s;
  struct vhost_log *vq_log;
  struct msghdr msg = {
  .msg_name = NULL,
  @@ -202,13 +221,14 @@ static void handle_rx(struct vhost_net *
  .msg_flags = MSG_DONTWAIT,
  };
   
  -   struct virtio_net_hdr hdr = {
  -   .flags = 0,
  -   .gso_type = VIRTIO_NET_HDR_GSO_NONE
  +   struct virtio_net_hdr_mrg_rxbuf hdr = {
  +   .hdr.flags = 0,
  +   .hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
  };
   
  +   int retries = 0;
  size_t len, total_len = 0;
  -   int err;
  +   int err, headcount, datalen;
  size_t hdr_size;
  struct socket *sock = rcu_dereference(vq-private_data);
  if (!sock || skb_queue_empty(sock-sk-sk_receive_queue))
  @@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net *
  vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
  vq-log : NULL;
   
  -   for (;;) {
  -   head = 

[PATCH] KVM test: Add a subtest iofuzz

2010-04-07 Thread Jason Wang
The design of iofuzz is simple: it just generate random I/O port
activity inside the virtual machine. The correctness of the device
emulation may be verified through this test.

As the instrcutions are randomly generated, guest may enter the wrong
state. The test solve this issue by detect the hang and restart the
virtual machine.

The test duration could also be adjusted through the fuzz_count. And
the parameter skip_devices is used to specified the devices which
should not be used to do the fuzzing.

For current version, every activity were logged and the commnad was
sent through a seesion between host and guest. Through this method may
slow down the whole test but it works well. The enumeration was done
through /proc/ioports and the scenario of avtivity is not aggressive.

Suggestions are welcomed.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 client/tests/kvm/tests/iofuzz.py   |   97 
 client/tests/kvm/tests_base.cfg.sample |2 +
 2 files changed, 99 insertions(+), 0 deletions(-)
 create mode 100644 client/tests/kvm/tests/iofuzz.py

diff --git a/client/tests/kvm/tests/iofuzz.py b/client/tests/kvm/tests/iofuzz.py
new file mode 100644
index 000..c2f22af
--- /dev/null
+++ b/client/tests/kvm/tests/iofuzz.py
@@ -0,0 +1,97 @@
+import logging, time, re, random
+from autotest_lib.client.common_lib import error
+import kvm_subprocess, kvm_test_utils, kvm_utils
+
+
+def run_iofuzz(test, params, env):
+
+KVM iofuzz test:
+1) Log into a guest
+
+@param test: kvm test object
+@param params: Dictionary with the test parameters
+@param env: Dictionary with test environment.
+
+vm = kvm_test_utils.get_living_vm(env, params.get(main_vm))
+session = kvm_test_utils.wait_for_login(vm, 0,
+ float(params.get(boot_timeout, 
240)),
+ 0, 2)
+
+def outb(session, port, data):
+logging.debug(outb(0x%x,0x%x) % (port, data))
+outb_cmd = echo -e '\\%s' | dd of=/dev/port seek=%d bs=1 count=1 % \
+   (oct(data), port)
+s, o = session.get_command_status_output(outb_cmd)
+if s != 0:
+logging.debug(None zero value returned)
+
+def inb(session, port):
+logging.debug(inb(0x%x) % port)
+inb_cmd = dd if=/dev/port seek=%d of=/dev/null bs=1 count=1 % port
+s, o = session.get_command_status_output(inb_cmd)
+if s != 0:
+logging.debug(None zero value returned)
+
+def fuzz(session, inst_list):
+for (op, operand) in inst_list:
+if op == read:
+inb(session, operand[0])
+elif op ==write:
+outb(session, operand[0], operand[1])
+else:
+raise error.TestError(Unknown command %s % op)
+
+if not session.is_responsive():
+logging.debug(Session is not responsive)
+if vm.process.is_alive():
+logging.debug(VM is alive, try to re-login)
+try:
+session = kvm_test_utils.wait_for_login(vm, 0, 10, 0, 
2)
+except:
+logging.debug(Could not re-login, reboot the guest)
+session = kvm_test_utils.reboot(vm, session,
+method = 
system_reset)
+else:
+raise error.TestFail(VM have quit abnormally)
+try:
+ports = {}
+ran = random.SystemRandom()
+
+logging.info(Enumerate the device through /proc/ioports)
+ioports = session.get_command_output(cat /proc/ioports)
+logging.debug(ioports)
+devices = re.findall((\w+)-(\w+)\ : (.*), ioports)
+
+skip_devices = params.get(skip_devices,)
+fuzz_count = int(params.get(fuzz_count, 10))
+
+for (beg, end, name) in devices:
+ports[(int(beg, base=16), int(end, base=16))] = name.strip()
+
+for (beg, end) in ports.keys():
+
+name = ports[(beg, end)]
+if name in skip_devices:
+logging.info(Skipping %s % name)
+continue
+
+logging.info(Fuzzing %s at 0x%x-0x%x % (name, beg, end))
+inst = []
+ 
+# Read all ports
+for port in range(beg, end+1):
+inst.append((read, [port]))
+
+# Outb with zero
+for port in range(beg, end+1):
+inst.append((write, [port, 0]))
+pass
+
+# Random fuzzing 
+for seq in range(fuzz_count * (end-beg+1)):
+inst.append((write, [ran.randint(beg, end), 
ran.randint(0,255)]))
+
+fuzz(session, inst)
+
+finally:
+session.close()
diff --git a/client/tests/kvm/tests_base.cfg.sample 

Re: Setting nx bit in virtual CPU

2010-04-07 Thread Richard Simpson
On 07/04/10 06:39, Avi Kivity wrote:
 On 04/07/2010 01:31 AM, Richard Simpson wrote:

 2.6.27 should be plenty fine for nx.  Really the important bit is that
 the host kernel has nx enabled.  Can you check if that is so?

  
 Umm, could you give me a clue about how to do that.  It is some time
 since I configured the host kernel, but I do have a /proc/config.gz.
 Could I check by looking in that?

 
 The attached script should verify it.
 

rs% ./check-nx
Traceback (most recent call last):
  File ./check-nx, line 17, in module
efer = msr().read(0xc080, 0)
  File ./check-nx, line 8, in __init__
self.f = file('/dev/msr0')
IOError: [Errno 2] No such file or directory: '/dev/msr0'

Sorry!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Setting nx bit in virtual CPU

2010-04-07 Thread Avi Kivity

On 04/07/2010 03:10 PM, Richard Simpson wrote:

On 07/04/10 06:39, Avi Kivity wrote:
   

On 04/07/2010 01:31 AM, Richard Simpson wrote:
 
   

2.6.27 should be plenty fine for nx.  Really the important bit is that
the host kernel has nx enabled.  Can you check if that is so?


 

Umm, could you give me a clue about how to do that.  It is some time
since I configured the host kernel, but I do have a /proc/config.gz.
Could I check by looking in that?

   

The attached script should verify it.

 

rs% ./check-nx
Traceback (most recent call last):
   File ./check-nx, line 17, inmodule
 efer = msr().read(0xc080, 0)
   File ./check-nx, line 8, in __init__
 self.f = file('/dev/msr0')
IOError: [Errno 2] No such file or directory: '/dev/msr0'

   


Run as root, please.  And check first that you have a file named 
/dev/cpu/0/msr.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM freeze when using --serial

2010-04-07 Thread Thomas Kittel

Hi there,

I already posted this problem to #kvm on freenode.

Please set me in CC: when replying to this mail, as I am not subscribed 
to this mailing lists right now.


The Scenario is as follows:

I got 2 VM processes in userspace.

The first is started with the parameter --monitor pty.
   = This results in a file /dev/pts/x in the host,
   (crw--w 1 kittel tty 136, 3 2010-04-07 15:51 /dev/pts/3 on 
my system)


Another VM is then started with the parameter --serial /dev/pts/3
   = This results in /dev/ttyS0 inside the second VM.

Both VMs are running debian lenny. The host (debian) uses qemu-kvm 0.12.3.
startvms.sh start is used to start the VMs.

Running the executable build from test.c in the second VM results in a 
freeze of this VM.
(The test.c included uses /dev/ttyS1 as /dev/ttyS0 is the VMs serial 
console in my setup.)

The process uses 100% CPU and is stuck in kvm_mutex_lock().

Trying to use the build in gdbserver didn´t work because it also locked.

Is there a way to tunnel one VMs monitor console to another VM?

Thanks Thomas




#include fcntl.h
#include unistd.h
#include stdio.h

#include signal.h

#include pthread.h

void signal_handler(int signum){
	pthread_exit(NULL);
}

void *readFile(void * ptr){
	signal(SIGTERM, signal_handler);
	int fd;
	char buffer;
	fd = open(/dev/ttyS1, O_RDONLY);
	while(true){
		read(fd, buffer, 1);
		printf(%c, buffer);
		fflush(stdout);
	}

	close(fd);
	pthread_exit(NULL);

}

int main(int argc, char** argv){
	pthread_t thread;

	pthread_create(thread, NULL, readFile, NULL);

	sleep(10);
		
	pthread_kill(thread, SIGTERM);
	pthread_join(thread, NULL);

}
	


startvms.sh
Description: Bourne shell script


Shouldn't cache=none be the default for drives?

2010-04-07 Thread Troels Arvin

Hello,

I'm conducting some performancetests with KVM-virtualized CentOSes. One 
thing I noticed is that guest I/O performance seems to be significantly 
better for virtio-based block devices (drives) if the cache=none 
argument is used. (This was with a rather powerful storage system 
backend which is hard to saturate.)


So: Why isn't cache=none be the default for drives?

--
Troels
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM freeze when using --serial

2010-04-07 Thread Thomas Kittel

Hi again,

I just tried to use unix domain sockets.

So I used the parameter --monitor unix:monitor:server:nowait on the first VM
and the parameter --serial unix:monitor on the second VM,

And again the second VM freezes when running my test application.

cya Tom

Thomas Kittel wrote:

Hi there,

I already posted this problem to #kvm on freenode.

Please set me in CC: when replying to this mail, as I am not 
subscribed to this mailing lists right now.


The Scenario is as follows:

I got 2 VM processes in userspace.

The first is started with the parameter --monitor pty.
   = This results in a file /dev/pts/x in the host,
   (crw--w 1 kittel tty 136, 3 2010-04-07 15:51 /dev/pts/3 on 
my system)


Another VM is then started with the parameter --serial /dev/pts/3
   = This results in /dev/ttyS0 inside the second VM.

Both VMs are running debian lenny. The host (debian) uses qemu-kvm 
0.12.3.

startvms.sh start is used to start the VMs.

Running the executable build from test.c in the second VM results in a 
freeze of this VM.
(The test.c included uses /dev/ttyS1 as /dev/ttyS0 is the VMs serial 
console in my setup.)

The process uses 100% CPU and is stuck in kvm_mutex_lock().

Trying to use the build in gdbserver didn´t work because it also locked.

Is there a way to tunnel one VMs monitor console to another VM?

Thanks Thomas






--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shouldn't cache=none be the default for drives?

2010-04-07 Thread Gordan Bobic

Troels Arvin wrote:

Hello,

I'm conducting some performancetests with KVM-virtualized CentOSes. One 
thing I noticed is that guest I/O performance seems to be significantly 
better for virtio-based block devices (drives) if the cache=none 
argument is used. (This was with a rather powerful storage system 
backend which is hard to saturate.)


So: Why isn't cache=none be the default for drives?


Is that the right question? Or is the right question Why is cache=none 
faster?


What did you use for measuring the performance? I have found in the past 
that virtio block device was slower than IDE block device emulation.


Gordan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question on skip_emulated_instructions()

2010-04-07 Thread Gleb Natapov
On Wed, Apr 07, 2010 at 03:25:10PM +0900, Yoshiaki Tamura wrote:
 2010/4/6 Gleb Natapov g...@redhat.com:
  On Tue, Apr 06, 2010 at 01:11:23PM +0900, Yoshiaki Tamura wrote:
  Hi.
 
  When handle_io() is called, rip is currently proceeded *before* actually 
  having
  I/O handled by qemu in userland.  Upon implementing Kemari for
  KVM(http://www.mail-archive.com/kvm@vger.kernel.org/msg25141.html) mainly 
  in
  userland qemu, we encountered a problem that synchronizing the content of 
  VCPU
  before handling I/O in qemu is too late because rip is already proceeded 
  in KVM,
  Although we avoided this issue with temporal hack, I would like to ask a 
  few
  question on skip_emulated_instructions.
 
  1. Does rip need to be proceeded before having I/O handled by qemu?
  In current kvm.git rip is proceeded before I/O is handled by qemu only
  in case of out instruction. From architecture point of view I think
  it's OK since on real HW you can't guaranty that I/O will take effect
  before instruction pointer is advanced. It is done like that because we
  want out emulation to be real fast so we skip x86 emulator.
 
 Thanks for your reply.
 
 If proceeding rip later doesn't break the behavior of devices or
 introduce slow down, I would like that to be done.
 
Device can not care less about what value rip register currently has.
Why is it matters for you code?

  2. If no, is it possible to divide skip_emulated_instructions(), like
  rec_emulated_instructions() to remember to next_rip, and
  skip_emulated_instructions() to actually proceed the rip.
  Currently only emulator can call userspace to do I/O, so after
  userspace returns after I/O exit, control is handled back to emulator
  unconditionally.  out instruction skips emulator, but there is nothing
  to do after userspace returns, so regular cpu loop is executed. If we
  want to advance rip only after userspace executed I/O done by out we
  need to distinguish who requested I/O (emulator or kvm_fast_pio_out())
  and call different code depending on who that was. It can be done by
  having a callback that (if not null) is called on return from userspace.
 
 Your suggestion is to introduce a callback entry, and instead of
 calling kvm_rip_write(), set it to the entry before calling
 kvm_fast_pio_out(),
 and check the entry upon return from the userspace, correct?
 
Something like that, yes.

 According to the comment in x86.c, when it was out instruction
 vcpu-arch.pio.count is set to 0 to skip the emulator.
 To call kvm_fast_pio_out(), !string and !in must be set.
 If we can check, vcpu-arch.pio.count, string and in on return
 from the userspace, can't we distinguish who requested I/O, emulator
 or kvm_fast_pio_out()?
 
May be, but callback approach is much cleaner. string and in can have
stale data for instance.

  3. svm has next_rip but when it is 0, nop is emulated.  Can this be 
  modified to
  continue without emulating nop when next_rip is 0?
 
  I don't see where nop is emulated if next_rip is 0. As far as I see in
  case of next_rip==0 an instruction at rip is decoded to figure out its
  length and then rip is advanced by instruction length. Anyway next_rip
  is svm thing only.
 
 Sorry.  I wasn't understanding the code enough.
 
 static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
 {
 ...
   if (!svm-next_rip) {
   if (emulate_instruction(vcpu, 0, 0, EMULTYPE_SKIP) !=
   EMULATE_DONE)
   printk(KERN_DEBUG %s: NOP\n, __func__);
   return;
   }
 
 Since the printk says NOP, I thought emulate_instruction was doing so...
 
 The reason I asked about next_rip is because I was hoping to use this
 entry to advance rip only after userspace executed I/O done by out,
 like if next_rip is !0,
 call kvm_rip_write(), and introduce next_rip to vmx if it is usable
 because vmx is
 currently using local variable rip.
 
 Yoshi

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM MMU: remove unused field

2010-04-07 Thread Marcelo Tosatti
On Tue, Apr 06, 2010 at 06:29:05PM +0800, Xiao Guangrong wrote:
 kvm_mmu_page.oos_link is not used, so remove it
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
 ---
  arch/x86/include/asm/kvm_host.h |2 --
  arch/x86/kvm/mmu.c  |1 -
  2 files changed, 0 insertions(+), 3 deletions(-)

Applied both, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [PPC] Add dequeue for external on BookE

2010-04-07 Thread Marcelo Tosatti
On Wed, Apr 07, 2010 at 10:03:25AM +0200, Alexander Graf wrote:
 Commit a0abee86af2d1f048dbe99d2bcc4a2cefe685617 introduced unsetting of the
 IRQ line from userspace. This added a new core specific callback that I
 apparently forgot to add for BookE.
 
 So let's add the callback for BookE as well, making it build again.
 
 Signed-off-by: Alexander Graf ag...@suse.de
 ---
  arch/powerpc/kvm/booke.c |6 ++
  1 files changed, 6 insertions(+), 0 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [PPC] Add dequeue for external on BookE

2010-04-07 Thread Stephen Rothwell
On Wed, 7 Apr 2010 12:58:34 -0300 Marcelo Tosatti mtosa...@redhat.com wrote:

 On Wed, Apr 07, 2010 at 10:03:25AM +0200, Alexander Graf wrote:
  Commit a0abee86af2d1f048dbe99d2bcc4a2cefe685617 introduced unsetting of the
  IRQ line from userspace. This added a new core specific callback that I
  apparently forgot to add for BookE.
  
  So let's add the callback for BookE as well, making it build again.
  
  Signed-off-by: Alexander Graf ag...@suse.de
  ---
   arch/powerpc/kvm/booke.c |6 ++
   1 files changed, 6 insertions(+), 0 deletions(-)
 
 Applied, thanks.

Thanks, guys.

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/


pgpiHVmUvpI5v.pgp
Description: PGP signature


Re: [GSoC 2010][RESEND] Shared memory transport between guest(s) and host

2010-04-07 Thread Cam Macdonell
On Wed, Apr 7, 2010 at 5:30 AM, Mohammed Gamal m.gamal...@gmail.com wrote:
 Hi,
 I am interested in the Shared memory transport between guest(s) and
 host project for GSoC 2010. The description of the project is pretty
 straightforward, but I am a little bit lost on some parts:

 1- Is there any documentation available on KVM shared memory
 transport. This'd definitely help understand how inter-vm shared
 memory should work.

Hi Mohammed,

A shared memory transport would be a new addition to the code base so
there isn't anything yet in KVM.  That said, I'm working on my patch
and it will hopefully be accepted soon.  Frankly, while I suggested
this project, I'm not sure there's enough work remaining for the full
Summer.


 2- Does the project only aim at providing a shared memory transport
 between a single host and a number of guests, with the host acting as
 a central node containing shared memory objects and communication
 taking placde only between guests and host, or is there any kind of
 guest-guest communications to be supported? If yes, how should it be
 done?

My patch currently supports guest-to-guest communication and
guest-to-host.  I'll be sending out a new version shortly.  You can
see if there's something you might like to add to it whether it's part
of GSoC or not.

Cheers,
Cam
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question on skip_emulated_instructions()

2010-04-07 Thread Yoshiaki Tamura
2010/4/8 Gleb Natapov g...@redhat.com:
 On Wed, Apr 07, 2010 at 03:25:10PM +0900, Yoshiaki Tamura wrote:
 2010/4/6 Gleb Natapov g...@redhat.com:
  On Tue, Apr 06, 2010 at 01:11:23PM +0900, Yoshiaki Tamura wrote:
  Hi.
 
  When handle_io() is called, rip is currently proceeded *before* actually 
  having
  I/O handled by qemu in userland.  Upon implementing Kemari for
  KVM(http://www.mail-archive.com/kvm@vger.kernel.org/msg25141.html) mainly 
  in
  userland qemu, we encountered a problem that synchronizing the content of 
  VCPU
  before handling I/O in qemu is too late because rip is already proceeded 
  in KVM,
  Although we avoided this issue with temporal hack, I would like to ask a 
  few
  question on skip_emulated_instructions.
 
  1. Does rip need to be proceeded before having I/O handled by qemu?
  In current kvm.git rip is proceeded before I/O is handled by qemu only
  in case of out instruction. From architecture point of view I think
  it's OK since on real HW you can't guaranty that I/O will take effect
  before instruction pointer is advanced. It is done like that because we
  want out emulation to be real fast so we skip x86 emulator.

 Thanks for your reply.

 If proceeding rip later doesn't break the behavior of devices or
 introduce slow down, I would like that to be done.

 Device can not care less about what value rip register currently has.
 Why is it matters for you code?

My code, Kemari is a mechanism to synchronize VMs to achieve fault tolerance.
It transfers the whole VM state upon events such as disk or network output,
so that the secondary server can keep continuing upon hardware failure.
Please think it like continuous live migration.
I've implemented this feature in userland qemu, which calls the live migration
function when it detects any outputs from the device emulators.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The problem here is that, I needed to transfer the VM state which is
just *before* the output to the devices.  Otherwise, the VM state has
already been proceeded, and after failover, some I/O didn't work as I expected.
I tracked down this issue, and figured out rip was already proceeded in KVM,
and transferring this VCPU state was meaningless.

I'm planning to post the patch set of Kemari soon, but I would like to solve
this rip issue before that.  If there is no drawback, I'm happy to work
and post a patch.

  2. If no, is it possible to divide skip_emulated_instructions(), like
  rec_emulated_instructions() to remember to next_rip, and
  skip_emulated_instructions() to actually proceed the rip.
  Currently only emulator can call userspace to do I/O, so after
  userspace returns after I/O exit, control is handled back to emulator
  unconditionally.  out instruction skips emulator, but there is nothing
  to do after userspace returns, so regular cpu loop is executed. If we
  want to advance rip only after userspace executed I/O done by out we
  need to distinguish who requested I/O (emulator or kvm_fast_pio_out())
  and call different code depending on who that was. It can be done by
  having a callback that (if not null) is called on return from userspace.

 Your suggestion is to introduce a callback entry, and instead of
 calling kvm_rip_write(), set it to the entry before calling
 kvm_fast_pio_out(),
 and check the entry upon return from the userspace, correct?

 Something like that, yes.

OK.  Let me work on that.

 According to the comment in x86.c, when it was out instruction
 vcpu-arch.pio.count is set to 0 to skip the emulator.
 To call kvm_fast_pio_out(), !string and !in must be set.
 If we can check, vcpu-arch.pio.count, string and in on return
 from the userspace, can't we distinguish who requested I/O, emulator
 or kvm_fast_pio_out()?

 May be, but callback approach is much cleaner. string and in can have
 stale data for instance.

I see.  I was thinking that can be a trade off between introducing a
new variable.
I'll take the callback approach first, and think again later if necessary.


  3. svm has next_rip but when it is 0, nop is emulated.  Can this be 
  modified to
  continue without emulating nop when next_rip is 0?
 
  I don't see where nop is emulated if next_rip is 0. As far as I see in
  case of next_rip==0 an instruction at rip is decoded to figure out its
  length and then rip is advanced by instruction length. Anyway next_rip
  is svm thing only.

 Sorry.  I wasn't understanding the code enough.

 static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
 {
 ...
       if (!svm-next_rip) {
               if (emulate_instruction(vcpu, 0, 0, EMULTYPE_SKIP) !=
                               EMULATE_DONE)
                       printk(KERN_DEBUG %s: NOP\n, __func__);
               return;
       }

 Since the printk says NOP, I thought emulate_instruction was doing so...

 The reason I asked about next_rip is because I was hoping to use this
 entry to advance rip only after userspace executed I/O done by out,
 like 

Re: Question on skip_emulated_instructions()

2010-04-07 Thread Avi Kivity

On 04/07/2010 08:21 PM, Yoshiaki Tamura wrote:


The problem here is that, I needed to transfer the VM state which is
just *before* the output to the devices.  Otherwise, the VM state has
already been proceeded, and after failover, some I/O didn't work as I expected.
I tracked down this issue, and figured out rip was already proceeded in KVM,
and transferring this VCPU state was meaningless.

I'm planning to post the patch set of Kemari soon, but I would like to solve
this rip issue before that.  If there is no drawback, I'm happy to work
and post a patch.
   


vcpu state is undefined when an mmio operation is pending, 
Documentation/kvm/api.txt says the following:



NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding
operations are complete (and guest state is consistent) only after 
userspace

has re-entered the kernel with KVM_RUN.  The kernel side will first finish
incomplete operations and then check for pending signals.  Userspace
can re-enter the guest with an unmasked signal pending to complete
pending operations.


Currently we complete instructions for output operations and leave them 
incomplete for input operations.  Deferring completion for output 
operations should work, except it may break the vmware backdoor port 
(see hw/vmport.c), which changes register state following an output 
instruction, and KVM_EXIT_TPR_ACCESS, where userspace reads the state 
following a write instruction.


Do you really need to transfer the vcpu state before the instruction, or 
do you just need a consistent state?  If the latter, then you can get 
away by posting a signal and re-entering the guest.  kvm will complete 
the instruction and exit immediately, and you will have fully consistent 
state.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] Add Mergeable receive buffer support to vhost_net

2010-04-07 Thread David Stevens
 
 Thanks!
 There's some whitespace damage, are you sending with your new
 sendmail setup? It seems to have worked for qemu patches ...

Yes, I saw some line wraps in what I received, but I checked
the original draft to be sure and they weren't there. Possibly from
the relay; Sigh.


  @@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net *
 /* TODO: Check specific error and bomb out unless ENOBUFS? */
 err = sock-ops-sendmsg(NULL, sock, msg, len);
 if (unlikely(err  0)) {
  - vhost_discard_vq_desc(vq);
  - tx_poll_start(net, sock);
  + if (err == -EAGAIN) {
  +vhost_discard_desc(vq, 1);
  +tx_poll_start(net, sock);
  + } else {
  +vq_err(vq, sendmsg: errno %d\n, -err);
  +/* drop packet; do not discard/resend */
  +vhost_add_used_and_signal(net-dev, vq, head,
  +   0);
 
 vhost does not currently has a consistent error handling strategy:
 if we drop packets, need to think which other errors should cause
 packet drops.  I prefer to just call vq_err for now,
 and have us look at handling segfaults etc in a consistent way
 separately.

I had to add this to avoid an infinite loop when I wrote a bad
packet on the socket. I agree error handling needs a better look,
but retrying a bad packet continuously while dumping in the log
is what it was doing when I hit an error before this code. Isn't
this better, at least until a second look?


  +}
  +
 
 I wonder whether it makes sense to check
 skb_queue_empty(sk-sk_receive_queue)
 outside the lock, to reduce the cost of this call
 on an empty queue (we know that it happens at least once
 each time we exit the loop on rx)?

I was looking at alternatives to adding the lock in the
first place, but I found I couldn't measure a difference in the
cost with and without the lock.


  +   int retries = 0;
  size_t len, total_len = 0;
  -   int err;
  +   int err, headcount, datalen;
  size_t hdr_size;
  struct socket *sock = rcu_dereference(vq-private_data);
  if (!sock || skb_queue_empty(sock-sk-sk_receive_queue))
  @@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net *
  vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
 vq-log : NULL;
  
  -   for (;;) {
  -  head = vhost_get_vq_desc(net-dev, vq, vq-iov,
  -ARRAY_SIZE(vq-iov),
  -out, in,
  -vq_log, log);
  +   while ((datalen = vhost_head_len(sock-sk))) {
  +  headcount = vhost_get_desc_n(vq, vq-heads, datalen, in,
  +vq_log, log);
 
 This looks like a bug, I think we need to pass
 datalen + header size to vhost_get_desc_n.
 Not sure how we know the header size that backend will use though.
 Maybe just look at our features.

Yes; we have hdr_size, so I can add it here. It'll be 0 for
the cases where the backend and guest both have vnet header (either
the regular or larger mergeable buffers one), but should be added
in for future raw socket support.

 
 /* OK, now we need to know about added descriptors. */
  -  if (head == vq-num) {
  - if (unlikely(vhost_enable_notify(vq))) {
  +  if (!headcount) {
  + if (retries == 0  unlikely(vhost_enable_notify(vq))) {
   /* They have slipped one in as we were
* doing that: check again. */
   vhost_disable_notify(vq);
  +retries++;
   continue;
}
 
 Hmm. The reason we have the code at all, as the comment says, is because
 guest could have added more buffers between the time we read last index
 and the time we enabled notification. So if we just break like this
 the race still exists. We could remember the
 last head value we observed, and have vhost_enable_notify check
 against this value?

This is to prevent a spin loop in the case where we have some
buffers available, but not enough for the current packet (ie, this
is the replacement code for the rxmaxheadcount business). If they
actually added something new, retrying once should see it, but what
vhost_enable_notify() returns non-zero on is not new buffers but
rather not empty. In the case mentioned, we aren't empty, so
vhost_enable_notify() returns nonzero every time, but the guest hasn't
given us enough buffers to proceed, so we continuously retry; this
code breaks the spin loop until we've really got new buffers from
the guest.

 
 Need to think about it.
 
 Another concern here is that on retries vhost_get_desc_n
 is doing extra work, rescanning the same descriptor
 again and again. Not sure how common this is, might be
 worthwhile to add a TODO to consider this at least.

I had a printk in there to test the code and with the
retries counter, it happens when we fill the ring (once,
because of the retries checks), and then proceeds as
desired when the guest gives us more buffers. Without the
check, it spews until we 

[GIT PULL] vhost-net fix for 2.6.34-rc3

2010-04-07 Thread Michael S. Tsirkin
David,
The following tree includes a patch fixing an issue with vhost-net in
2.6.34-rc3.  Please pull for 2.6.34.
 
Thanks!

The following changes since commit 2eaa9cfdf33b8d7fb7aff27792192e0019ae8fc6:

  Linux 2.6.34-rc3 (2010-03-30 09:24:39 -0700)

are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost

Jeff Dike (1):
  vhost-net: fix vq_memory_access_ok error checking

 drivers/vhost/vhost.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some Code for Performance Profiling

2010-04-07 Thread Jiaqing Du
2010/4/5 Avi Kivity a...@redhat.com:
 On 03/31/2010 07:53 PM, Jiaqing Du wrote:

 Hi,

 We have some code about performance profiling in KVM. They are outputs
 of a school project. Previous discussions in KVM, Perfmon2, and Xen
 mailing lists helped us a lot. The code are NOT in a good shape and
 are only used to demonstrated the feasibility of doing performance
 profiling in KVM. Feel free to use it if you want.


 Performance monitoring is an important feature for kvm.  Is there any chance
 you can work at getting it into good shape?

I have been following the discussions about PMU virtualization in the
list for a while. Exporting a proper interface, i.e., guest visible
MSRs and supported events, to the guest across a large number physical
CPUs from different vendors, families, and models is the major
problem. For KVM, currently it also supports almost a dozen different
types of virtual CPUs. I will think about it and try to come up with
something more general.


 We categorize performance profiling in a virtualized environment into
 two types: *guest-wide profiling* and *system-wide profiling*. For
 guest-wide profiling, only the guest is profiled. KVM virtualizes the
 PMU and the user runs a profiler directly in the guest. It requires no
 modifications to the guest OS and the profiler running in the guest.
 For system-wide profiling, both KVM and the guest OS are profiled. The
 results are similar to what XenOprof outputs. In this case, one
 profiler running in the host and one profiler running in the guest.
 Still it requires no modifications to the guest and the profiler
 running in it.


 Can your implementation support both simultaneously?

What do you mean simultaneously? With my implementation, you either
do guest-wide profiling or system-wide profiling. They are achieved
through different patches. Actually, the result of guest-wide
profiling is a subset of system-wide profiling.


 For guest-wide profiling, there are two possible places to save and
 restore the related MSRs. One is where the CPU switches between guest
 mode and host mode. We call this *CPU-switch*. Profiling with this
 enabled reflects how the guest behaves on the physical CPU, plus other
 virtualized, not emulated, devices. The other place is where the CPU
 switches between the KVM context and others. Here KVM context means
 the CPU is executing guest code or KVM code, both kernel space and
 user space. We call this *domain-switch*. Profiling with this enabled
 discloses how the guest behaves on both the physical CPU and KVM.
 (Some emulated operations are really expensive in a virtualized
 environment.)


 Which method do you use?  Or do you support both?

I post two patches in my previous email. One is for CPU-switch, and
the other is for domain-switch.


 Note disclosing host pmu data to the guest is sometimes a security issue.


For instance?

 --
 Do not meddle in the internals of kernels, for they are subtle and quick to
 panic.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some Code for Performance Profiling

2010-04-07 Thread Avi Kivity

On 04/07/2010 10:23 PM, Jiaqing Du wrote:



Can your implementation support both simultaneously?
 

What do you mean simultaneously? With my implementation, you either
do guest-wide profiling or system-wide profiling. They are achieved
through different patches. Actually, the result of guest-wide
profiling is a subset of system-wide profiling.

   


A guest admin monitors the performance of their guest via a vpmu.  
Meanwhile the host admin monitors the performance of the host (including 
all guests) using the host pmu.  Given that the host pmu and the vpmu 
may select different counters, it is difficult to support both 
simultaneously.



For guest-wide profiling, there are two possible places to save and
restore the related MSRs. One is where the CPU switches between guest
mode and host mode. We call this *CPU-switch*. Profiling with this
enabled reflects how the guest behaves on the physical CPU, plus other
virtualized, not emulated, devices. The other place is where the CPU
switches between the KVM context and others. Here KVM context means
the CPU is executing guest code or KVM code, both kernel space and
user space. We call this *domain-switch*. Profiling with this enabled
discloses how the guest behaves on both the physical CPU and KVM.
(Some emulated operations are really expensive in a virtualized
environment.)

   

Which method do you use?  Or do you support both?
 

I post two patches in my previous email. One is for CPU-switch, and
the other is for domain-switch.

   


I see.  I'm not sure I know which one is better!


Note disclosing host pmu data to the guest is sometimes a security issue.

 

For instance?
   


The standard example is hyperthreading where the memory bus unit is 
shared among two logical processors.  A guest sampling a vcpu on one 
thread can gain information about what is happening on the other - the 
number of bus transactions the other thread has issued.  This can be 
used to establish a communication channel between two guests that 
shouldn't be communicating, or to eavesdrop on another guest.  A similar 
problem happens with multicores.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


VMX and save/restore guest in virtual-8086 mode

2010-04-07 Thread Marcelo Tosatti

During initialization, WinXP.32 switches to virtual-8086 mode, with
paging enabled, to use VGABIOS functions.

Since enter_pmode unconditionally clears IOPL and VM bits in RFLAGS

flags = vmcs_readl(GUEST_RFLAGS);
flags = ~(X86_EFLAGS_IOPL | X86_EFLAGS_VM);
flags |= (vmx-rmode.save_iopl  IOPL_SHIFT);
vmcs_writel(GUEST_RFLAGS, flags);

And the order of loading state is set_regs (rflags) followed by
set_sregs (cr0), these bits are lost across save/restore:

savevm 1
kvm_arch_save_regs EIP=7a04 cr0=8001003b eflags=33286
system_reset
loadvm 1
kvm_arch_save_regs EIP=7a04 cr0=8001003b eflags=10286
cont
kvm: unhandled exit 8021
kvm_run returned -22

The following patch fixes it, but it has some drawbacks:

- cpu_synchronize_state+writeback is noticeably slow with tpr patching,
  this makes it slower.
- Should be conditional on VMX !unrestricted guest.
- Its a fugly workaround.

Any better ideas?

diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 748ff69..9821653 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -956,6 +956,7 @@ void kvm_arch_load_regs(CPUState *env, int level)
 sregs.efer = env-efer;
 
 kvm_set_sregs(env, sregs);
+kvm_set_regs(env, regs);
 
 /* msrs */
 n = 0;




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Setting nx bit in virtual CPU

2010-04-07 Thread Richard Simpson
On 07/04/10 13:23, Avi Kivity wrote:
 On 04/07/2010 03:10 PM, Richard Simpson wrote:
 On 07/04/10 06:39, Avi Kivity wrote:
   
 On 04/07/2010 01:31 AM, Richard Simpson wrote:
 
   
 2.6.27 should be plenty fine for nx.  Really the important bit is that
 the host kernel has nx enabled.  Can you check if that is so?
 The attached script should verify it.
 IOError: [Errno 2] No such file or directory: '/dev/msr0'
 
 Run as root, please.  And check first that you have a file named
 /dev/cpu/0/msr.

Doh!

gordon Code # ./check-nx
nx: enabled
gordon Code #

OK, seems to be enabled just fine.  Any other ideas?  I am beginning to
get that horrible feeling that there isn't a real problem and it is just
me being dumb!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VMX and save/restore guest in virtual-8086 mode

2010-04-07 Thread Avi Kivity

On 04/07/2010 11:24 PM, Marcelo Tosatti wrote:

During initialization, WinXP.32 switches to virtual-8086 mode, with
paging enabled, to use VGABIOS functions.

Since enter_pmode unconditionally clears IOPL and VM bits in RFLAGS

 flags = vmcs_readl(GUEST_RFLAGS);
 flags= ~(X86_EFLAGS_IOPL | X86_EFLAGS_VM);
 flags |= (vmx-rmode.save_iopl  IOPL_SHIFT);
 vmcs_writel(GUEST_RFLAGS, flags);

   



Looks like KVM_SET_REGS should write rmode.save_iopl (and a new save_vm)?

I think we have a small related bug in realmode emulation - we run the 
guest with iopl=3.  This means the guest can use pushfl and see the host 
iopl instead of the guest iopl.  We should run with iopl=0, which causes 
pushfl/popfl to #GP, where we can emulate the flags correctly (by 
updating rmode.save_iopl and rmode.save_vm).  That has lots of 
implications however...




And the order of loading state is set_regs (rflags) followed by
set_sregs (cr0), these bits are lost across save/restore:

savevm 1
kvm_arch_save_regs EIP=7a04 cr0=8001003b eflags=33286
system_reset
loadvm 1
kvm_arch_save_regs EIP=7a04 cr0=8001003b eflags=10286
cont
kvm: unhandled exit 8021
kvm_run returned -22

The following patch fixes it, but it has some drawbacks:

- cpu_synchronize_state+writeback is noticeably slow with tpr patching,
   this makes it slower.
   


Isn't it a very rare event?


- Should be conditional on VMX !unrestricted guest.
   


Userspace should know nothing of this mess.


- Its a fugly workaround.
   


True.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Setting nx bit in virtual CPU

2010-04-07 Thread Avi Kivity

On 04/07/2010 11:38 PM, Richard Simpson wrote:

On 07/04/10 13:23, Avi Kivity wrote:
   

On 04/07/2010 03:10 PM, Richard Simpson wrote:
 

On 07/04/10 06:39, Avi Kivity wrote:

   

On 04/07/2010 01:31 AM, Richard Simpson wrote:

 


   

2.6.27 should be plenty fine for nx.  Really the important bit is that
the host kernel has nx enabled.  Can you check if that is so?
 

The attached script should verify it.
 

IOError: [Errno 2] No such file or directory: '/dev/msr0'
   

Run as root, please.  And check first that you have a file named
/dev/cpu/0/msr.
 

Doh!

gordon Code # ./check-nx
nx: enabled
gordon Code #

OK, seems to be enabled just fine.  Any other ideas?  I am beginning to
get that horrible feeling that there isn't a real problem and it is just
me being dumb!
   


I really hope so, because I am out of ideas... :)

Can you verify check-nx returns disabled on the guest?
Does /proc/cpuinfo show nx in the guest?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] Add Mergeable receive buffer support to vhost_net

2010-04-07 Thread David Stevens
kvm-ow...@vger.kernel.org wrote on 04/07/2010 11:09:30 AM:

 On Wed, Apr 07, 2010 at 10:37:17AM -0700, David Stevens wrote:
   
   Thanks!
   There's some whitespace damage, are you sending with your new
   sendmail setup? It seems to have worked for qemu patches ...
  
  Yes, I saw some line wraps in what I received, but I checked
  the original draft to be sure and they weren't there. Possibly from
  the relay; Sigh.
  
  
@@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net *
   /* TODO: Check specific error and bomb out unless ENOBUFS? 
*/
   err = sock-ops-sendmsg(NULL, sock, msg, len);
   if (unlikely(err  0)) {
- vhost_discard_vq_desc(vq);
- tx_poll_start(net, sock);
+ if (err == -EAGAIN) {
+vhost_discard_desc(vq, 1);
+tx_poll_start(net, sock);
+ } else {
+vq_err(vq, sendmsg: errno %d\n, -err);
+/* drop packet; do not discard/resend */
+vhost_add_used_and_signal(net-dev, vq, head,
+   0);
   
   vhost does not currently has a consistent error handling strategy:
   if we drop packets, need to think which other errors should cause
   packet drops.  I prefer to just call vq_err for now,
   and have us look at handling segfaults etc in a consistent way
   separately.
  
  I had to add this to avoid an infinite loop when I wrote a bad
  packet on the socket. I agree error handling needs a better look,
  but retrying a bad packet continuously while dumping in the log
  is what it was doing when I hit an error before this code. Isn't
  this better, at least until a second look?
  
 
 Hmm, what do you mean 'continuously'? Don't we only try again
 on next kick?

If the packet is corrupt (in my case, a missing vnet header
during testing), every send will fail and we never make progress.
I had thousands of error messages in the log (for the same packet)
before I added this code. If the problem is with the packet,
retrying the same one as the original code does will never recover.
This isn't required for mergeable rx buffer support, so I
can certainly remove it from this patch, but I think the original
error handling doesn't handle a single corrupted packet very
gracefully.


@@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net *
vq_log = unlikely(vhost_has_feature(net-dev, 
VHOST_F_LOG_ALL)) ?
   vq-log : NULL;

-   for (;;) {
-  head = vhost_get_vq_desc(net-dev, vq, vq-iov,
-ARRAY_SIZE(vq-iov),
-out, in,
-vq_log, log);
+   while ((datalen = vhost_head_len(sock-sk))) {
+  headcount = vhost_get_desc_n(vq, vq-heads, datalen, in,
+vq_log, log);
   
   This looks like a bug, I think we need to pass
   datalen + header size to vhost_get_desc_n.
   Not sure how we know the header size that backend will use though.
   Maybe just look at our features.
  
  Yes; we have hdr_size, so I can add it here. It'll be 0 for
  the cases where the backend and guest both have vnet header (either
  the regular or larger mergeable buffers one), but should be added
  in for future raw socket support.
 
 So hdr_size is the wrong thing to add then.
 We need to add non-zero value for tap now.

datalen includes the vnet_hdr in the tap case, so we don't need
a non-zero hdr_size. The socket data has the entire packet and vnet_hdr
and that length is what we're getting from vhost_head_len().

 
   
   /* OK, now we need to know about added descriptors. */
-  if (head == vq-num) {
- if (unlikely(vhost_enable_notify(vq))) {
+  if (!headcount) {
+ if (retries == 0  unlikely(vhost_enable_notify(vq))) {
 /* They have slipped one in as we were
  * doing that: check again. */
 vhost_disable_notify(vq);
+retries++;
 continue;
  }
   
   Hmm. The reason we have the code at all, as the comment says, is 
because
   guest could have added more buffers between the time we read last 
index
   and the time we enabled notification. So if we just break like this
   the race still exists. We could remember the
   last head value we observed, and have vhost_enable_notify check
   against this value?
  
  This is to prevent a spin loop in the case where we have some
  buffers available, but not enough for the current packet (ie, this
  is the replacement code for the rxmaxheadcount business). If they
  actually added something new, retrying once should see it, but what
  vhost_enable_notify() returns non-zero on is not new buffers but
  rather not empty. In the case mentioned, we aren't empty, so
  vhost_enable_notify() returns nonzero every time, but the guest hasn't
  given us enough buffers to proceed, so we continuously retry; this
  code breaks the 

Re: [PATCH] KVM: VMX: Disable unrestricted guest when EPT disabled

2010-04-07 Thread Greg KH
On Thu, Mar 18, 2010 at 02:11:19PM +0800, Sheng Yang wrote:
 Otherwise would cause VMEntry failure when using ept=0 on unrestricted guest
 supported processors.
 
 Signed-off-by: Sheng Yang sh...@linux.intel.com

Now included through a different submission.

thanks,

greg k-h
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Using serial with Windows XP - need help

2010-04-07 Thread Iztok Kobal

Using:
kvm-0.12.3
libvirt-0.7.7
kernel-2.6.31

I am trying to persuade my Win XP guest to use the serial which I have 
configured using the virt-manager. But whatever I try no COMs appear in 
guest. I have tried with socket, tcp, pipe, etc. with just the same 
result - no COM within the guest. I have obviously missed something. I 
need help to set-up that COM appear and work within guest machine.


guest's XML part for serial looks like:

serial type='tcp'
source mode='connect' host='127.0.0.1' service='4555'/
protocol type='raw'/
target port='0'/
/serial
console type='tcp'
source mode='connect' host='127.0.0.1' service='4555'/
protocol type='raw'/
target port='0'/
/console


Here is the latest command line to run my guest:

/usr/bin/qemu-kvm -S -M pc-0.11 -enable-kvm -m 512 -smp 
1,sockets=1,cores=1,threads=1 -name windowsxp -uuid 
1f012431-f345-48f4-9ae5-b10787e24de7 -nodefaults -chardev 
socket,id=monitor,path=/var/lib/libvirt/qemu/windowsxp.monitor,server,nowait 
-mon chardev=monitor,mode=readline -rtc base=localtime -boot c -drive 
file=/var/lib/kvm/images/windowsxp/disk0,if=none,id=drive-ide0-0-0,boot=on 
-device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 
-drive if=none,media=cdrom,id=drive-ide0-1-0 -device 
ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive 
file=/var/lib/kvm/images/windowsxp/disk1,if=none,id=drive-ide0-0-1 
-device ide-drive,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 
-device rtl8139,vlan=0,id=net0,mac=52:54:00:6b:9d:6e,bus=pci.0,addr=0x4 
-net tap,fd=20,vlan=0,name=hostnet0 -device 
rtl8139,vlan=1,id=net1,mac=52:54:00:3e:2a:f8,bus=pci.0,addr=0x5 -net 
tap,fd=21,vlan=1,name=hostnet1 -device 
rtl8139,vlan=2,id=net2,mac=52:54:00:ce:35:fe,bus=pci.0,addr=0x6 -net 
tap,fd=22,vlan=2,name=hostnet2 -chardev 
socket,id=serial0,host=127.0.0.1,port=4555,server,nowait -device 
isa-serial,chardev=serial0 -usb -vnc 127.0.0.1:0 -vga cirrus -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3


Iztok
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 0/3] PCI Shared memory device

2010-04-07 Thread Cam Macdonell
Latest patch for PCI shared memory device that maps a host shared memory object
to be shared between guests

new in this series
- moved to single Doorbell register and use datamatch to trigger different
  VMs rather than one register per eventfd
- remove writing arbitrary values to eventfds.  Only values of 1 are now
  written to ensure correct usage

Cam Macdonell (3):
  Device specification for shared memory PCI device
  Support adding a file to qemu's ram allocation
  Inter-VM shared memory PCI device

 Makefile.target|3 +
 cpu-common.h   |1 +
 docs/specs/ivshmem_device_spec.txt |   85 +
 exec.c |   33 ++
 hw/ivshmem.c   |  700 
 qemu-char.c|6 +
 qemu-char.h|3 +
 7 files changed, 831 insertions(+), 0 deletions(-)
 create mode 100644 docs/specs/ivshmem_device_spec.txt
 create mode 100644 hw/ivshmem.c

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/3] Device specification for shared memory PCI device

2010-04-07 Thread Cam Macdonell
---
 docs/specs/ivshmem_device_spec.txt |   85 
 1 files changed, 85 insertions(+), 0 deletions(-)
 create mode 100644 docs/specs/ivshmem_device_spec.txt

diff --git a/docs/specs/ivshmem_device_spec.txt 
b/docs/specs/ivshmem_device_spec.txt
new file mode 100644
index 000..9895782
--- /dev/null
+++ b/docs/specs/ivshmem_device_spec.txt
@@ -0,0 +1,85 @@
+
+Device Specification for Inter-VM shared memory device
+--
+
+The Inter-VM shared memory device is designed to share a region of memory to
+userspace in multiple virtual guests.  The memory region does not belong to any
+guest, but is a POSIX memory object on the host.  Optionally, the device may
+support sending interrupts to other guests sharing the same memory region.
+
+The Inter-VM PCI device
+---
+
+BARs
+
+The device supports three BARs.  BAR0 is a 1 Kbyte MMIO region to support
+registers.  BAR1 is used for MSI-X when it is enabled in the device.  BAR2 is
+used to map the shared memory object from the host.  The size of BAR2 is
+specified when the guest is started and must be a power of 2 in size.
+
+Registers
+
+The device currently supports 4 registers of 32-bits each.  Registers
+are used for synchronization between guests sharing the same memory object when
+interrupts are supported (this requires using the shared memory server).
+
+The server assigns each VM an ID number and sends this ID number to the Qemu
+process when the guest starts.
+
+enum ivshmem_registers {
+IntrMask = 0,
+IntrStatus = 4,
+IVPosition = 8,
+Doorbell = 12
+};
+
+The first two registers are the interrupt mask and status registers.  Mask and
+status are only used with pin-based interrupts.  They are unused with MSI
+interrupts.  The IVPosition register is read-only and reports the guest's ID
+number.  To interrupt another guest, a guest must write to the Doorbell
+register.  The doorbell register is 32-bits, logically divided into two 16-bit
+fields.  The high 16-bits are the guest ID to interrupt and the low 16-bits are
+the interrupt vector to trigger.
+
+The semantics of the value written to the doorbell depends on whether the
+device is using MSI or a regular pin-based interrupt.  In short, MSI uses
+vectors and regular interrupts set the status register.
+
+Regular Interrupts
+--
+
+If regular interrupts are used (due to either a guest not supporting MSI or the
+user specifying not to use them on startup) then the value written to the lower
+16-bits of the Doorbell register results is arbitrary and will trigger an
+interrupt in the destination guest.
+
+An interrupt is also generated when a new guest accesses the shared memory
+region.  A status of (2^32 - 1) indicates that a new guest has joined.
+
+Message Signalled Interrupts
+
+
+A ivshmem device may support multiple MSI vectors.  If so, the lower 16-bits
+written to the Doorbell register must be between 1 and the maximum number of
+vectors the guest supports.  The lower 16 bits written to the doorbell is the
+MSI vector that will be raised in the destination guest.  The number of MSI
+vectors can vary but it is set when the VM is started, however vector 0 is
+used to notify that a new guest has joined.  Guests should not use vector 0 for
+any other purpose.
+
+The important thing to remember with MSI is that it is only a signal, no status
+is set (since MSI interrupts are not shared).  All information other than the
+interrupt itself should be communicated via the shared memory region.  Devices
+supporting multiple MSI vectors can use different vectors to indicate different
+events have occurred.  The semantics of interrupt vectors are left to the
+user's discretion.
+
+Usage in the Guest
+--
+
+The shared memory device is intended to be used with the provided UIO driver.
+Very little configuration is needed.  The guest should map BAR0 to access the
+registers (an array of 32-bit ints allows simple writing) and map BAR2 to
+access the shared memory region itself.  The size of the shared memory region
+is specified when the guest (or shared memory server) is started.  A guest may
+map the whole shared memory region or only part of it.
-- 
1.6.0.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 2/3] Support adding a file to qemu's ram allocation

2010-04-07 Thread Cam Macdonell
This avoids the need of using qemu_ram_alloc and mmap with MAP_FIXED to map a
host file into guest RAM.  This function mmaps the opened file anywhere and adds
the memory to the ram blocks.

Usage is

qemu_ram_mmap(fd, size, MAP_SHARED, offset);
---
 cpu-common.h |1 +
 exec.c   |   33 +
 2 files changed, 34 insertions(+), 0 deletions(-)

diff --git a/cpu-common.h b/cpu-common.h
index 49c7fb3..87c82fc 100644
--- a/cpu-common.h
+++ b/cpu-common.h
@@ -32,6 +32,7 @@ static inline void 
cpu_register_physical_memory(target_phys_addr_t start_addr,
 }
 
 ram_addr_t cpu_get_physical_page_desc(target_phys_addr_t addr);
+ram_addr_t qemu_ram_mmap(int, ram_addr_t, int, int);
 ram_addr_t qemu_ram_alloc(ram_addr_t);
 void qemu_ram_free(ram_addr_t addr);
 /* This should only be used for ram local to a device.  */
diff --git a/exec.c b/exec.c
index 467a0e7..2303be7 100644
--- a/exec.c
+++ b/exec.c
@@ -2811,6 +2811,39 @@ static void *file_ram_alloc(ram_addr_t memory, const 
char *path)
 }
 #endif
 
+ram_addr_t qemu_ram_mmap(int fd, ram_addr_t size, int flags, int offset)
+{
+RAMBlock *new_block;
+
+size = TARGET_PAGE_ALIGN(size);
+new_block = qemu_malloc(sizeof(*new_block));
+
+// map the file passed as a parameter to be this part of memory
+new_block-host = mmap(0, size, PROT_READ|PROT_WRITE, flags, fd, offset);
+
+#ifdef MADV_MERGEABLE
+madvise(new_block-host, size, MADV_MERGEABLE);
+#endif
+
+new_block-offset = last_ram_offset;
+new_block-length = size;
+
+new_block-next = ram_blocks;
+ram_blocks = new_block;
+
+phys_ram_dirty = qemu_realloc(phys_ram_dirty,
+(last_ram_offset + size)  TARGET_PAGE_BITS);
+memset(phys_ram_dirty + (last_ram_offset  TARGET_PAGE_BITS),
+   0xff, size  TARGET_PAGE_BITS);
+
+last_ram_offset += size;
+
+if (kvm_enabled())
+kvm_setup_guest_memory(new_block-host, size);
+
+return new_block-offset;
+}
+
 ram_addr_t qemu_ram_alloc(ram_addr_t size)
 {
 RAMBlock *new_block;
-- 
1.6.0.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 3/3] Inter-VM shared memory PCI device

2010-04-07 Thread Cam Macdonell
Support an inter-vm shared memory device that maps a shared-memory object as a
PCI device in the guest.  This patch also supports interrupts between guest by
communicating over a unix domain socket.  This patch applies to the qemu-kvm
repository.

-device ivshmem,size=size in MB[,shm=shm name]

Interrupts are supported between multiple VMs by using a shared memory server
by using a chardev socket.

-device ivshmem,size=size in MB[,shm=shm name][,chardev=id][,msi=on]
[,irqfd=on][,vectors=n]
-chardev socket,path=path,id=id

Sample programs, init scripts and the shared memory server are available in a
git repo here:

www.gitorious.org/nahanni
---
 Makefile.target |3 +
 hw/ivshmem.c|  700 +++
 qemu-char.c |6 +
 qemu-char.h |3 +
 4 files changed, 712 insertions(+), 0 deletions(-)
 create mode 100644 hw/ivshmem.c

diff --git a/Makefile.target b/Makefile.target
index 1ffd802..bc9a681 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -199,6 +199,9 @@ obj-$(CONFIG_USB_OHCI) += usb-ohci.o
 obj-y += rtl8139.o
 obj-y += e1000.o
 
+# Inter-VM PCI shared memory
+obj-y += ivshmem.o
+
 # Hardware support
 obj-i386-y = pckbd.o dma.o
 obj-i386-y += vga.o
diff --git a/hw/ivshmem.c b/hw/ivshmem.c
new file mode 100644
index 000..2ec6c2c
--- /dev/null
+++ b/hw/ivshmem.c
@@ -0,0 +1,700 @@
+/*
+ * Inter-VM Shared Memory PCI device.
+ *
+ * Author:
+ *  Cam Macdonell c...@cs.ualberta.ca
+ *
+ * Based On: cirrus_vga.c and rtl8139.c
+ *
+ * This code is licensed under the GNU GPL v2.
+ */
+#include sys/mman.h
+#include sys/types.h
+#include sys/socket.h
+#include sys/io.h
+#include sys/ioctl.h
+#include sys/eventfd.h
+#include hw.h
+#include console.h
+#include pc.h
+#include pci.h
+#include sysemu.h
+
+#include msix.h
+#include qemu-kvm.h
+#include libkvm.h
+
+#include sys/eventfd.h
+#include sys/mman.h
+#include sys/socket.h
+#include sys/ioctl.h
+
+#define PCI_COMMAND_IOACCESS0x0001
+#define PCI_COMMAND_MEMACCESS   0x0002
+
+#define DEBUG_IVSHMEM
+
+#define IVSHMEM_IRQFD   0
+#define IVSHMEM_MSI 1
+#define IVSHMEM_MAX_EVENTFDS  16
+
+#ifdef DEBUG_IVSHMEM
+#define IVSHMEM_DPRINTF(fmt, args...)\
+do {printf(IVSHMEM:  fmt, ##args); } while (0)
+#else
+#define IVSHMEM_DPRINTF(fmt, args...)
+#endif
+
+#define NEW_GUEST_VAL UINT_MAX
+
+struct eventfd_entry {
+PCIDevice *pdev;
+int vector;
+};
+
+typedef struct IVShmemState {
+PCIDevice dev;
+uint32_t intrmask;
+uint32_t intrstatus;
+uint32_t doorbell;
+
+CharDriverState * chr;
+CharDriverState ** eventfd_chr;
+int ivshmem_mmio_io_addr;
+
+pcibus_t mmio_addr;
+uint8_t *ivshmem_ptr;
+unsigned long ivshmem_offset;
+unsigned int ivshmem_size;
+int shm_fd; /* shared memory file descriptor */
+
+/* array of eventfds for each guest */
+int * eventfds[IVSHMEM_MAX_EVENTFDS];
+/* keep track of # of eventfds for each guest*/
+int * eventfds_posn_count;
+
+int vm_id;
+int num_eventfds;
+uint32_t vectors;
+uint32_t features;
+struct eventfd_entry eventfd_table[IVSHMEM_MAX_EVENTFDS];
+
+char * shmobj;
+uint32_t size; /*size of shared memory in MB*/
+} IVShmemState;
+
+/* registers for the Inter-VM shared memory device */
+enum ivshmem_registers {
+IntrMask = 0,
+IntrStatus = 4,
+IVPosition = 8,
+Doorbell = 12,
+};
+
+static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int feature) {
+return (ivs-features  (1  feature));
+}
+
+static inline int is_power_of_two(int x) {
+return (x  (x-1)) == 0;
+}
+
+static void ivshmem_map(PCIDevice *pci_dev, int region_num,
+pcibus_t addr, pcibus_t size, int type)
+{
+IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
+
+IVSHMEM_DPRINTF(addr = %u size = %u\n, (uint32_t)addr, (uint32_t)size);
+cpu_register_physical_memory(addr, s-ivshmem_size, s-ivshmem_offset);
+
+}
+
+/* accessing registers - based on rtl8139 */
+static void ivshmem_update_irq(IVShmemState *s, int val)
+{
+int isr;
+isr = (s-intrstatus  s-intrmask)  0x;
+
+/* don't print ISR resets */
+if (isr) {
+IVSHMEM_DPRINTF(Set IRQ to %d (%04x %04x)\n,
+   isr ? 1 : 0, s-intrstatus, s-intrmask);
+}
+
+qemu_set_irq(s-dev.irq[0], (isr != 0));
+}
+
+static void ivshmem_IntrMask_write(IVShmemState *s, uint32_t val)
+{
+IVSHMEM_DPRINTF(IntrMask write(w) val = 0x%04x\n, val);
+
+s-intrmask = val;
+
+ivshmem_update_irq(s, val);
+}
+
+static uint32_t ivshmem_IntrMask_read(IVShmemState *s)
+{
+uint32_t ret = s-intrmask;
+
+IVSHMEM_DPRINTF(intrmask read(w) val = 0x%04x\n, ret);
+
+return ret;
+}
+
+static void ivshmem_IntrStatus_write(IVShmemState *s, uint32_t val)
+{
+IVSHMEM_DPRINTF(IntrStatus write(w) val = 0x%04x\n, val);
+
+s-intrstatus = val;
+
+ivshmem_update_irq(s, val);
+return;
+}
+
+static 

[PATCH v4] Shared memory uio_pci driver

2010-04-07 Thread Cam Macdonell
This patch adds a driver for my shared memory PCI device using the uio_pci
interface.  The driver has three memory regions.  The first memory region is for
device registers for sending interrupts. The second BAR is for receiving MSI-X
interrupts and the third memory region maps the shared memory.  The device only
exports the first and third memory regions to userspace.

This driver supports MSI-X and regular pin interrupts.  Currently, the number
of MSI vectors is set to 2 (one for new connections and the other for
interrupts) but it could easily be increased.  If MSI is not available, then
regular interrupts will be used.

This version added formatting and style corrections as well as better
error-checking and cleanup when errors occur.

---
 drivers/uio/Kconfig   |8 ++
 drivers/uio/Makefile  |1 +
 drivers/uio/uio_ivshmem.c |  252 +
 3 files changed, 261 insertions(+), 0 deletions(-)
 create mode 100644 drivers/uio/uio_ivshmem.c

diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
index 1da73ec..b92cded 100644
--- a/drivers/uio/Kconfig
+++ b/drivers/uio/Kconfig
@@ -74,6 +74,14 @@ config UIO_SERCOS3
 
  If you compile this as a module, it will be called uio_sercos3.
 
+config UIO_IVSHMEM
+   tristate KVM shared memory PCI driver
+   default n
+   help
+ Userspace I/O interface for the KVM shared memory device.  This
+ driver will make available two memory regions, the first is
+ registers and the second is a region for sharing between VMs.
+
 config UIO_PCI_GENERIC
tristate Generic driver for PCI 2.3 and PCI Express cards
depends on PCI
diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile
index 18fd818..25c1ca5 100644
--- a/drivers/uio/Makefile
+++ b/drivers/uio/Makefile
@@ -6,3 +6,4 @@ obj-$(CONFIG_UIO_AEC)   += uio_aec.o
 obj-$(CONFIG_UIO_SERCOS3)  += uio_sercos3.o
 obj-$(CONFIG_UIO_PCI_GENERIC)  += uio_pci_generic.o
 obj-$(CONFIG_UIO_NETX) += uio_netx.o
+obj-$(CONFIG_UIO_IVSHMEM) += uio_ivshmem.o
diff --git a/drivers/uio/uio_ivshmem.c b/drivers/uio/uio_ivshmem.c
new file mode 100644
index 000..42ac9a7
--- /dev/null
+++ b/drivers/uio/uio_ivshmem.c
@@ -0,0 +1,252 @@
+/*
+ * UIO IVShmem Driver
+ *
+ * (C) 2009 Cam Macdonell
+ * based on Hilscher CIF card driver (C) 2007 Hans J. Koch h...@linutronix.de
+ *
+ * Licensed under GPL version 2 only.
+ *
+ */
+
+#include linux/device.h
+#include linux/module.h
+#include linux/pci.h
+#include linux/uio_driver.h
+
+#include asm/io.h
+
+#define IntrStatus 0x04
+#define IntrMask 0x00
+
+struct ivshmem_info {
+   struct uio_info *uio;
+   struct pci_dev *dev;
+   char (*msix_names)[256];
+   struct msix_entry *msix_entries;
+   int nvectors;
+};
+
+static irqreturn_t ivshmem_handler(int irq, struct uio_info *dev_info)
+{
+
+   void __iomem *plx_intscr = dev_info-mem[0].internal_addr
+   + IntrStatus;
+   u32 val;
+
+   val = readl(plx_intscr);
+   if (val == 0)
+   return IRQ_NONE;
+
+   return IRQ_HANDLED;
+}
+
+static irqreturn_t ivshmem_msix_handler(int irq, void *opaque)
+{
+
+   struct uio_info * dev_info = (struct uio_info *) opaque;
+
+   /* we have to do this explicitly when using MSI-X */
+   uio_event_notify(dev_info);
+   return IRQ_HANDLED;
+}
+
+static void free_msix_vectors(struct ivshmem_info *ivs_info,
+   const int max_vector)
+{
+   int i;
+
+   for (i = 0; i  max_vector; i++)
+   free_irq(ivs_info-msix_entries[i].vector, ivs_info-uio);
+}
+
+static int request_msix_vectors(struct ivshmem_info *ivs_info, int nvectors)
+{
+   int i, err;
+   const char *name = ivshmem;
+
+   ivs_info-nvectors = nvectors;
+
+   ivs_info-msix_entries = kmalloc(nvectors * sizeof *
+   ivs_info-msix_entries,
+   GFP_KERNEL);
+   if (ivs_info-msix_entries == NULL)
+   return -ENOSPC;
+
+   ivs_info-msix_names = kmalloc(nvectors * sizeof *ivs_info-msix_names,
+   GFP_KERNEL);
+   if (ivs_info-msix_names == NULL) {
+   kfree(ivs_info-msix_entries);
+   return -ENOSPC;
+   }
+
+   for (i = 0; i  nvectors; ++i)
+   ivs_info-msix_entries[i].entry = i;
+
+   err = pci_enable_msix(ivs_info-dev, ivs_info-msix_entries,
+   ivs_info-nvectors);
+   if (err  0) {
+   ivs_info-nvectors = err; /* msi-x positive error code
+returns the number available*/
+   err = pci_enable_msix(ivs_info-dev, ivs_info-msix_entries,
+   ivs_info-nvectors);
+   if (err) {
+   printk(KERN_INFO no MSI (%d). Back to INTx.\n, err);
+  

Re: Setting nx bit in virtual CPU

2010-04-07 Thread Richard Simpson

 gordon Code # ./check-nx
 nx: enabled
 gordon Code #

 OK, seems to be enabled just fine.  Any other ideas?  I am beginning to
 get that horrible feeling that there isn't a real problem and it is just
 me being dumb!

 I really hope so, because I am out of ideas... :)
 
 Can you verify check-nx returns disabled on the guest?
 Does /proc/cpuinfo show nx in the guest?
 

OK, time for a summary:

Host:  /proc/cpuinfo shows 'nx' and check-nx shows 'enabled'

Guest: /proc/cpuinfo doesn't show nx and check-nx shows 'disabled'

Guest (with -no-kvm option): /proc/cpuinfo shows 'nx', but check-nx
still shows 'disabled'

Below I have included all the listings which I think might be useful,
but if you would like to see anything else then please ask.

HOST:

/proc/cpuinfo

processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 15
model   : 79
model name  : AMD Athlon(tm) 64 Processor 3200+
stepping: 2
cpu MHz : 1000.000
cache size  : 512 KB
fpu : yes
fpu_exception   : yes
cpuid level : 1
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt
rdtscp lm 3dnowext 3dnow rep_good nopl pni cx16 lahf_lm svm extapic
cr8_legacy
bogomips: 2000.06
TLB size: 1024 4K pages
clflush size: 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

GUEST with command line - kvm -hda /dev/mapper/vols-andrew -kernel
./bzImage -append root=/dev/hda2 -cpu host -runas xx -net nic -net user
-m 256 -k en-gb -vnc :1 -monitor stdio

/proc/cpuinfo

processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 15
model   : 79
model name  : AMD Athlon(tm) 64 Processor 3200+
stepping: 2
cpu MHz : 1.330
cache size  : 512 KB
fpu : yes
fpu_exception   : yes
cpuid level : 1
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall mmxext fxsr_opt lm
rep_good pni cx16 lahf_lm
bogomips: 2000.06
TLB size: 1024 4K pages
clflush size: 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

Results of paxtest

PaXtest - Copyright(c) 2003,2004 by Peter Busser pe...@adamantix.org
Released under the GNU Public Licence version 2 or later

Mode: kiddie
Linux andrew 2.6.28-hardened-r9 #4 Mon Jan 18 22:39:31 GMT 2010 x86_64
AMD Athlon(tm) 64 Processor 3200+ AuthenticAMD GNU/Linux

Executable anonymous mapping : Vulnerable
Executable bss   : Vulnerable
Executable data  : Vulnerable
Executable heap  : Vulnerable
Executable stack : Vulnerable
Executable anonymous mapping (mprotect)  : Vulnerable
Executable bss (mprotect): Vulnerable
Executable data (mprotect)   : Vulnerable
Executable heap (mprotect)   : Vulnerable
Executable stack (mprotect)  : Vulnerable
Executable shared library bss (mprotect) : Vulnerable
Executable shared library data (mprotect): Vulnerable
Writable text segments   : Killed
Anonymous mapping randomisation test : 33 bits (guessed)
Heap randomisation test (ET_EXEC): 13 bits (guessed)
Heap randomisation test (ET_DYN) : 40 bits (guessed)
Main executable randomisation (ET_EXEC)  : No randomisation
Main executable randomisation (ET_DYN)   : 12 bits (guessed)
Shared library randomisation test: 33 bits (guessed)
Stack randomisation test (SEGMEXEC)  : 40 bits (guessed)
Stack randomisation test (PAGEEXEC)  : 40 bits (guessed)
Return to function (strcpy)  : paxtest: bad luck, try
different compiler options.
Return to function (memcpy)  : *** buffer overflow detected
***: rettofunc2 - terminated
rettofunc2: buffer overflow attack in function unknown - terminated
Report to http://bugs.gentoo.org/
Killed
Return to function (strcpy, RANDEXEC): paxtest: bad luck, try
different compiler options.
Return to function (memcpy, RANDEXEC): *** buffer overflow detected
***: rettofunc2x - terminated
rettofunc2x: buffer overflow attack in function unknown - terminated
Report to http://bugs.gentoo.org/
Killed
Executable shared library bss: Killed
Executable shared library data   : Killed

GUEST with command line - kvm -hda /dev/mapper/vols-andrew -kernel
./bzImage -append root=/dev/hda2 -no-kvm -runas xx -net nic -net user -m
256 -k en-gb -vnc :1 -monitor stdio

/proc/cpuinfo

processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 6
model   : 2
model name  : QEMU Virtual CPU version 0.12.3
stepping: 3
cpu MHz : 1998.067
cache size  : 512 KB
fpu : yes
fpu_exception   : yes
cpuid level : 4
wp 

Re: [GIT PULL] vhost-net fix for 2.6.34-rc3

2010-04-07 Thread David Miller
From: Michael S. Tsirkin m...@redhat.com
Date: Wed, 7 Apr 2010 20:35:02 +0300

 David,
 The following tree includes a patch fixing an issue with vhost-net in
 2.6.34-rc3.  Please pull for 2.6.34.

Pulled, thanks Michael.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


buildbot failure in qemu-kvm on default_x86_64_debian_5_0

2010-04-07 Thread qemu-kvm
The Buildbot has detected a new failure of default_x86_64_debian_5_0 on 
qemu-kvm.
Full details are available at:
 
http://buildbot.b1-systems.de/qemu-kvm/builders/default_x86_64_debian_5_0/builds/347

Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/

Buildslave for this Build: b1_qemu_kvm_1

Build Reason: The Nightly scheduler named 'nightly_default' triggered this build
Build Source Stamp: [branch master] HEAD
Blamelist: 

BUILD FAILED: failed compile

sincerely,
 -The Buildbot

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


buildbot failure in qemu-kvm on default_x86_64_out_of_tree

2010-04-07 Thread qemu-kvm
The Buildbot has detected a new failure of default_x86_64_out_of_tree on 
qemu-kvm.
Full details are available at:
 
http://buildbot.b1-systems.de/qemu-kvm/builders/default_x86_64_out_of_tree/builds/288

Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/

Buildslave for this Build: b1_qemu_kvm_1

Build Reason: The Nightly scheduler named 'nightly_default' triggered this build
Build Source Stamp: [branch master] HEAD
Blamelist: 

BUILD FAILED: failed compile

sincerely,
 -The Buildbot

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM autotest patch queue report 04-07-2010

2010-04-07 Thread Lucas Meneghel Rodrigues
Summary:

Total patches: 13
Reviewed patches: 8
Reviews unfinished: 1
Unreviewed patches: 4 

Autotest patchwork (patches under review)
http://patchwork.test.kernel.org/project/autotest/list/

Autotest timeline (patches already applied)
http://autotest.kernel.org/timeline


KVM test: Add a subtest iofuzz   2010-04-07   Jason Wang   lmr   Under Review

Unreviewed patch.


[3/3] KVM Test: Add ioquit test case  2010-04-07  Feng Yang  lmr  Under Review
[2/3] KVM Test: Add function run_autotest_background and 
wait_autotest_background.  2010-04-07  Feng Yang  lmr  Under Review
[1/3] KVM Test: Add control file dbench.control.200 for dbench  2010-04-07  
Feng Yang  lmr  Under Review

Unreviewed patch series.


[RFC] KVM test: Introduce sample performance test set  2010-04-07  Lucas 
Meneghel Rodrigues  lmr  Under Review

First pass at implementing automated performance testing. Still has a lot of 
work to make it in shape for upstream inclusion.


[v2] KVM-test: Add a subtest 'qemu_img'  2010-03-31  Yolkfull Chow  lmr  Under 
Review

Improved version of qemu_img test, currently in testing stage.


 
[KVM-AUTOTEST] Opensuse unattended install 2010-03-23 yogi lmr Under Review

I am concerned about a timeout added at the end of unattended install,
which could have bad side effects on other guests. Didn't manage to finish
testing.


KVM-Test: Add kvm userspace unit test 2010-03-05 sshang lmr Under review

Naphtali Sprei is working on support running the unittests with a new
infrastructure in-qemu, made by Avi, so this test will wait a little bit
so we can merge both approaches in one. Updates: Naphtali already sent his RFC
patches, currently under review.


[2/2] KVM test: Add cpu_set subtest 2010-02-25 Lucas Meneghel Rodrigues lmr 
Under Review

This patch will stay on the queue until the feature tested gets in a
better shape on KVM upstream


KVM test: Add support for ipv6 addresses 2010-02-24 Lucas Meneghel Rodrigues 
lmr Under Review

This test was reviewed and the decision is that it will stay on the
queue until we have more extensive guest network testing.


KVM test: Memory ballooning test for KVM guest  2010-02-11  pradeep  lmr  Under 
Review

Same status from previous week. Waiting on revised patch from originator.


[2/2] KVM test: subtest migration: Add rem_host and rem_port for migrate()  
2009-12-08  Yolkfull Chow  lmr  Under Review
[1/2,-,V3] Add a server-side test - kvm_migration  2009-12-08  Yolkfull Chow  
lmr  Under Review

Same status from previous week. Needs full review

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shouldn't cache=none be the default for drives?

2010-04-07 Thread Thomas Mueller
Am Wed, 07 Apr 2010 16:39:41 +0200 schrieb Troels Arvin:

 Hello,
 
 I'm conducting some performancetests with KVM-virtualized CentOSes. One
 thing I noticed is that guest I/O performance seems to be significantly
 better for virtio-based block devices (drives) if the cache=none
 argument is used. (This was with a rather powerful storage system
 backend which is hard to saturate.)
 
 So: Why isn't cache=none be the default for drives?

while ago i suffered poor performance of virtio and win2008. 

This helped alot:

I enabled deadline block scheduler instead of the default cfq on the 
host system. tested with: Host Debian with scheduler deadline, Guest 
Win2008 with Virtio and cache=none. (26MB/s to 50MB/s boost measured) 
Maybe this is also true for Linux/Linux.

I expect that scheduler noop for linux guests would be good.

- Thomas


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question on skip_emulated_instructions()

2010-04-07 Thread Yoshiaki Tamura

Avi Kivity wrote:

On 04/07/2010 08:21 PM, Yoshiaki Tamura wrote:


The problem here is that, I needed to transfer the VM state which is
just *before* the output to the devices. Otherwise, the VM state has
already been proceeded, and after failover, some I/O didn't work as I
expected.
I tracked down this issue, and figured out rip was already proceeded
in KVM,
and transferring this VCPU state was meaningless.

I'm planning to post the patch set of Kemari soon, but I would like to
solve
this rip issue before that. If there is no drawback, I'm happy to work
and post a patch.


vcpu state is undefined when an mmio operation is pending,
Documentation/kvm/api.txt says the following:


NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding
operations are complete (and guest state is consistent) only after
userspace
has re-entered the kernel with KVM_RUN. The kernel side will first finish
incomplete operations and then check for pending signals. Userspace
can re-enter the guest with an unmasked signal pending to complete
pending operations.


Thanks for the information.

So the point is the vcpu state that can been observed from qemu upon 
KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI should not be used because it's not 
complete/consistent?



Currently we complete instructions for output operations and leave them
incomplete for input operations. Deferring completion for output
operations should work, except it may break the vmware backdoor port
(see hw/vmport.c), which changes register state following an output
instruction, and KVM_EXIT_TPR_ACCESS, where userspace reads the state
following a write instruction.

Do you really need to transfer the vcpu state before the instruction, or
do you just need a consistent state? If the latter, then you can get
away by posting a signal and re-entering the guest. kvm will complete
the instruction and exit immediately, and you will have fully consistent
state.


The requirement is that the guest must always be able to replay at least the 
instruction which triggered the synchronization on the primary.  From that point 
of view, I think I need to transfer the vcpu state before the instruction.  If I 
post a signal and let the guest or emulator proceed, I'm not sure whether the 
guest on the secondary can be replay as expected.  Please point out if I were 
misunderstanding.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question on skip_emulated_instructions()

2010-04-07 Thread Gleb Natapov
On Thu, Apr 08, 2010 at 02:27:53PM +0900, Yoshiaki Tamura wrote:
 Avi Kivity wrote:
 On 04/07/2010 08:21 PM, Yoshiaki Tamura wrote:
 
 The problem here is that, I needed to transfer the VM state which is
 just *before* the output to the devices. Otherwise, the VM state has
 already been proceeded, and after failover, some I/O didn't work as I
 expected.
 I tracked down this issue, and figured out rip was already proceeded
 in KVM,
 and transferring this VCPU state was meaningless.
 
 I'm planning to post the patch set of Kemari soon, but I would like to
 solve
 this rip issue before that. If there is no drawback, I'm happy to work
 and post a patch.
 
 vcpu state is undefined when an mmio operation is pending,
 Documentation/kvm/api.txt says the following:
 
 NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding
 operations are complete (and guest state is consistent) only after
 userspace
 has re-entered the kernel with KVM_RUN. The kernel side will first finish
 incomplete operations and then check for pending signals. Userspace
 can re-enter the guest with an unmasked signal pending to complete
 pending operations.
 
 Thanks for the information.
 
 So the point is the vcpu state that can been observed from qemu upon
 KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI should not be used
 because it's not complete/consistent?
 
Definitely. VCPU is in the middle of an instruction execution, so the
state is undefined. One instruction may generate more then one IO exit
during its execution BTW.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html