Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH

2010-05-05 Thread Neil Brown
On Wed, 5 May 2010 14:28:41 +0930
Rusty Russell ru...@rustcorp.com.au wrote:

 On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote:
  Jens Axboe wrote:
   On Tue, May 04 2010, Rusty Russell wrote:
ISTR someone mentioning a desire for such an API years ago, so CC'ing 
the
usual I/O suspects...
   
   It would be nice to have a more fuller API for this, but the reality is
   that only the flush approach is really workable. Even just strict
   ordering of requests could only be supported on SCSI, and even there the
   kernel still lacks proper guarantees on error handling to prevent
   reordering there.
  
  There's a few I/O scheduling differences that might be useful:
  
  1. The I/O scheduler could freely move WRITEs before a FLUSH but not
 before a BARRIER.  That might be useful for time-critical WRITEs,
 and those issued by high I/O priority.
 
 This is only because noone actually wants flushes or barriers, though
 I/O people seem to only offer that.  We really want these writes must
 occur before this write.  That offers maximum choice to the I/O subsystem
 and potentially to smart (virtual?) disks.
 
  2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is
 only for data belonging to a particular file (e.g. fdatasync with
 no file size change, even on btrfs if O_DIRECT was used for the
 writes being committed).  That would entail tagging FLUSHes and
 WRITEs with a fs-specific identifier (such as inode number), opaque
 to the scheduler which only checks equality.
 
 This is closer.  In userspace I'd be happy with a all prior writes to this
 struct file before all future writes.  Even if the original guarantees were
 stronger (ie. inode basis).  We currently implement transactions using 4 fsync
 /msync pairs.
 
   write_recovery_data(fd);
   fsync(fd);
   msync(mmap);
   write_recovery_header(fd);
   fsync(fd);
   msync(mmap);
   overwrite_with_new_data(fd);
   fsync(fd);
   msync(mmap);
   remove_recovery_header(fd);
   fsync(fd);
   msync(mmap);

Seems over-zealous.
If the recovery_header held a strong checksum of the recovery_data you would
not need the first fsync, and as long as you have two places to write recovery
data, you don't need the 3rd and 4th syncs.
Just:
  write_internally_checksummed_recovery_data_and_header_to_unused_log_space()
  fsync / msync
  overwrite_with_new_data()

To recovery you choose the most recent log_space and replay the content.
That may be a redundant operation, but that is no loss.

Also cannot see the point of msync if you have already performed an fsync,
and if there is a point, I would expect you to call msync before
fsync... Maybe there is some subtlety there that I am not aware of.

 
 Yet we really only need ordering, not guarantees about it actually hitting
 disk before returning.
 
  In other words, FLUSH can be more relaxed than BARRIER inside the
  kernel.  It's ironic that we think of fsync as stronger than
  fbarrier outside the kernel :-)
 
 It's an implementation detail; barrier has less flexibility because it has
 less information about what is required. I'm saying I want to give you as
 much information as I can, even if you don't use it yet.

Only we know that approach doesn't work.
People will learn that they don't need to give the extra information to still
achieve the same result - just like they did with ext3 and fsync.
Then when we improve the implementation to only provide the guarantees that
you asked for, people will complain that they are getting empty files that
they didn't expect.

The abstraction I would like to see is a simple 'barrier' that contains no
data and has a filesystem-wide effect.

If a filesystem wanted a 'full' barrier such as the current BIO_RW_BARRER,
it would send an empty barrier, then the data, then another empty barrier.
(However I suspect most filesystems don't really need barriers on both sides.)
A low level driver might merge these together if the underlying hardware
supported that combined operation (which I believe some do).
I think this merging would be less complex that the current need to split a
BIO_RW_BARRIER in to the three separate operations when only a flush is
possible (I know it would make md code a lot nicer :-).

I would probably expose this to user-space as extra flags to sync_file_range:
   SYNC_FILE_RANGE_BARRIER_BEFORE
   SYNC_FILE_RANGE_BARRIER_AFTER

This would make it clear that a barrier does *not* imply a sync, it only
applies to data for which a sync has already been requested. So data that has
already been 'synced' is stored strictly before data which has not yet been
submitted with write() (or by changing a mmapped area).
The barrier would still be filesystem wide in that if you
SYNC_FILE_WRITE_WRITE one file, then SYNC_FILE_RANGE_BARRIER_BEFORE another
file on the same filesystem, the pages scheduled in the first file would be
affect by the barrier request on the second file.

Implementing 

question on virtio

2010-05-05 Thread Michael S. Tsirkin
Hi!
I see this in virtio_ring.c:

/* Put entry in available array (but don't update avail-idx *
   until they do sync). */

Why is it done this way?
It seems that updating the index straight away would be simpler, while
this might allow the host to specilatively look up the buffer and handle
it, without waiting for the kick.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: RFC: Network Plugin Architecture (NPA) for vmxnet3

2010-05-05 Thread Pankaj Thakkar
The purpose of this email is to introduce the architecture and the design 
principles. The overall project involves more than just changes to vmxnet3 
driver and hence we though an overview email would be better. Once people agree 
to the design in general we intend to provide the code changes to the vmxnet3 
driver.

The architecture supports more than Intel NICs. We started the project with 
Intel but plan to support all major IHVs including Broadcom, Qlogic, Emulex and 
others through a certification program. The architecture works on VMware ESX 
server only as it requires significant support from the hypervisor. Also, the 
vmxnet3 driver works on VMware platform only. AFAICT Xen has a different model 
for supporting SR-IOV devices and allowing live migration and the document 
briefly talks about it (paragraph 6).

Thanks,

-pankaj


On Tue, May 04, 2010 at 05:05:31PM -0700, Stephen Hemminger wrote:
 Date: Tue, 4 May 2010 17:05:31 -0700
 From: Stephen Hemminger shemmin...@vyatta.com
 To: Pankaj Thakkar pthak...@vmware.com
 CC: linux-ker...@vger.kernel.org linux-ker...@vger.kernel.org,
   net...@vger.kernel.org net...@vger.kernel.org,
   virtualization@lists.linux-foundation.org
  virtualization@lists.linux-foundation.org,
   pv-driv...@vmware.com pv-driv...@vmware.com,
   Shreyas Bhatewara sbhatew...@vmware.com
 Subject: Re: RFC: Network Plugin Architecture (NPA) for vmxnet3
 
 On Tue, 4 May 2010 16:02:25 -0700
 Pankaj Thakkar pthak...@vmware.com wrote:
 
  Device passthrough technology allows a guest to bypass the hypervisor and 
  drive
  the underlying physical device. VMware has been exploring various ways to
  deliver this technology to users in a manner which is easy to adopt. In this
  process we have prepared an architecture along with Intel - NPA (Network 
  Plugin
  Architecture). NPA allows the guest to use the virtualized NIC vmxnet3 to
  passthrough to a number of physical NICs which support it. The document 
  below
  provides an overview of NPA.
  
  We intend to upgrade the upstreamed vmxnet3 driver to implement NPA so that
  Linux users can exploit the benefits provided by passthrough devices in a
  seamless manner while retaining the benefits of virtualization. The document
  below tries to answer most of the questions which we anticipated. Please 
  let us
  know your comments and queries.
  
  Thank you.
  
  Signed-off-by: Pankaj Thakkar pthak...@vmware.com
 
 
 Code please. Also, it has to work for all architectures not just VMware and
 Intel.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: RFC: Network Plugin Architecture (NPA) for vmxnet3

2010-05-05 Thread Pankaj Thakkar
Sure. We have been working on NPA for a while and have the code internally up
and running. Let me sync up internally on how and when we can provide the
vmxnet3 driver code so that people can look at it.


On Tue, May 04, 2010 at 05:32:36PM -0700, David Miller wrote:
 Date: Tue, 4 May 2010 17:32:36 -0700
 From: David Miller da...@davemloft.net
 To: Pankaj Thakkar pthak...@vmware.com
 CC: shemmin...@vyatta.com shemmin...@vyatta.com,
   linux-ker...@vger.kernel.org linux-ker...@vger.kernel.org,
   net...@vger.kernel.org net...@vger.kernel.org,
   virtualization@lists.linux-foundation.org
  virtualization@lists.linux-foundation.org,
   pv-driv...@vmware.com pv-driv...@vmware.com,
   Shreyas Bhatewara sbhatew...@vmware.com
 Subject: Re: RFC: Network Plugin Architecture (NPA) for vmxnet3
 
 From: Pankaj Thakkar pthak...@vmware.com
 Date: Tue, 4 May 2010 17:18:57 -0700
 
  The purpose of this email is to introduce the architecture and the
  design principles. The overall project involves more than just
  changes to vmxnet3 driver and hence we though an overview email
  would be better. Once people agree to the design in general we
  intend to provide the code changes to the vmxnet3 driver.
 
 Stephen's point is that code talks and bullshit walks.
 
 Talk about high level designs rarely gets any traction, and often goes
 nowhere.  Give us an example implementation so there is something
 concrete for us to sink our teeth into.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: RFC: Network Plugin Architecture (NPA) for vmxnet3

2010-05-05 Thread Christoph Hellwig
On Tue, May 04, 2010 at 04:02:25PM -0700, Pankaj Thakkar wrote:
 The plugin image is provided by the IHVs along with the PF driver and is
 packaged in the hypervisor. The plugin image is OS agnostic and can be loaded
 either into a Linux VM or a Windows VM. The plugin is written against the 
 Shell
 API interface which the shell is responsible for implementing. The API

We're not going to add any kind of loader for binry blobs into kernel
space, sorry.  Don't even bother wasting your time on this.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3

2010-05-05 Thread Christoph Hellwig
On Wed, May 05, 2010 at 10:35:28AM -0700, Dmitry Torokhov wrote:
 Yes, with the exception that the only body of code that will be
 accepted by the shell should be GPL-licensed and thus open and available
 for examining. This is not different from having a standard kernel
 module that is loaded normally and plugs into a certain subsystem.
 The difference is that the binary resides not on guest filesystem
 but elsewhere.

Forget about the licensing.  Loading binary blobs written to a shim
layer is a complete pain in the ass and totally unsupportable, and
also uninteresting because of the overhead.

If you have any interesting in developing this further, do:

 (1) move the limited VF drivers directly into the kernel tree,
 talk to them through a normal ops vector
 (2) get rid of the whole shim crap and instead integrate the limited
 VF driver with the full VF driver we already have, instead of
 duplicating the code
 (3) don't make the PV to VF integration VMware-specific but also
 provide an open reference implementation like virtio.  We're not
 going to add massive amount of infrastructure that is not actually
 useable in a free software stack.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


RE: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3

2010-05-05 Thread Pankaj Thakkar


 -Original Message-
 From: Christoph Hellwig [mailto:h...@infradead.org]
 Sent: Wednesday, May 05, 2010 10:40 AM
 To: Dmitry Torokhov
 Cc: Christoph Hellwig; pv-driv...@vmware.com; Pankaj Thakkar;
 net...@vger.kernel.org; linux-ker...@vger.kernel.org;
 virtualization@lists.linux-foundation.org
 Subject: Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for
 vmxnet3
 
 On Wed, May 05, 2010 at 10:35:28AM -0700, Dmitry Torokhov wrote:
  Yes, with the exception that the only body of code that will be
  accepted by the shell should be GPL-licensed and thus open and
 available
  for examining. This is not different from having a standard kernel
  module that is loaded normally and plugs into a certain subsystem.
  The difference is that the binary resides not on guest filesystem
  but elsewhere.
 
 Forget about the licensing.  Loading binary blobs written to a shim
 layer is a complete pain in the ass and totally unsupportable, and
 also uninteresting because of the overhead.

[PT] Why do you think it is unsupportable? How different is it from any module 
written against a well maintained interface? What overhead are you talking 
about?

 
 If you have any interesting in developing this further, do:
 
  (1) move the limited VF drivers directly into the kernel tree,
  talk to them through a normal ops vector
[PT] This assumes that all the VF drivers would always be available. Also we 
have to support windows and our current design supports it nicely in an OS 
agnostic manner.

  (2) get rid of the whole shim crap and instead integrate the limited
  VF driver with the full VF driver we already have, instead of
  duplicating the code
[PT] Having a full VF driver adds a lot of dependency on the guest VM and this 
is what NPA tries to avoid.

  (3) don't make the PV to VF integration VMware-specific but also
  provide an open reference implementation like virtio.  We're not
  going to add massive amount of infrastructure that is not actually
  useable in a free software stack.
[PT] Today this is tied to vmxnet3 device and is intended to work on ESX 
hypervisor only (vmxnet3 works on VMware hypervisor only). All the loading 
support is inside the ESX hypervisor. I am going to post the interface between 
the shell and the plugin soon and you can see that there is not a whole lot of 
dependency or infrastructure requirements from the Linux kernel. Please keep in 
mind that we don't use Linux as a hypervisor but as a guest VM.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: RFC: Network Plugin Architecture (NPA) for vmxnet3

2010-05-05 Thread Pankaj Thakkar
On Tue, May 04, 2010 at 05:58:52PM -0700, Chris Wright wrote:
 Date: Tue, 4 May 2010 17:58:52 -0700
 From: Chris Wright chr...@sous-sol.org
 To: Pankaj Thakkar pthak...@vmware.com
 CC: linux-ker...@vger.kernel.org linux-ker...@vger.kernel.org,
   net...@vger.kernel.org net...@vger.kernel.org,
   virtualization@lists.linux-foundation.org
  virtualization@lists.linux-foundation.org,
   pv-driv...@vmware.com pv-driv...@vmware.com,
   Shreyas Bhatewara sbhatew...@vmware.com,
   k...@vger.kernel.org k...@vger.kernel.org
 Subject: Re: RFC: Network Plugin Architecture (NPA) for vmxnet3
 
 * Pankaj Thakkar (pthak...@vmware.com) wrote:
  We intend to upgrade the upstreamed vmxnet3 driver to implement NPA so that
  Linux users can exploit the benefits provided by passthrough devices in a
  seamless manner while retaining the benefits of virtualization. The document
  below tries to answer most of the questions which we anticipated. Please 
  let us
  know your comments and queries.
 
 How does the throughput, latency, and host CPU utilization for normal
 data path compare with say NetQueue?

NetQueue is really for scaling across multiple VMs. NPA allows similar scaling
and also helps in improving the CPU efficiency for a single VM since the
hypervisor is bypassed. Througput wise both emulation and passthrough (NPA) can
obtain line rates on 10gig but passthrough saves upto 40% cpu based on the
workload. We did a demo at IDF 2009 where we compared 8 VMs running on NetQueue
v/s 8 VMs running on NPA (using Niantic) and we obtained similar CPU efficiency
gains.

 
 And does this obsolete your UPT implementation?

NPA and UPT share a lot of code in the hypervisor. UPT was adopted only by very
limited IHVs and hence NPA is our way forward to have all IHVs onboard.

 How many cards actually support this NPA interface?  What does it look
 like, i.e. where is the NPA specification?  (AFAIK, we never got the UPT
 one).

We have it working internally with Intel Niantic (10G) and Kawela (1G) SR-IOV
NIC. We are also working with upcoming Broadcom 10G card and plan to support
other IHVs. This is unlike UPT so we don't dictate the register sets or rings
like we did in UPT. Rather we have guidelines like that the card should have an
embedded switch for inter VF switching or should support programming (rx
filters, VLAN, etc) though the PF driver rather than the VF driver.

 How do you handle hardware which has a more symmetric view of the
 SR-IOV world (SR-IOV is only PCI sepcification, not a network driver
 specification)?  Or hardware which has multiple functions per physical
 port (multiqueue, hw filtering, embedded switch, etc.)?

I am not sure what do you mean by symmetric view of SR-IOV world?

NPA allows multi-queue VFs and requires an embedded switch currently. As far as
the PF driver is concerned we require IHVs to support all existing and upcoming
features like NetQueue, FCoE, etc. The PF driver is considered special and is
used to drive the traffic for the emulated/paravirtualized VMs and is also used
to program things on behalf of the VFs through the hypervisor. If the hardware
has multiple physical functions they are treated as separate adapters (with
their own set of VFs) and we require the embedded switch to maintain that
distinction as well.


  NPA offers several benefits:
  1. Performance: Critical performance sensitive paths are not trapped and the
  guest can directly drive the hardware without incurring virtualization
  overheads.
 
 Can you demonstrate with data?

The setup is 2.667Ghz Nehalem server running SLES11 VM talking to a 2.33Ghz
Barcelona client box running RHEL 5.1. We had netperf streams with 16k msg size
over 64k socket size running between server VM and client and they are using
Intel Niantic 10G cards. In both cases (NPA and regular) the VM was CPU
saturated (used one full core).

TX: regular vmxnet3 = 3085.5 Mbps/GHz; NPA vmxnet3 = 4397.2 Mbps/GHz
RX: regular vmxnet3 = 1379.6 Mbps/GHz; NPA vmxnet3 = 2349.7 Mbps/GHz

We have similar results for other configuration and in general we have seen NPA
is better in terms of CPU cost and can save upto 40% of CPU cost.

 
  2. Hypervisor control: All control operations from the guest such as 
  programming
  MAC address go through the hypervisor layer and hence can be subjected to
  hypervisor policies. The PF driver can be further used to put policy 
  decisions
  like which VLAN the guest should be on.
 
 This can happen without NPA as well.  VF simply needs to request
 the change via the PF (in fact, hw does that right now).  Also, we
 already have a host side management interface via PF (see, for example,
 RTM_SETLINK IFLA_VF_MAC interface).
 
 What is control plane interface?  Just something like a fixed register set?

All operations other than TX/RX go through the vmxnet3 shell to the vmxnet3
device emulation. So the control plane is really the vmxnet3 device emulation
as far as the guest is concerned.

 
  3. Guest Management: 

[PATCH 1/1] staging: hv: Add Time Sync feature to hv_utils module

2010-05-05 Thread Haiyang Zhang
From: Haiyang Zhang haiya...@microsoft.com

Subject: Add Time Sync feature to hv_utils module.
The Time Sync feature synchronizes guest time to host UTC time after reboot,
and restore from saved/paused state.

Cc: Greg Kroah-Hartman gre...@suse.de
Signed-off-by: Hank Janssen hjans...@microsoft.com
Signed-off-by: Haiyang Zhang haiya...@microsoft.com

---
 drivers/staging/hv/ChannelMgmt.c  |   24 ++-
 drivers/staging/hv/hyperv_utils.c |   84 +
 drivers/staging/hv/utils.h|   25 ++-
 3 files changed, 130 insertions(+), 3 deletions(-)

diff --git a/drivers/staging/hv/ChannelMgmt.c b/drivers/staging/hv/ChannelMgmt.c
index 445506d..71fe8dd 100644
--- a/drivers/staging/hv/ChannelMgmt.c
+++ b/drivers/staging/hv/ChannelMgmt.c
@@ -33,8 +33,8 @@ struct vmbus_channel_message_table_entry {
void (*messageHandler)(struct vmbus_channel_message_header *msg);
 };
 
-#define MAX_MSG_TYPES1
-#define MAX_NUM_DEVICE_CLASSES_SUPPORTED 5
+#define MAX_MSG_TYPES2
+#define MAX_NUM_DEVICE_CLASSES_SUPPORTED 6
 
 static const struct hv_guid
gSupportedDeviceClasses[MAX_NUM_DEVICE_CLASSES_SUPPORTED] = {
@@ -81,6 +81,14 @@ static const struct hv_guid
0x81, 0x8B, 0x38, 0XD9, 0x0C, 0xED, 0x39, 0xDB
}
},
+   /* {9527E630-D0AE-497b-ADCE-E80AB0175CAF} */
+   /* TimeSync */
+   {
+   .data = {
+   0x30, 0xe6, 0x27, 0x95, 0xae, 0xd0, 0x7b, 0x49,
+   0xad, 0xce, 0xe8, 0x0a, 0xb0, 0x17, 0x5c, 0xaf
+   }
+   },
 };
 
 
@@ -191,6 +199,18 @@ struct hyperv_service_callback hv_cb_utils[MAX_MSG_TYPES] 
= {
.callback = chn_cb_negotiate,
.log_msg = Shutdown channel functionality initialized
},
+
+   /* {9527E630-D0AE-497b-ADCE-E80AB0175CAF} */
+   /* TimeSync */
+   {
+   .msg_type = HV_TIMESYNC_MSG,
+   .data = {
+   0x30, 0xe6, 0x27, 0x95, 0xae, 0xd0, 0x7b, 0x49,
+   0xad, 0xce, 0xe8, 0x0a, 0xb0, 0x17, 0x5c, 0xaf
+   },
+   .callback = chn_cb_negotiate,
+   .log_msg = Timesync channel functionality initialized
+   },
 };
 EXPORT_SYMBOL(hv_cb_utils);
 
diff --git a/drivers/staging/hv/hyperv_utils.c 
b/drivers/staging/hv/hyperv_utils.c
index cbebad3..9174f79 100644
--- a/drivers/staging/hv/hyperv_utils.c
+++ b/drivers/staging/hv/hyperv_utils.c
@@ -106,6 +106,82 @@ static void shutdown_onchannelcallback(void *context)
orderly_poweroff(false);
 }
 
+
+/*
+ * Synchronize time with host after reboot, restore, etc.
+ */
+static void adj_guesttime(winfiletime_t hosttime, u8 flags)
+{
+   s64 host_tns;
+   struct timespec host_ts;
+   static s32 scnt = 50;
+
+   host_tns = (hosttime - WLTIMEDELTA) * 100;
+   host_ts = ns_to_timespec(host_tns);
+
+   if ((flags  ICTIMESYNCFLAG_SYNC) != 0) {
+   do_settimeofday(host_ts);
+   return;
+   }
+
+   if ((flags  ICTIMESYNCFLAG_SAMPLE) != 0 
+   scnt  0) {
+   scnt--;
+   do_settimeofday(host_ts);
+   }
+
+   return;
+}
+
+/*
+ * Time Sync Channel message handler.
+ */
+static void timesync_onchannelcallback(void *context)
+{
+   struct vmbus_channel *channel = context;
+   u8 *buf;
+   u32 buflen, recvlen;
+   u64 requestid;
+   struct icmsg_hdr *icmsghdrp;
+   struct ictimesync_data *timedatap;
+
+   DPRINT_ENTER(VMBUS);
+
+   buflen = PAGE_SIZE;
+   buf = kmalloc(buflen, GFP_ATOMIC);
+
+   VmbusChannelRecvPacket(channel, buf, buflen, recvlen, requestid);
+
+   if (recvlen  0) {
+   DPRINT_DBG(VMBUS, timesync packet: recvlen=%d, requestid=%lld,
+   recvlen, requestid);
+
+   icmsghdrp = (struct icmsg_hdr *)buf[
+   sizeof(struct vmbuspipe_hdr)];
+
+   if (icmsghdrp-icmsgtype == ICMSGTYPE_NEGOTIATE) {
+   prep_negotiate_resp(icmsghdrp, NULL, buf);
+   } else {
+   timedatap = (struct ictimesync_data *)buf[
+   sizeof(struct vmbuspipe_hdr) +
+   sizeof(struct icmsg_hdr)];
+   adj_guesttime(timedatap-parenttime, timedatap-flags);
+   }
+
+   icmsghdrp-icflags = ICMSGHDRFLAG_TRANSACTION
+   | ICMSGHDRFLAG_RESPONSE;
+
+   VmbusChannelSendPacket(channel, buf,
+   recvlen, requestid,
+   VmbusPacketTypeDataInBand, 0);
+   }
+
+   kfree(buf);
+
+   DPRINT_EXIT(VMBUS);
+}
+
+
 static int __init init_hyperv_utils(void)
 {
printk(KERN_INFO Registering HyperV Utility Driver\n);
@@ -114,6 +190,10 @@ static int __init init_hyperv_utils(void)
 

Re: RFC: Network Plugin Architecture (NPA) for vmxnet3

2010-05-05 Thread Pankaj Thakkar
On Wed, May 05, 2010 at 10:59:51AM -0700, Avi Kivity wrote:
 Date: Wed, 5 May 2010 10:59:51 -0700
 From: Avi Kivity a...@redhat.com
 To: Pankaj Thakkar pthak...@vmware.com
 CC: linux-ker...@vger.kernel.org linux-ker...@vger.kernel.org,
   net...@vger.kernel.org net...@vger.kernel.org,
   virtualization@lists.linux-foundation.org
  virtualization@lists.linux-foundation.org,
   pv-driv...@vmware.com pv-driv...@vmware.com,
   Shreyas Bhatewara sbhatew...@vmware.com
 Subject: Re: RFC: Network Plugin Architecture (NPA) for vmxnet3
 
 On 05/05/2010 02:02 AM, Pankaj Thakkar wrote:
  2. Hypervisor control: All control operations from the guest such as 
  programming
  MAC address go through the hypervisor layer and hence can be subjected to
  hypervisor policies. The PF driver can be further used to put policy 
  decisions
  like which VLAN the guest should be on.
 
 
 Is this enforced?  Since you pass the hardware through, you can't rely 
 on the guest actually doing this, yes?

We don't pass the whole VF to the guest. Only the BAR which is responsible for
TX/RX/intr is mapped into guest space. The interface between the shell and
plugin only allows to do operations related to TX and RX such as send a packet
to the VF, allocate RX buffers, indicate a packet upto the shell. All control
operations are handled by the shell and the shell does what the existing
vmxnet3 drivers does (touch a specific register and let the device emulation do
the work). When a VF is mapped to the guest the hypervisor knows this and
programs the h/w accordingly on behalf of the shell. So for example if the VM
does a MAC address change inside the guest, the shell would write to
VMXNET3_REG_MAC{L|H} registers which would trigger the device emulation to read
the new mac address and update its internal virtual port information for the
virtual switch and if the VF is mapped it would also program the embedded
switch RX filters to reflect the new mac address.

 
  The plugin image is provided by the IHVs along with the PF driver and is
  packaged in the hypervisor. The plugin image is OS agnostic and can be 
  loaded
  either into a Linux VM or a Windows VM. The plugin is written against the 
  Shell
  API interface which the shell is responsible for implementing. The API
  interface allows the plugin to do TX and RX only by programming the hardware
  rings (along with things like buffer allocation and basic initialization). 
  The
  virtual machine comes up in paravirtualized/emulated mode when it is booted.
  The hypervisor allocates the VF and other resources and notifies the shell 
  of
  the availability of the VF. The hypervisor injects the plugin into memory
  location specified by the shell. The shell initializes the plugin by calling
  into a known entry point and the plugin initializes the data path. The 
  control
  path is already initialized by the PF driver when the VF is allocated. At 
  this
  point the shell switches to using the loaded plugin to do all further TX 
  and RX
  operations. The guest networking stack does not participate in these 
  operations
  and continues to function normally. All the control operations continue 
  being
  trapped by the hypervisor and are directed to the PF driver as needed. For
  example, if the MAC address changes the hypervisor updates its internal 
  state
  and changes the state of the embedded switch as well through the PF control
  API.
 
 
 This is essentially a miniature network stack with a its own mini 
 bonding layer, mini hotplug, and mini API, except s/API/ABI/.  Is this a 
 correct view?

To some extent yes but there is no complicated bonding nor there is any thing
like a PCI hotplug. The shell interface is small and the OS always interacts
with the shell as the main driver. Based on the underlying VF the plugin
changes and this plugin as well is really small. Our vmxnet3 s/w plugin is
about 1300 lines with whitespaces and comments and the Intel Kawela plugin is
about 1100 lines with whitspaces and comments. The design principle is to put
more of the complexity related to initialization/control into the PF driver
rather than in plugin.

 
 If so, the Linuxy approach would be to use the ordinary drivers and the 
 Linux networking API, and hide the bond setup using namespaces.  The 
 bond driver, or perhaps a new, similar, driver can be enhanced to 
 propagate ethtool commands to its (hidden) components, and to have a 
 control channel with the hypervisor.
 
 This would make the approach hypervisor agnostic, you're just pairing 
 two devices and presenting them to the rest of the stack as a single device.
 
  We have reworked our existing Linux vmxnet3 driver to accomodate NPA by
  splitting the driver into two parts: Shell and Plugin. The new split driver 
  is
 
 
 So the Shell would be the reworked or new bond driver, and Plugins would 
 be ordinary Linux network drivers.

In NPA we do not rely on the guest OS to provide any of these services like

Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3

2010-05-05 Thread Arnd Bergmann
On Wednesday 05 May 2010 19:47:10 Pankaj Thakkar wrote:
  
  Forget about the licensing.  Loading binary blobs written to a shim
  layer is a complete pain in the ass and totally unsupportable, and
  also uninteresting because of the overhead.
 
 [PT] Why do you think it is unsupportable? How different is it from any module
 written against a well maintained interface? What overhead are you talking 
 about?

We have the right number of module loaders in the kernel: one. If you
add another one, you're doubling the amount of code that anyone
working on that code needs to know about.
 
  If you have any interesting in developing this further, do:
  
   (1) move the limited VF drivers directly into the kernel tree,
   talk to them through a normal ops vector
 [PT] This assumes that all the VF drivers would always be available.
 Also we have to support windows and our current design supports it
 nicely in an OS agnostic manner.

Your approach assumes that the plugin is always available, which has
exactly the same implications.

   (2) get rid of the whole shim crap and instead integrate the limited
   VF driver with the full VF driver we already have, instead of
   duplicating the code
 [PT] Having a full VF driver adds a lot of dependency on the guest VM
 and this is what NPA tries to avoid.

If you have the limited driver for some hardware that does not have
the real thing, we could still ship just that. I would however guess
that most vendors are interested in not just running in vmware but
also other hypervisors that still require the full driver, so that
case would be rare, especially in the long run.

Arnd
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 3/3] virtio: console: Accept console size along with resize control message

2010-05-05 Thread Amit Shah
The VIRTIO_CONSOLE_RESIZE control message sent to us by the host now
contains the new {rows, cols} values for the console. This ensures each
console port gets its own size, and we don't depend on the config-space
rows and cols values at all now.

Signed-off-by: Amit Shah amit.s...@redhat.com
CC: Christian Borntraeger borntrae...@de.ibm.com
CC: linuxppc-...@ozlabs.org
CC: Kusanagi Kouichi sl...@ac.auone-net.jp
---
 drivers/char/virtio_console.c |   13 -
 1 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index ccfe68a..5cab839 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -1194,12 +1194,23 @@ static void handle_control_message(struct ports_device 
*portdev,
 * have to notify the host first.
 */
break;
-   case VIRTIO_CONSOLE_RESIZE:
+   case VIRTIO_CONSOLE_RESIZE: {
+   struct {
+   __u16 rows;
+   __u16 cols;
+   } size;
+
if (!is_console_port(port))
break;
+
+   memcpy(size, buf-buf + buf-offset + sizeof(*cpkt),
+  sizeof(size));
+   set_console_size(port, size.rows, size.cols);
+
port-cons.hvc-irq_requested = 1;
resize_console(port);
break;
+   }
case VIRTIO_CONSOLE_PORT_OPEN:
port-host_connected = cpkt-value;
wake_up_interruptible(port-waitqueue);
-- 
1.6.2.5

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH RFC] virtio: put last seen used index into ring itself

2010-05-05 Thread Michael S. Tsirkin
Generally, the Host end of the virtio ring doesn't need to see where
Guest is up to in consuming the ring.  However, to completely understand
what's going on from the outside, this information must be exposed.
For example, host can reduce the number of interrupts by detecting
that the guest is currently handling previous buffers.

Fortunately, we have room to expand: the ring is always a whole number
of pages and there's hundreds of bytes of padding after the avail ring
and the used ring, whatever the number of descriptors (which must be a
power of 2).

We add a feature bit so the guest can tell the host that it's writing
out the current value there, if it wants to use that.

This is based on a patch by Rusty Russell, with the main difference
being that we dedicate a feature bit to guest to tell the host it is
writing the used index.  This way we don't need to force host to publish
the last available index until we have a use for it.

Signed-off-by: Rusty Russell ru...@rustcorp.com.au
Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

Rusty,
this is a simplified form of a patch you posted in the past.
I have a vhost patch that, using this feature, shows external
to host bandwidth grow from 5 to 7 GB/s, by avoiding
an interrupt in the window after previous interrupt
was sent and before interrupts were disabled for the vq.
With vhost under some external to host loads I see
this window being hit about 30% sometimes.

I'm finalizing the host bits and plan to send
the final version for inclusion when all's ready,
but I'd like to hear comments meanwhile.

 drivers/virtio/virtio_ring.c |   28 +---
 include/linux/virtio_ring.h  |   14 +-
 2 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 1ca8890..7729aba 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -89,9 +89,6 @@ struct vring_virtqueue
/* Number we've added since last sync. */
unsigned int num_added;
 
-   /* Last used index we've seen. */
-   u16 last_used_idx;
-
/* How to notify other side. FIXME: commonalize hcalls! */
void (*notify)(struct virtqueue *vq);
 
@@ -285,12 +282,13 @@ static void detach_buf(struct vring_virtqueue *vq, 
unsigned int head)
 
 static inline bool more_used(const struct vring_virtqueue *vq)
 {
-   return vq-last_used_idx != vq-vring.used-idx;
+   return *vq-vring.last_used_idx != vq-vring.used-idx;
 }
 
 void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
+   struct vring_used_elem *u;
void *ret;
unsigned int i;
 
@@ -307,12 +305,13 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned 
int *len)
return NULL;
}
 
-   /* Only get used array entries after they have been exposed by host. */
-   virtio_rmb();
-
-   i = vq-vring.used-ring[vq-last_used_idx%vq-vring.num].id;
-   *len = vq-vring.used-ring[vq-last_used_idx%vq-vring.num].len;
+   /* Only get used array entries after they have been exposed by host.
+* Need mb(), not just rmb() because we write last_used_idx below. */
+   virtio_mb();
 
+   u = vq-vring.used-ring[*vq-vring.last_used_idx % vq-vring.num];
+   i = u-id;
+   *len = u-len;
if (unlikely(i = vq-vring.num)) {
BAD_RING(vq, id %u out of range\n, i);
return NULL;
@@ -325,7 +324,8 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int 
*len)
/* detach_buf clears data, so grab it now. */
ret = vq-data[i];
detach_buf(vq, i);
-   vq-last_used_idx++;
+   (*vq-vring.last_used_idx)++;
+
END_USE(vq);
return ret;
 }
@@ -431,7 +431,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
vq-vq.name = name;
vq-notify = notify;
vq-broken = false;
-   vq-last_used_idx = 0;
+   *vq-vring.last_used_idx = 0;
vq-num_added = 0;
list_add_tail(vq-vq.list, vdev-vqs);
 #ifdef DEBUG
@@ -440,6 +440,10 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
 
vq-indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
 
+   /* We publish used index whether Host offers it or not: if not, it's
+* junk space anyway.  But calling this acknowledges the feature. */
+   virtio_has_feature(vdev, VIRTIO_RING_F_PUBLISH_USED);
+
/* No callback?  Tell other side not to bother us. */
if (!callback)
vq-vring.avail-flags |= VRING_AVAIL_F_NO_INTERRUPT;
@@ -473,6 +477,8 @@ void vring_transport_features(struct virtio_device *vdev)
switch (i) {
case VIRTIO_RING_F_INDIRECT_DESC:
break;
+   case VIRTIO_RING_F_PUBLISH_INDICES:
+   break;
default:
/* We don't understand this 

Re: question on virtio

2010-05-05 Thread Michael S. Tsirkin
On Wed, May 05, 2010 at 02:40:15PM -0500, Anthony Liguori wrote:
 On 05/05/2010 06:09 AM, Michael S. Tsirkin wrote:
 Hi!
 I see this in virtio_ring.c:

  /* Put entry in available array (but don't update avail-idx *
 until they do sync). */

 Why is it done this way?
 It seems that updating the index straight away would be simpler, while
 this might allow the host to specilatively look up the buffer and handle
 it, without waiting for the kick.


 It should be okay as long as you don't update idx for partial vectors.

 Regards,

 Anthony Liguori

Sorry, what do you mean by partial vectors here?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-05 Thread Dor Laor
On 05/05/2010 11:58 PM, Michael S. Tsirkin wrote:
 Generally, the Host end of the virtio ring doesn't need to see where
 Guest is up to in consuming the ring.  However, to completely understand
 what's going on from the outside, this information must be exposed.
 For example, host can reduce the number of interrupts by detecting
 that the guest is currently handling previous buffers.

 Fortunately, we have room to expand: the ring is always a whole number
 of pages and there's hundreds of bytes of padding after the avail ring
 and the used ring, whatever the number of descriptors (which must be a
 power of 2).

 We add a feature bit so the guest can tell the host that it's writing
 out the current value there, if it wants to use that.

 This is based on a patch by Rusty Russell, with the main difference
 being that we dedicate a feature bit to guest to tell the host it is
 writing the used index.  This way we don't need to force host to publish
 the last available index until we have a use for it.

 Signed-off-by: Rusty Russellru...@rustcorp.com.au
 Signed-off-by: Michael S. Tsirkinm...@redhat.com
 ---

 Rusty,
 this is a simplified form of a patch you posted in the past.
 I have a vhost patch that, using this feature, shows external
 to host bandwidth grow from 5 to 7 GB/s, by avoiding

You mean external to guest I guess.

We have a similar issue with virtio-blk - when using very fast 
multi-spindle storage on the host side, there are too many irq injection 
events. This patch should probably reduce them allot.
The principle exactly matches the Xen ring.

 an interrupt in the window after previous interrupt
 was sent and before interrupts were disabled for the vq.
 With vhost under some external to host loads I see
 this window being hit about 30% sometimes.

 I'm finalizing the host bits and plan to send
 the final version for inclusion when all's ready,
 but I'd like to hear comments meanwhile.

   drivers/virtio/virtio_ring.c |   28 +---
   include/linux/virtio_ring.h  |   14 +-
   2 files changed, 30 insertions(+), 12 deletions(-)

 diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
 index 1ca8890..7729aba 100644
 --- a/drivers/virtio/virtio_ring.c
 +++ b/drivers/virtio/virtio_ring.c
 @@ -89,9 +89,6 @@ struct vring_virtqueue
   /* Number we've added since last sync. */
   unsigned int num_added;

 - /* Last used index we've seen. */
 - u16 last_used_idx;
 -
   /* How to notify other side. FIXME: commonalize hcalls! */
   void (*notify)(struct virtqueue *vq);

 @@ -285,12 +282,13 @@ static void detach_buf(struct vring_virtqueue *vq, 
 unsigned int head)

   static inline bool more_used(const struct vring_virtqueue *vq)
   {
 - return vq-last_used_idx != vq-vring.used-idx;
 + return *vq-vring.last_used_idx != vq-vring.used-idx;
   }

   void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len)
   {
   struct vring_virtqueue *vq = to_vvq(_vq);
 + struct vring_used_elem *u;
   void *ret;
   unsigned int i;

 @@ -307,12 +305,13 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned 
 int *len)
   return NULL;
   }

 - /* Only get used array entries after they have been exposed by host. */
 - virtio_rmb();
 -
 - i = vq-vring.used-ring[vq-last_used_idx%vq-vring.num].id;
 - *len = vq-vring.used-ring[vq-last_used_idx%vq-vring.num].len;
 + /* Only get used array entries after they have been exposed by host.
 +  * Need mb(), not just rmb() because we write last_used_idx below. */
 + virtio_mb();

 + u =vq-vring.used-ring[*vq-vring.last_used_idx % vq-vring.num];
 + i = u-id;
 + *len = u-len;
   if (unlikely(i= vq-vring.num)) {
   BAD_RING(vq, id %u out of range\n, i);
   return NULL;
 @@ -325,7 +324,8 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned 
 int *len)
   /* detach_buf clears data, so grab it now. */
   ret = vq-data[i];
   detach_buf(vq, i);
 - vq-last_used_idx++;
 + (*vq-vring.last_used_idx)++;
 +
   END_USE(vq);
   return ret;
   }
 @@ -431,7 +431,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
   vq-vq.name = name;
   vq-notify = notify;
   vq-broken = false;
 - vq-last_used_idx = 0;
 + *vq-vring.last_used_idx = 0;
   vq-num_added = 0;
   list_add_tail(vq-vq.list,vdev-vqs);
   #ifdef DEBUG
 @@ -440,6 +440,10 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,

   vq-indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);

 + /* We publish used index whether Host offers it or not: if not, it's
 +  * junk space anyway.  But calling this acknowledges the feature. */
 + virtio_has_feature(vdev, VIRTIO_RING_F_PUBLISH_USED);
 +
   /* No callback?  Tell other side not to bother us. */
   if (!callback)
   vq-vring.avail-flags |= VRING_AVAIL_F_NO_INTERRUPT;
 @@ -473,6 +477,8 @@ 

Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3

2010-05-05 Thread Arnd Bergmann
On Wednesday 05 May 2010 22:36:31 Dmitry Torokhov wrote:
 
 On Wednesday 05 May 2010 01:09:48 pm Arnd Bergmann wrote:
If you have any interesting in developing this further, do:

 (1) move the limited VF drivers directly into the kernel tree,
 talk to them through a normal ops vector
   
   [PT] This assumes that all the VF drivers would always be available.
   Also we have to support windows and our current design supports it
   nicely in an OS agnostic manner.
  
  Your approach assumes that the plugin is always available, which has
  exactly the same implications.
 
 Since plugin[s] are carried by the host they are indeed always
 available.

But what makes you think that you can build code that can be linked
into arbitrary future kernel versions? The kernel does not define any
calling conventions that are stable across multiple versions or
configurations. For example, you'd have to provide different binaries
for each combination of

- 32/64 bit code
- gcc -mregparm=?
- lockdep
- tracepoints
- stackcheck
- NOMMU
- highmem
- whatever new gets merged

If you build the plugins only for specific versions of enterprise Linux
kernels, the code becomes really hard to debug and maintain.
If you wrap everything in your own version of the existing interfaces, your
code gets bloated to the point of being unmaintainable.

So I have to correct myself: this is very different from assuming the
driver is available in the guest, it's actually much worse.

Arnd
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3

2010-05-05 Thread Dmitry Torokhov
On Wednesday 05 May 2010 10:23:16 am Christoph Hellwig wrote:
 On Tue, May 04, 2010 at 04:02:25PM -0700, Pankaj Thakkar wrote:
  The plugin image is provided by the IHVs along with the PF driver and is
  packaged in the hypervisor. The plugin image is OS agnostic and can be
  loaded either into a Linux VM or a Windows VM. The plugin is written
  against the Shell API interface which the shell is responsible for
  implementing. The API
 
 We're not going to add any kind of loader for binry blobs into kernel
 space, sorry.  Don't even bother wasting your time on this.
 

It would not be a binary blob but software properly released under GPL.
The current plan is for the shell to enforce GPL requirement on the
plugin code, similar to what module loaded does for regular kernel
modules.

-- 
Dmitry
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3

2010-05-05 Thread Dmitry Torokhov
On Wednesday 05 May 2010 10:31:20 am Christoph Hellwig wrote:
 On Wed, May 05, 2010 at 10:29:40AM -0700, Dmitry Torokhov wrote:
   We're not going to add any kind of loader for binry blobs into kernel
   space, sorry.  Don't even bother wasting your time on this.
  
  It would not be a binary blob but software properly released under GPL.
  The current plan is for the shell to enforce GPL requirement on the
  plugin code, similar to what module loaded does for regular kernel
  modules.
 
 The mechanism described in the document is loading a binary blob
 coded to an abstract API.

Yes, with the exception that the only body of code that will be
accepted by the shell should be GPL-licensed and thus open and available
for examining. This is not different from having a standard kernel
module that is loaded normally and plugs into a certain subsystem.
The difference is that the binary resides not on guest filesystem
but elsewhere.

 
 That's something entirely different from having normal modules for
 the Virtual Functions, which we already have for various pieces of
 hardware anyway.

-- 
Dmitry
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3

2010-05-05 Thread Dmitry Torokhov
On Wednesday 05 May 2010 01:09:48 pm Arnd Bergmann wrote:
   If you have any interesting in developing this further, do:
   
(1) move the limited VF drivers directly into the kernel tree,
talk to them through a normal ops vector
  
  [PT] This assumes that all the VF drivers would always be available.
  Also we have to support windows and our current design supports it
  nicely in an OS agnostic manner.
 
 Your approach assumes that the plugin is always available, which has
 exactly the same implications.

Since plugin[s] are carried by the host they are indeed always
available.

-- 
Dmitry
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: virtio: put last_used and last_avail index into ring itself.

2010-05-05 Thread Rusty Russell
On Wed, 5 May 2010 03:52:36 am Michael S. Tsirkin wrote:
  virtio: put last_used and last_avail index into ring itself.
  
  Generally, the other end of the virtio ring doesn't need to see where
  you're up to in consuming the ring.  However, to completely understand
  what's going on from the outside, this information must be exposed.
  For example, if you want to save and restore a virtio_ring, but you're
  not the consumer because the kernel is using it directly.
  
  Fortunately, we have room to expand: the ring is always a whole number
  of pages and there's hundreds of bytes of padding after the avail ring
  and the used ring, whatever the number of descriptors (which must be a
  power of 2).
  
  We add a feature bit so the guest can tell the host that it's writing
  out the current value there, if it wants to use that.
  
  Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 
 I've been looking at this patch some more (more on why
 later), and I wonder: would it be better to add some
 alignment to the last used index address, so that
 if we later add more stuff at the tail, it all
 fits in a single cache line?

In theory, but not in practice.  We don't have many rings, so the
difference between 1 and 2 cache lines is not very much.

 We use a new feature bit anyway, so layout change should not be
 a problem.
 
 Since I raised the question of caches: for used ring,
 the ring is not aligned to 64 bit, so on CPUs with 64 bit
 or larger cache lines, used entries will often cross
 cache line boundaries. Am I right and might it
 have been better to align ring entries to cache line boundaries?
 
 What do you think?

I think everyone is settled on 128 byte cache lines for the forseeable
future, so it's not really an issue.

Cheers,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/1] staging: hv: Add Time Sync feature to hv_utils module

2010-05-05 Thread Joe Perches
On Wed, 2010-05-05 at 19:23 +, Haiyang Zhang wrote:
 From: Haiyang Zhang haiya...@microsoft.com
 
 Subject: Add Time Sync feature to hv_utils module.
 The Time Sync feature synchronizes guest time to host UTC time after reboot,
 and restore from saved/paused state.
 +static void adj_guesttime(winfiletime_t hosttime, u8 flags)
 +{
 + s64 host_tns;
 + struct timespec host_ts;
 + static s32 scnt = 50;

Why a maximum of 50 samples?

 + host_tns = (hosttime - WLTIMEDELTA) * 100;
 + host_ts = ns_to_timespec(host_tns);
 +
 + if ((flags  ICTIMESYNCFLAG_SYNC) != 0) {
 + do_settimeofday(host_ts);
 + return;
 + }
 +
 + if ((flags  ICTIMESYNCFLAG_SAMPLE) != 0 
 + scnt  0) {
 + scnt--;
 + do_settimeofday(host_ts);
 + }

It might be better to do something like this
so the ns_to_timespec isn't performe when unnecessary.

static void settimeofday(winfiletime_t hosttime)
{
s64 host_tns = (hosttime - WLTIMEDELTA) * 100;
struct timespec host_ts = ns_to_timespec(host_tns);
do_settimeofday(host_ts);
}

static void adj_guesttime(winfiletime_t hosttime, u8 flags)
{
static s32 scnt = 50;

if ((flags  ICTIMESYNCFLAG_SYNC) != 0) {
settimeofday(hosttime);
return;
}

if ((flags  ICTIMESYNCFLAG_SAMPLE) != 0  scnt  0) {
scnt--;
settimeofday(hosttime);
}
}



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: question on virtio

2010-05-05 Thread Rusty Russell
On Wed, 5 May 2010 08:39:47 pm Michael S. Tsirkin wrote:
 Hi!
 I see this in virtio_ring.c:
 
 /* Put entry in available array (but don't update avail-idx *
  until they do sync). */
 
 Why is it done this way?
 It seems that updating the index straight away would be simpler, while
 this might allow the host to specilatively look up the buffer and handle
 it, without waiting for the kick.

I agree.  From my TODO:
what if we actually expose in -add_buf?

I don't *think* anyone adds buffers without being ready for them to be used,
so changing this should be safe.

Want to give it a try and report back?

Thanks!
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


RE: [PATCH 1/1] staging: hv: Add Time Sync feature to hv_utils module

2010-05-05 Thread Joe Perches
On Thu, 2010-05-06 at 01:42 +, Haiyang Zhang wrote:
  Why a maximum of 50 samples?
 After reboot the flag ICTIMESYNCFLAG_SYNC is included in the 
 first time message after the timesync channel is opened. Since the 
 hv_utils module is loaded after hv_vmbus, the first message is usually
 missed. The other thing is, systime is automatically set to emulated 
 hardware clock which may not be UTC time or the same time 
 zone. So, to override these effects, we use the first 50 time samples 
 for initial system time setting.

I suggest putting that in a commit message or a code comment.

cheers, Joe


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/3] virtio: console: Handle multiple console port resizes

2010-05-05 Thread Rusty Russell
On Thu, 6 May 2010 06:05:06 am Amit Shah wrote:
 Hello,
 
 This series adds resize support for multiple console ports. The size
 for each console is stored in its structure and the host informs the
 guest about size changes via the VIRTIO_CONSOLE_RESIZE control
 message.

Thanks, applied!

Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-05 Thread Rusty Russell
On Thu, 6 May 2010 06:28:14 am Michael S. Tsirkin wrote:
 Rusty,
 this is a simplified form of a patch you posted in the past.
 I have a vhost patch that, using this feature, shows external
 to host bandwidth grow from 5 to 7 GB/s, by avoiding
 an interrupt in the window after previous interrupt
 was sent and before interrupts were disabled for the vq.
 With vhost under some external to host loads I see
 this window being hit about 30% sometimes.

Fascinating.  So you use this to guess if the guest is still processing?
I haven't thought about it hard, but is that racy?

Obviously happy to apply this when you finalize it.

Thanks!
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization