[PATCH resend] block: silently error unsupported empty barriers too

2009-09-03 Thread Mark McLoughlin
With 2.6.31-rc5 in a KVM guest using dm and virtio_blk, we see the
following errors:

  end_request: I/O error, dev vda, sector 0
  end_request: I/O error, dev vda, sector 0

The errors go away if dm stops submitting empty barriers, by reverting:

  commit 52b1fd5a27c625c78373e024bf570af3c9d44a79
  Author: Mikulas Patocka mpato...@redhat.com
dm: send empty barriers to targets in dm_flush

We should error all barriers, even empty barriers, on devices like
virtio_blk which don't support them.

See also:

  https://bugzilla.redhat.com/514901

Signed-off-by: Mark McLoughlin mar...@redhat.com
Cc: Rusty Russell ru...@rustcorp.com.au
Cc: Mikulas Patocka mpato...@redhat.com
Cc: Alasdair G Kergon a...@redhat.com
Cc: Neil Brown ne...@suse.de
Cc: Christoph Hellwig h...@infradead.org
---
 block/blk-core.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index e3299a7..35ad2bb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1163,8 +1163,7 @@ static int __make_request(struct request_queue *q, struct 
bio *bio)
const int unplug = bio_unplug(bio);
int rw_flags;
 
-   if (bio_barrier(bio)  bio_has_data(bio) 
-   (q-next_ordered == QUEUE_ORDERED_NONE)) {
+   if (bio_barrier(bio)  (q-next_ordered == QUEUE_ORDERED_NONE)) {
bio_endio(bio, -EOPNOTSUPP);
return 0;
}
-- 
1.6.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv3 3/4] qemu-kvm: vhost-net implementation

2009-08-20 Thread Mark McLoughlin
Hi Michael,

Some random high-level comments:

  - I had expected this to be available as:

  -net raw,ifname=eth2 -net nic,model=virtio

I'd prefer it this way, because it means you can use this mode even 
without vhost and it's ties in better with the way all other qemu 
networking modes work.

Like the VNET_HDR support, the detect that there is only one 
backend on the vlan, and that it is a raw socket code will be a 
hack until we have command line options that don't involve vlans.

Open question whether we the user should explicitly have to request
vhost acceleration or we use it if it is available. The latter
sounds nice, but it hasn't turned out great for kvm.

  - CAP_NET_ADMIN is needed for raw sockets, so for e.g. libvirt I 
think we need to be able to support passing the raw socket fd via 
the command line and the monitor interface. I don't think we need 
that for the vhost fd, it should be safe to allow unprivileged 
users access to that, I think.

  - I thought this version supports migration, but I don't see where 
that is handled in the code?

  - You're not using the errors eventfd?

I've taken a first look through the kernel patch, hopefully I'll post
decent comments for that soon, but in the meantime:

  - I think /dev/vhost makes more sense - we shouldn't need to add
another character device if we implement kernel backends for other
virtio devices

  - I'd really like vhost to support a 'tap' mode, so that we can still 
use a bridge if a NIC isn't available to be assigned. It would 
result in this stuff getting much more testing. Options I see:

   1) Add tap-like functionality to vhost
   2) Add VHOST_NET_SET_TAP
   3) Just tell people to set up a tap and bind a raw socket too it

 IMHO, (2) makes the most sense - it should be much less exta kernel
code than (1), and it would be much more convenient than (3)

  - Distros will need a udev rule to set the device perms to 0666

  - Distros will also need some way to have the module loaded if its 
not built-in; best I can think of in Fedora is to stick it in
/etc/sysconfig/modules/kvm.modules

What would be nicer is if loading the kvm module could cause vhost 
to be loaded. It's nice that vhost can be used without kvm, but I 
think if kvm is loaded it's just very convenient to load vhost too. 
See also the problems we've cause with needing mkinitrd/dracut 
hacks by not having KVM_GUEST require VIRTIO_PCI

On Mon, 2009-08-17 at 15:37 +0300, Michael S. Tsirkin wrote:
 This adds support for vhost-net virtio kernel backend.
 To enable (assuming device eth2):
 1. enable promisc mode or program guest mac in device eth2

Why can't vhost do this itself?

 2. disable tso, gso, lro, jumbo frames on the card
(disabling lro + jumbo frames should be sufficient,
 haven't tested this)

And this.

If we leave that up to the user or the management app, we need to expose
to them what features vhost supports so that they can know in future to
stop disabling them.

 3. add vhost=eth2 to -net flag
 4. run with CAP_NET_ADMIN priviledge (e.g. root)
 
 This patch is RFC, but works without issues for me.
 
 It still needs to be split up, tested and benchmarked properly,
 but posting it here in case people want to test drive
 the kernel bits I posted.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  Makefile.target |3 +-
  hw/vhost_net.c  |  181 
 +++
  hw/vhost_net.h  |   30 +
  hw/virtio-net.c |   32 ++-
  hw/virtio-pci.c |   40 
  hw/virtio.c |   19 --
  hw/virtio.h |   28 -
  net.c   |6 ++-
  net.h   |1 +
  qemu-kvm.c  |8 ---
  qemu-kvm.h  |9 +++
  11 files changed, 324 insertions(+), 33 deletions(-)
  create mode 100644 hw/vhost_net.c
  create mode 100644 hw/vhost_net.h
 
 diff --git a/Makefile.target b/Makefile.target
 index f6d9708..e941a36 100644
 --- a/Makefile.target
 +++ b/Makefile.target
 @@ -170,7 +170,8 @@ obj-y = vl.o osdep.o monitor.o pci.o loader.o isa_mmio.o 
 machine.o \
  gdbstub.o gdbstub-xml.o msix.o ioport.o qemu-config.o
  # virtio has to be here due to weird dependency between PCI and virtio-net.
  # need to fix this properly
 -obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-console.o 
 virtio-pci.o
 +obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-console.o 
 virtio-pci.o \
 + vhost_net.o
  obj-$(CONFIG_KVM) += kvm.o kvm-all.o
  
  LIBS+=-lz
 diff --git a/hw/vhost_net.c b/hw/vhost_net.c
 new file mode 100644
 index 000..7d52de0
 --- /dev/null
 +++ b/hw/vhost_net.c
 @@ -0,0 +1,181 @@
 +#include sys/eventfd.h
 +#include sys/socket.h
 +#include linux/kvm.h
 +#include fcntl.h
 +#include sys/ioctl.h
 +#include linux/vhost.h
 +#include linux/virtio_ring.h
 +#include netpacket/packet.h
 +#include net/ethernet.h
 +#include 

[PATCH] block: silently error unsupported empty barriers too

2009-08-06 Thread Mark McLoughlin
With 2.6.31-rc5 in a KVM guest using dm and virtio_blk, we see the
following errors:

  end_request: I/O error, dev vda, sector 0
  end_request: I/O error, dev vda, sector 0

The errors go away if dm stops submitting empty barriers, by reverting:

  commit 52b1fd5a27c625c78373e024bf570af3c9d44a79
  Author: Mikulas Patocka mpato...@redhat.com
dm: send empty barriers to targets in dm_flush

We should error all barriers, even empty barriers, on devices like
virtio_blk which don't support them.

See also:

  https://bugzilla.redhat.com/514901

Signed-off-by: Mark McLoughlin mar...@redhat.com
Cc: Rusty Russell ru...@rustcorp.com.au
Cc: Mikulas Patocka mpato...@redhat.com
Cc: Alasdair G Kergon a...@redhat.com
Cc: Neil Brown ne...@suse.de
---
 block/blk-core.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index e3299a7..35ad2bb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1163,8 +1163,7 @@ static int __make_request(struct request_queue *q, struct 
bio *bio)
const int unplug = bio_unplug(bio);
int rw_flags;
 
-   if (bio_barrier(bio)  bio_has_data(bio) 
-   (q-next_ordered == QUEUE_ORDERED_NONE)) {
+   if (bio_barrier(bio)  (q-next_ordered == QUEUE_ORDERED_NONE)) {
bio_endio(bio, -EOPNOTSUPP);
return 0;
}
-- 
1.6.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH] virtio-pci: correctly unregister root device on error

2009-07-07 Thread Mark McLoughlin
If pci_register_driver() fails we're incorrectly unregistering the root
device with device_unregister() rather than root_device_unregister().

Reported-by: Don Zickus dzic...@redhat.com
Signed-off-by: Mark McLoughlin mar...@redhat.com
---
 drivers/virtio/virtio_pci.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 330aacb..2cafa62 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -440,7 +440,7 @@ static int __init virtio_pci_init(void)
 
err = pci_register_driver(virtio_pci_driver);
if (err)
-   device_unregister(virtio_pci_root);
+   root_device_unregister(virtio_pci_root);
 
return err;
 }
-- 
1.6.2.5

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]

2009-06-17 Thread Mark McLoughlin
On Tue, 2009-06-16 at 19:44 +0100, Jamie Lokier wrote:
 Mark McLoughlin wrote:
   Worst case we hardcode those numbers (gasp, faint).
  
  Maybe we can just add the open slots to the -help output. That'd be nice
  and clean.

I was being sarcastic - libvirt currently must parse qemu -help, and
even has some test infrastructure to check that it works with various
versions of qemu. Extending this would not be nice and clean :-)

 I particularly don't like the idea of arcane machine-dependent slot
 allocation knowledge living in libvirt, because it needs to be in Qemu
 anyway for non-libvirt users.  No point in having two implementations
 of something tricky and likely to have machine quirks, if one will do.

Indeed.

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]

2009-06-16 Thread Mark McLoughlin
On Mon, 2009-06-15 at 13:12 -0500, Anthony Liguori wrote:
 Mark McLoughlin wrote:
  So long as the restrictions would be known to the management app via
  some what slots are available mechanism in qemu, that sounds fine.

 
 I'm not sure a what slots are available mechanism is as straight 
 forward as has been claimed.

If qemu can't provide that information, then the management app does not
have sufficient information to do the slot allocation itself. In which
case, it must leave it up to qemu to do it.

 It doesn't matter though because it's orthogonal to the current proposal.

It is not orthogonal to solving the actual problem at hand, though -
i.e. how to allow management apps to provide stable PCI addresses.

  I'm not at all arguing against pci_addr.  I'm arguing about how 
  libvirt should use it with respect to the genesis use-case where 
  libvirt has no specific reason to choose one PCI slot over another.  
  In that case, I'm merely advocating that we want to let QEMU make the 
  decision.

  However this may end up, isn't it offtopic?  Whatever we do we have to 
  support both pci_addr= and default placement, so we can push this 
  discussion to livirt-devel and bid them godspeed.
  
 
  Presumably you're not proposing that qemu-devel completely ignore the
  typical requirements of management apps?

 
 This is a happy case where the current proposals allow both usages to 
 occur.  Which one libvirt chooses it up to it.
 
 To summarize, I think we have:
 
 1) Introduce addressing to all host device configurations
   - Either in the canonical form pci_addr=bus:dev.fn or target=3,lun=1 
 or in flattened form addr=bus:dev.fn or addr=target.lun.  I prefer the 
 later form but I think either would be acceptable.

That helps, but it's not enough on its own.

The management app needs to figure out what addresses to pass either by:

   a) Initially allowing qemu to do the address allocation, and 
  thereafter using those addresses - this requires some way to query
  the addresses of devices

or b) Doing the initial address allocation itself - this requires some 
  way to query what slots are available.

 2) Whenever the default machine type changes in a guest-visible way, 
 introduce a new machine type
   - Use explicit versions in name: pc-v1, pc-v2 or use more descriptive 
 names pc-with-usb
   - Easily transitions to device config files

To be clear - you're not proposing this is a solution to the stable PCI
addresses problem, are you? The main requirement is for the addresses
to stay stable even if the user adds/removes other devices.

This is a fine solution to the stable guest ABI problem ... assuming
there's some way of querying the current default machine type.

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]

2009-06-15 Thread Mark McLoughlin
On Sun, 2009-06-14 at 12:50 +0300, Michael S. Tsirkin wrote:
 On Fri, Jun 12, 2009 at 05:48:23PM +0100, Mark McLoughlin wrote:
  However, in order to retain compat for that SCSI device (e.g. ensuring
  the PCI address doesn't change as other devices are added an removed),
  we're back to the same problem ... either:
  
1) Use '-drive file=foo.img,if=scsi,pci_addr=foo'; in order to figure 
   out what address to use, libvirt would need to query qemu for what 
   address was originally allocated to device or it would do all the 
   PCI address allocation itself ... 
 
 This last option makes sense to me: in a real world the user has
 control over where he places the device on the bus, so why
 not with qemu?

Yep, most people seem to agree that it makes sense to allow this, but
some believe it should only be via a machine description file, not the
command line.

However, the first problem is that it isn't a solution to the guest ABI
problem more generally.

And the second problem is that for e.g. libvirt to use it, it would have
to be possible to query qemu for what PCI slots were assigned to the
devices - libvirt would need to be able to parse 'info pci' and match
the devices listed with the devices specified on the command line.

Again, details written up here:

  https://fedoraproject.org/wiki/Features/KVM_Stable_PCI_Addresses

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]

2009-06-15 Thread Mark McLoughlin
On Sun, 2009-06-14 at 10:58 +0300, Avi Kivity wrote:
 Mark McLoughlin wrote:
 
   
 
  I think the point is that you don't need version numbers if you have a 
  proper device tree.
  
 
  How do you add a new attribute to the device tree and, when a supplied
  device tree lacking said attribute, distinguish between a device tree
  from an old version of qemu (i.e. use the old default) and a partial
  device tree from the VM manager (i.e. use the new default) ?

 
 -baseline 0.10

That's a version number :-)

(I was responding to Anthony's you don't need a version number)

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]

2009-06-15 Thread Mark McLoughlin
On Mon, 2009-06-15 at 07:48 -0500, Anthony Liguori wrote:
 Avi Kivity wrote:
  On 06/15/2009 12:09 PM, Mark McLoughlin wrote:
  I think the point is that you don't need version numbers if you 
  have a
  proper device tree.
 
   
  How do you add a new attribute to the device tree and, when a supplied
  device tree lacking said attribute, distinguish between a device tree
  from an old version of qemu (i.e. use the old default) and a partial
  device tree from the VM manager (i.e. use the new default) ?
 
 
  -baseline 0.10
   
 
  That's a version number :-)
 
  (I was responding to Anthony's you don't need a version number)
 
 
  If you want to prevent incompatibilities, you need to make everything 
  new (potentially including bugfixes) non-default.

No need to punish new guests in order to maintain compatibility for old
guests.

  Eventually the 
  default configuration becomes increasingly unusable and you need a new 
  baseline.  You must still be able to fall back to the old baseline for 
  older guests.  I don't think games with configuration files can hide 
  that.
 
 -M pc1
 -M pc2
 
 etc.
 
 This is pretty easy to maintain with config files.

I think this would be reasonable, but it is essentially just a version
number which you objected to on the basis that it would make
cherry-picking harder for distros.

One thing that would be nice with this '-M pc1' thing would be to retain
'-M pc' as a symlink to the latest version. We'd also need a way to read
the symlink too, so that you can query what the current latest version
is and use that in future.

How would this machine type version relate to e.g. changing the default
PCI class of virtio-blk? Would we bump the version number of all machine
types can use virtio-blk?

A per-device version number is workable alternative, but only with a
saveabi type file IMHO.

I've tried to summarise the options here:

  https://fedoraproject.org/wiki/Features/KVM_Stable_Guest_ABI

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]

2009-06-15 Thread Mark McLoughlin
On Mon, 2009-06-15 at 18:05 +0300, Avi Kivity wrote:
 On 06/15/2009 05:24 PM, Anthony Liguori wrote:
  Dor Laor wrote:
  Libvirt does not support r2d. I hope it won't start to support it.
 
  It supports mips, sparc, and ppc machines now.  I don't see why it 
  wouldn't support r2d.  For ppcemb, I expect this same problem to 
  occur.  This sort of restriction is going to be common with embedded 
  boards.
 
 I expect these restrictions will have to be known by the management 
 application.  Otherwise the users will try invalid configurations only 
 to receive errors when they launch them.  GUIs exist to guide users, not 
 as an inefficient means of trial-and-error.

So long as the restrictions would be known to the management app via
some what slots are available mechanism in qemu, that sounds fine.

  I'm not at all arguing against pci_addr.  I'm arguing about how 
  libvirt should use it with respect to the genesis use-case where 
  libvirt has no specific reason to choose one PCI slot over another.  
  In that case, I'm merely advocating that we want to let QEMU make the 
  decision.
 
 However this may end up, isn't it offtopic?  Whatever we do we have to 
 support both pci_addr= and default placement, so we can push this 
 discussion to livirt-devel and bid them godspeed.

Presumably you're not proposing that qemu-devel completely ignore the
typical requirements of management apps?

You can push the discussion to libvirt-devel, and the conclusion would
most likely be:

  We can do slot allocation if you provide us with a way to query free 
   slots, or we can use qemu's default allocation if you provide us a
   way to query the allocation.

   We'd prefer the default allocation problem, but we don't really 
   care. Both require about the same amount of work for us.

libvirt was only mentioned in this thread as a concrete example of how
the suggested solutions would actually be used by management apps.

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]

2009-06-15 Thread Mark McLoughlin
On Mon, 2009-06-15 at 18:12 +0300, Dor Laor wrote:
  It doesn't want to. As Mark said, libvirt just wants to be able to ensure
  a stable guest ABI, of which stable PCI addresses is one aspect. This does
  not imply libvirt wants to allocate the PCI addresses, just that it wants
  a way to keep them stable. All else being equal I'd rather libvirt wasn't
  in the PCI address allocation business.

 
 It's not about what libvirt wants. It's about what will serve the end 
 user the most.

Absolutely. And not just about what most helps end users of libvirt
based management apps, but also any app managing qemu.

 Apart for stable guest ABI, end users need to have the option to
 control the slot for their devices. Just like them have for physical
 machines. It's not theoretical discussion, limiting issues with shared
 irq is one real life example.

Providing end users with the *option* to choose PCI slots sounds like a
fine feature request for any management app.

Requiring all management apps to force end users to explicitly choose
PCI slots in order for slots to be stable is not so reasonable.

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]

2009-06-12 Thread Mark McLoughlin
On Wed, 2009-06-10 at 20:27 +0100, Jamie Lokier wrote:
 Michael S. Tsirkin wrote:
   I think the right long term answer to all this is a way to get QEMU to
   dump it's current machine configuration in glorious detail as a file
   which can be reloaded as a machine configuration.
  
  And then we'll have the same set of problems there.
 
 We will, and the solution will be the same: options to create devices
 as they were in older versions of QEMU.  It only needs to cover device
 features which matter to guests, not every bug fix.
 
 However with a machine configuration which is generated by QEMU,
 there's less worry about proliferation of obscure options, compared
 with the command line.  You don't necessarily have to document every
 backward-compatibility option in any detail, you just have to make
 sure it's written and read properly, which is much the same thing as
 the snapshot code does.

This is a sensible plan, but I don't think we should mix these compat
options in with the VM manager supplied configuration.

There are two problems with that approach.

= Problem 1 - VM manager needs to parse qemu config =

Your proposal implies:

  - VM manager supplies a basic configuration to qemu

  - It then immediately asks qemu for a dump of the machine 
configuration in all its glorious detail and retains that
config

  - If the VM manager wishes to add a new device it needs to parse the 
qemu config and add it, rather than just generate an entirely new 
config

= Problem 2 - We can't predict the future =

If a VM manager supplies a configuration which is missing any given
option, qemu cannot tell the difference between:

  - This is a basic config, the VM manager wants whatever the default 
of the current qemu version

  - This is a complete config dumped using an old version of qemu, the 
VM manager wants the old default

= Solution - Separate configuration from compat hints =

As I suggested before:

  - Allow the VM manager to dump compat hints; this would be an opaque 
file format, more like the savevm format than a config file

  - Use defaults where compat hints are not available; e.g. if the VM 
manager specifies a device config, but no compat hints are 
supplied for it, then just use default values

  - Make the config override compat hints; e.g. if there are compat 
hints specified for a device not included in the machine config, 
just ignore those hints

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]

2009-06-12 Thread Mark McLoughlin
On Fri, 2009-06-12 at 16:59 +0300, Michael S. Tsirkin wrote:
 On Fri, Jun 12, 2009 at 09:43:29AM +0100, Mark McLoughlin wrote:
  = Solution - Separate configuration from compat hints =
  
  As I suggested before:
  
- Allow the VM manager to dump compat hints; this would be an opaque 
  file format, more like the savevm format than a config file
 
 Why make it like the savevm format then?
 If they are opaque anyway, compat hints could be part of savevm format.

So a savevm --only-compat-hints command? It might make sense, since we
would want the compat hints with savevm too.

- Use defaults where compat hints are not available; e.g. if the VM 
  manager specifies a device config, but no compat hints are 
  supplied for it, then just use default values
  
- Make the config override compat hints; e.g. if there are compat 
  hints specified for a device not included in the machine config, 
  just ignore those hints
  
  Cheers,
  Mark.
 
 If compat hints are opaque and only editable by qemu, we can get into a
 situation where one can't create a specific setup with a new qemu.

An example?

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]

2009-06-12 Thread Mark McLoughlin
On Fri, 2009-06-12 at 09:55 -0500, Anthony Liguori wrote:
 Mark McLoughlin wrote:
  On Wed, 2009-06-10 at 20:27 +0100, Jamie Lokier wrote:

  = Solution - Separate configuration from compat hints =
 
  As I suggested before:
 
- Allow the VM manager to dump compat hints; this would be an opaque 
  file format, more like the savevm format than a config file

 
 How is compat hints different from a device tree?
 
 In my mind, that's what compat hints is.  I don't see another sane way 
 to implement it.

A device tree with a different purpose than a config file.

In its simplest form it could be a device tree with a version number for
each device[1]. 

The other obvious piece to add to it would be PCI addresses, so that
even if you remove a device, the addresses assigned to existing devices
don't change.

Cheers,
Mark.

[1] - Adding such a per-device version number to the config file would
solve problem (2)

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]

2009-06-12 Thread Mark McLoughlin
On Fri, 2009-06-12 at 11:11 -0500, Anthony Liguori wrote:
 Mark McLoughlin wrote:
  On Fri, 2009-06-12 at 09:51 -0500, Anthony Liguori wrote:

  Mark McLoughlin wrote:
  
  On Wed, 2009-06-10 at 20:27 +0100, Jamie Lokier wrote:


  Michael S. Tsirkin wrote:
  
  
  I think the right long term answer to all this is a way to get QEMU to
  dump it's current machine configuration in glorious detail as a file
  which can be reloaded as a machine configuration.
  
  
  And then we'll have the same set of problems there.


  We will, and the solution will be the same: options to create devices
  as they were in older versions of QEMU.  It only needs to cover device
  features which matter to guests, not every bug fix.
 
  However with a machine configuration which is generated by QEMU,
  there's less worry about proliferation of obscure options, compared
  with the command line.  You don't necessarily have to document every
  backward-compatibility option in any detail, you just have to make
  sure it's written and read properly, which is much the same thing as
  the snapshot code does.
  
  
  This is a sensible plan, but I don't think we should mix these compat
  options in with the VM manager supplied configuration.
 
  There are two problems with that approach.
 
  = Problem 1 - VM manager needs to parse qemu config =
 
  Your proposal implies:
 
- VM manager supplies a basic configuration to qemu
 
- It then immediately asks qemu for a dump of the machine 
  configuration in all its glorious detail and retains that
  config
 
- If the VM manager wishes to add a new device it needs to parse the 
  qemu config and add it, rather than just generate an entirely new 
  config


  What's the problem with parsing the device config and modifying it?  Is 
  it just complexity?
  
 
  Yes, complexity is the issue.
 

  If we provided a mechanism to simplify manipulating a device config, 
  would that eliminate the concern here?
  
 
  In libvirt's case, a lot of the complexity would come from needing to
  figure out what to change.

 
 Right, libvirt wants to be able to easily say add a scsi block device 
 to this VM.  The way I see this working is that there would be a 
 default pc.dtc.  We would still have a -drive file=foo.img,if=scsi 
 option that would really just be a wrapper around first searching for an 
 existing LSI controller, if one exists, attaching the lun, if not, 
 create one, etc.
 
 libvirt could continue to use this sort of interface.  However, as it 
 wants to do more advanced things, it may have to dive into the device 
 tree itself.
 
 On live migration, QEMU will save a copy of the device tree somewhere 
 and libvirt needs to keep track of it.  It can treat it as opaque.  -M 
 /path/to/foo.dtc -drive file=foo.img,if=scsi should continue working as 
 expected IMHO.

So, when libvirt creates a guest for the first time, it makes a copy of
the device tree and continues to use that even if qemu is upgraded.
That's enough to ensure compat is retained for all built-in devices.

However, in order to retain compat for that SCSI device (e.g. ensuring
the PCI address doesn't change as other devices are added an removed),
we're back to the same problem ... either:

  1) Use '-drive file=foo.img,if=scsi,pci_addr=foo'; in order to figure 
 out what address to use, libvirt would need to query qemu for what 
 address was originally allocated to device or it would do all the 
 PCI address allocation itself ... or:

  2) Don't use the command line, instead get a dump of the entire 
 device tree (including the SCSI device) - if the device is to be 
 removed or modified in future, libvirt would need to modify the 
 device tree

The basic problem would be that the command line config would have very
limited ability to override the device tree config.

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]

2009-06-12 Thread Mark McLoughlin
On Fri, 2009-06-12 at 11:12 -0500, Anthony Liguori wrote:
 Mark McLoughlin wrote:
  On Fri, 2009-06-12 at 09:55 -0500, Anthony Liguori wrote:

  Mark McLoughlin wrote:
  
  On Wed, 2009-06-10 at 20:27 +0100, Jamie Lokier wrote:

  = Solution - Separate configuration from compat hints =
 
  As I suggested before:
 
- Allow the VM manager to dump compat hints; this would be an opaque 
  file format, more like the savevm format than a config file


  How is compat hints different from a device tree?
 
  In my mind, that's what compat hints is.  I don't see another sane way 
  to implement it.
  
 
  A device tree with a different purpose than a config file.
 
  In its simplest form it could be a device tree with a version number for
  each device[1]. 

 
 I think the point is that you don't need version numbers if you have a 
 proper device tree.

How do you add a new attribute to the device tree and, when a supplied
device tree lacking said attribute, distinguish between a device tree
from an old version of qemu (i.e. use the old default) and a partial
device tree from the VM manager (i.e. use the new default) ?

 NB the device tree contains no host configuration information.

So, it wouldn't e.g. include the path to the image file for a block
device? That would always be specified on the command line?

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]

2009-06-12 Thread Mark McLoughlin
On Fri, 2009-06-12 at 12:00 -0500, Anthony Liguori wrote:
 Mark McLoughlin wrote:
  So, when libvirt creates a guest for the first time, it makes a copy of
  the device tree and continues to use that even if qemu is upgraded.
  That's enough to ensure compat is retained for all built-in devices.
 
  However, in order to retain compat for that SCSI device (e.g. ensuring
  the PCI address doesn't change as other devices are added an removed),
  we're back to the same problem ... either:
 
1) Use '-drive file=foo.img,if=scsi,pci_addr=foo'; in order to figure 
   out what address to use, libvirt would need to query qemu for what 
   address was originally allocated to device or it would do all the 
   PCI address allocation itself ... or:
 
2) Don't use the command line, instead get a dump of the entire 
   device tree (including the SCSI device) - if the device is to be 
   removed or modified in future, libvirt would need to modify the 
   device tree
 
  The basic problem would be that the command line config would have very
  limited ability to override the device tree config.

 
 After libvirt has done -drive file=foo... it should dump the machine 
 config and use that from then on.

Right - libvirt then wouldn't be able to avoid the complexity of merging
any future changes into the dumped machine config.

 To combined to a single thread...
  How do you add a new attribute to the device tree and, when a supplied
  device tree lacking said attribute, distinguish between a device tree
  from an old version of qemu (i.e. use the old default) and a partial
  device tree from the VM manager (i.e. use the new default) ?

 
 Please define attribute.  I don't follow what you're asking.

e.g. a per-device enable MSI support flag.

If qemu is supplied with a device tree that lacks that flag, does it
enable or disable MSI?

Enable by default is bad - it could be a device tree dumped from an old
version of qemu, so compat would be broken.

Disable by default is bad - it could be a simple device tree supplied by
the user, and the latest features are wanted.

Maybe we want a per-device this is a complete device description flag
and if anything is missing from a supposedly complete description, the
old defaults would be used. A config dumped from qemu would have this
flag set, a config generated by libvirt would not have the flag.

  NB the device tree contains no host configuration information.
  
 
  So, it wouldn't e.g. include the path to the image file for a block
  device? That would always be specified on the command line?

 
 No, the IDE definition would contain some sort of symbolic node name.  A 
 separate mechanism (either command line or host config file) would then 
 link a image file to the symbolic name.

Okay.

 libvirt should really never worry about the machine config file for 
 normal things unless it needs to change what devices are exposed to a guest.

But changing devices *is* normal ... e.g. removing a block device.

Writing out a device tree is not a problem for libvirt (or any other
management tools), it's the need to merge changes into an existing
device tree is where the real complexity would lie.

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]

2009-06-12 Thread Mark McLoughlin
On Fri, 2009-06-12 at 20:44 +0300, Blue Swirl wrote:

 If the device has different behavior or different properties from
 guest perspective compared to the old device, it should get a new
 device type so that you could specify in the device tree either the
 old device or the new one.

Yes, that works - it's analogous to a device (type, version) pair.

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/4] virtio_net: return NETDEV_TX_BUSY instead of queueing an extra skb.

2009-06-02 Thread Mark McLoughlin
On Fri, 2009-05-29 at 23:46 +0930, Rusty Russell wrote:

 This effectively reverts 99ffc696d10b28580fe93441d627cf290ac4484c
 virtio: wean net driver off NETDEV_TX_BUSY.
 
 The complexity of queuing an skb (setting a tasklet to re-xmit) is
 questionable,

It certainly adds some subtle complexities to start_xmit() 

  especially once we get rid of the other reason for the
 tasklet in the next patch.
 
 If the skb won't fit in the tx queue, just return NETDEV_TX_BUSY.  It
 might be frowned upon, but it's common and not going away any time
 soon.
 
 Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 Cc: Herbert Xu herb...@gondor.apana.org.au
 ---
  drivers/net/virtio_net.c |   49 
 ++-
  1 file changed, 11 insertions(+), 38 deletions(-)
 
 diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
 --- a/drivers/net/virtio_net.c
 +++ b/drivers/net/virtio_net.c
  
 @@ -526,27 +517,14 @@ again:
   /* Free up any pending old buffers before queueing new ones. */
   free_old_xmit_skbs(vi);
  
 - /* If we has a buffer left over from last time, send it now. */
 - if (unlikely(vi-last_xmit_skb) 
 - xmit_skb(vi, vi-last_xmit_skb) != 0)
 - goto stop_queue;
 + /* Put new one in send queue and do transmit */
 + __skb_queue_head(vi-send, skb);
 + if (likely(xmit_skb(vi, skb) == 0)) {
 + vi-svq-vq_ops-kick(vi-svq);
 + return NETDEV_TX_OK;
 + }

Hmm, is it okay to leave the skb on the send queue if we return
NETDEV_TX_BUSY?

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/4] virtio_net: don't free buffers in xmit ring

2009-06-02 Thread Mark McLoughlin
On Fri, 2009-05-29 at 23:46 +0930, Rusty Russell wrote:
 The virtio_net driver is complicated by the two methods of freeing old
 xmit buffers (in addition to freeing old ones at the start of the xmit
 path).
 
 The original code used a 1/10 second timer attached to xmit_free(),
 reset on every xmit.  Before we orphaned skbs on xmit, the
 transmitting userspace could block with a full socket until the timer
 fired, the skb destructor was called, and they were re-woken.

The timer was actually added to solve a hang when trying to unload
nf_conntrack AFAIR - nf_conntrack was blocking on the skb being freed
and we never actually freed it.

I think skb_orphan() is enough to prevent this, is it?

 So we added the VIRTIO_F_NOTIFY_ON_EMPTY feature: supporting devices
 send an interrupt (even if normally suppressed) on an empty xmit ring
 which makes us schedule xmit_tasklet().  This was a benchmark win.
 
 Unfortunately, VIRTIO_F_NOTIFY_ON_EMPTY makes quite a lot of work: a
 host which is faster than the guest will fire the interrupt every xmit
 packet (slowing the guest down further).

Ouch. So, does simply disabling host support for
VIRTIO_F_NOTIFY_ON_EMPTY speed up current guests?

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/4] virtio_net: don't free buffers in xmit ring

2009-06-02 Thread Mark McLoughlin
On Tue, 2009-06-02 at 09:13 +0100, Mark McLoughlin wrote:
 On Fri, 2009-05-29 at 23:46 +0930, Rusty Russell wrote:
  The virtio_net driver is complicated by the two methods of freeing old
  xmit buffers (in addition to freeing old ones at the start of the xmit
  path).
  
  The original code used a 1/10 second timer attached to xmit_free(),
  reset on every xmit.  Before we orphaned skbs on xmit, the
  transmitting userspace could block with a full socket until the timer
  fired, the skb destructor was called, and they were re-woken.
 
 The timer was actually added to solve a hang when trying to unload
 nf_conntrack AFAIR - nf_conntrack was blocking on the skb being freed
 and we never actually freed it.
 
 I think skb_orphan() is enough to prevent this, is it?

Oops, I meant:

  I don't think skb_orphan() is enough to prevent this, is it?

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/3] virtio: indirect ring entries (VIRTIO_RING_F_INDIRECT_DESC)

2009-05-11 Thread Mark McLoughlin
On Mon, 2009-05-04 at 11:49 +0930, Rusty Russell wrote:
 On Mon, 27 Apr 2009 05:13:53 pm Dor Laor wrote:
  Mark McLoughlin wrote:
   Hi Rusty,
  
   On Thu, 2008-12-18 at 17:10 +, Mark McLoughlin wrote:
 
   Add a new feature flag for indirect ring entries. These are ring
   entries which point to a table of buffer descriptors.
  
   The idea here is to increase the ring capacity by allowing a larger
   effective ring size whereby the ring size dictates the number of
   requests that may be outstanding, rather than the size of those
   requests.
 
 OK, just so we track our mistakes.
 
 1) virtio_rings must be physically contiguous, even though they actually
have two independent parts.
 2) The number of elements in a ring must be a power of 2.
 3) virtio_pci tells the guest what number of elements to use.
 4) The guest has to allocate that much physically contiguous memory, or fail.
 
 In practice, 128 elements = 2 pages, 256 elements = 3 pages, 512 elements
 = 5 pages.  Order 1, order 2, order 3 under Linux.  1 is OK, 2 is iffy, 3 is
 hard.
 
 Blocked from doing the simpler thing, we've decided to go with a layer
 of indirection.  But the patch is simple and clean, so there's nothing
 fundamental to object to.

Still have one FIXME in the patch worth looking at - at what point
should we use an indirect entry rather than consuming N entries? 

 I can't find 3/3, did it go missing?

Following up with all three patches again.

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/3] virtio: indirect ring entries (VIRTIO_RING_F_INDIRECT_DESC)

2009-04-21 Thread Mark McLoughlin
Hi Rusty,

On Thu, 2008-12-18 at 17:10 +, Mark McLoughlin wrote:
 Add a new feature flag for indirect ring entries. These are ring
 entries which point to a table of buffer descriptors.
 
 The idea here is to increase the ring capacity by allowing a larger
 effective ring size whereby the ring size dictates the number of
 requests that may be outstanding, rather than the size of those
 requests.
 
 This should be most effective in the case of block I/O where we can
 potentially benefit by concurrently dispatching a large number of
 large requests. Even in the simple case of single segment block
 requests, this results in a threefold increase in ring capacity.

Apparently, this would also be useful for the windows virtio-net
drivers.

Dor can explain further, but apparently Windows has been observed
passing the driver a packet with 256 fragments when using TSO.

With a ring size of 256, the guest can either drop the packet or copy it
into a single buffer. We'd much rather if we could use an indirect ring
entry to pass this number of fragments without copying.

For reference the original patch was here:

  http://lkml.org/lkml/2008/12/18/212

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/3] virtio: indirect ring entries (VIRTIO_RING_F_INDIRECT_DESC)

2008-12-22 Thread Mark McLoughlin
Hi Ingo,

On Sat, 2008-12-20 at 12:38 +0100, Ingo Oeser wrote:
 Hi Mark,
 
 On Thursday 18 December 2008, Mark McLoughlin wrote:
  diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
  index 5777196..2330c4b 100644
  --- a/drivers/virtio/virtio_ring.c
  +++ b/drivers/virtio/virtio_ring.c
  @@ -70,6 +73,55 @@ struct vring_virtqueue
   
   #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
   
  +/* Set up an indirect table of descriptors and add it to the queue. */
  +static int vring_add_indirect(struct vring_virtqueue *vq,
  + struct scatterlist sg[],
  + unsigned int out,
  + unsigned int in)
  +{
  +   struct vring_desc *desc;
  +   unsigned head;
  +   int i;
  +
  +   desc = kmalloc((out + in) * sizeof(struct vring_desc), GFP_ATOMIC);
 
 kmalloc() returns ZERO_SIZE_PTR, if (out + in) == 0

vring_add_buf() has:

  BUG_ON(out + in == 0)

I should just add that here too before the kmalloc() call.

Thanks,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/6] virtio: add register_virtio_root_device()

2008-12-11 Thread Mark McLoughlin
On Thu, 2008-12-11 at 13:59 +0100, Cornelia Huck wrote:
 On Wed, 10 Dec 2008 17:45:35 +,
 Mark McLoughlin [EMAIL PROTECTED] wrote:
 
  Add a function to allocate a root device object to group the
  devices from a given virtio implementation.
  
  Also add a 'module' sysfs symlink to allow so that userspace
  can generically determine which virtio implementation a
  device is associated with. This will be used by Fedora
  mkinitrd to generically determine e.g. that virtio_pci is
  needed to mount a given root filesystem.
 
 Nothing about this is really virtio-specific (just as
 s390_root_dev_register() is not really s390-specific), and a 'module'
 symlink doesn't really hurt in a generic implementation, even if it is
 unneeded. I'm voting to put this in some generic, always built-in code
 (or have the users select it) so we could also use it from s390.

Okay, coming up ...

  Signed-off-by: Mark McLoughlin [EMAIL PROTECTED]
  ---
   drivers/virtio/virtio.c |   71 
  +++
   include/linux/virtio.h  |   10 ++
   2 files changed, 81 insertions(+), 0 deletions(-)
  
  diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
  index 018c070..61e6597 100644
  --- a/drivers/virtio/virtio.c
  +++ b/drivers/virtio/virtio.c
  @@ -1,6 +1,7 @@
   #include linux/virtio.h
   #include linux/spinlock.h
   #include linux/virtio_config.h
  +#include linux/err.h
  
   /* Unique numbering for virtio devices. */
   static unsigned int dev_index;
  @@ -200,6 +201,76 @@ void unregister_virtio_device(struct virtio_device 
  *dev)
   }
   EXPORT_SYMBOL_GPL(unregister_virtio_device);
  
  +/* A root device for virtio devices from a given backend.  This makes them
  + * appear as /sys/devices/{name}/0,1,2 not /sys/devices/0,1,2. It also 
  allows
  + * us to have a /sys/devices/{name}/module symlink to the backend module. 
  */
  +struct virtio_root_device
  +{
  +   struct device dev;
  +   struct module *owner;
  +};
  +
  +static struct virtio_root_device *to_virtio_root(struct device *dev)
  +{
  +return container_of(dev, struct virtio_root_device, dev);
  +}
  +
  +static void release_virtio_root_device(struct device *dev)
  +{
  +   struct virtio_root_device *root = to_virtio_root(dev);
  +   if (root-owner)
  +   sysfs_remove_link(root-dev.kobj, module);
  +   kfree(root);
  +}
 
 Can this code be a module? If yes, move the release callback to a
 build-in as there are races with release-functions in modules.

Not sure I fully understand the issue here, but it won't be an problem
with it if we move to driver core.

  +struct device *__register_virtio_root_device(const char *name,
  +struct module *owner)
  +{
  +   struct virtio_root_device *root;
  +   int err = -ENOMEM;
  +
  +   root = kzalloc(sizeof(struct virtio_root_device), GFP_KERNEL);
  +   if (!root)
  +   goto out;
  +
  +   err = dev_set_name(root-dev, name);
  +   if (err)
  +   goto free_root;
  +
  +   err = device_register(root-dev);
  +   if (err)
  +   goto free_root;
  +
  +   root-dev.parent  = NULL;
  +   root-dev.release = release_virtio_root_device;
 
 You must set -release before calling device_register(), and setting
 the parent is unneeded.

Okay.

  +   if (owner) {
  +   struct module_kobject *mk = owner-mkobj;
  +
  +   err = sysfs_create_link(root-dev.kobj, mk-kobj, module);
  +if (err) {
  +   device_unregister(root-dev);
  +   return ERR_PTR(err);
  +   }
  +
  +   root-owner = owner;
  +   }
  +
  +   return root-dev;
  +
  +free_root:
  +   kfree(root);
 
 You need to call device_put() if you called device_register().

Oh, I missed that subtlety. So the rules are:

  1) To release before calling device_register(), use kfree()

  2) To release if device_register() failed, put_device()

  3) To release once device_register() succeeds, device_unregister()

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/4] driver core: add root_device_register()

2008-12-11 Thread Mark McLoughlin
Add support for allocating root device objects which group
device objects under /sys/devices directories.

Also add a sysfs 'module' symlink which points to the owner
of the root device object. This symlink will be used in virtio
to allow userspace to determine which virtio bus implementation
a given device is associated with.

[Includes suggestions from Cornelia Huck]

Signed-off-by: Mark McLoughlin mar...@redhat.com
---
 drivers/base/core.c|   87 
 include/linux/device.h |   11 ++
 2 files changed, 98 insertions(+), 0 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 8c2cc26..20e5825 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1196,6 +1196,93 @@ EXPORT_SYMBOL_GPL(put_device);
 EXPORT_SYMBOL_GPL(device_create_file);
 EXPORT_SYMBOL_GPL(device_remove_file);
 
+struct root_device
+{
+   struct device dev;
+   struct module *owner;
+};
+
+static void root_device_release(struct device *dev)
+{
+   kfree(dev);
+}
+
+/**
+ * __root_device_register - allocate and register a root device
+ * @name: root device name
+ * @owner: owner module of the root device, usually THIS_MODULE
+ *
+ * This function allocates a root device and registers it
+ * using device_register(). In order to free the returned
+ * device, use root_device_unregister().
+ *
+ * Root devices are dummy devices which allow other devices
+ * to be grouped under /sys/devices. Use this function to
+ * allocate a root device and then use it as the parent of
+ * any device which should appear under /sys/devices/{name}
+ *
+ * The /sys/devices/{name} directory will also contain a
+ * 'module' symlink which points to the @owner directory
+ * in sysfs.
+ *
+ * Note: You probably want to use root_device_register().
+ */
+struct device *__root_device_register(const char *name, struct module *owner)
+{
+   struct root_device *root;
+   int err = -ENOMEM;
+
+   root = kzalloc(sizeof(struct root_device), GFP_KERNEL);
+   if (!root)
+   return ERR_PTR(err);
+
+   err = dev_set_name(root-dev, name);
+   if (err) {
+   kfree(root);
+   return ERR_PTR(err);
+   }
+
+   root-dev.release = root_device_release;
+
+   err = device_register(root-dev);
+   if (err) {
+   put_device(root-dev);
+   return ERR_PTR(err);
+   }
+
+   if (owner) {
+   struct module_kobject *mk = owner-mkobj;
+
+   err = sysfs_create_link(root-dev.kobj, mk-kobj, module);
+   if (err) {
+   device_unregister(root-dev);
+   return ERR_PTR(err);
+   }
+   root-owner = owner;
+   }
+
+   return root-dev;
+}
+EXPORT_SYMBOL_GPL(__root_device_register);
+
+/**
+ * root_device_unregister - unregister and free a root device
+ * @root: device going away.
+ *
+ * This function unregisters and cleans up a device that was created by
+ * root_device_register().
+ */
+void root_device_unregister(struct device *dev)
+{
+   struct root_device *root = container_of(dev, struct root_device, dev);
+
+   if (root-owner)
+   sysfs_remove_link(root-dev.kobj, module);
+
+   device_unregister(dev);
+}
+EXPORT_SYMBOL_GPL(root_device_unregister);
+
 
 static void device_create_release(struct device *dev)
 {
diff --git a/include/linux/device.h b/include/linux/device.h
index 1a3686d..9e02980 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -483,6 +483,17 @@ extern int device_rename(struct device *dev, char 
*new_name);
 extern int device_move(struct device *dev, struct device *new_parent);
 
 /*
+ * Root device objects for grouping under /sys/devices
+ */
+extern struct device *__root_device_register(const char *name,
+struct module *owner);
+static inline struct device *root_device_register(const char *name)
+{
+   return __root_device_register(name, THIS_MODULE);
+}
+extern void root_device_unregister(struct device *root);
+
+/*
  * Manual binding of a device to driver. See drivers/base/bus.c
  * for information on use.
  */
-- 
1.5.4.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 0/6] Clean up virtio device object handling [was Re: [PATCH] virtio: make PCI devices take a virtio_pci module ref]

2008-12-10 Thread Mark McLoughlin
(Moved from [EMAIL PROTECTED] to [EMAIL PROTECTED], changed
subject, cleaned up cc list)

On Wed, 2008-12-10 at 13:02 +0100, Kay Sievers wrote:
 On Wed, Dec 10, 2008 at 10:49, Mark McLoughlin [EMAIL PROTECTED] wrote:
  On Tue, 2008-12-09 at 19:16 +0100, Kay Sievers wrote:
  On Tue, Dec 9, 2008 at 17:41, Mark McLoughlin [EMAIL PROTECTED] wrote:
   On Mon, 2008-12-08 at 08:46 -0600, Anthony Liguori wrote:
   Mark McLoughlin wrote:
On Sun, 2008-12-07 at 18:52 +1030, Rusty Russell wrote:
On Saturday 06 December 2008 01:37:06 Mark McLoughlin wrote:
   
Another example of a lack of an explicit dependency causing 
problems is
Fedora's mkinitrd having this hack:
   
if echo $PWD | grep -q /virtio-pci/ ; then
findmodule virtio_pci
fi
   
which basically says if this is a virtio device, don't forget to
include virtio_pci in the initrd too!. Now, mkinitrd is full of 
hacks,
but this is a particularly unusual one.
   
Um, I don't know what this does, sorry.
   
I have no idea how Fedora chooses what to put in an initrd; I can't 
think
of a sensible way of deciding what goes in and what doesn't other 
than
lists and heuristics.
   
   
Fedora's mkinitrd creates an initrd suitable to boot the machine you 
run
mkinitrd on, rather than creating an initrd suitable to boot any
machine.
   
So, it goes ah, / is mounted from /dev/vda, we need to include
virtio_blk and it's dependencies. It does that in a generic way that
works well for most setups:
   
  1) Find the device name (e.g. vda) below /sys/block
   
  2) Follow the 'device' link to e.g. /sys/devices/virtio-pci/virtio1
   
  3) Find the module need for this through either 'modalias' or the
 'driver/module' symlink
   
  4) Use modprobe to list any dependencies of that module
   
Clearly, virtio-pci won't be pulled in by any of this so we've added a
hack to say oh, it's a virtio device, let's include virtio_pci just 
in
case.
   
It's not even the case that mkinitrd needs to know how to include the
the module for the bus, because in our case that's virtio.ko ... we've
pretty effectively hidden the the bus *implementation* from userspace.
   
I don't think this is worth wasting too much time fixing, that's why 
I'm
thinking we should just make virtio_pci built-in by default with
CONFIG_KVM_GUEST.
   
  
   What if we have multiple virtio transports?
  
   I don't think that's so much an an issue (just build in any transport
   supported by KVM), but rather that you might build a non-pv_ops kernel
   to run on QEMU which would benefit from using virtio drivers ...
  
 Is there a way that we can
   expose the relationship with virtio-blk and virtio-pci in sysfs?  We
   have a struct device for the PCI device, it's just a matter of making
   the link visible.
  
   It feels a bit like busy work to generalise this since only virtio_pci
   can be built as a module, but here's a patch.
  
   The mkinitrd hack turns into:
  
  # Handle finding virtio bus implementations
  if [ -L ./virtio_module ] ; then
  findmodule $(basename $(readlink ./virtio_module))
  else if echo $PWD | grep -q /virtio-pci/ ; then
  findmodule virtio_pci
  fi; fi
  
   [PATCH] virtio: add a 'virtio_module' sysfs symlink
 
  Doesn't the device have a driver link already? If yes, the driver it
  points to should have a module link.
 
  The virtio bus is an abstraction that has several different backend
  implementations - currently virtio-pci, lguest and kvm-s390.
 
  So yes, the driver/module link gives us the device driver, but the
  virtio_module link is to the virtio bus driver (aka implementation,
  transport, backend, ...):
 
   $ basename $(readlink virtio_module)
   virtio_pci
   $ basename $(readlink driver/module)
   virtio_net
 
 I see. But why not just call it module, like we do in all other
 places, when it points to /sys/module/.
 
 To find dependent modules, you would walk up the chain of parents, and
 include everything that is found by looking for driver/module and
 module links?
 
 Wouldn't that make it completely generic, without any virtio specific hacks?

Yeah, that sounds much better - a minor detail is that it'd be better to
hang the symlink off each virtio implementation's root object rather
than off each device.

To that end, I've hacked up register_virtio_root_device() which fixes
the fact that we statically allocate root objects and gives us a sane
place to add this generic symlink.

It might make sense to add this to the core, though - e.g.
device_register_root() - and that would also allow us use the same
approach as module_add_driver() to add the module symlink for built-in
modules.

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux

Re: [PATCH] virtio_net: large tx MTU support

2008-11-28 Thread Mark McLoughlin
On Fri, 2008-11-28 at 10:52 +1030, Rusty Russell wrote:
 On Friday 28 November 2008 00:27:05 Mark McLoughlin wrote:
  Hi Rusty,
 
  On Thu, 2008-11-27 at 23:00 +1030, Rusty Russell wrote:
   On Thursday 27 November 2008 00:28:11 Mark McLoughlin wrote:
We don't really have a max tx packet size limit, so allow configuring
the device with up to 64k tx MTU.
  
   Hi Mark,
  
   Just one comment: maybe we should be conservative and maybe limit to 1500
   if the host doesn't offer any of the GSO or MRG_RXBUF features?
 
  That was actually what I was going to do until I thought about it a bit
  more and discussed it with Herbert.
 
  The virtio_net MTU only affects the transmit path, so there shouldn't be
  any issue with a host that doesn't support those features.
 
 Not quite what I meant.

Well, you did mention MRG_RXBUF :-)

 A minimal host can reasonably expect ethernet-fitting packets.  If it
 supports GSO of course it must handle larger ones.

I think this is orthogonal to GSO - e.g. a host may support GSO even if
it can only physically transmit 1500 byte frames.

MTU configuration is commonly a trial and error thing, so we're better
off allowing larger MTU sizes in cases where the host might not support
it rather than disallowing it in cases where the host can support it.

 Otherwise we should add YA feature bit or even a max-mtu field.

If e.g. we allowed physical device assignment via virtio, then the MTU
would be limited to the MTU supported by the physical device. In that
case it might make sense to add a max-mtu field or similar, but IMHO
it's fine to allow larger MTU sizes in the mean time.

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio_net: large tx MTU support

2008-11-27 Thread Mark McLoughlin
Hi Rusty,

On Thu, 2008-11-27 at 23:00 +1030, Rusty Russell wrote:
 On Thursday 27 November 2008 00:28:11 Mark McLoughlin wrote:
  We don't really have a max tx packet size limit, so allow configuring
  the device with up to 64k tx MTU.
 
 Hi Mark,
 
 Just one comment: maybe we should be conservative and maybe limit to 1500 if 
 the host doesn't offer any of the GSO or MRG_RXBUF features?

That was actually what I was going to do until I thought about it a bit
more and discussed it with Herbert.

The virtio_net MTU only affects the transmit path, so there shouldn't be
any issue with a host that doesn't support those features.

The MTU on the receive path is a different story - e.g. the host needs
to have a large enough buffer when reading from the tap fd and the guest
needs to have allocated large enough receive buffers. The former we
fixed recently in KVM (we used to only have a 64k buffer if built with
IFF_VNET_HDR support) and the latter is currently only true if either
GSO or MRG_RXBUF has been negotiated.

So, in summary - making change_mtu() check for GSO and MRG_RXBUF isn't
correct, but perhaps we could do it anyway so as to give the user a hint
that a large MTU isn't going to work on the receive path. If you feel
that's best, here's another version.

Cheers,
Mark.

From: Mark McLoughlin [EMAIL PROTECTED]
Subject: [PATCH] virtio_net: large tx MTU support

We don't really have a max tx packet size limit, so allow configuring
the device with up to 64k tx MTU.

On the receive path, we can only handle a large MTU if we negotiate
GSO or MRG_RXBUF with the host. In order to give the user some hint
as to this limitation, we also limit the transmit path to a smaller
MTU if neither of those features are available.

Signed-off-by: Mark McLoughlin [EMAIL PROTECTED]
---
 drivers/net/virtio_net.c |   24 
 1 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index e6b5d6e..499c7a5 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -613,6 +613,29 @@ static struct ethtool_ops virtnet_ethtool_ops = {
.set_tso = ethtool_op_set_tso,
 };
 
+#define MIN_MTU 68
+#define MAX_MTU 65535
+
+static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
+{
+   struct virtnet_info *vi = netdev_priv(dev);
+   int max_mtu;
+
+   /* Only allow a large MTU if we know we have a chance
+* of also supporting that MTU on the receive side. */
+   if (vi-mergeable_rx_bufs || vi-big_packets)
+   max_mtu = MAX_MTU;
+   else
+   max_mtu = ETH_DATA_LEN;
+
+   if (new_mtu  MIN_MTU || new_mtu  max_mtu)
+   return -EINVAL;
+
+   dev-mtu = new_mtu;
+
+   return 0;
+}
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
int err;
@@ -628,6 +651,7 @@ static int virtnet_probe(struct virtio_device *vdev)
dev-open = virtnet_open;
dev-stop = virtnet_close;
dev-hard_start_xmit = start_xmit;
+   dev-change_mtu = virtnet_change_mtu;
dev-features = NETIF_F_HIGHDMA;
 #ifdef CONFIG_NET_POLL_CONTROLLER
dev-poll_controller = virtnet_netpoll;
-- 
1.6.0.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH] virtio_net: hook up the set-tso ethtool op

2008-10-22 Thread Mark McLoughlin
Seems like an oversight that we have set-tx-csum and set-sg hooked
up, but not set-tso.

Also leads to the strange situation that if you e.g. disable tx-csum,
then tso doesn't get disabled.

Signed-off-by: Mark McLoughlin [EMAIL PROTECTED]
---
 drivers/net/virtio_net.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index cca6435..79b59cc 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -612,6 +612,7 @@ static int virtnet_set_tx_csum(struct net_device *dev, u32 
data)
 static struct ethtool_ops virtnet_ethtool_ops = {
.set_tx_csum = virtnet_set_tx_csum,
.set_sg = ethtool_op_set_sg,
+   .set_tso = ethtool_op_set_tso,
 };
 
 static int virtnet_probe(struct virtio_device *vdev)
-- 
1.6.0.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme

2008-10-10 Thread Mark McLoughlin
On Thu, 2008-10-09 at 14:26 -0500, Anthony Liguori wrote:
 Mark McLoughlin wrote:
  
  Also, including virtio_net_hdr in the data buffer would need another
  feature flag. Rightly or wrongly, KVM's implementation requires
  virtio_net_hdr to be the first buffer:
  
  if (elem.in_num  1 || elem.in_sg[0].iov_len != sizeof(*hdr)) {
  fprintf(stderr, virtio-net header not in first element\n);
  exit(1);
  }
  
  i.e. it's part of the ABI ... at least as KVM sees it :-)
 
 This is actually something that's broken in a nasty way.  Having the 
 header in the first element is not supposed to be part of the ABI but it 
 sort of has to be ATM.
 
 If an older version of QEMU were to use a newer kernel, and the newer 
 kernel had a larger header size, then if we just made the header be the 
 first X bytes, QEMU has no way of knowing how many bytes that should be. 
   Instead, the guest actually has to allocate the virtio-net header in 
 such a way that it only presents the size depending on the features that 
 the host supports.  We don't use a simple versioning scheme, so you'd 
 have to check for a combination of features advertised by the host but 
 that's not good enough because the host may disable certain features.
 
 Perhaps the header size is whatever the longest element that has been 
 commonly negotiated?
 
 So that's why this aggressive check is here.  Not to necessarily cement 
 this into the ABI but as a way to make someone figure out how to 
 sanitize this all.

Well, features may be orthogonal but they are still added sequentially
to the ABI. So, you would have a kind of implicit ABI versioning, while
still allowing individual selection of features.

e.g. if NET_F_FOO adds int foo to the header and then NET_F_BAR adds
int bar to the header then if NET_F_FOO is negotiated, the guest
should only send a header with foo and if NET_F_FOO|NET_F_BAR or
NET_F_BAR is negotiated, then the guest sends a header with both foo
and bar.

Or put it another way, a host or guest may not implement NET_F_FOO but
knowledge of the foo header field is part of the ABI of NET_F_BAR.
That knowledge would be as simple as knowing that the field exists and
that it should be ignored if the feature isn't used.

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme

2008-10-10 Thread Mark McLoughlin
On Thu, 2008-10-09 at 23:30 +0800, Herbert Xu wrote:
 On Thu, Oct 09, 2008 at 11:55:59AM +1100, Rusty Russell wrote:

   The size of the logical buffer is
   returned to the guest rather than the size of the individual smaller
   buffers.
  
  That's a virtio transport breakage: can you use the standard virtio 
  mechanism, 
  just put the extended length or number of extra buffers inside the 
  virtio_net_hdr?
 
 Sure that sounds reasonable.

Okay, here we go.

The new header is lamely called virtio_net_hdr2 - I've added some
padding in there so we can extend it further in future.

It gets messy for lguest because tun/tap isn't using the same header
format anymore.

Rusty - let me know if this looks reasonable and, if so, I'll merge it
back into the original patches and resend.

Cheers,
Mark.

diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c
index da934c2..0f840f2 100644
--- a/Documentation/lguest/lguest.c
+++ b/Documentation/lguest/lguest.c
@@ -940,14 +940,21 @@ static void handle_net_output(int fd, struct virtqueue 
*vq, bool timeout)
 {
unsigned int head, out, in, num = 0;
int len;
-   struct iovec iov[vq-vring.num];
+   struct iovec iov[vq-vring.num + 1];
static int last_timeout_num;
 
/* Keep getting output buffers from the Guest until we run out. */
-   while ((head = get_vq_desc(vq, iov, out, in)) != vq-vring.num) {
+   while ((head = get_vq_desc(vq, iov[1], out, in)) != vq-vring.num) {
if (in)
errx(1, Input buffers in output queue?);
-   len = writev(vq-dev-fd, iov, out);
+
+   /* tapfd needs a virtio_net_hdr, not virtio_net_hdr2 */
+   iov[0].iov_base  = iov[1].iov_base;
+   iov[0].iov_len   = sizeof(struct virtio_net_hdr);
+   iov[1].iov_base += sizeof(struct virtio_net_hdr2);
+   iov[1].iov_len  -= sizeof(struct virtio_net_hdr2);
+
+   len = writev(vq-dev-fd, iov, out + 1);
if (len  0)
err(1, Writing network packet to tun);
add_used_and_trigger(fd, vq, head, len);
@@ -998,18 +1005,24 @@ static unsigned int get_net_recv_head(struct device 
*dev, struct iovec *iov,
 
 /* Here we add used recv buffers to the used queue but, also, return unused
  * buffers to the avail queue. */
-static void add_net_recv_used(struct device *dev, unsigned int *heads,
- int *bufsizes, int nheads, int used_len)
+static void add_net_recv_used(struct device *dev, struct virtio_net_hdr2 *hdr2,
+ unsigned int *heads, int *bufsizes,
+ int nheads, int used_len)
 {
int len, idx;
 
/* Add the buffers we've actually used to the used queue */
len = idx = 0;
while (len  used_len) {
-   add_used(dev-vq, heads[idx], used_len, idx);
+   if (bufsizes[idx]  (used_len - len))
+   bufsizes[idx] = used_len - len;
+   add_used(dev-vq, heads[idx], bufsizes[idx], idx);
len += bufsizes[idx++];
}
 
+   /* The guest needs to know how many buffers to fetch */
+   hdr2-num_buffers = idx;
+
/* Return the rest of them back to the avail queue */
lg_last_avail(dev-vq) -= nheads - idx;
dev-vq-inflight  -= nheads - idx;
@@ -1022,12 +1035,17 @@ static void add_net_recv_used(struct device *dev, 
unsigned int *heads,
  * Guest. */
 static bool handle_tun_input(int fd, struct device *dev)
 {
-   struct iovec iov[dev-vq-vring.num];
+   struct virtio_net_hdr hdr;
+   struct virtio_net_hdr2 *hdr2;
+   struct iovec iov[dev-vq-vring.num + 1];
unsigned int heads[NET_MAX_RECV_PAGES];
int bufsizes[NET_MAX_RECV_PAGES];
int nheads, len, iovcnt;
 
-   nheads = len = iovcnt = 0;
+   nheads = len = 0;
+
+   /* First iov is for the header */
+   iovcnt = 1;
 
/* First we need enough network buffers from the Guests's recv
 * virtqueue for the largest possible packet. */
@@ -1056,13 +1074,26 @@ static bool handle_tun_input(int fd, struct device *dev)
len += bufsizes[nheads++];
}
 
+   /* Read virtio_net_hdr from tapfd */
+   iov[0].iov_base = hdr;
+   iov[0].iov_len = sizeof(hdr);
+
+   /* Read data into buffer after virtio_net_hdr2 */
+   hdr2 = iov[1].iov_base;
+   iov[1].iov_base += sizeof(*hdr2);
+   iov[1].iov_len  -= sizeof(*hdr2);
+
/* Read the packet from the device directly into the Guest's buffer. */
len = readv(dev-fd, iov, iovcnt);
if (len = 0)
err(1, reading network);
 
+   /* Copy the virtio_net_hdr into the virtio_net_hdr2 */
+   hdr2-hdr = hdr;
+   len += sizeof(*hdr2) - sizeof(hdr);
+
/* Return unused buffers to the recv queue */
-   add_net_recv_used(dev, heads, bufsizes, nheads, len);
+

Re: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme

2008-10-09 Thread Mark McLoughlin
On Thu, 2008-10-09 at 23:30 +0800, Herbert Xu wrote: 
 On Thu, Oct 09, 2008 at 11:55:59AM +1100, Rusty Russell wrote:
 
  There are three approaches we should investigate before adding YA feature.  
  Obviously, we can simply increase the number of ring entries.
 
 That's not going to work so well as you need to increase the ring
 size by MAX_SKB_FRAGS times to achieve the same level of effect.
 
 Basically the current scheme is either going to suck at non-TSO
 traffic or it's going to chew too much resources.

Yeah ... to put some numbers on it, assume we have a 256 entry ring now.

Currently, with GSO enabled in the host the guest will fill this with 12
buffer heads with 20 buffers per head (a 10 byte buffer, an MTU sized
buffer and 18 page sized buffers).

That means we allocate ~900k for receive buffers, 12k for the ring, fail
to use 16 ring entries and the ring ends up with a capacity of 12
packets. In the case of MTU sized packets from an off-host source,
that's a huge amount of overhead for ~17k of data.

If we wanted to match the packet capacity that Herbert's suggestion
enables (i.e. 256 packets), we'd need to bump the ring size to 4k
entries (assuming we reduce it to 19 entries per packet). This would
mean we'd need to allocate ~200k for the ring and ~18M in receive
buffers. Again, assuming MTU sized packets, that's massive overhead for
~400k of data.

  Secondly, we can put the virtio_net_hdr at the head of the skb data (this 
  is 
  also worth considering for xmit I think if we have headroom) and drop 
  MAX_SKB_FRAGS which contains a gratuitous +2.
 
 That's fine but having skb-data in the ring still means two
 different kinds of memory in there and it sucks when you only
 have 1500-byte packets.

Also, including virtio_net_hdr in the data buffer would need another
feature flag. Rightly or wrongly, KVM's implementation requires
virtio_net_hdr to be the first buffer:

if (elem.in_num  1 || elem.in_sg[0].iov_len != sizeof(*hdr)) {
fprintf(stderr, virtio-net header not in first element\n);
exit(1);
}

i.e. it's part of the ABI ... at least as KVM sees it :-)

   The size of the logical buffer is
   returned to the guest rather than the size of the individual smaller
   buffers.
  
  That's a virtio transport breakage: can you use the standard virtio 
  mechanism, 
  just put the extended length or number of extra buffers inside the 
  virtio_net_hdr?
 
 Sure that sounds reasonable.


I'll give that a shot.

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/1] tun: TUNGETIFF interface to query name and flags

2008-08-15 Thread Mark McLoughlin
On Thu, 2008-08-14 at 11:58 +1000, Rusty Russell wrote:
 On Thursday 14 August 2008 00:30:16 Mark McLoughlin wrote:
  A very simple approach is attached; I did consider doing a TUNGETFLAGS
  that would return tun-flags, but I think it's nicer to have a companion
  to TUNGETIFF since it also allows one to query the interface name from
  the file descriptor.
 
 This seems really sensible to me.
 
 If Max acks it,

How about it Max?

Trying a different email address - previous mail at:

  http://marc.info/?l=linux-netdevm=121863813904363

 I'd say Dave should merge it.

Maybe it's too late already, but I'm really hoping to get this in
2.6.27.

Without it, the IFF_VNET_HDR goodness already added in 2.6.27 can't be
used by KVM guests created by libvirt ... which is a good chunk of the
potential users for this stuff.

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/3] virtio: fix virtio_net xmit of freed skb bug

2008-06-13 Thread Mark McLoughlin
On Thu, 2008-05-29 at 16:34 +1000, Rusty Russell wrote:
 On Tuesday 27 May 2008 21:06:26 Mark McLoughlin wrote:
  On Mon, 2008-05-26 at 17:42 +1000, Rusty Russell wrote:
   If we fail to transmit a packet, we assume the queue is full and put
   the skb into last_xmit_skb.  However, if more space frees up before we
   xmit it, we loop, and the result can be transmitting the same skb twice.
  
   Fix is simple: set skb to NULL if we've used it in some way, and check
   before sending.
 
 Great! It's a corner case, but it's no more complicated to do it your way.
 
 Minor mod, I find it clearer to have the vi-last_xmit_skb = NULL; under
 the branch:
 
 Subject: [PATCH] virtio_net: Delay dropping tx skbs
 Date: Tue, 27 May 2008 12:06:26 +0100
 From: Mark McLoughlin [EMAIL PROTECTED]
 
 Currently we drop the skb in start_xmit() if we have a
 queued buffer and fail to transmit it.
 
 However, if we delay dropping it until we've stopped the
 queue and enabled the tx notification callback, then there
 is a chance space might become available for it.

Hmm, we lost this one somewhere along the way ...

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [3/6] [TUN]: Fix GSO mapping

2008-05-29 Thread Mark McLoughlin
On Fri, 2008-04-18 at 11:17 +0800, Herbert Xu wrote:
 This patch avoids the correctness issue on the user-space mapping
 by just copying the memory.
 
 diff --git a/drivers/net/tun.c b/drivers/net/tun.c
 index 4c15dc4..d75cfd2 100644
 --- a/drivers/net/tun.c
 +++ b/drivers/net/tun.c

 + virt = kmap_atomic(f-page, KM_USER0);
 + err = memcpy_fromiovec(virt + f-size, iv, copy);
 + kunmap_atomic(virt, KM_USER0);

Seeing an oops from this; fix below.

Cheers,
Mark.

Subject: [PATCH 1/1] tun: Do not use kmap_atomic() since memcpy_fromiovec() can 
sleep

Signed-off-by: Mark McLoughlin [EMAIL PROTECTED]
---
 drivers/net/tun.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 151b409..aff338e 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -322,9 +322,9 @@ static int get_user_skb_frags(struct iovec *iv, size_t 
count,
if (copy  len)
copy = len;
 
-   virt = kmap_atomic(f-page, KM_USER0);
+   virt = kmap(f-page);
err = memcpy_fromiovec(virt + f-size, iv, copy);
-   kunmap_atomic(virt, KM_USER0);
+   kunmap(f-page);
 
if (err)
return err;


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio_net: free transmit skbs in a timer

2008-05-15 Thread Mark McLoughlin
On Wed, 2008-05-14 at 11:59 +0300, Avi Kivity wrote:
 Rusty Russell wrote:
  Sorry to barge in late, but IMO the timer should be on the host, which
  is cheaper than on the guest (well, a 100ms timer is likely zero cost,
  but  I still don't like it).
 
  the host should fire a tx completion interrupt whenever the completion
  queue has enough entries, where we can define enough now as the
  halfway mark or a timer expiry, whichever comes earlier.
 
  We can later improve enough to be just enough so the timer never
  triggers and adjust it dynamically.  It probably doesn't matter for
  Linux, but I don't want to punish guests that can do true async
  networking and depend on timely completion notification.
  
 
  This implies that we should not be supressing notifications in the guest at 
  all (unless we're sure there are more packets to come, which currently we 
  never are: that needs new net infrastructure).

 
 We don't have to be sure, just reasonably confident.  If we see a stream 
 of packets, we open the window, but set a timer in case we're wrong.  
 The expectation is that the timer will only fire when tx rate drops (or 
 tx stops completely).
 
  But that means we'd get a notification on every xmit at the moment.  
  Benchmarks anyone?

 
 Notification on every xmit will surely kill performance.  I'm trying to 
 get batching to work but also good latency when the link is not saturated.

I think Rusty is speaking from the POV of the guest driver - i.e. that
virtio_net should never disable notifications on the xmit queue using
disable_cb()?

Sounds like you think agree, but that the host side should throttle the
rate of xmit notifications?

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio_net: free transmit skbs in a timer

2008-05-12 Thread Mark McLoughlin
Hi Rusty,
Sorry 'bout the lag ...

On Fri, 2008-05-02 at 20:55 +1000, Rusty Russell wrote:
 On Thursday 01 May 2008 00:31:46 Mark McLoughlin wrote:
  virtio_net currently only frees old transmit skbs just
  before queueing new ones. If the queue is full, it then
  enables interrupts and waits for notification that more
  work has been performed.
 
 Hi Mark,
 
This patch is fine, but it's better to do it from skb_xmit_done().

Unless I'm missing something, we only get this callback when we've
stopped the queue and we're waiting for buffers to be freed up.

In the normal case, where the callback is disabled, we don't get any
notification that the host has finished with the buffer ... hence the
need for a timer.

2.6.25-rc2 rebase below.

   Of 
 course, this is usually called from an interrupt handler, so it's not 
 entirely trivial: we can't free the skbs there.
 
A softirq is probably the answer here, but AFAICT that's old fashioned.  
 Not sure what the right way of doing this is now...

Thanks,
Mark.

Subject: [PATCH] virtio_net: free transmit skbs in a timer

virtio_net currently only frees old transmit skbs just
before queueing new ones. If the queue is full, it then
enables interrupts and waits for notification that more
work has been performed.

However, a side-effect of this scheme is that there are
always xmit skbs left dangling when no new packets are
sent, against the Documentation/networking/driver.txt
guideline:

  ... it is not allowed for your TX mitigation scheme
   to let TX packets hang out in the TX ring unreclaimed
   forever if no new TX packets are sent.

Add a timer to ensure that any time we queue new TX
skbs, we will shortly free them again.

This fixes an easily reproduced hang at shutdown where
iptables attempts to unload nf_conntrack and nf_conntrack
waits for an skb it is tracking to be freed, but virtio_net
never frees it.

Signed-off-by: Mark McLoughlin [EMAIL PROTECTED]
---
 drivers/net/virtio_net.c |   30 --
 1 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f926b5a..69b308a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -44,6 +44,8 @@ struct virtnet_info
/* The skb we couldn't send because buffers were full. */
struct sk_buff *last_xmit_skb;
 
+   struct timer_list xmit_free_timer;
+
/* Number of input buffers, and max we've ever had. */
unsigned int num, max;
 
@@ -230,9 +232,23 @@ static void free_old_xmit_skbs(struct virtnet_info *vi)
}
 }
 
+static void xmit_free(unsigned long data)
+{
+   struct virtnet_info *vi = (void *)data;
+
+   netif_tx_lock(vi-dev);
+
+   free_old_xmit_skbs(vi);
+
+   if (!skb_queue_empty(vi-send))
+   mod_timer(vi-xmit_free_timer, jiffies + (HZ/10));
+
+   netif_tx_unlock(vi-dev);
+}
+
 static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
 {
-   int num;
+   int num, err;
struct scatterlist sg[2+MAX_SKB_FRAGS];
struct virtio_net_hdr *hdr;
const unsigned char *dest = ((struct ethhdr *)skb-data)-h_dest;
@@ -275,7 +291,11 @@ static int xmit_skb(struct virtnet_info *vi, struct 
sk_buff *skb)
vnet_hdr_to_sg(sg, skb);
num = skb_to_sgvec(skb, sg+1, 0, skb-len) + 1;
 
-   return vi-svq-vq_ops-add_buf(vi-svq, sg, num, 0, skb);
+   err = vi-svq-vq_ops-add_buf(vi-svq, sg, num, 0, skb);
+   if (!err)
+   mod_timer(vi-xmit_free_timer, jiffies + (HZ/10));
+
+   return err;
 }
 
 static int start_xmit(struct sk_buff *skb, struct net_device *dev)
@@ -428,6 +448,10 @@ static int virtnet_probe(struct virtio_device *vdev)
skb_queue_head_init(vi-recv);
skb_queue_head_init(vi-send);
 
+   init_timer(vi-xmit_free_timer);
+   vi-xmit_free_timer.data = (unsigned long)vi;
+   vi-xmit_free_timer.function = xmit_free;
+
err = register_netdev(dev);
if (err) {
pr_debug(virtio_net: registering device failed\n);
@@ -465,6 +489,8 @@ static void virtnet_remove(struct virtio_device *vdev)
/* Stop all the virtqueues. */
vdev-config-reset(vdev);
 
+   del_timer_sync(vi-xmit_free_timer);
+
/* Free our skbs in send and recv queues, if any. */
while ((skb = __skb_dequeue(vi-recv)) != NULL) {
kfree_skb(skb);
--

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] xen: Enable Xen console by default in domU

2008-04-11 Thread Mark McLoughlin
On Fri, 2008-04-11 at 09:52 +0100, Mark McLoughlin wrote:
 On Thu, 2008-04-10 at 17:46 +0200, Markus Armbruster wrote:
  diff --git a/kernel/printk.c b/kernel/printk.c
  index c46a20a..c07bfc1 100644
  --- a/kernel/printk.c
  +++ b/kernel/printk.c
  @@ -118,6 +118,7 @@ struct console_cmdline
   static struct console_cmdline console_cmdline[MAX_CMDLINECONSOLES];
   static int selected_console = -1;
   static int preferred_console = -1;
  +int console_set_on_cmdline;

You'd also need to export this symbol for the unusual case where fbfront
is a module ...

Cheers,
Mark.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv3 1/3] x86: use ELF format in compressed images.

2008-02-14 Thread Mark McLoughlin
On Wed, 2008-02-13 at 20:54 +, Ian Campbell wrote:
 This allows other boot loaders such as the Xen domain builder the
 opportunity to extract the ELF file.

Right, Xen currently can't boot bzImage (it needs the ELF image) so you
still can't use the same kernel image on Xen as bare-metal.

 +Field name:  compressed_payload_offset
 +Type:read
 +Offset/size: 0x248/4
 +Protocol:2.08+
 +
 +  If non-zero then this field contains the offset from the end of the
 +  real-mode code to the compressed payload. The compression format
 +  should be determined using the standard magic number, currently only
 +  gzip is used.

Should probably mention that the payload format is expected to be ELF.
 
 diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile
 index f88458e..9695aff 100644
 --- a/arch/x86/boot/Makefile
 +++ b/arch/x86/boot/Makefile
 @@ -94,6 +94,20 @@ $(obj)/vmlinux.bin: $(obj)/compressed/vmlinux FORCE
  
  SETUP_OBJS = $(addprefix $(obj)/,$(setup-y))
  
 +sed-offsets := -e 's/^00*/0/' \
 +-e 's/^\([0-9a-fA-F]*\) . \(input_data\|input_data_end\)$$/\#define 
 \2 0x\1/p'
 +
 +quiet_cmd_offsets = OFFSETS $@
 +  cmd_offsets = $(NM) $ | sed -n $(sed-offsets)  $@
 +
 +$(obj)/offsets.h: $(obj)/compressed/vmlinux FORCE
 + $(call if_changed,offsets)
 +
 +targets += offsets.h
 +
 +AFLAGS_header.o += -I$(obj)
 +$(obj)/header.o: $(obj)/offsets.h

How about this?

+sed-offsets := -e 's/^00*/0/' \
+-e 's/^\([0-9a-fA-F]*\) . \(input_data\|input_data_end\)$$/-D\2=0x\1 
/p'
+
+$(obj)/header.o: AFLAGS_header.o += $(shell $(NM) $(obj)/compressed/vmlinux | 
sed -n $(sed-offsets))
+$(obj)/header.o: $(obj)/compressed/vmlinux FORCE

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv3 1/3] x86: use ELF format in compressed images.

2008-02-14 Thread Mark McLoughlin
On Thu, 2008-02-14 at 17:01 +, Ian Campbell wrote:
 On Thu, 2008-02-14 at 11:34 +, Mark McLoughlin wrote:
  On Wed, 2008-02-13 at 20:54 +, Ian Campbell wrote:
   This allows other boot loaders such as the Xen domain builder the
   opportunity to extract the ELF file.
  
  Right, Xen currently can't boot bzImage (it needs the ELF image) so you
  still can't use the same kernel image on Xen as bare-metal.
 
 I have a xen domain builder patch as well. I was waiting for the Linux
 side to gain some traction before putting it forward (I'd attach it now
 but it's at home on a laptop which is sleeping).

Yep, just want to highlight to people that your patches (or an
alternative) are needed before the same pv_ops kernel truly can be used
on bare-metal and Xen.

  How about this?
  
  +sed-offsets := -e 's/^00*/0/' \
  +-e 's/^\([0-9a-fA-F]*\) . 
  \(input_data\|input_data_end\)$$/-D\2=0x\1 /p'
  +
  +$(obj)/header.o: AFLAGS_header.o += $(shell $(NM) 
  $(obj)/compressed/vmlinux | sed -n $(sed-offsets))
  +$(obj)/header.o: $(obj)/compressed/vmlinux FORCE
 
 That's probably a neater way of doing it. Although the .../header.o:
 AFLAGS_header.o is redundant, either 
   header.o: AFLAGS += foo

With this, AFLAGS would apply to building when building the
prerequisites of header.o too, which you don't want

The make manual says:

  when you define a target-specific variable that variable value is
   also in effect for all prerequisites of this target, and all their
   prerequisites

 or
   AFLAGS_header.o += foo
 with the second being preferred in Linux Makefiles I think.

And with this, it would try and read vmlinux before it is built.

But, hmm, given the fact that the variable is defined when building
prerequisites, you'd think that even in my version AFLAGS would be
evaluated before building vmlinux.

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization