from:"Ben Hutchings"

Re: [PATCH] KVM: arm/arm64: Revert to old way of checking for device mapping in stage2_flush_ptes().

2015-12-03 Thread Ben Hutchings

On Wed, 2015-12-02 at 18:41 +0100, Ard Biesheuvel wrote:
> Hi Pavel,
> 
> Thanks for getting to the bottom of this.
> 
> On 1 December 2015 at 14:03, Pavel Fedin <p.fe...@samsung.com> wrote:
> > This function takes stage-II physical addresses (A.K.A. IPA), on input, not
> > real physical addresses. This causes kvm_is_device_pfn() to return wrong
> > values, depending on how much guest and host memory maps match. This
> > results in completely broken KVM on some boards. The problem has been
> > caught on Samsung proprietary hardware.
> > 
> > Cc: sta...@vger.kernel.org
> > Fixes: e6fab5442345 ("ARM/arm64: KVM: test properly for a PTE's 
> > uncachedness")
> > 
> 
> That commit is not in a release yet, so no need for cc stable
[...]

But it is cc'd to stable, so unless it is going to be nacked at review
stage, any subsequent fixes should also be cc'd.

Ben.

-- 
Ben Hutchings
For every action, there is an equal and opposite criticism. - Harrison

signature.asc
Description: This is a digitally signed message part

Re: Bug#802926: linux-image-4.2.0-1-amd64: KVM hangs with 100% cpu on 4.2

2015-10-26 Thread Ben Hutchings

It seems that real mode virtualisation on Nehalem has regressed in 4.2:

On Sun, 2015-10-25 at 10:08 +0100, Stefan Fritsch wrote:
[...]
> I cannot use KVM with 4.2, qemu loops with 100% CPU during seabios
> initialization. Booting with the latest linux-image-4.1.0-2-amd64 fixes
> the issue.
[...]
> kvm tracing shows:
> 
> ...
>  qemu-system-x86-3219  [007]   1090.728418: kvm_set_irq: gsi 12 level 0 
> source 0
>  qemu-system-x86-3219  [007]   1090.728418: kvm_pic_set_irq: chip 1 pin 4 
> (edge|masked)
>  qemu-system-x86-3219  [007]   1090.728419: kvm_ioapic_set_irq: pin 12 
> dst 0 vec=0 (Fixed|physical|edge|masked)
>  qemu-system-x86-3219  [007]   1090.728419: kvm_set_irq: gsi 1 level 0 
> source 0
>  qemu-system-x86-3219  [007]   1090.728420: kvm_pic_set_irq: chip 0 pin 1 
> (edge|masked)
>  qemu-system-x86-3219  [007]   1090.728420: kvm_ioapic_set_irq: pin 1 dst 
> 0 vec=0 (Fixed|physical|edge|masked)
>  qemu-system-x86-3219  [007]   1090.728420: kvm_set_irq: gsi 12 level 0 
> source 0
>  qemu-system-x86-3219  [007]   1090.728421: kvm_pic_set_irq: chip 1 pin 4 
> (edge|masked)
>  qemu-system-x86-3219  [007]   1090.728421: kvm_ioapic_set_irq: pin 12 
> dst 0 vec=0 (Fixed|physical|edge|masked)
>  qemu-system-x86-3219  [007]   1090.728723: kvm_set_irq: gsi 0 level 0 
> source 0
>  qemu-system-x86-3219  [007]   1090.728724: kvm_pic_set_irq: chip 0 pin 0 
> (edge)
>  qemu-system-x86-3219  [007]   1090.728725: kvm_ioapic_set_irq: pin 2 dst 
> 0 vec=0 (Fixed|physical|edge|masked)
>  qemu-system-x86-3219  [007]   1090.728725: kvm_set_irq: gsi 0 level 0 
> source 0
>  qemu-system-x86-3219  [007]   1090.728725: kvm_pic_set_irq: chip 0 pin 0 
> (edge)
>  qemu-system-x86-3219  [007]   1090.728725: kvm_ioapic_set_irq: pin 2 dst 
> 0 vec=0 (Fixed|physical|edge|masked)
>  qemu-system-x86-3219  [007]   1090.728726: kvm_set_irq: gsi 0 level 0 
> source 0
>  qemu-system-x86-3219  [007]   1090.728726: kvm_pic_set_irq: chip 0 pin 0 
> (edge)
>  qemu-system-x86-3219  [007]   1090.728726: kvm_ioapic_set_irq: pin 2 dst 
> 0 vec=0 (Fixed|physical|edge|masked)
>  qemu-system-x86-3221  [000] d...  1090.729926: kvm_write_tsc_offset: vcpu=0 
> prev=18446740943986499809 next=18446740943986499809
>  qemu-system-x86-3221  [000]   1090.729927: kvm_track_tsc: vcpu_id 0 
> masterclock 1 offsetmatched 0 nr_online 1 hostclock tsc
>  qemu-system-x86-3221  [000]   1090.730004: kvm_update_master_clock: 
> masterclock 1 hostclock tsc offsetmatched 1
>  qemu-system-x86-3221  [000] d...  1090.730010: kvm_entry: vcpu 0
>  qemu-system-x86-3221  [000]   1090.730013: kvm_emulate_insn: 
> :fff0:ea 5b e0 00 f0 (real)
>  qemu-system-x86-3221  [000] d...  1090.730016: kvm_entry: vcpu 0
>  qemu-system-x86-3221  [000] d...  1090.730017: kvm_exit: reason 
> EPT_VIOLATION rip 0xe05b info 81 0
>  qemu-system-x86-3221  [000]   1090.730018: kvm_page_fault: address 
> feffc000 error_code 81
>  qemu-system-x86-3221  [000] d...  1090.730032: kvm_entry: vcpu 0
>  qemu-system-x86-3221  [000] d...  1090.730034: kvm_exit: reason 
> EXCEPTION_NMI rip 0xe05b info 0 8b08
>  qemu-system-x86-3221  [000]   1090.730035: kvm_inj_exception: #DF (0x0)
>  qemu-system-x86-3221  [000] d...  1090.730040: kvm_entry: vcpu 0
>  qemu-system-x86-3221  [000] d...  1090.730041: kvm_exit: reason 
> EXCEPTION_NMI rip 0xfea5 info 0 8b08
>  qemu-system-x86-3221  [000]   1090.730041: kvm_inj_exception: #DF (0x0)
>  qemu-system-x86-3221  [000] d...  1090.730043: kvm_entry: vcpu 0
> ...
[...]
> I am rather surprised that there is no bug report about this yet, so I
> guess it only happens on some hardware. Mine is a Core i7-860.

Ben.

-- 
Ben Hutchings
Beware of bugs in the above code;
I have only proved it correct, not tried it. - Donald Knuth

signature.asc
Description: This is a digitally signed message part

Re: [PATCH v7 28/46] vhost: make features 64 bit

2014-11-30 Thread Ben Hutchings

On Sun, 2014-11-30 at 18:44 +0300, Sergei Shtylyov wrote:
 Hello.
 
 On 11/30/2014 6:11 PM, Michael S. Tsirkin wrote:
 
  We need to use bit 32 for virtio 1.0
 
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  Reviewed-by: Jason Wang jasow...@redhat.com
  ---
drivers/vhost/vhost.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
 
  diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
  index 3eda654..c624b09 100644
  --- a/drivers/vhost/vhost.h
  +++ b/drivers/vhost/vhost.h
  @@ -106,7 +106,7 @@ struct vhost_virtqueue {
  /* Protected by virtqueue mutex. */
  struct vhost_memory *memory;
  void *private_data;
  -   unsigned acked_features;
  +   u64 acked_features;
  /* Log write descriptors */
  void __user *log_base;
  struct vhost_log *log;
  @@ -174,6 +174,6 @@ enum {
 
static inline int vhost_has_feature(struct vhost_virtqueue *vq, int bit)
{
  -   return vq-acked_features  (1  bit);
  +   return vq-acked_features  (1ULL  bit);
 
 Erm, wouldn't the high word be just dropped when returning *int*? I think 
 you need !!(vq-acked_features  (1ULL  bit)).

Or change the return type to bool.

Ben.

-- 
Ben Hutchings
The first rule of tautology club is the first rule of tautology club.


signature.asc
Description: This is a digitally signed message part

Re: TUN_F_UFO change breaks live migration

2014-11-11 Thread Ben Hutchings

On Tue, 2014-11-11 at 10:58 +, Stefan Hajnoczi wrote:
Commit 3d0ad09412ffe00c9afa201d01effdb6023d09b4 (drivers/net: Disable
UFO through virtio) breaks live migration of KVM guests from older to
newer host kernels:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3d0ad09412ffe00c9afa201d01effdb6023d09b4

The problem occurs when a guest running on a host kernel without commit
3d0ad0941 in tun.ko attempts to live migration to a host with commit
3d0ad0941.

Live migration fails in QEMU with the following error message:

virtio-net: saved image requires TUN_F_UFO support

The old host tun.ko advertised support for TUN_F_UFO. The new host
tun.ko does not and that's why QEMU aborts live migration. QEMU cannot
change the features of a running virtio-net device.

Yes, this is known and was mentioned in the DSA.

tuxcrafter provided logs from two Debian hosts migrating from
3.2.60-1+deb7u3 to 3.2.63-2+deb7u1:

http://paste.debian.net/131264/

I haven't investigated enough to suggest a fix, just wanted to bring it
to your attention. Soon a lot of people will be hitting this problem as
they upgrade their infrastructure and migrate guests - seems like a
critical issue.

You can work around this by making macvtap and tun still claim to
support UFO. They continue to support it even if it's not advertised
because the tap features don't reliably get propagated to virtio
devices.

Ben.

--
Ben Hutchings
Experience is directly proportional to the value of equipment destroyed.
- Carolyn Scheppner

signature.asc
Description: This is a digitally signed message part

[PATCH net] Revert drivers/net: Disable UFO through virtio in macvtap and tun

2014-11-11 Thread Ben Hutchings

This reverts commit 88e0e0e5aa722b193c8758c8b45d041de5316924 for
the tap drivers, but leaves UFO disabled in virtio_net.

libvirt at least assumes that tap features will never be dropped
in new kernel versions, and doing so prevents migration of VMs to
the never kernel version while they are running with virtio net
devices.

Fixes: 88e0e0e5aa7a (drivers/net: Disable UFO through virtio)
Signed-off-by: Ben Hutchings b...@decadent.org.uk
---
Compile-tested only.

Ben.

 drivers/net/macvtap.c | 13 -
 drivers/net/tun.c | 19 ---
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 6f226de..aeaeb6d 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -66,7 +66,7 @@ static struct cdev macvtap_cdev;
 static const struct proto_ops macvtap_socket_ops;
 
 #define TUN_OFFLOADS (NETIF_F_HW_CSUM | NETIF_F_TSO_ECN | NETIF_F_TSO | \
- NETIF_F_TSO6)
+ NETIF_F_TSO6 | NETIF_F_UFO)
 #define RX_OFFLOADS (NETIF_F_GRO | NETIF_F_LRO)
 #define TAP_FEATURES (NETIF_F_GSO | NETIF_F_SG)
 
@@ -570,8 +570,6 @@ static int macvtap_skb_from_vnet_hdr(struct sk_buff *skb,
gso_type = SKB_GSO_TCPV6;
break;
case VIRTIO_NET_HDR_GSO_UDP:
-   pr_warn_once(macvtap: %s: using disabled UFO feature; 
please fix this program\n,
-current-comm);
gso_type = SKB_GSO_UDP;
if (skb-protocol == htons(ETH_P_IPV6))
ipv6_proxy_select_ident(skb);
@@ -619,6 +617,8 @@ static void macvtap_skb_to_vnet_hdr(const struct sk_buff 
*skb,
vnet_hdr-gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
else if (sinfo-gso_type  SKB_GSO_TCPV6)
vnet_hdr-gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+   else if (sinfo-gso_type  SKB_GSO_UDP)
+   vnet_hdr-gso_type = VIRTIO_NET_HDR_GSO_UDP;
else
BUG();
if (sinfo-gso_type  SKB_GSO_TCP_ECN)
@@ -953,6 +953,9 @@ static int set_offload(struct macvtap_queue *q, unsigned 
long arg)
if (arg  TUN_F_TSO6)
feature_mask |= NETIF_F_TSO6;
}
+
+   if (arg  TUN_F_UFO)
+   feature_mask |= NETIF_F_UFO;
}
 
/* tun/tap driver inverts the usage for TSO offloads, where
@@ -963,7 +966,7 @@ static int set_offload(struct macvtap_queue *q, unsigned 
long arg)
 * When user space turns off TSO, we turn off GSO/LRO so that
 * user-space will not receive TSO frames.
 */
-   if (feature_mask  (NETIF_F_TSO | NETIF_F_TSO6))
+   if (feature_mask  (NETIF_F_TSO | NETIF_F_TSO6 | NETIF_F_UFO))
features |= RX_OFFLOADS;
else
features = ~RX_OFFLOADS;
@@ -1064,7 +1067,7 @@ static long macvtap_ioctl(struct file *file, unsigned int 
cmd,
case TUNSETOFFLOAD:
/* let the user check for future flags */
if (arg  ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
-   TUN_F_TSO_ECN))
+   TUN_F_TSO_ECN | TUN_F_UFO))
return -EINVAL;
 
rtnl_lock();
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 7302398..a0987d1 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -175,7 +175,7 @@ struct tun_struct {
struct net_device   *dev;
netdev_features_t   set_features;
 #define TUN_USER_FEATURES (NETIF_F_HW_CSUM|NETIF_F_TSO_ECN|NETIF_F_TSO| \
- NETIF_F_TSO6)
+ NETIF_F_TSO6|NETIF_F_UFO)
 
int vnet_hdr_sz;
int sndbuf;
@@ -1152,20 +1152,10 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
skb_shinfo(skb)-gso_type = SKB_GSO_TCPV6;
break;
case VIRTIO_NET_HDR_GSO_UDP:
-   {
-   static bool warned;
-
-   if (!warned) {
-   warned = true;
-   netdev_warn(tun-dev,
-   %s: using disabled UFO feature; 
please fix this program\n,
-   current-comm);
-   }
skb_shinfo(skb)-gso_type = SKB_GSO_UDP;
if (skb-protocol == htons(ETH_P_IPV6))
ipv6_proxy_select_ident(skb);
break;
-   }
default:
tun-dev-stats.rx_frame_errors++;
kfree_skb(skb);
@@ -1265,6 +1255,8 @@ static ssize_t tun_put_user(struct tun_struct *tun

Re: TUN_F_UFO change breaks live migration

2014-11-11 Thread Ben Hutchings

On Tue, 2014-11-11 at 17:57 +0200, Michael S. Tsirkin wrote:
On Tue, Nov 11, 2014 at 12:17:26PM +, Ben Hutchings wrote:
On Tue, 2014-11-11 at 10:58 +, Stefan Hajnoczi wrote:
Commit 3d0ad09412ffe00c9afa201d01effdb6023d09b4 (drivers/net: Disable
UFO through virtio) breaks live migration of KVM guests from older to
newer host kernels:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3d0ad09412ffe00c9afa201d01effdb6023d09b4

The problem occurs when a guest running on a host kernel without commit
3d0ad0941 in tun.ko attempts to live migration to a host with commit
3d0ad0941.

Live migration fails in QEMU with the following error message:

virtio-net: saved image requires TUN_F_UFO support

The old host tun.ko advertised support for TUN_F_UFO. The new host
tun.ko does not and that's why QEMU aborts live migration. QEMU cannot
change the features of a running virtio-net device.

Yes, this is known and was mentioned in the DSA.

tuxcrafter provided logs from two Debian hosts migrating from
3.2.60-1+deb7u3 to 3.2.63-2+deb7u1:

http://paste.debian.net/131264/

You can work around this by making macvtap and tun still claim to
support UFO.

If this is what we want userspace to do, let's just put the
feature flag back?

Let's not *just* put the feature flag back. I accept this is probably
needed as a workaround, but UFO/IPv6 still won't work correctly over
virtio.

Basically userspace assumed that features will only
ever be added, never removed, so this change is
breaking it.

OK.

They continue to support it even if it's not advertised
because the tap features don't reliably get propagated to virtio
devices.

Ben.

Hmm I don't understand this last sentence.
features are actually reliably propagated to virtio devices.

They might be when using current QEMU and libvirt on the host. They
weren't when I tested on Debian stable. The warnings about 'using
disabled UFO feature' are reliably triggered if the host's tap driver is
patched and the guest's virtio_net driver is not.

Ben.

--
Ben Hutchings
Experience is directly proportional to the value of equipment destroyed.
- Carolyn Scheppner

signature.asc
Description: This is a digitally signed message part

Re: [GIT PULL 2/2] KVM: s390/mm: try a cow on read only pages for key ops

2014-08-26 Thread Ben Hutchings

On Mon, 2014-08-25 at 15:10 +0200, Christian Borntraeger wrote:
 The PFMF instruction handler  blindly wrote the storage key even if
 the page was mapped R/O in the host. Lets try a COW before continuing
 and bail out in case of errors.
 
 Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
 Reviewed-by: Dominik Dingel din...@linux.vnet.ibm.com
 Cc: sta...@vger.kernel.org
 ---
  arch/s390/mm/pgtable.c | 10 ++
  1 file changed, 10 insertions(+)
 
 diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
 index 19daa53..5404a62 100644
 --- a/arch/s390/mm/pgtable.c
 +++ b/arch/s390/mm/pgtable.c
 @@ -986,11 +986,21 @@ int set_guest_storage_key(struct mm_struct *mm, 
 unsigned long addr,
   pte_t *ptep;
  
   down_read(mm-mmap_sem);
 +retry:
   ptep = get_locked_pte(current-mm, addr, ptl);
   if (unlikely(!ptep)) {
   up_read(mm-mmap_sem);
   return -EFAULT;
   }
 + if (!(pte_val(*ptep)  _PAGE_INVALID) 
 +  (pte_val(*ptep)  _PAGE_PROTECT)) {
 + pte_unmap_unlock(*ptep, ptl);
 + if (fixup_user_fault(current, mm, addr, 
 FAULT_FLAG_WRITE)) {
 + up_read(mm-mmap_sem);
 + return -EFAULT;
 + }
 + goto retry;
 + }

Every line below the first 'if' is indented one tab stop too far.

Ben.

   new = old = pgste_get_lock(ptep);
   pgste_val(new) = ~(PGSTE_GR_BIT | PGSTE_GC_BIT |

-- 
Ben Hutchings
No political challenge can be met by shopping. - George Monbiot
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5] core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors

2014-04-22 Thread Ben Hutchings

On Mon, 2014-04-21 at 12:26 +0100, Luis Henriques wrote:
 Hi David,

 On Thu, Mar 27, 2014 at 03:29:56PM -0400, David Miller wrote:
  From: Zoltan Kiss zoltan.k...@citrix.com
  Date: Wed, 26 Mar 2014 22:37:45 +

   skb_zerocopy can copy elements of the frags array between skbs, but it 
   doesn't
   orphan them. Also, it doesn't handle errors, so this patch takes care of 
   that
   as well, and modify the callers accordingly. skb_tx_error() is also added 
   to
   the callers so they will signal the failed delivery towards the creator 
   of the
   skb.

   Signed-off-by: Zoltan Kiss zoltan.k...@citrix.com

  Fingers crossed :-)  Applied, thanks Zoltan.
  --
  To unsubscribe from this list: send the line unsubscribe netdev in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html

 Could you please queue this for stable as well?  It has CVE-2014-2568
 assigned to it and I believe it is applicable to some of the trees.

It was applied to 3.13, but needs backporting to earlier versions.  I
posted my attempt in
1397429860.10849.86.ca...@deadeye.wl.decadent.org.uk but it needs
testing/reviewing.

Ben.

-- 
Ben Hutchings
Beware of programmers who carry screwdrivers. - Leonard Brandwein

signature.asc
Description: This is a digitally signed message part

[PATCH 3.13] core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors

2014-04-22 Thread Ben Hutchings

From: Zoltan Kiss zoltan.k...@citrix.com

commit 36d5fe6a000790f56039afe26834265db0a3ad4c upstream.

skb_zerocopy can copy elements of the frags array between skbs, but it doesn't
orphan them. Also, it doesn't handle errors, so this patch takes care of that
as well, and modify the callers accordingly. skb_tx_error() is also added to
the callers so they will signal the failed delivery towards the creator of the
skb.

Signed-off-by: Zoltan Kiss zoltan.k...@citrix.com
Signed-off-by: David S. Miller da...@davemloft.net
[bwh: Backported to 3.13: skb_zerocopy() is new in 3.14, but was moved from a
 static function in nfnetlink_queue.  We need to patch that and its caller, but
 not openvswitch.]
Signed-off-by: Ben Hutchings b...@decadent.org.uk
---
On Tue, 2014-04-22 at 19:02 +0100, Zoltan Kiss wrote:
 On 22/04/14 16:38, Ben Hutchings wrote:
  On Mon, 2014-04-21 at 12:26 +0100, Luis Henriques wrote:
  Hi David,
 
  On Thu, Mar 27, 2014 at 03:29:56PM -0400, David Miller wrote:
  From: Zoltan Kiss zoltan.k...@citrix.com
  Date: Wed, 26 Mar 2014 22:37:45 +
 
  skb_zerocopy can copy elements of the frags array between skbs, but it 
  doesn't
  orphan them. Also, it doesn't handle errors, so this patch takes care of 
  that
  as well, and modify the callers accordingly. skb_tx_error() is also 
  added to
  the callers so they will signal the failed delivery towards the creator 
  of the
  skb.
 
  Signed-off-by: Zoltan Kiss zoltan.k...@citrix.com
 
  Fingers crossed :-)  Applied, thanks Zoltan.
  --
  To unsubscribe from this list: send the line unsubscribe netdev in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
  Could you please queue this for stable as well?  It has CVE-2014-2568
  assigned to it and I believe it is applicable to some of the trees.
 
  It was applied to 3.13, but needs backporting to earlier versions.  I
  posted my attempt in
  1397429860.10849.86.ca...@deadeye.wl.decadent.org.uk but it needs
  testing/reviewing.
 
 This one is a different issue, although it is very similar.

Sorry for mixing these up.  I believe this bug has been present since
zerocopy was added to nfqueue in Linux 3.10 (commit ae08ce002108).  This
is the version I used for Debian's 3.13 branch, which might be usable
for older stable branches too.

Ben.

---
--- a/net/netfilter/nfnetlink_queue_core.c
+++ b/net/netfilter/nfnetlink_queue_core.c
@@ -235,22 +235,23 @@ nfqnl_flush(struct nfqnl_instance *queue
spin_unlock_bh(queue-lock);
 }
 
-static void
+static int
 nfqnl_zcopy(struct sk_buff *to, const struct sk_buff *from, int len, int hlen)
 {
int i, j = 0;
int plen = 0; /* length of skb-head fragment */
+   int ret;
struct page *page;
unsigned int offset;
 
/* dont bother with small payloads */
-   if (len = skb_tailroom(to)) {
-   skb_copy_bits(from, 0, skb_put(to, len), len);
-   return;
-   }
+   if (len = skb_tailroom(to))
+   return skb_copy_bits(from, 0, skb_put(to, len), len);
 
if (hlen) {
-   skb_copy_bits(from, 0, skb_put(to, hlen), hlen);
+   ret = skb_copy_bits(from, 0, skb_put(to, hlen), hlen);
+   if (unlikely(ret))
+   return ret;
len -= hlen;
} else {
plen = min_t(int, skb_headlen(from), len);
@@ -268,6 +269,11 @@ nfqnl_zcopy(struct sk_buff *to, const st
to-len += len + plen;
to-data_len += len + plen;
 
+   if (unlikely(skb_orphan_frags(from, GFP_ATOMIC))) {
+   skb_tx_error(from);
+   return -ENOMEM;
+   }
+
for (i = 0; i  skb_shinfo(from)-nr_frags; i++) {
if (!len)
break;
@@ -278,6 +284,8 @@ nfqnl_zcopy(struct sk_buff *to, const st
j++;
}
skb_shinfo(to)-nr_frags = j;
+
+   return 0;
 }
 
 static int
@@ -374,13 +382,16 @@ nfqnl_build_packet_message(struct net *n
 
skb = nfnetlink_alloc_skb(net, size, queue-peer_portid,
  GFP_ATOMIC);
-   if (!skb)
+   if (!skb) {
+   skb_tx_error(entskb);
return NULL;
+   }
 
nlh = nlmsg_put(skb, 0, 0,
NFNL_SUBSYS_QUEUE  8 | NFQNL_MSG_PACKET,
sizeof(struct nfgenmsg), 0);
if (!nlh) {
+   skb_tx_error(entskb);
kfree_skb(skb);
return NULL;
}
@@ -504,13 +515,15 @@ nfqnl_build_packet_message(struct net *n
nla-nla_type = NFQA_PAYLOAD;
nla-nla_len = nla_attr_size(data_len);
 
-   nfqnl_zcopy(skb, entskb, data_len, hlen);
+   if (nfqnl_zcopy(skb, entskb, data_len, hlen))
+   goto nla_put_failure;
}
 
nlh-nlmsg_len = skb-len;
return skb;
 
 nla_put_failure:
+   skb_tx_error(entskb

Re: [RFC v2 2/4] net: enables interface option to skip IP

2014-02-24 Thread Ben Hutchings

On Mon, 2014-02-24 at 18:04 -0500, David Miller wrote:
 From: Dan Williams d...@redhat.com
 Date: Mon, 24 Feb 2014 12:22:00 -0600

  In the future I expect more people will want to disable IPv4 as
  they move to IPv6.

 I definitely don't.

 I've been lightly following this conversation and I have to say
 a few things.

 disable_ipv6 was added because people wanted to make sure their
 machines didn't generate any ipv6 traffic because ipv6 is not
 mature, we don't have our firewalls configured to handle that
 kind of traffic etc.

 None of these things apply to ipv4.

 And if you think people will go to ipv6 only, you are dreaming.

 Name a provider of a major web sitewho will go to strictly only
 providing an ipv6 facing site?

 Only an idiot who wanted to lose significiant nunbers of page views
 and traffic would do that,

That's obviously true for public-facing servers, but that doesn't mean
it's not useful to anyone.

 so ipv4 based connectivity will be universally necessary forever.

You can run an internal network, or access network, as v6-only with
NAT64 and DNS64 at the border.  I believe some mobile networks are doing
this; it was also done on the main FOSDEM wireless network this year.

Ben.

 I think disable_ipv4 is absolutely a non-starter.

-- 
Ben Hutchings
Beware of bugs in the above code;
I have only proved it correct, not tried it. - Donald Knuth

signature.asc
Description: This is a digitally signed message part

Re: [RFC v2 2/4] net: enables interface option to skip IP

2014-02-24 Thread Ben Hutchings

On Mon, 2014-02-24 at 19:12 -0500, David Miller wrote:
 From: Ben Hutchings b...@decadent.org.uk
 Date: Tue, 25 Feb 2014 00:02:00 +

  You can run an internal network, or access network, as v6-only with
  NAT64 and DNS64 at the border.  I believe some mobile networks are doing
  this; it was also done on the main FOSDEM wireless network this year.

 This seems to be bloating up the networking headers of the internal
 network, for what purpose?

 For mobile that's doubly inadvisable.

I don't know what the reasoning is for the mobile network operators.
They're forced to do NAT for v4 somewhere, and maybe v6-only makes the
access network easier to manage.

I doubt the extra header length hurts that much on a 3G or 4G network.

Ben.

-- 
Ben Hutchings
Everything should be made as simple as possible, but not simpler.
   - Albert Einstein

signature.asc
Description: This is a digitally signed message part

Re: [RFC v2 1/4] bridge: enable interfaces to opt out from becoming the root bridge

2014-02-16 Thread Ben Hutchings

On Fri, 2014-02-14 at 18:59 -0800, Luis R. Rodriguez wrote:
 From: Luis R. Rodriguez mcg...@suse.com
 
 It doesn't make sense for some interfaces to become a root bridge

I think you mean 'root port'.

 at any point in time. One example is virtual backend interfaces
 which rely on other entities on the bridge for actual physical
 connectivity. They only provide virtual access.
 
 Device drivers that know they should never become part of the
 root bridge have been using a trick of setting their MAC address
 to a high broadcast MAC address such as FE:FF:FF:FF:FF:FF. Instead
 of using these hacks lets the interfaces annotate its intent and
 generalizes a solution for multiple drivers, while letting the
 drivers use a random MAC address or one prefixed with a proper OUI.
 This sort of hack is used by both qemu and xen for their backend
 interfaces.
 
 Cc: Stephen Hemminger step...@networkplumber.org
 Cc: bri...@lists.linux-foundation.org
 Cc: net...@vger.kernel.org
 Cc: linux-ker...@vger.kernel.org
 Signed-off-by: Luis R. Rodriguez mcg...@suse.com
 ---
  include/uapi/linux/if.h | 1 +
  net/bridge/br_if.c  | 2 ++
  net/bridge/br_private.h | 1 +
  net/bridge/br_stp_if.c  | 2 ++
  4 files changed, 6 insertions(+)
 
 diff --git a/include/uapi/linux/if.h b/include/uapi/linux/if.h
 index d758163..8d10382 100644
 --- a/include/uapi/linux/if.h
 +++ b/include/uapi/linux/if.h
 @@ -84,6 +84,7 @@
  #define IFF_LIVE_ADDR_CHANGE 0x10/* device supports hardware 
 address
* change when it's running */
  #define IFF_MACVLAN 0x20 /* Macvlan device */
 +#define IFF_BRIDGE_NON_ROOT 0x40/* Don't consider for root bridge */
[...]

Does it really make sense to add a flag that says exactly which special
behaviour you want, or would it be better to define the flag as a
passive property, which other drivers/protocols then use as a condition
for special behaviour?

The fact that you also define the IFF_BRIDGE_SKIP_IP flag, and set it on
exactly the same devices, makes me think that they should actually be a
single flag.  I don't know how that flag should be named or described,
though.

Ben.

-- 
Ben Hutchings
Any sufficiently advanced bug is indistinguishable from a feature.


signature.asc
Description: This is a digitally signed message part

Re: [PATCH V2 5/6] vhost_net: poll vhost queue after marking DMA is done

2013-08-30 Thread Ben Hutchings

On Fri, 2013-08-30 at 12:29 +0800, Jason Wang wrote:
 We used to poll vhost queue before making DMA is done, this is racy if vhost
 thread were waked up before marking DMA is done which can result the signal to
 be missed. Fix this by always poll the vhost thread before DMA is done.

Does this bug only exist in net-next or is it older?  Should the fix go
to net and stable branches?

Ben.

 Signed-off-by: Jason Wang jasow...@redhat.com
 ---
  drivers/vhost/net.c |9 +
  1 files changed, 5 insertions(+), 4 deletions(-)
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index ff60c2a..d09c17c 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -308,6 +308,11 @@ static void vhost_zerocopy_callback(struct ubuf_info 
 *ubuf, bool success)
   struct vhost_virtqueue *vq = ubufs-vq;
   int cnt = atomic_read(ubufs-kref.refcount);
  
 + /* set len to mark this desc buffers done DMA */
 + vq-heads[ubuf-desc].len = success ?
 + VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
 + vhost_net_ubuf_put(ubufs);
 +
   /*
* Trigger polling thread if guest stopped submitting new buffers:
* in this case, the refcount after decrement will eventually reach 1
 @@ -318,10 +323,6 @@ static void vhost_zerocopy_callback(struct ubuf_info 
 *ubuf, bool success)
*/
   if (cnt = 2 || !(cnt % 16))
   vhost_poll_queue(vq-poll);
 - /* set len to mark this desc buffers done DMA */
 - vq-heads[ubuf-desc].len = success ?
 - VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
 - vhost_net_ubuf_put(ubufs);
  }
  
  /* Expects to be always run from workqueue - which acts as

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Bug#707257: linux-image-3.8-1-686-pae: KVM crashes with entry failed, hardware error 0x80000021

2013-05-29 Thread Ben Hutchings

On Wed, May 29, 2013 at 05:05:55PM +0200, Stefan Pietsch wrote:
 On 19.05.2013 14:32, Gleb Natapov wrote:
  On Sun, May 19, 2013 at 02:00:31AM +0100, Ben Hutchings wrote:
  Dear KVM maintainers, it appears that there is a gap in x86 emulation,
  at least on a 32-bit host.  Stefan found this when running GRML, a live
  distribution which can be downloaded from:
  http://download.grml.org/grml32-full_2013.02.iso.  His original
  reported is at http://bugs.debian.org/707257.
 
  Can you verify with latest linux.git HEAD? It works for me there on
  64bit. There were a lot of problems fixed in this area in 3.9/3.10 time 
  frame,
  so it would be helpful if you'll test 32bit before I install one myself.
 
 
 Ben,
 
 can you provide a 3.9 series kernel package?

I will do soon, but you should be able to build your own:

# ...unpack upstream source...
$ cp /boot/config-$(uname -r) .config
$ yes  | make oldconfig
$ make deb-pkg -j$(nproc)

Ben.

-- 
Ben Hutchings
We get into the habit of living before acquiring the habit of thinking.
  - Albert Camus
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Bug#707257: linux-image-3.8-1-686-pae: KVM crashes with entry failed, hardware error 0x80000021

2013-05-18 Thread Ben Hutchings

Dear KVM maintainers, it appears that there is a gap in x86 emulation,
at least on a 32-bit host.  Stefan found this when running GRML, a live
distribution which can be downloaded from:
http://download.grml.org/grml32-full_2013.02.iso.  His original
reported is at http://bugs.debian.org/707257.

On Thu, 2013-05-16 at 13:26 +0200, Stefan Pietsch wrote:
 On 09.05.2013 20:56, Stefan Pietsch wrote:
  On 09.05.2013 03:08, Ben Hutchings wrote:
  
  Please could you test some of the intermediate versions at
  http://snapshot.debian.org/package/linux/ to find the first upstream
  version where this was broken.
  
  The first version which does not work is 3.6.4-1~experimental.1.
  3.5.5-1~experimental.1 works.
 
 
 I was able to start KVM under kernel version 3.8.12-1 after loading the
 kvm_intel module with the option emulate_invalid_guest_state=0.

And one of the many changes between 3.5 and 3.6 was to change the
default value of that parameter from 0 to 1.  So we don't know when the
the bug in emulation was introduced (or if it was always there).

Ben.

-- 
Ben Hutchings
The generation of random numbers is too important to be left to chance.
- Robert Coveyou


signature.asc
Description: This is a digitally signed message part

Re: [PATCH] virtio-net: Reporting traffic queue distribution statistics through ethtool

2013-05-16 Thread Ben Hutchings

On Thu, 2013-05-16 at 13:24 -0700, Sriram Narasimhan wrote:
 This patch allows virtio-net driver to report traffic distribution
 to inbound/outbound queues through ethtool -S.  The per_cpu
 virtnet_stats is split into receive and transmit stats and are
 maintained on a per receive_queue and send_queue basis.
 virtnet_stats() is modified to aggregate interface level statistics
 from per-queue statistics.  Sample output below:
 
 NIC statistics:
  rxq0: rx_packets: 4357802
  rxq0: rx_bytes: 292642052
  txq0: tx_packets: 824540
  txq0: tx_bytes: 55256404
  rxq1: rx_packets: 0
  rxq1: rx_bytes: 0
  txq1: tx_packets: 1094268
  txq1: tx_bytes: 73328316
  rxq2: rx_packets: 0
  rxq2: rx_bytes: 0
  txq2: tx_packets: 1091466
  txq2: tx_bytes: 73140566
  rxq3: rx_packets: 0
  rxq3: rx_bytes: 0
  txq3: tx_packets: 1093043
  txq3: tx_bytes: 73246142
 
 Signed-off-by: Sriram Narasimhan sriram.narasim...@hp.com
[...]

That ordering is totally unreadable.  I want to see patches for ethtool
to let the user order and aggregate the queue statistics in a sensible
way:

$ ethtool -S eth0   # default would show aggregated statistics
NIC statistics:
rx_packets: 4357802
rx_bytes: 292642052
tx_packets: 4103317
tx_bytes: 274971428

$ ethtool -S eth0 full group queue  # full statistics, grouped by queue name
NIC statistics:
rxq0:
rx_packets: 4357802
rx_bytes: 292642052
rxq1:
rx_packets: 0
rx_bytes: 0
[...]
txq0:
tx_packets: 824540
tx_bytes: 55256404
txq1:
tx_packets: 1094268
tx_bytes: 73328316
[...]

$ ethtool -S eth0 full group statistic  # full statistics, grouped by statistic 
name
NIC statistics:
rx_packets:
rxq0: 4357802
rxq1: 0
rxq2: 0
rxq3: 0
rx_bytes:
rxq0: 292642052
rxq1: 0
rxq2: 0
rxq3: 0
[...]

Maybe even:

$ ethtool -S eth0 full table
NIC statistics:
rx_packets   rx_bytes
rxq0:  4357802  292642052
rxq1:0  0
rxq2:0  0
rxq3:0  0
tx_packets   tx_bytes
txq0:   824540   55256404
txq1:  1094268   73328316
txq2:  1091466   73140566
txq3:  1093043   73246142

(Difficult to do, but probably very useful!)

The queue names should be rx-index and tx-index like in sysfs.

We'll need to reach a consensus on the naming scheme for per-queue and
otherwise disaggregated statistics (see previous discussions in
http://thread.gmane.org/gmane.linux.kernel.virtualization/15572/focus=15608). 
 I don't much care what they look like as long as there's an implementation for 
the ethtool side and they don't look too awful in older versions of ethtool.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net] vhost/net: fix heads usage of ubuf_info

2013-03-21 Thread Ben Hutchings

On Thu, 2013-03-21 at 08:02 +0200, Michael S. Tsirkin wrote:
 On Sun, Mar 17, 2013 at 02:29:55PM -0400, David Miller wrote:
  From: Michael S. Tsirkin m...@redhat.com
  Date: Sun, 17 Mar 2013 14:46:09 +0200

   ubuf info allocator uses guest controlled head as an index,
   so a malicious guest could put the same head entry in the ring twice,
   and we will get two callbacks on the same value.
   To fix use upend_idx which is guaranteed to be unique.

   Reported-by: Rusty Russell ru...@rustcorp.com.au
   Signed-off-by: Michael S. Tsirkin m...@redhat.com

  Applied and queued up for -stable, thanks.

  And thankfully you got the stable URL wrong,

 Yes I wrote sta...@kernel.org that's what an old copy
 says here:
 https://www.kernel.org/doc/Documentation/stable_kernel_rules.txt

 I should have known better than look at it on the 'net.  The top
 'Everything you ever wanted to know about Linux 2.6 -stable releases.'
 is a big hint that it's stale.
 Any idea who maintains this? Better update it or remove it or redirect to git.

Rob Landley maintains it, but he's been having trouble updating it since
all the upload mechanisms were changed on kernel.org.

(My stable maintenance scripts still match the old address, anyway.  Not
sure about Greg's.)

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net] vhost/net: fix heads usage of ubuf_info

2013-03-21 Thread Ben Hutchings

On Thu, 2013-03-21 at 18:28 +0200, Michael S. Tsirkin wrote:
 On Thu, Mar 21, 2013 at 04:23:48PM +, Ben Hutchings wrote:
  On Thu, 2013-03-21 at 08:02 +0200, Michael S. Tsirkin wrote:
   On Sun, Mar 17, 2013 at 02:29:55PM -0400, David Miller wrote:
From: Michael S. Tsirkin m...@redhat.com
Date: Sun, 17 Mar 2013 14:46:09 +0200

 ubuf info allocator uses guest controlled head as an index,
 so a malicious guest could put the same head entry in the ring twice,
 and we will get two callbacks on the same value.
 To fix use upend_idx which is guaranteed to be unique.

 Reported-by: Rusty Russell ru...@rustcorp.com.au
 Signed-off-by: Michael S. Tsirkin m...@redhat.com

Applied and queued up for -stable, thanks.

And thankfully you got the stable URL wrong,

   Yes I wrote sta...@kernel.org that's what an old copy
   says here:
   https://www.kernel.org/doc/Documentation/stable_kernel_rules.txt

   I should have known better than look at it on the 'net.  The top
   'Everything you ever wanted to know about Linux 2.6 -stable releases.'
   is a big hint that it's stale.
   Any idea who maintains this? Better update it or remove it or redirect to 
   git.

  Rob Landley maintains it, but he's been having trouble updating it since
  all the upload mechanisms were changed on kernel.org.

  (My stable maintenance scripts still match the old address, anyway.  Not
  sure about Greg's.)

  Ben.

 I hope you mean it will match both the old and the new address?

Yes, of course!

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: virtio-rng only returns zeros with CONFIG_HW_RANDOM=m

2013-02-26 Thread Ben Hutchings

On Wed, 2013-02-27 at 10:43 +1030, Rusty Russell wrote:
 Aurelien Jarno aurel...@aurel32.net writes:
  Hi,
 
  I have noticed that virtio-rng only returns zero for kernels = 2.6.33
  built with CONFIG_HW_RANDOM=m. This is a bit much too predictable for a
  random generator ;-).
 
 Wow.  Fortunately, all of SLES, RHEL, Ubuntu or Fedora set
 CONFIG_HW_RANDOM=y.  What do they know that we don't?
 
 Oops, looks like Debian testing: config-3.2.0-4-amd64:CONFIG_HW_RANDOM=m
[...]

*Groan*.  I'll change the configuration for now, but trust this will be
fixed in virtio-rng later.  Thanks for letting me know.

Ben.

-- 
Ben Hutchings
Never attribute to conspiracy what can adequately be explained by stupidity.


signature.asc
Description: This is a digitally signed message part

Re: [PATCHv5] virtio-spec: virtio network device RFS support

2012-12-06 Thread Ben Hutchings

On Thu, 2012-12-06 at 10:13 +0200, Michael S. Tsirkin wrote:
 On Wed, Dec 05, 2012 at 08:39:26PM +, Ben Hutchings wrote:
  On Mon, 2012-12-03 at 12:58 +0200, Michael S. Tsirkin wrote:
   Add RFS support to virtio network device.
   Add a new feature flag VIRTIO_NET_F_RFS for this feature, a new
   configuration field max_virtqueue_pairs to detect supported number of
   virtqueues as well as a new command VIRTIO_NET_CTRL_RFS to program
   packet steering for unidirectional protocols.
  [...]
   +Programming of the receive flow classificator is implicit.
   + Transmitting a packet of a specific flow on transmitqX will cause 
   incoming
   + packets for this flow to be steered to receiveqX.
   + For uni-directional protocols, or where no packets have been transmitted
   + yet, device will steer a packet to a random queue out of the specified
   + receiveq0..receiveqn.
  [...]
  
  It doesn't seem like this is usable to implement accelerated RFS in the
  guest, though perhaps that doesn't matter.
 
 What is the issue? Could you be more explicit please?
 
 It seems to work pretty well: if we have
 # of queues = # of cpus, incoming TCP_STREAM into
 guest scales very nicely without manual tweaks in guest.
 
 The way it works is, when guest sends a packet driver
 select the rx queue that we want to use for incoming
 packets for this slow, and transmit on the matching tx queue.
 This is exactly what text above suggests no?

Yes, I get that.

   On the host side, presumably
  you'll want vhost_net to do the equivalent of sock_rps_record_flow() -
  only without a socket?  But in any case, that requires an rxhash, so I
  don't see how this is supposed to work.
  
  Ben.
 
 Host should just do what guest tells it to.
 On the host side we build up the steering table as we get packets
 to transmit. See the code in drivers/net/tun.c in recent
 kernels.
 
 Again this actually works fine - what are the problems that you see?
 Could you give an example please?

I'm not saying it doesn't work in its own way, I just don't see how you
would make it work with the existing RFS!

Since this doesn't seem to be intended to have *any* connection with the
existing core networking feature called RFS, perhaps you could find a
different name for it.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv5] virtio-spec: virtio network device RFS support

2012-12-06 Thread Ben Hutchings

On Thu, 2012-12-06 at 22:29 +0200, Michael S. Tsirkin wrote:
 On Thu, Dec 06, 2012 at 08:03:14PM +, Ben Hutchings wrote:
[...]
  Since this doesn't seem to be intended to have *any* connection with the
  existing core networking feature called RFS, perhaps you could find a
  different name for it.
  
  Ben.
 
 
 Ah I see what you mean. We started out calling this feature multiqueue
 Rusty suggested RFS since it gives similar functionality to RFS but in
 device: it has receive steering logic per flow as part of the device.

The name is quite generic, but in the context of Linux it has so far
been used for a specific software feature and not as a generic name for
flow steering by hardware (or drivers).  The existing documentation
(Documentation/networking/scaling.txt) states quite clearly that 'RFS'
means that specific software implementation (with optional driver
integration) and configuration interface.

 Maybe simply adding a statement similar to the one above would be
 sufficient to avoid confusion?

No, I don't think it's sufficient.  We have documentation that says how
to configure 'RFS', and you're proposing to add a very similar feature
called 'RFS' that is configured differently.  No matter how clearly you
distinguish them in new documentation, this will make the old
documentation confusing.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv5] virtio-spec: virtio network device RFS support

2012-12-06 Thread Ben Hutchings

On Thu, 2012-12-06 at 23:01 +0200, Michael S. Tsirkin wrote:
 On Thu, Dec 06, 2012 at 08:53:59PM +, Ben Hutchings wrote:
  On Thu, 2012-12-06 at 22:29 +0200, Michael S. Tsirkin wrote:
   On Thu, Dec 06, 2012 at 08:03:14PM +, Ben Hutchings wrote:
  [...]
Since this doesn't seem to be intended to have *any* connection with the
existing core networking feature called RFS, perhaps you could find a
different name for it.

Ben.
   
   
   Ah I see what you mean. We started out calling this feature multiqueue
   Rusty suggested RFS since it gives similar functionality to RFS but in
   device: it has receive steering logic per flow as part of the device.
  
  The name is quite generic, but in the context of Linux it has so far
  been used for a specific software feature and not as a generic name for
  flow steering by hardware (or drivers).  The existing documentation
  (Documentation/networking/scaling.txt) states quite clearly that 'RFS'
  means that specific software implementation (with optional driver
  integration) and configuration interface.
 
   Maybe simply adding a statement similar to the one above would be
   sufficient to avoid confusion?
  
  No, I don't think it's sufficient.  We have documentation that says how
  to configure 'RFS', and you're proposing to add a very similar feature
  called 'RFS' that is configured differently.  No matter how clearly you
  distinguish them in new documentation, this will make the old
  documentation confusing.
  
  Ben.
 
 I don't mind, renaming is just s/RFS/whatever/ away -
 how should hardware call this in your opinion?

If by 'this' you mean the use of perfect filters or a large hash table
to select the RX queue per flow, then 'flow steering'.

But that is usually combined with the fall-back of a simple mapping from
hash to queue ('RSS' or 'flow hashing') in case there is no specific
queue selection yet, which I can see tun has.  And you're specifying
multiple transmit queues too.  If you want a name for the whole set of
features involved, I don't see any better name than 'multiqueue'/'MQ'.

If you want a name for this specific flow steering mechanism, add some
distinguishing adjective(s) like 'virtual' or 'automatic'.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv5] virtio-spec: virtio network device RFS support

2012-12-05 Thread Ben Hutchings

On Mon, 2012-12-03 at 12:58 +0200, Michael S. Tsirkin wrote:
 Add RFS support to virtio network device.
 Add a new feature flag VIRTIO_NET_F_RFS for this feature, a new
 configuration field max_virtqueue_pairs to detect supported number of
 virtqueues as well as a new command VIRTIO_NET_CTRL_RFS to program
 packet steering for unidirectional protocols.
[...]
 +Programming of the receive flow classificator is implicit.
 + Transmitting a packet of a specific flow on transmitqX will cause incoming
 + packets for this flow to be steered to receiveqX.
 + For uni-directional protocols, or where no packets have been transmitted
 + yet, device will steer a packet to a random queue out of the specified
 + receiveq0..receiveqn.
[...]

It doesn't seem like this is usable to implement accelerated RFS in the
guest, though perhaps that doesn't matter.  On the host side, presumably
you'll want vhost_net to do the equivalent of sock_rps_record_flow() -
only without a socket?  But in any case, that requires an rxhash, so I
don't see how this is supposed to work.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next rfc v7 3/3] virtio-net: change the number of queues through ethtool

2012-12-03 Thread Ben Hutchings

On Mon, 2012-12-03 at 13:25 +0200, Michael S. Tsirkin wrote:
 On Mon, Dec 03, 2012 at 02:09:28PM +0800, Jason Wang wrote:
  On Sunday, December 02, 2012 06:09:06 PM Michael S. Tsirkin wrote:
   On Tue, Nov 27, 2012 at 06:16:00PM +0800, Jason Wang wrote:
This patch implement the {set|get}_channels method of ethool to allow 
user
to change the number of queues dymaically when the device is running.
This would let the user to configure it on demand.

Signed-off-by: Jason Wang jasow...@redhat.com
---

 drivers/net/virtio_net.c |   41 
+
 1 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index bcaa6e5..f08ec2a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1578,10 +1578,51 @@ static struct virtio_driver virtio_net_driver = 
{

 #endif
 };

+/* TODO: Eliminate OOO packets during switching */
+static int virtnet_set_channels(struct net_device *dev,
+   struct ethtool_channels *channels)
+{
+   struct virtnet_info *vi = netdev_priv(dev);
+   u16 queue_pairs = channels-combined_count;
 
 by the way shouldn't this be combined_count / 2?
 
 And below channels-max_combined = vi-max_queue_pairs * 2; ?
[...]

In this ethtool API, 'channel' means an IRQ and set of queues that
trigger it.  So each ethtool-channel will correspond to one queue-pair
and not one virtio channel.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [rfc net-next v6 2/3] virtio_net: multiqueue support

2012-11-19 Thread Ben Hutchings

On Sun, 2012-11-18 at 11:13 +0200, Michael S. Tsirkin wrote:
 On Sat, Nov 17, 2012 at 12:35:29AM +, Ben Hutchings wrote:
  On Tue, 2012-11-13 at 08:40 +0200, Michael S. Tsirkin wrote:
   On Mon, Nov 05, 2012 at 11:38:39AM +1030, Rusty Russell wrote:
 @@ -924,11 +1032,10 @@ static void virtnet_get_ringparam(struct 
 net_device *dev,
  {
   struct virtnet_info *vi = netdev_priv(dev);
  
 - ring-rx_max_pending = virtqueue_get_vring_size(vi-rvq);
 - ring-tx_max_pending = virtqueue_get_vring_size(vi-svq);
 + ring-rx_max_pending = virtqueue_get_vring_size(vi-rq[0].vq);
 + ring-tx_max_pending = virtqueue_get_vring_size(vi-sq[0].vq);
   ring-rx_pending = ring-rx_max_pending;
   ring-tx_pending = ring-tx_max_pending;
 -
  }

This assumes all vqs are the same size.  I think this should probably
check: for mq mode, use the first vq, otherewise use the 0th.
   
   For rx_pending/tx_pending I think what is required here is the
   actual number of outstanding buffers.
   Dave, Eric - right?
   
   So this should be the total over all rings and to be useful,
   rx_max_pending/tx_max_pending should be the total too.
  
  So far as I know, all current implementations use the number of
  descriptors per ring here. virtio_net should be consistent with this.
  
  Ben.
 
 Problem is, it could in theory be different between rings. I guess we
 could use the maximum.
 
 What's the right thing to do for rx_pending - I am guessing
 we want the current outstanding packets right?

The 'max_pending' fields are for the maximum ring sizes supported; the
'pending' fields are for the current ring sizes.  The current number of
descriptors pending is not included (would be a bit pointless).

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [rfc net-next v6 2/3] virtio_net: multiqueue support

2012-11-16 Thread Ben Hutchings

On Tue, 2012-11-13 at 08:40 +0200, Michael S. Tsirkin wrote:
 On Mon, Nov 05, 2012 at 11:38:39AM +1030, Rusty Russell wrote:
   @@ -924,11 +1032,10 @@ static void virtnet_get_ringparam(struct 
   net_device *dev,
{
 struct virtnet_info *vi = netdev_priv(dev);

   - ring-rx_max_pending = virtqueue_get_vring_size(vi-rvq);
   - ring-tx_max_pending = virtqueue_get_vring_size(vi-svq);
   + ring-rx_max_pending = virtqueue_get_vring_size(vi-rq[0].vq);
   + ring-tx_max_pending = virtqueue_get_vring_size(vi-sq[0].vq);
 ring-rx_pending = ring-rx_max_pending;
 ring-tx_pending = ring-tx_max_pending;
   -
}
  
  This assumes all vqs are the same size.  I think this should probably
  check: for mq mode, use the first vq, otherewise use the 0th.
 
 For rx_pending/tx_pending I think what is required here is the
 actual number of outstanding buffers.
 Dave, Eric - right?
 
 So this should be the total over all rings and to be useful,
 rx_max_pending/tx_max_pending should be the total too.

So far as I know, all current implementations use the number of
descriptors per ring here. virtio_net should be consistent with this.

Ben.

  For bonus points, check this assertion at probe time.
 
 Looks like we can easily support different queues too.
 

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [rfc net-next v6 3/3] virtio-net: change the number of queues through ethtool

2012-11-08 Thread Ben Hutchings

On Tue, 2012-10-30 at 18:03 +0800, Jason Wang wrote:
 This patch implement the {set|get}_channels method of ethool to allow user to
 change the number of queues dymaically when the device is running. This would
 let the user to tune the device for specific applications.
 
 Signed-off-by: Jason Wang jasow...@redhat.com
 ---
  drivers/net/virtio_net.c |   43 +++
  1 files changed, 43 insertions(+), 0 deletions(-)
 
 diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
 index 8cc43e5..66fc129 100644
 --- a/drivers/net/virtio_net.c
 +++ b/drivers/net/virtio_net.c
 @@ -1616,10 +1616,53 @@ static struct virtio_driver virtio_net_driver = {
  #endif
  };
  
 +/* TODO: Eliminate OOO packets during switching */
 +static int virtnet_set_channels(struct net_device *dev,
 + struct ethtool_channels *channels)
 +{
 + struct virtnet_info *vi = netdev_priv(dev);
 + u16 queue_pairs = channels-combined_count;
 +
 + /* Only two modes were support currently */
 + if (queue_pairs == 0)
 + return -EINVAL;
 + if (queue_pairs != vi-total_queue_pairs - 1  queue_pairs != 1)
 + return -EINVAL;

You should also check that all of the other counts are == 0.

 + vi-num_queue_pairs = queue_pairs  1 ? queue_pairs + 1 : 1;

This doesn't make sense.  You're setting vi-num_queue_pairs to be
combined_count + 1...

 + BUG_ON(virtnet_set_queues(vi));
 +
 + netif_set_real_num_tx_queues(dev, vi-num_queue_pairs);
 + netif_set_real_num_rx_queues(dev, vi-num_queue_pairs);
 +
 + return 0;
 +}
 +
 +static void virtnet_get_channels(struct net_device *dev,
 +  struct ethtool_channels *channels)
 +{
 + struct virtnet_info *vi = netdev_priv(dev);
 +
 + if (vi-total_queue_pairs != 1) {
 + channels-combined_count = vi-num_queue_pairs;

but here you report combined_count as being vi-num_queue_pairs.  Do you
need to subtract 1 here?

Ben.

 + channels-max_combined = vi-total_queue_pairs - 1;
 + } else {
 + channels-combined_count = 1;
 + channels-max_combined = 1;
 + }
 +
 + channels-max_other = 0;
 + channels-rx_count = 0;
 + channels-tx_count = 0;
 + channels-other_count = 0;
 +}
 +
  static const struct ethtool_ops virtnet_ethtool_ops = {
   .get_drvinfo = virtnet_get_drvinfo,
   .get_link = ethtool_op_get_link,
   .get_ringparam = virtnet_get_ringparam,
 + .set_channels = virtnet_set_channels,
 + .get_channels = virtnet_get_channels,
  };
  
  static int __init init(void)

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv4] virtio-spec: virtio network device multiqueue support

2012-09-12 Thread Ben Hutchings

On Wed, 2012-09-12 at 07:40 -0700, Tom Herbert wrote:
 On Wed, Sep 12, 2012 at 12:57 AM, Michael S. Tsirkin m...@redhat.com wrote:
  On Wed, Sep 12, 2012 at 03:19:11PM +0930, Rusty Russell wrote:
[...]
  Perhaps Tom can explain how we avoid out-of-order receive for the
  accelerated RFS case?  It's not clear to me, but we need to be able to
  do that for virtio-net if it implements accelerated RFS.
 
 AFAIK ooo RX is still possible with accelerated RFS.  We have an
 algorithm that prevents this for RFS by deferring a migration to a new
 queue as long as it's possible that a flow might have outstanding
 packets on the old queue.  I suppose this could be implemented in the
 device for the HW queues, but I don't think it would be easy to cover
 all cases where packets were already in transit to the host or other
 cases where host and device queues are out of sync.
[...]

Yes, I couldn't see any way to eliminate the possibility of OOO.  The
software queue check in RFS should redirect the flow only when it is new
or has had an idle period, when I hope only a few packets will be
received before we send some kind of response (transport or application
layer ACK).  So I think that OOO is not that likely in practice, but I
don't have the evidence to back that up.

If the filter update latency is high enough that a response can overtake
the filter update, there may be more of a problem.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq

2012-07-09 Thread Ben Hutchings

On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
 This patch let the virtio_net driver can negotiate the number of queues it
 wishes to use through control virtqueue and export an ethtool interface to let
 use tweak it.
 
 As current multiqueue virtio-net implementation has optimizations on per-cpu
 virtuqueues, so only two modes were support:
 
 - single queue pair mode
 - multiple queue paris mode, the number of queues matches the number of vcpus
 
 The single queue mode were used by default currently due to regression of
 multiqueue mode in some test (especially in stream test).
 
 Since virtio core does not support paritially deleting virtqueues, so during
 mode switching the whole virtqueue were deleted and the driver would re-create
 the virtqueues it would used.
 
 btw. The queue number negotiating were defered to .ndo_open(), this is because
 only after feature negotitaion could we send the command to control virtqueue
 (as it may also use event index).
[...]
 +static int virtnet_set_channels(struct net_device *dev,
 + struct ethtool_channels *channels)
 +{
 + struct virtnet_info *vi = netdev_priv(dev);
 + u16 queues = channels-rx_count;
 + unsigned status = VIRTIO_CONFIG_S_ACKNOWLEDGE | VIRTIO_CONFIG_S_DRIVER;
 +
 + if (channels-rx_count != channels-tx_count)
 + return -EINVAL;
[...]
 +static void virtnet_get_channels(struct net_device *dev,
 +  struct ethtool_channels *channels)
 +{
 + struct virtnet_info *vi = netdev_priv(dev);
 +
 + channels-max_rx = vi-total_queue_pairs;
 + channels-max_tx = vi-total_queue_pairs;
 + channels-max_other = 0;
 + channels-max_combined = 0;
 + channels-rx_count = vi-num_queue_pairs;
 + channels-tx_count = vi-num_queue_pairs;
 + channels-other_count = 0;
 + channels-combined_count = 0;
 +}
[...]

It looks like the queue-pairs should be treated as 'combined channels',
not separate RX and TX channels.  Also you don't need to clear the other
members; you can assume that the ethtool core will zero-initialise
structures for 'get' operations.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] drivers/net/ethernet/dec/tulip: Use standard __set_bit_le() function

2012-06-13 Thread Ben Hutchings

On Wed, 2012-06-13 at 23:00 +0900, Takuya Yoshikawa wrote:
 On Wed, 13 Jun 2012 22:31:13 +0900
 Akinobu Mita akinobu.m...@gmail.com wrote:
 
   Should this hash_table be converted from u16 hash_table[32] to
   DECLARE_BITMAP(hash_table, 16 * 32) to ensure that it is aligned
   on long-word boundary?
  
   I think hash_table is already long-word aligned because it is placed
   right after a pointer.
  
  I recommend converting to proper bitmap.  Because such an implicit
  assumption is easily broken by someone touching this function.
 
 Do you mean something like:
   DECLARE_BITMAP(__hash_table, 16 * 32);
   u16 *hash_table = (u16 *)__hash_table;
 ?
[...]

Could this driver perhaps use:

union hash_table {
DECLARE_BITMAP(bits, 16 * 32);
__le16 words[32];
};

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] sfc: Use standard __{clear,set}_bit_le() functions

2012-06-12 Thread Ben Hutchings

There are now standard functions for dealing with little-endian bit
arrays, so use them instead of our own implementations.

Signed-off-by: Ben Hutchings bhutchi...@solarflare.com
---
Please use this version instead.

Ben.

 drivers/net/ethernet/sfc/efx.c|4 ++--
 drivers/net/ethernet/sfc/net_driver.h |   12 
 drivers/net/ethernet/sfc/nic.c|4 ++--
 3 files changed, 4 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index b95f2e1..ca2a348 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -1976,14 +1976,14 @@ static void efx_set_rx_mode(struct net_device *net_dev)
netdev_for_each_mc_addr(ha, net_dev) {
crc = ether_crc_le(ETH_ALEN, ha-addr);
bit = crc  (EFX_MCAST_HASH_ENTRIES - 1);
-   set_bit_le(bit, mc_hash-byte);
+   __set_bit_le(bit, mc_hash);
}
 
/* Broadcast packets go through the multicast hash filter.
 * ether_crc_le() of the broadcast address is 0xbe2612ff
 * so we always add bit 0xff to the mask.
 */
-   set_bit_le(0xff, mc_hash-byte);
+   __set_bit_le(0xff, mc_hash);
}
 
if (efx-port_enabled)
diff --git a/drivers/net/ethernet/sfc/net_driver.h 
b/drivers/net/ethernet/sfc/net_driver.h
index 0e57535..6f1a7f7 100644
--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -1080,18 +1080,6 @@ static inline struct efx_rx_buffer *efx_rx_buffer(struct 
efx_rx_queue *rx_queue,
return rx_queue-buffer[index];
 }
 
-/* Set bit in a little-endian bitfield */
-static inline void set_bit_le(unsigned nr, unsigned char *addr)
-{
-   addr[nr / 8] |= (1  (nr % 8));
-}
-
-/* Clear bit in a little-endian bitfield */
-static inline void clear_bit_le(unsigned nr, unsigned char *addr)
-{
-   addr[nr / 8] = ~(1  (nr % 8));
-}
-
 
 /**
  * EFX_MAX_FRAME_LEN - calculate maximum frame length
diff --git a/drivers/net/ethernet/sfc/nic.c b/drivers/net/ethernet/sfc/nic.c
index 4a9a5be..bb0172d 100644
--- a/drivers/net/ethernet/sfc/nic.c
+++ b/drivers/net/ethernet/sfc/nic.c
@@ -473,9 +473,9 @@ void efx_nic_init_tx(struct efx_tx_queue *tx_queue)
 
efx_reado(efx, reg, FR_AA_TX_CHKSM_CFG);
if (tx_queue-queue  EFX_TXQ_TYPE_OFFLOAD)
-   clear_bit_le(tx_queue-queue, (void *)reg);
+   __clear_bit_le(tx_queue-queue, reg);
else
-   set_bit_le(tx_queue-queue, (void *)reg);
+   __set_bit_le(tx_queue-queue, reg);
efx_writeo(efx, reg, FR_AA_TX_CHKSM_CFG);
}
 
-- 
1.7.7.6


-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/7] KVM fixes for 3.2.17

2012-05-21 Thread Ben Hutchings

On Fri, 2012-05-18 at 17:58 -0300, Marcelo Tosatti wrote:
 See individual patches for details.
[...]

These came a little too late for 3.2.18, but I've queued them up now.

Ben.

-- 
Ben Hutchings
You can't have everything.  Where would you put it?


signature.asc
Description: This is a digitally signed message part

Re: [PATCH 00/11] KVM fixes for 3.3.5

2012-05-13 Thread Ben Hutchings

On Sun, 2012-05-13 at 12:23 +0300, Avi Kivity wrote:
 On 05/12/2012 03:35 AM, Ben Hutchings wrote:
  On Wed, 2012-05-09 at 16:10 +0300, Avi Kivity wrote:
   After a long hiatus, here are a bunch of very delayed fixes for 3.3.5.
 
  Are any of these also applicable to 3.2.y?
 
 Yes, and more.  We'll prepare a patchset for 3.2.

Thanks.

Ben.

  Also, would you consider these two fixes from 3.3 important enough for
  3.2.y?
 
  d6185f20a0efbf175e12831d0de330e4f21725aa KVM: nVMX: Add 
  KVM_REQ_IMMEDIATE_EXIT
  51cfe38ea50aa631f58ed8c340ed6f0143c325a8 KVM: nVMX: Fix warning-causing 
  idt-vectoring-info behavior
 
 No.
 

-- 
Ben Hutchings
The two most common things in the universe are hydrogen and stupidity.


signature.asc
Description: This is a digitally signed message part

Re: [PATCH 00/11] KVM fixes for 3.3.5

2012-05-11 Thread Ben Hutchings

On Wed, 2012-05-09 at 16:10 +0300, Avi Kivity wrote:
 After a long hiatus, here are a bunch of very delayed fixes for 3.3.5.

Are any of these also applicable to 3.2.y?

Also, would you consider these two fixes from 3.3 important enough for
3.2.y?

d6185f20a0efbf175e12831d0de330e4f21725aa KVM: nVMX: Add KVM_REQ_IMMEDIATE_EXIT
51cfe38ea50aa631f58ed8c340ed6f0143c325a8 KVM: nVMX: Fix warning-causing 
idt-vectoring-info behavior

Ben.

 Alex Williamson (1):
   KVM: lock slots_lock around device assignment
 
 Avi Kivity (3):
   KVM: Ensure all vcpus are consistent with in-kernel irqchip settings
   KVM: VMX: Fix delayed load of shared MSRs
   KVM: VMX: Fix kvm_set_shared_msr() called in preemptible context
 
 Christian Borntraeger (1):
   KVM: s390: Sanitize fpc registers for KVM_SET_FPU
 
 Gleb Natapov (1):
   KVM: x86 emulator: correctly mask pmc index bits in RDPMC instruction
 emulation
 
 Jens Freimann (1):
   KVM: s390: do store status after handling STOP_ON_STOP bit
 
 Marcelo Tosatti (1):
   KVM: VMX: vmx_set_cr0 expects kvm-srcu locked
 
 Nadav Har'El (1):
   KVM: nVMX: Fix erroneous exception bitmap check
 
 Takuya Yoshikawa (2):
   KVM: Fix write protection race during dirty logging
   KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
 
  arch/ia64/kvm/kvm-ia64.c  |5 +
  arch/s390/kvm/intercept.c |   20 
  arch/s390/kvm/kvm-s390.c  |2 +-
  arch/x86/kvm/pmu.c|2 +-
  arch/x86/kvm/vmx.c|   10 +-
  arch/x86/kvm/x86.c|   19 +--
  include/linux/kvm_host.h  |7 +++
  virt/kvm/iommu.c  |   23 +++
  virt/kvm/kvm_main.c   |   23 ++-
  9 files changed, 77 insertions(+), 34 deletions(-)
 

-- 
Ben Hutchings
Experience is directly proportional to the value of equipment destroyed.
 - Carolyn Scheppner


signature.asc
Description: This is a digitally signed message part

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-15 Thread Ben Hutchings

[I'm just catching up with this after getting my own driver changes into
shape.]

On Fri, 2012-02-10 at 10:18 -0500, jamal wrote:
 Hi John,
 
 I went backwards to summarize at the top after going through your email.
 
 TL;DR version 0.1: 
 you provide a good use case where it makes sense to do things in the
 kernel. IMO, you could make the same arguement if your embedded switch
 could do ACLs, IPv4 forwarding etc. And the kernel bloats.
 I am always bigoted to move all policy control to user space instead of
 bloating in the kernel.
[...]
  Now here is the potential issue,
  
  (G) The frame transmitted from ethx.y with the destination address of
  veth0 but the embedded switch is not a learning switch. If the FDB
  update is done in user space its possible (likely?) that the FDB
  entry for veth0 has not been added to the embedded switch yet. 
 
 Ok, got it - so the catch here is the switch is not capable of learning.
 I think this depends on where learning is done. Your intent is to
 use the S/W bridge as something that does the learning for you i.e in
 the kernel. This makes the s/w bridge part of MUST-have-for-this-to-run.
 And that maybe the case for your use case.
[...]

Well, in addition, there are SR-IOV network adapters that don't have any
bridge.  For these, the software bridge is necessary to handle
multicast, broadcast and forwarding between local ports, not only to do
learning.

Solarflare's implementation of accelerated guest networking (which
Shradha and I are gradually sending upstream) builds on libvirt's
existing support for software bridges and assigns VFs to guests as a
means to offload some of the forwarding.

If and when we implement a hardware bridge, we would probably still want
to keep the software bridge as a fallback.  If a guest is dependent on a
VF that's connected to a hardware bridge, it becomes impossible or at
least very disruptive to migrate it to another host that doesn't have a
compatible VF available.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2012-02-02 Thread Ben Hutchings

On Thu, 2012-02-02 at 00:46 -0800, John Fastabend wrote:
[...]
 OK finally got to read through this. And its not clear to me why we need
 these per VF/PF filter netdevice ops and netlink extensions if we can
 get the stacking correct. (Adding filters to the macvlan seems reasonable
 to me)
 
 In the cases I saw listed above I see a few enumerations:
 
 PF -- MACVLAN  --- Guest --- [...]
 
 VF -- MACVLAN  --- Guest --- [...]   
 
 VF|Guest --- [...]   direct assigned VF
 
 PF|Guest --- [...]   direct assigned PF
 
 
 I used '[...]' to represent whatever additional stacking is done in the
 guest unknown to the host. In the direct assign VF case (Greg Rose
 correct me if I am wrong) the normal uc and mc addr lists should suffice
 along with the netdev op ndo_set_rx_mode(). Here the guest adds MAC
 addresses and/or VLANS as normal and then the VF-PF back channel
 should handle this if needed. This should work for Linux guests and other
 OS's should do something similar.
 
 In the direct assign PF case the hardware is owned by the guest so
 no problems here.
 
 This leaves the two MACVLAN cases which can be handled the same. If
 the MACVLAN driver and netlink interface is extended to add filters
 to the MACVLAN then the addresses can be pushed to the lower device
 using the normal dev_uc_{add|del}() and dev_mc_{add|del}() routines.
[...]

There is another case: hybrid acceleration.  Without a bridge in the
NIC, you need a software bridge for multicast/broadcast replication,
traffic between guests, and traffic between guest and host.  A guest
driver can then send and receive to remote addresses through a VF while
retaining fallback to the software bridge.

In order to do this, the guest driver needs to know which addresses are
local.  The net driver for the PF can tell it about the addresses
assigned to each function, but if there are other devices included in
the bridge then it will not know about them.

In Solarflare's current out-of-tree implementation this is dealt with in
an extension to libvirt that writes the additional 'local' MAC addresses
to a driver-specific sysfs file, but that is obviously not likely to be
acceptable in-tree!  So I was interested in this proposal to extend MAC
filtering, but wanted to get the semantics clear.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vhost-net: add module alias (v2)

2012-01-11 Thread Ben Hutchings

On Wed, 2012-01-11 at 09:16 -0800, Stephen Hemminger wrote:
 By adding the correct module alias, programs won't have to explicitly
 call modprobe. Vhost-net will always be available if built into the kernel.
 It does require assigning a permanent minor number for depmod to work.
 Choose one next to TUN since this driver is related to it.
 
 Also, use C99 style initialization.
 
 Signed-off-by: Stephen Hemminger shemmin...@vyatta.com
 
 ---
 v2 - document minor number and make sure to not overlap
[...]
 --- a/include/linux/miscdevice.h  2012-01-10 10:56:59.779189436 -0800
 +++ b/include/linux/miscdevice.h  2012-01-11 09:13:20.803694316 -0800
 @@ -42,6 +42,7 @@
  #define AUTOFS_MINOR 235
  #define MAPPER_CTRL_MINOR236
  #define LOOP_CTRL_MINOR  237
 +#define VHOST_NET_MINOR  238
  #define MISC_DYNAMIC_MINOR   255
  
  struct device;
 --- a/Documentation/devices.txt   2012-01-10 10:56:53.399116518 -0800
 +++ b/Documentation/devices.txt   2012-01-11 09:12:49.251197653 -0800
 @@ -447,6 +447,8 @@ Your cooperation is appreciated.
   234 = /dev/btrfs-controlBtrfs control device
   235 = /dev/autofs   Autofs control device
   236 = /dev/mapper/control   Device-Mapper control device
 + 237 = /dev/vhost-netHost kernel accelerator for virtio net
[...]

238 != 237.  It looks like someone forgot to add loopctrl here.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next RFC PATCH 0/5] Series short description

2011-12-14 Thread Ben Hutchings

On Fri, 2011-12-09 at 16:01 +1030, Rusty Russell wrote:
 On Wed, 7 Dec 2011 17:02:04 +, Ben Hutchings bhutchi...@solarflare.com 
 wrote:
  Solarflare controllers (sfc driver) have 8192 perfect filters for
  TCP/IPv4 and UDP/IPv4 which can be used for flow steering.  (The filters
  are organised as a hash table, but matched based on 5-tuples.)  I
  implemented the 'accelerated RFS' interface in this driver.
  
  I believe the Intel 82599 controllers (ixgbe driver) have both
  hash-based and perfect filter modes and the driver can be configured to
  use one or the other.  The driver has its own independent mechanism for
  steering RX and TX flows which predates RFS; I don't know whether it
  uses hash-based or perfect filters.
 
 Thanks for this summary (and Jason, too).  I've fallen a long way behind
 NIC state-of-the-art.
  
  Most multi-queue controllers could support a kind of hash-based
  filtering for TCP/IP by adjusting the RSS indirection table.  However,
  this table is usually quite small (64-256 entries).  This means that
  hash collisions will be quite common and this can result in reordering.
  The same applies to the small table Jason has proposed for virtio-net.
 
 But this happens on real hardware today.  Better that real hardware is
 nice, but is it overkill?

What do you mean, it happens on real hardware today?  So far as I know,
the only cases where we have dynamic adjustment of flow steering are in
ixgbe (big table of hash filters, I think) and sfc (perfect filters).
I don't think that anyone's currently doing flow steering with the RSS
indirection table.  (At least, not on Linux.  I think that Microsoft was
intending to do so on Windows, but I don't know whether they ever did.)

 And can't you reorder even with perfect matching, since prior packets
 will be on the old queue and more recent ones on the new queue?  Does it
 discard or requeue old ones?  Or am I missing a trick?

Yes, that is possible.  RFS is careful to avoid such reordering by only
changing the steering of a flow when none of its packets can be in a
software receive queue.  It is not generally possible to do the same for
hardware receive queues.  However, when the first condition is met it is
likely that there won't be a whole lot of packets for that flow in the
hardware receive queue either.  (But if there are, then I think as a
side-effect of commit 09994d1 RFS will repeatedly ask the driver to
steer the flow.  Which isn't ideal.)

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next RFC PATCH 0/5] Series short description

2011-12-07 Thread Ben Hutchings

On Wed, 2011-12-07 at 19:31 +0800, Jason Wang wrote:
 On 12/07/2011 03:30 PM, Rusty Russell wrote:
  On Mon, 05 Dec 2011 16:58:37 +0800, Jason Wangjasow...@redhat.com  wrote:
  multiple queue virtio-net: flow steering through host/guest cooperation
 
  Hello all:
 
  This is a rough series adds the guest/host cooperation of flow
  steering support based on Krish Kumar's multiple queue virtio-net
  driver patch 3/3 (http://lwn.net/Articles/467283/).
  Is there a real (physical) device which does this kind of thing?  How do
  they do it?  Can we copy them?
 
  Cheers,
  Rusty.
 As far as I see, ixgbe and sfc have similar but much more sophisticated 
 mechanism.
 
 The idea was originally suggested by Ben and it was just borrowed form 
 those real physical nic cards who can dispatch packets based on their 
 hash. All of theses cards can filter the flow based on the hash of 
 L2/L3/L4 header and the stack would tell the card which queue should 
 this flow goes.

Solarflare controllers (sfc driver) have 8192 perfect filters for
TCP/IPv4 and UDP/IPv4 which can be used for flow steering.  (The filters
are organised as a hash table, but matched based on 5-tuples.)  I
implemented the 'accelerated RFS' interface in this driver.

I believe the Intel 82599 controllers (ixgbe driver) have both
hash-based and perfect filter modes and the driver can be configured to
use one or the other.  The driver has its own independent mechanism for
steering RX and TX flows which predates RFS; I don't know whether it
uses hash-based or perfect filters.

Most multi-queue controllers could support a kind of hash-based
filtering for TCP/IP by adjusting the RSS indirection table.  However,
this table is usually quite small (64-256 entries).  This means that
hash collisions will be quite common and this can result in reordering.
The same applies to the small table Jason has proposed for virtio-net.

 So in host, a simple hash to queue table were introduced in tap/macvtap 
 and in guest, the guest driver would tell the desired queue of a flow 
 through changing this table.

I don't think accelerated RFS can work well without the use of perfect
filtering or hash-based filtering with a very low rate of collisions.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next RFC PATCH 2/5] tuntap: simple flow director support

2011-12-06 Thread Ben Hutchings

On Tue, 2011-12-06 at 15:21 +0800, Jason Wang wrote:
 On 12/06/2011 04:09 AM, Ben Hutchings wrote:
  On Mon, 2011-12-05 at 16:58 +0800, Jason Wang wrote:
  This patch adds a simple flow director to tun/tap device. It is just a
  page that contains the hash to queue mapping which could be changed by
  user-space. The backend (tap/macvtap) would query this table to get
  the desired queue of a packets when it send packets to userspace.
  This is just flow hashing (RSS), not flow steering.
 
  The page address were set through a new kind of ioctl - TUNSETFD and
  were pinned until device exit or another new page were specified.
  [...]
 
  You should implement ethtool ETHTOOL_{G,S}RXFHINDIR instead.
 
  Ben.
 
 
 I'm not fully understanding this. The page belongs to guest, and the 
 idea is to let guest driver can easily change any entry. Looks like if 
 ethtool_set_rxfh_indir() is used, this kind of change is not easy as it 
 needs one copy and can only accept the whole table as its parameters.

Sorry, yes, I was misreading this.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next RFC PATCH 5/5] virtio-net: flow director support

2011-12-06 Thread Ben Hutchings

On Tue, 2011-12-06 at 15:25 +0800, Jason Wang wrote:
 On 12/06/2011 04:42 AM, Ben Hutchings wrote:
[...]
  This is not a proper implementation of ndo_rx_flow_steer.  If you steer
  a flow by changing the RSS table this can easily cause packet reordering
  in other flows.  The filtering should be more precise, ideally matching
  exactly a single flow by e.g. VID and IP 5-tuple.
 
  I think you need to add a second hash table which records exactly which
  flow is supposed to be steered.  Also, you must call
  rps_may_expire_flow() to check whether an entry in this table may be
  replaced; otherwise you can cause packet reordering in the flow that was
  previously being steered.
 
  Finally, this function must return the table index it assigned, so that
  rps_may_expire_flow() works.
 
 Thanks for the explanation, how about document this briefly in scaling.txt?
[...]

I believe scaling.txt is intended for users/administrators, not
developers.

The documentation for implementers of accelerated RFS is in the comment
for struct net_device_ops and the commit message adding it.  But I
really should improve that comment.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next RFC PATCH 2/5] tuntap: simple flow director support

2011-12-05 Thread Ben Hutchings

On Mon, 2011-12-05 at 16:58 +0800, Jason Wang wrote:
 This patch adds a simple flow director to tun/tap device. It is just a
 page that contains the hash to queue mapping which could be changed by
 user-space. The backend (tap/macvtap) would query this table to get
 the desired queue of a packets when it send packets to userspace.

This is just flow hashing (RSS), not flow steering.

 The page address were set through a new kind of ioctl - TUNSETFD and
 were pinned until device exit or another new page were specified.
[...]

You should implement ethtool ETHTOOL_{G,S}RXFHINDIR instead.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next RFC PATCH 3/5] macvtap: flow director support

2011-12-05 Thread Ben Hutchings

Similarly, macvtap chould implement the ethtool {get,set}_rxfh_indir
operations.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next RFC PATCH 5/5] virtio-net: flow director support

2011-12-05 Thread Ben Hutchings

On Mon, 2011-12-05 at 16:59 +0800, Jason Wang wrote:
 In order to let the packets of a flow to be passed to the desired
 guest cpu, we can co-operate with devices through programming the flow
 director which was just a hash to queue table.
 
 This kinds of co-operation is done through the accelerate RFS support,
 a device specific flow sterring method virtnet_fd() is used to modify
 the flow director based on rfs mapping. The desired queue were
 calculated through reverse mapping of the irq affinity table. In order
 to parallelize the ingress path, irq affinity of rx queue were also
 provides by the driver.
 
 In addition to accelerate RFS, we can also use the guest scheduler to
 balance the load of TX and reduce the lock contention on egress path,
 so the processor_id() were used to tx queue selection.
[...]
 +#ifdef CONFIG_RFS_ACCEL
 +
 +int virtnet_fd(struct net_device *net_dev, const struct sk_buff *skb,
 +u16 rxq_index, u32 flow_id)
 +{
 + struct virtnet_info *vi = netdev_priv(net_dev);
 + u16 *table = NULL;
 +
 + if (skb-protocol != htons(ETH_P_IP) || !skb-rxhash)
 + return -EPROTONOSUPPORT;

Why only IPv4?

 + table = kmap_atomic(vi-fd_page);
 + table[skb-rxhash  TAP_HASH_MASK] = rxq_index;
 + kunmap_atomic(table);
 +
 + return 0;
 +}
 +#endif

This is not a proper implementation of ndo_rx_flow_steer.  If you steer
a flow by changing the RSS table this can easily cause packet reordering
in other flows.  The filtering should be more precise, ideally matching
exactly a single flow by e.g. VID and IP 5-tuple.

I think you need to add a second hash table which records exactly which
flow is supposed to be steered.  Also, you must call
rps_may_expire_flow() to check whether an entry in this table may be
replaced; otherwise you can cause packet reordering in the flow that was
previously being steered.

Finally, this function must return the table index it assigned, so that
rps_may_expire_flow() works.

 +static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb)
 +{
 + int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
 +smp_processor_id();
 +
 + /* As we make use of the accelerate rfs which let the scheduler to
 +  * balance the load, it make sense to choose the tx queue also based on
 +  * theprocessor id?
 +  */
 + while (unlikely(txq = dev-real_num_tx_queues))
 + txq -= dev-real_num_tx_queues;
 + return txq;
 +}
[...]

Don't do this, let XPS handle it.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2011-11-30 Thread Ben Hutchings

On Wed, 2011-11-30 at 09:34 -0800, Greg Rose wrote:
 On 11/29/2011 9:19 AM, Ben Hutchings wrote:
  On Tue, 2011-11-29 at 16:35 +, Ben Hutchings wrote:
 
  Maybe I missed something!
[...]
  If not, please explain what the new model *is*.
 
 The new model is to incorporate a VEB into the NIC.  The current model 
 doesn't address any of the requirements of a VEB in the NIC and this 
 proposed set of patches allow us to set MAC filters for the *ports* on 
 the internal NIC VEB.  Consider the PF and each of the VFs as just a 
 port on the VEB.  We need the ability to set L2 filters (MAC, MC and 
 VLAN) for each of the ports on that VEB.  There is no currently 
 supported method for doing this.  So yes, this is a new model although 
 it's a fairly simple one.

Explain precisely how the VEB changes the existing model.  Explain how
the existing MAC filter and VF filter APIs interact with port filters on
the VEB.  Refer to any relevant standards.

(I have really had enough of net driver API proposals where all the
difficult questions are punted to the implementation.  Either
implementations diverge and users and userspace developers are left
horribly confused, or else the second and subsequent implementations
have to follow whatever quirks the first implementation had.  It's an
essential part of the review process that such questions are asked and
answered.)

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2011-11-30 Thread Ben Hutchings

On Wed, 2011-11-30 at 13:04 -0800, Chris Wright wrote:
 * Ben Hutchings (bhutchi...@solarflare.com) wrote:
  On Wed, 2011-11-30 at 09:34 -0800, Greg Rose wrote:
   On 11/29/2011 9:19 AM, Ben Hutchings wrote:
On Tue, 2011-11-29 at 16:35 +, Ben Hutchings wrote:
   
Maybe I missed something!
  [...]
If not, please explain what the new model *is*.
   
   The new model is to incorporate a VEB into the NIC.  The current model 
   doesn't address any of the requirements of a VEB in the NIC and this 
   proposed set of patches allow us to set MAC filters for the *ports* on 
   the internal NIC VEB.  Consider the PF and each of the VFs as just a 
   port on the VEB.  We need the ability to set L2 filters (MAC, MC and 
   VLAN) for each of the ports on that VEB.  There is no currently 
   supported method for doing this.  So yes, this is a new model although 
   it's a fairly simple one.
  
  Explain precisely how the VEB changes the existing model.  Explain how
  the existing MAC filter and VF filter APIs interact with port filters on
  the VEB.  Refer to any relevant standards.
 
 I agree that it's confusing.  Couldn't you simplify your ascii art
 (hopefully removing hw assumptions about receive processing, and
 completely ignoring vlans for the moment) to something like:

  |RX
  v
 ++-+
 | +--++|
 | | RX MAC filter ||
 | |and port select||
 | +---+|
 |/|\   |
 |   / | \   match 2|
 |  /  v  \ |
 | /match  \|
 |/  1 |\   |
 |   / | \  |
 |match /  |  \ |
 |  0  /   |   \|
 |v|v   |
 ||||   |
 ++++---+
  |||
 PF   VF 1 VF 2
 
 And there's an unclear number of ways to update RX MAC filter and port
 select table.
 
 1) PF ndo_set_mac_addr
 I expect that to be implicit to match 0.
 
 2) PF ndo_set_rx_mode
 Less clear, but I'd still expect these to implicitly match 0
 
 3) PF ndo_set_vf_mac
 I expect these to be an explicit match to VF N (given the interface
 specifices which VF's MAC is being programmed).

I'm not sure whether this is supposed to implicitly add to the MAC
filter or whether that has to be changed too.  That's the main
difference between my models (a) and (b).

There's also PF ndo_set_vf_vlan.

 4) VF ndo_set_mac_addr
 This one may or may not be allowed (setting MAC+port if the VF is owned
 by a guest is likely not allowed), but would expect an implicit VF N.
 
 5) VF ndo_set_rx_mode
 Same as 4) above.

So this is where we are today.

 6) PF or VF? ndo_set_rx_filter_addr
 The new proposal, which has an explicit VF, although when it's VF_SELF
 I'm not clear if this is just the same as 5) above?
 
 Have I missed anything?

Any physical port can be bridged to a mixture of guests with and without
their own VFs.  Packets sent from a guest with a VF to the address of a
guest without a VF need to be forwarded to the PF rather than the
physical port, but none of the drivers currently get to know about those
addresses.

Packets sent from a guest with a VF to the address of another guest with
a VF need to be forwarded similarly, but the driver should be able to
infer that from (3).

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2011-11-29 Thread Ben Hutchings

On Tue, 2011-11-29 at 16:35 +, Ben Hutchings wrote:
 On Mon, 2011-11-21 at 09:41 -0800, Greg Rose wrote:
  On 11/18/2011 9:40 AM, Ben Hutchings wrote:
 [...]
   What concerns me is that this seems to be a workaround rather than a fix
   for over-use of promiscuous mode, and it changes the semantics of
   filtering modes in ways that haven't been well-specified.
  
  I feel the opposite is true.  It allows a known set of receive filters 
  so that you don't have to use promiscuous mode, which cuts down on 
  overhead from processing packets the upper layer stack isn't really 
  interested in.
 
  
   What if there's a software bridge between two net devices corresponding
   to separate physical ports, so that they really need to be promiscuous?
   What if the administrator runs tcpdump and really wants the (PF) net
   device to be promiscuous?
  
  I don't believe there is anything in this patch set that removes 
  promiscuous mode operation as it is commonly used.  Perhaps I've missed 
  something.
 [...]
 
 Maybe I missed something!
 
 Let's be clear on what our models are for filtering.  At the moment we
 have MAC filters set through ndo_set_rx_mode and VF filters set through
 ndo_set_vf_{mac,vlan}.
 
 Ignoring anti-spoofing for the moment, should the currently defined
 filters look like this (a):
 
 TX ^   | RX
|   v
 +--+---+-+
 |  |  +++|
 |  |  |RX MAC filter||
 |  |  +++|
 |  |   |match|
 |  ^   v |
 |  |  +++|
 |  |  |RX VF filters||
 |  |  +---+-+|
 | /|\ no /|\ |
 || | \ match/ | |match 2 |
 || ^  \/  v ||
 || |   \  /match||
 ||  \   \/  1/  ||
 ||   \  /\  /   ||
 |^\/  \/v|
 ||/\  /\||
 ||   /  ||  \   ||
 ||  /   ||   \  ||
 || /||\ ||
 ||| || |||
 +++-++-+++
  || || ||
  PFVF 1   VF 2
 
 or like this (b):
 
 TX ^   | RX
|   v
 +--+---+-+
 |  |  +++|
 |  |  |RX VF filters||
 |  |  +++---+|
 |  | no|match  /||
 |  ^   v  | ||
 |  | +-++ | ||
 |  | |RX MAC| | ||
 |  | |filter| | ||
 |  | +--+ | ||
 |  |   |match | ||
 | /|\  |  | ||
 || | \ | match| |match 2 |
 || ^  \/1 v ||
 || |  /\  | ||
 ||  \/  \/  ||
 ||  /\   \  /   ||
 |^ /  \   \/v|
 |||\  /\||
 ||| ||  \   ||
 ||| ||   \  ||
 ||| ||\ ||
 ||| || |||
 +++-++-+++
  || || ||
  PFVF 1   VF 2
 
 I think the current model is (a); do you agree?
 
 So is the proposed new model something like this (c):

Corrected diagram:

TX ^   | RX
   |   v
+--+---+-+
|  |  +++|
|  |  |RX MAC filter||
|  ^  +++|
|  |   |match|
|  no match|   v |
|  ++ +++|
|  |loopback filters| |RX VF filters||
|  +-+-++ +---+-+|
|   /|\   /|\ match  /|\ |
|  v | `-++-+-.2   / | ||
|   \ \  | |m \ \   / | ||
| match 0\ `-+-+.a \ \ /  v ||
| \  | | \t \ X   / ||
|  \ |  \ \c X \ /  ||
|   \|   \ \h \ X   ||
|\\/\1 X \  v|
|||   /\ |/ \ \ ||
||v  /  ||   \ \||
||| /   ^|\ ||
|||/|v |||
||| || |||
+++-++-+++
 || || ||
 PFVF 1   VF 2

 (I've labelled the new filters as loopback filters here, and I'm still
 leaving out anti-spoofing.)
 
 If not, please explain what

Re: [RFC] [ver3 PATCH 3/6] virtio_net: virtio_net driver changes

2011-11-18 Thread Ben Hutchings

On Fri, 2011-11-18 at 08:24 +0200, Sasha Levin wrote:
 On Fri, 2011-11-18 at 01:08 +, Ben Hutchings wrote:
  On Fri, 2011-11-11 at 18:34 +0530, Krishna Kumar wrote:
   Changes for multiqueue virtio_net driver.
  [...]
   @@ -677,25 +730,35 @@ static struct rtnl_link_stats64 *virtnet
{
 struct virtnet_info *vi = netdev_priv(dev);
 int cpu;
   - unsigned int start;

 for_each_possible_cpu(cpu) {
   - struct virtnet_stats __percpu *stats
   - = per_cpu_ptr(vi-stats, cpu);
   - u64 tpackets, tbytes, rpackets, rbytes;
   -
   - do {
   - start = u64_stats_fetch_begin(stats-syncp);
   - tpackets = stats-tx_packets;
   - tbytes   = stats-tx_bytes;
   - rpackets = stats-rx_packets;
   - rbytes   = stats-rx_bytes;
   - } while (u64_stats_fetch_retry(stats-syncp, start));
   -
   - tot-rx_packets += rpackets;
   - tot-tx_packets += tpackets;
   - tot-rx_bytes   += rbytes;
   - tot-tx_bytes   += tbytes;
   + int qpair;
   +
   + for (qpair = 0; qpair  vi-num_queue_pairs; qpair++) {
   + struct virtnet_send_stats __percpu *tx_stat;
   + struct virtnet_recv_stats __percpu *rx_stat;
  
  While you're at it, you can drop the per-CPU stats and make them only
  per-queue.  There is unlikely to be any benefit in maintaining them
  per-CPU while receive and transmit processing is serialised per-queue.
 
 It allows you to update stats without a lock.

But you'll already be holding a lock related to the queue.

 Whats the benefit of having them per queue?

It should save some memory (and a little time when summing stats, though
that's unlikely to matter much).

The important thing is that splitting up stats per-CPU *and* per-queue
is a waste.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] [ver3 PATCH 3/6] virtio_net: virtio_net driver changes

2011-11-18 Thread Ben Hutchings

On Fri, 2011-11-18 at 18:18 +0200, Sasha Levin wrote:
 On Fri, 2011-11-18 at 15:40 +, Ben Hutchings wrote:
  On Fri, 2011-11-18 at 08:24 +0200, Sasha Levin wrote:
   On Fri, 2011-11-18 at 01:08 +, Ben Hutchings wrote:
On Fri, 2011-11-11 at 18:34 +0530, Krishna Kumar wrote:
 Changes for multiqueue virtio_net driver.
[...]
 @@ -677,25 +730,35 @@ static struct rtnl_link_stats64 *virtnet
  {
   struct virtnet_info *vi = netdev_priv(dev);
   int cpu;
 - unsigned int start;
  
   for_each_possible_cpu(cpu) {
 - struct virtnet_stats __percpu *stats
 - = per_cpu_ptr(vi-stats, cpu);
 - u64 tpackets, tbytes, rpackets, rbytes;
 -
 - do {
 - start = u64_stats_fetch_begin(stats-syncp);
 - tpackets = stats-tx_packets;
 - tbytes   = stats-tx_bytes;
 - rpackets = stats-rx_packets;
 - rbytes   = stats-rx_bytes;
 - } while (u64_stats_fetch_retry(stats-syncp, start));
 -
 - tot-rx_packets += rpackets;
 - tot-tx_packets += tpackets;
 - tot-rx_bytes   += rbytes;
 - tot-tx_bytes   += tbytes;
 + int qpair;
 +
 + for (qpair = 0; qpair  vi-num_queue_pairs; qpair++) {
 + struct virtnet_send_stats __percpu *tx_stat;
 + struct virtnet_recv_stats __percpu *rx_stat;

While you're at it, you can drop the per-CPU stats and make them only
per-queue.  There is unlikely to be any benefit in maintaining them
per-CPU while receive and transmit processing is serialised per-queue.
   
   It allows you to update stats without a lock.
  
  But you'll already be holding a lock related to the queue.
 
 Right, but now you're holding a queue lock just when playing with the
 queue, we don't hold it when we process the data - which is when we
 usually need to update stats.
[...]

The *stack* is holding the appropriate lock when calling the NAPI poll
function or ndo_start_xmit function.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2011-11-18 Thread Ben Hutchings

On Fri, 2011-11-18 at 08:58 -0800, Greg Rose wrote:
 On 11/17/2011 4:44 PM, Ben Hutchings wrote:
  On Thu, 2011-11-17 at 16:32 -0800, Greg Rose wrote:
  On 11/17/2011 4:15 PM, Ben Hutchings wrote:
  Sorry to come to this rather late.
 
  On Tue, 2011-11-08 at 23:55 -0800, Roopa Prabhu wrote:
  [...]
  v2 -   v3
  - Moved set and get filter ops from rtnl_link_ops to netdev_ops
  - Support for SRIOV VFs.
[Note: The get filters msg (in the way current get rtnetlink 
  handles
it) might get too big for SRIOV vfs. This patch follows 
  existing sriov
vf get code and tries to accomodate filters for all VF's in a 
  PF.
And for the SRIOV case I have only tested the fact that the VF
arguments are getting delivered to rtnetlink correctly. The 
  code
follows existing sriov vf handling code so rest of it should 
  work fine]
  [...]
 
  This is already broken for large numbers of VFs, and increasing the
  amount of information per VF is going to make the situation worse.  I am
  no netlink expert but I think that the current approach of bundling all
  information about an interface in a single message may not be
  sustainable.
 
  Also, I'm unclear on why this interface is to be used to set filtering
  for the (PF) net device as well as for related VFs.  Doesn't that
  duplicate the functionality of ndo_set_rx_mode and
  ndo_vlan_rx_{add,kill}_vid?
 
  Functionally yes but contextually no.  This allows the PF driver to know
  that it is setting these filters in the context of the existence of VFs,
  allowing it to take appropriate action.  The other two functions may be
  called without the presence of SR-IOV enablement and the existence of VFs.
 
  Anyway, that's why I asked Roopa to add that capability.
 
  I don't follow.  The PF driver already knows whether it has enabled VFs.
 
  How do filters set this way interact with filters set through the
  existing operations?  Should they override promiscuous mode?  None of
  this has been specified.
 
 Promiscuous mode is exactly the issue this feature is intended for.  I'm 
 not familiar with the solarflare device but Intel HW promiscuous mode is 
 only promiscuous on the physical port, not on the VEB.  So a packet sent 
 from a VF will not be captured by the PF across the VEB unless the MAC 
 and VLAN filters have been programmed into the HW.

Yes, I get it.  The hardware bridge needs to know more about the address
configuration on the host than the driver is getting at the moment.

 So you may not need 
 the feature for your devices but it is required for Intel devices.

Well we don't have the hardware bridge but that means each VF driver
needs to know whether to fall back to the software bridge.  The net
driver needs much the same additional information.

 And 
 it's a fairly simple request, just allow -1 to indicate that the target 
 of the filter requests is for the PF itself.  Using the already existing 
 set_rx_mode function wont' work because the PF driver will look at it 
 and figure it's in promiscuous mode anyway, so it won't set the filters 
 into the HW.  At least that is how it is in the case of our HW and 
 driver.  Again, the behavior of your HW and driver is unknown to me and 
 thus you may not require this feature.

What concerns me is that this seems to be a workaround rather than a fix
for over-use of promiscuous mode, and it changes the semantics of
filtering modes in ways that haven't been well-specified.

What if there's a software bridge between two net devices corresponding
to separate physical ports, so that they really need to be promiscuous?
What if the administrator runs tcpdump and really wants the (PF) net
device to be promiscuous?

These cases shouldn't break because of VF acceleration.  Or at least we
should make a conscious and documented decision that 'promiscuous'
doesn't mean that if you enable it on your network adapter.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [net-next-2.6 PATCH 0/8 RFC v2] macvlan: MAC Address filtering support for passthru mode

2011-11-17 Thread Ben Hutchings

On Thu, 2011-10-20 at 13:43 -0700, Rose, Gregory V wrote:
  -Original Message-
  From: Roopa Prabhu [mailto:ropra...@cisco.com]
[...]
  Moving the ops to netdev should be trivial. You probably want the ops to
  work on the VF via the PF, like the existing ndo_set_vf_mac etc.
 
 That is correct, so we would need to add some way to pass the VF number to 
 the op.
 In addition, there are use cases for multiple MAC address filters for the 
 Physical
 Function (PF) so we would like to be able to identify to the netdev op that 
 it is
 supposed to perform the action on the PF filters instead of a VF.
 
 An example of this would be when an administrator has created some number of 
 VFs
 for a given PF but is also running the PF in bridged (i.e. promiscuous) mode 
 so that it
 can support purely SW emulated network connections in some VMs that have low 
 network
 latency and bandwidth requirements while reserving the VFs for VMs that 
 require the low latency, high throughput that directly assigned VFs can 
 provide.  In this case an
 emulated SW interface in a VM is unable to properly communicate with VFs on 
 the same
 PF because the emulated SW interface's MAC address isn't programmed into the 
 HW filters
 on the PF.  If we could use this op to program the MAC address and VLAN 
 filters of
 the emulated SW interfaces into the PF HW a VF could then properly 
 communicate across
 the NIC's internal VEB to the emulated SW interfaces.
[...]

This would also be good for Solarflare's VF plugin architecture.  The VF
driver works as a plugin for virtio or xen_netfront and can refuse
packets that need to be bridged to another (physically) local address.
The PF driver has to tell VFs what the local addresses are and currently
relies on some custom scripting to know about those extra addresses.

(No, none of that is upstream - I'm preparing for that now.)

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [net-next-2.6 PATCH 0/8 RFC v2] macvlan: MAC Address filtering support for passthru mode

2011-11-17 Thread Ben Hutchings

On Tue, 2011-10-25 at 08:59 -0700, Rose, Gregory V wrote:
  -Original Message-
  From: Michael S. Tsirkin [mailto:m...@redhat.com]
  Sent: Tuesday, October 25, 2011 8:46 AM
  To: Roopa Prabhu
  Cc: net...@vger.kernel.org; s...@us.ibm.com; dragos.tatu...@gmail.com;
  a...@arndb.de; kvm@vger.kernel.org; da...@davemloft.net;
  mc...@broadcom.com; dwa...@cisco.com; shemmin...@vyatta.com;
  eric.duma...@gmail.com; ka...@trash.net; be...@cisco.com; Rose, Gregory V
  Subject: Re: [net-next-2.6 PATCH 0/8 RFC v2] macvlan: MAC Address
  filtering support for passthru mode

  On Mon, Oct 24, 2011 at 11:15:05AM -0700, Roopa Prabhu wrote:
... And also I dont know of any hw
that provides an interface to set hw filters on a per queue basis.

VMDq hardware would support this, no?

   Am not really sure. This patch uses netdev to pass filters to hw. And I
   don't see any netdev infrastructure that would support per queue
  filters.

  Sure. I was only saying that as far as I understand,
  VMDq hardware does support this functionality.
  Right, Greg?

 In the case of Intel HW yes.  I'll refrain from speaking for other HW
 vendors although I'm guessing it would be true in their cases also.
 YMMV, caveat emptor, etc. etc.

Current Solarflare hardware supports:
- RX MAC address filters for queue selection (steering), which can be
  combined with RSS (flow hashing)
- TX MAC address filters to prevent spoofing
Multiple filters can be associated with a single queue.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2011-11-17 Thread Ben Hutchings

Sorry to come to this rather late.

On Tue, 2011-11-08 at 23:55 -0800, Roopa Prabhu wrote:
[...]
 v2 - v3
 - Moved set and get filter ops from rtnl_link_ops to netdev_ops
 - Support for SRIOV VFs.
 [Note: The get filters msg (in the way current get rtnetlink handles
 it) might get too big for SRIOV vfs. This patch follows existing 
 sriov 
 vf get code and tries to accomodate filters for all VF's in a PF. 
 And for the SRIOV case I have only tested the fact that the VF 
 arguments are getting delivered to rtnetlink correctly. The code
 follows existing sriov vf handling code so rest of it should work 
 fine]
[...]

This is already broken for large numbers of VFs, and increasing the
amount of information per VF is going to make the situation worse.  I am
no netlink expert but I think that the current approach of bundling all
information about an interface in a single message may not be
sustainable.

Also, I'm unclear on why this interface is to be used to set filtering
for the (PF) net device as well as for related VFs.  Doesn't that
duplicate the functionality of ndo_set_rx_mode and
ndo_vlan_rx_{add,kill}_vid?

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next-2.6 PATCH 1/6 v4] rtnetlink: Netlink interface for setting MAC and VLAN filters

2011-11-17 Thread Ben Hutchings

On Tue, 2011-11-08 at 23:55 -0800, Roopa Prabhu wrote:
 From: Roopa Prabhu ropra...@cisco.com
 
 This patch introduces the following netlink interface to set
 MAC and VLAN filters on an network interface. It can be used to
 set RX filter on any network interface (if supported by the driver) and
 also on a SRIOV VF via its PF

 Interface to set RX filter on a SRIOV VF
 [IFLA_VF_RX_FILTERS] = {
   [IFLA_VF_RX_FILTER] = {
   [IFLA_RX_FILTER_VF]
   [IFLA_RX_FILTER_ADDR] = {
   [IFLA_RX_FILTER_ADDR_FLAGS]
   [IFLA_RX_FILTER_ADDR_UC_LIST] = {
   [IFLA_ADDR_LIST_ENTRY]
   }
   [IFLA_RX_FILTER_ADDR_MC_LIST] = {
   [IFLA_ADDR_LIST_ENTRY]
   }
   }
   [IFLA_RX_FILTER_VLAN] = {
   [IFLA_RX_FILTER_VLAN_BITMAP]
   }
   }
   ...
 }
[...]

Please put the details of both syntax *and semantics* in if_link.h.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2011-11-17 Thread Ben Hutchings

On Thu, 2011-11-17 at 16:32 -0800, Greg Rose wrote:
 On 11/17/2011 4:15 PM, Ben Hutchings wrote:
  Sorry to come to this rather late.
 
  On Tue, 2011-11-08 at 23:55 -0800, Roopa Prabhu wrote:
  [...]
  v2 -  v3
  - Moved set and get filter ops from rtnl_link_ops to netdev_ops
  - Support for SRIOV VFs.
   [Note: The get filters msg (in the way current get rtnetlink 
  handles
   it) might get too big for SRIOV vfs. This patch follows existing 
  sriov
   vf get code and tries to accomodate filters for all VF's in a PF.
   And for the SRIOV case I have only tested the fact that the VF
   arguments are getting delivered to rtnetlink correctly. The code
   follows existing sriov vf handling code so rest of it should work 
  fine]
  [...]
 
  This is already broken for large numbers of VFs, and increasing the
  amount of information per VF is going to make the situation worse.  I am
  no netlink expert but I think that the current approach of bundling all
  information about an interface in a single message may not be
  sustainable.
 
  Also, I'm unclear on why this interface is to be used to set filtering
  for the (PF) net device as well as for related VFs.  Doesn't that
  duplicate the functionality of ndo_set_rx_mode and
  ndo_vlan_rx_{add,kill}_vid?
 
 Functionally yes but contextually no.  This allows the PF driver to know 
 that it is setting these filters in the context of the existence of VFs, 
 allowing it to take appropriate action.  The other two functions may be 
 called without the presence of SR-IOV enablement and the existence of VFs.
 
 Anyway, that's why I asked Roopa to add that capability.

I don't follow.  The PF driver already knows whether it has enabled VFs.

How do filters set this way interact with filters set through the
existing operations?  Should they override promiscuous mode?  None of
this has been specified.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] [ver3 PATCH 3/6] virtio_net: virtio_net driver changes

2011-11-17 Thread Ben Hutchings

On Fri, 2011-11-11 at 18:34 +0530, Krishna Kumar wrote:
 Changes for multiqueue virtio_net driver.
[...]
 @@ -677,25 +730,35 @@ static struct rtnl_link_stats64 *virtnet
  {
   struct virtnet_info *vi = netdev_priv(dev);
   int cpu;
 - unsigned int start;
  
   for_each_possible_cpu(cpu) {
 - struct virtnet_stats __percpu *stats
 - = per_cpu_ptr(vi-stats, cpu);
 - u64 tpackets, tbytes, rpackets, rbytes;
 -
 - do {
 - start = u64_stats_fetch_begin(stats-syncp);
 - tpackets = stats-tx_packets;
 - tbytes   = stats-tx_bytes;
 - rpackets = stats-rx_packets;
 - rbytes   = stats-rx_bytes;
 - } while (u64_stats_fetch_retry(stats-syncp, start));
 -
 - tot-rx_packets += rpackets;
 - tot-tx_packets += tpackets;
 - tot-rx_bytes   += rbytes;
 - tot-tx_bytes   += tbytes;
 + int qpair;
 +
 + for (qpair = 0; qpair  vi-num_queue_pairs; qpair++) {
 + struct virtnet_send_stats __percpu *tx_stat;
 + struct virtnet_recv_stats __percpu *rx_stat;

While you're at it, you can drop the per-CPU stats and make them only
per-queue.  There is unlikely to be any benefit in maintaining them
per-CPU while receive and transmit processing is serialised per-queue.

[...]
 +static int invoke_find_vqs(struct virtnet_info *vi)
 +{
 + vq_callback_t **callbacks;
 + struct virtqueue **vqs;
 + int ret = -ENOMEM;
 + int i, total_vqs;
 + char **names;
 +
 + /*
 +  * We expect 1 RX virtqueue followed by 1 TX virtqueue, followed
 +  * by the same 'vi-num_queue_pairs-1' more times, and optionally
 +  * one control virtqueue.
 +  */
 + total_vqs = vi-num_queue_pairs * 2 +
 + virtio_has_feature(vi-vdev, VIRTIO_NET_F_CTRL_VQ);
 +
 + /* Allocate space for find_vqs parameters */
 + vqs = kmalloc(total_vqs * sizeof(*vqs), GFP_KERNEL);
 + callbacks = kmalloc(total_vqs * sizeof(*callbacks), GFP_KERNEL);
 + names = kmalloc(total_vqs * sizeof(*names), GFP_KERNEL);
 + if (!vqs || !callbacks || !names)
 + goto err;
 +
 + /* Allocate/initialize parameters for recv virtqueues */
 + for (i = 0; i  vi-num_queue_pairs * 2; i += 2) {
 + callbacks[i] = skb_recv_done;
 + names[i] = kasprintf(GFP_KERNEL, input.%d, i / 2);
 + if (!names[i])
 + goto err;
 + }
 +
 + /* Allocate/initialize parameters for send virtqueues */
 + for (i = 1; i  vi-num_queue_pairs * 2; i += 2) {
 + callbacks[i] = skb_xmit_done;
 + names[i] = kasprintf(GFP_KERNEL, output.%d, i / 2);
 + if (!names[i])
 + goto err;
 + }
[...]

The RX and TX interrupt names for a multiqueue device should follow the
formats device-rx-index and device-tx-index.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [RFC v2 PATCH 5/4 PATCH] virtio-net: send gratuitous packet when needed

2011-10-24 Thread Ben Hutchings

On Mon, 2011-10-24 at 07:25 +0200, Michael S. Tsirkin wrote:
 On Mon, Oct 24, 2011 at 02:54:59PM +1030, Rusty Russell wrote:
  On Sat, 22 Oct 2011 13:43:11 +0800, Jason Wang jasow...@redhat.com wrote:
   This make let virtio-net driver can send gratituous packet by a new
   config bit - VIRTIO_NET_S_ANNOUNCE in each config update
   interrupt. When this bit is set by backend, the driver would schedule
   a workqueue to send gratituous packet through NETDEV_NOTIFY_PEERS.
   
   This feature is negotiated through bit VIRTIO_NET_F_GUEST_ANNOUNCE.
   
   Signed-off-by: Jason Wang jasow...@redhat.com
  
  This seems like a huge layering violation.  Imagine this in real
  hardware, for example.
 
 commits 06c4648d46d1b757d6b9591a86810be79818b60c
 and 99606477a5888b0ead0284fecb13417b1da8e3af
 document the need for this:
 
 NETDEV_NOTIFY_PEERS notifier indicates that a device moved to a 
 different physical link.
   and
 In real hardware such notifications are only
 generated when the device comes up or the address changes.
 
 So hypervisor could get the same behaviour by sending link up/down
 events, this is just an optimization so guest won't do
 unecessary stuff like try to reconfigure an IP address.
 
 
 Maybe LOCATION_CHANGE would be a better name?
[...]

We also use this in bonding failover, where the system location doesn't
change but a different link is used.  However, I do recognise that the
name ought to indicate what kind of change happened and not what the
expected action is.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next RFC V2 PATCH 0/5] Multiqueue support in tun/tap

2011-09-19 Thread Ben Hutchings

On Sat, 2011-09-17 at 14:02 +0800, Jason Wang wrote:
[...]
 2 Current implementation may also get regression for single session
 packet transmission.
 
 The reason is packets from each flow were not handled by the same
 queue/vhost thread.
 
 Various method could be done to handle this:
 
 2.1 hack the guest driver, and store the queue index into the rxhash and
 use it when choosing tx in guest. This need some hack to store the
 rxhash into sk and pass it in to skb again in
 skb_orphan_try(). sk_rxhash is only used by RPS now, so some more
 clean method is needed.
[...]

I have previously suggested doing this as a general rule.  However, I
now think we can do much better with accelerated RFS and automatic XPS
(but the latter is not yet implemented).  For virtio_net, accelerated
RFS would effectively push the guest's RFS socket map out to the host.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice

2011-05-16 Thread Ben Hutchings

On Mon, 2011-05-16 at 12:28 -0700, Shirley Ma wrote:
 Signed-off-by: Shirley Ma x...@us.ibm.com
 ---
 
  include/linux/netdevice.h |   10 ++
  1 files changed, 10 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
 index a134d80..2646251 100644
 --- a/include/linux/netdevice.h
 +++ b/include/linux/netdevice.h
 @@ -1066,6 +1066,16 @@ struct net_device {
  #define NETIF_F_NOCACHE_COPY (1  30) /* Use no-cache copyfromuser */
  #define NETIF_F_LOOPBACK (1  31) /* Enable loopback */
  
 +/*
 + * Bit 31 is for device to map userspace buffers -- zerocopy
 + * Device can set this flag when it supports HIGHDMA.
 + * Device can't recycle this kind of skb buffers.
 + * There are 256 bytes copied, the rest of buffers are mapped.
 + * The userspace callback should only be called when last reference to this 
 skb
 + * is gone.
 + */
 +#define NETIF_F_ZEROCOPY (1  31)

Sorry, bit 31 is taken.  You get the job of turning features into a
wider bitmap.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice

2011-05-16 Thread Ben Hutchings

On Mon, 2011-05-16 at 12:38 -0700, Shirley Ma wrote:
 On Mon, 2011-05-16 at 20:35 +0100, Ben Hutchings wrote:
  Sorry, bit 31 is taken.  You get the job of turning features into a
  wider bitmap.
 
 :) will do it.

Bear in mind that feature masks are manipulated in many different
places.  This is not a simple task.

See previous discussion at:
http://thread.gmane.org/gmane.linux.network/193284
and especially:
http://thread.gmane.org/gmane.linux.network/193284/focus=193332

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V3 2/8] Add a new zerocopy device flag

2011-04-20 Thread Ben Hutchings

On Wed, 2011-04-20 at 12:44 -0700, Shirley Ma wrote:
 This zerocopy flag is used to support device DMA userspace buffers.
 
 Signed-off-by: Shirley Ma x...@us.ibm.com
 ---
 
  include/linux/netdevice.h |3 +++
  1 files changed, 3 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
 index 0249fe7..0998d3d 100644
 --- a/include/linux/netdevice.h
 +++ b/include/linux/netdevice.h
 @@ -1067,6 +1067,9 @@ struct net_device {
  #define NETIF_F_RXHASH   (1  28) /* Receive hashing offload */
  #define NETIF_F_RXCSUM   (1  29) /* Receive checksumming 
 offload */
  
 +/* bit 29 is for device to map userspace buffers -- zerocopy */
 +#define NETIF_F_ZEROCOPY (1  29)

Look above.

Ben.

   /* Segmentation offload features */
  #define NETIF_F_GSO_SHIFT16
  #define NETIF_F_GSO_MASK 0x00ff
 
 
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v11 02/17] Add a new struct for device to manipulate external buffer.

2010-09-27 Thread Ben Hutchings

On Sat, 2010-09-25 at 12:27 +0800, xiaohui@intel.com wrote:
 From: Xin Xiaohui xiaohui@intel.com
 
 Signed-off-by: Xin Xiaohui xiaohui@intel.com
 Signed-off-by: Zhao Yu yzhao81...@gmail.com
 Reviewed-by: Jeff Dike jd...@linux.intel.com
 ---
  include/linux/netdevice.h |   22 +-
  1 files changed, 21 insertions(+), 1 deletions(-)
 
 diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
 index fa8b476..ba582e1 100644
 --- a/include/linux/netdevice.h
 +++ b/include/linux/netdevice.h
 @@ -530,6 +530,25 @@ struct netdev_queue {
   unsigned long   tx_dropped;
  } cacheline_aligned_in_smp;
  
 +/* Add a structure in structure net_device, the new field is
 + * named as mp_port. It's for mediate passthru (zero-copy).

That belongs in the commit message.

 + * It contains the capability for the net device driver,
 + * a socket, and an external buffer creator, external means
 + * skb buffer belongs to the device may not be allocated from
 + * kernel space.

Who sets which fields in this structure?  Can you make this a kernel-doc
comment specifying the use of each field?

Ben.

 + */
 +struct mpassthru_port{
 + int hdr_len;
 + int data_len;
 + int npages;
 + unsignedflags;
 + struct socket   *sock;
 + int vnet_hlen;
 + struct skb_ext_page *(*ctor)(struct mpassthru_port *,
 + struct sk_buff *, int);
 + struct skb_ext_page *(*hash)(struct net_device *,
 + struct page *);
 +};
  
  /*
   * This structure defines the management hooks for network devices.
 @@ -952,7 +971,8 @@ struct net_device {
   struct macvlan_port *macvlan_port;
   /* GARP */
   struct garp_port*garp_port;
 -
 + /* mpassthru */
 + struct mpassthru_port   *mp_port;
   /* class/net/name entry */
   struct device   dev;
   /* space for optional device, statistics, and wireless sysfs groups */

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v11 03/17] Add a ndo_mp_port_prep pointer to net_device_ops.

2010-09-27 Thread Ben Hutchings

On Sat, 2010-09-25 at 12:27 +0800, xiaohui@intel.com wrote:
 From: Xin Xiaohui xiaohui@intel.com
 
 If the driver want to allocate external buffers,
 then it can export it's capability, as the skb
 buffer header length, the page length can be DMA, etc.
 The external buffers owner may utilize this.
[...]

This information needs to be included in the comment above struct
net_device_ops, not just in the commit message.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v11 04/17]Add a function make external buffer owner to query capability.

2010-09-27 Thread Ben Hutchings

On Sat, 2010-09-25 at 12:27 +0800, xiaohui@intel.com wrote:
[...]
 diff --git a/net/core/dev.c b/net/core/dev.c
 index 264137f..636f11b 100644
 --- a/net/core/dev.c
 +++ b/net/core/dev.c
 @@ -2468,6 +2468,55 @@ void netif_nit_deliver(struct sk_buff *skb)
   rcu_read_unlock();
  }
  
 +/* To support meidate passthru(zero-copy) with NIC driver,
 + * we'd better query NIC driver for the capability it can
 + * provide, especially for packet split mode, now we only
 + * query for the header size, and the payload a descriptor
 + * may carry. If a driver does not use the API to export,
 + * then we may try to use a default value, currently,
 + * we use the default value from an IGB driver. Now,
 + * it's only called by mpassthru device.
 + */
 +#if defined(CONFIG_MEDIATE_PASSTHRU) || 
 defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
 +int netdev_mp_port_prep(struct net_device *dev,
 + struct mpassthru_port *port)
 +{
 + int rc;
 + int npages, data_len;
 + const struct net_device_ops *ops = dev-netdev_ops;
 +
 + if (ops-ndo_mp_port_prep) {
 + rc = ops-ndo_mp_port_prep(dev, port);
 + if (rc)
 + return rc;
 + } else {
 + /* If the NIC driver did not report this,
 +  * then we try to use default value.
 +  */
 + port-hdr_len = 128;
 + port-data_len = 2048;
 + port-npages = 1;
 + }
[...]

Is it really necessary to have a default?

Also have you considered an API for changing the header/data split?

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v11 11/17] Add header file for mp device.

2010-09-27 Thread Ben Hutchings

On Sat, 2010-09-25 at 12:27 +0800, xiaohui@intel.com wrote:
 From: Xin Xiaohui xiaohui@intel.com
 
 Signed-off-by: Xin Xiaohui xiaohui@intel.com
 Signed-off-by: Zhao Yu yzhao81...@gmail.com
 Reviewed-by: Jeff Dike jd...@linux.intel.com
 ---
  include/linux/mpassthru.h |   25 +
  1 files changed, 25 insertions(+), 0 deletions(-)
  create mode 100644 include/linux/mpassthru.h
 
 diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
 new file mode 100644
 index 000..ba8f320
 --- /dev/null
 +++ b/include/linux/mpassthru.h
 @@ -0,0 +1,25 @@
 +#ifndef __MPASSTHRU_H
 +#define __MPASSTHRU_H
 +
 +#include linux/types.h
 +#include linux/if_ether.h
 +
 +/* ioctl defines */
 +#define MPASSTHRU_BINDDEV  _IOW('M', 213, int)
 +#define MPASSTHRU_UNBINDDEV_IO('M', 214)
[...]

You need to include linux/ioctl.h first!

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v11 12/17] Add a kconfig entry and make entry for mp device.

2010-09-27 Thread Ben Hutchings

This patch is in the wrong position in the sequence.  It needs to be
applied after mpassthru.c is created, not before.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v11 13/17] Add mp(mediate passthru) device.

2010-09-27 Thread Ben Hutchings

) {
 + set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
 + iocb-ki_user_data * 4096 * 2,
 + iocb-ki_user_data * 4096 * 2);
 + }
 +
 + /* Translate address to kernel */
 + info = alloc_page_info(ctor, iocb, iov, count, frags, npages, 0);
 + if (!info)
 + return -ENOMEM;

I'm not convinced that the checks above this ensure that there will be
= MAX_SKB_FRAGS fragments.

[...]
 +static int mp_chr_open(struct inode *inode, struct file * file)
 +{
 + struct mp_file *mfile;
 + cycle_kernel_lock();

Seriously?

[...]
 +static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
 + unsigned long count, loff_t pos)
 +{
 + struct file *file = iocb-ki_filp;
 + struct mp_struct *mp = mp_get(file-private_data);
 + struct sock *sk = mp-socket.sk;
 + struct sk_buff *skb;
 + int len, err;
 + ssize_t result = 0;
 +
 + if (!mp)
 + return -EBADFD;
 +
 + /* currently, async is not supported.
 +  * but we may support real async aio from user application,
 +  * maybe qemu virtio-net backend.
 +  */
 + if (!is_sync_kiocb(iocb))
 + return -EFAULT;
 +
 + len = iov_length(iov, count);
 +
 + if (unlikely(len)  ETH_HLEN)
 + return -EINVAL;

The first close-paren is in the wrong place.

 + skb = sock_alloc_send_skb(sk, len + NET_IP_ALIGN,
 +   file-f_flags  O_NONBLOCK, err);
 +
 + if (!skb)
 + return -EFAULT;

Why EFAULT?

 + skb_reserve(skb, NET_IP_ALIGN);
 + skb_put(skb, len);
 +
 + if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) {
 + kfree_skb(skb);
 + return -EAGAIN;
 + }
 +
 + skb-protocol = eth_type_trans(skb, mp-dev);

Why are you calling eth_type_trans() on transmit?

[...]
 +static int mp_device_event(struct notifier_block *unused,
 + unsigned long event, void *ptr)
 +{
 + struct net_device *dev = ptr;
 + struct mpassthru_port *port;
 + struct mp_struct *mp = NULL;
 + struct socket *sock = NULL;
 + struct sock *sk;
 +
 + port = dev-mp_port;
 + if (port == NULL)
 + return NOTIFY_DONE;
 +
 + switch (event) {
 + case NETDEV_UNREGISTER:
 + sock = dev-mp_port-sock;
 + mp = container_of(sock-sk, struct mp_sock, sk)-mp;
 + do_unbind(mp-mfile);
[...]

This can deadlock - netdev notifiers are called under the RTNL lock and
do_unbind() acquires the mp_mutex, whereas in other places they are
acquired in the opposite order.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v11 17/17]add two new ioctls for mp device.

2010-09-27 Thread Ben Hutchings

On Sat, 2010-09-25 at 12:27 +0800, xiaohui@intel.com wrote:
 From: Xin Xiaohui xiaohui@intel.com
 
 The patch add two ioctls for mp device.
 One is for userspace to query how much memory locked to make mp device
 run smoothly. Another one is for userspace to set how much meory locked
 it really wants.
[...]
 diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
 index ba8f320..083e9f7 100644
 --- a/include/linux/mpassthru.h
 +++ b/include/linux/mpassthru.h
 @@ -7,6 +7,8 @@
  /* ioctl defines */
  #define MPASSTHRU_BINDDEV  _IOW('M', 213, int)
  #define MPASSTHRU_UNBINDDEV_IO('M', 214)
 +#define MPASSTHRU_SET_MEM_LOCKED _IOW('M', 215, unsigned long)
 +#define MPASSTHRU_GET_MEM_LOCKED_NEED_IOR('M', 216, unsigned long)
[...]

These will need compat handling.  You can avoid that by defining them to
use a parameter type of u32 or u64.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-20 Thread Ben Hutchings

On Sun, 2010-06-20 at 21:59 +1000, Herbert Xu wrote:
 On Sun, Jun 20, 2010 at 02:47:19PM +0300, Michael S. Tsirkin wrote:
 
  Let's do this then.  So far the virtio spec avoided making layout
  assumptions, leaving guests lay out data as they see fit.
  Isn't it possible to keep supporting this with zero copy for hardware
  that can issue DMA at arbitrary addresses?
 
 I think you're mistaken with respect to what is being proposed.
 Raising 512 bytes isn't a hard constraint, it is merely an
 optimisation for Intel NICs because their PS mode can produce
 a head fragment of up to 512 bytes.
 
 If the guest didn't allocate 512 bytes it wouldn't be the end of
 the world, it'd just mean that we'd either copy whatever is in
 the head fragment, or we waste 4096-X bytes of memory where X
 is the number of bytes in the head.

If I understand correctly what this 'PS mode' is (I haven't seen the
documentation for it), it is a feature that Microsoft requested from
hardware vendors for use in Hyper-V.  As a result, the SFC9000 family
and presumably other controllers also implement something similar.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 02/17] vbus: add virtual-bus definitions

2009-04-02 Thread Ben Hutchings

On Tue, 2009-03-31 at 14:42 -0400, Gregory Haskins wrote:
[...]
 +Create a device instance
 +
 +
 +Devices are instantiated by again utilizing the /config/vbus configfs area.
 +At first you may suspect that devices are created as subordinate objects of a
 +bus/container instance, but you would be mistaken.

This is kind of patronising; why don't you simply lay out how things
_do_ work?

  Devices are actually
 +root-level objects in vbus specifically to allow greater flexibility in the
 +association of a device.  For instance, it may be desirable to have a single
 +device that spans multiple VMs (consider an ethernet switch, or a shared disk
 +for a cluster).  Therefore, device lifecycles are managed by 
 creating/deleting
 +objects in /config/vbus/devices.
 +
 +Note: Creating a device instance is actually a two step process:  We need to
 +give the device instance a unique name, and we also need to give it a 
 specific
 +device type.  It is hard to express both parameters using standard filesystem
 +operations like mkdir, so the design decision was made to require performing
 +the operation in two steps.

How about exposing a subdir for each device class under
/config/vbus/devices/ and allowing device creation only within those?
Two-stage construction is a pain for both users and implementors.

[...]
 +At this point, we are ready to roll.  Pid 4382 has access to a virtual-bus
 +namespace with one device, id=0.  Its type is:
 +
 +# cat /sys/vbus/instances/beb4df8f-7483-4028-b3f7-767512e2a18c/devices/0/type
 +virtual-ethernet
 +
 +virtual-ethernet?  Why is it not venet-tap?  Device-classes are allowed 
 to
 +register their interfaces under an id that is not required to be the same as
 +their deviceclass.  This supports device polymorphism.   For instance,
 +consider that an interface virtual-ethernet may provide basic 802.x packet
 +exchange.  However, we could have various implementations of a device that
 +supports the 802.x interface, while having various implementations behind
 +them.
[...]

It seems to me that your device-classes correspond to drivers and
interfaces correspond to device classes in the LDM.  To avoid
confusion, I think the vbus terminology should be made consistent with
LDM.  And certainly these should not both be called simply type in the
configfs/sysfs interface.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

70 matches

Mail list logo