[PATCH] vhost/vsock: fix reset orphans race with close timeout

2018-12-06 Thread Stefan Hajnoczi
If a local process has closed a connected socket and hasn't received a
RST packet yet, then the socket remains in the table until a timeout
expires.

When a vhost_vsock instance is released with the timeout still pending,
the socket is never freed because vhost_vsock has already set the
SOCK_DONE flag.

Check if the close timer is pending and let it close the socket.  This
prevents the race which can leak sockets.

Reported-by: Maximilian Riemensberger 
Cc: Graham Whaley 
Signed-off-by: Stefan Hajnoczi 
---
 drivers/vhost/vsock.c | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 34bc3ab40c6d..731e2ea2aeca 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -563,13 +563,21 @@ static void vhost_vsock_reset_orphans(struct sock *sk)
 * executing.
 */
 
-   if (!vhost_vsock_get(vsk->remote_addr.svm_cid)) {
-   sock_set_flag(sk, SOCK_DONE);
-   vsk->peer_shutdown = SHUTDOWN_MASK;
-   sk->sk_state = SS_UNCONNECTED;
-   sk->sk_err = ECONNRESET;
-   sk->sk_error_report(sk);
-   }
+   /* If the peer is still valid, no need to reset connection */
+   if (vhost_vsock_get(vsk->remote_addr.svm_cid))
+   return;
+
+   /* If the close timeout is pending, let it expire.  This avoids races
+* with the timeout callback.
+*/
+   if (vsk->close_work_scheduled)
+   return;
+
+   sock_set_flag(sk, SOCK_DONE);
+   vsk->peer_shutdown = SHUTDOWN_MASK;
+   sk->sk_state = SS_UNCONNECTED;
+   sk->sk_err = ECONNRESET;
+   sk->sk_error_report(sk);
 }
 
 static int vhost_vsock_dev_release(struct inode *inode, struct file *file)
-- 
2.19.2



Re: [PATCH net V2] vhost-vsock: fix use after free

2018-09-27 Thread Stefan Hajnoczi
On Thu, Sep 27, 2018 at 08:22:04PM +0800, Jason Wang wrote:
> The access of vsock is not protected by vhost_vsock_lock. This may
> lead to use after free since vhost_vsock_dev_release() may free the
> pointer at the same time.
> 
> Fix this by holding the lock during the access.
> 
> Reported-by: syzbot+e3e074963495f92a8...@syzkaller.appspotmail.com
> Fixes: 16320f363ae1 ("vhost-vsock: add pkt cancel capability")
> Fixes: 433fc58e6bf2 ("VSOCK: Introduce vhost_vsock.ko")
> Cc: Stefan Hajnoczi 
> Signed-off-by: Jason Wang 
> ---
> - V2: fix typos
> - The patch is needed for -stable.
> ---
>  drivers/vhost/vsock.c | 26 +++---
>  1 file changed, 19 insertions(+), 7 deletions(-)

Thank you, Jason!

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH net] VSOCK: check sk state before receive

2018-05-30 Thread Stefan Hajnoczi
On Sun, May 27, 2018 at 11:29:45PM +0800, Hangbin Liu wrote:
> Hmm...Although I won't reproduce this bug with my reproducer after
> apply my patch. I could still get a similiar issue with syzkaller sock vnet 
> test.
> 
> It looks this patch is not complete. Here is the KASAN call trace with my 
> patch.
> I can also reproduce it without my patch.

Seems like a race between vmci_datagram_destroy_handle() and the
delayed callback, vmci_transport_recv_dgram_cb().

I don't know the VMCI transport well so I'll leave this to Jorgen.

> ==
> BUG: KASAN: use-after-free in vmci_transport_allow_dgram.part.7+0x155/0x1a0 
> [vmw_vsock_vmci_transport]
> Read of size 4 at addr 880026a3a914 by task kworker/0:2/96
> 
> CPU: 0 PID: 96 Comm: kworker/0:2 Not tainted 4.17.0-rc6.vsock+ #28
> Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> Workqueue: events dg_delayed_dispatch [vmw_vmci]
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0xdd/0x18e lib/dump_stack.c:113
>  print_address_description+0x7a/0x3e0 mm/kasan/report.c:256
>  kasan_report_error mm/kasan/report.c:354 [inline]
>  kasan_report+0x1dd/0x460 mm/kasan/report.c:412
>  vmci_transport_allow_dgram.part.7+0x155/0x1a0 [vmw_vsock_vmci_transport]
>  vmci_transport_recv_dgram_cb+0x5d/0x200 [vmw_vsock_vmci_transport]
>  dg_delayed_dispatch+0x99/0x1b0 [vmw_vmci]
>  process_one_work+0xa4e/0x1720 kernel/workqueue.c:2145
>  worker_thread+0x1df/0x1400 kernel/workqueue.c:2279
>  kthread+0x343/0x4b0 kernel/kthread.c:240
>  ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:412
> 
> Allocated by task 2684:
>  set_track mm/kasan/kasan.c:460 [inline]
>  kasan_kmalloc+0xa0/0xd0 mm/kasan/kasan.c:553
>  slab_post_alloc_hook mm/slab.h:444 [inline]
>  slab_alloc_node mm/slub.c:2741 [inline]
>  slab_alloc mm/slub.c:2749 [inline]
>  kmem_cache_alloc+0x105/0x330 mm/slub.c:2754
>  sk_prot_alloc+0x6a/0x2c0 net/core/sock.c:1468
>  sk_alloc+0xc9/0xbb0 net/core/sock.c:1528
>  __vsock_create+0xc8/0x9b0 [vsock]
>  vsock_create+0xfd/0x1a0 [vsock]
>  __sock_create+0x310/0x690 net/socket.c:1285
>  sock_create net/socket.c:1325 [inline]
>  __sys_socket+0x101/0x240 net/socket.c:1355
>  __do_sys_socket net/socket.c:1364 [inline]
>  __se_sys_socket net/socket.c:1362 [inline]
>  __x64_sys_socket+0x7d/0xd0 net/socket.c:1362
>  do_syscall_64+0x175/0x630 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Freed by task 2684:
>  set_track mm/kasan/kasan.c:460 [inline]
>  __kasan_slab_free+0x130/0x180 mm/kasan/kasan.c:521
>  slab_free_hook mm/slub.c:1388 [inline]
>  slab_free_freelist_hook mm/slub.c:1415 [inline]
>  slab_free mm/slub.c:2988 [inline]
>  kmem_cache_free+0xce/0x410 mm/slub.c:3004
>  sk_prot_free net/core/sock.c:1509 [inline]
>  __sk_destruct+0x629/0x940 net/core/sock.c:1593
>  sk_destruct+0x4e/0x90 net/core/sock.c:1601
>  __sk_free+0xd3/0x320 net/core/sock.c:1612
>  sk_free+0x2a/0x30 net/core/sock.c:1623
>  __vsock_release+0x431/0x610 [vsock]
>  vsock_release+0x3c/0xc0 [vsock]
>  sock_release+0x91/0x200 net/socket.c:594
>  sock_close+0x17/0x20 net/socket.c:1149
>  __fput+0x368/0xa20 fs/file_table.c:209
>  task_work_run+0x1c5/0x2a0 kernel/task_work.c:113
>  exit_task_work include/linux/task_work.h:22 [inline]
>  do_exit+0x1876/0x26c0 kernel/exit.c:865
>  do_group_exit+0x159/0x3e0 kernel/exit.c:968
>  get_signal+0x65a/0x1780 kernel/signal.c:2482
>  do_signal+0xa4/0x1fe0 arch/x86/kernel/signal.c:810
>  exit_to_usermode_loop+0x1b8/0x260 arch/x86/entry/common.c:162
>  prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
>  syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
>  do_syscall_64+0x505/0x630 arch/x86/entry/common.c:290
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> The buggy address belongs to the object at 880026a3a600
>  which belongs to the cache AF_VSOCK of size 1056
> The buggy address is located 788 bytes inside of
>  1056-byte region [880026a3a600, 880026a3aa20)
> The buggy address belongs to the page:
> page:ea9a8e00 count:1 mapcount:0 mapping: index:0x0 
> compound_mapcount: 0
> flags: 0xfc0008100(slab|head)
> raw: 000fc0008100   0001000d000d
> raw: dead0100 dead0200 880034471a40 
> page dumped because: kasan: bad access detected
> 
> Memory state around the buggy address:
>  880026a3a800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  880026a3a880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >880026a3a900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  ^
>  880026a3a980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  880026a3aa00: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
> ==


signature.asc
Description: PGP signature


Re: [PATCH] VSOCK: make af_vsock.ko removable again

2018-04-17 Thread Stefan Hajnoczi
On Tue, Apr 17, 2018 at 09:45:12AM -0400, David Miller wrote:
> From: Stefan Hajnoczi <stefa...@redhat.com>
> Date: Tue, 17 Apr 2018 14:25:58 +0800
> 
> > Commit c1eef220c1760762753b602c382127bfccee226d ("vsock: always call
> > vsock_init_tables()") introduced a module_init() function without a
> > corresponding module_exit() function.
> > 
> > Modules with an init function can only be removed if they also have an
> > exit function.  Therefore the vsock module was considered "permanent"
> > and could not be removed.
> > 
> > This patch adds an empty module_exit() function so that "rmmod vsock"
> > works.  No explicit cleanup is required because:
> > 
> > 1. Transports call vsock_core_exit() upon exit and cannot be removed
> >while sockets are still alive.
> > 2. vsock_diag.ko does not perform any action that requires cleanup by
> >vsock.ko.
> > 
> > Reported-by: Xiumei Mu <x...@redhat.com>
> > Cc: Cong Wang <xiyou.wangc...@gmail.com>
> > Cc: Jorgen Hansen <jhan...@vmware.com>
> > Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
> 
> Applied, but please provide a proper Fixes: tag next time.  I added it
> for you this time.

Will do.  Thanks!

Stefan


signature.asc
Description: PGP signature


[PATCH] VSOCK: make af_vsock.ko removable again

2018-04-17 Thread Stefan Hajnoczi
Commit c1eef220c1760762753b602c382127bfccee226d ("vsock: always call
vsock_init_tables()") introduced a module_init() function without a
corresponding module_exit() function.

Modules with an init function can only be removed if they also have an
exit function.  Therefore the vsock module was considered "permanent"
and could not be removed.

This patch adds an empty module_exit() function so that "rmmod vsock"
works.  No explicit cleanup is required because:

1. Transports call vsock_core_exit() upon exit and cannot be removed
   while sockets are still alive.
2. vsock_diag.ko does not perform any action that requires cleanup by
   vsock.ko.

Reported-by: Xiumei Mu <x...@redhat.com>
Cc: Cong Wang <xiyou.wangc...@gmail.com>
Cc: Jorgen Hansen <jhan...@vmware.com>
Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 net/vmw_vsock/af_vsock.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index aac9b8f6552e..c1076c19b858 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -2018,7 +2018,13 @@ const struct vsock_transport 
*vsock_core_get_transport(void)
 }
 EXPORT_SYMBOL_GPL(vsock_core_get_transport);
 
+static void __exit vsock_exit(void)
+{
+   /* Do nothing.  This function makes this module removable. */
+}
+
 module_init(vsock_init_tables);
+module_exit(vsock_exit);
 
 MODULE_AUTHOR("VMware, Inc.");
 MODULE_DESCRIPTION("VMware Virtual Socket Family");
-- 
2.14.3



[PATCH v3 1/2] vhost: fix vhost_vq_access_ok() log check

2018-04-10 Thread Stefan Hajnoczi
Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log
when IOTLB is enabled") introduced a regression.  The logic was
originally:

  if (vq->iotlb)
  return 1;
  return A && B;

After the patch the short-circuit logic for A was inverted:

  if (A || vq->iotlb)
  return A;
  return B;

This patch fixes the regression by rewriting the checks in the obvious
way, no longer returning A when vq->iotlb is non-NULL (which is hard to
understand).

Reported-by: syzbot+65a84dde0214b0387...@syzkaller.appspotmail.com
Cc: Jason Wang <jasow...@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 drivers/vhost/vhost.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index bec722e41f58..fc805b7fad9d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1244,10 +1244,12 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
 /* Caller should have vq mutex and device mutex */
 int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 {
-   int ret = vq_log_access_ok(vq, vq->log_base);
+   if (!vq_log_access_ok(vq, vq->log_base))
+   return 0;
 
-   if (ret || vq->iotlb)
-   return ret;
+   /* Access validation occurs at prefetch time with IOTLB */
+   if (vq->iotlb)
+   return 1;
 
return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
 }
-- 
2.14.3



[PATCH v3 0/2] vhost: fix vhost_vq_access_ok() log check

2018-04-10 Thread Stefan Hajnoczi
v3:
 * Rebased onto net/master and resolved conflict [DaveM]

v2:
 * Rewrote the conditional to make the vq access check clearer [Linus]
 * Added Patch 2 to make the return type consistent and harder to misuse [Linus]

The first patch fixes the vhost virtqueue access check which was recently
broken.  The second patch replaces the int return type with bool to prevent
future bugs.

Stefan Hajnoczi (2):
  vhost: fix vhost_vq_access_ok() log check
  vhost: return bool from *_access_ok() functions

 drivers/vhost/vhost.h |  4 +--
 drivers/vhost/vhost.c | 70 ++-
 2 files changed, 38 insertions(+), 36 deletions(-)

-- 
2.14.3



[PATCH v3 2/2] vhost: return bool from *_access_ok() functions

2018-04-10 Thread Stefan Hajnoczi
Currently vhost *_access_ok() functions return int.  This is error-prone
because there are two popular conventions:

1. 0 means failure, 1 means success
2. -errno means failure, 0 means success

Although vhost mostly uses #1, it does not do so consistently.
umem_access_ok() uses #2.

This patch changes the return type from int to bool so that false means
failure and true means success.  This eliminates a potential source of
errors.

Suggested-by: Linus Torvalds <torva...@linux-foundation.org>
Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 drivers/vhost/vhost.h |  4 ++--
 drivers/vhost/vhost.c | 66 +--
 2 files changed, 35 insertions(+), 35 deletions(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d8ee85ae8fdc..6c844b90a168 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -178,8 +178,8 @@ void vhost_dev_cleanup(struct vhost_dev *);
 void vhost_dev_stop(struct vhost_dev *);
 long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user 
*argp);
 long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user 
*argp);
-int vhost_vq_access_ok(struct vhost_virtqueue *vq);
-int vhost_log_access_ok(struct vhost_dev *);
+bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
+bool vhost_log_access_ok(struct vhost_dev *);
 
 int vhost_get_vq_desc(struct vhost_virtqueue *,
  struct iovec iov[], unsigned int iov_count,
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index fc805b7fad9d..0fcb51a9940c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -641,14 +641,14 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
 
-static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
+static bool log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
 {
u64 a = addr / VHOST_PAGE_SIZE / 8;
 
/* Make sure 64 bit math will not overflow. */
if (a > ULONG_MAX - (unsigned long)log_base ||
a + (unsigned long)log_base > ULONG_MAX)
-   return 0;
+   return false;
 
return access_ok(VERIFY_WRITE, log_base + a,
 (sz + VHOST_PAGE_SIZE * 8 - 1) / VHOST_PAGE_SIZE / 8);
@@ -661,30 +661,30 @@ static bool vhost_overflow(u64 uaddr, u64 size)
 }
 
 /* Caller should have vq mutex and device mutex. */
-static int vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
-  int log_all)
+static bool vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
+   int log_all)
 {
struct vhost_umem_node *node;
 
if (!umem)
-   return 0;
+   return false;
 
list_for_each_entry(node, >umem_list, link) {
unsigned long a = node->userspace_addr;
 
if (vhost_overflow(node->userspace_addr, node->size))
-   return 0;
+   return false;
 
 
if (!access_ok(VERIFY_WRITE, (void __user *)a,
node->size))
-   return 0;
+   return false;
else if (log_all && !log_access_ok(log_base,
   node->start,
   node->size))
-   return 0;
+   return false;
}
-   return 1;
+   return true;
 }
 
 static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq,
@@ -701,13 +701,13 @@ static inline void __user *vhost_vq_meta_fetch(struct 
vhost_virtqueue *vq,
 
 /* Can we switch to this memory table? */
 /* Caller should have device mutex but not vq mutex */
-static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
-   int log_all)
+static bool memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
+int log_all)
 {
int i;
 
for (i = 0; i < d->nvqs; ++i) {
-   int ok;
+   bool ok;
bool log;
 
mutex_lock(>vqs[i]->mutex);
@@ -717,12 +717,12 @@ static int memory_access_ok(struct vhost_dev *d, struct 
vhost_umem *umem,
ok = vq_memory_access_ok(d->vqs[i]->log_base,
 umem, log);
else
-   ok = 1;
+   ok = true;
mutex_unlock(>vqs[i]->mutex);
if (!ok)
-   return 0;
+   return false;
}
-   return 1;
+   return true;
 }
 
 static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
@@ -959,21 +959,21 @@ static void vhost_iotlb_notify_vq(struct vhost_dev *d,
spin_unlock

Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-10 Thread Stefan Hajnoczi
On Tue, Apr 10, 2018 at 09:23:53AM +, Liang, Cunming wrote:
> If QEMU is going to build a user space driver framework there, we're open 
> mind on that, even leveraging DPDK as the underlay library. Looking forward 
> to more others' comments from community.

There is already an NVMe VFIO driver in QEMU (see block/nvme.c).  So in
principle there's no reason against userspace drivers in QEMU.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v2 0/2] vhost: fix vhost_vq_access_ok() log check

2018-04-10 Thread Stefan Hajnoczi
On Tue, Apr 10, 2018 at 10:50:43AM -0400, David Miller wrote:
> From: Jason Wang <jasow...@redhat.com>
> Date: Tue, 10 Apr 2018 14:40:10 +0800
> 
> > On 2018年04月10日 13:26, Stefan Hajnoczi wrote:
> >> v2:
> >>   * Rewrote the conditional to make the vq access check clearer [Linus]
> >>   * Added Patch 2 to make the return type consistent and harder to misuse
> >>   * [Linus]
> >>
> >> The first patch fixes the vhost virtqueue access check which was
> >> recently
> >> broken.  The second patch replaces the int return type with bool to
> >> prevent
> >> future bugs.
> >>
> >> Stefan Hajnoczi (2):
> >>vhost: fix vhost_vq_access_ok() log check
> >>vhost: return bool from *_access_ok() functions
> >>
> >>   drivers/vhost/vhost.h |  4 +--
> >>   drivers/vhost/vhost.c | 70
> >>   ++-
> >>   2 files changed, 38 insertions(+), 36 deletions(-)
> >>
> > 
> > Acked-by: Jason Wang <jasow...@redhat.com>
> 
> This sereis doesn't apply cleanly to the net tree, please respin
> and add the appropriate Acks and Fixes: tags that Michael and Jason
> have provided.

Sorry, will fix.

Stefan


signature.asc
Description: PGP signature


[PATCH v2 0/2] vhost: fix vhost_vq_access_ok() log check

2018-04-09 Thread Stefan Hajnoczi
v2:
 * Rewrote the conditional to make the vq access check clearer [Linus]
 * Added Patch 2 to make the return type consistent and harder to misuse [Linus]

The first patch fixes the vhost virtqueue access check which was recently
broken.  The second patch replaces the int return type with bool to prevent
future bugs.

Stefan Hajnoczi (2):
  vhost: fix vhost_vq_access_ok() log check
  vhost: return bool from *_access_ok() functions

 drivers/vhost/vhost.h |  4 +--
 drivers/vhost/vhost.c | 70 ++-
 2 files changed, 38 insertions(+), 36 deletions(-)

-- 
2.14.3



[PATCH v2 2/2] vhost: return bool from *_access_ok() functions

2018-04-09 Thread Stefan Hajnoczi
Currently vhost *_access_ok() functions return int.  This is error-prone
because there are two popular conventions:

1. 0 means failure, 1 means success
2. -errno means failure, 0 means success

Although vhost mostly uses #1, it does not do so consistently.
umem_access_ok() uses #2.

This patch changes the return type from int to bool so that false means
failure and true means success.  This eliminates a potential source of
errors.

Suggested-by: Linus Torvalds <torva...@linux-foundation.org>
Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 drivers/vhost/vhost.h |  4 ++--
 drivers/vhost/vhost.c | 66 +--
 2 files changed, 35 insertions(+), 35 deletions(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index ac4b6056f19a..6e00fa57af09 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -178,8 +178,8 @@ void vhost_dev_cleanup(struct vhost_dev *);
 void vhost_dev_stop(struct vhost_dev *);
 long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user 
*argp);
 long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp);
-int vhost_vq_access_ok(struct vhost_virtqueue *vq);
-int vhost_log_access_ok(struct vhost_dev *);
+bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
+bool vhost_log_access_ok(struct vhost_dev *);
 
 int vhost_get_vq_desc(struct vhost_virtqueue *,
  struct iovec iov[], unsigned int iov_count,
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 93fd0c75b0d8..b6a082ef33dd 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -641,14 +641,14 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
 
-static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
+static bool log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
 {
u64 a = addr / VHOST_PAGE_SIZE / 8;
 
/* Make sure 64 bit math will not overflow. */
if (a > ULONG_MAX - (unsigned long)log_base ||
a + (unsigned long)log_base > ULONG_MAX)
-   return 0;
+   return false;
 
return access_ok(VERIFY_WRITE, log_base + a,
 (sz + VHOST_PAGE_SIZE * 8 - 1) / VHOST_PAGE_SIZE / 8);
@@ -661,30 +661,30 @@ static bool vhost_overflow(u64 uaddr, u64 size)
 }
 
 /* Caller should have vq mutex and device mutex. */
-static int vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
-  int log_all)
+static bool vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
+   int log_all)
 {
struct vhost_umem_node *node;
 
if (!umem)
-   return 0;
+   return false;
 
list_for_each_entry(node, >umem_list, link) {
unsigned long a = node->userspace_addr;
 
if (vhost_overflow(node->userspace_addr, node->size))
-   return 0;
+   return false;
 
 
if (!access_ok(VERIFY_WRITE, (void __user *)a,
node->size))
-   return 0;
+   return false;
else if (log_all && !log_access_ok(log_base,
   node->start,
   node->size))
-   return 0;
+   return false;
}
-   return 1;
+   return true;
 }
 
 static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq,
@@ -701,13 +701,13 @@ static inline void __user *vhost_vq_meta_fetch(struct 
vhost_virtqueue *vq,
 
 /* Can we switch to this memory table? */
 /* Caller should have device mutex but not vq mutex */
-static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
-   int log_all)
+static bool memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
+int log_all)
 {
int i;
 
for (i = 0; i < d->nvqs; ++i) {
-   int ok;
+   bool ok;
bool log;
 
mutex_lock(>vqs[i]->mutex);
@@ -717,12 +717,12 @@ static int memory_access_ok(struct vhost_dev *d, struct 
vhost_umem *umem,
ok = vq_memory_access_ok(d->vqs[i]->log_base,
 umem, log);
else
-   ok = 1;
+   ok = true;
mutex_unlock(>vqs[i]->mutex);
if (!ok)
-   return 0;
+   return false;
}
-   return 1;
+   return true;
 }
 
 static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
@@ -959,21 +959,21 @@ static void vhost_iotlb_notify_vq(struct vhost_dev *d,
spin_unlock(>iotlb_lock)

[PATCH v2 1/2] vhost: fix vhost_vq_access_ok() log check

2018-04-09 Thread Stefan Hajnoczi
Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log
when IOTLB is enabled") introduced a regression.  The logic was
originally:

  if (vq->iotlb)
  return 1;
  return A && B;

After the patch the short-circuit logic for A was inverted:

  if (A || vq->iotlb)
  return A;
  return B;

This patch fixes the regression by rewriting the checks in the obvious
way, no longer returning A when vq->iotlb is non-NULL (which is hard to
understand).

Reported-by: syzbot+65a84dde0214b0387...@syzkaller.appspotmail.com
Cc: Jason Wang <jasow...@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 drivers/vhost/vhost.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5320039671b7..93fd0c75b0d8 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1244,10 +1244,12 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
 /* Caller should have vq mutex and device mutex */
 int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 {
-   int ret = vq_log_access_ok(vq, vq->log_base);
+   if (!vq_log_access_ok(vq, vq->log_base))
+   return 0;
 
-   if (ret || vq->iotlb)
-   return ret;
+   /* Access validation occurs at prefetch time with IOTLB */
+   if (vq->iotlb)
+   return 1;
 
return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
 }
-- 
2.14.3



Re: [PATCH RESEND net] vhost: fix vhost_vq_access_ok() log check

2018-04-09 Thread Stefan Hajnoczi
On Tue, Apr 10, 2018 at 3:40 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> From: Stefan Hajnoczi <stefa...@redhat.com>
>
> Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log
> when IOTLB is enabled") introduced a regression.  The logic was
> originally:
>
>   if (vq->iotlb)
>   return 1;
>   return A && B;
>
> After the patch the short-circuit logic for A was inverted:
>
>   if (A || vq->iotlb)
>   return A;
>   return B;
>
> The correct logic is:
>
>   if (!A || vq->iotlb)
>   return A;
>   return B;
>
> Reported-by: syzbot+65a84dde0214b0387...@syzkaller.appspotmail.com
> Cc: Jason Wang <jasow...@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
> Acked-by: Michael S. Tsirkin <m...@redhat.com>
>
> ---
>  drivers/vhost/vhost.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

NACK

I will send a v2 with cleaner logic as suggested by Linus.

Stefan


Re: kernel BUG at drivers/vhost/vhost.c:LINE! (2)

2018-04-09 Thread Stefan Hajnoczi
On Mon, Apr 9, 2018 at 11:28 AM, Stefan Hajnoczi <stefa...@gmail.com> wrote:
> On Mon, Apr 09, 2018 at 05:44:36AM +0300, Michael S. Tsirkin wrote:
>> On Mon, Apr 09, 2018 at 10:37:45AM +0800, Stefan Hajnoczi wrote:
>> > On Sat, Apr 7, 2018 at 3:02 AM, syzbot
>> > <syzbot+65a84dde0214b0387...@syzkaller.appspotmail.com> wrote:
>> > > syzbot hit the following crash on upstream commit
>> > > 38c23685b273cfb4ccf31a199feccce3bdcb5d83 (Fri Apr 6 04:29:35 2018 +)
>> > > Merge tag 'armsoc-drivers' of
>> > > git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
>> > > syzbot dashboard link:
>> > > https://syzkaller.appspot.com/bug?extid=65a84dde0214b0387ccd
>> >
>> > To prevent duplicated work: I am working on this one.
>> >
>> > Stefan
>>
>> Do you want to try this patchset:
>> https://lkml.org/lkml/2018/4/5/665
>>
>> ?
>
> Thanks, I'll give it a shot.
>
> I also noticed a regression in commit
> d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log when
> IOTLB is enabled") and am currently testing a fix.

I have sent a fix:
https://lkml.org/lkml/2018/4/9/390

Stefan


Re: kernel BUG at drivers/vhost/vhost.c:LINE! (2)

2018-04-08 Thread Stefan Hajnoczi
On Mon, Apr 09, 2018 at 05:44:36AM +0300, Michael S. Tsirkin wrote:
> On Mon, Apr 09, 2018 at 10:37:45AM +0800, Stefan Hajnoczi wrote:
> > On Sat, Apr 7, 2018 at 3:02 AM, syzbot
> > <syzbot+65a84dde0214b0387...@syzkaller.appspotmail.com> wrote:
> > > syzbot hit the following crash on upstream commit
> > > 38c23685b273cfb4ccf31a199feccce3bdcb5d83 (Fri Apr 6 04:29:35 2018 +)
> > > Merge tag 'armsoc-drivers' of
> > > git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
> > > syzbot dashboard link:
> > > https://syzkaller.appspot.com/bug?extid=65a84dde0214b0387ccd
> > 
> > To prevent duplicated work: I am working on this one.
> > 
> > Stefan
> 
> Do you want to try this patchset:
> https://lkml.org/lkml/2018/4/5/665
> 
> ?

Thanks, I'll give it a shot.

I also noticed a regression in commit
d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log when
IOTLB is enabled") and am currently testing a fix.

Stefan


signature.asc
Description: PGP signature


Re: kernel BUG at drivers/vhost/vhost.c:LINE! (2)

2018-04-08 Thread Stefan Hajnoczi
On Sat, Apr 7, 2018 at 3:02 AM, syzbot
 wrote:
> syzbot hit the following crash on upstream commit
> 38c23685b273cfb4ccf31a199feccce3bdcb5d83 (Fri Apr 6 04:29:35 2018 +)
> Merge tag 'armsoc-drivers' of
> git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
> syzbot dashboard link:
> https://syzkaller.appspot.com/bug?extid=65a84dde0214b0387ccd

To prevent duplicated work: I am working on this one.

Stefan

>
> So far this crash happened 4 times on upstream.
> C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6586748079439872
> syzkaller reproducer:
> https://syzkaller.appspot.com/x/repro.syz?id=5974272052822016
> Raw console output:
> https://syzkaller.appspot.com/x/log.txt?id=6224632407392256
> Kernel config:
> https://syzkaller.appspot.com/x/.config?id=-5813481738265533882
> compiler: gcc (GCC) 8.0.1 20180301 (experimental)
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+65a84dde0214b0387...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
>
> [ cut here ]
> kernel BUG at drivers/vhost/vhost.c:1652!
> invalid opcode:  [#1] SMP KASAN
> [ cut here ]
> Dumping ftrace buffer:
> kernel BUG at drivers/vhost/vhost.c:1652!
>(ftrace buffer empty)
> Modules linked in:
> CPU: 1 PID: 4461 Comm: syzkaller684218 Not tainted 4.16.0+ #3
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:set_bit_to_user drivers/vhost/vhost.c:1652 [inline]
> RIP: 0010:log_write+0x42a/0x4d0 drivers/vhost/vhost.c:1676
> RSP: 0018:8801b256f920 EFLAGS: 00010293
> RAX: 8801adc9e2c0 RBX: dc00 RCX: 85924a0f
> RDX:  RSI: 85924cea RDI: 0005
> RBP: 8801b256fa58 R08: 8801adc9e2c0 R09: ed003962412d
> R10: 8801b256fad8 R11: 8801cb12096f R12: 0001
> R13: ed00364adf36 R14:  R15: 8801b256fa30
> FS:  7fdf24b19700() GS:8801db10() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 20bf6000 CR3: 0001ae6a7000 CR4: 001406e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  vhost_update_used_flags+0x3af/0x4a0 drivers/vhost/vhost.c:1723
>  vhost_vq_init_access+0x117/0x590 drivers/vhost/vhost.c:1763
>  vhost_vsock_start drivers/vhost/vsock.c:446 [inline]
>  vhost_vsock_dev_ioctl+0x751/0x920 drivers/vhost/vsock.c:678
>  vfs_ioctl fs/ioctl.c:46 [inline]
>  file_ioctl fs/ioctl.c:500 [inline]
>  do_vfs_ioctl+0x1cf/0x1650 fs/ioctl.c:684
>  ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701
>  SYSC_ioctl fs/ioctl.c:708 [inline]
>  SyS_ioctl+0x24/0x30 fs/ioctl.c:706
>  do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> RIP: 0033:0x4456c9
> RSP: 002b:7fdf24b18da8 EFLAGS: 0297 ORIG_RAX: 0010
> RAX: ffda RBX: 006dac24 RCX: 004456c9
> RDX: 20f82ffc RSI: 4004af61 RDI: 001b
> RBP: 006dac20 R08:  R09: 
> R10:  R11: 0297 R12: 6b636f73762d7473
> R13: 6f68762f7665642f R14: fffc R15: 0007
> Code: e8 7c 5e e4 fb 4c 89 ef e8 e4 16 06 fc 48 8d 85 58 ff ff ff 48 c1 e8
> 03 c6 04 18 f8 e9 46 ff ff ff 45 31 f6 eb 91 e8 56 5e e4 fb <0f> 0b e8 4f 5e
> e4 fb 48 c7 c6 a0 a3 24 88 4c 89 ef e8 60 b6 10
> RIP: set_bit_to_user drivers/vhost/vhost.c:1652 [inline] RSP:
> 8801b256f920
> RIP: log_write+0x42a/0x4d0 drivers/vhost/vhost.c:1676 RSP: 8801b256f920
> invalid opcode:  [#2] SMP KASAN
> ---[ end trace 0d0ff45aa44d8a23 ]---
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Modules linked in:
>
>
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkal...@googlegroups.com.
>
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> If you want to test a patch for this bug, please reply with:
> #syz test: git://repo/address.git branch
> and provide the patch inline or as an attachment.
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line in the email body.


Re: [v2] vhost: add vsock compat ioctl

2018-03-22 Thread Stefan Hajnoczi
On Fri, Mar 16, 2018 at 7:30 PM, David Miller  wrote:
> Although the top level ioctls are probably size and layout compatible,
> I do not think that the deeper ioctls can be called by compat binaries
> without some translations in order for them to work.

I audited the vhost ioctl code when reviewing this patch and was
unable to find anything that would break for a 32-bit userspace
process.

drivers/vhost/net.c does the same thing already, which doesn't prove
it's correct but makes me more confident I didn't miss something while
auditing the vhost ioctl code.

Did you have a specific ioctl in mind?

Stefan


Re: [PATCH v2] vhost: add vsock compat ioctl

2018-03-16 Thread Stefan Hajnoczi
On Wed, Mar 14, 2018 at 02:36:25PM -0700, Sonny Rao wrote:
> This will allow usage of vsock from 32-bit binaries on a 64-bit
> kernel.
> 
> Signed-off-by: Sonny Rao <sonny...@chromium.org>
> ---
>  drivers/vhost/vsock.c | 11 +++
>  1 file changed, 11 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>


signature.asc
Description: PGP signature


Re: [PATCH] drivers: vhost: vsock: fixed a brace coding style issue

2018-03-09 Thread Stefan Hajnoczi
On Fri, Mar 09, 2018 at 08:26:03AM +0530, Vaibhav Murkute wrote:
> Fixed a coding style issue.
> 
> Signed-off-by: Vaibhav Murkute <vaibhavmurkut...@gmail.com>
> ---
>  drivers/vhost/vsock.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>


signature.asc
Description: PGP signature


Re: [PATCH] vsock.7: document VSOCK socket address family

2018-02-01 Thread Stefan Hajnoczi
On Tue, Jan 30, 2018 at 10:31:54PM +0100, Michael Kerrisk (man-pages) wrote:
> Hi Stefan,
> 
> Ping on the below please, since it either blocks the man-pages release
> I'd currently like to make, or I must remove the vsock.7 page for this
> release.

Sorry for the delay.  The verbatim license is fine.

Stefan


signature.asc
Description: PGP signature


[PATCH] VSOCK: set POLLOUT | POLLWRNORM for TCP_CLOSING

2018-01-26 Thread Stefan Hajnoczi
select(2) with wfds but no rfds must return when the socket is shut down
by the peer.  This way userspace notices socket activity and gets -EPIPE
from the next write(2).

Currently select(2) does not return for virtio-vsock when a SEND+RCV
shutdown packet is received.  This is because vsock_poll() only sets
POLLOUT | POLLWRNORM for TCP_CLOSE, not the TCP_CLOSING state that the
socket is in when the shutdown is received.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 net/vmw_vsock/af_vsock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 5d28abf87fbf..c9473d698525 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -951,7 +951,7 @@ static unsigned int vsock_poll(struct file *file, struct 
socket *sock,
 * POLLOUT|POLLWRNORM when peer is closed and nothing to read,
 * but local send is not shutdown.
 */
-   if (sk->sk_state == TCP_CLOSE) {
+   if (sk->sk_state == TCP_CLOSE || sk->sk_state == TCP_CLOSING) {
if (!(sk->sk_shutdown & SEND_SHUTDOWN))
mask |= POLLOUT | POLLWRNORM;
 
-- 
2.14.3



Re: [PATCH 0/5] VSOCK: add vsock_test test suite

2018-01-02 Thread Stefan Hajnoczi
On Wed, Dec 20, 2017 at 02:48:43PM +, Jorgen S. Hansen wrote:
> 
> > On Dec 13, 2017, at 3:49 PM, Stefan Hajnoczi <stefa...@redhat.com> wrote:
> > 
> > The vsock_diag.ko module already has a test suite but the core AF_VSOCK
> > functionality has no tests.  This patch series adds several test cases that
> > exercise AF_VSOCK SOCK_STREAM socket semantics (send/recv, connect/accept,
> > half-closed connections, simultaneous connections).
> > 
> > The test suite is modest but I hope to cover additional cases in the future.
> > My goal is to have a shared test suite so VMCI, Hyper-V, and KVM can ensure
> > that our transports behave the same.
> > 
> > I have tested virtio-vsock.
> > 
> > Jorgen: Please test the VMCI transport and let me know if anything needs to 
> > be
> > adjusted.  See tools/testing/vsock/README for information on how to run the
> > test suite.
> > 
> 
> I tried running the vsock_test on VMCI, and all the tests failed in one way or
> another:

Great, thank you for testing and looking into the failures!

> 1) connection reset test: when the guest tries to connect to the host, we
>   get EINVAL as the error instead of ECONNRESET. I’ll fix that.

Yay, the tests found a bug!

> 2) client close and server close tests: On the host side, VMCI doesn’t
>   support reading data from a socket that has been closed by the
>   guest. When the guest closes a connection, all data is gone, and
>   we return EOF on the host side. So the tests that try to read data
>   after close, should not attempt that on VMCI host side. I got the
>   tests to pass by adding a getsockname call to determine if
>   the local CID was the host CID, and then skip the read attempt
>   in that case. We could add a vmci flag, that would enable
>   this behavior.

Interesting behavior.  Is there a reason for disallowing half-closed
sockets on the host side?

> 3) send_byte(fd, -EPIPE): for the VMCI transport, the close
>  isn’t necessarily visible immediately on the peer. So in most
>  cases, these send operations would complete with success.
>  I was running these tests using nested virtualization, so I
>  suspect that the problem is more likely to occur here, but
>  I had to add a sleep to be sure to get the EPIPE error.

Good point, you've discovered a race condition that affects all
transports.  The vsock close state transition might not have occurred
yet when the TCP control channel receives the "CLOSED" message.

test_stream_client_close_server() needs to wait for the socket status to
change before attempting send_byte(fd, -EPIPE).  I guess I'll have to
use vsock_diag or another kernel interface to check the socket's state.

> 5) multiple connections tests: with the standard socket sizes,
>   VMCI is only able to support about 100 concurrent stream
>   connections so this test passes with MULTICONN_NFDS
>   set to 100.

The 1000 magic number is because many distros have a maximum number of
file descriptors ulimit of 1024.  But it's an arbitrary number and we
could lower it to 100.

Is this VMCI concurrent stream limit a denial of service vector?  Can an
unprivileged guest userspace process open many sockets to prevent
legitimate connections from other users within the same guest?

Stefan


signature.asc
Description: PGP signature


Re: [PATCH 2/5] VSOCK: extract connect/accept functions from vsock_diag_test.c

2017-12-14 Thread Stefan Hajnoczi
On Wed, Dec 13, 2017 at 04:32:58PM -0500, David Miller wrote:
> From: Stefan Hajnoczi <stefa...@redhat.com>
> Date: Wed, 13 Dec 2017 14:49:08 +
> 
> > +#include 
> > +#include "../../../include/uapi/linux/vm_sockets.h"
> > +
>  ...
> > -#include "../../../include/uapi/linux/vm_sockets.h"
> >  #include "../../../include/uapi/linux/vm_sockets_diag.h"
> 
> This really should never be necessary.
> 
> As part of the build, before tunning tests, one is required to do
> "make headers_install" and whether as root or a normal user, this
> will make the UAPI headers of the current kernel available to the
> build of the testcases.
> 
> So please undo this weird relative include stuff.

Thanks for explaining!  Will fix in v2.

Stefan


signature.asc
Description: PGP signature


Re: AF_VSOCK connection refused errno

2017-12-13 Thread Stefan Hajnoczi
On Wed, Dec 13, 2017 at 10:28:30AM +, Jorgen S. Hansen wrote:
> 
> > On Dec 12, 2017, at 4:53 PM, Stefan Hajnoczi <stefa...@redhat.com> wrote:
> > 
> > When connect(2) fails because the peer is not listening the virtio vsock
> > transport returns ECONNRESET.  I believe the VMCI transport does the
> > same (based on code inspection).
> > 
> > Jorgen: Can you confirm this VMCI transport behavior?
> 
> Yes, that is correct.
> 
> > I'd like to change to ECONNREFUSED for all transports because developers
> > will be surprised when they get ECONNRESET.  It makes porting AF_INET
> > code harder.
> > 
> > On the other hand, it may be too late to fix this if there userspace
> > applications that rely on ECONNRESET?  I'm not aware of any such
> > applications myself.
> 
> In the past, I’ve explained to customers that an ECONNRESET error on connect
> can indicate that the peer isn’t listening on the dest address. Whether they 
> went
> and used that information isn’t clear, but changing this behavior now would
> risk breaking applications. While it is unfortunate that we deviate from INET 
> in
> this case, I would prefer it to stay as is.

That's fine.  Thanks for confirming.

Stefan


signature.asc
Description: PGP signature


[PATCH 1/5] VSOCK: extract utility functions from vsock_diag_test.c

2017-12-13 Thread Stefan Hajnoczi
Move useful functions into a separate file in preparation for more vsock
test programs.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 tools/testing/vsock/Makefile  |  2 +-
 tools/testing/vsock/util.h| 35 +
 tools/testing/vsock/util.c| 70 ++
 tools/testing/vsock/vsock_diag_test.c | 92 +--
 4 files changed, 128 insertions(+), 71 deletions(-)
 create mode 100644 tools/testing/vsock/util.h
 create mode 100644 tools/testing/vsock/util.c

diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
index 66ba0924194d..cf0650007fa8 100644
--- a/tools/testing/vsock/Makefile
+++ b/tools/testing/vsock/Makefile
@@ -1,6 +1,6 @@
 all: test
 test: vsock_diag_test
-vsock_diag_test: vsock_diag_test.o timeout.o control.o
+vsock_diag_test: vsock_diag_test.o timeout.o control.o util.o
 
 CFLAGS += -g -O2 -Werror -Wall -I. -I../../include/uapi -I../../include 
-Wno-pointer-sign -fno-strict-overflow -fno-strict-aliasing -fno-common -MMD 
-U_FORTIFY_SOURCE -D_GNU_SOURCE
 .PHONY: all test clean
diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
new file mode 100644
index ..a51e17578ccf
--- /dev/null
+++ b/tools/testing/vsock/util.h
@@ -0,0 +1,35 @@
+#ifndef UTIL_H
+#define UTIL_H
+
+/* Tests can either run as the client or the server */
+enum test_mode {
+   TEST_MODE_UNSET,
+   TEST_MODE_CLIENT,
+   TEST_MODE_SERVER
+};
+
+/* Test runner options */
+struct test_opts {
+   enum test_mode mode;
+   unsigned int peer_cid;
+};
+
+/* A test case definition.  Test functions must print failures to stderr and
+ * terminate with exit(EXIT_FAILURE).
+ */
+struct test_case {
+   const char *name; /* human-readable name */
+
+   /* Called when test mode is TEST_MODE_CLIENT */
+   void (*run_client)(const struct test_opts *opts);
+
+   /* Called when test mode is TEST_MODE_SERVER */
+   void (*run_server)(const struct test_opts *opts);
+};
+
+void init_signals(void);
+unsigned int parse_cid(const char *str);
+void run_tests(const struct test_case *test_cases,
+  const struct test_opts *opts);
+
+#endif /* UTIL_H */
diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
new file mode 100644
index ..2567abd90153
--- /dev/null
+++ b/tools/testing/vsock/util.c
@@ -0,0 +1,70 @@
+/*
+ * vsock test utilities
+ *
+ * Copyright (C) 2017 Red Hat, Inc.
+ *
+ * Author: Stefan Hajnoczi <stefa...@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include "timeout.h"
+#include "util.h"
+
+/* Install signal handlers */
+void init_signals(void)
+{
+   struct sigaction act = {
+   .sa_handler = sigalrm,
+   };
+
+   sigaction(SIGALRM, , NULL);
+   signal(SIGPIPE, SIG_IGN);
+}
+
+/* Parse a CID in string representation */
+unsigned int parse_cid(const char *str)
+{
+   char *endptr = NULL;
+   unsigned long int n;
+
+   errno = 0;
+   n = strtoul(str, , 10);
+   if (errno || *endptr != '\0') {
+   fprintf(stderr, "malformed CID \"%s\"\n", str);
+   exit(EXIT_FAILURE);
+   }
+   return n;
+}
+
+/* Run test cases.  The program terminates if a failure occurs. */
+void run_tests(const struct test_case *test_cases,
+  const struct test_opts *opts)
+{
+   int i;
+
+   for (i = 0; test_cases[i].name; i++) {
+   void (*run)(const struct test_opts *opts);
+
+   printf("%s...", test_cases[i].name);
+   fflush(stdout);
+
+   if (opts->mode == TEST_MODE_CLIENT)
+   run = test_cases[i].run_client;
+   else
+   run = test_cases[i].run_server;
+
+   if (run)
+   run(opts);
+
+   printf("ok\n");
+   }
+}
diff --git a/tools/testing/vsock/vsock_diag_test.c 
b/tools/testing/vsock/vsock_diag_test.c
index e896a4af52f4..f1ace5d96e00 100644
--- a/tools/testing/vsock/vsock_diag_test.c
+++ b/tools/testing/vsock/vsock_diag_test.c
@@ -13,12 +13,10 @@
 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -33,12 +31,7 @@
 
 #include "timeout.h"
 #include "control.h"
-
-enum test_mode {
-   TEST_MODE_UNSET,
-   TEST_MODE_CLIENT,
-   TEST_MODE_SERVER
-};
+#include "util.h"
 
 /* Per-socket status */
 struct vsock_stat {
@@ -339,7 +332,7 @@ static void free_sock_stat(struct list_head *sockets)
free(st);
 }
 
-static void test_no_sockets(unsigned int peer_cid)
+static void test_

[PATCH 4/5] VSOCK: add send_byte()/recv_byte() test utilities

2017-12-13 Thread Stefan Hajnoczi
Test cases will want to transfer data.  This patch adds utility
functions to do this.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 tools/testing/vsock/util.h |  2 +
 tools/testing/vsock/util.c | 99 ++
 2 files changed, 101 insertions(+)

diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
index 3e6971f15caa..cf43d42591bc 100644
--- a/tools/testing/vsock/util.h
+++ b/tools/testing/vsock/util.h
@@ -35,6 +35,8 @@ unsigned int parse_cid(const char *str);
 int vsock_stream_connect(unsigned int cid, unsigned int port);
 int vsock_stream_accept(unsigned int cid, unsigned int port,
struct sockaddr_vm *clientaddrp);
+void send_byte(int fd, int expected_ret);
+void recv_byte(int fd, int expected_ret);
 void run_tests(const struct test_case *test_cases,
   const struct test_opts *opts);
 
diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
index 2be923fe9922..e28085a2a429 100644
--- a/tools/testing/vsock/util.c
+++ b/tools/testing/vsock/util.c
@@ -13,6 +13,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -153,6 +154,104 @@ int vsock_stream_accept(unsigned int cid, unsigned int 
port,
return client_fd;
 }
 
+/* Transmit one byte and check the return value.
+ *
+ * expected_ret:
+ *  <0 Negative errno (for testing errors)
+ *   0 End-of-file
+ *   1 Success
+ */
+void send_byte(int fd, int expected_ret)
+{
+   const uint8_t byte = 'A';
+   ssize_t nwritten;
+
+   timeout_begin(TIMEOUT);
+   do {
+   nwritten = write(fd, , sizeof(byte));
+   timeout_check("write");
+   } while (nwritten < 0 && errno == EINTR);
+   timeout_end();
+
+   if (expected_ret < 0) {
+   if (nwritten != -1) {
+   fprintf(stderr, "bogus write(2) return value %zd\n",
+   nwritten);
+   exit(EXIT_FAILURE);
+   }
+   if (errno != -expected_ret) {
+   perror("write");
+   exit(EXIT_FAILURE);
+   }
+   return;
+   }
+
+   if (nwritten < 0) {
+   perror("write");
+   exit(EXIT_FAILURE);
+   }
+   if (nwritten == 0) {
+   if (expected_ret == 0)
+   return;
+
+   fprintf(stderr, "unexpected EOF while sending byte\n");
+   exit(EXIT_FAILURE);
+   }
+   if (nwritten != sizeof(byte)) {
+   fprintf(stderr, "bogus write(2) return value %zd\n", nwritten);
+   exit(EXIT_FAILURE);
+   }
+}
+
+/* Receive one byte and check the return value.
+ *
+ * expected_ret:
+ *  <0 Negative errno (for testing errors)
+ *   0 End-of-file
+ *   1 Success
+ */
+void recv_byte(int fd, int expected_ret)
+{
+   uint8_t byte;
+   ssize_t nread;
+
+   timeout_begin(TIMEOUT);
+   do {
+   nread = read(fd, , sizeof(byte));
+   timeout_check("read");
+   } while (nread < 0 && errno == EINTR);
+   timeout_end();
+
+   if (expected_ret < 0) {
+   if (nread != -1) {
+   fprintf(stderr, "bogus read(2) return value %zd\n",
+   nread);
+   exit(EXIT_FAILURE);
+   }
+   if (errno != -expected_ret) {
+   perror("read");
+   exit(EXIT_FAILURE);
+   }
+   return;
+   }
+
+   if (nread < 0) {
+   perror("read");
+   exit(EXIT_FAILURE);
+   }
+   if (nread == 0) {
+   if (expected_ret == 0)
+   return;
+
+   fprintf(stderr, "unexpected EOF while receiving byte\n");
+   exit(EXIT_FAILURE);
+   }
+   if (nread != sizeof(byte)) {
+   fprintf(stderr, "bogus read(2) return value %zd\n", nread);
+   exit(EXIT_FAILURE);
+   }
+}
+
 /* Run test cases.  The program terminates if a failure occurs. */
 void run_tests(const struct test_case *test_cases,
   const struct test_opts *opts)
-- 
2.14.3



[PATCH 3/5] VSOCK: add full barrier between test cases

2017-12-13 Thread Stefan Hajnoczi
See code comment for details.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 tools/testing/vsock/util.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
index 75a78f295b37..2be923fe9922 100644
--- a/tools/testing/vsock/util.c
+++ b/tools/testing/vsock/util.c
@@ -165,10 +165,24 @@ void run_tests(const struct test_case *test_cases,
printf("%s...", test_cases[i].name);
fflush(stdout);
 
-   if (opts->mode == TEST_MODE_CLIENT)
+   if (opts->mode == TEST_MODE_CLIENT) {
+   /* Full barrier before executing the next test.  This
+* ensures that client and server are executing the
+* same test case.  In particular, it means whoever is
+* faster will not see the peer still executing the
+* last test.  This is important because port numbers
+* can be used by multiple test cases.
+*/
+   control_expectln("NEXT");
+   control_writeln("NEXT");
+
run = test_cases[i].run_client;
-   else
+   } else {
+   control_writeln("NEXT");
+   control_expectln("NEXT");
+
run = test_cases[i].run_server;
+   }
 
if (run)
run(opts);
-- 
2.14.3



[PATCH 5/5] VSOCK: add AF_VSOCK test cases

2017-12-13 Thread Stefan Hajnoczi
The vsock_test.c program runs a test suite of AF_VSOCK test cases.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 tools/testing/vsock/Makefile |   5 +-
 tools/testing/vsock/vsock_test.c | 321 +++
 tools/testing/vsock/.gitignore   |   1 +
 tools/testing/vsock/README   |   1 +
 4 files changed, 326 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/vsock/vsock_test.c

diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
index cf0650007fa8..9efb9010df2e 100644
--- a/tools/testing/vsock/Makefile
+++ b/tools/testing/vsock/Makefile
@@ -1,9 +1,10 @@
 all: test
-test: vsock_diag_test
+test: vsock_test vsock_diag_test
+vsock_test: vsock_test.o timeout.o control.o util.o
 vsock_diag_test: vsock_diag_test.o timeout.o control.o util.o
 
 CFLAGS += -g -O2 -Werror -Wall -I. -I../../include/uapi -I../../include 
-Wno-pointer-sign -fno-strict-overflow -fno-strict-aliasing -fno-common -MMD 
-U_FORTIFY_SOURCE -D_GNU_SOURCE
 .PHONY: all test clean
 clean:
-   ${RM} *.o *.d vsock_diag_test
+   ${RM} *.o *.d vsock_test vsock_diag_test
 -include *.d
diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
new file mode 100644
index ..c1fc376eea00
--- /dev/null
+++ b/tools/testing/vsock/vsock_test.c
@@ -0,0 +1,321 @@
+/*
+ * vsock_test - vsock.ko test suite
+ *
+ * Copyright (C) 2017 Red Hat, Inc.
+ *
+ * Author: Stefan Hajnoczi <stefa...@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "timeout.h"
+#include "control.h"
+#include "util.h"
+
+static void test_stream_connection_reset(const struct test_opts *opts)
+{
+   union {
+   struct sockaddr sa;
+   struct sockaddr_vm svm;
+   } addr = {
+   .svm = {
+   .svm_family = AF_VSOCK,
+   .svm_port = 1234,
+   .svm_cid = opts->peer_cid,
+   },
+   };
+   int ret;
+   int fd;
+
+   fd = socket(AF_VSOCK, SOCK_STREAM, 0);
+
+   timeout_begin(TIMEOUT);
+   do {
+   ret = connect(fd, , sizeof(addr.svm));
+   timeout_check("connect");
+   } while (ret < 0 && errno == EINTR);
+   timeout_end();
+
+   if (ret != -1) {
+   fprintf(stderr, "expected connect(2) failure, got %d\n", ret);
+   exit(EXIT_FAILURE);
+   }
+   if (errno != ECONNRESET) {
+   fprintf(stderr, "unexpected connect(2) errno %d\n", errno);
+   exit(EXIT_FAILURE);
+   }
+
+   close(fd);
+}
+
+static void test_stream_client_close_client(const struct test_opts *opts)
+{
+   int fd;
+
+   fd = vsock_stream_connect(opts->peer_cid, 1234);
+   if (fd < 0) {
+   perror("connect");
+   exit(EXIT_FAILURE);
+   }
+
+   send_byte(fd, 1);
+   close(fd);
+   control_writeln("CLOSED");
+}
+
+static void test_stream_client_close_server(const struct test_opts *opts)
+{
+   int fd;
+
+   fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+   if (fd < 0) {
+   perror("accept");
+   exit(EXIT_FAILURE);
+   }
+
+   control_expectln("CLOSED");
+
+   send_byte(fd, -EPIPE);
+   recv_byte(fd, 1);
+   recv_byte(fd, 0);
+   close(fd);
+}
+
+static void test_stream_server_close_client(const struct test_opts *opts)
+{
+   int fd;
+
+   fd = vsock_stream_connect(opts->peer_cid, 1234);
+   if (fd < 0) {
+   perror("connect");
+   exit(EXIT_FAILURE);
+   }
+
+   control_expectln("CLOSED");
+
+   send_byte(fd, -EPIPE);
+   recv_byte(fd, 1);
+   recv_byte(fd, 0);
+   close(fd);
+}
+
+static void test_stream_server_close_server(const struct test_opts *opts)
+{
+   int fd;
+
+   fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+   if (fd < 0) {
+   perror("accept");
+   exit(EXIT_FAILURE);
+   }
+
+   send_byte(fd, 1);
+   close(fd);
+   control_writeln("CLOSED");
+}
+
+#define MULTICONN_NFDS 1000
+
+static void test_stream_multiconn_client(const struct test_opts *opts)
+{
+   int fds[MULTICONN_NFDS];
+   int i;
+
+   for (i = 0; i < MULTICONN_NFDS; i++) {
+   fds[i] = vsock_stream_connect(opts->peer_cid, 1234);
+   if (fds[i] < 0) {
+   perror("conn

[PATCH 0/5] VSOCK: add vsock_test test suite

2017-12-13 Thread Stefan Hajnoczi
The vsock_diag.ko module already has a test suite but the core AF_VSOCK
functionality has no tests.  This patch series adds several test cases that
exercise AF_VSOCK SOCK_STREAM socket semantics (send/recv, connect/accept,
half-closed connections, simultaneous connections).

The test suite is modest but I hope to cover additional cases in the future.
My goal is to have a shared test suite so VMCI, Hyper-V, and KVM can ensure
that our transports behave the same.

I have tested virtio-vsock.

Jorgen: Please test the VMCI transport and let me know if anything needs to be
adjusted.  See tools/testing/vsock/README for information on how to run the
test suite.

Dexuan: I'm not sure if this test suite is useful for the Hyper-V transport
since the host is Windows and uses a different API for AF_HYPERV?

Stefan Hajnoczi (5):
  VSOCK: extract utility functions from vsock_diag_test.c
  VSOCK: extract connect/accept functions from vsock_diag_test.c
  VSOCK: add full barrier between test cases
  VSOCK: add send_byte()/recv_byte() test utilities
  VSOCK: add AF_VSOCK test cases

 tools/testing/vsock/Makefile  |   7 +-
 tools/testing/vsock/util.h|  43 +
 tools/testing/vsock/util.c| 291 ++
 tools/testing/vsock/vsock_diag_test.c | 167 +++---
 tools/testing/vsock/vsock_test.c  | 321 ++
 tools/testing/vsock/.gitignore|   1 +
 tools/testing/vsock/README|   1 +
 7 files changed, 685 insertions(+), 146 deletions(-)
 create mode 100644 tools/testing/vsock/util.h
 create mode 100644 tools/testing/vsock/util.c
 create mode 100644 tools/testing/vsock/vsock_test.c

-- 
2.14.3



[PATCH 2/5] VSOCK: extract connect/accept functions from vsock_diag_test.c

2017-12-13 Thread Stefan Hajnoczi
Many test cases will need to connect to the server or accept incoming
connections.  This patch extracts these operations into utility
functions that can be reused.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 tools/testing/vsock/util.h|   6 ++
 tools/testing/vsock/util.c| 108 ++
 tools/testing/vsock/vsock_diag_test.c |  81 ++---
 3 files changed, 119 insertions(+), 76 deletions(-)

diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
index a51e17578ccf..3e6971f15caa 100644
--- a/tools/testing/vsock/util.h
+++ b/tools/testing/vsock/util.h
@@ -1,6 +1,9 @@
 #ifndef UTIL_H
 #define UTIL_H
 
+#include 
+#include "../../../include/uapi/linux/vm_sockets.h"
+
 /* Tests can either run as the client or the server */
 enum test_mode {
TEST_MODE_UNSET,
@@ -29,6 +32,9 @@ struct test_case {
 
 void init_signals(void);
 unsigned int parse_cid(const char *str);
+int vsock_stream_connect(unsigned int cid, unsigned int port);
+int vsock_stream_accept(unsigned int cid, unsigned int port,
+   struct sockaddr_vm *clientaddrp);
 void run_tests(const struct test_case *test_cases,
   const struct test_opts *opts);
 
diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
index 2567abd90153..75a78f295b37 100644
--- a/tools/testing/vsock/util.c
+++ b/tools/testing/vsock/util.c
@@ -15,8 +15,10 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "timeout.h"
+#include "control.h"
 #include "util.h"
 
 /* Install signal handlers */
@@ -45,6 +47,112 @@ unsigned int parse_cid(const char *str)
return n;
 }
 
+/* Connect to <cid, port> and return the file descriptor. */
+int vsock_stream_connect(unsigned int cid, unsigned int port)
+{
+   union {
+   struct sockaddr sa;
+   struct sockaddr_vm svm;
+   } addr = {
+   .svm = {
+   .svm_family = AF_VSOCK,
+   .svm_port = port,
+   .svm_cid = cid,
+   },
+   };
+   int ret;
+   int fd;
+
+   control_expectln("LISTENING");
+
+   fd = socket(AF_VSOCK, SOCK_STREAM, 0);
+
+   timeout_begin(TIMEOUT);
+   do {
+   ret = connect(fd, , sizeof(addr.svm));
+   timeout_check("connect");
+   } while (ret < 0 && errno == EINTR);
+   timeout_end();
+
+   if (ret < 0) {
+   int old_errno = errno;
+
+   close(fd);
+   fd = -1;
+   errno = old_errno;
+   }
+   return fd;
+}
+
+/* Listen on <cid, port> and return the first incoming connection.  The remote
+ * address is stored to clientaddrp.  clientaddrp may be NULL.
+ */
+int vsock_stream_accept(unsigned int cid, unsigned int port,
+   struct sockaddr_vm *clientaddrp)
+{
+   union {
+   struct sockaddr sa;
+   struct sockaddr_vm svm;
+   } addr = {
+   .svm = {
+   .svm_family = AF_VSOCK,
+   .svm_port = port,
+   .svm_cid = cid,
+   },
+   };
+   union {
+   struct sockaddr sa;
+   struct sockaddr_vm svm;
+   } clientaddr;
+   socklen_t clientaddr_len = sizeof(clientaddr.svm);
+   int fd;
+   int client_fd;
+   int old_errno;
+
+   fd = socket(AF_VSOCK, SOCK_STREAM, 0);
+
+   if (bind(fd, , sizeof(addr.svm)) < 0) {
+   perror("bind");
+   exit(EXIT_FAILURE);
+   }
+
+   if (listen(fd, 1) < 0) {
+   perror("listen");
+   exit(EXIT_FAILURE);
+   }
+
+   control_writeln("LISTENING");
+
+   timeout_begin(TIMEOUT);
+   do {
+   client_fd = accept(fd, , _len);
+   timeout_check("accept");
+   } while (client_fd < 0 && errno == EINTR);
+   timeout_end();
+
+   old_errno = errno;
+   close(fd);
+   errno = old_errno;
+
+   if (client_fd < 0)
+   return client_fd;
+
+   if (clientaddr_len != sizeof(clientaddr.svm)) {
+   fprintf(stderr, "unexpected addrlen from accept(2), %zu\n",
+   (size_t)clientaddr_len);
+   exit(EXIT_FAILURE);
+   }
+   if (clientaddr.sa.sa_family != AF_VSOCK) {
+   fprintf(stderr, "expected AF_VSOCK from accept(2), got %d\n",
+   clientaddr.sa.sa_family);
+   exit(EXIT_FAILURE);
+   }
+
+   if (clientaddrp)
+   *clientaddrp = clientaddr.svm;
+   return client_fd;
+}
+
 /* Run test cases.  The program terminates if a failure occurs. */
 void run_tests(const struct test_case *test_cases,
   const struct te

AF_VSOCK connection refused errno

2017-12-12 Thread Stefan Hajnoczi
When connect(2) fails because the peer is not listening the virtio vsock
transport returns ECONNRESET.  I believe the VMCI transport does the
same (based on code inspection).

Jorgen: Can you confirm this VMCI transport behavior?

I'd like to change to ECONNREFUSED for all transports because developers
will be surprised when they get ECONNRESET.  It makes porting AF_INET
code harder.

On the other hand, it may be too late to fix this if there userspace
applications that rely on ECONNRESET?  I'm not aware of any such
applications myself.

Thanks,
Stefan


signature.asc
Description: PGP signature


Re: [PATCH v2] vsock.7: document VSOCK socket address family

2017-12-12 Thread Stefan Hajnoczi
On Mon, Dec 11, 2017 at 08:32:20PM +0100, Michael Kerrisk (man-pages) wrote:
> On 12/05/2017 11:56 AM, Stefan Hajnoczi wrote:
> > +Data is transferred using the usual
> > +.BR send (2)
> > +and
> > +.BR recv (2)
> 
> Or equally, write(2) and read(2), right? By failing to mention those, the
> text subtly implies that send(2) and recv(2) are preferred, but I don't
> suppose that is true.
> 
> > +family of socket system calls.

Yes, this file descriptor is a socket so write(2) and read(2) work.

I said "family of socket system calls" to avoid listing all the
variations of send(2), sendmsg(2), sendfile(2), sendmmsg(2), etc but I
guess that doesn't include the read(2)/write(2) family of syscalls
(readv(2)/writev(2)).

Will send a follow-up patch to clarify this.

> > +.B EOPNOTSUPP
> > +Operation not supported.  This includes:
> > +the
> > +.B MSG_OOB
> > +flag that is not implemented for
> > +.BR sendmsg (2)
> > +and
> > +.B MSG_PEEK
> > +for
> > +.BR recvmsg (2).
> 
> So these errors might also occur for send() and recv(), right?

Yes, I'll change this to "the send(2) family of syscalls" and "recv(2)
family of syscalls", respectively.


signature.asc
Description: PGP signature


[PATCH] VSOCK: fix outdated sk_state value in hvs_release()

2017-12-05 Thread Stefan Hajnoczi
Since commit 3b4477d2dcf2709d0be89e2a8dced3d0f4a017f2 ("VSOCK: use TCP
state constants for sk_state") VSOCK has used TCP_* constants for
sk_state.

Commit b4562ca7925a3bedada87a3dd072dd5bad043288 ("hv_sock: add locking
in the open/close/release code paths") reintroduced the SS_DISCONNECTING
constant.

This patch replaces the old SS_DISCONNECTING with the new TCP_CLOSING
constant.

CC: Dexuan Cui <de...@microsoft.com>
CC: Cathy Avery <cav...@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 net/vmw_vsock/hyperv_transport.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index 5583df708b8c..a827547aa102 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -487,7 +487,7 @@ static void hvs_release(struct vsock_sock *vsk)
 
lock_sock(sk);
 
-   sk->sk_state = SS_DISCONNECTING;
+   sk->sk_state = TCP_CLOSING;
vsock_remove_sock(vsk);
 
release_sock(sk);
-- 
2.14.3



[PATCH v2] vsock.7: document VSOCK socket address family

2017-12-05 Thread Stefan Hajnoczi
The AF_VSOCK address family has been available since Linux 3.9 without a
corresponding man page.

This patch adds vsock.7 and describes its use along the same lines as
existing ip.7, unix.7, and netlink.7 man pages.

CC: Jorgen Hansen <jhan...@vmware.com>
CC: Dexuan Cui <de...@microsoft.com>
Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 man7/vsock.7 | 180 +++
 1 file changed, 180 insertions(+)
 create mode 100644 man7/vsock.7

diff --git a/man7/vsock.7 b/man7/vsock.7
new file mode 100644
index 0..46dc561f5
--- /dev/null
+++ b/man7/vsock.7
@@ -0,0 +1,180 @@
+.TH VSOCK 7 2017-11-30 "Linux" "Linux Programmer's Manual"
+.SH NAME
+vsock \- Linux VSOCK address family
+.SH SYNOPSIS
+.B #include 
+.br
+.B #include 
+.PP
+.IB stream_socket " = socket(AF_VSOCK, SOCK_STREAM, 0);"
+.br
+.IB datagram_socket " = socket(AF_VSOCK, SOCK_DGRAM, 0);"
+.SH DESCRIPTION
+The VSOCK address family facilitates communication between virtual machines and
+the host they are running on.  This address family is used by guest agents and
+hypervisor services that need a communications channel that is independent of
+virtual machine network configuration.
+.PP
+Valid socket types are
+.B SOCK_STREAM
+and
+.BR SOCK_DGRAM .
+.B SOCK_STREAM
+provides connection-oriented byte streams with guaranteed, in-order delivery.
+.B SOCK_DGRAM
+provides a connectionless datagram packet service with best-effort delivery and
+best-effort ordering.  Availability of these socket types is dependent on the
+underlying hypervisor.
+.PP
+A new socket is created with
+.PP
+socket(AF_VSOCK, socket_type, 0);
+.PP
+When a process wants to establish a connection it calls
+.BR connect (2)
+with a given destination socket address.  The socket is automatically bound to
+a free port if unbound.
+.PP
+A process can listen for incoming connections by first binding to a socket
+address using
+.BR bind (2)
+and then calling
+.BR listen (2).
+.PP
+Data is transferred using the usual
+.BR send (2)
+and
+.BR recv (2)
+family of socket system calls.
+.SS Address format
+A socket address is defined as a combination of a 32-bit Context Identifier
+(CID) and a 32-bit port number.  The CID identifies the source or destination,
+which is either a virtual machine or the host.  The port number differentiates
+between multiple services running on a single machine.
+.PP
+.in +4n
+.EX
+struct sockaddr_vm {
+sa_family_t svm_family; /* address family: AF_VSOCK */
+unsigned short  svm_reserved1;
+unsigned intsvm_port;   /* port in native byte order */
+unsigned intsvm_cid;/* address in native byte order */
+};
+.EE
+.in
+.PP
+.I svm_family
+is always set to
+.BR AF_VSOCK .
+.I svm_reserved1
+is always set to 0.
+.I svm_port
+contains the port in native byte order.
+The port numbers below 1024 are called
+.IR "privileged ports" .
+Only a process with
+.B CAP_NET_BIND_SERVER
+capability may
+.BR bind (2)
+to these port numbers.
+.PP
+There are several special addresses:
+.B VMADDR_CID_ANY
+(-1U)
+means any address for binding;
+.B VMADDR_CID_HYPERVISOR
+(0) is reserved for services built into the hypervisor;
+.B VMADDR_CID_RESERVED
+(1) must not be used;
+.B VMADDR_CID_HOST
+(2)
+is the well-known address of the host.
+.PP
+The special constant
+.B VMADDR_PORT_ANY
+(-1U)
+means any port number for binding.
+.SS Live migration
+Sockets are affected by live migration of virtual machines.  Connected
+.B SOCK_STREAM
+sockets become disconnected when the virtual machine migrates to a new host.
+Applications must reconnect when this happens.
+.PP
+The local CID may change across live migration if the old CID is not available
+on the new host.  Bound sockets are automatically updated to the new CID.
+.SS Ioctls
+.TP
+.B IOCTL_VM_SOCKETS_GET_LOCAL_CID
+Get the CID of the local machine.  The argument is a pointer to an unsigned 
int.
+.IP
+.in +4n
+.EX
+.IB error " = ioctl(" socket ", " IOCTL_VM_SOCKETS_GET_LOCAL_CID ", "  ");"
+.EE
+.in
+.IP
+Consider using
+.B VMADDR_CID_ANY
+when binding instead of getting the local CID with
+.BR IOCTL_VM_SOCKETS_GET_LOCAL_CID .
+.SH ERRORS
+.TP
+.B EACCES
+Unable to bind to a privileged port without the
+.B CAP_NET_BIND_SERVICE
+capability.
+.TP
+.B EINVAL
+Invalid parameters.  This includes:
+attempting to bind a socket that is already bound, providing an invalid struct
+.BR sockaddr_vm ,
+and other input validation errors.
+.TP
+.B EOPNOTSUPP
+Operation not supported.  This includes:
+the
+.B MSG_OOB
+flag that is not implemented for
+.BR sendmsg (2)
+and
+.B MSG_PEEK
+for
+.BR recvmsg (2).
+.TP
+.B EADDRINUSE
+Unable to bind to a port that is already in use.
+.TP
+.B EADDRNOTAVAIL
+Unable to find a free port for binding or unable to bind to a non-local CID.
+.TP
+.B ENOTCONN
+Unable to perform operation on an unconnected 

Re: [PATCH] vsock.7: document VSOCK socket address family

2017-12-02 Thread Stefan Hajnoczi
On Fri, Dec 01, 2017 at 09:57:04AM -0500, G. Branden Robinson wrote:
> At 2017-12-01T13:09:01+0000, Stefan Hajnoczi wrote:
> > On Thu, Nov 30, 2017 at 01:21:26PM +, Jorgen S. Hansen wrote:
> > > > On Nov 30, 2017, at 12:21 PM, Stefan Hajnoczi <stefa...@redhat.com> 
> > > > wrote:
> > 
> > Thanks for the quick review!
> > 
> > I forgot to ask you: Is SOCK_DGRAM reliable and in-order over VMCI?
> > 
> > > > +.PP
> > > > +Valid socket types are
> > > > +.B SOCK_STREAM
> > > > +and
> > > > +.B SOCK_DGRAM .
> > > 
> > > The space here results in a space between SOCK_DGRAM and the “.” in the 
> > > formatted text. Is that intentional?
> > 
> > I haven't figured out the groff syntax to avoid the space :(.  Any
> > ideas?
> 
> What you want is the .BR macro.
> 
> .BR SOCK_DGRAM .
> 
> The man macro package defines six "two-font" macros for switching
> between roman, bold, and italic faces without intervening space.
> 
> See man(7) and groff_man(7).

Excellent, thank you!

Stefan


signature.asc
Description: PGP signature


Re: [PATCH] vsock.7: document VSOCK socket address family

2017-12-01 Thread Stefan Hajnoczi
On Thu, Nov 30, 2017 at 01:21:26PM +, Jorgen S. Hansen wrote:
> > On Nov 30, 2017, at 12:21 PM, Stefan Hajnoczi <stefa...@redhat.com> wrote:

Thanks for the quick review!

I forgot to ask you: Is SOCK_DGRAM reliable and in-order over VMCI?

> > +.PP
> > +Valid socket types are
> > +.B SOCK_STREAM
> > +and
> > +.B SOCK_DGRAM .
> 
> The space here results in a space between SOCK_DGRAM and the “.” in the 
> formatted text. Is that intentional?

I haven't figured out the groff syntax to avoid the space :(.  Any
ideas?

> > +.PP
> > +There are several special addresses:
> > +.B VMADDR_CID_ANY
> > +(-1U)
> > +means any address for binding;
> > +.B VMADDR_CID_HYPERVISOR
> 
> We use VMADDR_CID_HYPERVISOR for communicating with services in the 
> hypervisor, so you could describe this as “an address reserved for services 
> built into the hypervisor”.

Will fix.

> > +.SH VERSIONS
> > +Support for VMware has been available since Linux 3.9.  KVM (virtio) is
> 
> VMware (VMCI) for clarity.

Will fix.

> Apart from the nits, this looks great.

Please let me know if there are any other topics that we should cover in
the man page.


signature.asc
Description: PGP signature


[PATCH] vsock.7: document VSOCK socket address family

2017-11-30 Thread Stefan Hajnoczi
The AF_VSOCK address family has been available since Linux 3.9 without a
corresponding man page.

This patch adds vsock.7 and describes its use along the same lines as
existing ip.7, unix.7, and netlink.7 man pages.

CC: Jorgen Hansen <jhan...@vmware.com>
CC: Dexuan Cui <de...@microsoft.com>
Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 man7/vsock.7 | 175 +++
 1 file changed, 175 insertions(+)
 create mode 100644 man7/vsock.7

diff --git a/man7/vsock.7 b/man7/vsock.7
new file mode 100644
index 0..48c6c2e1e
--- /dev/null
+++ b/man7/vsock.7
@@ -0,0 +1,175 @@
+.TH VSOCK 7 2017-11-30 "Linux" "Linux Programmer's Manual"
+.SH NAME
+vsock \- Linux VSOCK address family
+.SH SYNOPSIS
+.B #include 
+.br
+.B #include 
+.PP
+.IB stream_socket " = socket(AF_VSOCK, SOCK_STREAM, 0);"
+.br
+.IB datagram_socket " = socket(AF_VSOCK, SOCK_DGRAM, 0);"
+.SH DESCRIPTION
+The VSOCK address family facilitates communication between virtual machines and
+the host they are running on.  This address family is used by guest agents and
+hypervisor services that need a communications channel that is independent of
+virtual machine network configuration.
+.PP
+Valid socket types are
+.B SOCK_STREAM
+and
+.B SOCK_DGRAM .
+.B SOCK_STREAM
+provides connection-oriented byte streams with guaranteed, in-order delivery.
+.B SOCK_DGRAM
+provides a connectionless datagram packet service.  Availability of these
+socket types is dependent on the underlying hypervisor.
+.PP
+A new socket is created with
+.PP
+socket(AF_VSOCK, socket_type, 0);
+.PP
+When a process wants to establish a connection it calls
+.BR connect (2)
+with a given destination socket address.  The socket is automatically bound to
+a free port if unbound.
+.PP
+A process can listen for incoming connections by first binding to a socket 
address using
+.BR bind (2)
+and then calling
+.BR listen (2).
+.PP
+Data is transferred using the usual
+.BR send (2)
+and
+.BR recv (2)
+family of socket system calls.
+.SS Address format
+A socket address is defined as a combination of a 32-bit Context Identifier 
(CID) and a 32-bit port number.  The CID identifies the source or destination, 
which is either a virtual machine or the host.  The port number differentiates 
between multiple services running on a single machine.
+.PP
+.in +4n
+.EX
+struct sockaddr_vm {
+sa_family_t svm_family; /* address family: AF_VSOCK */
+unsigned short  svm_reserved1;
+unsigned intsvm_port;   /* port in native byte order */
+unsigned intsvm_cid;/* address in native byte order */
+};
+.EE
+.in
+.PP
+.I svm_family
+is always set to
+.BR AF_VSOCK .
+.I svm_reserved1
+is always set to 0.
+.I svm_port
+contains the port in native byte order.
+The port numbers below 1024 are called
+.IR "privileged ports" .
+Only a process with
+.B CAP_NET_BIND_SERVER
+capability may
+.BR bind (2)
+to these port numbers.
+.PP
+There are several special addresses:
+.B VMADDR_CID_ANY
+(-1U)
+means any address for binding;
+.B VMADDR_CID_HYPERVISOR
+(0) and
+.B VMADDR_CID_RESERVED
+(1) are unused addresses;
+.B VMADDR_CID_HOST
+(2)
+is the well-known address of the host.
+.PP
+The special constant
+.B VMADDR_PORT_ANY
+(-1U)
+means any port number for binding.
+.SS Live migration
+Sockets are affected by live migration of virtual machines.  Connected
+.B SOCK_STREAM
+sockets become disconnected when the virtual machine migrates to a new host.
+Applications must reconnect when this happens.
+.PP
+The local CID may change across live migration if the old CID is not available
+on the new host.  Bound sockets are automatically updated to the new CID.
+.SS Ioctls
+.TP
+.B IOCTL_VM_SOCKETS_GET_LOCAL_CID
+Get the CID of the local machine.  The argument is a pointer to an unsigned 
int.
+.IP
+.in +4n
+.EX
+.IB error " = ioctl(" socket ", " IOCTL_VM_SOCKETS_GET_LOCAL_CID ", "  ");"
+.EE
+.in
+.IP
+Consider using
+.B VMADDR_CID_ANY
+when binding instead of getting the local CID with
+.B IOCTL_VM_SOCKETS_GET_LOCAL_CID .
+.SH ERRORS
+.TP
+.B EACCES
+Unable to bind to a privileged port without the
+.B CAP_NET_BIND_SERVICE
+capability.
+.TP
+.B EINVAL
+Invalid parameters.  This includes:
+attempting to bind a socket that is already bound, providing an invalid struct
+.B sockaddr_vm ,
+and other input validation errors.
+.TP
+.B EOPNOTSUPP
+Operation not supported.  This includes:
+the
+.B MSG_OOB
+flag that is not implemented for
+.B sendmsg (2)
+and
+.B MSG_PEEK
+for
+.B recvmsg (2).
+.TP
+.B EADDRINUSE
+Unable to bind to a port that is already in use.
+.TP
+.B EADDRNOTAVAIL
+Unable to find a free port for binding or unable to bind to a non-local CID.
+.TP
+.B ENOTCONN
+Unable to perform operation on an unconnected socket.
+.TP
+.B ENOPROTOOPT
+Invalid socket option in
+.B setsockopt (2)
+or
+.B getsockopt (2).
+.TP

Re: [PATCH] VSOCK: Don't set sk_state to TCP_CLOSE before testing it

2017-11-28 Thread Stefan Hajnoczi
On Mon, Nov 27, 2017 at 05:29:32AM -0800, Jorgen Hansen wrote:
> A recent commit (3b4477d2dcf2) converted the sk_state to use
> TCP constants. In that change, vmci_transport_handle_detach
> was changed such that sk->sk_state was set to TCP_CLOSE before
> we test whether it is TCP_SYN_SENT. This change moves the
> sk_state change back to the original locations in that function.
> 
> Signed-off-by: Jorgen Hansen <jhan...@vmware.com>
> ---
>  net/vmw_vsock/vmci_transport.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Sorry, silly bug on my part!

Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>


signature.asc
Description: PGP signature


Re: [PATCH] uapi: add SPDX identifier to vm_sockets_diag.h

2017-11-24 Thread Stefan Hajnoczi
On Fri, Nov 24, 2017 at 8:08 PM, Stephen Hemminger
<step...@networkplumber.org> wrote:
> New file seems to have missed the SPDX license scan and update.
>
> Signed-off-by: Stephen Hemminger <sthem...@microsoft.com>
> ---
>  include/uapi/linux/vm_sockets_diag.h | 1 +
>  1 file changed, 1 insertion(+)

Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>


Re: [PATCH] VSOCK: Don't call vsock_stream_has_data in atomic context

2017-11-23 Thread Stefan Hajnoczi
On Tue, Nov 21, 2017 at 10:46 AM, Jorgen Hansen <jhan...@vmware.com> wrote:
> When using the host personality, VMCI will grab a mutex for any
> queue pair access. In the detach callback for the vmci vsock
> transport, we call vsock_stream_has_data while holding a spinlock,
> and vsock_stream_has_data will access a queue pair.
>
> To avoid this, we can simply omit calling vsock_stream_has_data
> for host side queue pairs, since the QPs are empty per default
> when the guest has detached.
>
> This bug affects users of VMware Workstation using kernel version
> 4.4 and later.
>
> Testing: Ran vsock tests between guest and host, and verified that
> with this change, the host isn't calling vsock_stream_has_data
> during detach. Ran mixedTest between guest and host using both
> guest and host as server.
>
> Reviewed-by: Adit Ranadive <ad...@vmware.com>
> Reviewed-by: Aditya Sarwade <asarw...@vmware.com>
> Signed-off-by: Jorgen Hansen <jhan...@vmware.com>
> ---
>  net/vmw_vsock/vmci_transport.c |   10 +++---
>  1 files changed, 7 insertions(+), 3 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>


Re: [Patch net-next] vsock: always call vsock_init_tables()

2017-10-26 Thread Stefan Hajnoczi
On Tue, Oct 24, 2017 at 03:30:37PM -0700, Cong Wang wrote:
> Although CONFIG_VSOCKETS_DIAG depends on CONFIG_VSOCKETS,
> vsock_init_tables() is not always called, it is called only
> if other modules call its caller. Therefore if we only
> enable CONFIG_VSOCKETS_DIAG, it would crash kernel on uninitialized
> vsock_bind_table.
> 
> This patch fixes it by moving vsock_init_tables() to its own
> module_init().
> 
> Fixes: 413a4317aca7 ("VSOCK: add sock_diag interface")
> Reported-by: syzkaller bot
> Cc: Stefan Hajnoczi <stefa...@redhat.com>
> Cc: Jorgen Hansen <jhan...@vmware.com>
> Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
> ---
>  net/vmw_vsock/af_vsock.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)

Thank you, Cong Wang!  I was just trying to figure out the syzkaller bug
report and you've already fixed it :).  The patch looks good.

Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>


Re: BUG: unable to handle kernel paging request in vsock_diag_dump

2017-10-26 Thread Stefan Hajnoczi
On Tue, Oct 24, 2017 at 08:14:01AM -0700, syzbot wrote:
> Hello,
> 
> syzkaller hit the following crash on
> 28f50eb20931f32a2ceeb6aba8fa2cd5ca96ad9f
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
> 
> 
> IP: read_pnet include/net/net_namespace.h:269 [inline]
> IP: sock_net include/net/sock.h:2299 [inline]
> IP: vsock_diag_dump+0x290/0xa80 net/vmw_vsock/diag.c:87

Feel free to ignore my previous reply, Cong Wang
 has already found the root cause.  Thank you!

Stefan


Re: BUG: unable to handle kernel paging request in vsock_diag_dump

2017-10-26 Thread Stefan Hajnoczi
On Tue, Oct 24, 2017 at 08:14:01AM -0700, syzbot wrote:
> Hello,
> 
> syzkaller hit the following crash on
> 28f50eb20931f32a2ceeb6aba8fa2cd5ca96ad9f
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
> 
> 
> IP: read_pnet include/net/net_namespace.h:269 [inline]
> IP: sock_net include/net/sock.h:2299 [inline]
> IP: vsock_diag_dump+0x290/0xa80 net/vmw_vsock/diag.c:87

[   41.303353] BUG: unable to handle kernel paging request at fb70

Unfortunately the reproducer doesn't hit the issue on my machine (even
after multiple runs).

Is it possible to get the exact net/vmw_vsock/vsock_diag.ko file used in
this syzkaller run?  The machine code snippet below doesn't contain
enough context for me to understand the issue, and seeing the whole
module might help.

Thanks,
Stefan

> PGD 21fff4067 P4D 21fff4067 PUD 21fff3067 PMD 21fff2067 PTE 0
> Oops:  [#1] SMP KASAN
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Modules linked in:
> CPU: 1 PID: 2986 Comm: syzkaller165749 Not tainted 4.14.0-rc3+ #76
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> task: 8801ceaaa1c0 task.stack: 8801ce0e
> RIP: 0010:read_pnet include/net/net_namespace.h:269 [inline]
> RIP: 0010:sock_net include/net/sock.h:2299 [inline]
> RIP: 0010:vsock_diag_dump+0x290/0xa80 net/vmw_vsock/diag.c:87
> RSP: 0018:8801ce0e73d0 EFLAGS: 00010a02
> RAX: 1f70 RBX: fb50 RCX: 
> RDX:  RSI: 83f09694 RDI: fb80
> RBP: 8801ce0e74e0 R08: 8801ce0e6d20 R09: 8801ce0e6d18
> R10: 8801ce0e6c90 R11: 110039d556cb R12: dc00
> R13:  R14: 87962d20 R15: 8801ce0e74b8
> FS:  0152a880() GS:8801db30() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: fb70 CR3: 0001ce70f000 CR4: 001406e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  netlink_dump+0x46b/0xb90 net/netlink/af_netlink.c:2186
>  __netlink_dump_start+0x4e9/0x710 net/netlink/af_netlink.c:2278
>  netlink_dump_start include/linux/netlink.h:216 [inline]
>  vsock_diag_handler_dump+0x206/0x2c0 net/vmw_vsock/diag.c:161
>  __sock_diag_cmd net/core/sock_diag.c:231 [inline]
>  sock_diag_rcv_msg+0x204/0x390 net/core/sock_diag.c:263
>  netlink_rcv_skb+0x216/0x440 net/netlink/af_netlink.c:2408
>  sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:274
>  netlink_unicast_kernel net/netlink/af_netlink.c:1273 [inline]
>  netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1299
>  netlink_sendmsg+0xa4a/0xe70 net/netlink/af_netlink.c:1862
>  sock_sendmsg_nosec net/socket.c:633 [inline]
>  sock_sendmsg+0xca/0x110 net/socket.c:643
>  sock_write_iter+0x320/0x5e0 net/socket.c:912
>  call_write_iter include/linux/fs.h:1770 [inline]
>  do_iter_readv_writev+0x531/0x7f0 fs/read_write.c:673
>  do_iter_write+0x15a/0x540 fs/read_write.c:952
>  vfs_writev+0x18a/0x340 fs/read_write.c:997
>  do_writev+0xfc/0x2a0 fs/read_write.c:1032
>  SYSC_writev fs/read_write.c:1105 [inline]
>  SyS_writev+0x27/0x30 fs/read_write.c:1102
>  entry_SYSCALL_64_fastpath+0x1f/0xbe
> RIP: 0033:0x43fd69
> RSP: 002b:7fff5da11d98 EFLAGS: 0203 ORIG_RAX: 0014
> RAX: ffda RBX: 004002c8 RCX: 0043fd69
> RDX: 0001 RSI: 206a3ff0 RDI: 0003
> RBP: 0086 R08:  R09: 
> R10:  R11: 0203 R12: 004016d0
> R13: 00401760 R14:  R15: 
> Code: 69 07 00 00 48 8b 83 b0 04 00 00 49 39 c6 48 8d 98 50 fb ff ff 0f 84
> ce 03 00 00 e8 bb a2 9f fc 48 8d 7b 30 48 89 f8 48 c1 e8 03 <42> 80 3c 20 00
> 0f 85 2d 07 00 00 48 8b 85 70 ff ff ff 48 3b 43
> RIP: read_pnet include/net/net_namespace.h:269 [inline] RSP:
> 8801ce0e73d0
> RIP: sock_net include/net/sock.h:2299 [inline] RSP: 8801ce0e73d0
> RIP: vsock_diag_dump+0x290/0xa80 net/vmw_vsock/diag.c:87 RSP:
> 8801ce0e73d0
> CR2: fb70
> ---[ end trace 9852ce58edddbce7 ]---
> 
> 
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkal...@googlegroups.com.
> 
> syzbot will keep track of this bug report.
> Once a fix for this bug is committed, please reply to this email with:
> #syz fix: exact-commit-title
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens 

[PATCH iproute2 v2 0/3] ss: add AF_VSOCK support

2017-10-06 Thread Stefan Hajnoczi
v2:
 * Use uint64_t instead of __u64 for filter->families
 * Added reference to net-next commit that merged vsock_diag.ko

This patch series adds AF_VSOCK support to ss(8).  AF_VSOCK is a host<->guest
communications channel supported by VMware, KVM (virtio-vsock), and Hyper-V.

To dump AF_VSOCK sockets:

  $ ss --vsock

The vsock_diag.ko module has now been merged in the Linux net-next tree.  I
have verified that the  header copy in this patch
series is in sync with Linux net-next.  See commit
5820299a271fd3dc9b1733e1e10cd7b983edd028 ("Merge branch 'VSOCK-sock_diag'").

Stefan Hajnoczi (3):
  ss: allow AF_FAMILY constants >32
  include: add 
  ss: add AF_VSOCK support

 include/linux/vm_sockets_diag.h |  33 ++
 misc/ss.c   | 238 +++-
 man/man8/ss.8   |   8 +-
 3 files changed, 249 insertions(+), 30 deletions(-)
 create mode 100644 include/linux/vm_sockets_diag.h

-- 
2.13.6



[PATCH iproute2 v2 2/3] include: add

2017-10-06 Thread Stefan Hajnoczi
This new Linux header file defines the sock_diag interface used by
AF_VSOCK.  This new header file was merged in net-next in commit
413a4317aca7d6367d57a5971b0c461f03851207 ("VSOCK: add sock_diag
interface").

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 include/linux/vm_sockets_diag.h | 33 +
 1 file changed, 33 insertions(+)
 create mode 100644 include/linux/vm_sockets_diag.h

diff --git a/include/linux/vm_sockets_diag.h b/include/linux/vm_sockets_diag.h
new file mode 100644
index ..14cd7dc5
--- /dev/null
+++ b/include/linux/vm_sockets_diag.h
@@ -0,0 +1,33 @@
+/* AF_VSOCK sock_diag(7) interface for querying open sockets */
+
+#ifndef _UAPI__VM_SOCKETS_DIAG_H__
+#define _UAPI__VM_SOCKETS_DIAG_H__
+
+#include 
+
+/* Request */
+struct vsock_diag_req {
+   __u8sdiag_family;   /* must be AF_VSOCK */
+   __u8sdiag_protocol; /* must be 0 */
+   __u16   pad;/* must be 0 */
+   __u32   vdiag_states;   /* query bitmap (e.g. 1 << TCP_LISTEN) */
+   __u32   vdiag_ino;  /* must be 0 (reserved) */
+   __u32   vdiag_show; /* must be 0 (reserved) */
+   __u32   vdiag_cookie[2];
+};
+
+/* Response */
+struct vsock_diag_msg {
+   __u8vdiag_family;   /* AF_VSOCK */
+   __u8vdiag_type; /* SOCK_STREAM or SOCK_DGRAM */
+   __u8vdiag_state;/* sk_state (e.g. TCP_LISTEN) */
+   __u8vdiag_shutdown; /* local RCV_SHUTDOWN | SEND_SHUTDOWN */
+   __u32   vdiag_src_cid;
+   __u32   vdiag_src_port;
+   __u32   vdiag_dst_cid;
+   __u32   vdiag_dst_port;
+   __u32   vdiag_ino;
+   __u32   vdiag_cookie[2];
+};
+
+#endif /* _UAPI__VM_SOCKETS_DIAG_H__ */
-- 
2.13.6



[PATCH iproute2 v2 1/3] ss: allow AF_FAMILY constants >32

2017-10-06 Thread Stefan Hajnoczi
Linux has more than 32 address families defined in .  Use
a 64-bit type so all of them can be represented in the filter->families
bitmask.

It's easy to introduce bugs when using (1 << AF_FAMILY) because the
value is 32-bit.  This can produce incorrect results from bitmask
operations so introduce the FAMILY_MASK() macro to eliminate these bugs.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 misc/ss.c | 54 --
 1 file changed, 28 insertions(+), 26 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index dd8dfaa4..005e781d 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -170,55 +170,57 @@ enum {
 struct filter {
int dbs;
int states;
-   int families;
+   uint64_t families;
struct ssfilter *f;
bool kill;
 };
 
+#define FAMILY_MASK(family) ((uint64_t)1 << (family))
+
 static const struct filter default_dbs[MAX_DB] = {
[TCP_DB] = {
.states   = SS_CONN,
-   .families = (1 << AF_INET) | (1 << AF_INET6),
+   .families = FAMILY_MASK(AF_INET) | FAMILY_MASK(AF_INET6),
},
[DCCP_DB] = {
.states   = SS_CONN,
-   .families = (1 << AF_INET) | (1 << AF_INET6),
+   .families = FAMILY_MASK(AF_INET) | FAMILY_MASK(AF_INET6),
},
[UDP_DB] = {
.states   = (1 << SS_ESTABLISHED),
-   .families = (1 << AF_INET) | (1 << AF_INET6),
+   .families = FAMILY_MASK(AF_INET) | FAMILY_MASK(AF_INET6),
},
[RAW_DB] = {
.states   = (1 << SS_ESTABLISHED),
-   .families = (1 << AF_INET) | (1 << AF_INET6),
+   .families = FAMILY_MASK(AF_INET) | FAMILY_MASK(AF_INET6),
},
[UNIX_DG_DB] = {
.states   = (1 << SS_CLOSE),
-   .families = (1 << AF_UNIX),
+   .families = FAMILY_MASK(AF_UNIX),
},
[UNIX_ST_DB] = {
.states   = SS_CONN,
-   .families = (1 << AF_UNIX),
+   .families = FAMILY_MASK(AF_UNIX),
},
[UNIX_SQ_DB] = {
.states   = SS_CONN,
-   .families = (1 << AF_UNIX),
+   .families = FAMILY_MASK(AF_UNIX),
},
[PACKET_DG_DB] = {
.states   = (1 << SS_CLOSE),
-   .families = (1 << AF_PACKET),
+   .families = FAMILY_MASK(AF_PACKET),
},
[PACKET_R_DB] = {
.states   = (1 << SS_CLOSE),
-   .families = (1 << AF_PACKET),
+   .families = FAMILY_MASK(AF_PACKET),
},
[NETLINK_DB] = {
.states   = (1 << SS_CLOSE),
-   .families = (1 << AF_NETLINK),
+   .families = FAMILY_MASK(AF_NETLINK),
},
[SCTP_DB] = {
.states   = SS_CONN,
-   .families = (1 << AF_INET) | (1 << AF_INET6),
+   .families = FAMILY_MASK(AF_INET) | FAMILY_MASK(AF_INET6),
},
 };
 
@@ -258,14 +260,14 @@ static void filter_db_set(struct filter *f, int db)
 static void filter_af_set(struct filter *f, int af)
 {
f->states  |= default_afs[af].states;
-   f->families|= 1 << af;
+   f->families|= FAMILY_MASK(af);
do_default  = 0;
preferred_family= af;
 }
 
 static int filter_af_get(struct filter *f, int af)
 {
-   return f->families & (1 << af);
+   return !!(f->families & FAMILY_MASK(af));
 }
 
 static void filter_default_dbs(struct filter *f)
@@ -302,7 +304,7 @@ static void filter_merge_defaults(struct filter *f)
f->families |= default_dbs[db].families;
}
for (af = 0; af < AF_MAX; af++) {
-   if (!(f->families & (1 << af)))
+   if (!(f->families & FAMILY_MASK(af)))
continue;
 
if (!(default_afs[af].dbs & f->dbs))
@@ -2608,7 +2610,7 @@ static int show_one_inet_sock(const struct sockaddr_nl 
*addr,
struct inet_diag_msg *r = NLMSG_DATA(h);
struct sockstat s = {};
 
-   if (!(diag_arg->f->families & (1 << r->idiag_family)))
+   if (!(diag_arg->f->families & FAMILY_MASK(r->idiag_family)))
return 0;
 
parse_diag_msg(h, );
@@ -2802,7 +2804,7 @@ static int tcp_show(struct filter *f)
return -1;
}
 
-   if (f->families & (1<<AF_INET)) {
+   if (f->families & FAMILY_MASK(AF_INET)) {
if ((fp = net_tcp_open()) == NULL)
goto outerr;
 
@@ -2812,7 +2814,7 @@ static int tcp_show(struct filter *f)
fclose(fp);
}
 
-   if ((f->families &a

[PATCH iproute2 v2 3/3] ss: add AF_VSOCK support

2017-10-06 Thread Stefan Hajnoczi
The AF_VSOCK address family is a host<->guest communications channel
supported by VMware, KVM, and Hyper-V.  Initial VMware support was
released in Linux 3.9 in 2013 and transports for other hypervisors were
added later.

AF_VSOCK addresses are  tuples.  The 32-bit cid
integer is comparable to an IP address.  AF_VSOCK ports work like
TCP/UDP ports.

Both SOCK_STREAM and SOCK_DGRAM socket types are available.

This patch adds AF_VSOCK support to ss(8) so that sockets can be
observed.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 misc/ss.c | 184 +-
 man/man8/ss.8 |   8 ++-
 2 files changed, 188 insertions(+), 4 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 005e781d..8b563bd4 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define MAGIC_SEQ 123456
 
@@ -126,6 +127,8 @@ enum {
PACKET_R_DB,
NETLINK_DB,
SCTP_DB,
+   VSOCK_ST_DB,
+   VSOCK_DG_DB,
MAX_DB
 };
 
@@ -134,6 +137,7 @@ enum {
 #define ALL_DB ((1<<MAX_DB)-1)
 #define INET_L4_DBM ((1<<TCP_DB)|(1<<UDP_DB)|(1<<DCCP_DB)|(1<<SCTP_DB))
 #define INET_DBM (INET_L4_DBM | (1<<RAW_DB))
+#define VSOCK_DBM ((1<<VSOCK_ST_DB)|(1<<VSOCK_DG_DB))
 
 enum {
SS_UNKNOWN,
@@ -222,6 +226,14 @@ static const struct filter default_dbs[MAX_DB] = {
.states   = SS_CONN,
.families = FAMILY_MASK(AF_INET) | FAMILY_MASK(AF_INET6),
},
+   [VSOCK_ST_DB] = {
+   .states   = SS_CONN,
+   .families = FAMILY_MASK(AF_VSOCK),
+   },
+   [VSOCK_DG_DB] = {
+   .states   = SS_CONN,
+   .families = FAMILY_MASK(AF_VSOCK),
+   },
 };
 
 static const struct filter default_afs[AF_MAX] = {
@@ -245,6 +257,10 @@ static const struct filter default_afs[AF_MAX] = {
.dbs= (1 << NETLINK_DB),
.states = (1 << SS_CLOSE),
},
+   [AF_VSOCK] = {
+   .dbs= VSOCK_DBM,
+   .states = SS_CONN,
+   },
 };
 
 static int do_default = 1;
@@ -283,6 +299,8 @@ static void filter_default_dbs(struct filter *f)
filter_db_set(f, PACKET_DG_DB);
filter_db_set(f, NETLINK_DB);
filter_db_set(f, SCTP_DB);
+   filter_db_set(f, VSOCK_ST_DB);
+   filter_db_set(f, VSOCK_DG_DB);
 }
 
 static void filter_states_set(struct filter *f, int states)
@@ -791,6 +809,18 @@ static const char *proto_name(int protocol)
return "???";
 }
 
+static const char *vsock_netid_name(int type)
+{
+   switch (type) {
+   case SOCK_STREAM:
+   return "v_str";
+   case SOCK_DGRAM:
+   return "v_dgr";
+   default:
+   return "???";
+   }
+}
+
 static void sock_state_print(struct sockstat *s)
 {
const char *sock_name;
@@ -823,6 +853,9 @@ static void sock_state_print(struct sockstat *s)
case AF_NETLINK:
sock_name = "nl";
break;
+   case AF_VSOCK:
+   sock_name = vsock_netid_name(s->type);
+   break;
default:
sock_name = "unknown";
}
@@ -1149,6 +1182,8 @@ static int run_ssfilter(struct ssfilter *f, struct 
sockstat *s)
return s->lport == 0 && s->local.data[0] == 0;
if (s->local.family == AF_NETLINK)
return s->lport < 0;
+   if (s->local.family == AF_VSOCK)
+   return s->lport > 1023;
 
return is_ephemeral(s->lport);
}
@@ -1524,6 +1559,15 @@ void *parse_devcond(char *name)
return res;
 }
 
+static void vsock_set_inet_prefix(inet_prefix *a, __u32 cid)
+{
+   *a = (inet_prefix){
+   .bytelen = sizeof(cid),
+   .family = AF_VSOCK,
+   };
+   memcpy(a->data, , sizeof(cid));
+}
+
 void *parse_hostcond(char *addr, bool is_port)
 {
char *port = NULL;
@@ -1598,6 +1642,37 @@ void *parse_hostcond(char *addr, bool is_port)
goto out;
}
 
+   if (fam == AF_VSOCK || strncmp(addr, "vsock:", 6) == 0) {
+   __u32 cid = ~(__u32)0;
+
+   a.addr.family = AF_VSOCK;
+   if (strncmp(addr, "vsock:", 6) == 0)
+   addr += 6;
+
+   if (is_port)
+   port = addr;
+   else {
+   port = strchr(addr, ':');
+   if (port) {
+   *port = '\0';
+   port++;
+   }
+   }
+
+   if (port && strcmp(port, "*") &&
+   get_u32((__u32 *), port, 0))
+  

[PATCH v3 5/5] VSOCK: add tools/testing/vsock/vsock_diag_test

2017-10-05 Thread Stefan Hajnoczi
This patch adds tests for the vsock_diag.ko module.

These tests are not self-tests because they require manual set up of a
KVM or VMware guest.  Please see tools/testing/vsock/README for
instructions.

The control.h and timeout.h infrastructure can be used for additional
AF_VSOCK tests in the future.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 MAINTAINERS   |   1 +
 tools/testing/vsock/Makefile  |   9 +
 tools/testing/vsock/control.h |  13 +
 tools/testing/vsock/timeout.h |  14 +
 tools/testing/vsock/control.c | 219 +++
 tools/testing/vsock/timeout.c |  64 
 tools/testing/vsock/vsock_diag_test.c | 681 ++
 tools/testing/vsock/.gitignore|   2 +
 tools/testing/vsock/README|  36 ++
 9 files changed, 1039 insertions(+)
 create mode 100644 tools/testing/vsock/Makefile
 create mode 100644 tools/testing/vsock/control.h
 create mode 100644 tools/testing/vsock/timeout.h
 create mode 100644 tools/testing/vsock/control.c
 create mode 100644 tools/testing/vsock/timeout.c
 create mode 100644 tools/testing/vsock/vsock_diag_test.c
 create mode 100644 tools/testing/vsock/.gitignore
 create mode 100644 tools/testing/vsock/README

diff --git a/MAINTAINERS b/MAINTAINERS
index 039f0ad13482..5ad7381d92aa 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14282,6 +14282,7 @@ F:  net/vmw_vsock/virtio_transport.c
 F: drivers/net/vsockmon.c
 F: drivers/vhost/vsock.c
 F: drivers/vhost/vsock.h
+F: tools/testing/vsock/
 
 VIRTIO CONSOLE DRIVER
 M: Amit Shah <a...@kernel.org>
diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
new file mode 100644
index ..66ba0924194d
--- /dev/null
+++ b/tools/testing/vsock/Makefile
@@ -0,0 +1,9 @@
+all: test
+test: vsock_diag_test
+vsock_diag_test: vsock_diag_test.o timeout.o control.o
+
+CFLAGS += -g -O2 -Werror -Wall -I. -I../../include/uapi -I../../include 
-Wno-pointer-sign -fno-strict-overflow -fno-strict-aliasing -fno-common -MMD 
-U_FORTIFY_SOURCE -D_GNU_SOURCE
+.PHONY: all test clean
+clean:
+   ${RM} *.o *.d vsock_diag_test
+-include *.d
diff --git a/tools/testing/vsock/control.h b/tools/testing/vsock/control.h
new file mode 100644
index ..54a07efd267c
--- /dev/null
+++ b/tools/testing/vsock/control.h
@@ -0,0 +1,13 @@
+#ifndef CONTROL_H
+#define CONTROL_H
+
+#include 
+
+void control_init(const char *control_host, const char *control_port,
+ bool server);
+void control_cleanup(void);
+void control_writeln(const char *str);
+char *control_readln(void);
+void control_expectln(const char *str);
+
+#endif /* CONTROL_H */
diff --git a/tools/testing/vsock/timeout.h b/tools/testing/vsock/timeout.h
new file mode 100644
index ..77db9ce9860a
--- /dev/null
+++ b/tools/testing/vsock/timeout.h
@@ -0,0 +1,14 @@
+#ifndef TIMEOUT_H
+#define TIMEOUT_H
+
+enum {
+   /* Default timeout */
+   TIMEOUT = 10 /* seconds */
+};
+
+void sigalrm(int signo);
+void timeout_begin(unsigned int seconds);
+void timeout_check(const char *operation);
+void timeout_end(void);
+
+#endif /* TIMEOUT_H */
diff --git a/tools/testing/vsock/control.c b/tools/testing/vsock/control.c
new file mode 100644
index ..90fd47f0e422
--- /dev/null
+++ b/tools/testing/vsock/control.c
@@ -0,0 +1,219 @@
+/* Control socket for client/server test execution
+ *
+ * Copyright (C) 2017 Red Hat, Inc.
+ *
+ * Author: Stefan Hajnoczi <stefa...@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+/* The client and server may need to coordinate to avoid race conditions like
+ * the client attempting to connect to a socket that the server is not
+ * listening on yet.  The control socket offers a communications channel for
+ * such coordination tasks.
+ *
+ * If the client calls control_expectln("LISTENING"), then it will block until
+ * the server calls control_writeln("LISTENING").  This provides a simple
+ * mechanism for coordinating between the client and the server.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "timeout.h"
+#include "control.h"
+
+static int control_fd = -1;
+
+/* Open the control socket, either in server or client mode */
+void control_init(const char *control_host,
+ const char *control_port,
+ bool server)
+{
+   struct addrinfo hints = {
+   .ai_socktype = SOCK_STREAM,
+   };
+   struct addrinfo *result = NULL;
+   struct addrinfo *ai;
+   int ret;
+
+   ret = getaddrinfo(control_host, control_port, , );
+   if (ret != 0) {
+   fprintf(stderr, "%s\n", gai_strerror(ret));
+   e

[PATCH v3 3/5] VSOCK: use TCP state constants for sk_state

2017-10-05 Thread Stefan Hajnoczi
There are two state fields: socket->state and sock->sk_state.  The
socket->state field uses SS_UNCONNECTED, SS_CONNECTED, etc while the
sock->sk_state typically uses values that match TCP state constants
(TCP_CLOSE, TCP_ESTABLISHED).  AF_VSOCK does not follow this convention
and instead uses SS_* constants for both fields.

The sk_state field will be exposed to userspace through the vsock_diag
interface for ss(8), netstat(8), and other programs.

This patch switches sk_state to TCP state constants so that the meaning
of this field is consistent with other address families.  Not just
AF_INET and AF_INET6 use the TCP constants, AF_UNIX and others do too.

The following mapping was used to convert the code:

  SS_FREE -> TCP_CLOSE
  SS_UNCONNECTED -> TCP_CLOSE
  SS_CONNECTING -> TCP_SYN_SENT
  SS_CONNECTED -> TCP_ESTABLISHED
  SS_DISCONNECTING -> TCP_CLOSING
  VSOCK_SS_LISTEN -> TCP_LISTEN

In __vsock_create() the sk_state initialization was dropped because
sock_init_data() already initializes sk_state to TCP_CLOSE.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 include/net/af_vsock.h   |  3 --
 net/vmw_vsock/af_vsock.c | 46 
 net/vmw_vsock/hyperv_transport.c | 12 
 net/vmw_vsock/virtio_transport.c |  2 +-
 net/vmw_vsock/virtio_transport_common.c  | 22 ++---
 net/vmw_vsock/vmci_transport.c   | 34 ++--
 net/vmw_vsock/vmci_transport_notify.c|  2 +-
 net/vmw_vsock/vmci_transport_notify_qstate.c |  2 +-
 8 files changed, 64 insertions(+), 59 deletions(-)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 3dd217718a2f..9324ac2d9ff2 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -22,9 +22,6 @@
 
 #include "vsock_addr.h"
 
-/* vsock-specific sock->sk_state constants */
-#define VSOCK_SS_LISTEN 255
-
 #define LAST_RESERVED_PORT 1023
 
 #define VSOCK_HASH_SIZE 251
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 9b179a0081b3..98359c19522f 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -36,7 +36,7 @@
  * not support simultaneous connects (two "client" sockets connecting).
  *
  * - "Server" sockets are referred to as listener sockets throughout this
- * implementation because they are in the VSOCK_SS_LISTEN state.  When a
+ * implementation because they are in the TCP_LISTEN state.  When a
  * connection request is received (the second kind of socket mentioned above),
  * we create a new socket and refer to it as a pending socket.  These pending
  * sockets are placed on the pending connection list of the listener socket.
@@ -82,6 +82,15 @@
  * argument, we must ensure the reference count is increased to ensure the
  * socket isn't freed before the function is run; the deferred function will
  * then drop the reference.
+ *
+ * - sk->sk_state uses the TCP state constants because they are widely used by
+ * other address families and exposed to userspace tools like ss(8):
+ *
+ *   TCP_CLOSE - unconnected
+ *   TCP_SYN_SENT - connecting
+ *   TCP_ESTABLISHED - connected
+ *   TCP_CLOSING - disconnecting
+ *   TCP_LISTEN - listening
  */
 
 #include 
@@ -477,7 +486,7 @@ void vsock_pending_work(struct work_struct *work)
if (vsock_in_connected_table(vsk))
vsock_remove_connected(vsk);
 
-   sk->sk_state = SS_FREE;
+   sk->sk_state = TCP_CLOSE;
 
 out:
release_sock(sk);
@@ -617,7 +626,6 @@ struct sock *__vsock_create(struct net *net,
 
sk->sk_destruct = vsock_sk_destruct;
sk->sk_backlog_rcv = vsock_queue_rcv_skb;
-   sk->sk_state = 0;
sock_reset_flag(sk, SOCK_DONE);
 
INIT_LIST_HEAD(>bound_table);
@@ -891,7 +899,7 @@ static unsigned int vsock_poll(struct file *file, struct 
socket *sock,
/* Listening sockets that have connections in their accept
 * queue can be read.
 */
-   if (sk->sk_state == VSOCK_SS_LISTEN
+   if (sk->sk_state == TCP_LISTEN
&& !vsock_is_accept_queue_empty(sk))
mask |= POLLIN | POLLRDNORM;
 
@@ -920,7 +928,7 @@ static unsigned int vsock_poll(struct file *file, struct 
socket *sock,
}
 
/* Connected sockets that can produce data can be written. */
-   if (sk->sk_state == SS_CONNECTED) {
+   if (sk->sk_state == TCP_ESTABLISHED) {
if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
bool space_avail_now = false;
int ret = transport->notify_poll_out(
@@ -942,7 +950,7 @@ static unsigned int vsock_poll(struct file *file, struct 
socket *sock,
 * POLLOUT|POLLWRNORM when peer is closed and nothing to read,
 

[PATCH v3 4/5] VSOCK: add sock_diag interface

2017-10-05 Thread Stefan Hajnoczi
This patch adds the sock_diag interface for querying sockets from
userspace.  Tools like ss(8) and netstat(8) can use this interface to
list open sockets.

The userspace ABI is defined in  and includes
netlink request and response structs.  The request can query sockets
based on their sk_state (e.g. listening sockets only) and the response
contains socket information fields including the local/remote addresses,
inode number, etc.

This patch does not dump VMCI pending sockets because I have only tested
the virtio transport, which does not use pending sockets.  Support can
be added later by extending vsock_diag_dump() if needed by VMCI users.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 MAINTAINERS  |   2 +
 net/vmw_vsock/Makefile   |   3 +
 include/uapi/linux/vm_sockets_diag.h |  33 +++
 net/vmw_vsock/diag.c | 186 +++
 net/vmw_vsock/Kconfig|  10 ++
 5 files changed, 234 insertions(+)
 create mode 100644 include/uapi/linux/vm_sockets_diag.h
 create mode 100644 net/vmw_vsock/diag.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 004816a585b8..039f0ad13482 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14274,6 +14274,8 @@ S:  Maintained
 F: include/linux/virtio_vsock.h
 F: include/uapi/linux/virtio_vsock.h
 F: include/uapi/linux/vsockmon.h
+F: include/uapi/linux/vm_sockets_diag.h
+F: net/vmw_vsock/diag.c
 F: net/vmw_vsock/af_vsock_tap.c
 F: net/vmw_vsock/virtio_transport_common.c
 F: net/vmw_vsock/virtio_transport.c
diff --git a/net/vmw_vsock/Makefile b/net/vmw_vsock/Makefile
index e63d574234a9..64afc06805da 100644
--- a/net/vmw_vsock/Makefile
+++ b/net/vmw_vsock/Makefile
@@ -1,4 +1,5 @@
 obj-$(CONFIG_VSOCKETS) += vsock.o
+obj-$(CONFIG_VSOCKETS_DIAG) += vsock_diag.o
 obj-$(CONFIG_VMWARE_VMCI_VSOCKETS) += vmw_vsock_vmci_transport.o
 obj-$(CONFIG_VIRTIO_VSOCKETS) += vmw_vsock_virtio_transport.o
 obj-$(CONFIG_VIRTIO_VSOCKETS_COMMON) += vmw_vsock_virtio_transport_common.o
@@ -6,6 +7,8 @@ obj-$(CONFIG_HYPERV_VSOCKETS) += hv_sock.o
 
 vsock-y += af_vsock.o af_vsock_tap.o vsock_addr.o
 
+vsock_diag-y += diag.o
+
 vmw_vsock_vmci_transport-y += vmci_transport.o vmci_transport_notify.o \
vmci_transport_notify_qstate.o
 
diff --git a/include/uapi/linux/vm_sockets_diag.h 
b/include/uapi/linux/vm_sockets_diag.h
new file mode 100644
index ..14cd7dc5a187
--- /dev/null
+++ b/include/uapi/linux/vm_sockets_diag.h
@@ -0,0 +1,33 @@
+/* AF_VSOCK sock_diag(7) interface for querying open sockets */
+
+#ifndef _UAPI__VM_SOCKETS_DIAG_H__
+#define _UAPI__VM_SOCKETS_DIAG_H__
+
+#include 
+
+/* Request */
+struct vsock_diag_req {
+   __u8sdiag_family;   /* must be AF_VSOCK */
+   __u8sdiag_protocol; /* must be 0 */
+   __u16   pad;/* must be 0 */
+   __u32   vdiag_states;   /* query bitmap (e.g. 1 << TCP_LISTEN) */
+   __u32   vdiag_ino;  /* must be 0 (reserved) */
+   __u32   vdiag_show; /* must be 0 (reserved) */
+   __u32   vdiag_cookie[2];
+};
+
+/* Response */
+struct vsock_diag_msg {
+   __u8vdiag_family;   /* AF_VSOCK */
+   __u8vdiag_type; /* SOCK_STREAM or SOCK_DGRAM */
+   __u8vdiag_state;/* sk_state (e.g. TCP_LISTEN) */
+   __u8vdiag_shutdown; /* local RCV_SHUTDOWN | SEND_SHUTDOWN */
+   __u32   vdiag_src_cid;
+   __u32   vdiag_src_port;
+   __u32   vdiag_dst_cid;
+   __u32   vdiag_dst_port;
+   __u32   vdiag_ino;
+   __u32   vdiag_cookie[2];
+};
+
+#endif /* _UAPI__VM_SOCKETS_DIAG_H__ */
diff --git a/net/vmw_vsock/diag.c b/net/vmw_vsock/diag.c
new file mode 100644
index ..31b567652250
--- /dev/null
+++ b/net/vmw_vsock/diag.c
@@ -0,0 +1,186 @@
+/*
+ * vsock sock_diag(7) module
+ *
+ * Copyright (C) 2017 Red Hat, Inc.
+ * Author: Stefan Hajnoczi <stefa...@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation version 2 and no later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
+   u32 portid, u32 seq, u32 flags)
+{
+   struct vsock_sock *vsk = vsock_sk(sk);
+   struct vsock_diag_msg *rep;
+   struct nlmsghdr *nlh;
+
+   nlh = nlmsg_put(skb, portid, seq, SOCK_DIAG_BY_FAMILY, sizeof(*rep),
+   flags);
+   if (!nlh)
+   return -EMSGSIZE;
+
+   rep = nlmsg_data(nlh);
+   rep->vdiag_family = AF_VSOCK;
+
+   /* Lock order dictates that sk_lo

[PATCH v3 0/5] VSOCK: add sock_diag interface

2017-10-05 Thread Stefan Hajnoczi
v3:
 * Rebased onto net-next/master and resolved Hyper-V transport conflict

v2:
 * Moved tests to tools/testing/vsock/.  I was unable to put them in selftests/
   because they require manual setup of a VMware/KVM guest.
 * Moved to __vsock_in_bound/connected_table() to af_vsock.h
 * Fixed local variable ordering in Patch 4

There is currently no way for userspace to query open AF_VSOCK sockets.  This
means ss(8), netstat(8), and other utilities cannot display AF_VSOCK sockets.

This patch series adds the netlink sock_diag interface for AF_VSOCK.  Userspace
programs sent a DUMP request including an sk_state bitmap to filter sockets
based on their state (connected, listening, etc).  The vsock_diag.ko module
replies with information about matching sockets.  This userspace ABI is defined
in .

The final patch adds a test suite that exercises the basic cases.

Jorgen and Dexuan: I have only tested the virtio transport but this should also
work for VMCI and Hyper-V.  Please give it a shot if you have time.

Stefan Hajnoczi (5):
  VSOCK: export socket tables for sock_diag interface
  VSOCK: move __vsock_in_bound/connected_table() to af_vsock.h
  VSOCK: use TCP state constants for sk_state
  VSOCK: add sock_diag interface
  VSOCK: add tools/testing/vsock/vsock_diag_test

 MAINTAINERS  |   3 +
 net/vmw_vsock/Makefile   |   3 +
 tools/testing/vsock/Makefile |   9 +
 include/net/af_vsock.h   |  20 +-
 include/uapi/linux/vm_sockets_diag.h |  33 ++
 tools/testing/vsock/control.h|  13 +
 tools/testing/vsock/timeout.h|  14 +
 net/vmw_vsock/af_vsock.c |  66 +--
 net/vmw_vsock/diag.c | 186 
 net/vmw_vsock/hyperv_transport.c |  12 +-
 net/vmw_vsock/virtio_transport.c |   2 +-
 net/vmw_vsock/virtio_transport_common.c  |  22 +-
 net/vmw_vsock/vmci_transport.c   |  34 +-
 net/vmw_vsock/vmci_transport_notify.c|   2 +-
 net/vmw_vsock/vmci_transport_notify_qstate.c |   2 +-
 tools/testing/vsock/control.c| 219 +
 tools/testing/vsock/timeout.c|  64 +++
 tools/testing/vsock/vsock_diag_test.c| 681 +++
 net/vmw_vsock/Kconfig|  10 +
 tools/testing/vsock/.gitignore   |   2 +
 tools/testing/vsock/README   |  36 ++
 21 files changed, 1360 insertions(+), 73 deletions(-)
 create mode 100644 tools/testing/vsock/Makefile
 create mode 100644 include/uapi/linux/vm_sockets_diag.h
 create mode 100644 tools/testing/vsock/control.h
 create mode 100644 tools/testing/vsock/timeout.h
 create mode 100644 net/vmw_vsock/diag.c
 create mode 100644 tools/testing/vsock/control.c
 create mode 100644 tools/testing/vsock/timeout.c
 create mode 100644 tools/testing/vsock/vsock_diag_test.c
 create mode 100644 tools/testing/vsock/.gitignore
 create mode 100644 tools/testing/vsock/README

-- 
2.13.6



[PATCH v3 1/5] VSOCK: export socket tables for sock_diag interface

2017-10-05 Thread Stefan Hajnoczi
The socket table symbols need to be exported from vsock.ko so that the
vsock_diag.ko module will be able to traverse sockets.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 include/net/af_vsock.h   |  5 +
 net/vmw_vsock/af_vsock.c | 10 ++
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index f9fb566e75cf..30cba806e344 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -27,6 +27,11 @@
 
 #define LAST_RESERVED_PORT 1023
 
+#define VSOCK_HASH_SIZE 251
+extern struct list_head vsock_bind_table[VSOCK_HASH_SIZE + 1];
+extern struct list_head vsock_connected_table[VSOCK_HASH_SIZE];
+extern spinlock_t vsock_table_lock;
+
 #define vsock_sk(__sk)((struct vsock_sock *)__sk)
 #define sk_vsock(__vsk)   (&(__vsk)->sk)
 
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index dfc8c51e4d74..9afe4da8c67d 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -153,7 +153,6 @@ EXPORT_SYMBOL_GPL(vm_sockets_get_local_cid);
  * vsock_bind_table[VSOCK_HASH_SIZE] is for unbound sockets.  The hash function
  * mods with VSOCK_HASH_SIZE to ensure this.
  */
-#define VSOCK_HASH_SIZE 251
 #define MAX_PORT_RETRIES24
 
 #define VSOCK_HASH(addr)((addr)->svm_port % VSOCK_HASH_SIZE)
@@ -168,9 +167,12 @@ EXPORT_SYMBOL_GPL(vm_sockets_get_local_cid);
 #define vsock_connected_sockets_vsk(vsk)   \
vsock_connected_sockets(&(vsk)->remote_addr, &(vsk)->local_addr)
 
-static struct list_head vsock_bind_table[VSOCK_HASH_SIZE + 1];
-static struct list_head vsock_connected_table[VSOCK_HASH_SIZE];
-static DEFINE_SPINLOCK(vsock_table_lock);
+struct list_head vsock_bind_table[VSOCK_HASH_SIZE + 1];
+EXPORT_SYMBOL_GPL(vsock_bind_table);
+struct list_head vsock_connected_table[VSOCK_HASH_SIZE];
+EXPORT_SYMBOL_GPL(vsock_connected_table);
+DEFINE_SPINLOCK(vsock_table_lock);
+EXPORT_SYMBOL_GPL(vsock_table_lock);
 
 /* Autobind this socket to the local address if necessary. */
 static int vsock_auto_bind(struct vsock_sock *vsk)
-- 
2.13.6



[PATCH v3 2/5] VSOCK: move __vsock_in_bound/connected_table() to af_vsock.h

2017-10-05 Thread Stefan Hajnoczi
The vsock_diag.ko module will need to check socket table membership.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 include/net/af_vsock.h   | 12 
 net/vmw_vsock/af_vsock.c | 10 --
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 30cba806e344..3dd217718a2f 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -180,6 +180,18 @@ const struct vsock_transport 
*vsock_core_get_transport(void);
 
 / UTILS /
 
+/* vsock_table_lock must be held */
+static inline bool __vsock_in_bound_table(struct vsock_sock *vsk)
+{
+   return !list_empty(>bound_table);
+}
+
+/* vsock_table_lock must be held */
+static inline bool __vsock_in_connected_table(struct vsock_sock *vsk)
+{
+   return !list_empty(>connected_table);
+}
+
 void vsock_release_pending(struct sock *pending);
 void vsock_add_pending(struct sock *listener, struct sock *pending);
 void vsock_remove_pending(struct sock *listener, struct sock *pending);
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 9afe4da8c67d..9b179a0081b3 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -250,16 +250,6 @@ static struct sock *__vsock_find_connected_socket(struct 
sockaddr_vm *src,
return NULL;
 }
 
-static bool __vsock_in_bound_table(struct vsock_sock *vsk)
-{
-   return !list_empty(>bound_table);
-}
-
-static bool __vsock_in_connected_table(struct vsock_sock *vsk)
-{
-   return !list_empty(>connected_table);
-}
-
 static void vsock_insert_unbound(struct vsock_sock *vsk)
 {
spin_lock_bh(_table_lock);
-- 
2.13.6



[PATCH v2 5/5] VSOCK: add tools/testing/vsock/vsock_diag_test

2017-10-04 Thread Stefan Hajnoczi
This patch adds tests for the vsock_diag.ko module.

These tests are not self-tests because they require manual set up of a
KVM or VMware guest.  Please see tools/testing/vsock/README for
instructions.

The control.h and timeout.h infrastructure can be used for additional
AF_VSOCK tests in the future.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 MAINTAINERS   |   1 +
 tools/testing/vsock/Makefile  |   9 +
 tools/testing/vsock/control.h |  13 +
 tools/testing/vsock/timeout.h |  14 +
 tools/testing/vsock/control.c | 219 +++
 tools/testing/vsock/timeout.c |  64 
 tools/testing/vsock/vsock_diag_test.c | 681 ++
 tools/testing/vsock/.gitignore|   2 +
 tools/testing/vsock/README|  36 ++
 9 files changed, 1039 insertions(+)
 create mode 100644 tools/testing/vsock/Makefile
 create mode 100644 tools/testing/vsock/control.h
 create mode 100644 tools/testing/vsock/timeout.h
 create mode 100644 tools/testing/vsock/control.c
 create mode 100644 tools/testing/vsock/timeout.c
 create mode 100644 tools/testing/vsock/vsock_diag_test.c
 create mode 100644 tools/testing/vsock/.gitignore
 create mode 100644 tools/testing/vsock/README

diff --git a/MAINTAINERS b/MAINTAINERS
index 200dac93f34b..76b9fa587bfa 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13983,6 +13983,7 @@ F:  net/vmw_vsock/virtio_transport.c
 F: drivers/net/vsockmon.c
 F: drivers/vhost/vsock.c
 F: drivers/vhost/vsock.h
+F: tools/testing/vsock/
 
 VIRTIO CONSOLE DRIVER
 M: Amit Shah <a...@kernel.org>
diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
new file mode 100644
index ..66ba0924194d
--- /dev/null
+++ b/tools/testing/vsock/Makefile
@@ -0,0 +1,9 @@
+all: test
+test: vsock_diag_test
+vsock_diag_test: vsock_diag_test.o timeout.o control.o
+
+CFLAGS += -g -O2 -Werror -Wall -I. -I../../include/uapi -I../../include 
-Wno-pointer-sign -fno-strict-overflow -fno-strict-aliasing -fno-common -MMD 
-U_FORTIFY_SOURCE -D_GNU_SOURCE
+.PHONY: all test clean
+clean:
+   ${RM} *.o *.d vsock_diag_test
+-include *.d
diff --git a/tools/testing/vsock/control.h b/tools/testing/vsock/control.h
new file mode 100644
index ..54a07efd267c
--- /dev/null
+++ b/tools/testing/vsock/control.h
@@ -0,0 +1,13 @@
+#ifndef CONTROL_H
+#define CONTROL_H
+
+#include 
+
+void control_init(const char *control_host, const char *control_port,
+ bool server);
+void control_cleanup(void);
+void control_writeln(const char *str);
+char *control_readln(void);
+void control_expectln(const char *str);
+
+#endif /* CONTROL_H */
diff --git a/tools/testing/vsock/timeout.h b/tools/testing/vsock/timeout.h
new file mode 100644
index ..77db9ce9860a
--- /dev/null
+++ b/tools/testing/vsock/timeout.h
@@ -0,0 +1,14 @@
+#ifndef TIMEOUT_H
+#define TIMEOUT_H
+
+enum {
+   /* Default timeout */
+   TIMEOUT = 10 /* seconds */
+};
+
+void sigalrm(int signo);
+void timeout_begin(unsigned int seconds);
+void timeout_check(const char *operation);
+void timeout_end(void);
+
+#endif /* TIMEOUT_H */
diff --git a/tools/testing/vsock/control.c b/tools/testing/vsock/control.c
new file mode 100644
index ..90fd47f0e422
--- /dev/null
+++ b/tools/testing/vsock/control.c
@@ -0,0 +1,219 @@
+/* Control socket for client/server test execution
+ *
+ * Copyright (C) 2017 Red Hat, Inc.
+ *
+ * Author: Stefan Hajnoczi <stefa...@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+/* The client and server may need to coordinate to avoid race conditions like
+ * the client attempting to connect to a socket that the server is not
+ * listening on yet.  The control socket offers a communications channel for
+ * such coordination tasks.
+ *
+ * If the client calls control_expectln("LISTENING"), then it will block until
+ * the server calls control_writeln("LISTENING").  This provides a simple
+ * mechanism for coordinating between the client and the server.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "timeout.h"
+#include "control.h"
+
+static int control_fd = -1;
+
+/* Open the control socket, either in server or client mode */
+void control_init(const char *control_host,
+ const char *control_port,
+ bool server)
+{
+   struct addrinfo hints = {
+   .ai_socktype = SOCK_STREAM,
+   };
+   struct addrinfo *result = NULL;
+   struct addrinfo *ai;
+   int ret;
+
+   ret = getaddrinfo(control_host, control_port, , );
+   if (ret != 0) {
+   fprintf(stderr, "%s\n", gai_strerror(ret));
+   e

[PATCH v2 3/5] VSOCK: use TCP state constants for sk_state

2017-10-04 Thread Stefan Hajnoczi
There are two state fields: socket->state and sock->sk_state.  The
socket->state field uses SS_UNCONNECTED, SS_CONNECTED, etc while the
sock->sk_state typically uses values that match TCP state constants
(TCP_CLOSE, TCP_ESTABLISHED).  AF_VSOCK does not follow this convention
and instead uses SS_* constants for both fields.

The sk_state field will be exposed to userspace through the vsock_diag
interface for ss(8), netstat(8), and other programs.

This patch switches sk_state to TCP state constants so that the meaning
of this field is consistent with other address families.  Not just
AF_INET and AF_INET6 use the TCP constants, AF_UNIX and others do too.

The following mapping was used to convert the code:

  SS_FREE -> TCP_CLOSE
  SS_UNCONNECTED -> TCP_CLOSE
  SS_CONNECTING -> TCP_SYN_SENT
  SS_CONNECTED -> TCP_ESTABLISHED
  SS_DISCONNECTING -> TCP_CLOSING
  VSOCK_SS_LISTEN -> TCP_LISTEN

In __vsock_create() the sk_state initialization was dropped because
sock_init_data() already initializes sk_state to TCP_CLOSE.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 include/net/af_vsock.h   |  3 --
 net/vmw_vsock/af_vsock.c | 46 
 net/vmw_vsock/virtio_transport.c |  2 +-
 net/vmw_vsock/virtio_transport_common.c  | 22 ++---
 net/vmw_vsock/vmci_transport.c   | 34 ++--
 net/vmw_vsock/vmci_transport_notify.c|  2 +-
 net/vmw_vsock/vmci_transport_notify_qstate.c |  2 +-
 7 files changed, 58 insertions(+), 53 deletions(-)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 3dd217718a2f..9324ac2d9ff2 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -22,9 +22,6 @@
 
 #include "vsock_addr.h"
 
-/* vsock-specific sock->sk_state constants */
-#define VSOCK_SS_LISTEN 255
-
 #define LAST_RESERVED_PORT 1023
 
 #define VSOCK_HASH_SIZE 251
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 9b179a0081b3..98359c19522f 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -36,7 +36,7 @@
  * not support simultaneous connects (two "client" sockets connecting).
  *
  * - "Server" sockets are referred to as listener sockets throughout this
- * implementation because they are in the VSOCK_SS_LISTEN state.  When a
+ * implementation because they are in the TCP_LISTEN state.  When a
  * connection request is received (the second kind of socket mentioned above),
  * we create a new socket and refer to it as a pending socket.  These pending
  * sockets are placed on the pending connection list of the listener socket.
@@ -82,6 +82,15 @@
  * argument, we must ensure the reference count is increased to ensure the
  * socket isn't freed before the function is run; the deferred function will
  * then drop the reference.
+ *
+ * - sk->sk_state uses the TCP state constants because they are widely used by
+ * other address families and exposed to userspace tools like ss(8):
+ *
+ *   TCP_CLOSE - unconnected
+ *   TCP_SYN_SENT - connecting
+ *   TCP_ESTABLISHED - connected
+ *   TCP_CLOSING - disconnecting
+ *   TCP_LISTEN - listening
  */
 
 #include 
@@ -477,7 +486,7 @@ void vsock_pending_work(struct work_struct *work)
if (vsock_in_connected_table(vsk))
vsock_remove_connected(vsk);
 
-   sk->sk_state = SS_FREE;
+   sk->sk_state = TCP_CLOSE;
 
 out:
release_sock(sk);
@@ -617,7 +626,6 @@ struct sock *__vsock_create(struct net *net,
 
sk->sk_destruct = vsock_sk_destruct;
sk->sk_backlog_rcv = vsock_queue_rcv_skb;
-   sk->sk_state = 0;
sock_reset_flag(sk, SOCK_DONE);
 
INIT_LIST_HEAD(>bound_table);
@@ -891,7 +899,7 @@ static unsigned int vsock_poll(struct file *file, struct 
socket *sock,
/* Listening sockets that have connections in their accept
 * queue can be read.
 */
-   if (sk->sk_state == VSOCK_SS_LISTEN
+   if (sk->sk_state == TCP_LISTEN
&& !vsock_is_accept_queue_empty(sk))
mask |= POLLIN | POLLRDNORM;
 
@@ -920,7 +928,7 @@ static unsigned int vsock_poll(struct file *file, struct 
socket *sock,
}
 
/* Connected sockets that can produce data can be written. */
-   if (sk->sk_state == SS_CONNECTED) {
+   if (sk->sk_state == TCP_ESTABLISHED) {
if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
bool space_avail_now = false;
int ret = transport->notify_poll_out(
@@ -942,7 +950,7 @@ static unsigned int vsock_poll(struct file *file, struct 
socket *sock,
 * POLLOUT|POLLWRNORM when peer is closed and nothing to read,
 * but local send is not shutdown.

[PATCH v2 2/5] VSOCK: move __vsock_in_bound/connected_table() to af_vsock.h

2017-10-04 Thread Stefan Hajnoczi
The vsock_diag.ko module will need to check socket table membership.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 include/net/af_vsock.h   | 12 
 net/vmw_vsock/af_vsock.c | 10 --
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 30cba806e344..3dd217718a2f 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -180,6 +180,18 @@ const struct vsock_transport 
*vsock_core_get_transport(void);
 
 / UTILS /
 
+/* vsock_table_lock must be held */
+static inline bool __vsock_in_bound_table(struct vsock_sock *vsk)
+{
+   return !list_empty(>bound_table);
+}
+
+/* vsock_table_lock must be held */
+static inline bool __vsock_in_connected_table(struct vsock_sock *vsk)
+{
+   return !list_empty(>connected_table);
+}
+
 void vsock_release_pending(struct sock *pending);
 void vsock_add_pending(struct sock *listener, struct sock *pending);
 void vsock_remove_pending(struct sock *listener, struct sock *pending);
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 9afe4da8c67d..9b179a0081b3 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -250,16 +250,6 @@ static struct sock *__vsock_find_connected_socket(struct 
sockaddr_vm *src,
return NULL;
 }
 
-static bool __vsock_in_bound_table(struct vsock_sock *vsk)
-{
-   return !list_empty(>bound_table);
-}
-
-static bool __vsock_in_connected_table(struct vsock_sock *vsk)
-{
-   return !list_empty(>connected_table);
-}
-
 static void vsock_insert_unbound(struct vsock_sock *vsk)
 {
spin_lock_bh(_table_lock);
-- 
2.13.6



[PATCH v2 4/5] VSOCK: add sock_diag interface

2017-10-04 Thread Stefan Hajnoczi
This patch adds the sock_diag interface for querying sockets from
userspace.  Tools like ss(8) and netstat(8) can use this interface to
list open sockets.

The userspace ABI is defined in  and includes
netlink request and response structs.  The request can query sockets
based on their sk_state (e.g. listening sockets only) and the response
contains socket information fields including the local/remote addresses,
inode number, etc.

This patch does not dump VMCI pending sockets because I have only tested
the virtio transport, which does not use pending sockets.  Support can
be added later by extending vsock_diag_dump() if needed by VMCI users.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 MAINTAINERS  |   2 +
 net/vmw_vsock/Makefile   |   3 +
 include/uapi/linux/vm_sockets_diag.h |  33 +++
 net/vmw_vsock/diag.c | 186 +++
 net/vmw_vsock/Kconfig|  10 ++
 5 files changed, 234 insertions(+)
 create mode 100644 include/uapi/linux/vm_sockets_diag.h
 create mode 100644 net/vmw_vsock/diag.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 1c3feffb1c1c..200dac93f34b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13975,6 +13975,8 @@ S:  Maintained
 F: include/linux/virtio_vsock.h
 F: include/uapi/linux/virtio_vsock.h
 F: include/uapi/linux/vsockmon.h
+F: include/uapi/linux/vm_sockets_diag.h
+F: net/vmw_vsock/diag.c
 F: net/vmw_vsock/af_vsock_tap.c
 F: net/vmw_vsock/virtio_transport_common.c
 F: net/vmw_vsock/virtio_transport.c
diff --git a/net/vmw_vsock/Makefile b/net/vmw_vsock/Makefile
index 09fc2eb29dc8..e5dbf153aff0 100644
--- a/net/vmw_vsock/Makefile
+++ b/net/vmw_vsock/Makefile
@@ -1,10 +1,13 @@
 obj-$(CONFIG_VSOCKETS) += vsock.o
+obj-$(CONFIG_VSOCKETS_DIAG) += vsock_diag.o
 obj-$(CONFIG_VMWARE_VMCI_VSOCKETS) += vmw_vsock_vmci_transport.o
 obj-$(CONFIG_VIRTIO_VSOCKETS) += vmw_vsock_virtio_transport.o
 obj-$(CONFIG_VIRTIO_VSOCKETS_COMMON) += vmw_vsock_virtio_transport_common.o
 
 vsock-y += af_vsock.o af_vsock_tap.o vsock_addr.o
 
+vsock_diag-y += diag.o
+
 vmw_vsock_vmci_transport-y += vmci_transport.o vmci_transport_notify.o \
vmci_transport_notify_qstate.o
 
diff --git a/include/uapi/linux/vm_sockets_diag.h 
b/include/uapi/linux/vm_sockets_diag.h
new file mode 100644
index ..14cd7dc5a187
--- /dev/null
+++ b/include/uapi/linux/vm_sockets_diag.h
@@ -0,0 +1,33 @@
+/* AF_VSOCK sock_diag(7) interface for querying open sockets */
+
+#ifndef _UAPI__VM_SOCKETS_DIAG_H__
+#define _UAPI__VM_SOCKETS_DIAG_H__
+
+#include 
+
+/* Request */
+struct vsock_diag_req {
+   __u8sdiag_family;   /* must be AF_VSOCK */
+   __u8sdiag_protocol; /* must be 0 */
+   __u16   pad;/* must be 0 */
+   __u32   vdiag_states;   /* query bitmap (e.g. 1 << TCP_LISTEN) */
+   __u32   vdiag_ino;  /* must be 0 (reserved) */
+   __u32   vdiag_show; /* must be 0 (reserved) */
+   __u32   vdiag_cookie[2];
+};
+
+/* Response */
+struct vsock_diag_msg {
+   __u8vdiag_family;   /* AF_VSOCK */
+   __u8vdiag_type; /* SOCK_STREAM or SOCK_DGRAM */
+   __u8vdiag_state;/* sk_state (e.g. TCP_LISTEN) */
+   __u8vdiag_shutdown; /* local RCV_SHUTDOWN | SEND_SHUTDOWN */
+   __u32   vdiag_src_cid;
+   __u32   vdiag_src_port;
+   __u32   vdiag_dst_cid;
+   __u32   vdiag_dst_port;
+   __u32   vdiag_ino;
+   __u32   vdiag_cookie[2];
+};
+
+#endif /* _UAPI__VM_SOCKETS_DIAG_H__ */
diff --git a/net/vmw_vsock/diag.c b/net/vmw_vsock/diag.c
new file mode 100644
index ..31b567652250
--- /dev/null
+++ b/net/vmw_vsock/diag.c
@@ -0,0 +1,186 @@
+/*
+ * vsock sock_diag(7) module
+ *
+ * Copyright (C) 2017 Red Hat, Inc.
+ * Author: Stefan Hajnoczi <stefa...@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation version 2 and no later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
+   u32 portid, u32 seq, u32 flags)
+{
+   struct vsock_sock *vsk = vsock_sk(sk);
+   struct vsock_diag_msg *rep;
+   struct nlmsghdr *nlh;
+
+   nlh = nlmsg_put(skb, portid, seq, SOCK_DIAG_BY_FAMILY, sizeof(*rep),
+   flags);
+   if (!nlh)
+   return -EMSGSIZE;
+
+   rep = nlmsg_data(nlh);
+   rep->vdiag_family = AF_VSOCK;
+
+   /* Lock order dictates that sk_lock is acquired before
+* vsock_table_lock, so we cannot lock her

[PATCH v2 1/5] VSOCK: export socket tables for sock_diag interface

2017-10-04 Thread Stefan Hajnoczi
The socket table symbols need to be exported from vsock.ko so that the
vsock_diag.ko module will be able to traverse sockets.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 include/net/af_vsock.h   |  5 +
 net/vmw_vsock/af_vsock.c | 10 ++
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index f9fb566e75cf..30cba806e344 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -27,6 +27,11 @@
 
 #define LAST_RESERVED_PORT 1023
 
+#define VSOCK_HASH_SIZE 251
+extern struct list_head vsock_bind_table[VSOCK_HASH_SIZE + 1];
+extern struct list_head vsock_connected_table[VSOCK_HASH_SIZE];
+extern spinlock_t vsock_table_lock;
+
 #define vsock_sk(__sk)((struct vsock_sock *)__sk)
 #define sk_vsock(__vsk)   (&(__vsk)->sk)
 
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index dfc8c51e4d74..9afe4da8c67d 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -153,7 +153,6 @@ EXPORT_SYMBOL_GPL(vm_sockets_get_local_cid);
  * vsock_bind_table[VSOCK_HASH_SIZE] is for unbound sockets.  The hash function
  * mods with VSOCK_HASH_SIZE to ensure this.
  */
-#define VSOCK_HASH_SIZE 251
 #define MAX_PORT_RETRIES24
 
 #define VSOCK_HASH(addr)((addr)->svm_port % VSOCK_HASH_SIZE)
@@ -168,9 +167,12 @@ EXPORT_SYMBOL_GPL(vm_sockets_get_local_cid);
 #define vsock_connected_sockets_vsk(vsk)   \
vsock_connected_sockets(&(vsk)->remote_addr, &(vsk)->local_addr)
 
-static struct list_head vsock_bind_table[VSOCK_HASH_SIZE + 1];
-static struct list_head vsock_connected_table[VSOCK_HASH_SIZE];
-static DEFINE_SPINLOCK(vsock_table_lock);
+struct list_head vsock_bind_table[VSOCK_HASH_SIZE + 1];
+EXPORT_SYMBOL_GPL(vsock_bind_table);
+struct list_head vsock_connected_table[VSOCK_HASH_SIZE];
+EXPORT_SYMBOL_GPL(vsock_connected_table);
+DEFINE_SPINLOCK(vsock_table_lock);
+EXPORT_SYMBOL_GPL(vsock_table_lock);
 
 /* Autobind this socket to the local address if necessary. */
 static int vsock_auto_bind(struct vsock_sock *vsk)
-- 
2.13.6



[PATCH v2 0/5] VSOCK: add sock_diag interface

2017-10-04 Thread Stefan Hajnoczi
v2:
 * Moved tests to tools/testing/vsock/.  I was unable to put them in selftests/
   because they require manual setup of a VMware/KVM guest.
 * Moved to __vsock_in_bound/connected_table() to af_vsock.h
 * Fixed local variable ordering in Patch 4

There is currently no way for userspace to query open AF_VSOCK sockets.  This
means ss(8), netstat(8), and other utilities cannot display AF_VSOCK sockets.

This patch series adds the netlink sock_diag interface for AF_VSOCK.  Userspace
programs sent a DUMP request including an sk_state bitmap to filter sockets
based on their state (connected, listening, etc).  The vsock_diag.ko module
replies with information about matching sockets.  This userspace ABI is defined
in .

The final patch adds a test suite that exercises the basic cases.

Jorgen and Dexuan: I have only tested the virtio transport but this should also
work for VMCI and Hyper-V.  Please give it a shot if you have time.

Stefan Hajnoczi (5):
  VSOCK: export socket tables for sock_diag interface
  VSOCK: move __vsock_in_bound/connected_table() to af_vsock.h
  VSOCK: use TCP state constants for sk_state
  VSOCK: add sock_diag interface
  VSOCK: add tools/testing/vsock/vsock_diag_test

 MAINTAINERS  |   3 +
 net/vmw_vsock/Makefile   |   3 +
 tools/testing/vsock/Makefile |   9 +
 include/net/af_vsock.h   |  20 +-
 include/uapi/linux/vm_sockets_diag.h |  33 ++
 tools/testing/vsock/control.h|  13 +
 tools/testing/vsock/timeout.h|  14 +
 net/vmw_vsock/af_vsock.c |  66 +--
 net/vmw_vsock/diag.c | 186 
 net/vmw_vsock/virtio_transport.c |   2 +-
 net/vmw_vsock/virtio_transport_common.c  |  22 +-
 net/vmw_vsock/vmci_transport.c   |  34 +-
 net/vmw_vsock/vmci_transport_notify.c|   2 +-
 net/vmw_vsock/vmci_transport_notify_qstate.c |   2 +-
 tools/testing/vsock/control.c| 219 +
 tools/testing/vsock/timeout.c|  64 +++
 tools/testing/vsock/vsock_diag_test.c| 681 +++
 net/vmw_vsock/Kconfig|  10 +
 tools/testing/vsock/.gitignore   |   2 +
 tools/testing/vsock/README   |  36 ++
 20 files changed, 1354 insertions(+), 67 deletions(-)
 create mode 100644 tools/testing/vsock/Makefile
 create mode 100644 include/uapi/linux/vm_sockets_diag.h
 create mode 100644 tools/testing/vsock/control.h
 create mode 100644 tools/testing/vsock/timeout.h
 create mode 100644 net/vmw_vsock/diag.c
 create mode 100644 tools/testing/vsock/control.c
 create mode 100644 tools/testing/vsock/timeout.c
 create mode 100644 tools/testing/vsock/vsock_diag_test.c
 create mode 100644 tools/testing/vsock/.gitignore
 create mode 100644 tools/testing/vsock/README

-- 
2.13.6



Re: [PATCH 2/5] VSOCK: export __vsock_in_bound/connected_table()

2017-10-04 Thread Stefan Hajnoczi
On Tue, Oct 03, 2017 at 09:46:17PM -0700, David Miller wrote:
> From: Stefan Hajnoczi <stefa...@redhat.com>
> Date: Tue,  3 Oct 2017 11:39:40 -0400
> 
> > @@ -250,15 +250,17 @@ static struct sock 
> > *__vsock_find_connected_socket(struct sockaddr_vm *src,
> > return NULL;
> >  }
> >  
> > -static bool __vsock_in_bound_table(struct vsock_sock *vsk)
> > +bool __vsock_in_bound_table(struct vsock_sock *vsk)
> >  {
> > return !list_empty(>bound_table);
> >  }
> > +EXPORT_SYMBOL_GPL(__vsock_in_bound_table);
> >  
> > -static bool __vsock_in_connected_table(struct vsock_sock *vsk)
> > +bool __vsock_in_connected_table(struct vsock_sock *vsk)
> >  {
> > return !list_empty(>connected_table);
> >  }
> > +EXPORT_SYMBOL_GPL(__vsock_in_connected_table);
> 
> Maybe you can just make these inline helpers in af_vsock.h?

Will fix in v2.


Re: [PATCH 5/5] VSOCK: add tools/vsock/vsock_diag_test

2017-10-04 Thread Stefan Hajnoczi
On Tue, Oct 03, 2017 at 09:48:05PM -0700, David Miller wrote:
> From: Stefan Hajnoczi <stefa...@redhat.com>
> Date: Tue,  3 Oct 2017 11:39:43 -0400
> 
> >  MAINTAINERS   |   1 +
> >  tools/vsock/Makefile  |   9 +
> >  tools/vsock/control.h |  13 +
> >  tools/vsock/timeout.h |  14 +
> >  tools/vsock/control.c | 219 ++
> >  tools/vsock/timeout.c |  64 
> >  tools/vsock/vsock_diag_test.c | 681 
> > ++
> >  tools/vsock/.gitignore|   2 +
> 
> Please don't create you own "special" directory for tests.
> 
> Tests belong under tools/testing/selftests/
> 
> If you put your tests in the proper place, and structure them properly 
> (especially
> the Makefile rules), they will automatically be run by various automated build
> and test frameworks.

Cool, thanks for the hint.  I wasn't aware of that.

Will fix in v2.

Stefan


Re: [PATCH 4/5] VSOCK: add sock_diag interface

2017-10-04 Thread Stefan Hajnoczi
On Tue, Oct 03, 2017 at 09:46:52PM -0700, David Miller wrote:
> From: Stefan Hajnoczi <stefa...@redhat.com>
> Date: Tue,  3 Oct 2017 11:39:42 -0400
> 
> > +static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
> > +   u32 portid, u32 seq, u32 flags)
> > +{
> > +   struct nlmsghdr *nlh;
> > +   struct vsock_diag_msg *rep;
> > +   struct vsock_sock *vsk = vsock_sk(sk);
> 
> Please order local variables from longest to shortest line.
> 
> > +static int vsock_diag_dump(struct sk_buff *skb, struct netlink_callback 
> > *cb)
> > +{
> > +   struct vsock_diag_req *req;
> > +   unsigned int table;
> > +   unsigned int bucket;
> > +   unsigned int last_i;
> > +   unsigned int i;
> > +   struct vsock_sock *vsk;
> > +   struct net *net;
> 
> Likewise.

Will fix in v2.


Re: [PATCH iproute2 2/3] include: add

2017-10-04 Thread Stefan Hajnoczi
On Tue, Oct 03, 2017 at 11:24:55AM -0700, Stephen Hemminger wrote:
> On Tue,  3 Oct 2017 13:57:43 -0400
> Stefan Hajnoczi <stefa...@redhat.com> wrote:
> 
> > This new Linux header file defines the sock_diag interface used by
> > AF_VSOCK.
> > 
> > Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
> > ---
> >  include/linux/vm_sockets_diag.h | 33 +
> >  1 file changed, 33 insertions(+)
> >  create mode 100644 include/linux/vm_sockets_diag.h
> > 
> 
> All header files in linux must be headers that come from 'make 
> install_headers'
> in kernel. I don't see vm_sockets_diag.h even in net-next.

This series depends on "[PATCH 0/5] VSOCK: add sock_diag interface".
Once that has been merged in Linux the  header
will be available.

I sent the ss(8) patch at the same time and referenced the Linux patch
series in the cover letter so they can be reviewed/tested together.

Thanks,
Stefan


Re: [PATCH iproute2 1/3] ss: allow AF_FAMILY constants >32

2017-10-04 Thread Stefan Hajnoczi
On Tue, Oct 03, 2017 at 11:26:53AM -0700, Stephen Hemminger wrote:
> On Tue,  3 Oct 2017 13:57:42 -0400
> Stefan Hajnoczi <stefa...@redhat.com> wrote:
> 
> > Linux has more than 32 address families defined in .  Use
> > a 64-bit type so all of them can be represented in the filter->families
> > bitmask.
> > 
> > It's easy to introduce bugs when using (1 << AF_FAMILY) because the
> > value is 32-bit.  This can produce incorrect results from bitmask
> > operations so introduce the FAMILY_MASK() macro to eliminate these bugs.
> > 
> > Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
> > ---
> >  misc/ss.c | 54 --
> >  1 file changed, 28 insertions(+), 26 deletions(-)
> > 
> > diff --git a/misc/ss.c b/misc/ss.c
> > index dd8dfaa4..12a31c90 100644
> > --- a/misc/ss.c
> > +++ b/misc/ss.c
> > @@ -170,55 +170,57 @@ enum {
> >  struct filter {
> > int dbs;
> > int states;
> > -   int families;
> > +   __u64 families;
> 
> Since this isn't a value that is coming from kernel. It should be uint64_t
> rather than __u64.

Okay, will fix in v2.


[PATCH iproute2 1/3] ss: allow AF_FAMILY constants >32

2017-10-03 Thread Stefan Hajnoczi
Linux has more than 32 address families defined in .  Use
a 64-bit type so all of them can be represented in the filter->families
bitmask.

It's easy to introduce bugs when using (1 << AF_FAMILY) because the
value is 32-bit.  This can produce incorrect results from bitmask
operations so introduce the FAMILY_MASK() macro to eliminate these bugs.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 misc/ss.c | 54 --
 1 file changed, 28 insertions(+), 26 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index dd8dfaa4..12a31c90 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -170,55 +170,57 @@ enum {
 struct filter {
int dbs;
int states;
-   int families;
+   __u64 families;
struct ssfilter *f;
bool kill;
 };
 
+#define FAMILY_MASK(family) ((__u64)1 << (family))
+
 static const struct filter default_dbs[MAX_DB] = {
[TCP_DB] = {
.states   = SS_CONN,
-   .families = (1 << AF_INET) | (1 << AF_INET6),
+   .families = FAMILY_MASK(AF_INET) | FAMILY_MASK(AF_INET6),
},
[DCCP_DB] = {
.states   = SS_CONN,
-   .families = (1 << AF_INET) | (1 << AF_INET6),
+   .families = FAMILY_MASK(AF_INET) | FAMILY_MASK(AF_INET6),
},
[UDP_DB] = {
.states   = (1 << SS_ESTABLISHED),
-   .families = (1 << AF_INET) | (1 << AF_INET6),
+   .families = FAMILY_MASK(AF_INET) | FAMILY_MASK(AF_INET6),
},
[RAW_DB] = {
.states   = (1 << SS_ESTABLISHED),
-   .families = (1 << AF_INET) | (1 << AF_INET6),
+   .families = FAMILY_MASK(AF_INET) | FAMILY_MASK(AF_INET6),
},
[UNIX_DG_DB] = {
.states   = (1 << SS_CLOSE),
-   .families = (1 << AF_UNIX),
+   .families = FAMILY_MASK(AF_UNIX),
},
[UNIX_ST_DB] = {
.states   = SS_CONN,
-   .families = (1 << AF_UNIX),
+   .families = FAMILY_MASK(AF_UNIX),
},
[UNIX_SQ_DB] = {
.states   = SS_CONN,
-   .families = (1 << AF_UNIX),
+   .families = FAMILY_MASK(AF_UNIX),
},
[PACKET_DG_DB] = {
.states   = (1 << SS_CLOSE),
-   .families = (1 << AF_PACKET),
+   .families = FAMILY_MASK(AF_PACKET),
},
[PACKET_R_DB] = {
.states   = (1 << SS_CLOSE),
-   .families = (1 << AF_PACKET),
+   .families = FAMILY_MASK(AF_PACKET),
},
[NETLINK_DB] = {
.states   = (1 << SS_CLOSE),
-   .families = (1 << AF_NETLINK),
+   .families = FAMILY_MASK(AF_NETLINK),
},
[SCTP_DB] = {
.states   = SS_CONN,
-   .families = (1 << AF_INET) | (1 << AF_INET6),
+   .families = FAMILY_MASK(AF_INET) | FAMILY_MASK(AF_INET6),
},
 };
 
@@ -258,14 +260,14 @@ static void filter_db_set(struct filter *f, int db)
 static void filter_af_set(struct filter *f, int af)
 {
f->states  |= default_afs[af].states;
-   f->families|= 1 << af;
+   f->families|= FAMILY_MASK(af);
do_default  = 0;
preferred_family= af;
 }
 
 static int filter_af_get(struct filter *f, int af)
 {
-   return f->families & (1 << af);
+   return !!(f->families & FAMILY_MASK(af));
 }
 
 static void filter_default_dbs(struct filter *f)
@@ -302,7 +304,7 @@ static void filter_merge_defaults(struct filter *f)
f->families |= default_dbs[db].families;
}
for (af = 0; af < AF_MAX; af++) {
-   if (!(f->families & (1 << af)))
+   if (!(f->families & FAMILY_MASK(af)))
continue;
 
if (!(default_afs[af].dbs & f->dbs))
@@ -2608,7 +2610,7 @@ static int show_one_inet_sock(const struct sockaddr_nl 
*addr,
struct inet_diag_msg *r = NLMSG_DATA(h);
struct sockstat s = {};
 
-   if (!(diag_arg->f->families & (1 << r->idiag_family)))
+   if (!(diag_arg->f->families & FAMILY_MASK(r->idiag_family)))
return 0;
 
parse_diag_msg(h, );
@@ -2802,7 +2804,7 @@ static int tcp_show(struct filter *f)
return -1;
}
 
-   if (f->families & (1<<AF_INET)) {
+   if (f->families & FAMILY_MASK(AF_INET)) {
if ((fp = net_tcp_open()) == NULL)
goto outerr;
 
@@ -2812,7 +2814,7 @@ static int tcp_show(struct filter *f)
fclose(fp);
}
 
-   if ((f->families & (1&

[PATCH iproute2 0/3] ss: add AF_VSOCK support

2017-10-03 Thread Stefan Hajnoczi
This patch series adds AF_VSOCK support to ss(8).  AF_VSOCK is a host<->guest
communications channel supported by VMware, KVM (virtio-vsock), and Hyper-V.

To dump AF_VSOCK sockets:

  $ ss --vsock

The vsock_diag.ko kernel module for Linux was posted in "[PATCH 0/5] VSOCK: add
sock_diag interface".  That patch series also adds the
 header file that this patch series relies on.  The
Linux patches are also available here:

  https://github.com/stefanha/linux/commits/vsock-diag

Stefan Hajnoczi (3):
  ss: allow AF_FAMILY constants >32
  include: add 
  ss: add AF_VSOCK support

 include/linux/vm_sockets_diag.h |  33 ++
 misc/ss.c   | 238 +++-
 man/man8/ss.8   |   8 +-
 3 files changed, 249 insertions(+), 30 deletions(-)
 create mode 100644 include/linux/vm_sockets_diag.h

-- 
2.13.6



[PATCH iproute2 3/3] ss: add AF_VSOCK support

2017-10-03 Thread Stefan Hajnoczi
The AF_VSOCK address family is a host<->guest communications channel
supported by VMware, KVM, and Hyper-V.  Initial VMware support was
released in Linux 3.9 in 2013 and transports for other hypervisors were
added later.

AF_VSOCK addresses are  tuples.  The 32-bit cid
integer is comparable to an IP address.  AF_VSOCK ports work like
TCP/UDP ports.

Both SOCK_STREAM and SOCK_DGRAM socket types are available.

This patch adds AF_VSOCK support to ss(8) so that sockets can be
observed.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 misc/ss.c | 184 +-
 man/man8/ss.8 |   8 ++-
 2 files changed, 188 insertions(+), 4 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 12a31c90..164356a0 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define MAGIC_SEQ 123456
 
@@ -126,6 +127,8 @@ enum {
PACKET_R_DB,
NETLINK_DB,
SCTP_DB,
+   VSOCK_ST_DB,
+   VSOCK_DG_DB,
MAX_DB
 };
 
@@ -134,6 +137,7 @@ enum {
 #define ALL_DB ((1<<MAX_DB)-1)
 #define INET_L4_DBM ((1<<TCP_DB)|(1<<UDP_DB)|(1<<DCCP_DB)|(1<<SCTP_DB))
 #define INET_DBM (INET_L4_DBM | (1<<RAW_DB))
+#define VSOCK_DBM ((1<<VSOCK_ST_DB)|(1<<VSOCK_DG_DB))
 
 enum {
SS_UNKNOWN,
@@ -222,6 +226,14 @@ static const struct filter default_dbs[MAX_DB] = {
.states   = SS_CONN,
.families = FAMILY_MASK(AF_INET) | FAMILY_MASK(AF_INET6),
},
+   [VSOCK_ST_DB] = {
+   .states   = SS_CONN,
+   .families = FAMILY_MASK(AF_VSOCK),
+   },
+   [VSOCK_DG_DB] = {
+   .states   = SS_CONN,
+   .families = FAMILY_MASK(AF_VSOCK),
+   },
 };
 
 static const struct filter default_afs[AF_MAX] = {
@@ -245,6 +257,10 @@ static const struct filter default_afs[AF_MAX] = {
.dbs= (1 << NETLINK_DB),
.states = (1 << SS_CLOSE),
},
+   [AF_VSOCK] = {
+   .dbs= VSOCK_DBM,
+   .states = SS_CONN,
+   },
 };
 
 static int do_default = 1;
@@ -283,6 +299,8 @@ static void filter_default_dbs(struct filter *f)
filter_db_set(f, PACKET_DG_DB);
filter_db_set(f, NETLINK_DB);
filter_db_set(f, SCTP_DB);
+   filter_db_set(f, VSOCK_ST_DB);
+   filter_db_set(f, VSOCK_DG_DB);
 }
 
 static void filter_states_set(struct filter *f, int states)
@@ -791,6 +809,18 @@ static const char *proto_name(int protocol)
return "???";
 }
 
+static const char *vsock_netid_name(int type)
+{
+   switch (type) {
+   case SOCK_STREAM:
+   return "v_str";
+   case SOCK_DGRAM:
+   return "v_dgr";
+   default:
+   return "???";
+   }
+}
+
 static void sock_state_print(struct sockstat *s)
 {
const char *sock_name;
@@ -823,6 +853,9 @@ static void sock_state_print(struct sockstat *s)
case AF_NETLINK:
sock_name = "nl";
break;
+   case AF_VSOCK:
+   sock_name = vsock_netid_name(s->type);
+   break;
default:
sock_name = "unknown";
}
@@ -1149,6 +1182,8 @@ static int run_ssfilter(struct ssfilter *f, struct 
sockstat *s)
return s->lport == 0 && s->local.data[0] == 0;
if (s->local.family == AF_NETLINK)
return s->lport < 0;
+   if (s->local.family == AF_VSOCK)
+   return s->lport > 1023;
 
return is_ephemeral(s->lport);
}
@@ -1524,6 +1559,15 @@ void *parse_devcond(char *name)
return res;
 }
 
+static void vsock_set_inet_prefix(inet_prefix *a, __u32 cid)
+{
+   *a = (inet_prefix){
+   .bytelen = sizeof(cid),
+   .family = AF_VSOCK,
+   };
+   memcpy(a->data, , sizeof(cid));
+}
+
 void *parse_hostcond(char *addr, bool is_port)
 {
char *port = NULL;
@@ -1598,6 +1642,37 @@ void *parse_hostcond(char *addr, bool is_port)
goto out;
}
 
+   if (fam == AF_VSOCK || strncmp(addr, "vsock:", 6) == 0) {
+   __u32 cid = ~(__u32)0;
+
+   a.addr.family = AF_VSOCK;
+   if (strncmp(addr, "vsock:", 6) == 0)
+   addr += 6;
+
+   if (is_port)
+   port = addr;
+   else {
+   port = strchr(addr, ':');
+   if (port) {
+   *port = '\0';
+   port++;
+   }
+   }
+
+   if (port && strcmp(port, "*") &&
+   get_u32((__u32 *), port, 0))
+  

[PATCH iproute2 2/3] include: add

2017-10-03 Thread Stefan Hajnoczi
This new Linux header file defines the sock_diag interface used by
AF_VSOCK.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 include/linux/vm_sockets_diag.h | 33 +
 1 file changed, 33 insertions(+)
 create mode 100644 include/linux/vm_sockets_diag.h

diff --git a/include/linux/vm_sockets_diag.h b/include/linux/vm_sockets_diag.h
new file mode 100644
index ..14cd7dc5
--- /dev/null
+++ b/include/linux/vm_sockets_diag.h
@@ -0,0 +1,33 @@
+/* AF_VSOCK sock_diag(7) interface for querying open sockets */
+
+#ifndef _UAPI__VM_SOCKETS_DIAG_H__
+#define _UAPI__VM_SOCKETS_DIAG_H__
+
+#include 
+
+/* Request */
+struct vsock_diag_req {
+   __u8sdiag_family;   /* must be AF_VSOCK */
+   __u8sdiag_protocol; /* must be 0 */
+   __u16   pad;/* must be 0 */
+   __u32   vdiag_states;   /* query bitmap (e.g. 1 << TCP_LISTEN) */
+   __u32   vdiag_ino;  /* must be 0 (reserved) */
+   __u32   vdiag_show; /* must be 0 (reserved) */
+   __u32   vdiag_cookie[2];
+};
+
+/* Response */
+struct vsock_diag_msg {
+   __u8vdiag_family;   /* AF_VSOCK */
+   __u8vdiag_type; /* SOCK_STREAM or SOCK_DGRAM */
+   __u8vdiag_state;/* sk_state (e.g. TCP_LISTEN) */
+   __u8vdiag_shutdown; /* local RCV_SHUTDOWN | SEND_SHUTDOWN */
+   __u32   vdiag_src_cid;
+   __u32   vdiag_src_port;
+   __u32   vdiag_dst_cid;
+   __u32   vdiag_dst_port;
+   __u32   vdiag_ino;
+   __u32   vdiag_cookie[2];
+};
+
+#endif /* _UAPI__VM_SOCKETS_DIAG_H__ */
-- 
2.13.6



[PATCH 3/5] VSOCK: use TCP state constants for sk_state

2017-10-03 Thread Stefan Hajnoczi
There are two state fields: socket->state and sock->sk_state.  The
socket->state field uses SS_UNCONNECTED, SS_CONNECTED, etc while the
sock->sk_state typically uses values that match TCP state constants
(TCP_CLOSE, TCP_ESTABLISHED).  AF_VSOCK does not follow this convention
and instead uses SS_* constants for both fields.

The sk_state field will be exposed to userspace through the vsock_diag
interface for ss(8), netstat(8), and other programs.

This patch switches sk_state to TCP state constants so that the meaning
of this field is consistent with other address families.  Not just
AF_INET and AF_INET6 use the TCP constants, AF_UNIX and others do too.

The following mapping was used to convert the code:

  SS_FREE -> TCP_CLOSE
  SS_UNCONNECTED -> TCP_CLOSE
  SS_CONNECTING -> TCP_SYN_SENT
  SS_CONNECTED -> TCP_ESTABLISHED
  SS_DISCONNECTING -> TCP_CLOSING
  VSOCK_SS_LISTEN -> TCP_LISTEN

In __vsock_create() the sk_state initialization was dropped because
sock_init_data() already initializes sk_state to TCP_CLOSE.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 include/net/af_vsock.h   |  3 --
 net/vmw_vsock/af_vsock.c | 46 
 net/vmw_vsock/virtio_transport.c |  2 +-
 net/vmw_vsock/virtio_transport_common.c  | 22 ++---
 net/vmw_vsock/vmci_transport.c   | 34 ++--
 net/vmw_vsock/vmci_transport_notify.c|  2 +-
 net/vmw_vsock/vmci_transport_notify_qstate.c |  2 +-
 7 files changed, 58 insertions(+), 53 deletions(-)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 88149d580975..eaf45df90d97 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -22,9 +22,6 @@
 
 #include "vsock_addr.h"
 
-/* vsock-specific sock->sk_state constants */
-#define VSOCK_SS_LISTEN 255
-
 #define LAST_RESERVED_PORT 1023
 
 #define VSOCK_HASH_SIZE 251
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index f2d0fb593908..9b76953aeac6 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -36,7 +36,7 @@
  * not support simultaneous connects (two "client" sockets connecting).
  *
  * - "Server" sockets are referred to as listener sockets throughout this
- * implementation because they are in the VSOCK_SS_LISTEN state.  When a
+ * implementation because they are in the TCP_LISTEN state.  When a
  * connection request is received (the second kind of socket mentioned above),
  * we create a new socket and refer to it as a pending socket.  These pending
  * sockets are placed on the pending connection list of the listener socket.
@@ -82,6 +82,15 @@
  * argument, we must ensure the reference count is increased to ensure the
  * socket isn't freed before the function is run; the deferred function will
  * then drop the reference.
+ *
+ * - sk->sk_state uses the TCP state constants because they are widely used by
+ * other address families and exposed to userspace tools like ss(8):
+ *
+ *   TCP_CLOSE - unconnected
+ *   TCP_SYN_SENT - connecting
+ *   TCP_ESTABLISHED - connected
+ *   TCP_CLOSING - disconnecting
+ *   TCP_LISTEN - listening
  */
 
 #include 
@@ -489,7 +498,7 @@ void vsock_pending_work(struct work_struct *work)
if (vsock_in_connected_table(vsk))
vsock_remove_connected(vsk);
 
-   sk->sk_state = SS_FREE;
+   sk->sk_state = TCP_CLOSE;
 
 out:
release_sock(sk);
@@ -629,7 +638,6 @@ struct sock *__vsock_create(struct net *net,
 
sk->sk_destruct = vsock_sk_destruct;
sk->sk_backlog_rcv = vsock_queue_rcv_skb;
-   sk->sk_state = 0;
sock_reset_flag(sk, SOCK_DONE);
 
INIT_LIST_HEAD(>bound_table);
@@ -903,7 +911,7 @@ static unsigned int vsock_poll(struct file *file, struct 
socket *sock,
/* Listening sockets that have connections in their accept
 * queue can be read.
 */
-   if (sk->sk_state == VSOCK_SS_LISTEN
+   if (sk->sk_state == TCP_LISTEN
&& !vsock_is_accept_queue_empty(sk))
mask |= POLLIN | POLLRDNORM;
 
@@ -932,7 +940,7 @@ static unsigned int vsock_poll(struct file *file, struct 
socket *sock,
}
 
/* Connected sockets that can produce data can be written. */
-   if (sk->sk_state == SS_CONNECTED) {
+   if (sk->sk_state == TCP_ESTABLISHED) {
if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
bool space_avail_now = false;
int ret = transport->notify_poll_out(
@@ -954,7 +962,7 @@ static unsigned int vsock_poll(struct file *file, struct 
socket *sock,
 * POLLOUT|POLLWRNORM when peer is closed and nothing to read,
 * but local send is not shutdown.

[PATCH 4/5] VSOCK: add sock_diag interface

2017-10-03 Thread Stefan Hajnoczi
This patch adds the sock_diag interface for querying sockets from
userspace.  Tools like ss(8) and netstat(8) can use this interface to
list open sockets.

The userspace ABI is defined in  and includes
netlink request and response structs.  The request can query sockets
based on their sk_state (e.g. listening sockets only) and the response
contains socket information fields including the local/remote addresses,
inode number, etc.

This patch does not dump VMCI pending sockets because I have only tested
the virtio transport, which does not use pending sockets.  Support can
be added later by extending vsock_diag_dump() if needed by VMCI users.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 MAINTAINERS  |   2 +
 net/vmw_vsock/Makefile   |   3 +
 include/uapi/linux/vm_sockets_diag.h |  33 +++
 net/vmw_vsock/diag.c | 186 +++
 net/vmw_vsock/Kconfig|  10 ++
 5 files changed, 234 insertions(+)
 create mode 100644 include/uapi/linux/vm_sockets_diag.h
 create mode 100644 net/vmw_vsock/diag.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 1c3feffb1c1c..200dac93f34b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13975,6 +13975,8 @@ S:  Maintained
 F: include/linux/virtio_vsock.h
 F: include/uapi/linux/virtio_vsock.h
 F: include/uapi/linux/vsockmon.h
+F: include/uapi/linux/vm_sockets_diag.h
+F: net/vmw_vsock/diag.c
 F: net/vmw_vsock/af_vsock_tap.c
 F: net/vmw_vsock/virtio_transport_common.c
 F: net/vmw_vsock/virtio_transport.c
diff --git a/net/vmw_vsock/Makefile b/net/vmw_vsock/Makefile
index 09fc2eb29dc8..e5dbf153aff0 100644
--- a/net/vmw_vsock/Makefile
+++ b/net/vmw_vsock/Makefile
@@ -1,10 +1,13 @@
 obj-$(CONFIG_VSOCKETS) += vsock.o
+obj-$(CONFIG_VSOCKETS_DIAG) += vsock_diag.o
 obj-$(CONFIG_VMWARE_VMCI_VSOCKETS) += vmw_vsock_vmci_transport.o
 obj-$(CONFIG_VIRTIO_VSOCKETS) += vmw_vsock_virtio_transport.o
 obj-$(CONFIG_VIRTIO_VSOCKETS_COMMON) += vmw_vsock_virtio_transport_common.o
 
 vsock-y += af_vsock.o af_vsock_tap.o vsock_addr.o
 
+vsock_diag-y += diag.o
+
 vmw_vsock_vmci_transport-y += vmci_transport.o vmci_transport_notify.o \
vmci_transport_notify_qstate.o
 
diff --git a/include/uapi/linux/vm_sockets_diag.h 
b/include/uapi/linux/vm_sockets_diag.h
new file mode 100644
index ..14cd7dc5a187
--- /dev/null
+++ b/include/uapi/linux/vm_sockets_diag.h
@@ -0,0 +1,33 @@
+/* AF_VSOCK sock_diag(7) interface for querying open sockets */
+
+#ifndef _UAPI__VM_SOCKETS_DIAG_H__
+#define _UAPI__VM_SOCKETS_DIAG_H__
+
+#include 
+
+/* Request */
+struct vsock_diag_req {
+   __u8sdiag_family;   /* must be AF_VSOCK */
+   __u8sdiag_protocol; /* must be 0 */
+   __u16   pad;/* must be 0 */
+   __u32   vdiag_states;   /* query bitmap (e.g. 1 << TCP_LISTEN) */
+   __u32   vdiag_ino;  /* must be 0 (reserved) */
+   __u32   vdiag_show; /* must be 0 (reserved) */
+   __u32   vdiag_cookie[2];
+};
+
+/* Response */
+struct vsock_diag_msg {
+   __u8vdiag_family;   /* AF_VSOCK */
+   __u8vdiag_type; /* SOCK_STREAM or SOCK_DGRAM */
+   __u8vdiag_state;/* sk_state (e.g. TCP_LISTEN) */
+   __u8vdiag_shutdown; /* local RCV_SHUTDOWN | SEND_SHUTDOWN */
+   __u32   vdiag_src_cid;
+   __u32   vdiag_src_port;
+   __u32   vdiag_dst_cid;
+   __u32   vdiag_dst_port;
+   __u32   vdiag_ino;
+   __u32   vdiag_cookie[2];
+};
+
+#endif /* _UAPI__VM_SOCKETS_DIAG_H__ */
diff --git a/net/vmw_vsock/diag.c b/net/vmw_vsock/diag.c
new file mode 100644
index ..13266329b56a
--- /dev/null
+++ b/net/vmw_vsock/diag.c
@@ -0,0 +1,186 @@
+/*
+ * vsock sock_diag(7) module
+ *
+ * Copyright (C) 2017 Red Hat, Inc.
+ * Author: Stefan Hajnoczi <stefa...@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation version 2 and no later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
+   u32 portid, u32 seq, u32 flags)
+{
+   struct nlmsghdr *nlh;
+   struct vsock_diag_msg *rep;
+   struct vsock_sock *vsk = vsock_sk(sk);
+
+   nlh = nlmsg_put(skb, portid, seq, SOCK_DIAG_BY_FAMILY, sizeof(*rep),
+   flags);
+   if (!nlh)
+   return -EMSGSIZE;
+
+   rep = nlmsg_data(nlh);
+   rep->vdiag_family = AF_VSOCK;
+
+   /* Lock order dictates that sk_lock is acquired before
+* vsock_table_lock, so we cannot lock her

[PATCH 2/5] VSOCK: export __vsock_in_bound/connected_table()

2017-10-03 Thread Stefan Hajnoczi
The vsock_diag.ko module will need to check socket table membership.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 include/net/af_vsock.h   | 2 ++
 net/vmw_vsock/af_vsock.c | 6 --
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 30cba806e344..88149d580975 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -187,6 +187,8 @@ void vsock_enqueue_accept(struct sock *listener, struct 
sock *connected);
 void vsock_insert_connected(struct vsock_sock *vsk);
 void vsock_remove_bound(struct vsock_sock *vsk);
 void vsock_remove_connected(struct vsock_sock *vsk);
+bool __vsock_in_bound_table(struct vsock_sock *vsk);
+bool __vsock_in_connected_table(struct vsock_sock *vsk);
 struct sock *vsock_find_bound_socket(struct sockaddr_vm *addr);
 struct sock *vsock_find_connected_socket(struct sockaddr_vm *src,
 struct sockaddr_vm *dst);
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 9afe4da8c67d..f2d0fb593908 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -250,15 +250,17 @@ static struct sock *__vsock_find_connected_socket(struct 
sockaddr_vm *src,
return NULL;
 }
 
-static bool __vsock_in_bound_table(struct vsock_sock *vsk)
+bool __vsock_in_bound_table(struct vsock_sock *vsk)
 {
return !list_empty(>bound_table);
 }
+EXPORT_SYMBOL_GPL(__vsock_in_bound_table);
 
-static bool __vsock_in_connected_table(struct vsock_sock *vsk)
+bool __vsock_in_connected_table(struct vsock_sock *vsk)
 {
return !list_empty(>connected_table);
 }
+EXPORT_SYMBOL_GPL(__vsock_in_connected_table);
 
 static void vsock_insert_unbound(struct vsock_sock *vsk)
 {
-- 
2.13.6



[PATCH 1/5] VSOCK: export socket tables for sock_diag interface

2017-10-03 Thread Stefan Hajnoczi
The socket table symbols need to be exported from vsock.ko so that the
vsock_diag.ko module will be able to traverse sockets.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 include/net/af_vsock.h   |  5 +
 net/vmw_vsock/af_vsock.c | 10 ++
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index f9fb566e75cf..30cba806e344 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -27,6 +27,11 @@
 
 #define LAST_RESERVED_PORT 1023
 
+#define VSOCK_HASH_SIZE 251
+extern struct list_head vsock_bind_table[VSOCK_HASH_SIZE + 1];
+extern struct list_head vsock_connected_table[VSOCK_HASH_SIZE];
+extern spinlock_t vsock_table_lock;
+
 #define vsock_sk(__sk)((struct vsock_sock *)__sk)
 #define sk_vsock(__vsk)   (&(__vsk)->sk)
 
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index dfc8c51e4d74..9afe4da8c67d 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -153,7 +153,6 @@ EXPORT_SYMBOL_GPL(vm_sockets_get_local_cid);
  * vsock_bind_table[VSOCK_HASH_SIZE] is for unbound sockets.  The hash function
  * mods with VSOCK_HASH_SIZE to ensure this.
  */
-#define VSOCK_HASH_SIZE 251
 #define MAX_PORT_RETRIES24
 
 #define VSOCK_HASH(addr)((addr)->svm_port % VSOCK_HASH_SIZE)
@@ -168,9 +167,12 @@ EXPORT_SYMBOL_GPL(vm_sockets_get_local_cid);
 #define vsock_connected_sockets_vsk(vsk)   \
vsock_connected_sockets(&(vsk)->remote_addr, &(vsk)->local_addr)
 
-static struct list_head vsock_bind_table[VSOCK_HASH_SIZE + 1];
-static struct list_head vsock_connected_table[VSOCK_HASH_SIZE];
-static DEFINE_SPINLOCK(vsock_table_lock);
+struct list_head vsock_bind_table[VSOCK_HASH_SIZE + 1];
+EXPORT_SYMBOL_GPL(vsock_bind_table);
+struct list_head vsock_connected_table[VSOCK_HASH_SIZE];
+EXPORT_SYMBOL_GPL(vsock_connected_table);
+DEFINE_SPINLOCK(vsock_table_lock);
+EXPORT_SYMBOL_GPL(vsock_table_lock);
 
 /* Autobind this socket to the local address if necessary. */
 static int vsock_auto_bind(struct vsock_sock *vsk)
-- 
2.13.6



[PATCH 0/5] VSOCK: add sock_diag interface

2017-10-03 Thread Stefan Hajnoczi
There is currently no way for userspace to query open AF_VSOCK sockets.  This
means ss(8), netstat(8), and other utilities cannot display AF_VSOCK sockets.

This patch series adds the netlink sock_diag interface for AF_VSOCK.  Userspace
programs sent a DUMP request including an sk_state bitmap to filter sockets
based on their state (connected, listening, etc).  The vsock_diag.ko module
replies with information about matching sockets.  This userspace ABI is defined
in .

The final patch adds a test suite that exercises the basic cases.

Jorgen and Dexuan: I have only tested the virtio transport but this should also
work for VMCI and Hyper-V.  Please give it a shot if you have time.

Stefan Hajnoczi (5):
  VSOCK: export socket tables for sock_diag interface
  VSOCK: export __vsock_in_bound/connected_table()
  VSOCK: use TCP state constants for sk_state
  VSOCK: add sock_diag interface
  VSOCK: add tools/vsock/vsock_diag_test

 MAINTAINERS  |   3 +
 net/vmw_vsock/Makefile   |   3 +
 tools/vsock/Makefile |   9 +
 include/net/af_vsock.h   |  10 +-
 include/uapi/linux/vm_sockets_diag.h |  33 ++
 tools/vsock/control.h|  13 +
 tools/vsock/timeout.h|  14 +
 net/vmw_vsock/af_vsock.c |  62 ++-
 net/vmw_vsock/diag.c | 186 
 net/vmw_vsock/virtio_transport.c |   2 +-
 net/vmw_vsock/virtio_transport_common.c  |  22 +-
 net/vmw_vsock/vmci_transport.c   |  34 +-
 net/vmw_vsock/vmci_transport_notify.c|   2 +-
 net/vmw_vsock/vmci_transport_notify_qstate.c |   2 +-
 tools/vsock/control.c| 219 +
 tools/vsock/timeout.c|  64 +++
 tools/vsock/vsock_diag_test.c| 681 +++
 net/vmw_vsock/Kconfig|  10 +
 tools/vsock/.gitignore   |   2 +
 19 files changed, 1312 insertions(+), 59 deletions(-)
 create mode 100644 tools/vsock/Makefile
 create mode 100644 include/uapi/linux/vm_sockets_diag.h
 create mode 100644 tools/vsock/control.h
 create mode 100644 tools/vsock/timeout.h
 create mode 100644 net/vmw_vsock/diag.c
 create mode 100644 tools/vsock/control.c
 create mode 100644 tools/vsock/timeout.c
 create mode 100644 tools/vsock/vsock_diag_test.c
 create mode 100644 tools/vsock/.gitignore

-- 
2.13.6



[PATCH 5/5] VSOCK: add tools/vsock/vsock_diag_test

2017-10-03 Thread Stefan Hajnoczi
This patch adds tests for the vsock_diag.ko module.

To run the tests:

  # qemu-system-x86_64 -M accel=kvm ... \
   -device vhost-vsock-pci,peer-cid=3
  (host)# ./vsock_diag_test --mode=server \
--control-port=1234 \
--peer-cid=3
  (guest)# ./vsock_diag_test --mode=client \
 --control-host=10.0.2.2 \
 --control-port 1234 \
 --peer-cid=2

The control.h and timeout.h infrastructure can be used for additional
AF_VSOCK tests in the future.

Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 MAINTAINERS   |   1 +
 tools/vsock/Makefile  |   9 +
 tools/vsock/control.h |  13 +
 tools/vsock/timeout.h |  14 +
 tools/vsock/control.c | 219 ++
 tools/vsock/timeout.c |  64 
 tools/vsock/vsock_diag_test.c | 681 ++
 tools/vsock/.gitignore|   2 +
 8 files changed, 1003 insertions(+)
 create mode 100644 tools/vsock/Makefile
 create mode 100644 tools/vsock/control.h
 create mode 100644 tools/vsock/timeout.h
 create mode 100644 tools/vsock/control.c
 create mode 100644 tools/vsock/timeout.c
 create mode 100644 tools/vsock/vsock_diag_test.c
 create mode 100644 tools/vsock/.gitignore

diff --git a/MAINTAINERS b/MAINTAINERS
index 200dac93f34b..bd396f52670b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13983,6 +13983,7 @@ F:  net/vmw_vsock/virtio_transport.c
 F: drivers/net/vsockmon.c
 F: drivers/vhost/vsock.c
 F: drivers/vhost/vsock.h
+F: tools/vsock/
 
 VIRTIO CONSOLE DRIVER
 M: Amit Shah <a...@kernel.org>
diff --git a/tools/vsock/Makefile b/tools/vsock/Makefile
new file mode 100644
index ..fccd860593ad
--- /dev/null
+++ b/tools/vsock/Makefile
@@ -0,0 +1,9 @@
+all: test
+test: vsock_diag_test
+vsock_diag_test: vsock_diag_test.o timeout.o control.o
+
+CFLAGS += -g -O2 -Werror -Wall -I. -I../include/uapi -I../include 
-Wno-pointer-sign -fno-strict-overflow -fno-strict-aliasing -fno-common -MMD 
-U_FORTIFY_SOURCE -D_GNU_SOURCE
+.PHONY: all test clean
+clean:
+   ${RM} *.o *.d vsock_diag_test
+-include *.d
diff --git a/tools/vsock/control.h b/tools/vsock/control.h
new file mode 100644
index ..54a07efd267c
--- /dev/null
+++ b/tools/vsock/control.h
@@ -0,0 +1,13 @@
+#ifndef CONTROL_H
+#define CONTROL_H
+
+#include 
+
+void control_init(const char *control_host, const char *control_port,
+ bool server);
+void control_cleanup(void);
+void control_writeln(const char *str);
+char *control_readln(void);
+void control_expectln(const char *str);
+
+#endif /* CONTROL_H */
diff --git a/tools/vsock/timeout.h b/tools/vsock/timeout.h
new file mode 100644
index ..77db9ce9860a
--- /dev/null
+++ b/tools/vsock/timeout.h
@@ -0,0 +1,14 @@
+#ifndef TIMEOUT_H
+#define TIMEOUT_H
+
+enum {
+   /* Default timeout */
+   TIMEOUT = 10 /* seconds */
+};
+
+void sigalrm(int signo);
+void timeout_begin(unsigned int seconds);
+void timeout_check(const char *operation);
+void timeout_end(void);
+
+#endif /* TIMEOUT_H */
diff --git a/tools/vsock/control.c b/tools/vsock/control.c
new file mode 100644
index ..90fd47f0e422
--- /dev/null
+++ b/tools/vsock/control.c
@@ -0,0 +1,219 @@
+/* Control socket for client/server test execution
+ *
+ * Copyright (C) 2017 Red Hat, Inc.
+ *
+ * Author: Stefan Hajnoczi <stefa...@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+/* The client and server may need to coordinate to avoid race conditions like
+ * the client attempting to connect to a socket that the server is not
+ * listening on yet.  The control socket offers a communications channel for
+ * such coordination tasks.
+ *
+ * If the client calls control_expectln("LISTENING"), then it will block until
+ * the server calls control_writeln("LISTENING").  This provides a simple
+ * mechanism for coordinating between the client and the server.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "timeout.h"
+#include "control.h"
+
+static int control_fd = -1;
+
+/* Open the control socket, either in server or client mode */
+void control_init(const char *control_host,
+ const char *control_port,
+ bool server)
+{
+   struct addrinfo hints = {
+   .ai_socktype = SOCK_STREAM,
+   };
+   struct addrinfo *result = NULL;
+   struct addrinfo *ai;
+   int ret;
+
+   ret = getaddrinfo(control_host, control_port, , );
+   if (ret != 0) {
+   fprintf(stderr, "%s\n", gai_strerror(ret));
+   exit(EXIT_

Re: [PATCH net-next RFC 3/5] vhost: introduce vhost_add_used_idx()

2017-09-22 Thread Stefan Hajnoczi
On Fri, Sep 22, 2017 at 04:02:33PM +0800, Jason Wang wrote:
> This patch introduces a helper which just increase the used idx. This
> will be used in pair with vhost_prefetch_desc_indices() by batching
> code.
> 
> Signed-off-by: Jason Wang <jasow...@redhat.com>
> ---
>  drivers/vhost/vhost.c | 33 +
>  drivers/vhost/vhost.h |  1 +
>  2 files changed, 34 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>


Re: [PATCH net-next RFC 2/5] vhost: introduce helper to prefetch desc index

2017-09-22 Thread Stefan Hajnoczi
On Fri, Sep 22, 2017 at 04:02:32PM +0800, Jason Wang wrote:
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index f87ec75..8424166d 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2437,6 +2437,61 @@ struct vhost_msg_node *vhost_dequeue_msg(struct 
> vhost_dev *dev,
>  }
>  EXPORT_SYMBOL_GPL(vhost_dequeue_msg);
>  
> +int vhost_prefetch_desc_indices(struct vhost_virtqueue *vq,
> + struct vring_used_elem *heads,
> + u16 num, bool used_update)

Missing doc comment.

> +{
> + int ret, ret2;
> + u16 last_avail_idx, last_used_idx, total, copied;
> + __virtio16 avail_idx;
> + struct vring_used_elem __user *used;
> + int i;

The following variable names are a little confusing:

last_avail_idx vs vq->last_avail_idx.  last_avail_idx is a wrapped
avail->ring[] index, vq->last_avail_idx is a free-running counter.  The
same for last_used_idx vs vq->last_used_idx.

num argument vs vq->num.  The argument could be called nheads instead to
make it clear that this is heads[] and not the virtqueue size.

Not a bug but it took me a while to figure out what was going on.


Re: [PATCH net-next RFC 1/5] vhost: split out ring head fetching logic

2017-09-22 Thread Stefan Hajnoczi
On Fri, Sep 22, 2017 at 04:02:31PM +0800, Jason Wang wrote:
> +/* This looks in the virtqueue and for the first available buffer, and 
> converts
> + * it to an iovec for convenient access.  Since descriptors consist of some
> + * number of output then some number of input descriptors, it's actually two
> + * iovecs, but we pack them into one and note how many of each there were.
> + *
> + * This function returns the descriptor number found, or vq->num (which is
> + * never a valid descriptor number) if none was found.  A negative code is
> + * returned on error. */
> +int __vhost_get_vq_desc(struct vhost_virtqueue *vq,
> + struct iovec iov[], unsigned int iov_size,
> + unsigned int *out_num, unsigned int *in_num,
> + struct vhost_log *log, unsigned int *log_num,
> + __virtio16 head)
[...]
> +int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> +   struct iovec iov[], unsigned int iov_size,
> +   unsigned int *out_num, unsigned int *in_num,
> +   struct vhost_log *log, unsigned int *log_num)

Please document vhost_get_vq_desc().

Please also explain the difference between __vhost_get_vq_desc() and
vhost_get_vq_desc() in the documentation.


Re: [PATCH] VSOCK: fix uapi/linux/vm_sockets.h incomplete types

2017-09-20 Thread Stefan Hajnoczi
On Tue, Sep 19, 2017 at 10:38:40AM -0700, David Miller wrote:
> From: Stefan Hajnoczi <stefa...@redhat.com>
> Date: Mon, 18 Sep 2017 16:21:00 +0100
> 
> > On Fri, Sep 15, 2017 at 02:14:32PM -0700, David Miller wrote:
> >> > diff --git a/include/uapi/linux/vm_sockets.h 
> >> > b/include/uapi/linux/vm_sockets.h
> >> > index b4ed5d895699..4ae5c625ac56 100644
> >> > --- a/include/uapi/linux/vm_sockets.h
> >> > +++ b/include/uapi/linux/vm_sockets.h
> >> > @@ -18,6 +18,10 @@
> >> >  
> >> >  #include 
> >> >  
> >> > +#ifndef __KERNEL__
> >> > +#include  /* struct sockaddr */
> >> > +#endif
> >> > +
> >> 
> >> There is no precedence whatsoever to include sys/socket.h in _any_ UAPI
> >> header file provided by the kernel.
> > 
> >  does it for the same reason:
> > 
> > include/uapi/linux/if.h:#include  /* for 
> > struct sockaddr. */
> 
> You don't need it for struct sockaddr, you need it for sa_family_t,
> the comment is very misleading.
> 
> Please do as I have instructed and it will fix this problem.

No, you really cannot rely on struct sockaddr from  in
uapi headers.  You can check this yourself:

  $ cd /tmp && gcc -o a.o -c /usr/include/linux/vm_sockets.h
  /usr/include/linux/vm_sockets.h:148:32: error: invalid application of 
‘sizeof’ to incomplete type ‘struct sockaddr’
  unsigned char svm_zero[sizeof(struct sockaddr) -
^~

The weird situation is:

1. When compiling the kernel,  brings in struct sockaddr
   because the compiler finds include/linux/socket.h first before
   include/uapi/linux/socket.h.

2. When compiling a userspace application,  does not
   bring in struct sockaddr because include/uapi/linux/socket.h is
   found.

This is why I added the #include  when !__KERNEL__.  Sorry
that the commit description wasn't clear on this.

Am I misunderstanding something?

Stefan


Re: [PATCH] VSOCK: fix uapi/linux/vm_sockets.h incomplete types

2017-09-19 Thread Stefan Hajnoczi
On Fri, Sep 15, 2017 at 02:14:32PM -0700, David Miller wrote:
> > diff --git a/include/uapi/linux/vm_sockets.h 
> > b/include/uapi/linux/vm_sockets.h
> > index b4ed5d895699..4ae5c625ac56 100644
> > --- a/include/uapi/linux/vm_sockets.h
> > +++ b/include/uapi/linux/vm_sockets.h
> > @@ -18,6 +18,10 @@
> >  
> >  #include 
> >  
> > +#ifndef __KERNEL__
> > +#include  /* struct sockaddr */
> > +#endif
> > +
> 
> There is no precedence whatsoever to include sys/socket.h in _any_ UAPI
> header file provided by the kernel.

 does it for the same reason:

include/uapi/linux/if.h:#include  /* for struct 
sockaddr. */

If that is not acceptable for  then we could do the
same thing as :

#define __SOCK_SIZE__   16  /* sizeof(struct sockaddr)  */

Is that okay or do you have another suggestion for getting sizeof(struct
sockaddr)?  It's needed to pad struct sockaddr_vm.

> __kernel_sa_family_t is what should be used in UAPI headers, only
> non-UAPI headers can use plain sa_family_t.
> 
> So that is the correct fix for this problem.

Thanks, will fix in v2.


[PATCH] VSOCK: fix uapi/linux/vm_sockets.h incomplete types

2017-09-12 Thread Stefan Hajnoczi
This patch fixes the following compiler errors when userspace
applications use the vm_sockets.h header:

  include/uapi/linux/vm_sockets.h:148:32: error: invalid application of 
‘sizeof’ to incomplete type ‘struct sockaddr’
unsigned char svm_zero[sizeof(struct sockaddr) -
  ^~
  include/uapi/linux/vm_sockets.h:149:18: error: ‘sa_family_t’ undeclared here 
(not in a function)
 sizeof(sa_family_t) -
^~~

Two issues:
1. In the kernel struct sockaddr comes in via  but in
   userspace  is required.
2. struct sockaddr_vm has a __kernel_sa_family_t field so let's be
   consistent and use the same type for the sizeof(sa_family_t)
   calculation.

Currently userspace applications work around this broken header by first
including .  In the kernel there is no compiler error
because  provides everything.  It's worth fixing the
header file though.

Cc: Jorgen Hansen <jhan...@vmware.com>
Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 include/uapi/linux/vm_sockets.h | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/vm_sockets.h b/include/uapi/linux/vm_sockets.h
index b4ed5d895699..4ae5c625ac56 100644
--- a/include/uapi/linux/vm_sockets.h
+++ b/include/uapi/linux/vm_sockets.h
@@ -18,6 +18,10 @@
 
 #include 
 
+#ifndef __KERNEL__
+#include  /* struct sockaddr */
+#endif
+
 /* Option name for STREAM socket buffer size.  Use as the option name in
  * setsockopt(3) or getsockopt(3) to set or get an unsigned long long that
  * specifies the size of the buffer underlying a vSockets STREAM socket.
@@ -146,7 +150,7 @@ struct sockaddr_vm {
unsigned int svm_port;
unsigned int svm_cid;
unsigned char svm_zero[sizeof(struct sockaddr) -
-  sizeof(sa_family_t) -
+  sizeof(__kernel_sa_family_t) -
   sizeof(unsigned short) -
   sizeof(unsigned int) - sizeof(unsigned int)];
 };
-- 
2.13.5



Re: [PATCH] vsock: only load vmci transport on VMware hypervisor by default

2017-08-31 Thread Stefan Hajnoczi
On Tue, Aug 29, 2017 at 03:37:07PM +, Jorgen S. Hansen wrote:
> > On Aug 29, 2017, at 4:36 AM, Dexuan Cui  wrote:
> If we allow multiple host side transports, virtio host side support and
> vmci should be able to coexist regardless of the order of initialization.

That sounds good to me.

This means af_vsock.c needs to be aware of CID allocation.  Currently the
vhost_vsock.ko driver handles this itself (it keeps a list of CIDs and
checks that they are not used twice).  It should be possible to move
that state into af_vsock.c so we have  pairs.

I'm currently working on NFS over AF_VSOCK and sock_diag support (for
ss(8) and netstat-like tools).

Multi-transport support is lower priority for me at the moment.  I'm
happy to review patches though.  If there is no progress on this by the
end of the year then I will have time to work on it.

Are either of you are in Prague, Czech Republic on October 25-27 for
Linux Kernel Summit, Open Source Summit Europe, Embedded Linux
Conference Europe, KVM Forum, or MesosCon Europe?

Stefan


Re: [PATCH net-next 3/3] hv_sock: implements Hyper-V transport for Virtual Sockets (AF_VSOCK)

2017-08-24 Thread Stefan Hajnoczi
On Tue, Aug 22, 2017 at 09:40:01PM +, Dexuan Cui wrote:
> > From: Stefan Hajnoczi [mailto:stefa...@redhat.com]
> > On Fri, Aug 18, 2017 at 10:23:54PM +, Dexuan Cui wrote:
> > > > > +static bool hvs_stream_allow(u32 cid, u32 port)
> > > > > +{
> > > > > + static const u32 valid_cids[] = {
> > > > > + VMADDR_CID_ANY,
> > > >
> > > > Is this for loopback?
> > >
> > > No, we don't support lookback in Linux VM, at least for now.
> > > In our Linux implementation, Linux VM can only connect to the host, and
> > > here when Linux VM calls connect(), I treat  VMADDR_CID_ANY
> > > the same as VMADDR_CID_HOST.
> > 
> > VMCI and virtio-vsock do not treat connect(VMADDR_CID_ANY) the same as
> > connect(VMADDR_CID_HOST).  It is an error to connect to VMADDR_CID_ANY.
> 
> Ok. Then I'll only allow VMADDR_CID_HOST as the destination CID, since 
> we don't support loopback mode.
> 
> > > > > + /* The host's port range [MIN_HOST_EPHEMERAL_PORT, 0x)
> > > > is
> > > > > +  * reserved as ephemeral ports, which are used as the host's 
> > > > > ports
> > > > > +  * when the host initiates connections.
> > > > > +  */
> > > > > + if (port > MAX_HOST_LISTEN_PORT)
> > > > > + return false;
> > > >
> > > > Without this if statement the guest will attempt to connect.  I guess
> > > > there will be no listen sockets above MAX_HOST_LISTEN_PORT, so the
> > > > connection attempt will fail.
> > >
> > > You're correct.
> > > To use the vsock common infrastructure, we have to map Hyper-V's
> > > GUID <VM_ID, Service_ID> to int <cid, port>, and hence we must limit
> > > the port range we can listen() on to [0, MAX_LISTEN_PORT], i.e.
> > > we can only use half of the whole 32-bit port space for listen().
> > > This is detailed in the long comments starting at about Line 100.
> > >
> > > > ...but hardcode this knowledge into the guest driver?
> > > I'd like the guest's connect() to fail immediately here.
> > > IMO this is better than a connect timeout. :-)
> > 
> > Thanks for explaining.  Perhaps the comment could be updated:
> > 
> >  /* The host's port range [MIN_HOST_EPHEMERAL_PORT, 0x) is
> >   * reserved as ephemeral ports, which are used as the host's ports when
> >   * the host initiates connections.
> >   *
> >   * Perform this check in the guest so an immediate error is produced
> >   * instead of a timeout.
> >   */
> > 
> > Stefan
> 
> Thank you, Stefan! 
> Please see the below for the updated version of the function:
> 
> static bool hvs_stream_allow(u32 cid, u32 port)
> {
> /* The host's port range [MIN_HOST_EPHEMERAL_PORT, 0x) is
>  * reserved as ephemeral ports, which are used as the host's ports
>  * when the host initiates connections.
>  *
>  * Perform this check in the guest so an immediate error is produced
>  * instead of a timeout.
>  */
> if (port > MAX_HOST_LISTEN_PORT)
> return false;
> 
> if (cid == VMADDR_CID_HOST)
> return true;
> 
> return false;
> }
> 
> I'll send a v2 of the patch later today.

Great, thanks!


Re: [PATCH net-next 3/3] hv_sock: implements Hyper-V transport for Virtual Sockets (AF_VSOCK)

2017-08-22 Thread Stefan Hajnoczi
On Fri, Aug 18, 2017 at 10:23:54PM +, Dexuan Cui wrote:
> > From: Stefan Hajnoczi [mailto:stefa...@redhat.com]
> > Sent: Thursday, August 17, 2017 07:56
> > To: Dexuan Cui <de...@microsoft.com>
> > On Tue, Aug 15, 2017 at 10:18:41PM +, Dexuan Cui wrote:
> > > +static u32 hvs_get_local_cid(void)
> > > +{
> > > + return VMADDR_CID_ANY;
> > > +}
> > 
> > Interesting concept: the guest never knows its CID.  This is nice from a
> > live migration perspective.  Currently VMCI and virtio adjust listen
> > socket local CIDs after migration.
> > 
> > > +static bool hvs_stream_allow(u32 cid, u32 port)
> > > +{
> > > + static const u32 valid_cids[] = {
> > > + VMADDR_CID_ANY,
> > 
> > Is this for loopback?
> 
> No, we don't support lookback in Linux VM, at least for now.
> In our Linux implementation, Linux VM can only connect to the host, and
> here when Linux VM calls connect(), I treat  VMADDR_CID_ANY 
> the same as VMADDR_CID_HOST.

VMCI and virtio-vsock do not treat connect(VMADDR_CID_ANY) the same as
connect(VMADDR_CID_HOST).  It is an error to connect to VMADDR_CID_ANY.

> > > + VMADDR_CID_HOST,
> > > + };
> > > + int i;
> > > +
> > > + /* The host's port range [MIN_HOST_EPHEMERAL_PORT, 0x)
> > is
> > > +  * reserved as ephemeral ports, which are used as the host's ports
> > > +  * when the host initiates connections.
> > > +  */
> > > + if (port > MAX_HOST_LISTEN_PORT)
> > > + return false;
> > 
> > Without this if statement the guest will attempt to connect.  I guess
> > there will be no listen sockets above MAX_HOST_LISTEN_PORT, so the
> > connection attempt will fail.
> 
> You're correct.
> To use the vsock common infrastructure, we have to map Hyper-V's
> GUID <VM_ID, Service_ID> to int <cid, port>, and hence we must limit
> the port range we can listen() on to [0, MAX_LISTEN_PORT], i.e.
> we can only use half of the whole 32-bit port space for listen().
> This is detailed in the long comments starting at about Line 100.
>  
> > ...but hardcode this knowledge into the guest driver?
> I'd like the guest's connect() to fail immediately here.
> IMO this is better than a connect timeout. :-)

Thanks for explaining.  Perhaps the comment could be updated:

 /* The host's port range [MIN_HOST_EPHEMERAL_PORT, 0x) is
  * reserved as ephemeral ports, which are used as the host's ports when
  * the host initiates connections.
  *
  * Perform this check in the guest so an immediate error is produced
  * instead of a timeout.
  */

Stefan


Re: [PATCH] vsock: only load vmci transport on VMware hypervisor by default

2017-08-22 Thread Stefan Hajnoczi
On Fri, Aug 18, 2017 at 11:07:37PM +, Dexuan Cui wrote:
> > From: Stefan Hajnoczi [mailto:stefa...@redhat.com]
> > > CID is not really used by us, because we only support guest<->host
> > communication,
> > > and don't support guest<->guest communication. The Hyper-V host
> > references
> > > every VM by VmID (which is invisible to the VM), and a VM can only talk to
> > the
> > > host via this feature.
> > 
> > Applications running inside the guest should use VMADDR_CID_HOST (2) to
> > connect to the host, even on Hyper-V.
> I have no objection, and this patch does support this usage of the
> user-space applications.
>
> > By the way, we should collaborate on a test suite and a vsock(7) man
> > page that documents the semantics of AF_VSOCK sockets.  This way our
> > transports will have the same behavior and AF_VSOCK applications will
> > work on all 3 hypervisors.
> I can't agree more. :-)
> BTW, I have been using Rolf's test suite to test my patch:
> https://github.com/rn/virtsock/tree/master/c
> Maybe this can be a good starting point.

Thanks for sharing this, I will try it with virtio-vsock.

I have a netcat-like utility here:
https://github.com/stefanha/linux/blob/vsock-extras/nc-vsock.c

> > Not all features need to be supported.  For example, VMCI supports
> > SOCK_DGRAM while Hyper-V and virtio do not.  But features that are
> > available should behave identically.
> I totally agree, though I'm afraid Hyper-V may have a little more limitations
> compared to VMware/KVM duo to the <VM_ID, ServiceID> <--> <cid, port>
> mapping.
>  
> > > Can we use the 'protocol' parameter in the socket() function:
> > > int socket(int domain, int type, int protocol)
> > >
> > > IMO currently the 'protocol' is not really used.
> > > I think we can modify __vsock_core_init() to allow multiple transport 
> > > layers
> > to
> > >  be registered, and we can define different 'protocol' numbers for
> > > VMware/KVM/Hyper-V, and ask the application to explicitly specify what
> > should
> > > be used. Considering compatibility, we can use the default transport in a
> > given
> > > VM depending on the underlying hypervisor.
> > 
> > I think AF_VSOCK should hide the transport from users/applications.
> Ideally yes, but let's consider the KVM-on-KVM nested scenario: when
> an application in the Level-1 VM creates an AF_VSOCK socket and call
> connect() for it, how can we know if the app is trying to connect to
> the Level-0 host, or connect to the Level-2 VM? We can't.

We *can* by looking at the destination CID.  Please take a look at
drivers/misc/vmw_vmci/vmci_route.c:vmci_route() to see how VMCI handles
nested virt.

It boils down to something like this:

  static int vsock_stream_connect(struct socket *sock, struct sockaddr *addr,
  int addr_len, int flags)
  {
  ...
  if (remote_addr.svm_cid == VMADDR_CID_HOST)
  transport = host_transport;
  else
  transport = guest_transport;

It's easy for connect(2) but Jorgen mentioned it's harder for listen(2)
because the socket would need to listen on both transports.  We define
two new constants VMADDR_CID_LISTEN_FROM_GUEST and
VMADDR_CID_LISTEN_FROM_HOST for bind(2) so that applications can decide
which side to listen on.  Or the listen socket could simply listen to
both sides.

Stefan


Re: [PATCH net-next 2/3] vsock: fix vsock_dequeue/enqueue_accept race

2017-08-18 Thread Stefan Hajnoczi
On Tue, Aug 15, 2017 at 10:15:39PM +, Dexuan Cui wrote:
> With the current code, when vsock_dequeue_accept() is removing a sock
> from the list, nothing prevents vsock_enqueue_accept() from adding a new
> sock into the list concurrently. We should add a lock to protect the list.

The listener sock is locked, preventing concurrent modification.  I have
checked both the virtio and vmci transports.  Can you post an example
where the listener sock isn't locked?

Stefan


signature.asc
Description: PGP signature


Re: [PATCH net-next 3/3] hv_sock: implements Hyper-V transport for Virtual Sockets (AF_VSOCK)

2017-08-18 Thread Stefan Hajnoczi
On Tue, Aug 15, 2017 at 10:18:41PM +, Dexuan Cui wrote:
> +static u32 hvs_get_local_cid(void)
> +{
> + return VMADDR_CID_ANY;
> +}

Interesting concept: the guest never knows its CID.  This is nice from a
live migration perspective.  Currently VMCI and virtio adjust listen
socket local CIDs after migration.

> +static bool hvs_stream_allow(u32 cid, u32 port)
> +{
> + static const u32 valid_cids[] = {
> + VMADDR_CID_ANY,

Is this for loopback?

> + VMADDR_CID_HOST,
> + };
> + int i;
> +
> + /* The host's port range [MIN_HOST_EPHEMERAL_PORT, 0x) is
> +  * reserved as ephemeral ports, which are used as the host's ports
> +  * when the host initiates connections.
> +  */
> + if (port > MAX_HOST_LISTEN_PORT)
> + return false;

Without this if statement the guest will attempt to connect.  I guess
there will be no listen sockets above MAX_HOST_LISTEN_PORT, so the
connection attempt will fail.

...but hardcode this knowledge into the guest driver?


signature.asc
Description: PGP signature


Re: [PATCH] vsock: only load vmci transport on VMware hypervisor by default

2017-08-18 Thread Stefan Hajnoczi
On Fri, Aug 18, 2017 at 03:07:30AM +, Dexuan Cui wrote:
> > From: Jorgen S. Hansen [mailto:jhan...@vmware.com]
> > Sent: Thursday, August 17, 2017 08:17
> > >
> > > Putting aside nested virtualization, I want to load the transport (vmci,
> > > Hyper-V, vsock) for which there is paravirtualized hardware present
> > > inside the guest.
> > 
> > Good points. Completely agree that this is the desired behavior for a guest.
> > 
> > 
> > > It's a little tricker on the host side (doesn't matter for Hyper-V and
> > > probably also doesn't for VMware) because the host-side driver is a
> > > software device with no hardware backing it.  In KVM we assume the
> > > vhost_vsock.ko kernel module will be loaded sufficiently early.
> > 
> > Since the vmci driver is currently tied to PF_VSOCK it hasn’t been a 
> > problem,
> > but on the host side the VMCI driver has no hardware backing it either, so
> > when we move to a more appropriate solution, this will be an issue for VMCI 
> > as
> > well. I’ll check our shipped products, but they most likely assume that if 
> > an
> > upstreamed vmci module is present, it will be loaded automatically.
> 
> Hyper-V Sockets is a standard feature of VMBus v4.0, so we can easily know
> we can and should load iff vmbus_proto_version >= VERSION_WIN10.
> 
> > > Things get trickier with nested virtualization because the VM might want
> > > to talk to its host but also to its nested VMs.  The simple way of
> > > fixing this would be to allow two transports loaded simultaneously and
> > > route traffic destined to CID 2 to the host transport and all other
> > > traffic to the guest transport.
> 
> This sounds like a little tricky to me.
> CID is not really used by us, because we only support guest<->host 
> communication,
> and don't support guest<->guest communication. The Hyper-V host references
> every VM by VmID (which is invisible to the VM), and a VM can only talk to the
> host via this feature.

Applications running inside the guest should use VMADDR_CID_HOST (2) to
connect to the host, even on Hyper-V.

By the way, we should collaborate on a test suite and a vsock(7) man
page that documents the semantics of AF_VSOCK sockets.  This way our
transports will have the same behavior and AF_VSOCK applications will
work on all 3 hypervisors.

Not all features need to be supported.  For example, VMCI supports
SOCK_DGRAM while Hyper-V and virtio do not.  But features that are
available should behave identically.

> > This is close to the routing the VMCI driver does in a nested environment, 
> > but
> > that is with the assumption that there is only one type of transport. 
> > Having two
> > different transports would require that we delay resolving the transport 
> > type
> > until the socket endpoint has been bound to an address. Things get trickier 
> > if
> > listening sockets use VMADDR_CID_ANY - if only one transport is present, 
> > this
> > would allow the socket to accept connections from both guests and outer 
> > host,
> > but with multiple transports that won’t work, since we can’t associate a 
> > socket
> > with a transport until the socket is bound.
> > 
> > >
> > > Perhaps we should discuss these cases a bit more to figure out how to
> > > avoid conflicts over MODULE_ALIAS_NETPROTO(PF_VSOCK).
> > 
> > Agreed.
> 
> Can we use the 'protocol' parameter in the socket() function:
> int socket(int domain, int type, int protocol) 
> 
> IMO currently the 'protocol' is not really used.
> I think we can modify __vsock_core_init() to allow multiple transport layers 
> to
>  be registered, and we can define different 'protocol' numbers for
> VMware/KVM/Hyper-V, and ask the application to explicitly specify what should
> be used. Considering compatibility, we can use the default transport in a 
> given
> VM depending on the underlying hypervisor. 

I think AF_VSOCK should hide the transport from users/applications.
Think of same-on-same nested virtualization: VMware-on-VMware or
KVM-on-KVM.  In that case specifying VMCI or virtio doesn't help.

We'd still need to distinguish between "to guest" and "to host"
(currently VMCI has code to do this but virtio does not).

The natural place to distinguish the destination is when dealing with
the sockaddr in connect(), bind(), etc.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH] vsock: only load vmci transport on VMware hypervisor by default

2017-08-17 Thread Stefan Hajnoczi
On Thu, Aug 17, 2017 at 08:00:29AM +, Dexuan Cui wrote:
> 
> Without the patch, vmw_vsock_vmci_transport.ko can automatically load
> when an application creates an AF_VSOCK socket.
> 
> This is the expected good behavior on VMware hypervisor, but as we
> are going to add hv_sock.ko (i.e. Hyper-V transport for AF_VSOCK), we
> should make sure vmw_vsock_vmci_transport.ko can't load on Hyper-V,
> otherwise there is a -EBUSY conflict when both vmw_vsock_vmci_transport.ko
> and hv_sock.ko try to call vsock_core_init() on Hyper-V.
> 
> On the other hand, hv_sock.ko can only load on Hyper-V, because it
> depends on hv_vmbus.ko, which detects Hyper-V in hv_acpi_init().
> 
> KVM's vsock_virtio_transport doesn't have the issue because it doesn't
> define MODULE_ALIAS_NETPROTO(PF_VSOCK).

Thanks for sending this patch, vmci's MODULE_ALIAS_NETPROTO(PF_VSOCK) is
a problem for vhost_vsock.ko (the virtio host driver) too.  A host
userspace program can create a AF_VSOCK socket before vhost_vsock is
loaded.  The vmci transport will be unconditionally loaded and that's
not the right behavior.

Putting aside nested virtualization, I want to load the transport (vmci,
Hyper-V, vsock) for which there is paravirtualized hardware present
inside the guest.

It's a little tricker on the host side (doesn't matter for Hyper-V and
probably also doesn't for VMware) because the host-side driver is a
software device with no hardware backing it.  In KVM we assume the
vhost_vsock.ko kernel module will be loaded sufficiently early.

Things get trickier with nested virtualization because the VM might want
to talk to its host but also to its nested VMs.  The simple way of
fixing this would be to allow two transports loaded simultaneously and
route traffic destined to CID 2 to the host transport and all other
traffic to the guest transport.

Perhaps we should discuss these cases a bit more to figure out how to
avoid conflicts over MODULE_ALIAS_NETPROTO(PF_VSOCK).

> 
> The patch also adds a module parameter "skip_hypervisor_check" for
> vmw_vsock_vmci_transport.ko.
> 
> Signed-off-by: Dexuan Cui 
> Cc: Alok Kataria 
> Cc: Andy King 
> Cc: Adit Ranadive 
> Cc: George Zhang 
> Cc: Jorgen Hansen 
> Cc: K. Y. Srinivasan 
> Cc: Haiyang Zhang 
> Cc: Stephen Hemminger 
> ---
>  net/vmw_vsock/Kconfig  |  2 +-
>  net/vmw_vsock/vmci_transport.c | 11 +++
>  2 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/net/vmw_vsock/Kconfig b/net/vmw_vsock/Kconfig
> index a24369d..3f52929 100644
> --- a/net/vmw_vsock/Kconfig
> +++ b/net/vmw_vsock/Kconfig
> @@ -17,7 +17,7 @@ config VSOCKETS
>  
>  config VMWARE_VMCI_VSOCKETS
>   tristate "VMware VMCI transport for Virtual Sockets"
> - depends on VSOCKETS && VMWARE_VMCI
> + depends on VSOCKETS && VMWARE_VMCI && HYPERVISOR_GUEST
>   help
> This module implements a VMCI transport for Virtual Sockets.
>  
> diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
> index 10ae782..c068873 100644
> --- a/net/vmw_vsock/vmci_transport.c
> +++ b/net/vmw_vsock/vmci_transport.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -73,6 +74,10 @@ struct vmci_transport_recv_pkt_info {
>   struct vmci_transport_packet pkt;
>  };
>  
> +static bool skip_hypervisor_check;
> +module_param(skip_hypervisor_check, bool, 0444);
> +MODULE_PARM_DESC(hot_add, "If set, attempt to load on non-VMware platforms");
> +
>  static LIST_HEAD(vmci_transport_cleanup_list);
>  static DEFINE_SPINLOCK(vmci_transport_cleanup_lock);
>  static DECLARE_WORK(vmci_transport_cleanup_work, vmci_transport_cleanup);
> @@ -2085,6 +2090,12 @@ static int __init vmci_transport_init(void)
>  {
>   int err;
>  
> + /* Check if we are running on VMware's hypervisor and bail out
> +  * if we are not.
> +  */
> + if (!skip_hypervisor_check && x86_hyper != _hyper_vmware)
> + return -ENODEV;
> +
>   /* Create the datagram handle that we will use to send and receive all
>* VSocket control messages for this context.
>*/
> -- 
> 2.7.4
> 


signature.asc
Description: PGP signature


Re: AF_VSOCK unimplemented sockopts

2017-08-15 Thread Stefan Hajnoczi
On Fri, Aug 11, 2017 at 09:23:17AM +, Jorgen S. Hansen wrote:
> > On Aug 3, 2017, at 12:41 PM, Stefan Hajnoczi <stefa...@redhat.com> wrote:
> > 
> > Hi Jorgen,
> > There are 3 sockopts defined in include/uapi/linux/vm_sockets.h that are
> > currently not implemented in net/vmw_vsock/af_vsock.c:
> > 
> > * SO_VM_SOCKETS_PEER_HOST_VM_ID
> > * SO_VM_SOCKETS_TRUSTED
> > * SO_VM_SOCKETS_NONBLOCK_TXRX
> > 
> > I noticed this because SO_VM_SOCKETS_TRUSTED is interesting for
> > virtio-vsock.  Services listening on AF_VSOCK inside the guest may not
> > want arbitrary unprivileged host processes to connect.  Instead of
> > inventing a new solution I wanted to look into SO_VM_SOCKETS_TRUSTED but
> > found it is not implemented in linux.git.
> > 
> > What is the status of these sockets?
> 
> These options were only implemented for ESX host endpoints, so were never 
> part of the Linux host side support. It looks like they could have been 
> omitted from vm_sockets.h, when the initial upstreaming was performed.
> 
> On ESX, the equivalent of SO_VM_SOCKETS_TRUSTED, is used for retrieving the 
> value of s->trusted of a VMCI socket. It cannot be used to mark a socket as 
> trusted. On Linux, trusted is tied to the CAP_NET_ADMIN capability of the 
> socket creator. VMCI based vSockets will per default only allow host side 
> sockets that are trusted, or are created by the same user as the VM, to 
> communicate with a given VM. This is achieved by per default creating VMs 
> with the VMCI privilege flag VMCI_PRIVILEGE_FLAG_RESTRICTED. It is possible 
> to create a VM that isn’t restricted, in which case any host process will be 
> able to communicate with the VM.
> 
> So it should be straight forward to implement the getsockopt part of 
> SO_VM_SOCKETS_TRUSTED, since it just needs to return s->trusted.

Currently virtio-vsock does not implement the 'restricted' mode but I'm
evaluating using it by default for stronger security.  Thanks for your
response!

Stefan


signature.asc
Description: PGP signature


AF_VSOCK unimplemented sockopts

2017-08-03 Thread Stefan Hajnoczi
Hi Jorgen,
There are 3 sockopts defined in include/uapi/linux/vm_sockets.h that are
currently not implemented in net/vmw_vsock/af_vsock.c:

 * SO_VM_SOCKETS_PEER_HOST_VM_ID
 * SO_VM_SOCKETS_TRUSTED
 * SO_VM_SOCKETS_NONBLOCK_TXRX

I noticed this because SO_VM_SOCKETS_TRUSTED is interesting for
virtio-vsock.  Services listening on AF_VSOCK inside the guest may not
want arbitrary unprivileged host processes to connect.  Instead of
inventing a new solution I wanted to look into SO_VM_SOCKETS_TRUSTED but
found it is not implemented in linux.git.

What is the status of these sockopts?

Thanks,
Stefan


signature.asc
Description: PGP signature


Re: Use sock_diag instead of procfs for new address families?

2017-07-19 Thread Stefan Hajnoczi
On Tue, Jul 18, 2017 at 09:58:38AM -0700, Stephen Hemminger wrote:
> On Tue, 18 Jul 2017 17:18:06 +0100
> Stefan Hajnoczi <stefa...@redhat.com> wrote:
> 
> > I am implementing userspace access to socket information for AF_VSOCK.
> > A few hours into writing and testing a /proc/net/vsock seq_file I
> > noticed that ss(8) prefers NETLINK_SOCK_DIAG over procfs.
> > 
> > Before potentially wasting time implementing a legacy interface that
> > won't be accepted, I thought it might be good to ask :).
> > 
> > Which approach is preferred?
> > 1. New address families must implement only sock_diag.
> > 2. Both sock_diag and procfs must be implemented.
> > 3. Implement whichever interface you prefer.
> > 
> > Thanks,
> > Stefan
> 
> You are correct, I am unlikely to take any new code using /proc
> in ss.

Thanks Stephen and David, will switch to sock_diag.

Stefan


signature.asc
Description: PGP signature


Use sock_diag instead of procfs for new address families?

2017-07-18 Thread Stefan Hajnoczi
I am implementing userspace access to socket information for AF_VSOCK.
A few hours into writing and testing a /proc/net/vsock seq_file I
noticed that ss(8) prefers NETLINK_SOCK_DIAG over procfs.

Before potentially wasting time implementing a legacy interface that
won't be accepted, I thought it might be good to ask :).

Which approach is preferred?
1. New address families must implement only sock_diag.
2. Both sock_diag and procfs must be implemented.
3. Implement whichever interface you prefer.

Thanks,
Stefan


signature.asc
Description: PGP signature


Re: [PATCH] vhost: Coalesce vq_err formats, pr_fmt misuse, add missing newlines

2017-05-22 Thread Stefan Hajnoczi
On Mon, May 22, 2017 at 03:33:43AM -0700, Joe Perches wrote:
> vhost logging uses of vq_err has a few defects and style inconsistencies.
> 
> o pr_debug already uses pr_fmt so its use in vq_err is defective
>   however no #defines of pr_fmt exist in this code so no actual
>   defects exist
> o vq_err uses need terminating newlines so add the missing ones
> o Coalesce formats and realign arguments
> 
> Signed-off-by: Joe Perches <j...@perches.com>
> ---
>  drivers/vhost/net.c   | 17 ++---
>  drivers/vhost/scsi.c  | 17 ++---
>  drivers/vhost/test.c  |  4 +--
>  drivers/vhost/vhost.c | 70 
> +++
>  drivers/vhost/vhost.h | 11 
>  5 files changed, 54 insertions(+), 65 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>


signature.asc
Description: PGP signature


Re: vhost/scsi: Delete error messages for failed memory allocations in five functions

2017-05-22 Thread Stefan Hajnoczi
On Mon, May 22, 2017 at 01:34:34PM +0200, SF Markus Elfring wrote:
> >> Do you find information from a Linux allocation failure report sufficient
> >> for any function implementations here?
> > 
> > If kmalloc() and friends guarantee to print a warning and backtrace on
> > every allocation failure, then there's no need for error messages in
> > callers.
> > 
> > That seems like good justification that can go in the commit
> > description, but I'm not sure if kmalloc() and friends guarantee to show
> > a message (not just the first time, but for every failed allocation)?
> 
> I am also looking for a more complete and easier accessible documentation
> for this aspect of the desired exception handling.
> How would we like to resolve any remaining open issues there?

No objection from me but please make sure to keep vq_err().

Stefan


signature.asc
Description: PGP signature


Re: [PATCH 2/2] vhost/scsi: Delete error messages for failed memory allocations in five functions

2017-05-22 Thread Stefan Hajnoczi
On Mon, May 22, 2017 at 03:38:33PM +0300, Dan Carpenter wrote:
> On Mon, May 22, 2017 at 12:23:20PM +0100, Stefan Hajnoczi wrote:
> > I'm not sure if kmalloc() and friends guarantee to show
> > a message (not just the first time, but for every failed allocation)?
> >
> 
> It prints multiple times, but it's ratelimited.  It can also be disabled
> using a config option.
> 
> See slab_out_of_memory().

Thanks!

Stefan


signature.asc
Description: PGP signature


Re: [Patch net] vsock: use new wait API for vsock_stream_sendmsg()

2017-05-22 Thread Stefan Hajnoczi
On Fri, May 19, 2017 at 11:21:59AM -0700, Cong Wang wrote:
> As reported by Michal, vsock_stream_sendmsg() could still
> sleep at vsock_stream_has_space() after prepare_to_wait():
> 
>   vsock_stream_has_space
> vmci_transport_stream_has_space
>   vmci_qpair_produce_free_space
> qp_lock
>   qp_acquire_queue_mutex
> mutex_lock
> 
> Just switch to the new wait API like we did for commit
> d9dc8b0f8b4e ("net: fix sleeping for sk_wait_event()").
> 
> Reported-by: Michal Kubecek <mkube...@suse.cz>
> Cc: Stefan Hajnoczi <stefa...@redhat.com>
> Cc: Jorgen Hansen <jhan...@vmware.com>
> Cc: "Michael S. Tsirkin" <m...@redhat.com>
> Cc: Claudio Imbrenda <imbre...@linux.vnet.ibm.com>
> Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
> ---
>  net/vmw_vsock/af_vsock.c | 21 -
>  1 file changed, 8 insertions(+), 13 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>


signature.asc
Description: PGP signature


Re: [PATCH 2/2] vhost/scsi: Delete error messages for failed memory allocations in five functions

2017-05-22 Thread Stefan Hajnoczi
On Mon, May 22, 2017 at 12:50:39PM +0200, SF Markus Elfring wrote:
> > Why are you trying to get rid of memory allocation failure messages?
> 
> Do you find information from a Linux allocation failure report sufficient
> for any function implementations here?

If kmalloc() and friends guarantee to print a warning and backtrace on
every allocation failure, then there's no need for error messages in
callers.

That seems like good justification that can go in the commit
description, but I'm not sure if kmalloc() and friends guarantee to show
a message (not just the first time, but for every failed allocation)?

> >> +++ b/drivers/vhost/scsi.c
> >> @@ -417,5 +417,4 @@ vhost_scsi_allocate_evt(struct vhost_scsi *vs,
> >>if (!evt) {
> >> -  vq_err(vq, "Failed to allocate vhost_scsi_evt\n");
> > 
> > #define vq_err(vq, fmt, ...) do {  \
> > pr_debug(pr_fmt(fmt), ##__VA_ARGS__);   \
> > if ((vq)->error_ctx)   \
> > eventfd_signal((vq)->error_ctx, 1);\
> > } while (0)
> > 
> > You silently dropped the eventfd_signal() call.
> 
> Do you prefer to preserve this special error handling then?

Yes, please leave vq_err() calls.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH 2/2] vhost/scsi: Delete error messages for failed memory allocations in five functions

2017-05-22 Thread Stefan Hajnoczi
On Sat, May 20, 2017 at 04:32:17PM +0200, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Sat, 20 May 2017 15:50:30 +0200
> 
> Omit seven extra messages for memory allocation failures in these functions.
> 
> This issue was detected by using the Coccinelle software.
> 
> Link: 
> http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf

Please include an actual explanation for this change instead of linking
to slides.  Why are you trying to get rid of memory allocation failure
messages?

> Signed-off-by: Markus Elfring 
> ---
>  drivers/vhost/scsi.c | 24 +++-
>  1 file changed, 7 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index 650533916c19..49d07950e2e5 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -417,5 +417,4 @@ vhost_scsi_allocate_evt(struct vhost_scsi *vs,
>   if (!evt) {
> - vq_err(vq, "Failed to allocate vhost_scsi_evt\n");

#define vq_err(vq, fmt, ...) do {  \
pr_debug(pr_fmt(fmt), ##__VA_ARGS__);   \
if ((vq)->error_ctx)   \
eventfd_signal((vq)->error_ctx, 1);\
} while (0)

You silently dropped the eventfd_signal() call.  Please explain.


signature.asc
Description: PGP signature


Re: [PATCH 1/2] vhost/scsi: Improve a size determination in four functions

2017-05-22 Thread Stefan Hajnoczi
On Sat, May 20, 2017 at 04:31:13PM +0200, SF Markus Elfring wrote:
> From: Markus Elfring <elfr...@users.sourceforge.net>
> Date: Sat, 20 May 2017 13:48:44 +0200
> 
> Replace the specification of four data structures by pointer dereferences
> as the parameter for the operator "sizeof" to make the corresponding size
> determination a bit safer according to the Linux coding style convention.
> 
> Signed-off-by: Markus Elfring <elfr...@users.sourceforge.net>
> ---
>  drivers/vhost/scsi.c | 9 -
>  1 file changed, 4 insertions(+), 5 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>


signature.asc
Description: PGP signature


  1   2   3   >