Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal

2018-06-26 Thread Maxime Coquelin




On 06/26/2018 11:14 AM, Tiwei Bie wrote:

On Tue, Jun 26, 2018 at 04:47:33PM +0800, Stojaczyk, DariuszX wrote:

-Original Message-
From: Bie, Tiwei
Sent: Tuesday, June 26, 2018 10:22 AM
To: Stojaczyk, DariuszX 
Cc: Dariusz Stojaczyk ; dev@dpdk.org; Maxime
Coquelin ; Tetsuya Mukawa
; Stefan Hajnoczi ; Thomas
Monjalon ; y...@fridaylinux.org; Harris, James R
; Kulasek, TomaszX ;
Wodkowski, PawelX 
Subject: Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal

On Mon, Jun 25, 2018 at 08:17:08PM +0800, Stojaczyk, DariuszX wrote:

-Original Message-
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Tiwei Bie
Sent: Monday, June 25, 2018 1:02 PM


Hi Dariusz,



Hi Tiwei,


Thank you for putting efforts in making the DPDK
vhost more generic!

 From my understanding, your proposal is that:

1) Introduce rte_vhost2 to provide the APIs which
allow users to implement vhost backends like
SCSI, net, crypto, ..



That's right.


2) Refactor the existing rte_vhost to use rte_vhost2.
The rte_vhost will still provide below existing
sets of APIs:
 1. The APIs which allow users to implement
external vhost backends (these APIs were
designed for SPDK previously)
 2. The APIs provided by the net backend
 3. The APIs provided by the crypto backend
And above APIs in rte_vhost won't be changed.


That's correct. Rte_vhost would register its own rte_vhost2_tgt_ops

underneath and will call existing vhost_device_ops for e.g. starting the device
once all queues are started.

Currently I have below concerns and questions:

- The rte_vhost's problem is still there. Even though
   rte_vhost2 is introduced, the net and crypto backends
   in rte_vhost won't benefit from the new callbacks.

   The existing rte_vhost in DPDK not only provides the
   APIs for DPDK applications to implement the external
   backends. But also provides high performance net and
   crypto backends implementation (maybe more in the
   future). So it's important that besides the DPDK
   applications which implement their external backends,
   the DPDK applications which use the builtin backends
   will also benefit from the new callbacks.

   So we should have a clear plan on how will the legacy
   callbacks in rte_vhost be dealt with in the next step.

   Besides, the new library's name is a bit misleading.
   It makes the existing rte_vhost library sound like an
   obsolete library. But actually the existing rte_vhost
   isn't an obsolete library. It will still provide the
   net and crypto backends. So if we want to introduce
   this new library, we should give it a better name.

- It's possible to solve rte_vhost's problem you met
   by refactoring the existing vhost library directly
   instead of re-implementing a new vhost library from
   scratch and keeping the old one's problem as is.

   In this way, it will solve the problem you met and
   also solve the problem for rte_vhost. Why not go
   this way? Something like:

   Below is the existing callbacks set in rte_vhost.h:

   /**
* Device and vring operations.
*/
   struct vhost_device_ops {
   ..
   };

   It's a legacy implementation, and doesn't really
   follow the DPDK API design (e.g. no rte_ prefix).
   We can design and implement a new message handling
   and a new set of callbacks for rte_vhost to solve
   the problem you met without changing the old one.
   Something like:

   struct rte_vhost_device_ops {
   ..
   }

   int
   vhost_user_msg_handler(struct vhost_dev *vdev, struct vhost_user_msg
*msg)
   {
   ..

   if (!vdev->is_using_new_device_ops) {
   // Call the existing message handler
   return vhost_user_msg_handler_legacy(vdev, msg);
   }

   // Implement the new logic here
   ..
   }

   A vhost application is allowed to register only struct
   rte_vhost_device_ops or struct vhost_device_ops (which
   should be deprecated in the future). The two ops cannot
   be registered at the same time.

   The existing applications could use the old ops. And
   if an application registers struct rte_vhost_device_ops,
   the new callbacks and message handler will be used.


Please notice that some features like vIOMMU are not even a part of the public 
rte_vhost API. Only vhost-net benefits from vIOMMU right now. Separating 
vhost-net from a generic vhost library (rte_vhost2) would avoid making such 
design mistakes in future. What's the point of having a single rte_vhost 
library, if some vhost-user features are only implemented for vhost-net.


These APIs can be safely added at any time.
And we can also ask developers to add public
APIs if it matters when adding new features
in the future. I don't think it's a big
problem.


+1, I don't think it is a problem.
It is better to have it internal only at the beginning 

Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal

2018-06-26 Thread Tiwei Bie
On Tue, Jun 26, 2018 at 04:47:33PM +0800, Stojaczyk, DariuszX wrote:
> > -Original Message-
> > From: Bie, Tiwei
> > Sent: Tuesday, June 26, 2018 10:22 AM
> > To: Stojaczyk, DariuszX 
> > Cc: Dariusz Stojaczyk ; dev@dpdk.org; Maxime
> > Coquelin ; Tetsuya Mukawa
> > ; Stefan Hajnoczi ; Thomas
> > Monjalon ; y...@fridaylinux.org; Harris, James R
> > ; Kulasek, TomaszX ;
> > Wodkowski, PawelX 
> > Subject: Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal
> > 
> > On Mon, Jun 25, 2018 at 08:17:08PM +0800, Stojaczyk, DariuszX wrote:
> > > > -Original Message-
> > > > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Tiwei Bie
> > > > Sent: Monday, June 25, 2018 1:02 PM
> > > > 
> > > >
> > > > Hi Dariusz,
> > > >
> > >
> > > Hi Tiwei,
> > >
> > > > Thank you for putting efforts in making the DPDK
> > > > vhost more generic!
> > > >
> > > > From my understanding, your proposal is that:
> > > >
> > > > 1) Introduce rte_vhost2 to provide the APIs which
> > > >allow users to implement vhost backends like
> > > >SCSI, net, crypto, ..
> > > >
> > >
> > > That's right.
> > >
> > > > 2) Refactor the existing rte_vhost to use rte_vhost2.
> > > >The rte_vhost will still provide below existing
> > > >sets of APIs:
> > > > 1. The APIs which allow users to implement
> > > >external vhost backends (these APIs were
> > > >designed for SPDK previously)
> > > > 2. The APIs provided by the net backend
> > > > 3. The APIs provided by the crypto backend
> > > >And above APIs in rte_vhost won't be changed.
> > >
> > > That's correct. Rte_vhost would register its own rte_vhost2_tgt_ops
> > underneath and will call existing vhost_device_ops for e.g. starting the 
> > device
> > once all queues are started.
> > 
> > Currently I have below concerns and questions:
> > 
> > - The rte_vhost's problem is still there. Even though
> >   rte_vhost2 is introduced, the net and crypto backends
> >   in rte_vhost won't benefit from the new callbacks.
> > 
> >   The existing rte_vhost in DPDK not only provides the
> >   APIs for DPDK applications to implement the external
> >   backends. But also provides high performance net and
> >   crypto backends implementation (maybe more in the
> >   future). So it's important that besides the DPDK
> >   applications which implement their external backends,
> >   the DPDK applications which use the builtin backends
> >   will also benefit from the new callbacks.
> > 
> >   So we should have a clear plan on how will the legacy
> >   callbacks in rte_vhost be dealt with in the next step.
> > 
> >   Besides, the new library's name is a bit misleading.
> >   It makes the existing rte_vhost library sound like an
> >   obsolete library. But actually the existing rte_vhost
> >   isn't an obsolete library. It will still provide the
> >   net and crypto backends. So if we want to introduce
> >   this new library, we should give it a better name.
> > 
> > - It's possible to solve rte_vhost's problem you met
> >   by refactoring the existing vhost library directly
> >   instead of re-implementing a new vhost library from
> >   scratch and keeping the old one's problem as is.
> > 
> >   In this way, it will solve the problem you met and
> >   also solve the problem for rte_vhost. Why not go
> >   this way? Something like:
> > 
> >   Below is the existing callbacks set in rte_vhost.h:
> > 
> >   /**
> >* Device and vring operations.
> >*/
> >   struct vhost_device_ops {
> >   ..
> >   };
> > 
> >   It's a legacy implementation, and doesn't really
> >   follow the DPDK API design (e.g. no rte_ prefix).
> >   We can design and implement a new message handling
> >   and a new set of callbacks for rte_vhost to solve
> >   the problem you met without changing the old one.
> >   Something like:
> > 
> >   struct rte_vhost_device_ops {
> >   ..
> >   }
> > 
> >   int
> >   vhost_user_msg_handler(struct vhost_dev *vdev, struct vhost_user_msg
> > *msg)
> >   {
> >   ..
> > 
> >   if (!vdev->is_using_new_d

Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal

2018-06-26 Thread Stojaczyk, DariuszX


> -Original Message-
> From: Bie, Tiwei
> Sent: Tuesday, June 26, 2018 10:22 AM
> To: Stojaczyk, DariuszX 
> Cc: Dariusz Stojaczyk ; dev@dpdk.org; Maxime
> Coquelin ; Tetsuya Mukawa
> ; Stefan Hajnoczi ; Thomas
> Monjalon ; y...@fridaylinux.org; Harris, James R
> ; Kulasek, TomaszX ;
> Wodkowski, PawelX 
> Subject: Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal
> 
> On Mon, Jun 25, 2018 at 08:17:08PM +0800, Stojaczyk, DariuszX wrote:
> > > -Original Message-
> > > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Tiwei Bie
> > > Sent: Monday, June 25, 2018 1:02 PM
> > > 
> > >
> > > Hi Dariusz,
> > >
> >
> > Hi Tiwei,
> >
> > > Thank you for putting efforts in making the DPDK
> > > vhost more generic!
> > >
> > > From my understanding, your proposal is that:
> > >
> > > 1) Introduce rte_vhost2 to provide the APIs which
> > >allow users to implement vhost backends like
> > >SCSI, net, crypto, ..
> > >
> >
> > That's right.
> >
> > > 2) Refactor the existing rte_vhost to use rte_vhost2.
> > >The rte_vhost will still provide below existing
> > >sets of APIs:
> > > 1. The APIs which allow users to implement
> > >external vhost backends (these APIs were
> > >designed for SPDK previously)
> > > 2. The APIs provided by the net backend
> > > 3. The APIs provided by the crypto backend
> > >And above APIs in rte_vhost won't be changed.
> >
> > That's correct. Rte_vhost would register its own rte_vhost2_tgt_ops
> underneath and will call existing vhost_device_ops for e.g. starting the 
> device
> once all queues are started.
> 
> Currently I have below concerns and questions:
> 
> - The rte_vhost's problem is still there. Even though
>   rte_vhost2 is introduced, the net and crypto backends
>   in rte_vhost won't benefit from the new callbacks.
> 
>   The existing rte_vhost in DPDK not only provides the
>   APIs for DPDK applications to implement the external
>   backends. But also provides high performance net and
>   crypto backends implementation (maybe more in the
>   future). So it's important that besides the DPDK
>   applications which implement their external backends,
>   the DPDK applications which use the builtin backends
>   will also benefit from the new callbacks.
> 
>   So we should have a clear plan on how will the legacy
>   callbacks in rte_vhost be dealt with in the next step.
> 
>   Besides, the new library's name is a bit misleading.
>   It makes the existing rte_vhost library sound like an
>   obsolete library. But actually the existing rte_vhost
>   isn't an obsolete library. It will still provide the
>   net and crypto backends. So if we want to introduce
>   this new library, we should give it a better name.
> 
> - It's possible to solve rte_vhost's problem you met
>   by refactoring the existing vhost library directly
>   instead of re-implementing a new vhost library from
>   scratch and keeping the old one's problem as is.
> 
>   In this way, it will solve the problem you met and
>   also solve the problem for rte_vhost. Why not go
>   this way? Something like:
> 
>   Below is the existing callbacks set in rte_vhost.h:
> 
>   /**
>* Device and vring operations.
>*/
>   struct vhost_device_ops {
>   ..
>   };
> 
>   It's a legacy implementation, and doesn't really
>   follow the DPDK API design (e.g. no rte_ prefix).
>   We can design and implement a new message handling
>   and a new set of callbacks for rte_vhost to solve
>   the problem you met without changing the old one.
>   Something like:
> 
>   struct rte_vhost_device_ops {
>   ..
>   }
> 
>   int
>   vhost_user_msg_handler(struct vhost_dev *vdev, struct vhost_user_msg
> *msg)
>   {
>   ..
> 
>   if (!vdev->is_using_new_device_ops) {
>   // Call the existing message handler
>   return vhost_user_msg_handler_legacy(vdev, msg);
>   }
> 
>   // Implement the new logic here
>   ..
>   }
> 
>   A vhost application is allowed to register only struct
>   rte_vhost_device_ops or struct vhost_device_ops (which
>   should be deprecated in the future). The two ops cannot
>   be registered at the same time.
> 
>   The existing applications could use the old ops. And
>   if an application registers struct rte_vhost_device_ops,
>   the new callbacks and message handler will be used.

Please notice that some features like vIOMMU are not even a part of the public 
rte_vhost API. Only vhost-net benefits from vIOMMU right now. Separating 
vhost-net from a generic vhost library (rte_vhost2) would avoid making such 
design mistakes in future. What's the point of having a single rte_vhost 
library, if some vhost-user features are only implemented for vhost-net.

> 
> Best regards,
> Tiwei Bie
> 
> 
> > Regards,
> > D.
> >
> > >
> > > Is my above understanding correct? Thanks!
> > >
> > > Best regards,
> > > Tiwei Bie
> > >


Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal

2018-06-26 Thread Thomas Monjalon
26/06/2018 10:22, Tiwei Bie:
> On Mon, Jun 25, 2018 at 08:17:08PM +0800, Stojaczyk, DariuszX wrote:
> > From: Tiwei Bie
> > > 
> > > Hi Dariusz,
> > 
> > Hi Tiwei,
> > 
> > > Thank you for putting efforts in making the DPDK
> > > vhost more generic!
> > > 
> > > From my understanding, your proposal is that:
> > > 
> > > 1) Introduce rte_vhost2 to provide the APIs which
> > >allow users to implement vhost backends like
> > >SCSI, net, crypto, ..
> > 
> > That's right.
> > 
> > > 2) Refactor the existing rte_vhost to use rte_vhost2.
> > >The rte_vhost will still provide below existing
> > >sets of APIs:
> > > 1. The APIs which allow users to implement
> > >external vhost backends (these APIs were
> > >designed for SPDK previously)
> > > 2. The APIs provided by the net backend
> > > 3. The APIs provided by the crypto backend
> > >And above APIs in rte_vhost won't be changed.
> > 
> > That's correct. Rte_vhost would register its own rte_vhost2_tgt_ops 
> > underneath and will call existing vhost_device_ops for e.g. starting the 
> > device once all queues are started.
> 
> Currently I have below concerns and questions:
> 
> - The rte_vhost's problem is still there. Even though
>   rte_vhost2 is introduced, the net and crypto backends
>   in rte_vhost won't benefit from the new callbacks.
> 
>   The existing rte_vhost in DPDK not only provides the
>   APIs for DPDK applications to implement the external
>   backends. But also provides high performance net and
>   crypto backends implementation (maybe more in the
>   future). So it's important that besides the DPDK
>   applications which implement their external backends,
>   the DPDK applications which use the builtin backends
>   will also benefit from the new callbacks.
> 
>   So we should have a clear plan on how will the legacy
>   callbacks in rte_vhost be dealt with in the next step.
> 
>   Besides, the new library's name is a bit misleading.
>   It makes the existing rte_vhost library sound like an
>   obsolete library. But actually the existing rte_vhost
>   isn't an obsolete library. It will still provide the
>   net and crypto backends. So if we want to introduce
>   this new library, we should give it a better name.
> 
> - It's possible to solve rte_vhost's problem you met
>   by refactoring the existing vhost library directly
>   instead of re-implementing a new vhost library from
>   scratch and keeping the old one's problem as is.

+1

>   In this way, it will solve the problem you met and
>   also solve the problem for rte_vhost. Why not go
>   this way? Something like:
> 
>   Below is the existing callbacks set in rte_vhost.h:
> 
>   /**
>* Device and vring operations.
>*/
>   struct vhost_device_ops {
>   ..
>   };
> 
>   It's a legacy implementation, and doesn't really
>   follow the DPDK API design (e.g. no rte_ prefix).
>   We can design and implement a new message handling
>   and a new set of callbacks for rte_vhost to solve
>   the problem you met without changing the old one.
>   Something like:
> 
>   struct rte_vhost_device_ops {
>   ..
>   }
> 
>   int
>   vhost_user_msg_handler(struct vhost_dev *vdev, struct vhost_user_msg *msg)
>   {
>   ..
> 
>   if (!vdev->is_using_new_device_ops) {
>   // Call the existing message handler
>   return vhost_user_msg_handler_legacy(vdev, msg);
>   }
> 
>   // Implement the new logic here
>   ..
>   }
> 
>   A vhost application is allowed to register only struct
>   rte_vhost_device_ops or struct vhost_device_ops (which
>   should be deprecated in the future). The two ops cannot
>   be registered at the same time.
> 
>   The existing applications could use the old ops. And
>   if an application registers struct rte_vhost_device_ops,
>   the new callbacks and message handler will be used.






Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal

2018-06-26 Thread Tiwei Bie
On Mon, Jun 25, 2018 at 08:17:08PM +0800, Stojaczyk, DariuszX wrote:
> > -Original Message-
> > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Tiwei Bie
> > Sent: Monday, June 25, 2018 1:02 PM
> > 
> > 
> > Hi Dariusz,
> > 
> 
> Hi Tiwei,
> 
> > Thank you for putting efforts in making the DPDK
> > vhost more generic!
> > 
> > From my understanding, your proposal is that:
> > 
> > 1) Introduce rte_vhost2 to provide the APIs which
> >allow users to implement vhost backends like
> >SCSI, net, crypto, ..
> > 
> 
> That's right.
> 
> > 2) Refactor the existing rte_vhost to use rte_vhost2.
> >The rte_vhost will still provide below existing
> >sets of APIs:
> > 1. The APIs which allow users to implement
> >external vhost backends (these APIs were
> >designed for SPDK previously)
> > 2. The APIs provided by the net backend
> > 3. The APIs provided by the crypto backend
> >And above APIs in rte_vhost won't be changed.
> 
> That's correct. Rte_vhost would register its own rte_vhost2_tgt_ops 
> underneath and will call existing vhost_device_ops for e.g. starting the 
> device once all queues are started.

Currently I have below concerns and questions:

- The rte_vhost's problem is still there. Even though
  rte_vhost2 is introduced, the net and crypto backends
  in rte_vhost won't benefit from the new callbacks.

  The existing rte_vhost in DPDK not only provides the
  APIs for DPDK applications to implement the external
  backends. But also provides high performance net and
  crypto backends implementation (maybe more in the
  future). So it's important that besides the DPDK
  applications which implement their external backends,
  the DPDK applications which use the builtin backends
  will also benefit from the new callbacks.

  So we should have a clear plan on how will the legacy
  callbacks in rte_vhost be dealt with in the next step.

  Besides, the new library's name is a bit misleading.
  It makes the existing rte_vhost library sound like an
  obsolete library. But actually the existing rte_vhost
  isn't an obsolete library. It will still provide the
  net and crypto backends. So if we want to introduce
  this new library, we should give it a better name.

- It's possible to solve rte_vhost's problem you met
  by refactoring the existing vhost library directly
  instead of re-implementing a new vhost library from
  scratch and keeping the old one's problem as is.

  In this way, it will solve the problem you met and
  also solve the problem for rte_vhost. Why not go
  this way? Something like:

  Below is the existing callbacks set in rte_vhost.h:

  /**
   * Device and vring operations.
   */
  struct vhost_device_ops {
  ..
  };

  It's a legacy implementation, and doesn't really
  follow the DPDK API design (e.g. no rte_ prefix).
  We can design and implement a new message handling
  and a new set of callbacks for rte_vhost to solve
  the problem you met without changing the old one.
  Something like:

  struct rte_vhost_device_ops {
  ..
  }

  int
  vhost_user_msg_handler(struct vhost_dev *vdev, struct vhost_user_msg *msg)
  {
  ..

  if (!vdev->is_using_new_device_ops) {
  // Call the existing message handler
  return vhost_user_msg_handler_legacy(vdev, msg);
  }

  // Implement the new logic here
  ..
  }

  A vhost application is allowed to register only struct
  rte_vhost_device_ops or struct vhost_device_ops (which
  should be deprecated in the future). The two ops cannot
  be registered at the same time.

  The existing applications could use the old ops. And
  if an application registers struct rte_vhost_device_ops,
  the new callbacks and message handler will be used.

Best regards,
Tiwei Bie


> Regards,
> D.
> 
> > 
> > Is my above understanding correct? Thanks!
> > 
> > Best regards,
> > Tiwei Bie
> > 


Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal

2018-06-25 Thread Stojaczyk, DariuszX


> -Original Message-
> From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Tiwei Bie
> Sent: Monday, June 25, 2018 1:02 PM
> 
> 
> Hi Dariusz,
> 

Hi Tiwei,

> Thank you for putting efforts in making the DPDK
> vhost more generic!
> 
> From my understanding, your proposal is that:
> 
> 1) Introduce rte_vhost2 to provide the APIs which
>allow users to implement vhost backends like
>SCSI, net, crypto, ..
> 

That's right.

> 2) Refactor the existing rte_vhost to use rte_vhost2.
>The rte_vhost will still provide below existing
>sets of APIs:
> 1. The APIs which allow users to implement
>external vhost backends (these APIs were
>designed for SPDK previously)
> 2. The APIs provided by the net backend
> 3. The APIs provided by the crypto backend
>And above APIs in rte_vhost won't be changed.

That's correct. Rte_vhost would register its own rte_vhost2_tgt_ops underneath 
and will call existing vhost_device_ops for e.g. starting the device once all 
queues are started.
Regards,
D.

> 
> Is my above understanding correct? Thanks!
> 
> Best regards,
> Tiwei Bie
> 


Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal

2018-06-25 Thread Tiwei Bie
On Thu, Jun 07, 2018 at 05:12:20PM +0200, Dariusz Stojaczyk wrote:
> rte_vhost is not vhost-user spec compliant. Some Vhost drivers have
> been already confirmed not to work with rte_vhost. virtio-user-scsi-pci
> in QEMU 2.12 doesn't fully initialize its management queues at SeaBIOS
> stage. This is perfectly fine from the Vhost-user spec perspective, but
> doesn't meet rte_vhost expectations. rte_vhost waits for all queues
> to be fully initialized before it allows the entire device to be
> processed.
> 
> The problem can be temporarily worked around by start-stopping entire
> device (all queues) on each vhost-user message that could change queue
> state. This would increase general VM boot time and is totally
> unacceptable though.
> 
> Fixing rte_vhost properly would require quite a big amount
> of changes which would completely break backward compatibility.
> We're introducing a new rte_vhost2 library intended to smooth the transition.
> It exposes a low-level API for implementing new vhost devices.
> The existing rte_vhost is about to be refactored to use rte_vhost2
> underneath and demanding users could now use rte_vhost2 directly.
> 
> We're designing a fresh library here, so this opens up a great
> amount of possibilities and improvements we can make to the public
> API that will pay off significantly for the sake of future
> specification/library extensions.
> 
> rte_vhost2 abstracts away most vhost-user/virtio-vhost-user specifics
> and allows developers to implement vhost devices with an ease.
> It calls user-provided callbacks once proper device initialization
> state has been reached. That is - memory mappings have changed,
> virtqueues are ready to be processed, features have changed at
> runtime, etc.
> 
> Compared to the rte_vhost, this lib additionally allows the following:
> 
> * ability to start/stop particular queues
> * most callbacks are now asynchronous - it greatly simplifies
>   the event handling for asynchronous applications and doesn't
>   make anything harder for synchronous ones.
> * this is low-level vhost API. It doesn't have any vhost-net, nvme
>   or crypto references. These backend-specific libraries will
>   be later refactored to use *this* generic library underneath.
>   This implies that the library doesn't do any virtqueue processing,
>   it only delivers vring addresses to the user, so he can process
>   virtqueues by himself.
> * abstracting away Unix domain sockets (vhost-user) and PCI
>   (virtio-vhost-user).
> * APIs for interrupt-driven drivers
> * replacing gpa_to_vva function with an IOMMU-aware variant
> * The API imposes how public functions can be called and how
>   internal data can change, so there's only a minimal work required
>   to ensure thread-safety. Possibly no mutexes are required at all.
> * full Virtio 1.0/vhost-user specification compliance.
> 
> The proposed structure for the new library is described below.
>  * rte_vhost2.h
>- public API
>- registering targets with provided ops
>- unregistering targets
>- iova_to_vva()
>  * transport.h/.c
>- implements rte_vhost2.h
>- allows registering vhost transports, which are opaquely required by the
>  rte_vhost2.h API (target register function requires transport name).
>- exposes a set of callbacks to be implemented by each transport
>  * vhost_user.c
>- vhost-user Unix domain socket transport
>- does recvmsg()
>- uses the underlying vhost-user helper lib to process messages, but still
>  handles some transport-specific ones, e.g. SET_MEM_TABLE
>- calls some of the rte_vhost2.h ops registered with a target
>  * fd_man.h/.c
>- polls provided descriptors, calls user callbacks on fd events
>- based on the original rte_vhost version
>- additionally allows calling user-provided callbacks on the poll thread
>  * vhost.h/.c
>- a transport-agnostic vhost-user library
>- calls most of the rte_vhost2.h ops registered with a target
>- manages virtqueues state
>- hopefully to be reused by the virtio-vhost-user
>- exposes a set of callbacks to be implemented by each transport
>  (for e.g. sending message replies)
> 
> This series includes only vhost-user transport. Virtio-vhost-user
> is to be done later.
> 
> The following items are still TBD:
>   * vhost-user slave channel
>   * get/set_config
>   * cfg_call() implementation
>   * IOTLB
>   * NUMA awareness
>   * Live migration
>   * vhost-net implementation (to port rte_vhost over)
>   * vhost-crypto implementation (to port rte_vhost over)
>   * vhost-user client mode (connecting to an external socket)
>   * various error logging
> 
> Dariusz Stojaczyk (7):
>   vhost2: add initial implementation
>   vhost2: add vhost-user helper layer
>   vhost2/vhost: handle queue_stop/device_destroy failure
>   vhost2: add asynchronous fd_man
>   vhost2: add initial vhost-user transport
>   vhost2/vhost: support protocol features
>   vhost2/user: implement tgt unregister path

Hi Dariusz,

Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal

2018-06-13 Thread Dariusz Stojaczyk
Hi Stefan,
I'm sorry for the late response. My email client filtered out this
mail. I fixed it just now.

pt., 8 cze 2018 o 15:29 Stefan Hajnoczi  napisaƂ(a):
>
> On Thu, Jun 07, 2018 at 05:12:20PM +0200, Dariusz Stojaczyk wrote:
> > The proposed structure for the new library is described below.
> >  * rte_vhost2.h
> >- public API
> >- registering targets with provided ops
> >- unregistering targets
> >- iova_to_vva()
> >  * transport.h/.c
> >- implements rte_vhost2.h
> >- allows registering vhost transports, which are opaquely required by the
> >  rte_vhost2.h API (target register function requires transport name).
> >- exposes a set of callbacks to be implemented by each transport
> >  * vhost_user.c
> >- vhost-user Unix domain socket transport
>
> This file should be called transport_unix.c or similar so it's clear
> that it only handles UNIX domain socket transport aspects, not general
> vhost-user protocol aspects.  If the distinction is unclear then
> layering violations are easy to make in the future (especially when
> people other than you contribute to the code).

Ack. Also, virtio-vhost-user transport still has to be placed in
drivers/ directory, right? We could move most of this library
somewhere into drivers/, leaving only rte_vhost2.h, transport.h and
transport.c in lib/librte_vhost2. What do you think?

>
> >- does recvmsg()
> >- uses the underlying vhost-user helper lib to process messages, but 
> > still
> >  handles some transport-specific ones, e.g. SET_MEM_TABLE
> >- calls some of the rte_vhost2.h ops registered with a target
> >  * fd_man.h/.c
> >- polls provided descriptors, calls user callbacks on fd events
> >- based on the original rte_vhost version
> >- additionally allows calling user-provided callbacks on the poll thread
>
> Ths is general-purpose functionality that should be a core DPDK utility.
>
> Are you sure you cannot use existing (e)poll functionality in DPDK?

We have to use poll here, and I haven't seen any DPDK APIs for poll,
only rte_epoll_*. Since received vhost-user messages may be handled
asynchronously, we have to temporarily remove an fd from the poll
group for the time each message is handled. We do it by setting the fd
in the polled fds array to -1. man page for poll(2) explicitly
suggests this to ignore poll() events.

>
> >  * vhost.h/.c
> >- a transport-agnostic vhost-user library
> >- calls most of the rte_vhost2.h ops registered with a target
> >- manages virtqueues state
> >- hopefully to be reused by the virtio-vhost-user
> >- exposes a set of callbacks to be implemented by each transport
> >  (for e.g. sending message replies)
> >
> > This series includes only vhost-user transport. Virtio-vhost-user
> > is to be done later.
> >
> > The following items are still TBD:
> >   * vhost-user slave channel
> >   * get/set_config
> >   * cfg_call() implementation
> >   * IOTLB
> >   * NUMA awareness
>
> This is important to think about while the API is still being designed.
>
> Some initial thoughts.  NUMA affinity is optimal when:
>
> 1. The vring, indirect descriptor tables, and data buffers are allocated
>on the same NUMA node.
>
> 2. The virtqueue interrupts go to vcpus associated with the same NUMA
>node as the vring.
>
> 3. The guest NUMA node corresponds to the host NUMA node of the backing
>storage device (e.g. NVMe PCI adapter).
>
> This way memory is local to the NVMe PCI adapter on the way down the
> stack when submitting I/O and back up again when completing I/O.
>
> Achieving #1 & #2 is up to the guest drivers.
>
> Achieving #3 is up to virtual machine configuration (QEMU command-line).
>
> The role that DPDK plays in all of this is that each vhost-user
> virtqueue should be polled by a thread that has been placed on the same
> host NUMA node mentioned above for #1, #2, and #3.
>
> Per-virtqueue state should also be allocated on this host NUMA node.

I agree on all points.

> Device backends should be able to query this information so they, too,
> can allocate memory with optimal NUMA affinity.

Of course. Since we offer raw vq pointers, they will be able to use
get_mempolicy(2). No additional APIs required.

>
> >   * Live migration
>
> Another important feature to design in from the beginning.

This is being worked on at the moment.
Thanks,
D.