Re: [dpdk-dev] [PATCH v2 1/3] vhost: fix malloc in rte_vhost_get_mem_table

2017-05-11 Thread Stojaczyk, DariuszX
The size variable is still used a few lines later:
memcpy(m->regions, dev->mem->regions, size);
That line is ok. Only the amount of malloc'ed memory was too small.

-Original Message-
From: Jens Freimann [mailto:jfrei...@redhat.com] 
Sent: Thursday, May 11, 2017 1:42 PM
To: Stojaczyk, DariuszX 
Cc: dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH v2 1/3] vhost: fix malloc in 
rte_vhost_get_mem_table

On Thu, May 11, 2017 at 12:56:46PM +0200, Dariusz Stojaczyk wrote:
> Amount of allocated memory was too small, causing buffer overflow.
> 
> Signed-off-by: Dariusz Stojaczyk 
> ---
> Removed Gerrit Change-Id
>  lib/librte_vhost/vhost.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c index 
> 0b19d2e..1f565fb 100644
> --- a/lib/librte_vhost/vhost.c
> +++ b/lib/librte_vhost/vhost.c
> @@ -369,7 +369,7 @@ rte_vhost_get_mem_table(int vid, struct rte_vhost_memory 
> **mem)
>   return -1;
>  
>   size = dev->mem->nregions * sizeof(struct rte_vhost_mem_region);
> - m = malloc(size);
> + m = malloc(sizeof(struct rte_vhost_memory) + size);

Why not just add it to the line above where size is calculated?
With that changed,

Reviewed-by: Jens Freimann  


regards,
Jens



Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial 
Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | 
Kapital zakladowy 200.000 PLN.

Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i 
moze zawierac informacje poufne. W razie przypadkowego otrzymania tej 
wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; 
jakiekolwiek
przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole 
use of the intended recipient(s). If you are not the intended recipient, please 
contact the sender and delete all copies; any review or distribution by
others is strictly prohibited.



Re: [dpdk-dev] [PATCH] vhost: fix deadlock on rte_vhost_driver_unregister()

2017-05-16 Thread Stojaczyk, DariuszX
> -Original Message-
> From: Stojaczyk, DariuszX
> Sent: Tuesday, May 16, 2017 4:35 PM
> To: dev@dpdk.org
> Cc: Wodkowski, PawelX ; Stojaczyk, DariuszX
> 
> Subject: [PATCH] vhost: fix deadlock on rte_vhost_driver_unregister()
> 
> Consider the following scenario, threads A and B:
> (A)
>  * fdset_event_dispatch() start
>* pfdentry->busy = 1;
>* vhost_user_read_cb() start
>  * vhost_destroy_device() start
> (B)
>  * rte_vhost_driver_unregister() start
>* pthread_mutex_lock(&vsocket->conn_mutex);
>* fdset_del()
>  * endless loop, waiting for pfdentry->busy == 0
> (A)
>  * vhost_destroy_device() end
>  * pthread_mutex_lock(&vsocket->conn_mutex);
>(mutex already locked - deadlock at this point)
> 
> Thread B has locked vsocket->conn_mutex and is in while(1) loop waiting for
> given fd to change it's busy flag to 0.
> Thread A would have to finish vhost_user_read_cb() in order to set busy flag
> back to 0, but that can't happen due to the vsocket->conn_mutex lock.
> 
> This patch synchronizes rte_vhost_driver_unregister() with
> vhost_user_read_cb() through vhost_user.mutex.
> 
> Signed-off-by: Dariusz Stojaczyk 
> ---
>  lib/librte_vhost/socket.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c index
> c7f99b0..77e58fe 100644
> --- a/lib/librte_vhost/socket.c
> +++ b/lib/librte_vhost/socket.c
> @@ -273,6 +273,8 @@ vhost_user_read_cb(int connfd, void *dat, int
> *remove)
> 
>   ret = vhost_user_msg_handler(conn->vid, connfd);
>   if (ret < 0) {
> + pthread_mutex_lock(&vhost_user.mutex);
> +
>   close(connfd);
>   *remove = 1;
>   vhost_destroy_device(conn->vid);
> @@ -287,6 +289,8 @@ vhost_user_read_cb(int connfd, void *dat, int
> *remove)
>   create_unix_socket(vsocket);
>   vhost_user_start_client(vsocket);
>   }
> +
> + pthread_mutex_unlock(&vhost_user.mutex);
>   }
>  }
> 
> --
> 2.7.4

I've found other cases where rte_vhost_driver_unregister() can still deadlock. 
Please do not merge this patch now, I will try to come up with a different 
solution for this issue.


Re: [dpdk-dev] [PATCH v3 3/3] vhost: access VhostUsrMsg via packed struct

2017-05-24 Thread Stojaczyk, DariuszX
> This is for fixing compile warnings with new clang 4.0?
> 
> http://dpdk.org/ml/archives/dev/2017-April/064089.html
> 
> If so, please show the exact warning in the commit log.
> 

Everything compiles, but is undefined behavior.  Accessing packed struct's 
fields through pointers would have to be done as following:
e.g vhost_user_set_vring_addr(struct virtio_net *dev, struct vhost_vring_addr 
*addr __attribute__((aligned(1)))
Since the code above is unacceptable, this patch makes all functions take 
pointer to the parent struct (VhostUserMsg)

> >
> > Signed-off-by: Daniel Verkamp 
> > Signed-off-by: Dariusz Stojaczyk 
> > ---
> > Fixed checkpatch warnings
> 
> It's likely it will be easily missed while review. We normally do that:
> 
> ---
> 
> v3: fix checkpatch warnings
> 
> v2: remove gerrit id
> 
>   --yliu

Thanks, I'll stick with it from now on


Re: [dpdk-dev] [PATCH] memory: do not use base-virtaddr in secondary processes

2018-06-18 Thread Stojaczyk, DariuszX

> -Original Message-
> From: Alejandro Lucero [mailto:alejandro.luc...@netronome.com]
> Sent: Monday, June 18, 2018 7:22 PM
> 
> Should not be better to handle these allocations being aware about the
> problem for secondary processes?
> 
> I do not know exactly what are the (other) reasons behind base-virtaddr,
> but it turns out NFP requires this to be used when DPDK apps executed
> by non-root users.
> 
> I'm working on a RFC for handling our specific case, that could also be
> required for other devices, and this change would make the NFP unusable
> for the secondary processes.
> 

The only place base-virtaddr is used in secondary processes in DPDK 18.05 is 
this shadow memseg mapping, which shouldn't really need to be accessed by 
anyone else than DPDK EAL. Can you point me out to an NFP guide or some code 
that describes this in more detail?

If we're talking about base-virtaddr for hugepages, then that's always 
inherited from the primary process, regardless of what base-virtaddr is 
supplied to the secondary.

Regards!
D.


Re: [dpdk-dev] [PATCH] memory: do not use base-virtaddr in secondary processes

2018-06-18 Thread Stojaczyk, DariuszX


> -Original Message-
> From: Alejandro Lucero [mailto:alejandro.luc...@netronome.com]
> Sent: Monday, June 18, 2018 9:33 PM
> 
> On Mon, Jun 18, 2018 at 8:03 PM, Stojaczyk, DariuszX
> mailto:dariuszx.stojac...@intel.com> >
> wrote:
> 
>   Can you point me out to an NFP guide or some code that describes
> this in more detail?
> 
> 
> 
> As I said, I'm working on a RFC. I will send something shortly. But I could 
> give
> you an advance: the hugepages needs to be mapped below certain virtual
> address, 1TB, and I'm afraid that includes the primary and also the
> secondary processes. At least if any process can send or receive packets
> to/from a NFP.
> 
> 

Thanks, I'm pretty sure we're safe, then.

> 
>   If we're talking about base-virtaddr for hugepages, then that's always
> inherited from the primary process, regardless of what base-virtaddr is
> supplied to the secondary.
> 
> 
> 
> 
> But, is not your patch avoiding to use that base-virtaddr for secondary
> processes?

I see now that the patch name is slightly misleading. Maybe I shouldn’t pick 
such a catchy title. Let me clarify: As of DPDK 18.05, --base-virtaddr param 
for secondary process applications only affects that shadow memseg metadata 
that's not useful for anyone, but can still do a lot of harm. Hugepage memory 
in secondary processes is always mapped to the same addresses the primary 
process uses.

D.


Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal

2018-06-25 Thread Stojaczyk, DariuszX


> -Original Message-
> From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Tiwei Bie
> Sent: Monday, June 25, 2018 1:02 PM
> 
> 
> Hi Dariusz,
> 

Hi Tiwei,

> Thank you for putting efforts in making the DPDK
> vhost more generic!
> 
> From my understanding, your proposal is that:
> 
> 1) Introduce rte_vhost2 to provide the APIs which
>allow users to implement vhost backends like
>SCSI, net, crypto, ..
> 

That's right.

> 2) Refactor the existing rte_vhost to use rte_vhost2.
>The rte_vhost will still provide below existing
>sets of APIs:
> 1. The APIs which allow users to implement
>external vhost backends (these APIs were
>designed for SPDK previously)
> 2. The APIs provided by the net backend
> 3. The APIs provided by the crypto backend
>And above APIs in rte_vhost won't be changed.

That's correct. Rte_vhost would register its own rte_vhost2_tgt_ops underneath 
and will call existing vhost_device_ops for e.g. starting the device once all 
queues are started.
Regards,
D.

> 
> Is my above understanding correct? Thanks!
> 
> Best regards,
> Tiwei Bie
> 


Re: [dpdk-dev] [PATCH 1/2] eal/thread: fix return codes for rte_thread_setname()

2018-06-25 Thread Stojaczyk, DariuszX



> -Original Message-
> From: Olivier Matz [mailto:olivier.m...@6wind.com]
> Sent: Monday, June 25, 2018 4:36 PM
> To: Burakov, Anatoly 
> Cc: Stojaczyk, DariuszX ; dev@dpdk.org;
> thomas.monja...@6wind.com; sta...@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 1/2] eal/thread: fix return codes for
> rte_thread_setname()
> 
> On Mon, Jun 18, 2018 at 11:00:55AM +0100, Burakov, Anatoly wrote:
> > On 08-Jun-18 1:37 PM, Dariusz Stojaczyk wrote:
> > > The doc says this function returns negative errno
> > > on error, but it currently returns either -1 or
> > > positive errno.
> > >
> > > It was incorrectly assumed that pthread_setname_np()
> > > returns negative error numbers. It always returns
> > > positive ones, so this patch negates its return value
> > > before returning.
> > >
> > > While here, also ignore rte_thread_setname() failure
> > > in rte_ctrl_thread_create() and print a debug message
> > > instead.
> > >
> > > Fixes: 3901ed99c2f8 ("eal: fix thread naming on FreeBSD")
> > > Cc: thomas.monja...@6wind.com
> > > Cc: sta...@dpdk.org
> > >
> > > Signed-off-by: Dariusz Stojaczyk 
> > > ---
> >
> > For patch contents,
> >
> > Acked-by: Anatoly Burakov 
> >
> > However, maybe this should be split in two patches.
> 
> Agree it should be split.
> 
> Reviewed-by: Olivier Matz 
> 
> Out of curiosity, do you have a use-case where rte_thread_setname()
> fails? The only reason I see is a too long name. Why this error
> should be ignored?

I don't have any use-case like that. It's just that the error is not fatal and 
we can physically continue creating the thread. EAL does the same thing for the 
lcore threads.


Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal

2018-06-26 Thread Stojaczyk, DariuszX


> -Original Message-
> From: Bie, Tiwei
> Sent: Tuesday, June 26, 2018 10:22 AM
> To: Stojaczyk, DariuszX 
> Cc: Dariusz Stojaczyk ; dev@dpdk.org; Maxime
> Coquelin ; Tetsuya Mukawa
> ; Stefan Hajnoczi ; Thomas
> Monjalon ; y...@fridaylinux.org; Harris, James R
> ; Kulasek, TomaszX ;
> Wodkowski, PawelX 
> Subject: Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal
> 
> On Mon, Jun 25, 2018 at 08:17:08PM +0800, Stojaczyk, DariuszX wrote:
> > > -Original Message-
> > > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Tiwei Bie
> > > Sent: Monday, June 25, 2018 1:02 PM
> > > 
> > >
> > > Hi Dariusz,
> > >
> >
> > Hi Tiwei,
> >
> > > Thank you for putting efforts in making the DPDK
> > > vhost more generic!
> > >
> > > From my understanding, your proposal is that:
> > >
> > > 1) Introduce rte_vhost2 to provide the APIs which
> > >allow users to implement vhost backends like
> > >SCSI, net, crypto, ..
> > >
> >
> > That's right.
> >
> > > 2) Refactor the existing rte_vhost to use rte_vhost2.
> > >The rte_vhost will still provide below existing
> > >sets of APIs:
> > > 1. The APIs which allow users to implement
> > >external vhost backends (these APIs were
> > >designed for SPDK previously)
> > > 2. The APIs provided by the net backend
> > > 3. The APIs provided by the crypto backend
> > >And above APIs in rte_vhost won't be changed.
> >
> > That's correct. Rte_vhost would register its own rte_vhost2_tgt_ops
> underneath and will call existing vhost_device_ops for e.g. starting the 
> device
> once all queues are started.
> 
> Currently I have below concerns and questions:
> 
> - The rte_vhost's problem is still there. Even though
>   rte_vhost2 is introduced, the net and crypto backends
>   in rte_vhost won't benefit from the new callbacks.
> 
>   The existing rte_vhost in DPDK not only provides the
>   APIs for DPDK applications to implement the external
>   backends. But also provides high performance net and
>   crypto backends implementation (maybe more in the
>   future). So it's important that besides the DPDK
>   applications which implement their external backends,
>   the DPDK applications which use the builtin backends
>   will also benefit from the new callbacks.
> 
>   So we should have a clear plan on how will the legacy
>   callbacks in rte_vhost be dealt with in the next step.
> 
>   Besides, the new library's name is a bit misleading.
>   It makes the existing rte_vhost library sound like an
>   obsolete library. But actually the existing rte_vhost
>   isn't an obsolete library. It will still provide the
>   net and crypto backends. So if we want to introduce
>   this new library, we should give it a better name.
> 
> - It's possible to solve rte_vhost's problem you met
>   by refactoring the existing vhost library directly
>   instead of re-implementing a new vhost library from
>   scratch and keeping the old one's problem as is.
> 
>   In this way, it will solve the problem you met and
>   also solve the problem for rte_vhost. Why not go
>   this way? Something like:
> 
>   Below is the existing callbacks set in rte_vhost.h:
> 
>   /**
>* Device and vring operations.
>*/
>   struct vhost_device_ops {
>   ..
>   };
> 
>   It's a legacy implementation, and doesn't really
>   follow the DPDK API design (e.g. no rte_ prefix).
>   We can design and implement a new message handling
>   and a new set of callbacks for rte_vhost to solve
>   the problem you met without changing the old one.
>   Something like:
> 
>   struct rte_vhost_device_ops {
>   ..
>   }
> 
>   int
>   vhost_user_msg_handler(struct vhost_dev *vdev, struct vhost_user_msg
> *msg)
>   {
>   ..
> 
>   if (!vdev->is_using_new_device_ops) {
>   // Call the existing message handler
>   return vhost_user_msg_handler_legacy(vdev, msg);
>   }
> 
>   // Implement the new logic here
>   ..
>   }
> 
>   A vhost application is allowed to register only struct
>   rte_vhost_device_ops or struct vhost_device_ops (which
>   should be deprecated in the future). The two ops cannot
>   be registered at the same time.
> 
>   The existing applications could use the old ops. And
>   if an application registers struct rte_vhost_device_ops,
>   the new callbacks and message handler will be used.

Please notice that some features like vIOMMU are not even a part of the public 
rte_vhost API. Only vhost-net benefits from vIOMMU right now. Separating 
vhost-net from a generic vhost library (rte_vhost2) would avoid making such 
design mistakes in future. What's the point of having a single rte_vhost 
library, if some vhost-user features are only implemented for vhost-net.

> 
> Best regards,
> Tiwei Bie
> 
> 
> > Regards,
> > D.
> >
> > >
> > > Is my above understanding correct? Thanks!
> > >
> > > Best regards,
> > > Tiwei Bie
> > >


Re: [dpdk-dev] [PATCH] memory: fix alignment in eal_get_virtual_area()

2018-07-16 Thread Stojaczyk, DariuszX

> -Original Message-
> From: Burakov, Anatoly
> Sent: Monday, July 16, 2018 2:58 PM
> To: Stojaczyk, DariuszX ; dev@dpdk.org
> Cc: sta...@dpdk.org
> Subject: Re: [PATCH] memory: fix alignment in eal_get_virtual_area()
> 
> On 13-Jun-18 8:08 PM, Dariusz Stojaczyk wrote:
> > Although the alignment mechanism works as intended, the
> > `no_align` bool flag was set incorrectly. We were aligning
> > buffers that didn't need extra alignment, and weren't
> > aligning ones that really needed it.
> >
> > Fixes: b7cc54187ea4 ("mem: move virtual area function in common directory")
> > Cc: anatoly.bura...@intel.com
> > Cc: sta...@dpdk.org
> >
> > Signed-off-by: Dariusz Stojaczyk 
> > ---
> >   lib/librte_eal/common/eal_common_memory.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/lib/librte_eal/common/eal_common_memory.c
> b/lib/librte_eal/common/eal_common_memory.c
> > index 4f0688f..a7c89f0 100644
> > --- a/lib/librte_eal/common/eal_common_memory.c
> > +++ b/lib/librte_eal/common/eal_common_memory.c
> > @@ -70,7 +70,7 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
> >  * system page size is the same as requested page size.
> >  */
> > no_align = (requested_addr != NULL &&
> > -   ((uintptr_t)requested_addr & (page_sz - 1)) == 0) ||
> > +   ((uintptr_t)requested_addr & (page_sz - 1))) ||
> > page_sz == system_page_sz;
> >
> > do {
> >
> 
> This patch is wrong - no alignment should be performed if address is
> already alighed, e.g. if requested_addr & (page_sz - 1) == 0. The
> original code was correct.

If we provide an aligned address with ADDR_IS_HINT flag and OS decides not to 
use it, we end up with potentially unaligned address that needs to be manually 
aligned and that's what this patch does. If the requested address wasn't 
aligned to the provided page_sz, why would we bother aligning it manually?

Maybe there's additional case I'm not seeing, but this patch should not be 
reverted.

D.

> 
> Thomas, could you please revert this patch?
> 
> --
> Thanks,
> Anatoly


Re: [dpdk-dev] [PATCH] mem: fix alignment of requested virtual areas

2018-07-17 Thread Stojaczyk, DariuszX



> -Original Message-
> From: Burakov, Anatoly
> Sent: Monday, July 16, 2018 4:57 PM
> To: dev@dpdk.org
> Cc: tho...@monjalon.net; Yao, Lei A ; Stojaczyk, DariuszX
> ; sta...@dpdk.org
> Subject: [PATCH] mem: fix alignment of requested virtual areas
> 
> The original code did not align any addresses that were requested as
> page-aligned, but were different because addr_is_hint was set.
> 
> Below fix by Dariusz has introduced an issue where all unaligned addresses
> were left as unaligned.
> 
> This patch is a partial revert of
> commit 7fa7216ed48d ("mem: fix alignment of requested virtual areas")
> 
> and implements a proper fix for this issue, by asking for alignment in all
> but the following two cases:
> 
> 1) page size is equal to system page size, or
> 2) we got an aligned requested address, and will not accept a different one
> 
> This ensures that alignment is performed in all cases, except for those we
> can guarantee that the address will not need alignment.
> 
> Fixes: b7cc54187ea4 ("mem: move virtual area function in common directory")
> Fixes: 7fa7216ed48d ("mem: fix alignment of requested virtual areas")
> Cc: dariuszx.stojac...@intel.com
> Cc: sta...@dpdk.org
> 
> Signed-off-by: Anatoly Burakov 

Acked-by: Dariusz Stojaczyk 

> ---
>  lib/librte_eal/common/eal_common_memory.c | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/librte_eal/common/eal_common_memory.c
> b/lib/librte_eal/common/eal_common_memory.c
> index 659cc08f6..fbfb1b055 100644
> --- a/lib/librte_eal/common/eal_common_memory.c
> +++ b/lib/librte_eal/common/eal_common_memory.c
> @@ -66,14 +66,17 @@ eal_get_virtual_area(void *requested_addr, size_t *size,
>   addr_is_hint = true;
>   }
> 
> - /* if requested address is not aligned by page size, or if requested
> -  * address is NULL, add page size to requested length as we may get an
> -  * address that's aligned by system page size, which can be smaller than
> -  * our requested page size. additionally, we shouldn't try to align if
> -  * system page size is the same as requested page size.
> + /* we don't need alignment of resulting pointer in the following cases:
> +  *
> +  * 1. page size is equal to system size
> +  * 2. we have a requested address, and it is page-aligned, and we will
> +  *be discarding the address if we get a different one.
> +  *
> +  * for all other cases, alignment is potentially necessary.
>*/
>   no_align = (requested_addr != NULL &&
> - ((uintptr_t)requested_addr & (page_sz - 1))) ||
> + requested_addr == RTE_PTR_ALIGN(requested_addr, page_sz)
> &&
> + !addr_is_hint) ||
>   page_sz == system_page_sz;
> 
>   do {
> --
> 2.17.1


Re: [dpdk-dev] [PATCH] memory: fix alignment in eal_get_virtual_area()

2018-07-17 Thread Stojaczyk, DariuszX


> -Original Message-
> From: Burakov, Anatoly
> Sent: Monday, July 16, 2018 4:17 PM
> To: Stojaczyk, DariuszX ; dev@dpdk.org
> Cc: sta...@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] memory: fix alignment in 
> eal_get_virtual_area()
> 
> On 16-Jul-18 3:01 PM, Burakov, Anatoly wrote:
> > On 16-Jul-18 2:29 PM, Stojaczyk, DariuszX wrote:
> >>
> >>> -Original Message-
> >>> From: Burakov, Anatoly
> >>> Sent: Monday, July 16, 2018 2:58 PM
> >>> To: Stojaczyk, DariuszX ; dev@dpdk.org
> >>> Cc: sta...@dpdk.org
> >>> Subject: Re: [PATCH] memory: fix alignment in eal_get_virtual_area()
> >>>
> >>> On 13-Jun-18 8:08 PM, Dariusz Stojaczyk wrote:
> >>>> Although the alignment mechanism works as intended, the
> >>>> `no_align` bool flag was set incorrectly. We were aligning
> >>>> buffers that didn't need extra alignment, and weren't
> >>>> aligning ones that really needed it.
> >>>>
> >>>> Fixes: b7cc54187ea4 ("mem: move virtual area function in common
> >>>> directory")
> >>>> Cc: anatoly.bura...@intel.com
> >>>> Cc: sta...@dpdk.org
> >>>>
> >>>> Signed-off-by: Dariusz Stojaczyk 
> >>>> ---
> >>>>    lib/librte_eal/common/eal_common_memory.c | 2 +-
> >>>>    1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/lib/librte_eal/common/eal_common_memory.c
> >>> b/lib/librte_eal/common/eal_common_memory.c
> >>>> index 4f0688f..a7c89f0 100644
> >>>> --- a/lib/librte_eal/common/eal_common_memory.c
> >>>> +++ b/lib/librte_eal/common/eal_common_memory.c
> >>>> @@ -70,7 +70,7 @@ eal_get_virtual_area(void *requested_addr, size_t
> >>>> *size,
> >>>>     * system page size is the same as requested page size.
> >>>>     */
> >>>>    no_align = (requested_addr != NULL &&
> >>>> -    ((uintptr_t)requested_addr & (page_sz - 1)) == 0) ||
> >>>> +    ((uintptr_t)requested_addr & (page_sz - 1))) ||
> >>>>    page_sz == system_page_sz;
> >>>>
> >>>>    do {
> >>>>
> >>>
> >>> This patch is wrong - no alignment should be performed if address is
> >>> already alighed, e.g. if requested_addr & (page_sz - 1) == 0. The
> >>> original code was correct.
> >>
> >> If we provide an aligned address with ADDR_IS_HINT flag and OS decides
> >> not to use it, we end up with potentially unaligned address that needs
> >> to be manually aligned and that's what this patch does. If the
> >> requested address wasn't aligned to the provided page_sz, why would we
> >> bother aligning it manually?
> >
> > no_align is a flag that indicates whether we should or shouldn't align
> > the resulting end address - it is not meant to align requested address.
> >
> > If requested_addr was NULL, no_align will be set to "false" (we don't
> > know what we get, so we must reserve additional space for alignment
> > purposes).
> >
> > However, it will be set to "true" if page size is equal to system size
> > (the OS will return pointer that is already aligned to system page size,
> > so we don't need to align the result and thus don't need to reserve
> > additional space for alignment).
> >
> > If requested address wasn't null, again we don't need alignment if
> > system page size is equal to requested page size, as any resulting
> > address will be already page-aligned (hence no_align set to "true").
> >
> > If requested address wasn't already page-aligned and page size is not
> > equal to system page size, then we set "no_align" to false, because we
> > will need to align the resulting address.

I haven't seen such use case in the code and I deliberately didn't handle it. I 
believe that was my problem.

> >
> > The crucial part to understand is that the logic here is inverted - "if
> > requested address isn't NULL, and if the requested address is already
> > aligned (i.e. (addr & pgsz-1) == 0), then we *don't* need to align the
> > address". So, if the requested address is *not* aligned, "no_align" must
> > be set to false - because we *will* need to align the address.
> >
> > As an 

Re: [dpdk-dev] [PATCH] vhost: add API for getting last_idx of vrings

2018-07-26 Thread Stojaczyk, DariuszX
It can be abandoned. I can see Zhihong added equivalent APIs called 
rte_vhost_get_vring_base/set_vring_base back in April.

> -Original Message-
> From: Thomas Monjalon [mailto:tho...@monjalon.net]
> Sent: Thursday, July 26, 2018 5:44 PM
> To: Maxime Coquelin ; Kulasek, TomaszX
> 
> Cc: dev@dpdk.org; y...@fridaylinux.org; Verkamp, Daniel
> ; Harris, James R ;
> Wodkowski, PawelX ; Stojaczyk, DariuszX
> 
> Subject: Re: [dpdk-dev] [PATCH] vhost: add API for getting last_idx of vrings
> 
> What is the status of this patch?
> 
> 19/04/2018 16:57, Maxime Coquelin:
> > Hi Tomasz,
> >
> > On 03/28/2018 11:31 AM, Kulasek, TomaszX wrote:
> > > Hi Maxime,
> > >
> > >> -Original Message-
> > >> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
> > >> Sent: Wednesday, March 28, 2018 10:57
> > >> To: Kulasek, TomaszX ; y...@fridaylinux.org
> > >> Cc: Verkamp, Daniel ; Harris, James R
> > >> ; Wodkowski, PawelX
> ;
> > >> dev@dpdk.org; Stojaczyk, DariuszX 
> > >> Subject: Re: [dpdk-dev] [PATCH] vhost: add API for getting last_idx of 
> > >> vrings
> > >>
> > >> Hi Tomasz,
> > >>
> > >> On 03/05/2018 04:59 PM, Tomasz Kulasek wrote:
> > >>> vhost-net devices might keep track of last descriptors indices by
> > >>> themselves, and assuming they initially start at 0, but that is not the
> > >>> case for vhost-scsi. Initial last descriptor indices are set via
> > >>> VHOST_USER_SET_VRING_BASE message, and we cannot possibly predict
> what
> > >>> will they be. Setting these to vqueue->used->idx is also not an option,
> > >>> because there might be some yet unprocessed requests between these and
> > >>> the actual last_idx. This patch adds API for getting/setting last
> > >>> descriptor indices of vrings, so that they can be synchronized between
> > >>> user-device and rte_vhost.
> > >>>
> > >>> The last_idx flow could be as following:
> > >>>
> > >>>* vhost start,
> > >>>* received SET_VRING_BASE msg, last_idx is set on rte_vhost side,
> > >>>* created user-device, last_idx pulled from rte_vhost,
> > >>>* requests are being processed by user-device, last_idx changes,
> > >>>* destroyed user-device, last_idx pushed to rte_vhost,
> > >>>* at this point, vrings could be recreated and another SET_VRING_BASE
> > >>>  message could arrive, so last_idx would be set
> > >>>* recreated user-device, last_idx pulled from rte_vhost.
> > >>>
> > >>>
> > >>> Signed-off-by: Dariusz Stojaczyk 
> > >>> Signed-off-by: Tomasz Kulasek 
> > >>> ---
> > >>>lib/librte_vhost/rte_vhost.h | 24 
> > >>>lib/librte_vhost/vhost.c | 27 +++
> > >>>2 files changed, 51 insertions(+)
> > >>>
> > >>
> > >> I agree with the patch, but it is missing the declaration of the new API
> > >> in rte_vhost_version.map.
> > >>
> > >> Thanks,
> > >> Maxime
> > >
> > > Yes, I will send v2.
> >
> > Do you plan to send v2 for v18.02?
> > It can still make it to -rc2 if you post it early next week.
> >
> > Thanks,
> > Maxime
> >
> > > Tomasz
> > >
> >
> 
> 
> 
> 



Re: [dpdk-dev] [RFC] vhost: new rte_vhost API proposal

2018-05-10 Thread Stojaczyk, DariuszX
Hi,

> -Original Message-
> From: Stefan Hajnoczi [mailto:stefa...@redhat.com]
> Sent: Friday, May 11, 2018 12:37 AM
> On Thu, May 10, 2018 at 03:22:53PM +0200, Dariusz Stojaczyk wrote:
> > rte_virtio would offer both vhost and virtio driver APIs. These two
> > have a lot of common code for vhost-user handling or PCI access for
> > initiator/virtio-vhost-user (and possibly vDPA) so there's little
> > sense to keep target and initiator code separated between different
> > libs. Of course, the APIs would be separate - only some parts of
> > the code would be shared.
> 
> The API below seems to be for vhost backends (aka slaves).  rte_virtio_*
> is a misnomer because vhost and virtio are two different things.  This
> is not for implementing virtio devices, it's specifically for vhost
> devices.

I agree it's named a bit off if we're talking about vhost. My idea was to 
introduce a generic library for userspace Virtio processing and that's where 
the name came from. Even when you use just the vhost API that's introduced 
here, you are only required to implement vring processing, config access, and 
possibly interrupt handling, all of which are typical Virtio things. The vhost 
logic is hidden inside.

> 
> Vhost does not offer the full virtio device model - otherwise it would
> be just another transport in the VIRTIO specification.  Instead vhost is
> a protocol for vhost devices, which are subsets of virtio devices.
> 
> I suggest calling it rte_vhost2 since it's basically a new, incompatible
> rte_vhost API.

Rte_vhost2 sounds better for what we have now, but would that name be still 
valid once we add a true Virtio driver functionality? (I believe it's called 
Virtio PMD in DPDK right now). That driver would reuse a lot of the vhost code 
for PCI and vhost-user, so it makes some sense to put these two together. 

I don't think rte_vhost2 is a permanent name anyway, so maybe we could call it 
like so for now, and rename it later once I introduce that additional Virtio 
functionality? Would that work?

> 
> Also, the initiator/target terminology does not match the vhost-user
> specification.  It uses master/client and slave/backend/server.  Adding
> another pair of words makes things more confusing.  Please stick to the
> words used by the spec.

Ack.

> 
> > +/**
> > + * Device/queue related callbacks, all optional. Provided callback
> > + * parameters are guaranteed not to be NULL until explicitly specified.
> 
> s/until/unless/ ?

Ack.

> > + /**
> > + * Stop processing vq. It shouldn't be accessed after this callback
> > + * completes (via tgt_cb_complete). This can be called prior to
> shutdown
> 
> s/tgt_cb_complete/rte_virtio_tgt_cb_complete/

Ack.

> 
> > + * or before actions that require changing vhost device/vq state.
> > + */
> > + void (*queue_stop)(struct rte_virtio_dev *vdev, struct rte_virtio_vq
> *vq);
> > + /** Device disconnected. All queues are guaranteed to be stopped by
> now */
> > + void (*device_destroy)(struct rte_virtio_dev *vdev);
> > + /**
> > + * Custom message handler. `vdev` and `vq` can be NULL. This is called
> > + * for backend-specific actions. The `id` should be prefixed by the
> 
> Since vdev can be NULL, does this mean custom_msg() may be invoked at
> any time during the lifecycle and even before/after
> device_create()/device_destroy()?

Theoretically. I was thinking of some poorly-written backends notifying they're 
out of internal resources, but I agree it's just poor. I'll remove the `vdev 
can be NULL` part.

> > + */
> > + void (*custom_msg)(struct rte_virtio_dev *vdev, struct rte_virtio_vq
> *vq,
> > +   char *id, void *ctx);
> 
> What is the purpose of id and why is it char* instead of const char*?

Ack, It should be const. (same thing goes to every other char* in this patch)

For example vhost-crypto introduces two new vhost-user messages for 
initializing and destroying crypto session. The underlying vhost-crypto 
vhost-user backend after receiving such message could execute this callback as 
follows:

struct my_crypto_data *data = calloc();
[...]
Ops->custom_msg(vdev, NULL, "crypto_sess_init", data);

> 
> Is ctx the "message"?  If ctx is untrusted message data from an external
> process, why is there no size argument?  Who validates the message size?

Ack. Will add size parameter.

> 
> > +
> > + /**
> > + * Interrupt handler, synchronous. If this callback is set to NULL,
> > + * rte_virtio will hint the initiators not to send any interrupts.
> > + */
> > + void (*queue_kick)(struct rte_virtio_dev *vdev, struct rte_virtio_vq
> *vq);
> 
> Devices often have multiple types of queues.  Some of them may be
> suitable for polling, others may be suitable for an interrupt-driven
> model.  Is there a way to enable/disable interrupts for specific queues?

Thanks, I didn't think of that. I'll need to move the responsibility of setting 
vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT to the user.

> 
> > + /** Device config read, synchronous. */
> 
> What is the

Re: [dpdk-dev] [RFC] vhost: new rte_vhost API proposal

2018-05-18 Thread Stojaczyk, DariuszX


> -Original Message-
> From: Stefan Hajnoczi [mailto:stefa...@redhat.com]
> Sent: Friday, May 11, 2018 6:06 PM
> On Fri, May 11, 2018 at 05:55:45AM +0000, Stojaczyk, DariuszX wrote:
> > > -Original Message-
> > > From: Stefan Hajnoczi [mailto:stefa...@redhat.com]
> > > Sent: Friday, May 11, 2018 12:37 AM
> > > On Thu, May 10, 2018 at 03:22:53PM +0200, Dariusz Stojaczyk wrote:
> > > > rte_virtio would offer both vhost and virtio driver APIs. These two
> > > > have a lot of common code for vhost-user handling or PCI access for
> > > > initiator/virtio-vhost-user (and possibly vDPA) so there's little
> > > > sense to keep target and initiator code separated between different
> > > > libs. Of course, the APIs would be separate - only some parts of
> > > > the code would be shared.
> > >
> > > The API below seems to be for vhost backends (aka slaves).
> rte_virtio_*
> > > is a misnomer because vhost and virtio are two different things.  This
> > > is not for implementing virtio devices, it's specifically for vhost
> > > devices.
> >
> > I agree it's named a bit off if we're talking about vhost. My idea was to
> introduce a generic library for userspace Virtio processing and that's
> where the name came from. Even when you use just the vhost API that's
> introduced here, you are only required to implement vring processing,
> config access, and possibly interrupt handling, all of which are typical
> Virtio things. The vhost logic is hidden inside.
> 
> No, the vhost logic is not hidden: there is custom_msg() and the whole
> tgt_ops struct is an abstraction of the vhost protocol, not virtio.
> 
> It sounds like you're hoping to create a single API that can support
> both vhost and virtio access.  For example, one "net" device backend
> implementation using rte_virtio can be accessed via vhost or virtio.
> 
> This won't work because vhost and virtio are not equivalent.  vhost-net
> devices don't implement the virtio-net config space and they only have a
> subset of the virtqueues.  vhost-net devices support special vhost
> messages that don't exist in virtio-net.
> 
> Additionally, the virtio and vhost-user specifications are independent
> and make no promise of a 1:1 mapping.  They have the freedom to
> change
> in ways which will break any abstraction you come up with today.
> 
> I hope it will be possible to unify the two in the future, but that
> needs to happen at the spec level first, before trying to unify them in
> code.
> 
> This is why I'm belaboring the point that vhost should not be confused
> with virtio.  Each needs to be separate and clearly identified to avoid
> confusion.
>   

Ok, I'm convinced now. Thanks for the explanation. I'll name the lib rte_vhost2 
in v2.


> >
> > >
> > > Vhost does not offer the full virtio device model - otherwise it would
> > > be just another transport in the VIRTIO specification.  Instead vhost is
> > > a protocol for vhost devices, which are subsets of virtio devices.
> > >
> > > I suggest calling it rte_vhost2 since it's basically a new, incompatible
> > > rte_vhost API.
> >
> > Rte_vhost2 sounds better for what we have now, but would that name
> be still valid once we add a true Virtio driver functionality? (I believe it's
> called Virtio PMD in DPDK right now). That driver would reuse a lot of the
> vhost code for PCI and vhost-user, so it makes some sense to put these
> two together.
> >
> > I don't think rte_vhost2 is a permanent name anyway, so maybe we
> could call it like so for now, and rename it later once I introduce that
> additional Virtio functionality? Would that work?
> 
> The natural layering for is that vhost depends on virtio.  Virtio header
> files (feature bits, config space layout, vring layout) and the vring
> API can be reused by vhost.
> 
> Virtio doesn't need knowledge of virtio though and the two can be in
> separate packages without code duplication.
> 
> That said, it doesn't really matter whether there are rte_virtio +
> rte_vhost2 packages or a single rte_virtio package, as long as the
> function and struct names for vhost interfaces contain the name "vhost"
> so they cannot be confused with virtio.
> 
> > > > + * or before actions that require changing vhost device/vq state.
> > > > + */
> > > > + void (*queue_stop)(struct rte_virtio_dev *vdev, struct
> rte_virtio_vq
> > > *vq);
> > > > + /** Device disconnected. All queues are g

Re: [dpdk-dev] [RFC v2] vhost: new rte_vhost API proposal

2018-05-22 Thread Stojaczyk, DariuszX
Hi Maxime,

> -Original Message-
> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
> Sent: Friday, May 18, 2018 9:51 PM
> On 05/18/2018 03:01 PM, Dariusz Stojaczyk wrote:
> > rte_vhost is not vhost-user spec compliant. Some Vhost drivers have
> > been already confirmed not to work with rte_vhost. virtio-user-scsi-pci
> > in QEMU 2.12 doesn't fully initialize its management queues at SeaBIOS
> > stage. This is perfectly fine from the Vhost-user spec perspective, but
> > doesn't meet rte_vhost expectations. rte_vhost waits for all queues
> > to be fully initialized before it allows the entire device to be
> > processed. qFixing rte_vhost directly would require quite a big amount
> > of changes, which would completely break backwards compatibility.
> >
> > This rte_vhost2 library is intended to smooth out the transition.
> > It exposes a low-level API for implementing new Vhost-user slaves.
> > The existing rte_vhost is about to be refactored to use rte_vhost2
> > library underneath, and demanding backends could now use rte_vhost2
> > directly.
> 
> I like the idea, and the proposed way to smooth the transition.
> 
> I will certainly have other comments later, but please find below
> the ones I have for the moment.
> 
> > 
> > +
> > +/**
> > + * Registers a new vhost target accepting remote connections. Multiple
> > + * available transports are available. It is possible to create a Vhost-
> user
> > + * Unix domain socket polling local connections or connect to a
> physical
> > + * Virtio device and install an interrupt handler .
> > + *
> > + * This function is thread-safe.
> > + *
> > + * \param trtype type of the transport used, e.g. "vhost-user",
> > + * "PCI-vhost-user", "PCI-vDPA".
> > + * \param trid identifier of the device. For PCI this would be the BDF
> address,
> > + * for vhost-user the socket name.
> > + * \param trflags additional options for the specified transport
> > + * \param trctx additional data for the specified transport. Can be
> NULL.
> > + * \param tgt_ops callbacks to be called upon reaching specific
> initialization
> > + * states.
> > + * \param features supported Virtio features. To be negotiated with
> the
> > + * driver ones. rte_vhost2 will append a couple of generic feature bits
> > + * which are required by the Virtio spec. TODO list these features here
> > + * \return 0 on success, negative errno otherwise
> > + */
> > +int rte_vhost2_tgt_register(const char *trtype, const char *trid,
> > +   uint64_t trflags, void *trctx,
> > +   struct rte_vhost2_tgt_ops *tgt_ops,
> > +   uint64_t features);
> 
> Couldn't the register API also pass the vdev?
> Doing this, the backend could have rte_vhost2_dev in its device
> struct.

Please notice the register API is for registering targets, not devices. A 
single Vhost-user server target can spawn multiple devices - one for each 
connection. I know the nomenclature is different from rte_vhost, but since each 
connection uses its own (virt)queues it makes sense to call things this way.

Initially I thought about adding some rte_vhost2_tgt struct declaration that 
register function would return, but later on came to a conclusion that it would 
only complicate things for the library user. A parent struct that would keep 
rte_vhost2_tgt* needs to contain `const char *trtype` and `const char *trid` 
anyway, so it's just easier to use these two strings for target identification. 

> > 
> > +/**
> > + * Bypass VIRTIO_F_IOMMU_PLATFORM and translate gpa directly.
> > + *
> > + * This function is thread-safe.
> > + *
> > + * \param mem vhost device memory
> > + * \param gpa guest physical address
> > + * \param len length of the memory to translate (in bytes). If
> requested
> > + * memory chunk crosses memory region boundary, the *len will be
> set to
> > + * the remaining, maximum length of virtually contiguous memory. In
> such
> > + * case the user will be required to call another gpa_to_vva(gpa +
> *len).
> > + * \return vhost virtual address or NULL if requested `gpa` is not
> mapped.
> > + */
> > +static inline void *
> > +rte_vhost2_gpa_to_vva(struct rte_vhost2_memory *mem, uint64_t
> gpa, uint64_t *len)
> > +{
> > +   struct rte_vhost2_mem_region *r;
> > +   uint32_t i;
> > +
> > +   for (i = 0; i < mem->nregions; i++) {
> > +   r = &mem->regions[i];
> > +   if (gpa >= r->guest_phys_addr &&
> > +   gpa <  r->guest_phys_addr + r->size) {
> > +
> > +   if (unlikely(*len > r->guest_phys_addr + r->size -
> gpa)) {
> > +   *len = r->guest_phys_addr + r->size - gpa;
> > +   }
> > +
> > +   return gpa - r->guest_phys_addr + r-
> >host_user_addr;
> > +   }
> > +   }
> > +   *len = 0;
> > +
> > +   return 0;
> > +}
> 
> Maybe we could take the opportunity to only have
> rte_vhost2_iova_to_vva.

Good idea; will remove it in v3.

Thanks,
D.


Re: [dpdk-dev] [RFC v2] vhost: new rte_vhost API proposal

2018-05-29 Thread Stojaczyk, DariuszX



> -Original Message-
> From: Stefan Hajnoczi [mailto:stefa...@redhat.com]
> Sent: Friday, May 25, 2018 12:06 PM
> On Fri, May 18, 2018 at 03:01:05PM +0200, Dariusz Stojaczyk wrote:
> > +struct rte_vhost2_msg {
> > +   uint32_t id;
> 
> Is this what the vhost-user specification calls the "request type"?  I
> suggest following the vhost-user spec terminology.
> 
> > +   uint32_t flags;
> > +   uint32_t size; /**< The following payload size. */
> > +   void *payload;
> > +   int fds[RTE_VHOST2_MEMORY_MAX_NREGIONS];
> 
> Is it necessary to expose file descriptor passing in the API?
> virtio-vhost-user doesn't have file descriptor passing, so it's best if this
> can be hidden inside rte_vhost2.

So it's another argument for not exposing raw message handling to the user.
If there's some backend-specific vhost-user message in future that contains an 
fd, it will need a set of new abstractions to work with virtio-vhost-user 
anyway.
I guess I'll get back the original custom_msg idea from V1.

> 
> > +};
> > +
> > +/** Single memory region. Both physically and virtually contiguous */
> > +struct rte_vhost2_mem_region {
> > +   uint64_t guest_phys_addr;
> > +   uint64_t guest_user_addr;
> > +   uint64_t host_user_addr;
> > +   uint64_t size;
> > +   void *mmap_addr;
> > +   uint64_t mmap_size;
> > +   int fd;
> 
> virtio-vhost-user doesn't have an fd.  Why do API consumers need to
> know about the fd?

They don't. Ack. I'll strip this struct.

> 
> > +/**
> > + * Device/queue related callbacks, all optional. Provided callback
> > + * parameters are guaranteed not to be NULL unless explicitly
> specified.
> > + */
> 
> This is a good place to mention that all callbacks are asynchronous unless
> specified otherwise.  Without that knowledge statements below like "If
> this is completed with a non-zero status" are confusing on a void
> function.

Ack.

> 
> > +struct rte_vhost2_tgt_ops {
> > +   /**
> > +* New driver connected. If this is completed with a non-zero
> status,
> > +* rte_vhost2 will terminate the connection.
> > +*/
> > +   void (*device_create)(struct rte_vhost2_dev *vdev);
> > +   /**
> > +   * Device is ready to operate. vdev data is now initialized. This
> callback
> > +   * may be called multiple times as e.g. memory mappings can
> change
> > +   * dynamically. All queues are guaranteed to be stopped by now.
> > +   */
> > +   void (*device_init)(struct rte_vhost2_dev *vdev);
> > +   /**
> > +   * Features have changed in runtime. This is called at least once
> > +during
> 
> s/in/at/

Ack.

> 
> > +   /**
> > +   * Custom vhost-user message handler. This is called for
> > +   * backend-specific messages (net/crypto/scsi) that weren't
> recognized
> > +   * by the generic message parser. `msg` is available until
> > +   * \c rte_vhost2_tgt_cb_complete is called.
> > +   */
> > +   void (*custom_msg)(struct rte_vhost2_dev *vdev, struct
> > +rte_vhost2_msg *msg);
> 
> What happens if rte_vhost2_tgt_cb_complete() is called with a negative
> rc?  Does the specific errno value matter?

My current implementation only checks for rc != 0 now. I'm still working this 
out.

> 
> Where is the API for sending a vhost-user reply message?

I didn't push any. Now that you pointed out the fds in public API I think I'll 
rollback this custom_msg stuff to V1.

D.


Re: [dpdk-dev] [RFC v2] vhost: new rte_vhost API proposal

2018-05-30 Thread Stojaczyk, DariuszX



> -Original Message-
> From: Stefan Hajnoczi [mailto:stefa...@redhat.com]
> Sent: Wednesday, May 30, 2018 10:57 AM
> On Tue, May 29, 2018 at 01:38:33PM +0000, Stojaczyk, DariuszX wrote:
> >
> >
> > > -Original Message-
> > > From: Stefan Hajnoczi [mailto:stefa...@redhat.com]
> > > Sent: Friday, May 25, 2018 12:06 PM
> > > On Fri, May 18, 2018 at 03:01:05PM +0200, Dariusz Stojaczyk wrote:
> > > > +struct rte_vhost2_msg {
> > > > +   uint32_t id;
> > >
> > > Is this what the vhost-user specification calls the "request type"?
> > > I suggest following the vhost-user spec terminology.
> > >
> > > > +   uint32_t flags;
> > > > +   uint32_t size; /**< The following payload size. */
> > > > +   void *payload;
> > > > +   int fds[RTE_VHOST2_MEMORY_MAX_NREGIONS];
> > >
> > > Is it necessary to expose file descriptor passing in the API?
> > > virtio-vhost-user doesn't have file descriptor passing, so it's best
> > > if this can be hidden inside rte_vhost2.
> >
> > So it's another argument for not exposing raw message handling to the
> user.
> > If there's some backend-specific vhost-user message in future that
> contains an fd, it will need a set of new abstractions to work with virtio-
> vhost-user anyway.
> > I guess I'll get back the original custom_msg idea from V1.
> 
> Hold on, have you looked at the device-specific messages defined in the
> vhost-user spec?  Do any of them even pass resources (file descriptors)?

There's at least VHOST_USER_NVME_SET_CQ_CALL (currently in review, not present 
in master yet). Vhost-nvme is a one big hack that doesn't use any virtqueues, 
so it has its own message for callfd.


Re: [dpdk-dev] [PATCH v2 1/2] memalloc: do not leave unmapped holes in EAL virtual memory area

2018-06-01 Thread Stojaczyk, DariuszX
The Intel SMTP server keeps dying to me. It's really not fun trying to resend 
these messages for half an hour without success. I'll resend this series from 
my private mail in the evening.

D.

> -Original Message-
> From: Stojaczyk, DariuszX
> Sent: Friday, June 1, 2018 3:39 PM
> To: dev@dpdk.org; Burakov, Anatoly 
> Cc: Stojaczyk, DariuszX 
> Subject: [PATCH v2 1/2] memalloc: do not leave unmapped holes in EAL
> virtual memory area
> 
> EAL reserves a huge area in virtual address space to provide virtual address
> contiguity for e.g.
> future memory extensions (memory hotplug). During memory hotplug, if
> the hugepage mmap succeeds but doesn't suffice EAL's requiriments, the
> EAL would unmap this mapping straight away, leaving a hole in its virtual
> memory area and making it available to everyone. As EAL still thinks it owns
> the entire region, it may try to mmap it later with MAP_FIXED, possibly
> overriding a user's mapping that was made in the meantime.
> 
> This patch ensures each hole is mapped back by EAL, so that it won't be
> available to anyone else.
> 
> Changes from v1:
>  * checkpatch fixes
> 
> Signed-off-by: Dariusz Stojaczyk 
> ---
>  lib/librte_eal/linuxapp/eal/eal_memalloc.c | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c
> b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
> index 8c11f98..2ab3d34 100644
> --- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c
> +++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
> @@ -39,6 +39,7 @@
>  #include "eal_filesystem.h"
>  #include "eal_internal_cfg.h"
>  #include "eal_memalloc.h"
> +#include "eal_private.h"
> 
>  /*
>   * not all kernel version support fallocate on hugetlbfs, so fall back to @@ 
> -
> 490,6 +491,8 @@ alloc_seg(struct rte_memseg *ms, void *addr, int
> socket_id,
>   int ret = 0;
>   int fd;
>   size_t alloc_sz;
> + int flags;
> + void *new_addr;
> 
>   /* takes out a read lock on segment or segment list */
>   fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx); @@ -585,6
> +588,20 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
> 
>  mapped:
>   munmap(addr, alloc_sz);
> + flags = MAP_FIXED;
> +#ifdef RTE_ARCH_PPC_64
> + flags |= MAP_HUGETLB;
> +#endif
> + new_addr = eal_get_virtual_area(addr, &alloc_sz, alloc_sz, 0,
> flags);
> + if (new_addr != addr) {
> + if (new_addr != NULL)
> + munmap(new_addr, alloc_sz);
> + /* we're leaving a hole in our virtual address space. if
> +  * somebody else maps this hole now, we could accidentally
> +  * override it in the future.
> +  */
> + rte_panic("can't mmap holes in our virtual address space");
> + }
>  resized:
>   if (internal_config.single_file_segments) {
>   resize_hugefile(fd, path, list_idx, seg_idx, map_offset,
> --
> 2.7.4



Re: [dpdk-dev] [PATCH 1/2] memalloc: do not leave unmapped holes in EAL virtual memory area

2018-06-01 Thread Stojaczyk, DariuszX

> -Original Message-
> From: Burakov, Anatoly
> Sent: Friday, June 1, 2018 1:00 PM
> On 01-Jun-18 1:51 PM, Dariusz Stojaczyk wrote:
> > EAL reserves a huge area in virtual address space to provide virtual
> > address contiguity for e.g.
> > future memory extensions (memory hotplug). During memory hotplug, if
> > the hugepage mmap succeeds but doesn't suffice EAL's requiriments, the
> > EAL would unmap this mapping straight away, leaving a hole in its
> > virtual memory area and making it available to everyone. As EAL still
> > thinks it owns the entire region, it may try to mmap it later with
> > MAP_FIXED, possibly overriding a user's mapping that was made in the
> > meantime.
> >
> > This patch ensures each hole is mapped back by EAL, so that it won't
> > be available to anyone else.
> >
> > Signed-off-by: Dariusz Stojaczyk 
> > ---
> 
> Generally, if the commit is a bugfix, reference to the original commit that
> introduces the issue should be part of the commit message. See DPDK
> contribution guidelines [1] and "git fixline" [2]. Also, this bug is present 
> in
> 18.05, so you should also Cc: sta...@dpdk.org (same applies for all of your
> other patches for memalloc).
> 
> [1] http://dpdk.org/doc/guides/contributing/patches.html
> [2] http://dpdk.org/dev

Ack, thanks.

> 
> >   lib/librte_eal/linuxapp/eal/eal_memalloc.c | 17 +
> >   1 file changed, 17 insertions(+)
> >
> > diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c
> > b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
> > index 8c11f98..b959ea5 100644
> > --- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c
> > +++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
> > @@ -39,6 +39,7 @@
> >   #include "eal_filesystem.h"
> >   #include "eal_internal_cfg.h"
> >   #include "eal_memalloc.h"
> > +#include "eal_private.h"
> >
> >   /*
> >* not all kernel version support fallocate on hugetlbfs, so fall
> > back to @@ -490,6 +491,8 @@ alloc_seg(struct rte_memseg *ms, void
> *addr, int socket_id,
> > int ret = 0;
> > int fd;
> > size_t alloc_sz;
> > +   int flags;
> > +   void *new_addr;
> >
> > /* takes out a read lock on segment or segment list */
> > fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx); @@
> > -585,6 +588,20 @@ alloc_seg(struct rte_memseg *ms, void *addr, int
> > socket_id,
> >
> >   mapped:
> > munmap(addr, alloc_sz);
> > +   flags = MAP_FIXED;
> > +#ifdef RTE_ARCH_PPC_64
> > +   flags |= MAP_HUGETLB;
> > +#endif
> > +   new_addr = eal_get_virtual_area(addr, &alloc_sz, alloc_sz, 0,
> > +flags);
> 
> Page size shouldn't be zero, should be alloc_sz.

The 0 above is for the `flags` parameter. Page size is set to alloc_sz.

```
void *
eal_get_virtual_area(void *requested_addr, size_t *size,
size_t page_sz, int flags, int mmap_flags);
```

I believe it's okay. Correct me if I'm wrong.

> 
> > +   if (new_addr != addr) {
> > +   if (new_addr != NULL) {
> > +   munmap(new_addr, alloc_sz);
> > +   }
> > +   /* We're leaving a hole in our virtual address space. If
> > +* somebody else maps this hole now, we might accidentally
> > +* override it in the future. */
> > +   rte_panic("can't mmap holes in our virtual address space");
> 
> I don't think rte_panic() here makes sense. First of all, we're now striving 
> to
> not panic inside DPDK libraries, especially once initialization has already
> finished. But more importantly, when you reach this panic, you're deep in
> the memory allocation process, which means you hold a write lock to DPDK
> memory hotplug. If you crash now, the lock will never be released and
> subsequent memory hotplug requests will just deadlock.
> 
> What's worse is you may be inside a secondary process, synchronizing the
> memory map with that of a primary, which means that even if you release
> the lock here, you're basically releasing someone else's lock, so behavior
> will be undefined at that point.
> 
> I think an error message with the highest possible log level should suffice
> (CRIT?), as there's really nothing more we can do here.

Ok, I'll fallback to CRIT log for now. We could try to add some proper error 
handling in a separate patch later.

> 
> That said, how likely is this scenario?

I can't think of any reason why that last mmap would fail, but it's still 
technically possible.

>
> 
> > +   }
> >   resized:
> > if (internal_config.single_file_segments) {
> > resize_hugefile(fd, path, list_idx, seg_idx, map_offset,
> >
> 
> Generally, if the above is fixed, i'm OK with the patch.
> 
> --
> Thanks,
> Anatoly

D.


Re: [dpdk-dev] [PATCH 2/2] memalloc: keep in mind a failed MAP_FIXED mmap may still perform an unmap

2018-06-01 Thread Stojaczyk, DariuszX

> -Original Message-
> From: Burakov, Anatoly
> Sent: Friday, June 1, 2018 1:09 PM
> To: Stojaczyk, DariuszX ; dev@dpdk.org
> Subject: Re: [PATCH 2/2] memalloc: keep in mind a failed MAP_FIXED
> mmap may still perform an unmap
> 
> On 01-Jun-18 1:51 PM, Dariusz Stojaczyk wrote:
> > This isn't documented in the manuals, but a failed mmap(...,
> > MAP_FIXED) may still unmap overlapping regions. In such case, we need
> > to remap these regions back into our address space to ensure mem
> > contiguity.
> > We do it unconditionally now on mmap failure just to be safe.
> >
> > Verified on Linux 4.9.0-4-amd64. I was getting ENOMEM when trying to
> > map in hugetlbfs with no space left, but the previous anonymous
> > mapping was still being removed.
> >
> > Signed-off-by: Dariusz Stojaczyk 
> > ---
> 
> Does this also happen with other error values?

The man pages aren't clear on this. It could. I know this patch only remaps the 
region if errrno == ENOMEM, but we could remap it unconditionally after each 
failed mmap just as well.  It won't hurt if we remap a region unnecessarily. 
This was my original intent, but apparently I didn't fully commit my changes 
before publishing patches here. Uhh, sorry.

D.

> 
> --
> Thanks,
> Anatoly


Re: [dpdk-dev] [PATCH 1/2] eal/thread: fix return codes for rte_thread_setname()

2018-06-08 Thread Stojaczyk, DariuszX
+CC tho...@monjalon.net 
The previous @6wind email appears to be unavailable now.

> -Original Message-
> From: Stojaczyk, DariuszX
> Sent: Friday, June 8, 2018 2:37 PM
> To: dev@dpdk.org
> Cc: Stojaczyk, DariuszX ;
> thomas.monja...@6wind.com; sta...@dpdk.org
> Subject: [PATCH 1/2] eal/thread: fix return codes for rte_thread_setname()
> 
> The doc says this function returns negative errno on error, but it currently
> returns either -1 or positive errno.
> 
> It was incorrectly assumed that pthread_setname_np() returns negative
> error numbers. It always returns positive ones, so this patch negates its
> return value before returning.
> 
> While here, also ignore rte_thread_setname() failure in
> rte_ctrl_thread_create() and print a debug message instead.
> 
> Fixes: 3901ed99c2f8 ("eal: fix thread naming on FreeBSD")
> Cc: thomas.monja...@6wind.com
> Cc: sta...@dpdk.org
> 
> Signed-off-by: Dariusz Stojaczyk 
> ---
>  lib/librte_eal/common/eal_common_thread.c | 3 ++-
> lib/librte_eal/linuxapp/eal/eal_thread.c  | 4 ++--
>  2 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/librte_eal/common/eal_common_thread.c
> b/lib/librte_eal/common/eal_common_thread.c
> index 4239863..8110ac2 100644
> --- a/lib/librte_eal/common/eal_common_thread.c
> +++ b/lib/librte_eal/common/eal_common_thread.c
> @@ -191,7 +191,8 @@ rte_ctrl_thread_create(pthread_t *thread, const
> char *name,
>   if (name != NULL) {
>   ret = rte_thread_setname(*thread, name);
>   if (ret < 0)
> - goto fail;
> + RTE_LOG(DEBUG, EAL,
> + "Cannot set name for ctrl thread\n");
>   }
> 
>   cpu_found = 0;
> diff --git a/lib/librte_eal/linuxapp/eal/eal_thread.c
> b/lib/librte_eal/linuxapp/eal/eal_thread.c
> index f652ff9..b496fc7 100644
> --- a/lib/librte_eal/linuxapp/eal/eal_thread.c
> +++ b/lib/librte_eal/linuxapp/eal/eal_thread.c
> @@ -176,7 +176,7 @@ int rte_sys_gettid(void)
> 
>  int rte_thread_setname(pthread_t id, const char *name)  {
> - int ret = -1;
> + int ret = ENOSYS;
>  #if defined(__GLIBC__) && defined(__GLIBC_PREREQ)  #if
> __GLIBC_PREREQ(2, 12)
>   ret = pthread_setname_np(id, name);
> @@ -184,5 +184,5 @@ int rte_thread_setname(pthread_t id, const char
> *name)  #endif
>   RTE_SET_USED(id);
>   RTE_SET_USED(name);
> - return ret;
> + return -ret;
>  }
> --
> 2.7.4



Re: [dpdk-dev] [PATCH] vhost: added user callbacks for socket open/close

2017-08-22 Thread Stojaczyk, DariuszX
Hi Jens,

> I'm a little uncertain but my gut feeling is that in this context a 
> connection is
> something between two sockets, not between devices.

What do you mean?
This is a unix domain socket connection. DPDK can create the socket, then the 
client may connect to it via connect(2).

> I would probably add
> these callbacks to struct vhost_user_socket. This is also where we keep the
> list of connections.

I get your point. However, it's vhost_device_ops struct that's being set by the 
user via rte_vhost_driver_callback_register(). The new_connection callback is 
there just to mark the device as *in use, can't be deleted*. It doesn't 
transport any connection data.

Regards,
D.


Re: [dpdk-dev] [PATCH v2] vhost: added user callbacks for socket open/close

2017-08-28 Thread Stojaczyk, DariuszX
Hi Jens,

> I'm still not sure I understand the use case. So just for my
> understanding: users need to distinct between "the device is going away
> temporarily, keep the connection" and "we're shutting down for good", is
> that it?

Yes, exactly.

> Maybe it's just me or maybe it means you could explain your example in the
> commit message a bit more.

Ok. How about the following commit message instead:
```
rte_vhost: added user callbacks for socket open/close

Added new callbacks to notify about socket connection status.
As destroy_device is used for virtqueue processing *pause* as
well as connection close, the user has no distinction between those.

Consider the following scenario:
rte_vhost: received SET_VRING_BASE message,
  calling destroy_device() as usual

user:  end-user asks to remove the device (together with socket file),
  OK, device is not *in use* - that's NOT the behavior we want
  calling rte_vhost_driver_unregister() etc.

Instead of changing new_device/destroy_device callbacks and breaking
the ABI, a set of new functions new_connection/destroy_connection
has been added.
```

> Oh, and you should put the maintainers on Cc to get a faster review.

Thanks, I will!
Regards,
D. 


[dpdk-dev] virtio with 2MB hugepages - bringing back single file segments

2018-03-01 Thread Stojaczyk, DariuszX
Hi,

I'm trying to make a vhost-user initiator built upon DPDK work with 2MB 
hugepages. In the initiator we have to share all memory with the host process, 
so it
can perform DMA. DPDK currently enforces having one descriptor per hugepage and 
there's an artificial limit of shared descriptors in DPDK vhost-user 
implementation (currently 8). Because of that, all DPDK vhost-user initiators 
are practically limited to 1GB hugepages at the moment. We can always increase 
the artificial descriptor limit, but then we're limited by sendmsg() itself, 
which on Linux accepts no more than 253 descriptors. However, could we increase 
the vhost-user implementation limit to - say - 128, and bring back "single file 
segments" [1]?

Could I send a patch series that does this? The single file segments code would 
go through a cleanup - at least making it available via a runtime option rather 
than #ifdefs.

I know there's an ongoing rework of the memory allocator in DPDK [2] and it 
includes a similar single file segments functionality. However, it will 
probably take quite some time before merged and even then, the new 
functionality would only be available in the *new* allocator. The old one is 
kept unchanged. It could use single file segments as well.

Regards,
D.

[1] http://dpdk.org/dev/patchwork/patch/16042/
[2] http://dpdk.org/ml/archives/dev/2017-December/084302.html 


Re: [dpdk-dev] virtio with 2MB hugepages - bringing back single file segments

2018-03-02 Thread Stojaczyk, DariuszX
Hi Maxime,

> > Hi,
> >
> > I'm trying to make a vhost-user initiator built upon DPDK work with
> > 2MB hugepages. In the initiator we have to share all memory with the host
> process, so it can perform DMA. DPDK currently enforces having one descriptor
> per hugepage and there's an artificial limit of shared descriptors in DPDK
> vhost-user implementation (currently 8). Because of that, all DPDK vhost-user
> initiators are practically limited to 1GB hugepages at the moment. We can
> always increase the artificial descriptor limit, but then we're limited by
> sendmsg() itself, which on Linux accepts no more than 253 descriptors.
> However, could we increase the vhost-user implementation limit to - say - 128,
> and bring back "single file segments" [1]?
> 
> If you do something like this, you'll have first to update the vhost-user 
> spec,
> which should I think include a new protocol feature bit. 

Do I? I can't see any memory region limitation in the vhost-user spec [1]. If a 
128-region initiator tries to connect to an 8-region rte_vhost, the recvmsg() 
would simply fail - rte_vhost rejects `truncated` messages.

> 
> Also, you will have to consider improving the translation functions with a 
> better
> search algorithm, else you'll have very poor performance.

That's right, I got that sorted out already.

Regards,
D.

[1] https://github.com/qemu/qemu/blob/master/docs/interop/vhost-user.txt