Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-22 Thread Michael S. Tsirkin
On Tue, Aug 22, 2017 at 11:28:28AM -0700, Eric Dumazet wrote:
> On Tue, 2017-08-22 at 11:01 -0700, David Miller wrote:
> > From: "Michael S. Tsirkin" 
> > Date: Tue, 22 Aug 2017 20:55:56 +0300
> > 
> > > Which reminds me that skb_linearize in net core seems to be
> > > fundamentally racy - I suspect that if skb is cloned, and someone is
> > > trying to use the shared frags while another thread calls skb_linearize,
> > > we get some use after free bugs which likely mostly go undetected
> > > because the corrupted packets mostly go on wire and get dropped
> > > by checksum code.
> > 
> > Indeed, it does assume that the skb from which the clone was made
> > never has it's geometry changed.
> > 
> > I don't think even the TCP retransmit queue has this guarantee.
> 
> TCP retransmit makes sure to avoid that.
> 
> if (skb_unclone(skb, GFP_ATOMIC))
>  return -ENOMEM;
> 
> ( Before cloning again skb )
> 
> 

I'm pretty sure not all users of skb_clone or generally
__pskb_pull_tail are careful like this.

E.g. skb_cow_data actually intentionally pulls pages when
skb is cloned.


-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-22 Thread Eric Dumazet
On Tue, 2017-08-22 at 11:01 -0700, David Miller wrote:
> From: "Michael S. Tsirkin" 
> Date: Tue, 22 Aug 2017 20:55:56 +0300
> 
> > Which reminds me that skb_linearize in net core seems to be
> > fundamentally racy - I suspect that if skb is cloned, and someone is
> > trying to use the shared frags while another thread calls skb_linearize,
> > we get some use after free bugs which likely mostly go undetected
> > because the corrupted packets mostly go on wire and get dropped
> > by checksum code.
> 
> Indeed, it does assume that the skb from which the clone was made
> never has it's geometry changed.
> 
> I don't think even the TCP retransmit queue has this guarantee.

TCP retransmit makes sure to avoid that.

if (skb_unclone(skb, GFP_ATOMIC))
 return -ENOMEM;

( Before cloning again skb )



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] net/virtio-net: reverse unregistering on exit

2017-08-22 Thread Michael S. Tsirkin
On Tue, Aug 22, 2017 at 04:59:51PM +0200, Pierre Morel wrote:
> Hi,
> 
> I got a problem with virtio-net in a QEMU guest.
> 
> When virtio-net is used as a module in a guest and the network is
> activated, removing the virtio-net modules produces a kernel error:
> 
> [ 2375.624949] Last Breaking-Event-Address:
> [ 2375.624954]  [<001453d4>] __cpuhp_remove_state+0x84/0x1a8
> [ 2375.624959] ---[ end trace cc9fd68b89f5235a ]---
> [ 2375.624966] Error: Removing state 163 which has instances left.
> [ 2375.624985] [ cut here ]

Which kernel do you use?


> The problem is produced by an inversion in the order of unregistering
> the driver and the hotplug state machine callbacks in virtio_net_driver_exit.
> 
> This patch solves the problem by first unregistering the virtio_net_driver
> before unregistering the hotplug state machine callbacks.
> 
> Best regards,
> 
> Pierre Morel
> 
> Pierre Morel (1):
>   net/virtio-net: reverse unregistering on exit
> 
>  drivers/net/virtio_net.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> -- 
> 2.7.4
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] net/virtio-net: reverse unregistering on exit

2017-08-22 Thread Michael S. Tsirkin
On Tue, Aug 22, 2017 at 04:59:52PM +0200, Pierre Morel wrote:
> unregister_virtio_driver should be done before the unregistering of
> the hotplug state machine callbacks, otherwise the state machine still
> holds some instance states at that time.
> 
> Let's first unregister the virtio_net_driver first and then the hotplug
> state machine callbacks.
> 
> Signed-off-by: Pierre Morel 

I'm having a deja vu.

Wasn't this fixed by

commit cfa0ebc9d6d6308564f5174ecb655b9d504b2be5
Author: Andrew Jones 
Date:   Mon Jul 24 15:38:32 2017 +0200

virtio-net: fix module unloading

already?

> ---
>  drivers/net/virtio_net.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 143d8a9..c042ffd 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2734,9 +2734,9 @@ module_init(virtio_net_driver_init);
>  
>  static __exit void virtio_net_driver_exit(void)
>  {
> + unregister_virtio_driver(_net_driver);
>   cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
>   cpuhp_remove_multi_state(virtionet_online);
> - unregister_virtio_driver(_net_driver);
>  }
>  module_exit(virtio_net_driver_exit);
>  
> -- 
> 2.7.4
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-22 Thread David Miller
From: "Michael S. Tsirkin" 
Date: Tue, 22 Aug 2017 20:55:56 +0300

> Which reminds me that skb_linearize in net core seems to be
> fundamentally racy - I suspect that if skb is cloned, and someone is
> trying to use the shared frags while another thread calls skb_linearize,
> we get some use after free bugs which likely mostly go undetected
> because the corrupted packets mostly go on wire and get dropped
> by checksum code.

Indeed, it does assume that the skb from which the clone was made
never has it's geometry changed.

I don't think even the TCP retransmit queue has this guarantee.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-22 Thread Michael S. Tsirkin
On Tue, Aug 22, 2017 at 10:50:41AM +0800, Jason Wang wrote:
> > Perhaps the descriptor pool should also be
> > revised to allow out of order completions. Then there is no need to
> > copy zerocopy packets whenever they may experience delay.
> 
> Yes, but as replied in the referenced thread, windows driver may treat out
> of order completion as a bug.

That would be a windows driver bug then, but I don't think it makes this
assumption. What the referenced thread
(https://patchwork.kernel.org/patch/3787671/) is saying is that host
must use any buffers made available on a tx vq within a reasonable
timeframe otherwise windows guests panic.

Ideally we would detect that a packet is actually experiencing delay and
trigger the copy at that point e.g. by calling skb_linearize. But it
isn't easy to track these packets though and even harder to do a data
copy without races.

Which reminds me that skb_linearize in net core seems to be
fundamentally racy - I suspect that if skb is cloned, and someone is
trying to use the shared frags while another thread calls skb_linearize,
we get some use after free bugs which likely mostly go undetected
because the corrupted packets mostly go on wire and get dropped
by checksum code.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-22 Thread Willem de Bruijn
>> > An issue of the referenced patch is that sndbuf could be smaller than low
>> > watermark.
> We cannot determine the low watermark properly because of not only sndbuf size
> issue but also the fact that the upper vhost-net cannot directly see how much
> descriptor is currently available at the virtio-net tx queue. It depends on
> multiqueue settings or other senders which are also using the same tx queue.
> Note that in the latter case if they constantly transmitting, the deadlock 
> could
> not occur(*), however if it has just temporarily fulfill some portion of the
> pool in the mean time, then the low watermark cannot be helpful.
> (*: That is because it's reliable enough in the sense I mention below.)
>
> Keep in this in mind, let me briefly describe the possible deadlock I 
> mentioned:
> (1). vhost-net on L1 guest has nothing to do sendmsg until the upper layer 
> sets
> new descriptors, which depends only on the vhost-net zcopy callback and adding
> newly used descriptors.
> (2). vhost-net callback depends on the skb freeing on the xmit path only.
> (3). the xmit path depends (possibly only) on the vhost-net sendmsg.
> As you see, it's enough to bring about the situation above that L1 virtio-net
> reaches its limit earlier than the L0 host processing. The vhost-net pool 
> could
> be almost full or empty, whatever.

Thanks for the context. This issue is very similar to the one that used to
exist when running out of transmit descriptors, before the removal of
the timer and introduction of skb_orphan in start_xmit.

To make sure that I understand correctly, let me paraphrase:

A. guest socket cannot send because it exhausted its sk budget (sndbuf, tsq, ..)

B. budget is not freed up until guest receives tx completion for this flow

C. tx completion is held back on the host side in vhost_zerocopy_signal_used
   behind the completion for an unrelated skb

D. unrelated packet is delayed somewhere in the host stackf zerocopy
completions.
   e.g., netem

The issue that is specific to vhost-net zerocopy is that (C) enforces strict
ordering of transmit completions causing head of line blocking behind
vhost-net zerocopy callbacks.

This is a different problem from

C1. tx completion is delayed until guest sends another packet and
   triggers free_old_xmit_skb

Both in host and guest, zerocopy packets should never be able to loop
to a receive path where they can cause unbounded delay.

The obvious cases of latency are queueing, like netem. That leads
to poor performance for unrelated flows, but I don't see how this
could cause deadlock.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] net/virtio-net: reverse unregistering on exit

2017-08-22 Thread Pierre Morel
Hi,

I got a problem with virtio-net in a QEMU guest.

When virtio-net is used as a module in a guest and the network is
activated, removing the virtio-net modules produces a kernel error:

[ 2375.624949] Last Breaking-Event-Address:
[ 2375.624954]  [<001453d4>] __cpuhp_remove_state+0x84/0x1a8
[ 2375.624959] ---[ end trace cc9fd68b89f5235a ]---
[ 2375.624966] Error: Removing state 163 which has instances left.
[ 2375.624985] [ cut here ]

The problem is produced by an inversion in the order of unregistering
the driver and the hotplug state machine callbacks in virtio_net_driver_exit.

This patch solves the problem by first unregistering the virtio_net_driver
before unregistering the hotplug state machine callbacks.

Best regards,

Pierre Morel

Pierre Morel (1):
  net/virtio-net: reverse unregistering on exit

 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] net/virtio-net: reverse unregistering on exit

2017-08-22 Thread Pierre Morel
unregister_virtio_driver should be done before the unregistering of
the hotplug state machine callbacks, otherwise the state machine still
holds some instance states at that time.

Let's first unregister the virtio_net_driver first and then the hotplug
state machine callbacks.

Signed-off-by: Pierre Morel 
---
 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 143d8a9..c042ffd 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2734,9 +2734,9 @@ module_init(virtio_net_driver_init);
 
 static __exit void virtio_net_driver_exit(void)
 {
+   unregister_virtio_driver(_net_driver);
cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
cpuhp_remove_multi_state(virtionet_online);
-   unregister_virtio_driver(_net_driver);
 }
 module_exit(virtio_net_driver_exit);
 
-- 
2.7.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] vhost: fix end of range for access_ok

2017-08-22 Thread Koichiro Den
On Mon, 2017-08-21 at 22:45 +0300, Michael S. Tsirkin wrote:
> During access_ok checks, addr increases as we iterate over the data
> structure, thus addr + len - 1 will point beyond the end of region we
> are translating.  Harmless since we then verify that the region covers
> addr, but let's not waste cpu cycles.
> 
> Reported-by: Koichiro Den 
> Signed-off-by: Michael S. Tsirkin 
> ---
> 
> Lightly tested, would appreciate an ack from reporter.
> 
>  drivers/vhost/vhost.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index e4613a3..ecd70e4 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1176,7 +1176,7 @@ static int iotlb_access_ok(struct vhost_virtqueue *vq,
>  {
>   const struct vhost_umem_node *node;
>   struct vhost_umem *umem = vq->iotlb;
> - u64 s = 0, size, orig_addr = addr;
> + u64 s = 0, size, orig_addr = addr, last = addr + len - 1;
>  
>   if (vhost_vq_meta_fetch(vq, addr, len, type))
>   return true;
> @@ -1184,7 +1184,7 @@ static int iotlb_access_ok(struct vhost_virtqueue *vq,
>   while (len > s) {
>   node = vhost_umem_interval_tree_iter_first(>umem_tree,
>      addr,
> -    addr + len - 1);
> +    last);
>   if (node == NULL || node->start > addr) {
>   vhost_iotlb_miss(vq, addr, access);
>   return false;

Michael, Thank you for this one.

Acked-by: Koichiro Den 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC 2/3] virtio-iommu: device probing and operations

2017-08-22 Thread Jean-Philippe Brucker
On 22/08/17 07:24, Tian, Kevin wrote:
>>> (sorry to pick up this old thread, as the .tex one is not good for review
>>> and this thread provides necessary background for IOASID).
>>>
>>> Hi, Jean,
>>>
>>> I'd like to hear more clarification regarding the relationship between
>>> IOASID and PASID. When reading back above explanation, it looks
>>> confusing to me now (though I might get the meaning months ago :/).
>>> At least Intel VT-d only understands PASID (or you can think IOASID
>>> =PASID). There is no such layered address space concept. Then for
>>> map/unmap type request, do you intend to steal some PASIDs for
>>> that purpose on such architecture (since IOASID is a mandatory field
>>> in map/unmap request)?
>>
>> IOASID is a logical ID, it isn't used by hardware. The address space
>> concept in virtio-iommu allows to group endpoints together, so that they
>> have the same address space. I thought it was pretty much the same as
>> "domains" in VT-d? In any case, it is the same as domains in Linux. An
>> IOASID provides a handle for communication between virtio-iommu device
>> and
>> driver, but unlike PASID, the IOASID number doesn't mean anything outside
>> of virtio-iommu.
> 
> Thanks. It's clear to me then.
> 
> btw does it make more sense to use "domain id" instead of "IO address
> space id"? For one, when talking about layered address spaces
> usually parent address space is a superset of all child address spaces
> which doesn't apply to this case, since either anonymous address
> space or PASID-tagged address spaces are completely isolated. Instead> 
> 'domain' is a more inclusive terminology to embrace multiple address
> spaces. For two, 'domain' is better aligned to software terminology (e.g.
> iommu_domain) is easier for people to catch up. :-)

I do agree that the naming isn't great. I didn't use "domain" for various
reasons (it also has a different meanings in ARM) but I keep regretting
it. As there is no virtio-iommu code upstream yet, it is still possible to
change this one.

I find that "address space" was a good fit for the baseline device, but
the name doesn't scale. When introducing PASIDs, the address space moves
one level down in the programming model. It is contexts, anonymous or
PASID-tagged, that should be called address spaces. I was considering
replacing it with "domain", "container", "partition"...

Even though I don't want to use too much Linux terminology (virtio isn't
just Linux), "domain" is better fitted, somewhat neutral, and gets the
point across. A domain has one or more input address spaces and a single
output address space.

When introducing nested translation to virtio-iommu (for the guest to have
virtual machines itself), there will be one or more intermediate address
spaces. Domains will be nested, with the terminology "parent domain" and
"child domain". I only briefly looked at a programming model for this but
I think we can nest virtio-iommus without much hassle.

If there is no objection the next version will use "domain" in place of
"address_space". The change is quite invasive at this point, but I believe
that it will makes things more clear down the road.

>> I haven't introduced PASIDs in public virtio-iommu documents yet, but the
>> way I intend it, PASID != IOASID. We will still have a logical address
>> space identified by IOASID, that can contain multiple contexts identified
>> by PASID. At the moment, after the ATTACH request, an address space
>> contains a single anonymous context (NO PASID) that can be managed with
>> MAP/UNMAP requests. With virtio-iommu v0.4, structures look like the
>> following. The NO PASID context is implicit.
>>
>> address space  context
>> endpoint .  .- mapping
>> endpoint + IOASID  NO PASID +- mapping
>> endpoint '  '- mapping
>>
>> I'd like to add a flag to ATTACH that says "don't create a default
>> anonymous context, I'll handle contexts myself". Then a new "ADD_TABLE"
>> request to handle contexts. When creating a context, the guest decides if
>> it wants to manage it via MAP/UNMAP requests (and a new "context" field),
>> or instead manage mappings itself by allocating a page directory and use
>> INVALIDATE requests.
>>
>> address space  context
>> endpoint .  .- mapping
>> endpoint + IOASID +--- NO PASID +- mapping
>> endpoint '| '- mapping
>>   +--- PASID 0   pgd
>>   | ...
>>   '--- PASID N   pgd
>>
>> In this example the guest chose to still have an anonymous context that
>> uses MAP/UNMAP, along with a few PASID contexts with their own page
>> tables.
>>
> 
> Above explanation is a good background. Is it useful to include it
> in current spec? Though SVM 

Re: [PATCH 1/2] vhost: remove the possible fruitless search on iotlb prefetch

2017-08-22 Thread Koichiro Den
On Mon, 2017-08-21 at 22:45 +0300, Michael S. Tsirkin wrote:
> On Sat, Aug 19, 2017 at 03:41:14PM +0900, Koichiro Den wrote:
> > Signed-off-by: Koichiro Den 
> > ---
> >  drivers/vhost/vhost.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index e4613a3c362d..93e909afc1c3 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -1184,7 +1184,7 @@ static int iotlb_access_ok(struct vhost_virtqueue *vq,
> >     while (len > s) {
> >     node = vhost_umem_interval_tree_iter_first(
> > >umem_tree,
> >        addr,
> > -      addr + len - 1);
> > +      addr + len - s -
> > 1);
> >     if (node == NULL || node->start > addr) {
> >     vhost_iotlb_miss(vq, addr, access);
> >     return false;
> 
> This works but it probably makes sense to just refactor the code to make end
> of
> range a variable. I posted a patch like this, pls take a look.
> 
> > -- 
> > 2.9.4
> > 
Ok, thanks.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-22 Thread Koichiro Den
On Tue, 2017-08-22 at 08:11 -0400, Willem de Bruijn wrote:
> On Sun, Aug 20, 2017 at 4:49 PM, Willem de Bruijn
>  wrote:
> > On Sat, Aug 19, 2017 at 2:38 AM, Koichiro Den  wrote:
> > > Facing the possible unbounded delay relying on freeing on xmit path,
> > > we also better to invoke and clear the upper layer zerocopy callback
> > > beforehand to keep them from waiting for unbounded duration in vain.
> > 
> > Good point.
> > 
> > > For instance, this removes the possible deadlock in the case that the
> > > upper layer is a zerocopy-enabled vhost-net.
> > > This does not apply if napi_tx is enabled since it will be called in
> > > reasonale time.
> > 
> > Indeed. Btw, I am gathering data to eventually make napi the default
> > mode. But that is taking some time.
> > 
> > > 
> > > Signed-off-by: Koichiro Den 
> > > ---
> > >  drivers/net/virtio_net.c | 8 
> > >  1 file changed, 8 insertions(+)
> > > 
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index 4302f313d9a7..f7deaa5b7b50 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -1290,6 +1290,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
> > > struct net_device *dev)
> > > 
> > > /* Don't wait up for transmitted skbs to be freed. */
> > > if (!use_napi) {
> > > +   if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
> > > +   struct ubuf_info *uarg;
> > > +   uarg = skb_shinfo(skb)->destructor_arg;
> > > +   if (uarg->callback)
> > > +   uarg->callback(uarg, true);
> > > +   skb_shinfo(skb)->destructor_arg = NULL;
> > > +   skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
> > > +   }
> > 
> > Instead of open coding, this can use skb_zcopy_clear.
> 
> It is not correct to only send the zerocopy completion here, as
> the skb will still be shared with the nic. The only safe approach
> to notifying early is to create a copy with skb_orphan_frags_rx.
> That will call skb_zcopy_clear(skb, false) internally. But the
> copy will be very expensive.
I noticed this email after my last post. I cannot not imagine how it could be
unsafe in virtio case. Sorry could you explain a bit more?
> 
> Is the deadlock you refer to the case where tx interrupts are
> disabled, so transmit completions are only handled in start_xmit
> and somehow the socket(s) are unable to send new data? The
> question is what is blocking them. If it is zerocopy, it may be a
> too low optmem_max or hitting the per-user locked pages limit.
> We may have to keep interrupts enabled when zerocopy skbs
> are in flight.
Even if we keep interrupts enabled, We miss the completion without start_xmit.
And it's also likely that the next start_xmit depends on the completion itself
as I described in my last post.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-22 Thread Koichiro Den
On Mon, 2017-08-21 at 23:10 -0400, Willem de Bruijn wrote:
> > > > Interesting, deadlock could be treated as a a radical case of the
> > > > discussion
> > > > here https://patchwork.kernel.org/patch/3787671/.
> > > > 
> > > > git grep tells more similar skb_orphan() cases. Do we need to change
> > > > them
> > > > all (or part)?
> > > 
> > > Most skb_orphan calls are not relevant to the issue of transmit delay.
> > 
> > 
> > Yes, but at least we should audit the ones in drivers/net.
> 
> Do you mean other virtual device driver transmit paths, like xen,
> specifically?
> 
> > > > Actually, we may meet similar issues at many other places (e.g netem).
> > > 
> > > Netem is an interesting case. Because it is intended to mimic network
> > > delay, at least in the case where it calls skb_orphan, it may make
> > > sense to release all references, including calling skb_zcopy_clear.
> > > 
> > > In general, zerocopy reverts to copy on all paths that may cause
> > > unbounded delay due to another process. Guarding against delay
> > > induced by the administrator is infeasible. It is always possible to
> > > just pause the nic. Netem is one instance of that, and not unbounded.
> > 
> > 
> > The problem is, admin may only delay the traffic in e.g one interface, but
> > it actually delay or stall all traffic inside a VM.
> 
> Understood. Ideally we can remove the HoL blocking cause of this,
> itself.
> 
> > > > Need
> > > > to consider a complete solution for this. Figuring out all places that
> > > > could
> > > > delay a packet is a method.
> > > 
> > > The issue described in the referenced patch seems like head of line
> > > blocking between two flows. If one flow delays zerocopy descriptor
> > > release from the vhost-net pool, it blocks all subsequent descriptors
> > > in that pool from being released, also delaying other flows that use
> > > the same descriptor pool. If the pool is empty, all transmission stopped.
> > > 
> > > Reverting to copy tx when the pool reaches a low watermark, as the
> > > patch does, fixes this.
> > 
> > 
> > An issue of the referenced patch is that sndbuf could be smaller than low
> > watermark.
We cannot determine the low watermark properly because of not only sndbuf size
issue but also the fact that the upper vhost-net cannot directly see how much
descriptor is currently available at the virtio-net tx queue. It depends on
multiqueue settings or other senders which are also using the same tx queue.
Note that in the latter case if they constantly transmitting, the deadlock could
not occur(*), however if it has just temporarily fulfill some portion of the
pool in the mean time, then the low watermark cannot be helpful.
(*: That is because it's reliable enough in the sense I mention below.)

Keep in this in mind, let me briefly describe the possible deadlock I mentioned:
(1). vhost-net on L1 guest has nothing to do sendmsg until the upper layer sets
new descriptors, which depends only on the vhost-net zcopy callback and adding
newly used descriptors.
(2). vhost-net callback depends on the skb freeing on the xmit path only.
(3). the xmit path depends (possibly only) on the vhost-net sendmsg.
As you see, it's enough to bring about the situation above that L1 virtio-net
reaches its limit earlier than the L0 host processing. The vhost-net pool could
be almost full or empty, whatever.

Also note that relaxing the restriction (2) (and thus remove the dependency (3)
-> (1)) seems quite natural but it needs to be reliable enough to never miss the
last trigger if it's based on interruption. We might need to do some
modification on virtio interruption mitigation in none tx napi case but it's too
costly and not fair to cover the upper layer peculiarity.

So I think drivers which holds characteristics such as (2) is worth checking
whether we should revise.
> > 
> > > Perhaps the descriptor pool should also be
> > > revised to allow out of order completions. Then there is no need to
> > > copy zerocopy packets whenever they may experience delay.
> > 
> > 
> > Yes, but as replied in the referenced thread, windows driver may treat out
> > of order completion as a bug.
> 
> Interesting. I missed that. Perhaps the zerocopy optimization
> could be gated on guest support for out of order completions.
> 
> > > On the point of counting copy vs zerocopy: the new msg_zerocopy
> > > variant of ubuf_info has a field to record whether a deep copy was
> > > made. This can be used with vhost-net zerocopy, too.
> > 
> > 
> > Just to make sure I understand. It's still not clear to me how to reuse this
> > for vhost-net, e.g zerocopy flag is in a union which is not used by
> > vhost_net.
> 
> True, but that is not set in stone. I went back and forth on that when
> preparing fix 0a4a060bb204 ("sock: fix zerocopy_success regression
> with msg_zerocopy"). The field can be moved outside the union and
> initialized in the other zerocopy paths.

___
Virtualization mailing list

Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-22 Thread Willem de Bruijn
On Sun, Aug 20, 2017 at 4:49 PM, Willem de Bruijn
 wrote:
> On Sat, Aug 19, 2017 at 2:38 AM, Koichiro Den  wrote:
>> Facing the possible unbounded delay relying on freeing on xmit path,
>> we also better to invoke and clear the upper layer zerocopy callback
>> beforehand to keep them from waiting for unbounded duration in vain.
>
> Good point.
>
>> For instance, this removes the possible deadlock in the case that the
>> upper layer is a zerocopy-enabled vhost-net.
>> This does not apply if napi_tx is enabled since it will be called in
>> reasonale time.
>
> Indeed. Btw, I am gathering data to eventually make napi the default
> mode. But that is taking some time.
>
>>
>> Signed-off-by: Koichiro Den 
>> ---
>>  drivers/net/virtio_net.c | 8 
>>  1 file changed, 8 insertions(+)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 4302f313d9a7..f7deaa5b7b50 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -1290,6 +1290,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, 
>> struct net_device *dev)
>>
>> /* Don't wait up for transmitted skbs to be freed. */
>> if (!use_napi) {
>> +   if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
>> +   struct ubuf_info *uarg;
>> +   uarg = skb_shinfo(skb)->destructor_arg;
>> +   if (uarg->callback)
>> +   uarg->callback(uarg, true);
>> +   skb_shinfo(skb)->destructor_arg = NULL;
>> +   skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
>> +   }
>
> Instead of open coding, this can use skb_zcopy_clear.

It is not correct to only send the zerocopy completion here, as
the skb will still be shared with the nic. The only safe approach
to notifying early is to create a copy with skb_orphan_frags_rx.
That will call skb_zcopy_clear(skb, false) internally. But the
copy will be very expensive.

Is the deadlock you refer to the case where tx interrupts are
disabled, so transmit completions are only handled in start_xmit
and somehow the socket(s) are unable to send new data? The
question is what is blocking them. If it is zerocopy, it may be a
too low optmem_max or hitting the per-user locked pages limit.
We may have to keep interrupts enabled when zerocopy skbs
are in flight.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-22 Thread Jason Wang



On 2017年08月22日 11:10, Willem de Bruijn wrote:

Interesting, deadlock could be treated as a a radical case of the
discussion
here https://patchwork.kernel.org/patch/3787671/.

git grep tells more similar skb_orphan() cases. Do we need to change them
all (or part)?

Most skb_orphan calls are not relevant to the issue of transmit delay.


Yes, but at least we should audit the ones in drivers/net.

Do you mean other virtual device driver transmit paths, like xen,
specifically?


Git grep does not show skb_orphan() was used for xen for me. But looking 
at cxgb4/sge.c which seems to call skb_orphan() for large packet and 
reclaim transmitted packets when:


- doing ndo_start_xmit()
- or a timer.


Actually, we may meet similar issues at many other places (e.g netem).

Netem is an interesting case. Because it is intended to mimic network
delay, at least in the case where it calls skb_orphan, it may make
sense to release all references, including calling skb_zcopy_clear.

In general, zerocopy reverts to copy on all paths that may cause
unbounded delay due to another process. Guarding against delay
induced by the administrator is infeasible. It is always possible to
just pause the nic. Netem is one instance of that, and not unbounded.


The problem is, admin may only delay the traffic in e.g one interface, but
it actually delay or stall all traffic inside a VM.

Understood. Ideally we can remove the HoL blocking cause of this,
itself.


Need
to consider a complete solution for this. Figuring out all places that
could
delay a packet is a method.

The issue described in the referenced patch seems like head of line
blocking between two flows. If one flow delays zerocopy descriptor
release from the vhost-net pool, it blocks all subsequent descriptors
in that pool from being released, also delaying other flows that use
the same descriptor pool. If the pool is empty, all transmission stopped.

Reverting to copy tx when the pool reaches a low watermark, as the
patch does, fixes this.


An issue of the referenced patch is that sndbuf could be smaller than low
watermark.


Perhaps the descriptor pool should also be
revised to allow out of order completions. Then there is no need to
copy zerocopy packets whenever they may experience delay.


Yes, but as replied in the referenced thread, windows driver may treat out
of order completion as a bug.

Interesting. I missed that. Perhaps the zerocopy optimization
could be gated on guest support for out of order completions.


Yes, we may plan to explicitly notify driver about out of order in 
future virtio.





On the point of counting copy vs zerocopy: the new msg_zerocopy
variant of ubuf_info has a field to record whether a deep copy was
made. This can be used with vhost-net zerocopy, too.


Just to make sure I understand. It's still not clear to me how to reuse this
for vhost-net, e.g zerocopy flag is in a union which is not used by
vhost_net.

True, but that is not set in stone. I went back and forth on that when
preparing fix 0a4a060bb204 ("sock: fix zerocopy_success regression
with msg_zerocopy"). The field can be moved outside the union and
initialized in the other zerocopy paths.


Ok. I see.

Thanks
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] drm: virtio: constify drm_fb_helper_funcs

2017-08-22 Thread Daniel Vetter
On Mon, Aug 21, 2017 at 04:37:32PM +0530, Arvind Yadav wrote:
> drm_fb_helper_funcs are not supposed to change at runtime.
> All functions working with drm_fb_helper_funcs provided
> by  work with const drm_fb_helper_funcs.
> So mark the non-const structs as const.
> 
> Signed-off-by: Arvind Yadav 
> ---
>  drivers/gpu/drm/virtio/virtgpu_fb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/virtio/virtgpu_fb.c 
> b/drivers/gpu/drm/virtio/virtgpu_fb.c
> index 33df067..61f33ec 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_fb.c
> +++ b/drivers/gpu/drm/virtio/virtgpu_fb.c
> @@ -309,7 +309,7 @@ static int virtio_gpu_fbdev_destroy(struct drm_device 
> *dev,
>  
>   return 0;
>  }
> -static struct drm_fb_helper_funcs virtio_gpu_fb_helper_funcs = {
> +static const struct drm_fb_helper_funcs virtio_gpu_fb_helper_funcs = {

I have this patch already from someone else. Please create your patches
against linux-next to avoid this kind of duplicated work.

Thanks, Daniel
>   .fb_probe = virtio_gpufb_create,
>  };
>  
> -- 
> 1.9.1
> 
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [RFC 2/3] virtio-iommu: device probing and operations

2017-08-22 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Monday, August 21, 2017 8:00 PM
> 
> On 21/08/17 08:59, Tian, Kevin wrote:
> >> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> >> Sent: Monday, April 24, 2017 11:06 PM
> >>   1. Attach device
> >>   
> >>
> >> struct virtio_iommu_req_attach {
> >>le32address_space;
> >>le32device;
> >>le32flags/reserved;
> >> };
> >>
> >> Attach a device to an address space. 'address_space' is an identifier
> >> unique to the guest. If the address space doesn't exist in the
> IOMMU
> >
> > Based on your description this address space ID is per operation
> right?
> > MAP/UNMAP and page-table sharing should have different ID
> spaces...
> 
>  I think it's simpler if we keep a single IOASID space per virtio-iommu
>  device, because the maximum number of address spaces (described
> by
>  ioasid_bits) might be a restriction of the pIOMMU. For page-table
> >> sharing
>  you still need to define which devices will share a page directory using
>  ATTACH requests, though that interface is not set in stone.
> >>>
> >>> got you. yes VM is supposed to consume less IOASIDs than physically
> >>> available. It doesn’t hurt to have one IOASID space for both IOVA
> >>> map/unmap usages (one IOASID per device) and SVM usages (multiple
> >>> IOASIDs per device). The former is digested by software and the latter
> >>> will be bound to hardware.
> >>>
> >>
> >> Hmm, I'm using address space indexed by IOASID for "classic" IOMMU,
> and
> >> then contexts indexed by PASID when talking about SVM. So in my mind
> an
> >> address space can have multiple sub-address-spaces (contexts). Number
> of
> >> IOASIDs is a limitation of the pIOMMU, and number of PASIDs is a
> >> limitation of the device. Therefore attaching devices to address spaces
> >> would update the number of available contexts in that address space.
> The
> >> terminology is not ideal, and I'd be happy to change it for something
> more
> >> clear.
> >>
> >
> > (sorry to pick up this old thread, as the .tex one is not good for review
> > and this thread provides necessary background for IOASID).
> >
> > Hi, Jean,
> >
> > I'd like to hear more clarification regarding the relationship between
> > IOASID and PASID. When reading back above explanation, it looks
> > confusing to me now (though I might get the meaning months ago :/).
> > At least Intel VT-d only understands PASID (or you can think IOASID
> > =PASID). There is no such layered address space concept. Then for
> > map/unmap type request, do you intend to steal some PASIDs for
> > that purpose on such architecture (since IOASID is a mandatory field
> > in map/unmap request)?
> 
> IOASID is a logical ID, it isn't used by hardware. The address space
> concept in virtio-iommu allows to group endpoints together, so that they
> have the same address space. I thought it was pretty much the same as
> "domains" in VT-d? In any case, it is the same as domains in Linux. An
> IOASID provides a handle for communication between virtio-iommu device
> and
> driver, but unlike PASID, the IOASID number doesn't mean anything outside
> of virtio-iommu.

Thanks. It's clear to me then.

btw does it make more sense to use "domain id" instead of "IO address
space id"? For one, when talking about layered address spaces
usually parent address space is a superset of all child address spaces
which doesn't apply to this case, since either anonymous address
space or PASID-tagged address spaces are completely isolated. Instead
'domain' is a more inclusive terminology to embrace multiple address
spaces. For two, 'domain' is better aligned to software terminology (e.g.
iommu_domain) is easier for people to catch up. :-)

> 
> I haven't introduced PASIDs in public virtio-iommu documents yet, but the
> way I intend it, PASID != IOASID. We will still have a logical address
> space identified by IOASID, that can contain multiple contexts identified
> by PASID. At the moment, after the ATTACH request, an address space
> contains a single anonymous context (NO PASID) that can be managed with
> MAP/UNMAP requests. With virtio-iommu v0.4, structures look like the
> following. The NO PASID context is implicit.
> 
> address space  context
> endpoint .  .- mapping
> endpoint + IOASID  NO PASID +- mapping
> endpoint '  '- mapping
> 
> I'd like to add a flag to ATTACH that says "don't create a default
> anonymous context, I'll handle contexts myself". Then a new "ADD_TABLE"
> request to handle contexts. When creating a context, the guest decides if
> it wants to manage it via MAP/UNMAP requests (and a new "context" field),
> or instead manage mappings itself by allocating a page directory and use
> INVALIDATE requests.
>