Re: [External] Re: [RFC PATCH] virtio-vsock: add description for datagram type

2021-03-26 Thread Jiang Wang .
Hi Michael and Stefan,

I thought about this and discussed it with my colleague Cong Wang.
One idea is to make current asynchronous send_pkt flow to be synchronous,
then if the virtqueue is full, the function can return  ENOMEM all the way back
to the caller and the caller can check the return value of sendmsg
and slow down when necessary.

In the spec, we can put something like, if the virtqueue is full, the caller
should be notified with an error etc.

In terms of implementation, that means we will remove the current
send_pkt_work for both stream and dgram sockets. Currently, the
code path uses RCU and a work queue, then grab a mutex in the
work queue function. Since we cannot grab mutex when in rcu
critical section, we have to change RCU to a normal reference
counting mechanism. I think this is doable. The drawback is
that the reference counting in general spends a little more
cycles than the RCU, so there is a small price to pay. Another
option is to use Sleepable RCU and remove the work queue.

What do you guys think?

btw, I will also add some SENDBUF restrictions for the dgram
sockets, but I don't think it needs to be in the spec.

Regards,

Jiang

On Tue, Mar 23, 2021 at 1:53 AM Stefan Hajnoczi  wrote:
>
> On Mon, Mar 22, 2021 at 07:23:14PM -0700, Jiang Wang . wrote:
> > Got it. Will do.
>
> You could look at udp_sendmsg() to see how sockets compete when
> transmitting to the same net device.
>
> I'm not very familiar with this but I guess that the qdisc (like
> fq_codel) decides which packets to place into the device's tx queue. I
> guess sk_buffs waiting to be put onto the device's tx queue are
> accounted for against the socket's sndbuf. Further sendmsg calls will
> fail with -ENOBUFS when the sndbuf limit is reached.
>
> It's not clear to me how much of the existing code can be reused since
> vsock does not use sk_buff or netdev :(.
>
> Stefan
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio-balloon: move release_pages_balloon() outside of mutex_unlock(>balloon_lock)

2021-03-26 Thread David Hildenbrand

On 26.03.21 10:53, Liu Xiang wrote:

Since pages have been deflated to a local list,
there is no race between fill and leak.

Signed-off-by: Liu Xiang 
---
  drivers/virtio/virtio_balloon.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 8985fc2ce..7da25b87f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -303,8 +303,8 @@ static unsigned leak_balloon(struct virtio_balloon *vb, 
size_t num)
 */
if (vb->num_pfns != 0)
tell_host(vb, vb->deflate_vq);
-   release_pages_balloon(vb, );
mutex_unlock(>balloon_lock);
+   release_pages_balloon(vb, );
return num_freed_pages;
  }
  



I think this should be fine

Reviewed-by: David Hildenbrand 

--
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 08/11] vduse: Implement an MMU-based IOMMU driver

2021-03-26 Thread Jason Wang


在 2021/3/26 下午2:56, Yongji Xie 写道:

On Fri, Mar 26, 2021 at 2:16 PM Jason Wang  wrote:


在 2021/3/26 下午1:14, Yongji Xie 写道:

+ }
+ map->bounce_page = page;
+
+ /* paired with vduse_domain_map_page() */
+ smp_mb();

So this is suspicious. It's better to explain like, we need make sure A
must be done after B.

OK. I see. It's used to protect this pattern:

  vduse_domain_alloc_bounce_page:  vduse_domain_map_page:
  write map->bounce_page   write map->orig_phys
  mb()mb()
  read map->orig_phys read map->bounce_page

Make sure there will always be a path to do bouncing.

Ok.


And it looks to me the iotlb_lock is sufficnet to do the synchronization
here. E.g any reason that you don't take it in
vduse_domain_map_bounce_page().

Yes, we can. But the performance in multi-queue cases will go down if
we use iotlb_lock on this critical path.

And what's more, is there anyway to aovid holding the spinlock during
bouncing?

Looks like we can't. In the case that multiple page faults happen on
the same page, we should make sure the bouncing is done before any
page fault handler returns.

So it looks to me all those extra complexitiy comes from the fact that
the bounce_page and orig_phys are set by different places so we need to
do the bouncing in two places.

I wonder how much we can gain from the "lazy" boucning in page fault.
The buffer mapped via dma_ops from virtio driver is expected to be
accessed by the userspace soon.  It looks to me we can do all those
stuffs during dma_map() then things would be greatly simplified.

If so, we need to allocate lots of pages from the pool reserved for
atomic memory allocation requests.

This should be fine, a lot of drivers tries to allocate pages in atomic
context. The point is to simplify the codes to make it easy to
determince the correctness so we can add optimization on top simply by
benchmarking the difference.

OK. I will use this way in the next version.

E.g we have serveral places that accesses orig_phys:

1) map_page(), write
2) unmap_page(), write
3) page fault handler, read

It's not clear to me how they were synchronized. Or if it was
synchronzied implicitly (via iova allocator?), we'd better document it.

Yes.

Or simply use spinlock (which is the preferrable way I'd like to go). We
probably don't need to worry too much about the cost of spinlock since
iova allocater use it heavily.

Actually iova allocator implements a per-CPU cache to optimize it.

Thanks,
Yongji


Right, but have a quick glance, I guess what you meant is that usually there's 
no lock contention unless cpu hot-plug. This can work but the problem is that 
such synchornization depends on the internal implementation of IOVA allocator 
which is kind of fragile. I still think we should do that on our own.


I might miss something. Looks like we don't need any synchronization
if the page fault handler is removed as you suggested. We should not
access the same orig_phys concurrently (in map_page() and
unmap_page()) unless we free the iova before accessing.

Thanks,
Yongji



You're right. I overestimate the complexitiy that is required by the 
synchronization.


Thanks






___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v5 08/11] vduse: Implement an MMU-based IOMMU driver

2021-03-26 Thread Jason Wang


在 2021/3/26 下午1:14, Yongji Xie 写道:

+ }
+ map->bounce_page = page;
+
+ /* paired with vduse_domain_map_page() */
+ smp_mb();

So this is suspicious. It's better to explain like, we need make sure A
must be done after B.

OK. I see. It's used to protect this pattern:

  vduse_domain_alloc_bounce_page:  vduse_domain_map_page:
  write map->bounce_page   write map->orig_phys
  mb()mb()
  read map->orig_phys read map->bounce_page

Make sure there will always be a path to do bouncing.

Ok.



And it looks to me the iotlb_lock is sufficnet to do the synchronization
here. E.g any reason that you don't take it in
vduse_domain_map_bounce_page().


Yes, we can. But the performance in multi-queue cases will go down if
we use iotlb_lock on this critical path.


And what's more, is there anyway to aovid holding the spinlock during
bouncing?


Looks like we can't. In the case that multiple page faults happen on
the same page, we should make sure the bouncing is done before any
page fault handler returns.

So it looks to me all those extra complexitiy comes from the fact that
the bounce_page and orig_phys are set by different places so we need to
do the bouncing in two places.

I wonder how much we can gain from the "lazy" boucning in page fault.
The buffer mapped via dma_ops from virtio driver is expected to be
accessed by the userspace soon.  It looks to me we can do all those
stuffs during dma_map() then things would be greatly simplified.


If so, we need to allocate lots of pages from the pool reserved for
atomic memory allocation requests.

This should be fine, a lot of drivers tries to allocate pages in atomic
context. The point is to simplify the codes to make it easy to
determince the correctness so we can add optimization on top simply by
benchmarking the difference.


OK. I will use this way in the next version.


E.g we have serveral places that accesses orig_phys:

1) map_page(), write
2) unmap_page(), write
3) page fault handler, read

It's not clear to me how they were synchronized. Or if it was
synchronzied implicitly (via iova allocator?), we'd better document it.

Yes.


Or simply use spinlock (which is the preferrable way I'd like to go). We
probably don't need to worry too much about the cost of spinlock since
iova allocater use it heavily.


Actually iova allocator implements a per-CPU cache to optimize it.

Thanks,
Yongji



Right, but have a quick glance, I guess what you meant is that usually 
there's no lock contention unless cpu hot-plug. This can work but the 
problem is that such synchornization depends on the internal 
implementation of IOVA allocator which is kind of fragile. I still think 
we should do that on our own.


Thanks


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization