Re: [PATCH V5 1/4] virtio_ring: validate used buffer length

2021-11-21 Thread Stefano Garzarella

On Mon, Nov 22, 2021 at 02:25:26PM +0800, Jason Wang wrote:

On Mon, Nov 22, 2021 at 1:49 PM Halil Pasic  wrote:


On Mon, 22 Nov 2021 06:35:18 +0100
Halil Pasic  wrote:

> > I think it should be a common issue, looking at
> > vhost_vsock_handle_tx_kick(), it did:
> >
> > len += sizeof(pkt->hdr);
> > vhost_add_used(vq, head, len);
> >
> > which looks like a violation of the spec since it's TX.
>
> I'm not sure the lines above look like a violation of the spec. If you
> examine vhost_vsock_alloc_pkt() I believe that you will agree that:
> len == pkt->len == pkt->hdr.len
> which makes sense since according to the spec both tx and rx messages
> are hdr+payload. And I believe hdr.len is the size of the payload,
> although that does not seem to be properly documented by the spec.


Sorry for being unclear, what I meant is that we probably should use
zero here. TX doesn't use in buffer actually.

According to the spec, 0 should be the used length:

"and len the total of bytes written into the buffer."


>
> On the other hand tx messages are stated to be device read-only (in the
> spec) so if the device writes stuff, that is certainly wrong.
>


Yes.


> If that is what happens.
>
> Looking at virtqueue_get_buf_ctx_split() I'm not sure that is what
> happens. My hypothesis is that we just a last descriptor is an 'in'
> type descriptor (i.e. a device writable one). For tx that assumption
> would be wrong.
>
> I will have another look at this today and send a fix patch if my
> suspicion is confirmed.

If my suspicion is right something like:

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 00f64f2f8b72..efb57898920b 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -764,6 +764,7 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue 
*_vq,
struct vring_virtqueue *vq = to_vvq(_vq);
void *ret;
unsigned int i;
+   bool has_in;
u16 last_used;

START_USE(vq);
@@ -787,6 +788,9 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue 
*_vq,
vq->split.vring.used->ring[last_used].id);
*len = virtio32_to_cpu(_vq->vdev,
vq->split.vring.used->ring[last_used].len);
+   has_in = virtio16_to_cpu(_vq->vdev,
+   vq->split.vring.used->ring[last_used].flags)
+   & VRING_DESC_F_WRITE;


Did you mean vring.desc actually? If yes, it's better not depend on
the descriptor ring which can be modified by the device. We've stored
the flags in desc_extra[].



if (unlikely(i >= vq->split.vring.num)) {
BAD_RING(vq, "id %u out of range\n", i);
@@ -796,7 +800,7 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue 
*_vq,
BAD_RING(vq, "id %u is not a head!\n", i);
return NULL;
}
-   if (vq->buflen && unlikely(*len > vq->buflen[i])) {
+   if (has_in && q->buflen && unlikely(*len > vq->buflen[i])) {
BAD_RING(vq, "used len %d is larger than in buflen %u\n",
*len, vq->buflen[i]);
return NULL;

would fix the problem for split. I will try that out and let you know
later.


I'm not sure I get this, in virtqueue_add_split, the buflen[i] only
contains the in buffer length.

I think the fixes are:

1) fixing the vhost vsock


Yep, in vhost_vsock_handle_tx_kick() we should have vhost_add_used(vq, 
head, 0) since the device doesn't write anything.



2) use suppress_used_validation=true to let vsock driver to validate
the in buffer length
3) probably a new feature so the driver can only enable the validation
when the feature is enabled.


I fully agree with these steps.

Thanks,
Stefano

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V5 1/4] virtio_ring: validate used buffer length

2021-11-21 Thread Stefano Garzarella

On Mon, Nov 22, 2021 at 11:51:09AM +0800, Jason Wang wrote:

On Fri, Nov 19, 2021 at 11:10 PM Halil Pasic  wrote:


On Wed, 27 Oct 2021 10:21:04 +0800
Jason Wang  wrote:

> This patch validate the used buffer length provided by the device
> before trying to use it. This is done by record the in buffer length
> in a new field in desc_state structure during virtqueue_add(), then we
> can fail the virtqueue_get_buf() when we find the device is trying to
> give us a used buffer length which is greater than the in buffer
> length.
>
> Since some drivers have already done the validation by themselves,
> this patch tries to makes the core validation optional. For the driver
> that doesn't want the validation, it can set the
> suppress_used_validation to be true (which could be overridden by
> force_used_validation module parameter). To be more efficient, a
> dedicate array is used for storing the validate used length, this
> helps to eliminate the cache stress if validation is done by the
> driver.
>
> Signed-off-by: Jason Wang 

Hi Jason!

Our CI has detected, that virtio-vsock became unusable with this
patch on s390x. I didn't test on x86 yet. The guest kernel says
something like:
vmw_vsock_virtio_transport virtio1: tx: used len 44 is larger than in buflen 0

Did you, or anybody else, see something like this on platforms other that
s390x?


Adding Stefan and Stefano.

I think it should be a common issue, looking at


Yep, I confirm the same behaviour on x86_64. On Friday evening I had the 
same failure while testing latest QEMU and Linux kernel.



vhost_vsock_handle_tx_kick(), it did:

len += sizeof(pkt->hdr);
vhost_add_used(vq, head, len);

which looks like a violation of the spec since it's TX.



I had a quick look at this code, and I speculate that it probably
uncovers a pre-existig bug, rather than introducing a new one.


I agree.



If somebody is already working on this please reach out to me.




My plan was to debug and test it today, so let me know if you need some 
help.



AFAIK, no. I think the plan is to fix both the device and drive side
(but I'm not sure we need a new feature for this if we stick to the
validation).



Yes, maybe we need a new feature, since I believe there has been this 
problem since the beginning.


Thanks,
Stefano

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V5 1/4] virtio_ring: validate used buffer length

2021-11-21 Thread Jason Wang
On Mon, Nov 22, 2021 at 1:49 PM Halil Pasic  wrote:
>
> On Mon, 22 Nov 2021 06:35:18 +0100
> Halil Pasic  wrote:
>
> > > I think it should be a common issue, looking at
> > > vhost_vsock_handle_tx_kick(), it did:
> > >
> > > len += sizeof(pkt->hdr);
> > > vhost_add_used(vq, head, len);
> > >
> > > which looks like a violation of the spec since it's TX.
> >
> > I'm not sure the lines above look like a violation of the spec. If you
> > examine vhost_vsock_alloc_pkt() I believe that you will agree that:
> > len == pkt->len == pkt->hdr.len
> > which makes sense since according to the spec both tx and rx messages
> > are hdr+payload. And I believe hdr.len is the size of the payload,
> > although that does not seem to be properly documented by the spec.

Sorry for being unclear, what I meant is that we probably should use
zero here. TX doesn't use in buffer actually.

According to the spec, 0 should be the used length:

"and len the total of bytes written into the buffer."

> >
> > On the other hand tx messages are stated to be device read-only (in the
> > spec) so if the device writes stuff, that is certainly wrong.
> >

Yes.

> > If that is what happens.
> >
> > Looking at virtqueue_get_buf_ctx_split() I'm not sure that is what
> > happens. My hypothesis is that we just a last descriptor is an 'in'
> > type descriptor (i.e. a device writable one). For tx that assumption
> > would be wrong.
> >
> > I will have another look at this today and send a fix patch if my
> > suspicion is confirmed.
>
> If my suspicion is right something like:
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 00f64f2f8b72..efb57898920b 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -764,6 +764,7 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue 
> *_vq,
> struct vring_virtqueue *vq = to_vvq(_vq);
> void *ret;
> unsigned int i;
> +   bool has_in;
> u16 last_used;
>
> START_USE(vq);
> @@ -787,6 +788,9 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue 
> *_vq,
> vq->split.vring.used->ring[last_used].id);
> *len = virtio32_to_cpu(_vq->vdev,
> vq->split.vring.used->ring[last_used].len);
> +   has_in = virtio16_to_cpu(_vq->vdev,
> +   vq->split.vring.used->ring[last_used].flags)
> +   & VRING_DESC_F_WRITE;

Did you mean vring.desc actually? If yes, it's better not depend on
the descriptor ring which can be modified by the device. We've stored
the flags in desc_extra[].

>
> if (unlikely(i >= vq->split.vring.num)) {
> BAD_RING(vq, "id %u out of range\n", i);
> @@ -796,7 +800,7 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue 
> *_vq,
> BAD_RING(vq, "id %u is not a head!\n", i);
> return NULL;
> }
> -   if (vq->buflen && unlikely(*len > vq->buflen[i])) {
> +   if (has_in && q->buflen && unlikely(*len > vq->buflen[i])) {
> BAD_RING(vq, "used len %d is larger than in buflen %u\n",
> *len, vq->buflen[i]);
> return NULL;
>
> would fix the problem for split. I will try that out and let you know
> later.

I'm not sure I get this, in virtqueue_add_split, the buflen[i] only
contains the in buffer length.

I think the fixes are:

1) fixing the vhost vsock
2) use suppress_used_validation=true to let vsock driver to validate
the in buffer length
3) probably a new feature so the driver can only enable the validation
when the feature is enabled.

Thanks

>
> Regards,
> Halil
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V5 1/4] virtio_ring: validate used buffer length

2021-11-21 Thread Halil Pasic
On Mon, 22 Nov 2021 06:35:18 +0100
Halil Pasic  wrote:

> > I think it should be a common issue, looking at
> > vhost_vsock_handle_tx_kick(), it did:
> > 
> > len += sizeof(pkt->hdr);
> > vhost_add_used(vq, head, len);
> > 
> > which looks like a violation of the spec since it's TX.  
> 
> I'm not sure the lines above look like a violation of the spec. If you
> examine vhost_vsock_alloc_pkt() I believe that you will agree that:
> len == pkt->len == pkt->hdr.len
> which makes sense since according to the spec both tx and rx messages
> are hdr+payload. And I believe hdr.len is the size of the payload,
> although that does not seem to be properly documented by the spec.
> 
> On the other hand tx messages are stated to be device read-only (in the
> spec) so if the device writes stuff, that is certainly wrong.
> 
> If that is what happens. 
> 
> Looking at virtqueue_get_buf_ctx_split() I'm not sure that is what
> happens. My hypothesis is that we just a last descriptor is an 'in'
> type descriptor (i.e. a device writable one). For tx that assumption
> would be wrong.
> 
> I will have another look at this today and send a fix patch if my
> suspicion is confirmed.

If my suspicion is right something like:

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 00f64f2f8b72..efb57898920b 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -764,6 +764,7 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue 
*_vq,
struct vring_virtqueue *vq = to_vvq(_vq);
void *ret;
unsigned int i;
+   bool has_in;
u16 last_used;
 
START_USE(vq);
@@ -787,6 +788,9 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue 
*_vq,
vq->split.vring.used->ring[last_used].id);
*len = virtio32_to_cpu(_vq->vdev,
vq->split.vring.used->ring[last_used].len);
+   has_in = virtio16_to_cpu(_vq->vdev,
+   vq->split.vring.used->ring[last_used].flags)
+   & VRING_DESC_F_WRITE;
 
if (unlikely(i >= vq->split.vring.num)) {
BAD_RING(vq, "id %u out of range\n", i);
@@ -796,7 +800,7 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue 
*_vq,
BAD_RING(vq, "id %u is not a head!\n", i);
return NULL;
}
-   if (vq->buflen && unlikely(*len > vq->buflen[i])) {
+   if (has_in && q->buflen && unlikely(*len > vq->buflen[i])) {
BAD_RING(vq, "used len %d is larger than in buflen %u\n",
*len, vq->buflen[i]);
return NULL;

would fix the problem for split. I will try that out and let you know
later.

Regards,
Halil
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V5 1/4] virtio_ring: validate used buffer length

2021-11-21 Thread Halil Pasic
On Mon, 22 Nov 2021 11:51:09 +0800
Jason Wang  wrote:

> On Fri, Nov 19, 2021 at 11:10 PM Halil Pasic  wrote:
> >
> > On Wed, 27 Oct 2021 10:21:04 +0800
> > Jason Wang  wrote:
> >  
> > > This patch validate the used buffer length provided by the device
> > > before trying to use it. This is done by record the in buffer length
> > > in a new field in desc_state structure during virtqueue_add(), then we
> > > can fail the virtqueue_get_buf() when we find the device is trying to
> > > give us a used buffer length which is greater than the in buffer
> > > length.
> > >
> > > Since some drivers have already done the validation by themselves,
> > > this patch tries to makes the core validation optional. For the driver
> > > that doesn't want the validation, it can set the
> > > suppress_used_validation to be true (which could be overridden by
> > > force_used_validation module parameter). To be more efficient, a
> > > dedicate array is used for storing the validate used length, this
> > > helps to eliminate the cache stress if validation is done by the
> > > driver.
> > >
> > > Signed-off-by: Jason Wang   
> >
> > Hi Jason!
> >
> > Our CI has detected, that virtio-vsock became unusable with this
> > patch on s390x. I didn't test on x86 yet. The guest kernel says
> > something like:
> > vmw_vsock_virtio_transport virtio1: tx: used len 44 is larger than in 
> > buflen 0
> >
> > Did you, or anybody else, see something like this on platforms other that
> > s390x?  
> 
> Adding Stefan and Stefano.
> 
> I think it should be a common issue, looking at
> vhost_vsock_handle_tx_kick(), it did:
> 
> len += sizeof(pkt->hdr);
> vhost_add_used(vq, head, len);
> 
> which looks like a violation of the spec since it's TX.

I'm not sure the lines above look like a violation of the spec. If you
examine vhost_vsock_alloc_pkt() I believe that you will agree that:
len == pkt->len == pkt->hdr.len
which makes sense since according to the spec both tx and rx messages
are hdr+payload. And I believe hdr.len is the size of the payload,
although that does not seem to be properly documented by the spec.

On the other hand tx messages are stated to be device read-only (in the
spec) so if the device writes stuff, that is certainly wrong.

If that is what happens. 

Looking at virtqueue_get_buf_ctx_split() I'm not sure that is what
happens. My hypothesis is that we just a last descriptor is an 'in'
type descriptor (i.e. a device writable one). For tx that assumption
would be wrong.

I will have another look at this today and send a fix patch if my
suspicion is confirmed.


> 
> >
> > I had a quick look at this code, and I speculate that it probably
> > uncovers a pre-existig bug, rather than introducing a new one.  
> 
> I agree.
> 

:) I'm not so sure any more myself.

> >
> > If somebody is already working on this please reach out to me.  
> 
> AFAIK, no. 

Thanks for the info! Then I will dig a little deeper. I asked in order
to avoid doing the debugging and fixing just to see that somebody was
faster :D

> I think the plan is to fix both the device and drive side
> (but I'm not sure we need a new feature for this if we stick to the
> validation).
> 
> Thanks
> 

Thank you!

Regards,
Halil
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V5 1/4] virtio_ring: validate used buffer length

2021-11-21 Thread Jason Wang
On Fri, Nov 19, 2021 at 11:10 PM Halil Pasic  wrote:
>
> On Wed, 27 Oct 2021 10:21:04 +0800
> Jason Wang  wrote:
>
> > This patch validate the used buffer length provided by the device
> > before trying to use it. This is done by record the in buffer length
> > in a new field in desc_state structure during virtqueue_add(), then we
> > can fail the virtqueue_get_buf() when we find the device is trying to
> > give us a used buffer length which is greater than the in buffer
> > length.
> >
> > Since some drivers have already done the validation by themselves,
> > this patch tries to makes the core validation optional. For the driver
> > that doesn't want the validation, it can set the
> > suppress_used_validation to be true (which could be overridden by
> > force_used_validation module parameter). To be more efficient, a
> > dedicate array is used for storing the validate used length, this
> > helps to eliminate the cache stress if validation is done by the
> > driver.
> >
> > Signed-off-by: Jason Wang 
>
> Hi Jason!
>
> Our CI has detected, that virtio-vsock became unusable with this
> patch on s390x. I didn't test on x86 yet. The guest kernel says
> something like:
> vmw_vsock_virtio_transport virtio1: tx: used len 44 is larger than in buflen 0
>
> Did you, or anybody else, see something like this on platforms other that
> s390x?

Adding Stefan and Stefano.

I think it should be a common issue, looking at
vhost_vsock_handle_tx_kick(), it did:

len += sizeof(pkt->hdr);
vhost_add_used(vq, head, len);

which looks like a violation of the spec since it's TX.

>
> I had a quick look at this code, and I speculate that it probably
> uncovers a pre-existig bug, rather than introducing a new one.

I agree.

>
> If somebody is already working on this please reach out to me.

AFAIK, no. I think the plan is to fix both the device and drive side
(but I'm not sure we need a new feature for this if we stick to the
validation).

Thanks

>
> Regards,
> Halil
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 6/6] vhost_net: use RCU callbacks instead of synchronize_rcu()

2021-11-21 Thread Jason Wang
On Fri, Nov 19, 2021 at 7:31 PM Andrey Ryabinin  wrote:
>
>
>
> On 11/16/21 8:00 AM, Jason Wang wrote:
> > On Mon, Nov 15, 2021 at 11:32 PM Andrey Ryabinin  
> > wrote:
> >>
> >> Currently vhost_net_release() uses synchronize_rcu() to synchronize
> >> freeing with vhost_zerocopy_callback(). However synchronize_rcu()
> >> is quite costly operation. It take more than 10 seconds
> >> to shutdown qemu launched with couple net devices like this:
> >> -netdev tap,id=tap0,..,vhost=on,queues=80
> >> because we end up calling synchronize_rcu() netdev_count*queues times.
> >>
> >> Free vhost net structures in rcu callback instead of using
> >> synchronize_rcu() to fix the problem.
> >
> > I admit the release code is somehow hard to understand. But I wonder
> > if the following case can still happen with this:
> >
> > CPU 0 (vhost_dev_cleanup)   CPU1
> > (vhost_net_zerocopy_callback()->vhost_work_queue())
> > if (!dev->worker)
> > dev->worker = NULL
> >
> > wake_up_process(dev->worker)
> >
> > If this is true. It seems the fix is to move RCU synchronization stuff
> > in vhost_net_ubuf_put_and_wait()?
> >
>
> It all depends whether vhost_zerocopy_callback() can be called outside of 
> vhost
> thread context or not.

I think the answer is yes, the callback will be mainly used in the
zerocopy path when the underlayer NIC finishes the DMA of a packet.

> If it can run after vhost thread stopped, than the race you
> describe seems possible and the fix in commit b0c057ca7e83 ("vhost: fix a 
> theoretical race in device cleanup")
> wasn't complete. I would fix it by calling synchronize_rcu() after 
> vhost_net_flush()
> and before vhost_dev_cleanup().
>
> As for the performance problem, it can be solved by replacing 
> synchronize_rcu() with synchronize_rcu_expedited().

Yes, that's another way, but see below.

>
> But now I'm not sure that this race is actually exists and that 
> synchronize_rcu() needed at all.
> I did a bit of testing and I only see callback being called from vhost thread:
>
> vhost-3724  3733 [002]  2701.768731: probe:vhost_zerocopy_callback: 
> (81af8c10)
> 81af8c11 vhost_zerocopy_callback+0x1 ([kernel.kallsyms])
> 81bb34f6 skb_copy_ubufs+0x256 ([kernel.kallsyms])
> 81bce621 __netif_receive_skb_core.constprop.0+0xac1 
> ([kernel.kallsyms])
> 81bd062d __netif_receive_skb_one_core+0x3d ([kernel.kallsyms])
> 81bd0748 netif_receive_skb+0x38 ([kernel.kallsyms])
> 819a2a1e tun_get_user+0xdce ([kernel.kallsyms])
> 819a2cf4 tun_sendmsg+0xa4 ([kernel.kallsyms])
> 81af9229 handle_tx_zerocopy+0x149 ([kernel.kallsyms])
> 81afaf05 handle_tx+0xc5 ([kernel.kallsyms])
> 81afce86 vhost_worker+0x76 ([kernel.kallsyms])
> 811581e9 kthread+0x169 ([kernel.kallsyms])
> 810018cf ret_from_fork+0x1f ([kernel.kallsyms])
>0 [unknown] ([unknown])
>

>From the call trace you can send packets between two TAP. Since the TX
of TAP is synchronous so we can't see callback to be called out of
vhost thread.

In order to test it, we need 1) enable zerocopy
(experimental_zcopytx=1) and 2) sending the packet to the real NIC
with bridge or macvlan

Zerocopy was disalbed due to a lot of isuses (098eadce3c62 "vhost_net:
disable zerocopy by default"). So if we fix by moving it to
vhost_net_ubuf_put_and_wait(), there won't be a synchronize_rcu() in
the non-zerocopy path which seems to be sufficient. And we can use
synchronize_rcu_expedited() on top if it is really needed.

Thanks

> This means that the callback can't run after kthread_stop() in 
> vhost_dev_cleanup() and no synchronize_rcu() needed.
>
> I'm not confident that my quite limited testing cover all possible 
> vhost_zerocopy_callback() callstacks.
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] vdpa: Add support for querying statistics

2021-11-21 Thread Jason Wang
On Fri, Nov 19, 2021 at 11:09 AM Parav Pandit  wrote:
>
>
>
> > From: Jason Wang 
> > Sent: Friday, November 19, 2021 8:12 AM
> >
> > On Thu, Nov 18, 2021 at 1:58 PM Eli Cohen  wrote:
> > >
> > > Add support for querying virtqueue statistics. Supported statistics are:
> > >
> > > Received_desc - number of descriptors received for the virtqueue
> > > completed_desc - number of descriptors completed for the virtqueue
> > >
> > > A new callback was added to vdpa_config_ops which provides the means
> > > for the vdpa driver to return statistics results.
> > >
> > > The interface allows for reading all the supported virtqueues,
> > > including the control virtqueue if it exists, by returning the next
> > > queue index to query.
> > >
> > > Examples:
> > > 1. Read statisitics for the virtqueue at index 1 $ vdpa dev stats show
> > > vdpa-a index 1
> > > vdpa-a:
> > > index 1 tx received_desc 21 completed_desc 21
> > >
> > > 2. Read statisitics for all the virtqueues vdpa dev stats show vdpa-a
> > > vdpa-a:
> > > index 0 rx received_desc 256 completed_desc 12 index 1 tx
> > > received_desc 21 completed_desc 21 index 2 ctrl received_desc 0
> > > completed_desc 0
> >
> > Adding Adrian and Laurent.
> >
> > It's quite useful but I think it's vendor specific statistics.
> These are vdpa device specific of Linux.
> And are very generic of the VQ for all device types.

The question is what happens if the parent doesn't implement those statistics.

>
> > I wonder if it's better
> > to use "vendor" prefix in the protocol, then we use this instead:
> >
> > vdpa dev vendor-stats show vdpa-a
> >
> May be. Lets evaluate if stats of this patch are generic enough or not.
>
> > Or if we want to make it standard is exposing virtio index better?
> >
> > qid 0 last_avail_idx X avail_idx Y last_used_idx M used_idx N
> >
> I did consider this option a while back. Shows indices are equally useful.
> I think we should show that as vq info, along with other VQ attributes (addr, 
> len).

That may work but it looks to me the receiced_desc/completed_desc is
also per vq.

Another question is that is it more useful to use buffers instead of
descriptors? E.g how indirect descriptors are counted.

> $ vdpa dev show vq
>
> But showing indices are not less statistics and more current state of the 
> queue.
> For example roll over of the indices won't cover absolute number of 
> descriptors processed for the queue.
> And even if we make them u64 (not good), non_developer end user needs to keep 
> using side calculator to count the delta.

How about exposing those raw indices via the protocol and letting the
vdpa tool calculate for us?

Thanks

>
> So I think useful q indices belong to q info.
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V5 07/10] io_uring: switch to kernel_worker

2021-11-21 Thread Jens Axboe
On 11/21/21 10:49 AM, Mike Christie wrote:
> Convert io_uring and io-wq to use kernel_worker.

I don't like the kernel_worker name, that implies it's always giving you
a kernel thread or kthread. That's not the io_uring use case, it's
really just a thread off the original task that just happens to never
exit to userspace.

Can we do a better name? At least io_thread doesn't imply that.

Also I do think this deserves a bit more commit message, there's really
nothing in here.

-- 
Jens Axboe

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH V5 09/10] vhost: move worker thread fields to new struct

2021-11-21 Thread Mike Christie
This is just a prep patch. It moves the worker related fields to a new
vhost_worker struct and moves the code around to create some helpers that
will be used in the next patch.

Signed-off-by: Mike Christie 
Reviewed-by: Stefan Hajnoczi 
Acked-by: Michael S. Tsirkin 
Reviewed-by: Christoph Hellwig 
---
 drivers/vhost/vhost.c | 98 ---
 drivers/vhost/vhost.h | 11 +++--
 2 files changed, 72 insertions(+), 37 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 59edb5a1ffe2..c9a1f706989c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -263,8 +263,8 @@ void vhost_work_queue(struct vhost_dev *dev, struct 
vhost_work *work)
 * sure it was not in the list.
 * test_and_set_bit() implies a memory barrier.
 */
-   llist_add(>node, >work_list);
-   wake_up_process(dev->worker);
+   llist_add(>node, >worker->work_list);
+   wake_up_process(dev->worker->task);
}
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -272,7 +272,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue);
 /* A lockless hint for busy polling code to exit the loop */
 bool vhost_has_work(struct vhost_dev *dev)
 {
-   return !llist_empty(>work_list);
+   return dev->worker && !llist_empty(>worker->work_list);
 }
 EXPORT_SYMBOL_GPL(vhost_has_work);
 
@@ -343,7 +343,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 
 static int vhost_worker(void *data)
 {
-   struct vhost_dev *dev = data;
+   struct vhost_worker *worker = data;
+   struct vhost_dev *dev = worker->dev;
struct vhost_work *work, *work_next;
struct llist_node *node;
 
@@ -358,7 +359,7 @@ static int vhost_worker(void *data)
break;
}
 
-   node = llist_del_all(>work_list);
+   node = llist_del_all(>work_list);
if (!node)
schedule();
 
@@ -368,7 +369,7 @@ static int vhost_worker(void *data)
llist_for_each_entry_safe(work, work_next, node, node) {
clear_bit(VHOST_WORK_QUEUED, >flags);
__set_current_state(TASK_RUNNING);
-   kcov_remote_start_common(dev->kcov_handle);
+   kcov_remote_start_common(worker->kcov_handle);
work->fn(work);
kcov_remote_stop();
if (need_resched())
@@ -487,7 +488,6 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->byte_weight = byte_weight;
dev->use_worker = use_worker;
dev->msg_handler = msg_handler;
-   init_llist_head(>work_list);
init_waitqueue_head(>wait);
INIT_LIST_HEAD(>read_list);
INIT_LIST_HEAD(>pending_list);
@@ -579,10 +579,60 @@ static void vhost_detach_mm(struct vhost_dev *dev)
dev->mm = NULL;
 }
 
+static void vhost_worker_free(struct vhost_dev *dev)
+{
+   struct vhost_worker *worker = dev->worker;
+
+   if (!worker)
+   return;
+
+   dev->worker = NULL;
+   WARN_ON(!llist_empty(>work_list));
+   kthread_stop(worker->task);
+   kfree(worker);
+}
+
+static int vhost_worker_create(struct vhost_dev *dev)
+{
+   struct vhost_worker *worker;
+   struct task_struct *task;
+   int ret;
+
+   worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
+   if (!worker)
+   return -ENOMEM;
+
+   dev->worker = worker;
+   worker->dev = dev;
+   worker->kcov_handle = kcov_common_handle();
+   init_llist_head(>work_list);
+
+   task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
+   if (IS_ERR(task)) {
+   ret = PTR_ERR(task);
+   goto free_worker;
+   }
+
+   worker->task = task;
+   wake_up_process(task); /* avoid contributing to loadavg */
+
+   ret = vhost_attach_cgroups(dev);
+   if (ret)
+   goto stop_worker;
+
+   return 0;
+
+stop_worker:
+   kthread_stop(worker->task);
+free_worker:
+   kfree(worker);
+   dev->worker = NULL;
+   return ret;
+}
+
 /* Caller should have device mutex */
 long vhost_dev_set_owner(struct vhost_dev *dev)
 {
-   struct task_struct *worker;
int err;
 
/* Is there an owner already? */
@@ -593,36 +643,21 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 
vhost_attach_mm(dev);
 
-   dev->kcov_handle = kcov_common_handle();
if (dev->use_worker) {
-   worker = kthread_create(vhost_worker, dev,
-   "vhost-%d", current->pid);
-   if (IS_ERR(worker)) {
-   err = PTR_ERR(worker);
-   goto err_worker;
-   }
-
-   dev->worker = worker;
-   wake_up_process(worker); /* avoid contributing to loadavg */
-
-   err = vhost_attach_cgroups(dev);

[PATCH V5 07/10] io_uring: switch to kernel_worker

2021-11-21 Thread Mike Christie
Convert io_uring and io-wq to use kernel_worker.

Signed-off-by: Mike Christie 
Reviewed-by: Christian Brauner 
Reviewed-by: Jens Axboe 
Reviewed-by: Christoph Hellwig 
---
 fs/io-wq.c| 15 ---
 fs/io_uring.c | 11 +--
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 88202de519f6..b39cf045d6ce 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -70,6 +70,9 @@ struct io_worker {
 
 #define IO_WQ_NR_HASH_BUCKETS  (1u << IO_WQ_HASH_ORDER)
 
+#define IO_WQ_CLONE_FLAGS  (CLONE_FS | CLONE_FILES | CLONE_SIGHAND | \
+CLONE_THREAD | CLONE_IO)
+
 struct io_wqe_acct {
unsigned nr_workers;
unsigned max_workers;
@@ -600,13 +603,9 @@ static int io_wqe_worker(void *data)
struct io_wqe *wqe = worker->wqe;
struct io_wq *wq = wqe->wq;
bool last_timeout = false;
-   char buf[TASK_COMM_LEN];
 
worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING);
 
-   snprintf(buf, sizeof(buf), "iou-wrk-%d", wq->task->pid);
-   set_task_comm(current, buf);
-
audit_alloc_kernel(current);
 
while (!test_bit(IO_WQ_BIT_EXIT, >state)) {
@@ -704,7 +703,7 @@ static void io_init_new_worker(struct io_wqe *wqe, struct 
io_worker *worker,
list_add_tail_rcu(>all_list, >all_list);
worker->flags |= IO_WORKER_F_FREE;
raw_spin_unlock(>lock);
-   wake_up_new_task(tsk);
+   kernel_worker_start(tsk, "iou-wrk-%d", wqe->wq->task->pid);
 }
 
 static bool io_wq_work_match_all(struct io_wq_work *work, void *data)
@@ -734,7 +733,8 @@ static void create_worker_cont(struct callback_head *cb)
worker = container_of(cb, struct io_worker, create_work);
clear_bit_unlock(0, >create_state);
wqe = worker->wqe;
-   tsk = create_io_thread(io_wqe_worker, worker, wqe->node);
+   tsk = kernel_worker(io_wqe_worker, worker, wqe->node,
+   IO_WQ_CLONE_FLAGS, KERN_WORKER_IO);
if (!IS_ERR(tsk)) {
io_init_new_worker(wqe, worker, tsk);
io_worker_release(worker);
@@ -801,7 +801,8 @@ static bool create_io_worker(struct io_wq *wq, struct 
io_wqe *wqe, int index)
if (index == IO_WQ_ACCT_BOUND)
worker->flags |= IO_WORKER_F_BOUND;
 
-   tsk = create_io_thread(io_wqe_worker, worker, wqe->node);
+   tsk = kernel_worker(io_wqe_worker, worker, wqe->node, IO_WQ_CLONE_FLAGS,
+   KERN_WORKER_IO);
if (!IS_ERR(tsk)) {
io_init_new_worker(wqe, worker, tsk);
} else if (!io_should_retry_thread(PTR_ERR(tsk))) {
diff --git a/fs/io_uring.c b/fs/io_uring.c
index b07196b4511c..d01eef85cc03 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7431,12 +7431,8 @@ static int io_sq_thread(void *data)
struct io_sq_data *sqd = data;
struct io_ring_ctx *ctx;
unsigned long timeout = 0;
-   char buf[TASK_COMM_LEN];
DEFINE_WAIT(wait);
 
-   snprintf(buf, sizeof(buf), "iou-sqp-%d", sqd->task_pid);
-   set_task_comm(current, buf);
-
if (sqd->sq_cpu != -1)
set_cpus_allowed_ptr(current, cpumask_of(sqd->sq_cpu));
else
@@ -8665,6 +8661,8 @@ static __cold int io_sq_offload_create(struct io_ring_ctx 
*ctx,
fdput(f);
}
if (ctx->flags & IORING_SETUP_SQPOLL) {
+   unsigned long flags = CLONE_FS | CLONE_FILES | CLONE_SIGHAND |
+   CLONE_THREAD | CLONE_IO;
struct task_struct *tsk;
struct io_sq_data *sqd;
bool attached;
@@ -8710,7 +8708,8 @@ static __cold int io_sq_offload_create(struct io_ring_ctx 
*ctx,
 
sqd->task_pid = current->pid;
sqd->task_tgid = current->tgid;
-   tsk = create_io_thread(io_sq_thread, sqd, NUMA_NO_NODE);
+   tsk = kernel_worker(io_sq_thread, sqd, NUMA_NO_NODE,
+   flags, KERN_WORKER_IO);
if (IS_ERR(tsk)) {
ret = PTR_ERR(tsk);
goto err_sqpoll;
@@ -8718,7 +8717,7 @@ static __cold int io_sq_offload_create(struct io_ring_ctx 
*ctx,
 
sqd->thread = tsk;
ret = io_uring_alloc_task_context(tsk, ctx);
-   wake_up_new_task(tsk);
+   kernel_worker_start(tsk, "iou-sqp-%d", sqd->task_pid);
if (ret)
goto err;
} else if (p->flags & IORING_SETUP_SQ_AFF) {
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH V5 10/10] vhost: use kernel_worker to check RLIMITs

2021-11-21 Thread Mike Christie
For vhost workers we use the kthread API which inherit's its values from
and checks against the kthreadd thread. This results in the wrong RLIMITs
being checked. This patch has us use the kernel_copy_process function
which will inherit its values/checks from the thread that owns the device.

Signed-off-by: Mike Christie 
Acked-by: Michael S. Tsirkin 
Reviewed-by: Christoph Hellwig 

---
 drivers/vhost/vhost.c | 65 +++
 drivers/vhost/vhost.h |  7 -
 2 files changed, 28 insertions(+), 44 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c9a1f706989c..9aa04fcdf210 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -22,7 +22,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -344,17 +343,14 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 static int vhost_worker(void *data)
 {
struct vhost_worker *worker = data;
-   struct vhost_dev *dev = worker->dev;
struct vhost_work *work, *work_next;
struct llist_node *node;
 
-   kthread_use_mm(dev->mm);
-
for (;;) {
/* mb paired w/ kthread_stop */
set_current_state(TASK_INTERRUPTIBLE);
 
-   if (kthread_should_stop()) {
+   if (test_bit(VHOST_WORKER_FLAG_STOP, >flags)) {
__set_current_state(TASK_RUNNING);
break;
}
@@ -376,8 +372,9 @@ static int vhost_worker(void *data)
schedule();
}
}
-   kthread_unuse_mm(dev->mm);
-   return 0;
+
+   complete(worker->exit_done);
+   do_exit(0);
 }
 
 static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
@@ -517,31 +514,6 @@ long vhost_dev_check_owner(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_check_owner);
 
-struct vhost_attach_cgroups_struct {
-   struct vhost_work work;
-   struct task_struct *owner;
-   int ret;
-};
-
-static void vhost_attach_cgroups_work(struct vhost_work *work)
-{
-   struct vhost_attach_cgroups_struct *s;
-
-   s = container_of(work, struct vhost_attach_cgroups_struct, work);
-   s->ret = cgroup_attach_task_all(s->owner, current);
-}
-
-static int vhost_attach_cgroups(struct vhost_dev *dev)
-{
-   struct vhost_attach_cgroups_struct attach;
-
-   attach.owner = current;
-   vhost_work_init(, vhost_attach_cgroups_work);
-   vhost_work_queue(dev, );
-   vhost_work_dev_flush(dev);
-   return attach.ret;
-}
-
 /* Caller should have device mutex */
 bool vhost_dev_has_owner(struct vhost_dev *dev)
 {
@@ -579,6 +551,16 @@ static void vhost_detach_mm(struct vhost_dev *dev)
dev->mm = NULL;
 }
 
+static void vhost_worker_stop(struct vhost_worker *worker)
+{
+   DECLARE_COMPLETION_ONSTACK(exit_done);
+
+   worker->exit_done = _done;
+   set_bit(VHOST_WORKER_FLAG_STOP, >flags);
+   wake_up_process(worker->task);
+   wait_for_completion(worker->exit_done);
+}
+
 static void vhost_worker_free(struct vhost_dev *dev)
 {
struct vhost_worker *worker = dev->worker;
@@ -588,7 +570,7 @@ static void vhost_worker_free(struct vhost_dev *dev)
 
dev->worker = NULL;
WARN_ON(!llist_empty(>work_list));
-   kthread_stop(worker->task);
+   vhost_worker_stop(worker);
kfree(worker);
 }
 
@@ -603,27 +585,24 @@ static int vhost_worker_create(struct vhost_dev *dev)
return -ENOMEM;
 
dev->worker = worker;
-   worker->dev = dev;
worker->kcov_handle = kcov_common_handle();
init_llist_head(>work_list);
 
-   task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid);
+   /*
+* vhost used to use the kthread API which ignores all signals by
+* default and the drivers expect this behavior.
+*/
+   task = kernel_worker(vhost_worker, worker, NUMA_NO_NODE, CLONE_FS,
+KERN_WORKER_NO_FILES | KERN_WORKER_SIG_IGN);
if (IS_ERR(task)) {
ret = PTR_ERR(task);
goto free_worker;
}
 
worker->task = task;
-   wake_up_process(task); /* avoid contributing to loadavg */
-
-   ret = vhost_attach_cgroups(dev);
-   if (ret)
-   goto stop_worker;
-
+   kernel_worker_start(task, "vhost-%d", current->pid);
return 0;
 
-stop_worker:
-   kthread_stop(worker->task);
 free_worker:
kfree(worker);
dev->worker = NULL;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 102ce25e4e13..09748694cb66 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -25,11 +25,16 @@ struct vhost_work {
unsigned long   flags;
 };
 
+enum {
+   VHOST_WORKER_FLAG_STOP,
+};
+
 struct vhost_worker {
struct task_struct  *task;
+   struct completion   *exit_done;
struct llist_head   work_list;
-   struct vhost_dev 

[PATCH V5 06/10] fork: add helper to clone a process

2021-11-21 Thread Mike Christie
The vhost layer has similar requirements as io_uring where its worker
threads need to access the userspace thread's memory, want to inherit the
parents's cgroups and namespaces, and be checked against the parent's
RLIMITs. Right now, the vhost layer uses the kthread API which has
kthread_use_mm for mem access, and those threads can use
cgroup_attach_task_all for v1 cgroups, but there are no helpers for the
other items.

This adds a helper to clone a process so we can inherit everything we
want in one call. It's a more generic version of create_io_thread which
will be used by the vhost layer and io_uring in later patches in this set.

[added flag validation code from Christian Brauner's SIG_IGN patch]
Signed-off-by: Mike Christie 
Acked-by: Christian Brauner 
Reviewed-by: Christoph Hellwig 
---
 include/linux/sched/task.h |  4 +++
 kernel/fork.c  | 71 ++
 2 files changed, 75 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ef3a8e7f7ed9..2188be3a3142 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -90,6 +90,10 @@ extern void exit_itimers(struct signal_struct *);
 
 extern pid_t kernel_clone(struct kernel_clone_args *kargs);
 struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
+struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
+ unsigned long clone_flags, u32 worker_flags);
+__printf(2, 3)
+void kernel_worker_start(struct task_struct *tsk, const char namefmt[], ...);
 struct task_struct *fork_idle(int);
 struct mm_struct *copy_init_mm(void);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/kernel/fork.c b/kernel/fork.c
index 6e6050d588ae..3729abafbdf9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2543,6 +2543,77 @@ struct task_struct *create_io_thread(int (*fn)(void *), 
void *arg, int node)
return copy_process(NULL, 0, node, );
 }
 
+static bool kernel_worker_flags_valid(struct kernel_clone_args *kargs)
+{
+   /* Verify that no unknown flags are passed along. */
+   if (kargs->worker_flags & ~(KERN_WORKER_IO | KERN_WORKER_USER |
+   KERN_WORKER_NO_FILES | KERN_WORKER_SIG_IGN))
+   return false;
+
+   /*
+* If we're ignoring all signals don't allow sharing struct sighand and
+* don't bother clearing signal handlers.
+*/
+   if ((kargs->flags & (CLONE_SIGHAND | CLONE_CLEAR_SIGHAND)) &&
+   (kargs->worker_flags & KERN_WORKER_SIG_IGN))
+   return false;
+
+   return true;
+}
+
+/**
+ * kernel_worker - create a copy of a process to be used by the kernel
+ * @fn: thread stack
+ * @arg: data to be passed to fn
+ * @node: numa node to allocate task from
+ * @clone_flags: CLONE flags
+ * @worker_flags: KERN_WORKER flags
+ *
+ * This returns a created task, or an error pointer. The returned task is
+ * inactive, and the caller must fire it up through kernel_worker_start(). If
+ * this is an PF_IO_WORKER all singals but KILL and STOP are blocked.
+ */
+struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
+ unsigned long clone_flags, u32 worker_flags)
+{
+   struct kernel_clone_args args = {
+   .flags  = ((lower_32_bits(clone_flags) | CLONE_VM |
+  CLONE_UNTRACED) & ~CSIGNAL),
+   .exit_signal= (lower_32_bits(clone_flags) & CSIGNAL),
+   .stack  = (unsigned long)fn,
+   .stack_size = (unsigned long)arg,
+   .worker_flags   = KERN_WORKER_USER | worker_flags,
+   };
+
+   if (!kernel_worker_flags_valid())
+   return ERR_PTR(-EINVAL);
+
+   return copy_process(NULL, 0, node, );
+}
+EXPORT_SYMBOL_GPL(kernel_worker);
+
+/**
+ * kernel_worker_start - Start a task created with kernel_worker
+ * @tsk: task to wake up
+ * @namefmt: printf-style format string for the thread name
+ * @arg: arguments for @namefmt
+ */
+void kernel_worker_start(struct task_struct *tsk, const char namefmt[], ...)
+{
+   char name[TASK_COMM_LEN];
+   va_list args;
+
+   WARN_ON(!(tsk->flags & PF_USER_WORKER));
+
+   va_start(args, namefmt);
+   vsnprintf(name, sizeof(name), namefmt, args);
+   set_task_comm(tsk, name);
+   va_end(args);
+
+   wake_up_new_task(tsk);
+}
+EXPORT_SYMBOL_GPL(kernel_worker_start);
+
 /*
  *  Ok, this is the main fork-routine.
  *
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH V5 00/10] Use copy_process/create_io_thread in vhost layer

2021-11-21 Thread Mike Christie
The following patches made over Linus's tree, allow the vhost layer to do
a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
io_uring does a copy_process against its userspace app. This allows the
vhost layer's worker threads to inherit cgroups, namespaces, address
space, etc and this worker thread will also be accounted for against that
owner/parent process's RLIMIT_NPROC limit.

If you are not familiar with qemu and vhost here is more detailed
problem description:

Qemu will create vhost devices in the kernel which perform network, SCSI,
etc IO and management operations from worker threads created by the
kthread API. Because the kthread API does a copy_process on the kthreadd
thread, the vhost layer has to use kthread_use_mm to access the Qemu
thread's memory and cgroup_attach_task_all to add itself to the Qemu
thread's cgroups.

The problem with this approach is that we then have to add new functions/
args/functionality for every thing we want to inherit. I started doing
that here:

https://lkml.org/lkml/2021/6/23/1233

for the RLIMIT_NPROC check, but it seems it might be easier to just
inherit everything from the beginning, becuase I'd need to do something
like that patch several times.

V5:
- Handle kbuild errors by building patchset against current kernel that
  has all deps merged. Also add patch to remove create_io_thread code as
  it's not used anymore.
- Rebase patchset against current kernel and handle a new vm PF_IO_WORKER
  case added in 5.16-rc1.
- Add PF_USER_WORKER flag so we can check it later after the initial
  thread creation for the wake up, vm and singal cses.
- Added patch to auto reap the worker thread.
V4:
- Drop NO_SIG patch and replaced with Christian's SIG_IGN patch.
- Merged Christian's kernel_worker_flags_valid helpers into patch 5 that
  added the new kernel worker functions.
- Fixed extra "i" issue.
- Added PF_USER_WORKER flag and added check that kernel_worker_start users
  had that flag set. Also dropped patches that passed worker flags to
  copy_thread and replaced with PF_USER_WORKER check.
V3:
- Add parentheses in p->flag and work_flags check in copy_thread.
- Fix check in arm/arm64 which was doing the reverse of other archs
  where it did likely(!flags) instead of unlikely(flags).
V2:
- Rename kernel_copy_process to kernel_worker.
- Instead of exporting functions, make kernel_worker() a proper
  function/API that does common work for the caller.
- Instead of adding new fields to kernel_clone_args for each option
  make it flag based similar to CLONE_*.
- Drop unused completion struct in vhost.
- Fix compile warnings by merging vhost cgroup cleanup patch and
  vhost conversion patch.



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH V5 01/10] fork: Make IO worker options flag based

2021-11-21 Thread Mike Christie
This patchset adds a couple new options to kernel_clone_args for IO thread
like/related users. Instead of adding new fields to kernel_clone_args for
each option, this moves us to a flags based approach by first converting
io_thread.

Signed-off-by: Mike Christie 
Suggested-by: Christian Brauner 
Acked-by: Christian Brauner 
Reviewed-by: Christoph Hellwig 
---
 include/linux/sched/task.h | 4 +++-
 kernel/fork.c  | 4 ++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ba88a6987400..c688e1d2e3e3 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -18,8 +18,11 @@ struct css_set;
 /* All the bits taken by the old clone syscall. */
 #define CLONE_LEGACY_FLAGS 0xULL
 
+#define KERN_WORKER_IO BIT(0)
+
 struct kernel_clone_args {
u64 flags;
+   u32 worker_flags;
int __user *pidfd;
int __user *child_tid;
int __user *parent_tid;
@@ -31,7 +34,6 @@ struct kernel_clone_args {
/* Number of elements in *set_tid */
size_t set_tid_size;
int cgroup;
-   int io_thread;
struct cgroup *cgrp;
struct css_set *cset;
 };
diff --git a/kernel/fork.c b/kernel/fork.c
index 3244cc56b697..58dcbf7dd28d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2023,7 +2023,7 @@ static __latent_entropy struct task_struct *copy_process(
p = dup_task_struct(current, node);
if (!p)
goto fork_out;
-   if (args->io_thread) {
+   if (args->worker_flags & KERN_WORKER_IO) {
/*
 * Mark us an IO worker, and block any signal that isn't
 * fatal or STOP
@@ -2524,7 +2524,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), 
void *arg, int node)
.exit_signal= (lower_32_bits(flags) & CSIGNAL),
.stack  = (unsigned long)fn,
.stack_size = (unsigned long)arg,
-   .io_thread  = 1,
+   .worker_flags   = KERN_WORKER_IO,
};
 
return copy_process(NULL, 0, node, );
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH V5 08/10] fork: remove create_io_thread

2021-11-21 Thread Mike Christie
create_io_thread is not used anymore so remove it.

Signed-off-by: Mike Christie 
---
 include/linux/sched/task.h |  1 -
 kernel/fork.c  | 22 --
 2 files changed, 23 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 2188be3a3142..313fb8c825ae 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -89,7 +89,6 @@ extern void exit_files(struct task_struct *);
 extern void exit_itimers(struct signal_struct *);
 
 extern pid_t kernel_clone(struct kernel_clone_args *kargs);
-struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node);
 struct task_struct *kernel_worker(int (*fn)(void *), void *arg, int node,
  unsigned long clone_flags, u32 worker_flags);
 __printf(2, 3)
diff --git a/kernel/fork.c b/kernel/fork.c
index 3729abafbdf9..1b44d048ec16 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2521,28 +2521,6 @@ struct mm_struct *copy_init_mm(void)
return dup_mm(NULL, _mm);
 }
 
-/*
- * This is like kernel_clone(), but shaved down and tailored to just
- * creating io_uring workers. It returns a created task, or an error pointer.
- * The returned task is inactive, and the caller must fire it up through
- * wake_up_new_task(p). All signals are blocked in the created task.
- */
-struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
-{
-   unsigned long flags = CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
-   CLONE_IO;
-   struct kernel_clone_args args = {
-   .flags  = ((lower_32_bits(flags) | CLONE_VM |
-   CLONE_UNTRACED) & ~CSIGNAL),
-   .exit_signal= (lower_32_bits(flags) & CSIGNAL),
-   .stack  = (unsigned long)fn,
-   .stack_size = (unsigned long)arg,
-   .worker_flags   = KERN_WORKER_IO | KERN_WORKER_USER,
-   };
-
-   return copy_process(NULL, 0, node, );
-}
-
 static bool kernel_worker_flags_valid(struct kernel_clone_args *kargs)
 {
/* Verify that no unknown flags are passed along. */
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH V5 04/10] fork: Add KERNEL_WORKER flag to ignore signals

2021-11-21 Thread Mike Christie
From: Christian Brauner 

Since this is mirroring kthread's sig ignore api introduced in

commit 10ab825bdef8 ("change kernel threads to ignore signals instead of
blocking them")

this patch adds an option flag, KERNEL_WORKER_SIG_IGN, handled in
copy_process() after copy_sighand() and copy_signals() to ease the
transition from kthreads to this new api.

Signed-off-by: Christian Brauner 
Signed-off-by: Mike Christie 
Reviewed-by: Christoph Hellwig 
---
 include/linux/sched/task.h | 1 +
 kernel/fork.c  | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 47ede41b19c5..ef3a8e7f7ed9 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -21,6 +21,7 @@ struct css_set;
 #define KERN_WORKER_IO BIT(0)
 #define KERN_WORKER_USER   BIT(1)
 #define KERN_WORKER_NO_FILES   BIT(2)
+#define KERN_WORKER_SIG_IGNBIT(3)
 
 struct kernel_clone_args {
u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index a1ba423eec4d..6e6050d588ae 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2211,6 +2211,9 @@ static __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_io;
 
+   if (args->worker_flags & KERN_WORKER_SIG_IGN)
+   ignore_signals(p);
+
stackleak_task_init(p);
 
if (pid != _struct_pid) {
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH V5 02/10] fork/vm: Move common PF_IO_WORKER behavior to new flag

2021-11-21 Thread Mike Christie
This adds a new flag, PF_USER_WORKER, that's used for behavior common to
to both PF_IO_WORKER and users like vhost which will use the new
kernel_worker helpers that will use the flag and are added later in this
patchset.

The common behavior PF_USER_WORKER covers is the initial frame and fpu
setup and the vm reclaim handling.

Signed-off-by: Mike Christie 
---
 arch/alpha/kernel/process.c  | 2 +-
 arch/arc/kernel/process.c| 2 +-
 arch/arm/kernel/process.c| 2 +-
 arch/arm64/kernel/process.c  | 2 +-
 arch/csky/kernel/process.c   | 2 +-
 arch/h8300/kernel/process.c  | 2 +-
 arch/hexagon/kernel/process.c| 2 +-
 arch/ia64/kernel/process.c   | 2 +-
 arch/m68k/kernel/process.c   | 2 +-
 arch/microblaze/kernel/process.c | 2 +-
 arch/mips/kernel/process.c   | 2 +-
 arch/nds32/kernel/process.c  | 2 +-
 arch/nios2/kernel/process.c  | 2 +-
 arch/openrisc/kernel/process.c   | 2 +-
 arch/parisc/kernel/process.c | 2 +-
 arch/powerpc/kernel/process.c| 2 +-
 arch/riscv/kernel/process.c  | 2 +-
 arch/s390/kernel/process.c   | 2 +-
 arch/sh/kernel/process_32.c  | 2 +-
 arch/sparc/kernel/process_32.c   | 2 +-
 arch/sparc/kernel/process_64.c   | 2 +-
 arch/um/kernel/process.c | 2 +-
 arch/x86/kernel/fpu/core.c   | 4 ++--
 arch/x86/kernel/process.c| 2 +-
 arch/xtensa/kernel/process.c | 2 +-
 include/linux/sched.h| 1 +
 include/linux/sched/task.h   | 1 +
 kernel/fork.c| 5 -
 mm/vmscan.c  | 4 ++--
 29 files changed, 34 insertions(+), 29 deletions(-)

diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index 5f8527081da9..f4759e4ee4a9 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -249,7 +249,7 @@ int copy_thread(unsigned long clone_flags, unsigned long 
usp,
childti->pcb.ksp = (unsigned long) childstack;
childti->pcb.flags = 1; /* set FEN, clear everything else */
 
-   if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+   if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
/* kernel thread */
memset(childstack, 0,
sizeof(struct switch_stack) + sizeof(struct pt_regs));
diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c
index 8e90052f6f05..b409ecb1407f 100644
--- a/arch/arc/kernel/process.c
+++ b/arch/arc/kernel/process.c
@@ -191,7 +191,7 @@ int copy_thread(unsigned long clone_flags, unsigned long 
usp,
childksp[0] = 0;/* fp */
childksp[1] = (unsigned long)ret_from_fork; /* blink */
 
-   if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+   if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
memset(c_regs, 0, sizeof(struct pt_regs));
 
c_callee->r13 = kthread_arg;
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index d47159f3791c..44603062d661 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -251,7 +251,7 @@ int copy_thread(unsigned long clone_flags, unsigned long 
stack_start,
thread->cpu_domain = get_domain();
 #endif
 
-   if (likely(!(p->flags & (PF_KTHREAD | PF_IO_WORKER {
+   if (likely(!(p->flags & (PF_KTHREAD | PF_USER_WORKER {
*childregs = *current_pt_regs();
childregs->ARM_r0 = 0;
if (stack_start)
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index aacf2f5559a8..b4c6c41847be 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -333,7 +333,7 @@ int copy_thread(unsigned long clone_flags, unsigned long 
stack_start,
 
ptrauth_thread_init_kernel(p);
 
-   if (likely(!(p->flags & (PF_KTHREAD | PF_IO_WORKER {
+   if (likely(!(p->flags & (PF_KTHREAD | PF_USER_WORKER {
*childregs = *current_pt_regs();
childregs->regs[0] = 0;
 
diff --git a/arch/csky/kernel/process.c b/arch/csky/kernel/process.c
index 3d0ca22cd0e2..509f2bfe4ace 100644
--- a/arch/csky/kernel/process.c
+++ b/arch/csky/kernel/process.c
@@ -49,7 +49,7 @@ int copy_thread(unsigned long clone_flags,
/* setup thread.sp for switch_to !!! */
p->thread.sp = (unsigned long)childstack;
 
-   if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+   if (unlikely(p->flags & (PF_KTHREAD | PF_USER_WORKER))) {
memset(childregs, 0, sizeof(struct pt_regs));
childstack->r15 = (unsigned long) ret_from_kernel_thread;
childstack->r10 = kthread_arg;
diff --git a/arch/h8300/kernel/process.c b/arch/h8300/kernel/process.c
index 8833fa4f5d51..050aca44ba6d 100644
--- a/arch/h8300/kernel/process.c
+++ b/arch/h8300/kernel/process.c
@@ -112,7 +112,7 @@ int copy_thread(unsigned long clone_flags, unsigned long 
usp,
 
childregs = (struct pt_regs *) (THREAD_SIZE + task_stack_page(p)) - 1;
 
- 

[PATCH V5 05/10] signal: Perfom autoreap for PF_USER_WORKER

2021-11-21 Thread Mike Christie
Userspace doesn't know about PF_USER_WORKER threads, so it can't do wait
to clean them up. For cases like where qemu will do dynamic/hot add/remove
of vhost devices, then we need to auto reap the thread like was done for
the kthread case, because qemu does not know what API the kernel/vhost
layer is using.

This has us do autoreaping for these threads similar to when the parent
ignores SIGCHLD and for kthreads.

Signed-off-by: Mike Christie 
---
 kernel/signal.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 7c4b7ae714d4..07e7d6f9bf66 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2049,9 +2049,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 
psig = tsk->parent->sighand;
spin_lock_irqsave(>siglock, flags);
-   if (!tsk->ptrace && sig == SIGCHLD &&
+   if (!tsk->ptrace && (tsk->flags & PF_USER_WORKER || (sig == SIGCHLD &&
(psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
-(psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {
+(psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT) {
/*
 * We are exiting and our parent doesn't care.  POSIX.1
 * defines special semantics for setting SIGCHLD to SIG_IGN
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH V5 03/10] fork: add KERNEL_WORKER flag to not dup/clone files

2021-11-21 Thread Mike Christie
Each vhost device gets a thread that is used to perform IO and management
operations. Instead of a thread that is accessing a device, the thread is
part of the device, so when it calls the kernel_worker() function added in
the next patch we can't dup or clone the parent's files/FDS because it
would do an extra increment on ourself.

Later, when we do:

Qemu process exits:
do_exit -> exit_files -> put_files_struct -> close_files

we would leak the device's resources because of that extra refcount
on the fd or file_struct.

This patch adds a no_files option so these worker threads can prevent
taking an extra refcount on themselves.

Signed-off-by: Mike Christie 
Acked-by: Christian Brauner 
Reviewed-by: Christoph Hellwig 
---
 include/linux/sched/task.h |  1 +
 kernel/fork.c  | 11 +--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 1ad98e31d5bc..47ede41b19c5 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -20,6 +20,7 @@ struct css_set;
 
 #define KERN_WORKER_IO BIT(0)
 #define KERN_WORKER_USER   BIT(1)
+#define KERN_WORKER_NO_FILES   BIT(2)
 
 struct kernel_clone_args {
u64 flags;
diff --git a/kernel/fork.c b/kernel/fork.c
index 8f6cd9581e2e..a1ba423eec4d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1529,7 +1529,8 @@ static int copy_fs(unsigned long clone_flags, struct 
task_struct *tsk)
return 0;
 }
 
-static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
+static int copy_files(unsigned long clone_flags, struct task_struct *tsk,
+ int no_files)
 {
struct files_struct *oldf, *newf;
int error = 0;
@@ -1541,6 +1542,11 @@ static int copy_files(unsigned long clone_flags, struct 
task_struct *tsk)
if (!oldf)
goto out;
 
+   if (no_files) {
+   tsk->files = NULL;
+   goto out;
+   }
+
if (clone_flags & CLONE_FILES) {
atomic_inc(>count);
goto out;
@@ -2179,7 +2185,8 @@ static __latent_entropy struct task_struct *copy_process(
retval = copy_semundo(clone_flags, p);
if (retval)
goto bad_fork_cleanup_security;
-   retval = copy_files(clone_flags, p);
+   retval = copy_files(clone_flags, p,
+   args->worker_flags & KERN_WORKER_NO_FILES);
if (retval)
goto bad_fork_cleanup_semundo;
retval = copy_fs(clone_flags, p);
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


WorldCist'22 - 10th World Conference on IST | Montenegro | Deadline: November 24

2021-11-21 Thread WorldCIST
* Conference listed in CORE Ranking

** Google Scholar H5-Index = 23

*** Best papers selected for SCI/SSCI journals



--

WorldCIST'22 - 10th World Conference on Information Systems and Technologies

12-14 April 2022, Budva, Montenegro

http://worldcist.org 

---


The WorldCist'22 - 10th World Conference on Information Systems and 
Technologies, to be held in Budva, Montenegro, 12-14 April 2022, is a global 
forum for researchers and practitioners to present and discuss the most recent 
innovations, trends, results, experiences and concerns in the several 
perspectives of Information Systems and Technologies.

We are pleased to invite you to submit your papers to WorldCist'22. All 
submissions will be reviewed on the basis of relevance, originality, importance 
and clarity.



TOPICS

Submitted papers should be related with one or more of the main themes proposed 
for the Conference:

A) Information and Knowledge Management (IKM);

B) Organizational Models and Information Systems (OMIS);

C) Software and Systems Modeling (SSM);

D) Software Systems, Architectures, Applications and Tools (SSAAT);

E) Multimedia Systems and Applications (MSA);

F) Computer Networks, Mobility and Pervasive Systems (CNMPS);

G) Intelligent and Decision Support Systems (IDSS);

H) Big Data Analytics and Applications (BDAA);

I) Human-Computer Interaction (HCI);

J) Ethics, Computers and Security (ECS)

K) Health Informatics (HIS);

L) Information Technologies in Education (ITE);

M) Technologies for Biomedical Applications (TBA)

N) Information Technologies in Radiocommunications (ITR);



TYPES of SUBMISSIONS and DECISIONS

Four types of papers can be submitted:

Full paper: Finished or consolidated R works, to be included in one of the 
Conference themes. These papers are assigned a 10-page limit.

Short paper: Ongoing works with relevant preliminary results, open to 
discussion. These papers are assigned a 7-page limit.

Poster paper: Initial work with relevant ideas, open to discussion. These 
papers are assigned to a 4-page limit.

Company paper: Companies' papers that show practical experience, R & D, tools, 
etc., focused on some topics of the conference. These papers are assigned to a 
4-page limit.

Submitted papers must comply with the format of Advances in Intelligent Systems 
and Computing Series (see Instructions for Authors at Springer Website), be 
written in English, must not have been published before, not be under review 
for any other conference or publication and not include any information leading 
to the authors’ identification. Therefore, the authors’ names, affiliations and 
bibliographic references should not be included in the version for evaluation 
by the Program Committee. This information should only be included in the 
camera-ready version, saved in Word or Latex format and also in PDF format. 
These files must be accompanied by the Consent to Publish form filled out, in a 
ZIP file, and uploaded at the conference management system.

All papers will be subjected to a “double-blind review” by at least two members 
of the Program Committee.

Based on Program Committee evaluation, a paper can be rejected or accepted by 
the Conference Chairs. In the later case, it can be accepted as the type 
originally submitted or as another type. Thus, full papers can be accepted as 
short papers or poster papers only. Similarly, short papers can be accepted as 
poster papers only.

Poster papers and Company papers are not published in the Conference 
Proceedings, being only presented and discussed. The authors of accepted poster 
papers should build and print a poster to be exhibited during the Conference. 
This poster must follow an A1 or A2 vertical format. The Conference includes 
Work Sessions where these posters are presented and orally discussed, with a 7 
minute limit per poster.

The authors of accepted Full papers will have 15 minutes to present their work 
in a Conference Work Session; approximately 5 minutes of discussion will follow 
each presentation. The authors of accepted Short papers and Company papers will 
have 11 minutes to present their work in a Conference Work Session; 
approximately 4 minutes of discussion will follow each presentation.



PUBLICATION & INDEXING

To ensure that a full paper or short paper is published, poster paper or 
company paper is presented, at least one of the authors must be fully 
registered by the 8nd of January 2022, and the paper must comply with the 
suggested layout and page-limit. Additionally, all recommended changes must be 
addressed by the authors before they submit the camera-ready version.

No more than one paper per registration will be published. An extra fee must be 
paid for publication of additional papers, with a