Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-27 Thread Eric W. Biederman
Mike Christie  writes:

> On 5/23/23 7:15 AM, Oleg Nesterov wrote:
>> 
>> Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right
>> before we call work->fn(). Is it "safe" to run this callback with
>> signal_pending() or fatal_signal_pending() ?
>
> The questions before this one I'll leave for the core vhost devs since
> they know best.

Let me ask a clarifying question:

Is it only the call to schedule() in vhost_worker that you are worried
about not sleeping if signal_pending() or fatal_signal_pending()?

Is there concern that the worker functions aka "work->fn()" will also
have killable or interruptible sleeps that also will misbehave.

We can handle schedule() in vhost_worker without problem.

If a worker function has interruptible or killable sleeps that will turn
into busy waits or worse not sleeping long enough that seems like a
problem.  There is no way to guarantee that the outer loop of
vhost_worker will protect the worker functions from signal_pending()
or fatal_signal_pending() becoming true.


Eric
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-27 Thread Linus Torvalds
On Sat, May 27, 2023 at 6:17 PM Eric W. Biederman  wrote:
>
> It seems like a good approach for including in the -rc series.
> I think the change should look more like my change below.

I have no objections. My patch was a fairly "hack and slash" thing to
just disassociate the IO workers entirely from the core dumping. Yours
seems to be slightly more surgical.

  Linus
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-27 Thread Eric W. Biederman
Linus Torvalds  writes:

> On Sat, May 27, 2023 at 2:49 AM Eric W. Biederman  
> wrote:
>>
>> The real sticky widget for me is how to handle one of these processes
>> coredumping.  It really looks like it will result in a reliable hang.
>
> Well, if *that* is the main worry, I think that's trivial enough to deal with.
>
> In particular, we could make the rule just be that user worker threads
> simply do not participate in core-dumps.
>
> THAT isn't hard.
>
> All we need to do is
>
>  (a) not count those threads in zap_threads()
>
>  (b) make sure that they don't add themselves to the "dumper" list by
> not calling "coredujmp_task_exit()"
>
>  (c) not initiate core-dumping themselves.
>
> and I think that's pretty much it.
>
> In fact, that really seems like a good model *regardless*, because
> honestly, a PF_IO_WORKER doesn't have valid register state for the
> core dump anyway, and anything that would have caused a IO thread to
> get a SIGSEGV *should* have caused a kernel oops already.
>
> So the only worry is that the core dump will now happen while an IO
> worker is still busy and so it's not "atomic" wrt possible VM changes,
> but while that used to be a big problem back in the dark ages when we
> didn't get the VM locks for core dumping, that got fixed a few years
> ago because it already caused lots of potential issues.


>
> End result: I think the attached patch is probably missing something,
> but the approach "FeelsRight(tm)" to me.
>
> Comments?

It seems like a good approach for including in the -rc series.
I think the change should look more like my change below.

nits:
  - The threads all need to participate in the group exit even if they
aren't going to be in the coredump.

  - For vhost_worker we need s/PF_IO_WORKER/PF_USER_WORKER/.

  - Moving PF_IO_WORKER above the sig_kernel_coredump(signr) test is
unnecessary.  The sig_kernel_coredump(signr) test can only become
true if a process exit has not been initiated yet.  More importantly
moving the test obscures the fact that only do_group_exit is
moved out of get_signal for the PF_IO_WORKER special case.


Long term I think we want an approach that stops the worker threads
during the coredumps.  It will just yield a better quality of
implementation if we minimize the amount of concurrency during the
coredump.

I have a pending patchset that moves the coredump rendezvous into
get_signal.  At which point stopping all of the threads is just like
SIGSTOP something the worker threads can use and it won't introduce any
issues.  Today the problem is some of the worker thread code must run
before the coredump stop.

Looking forward I don't see not asking the worker threads to stop
for the coredump right now causing any problems in the future.
So I think we can use this to resolve the coredump issue I spotted.


diff --git a/fs/coredump.c b/fs/coredump.c
index ece7badf701b..620f7f9dc894 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -371,7 +371,8 @@ static int zap_process(struct task_struct *start, int 
exit_code)
if (t != current && !(t->flags & PF_POSTCOREDUMP)) {
sigaddset(>pending.signal, SIGKILL);
signal_wake_up(t, 1);
-   nr++;
+   if (!(t->flags & PF_IO_WORKER))
+   nr++;
}
}
 
diff --git a/kernel/exit.c b/kernel/exit.c
index 34b90e2e7cf7..6082dba9131a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -411,7 +411,9 @@ static void coredump_task_exit(struct task_struct *tsk)
tsk->flags |= PF_POSTCOREDUMP;
core_state = tsk->signal->core_state;
spin_unlock_irq(>sighand->siglock);
-   if (core_state) {
+
+   /* I/O Workers don't participate in coredumps */
+   if (core_state && !(tsk->flags & PF_IO_WORKER) {
struct core_thread self;
 
self.task = current;


>   current->flags |= PF_SIGNALED;
>  
> + /*
> +  * PF_IO_WORKER threads will catch and exit on fatal signals
> +  * themselves and do not participate in core dumping.
> +  *
> +  * They have cleanup that must be performed, so we cannot
> +  * call do_exit() on their behalf.
> +  */
> + if (current->flags & PF_IO_WORKER)
> + goto out;
> +
>   if (sig_kernel_coredump(signr)) {
>   if (print_fatal_signals)
>   print_fatal_signal(ksig->info.si_signo);
> @@ -2860,14 +2870,6 @@ bool get_signal(struct ksignal *ksig)
>   do_coredump(>info);
>   }
>  
> - /*
> -  * PF_IO_WORKER threads will catch and exit on fatal signals
> -  * themselves. They have cleanup that must be performed, so
> -  * we cannot call do_exit() on their behalf.
> -  */
> - if (current->flags & 

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-27 Thread Linus Torvalds
On Sat, May 27, 2023 at 2:49 AM Eric W. Biederman  wrote:
>
> The real sticky widget for me is how to handle one of these processes
> coredumping.  It really looks like it will result in a reliable hang.

Well, if *that* is the main worry, I think that's trivial enough to deal with.

In particular, we could make the rule just be that user worker threads
simply do not participate in core-dumps.

THAT isn't hard.

All we need to do is

 (a) not count those threads in zap_threads()

 (b) make sure that they don't add themselves to the "dumper" list by
not calling "coredujmp_task_exit()"

 (c) not initiate core-dumping themselves.

and I think that's pretty much it.

In fact, that really seems like a good model *regardless*, because
honestly, a PF_IO_WORKER doesn't have valid register state for the
core dump anyway, and anything that would have caused a IO thread to
get a SIGSEGV *should* have caused a kernel oops already.

So the only worry is that the core dump will now happen while an IO
worker is still busy and so it's not "atomic" wrt possible VM changes,
but while that used to be a big problem back in the dark ages when we
didn't get the VM locks for core dumping, that got fixed a few years
ago because it already caused lots of potential issues.

End result: I think the attached patch is probably missing something,
but the approach "FeelsRight(tm)" to me.

Comments?

   Linus
 fs/coredump.c   |  2 +-
 kernel/exit.c   |  6 ++
 kernel/signal.c | 18 ++
 3 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index ece7badf701b..46f8145b39e6 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -368,7 +368,7 @@ static int zap_process(struct task_struct *start, int exit_code)
 
 	for_each_thread(start, t) {
 		task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
-		if (t != current && !(t->flags & PF_POSTCOREDUMP)) {
+		if (t != current && !(t->flags & (PF_POSTCOREDUMP | PF_IO_WORKER))) {
 			sigaddset(>pending.signal, SIGKILL);
 			signal_wake_up(t, 1);
 			nr++;
diff --git a/kernel/exit.c b/kernel/exit.c
index 34b90e2e7cf7..fde57b9f4494 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -400,6 +400,12 @@ static void coredump_task_exit(struct task_struct *tsk)
 {
 	struct core_state *core_state;
 
+	/*
+	 * IO workers do not participate in dumping core
+	 */
+	if (tsk->flags & PF_IO_WORKER)
+		return;
+
 	/*
 	 * Serialize with any possible pending coredump.
 	 * We must hold siglock around checking core_state
diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..e0acb11d3a1d 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2845,6 +2845,16 @@ bool get_signal(struct ksignal *ksig)
 		 */
 		current->flags |= PF_SIGNALED;
 
+		/*
+		 * PF_IO_WORKER threads will catch and exit on fatal signals
+		 * themselves and do not participate in core dumping.
+		 *
+		 * They have cleanup that must be performed, so we cannot
+		 * call do_exit() on their behalf.
+		 */
+		if (current->flags & PF_IO_WORKER)
+			goto out;
+
 		if (sig_kernel_coredump(signr)) {
 			if (print_fatal_signals)
 print_fatal_signal(ksig->info.si_signo);
@@ -2860,14 +2870,6 @@ bool get_signal(struct ksignal *ksig)
 			do_coredump(>info);
 		}
 
-		/*
-		 * PF_IO_WORKER threads will catch and exit on fatal signals
-		 * themselves. They have cleanup that must be performed, so
-		 * we cannot call do_exit() on their behalf.
-		 */
-		if (current->flags & PF_IO_WORKER)
-			goto out;
-
 		/*
 		 * Death signals, no core dump.
 		 */
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next 3/5] virtio_net: Add page pool fragmentation support

2023-05-27 Thread Liang Chen
On Fri, May 26, 2023 at 4:29 PM Horatiu Vultur
 wrote:
>
> The 05/26/2023 13:46, Liang Chen wrote:
>
> Hi Liang,
>
> >
> > To further enhance performance, implement page pool fragmentation
> > support and introduce a module parameter to enable or disable it.
> >
> > In single-core vm testing environments, there is an additional performance
> > gain observed in the normal path compared to the one packet per page
> > approach.
> >   Upstream codebase: 47.5 Gbits/sec
> >   Upstream codebase with page pool: 50.2 Gbits/sec
> >   Upstream codebase with page pool fragmentation support: 52.3 Gbits/sec
> >
> > There is also some performance gain for XDP cpumap.
> >   Upstream codebase: 1.38 Gbits/sec
> >   Upstream codebase with page pool: 9.74 Gbits/sec
> >   Upstream codebase with page pool fragmentation: 10.3 Gbits/sec
> >
> > Signed-off-by: Liang Chen 
> > ---
> >  drivers/net/virtio_net.c | 72 ++--
> >  1 file changed, 55 insertions(+), 17 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 99c0ca0c1781..ac40b8c66c59 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -32,7 +32,9 @@ module_param(gso, bool, 0444);
> >  module_param(napi_tx, bool, 0644);
> >
> >  static bool page_pool_enabled;
> > +static bool page_pool_frag;
> >  module_param(page_pool_enabled, bool, 0400);
> > +module_param(page_pool_frag, bool, 0400);
> >
> >  /* FIXME: MTU in config. */
> >  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> > @@ -909,23 +911,32 @@ static struct page *xdp_linearize_page(struct 
> > receive_queue *rq,
> >struct page *p,
> >int offset,
> >int page_off,
> > -  unsigned int *len)
> > +  unsigned int *len,
> > +  unsigned int *pp_frag_offset)
>
> The 'unsigned int *pp_frag_offset' seems to be unaligned.
>

Sure, Thanks!
> >  {
> > int tailroom = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > struct page *page;
> > +   unsigned int pp_frag_offset_val;
>
> Please use reverse christmas tree notation here. The pp_frag_offset_val
> needs to be declared before page;
>

Sure. Will do on v2.
> >
> > if (page_off + *len + tailroom > PAGE_SIZE)
> > return NULL;
> >
> > if (rq->page_pool)
> > -   page = page_pool_dev_alloc_pages(rq->page_pool);
> > +   if (rq->page_pool->p.flags & PP_FLAG_PAGE_FRAG)
> > +   page = page_pool_dev_alloc_frag(rq->page_pool, 
> > pp_frag_offset,
> > +   PAGE_SIZE);
>
> Don't you need to check if pp_frag_offset is null? As you call once with
> NULL.
>

At the moment, page_pool is enabled only for mergeable mode, and the
path leading to a call with NULL pp_frag_offset is from small mode.
But I will evaluate again whether it is beneficial to support
page_pool for small mode on v2. Thanks.
> > +   else
> > +   page = page_pool_dev_alloc_pages(rq->page_pool);
> > else
> > page = alloc_page(GFP_ATOMIC);
> >
> > if (!page)
> > return NULL;
> >
> > -   memcpy(page_address(page) + page_off, page_address(p) + offset, 
> > *len);
> > +   pp_frag_offset_val = pp_frag_offset ? *pp_frag_offset : 0;
> > +
> > +   memcpy(page_address(page) + page_off + pp_frag_offset_val,
> > +  page_address(p) + offset, *len);
> > page_off += *len;
> >
> > while (--*num_buf) {
> > @@ -948,7 +959,7 @@ static struct page *xdp_linearize_page(struct 
> > receive_queue *rq,
> > goto err_buf;
> > }
> >
> > -   memcpy(page_address(page) + page_off,
> > +   memcpy(page_address(page) + page_off + pp_frag_offset_val,
> >page_address(p) + off, buflen);
> > page_off += buflen;
> > virtnet_put_page(rq, p);
> > @@ -1029,7 +1040,7 @@ static struct sk_buff *receive_small_xdp(struct 
> > net_device *dev,
> > SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > xdp_page = xdp_linearize_page(rq, _buf, page,
> >   offset, header_offset,
> > - );
> > + , NULL);
> > if (!xdp_page)
> > goto err_xdp;
> >
> > @@ -1323,6 +1334,7 @@ static void *mergeable_xdp_get_buf(struct 
> > virtnet_info *vi,
> > unsigned int headroom = mergeable_ctx_to_headroom(ctx);
> > struct page *xdp_page;
> > unsigned int xdp_room;
> > +   unsigned int page_frag_offset = 0;
>
> Please use reverse x-mas 

Re: [PATCH net-next 5/5] virtio_net: Implement DMA pre-handler

2023-05-27 Thread Liang Chen
On Fri, May 26, 2023 at 3:06 PM Jason Wang  wrote:
>
> On Fri, May 26, 2023 at 1:47 PM Liang Chen  wrote:
> >
> > Adding a DMA pre-handler that utilizes page pool for managing DMA mappings.
> > When IOMMU is enabled, turning on the page_pool_dma_map module parameter to
> > select page pool for DMA mapping management gives a significant reduction
> > in the overhead caused by DMA mappings.
> >
> > In testing environments with a single core vm and qemu emulated IOMMU,
> > significant performance improvements can be observed:
> >   Upstream codebase: 1.76 Gbits/sec
> >   Upstream codebase with page pool fragmentation support: 1.81 Gbits/sec
> >   Upstream codebase with page pool fragmentation and DMA support: 19.3
> >   Gbits/sec
> >
> > Signed-off-by: Liang Chen 
> > ---
> >  drivers/net/virtio_net.c | 55 
> >  1 file changed, 55 insertions(+)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index ac40b8c66c59..73cc4f9fe4fa 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -22,6 +22,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  static int napi_weight = NAPI_POLL_WEIGHT;
> >  module_param(napi_weight, int, 0444);
> > @@ -33,8 +34,10 @@ module_param(napi_tx, bool, 0644);
> >
> >  static bool page_pool_enabled;
> >  static bool page_pool_frag;
> > +static bool page_pool_dma_map;
> >  module_param(page_pool_enabled, bool, 0400);
> >  module_param(page_pool_frag, bool, 0400);
> > +module_param(page_pool_dma_map, bool, 0400);
> >
> >  /* FIXME: MTU in config. */
> >  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> > @@ -3830,6 +3833,49 @@ static void virtnet_del_vqs(struct virtnet_info *vi)
> > virtnet_free_queues(vi);
> >  }
> >
> > +static dma_addr_t virtnet_pp_dma_map_page(struct device *dev, struct page 
> > *page,
> > + unsigned long offset, size_t size,
> > + enum dma_data_direction dir, 
> > unsigned long attrs)
> > +{
> > +   struct page *head_page;
> > +
> > +   if (dir != DMA_FROM_DEVICE)
> > +   return 0;
> > +
> > +   head_page = compound_head(page);
> > +   return page_pool_get_dma_addr(head_page)
> > +   + (page - head_page) * PAGE_SIZE
> > +   + offset;
>
> So it's not a map, it is just a query from the dma address from the pool.
>
> > +}
> > +
> > +static bool virtnet_pp_dma_unmap_page(struct device *dev, dma_addr_t 
> > dma_handle,
> > + size_t size, enum dma_data_direction 
> > dir,
> > + unsigned long attrs)
> > +{
> > +   phys_addr_t phys;
> > +
> > +   /* Handle only the RX direction, and sync the DMA memory only if 
> > it's not
> > +* a DMA coherent architecture.
> > +*/
> > +   if (dir != DMA_FROM_DEVICE)
> > +   return false;
> > +
> > +   if (dev_is_dma_coherent(dev))
> > +   return true;
> > +
> > +   phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle);
>
> This would be somehow slow. If we track the mapping by driver, it
> would be much faster.
>
> More could be seen here:
>
> https://lists.linuxfoundation.org/pipermail/virtualization/2023-May/066778.html
>
> Thanks
>

Thanks for the information. I agree with your suggestion, and I will
drop the last two patches on v2 and wait for Xuan's patch to land for
dma mapping management.




> > +   if (WARN_ON(!phys))
> > +   return false;
> > +
> > +   arch_sync_dma_for_cpu(phys, size, dir);
> > +   return true;
> > +}
> > +
> > +static struct virtqueue_pre_dma_ops virtnet_pp_pre_dma_ops = {
> > +   .map_page = virtnet_pp_dma_map_page,
> > +   .unmap_page = virtnet_pp_dma_unmap_page,
> > +};
> > +
> >  static void virtnet_alloc_page_pool(struct receive_queue *rq)
> >  {
> > struct virtio_device *vdev = rq->vq->vdev;
> > @@ -3845,6 +3891,15 @@ static void virtnet_alloc_page_pool(struct 
> > receive_queue *rq)
> > if (page_pool_frag)
> > pp_params.flags |= PP_FLAG_PAGE_FRAG;
> >
> > +   /* Consider using page pool DMA support only when DMA API is used. 
> > */
> > +   if (virtio_has_feature(vdev, VIRTIO_F_ACCESS_PLATFORM) &&
> > +   page_pool_dma_map) {
> > +   pp_params.flags |= PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
> > +   pp_params.dma_dir = DMA_FROM_DEVICE;
> > +   pp_params.max_len = PAGE_SIZE << pp_params.order;
> > +   virtqueue_register_pre_dma_ops(rq->vq, 
> > _pp_pre_dma_ops);
> > +   }
> > +
> > rq->page_pool = page_pool_create(_params);
> > if (IS_ERR(rq->page_pool)) {
> > dev_warn(>dev, "page pool creation failed: %ld\n",
> > --
> > 2.31.1
> >
>
___
Virtualization mailing list

Re: [PATCH net-next 2/5] virtio_net: Add page_pool support to improve performance

2023-05-27 Thread Liang Chen
On Fri, May 26, 2023 at 2:51 PM Jason Wang  wrote:
>
> On Fri, May 26, 2023 at 1:46 PM Liang Chen  wrote:
> >
> > The implementation at the moment uses one page per packet in both the
> > normal and XDP path.
>
> It's better to explain why we need a page pool and how it can help the
> performance.
>

Sure, I will include that on v2.
> > In addition, introducing a module parameter to enable
> > or disable the usage of page pool (disabled by default).
>
> If page pool wins for most of the cases, any reason to disable it by default?
>

Thank you for raising the point. It does make sense to enable it by default.
> >
> > In single-core vm testing environments, it gives a modest performance gain
> > in the normal path.
> >   Upstream codebase: 47.5 Gbits/sec
> >   Upstream codebase + page_pool support: 50.2 Gbits/sec
> >
> > In multi-core vm testing environments, The most significant performance
> > gain is observed in XDP cpumap:
> >   Upstream codebase: 1.38 Gbits/sec
> >   Upstream codebase + page_pool support: 9.74 Gbits/sec
>
> Please show more details on the test. E.g which kinds of tests have
> you measured?
>
> Btw, it would be better to measure PPS as well.
>

Sure. It will be added on v2.
> >
> > With this foundation, we can further integrate page pool fragmentation and
> > DMA map/unmap support.
> >
> > Signed-off-by: Liang Chen 
> > ---
> >  drivers/net/virtio_net.c | 188 ++-
>
> I believe we should make virtio-net to select CONFIG_PAGE_POOL or do
> the ifdef tricks at least.
>

Sure. it will be done on v2.
> >  1 file changed, 146 insertions(+), 42 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index c5dca0d92e64..99c0ca0c1781 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -31,6 +31,9 @@ module_param(csum, bool, 0444);
> >  module_param(gso, bool, 0444);
> >  module_param(napi_tx, bool, 0644);
> >
> > +static bool page_pool_enabled;
> > +module_param(page_pool_enabled, bool, 0400);
> > +
> >  /* FIXME: MTU in config. */
> >  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> >  #define GOOD_COPY_LEN  128
> > @@ -159,6 +162,9 @@ struct receive_queue {
> > /* Chain pages by the private ptr. */
> > struct page *pages;
> >
> > +   /* Page pool */
> > +   struct page_pool *page_pool;
> > +
> > /* Average packet length for mergeable receive buffers. */
> > struct ewma_pkt_len mrg_avg_pkt_len;
> >
> > @@ -459,6 +465,14 @@ static struct sk_buff *virtnet_build_skb(void *buf, 
> > unsigned int buflen,
> > return skb;
> >  }
> >
> > +static void virtnet_put_page(struct receive_queue *rq, struct page *page)
> > +{
> > +   if (rq->page_pool)
> > +   page_pool_put_full_page(rq->page_pool, page, true);
> > +   else
> > +   put_page(page);
> > +}
> > +
> >  /* Called from bottom half context */
> >  static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> >struct receive_queue *rq,
> > @@ -555,7 +569,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info 
> > *vi,
> > hdr = skb_vnet_hdr(skb);
> > memcpy(hdr, hdr_p, hdr_len);
> > if (page_to_free)
> > -   put_page(page_to_free);
> > +   virtnet_put_page(rq, page_to_free);
> >
> > return skb;
> >  }
> > @@ -802,7 +816,7 @@ static int virtnet_xdp_xmit(struct net_device *dev,
> > return ret;
> >  }
> >
> > -static void put_xdp_frags(struct xdp_buff *xdp)
> > +static void put_xdp_frags(struct xdp_buff *xdp, struct receive_queue *rq)
> >  {
>
> rq could be fetched from xdp_rxq_info?

Yeah, it has the queue_index there.
>
> > struct skb_shared_info *shinfo;
> > struct page *xdp_page;
> > @@ -812,7 +826,7 @@ static void put_xdp_frags(struct xdp_buff *xdp)
> > shinfo = xdp_get_shared_info_from_buff(xdp);
> > for (i = 0; i < shinfo->nr_frags; i++) {
> > xdp_page = skb_frag_page(>frags[i]);
> > -   put_page(xdp_page);
> > +   virtnet_put_page(rq, xdp_page);
> > }
> > }
> >  }
> > @@ -903,7 +917,11 @@ static struct page *xdp_linearize_page(struct 
> > receive_queue *rq,
> > if (page_off + *len + tailroom > PAGE_SIZE)
> > return NULL;
> >
> > -   page = alloc_page(GFP_ATOMIC);
> > +   if (rq->page_pool)
> > +   page = page_pool_dev_alloc_pages(rq->page_pool);
> > +   else
> > +   page = alloc_page(GFP_ATOMIC);
> > +
> > if (!page)
> > return NULL;
> >
> > @@ -926,21 +944,24 @@ static struct page *xdp_linearize_page(struct 
> > receive_queue *rq,
> >  * is sending packet larger than the MTU.
> >  */
> > if ((page_off + buflen + tailroom) > PAGE_SIZE) {
> > -   put_page(p);
> > + 

Re: [PATCH net-next 1/5] virtio_net: Fix an unsafe reference to the page chain

2023-05-27 Thread Liang Chen
On Fri, May 26, 2023 at 2:39 PM Jason Wang  wrote:
>
> On Fri, May 26, 2023 at 1:46 PM Liang Chen  wrote:
> >
> > "private" of buffer page is currently used for big mode to chain pages.
> > But in mergeable mode, that offset of page could mean something else,
> > e.g. when page_pool page is used instead. So excluding mergeable mode to
> > avoid such a problem.
>
> If this issue happens only in the case of page_pool, it would be
> better to squash it there.
>
> Thanks

Sure, thanks!


>
> >
> > Signed-off-by: Liang Chen 
> > ---
> >  drivers/net/virtio_net.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 5a7f7a76b920..c5dca0d92e64 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -497,7 +497,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info 
> > *vi,
> > return NULL;
> >
> > page = (struct page *)page->private;
> > -   if (page)
> > +   if (!vi->mergeable_rx_bufs && page)
> > give_pages(rq, page);
> > goto ok;
> > }
> > --
> > 2.31.1
> >
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-27 Thread Eric W. Biederman
Linus Torvalds  writes:

> So I'd really like to finish this. Even if we end up with a hack or
> two in signal handling that we can hopefully fix up later by having
> vhost fix up some of its current assumptions.


The real sticky widget for me is how to handle one of these processes
coredumping.  It really looks like it will result in a reliable hang.

Limiting ourselves to changes that will only affect vhost, all I can
see would be allowing the vhost_worker thread to exit as soon as
get_signal reports the process is exiting.  Then vhost_dev_flush
would need to process the pending work.

Something like this:

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a92af08e7864..fb5ebc50c553 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -234,14 +234,31 @@ EXPORT_SYMBOL_GPL(vhost_poll_stop);
 void vhost_dev_flush(struct vhost_dev *dev)
 {
struct vhost_flush_struct flush;
+   struct vhost_worker *worker = dev->worker;
+   struct llist_node *node, *head;
+
+   if (!worker)
+   return;
+
+   init_completion(_event);
+   vhost_work_init(, vhost_flush_work);
 
-   if (dev->worker) {
-   init_completion(_event);
-   vhost_work_init(, vhost_flush_work);
+   vhost_work_queue(dev, );
 
-   vhost_work_queue(dev, );
-   wait_for_completion(_event);
+   /* Either vhost_worker runs the pending work or we do */
+   node = llist_del_all(>work_list);
+   if (node) {
+   node = llist_reverse_order(node);
+   /* make sure flag is seen after deletion */
+   smp_wmb();
+   llist_for_each_entry_safe(work, work_next, node, node) {
+   clear_bit(VHOST_WORK_QUEUED, >flags);
+   work->fn(work);
+   cond_resched();
+   }
}
+
+   wait_for_completion(_event);
 }
 EXPORT_SYMBOL_GPL(vhost_dev_flush);
 
@@ -338,6 +355,7 @@ static int vhost_worker(void *data)
struct vhost_worker *worker = data;
struct vhost_work *work, *work_next;
struct llist_node *node;
+   struct ksignal ksig;
 
for (;;) {
/* mb paired w/ kthread_stop */
@@ -348,6 +366,9 @@ static int vhost_worker(void *data)
break;
}
 
+   if (get_signal())
+   break;
+
node = llist_del_all(>work_list);
if (!node)
schedule();
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index b7cbd66f889e..613d52f01c07 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -47,6 +47,7 @@ void vhost_task_stop(struct vhost_task *vtsk)
 * not exiting then reap the task.
 */
kernel_wait4(pid, NULL, __WCLONE, NULL);
+   put_task_struct(vtsk->task);
kfree(vtsk);
 }
 EXPORT_SYMBOL_GPL(vhost_task_stop);
@@ -101,7 +102,7 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), 
void *arg,
return NULL;
}
 
-   vtsk->task = tsk;
+   vtsk->task = get_task_struct(tsk);
return vtsk;
 }
 EXPORT_SYMBOL_GPL(vhost_task_create);

Eric

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization