Re: [RFC 3/7] iommufd: Add iommufd_device_bind_pasid()

2023-11-07 Thread Yi Liu

On 2023/10/10 16:19, Tian, Kevin wrote:

From: Liu, Yi L 
Sent: Monday, October 9, 2023 4:51 PM

+struct iommufd_device *iommufd_device_bind_pasid(struct iommufd_ctx
*ictx,
+struct device *dev,
+u32 pasid, u32 *id)
+{
+   struct iommufd_device *idev;
+   int rc;
+
+   /*
+* iommufd always sets IOMMU_CACHE because we offer no way for
userspace
+* to restore cache coherency.
+*/
+   if (!device_iommu_capable(dev, IOMMU_CAP_CACHE_COHERENCY))
+   return ERR_PTR(-EINVAL);
+
+   /*
+* No iommu supports pasid-granular msi message today. Here we
+* just check whether the parent device can do safe interrupts.
+* Isolation between virtual devices within the parent device
+* relies on the parent driver to enforce.
+*/
+   if (!iommufd_selftest_is_mock_dev(dev) &&
+   !msi_device_has_isolated_msi(dev)) {
+   rc = iommufd_allow_unsafe_interrupts(dev);
+   if (rc)
+   return ERR_PTR(rc);
+   }
+


Only MemWr w/o pasid can be interpreted as an interrupt message
then we need msi isolation to protect.


yes.



But for SIOV all MemWr's are tagged with a pasid hence can never
trigger an interrupt. From this angle looks this check is unnecessary.


But the interrupts out from a SIOV virtual device do not have pasid (at
least today). Seems still need a check here if we consider this bind for
a SIOV virtual device just like binding a physical device.

--
Regards,
Yi Liu


Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-07 Thread Yunsheng Lin
On 2023/11/8 6:10, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:
>>
>> On 11/5/23 7:44 PM, Mina Almasry wrote:
>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>> index eeeda849115c..1c351c138a5b 100644
>>> --- a/include/linux/netdevice.h
>>> +++ b/include/linux/netdevice.h
>>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>>>  };
>>>
>>>  #ifdef CONFIG_DMA_SHARED_BUFFER
>>> +struct page_pool_iov *
>>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
>>> +void netdev_free_devmem(struct page_pool_iov *ppiov);
>>
>> netdev_{alloc,free}_dmabuf?
>>
> 
> Can do.
> 
>> I say that because a dmabuf can be host memory, at least I am not aware
>> of a restriction that a dmabuf is device memory.
>>
> 
> In my limited experience dma-buf is generally device memory, and
> that's really its use case. CONFIG_UDMABUF is a driver that mocks
> dma-buf with a memfd which I think is used for testing. But I can do
> the rename, it's more clear anyway, I think.
> 
> On Mon, Nov 6, 2023 at 11:45 PM Yunsheng Lin  wrote:
>>
>> On 2023/11/6 10:44, Mina Almasry wrote:
>>> +
>>> +void netdev_free_devmem(struct page_pool_iov *ppiov)
>>> +{
>>> + struct netdev_dmabuf_binding *binding = page_pool_iov_binding(ppiov);
>>> +
>>> + refcount_set(&ppiov->refcount, 1);
>>> +
>>> + if (gen_pool_has_addr(binding->chunk_pool,
>>> +   page_pool_iov_dma_addr(ppiov), PAGE_SIZE))
>>
>> When gen_pool_has_addr() returns false, does it mean something has gone
>> really wrong here?
>>
> 
> Yes, good eye. gen_pool_has_addr() should never return false, but then
> again, gen_pool_free()  BUG_ON()s if it doesn't find the address,
> which is an extremely severe reaction to what can be a minor bug in
> the accounting. I prefer to leak rather than crash the machine. It's a
> bit of defensive programming that is normally frowned upon, but I feel
> like in this case it's maybe warranted due to the very severe reaction
> (BUG_ON).

I would argue that why is the above defensive programming not done in the
gen_pool core:)

> 


Re: [RFC PATCH v3 04/12] netdev: support binding dma-buf to netdevice

2023-11-07 Thread Yunsheng Lin
On 2023/11/8 5:59, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 11:46 PM Yunsheng Lin  wrote:
>>
>> On 2023/11/6 10:44, Mina Almasry wrote:
>>> +
>>> +void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding)
>>> +{
>>> + size_t size, avail;
>>> +
>>> + gen_pool_for_each_chunk(binding->chunk_pool,
>>> + netdev_devmem_free_chunk_owner, NULL);
>>> +
>>> + size = gen_pool_size(binding->chunk_pool);
>>> + avail = gen_pool_avail(binding->chunk_pool);
>>> +
>>> + if (!WARN(size != avail, "can't destroy genpool. size=%lu, avail=%lu",
>>> +   size, avail))
>>> + gen_pool_destroy(binding->chunk_pool);
>>
>>
>> Is there any other place calling the gen_pool_destroy() when the above
>> warning is triggered? Do we have a leaking for binding->chunk_pool?
>>
> 
> gen_pool_destroy BUG_ON() if it's not empty at the time of destroying.
> Technically that should never happen, because
> __netdev_devmem_binding_free() should only be called when the refcount
> hits 0, so all the chunks have been freed back to the gen_pool. But,
> just in case, I don't want to crash the server just because I'm
> leaking a chunk... this is a bit of defensive programming that is
> typically frowned upon, but the behavior of gen_pool is so severe I
> think the WARN() + check is warranted here.

It seems it is pretty normal for the above to happen nowadays because of
retransmits timeouts, NAPI defer schemes mentioned below:

https://lkml.kernel.org/netdev/168269854650.2191653.8465259808498269815.stgit@firesoul/

And currently page pool core handles that by using a workqueue.


Re: [RFC PATCH v3 06/12] memory-provider: dmabuf devmem memory provider

2023-11-07 Thread David Ahern
On 11/7/23 5:02 PM, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 1:02 PM Stanislav Fomichev  wrote:
>>
>> On 11/05, Mina Almasry wrote:
>>> +static inline bool page_is_page_pool_iov(const struct page *page)
>>> +{
>>> + return (unsigned long)page & PP_DEVMEM;
>>> +}
>>
>> Speaking of bpf: one thing that might be problematic with this PP_DEVMEM
>> bit is that it will make debugging with bpftrace a bit (more)
>> complicated. If somebody were trying to get to that page_pool_iov from
>> the frags, they will have to do the equivalent of page_is_page_pool_iov,
>> but probably not a big deal? (thinking out loud)
> 
> Good point, but that doesn't only apply to bpf I think. I'm guessing
> even debugger drgn access to the bv_page in the frag will have trouble
> if it's actually accessing an iov with LSB set.
> 
> But this is not specific to this use for LSB pointer trick. I think
> all code that currently uses LSB pointer trick will have similar
> troubles. In this context my humble vote is that we get such big
> upside from reducing code churn that it's reasonable to tolerate such
> side effects.

+1

> 
> I could alleviate some of the issues by teaching drgn to do the right
> thing for devmem/iovs... time permitting.
> 
Tools like drgn and crash have to know when the LSB trick is used  -
e.g., dst_entry - and handle it when dereferencing pointers.


Re: [RFC PATCH v3 06/12] memory-provider: dmabuf devmem memory provider

2023-11-07 Thread Mina Almasry
On Mon, Nov 6, 2023 at 1:02 PM Stanislav Fomichev  wrote:
>
> On 11/05, Mina Almasry wrote:
> > +static inline bool page_is_page_pool_iov(const struct page *page)
> > +{
> > + return (unsigned long)page & PP_DEVMEM;
> > +}
>
> Speaking of bpf: one thing that might be problematic with this PP_DEVMEM
> bit is that it will make debugging with bpftrace a bit (more)
> complicated. If somebody were trying to get to that page_pool_iov from
> the frags, they will have to do the equivalent of page_is_page_pool_iov,
> but probably not a big deal? (thinking out loud)

Good point, but that doesn't only apply to bpf I think. I'm guessing
even debugger drgn access to the bv_page in the frag will have trouble
if it's actually accessing an iov with LSB set.

But this is not specific to this use for LSB pointer trick. I think
all code that currently uses LSB pointer trick will have similar
troubles. In this context my humble vote is that we get such big
upside from reducing code churn that it's reasonable to tolerate such
side effects.

I could alleviate some of the issues by teaching drgn to do the right
thing for devmem/iovs... time permitting.

On Mon, Nov 6, 2023 at 3:49 PM David Ahern  wrote:
>
> On 11/5/23 7:44 PM, Mina Almasry wrote:
> > diff --git a/include/net/page_pool/helpers.h 
> > b/include/net/page_pool/helpers.h
> > index 78cbb040af94..b93243c2a640 100644
> > --- a/include/net/page_pool/helpers.h
> > +++ b/include/net/page_pool/helpers.h
> > @@ -111,6 +112,45 @@ page_pool_iov_binding(const struct page_pool_iov 
> > *ppiov)
> >   return page_pool_iov_owner(ppiov)->binding;
> >  }
> >
> > +static inline int page_pool_iov_refcount(const struct page_pool_iov *ppiov)
> > +{
> > + return refcount_read(&ppiov->refcount);
> > +}
> > +
> > +static inline void page_pool_iov_get_many(struct page_pool_iov *ppiov,
> > +   unsigned int count)
> > +{
> > + refcount_add(count, &ppiov->refcount);
> > +}
> > +
> > +void __page_pool_iov_free(struct page_pool_iov *ppiov);
> > +
> > +static inline void page_pool_iov_put_many(struct page_pool_iov *ppiov,
> > +   unsigned int count)
> > +{
> > + if (!refcount_sub_and_test(count, &ppiov->refcount))
> > + return;
> > +
> > + __page_pool_iov_free(ppiov);
> > +}
> > +
> > +/* page pool mm helpers */
> > +
> > +static inline bool page_is_page_pool_iov(const struct page *page)
> > +{
> > + return (unsigned long)page & PP_DEVMEM;
>
> This is another one where the code can be more generic to not force a
> lot changes later.  e.g., PP_CUSTOM or PP_NO_PAGE. Then io_uring use
> case with host memory can leverage the iov pool in a similar manner.
>
> That does mean skb->devmem needs to be a flag on the page pool and not
> just assume iov == device memory.
>
>
>


-- 
Thanks,
Mina


Re: [RFC PATCH v3 10/12] tcp: RX path for devmem TCP

2023-11-07 Thread David Ahern
On 11/7/23 4:55 PM, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 4:03 PM Willem de Bruijn
>  wrote:
>>
>> On Mon, Nov 6, 2023 at 3:55 PM David Ahern  wrote:
>>>
>>> On 11/6/23 4:32 PM, Stanislav Fomichev wrote:
> The concise notification API returns tokens as a range for
> compression, encoding as two 32-bit unsigned integers start + length.
> It allows for even further batching by returning multiple such ranges
> in a single call.

 Tangential: should tokens be u64? Otherwise we can't have more than
 4gb unacknowledged. Or that's a reasonable constraint?

>>>
>>> Was thinking the same and with bits reserved for a dmabuf id to allow
>>> multiple dmabufs in a single rx queue (future extension, but build the
>>> capability in now). e.g., something like a 37b offset (128GB dmabuf
>>> size), 19b length (large GRO), 8b dmabuf id (lots of dmabufs to a queue).
>>
>> Agreed. Converting to 64b now sounds like a good forward looking revision.
> 
> The concept of IDing a dma-buf came up in a couple of different
> contexts. First, in the context of us giving the dma-buf ID to the
> user on recvmsg() to tell the user the data is in this specific
> dma-buf. The second context is here, to bind dma-bufs with multiple
> user-visible IDs to an rx queue.
> 
> My issue here is that I don't see anything in the struct dma_buf that
> can practically serve as an ID:
> 
> https://elixir.bootlin.com/linux/v6.6-rc7/source/include/linux/dma-buf.h#L302
> 
> Actually, from the userspace, only the name of the dma-buf seems
> queryable. That's only unique if the user sets it as such. The dmabuf
> FD can't serve as an ID. For our use case we need to support 1 process
> doing the dma-buf bind via netlink, sharing the dma-buf FD to another
> process, and that process receives the data.  In this case the FDs
> shown by the 2 processes may be different. Converting to 64b is a
> trivial change I can make now, but I'm not sure how to ID these
> dma-bufs. Suggestions welcome. I'm not sure the dma-buf guys will
> allow adding a new ID + APIs to query said dma-buf ID.
> 

The API can be unique to this usage: e.g., add a dmabuf id to the
netlink API. Userspace manages the ids (tells the kernel what value to
use with an instance), the kernel validates no 2 dmabufs have the same
id and then returns the value here.




Re: [RFC PATCH v3 10/12] tcp: RX path for devmem TCP

2023-11-07 Thread Mina Almasry
On Mon, Nov 6, 2023 at 4:03 PM Willem de Bruijn
 wrote:
>
> On Mon, Nov 6, 2023 at 3:55 PM David Ahern  wrote:
> >
> > On 11/6/23 4:32 PM, Stanislav Fomichev wrote:
> > >> The concise notification API returns tokens as a range for
> > >> compression, encoding as two 32-bit unsigned integers start + length.
> > >> It allows for even further batching by returning multiple such ranges
> > >> in a single call.
> > >
> > > Tangential: should tokens be u64? Otherwise we can't have more than
> > > 4gb unacknowledged. Or that's a reasonable constraint?
> > >
> >
> > Was thinking the same and with bits reserved for a dmabuf id to allow
> > multiple dmabufs in a single rx queue (future extension, but build the
> > capability in now). e.g., something like a 37b offset (128GB dmabuf
> > size), 19b length (large GRO), 8b dmabuf id (lots of dmabufs to a queue).
>
> Agreed. Converting to 64b now sounds like a good forward looking revision.

The concept of IDing a dma-buf came up in a couple of different
contexts. First, in the context of us giving the dma-buf ID to the
user on recvmsg() to tell the user the data is in this specific
dma-buf. The second context is here, to bind dma-bufs with multiple
user-visible IDs to an rx queue.

My issue here is that I don't see anything in the struct dma_buf that
can practically serve as an ID:

https://elixir.bootlin.com/linux/v6.6-rc7/source/include/linux/dma-buf.h#L302

Actually, from the userspace, only the name of the dma-buf seems
queryable. That's only unique if the user sets it as such. The dmabuf
FD can't serve as an ID. For our use case we need to support 1 process
doing the dma-buf bind via netlink, sharing the dma-buf FD to another
process, and that process receives the data.  In this case the FDs
shown by the 2 processes may be different. Converting to 64b is a
trivial change I can make now, but I'm not sure how to ID these
dma-bufs. Suggestions welcome. I'm not sure the dma-buf guys will
allow adding a new ID + APIs to query said dma-buf ID.

--
Thanks,
Mina


Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-07 Thread Mina Almasry
On Tue, Nov 7, 2023 at 2:55 PM David Ahern  wrote:
>
> On 11/7/23 3:10 PM, Mina Almasry wrote:
> > On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:
> >>
> >> On 11/5/23 7:44 PM, Mina Almasry wrote:
> >>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> >>> index eeeda849115c..1c351c138a5b 100644
> >>> --- a/include/linux/netdevice.h
> >>> +++ b/include/linux/netdevice.h
> >>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
> >>>  };
> >>>
> >>>  #ifdef CONFIG_DMA_SHARED_BUFFER
> >>> +struct page_pool_iov *
> >>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
> >>> +void netdev_free_devmem(struct page_pool_iov *ppiov);
> >>
> >> netdev_{alloc,free}_dmabuf?
> >>
> >
> > Can do.
> >
> >> I say that because a dmabuf can be host memory, at least I am not aware
> >> of a restriction that a dmabuf is device memory.
> >>
> >
> > In my limited experience dma-buf is generally device memory, and
> > that's really its use case. CONFIG_UDMABUF is a driver that mocks
> > dma-buf with a memfd which I think is used for testing. But I can do
> > the rename, it's more clear anyway, I think.
>
> config UDMABUF
> bool "userspace dmabuf misc driver"
> default n
> depends on DMA_SHARED_BUFFER
> depends on MEMFD_CREATE || COMPILE_TEST
> help
>   A driver to let userspace turn memfd regions into dma-bufs.
>   Qemu can use this to create host dmabufs for guest framebuffers.
>
>
> Qemu is just a userspace process; it is no way a special one.
>
> Treating host memory as a dmabuf should radically simplify the io_uring
> extension of this set.

I agree actually, and I was about to make that comment to David Wei's
series once I have the time.

David, your io_uring RX zerocopy proposal actually works with devmem
TCP, if you're inclined to do that instead, what you'd do roughly is
(I think):

- Allocate a memfd,
- Use CONFIG_UDMABUF to create a dma-buf out of that memfd.
- Bind the dma-buf to the NIC using the netlink API in this RFC.
- Your io_uring extensions and io_uring uapi should work as-is almost
on top of this series, I think.

If you do this the incoming packets should land into your memfd, which
may or may not work for you. In the future if you feel inclined to use
device memory, this approach that I'm describing here would be more
extensible to device memory, because you'd already be using dma-bufs
for your user memory; you'd just replace one kind of dma-buf (UDMABUF)
with another.

> That the io_uring set needs to dive into
> page_pools is just wrong - complicating the design and code and pushing
> io_uring into a realm it does not need to be involved in.
>
> Most (all?) of this patch set can work with any memory; only device
> memory is unreadable.
>
>


-- 
Thanks,
Mina


Re: [PATCH v3 3/3] kselftest: Add new test for detecting unprobed Devicetree devices

2023-11-07 Thread Nícolas F . R . A . Prado
On Mon, Nov 06, 2023 at 11:09:44AM -0600, Rob Herring wrote:
> On Thu, Nov 2, 2023 at 12:36 PM Mark Brown  wrote:
> >
> > On Thu, Nov 02, 2023 at 07:15:58PM +0530, Naresh Kamboju wrote:
> > > On Thu, 2 Nov 2023 at 17:41, Aishwarya TCV  wrote:
> >
> > > > https://storage.kernelci.org/mainline/master/v6.6-9152-gdeefd5024f07/arm64/defconfig%2Bkselftest/gcc-10/logs/kselftest.log
> >
> > ...
> >
> > > May be due to, A loop of symlinks that are pointing to self / same files ?
> >
> > Right, it does look like something bad is going on with symlinks:
> >
> > > > '/tmp/kci/linux/tools/testing/selftests/../../../build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/build/source/tools/testing/selftests/powerpc/vphn/vphn.c'
> >
> > > Please build by using tuxmake and validate builds are working.
> >
> > Note that tuxmake does an in tree build of kselftest:
> >
> >   make --silent --keep-going --jobs=8 
> > O=/home/tuxbuild/.cache/tuxmake/builds/1/build 
> > INSTALL_PATH=/home/tuxbuild/.cache/tuxmake/builds/1/build/kselftest_install 
> > ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- 
> > CROSS_COMPILE_COMPAT=arm-linux-gnueabihf- 'CC=sccache 
> > aarch64-linux-gnu-gcc' 'HOSTCC=sccache gcc' kselftest-install
> >
> > and does it's own tarball build too, whereas kernelci does an out of
> > tree build and uses kselftest-gen_tar:
> >
> >   make KBUILD_BUILD_USER=KernelCI FORMAT=.xz ARCH=arm64 HOSTCC=gcc 
> > CROSS_COMPILE=aarch64-linux-gnu- CROSS_COMPILE_COMPAT=arm-linux-gnueabihf- 
> > CC="ccache aarch64-linux-gnu-gcc" O=/tmp/kci/linux/build -C/tmp/kci/linux 
> > -j10 kselftest-gen_tar
> >
> > and that the error is in the dt-extract-compatibles program which is
> > part of the kernel (well, imported into the kernel from dtc upstream):
> >
> >   File 
> > "/tmp/kci/linux/tools/testing/selftests/../../../scripts/dtc/dt-extract-compatibles",
> >  line 107, in 
> > compat_ignore_list.extend(parse_compatibles_to_ignore(f))
> >
> > This all suggests that something to do with how the build is set up is
> > resulting in the source symlink that gets created for out of tree builds
> > blowing up, I guess it's not specifically the DT stuff that's blowing it
> > up but rather that it's tripping over an existing bug.  Really does look
> > like a legitimate bug though, the source link is set up by the in tree
> > kernel build infrastructure.
> >
> > I did poke a bit at reproducing outside of the KernelCI scripts but
> > didn't manage to yet.
> 
> I can repro with "make dt_compatible_check". The problem is with an
> 'out of tree' build within the tree. That's my normal setup, but the
> difference is I have ".build" directories. If I use "build" instead,
> then I can repro. The issue is the iglob will recurse into "build" but
> not hidden directories (by default). There's no option to not follow
> symlinks which would solve this (there is an open python issue since
> 2017 to add it). I don't see a simple solution in python other than
> getting a full list with glob(), convert to absolute paths, and remove
> duplicates. I imagine that will be somewhat slow.

Hi, sorry for the delay, I was on vacation last week.

I was able to reproduce the issue the way you described. And I also suspected
an alternative approach would be slower, but after trying it out it ran just as
fast as the current one, even on cold cache, so I sent it out:

https://lore.kernel.org/all/20231107225624.9811-1-nfrapr...@collabora.com

Let me know your thoughts there.

Thanks,
Nícolas


Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-07 Thread David Ahern
On 11/7/23 3:10 PM, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:
>>
>> On 11/5/23 7:44 PM, Mina Almasry wrote:
>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>> index eeeda849115c..1c351c138a5b 100644
>>> --- a/include/linux/netdevice.h
>>> +++ b/include/linux/netdevice.h
>>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>>>  };
>>>
>>>  #ifdef CONFIG_DMA_SHARED_BUFFER
>>> +struct page_pool_iov *
>>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
>>> +void netdev_free_devmem(struct page_pool_iov *ppiov);
>>
>> netdev_{alloc,free}_dmabuf?
>>
> 
> Can do.
> 
>> I say that because a dmabuf can be host memory, at least I am not aware
>> of a restriction that a dmabuf is device memory.
>>
> 
> In my limited experience dma-buf is generally device memory, and
> that's really its use case. CONFIG_UDMABUF is a driver that mocks
> dma-buf with a memfd which I think is used for testing. But I can do
> the rename, it's more clear anyway, I think.

config UDMABUF
bool "userspace dmabuf misc driver"
default n
depends on DMA_SHARED_BUFFER
depends on MEMFD_CREATE || COMPILE_TEST
help
  A driver to let userspace turn memfd regions into dma-bufs.
  Qemu can use this to create host dmabufs for guest framebuffers.


Qemu is just a userspace process; it is no way a special one.

Treating host memory as a dmabuf should radically simplify the io_uring
extension of this set. That the io_uring set needs to dive into
page_pools is just wrong - complicating the design and code and pushing
io_uring into a realm it does not need to be involved in.

Most (all?) of this patch set can work with any memory; only device
memory is unreadable.




Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-07 Thread Stanislav Fomichev
On 11/07, Eric Dumazet wrote:
> On Tue, Nov 7, 2023 at 10:05 PM Stanislav Fomichev  wrote:
> 
> >
> > I don't understand. We require an elaborate setup to receive devmem cmsgs,
> > why would some random application receive those?
> 
> 
> A TCP socket can receive 'valid TCP packets' from many different sources,
> especially with BPF hooks...
> 
> Think of a bonding setup, packets being mirrored by some switches or
> even from tc.
> 
> Better double check than be sorry.
> 
> We have not added a 5th component in the 4-tuple lookups, being "is
> this socket a devmem one".
> 
> A mix of regular/devmem skb is supported.

Can we mark a socket as devmem-only? Do we have any use-case for those
hybrid setups? Or, let me put it that way: do we expect API callers
to handle both linear and non-linear cases correctly?
As a consumer of the previous versions of these apis internally,
I find all those corner cases confusing :-( Hence trying to understand
whether we can make it a bit more rigid and properly defined upstream.

But going back to that MSG_SOCK_DEVMEM flag. If the application is
supposed to handle both linear and devmem chucks, why do we need
this extra MSG_SOCK_DEVMEM opt-in to signal that it's able to process
it? From Mina's reply, it seemed like MSG_SOCK_DEVMEM is there to
protect random applications that get misrouted devmem skb. I don't
see how returning EFAULT helps in that case.


Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-07 Thread Mina Almasry
On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:
>
> On 11/5/23 7:44 PM, Mina Almasry wrote:
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index eeeda849115c..1c351c138a5b 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
> >  };
> >
> >  #ifdef CONFIG_DMA_SHARED_BUFFER
> > +struct page_pool_iov *
> > +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
> > +void netdev_free_devmem(struct page_pool_iov *ppiov);
>
> netdev_{alloc,free}_dmabuf?
>

Can do.

> I say that because a dmabuf can be host memory, at least I am not aware
> of a restriction that a dmabuf is device memory.
>

In my limited experience dma-buf is generally device memory, and
that's really its use case. CONFIG_UDMABUF is a driver that mocks
dma-buf with a memfd which I think is used for testing. But I can do
the rename, it's more clear anyway, I think.

On Mon, Nov 6, 2023 at 11:45 PM Yunsheng Lin  wrote:
>
> On 2023/11/6 10:44, Mina Almasry wrote:
> > +
> > +void netdev_free_devmem(struct page_pool_iov *ppiov)
> > +{
> > + struct netdev_dmabuf_binding *binding = page_pool_iov_binding(ppiov);
> > +
> > + refcount_set(&ppiov->refcount, 1);
> > +
> > + if (gen_pool_has_addr(binding->chunk_pool,
> > +   page_pool_iov_dma_addr(ppiov), PAGE_SIZE))
>
> When gen_pool_has_addr() returns false, does it mean something has gone
> really wrong here?
>

Yes, good eye. gen_pool_has_addr() should never return false, but then
again, gen_pool_free()  BUG_ON()s if it doesn't find the address,
which is an extremely severe reaction to what can be a minor bug in
the accounting. I prefer to leak rather than crash the machine. It's a
bit of defensive programming that is normally frowned upon, but I feel
like in this case it's maybe warranted due to the very severe reaction
(BUG_ON).

-- 
Thanks,
Mina


Re: [RFC PATCH v3 04/12] netdev: support binding dma-buf to netdevice

2023-11-07 Thread Mina Almasry
On Mon, Nov 6, 2023 at 11:46 PM Yunsheng Lin  wrote:
>
> On 2023/11/6 10:44, Mina Almasry wrote:
> > +
> > +void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding)
> > +{
> > + size_t size, avail;
> > +
> > + gen_pool_for_each_chunk(binding->chunk_pool,
> > + netdev_devmem_free_chunk_owner, NULL);
> > +
> > + size = gen_pool_size(binding->chunk_pool);
> > + avail = gen_pool_avail(binding->chunk_pool);
> > +
> > + if (!WARN(size != avail, "can't destroy genpool. size=%lu, avail=%lu",
> > +   size, avail))
> > + gen_pool_destroy(binding->chunk_pool);
>
>
> Is there any other place calling the gen_pool_destroy() when the above
> warning is triggered? Do we have a leaking for binding->chunk_pool?
>

gen_pool_destroy BUG_ON() if it's not empty at the time of destroying.
Technically that should never happen, because
__netdev_devmem_binding_free() should only be called when the refcount
hits 0, so all the chunks have been freed back to the gen_pool. But,
just in case, I don't want to crash the server just because I'm
leaking a chunk... this is a bit of defensive programming that is
typically frowned upon, but the behavior of gen_pool is so severe I
think the WARN() + check is warranted here.

> > +
> > + dma_buf_unmap_attachment(binding->attachment, binding->sgt,
> > +  DMA_BIDIRECTIONAL);
> > + dma_buf_detach(binding->dmabuf, binding->attachment);
> > + dma_buf_put(binding->dmabuf);
> > + kfree(binding);
> > +}
> > +
>
>


-- 
Thanks,
Mina


Re: [RFC PATCH v3 07/12] page-pool: device memory support

2023-11-07 Thread Mina Almasry
On Tue, Nov 7, 2023 at 12:00 AM Yunsheng Lin  wrote:
>
> On 2023/11/6 10:44, Mina Almasry wrote:
> > Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> >
> > Refactor mm calls on struct page* into helpers, and add page_pool_iov
> > handling on those helpers. Modify callers of these mm APIs with calls to
> > these helpers instead.
> >
> > In areas where struct page* is dereferenced, add a check for special
> > handling of page_pool_iov.
> >
> > Signed-off-by: Mina Almasry 
> >
> > ---
> >  include/net/page_pool/helpers.h | 74 -
> >  net/core/page_pool.c| 63 
> >  2 files changed, 118 insertions(+), 19 deletions(-)
> >
> > diff --git a/include/net/page_pool/helpers.h 
> > b/include/net/page_pool/helpers.h
> > index b93243c2a640..08f1a2cc70d2 100644
> > --- a/include/net/page_pool/helpers.h
> > +++ b/include/net/page_pool/helpers.h
> > @@ -151,6 +151,64 @@ static inline struct page_pool_iov 
> > *page_to_page_pool_iov(struct page *page)
> >   return NULL;
> >  }
> >
> > +static inline int page_pool_page_ref_count(struct page *page)
> > +{
> > + if (page_is_page_pool_iov(page))
> > + return page_pool_iov_refcount(page_to_page_pool_iov(page));
>
> We have added a lot of 'if' for the devmem case, it would be better to
> make it more generic so that we can have more unified metadata handling
> for normal page and devmem. If we add another memory type here, do we
> need another 'if' here?

Maybe, not sure. I'm guessing new memory types will either be pages or
iovs, so maybe no new if statements needed.

> That is part of the reason I suggested using a more unified metadata for
> all the types of memory chunks used by page_pool.

I think your suggestion was to use struct pages for devmem. That was
thoroughly considered and intensely argued about in the initial
conversations regarding devmem and the initial RFC, and from the
conclusions there it's extremely clear to me that devmem struct pages
are categorically a no-go.

--
Thanks,
Mina


Re: [PATCH v6] selftests: rtc: Fixes rtctest error handling.

2023-11-07 Thread Shuah Khan

On 10/7/23 09:43, Atul Kumar Pant wrote:

On Sat, Sep 23, 2023 at 11:06:58PM +0530, Atul Kumar Pant wrote:

On Thu, Aug 17, 2023 at 02:44:01PM +0530, Atul Kumar Pant wrote:

Adds a check to verify if the rtc device file is valid or not
and prints a useful error message if the file is not accessible.

Signed-off-by: Atul Kumar Pant 
---



Sorry for the delay. I will pick this up for the next rc.

thanks,
-- Shuah


Re: [PATCH] selftests: capabilities: namespace create varies for root and normal user

2023-11-07 Thread Shuah Khan

On 9/29/23 06:53, Swarup Laxman Kotiaklapudi wrote:

Change namespace creation for root and non-root
user differently in create_and_enter_ns() function



Sorry for the delay on reviewing this.

Can you tell me more about why this change is needed and
include it in the change log.

thanks,
-- Shuah


Re: [RFC PATCH v3 08/12] net: support non paged skb frags

2023-11-07 Thread Mina Almasry
On Tue, Nov 7, 2023 at 1:00 AM Yunsheng Lin  wrote:
>
> On 2023/11/6 10:44, Mina Almasry wrote:
> > Make skb_frag_page() fail in the case where the frag is not backed
> > by a page, and fix its relevent callers to handle this case.
> >
> > Correctly handle skb_frag refcounting in the page_pool_iovs case.
> >
> > Signed-off-by: Mina Almasry 
> >
>
> ...
>
> >  /**
> >   * skb_frag_page - retrieve the page referred to by a paged fragment
> >   * @frag: the paged fragment
> >   *
> > - * Returns the &struct page associated with @frag.
> > + * Returns the &struct page associated with @frag. Returns NULL if this 
> > frag
> > + * has no associated page.
> >   */
> >  static inline struct page *skb_frag_page(const skb_frag_t *frag)
> >  {
> > - return frag->bv_page;
> > + if (!page_is_page_pool_iov(frag->bv_page))
> > + return frag->bv_page;
> > +
> > + return NULL;
>
> It seems most of callers don't expect NULL returning for skb_frag_page(),
> and this patch only changes a few relevant callers to handle the NULL case.
>

Yes, I did not change code that I guessed was not likely to be
affected or enable the devmem TCP case. Here is my breakdown:

➜  cos-kernel git:(tcpdevmem) ✗ ack -i "skb_frag_page\("
--ignore-dir=drivers -t cc -l
net/core/dev.c
net/core/datagram.c
net/core/xdp.c
net/core/skbuff.c
net/core/filter.c
net/core/gro.c
net/appletalk/ddp.c
net/wireless/util.c
net/tls/tls_device.c
net/tls/tls_device_fallback.c
net/ipv4/tcp.c
net/ipv4/tcp_output.c
net/bpf/test_run.c
include/linux/skbuff.h

I'm ignoring ank skb_frag_page() calls in drivers because drivers need
to add support for devmem TCP, and handle these calls at time of
adding support, I think that's reasonable.

net/core/dev.c:
I think I missed ilegal_highdma()

net/core/datagram.c:
__skb_datagram_iter() protected by not_readable(skb) check.

net/core/skbuff.c:
protected by not_readable(skb) check.

net/core/filter.c:
bpf_xdp_frags_shrink_tail seems like xdp specific, not sure it's relevant here.

net/core/gro.c:
skb_gro_reset_offset: protected by NULL check

net/ipv4/tcp.c:
tcp_zerocopy_receive protected by NULL check.

net/ipv4/tcp_output.c:
tcp_clone_payload: handles NULL return fine.

net/bpf/test_run.c:
seems xdp specific and not sure if it can run into devmem issues.

include/linux/skbuff.h:
I think the multiple calls here are being handled correctly, but let
me know if not.

All the calls in these files, I think, are code paths not possible to
hit devmem TCP with the current support, I think:
net/core/xdp.c
net/appletalk/ddp.c
net/wireless/util.c
net/tls/tls_device.c
net/tls/tls_device_fallback.c

All in all I think maybe all in all I missed illegal_highdma(). I'll
fix it in the next iteration.

> It may make more sense to add a new helper to do the above checking, and
> add a warning in skb_frag_page() to catch any missing NULL checking for
> skb_frag_page() caller, something like below?
>
>  static inline struct page *skb_frag_page(const skb_frag_t *frag)
>  {
> -   return frag->bv_page;
> +   struct page *page = frag->bv_page;
> +
> +   BUG_ON(page_is_page_pool_iov(page));
> +
> +   return page;
> +}
> +
> +static inline struct page *skb_frag_readable_page(const skb_frag_t *frag)
> +{
> +   struct page *page = frag->bv_page;
> +
> +   if (!page_is_page_pool_iov(page))
> +   return page;
> +
> +   return NULL;
>  }
>
>

My personal immediate reaction is that this may just introduce code
churn without significant benefit. If an unsuspecting caller call
skb_frag_page() on devmem frag and doesn't correctly handle NULL
return, it will crash or error out anyway, and likely in some obvious
way, so maybe the BUG_ON() isn't so useful that it's worth changing
all the call sites. But if there is consensus on adding a change like
you propose, I have no problem adding it.

-- 
Thanks,
Mina


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-07 Thread Eric Dumazet
On Tue, Nov 7, 2023 at 10:05 PM Stanislav Fomichev  wrote:

>
> I don't understand. We require an elaborate setup to receive devmem cmsgs,
> why would some random application receive those?


A TCP socket can receive 'valid TCP packets' from many different sources,
especially with BPF hooks...

Think of a bonding setup, packets being mirrored by some switches or
even from tc.

Better double check than be sorry.

We have not added a 5th component in the 4-tuple lookups, being "is
this socket a devmem one".

A mix of regular/devmem skb is supported.


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-07 Thread Stanislav Fomichev
On 11/07, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 5:06 PM Stanislav Fomichev  wrote:
> [..]
> > > > > And the socket has to know this association; otherwise those tokens
> > > > > are useless since they don't carry anything to identify the dmabuf.
> > > > >
> > > > > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > > > > it somehow implies that I have an option of passing or not passing it
> > > > > for an individual system call.
> > >
> > > You do have the option of passing it or not passing it per system
> > > call. The MSG_SOCK_DEVMEM says the application is willing to receive
> > > devmem cmsgs - that's all. The application doesn't get to decide
> > > whether it's actually going to receive a devmem cmsg or not, because
> > > that's dictated by the type of skb that is present in the receive
> > > queue, and not up to the application. I should explain this in the
> > > commit message...
> >
> > What would be the case of passing it or not passing it? Some fallback to
> > the host memory after flow steering update? Yeah, would be useful to
> > document those constrains. I'd lean on starting stricter and relaxing
> > those conditions if we find the use-cases.
> >
> 
> MSG_SOCK_DEVMEM (or its replacement SOCK_DEVMEM or SO_SOCK_DEVMEM),
> just says that the application is able to receive devmem cmsgs and
> will parse them. The use case for not setting that flag is existing
> applications that are not aware of devmem cmsgs. I don't want those
> applications to think they're receiving data in the linear buffer only
> to find out that the data is in devmem and they ignored the devmem
> cmsg.
> 
> So, what happens:
> 
> - MSG_SOCK_DEVMEM provided and next skb in the queue is devmem:
> application receives cmsgs.
> - MSG_SOCK_DEVMEM provided and next skb in the queue is non-devmem:
> application receives in the linear buffer.
> - MSG_SOCK_DEVMEM not provided and net skb is devmem: application
> receives EFAULT.
> - MSG_SOCK_DEVMEM not provided and next skb is non-devmem: application
> receives in the linear buffer.
> 
> My bad on not including some docs about this. The next version should
> have the commit message beefed up to explain all this, or a docs
> patch.

I don't understand. We require an elaborate setup to receive devmem cmsgs,
why would some random application receive those?


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-07 Thread Mina Almasry
On Mon, Nov 6, 2023 at 5:06 PM Stanislav Fomichev  wrote:
[..]
> > > > And the socket has to know this association; otherwise those tokens
> > > > are useless since they don't carry anything to identify the dmabuf.
> > > >
> > > > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > > > it somehow implies that I have an option of passing or not passing it
> > > > for an individual system call.
> >
> > You do have the option of passing it or not passing it per system
> > call. The MSG_SOCK_DEVMEM says the application is willing to receive
> > devmem cmsgs - that's all. The application doesn't get to decide
> > whether it's actually going to receive a devmem cmsg or not, because
> > that's dictated by the type of skb that is present in the receive
> > queue, and not up to the application. I should explain this in the
> > commit message...
>
> What would be the case of passing it or not passing it? Some fallback to
> the host memory after flow steering update? Yeah, would be useful to
> document those constrains. I'd lean on starting stricter and relaxing
> those conditions if we find the use-cases.
>

MSG_SOCK_DEVMEM (or its replacement SOCK_DEVMEM or SO_SOCK_DEVMEM),
just says that the application is able to receive devmem cmsgs and
will parse them. The use case for not setting that flag is existing
applications that are not aware of devmem cmsgs. I don't want those
applications to think they're receiving data in the linear buffer only
to find out that the data is in devmem and they ignored the devmem
cmsg.

So, what happens:

- MSG_SOCK_DEVMEM provided and next skb in the queue is devmem:
application receives cmsgs.
- MSG_SOCK_DEVMEM provided and next skb in the queue is non-devmem:
application receives in the linear buffer.
- MSG_SOCK_DEVMEM not provided and net skb is devmem: application
receives EFAULT.
- MSG_SOCK_DEVMEM not provided and next skb is non-devmem: application
receives in the linear buffer.

My bad on not including some docs about this. The next version should
have the commit message beefed up to explain all this, or a docs
patch.


-- 
Thanks,
Mina


Re: [PATCH] Kunit to check the longest symbol length

2023-11-07 Thread kernel test robot
Hi Sergio,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[also build test WARNING on v6.6 next-20231107]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Sergio-Gonz-lez-Collado/Kunit-to-check-the-longest-symbol-length/20231106-024653
base:   linus/master
patch link:
https://lore.kernel.org/r/20231105184010.49194-1-sergio.collado%40gmail.com
patch subject: [PATCH] Kunit to check the longest symbol length
config: sparc-allyesconfig 
(https://download.01.org/0day-ci/archive/20231108/202311080319.fcep5dtc-...@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20231108/202311080319.fcep5dtc-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202311080319.fcep5dtc-...@intel.com/

All warnings (new ones prefixed by >>):

>> lib/longest_symbol_kunit.c:16:18: warning: no previous prototype for 
>> 'sg1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nnng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7g1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nnng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7ng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nnng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7g1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nnng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7n'
>>  [-Wmissing-prototypes]
  16 | #define DI(name) s##name##name
 |  ^
   lib/longest_symbol_kunit.c:17:19: note: in expansion of macro 'DI'
  17 | #define DDI(name) DI(n##name##name)
 |   ^~
   lib/longest_symbol_kunit.c:18:20: note: in expansion of macro 'DDI'
  18 | #define DDDI(name) DDI(n##name##name)
 |^~~
   lib/longest_symbol_kunit.c:19:21: note: in expansion of macro 'DDDI'
  19 | #define I(name) DDDI(n##name##name)
 | ^~~~
   lib/longest_symbol_kunit.c:20:22: note: in expansion of macro 'I'
  20 | #define DI(name) I(n##name##name)
 |  ^
   lib/longest_symbol_kunit.c:25:27: note: in expansion of macro 'DI'
  25 | #define LONGEST_SYM_NAME  DI(g1h2i3j4k5l6m7n)
 |   ^~
   lib/longest_symbol_kunit.c:33:14: note: in expansion of macro 
'LONGEST_SYM_NAME'
  33 | noinline int LONGEST_SYM_NAME(void)
 |  ^~~~
   In file included from :
   lib/longest_symbol_kunit.c:16:18: warning: no previous prototype for 
'sg1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nnng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7g1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nnng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7ng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nnng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7g1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nnng1h2i3j4k5l6m7ng1h2i3j4k5l6m7nng1h2i3j4k5l6m7ng1h2i3j4k5l6m7ne'
 [-Wmissing-prototypes]
  16 | #define DI(name) s##name##name
 |  ^
   include/linux/compiler_types.h:83:23: note: in definition of macro '___PASTE'
  83 | #define ___PASTE(a,b) a##b
 |   ^
   lib/longest_symbol_kunit.c:22:21: note: in expansion of macro '__PASTE'
  22 | #define PLUS1(name) __PASTE(name, e)
 | ^~~
   lib/longest_symbol_kunit.c:28:32: note: in expansion of macro 'PLUS1'
  28 | #define LONGEST_SYM_NAME_PLUS1 PLUS1(LONGEST_SYM_NAME)
 |^
   lib/longest_symbol_kunit.c:17:19: note: in expansion of macro 'DI'
  17 | #define DDI(name) DI(n##name##name)
 |   ^~
   lib/longest_symbol_kunit.c:18:20: note: in expansion of macro 'DDI'
  18 | #define DDDI(name) DDI(n##name##name)
 |^~~
   lib/longest_symbol_kunit.c:19:21: note: in expansion of macro 'DDDI'
  19 | #define I(name) DDDI(n##name##name)
 | ^~~~
   lib/longest_symbol_kunit.c:20:22: note: in expansion of macro 'I'
  20 | #define DI(name) I(n##name##name)
 |  ^
   lib/longest_symbol_kunit.c:25:27:

Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-07 Thread Stanislav Fomichev
On 11/07, Willem de Bruijn wrote:
> On Tue, Nov 7, 2023 at 12:44 PM Stanislav Fomichev  wrote:
> >
> > On 11/06, Willem de Bruijn wrote:
> > > > > > > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is 
> > > > > > > that
> > > > > > > it somehow implies that I have an option of passing or not 
> > > > > > > passing it
> > > > > > > for an individual system call.
> > > > > > > If we know that we're going to use dmabuf with the socket, maybe 
> > > > > > > we
> > > > > > > should move this flag to the socket() syscall?
> > > > > > >
> > > > > > > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
> > > > > > >
> > > > > > > ?
> > > > > >
> > > > > > I think it should then be a setsockopt called before any data is
> > > > > > exchanged, with no change of modifying mode later. We generally use
> > > > > > setsockopts for the mode of a socket. This use of the protocol field
> > > > > > in socket() for setting a mode would be novel. Also, it might miss
> > > > > > passively opened connections, or be overly restrictive: one approach
> > > > > > for all accepted child sockets.
> > > > >
> > > > > I was thinking this is similar to SOCK_CLOEXEC or SOCK_NONBLOCK? There
> > > > > are plenty of bits we can grab. But setsockopt works as well!
> > > >
> > > > To follow up: if we have this flag on a socket, not on a per-message
> > > > basis, can we also use recvmsg for the recycling part maybe?
> > > >
> > > > while (true) {
> > > > memset(msg, 0, ...);
> > > >
> > > > /* receive the tokens */
> > > > ret = recvmsg(fd, &msg, 0);
> > > >
> > > > /* recycle the tokens from the above recvmsg() */
> > > > ret = recvmsg(fd, &msg, MSG_RECYCLE);
> > > > }
> > > >
> > > > recvmsg + MSG_RECYCLE can parse the same format that regular recvmsg
> > > > exports (SO_DEVMEM_OFFSET) and we can also add extra cmsg option
> > > > to recycle a range.
> > > >
> > > > Will this be more straightforward than a setsockopt(SO_DEVMEM_DONTNEED)?
> > > > Or is it more confusing?
> > >
> > > It would have to be sendmsg, as recvmsg is a copy_to_user operation.
> > >
> > >
> > > I am not aware of any precedent in multiplexing the data stream and a
> > > control operation stream in this manner. It would also require adding
> > > a branch in the sendmsg hot path.
> >
> > Is it too much plumbing to copy_from_user msg_control deep in recvmsg
> > stack where we need it? Mixing in sendmsg is indeed ugly :-(
> 
> I tried exactly the inverse of that when originally adding
> MSG_ZEROCOPY: to allow piggy-backing zerocopy completion notifications
> on sendmsg calls by writing to sendmsg msg_control on return to user.
> It required significant code churn, which the performance gains did
> not warrant. Doing so also breaks the simple rule that recv is for
> reading and send is for writing.

We're breaking so many rules here, so not sure we should be super
constrained :-D

> > Regarding hot patch: aren't we already doing copy_to_user for the tokens in
> > this hot path, so having one extra condition shouldn't hurt too much?
> 
> We're doing that in the optional cmsg handling of recvmsg, which is
> already a slow path (compared to the data read() itself).
> 
> > > The memory is associated with the socket, freed when the socket is
> > > closed as well as on SO_DEVMEM_DONTNEED. Fundamentally it is a socket
> > > state operation, for which setsockopt is the socket interface.
> > >
> > > Is your request purely a dislike, or is there some technical concern
> > > with BPF and setsockopt?
> >
> > It's mostly because I've been bitten too much by custom socket options that
> > are not really on/off/update-value operations:
> >
> > 29ebbba7d461 - bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen
> > 00e74ae08638 - bpf: Don't EFAULT for getsockopt with optval=NULL
> > 9cacf81f8161 - bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE
> > d8fe449a9c51 - bpf: Don't return EINVAL from {get,set}sockopt when optlen > 
> > PAGE_SIZE
> >
> > I do agree that this particular case of SO_DEVMEM_DONTNEED seems ok, but
> > things tend to evolve and change.
> 
> I see. I'm a bit concerned if we start limiting what we can do in
> sockets because of dependencies that BPF processing places on them.
> The use case for BPF [gs]etsockopt is limited to specific control mode
> calls. Would it make sense to just exclude calls like
> SO_DEVMEM_DONTNEED from this interpositioning?

Yup, that's why I'm asking. We already have ->bpf_bypass_getsockopt()
to special-case tcp zerocopy. We might add another bpf_bypass_setsockopt
to special case SO_DEVMEM_DONTNEED. That's why I'm trying to see if
there is a better alternative.

> At a high level what we really want is a high rate metadata path from
> user to kernel. And there are no perfect solutions. From kernel to
> user we use the socket error queue for this. That was never intended
> for high event rate itself, dealing with ICMP errors and the like
> before timestamps and zerocopy notifications wer

Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-07 Thread Willem de Bruijn
On Tue, Nov 7, 2023 at 12:44 PM Stanislav Fomichev  wrote:
>
> On 11/06, Willem de Bruijn wrote:
> > > > > > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > > > > > it somehow implies that I have an option of passing or not passing 
> > > > > > it
> > > > > > for an individual system call.
> > > > > > If we know that we're going to use dmabuf with the socket, maybe we
> > > > > > should move this flag to the socket() syscall?
> > > > > >
> > > > > > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
> > > > > >
> > > > > > ?
> > > > >
> > > > > I think it should then be a setsockopt called before any data is
> > > > > exchanged, with no change of modifying mode later. We generally use
> > > > > setsockopts for the mode of a socket. This use of the protocol field
> > > > > in socket() for setting a mode would be novel. Also, it might miss
> > > > > passively opened connections, or be overly restrictive: one approach
> > > > > for all accepted child sockets.
> > > >
> > > > I was thinking this is similar to SOCK_CLOEXEC or SOCK_NONBLOCK? There
> > > > are plenty of bits we can grab. But setsockopt works as well!
> > >
> > > To follow up: if we have this flag on a socket, not on a per-message
> > > basis, can we also use recvmsg for the recycling part maybe?
> > >
> > > while (true) {
> > > memset(msg, 0, ...);
> > >
> > > /* receive the tokens */
> > > ret = recvmsg(fd, &msg, 0);
> > >
> > > /* recycle the tokens from the above recvmsg() */
> > > ret = recvmsg(fd, &msg, MSG_RECYCLE);
> > > }
> > >
> > > recvmsg + MSG_RECYCLE can parse the same format that regular recvmsg
> > > exports (SO_DEVMEM_OFFSET) and we can also add extra cmsg option
> > > to recycle a range.
> > >
> > > Will this be more straightforward than a setsockopt(SO_DEVMEM_DONTNEED)?
> > > Or is it more confusing?
> >
> > It would have to be sendmsg, as recvmsg is a copy_to_user operation.
> >
> >
> > I am not aware of any precedent in multiplexing the data stream and a
> > control operation stream in this manner. It would also require adding
> > a branch in the sendmsg hot path.
>
> Is it too much plumbing to copy_from_user msg_control deep in recvmsg
> stack where we need it? Mixing in sendmsg is indeed ugly :-(

I tried exactly the inverse of that when originally adding
MSG_ZEROCOPY: to allow piggy-backing zerocopy completion notifications
on sendmsg calls by writing to sendmsg msg_control on return to user.
It required significant code churn, which the performance gains did
not warrant. Doing so also breaks the simple rule that recv is for
reading and send is for writing.

> Regarding hot patch: aren't we already doing copy_to_user for the tokens in
> this hot path, so having one extra condition shouldn't hurt too much?

We're doing that in the optional cmsg handling of recvmsg, which is
already a slow path (compared to the data read() itself).

> > The memory is associated with the socket, freed when the socket is
> > closed as well as on SO_DEVMEM_DONTNEED. Fundamentally it is a socket
> > state operation, for which setsockopt is the socket interface.
> >
> > Is your request purely a dislike, or is there some technical concern
> > with BPF and setsockopt?
>
> It's mostly because I've been bitten too much by custom socket options that
> are not really on/off/update-value operations:
>
> 29ebbba7d461 - bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen
> 00e74ae08638 - bpf: Don't EFAULT for getsockopt with optval=NULL
> 9cacf81f8161 - bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE
> d8fe449a9c51 - bpf: Don't return EINVAL from {get,set}sockopt when optlen > 
> PAGE_SIZE
>
> I do agree that this particular case of SO_DEVMEM_DONTNEED seems ok, but
> things tend to evolve and change.

I see. I'm a bit concerned if we start limiting what we can do in
sockets because of dependencies that BPF processing places on them.
The use case for BPF [gs]etsockopt is limited to specific control mode
calls. Would it make sense to just exclude calls like
SO_DEVMEM_DONTNEED from this interpositioning?

At a high level what we really want is a high rate metadata path from
user to kernel. And there are no perfect solutions. From kernel to
user we use the socket error queue for this. That was never intended
for high event rate itself, dealing with ICMP errors and the like
before timestamps and zerocopy notifications were added.

If I squint hard enough I can see some prior art in mixing data and
high rate state changes within the same channel in NIC descriptor
queues, where some devices do this, e.g.,  { "insert encryption key",
"send packet" }. But fundamentally I think we should keep the socket
queues for data only.


Re: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

2023-11-07 Thread Jason Gunthorpe
On Tue, Nov 07, 2023 at 08:35:10AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Thursday, November 2, 2023 8:48 PM
> >
> > On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> > > Hi folks,
> > >
> > > This series implements the functionality of delivering IO page faults to
> > > user space through the IOMMUFD framework for nested translation.
> > Nested
> > > translation is a hardware feature that supports two-stage translation
> > > tables for IOMMU. The second-stage translation table is managed by the
> > > host VMM, while the first-stage translation table is owned by user
> > > space. This allows user space to control the IOMMU mappings for its
> > > devices.
> > 
> > Having now looked more closely at the ARM requirements it seems we
> > will need generic events, not just page fault events to have a
> > complete emulation.
> 
> Can you elaborate?

There are many events related to object in guest memory or controlled
by the guest, eg C_BAD_CD and C_BAD_STE. These should be relayed or
the emulation is not working well.

> > > User space indicates its capability of handling IO page faults by
> > > setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> > > hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> > > for page fault delivery. On a successful return of HWPT allocation, the
> > > user can retrieve and respond to page faults by reading and writing to
> > > the file descriptor (FD) returned in out_fault_fd.
> > 
> > This is the right way to approach it, and more broadly this shouldn't
> > be an iommufd specific thing. Kernel drivers will also need to create
> > fault capable PAGING iommu domains.
> 
> Are you suggesting a common interface used by both iommufd and
> kernel drivers?

Yes
 
> but I didn't get the last piece. If those domains are created by kernel
> drivers why would they require a uAPI for userspace to specify fault
> capable?

Not to userspace, but a kapi to request a fault capable domain and to
supply the fault handler. Eg:

 iommu_domain_alloc_faultable(dev, handler);

Jason


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-07 Thread Stanislav Fomichev
On 11/06, Willem de Bruijn wrote:
> > > > > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > > > > it somehow implies that I have an option of passing or not passing it
> > > > > for an individual system call.
> > > > > If we know that we're going to use dmabuf with the socket, maybe we
> > > > > should move this flag to the socket() syscall?
> > > > >
> > > > > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
> > > > >
> > > > > ?
> > > >
> > > > I think it should then be a setsockopt called before any data is
> > > > exchanged, with no change of modifying mode later. We generally use
> > > > setsockopts for the mode of a socket. This use of the protocol field
> > > > in socket() for setting a mode would be novel. Also, it might miss
> > > > passively opened connections, or be overly restrictive: one approach
> > > > for all accepted child sockets.
> > >
> > > I was thinking this is similar to SOCK_CLOEXEC or SOCK_NONBLOCK? There
> > > are plenty of bits we can grab. But setsockopt works as well!
> >
> > To follow up: if we have this flag on a socket, not on a per-message
> > basis, can we also use recvmsg for the recycling part maybe?
> >
> > while (true) {
> > memset(msg, 0, ...);
> >
> > /* receive the tokens */
> > ret = recvmsg(fd, &msg, 0);
> >
> > /* recycle the tokens from the above recvmsg() */
> > ret = recvmsg(fd, &msg, MSG_RECYCLE);
> > }
> >
> > recvmsg + MSG_RECYCLE can parse the same format that regular recvmsg
> > exports (SO_DEVMEM_OFFSET) and we can also add extra cmsg option
> > to recycle a range.
> >
> > Will this be more straightforward than a setsockopt(SO_DEVMEM_DONTNEED)?
> > Or is it more confusing?
> 
> It would have to be sendmsg, as recvmsg is a copy_to_user operation.
>
>
> I am not aware of any precedent in multiplexing the data stream and a
> control operation stream in this manner. It would also require adding
> a branch in the sendmsg hot path.

Is it too much plumbing to copy_from_user msg_control deep in recvmsg
stack where we need it? Mixing in sendmsg is indeed ugly :-(

Regarding hot patch: aren't we already doing copy_to_user for the tokens in
this hot path, so having one extra condition shouldn't hurt too much?

> The memory is associated with the socket, freed when the socket is
> closed as well as on SO_DEVMEM_DONTNEED. Fundamentally it is a socket
> state operation, for which setsockopt is the socket interface.
> 
> Is your request purely a dislike, or is there some technical concern
> with BPF and setsockopt?

It's mostly because I've been bitten too much by custom socket options that
are not really on/off/update-value operations:

29ebbba7d461 - bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen
00e74ae08638 - bpf: Don't EFAULT for getsockopt with optval=NULL
9cacf81f8161 - bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE
d8fe449a9c51 - bpf: Don't return EINVAL from {get,set}sockopt when optlen > 
PAGE_SIZE

I do agree that this particular case of SO_DEVMEM_DONTNEED seems ok, but
things tend to evolve and change.


Re: selftests: arm64: za-fork - ZA state invalid in child

2023-11-07 Thread Naresh Kamboju
On Tue, 7 Nov 2023 at 14:39, Naresh Kamboju  wrote:
>
> Following selftests: arm64: za-fork test failures noticed on qemu-arm64.
> But the same test passed on FVP and Juno-r2.
>
> Are we missing something on qemu-arm64 ?
> qemu-system-arm installed at version: 8.1

The Qemu version got updated from v8.0 to v8.1 and this test failed.

- Naresh


Re: selftests: arm64: fp-stress: Unable to handle kernel paging request at virtual address

2023-11-07 Thread Naresh Kamboju
Hi Mark,

On Tue, 7 Nov 2023 at 21:37, Mark Brown  wrote:
>
> On Tue, Nov 07, 2023 at 08:14:59PM +0530, Naresh Kamboju wrote:
> > On Tue, 7 Nov 2023 at 19:51, Mark Brown  wrote:
>
> > > This all seems very surprising, especially given that AFAICT there are
> > > no changes in stable-6.6-rc for arch/arm64.
>
> > We do not see on the mainline and next.
> > Is this reported problems on stable-rc 6.6 and 6.5 are due to running
> > latest kselftest on older kernels ?
>
> There's also no backports I can see in the selftests (at all, never mind
> just arm64).  There were a small number of selftest changes for arm64
> went in during the merge window but nothing that looks super relevant.

The Qemu version got updated from v8.0 to v8.1 and started getting these
test failures.

- Naresh


Re: selftests: arm64: fp-stress: Unable to handle kernel paging request at virtual address

2023-11-07 Thread Mark Brown
On Tue, Nov 07, 2023 at 08:14:59PM +0530, Naresh Kamboju wrote:
> On Tue, 7 Nov 2023 at 19:51, Mark Brown  wrote:

> > This all seems very surprising, especially given that AFAICT there are
> > no changes in stable-6.6-rc for arch/arm64.

> We do not see on the mainline and next.
> Is this reported problems on stable-rc 6.6 and 6.5 are due to running
> latest kselftest on older kernels ?

There's also no backports I can see in the selftests (at all, never mind
just arm64).  There were a small number of selftest changes for arm64
went in during the merge window but nothing that looks super relevant.


signature.asc
Description: PGP signature


Re: [RFC PATCH v3 00/12] Device Memory TCP

2023-11-07 Thread David Ahern
Is there a policy about cc'ing moderated lists on patch sets? I thought
there was, but not finding anything under Documentation/. Getting a
'needs moderator approval response' on every message is rather annoying.


Re: selftests: arm64: fp-stress: Unable to handle kernel paging request at virtual address

2023-11-07 Thread Naresh Kamboju
On Tue, 7 Nov 2023 at 19:51, Mark Brown  wrote:
>
> On Tue, Nov 07, 2023 at 06:43:25PM +0530, Naresh Kamboju wrote:
>
> > # # SVE-VL-64-0: Expected
> > [390439044000390480003904c0003904000139044001390480013904c0013904000239044002390480023904c0023904000339044003390480033904c003]
> > <>
>
> You've elided *lots* of error reports from the actual test which suggest
> that there is substantial memory corruption, it looks like tearing part
> way through loading or saving the values - the start of the vectors
> looks fine but at some point they get what looks like a related process'
> data, eg:
>
> # # SVE-VL-64-0:Expected 
> [390439044000390480003904c0003904000139044001390480013904c0013904000239044002390480023904c0023904000339044003390480033904c003]
> # # SVE-VL-64-0:Got  
> [390439044000390480003904c000390480003904c00039040001390440013904000139044001390480013904c001390480013904c0013904000239044002]
>
> This only appears to affect SVE and SME, I didn't spot any FPSIMD
> corruption but then that is the smallest case (and I didn't notice any
> VL 16 cases either).  It looks like the corruption is on the first thing
> we check each time (either register 0 or the highest ZA.H vector for
> ZA), all the values do look lke they were plausibly generated by
> fp-stress test programs.
>
> Then we get what looks like memory corruption:
>
> > # # SVE-VL-256-<1>[   88.160313] Unable to handle kernel paging
> > request at virtual address 00550f0344550f02
>
> > <4>[   88.195706] Call trace:
> > <4>[ 88.196098] percpu_ref_get_many
> > (include/linux/percpu-refcount.h:174 (discriminator 2)
> > include/linux/percpu-refcount.h:204 (discriminator 2))
> > <4>[ 88.196815] refill_obj_stock (mm/memcontrol.c:3339 (discriminator 2))
> > <4>[ 88.197367] obj_cgroup_uncharge (mm/memcontrol.c:3406)
> > <4>[ 88.197835] kmem_cache_free (include/linux/mm.h:1630
> > include/linux/mm.h:1849 include/linux/mm.h:1859 mm/slab.h:208
> > mm/slab.h:572 mm/slub.c:3804 mm/slub.c:3831)
> > <4>[ 88.198407] put_pid.part.0 (kernel/pid.c:118)
> > <4>[ 88.198870] delayed_put_pid (kernel/pid.c:127)
> > <4>[ 88.200527] rcu_core (arch/arm64/include/asm/preempt.h:13
> > (discriminator 1) kernel/rcu/tree.c:2146 (discriminator 1)
> > kernel/rcu/tree.c:2403 (discriminator 1))
>
> This all seems very surprising, especially given that AFAICT there are
> no changes in stable-6.6-rc for arch/arm64.

We do not see on the mainline and next.
Is this reported problems on stable-rc 6.6 and 6.5 are due to running
latest kselftest on older kernels ?

--
# # SSVE-VL-32-1: Mismatch: PID=641, iteration=0, reg=0
# # SSVE-VL-128-1: Got []
# # SSVE-VL-256-1: Got   []

Unable to handle kernel paging request at virtual address 00740f0322740f02
0<1>[   89.400173] Mem abort info:
<1>[   89.400844]   ESR = 0x9604
<1>[   89.401974]   EC = 0x25: DABT (current EL), IL = 32 bits
<1>[   89.403399]   SET = 0, FnV = 0
<1>[   89.404421]   EA = 0, S1PTW = 0
<1>[   89.405317]   FSC = 0x04: level 0 translation fault
<1>[   89.406545] Data abort info:
<1>[   89.407493]   ISV = 0, ISS = 0x0004, ISS2 = 0x
<1>[   89.408785]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
<1>[   89.410001]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
<1>[   89.411485] [00740f0322740f02] address between user and kernel
address ranges
<0>[   89.413851] Internal error: Oops: 9604 [#1] PREEMPT SMP
<4>[   89.415573] Modules linked in: crct10dif_ce sm3_ce sm3 sha3_ce
sha512_ce sha512_arm64 fuse drm dm_mod ip_tables x_tables
<4>[   89.419561] CPU: 1 PID: 22 Comm: ksoftirqd/1 Not tainted 6.5.11-rc1 #1
<4>[   89.420795] Hardware name: linux,dummy-virt (DT)
<4>[   89.422676] pstate: 624000c9 (nZCv daIF +PAN -UAO +TCO -DIT
-SSBS BTYPE=--)
<4>[   89.424344] pc : refill_obj_stock+0x6c/0x250
<4>[   89.426324] lr : refill_obj_stock+0x6c/0x250

<4>[   89.447170] Call trace:
<4>[   89.447756]  refill_obj_stock+0x6c/0x250
<4>[   89.449033]  obj_cgroup_uncharge+0x20/0x38
<4>[   89.450457]  kmem_cache_free+0xf8/0x500
<4>[   89.451066]  delayed_put_pid+0x50/0xb0
<4>[   89.452401]  rcu_core+0x3cc/0x950
<4>[   89.453369]  rcu_core_si+0x1c/0x30
<4>[   89.454465]  __do_softirq+0x118/0x438
<4>[   89.455738]  run_ksoftirqd+0x40/0xf8
<4>[   89.456893]  smpboot_thread_fn+0x1d0/0x248
<4>[   89.457969]  kthread+0xfc/0x1a0
<4>[   89.459171]  ret_from_fork+0x10/0x20
<0>[   89.460445] Code: aa1603e0 97fffef8 aa0003f4 97f6cbf6 (f9400269)
<4>[   89.462028] ---[ end trace  ]---
<0>[   89.463494] Kernel panic - not syncing: Oops: Fatal exception in interrupt
<2>[   89.465046] SMP: stopping secondary CPUs
<0>[   89.466327] Kernel Offset: 0x2dabffa0 from 0x80008000
<0>[   89.467385] PHYS_OFFSET: 0x4000
<0>[   89.468131] CPU features: 0x,68f167a1,cce6773f
<0>[   89.469850] Memory Limit: none
<0>[   89.470836] ---[ end Kernel panic - not syncing: Oops: Fatal
exception in interrupt ]---



Links:
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-

Re: [PATCH v3 3/3] kselftest: Add new test for detecting unprobed Devicetree devices

2023-11-07 Thread Mark Brown
On Mon, Nov 06, 2023 at 11:09:44AM -0600, Rob Herring wrote:

> A simple solution would be instead of passing the source tree root to
> dt-extract-compatibles, pass 'arch', 'drivers', and 'sound' instead.
> There shouldn't be compatibles anywhere else.

This does seem like a reasonable quick fix that avoids the issue for
now - nothing would stop someone implementing a more complete solution
later.


signature.asc
Description: PGP signature


Re: selftests: arm64: fp-stress: Unable to handle kernel paging request at virtual address

2023-11-07 Thread Mark Brown
On Tue, Nov 07, 2023 at 06:43:25PM +0530, Naresh Kamboju wrote:

> # # SVE-VL-64-0: Expected
> [390439044000390480003904c0003904000139044001390480013904c0013904000239044002390480023904c0023904000339044003390480033904c003]
> <>

You've elided *lots* of error reports from the actual test which suggest
that there is substantial memory corruption, it looks like tearing part
way through loading or saving the values - the start of the vectors
looks fine but at some point they get what looks like a related process'
data, eg:

# # SVE-VL-64-0:Expected 
[390439044000390480003904c0003904000139044001390480013904c0013904000239044002390480023904c0023904000339044003390480033904c003]
# # SVE-VL-64-0:Got  
[390439044000390480003904c000390480003904c00039040001390440013904000139044001390480013904c001390480013904c0013904000239044002]

This only appears to affect SVE and SME, I didn't spot any FPSIMD
corruption but then that is the smallest case (and I didn't notice any
VL 16 cases either).  It looks like the corruption is on the first thing
we check each time (either register 0 or the highest ZA.H vector for
ZA), all the values do look lke they were plausibly generated by
fp-stress test programs.

Then we get what looks like memory corruption:

> # # SVE-VL-256-<1>[   88.160313] Unable to handle kernel paging
> request at virtual address 00550f0344550f02

> <4>[   88.195706] Call trace:
> <4>[ 88.196098] percpu_ref_get_many
> (include/linux/percpu-refcount.h:174 (discriminator 2)
> include/linux/percpu-refcount.h:204 (discriminator 2))
> <4>[ 88.196815] refill_obj_stock (mm/memcontrol.c:3339 (discriminator 2))
> <4>[ 88.197367] obj_cgroup_uncharge (mm/memcontrol.c:3406)
> <4>[ 88.197835] kmem_cache_free (include/linux/mm.h:1630
> include/linux/mm.h:1849 include/linux/mm.h:1859 mm/slab.h:208
> mm/slab.h:572 mm/slub.c:3804 mm/slub.c:3831)
> <4>[ 88.198407] put_pid.part.0 (kernel/pid.c:118)
> <4>[ 88.198870] delayed_put_pid (kernel/pid.c:127)
> <4>[ 88.200527] rcu_core (arch/arm64/include/asm/preempt.h:13
> (discriminator 1) kernel/rcu/tree.c:2146 (discriminator 1)
> kernel/rcu/tree.c:2403 (discriminator 1))

This all seems very surprising, especially given that AFAICT there are
no changes in stable-6.6-rc for arch/arm64.


signature.asc
Description: PGP signature


Re: selftests: gpio: crash on arm64

2023-11-07 Thread Naresh Kamboju
Hi Linus and Bartosz,

On Tue, 20 Jun 2023 at 22:11, Andy Shevchenko
 wrote:
>
> On Tue, Apr 11, 2023 at 10:57:28AM +0200, Linus Walleij wrote:
> > On Mon, Apr 10, 2023 at 11:16 AM Naresh Kamboju
> >  wrote:
>
> ...
>
> > Add a pr_info() devm_gpio_chip_release() in drivers/gpio/gpiolib-devres.c
> > and see if the callback is even called. I think this could be the
> > problem: if that isn't cleaned up, there will be dangling references.
>
> Side note: Since we have devres tracepoints, your patch seems an overkill :-)
> Just enable devres tracepoints and filter out by the function name. I believe
> that should work.

Since I have been tracking open issues on the stable-rc kernel,
The reported problem on stable-rc linux.6.3.y has been solved
on the stable-rc linux.6.6.y branch.

Thanks for fixing this reported issue.

Upstream links about this fix and discussion,

Commit daecca4b8433
gpiolib: Do not alter GPIO chip fwnode member

[1] 
https://lore.kernel.org/linux-gpio/20230703142308.5772-4-andriy.shevche...@linux.intel.com/
[2] 
https://lore.kernel.org/linux-gpio/CAMRc=mffebsej78no7xeuzamj0kezepaywswnfxxaryqpaf...@mail.gmail.com/
[3] 
https://lore.kernel.org/linux-gpio/ca+g9fyv94gx8+-jmzbmqaue3q3y6qdbmsgucdd-26x5xavl...@mail.gmail.com/

>
> --
> With Best Regards,
> Andy Shevchenko
>
>


Re: selftests: arm64: fp-stress: Unable to handle kernel paging request at virtual address

2023-11-07 Thread Catalin Marinas
On Tue, Nov 07, 2023 at 06:32:13PM +0530, Naresh Kamboju wrote:
> # <1>[   88.160313] Unable to handle kernel paging
> request at virtual address 00550f0344550f02

It would really help if the kernel dump was not wrapped. Just use git
send-email to post to the list if your email client wraps the log. I'll
try to unwrap some of this below:

> <1>[   88.161949] Mem abort info:
> <1>[   88.162574]   ESR = 0x9604
> <1>[   88.163283]   EC = 0x25: DABT (current EL), IL = 32 bits
> <1>[   88.164330]   SET = 0, FnV = 0
> <1>[   88.164930]   EA = 0, S1PTW = 0
> <1>[   88.165854]   FSC = 0x04: level 0 translation fault
> <1>[   88.166852] Data abort info:
> <1>[   88.167463]   ISV = 0, ISS = 0x0004, ISS2 = 0x
> <1>[   88.168566]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> <1>[   88.169558]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> <1>[   88.170580] [00550f0344550f02] address between user and kernel address 
> ranges
> <0>[   88.172317] Internal error: Oops: 9604 [#1] PREEMPT SMP
> <4>[   88.173833] Modules linked in: crct10dif_ce sm3_ce sm3 sha3_ce 
> sha512_ce sha512_arm64 fuse drm backlight dm_mod ip_tables x_tables
> <4>[   88.177601] CPU: 1 PID: 1 Comm: systemd Not tainted 6.6.1-rc1 #1
> <4>[   88.178992] Hardware name: linux,dummy-virt (DT)
> <4>[   88.180334] pstate: 224000c9 (nzCv daIF +PAN -UAO +TCO -DIT -SSBS 
> BTYPE=--)
> <4>[ 88.181149] pc : percpu_ref_get_many (include/linux/percpu-refcount.h:174 
> (discriminator 2) include/linux/percpu-refcount.h:204 (discriminator 2))
> <4>[ 88.182885] lr : percpu_ref_get_many (include/linux/percpu-refcount.h:174 
> (discriminator 2) include/linux/percpu-refcount.h:204 (discriminator 2))
> <4>[   88.183621] sp : 80008000bd80
> <4>[   88.184039] x29: 80008000bd80 x28: c02c8000 x27: 
> 000a
> <4>[   88.185245] x26:  x25: 0002 x24: 
> 
> <4>[   88.187718] x23: c2306f40 x22:  x21: 
> 44550f0344550f02
> <4>[   88.188696] x20: 44550f0344550f02 x19: 0001 x18: 
> 
> <4>[   88.189556] x17: 436cf77c7000 x16: 800080008000 x15: 
> 
> <4>[   88.190568] x14:  x13: c2290026 x12: 
> 80008002bcb4
> <4>[   88.191589] x11: 0040 x10: c00ea0a8 x9 : 
> bc9405d93864
> <4>[   88.192573] x8 : 80008000bcd8 x7 : c09fe000 x6 : 
> 436cf77c7000
> <4>[   88.193523] x5 : 80008000bd40 x4 : fef8 x3 : 
> 0040
> <4>[   88.194472] x2 : 0002 x1 : c02c8000 x0 : 
> 0001
> <4>[   88.195706] Call trace:
> <4>[ 88.196098] percpu_ref_get_many (include/linux/percpu-refcount.h:174 
> (discriminator 2) include/linux/percpu-refcount.h:204 (discriminator 2))
> <4>[ 88.196815] refill_obj_stock (mm/memcontrol.c:3339 (discriminator 2))
> <4>[ 88.197367] obj_cgroup_uncharge (mm/memcontrol.c:3406)
> <4>[ 88.197835] kmem_cache_free (include/linux/mm.h:1630 
> include/linux/mm.h:1849 include/linux/mm.h:1859 mm/slab.h:208 mm/slab.h:572 
> mm/slub.c:3804 mm/slub.c:3831)
> <4>[ 88.198407] put_pid.part.0 (kernel/pid.c:118)
> <4>[ 88.198870] delayed_put_pid (kernel/pid.c:127)
> <4>[ 88.200527] rcu_core (arch/arm64/include/asm/preempt.h:13 (discriminator 
> 1) kernel/rcu/tree.c:2146 (discriminator 1) kernel/rcu/tree.c:2403 
> (discriminator 1))
> <4>[ 88.200978] rcu_core_si (kernel/rcu/tree.c:2421)
> <4>[ 88.201972] __do_softirq (arch/arm64/include/asm/jump_label.h:21 
> include/linux/jump_label.h:207 include/trace/events/irq.h:142 
> kernel/softirq.c:554)
> <4>[ 88.202587] do_softirq (arch/arm64/kernel/irq.c:81)
> <4>[ 88.203049] call_on_irq_stack (arch/arm64/kernel/entry.S:892)
> <4>[ 88.203544] do_softirq_own_stack (arch/arm64/kernel/irq.c:86)
> <4>[ 88.204008] irq_exit_rcu (arch/arm64/include/asm/percpu.h:44 
> kernel/softirq.c:612 kernel/softirq.c:634 kernel/softirq.c:644)
> <4>[ 88.204401] el1_interrupt (arch/arm64/include/asm/current.h:19 
> arch/arm64/kernel/entry-common.c:246 arch/arm64/kernel/entry-common.c:505 
> arch/arm64/kernel/entry-common.c:517)
> <4>[ 88.205751] el1h_64_irq_handler (arch/arm64/kernel/entry-common.c:523)
> <4>[ 88.206672] el1h_64_irq (arch/arm64/kernel/entry.S:591)
> <4>[ 88.207329] map_id_range_down (kernel/user_namespace.c:299 
> kernel/user_namespace.c:319)
> <4>[ 88.208250] make_kuid (kernel/user_namespace.c:412)
> <4>[ 88.208826] inode_init_always (include/linux/fs.h:1343 (discriminator 1) 
> fs/inode.c:174 (discriminator 1))
> <4>[ 88.209678] alloc_inode (fs/inode.c:266 (discriminator 2))
> <4>[ 88.210105] new_inode (fs/inode.c:1004 fs/inode.c:1030)
> <4>[ 88.210542] proc_pid_make_inode (fs/proc/base.c:1898)
> <4>[ 88.210963] proc_pid_instantiate (fs/proc/base.c:1949 fs/proc/base.c:3420)
> <4>[ 88.211361] proc_pid_lookup (fs/proc/base.c:3464)
> <4>[ 88.211762] proc_root_lookup (fs/proc/root.c:325 (discriminator 1))
> <4>[ 88.212299] __lookup_slow (fs/namei.c:1694)
> <4>[ 88.2

selftests: arm64: fp-stress: Unable to handle kernel paging request at virtual address

2023-11-07 Thread Naresh Kamboju
Following kernel oops noticed while running kselftests arm64 on qemu-arm64
on stable-rc linux-6.6.y branch.

I have re-built vmlinux with the same config and ran decode stackdump.

Reported-by: Linux Kernel Functional Testing 

Logs:

# selftests: arm64: fp-stress
# TAP version 13
# 1..32
# # 2 CPUs, 5 SVE VLs, 5 SME VLs, SME2 absent
# # Will run for 10s
# # Started FPSIMD-0-0
<>
# # SVE-VL-64-0: Expected
[390439044000390480003904c0003904000139044001390480013904c0013904000239044002390480023904c0023904000339044003390480033904c003]
<>
# # Finishing up...
# # SSVE-VL-16-0: Terminated by signal 15, no error, iterations=50467, signals=9
# # SVE-VL-16-0: Terminated by signal 15, no error, iterations=56669, signals=9
# # FPSIMD-1-0: Terminated by signal 15, no error, iterations=20632, signals=10
# # FPSIMD-0-0: Terminated by signal 15, no error, iterations=21549, signals=9
# # SSVE-VL-16-1: Terminated by signal 15, no error, iterations=49077,
signals=10
# # ZA-VL-16-0: Terminated by signal 15, no error, iterations=24878, signals=9
# # ZA-VL-16-1: Terminated by signal 15, no error, iterations=22452, signals=10
# # SVE-VL-16-1: Terminated by signal 15, no error, iterations=49039, signals=10
# ok 1 FPSIMD-0-0
# # SVE-VL-256-<1>[   88.160313] Unable to handle kernel paging
request at virtual address 00550f0344550f02
<1>[   88.161949] Mem abort info:
<1>[   88.162574]   ESR = 0x9604
<1>[   88.163283]   EC = 0x25: DABT (current EL), IL = 32 bits
<1>[   88.164330]   SET = 0, FnV = 0
<1>[   88.164930]   EA = 0, S1PTW = 0
<1>[   88.165854]   FSC = 0x04: level 0 translation fault
<1>[   88.166852] Data abort info:
<1>[   88.167463]   ISV = 0, ISS = 0x0004, ISS2 = 0x
<1>[   88.168566]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
<1>[   88.169558]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
<1>[   88.170580] [00550f0344550f02] address between user and kernel
address ranges
<0>[   88.172317] Internal error: Oops: 9604 [#1] PREEMPT SMP
<4>[   88.173833] Modules linked in: crct10dif_ce sm3_ce sm3 sha3_ce
sha512_ce sha512_arm64 fuse drm backlight dm_mod ip_tables x_tables
<4>[   88.177601] CPU: 1 PID: 1 Comm: systemd Not tainted 6.6.1-rc1 #1
0<4>[   88.178992] Hardware name: linux,dummy-virt (DT)
<4>[   88.180334] pstate: 224000c9 (nzCv daIF +PAN -UAO +TCO -DIT
-SSBS BTYPE=--)
<4>[ 88.181149] pc : percpu_ref_get_many
(include/linux/percpu-refcount.h:174 (discriminator 2)
include/linux/percpu-refcount.h:204 (discriminator 2))
<4>[ 88.182885] lr : percpu_ref_get_many
(include/linux/percpu-refcount.h:174 (discriminator 2)
include/linux/percpu-refcount.h:204 (discriminator 2))
<4>[   88.183621] sp : 80008000bd80
<4>[   88.184039] x29: 80008000bd80 x28: c02c8000 x27:
000a
<4>[   88.185245] x26:  x25: 0002 x24:

<4>[   88.187718] x23: c2306f40 x22:  x21:
44550f0344550f02
<4>[   88.188696] x20: 44550f0344550f02 x19: 0001 x18:

<4>[   88.189556] x17: 436cf77c7000 x16: 800080008000 x15:

<4>[   88.190568] x14:  x13: c2290026 x12:
80008002bcb4
<4>[   88.191589] x11: 0040 x10: c00ea0a8 x9 :
bc9405d93864
<4>[   88.192573] x8 : 80008000bcd8 x7 : c09fe000 x6 :
436cf77c7000
<4>[   88.193523] x5 : 80008000bd40 x4 : fef8 x3 :
0040
<4>[   88.194472] x2 : 0002 x1 : c02c8000 x0 :
0001
<4>[   88.195706] Call trace:
<4>[ 88.196098] percpu_ref_get_many
(include/linux/percpu-refcount.h:174 (discriminator 2)
include/linux/percpu-refcount.h:204 (discriminator 2))
<4>[ 88.196815] refill_obj_stock (mm/memcontrol.c:3339 (discriminator 2))
<4>[ 88.197367] obj_cgroup_uncharge (mm/memcontrol.c:3406)
<4>[ 88.197835] kmem_cache_free (include/linux/mm.h:1630
include/linux/mm.h:1849 include/linux/mm.h:1859 mm/slab.h:208
mm/slab.h:572 mm/slub.c:3804 mm/slub.c:3831)
<4>[ 88.198407] put_pid.part.0 (kernel/pid.c:118)
<4>[ 88.198870] delayed_put_pid (kernel/pid.c:127)
<4>[ 88.200527] rcu_core (arch/arm64/include/asm/preempt.h:13
(discriminator 1) kernel/rcu/tree.c:2146 (discriminator 1)
kernel/rcu/tree.c:2403 (discriminator 1))
<4>[ 88.200978] rcu_core_si (kernel/rcu/tree.c:2421)
<4>[ 88.201972] __do_softirq (arch/arm64/include/asm/jump_label.h:21
include/linux/jump_label.h:207 include/trace/events/irq.h:142
kernel/softirq.c:554)
<4>[ 88.202587] do_softirq (arch/arm64/kernel/irq.c:81)
<4>[ 88.203049] call_on_irq_stack (arch/arm64/kernel/entry.S:892)
<4>[ 88.203544] do_softirq_own_stack (arch/arm64/kernel/irq.c:86)
<4>[ 88.204008] irq_exit_rcu (arch/arm64/include/asm/percpu.h:44
kernel/softirq.c:612 kernel/softirq.c:634 kernel/softirq.c:644)
<4>[ 88.204401] el1_interrupt (arch/arm64/include/asm/current.h:19
arch/arm64/kernel/entry-common.c:246
arch/arm64/kernel/entry-common.c:505
arch/arm64/kernel/entry-common.c:517)
<4>[ 88.205751] el1h_64_irq_handler (

Re: selftests: arm64: fp-stress: Unable to handle kernel paging request at virtual address

2023-11-07 Thread Naresh Kamboju
My apologies !
Please ignore this email.
I will re-run for arm64 and get back to you.


On Tue, 7 Nov 2023 at 18:32, Naresh Kamboju  wrote:
>
> Following kernel oops noticed while running kselftests arm64 on qemu-arm64
> on stable-rc linux-6.6.y branch.
>
> I have re-built vmlinux with the same config and ran decode stackdump.
>
> Reported-by: Linux Kernel Functional Testing 
>
> Logs:
> 
> # selftests: arm64: fp-stress
> # TAP version 13
> # 1..32
> # # 2 CPUs, 5 SVE VLs, 5 SME VLs, SME2 absent
> # # Will run for 10s
> # # Started FPSIMD-0-0
> <>
> # # SVE-VL-64-0: Expected
> [390439044000390480003904c0003904000139044001390480013904c0013904000239044002390480023904c0023904000339044003390480033904c003]
> <>
> # # Finishing up...
> # # SSVE-VL-16-0: Terminated by signal 15, no error, iterations=50467, 
> signals=9
> # # SVE-VL-16-0: Terminated by signal 15, no error, iterations=56669, 
> signals=9
> # # FPSIMD-1-0: Terminated by signal 15, no error, iterations=20632, 
> signals=10
> # # FPSIMD-0-0: Terminated by signal 15, no error, iterations=21549, signals=9
> # # SSVE-VL-16-1: Terminated by signal 15, no error, iterations=49077,
> signals=10
> # # ZA-VL-16-0: Terminated by signal 15, no error, iterations=24878, signals=9
> # # ZA-VL-16-1: Terminated by signal 15, no error, iterations=22452, 
> signals=10
> # # SVE-VL-16-1: Terminated by signal 15, no error, iterations=49039, 
> signals=10
> # ok 1 FPSIMD-0-0
> # # SVE-VL-256-<1>[   88.160313] Unable to handle kernel paging
> request at virtual address 00550f0344550f02
> <1>[   88.161949] Mem abort info:
> <1>[   88.162574]   ESR = 0x9604
> <1>[   88.163283]   EC = 0x25: DABT (current EL), IL = 32 bits
> <1>[   88.164330]   SET = 0, FnV = 0
> <1>[   88.164930]   EA = 0, S1PTW = 0
> <1>[   88.165854]   FSC = 0x04: level 0 translation fault
> <1>[   88.166852] Data abort info:
> <1>[   88.167463]   ISV = 0, ISS = 0x0004, ISS2 = 0x
> <1>[   88.168566]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> <1>[   88.169558]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> <1>[   88.170580] [00550f0344550f02] address between user and kernel
> address ranges
> <0>[   88.172317] Internal error: Oops: 9604 [#1] PREEMPT SMP
> <4>[   88.173833] Modules linked in: crct10dif_ce sm3_ce sm3 sha3_ce
> sha512_ce sha512_arm64 fuse drm backlight dm_mod ip_tables x_tables
> <4>[   88.177601] CPU: 1 PID: 1 Comm: systemd Not tainted 6.6.1-rc1 #1
> 0<4>[   88.178992] Hardware name: linux,dummy-virt (DT)
> <4>[   88.180334] pstate: 224000c9 (nzCv daIF +PAN -UAO +TCO -DIT
> -SSBS BTYPE=--)
> <4>[ 88.181149] pc : percpu_ref_get_many
> (include/linux/percpu-refcount.h:174 (discriminator 2)
> include/linux/percpu-refcount.h:204 (discriminator 2))
> <4>[ 88.182885] lr : percpu_ref_get_many
> (include/linux/percpu-refcount.h:174 (discriminator 2)
> include/linux/percpu-refcount.h:204 (discriminator 2))
> <4>[   88.183621] sp : 80008000bd80
> <4>[   88.184039] x29: 80008000bd80 x28: c02c8000 x27:
> 000a
> <4>[   88.185245] x26:  x25: 0002 x24:
> 
> <4>[   88.187718] x23: c2306f40 x22:  x21:
> 44550f0344550f02
> <4>[   88.188696] x20: 44550f0344550f02 x19: 0001 x18:
> 
> <4>[   88.189556] x17: 436cf77c7000 x16: 800080008000 x15:
> 
> <4>[   88.190568] x14:  x13: c2290026 x12:
> 80008002bcb4
> <4>[   88.191589] x11: 0040 x10: c00ea0a8 x9 :
> bc9405d93864
> <4>[   88.192573] x8 : 80008000bcd8 x7 : c09fe000 x6 :
> 436cf77c7000
> <4>[   88.193523] x5 : 80008000bd40 x4 : fef8 x3 :
> 0040
> <4>[   88.194472] x2 : 0002 x1 : c02c8000 x0 :
> 0001
> <4>[   88.195706] Call trace:
> <4>[ 88.196098] percpu_ref_get_many
> (include/linux/percpu-refcount.h:174 (discriminator 2)
> include/linux/percpu-refcount.h:204 (discriminator 2))
> <4>[ 88.196815] refill_obj_stock (mm/memcontrol.c:3339 (discriminator 2))
> <4>[ 88.197367] obj_cgroup_uncharge (mm/memcontrol.c:3406)
> <4>[ 88.197835] kmem_cache_free (include/linux/mm.h:1630
> include/linux/mm.h:1849 include/linux/mm.h:1859 mm/slab.h:208
> mm/slab.h:572 mm/slub.c:3804 mm/slub.c:3831)
> <4>[ 88.198407] put_pid.part.0 (kernel/pid.c:118)
> <4>[ 88.198870] delayed_put_pid (kernel/pid.c:127)
> <4>[ 88.200527] rcu_core (arch/arm64/include/asm/preempt.h:13
> (discriminator 1) kernel/rcu/tree.c:2146 (discriminator 1)
> kernel/rcu/tree.c:2403 (discriminator 1))
> <4>[ 88.200978] rcu_core_si (kernel/rcu/tree.c:2421)
> <4>[ 88.201972] __do_softirq (arch/arm64/include/asm/jump_label.h:21
> include/linux/jump_label.h:207 include/trace/events/irq.h:142
> kernel/softirq.c:554)
> <4>[ 88.202587] do_softirq (arch/arm64/kernel/irq.c:81)
> <4>[ 88.203049] call_on_irq_stack (arch/arm64/kernel/entry.S:892)
> <4>[ 88.203544] do_softirq_own_stack (arch/arm64/kernel/irq

selftests: arm64: fp-stress: Unable to handle kernel paging request at virtual address

2023-11-07 Thread Naresh Kamboju
Following kernel oops noticed while running kselftests arm64 on qemu-arm64
on stable-rc linux-6.6.y branch.

I have re-built vmlinux with the same config and ran decode stackdump.

Reported-by: Linux Kernel Functional Testing 

Logs:

# selftests: arm64: fp-stress
# TAP version 13
# 1..32
# # 2 CPUs, 5 SVE VLs, 5 SME VLs, SME2 absent
# # Will run for 10s
# # Started FPSIMD-0-0
<>
# # SVE-VL-64-0: Expected
[390439044000390480003904c0003904000139044001390480013904c0013904000239044002390480023904c0023904000339044003390480033904c003]
<>
# # Finishing up...
# # SSVE-VL-16-0: Terminated by signal 15, no error, iterations=50467, signals=9
# # SVE-VL-16-0: Terminated by signal 15, no error, iterations=56669, signals=9
# # FPSIMD-1-0: Terminated by signal 15, no error, iterations=20632, signals=10
# # FPSIMD-0-0: Terminated by signal 15, no error, iterations=21549, signals=9
# # SSVE-VL-16-1: Terminated by signal 15, no error, iterations=49077,
signals=10
# # ZA-VL-16-0: Terminated by signal 15, no error, iterations=24878, signals=9
# # ZA-VL-16-1: Terminated by signal 15, no error, iterations=22452, signals=10
# # SVE-VL-16-1: Terminated by signal 15, no error, iterations=49039, signals=10
# ok 1 FPSIMD-0-0
# # SVE-VL-256-<1>[   88.160313] Unable to handle kernel paging
request at virtual address 00550f0344550f02
<1>[   88.161949] Mem abort info:
<1>[   88.162574]   ESR = 0x9604
<1>[   88.163283]   EC = 0x25: DABT (current EL), IL = 32 bits
<1>[   88.164330]   SET = 0, FnV = 0
<1>[   88.164930]   EA = 0, S1PTW = 0
<1>[   88.165854]   FSC = 0x04: level 0 translation fault
<1>[   88.166852] Data abort info:
<1>[   88.167463]   ISV = 0, ISS = 0x0004, ISS2 = 0x
<1>[   88.168566]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
<1>[   88.169558]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
<1>[   88.170580] [00550f0344550f02] address between user and kernel
address ranges
<0>[   88.172317] Internal error: Oops: 9604 [#1] PREEMPT SMP
<4>[   88.173833] Modules linked in: crct10dif_ce sm3_ce sm3 sha3_ce
sha512_ce sha512_arm64 fuse drm backlight dm_mod ip_tables x_tables
<4>[   88.177601] CPU: 1 PID: 1 Comm: systemd Not tainted 6.6.1-rc1 #1
0<4>[   88.178992] Hardware name: linux,dummy-virt (DT)
<4>[   88.180334] pstate: 224000c9 (nzCv daIF +PAN -UAO +TCO -DIT
-SSBS BTYPE=--)
<4>[ 88.181149] pc : percpu_ref_get_many
(include/linux/percpu-refcount.h:174 (discriminator 2)
include/linux/percpu-refcount.h:204 (discriminator 2))
<4>[ 88.182885] lr : percpu_ref_get_many
(include/linux/percpu-refcount.h:174 (discriminator 2)
include/linux/percpu-refcount.h:204 (discriminator 2))
<4>[   88.183621] sp : 80008000bd80
<4>[   88.184039] x29: 80008000bd80 x28: c02c8000 x27:
000a
<4>[   88.185245] x26:  x25: 0002 x24:

<4>[   88.187718] x23: c2306f40 x22:  x21:
44550f0344550f02
<4>[   88.188696] x20: 44550f0344550f02 x19: 0001 x18:

<4>[   88.189556] x17: 436cf77c7000 x16: 800080008000 x15:

<4>[   88.190568] x14:  x13: c2290026 x12:
80008002bcb4
<4>[   88.191589] x11: 0040 x10: c00ea0a8 x9 :
bc9405d93864
<4>[   88.192573] x8 : 80008000bcd8 x7 : c09fe000 x6 :
436cf77c7000
<4>[   88.193523] x5 : 80008000bd40 x4 : fef8 x3 :
0040
<4>[   88.194472] x2 : 0002 x1 : c02c8000 x0 :
0001
<4>[   88.195706] Call trace:
<4>[ 88.196098] percpu_ref_get_many
(include/linux/percpu-refcount.h:174 (discriminator 2)
include/linux/percpu-refcount.h:204 (discriminator 2))
<4>[ 88.196815] refill_obj_stock (mm/memcontrol.c:3339 (discriminator 2))
<4>[ 88.197367] obj_cgroup_uncharge (mm/memcontrol.c:3406)
<4>[ 88.197835] kmem_cache_free (include/linux/mm.h:1630
include/linux/mm.h:1849 include/linux/mm.h:1859 mm/slab.h:208
mm/slab.h:572 mm/slub.c:3804 mm/slub.c:3831)
<4>[ 88.198407] put_pid.part.0 (kernel/pid.c:118)
<4>[ 88.198870] delayed_put_pid (kernel/pid.c:127)
<4>[ 88.200527] rcu_core (arch/arm64/include/asm/preempt.h:13
(discriminator 1) kernel/rcu/tree.c:2146 (discriminator 1)
kernel/rcu/tree.c:2403 (discriminator 1))
<4>[ 88.200978] rcu_core_si (kernel/rcu/tree.c:2421)
<4>[ 88.201972] __do_softirq (arch/arm64/include/asm/jump_label.h:21
include/linux/jump_label.h:207 include/trace/events/irq.h:142
kernel/softirq.c:554)
<4>[ 88.202587] do_softirq (arch/arm64/kernel/irq.c:81)
<4>[ 88.203049] call_on_irq_stack (arch/arm64/kernel/entry.S:892)
<4>[ 88.203544] do_softirq_own_stack (arch/arm64/kernel/irq.c:86)
<4>[ 88.204008] irq_exit_rcu (arch/arm64/include/asm/percpu.h:44
kernel/softirq.c:612 kernel/softirq.c:634 kernel/softirq.c:644)
<4>[ 88.204401] el1_interrupt (arch/arm64/include/asm/current.h:19
arch/arm64/kernel/entry-common.c:246
arch/arm64/kernel/entry-common.c:505
arch/arm64/kernel/entry-common.c:517)
<4>[ 88.205751] el1h_64_irq_handler (

[PATCH AUTOSEL 6.5 37/37] selftests/efivarfs: create-read: fix a resource leak

2023-11-07 Thread Sasha Levin
From: zhujun2 

[ Upstream commit 3f6f8a8c5e11a9b384a36df4f40f0c9a653b6975 ]

The opened file should be closed in main(), otherwise resource
leak will occur that this problem was discovered by code reading

Signed-off-by: zhujun2 
Signed-off-by: Shuah Khan 
Signed-off-by: Sasha Levin 
---
 tools/testing/selftests/efivarfs/create-read.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/efivarfs/create-read.c 
b/tools/testing/selftests/efivarfs/create-read.c
index 9674a19396a32..7bc7af4eb2c17 100644
--- a/tools/testing/selftests/efivarfs/create-read.c
+++ b/tools/testing/selftests/efivarfs/create-read.c
@@ -32,8 +32,10 @@ int main(int argc, char **argv)
rc = read(fd, buf, sizeof(buf));
if (rc != 0) {
fprintf(stderr, "Reading a new var should return EOF\n");
+   close(fd);
return EXIT_FAILURE;
}
 
+   close(fd);
return EXIT_SUCCESS;
 }
-- 
2.42.0



[PATCH AUTOSEL 4.19 5/5] selftests/efivarfs: create-read: fix a resource leak

2023-11-07 Thread Sasha Levin
From: zhujun2 

[ Upstream commit 3f6f8a8c5e11a9b384a36df4f40f0c9a653b6975 ]

The opened file should be closed in main(), otherwise resource
leak will occur that this problem was discovered by code reading

Signed-off-by: zhujun2 
Signed-off-by: Shuah Khan 
Signed-off-by: Sasha Levin 
---
 tools/testing/selftests/efivarfs/create-read.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/efivarfs/create-read.c 
b/tools/testing/selftests/efivarfs/create-read.c
index 9674a19396a32..7bc7af4eb2c17 100644
--- a/tools/testing/selftests/efivarfs/create-read.c
+++ b/tools/testing/selftests/efivarfs/create-read.c
@@ -32,8 +32,10 @@ int main(int argc, char **argv)
rc = read(fd, buf, sizeof(buf));
if (rc != 0) {
fprintf(stderr, "Reading a new var should return EOF\n");
+   close(fd);
return EXIT_FAILURE;
}
 
+   close(fd);
return EXIT_SUCCESS;
 }
-- 
2.42.0



[PATCH AUTOSEL 4.14 4/4] selftests/efivarfs: create-read: fix a resource leak

2023-11-07 Thread Sasha Levin
From: zhujun2 

[ Upstream commit 3f6f8a8c5e11a9b384a36df4f40f0c9a653b6975 ]

The opened file should be closed in main(), otherwise resource
leak will occur that this problem was discovered by code reading

Signed-off-by: zhujun2 
Signed-off-by: Shuah Khan 
Signed-off-by: Sasha Levin 
---
 tools/testing/selftests/efivarfs/create-read.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/efivarfs/create-read.c 
b/tools/testing/selftests/efivarfs/create-read.c
index 9674a19396a32..7bc7af4eb2c17 100644
--- a/tools/testing/selftests/efivarfs/create-read.c
+++ b/tools/testing/selftests/efivarfs/create-read.c
@@ -32,8 +32,10 @@ int main(int argc, char **argv)
rc = read(fd, buf, sizeof(buf));
if (rc != 0) {
fprintf(stderr, "Reading a new var should return EOF\n");
+   close(fd);
return EXIT_FAILURE;
}
 
+   close(fd);
return EXIT_SUCCESS;
 }
-- 
2.42.0



[PATCH AUTOSEL 5.4 6/6] selftests/efivarfs: create-read: fix a resource leak

2023-11-07 Thread Sasha Levin
From: zhujun2 

[ Upstream commit 3f6f8a8c5e11a9b384a36df4f40f0c9a653b6975 ]

The opened file should be closed in main(), otherwise resource
leak will occur that this problem was discovered by code reading

Signed-off-by: zhujun2 
Signed-off-by: Shuah Khan 
Signed-off-by: Sasha Levin 
---
 tools/testing/selftests/efivarfs/create-read.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/efivarfs/create-read.c 
b/tools/testing/selftests/efivarfs/create-read.c
index 9674a19396a32..7bc7af4eb2c17 100644
--- a/tools/testing/selftests/efivarfs/create-read.c
+++ b/tools/testing/selftests/efivarfs/create-read.c
@@ -32,8 +32,10 @@ int main(int argc, char **argv)
rc = read(fd, buf, sizeof(buf));
if (rc != 0) {
fprintf(stderr, "Reading a new var should return EOF\n");
+   close(fd);
return EXIT_FAILURE;
}
 
+   close(fd);
return EXIT_SUCCESS;
 }
-- 
2.42.0



[PATCH AUTOSEL 5.10 11/11] selftests/efivarfs: create-read: fix a resource leak

2023-11-07 Thread Sasha Levin
From: zhujun2 

[ Upstream commit 3f6f8a8c5e11a9b384a36df4f40f0c9a653b6975 ]

The opened file should be closed in main(), otherwise resource
leak will occur that this problem was discovered by code reading

Signed-off-by: zhujun2 
Signed-off-by: Shuah Khan 
Signed-off-by: Sasha Levin 
---
 tools/testing/selftests/efivarfs/create-read.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/efivarfs/create-read.c 
b/tools/testing/selftests/efivarfs/create-read.c
index 9674a19396a32..7bc7af4eb2c17 100644
--- a/tools/testing/selftests/efivarfs/create-read.c
+++ b/tools/testing/selftests/efivarfs/create-read.c
@@ -32,8 +32,10 @@ int main(int argc, char **argv)
rc = read(fd, buf, sizeof(buf));
if (rc != 0) {
fprintf(stderr, "Reading a new var should return EOF\n");
+   close(fd);
return EXIT_FAILURE;
}
 
+   close(fd);
return EXIT_SUCCESS;
 }
-- 
2.42.0



[PATCH AUTOSEL 5.15 20/20] selftests/efivarfs: create-read: fix a resource leak

2023-11-07 Thread Sasha Levin
From: zhujun2 

[ Upstream commit 3f6f8a8c5e11a9b384a36df4f40f0c9a653b6975 ]

The opened file should be closed in main(), otherwise resource
leak will occur that this problem was discovered by code reading

Signed-off-by: zhujun2 
Signed-off-by: Shuah Khan 
Signed-off-by: Sasha Levin 
---
 tools/testing/selftests/efivarfs/create-read.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/efivarfs/create-read.c 
b/tools/testing/selftests/efivarfs/create-read.c
index 9674a19396a32..7bc7af4eb2c17 100644
--- a/tools/testing/selftests/efivarfs/create-read.c
+++ b/tools/testing/selftests/efivarfs/create-read.c
@@ -32,8 +32,10 @@ int main(int argc, char **argv)
rc = read(fd, buf, sizeof(buf));
if (rc != 0) {
fprintf(stderr, "Reading a new var should return EOF\n");
+   close(fd);
return EXIT_FAILURE;
}
 
+   close(fd);
return EXIT_SUCCESS;
 }
-- 
2.42.0



[PATCH AUTOSEL 6.1 25/25] selftests/efivarfs: create-read: fix a resource leak

2023-11-07 Thread Sasha Levin
From: zhujun2 

[ Upstream commit 3f6f8a8c5e11a9b384a36df4f40f0c9a653b6975 ]

The opened file should be closed in main(), otherwise resource
leak will occur that this problem was discovered by code reading

Signed-off-by: zhujun2 
Signed-off-by: Shuah Khan 
Signed-off-by: Sasha Levin 
---
 tools/testing/selftests/efivarfs/create-read.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/efivarfs/create-read.c 
b/tools/testing/selftests/efivarfs/create-read.c
index 9674a19396a32..7bc7af4eb2c17 100644
--- a/tools/testing/selftests/efivarfs/create-read.c
+++ b/tools/testing/selftests/efivarfs/create-read.c
@@ -32,8 +32,10 @@ int main(int argc, char **argv)
rc = read(fd, buf, sizeof(buf));
if (rc != 0) {
fprintf(stderr, "Reading a new var should return EOF\n");
+   close(fd);
return EXIT_FAILURE;
}
 
+   close(fd);
return EXIT_SUCCESS;
 }
-- 
2.42.0



[PATCH AUTOSEL 6.6 40/40] selftests/efivarfs: create-read: fix a resource leak

2023-11-07 Thread Sasha Levin
From: zhujun2 

[ Upstream commit 3f6f8a8c5e11a9b384a36df4f40f0c9a653b6975 ]

The opened file should be closed in main(), otherwise resource
leak will occur that this problem was discovered by code reading

Signed-off-by: zhujun2 
Signed-off-by: Shuah Khan 
Signed-off-by: Sasha Levin 
---
 tools/testing/selftests/efivarfs/create-read.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/efivarfs/create-read.c 
b/tools/testing/selftests/efivarfs/create-read.c
index 9674a19396a32..7bc7af4eb2c17 100644
--- a/tools/testing/selftests/efivarfs/create-read.c
+++ b/tools/testing/selftests/efivarfs/create-read.c
@@ -32,8 +32,10 @@ int main(int argc, char **argv)
rc = read(fd, buf, sizeof(buf));
if (rc != 0) {
fprintf(stderr, "Reading a new var should return EOF\n");
+   close(fd);
return EXIT_FAILURE;
}
 
+   close(fd);
return EXIT_SUCCESS;
 }
-- 
2.42.0



Re: [PATCH 23/24] selftests/resctrl: Add L2 CAT test

2023-11-07 Thread Ilpo Järvinen
On Mon, 6 Nov 2023, Reinette Chatre wrote:
> On 11/6/2023 9:03 AM, Reinette Chatre wrote:
> > On 11/6/2023 1:53 AM, Ilpo Järvinen wrote:
> >> On Fri, 3 Nov 2023, Reinette Chatre wrote:
> >>> On 11/3/2023 3:39 AM, Ilpo Järvinen wrote:
>  On Thu, 2 Nov 2023, Reinette Chatre wrote:
> > On 10/24/2023 2:26 AM, Ilpo Järvinen wrote:
> 
> >> Add L2 CAT selftest. As measuring L2 misses is not easily available
> >> with perf, use L3 accesses as a proxy for L2 CAT working or not.
> >
> > I understand the exact measurement is not available but I do notice some
> > L2 related symbolic counters when I run "perf list". 
> > l2_rqsts.all_demand_miss
> > looks promising.
> 
>  Okay, I was under impression that L2 misses are not available. Both 
>  based 
>  on what you mentioned to me half an year ago and because of what flags I 
>  found from the header. But I'll take another look into it.
> >>>
> >>> You are correct that when I did L2 testing a long time ago I used
> >>> the model specific L2 miss counts. I was hoping that things have improved
> >>> so that model specific counters are not needed, as you have tried here.
> >>> I found the l2_rqsts symbol while looking for alternatives but I am not
> >>> familiar enough with perf to know how these symbolic names are mapped.
> >>> I was hoping that they could be a simple drop-in replacement to
> >>> experiment with.
> >>
> >> According to perf_event_open() manpage, mapping those symbolic names 
> >> requires libpfm so this would add a library dependency?
> > 
> > I do not see perf list using this library to determine the event and
> > umask but I am in unfamiliar territory. I'll have to spend some more
> > time here to determine options.
> 
> tools/perf/pmu-events/README cleared it up for me. The architecture specific
> tables are included in the perf binary. Potentially pmu-events.h could be
> included or the test could just stick with the architectural events.
> A quick look at the various cache.json files created the impression that
> the events of interest may actually have the same event code and umask across
> platforms.
> I am not familiar with libpfm. This can surely be considered if it supports
> this testing. Several selftests have library dependencies.

man perf_event_open() says this:

"If type is PERF_TYPE_RAW, then a custom "raw" config  value  is  needed.
Most  CPUs  support  events  that  are  not covered by the "generalized"
events.  These are implementation defined; see your CPU manual (for  ex-
ample  the  Intel Volume 3B documentation or the AMD BIOS and Kernel De-
veloper Guide).  The libpfm4 library can be used to translate  from  the
name in the architectural manuals to the raw hex value perf_event_open()
expects in this field."

...I've not come across libpfm myself either but to me it looks libpfm 
bridges between those architecture specific tables and perf_event_open(). 
That is, it could provide the binary value necessary in constructing the 
perf_event_attr struct.

I think this is probably the function which maps string -> 
perf_event_attr:

https://man7.org/linux/man-pages/man3/pfm_get_os_event_encoding.3.html


-- 
 i.


selftests: arm64: za-fork - ZA state invalid in child

2023-11-07 Thread Naresh Kamboju
Following selftests: arm64: za-fork test failures noticed on qemu-arm64.
But the same test passed on FVP and Juno-r2.

Are we missing something on qemu-arm64 ?
qemu-system-arm installed at version: 8.1
host architecture: amd64

Started failing from Linux next-20231106

GOOD: next-20231103
Bad: next-20231103

Please find the links to compare results between the Linux next tags.

kselftest: Running tests in arm64
TAP version 13
1..16
# timeout set to 45
# selftests: arm64: za-fork
# TAP version 13
# 1..1
# # PID: 348
# # ZA state invalid in child
# not ok 1 fork_test# Totals: pass:0 fail:1 xfail:0 xpass:0 skip:0 error:0
not ok 1 selftests: arm64: za-fork # exit=1

Reported-by: Linux Kernel Functional Testing 

Steps to reproduce:

 - 
https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2XmWvN5YW3lPR4UFwxEubBVVC7y/reproducer

Links:
 - 
https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20231107/testrun/20969859/suite/kselftest-arm64/test/arm64_za-fork/history/
 - 
https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20231107/testrun/20969859/suite/kselftest-arm64/test/arm64_check_gcr_el1_cswitch/log
 - 
https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20231106/testrun/20927295/suite/kselftest-arm64/test/arm64_za-fork/details/

--
Linaro LKFT
https://lkft.linaro.org


Re: [RFC PATCH v3 08/12] net: support non paged skb frags

2023-11-07 Thread Yunsheng Lin
On 2023/11/6 10:44, Mina Almasry wrote:
> Make skb_frag_page() fail in the case where the frag is not backed
> by a page, and fix its relevent callers to handle this case.
> 
> Correctly handle skb_frag refcounting in the page_pool_iovs case.
> 
> Signed-off-by: Mina Almasry 
> 

...

>  /**
>   * skb_frag_page - retrieve the page referred to by a paged fragment
>   * @frag: the paged fragment
>   *
> - * Returns the &struct page associated with @frag.
> + * Returns the &struct page associated with @frag. Returns NULL if this frag
> + * has no associated page.
>   */
>  static inline struct page *skb_frag_page(const skb_frag_t *frag)
>  {
> - return frag->bv_page;
> + if (!page_is_page_pool_iov(frag->bv_page))
> + return frag->bv_page;
> +
> + return NULL;

It seems most of callers don't expect NULL returning for skb_frag_page(),
and this patch only changes a few relevant callers to handle the NULL case.

It may make more sense to add a new helper to do the above checking, and
add a warning in skb_frag_page() to catch any missing NULL checking for
skb_frag_page() caller, something like below?

 static inline struct page *skb_frag_page(const skb_frag_t *frag)
 {
-   return frag->bv_page;
+   struct page *page = frag->bv_page;
+
+   BUG_ON(page_is_page_pool_iov(page));
+
+   return page;
+}
+
+static inline struct page *skb_frag_readable_page(const skb_frag_t *frag)
+{
+   struct page *page = frag->bv_page;
+
+   if (!page_is_page_pool_iov(page))
+   return page;
+
+   return NULL;
 }




RE: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space

2023-11-07 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, November 2, 2023 8:48 PM
>
> On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> > Hi folks,
> >
> > This series implements the functionality of delivering IO page faults to
> > user space through the IOMMUFD framework for nested translation.
> Nested
> > translation is a hardware feature that supports two-stage translation
> > tables for IOMMU. The second-stage translation table is managed by the
> > host VMM, while the first-stage translation table is owned by user
> > space. This allows user space to control the IOMMU mappings for its
> > devices.
> 
> Having now looked more closely at the ARM requirements it seems we
> will need generic events, not just page fault events to have a
> complete emulation.

Can you elaborate?

> 
> So I'd like to see this generalized into a channel to carry any
> events..
> 
> > User space indicates its capability of handling IO page faults by
> > setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> > hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> > for page fault delivery. On a successful return of HWPT allocation, the
> > user can retrieve and respond to page faults by reading and writing to
> > the file descriptor (FD) returned in out_fault_fd.
> 
> This is the right way to approach it, and more broadly this shouldn't
> be an iommufd specific thing. Kernel drivers will also need to create
> fault capable PAGING iommu domains.
> 

Are you suggesting a common interface used by both iommufd and
kernel drivers?

but I didn't get the last piece. If those domains are created by kernel
drivers why would they require a uAPI for userspace to specify fault
capable?


Re: [RFC PATCH v3 07/12] page-pool: device memory support

2023-11-07 Thread Yunsheng Lin
On 2023/11/6 10:44, Mina Almasry wrote:
> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> 
> Refactor mm calls on struct page* into helpers, and add page_pool_iov
> handling on those helpers. Modify callers of these mm APIs with calls to
> these helpers instead.
> 
> In areas where struct page* is dereferenced, add a check for special
> handling of page_pool_iov.
> 
> Signed-off-by: Mina Almasry 
> 
> ---
>  include/net/page_pool/helpers.h | 74 -
>  net/core/page_pool.c| 63 
>  2 files changed, 118 insertions(+), 19 deletions(-)
> 
> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> index b93243c2a640..08f1a2cc70d2 100644
> --- a/include/net/page_pool/helpers.h
> +++ b/include/net/page_pool/helpers.h
> @@ -151,6 +151,64 @@ static inline struct page_pool_iov 
> *page_to_page_pool_iov(struct page *page)
>   return NULL;
>  }
>  
> +static inline int page_pool_page_ref_count(struct page *page)
> +{
> + if (page_is_page_pool_iov(page))
> + return page_pool_iov_refcount(page_to_page_pool_iov(page));

We have added a lot of 'if' for the devmem case, it would be better to
make it more generic so that we can have more unified metadata handling
for normal page and devmem. If we add another memory type here, do we
need another 'if' here?
That is part of the reason I suggested using a more unified metadata for
all the types of memory chunks used by page_pool.