date:20231205

Re: [PATCH 1/4] kunit: Add APIs for managing devices

2023-12-05 Thread David Gow

Hey Greg,

On Wed, 6 Dec 2023 at 01:31, Greg Kroah-Hartman
 wrote:
>
> On Tue, Dec 05, 2023 at 03:31:33PM +0800, david...@google.com wrote:
> > Tests for drivers often require a struct device to pass to other
> > functions. While it's possible to create these with
> > root_device_register(), or to use something like a platform device, this
> > is both a misuse of those APIs, and can be difficult to clean up after,
> > for example, a failed assertion.
> >
> > Add some KUnit-specific functions for registering and unregistering a
> > struct device:
> > - kunit_device_register()
> > - kunit_device_register_with_driver()
> > - kunit_device_unregister()
> >
> > These helpers allocate a on a 'kunit' bus which will either probe the
> > driver passed in (kunit_device_register_with_driver), or will create a
> > stub driver (kunit_device_register) which is cleaned up on test shutdown.
> >
> > Devices are automatically unregistered on test shutdown, but can be
> > manually unregistered earlier with kunit_device_unregister() in order
> > to, for example, test device release code.
>
> At first glance, nice work.  But looks like 0-day doesn't like it that
> much, so I'll wait for the next version to review it properly.

Thanks very much for taking a look. I'll send v2 with the 0-day (and
other) issues fixed sometime tomorrow.

In the meantime, I've tried to explain some of the weirder decisions
below -- it mostly boils down to the existing use-cases only wanting
an opaque 'struct device *' they can pass around, and my attempt to
find a minimal (but still sensible) implementation of that. I'm
definitely happy to tweak this to make it a more 'normal' use of the
device model where that makes sense, though, especially if it doesn't
require too much boilerplate on the part of test authors.

> One nit I did notice:
>
> > +// For internal use only -- registers the kunit_bus.
> > +int kunit_bus_init(void);
>
> Put stuff like this in a local .h file, don't pollute the include/linux/
> files for things that you do not want any other part of the kernel to
> call.
>

v2 will have this in lib/kunit/device-impl.h

> > +/**
> > + * kunit_device_register_with_driver() - Create a struct device for use in 
> > KUnit tests
> > + * @test: The test context object.
> > + * @name: The name to give the created device.
> > + * @drv: The struct device_driver to associate with the device.
> > + *
> > + * Creates a struct kunit_device (which is a struct device) with the given
> > + * name, and driver. The device will be cleaned up on test exit, or when
> > + * kunit_device_unregister is called. See also kunit_device_register, if 
> > you
> > + * wish KUnit to create and manage a driver for you
> > + */
> > +struct device *kunit_device_register_with_driver(struct kunit *test,
> > +  const char *name,
> > +  struct device_driver *drv);
>
> Shouldn't "struct device_driver *" be a constant pointer?

Done (and for the internal functions) for v2.
>
> But really, why is this a "raw" device_driver pointer and not a pointer
> to the driver type for your bus?

So, this is where the more difficult questions start (and where my
knowledge of the driver model gets a bit shakier).

At the moment, there's no struct kunit_driver; the goal was to have
whatever the minimal amount of infrastructure needed to get a 'struct
device *' that could be plumbed through existing code which accepts
it. (Read: mostly devres resource management stuff, get_device(),
etc.)

So, in this version, there's no:
- struct kunit_driver: we've no extra data to store / function
pointers other than what's in struct device_driver.
- The kunit_bus is as minimal as I could get it: each device matches
exactly one driver pointer (which is passed as struct
kunit_device->driver).
- The 'struct kunit_device' type is kept private, and 'struct device'
is used instead, as this is supposed to only be passed to generic
device functions (KUnit is just managing its lifecycle).

I've no problem adding these extra types to flesh this out into a more
'normal' setup, though I'd rather keep the boilerplate on the user
side minimal if possible. I suspect if we were to return a struct
kunit_device, everyone would be quickly grabbing and passing around a
raw 'struct device *' anyway, which is what the existing tests with
fake devices do (via root_device_register, which returns struct
device, or by returning _device->dev from a helper).

Similarly, the kunit_bus is not ever exposed to test code, nor really
is the driver (except via kunit_device_register_with_driver(), which
isn't actually being used anywhere yet, so it may make sense to cut it
out from the next version). So, ideally tests won't even be aware that
their devices are attached to the kunit_bus, nor that they have
drivers attached: it's mostly just to make these normal enough that
they show up nicely in sysfs and play well with the devm_ resource
management

Re: [PATCH 1/4] kunit: Add APIs for managing devices

2023-12-05 Thread David Gow

On Tue, 5 Dec 2023 at 16:30, Matti Vaittinen  wrote:
>
> On 12/5/23 09:31, david...@google.com wrote:
> > Tests for drivers often require a struct device to pass to other
> > functions. While it's possible to create these with
> > root_device_register(), or to use something like a platform device, this
> > is both a misuse of those APIs, and can be difficult to clean up after,
> > for example, a failed assertion.
> >
> > Add some KUnit-specific functions for registering and unregistering a
> > struct device:
> > - kunit_device_register()
> > - kunit_device_register_with_driver()
> > - kunit_device_unregister()
>
> Thanks a lot David! I have been missing these!
>
> I love the explanation you added under Documentation. Very helpful I'd
> say. I only have very minor comments which you can ignore if they don't
> make sense to you or the kunit-subsystem.
>
> With or without the suggested changes:
>
> Reviewed-by: Matti Vaittinen 
>
> > --- /dev/null
> > +++ b/include/kunit/device.h
> > @@ -0,0 +1,76 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * KUnit basic device implementation
> > + *
> > + * Helpers for creating and managing fake devices for KUnit tests.
> > + *
> > + * Copyright (C) 2023, Google LLC.
> > + * Author: David Gow 
> > + */
> > +
> > +#ifndef _KUNIT_DEVICE_H
> > +#define _KUNIT_DEVICE_H
> > +
> > +#if IS_ENABLED(CONFIG_KUNIT)
> > +
> > +#include 
> > +
> > +struct kunit_device;
> > +struct device;
> > +struct device_driver;
> > +
> > +// For internal use only -- registers the kunit_bus.
> > +int kunit_bus_init(void);
> > +
> > +/**
> > + * kunit_driver_create() - Create a struct device_driver attached to the 
> > kunit_bus
> > + * @test: The test context object.
> > + * @name: The name to give the created driver.
> > + *
> > + * Creates a struct device_driver attached to the kunit_bus, with the name 
> > @name.
> > + * This driver will automatically be cleaned up on test exit.
> > + */
> > +struct device_driver *kunit_driver_create(struct kunit *test, const char 
> > *name);
> > +
> > +/**
> > + * kunit_device_register() - Create a struct device for use in KUnit tests
> > + * @test: The test context object.
> > + * @name: The name to give the created device.
> > + *
> > + * Creates a struct kunit_device (which is a struct device) with the given 
> > name,
> > + * and a corresponding driver. The device and driver will be cleaned up on 
> > test
> > + * exit, or when kunit_device_unregister is called. See also
> > + * kunit_device_register_with_driver, if you wish to provide your own
> > + * struct device_driver.
> > + */
> > +struct device *kunit_device_register(struct kunit *test, const char *name);
> > +
> > +/**
> > + * kunit_device_register_with_driver() - Create a struct device for use in 
> > KUnit tests
> > + * @test: The test context object.
> > + * @name: The name to give the created device.
> > + * @drv: The struct device_driver to associate with the device.
> > + *
> > + * Creates a struct kunit_device (which is a struct device) with the given
> > + * name, and driver. The device will be cleaned up on test exit, or when
> > + * kunit_device_unregister is called. See also kunit_device_register, if 
> > you
> > + * wish KUnit to create and manage a driver for you
> > + */
> > +struct device *kunit_device_register_with_driver(struct kunit *test,
> > +  const char *name,
> > +  struct device_driver *drv);
> > +
> > +/**
> > + * kunit_device_unregister() - Unregister a KUnit-managed device
> > + * @test: The test context object which created the device
> > + * @dev: The device.
> > + *
> > + * Unregisters and destroys a struct device which was created with
> > + * kunit_device_register or kunit_device_register_with_driver. If KUnit 
> > created
> > + * a driver, cleans it up as well.
> > + */
> > +void kunit_device_unregister(struct kunit *test, struct device *dev);
>
> I wish the return values for error case(s) were also mentioned. But
> please, see my next comment as well.
>

I'll add these for v2.

> > +
> > +#endif
> > +
> > +#endif
>
> ...
>
> > diff --git a/lib/kunit/device.c b/lib/kunit/device.c
> > new file mode 100644
> > index ..93ace1a2297d
> > --- /dev/null
> > +++ b/lib/kunit/device.c
> > @@ -0,0 +1,176 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * KUnit basic device implementation
> > + *
> > + * Implementation of struct kunit_device helpers.
> > + *
> > + * Copyright (C) 2023, Google LLC.
> > + * Author: David Gow 
> > + */
> > +
>
> ...
>
> > +
> > +static void kunit_device_release(struct device *d)
> > +{
> > + kfree(to_kunit_device(d));
> > +}
>
> I see you added the function documentation to the header. I assume this
> is the kunit style(?) I may be heretical, but I'd love to see at least a
> very short documentation for (all) exported functions here. I think the
> arguments are mostly self-explatonary, but at least for me the return
>

Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

2023-12-05 Thread Chengming Zhou

On 2023/12/6 15:36, Yosry Ahmed wrote:
> On Tue, Dec 5, 2023 at 10:43 PM Chengming Zhou  
> wrote:
>>
>> On 2023/12/6 13:59, Yosry Ahmed wrote:
>>> [..]
>>>>> @@ -526,6 +582,102 @@ static struct zswap_entry 
>>>>> *zswap_entry_find_get(struct rb_root *root,
>>>>>   return entry;
>>>>>  }
>>>>>
>>>>> +/*
>>>>> +* shrinker functions
>>>>> +**/
>>>>> +static enum lru_status shrink_memcg_cb(struct list_head *item, struct 
>>>>> list_lru_one *l,
>>>>> +spinlock_t *lock, void *arg);
>>>>> +
>>>>> +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
>>>>> + struct shrink_control *sc)
>>>>> +{
>>>>> + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, 
>>>>> NODE_DATA(sc->nid));
>>>>> + unsigned long shrink_ret, nr_protected, lru_size;
>>>>> + struct zswap_pool *pool = shrinker->private_data;
>>>>> + bool encountered_page_in_swapcache = false;
>>>>> +
>>>>> + nr_protected =
>>>>> + 
>>>>> atomic_long_read(>zswap_lruvec_state.nr_zswap_protected);
>>>>> + lru_size = list_lru_shrink_count(>list_lru, sc);
>>>>> +
>>>>> + /*
>>>>> +  * Abort if the shrinker is disabled or if we are shrinking into the
>>>>> +  * protected region.
>>>>> +  *
>>>>> +  * This short-circuiting is necessary because if we have too many 
>>>>> multiple
>>>>> +  * concurrent reclaimers getting the freeable zswap object counts 
>>>>> at the
>>>>> +  * same time (before any of them made reasonable progress), the 
>>>>> total
>>>>> +  * number of reclaimed objects might be more than the number of 
>>>>> unprotected
>>>>> +  * objects (i.e the reclaimers will reclaim into the protected area 
>>>>> of the
>>>>> +  * zswap LRU).
>>>>> +  */
>>>>> + if (!zswap_shrinker_enabled || nr_protected >= lru_size - 
>>>>> sc->nr_to_scan) {
>>>>> + sc->nr_scanned = 0;
>>>>> + return SHRINK_STOP;
>>>>> + }
>>>>> +
>>>>> + shrink_ret = list_lru_shrink_walk(>list_lru, sc, 
>>>>> _memcg_cb,
>>>>> + _page_in_swapcache);
>>>>> +
>>>>> + if (encountered_page_in_swapcache)
>>>>> + return SHRINK_STOP;
>>>>> +
>>>>> + return shrink_ret ? shrink_ret : SHRINK_STOP;
>>>>> +}
>>>>> +
>>>>> +static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
>>>>> + struct shrink_control *sc)
>>>>> +{
>>>>> + struct zswap_pool *pool = shrinker->private_data;
>>>>> + struct mem_cgroup *memcg = sc->memcg;
>>>>> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, 
>>>>> NODE_DATA(sc->nid));
>>>>> + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
>>>>> +
>>>>> +#ifdef CONFIG_MEMCG_KMEM
>>>>> + cgroup_rstat_flush(memcg->css.cgroup);
>>>>> + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
>>>>> + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
>>>>> +#else
>>>>> + /* use pool stats instead of memcg stats */
>>>>> + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT;
>>>>> + nr_stored = atomic_read(>nr_stored);
>>>>> +#endif
>>>>> +
>>>>> + if (!zswap_shrinker_enabled || !nr_stored)
>>>> When I tested with this series, with !zswap_shrinker_enabled in the 
>>>> default case,
>>>> I found the performance is much worse than that without this patch.
>>>>
>>>> Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs 
>>>> directory.
>>>>
>>>> The reason seems the above cgroup_rstat_flush(), caused much rstat lock 
>>>> contention
>>>> to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check 
>>>> above
>>>> the cgroup_rstat_flush(), the performance become much better.
>>>>
>>>> Maybe we can put the "zswap_shrinker_enabled" check above 
>>>> cgroup_rstat_flush()?
>>>
>>> Yes, we should do nothing if !zswap_shrinker_enabled. We should also
>>> use mem_cgroup_flush_stats() here like other places unless accuracy is
>>> crucial, which I doubt given that reclaim uses
>>> mem_cgroup_flush_stats().
>>>
>>
>> Yes. After changing to use mem_cgroup_flush_stats() here, the performance
>> become much better.
>>
>>> mem_cgroup_flush_stats() has some thresholding to make sure we don't
>>> do flushes unnecessarily, and I have a pending series in mm-unstable
>>> that makes that thresholding per-memcg. Keep in mind that adding a
>>> call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable,
>>
>> My test branch is linux-next 20231205, and it's all good after changing
>> to use mem_cgroup_flush_stats(memcg).
> 
> Thanks for reporting back. We should still move the
> zswap_shrinker_enabled check ahead, no need to even call
> mem_cgroup_flush_stats() if we will do nothing anyway.
> 

Yes, agree!

Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

2023-12-05 Thread Yosry Ahmed

On Tue, Dec 5, 2023 at 10:43 PM Chengming Zhou  wrote:
>
> On 2023/12/6 13:59, Yosry Ahmed wrote:
> > [..]
> >>> @@ -526,6 +582,102 @@ static struct zswap_entry 
> >>> *zswap_entry_find_get(struct rb_root *root,
> >>>   return entry;
> >>>  }
> >>>
> >>> +/*
> >>> +* shrinker functions
> >>> +**/
> >>> +static enum lru_status shrink_memcg_cb(struct list_head *item, struct 
> >>> list_lru_one *l,
> >>> +spinlock_t *lock, void *arg);
> >>> +
> >>> +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
> >>> + struct shrink_control *sc)
> >>> +{
> >>> + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, 
> >>> NODE_DATA(sc->nid));
> >>> + unsigned long shrink_ret, nr_protected, lru_size;
> >>> + struct zswap_pool *pool = shrinker->private_data;
> >>> + bool encountered_page_in_swapcache = false;
> >>> +
> >>> + nr_protected =
> >>> + 
> >>> atomic_long_read(>zswap_lruvec_state.nr_zswap_protected);
> >>> + lru_size = list_lru_shrink_count(>list_lru, sc);
> >>> +
> >>> + /*
> >>> +  * Abort if the shrinker is disabled or if we are shrinking into the
> >>> +  * protected region.
> >>> +  *
> >>> +  * This short-circuiting is necessary because if we have too many 
> >>> multiple
> >>> +  * concurrent reclaimers getting the freeable zswap object counts 
> >>> at the
> >>> +  * same time (before any of them made reasonable progress), the 
> >>> total
> >>> +  * number of reclaimed objects might be more than the number of 
> >>> unprotected
> >>> +  * objects (i.e the reclaimers will reclaim into the protected area 
> >>> of the
> >>> +  * zswap LRU).
> >>> +  */
> >>> + if (!zswap_shrinker_enabled || nr_protected >= lru_size - 
> >>> sc->nr_to_scan) {
> >>> + sc->nr_scanned = 0;
> >>> + return SHRINK_STOP;
> >>> + }
> >>> +
> >>> + shrink_ret = list_lru_shrink_walk(>list_lru, sc, 
> >>> _memcg_cb,
> >>> + _page_in_swapcache);
> >>> +
> >>> + if (encountered_page_in_swapcache)
> >>> + return SHRINK_STOP;
> >>> +
> >>> + return shrink_ret ? shrink_ret : SHRINK_STOP;
> >>> +}
> >>> +
> >>> +static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
> >>> + struct shrink_control *sc)
> >>> +{
> >>> + struct zswap_pool *pool = shrinker->private_data;
> >>> + struct mem_cgroup *memcg = sc->memcg;
> >>> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, 
> >>> NODE_DATA(sc->nid));
> >>> + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
> >>> +
> >>> +#ifdef CONFIG_MEMCG_KMEM
> >>> + cgroup_rstat_flush(memcg->css.cgroup);
> >>> + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
> >>> + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
> >>> +#else
> >>> + /* use pool stats instead of memcg stats */
> >>> + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT;
> >>> + nr_stored = atomic_read(>nr_stored);
> >>> +#endif
> >>> +
> >>> + if (!zswap_shrinker_enabled || !nr_stored)
> >> When I tested with this series, with !zswap_shrinker_enabled in the 
> >> default case,
> >> I found the performance is much worse than that without this patch.
> >>
> >> Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs 
> >> directory.
> >>
> >> The reason seems the above cgroup_rstat_flush(), caused much rstat lock 
> >> contention
> >> to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check 
> >> above
> >> the cgroup_rstat_flush(), the performance become much better.
> >>
> >> Maybe we can put the "zswap_shrinker_enabled" check above 
> >> cgroup_rstat_flush()?
> >
> > Yes, we should do nothing if !zswap_shrinker_enabled. We should also
> > use mem_cgroup_flush_stats() here like other places unless accuracy is
> > crucial, which I doubt given that reclaim uses
> > mem_cgroup_flush_stats().
> >
>
> Yes. After changing to use mem_cgroup_flush_stats() here, the performance
> become much better.
>
> > mem_cgroup_flush_stats() has some thresholding to make sure we don't
> > do flushes unnecessarily, and I have a pending series in mm-unstable
> > that makes that thresholding per-memcg. Keep in mind that adding a
> > call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable,
>
> My test branch is linux-next 20231205, and it's all good after changing
> to use mem_cgroup_flush_stats(memcg).

Thanks for reporting back. We should still move the
zswap_shrinker_enabled check ahead, no need to even call
mem_cgroup_flush_stats() if we will do nothing anyway.

>
> > because the series there adds a memcg argument to
> > mem_cgroup_flush_stats(). That should be easily amenable though, I can
> > post a fixlet for my series to add the memcg argument there on top of
> > users if needed.
> >
>
> It's great. Thanks!
>

[PATCH net-next 9/9] selftests/net: convert vrf-xfrm-tests.sh to run it in unique namespace

2023-12-05 Thread Hangbin Liu

Here is the test result after conversion.

]# ./vrf-xfrm-tests.sh

No qdisc on VRF device
TEST: IPv4 no xfrm policy   [ OK ]
TEST: IPv6 no xfrm policy   [ OK ]
TEST: IPv4 xfrm policy based on address [ OK ]
TEST: IPv6 xfrm policy based on address [ OK ]
TEST: IPv6 xfrm policy with VRF in selector [ OK ]
TEST: IPv4 xfrm policy with xfrm device [ OK ]
TEST: IPv6 xfrm policy with xfrm device [ OK ]

netem qdisc on VRF device
TEST: IPv4 no xfrm policy   [ OK ]
TEST: IPv6 no xfrm policy   [ OK ]
TEST: IPv4 xfrm policy based on address [ OK ]
TEST: IPv6 xfrm policy based on address [ OK ]
TEST: IPv6 xfrm policy with VRF in selector [ OK ]
TEST: IPv4 xfrm policy with xfrm device [ OK ]
TEST: IPv6 xfrm policy with xfrm device [ OK ]

Tests passed:  14
Tests failed:   0

Acked-by: David Ahern 
Signed-off-by: Hangbin Liu 
---
 tools/testing/selftests/net/vrf-xfrm-tests.sh | 77 +--
 1 file changed, 36 insertions(+), 41 deletions(-)

diff --git a/tools/testing/selftests/net/vrf-xfrm-tests.sh 
b/tools/testing/selftests/net/vrf-xfrm-tests.sh
index 452638ae8aed..b64dd891699d 100755
--- a/tools/testing/selftests/net/vrf-xfrm-tests.sh
+++ b/tools/testing/selftests/net/vrf-xfrm-tests.sh
@@ -3,9 +3,7 @@
 #
 # Various combinations of VRF with xfrms and qdisc.
 
-# Kselftest framework requirement - SKIP code is 4.
-ksft_skip=4
-
+source lib.sh
 PAUSE_ON_FAIL=no
 VERBOSE=0
 ret=0
@@ -67,7 +65,7 @@ run_cmd_host1()
printf "COMMAND: $cmd\n"
fi
 
-   out=$(eval ip netns exec host1 $cmd 2>&1)
+   out=$(eval ip netns exec $host1 $cmd 2>&1)
rc=$?
if [ "$VERBOSE" = "1" ]; then
if [ -n "$out" ]; then
@@ -116,9 +114,6 @@ create_ns()
[ -z "${addr}" ] && addr="-"
[ -z "${addr6}" ] && addr6="-"
 
-   ip netns add ${ns}
-
-   ip -netns ${ns} link set lo up
if [ "${addr}" != "-" ]; then
ip -netns ${ns} addr add dev lo ${addr}
fi
@@ -177,25 +172,25 @@ connect_ns()
 
 cleanup()
 {
-   ip netns del host1
-   ip netns del host2
+   cleanup_ns $host1 $host2
 }
 
 setup()
 {
-   create_ns "host1"
-   create_ns "host2"
+   setup_ns host1 host2
+   create_ns "$host1"
+   create_ns "$host2"
 
-   connect_ns "host1" eth0 ${HOST1_4}/24 ${HOST1_6}/64 \
-  "host2" eth0 ${HOST2_4}/24 ${HOST2_6}/64
+   connect_ns "$host1" eth0 ${HOST1_4}/24 ${HOST1_6}/64 \
+  "$host2" eth0 ${HOST2_4}/24 ${HOST2_6}/64
 
-   create_vrf "host1" ${VRF} ${TABLE}
-   ip -netns host1 link set dev eth0 master ${VRF}
+   create_vrf "$host1" ${VRF} ${TABLE}
+   ip -netns $host1 link set dev eth0 master ${VRF}
 }
 
 cleanup_xfrm()
 {
-   for ns in host1 host2
+   for ns in $host1 $host2
do
for x in state policy
do
@@ -218,57 +213,57 @@ setup_xfrm()
#
 
# host1 - IPv4 out
-   ip -netns host1 xfrm policy add \
+   ip -netns $host1 xfrm policy add \
  src ${h1_4} dst ${h2_4} ${devarg} dir out \
  tmpl src ${HOST1_4} dst ${HOST2_4} proto esp mode tunnel
 
# host2 - IPv4 in
-   ip -netns host2 xfrm policy add \
+   ip -netns $host2 xfrm policy add \
  src ${h1_4} dst ${h2_4} dir in \
  tmpl src ${HOST1_4} dst ${HOST2_4} proto esp mode tunnel
 
# host1 - IPv4 in
-   ip -netns host1 xfrm policy add \
+   ip -netns $host1 xfrm policy add \
  src ${h2_4} dst ${h1_4} ${devarg} dir in \
  tmpl src ${HOST2_4} dst ${HOST1_4} proto esp mode tunnel
 
# host2 - IPv4 out
-   ip -netns host2 xfrm policy add \
+   ip -netns $host2 xfrm policy add \
  src ${h2_4} dst ${h1_4} dir out \
  tmpl src ${HOST2_4} dst ${HOST1_4} proto esp mode tunnel
 
 
# host1 - IPv6 out
-   ip -6 -netns host1 xfrm policy add \
+   ip -6 -netns $host1 xfrm policy add \
  src ${h1_6} dst ${h2_6} ${devarg} dir out \
  tmpl src ${HOST1_6} dst ${HOST2_6} proto esp mode tunnel
 
# host2 - IPv6 in
-   ip -6 -netns host2 xfrm policy add \
+   ip -6 -netns $host2 xfrm policy add \
  src ${h1_6} dst ${h2_6} dir in \
  tmpl src ${HOST1_6} dst ${HOST2_6} proto esp mode tunnel
 
# host1 - IPv6 in
-   ip -6 -netns host1 xfrm policy add \
+   ip -6 -netns $host1 xfrm policy add \
  src ${h2_6} dst ${h1_6} ${devarg} dir in \
  tmpl src ${HOST2_6} dst ${HOST1_6} proto esp mode tunnel
 
# host2 - IPv6 out
-

[PATCH net-next 8/9] selftests/net: convert vrf_strict_mode_test.sh to run it in unique namespace

2023-12-05 Thread Hangbin Liu

Here is the test result after conversion.

 ]# ./vrf_strict_mode_test.sh

 

 TEST SECTION: VRF strict_mode test on init network namespace
 


 TEST: init: net.vrf.strict_mode is available[ OK ]

 TEST: init: strict_mode=0 by default, 0 vrfs[ OK ]

 ...

 TEST: init: check strict_mode=1 [ OK ]

 TEST: testns-HvoZkB: check strict_mode=0[ OK ]

 Tests passed:  37
 Tests failed:   0

Acked-by: David Ahern 
Signed-off-by: Hangbin Liu 
---
 .../selftests/net/vrf_strict_mode_test.sh | 47 +--
 1 file changed, 22 insertions(+), 25 deletions(-)

diff --git a/tools/testing/selftests/net/vrf_strict_mode_test.sh 
b/tools/testing/selftests/net/vrf_strict_mode_test.sh
index 417d214264f3..01552b542544 100755
--- a/tools/testing/selftests/net/vrf_strict_mode_test.sh
+++ b/tools/testing/selftests/net/vrf_strict_mode_test.sh
@@ -3,9 +3,7 @@
 
 # This test is designed for testing the new VRF strict_mode functionality.
 
-# Kselftest framework requirement - SKIP code is 4.
-ksft_skip=4
-
+source lib.sh
 ret=0
 
 # identifies the "init" network namespace which is often called root network
@@ -247,13 +245,12 @@ setup()
 {
modprobe vrf
 
-   ip netns add testns
-   ip netns exec testns ip link set lo up
+   setup_ns testns
 }
 
 cleanup()
 {
-   ip netns del testns 2>/dev/null
+   ip netns del $testns 2>/dev/null
 
ip link del vrf100 2>/dev/null
ip link del vrf101 2>/dev/null
@@ -298,28 +295,28 @@ vrf_strict_mode_tests_testns()
 {
log_section "VRF strict_mode test on testns network namespace"
 
-   vrf_strict_mode_check_support testns
+   vrf_strict_mode_check_support $testns
 
-   strict_mode_check_default testns
+   strict_mode_check_default $testns
 
-   enable_strict_mode_and_check testns
+   enable_strict_mode_and_check $testns
 
-   add_vrf_and_check testns vrf100 100
-   config_vrf_and_check testns 10.0.100.1/24 vrf100
+   add_vrf_and_check $testns vrf100 100
+   config_vrf_and_check $testns 10.0.100.1/24 vrf100
 
-   add_vrf_and_check_fail testns vrf101 100
+   add_vrf_and_check_fail $testns vrf101 100
 
-   add_vrf_and_check_fail testns vrf102 100
+   add_vrf_and_check_fail $testns vrf102 100
 
-   add_vrf_and_check testns vrf200 200
+   add_vrf_and_check $testns vrf200 200
 
-   disable_strict_mode_and_check testns
+   disable_strict_mode_and_check $testns
 
-   add_vrf_and_check testns vrf101 100
+   add_vrf_and_check $testns vrf101 100
 
-   add_vrf_and_check testns vrf102 100
+   add_vrf_and_check $testns vrf102 100
 
-   #the strict_mode is disabled in the testns
+   #the strict_mode is disabled in the $testns
 }
 
 vrf_strict_mode_tests_mix()
@@ -328,25 +325,25 @@ vrf_strict_mode_tests_mix()
 
read_strict_mode_compare_and_check init 1
 
-   read_strict_mode_compare_and_check testns 0
+   read_strict_mode_compare_and_check $testns 0
 
-   del_vrf_and_check testns vrf101
+   del_vrf_and_check $testns vrf101
 
-   del_vrf_and_check testns vrf102
+   del_vrf_and_check $testns vrf102
 
disable_strict_mode_and_check init
 
-   enable_strict_mode_and_check testns
+   enable_strict_mode_and_check $testns
 
enable_strict_mode_and_check init
enable_strict_mode_and_check init
 
-   disable_strict_mode_and_check testns
-   disable_strict_mode_and_check testns
+   disable_strict_mode_and_check $testns
+   disable_strict_mode_and_check $testns
 
read_strict_mode_compare_and_check init 1
 
-   read_strict_mode_compare_and_check testns 0
+   read_strict_mode_compare_and_check $testns 0
 }
 
 

-- 
2.43.0

[PATCH net-next 7/9] selftests/net: convert vrf_route_leaking.sh to run it in unique namespace

2023-12-05 Thread Hangbin Liu

Here is the test result after conversion.

 ]# ./vrf_route_leaking.sh

 ###
 IPv4 (sym route): VRF ICMP ttl error route lookup ping
 ###

 TEST: Basic IPv4 connectivity   [ OK ]
 TEST: Ping received ICMP ttl exceeded   [ OK ]

 ...

 TEST: Basic IPv6 connectivity   [ OK ]
 TEST: Traceroute6 reports a hop on r1   [ OK ]

 Tests passed:  18
 Tests failed:   0

Acked-by: David Ahern 
Signed-off-by: Hangbin Liu 
---
 .../selftests/net/vrf_route_leaking.sh| 201 +-
 1 file changed, 96 insertions(+), 105 deletions(-)

diff --git a/tools/testing/selftests/net/vrf_route_leaking.sh 
b/tools/testing/selftests/net/vrf_route_leaking.sh
index dedc52562b4f..2da32f4c479b 100755
--- a/tools/testing/selftests/net/vrf_route_leaking.sh
+++ b/tools/testing/selftests/net/vrf_route_leaking.sh
@@ -58,6 +58,7 @@
 # to send an ICMP error back to the source when the ttl of a packet reaches 1
 # while it is forwarded between different vrfs.
 
+source lib.sh
 VERBOSE=0
 PAUSE_ON_FAIL=no
 DEFAULT_TTYPE=sym
@@ -171,11 +172,7 @@ run_cmd_grep()
 
 cleanup()
 {
-   local ns
-
-   for ns in h1 h2 r1 r2; do
-   ip netns del $ns 2>/dev/null
-   done
+   cleanup_ns $h1 $h2 $r1 $r2
 }
 
 setup_vrf()
@@ -212,72 +209,69 @@ setup_sym()
 
#
# create nodes as namespaces
-   #
-   for ns in h1 h2 r1; do
-   ip netns add $ns
-   ip -netns $ns link set lo up
-
-   case "${ns}" in
-   h[12]) ip netns exec $ns sysctl -q -w 
net.ipv6.conf.all.forwarding=0
-  ip netns exec $ns sysctl -q -w 
net.ipv6.conf.all.keep_addr_on_down=1
-   ;;
-   r1)ip netns exec $ns sysctl -q -w net.ipv4.ip_forward=1
-  ip netns exec $ns sysctl -q -w 
net.ipv6.conf.all.forwarding=1
-   esac
+   setup_ns h1 h2 r1
+   for ns in $h1 $h2 $r1; do
+   if echo $ns | grep -q h[12]-; then
+   ip netns exec $ns sysctl -q -w 
net.ipv6.conf.all.forwarding=0
+   ip netns exec $ns sysctl -q -w 
net.ipv6.conf.all.keep_addr_on_down=1
+   else
+   ip netns exec $ns sysctl -q -w net.ipv4.ip_forward=1
+   ip netns exec $ns sysctl -q -w 
net.ipv6.conf.all.forwarding=1
+   fi
done
 
#
# create interconnects
#
-   ip -netns h1 link add eth0 type veth peer name r1h1
-   ip -netns h1 link set r1h1 netns r1 name eth0 up
+   ip -netns $h1 link add eth0 type veth peer name r1h1
+   ip -netns $h1 link set r1h1 netns $r1 name eth0 up
 
-   ip -netns h2 link add eth0 type veth peer name r1h2
-   ip -netns h2 link set r1h2 netns r1 name eth1 up
+   ip -netns $h2 link add eth0 type veth peer name r1h2
+   ip -netns $h2 link set r1h2 netns $r1 name eth1 up
 
#
# h1
#
-   ip -netns h1 addr add dev eth0 ${H1_N1_IP}/24
-   ip -netns h1 -6 addr add dev eth0 ${H1_N1_IP6}/64 nodad
-   ip -netns h1 link set eth0 up
+   ip -netns $h1 addr add dev eth0 ${H1_N1_IP}/24
+   ip -netns $h1 -6 addr add dev eth0 ${H1_N1_IP6}/64 nodad
+   ip -netns $h1 link set eth0 up
 
# h1 to h2 via r1
-   ip -netns h1route add ${H2_N2} via ${R1_N1_IP} dev eth0
-   ip -netns h1 -6 route add ${H2_N2_6} via "${R1_N1_IP6}" dev eth0
+   ip -netns $h1route add ${H2_N2} via ${R1_N1_IP} dev eth0
+   ip -netns $h1 -6 route add ${H2_N2_6} via "${R1_N1_IP6}" dev eth0
 
#
# h2
#
-   ip -netns h2 addr add dev eth0 ${H2_N2_IP}/24
-   ip -netns h2 -6 addr add dev eth0 ${H2_N2_IP6}/64 nodad
-   ip -netns h2 link set eth0 up
+   ip -netns $h2 addr add dev eth0 ${H2_N2_IP}/24
+   ip -netns $h2 -6 addr add dev eth0 ${H2_N2_IP6}/64 nodad
+   ip -netns $h2 link set eth0 up
 
# h2 to h1 via r1
-   ip -netns h2 route add default via ${R1_N2_IP} dev eth0
-   ip -netns h2 -6 route add default via ${R1_N2_IP6} dev eth0
+   ip -netns $h2 route add default via ${R1_N2_IP} dev eth0
+   ip -netns $h2 -6 route add default via ${R1_N2_IP6} dev eth0
 
#
# r1
#
-   setup_vrf r1
-   create_vrf r1 blue 1101
-   create_vrf r1 red 1102
-   ip -netns r1 link set mtu 1400 dev eth1
-   ip -netns r1 link set eth0 vrf blue up
-   ip -netns r1 link set eth1 vrf red up
-   ip -netns r1 addr add dev eth0 ${R1_N1_IP}/24
-   ip -netns r1 -6 addr add dev eth0 ${R1_N1_IP6}/64 nodad
-   ip -netns r1 addr add dev eth1 ${R1_N2_IP}/24
-   ip -netns r1 -6 addr add dev eth1 ${R1_N2_IP6}/64 nodad
+   setup_vrf $r1
+

[PATCH net-next 6/9] selftests/net: convert test_vxlan_vnifiltering.sh to run it in unique namespace

2023-12-05 Thread Hangbin Liu

Here is the test result after conversion.

]# ./test_vxlan_vnifiltering.sh
TEST: Create traditional vxlan device   [ OK ]
TEST: Cannot create vnifilter device without external flag  [ OK ]
TEST: Creating external vxlan device with vnifilter flag[ OK ]
...
TEST: VM connectivity over traditional vxlan (ipv6 default rdst)[ OK ]
TEST: VM connectivity over metadata nonfiltering vxlan (ipv4 default rdst)  
[ OK ]

Tests passed:  27
Tests failed:   0

Acked-by: David Ahern 
Signed-off-by: Hangbin Liu 
---
 .../selftests/net/test_vxlan_vnifiltering.sh  | 154 +++---
 1 file changed, 95 insertions(+), 59 deletions(-)

diff --git a/tools/testing/selftests/net/test_vxlan_vnifiltering.sh 
b/tools/testing/selftests/net/test_vxlan_vnifiltering.sh
index 8c3ac0a72545..6127a78ee988 100755
--- a/tools/testing/selftests/net/test_vxlan_vnifiltering.sh
+++ b/tools/testing/selftests/net/test_vxlan_vnifiltering.sh
@@ -78,10 +78,8 @@
 #
 #
 # This test tests the new vxlan vnifiltering api
-
+source lib.sh
 ret=0
-# Kselftest framework requirement - SKIP code is 4.
-ksft_skip=4
 
 # all tests in this script. Can be overridden with -t option
 TESTS="
@@ -148,18 +146,18 @@ run_cmd()
 }
 
 check_hv_connectivity() {
-   ip netns exec hv-1 ping -c 1 -W 1 $1 &>/dev/null
+   ip netns exec $hv_1 ping -c 1 -W 1 $1 &>/dev/null
sleep 1
-   ip netns exec hv-1 ping -c 1 -W 1 $2 &>/dev/null
+   ip netns exec $hv_1 ping -c 1 -W 1 $2 &>/dev/null
 
return $?
 }
 
 check_vm_connectivity() {
-   run_cmd "ip netns exec vm-11 ping -c 1 -W 1 10.0.10.12"
+   run_cmd "ip netns exec $vm_11 ping -c 1 -W 1 10.0.10.12"
log_test $? 0 "VM connectivity over $1 (ipv4 default rdst)"
 
-   run_cmd "ip netns exec vm-21 ping -c 1 -W 1 10.0.10.22"
+   run_cmd "ip netns exec $vm_21 ping -c 1 -W 1 10.0.10.22"
log_test $? 0 "VM connectivity over $1 (ipv6 default rdst)"
 }
 
@@ -167,26 +165,23 @@ cleanup() {
ip link del veth-hv-1 2>/dev/null || true
ip link del vethhv-11 vethhv-12 vethhv-21 vethhv-22 2>/dev/null || true
 
-   for ns in hv-1 hv-2 vm-11 vm-21 vm-12 vm-22 vm-31 vm-32; do
-   ip netns del $ns 2>/dev/null || true
-   done
+   cleanup_ns $hv_1 $hv_2 $vm_11 $vm_21 $vm_12 $vm_22 $vm_31 $vm_32
 }
 
 trap cleanup EXIT
 
 setup-hv-networking() {
-   hv=$1
+   id=$1
local1=$2
mask1=$3
local2=$4
mask2=$5
 
-   ip netns add hv-$hv
-   ip link set veth-hv-$hv netns hv-$hv
-   ip -netns hv-$hv link set veth-hv-$hv name veth0
-   ip -netns hv-$hv addr add $local1/$mask1 dev veth0
-   ip -netns hv-$hv addr add $local2/$mask2 dev veth0
-   ip -netns hv-$hv link set veth0 up
+   ip link set veth-hv-$id netns ${hv[$id]}
+   ip -netns ${hv[$id]} link set veth-hv-$id name veth0
+   ip -netns ${hv[$id]} addr add $local1/$mask1 dev veth0
+   ip -netns ${hv[$id]} addr add $local2/$mask2 dev veth0
+   ip -netns ${hv[$id]} link set veth0 up
 }
 
 # Setups a "VM" simulated by a netns an a veth pair
@@ -208,21 +203,20 @@ setup-vm() {
lastvxlandev=""
 
# create bridge
-   ip -netns hv-$hvid link add br$brid type bridge vlan_filtering 1 
vlan_default_pvid 0 \
+   ip -netns ${hv[$hvid]} link add br$brid type bridge vlan_filtering 1 
vlan_default_pvid 0 \
mcast_snooping 0
-   ip -netns hv-$hvid link set br$brid up
+   ip -netns ${hv[$hvid]} link set br$brid up
 
# create vm namespace and interfaces and connect to hypervisor
# namespace
-   ip netns add vm-$vmid
hvvethif="vethhv-$vmid"
vmvethif="veth-$vmid"
ip link add $hvvethif type veth peer name $vmvethif
-   ip link set $hvvethif netns hv-$hvid
-   ip link set $vmvethif netns vm-$vmid
-   ip -netns hv-$hvid link set $hvvethif up
-   ip -netns vm-$vmid link set $vmvethif up
-   ip -netns hv-$hvid link set $hvvethif master br$brid
+   ip link set $hvvethif netns ${hv[$hvid]}
+   ip link set $vmvethif netns ${vm[$vmid]}
+   ip -netns ${hv[$hvid]} link set $hvvethif up
+   ip -netns ${vm[$vmid]} link set $vmvethif up
+   ip -netns ${hv[$hvid]} link set $hvvethif master br$brid
 
# configure VM vlan/vni filtering on hypervisor
for vmap in $(echo $vattrs | cut -d "," -f1- --output-delimiter=' ')
@@ -234,9 +228,9 @@ setup-vm() {
local vtype=$(echo $vmap | awk -F'-' '{print ($5)}')
local port=$(echo $vmap | awk -F'-' '{print ($6)}')
 
-   ip -netns vm-$vmid link add name $vmvethif.$vid link $vmvethif type 
vlan id $vid
-   ip -netns vm-$vmid addr add 10.0.$vid.$vmid/24 dev $vmvethif.$vid
-   ip -netns vm-$vmid link set $vmvethif.$vid up
+   ip -netns ${vm[$vmid]} link add name $vmvethif.$vid link $vmvethif type 
vlan id $vid
+   ip -netns ${vm[$vmid]} addr add 10.0.$vid.$vmid/24 dev

[PATCH net-next 5/9] selftests/net: convert test_vxlan_under_vrf.sh to run it in unique namespace

2023-12-05 Thread Hangbin Liu

Here is the test result after conversion.

]# ./test_vxlan_under_vrf.sh
Checking HV connectivity   [ OK ]
Check VM connectivity through VXLAN (underlay in the default VRF)  [ OK ]
Check VM connectivity through VXLAN (underlay in a VRF)[ OK ]

Acked-by: David Ahern 
Signed-off-by: Hangbin Liu 
---
 .../selftests/net/test_vxlan_under_vrf.sh | 70 ++-
 1 file changed, 36 insertions(+), 34 deletions(-)

diff --git a/tools/testing/selftests/net/test_vxlan_under_vrf.sh 
b/tools/testing/selftests/net/test_vxlan_under_vrf.sh
index 1fd1250ebc66..ae8fbe3f0779 100755
--- a/tools/testing/selftests/net/test_vxlan_under_vrf.sh
+++ b/tools/testing/selftests/net/test_vxlan_under_vrf.sh
@@ -43,15 +43,14 @@
 # This tests both the connectivity between vm-1 and vm-2, and that the underlay
 # can be moved in and out of the vrf by unsetting and setting veth0's master.
 
+source lib.sh
 set -e
 
 cleanup() {
 ip link del veth-hv-1 2>/dev/null || true
 ip link del veth-tap 2>/dev/null || true
 
-for ns in hv-1 hv-2 vm-1 vm-2; do
-ip netns del $ns 2>/dev/null || true
-done
+cleanup_ns $hv_1 $hv_2 $vm_1 $vm_2
 }
 
 # Clean start
@@ -60,72 +59,75 @@ cleanup &> /dev/null
 [[ $1 == "clean" ]] && exit 0
 
 trap cleanup EXIT
+setup_ns hv_1 hv_2 vm_1 vm_2
+hv[1]=$hv_1
+hv[2]=$hv_2
+vm[1]=$vm_1
+vm[2]=$vm_2
 
 # Setup "Hypervisors" simulated with netns
 ip link add veth-hv-1 type veth peer name veth-hv-2
 setup-hv-networking() {
-hv=$1
+id=$1
 
-ip netns add hv-$hv
-ip link set veth-hv-$hv netns hv-$hv
-ip -netns hv-$hv link set veth-hv-$hv name veth0
+ip link set veth-hv-$id netns ${hv[$id]}
+ip -netns ${hv[$id]} link set veth-hv-$id name veth0
 
-ip -netns hv-$hv link add vrf-underlay type vrf table 1
-ip -netns hv-$hv link set vrf-underlay up
-ip -netns hv-$hv addr add 172.16.0.$hv/24 dev veth0
-ip -netns hv-$hv link set veth0 up
+ip -netns ${hv[$id]} link add vrf-underlay type vrf table 1
+ip -netns ${hv[$id]} link set vrf-underlay up
+ip -netns ${hv[$id]} addr add 172.16.0.$id/24 dev veth0
+ip -netns ${hv[$id]} link set veth0 up
 
-ip -netns hv-$hv link add br0 type bridge
-ip -netns hv-$hv link set br0 up
+ip -netns ${hv[$id]} link add br0 type bridge
+ip -netns ${hv[$id]} link set br0 up
 
-ip -netns hv-$hv link add vxlan0 type vxlan id 10 local 172.16.0.$hv dev 
veth0 dstport 4789
-ip -netns hv-$hv link set vxlan0 master br0
-ip -netns hv-$hv link set vxlan0 up
+ip -netns ${hv[$id]} link add vxlan0 type vxlan id 10 local 172.16.0.$id 
dev veth0 dstport 4789
+ip -netns ${hv[$id]} link set vxlan0 master br0
+ip -netns ${hv[$id]} link set vxlan0 up
 }
 setup-hv-networking 1
 setup-hv-networking 2
 
 # Check connectivity between HVs by pinging hv-2 from hv-1
 echo -n "Checking HV connectivity   "
-ip netns exec hv-1 ping -c 1 -W 1 172.16.0.2 &> /dev/null || (echo "[FAIL]"; 
false)
+ip netns exec $hv_1 ping -c 1 -W 1 172.16.0.2 &> /dev/null || (echo "[FAIL]"; 
false)
 echo "[ OK ]"
 
 # Setups a "VM" simulated by a netns an a veth pair
 setup-vm() {
 id=$1
 
-ip netns add vm-$id
 ip link add veth-tap type veth peer name veth-hv
 
-ip link set veth-tap netns hv-$id
-ip -netns hv-$id link set veth-tap master br0
-ip -netns hv-$id link set veth-tap up
+ip link set veth-tap netns ${hv[$id]}
+ip -netns ${hv[$id]} link set veth-tap master br0
+ip -netns ${hv[$id]} link set veth-tap up
 
 ip link set veth-hv address 02:1d:8d:dd:0c:6$id
 
-ip link set veth-hv netns vm-$id
-ip -netns vm-$id addr add 10.0.0.$id/24 dev veth-hv
-ip -netns vm-$id link set veth-hv up
+ip link set veth-hv netns ${vm[$id]}
+ip -netns ${vm[$id]} addr add 10.0.0.$id/24 dev veth-hv
+ip -netns ${vm[$id]} link set veth-hv up
 }
 setup-vm 1
 setup-vm 2
 
 # Setup VTEP routes to make ARP work
-bridge -netns hv-1 fdb add 00:00:00:00:00:00 dev vxlan0 dst 172.16.0.2 self 
permanent
-bridge -netns hv-2 fdb add 00:00:00:00:00:00 dev vxlan0 dst 172.16.0.1 self 
permanent
+bridge -netns $hv_1 fdb add 00:00:00:00:00:00 dev vxlan0 dst 172.16.0.2 self 
permanent
+bridge -netns $hv_2 fdb add 00:00:00:00:00:00 dev vxlan0 dst 172.16.0.1 self 
permanent
 
 echo -n "Check VM connectivity through VXLAN (underlay in the default VRF)  "
-ip netns exec vm-1 ping -c 1 -W 1 10.0.0.2 &> /dev/null || (echo "[FAIL]"; 
false)
+ip netns exec $vm_1 ping -c 1 -W 1 10.0.0.2 &> /dev/null || (echo "[FAIL]"; 
false)
 echo "[ OK ]"
 
 # Move the underlay to a non-default VRF
-ip -netns hv-1 link set veth0 vrf vrf-underlay
-ip -netns hv-1 link set vxlan0 down
-ip -netns hv-1 link set vxlan0 up
-ip -netns hv-2 link set veth0 vrf vrf-underlay
-ip -netns hv-2 link set vxlan0 down
-ip -netns hv-2 link set vxlan0 up
+ip -netns $hv_1 link set veth0 vrf vrf-underlay
+ip -netns $hv_1 link set vxlan0 down
+ip -netns

[PATCH net-next 4/9] selftests/net: convert test_vxlan_nolocalbypass.sh to run it in unique namespace

2023-12-05 Thread Hangbin Liu

Here is the test result after conversion.

]# ./test_vxlan_nolocalbypass.sh
TEST: localbypass enabled   [ OK ]
TEST: Packet received by local VXLAN device - localbypass   [ OK ]
TEST: localbypass disabled  [ OK ]
TEST: Packet not received by local VXLAN device - nolocalbypass [ OK ]
TEST: localbypass enabled   [ OK ]
TEST: Packet received by local VXLAN device - localbypass   [ OK ]

Tests passed:   6
Tests failed:   0

Acked-by: David Ahern 
Signed-off-by: Hangbin Liu 
---
 .../selftests/net/test_vxlan_nolocalbypass.sh | 48 +--
 1 file changed, 23 insertions(+), 25 deletions(-)

diff --git a/tools/testing/selftests/net/test_vxlan_nolocalbypass.sh 
b/tools/testing/selftests/net/test_vxlan_nolocalbypass.sh
index f75212bf142c..b8805983b728 100755
--- a/tools/testing/selftests/net/test_vxlan_nolocalbypass.sh
+++ b/tools/testing/selftests/net/test_vxlan_nolocalbypass.sh
@@ -9,9 +9,8 @@
 # option and verifies that packets are no longer received by the second VXLAN
 # device.
 
+source lib.sh
 ret=0
-# Kselftest framework requirement - SKIP code is 4.
-ksft_skip=4
 
 TESTS="
nolocalbypass
@@ -98,20 +97,19 @@ tc_check_packets()
 
 setup()
 {
-   ip netns add ns1
+   setup_ns ns1
 
-   ip -n ns1 link set dev lo up
-   ip -n ns1 address add 192.0.2.1/32 dev lo
-   ip -n ns1 address add 198.51.100.1/32 dev lo
+   ip -n $ns1 address add 192.0.2.1/32 dev lo
+   ip -n $ns1 address add 198.51.100.1/32 dev lo
 
-   ip -n ns1 link add name vx0 up type vxlan id 100 local 198.51.100.1 \
+   ip -n $ns1 link add name vx0 up type vxlan id 100 local 198.51.100.1 \
dstport 4789 nolearning
-   ip -n ns1 link add name vx1 up type vxlan id 100 dstport 4790
+   ip -n $ns1 link add name vx1 up type vxlan id 100 dstport 4790
 }
 
 cleanup()
 {
-   ip netns del ns1 &> /dev/null
+   cleanup_ns $ns1
 }
 
 

@@ -122,40 +120,40 @@ nolocalbypass()
local smac=00:01:02:03:04:05
local dmac=00:0a:0b:0c:0d:0e
 
-   run_cmd "bridge -n ns1 fdb add $dmac dev vx0 self static dst 192.0.2.1 
port 4790"
+   run_cmd "bridge -n $ns1 fdb add $dmac dev vx0 self static dst 192.0.2.1 
port 4790"
 
-   run_cmd "tc -n ns1 qdisc add dev vx1 clsact"
-   run_cmd "tc -n ns1 filter add dev vx1 ingress pref 1 handle 101 proto 
all flower src_mac $smac dst_mac $dmac action pass"
+   run_cmd "tc -n $ns1 qdisc add dev vx1 clsact"
+   run_cmd "tc -n $ns1 filter add dev vx1 ingress pref 1 handle 101 proto 
all flower src_mac $smac dst_mac $dmac action pass"
 
-   run_cmd "tc -n ns1 qdisc add dev lo clsact"
-   run_cmd "tc -n ns1 filter add dev lo ingress pref 1 handle 101 proto ip 
flower ip_proto udp dst_port 4790 action drop"
+   run_cmd "tc -n $ns1 qdisc add dev lo clsact"
+   run_cmd "tc -n $ns1 filter add dev lo ingress pref 1 handle 101 proto 
ip flower ip_proto udp dst_port 4790 action drop"
 
-   run_cmd "ip -n ns1 -d -j link show dev vx0 | jq -e 
'.[][\"linkinfo\"][\"info_data\"][\"localbypass\"] == true'"
+   run_cmd "ip -n $ns1 -d -j link show dev vx0 | jq -e 
'.[][\"linkinfo\"][\"info_data\"][\"localbypass\"] == true'"
log_test $? 0 "localbypass enabled"
 
-   run_cmd "ip netns exec ns1 mausezahn vx0 -a $smac -b $dmac -c 1 -p 100 
-q"
+   run_cmd "ip netns exec $ns1 mausezahn vx0 -a $smac -b $dmac -c 1 -p 100 
-q"
 
-   tc_check_packets "ns1" "dev vx1 ingress" 101 1
+   tc_check_packets "$ns1" "dev vx1 ingress" 101 1
log_test $? 0 "Packet received by local VXLAN device - localbypass"
 
-   run_cmd "ip -n ns1 link set dev vx0 type vxlan nolocalbypass"
+   run_cmd "ip -n $ns1 link set dev vx0 type vxlan nolocalbypass"
 
-   run_cmd "ip -n ns1 -d -j link show dev vx0 | jq -e 
'.[][\"linkinfo\"][\"info_data\"][\"localbypass\"] == false'"
+   run_cmd "ip -n $ns1 -d -j link show dev vx0 | jq -e 
'.[][\"linkinfo\"][\"info_data\"][\"localbypass\"] == false'"
log_test $? 0 "localbypass disabled"
 
-   run_cmd "ip netns exec ns1 mausezahn vx0 -a $smac -b $dmac -c 1 -p 100 
-q"
+   run_cmd "ip netns exec $ns1 mausezahn vx0 -a $smac -b $dmac -c 1 -p 100 
-q"
 
-   tc_check_packets "ns1" "dev vx1 ingress" 101 1
+   tc_check_packets "$ns1" "dev vx1 ingress" 101 1
log_test $? 0 "Packet not received by local VXLAN device - 
nolocalbypass"
 
-   run_cmd "ip -n ns1 link set dev vx0 type vxlan localbypass"
+   run_cmd "ip -n $ns1 link set dev vx0 type vxlan localbypass"
 
-   run_cmd "ip -n ns1 -d -j link show dev vx0 | jq -e 
'.[][\"linkinfo\"][\"info_data\"][\"localbypass\"] == true'"
+   run_cmd "ip -n $ns1 -d -j link show dev vx0 | jq -e 
'.[][\"linkinfo\"][\"info_data\"][\"localbypass\"] == true'"

[PATCH net-next 3/9] selftests/net: convert test_vxlan_mdb.sh to run it in unique namespace

2023-12-05 Thread Hangbin Liu

Here is the test result after conversion.

]# ./test_vxlan_mdb.sh

Control path: Basic (*, G) operations - IPv4 overlay / IPv4 underlay

TEST: MDB entry addition[ OK ]

...

Data path: MDB torture test - IPv6 overlay / IPv6 underlay
--
TEST: Torture test  [ OK ]

Tests passed: 620
Tests failed:   0

Acked-by: David Ahern 
Signed-off-by: Hangbin Liu 
---
 tools/testing/selftests/net/test_vxlan_mdb.sh | 202 +-
 1 file changed, 99 insertions(+), 103 deletions(-)

diff --git a/tools/testing/selftests/net/test_vxlan_mdb.sh 
b/tools/testing/selftests/net/test_vxlan_mdb.sh
index 6e996f8063cd..6725fd9157b9 100755
--- a/tools/testing/selftests/net/test_vxlan_mdb.sh
+++ b/tools/testing/selftests/net/test_vxlan_mdb.sh
@@ -55,9 +55,8 @@
 # | ns2_v4 | | ns2_v6 |
 # ++ ++
 
+source lib.sh
 ret=0
-# Kselftest framework requirement - SKIP code is 4.
-ksft_skip=4
 
 CONTROL_PATH_TESTS="
basic_star_g_ipv4_ipv4
@@ -260,9 +259,6 @@ setup_common()
local local_addr1=$1; shift
local local_addr2=$1; shift
 
-   ip netns add $ns1
-   ip netns add $ns2
-
ip link add name veth0 type veth peer name veth1
ip link set dev veth0 netns $ns1 name veth0
ip link set dev veth1 netns $ns2 name veth0
@@ -273,36 +269,36 @@ setup_common()
 
 setup_v4()
 {
-   setup_common ns1_v4 ns2_v4 192.0.2.1 192.0.2.2
+   setup_ns ns1_v4 ns2_v4
+   setup_common $ns1_v4 $ns2_v4 192.0.2.1 192.0.2.2
 
-   ip -n ns1_v4 address add 192.0.2.17/28 dev veth0
-   ip -n ns2_v4 address add 192.0.2.18/28 dev veth0
+   ip -n $ns1_v4 address add 192.0.2.17/28 dev veth0
+   ip -n $ns2_v4 address add 192.0.2.18/28 dev veth0
 
-   ip -n ns1_v4 route add default via 192.0.2.18
-   ip -n ns2_v4 route add default via 192.0.2.17
+   ip -n $ns1_v4 route add default via 192.0.2.18
+   ip -n $ns2_v4 route add default via 192.0.2.17
 }
 
 cleanup_v4()
 {
-   ip netns del ns2_v4
-   ip netns del ns1_v4
+   cleanup_ns $ns2_v4 $ns1_v4
 }
 
 setup_v6()
 {
-   setup_common ns1_v6 ns2_v6 2001:db8:1::1 2001:db8:1::2
+   setup_ns ns1_v6 ns2_v6
+   setup_common $ns1_v6 $ns2_v6 2001:db8:1::1 2001:db8:1::2
 
-   ip -n ns1_v6 address add 2001:db8:2::1/64 dev veth0 nodad
-   ip -n ns2_v6 address add 2001:db8:2::2/64 dev veth0 nodad
+   ip -n $ns1_v6 address add 2001:db8:2::1/64 dev veth0 nodad
+   ip -n $ns2_v6 address add 2001:db8:2::2/64 dev veth0 nodad
 
-   ip -n ns1_v6 route add default via 2001:db8:2::2
-   ip -n ns2_v6 route add default via 2001:db8:2::1
+   ip -n $ns1_v6 route add default via 2001:db8:2::2
+   ip -n $ns2_v6 route add default via 2001:db8:2::1
 }
 
 cleanup_v6()
 {
-   ip netns del ns2_v6
-   ip netns del ns1_v6
+   cleanup_ns $ns2_v6 $ns1_v6
 }
 
 setup()
@@ -433,7 +429,7 @@ basic_common()
 
 basic_star_g_ipv4_ipv4()
 {
-   local ns1=ns1_v4
+   local ns1=$ns1_v4
local grp_key="grp 239.1.1.1"
local vtep_ip=198.51.100.100
 
@@ -446,7 +442,7 @@ basic_star_g_ipv4_ipv4()
 
 basic_star_g_ipv6_ipv4()
 {
-   local ns1=ns1_v4
+   local ns1=$ns1_v4
local grp_key="grp ff0e::1"
local vtep_ip=198.51.100.100
 
@@ -459,7 +455,7 @@ basic_star_g_ipv6_ipv4()
 
 basic_star_g_ipv4_ipv6()
 {
-   local ns1=ns1_v6
+   local ns1=$ns1_v6
local grp_key="grp 239.1.1.1"
local vtep_ip=2001:db8:1000::1
 
@@ -472,7 +468,7 @@ basic_star_g_ipv4_ipv6()
 
 basic_star_g_ipv6_ipv6()
 {
-   local ns1=ns1_v6
+   local ns1=$ns1_v6
local grp_key="grp ff0e::1"
local vtep_ip=2001:db8:1000::1
 
@@ -485,7 +481,7 @@ basic_star_g_ipv6_ipv6()
 
 basic_sg_ipv4_ipv4()
 {
-   local ns1=ns1_v4
+   local ns1=$ns1_v4
local grp_key="grp 239.1.1.1 src 192.0.2.129"
local vtep_ip=198.51.100.100
 
@@ -498,7 +494,7 @@ basic_sg_ipv4_ipv4()
 
 basic_sg_ipv6_ipv4()
 {
-   local ns1=ns1_v4
+   local ns1=$ns1_v4
local grp_key="grp ff0e::1 src 2001:db8:100::1"
local vtep_ip=198.51.100.100
 
@@ -511,7 +507,7 @@ basic_sg_ipv6_ipv4()
 
 basic_sg_ipv4_ipv6()
 {
-   local ns1=ns1_v6
+   local ns1=$ns1_v6
local grp_key="grp 239.1.1.1 src 192.0.2.129"
local vtep_ip=2001:db8:1000::1
 
@@ -524,7 +520,7 @@ basic_sg_ipv4_ipv6()
 
 basic_sg_ipv6_ipv6()
 {
-   local ns1=ns1_v6
+   local ns1=$ns1_v6
local grp_key="grp ff0e::1 src 2001:db8:100::1"
local vtep_ip=2001:db8:1000::1
 
@@ -694,7 +690,7 @@ star_g_common()
 
 star_g_ipv4_ipv4()
 {
-   local ns1=ns1_v4
+   local ns1=$ns1_v4
local grp=239.1.1.1
local

[PATCH net-next 2/9] selftests/net: convert test_bridge_neigh_suppress.sh to run it in unique namespace

2023-12-05 Thread Hangbin Liu

Here is the test result after conversion.

]# ./test_bridge_neigh_suppress.sh

Per-port ARP suppression - VLAN 10
--
TEST: arping[ OK ]
TEST: ARP suppression   [ OK ]

...

TEST: NS suppression (VLAN 20)  [ OK ]

Tests passed: 148
Tests failed:   0

Acked-by: David Ahern 
Signed-off-by: Hangbin Liu 
---
 .../net/test_bridge_neigh_suppress.sh | 331 +-
 1 file changed, 162 insertions(+), 169 deletions(-)

diff --git a/tools/testing/selftests/net/test_bridge_neigh_suppress.sh 
b/tools/testing/selftests/net/test_bridge_neigh_suppress.sh
index d80f2cd87614..8533393a4f18 100755
--- a/tools/testing/selftests/net/test_bridge_neigh_suppress.sh
+++ b/tools/testing/selftests/net/test_bridge_neigh_suppress.sh
@@ -45,9 +45,8 @@
 # | sw1| | sw2|
 # ++ ++
 
+source lib.sh
 ret=0
-# Kselftest framework requirement - SKIP code is 4.
-ksft_skip=4
 
 # All tests in this script. Can be overridden with -t option.
 TESTS="
@@ -140,9 +139,6 @@ setup_topo_ns()
 {
local ns=$1; shift
 
-   ip netns add $ns
-   ip -n $ns link set dev lo up
-
ip netns exec $ns sysctl -qw net.ipv6.conf.all.keep_addr_on_down=1
ip netns exec $ns sysctl -qw 
net.ipv6.conf.default.ignore_routes_with_linkdown=1
ip netns exec $ns sysctl -qw net.ipv6.conf.all.accept_dad=0
@@ -153,21 +149,22 @@ setup_topo()
 {
local ns
 
-   for ns in h1 h2 sw1 sw2; do
+   setup_ns h1 h2 sw1 sw2
+   for ns in $h1 $h2 $sw1 $sw2; do
setup_topo_ns $ns
done
 
ip link add name veth0 type veth peer name veth1
-   ip link set dev veth0 netns h1 name eth0
-   ip link set dev veth1 netns sw1 name swp1
+   ip link set dev veth0 netns $h1 name eth0
+   ip link set dev veth1 netns $sw1 name swp1
 
ip link add name veth0 type veth peer name veth1
-   ip link set dev veth0 netns sw1 name veth0
-   ip link set dev veth1 netns sw2 name veth0
+   ip link set dev veth0 netns $sw1 name veth0
+   ip link set dev veth1 netns $sw2 name veth0
 
ip link add name veth0 type veth peer name veth1
-   ip link set dev veth0 netns h2 name eth0
-   ip link set dev veth1 netns sw2 name swp1
+   ip link set dev veth0 netns $h2 name eth0
+   ip link set dev veth1 netns $sw2 name swp1
 }
 
 setup_host_common()
@@ -190,7 +187,7 @@ setup_host_common()
 
 setup_h1()
 {
-   local ns=h1
+   local ns=$h1
local v4addr1=192.0.2.1/28
local v4addr2=192.0.2.17/28
local v6addr1=2001:db8:1::1/64
@@ -201,7 +198,7 @@ setup_h1()
 
 setup_h2()
 {
-   local ns=h2
+   local ns=$h2
local v4addr1=192.0.2.2/28
local v4addr2=192.0.2.18/28
local v6addr1=2001:db8:1::2/64
@@ -254,7 +251,7 @@ setup_sw_common()
 
 setup_sw1()
 {
-   local ns=sw1
+   local ns=$sw1
local local_addr=192.0.2.33
local remote_addr=192.0.2.34
local veth_addr=192.0.2.49
@@ -265,7 +262,7 @@ setup_sw1()
 
 setup_sw2()
 {
-   local ns=sw2
+   local ns=$sw2
local local_addr=192.0.2.34
local remote_addr=192.0.2.33
local veth_addr=192.0.2.50
@@ -291,11 +288,7 @@ setup()
 
 cleanup()
 {
-   local ns
-
-   for ns in h1 h2 sw1 sw2; do
-   ip netns del $ns &> /dev/null
-   done
+   cleanup_ns $h1 $h2 $sw1 $sw2
 }
 
 

@@ -312,80 +305,80 @@ neigh_suppress_arp_common()
echo "Per-port ARP suppression - VLAN $vid"
echo "--"
 
-   run_cmd "tc -n sw1 qdisc replace dev vx0 clsact"
-   run_cmd "tc -n sw1 filter replace dev vx0 egress pref 1 handle 101 
proto 0x0806 flower indev swp1 arp_tip $tip arp_sip $sip arp_op request action 
pass"
+   run_cmd "tc -n $sw1 qdisc replace dev vx0 clsact"
+   run_cmd "tc -n $sw1 filter replace dev vx0 egress pref 1 handle 101 
proto 0x0806 flower indev swp1 arp_tip $tip arp_sip $sip arp_op request action 
pass"
 
# Initial state - check that ARP requests are not suppressed and that
# ARP replies are received.
-   run_cmd "ip netns exec h1 arping -q -b -c 1 -w 5 -s $sip -I eth0.$vid 
$tip"
+   run_cmd "ip netns exec $h1 arping -q -b -c 1 -w 5 -s $sip -I eth0.$vid 
$tip"
log_test $? 0 "arping"
-   tc_check_packets sw1 "dev vx0 egress" 101 1
+   tc_check_packets $sw1 "dev vx0 egress" 101 1
log_test $? 0 "ARP suppression"
 
# Enable neighbor suppression and check that nothing changes compared
# to the initial state.
-   run_cmd "bridge -n sw1 link set dev vx0 neigh_suppress on"
-   run_cmd "bridge -n sw1 -d

[PATCH net-next 1/9] selftests/net: convert test_bridge_backup_port.sh to run it in unique namespace

2023-12-05 Thread Hangbin Liu

There is no h1 h2 actually. Remove it. Here is the test result after
conversion.

]# ./test_bridge_backup_port.sh

Backup port
---
TEST: Forwarding out of swp1[ OK ]
TEST: No forwarding out of vx0  [ OK ]
TEST: swp1 carrier off  [ OK ]
TEST: No forwarding out of swp1 [ OK ]
...
Backup nexthop ID - ping

TEST: Ping with backup nexthop ID   [ OK ]
TEST: Ping after disabling backup nexthop ID[ OK ]

Backup nexthop ID - torture test

TEST: Torture test  [ OK ]

Tests passed:  83
Tests failed:   0

Acked-by: David Ahern 
Signed-off-by: Hangbin Liu 
---
 .../selftests/net/test_bridge_backup_port.sh  | 371 +-
 1 file changed, 182 insertions(+), 189 deletions(-)

diff --git a/tools/testing/selftests/net/test_bridge_backup_port.sh 
b/tools/testing/selftests/net/test_bridge_backup_port.sh
index 112cfd8a10ad..70a7d87ba2d2 100755
--- a/tools/testing/selftests/net/test_bridge_backup_port.sh
+++ b/tools/testing/selftests/net/test_bridge_backup_port.sh
@@ -35,9 +35,8 @@
 # | sw1| | sw2|
 # ++ ++
 
+source lib.sh
 ret=0
-# Kselftest framework requirement - SKIP code is 4.
-ksft_skip=4
 
 # All tests in this script. Can be overridden with -t option.
 TESTS="
@@ -132,9 +131,6 @@ setup_topo_ns()
 {
local ns=$1; shift
 
-   ip netns add $ns
-   ip -n $ns link set dev lo up
-
ip netns exec $ns sysctl -qw net.ipv6.conf.all.keep_addr_on_down=1
ip netns exec $ns sysctl -qw 
net.ipv6.conf.default.ignore_routes_with_linkdown=1
ip netns exec $ns sysctl -qw net.ipv6.conf.all.accept_dad=0
@@ -145,13 +141,14 @@ setup_topo()
 {
local ns
 
-   for ns in sw1 sw2; do
+   setup_ns sw1 sw2
+   for ns in $sw1 $sw2; do
setup_topo_ns $ns
done
 
ip link add name veth0 type veth peer name veth1
-   ip link set dev veth0 netns sw1 name veth0
-   ip link set dev veth1 netns sw2 name veth0
+   ip link set dev veth0 netns $sw1 name veth0
+   ip link set dev veth1 netns $sw2 name veth0
 }
 
 setup_sw_common()
@@ -190,7 +187,7 @@ setup_sw_common()
 
 setup_sw1()
 {
-   local ns=sw1
+   local ns=$sw1
local local_addr=192.0.2.33
local remote_addr=192.0.2.34
local veth_addr=192.0.2.49
@@ -203,7 +200,7 @@ setup_sw1()
 
 setup_sw2()
 {
-   local ns=sw2
+   local ns=$sw2
local local_addr=192.0.2.34
local remote_addr=192.0.2.33
local veth_addr=192.0.2.50
@@ -229,11 +226,7 @@ setup()
 
 cleanup()
 {
-   local ns
-
-   for ns in h1 h2 sw1 sw2; do
-   ip netns del $ns &> /dev/null
-   done
+   cleanup_ns $sw1 $sw2
 }
 
 

@@ -248,85 +241,85 @@ backup_port()
echo "Backup port"
echo "---"
 
-   run_cmd "tc -n sw1 qdisc replace dev swp1 clsact"
-   run_cmd "tc -n sw1 filter replace dev swp1 egress pref 1 handle 101 
proto ip flower src_mac $smac dst_mac $dmac action pass"
+   run_cmd "tc -n $sw1 qdisc replace dev swp1 clsact"
+   run_cmd "tc -n $sw1 filter replace dev swp1 egress pref 1 handle 101 
proto ip flower src_mac $smac dst_mac $dmac action pass"
 
-   run_cmd "tc -n sw1 qdisc replace dev vx0 clsact"
-   run_cmd "tc -n sw1 filter replace dev vx0 egress pref 1 handle 101 
proto ip flower src_mac $smac dst_mac $dmac action pass"
+   run_cmd "tc -n $sw1 qdisc replace dev vx0 clsact"
+   run_cmd "tc -n $sw1 filter replace dev vx0 egress pref 1 handle 101 
proto ip flower src_mac $smac dst_mac $dmac action pass"
 
-   run_cmd "bridge -n sw1 fdb replace $dmac dev swp1 master static vlan 10"
+   run_cmd "bridge -n $sw1 fdb replace $dmac dev swp1 master static vlan 
10"
 
# Initial state - check that packets are forwarded out of swp1 when it
# has a carrier and not forwarded out of any port when it does not have
# a carrier.
-   run_cmd "ip netns exec sw1 mausezahn br0.10 -a $smac -b $dmac -A 
198.51.100.1 -B 198.51.100.2 -t ip -p 100 -q -c 1"
-   tc_check_packets sw1 "dev swp1 egress" 101 1
+   run_cmd "ip netns exec $sw1 mausezahn br0.10 -a $smac -b $dmac -A 
198.51.100.1 -B 198.51.100.2 -t ip -p 100 -q -c 1"
+   tc_check_packets $sw1 "dev swp1 egress" 101 1
log_test $? 0 "Forwarding out of swp1"
-   tc_check_packets sw1 "dev vx0 egress" 101 0
+   tc_check_packets $sw1 "dev vx0 egress" 101 0
log_test $? 0 "No forwarding out of vx0"
 
-   run_cmd "ip -n sw1 link set dev swp1 carrier off"
+

[PATCH net-next 0/9] Convert net selftests to run in unique namespace (Part 2)

2023-12-05 Thread Hangbin Liu

Here is the 2nd part of converting net selftests to run in unique namespace.
This part converts all bridge, vxlan, vrf tests.

Here is the part 1 link:
https://lore.kernel.org/netdev/20231202020110.362433-1-liuhang...@gmail.com

Hangbin Liu (9):
  selftests/net: convert test_bridge_backup_port.sh to run it in unique
namespace
  selftests/net: convert test_bridge_neigh_suppress.sh to run it in
unique namespace
  selftests/net: convert test_vxlan_mdb.sh to run it in unique namespace
  selftests/net: convert test_vxlan_nolocalbypass.sh to run it in unique
namespace
  selftests/net: convert test_vxlan_under_vrf.sh to run it in unique
namespace
  selftests/net: convert test_vxlan_vnifiltering.sh to run it in unique
namespace
  selftests/net: convert vrf_route_leaking.sh to run it in unique
namespace
  selftests/net: convert vrf_strict_mode_test.sh to run it in unique
namespace
  selftests/net: convert vrf-xfrm-tests.sh to run it in unique namespace

 .../selftests/net/test_bridge_backup_port.sh  | 371 +-
 .../net/test_bridge_neigh_suppress.sh | 331 
 tools/testing/selftests/net/test_vxlan_mdb.sh | 202 +-
 .../selftests/net/test_vxlan_nolocalbypass.sh |  48 ++-
 .../selftests/net/test_vxlan_under_vrf.sh |  70 ++--
 .../selftests/net/test_vxlan_vnifiltering.sh  | 154 +---
 tools/testing/selftests/net/vrf-xfrm-tests.sh |  77 ++--
 .../selftests/net/vrf_route_leaking.sh| 201 +-
 .../selftests/net/vrf_strict_mode_test.sh |  47 ++-
 9 files changed, 751 insertions(+), 750 deletions(-)

-- 
2.43.0

Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

2023-12-05 Thread Chengming Zhou

On 2023/12/6 13:59, Yosry Ahmed wrote:
> [..]
>>> @@ -526,6 +582,102 @@ static struct zswap_entry 
>>> *zswap_entry_find_get(struct rb_root *root,
>>>   return entry;
>>>  }
>>>
>>> +/*
>>> +* shrinker functions
>>> +**/
>>> +static enum lru_status shrink_memcg_cb(struct list_head *item, struct 
>>> list_lru_one *l,
>>> +spinlock_t *lock, void *arg);
>>> +
>>> +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
>>> + struct shrink_control *sc)
>>> +{
>>> + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, 
>>> NODE_DATA(sc->nid));
>>> + unsigned long shrink_ret, nr_protected, lru_size;
>>> + struct zswap_pool *pool = shrinker->private_data;
>>> + bool encountered_page_in_swapcache = false;
>>> +
>>> + nr_protected =
>>> + 
>>> atomic_long_read(>zswap_lruvec_state.nr_zswap_protected);
>>> + lru_size = list_lru_shrink_count(>list_lru, sc);
>>> +
>>> + /*
>>> +  * Abort if the shrinker is disabled or if we are shrinking into the
>>> +  * protected region.
>>> +  *
>>> +  * This short-circuiting is necessary because if we have too many 
>>> multiple
>>> +  * concurrent reclaimers getting the freeable zswap object counts at 
>>> the
>>> +  * same time (before any of them made reasonable progress), the total
>>> +  * number of reclaimed objects might be more than the number of 
>>> unprotected
>>> +  * objects (i.e the reclaimers will reclaim into the protected area 
>>> of the
>>> +  * zswap LRU).
>>> +  */
>>> + if (!zswap_shrinker_enabled || nr_protected >= lru_size - 
>>> sc->nr_to_scan) {
>>> + sc->nr_scanned = 0;
>>> + return SHRINK_STOP;
>>> + }
>>> +
>>> + shrink_ret = list_lru_shrink_walk(>list_lru, sc, 
>>> _memcg_cb,
>>> + _page_in_swapcache);
>>> +
>>> + if (encountered_page_in_swapcache)
>>> + return SHRINK_STOP;
>>> +
>>> + return shrink_ret ? shrink_ret : SHRINK_STOP;
>>> +}
>>> +
>>> +static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
>>> + struct shrink_control *sc)
>>> +{
>>> + struct zswap_pool *pool = shrinker->private_data;
>>> + struct mem_cgroup *memcg = sc->memcg;
>>> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
>>> + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
>>> +
>>> +#ifdef CONFIG_MEMCG_KMEM
>>> + cgroup_rstat_flush(memcg->css.cgroup);
>>> + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
>>> + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
>>> +#else
>>> + /* use pool stats instead of memcg stats */
>>> + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT;
>>> + nr_stored = atomic_read(>nr_stored);
>>> +#endif
>>> +
>>> + if (!zswap_shrinker_enabled || !nr_stored)
>> When I tested with this series, with !zswap_shrinker_enabled in the default 
>> case,
>> I found the performance is much worse than that without this patch.
>>
>> Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs 
>> directory.
>>
>> The reason seems the above cgroup_rstat_flush(), caused much rstat lock 
>> contention
>> to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check 
>> above
>> the cgroup_rstat_flush(), the performance become much better.
>>
>> Maybe we can put the "zswap_shrinker_enabled" check above 
>> cgroup_rstat_flush()?
> 
> Yes, we should do nothing if !zswap_shrinker_enabled. We should also
> use mem_cgroup_flush_stats() here like other places unless accuracy is
> crucial, which I doubt given that reclaim uses
> mem_cgroup_flush_stats().
> 

Yes. After changing to use mem_cgroup_flush_stats() here, the performance
become much better.

> mem_cgroup_flush_stats() has some thresholding to make sure we don't
> do flushes unnecessarily, and I have a pending series in mm-unstable
> that makes that thresholding per-memcg. Keep in mind that adding a
> call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable,

My test branch is linux-next 20231205, and it's all good after changing
to use mem_cgroup_flush_stats(memcg).

> because the series there adds a memcg argument to
> mem_cgroup_flush_stats(). That should be easily amenable though, I can
> post a fixlet for my series to add the memcg argument there on top of
> users if needed.
> 

It's great. Thanks!

Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

2023-12-05 Thread Yosry Ahmed

[..]
> > @@ -526,6 +582,102 @@ static struct zswap_entry 
> > *zswap_entry_find_get(struct rb_root *root,
> >   return entry;
> >  }
> >
> > +/*
> > +* shrinker functions
> > +**/
> > +static enum lru_status shrink_memcg_cb(struct list_head *item, struct 
> > list_lru_one *l,
> > +spinlock_t *lock, void *arg);
> > +
> > +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
> > + struct shrink_control *sc)
> > +{
> > + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, 
> > NODE_DATA(sc->nid));
> > + unsigned long shrink_ret, nr_protected, lru_size;
> > + struct zswap_pool *pool = shrinker->private_data;
> > + bool encountered_page_in_swapcache = false;
> > +
> > + nr_protected =
> > + 
> > atomic_long_read(>zswap_lruvec_state.nr_zswap_protected);
> > + lru_size = list_lru_shrink_count(>list_lru, sc);
> > +
> > + /*
> > +  * Abort if the shrinker is disabled or if we are shrinking into the
> > +  * protected region.
> > +  *
> > +  * This short-circuiting is necessary because if we have too many 
> > multiple
> > +  * concurrent reclaimers getting the freeable zswap object counts at 
> > the
> > +  * same time (before any of them made reasonable progress), the total
> > +  * number of reclaimed objects might be more than the number of 
> > unprotected
> > +  * objects (i.e the reclaimers will reclaim into the protected area 
> > of the
> > +  * zswap LRU).
> > +  */
> > + if (!zswap_shrinker_enabled || nr_protected >= lru_size - 
> > sc->nr_to_scan) {
> > + sc->nr_scanned = 0;
> > + return SHRINK_STOP;
> > + }
> > +
> > + shrink_ret = list_lru_shrink_walk(>list_lru, sc, 
> > _memcg_cb,
> > + _page_in_swapcache);
> > +
> > + if (encountered_page_in_swapcache)
> > + return SHRINK_STOP;
> > +
> > + return shrink_ret ? shrink_ret : SHRINK_STOP;
> > +}
> > +
> > +static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
> > + struct shrink_control *sc)
> > +{
> > + struct zswap_pool *pool = shrinker->private_data;
> > + struct mem_cgroup *memcg = sc->memcg;
> > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
> > + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
> > +
> > +#ifdef CONFIG_MEMCG_KMEM
> > + cgroup_rstat_flush(memcg->css.cgroup);
> > + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
> > + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
> > +#else
> > + /* use pool stats instead of memcg stats */
> > + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT;
> > + nr_stored = atomic_read(>nr_stored);
> > +#endif
> > +
> > + if (!zswap_shrinker_enabled || !nr_stored)
> When I tested with this series, with !zswap_shrinker_enabled in the default 
> case,
> I found the performance is much worse than that without this patch.
>
> Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs 
> directory.
>
> The reason seems the above cgroup_rstat_flush(), caused much rstat lock 
> contention
> to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check 
> above
> the cgroup_rstat_flush(), the performance become much better.
>
> Maybe we can put the "zswap_shrinker_enabled" check above 
> cgroup_rstat_flush()?

Yes, we should do nothing if !zswap_shrinker_enabled. We should also
use mem_cgroup_flush_stats() here like other places unless accuracy is
crucial, which I doubt given that reclaim uses
mem_cgroup_flush_stats().

mem_cgroup_flush_stats() has some thresholding to make sure we don't
do flushes unnecessarily, and I have a pending series in mm-unstable
that makes that thresholding per-memcg. Keep in mind that adding a
call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable,
because the series there adds a memcg argument to
mem_cgroup_flush_stats(). That should be easily amenable though, I can
post a fixlet for my series to add the memcg argument there on top of
users if needed.

>
> Thanks!
>
> > + return 0;
> > +
> > + nr_protected =
> > + 
> > atomic_long_read(>zswap_lruvec_state.nr_zswap_protected);
> > + nr_freeable = list_lru_shrink_count(>list_lru, sc);
> > + /*
> > +  * Subtract the lru size by an estimate of the number of pages
> > +  * that should be protected.
> > +  */
> > + nr_freeable = nr_freeable > nr_protected ? nr_freeable - nr_protected 
> > : 0;
> > +
> > + /*
> > +  * Scale the number of freeable pages by the memory saving factor.
> > +  * This ensures that the better zswap compresses memory, the fewer
> > +  * pages we will evict to swap (as it will otherwise incur IO for
> > +  * relatively small memory saving).
> > +  */
> > + return mult_frac(nr_freeable,

Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure

2023-12-05 Thread Chengming Zhou

On 2023/12/1 03:40, Nhat Pham wrote:
> Currently, we only shrink the zswap pool when the user-defined limit is
> hit. This means that if we set the limit too high, cold data that are
> unlikely to be used again will reside in the pool, wasting precious
> memory. It is hard to predict how much zswap space will be needed ahead
> of time, as this depends on the workload (specifically, on factors such
> as memory access patterns and compressibility of the memory pages).
> 
> This patch implements a memcg- and NUMA-aware shrinker for zswap, that
> is initiated when there is memory pressure. The shrinker does not
> have any parameter that must be tuned by the user, and can be opted in
> or out on a per-memcg basis.
> 
> Furthermore, to make it more robust for many workloads and prevent
> overshrinking (i.e evicting warm pages that might be refaulted into
> memory), we build in the following heuristics:
> 
> * Estimate the number of warm pages residing in zswap, and attempt to
>   protect this region of the zswap LRU.
> * Scale the number of freeable objects by an estimate of the memory
>   saving factor. The better zswap compresses the data, the fewer pages
>   we will evict to swap (as we will otherwise incur IO for relatively
>   small memory saving).
> * During reclaim, if the shrinker encounters a page that is also being
>   brought into memory, the shrinker will cautiously terminate its
>   shrinking action, as this is a sign that it is touching the warmer
>   region of the zswap LRU.
> 
> As a proof of concept, we ran the following synthetic benchmark:
> build the linux kernel in a memory-limited cgroup, and allocate some
> cold data in tmpfs to see if the shrinker could write them out and
> improved the overall performance. Depending on the amount of cold data
> generated, we observe from 14% to 35% reduction in kernel CPU time used
> in the kernel builds.
> 
> Signed-off-by: Nhat Pham 
> Acked-by: Johannes Weiner 
> ---
>  Documentation/admin-guide/mm/zswap.rst |  10 ++
>  include/linux/mmzone.h |   2 +
>  include/linux/zswap.h  |  25 +++-
>  mm/Kconfig |  14 ++
>  mm/mmzone.c|   1 +
>  mm/swap_state.c|   2 +
>  mm/zswap.c | 185 -
>  7 files changed, 233 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/zswap.rst 
> b/Documentation/admin-guide/mm/zswap.rst
> index 45b98390e938..62fc244ec702 100644
> --- a/Documentation/admin-guide/mm/zswap.rst
> +++ b/Documentation/admin-guide/mm/zswap.rst
> @@ -153,6 +153,16 @@ attribute, e. g.::
>  
>  Setting this parameter to 100 will disable the hysteresis.
>  
> +When there is a sizable amount of cold memory residing in the zswap pool, it
> +can be advantageous to proactively write these cold pages to swap and reclaim
> +the memory for other use cases. By default, the zswap shrinker is disabled.
> +User can enable it as follows:
> +
> +  echo Y > /sys/module/zswap/parameters/shrinker_enabled
> +
> +This can be enabled at the boot time if ``CONFIG_ZSWAP_SHRINKER_DEFAULT_ON`` 
> is
> +selected.
> +
>  A debugfs interface is provided for various statistic about pool size, number
>  of pages stored, same-value filled pages and various counters for the reasons
>  pages are rejected.
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7b1816450bfc..b23bc5390240 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  /* Free memory management - zoned buddy allocator.  */
> @@ -641,6 +642,7 @@ struct lruvec {
>  #ifdef CONFIG_MEMCG
>   struct pglist_data *pgdat;
>  #endif
> + struct zswap_lruvec_state zswap_lruvec_state;
>  };
>  
>  /* Isolate for asynchronous migration */
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index e571e393669b..08c240e16a01 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -5,20 +5,40 @@
>  #include 
>  #include 
>  
> +struct lruvec;
> +
>  extern u64 zswap_pool_total_size;
>  extern atomic_t zswap_stored_pages;
>  
>  #ifdef CONFIG_ZSWAP
>  
> +struct zswap_lruvec_state {
> + /*
> +  * Number of pages in zswap that should be protected from the shrinker.
> +  * This number is an estimate of the following counts:
> +  *
> +  * a) Recent page faults.
> +  * b) Recent insertion to the zswap LRU. This includes new zswap stores,
> +  *as well as recent zswap LRU rotations.
> +  *
> +  * These pages are likely to be warm, and might incur IO if the are 
> written
> +  * to swap.
> +  */
> + atomic_long_t nr_zswap_protected;
> +};
> +
>  bool zswap_store(struct folio *folio);
>  bool zswap_load(struct folio *folio);
>  void zswap_invalidate(int type, pgoff_t offset);
>  void zswap_swapon(int type);
>  void zswap_swapoff(int type);
>  void

Re: [PATCH v8 0/6] workload-specific and memory pressure-driven zswap writeback

2023-12-05 Thread Bagas Sanjaya

On Thu, Nov 30, 2023 at 11:40:17AM -0800, Nhat Pham wrote:
> Changelog:
> v8:
>* Fixed a couple of build errors in the case of !CONFIG_MEMCG
>* Simplified the online memcg selection scheme for the zswap global
>  limit reclaim (suggested by Michal Hocko and Johannes Weiner)
>  (patch 2 and patch 3)
>* Added a new kconfig to allows users to enable zswap shrinker by
>  default. (suggested by Johannes Weiner) (patch 6)
> v7:
>* Added the mem_cgroup_iter_online() function to the API for the new
>  behavior (suggested by Andrew Morton) (patch 2)
>* Fixed a missing list_lru_del -> list_lru_del_obj (patch 1)
> v6:
>* Rebase on top of latest mm-unstable.
>* Fix/improve the in-code documentation of the new list_lru
>  manipulation functions (patch 1)
> v5:
>* Replace reference getting with an rcu_read_lock() section for
>  zswap lru modifications (suggested by Yosry)
>* Add a new prep patch that allows mem_cgroup_iter() to return
>  online cgroup.
>* Add a callback that updates pool->next_shrink when the cgroup is
>  offlined (suggested by Yosry Ahmed, Johannes Weiner)
> v4:
>* Rename list_lru_add to list_lru_add_obj and __list_lru_add to
>  list_lru_add (patch 1) (suggested by Johannes Weiner and
>Yosry Ahmed)
>* Some cleanups on the memcg aware LRU patch (patch 2)
>  (suggested by Yosry Ahmed)
>* Use event interface for the new per-cgroup writeback counters.
>  (patch 3) (suggested by Yosry Ahmed)
>* Abstract zswap's lruvec states and handling into 
>  zswap_lruvec_state (patch 5) (suggested by Yosry Ahmed)
> v3:
>* Add a patch to export per-cgroup zswap writeback counters
>* Add a patch to update zswap's kselftest
>* Separate the new list_lru functions into its own prep patch
>* Do not start from the top of the hierarchy when encounter a memcg
>  that is not online for the global limit zswap writeback (patch 2)
>  (suggested by Yosry Ahmed)
>* Do not remove the swap entry from list_lru in
>  __read_swapcache_async() (patch 2) (suggested by Yosry Ahmed)
>* Removed a redundant zswap pool getting (patch 2)
>  (reported by Ryan Roberts)
>* Use atomic for the nr_zswap_protected (instead of lruvec's lock)
>  (patch 5) (suggested by Yosry Ahmed)
>* Remove the per-cgroup zswap shrinker knob (patch 5)
>  (suggested by Yosry Ahmed)
> v2:
>* Fix loongarch compiler errors
>* Use pool stats instead of memcg stats when !CONFIG_MEMCG_KEM
> 
> There are currently several issues with zswap writeback:
> 
> 1. There is only a single global LRU for zswap, making it impossible to
>perform worload-specific shrinking - an memcg under memory pressure
>cannot determine which pages in the pool it owns, and often ends up
>writing pages from other memcgs. This issue has been previously
>observed in practice and mitigated by simply disabling
>memcg-initiated shrinking:
> 
>https://lore.kernel.org/all/20230530232435.3097106-1-npha...@gmail.com/T/#u
> 
>But this solution leaves a lot to be desired, as we still do not
>have an avenue for an memcg to free up its own memory locked up in
>the zswap pool.
> 
> 2. We only shrink the zswap pool when the user-defined limit is hit.
>This means that if we set the limit too high, cold data that are
>unlikely to be used again will reside in the pool, wasting precious
>memory. It is hard to predict how much zswap space will be needed
>ahead of time, as this depends on the workload (specifically, on
>factors such as memory access patterns and compressibility of the
>memory pages).
> 
> This patch series solves these issues by separating the global zswap
> LRU into per-memcg and per-NUMA LRUs, and performs workload-specific
> (i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The
> new shrinker does not have any parameter that must be tuned by the
> user, and can be opted in or out on a per-memcg basis.
> 
> As a proof of concept, we ran the following synthetic benchmark:
> build the linux kernel in a memory-limited cgroup, and allocate some
> cold data in tmpfs to see if the shrinker could write them out and
> improved the overall performance. Depending on the amount of cold data
> generated, we observe from 14% to 35% reduction in kernel CPU time used
> in the kernel builds.
> 
> Domenico Cerasuolo (3):
>   zswap: make shrinking memcg-aware
>   mm: memcg: add per-memcg zswap writeback stat
>   selftests: cgroup: update per-memcg zswap writeback selftest
> 
> Nhat Pham (3):
>   list_lru: allows explicit memcg and NUMA node selection
>   memcontrol: implement mem_cgroup_tryget_online()
>   zswap: shrinks zswap pool based on memory pressure
> 
>  Documentation/admin-guide/mm/zswap.rst  |  10 +
>  drivers/android/binder_alloc.c  |   7 +-
>  fs/dcache.c |   8 +-
>  fs/gfs2/quota.c |   6

[PATCH v1] selftests/sgx: Skip non X86_64 platform

2023-12-05 Thread Zhao Mengmeng

From: Zhao Mengmeng 

When building whole selftests on arm64, rsync gives an erorr about sgx:

rsync: [sender] link_stat 
"/root/linux-next/tools/testing/selftests/sgx/test_encl.elf" failed: No such 
file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 
23) at main.c(1327) [sender=3.2.5]

The root casue is sgx only used on X86_64, and shall be skipped on other
platforms.

Fix this by moving TEST_CUSTOM_PROGS and TEST_FILES inside the if check,
then the build result will be "Skipping non-existent dir: sgx".

Fixes: 2adcba79e69d ("selftests/x86: Add a selftest for SGX")
Signed-off-by: Zhao Mengmeng 
---
 tools/testing/selftests/sgx/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/sgx/Makefile 
b/tools/testing/selftests/sgx/Makefile
index 50aab6b57da3..01abe4969b0f 100644
--- a/tools/testing/selftests/sgx/Makefile
+++ b/tools/testing/selftests/sgx/Makefile
@@ -16,10 +16,10 @@ HOST_CFLAGS := -Wall -Werror -g $(INCLUDES) -fPIC -z 
noexecstack
 ENCL_CFLAGS := -Wall -Werror -static -nostdlib -nostartfiles -fPIC \
   -fno-stack-protector -mrdrnd $(INCLUDES)
 
+ifeq ($(CAN_BUILD_X86_64), 1)
 TEST_CUSTOM_PROGS := $(OUTPUT)/test_sgx
 TEST_FILES := $(OUTPUT)/test_encl.elf
 
-ifeq ($(CAN_BUILD_X86_64), 1)
 all: $(TEST_CUSTOM_PROGS) $(OUTPUT)/test_encl.elf
 endif
 
-- 
2.38.1

[PATCH v8 3/6] zswap: make shrinking memcg-aware (fix 2)

2023-12-05 Thread Nhat Pham

Drop the pool's reference at the end of the writeback step. Apply on
top of the first fixlet:

https://lore.kernel.org/linux-mm/20231130203522.gc543...@cmpxchg.org/T/#m6ba8efd2205486b1b333a29f5a890563b45c7a7e

Signed-off-by: Nhat Pham 
---
 mm/zswap.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/zswap.c b/mm/zswap.c
index 7a84c1454988..56d4a8cc461d 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -859,6 +859,7 @@ static void shrink_worker(struct work_struct *w)
 resched:
cond_resched();
} while (!zswap_can_accept());
+   zswap_pool_put(pool);
 }
 
 static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
-- 
2.34.1

Re: [PATCH v8 3/6] zswap: make shrinking memcg-aware

2023-12-05 Thread Nhat Pham

On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham  wrote:
>
> From: Domenico Cerasuolo 
>
> Currently, we only have a single global LRU for zswap. This makes it
> impossible to perform worload-specific shrinking - an memcg cannot
> determine which pages in the pool it owns, and often ends up writing
> pages from other memcgs. This issue has been previously observed in
> practice and mitigated by simply disabling memcg-initiated shrinking:
>
> https://lore.kernel.org/all/20230530232435.3097106-1-npha...@gmail.com/T/#u
>
> This patch fully resolves the issue by replacing the global zswap LRU
> with memcg- and NUMA-specific LRUs, and modify the reclaim logic:
>
> a) When a store attempt hits an memcg limit, it now triggers a
>synchronous reclaim attempt that, if successful, allows the new
>hotter page to be accepted by zswap.
> b) If the store attempt instead hits the global zswap limit, it will
>trigger an asynchronous reclaim attempt, in which an memcg is
>selected for reclaim in a round-robin-like fashion.
>
> Signed-off-by: Domenico Cerasuolo 
> Co-developed-by: Nhat Pham 
> Signed-off-by: Nhat Pham 
> ---
>  include/linux/memcontrol.h |   5 +
>  include/linux/zswap.h  |   2 +
>  mm/memcontrol.c|   2 +
>  mm/swap.h  |   3 +-
>  mm/swap_state.c|  24 +++-
>  mm/zswap.c | 269 +
>  6 files changed, 245 insertions(+), 60 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 2bd7d14ace78..a308c8eacf20 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1192,6 +1192,11 @@ static inline struct mem_cgroup 
> *page_memcg_check(struct page *page)
> return NULL;
>  }
>
> +static inline struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup 
> *objcg)
> +{
> +   return NULL;
> +}
> +
>  static inline bool folio_memcg_kmem(struct folio *folio)
>  {
> return false;
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 2a60ce39cfde..e571e393669b 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -15,6 +15,7 @@ bool zswap_load(struct folio *folio);
>  void zswap_invalidate(int type, pgoff_t offset);
>  void zswap_swapon(int type);
>  void zswap_swapoff(int type);
> +void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
>
>  #else
>
> @@ -31,6 +32,7 @@ static inline bool zswap_load(struct folio *folio)
>  static inline void zswap_invalidate(int type, pgoff_t offset) {}
>  static inline void zswap_swapon(int type) {}
>  static inline void zswap_swapoff(int type) {}
> +static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {}
>
>  #endif
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 470821d1ba1a..792ca21c5815 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5614,6 +5614,8 @@ static void mem_cgroup_css_offline(struct 
> cgroup_subsys_state *css)
> page_counter_set_min(>memory, 0);
> page_counter_set_low(>memory, 0);
>
> +   zswap_memcg_offline_cleanup(memcg);
> +
> memcg_offline_kmem(memcg);
> reparent_shrinker_deferred(memcg);
> wb_memcg_offline(memcg);
> diff --git a/mm/swap.h b/mm/swap.h
> index 73c332ee4d91..c0dc73e10e91 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -51,7 +51,8 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t 
> gfp_mask,
>struct swap_iocb **plug);
>  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  struct mempolicy *mpol, pgoff_t ilx,
> -bool *new_page_allocated);
> +bool *new_page_allocated,
> +bool skip_if_exists);
>  struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> struct mempolicy *mpol, pgoff_t ilx);
>  struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 85d9e5806a6a..6c84236382f3 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -412,7 +412,8 @@ struct folio *filemap_get_incore_folio(struct 
> address_space *mapping,
>
>  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  struct mempolicy *mpol, pgoff_t ilx,
> -bool *new_page_allocated)
> +bool *new_page_allocated,
> +bool skip_if_exists)
>  {
> struct swap_info_struct *si;
> struct folio *folio;
> @@ -470,6 +471,17 @@ struct page *__read_swap_cache_async(swp_entry_t entry, 
> gfp_t gfp_mask,
> if (err != -EEXIST)
> goto fail_put_swap;
>
> +   /*
> +* Protect against a recursive call to 
> __read_swap_cache_async()
> +

Re: [PATCHv3 net-next 01/14] selftests/net: add lib.sh

2023-12-05 Thread Hangbin Liu

On Tue, Dec 05, 2023 at 01:00:29PM +0100, Paolo Abeni wrote:
> > +cleanup_ns()
> > +{
> > +   local ns=""
> > +   local errexit=0
> > +   local ret=0
> > +
> > +   # disable errexit temporary
> > +   if [[ $- =~ "e" ]]; then
> > +   errexit=1
> > +   set +e
> > +   fi
> > +
> > +   for ns in "$@"; do
> > +   ip netns delete "${ns}" &> /dev/null
> > +   if ! busywait 2 ip netns list \| grep -vq "^$ns$" &> /dev/null; 
> > then
> > +   echo "Warn: Failed to remove namespace $ns"
> > +   ret=1
> > +   fi
> > +   done
> > +
> > +   [ $errexit -eq 1 ] && set -e
> > +   return $ret
> > +}
> > +
> > +# setup netns with given names as prefix. e.g
> > +# setup_ns local remote
> > +setup_ns()
> > +{
> > +   local ns=""
> > +   local ns_name=""
> > +   local ns_list=""
> > +   for ns_name in "$@"; do
> > +   # Some test may setup/remove same netns multi times
> > +   if unset ${ns_name} 2> /dev/null; then
> > +   ns="${ns_name,,}-$(mktemp -u XX)"
> > +   eval readonly ${ns_name}="$ns"
> > +   else
> > +   eval ns='$'${ns_name}
> > +   cleanup_ns "$ns"
> > +
> > +   fi
> > +
> > +   if ! ip netns add "$ns"; then
> > +   echo "Failed to create namespace $ns_name"
> > +   cleanup_ns "$ns_list"
> > +   return $ksft_skip
> > +   fi
> > +   ip -n "$ns" link set lo up
> > +   ns_list="$ns_list $ns"
> 
> Side note for a possible follow-up: if you maintain $ns_list as global
> variable, and remove from such list the ns deleted by cleanup_ns, you
> could remove the cleanup trap from the individual test with something
> alike:
> 
> final_cleanup_ns()
> {
>   cleanup_ns $ns_list
> }
> 
> trap final_cleanup_ns EXIT
> 
> No respin needed for the above, could be a follow-up if agreed upon.

Hi Paolo,

I did similar in the first version. But Petr said[1] we should let the
client do cleanup specifically. I agree that we should let client script
keep this in mind.

On the other hand, maybe we can add this final cleanup and let client call
it directly. What do you think?

[1] https://lore.kernel.org/netdev/878r6nf9x5@nvidia.com/

Thanks
Hangbin

Re: [PATCH v8 3/6] zswap: make shrinking memcg-aware

2023-12-05 Thread Nhat Pham

On Tue, Dec 5, 2023 at 4:10 PM Chris Li  wrote:
>
> Hi Nhat,
>
> Still working my way up of your patches series.
>
> On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham  wrote:
> >
> > From: Domenico Cerasuolo 
> >
> > Currently, we only have a single global LRU for zswap. This makes it
> > impossible to perform worload-specific shrinking - an memcg cannot
> > determine which pages in the pool it owns, and often ends up writing
> > pages from other memcgs. This issue has been previously observed in
> > practice and mitigated by simply disabling memcg-initiated shrinking:
> >
> > https://lore.kernel.org/all/20230530232435.3097106-1-npha...@gmail.com/T/#u
> >
> > This patch fully resolves the issue by replacing the global zswap LRU
> > with memcg- and NUMA-specific LRUs, and modify the reclaim logic:
> >
> > a) When a store attempt hits an memcg limit, it now triggers a
> >synchronous reclaim attempt that, if successful, allows the new
> >hotter page to be accepted by zswap.
> > b) If the store attempt instead hits the global zswap limit, it will
> >trigger an asynchronous reclaim attempt, in which an memcg is
> >selected for reclaim in a round-robin-like fashion.
> >
> > Signed-off-by: Domenico Cerasuolo 
> > Co-developed-by: Nhat Pham 
> > Signed-off-by: Nhat Pham 
> > ---
> >  include/linux/memcontrol.h |   5 +
> >  include/linux/zswap.h  |   2 +
> >  mm/memcontrol.c|   2 +
> >  mm/swap.h  |   3 +-
> >  mm/swap_state.c|  24 +++-
> >  mm/zswap.c | 269 +
> >  6 files changed, 245 insertions(+), 60 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 2bd7d14ace78..a308c8eacf20 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -1192,6 +1192,11 @@ static inline struct mem_cgroup 
> > *page_memcg_check(struct page *page)
> > return NULL;
> >  }
> >
> > +static inline struct mem_cgroup *get_mem_cgroup_from_objcg(struct 
> > obj_cgroup *objcg)
> > +{
> > +   return NULL;
> > +}
> > +
> >  static inline bool folio_memcg_kmem(struct folio *folio)
> >  {
> > return false;
> > diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> > index 2a60ce39cfde..e571e393669b 100644
> > --- a/include/linux/zswap.h
> > +++ b/include/linux/zswap.h
> > @@ -15,6 +15,7 @@ bool zswap_load(struct folio *folio);
> >  void zswap_invalidate(int type, pgoff_t offset);
> >  void zswap_swapon(int type);
> >  void zswap_swapoff(int type);
> > +void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
> >
> >  #else
> >
> > @@ -31,6 +32,7 @@ static inline bool zswap_load(struct folio *folio)
> >  static inline void zswap_invalidate(int type, pgoff_t offset) {}
> >  static inline void zswap_swapon(int type) {}
> >  static inline void zswap_swapoff(int type) {}
> > +static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {}
> >
> >  #endif
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 470821d1ba1a..792ca21c5815 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -5614,6 +5614,8 @@ static void mem_cgroup_css_offline(struct 
> > cgroup_subsys_state *css)
> > page_counter_set_min(>memory, 0);
> > page_counter_set_low(>memory, 0);
> >
> > +   zswap_memcg_offline_cleanup(memcg);
> > +
> > memcg_offline_kmem(memcg);
> > reparent_shrinker_deferred(memcg);
> > wb_memcg_offline(memcg);
> > diff --git a/mm/swap.h b/mm/swap.h
> > index 73c332ee4d91..c0dc73e10e91 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -51,7 +51,8 @@ struct page *read_swap_cache_async(swp_entry_t entry, 
> > gfp_t gfp_mask,
> >struct swap_iocb **plug);
> >  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >  struct mempolicy *mpol, pgoff_t ilx,
> > -bool *new_page_allocated);
> > +bool *new_page_allocated,
> > +bool skip_if_exists);
> >  struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> > struct mempolicy *mpol, pgoff_t ilx);
> >  struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 85d9e5806a6a..6c84236382f3 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -412,7 +412,8 @@ struct folio *filemap_get_incore_folio(struct 
> > address_space *mapping,
> >
> >  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >  struct mempolicy *mpol, pgoff_t ilx,
> > -bool *new_page_allocated)
> > +bool *new_page_allocated,
> > +bool skip_if_exists)
>
> I think this skip_if_exists is

Re: [PATCH v1 3/9] x86/resctrl: Add resctrl_mbm_flush_cpu() to collect CPUs' MBM events

2023-12-05 Thread Reinette Chatre

Hi Peter,

On 12/5/2023 4:33 PM, Peter Newman wrote:
> On Tue, Dec 5, 2023 at 1:57 PM Reinette Chatre
>  wrote:
>> On 12/1/2023 12:56 PM, Peter Newman wrote:
>>> On Tue, May 16, 2023 at 5:06 PM Reinette Chatre
 I think it may be optimistic to view this as a replacement of a PQR write.
 As you point out, that requires that a CPU switches between tasks with the
 same CLOSID. You demonstrate that resctrl already contributes a significant
 delay to __switch_to - this work will increase that much more, it has to
 be clear about this impact and motivate that it is acceptable.
>>>
>>> We were operating under the assumption that if the overhead wasn't
>>> acceptable, we would have heard complaints about it by now, but we
>>> ultimately learned that this feature wasn't deployed as much as we had
>>> originally thought on AMD hardware and that the overhead does need to
>>> be addressed.
>>>
>>> I am interested in your opinion on two options I'm exploring to
>>> mitigate the overhead, both of which depend on an API like the one
>>> Babu recently proposed for the AMD ABMC feature [1], where a new file
>>> interface will allow the user to indicate which mon_groups are
>>> actively being measured. I will refer to this as "assigned" for now,
>>> as that's the current proposal.
>>>
>>> The first is likely the simpler approach: only read MBM event counters
>>> which have been marked as "assigned" in the filesystem to avoid paying
>>> the context switch cost on tasks in groups which are not actively
>>> being measured. In our use case, we calculate memory bandwidth on
>>> every group every few minutes by reading the counters twice, 5 seconds
>>> apart. We would just need counters read during this 5-second window.
>>
>> I assume that tasks within a monitoring group can be scheduled on any
>> CPU and from the cover letter of this work I understand that only an
>> RMID assigned to a processor can be guaranteed to be tracked by hardware.
>>
>> Are you proposing for this option that you keep this "soft RMID" approach
>> with CPUs  permanently assigned a "hard RMID" but only update the counts for 
>> a
>> "soft RMID" that is "assigned"?
> 
> Yes
> 
>> I think that means that the context
>> switch cost for the monitored group would increase even more than with the
>> implementation in this series since the counters need to be read on context
>> switch in as well as context switch out.
>>
>> If I understand correctly then only one monitoring group can be measured
>> at a time. If such a measurement takes 5 seconds then theoretically 12 groups
>> can be measured in one minute. It may be possible to create many more
>> monitoring groups than this. Would it be possible to reach monitoring
>> goals in your environment?
> 
> We actually measure all of the groups at the same time, so thinking
> about this more, the proposed ABMC fix isn't actually a great fit: the
> user would have to assign all groups individually when a global
> setting would have been fine.
> 
> Ignoring any present-day resctrl interfaces, what we minimally need is...
> 
> 1. global "start measurement", which enables a
> read-counters-on-context switch flag, and broadcasts an IPI to all
> CPUs to read their current count
> 2. wait 5 seconds
> 3. global "end measurement", to IPI all CPUs again for final counts
> and clear the flag from step 1
> 
> Then the user could read at their leisure all the (frozen) event
> counts from memory until the next measurement begins.
> 
> In our case, if we're measuring as often as 5 seconds for every
> minute, that will already be a 12x aggregate reduction in overhead,
> which would be worthwhile enough.

The "con" here would be that during those 5 seconds (which I assume would be
controlled via user space so potentially shorter or longer) all tasks in the
system is expected to have significant (but yet to be measured) impact
on context switch delay.
I expect the overflow handler should only be run during the measurement
timeframe, to not defeat the "at their leisure" reading of counters.

>>> The second involves avoiding the situation where a hardware counter
>>> could be deallocated: Determine the number of simultaneous RMIDs
>>> supported, reduce the effective number of RMIDs available to that
>>> number. Use the default RMID (0) for all "unassigned" monitoring
>>
>> hmmm ... so on the one side there is "only the RMID within the PQR
>> register can be guaranteed to be tracked by hardware" and on the
>> other side there is "A given implementation may have insufficient
>> hardware to simultaneously track the bandwidth for all RMID values
>> that the hardware supports."
>>
>> From the above there seems to be something in the middle where
>> some subset of the RMID values supported by hardware can be used
>> to simultaneously track bandwidth? How can it be determined
>> what this number of RMID values is?
> 
> In the context of AMD, we could use the smallest number of CPUs in any
> L3 domain as a lower bound of the

Re: [PATCH v8 2/6] memcontrol: implement mem_cgroup_tryget_online()

2023-12-05 Thread Nhat Pham

On Tue, Dec 5, 2023 at 4:16 PM Chris Li  wrote:
>
> On Mon, Dec 4, 2023 at 5:39 PM Nhat Pham  wrote:
> >
> > > > memcg as a candidate for the global limit reclaim.
> > >
> > > Very minor nitpick. This patch can fold with the later patch that uses
> > > it. That makes the review easier, no need to cross reference different
> > > patches. It will also make it harder to introduce API that nobody
> > > uses.
> >
> > I don't have a strong preference one way or the other :) Probably not
> > worth the churn tho.
>
> Squashing a patch is very easy. If you are refreshing a new series, it
> is worthwhile to do it. I notice on the other thread Yosry pointed out
> you did  not use the function "mem_cgroup_tryget_online" in patch 3,
> that is exactly the situation my suggestion is trying to prevent.

I doubt squashing it would solve the issue - in fact, I think Yosry
noticed it precisely because he had to stare at a separate patch
detailing the adding of the new function in the first place :P

In general though, I'm hesitant to extend this API silently in a patch
that uses it. Is it not better to have a separate patch announcing
this API extension? list_lru_add() was originally part of the original
series too - we separate that out to its own thing because it gets
confusing. Another benefit is that there will be less work in the
future if we want to revert the per-cgroup zswap LRU patch, and
there's already another mem_cgroup_tryget_online() user - we can keep
this patch.

But yeah we'll see - I'll think about it if I actually have to send
v9. If not, let's not add unnecessary churning.

>
> If you don't have a strong preference, it sounds like you should squash it.
>
> Chris
>
> >
> > >
> > > Chris
> > >
> > > >
> > > > Signed-off-by: Nhat Pham 
> > > > ---
> > > >  include/linux/memcontrol.h | 10 ++
> > > >  1 file changed, 10 insertions(+)
> > > >
> > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > index 7bdcf3020d7a..2bd7d14ace78 100644
> > > > --- a/include/linux/memcontrol.h
> > > > +++ b/include/linux/memcontrol.h
> > > > @@ -821,6 +821,11 @@ static inline bool mem_cgroup_tryget(struct 
> > > > mem_cgroup *memcg)
> > > > return !memcg || css_tryget(>css);
> > > >  }
> > > >
> > > > +static inline bool mem_cgroup_tryget_online(struct mem_cgroup *memcg)
> > > > +{
> > > > +   return !memcg || css_tryget_online(>css);
> > > > +}
> > > > +
> > > >  static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> > > >  {
> > > > if (memcg)
> > > > @@ -1349,6 +1354,11 @@ static inline bool mem_cgroup_tryget(struct 
> > > > mem_cgroup *memcg)
> > > > return true;
> > > >  }
> > > >
> > > > +static inline bool mem_cgroup_tryget_online(struct mem_cgroup *memcg)
> > > > +{
> > > > +   return true;
> > > > +}
> > > > +
> > > >  static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> > > >  {
> > > >  }
> > > > --
> > > > 2.34.1
> > > >
> >

Re: [PATCH v1 3/9] x86/resctrl: Add resctrl_mbm_flush_cpu() to collect CPUs' MBM events

2023-12-05 Thread Peter Newman

Hi Reinette,

On Tue, Dec 5, 2023 at 1:57 PM Reinette Chatre
 wrote:
> On 12/1/2023 12:56 PM, Peter Newman wrote:
> > On Tue, May 16, 2023 at 5:06 PM Reinette Chatre
> >> I think it may be optimistic to view this as a replacement of a PQR write.
> >> As you point out, that requires that a CPU switches between tasks with the
> >> same CLOSID. You demonstrate that resctrl already contributes a significant
> >> delay to __switch_to - this work will increase that much more, it has to
> >> be clear about this impact and motivate that it is acceptable.
> >
> > We were operating under the assumption that if the overhead wasn't
> > acceptable, we would have heard complaints about it by now, but we
> > ultimately learned that this feature wasn't deployed as much as we had
> > originally thought on AMD hardware and that the overhead does need to
> > be addressed.
> >
> > I am interested in your opinion on two options I'm exploring to
> > mitigate the overhead, both of which depend on an API like the one
> > Babu recently proposed for the AMD ABMC feature [1], where a new file
> > interface will allow the user to indicate which mon_groups are
> > actively being measured. I will refer to this as "assigned" for now,
> > as that's the current proposal.
> >
> > The first is likely the simpler approach: only read MBM event counters
> > which have been marked as "assigned" in the filesystem to avoid paying
> > the context switch cost on tasks in groups which are not actively
> > being measured. In our use case, we calculate memory bandwidth on
> > every group every few minutes by reading the counters twice, 5 seconds
> > apart. We would just need counters read during this 5-second window.
>
> I assume that tasks within a monitoring group can be scheduled on any
> CPU and from the cover letter of this work I understand that only an
> RMID assigned to a processor can be guaranteed to be tracked by hardware.
>
> Are you proposing for this option that you keep this "soft RMID" approach
> with CPUs  permanently assigned a "hard RMID" but only update the counts for a
> "soft RMID" that is "assigned"?

Yes

> I think that means that the context
> switch cost for the monitored group would increase even more than with the
> implementation in this series since the counters need to be read on context
> switch in as well as context switch out.
>
> If I understand correctly then only one monitoring group can be measured
> at a time. If such a measurement takes 5 seconds then theoretically 12 groups
> can be measured in one minute. It may be possible to create many more
> monitoring groups than this. Would it be possible to reach monitoring
> goals in your environment?

We actually measure all of the groups at the same time, so thinking
about this more, the proposed ABMC fix isn't actually a great fit: the
user would have to assign all groups individually when a global
setting would have been fine.

Ignoring any present-day resctrl interfaces, what we minimally need is...

1. global "start measurement", which enables a
read-counters-on-context switch flag, and broadcasts an IPI to all
CPUs to read their current count
2. wait 5 seconds
3. global "end measurement", to IPI all CPUs again for final counts
and clear the flag from step 1

Then the user could read at their leisure all the (frozen) event
counts from memory until the next measurement begins.

In our case, if we're measuring as often as 5 seconds for every
minute, that will already be a 12x aggregate reduction in overhead,
which would be worthwhile enough.

>
> >
> > The second involves avoiding the situation where a hardware counter
> > could be deallocated: Determine the number of simultaneous RMIDs
> > supported, reduce the effective number of RMIDs available to that
> > number. Use the default RMID (0) for all "unassigned" monitoring
>
> hmmm ... so on the one side there is "only the RMID within the PQR
> register can be guaranteed to be tracked by hardware" and on the
> other side there is "A given implementation may have insufficient
> hardware to simultaneously track the bandwidth for all RMID values
> that the hardware supports."
>
> From the above there seems to be something in the middle where
> some subset of the RMID values supported by hardware can be used
> to simultaneously track bandwidth? How can it be determined
> what this number of RMID values is?

In the context of AMD, we could use the smallest number of CPUs in any
L3 domain as a lower bound of the number of counters.

If the number is actually higher, it's not too difficult to probe at
runtime. The technique used by the test script[1] reliably identifies
the number of counters, but some experimentation would be needed to
see how quickly the hardware will repurpose a counter, as the script
today is using way too long of a workload for the kernel to be
invoking.

Maybe a reasonable compromise would be to initialize the HW counter
estimate at the CPUs-per-domain value and add a file node to let the
user

Re: [PATCH v8 2/6] memcontrol: implement mem_cgroup_tryget_online()

2023-12-05 Thread Chris Li

On Mon, Dec 4, 2023 at 5:39 PM Nhat Pham  wrote:
>
> > > memcg as a candidate for the global limit reclaim.
> >
> > Very minor nitpick. This patch can fold with the later patch that uses
> > it. That makes the review easier, no need to cross reference different
> > patches. It will also make it harder to introduce API that nobody
> > uses.
>
> I don't have a strong preference one way or the other :) Probably not
> worth the churn tho.

Squashing a patch is very easy. If you are refreshing a new series, it
is worthwhile to do it. I notice on the other thread Yosry pointed out
you did  not use the function "mem_cgroup_tryget_online" in patch 3,
that is exactly the situation my suggestion is trying to prevent.

If you don't have a strong preference, it sounds like you should squash it.

Chris

>
> >
> > Chris
> >
> > >
> > > Signed-off-by: Nhat Pham 
> > > ---
> > >  include/linux/memcontrol.h | 10 ++
> > >  1 file changed, 10 insertions(+)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 7bdcf3020d7a..2bd7d14ace78 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -821,6 +821,11 @@ static inline bool mem_cgroup_tryget(struct 
> > > mem_cgroup *memcg)
> > > return !memcg || css_tryget(>css);
> > >  }
> > >
> > > +static inline bool mem_cgroup_tryget_online(struct mem_cgroup *memcg)
> > > +{
> > > +   return !memcg || css_tryget_online(>css);
> > > +}
> > > +
> > >  static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> > >  {
> > > if (memcg)
> > > @@ -1349,6 +1354,11 @@ static inline bool mem_cgroup_tryget(struct 
> > > mem_cgroup *memcg)
> > > return true;
> > >  }
> > >
> > > +static inline bool mem_cgroup_tryget_online(struct mem_cgroup *memcg)
> > > +{
> > > +   return true;
> > > +}
> > > +
> > >  static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> > >  {
> > >  }
> > > --
> > > 2.34.1
> > >
>

Re: [PATCH v8 3/6] zswap: make shrinking memcg-aware

2023-12-05 Thread Chris Li

Hi Nhat,

Still working my way up of your patches series.

On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham  wrote:
>
> From: Domenico Cerasuolo 
>
> Currently, we only have a single global LRU for zswap. This makes it
> impossible to perform worload-specific shrinking - an memcg cannot
> determine which pages in the pool it owns, and often ends up writing
> pages from other memcgs. This issue has been previously observed in
> practice and mitigated by simply disabling memcg-initiated shrinking:
>
> https://lore.kernel.org/all/20230530232435.3097106-1-npha...@gmail.com/T/#u
>
> This patch fully resolves the issue by replacing the global zswap LRU
> with memcg- and NUMA-specific LRUs, and modify the reclaim logic:
>
> a) When a store attempt hits an memcg limit, it now triggers a
>synchronous reclaim attempt that, if successful, allows the new
>hotter page to be accepted by zswap.
> b) If the store attempt instead hits the global zswap limit, it will
>trigger an asynchronous reclaim attempt, in which an memcg is
>selected for reclaim in a round-robin-like fashion.
>
> Signed-off-by: Domenico Cerasuolo 
> Co-developed-by: Nhat Pham 
> Signed-off-by: Nhat Pham 
> ---
>  include/linux/memcontrol.h |   5 +
>  include/linux/zswap.h  |   2 +
>  mm/memcontrol.c|   2 +
>  mm/swap.h  |   3 +-
>  mm/swap_state.c|  24 +++-
>  mm/zswap.c | 269 +
>  6 files changed, 245 insertions(+), 60 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 2bd7d14ace78..a308c8eacf20 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1192,6 +1192,11 @@ static inline struct mem_cgroup 
> *page_memcg_check(struct page *page)
> return NULL;
>  }
>
> +static inline struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup 
> *objcg)
> +{
> +   return NULL;
> +}
> +
>  static inline bool folio_memcg_kmem(struct folio *folio)
>  {
> return false;
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 2a60ce39cfde..e571e393669b 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -15,6 +15,7 @@ bool zswap_load(struct folio *folio);
>  void zswap_invalidate(int type, pgoff_t offset);
>  void zswap_swapon(int type);
>  void zswap_swapoff(int type);
> +void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
>
>  #else
>
> @@ -31,6 +32,7 @@ static inline bool zswap_load(struct folio *folio)
>  static inline void zswap_invalidate(int type, pgoff_t offset) {}
>  static inline void zswap_swapon(int type) {}
>  static inline void zswap_swapoff(int type) {}
> +static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {}
>
>  #endif
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 470821d1ba1a..792ca21c5815 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5614,6 +5614,8 @@ static void mem_cgroup_css_offline(struct 
> cgroup_subsys_state *css)
> page_counter_set_min(>memory, 0);
> page_counter_set_low(>memory, 0);
>
> +   zswap_memcg_offline_cleanup(memcg);
> +
> memcg_offline_kmem(memcg);
> reparent_shrinker_deferred(memcg);
> wb_memcg_offline(memcg);
> diff --git a/mm/swap.h b/mm/swap.h
> index 73c332ee4d91..c0dc73e10e91 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -51,7 +51,8 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t 
> gfp_mask,
>struct swap_iocb **plug);
>  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  struct mempolicy *mpol, pgoff_t ilx,
> -bool *new_page_allocated);
> +bool *new_page_allocated,
> +bool skip_if_exists);
>  struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> struct mempolicy *mpol, pgoff_t ilx);
>  struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 85d9e5806a6a..6c84236382f3 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -412,7 +412,8 @@ struct folio *filemap_get_incore_folio(struct 
> address_space *mapping,
>
>  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  struct mempolicy *mpol, pgoff_t ilx,
> -bool *new_page_allocated)
> +bool *new_page_allocated,
> +bool skip_if_exists)

I think this skip_if_exists is problematic here. You might need to
redesign this.
First of all, the skip_if_exists as the argument name, the meaning to
the caller is not clear. When I saw this, I was wondering, what does
the function return when this condition is triggered? Unlike
"*new_page_allocated", which is a

Re: [PATCH RFT v4 5/5] kselftest/clone3: Test shadow stack support

2023-12-05 Thread Edgecombe, Rick P

On Tue, 2023-12-05 at 16:43 +, Mark Brown wrote:
> Right, it's a small and fairly easily auditable list - it's more
> about
> the app than the double enable which was what I thought your concern
> was.  It's a bit annoying definitely and not something we want to do
> in
> general but for something like this where we're adding specific
> coverage
> for API extensions for the feature it seems like a reasonable
> tradeoff.
> 
> If the x86 toolchain/libc support is widely enough deployed (or you
> just
> don't mind any missing coverage) we could use the toolchain support
> there and only have the manual enable for arm64, it'd be inconsistent
> but not wildly so.
> 
> 
> 
I'm hoping there is not too much of a gap before the glibc support
starts filtering out. Long term, elf bit enabling is probably the right
thing for the generic tests. Short term, manual enabling is ok with me
if no one else minds. Maybe we could add my "don't do" list as a
comment if we do manual enabling?

I'll have to check your new series, but I also wonder if we could cram
the manual enabling and status checking pieces into some headers and
not have to have "if x86" "if arm" logic in the test themselves.

Re: [PATCH RFT v4 2/5] fork: Add shadow stack support to clone3()

2023-12-05 Thread Edgecombe, Rick P

On Tue, 2023-12-05 at 15:51 +, Mark Brown wrote:
> On Tue, Dec 05, 2023 at 12:26:57AM +, Edgecombe, Rick P wrote:
> > On Tue, 2023-11-28 at 18:22 +, Mark Brown wrote:
> 
> > > -   size = adjust_shstk_size(stack_size);
> > > +   size = adjust_shstk_size(size);
> > > addr = alloc_shstk(0, size, 0, false);
> 
> > Hmm. I didn't test this, but in the copy_process(), copy_mm()
> > happens
> > before this point. So the shadow stack would get mapped in
> > current's MM
> > (i.e. the parent). So in the !CLONE_VM case with
> > shadow_stack_size!=0
> > the SSP in the child will be updated to an area that is not mapped
> > in
> > the child. I think we need to pass tsk->mm into alloc_shstk(). But
> > such
> > an exotic clone usage does give me pause, regarding whether all of
> > this
> > is premature.
> 
> Hrm, right.  And we then can't use do_mmap() either.  I'd be somewhat
> tempted to disallow that specific case for now rather than deal with
> it
> though that's not really in the spirit of just always following what
> the
> user asked for.

Oh, yea. What a pain. It doesn't seem like we could easily even add a
do_mmap() variant that takes an mm either.

I did a quick logging test on a Fedora userspace. systemd (I think)
appears to do a clone(!CLONE_VM) with a stack passed. So maybe the
combo might actually get used with a shadow_stack_size if it used
clone3 some day. At the same time, fixing clone to mmap() in the child
doesn't seem straight forward at all. Checking with some of our MM
folks, the suggestion was to look at doing the child's shadow stack
mapping in dup_mm() to avoid tripping over complications that happen
when a remote MM becomes more "live".

If we just punt on this combination for now, then the documented rules
for args->shadow_stack_size would be something like:
clone3 will use the parents shadow stack when CLONE_VM is not present.
If CLONE_VFORK is set then it will use the parents shadow stack only
when args->shadow_stack_size is non-zero. In the cases when the parents
shadow stack is not used, args->shadow_stack_size is used for the size
whenever non-zero.

I guess it doesn't seem too overly complicated. But I'm not thinking
any of the options seem great. I'd unhappily lean towards not
supporting shadow_stack_size!=0 && !CLONE_VM for now. But it seems like
there may be a user for the unsupported case, so this would be just
improving things a little and kicking the can down the road. I also
wonder if this is a sign to reconsider the earlier token consuming
design.

Re: [PATCH v1 3/9] x86/resctrl: Add resctrl_mbm_flush_cpu() to collect CPUs' MBM events

2023-12-05 Thread Reinette Chatre

Hi Peter,

On 12/1/2023 12:56 PM, Peter Newman wrote:
> Hi Reinette,
> 
> On Tue, May 16, 2023 at 5:06 PM Reinette Chatre
>  wrote:
>> On 5/15/2023 7:42 AM, Peter Newman wrote:
>>>
>>> I used a simple parent-child pipe loop benchmark with the parent in
>>> one monitoring group and the child in another to trigger 2M
>>> context-switches on the same CPU and compared the sample-based
>>> profiles on an AMD and Intel implementation. I used perf diff to
>>> compare the samples between hard and soft RMID switches.
>>>
>>> Intel(R) Xeon(R) Platinum 8173M CPU @ 2.00GHz:
>>>
>>>   +44.80%  [kernel.kallsyms]  [k] __rmid_read
>>> 10.43% -9.52%  [kernel.kallsyms]  [k] __switch_to
>>>
>>> AMD EPYC 7B12 64-Core Processor:
>>>
>>>   +28.27%  [kernel.kallsyms]  [k] __rmid_read
>>> 13.45%-13.44%  [kernel.kallsyms]  [k] __switch_to
>>>
>>> Note that a soft RMID switch that doesn't change CLOSID skips the
>>> PQR_ASSOC write completely, so from this data I can roughly say that
>>> __rmid_read() is a little over 2x the length of a PQR_ASSOC write that
>>> changes the current RMID on the AMD implementation and about 4.5x
>>> longer on Intel.
>>>
>>> Let me know if this clarifies the cost enough or if you'd like to also
>>> see instrumented measurements on the individual WRMSR/RDMSR
>>> instructions.
>>
>> I can see from the data the portion of total time spent in __rmid_read().
>> It is not clear to me what the impact on a context switch is. Is it
>> possible to say with this data that: this solution makes a context switch
>> x% slower?
>>
>> I think it may be optimistic to view this as a replacement of a PQR write.
>> As you point out, that requires that a CPU switches between tasks with the
>> same CLOSID. You demonstrate that resctrl already contributes a significant
>> delay to __switch_to - this work will increase that much more, it has to
>> be clear about this impact and motivate that it is acceptable.
> 
> We were operating under the assumption that if the overhead wasn't
> acceptable, we would have heard complaints about it by now, but we
> ultimately learned that this feature wasn't deployed as much as we had
> originally thought on AMD hardware and that the overhead does need to
> be addressed.
> 
> I am interested in your opinion on two options I'm exploring to
> mitigate the overhead, both of which depend on an API like the one
> Babu recently proposed for the AMD ABMC feature [1], where a new file
> interface will allow the user to indicate which mon_groups are
> actively being measured. I will refer to this as "assigned" for now,
> as that's the current proposal.
> 
> The first is likely the simpler approach: only read MBM event counters
> which have been marked as "assigned" in the filesystem to avoid paying
> the context switch cost on tasks in groups which are not actively
> being measured. In our use case, we calculate memory bandwidth on
> every group every few minutes by reading the counters twice, 5 seconds
> apart. We would just need counters read during this 5-second window.

I assume that tasks within a monitoring group can be scheduled on any
CPU and from the cover letter of this work I understand that only an
RMID assigned to a processor can be guaranteed to be tracked by hardware.

Are you proposing for this option that you keep this "soft RMID" approach
with CPUs  permanently assigned a "hard RMID" but only update the counts for a
"soft RMID" that is "assigned"? I think that means that the context
switch cost for the monitored group would increase even more than with the
implementation in this series since the counters need to be read on context
switch in as well as context switch out.

If I understand correctly then only one monitoring group can be measured
at a time. If such a measurement takes 5 seconds then theoretically 12 groups
can be measured in one minute. It may be possible to create many more
monitoring groups than this. Would it be possible to reach monitoring
goals in your environment?

> 
> The second involves avoiding the situation where a hardware counter
> could be deallocated: Determine the number of simultaneous RMIDs
> supported, reduce the effective number of RMIDs available to that
> number. Use the default RMID (0) for all "unassigned" monitoring

hmmm ... so on the one side there is "only the RMID within the PQR
register can be guaranteed to be tracked by hardware" and on the 
other side there is "A given implementation may have insufficient
hardware to simultaneously track the bandwidth for all RMID values
that the hardware supports."

>From the above there seems to be something in the middle where
some subset of the RMID values supported by hardware can be used
to simultaneously track bandwidth? How can it be determined
what this number of RMID values is?

> groups and report "Unavailable" on all counter reads (and address the
> default monitoring group's counts being unreliable). When assigned,
> attempt to allocate one of the

[PATCH] kunit: tool: fix parsing of test attributes

2023-12-05 Thread Rae Moar

Add parsing of attributes as diagnostic data. Fixes issue with test plan
being parsed incorrectly as diagnostic data when located after
suite-level attributes.

Note that if there does not exist a test plan line, the diagnostic lines
between the suite header and the first result will be saved in the suite
log rather than the first test case log.

Signed-off-by: Rae Moar 
---

Note this patch is a resend but I removed the second patch in the series
so now it is a standalone patch.

 tools/testing/kunit/kunit_parser.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/testing/kunit/kunit_parser.py 
b/tools/testing/kunit/kunit_parser.py
index 79d8832c862a..ce34be15c929 100644
--- a/tools/testing/kunit/kunit_parser.py
+++ b/tools/testing/kunit/kunit_parser.py
@@ -450,7 +450,7 @@ def parse_diagnostic(lines: LineStream) -> List[str]:
Log of diagnostic lines
"""
log = []  # type: List[str]
-   non_diagnostic_lines = [TEST_RESULT, TEST_HEADER, KTAP_START, TAP_START]
+   non_diagnostic_lines = [TEST_RESULT, TEST_HEADER, KTAP_START, 
TAP_START, TEST_PLAN]
while lines and not any(re.match(lines.peek())
for re in non_diagnostic_lines):
log.append(lines.pop())
@@ -726,6 +726,7 @@ def parse_test(lines: LineStream, expected_num: int, log: 
List[str], is_subtest:
# test plan
test.name = "main"
ktap_line = parse_ktap_header(lines, test)
+   test.log.extend(parse_diagnostic(lines))
parse_test_plan(lines, test)
parent_test = True
else:
@@ -737,6 +738,7 @@ def parse_test(lines: LineStream, expected_num: int, log: 
List[str], is_subtest:
if parent_test:
# If KTAP version line and/or subtest header is found, 
attempt
# to parse test plan and print test header
+   test.log.extend(parse_diagnostic(lines))
parse_test_plan(lines, test)
print_test_header(test)
expected_count = test.expected_count

base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86
-- 
2.43.0.rc2.451.g8631bc7472-goog

Re: [PATCH v8 4/6] mm: memcg: add per-memcg zswap writeback stat (fix)

2023-12-05 Thread Yosry Ahmed

On Tue, Dec 5, 2023 at 11:33 AM Nhat Pham  wrote:
>
> Rename ZSWP_WB to ZSWPWB to better match the existing counters naming
> scheme.
>
> Suggested-by: Johannes Weiner 
> Signed-off-by: Nhat Pham 

For the original patch + this fix:

Reviewed-by: Yosry Ahmed 

> ---
>  include/linux/vm_event_item.h | 2 +-
>  mm/memcontrol.c   | 2 +-
>  mm/vmstat.c   | 2 +-
>  mm/zswap.c| 4 ++--
>  4 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index f4569ad98edf..747943bc8cc2 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -142,7 +142,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  #ifdef CONFIG_ZSWAP
> ZSWPIN,
> ZSWPOUT,
> -   ZSWP_WB,
> +   ZSWPWB,
>  #endif
>  #ifdef CONFIG_X86
> DIRECT_MAP_LEVEL2_SPLIT,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 21d79249c8b4..0286b7d38832 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -703,7 +703,7 @@ static const unsigned int memcg_vm_event_stat[] = {
>  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> ZSWPIN,
> ZSWPOUT,
> -   ZSWP_WB,
> +   ZSWPWB,
>  #endif
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> THP_FAULT_ALLOC,
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 2249f85e4a87..cfd8d8256f8e 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1401,7 +1401,7 @@ const char * const vmstat_text[] = {
>  #ifdef CONFIG_ZSWAP
> "zswpin",
> "zswpout",
> -   "zswp_wb",
> +   "zswpwb",
>  #endif
>  #ifdef CONFIG_X86
> "direct_map_level2_splits",
> diff --git a/mm/zswap.c b/mm/zswap.c
> index c65b8ccc6b72..0fb0945c0031 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -761,9 +761,9 @@ static enum lru_status shrink_memcg_cb(struct list_head 
> *item, struct list_lru_o
> zswap_written_back_pages++;
>
> if (entry->objcg)
> -   count_objcg_event(entry->objcg, ZSWP_WB);
> +   count_objcg_event(entry->objcg, ZSWPWB);
>
> -   count_vm_event(ZSWP_WB);
> +   count_vm_event(ZSWPWB);
> /*
>  * Writeback started successfully, the page now belongs to the
>  * swapcache. Drop the entry from zswap - unless invalidate already
> --
> 2.34.1

Re: [xdp-hints] Re: [PATCH bpf-next v3 2/3] net: stmmac: add Launch Time support to XDP ZC

2023-12-05 Thread Willem de Bruijn

Stanislav Fomichev wrote:
> On 12/05, Willem de Bruijn wrote:
> > Stanislav Fomichev wrote:
> > > On Tue, Dec 5, 2023 at 7:34 AM Florian Bezdeka
> > >  wrote:
> > > >
> > > > On Tue, 2023-12-05 at 15:25 +, Song, Yoong Siang wrote:
> > > > > On Monday, December 4, 2023 10:55 PM, Willem de Bruijn wrote:
> > > > > > Jesper Dangaard Brouer wrote:
> > > > > > >
> > > > > > >
> > > > > > > On 12/3/23 17:51, Song Yoong Siang wrote:
> > > > > > > > This patch enables Launch Time (Time-Based Scheduling) support 
> > > > > > > > to XDP zero
> > > > > > > > copy via XDP Tx metadata framework.
> > > > > > > >
> > > > > > > > Signed-off-by: Song Yoong Siang
> > > > > > > > ---
> > > > > > > >   drivers/net/ethernet/stmicro/stmmac/stmmac.h  |  2 ++
> > > > > > >
> > > > > > > As requested before, I think we need to see another driver 
> > > > > > > implementing
> > > > > > > this.
> > > > > > >
> > > > > > > I propose driver igc and chip i225.
> > > > >
> > > > > Sure. I will include igc patches in next version.
> > > > >
> > > > > > >
> > > > > > > The interesting thing for me is to see how the LaunchTime max 1 
> > > > > > > second
> > > > > > > into the future[1] is handled code wise. One suggestion is to add 
> > > > > > > a
> > > > > > > section to Documentation/networking/xsk-tx-metadata.rst per 
> > > > > > > driver that
> > > > > > > mentions/documents these different hardware limitations.  It is 
> > > > > > > natural
> > > > > > > that different types of hardware have limitations.  This is a 
> > > > > > > close-to
> > > > > > > hardware-level abstraction/API, and IMHO as long as we document 
> > > > > > > the
> > > > > > > limitations we can expose this API without too many limitations 
> > > > > > > for more
> > > > > > > capable hardware.
> > > > >
> > > > > Sure. I will try to add hardware limitations in documentation.
> > > > >
> > > > > >
> > > > > > I would assume that the kfunc will fail when a value is passed that
> > > > > > cannot be programmed.
> > > > > >
> > > > >
> > > > > In current design, the xsk_tx_metadata_request() dint got return 
> > > > > value.
> > > > > So user won't know if their request is fail.
> > > > > It is complex to inform user which request is failing.
> > > > > Therefore, IMHO, it is good that we let driver handle the error 
> > > > > silently.
> > > > >
> > > >
> > > > If the programmed value is invalid, the packet will be "dropped" / will
> > > > never make it to the wire, right?
> > 
> > Programmable behavior is to either drop or cap to some boundary
> > value, such as the farthest programmable time in the future: the
> > horizon. In fq:
> > 
> > /* Check if packet timestamp is too far in the future. */
> > if (fq_packet_beyond_horizon(skb, q, now)) {
> > if (q->horizon_drop) {
> > q->stat_horizon_drops++;
> > return qdisc_drop(skb, sch, 
> > to_free);
> > }
> > q->stat_horizon_caps++;
> > skb->tstamp = now + q->horizon;
> > }
> > fq_skb_cb(skb)->time_to_send = skb->tstamp;
> > 
> > Drop is the more obviously correct mode.
> > 
> > Programming with a clock source that the driver does not support will
> > then be a persistent failure.
> > 
> > Preferably, this driver capability can be queried beforehand (rather
> > than only through reading error counters afterwards).
> > 
> > Perhaps it should not be a driver task to convert from possibly
> > multiple clock sources to the device native clock. Right now, we do
> > use per-device timecounters for this, implemented in the driver.
> > 
> > As for which clocks are relevant. For PTP, I suppose the device PHC,
> > converted to nsec. For pacing offload, TCP uses CLOCK_MONOTONIC.
> 
> Do we need to expose some generic netdev netlink apis to query/adjust
> nic clock sources (or maybe there is something existing already)?
> Then the userspace can be responsible for syncing/converting the
> timestamps to the internal nic clocks. +1 to trying to avoid doing
> this in the drivers.

Perhaps. I'm just a bit hesitant since that is UAPI and this is all
quite hand-wavy still.

Some of the conversion necessarily has to be in the driver. Only the
driver knows the descriptor format, and limitations of that, such as
the bit-width that can be encoded.

If we cannot move anything out of the drivers (quite likely), then
agreed that a netdev/ethtool netlink query approach is helpful.

To be clear: I don't mean that that should be part of this series.
This is not an XSK specific concern.

> > > > That is clearly a situation that the user should be informed about. For
> > > > RT systems this normally means that something is really wrong regarding
> > > > timing / cycle overflow. Such systems have to react on that situation.
> > > 
> > > In general, af_xdp is a bit lacking in this 'notify the user that they

Re: [PATCH v8 2/6] memcontrol: implement mem_cgroup_tryget_online()

2023-12-05 Thread Nhat Pham

On Tue, Dec 5, 2023 at 10:03 AM Yosry Ahmed  wrote:
>
> On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham  wrote:
> >
> > This patch implements a helper function that try to get a reference to
> > an memcg's css, as well as checking if it is online. This new function
> > is almost exactly the same as the existing mem_cgroup_tryget(), except
> > for the onlineness check. In the !CONFIG_MEMCG case, it always returns
> > true, analogous to mem_cgroup_tryget(). This is useful for e.g to the
> > new zswap writeback scheme, where we need to select the next online
> > memcg as a candidate for the global limit reclaim.
> >
> > Signed-off-by: Nhat Pham 
>
> Reviewed-by: Yosry Ahmed 

Thanks for the review, Yosry :) Really appreciate the effort and your
comments so far.

>
> > ---
> >  include/linux/memcontrol.h | 10 ++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 7bdcf3020d7a..2bd7d14ace78 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -821,6 +821,11 @@ static inline bool mem_cgroup_tryget(struct mem_cgroup 
> > *memcg)
> > return !memcg || css_tryget(>css);
> >  }
> >
> > +static inline bool mem_cgroup_tryget_online(struct mem_cgroup *memcg)
> > +{
> > +   return !memcg || css_tryget_online(>css);
> > +}
> > +
> >  static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> >  {
> > if (memcg)
> > @@ -1349,6 +1354,11 @@ static inline bool mem_cgroup_tryget(struct 
> > mem_cgroup *memcg)
> > return true;
> >  }
> >
> > +static inline bool mem_cgroup_tryget_online(struct mem_cgroup *memcg)
> > +{
> > +   return true;
> > +}
> > +
> >  static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> >  {
> >  }
> > --
> > 2.34.1

[PATCH v8 3/6] zswap: make shrinking memcg-aware (fix)

2023-12-05 Thread Nhat Pham

Use the correct function for the onlineness check for the memcg
selection, and use mem_cgroup_iter_break() to break the iteration.

Suggested-by: Yosry Ahmed 
Signed-off-by: Nhat Pham 
---
 mm/zswap.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index f323e45cbdc7..7a84c1454988 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -834,9 +834,9 @@ static void shrink_worker(struct work_struct *w)
goto resched;
}
 
-   if (!mem_cgroup_online(memcg)) {
+   if (!mem_cgroup_tryget_online(memcg)) {
/* drop the reference from mem_cgroup_iter() */
-   mem_cgroup_put(memcg);
+   mem_cgroup_iter_break(NULL, memcg);
pool->next_shrink = NULL;
spin_unlock(_pools_lock);
 
@@ -985,7 +985,7 @@ static void zswap_pool_destroy(struct zswap_pool *pool)
list_lru_destroy(>list_lru);
 
spin_lock(_pools_lock);
-   mem_cgroup_put(pool->next_shrink);
+   mem_cgroup_iter_break(NULL, pool->next_shrink);
pool->next_shrink = NULL;
spin_unlock(_pools_lock);
 
-- 
2.34.1

Re: [xdp-hints] Re: [PATCH bpf-next v3 2/3] net: stmmac: add Launch Time support to XDP ZC

2023-12-05 Thread Stanislav Fomichev

On 12/05, Willem de Bruijn wrote:
> Stanislav Fomichev wrote:
> > On Tue, Dec 5, 2023 at 7:34 AM Florian Bezdeka
> >  wrote:
> > >
> > > On Tue, 2023-12-05 at 15:25 +, Song, Yoong Siang wrote:
> > > > On Monday, December 4, 2023 10:55 PM, Willem de Bruijn wrote:
> > > > > Jesper Dangaard Brouer wrote:
> > > > > >
> > > > > >
> > > > > > On 12/3/23 17:51, Song Yoong Siang wrote:
> > > > > > > This patch enables Launch Time (Time-Based Scheduling) support to 
> > > > > > > XDP zero
> > > > > > > copy via XDP Tx metadata framework.
> > > > > > >
> > > > > > > Signed-off-by: Song Yoong Siang
> > > > > > > ---
> > > > > > >   drivers/net/ethernet/stmicro/stmmac/stmmac.h  |  2 ++
> > > > > >
> > > > > > As requested before, I think we need to see another driver 
> > > > > > implementing
> > > > > > this.
> > > > > >
> > > > > > I propose driver igc and chip i225.
> > > >
> > > > Sure. I will include igc patches in next version.
> > > >
> > > > > >
> > > > > > The interesting thing for me is to see how the LaunchTime max 1 
> > > > > > second
> > > > > > into the future[1] is handled code wise. One suggestion is to add a
> > > > > > section to Documentation/networking/xsk-tx-metadata.rst per driver 
> > > > > > that
> > > > > > mentions/documents these different hardware limitations.  It is 
> > > > > > natural
> > > > > > that different types of hardware have limitations.  This is a 
> > > > > > close-to
> > > > > > hardware-level abstraction/API, and IMHO as long as we document the
> > > > > > limitations we can expose this API without too many limitations for 
> > > > > > more
> > > > > > capable hardware.
> > > >
> > > > Sure. I will try to add hardware limitations in documentation.
> > > >
> > > > >
> > > > > I would assume that the kfunc will fail when a value is passed that
> > > > > cannot be programmed.
> > > > >
> > > >
> > > > In current design, the xsk_tx_metadata_request() dint got return value.
> > > > So user won't know if their request is fail.
> > > > It is complex to inform user which request is failing.
> > > > Therefore, IMHO, it is good that we let driver handle the error 
> > > > silently.
> > > >
> > >
> > > If the programmed value is invalid, the packet will be "dropped" / will
> > > never make it to the wire, right?
> 
> Programmable behavior is to either drop or cap to some boundary
> value, such as the farthest programmable time in the future: the
> horizon. In fq:
> 
> /* Check if packet timestamp is too far in the future. */
> if (fq_packet_beyond_horizon(skb, q, now)) {
> if (q->horizon_drop) {
> q->stat_horizon_drops++;
> return qdisc_drop(skb, sch, to_free);
> }
> q->stat_horizon_caps++;
> skb->tstamp = now + q->horizon;
> }
> fq_skb_cb(skb)->time_to_send = skb->tstamp;
> 
> Drop is the more obviously correct mode.
> 
> Programming with a clock source that the driver does not support will
> then be a persistent failure.
> 
> Preferably, this driver capability can be queried beforehand (rather
> than only through reading error counters afterwards).
> 
> Perhaps it should not be a driver task to convert from possibly
> multiple clock sources to the device native clock. Right now, we do
> use per-device timecounters for this, implemented in the driver.
> 
> As for which clocks are relevant. For PTP, I suppose the device PHC,
> converted to nsec. For pacing offload, TCP uses CLOCK_MONOTONIC.

Do we need to expose some generic netdev netlink apis to query/adjust
nic clock sources (or maybe there is something existing already)?
Then the userspace can be responsible for syncing/converting the
timestamps to the internal nic clocks. +1 to trying to avoid doing
this in the drivers.

> > > That is clearly a situation that the user should be informed about. For
> > > RT systems this normally means that something is really wrong regarding
> > > timing / cycle overflow. Such systems have to react on that situation.
> > 
> > In general, af_xdp is a bit lacking in this 'notify the user that they
> > somehow messed up' area :-(
> > For example, pushing a tx descriptor with a wrong addr/len in zc mode
> > will not give any visible signal back (besides driver potentially
> > spilling something into dmesg as it was in the mlx case).
> > We can probably start with having some counters for these events?
> 
> This is because the AF_XDP completion queue descriptor format is only
> a u64 address?

Yeah. XDP_COPY mode has the descriptor validation which is exported via
recvmsg errno, but zerocopy path seems to be too deep in the stack
to report something back. And there is no place, as you mention,
in the completion ring to report the status.

> Could error conditions be reported on tx completion in the metadata,
> using xsk_tx_metadata_complete?

That would be

[PATCH v8 4/6] mm: memcg: add per-memcg zswap writeback stat (fix)

2023-12-05 Thread Nhat Pham

Rename ZSWP_WB to ZSWPWB to better match the existing counters naming
scheme.

Suggested-by: Johannes Weiner 
Signed-off-by: Nhat Pham 
---
 include/linux/vm_event_item.h | 2 +-
 mm/memcontrol.c   | 2 +-
 mm/vmstat.c   | 2 +-
 mm/zswap.c| 4 ++--
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index f4569ad98edf..747943bc8cc2 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -142,7 +142,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_ZSWAP
ZSWPIN,
ZSWPOUT,
-   ZSWP_WB,
+   ZSWPWB,
 #endif
 #ifdef CONFIG_X86
DIRECT_MAP_LEVEL2_SPLIT,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 21d79249c8b4..0286b7d38832 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -703,7 +703,7 @@ static const unsigned int memcg_vm_event_stat[] = {
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
ZSWPIN,
ZSWPOUT,
-   ZSWP_WB,
+   ZSWPWB,
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
THP_FAULT_ALLOC,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 2249f85e4a87..cfd8d8256f8e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1401,7 +1401,7 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_ZSWAP
"zswpin",
"zswpout",
-   "zswp_wb",
+   "zswpwb",
 #endif
 #ifdef CONFIG_X86
"direct_map_level2_splits",
diff --git a/mm/zswap.c b/mm/zswap.c
index c65b8ccc6b72..0fb0945c0031 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -761,9 +761,9 @@ static enum lru_status shrink_memcg_cb(struct list_head 
*item, struct list_lru_o
zswap_written_back_pages++;
 
if (entry->objcg)
-   count_objcg_event(entry->objcg, ZSWP_WB);
+   count_objcg_event(entry->objcg, ZSWPWB);
 
-   count_vm_event(ZSWP_WB);
+   count_vm_event(ZSWPWB);
/*
 * Writeback started successfully, the page now belongs to the
 * swapcache. Drop the entry from zswap - unless invalidate already
-- 
2.34.1

Re: [PATCH v8 3/6] zswap: make shrinking memcg-aware

2023-12-05 Thread Nhat Pham

On Tue, Dec 5, 2023 at 11:00 AM Yosry Ahmed  wrote:
>
> [..]
> > > >  static void shrink_worker(struct work_struct *w)
> > > >  {
> > > > struct zswap_pool *pool = container_of(w, typeof(*pool),
> > > > shrink_work);
> > > > +   struct mem_cgroup *memcg;
> > > > int ret, failures = 0;
> > > >
> > > > +   /* global reclaim will select cgroup in a round-robin fashion. 
> > > > */
> > > > do {
> > > > -   ret = zswap_reclaim_entry(pool);
> > > > -   if (ret) {
> > > > -   zswap_reject_reclaim_fail++;
> > > > -   if (ret != -EAGAIN)
> > > > +   spin_lock(_pools_lock);
> > > > +   pool->next_shrink = mem_cgroup_iter(NULL, 
> > > > pool->next_shrink, NULL);
> > > > +   memcg = pool->next_shrink;
> > > > +
> > > > +   /*
> > > > +* We need to retry if we have gone through a full 
> > > > round trip, or if we
> > > > +* got an offline memcg (or else we risk undoing the 
> > > > effect of the
> > > > +* zswap memcg offlining cleanup callback). This is not 
> > > > catastrophic
> > > > +* per se, but it will keep the now offlined memcg 
> > > > hostage for a while.
> > > > +*
> > > > +* Note that if we got an online memcg, we will keep 
> > > > the extra
> > > > +* reference in case the original reference obtained by 
> > > > mem_cgroup_iter
> > > > +* is dropped by the zswap memcg offlining callback, 
> > > > ensuring that the
> > > > +* memcg is not killed when we are reclaiming.
> > > > +*/
> > > > +   if (!memcg) {
> > > > +   spin_unlock(_pools_lock);
> > > > +   if (++failures == MAX_RECLAIM_RETRIES)
> > > > break;
> > > > +
> > > > +   goto resched;
> > > > +   }
> > > > +
> > > > +   if (!mem_cgroup_online(memcg)) {
> > > > +   /* drop the reference from mem_cgroup_iter() */
> > > > +   mem_cgroup_put(memcg);
> > >
> > > Probably better to use mem_cgroup_iter_break() here?
> >
> > mem_cgroup_iter_break(NULL, memcg) seems to perform the same thing, right?
>
> Yes, but it's better to break the iteration with the documented API
> (e.g. if mem_cgroup_iter_break() changes to do extra work).

Hmm, a mostly aesthetic fix to me, but I don't have a strong opinion otherwise.

>
> >
> > >
> > > Also, I don't see mem_cgroup_tryget_online() being used here (where I
> > > expected it to be used), did I miss it?
> >
> > Oh shoot yeah that was a typo - it should be
> > mem_cgroup_tryget_online(). Let me send a fix to that.
> >
> > >
> > > > +   pool->next_shrink = NULL;
> > > > +   spin_unlock(_pools_lock);
> > > > +
> > > > if (++failures == MAX_RECLAIM_RETRIES)
> > > > break;
> > > > +
> > > > +   goto resched;
> > > > }
> > > > +   spin_unlock(_pools_lock);
> > > > +
> > > > +   ret = shrink_memcg(memcg);
> > >
> > > We just checked for online-ness above, and then shrink_memcg() checks
> > > it again. Is this intentional?
> >
> > Hmm these two checks are for two different purposes. The check above
> > is mainly to prevent accidentally undoing the offline cleanup callback
> > during memcg selection step. Inside shrink_memcg(), we check
> > onlineness again to prevent reclaiming from offlined memcgs - which in
> > effect will trigger the reclaim of the parent's memcg.
>
> Right, but two checks in close proximity are not doing a lot.
> Especially that the memcg online-ness can change right after the check
> inside shrink_memcg() anyway, so it's a best effort thing.
>
> Anyway, it shouldn't matter much. We can leave it.
>
> >
> > >
> > > > +   /* drop the extra reference */
> > >
> > > Where does the extra reference come from?
> >
> > The extra reference is from mem_cgroup_tryget_online(). We get two
> > references in the dance above - one from mem_cgroup_iter() (which can
> > be dropped) and one extra from mem_cgroup_tryget_online(). I kept the
> > second one in case the first one was dropped by the zswap memcg
> > offlining callback, but after reclaiming it is safe to just drop it.
>
> Right. I was confused by the missing mem_cgroup_tryget_online().
>
> >
> > >
> > > > +   mem_cgroup_put(memcg);
> > > > +
> > > > +   if (ret == -EINVAL)
> > > > +   break;
> > > > +   if (ret && ++failures == MAX_RECLAIM_RETRIES)
> > > > +   break;
> > > > +
> > > > +resched:
> > > > cond_resched();
> > > > } while (!zswap_can_accept());
> > > > -

Re: [RFC PATCH v2 04/10] docs: submitting-patches: Introduce Tested-with:

2023-12-05 Thread Joe Perches

On Tue, 2023-12-05 at 11:59 -0700, Jonathan Corbet wrote:
> Nikolai Kondrashov  writes:
> 
> > Introduce a new tag, 'Tested-with:', documented in the
> > Documentation/process/submitting-patches.rst file.
[]
> I have to ask whether we *really* need to introduce yet another tag for
> this.  How are we going to use this information?  Are we going to try to
> make a tag for every way in which somebody might test a patch?

In general, I think
Link: 
would be good enough.

And remember that all this goes stale after awhile
and that includes old test suites.

Re: [PATCH v8 3/6] zswap: make shrinking memcg-aware

2023-12-05 Thread Yosry Ahmed

[..]
> > >  static void shrink_worker(struct work_struct *w)
> > >  {
> > > struct zswap_pool *pool = container_of(w, typeof(*pool),
> > > shrink_work);
> > > +   struct mem_cgroup *memcg;
> > > int ret, failures = 0;
> > >
> > > +   /* global reclaim will select cgroup in a round-robin fashion. */
> > > do {
> > > -   ret = zswap_reclaim_entry(pool);
> > > -   if (ret) {
> > > -   zswap_reject_reclaim_fail++;
> > > -   if (ret != -EAGAIN)
> > > +   spin_lock(_pools_lock);
> > > +   pool->next_shrink = mem_cgroup_iter(NULL, 
> > > pool->next_shrink, NULL);
> > > +   memcg = pool->next_shrink;
> > > +
> > > +   /*
> > > +* We need to retry if we have gone through a full round 
> > > trip, or if we
> > > +* got an offline memcg (or else we risk undoing the 
> > > effect of the
> > > +* zswap memcg offlining cleanup callback). This is not 
> > > catastrophic
> > > +* per se, but it will keep the now offlined memcg 
> > > hostage for a while.
> > > +*
> > > +* Note that if we got an online memcg, we will keep the 
> > > extra
> > > +* reference in case the original reference obtained by 
> > > mem_cgroup_iter
> > > +* is dropped by the zswap memcg offlining callback, 
> > > ensuring that the
> > > +* memcg is not killed when we are reclaiming.
> > > +*/
> > > +   if (!memcg) {
> > > +   spin_unlock(_pools_lock);
> > > +   if (++failures == MAX_RECLAIM_RETRIES)
> > > break;
> > > +
> > > +   goto resched;
> > > +   }
> > > +
> > > +   if (!mem_cgroup_online(memcg)) {
> > > +   /* drop the reference from mem_cgroup_iter() */
> > > +   mem_cgroup_put(memcg);
> >
> > Probably better to use mem_cgroup_iter_break() here?
>
> mem_cgroup_iter_break(NULL, memcg) seems to perform the same thing, right?

Yes, but it's better to break the iteration with the documented API
(e.g. if mem_cgroup_iter_break() changes to do extra work).

>
> >
> > Also, I don't see mem_cgroup_tryget_online() being used here (where I
> > expected it to be used), did I miss it?
>
> Oh shoot yeah that was a typo - it should be
> mem_cgroup_tryget_online(). Let me send a fix to that.
>
> >
> > > +   pool->next_shrink = NULL;
> > > +   spin_unlock(_pools_lock);
> > > +
> > > if (++failures == MAX_RECLAIM_RETRIES)
> > > break;
> > > +
> > > +   goto resched;
> > > }
> > > +   spin_unlock(_pools_lock);
> > > +
> > > +   ret = shrink_memcg(memcg);
> >
> > We just checked for online-ness above, and then shrink_memcg() checks
> > it again. Is this intentional?
>
> Hmm these two checks are for two different purposes. The check above
> is mainly to prevent accidentally undoing the offline cleanup callback
> during memcg selection step. Inside shrink_memcg(), we check
> onlineness again to prevent reclaiming from offlined memcgs - which in
> effect will trigger the reclaim of the parent's memcg.

Right, but two checks in close proximity are not doing a lot.
Especially that the memcg online-ness can change right after the check
inside shrink_memcg() anyway, so it's a best effort thing.

Anyway, it shouldn't matter much. We can leave it.

>
> >
> > > +   /* drop the extra reference */
> >
> > Where does the extra reference come from?
>
> The extra reference is from mem_cgroup_tryget_online(). We get two
> references in the dance above - one from mem_cgroup_iter() (which can
> be dropped) and one extra from mem_cgroup_tryget_online(). I kept the
> second one in case the first one was dropped by the zswap memcg
> offlining callback, but after reclaiming it is safe to just drop it.

Right. I was confused by the missing mem_cgroup_tryget_online().

>
> >
> > > +   mem_cgroup_put(memcg);
> > > +
> > > +   if (ret == -EINVAL)
> > > +   break;
> > > +   if (ret && ++failures == MAX_RECLAIM_RETRIES)
> > > +   break;
> > > +
> > > +resched:
> > > cond_resched();
> > > } while (!zswap_can_accept());
> > > -   zswap_pool_put(pool);
> > >  }
> > >
> > >  static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
[..]
> > > @@ -1240,15 +1395,15 @@ bool zswap_store(struct folio *folio)
> > > zswap_invalidate_entry(tree, dupentry);
> > > }
> > > spin_unlock(>lock);
> > > -
> > > -   /*
> > > -* XXX: zswap reclaim does not work with

Re: [RFC PATCH v2 04/10] docs: submitting-patches: Introduce Tested-with:

2023-12-05 Thread Jonathan Corbet

Nikolai Kondrashov  writes:

> Introduce a new tag, 'Tested-with:', documented in the
> Documentation/process/submitting-patches.rst file.
>
> The tag is expected to contain the test suite command which was executed
> for the commit, and to certify it passed. Additionally, it can contain a
> URL pointing to the execution results, after a '#' character.
>
> Prohibit the V: field from containing the '#' character correspondingly.
>
> Signed-off-by: Nikolai Kondrashov 
> ---
>  Documentation/process/submitting-patches.rst | 10 ++
>  MAINTAINERS  |  2 +-
>  scripts/checkpatch.pl|  4 ++--
>  3 files changed, 13 insertions(+), 3 deletions(-)

I have to ask whether we *really* need to introduce yet another tag for
this.  How are we going to use this information?  Are we going to try to
make a tag for every way in which somebody might test a patch?

Thanks,

jon

Re: [RFC PATCH v2 02/10] MAINTAINERS: Introduce V: entry for tests

2023-12-05 Thread Joe Perches

On Tue, 2023-12-05 at 20:02 +0200, Nikolai Kondrashov wrote:
> Require the entry values to not contain the '@' character, so they could
> be distinguished from emails (always) output by get_maintainer.pl.

Why is this useful?
Why the need to distinguish?

Re: [PATCH v8 4/6] mm: memcg: add per-memcg zswap writeback stat

2023-12-05 Thread Nhat Pham

On Tue, Dec 5, 2023 at 10:22 AM Yosry Ahmed  wrote:
>
> On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham  wrote:
> >
> > From: Domenico Cerasuolo 
> >
> > Since zswap now writes back pages from memcg-specific LRUs, we now need a
> > new stat to show writebacks count for each memcg.
> >
> > Suggested-by: Nhat Pham 
> > Signed-off-by: Domenico Cerasuolo 
> > Signed-off-by: Nhat Pham 
> > ---
> >  include/linux/vm_event_item.h | 1 +
> >  mm/memcontrol.c   | 1 +
> >  mm/vmstat.c   | 1 +
> >  mm/zswap.c| 4 
> >  4 files changed, 7 insertions(+)
> >
> > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> > index d1b847502f09..f4569ad98edf 100644
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -142,6 +142,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >  #ifdef CONFIG_ZSWAP
> > ZSWPIN,
> > ZSWPOUT,
> > +   ZSWP_WB,
>
> I think you dismissed Johannes's comment from v7 about ZSWPWB and
> "zswpwb" being more consistent with the existing events.

I missed that entirely. Oops. Yeah I prefer ZSWPWB too. Let me send a fix.

>
> >  #endif
> >  #ifdef CONFIG_X86
> > DIRECT_MAP_LEVEL2_SPLIT,
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 792ca21c5815..21d79249c8b4 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -703,6 +703,7 @@ static const unsigned int memcg_vm_event_stat[] = {
> >  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> > ZSWPIN,
> > ZSWPOUT,
> > +   ZSWP_WB,
> >  #endif
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > THP_FAULT_ALLOC,
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index afa5a38fcc9c..2249f85e4a87 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1401,6 +1401,7 @@ const char * const vmstat_text[] = {
> >  #ifdef CONFIG_ZSWAP
> > "zswpin",
> > "zswpout",
> > +   "zswp_wb",
> >  #endif
> >  #ifdef CONFIG_X86
> > "direct_map_level2_splits",
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index f323e45cbdc7..49b79393e472 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -760,6 +760,10 @@ static enum lru_status shrink_memcg_cb(struct 
> > list_head *item, struct list_lru_o
> > }
> > zswap_written_back_pages++;
> >
> > +   if (entry->objcg)
> > +   count_objcg_event(entry->objcg, ZSWP_WB);
> > +
> > +   count_vm_event(ZSWP_WB);
> > /*
> >  * Writeback started successfully, the page now belongs to the
> >  * swapcache. Drop the entry from zswap - unless invalidate already
> > --
> > 2.34.1

Re: [RFC PATCH v2 01/10] get_maintainer: Survive querying missing files

2023-12-05 Thread Joe Perches

On Tue, 2023-12-05 at 20:02 +0200, Nikolai Kondrashov wrote:
> Do not die, but only warn when scripts/get_maintainer.pl is asked to
> retrieve information about a missing file.
> 
> This allows scripts/checkpatch.pl to query MAINTAINERS while processing
> patches which are removing files.

Why is this useful?

Give a for-instance example please.

Re: [PATCH v8 3/6] zswap: make shrinking memcg-aware

2023-12-05 Thread Nhat Pham

On Tue, Dec 5, 2023 at 10:21 AM Yosry Ahmed  wrote:
>
> On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham  wrote:
> >
> > From: Domenico Cerasuolo 
> >
> > Currently, we only have a single global LRU for zswap. This makes it
> > impossible to perform worload-specific shrinking - an memcg cannot
> > determine which pages in the pool it owns, and often ends up writing
> > pages from other memcgs. This issue has been previously observed in
> > practice and mitigated by simply disabling memcg-initiated shrinking:
> >
> > https://lore.kernel.org/all/20230530232435.3097106-1-npha...@gmail.com/T/#u
> >
> > This patch fully resolves the issue by replacing the global zswap LRU
> > with memcg- and NUMA-specific LRUs, and modify the reclaim logic:
> >
> > a) When a store attempt hits an memcg limit, it now triggers a
> >synchronous reclaim attempt that, if successful, allows the new
> >hotter page to be accepted by zswap.
> > b) If the store attempt instead hits the global zswap limit, it will
> >trigger an asynchronous reclaim attempt, in which an memcg is
> >selected for reclaim in a round-robin-like fashion.
> >
> > Signed-off-by: Domenico Cerasuolo 
> > Co-developed-by: Nhat Pham 
> > Signed-off-by: Nhat Pham 
> > ---
> >  include/linux/memcontrol.h |   5 +
> >  include/linux/zswap.h  |   2 +
> >  mm/memcontrol.c|   2 +
> >  mm/swap.h  |   3 +-
> >  mm/swap_state.c|  24 +++-
> >  mm/zswap.c | 269 +
> >  6 files changed, 245 insertions(+), 60 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 2bd7d14ace78..a308c8eacf20 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -1192,6 +1192,11 @@ static inline struct mem_cgroup 
> > *page_memcg_check(struct page *page)
> > return NULL;
> >  }
> >
> > +static inline struct mem_cgroup *get_mem_cgroup_from_objcg(struct 
> > obj_cgroup *objcg)
> > +{
> > +   return NULL;
> > +}
> > +
> >  static inline bool folio_memcg_kmem(struct folio *folio)
> >  {
> > return false;
> > diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> > index 2a60ce39cfde..e571e393669b 100644
> > --- a/include/linux/zswap.h
> > +++ b/include/linux/zswap.h
> > @@ -15,6 +15,7 @@ bool zswap_load(struct folio *folio);
> >  void zswap_invalidate(int type, pgoff_t offset);
> >  void zswap_swapon(int type);
> >  void zswap_swapoff(int type);
> > +void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
> >
> >  #else
> >
> > @@ -31,6 +32,7 @@ static inline bool zswap_load(struct folio *folio)
> >  static inline void zswap_invalidate(int type, pgoff_t offset) {}
> >  static inline void zswap_swapon(int type) {}
> >  static inline void zswap_swapoff(int type) {}
> > +static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {}
> >
> >  #endif
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 470821d1ba1a..792ca21c5815 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -5614,6 +5614,8 @@ static void mem_cgroup_css_offline(struct 
> > cgroup_subsys_state *css)
> > page_counter_set_min(>memory, 0);
> > page_counter_set_low(>memory, 0);
> >
> > +   zswap_memcg_offline_cleanup(memcg);
> > +
> > memcg_offline_kmem(memcg);
> > reparent_shrinker_deferred(memcg);
> > wb_memcg_offline(memcg);
> > diff --git a/mm/swap.h b/mm/swap.h
> > index 73c332ee4d91..c0dc73e10e91 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -51,7 +51,8 @@ struct page *read_swap_cache_async(swp_entry_t entry, 
> > gfp_t gfp_mask,
> >struct swap_iocb **plug);
> >  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >  struct mempolicy *mpol, pgoff_t ilx,
> > -bool *new_page_allocated);
> > +bool *new_page_allocated,
> > +bool skip_if_exists);
> >  struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> > struct mempolicy *mpol, pgoff_t ilx);
> >  struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 85d9e5806a6a..6c84236382f3 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -412,7 +412,8 @@ struct folio *filemap_get_incore_folio(struct 
> > address_space *mapping,
> >
> >  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >  struct mempolicy *mpol, pgoff_t ilx,
> > -bool *new_page_allocated)
> > +bool *new_page_allocated,
> > +bool skip_if_exists)
> >  {
> > struct swap_info_struct *si;
> > struct folio *folio;
> > @@ -470,6

[RFC PATCH v2 10/10] MAINTAINERS: Add proposal strength to V: entries

2023-12-05 Thread Nikolai Kondrashov

Require the MAINTAINERS V: entries to begin with a keyword, one of
SUGGESTED/RECOMMENDED/REQUIRED, signifying how strongly the test is
proposed for verifying the subsystem changes, prompting
scripts/checkpatch.pl to produce CHECK/WARNING/ERROR messages
respectively, whenever the commit message doesn't have the corresponding
Tested-with: tag.

Signed-off-by: Nikolai Kondrashov 
---
 Documentation/process/submitting-patches.rst | 11 ++-
 MAINTAINERS  | 20 +++--
 scripts/checkpatch.pl| 83 
 3 files changed, 71 insertions(+), 43 deletions(-)

diff --git a/Documentation/process/submitting-patches.rst 
b/Documentation/process/submitting-patches.rst
index 45bd1a713ef33..199fadc50cf62 100644
--- a/Documentation/process/submitting-patches.rst
+++ b/Documentation/process/submitting-patches.rst
@@ -233,18 +233,21 @@ Test your changes
 
 Test the patch to the best of your ability. Check the MAINTAINERS file for the
 subsystem(s) you are changing to see if there are any **V:** entries
-proposing particular test suites, either directly as commands, or via
-documentation references.
+proposing particular test suites.
+
+The **V:** entries start with a proposal strength keyword
+(SUGGESTED/RECOMMENDED/REQUIRED), followed either by a command, or a
+documentation reference.
 
 Test suite references start with a ``*`` (similar to C pointer dereferencing),
 followed by the name of the test suite, which would be documented in the
 Documentation/process/tests.rst under the corresponding heading. E.g.::
 
-  V: *xfstests
+  V: SUGGESTED *xfstests
 
 Anything not starting with a ``*`` is considered a command. E.g.::
 
-  V: tools/testing/kunit/run_checks.py
+  V: RECOMMENDED tools/testing/kunit/run_checks.py
 
 Supplying the ``--test`` option to ``scripts/get_maintainer.pl`` adds **V:**
 entries to its output.
diff --git a/MAINTAINERS b/MAINTAINERS
index 84e90ec015090..3a35e320b5a5b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -59,15 +59,19 @@ Descriptions of section entries and preferred order
  matches patches or files that contain one or more of the words
  printk, pr_info or pr_err
   One regex pattern per line.  Multiple K: lines acceptable.
-   V: *Test suite* proposed for execution. The command that could be
-  executed to verify changes to the maintained subsystem, or a 
reference
-  to a test suite documented in Documentation/process/tests.txt.
+   V: *Test suite* proposed for execution for verifying changes to the
+  maintained subsystem. Must start with a proposal strength keyword:
+  (SUGGESTED/RECOMMENDED/REQUIRED), followed by the test suite command,
+  or a reference to a test suite documented in
+  Documentation/process/tests.txt.
+  Proposal strengths correspond to checkpatch.pl message levels
+  (CHECK/WARNING/ERROR respectively, whenever Tested-with: is missing).
   Commands must be executed from the root of the source tree.
   Commands must support the -h/--help option.
   References must be preceded with a '*'.
   Cannot contain '@' or '#' characters.
-  V: tools/testing/kunit/run_checks.py
-  V: *xfstests
+  V: SUGGESTED tools/testing/kunit/run_checks.py
+  V: RECOMMENDED *xfstests
   One test suite per line.
 
 Maintainers List
@@ -7978,7 +7982,7 @@ L:linux-e...@vger.kernel.org
 S: Maintained
 W: http://ext4.wiki.kernel.org
 Q: http://patchwork.ozlabs.org/project/linux-ext4/list/
-V: *kvm-xfstests smoke
+V: RECOMMENDED *kvm-xfstests smoke
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git
 F: Documentation/filesystems/ext4/
 F: fs/ext4/
@@ -11628,7 +11632,7 @@ L:  linux-kselftest@vger.kernel.org
 L: kunit-...@googlegroups.com
 S: Maintained
 W: https://google.github.io/kunit-docs/third_party/kernel/docs/
-V: tools/testing/kunit/run_checks.py
+V: RECOMMENDED tools/testing/kunit/run_checks.py
 T: git 
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git kunit
 T: git 
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git 
kunit-fixes
 F: Documentation/dev-tools/kunit/
@@ -18367,7 +18371,7 @@ REGISTER MAP ABSTRACTION
 M: Mark Brown 
 L: linux-ker...@vger.kernel.org
 S: Supported
-V: *kunit
+V: RECOMMENDED *kunit
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap.git
 F: Documentation/devicetree/bindings/regmap/
 F: drivers/base/regmap/
diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index bfeb4c33b5424..9438e4f452a6c 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -1181,39 +1181,57 @@ sub is_maintained_obsolete {
return $maintained_status{$filename} =~ /obsolete/i;
 }
 
-# Test suites proposed per changed file
+# A list of test proposal strength

[RFC PATCH v2 09/10] MAINTAINERS: Propose kunit tests for regmap

2023-12-05 Thread Nikolai Kondrashov

From: Mark Brown 

The regmap core and especially cache code have reasonable kunit
coverage, ask people to use that to test regmap changes.

Signed-off-by: Mark Brown 
Signed-off-by: Nikolai Kondrashov 
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 669b5ff571730..84e90ec015090 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18367,6 +18367,7 @@ REGISTER MAP ABSTRACTION
 M: Mark Brown 
 L: linux-ker...@vger.kernel.org
 S: Supported
+V: *kunit
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap.git
 F: Documentation/devicetree/bindings/regmap/
 F: drivers/base/regmap/
-- 
2.42.0

[RFC PATCH v2 08/10] docs: tests: Document kunit in general

2023-12-05 Thread Nikolai Kondrashov

Add an entry on the complete set of kunit tests to the
Documentation/process/tests.rst, so that it could be referenced in
MAINTAINERS, and is catalogued in general.

Signed-off-by: Nikolai Kondrashov 
---
 Documentation/process/tests.rst | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/Documentation/process/tests.rst b/Documentation/process/tests.rst
index cfaf937dc4d5f..0760229fc32b0 100644
--- a/Documentation/process/tests.rst
+++ b/Documentation/process/tests.rst
@@ -71,3 +71,26 @@ kvm-xfstests smoke
 
 The "kvm-xfstests smoke" is a minimal subset of xfstests for testing all major
 file systems, running under KVM.
+
+kunit
+-
+
+:Summary: complete set of KUnit unit tests
+:Command: tools/testing/kunit/kunit.py run --alltests
+:Docs: https://docs.kernel.org/dev-tools/kunit/
+
+KUnit tests are part of the kernel, written in the C (programming) language,
+and test parts of the Kernel implementation (example: a C language function).
+Excluding build time, from invocation to completion, KUnit can run around 100
+tests in less than 10 seconds. KUnit can test any kernel component, for
+example: file system, system calls, memory management, device drivers and so
+on.
+
+KUnit follows the white-box testing approach. The test has access to internal
+system functionality. KUnit runs in kernel space and is not restricted to
+things exposed to user-space.
+
+In addition, KUnit has kunit_tool, a script (tools/testing/kunit/kunit.py)
+that configures the Linux kernel, runs KUnit tests under QEMU or UML (User
+Mode Linux), parses the test results and displays them in a user friendly
+manner.
-- 
2.42.0

[RFC PATCH v2 07/10] MAINTAINERS: Propose kvm-xfstests smoke for ext4

2023-12-05 Thread Nikolai Kondrashov

Propose the "kvm-xfstests smoke" test suite for changes to the EXT4 FILE
SYSTEM subsystem, as discussed previously with maintainers.

Signed-off-by: Nikolai Kondrashov 
---
 Documentation/process/tests.rst | 32 
 MAINTAINERS |  1 +
 2 files changed, 33 insertions(+)

diff --git a/Documentation/process/tests.rst b/Documentation/process/tests.rst
index 4ae5000e811c8..cfaf937dc4d5f 100644
--- a/Documentation/process/tests.rst
+++ b/Documentation/process/tests.rst
@@ -39,3 +39,35 @@ following ones recognized by the tools (regardless of the 
case):
   (even if only to report what else needs setting up)
 
 Any other entries are accepted, but not processed.
+
+xfstests
+
+
+:Summary: file system regression test suite
+:Source: https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
+:Docs: 
https://github.com/tytso/xfstests-bld/blob/master/Documentation/what-is-xfstests.md
+
+As the name might imply, xfstests is a file system regression test suite which
+was originally developed by Silicon Graphics (SGI) for the XFS file system.
+Originally, xfstests, like XFS was only supported on the SGI's Irix operating
+system. When XFS was ported to Linux, so was xfstests, and now xfstests is
+only supported on Linux.
+
+Today, xfstests is used as a file system regression test suite for all of
+Linux's major file systems: xfs, ext2, ext4, cifs, btrfs, f2fs, reiserfs, gfs,
+jfs, udf, nfs, and tmpfs. Many file system maintainers will run a full set of
+xfstests before sending patches to Linus, and will require that any major
+changes be tested using xfstests before they are submitted for integration.
+
+The easiest way to start running xfstests is under KVM with xfstests-bld:
+https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md
+
+kvm-xfstests smoke
+--
+
+:Summary: file system smoke test suite
+:Superset: xfstests
+:Docs: 
https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md
+
+The "kvm-xfstests smoke" is a minimal subset of xfstests for testing all major
+file systems, running under KVM.
diff --git a/MAINTAINERS b/MAINTAINERS
index 3ed15d8327919..669b5ff571730 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7978,6 +7978,7 @@ L:linux-e...@vger.kernel.org
 S: Maintained
 W: http://ext4.wiki.kernel.org
 Q: http://patchwork.ozlabs.org/project/linux-ext4/list/
+V: *kvm-xfstests smoke
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git
 F: Documentation/filesystems/ext4/
 F: fs/ext4/
-- 
2.42.0

[RFC PATCH v2 06/10] MAINTAINERS: Support referencing test docs in V:

2023-12-05 Thread Nikolai Kondrashov

Support referencing test suite documentation in the V: entries of
MAINTAINERS file. Use the '*' syntax (like C pointer dereference),
where '' is a second-level heading in the new
Documentation/process/tests.rst file, with the suite's description.
This syntax allows distinguishing the references from test commands.

Add a boiler-plate Documentation/process/tests.rst file, describing a
way to add structured info to the test suites in the form of field
lists. Apart from a "summary" and "command" fields, they can also
contain a "superset" field specifying the superset of the test suite,
helping reuse documentation and express both wider and narrower test
sets.

Make scripts/checkpatch.pl load the tests from the file, along with the
structured data, validate the references in MAINTAINERS, dereference
them, and output the test suite information in the CHECK messages
whenever the corresponding subsystems are changed. But only if there was
no corresponding Tested-with: tag in the commit message, certifying it
was executed successfully already.

This is supposed to help propose executing test suites which cannot be
executed immediately, and need extra setup, as well as provide a place
for extra documentation and information on directly-available suites.

Signed-off-by: Nikolai Kondrashov 
---
 Documentation/process/index.rst  |   1 +
 Documentation/process/submitting-patches.rst |  21 +++-
 Documentation/process/tests.rst  |  41 +++
 MAINTAINERS  |   9 +-
 scripts/checkpatch.pl| 122 +--
 5 files changed, 177 insertions(+), 17 deletions(-)
 create mode 100644 Documentation/process/tests.rst

diff --git a/Documentation/process/index.rst b/Documentation/process/index.rst
index a1daa309b58d0..3eda2e7432fdb 100644
--- a/Documentation/process/index.rst
+++ b/Documentation/process/index.rst
@@ -49,6 +49,7 @@ Other guides to the community that are of interest to most 
developers are:
:maxdepth: 1
 
changes
+   tests
stable-api-nonsense
management-style
stable-kernel-rules
diff --git a/Documentation/process/submitting-patches.rst 
b/Documentation/process/submitting-patches.rst
index 2004df2ac1b39..45bd1a713ef33 100644
--- a/Documentation/process/submitting-patches.rst
+++ b/Documentation/process/submitting-patches.rst
@@ -233,27 +233,42 @@ Test your changes
 
 Test the patch to the best of your ability. Check the MAINTAINERS file for the
 subsystem(s) you are changing to see if there are any **V:** entries
-proposing particular test suite commands. E.g.::
+proposing particular test suites, either directly as commands, or via
+documentation references.
+
+Test suite references start with a ``*`` (similar to C pointer dereferencing),
+followed by the name of the test suite, which would be documented in the
+Documentation/process/tests.rst under the corresponding heading. E.g.::
+
+  V: *xfstests
+
+Anything not starting with a ``*`` is considered a command. E.g.::
 
   V: tools/testing/kunit/run_checks.py
 
 Supplying the ``--test`` option to ``scripts/get_maintainer.pl`` adds **V:**
 entries to its output.
 
-Execute the commands, if any, to test your changes.
+Execute the (referenced) test suites, if any, to test your changes.
 
 All commands must be executed from the root of the source tree. Each command
 outputs usage information, if an -h/--help option is specified.
 
 If a test suite you've executed completed successfully, add a ``Tested-with:
-`` to the message of the commit you tested. E.g.::
+`` or ``Tested-with: *`` to the message of the commit you
+tested. E.g.::
 
   Tested-with: tools/testing/kunit/run_checks.py
 
+or::
+
+  Tested-with: *xfstests
+
 Optionally, add a '#' character followed by a publicly-accessible URL
 containing the test results, if you make them available. E.g.::
 
   Tested-with: tools/testing/kunit/run_checks.py # 
https://kernelci.org/test/2239874
+  Tested-with: *xfstests # https://kernelci.org/test/2239324
 
 
 Select the recipients for your patch
diff --git a/Documentation/process/tests.rst b/Documentation/process/tests.rst
new file mode 100644
index 0..4ae5000e811c8
--- /dev/null
+++ b/Documentation/process/tests.rst
@@ -0,0 +1,41 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _tests:
+
+Tests you can run
+=
+
+There are many automated tests available for the Linux kernel, and some
+userspace tests which happen to also test the kernel. Here are some of them,
+along with the instructions on where to get them and how to run them for
+various purposes.
+
+This document has to follow a certain structure to allow tool access.
+Second-level headers (underscored with dashes '-') must contain test suite
+names, and the corresponding section must contain the test description.
+
+The test suites can be referenced by name, preceded with a '*', in the "V:"
+lines in the MAINTAINERS file, as well as in the "Tested-with:" tag in commit
+messages. E.g::
+

[RFC PATCH v2 03/10] MAINTAINERS: Propose kunit core tests for framework changes

2023-12-05 Thread Nikolai Kondrashov

DONOTMERGE: The command in question should support -h/--help option.

Signed-off-by: Nikolai Kondrashov 
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index e6d0777e21657..68821eecf61cf 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11624,6 +11624,7 @@ L:  linux-kselftest@vger.kernel.org
 L: kunit-...@googlegroups.com
 S: Maintained
 W: https://google.github.io/kunit-docs/third_party/kernel/docs/
+V: tools/testing/kunit/run_checks.py
 T: git 
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git kunit
 T: git 
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git 
kunit-fixes
 F: Documentation/dev-tools/kunit/
-- 
2.42.0

[RFC PATCH v2 05/10] checkpatch: Propose tests to execute

2023-12-05 Thread Nikolai Kondrashov

Make scripts/checkpatch.pl output a 'CHECK' advertising any test suites
proposed for the changed subsystems, and prompting their execution.

Using 'CHECK', instead of 'WARNING', or 'ERROR', because test suite
commands executed for testing can generally be off by an option/argument
or two, depending on the situation, while still satisfying the
maintainer requirements, but failing the comparison with the V: entry
and raising alarm unnecessarily.

However, see the later patch adding the proposal strength to the V:
entry and allowing raising the severity of the message for those who'd
like that.

Signed-off-by: Nikolai Kondrashov 
---
 scripts/checkpatch.pl | 43 +++
 1 file changed, 43 insertions(+)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index bea602c30df5d..1da617e1edb5f 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -1144,6 +1144,29 @@ sub is_maintained_obsolete {
return $maintained_status{$filename} =~ /obsolete/i;
 }
 
+# Test suites proposed per changed file
+our %files_proposed_tests = ();
+
+# Return a list of test suites proposed for execution for a particular file
+sub get_file_proposed_tests {
+   my ($filename) = @_;
+   my $file_proposed_tests;
+
+   return () if (!$tree || !(-e "$root/scripts/get_maintainer.pl"));
+
+   if (!exists($files_proposed_tests{$filename})) {
+   my $command = "perl $root/scripts/get_maintainer.pl --test 
--multiline --nogit --nogit-fallback -f $filename";
+   # Ignore warnings on stderr
+   my $output = `$command 2>/dev/null`;
+   # But regenerate stderr on failure
+   die "Failed retrieving tests proposed for changes to 
\"$filename\":\n" . `$command 2>&1 >/dev/null` if ($?);
+   $files_proposed_tests{$filename} = [grep { !/@/ } split("\n", 
$output)]
+   }
+
+   $file_proposed_tests = $files_proposed_tests{$filename};
+   return @$file_proposed_tests;
+}
+
 sub is_SPDX_License_valid {
my ($license) = @_;
 
@@ -2689,6 +2712,9 @@ sub process {
my @setup_docs = ();
my $setup_docs = 0;
 
+   # Test suites which should not be proposed for execution
+   my %dont_propose_tests = ();
+
my $camelcase_file_seeded = 0;
 
my $checklicenseline = 1;
@@ -2907,6 +2933,17 @@ sub process {
}
}
 
+   # Check if tests are proposed for changes to the file
+   foreach my $test (get_file_proposed_tests($realfile)) {
+   next if exists $dont_propose_tests{$test};
+   CHK("TEST_PROPOSAL",
+   "Running the following test suite is 
proposed for changes to $realfile:\n" .
+   "$test\n" .
+   "Add the following to the tested commit's 
message, IF IT PASSES:\n" .
+   "Tested-with: $test\n");
+   $dont_propose_tests{$test} = 1;
+   }
+
next;
}
 
@@ -3233,6 +3270,12 @@ sub process {
}
}
 
+# Check and accumulate executed test suites (stripping URLs off the end)
+   if (!$in_commit_log && $line =~ 
/^\s*Tested-with:\s*(.*?)\s*#.*$/i) {
+   # Do not propose this certified-passing test suite
+   $dont_propose_tests{$1} = 1;
+   }
+
 # Check email subject for common tools that don't need to be mentioned
if ($in_header_lines &&
$line =~ 
/^Subject:.*\b(?:checkpatch|sparse|smatch)\b[^:]/i) {
-- 
2.42.0

[RFC PATCH v2 04/10] docs: submitting-patches: Introduce Tested-with:

2023-12-05 Thread Nikolai Kondrashov

Introduce a new tag, 'Tested-with:', documented in the
Documentation/process/submitting-patches.rst file.

The tag is expected to contain the test suite command which was executed
for the commit, and to certify it passed. Additionally, it can contain a
URL pointing to the execution results, after a '#' character.

Prohibit the V: field from containing the '#' character correspondingly.

Signed-off-by: Nikolai Kondrashov 
---
 Documentation/process/submitting-patches.rst | 10 ++
 MAINTAINERS  |  2 +-
 scripts/checkpatch.pl|  4 ++--
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/Documentation/process/submitting-patches.rst 
b/Documentation/process/submitting-patches.rst
index f034feaf1369e..2004df2ac1b39 100644
--- a/Documentation/process/submitting-patches.rst
+++ b/Documentation/process/submitting-patches.rst
@@ -245,6 +245,16 @@ Execute the commands, if any, to test your changes.
 All commands must be executed from the root of the source tree. Each command
 outputs usage information, if an -h/--help option is specified.
 
+If a test suite you've executed completed successfully, add a ``Tested-with:
+`` to the message of the commit you tested. E.g.::
+
+  Tested-with: tools/testing/kunit/run_checks.py
+
+Optionally, add a '#' character followed by a publicly-accessible URL
+containing the test results, if you make them available. E.g.::
+
+  Tested-with: tools/testing/kunit/run_checks.py # 
https://kernelci.org/test/2239874
+
 
 Select the recipients for your patch
 
diff --git a/MAINTAINERS b/MAINTAINERS
index 68821eecf61cf..28fbb0eb335ba 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -63,7 +63,7 @@ Descriptions of section entries and preferred order
   executed to verify changes to the maintained subsystem.
   Must be executed from the root of the source tree.
   Must support the -h/--help option.
-  Cannot contain '@' character.
+  Cannot contain '@' or '#' characters.
   V: tools/testing/kunit/run_checks.py
   One test suite per line.
 
diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index a184e576c980b..bea602c30df5d 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -3686,9 +3686,9 @@ sub process {
 # check MAINTAINERS V: entries are valid
if ($rawline =~ /^\+V:\s*(.*)/) {
my $name = $1;
-   if ($name =~ /@/) {
+   if ($name =~ /[@#]/) {
ERROR("TEST_PROPOSAL_INVALID",
- "Test proposal cannot contain 
'\@' character\n" . $herecurr);
+ "Test proposal cannot contain 
'\@' or '#' characters\n" . $herecurr);
}
}
}
-- 
2.42.0

[RFC PATCH v2 01/10] get_maintainer: Survive querying missing files

2023-12-05 Thread Nikolai Kondrashov

Do not die, but only warn when scripts/get_maintainer.pl is asked to
retrieve information about a missing file.

This allows scripts/checkpatch.pl to query MAINTAINERS while processing
patches which are removing files.

Signed-off-by: Nikolai Kondrashov 
---
 scripts/get_maintainer.pl | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl
index 16d8ac6005b6f..37901c2298388 100755
--- a/scripts/get_maintainer.pl
+++ b/scripts/get_maintainer.pl
@@ -541,7 +541,11 @@ foreach my $file (@ARGV) {
if ((-d $file)) {
$file =~ s@([^/])$@$1/@;
} elsif (!(-f $file)) {
-   die "$P: file '${file}' not found\n";
+   if ($from_filename) {
+   warn "$P: file '${file}' not found\n";
+   } else {
+   die "$P: file '${file}' not found\n";
+   }
}
 }
 if ($from_filename && (vcs_exists() && !vcs_file_exists($file))) {
-- 
2.42.0

[RFC PATCH v2 02/10] MAINTAINERS: Introduce V: entry for tests

2023-12-05 Thread Nikolai Kondrashov

Introduce a new 'V:' ("Verify") entry to MAINTAINERS. The entry accepts
a test suite command which is proposed to be executed for each
contribution to the subsystem.

Extend scripts/get_maintainer.pl to support retrieving the V: entries
when '--test' option is specified.

Require the entry values to not contain the '@' character, so they could
be distinguished from emails (always) output by get_maintainer.pl. Make
scripts/checkpatch.pl check that they don't.

Update entry ordering in both scripts/checkpatch.pl and
scripts/parse-maintainers.pl.

Signed-off-by: Nikolai Kondrashov 
---
 Documentation/process/submitting-patches.rst | 18 ++
 MAINTAINERS  |  7 +++
 scripts/checkpatch.pl| 10 +-
 scripts/get_maintainer.pl| 17 +++--
 scripts/parse-maintainers.pl |  3 ++-
 5 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/Documentation/process/submitting-patches.rst 
b/Documentation/process/submitting-patches.rst
index 86d346bcb8ef0..f034feaf1369e 100644
--- a/Documentation/process/submitting-patches.rst
+++ b/Documentation/process/submitting-patches.rst
@@ -228,6 +228,24 @@ You should be able to justify all violations that remain 
in your
 patch.
 
 
+Test your changes
+-
+
+Test the patch to the best of your ability. Check the MAINTAINERS file for the
+subsystem(s) you are changing to see if there are any **V:** entries
+proposing particular test suite commands. E.g.::
+
+  V: tools/testing/kunit/run_checks.py
+
+Supplying the ``--test`` option to ``scripts/get_maintainer.pl`` adds **V:**
+entries to its output.
+
+Execute the commands, if any, to test your changes.
+
+All commands must be executed from the root of the source tree. Each command
+outputs usage information, if an -h/--help option is specified.
+
+
 Select the recipients for your patch
 
 
diff --git a/MAINTAINERS b/MAINTAINERS
index 788be9ab5b733..e6d0777e21657 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -59,6 +59,13 @@ Descriptions of section entries and preferred order
  matches patches or files that contain one or more of the words
  printk, pr_info or pr_err
   One regex pattern per line.  Multiple K: lines acceptable.
+   V: *Test suite* proposed for execution. The command that could be
+  executed to verify changes to the maintained subsystem.
+  Must be executed from the root of the source tree.
+  Must support the -h/--help option.
+  Cannot contain '@' character.
+  V: tools/testing/kunit/run_checks.py
+  One test suite per line.
 
 Maintainers List
 
diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 25fdb7fda1128..a184e576c980b 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -3657,7 +3657,7 @@ sub process {
}
}
 # check MAINTAINERS entries for the right ordering too
-   my $preferred_order = 'MRLSWQBCPTFXNK';
+   my $preferred_order = 'MRLSWQBCPVTFXNK';
if ($rawline =~ /^\+[A-Z]:/ &&
$prevrawline =~ /^[\+ ][A-Z]:/) {
$rawline =~ /^\+([A-Z]):\s*(.*)/;
@@ -3683,6 +3683,14 @@ sub process {
}
}
}
+# check MAINTAINERS V: entries are valid
+   if ($rawline =~ /^\+V:\s*(.*)/) {
+   my $name = $1;
+   if ($name =~ /@/) {
+   ERROR("TEST_PROPOSAL_INVALID",
+ "Test proposal cannot contain 
'\@' character\n" . $herecurr);
+   }
+   }
}
 
if (($realfile =~ /Makefile.*/ || $realfile =~ /Kbuild.*/) &&
diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl
index 37901c2298388..804215a7477db 100755
--- a/scripts/get_maintainer.pl
+++ b/scripts/get_maintainer.pl
@@ -53,6 +53,7 @@ my $output_section_maxlen = 50;
 my $scm = 0;
 my $tree = 1;
 my $web = 0;
+my $test = 0;
 my $subsystem = 0;
 my $status = 0;
 my $letters = "";
@@ -270,6 +271,7 @@ if (!GetOptions(
'scm!' => \$scm,
'tree!' => \$tree,
'web!' => \$web,
+   'test!' => \$test,
'letters=s' => \$letters,
'pattern-depth=i' => \$pattern_depth,
'k|keywords!' => \$keywords,
@@ -319,13 +321,14 @@ if ($sections || $letters ne "") {
 $status = 0;
 $subsystem = 0;
 $web = 0;
+$test = 0;
 $keywords = 0;
 $keywords_in_file = 0;
 $interactive = 0;
 } else {
-my $selections = $email + $scm + $status + $subsystem + $web;
+my

[RFC PATCH v2 00/10] MAINTAINERS: Introduce V: entry for tests

2023-12-05 Thread Nikolai Kondrashov

Alright, here's a second version, attempting to address as many concerns as
possible. It's likely I've missed something, though.

Changes from v1:

* Make scripts/get_maintainer.pl survive querying missing files, giving a
  warning instead. This is necessary to enable scripts/checkpatch.pl to query
  MAINTAINERS about files being deleted.
* Start with the minimal change just documenting the V: entry, which accepts
  test commands directly, and tweaking the tools to deal with that.
* However, require the commands accept the -h/--help option so that users have
  an easier time getting *some* help. The run_checks.py missing that is the
  reason why the patch proposing it for kunit subsystem is marked "DONOTMERGE"
  in this version. We can drop that requirement, or soften the language, if
  there's opposition.
* Have a *separate* patch documenting 'Tested-with:' as the next (early)
  change. Mention that you can add a '#' followed by a results URL, on the
  end. Adjust the V: docs/checks to exclude '#'.
* Have a *separate* patch making scripts/checkpatch.pl propose the execution
  of the test suite defined in MAINTAINERS whenever the corresponding
  subsystem is changed.
* However, use 'CHECK', instead of 'WARNING', to allow submitters specify the
  exact (and potentially slightly different) command they used, and not have
  checkpatch.pl complain too loudly that they didn't run the (exact
  MAINTAINERS-specified) command. This unfortunately means that unless you use
  --strict, you won't see the message. We'll try to address that in a new
  change at the end.
* Have a *separate* patch introducing the test catalog and accepting
  references to that everywhere, with a special syntax to distinguish them
  from verbatim/direct commands. The syntax is prepending the test name with a
  '*' (just like C pointer dereference). Make checkpatch.pl handle that.
* Drop the recommendation to have the "Docs" and "Sources" fields in test
  descriptions, as the description text should focus on giving a good
  introduction and not prompt the user to go somewhere else immediately. They
  both can be referenced in the text where and how is appropriate.
* Generally keep the previous changes adding V: entries and test suite docs,
  and try to accommodate all the requests, but refine the "Summary" fields to
  fit the checkpatch.pl messages better.
* Have a separate patch cataloguing the complete kunit suite.
* Finally, add a patch introducing the "proposal strength" keywords
  (SUGGESTED/RECOMMENDED/REQUIRED) to the syntax of V: entries, which directly
  affect which level of checkpatch.pl message missing 'Tested-with:' tags
  would generate: CHECK/WARNING/ERROR respectively. This allows subsystems to
  disable checkpatch.pl WARNINGS/ERRORS, and keep their test proposals
  inobtrusive, if they so wish. E.g. if they expect people to change their
  commands often. At the same time allow stricter workflows for subsystems
  with more uniform testing. Or e.g. for subsystems which expect the tests to
  explain their parameters in their output, and the submitters to upload and
  link their results in their 'Tested-with:' tags.

That seems to be all, but I'm sure I forgot something :D

Anyway, send me more corrections and I'll try to address them, but it's likely
going to happen next year only.

Nick
---
Nikolai Kondrashov (9):
  get_maintainer: Survive querying missing files
  MAINTAINERS: Introduce V: entry for tests
  MAINTAINERS: Propose kunit core tests for framework changes
  docs: submitting-patches: Introduce Tested-with:
  checkpatch: Propose tests to execute
  MAINTAINERS: Support referencing test docs in V:
  MAINTAINERS: Propose kvm-xfstests smoke for ext4
  docs: tests: Document kunit in general
  MAINTAINERS: Add proposal strength to V: entries

Mark Brown (1):
  MAINTAINERS: Propose kunit tests for regmap

 Documentation/process/index.rst  |   1 +
 Documentation/process/submitting-patches.rst |  46 +++
 Documentation/process/tests.rst  |  96 +++
 MAINTAINERS  |  17 +++
 scripts/checkpatch.pl| 174 ++-
 scripts/get_maintainer.pl|  23 +++-
 scripts/parse-maintainers.pl |   3 +-
 7 files changed, 355 insertions(+), 5 deletions(-)
---

Re: [PATCH v8 4/6] mm: memcg: add per-memcg zswap writeback stat

2023-12-05 Thread Yosry Ahmed

On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham  wrote:
>
> From: Domenico Cerasuolo 
>
> Since zswap now writes back pages from memcg-specific LRUs, we now need a
> new stat to show writebacks count for each memcg.
>
> Suggested-by: Nhat Pham 
> Signed-off-by: Domenico Cerasuolo 
> Signed-off-by: Nhat Pham 
> ---
>  include/linux/vm_event_item.h | 1 +
>  mm/memcontrol.c   | 1 +
>  mm/vmstat.c   | 1 +
>  mm/zswap.c| 4 
>  4 files changed, 7 insertions(+)
>
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index d1b847502f09..f4569ad98edf 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -142,6 +142,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  #ifdef CONFIG_ZSWAP
> ZSWPIN,
> ZSWPOUT,
> +   ZSWP_WB,

I think you dismissed Johannes's comment from v7 about ZSWPWB and
"zswpwb" being more consistent with the existing events.

>  #endif
>  #ifdef CONFIG_X86
> DIRECT_MAP_LEVEL2_SPLIT,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 792ca21c5815..21d79249c8b4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -703,6 +703,7 @@ static const unsigned int memcg_vm_event_stat[] = {
>  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> ZSWPIN,
> ZSWPOUT,
> +   ZSWP_WB,
>  #endif
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> THP_FAULT_ALLOC,
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index afa5a38fcc9c..2249f85e4a87 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1401,6 +1401,7 @@ const char * const vmstat_text[] = {
>  #ifdef CONFIG_ZSWAP
> "zswpin",
> "zswpout",
> +   "zswp_wb",
>  #endif
>  #ifdef CONFIG_X86
> "direct_map_level2_splits",
> diff --git a/mm/zswap.c b/mm/zswap.c
> index f323e45cbdc7..49b79393e472 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -760,6 +760,10 @@ static enum lru_status shrink_memcg_cb(struct list_head 
> *item, struct list_lru_o
> }
> zswap_written_back_pages++;
>
> +   if (entry->objcg)
> +   count_objcg_event(entry->objcg, ZSWP_WB);
> +
> +   count_vm_event(ZSWP_WB);
> /*
>  * Writeback started successfully, the page now belongs to the
>  * swapcache. Drop the entry from zswap - unless invalidate already
> --
> 2.34.1

Re: [PATCH 2/2] selftest/bpf: Test returning zero from a perf bpf program suppresses SIGIO.

2023-12-05 Thread Kyle Huey

On Mon, Dec 4, 2023 at 2:14 PM Andrii Nakryiko
 wrote:
>
> On Mon, Dec 4, 2023 at 12:14 PM Kyle Huey  wrote:
> >
> > The test sets a hardware breakpoint and uses a bpf program to suppress the
> > I/O availability signal if the ip matches the expected value.
> >
> > Signed-off-by: Kyle Huey 
> > ---
> >  .../selftests/bpf/prog_tests/perf_skip.c  | 95 +++
> >  .../selftests/bpf/progs/test_perf_skip.c  | 23 +
> >  2 files changed, 118 insertions(+)
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/perf_skip.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/test_perf_skip.c
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/perf_skip.c 
> > b/tools/testing/selftests/bpf/prog_tests/perf_skip.c
> > new file mode 100644
> > index ..b269a31669b7
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/perf_skip.c
> > @@ -0,0 +1,95 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#define _GNU_SOURCE
> > +#include 
> > +#include "test_perf_skip.skel.h"
> > +#include 
> > +#include 
> > +
> > +#define BPF_OBJECT"test_perf_skip.bpf.o"
>
> leftover?

Indeed. Fixed.

> > +
> > +static void handle_sig(int)
> > +{
> > +   ASSERT_OK(1, "perf event not skipped");
> > +}
> > +
> > +static noinline int test_function(void)
> > +{
>
> please add
>
> asm volatile ("");
>
> here to prevent compiler from actually inlining at the call site

Ok.

> > +   return 0;
> > +}
> > +
> > +void serial_test_perf_skip(void)
> > +{
> > +   sighandler_t previous;
> > +   int duration = 0;
> > +   struct test_perf_skip *skel = NULL;
> > +   int map_fd = -1;
> > +   long page_size = sysconf(_SC_PAGE_SIZE);
> > +   uintptr_t *ip = NULL;
> > +   int prog_fd = -1;
> > +   struct perf_event_attr attr = {0};
> > +   int perf_fd = -1;
> > +   struct f_owner_ex owner;
> > +   int err;
> > +
> > +   previous = signal(SIGIO, handle_sig);
> > +
> > +   skel = test_perf_skip__open_and_load();
> > +   if (!ASSERT_OK_PTR(skel, "skel_load"))
> > +   goto cleanup;
> > +
> > +   prog_fd = bpf_program__fd(skel->progs.handler);
> > +   if (!ASSERT_OK(prog_fd < 0, "bpf_program__fd"))
> > +   goto cleanup;
> > +
> > +   map_fd = bpf_map__fd(skel->maps.ip);
> > +   if (!ASSERT_OK(map_fd < 0, "bpf_map__fd"))
> > +   goto cleanup;
> > +
> > +   ip = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, 
> > map_fd, 0);
> > +   if (!ASSERT_OK_PTR(ip, "mmap bpf map"))
> > +   goto cleanup;
> > +
> > +   *ip = (uintptr_t)test_function;
> > +
> > +   attr.type = PERF_TYPE_BREAKPOINT;
> > +   attr.size = sizeof(attr);
> > +   attr.bp_type = HW_BREAKPOINT_X;
> > +   attr.bp_addr = (uintptr_t)test_function;
> > +   attr.bp_len = sizeof(long);
> > +   attr.sample_period = 1;
> > +   attr.sample_type = PERF_SAMPLE_IP;
> > +   attr.pinned = 1;
> > +   attr.exclude_kernel = 1;
> > +   attr.exclude_hv = 1;
> > +   attr.precise_ip = 3;
> > +
> > +   perf_fd = syscall(__NR_perf_event_open, , 0, -1, -1, 0);
> > +   if (CHECK(perf_fd < 0, "perf_event_open", "err %d\n", perf_fd))
>
> please don't use CHECK() macro, stick to ASSERT_xxx()

Done.

> also, we are going to run all this on different hardware and VMs, see
> how we skip tests if hardware support is not there. See test__skip
> usage in prog_tests/perf_branches.c, as one example

Hmm I suppose it should be conditioned on CONFIG_HAVE_HW_BREAKPOINT.

> > +   goto cleanup;
> > +
> > +   err = fcntl(perf_fd, F_SETFL, O_ASYNC);
>
> I assume this is what will send SIGIO, right? Can you add a small
> comment explicitly saying this?

Done.

> > +   if (!ASSERT_OK(err, "fcntl(F_SETFL, O_ASYNC)"))
> > +   goto cleanup;
> > +
> > +   owner.type = F_OWNER_TID;
> > +   owner.pid = gettid();
> > +   err = fcntl(perf_fd, F_SETOWN_EX, );
> > +   if (!ASSERT_OK(err, "fcntl(F_SETOWN_EX)"))
> > +   goto cleanup;
> > +
> > +   err = ioctl(perf_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
> > +   if (!ASSERT_OK(err, "ioctl(PERF_EVENT_IOC_SET_BPF)"))
> > +   goto cleanup;
>
> we have a better way to do this, please use
> bpf_program__attach_perf_event() instead

Done.

> > +
> > +   test_function();
> > +
> > +cleanup:
> > +   if (perf_fd >= 0)
> > +   close(perf_fd);
> > +   if (ip)
> > +   munmap(ip, page_size);
> > +   if (skel)
> > +   test_perf_skip__destroy(skel);
> > +
> > +   signal(SIGIO, previous);
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/test_perf_skip.c 
> > b/tools/testing/selftests/bpf/progs/test_perf_skip.c
> > new file mode 100644
> > index ..ef01a9161afe
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/test_perf_skip.c
> > @@ -0,0 +1,23 @@
> > +// SPDX-License-Identifier: GPL-2.0

Re: [PATCH v8 3/6] zswap: make shrinking memcg-aware

2023-12-05 Thread Yosry Ahmed

On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham  wrote:
>
> From: Domenico Cerasuolo 
>
> Currently, we only have a single global LRU for zswap. This makes it
> impossible to perform worload-specific shrinking - an memcg cannot
> determine which pages in the pool it owns, and often ends up writing
> pages from other memcgs. This issue has been previously observed in
> practice and mitigated by simply disabling memcg-initiated shrinking:
>
> https://lore.kernel.org/all/20230530232435.3097106-1-npha...@gmail.com/T/#u
>
> This patch fully resolves the issue by replacing the global zswap LRU
> with memcg- and NUMA-specific LRUs, and modify the reclaim logic:
>
> a) When a store attempt hits an memcg limit, it now triggers a
>synchronous reclaim attempt that, if successful, allows the new
>hotter page to be accepted by zswap.
> b) If the store attempt instead hits the global zswap limit, it will
>trigger an asynchronous reclaim attempt, in which an memcg is
>selected for reclaim in a round-robin-like fashion.
>
> Signed-off-by: Domenico Cerasuolo 
> Co-developed-by: Nhat Pham 
> Signed-off-by: Nhat Pham 
> ---
>  include/linux/memcontrol.h |   5 +
>  include/linux/zswap.h  |   2 +
>  mm/memcontrol.c|   2 +
>  mm/swap.h  |   3 +-
>  mm/swap_state.c|  24 +++-
>  mm/zswap.c | 269 +
>  6 files changed, 245 insertions(+), 60 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 2bd7d14ace78..a308c8eacf20 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1192,6 +1192,11 @@ static inline struct mem_cgroup 
> *page_memcg_check(struct page *page)
> return NULL;
>  }
>
> +static inline struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup 
> *objcg)
> +{
> +   return NULL;
> +}
> +
>  static inline bool folio_memcg_kmem(struct folio *folio)
>  {
> return false;
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 2a60ce39cfde..e571e393669b 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -15,6 +15,7 @@ bool zswap_load(struct folio *folio);
>  void zswap_invalidate(int type, pgoff_t offset);
>  void zswap_swapon(int type);
>  void zswap_swapoff(int type);
> +void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg);
>
>  #else
>
> @@ -31,6 +32,7 @@ static inline bool zswap_load(struct folio *folio)
>  static inline void zswap_invalidate(int type, pgoff_t offset) {}
>  static inline void zswap_swapon(int type) {}
>  static inline void zswap_swapoff(int type) {}
> +static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {}
>
>  #endif
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 470821d1ba1a..792ca21c5815 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5614,6 +5614,8 @@ static void mem_cgroup_css_offline(struct 
> cgroup_subsys_state *css)
> page_counter_set_min(>memory, 0);
> page_counter_set_low(>memory, 0);
>
> +   zswap_memcg_offline_cleanup(memcg);
> +
> memcg_offline_kmem(memcg);
> reparent_shrinker_deferred(memcg);
> wb_memcg_offline(memcg);
> diff --git a/mm/swap.h b/mm/swap.h
> index 73c332ee4d91..c0dc73e10e91 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -51,7 +51,8 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t 
> gfp_mask,
>struct swap_iocb **plug);
>  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  struct mempolicy *mpol, pgoff_t ilx,
> -bool *new_page_allocated);
> +bool *new_page_allocated,
> +bool skip_if_exists);
>  struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> struct mempolicy *mpol, pgoff_t ilx);
>  struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 85d9e5806a6a..6c84236382f3 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -412,7 +412,8 @@ struct folio *filemap_get_incore_folio(struct 
> address_space *mapping,
>
>  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  struct mempolicy *mpol, pgoff_t ilx,
> -bool *new_page_allocated)
> +bool *new_page_allocated,
> +bool skip_if_exists)
>  {
> struct swap_info_struct *si;
> struct folio *folio;
> @@ -470,6 +471,17 @@ struct page *__read_swap_cache_async(swp_entry_t entry, 
> gfp_t gfp_mask,
> if (err != -EEXIST)
> goto fail_put_swap;
>
> +   /*
> +* Protect against a recursive call to 
> __read_swap_cache_async()
> +

Re: [xdp-hints] Re: [PATCH bpf-next v3 2/3] net: stmmac: add Launch Time support to XDP ZC

2023-12-05 Thread Willem de Bruijn

Stanislav Fomichev wrote:
> On Tue, Dec 5, 2023 at 7:34 AM Florian Bezdeka
>  wrote:
> >
> > On Tue, 2023-12-05 at 15:25 +, Song, Yoong Siang wrote:
> > > On Monday, December 4, 2023 10:55 PM, Willem de Bruijn wrote:
> > > > Jesper Dangaard Brouer wrote:
> > > > >
> > > > >
> > > > > On 12/3/23 17:51, Song Yoong Siang wrote:
> > > > > > This patch enables Launch Time (Time-Based Scheduling) support to 
> > > > > > XDP zero
> > > > > > copy via XDP Tx metadata framework.
> > > > > >
> > > > > > Signed-off-by: Song Yoong Siang
> > > > > > ---
> > > > > >   drivers/net/ethernet/stmicro/stmmac/stmmac.h  |  2 ++
> > > > >
> > > > > As requested before, I think we need to see another driver 
> > > > > implementing
> > > > > this.
> > > > >
> > > > > I propose driver igc and chip i225.
> > >
> > > Sure. I will include igc patches in next version.
> > >
> > > > >
> > > > > The interesting thing for me is to see how the LaunchTime max 1 second
> > > > > into the future[1] is handled code wise. One suggestion is to add a
> > > > > section to Documentation/networking/xsk-tx-metadata.rst per driver 
> > > > > that
> > > > > mentions/documents these different hardware limitations.  It is 
> > > > > natural
> > > > > that different types of hardware have limitations.  This is a close-to
> > > > > hardware-level abstraction/API, and IMHO as long as we document the
> > > > > limitations we can expose this API without too many limitations for 
> > > > > more
> > > > > capable hardware.
> > >
> > > Sure. I will try to add hardware limitations in documentation.
> > >
> > > >
> > > > I would assume that the kfunc will fail when a value is passed that
> > > > cannot be programmed.
> > > >
> > >
> > > In current design, the xsk_tx_metadata_request() dint got return value.
> > > So user won't know if their request is fail.
> > > It is complex to inform user which request is failing.
> > > Therefore, IMHO, it is good that we let driver handle the error silently.
> > >
> >
> > If the programmed value is invalid, the packet will be "dropped" / will
> > never make it to the wire, right?

Programmable behavior is to either drop or cap to some boundary
value, such as the farthest programmable time in the future: the
horizon. In fq:

/* Check if packet timestamp is too far in the future. */
if (fq_packet_beyond_horizon(skb, q, now)) {
if (q->horizon_drop) {
q->stat_horizon_drops++;
return qdisc_drop(skb, sch, to_free);
}
q->stat_horizon_caps++;
skb->tstamp = now + q->horizon;
}
fq_skb_cb(skb)->time_to_send = skb->tstamp;

Drop is the more obviously correct mode.

Programming with a clock source that the driver does not support will
then be a persistent failure.

Preferably, this driver capability can be queried beforehand (rather
than only through reading error counters afterwards).

Perhaps it should not be a driver task to convert from possibly
multiple clock sources to the device native clock. Right now, we do
use per-device timecounters for this, implemented in the driver.

As for which clocks are relevant. For PTP, I suppose the device PHC,
converted to nsec. For pacing offload, TCP uses CLOCK_MONOTONIC.

> >
> > That is clearly a situation that the user should be informed about. For
> > RT systems this normally means that something is really wrong regarding
> > timing / cycle overflow. Such systems have to react on that situation.
> 
> In general, af_xdp is a bit lacking in this 'notify the user that they
> somehow messed up' area :-(
> For example, pushing a tx descriptor with a wrong addr/len in zc mode
> will not give any visible signal back (besides driver potentially
> spilling something into dmesg as it was in the mlx case).
> We can probably start with having some counters for these events?

This is because the AF_XDP completion queue descriptor format is only
a u64 address?

Could error conditions be reported on tx completion in the metadata,
using xsk_tx_metadata_complete?

Re: [PATCH v8 2/6] memcontrol: implement mem_cgroup_tryget_online()

2023-12-05 Thread Yosry Ahmed

On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham  wrote:
>
> This patch implements a helper function that try to get a reference to
> an memcg's css, as well as checking if it is online. This new function
> is almost exactly the same as the existing mem_cgroup_tryget(), except
> for the onlineness check. In the !CONFIG_MEMCG case, it always returns
> true, analogous to mem_cgroup_tryget(). This is useful for e.g to the
> new zswap writeback scheme, where we need to select the next online
> memcg as a candidate for the global limit reclaim.
>
> Signed-off-by: Nhat Pham 

Reviewed-by: Yosry Ahmed 

> ---
>  include/linux/memcontrol.h | 10 ++
>  1 file changed, 10 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7bdcf3020d7a..2bd7d14ace78 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -821,6 +821,11 @@ static inline bool mem_cgroup_tryget(struct mem_cgroup 
> *memcg)
> return !memcg || css_tryget(>css);
>  }
>
> +static inline bool mem_cgroup_tryget_online(struct mem_cgroup *memcg)
> +{
> +   return !memcg || css_tryget_online(>css);
> +}
> +
>  static inline void mem_cgroup_put(struct mem_cgroup *memcg)
>  {
> if (memcg)
> @@ -1349,6 +1354,11 @@ static inline bool mem_cgroup_tryget(struct mem_cgroup 
> *memcg)
> return true;
>  }
>
> +static inline bool mem_cgroup_tryget_online(struct mem_cgroup *memcg)
> +{
> +   return true;
> +}
> +
>  static inline void mem_cgroup_put(struct mem_cgroup *memcg)
>  {
>  }
> --
> 2.34.1

Re: [PATCH 2/2] selftest/bpf: Test returning zero from a perf bpf program suppresses SIGIO.

2023-12-05 Thread Kyle Huey

On Tue, Dec 5, 2023 at 8:54 AM Yonghong Song  wrote:
>
>
> On 12/4/23 3:14 PM, Kyle Huey wrote:
> > The test sets a hardware breakpoint and uses a bpf program to suppress the
> > I/O availability signal if the ip matches the expected value.
> >
> > Signed-off-by: Kyle Huey 
> > ---
> >   .../selftests/bpf/prog_tests/perf_skip.c  | 95 +++
> >   .../selftests/bpf/progs/test_perf_skip.c  | 23 +
> >   2 files changed, 118 insertions(+)
> >   create mode 100644 tools/testing/selftests/bpf/prog_tests/perf_skip.c
> >   create mode 100644 tools/testing/selftests/bpf/progs/test_perf_skip.c
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/perf_skip.c 
> > b/tools/testing/selftests/bpf/prog_tests/perf_skip.c
> > new file mode 100644
> > index ..b269a31669b7
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/perf_skip.c
> > @@ -0,0 +1,95 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#define _GNU_SOURCE
> > +#include 
> > +#include "test_perf_skip.skel.h"
> > +#include 
> > +#include 
> > +
> > +#define BPF_OBJECT"test_perf_skip.bpf.o"
> > +
> > +static void handle_sig(int)
>
> I hit a warning here:
> home/yhs/work/bpf-next/tools/testing/selftests/bpf/prog_tests/perf_skip.c:10:27:
>  error: omitting the parameter name in a function definition is a C23 
> extension [-Werror,-Wc23-extensions]

Yeah, Meta's kernel-ci bot sent me off-list email about this one.

>
> 10 | static void handle_sig(int)
>|
>
> Add a parameter and marked as unused can resolve the issue.
>
> #define __always_unused __attribute__((__unused__))
>
> static void handle_sig(int unused __always_unused)
> {
>  ASSERT_OK(1, "perf event not skipped");
> }
>
>
> > +{
> > + ASSERT_OK(1, "perf event not skipped");
> > +}
> > +
> > +static noinline int test_function(void)
> > +{
> > + return 0;
> > +}
> > +
> > +void serial_test_perf_skip(void)
> > +{
> > + sighandler_t previous;
> > + int duration = 0;
> > + struct test_perf_skip *skel = NULL;
> > + int map_fd = -1;
> > + long page_size = sysconf(_SC_PAGE_SIZE);
> > + uintptr_t *ip = NULL;
> > + int prog_fd = -1;
> > + struct perf_event_attr attr = {0};
> > + int perf_fd = -1;
> > + struct f_owner_ex owner;
> > + int err;
> > +
> > + previous = signal(SIGIO, handle_sig);
> > +
> > + skel = test_perf_skip__open_and_load();
> > + if (!ASSERT_OK_PTR(skel, "skel_load"))
> > + goto cleanup;
> > +
> > + prog_fd = bpf_program__fd(skel->progs.handler);
> > + if (!ASSERT_OK(prog_fd < 0, "bpf_program__fd"))
> > + goto cleanup;
> > +
> > + map_fd = bpf_map__fd(skel->maps.ip);
> > + if (!ASSERT_OK(map_fd < 0, "bpf_map__fd"))
> > + goto cleanup;
> > +
> > + ip = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, 
> > map_fd, 0);
> > + if (!ASSERT_OK_PTR(ip, "mmap bpf map"))
> > + goto cleanup;
> > +
> > + *ip = (uintptr_t)test_function;
> > +
> > + attr.type = PERF_TYPE_BREAKPOINT;
> > + attr.size = sizeof(attr);
> > + attr.bp_type = HW_BREAKPOINT_X;
> > + attr.bp_addr = (uintptr_t)test_function;
> > + attr.bp_len = sizeof(long);
> > + attr.sample_period = 1;
> > + attr.sample_type = PERF_SAMPLE_IP;
> > + attr.pinned = 1;
> > + attr.exclude_kernel = 1;
> > + attr.exclude_hv = 1;
> > + attr.precise_ip = 3;
> > +
> > + perf_fd = syscall(__NR_perf_event_open, , 0, -1, -1, 0);
> > + if (CHECK(perf_fd < 0, "perf_event_open", "err %d\n", perf_fd))
> > + goto cleanup;
> > +
> > + err = fcntl(perf_fd, F_SETFL, O_ASYNC);
> > + if (!ASSERT_OK(err, "fcntl(F_SETFL, O_ASYNC)"))
> > + goto cleanup;
> > +
> > + owner.type = F_OWNER_TID;
> > + owner.pid = gettid();
>
> I hit a compilation failure here:
>
> /home/yhs/work/bpf-next/tools/testing/selftests/bpf/prog_tests/perf_skip.c:75:14:
>  error: call to undeclared function 'gettid'; ISO C99 and later do not 
> support implicit function declarations [-Wimplicit-function-declaration]
> 75 | owner.pid = gettid();
>| ^
>
> If you looked at some other examples, the common usage is do 
> 'syscall(SYS_gettid)'.

Not clear why this works for me but sure I'll change that.

>
> So the following patch should fix the compilation error:
>
> #include 
> ...
>  owner.pid = syscall(SYS_gettid);
> ...
>
> > + err = fcntl(perf_fd, F_SETOWN_EX, );
> > + if (!ASSERT_OK(err, "fcntl(F_SETOWN_EX)"))
> > + goto cleanup;
> > +
> > + err = ioctl(perf_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
> > + if (!ASSERT_OK(err, "ioctl(PERF_EVENT_IOC_SET_BPF)"))
> > + goto cleanup;
> > +
> > + test_function();
>
> As Andrii has mentioned in previous comments, we will have
> issue is RELEASE version of selftest is built
>RELEASE=1 make ...
>
> See 
>

Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE

2023-12-05 Thread Nicolin Chen

On Mon, Dec 04, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote:

> > Or am I missing some point here?
> 
> It sounds Ok, we just have to understand what userspace should be
> doing and how much of this the kernel should implement.
> 
> It seems to me that the error code should return the gerror and the
> req_num should indicate the halted cons. The vmm should relay both
> into the virtual registers.

I see your concern. I will take a closer look and see if we can
add to the initial version of arm_smmu_cache_invalidate_user().
Otherwise, we can add later.

Btw, VT-d seems to want the error_code and reports in the VT-d
specific invalidate entry structure, as Kevin and Yi had that
discussion in the other side of the thread.

Thanks
Nicolin

Re: [PATCH 1/4] kunit: Add APIs for managing devices

2023-12-05 Thread Greg Kroah-Hartman

On Tue, Dec 05, 2023 at 03:31:33PM +0800, david...@google.com wrote:
> Tests for drivers often require a struct device to pass to other
> functions. While it's possible to create these with
> root_device_register(), or to use something like a platform device, this
> is both a misuse of those APIs, and can be difficult to clean up after,
> for example, a failed assertion.
> 
> Add some KUnit-specific functions for registering and unregistering a
> struct device:
> - kunit_device_register()
> - kunit_device_register_with_driver()
> - kunit_device_unregister()
> 
> These helpers allocate a on a 'kunit' bus which will either probe the
> driver passed in (kunit_device_register_with_driver), or will create a
> stub driver (kunit_device_register) which is cleaned up on test shutdown.
> 
> Devices are automatically unregistered on test shutdown, but can be
> manually unregistered earlier with kunit_device_unregister() in order
> to, for example, test device release code.

At first glance, nice work.  But looks like 0-day doesn't like it that
much, so I'll wait for the next version to review it properly.

One nit I did notice:

> +// For internal use only -- registers the kunit_bus.
> +int kunit_bus_init(void);

Put stuff like this in a local .h file, don't pollute the include/linux/
files for things that you do not want any other part of the kernel to
call.

> +/**
> + * kunit_device_register_with_driver() - Create a struct device for use in 
> KUnit tests
> + * @test: The test context object.
> + * @name: The name to give the created device.
> + * @drv: The struct device_driver to associate with the device.
> + *
> + * Creates a struct kunit_device (which is a struct device) with the given
> + * name, and driver. The device will be cleaned up on test exit, or when
> + * kunit_device_unregister is called. See also kunit_device_register, if you
> + * wish KUnit to create and manage a driver for you
> + */
> +struct device *kunit_device_register_with_driver(struct kunit *test,
> +  const char *name,
> +  struct device_driver *drv);

Shouldn't "struct device_driver *" be a constant pointer?

But really, why is this a "raw" device_driver pointer and not a pointer
to the driver type for your bus?

Oh heck, let's point out the other issues as I'm already here...

> @@ -7,7 +7,8 @@ kunit-objs += test.o \
>   assert.o \
>   try-catch.o \
>   executor.o \
> - attributes.o
> + attributes.o \
> + device.o

Shouldn't this file be "bus.c" as you are creating a kunit bus?

>  
>  ifeq ($(CONFIG_KUNIT_DEBUGFS),y)
>  kunit-objs +=debugfs.o
> diff --git a/lib/kunit/device.c b/lib/kunit/device.c
> new file mode 100644
> index ..93ace1a2297d
> --- /dev/null
> +++ b/lib/kunit/device.c
> @@ -0,0 +1,176 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KUnit basic device implementation

"basic bus/driver implementation", not device, right?

> + *
> + * Implementation of struct kunit_device helpers.
> + *
> + * Copyright (C) 2023, Google LLC.
> + * Author: David Gow 
> + */
> +
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +
> +
> +/* Wrappers for use with kunit_add_action() */
> +KUNIT_DEFINE_ACTION_WRAPPER(device_unregister_wrapper, device_unregister, 
> struct device *);
> +KUNIT_DEFINE_ACTION_WRAPPER(driver_unregister_wrapper, driver_unregister, 
> struct device_driver *);
> +
> +static struct device kunit_bus = {
> + .init_name = "kunit"
> +};

A static device as a bus?  This feels wrong, what is it for?  And where
does this live?  If you _REALLY_ want a single device for the root of
your bus (which is a good idea), then make it a dynamic variable (as it
is reference counted), NOT a static struct device which should not be
done if at all possible.

> +
> +/* A device owned by a KUnit test. */
> +struct kunit_device {
> + struct device dev;
> + struct kunit *owner;
> + /* Force binding to a specific driver. */
> + struct device_driver *driver;
> + /* The driver is managed by KUnit and unique to this device. */
> + bool cleanup_driver;
> +};

Wait, why isn't your "kunit" device above a struct kunit_device
structure?  Why is it ok to be a "raw" struct device (hint, that's
almost never a good idea.)

> +static inline struct kunit_device *to_kunit_device(struct device *d)
> +{
> + return container_of(d, struct kunit_device, dev);

container_of_const()?  And to use that properly, why not make this a #define?

> +}
> +
> +static int kunit_bus_match(struct device *dev, struct device_driver *driver)
> +{
> + struct kunit_device *kunit_dev = to_kunit_device(dev);
> +
> + if (kunit_dev->driver == driver)
> +

Re: [PATCH v8 1/6] list_lru: allows explicit memcg and NUMA node selection

2023-12-05 Thread Johannes Weiner

On Mon, Dec 04, 2023 at 04:30:44PM -0800, Chris Li wrote:
> On Thu, Nov 30, 2023 at 12:35 PM Johannes Weiner  wrote:
> >
> > On Thu, Nov 30, 2023 at 12:07:41PM -0800, Nhat Pham wrote:
> > > On Thu, Nov 30, 2023 at 11:57 AM Matthew Wilcox  
> > > wrote:
> > > >
> > > > On Thu, Nov 30, 2023 at 11:40:18AM -0800, Nhat Pham wrote:
> > > > > This patch changes list_lru interface so that the caller must 
> > > > > explicitly
> > > > > specify numa node and memcg when adding and removing objects. The old
> > > > > list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() 
> > > > > and
> > > > > list_lru_del_obj(), respectively.
> > > >
> > > > Wouldn't it be better to add list_lru_add_memcg() and
> > > > list_lru_del_memcg() and have:
> 
> That is my first thought as well. If we are having two different
> flavors of LRU add, one has memcg and one without. The list_lru_add()
> vs list_lru_add_memcg() is the common way to do it.
> > > >
> > > > +bool list_lru_del(struct list_lru *lru, struct list_head *item)
> > > > +{
> > > > +   int nid = page_to_nid(virt_to_page(item));
> > > > +   struct mem_cgroup *memcg = list_lru_memcg_aware(lru) ?
> > > > +   mem_cgroup_from_slab_obj(item) : NULL;
> > > > +
> > > > +   return list_lru_del_memcg(lru, item, nid, memcg);
> > > > +}
> > > >
> > > > Seems like _most_ callers will want the original versions and only
> > > > a few will want the explicit memcg/nid versions.  No?
> > > >
> > >
> > > I actually did something along that line in earlier iterations of this
> > > patch series (albeit with poorer naming - __list_lru_add() instead of
> > > list_lru_add_memcg()). The consensus after some back and forth was
> > > that the original list_lru_add() was not a very good design (the
> > > better one was this new version that allows for explicit numa/memcg
> > > selection). So I agreed to fix it everywhere as a prep patch.
> > >
> > > I don't have strong opinions here to be completely honest, but I do
> > > think this new API makes more sense (at the cost of quite a bit of
> > > elbow grease to fix every callsites and extra reviewing).
> >
> > Maybe I can shed some light since I was pushing for doing it this way.
> >
> > The quiet assumption that 'struct list_head *item' is (embedded in) a
> > slab object that is also charged to a cgroup is a bit much, given that
> > nothing in the name or documentation of the function points to that.
> 
> We can add it to the document if that is desirable.

It would help, but it still violates the "easy to use, hard to misuse"
principle. And I think it does the API layering backwards.

list_lru_add() is the "default" API function. It makes sense to keep
that simple and robust, then add add convenience wrappers for
additional, specialized functionality like memcg lookups for charged
slab objects - even if that's a common usecase.

It's better for a new user to be paused by the require memcg argument
in the default function and then go and find list_lru_add_obj(), than
it is for somebody to quietly pass an invalid object to list_lru_add()
and have subtle runtime problems and crashes (which has happened twice
now already).

Re: [xdp-hints] Re: [PATCH bpf-next v3 2/3] net: stmmac: add Launch Time support to XDP ZC

2023-12-05 Thread Stanislav Fomichev

On Tue, Dec 5, 2023 at 7:34 AM Florian Bezdeka
 wrote:
>
> On Tue, 2023-12-05 at 15:25 +, Song, Yoong Siang wrote:
> > On Monday, December 4, 2023 10:55 PM, Willem de Bruijn wrote:
> > > Jesper Dangaard Brouer wrote:
> > > >
> > > >
> > > > On 12/3/23 17:51, Song Yoong Siang wrote:
> > > > > This patch enables Launch Time (Time-Based Scheduling) support to XDP 
> > > > > zero
> > > > > copy via XDP Tx metadata framework.
> > > > >
> > > > > Signed-off-by: Song Yoong Siang
> > > > > ---
> > > > >   drivers/net/ethernet/stmicro/stmmac/stmmac.h  |  2 ++
> > > >
> > > > As requested before, I think we need to see another driver implementing
> > > > this.
> > > >
> > > > I propose driver igc and chip i225.
> >
> > Sure. I will include igc patches in next version.
> >
> > > >
> > > > The interesting thing for me is to see how the LaunchTime max 1 second
> > > > into the future[1] is handled code wise. One suggestion is to add a
> > > > section to Documentation/networking/xsk-tx-metadata.rst per driver that
> > > > mentions/documents these different hardware limitations.  It is natural
> > > > that different types of hardware have limitations.  This is a close-to
> > > > hardware-level abstraction/API, and IMHO as long as we document the
> > > > limitations we can expose this API without too many limitations for more
> > > > capable hardware.
> >
> > Sure. I will try to add hardware limitations in documentation.
> >
> > >
> > > I would assume that the kfunc will fail when a value is passed that
> > > cannot be programmed.
> > >
> >
> > In current design, the xsk_tx_metadata_request() dint got return value.
> > So user won't know if their request is fail.
> > It is complex to inform user which request is failing.
> > Therefore, IMHO, it is good that we let driver handle the error silently.
> >
>
> If the programmed value is invalid, the packet will be "dropped" / will
> never make it to the wire, right?
>
> That is clearly a situation that the user should be informed about. For
> RT systems this normally means that something is really wrong regarding
> timing / cycle overflow. Such systems have to react on that situation.

In general, af_xdp is a bit lacking in this 'notify the user that they
somehow messed up' area :-(
For example, pushing a tx descriptor with a wrong addr/len in zc mode
will not give any visible signal back (besides driver potentially
spilling something into dmesg as it was in the mlx case).
We can probably start with having some counters for these events?

Re: [PATCH 2/2] selftest/bpf: Test returning zero from a perf bpf program suppresses SIGIO.

2023-12-05 Thread Yonghong Song




On 12/4/23 3:14 PM, Kyle Huey wrote:

The test sets a hardware breakpoint and uses a bpf program to suppress the
I/O availability signal if the ip matches the expected value.

Signed-off-by: Kyle Huey 
---
  .../selftests/bpf/prog_tests/perf_skip.c  | 95 +++
  .../selftests/bpf/progs/test_perf_skip.c  | 23 +
  2 files changed, 118 insertions(+)
  create mode 100644 tools/testing/selftests/bpf/prog_tests/perf_skip.c
  create mode 100644 tools/testing/selftests/bpf/progs/test_perf_skip.c

diff --git a/tools/testing/selftests/bpf/prog_tests/perf_skip.c 
b/tools/testing/selftests/bpf/prog_tests/perf_skip.c
new file mode 100644
index ..b269a31669b7
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/perf_skip.c
@@ -0,0 +1,95 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include 
+#include "test_perf_skip.skel.h"
+#include 
+#include 
+
+#define BPF_OBJECT"test_perf_skip.bpf.o"
+
+static void handle_sig(int)


I hit a warning here:
home/yhs/work/bpf-next/tools/testing/selftests/bpf/prog_tests/perf_skip.c:10:27:
 error: omitting the parameter name in a function definition is a C23 extension 
[-Werror,-Wc23-extensions]
   10 | static void handle_sig(int)
  |

Add a parameter and marked as unused can resolve the issue.

#define __always_unused __attribute__((__unused__))

static void handle_sig(int unused __always_unused)
{
ASSERT_OK(1, "perf event not skipped");
}



+{
+   ASSERT_OK(1, "perf event not skipped");
+}
+
+static noinline int test_function(void)
+{
+   return 0;
+}
+
+void serial_test_perf_skip(void)
+{
+   sighandler_t previous;
+   int duration = 0;
+   struct test_perf_skip *skel = NULL;
+   int map_fd = -1;
+   long page_size = sysconf(_SC_PAGE_SIZE);
+   uintptr_t *ip = NULL;
+   int prog_fd = -1;
+   struct perf_event_attr attr = {0};
+   int perf_fd = -1;
+   struct f_owner_ex owner;
+   int err;
+
+   previous = signal(SIGIO, handle_sig);
+
+   skel = test_perf_skip__open_and_load();
+   if (!ASSERT_OK_PTR(skel, "skel_load"))
+   goto cleanup;
+
+   prog_fd = bpf_program__fd(skel->progs.handler);
+   if (!ASSERT_OK(prog_fd < 0, "bpf_program__fd"))
+   goto cleanup;
+
+   map_fd = bpf_map__fd(skel->maps.ip);
+   if (!ASSERT_OK(map_fd < 0, "bpf_map__fd"))
+   goto cleanup;
+
+   ip = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, map_fd, 
0);
+   if (!ASSERT_OK_PTR(ip, "mmap bpf map"))
+   goto cleanup;
+
+   *ip = (uintptr_t)test_function;
+
+   attr.type = PERF_TYPE_BREAKPOINT;
+   attr.size = sizeof(attr);
+   attr.bp_type = HW_BREAKPOINT_X;
+   attr.bp_addr = (uintptr_t)test_function;
+   attr.bp_len = sizeof(long);
+   attr.sample_period = 1;
+   attr.sample_type = PERF_SAMPLE_IP;
+   attr.pinned = 1;
+   attr.exclude_kernel = 1;
+   attr.exclude_hv = 1;
+   attr.precise_ip = 3;
+
+   perf_fd = syscall(__NR_perf_event_open, , 0, -1, -1, 0);
+   if (CHECK(perf_fd < 0, "perf_event_open", "err %d\n", perf_fd))
+   goto cleanup;
+
+   err = fcntl(perf_fd, F_SETFL, O_ASYNC);
+   if (!ASSERT_OK(err, "fcntl(F_SETFL, O_ASYNC)"))
+   goto cleanup;
+
+   owner.type = F_OWNER_TID;
+   owner.pid = gettid();


I hit a compilation failure here:

/home/yhs/work/bpf-next/tools/testing/selftests/bpf/prog_tests/perf_skip.c:75:14:
 error: call to undeclared function 'gettid'; ISO C99 and later do not support 
implicit function declarations [-Wimplicit-function-declaration]
   75 | owner.pid = gettid();
  | ^

If you looked at some other examples, the common usage is do 
'syscall(SYS_gettid)'.
So the following patch should fix the compilation error:

#include 
...
owner.pid = syscall(SYS_gettid);
...


+   err = fcntl(perf_fd, F_SETOWN_EX, );
+   if (!ASSERT_OK(err, "fcntl(F_SETOWN_EX)"))
+   goto cleanup;
+
+   err = ioctl(perf_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
+   if (!ASSERT_OK(err, "ioctl(PERF_EVENT_IOC_SET_BPF)"))
+   goto cleanup;
+
+   test_function();


As Andrii has mentioned in previous comments, we will have
issue is RELEASE version of selftest is built
  RELEASE=1 make ...

See https://lore.kernel.org/bpf/20231127050342.1945270-1-yonghong.s...@linux.dev


+
+cleanup:
+   if (perf_fd >= 0)
+   close(perf_fd);
+   if (ip)
+   munmap(ip, page_size);
+   if (skel)
+   test_perf_skip__destroy(skel);
+
+   signal(SIGIO, previous);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_perf_skip.c 
b/tools/testing/selftests/bpf/progs/test_perf_skip.c
new file mode 100644
index ..ef01a9161afe
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_perf_skip.c
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: GPL-2.0

[PATCH v3 19/21] kselftest/arm64: Add 2023 DPISA hwcap test coverage

2023-12-05 Thread Mark Brown

Add the hwcaps added for the 2023 DPISA extensions to the hwcaps test
program.

Signed-off-by: Mark Brown 
---
 tools/testing/selftests/arm64/abi/hwcap.c | 217 ++
 1 file changed, 217 insertions(+)

diff --git a/tools/testing/selftests/arm64/abi/hwcap.c 
b/tools/testing/selftests/arm64/abi/hwcap.c
index 1189e77c8152..d8909b2b535a 100644
--- a/tools/testing/selftests/arm64/abi/hwcap.c
+++ b/tools/testing/selftests/arm64/abi/hwcap.c
@@ -58,11 +58,46 @@ static void cssc_sigill(void)
asm volatile(".inst 0xdac01c00" : : : "x0");
 }
 
+static void f8cvt_sigill(void)
+{
+   /* FSCALE V0.4H, V0.4H, V0.4H */
+   asm volatile(".inst 0x2ec03c00");
+}
+
+static void f8dp2_sigill(void)
+{
+   /* FDOT V0.4H, V0.4H, V0.5H */
+   asm volatile(".inst 0xe40fc00");
+}
+
+static void f8dp4_sigill(void)
+{
+   /* FDOT V0.2S, V0.2S, V0.2S */
+   asm volatile(".inst 0xe00fc00");
+}
+
+static void f8fma_sigill(void)
+{
+   /* FMLALB V0.8H, V0.16B, V0.16B */
+   asm volatile(".inst 0xec0fc00");
+}
+
+static void faminmax_sigill(void)
+{
+   /* FAMIN V0.4H, V0.4H, V0.4H */
+   asm volatile(".inst 0x2ec01c00");
+}
+
 static void fp_sigill(void)
 {
asm volatile("fmov s0, #1");
 }
 
+static void fpmr_sigill(void)
+{
+   asm volatile("mrs x0, S3_3_C4_C4_2" : : : "x0");
+}
+
 static void ilrcpc_sigill(void)
 {
/* LDAPUR W0, [SP, #8] */
@@ -95,6 +130,12 @@ static void lse128_sigill(void)
 : "cc", "memory");
 }
 
+static void lut_sigill(void)
+{
+   /* LUTI2 V0.16B, { V0.16B }, V[0] */
+   asm volatile(".inst 0x4e801000");
+}
+
 static void mops_sigill(void)
 {
char dst[1], src[1];
@@ -216,6 +257,78 @@ static void smef16f16_sigill(void)
asm volatile("msr S0_3_C4_C6_3, xzr" : : : );
 }
 
+static void smef8f16_sigill(void)
+{
+   /* SMSTART */
+   asm volatile("msr S0_3_C4_C7_3, xzr" : : : );
+
+   /* FDOT ZA.H[W0, 0], Z0.B-Z1.B, Z0.B-Z1.B */
+   asm volatile(".inst 0xc1a01020" : : : );
+
+   /* SMSTOP */
+   asm volatile("msr S0_3_C4_C6_3, xzr" : : : );
+}
+
+static void smef8f32_sigill(void)
+{
+   /* SMSTART */
+   asm volatile("msr S0_3_C4_C7_3, xzr" : : : );
+
+   /* FDOT ZA.S[W0, 0], { Z0.B-Z1.B }, Z0.B[0] */
+   asm volatile(".inst 0xc1500038" : : : );
+
+   /* SMSTOP */
+   asm volatile("msr S0_3_C4_C6_3, xzr" : : : );
+}
+
+static void smelutv2_sigill(void)
+{
+   /* SMSTART */
+   asm volatile("msr S0_3_C4_C7_3, xzr" : : : );
+
+   /* LUTI4 { Z0.B-Z3.B }, ZT0, { Z0-Z1 } */
+   asm volatile(".inst 0xc08b" : : : );
+
+   /* SMSTOP */
+   asm volatile("msr S0_3_C4_C6_3, xzr" : : : );
+}
+
+static void smesf8dp2_sigill(void)
+{
+   /* SMSTART */
+   asm volatile("msr S0_3_C4_C7_3, xzr" : : : );
+
+   /* FDOT Z0.H, Z0.B, Z0.B[0] */
+   asm volatile(".inst 0x64204400" : : : );
+
+   /* SMSTOP */
+   asm volatile("msr S0_3_C4_C6_3, xzr" : : : );
+}
+
+static void smesf8dp4_sigill(void)
+{
+   /* SMSTART */
+   asm volatile("msr S0_3_C4_C7_3, xzr" : : : );
+
+   /* FDOT Z0.S, Z0.B, Z0.B[0] */
+   asm volatile(".inst 0xc1a41C00" : : : );
+
+   /* SMSTOP */
+   asm volatile("msr S0_3_C4_C6_3, xzr" : : : );
+}
+
+static void smesf8fma_sigill(void)
+{
+   /* SMSTART */
+   asm volatile("msr S0_3_C4_C7_3, xzr" : : : );
+
+   /* FMLALB V0.8H, V0.16B, V0.16B */
+   asm volatile(".inst 0xec0fc00");
+
+   /* SMSTOP */
+   asm volatile("msr S0_3_C4_C6_3, xzr" : : : );
+}
+
 static void sve_sigill(void)
 {
/* RDVL x0, #0 */
@@ -353,6 +466,53 @@ static const struct hwcap_data {
.cpuinfo = "cssc",
.sigill_fn = cssc_sigill,
},
+   {
+   .name = "F8CVT",
+   .at_hwcap = AT_HWCAP2,
+   .hwcap_bit = HWCAP2_F8CVT,
+   .cpuinfo = "f8cvt",
+   .sigill_fn = f8cvt_sigill,
+   },
+   {
+   .name = "F8DP4",
+   .at_hwcap = AT_HWCAP2,
+   .hwcap_bit = HWCAP2_F8DP4,
+   .cpuinfo = "f8dp4",
+   .sigill_fn = f8dp4_sigill,
+   },
+   {
+   .name = "F8DP2",
+   .at_hwcap = AT_HWCAP2,
+   .hwcap_bit = HWCAP2_F8DP2,
+   .cpuinfo = "f8dp4",
+   .sigill_fn = f8dp2_sigill,
+   },
+   {
+   .name = "F8E5M2",
+   .at_hwcap = AT_HWCAP2,
+   .hwcap_bit = HWCAP2_F8E5M2,
+   .cpuinfo = "f8e5m2",
+   },
+   {
+   .name = "F8E4M3",
+   .at_hwcap = AT_HWCAP2,
+   .hwcap_bit = HWCAP2_F8E4M3,
+   .cpuinfo = "f8e4m3",
+   },
+   {
+   .name = "F8FMA",
+   .at_hwcap = AT_HWCAP2,
+   .hwcap_bit = HWCAP2_F8FMA,
+   .cpuinfo = "f8fma",
+   .sigill_fn = f8fma_sigill,
+   },
+

[PATCH v3 20/21] KVM: arm64: selftests: Document feature registers added in 2023 extensions

2023-12-05 Thread Mark Brown

The 2023 architecture extensions allocated some previously usused feature
registers, add comments mapping the names in get-reg-list as we do for the
other allocated registers.

Signed-off-by: Mark Brown 
---
 tools/testing/selftests/kvm/aarch64/get-reg-list.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/aarch64/get-reg-list.c 
b/tools/testing/selftests/kvm/aarch64/get-reg-list.c
index 709d7d721760..71ea6ecec7ce 100644
--- a/tools/testing/selftests/kvm/aarch64/get-reg-list.c
+++ b/tools/testing/selftests/kvm/aarch64/get-reg-list.c
@@ -428,7 +428,7 @@ static __u64 base_regs[] = {
ARM64_SYS_REG(3, 0, 0, 4, 4),   /* ID_AA64ZFR0_EL1 */
ARM64_SYS_REG(3, 0, 0, 4, 5),   /* ID_AA64SMFR0_EL1 */
ARM64_SYS_REG(3, 0, 0, 4, 6),
-   ARM64_SYS_REG(3, 0, 0, 4, 7),
+   ARM64_SYS_REG(3, 0, 0, 4, 7),   /* ID_AA64FPFR_EL1 */
ARM64_SYS_REG(3, 0, 0, 5, 0),   /* ID_AA64DFR0_EL1 */
ARM64_SYS_REG(3, 0, 0, 5, 1),   /* ID_AA64DFR1_EL1 */
ARM64_SYS_REG(3, 0, 0, 5, 2),
@@ -440,7 +440,7 @@ static __u64 base_regs[] = {
ARM64_SYS_REG(3, 0, 0, 6, 0),   /* ID_AA64ISAR0_EL1 */
ARM64_SYS_REG(3, 0, 0, 6, 1),   /* ID_AA64ISAR1_EL1 */
ARM64_SYS_REG(3, 0, 0, 6, 2),   /* ID_AA64ISAR2_EL1 */
-   ARM64_SYS_REG(3, 0, 0, 6, 3),
+   ARM64_SYS_REG(3, 0, 0, 6, 3),   /* ID_AA64ISAR3_EL1 */
ARM64_SYS_REG(3, 0, 0, 6, 4),
ARM64_SYS_REG(3, 0, 0, 6, 5),
ARM64_SYS_REG(3, 0, 0, 6, 6),

-- 
2.30.2

[PATCH v3 21/21] KVM: arm64: selftests: Teach get-reg-list about FPMR

2023-12-05 Thread Mark Brown

FEAT_FPMR defines a new register FMPR which is available at all ELs and is
discovered via ID_AA64PFR2_EL1.FPMR, add this to the set of registers that
get-reg-list knows to check for with the required identification register
depdendency.

Signed-off-by: Mark Brown 
---
 tools/testing/selftests/kvm/aarch64/get-reg-list.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/tools/testing/selftests/kvm/aarch64/get-reg-list.c 
b/tools/testing/selftests/kvm/aarch64/get-reg-list.c
index 71ea6ecec7ce..1e43511d1440 100644
--- a/tools/testing/selftests/kvm/aarch64/get-reg-list.c
+++ b/tools/testing/selftests/kvm/aarch64/get-reg-list.c
@@ -40,6 +40,12 @@ static struct feature_id_reg feat_id_regs[] = {
ARM64_SYS_REG(3, 0, 0, 7, 3),   /* ID_AA64MMFR3_EL1 */
4,
1
+   },
+   {
+   ARM64_SYS_REG(3, 3, 4, 4, 2),   /* FPMR */
+   ARM64_SYS_REG(3, 0, 0, 4, 2),   /* ID_AA64PFR2_EL1 */
+   32,
+   1
}
 };
 
@@ -481,6 +487,7 @@ static __u64 base_regs[] = {
ARM64_SYS_REG(3, 3, 14, 2, 1),  /* CNTP_CTL_EL0 */
ARM64_SYS_REG(3, 3, 14, 2, 2),  /* CNTP_CVAL_EL0 */
ARM64_SYS_REG(3, 4, 3, 0, 0),   /* DACR32_EL2 */
+   ARM64_SYS_REG(3, 3, 4, 4, 2),   /* FPMR */
ARM64_SYS_REG(3, 4, 5, 0, 1),   /* IFSR32_EL2 */
ARM64_SYS_REG(3, 4, 5, 3, 0),   /* FPEXC32_EL2 */
 };

-- 
2.30.2

[PATCH v3 18/21] kselftest/arm64: Add basic FPMR test

2023-12-05 Thread Mark Brown

Verify that a FPMR frame is generated on systems that support FPMR and not
generated otherwise.

Signed-off-by: Mark Brown 
---
 tools/testing/selftests/arm64/signal/.gitignore|  1 +
 .../arm64/signal/testcases/fpmr_siginfo.c  | 82 ++
 2 files changed, 83 insertions(+)

diff --git a/tools/testing/selftests/arm64/signal/.gitignore 
b/tools/testing/selftests/arm64/signal/.gitignore
index 839e3a252629..1ce5b5eac386 100644
--- a/tools/testing/selftests/arm64/signal/.gitignore
+++ b/tools/testing/selftests/arm64/signal/.gitignore
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 mangle_*
 fake_sigreturn_*
+fpmr_*
 sme_*
 ssve_*
 sve_*
diff --git a/tools/testing/selftests/arm64/signal/testcases/fpmr_siginfo.c 
b/tools/testing/selftests/arm64/signal/testcases/fpmr_siginfo.c
new file mode 100644
index ..e9d24685e741
--- /dev/null
+++ b/tools/testing/selftests/arm64/signal/testcases/fpmr_siginfo.c
@@ -0,0 +1,82 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2023 ARM Limited
+ *
+ * Verify that the FPMR register context in signal frames is set up as
+ * expected.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "test_signals_utils.h"
+#include "testcases.h"
+
+static union {
+   ucontext_t uc;
+   char buf[1024 * 128];
+} context;
+
+#define SYS_FPMR "S3_3_C4_C4_2"
+
+static uint64_t get_fpmr(void)
+{
+   uint64_t val;
+
+   asm volatile (
+   "mrs%0, " SYS_FPMR "\n"
+   : "=r"(val)
+   :
+   : "cc");
+
+   return val;
+}
+
+int fpmr_present(struct tdescr *td, siginfo_t *si, ucontext_t *uc)
+{
+   struct _aarch64_ctx *head = GET_BUF_RESV_HEAD(context);
+   struct fpmr_context *fpmr_ctx;
+   size_t offset;
+   bool in_sigframe;
+   bool have_fpmr;
+   __u64 orig_fpmr;
+
+   have_fpmr = getauxval(AT_HWCAP2) & HWCAP2_FPMR;
+   if (have_fpmr)
+   orig_fpmr = get_fpmr();
+
+   if (!get_current_context(td, , sizeof(context)))
+   return 1;
+
+   fpmr_ctx = (struct fpmr_context *)
+   get_header(head, FPMR_MAGIC, td->live_sz, );
+
+   in_sigframe = fpmr_ctx != NULL;
+
+   fprintf(stderr, "FPMR sigframe %s on system %s FPMR\n",
+   in_sigframe ? "present" : "absent",
+   have_fpmr ? "with" : "without");
+
+   td->pass = (in_sigframe == have_fpmr);
+
+   if (have_fpmr && fpmr_ctx) {
+   if (fpmr_ctx->fpmr != orig_fpmr) {
+   fprintf(stderr, "FPMR in frame is %llx, was %llx\n",
+   fpmr_ctx->fpmr, orig_fpmr);
+   td->pass = false;
+   }
+   }
+
+   return 0;
+}
+
+struct tdescr tde = {
+   .name = "FPMR",
+   .descr = "Validate that FPMR is present as expected",
+   .timeout = 3,
+   .run = fpmr_present,
+};

-- 
2.30.2

[PATCH v3 17/21] kselftest/arm64: Handle FPMR context in generic signal frame parser

2023-12-05 Thread Mark Brown

Teach the generic signal frame parsing code about the newly added FPMR
frame, avoiding warnings every time one is generated.

Signed-off-by: Mark Brown 
---
 tools/testing/selftests/arm64/signal/testcases/testcases.c | 8 
 tools/testing/selftests/arm64/signal/testcases/testcases.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/tools/testing/selftests/arm64/signal/testcases/testcases.c 
b/tools/testing/selftests/arm64/signal/testcases/testcases.c
index 9f580b55b388..674b88cc8c39 100644
--- a/tools/testing/selftests/arm64/signal/testcases/testcases.c
+++ b/tools/testing/selftests/arm64/signal/testcases/testcases.c
@@ -209,6 +209,14 @@ bool validate_reserved(ucontext_t *uc, size_t resv_sz, 
char **err)
zt = (struct zt_context *)head;
new_flags |= ZT_CTX;
break;
+   case FPMR_MAGIC:
+   if (flags & FPMR_CTX)
+   *err = "Multiple FPMR_MAGIC";
+   else if (head->size !=
+sizeof(struct fpmr_context))
+   *err = "Bad size for fpmr_context";
+   new_flags |= FPMR_CTX;
+   break;
case EXTRA_MAGIC:
if (flags & EXTRA_CTX)
*err = "Multiple EXTRA_MAGIC";
diff --git a/tools/testing/selftests/arm64/signal/testcases/testcases.h 
b/tools/testing/selftests/arm64/signal/testcases/testcases.h
index a08ab0d6207a..7727126347e0 100644
--- a/tools/testing/selftests/arm64/signal/testcases/testcases.h
+++ b/tools/testing/selftests/arm64/signal/testcases/testcases.h
@@ -19,6 +19,7 @@
 #define ZA_CTX (1 << 2)
 #define EXTRA_CTX  (1 << 3)
 #define ZT_CTX (1 << 4)
+#define FPMR_CTX   (1 << 5)
 
 #define KSFT_BAD_MAGIC 0xdeadbeef
 

-- 
2.30.2

[PATCH v3 14/21] KVM: arm64: Add newly allocated ID registers to register descriptions

2023-12-05 Thread Mark Brown

The 2023 architecture extensions have allocated some new ID registers, add
them to the KVM system register descriptions so that they are visible to
guests.

Signed-off-by: Mark Brown 
---
 arch/arm64/kvm/sys_regs.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 4735e1b37fb3..b843da5e4bb9 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -2139,12 +2139,12 @@ static const struct sys_reg_desc sys_reg_descs[] = {
   ID_AA64PFR0_EL1_AdvSIMD |
   ID_AA64PFR0_EL1_FP), },
ID_SANITISED(ID_AA64PFR1_EL1),
-   ID_UNALLOCATED(4,2),
+   ID_SANITISED(ID_AA64PFR2_EL1),
ID_UNALLOCATED(4,3),
ID_WRITABLE(ID_AA64ZFR0_EL1, ~ID_AA64ZFR0_EL1_RES0),
ID_HIDDEN(ID_AA64SMFR0_EL1),
ID_UNALLOCATED(4,6),
-   ID_UNALLOCATED(4,7),
+   ID_SANITISED(ID_AA64FPFR0_EL1),
 
/* CRm=5 */
{ SYS_DESC(SYS_ID_AA64DFR0_EL1),
@@ -2171,7 +2171,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
ID_WRITABLE(ID_AA64ISAR2_EL1, ~(ID_AA64ISAR2_EL1_RES0 |
ID_AA64ISAR2_EL1_APA3 |
ID_AA64ISAR2_EL1_GPA3)),
-   ID_UNALLOCATED(6,3),
+   ID_WRITABLE(ID_AA64ISAR3_EL1, ~ID_AA64ISAR3_EL1_RES0),
ID_UNALLOCATED(6,4),
ID_UNALLOCATED(6,5),
ID_UNALLOCATED(6,6),

-- 
2.30.2

[PATCH v3 16/21] arm64/hwcap: Define hwcaps for 2023 DPISA features

2023-12-05 Thread Mark Brown

The 2023 architecture extensions include a large number of floating point
features, most of which simply add new instructions. Add hwcaps so that
userspace can enumerate these features.

Signed-off-by: Mark Brown 
---
 Documentation/arch/arm64/elf_hwcaps.rst | 49 +
 arch/arm64/include/asm/hwcap.h  | 15 ++
 arch/arm64/include/uapi/asm/hwcap.h | 15 ++
 arch/arm64/kernel/cpufeature.c  | 35 +++
 arch/arm64/kernel/cpuinfo.c | 15 ++
 5 files changed, 129 insertions(+)

diff --git a/Documentation/arch/arm64/elf_hwcaps.rst 
b/Documentation/arch/arm64/elf_hwcaps.rst
index ced7b335e2e0..448c1664879b 100644
--- a/Documentation/arch/arm64/elf_hwcaps.rst
+++ b/Documentation/arch/arm64/elf_hwcaps.rst
@@ -317,6 +317,55 @@ HWCAP2_LRCPC3
 HWCAP2_LSE128
 Functionality implied by ID_AA64ISAR0_EL1.Atomic == 0b0011.
 
+HWCAP2_FPMR
+Functionality implied by ID_AA64PFR2_EL1.FMR == 0b0001.
+
+HWCAP2_LUT
+Functionality implied by ID_AA64ISAR2_EL1.LUT == 0b0001.
+
+HWCAP2_FAMINMAX
+Functionality implied by ID_AA64ISAR3_EL1.FAMINMAX == 0b0001.
+
+HWCAP2_F8CVT
+Functionality implied by ID_AA64FPFR0_EL1.F8CVT == 0b1.
+
+HWCAP2_F8FMA
+Functionality implied by ID_AA64FPFR0_EL1.F8FMA == 0b1.
+
+HWCAP2_F8DP4
+Functionality implied by ID_AA64FPFR0_EL1.F8DP4 == 0b1.
+
+HWCAP2_F8DP2
+Functionality implied by ID_AA64FPFR0_EL1.F8DP2 == 0b1.
+
+HWCAP2_F8E4M3
+Functionality implied by ID_AA64FPFR0_EL1.F8E4M3 == 0b1.
+
+HWCAP2_F8E5M2
+Functionality implied by ID_AA64FPFR0_EL1.F8E5M2 == 0b1.
+
+HWCAP2_SME_LUTV2
+Functionality implied by ID_AA64SMFR0_EL1.LUTv2 == 0b1.
+
+HWCAP2_SME_F8F16
+Functionality implied by ID_AA64SMFR0_EL1.F8F16 == 0b1.
+
+HWCAP2_SME_F8F32
+Functionality implied by ID_AA64SMFR0_EL1.F8F32 == 0b1.
+
+HWCAP2_SME_SF8FMA
+Functionality implied by ID_AA64SMFR0_EL1.SF8FMA == 0b1.
+
+HWCAP2_SME_SF8DP4
+Functionality implied by ID_AA64SMFR0_EL1.SF8DP4 == 0b1.
+
+HWCAP2_SME_SF8DP2
+Functionality implied by ID_AA64SMFR0_EL1.SF8DP2 == 0b1.
+
+HWCAP2_SME_SF8DP4
+Functionality implied by ID_AA64SMFR0_EL1.SF8DP4 == 0b1.
+
+
 4. Unused AT_HWCAP bits
 ---
 
diff --git a/arch/arm64/include/asm/hwcap.h b/arch/arm64/include/asm/hwcap.h
index cd71e09ea14d..4edd3b61df11 100644
--- a/arch/arm64/include/asm/hwcap.h
+++ b/arch/arm64/include/asm/hwcap.h
@@ -142,6 +142,21 @@
 #define KERNEL_HWCAP_SVE_B16B16__khwcap2_feature(SVE_B16B16)
 #define KERNEL_HWCAP_LRCPC3__khwcap2_feature(LRCPC3)
 #define KERNEL_HWCAP_LSE128__khwcap2_feature(LSE128)
+#define KERNEL_HWCAP_FPMR  __khwcap2_feature(FPMR)
+#define KERNEL_HWCAP_LUT   __khwcap2_feature(LUT)
+#define KERNEL_HWCAP_FAMINMAX  __khwcap2_feature(FAMINMAX)
+#define KERNEL_HWCAP_F8CVT __khwcap2_feature(F8CVT)
+#define KERNEL_HWCAP_F8FMA __khwcap2_feature(F8FMA)
+#define KERNEL_HWCAP_F8DP4 __khwcap2_feature(F8DP4)
+#define KERNEL_HWCAP_F8DP2 __khwcap2_feature(F8DP2)
+#define KERNEL_HWCAP_F8E4M3__khwcap2_feature(F8E4M3)
+#define KERNEL_HWCAP_F8E5M2__khwcap2_feature(F8E5M2)
+#define KERNEL_HWCAP_SME_LUTV2 __khwcap2_feature(SME_LUTV2)
+#define KERNEL_HWCAP_SME_F8F16 __khwcap2_feature(SME_F8F16)
+#define KERNEL_HWCAP_SME_F8F32 __khwcap2_feature(SME_F8F32)
+#define KERNEL_HWCAP_SME_SF8FMA__khwcap2_feature(SME_SF8FMA)
+#define KERNEL_HWCAP_SME_SF8DP4__khwcap2_feature(SME_SF8DP4)
+#define KERNEL_HWCAP_SME_SF8DP2__khwcap2_feature(SME_SF8DP2)
 
 /*
  * This yields a mask that user programs can use to figure out what
diff --git a/arch/arm64/include/uapi/asm/hwcap.h 
b/arch/arm64/include/uapi/asm/hwcap.h
index 5023599fa278..285610e626f5 100644
--- a/arch/arm64/include/uapi/asm/hwcap.h
+++ b/arch/arm64/include/uapi/asm/hwcap.h
@@ -107,5 +107,20 @@
 #define HWCAP2_SVE_B16B16  (1UL << 45)
 #define HWCAP2_LRCPC3  (1UL << 46)
 #define HWCAP2_LSE128  (1UL << 47)
+#define HWCAP2_FPMR(1UL << 48)
+#define HWCAP2_LUT (1UL << 49)
+#define HWCAP2_FAMINMAX(1UL << 50)
+#define HWCAP2_F8CVT   (1UL << 51)
+#define HWCAP2_F8FMA   (1UL << 52)
+#define HWCAP2_F8DP4   (1UL << 53)
+#define HWCAP2_F8DP2   (1UL << 54)
+#define HWCAP2_F8E4M3  (1UL << 55)
+#define HWCAP2_F8E5M2  (1UL << 56)
+#define HWCAP2_SME_LUTV2   (1UL << 57)
+#define HWCAP2_SME_F8F16   (1UL << 58)
+#define HWCAP2_SME_F8F32   (1UL << 59)
+#define HWCAP2_SME_SF8FMA  (1UL << 60)
+#define HWCAP2_SME_SF8DP4  (1UL << 61)
+#define HWCAP2_SME_SF8DP2  (1UL << 62)
 
 #endif /* _UAPI__ASM_HWCAP_H */
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index ea0b680792de..33e301b6e31e 100644
---

[PATCH v3 15/21] KVM: arm64: Support FEAT_FPMR for guests

2023-12-05 Thread Mark Brown

FEAT_FPMR introduces a new system register FPMR which allows configuration
of floating point behaviour, currently for FP8 specific features. Allow use
of this in guests, disabling the trap while guests are running and saving
and restoring the value along with the rest of the floating point state.
Since FPMR is stored immediately after the main floating point state we
share it with the hypervisor by adjusting the size of the shared region.

Access to FPMR is covered by both a register specific trap HCRX_EL2.EnFPM
and the overall floating point access trap so we just unconditionally
enable the FPMR specific trap and rely on the floating point access trap to
detect guest floating point usage.

Signed-off-by: Mark Brown 
---
 arch/arm64/include/asm/kvm_arm.h|  2 +-
 arch/arm64/include/asm/kvm_host.h   |  4 +++-
 arch/arm64/kvm/emulate-nested.c |  9 +
 arch/arm64/kvm/fpsimd.c | 20 +---
 arch/arm64/kvm/hyp/include/hyp/switch.h |  7 ++-
 arch/arm64/kvm/sys_regs.c   | 11 +++
 6 files changed, 47 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 9f9239d86900..95f3b44e7c3a 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -103,7 +103,7 @@
 #define HCR_HOST_VHE_FLAGS (HCR_RW | HCR_TGE | HCR_E2H)
 
 #define HCRX_GUEST_FLAGS \
-   (HCRX_EL2_SMPME | HCRX_EL2_TCR2En | \
+   (HCRX_EL2_SMPME | HCRX_EL2_TCR2En | HCRX_EL2_EnFPM | \
 (cpus_have_final_cap(ARM64_HAS_MOPS) ? (HCRX_EL2_MSCEn | 
HCRX_EL2_MCE2) : 0))
 #define HCRX_HOST_FLAGS (HCRX_EL2_MSCEn | HCRX_EL2_TCR2En | HCRX_EL2_EnFPM)
 
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index f8d98985a39c..9885adff06fa 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -391,6 +391,8 @@ enum vcpu_sysreg {
CNTP_CVAL_EL0,
CNTP_CTL_EL0,
 
+   FPMR,
+
/* Memory Tagging Extension registers */
RGSR_EL1,   /* Random Allocation Tag Seed Register */
GCR_EL1,/* Tag Control Register */
@@ -517,7 +519,6 @@ struct kvm_vcpu_arch {
enum fp_type fp_type;
unsigned int sve_max_vl;
u64 svcr;
-   u64 fpmr;
 
/* Stage 2 paging state used by the hardware on next switch */
struct kvm_s2_mmu *hw_mmu;
@@ -576,6 +577,7 @@ struct kvm_vcpu_arch {
struct kvm_guest_debug_arch external_debug_state;
 
struct user_fpsimd_state *host_fpsimd_state;/* hyp VA */
+   u64 *host_fpmr; /* hyp VA */
struct task_struct *parent_task;
 
struct {
diff --git a/arch/arm64/kvm/emulate-nested.c b/arch/arm64/kvm/emulate-nested.c
index 06185216a297..802e5cde696f 100644
--- a/arch/arm64/kvm/emulate-nested.c
+++ b/arch/arm64/kvm/emulate-nested.c
@@ -67,6 +67,8 @@ enum cgt_group_id {
CGT_HCR_TTLBIS,
CGT_HCR_TTLBOS,
 
+   CGT_HCRX_EnFPM,
+
CGT_MDCR_TPMCR,
CGT_MDCR_TPM,
CGT_MDCR_TDE,
@@ -279,6 +281,12 @@ static const struct trap_bits coarse_trap_bits[] = {
.mask   = HCR_TTLBOS,
.behaviour  = BEHAVE_FORWARD_ANY,
},
+   [CGT_HCRX_EnFPM] = {
+   .index  = HCRX_EL2,
+   .value  = HCRX_EL2_EnFPM,
+   .mask   = HCRX_EL2_EnFPM,
+   .behaviour  = BEHAVE_FORWARD_ANY,
+   },
[CGT_MDCR_TPMCR] = {
.index  = MDCR_EL2,
.value  = MDCR_EL2_TPMCR,
@@ -478,6 +486,7 @@ static const struct encoding_to_trap_config 
encoding_to_cgt[] __initconst = {
SR_TRAP(SYS_AIDR_EL1,   CGT_HCR_TID1),
SR_TRAP(SYS_SMIDR_EL1,  CGT_HCR_TID1),
SR_TRAP(SYS_CTR_EL0,CGT_HCR_TID2),
+   SR_TRAP(SYS_FPMR,   CGT_HCRX_EnFPM),
SR_TRAP(SYS_CCSIDR_EL1, CGT_HCR_TID2_TID4),
SR_TRAP(SYS_CCSIDR2_EL1,CGT_HCR_TID2_TID4),
SR_TRAP(SYS_CLIDR_EL1,  CGT_HCR_TID2_TID4),
diff --git a/arch/arm64/kvm/fpsimd.c b/arch/arm64/kvm/fpsimd.c
index e3e611e30e91..dee078625d0d 100644
--- a/arch/arm64/kvm/fpsimd.c
+++ b/arch/arm64/kvm/fpsimd.c
@@ -14,6 +14,16 @@
 #include 
 #include 
 
+static void *fpsimd_share_end(struct user_fpsimd_state *fpsimd)
+{
+   void *share_end = fpsimd + 1;
+
+   if (cpus_have_final_cap(ARM64_HAS_FPMR))
+   share_end += sizeof(u64);
+
+   return share_end;
+}
+
 void kvm_vcpu_unshare_task_fp(struct kvm_vcpu *vcpu)
 {
struct task_struct *p = vcpu->arch.parent_task;
@@ -23,7 +33,7 @@ void kvm_vcpu_unshare_task_fp(struct kvm_vcpu *vcpu)
return;
 
fpsimd = >thread.uw.fpsimd_state;
-   kvm_unshare_hyp(fpsimd, fpsimd + 1);
+   kvm_unshare_hyp(fpsimd, fpsimd_share_end(fpsimd));
put_task_struct(p);
 }
 
@@ -45,11 +55,15 @@ int

[PATCH v3 13/21] arm64/ptrace: Expose FPMR via ptrace

2023-12-05 Thread Mark Brown

Add a new regset to expose FPMR via ptrace. It is not added to the FPSIMD
registers since that structure is exposed elsewhere without any allowance
for extension we don't add there.

Signed-off-by: Mark Brown 
---
 arch/arm64/kernel/ptrace.c | 42 ++
 include/uapi/linux/elf.h   |  1 +
 2 files changed, 43 insertions(+)

diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 20d7ef82de90..cfb8a4d213be 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -697,6 +697,39 @@ static int tls_set(struct task_struct *target, const 
struct user_regset *regset,
return ret;
 }
 
+static int fpmr_get(struct task_struct *target, const struct user_regset 
*regset,
+  struct membuf to)
+{
+   if (!system_supports_fpmr())
+   return -EINVAL;
+
+   if (target == current)
+   fpsimd_preserve_current_state();
+
+   return membuf_store(, target->thread.fpmr);
+}
+
+static int fpmr_set(struct task_struct *target, const struct user_regset 
*regset,
+  unsigned int pos, unsigned int count,
+  const void *kbuf, const void __user *ubuf)
+{
+   int ret;
+   unsigned long fpmr;
+
+   if (!system_supports_fpmr())
+   return -EINVAL;
+
+   ret = user_regset_copyin(, , , , , 0, count);
+   if (ret)
+   return ret;
+
+   target->thread.fpmr = fpmr;
+
+   fpsimd_flush_task_state(target);
+
+   return 0;
+}
+
 static int system_call_get(struct task_struct *target,
   const struct user_regset *regset,
   struct membuf to)
@@ -1417,6 +1450,7 @@ enum aarch64_regset {
REGSET_HW_BREAK,
REGSET_HW_WATCH,
 #endif
+   REGSET_FPMR,
REGSET_SYSTEM_CALL,
 #ifdef CONFIG_ARM64_SVE
REGSET_SVE,
@@ -1495,6 +1529,14 @@ static const struct user_regset aarch64_regsets[] = {
.regset_get = system_call_get,
.set = system_call_set,
},
+   [REGSET_FPMR] = {
+   .core_note_type = NT_ARM_FPMR,
+   .n = 1,
+   .size = sizeof(u64),
+   .align = sizeof(u64),
+   .regset_get = fpmr_get,
+   .set = fpmr_set,
+   },
 #ifdef CONFIG_ARM64_SVE
[REGSET_SVE] = { /* Scalable Vector Extension */
.core_note_type = NT_ARM_SVE,
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 9417309b7230..b54b313bcf07 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -440,6 +440,7 @@ typedef struct elf64_shdr {
 #define NT_ARM_SSVE0x40b   /* ARM Streaming SVE registers */
 #define NT_ARM_ZA  0x40c   /* ARM SME ZA registers */
 #define NT_ARM_ZT  0x40d   /* ARM SME ZT registers */
+#define NT_ARM_FPMR0x40e   /* ARM floating point mode register */
 #define NT_ARC_V2  0x600   /* ARCv2 accumulator/extra registers */
 #define NT_VMCOREDD0x700   /* Vmcore Device Dump Note */
 #define NT_MIPS_DSP0x800   /* MIPS DSP ASE registers */

-- 
2.30.2

[PATCH v3 12/21] arm64/signal: Add FPMR signal handling

2023-12-05 Thread Mark Brown

Expose FPMR in the signal context on systems where it is supported. The
kernel validates the exact size of the FPSIMD registers so we can't readily
add it to fpsimd_context without disruption.

Signed-off-by: Mark Brown 
---
 arch/arm64/include/uapi/asm/sigcontext.h |  8 +
 arch/arm64/kernel/signal.c   | 59 
 2 files changed, 67 insertions(+)

diff --git a/arch/arm64/include/uapi/asm/sigcontext.h 
b/arch/arm64/include/uapi/asm/sigcontext.h
index f23c1dc3f002..8a45b7a411e0 100644
--- a/arch/arm64/include/uapi/asm/sigcontext.h
+++ b/arch/arm64/include/uapi/asm/sigcontext.h
@@ -152,6 +152,14 @@ struct tpidr2_context {
__u64 tpidr2;
 };
 
+/* FPMR context */
+#define FPMR_MAGIC 0x46504d52
+
+struct fpmr_context {
+   struct _aarch64_ctx head;
+   __u64 fpmr;
+};
+
 #define ZA_MAGIC   0x54366345
 
 struct za_context {
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 0e8beb3349ea..e8c808afcc8a 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -60,6 +60,7 @@ struct rt_sigframe_user_layout {
unsigned long tpidr2_offset;
unsigned long za_offset;
unsigned long zt_offset;
+   unsigned long fpmr_offset;
unsigned long extra_offset;
unsigned long end_offset;
 };
@@ -182,6 +183,8 @@ struct user_ctxs {
u32 za_size;
struct zt_context __user *zt;
u32 zt_size;
+   struct fpmr_context __user *fpmr;
+   u32 fpmr_size;
 };
 
 static int preserve_fpsimd_context(struct fpsimd_context __user *ctx)
@@ -227,6 +230,33 @@ static int restore_fpsimd_context(struct user_ctxs *user)
return err ? -EFAULT : 0;
 }
 
+static int preserve_fpmr_context(struct fpmr_context __user *ctx)
+{
+   int err = 0;
+
+   current->thread.fpmr = read_sysreg_s(SYS_FPMR);
+
+   __put_user_error(FPMR_MAGIC, >head.magic, err);
+   __put_user_error(sizeof(*ctx), >head.size, err);
+   __put_user_error(current->thread.fpmr, >fpmr, err);
+
+   return err;
+}
+
+static int restore_fpmr_context(struct user_ctxs *user)
+{
+   u64 fpmr;
+   int err = 0;
+
+   if (user->fpmr_size != sizeof(*user->fpmr))
+   return -EINVAL;
+
+   __get_user_error(fpmr, >fpmr->fpmr, err);
+   if (!err)
+   write_sysreg_s(fpmr, SYS_FPMR);
+
+   return err;
+}
 
 #ifdef CONFIG_ARM64_SVE
 
@@ -590,6 +620,7 @@ static int parse_user_sigframe(struct user_ctxs *user,
user->tpidr2 = NULL;
user->za = NULL;
user->zt = NULL;
+   user->fpmr = NULL;
 
if (!IS_ALIGNED((unsigned long)base, 16))
goto invalid;
@@ -684,6 +715,17 @@ static int parse_user_sigframe(struct user_ctxs *user,
user->zt_size = size;
break;
 
+   case FPMR_MAGIC:
+   if (!system_supports_fpmr())
+   goto invalid;
+
+   if (user->fpmr)
+   goto invalid;
+
+   user->fpmr = (struct fpmr_context __user *)head;
+   user->fpmr_size = size;
+   break;
+
case EXTRA_MAGIC:
if (have_extra_context)
goto invalid;
@@ -806,6 +848,9 @@ static int restore_sigframe(struct pt_regs *regs,
if (err == 0 && system_supports_tpidr2() && user.tpidr2)
err = restore_tpidr2_context();
 
+   if (err == 0 && system_supports_fpmr() && user.fpmr)
+   err = restore_fpmr_context();
+
if (err == 0 && system_supports_sme() && user.za)
err = restore_za_context();
 
@@ -928,6 +973,13 @@ static int setup_sigframe_layout(struct 
rt_sigframe_user_layout *user,
}
}
 
+   if (system_supports_fpmr()) {
+   err = sigframe_alloc(user, >fpmr_offset,
+sizeof(struct fpmr_context));
+   if (err)
+   return err;
+   }
+
return sigframe_alloc_end(user);
 }
 
@@ -983,6 +1035,13 @@ static int setup_sigframe(struct rt_sigframe_user_layout 
*user,
err |= preserve_tpidr2_context(tpidr2_ctx);
}
 
+   /* FPMR if supported */
+   if (system_supports_fpmr() && err == 0) {
+   struct fpmr_context __user *fpmr_ctx =
+   apply_user_offset(user, user->fpmr_offset);
+   err |= preserve_fpmr_context(fpmr_ctx);
+   }
+
/* ZA state if present */
if (system_supports_sme() && err == 0 && user->za_offset) {
struct za_context __user *za_ctx =

-- 
2.30.2

[PATCH v3 11/21] arm64/fpsimd: Support FEAT_FPMR

2023-12-05 Thread Mark Brown

FEAT_FPMR defines a new EL0 accessible register FPMR use to configure the
FP8 related features added to the architecture at the same time. Detect
support for this register and context switch it for EL0 when present.

Due to the sharing of responsibility for saving floating point state
between the host kernel and KVM FP8 support is not yet implemented in KVM
and a stub similar to that used for SVCR is provided for FPMR in order to
avoid bisection issues.  To make it easier to share host state with the
hypervisor we store FPMR immediately after the base floating point
state, existing usage means that it is not practical to extend that
directly.

Signed-off-by: Mark Brown 
---
 arch/arm64/include/asm/cpufeature.h |  5 +
 arch/arm64/include/asm/fpsimd.h |  2 ++
 arch/arm64/include/asm/kvm_host.h   |  1 +
 arch/arm64/include/asm/processor.h  |  2 ++
 arch/arm64/kernel/cpufeature.c  |  9 +
 arch/arm64/kernel/fpsimd.c  | 13 +
 arch/arm64/kvm/fpsimd.c |  1 +
 arch/arm64/tools/cpucaps|  1 +
 8 files changed, 34 insertions(+)

diff --git a/arch/arm64/include/asm/cpufeature.h 
b/arch/arm64/include/asm/cpufeature.h
index f6d416fe49b0..8e83cb1e6c7c 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -767,6 +767,11 @@ static __always_inline bool system_supports_tpidr2(void)
return system_supports_sme();
 }
 
+static __always_inline bool system_supports_fpmr(void)
+{
+   return alternative_has_cap_unlikely(ARM64_HAS_FPMR);
+}
+
 static __always_inline bool system_supports_cnp(void)
 {
return alternative_has_cap_unlikely(ARM64_HAS_CNP);
diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h
index 50e5f25d3024..74afca3bd312 100644
--- a/arch/arm64/include/asm/fpsimd.h
+++ b/arch/arm64/include/asm/fpsimd.h
@@ -89,6 +89,7 @@ struct cpu_fp_state {
void *sve_state;
void *sme_state;
u64 *svcr;
+   u64 *fpmr;
unsigned int sve_vl;
unsigned int sme_vl;
enum fp_type *fp_type;
@@ -154,6 +155,7 @@ extern void cpu_enable_sve(const struct 
arm64_cpu_capabilities *__unused);
 extern void cpu_enable_sme(const struct arm64_cpu_capabilities *__unused);
 extern void cpu_enable_sme2(const struct arm64_cpu_capabilities *__unused);
 extern void cpu_enable_fa64(const struct arm64_cpu_capabilities *__unused);
+extern void cpu_enable_fpmr(const struct arm64_cpu_capabilities *__unused);
 
 extern u64 read_smcr_features(void);
 
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 824f29f04916..f8d98985a39c 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -517,6 +517,7 @@ struct kvm_vcpu_arch {
enum fp_type fp_type;
unsigned int sve_max_vl;
u64 svcr;
+   u64 fpmr;
 
/* Stage 2 paging state used by the hardware on next switch */
struct kvm_s2_mmu *hw_mmu;
diff --git a/arch/arm64/include/asm/processor.h 
b/arch/arm64/include/asm/processor.h
index e5bc54522e71..dd3a5b29f76e 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -158,6 +158,8 @@ struct thread_struct {
struct user_fpsimd_state fpsimd_state;
} uw;
 
+   u64 fpmr;   /* Adjacent to fpsimd_state for 
KVM */
+
enum fp_typefp_type;/* registers FPSIMD or SVE? */
unsigned intfpsimd_cpu;
void*sve_state; /* SVE registers, if any */
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index c8d38e5ce997..ea0b680792de 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -272,6 +272,7 @@ static const struct arm64_ftr_bits ftr_id_aa64pfr1[] = {
 };
 
 static const struct arm64_ftr_bits ftr_id_aa64pfr2[] = {
+   ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, 
ID_AA64PFR2_EL1_FPMR_SHIFT, 4, 0),
ARM64_FTR_END,
 };
 
@@ -2759,6 +2760,14 @@ static const struct arm64_cpu_capabilities 
arm64_features[] = {
.matches = has_cpuid_feature,
ARM64_CPUID_FIELDS(ID_AA64MMFR2_EL1, EVT, IMP)
},
+   {
+   .desc = "FPMR",
+   .type = ARM64_CPUCAP_SYSTEM_FEATURE,
+   .capability = ARM64_HAS_FPMR,
+   .matches = has_cpuid_feature,
+   .cpu_enable = cpu_enable_fpmr,
+   ARM64_CPUID_FIELDS(ID_AA64PFR2_EL1, FPMR, IMP)
+   },
{},
 };
 
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 1559c706d32d..2a6abd6423f7 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -385,6 +385,9 @@ static void task_fpsimd_load(void)
WARN_ON(!system_supports_fpsimd());
WARN_ON(!have_cpu_fpsimd_context());
 
+   if (system_supports_fpmr())
+

[PATCH v3 09/21] arm64/cpufeature: Hook new identification registers up to cpufeature

2023-12-05 Thread Mark Brown

The 2023 architecture extensions have defined several new ID registers,
hook them up to the cpufeature code so we can add feature checks and hwcaps
based on their contents.

Signed-off-by: Mark Brown 
---
 arch/arm64/include/asm/cpu.h   |  3 +++
 arch/arm64/kernel/cpufeature.c | 28 
 arch/arm64/kernel/cpuinfo.c|  3 +++
 3 files changed, 34 insertions(+)

diff --git a/arch/arm64/include/asm/cpu.h b/arch/arm64/include/asm/cpu.h
index f3034099fd95..b99138bc3d4a 100644
--- a/arch/arm64/include/asm/cpu.h
+++ b/arch/arm64/include/asm/cpu.h
@@ -53,14 +53,17 @@ struct cpuinfo_arm64 {
u64 reg_id_aa64isar0;
u64 reg_id_aa64isar1;
u64 reg_id_aa64isar2;
+   u64 reg_id_aa64isar3;
u64 reg_id_aa64mmfr0;
u64 reg_id_aa64mmfr1;
u64 reg_id_aa64mmfr2;
u64 reg_id_aa64mmfr3;
u64 reg_id_aa64pfr0;
u64 reg_id_aa64pfr1;
+   u64 reg_id_aa64pfr2;
u64 reg_id_aa64zfr0;
u64 reg_id_aa64smfr0;
+   u64 reg_id_aa64fpfr0;
 
struct cpuinfo_32bitaarch32;
 };
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 646591c67e7a..c8d38e5ce997 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -234,6 +234,10 @@ static const struct arm64_ftr_bits ftr_id_aa64isar2[] = {
ARM64_FTR_END,
 };
 
+static const struct arm64_ftr_bits ftr_id_aa64isar3[] = {
+   ARM64_FTR_END,
+};
+
 static const struct arm64_ftr_bits ftr_id_aa64pfr0[] = {
ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, 
ID_AA64PFR0_EL1_CSV3_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, 
ID_AA64PFR0_EL1_CSV2_SHIFT, 4, 0),
@@ -267,6 +271,10 @@ static const struct arm64_ftr_bits ftr_id_aa64pfr1[] = {
ARM64_FTR_END,
 };
 
+static const struct arm64_ftr_bits ftr_id_aa64pfr2[] = {
+   ARM64_FTR_END,
+};
+
 static const struct arm64_ftr_bits ftr_id_aa64zfr0[] = {
ARM64_FTR_BITS(FTR_VISIBLE_IF_IS_ENABLED(CONFIG_ARM64_SVE),
   FTR_STRICT, FTR_LOWER_SAFE, ID_AA64ZFR0_EL1_F64MM_SHIFT, 
4, 0),
@@ -319,6 +327,10 @@ static const struct arm64_ftr_bits ftr_id_aa64smfr0[] = {
ARM64_FTR_END,
 };
 
+static const struct arm64_ftr_bits ftr_id_aa64fpfr0[] = {
+   ARM64_FTR_END,
+};
+
 static const struct arm64_ftr_bits ftr_id_aa64mmfr0[] = {
ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, 
ID_AA64MMFR0_EL1_ECV_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, 
ID_AA64MMFR0_EL1_FGT_SHIFT, 4, 0),
@@ -702,10 +714,12 @@ static const struct __ftr_reg_entry {
   _aa64pfr0_override),
ARM64_FTR_REG_OVERRIDE(SYS_ID_AA64PFR1_EL1, ftr_id_aa64pfr1,
   _aa64pfr1_override),
+   ARM64_FTR_REG(SYS_ID_AA64PFR2_EL1, ftr_id_aa64pfr2),
ARM64_FTR_REG_OVERRIDE(SYS_ID_AA64ZFR0_EL1, ftr_id_aa64zfr0,
   _aa64zfr0_override),
ARM64_FTR_REG_OVERRIDE(SYS_ID_AA64SMFR0_EL1, ftr_id_aa64smfr0,
   _aa64smfr0_override),
+   ARM64_FTR_REG(SYS_ID_AA64FPFR0_EL1, ftr_id_aa64fpfr0),
 
/* Op1 = 0, CRn = 0, CRm = 5 */
ARM64_FTR_REG(SYS_ID_AA64DFR0_EL1, ftr_id_aa64dfr0),
@@ -717,6 +731,7 @@ static const struct __ftr_reg_entry {
   _aa64isar1_override),
ARM64_FTR_REG_OVERRIDE(SYS_ID_AA64ISAR2_EL1, ftr_id_aa64isar2,
   _aa64isar2_override),
+   ARM64_FTR_REG(SYS_ID_AA64ISAR3_EL1, ftr_id_aa64isar3),
 
/* Op1 = 0, CRn = 0, CRm = 7 */
ARM64_FTR_REG(SYS_ID_AA64MMFR0_EL1, ftr_id_aa64mmfr0),
@@ -1043,14 +1058,17 @@ void __init init_cpu_features(struct cpuinfo_arm64 
*info)
init_cpu_ftr_reg(SYS_ID_AA64ISAR0_EL1, info->reg_id_aa64isar0);
init_cpu_ftr_reg(SYS_ID_AA64ISAR1_EL1, info->reg_id_aa64isar1);
init_cpu_ftr_reg(SYS_ID_AA64ISAR2_EL1, info->reg_id_aa64isar2);
+   init_cpu_ftr_reg(SYS_ID_AA64ISAR3_EL1, info->reg_id_aa64isar3);
init_cpu_ftr_reg(SYS_ID_AA64MMFR0_EL1, info->reg_id_aa64mmfr0);
init_cpu_ftr_reg(SYS_ID_AA64MMFR1_EL1, info->reg_id_aa64mmfr1);
init_cpu_ftr_reg(SYS_ID_AA64MMFR2_EL1, info->reg_id_aa64mmfr2);
init_cpu_ftr_reg(SYS_ID_AA64MMFR3_EL1, info->reg_id_aa64mmfr3);
init_cpu_ftr_reg(SYS_ID_AA64PFR0_EL1, info->reg_id_aa64pfr0);
init_cpu_ftr_reg(SYS_ID_AA64PFR1_EL1, info->reg_id_aa64pfr1);
+   init_cpu_ftr_reg(SYS_ID_AA64PFR2_EL1, info->reg_id_aa64pfr2);
init_cpu_ftr_reg(SYS_ID_AA64ZFR0_EL1, info->reg_id_aa64zfr0);
init_cpu_ftr_reg(SYS_ID_AA64SMFR0_EL1, info->reg_id_aa64smfr0);
+   init_cpu_ftr_reg(SYS_ID_AA64FPFR0_EL1, info->reg_id_aa64fpfr0);
 
if

[PATCH v3 10/21] arm64/fpsimd: Enable host kernel access to FPMR

2023-12-05 Thread Mark Brown

FEAT_FPMR provides a new generally accessible architectural register FPMR.
This is only accessible to EL0 and EL1 when HCRX_EL2.EnFPM is set to 1,
do this when the host is running. The guest part will be done along with
context switching the new register and exposing it via guest management.

Signed-off-by: Mark Brown 
---
 arch/arm64/include/asm/kvm_arm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index b85f46a73e21..9f9239d86900 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -105,7 +105,7 @@
 #define HCRX_GUEST_FLAGS \
(HCRX_EL2_SMPME | HCRX_EL2_TCR2En | \
 (cpus_have_final_cap(ARM64_HAS_MOPS) ? (HCRX_EL2_MSCEn | 
HCRX_EL2_MCE2) : 0))
-#define HCRX_HOST_FLAGS (HCRX_EL2_MSCEn | HCRX_EL2_TCR2En)
+#define HCRX_HOST_FLAGS (HCRX_EL2_MSCEn | HCRX_EL2_TCR2En | HCRX_EL2_EnFPM)
 
 /* TCR_EL2 Registers bits */
 #define TCR_EL2_RES1   ((1U << 31) | (1 << 23))

-- 
2.30.2

[PATCH v3 08/21] arm64/sysreg: Add definition for FPMR

2023-12-05 Thread Mark Brown

DDI0601 2023-09 defines a new sysrem register FPMR (Floating Point Mode
Register) which configures the new FP8 features. Add a definition of this
register.

Signed-off-by: Mark Brown 
---
 arch/arm64/tools/sysreg | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
index 0b1a33a77074..67173576115a 100644
--- a/arch/arm64/tools/sysreg
+++ b/arch/arm64/tools/sysreg
@@ -2138,6 +2138,29 @@ Field1   ZA
 Field  0   SM
 EndSysreg
 
+Sysreg FPMR3   3   4   4   2
+Res0   63:38
+Field  37:32   LSCALE2
+Field  31:24   NSCALE
+Res0   23
+Field  22:16   LSCALE
+Field  15  OSC
+Field  14  OSM
+Res0   13:9
+UnsignedEnum   8:6 F8D
+   0b000   E5M2
+   0b001   E4M3
+EndEnum
+UnsignedEnum   5:3 F8S2
+   0b000   E5M2
+   0b001   E4M3
+EndEnum
+UnsignedEnum   2:0 F8S1
+   0b000   E5M2
+   0b001   E4M3
+EndEnum
+EndSysreg
+
 SysregFields   HFGxTR_EL2
 Field  63  nAMAIR2_EL1
 Field  62  nMAIR2_EL1

-- 
2.30.2

[PATCH v3 07/21] arm64/sysreg: Update HCRX_EL2 definition for DDI0601 2023-09

2023-12-05 Thread Mark Brown

DDI0601 2023-09 defines new fields in HCRX_EL2 controlling access to new
system registers, update our definition of HCRX_EL2 to reflect this.

Signed-off-by: Mark Brown 
---
 arch/arm64/tools/sysreg | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
index eea69bb48fa7..0b1a33a77074 100644
--- a/arch/arm64/tools/sysreg
+++ b/arch/arm64/tools/sysreg
@@ -2412,7 +2412,9 @@ FieldsZCR_ELx
 EndSysreg
 
 Sysreg HCRX_EL23   4   1   2   2
-Res0   63:23
+Res0   63:25
+Field  24  PACMEn
+Field  23  EnFPM
 Field  22  GCSEn
 Field  21  EnIDCP128
 Field  20  EnSDERR

-- 
2.30.2

[PATCH v3 06/21] arm64/sysreg: Update SCTLR_EL1 for DDI0601 2023-09

2023-12-05 Thread Mark Brown

DDI0601 2023-09 defines some new fields in SCTLR_EL1 controlling new MTE
and floating point features. Update our sysreg definition to reflect these.

Signed-off-by: Mark Brown 
---
 arch/arm64/tools/sysreg | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
index aee9ab4087c1..eea69bb48fa7 100644
--- a/arch/arm64/tools/sysreg
+++ b/arch/arm64/tools/sysreg
@@ -1791,7 +1791,8 @@ Field 63  TIDCP
 Field  62  SPINTMASK
 Field  61  NMI
 Field  60  EnTP2
-Res0   59:58
+Field  59  TCSO
+Field  58  TCSO0
 Field  57  EPAN
 Field  56  EnALS
 Field  55  EnAS0
@@ -1820,7 +1821,7 @@ EndEnum
 Field  37  ITFSB
 Field  36  BT1
 Field  35  BT0
-Res0   34
+Field  34  EnFPM
 Field  33  MSCEn
 Field  32  CMOW
 Field  31  EnIA

-- 
2.30.2

[PATCH v3 05/21] arm64/sysreg: Update ID_AA64SMFR0_EL1 definition for DDI0601 2023-09

2023-12-05 Thread Mark Brown

The 2023-09 release of DDI0601 defines a number of new feature enumeration
fields in ID_AA64SMFR0_EL1. Add these fields.

Signed-off-by: Mark Brown 
---
 arch/arm64/tools/sysreg | 30 +++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
index c9bb49d0ea03..aee9ab4087c1 100644
--- a/arch/arm64/tools/sysreg
+++ b/arch/arm64/tools/sysreg
@@ -1079,7 +1079,11 @@ UnsignedEnum 63  FA64
0b0 NI
0b1 IMP
 EndEnum
-Res0   62:60
+Res0   62:61
+UnsignedEnum   60  LUTv2
+   0b0 NI
+   0b1 IMP
+EndEnum
 UnsignedEnum   59:56   SMEver
0b  SME
0b0001  SME2
@@ -1107,7 +,14 @@ UnsignedEnum 42  F16F16
0b0 NI
0b1 IMP
 EndEnum
-Res0   41:40
+UnsignedEnum   41  F8F16
+   0b0 NI
+   0b1 IMP
+EndEnum
+UnsignedEnum   40  F8F32
+   0b0 NI
+   0b1 IMP
+EndEnum
 UnsignedEnum   39:36   I8I32
0b  NI
0b  IMP
@@ -1128,7 +1139,20 @@ UnsignedEnum 32  F32F32
0b0 NI
0b1 IMP
 EndEnum
-Res0   31:0
+Res0   31
+UnsignedEnum   30  SF8FMA
+   0b0 NI
+   0b1 IMP
+EndEnum
+UnsignedEnum   29  SF8DP4
+   0b0 NI
+   0b1 IMP
+EndEnum
+UnsignedEnum   28  SF8DP2
+   0b0 NI
+   0b1 IMP
+EndEnum
+Res0   27:0
 EndSysreg
 
 Sysreg ID_AA64FPFR0_EL13   0   0   4   7

-- 
2.30.2

[PATCH v3 03/21] arm64/sysreg: Add definition for ID_AA64ISAR3_EL1

2023-12-05 Thread Mark Brown

DDI0601 2023-09 adds a new system register ID_AA64ISAR3_EL1 enumerating
new floating point and TLB invalidation features. Add a defintion for it.

Signed-off-by: Mark Brown 
---
 arch/arm64/tools/sysreg | 17 +
 1 file changed, 17 insertions(+)

diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
index 27d79644e1a0..3d623a04934c 100644
--- a/arch/arm64/tools/sysreg
+++ b/arch/arm64/tools/sysreg
@@ -1433,6 +1433,23 @@ UnsignedEnum 3:0 WFxT
 EndEnum
 EndSysreg
 
+Sysreg ID_AA64ISAR3_EL13   0   0   6   3
+Res0   63:12
+UnsignedEnum   11:8TLBIW
+   0b  NI
+   0b0001  IMP
+EndEnum
+UnsignedEnum   7:4 FAMINMAX
+   0b  NI
+   0b0001  IMP
+EndEnum
+UnsignedEnum   3:0 CPA
+   0b  NI
+   0b0001  IMP
+   0b0010  CPA2
+EndEnum
+EndSysreg
+
 Sysreg ID_AA64MMFR0_EL13   0   0   7   0
 UnsignedEnum   63:60   ECV
0b  NI

-- 
2.30.2

[PATCH v3 04/21] arm64/sysreg: Add definition for ID_AA64FPFR0_EL1

2023-12-05 Thread Mark Brown

DDI0601 2023-09 defines a new feature register ID_AA64FPFR0_EL1 which
enumerates a number of FP8 related features. Add a definition for it.

Signed-off-by: Mark Brown 
---
 arch/arm64/tools/sysreg | 29 +
 1 file changed, 29 insertions(+)

diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
index 3d623a04934c..c9bb49d0ea03 100644
--- a/arch/arm64/tools/sysreg
+++ b/arch/arm64/tools/sysreg
@@ -1131,6 +1131,35 @@ EndEnum
 Res0   31:0
 EndSysreg
 
+Sysreg ID_AA64FPFR0_EL13   0   0   4   7
+Res0   63:32
+UnsignedEnum   31  F8CVT
+   0b0 NI
+   0b1 IMP
+EndEnum
+UnsignedEnum   30  F8FMA
+   0b0 NI
+   0b1 IMP
+EndEnum
+UnsignedEnum   29  F8DP4
+   0b0 NI
+   0b1 IMP
+EndEnum
+UnsignedEnum   28  F8DP2
+   0b0 NI
+   0b1 IMP
+EndEnum
+Res0   27:2
+UnsignedEnum   1   F8E4M3
+   0b0 NI
+   0b1 IMP
+EndEnum
+UnsignedEnum   0   F8E5M2
+   0b0 NI
+   0b1 IMP
+EndEnum
+EndSysreg
+
 Sysreg ID_AA64DFR0_EL1 3   0   0   5   0
 Enum   63:60   HPMN0
0b  UNPREDICTABLE

-- 
2.30.2

[PATCH v3 01/21] arm64/sysreg: Add definition for ID_AA64PFR2_EL1

2023-12-05 Thread Mark Brown

DDI0601 2023-09 defines a new system register ID_AA64PFR2_EL1 which
enumerates FPMR and some new MTE features. Add a definition of this
register.

Signed-off-by: Mark Brown 
---
 arch/arm64/tools/sysreg | 21 +
 1 file changed, 21 insertions(+)

diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
index 96cbeeab4eec..f22ade8f1fa7 100644
--- a/arch/arm64/tools/sysreg
+++ b/arch/arm64/tools/sysreg
@@ -1002,6 +1002,27 @@ UnsignedEnum 3:0 BT
 EndEnum
 EndSysreg
 
+Sysreg ID_AA64PFR2_EL1 3   0   0   4   2
+Res0   63:36
+UnsignedEnum   35:32   FPMR
+   0b  NI
+   0b0001  IMP
+EndEnum
+Res0   31:12
+UnsignedEnum   11:8MTEFAR
+   0b  NI
+   0b0001  IMP
+EndEnum
+UnsignedEnum   7:4 MTESTOREONLY
+   0b  NI
+   0b0001  IMP
+EndEnum
+UnsignedEnum   3:0 MTEPERM
+   0b  NI
+   0b0001  IMP
+EndEnum
+EndSysreg
+
 Sysreg ID_AA64ZFR0_EL1 3   0   0   4   4
 Res0   63:60
 UnsignedEnum   59:56   F64MM

-- 
2.30.2

[PATCH v3 02/21] arm64/sysreg: Update ID_AA64ISAR2_EL1 defintion for DDI0601 2023-09

2023-12-05 Thread Mark Brown

DDI0601 2023-09 defines some new fields in previously RES0 space in
ID_AA64ISAR2_EL1, together with one new enum value. Update the system
register definition to reflect this.

Signed-off-by: Mark Brown 
---
 arch/arm64/tools/sysreg | 24 ++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
index f22ade8f1fa7..27d79644e1a0 100644
--- a/arch/arm64/tools/sysreg
+++ b/arch/arm64/tools/sysreg
@@ -1365,7 +1365,14 @@ EndEnum
 EndSysreg
 
 Sysreg ID_AA64ISAR2_EL13   0   0   6   2
-Res0   63:56
+UnsignedEnum   63:60   ATS1A
+   0b  NI
+   0b0001  IMP
+EndEnum
+UnsignedEnum   59:56   LUT
+   0b  NI
+   0b0001  IMP
+EndEnum
 UnsignedEnum   55:52   CSSC
0b  NI
0b0001  IMP
@@ -1374,7 +1381,19 @@ UnsignedEnum 51:48   RPRFM
0b  NI
0b0001  IMP
 EndEnum
-Res0   47:32
+Res0   47:44
+UnsignedEnum   43:40   PRFMSLC
+   0b  NI
+   0b0001  IMP
+EndEnum
+UnsignedEnum   39:36   SYSINSTR_128
+   0b  NI
+   0b0001  IMP
+EndEnum
+UnsignedEnum   35:32   SYSREG_128
+   0b  NI
+   0b0001  IMP
+EndEnum
 UnsignedEnum   31:28   CLRBHB
0b  NI
0b0001  IMP
@@ -1398,6 +1417,7 @@ UnsignedEnum  15:12   APA3
0b0011  PAuth2
0b0100  FPAC
0b0101  FPACCOMBINE
+   0b0110  PAUTH_LR
 EndEnum
 UnsignedEnum   11:8GPA3
0b  NI

-- 
2.30.2

[PATCH v3 00/21] arm64: Support for 2023 DPISA extensions

2023-12-05 Thread Mark Brown

This series enables support for the data processing extensions in the
newly released 2023 architecture, this is mainly support for 8 bit
floating point formats.  Most of the extensions only introduce new
instructions and therefore only require hwcaps but there is a new EL0
visible control register FPMR used to control the 8 bit floating point
formats, we need to manage traps for this and context switch it.

The sharing of floating point save code between the host and guest
kernels slightly complicates the introduction of KVM support, we first
introduce host support with some placeholders for KVM then replace those
with the actual KVM support.

I've not added test coverage for ptrace, I've got a not quite finished
test program which exercises all the FP ptrace interfaces and their
interactions together, my plan is to cover it there rather than add
another tiny test program that duplicates the boilerplace for tracing a
target and doesn't actually run the traced program.

Signed-off-by: Mark Brown 
---
Changes in v3:
- Rebase onto v6.7-rc3.
- Hook up traps for FPMR in emulate-nested.c.
- Link to v2: 
https://lore.kernel.org/r/20231114-arm64-2023-dpisa-v2-0-47251894f...@kernel.org

Changes in v2:
- Rebase onto v6.7-rc1.
- Link to v1: 
https://lore.kernel.org/r/20231026-arm64-2023-dpisa-v1-0-8470dd989...@kernel.org

---
Mark Brown (21):
  arm64/sysreg: Add definition for ID_AA64PFR2_EL1
  arm64/sysreg: Update ID_AA64ISAR2_EL1 defintion for DDI0601 2023-09
  arm64/sysreg: Add definition for ID_AA64ISAR3_EL1
  arm64/sysreg: Add definition for ID_AA64FPFR0_EL1
  arm64/sysreg: Update ID_AA64SMFR0_EL1 definition for DDI0601 2023-09
  arm64/sysreg: Update SCTLR_EL1 for DDI0601 2023-09
  arm64/sysreg: Update HCRX_EL2 definition for DDI0601 2023-09
  arm64/sysreg: Add definition for FPMR
  arm64/cpufeature: Hook new identification registers up to cpufeature
  arm64/fpsimd: Enable host kernel access to FPMR
  arm64/fpsimd: Support FEAT_FPMR
  arm64/signal: Add FPMR signal handling
  arm64/ptrace: Expose FPMR via ptrace
  KVM: arm64: Add newly allocated ID registers to register descriptions
  KVM: arm64: Support FEAT_FPMR for guests
  arm64/hwcap: Define hwcaps for 2023 DPISA features
  kselftest/arm64: Handle FPMR context in generic signal frame parser
  kselftest/arm64: Add basic FPMR test
  kselftest/arm64: Add 2023 DPISA hwcap test coverage
  KVM: arm64: selftests: Document feature registers added in 2023 extensions
  KVM: arm64: selftests: Teach get-reg-list about FPMR

 Documentation/arch/arm64/elf_hwcaps.rst|  49 +
 arch/arm64/include/asm/cpu.h   |   3 +
 arch/arm64/include/asm/cpufeature.h|   5 +
 arch/arm64/include/asm/fpsimd.h|   2 +
 arch/arm64/include/asm/hwcap.h |  15 ++
 arch/arm64/include/asm/kvm_arm.h   |   4 +-
 arch/arm64/include/asm/kvm_host.h  |   3 +
 arch/arm64/include/asm/processor.h |   2 +
 arch/arm64/include/uapi/asm/hwcap.h|  15 ++
 arch/arm64/include/uapi/asm/sigcontext.h   |   8 +
 arch/arm64/kernel/cpufeature.c |  72 +++
 arch/arm64/kernel/cpuinfo.c|  18 ++
 arch/arm64/kernel/fpsimd.c |  13 ++
 arch/arm64/kernel/ptrace.c |  42 
 arch/arm64/kernel/signal.c |  59 ++
 arch/arm64/kvm/emulate-nested.c|   9 +
 arch/arm64/kvm/fpsimd.c|  19 +-
 arch/arm64/kvm/hyp/include/hyp/switch.h|   7 +-
 arch/arm64/kvm/sys_regs.c  |  17 +-
 arch/arm64/tools/cpucaps   |   1 +
 arch/arm64/tools/sysreg| 153 ++-
 include/uapi/linux/elf.h   |   1 +
 tools/testing/selftests/arm64/abi/hwcap.c  | 217 +
 tools/testing/selftests/arm64/signal/.gitignore|   1 +
 .../arm64/signal/testcases/fpmr_siginfo.c  |  82 
 .../selftests/arm64/signal/testcases/testcases.c   |   8 +
 .../selftests/arm64/signal/testcases/testcases.h   |   1 +
 tools/testing/selftests/kvm/aarch64/get-reg-list.c |  11 +-
 28 files changed, 819 insertions(+), 18 deletions(-)
---
base-commit: 2cc14f52aeb78ce3f29677c2de1f06c0e91471ab
change-id: 20231003-arm64-2023-dpisa-2f3d25746474

Best regards,
-- 
Mark Brown

Re: [PATCH RFT v4 5/5] kselftest/clone3: Test shadow stack support

2023-12-05 Thread Mark Brown

On Tue, Dec 05, 2023 at 04:01:50PM +, Edgecombe, Rick P wrote:

> Hmm, I didn't realize you were planning to have the kernel support
> upstream before the libc support was in testable shape.

It's not a "could someone run it" thing - it's about trying ensure that
we get coverage from people who are just running the selftests as part
of general testing coverage rather than with the specific goal of
testing this one feature.  Even when things start to land there will be
a considerable delay before they filter out so that all the enablement
is in CI systems off the shelf and it'd be good to have coverage in that
interval.

> > What's the issue with working around the missing support?  My
> > understanding was that there should be no ill effects from repeated
> > attempts to enable.  We could add a check for things already being
> > enabled

> Normally the loader enables shadow stack and glibc then knows to do
> things in special ways when it is successful. If it instead manually
> enables in the app:
>  - The app can't return from main() without disabling shadow stack 
>beforehand. Luckily this test directly calls exit()
>  - The app can't do longjmp()
>  - The app can't do ucontext stuff
>  - The enabling code needs to be carefully crafted (the inline problem 
>you hit)

> I guess it's not a huge list, and mostly tests will run ok. But it
> doesn't seem right to add somewhat hacky shadow stack crud into generic
> tests.

Right, it's a small and fairly easily auditable list - it's more about
the app than the double enable which was what I thought your concern
was.  It's a bit annoying definitely and not something we want to do in
general but for something like this where we're adding specific coverage
for API extensions for the feature it seems like a reasonable tradeoff.

If the x86 toolchain/libc support is widely enough deployed (or you just
don't mind any missing coverage) we could use the toolchain support
there and only have the manual enable for arm64, it'd be inconsistent
but not wildly so.

> So you were planning to enable GCS in this test manually as well? How
> many tests were you planning to add it like this?

Yes, the current version of the arm64 series has the equivalent support
for GCS.  I was only planning to do this along with adding specific
coverage for shadow stacks/GCS, general stuff that doesn't have any
specific support can get covered as part of system testing with the
toolchain and libc support.

The only case beyond that I've done is some arm64 specific stress tests
which are written as standalone assembler programs, those wouldn't get
enabled by the toolchain anyway and have some chance of catching context
switch or signal handling issues should they occur.  It seemed worth it
for the few lines of assembly it takes.

signature.asc
Description: PGP signature

Re: [PATCH 1/4] kunit: Add APIs for managing devices

2023-12-05 Thread kernel test robot

Hi,

kernel test robot noticed the following build errors:

[auto build test ERROR on c8613be119892ccceffbc550b9b9d7d68b995c9e]

url:
https://github.com/intel-lab-lkp/linux/commits/davidgow-google-com/kunit-Add-APIs-for-managing-devices/20231205-153349
base:   c8613be119892ccceffbc550b9b9d7d68b995c9e
patch link:
https://lore.kernel.org/r/20231205-kunit_bus-v1-1-635036d3bc13%40google.com
patch subject: [PATCH 1/4] kunit: Add APIs for managing devices
config: x86_64-buildonly-randconfig-001-20231205 
(https://download.01.org/0day-ci/archive/20231205/202312052341.feujgbbc-...@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20231205/202312052341.feujgbbc-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202312052341.feujgbbc-...@intel.com/

All errors (new ones prefixed by >>):

   ld: lib/kunit/device.o: in function `kunit_bus_init':
>> device.c:(.text+0x40): multiple definition of `init_module'; 
>> lib/kunit/test.o:test.c:(.init.text+0x0): first defined here

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH RFT v4 5/5] kselftest/clone3: Test shadow stack support

2023-12-05 Thread Edgecombe, Rick P

On Tue, 2023-12-05 at 15:05 +, Mark Brown wrote:
> > But I wonder if the clone3 test should get its shadow stack enabled
> > the
> > conventional elf bit way. So if it's all there (HW, kernel, glibc)
> > then
> > the test will run with shadow stack. Otherwise the test will run
> > without shadow stack.
> 
> This creates bootstrapping issues if we do it for arm64 where nothing
> is
> merged yet except for the model and EL3 support - in order to get any
> test coverage you need to be using an OS with the libc and toolchain
> support available and that's not going to be something we can rely on
> for a while (and even when things are merged a lot of the CI systems
> use
> Debian).  There is a small risk that the toolchain will generate
> incompatible code if it doesn't know it's specifically targetting
> shadow
> stacks but the toolchain people didn't seem concerned about that risk
> and we've not been running into problems.
> 
> It looks x86 is in better shape here with the userspace having run
> ahead
> of the kernel support though I'm not 100% clear if everything is
> fully
> lined up?  -mshstk -fcf-protection appears to build fine with gcc 8
> but
> I'm a bit less clear on glibc and any ABI variations.

Right, you would need a shadow stack enabled compiler too. The
check_cc.sh piece in the Makefile will detect that.

Hmm, I didn't realize you were planning to have the kernel support
upstream before the libc support was in testable shape.


> 
> > The other reason is that the shadow stack test in the x86 selftest
> > manual enabling is designed to work without a shadow stack enabled
> > glibc and has to be specially crafted to work around the missing
> > support. I'm not sure the more generic selftests should have to
> > know
> > how to do this. So what about something like this instead:
> 
> What's the issue with working around the missing support?  My
> understanding was that there should be no ill effects from repeated
> attempts to enable.  We could add a check for things already being
> enabled

Normally the loader enables shadow stack and glibc then knows to do
things in special ways when it is successful. If it instead manually
enables in the app:
 - The app can't return from main() without disabling shadow stack 
   beforehand. Luckily this test directly calls exit()
 - The app can't do longjmp()
 - The app can't do ucontext stuff
 - The enabling code needs to be carefully crafted (the inline problem 
   you hit)

I guess it's not a huge list, and mostly tests will run ok. But it
doesn't seem right to add somewhat hacky shadow stack crud into generic
tests.

So you were planning to enable GCS in this test manually as well? How
many tests were you planning to add it like this?

Re: [PATCH] mm, memcg: cg2 memory{.swap,}.peak write handlers

2023-12-05 Thread David Finkel

On Tue, Dec 5, 2023 at 4:07 AM Michal Hocko  wrote:

> > This behavior is particularly useful for work scheduling systems that
> > need to track memory usage of worker processes/cgroups per-work-item.
> > Since memory can't be squeezed like CPU can (the OOM-killer has
> > opinions), these systems need to track the peak memory usage to compute
> > system/container fullness when binpacking workitems.
>
> I do not understand the OOM-killer reference here but I do understand
> that your worker reuses a cgroup and you want a peak memory consumption
> of a single run to better profile/configure the memcg configuration for
> the specific worker type. Correct?

To a certain extent, yes.
At the moment, we're only using the inner memcg cgroups for
accounting/profiling, and using a
larger (k8s container) cgroup for enforcement.

The OOM-killer is involved because we're not configuring any memory limits on
these individual "worker" cgroups, so we need to provision for
multiple workloads using
their peak memory at the same time to minimize OOM-killing.

In case you're curious, this is the job/queue-work scheduling system
we wrote in-house
called Quickset that's mentioned in this blog post about our new
transcoder system:
https://medium.com/vimeo-engineering-blog/riding-the-dragon-e328a3dfd39d

>
> > Signed-off-by: David Finkel 
>
> Makes sense to me
> Acked-by: Michal Hocko 
>
> Thanks!

Thank you!

-- 
David Finkel
Senior Principal Software Engineer, Core Services

Re: [PATCH RFT v4 2/5] fork: Add shadow stack support to clone3()

2023-12-05 Thread Mark Brown

On Tue, Dec 05, 2023 at 12:26:57AM +, Edgecombe, Rick P wrote:
> On Tue, 2023-11-28 at 18:22 +, Mark Brown wrote:

> > -   size = adjust_shstk_size(stack_size);
> > +   size = adjust_shstk_size(size);
> > addr = alloc_shstk(0, size, 0, false);

> Hmm. I didn't test this, but in the copy_process(), copy_mm() happens
> before this point. So the shadow stack would get mapped in current's MM
> (i.e. the parent). So in the !CLONE_VM case with shadow_stack_size!=0
> the SSP in the child will be updated to an area that is not mapped in
> the child. I think we need to pass tsk->mm into alloc_shstk(). But such
> an exotic clone usage does give me pause, regarding whether all of this
> is premature.

Hrm, right.  And we then can't use do_mmap() either.  I'd be somewhat
tempted to disallow that specific case for now rather than deal with it
though that's not really in the spirit of just always following what the
user asked for.


signature.asc
Description: PGP signature

Re: [PATCH v3 00/25] Permission Overlay Extension

2023-12-05 Thread Joey Gouly

Hi Marc,

On Mon, Dec 04, 2023 at 11:03:24AM +, Marc Zyngier wrote:
> Hi Joey,
> 
> On Fri, 24 Nov 2023 16:34:45 +,
> Joey Gouly  wrote:
> > 
> > Hello everyone,
> > 
> > This series implements the Permission Overlay Extension introduced in 2022
> > VMSA enhancements [1]. It is based on v6.7-rc2.
> > 
> > Changes since v2[2]:
> > # Added ptrace support and selftest
> > # Add missing POR_EL0 initialisation in fork/clone
> > # Rebase onto v6.7-rc2
> > # Add r-bs
> > 
> > The Permission Overlay Extension allows to constrain permissions on memory
> > regions. This can be used from userspace (EL0) without a system call or TLB
> > invalidation.
> 
> I have given this series a few more thoughts, and came to the
> conclusion that is it still incomplete on the KVM front:
> 
> * FEAT_S1POE often comes together with FEAT_S2POE. For obvious
>   reasons, we cannot afford to let the guest play with S2POR_EL1, nor
>   do we want to advertise FEAT_S2POE to the guest.
> 
>   You will need to add some additional FGT for this, and mask out
>   FEAT_S2POE from the guest's view of the ID registers.

I found out last week that I had misunderstood S2POR_EL1, so yes looks like
we need to trap that. I will add that in.

> 
> * letting the guest play with POE comes with some interesting strings
>   attached: a guest that has started on a POE-enabled host cannot be
>   migrated to one that doesn't have POE. which means that the POE
>   registers should only be visible to the host userspace if enabled in
>   the guest's ID registers, and thus only context-switched in these
>   conditions. They should otherwise UNDEF.

Can you give me some clarification here?

- By visible to the host userspace do you mean via the GET_ONE_REG API?
- Currently the ID register (ID_AA64MMFR3_EL1) is not ID_WRITABLE,
  should this series or another make it so? Currently if the host had
  POE it's enabled in the guest, so I believe migration to a non-POE
  host will fail?
- For the context switch, do you mean something like:

if (system_supports_poe() && ID_REG(MMFR3_EL1) & S1POE)
ctxt_sys_reg(ctxt, POR_EL0) = 
read_sysreg_s(SYS_POR_EL0);

  That would need some refactoring, since I don't see how to access
  id_regs from struct kvm_cpu_context.

Thanks,
Joey

Re: [PATCH v3 3/3] selftests: livepatch: Test livepatching a heavily called syscall

2023-12-05 Thread Shuah Khan


On 12/5/23 05:52, mpdeso...@suse.com wrote:

On Fri, 2023-12-01 at 16:38 +, Shuah Khan wrote:



0003-selftests-livepatch-Test-livepatching-a-heavily-call.patch has
style problems, please review.

NOTE: If any of the errors are false positives, please report
   them to the maintainer, see CHECKPATCH in MAINTAINERS.

I couldn't find any mention about "missing module name". Is your script
showing more warnings than these ones? Can you please share your
output?

I'll fix MAINTAINERS file but I'll wait until I understand what's
missing in your checkpatch script to resend the patchset.



Looks like it is coming a script - still my question stands on
whether or not you would need a module name for this module?

I am not too concerned about MAINTAINERS file warns.

I am assuming you will be sending a new version to address
Joe Lawrence's comments?

thanks,
-- Shuah

Re: [xdp-hints] Re: [PATCH bpf-next v3 2/3] net: stmmac: add Launch Time support to XDP ZC

2023-12-05 Thread Florian Bezdeka

On Tue, 2023-12-05 at 15:25 +, Song, Yoong Siang wrote:
> On Monday, December 4, 2023 10:55 PM, Willem de Bruijn wrote:
> > Jesper Dangaard Brouer wrote:
> > > 
> > > 
> > > On 12/3/23 17:51, Song Yoong Siang wrote:
> > > > This patch enables Launch Time (Time-Based Scheduling) support to XDP 
> > > > zero
> > > > copy via XDP Tx metadata framework.
> > > > 
> > > > Signed-off-by: Song Yoong Siang
> > > > ---
> > > >   drivers/net/ethernet/stmicro/stmmac/stmmac.h  |  2 ++
> > > 
> > > As requested before, I think we need to see another driver implementing
> > > this.
> > > 
> > > I propose driver igc and chip i225.
> 
> Sure. I will include igc patches in next version.
> 
> > > 
> > > The interesting thing for me is to see how the LaunchTime max 1 second
> > > into the future[1] is handled code wise. One suggestion is to add a
> > > section to Documentation/networking/xsk-tx-metadata.rst per driver that
> > > mentions/documents these different hardware limitations.  It is natural
> > > that different types of hardware have limitations.  This is a close-to
> > > hardware-level abstraction/API, and IMHO as long as we document the
> > > limitations we can expose this API without too many limitations for more
> > > capable hardware.
> 
> Sure. I will try to add hardware limitations in documentation. 
> 
> > 
> > I would assume that the kfunc will fail when a value is passed that
> > cannot be programmed.
> > 
> 
> In current design, the xsk_tx_metadata_request() dint got return value. 
> So user won't know if their request is fail. 
> It is complex to inform user which request is failing. 
> Therefore, IMHO, it is good that we let driver handle the error silently.
> 

If the programmed value is invalid, the packet will be "dropped" / will
never make it to the wire, right?

That is clearly a situation that the user should be informed about. For
RT systems this normally means that something is really wrong regarding
timing / cycle overflow. Such systems have to react on that situation.

>  
> 
> > What is being implemented here already exists for qdiscs. The FQ
> > qdisc takes a horizon attribute and
> > 
> >"
> >when a packet is beyond the horizon
> >at enqueue() time:
> >- either drop the packet (default policy)
> >- or cap its delivery time to the horizon.
> >"
> >commit 39d010504e6b ("net_sched: sch_fq: add horizon attribute")
> > 
> > Having the admin manually configure this on the qdisc based on
> > off-line knowledge of the device is more fragile than if the device
> > would somehow signal its limit to the stack.
> > 
> > But I don't think we should add enforcement of that as a requirement
> > for this xdp extension of pacing.

RE: [PATCH bpf-next v2 2/3] net: stmmac: Add txtime support to XDP ZC

2023-12-05 Thread Song, Yoong Siang

On Tuesday, December 5, 2023 10:55 PM, Willem de Bruijn wrote:
>Song, Yoong Siang wrote:
>> On Monday, December 4, 2023 10:58 PM, Willem de Bruijn wrote:
>> >Song, Yoong Siang wrote:
>> >> On Friday, December 1, 2023 11:02 PM, Jesper Dangaard Brouer wrote:
>> >> >On 12/1/23 07:24, Song Yoong Siang wrote:
>> >> >> This patch enables txtime support to XDP zero copy via XDP Tx
>> >> >> metadata framework.
>> >> >>
>> >> >> Signed-off-by: Song Yoong Siang
>> >> >> ---
>> >> >>   drivers/net/ethernet/stmicro/stmmac/stmmac.h  |  2 ++
>> >> >>   drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 13 +
>> >> >>   2 files changed, 15 insertions(+)
>> >> >
>> >> >I think we need to see other drivers using this new feature to evaluate
>> >> >if API is sane.
>> >> >
>> >> >I suggest implementing this for igc driver (chip i225) and also for igb
>> >> >(i210 chip) that both support this kind of LaunchTime feature in HW.
>> >> >
>> >> >The API and stmmac driver takes a u64 as time.
>> >> >I'm wondering how this applies to i210 that[1] have 25-bit for
>> >> >LaunchTime (with 32 nanosec granularity) limiting LaunchTime max 0.5
>> >> >second into the future.
>> >> >And i225 that [1] have 30-bit max 1 second into the future.
>> >> >
>> >> >
>> >> >[1]
>> >> >https://github.com/xdp-project/xdp-
>> >> >project/blob/master/areas/tsn/code01_follow_qdisc_TSN_offload.org
>> >>
>> >> I am using u64 for launch time because existing EDT framework is using it.
>> >> Refer to struct sk_buff below. Both u64 and ktime_t can be used as launch 
>> >> time.
>> >> I choose u64 because ktime_t often requires additional type conversion and
>> >> we didn't expect negative value of time.
>> >>
>> >> include/linux/skbuff.h-744- *   @tstamp: Time we arrived/left
>> >> include/linux/skbuff.h:745- *   @skb_mstamp_ns: (aka @tstamp) earliest
>departure
>> >time; start point
>> >> include/linux/skbuff.h-746- *   for retransmit timer
>> >> --
>> >> include/linux/skbuff.h-880- union {
>> >> include/linux/skbuff.h-881- ktime_t tstamp;
>> >> include/linux/skbuff.h:882- u64 skb_mstamp_ns; /* 
>> >> earliest
>departure
>> >time */
>> >> include/linux/skbuff.h-883- };
>> >>
>> >> tstamp/skb_mstamp_ns are used by various drivers for launch time support
>> >> on normal packet, so I think u64 should be "friendly" to all the drivers. 
>> >> For an
>> >> example, igc driver will take launch time from tstamp and recalculate it
>> >> accordingly (i225 expect user to program "delta time" instead of "time" 
>> >> into
>> >> HW register).
>> >>
>> >> drivers/net/ethernet/intel/igc/igc_main.c-1602- txtime = skb->tstamp;
>> >> drivers/net/ethernet/intel/igc/igc_main.c-1603- skb->tstamp = 
>> >> ktime_set(0, 0);
>> >> drivers/net/ethernet/intel/igc/igc_main.c:1604- launch_time =
>> >igc_tx_launchtime(tx_ring, txtime, _flag, _empty);
>> >>
>> >> Do you think this is enough to say the API is sane?
>> >
>> >u64 nsec sounds sane to be. It must be made explicit with clock source
>> >it is against.
>> >
>>
>> The u64 launch time should base on NIC PTP hardware clock (PHC).
>> I will add documentation saying which clock source it is against
>
>It's not that obvious to me that that is the right and only choice.
>See below.
>
>> >Some applications could want to do the conversion from a clock source
>> >to raw NIC cycle counter in userspace or BPF and program the raw
>> >value. So it may be worthwhile to add an clock source argument -- even
>> >if initially only CLOCK_MONOTONIC is supported.
>>
>> Sorry, not so understand your suggestion on adding clock source argument.
>> Are you suggesting to add clock source for the selftest xdp_hw_metadata apps?
>> IMHO, no need to add clock source as the clock source for launch time
>> should always base on NIC PHC.
>
>This is not how FQ and ETF qdiscs pass timestamps to drivers today.
>
>Those are in CLOCK_MONOTONIC or CLOCK_TAI. The driver is expected to
>convert from that to its descriptor format, both to the reduced bit
>width and the NIC PHC.
>
>See also for instance how sch_etf has an explicit q->clock_id match,
>and SO_TXTIME added an sk_clock_id for the same purpose: to agree on
>which clock source is being used.

I see. Thank for the explanation. I will try to add clock source arguments
In next version.

RE: [PATCH bpf-next v3 2/3] net: stmmac: add Launch Time support to XDP ZC

2023-12-05 Thread Song, Yoong Siang

On Monday, December 4, 2023 10:55 PM, Willem de Bruijn wrote:
>Jesper Dangaard Brouer wrote:
>>
>>
>> On 12/3/23 17:51, Song Yoong Siang wrote:
>> > This patch enables Launch Time (Time-Based Scheduling) support to XDP zero
>> > copy via XDP Tx metadata framework.
>> >
>> > Signed-off-by: Song Yoong Siang
>> > ---
>> >   drivers/net/ethernet/stmicro/stmmac/stmmac.h  |  2 ++
>>
>> As requested before, I think we need to see another driver implementing
>> this.
>>
>> I propose driver igc and chip i225.

Sure. I will include igc patches in next version.

>>
>> The interesting thing for me is to see how the LaunchTime max 1 second
>> into the future[1] is handled code wise. One suggestion is to add a
>> section to Documentation/networking/xsk-tx-metadata.rst per driver that
>> mentions/documents these different hardware limitations.  It is natural
>> that different types of hardware have limitations.  This is a close-to
>> hardware-level abstraction/API, and IMHO as long as we document the
>> limitations we can expose this API without too many limitations for more
>> capable hardware.

Sure. I will try to add hardware limitations in documentation. 

>
>I would assume that the kfunc will fail when a value is passed that
>cannot be programmed.
>

In current design, the xsk_tx_metadata_request() dint got return value. 
So user won't know if their request is fail. 
It is complex to inform user which request is failing. 
Therefore, IMHO, it is good that we let driver handle the error silently. 

>What is being implemented here already exists for qdiscs. The FQ
>qdisc takes a horizon attribute and
>
>"
>when a packet is beyond the horizon
>at enqueue() time:
>- either drop the packet (default policy)
>- or cap its delivery time to the horizon.
>"
>commit 39d010504e6b ("net_sched: sch_fq: add horizon attribute")
>
>Having the admin manually configure this on the qdisc based on
>off-line knowledge of the device is more fragile than if the device
>would somehow signal its limit to the stack.
>
>But I don't think we should add enforcement of that as a requirement
>for this xdp extension of pacing.

1 2 >

1 - 100 of 123 matches

Mail list logo