Re: [PATCH 1/4] kunit: Add APIs for managing devices
Hey Greg, On Wed, 6 Dec 2023 at 01:31, Greg Kroah-Hartman wrote: > > On Tue, Dec 05, 2023 at 03:31:33PM +0800, david...@google.com wrote: > > Tests for drivers often require a struct device to pass to other > > functions. While it's possible to create these with > > root_device_register(), or to use something like a platform device, this > > is both a misuse of those APIs, and can be difficult to clean up after, > > for example, a failed assertion. > > > > Add some KUnit-specific functions for registering and unregistering a > > struct device: > > - kunit_device_register() > > - kunit_device_register_with_driver() > > - kunit_device_unregister() > > > > These helpers allocate a on a 'kunit' bus which will either probe the > > driver passed in (kunit_device_register_with_driver), or will create a > > stub driver (kunit_device_register) which is cleaned up on test shutdown. > > > > Devices are automatically unregistered on test shutdown, but can be > > manually unregistered earlier with kunit_device_unregister() in order > > to, for example, test device release code. > > At first glance, nice work. But looks like 0-day doesn't like it that > much, so I'll wait for the next version to review it properly. Thanks very much for taking a look. I'll send v2 with the 0-day (and other) issues fixed sometime tomorrow. In the meantime, I've tried to explain some of the weirder decisions below -- it mostly boils down to the existing use-cases only wanting an opaque 'struct device *' they can pass around, and my attempt to find a minimal (but still sensible) implementation of that. I'm definitely happy to tweak this to make it a more 'normal' use of the device model where that makes sense, though, especially if it doesn't require too much boilerplate on the part of test authors. > One nit I did notice: > > > +// For internal use only -- registers the kunit_bus. > > +int kunit_bus_init(void); > > Put stuff like this in a local .h file, don't pollute the include/linux/ > files for things that you do not want any other part of the kernel to > call. > v2 will have this in lib/kunit/device-impl.h > > +/** > > + * kunit_device_register_with_driver() - Create a struct device for use in > > KUnit tests > > + * @test: The test context object. > > + * @name: The name to give the created device. > > + * @drv: The struct device_driver to associate with the device. > > + * > > + * Creates a struct kunit_device (which is a struct device) with the given > > + * name, and driver. The device will be cleaned up on test exit, or when > > + * kunit_device_unregister is called. See also kunit_device_register, if > > you > > + * wish KUnit to create and manage a driver for you > > + */ > > +struct device *kunit_device_register_with_driver(struct kunit *test, > > + const char *name, > > + struct device_driver *drv); > > Shouldn't "struct device_driver *" be a constant pointer? Done (and for the internal functions) for v2. > > But really, why is this a "raw" device_driver pointer and not a pointer > to the driver type for your bus? So, this is where the more difficult questions start (and where my knowledge of the driver model gets a bit shakier). At the moment, there's no struct kunit_driver; the goal was to have whatever the minimal amount of infrastructure needed to get a 'struct device *' that could be plumbed through existing code which accepts it. (Read: mostly devres resource management stuff, get_device(), etc.) So, in this version, there's no: - struct kunit_driver: we've no extra data to store / function pointers other than what's in struct device_driver. - The kunit_bus is as minimal as I could get it: each device matches exactly one driver pointer (which is passed as struct kunit_device->driver). - The 'struct kunit_device' type is kept private, and 'struct device' is used instead, as this is supposed to only be passed to generic device functions (KUnit is just managing its lifecycle). I've no problem adding these extra types to flesh this out into a more 'normal' setup, though I'd rather keep the boilerplate on the user side minimal if possible. I suspect if we were to return a struct kunit_device, everyone would be quickly grabbing and passing around a raw 'struct device *' anyway, which is what the existing tests with fake devices do (via root_device_register, which returns struct device, or by returning _device->dev from a helper). Similarly, the kunit_bus is not ever exposed to test code, nor really is the driver (except via kunit_device_register_with_driver(), which isn't actually being used anywhere yet, so it may make sense to cut it out from the next version). So, ideally tests won't even be aware that their devices are attached to the kunit_bus, nor that they have drivers attached: it's mostly just to make these normal enough that they show up nicely in sysfs and play well with the devm_ resource management
Re: [PATCH 1/4] kunit: Add APIs for managing devices
On Tue, 5 Dec 2023 at 16:30, Matti Vaittinen wrote: > > On 12/5/23 09:31, david...@google.com wrote: > > Tests for drivers often require a struct device to pass to other > > functions. While it's possible to create these with > > root_device_register(), or to use something like a platform device, this > > is both a misuse of those APIs, and can be difficult to clean up after, > > for example, a failed assertion. > > > > Add some KUnit-specific functions for registering and unregistering a > > struct device: > > - kunit_device_register() > > - kunit_device_register_with_driver() > > - kunit_device_unregister() > > Thanks a lot David! I have been missing these! > > I love the explanation you added under Documentation. Very helpful I'd > say. I only have very minor comments which you can ignore if they don't > make sense to you or the kunit-subsystem. > > With or without the suggested changes: > > Reviewed-by: Matti Vaittinen > > > --- /dev/null > > +++ b/include/kunit/device.h > > @@ -0,0 +1,76 @@ > > +/* SPDX-License-Identifier: GPL-2.0 */ > > +/* > > + * KUnit basic device implementation > > + * > > + * Helpers for creating and managing fake devices for KUnit tests. > > + * > > + * Copyright (C) 2023, Google LLC. > > + * Author: David Gow > > + */ > > + > > +#ifndef _KUNIT_DEVICE_H > > +#define _KUNIT_DEVICE_H > > + > > +#if IS_ENABLED(CONFIG_KUNIT) > > + > > +#include > > + > > +struct kunit_device; > > +struct device; > > +struct device_driver; > > + > > +// For internal use only -- registers the kunit_bus. > > +int kunit_bus_init(void); > > + > > +/** > > + * kunit_driver_create() - Create a struct device_driver attached to the > > kunit_bus > > + * @test: The test context object. > > + * @name: The name to give the created driver. > > + * > > + * Creates a struct device_driver attached to the kunit_bus, with the name > > @name. > > + * This driver will automatically be cleaned up on test exit. > > + */ > > +struct device_driver *kunit_driver_create(struct kunit *test, const char > > *name); > > + > > +/** > > + * kunit_device_register() - Create a struct device for use in KUnit tests > > + * @test: The test context object. > > + * @name: The name to give the created device. > > + * > > + * Creates a struct kunit_device (which is a struct device) with the given > > name, > > + * and a corresponding driver. The device and driver will be cleaned up on > > test > > + * exit, or when kunit_device_unregister is called. See also > > + * kunit_device_register_with_driver, if you wish to provide your own > > + * struct device_driver. > > + */ > > +struct device *kunit_device_register(struct kunit *test, const char *name); > > + > > +/** > > + * kunit_device_register_with_driver() - Create a struct device for use in > > KUnit tests > > + * @test: The test context object. > > + * @name: The name to give the created device. > > + * @drv: The struct device_driver to associate with the device. > > + * > > + * Creates a struct kunit_device (which is a struct device) with the given > > + * name, and driver. The device will be cleaned up on test exit, or when > > + * kunit_device_unregister is called. See also kunit_device_register, if > > you > > + * wish KUnit to create and manage a driver for you > > + */ > > +struct device *kunit_device_register_with_driver(struct kunit *test, > > + const char *name, > > + struct device_driver *drv); > > + > > +/** > > + * kunit_device_unregister() - Unregister a KUnit-managed device > > + * @test: The test context object which created the device > > + * @dev: The device. > > + * > > + * Unregisters and destroys a struct device which was created with > > + * kunit_device_register or kunit_device_register_with_driver. If KUnit > > created > > + * a driver, cleans it up as well. > > + */ > > +void kunit_device_unregister(struct kunit *test, struct device *dev); > > I wish the return values for error case(s) were also mentioned. But > please, see my next comment as well. > I'll add these for v2. > > + > > +#endif > > + > > +#endif > > ... > > > diff --git a/lib/kunit/device.c b/lib/kunit/device.c > > new file mode 100644 > > index ..93ace1a2297d > > --- /dev/null > > +++ b/lib/kunit/device.c > > @@ -0,0 +1,176 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +/* > > + * KUnit basic device implementation > > + * > > + * Implementation of struct kunit_device helpers. > > + * > > + * Copyright (C) 2023, Google LLC. > > + * Author: David Gow > > + */ > > + > > ... > > > + > > +static void kunit_device_release(struct device *d) > > +{ > > + kfree(to_kunit_device(d)); > > +} > > I see you added the function documentation to the header. I assume this > is the kunit style(?) I may be heretical, but I'd love to see at least a > very short documentation for (all) exported functions here. I think the > arguments are mostly self-explatonary, but at least for me the return >
Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure
On 2023/12/6 15:36, Yosry Ahmed wrote: > On Tue, Dec 5, 2023 at 10:43 PM Chengming Zhou > wrote: >> >> On 2023/12/6 13:59, Yosry Ahmed wrote: >>> [..] >>>>> @@ -526,6 +582,102 @@ static struct zswap_entry >>>>> *zswap_entry_find_get(struct rb_root *root, >>>>> return entry; >>>>> } >>>>> >>>>> +/* >>>>> +* shrinker functions >>>>> +**/ >>>>> +static enum lru_status shrink_memcg_cb(struct list_head *item, struct >>>>> list_lru_one *l, >>>>> +spinlock_t *lock, void *arg); >>>>> + >>>>> +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, >>>>> + struct shrink_control *sc) >>>>> +{ >>>>> + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, >>>>> NODE_DATA(sc->nid)); >>>>> + unsigned long shrink_ret, nr_protected, lru_size; >>>>> + struct zswap_pool *pool = shrinker->private_data; >>>>> + bool encountered_page_in_swapcache = false; >>>>> + >>>>> + nr_protected = >>>>> + >>>>> atomic_long_read(>zswap_lruvec_state.nr_zswap_protected); >>>>> + lru_size = list_lru_shrink_count(>list_lru, sc); >>>>> + >>>>> + /* >>>>> + * Abort if the shrinker is disabled or if we are shrinking into the >>>>> + * protected region. >>>>> + * >>>>> + * This short-circuiting is necessary because if we have too many >>>>> multiple >>>>> + * concurrent reclaimers getting the freeable zswap object counts >>>>> at the >>>>> + * same time (before any of them made reasonable progress), the >>>>> total >>>>> + * number of reclaimed objects might be more than the number of >>>>> unprotected >>>>> + * objects (i.e the reclaimers will reclaim into the protected area >>>>> of the >>>>> + * zswap LRU). >>>>> + */ >>>>> + if (!zswap_shrinker_enabled || nr_protected >= lru_size - >>>>> sc->nr_to_scan) { >>>>> + sc->nr_scanned = 0; >>>>> + return SHRINK_STOP; >>>>> + } >>>>> + >>>>> + shrink_ret = list_lru_shrink_walk(>list_lru, sc, >>>>> _memcg_cb, >>>>> + _page_in_swapcache); >>>>> + >>>>> + if (encountered_page_in_swapcache) >>>>> + return SHRINK_STOP; >>>>> + >>>>> + return shrink_ret ? shrink_ret : SHRINK_STOP; >>>>> +} >>>>> + >>>>> +static unsigned long zswap_shrinker_count(struct shrinker *shrinker, >>>>> + struct shrink_control *sc) >>>>> +{ >>>>> + struct zswap_pool *pool = shrinker->private_data; >>>>> + struct mem_cgroup *memcg = sc->memcg; >>>>> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, >>>>> NODE_DATA(sc->nid)); >>>>> + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected; >>>>> + >>>>> +#ifdef CONFIG_MEMCG_KMEM >>>>> + cgroup_rstat_flush(memcg->css.cgroup); >>>>> + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT; >>>>> + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED); >>>>> +#else >>>>> + /* use pool stats instead of memcg stats */ >>>>> + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT; >>>>> + nr_stored = atomic_read(>nr_stored); >>>>> +#endif >>>>> + >>>>> + if (!zswap_shrinker_enabled || !nr_stored) >>>> When I tested with this series, with !zswap_shrinker_enabled in the >>>> default case, >>>> I found the performance is much worse than that without this patch. >>>> >>>> Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs >>>> directory. >>>> >>>> The reason seems the above cgroup_rstat_flush(), caused much rstat lock >>>> contention >>>> to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check >>>> above >>>> the cgroup_rstat_flush(), the performance become much better. >>>> >>>> Maybe we can put the "zswap_shrinker_enabled" check above >>>> cgroup_rstat_flush()? >>> >>> Yes, we should do nothing if !zswap_shrinker_enabled. We should also >>> use mem_cgroup_flush_stats() here like other places unless accuracy is >>> crucial, which I doubt given that reclaim uses >>> mem_cgroup_flush_stats(). >>> >> >> Yes. After changing to use mem_cgroup_flush_stats() here, the performance >> become much better. >> >>> mem_cgroup_flush_stats() has some thresholding to make sure we don't >>> do flushes unnecessarily, and I have a pending series in mm-unstable >>> that makes that thresholding per-memcg. Keep in mind that adding a >>> call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable, >> >> My test branch is linux-next 20231205, and it's all good after changing >> to use mem_cgroup_flush_stats(memcg). > > Thanks for reporting back. We should still move the > zswap_shrinker_enabled check ahead, no need to even call > mem_cgroup_flush_stats() if we will do nothing anyway. > Yes, agree!
Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure
On Tue, Dec 5, 2023 at 10:43 PM Chengming Zhou wrote: > > On 2023/12/6 13:59, Yosry Ahmed wrote: > > [..] > >>> @@ -526,6 +582,102 @@ static struct zswap_entry > >>> *zswap_entry_find_get(struct rb_root *root, > >>> return entry; > >>> } > >>> > >>> +/* > >>> +* shrinker functions > >>> +**/ > >>> +static enum lru_status shrink_memcg_cb(struct list_head *item, struct > >>> list_lru_one *l, > >>> +spinlock_t *lock, void *arg); > >>> + > >>> +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, > >>> + struct shrink_control *sc) > >>> +{ > >>> + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, > >>> NODE_DATA(sc->nid)); > >>> + unsigned long shrink_ret, nr_protected, lru_size; > >>> + struct zswap_pool *pool = shrinker->private_data; > >>> + bool encountered_page_in_swapcache = false; > >>> + > >>> + nr_protected = > >>> + > >>> atomic_long_read(>zswap_lruvec_state.nr_zswap_protected); > >>> + lru_size = list_lru_shrink_count(>list_lru, sc); > >>> + > >>> + /* > >>> + * Abort if the shrinker is disabled or if we are shrinking into the > >>> + * protected region. > >>> + * > >>> + * This short-circuiting is necessary because if we have too many > >>> multiple > >>> + * concurrent reclaimers getting the freeable zswap object counts > >>> at the > >>> + * same time (before any of them made reasonable progress), the > >>> total > >>> + * number of reclaimed objects might be more than the number of > >>> unprotected > >>> + * objects (i.e the reclaimers will reclaim into the protected area > >>> of the > >>> + * zswap LRU). > >>> + */ > >>> + if (!zswap_shrinker_enabled || nr_protected >= lru_size - > >>> sc->nr_to_scan) { > >>> + sc->nr_scanned = 0; > >>> + return SHRINK_STOP; > >>> + } > >>> + > >>> + shrink_ret = list_lru_shrink_walk(>list_lru, sc, > >>> _memcg_cb, > >>> + _page_in_swapcache); > >>> + > >>> + if (encountered_page_in_swapcache) > >>> + return SHRINK_STOP; > >>> + > >>> + return shrink_ret ? shrink_ret : SHRINK_STOP; > >>> +} > >>> + > >>> +static unsigned long zswap_shrinker_count(struct shrinker *shrinker, > >>> + struct shrink_control *sc) > >>> +{ > >>> + struct zswap_pool *pool = shrinker->private_data; > >>> + struct mem_cgroup *memcg = sc->memcg; > >>> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, > >>> NODE_DATA(sc->nid)); > >>> + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected; > >>> + > >>> +#ifdef CONFIG_MEMCG_KMEM > >>> + cgroup_rstat_flush(memcg->css.cgroup); > >>> + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT; > >>> + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED); > >>> +#else > >>> + /* use pool stats instead of memcg stats */ > >>> + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT; > >>> + nr_stored = atomic_read(>nr_stored); > >>> +#endif > >>> + > >>> + if (!zswap_shrinker_enabled || !nr_stored) > >> When I tested with this series, with !zswap_shrinker_enabled in the > >> default case, > >> I found the performance is much worse than that without this patch. > >> > >> Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs > >> directory. > >> > >> The reason seems the above cgroup_rstat_flush(), caused much rstat lock > >> contention > >> to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check > >> above > >> the cgroup_rstat_flush(), the performance become much better. > >> > >> Maybe we can put the "zswap_shrinker_enabled" check above > >> cgroup_rstat_flush()? > > > > Yes, we should do nothing if !zswap_shrinker_enabled. We should also > > use mem_cgroup_flush_stats() here like other places unless accuracy is > > crucial, which I doubt given that reclaim uses > > mem_cgroup_flush_stats(). > > > > Yes. After changing to use mem_cgroup_flush_stats() here, the performance > become much better. > > > mem_cgroup_flush_stats() has some thresholding to make sure we don't > > do flushes unnecessarily, and I have a pending series in mm-unstable > > that makes that thresholding per-memcg. Keep in mind that adding a > > call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable, > > My test branch is linux-next 20231205, and it's all good after changing > to use mem_cgroup_flush_stats(memcg). Thanks for reporting back. We should still move the zswap_shrinker_enabled check ahead, no need to even call mem_cgroup_flush_stats() if we will do nothing anyway. > > > because the series there adds a memcg argument to > > mem_cgroup_flush_stats(). That should be easily amenable though, I can > > post a fixlet for my series to add the memcg argument there on top of > > users if needed. > > > > It's great. Thanks! >
[PATCH net-next 9/9] selftests/net: convert vrf-xfrm-tests.sh to run it in unique namespace
Here is the test result after conversion. ]# ./vrf-xfrm-tests.sh No qdisc on VRF device TEST: IPv4 no xfrm policy [ OK ] TEST: IPv6 no xfrm policy [ OK ] TEST: IPv4 xfrm policy based on address [ OK ] TEST: IPv6 xfrm policy based on address [ OK ] TEST: IPv6 xfrm policy with VRF in selector [ OK ] TEST: IPv4 xfrm policy with xfrm device [ OK ] TEST: IPv6 xfrm policy with xfrm device [ OK ] netem qdisc on VRF device TEST: IPv4 no xfrm policy [ OK ] TEST: IPv6 no xfrm policy [ OK ] TEST: IPv4 xfrm policy based on address [ OK ] TEST: IPv6 xfrm policy based on address [ OK ] TEST: IPv6 xfrm policy with VRF in selector [ OK ] TEST: IPv4 xfrm policy with xfrm device [ OK ] TEST: IPv6 xfrm policy with xfrm device [ OK ] Tests passed: 14 Tests failed: 0 Acked-by: David Ahern Signed-off-by: Hangbin Liu --- tools/testing/selftests/net/vrf-xfrm-tests.sh | 77 +-- 1 file changed, 36 insertions(+), 41 deletions(-) diff --git a/tools/testing/selftests/net/vrf-xfrm-tests.sh b/tools/testing/selftests/net/vrf-xfrm-tests.sh index 452638ae8aed..b64dd891699d 100755 --- a/tools/testing/selftests/net/vrf-xfrm-tests.sh +++ b/tools/testing/selftests/net/vrf-xfrm-tests.sh @@ -3,9 +3,7 @@ # # Various combinations of VRF with xfrms and qdisc. -# Kselftest framework requirement - SKIP code is 4. -ksft_skip=4 - +source lib.sh PAUSE_ON_FAIL=no VERBOSE=0 ret=0 @@ -67,7 +65,7 @@ run_cmd_host1() printf "COMMAND: $cmd\n" fi - out=$(eval ip netns exec host1 $cmd 2>&1) + out=$(eval ip netns exec $host1 $cmd 2>&1) rc=$? if [ "$VERBOSE" = "1" ]; then if [ -n "$out" ]; then @@ -116,9 +114,6 @@ create_ns() [ -z "${addr}" ] && addr="-" [ -z "${addr6}" ] && addr6="-" - ip netns add ${ns} - - ip -netns ${ns} link set lo up if [ "${addr}" != "-" ]; then ip -netns ${ns} addr add dev lo ${addr} fi @@ -177,25 +172,25 @@ connect_ns() cleanup() { - ip netns del host1 - ip netns del host2 + cleanup_ns $host1 $host2 } setup() { - create_ns "host1" - create_ns "host2" + setup_ns host1 host2 + create_ns "$host1" + create_ns "$host2" - connect_ns "host1" eth0 ${HOST1_4}/24 ${HOST1_6}/64 \ - "host2" eth0 ${HOST2_4}/24 ${HOST2_6}/64 + connect_ns "$host1" eth0 ${HOST1_4}/24 ${HOST1_6}/64 \ + "$host2" eth0 ${HOST2_4}/24 ${HOST2_6}/64 - create_vrf "host1" ${VRF} ${TABLE} - ip -netns host1 link set dev eth0 master ${VRF} + create_vrf "$host1" ${VRF} ${TABLE} + ip -netns $host1 link set dev eth0 master ${VRF} } cleanup_xfrm() { - for ns in host1 host2 + for ns in $host1 $host2 do for x in state policy do @@ -218,57 +213,57 @@ setup_xfrm() # # host1 - IPv4 out - ip -netns host1 xfrm policy add \ + ip -netns $host1 xfrm policy add \ src ${h1_4} dst ${h2_4} ${devarg} dir out \ tmpl src ${HOST1_4} dst ${HOST2_4} proto esp mode tunnel # host2 - IPv4 in - ip -netns host2 xfrm policy add \ + ip -netns $host2 xfrm policy add \ src ${h1_4} dst ${h2_4} dir in \ tmpl src ${HOST1_4} dst ${HOST2_4} proto esp mode tunnel # host1 - IPv4 in - ip -netns host1 xfrm policy add \ + ip -netns $host1 xfrm policy add \ src ${h2_4} dst ${h1_4} ${devarg} dir in \ tmpl src ${HOST2_4} dst ${HOST1_4} proto esp mode tunnel # host2 - IPv4 out - ip -netns host2 xfrm policy add \ + ip -netns $host2 xfrm policy add \ src ${h2_4} dst ${h1_4} dir out \ tmpl src ${HOST2_4} dst ${HOST1_4} proto esp mode tunnel # host1 - IPv6 out - ip -6 -netns host1 xfrm policy add \ + ip -6 -netns $host1 xfrm policy add \ src ${h1_6} dst ${h2_6} ${devarg} dir out \ tmpl src ${HOST1_6} dst ${HOST2_6} proto esp mode tunnel # host2 - IPv6 in - ip -6 -netns host2 xfrm policy add \ + ip -6 -netns $host2 xfrm policy add \ src ${h1_6} dst ${h2_6} dir in \ tmpl src ${HOST1_6} dst ${HOST2_6} proto esp mode tunnel # host1 - IPv6 in - ip -6 -netns host1 xfrm policy add \ + ip -6 -netns $host1 xfrm policy add \ src ${h2_6} dst ${h1_6} ${devarg} dir in \ tmpl src ${HOST2_6} dst ${HOST1_6} proto esp mode tunnel # host2 - IPv6 out -
[PATCH net-next 8/9] selftests/net: convert vrf_strict_mode_test.sh to run it in unique namespace
Here is the test result after conversion. ]# ./vrf_strict_mode_test.sh TEST SECTION: VRF strict_mode test on init network namespace TEST: init: net.vrf.strict_mode is available[ OK ] TEST: init: strict_mode=0 by default, 0 vrfs[ OK ] ... TEST: init: check strict_mode=1 [ OK ] TEST: testns-HvoZkB: check strict_mode=0[ OK ] Tests passed: 37 Tests failed: 0 Acked-by: David Ahern Signed-off-by: Hangbin Liu --- .../selftests/net/vrf_strict_mode_test.sh | 47 +-- 1 file changed, 22 insertions(+), 25 deletions(-) diff --git a/tools/testing/selftests/net/vrf_strict_mode_test.sh b/tools/testing/selftests/net/vrf_strict_mode_test.sh index 417d214264f3..01552b542544 100755 --- a/tools/testing/selftests/net/vrf_strict_mode_test.sh +++ b/tools/testing/selftests/net/vrf_strict_mode_test.sh @@ -3,9 +3,7 @@ # This test is designed for testing the new VRF strict_mode functionality. -# Kselftest framework requirement - SKIP code is 4. -ksft_skip=4 - +source lib.sh ret=0 # identifies the "init" network namespace which is often called root network @@ -247,13 +245,12 @@ setup() { modprobe vrf - ip netns add testns - ip netns exec testns ip link set lo up + setup_ns testns } cleanup() { - ip netns del testns 2>/dev/null + ip netns del $testns 2>/dev/null ip link del vrf100 2>/dev/null ip link del vrf101 2>/dev/null @@ -298,28 +295,28 @@ vrf_strict_mode_tests_testns() { log_section "VRF strict_mode test on testns network namespace" - vrf_strict_mode_check_support testns + vrf_strict_mode_check_support $testns - strict_mode_check_default testns + strict_mode_check_default $testns - enable_strict_mode_and_check testns + enable_strict_mode_and_check $testns - add_vrf_and_check testns vrf100 100 - config_vrf_and_check testns 10.0.100.1/24 vrf100 + add_vrf_and_check $testns vrf100 100 + config_vrf_and_check $testns 10.0.100.1/24 vrf100 - add_vrf_and_check_fail testns vrf101 100 + add_vrf_and_check_fail $testns vrf101 100 - add_vrf_and_check_fail testns vrf102 100 + add_vrf_and_check_fail $testns vrf102 100 - add_vrf_and_check testns vrf200 200 + add_vrf_and_check $testns vrf200 200 - disable_strict_mode_and_check testns + disable_strict_mode_and_check $testns - add_vrf_and_check testns vrf101 100 + add_vrf_and_check $testns vrf101 100 - add_vrf_and_check testns vrf102 100 + add_vrf_and_check $testns vrf102 100 - #the strict_mode is disabled in the testns + #the strict_mode is disabled in the $testns } vrf_strict_mode_tests_mix() @@ -328,25 +325,25 @@ vrf_strict_mode_tests_mix() read_strict_mode_compare_and_check init 1 - read_strict_mode_compare_and_check testns 0 + read_strict_mode_compare_and_check $testns 0 - del_vrf_and_check testns vrf101 + del_vrf_and_check $testns vrf101 - del_vrf_and_check testns vrf102 + del_vrf_and_check $testns vrf102 disable_strict_mode_and_check init - enable_strict_mode_and_check testns + enable_strict_mode_and_check $testns enable_strict_mode_and_check init enable_strict_mode_and_check init - disable_strict_mode_and_check testns - disable_strict_mode_and_check testns + disable_strict_mode_and_check $testns + disable_strict_mode_and_check $testns read_strict_mode_compare_and_check init 1 - read_strict_mode_compare_and_check testns 0 + read_strict_mode_compare_and_check $testns 0 } -- 2.43.0
[PATCH net-next 7/9] selftests/net: convert vrf_route_leaking.sh to run it in unique namespace
Here is the test result after conversion. ]# ./vrf_route_leaking.sh ### IPv4 (sym route): VRF ICMP ttl error route lookup ping ### TEST: Basic IPv4 connectivity [ OK ] TEST: Ping received ICMP ttl exceeded [ OK ] ... TEST: Basic IPv6 connectivity [ OK ] TEST: Traceroute6 reports a hop on r1 [ OK ] Tests passed: 18 Tests failed: 0 Acked-by: David Ahern Signed-off-by: Hangbin Liu --- .../selftests/net/vrf_route_leaking.sh| 201 +- 1 file changed, 96 insertions(+), 105 deletions(-) diff --git a/tools/testing/selftests/net/vrf_route_leaking.sh b/tools/testing/selftests/net/vrf_route_leaking.sh index dedc52562b4f..2da32f4c479b 100755 --- a/tools/testing/selftests/net/vrf_route_leaking.sh +++ b/tools/testing/selftests/net/vrf_route_leaking.sh @@ -58,6 +58,7 @@ # to send an ICMP error back to the source when the ttl of a packet reaches 1 # while it is forwarded between different vrfs. +source lib.sh VERBOSE=0 PAUSE_ON_FAIL=no DEFAULT_TTYPE=sym @@ -171,11 +172,7 @@ run_cmd_grep() cleanup() { - local ns - - for ns in h1 h2 r1 r2; do - ip netns del $ns 2>/dev/null - done + cleanup_ns $h1 $h2 $r1 $r2 } setup_vrf() @@ -212,72 +209,69 @@ setup_sym() # # create nodes as namespaces - # - for ns in h1 h2 r1; do - ip netns add $ns - ip -netns $ns link set lo up - - case "${ns}" in - h[12]) ip netns exec $ns sysctl -q -w net.ipv6.conf.all.forwarding=0 - ip netns exec $ns sysctl -q -w net.ipv6.conf.all.keep_addr_on_down=1 - ;; - r1)ip netns exec $ns sysctl -q -w net.ipv4.ip_forward=1 - ip netns exec $ns sysctl -q -w net.ipv6.conf.all.forwarding=1 - esac + setup_ns h1 h2 r1 + for ns in $h1 $h2 $r1; do + if echo $ns | grep -q h[12]-; then + ip netns exec $ns sysctl -q -w net.ipv6.conf.all.forwarding=0 + ip netns exec $ns sysctl -q -w net.ipv6.conf.all.keep_addr_on_down=1 + else + ip netns exec $ns sysctl -q -w net.ipv4.ip_forward=1 + ip netns exec $ns sysctl -q -w net.ipv6.conf.all.forwarding=1 + fi done # # create interconnects # - ip -netns h1 link add eth0 type veth peer name r1h1 - ip -netns h1 link set r1h1 netns r1 name eth0 up + ip -netns $h1 link add eth0 type veth peer name r1h1 + ip -netns $h1 link set r1h1 netns $r1 name eth0 up - ip -netns h2 link add eth0 type veth peer name r1h2 - ip -netns h2 link set r1h2 netns r1 name eth1 up + ip -netns $h2 link add eth0 type veth peer name r1h2 + ip -netns $h2 link set r1h2 netns $r1 name eth1 up # # h1 # - ip -netns h1 addr add dev eth0 ${H1_N1_IP}/24 - ip -netns h1 -6 addr add dev eth0 ${H1_N1_IP6}/64 nodad - ip -netns h1 link set eth0 up + ip -netns $h1 addr add dev eth0 ${H1_N1_IP}/24 + ip -netns $h1 -6 addr add dev eth0 ${H1_N1_IP6}/64 nodad + ip -netns $h1 link set eth0 up # h1 to h2 via r1 - ip -netns h1route add ${H2_N2} via ${R1_N1_IP} dev eth0 - ip -netns h1 -6 route add ${H2_N2_6} via "${R1_N1_IP6}" dev eth0 + ip -netns $h1route add ${H2_N2} via ${R1_N1_IP} dev eth0 + ip -netns $h1 -6 route add ${H2_N2_6} via "${R1_N1_IP6}" dev eth0 # # h2 # - ip -netns h2 addr add dev eth0 ${H2_N2_IP}/24 - ip -netns h2 -6 addr add dev eth0 ${H2_N2_IP6}/64 nodad - ip -netns h2 link set eth0 up + ip -netns $h2 addr add dev eth0 ${H2_N2_IP}/24 + ip -netns $h2 -6 addr add dev eth0 ${H2_N2_IP6}/64 nodad + ip -netns $h2 link set eth0 up # h2 to h1 via r1 - ip -netns h2 route add default via ${R1_N2_IP} dev eth0 - ip -netns h2 -6 route add default via ${R1_N2_IP6} dev eth0 + ip -netns $h2 route add default via ${R1_N2_IP} dev eth0 + ip -netns $h2 -6 route add default via ${R1_N2_IP6} dev eth0 # # r1 # - setup_vrf r1 - create_vrf r1 blue 1101 - create_vrf r1 red 1102 - ip -netns r1 link set mtu 1400 dev eth1 - ip -netns r1 link set eth0 vrf blue up - ip -netns r1 link set eth1 vrf red up - ip -netns r1 addr add dev eth0 ${R1_N1_IP}/24 - ip -netns r1 -6 addr add dev eth0 ${R1_N1_IP6}/64 nodad - ip -netns r1 addr add dev eth1 ${R1_N2_IP}/24 - ip -netns r1 -6 addr add dev eth1 ${R1_N2_IP6}/64 nodad + setup_vrf $r1 +
[PATCH net-next 6/9] selftests/net: convert test_vxlan_vnifiltering.sh to run it in unique namespace
Here is the test result after conversion. ]# ./test_vxlan_vnifiltering.sh TEST: Create traditional vxlan device [ OK ] TEST: Cannot create vnifilter device without external flag [ OK ] TEST: Creating external vxlan device with vnifilter flag[ OK ] ... TEST: VM connectivity over traditional vxlan (ipv6 default rdst)[ OK ] TEST: VM connectivity over metadata nonfiltering vxlan (ipv4 default rdst) [ OK ] Tests passed: 27 Tests failed: 0 Acked-by: David Ahern Signed-off-by: Hangbin Liu --- .../selftests/net/test_vxlan_vnifiltering.sh | 154 +++--- 1 file changed, 95 insertions(+), 59 deletions(-) diff --git a/tools/testing/selftests/net/test_vxlan_vnifiltering.sh b/tools/testing/selftests/net/test_vxlan_vnifiltering.sh index 8c3ac0a72545..6127a78ee988 100755 --- a/tools/testing/selftests/net/test_vxlan_vnifiltering.sh +++ b/tools/testing/selftests/net/test_vxlan_vnifiltering.sh @@ -78,10 +78,8 @@ # # # This test tests the new vxlan vnifiltering api - +source lib.sh ret=0 -# Kselftest framework requirement - SKIP code is 4. -ksft_skip=4 # all tests in this script. Can be overridden with -t option TESTS=" @@ -148,18 +146,18 @@ run_cmd() } check_hv_connectivity() { - ip netns exec hv-1 ping -c 1 -W 1 $1 &>/dev/null + ip netns exec $hv_1 ping -c 1 -W 1 $1 &>/dev/null sleep 1 - ip netns exec hv-1 ping -c 1 -W 1 $2 &>/dev/null + ip netns exec $hv_1 ping -c 1 -W 1 $2 &>/dev/null return $? } check_vm_connectivity() { - run_cmd "ip netns exec vm-11 ping -c 1 -W 1 10.0.10.12" + run_cmd "ip netns exec $vm_11 ping -c 1 -W 1 10.0.10.12" log_test $? 0 "VM connectivity over $1 (ipv4 default rdst)" - run_cmd "ip netns exec vm-21 ping -c 1 -W 1 10.0.10.22" + run_cmd "ip netns exec $vm_21 ping -c 1 -W 1 10.0.10.22" log_test $? 0 "VM connectivity over $1 (ipv6 default rdst)" } @@ -167,26 +165,23 @@ cleanup() { ip link del veth-hv-1 2>/dev/null || true ip link del vethhv-11 vethhv-12 vethhv-21 vethhv-22 2>/dev/null || true - for ns in hv-1 hv-2 vm-11 vm-21 vm-12 vm-22 vm-31 vm-32; do - ip netns del $ns 2>/dev/null || true - done + cleanup_ns $hv_1 $hv_2 $vm_11 $vm_21 $vm_12 $vm_22 $vm_31 $vm_32 } trap cleanup EXIT setup-hv-networking() { - hv=$1 + id=$1 local1=$2 mask1=$3 local2=$4 mask2=$5 - ip netns add hv-$hv - ip link set veth-hv-$hv netns hv-$hv - ip -netns hv-$hv link set veth-hv-$hv name veth0 - ip -netns hv-$hv addr add $local1/$mask1 dev veth0 - ip -netns hv-$hv addr add $local2/$mask2 dev veth0 - ip -netns hv-$hv link set veth0 up + ip link set veth-hv-$id netns ${hv[$id]} + ip -netns ${hv[$id]} link set veth-hv-$id name veth0 + ip -netns ${hv[$id]} addr add $local1/$mask1 dev veth0 + ip -netns ${hv[$id]} addr add $local2/$mask2 dev veth0 + ip -netns ${hv[$id]} link set veth0 up } # Setups a "VM" simulated by a netns an a veth pair @@ -208,21 +203,20 @@ setup-vm() { lastvxlandev="" # create bridge - ip -netns hv-$hvid link add br$brid type bridge vlan_filtering 1 vlan_default_pvid 0 \ + ip -netns ${hv[$hvid]} link add br$brid type bridge vlan_filtering 1 vlan_default_pvid 0 \ mcast_snooping 0 - ip -netns hv-$hvid link set br$brid up + ip -netns ${hv[$hvid]} link set br$brid up # create vm namespace and interfaces and connect to hypervisor # namespace - ip netns add vm-$vmid hvvethif="vethhv-$vmid" vmvethif="veth-$vmid" ip link add $hvvethif type veth peer name $vmvethif - ip link set $hvvethif netns hv-$hvid - ip link set $vmvethif netns vm-$vmid - ip -netns hv-$hvid link set $hvvethif up - ip -netns vm-$vmid link set $vmvethif up - ip -netns hv-$hvid link set $hvvethif master br$brid + ip link set $hvvethif netns ${hv[$hvid]} + ip link set $vmvethif netns ${vm[$vmid]} + ip -netns ${hv[$hvid]} link set $hvvethif up + ip -netns ${vm[$vmid]} link set $vmvethif up + ip -netns ${hv[$hvid]} link set $hvvethif master br$brid # configure VM vlan/vni filtering on hypervisor for vmap in $(echo $vattrs | cut -d "," -f1- --output-delimiter=' ') @@ -234,9 +228,9 @@ setup-vm() { local vtype=$(echo $vmap | awk -F'-' '{print ($5)}') local port=$(echo $vmap | awk -F'-' '{print ($6)}') - ip -netns vm-$vmid link add name $vmvethif.$vid link $vmvethif type vlan id $vid - ip -netns vm-$vmid addr add 10.0.$vid.$vmid/24 dev $vmvethif.$vid - ip -netns vm-$vmid link set $vmvethif.$vid up + ip -netns ${vm[$vmid]} link add name $vmvethif.$vid link $vmvethif type vlan id $vid + ip -netns ${vm[$vmid]} addr add 10.0.$vid.$vmid/24 dev
[PATCH net-next 5/9] selftests/net: convert test_vxlan_under_vrf.sh to run it in unique namespace
Here is the test result after conversion. ]# ./test_vxlan_under_vrf.sh Checking HV connectivity [ OK ] Check VM connectivity through VXLAN (underlay in the default VRF) [ OK ] Check VM connectivity through VXLAN (underlay in a VRF)[ OK ] Acked-by: David Ahern Signed-off-by: Hangbin Liu --- .../selftests/net/test_vxlan_under_vrf.sh | 70 ++- 1 file changed, 36 insertions(+), 34 deletions(-) diff --git a/tools/testing/selftests/net/test_vxlan_under_vrf.sh b/tools/testing/selftests/net/test_vxlan_under_vrf.sh index 1fd1250ebc66..ae8fbe3f0779 100755 --- a/tools/testing/selftests/net/test_vxlan_under_vrf.sh +++ b/tools/testing/selftests/net/test_vxlan_under_vrf.sh @@ -43,15 +43,14 @@ # This tests both the connectivity between vm-1 and vm-2, and that the underlay # can be moved in and out of the vrf by unsetting and setting veth0's master. +source lib.sh set -e cleanup() { ip link del veth-hv-1 2>/dev/null || true ip link del veth-tap 2>/dev/null || true -for ns in hv-1 hv-2 vm-1 vm-2; do -ip netns del $ns 2>/dev/null || true -done +cleanup_ns $hv_1 $hv_2 $vm_1 $vm_2 } # Clean start @@ -60,72 +59,75 @@ cleanup &> /dev/null [[ $1 == "clean" ]] && exit 0 trap cleanup EXIT +setup_ns hv_1 hv_2 vm_1 vm_2 +hv[1]=$hv_1 +hv[2]=$hv_2 +vm[1]=$vm_1 +vm[2]=$vm_2 # Setup "Hypervisors" simulated with netns ip link add veth-hv-1 type veth peer name veth-hv-2 setup-hv-networking() { -hv=$1 +id=$1 -ip netns add hv-$hv -ip link set veth-hv-$hv netns hv-$hv -ip -netns hv-$hv link set veth-hv-$hv name veth0 +ip link set veth-hv-$id netns ${hv[$id]} +ip -netns ${hv[$id]} link set veth-hv-$id name veth0 -ip -netns hv-$hv link add vrf-underlay type vrf table 1 -ip -netns hv-$hv link set vrf-underlay up -ip -netns hv-$hv addr add 172.16.0.$hv/24 dev veth0 -ip -netns hv-$hv link set veth0 up +ip -netns ${hv[$id]} link add vrf-underlay type vrf table 1 +ip -netns ${hv[$id]} link set vrf-underlay up +ip -netns ${hv[$id]} addr add 172.16.0.$id/24 dev veth0 +ip -netns ${hv[$id]} link set veth0 up -ip -netns hv-$hv link add br0 type bridge -ip -netns hv-$hv link set br0 up +ip -netns ${hv[$id]} link add br0 type bridge +ip -netns ${hv[$id]} link set br0 up -ip -netns hv-$hv link add vxlan0 type vxlan id 10 local 172.16.0.$hv dev veth0 dstport 4789 -ip -netns hv-$hv link set vxlan0 master br0 -ip -netns hv-$hv link set vxlan0 up +ip -netns ${hv[$id]} link add vxlan0 type vxlan id 10 local 172.16.0.$id dev veth0 dstport 4789 +ip -netns ${hv[$id]} link set vxlan0 master br0 +ip -netns ${hv[$id]} link set vxlan0 up } setup-hv-networking 1 setup-hv-networking 2 # Check connectivity between HVs by pinging hv-2 from hv-1 echo -n "Checking HV connectivity " -ip netns exec hv-1 ping -c 1 -W 1 172.16.0.2 &> /dev/null || (echo "[FAIL]"; false) +ip netns exec $hv_1 ping -c 1 -W 1 172.16.0.2 &> /dev/null || (echo "[FAIL]"; false) echo "[ OK ]" # Setups a "VM" simulated by a netns an a veth pair setup-vm() { id=$1 -ip netns add vm-$id ip link add veth-tap type veth peer name veth-hv -ip link set veth-tap netns hv-$id -ip -netns hv-$id link set veth-tap master br0 -ip -netns hv-$id link set veth-tap up +ip link set veth-tap netns ${hv[$id]} +ip -netns ${hv[$id]} link set veth-tap master br0 +ip -netns ${hv[$id]} link set veth-tap up ip link set veth-hv address 02:1d:8d:dd:0c:6$id -ip link set veth-hv netns vm-$id -ip -netns vm-$id addr add 10.0.0.$id/24 dev veth-hv -ip -netns vm-$id link set veth-hv up +ip link set veth-hv netns ${vm[$id]} +ip -netns ${vm[$id]} addr add 10.0.0.$id/24 dev veth-hv +ip -netns ${vm[$id]} link set veth-hv up } setup-vm 1 setup-vm 2 # Setup VTEP routes to make ARP work -bridge -netns hv-1 fdb add 00:00:00:00:00:00 dev vxlan0 dst 172.16.0.2 self permanent -bridge -netns hv-2 fdb add 00:00:00:00:00:00 dev vxlan0 dst 172.16.0.1 self permanent +bridge -netns $hv_1 fdb add 00:00:00:00:00:00 dev vxlan0 dst 172.16.0.2 self permanent +bridge -netns $hv_2 fdb add 00:00:00:00:00:00 dev vxlan0 dst 172.16.0.1 self permanent echo -n "Check VM connectivity through VXLAN (underlay in the default VRF) " -ip netns exec vm-1 ping -c 1 -W 1 10.0.0.2 &> /dev/null || (echo "[FAIL]"; false) +ip netns exec $vm_1 ping -c 1 -W 1 10.0.0.2 &> /dev/null || (echo "[FAIL]"; false) echo "[ OK ]" # Move the underlay to a non-default VRF -ip -netns hv-1 link set veth0 vrf vrf-underlay -ip -netns hv-1 link set vxlan0 down -ip -netns hv-1 link set vxlan0 up -ip -netns hv-2 link set veth0 vrf vrf-underlay -ip -netns hv-2 link set vxlan0 down -ip -netns hv-2 link set vxlan0 up +ip -netns $hv_1 link set veth0 vrf vrf-underlay +ip -netns $hv_1 link set vxlan0 down +ip -netns
[PATCH net-next 4/9] selftests/net: convert test_vxlan_nolocalbypass.sh to run it in unique namespace
Here is the test result after conversion. ]# ./test_vxlan_nolocalbypass.sh TEST: localbypass enabled [ OK ] TEST: Packet received by local VXLAN device - localbypass [ OK ] TEST: localbypass disabled [ OK ] TEST: Packet not received by local VXLAN device - nolocalbypass [ OK ] TEST: localbypass enabled [ OK ] TEST: Packet received by local VXLAN device - localbypass [ OK ] Tests passed: 6 Tests failed: 0 Acked-by: David Ahern Signed-off-by: Hangbin Liu --- .../selftests/net/test_vxlan_nolocalbypass.sh | 48 +-- 1 file changed, 23 insertions(+), 25 deletions(-) diff --git a/tools/testing/selftests/net/test_vxlan_nolocalbypass.sh b/tools/testing/selftests/net/test_vxlan_nolocalbypass.sh index f75212bf142c..b8805983b728 100755 --- a/tools/testing/selftests/net/test_vxlan_nolocalbypass.sh +++ b/tools/testing/selftests/net/test_vxlan_nolocalbypass.sh @@ -9,9 +9,8 @@ # option and verifies that packets are no longer received by the second VXLAN # device. +source lib.sh ret=0 -# Kselftest framework requirement - SKIP code is 4. -ksft_skip=4 TESTS=" nolocalbypass @@ -98,20 +97,19 @@ tc_check_packets() setup() { - ip netns add ns1 + setup_ns ns1 - ip -n ns1 link set dev lo up - ip -n ns1 address add 192.0.2.1/32 dev lo - ip -n ns1 address add 198.51.100.1/32 dev lo + ip -n $ns1 address add 192.0.2.1/32 dev lo + ip -n $ns1 address add 198.51.100.1/32 dev lo - ip -n ns1 link add name vx0 up type vxlan id 100 local 198.51.100.1 \ + ip -n $ns1 link add name vx0 up type vxlan id 100 local 198.51.100.1 \ dstport 4789 nolearning - ip -n ns1 link add name vx1 up type vxlan id 100 dstport 4790 + ip -n $ns1 link add name vx1 up type vxlan id 100 dstport 4790 } cleanup() { - ip netns del ns1 &> /dev/null + cleanup_ns $ns1 } @@ -122,40 +120,40 @@ nolocalbypass() local smac=00:01:02:03:04:05 local dmac=00:0a:0b:0c:0d:0e - run_cmd "bridge -n ns1 fdb add $dmac dev vx0 self static dst 192.0.2.1 port 4790" + run_cmd "bridge -n $ns1 fdb add $dmac dev vx0 self static dst 192.0.2.1 port 4790" - run_cmd "tc -n ns1 qdisc add dev vx1 clsact" - run_cmd "tc -n ns1 filter add dev vx1 ingress pref 1 handle 101 proto all flower src_mac $smac dst_mac $dmac action pass" + run_cmd "tc -n $ns1 qdisc add dev vx1 clsact" + run_cmd "tc -n $ns1 filter add dev vx1 ingress pref 1 handle 101 proto all flower src_mac $smac dst_mac $dmac action pass" - run_cmd "tc -n ns1 qdisc add dev lo clsact" - run_cmd "tc -n ns1 filter add dev lo ingress pref 1 handle 101 proto ip flower ip_proto udp dst_port 4790 action drop" + run_cmd "tc -n $ns1 qdisc add dev lo clsact" + run_cmd "tc -n $ns1 filter add dev lo ingress pref 1 handle 101 proto ip flower ip_proto udp dst_port 4790 action drop" - run_cmd "ip -n ns1 -d -j link show dev vx0 | jq -e '.[][\"linkinfo\"][\"info_data\"][\"localbypass\"] == true'" + run_cmd "ip -n $ns1 -d -j link show dev vx0 | jq -e '.[][\"linkinfo\"][\"info_data\"][\"localbypass\"] == true'" log_test $? 0 "localbypass enabled" - run_cmd "ip netns exec ns1 mausezahn vx0 -a $smac -b $dmac -c 1 -p 100 -q" + run_cmd "ip netns exec $ns1 mausezahn vx0 -a $smac -b $dmac -c 1 -p 100 -q" - tc_check_packets "ns1" "dev vx1 ingress" 101 1 + tc_check_packets "$ns1" "dev vx1 ingress" 101 1 log_test $? 0 "Packet received by local VXLAN device - localbypass" - run_cmd "ip -n ns1 link set dev vx0 type vxlan nolocalbypass" + run_cmd "ip -n $ns1 link set dev vx0 type vxlan nolocalbypass" - run_cmd "ip -n ns1 -d -j link show dev vx0 | jq -e '.[][\"linkinfo\"][\"info_data\"][\"localbypass\"] == false'" + run_cmd "ip -n $ns1 -d -j link show dev vx0 | jq -e '.[][\"linkinfo\"][\"info_data\"][\"localbypass\"] == false'" log_test $? 0 "localbypass disabled" - run_cmd "ip netns exec ns1 mausezahn vx0 -a $smac -b $dmac -c 1 -p 100 -q" + run_cmd "ip netns exec $ns1 mausezahn vx0 -a $smac -b $dmac -c 1 -p 100 -q" - tc_check_packets "ns1" "dev vx1 ingress" 101 1 + tc_check_packets "$ns1" "dev vx1 ingress" 101 1 log_test $? 0 "Packet not received by local VXLAN device - nolocalbypass" - run_cmd "ip -n ns1 link set dev vx0 type vxlan localbypass" + run_cmd "ip -n $ns1 link set dev vx0 type vxlan localbypass" - run_cmd "ip -n ns1 -d -j link show dev vx0 | jq -e '.[][\"linkinfo\"][\"info_data\"][\"localbypass\"] == true'" + run_cmd "ip -n $ns1 -d -j link show dev vx0 | jq -e '.[][\"linkinfo\"][\"info_data\"][\"localbypass\"] == true'"
[PATCH net-next 3/9] selftests/net: convert test_vxlan_mdb.sh to run it in unique namespace
Here is the test result after conversion. ]# ./test_vxlan_mdb.sh Control path: Basic (*, G) operations - IPv4 overlay / IPv4 underlay TEST: MDB entry addition[ OK ] ... Data path: MDB torture test - IPv6 overlay / IPv6 underlay -- TEST: Torture test [ OK ] Tests passed: 620 Tests failed: 0 Acked-by: David Ahern Signed-off-by: Hangbin Liu --- tools/testing/selftests/net/test_vxlan_mdb.sh | 202 +- 1 file changed, 99 insertions(+), 103 deletions(-) diff --git a/tools/testing/selftests/net/test_vxlan_mdb.sh b/tools/testing/selftests/net/test_vxlan_mdb.sh index 6e996f8063cd..6725fd9157b9 100755 --- a/tools/testing/selftests/net/test_vxlan_mdb.sh +++ b/tools/testing/selftests/net/test_vxlan_mdb.sh @@ -55,9 +55,8 @@ # | ns2_v4 | | ns2_v6 | # ++ ++ +source lib.sh ret=0 -# Kselftest framework requirement - SKIP code is 4. -ksft_skip=4 CONTROL_PATH_TESTS=" basic_star_g_ipv4_ipv4 @@ -260,9 +259,6 @@ setup_common() local local_addr1=$1; shift local local_addr2=$1; shift - ip netns add $ns1 - ip netns add $ns2 - ip link add name veth0 type veth peer name veth1 ip link set dev veth0 netns $ns1 name veth0 ip link set dev veth1 netns $ns2 name veth0 @@ -273,36 +269,36 @@ setup_common() setup_v4() { - setup_common ns1_v4 ns2_v4 192.0.2.1 192.0.2.2 + setup_ns ns1_v4 ns2_v4 + setup_common $ns1_v4 $ns2_v4 192.0.2.1 192.0.2.2 - ip -n ns1_v4 address add 192.0.2.17/28 dev veth0 - ip -n ns2_v4 address add 192.0.2.18/28 dev veth0 + ip -n $ns1_v4 address add 192.0.2.17/28 dev veth0 + ip -n $ns2_v4 address add 192.0.2.18/28 dev veth0 - ip -n ns1_v4 route add default via 192.0.2.18 - ip -n ns2_v4 route add default via 192.0.2.17 + ip -n $ns1_v4 route add default via 192.0.2.18 + ip -n $ns2_v4 route add default via 192.0.2.17 } cleanup_v4() { - ip netns del ns2_v4 - ip netns del ns1_v4 + cleanup_ns $ns2_v4 $ns1_v4 } setup_v6() { - setup_common ns1_v6 ns2_v6 2001:db8:1::1 2001:db8:1::2 + setup_ns ns1_v6 ns2_v6 + setup_common $ns1_v6 $ns2_v6 2001:db8:1::1 2001:db8:1::2 - ip -n ns1_v6 address add 2001:db8:2::1/64 dev veth0 nodad - ip -n ns2_v6 address add 2001:db8:2::2/64 dev veth0 nodad + ip -n $ns1_v6 address add 2001:db8:2::1/64 dev veth0 nodad + ip -n $ns2_v6 address add 2001:db8:2::2/64 dev veth0 nodad - ip -n ns1_v6 route add default via 2001:db8:2::2 - ip -n ns2_v6 route add default via 2001:db8:2::1 + ip -n $ns1_v6 route add default via 2001:db8:2::2 + ip -n $ns2_v6 route add default via 2001:db8:2::1 } cleanup_v6() { - ip netns del ns2_v6 - ip netns del ns1_v6 + cleanup_ns $ns2_v6 $ns1_v6 } setup() @@ -433,7 +429,7 @@ basic_common() basic_star_g_ipv4_ipv4() { - local ns1=ns1_v4 + local ns1=$ns1_v4 local grp_key="grp 239.1.1.1" local vtep_ip=198.51.100.100 @@ -446,7 +442,7 @@ basic_star_g_ipv4_ipv4() basic_star_g_ipv6_ipv4() { - local ns1=ns1_v4 + local ns1=$ns1_v4 local grp_key="grp ff0e::1" local vtep_ip=198.51.100.100 @@ -459,7 +455,7 @@ basic_star_g_ipv6_ipv4() basic_star_g_ipv4_ipv6() { - local ns1=ns1_v6 + local ns1=$ns1_v6 local grp_key="grp 239.1.1.1" local vtep_ip=2001:db8:1000::1 @@ -472,7 +468,7 @@ basic_star_g_ipv4_ipv6() basic_star_g_ipv6_ipv6() { - local ns1=ns1_v6 + local ns1=$ns1_v6 local grp_key="grp ff0e::1" local vtep_ip=2001:db8:1000::1 @@ -485,7 +481,7 @@ basic_star_g_ipv6_ipv6() basic_sg_ipv4_ipv4() { - local ns1=ns1_v4 + local ns1=$ns1_v4 local grp_key="grp 239.1.1.1 src 192.0.2.129" local vtep_ip=198.51.100.100 @@ -498,7 +494,7 @@ basic_sg_ipv4_ipv4() basic_sg_ipv6_ipv4() { - local ns1=ns1_v4 + local ns1=$ns1_v4 local grp_key="grp ff0e::1 src 2001:db8:100::1" local vtep_ip=198.51.100.100 @@ -511,7 +507,7 @@ basic_sg_ipv6_ipv4() basic_sg_ipv4_ipv6() { - local ns1=ns1_v6 + local ns1=$ns1_v6 local grp_key="grp 239.1.1.1 src 192.0.2.129" local vtep_ip=2001:db8:1000::1 @@ -524,7 +520,7 @@ basic_sg_ipv4_ipv6() basic_sg_ipv6_ipv6() { - local ns1=ns1_v6 + local ns1=$ns1_v6 local grp_key="grp ff0e::1 src 2001:db8:100::1" local vtep_ip=2001:db8:1000::1 @@ -694,7 +690,7 @@ star_g_common() star_g_ipv4_ipv4() { - local ns1=ns1_v4 + local ns1=$ns1_v4 local grp=239.1.1.1 local
[PATCH net-next 2/9] selftests/net: convert test_bridge_neigh_suppress.sh to run it in unique namespace
Here is the test result after conversion. ]# ./test_bridge_neigh_suppress.sh Per-port ARP suppression - VLAN 10 -- TEST: arping[ OK ] TEST: ARP suppression [ OK ] ... TEST: NS suppression (VLAN 20) [ OK ] Tests passed: 148 Tests failed: 0 Acked-by: David Ahern Signed-off-by: Hangbin Liu --- .../net/test_bridge_neigh_suppress.sh | 331 +- 1 file changed, 162 insertions(+), 169 deletions(-) diff --git a/tools/testing/selftests/net/test_bridge_neigh_suppress.sh b/tools/testing/selftests/net/test_bridge_neigh_suppress.sh index d80f2cd87614..8533393a4f18 100755 --- a/tools/testing/selftests/net/test_bridge_neigh_suppress.sh +++ b/tools/testing/selftests/net/test_bridge_neigh_suppress.sh @@ -45,9 +45,8 @@ # | sw1| | sw2| # ++ ++ +source lib.sh ret=0 -# Kselftest framework requirement - SKIP code is 4. -ksft_skip=4 # All tests in this script. Can be overridden with -t option. TESTS=" @@ -140,9 +139,6 @@ setup_topo_ns() { local ns=$1; shift - ip netns add $ns - ip -n $ns link set dev lo up - ip netns exec $ns sysctl -qw net.ipv6.conf.all.keep_addr_on_down=1 ip netns exec $ns sysctl -qw net.ipv6.conf.default.ignore_routes_with_linkdown=1 ip netns exec $ns sysctl -qw net.ipv6.conf.all.accept_dad=0 @@ -153,21 +149,22 @@ setup_topo() { local ns - for ns in h1 h2 sw1 sw2; do + setup_ns h1 h2 sw1 sw2 + for ns in $h1 $h2 $sw1 $sw2; do setup_topo_ns $ns done ip link add name veth0 type veth peer name veth1 - ip link set dev veth0 netns h1 name eth0 - ip link set dev veth1 netns sw1 name swp1 + ip link set dev veth0 netns $h1 name eth0 + ip link set dev veth1 netns $sw1 name swp1 ip link add name veth0 type veth peer name veth1 - ip link set dev veth0 netns sw1 name veth0 - ip link set dev veth1 netns sw2 name veth0 + ip link set dev veth0 netns $sw1 name veth0 + ip link set dev veth1 netns $sw2 name veth0 ip link add name veth0 type veth peer name veth1 - ip link set dev veth0 netns h2 name eth0 - ip link set dev veth1 netns sw2 name swp1 + ip link set dev veth0 netns $h2 name eth0 + ip link set dev veth1 netns $sw2 name swp1 } setup_host_common() @@ -190,7 +187,7 @@ setup_host_common() setup_h1() { - local ns=h1 + local ns=$h1 local v4addr1=192.0.2.1/28 local v4addr2=192.0.2.17/28 local v6addr1=2001:db8:1::1/64 @@ -201,7 +198,7 @@ setup_h1() setup_h2() { - local ns=h2 + local ns=$h2 local v4addr1=192.0.2.2/28 local v4addr2=192.0.2.18/28 local v6addr1=2001:db8:1::2/64 @@ -254,7 +251,7 @@ setup_sw_common() setup_sw1() { - local ns=sw1 + local ns=$sw1 local local_addr=192.0.2.33 local remote_addr=192.0.2.34 local veth_addr=192.0.2.49 @@ -265,7 +262,7 @@ setup_sw1() setup_sw2() { - local ns=sw2 + local ns=$sw2 local local_addr=192.0.2.34 local remote_addr=192.0.2.33 local veth_addr=192.0.2.50 @@ -291,11 +288,7 @@ setup() cleanup() { - local ns - - for ns in h1 h2 sw1 sw2; do - ip netns del $ns &> /dev/null - done + cleanup_ns $h1 $h2 $sw1 $sw2 } @@ -312,80 +305,80 @@ neigh_suppress_arp_common() echo "Per-port ARP suppression - VLAN $vid" echo "--" - run_cmd "tc -n sw1 qdisc replace dev vx0 clsact" - run_cmd "tc -n sw1 filter replace dev vx0 egress pref 1 handle 101 proto 0x0806 flower indev swp1 arp_tip $tip arp_sip $sip arp_op request action pass" + run_cmd "tc -n $sw1 qdisc replace dev vx0 clsact" + run_cmd "tc -n $sw1 filter replace dev vx0 egress pref 1 handle 101 proto 0x0806 flower indev swp1 arp_tip $tip arp_sip $sip arp_op request action pass" # Initial state - check that ARP requests are not suppressed and that # ARP replies are received. - run_cmd "ip netns exec h1 arping -q -b -c 1 -w 5 -s $sip -I eth0.$vid $tip" + run_cmd "ip netns exec $h1 arping -q -b -c 1 -w 5 -s $sip -I eth0.$vid $tip" log_test $? 0 "arping" - tc_check_packets sw1 "dev vx0 egress" 101 1 + tc_check_packets $sw1 "dev vx0 egress" 101 1 log_test $? 0 "ARP suppression" # Enable neighbor suppression and check that nothing changes compared # to the initial state. - run_cmd "bridge -n sw1 link set dev vx0 neigh_suppress on" - run_cmd "bridge -n sw1 -d
[PATCH net-next 1/9] selftests/net: convert test_bridge_backup_port.sh to run it in unique namespace
There is no h1 h2 actually. Remove it. Here is the test result after conversion. ]# ./test_bridge_backup_port.sh Backup port --- TEST: Forwarding out of swp1[ OK ] TEST: No forwarding out of vx0 [ OK ] TEST: swp1 carrier off [ OK ] TEST: No forwarding out of swp1 [ OK ] ... Backup nexthop ID - ping TEST: Ping with backup nexthop ID [ OK ] TEST: Ping after disabling backup nexthop ID[ OK ] Backup nexthop ID - torture test TEST: Torture test [ OK ] Tests passed: 83 Tests failed: 0 Acked-by: David Ahern Signed-off-by: Hangbin Liu --- .../selftests/net/test_bridge_backup_port.sh | 371 +- 1 file changed, 182 insertions(+), 189 deletions(-) diff --git a/tools/testing/selftests/net/test_bridge_backup_port.sh b/tools/testing/selftests/net/test_bridge_backup_port.sh index 112cfd8a10ad..70a7d87ba2d2 100755 --- a/tools/testing/selftests/net/test_bridge_backup_port.sh +++ b/tools/testing/selftests/net/test_bridge_backup_port.sh @@ -35,9 +35,8 @@ # | sw1| | sw2| # ++ ++ +source lib.sh ret=0 -# Kselftest framework requirement - SKIP code is 4. -ksft_skip=4 # All tests in this script. Can be overridden with -t option. TESTS=" @@ -132,9 +131,6 @@ setup_topo_ns() { local ns=$1; shift - ip netns add $ns - ip -n $ns link set dev lo up - ip netns exec $ns sysctl -qw net.ipv6.conf.all.keep_addr_on_down=1 ip netns exec $ns sysctl -qw net.ipv6.conf.default.ignore_routes_with_linkdown=1 ip netns exec $ns sysctl -qw net.ipv6.conf.all.accept_dad=0 @@ -145,13 +141,14 @@ setup_topo() { local ns - for ns in sw1 sw2; do + setup_ns sw1 sw2 + for ns in $sw1 $sw2; do setup_topo_ns $ns done ip link add name veth0 type veth peer name veth1 - ip link set dev veth0 netns sw1 name veth0 - ip link set dev veth1 netns sw2 name veth0 + ip link set dev veth0 netns $sw1 name veth0 + ip link set dev veth1 netns $sw2 name veth0 } setup_sw_common() @@ -190,7 +187,7 @@ setup_sw_common() setup_sw1() { - local ns=sw1 + local ns=$sw1 local local_addr=192.0.2.33 local remote_addr=192.0.2.34 local veth_addr=192.0.2.49 @@ -203,7 +200,7 @@ setup_sw1() setup_sw2() { - local ns=sw2 + local ns=$sw2 local local_addr=192.0.2.34 local remote_addr=192.0.2.33 local veth_addr=192.0.2.50 @@ -229,11 +226,7 @@ setup() cleanup() { - local ns - - for ns in h1 h2 sw1 sw2; do - ip netns del $ns &> /dev/null - done + cleanup_ns $sw1 $sw2 } @@ -248,85 +241,85 @@ backup_port() echo "Backup port" echo "---" - run_cmd "tc -n sw1 qdisc replace dev swp1 clsact" - run_cmd "tc -n sw1 filter replace dev swp1 egress pref 1 handle 101 proto ip flower src_mac $smac dst_mac $dmac action pass" + run_cmd "tc -n $sw1 qdisc replace dev swp1 clsact" + run_cmd "tc -n $sw1 filter replace dev swp1 egress pref 1 handle 101 proto ip flower src_mac $smac dst_mac $dmac action pass" - run_cmd "tc -n sw1 qdisc replace dev vx0 clsact" - run_cmd "tc -n sw1 filter replace dev vx0 egress pref 1 handle 101 proto ip flower src_mac $smac dst_mac $dmac action pass" + run_cmd "tc -n $sw1 qdisc replace dev vx0 clsact" + run_cmd "tc -n $sw1 filter replace dev vx0 egress pref 1 handle 101 proto ip flower src_mac $smac dst_mac $dmac action pass" - run_cmd "bridge -n sw1 fdb replace $dmac dev swp1 master static vlan 10" + run_cmd "bridge -n $sw1 fdb replace $dmac dev swp1 master static vlan 10" # Initial state - check that packets are forwarded out of swp1 when it # has a carrier and not forwarded out of any port when it does not have # a carrier. - run_cmd "ip netns exec sw1 mausezahn br0.10 -a $smac -b $dmac -A 198.51.100.1 -B 198.51.100.2 -t ip -p 100 -q -c 1" - tc_check_packets sw1 "dev swp1 egress" 101 1 + run_cmd "ip netns exec $sw1 mausezahn br0.10 -a $smac -b $dmac -A 198.51.100.1 -B 198.51.100.2 -t ip -p 100 -q -c 1" + tc_check_packets $sw1 "dev swp1 egress" 101 1 log_test $? 0 "Forwarding out of swp1" - tc_check_packets sw1 "dev vx0 egress" 101 0 + tc_check_packets $sw1 "dev vx0 egress" 101 0 log_test $? 0 "No forwarding out of vx0" - run_cmd "ip -n sw1 link set dev swp1 carrier off" +
[PATCH net-next 0/9] Convert net selftests to run in unique namespace (Part 2)
Here is the 2nd part of converting net selftests to run in unique namespace. This part converts all bridge, vxlan, vrf tests. Here is the part 1 link: https://lore.kernel.org/netdev/20231202020110.362433-1-liuhang...@gmail.com Hangbin Liu (9): selftests/net: convert test_bridge_backup_port.sh to run it in unique namespace selftests/net: convert test_bridge_neigh_suppress.sh to run it in unique namespace selftests/net: convert test_vxlan_mdb.sh to run it in unique namespace selftests/net: convert test_vxlan_nolocalbypass.sh to run it in unique namespace selftests/net: convert test_vxlan_under_vrf.sh to run it in unique namespace selftests/net: convert test_vxlan_vnifiltering.sh to run it in unique namespace selftests/net: convert vrf_route_leaking.sh to run it in unique namespace selftests/net: convert vrf_strict_mode_test.sh to run it in unique namespace selftests/net: convert vrf-xfrm-tests.sh to run it in unique namespace .../selftests/net/test_bridge_backup_port.sh | 371 +- .../net/test_bridge_neigh_suppress.sh | 331 tools/testing/selftests/net/test_vxlan_mdb.sh | 202 +- .../selftests/net/test_vxlan_nolocalbypass.sh | 48 ++- .../selftests/net/test_vxlan_under_vrf.sh | 70 ++-- .../selftests/net/test_vxlan_vnifiltering.sh | 154 +--- tools/testing/selftests/net/vrf-xfrm-tests.sh | 77 ++-- .../selftests/net/vrf_route_leaking.sh| 201 +- .../selftests/net/vrf_strict_mode_test.sh | 47 ++- 9 files changed, 751 insertions(+), 750 deletions(-) -- 2.43.0
Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure
On 2023/12/6 13:59, Yosry Ahmed wrote: > [..] >>> @@ -526,6 +582,102 @@ static struct zswap_entry >>> *zswap_entry_find_get(struct rb_root *root, >>> return entry; >>> } >>> >>> +/* >>> +* shrinker functions >>> +**/ >>> +static enum lru_status shrink_memcg_cb(struct list_head *item, struct >>> list_lru_one *l, >>> +spinlock_t *lock, void *arg); >>> + >>> +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, >>> + struct shrink_control *sc) >>> +{ >>> + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, >>> NODE_DATA(sc->nid)); >>> + unsigned long shrink_ret, nr_protected, lru_size; >>> + struct zswap_pool *pool = shrinker->private_data; >>> + bool encountered_page_in_swapcache = false; >>> + >>> + nr_protected = >>> + >>> atomic_long_read(>zswap_lruvec_state.nr_zswap_protected); >>> + lru_size = list_lru_shrink_count(>list_lru, sc); >>> + >>> + /* >>> + * Abort if the shrinker is disabled or if we are shrinking into the >>> + * protected region. >>> + * >>> + * This short-circuiting is necessary because if we have too many >>> multiple >>> + * concurrent reclaimers getting the freeable zswap object counts at >>> the >>> + * same time (before any of them made reasonable progress), the total >>> + * number of reclaimed objects might be more than the number of >>> unprotected >>> + * objects (i.e the reclaimers will reclaim into the protected area >>> of the >>> + * zswap LRU). >>> + */ >>> + if (!zswap_shrinker_enabled || nr_protected >= lru_size - >>> sc->nr_to_scan) { >>> + sc->nr_scanned = 0; >>> + return SHRINK_STOP; >>> + } >>> + >>> + shrink_ret = list_lru_shrink_walk(>list_lru, sc, >>> _memcg_cb, >>> + _page_in_swapcache); >>> + >>> + if (encountered_page_in_swapcache) >>> + return SHRINK_STOP; >>> + >>> + return shrink_ret ? shrink_ret : SHRINK_STOP; >>> +} >>> + >>> +static unsigned long zswap_shrinker_count(struct shrinker *shrinker, >>> + struct shrink_control *sc) >>> +{ >>> + struct zswap_pool *pool = shrinker->private_data; >>> + struct mem_cgroup *memcg = sc->memcg; >>> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid)); >>> + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected; >>> + >>> +#ifdef CONFIG_MEMCG_KMEM >>> + cgroup_rstat_flush(memcg->css.cgroup); >>> + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT; >>> + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED); >>> +#else >>> + /* use pool stats instead of memcg stats */ >>> + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT; >>> + nr_stored = atomic_read(>nr_stored); >>> +#endif >>> + >>> + if (!zswap_shrinker_enabled || !nr_stored) >> When I tested with this series, with !zswap_shrinker_enabled in the default >> case, >> I found the performance is much worse than that without this patch. >> >> Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs >> directory. >> >> The reason seems the above cgroup_rstat_flush(), caused much rstat lock >> contention >> to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check >> above >> the cgroup_rstat_flush(), the performance become much better. >> >> Maybe we can put the "zswap_shrinker_enabled" check above >> cgroup_rstat_flush()? > > Yes, we should do nothing if !zswap_shrinker_enabled. We should also > use mem_cgroup_flush_stats() here like other places unless accuracy is > crucial, which I doubt given that reclaim uses > mem_cgroup_flush_stats(). > Yes. After changing to use mem_cgroup_flush_stats() here, the performance become much better. > mem_cgroup_flush_stats() has some thresholding to make sure we don't > do flushes unnecessarily, and I have a pending series in mm-unstable > that makes that thresholding per-memcg. Keep in mind that adding a > call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable, My test branch is linux-next 20231205, and it's all good after changing to use mem_cgroup_flush_stats(memcg). > because the series there adds a memcg argument to > mem_cgroup_flush_stats(). That should be easily amenable though, I can > post a fixlet for my series to add the memcg argument there on top of > users if needed. > It's great. Thanks!
Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure
[..] > > @@ -526,6 +582,102 @@ static struct zswap_entry > > *zswap_entry_find_get(struct rb_root *root, > > return entry; > > } > > > > +/* > > +* shrinker functions > > +**/ > > +static enum lru_status shrink_memcg_cb(struct list_head *item, struct > > list_lru_one *l, > > +spinlock_t *lock, void *arg); > > + > > +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, > > + struct shrink_control *sc) > > +{ > > + struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, > > NODE_DATA(sc->nid)); > > + unsigned long shrink_ret, nr_protected, lru_size; > > + struct zswap_pool *pool = shrinker->private_data; > > + bool encountered_page_in_swapcache = false; > > + > > + nr_protected = > > + > > atomic_long_read(>zswap_lruvec_state.nr_zswap_protected); > > + lru_size = list_lru_shrink_count(>list_lru, sc); > > + > > + /* > > + * Abort if the shrinker is disabled or if we are shrinking into the > > + * protected region. > > + * > > + * This short-circuiting is necessary because if we have too many > > multiple > > + * concurrent reclaimers getting the freeable zswap object counts at > > the > > + * same time (before any of them made reasonable progress), the total > > + * number of reclaimed objects might be more than the number of > > unprotected > > + * objects (i.e the reclaimers will reclaim into the protected area > > of the > > + * zswap LRU). > > + */ > > + if (!zswap_shrinker_enabled || nr_protected >= lru_size - > > sc->nr_to_scan) { > > + sc->nr_scanned = 0; > > + return SHRINK_STOP; > > + } > > + > > + shrink_ret = list_lru_shrink_walk(>list_lru, sc, > > _memcg_cb, > > + _page_in_swapcache); > > + > > + if (encountered_page_in_swapcache) > > + return SHRINK_STOP; > > + > > + return shrink_ret ? shrink_ret : SHRINK_STOP; > > +} > > + > > +static unsigned long zswap_shrinker_count(struct shrinker *shrinker, > > + struct shrink_control *sc) > > +{ > > + struct zswap_pool *pool = shrinker->private_data; > > + struct mem_cgroup *memcg = sc->memcg; > > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid)); > > + unsigned long nr_backing, nr_stored, nr_freeable, nr_protected; > > + > > +#ifdef CONFIG_MEMCG_KMEM > > + cgroup_rstat_flush(memcg->css.cgroup); > > + nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT; > > + nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED); > > +#else > > + /* use pool stats instead of memcg stats */ > > + nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT; > > + nr_stored = atomic_read(>nr_stored); > > +#endif > > + > > + if (!zswap_shrinker_enabled || !nr_stored) > When I tested with this series, with !zswap_shrinker_enabled in the default > case, > I found the performance is much worse than that without this patch. > > Testcase: memory.max=2G, zswap enabled, kernel build -j32 in a tmpfs > directory. > > The reason seems the above cgroup_rstat_flush(), caused much rstat lock > contention > to the zswap_store() path. And if I put the "zswap_shrinker_enabled" check > above > the cgroup_rstat_flush(), the performance become much better. > > Maybe we can put the "zswap_shrinker_enabled" check above > cgroup_rstat_flush()? Yes, we should do nothing if !zswap_shrinker_enabled. We should also use mem_cgroup_flush_stats() here like other places unless accuracy is crucial, which I doubt given that reclaim uses mem_cgroup_flush_stats(). mem_cgroup_flush_stats() has some thresholding to make sure we don't do flushes unnecessarily, and I have a pending series in mm-unstable that makes that thresholding per-memcg. Keep in mind that adding a call to mem_cgroup_flush_stats() will cause a conflict in mm-unstable, because the series there adds a memcg argument to mem_cgroup_flush_stats(). That should be easily amenable though, I can post a fixlet for my series to add the memcg argument there on top of users if needed. > > Thanks! > > > + return 0; > > + > > + nr_protected = > > + > > atomic_long_read(>zswap_lruvec_state.nr_zswap_protected); > > + nr_freeable = list_lru_shrink_count(>list_lru, sc); > > + /* > > + * Subtract the lru size by an estimate of the number of pages > > + * that should be protected. > > + */ > > + nr_freeable = nr_freeable > nr_protected ? nr_freeable - nr_protected > > : 0; > > + > > + /* > > + * Scale the number of freeable pages by the memory saving factor. > > + * This ensures that the better zswap compresses memory, the fewer > > + * pages we will evict to swap (as it will otherwise incur IO for > > + * relatively small memory saving). > > + */ > > + return mult_frac(nr_freeable,
Re: [PATCH v8 6/6] zswap: shrinks zswap pool based on memory pressure
On 2023/12/1 03:40, Nhat Pham wrote: > Currently, we only shrink the zswap pool when the user-defined limit is > hit. This means that if we set the limit too high, cold data that are > unlikely to be used again will reside in the pool, wasting precious > memory. It is hard to predict how much zswap space will be needed ahead > of time, as this depends on the workload (specifically, on factors such > as memory access patterns and compressibility of the memory pages). > > This patch implements a memcg- and NUMA-aware shrinker for zswap, that > is initiated when there is memory pressure. The shrinker does not > have any parameter that must be tuned by the user, and can be opted in > or out on a per-memcg basis. > > Furthermore, to make it more robust for many workloads and prevent > overshrinking (i.e evicting warm pages that might be refaulted into > memory), we build in the following heuristics: > > * Estimate the number of warm pages residing in zswap, and attempt to > protect this region of the zswap LRU. > * Scale the number of freeable objects by an estimate of the memory > saving factor. The better zswap compresses the data, the fewer pages > we will evict to swap (as we will otherwise incur IO for relatively > small memory saving). > * During reclaim, if the shrinker encounters a page that is also being > brought into memory, the shrinker will cautiously terminate its > shrinking action, as this is a sign that it is touching the warmer > region of the zswap LRU. > > As a proof of concept, we ran the following synthetic benchmark: > build the linux kernel in a memory-limited cgroup, and allocate some > cold data in tmpfs to see if the shrinker could write them out and > improved the overall performance. Depending on the amount of cold data > generated, we observe from 14% to 35% reduction in kernel CPU time used > in the kernel builds. > > Signed-off-by: Nhat Pham > Acked-by: Johannes Weiner > --- > Documentation/admin-guide/mm/zswap.rst | 10 ++ > include/linux/mmzone.h | 2 + > include/linux/zswap.h | 25 +++- > mm/Kconfig | 14 ++ > mm/mmzone.c| 1 + > mm/swap_state.c| 2 + > mm/zswap.c | 185 - > 7 files changed, 233 insertions(+), 6 deletions(-) > > diff --git a/Documentation/admin-guide/mm/zswap.rst > b/Documentation/admin-guide/mm/zswap.rst > index 45b98390e938..62fc244ec702 100644 > --- a/Documentation/admin-guide/mm/zswap.rst > +++ b/Documentation/admin-guide/mm/zswap.rst > @@ -153,6 +153,16 @@ attribute, e. g.:: > > Setting this parameter to 100 will disable the hysteresis. > > +When there is a sizable amount of cold memory residing in the zswap pool, it > +can be advantageous to proactively write these cold pages to swap and reclaim > +the memory for other use cases. By default, the zswap shrinker is disabled. > +User can enable it as follows: > + > + echo Y > /sys/module/zswap/parameters/shrinker_enabled > + > +This can be enabled at the boot time if ``CONFIG_ZSWAP_SHRINKER_DEFAULT_ON`` > is > +selected. > + > A debugfs interface is provided for various statistic about pool size, number > of pages stored, same-value filled pages and various counters for the reasons > pages are rejected. > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 7b1816450bfc..b23bc5390240 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -22,6 +22,7 @@ > #include > #include > #include > +#include > #include > > /* Free memory management - zoned buddy allocator. */ > @@ -641,6 +642,7 @@ struct lruvec { > #ifdef CONFIG_MEMCG > struct pglist_data *pgdat; > #endif > + struct zswap_lruvec_state zswap_lruvec_state; > }; > > /* Isolate for asynchronous migration */ > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > index e571e393669b..08c240e16a01 100644 > --- a/include/linux/zswap.h > +++ b/include/linux/zswap.h > @@ -5,20 +5,40 @@ > #include > #include > > +struct lruvec; > + > extern u64 zswap_pool_total_size; > extern atomic_t zswap_stored_pages; > > #ifdef CONFIG_ZSWAP > > +struct zswap_lruvec_state { > + /* > + * Number of pages in zswap that should be protected from the shrinker. > + * This number is an estimate of the following counts: > + * > + * a) Recent page faults. > + * b) Recent insertion to the zswap LRU. This includes new zswap stores, > + *as well as recent zswap LRU rotations. > + * > + * These pages are likely to be warm, and might incur IO if the are > written > + * to swap. > + */ > + atomic_long_t nr_zswap_protected; > +}; > + > bool zswap_store(struct folio *folio); > bool zswap_load(struct folio *folio); > void zswap_invalidate(int type, pgoff_t offset); > void zswap_swapon(int type); > void zswap_swapoff(int type); > void
Re: [PATCH v8 0/6] workload-specific and memory pressure-driven zswap writeback
On Thu, Nov 30, 2023 at 11:40:17AM -0800, Nhat Pham wrote: > Changelog: > v8: >* Fixed a couple of build errors in the case of !CONFIG_MEMCG >* Simplified the online memcg selection scheme for the zswap global > limit reclaim (suggested by Michal Hocko and Johannes Weiner) > (patch 2 and patch 3) >* Added a new kconfig to allows users to enable zswap shrinker by > default. (suggested by Johannes Weiner) (patch 6) > v7: >* Added the mem_cgroup_iter_online() function to the API for the new > behavior (suggested by Andrew Morton) (patch 2) >* Fixed a missing list_lru_del -> list_lru_del_obj (patch 1) > v6: >* Rebase on top of latest mm-unstable. >* Fix/improve the in-code documentation of the new list_lru > manipulation functions (patch 1) > v5: >* Replace reference getting with an rcu_read_lock() section for > zswap lru modifications (suggested by Yosry) >* Add a new prep patch that allows mem_cgroup_iter() to return > online cgroup. >* Add a callback that updates pool->next_shrink when the cgroup is > offlined (suggested by Yosry Ahmed, Johannes Weiner) > v4: >* Rename list_lru_add to list_lru_add_obj and __list_lru_add to > list_lru_add (patch 1) (suggested by Johannes Weiner and >Yosry Ahmed) >* Some cleanups on the memcg aware LRU patch (patch 2) > (suggested by Yosry Ahmed) >* Use event interface for the new per-cgroup writeback counters. > (patch 3) (suggested by Yosry Ahmed) >* Abstract zswap's lruvec states and handling into > zswap_lruvec_state (patch 5) (suggested by Yosry Ahmed) > v3: >* Add a patch to export per-cgroup zswap writeback counters >* Add a patch to update zswap's kselftest >* Separate the new list_lru functions into its own prep patch >* Do not start from the top of the hierarchy when encounter a memcg > that is not online for the global limit zswap writeback (patch 2) > (suggested by Yosry Ahmed) >* Do not remove the swap entry from list_lru in > __read_swapcache_async() (patch 2) (suggested by Yosry Ahmed) >* Removed a redundant zswap pool getting (patch 2) > (reported by Ryan Roberts) >* Use atomic for the nr_zswap_protected (instead of lruvec's lock) > (patch 5) (suggested by Yosry Ahmed) >* Remove the per-cgroup zswap shrinker knob (patch 5) > (suggested by Yosry Ahmed) > v2: >* Fix loongarch compiler errors >* Use pool stats instead of memcg stats when !CONFIG_MEMCG_KEM > > There are currently several issues with zswap writeback: > > 1. There is only a single global LRU for zswap, making it impossible to >perform worload-specific shrinking - an memcg under memory pressure >cannot determine which pages in the pool it owns, and often ends up >writing pages from other memcgs. This issue has been previously >observed in practice and mitigated by simply disabling >memcg-initiated shrinking: > >https://lore.kernel.org/all/20230530232435.3097106-1-npha...@gmail.com/T/#u > >But this solution leaves a lot to be desired, as we still do not >have an avenue for an memcg to free up its own memory locked up in >the zswap pool. > > 2. We only shrink the zswap pool when the user-defined limit is hit. >This means that if we set the limit too high, cold data that are >unlikely to be used again will reside in the pool, wasting precious >memory. It is hard to predict how much zswap space will be needed >ahead of time, as this depends on the workload (specifically, on >factors such as memory access patterns and compressibility of the >memory pages). > > This patch series solves these issues by separating the global zswap > LRU into per-memcg and per-NUMA LRUs, and performs workload-specific > (i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The > new shrinker does not have any parameter that must be tuned by the > user, and can be opted in or out on a per-memcg basis. > > As a proof of concept, we ran the following synthetic benchmark: > build the linux kernel in a memory-limited cgroup, and allocate some > cold data in tmpfs to see if the shrinker could write them out and > improved the overall performance. Depending on the amount of cold data > generated, we observe from 14% to 35% reduction in kernel CPU time used > in the kernel builds. > > Domenico Cerasuolo (3): > zswap: make shrinking memcg-aware > mm: memcg: add per-memcg zswap writeback stat > selftests: cgroup: update per-memcg zswap writeback selftest > > Nhat Pham (3): > list_lru: allows explicit memcg and NUMA node selection > memcontrol: implement mem_cgroup_tryget_online() > zswap: shrinks zswap pool based on memory pressure > > Documentation/admin-guide/mm/zswap.rst | 10 + > drivers/android/binder_alloc.c | 7 +- > fs/dcache.c | 8 +- > fs/gfs2/quota.c | 6
[PATCH v1] selftests/sgx: Skip non X86_64 platform
From: Zhao Mengmeng When building whole selftests on arm64, rsync gives an erorr about sgx: rsync: [sender] link_stat "/root/linux-next/tools/testing/selftests/sgx/test_encl.elf" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1327) [sender=3.2.5] The root casue is sgx only used on X86_64, and shall be skipped on other platforms. Fix this by moving TEST_CUSTOM_PROGS and TEST_FILES inside the if check, then the build result will be "Skipping non-existent dir: sgx". Fixes: 2adcba79e69d ("selftests/x86: Add a selftest for SGX") Signed-off-by: Zhao Mengmeng --- tools/testing/selftests/sgx/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/sgx/Makefile b/tools/testing/selftests/sgx/Makefile index 50aab6b57da3..01abe4969b0f 100644 --- a/tools/testing/selftests/sgx/Makefile +++ b/tools/testing/selftests/sgx/Makefile @@ -16,10 +16,10 @@ HOST_CFLAGS := -Wall -Werror -g $(INCLUDES) -fPIC -z noexecstack ENCL_CFLAGS := -Wall -Werror -static -nostdlib -nostartfiles -fPIC \ -fno-stack-protector -mrdrnd $(INCLUDES) +ifeq ($(CAN_BUILD_X86_64), 1) TEST_CUSTOM_PROGS := $(OUTPUT)/test_sgx TEST_FILES := $(OUTPUT)/test_encl.elf -ifeq ($(CAN_BUILD_X86_64), 1) all: $(TEST_CUSTOM_PROGS) $(OUTPUT)/test_encl.elf endif -- 2.38.1
[PATCH v8 3/6] zswap: make shrinking memcg-aware (fix 2)
Drop the pool's reference at the end of the writeback step. Apply on top of the first fixlet: https://lore.kernel.org/linux-mm/20231130203522.gc543...@cmpxchg.org/T/#m6ba8efd2205486b1b333a29f5a890563b45c7a7e Signed-off-by: Nhat Pham --- mm/zswap.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/zswap.c b/mm/zswap.c index 7a84c1454988..56d4a8cc461d 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -859,6 +859,7 @@ static void shrink_worker(struct work_struct *w) resched: cond_resched(); } while (!zswap_can_accept()); + zswap_pool_put(pool); } static struct zswap_pool *zswap_pool_create(char *type, char *compressor) -- 2.34.1
Re: [PATCH v8 3/6] zswap: make shrinking memcg-aware
On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham wrote: > > From: Domenico Cerasuolo > > Currently, we only have a single global LRU for zswap. This makes it > impossible to perform worload-specific shrinking - an memcg cannot > determine which pages in the pool it owns, and often ends up writing > pages from other memcgs. This issue has been previously observed in > practice and mitigated by simply disabling memcg-initiated shrinking: > > https://lore.kernel.org/all/20230530232435.3097106-1-npha...@gmail.com/T/#u > > This patch fully resolves the issue by replacing the global zswap LRU > with memcg- and NUMA-specific LRUs, and modify the reclaim logic: > > a) When a store attempt hits an memcg limit, it now triggers a >synchronous reclaim attempt that, if successful, allows the new >hotter page to be accepted by zswap. > b) If the store attempt instead hits the global zswap limit, it will >trigger an asynchronous reclaim attempt, in which an memcg is >selected for reclaim in a round-robin-like fashion. > > Signed-off-by: Domenico Cerasuolo > Co-developed-by: Nhat Pham > Signed-off-by: Nhat Pham > --- > include/linux/memcontrol.h | 5 + > include/linux/zswap.h | 2 + > mm/memcontrol.c| 2 + > mm/swap.h | 3 +- > mm/swap_state.c| 24 +++- > mm/zswap.c | 269 + > 6 files changed, 245 insertions(+), 60 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 2bd7d14ace78..a308c8eacf20 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -1192,6 +1192,11 @@ static inline struct mem_cgroup > *page_memcg_check(struct page *page) > return NULL; > } > > +static inline struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup > *objcg) > +{ > + return NULL; > +} > + > static inline bool folio_memcg_kmem(struct folio *folio) > { > return false; > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > index 2a60ce39cfde..e571e393669b 100644 > --- a/include/linux/zswap.h > +++ b/include/linux/zswap.h > @@ -15,6 +15,7 @@ bool zswap_load(struct folio *folio); > void zswap_invalidate(int type, pgoff_t offset); > void zswap_swapon(int type); > void zswap_swapoff(int type); > +void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg); > > #else > > @@ -31,6 +32,7 @@ static inline bool zswap_load(struct folio *folio) > static inline void zswap_invalidate(int type, pgoff_t offset) {} > static inline void zswap_swapon(int type) {} > static inline void zswap_swapoff(int type) {} > +static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {} > > #endif > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 470821d1ba1a..792ca21c5815 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5614,6 +5614,8 @@ static void mem_cgroup_css_offline(struct > cgroup_subsys_state *css) > page_counter_set_min(>memory, 0); > page_counter_set_low(>memory, 0); > > + zswap_memcg_offline_cleanup(memcg); > + > memcg_offline_kmem(memcg); > reparent_shrinker_deferred(memcg); > wb_memcg_offline(memcg); > diff --git a/mm/swap.h b/mm/swap.h > index 73c332ee4d91..c0dc73e10e91 100644 > --- a/mm/swap.h > +++ b/mm/swap.h > @@ -51,7 +51,8 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t > gfp_mask, >struct swap_iocb **plug); > struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > struct mempolicy *mpol, pgoff_t ilx, > -bool *new_page_allocated); > +bool *new_page_allocated, > +bool skip_if_exists); > struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag, > struct mempolicy *mpol, pgoff_t ilx); > struct page *swapin_readahead(swp_entry_t entry, gfp_t flag, > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 85d9e5806a6a..6c84236382f3 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -412,7 +412,8 @@ struct folio *filemap_get_incore_folio(struct > address_space *mapping, > > struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > struct mempolicy *mpol, pgoff_t ilx, > -bool *new_page_allocated) > +bool *new_page_allocated, > +bool skip_if_exists) > { > struct swap_info_struct *si; > struct folio *folio; > @@ -470,6 +471,17 @@ struct page *__read_swap_cache_async(swp_entry_t entry, > gfp_t gfp_mask, > if (err != -EEXIST) > goto fail_put_swap; > > + /* > +* Protect against a recursive call to > __read_swap_cache_async() > +
Re: [PATCHv3 net-next 01/14] selftests/net: add lib.sh
On Tue, Dec 05, 2023 at 01:00:29PM +0100, Paolo Abeni wrote: > > +cleanup_ns() > > +{ > > + local ns="" > > + local errexit=0 > > + local ret=0 > > + > > + # disable errexit temporary > > + if [[ $- =~ "e" ]]; then > > + errexit=1 > > + set +e > > + fi > > + > > + for ns in "$@"; do > > + ip netns delete "${ns}" &> /dev/null > > + if ! busywait 2 ip netns list \| grep -vq "^$ns$" &> /dev/null; > > then > > + echo "Warn: Failed to remove namespace $ns" > > + ret=1 > > + fi > > + done > > + > > + [ $errexit -eq 1 ] && set -e > > + return $ret > > +} > > + > > +# setup netns with given names as prefix. e.g > > +# setup_ns local remote > > +setup_ns() > > +{ > > + local ns="" > > + local ns_name="" > > + local ns_list="" > > + for ns_name in "$@"; do > > + # Some test may setup/remove same netns multi times > > + if unset ${ns_name} 2> /dev/null; then > > + ns="${ns_name,,}-$(mktemp -u XX)" > > + eval readonly ${ns_name}="$ns" > > + else > > + eval ns='$'${ns_name} > > + cleanup_ns "$ns" > > + > > + fi > > + > > + if ! ip netns add "$ns"; then > > + echo "Failed to create namespace $ns_name" > > + cleanup_ns "$ns_list" > > + return $ksft_skip > > + fi > > + ip -n "$ns" link set lo up > > + ns_list="$ns_list $ns" > > Side note for a possible follow-up: if you maintain $ns_list as global > variable, and remove from such list the ns deleted by cleanup_ns, you > could remove the cleanup trap from the individual test with something > alike: > > final_cleanup_ns() > { > cleanup_ns $ns_list > } > > trap final_cleanup_ns EXIT > > No respin needed for the above, could be a follow-up if agreed upon. Hi Paolo, I did similar in the first version. But Petr said[1] we should let the client do cleanup specifically. I agree that we should let client script keep this in mind. On the other hand, maybe we can add this final cleanup and let client call it directly. What do you think? [1] https://lore.kernel.org/netdev/878r6nf9x5@nvidia.com/ Thanks Hangbin
Re: [PATCH v8 3/6] zswap: make shrinking memcg-aware
On Tue, Dec 5, 2023 at 4:10 PM Chris Li wrote: > > Hi Nhat, > > Still working my way up of your patches series. > > On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham wrote: > > > > From: Domenico Cerasuolo > > > > Currently, we only have a single global LRU for zswap. This makes it > > impossible to perform worload-specific shrinking - an memcg cannot > > determine which pages in the pool it owns, and often ends up writing > > pages from other memcgs. This issue has been previously observed in > > practice and mitigated by simply disabling memcg-initiated shrinking: > > > > https://lore.kernel.org/all/20230530232435.3097106-1-npha...@gmail.com/T/#u > > > > This patch fully resolves the issue by replacing the global zswap LRU > > with memcg- and NUMA-specific LRUs, and modify the reclaim logic: > > > > a) When a store attempt hits an memcg limit, it now triggers a > >synchronous reclaim attempt that, if successful, allows the new > >hotter page to be accepted by zswap. > > b) If the store attempt instead hits the global zswap limit, it will > >trigger an asynchronous reclaim attempt, in which an memcg is > >selected for reclaim in a round-robin-like fashion. > > > > Signed-off-by: Domenico Cerasuolo > > Co-developed-by: Nhat Pham > > Signed-off-by: Nhat Pham > > --- > > include/linux/memcontrol.h | 5 + > > include/linux/zswap.h | 2 + > > mm/memcontrol.c| 2 + > > mm/swap.h | 3 +- > > mm/swap_state.c| 24 +++- > > mm/zswap.c | 269 + > > 6 files changed, 245 insertions(+), 60 deletions(-) > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 2bd7d14ace78..a308c8eacf20 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -1192,6 +1192,11 @@ static inline struct mem_cgroup > > *page_memcg_check(struct page *page) > > return NULL; > > } > > > > +static inline struct mem_cgroup *get_mem_cgroup_from_objcg(struct > > obj_cgroup *objcg) > > +{ > > + return NULL; > > +} > > + > > static inline bool folio_memcg_kmem(struct folio *folio) > > { > > return false; > > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > > index 2a60ce39cfde..e571e393669b 100644 > > --- a/include/linux/zswap.h > > +++ b/include/linux/zswap.h > > @@ -15,6 +15,7 @@ bool zswap_load(struct folio *folio); > > void zswap_invalidate(int type, pgoff_t offset); > > void zswap_swapon(int type); > > void zswap_swapoff(int type); > > +void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg); > > > > #else > > > > @@ -31,6 +32,7 @@ static inline bool zswap_load(struct folio *folio) > > static inline void zswap_invalidate(int type, pgoff_t offset) {} > > static inline void zswap_swapon(int type) {} > > static inline void zswap_swapoff(int type) {} > > +static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {} > > > > #endif > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 470821d1ba1a..792ca21c5815 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -5614,6 +5614,8 @@ static void mem_cgroup_css_offline(struct > > cgroup_subsys_state *css) > > page_counter_set_min(>memory, 0); > > page_counter_set_low(>memory, 0); > > > > + zswap_memcg_offline_cleanup(memcg); > > + > > memcg_offline_kmem(memcg); > > reparent_shrinker_deferred(memcg); > > wb_memcg_offline(memcg); > > diff --git a/mm/swap.h b/mm/swap.h > > index 73c332ee4d91..c0dc73e10e91 100644 > > --- a/mm/swap.h > > +++ b/mm/swap.h > > @@ -51,7 +51,8 @@ struct page *read_swap_cache_async(swp_entry_t entry, > > gfp_t gfp_mask, > >struct swap_iocb **plug); > > struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > > struct mempolicy *mpol, pgoff_t ilx, > > -bool *new_page_allocated); > > +bool *new_page_allocated, > > +bool skip_if_exists); > > struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag, > > struct mempolicy *mpol, pgoff_t ilx); > > struct page *swapin_readahead(swp_entry_t entry, gfp_t flag, > > diff --git a/mm/swap_state.c b/mm/swap_state.c > > index 85d9e5806a6a..6c84236382f3 100644 > > --- a/mm/swap_state.c > > +++ b/mm/swap_state.c > > @@ -412,7 +412,8 @@ struct folio *filemap_get_incore_folio(struct > > address_space *mapping, > > > > struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > > struct mempolicy *mpol, pgoff_t ilx, > > -bool *new_page_allocated) > > +bool *new_page_allocated, > > +bool skip_if_exists) > > I think this skip_if_exists is
Re: [PATCH v1 3/9] x86/resctrl: Add resctrl_mbm_flush_cpu() to collect CPUs' MBM events
Hi Peter, On 12/5/2023 4:33 PM, Peter Newman wrote: > On Tue, Dec 5, 2023 at 1:57 PM Reinette Chatre > wrote: >> On 12/1/2023 12:56 PM, Peter Newman wrote: >>> On Tue, May 16, 2023 at 5:06 PM Reinette Chatre I think it may be optimistic to view this as a replacement of a PQR write. As you point out, that requires that a CPU switches between tasks with the same CLOSID. You demonstrate that resctrl already contributes a significant delay to __switch_to - this work will increase that much more, it has to be clear about this impact and motivate that it is acceptable. >>> >>> We were operating under the assumption that if the overhead wasn't >>> acceptable, we would have heard complaints about it by now, but we >>> ultimately learned that this feature wasn't deployed as much as we had >>> originally thought on AMD hardware and that the overhead does need to >>> be addressed. >>> >>> I am interested in your opinion on two options I'm exploring to >>> mitigate the overhead, both of which depend on an API like the one >>> Babu recently proposed for the AMD ABMC feature [1], where a new file >>> interface will allow the user to indicate which mon_groups are >>> actively being measured. I will refer to this as "assigned" for now, >>> as that's the current proposal. >>> >>> The first is likely the simpler approach: only read MBM event counters >>> which have been marked as "assigned" in the filesystem to avoid paying >>> the context switch cost on tasks in groups which are not actively >>> being measured. In our use case, we calculate memory bandwidth on >>> every group every few minutes by reading the counters twice, 5 seconds >>> apart. We would just need counters read during this 5-second window. >> >> I assume that tasks within a monitoring group can be scheduled on any >> CPU and from the cover letter of this work I understand that only an >> RMID assigned to a processor can be guaranteed to be tracked by hardware. >> >> Are you proposing for this option that you keep this "soft RMID" approach >> with CPUs permanently assigned a "hard RMID" but only update the counts for >> a >> "soft RMID" that is "assigned"? > > Yes > >> I think that means that the context >> switch cost for the monitored group would increase even more than with the >> implementation in this series since the counters need to be read on context >> switch in as well as context switch out. >> >> If I understand correctly then only one monitoring group can be measured >> at a time. If such a measurement takes 5 seconds then theoretically 12 groups >> can be measured in one minute. It may be possible to create many more >> monitoring groups than this. Would it be possible to reach monitoring >> goals in your environment? > > We actually measure all of the groups at the same time, so thinking > about this more, the proposed ABMC fix isn't actually a great fit: the > user would have to assign all groups individually when a global > setting would have been fine. > > Ignoring any present-day resctrl interfaces, what we minimally need is... > > 1. global "start measurement", which enables a > read-counters-on-context switch flag, and broadcasts an IPI to all > CPUs to read their current count > 2. wait 5 seconds > 3. global "end measurement", to IPI all CPUs again for final counts > and clear the flag from step 1 > > Then the user could read at their leisure all the (frozen) event > counts from memory until the next measurement begins. > > In our case, if we're measuring as often as 5 seconds for every > minute, that will already be a 12x aggregate reduction in overhead, > which would be worthwhile enough. The "con" here would be that during those 5 seconds (which I assume would be controlled via user space so potentially shorter or longer) all tasks in the system is expected to have significant (but yet to be measured) impact on context switch delay. I expect the overflow handler should only be run during the measurement timeframe, to not defeat the "at their leisure" reading of counters. >>> The second involves avoiding the situation where a hardware counter >>> could be deallocated: Determine the number of simultaneous RMIDs >>> supported, reduce the effective number of RMIDs available to that >>> number. Use the default RMID (0) for all "unassigned" monitoring >> >> hmmm ... so on the one side there is "only the RMID within the PQR >> register can be guaranteed to be tracked by hardware" and on the >> other side there is "A given implementation may have insufficient >> hardware to simultaneously track the bandwidth for all RMID values >> that the hardware supports." >> >> From the above there seems to be something in the middle where >> some subset of the RMID values supported by hardware can be used >> to simultaneously track bandwidth? How can it be determined >> what this number of RMID values is? > > In the context of AMD, we could use the smallest number of CPUs in any > L3 domain as a lower bound of the
Re: [PATCH v8 2/6] memcontrol: implement mem_cgroup_tryget_online()
On Tue, Dec 5, 2023 at 4:16 PM Chris Li wrote: > > On Mon, Dec 4, 2023 at 5:39 PM Nhat Pham wrote: > > > > > > memcg as a candidate for the global limit reclaim. > > > > > > Very minor nitpick. This patch can fold with the later patch that uses > > > it. That makes the review easier, no need to cross reference different > > > patches. It will also make it harder to introduce API that nobody > > > uses. > > > > I don't have a strong preference one way or the other :) Probably not > > worth the churn tho. > > Squashing a patch is very easy. If you are refreshing a new series, it > is worthwhile to do it. I notice on the other thread Yosry pointed out > you did not use the function "mem_cgroup_tryget_online" in patch 3, > that is exactly the situation my suggestion is trying to prevent. I doubt squashing it would solve the issue - in fact, I think Yosry noticed it precisely because he had to stare at a separate patch detailing the adding of the new function in the first place :P In general though, I'm hesitant to extend this API silently in a patch that uses it. Is it not better to have a separate patch announcing this API extension? list_lru_add() was originally part of the original series too - we separate that out to its own thing because it gets confusing. Another benefit is that there will be less work in the future if we want to revert the per-cgroup zswap LRU patch, and there's already another mem_cgroup_tryget_online() user - we can keep this patch. But yeah we'll see - I'll think about it if I actually have to send v9. If not, let's not add unnecessary churning. > > If you don't have a strong preference, it sounds like you should squash it. > > Chris > > > > > > > > > Chris > > > > > > > > > > > Signed-off-by: Nhat Pham > > > > --- > > > > include/linux/memcontrol.h | 10 ++ > > > > 1 file changed, 10 insertions(+) > > > > > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > > index 7bdcf3020d7a..2bd7d14ace78 100644 > > > > --- a/include/linux/memcontrol.h > > > > +++ b/include/linux/memcontrol.h > > > > @@ -821,6 +821,11 @@ static inline bool mem_cgroup_tryget(struct > > > > mem_cgroup *memcg) > > > > return !memcg || css_tryget(>css); > > > > } > > > > > > > > +static inline bool mem_cgroup_tryget_online(struct mem_cgroup *memcg) > > > > +{ > > > > + return !memcg || css_tryget_online(>css); > > > > +} > > > > + > > > > static inline void mem_cgroup_put(struct mem_cgroup *memcg) > > > > { > > > > if (memcg) > > > > @@ -1349,6 +1354,11 @@ static inline bool mem_cgroup_tryget(struct > > > > mem_cgroup *memcg) > > > > return true; > > > > } > > > > > > > > +static inline bool mem_cgroup_tryget_online(struct mem_cgroup *memcg) > > > > +{ > > > > + return true; > > > > +} > > > > + > > > > static inline void mem_cgroup_put(struct mem_cgroup *memcg) > > > > { > > > > } > > > > -- > > > > 2.34.1 > > > > > >
Re: [PATCH v1 3/9] x86/resctrl: Add resctrl_mbm_flush_cpu() to collect CPUs' MBM events
Hi Reinette, On Tue, Dec 5, 2023 at 1:57 PM Reinette Chatre wrote: > On 12/1/2023 12:56 PM, Peter Newman wrote: > > On Tue, May 16, 2023 at 5:06 PM Reinette Chatre > >> I think it may be optimistic to view this as a replacement of a PQR write. > >> As you point out, that requires that a CPU switches between tasks with the > >> same CLOSID. You demonstrate that resctrl already contributes a significant > >> delay to __switch_to - this work will increase that much more, it has to > >> be clear about this impact and motivate that it is acceptable. > > > > We were operating under the assumption that if the overhead wasn't > > acceptable, we would have heard complaints about it by now, but we > > ultimately learned that this feature wasn't deployed as much as we had > > originally thought on AMD hardware and that the overhead does need to > > be addressed. > > > > I am interested in your opinion on two options I'm exploring to > > mitigate the overhead, both of which depend on an API like the one > > Babu recently proposed for the AMD ABMC feature [1], where a new file > > interface will allow the user to indicate which mon_groups are > > actively being measured. I will refer to this as "assigned" for now, > > as that's the current proposal. > > > > The first is likely the simpler approach: only read MBM event counters > > which have been marked as "assigned" in the filesystem to avoid paying > > the context switch cost on tasks in groups which are not actively > > being measured. In our use case, we calculate memory bandwidth on > > every group every few minutes by reading the counters twice, 5 seconds > > apart. We would just need counters read during this 5-second window. > > I assume that tasks within a monitoring group can be scheduled on any > CPU and from the cover letter of this work I understand that only an > RMID assigned to a processor can be guaranteed to be tracked by hardware. > > Are you proposing for this option that you keep this "soft RMID" approach > with CPUs permanently assigned a "hard RMID" but only update the counts for a > "soft RMID" that is "assigned"? Yes > I think that means that the context > switch cost for the monitored group would increase even more than with the > implementation in this series since the counters need to be read on context > switch in as well as context switch out. > > If I understand correctly then only one monitoring group can be measured > at a time. If such a measurement takes 5 seconds then theoretically 12 groups > can be measured in one minute. It may be possible to create many more > monitoring groups than this. Would it be possible to reach monitoring > goals in your environment? We actually measure all of the groups at the same time, so thinking about this more, the proposed ABMC fix isn't actually a great fit: the user would have to assign all groups individually when a global setting would have been fine. Ignoring any present-day resctrl interfaces, what we minimally need is... 1. global "start measurement", which enables a read-counters-on-context switch flag, and broadcasts an IPI to all CPUs to read their current count 2. wait 5 seconds 3. global "end measurement", to IPI all CPUs again for final counts and clear the flag from step 1 Then the user could read at their leisure all the (frozen) event counts from memory until the next measurement begins. In our case, if we're measuring as often as 5 seconds for every minute, that will already be a 12x aggregate reduction in overhead, which would be worthwhile enough. > > > > > The second involves avoiding the situation where a hardware counter > > could be deallocated: Determine the number of simultaneous RMIDs > > supported, reduce the effective number of RMIDs available to that > > number. Use the default RMID (0) for all "unassigned" monitoring > > hmmm ... so on the one side there is "only the RMID within the PQR > register can be guaranteed to be tracked by hardware" and on the > other side there is "A given implementation may have insufficient > hardware to simultaneously track the bandwidth for all RMID values > that the hardware supports." > > From the above there seems to be something in the middle where > some subset of the RMID values supported by hardware can be used > to simultaneously track bandwidth? How can it be determined > what this number of RMID values is? In the context of AMD, we could use the smallest number of CPUs in any L3 domain as a lower bound of the number of counters. If the number is actually higher, it's not too difficult to probe at runtime. The technique used by the test script[1] reliably identifies the number of counters, but some experimentation would be needed to see how quickly the hardware will repurpose a counter, as the script today is using way too long of a workload for the kernel to be invoking. Maybe a reasonable compromise would be to initialize the HW counter estimate at the CPUs-per-domain value and add a file node to let the user
Re: [PATCH v8 2/6] memcontrol: implement mem_cgroup_tryget_online()
On Mon, Dec 4, 2023 at 5:39 PM Nhat Pham wrote: > > > > memcg as a candidate for the global limit reclaim. > > > > Very minor nitpick. This patch can fold with the later patch that uses > > it. That makes the review easier, no need to cross reference different > > patches. It will also make it harder to introduce API that nobody > > uses. > > I don't have a strong preference one way or the other :) Probably not > worth the churn tho. Squashing a patch is very easy. If you are refreshing a new series, it is worthwhile to do it. I notice on the other thread Yosry pointed out you did not use the function "mem_cgroup_tryget_online" in patch 3, that is exactly the situation my suggestion is trying to prevent. If you don't have a strong preference, it sounds like you should squash it. Chris > > > > > Chris > > > > > > > > Signed-off-by: Nhat Pham > > > --- > > > include/linux/memcontrol.h | 10 ++ > > > 1 file changed, 10 insertions(+) > > > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > index 7bdcf3020d7a..2bd7d14ace78 100644 > > > --- a/include/linux/memcontrol.h > > > +++ b/include/linux/memcontrol.h > > > @@ -821,6 +821,11 @@ static inline bool mem_cgroup_tryget(struct > > > mem_cgroup *memcg) > > > return !memcg || css_tryget(>css); > > > } > > > > > > +static inline bool mem_cgroup_tryget_online(struct mem_cgroup *memcg) > > > +{ > > > + return !memcg || css_tryget_online(>css); > > > +} > > > + > > > static inline void mem_cgroup_put(struct mem_cgroup *memcg) > > > { > > > if (memcg) > > > @@ -1349,6 +1354,11 @@ static inline bool mem_cgroup_tryget(struct > > > mem_cgroup *memcg) > > > return true; > > > } > > > > > > +static inline bool mem_cgroup_tryget_online(struct mem_cgroup *memcg) > > > +{ > > > + return true; > > > +} > > > + > > > static inline void mem_cgroup_put(struct mem_cgroup *memcg) > > > { > > > } > > > -- > > > 2.34.1 > > > >
Re: [PATCH v8 3/6] zswap: make shrinking memcg-aware
Hi Nhat, Still working my way up of your patches series. On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham wrote: > > From: Domenico Cerasuolo > > Currently, we only have a single global LRU for zswap. This makes it > impossible to perform worload-specific shrinking - an memcg cannot > determine which pages in the pool it owns, and often ends up writing > pages from other memcgs. This issue has been previously observed in > practice and mitigated by simply disabling memcg-initiated shrinking: > > https://lore.kernel.org/all/20230530232435.3097106-1-npha...@gmail.com/T/#u > > This patch fully resolves the issue by replacing the global zswap LRU > with memcg- and NUMA-specific LRUs, and modify the reclaim logic: > > a) When a store attempt hits an memcg limit, it now triggers a >synchronous reclaim attempt that, if successful, allows the new >hotter page to be accepted by zswap. > b) If the store attempt instead hits the global zswap limit, it will >trigger an asynchronous reclaim attempt, in which an memcg is >selected for reclaim in a round-robin-like fashion. > > Signed-off-by: Domenico Cerasuolo > Co-developed-by: Nhat Pham > Signed-off-by: Nhat Pham > --- > include/linux/memcontrol.h | 5 + > include/linux/zswap.h | 2 + > mm/memcontrol.c| 2 + > mm/swap.h | 3 +- > mm/swap_state.c| 24 +++- > mm/zswap.c | 269 + > 6 files changed, 245 insertions(+), 60 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 2bd7d14ace78..a308c8eacf20 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -1192,6 +1192,11 @@ static inline struct mem_cgroup > *page_memcg_check(struct page *page) > return NULL; > } > > +static inline struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup > *objcg) > +{ > + return NULL; > +} > + > static inline bool folio_memcg_kmem(struct folio *folio) > { > return false; > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > index 2a60ce39cfde..e571e393669b 100644 > --- a/include/linux/zswap.h > +++ b/include/linux/zswap.h > @@ -15,6 +15,7 @@ bool zswap_load(struct folio *folio); > void zswap_invalidate(int type, pgoff_t offset); > void zswap_swapon(int type); > void zswap_swapoff(int type); > +void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg); > > #else > > @@ -31,6 +32,7 @@ static inline bool zswap_load(struct folio *folio) > static inline void zswap_invalidate(int type, pgoff_t offset) {} > static inline void zswap_swapon(int type) {} > static inline void zswap_swapoff(int type) {} > +static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {} > > #endif > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 470821d1ba1a..792ca21c5815 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5614,6 +5614,8 @@ static void mem_cgroup_css_offline(struct > cgroup_subsys_state *css) > page_counter_set_min(>memory, 0); > page_counter_set_low(>memory, 0); > > + zswap_memcg_offline_cleanup(memcg); > + > memcg_offline_kmem(memcg); > reparent_shrinker_deferred(memcg); > wb_memcg_offline(memcg); > diff --git a/mm/swap.h b/mm/swap.h > index 73c332ee4d91..c0dc73e10e91 100644 > --- a/mm/swap.h > +++ b/mm/swap.h > @@ -51,7 +51,8 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t > gfp_mask, >struct swap_iocb **plug); > struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > struct mempolicy *mpol, pgoff_t ilx, > -bool *new_page_allocated); > +bool *new_page_allocated, > +bool skip_if_exists); > struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag, > struct mempolicy *mpol, pgoff_t ilx); > struct page *swapin_readahead(swp_entry_t entry, gfp_t flag, > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 85d9e5806a6a..6c84236382f3 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -412,7 +412,8 @@ struct folio *filemap_get_incore_folio(struct > address_space *mapping, > > struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > struct mempolicy *mpol, pgoff_t ilx, > -bool *new_page_allocated) > +bool *new_page_allocated, > +bool skip_if_exists) I think this skip_if_exists is problematic here. You might need to redesign this. First of all, the skip_if_exists as the argument name, the meaning to the caller is not clear. When I saw this, I was wondering, what does the function return when this condition is triggered? Unlike "*new_page_allocated", which is a
Re: [PATCH RFT v4 5/5] kselftest/clone3: Test shadow stack support
On Tue, 2023-12-05 at 16:43 +, Mark Brown wrote: > Right, it's a small and fairly easily auditable list - it's more > about > the app than the double enable which was what I thought your concern > was. It's a bit annoying definitely and not something we want to do > in > general but for something like this where we're adding specific > coverage > for API extensions for the feature it seems like a reasonable > tradeoff. > > If the x86 toolchain/libc support is widely enough deployed (or you > just > don't mind any missing coverage) we could use the toolchain support > there and only have the manual enable for arm64, it'd be inconsistent > but not wildly so. > > > I'm hoping there is not too much of a gap before the glibc support starts filtering out. Long term, elf bit enabling is probably the right thing for the generic tests. Short term, manual enabling is ok with me if no one else minds. Maybe we could add my "don't do" list as a comment if we do manual enabling? I'll have to check your new series, but I also wonder if we could cram the manual enabling and status checking pieces into some headers and not have to have "if x86" "if arm" logic in the test themselves.
Re: [PATCH RFT v4 2/5] fork: Add shadow stack support to clone3()
On Tue, 2023-12-05 at 15:51 +, Mark Brown wrote: > On Tue, Dec 05, 2023 at 12:26:57AM +, Edgecombe, Rick P wrote: > > On Tue, 2023-11-28 at 18:22 +, Mark Brown wrote: > > > > - size = adjust_shstk_size(stack_size); > > > + size = adjust_shstk_size(size); > > > addr = alloc_shstk(0, size, 0, false); > > > Hmm. I didn't test this, but in the copy_process(), copy_mm() > > happens > > before this point. So the shadow stack would get mapped in > > current's MM > > (i.e. the parent). So in the !CLONE_VM case with > > shadow_stack_size!=0 > > the SSP in the child will be updated to an area that is not mapped > > in > > the child. I think we need to pass tsk->mm into alloc_shstk(). But > > such > > an exotic clone usage does give me pause, regarding whether all of > > this > > is premature. > > Hrm, right. And we then can't use do_mmap() either. I'd be somewhat > tempted to disallow that specific case for now rather than deal with > it > though that's not really in the spirit of just always following what > the > user asked for. Oh, yea. What a pain. It doesn't seem like we could easily even add a do_mmap() variant that takes an mm either. I did a quick logging test on a Fedora userspace. systemd (I think) appears to do a clone(!CLONE_VM) with a stack passed. So maybe the combo might actually get used with a shadow_stack_size if it used clone3 some day. At the same time, fixing clone to mmap() in the child doesn't seem straight forward at all. Checking with some of our MM folks, the suggestion was to look at doing the child's shadow stack mapping in dup_mm() to avoid tripping over complications that happen when a remote MM becomes more "live". If we just punt on this combination for now, then the documented rules for args->shadow_stack_size would be something like: clone3 will use the parents shadow stack when CLONE_VM is not present. If CLONE_VFORK is set then it will use the parents shadow stack only when args->shadow_stack_size is non-zero. In the cases when the parents shadow stack is not used, args->shadow_stack_size is used for the size whenever non-zero. I guess it doesn't seem too overly complicated. But I'm not thinking any of the options seem great. I'd unhappily lean towards not supporting shadow_stack_size!=0 && !CLONE_VM for now. But it seems like there may be a user for the unsupported case, so this would be just improving things a little and kicking the can down the road. I also wonder if this is a sign to reconsider the earlier token consuming design.
Re: [PATCH v1 3/9] x86/resctrl: Add resctrl_mbm_flush_cpu() to collect CPUs' MBM events
Hi Peter, On 12/1/2023 12:56 PM, Peter Newman wrote: > Hi Reinette, > > On Tue, May 16, 2023 at 5:06 PM Reinette Chatre > wrote: >> On 5/15/2023 7:42 AM, Peter Newman wrote: >>> >>> I used a simple parent-child pipe loop benchmark with the parent in >>> one monitoring group and the child in another to trigger 2M >>> context-switches on the same CPU and compared the sample-based >>> profiles on an AMD and Intel implementation. I used perf diff to >>> compare the samples between hard and soft RMID switches. >>> >>> Intel(R) Xeon(R) Platinum 8173M CPU @ 2.00GHz: >>> >>> +44.80% [kernel.kallsyms] [k] __rmid_read >>> 10.43% -9.52% [kernel.kallsyms] [k] __switch_to >>> >>> AMD EPYC 7B12 64-Core Processor: >>> >>> +28.27% [kernel.kallsyms] [k] __rmid_read >>> 13.45%-13.44% [kernel.kallsyms] [k] __switch_to >>> >>> Note that a soft RMID switch that doesn't change CLOSID skips the >>> PQR_ASSOC write completely, so from this data I can roughly say that >>> __rmid_read() is a little over 2x the length of a PQR_ASSOC write that >>> changes the current RMID on the AMD implementation and about 4.5x >>> longer on Intel. >>> >>> Let me know if this clarifies the cost enough or if you'd like to also >>> see instrumented measurements on the individual WRMSR/RDMSR >>> instructions. >> >> I can see from the data the portion of total time spent in __rmid_read(). >> It is not clear to me what the impact on a context switch is. Is it >> possible to say with this data that: this solution makes a context switch >> x% slower? >> >> I think it may be optimistic to view this as a replacement of a PQR write. >> As you point out, that requires that a CPU switches between tasks with the >> same CLOSID. You demonstrate that resctrl already contributes a significant >> delay to __switch_to - this work will increase that much more, it has to >> be clear about this impact and motivate that it is acceptable. > > We were operating under the assumption that if the overhead wasn't > acceptable, we would have heard complaints about it by now, but we > ultimately learned that this feature wasn't deployed as much as we had > originally thought on AMD hardware and that the overhead does need to > be addressed. > > I am interested in your opinion on two options I'm exploring to > mitigate the overhead, both of which depend on an API like the one > Babu recently proposed for the AMD ABMC feature [1], where a new file > interface will allow the user to indicate which mon_groups are > actively being measured. I will refer to this as "assigned" for now, > as that's the current proposal. > > The first is likely the simpler approach: only read MBM event counters > which have been marked as "assigned" in the filesystem to avoid paying > the context switch cost on tasks in groups which are not actively > being measured. In our use case, we calculate memory bandwidth on > every group every few minutes by reading the counters twice, 5 seconds > apart. We would just need counters read during this 5-second window. I assume that tasks within a monitoring group can be scheduled on any CPU and from the cover letter of this work I understand that only an RMID assigned to a processor can be guaranteed to be tracked by hardware. Are you proposing for this option that you keep this "soft RMID" approach with CPUs permanently assigned a "hard RMID" but only update the counts for a "soft RMID" that is "assigned"? I think that means that the context switch cost for the monitored group would increase even more than with the implementation in this series since the counters need to be read on context switch in as well as context switch out. If I understand correctly then only one monitoring group can be measured at a time. If such a measurement takes 5 seconds then theoretically 12 groups can be measured in one minute. It may be possible to create many more monitoring groups than this. Would it be possible to reach monitoring goals in your environment? > > The second involves avoiding the situation where a hardware counter > could be deallocated: Determine the number of simultaneous RMIDs > supported, reduce the effective number of RMIDs available to that > number. Use the default RMID (0) for all "unassigned" monitoring hmmm ... so on the one side there is "only the RMID within the PQR register can be guaranteed to be tracked by hardware" and on the other side there is "A given implementation may have insufficient hardware to simultaneously track the bandwidth for all RMID values that the hardware supports." >From the above there seems to be something in the middle where some subset of the RMID values supported by hardware can be used to simultaneously track bandwidth? How can it be determined what this number of RMID values is? > groups and report "Unavailable" on all counter reads (and address the > default monitoring group's counts being unreliable). When assigned, > attempt to allocate one of the
[PATCH] kunit: tool: fix parsing of test attributes
Add parsing of attributes as diagnostic data. Fixes issue with test plan being parsed incorrectly as diagnostic data when located after suite-level attributes. Note that if there does not exist a test plan line, the diagnostic lines between the suite header and the first result will be saved in the suite log rather than the first test case log. Signed-off-by: Rae Moar --- Note this patch is a resend but I removed the second patch in the series so now it is a standalone patch. tools/testing/kunit/kunit_parser.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/tools/testing/kunit/kunit_parser.py b/tools/testing/kunit/kunit_parser.py index 79d8832c862a..ce34be15c929 100644 --- a/tools/testing/kunit/kunit_parser.py +++ b/tools/testing/kunit/kunit_parser.py @@ -450,7 +450,7 @@ def parse_diagnostic(lines: LineStream) -> List[str]: Log of diagnostic lines """ log = [] # type: List[str] - non_diagnostic_lines = [TEST_RESULT, TEST_HEADER, KTAP_START, TAP_START] + non_diagnostic_lines = [TEST_RESULT, TEST_HEADER, KTAP_START, TAP_START, TEST_PLAN] while lines and not any(re.match(lines.peek()) for re in non_diagnostic_lines): log.append(lines.pop()) @@ -726,6 +726,7 @@ def parse_test(lines: LineStream, expected_num: int, log: List[str], is_subtest: # test plan test.name = "main" ktap_line = parse_ktap_header(lines, test) + test.log.extend(parse_diagnostic(lines)) parse_test_plan(lines, test) parent_test = True else: @@ -737,6 +738,7 @@ def parse_test(lines: LineStream, expected_num: int, log: List[str], is_subtest: if parent_test: # If KTAP version line and/or subtest header is found, attempt # to parse test plan and print test header + test.log.extend(parse_diagnostic(lines)) parse_test_plan(lines, test) print_test_header(test) expected_count = test.expected_count base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86 -- 2.43.0.rc2.451.g8631bc7472-goog
Re: [PATCH v8 4/6] mm: memcg: add per-memcg zswap writeback stat (fix)
On Tue, Dec 5, 2023 at 11:33 AM Nhat Pham wrote: > > Rename ZSWP_WB to ZSWPWB to better match the existing counters naming > scheme. > > Suggested-by: Johannes Weiner > Signed-off-by: Nhat Pham For the original patch + this fix: Reviewed-by: Yosry Ahmed > --- > include/linux/vm_event_item.h | 2 +- > mm/memcontrol.c | 2 +- > mm/vmstat.c | 2 +- > mm/zswap.c| 4 ++-- > 4 files changed, 5 insertions(+), 5 deletions(-) > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index f4569ad98edf..747943bc8cc2 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -142,7 +142,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > #ifdef CONFIG_ZSWAP > ZSWPIN, > ZSWPOUT, > - ZSWP_WB, > + ZSWPWB, > #endif > #ifdef CONFIG_X86 > DIRECT_MAP_LEVEL2_SPLIT, > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 21d79249c8b4..0286b7d38832 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -703,7 +703,7 @@ static const unsigned int memcg_vm_event_stat[] = { > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) > ZSWPIN, > ZSWPOUT, > - ZSWP_WB, > + ZSWPWB, > #endif > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > THP_FAULT_ALLOC, > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 2249f85e4a87..cfd8d8256f8e 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1401,7 +1401,7 @@ const char * const vmstat_text[] = { > #ifdef CONFIG_ZSWAP > "zswpin", > "zswpout", > - "zswp_wb", > + "zswpwb", > #endif > #ifdef CONFIG_X86 > "direct_map_level2_splits", > diff --git a/mm/zswap.c b/mm/zswap.c > index c65b8ccc6b72..0fb0945c0031 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -761,9 +761,9 @@ static enum lru_status shrink_memcg_cb(struct list_head > *item, struct list_lru_o > zswap_written_back_pages++; > > if (entry->objcg) > - count_objcg_event(entry->objcg, ZSWP_WB); > + count_objcg_event(entry->objcg, ZSWPWB); > > - count_vm_event(ZSWP_WB); > + count_vm_event(ZSWPWB); > /* > * Writeback started successfully, the page now belongs to the > * swapcache. Drop the entry from zswap - unless invalidate already > -- > 2.34.1
Re: [xdp-hints] Re: [PATCH bpf-next v3 2/3] net: stmmac: add Launch Time support to XDP ZC
Stanislav Fomichev wrote: > On 12/05, Willem de Bruijn wrote: > > Stanislav Fomichev wrote: > > > On Tue, Dec 5, 2023 at 7:34 AM Florian Bezdeka > > > wrote: > > > > > > > > On Tue, 2023-12-05 at 15:25 +, Song, Yoong Siang wrote: > > > > > On Monday, December 4, 2023 10:55 PM, Willem de Bruijn wrote: > > > > > > Jesper Dangaard Brouer wrote: > > > > > > > > > > > > > > > > > > > > > On 12/3/23 17:51, Song Yoong Siang wrote: > > > > > > > > This patch enables Launch Time (Time-Based Scheduling) support > > > > > > > > to XDP zero > > > > > > > > copy via XDP Tx metadata framework. > > > > > > > > > > > > > > > > Signed-off-by: Song Yoong Siang > > > > > > > > --- > > > > > > > > drivers/net/ethernet/stmicro/stmmac/stmmac.h | 2 ++ > > > > > > > > > > > > > > As requested before, I think we need to see another driver > > > > > > > implementing > > > > > > > this. > > > > > > > > > > > > > > I propose driver igc and chip i225. > > > > > > > > > > Sure. I will include igc patches in next version. > > > > > > > > > > > > > > > > > > > The interesting thing for me is to see how the LaunchTime max 1 > > > > > > > second > > > > > > > into the future[1] is handled code wise. One suggestion is to add > > > > > > > a > > > > > > > section to Documentation/networking/xsk-tx-metadata.rst per > > > > > > > driver that > > > > > > > mentions/documents these different hardware limitations. It is > > > > > > > natural > > > > > > > that different types of hardware have limitations. This is a > > > > > > > close-to > > > > > > > hardware-level abstraction/API, and IMHO as long as we document > > > > > > > the > > > > > > > limitations we can expose this API without too many limitations > > > > > > > for more > > > > > > > capable hardware. > > > > > > > > > > Sure. I will try to add hardware limitations in documentation. > > > > > > > > > > > > > > > > > I would assume that the kfunc will fail when a value is passed that > > > > > > cannot be programmed. > > > > > > > > > > > > > > > > In current design, the xsk_tx_metadata_request() dint got return > > > > > value. > > > > > So user won't know if their request is fail. > > > > > It is complex to inform user which request is failing. > > > > > Therefore, IMHO, it is good that we let driver handle the error > > > > > silently. > > > > > > > > > > > > > If the programmed value is invalid, the packet will be "dropped" / will > > > > never make it to the wire, right? > > > > Programmable behavior is to either drop or cap to some boundary > > value, such as the farthest programmable time in the future: the > > horizon. In fq: > > > > /* Check if packet timestamp is too far in the future. */ > > if (fq_packet_beyond_horizon(skb, q, now)) { > > if (q->horizon_drop) { > > q->stat_horizon_drops++; > > return qdisc_drop(skb, sch, > > to_free); > > } > > q->stat_horizon_caps++; > > skb->tstamp = now + q->horizon; > > } > > fq_skb_cb(skb)->time_to_send = skb->tstamp; > > > > Drop is the more obviously correct mode. > > > > Programming with a clock source that the driver does not support will > > then be a persistent failure. > > > > Preferably, this driver capability can be queried beforehand (rather > > than only through reading error counters afterwards). > > > > Perhaps it should not be a driver task to convert from possibly > > multiple clock sources to the device native clock. Right now, we do > > use per-device timecounters for this, implemented in the driver. > > > > As for which clocks are relevant. For PTP, I suppose the device PHC, > > converted to nsec. For pacing offload, TCP uses CLOCK_MONOTONIC. > > Do we need to expose some generic netdev netlink apis to query/adjust > nic clock sources (or maybe there is something existing already)? > Then the userspace can be responsible for syncing/converting the > timestamps to the internal nic clocks. +1 to trying to avoid doing > this in the drivers. Perhaps. I'm just a bit hesitant since that is UAPI and this is all quite hand-wavy still. Some of the conversion necessarily has to be in the driver. Only the driver knows the descriptor format, and limitations of that, such as the bit-width that can be encoded. If we cannot move anything out of the drivers (quite likely), then agreed that a netdev/ethtool netlink query approach is helpful. To be clear: I don't mean that that should be part of this series. This is not an XSK specific concern. > > > > That is clearly a situation that the user should be informed about. For > > > > RT systems this normally means that something is really wrong regarding > > > > timing / cycle overflow. Such systems have to react on that situation. > > > > > > In general, af_xdp is a bit lacking in this 'notify the user that they
Re: [PATCH v8 2/6] memcontrol: implement mem_cgroup_tryget_online()
On Tue, Dec 5, 2023 at 10:03 AM Yosry Ahmed wrote: > > On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham wrote: > > > > This patch implements a helper function that try to get a reference to > > an memcg's css, as well as checking if it is online. This new function > > is almost exactly the same as the existing mem_cgroup_tryget(), except > > for the onlineness check. In the !CONFIG_MEMCG case, it always returns > > true, analogous to mem_cgroup_tryget(). This is useful for e.g to the > > new zswap writeback scheme, where we need to select the next online > > memcg as a candidate for the global limit reclaim. > > > > Signed-off-by: Nhat Pham > > Reviewed-by: Yosry Ahmed Thanks for the review, Yosry :) Really appreciate the effort and your comments so far. > > > --- > > include/linux/memcontrol.h | 10 ++ > > 1 file changed, 10 insertions(+) > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 7bdcf3020d7a..2bd7d14ace78 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -821,6 +821,11 @@ static inline bool mem_cgroup_tryget(struct mem_cgroup > > *memcg) > > return !memcg || css_tryget(>css); > > } > > > > +static inline bool mem_cgroup_tryget_online(struct mem_cgroup *memcg) > > +{ > > + return !memcg || css_tryget_online(>css); > > +} > > + > > static inline void mem_cgroup_put(struct mem_cgroup *memcg) > > { > > if (memcg) > > @@ -1349,6 +1354,11 @@ static inline bool mem_cgroup_tryget(struct > > mem_cgroup *memcg) > > return true; > > } > > > > +static inline bool mem_cgroup_tryget_online(struct mem_cgroup *memcg) > > +{ > > + return true; > > +} > > + > > static inline void mem_cgroup_put(struct mem_cgroup *memcg) > > { > > } > > -- > > 2.34.1
[PATCH v8 3/6] zswap: make shrinking memcg-aware (fix)
Use the correct function for the onlineness check for the memcg selection, and use mem_cgroup_iter_break() to break the iteration. Suggested-by: Yosry Ahmed Signed-off-by: Nhat Pham --- mm/zswap.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index f323e45cbdc7..7a84c1454988 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -834,9 +834,9 @@ static void shrink_worker(struct work_struct *w) goto resched; } - if (!mem_cgroup_online(memcg)) { + if (!mem_cgroup_tryget_online(memcg)) { /* drop the reference from mem_cgroup_iter() */ - mem_cgroup_put(memcg); + mem_cgroup_iter_break(NULL, memcg); pool->next_shrink = NULL; spin_unlock(_pools_lock); @@ -985,7 +985,7 @@ static void zswap_pool_destroy(struct zswap_pool *pool) list_lru_destroy(>list_lru); spin_lock(_pools_lock); - mem_cgroup_put(pool->next_shrink); + mem_cgroup_iter_break(NULL, pool->next_shrink); pool->next_shrink = NULL; spin_unlock(_pools_lock); -- 2.34.1
Re: [xdp-hints] Re: [PATCH bpf-next v3 2/3] net: stmmac: add Launch Time support to XDP ZC
On 12/05, Willem de Bruijn wrote: > Stanislav Fomichev wrote: > > On Tue, Dec 5, 2023 at 7:34 AM Florian Bezdeka > > wrote: > > > > > > On Tue, 2023-12-05 at 15:25 +, Song, Yoong Siang wrote: > > > > On Monday, December 4, 2023 10:55 PM, Willem de Bruijn wrote: > > > > > Jesper Dangaard Brouer wrote: > > > > > > > > > > > > > > > > > > On 12/3/23 17:51, Song Yoong Siang wrote: > > > > > > > This patch enables Launch Time (Time-Based Scheduling) support to > > > > > > > XDP zero > > > > > > > copy via XDP Tx metadata framework. > > > > > > > > > > > > > > Signed-off-by: Song Yoong Siang > > > > > > > --- > > > > > > > drivers/net/ethernet/stmicro/stmmac/stmmac.h | 2 ++ > > > > > > > > > > > > As requested before, I think we need to see another driver > > > > > > implementing > > > > > > this. > > > > > > > > > > > > I propose driver igc and chip i225. > > > > > > > > Sure. I will include igc patches in next version. > > > > > > > > > > > > > > > > The interesting thing for me is to see how the LaunchTime max 1 > > > > > > second > > > > > > into the future[1] is handled code wise. One suggestion is to add a > > > > > > section to Documentation/networking/xsk-tx-metadata.rst per driver > > > > > > that > > > > > > mentions/documents these different hardware limitations. It is > > > > > > natural > > > > > > that different types of hardware have limitations. This is a > > > > > > close-to > > > > > > hardware-level abstraction/API, and IMHO as long as we document the > > > > > > limitations we can expose this API without too many limitations for > > > > > > more > > > > > > capable hardware. > > > > > > > > Sure. I will try to add hardware limitations in documentation. > > > > > > > > > > > > > > I would assume that the kfunc will fail when a value is passed that > > > > > cannot be programmed. > > > > > > > > > > > > > In current design, the xsk_tx_metadata_request() dint got return value. > > > > So user won't know if their request is fail. > > > > It is complex to inform user which request is failing. > > > > Therefore, IMHO, it is good that we let driver handle the error > > > > silently. > > > > > > > > > > If the programmed value is invalid, the packet will be "dropped" / will > > > never make it to the wire, right? > > Programmable behavior is to either drop or cap to some boundary > value, such as the farthest programmable time in the future: the > horizon. In fq: > > /* Check if packet timestamp is too far in the future. */ > if (fq_packet_beyond_horizon(skb, q, now)) { > if (q->horizon_drop) { > q->stat_horizon_drops++; > return qdisc_drop(skb, sch, to_free); > } > q->stat_horizon_caps++; > skb->tstamp = now + q->horizon; > } > fq_skb_cb(skb)->time_to_send = skb->tstamp; > > Drop is the more obviously correct mode. > > Programming with a clock source that the driver does not support will > then be a persistent failure. > > Preferably, this driver capability can be queried beforehand (rather > than only through reading error counters afterwards). > > Perhaps it should not be a driver task to convert from possibly > multiple clock sources to the device native clock. Right now, we do > use per-device timecounters for this, implemented in the driver. > > As for which clocks are relevant. For PTP, I suppose the device PHC, > converted to nsec. For pacing offload, TCP uses CLOCK_MONOTONIC. Do we need to expose some generic netdev netlink apis to query/adjust nic clock sources (or maybe there is something existing already)? Then the userspace can be responsible for syncing/converting the timestamps to the internal nic clocks. +1 to trying to avoid doing this in the drivers. > > > That is clearly a situation that the user should be informed about. For > > > RT systems this normally means that something is really wrong regarding > > > timing / cycle overflow. Such systems have to react on that situation. > > > > In general, af_xdp is a bit lacking in this 'notify the user that they > > somehow messed up' area :-( > > For example, pushing a tx descriptor with a wrong addr/len in zc mode > > will not give any visible signal back (besides driver potentially > > spilling something into dmesg as it was in the mlx case). > > We can probably start with having some counters for these events? > > This is because the AF_XDP completion queue descriptor format is only > a u64 address? Yeah. XDP_COPY mode has the descriptor validation which is exported via recvmsg errno, but zerocopy path seems to be too deep in the stack to report something back. And there is no place, as you mention, in the completion ring to report the status. > Could error conditions be reported on tx completion in the metadata, > using xsk_tx_metadata_complete? That would be
[PATCH v8 4/6] mm: memcg: add per-memcg zswap writeback stat (fix)
Rename ZSWP_WB to ZSWPWB to better match the existing counters naming scheme. Suggested-by: Johannes Weiner Signed-off-by: Nhat Pham --- include/linux/vm_event_item.h | 2 +- mm/memcontrol.c | 2 +- mm/vmstat.c | 2 +- mm/zswap.c| 4 ++-- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index f4569ad98edf..747943bc8cc2 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -142,7 +142,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #ifdef CONFIG_ZSWAP ZSWPIN, ZSWPOUT, - ZSWP_WB, + ZSWPWB, #endif #ifdef CONFIG_X86 DIRECT_MAP_LEVEL2_SPLIT, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 21d79249c8b4..0286b7d38832 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -703,7 +703,7 @@ static const unsigned int memcg_vm_event_stat[] = { #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) ZSWPIN, ZSWPOUT, - ZSWP_WB, + ZSWPWB, #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE THP_FAULT_ALLOC, diff --git a/mm/vmstat.c b/mm/vmstat.c index 2249f85e4a87..cfd8d8256f8e 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1401,7 +1401,7 @@ const char * const vmstat_text[] = { #ifdef CONFIG_ZSWAP "zswpin", "zswpout", - "zswp_wb", + "zswpwb", #endif #ifdef CONFIG_X86 "direct_map_level2_splits", diff --git a/mm/zswap.c b/mm/zswap.c index c65b8ccc6b72..0fb0945c0031 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -761,9 +761,9 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o zswap_written_back_pages++; if (entry->objcg) - count_objcg_event(entry->objcg, ZSWP_WB); + count_objcg_event(entry->objcg, ZSWPWB); - count_vm_event(ZSWP_WB); + count_vm_event(ZSWPWB); /* * Writeback started successfully, the page now belongs to the * swapcache. Drop the entry from zswap - unless invalidate already -- 2.34.1
Re: [PATCH v8 3/6] zswap: make shrinking memcg-aware
On Tue, Dec 5, 2023 at 11:00 AM Yosry Ahmed wrote: > > [..] > > > > static void shrink_worker(struct work_struct *w) > > > > { > > > > struct zswap_pool *pool = container_of(w, typeof(*pool), > > > > shrink_work); > > > > + struct mem_cgroup *memcg; > > > > int ret, failures = 0; > > > > > > > > + /* global reclaim will select cgroup in a round-robin fashion. > > > > */ > > > > do { > > > > - ret = zswap_reclaim_entry(pool); > > > > - if (ret) { > > > > - zswap_reject_reclaim_fail++; > > > > - if (ret != -EAGAIN) > > > > + spin_lock(_pools_lock); > > > > + pool->next_shrink = mem_cgroup_iter(NULL, > > > > pool->next_shrink, NULL); > > > > + memcg = pool->next_shrink; > > > > + > > > > + /* > > > > +* We need to retry if we have gone through a full > > > > round trip, or if we > > > > +* got an offline memcg (or else we risk undoing the > > > > effect of the > > > > +* zswap memcg offlining cleanup callback). This is not > > > > catastrophic > > > > +* per se, but it will keep the now offlined memcg > > > > hostage for a while. > > > > +* > > > > +* Note that if we got an online memcg, we will keep > > > > the extra > > > > +* reference in case the original reference obtained by > > > > mem_cgroup_iter > > > > +* is dropped by the zswap memcg offlining callback, > > > > ensuring that the > > > > +* memcg is not killed when we are reclaiming. > > > > +*/ > > > > + if (!memcg) { > > > > + spin_unlock(_pools_lock); > > > > + if (++failures == MAX_RECLAIM_RETRIES) > > > > break; > > > > + > > > > + goto resched; > > > > + } > > > > + > > > > + if (!mem_cgroup_online(memcg)) { > > > > + /* drop the reference from mem_cgroup_iter() */ > > > > + mem_cgroup_put(memcg); > > > > > > Probably better to use mem_cgroup_iter_break() here? > > > > mem_cgroup_iter_break(NULL, memcg) seems to perform the same thing, right? > > Yes, but it's better to break the iteration with the documented API > (e.g. if mem_cgroup_iter_break() changes to do extra work). Hmm, a mostly aesthetic fix to me, but I don't have a strong opinion otherwise. > > > > > > > > > Also, I don't see mem_cgroup_tryget_online() being used here (where I > > > expected it to be used), did I miss it? > > > > Oh shoot yeah that was a typo - it should be > > mem_cgroup_tryget_online(). Let me send a fix to that. > > > > > > > > > + pool->next_shrink = NULL; > > > > + spin_unlock(_pools_lock); > > > > + > > > > if (++failures == MAX_RECLAIM_RETRIES) > > > > break; > > > > + > > > > + goto resched; > > > > } > > > > + spin_unlock(_pools_lock); > > > > + > > > > + ret = shrink_memcg(memcg); > > > > > > We just checked for online-ness above, and then shrink_memcg() checks > > > it again. Is this intentional? > > > > Hmm these two checks are for two different purposes. The check above > > is mainly to prevent accidentally undoing the offline cleanup callback > > during memcg selection step. Inside shrink_memcg(), we check > > onlineness again to prevent reclaiming from offlined memcgs - which in > > effect will trigger the reclaim of the parent's memcg. > > Right, but two checks in close proximity are not doing a lot. > Especially that the memcg online-ness can change right after the check > inside shrink_memcg() anyway, so it's a best effort thing. > > Anyway, it shouldn't matter much. We can leave it. > > > > > > > > > > + /* drop the extra reference */ > > > > > > Where does the extra reference come from? > > > > The extra reference is from mem_cgroup_tryget_online(). We get two > > references in the dance above - one from mem_cgroup_iter() (which can > > be dropped) and one extra from mem_cgroup_tryget_online(). I kept the > > second one in case the first one was dropped by the zswap memcg > > offlining callback, but after reclaiming it is safe to just drop it. > > Right. I was confused by the missing mem_cgroup_tryget_online(). > > > > > > > > > > + mem_cgroup_put(memcg); > > > > + > > > > + if (ret == -EINVAL) > > > > + break; > > > > + if (ret && ++failures == MAX_RECLAIM_RETRIES) > > > > + break; > > > > + > > > > +resched: > > > > cond_resched(); > > > > } while (!zswap_can_accept()); > > > > -
Re: [RFC PATCH v2 04/10] docs: submitting-patches: Introduce Tested-with:
On Tue, 2023-12-05 at 11:59 -0700, Jonathan Corbet wrote: > Nikolai Kondrashov writes: > > > Introduce a new tag, 'Tested-with:', documented in the > > Documentation/process/submitting-patches.rst file. [] > I have to ask whether we *really* need to introduce yet another tag for > this. How are we going to use this information? Are we going to try to > make a tag for every way in which somebody might test a patch? In general, I think Link: would be good enough. And remember that all this goes stale after awhile and that includes old test suites.
Re: [PATCH v8 3/6] zswap: make shrinking memcg-aware
[..] > > > static void shrink_worker(struct work_struct *w) > > > { > > > struct zswap_pool *pool = container_of(w, typeof(*pool), > > > shrink_work); > > > + struct mem_cgroup *memcg; > > > int ret, failures = 0; > > > > > > + /* global reclaim will select cgroup in a round-robin fashion. */ > > > do { > > > - ret = zswap_reclaim_entry(pool); > > > - if (ret) { > > > - zswap_reject_reclaim_fail++; > > > - if (ret != -EAGAIN) > > > + spin_lock(_pools_lock); > > > + pool->next_shrink = mem_cgroup_iter(NULL, > > > pool->next_shrink, NULL); > > > + memcg = pool->next_shrink; > > > + > > > + /* > > > +* We need to retry if we have gone through a full round > > > trip, or if we > > > +* got an offline memcg (or else we risk undoing the > > > effect of the > > > +* zswap memcg offlining cleanup callback). This is not > > > catastrophic > > > +* per se, but it will keep the now offlined memcg > > > hostage for a while. > > > +* > > > +* Note that if we got an online memcg, we will keep the > > > extra > > > +* reference in case the original reference obtained by > > > mem_cgroup_iter > > > +* is dropped by the zswap memcg offlining callback, > > > ensuring that the > > > +* memcg is not killed when we are reclaiming. > > > +*/ > > > + if (!memcg) { > > > + spin_unlock(_pools_lock); > > > + if (++failures == MAX_RECLAIM_RETRIES) > > > break; > > > + > > > + goto resched; > > > + } > > > + > > > + if (!mem_cgroup_online(memcg)) { > > > + /* drop the reference from mem_cgroup_iter() */ > > > + mem_cgroup_put(memcg); > > > > Probably better to use mem_cgroup_iter_break() here? > > mem_cgroup_iter_break(NULL, memcg) seems to perform the same thing, right? Yes, but it's better to break the iteration with the documented API (e.g. if mem_cgroup_iter_break() changes to do extra work). > > > > > Also, I don't see mem_cgroup_tryget_online() being used here (where I > > expected it to be used), did I miss it? > > Oh shoot yeah that was a typo - it should be > mem_cgroup_tryget_online(). Let me send a fix to that. > > > > > > + pool->next_shrink = NULL; > > > + spin_unlock(_pools_lock); > > > + > > > if (++failures == MAX_RECLAIM_RETRIES) > > > break; > > > + > > > + goto resched; > > > } > > > + spin_unlock(_pools_lock); > > > + > > > + ret = shrink_memcg(memcg); > > > > We just checked for online-ness above, and then shrink_memcg() checks > > it again. Is this intentional? > > Hmm these two checks are for two different purposes. The check above > is mainly to prevent accidentally undoing the offline cleanup callback > during memcg selection step. Inside shrink_memcg(), we check > onlineness again to prevent reclaiming from offlined memcgs - which in > effect will trigger the reclaim of the parent's memcg. Right, but two checks in close proximity are not doing a lot. Especially that the memcg online-ness can change right after the check inside shrink_memcg() anyway, so it's a best effort thing. Anyway, it shouldn't matter much. We can leave it. > > > > > > + /* drop the extra reference */ > > > > Where does the extra reference come from? > > The extra reference is from mem_cgroup_tryget_online(). We get two > references in the dance above - one from mem_cgroup_iter() (which can > be dropped) and one extra from mem_cgroup_tryget_online(). I kept the > second one in case the first one was dropped by the zswap memcg > offlining callback, but after reclaiming it is safe to just drop it. Right. I was confused by the missing mem_cgroup_tryget_online(). > > > > > > + mem_cgroup_put(memcg); > > > + > > > + if (ret == -EINVAL) > > > + break; > > > + if (ret && ++failures == MAX_RECLAIM_RETRIES) > > > + break; > > > + > > > +resched: > > > cond_resched(); > > > } while (!zswap_can_accept()); > > > - zswap_pool_put(pool); > > > } > > > > > > static struct zswap_pool *zswap_pool_create(char *type, char *compressor) [..] > > > @@ -1240,15 +1395,15 @@ bool zswap_store(struct folio *folio) > > > zswap_invalidate_entry(tree, dupentry); > > > } > > > spin_unlock(>lock); > > > - > > > - /* > > > -* XXX: zswap reclaim does not work with
Re: [RFC PATCH v2 04/10] docs: submitting-patches: Introduce Tested-with:
Nikolai Kondrashov writes: > Introduce a new tag, 'Tested-with:', documented in the > Documentation/process/submitting-patches.rst file. > > The tag is expected to contain the test suite command which was executed > for the commit, and to certify it passed. Additionally, it can contain a > URL pointing to the execution results, after a '#' character. > > Prohibit the V: field from containing the '#' character correspondingly. > > Signed-off-by: Nikolai Kondrashov > --- > Documentation/process/submitting-patches.rst | 10 ++ > MAINTAINERS | 2 +- > scripts/checkpatch.pl| 4 ++-- > 3 files changed, 13 insertions(+), 3 deletions(-) I have to ask whether we *really* need to introduce yet another tag for this. How are we going to use this information? Are we going to try to make a tag for every way in which somebody might test a patch? Thanks, jon
Re: [RFC PATCH v2 02/10] MAINTAINERS: Introduce V: entry for tests
On Tue, 2023-12-05 at 20:02 +0200, Nikolai Kondrashov wrote: > Require the entry values to not contain the '@' character, so they could > be distinguished from emails (always) output by get_maintainer.pl. Why is this useful? Why the need to distinguish?
Re: [PATCH v8 4/6] mm: memcg: add per-memcg zswap writeback stat
On Tue, Dec 5, 2023 at 10:22 AM Yosry Ahmed wrote: > > On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham wrote: > > > > From: Domenico Cerasuolo > > > > Since zswap now writes back pages from memcg-specific LRUs, we now need a > > new stat to show writebacks count for each memcg. > > > > Suggested-by: Nhat Pham > > Signed-off-by: Domenico Cerasuolo > > Signed-off-by: Nhat Pham > > --- > > include/linux/vm_event_item.h | 1 + > > mm/memcontrol.c | 1 + > > mm/vmstat.c | 1 + > > mm/zswap.c| 4 > > 4 files changed, 7 insertions(+) > > > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > > index d1b847502f09..f4569ad98edf 100644 > > --- a/include/linux/vm_event_item.h > > +++ b/include/linux/vm_event_item.h > > @@ -142,6 +142,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > > #ifdef CONFIG_ZSWAP > > ZSWPIN, > > ZSWPOUT, > > + ZSWP_WB, > > I think you dismissed Johannes's comment from v7 about ZSWPWB and > "zswpwb" being more consistent with the existing events. I missed that entirely. Oops. Yeah I prefer ZSWPWB too. Let me send a fix. > > > #endif > > #ifdef CONFIG_X86 > > DIRECT_MAP_LEVEL2_SPLIT, > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 792ca21c5815..21d79249c8b4 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -703,6 +703,7 @@ static const unsigned int memcg_vm_event_stat[] = { > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) > > ZSWPIN, > > ZSWPOUT, > > + ZSWP_WB, > > #endif > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > THP_FAULT_ALLOC, > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > index afa5a38fcc9c..2249f85e4a87 100644 > > --- a/mm/vmstat.c > > +++ b/mm/vmstat.c > > @@ -1401,6 +1401,7 @@ const char * const vmstat_text[] = { > > #ifdef CONFIG_ZSWAP > > "zswpin", > > "zswpout", > > + "zswp_wb", > > #endif > > #ifdef CONFIG_X86 > > "direct_map_level2_splits", > > diff --git a/mm/zswap.c b/mm/zswap.c > > index f323e45cbdc7..49b79393e472 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -760,6 +760,10 @@ static enum lru_status shrink_memcg_cb(struct > > list_head *item, struct list_lru_o > > } > > zswap_written_back_pages++; > > > > + if (entry->objcg) > > + count_objcg_event(entry->objcg, ZSWP_WB); > > + > > + count_vm_event(ZSWP_WB); > > /* > > * Writeback started successfully, the page now belongs to the > > * swapcache. Drop the entry from zswap - unless invalidate already > > -- > > 2.34.1
Re: [RFC PATCH v2 01/10] get_maintainer: Survive querying missing files
On Tue, 2023-12-05 at 20:02 +0200, Nikolai Kondrashov wrote: > Do not die, but only warn when scripts/get_maintainer.pl is asked to > retrieve information about a missing file. > > This allows scripts/checkpatch.pl to query MAINTAINERS while processing > patches which are removing files. Why is this useful? Give a for-instance example please.
Re: [PATCH v8 3/6] zswap: make shrinking memcg-aware
On Tue, Dec 5, 2023 at 10:21 AM Yosry Ahmed wrote: > > On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham wrote: > > > > From: Domenico Cerasuolo > > > > Currently, we only have a single global LRU for zswap. This makes it > > impossible to perform worload-specific shrinking - an memcg cannot > > determine which pages in the pool it owns, and often ends up writing > > pages from other memcgs. This issue has been previously observed in > > practice and mitigated by simply disabling memcg-initiated shrinking: > > > > https://lore.kernel.org/all/20230530232435.3097106-1-npha...@gmail.com/T/#u > > > > This patch fully resolves the issue by replacing the global zswap LRU > > with memcg- and NUMA-specific LRUs, and modify the reclaim logic: > > > > a) When a store attempt hits an memcg limit, it now triggers a > >synchronous reclaim attempt that, if successful, allows the new > >hotter page to be accepted by zswap. > > b) If the store attempt instead hits the global zswap limit, it will > >trigger an asynchronous reclaim attempt, in which an memcg is > >selected for reclaim in a round-robin-like fashion. > > > > Signed-off-by: Domenico Cerasuolo > > Co-developed-by: Nhat Pham > > Signed-off-by: Nhat Pham > > --- > > include/linux/memcontrol.h | 5 + > > include/linux/zswap.h | 2 + > > mm/memcontrol.c| 2 + > > mm/swap.h | 3 +- > > mm/swap_state.c| 24 +++- > > mm/zswap.c | 269 + > > 6 files changed, 245 insertions(+), 60 deletions(-) > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 2bd7d14ace78..a308c8eacf20 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -1192,6 +1192,11 @@ static inline struct mem_cgroup > > *page_memcg_check(struct page *page) > > return NULL; > > } > > > > +static inline struct mem_cgroup *get_mem_cgroup_from_objcg(struct > > obj_cgroup *objcg) > > +{ > > + return NULL; > > +} > > + > > static inline bool folio_memcg_kmem(struct folio *folio) > > { > > return false; > > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > > index 2a60ce39cfde..e571e393669b 100644 > > --- a/include/linux/zswap.h > > +++ b/include/linux/zswap.h > > @@ -15,6 +15,7 @@ bool zswap_load(struct folio *folio); > > void zswap_invalidate(int type, pgoff_t offset); > > void zswap_swapon(int type); > > void zswap_swapoff(int type); > > +void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg); > > > > #else > > > > @@ -31,6 +32,7 @@ static inline bool zswap_load(struct folio *folio) > > static inline void zswap_invalidate(int type, pgoff_t offset) {} > > static inline void zswap_swapon(int type) {} > > static inline void zswap_swapoff(int type) {} > > +static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {} > > > > #endif > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 470821d1ba1a..792ca21c5815 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -5614,6 +5614,8 @@ static void mem_cgroup_css_offline(struct > > cgroup_subsys_state *css) > > page_counter_set_min(>memory, 0); > > page_counter_set_low(>memory, 0); > > > > + zswap_memcg_offline_cleanup(memcg); > > + > > memcg_offline_kmem(memcg); > > reparent_shrinker_deferred(memcg); > > wb_memcg_offline(memcg); > > diff --git a/mm/swap.h b/mm/swap.h > > index 73c332ee4d91..c0dc73e10e91 100644 > > --- a/mm/swap.h > > +++ b/mm/swap.h > > @@ -51,7 +51,8 @@ struct page *read_swap_cache_async(swp_entry_t entry, > > gfp_t gfp_mask, > >struct swap_iocb **plug); > > struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > > struct mempolicy *mpol, pgoff_t ilx, > > -bool *new_page_allocated); > > +bool *new_page_allocated, > > +bool skip_if_exists); > > struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag, > > struct mempolicy *mpol, pgoff_t ilx); > > struct page *swapin_readahead(swp_entry_t entry, gfp_t flag, > > diff --git a/mm/swap_state.c b/mm/swap_state.c > > index 85d9e5806a6a..6c84236382f3 100644 > > --- a/mm/swap_state.c > > +++ b/mm/swap_state.c > > @@ -412,7 +412,8 @@ struct folio *filemap_get_incore_folio(struct > > address_space *mapping, > > > > struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > > struct mempolicy *mpol, pgoff_t ilx, > > -bool *new_page_allocated) > > +bool *new_page_allocated, > > +bool skip_if_exists) > > { > > struct swap_info_struct *si; > > struct folio *folio; > > @@ -470,6
[RFC PATCH v2 10/10] MAINTAINERS: Add proposal strength to V: entries
Require the MAINTAINERS V: entries to begin with a keyword, one of SUGGESTED/RECOMMENDED/REQUIRED, signifying how strongly the test is proposed for verifying the subsystem changes, prompting scripts/checkpatch.pl to produce CHECK/WARNING/ERROR messages respectively, whenever the commit message doesn't have the corresponding Tested-with: tag. Signed-off-by: Nikolai Kondrashov --- Documentation/process/submitting-patches.rst | 11 ++- MAINTAINERS | 20 +++-- scripts/checkpatch.pl| 83 3 files changed, 71 insertions(+), 43 deletions(-) diff --git a/Documentation/process/submitting-patches.rst b/Documentation/process/submitting-patches.rst index 45bd1a713ef33..199fadc50cf62 100644 --- a/Documentation/process/submitting-patches.rst +++ b/Documentation/process/submitting-patches.rst @@ -233,18 +233,21 @@ Test your changes Test the patch to the best of your ability. Check the MAINTAINERS file for the subsystem(s) you are changing to see if there are any **V:** entries -proposing particular test suites, either directly as commands, or via -documentation references. +proposing particular test suites. + +The **V:** entries start with a proposal strength keyword +(SUGGESTED/RECOMMENDED/REQUIRED), followed either by a command, or a +documentation reference. Test suite references start with a ``*`` (similar to C pointer dereferencing), followed by the name of the test suite, which would be documented in the Documentation/process/tests.rst under the corresponding heading. E.g.:: - V: *xfstests + V: SUGGESTED *xfstests Anything not starting with a ``*`` is considered a command. E.g.:: - V: tools/testing/kunit/run_checks.py + V: RECOMMENDED tools/testing/kunit/run_checks.py Supplying the ``--test`` option to ``scripts/get_maintainer.pl`` adds **V:** entries to its output. diff --git a/MAINTAINERS b/MAINTAINERS index 84e90ec015090..3a35e320b5a5b 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -59,15 +59,19 @@ Descriptions of section entries and preferred order matches patches or files that contain one or more of the words printk, pr_info or pr_err One regex pattern per line. Multiple K: lines acceptable. - V: *Test suite* proposed for execution. The command that could be - executed to verify changes to the maintained subsystem, or a reference - to a test suite documented in Documentation/process/tests.txt. + V: *Test suite* proposed for execution for verifying changes to the + maintained subsystem. Must start with a proposal strength keyword: + (SUGGESTED/RECOMMENDED/REQUIRED), followed by the test suite command, + or a reference to a test suite documented in + Documentation/process/tests.txt. + Proposal strengths correspond to checkpatch.pl message levels + (CHECK/WARNING/ERROR respectively, whenever Tested-with: is missing). Commands must be executed from the root of the source tree. Commands must support the -h/--help option. References must be preceded with a '*'. Cannot contain '@' or '#' characters. - V: tools/testing/kunit/run_checks.py - V: *xfstests + V: SUGGESTED tools/testing/kunit/run_checks.py + V: RECOMMENDED *xfstests One test suite per line. Maintainers List @@ -7978,7 +7982,7 @@ L:linux-e...@vger.kernel.org S: Maintained W: http://ext4.wiki.kernel.org Q: http://patchwork.ozlabs.org/project/linux-ext4/list/ -V: *kvm-xfstests smoke +V: RECOMMENDED *kvm-xfstests smoke T: git git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git F: Documentation/filesystems/ext4/ F: fs/ext4/ @@ -11628,7 +11632,7 @@ L: linux-kselftest@vger.kernel.org L: kunit-...@googlegroups.com S: Maintained W: https://google.github.io/kunit-docs/third_party/kernel/docs/ -V: tools/testing/kunit/run_checks.py +V: RECOMMENDED tools/testing/kunit/run_checks.py T: git git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git kunit T: git git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git kunit-fixes F: Documentation/dev-tools/kunit/ @@ -18367,7 +18371,7 @@ REGISTER MAP ABSTRACTION M: Mark Brown L: linux-ker...@vger.kernel.org S: Supported -V: *kunit +V: RECOMMENDED *kunit T: git git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap.git F: Documentation/devicetree/bindings/regmap/ F: drivers/base/regmap/ diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index bfeb4c33b5424..9438e4f452a6c 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -1181,39 +1181,57 @@ sub is_maintained_obsolete { return $maintained_status{$filename} =~ /obsolete/i; } -# Test suites proposed per changed file +# A list of test proposal strength
[RFC PATCH v2 09/10] MAINTAINERS: Propose kunit tests for regmap
From: Mark Brown The regmap core and especially cache code have reasonable kunit coverage, ask people to use that to test regmap changes. Signed-off-by: Mark Brown Signed-off-by: Nikolai Kondrashov --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index 669b5ff571730..84e90ec015090 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -18367,6 +18367,7 @@ REGISTER MAP ABSTRACTION M: Mark Brown L: linux-ker...@vger.kernel.org S: Supported +V: *kunit T: git git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap.git F: Documentation/devicetree/bindings/regmap/ F: drivers/base/regmap/ -- 2.42.0
[RFC PATCH v2 08/10] docs: tests: Document kunit in general
Add an entry on the complete set of kunit tests to the Documentation/process/tests.rst, so that it could be referenced in MAINTAINERS, and is catalogued in general. Signed-off-by: Nikolai Kondrashov --- Documentation/process/tests.rst | 23 +++ 1 file changed, 23 insertions(+) diff --git a/Documentation/process/tests.rst b/Documentation/process/tests.rst index cfaf937dc4d5f..0760229fc32b0 100644 --- a/Documentation/process/tests.rst +++ b/Documentation/process/tests.rst @@ -71,3 +71,26 @@ kvm-xfstests smoke The "kvm-xfstests smoke" is a minimal subset of xfstests for testing all major file systems, running under KVM. + +kunit +- + +:Summary: complete set of KUnit unit tests +:Command: tools/testing/kunit/kunit.py run --alltests +:Docs: https://docs.kernel.org/dev-tools/kunit/ + +KUnit tests are part of the kernel, written in the C (programming) language, +and test parts of the Kernel implementation (example: a C language function). +Excluding build time, from invocation to completion, KUnit can run around 100 +tests in less than 10 seconds. KUnit can test any kernel component, for +example: file system, system calls, memory management, device drivers and so +on. + +KUnit follows the white-box testing approach. The test has access to internal +system functionality. KUnit runs in kernel space and is not restricted to +things exposed to user-space. + +In addition, KUnit has kunit_tool, a script (tools/testing/kunit/kunit.py) +that configures the Linux kernel, runs KUnit tests under QEMU or UML (User +Mode Linux), parses the test results and displays them in a user friendly +manner. -- 2.42.0
[RFC PATCH v2 07/10] MAINTAINERS: Propose kvm-xfstests smoke for ext4
Propose the "kvm-xfstests smoke" test suite for changes to the EXT4 FILE SYSTEM subsystem, as discussed previously with maintainers. Signed-off-by: Nikolai Kondrashov --- Documentation/process/tests.rst | 32 MAINTAINERS | 1 + 2 files changed, 33 insertions(+) diff --git a/Documentation/process/tests.rst b/Documentation/process/tests.rst index 4ae5000e811c8..cfaf937dc4d5f 100644 --- a/Documentation/process/tests.rst +++ b/Documentation/process/tests.rst @@ -39,3 +39,35 @@ following ones recognized by the tools (regardless of the case): (even if only to report what else needs setting up) Any other entries are accepted, but not processed. + +xfstests + + +:Summary: file system regression test suite +:Source: https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git +:Docs: https://github.com/tytso/xfstests-bld/blob/master/Documentation/what-is-xfstests.md + +As the name might imply, xfstests is a file system regression test suite which +was originally developed by Silicon Graphics (SGI) for the XFS file system. +Originally, xfstests, like XFS was only supported on the SGI's Irix operating +system. When XFS was ported to Linux, so was xfstests, and now xfstests is +only supported on Linux. + +Today, xfstests is used as a file system regression test suite for all of +Linux's major file systems: xfs, ext2, ext4, cifs, btrfs, f2fs, reiserfs, gfs, +jfs, udf, nfs, and tmpfs. Many file system maintainers will run a full set of +xfstests before sending patches to Linus, and will require that any major +changes be tested using xfstests before they are submitted for integration. + +The easiest way to start running xfstests is under KVM with xfstests-bld: +https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md + +kvm-xfstests smoke +-- + +:Summary: file system smoke test suite +:Superset: xfstests +:Docs: https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md + +The "kvm-xfstests smoke" is a minimal subset of xfstests for testing all major +file systems, running under KVM. diff --git a/MAINTAINERS b/MAINTAINERS index 3ed15d8327919..669b5ff571730 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7978,6 +7978,7 @@ L:linux-e...@vger.kernel.org S: Maintained W: http://ext4.wiki.kernel.org Q: http://patchwork.ozlabs.org/project/linux-ext4/list/ +V: *kvm-xfstests smoke T: git git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git F: Documentation/filesystems/ext4/ F: fs/ext4/ -- 2.42.0
[RFC PATCH v2 06/10] MAINTAINERS: Support referencing test docs in V:
Support referencing test suite documentation in the V: entries of MAINTAINERS file. Use the '*' syntax (like C pointer dereference), where '' is a second-level heading in the new Documentation/process/tests.rst file, with the suite's description. This syntax allows distinguishing the references from test commands. Add a boiler-plate Documentation/process/tests.rst file, describing a way to add structured info to the test suites in the form of field lists. Apart from a "summary" and "command" fields, they can also contain a "superset" field specifying the superset of the test suite, helping reuse documentation and express both wider and narrower test sets. Make scripts/checkpatch.pl load the tests from the file, along with the structured data, validate the references in MAINTAINERS, dereference them, and output the test suite information in the CHECK messages whenever the corresponding subsystems are changed. But only if there was no corresponding Tested-with: tag in the commit message, certifying it was executed successfully already. This is supposed to help propose executing test suites which cannot be executed immediately, and need extra setup, as well as provide a place for extra documentation and information on directly-available suites. Signed-off-by: Nikolai Kondrashov --- Documentation/process/index.rst | 1 + Documentation/process/submitting-patches.rst | 21 +++- Documentation/process/tests.rst | 41 +++ MAINTAINERS | 9 +- scripts/checkpatch.pl| 122 +-- 5 files changed, 177 insertions(+), 17 deletions(-) create mode 100644 Documentation/process/tests.rst diff --git a/Documentation/process/index.rst b/Documentation/process/index.rst index a1daa309b58d0..3eda2e7432fdb 100644 --- a/Documentation/process/index.rst +++ b/Documentation/process/index.rst @@ -49,6 +49,7 @@ Other guides to the community that are of interest to most developers are: :maxdepth: 1 changes + tests stable-api-nonsense management-style stable-kernel-rules diff --git a/Documentation/process/submitting-patches.rst b/Documentation/process/submitting-patches.rst index 2004df2ac1b39..45bd1a713ef33 100644 --- a/Documentation/process/submitting-patches.rst +++ b/Documentation/process/submitting-patches.rst @@ -233,27 +233,42 @@ Test your changes Test the patch to the best of your ability. Check the MAINTAINERS file for the subsystem(s) you are changing to see if there are any **V:** entries -proposing particular test suite commands. E.g.:: +proposing particular test suites, either directly as commands, or via +documentation references. + +Test suite references start with a ``*`` (similar to C pointer dereferencing), +followed by the name of the test suite, which would be documented in the +Documentation/process/tests.rst under the corresponding heading. E.g.:: + + V: *xfstests + +Anything not starting with a ``*`` is considered a command. E.g.:: V: tools/testing/kunit/run_checks.py Supplying the ``--test`` option to ``scripts/get_maintainer.pl`` adds **V:** entries to its output. -Execute the commands, if any, to test your changes. +Execute the (referenced) test suites, if any, to test your changes. All commands must be executed from the root of the source tree. Each command outputs usage information, if an -h/--help option is specified. If a test suite you've executed completed successfully, add a ``Tested-with: -`` to the message of the commit you tested. E.g.:: +`` or ``Tested-with: *`` to the message of the commit you +tested. E.g.:: Tested-with: tools/testing/kunit/run_checks.py +or:: + + Tested-with: *xfstests + Optionally, add a '#' character followed by a publicly-accessible URL containing the test results, if you make them available. E.g.:: Tested-with: tools/testing/kunit/run_checks.py # https://kernelci.org/test/2239874 + Tested-with: *xfstests # https://kernelci.org/test/2239324 Select the recipients for your patch diff --git a/Documentation/process/tests.rst b/Documentation/process/tests.rst new file mode 100644 index 0..4ae5000e811c8 --- /dev/null +++ b/Documentation/process/tests.rst @@ -0,0 +1,41 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. _tests: + +Tests you can run += + +There are many automated tests available for the Linux kernel, and some +userspace tests which happen to also test the kernel. Here are some of them, +along with the instructions on where to get them and how to run them for +various purposes. + +This document has to follow a certain structure to allow tool access. +Second-level headers (underscored with dashes '-') must contain test suite +names, and the corresponding section must contain the test description. + +The test suites can be referenced by name, preceded with a '*', in the "V:" +lines in the MAINTAINERS file, as well as in the "Tested-with:" tag in commit +messages. E.g:: +
[RFC PATCH v2 03/10] MAINTAINERS: Propose kunit core tests for framework changes
DONOTMERGE: The command in question should support -h/--help option. Signed-off-by: Nikolai Kondrashov --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index e6d0777e21657..68821eecf61cf 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -11624,6 +11624,7 @@ L: linux-kselftest@vger.kernel.org L: kunit-...@googlegroups.com S: Maintained W: https://google.github.io/kunit-docs/third_party/kernel/docs/ +V: tools/testing/kunit/run_checks.py T: git git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git kunit T: git git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git kunit-fixes F: Documentation/dev-tools/kunit/ -- 2.42.0
[RFC PATCH v2 05/10] checkpatch: Propose tests to execute
Make scripts/checkpatch.pl output a 'CHECK' advertising any test suites proposed for the changed subsystems, and prompting their execution. Using 'CHECK', instead of 'WARNING', or 'ERROR', because test suite commands executed for testing can generally be off by an option/argument or two, depending on the situation, while still satisfying the maintainer requirements, but failing the comparison with the V: entry and raising alarm unnecessarily. However, see the later patch adding the proposal strength to the V: entry and allowing raising the severity of the message for those who'd like that. Signed-off-by: Nikolai Kondrashov --- scripts/checkpatch.pl | 43 +++ 1 file changed, 43 insertions(+) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index bea602c30df5d..1da617e1edb5f 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -1144,6 +1144,29 @@ sub is_maintained_obsolete { return $maintained_status{$filename} =~ /obsolete/i; } +# Test suites proposed per changed file +our %files_proposed_tests = (); + +# Return a list of test suites proposed for execution for a particular file +sub get_file_proposed_tests { + my ($filename) = @_; + my $file_proposed_tests; + + return () if (!$tree || !(-e "$root/scripts/get_maintainer.pl")); + + if (!exists($files_proposed_tests{$filename})) { + my $command = "perl $root/scripts/get_maintainer.pl --test --multiline --nogit --nogit-fallback -f $filename"; + # Ignore warnings on stderr + my $output = `$command 2>/dev/null`; + # But regenerate stderr on failure + die "Failed retrieving tests proposed for changes to \"$filename\":\n" . `$command 2>&1 >/dev/null` if ($?); + $files_proposed_tests{$filename} = [grep { !/@/ } split("\n", $output)] + } + + $file_proposed_tests = $files_proposed_tests{$filename}; + return @$file_proposed_tests; +} + sub is_SPDX_License_valid { my ($license) = @_; @@ -2689,6 +2712,9 @@ sub process { my @setup_docs = (); my $setup_docs = 0; + # Test suites which should not be proposed for execution + my %dont_propose_tests = (); + my $camelcase_file_seeded = 0; my $checklicenseline = 1; @@ -2907,6 +2933,17 @@ sub process { } } + # Check if tests are proposed for changes to the file + foreach my $test (get_file_proposed_tests($realfile)) { + next if exists $dont_propose_tests{$test}; + CHK("TEST_PROPOSAL", + "Running the following test suite is proposed for changes to $realfile:\n" . + "$test\n" . + "Add the following to the tested commit's message, IF IT PASSES:\n" . + "Tested-with: $test\n"); + $dont_propose_tests{$test} = 1; + } + next; } @@ -3233,6 +3270,12 @@ sub process { } } +# Check and accumulate executed test suites (stripping URLs off the end) + if (!$in_commit_log && $line =~ /^\s*Tested-with:\s*(.*?)\s*#.*$/i) { + # Do not propose this certified-passing test suite + $dont_propose_tests{$1} = 1; + } + # Check email subject for common tools that don't need to be mentioned if ($in_header_lines && $line =~ /^Subject:.*\b(?:checkpatch|sparse|smatch)\b[^:]/i) { -- 2.42.0
[RFC PATCH v2 04/10] docs: submitting-patches: Introduce Tested-with:
Introduce a new tag, 'Tested-with:', documented in the Documentation/process/submitting-patches.rst file. The tag is expected to contain the test suite command which was executed for the commit, and to certify it passed. Additionally, it can contain a URL pointing to the execution results, after a '#' character. Prohibit the V: field from containing the '#' character correspondingly. Signed-off-by: Nikolai Kondrashov --- Documentation/process/submitting-patches.rst | 10 ++ MAINTAINERS | 2 +- scripts/checkpatch.pl| 4 ++-- 3 files changed, 13 insertions(+), 3 deletions(-) diff --git a/Documentation/process/submitting-patches.rst b/Documentation/process/submitting-patches.rst index f034feaf1369e..2004df2ac1b39 100644 --- a/Documentation/process/submitting-patches.rst +++ b/Documentation/process/submitting-patches.rst @@ -245,6 +245,16 @@ Execute the commands, if any, to test your changes. All commands must be executed from the root of the source tree. Each command outputs usage information, if an -h/--help option is specified. +If a test suite you've executed completed successfully, add a ``Tested-with: +`` to the message of the commit you tested. E.g.:: + + Tested-with: tools/testing/kunit/run_checks.py + +Optionally, add a '#' character followed by a publicly-accessible URL +containing the test results, if you make them available. E.g.:: + + Tested-with: tools/testing/kunit/run_checks.py # https://kernelci.org/test/2239874 + Select the recipients for your patch diff --git a/MAINTAINERS b/MAINTAINERS index 68821eecf61cf..28fbb0eb335ba 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -63,7 +63,7 @@ Descriptions of section entries and preferred order executed to verify changes to the maintained subsystem. Must be executed from the root of the source tree. Must support the -h/--help option. - Cannot contain '@' character. + Cannot contain '@' or '#' characters. V: tools/testing/kunit/run_checks.py One test suite per line. diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index a184e576c980b..bea602c30df5d 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -3686,9 +3686,9 @@ sub process { # check MAINTAINERS V: entries are valid if ($rawline =~ /^\+V:\s*(.*)/) { my $name = $1; - if ($name =~ /@/) { + if ($name =~ /[@#]/) { ERROR("TEST_PROPOSAL_INVALID", - "Test proposal cannot contain '\@' character\n" . $herecurr); + "Test proposal cannot contain '\@' or '#' characters\n" . $herecurr); } } } -- 2.42.0
[RFC PATCH v2 01/10] get_maintainer: Survive querying missing files
Do not die, but only warn when scripts/get_maintainer.pl is asked to retrieve information about a missing file. This allows scripts/checkpatch.pl to query MAINTAINERS while processing patches which are removing files. Signed-off-by: Nikolai Kondrashov --- scripts/get_maintainer.pl | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl index 16d8ac6005b6f..37901c2298388 100755 --- a/scripts/get_maintainer.pl +++ b/scripts/get_maintainer.pl @@ -541,7 +541,11 @@ foreach my $file (@ARGV) { if ((-d $file)) { $file =~ s@([^/])$@$1/@; } elsif (!(-f $file)) { - die "$P: file '${file}' not found\n"; + if ($from_filename) { + warn "$P: file '${file}' not found\n"; + } else { + die "$P: file '${file}' not found\n"; + } } } if ($from_filename && (vcs_exists() && !vcs_file_exists($file))) { -- 2.42.0
[RFC PATCH v2 02/10] MAINTAINERS: Introduce V: entry for tests
Introduce a new 'V:' ("Verify") entry to MAINTAINERS. The entry accepts a test suite command which is proposed to be executed for each contribution to the subsystem. Extend scripts/get_maintainer.pl to support retrieving the V: entries when '--test' option is specified. Require the entry values to not contain the '@' character, so they could be distinguished from emails (always) output by get_maintainer.pl. Make scripts/checkpatch.pl check that they don't. Update entry ordering in both scripts/checkpatch.pl and scripts/parse-maintainers.pl. Signed-off-by: Nikolai Kondrashov --- Documentation/process/submitting-patches.rst | 18 ++ MAINTAINERS | 7 +++ scripts/checkpatch.pl| 10 +- scripts/get_maintainer.pl| 17 +++-- scripts/parse-maintainers.pl | 3 ++- 5 files changed, 51 insertions(+), 4 deletions(-) diff --git a/Documentation/process/submitting-patches.rst b/Documentation/process/submitting-patches.rst index 86d346bcb8ef0..f034feaf1369e 100644 --- a/Documentation/process/submitting-patches.rst +++ b/Documentation/process/submitting-patches.rst @@ -228,6 +228,24 @@ You should be able to justify all violations that remain in your patch. +Test your changes +- + +Test the patch to the best of your ability. Check the MAINTAINERS file for the +subsystem(s) you are changing to see if there are any **V:** entries +proposing particular test suite commands. E.g.:: + + V: tools/testing/kunit/run_checks.py + +Supplying the ``--test`` option to ``scripts/get_maintainer.pl`` adds **V:** +entries to its output. + +Execute the commands, if any, to test your changes. + +All commands must be executed from the root of the source tree. Each command +outputs usage information, if an -h/--help option is specified. + + Select the recipients for your patch diff --git a/MAINTAINERS b/MAINTAINERS index 788be9ab5b733..e6d0777e21657 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -59,6 +59,13 @@ Descriptions of section entries and preferred order matches patches or files that contain one or more of the words printk, pr_info or pr_err One regex pattern per line. Multiple K: lines acceptable. + V: *Test suite* proposed for execution. The command that could be + executed to verify changes to the maintained subsystem. + Must be executed from the root of the source tree. + Must support the -h/--help option. + Cannot contain '@' character. + V: tools/testing/kunit/run_checks.py + One test suite per line. Maintainers List diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 25fdb7fda1128..a184e576c980b 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -3657,7 +3657,7 @@ sub process { } } # check MAINTAINERS entries for the right ordering too - my $preferred_order = 'MRLSWQBCPTFXNK'; + my $preferred_order = 'MRLSWQBCPVTFXNK'; if ($rawline =~ /^\+[A-Z]:/ && $prevrawline =~ /^[\+ ][A-Z]:/) { $rawline =~ /^\+([A-Z]):\s*(.*)/; @@ -3683,6 +3683,14 @@ sub process { } } } +# check MAINTAINERS V: entries are valid + if ($rawline =~ /^\+V:\s*(.*)/) { + my $name = $1; + if ($name =~ /@/) { + ERROR("TEST_PROPOSAL_INVALID", + "Test proposal cannot contain '\@' character\n" . $herecurr); + } + } } if (($realfile =~ /Makefile.*/ || $realfile =~ /Kbuild.*/) && diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl index 37901c2298388..804215a7477db 100755 --- a/scripts/get_maintainer.pl +++ b/scripts/get_maintainer.pl @@ -53,6 +53,7 @@ my $output_section_maxlen = 50; my $scm = 0; my $tree = 1; my $web = 0; +my $test = 0; my $subsystem = 0; my $status = 0; my $letters = ""; @@ -270,6 +271,7 @@ if (!GetOptions( 'scm!' => \$scm, 'tree!' => \$tree, 'web!' => \$web, + 'test!' => \$test, 'letters=s' => \$letters, 'pattern-depth=i' => \$pattern_depth, 'k|keywords!' => \$keywords, @@ -319,13 +321,14 @@ if ($sections || $letters ne "") { $status = 0; $subsystem = 0; $web = 0; +$test = 0; $keywords = 0; $keywords_in_file = 0; $interactive = 0; } else { -my $selections = $email + $scm + $status + $subsystem + $web; +my
[RFC PATCH v2 00/10] MAINTAINERS: Introduce V: entry for tests
Alright, here's a second version, attempting to address as many concerns as possible. It's likely I've missed something, though. Changes from v1: * Make scripts/get_maintainer.pl survive querying missing files, giving a warning instead. This is necessary to enable scripts/checkpatch.pl to query MAINTAINERS about files being deleted. * Start with the minimal change just documenting the V: entry, which accepts test commands directly, and tweaking the tools to deal with that. * However, require the commands accept the -h/--help option so that users have an easier time getting *some* help. The run_checks.py missing that is the reason why the patch proposing it for kunit subsystem is marked "DONOTMERGE" in this version. We can drop that requirement, or soften the language, if there's opposition. * Have a *separate* patch documenting 'Tested-with:' as the next (early) change. Mention that you can add a '#' followed by a results URL, on the end. Adjust the V: docs/checks to exclude '#'. * Have a *separate* patch making scripts/checkpatch.pl propose the execution of the test suite defined in MAINTAINERS whenever the corresponding subsystem is changed. * However, use 'CHECK', instead of 'WARNING', to allow submitters specify the exact (and potentially slightly different) command they used, and not have checkpatch.pl complain too loudly that they didn't run the (exact MAINTAINERS-specified) command. This unfortunately means that unless you use --strict, you won't see the message. We'll try to address that in a new change at the end. * Have a *separate* patch introducing the test catalog and accepting references to that everywhere, with a special syntax to distinguish them from verbatim/direct commands. The syntax is prepending the test name with a '*' (just like C pointer dereference). Make checkpatch.pl handle that. * Drop the recommendation to have the "Docs" and "Sources" fields in test descriptions, as the description text should focus on giving a good introduction and not prompt the user to go somewhere else immediately. They both can be referenced in the text where and how is appropriate. * Generally keep the previous changes adding V: entries and test suite docs, and try to accommodate all the requests, but refine the "Summary" fields to fit the checkpatch.pl messages better. * Have a separate patch cataloguing the complete kunit suite. * Finally, add a patch introducing the "proposal strength" keywords (SUGGESTED/RECOMMENDED/REQUIRED) to the syntax of V: entries, which directly affect which level of checkpatch.pl message missing 'Tested-with:' tags would generate: CHECK/WARNING/ERROR respectively. This allows subsystems to disable checkpatch.pl WARNINGS/ERRORS, and keep their test proposals inobtrusive, if they so wish. E.g. if they expect people to change their commands often. At the same time allow stricter workflows for subsystems with more uniform testing. Or e.g. for subsystems which expect the tests to explain their parameters in their output, and the submitters to upload and link their results in their 'Tested-with:' tags. That seems to be all, but I'm sure I forgot something :D Anyway, send me more corrections and I'll try to address them, but it's likely going to happen next year only. Nick --- Nikolai Kondrashov (9): get_maintainer: Survive querying missing files MAINTAINERS: Introduce V: entry for tests MAINTAINERS: Propose kunit core tests for framework changes docs: submitting-patches: Introduce Tested-with: checkpatch: Propose tests to execute MAINTAINERS: Support referencing test docs in V: MAINTAINERS: Propose kvm-xfstests smoke for ext4 docs: tests: Document kunit in general MAINTAINERS: Add proposal strength to V: entries Mark Brown (1): MAINTAINERS: Propose kunit tests for regmap Documentation/process/index.rst | 1 + Documentation/process/submitting-patches.rst | 46 +++ Documentation/process/tests.rst | 96 +++ MAINTAINERS | 17 +++ scripts/checkpatch.pl| 174 ++- scripts/get_maintainer.pl| 23 +++- scripts/parse-maintainers.pl | 3 +- 7 files changed, 355 insertions(+), 5 deletions(-) ---
Re: [PATCH v8 4/6] mm: memcg: add per-memcg zswap writeback stat
On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham wrote: > > From: Domenico Cerasuolo > > Since zswap now writes back pages from memcg-specific LRUs, we now need a > new stat to show writebacks count for each memcg. > > Suggested-by: Nhat Pham > Signed-off-by: Domenico Cerasuolo > Signed-off-by: Nhat Pham > --- > include/linux/vm_event_item.h | 1 + > mm/memcontrol.c | 1 + > mm/vmstat.c | 1 + > mm/zswap.c| 4 > 4 files changed, 7 insertions(+) > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index d1b847502f09..f4569ad98edf 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -142,6 +142,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > #ifdef CONFIG_ZSWAP > ZSWPIN, > ZSWPOUT, > + ZSWP_WB, I think you dismissed Johannes's comment from v7 about ZSWPWB and "zswpwb" being more consistent with the existing events. > #endif > #ifdef CONFIG_X86 > DIRECT_MAP_LEVEL2_SPLIT, > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 792ca21c5815..21d79249c8b4 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -703,6 +703,7 @@ static const unsigned int memcg_vm_event_stat[] = { > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) > ZSWPIN, > ZSWPOUT, > + ZSWP_WB, > #endif > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > THP_FAULT_ALLOC, > diff --git a/mm/vmstat.c b/mm/vmstat.c > index afa5a38fcc9c..2249f85e4a87 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1401,6 +1401,7 @@ const char * const vmstat_text[] = { > #ifdef CONFIG_ZSWAP > "zswpin", > "zswpout", > + "zswp_wb", > #endif > #ifdef CONFIG_X86 > "direct_map_level2_splits", > diff --git a/mm/zswap.c b/mm/zswap.c > index f323e45cbdc7..49b79393e472 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -760,6 +760,10 @@ static enum lru_status shrink_memcg_cb(struct list_head > *item, struct list_lru_o > } > zswap_written_back_pages++; > > + if (entry->objcg) > + count_objcg_event(entry->objcg, ZSWP_WB); > + > + count_vm_event(ZSWP_WB); > /* > * Writeback started successfully, the page now belongs to the > * swapcache. Drop the entry from zswap - unless invalidate already > -- > 2.34.1
Re: [PATCH 2/2] selftest/bpf: Test returning zero from a perf bpf program suppresses SIGIO.
On Mon, Dec 4, 2023 at 2:14 PM Andrii Nakryiko wrote: > > On Mon, Dec 4, 2023 at 12:14 PM Kyle Huey wrote: > > > > The test sets a hardware breakpoint and uses a bpf program to suppress the > > I/O availability signal if the ip matches the expected value. > > > > Signed-off-by: Kyle Huey > > --- > > .../selftests/bpf/prog_tests/perf_skip.c | 95 +++ > > .../selftests/bpf/progs/test_perf_skip.c | 23 + > > 2 files changed, 118 insertions(+) > > create mode 100644 tools/testing/selftests/bpf/prog_tests/perf_skip.c > > create mode 100644 tools/testing/selftests/bpf/progs/test_perf_skip.c > > > > diff --git a/tools/testing/selftests/bpf/prog_tests/perf_skip.c > > b/tools/testing/selftests/bpf/prog_tests/perf_skip.c > > new file mode 100644 > > index ..b269a31669b7 > > --- /dev/null > > +++ b/tools/testing/selftests/bpf/prog_tests/perf_skip.c > > @@ -0,0 +1,95 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +#define _GNU_SOURCE > > +#include > > +#include "test_perf_skip.skel.h" > > +#include > > +#include > > + > > +#define BPF_OBJECT"test_perf_skip.bpf.o" > > leftover? Indeed. Fixed. > > + > > +static void handle_sig(int) > > +{ > > + ASSERT_OK(1, "perf event not skipped"); > > +} > > + > > +static noinline int test_function(void) > > +{ > > please add > > asm volatile (""); > > here to prevent compiler from actually inlining at the call site Ok. > > + return 0; > > +} > > + > > +void serial_test_perf_skip(void) > > +{ > > + sighandler_t previous; > > + int duration = 0; > > + struct test_perf_skip *skel = NULL; > > + int map_fd = -1; > > + long page_size = sysconf(_SC_PAGE_SIZE); > > + uintptr_t *ip = NULL; > > + int prog_fd = -1; > > + struct perf_event_attr attr = {0}; > > + int perf_fd = -1; > > + struct f_owner_ex owner; > > + int err; > > + > > + previous = signal(SIGIO, handle_sig); > > + > > + skel = test_perf_skip__open_and_load(); > > + if (!ASSERT_OK_PTR(skel, "skel_load")) > > + goto cleanup; > > + > > + prog_fd = bpf_program__fd(skel->progs.handler); > > + if (!ASSERT_OK(prog_fd < 0, "bpf_program__fd")) > > + goto cleanup; > > + > > + map_fd = bpf_map__fd(skel->maps.ip); > > + if (!ASSERT_OK(map_fd < 0, "bpf_map__fd")) > > + goto cleanup; > > + > > + ip = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, > > map_fd, 0); > > + if (!ASSERT_OK_PTR(ip, "mmap bpf map")) > > + goto cleanup; > > + > > + *ip = (uintptr_t)test_function; > > + > > + attr.type = PERF_TYPE_BREAKPOINT; > > + attr.size = sizeof(attr); > > + attr.bp_type = HW_BREAKPOINT_X; > > + attr.bp_addr = (uintptr_t)test_function; > > + attr.bp_len = sizeof(long); > > + attr.sample_period = 1; > > + attr.sample_type = PERF_SAMPLE_IP; > > + attr.pinned = 1; > > + attr.exclude_kernel = 1; > > + attr.exclude_hv = 1; > > + attr.precise_ip = 3; > > + > > + perf_fd = syscall(__NR_perf_event_open, , 0, -1, -1, 0); > > + if (CHECK(perf_fd < 0, "perf_event_open", "err %d\n", perf_fd)) > > please don't use CHECK() macro, stick to ASSERT_xxx() Done. > also, we are going to run all this on different hardware and VMs, see > how we skip tests if hardware support is not there. See test__skip > usage in prog_tests/perf_branches.c, as one example Hmm I suppose it should be conditioned on CONFIG_HAVE_HW_BREAKPOINT. > > + goto cleanup; > > + > > + err = fcntl(perf_fd, F_SETFL, O_ASYNC); > > I assume this is what will send SIGIO, right? Can you add a small > comment explicitly saying this? Done. > > + if (!ASSERT_OK(err, "fcntl(F_SETFL, O_ASYNC)")) > > + goto cleanup; > > + > > + owner.type = F_OWNER_TID; > > + owner.pid = gettid(); > > + err = fcntl(perf_fd, F_SETOWN_EX, ); > > + if (!ASSERT_OK(err, "fcntl(F_SETOWN_EX)")) > > + goto cleanup; > > + > > + err = ioctl(perf_fd, PERF_EVENT_IOC_SET_BPF, prog_fd); > > + if (!ASSERT_OK(err, "ioctl(PERF_EVENT_IOC_SET_BPF)")) > > + goto cleanup; > > we have a better way to do this, please use > bpf_program__attach_perf_event() instead Done. > > + > > + test_function(); > > + > > +cleanup: > > + if (perf_fd >= 0) > > + close(perf_fd); > > + if (ip) > > + munmap(ip, page_size); > > + if (skel) > > + test_perf_skip__destroy(skel); > > + > > + signal(SIGIO, previous); > > +} > > diff --git a/tools/testing/selftests/bpf/progs/test_perf_skip.c > > b/tools/testing/selftests/bpf/progs/test_perf_skip.c > > new file mode 100644 > > index ..ef01a9161afe > > --- /dev/null > > +++ b/tools/testing/selftests/bpf/progs/test_perf_skip.c > > @@ -0,0 +1,23 @@ > > +// SPDX-License-Identifier: GPL-2.0
Re: [PATCH v8 3/6] zswap: make shrinking memcg-aware
On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham wrote: > > From: Domenico Cerasuolo > > Currently, we only have a single global LRU for zswap. This makes it > impossible to perform worload-specific shrinking - an memcg cannot > determine which pages in the pool it owns, and often ends up writing > pages from other memcgs. This issue has been previously observed in > practice and mitigated by simply disabling memcg-initiated shrinking: > > https://lore.kernel.org/all/20230530232435.3097106-1-npha...@gmail.com/T/#u > > This patch fully resolves the issue by replacing the global zswap LRU > with memcg- and NUMA-specific LRUs, and modify the reclaim logic: > > a) When a store attempt hits an memcg limit, it now triggers a >synchronous reclaim attempt that, if successful, allows the new >hotter page to be accepted by zswap. > b) If the store attempt instead hits the global zswap limit, it will >trigger an asynchronous reclaim attempt, in which an memcg is >selected for reclaim in a round-robin-like fashion. > > Signed-off-by: Domenico Cerasuolo > Co-developed-by: Nhat Pham > Signed-off-by: Nhat Pham > --- > include/linux/memcontrol.h | 5 + > include/linux/zswap.h | 2 + > mm/memcontrol.c| 2 + > mm/swap.h | 3 +- > mm/swap_state.c| 24 +++- > mm/zswap.c | 269 + > 6 files changed, 245 insertions(+), 60 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 2bd7d14ace78..a308c8eacf20 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -1192,6 +1192,11 @@ static inline struct mem_cgroup > *page_memcg_check(struct page *page) > return NULL; > } > > +static inline struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup > *objcg) > +{ > + return NULL; > +} > + > static inline bool folio_memcg_kmem(struct folio *folio) > { > return false; > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > index 2a60ce39cfde..e571e393669b 100644 > --- a/include/linux/zswap.h > +++ b/include/linux/zswap.h > @@ -15,6 +15,7 @@ bool zswap_load(struct folio *folio); > void zswap_invalidate(int type, pgoff_t offset); > void zswap_swapon(int type); > void zswap_swapoff(int type); > +void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg); > > #else > > @@ -31,6 +32,7 @@ static inline bool zswap_load(struct folio *folio) > static inline void zswap_invalidate(int type, pgoff_t offset) {} > static inline void zswap_swapon(int type) {} > static inline void zswap_swapoff(int type) {} > +static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) {} > > #endif > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 470821d1ba1a..792ca21c5815 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5614,6 +5614,8 @@ static void mem_cgroup_css_offline(struct > cgroup_subsys_state *css) > page_counter_set_min(>memory, 0); > page_counter_set_low(>memory, 0); > > + zswap_memcg_offline_cleanup(memcg); > + > memcg_offline_kmem(memcg); > reparent_shrinker_deferred(memcg); > wb_memcg_offline(memcg); > diff --git a/mm/swap.h b/mm/swap.h > index 73c332ee4d91..c0dc73e10e91 100644 > --- a/mm/swap.h > +++ b/mm/swap.h > @@ -51,7 +51,8 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t > gfp_mask, >struct swap_iocb **plug); > struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > struct mempolicy *mpol, pgoff_t ilx, > -bool *new_page_allocated); > +bool *new_page_allocated, > +bool skip_if_exists); > struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag, > struct mempolicy *mpol, pgoff_t ilx); > struct page *swapin_readahead(swp_entry_t entry, gfp_t flag, > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 85d9e5806a6a..6c84236382f3 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -412,7 +412,8 @@ struct folio *filemap_get_incore_folio(struct > address_space *mapping, > > struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > struct mempolicy *mpol, pgoff_t ilx, > -bool *new_page_allocated) > +bool *new_page_allocated, > +bool skip_if_exists) > { > struct swap_info_struct *si; > struct folio *folio; > @@ -470,6 +471,17 @@ struct page *__read_swap_cache_async(swp_entry_t entry, > gfp_t gfp_mask, > if (err != -EEXIST) > goto fail_put_swap; > > + /* > +* Protect against a recursive call to > __read_swap_cache_async() > +
Re: [xdp-hints] Re: [PATCH bpf-next v3 2/3] net: stmmac: add Launch Time support to XDP ZC
Stanislav Fomichev wrote: > On Tue, Dec 5, 2023 at 7:34 AM Florian Bezdeka > wrote: > > > > On Tue, 2023-12-05 at 15:25 +, Song, Yoong Siang wrote: > > > On Monday, December 4, 2023 10:55 PM, Willem de Bruijn wrote: > > > > Jesper Dangaard Brouer wrote: > > > > > > > > > > > > > > > On 12/3/23 17:51, Song Yoong Siang wrote: > > > > > > This patch enables Launch Time (Time-Based Scheduling) support to > > > > > > XDP zero > > > > > > copy via XDP Tx metadata framework. > > > > > > > > > > > > Signed-off-by: Song Yoong Siang > > > > > > --- > > > > > > drivers/net/ethernet/stmicro/stmmac/stmmac.h | 2 ++ > > > > > > > > > > As requested before, I think we need to see another driver > > > > > implementing > > > > > this. > > > > > > > > > > I propose driver igc and chip i225. > > > > > > Sure. I will include igc patches in next version. > > > > > > > > > > > > > The interesting thing for me is to see how the LaunchTime max 1 second > > > > > into the future[1] is handled code wise. One suggestion is to add a > > > > > section to Documentation/networking/xsk-tx-metadata.rst per driver > > > > > that > > > > > mentions/documents these different hardware limitations. It is > > > > > natural > > > > > that different types of hardware have limitations. This is a close-to > > > > > hardware-level abstraction/API, and IMHO as long as we document the > > > > > limitations we can expose this API without too many limitations for > > > > > more > > > > > capable hardware. > > > > > > Sure. I will try to add hardware limitations in documentation. > > > > > > > > > > > I would assume that the kfunc will fail when a value is passed that > > > > cannot be programmed. > > > > > > > > > > In current design, the xsk_tx_metadata_request() dint got return value. > > > So user won't know if their request is fail. > > > It is complex to inform user which request is failing. > > > Therefore, IMHO, it is good that we let driver handle the error silently. > > > > > > > If the programmed value is invalid, the packet will be "dropped" / will > > never make it to the wire, right? Programmable behavior is to either drop or cap to some boundary value, such as the farthest programmable time in the future: the horizon. In fq: /* Check if packet timestamp is too far in the future. */ if (fq_packet_beyond_horizon(skb, q, now)) { if (q->horizon_drop) { q->stat_horizon_drops++; return qdisc_drop(skb, sch, to_free); } q->stat_horizon_caps++; skb->tstamp = now + q->horizon; } fq_skb_cb(skb)->time_to_send = skb->tstamp; Drop is the more obviously correct mode. Programming with a clock source that the driver does not support will then be a persistent failure. Preferably, this driver capability can be queried beforehand (rather than only through reading error counters afterwards). Perhaps it should not be a driver task to convert from possibly multiple clock sources to the device native clock. Right now, we do use per-device timecounters for this, implemented in the driver. As for which clocks are relevant. For PTP, I suppose the device PHC, converted to nsec. For pacing offload, TCP uses CLOCK_MONOTONIC. > > > > That is clearly a situation that the user should be informed about. For > > RT systems this normally means that something is really wrong regarding > > timing / cycle overflow. Such systems have to react on that situation. > > In general, af_xdp is a bit lacking in this 'notify the user that they > somehow messed up' area :-( > For example, pushing a tx descriptor with a wrong addr/len in zc mode > will not give any visible signal back (besides driver potentially > spilling something into dmesg as it was in the mlx case). > We can probably start with having some counters for these events? This is because the AF_XDP completion queue descriptor format is only a u64 address? Could error conditions be reported on tx completion in the metadata, using xsk_tx_metadata_complete?
Re: [PATCH v8 2/6] memcontrol: implement mem_cgroup_tryget_online()
On Thu, Nov 30, 2023 at 11:40 AM Nhat Pham wrote: > > This patch implements a helper function that try to get a reference to > an memcg's css, as well as checking if it is online. This new function > is almost exactly the same as the existing mem_cgroup_tryget(), except > for the onlineness check. In the !CONFIG_MEMCG case, it always returns > true, analogous to mem_cgroup_tryget(). This is useful for e.g to the > new zswap writeback scheme, where we need to select the next online > memcg as a candidate for the global limit reclaim. > > Signed-off-by: Nhat Pham Reviewed-by: Yosry Ahmed > --- > include/linux/memcontrol.h | 10 ++ > 1 file changed, 10 insertions(+) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 7bdcf3020d7a..2bd7d14ace78 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -821,6 +821,11 @@ static inline bool mem_cgroup_tryget(struct mem_cgroup > *memcg) > return !memcg || css_tryget(>css); > } > > +static inline bool mem_cgroup_tryget_online(struct mem_cgroup *memcg) > +{ > + return !memcg || css_tryget_online(>css); > +} > + > static inline void mem_cgroup_put(struct mem_cgroup *memcg) > { > if (memcg) > @@ -1349,6 +1354,11 @@ static inline bool mem_cgroup_tryget(struct mem_cgroup > *memcg) > return true; > } > > +static inline bool mem_cgroup_tryget_online(struct mem_cgroup *memcg) > +{ > + return true; > +} > + > static inline void mem_cgroup_put(struct mem_cgroup *memcg) > { > } > -- > 2.34.1
Re: [PATCH 2/2] selftest/bpf: Test returning zero from a perf bpf program suppresses SIGIO.
On Tue, Dec 5, 2023 at 8:54 AM Yonghong Song wrote: > > > On 12/4/23 3:14 PM, Kyle Huey wrote: > > The test sets a hardware breakpoint and uses a bpf program to suppress the > > I/O availability signal if the ip matches the expected value. > > > > Signed-off-by: Kyle Huey > > --- > > .../selftests/bpf/prog_tests/perf_skip.c | 95 +++ > > .../selftests/bpf/progs/test_perf_skip.c | 23 + > > 2 files changed, 118 insertions(+) > > create mode 100644 tools/testing/selftests/bpf/prog_tests/perf_skip.c > > create mode 100644 tools/testing/selftests/bpf/progs/test_perf_skip.c > > > > diff --git a/tools/testing/selftests/bpf/prog_tests/perf_skip.c > > b/tools/testing/selftests/bpf/prog_tests/perf_skip.c > > new file mode 100644 > > index ..b269a31669b7 > > --- /dev/null > > +++ b/tools/testing/selftests/bpf/prog_tests/perf_skip.c > > @@ -0,0 +1,95 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +#define _GNU_SOURCE > > +#include > > +#include "test_perf_skip.skel.h" > > +#include > > +#include > > + > > +#define BPF_OBJECT"test_perf_skip.bpf.o" > > + > > +static void handle_sig(int) > > I hit a warning here: > home/yhs/work/bpf-next/tools/testing/selftests/bpf/prog_tests/perf_skip.c:10:27: > error: omitting the parameter name in a function definition is a C23 > extension [-Werror,-Wc23-extensions] Yeah, Meta's kernel-ci bot sent me off-list email about this one. > > 10 | static void handle_sig(int) >| > > Add a parameter and marked as unused can resolve the issue. > > #define __always_unused __attribute__((__unused__)) > > static void handle_sig(int unused __always_unused) > { > ASSERT_OK(1, "perf event not skipped"); > } > > > > +{ > > + ASSERT_OK(1, "perf event not skipped"); > > +} > > + > > +static noinline int test_function(void) > > +{ > > + return 0; > > +} > > + > > +void serial_test_perf_skip(void) > > +{ > > + sighandler_t previous; > > + int duration = 0; > > + struct test_perf_skip *skel = NULL; > > + int map_fd = -1; > > + long page_size = sysconf(_SC_PAGE_SIZE); > > + uintptr_t *ip = NULL; > > + int prog_fd = -1; > > + struct perf_event_attr attr = {0}; > > + int perf_fd = -1; > > + struct f_owner_ex owner; > > + int err; > > + > > + previous = signal(SIGIO, handle_sig); > > + > > + skel = test_perf_skip__open_and_load(); > > + if (!ASSERT_OK_PTR(skel, "skel_load")) > > + goto cleanup; > > + > > + prog_fd = bpf_program__fd(skel->progs.handler); > > + if (!ASSERT_OK(prog_fd < 0, "bpf_program__fd")) > > + goto cleanup; > > + > > + map_fd = bpf_map__fd(skel->maps.ip); > > + if (!ASSERT_OK(map_fd < 0, "bpf_map__fd")) > > + goto cleanup; > > + > > + ip = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, > > map_fd, 0); > > + if (!ASSERT_OK_PTR(ip, "mmap bpf map")) > > + goto cleanup; > > + > > + *ip = (uintptr_t)test_function; > > + > > + attr.type = PERF_TYPE_BREAKPOINT; > > + attr.size = sizeof(attr); > > + attr.bp_type = HW_BREAKPOINT_X; > > + attr.bp_addr = (uintptr_t)test_function; > > + attr.bp_len = sizeof(long); > > + attr.sample_period = 1; > > + attr.sample_type = PERF_SAMPLE_IP; > > + attr.pinned = 1; > > + attr.exclude_kernel = 1; > > + attr.exclude_hv = 1; > > + attr.precise_ip = 3; > > + > > + perf_fd = syscall(__NR_perf_event_open, , 0, -1, -1, 0); > > + if (CHECK(perf_fd < 0, "perf_event_open", "err %d\n", perf_fd)) > > + goto cleanup; > > + > > + err = fcntl(perf_fd, F_SETFL, O_ASYNC); > > + if (!ASSERT_OK(err, "fcntl(F_SETFL, O_ASYNC)")) > > + goto cleanup; > > + > > + owner.type = F_OWNER_TID; > > + owner.pid = gettid(); > > I hit a compilation failure here: > > /home/yhs/work/bpf-next/tools/testing/selftests/bpf/prog_tests/perf_skip.c:75:14: > error: call to undeclared function 'gettid'; ISO C99 and later do not > support implicit function declarations [-Wimplicit-function-declaration] > 75 | owner.pid = gettid(); >| ^ > > If you looked at some other examples, the common usage is do > 'syscall(SYS_gettid)'. Not clear why this works for me but sure I'll change that. > > So the following patch should fix the compilation error: > > #include > ... > owner.pid = syscall(SYS_gettid); > ... > > > + err = fcntl(perf_fd, F_SETOWN_EX, ); > > + if (!ASSERT_OK(err, "fcntl(F_SETOWN_EX)")) > > + goto cleanup; > > + > > + err = ioctl(perf_fd, PERF_EVENT_IOC_SET_BPF, prog_fd); > > + if (!ASSERT_OK(err, "ioctl(PERF_EVENT_IOC_SET_BPF)")) > > + goto cleanup; > > + > > + test_function(); > > As Andrii has mentioned in previous comments, we will have > issue is RELEASE version of selftest is built >RELEASE=1 make ... > > See >
Re: [PATCH v6 2/6] iommufd: Add IOMMU_HWPT_INVALIDATE
On Mon, Dec 04, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote: > > Or am I missing some point here? > > It sounds Ok, we just have to understand what userspace should be > doing and how much of this the kernel should implement. > > It seems to me that the error code should return the gerror and the > req_num should indicate the halted cons. The vmm should relay both > into the virtual registers. I see your concern. I will take a closer look and see if we can add to the initial version of arm_smmu_cache_invalidate_user(). Otherwise, we can add later. Btw, VT-d seems to want the error_code and reports in the VT-d specific invalidate entry structure, as Kevin and Yi had that discussion in the other side of the thread. Thanks Nicolin
Re: [PATCH 1/4] kunit: Add APIs for managing devices
On Tue, Dec 05, 2023 at 03:31:33PM +0800, david...@google.com wrote: > Tests for drivers often require a struct device to pass to other > functions. While it's possible to create these with > root_device_register(), or to use something like a platform device, this > is both a misuse of those APIs, and can be difficult to clean up after, > for example, a failed assertion. > > Add some KUnit-specific functions for registering and unregistering a > struct device: > - kunit_device_register() > - kunit_device_register_with_driver() > - kunit_device_unregister() > > These helpers allocate a on a 'kunit' bus which will either probe the > driver passed in (kunit_device_register_with_driver), or will create a > stub driver (kunit_device_register) which is cleaned up on test shutdown. > > Devices are automatically unregistered on test shutdown, but can be > manually unregistered earlier with kunit_device_unregister() in order > to, for example, test device release code. At first glance, nice work. But looks like 0-day doesn't like it that much, so I'll wait for the next version to review it properly. One nit I did notice: > +// For internal use only -- registers the kunit_bus. > +int kunit_bus_init(void); Put stuff like this in a local .h file, don't pollute the include/linux/ files for things that you do not want any other part of the kernel to call. > +/** > + * kunit_device_register_with_driver() - Create a struct device for use in > KUnit tests > + * @test: The test context object. > + * @name: The name to give the created device. > + * @drv: The struct device_driver to associate with the device. > + * > + * Creates a struct kunit_device (which is a struct device) with the given > + * name, and driver. The device will be cleaned up on test exit, or when > + * kunit_device_unregister is called. See also kunit_device_register, if you > + * wish KUnit to create and manage a driver for you > + */ > +struct device *kunit_device_register_with_driver(struct kunit *test, > + const char *name, > + struct device_driver *drv); Shouldn't "struct device_driver *" be a constant pointer? But really, why is this a "raw" device_driver pointer and not a pointer to the driver type for your bus? Oh heck, let's point out the other issues as I'm already here... > @@ -7,7 +7,8 @@ kunit-objs += test.o \ > assert.o \ > try-catch.o \ > executor.o \ > - attributes.o > + attributes.o \ > + device.o Shouldn't this file be "bus.c" as you are creating a kunit bus? > > ifeq ($(CONFIG_KUNIT_DEBUGFS),y) > kunit-objs +=debugfs.o > diff --git a/lib/kunit/device.c b/lib/kunit/device.c > new file mode 100644 > index ..93ace1a2297d > --- /dev/null > +++ b/lib/kunit/device.c > @@ -0,0 +1,176 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * KUnit basic device implementation "basic bus/driver implementation", not device, right? > + * > + * Implementation of struct kunit_device helpers. > + * > + * Copyright (C) 2023, Google LLC. > + * Author: David Gow > + */ > + > +#include > + > +#include > +#include > +#include > + > + > +/* Wrappers for use with kunit_add_action() */ > +KUNIT_DEFINE_ACTION_WRAPPER(device_unregister_wrapper, device_unregister, > struct device *); > +KUNIT_DEFINE_ACTION_WRAPPER(driver_unregister_wrapper, driver_unregister, > struct device_driver *); > + > +static struct device kunit_bus = { > + .init_name = "kunit" > +}; A static device as a bus? This feels wrong, what is it for? And where does this live? If you _REALLY_ want a single device for the root of your bus (which is a good idea), then make it a dynamic variable (as it is reference counted), NOT a static struct device which should not be done if at all possible. > + > +/* A device owned by a KUnit test. */ > +struct kunit_device { > + struct device dev; > + struct kunit *owner; > + /* Force binding to a specific driver. */ > + struct device_driver *driver; > + /* The driver is managed by KUnit and unique to this device. */ > + bool cleanup_driver; > +}; Wait, why isn't your "kunit" device above a struct kunit_device structure? Why is it ok to be a "raw" struct device (hint, that's almost never a good idea.) > +static inline struct kunit_device *to_kunit_device(struct device *d) > +{ > + return container_of(d, struct kunit_device, dev); container_of_const()? And to use that properly, why not make this a #define? > +} > + > +static int kunit_bus_match(struct device *dev, struct device_driver *driver) > +{ > + struct kunit_device *kunit_dev = to_kunit_device(dev); > + > + if (kunit_dev->driver == driver) > +
Re: [PATCH v8 1/6] list_lru: allows explicit memcg and NUMA node selection
On Mon, Dec 04, 2023 at 04:30:44PM -0800, Chris Li wrote: > On Thu, Nov 30, 2023 at 12:35 PM Johannes Weiner wrote: > > > > On Thu, Nov 30, 2023 at 12:07:41PM -0800, Nhat Pham wrote: > > > On Thu, Nov 30, 2023 at 11:57 AM Matthew Wilcox > > > wrote: > > > > > > > > On Thu, Nov 30, 2023 at 11:40:18AM -0800, Nhat Pham wrote: > > > > > This patch changes list_lru interface so that the caller must > > > > > explicitly > > > > > specify numa node and memcg when adding and removing objects. The old > > > > > list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() > > > > > and > > > > > list_lru_del_obj(), respectively. > > > > > > > > Wouldn't it be better to add list_lru_add_memcg() and > > > > list_lru_del_memcg() and have: > > That is my first thought as well. If we are having two different > flavors of LRU add, one has memcg and one without. The list_lru_add() > vs list_lru_add_memcg() is the common way to do it. > > > > > > > > +bool list_lru_del(struct list_lru *lru, struct list_head *item) > > > > +{ > > > > + int nid = page_to_nid(virt_to_page(item)); > > > > + struct mem_cgroup *memcg = list_lru_memcg_aware(lru) ? > > > > + mem_cgroup_from_slab_obj(item) : NULL; > > > > + > > > > + return list_lru_del_memcg(lru, item, nid, memcg); > > > > +} > > > > > > > > Seems like _most_ callers will want the original versions and only > > > > a few will want the explicit memcg/nid versions. No? > > > > > > > > > > I actually did something along that line in earlier iterations of this > > > patch series (albeit with poorer naming - __list_lru_add() instead of > > > list_lru_add_memcg()). The consensus after some back and forth was > > > that the original list_lru_add() was not a very good design (the > > > better one was this new version that allows for explicit numa/memcg > > > selection). So I agreed to fix it everywhere as a prep patch. > > > > > > I don't have strong opinions here to be completely honest, but I do > > > think this new API makes more sense (at the cost of quite a bit of > > > elbow grease to fix every callsites and extra reviewing). > > > > Maybe I can shed some light since I was pushing for doing it this way. > > > > The quiet assumption that 'struct list_head *item' is (embedded in) a > > slab object that is also charged to a cgroup is a bit much, given that > > nothing in the name or documentation of the function points to that. > > We can add it to the document if that is desirable. It would help, but it still violates the "easy to use, hard to misuse" principle. And I think it does the API layering backwards. list_lru_add() is the "default" API function. It makes sense to keep that simple and robust, then add add convenience wrappers for additional, specialized functionality like memcg lookups for charged slab objects - even if that's a common usecase. It's better for a new user to be paused by the require memcg argument in the default function and then go and find list_lru_add_obj(), than it is for somebody to quietly pass an invalid object to list_lru_add() and have subtle runtime problems and crashes (which has happened twice now already).
Re: [xdp-hints] Re: [PATCH bpf-next v3 2/3] net: stmmac: add Launch Time support to XDP ZC
On Tue, Dec 5, 2023 at 7:34 AM Florian Bezdeka wrote: > > On Tue, 2023-12-05 at 15:25 +, Song, Yoong Siang wrote: > > On Monday, December 4, 2023 10:55 PM, Willem de Bruijn wrote: > > > Jesper Dangaard Brouer wrote: > > > > > > > > > > > > On 12/3/23 17:51, Song Yoong Siang wrote: > > > > > This patch enables Launch Time (Time-Based Scheduling) support to XDP > > > > > zero > > > > > copy via XDP Tx metadata framework. > > > > > > > > > > Signed-off-by: Song Yoong Siang > > > > > --- > > > > > drivers/net/ethernet/stmicro/stmmac/stmmac.h | 2 ++ > > > > > > > > As requested before, I think we need to see another driver implementing > > > > this. > > > > > > > > I propose driver igc and chip i225. > > > > Sure. I will include igc patches in next version. > > > > > > > > > > The interesting thing for me is to see how the LaunchTime max 1 second > > > > into the future[1] is handled code wise. One suggestion is to add a > > > > section to Documentation/networking/xsk-tx-metadata.rst per driver that > > > > mentions/documents these different hardware limitations. It is natural > > > > that different types of hardware have limitations. This is a close-to > > > > hardware-level abstraction/API, and IMHO as long as we document the > > > > limitations we can expose this API without too many limitations for more > > > > capable hardware. > > > > Sure. I will try to add hardware limitations in documentation. > > > > > > > > I would assume that the kfunc will fail when a value is passed that > > > cannot be programmed. > > > > > > > In current design, the xsk_tx_metadata_request() dint got return value. > > So user won't know if their request is fail. > > It is complex to inform user which request is failing. > > Therefore, IMHO, it is good that we let driver handle the error silently. > > > > If the programmed value is invalid, the packet will be "dropped" / will > never make it to the wire, right? > > That is clearly a situation that the user should be informed about. For > RT systems this normally means that something is really wrong regarding > timing / cycle overflow. Such systems have to react on that situation. In general, af_xdp is a bit lacking in this 'notify the user that they somehow messed up' area :-( For example, pushing a tx descriptor with a wrong addr/len in zc mode will not give any visible signal back (besides driver potentially spilling something into dmesg as it was in the mlx case). We can probably start with having some counters for these events?
Re: [PATCH 2/2] selftest/bpf: Test returning zero from a perf bpf program suppresses SIGIO.
On 12/4/23 3:14 PM, Kyle Huey wrote: The test sets a hardware breakpoint and uses a bpf program to suppress the I/O availability signal if the ip matches the expected value. Signed-off-by: Kyle Huey --- .../selftests/bpf/prog_tests/perf_skip.c | 95 +++ .../selftests/bpf/progs/test_perf_skip.c | 23 + 2 files changed, 118 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/perf_skip.c create mode 100644 tools/testing/selftests/bpf/progs/test_perf_skip.c diff --git a/tools/testing/selftests/bpf/prog_tests/perf_skip.c b/tools/testing/selftests/bpf/prog_tests/perf_skip.c new file mode 100644 index ..b269a31669b7 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/perf_skip.c @@ -0,0 +1,95 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#include +#include "test_perf_skip.skel.h" +#include +#include + +#define BPF_OBJECT"test_perf_skip.bpf.o" + +static void handle_sig(int) I hit a warning here: home/yhs/work/bpf-next/tools/testing/selftests/bpf/prog_tests/perf_skip.c:10:27: error: omitting the parameter name in a function definition is a C23 extension [-Werror,-Wc23-extensions] 10 | static void handle_sig(int) | Add a parameter and marked as unused can resolve the issue. #define __always_unused __attribute__((__unused__)) static void handle_sig(int unused __always_unused) { ASSERT_OK(1, "perf event not skipped"); } +{ + ASSERT_OK(1, "perf event not skipped"); +} + +static noinline int test_function(void) +{ + return 0; +} + +void serial_test_perf_skip(void) +{ + sighandler_t previous; + int duration = 0; + struct test_perf_skip *skel = NULL; + int map_fd = -1; + long page_size = sysconf(_SC_PAGE_SIZE); + uintptr_t *ip = NULL; + int prog_fd = -1; + struct perf_event_attr attr = {0}; + int perf_fd = -1; + struct f_owner_ex owner; + int err; + + previous = signal(SIGIO, handle_sig); + + skel = test_perf_skip__open_and_load(); + if (!ASSERT_OK_PTR(skel, "skel_load")) + goto cleanup; + + prog_fd = bpf_program__fd(skel->progs.handler); + if (!ASSERT_OK(prog_fd < 0, "bpf_program__fd")) + goto cleanup; + + map_fd = bpf_map__fd(skel->maps.ip); + if (!ASSERT_OK(map_fd < 0, "bpf_map__fd")) + goto cleanup; + + ip = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, map_fd, 0); + if (!ASSERT_OK_PTR(ip, "mmap bpf map")) + goto cleanup; + + *ip = (uintptr_t)test_function; + + attr.type = PERF_TYPE_BREAKPOINT; + attr.size = sizeof(attr); + attr.bp_type = HW_BREAKPOINT_X; + attr.bp_addr = (uintptr_t)test_function; + attr.bp_len = sizeof(long); + attr.sample_period = 1; + attr.sample_type = PERF_SAMPLE_IP; + attr.pinned = 1; + attr.exclude_kernel = 1; + attr.exclude_hv = 1; + attr.precise_ip = 3; + + perf_fd = syscall(__NR_perf_event_open, , 0, -1, -1, 0); + if (CHECK(perf_fd < 0, "perf_event_open", "err %d\n", perf_fd)) + goto cleanup; + + err = fcntl(perf_fd, F_SETFL, O_ASYNC); + if (!ASSERT_OK(err, "fcntl(F_SETFL, O_ASYNC)")) + goto cleanup; + + owner.type = F_OWNER_TID; + owner.pid = gettid(); I hit a compilation failure here: /home/yhs/work/bpf-next/tools/testing/selftests/bpf/prog_tests/perf_skip.c:75:14: error: call to undeclared function 'gettid'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] 75 | owner.pid = gettid(); | ^ If you looked at some other examples, the common usage is do 'syscall(SYS_gettid)'. So the following patch should fix the compilation error: #include ... owner.pid = syscall(SYS_gettid); ... + err = fcntl(perf_fd, F_SETOWN_EX, ); + if (!ASSERT_OK(err, "fcntl(F_SETOWN_EX)")) + goto cleanup; + + err = ioctl(perf_fd, PERF_EVENT_IOC_SET_BPF, prog_fd); + if (!ASSERT_OK(err, "ioctl(PERF_EVENT_IOC_SET_BPF)")) + goto cleanup; + + test_function(); As Andrii has mentioned in previous comments, we will have issue is RELEASE version of selftest is built RELEASE=1 make ... See https://lore.kernel.org/bpf/20231127050342.1945270-1-yonghong.s...@linux.dev + +cleanup: + if (perf_fd >= 0) + close(perf_fd); + if (ip) + munmap(ip, page_size); + if (skel) + test_perf_skip__destroy(skel); + + signal(SIGIO, previous); +} diff --git a/tools/testing/selftests/bpf/progs/test_perf_skip.c b/tools/testing/selftests/bpf/progs/test_perf_skip.c new file mode 100644 index ..ef01a9161afe --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_perf_skip.c @@ -0,0 +1,23 @@ +// SPDX-License-Identifier: GPL-2.0
[PATCH v3 19/21] kselftest/arm64: Add 2023 DPISA hwcap test coverage
Add the hwcaps added for the 2023 DPISA extensions to the hwcaps test program. Signed-off-by: Mark Brown --- tools/testing/selftests/arm64/abi/hwcap.c | 217 ++ 1 file changed, 217 insertions(+) diff --git a/tools/testing/selftests/arm64/abi/hwcap.c b/tools/testing/selftests/arm64/abi/hwcap.c index 1189e77c8152..d8909b2b535a 100644 --- a/tools/testing/selftests/arm64/abi/hwcap.c +++ b/tools/testing/selftests/arm64/abi/hwcap.c @@ -58,11 +58,46 @@ static void cssc_sigill(void) asm volatile(".inst 0xdac01c00" : : : "x0"); } +static void f8cvt_sigill(void) +{ + /* FSCALE V0.4H, V0.4H, V0.4H */ + asm volatile(".inst 0x2ec03c00"); +} + +static void f8dp2_sigill(void) +{ + /* FDOT V0.4H, V0.4H, V0.5H */ + asm volatile(".inst 0xe40fc00"); +} + +static void f8dp4_sigill(void) +{ + /* FDOT V0.2S, V0.2S, V0.2S */ + asm volatile(".inst 0xe00fc00"); +} + +static void f8fma_sigill(void) +{ + /* FMLALB V0.8H, V0.16B, V0.16B */ + asm volatile(".inst 0xec0fc00"); +} + +static void faminmax_sigill(void) +{ + /* FAMIN V0.4H, V0.4H, V0.4H */ + asm volatile(".inst 0x2ec01c00"); +} + static void fp_sigill(void) { asm volatile("fmov s0, #1"); } +static void fpmr_sigill(void) +{ + asm volatile("mrs x0, S3_3_C4_C4_2" : : : "x0"); +} + static void ilrcpc_sigill(void) { /* LDAPUR W0, [SP, #8] */ @@ -95,6 +130,12 @@ static void lse128_sigill(void) : "cc", "memory"); } +static void lut_sigill(void) +{ + /* LUTI2 V0.16B, { V0.16B }, V[0] */ + asm volatile(".inst 0x4e801000"); +} + static void mops_sigill(void) { char dst[1], src[1]; @@ -216,6 +257,78 @@ static void smef16f16_sigill(void) asm volatile("msr S0_3_C4_C6_3, xzr" : : : ); } +static void smef8f16_sigill(void) +{ + /* SMSTART */ + asm volatile("msr S0_3_C4_C7_3, xzr" : : : ); + + /* FDOT ZA.H[W0, 0], Z0.B-Z1.B, Z0.B-Z1.B */ + asm volatile(".inst 0xc1a01020" : : : ); + + /* SMSTOP */ + asm volatile("msr S0_3_C4_C6_3, xzr" : : : ); +} + +static void smef8f32_sigill(void) +{ + /* SMSTART */ + asm volatile("msr S0_3_C4_C7_3, xzr" : : : ); + + /* FDOT ZA.S[W0, 0], { Z0.B-Z1.B }, Z0.B[0] */ + asm volatile(".inst 0xc1500038" : : : ); + + /* SMSTOP */ + asm volatile("msr S0_3_C4_C6_3, xzr" : : : ); +} + +static void smelutv2_sigill(void) +{ + /* SMSTART */ + asm volatile("msr S0_3_C4_C7_3, xzr" : : : ); + + /* LUTI4 { Z0.B-Z3.B }, ZT0, { Z0-Z1 } */ + asm volatile(".inst 0xc08b" : : : ); + + /* SMSTOP */ + asm volatile("msr S0_3_C4_C6_3, xzr" : : : ); +} + +static void smesf8dp2_sigill(void) +{ + /* SMSTART */ + asm volatile("msr S0_3_C4_C7_3, xzr" : : : ); + + /* FDOT Z0.H, Z0.B, Z0.B[0] */ + asm volatile(".inst 0x64204400" : : : ); + + /* SMSTOP */ + asm volatile("msr S0_3_C4_C6_3, xzr" : : : ); +} + +static void smesf8dp4_sigill(void) +{ + /* SMSTART */ + asm volatile("msr S0_3_C4_C7_3, xzr" : : : ); + + /* FDOT Z0.S, Z0.B, Z0.B[0] */ + asm volatile(".inst 0xc1a41C00" : : : ); + + /* SMSTOP */ + asm volatile("msr S0_3_C4_C6_3, xzr" : : : ); +} + +static void smesf8fma_sigill(void) +{ + /* SMSTART */ + asm volatile("msr S0_3_C4_C7_3, xzr" : : : ); + + /* FMLALB V0.8H, V0.16B, V0.16B */ + asm volatile(".inst 0xec0fc00"); + + /* SMSTOP */ + asm volatile("msr S0_3_C4_C6_3, xzr" : : : ); +} + static void sve_sigill(void) { /* RDVL x0, #0 */ @@ -353,6 +466,53 @@ static const struct hwcap_data { .cpuinfo = "cssc", .sigill_fn = cssc_sigill, }, + { + .name = "F8CVT", + .at_hwcap = AT_HWCAP2, + .hwcap_bit = HWCAP2_F8CVT, + .cpuinfo = "f8cvt", + .sigill_fn = f8cvt_sigill, + }, + { + .name = "F8DP4", + .at_hwcap = AT_HWCAP2, + .hwcap_bit = HWCAP2_F8DP4, + .cpuinfo = "f8dp4", + .sigill_fn = f8dp4_sigill, + }, + { + .name = "F8DP2", + .at_hwcap = AT_HWCAP2, + .hwcap_bit = HWCAP2_F8DP2, + .cpuinfo = "f8dp4", + .sigill_fn = f8dp2_sigill, + }, + { + .name = "F8E5M2", + .at_hwcap = AT_HWCAP2, + .hwcap_bit = HWCAP2_F8E5M2, + .cpuinfo = "f8e5m2", + }, + { + .name = "F8E4M3", + .at_hwcap = AT_HWCAP2, + .hwcap_bit = HWCAP2_F8E4M3, + .cpuinfo = "f8e4m3", + }, + { + .name = "F8FMA", + .at_hwcap = AT_HWCAP2, + .hwcap_bit = HWCAP2_F8FMA, + .cpuinfo = "f8fma", + .sigill_fn = f8fma_sigill, + }, +
[PATCH v3 20/21] KVM: arm64: selftests: Document feature registers added in 2023 extensions
The 2023 architecture extensions allocated some previously usused feature registers, add comments mapping the names in get-reg-list as we do for the other allocated registers. Signed-off-by: Mark Brown --- tools/testing/selftests/kvm/aarch64/get-reg-list.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/kvm/aarch64/get-reg-list.c b/tools/testing/selftests/kvm/aarch64/get-reg-list.c index 709d7d721760..71ea6ecec7ce 100644 --- a/tools/testing/selftests/kvm/aarch64/get-reg-list.c +++ b/tools/testing/selftests/kvm/aarch64/get-reg-list.c @@ -428,7 +428,7 @@ static __u64 base_regs[] = { ARM64_SYS_REG(3, 0, 0, 4, 4), /* ID_AA64ZFR0_EL1 */ ARM64_SYS_REG(3, 0, 0, 4, 5), /* ID_AA64SMFR0_EL1 */ ARM64_SYS_REG(3, 0, 0, 4, 6), - ARM64_SYS_REG(3, 0, 0, 4, 7), + ARM64_SYS_REG(3, 0, 0, 4, 7), /* ID_AA64FPFR_EL1 */ ARM64_SYS_REG(3, 0, 0, 5, 0), /* ID_AA64DFR0_EL1 */ ARM64_SYS_REG(3, 0, 0, 5, 1), /* ID_AA64DFR1_EL1 */ ARM64_SYS_REG(3, 0, 0, 5, 2), @@ -440,7 +440,7 @@ static __u64 base_regs[] = { ARM64_SYS_REG(3, 0, 0, 6, 0), /* ID_AA64ISAR0_EL1 */ ARM64_SYS_REG(3, 0, 0, 6, 1), /* ID_AA64ISAR1_EL1 */ ARM64_SYS_REG(3, 0, 0, 6, 2), /* ID_AA64ISAR2_EL1 */ - ARM64_SYS_REG(3, 0, 0, 6, 3), + ARM64_SYS_REG(3, 0, 0, 6, 3), /* ID_AA64ISAR3_EL1 */ ARM64_SYS_REG(3, 0, 0, 6, 4), ARM64_SYS_REG(3, 0, 0, 6, 5), ARM64_SYS_REG(3, 0, 0, 6, 6), -- 2.30.2
[PATCH v3 21/21] KVM: arm64: selftests: Teach get-reg-list about FPMR
FEAT_FPMR defines a new register FMPR which is available at all ELs and is discovered via ID_AA64PFR2_EL1.FPMR, add this to the set of registers that get-reg-list knows to check for with the required identification register depdendency. Signed-off-by: Mark Brown --- tools/testing/selftests/kvm/aarch64/get-reg-list.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/tools/testing/selftests/kvm/aarch64/get-reg-list.c b/tools/testing/selftests/kvm/aarch64/get-reg-list.c index 71ea6ecec7ce..1e43511d1440 100644 --- a/tools/testing/selftests/kvm/aarch64/get-reg-list.c +++ b/tools/testing/selftests/kvm/aarch64/get-reg-list.c @@ -40,6 +40,12 @@ static struct feature_id_reg feat_id_regs[] = { ARM64_SYS_REG(3, 0, 0, 7, 3), /* ID_AA64MMFR3_EL1 */ 4, 1 + }, + { + ARM64_SYS_REG(3, 3, 4, 4, 2), /* FPMR */ + ARM64_SYS_REG(3, 0, 0, 4, 2), /* ID_AA64PFR2_EL1 */ + 32, + 1 } }; @@ -481,6 +487,7 @@ static __u64 base_regs[] = { ARM64_SYS_REG(3, 3, 14, 2, 1), /* CNTP_CTL_EL0 */ ARM64_SYS_REG(3, 3, 14, 2, 2), /* CNTP_CVAL_EL0 */ ARM64_SYS_REG(3, 4, 3, 0, 0), /* DACR32_EL2 */ + ARM64_SYS_REG(3, 3, 4, 4, 2), /* FPMR */ ARM64_SYS_REG(3, 4, 5, 0, 1), /* IFSR32_EL2 */ ARM64_SYS_REG(3, 4, 5, 3, 0), /* FPEXC32_EL2 */ }; -- 2.30.2
[PATCH v3 18/21] kselftest/arm64: Add basic FPMR test
Verify that a FPMR frame is generated on systems that support FPMR and not generated otherwise. Signed-off-by: Mark Brown --- tools/testing/selftests/arm64/signal/.gitignore| 1 + .../arm64/signal/testcases/fpmr_siginfo.c | 82 ++ 2 files changed, 83 insertions(+) diff --git a/tools/testing/selftests/arm64/signal/.gitignore b/tools/testing/selftests/arm64/signal/.gitignore index 839e3a252629..1ce5b5eac386 100644 --- a/tools/testing/selftests/arm64/signal/.gitignore +++ b/tools/testing/selftests/arm64/signal/.gitignore @@ -1,6 +1,7 @@ # SPDX-License-Identifier: GPL-2.0-only mangle_* fake_sigreturn_* +fpmr_* sme_* ssve_* sve_* diff --git a/tools/testing/selftests/arm64/signal/testcases/fpmr_siginfo.c b/tools/testing/selftests/arm64/signal/testcases/fpmr_siginfo.c new file mode 100644 index ..e9d24685e741 --- /dev/null +++ b/tools/testing/selftests/arm64/signal/testcases/fpmr_siginfo.c @@ -0,0 +1,82 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2023 ARM Limited + * + * Verify that the FPMR register context in signal frames is set up as + * expected. + */ + +#include +#include +#include +#include +#include +#include + +#include "test_signals_utils.h" +#include "testcases.h" + +static union { + ucontext_t uc; + char buf[1024 * 128]; +} context; + +#define SYS_FPMR "S3_3_C4_C4_2" + +static uint64_t get_fpmr(void) +{ + uint64_t val; + + asm volatile ( + "mrs%0, " SYS_FPMR "\n" + : "=r"(val) + : + : "cc"); + + return val; +} + +int fpmr_present(struct tdescr *td, siginfo_t *si, ucontext_t *uc) +{ + struct _aarch64_ctx *head = GET_BUF_RESV_HEAD(context); + struct fpmr_context *fpmr_ctx; + size_t offset; + bool in_sigframe; + bool have_fpmr; + __u64 orig_fpmr; + + have_fpmr = getauxval(AT_HWCAP2) & HWCAP2_FPMR; + if (have_fpmr) + orig_fpmr = get_fpmr(); + + if (!get_current_context(td, , sizeof(context))) + return 1; + + fpmr_ctx = (struct fpmr_context *) + get_header(head, FPMR_MAGIC, td->live_sz, ); + + in_sigframe = fpmr_ctx != NULL; + + fprintf(stderr, "FPMR sigframe %s on system %s FPMR\n", + in_sigframe ? "present" : "absent", + have_fpmr ? "with" : "without"); + + td->pass = (in_sigframe == have_fpmr); + + if (have_fpmr && fpmr_ctx) { + if (fpmr_ctx->fpmr != orig_fpmr) { + fprintf(stderr, "FPMR in frame is %llx, was %llx\n", + fpmr_ctx->fpmr, orig_fpmr); + td->pass = false; + } + } + + return 0; +} + +struct tdescr tde = { + .name = "FPMR", + .descr = "Validate that FPMR is present as expected", + .timeout = 3, + .run = fpmr_present, +}; -- 2.30.2
[PATCH v3 17/21] kselftest/arm64: Handle FPMR context in generic signal frame parser
Teach the generic signal frame parsing code about the newly added FPMR frame, avoiding warnings every time one is generated. Signed-off-by: Mark Brown --- tools/testing/selftests/arm64/signal/testcases/testcases.c | 8 tools/testing/selftests/arm64/signal/testcases/testcases.h | 1 + 2 files changed, 9 insertions(+) diff --git a/tools/testing/selftests/arm64/signal/testcases/testcases.c b/tools/testing/selftests/arm64/signal/testcases/testcases.c index 9f580b55b388..674b88cc8c39 100644 --- a/tools/testing/selftests/arm64/signal/testcases/testcases.c +++ b/tools/testing/selftests/arm64/signal/testcases/testcases.c @@ -209,6 +209,14 @@ bool validate_reserved(ucontext_t *uc, size_t resv_sz, char **err) zt = (struct zt_context *)head; new_flags |= ZT_CTX; break; + case FPMR_MAGIC: + if (flags & FPMR_CTX) + *err = "Multiple FPMR_MAGIC"; + else if (head->size != +sizeof(struct fpmr_context)) + *err = "Bad size for fpmr_context"; + new_flags |= FPMR_CTX; + break; case EXTRA_MAGIC: if (flags & EXTRA_CTX) *err = "Multiple EXTRA_MAGIC"; diff --git a/tools/testing/selftests/arm64/signal/testcases/testcases.h b/tools/testing/selftests/arm64/signal/testcases/testcases.h index a08ab0d6207a..7727126347e0 100644 --- a/tools/testing/selftests/arm64/signal/testcases/testcases.h +++ b/tools/testing/selftests/arm64/signal/testcases/testcases.h @@ -19,6 +19,7 @@ #define ZA_CTX (1 << 2) #define EXTRA_CTX (1 << 3) #define ZT_CTX (1 << 4) +#define FPMR_CTX (1 << 5) #define KSFT_BAD_MAGIC 0xdeadbeef -- 2.30.2
[PATCH v3 14/21] KVM: arm64: Add newly allocated ID registers to register descriptions
The 2023 architecture extensions have allocated some new ID registers, add them to the KVM system register descriptions so that they are visible to guests. Signed-off-by: Mark Brown --- arch/arm64/kvm/sys_regs.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 4735e1b37fb3..b843da5e4bb9 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -2139,12 +2139,12 @@ static const struct sys_reg_desc sys_reg_descs[] = { ID_AA64PFR0_EL1_AdvSIMD | ID_AA64PFR0_EL1_FP), }, ID_SANITISED(ID_AA64PFR1_EL1), - ID_UNALLOCATED(4,2), + ID_SANITISED(ID_AA64PFR2_EL1), ID_UNALLOCATED(4,3), ID_WRITABLE(ID_AA64ZFR0_EL1, ~ID_AA64ZFR0_EL1_RES0), ID_HIDDEN(ID_AA64SMFR0_EL1), ID_UNALLOCATED(4,6), - ID_UNALLOCATED(4,7), + ID_SANITISED(ID_AA64FPFR0_EL1), /* CRm=5 */ { SYS_DESC(SYS_ID_AA64DFR0_EL1), @@ -2171,7 +2171,7 @@ static const struct sys_reg_desc sys_reg_descs[] = { ID_WRITABLE(ID_AA64ISAR2_EL1, ~(ID_AA64ISAR2_EL1_RES0 | ID_AA64ISAR2_EL1_APA3 | ID_AA64ISAR2_EL1_GPA3)), - ID_UNALLOCATED(6,3), + ID_WRITABLE(ID_AA64ISAR3_EL1, ~ID_AA64ISAR3_EL1_RES0), ID_UNALLOCATED(6,4), ID_UNALLOCATED(6,5), ID_UNALLOCATED(6,6), -- 2.30.2
[PATCH v3 16/21] arm64/hwcap: Define hwcaps for 2023 DPISA features
The 2023 architecture extensions include a large number of floating point features, most of which simply add new instructions. Add hwcaps so that userspace can enumerate these features. Signed-off-by: Mark Brown --- Documentation/arch/arm64/elf_hwcaps.rst | 49 + arch/arm64/include/asm/hwcap.h | 15 ++ arch/arm64/include/uapi/asm/hwcap.h | 15 ++ arch/arm64/kernel/cpufeature.c | 35 +++ arch/arm64/kernel/cpuinfo.c | 15 ++ 5 files changed, 129 insertions(+) diff --git a/Documentation/arch/arm64/elf_hwcaps.rst b/Documentation/arch/arm64/elf_hwcaps.rst index ced7b335e2e0..448c1664879b 100644 --- a/Documentation/arch/arm64/elf_hwcaps.rst +++ b/Documentation/arch/arm64/elf_hwcaps.rst @@ -317,6 +317,55 @@ HWCAP2_LRCPC3 HWCAP2_LSE128 Functionality implied by ID_AA64ISAR0_EL1.Atomic == 0b0011. +HWCAP2_FPMR +Functionality implied by ID_AA64PFR2_EL1.FMR == 0b0001. + +HWCAP2_LUT +Functionality implied by ID_AA64ISAR2_EL1.LUT == 0b0001. + +HWCAP2_FAMINMAX +Functionality implied by ID_AA64ISAR3_EL1.FAMINMAX == 0b0001. + +HWCAP2_F8CVT +Functionality implied by ID_AA64FPFR0_EL1.F8CVT == 0b1. + +HWCAP2_F8FMA +Functionality implied by ID_AA64FPFR0_EL1.F8FMA == 0b1. + +HWCAP2_F8DP4 +Functionality implied by ID_AA64FPFR0_EL1.F8DP4 == 0b1. + +HWCAP2_F8DP2 +Functionality implied by ID_AA64FPFR0_EL1.F8DP2 == 0b1. + +HWCAP2_F8E4M3 +Functionality implied by ID_AA64FPFR0_EL1.F8E4M3 == 0b1. + +HWCAP2_F8E5M2 +Functionality implied by ID_AA64FPFR0_EL1.F8E5M2 == 0b1. + +HWCAP2_SME_LUTV2 +Functionality implied by ID_AA64SMFR0_EL1.LUTv2 == 0b1. + +HWCAP2_SME_F8F16 +Functionality implied by ID_AA64SMFR0_EL1.F8F16 == 0b1. + +HWCAP2_SME_F8F32 +Functionality implied by ID_AA64SMFR0_EL1.F8F32 == 0b1. + +HWCAP2_SME_SF8FMA +Functionality implied by ID_AA64SMFR0_EL1.SF8FMA == 0b1. + +HWCAP2_SME_SF8DP4 +Functionality implied by ID_AA64SMFR0_EL1.SF8DP4 == 0b1. + +HWCAP2_SME_SF8DP2 +Functionality implied by ID_AA64SMFR0_EL1.SF8DP2 == 0b1. + +HWCAP2_SME_SF8DP4 +Functionality implied by ID_AA64SMFR0_EL1.SF8DP4 == 0b1. + + 4. Unused AT_HWCAP bits --- diff --git a/arch/arm64/include/asm/hwcap.h b/arch/arm64/include/asm/hwcap.h index cd71e09ea14d..4edd3b61df11 100644 --- a/arch/arm64/include/asm/hwcap.h +++ b/arch/arm64/include/asm/hwcap.h @@ -142,6 +142,21 @@ #define KERNEL_HWCAP_SVE_B16B16__khwcap2_feature(SVE_B16B16) #define KERNEL_HWCAP_LRCPC3__khwcap2_feature(LRCPC3) #define KERNEL_HWCAP_LSE128__khwcap2_feature(LSE128) +#define KERNEL_HWCAP_FPMR __khwcap2_feature(FPMR) +#define KERNEL_HWCAP_LUT __khwcap2_feature(LUT) +#define KERNEL_HWCAP_FAMINMAX __khwcap2_feature(FAMINMAX) +#define KERNEL_HWCAP_F8CVT __khwcap2_feature(F8CVT) +#define KERNEL_HWCAP_F8FMA __khwcap2_feature(F8FMA) +#define KERNEL_HWCAP_F8DP4 __khwcap2_feature(F8DP4) +#define KERNEL_HWCAP_F8DP2 __khwcap2_feature(F8DP2) +#define KERNEL_HWCAP_F8E4M3__khwcap2_feature(F8E4M3) +#define KERNEL_HWCAP_F8E5M2__khwcap2_feature(F8E5M2) +#define KERNEL_HWCAP_SME_LUTV2 __khwcap2_feature(SME_LUTV2) +#define KERNEL_HWCAP_SME_F8F16 __khwcap2_feature(SME_F8F16) +#define KERNEL_HWCAP_SME_F8F32 __khwcap2_feature(SME_F8F32) +#define KERNEL_HWCAP_SME_SF8FMA__khwcap2_feature(SME_SF8FMA) +#define KERNEL_HWCAP_SME_SF8DP4__khwcap2_feature(SME_SF8DP4) +#define KERNEL_HWCAP_SME_SF8DP2__khwcap2_feature(SME_SF8DP2) /* * This yields a mask that user programs can use to figure out what diff --git a/arch/arm64/include/uapi/asm/hwcap.h b/arch/arm64/include/uapi/asm/hwcap.h index 5023599fa278..285610e626f5 100644 --- a/arch/arm64/include/uapi/asm/hwcap.h +++ b/arch/arm64/include/uapi/asm/hwcap.h @@ -107,5 +107,20 @@ #define HWCAP2_SVE_B16B16 (1UL << 45) #define HWCAP2_LRCPC3 (1UL << 46) #define HWCAP2_LSE128 (1UL << 47) +#define HWCAP2_FPMR(1UL << 48) +#define HWCAP2_LUT (1UL << 49) +#define HWCAP2_FAMINMAX(1UL << 50) +#define HWCAP2_F8CVT (1UL << 51) +#define HWCAP2_F8FMA (1UL << 52) +#define HWCAP2_F8DP4 (1UL << 53) +#define HWCAP2_F8DP2 (1UL << 54) +#define HWCAP2_F8E4M3 (1UL << 55) +#define HWCAP2_F8E5M2 (1UL << 56) +#define HWCAP2_SME_LUTV2 (1UL << 57) +#define HWCAP2_SME_F8F16 (1UL << 58) +#define HWCAP2_SME_F8F32 (1UL << 59) +#define HWCAP2_SME_SF8FMA (1UL << 60) +#define HWCAP2_SME_SF8DP4 (1UL << 61) +#define HWCAP2_SME_SF8DP2 (1UL << 62) #endif /* _UAPI__ASM_HWCAP_H */ diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index ea0b680792de..33e301b6e31e 100644 ---
[PATCH v3 15/21] KVM: arm64: Support FEAT_FPMR for guests
FEAT_FPMR introduces a new system register FPMR which allows configuration of floating point behaviour, currently for FP8 specific features. Allow use of this in guests, disabling the trap while guests are running and saving and restoring the value along with the rest of the floating point state. Since FPMR is stored immediately after the main floating point state we share it with the hypervisor by adjusting the size of the shared region. Access to FPMR is covered by both a register specific trap HCRX_EL2.EnFPM and the overall floating point access trap so we just unconditionally enable the FPMR specific trap and rely on the floating point access trap to detect guest floating point usage. Signed-off-by: Mark Brown --- arch/arm64/include/asm/kvm_arm.h| 2 +- arch/arm64/include/asm/kvm_host.h | 4 +++- arch/arm64/kvm/emulate-nested.c | 9 + arch/arm64/kvm/fpsimd.c | 20 +--- arch/arm64/kvm/hyp/include/hyp/switch.h | 7 ++- arch/arm64/kvm/sys_regs.c | 11 +++ 6 files changed, 47 insertions(+), 6 deletions(-) diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h index 9f9239d86900..95f3b44e7c3a 100644 --- a/arch/arm64/include/asm/kvm_arm.h +++ b/arch/arm64/include/asm/kvm_arm.h @@ -103,7 +103,7 @@ #define HCR_HOST_VHE_FLAGS (HCR_RW | HCR_TGE | HCR_E2H) #define HCRX_GUEST_FLAGS \ - (HCRX_EL2_SMPME | HCRX_EL2_TCR2En | \ + (HCRX_EL2_SMPME | HCRX_EL2_TCR2En | HCRX_EL2_EnFPM | \ (cpus_have_final_cap(ARM64_HAS_MOPS) ? (HCRX_EL2_MSCEn | HCRX_EL2_MCE2) : 0)) #define HCRX_HOST_FLAGS (HCRX_EL2_MSCEn | HCRX_EL2_TCR2En | HCRX_EL2_EnFPM) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index f8d98985a39c..9885adff06fa 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -391,6 +391,8 @@ enum vcpu_sysreg { CNTP_CVAL_EL0, CNTP_CTL_EL0, + FPMR, + /* Memory Tagging Extension registers */ RGSR_EL1, /* Random Allocation Tag Seed Register */ GCR_EL1,/* Tag Control Register */ @@ -517,7 +519,6 @@ struct kvm_vcpu_arch { enum fp_type fp_type; unsigned int sve_max_vl; u64 svcr; - u64 fpmr; /* Stage 2 paging state used by the hardware on next switch */ struct kvm_s2_mmu *hw_mmu; @@ -576,6 +577,7 @@ struct kvm_vcpu_arch { struct kvm_guest_debug_arch external_debug_state; struct user_fpsimd_state *host_fpsimd_state;/* hyp VA */ + u64 *host_fpmr; /* hyp VA */ struct task_struct *parent_task; struct { diff --git a/arch/arm64/kvm/emulate-nested.c b/arch/arm64/kvm/emulate-nested.c index 06185216a297..802e5cde696f 100644 --- a/arch/arm64/kvm/emulate-nested.c +++ b/arch/arm64/kvm/emulate-nested.c @@ -67,6 +67,8 @@ enum cgt_group_id { CGT_HCR_TTLBIS, CGT_HCR_TTLBOS, + CGT_HCRX_EnFPM, + CGT_MDCR_TPMCR, CGT_MDCR_TPM, CGT_MDCR_TDE, @@ -279,6 +281,12 @@ static const struct trap_bits coarse_trap_bits[] = { .mask = HCR_TTLBOS, .behaviour = BEHAVE_FORWARD_ANY, }, + [CGT_HCRX_EnFPM] = { + .index = HCRX_EL2, + .value = HCRX_EL2_EnFPM, + .mask = HCRX_EL2_EnFPM, + .behaviour = BEHAVE_FORWARD_ANY, + }, [CGT_MDCR_TPMCR] = { .index = MDCR_EL2, .value = MDCR_EL2_TPMCR, @@ -478,6 +486,7 @@ static const struct encoding_to_trap_config encoding_to_cgt[] __initconst = { SR_TRAP(SYS_AIDR_EL1, CGT_HCR_TID1), SR_TRAP(SYS_SMIDR_EL1, CGT_HCR_TID1), SR_TRAP(SYS_CTR_EL0,CGT_HCR_TID2), + SR_TRAP(SYS_FPMR, CGT_HCRX_EnFPM), SR_TRAP(SYS_CCSIDR_EL1, CGT_HCR_TID2_TID4), SR_TRAP(SYS_CCSIDR2_EL1,CGT_HCR_TID2_TID4), SR_TRAP(SYS_CLIDR_EL1, CGT_HCR_TID2_TID4), diff --git a/arch/arm64/kvm/fpsimd.c b/arch/arm64/kvm/fpsimd.c index e3e611e30e91..dee078625d0d 100644 --- a/arch/arm64/kvm/fpsimd.c +++ b/arch/arm64/kvm/fpsimd.c @@ -14,6 +14,16 @@ #include #include +static void *fpsimd_share_end(struct user_fpsimd_state *fpsimd) +{ + void *share_end = fpsimd + 1; + + if (cpus_have_final_cap(ARM64_HAS_FPMR)) + share_end += sizeof(u64); + + return share_end; +} + void kvm_vcpu_unshare_task_fp(struct kvm_vcpu *vcpu) { struct task_struct *p = vcpu->arch.parent_task; @@ -23,7 +33,7 @@ void kvm_vcpu_unshare_task_fp(struct kvm_vcpu *vcpu) return; fpsimd = >thread.uw.fpsimd_state; - kvm_unshare_hyp(fpsimd, fpsimd + 1); + kvm_unshare_hyp(fpsimd, fpsimd_share_end(fpsimd)); put_task_struct(p); } @@ -45,11 +55,15 @@ int
[PATCH v3 13/21] arm64/ptrace: Expose FPMR via ptrace
Add a new regset to expose FPMR via ptrace. It is not added to the FPSIMD registers since that structure is exposed elsewhere without any allowance for extension we don't add there. Signed-off-by: Mark Brown --- arch/arm64/kernel/ptrace.c | 42 ++ include/uapi/linux/elf.h | 1 + 2 files changed, 43 insertions(+) diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index 20d7ef82de90..cfb8a4d213be 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -697,6 +697,39 @@ static int tls_set(struct task_struct *target, const struct user_regset *regset, return ret; } +static int fpmr_get(struct task_struct *target, const struct user_regset *regset, + struct membuf to) +{ + if (!system_supports_fpmr()) + return -EINVAL; + + if (target == current) + fpsimd_preserve_current_state(); + + return membuf_store(, target->thread.fpmr); +} + +static int fpmr_set(struct task_struct *target, const struct user_regset *regset, + unsigned int pos, unsigned int count, + const void *kbuf, const void __user *ubuf) +{ + int ret; + unsigned long fpmr; + + if (!system_supports_fpmr()) + return -EINVAL; + + ret = user_regset_copyin(, , , , , 0, count); + if (ret) + return ret; + + target->thread.fpmr = fpmr; + + fpsimd_flush_task_state(target); + + return 0; +} + static int system_call_get(struct task_struct *target, const struct user_regset *regset, struct membuf to) @@ -1417,6 +1450,7 @@ enum aarch64_regset { REGSET_HW_BREAK, REGSET_HW_WATCH, #endif + REGSET_FPMR, REGSET_SYSTEM_CALL, #ifdef CONFIG_ARM64_SVE REGSET_SVE, @@ -1495,6 +1529,14 @@ static const struct user_regset aarch64_regsets[] = { .regset_get = system_call_get, .set = system_call_set, }, + [REGSET_FPMR] = { + .core_note_type = NT_ARM_FPMR, + .n = 1, + .size = sizeof(u64), + .align = sizeof(u64), + .regset_get = fpmr_get, + .set = fpmr_set, + }, #ifdef CONFIG_ARM64_SVE [REGSET_SVE] = { /* Scalable Vector Extension */ .core_note_type = NT_ARM_SVE, diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h index 9417309b7230..b54b313bcf07 100644 --- a/include/uapi/linux/elf.h +++ b/include/uapi/linux/elf.h @@ -440,6 +440,7 @@ typedef struct elf64_shdr { #define NT_ARM_SSVE0x40b /* ARM Streaming SVE registers */ #define NT_ARM_ZA 0x40c /* ARM SME ZA registers */ #define NT_ARM_ZT 0x40d /* ARM SME ZT registers */ +#define NT_ARM_FPMR0x40e /* ARM floating point mode register */ #define NT_ARC_V2 0x600 /* ARCv2 accumulator/extra registers */ #define NT_VMCOREDD0x700 /* Vmcore Device Dump Note */ #define NT_MIPS_DSP0x800 /* MIPS DSP ASE registers */ -- 2.30.2
[PATCH v3 12/21] arm64/signal: Add FPMR signal handling
Expose FPMR in the signal context on systems where it is supported. The kernel validates the exact size of the FPSIMD registers so we can't readily add it to fpsimd_context without disruption. Signed-off-by: Mark Brown --- arch/arm64/include/uapi/asm/sigcontext.h | 8 + arch/arm64/kernel/signal.c | 59 2 files changed, 67 insertions(+) diff --git a/arch/arm64/include/uapi/asm/sigcontext.h b/arch/arm64/include/uapi/asm/sigcontext.h index f23c1dc3f002..8a45b7a411e0 100644 --- a/arch/arm64/include/uapi/asm/sigcontext.h +++ b/arch/arm64/include/uapi/asm/sigcontext.h @@ -152,6 +152,14 @@ struct tpidr2_context { __u64 tpidr2; }; +/* FPMR context */ +#define FPMR_MAGIC 0x46504d52 + +struct fpmr_context { + struct _aarch64_ctx head; + __u64 fpmr; +}; + #define ZA_MAGIC 0x54366345 struct za_context { diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index 0e8beb3349ea..e8c808afcc8a 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -60,6 +60,7 @@ struct rt_sigframe_user_layout { unsigned long tpidr2_offset; unsigned long za_offset; unsigned long zt_offset; + unsigned long fpmr_offset; unsigned long extra_offset; unsigned long end_offset; }; @@ -182,6 +183,8 @@ struct user_ctxs { u32 za_size; struct zt_context __user *zt; u32 zt_size; + struct fpmr_context __user *fpmr; + u32 fpmr_size; }; static int preserve_fpsimd_context(struct fpsimd_context __user *ctx) @@ -227,6 +230,33 @@ static int restore_fpsimd_context(struct user_ctxs *user) return err ? -EFAULT : 0; } +static int preserve_fpmr_context(struct fpmr_context __user *ctx) +{ + int err = 0; + + current->thread.fpmr = read_sysreg_s(SYS_FPMR); + + __put_user_error(FPMR_MAGIC, >head.magic, err); + __put_user_error(sizeof(*ctx), >head.size, err); + __put_user_error(current->thread.fpmr, >fpmr, err); + + return err; +} + +static int restore_fpmr_context(struct user_ctxs *user) +{ + u64 fpmr; + int err = 0; + + if (user->fpmr_size != sizeof(*user->fpmr)) + return -EINVAL; + + __get_user_error(fpmr, >fpmr->fpmr, err); + if (!err) + write_sysreg_s(fpmr, SYS_FPMR); + + return err; +} #ifdef CONFIG_ARM64_SVE @@ -590,6 +620,7 @@ static int parse_user_sigframe(struct user_ctxs *user, user->tpidr2 = NULL; user->za = NULL; user->zt = NULL; + user->fpmr = NULL; if (!IS_ALIGNED((unsigned long)base, 16)) goto invalid; @@ -684,6 +715,17 @@ static int parse_user_sigframe(struct user_ctxs *user, user->zt_size = size; break; + case FPMR_MAGIC: + if (!system_supports_fpmr()) + goto invalid; + + if (user->fpmr) + goto invalid; + + user->fpmr = (struct fpmr_context __user *)head; + user->fpmr_size = size; + break; + case EXTRA_MAGIC: if (have_extra_context) goto invalid; @@ -806,6 +848,9 @@ static int restore_sigframe(struct pt_regs *regs, if (err == 0 && system_supports_tpidr2() && user.tpidr2) err = restore_tpidr2_context(); + if (err == 0 && system_supports_fpmr() && user.fpmr) + err = restore_fpmr_context(); + if (err == 0 && system_supports_sme() && user.za) err = restore_za_context(); @@ -928,6 +973,13 @@ static int setup_sigframe_layout(struct rt_sigframe_user_layout *user, } } + if (system_supports_fpmr()) { + err = sigframe_alloc(user, >fpmr_offset, +sizeof(struct fpmr_context)); + if (err) + return err; + } + return sigframe_alloc_end(user); } @@ -983,6 +1035,13 @@ static int setup_sigframe(struct rt_sigframe_user_layout *user, err |= preserve_tpidr2_context(tpidr2_ctx); } + /* FPMR if supported */ + if (system_supports_fpmr() && err == 0) { + struct fpmr_context __user *fpmr_ctx = + apply_user_offset(user, user->fpmr_offset); + err |= preserve_fpmr_context(fpmr_ctx); + } + /* ZA state if present */ if (system_supports_sme() && err == 0 && user->za_offset) { struct za_context __user *za_ctx = -- 2.30.2
[PATCH v3 11/21] arm64/fpsimd: Support FEAT_FPMR
FEAT_FPMR defines a new EL0 accessible register FPMR use to configure the FP8 related features added to the architecture at the same time. Detect support for this register and context switch it for EL0 when present. Due to the sharing of responsibility for saving floating point state between the host kernel and KVM FP8 support is not yet implemented in KVM and a stub similar to that used for SVCR is provided for FPMR in order to avoid bisection issues. To make it easier to share host state with the hypervisor we store FPMR immediately after the base floating point state, existing usage means that it is not practical to extend that directly. Signed-off-by: Mark Brown --- arch/arm64/include/asm/cpufeature.h | 5 + arch/arm64/include/asm/fpsimd.h | 2 ++ arch/arm64/include/asm/kvm_host.h | 1 + arch/arm64/include/asm/processor.h | 2 ++ arch/arm64/kernel/cpufeature.c | 9 + arch/arm64/kernel/fpsimd.c | 13 + arch/arm64/kvm/fpsimd.c | 1 + arch/arm64/tools/cpucaps| 1 + 8 files changed, 34 insertions(+) diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index f6d416fe49b0..8e83cb1e6c7c 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -767,6 +767,11 @@ static __always_inline bool system_supports_tpidr2(void) return system_supports_sme(); } +static __always_inline bool system_supports_fpmr(void) +{ + return alternative_has_cap_unlikely(ARM64_HAS_FPMR); +} + static __always_inline bool system_supports_cnp(void) { return alternative_has_cap_unlikely(ARM64_HAS_CNP); diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h index 50e5f25d3024..74afca3bd312 100644 --- a/arch/arm64/include/asm/fpsimd.h +++ b/arch/arm64/include/asm/fpsimd.h @@ -89,6 +89,7 @@ struct cpu_fp_state { void *sve_state; void *sme_state; u64 *svcr; + u64 *fpmr; unsigned int sve_vl; unsigned int sme_vl; enum fp_type *fp_type; @@ -154,6 +155,7 @@ extern void cpu_enable_sve(const struct arm64_cpu_capabilities *__unused); extern void cpu_enable_sme(const struct arm64_cpu_capabilities *__unused); extern void cpu_enable_sme2(const struct arm64_cpu_capabilities *__unused); extern void cpu_enable_fa64(const struct arm64_cpu_capabilities *__unused); +extern void cpu_enable_fpmr(const struct arm64_cpu_capabilities *__unused); extern u64 read_smcr_features(void); diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 824f29f04916..f8d98985a39c 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -517,6 +517,7 @@ struct kvm_vcpu_arch { enum fp_type fp_type; unsigned int sve_max_vl; u64 svcr; + u64 fpmr; /* Stage 2 paging state used by the hardware on next switch */ struct kvm_s2_mmu *hw_mmu; diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h index e5bc54522e71..dd3a5b29f76e 100644 --- a/arch/arm64/include/asm/processor.h +++ b/arch/arm64/include/asm/processor.h @@ -158,6 +158,8 @@ struct thread_struct { struct user_fpsimd_state fpsimd_state; } uw; + u64 fpmr; /* Adjacent to fpsimd_state for KVM */ + enum fp_typefp_type;/* registers FPSIMD or SVE? */ unsigned intfpsimd_cpu; void*sve_state; /* SVE registers, if any */ diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index c8d38e5ce997..ea0b680792de 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -272,6 +272,7 @@ static const struct arm64_ftr_bits ftr_id_aa64pfr1[] = { }; static const struct arm64_ftr_bits ftr_id_aa64pfr2[] = { + ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64PFR2_EL1_FPMR_SHIFT, 4, 0), ARM64_FTR_END, }; @@ -2759,6 +2760,14 @@ static const struct arm64_cpu_capabilities arm64_features[] = { .matches = has_cpuid_feature, ARM64_CPUID_FIELDS(ID_AA64MMFR2_EL1, EVT, IMP) }, + { + .desc = "FPMR", + .type = ARM64_CPUCAP_SYSTEM_FEATURE, + .capability = ARM64_HAS_FPMR, + .matches = has_cpuid_feature, + .cpu_enable = cpu_enable_fpmr, + ARM64_CPUID_FIELDS(ID_AA64PFR2_EL1, FPMR, IMP) + }, {}, }; diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c index 1559c706d32d..2a6abd6423f7 100644 --- a/arch/arm64/kernel/fpsimd.c +++ b/arch/arm64/kernel/fpsimd.c @@ -385,6 +385,9 @@ static void task_fpsimd_load(void) WARN_ON(!system_supports_fpsimd()); WARN_ON(!have_cpu_fpsimd_context()); + if (system_supports_fpmr()) +
[PATCH v3 09/21] arm64/cpufeature: Hook new identification registers up to cpufeature
The 2023 architecture extensions have defined several new ID registers, hook them up to the cpufeature code so we can add feature checks and hwcaps based on their contents. Signed-off-by: Mark Brown --- arch/arm64/include/asm/cpu.h | 3 +++ arch/arm64/kernel/cpufeature.c | 28 arch/arm64/kernel/cpuinfo.c| 3 +++ 3 files changed, 34 insertions(+) diff --git a/arch/arm64/include/asm/cpu.h b/arch/arm64/include/asm/cpu.h index f3034099fd95..b99138bc3d4a 100644 --- a/arch/arm64/include/asm/cpu.h +++ b/arch/arm64/include/asm/cpu.h @@ -53,14 +53,17 @@ struct cpuinfo_arm64 { u64 reg_id_aa64isar0; u64 reg_id_aa64isar1; u64 reg_id_aa64isar2; + u64 reg_id_aa64isar3; u64 reg_id_aa64mmfr0; u64 reg_id_aa64mmfr1; u64 reg_id_aa64mmfr2; u64 reg_id_aa64mmfr3; u64 reg_id_aa64pfr0; u64 reg_id_aa64pfr1; + u64 reg_id_aa64pfr2; u64 reg_id_aa64zfr0; u64 reg_id_aa64smfr0; + u64 reg_id_aa64fpfr0; struct cpuinfo_32bitaarch32; }; diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 646591c67e7a..c8d38e5ce997 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -234,6 +234,10 @@ static const struct arm64_ftr_bits ftr_id_aa64isar2[] = { ARM64_FTR_END, }; +static const struct arm64_ftr_bits ftr_id_aa64isar3[] = { + ARM64_FTR_END, +}; + static const struct arm64_ftr_bits ftr_id_aa64pfr0[] = { ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64PFR0_EL1_CSV3_SHIFT, 4, 0), ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64PFR0_EL1_CSV2_SHIFT, 4, 0), @@ -267,6 +271,10 @@ static const struct arm64_ftr_bits ftr_id_aa64pfr1[] = { ARM64_FTR_END, }; +static const struct arm64_ftr_bits ftr_id_aa64pfr2[] = { + ARM64_FTR_END, +}; + static const struct arm64_ftr_bits ftr_id_aa64zfr0[] = { ARM64_FTR_BITS(FTR_VISIBLE_IF_IS_ENABLED(CONFIG_ARM64_SVE), FTR_STRICT, FTR_LOWER_SAFE, ID_AA64ZFR0_EL1_F64MM_SHIFT, 4, 0), @@ -319,6 +327,10 @@ static const struct arm64_ftr_bits ftr_id_aa64smfr0[] = { ARM64_FTR_END, }; +static const struct arm64_ftr_bits ftr_id_aa64fpfr0[] = { + ARM64_FTR_END, +}; + static const struct arm64_ftr_bits ftr_id_aa64mmfr0[] = { ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64MMFR0_EL1_ECV_SHIFT, 4, 0), ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64MMFR0_EL1_FGT_SHIFT, 4, 0), @@ -702,10 +714,12 @@ static const struct __ftr_reg_entry { _aa64pfr0_override), ARM64_FTR_REG_OVERRIDE(SYS_ID_AA64PFR1_EL1, ftr_id_aa64pfr1, _aa64pfr1_override), + ARM64_FTR_REG(SYS_ID_AA64PFR2_EL1, ftr_id_aa64pfr2), ARM64_FTR_REG_OVERRIDE(SYS_ID_AA64ZFR0_EL1, ftr_id_aa64zfr0, _aa64zfr0_override), ARM64_FTR_REG_OVERRIDE(SYS_ID_AA64SMFR0_EL1, ftr_id_aa64smfr0, _aa64smfr0_override), + ARM64_FTR_REG(SYS_ID_AA64FPFR0_EL1, ftr_id_aa64fpfr0), /* Op1 = 0, CRn = 0, CRm = 5 */ ARM64_FTR_REG(SYS_ID_AA64DFR0_EL1, ftr_id_aa64dfr0), @@ -717,6 +731,7 @@ static const struct __ftr_reg_entry { _aa64isar1_override), ARM64_FTR_REG_OVERRIDE(SYS_ID_AA64ISAR2_EL1, ftr_id_aa64isar2, _aa64isar2_override), + ARM64_FTR_REG(SYS_ID_AA64ISAR3_EL1, ftr_id_aa64isar3), /* Op1 = 0, CRn = 0, CRm = 7 */ ARM64_FTR_REG(SYS_ID_AA64MMFR0_EL1, ftr_id_aa64mmfr0), @@ -1043,14 +1058,17 @@ void __init init_cpu_features(struct cpuinfo_arm64 *info) init_cpu_ftr_reg(SYS_ID_AA64ISAR0_EL1, info->reg_id_aa64isar0); init_cpu_ftr_reg(SYS_ID_AA64ISAR1_EL1, info->reg_id_aa64isar1); init_cpu_ftr_reg(SYS_ID_AA64ISAR2_EL1, info->reg_id_aa64isar2); + init_cpu_ftr_reg(SYS_ID_AA64ISAR3_EL1, info->reg_id_aa64isar3); init_cpu_ftr_reg(SYS_ID_AA64MMFR0_EL1, info->reg_id_aa64mmfr0); init_cpu_ftr_reg(SYS_ID_AA64MMFR1_EL1, info->reg_id_aa64mmfr1); init_cpu_ftr_reg(SYS_ID_AA64MMFR2_EL1, info->reg_id_aa64mmfr2); init_cpu_ftr_reg(SYS_ID_AA64MMFR3_EL1, info->reg_id_aa64mmfr3); init_cpu_ftr_reg(SYS_ID_AA64PFR0_EL1, info->reg_id_aa64pfr0); init_cpu_ftr_reg(SYS_ID_AA64PFR1_EL1, info->reg_id_aa64pfr1); + init_cpu_ftr_reg(SYS_ID_AA64PFR2_EL1, info->reg_id_aa64pfr2); init_cpu_ftr_reg(SYS_ID_AA64ZFR0_EL1, info->reg_id_aa64zfr0); init_cpu_ftr_reg(SYS_ID_AA64SMFR0_EL1, info->reg_id_aa64smfr0); + init_cpu_ftr_reg(SYS_ID_AA64FPFR0_EL1, info->reg_id_aa64fpfr0); if
[PATCH v3 10/21] arm64/fpsimd: Enable host kernel access to FPMR
FEAT_FPMR provides a new generally accessible architectural register FPMR. This is only accessible to EL0 and EL1 when HCRX_EL2.EnFPM is set to 1, do this when the host is running. The guest part will be done along with context switching the new register and exposing it via guest management. Signed-off-by: Mark Brown --- arch/arm64/include/asm/kvm_arm.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h index b85f46a73e21..9f9239d86900 100644 --- a/arch/arm64/include/asm/kvm_arm.h +++ b/arch/arm64/include/asm/kvm_arm.h @@ -105,7 +105,7 @@ #define HCRX_GUEST_FLAGS \ (HCRX_EL2_SMPME | HCRX_EL2_TCR2En | \ (cpus_have_final_cap(ARM64_HAS_MOPS) ? (HCRX_EL2_MSCEn | HCRX_EL2_MCE2) : 0)) -#define HCRX_HOST_FLAGS (HCRX_EL2_MSCEn | HCRX_EL2_TCR2En) +#define HCRX_HOST_FLAGS (HCRX_EL2_MSCEn | HCRX_EL2_TCR2En | HCRX_EL2_EnFPM) /* TCR_EL2 Registers bits */ #define TCR_EL2_RES1 ((1U << 31) | (1 << 23)) -- 2.30.2
[PATCH v3 08/21] arm64/sysreg: Add definition for FPMR
DDI0601 2023-09 defines a new sysrem register FPMR (Floating Point Mode Register) which configures the new FP8 features. Add a definition of this register. Signed-off-by: Mark Brown --- arch/arm64/tools/sysreg | 23 +++ 1 file changed, 23 insertions(+) diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg index 0b1a33a77074..67173576115a 100644 --- a/arch/arm64/tools/sysreg +++ b/arch/arm64/tools/sysreg @@ -2138,6 +2138,29 @@ Field1 ZA Field 0 SM EndSysreg +Sysreg FPMR3 3 4 4 2 +Res0 63:38 +Field 37:32 LSCALE2 +Field 31:24 NSCALE +Res0 23 +Field 22:16 LSCALE +Field 15 OSC +Field 14 OSM +Res0 13:9 +UnsignedEnum 8:6 F8D + 0b000 E5M2 + 0b001 E4M3 +EndEnum +UnsignedEnum 5:3 F8S2 + 0b000 E5M2 + 0b001 E4M3 +EndEnum +UnsignedEnum 2:0 F8S1 + 0b000 E5M2 + 0b001 E4M3 +EndEnum +EndSysreg + SysregFields HFGxTR_EL2 Field 63 nAMAIR2_EL1 Field 62 nMAIR2_EL1 -- 2.30.2
[PATCH v3 07/21] arm64/sysreg: Update HCRX_EL2 definition for DDI0601 2023-09
DDI0601 2023-09 defines new fields in HCRX_EL2 controlling access to new system registers, update our definition of HCRX_EL2 to reflect this. Signed-off-by: Mark Brown --- arch/arm64/tools/sysreg | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg index eea69bb48fa7..0b1a33a77074 100644 --- a/arch/arm64/tools/sysreg +++ b/arch/arm64/tools/sysreg @@ -2412,7 +2412,9 @@ FieldsZCR_ELx EndSysreg Sysreg HCRX_EL23 4 1 2 2 -Res0 63:23 +Res0 63:25 +Field 24 PACMEn +Field 23 EnFPM Field 22 GCSEn Field 21 EnIDCP128 Field 20 EnSDERR -- 2.30.2
[PATCH v3 06/21] arm64/sysreg: Update SCTLR_EL1 for DDI0601 2023-09
DDI0601 2023-09 defines some new fields in SCTLR_EL1 controlling new MTE and floating point features. Update our sysreg definition to reflect these. Signed-off-by: Mark Brown --- arch/arm64/tools/sysreg | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg index aee9ab4087c1..eea69bb48fa7 100644 --- a/arch/arm64/tools/sysreg +++ b/arch/arm64/tools/sysreg @@ -1791,7 +1791,8 @@ Field 63 TIDCP Field 62 SPINTMASK Field 61 NMI Field 60 EnTP2 -Res0 59:58 +Field 59 TCSO +Field 58 TCSO0 Field 57 EPAN Field 56 EnALS Field 55 EnAS0 @@ -1820,7 +1821,7 @@ EndEnum Field 37 ITFSB Field 36 BT1 Field 35 BT0 -Res0 34 +Field 34 EnFPM Field 33 MSCEn Field 32 CMOW Field 31 EnIA -- 2.30.2
[PATCH v3 05/21] arm64/sysreg: Update ID_AA64SMFR0_EL1 definition for DDI0601 2023-09
The 2023-09 release of DDI0601 defines a number of new feature enumeration fields in ID_AA64SMFR0_EL1. Add these fields. Signed-off-by: Mark Brown --- arch/arm64/tools/sysreg | 30 +++--- 1 file changed, 27 insertions(+), 3 deletions(-) diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg index c9bb49d0ea03..aee9ab4087c1 100644 --- a/arch/arm64/tools/sysreg +++ b/arch/arm64/tools/sysreg @@ -1079,7 +1079,11 @@ UnsignedEnum 63 FA64 0b0 NI 0b1 IMP EndEnum -Res0 62:60 +Res0 62:61 +UnsignedEnum 60 LUTv2 + 0b0 NI + 0b1 IMP +EndEnum UnsignedEnum 59:56 SMEver 0b SME 0b0001 SME2 @@ -1107,7 +,14 @@ UnsignedEnum 42 F16F16 0b0 NI 0b1 IMP EndEnum -Res0 41:40 +UnsignedEnum 41 F8F16 + 0b0 NI + 0b1 IMP +EndEnum +UnsignedEnum 40 F8F32 + 0b0 NI + 0b1 IMP +EndEnum UnsignedEnum 39:36 I8I32 0b NI 0b IMP @@ -1128,7 +1139,20 @@ UnsignedEnum 32 F32F32 0b0 NI 0b1 IMP EndEnum -Res0 31:0 +Res0 31 +UnsignedEnum 30 SF8FMA + 0b0 NI + 0b1 IMP +EndEnum +UnsignedEnum 29 SF8DP4 + 0b0 NI + 0b1 IMP +EndEnum +UnsignedEnum 28 SF8DP2 + 0b0 NI + 0b1 IMP +EndEnum +Res0 27:0 EndSysreg Sysreg ID_AA64FPFR0_EL13 0 0 4 7 -- 2.30.2
[PATCH v3 03/21] arm64/sysreg: Add definition for ID_AA64ISAR3_EL1
DDI0601 2023-09 adds a new system register ID_AA64ISAR3_EL1 enumerating new floating point and TLB invalidation features. Add a defintion for it. Signed-off-by: Mark Brown --- arch/arm64/tools/sysreg | 17 + 1 file changed, 17 insertions(+) diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg index 27d79644e1a0..3d623a04934c 100644 --- a/arch/arm64/tools/sysreg +++ b/arch/arm64/tools/sysreg @@ -1433,6 +1433,23 @@ UnsignedEnum 3:0 WFxT EndEnum EndSysreg +Sysreg ID_AA64ISAR3_EL13 0 0 6 3 +Res0 63:12 +UnsignedEnum 11:8TLBIW + 0b NI + 0b0001 IMP +EndEnum +UnsignedEnum 7:4 FAMINMAX + 0b NI + 0b0001 IMP +EndEnum +UnsignedEnum 3:0 CPA + 0b NI + 0b0001 IMP + 0b0010 CPA2 +EndEnum +EndSysreg + Sysreg ID_AA64MMFR0_EL13 0 0 7 0 UnsignedEnum 63:60 ECV 0b NI -- 2.30.2
[PATCH v3 04/21] arm64/sysreg: Add definition for ID_AA64FPFR0_EL1
DDI0601 2023-09 defines a new feature register ID_AA64FPFR0_EL1 which enumerates a number of FP8 related features. Add a definition for it. Signed-off-by: Mark Brown --- arch/arm64/tools/sysreg | 29 + 1 file changed, 29 insertions(+) diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg index 3d623a04934c..c9bb49d0ea03 100644 --- a/arch/arm64/tools/sysreg +++ b/arch/arm64/tools/sysreg @@ -1131,6 +1131,35 @@ EndEnum Res0 31:0 EndSysreg +Sysreg ID_AA64FPFR0_EL13 0 0 4 7 +Res0 63:32 +UnsignedEnum 31 F8CVT + 0b0 NI + 0b1 IMP +EndEnum +UnsignedEnum 30 F8FMA + 0b0 NI + 0b1 IMP +EndEnum +UnsignedEnum 29 F8DP4 + 0b0 NI + 0b1 IMP +EndEnum +UnsignedEnum 28 F8DP2 + 0b0 NI + 0b1 IMP +EndEnum +Res0 27:2 +UnsignedEnum 1 F8E4M3 + 0b0 NI + 0b1 IMP +EndEnum +UnsignedEnum 0 F8E5M2 + 0b0 NI + 0b1 IMP +EndEnum +EndSysreg + Sysreg ID_AA64DFR0_EL1 3 0 0 5 0 Enum 63:60 HPMN0 0b UNPREDICTABLE -- 2.30.2
[PATCH v3 01/21] arm64/sysreg: Add definition for ID_AA64PFR2_EL1
DDI0601 2023-09 defines a new system register ID_AA64PFR2_EL1 which enumerates FPMR and some new MTE features. Add a definition of this register. Signed-off-by: Mark Brown --- arch/arm64/tools/sysreg | 21 + 1 file changed, 21 insertions(+) diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg index 96cbeeab4eec..f22ade8f1fa7 100644 --- a/arch/arm64/tools/sysreg +++ b/arch/arm64/tools/sysreg @@ -1002,6 +1002,27 @@ UnsignedEnum 3:0 BT EndEnum EndSysreg +Sysreg ID_AA64PFR2_EL1 3 0 0 4 2 +Res0 63:36 +UnsignedEnum 35:32 FPMR + 0b NI + 0b0001 IMP +EndEnum +Res0 31:12 +UnsignedEnum 11:8MTEFAR + 0b NI + 0b0001 IMP +EndEnum +UnsignedEnum 7:4 MTESTOREONLY + 0b NI + 0b0001 IMP +EndEnum +UnsignedEnum 3:0 MTEPERM + 0b NI + 0b0001 IMP +EndEnum +EndSysreg + Sysreg ID_AA64ZFR0_EL1 3 0 0 4 4 Res0 63:60 UnsignedEnum 59:56 F64MM -- 2.30.2
[PATCH v3 02/21] arm64/sysreg: Update ID_AA64ISAR2_EL1 defintion for DDI0601 2023-09
DDI0601 2023-09 defines some new fields in previously RES0 space in ID_AA64ISAR2_EL1, together with one new enum value. Update the system register definition to reflect this. Signed-off-by: Mark Brown --- arch/arm64/tools/sysreg | 24 ++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg index f22ade8f1fa7..27d79644e1a0 100644 --- a/arch/arm64/tools/sysreg +++ b/arch/arm64/tools/sysreg @@ -1365,7 +1365,14 @@ EndEnum EndSysreg Sysreg ID_AA64ISAR2_EL13 0 0 6 2 -Res0 63:56 +UnsignedEnum 63:60 ATS1A + 0b NI + 0b0001 IMP +EndEnum +UnsignedEnum 59:56 LUT + 0b NI + 0b0001 IMP +EndEnum UnsignedEnum 55:52 CSSC 0b NI 0b0001 IMP @@ -1374,7 +1381,19 @@ UnsignedEnum 51:48 RPRFM 0b NI 0b0001 IMP EndEnum -Res0 47:32 +Res0 47:44 +UnsignedEnum 43:40 PRFMSLC + 0b NI + 0b0001 IMP +EndEnum +UnsignedEnum 39:36 SYSINSTR_128 + 0b NI + 0b0001 IMP +EndEnum +UnsignedEnum 35:32 SYSREG_128 + 0b NI + 0b0001 IMP +EndEnum UnsignedEnum 31:28 CLRBHB 0b NI 0b0001 IMP @@ -1398,6 +1417,7 @@ UnsignedEnum 15:12 APA3 0b0011 PAuth2 0b0100 FPAC 0b0101 FPACCOMBINE + 0b0110 PAUTH_LR EndEnum UnsignedEnum 11:8GPA3 0b NI -- 2.30.2
[PATCH v3 00/21] arm64: Support for 2023 DPISA extensions
This series enables support for the data processing extensions in the newly released 2023 architecture, this is mainly support for 8 bit floating point formats. Most of the extensions only introduce new instructions and therefore only require hwcaps but there is a new EL0 visible control register FPMR used to control the 8 bit floating point formats, we need to manage traps for this and context switch it. The sharing of floating point save code between the host and guest kernels slightly complicates the introduction of KVM support, we first introduce host support with some placeholders for KVM then replace those with the actual KVM support. I've not added test coverage for ptrace, I've got a not quite finished test program which exercises all the FP ptrace interfaces and their interactions together, my plan is to cover it there rather than add another tiny test program that duplicates the boilerplace for tracing a target and doesn't actually run the traced program. Signed-off-by: Mark Brown --- Changes in v3: - Rebase onto v6.7-rc3. - Hook up traps for FPMR in emulate-nested.c. - Link to v2: https://lore.kernel.org/r/20231114-arm64-2023-dpisa-v2-0-47251894f...@kernel.org Changes in v2: - Rebase onto v6.7-rc1. - Link to v1: https://lore.kernel.org/r/20231026-arm64-2023-dpisa-v1-0-8470dd989...@kernel.org --- Mark Brown (21): arm64/sysreg: Add definition for ID_AA64PFR2_EL1 arm64/sysreg: Update ID_AA64ISAR2_EL1 defintion for DDI0601 2023-09 arm64/sysreg: Add definition for ID_AA64ISAR3_EL1 arm64/sysreg: Add definition for ID_AA64FPFR0_EL1 arm64/sysreg: Update ID_AA64SMFR0_EL1 definition for DDI0601 2023-09 arm64/sysreg: Update SCTLR_EL1 for DDI0601 2023-09 arm64/sysreg: Update HCRX_EL2 definition for DDI0601 2023-09 arm64/sysreg: Add definition for FPMR arm64/cpufeature: Hook new identification registers up to cpufeature arm64/fpsimd: Enable host kernel access to FPMR arm64/fpsimd: Support FEAT_FPMR arm64/signal: Add FPMR signal handling arm64/ptrace: Expose FPMR via ptrace KVM: arm64: Add newly allocated ID registers to register descriptions KVM: arm64: Support FEAT_FPMR for guests arm64/hwcap: Define hwcaps for 2023 DPISA features kselftest/arm64: Handle FPMR context in generic signal frame parser kselftest/arm64: Add basic FPMR test kselftest/arm64: Add 2023 DPISA hwcap test coverage KVM: arm64: selftests: Document feature registers added in 2023 extensions KVM: arm64: selftests: Teach get-reg-list about FPMR Documentation/arch/arm64/elf_hwcaps.rst| 49 + arch/arm64/include/asm/cpu.h | 3 + arch/arm64/include/asm/cpufeature.h| 5 + arch/arm64/include/asm/fpsimd.h| 2 + arch/arm64/include/asm/hwcap.h | 15 ++ arch/arm64/include/asm/kvm_arm.h | 4 +- arch/arm64/include/asm/kvm_host.h | 3 + arch/arm64/include/asm/processor.h | 2 + arch/arm64/include/uapi/asm/hwcap.h| 15 ++ arch/arm64/include/uapi/asm/sigcontext.h | 8 + arch/arm64/kernel/cpufeature.c | 72 +++ arch/arm64/kernel/cpuinfo.c| 18 ++ arch/arm64/kernel/fpsimd.c | 13 ++ arch/arm64/kernel/ptrace.c | 42 arch/arm64/kernel/signal.c | 59 ++ arch/arm64/kvm/emulate-nested.c| 9 + arch/arm64/kvm/fpsimd.c| 19 +- arch/arm64/kvm/hyp/include/hyp/switch.h| 7 +- arch/arm64/kvm/sys_regs.c | 17 +- arch/arm64/tools/cpucaps | 1 + arch/arm64/tools/sysreg| 153 ++- include/uapi/linux/elf.h | 1 + tools/testing/selftests/arm64/abi/hwcap.c | 217 + tools/testing/selftests/arm64/signal/.gitignore| 1 + .../arm64/signal/testcases/fpmr_siginfo.c | 82 .../selftests/arm64/signal/testcases/testcases.c | 8 + .../selftests/arm64/signal/testcases/testcases.h | 1 + tools/testing/selftests/kvm/aarch64/get-reg-list.c | 11 +- 28 files changed, 819 insertions(+), 18 deletions(-) --- base-commit: 2cc14f52aeb78ce3f29677c2de1f06c0e91471ab change-id: 20231003-arm64-2023-dpisa-2f3d25746474 Best regards, -- Mark Brown
Re: [PATCH RFT v4 5/5] kselftest/clone3: Test shadow stack support
On Tue, Dec 05, 2023 at 04:01:50PM +, Edgecombe, Rick P wrote: > Hmm, I didn't realize you were planning to have the kernel support > upstream before the libc support was in testable shape. It's not a "could someone run it" thing - it's about trying ensure that we get coverage from people who are just running the selftests as part of general testing coverage rather than with the specific goal of testing this one feature. Even when things start to land there will be a considerable delay before they filter out so that all the enablement is in CI systems off the shelf and it'd be good to have coverage in that interval. > > What's the issue with working around the missing support? My > > understanding was that there should be no ill effects from repeated > > attempts to enable. We could add a check for things already being > > enabled > Normally the loader enables shadow stack and glibc then knows to do > things in special ways when it is successful. If it instead manually > enables in the app: > - The app can't return from main() without disabling shadow stack >beforehand. Luckily this test directly calls exit() > - The app can't do longjmp() > - The app can't do ucontext stuff > - The enabling code needs to be carefully crafted (the inline problem >you hit) > I guess it's not a huge list, and mostly tests will run ok. But it > doesn't seem right to add somewhat hacky shadow stack crud into generic > tests. Right, it's a small and fairly easily auditable list - it's more about the app than the double enable which was what I thought your concern was. It's a bit annoying definitely and not something we want to do in general but for something like this where we're adding specific coverage for API extensions for the feature it seems like a reasonable tradeoff. If the x86 toolchain/libc support is widely enough deployed (or you just don't mind any missing coverage) we could use the toolchain support there and only have the manual enable for arm64, it'd be inconsistent but not wildly so. > So you were planning to enable GCS in this test manually as well? How > many tests were you planning to add it like this? Yes, the current version of the arm64 series has the equivalent support for GCS. I was only planning to do this along with adding specific coverage for shadow stacks/GCS, general stuff that doesn't have any specific support can get covered as part of system testing with the toolchain and libc support. The only case beyond that I've done is some arm64 specific stress tests which are written as standalone assembler programs, those wouldn't get enabled by the toolchain anyway and have some chance of catching context switch or signal handling issues should they occur. It seemed worth it for the few lines of assembly it takes. signature.asc Description: PGP signature
Re: [PATCH 1/4] kunit: Add APIs for managing devices
Hi, kernel test robot noticed the following build errors: [auto build test ERROR on c8613be119892ccceffbc550b9b9d7d68b995c9e] url: https://github.com/intel-lab-lkp/linux/commits/davidgow-google-com/kunit-Add-APIs-for-managing-devices/20231205-153349 base: c8613be119892ccceffbc550b9b9d7d68b995c9e patch link: https://lore.kernel.org/r/20231205-kunit_bus-v1-1-635036d3bc13%40google.com patch subject: [PATCH 1/4] kunit: Add APIs for managing devices config: x86_64-buildonly-randconfig-001-20231205 (https://download.01.org/0day-ci/archive/20231205/202312052341.feujgbbc-...@intel.com/config) compiler: gcc-11 (Debian 11.3.0-12) 11.3.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231205/202312052341.feujgbbc-...@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202312052341.feujgbbc-...@intel.com/ All errors (new ones prefixed by >>): ld: lib/kunit/device.o: in function `kunit_bus_init': >> device.c:(.text+0x40): multiple definition of `init_module'; >> lib/kunit/test.o:test.c:(.init.text+0x0): first defined here -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
Re: [PATCH RFT v4 5/5] kselftest/clone3: Test shadow stack support
On Tue, 2023-12-05 at 15:05 +, Mark Brown wrote: > > But I wonder if the clone3 test should get its shadow stack enabled > > the > > conventional elf bit way. So if it's all there (HW, kernel, glibc) > > then > > the test will run with shadow stack. Otherwise the test will run > > without shadow stack. > > This creates bootstrapping issues if we do it for arm64 where nothing > is > merged yet except for the model and EL3 support - in order to get any > test coverage you need to be using an OS with the libc and toolchain > support available and that's not going to be something we can rely on > for a while (and even when things are merged a lot of the CI systems > use > Debian). There is a small risk that the toolchain will generate > incompatible code if it doesn't know it's specifically targetting > shadow > stacks but the toolchain people didn't seem concerned about that risk > and we've not been running into problems. > > It looks x86 is in better shape here with the userspace having run > ahead > of the kernel support though I'm not 100% clear if everything is > fully > lined up? -mshstk -fcf-protection appears to build fine with gcc 8 > but > I'm a bit less clear on glibc and any ABI variations. Right, you would need a shadow stack enabled compiler too. The check_cc.sh piece in the Makefile will detect that. Hmm, I didn't realize you were planning to have the kernel support upstream before the libc support was in testable shape. > > > The other reason is that the shadow stack test in the x86 selftest > > manual enabling is designed to work without a shadow stack enabled > > glibc and has to be specially crafted to work around the missing > > support. I'm not sure the more generic selftests should have to > > know > > how to do this. So what about something like this instead: > > What's the issue with working around the missing support? My > understanding was that there should be no ill effects from repeated > attempts to enable. We could add a check for things already being > enabled Normally the loader enables shadow stack and glibc then knows to do things in special ways when it is successful. If it instead manually enables in the app: - The app can't return from main() without disabling shadow stack beforehand. Luckily this test directly calls exit() - The app can't do longjmp() - The app can't do ucontext stuff - The enabling code needs to be carefully crafted (the inline problem you hit) I guess it's not a huge list, and mostly tests will run ok. But it doesn't seem right to add somewhat hacky shadow stack crud into generic tests. So you were planning to enable GCS in this test manually as well? How many tests were you planning to add it like this?
Re: [PATCH] mm, memcg: cg2 memory{.swap,}.peak write handlers
On Tue, Dec 5, 2023 at 4:07 AM Michal Hocko wrote: > > This behavior is particularly useful for work scheduling systems that > > need to track memory usage of worker processes/cgroups per-work-item. > > Since memory can't be squeezed like CPU can (the OOM-killer has > > opinions), these systems need to track the peak memory usage to compute > > system/container fullness when binpacking workitems. > > I do not understand the OOM-killer reference here but I do understand > that your worker reuses a cgroup and you want a peak memory consumption > of a single run to better profile/configure the memcg configuration for > the specific worker type. Correct? To a certain extent, yes. At the moment, we're only using the inner memcg cgroups for accounting/profiling, and using a larger (k8s container) cgroup for enforcement. The OOM-killer is involved because we're not configuring any memory limits on these individual "worker" cgroups, so we need to provision for multiple workloads using their peak memory at the same time to minimize OOM-killing. In case you're curious, this is the job/queue-work scheduling system we wrote in-house called Quickset that's mentioned in this blog post about our new transcoder system: https://medium.com/vimeo-engineering-blog/riding-the-dragon-e328a3dfd39d > > > Signed-off-by: David Finkel > > Makes sense to me > Acked-by: Michal Hocko > > Thanks! Thank you! -- David Finkel Senior Principal Software Engineer, Core Services
Re: [PATCH RFT v4 2/5] fork: Add shadow stack support to clone3()
On Tue, Dec 05, 2023 at 12:26:57AM +, Edgecombe, Rick P wrote: > On Tue, 2023-11-28 at 18:22 +, Mark Brown wrote: > > - size = adjust_shstk_size(stack_size); > > + size = adjust_shstk_size(size); > > addr = alloc_shstk(0, size, 0, false); > Hmm. I didn't test this, but in the copy_process(), copy_mm() happens > before this point. So the shadow stack would get mapped in current's MM > (i.e. the parent). So in the !CLONE_VM case with shadow_stack_size!=0 > the SSP in the child will be updated to an area that is not mapped in > the child. I think we need to pass tsk->mm into alloc_shstk(). But such > an exotic clone usage does give me pause, regarding whether all of this > is premature. Hrm, right. And we then can't use do_mmap() either. I'd be somewhat tempted to disallow that specific case for now rather than deal with it though that's not really in the spirit of just always following what the user asked for. signature.asc Description: PGP signature
Re: [PATCH v3 00/25] Permission Overlay Extension
Hi Marc, On Mon, Dec 04, 2023 at 11:03:24AM +, Marc Zyngier wrote: > Hi Joey, > > On Fri, 24 Nov 2023 16:34:45 +, > Joey Gouly wrote: > > > > Hello everyone, > > > > This series implements the Permission Overlay Extension introduced in 2022 > > VMSA enhancements [1]. It is based on v6.7-rc2. > > > > Changes since v2[2]: > > # Added ptrace support and selftest > > # Add missing POR_EL0 initialisation in fork/clone > > # Rebase onto v6.7-rc2 > > # Add r-bs > > > > The Permission Overlay Extension allows to constrain permissions on memory > > regions. This can be used from userspace (EL0) without a system call or TLB > > invalidation. > > I have given this series a few more thoughts, and came to the > conclusion that is it still incomplete on the KVM front: > > * FEAT_S1POE often comes together with FEAT_S2POE. For obvious > reasons, we cannot afford to let the guest play with S2POR_EL1, nor > do we want to advertise FEAT_S2POE to the guest. > > You will need to add some additional FGT for this, and mask out > FEAT_S2POE from the guest's view of the ID registers. I found out last week that I had misunderstood S2POR_EL1, so yes looks like we need to trap that. I will add that in. > > * letting the guest play with POE comes with some interesting strings > attached: a guest that has started on a POE-enabled host cannot be > migrated to one that doesn't have POE. which means that the POE > registers should only be visible to the host userspace if enabled in > the guest's ID registers, and thus only context-switched in these > conditions. They should otherwise UNDEF. Can you give me some clarification here? - By visible to the host userspace do you mean via the GET_ONE_REG API? - Currently the ID register (ID_AA64MMFR3_EL1) is not ID_WRITABLE, should this series or another make it so? Currently if the host had POE it's enabled in the guest, so I believe migration to a non-POE host will fail? - For the context switch, do you mean something like: if (system_supports_poe() && ID_REG(MMFR3_EL1) & S1POE) ctxt_sys_reg(ctxt, POR_EL0) = read_sysreg_s(SYS_POR_EL0); That would need some refactoring, since I don't see how to access id_regs from struct kvm_cpu_context. Thanks, Joey
Re: [PATCH v3 3/3] selftests: livepatch: Test livepatching a heavily called syscall
On 12/5/23 05:52, mpdeso...@suse.com wrote: On Fri, 2023-12-01 at 16:38 +, Shuah Khan wrote: 0003-selftests-livepatch-Test-livepatching-a-heavily-call.patch has style problems, please review. NOTE: If any of the errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. I couldn't find any mention about "missing module name". Is your script showing more warnings than these ones? Can you please share your output? I'll fix MAINTAINERS file but I'll wait until I understand what's missing in your checkpatch script to resend the patchset. Looks like it is coming a script - still my question stands on whether or not you would need a module name for this module? I am not too concerned about MAINTAINERS file warns. I am assuming you will be sending a new version to address Joe Lawrence's comments? thanks, -- Shuah
Re: [xdp-hints] Re: [PATCH bpf-next v3 2/3] net: stmmac: add Launch Time support to XDP ZC
On Tue, 2023-12-05 at 15:25 +, Song, Yoong Siang wrote: > On Monday, December 4, 2023 10:55 PM, Willem de Bruijn wrote: > > Jesper Dangaard Brouer wrote: > > > > > > > > > On 12/3/23 17:51, Song Yoong Siang wrote: > > > > This patch enables Launch Time (Time-Based Scheduling) support to XDP > > > > zero > > > > copy via XDP Tx metadata framework. > > > > > > > > Signed-off-by: Song Yoong Siang > > > > --- > > > > drivers/net/ethernet/stmicro/stmmac/stmmac.h | 2 ++ > > > > > > As requested before, I think we need to see another driver implementing > > > this. > > > > > > I propose driver igc and chip i225. > > Sure. I will include igc patches in next version. > > > > > > > The interesting thing for me is to see how the LaunchTime max 1 second > > > into the future[1] is handled code wise. One suggestion is to add a > > > section to Documentation/networking/xsk-tx-metadata.rst per driver that > > > mentions/documents these different hardware limitations. It is natural > > > that different types of hardware have limitations. This is a close-to > > > hardware-level abstraction/API, and IMHO as long as we document the > > > limitations we can expose this API without too many limitations for more > > > capable hardware. > > Sure. I will try to add hardware limitations in documentation. > > > > > I would assume that the kfunc will fail when a value is passed that > > cannot be programmed. > > > > In current design, the xsk_tx_metadata_request() dint got return value. > So user won't know if their request is fail. > It is complex to inform user which request is failing. > Therefore, IMHO, it is good that we let driver handle the error silently. > If the programmed value is invalid, the packet will be "dropped" / will never make it to the wire, right? That is clearly a situation that the user should be informed about. For RT systems this normally means that something is really wrong regarding timing / cycle overflow. Such systems have to react on that situation. > > > > What is being implemented here already exists for qdiscs. The FQ > > qdisc takes a horizon attribute and > > > >" > >when a packet is beyond the horizon > >at enqueue() time: > >- either drop the packet (default policy) > >- or cap its delivery time to the horizon. > >" > >commit 39d010504e6b ("net_sched: sch_fq: add horizon attribute") > > > > Having the admin manually configure this on the qdisc based on > > off-line knowledge of the device is more fragile than if the device > > would somehow signal its limit to the stack. > > > > But I don't think we should add enforcement of that as a requirement > > for this xdp extension of pacing.
RE: [PATCH bpf-next v2 2/3] net: stmmac: Add txtime support to XDP ZC
On Tuesday, December 5, 2023 10:55 PM, Willem de Bruijn wrote: >Song, Yoong Siang wrote: >> On Monday, December 4, 2023 10:58 PM, Willem de Bruijn wrote: >> >Song, Yoong Siang wrote: >> >> On Friday, December 1, 2023 11:02 PM, Jesper Dangaard Brouer wrote: >> >> >On 12/1/23 07:24, Song Yoong Siang wrote: >> >> >> This patch enables txtime support to XDP zero copy via XDP Tx >> >> >> metadata framework. >> >> >> >> >> >> Signed-off-by: Song Yoong Siang >> >> >> --- >> >> >> drivers/net/ethernet/stmicro/stmmac/stmmac.h | 2 ++ >> >> >> drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 13 + >> >> >> 2 files changed, 15 insertions(+) >> >> > >> >> >I think we need to see other drivers using this new feature to evaluate >> >> >if API is sane. >> >> > >> >> >I suggest implementing this for igc driver (chip i225) and also for igb >> >> >(i210 chip) that both support this kind of LaunchTime feature in HW. >> >> > >> >> >The API and stmmac driver takes a u64 as time. >> >> >I'm wondering how this applies to i210 that[1] have 25-bit for >> >> >LaunchTime (with 32 nanosec granularity) limiting LaunchTime max 0.5 >> >> >second into the future. >> >> >And i225 that [1] have 30-bit max 1 second into the future. >> >> > >> >> > >> >> >[1] >> >> >https://github.com/xdp-project/xdp- >> >> >project/blob/master/areas/tsn/code01_follow_qdisc_TSN_offload.org >> >> >> >> I am using u64 for launch time because existing EDT framework is using it. >> >> Refer to struct sk_buff below. Both u64 and ktime_t can be used as launch >> >> time. >> >> I choose u64 because ktime_t often requires additional type conversion and >> >> we didn't expect negative value of time. >> >> >> >> include/linux/skbuff.h-744- * @tstamp: Time we arrived/left >> >> include/linux/skbuff.h:745- * @skb_mstamp_ns: (aka @tstamp) earliest >departure >> >time; start point >> >> include/linux/skbuff.h-746- * for retransmit timer >> >> -- >> >> include/linux/skbuff.h-880- union { >> >> include/linux/skbuff.h-881- ktime_t tstamp; >> >> include/linux/skbuff.h:882- u64 skb_mstamp_ns; /* >> >> earliest >departure >> >time */ >> >> include/linux/skbuff.h-883- }; >> >> >> >> tstamp/skb_mstamp_ns are used by various drivers for launch time support >> >> on normal packet, so I think u64 should be "friendly" to all the drivers. >> >> For an >> >> example, igc driver will take launch time from tstamp and recalculate it >> >> accordingly (i225 expect user to program "delta time" instead of "time" >> >> into >> >> HW register). >> >> >> >> drivers/net/ethernet/intel/igc/igc_main.c-1602- txtime = skb->tstamp; >> >> drivers/net/ethernet/intel/igc/igc_main.c-1603- skb->tstamp = >> >> ktime_set(0, 0); >> >> drivers/net/ethernet/intel/igc/igc_main.c:1604- launch_time = >> >igc_tx_launchtime(tx_ring, txtime, _flag, _empty); >> >> >> >> Do you think this is enough to say the API is sane? >> > >> >u64 nsec sounds sane to be. It must be made explicit with clock source >> >it is against. >> > >> >> The u64 launch time should base on NIC PTP hardware clock (PHC). >> I will add documentation saying which clock source it is against > >It's not that obvious to me that that is the right and only choice. >See below. > >> >Some applications could want to do the conversion from a clock source >> >to raw NIC cycle counter in userspace or BPF and program the raw >> >value. So it may be worthwhile to add an clock source argument -- even >> >if initially only CLOCK_MONOTONIC is supported. >> >> Sorry, not so understand your suggestion on adding clock source argument. >> Are you suggesting to add clock source for the selftest xdp_hw_metadata apps? >> IMHO, no need to add clock source as the clock source for launch time >> should always base on NIC PHC. > >This is not how FQ and ETF qdiscs pass timestamps to drivers today. > >Those are in CLOCK_MONOTONIC or CLOCK_TAI. The driver is expected to >convert from that to its descriptor format, both to the reduced bit >width and the NIC PHC. > >See also for instance how sch_etf has an explicit q->clock_id match, >and SO_TXTIME added an sk_clock_id for the same purpose: to agree on >which clock source is being used. I see. Thank for the explanation. I will try to add clock source arguments In next version.
RE: [PATCH bpf-next v3 2/3] net: stmmac: add Launch Time support to XDP ZC
On Monday, December 4, 2023 10:55 PM, Willem de Bruijn wrote: >Jesper Dangaard Brouer wrote: >> >> >> On 12/3/23 17:51, Song Yoong Siang wrote: >> > This patch enables Launch Time (Time-Based Scheduling) support to XDP zero >> > copy via XDP Tx metadata framework. >> > >> > Signed-off-by: Song Yoong Siang >> > --- >> > drivers/net/ethernet/stmicro/stmmac/stmmac.h | 2 ++ >> >> As requested before, I think we need to see another driver implementing >> this. >> >> I propose driver igc and chip i225. Sure. I will include igc patches in next version. >> >> The interesting thing for me is to see how the LaunchTime max 1 second >> into the future[1] is handled code wise. One suggestion is to add a >> section to Documentation/networking/xsk-tx-metadata.rst per driver that >> mentions/documents these different hardware limitations. It is natural >> that different types of hardware have limitations. This is a close-to >> hardware-level abstraction/API, and IMHO as long as we document the >> limitations we can expose this API without too many limitations for more >> capable hardware. Sure. I will try to add hardware limitations in documentation. > >I would assume that the kfunc will fail when a value is passed that >cannot be programmed. > In current design, the xsk_tx_metadata_request() dint got return value. So user won't know if their request is fail. It is complex to inform user which request is failing. Therefore, IMHO, it is good that we let driver handle the error silently. >What is being implemented here already exists for qdiscs. The FQ >qdisc takes a horizon attribute and > >" >when a packet is beyond the horizon >at enqueue() time: >- either drop the packet (default policy) >- or cap its delivery time to the horizon. >" >commit 39d010504e6b ("net_sched: sch_fq: add horizon attribute") > >Having the admin manually configure this on the qdisc based on >off-line knowledge of the device is more fragile than if the device >would somehow signal its limit to the stack. > >But I don't think we should add enforcement of that as a requirement >for this xdp extension of pacing.