Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-06 Thread Michael S. Tsirkin
On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> 3. vhost looping endlessly, waiting for kworker to be scheduled
> 
> I dug a little deeper on what the vhost is doing. I'm not an expert on
> virtio whatsoever, so these are just educated guesses that maybe
> someone can verify/correct. Please bear with me probably messing up 
> the terminology.
> 
> - vhost is looping through available queues.
> - vhost wants to wake up a kworker to process a found queue.
> - kworker does something with that queue and terminates quickly.
> 
> What I found by throwing in some very noisy trace statements was that,
> if the kworker is not woken up, the vhost just keeps looping accross
> all available queues (and seems to repeat itself). So it essentially
> relies on the scheduler to schedule the kworker fast enough. Otherwise
> it will just keep on looping until it is migrated off the CPU.


Normally it takes the buffers off the queue and is done with it.
I am guessing that at the same time guest is running on some other
CPU and keeps adding available buffers?


-- 
MST




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-06 Thread Tobias Huschle
On Tue, Nov 28, 2023 at 04:55:11PM +0800, Abel Wu wrote:
> On 11/27/23 9:56 PM, Tobias Huschle Wrote:
> > On Wed, Nov 22, 2023 at 11:00:16AM +0100, Peter Zijlstra wrote:
> > > On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:
[...]
> 
> What are the weights of the two entities?
> 

Both entities have the same weights (I saw 1048576 for both of them).
The story looks different when we look at the cgroup hierarchy though:

sew := weight of the sched entity (se->load.weight)

 CPU 6/KVM-2360[011] d  1158.884473: sched_place: comm=vhost-2961 
pid=2984 sev=3595548386 sed=3598548386 sel=0 sew=1048576 avg=3595548386 
min=3595548386 cpu=11 nr=0 vru=3595548386 lag=0
 CPU 6/KVM-2360[011] d  1158.884473: sched_place: comm= pid=0 
sev=19998138425 sed=20007532920 sel=0 sew=335754 avg=19998138425 
min=19998138425 cpu=11 nr=0 vru=19998138425 lag=0
 CPU 6/KVM-2360[011] d  1158.884474: sched_place: comm= pid=0 
sev=37794158943 sed=37807515464 sel=0 sew=236146 avg=37794158943 
min=37794158943 cpu=11 nr=0 vru=37794158943 lag=0
 CPU 6/KVM-2360[011] d  1158.884474: sched_place: comm= pid=0 
sev=50387168150 sed=50394482435 sel=0 sew=430665 avg=50387168150 
min=50387168150 cpu=11 nr=0 vru=50387168150 lag=0
 CPU 6/KVM-2360[011] d  1158.884474: sched_place: comm= pid=0 
sev=76600751247 sed=77624751246 sel=0 sew=3876 avg=76600751247 min=76600751247 
cpu=11 nr=0 vru=76600751247 lag=0
<...>
vhost-2961-2984[011] d  1158.884487: sched_place: comm=kworker/11:2 
pid=202 sev=76603905961 sed=76606905961 sel=0 sew=1048576 avg=76603905961 
min=76603905961 cpu=11 nr=1 vru=76603905961 lag=0

Here we can see the following weights:
kworker -> 1048576
vhost   -> 1048576
cgroup root ->3876

kworker and vhost weights remain the same. The weights of the nodes in the 
cgroup vary.


I also spent some more thought on this and have some more observations:

1. kworker lag after short runtime

vhost-2961-2984[011] d  1158.884486: sched_waking: 
comm=kworker/11:2 pid=202 prio=120 target_cpu=011
vhost-2961-2984[011] d  1158.884487: sched_place: comm=kworker/11:2 
pid=202 sev=76603905961 sed=76606905961 sel=0 sew=1048576 avg=76603905961 
min=76603905961 cpu=11 nr=1 vru=76603905961 lag=0
<...>   
^
vhost-2961-2984[011] d  1158.884490: sched_switch: 
prev_comm=vhost-2961 prev_pid=2984 prev_prio=120 prev_state=R+ ==> 
next_comm=kworker/11:2 next_pid=202 next_prio=120
   kworker/11:2-202[011] d  1158.884491: sched_waking: comm=CPU 0/KVM 
pid=2988 prio=120 target_cpu=009
   kworker/11:2-202[011] d  1158.884492: sched_stat_runtime: 
comm=kworker/11:2 pid=202 runtime=5150 [ns] vruntime=7660391 [ns] 
deadline=76606905961 [ns] lag=76606905961

   
   kworker/11:2-202[011] d  1158.884492: sched_update: 
comm=kworker/11:2 pid=202 sev=7660391 sed=76606905961 sel=-1128 sew=1048576 
avg=76603909983 min=76603905961 cpu=11 nr=2 lag=-1128 lim=1000

 ^
   kworker/11:2-202[011] d  1158.884494: sched_stat_wait: 
comm=vhost-2961 pid=2984 delay=5150 [ns]
   kworker/11:2-202[011] d  1158.884494: sched_switch: 
prev_comm=kworker/11:2 prev_pid=202 prev_prio=120 prev_state=I ==> 
next_comm=vhost-2961 next_pid=2984 next_prio=120

In the sequence above, the kworker gets woken up by the vhost and placed on the 
timeline with 0 lag.
The kworker then executes for 5150ns and returns control to the vhost.
Unfortunately, this short runtime earns the kworker a negative lag of -1128.
This in turn, causes the kworker to not be selected by 
check_preempt_wakeup_fair.

My naive understanding of lag is, that only those entities get negative lag, 
which consume
more time than they should. Why is the kworker being punished for running only 
a tiny
portion of time?

In the majority of cases, the kworker finishes after a 4-digit number of ns.
There are occassional outliers with 5-digit numbers. I would therefore not 
expect negative lag for the kworker.

It is fair to say that the kworker was executing while the vhost was not.
kworker gets put on the queue with no lag, so it essentially has its vruntime
set to avg_vruntime.
After giving up its timeslice the kworker has now a vruntime which is larger
than the avg_vruntime. Hence the negative lag might make sense here from an
algorithmic standpoint. 


2a/b. vhost getting increased deadlines over time, no call of pick_eevdf

vhost-2961-2984[011] d.h..  1158.892878: sched_stat_runtime: 
comm=vhost-2961 pid=2984 runtime=8385872 [ns] vruntime=3603948448 [ns] 
deadline=3606948448 [ns] lag=3598548386
vhost-2961-2984  

Re: [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test

2023-12-06 Thread Jason Wang
On Tue, Dec 5, 2023 at 7:35 PM Yunsheng Lin  wrote:
>
> introduce vhost_net_test basing on virtio_test to test
> vhost_net changing in the kernel.
>
> Signed-off-by: Yunsheng Lin 
> ---
>  tools/virtio/Makefile |   8 +-
>  tools/virtio/vhost_net_test.c | 441 ++
>  2 files changed, 446 insertions(+), 3 deletions(-)
>  create mode 100644 tools/virtio/vhost_net_test.c
>
> diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile
> index d128925980e0..e25e99c1c3b7 100644
> --- a/tools/virtio/Makefile
> +++ b/tools/virtio/Makefile
> @@ -1,8 +1,9 @@
>  # SPDX-License-Identifier: GPL-2.0
>  all: test mod
> -test: virtio_test vringh_test
> +test: virtio_test vringh_test vhost_net_test
>  virtio_test: virtio_ring.o virtio_test.o
>  vringh_test: vringh_test.o vringh.o virtio_ring.o
> +vhost_net_test: virtio_ring.o vhost_net_test.o
>
>  try-run = $(shell set -e;  \
> if ($(1)) >/dev/null 2>&1;  \
> @@ -49,6 +50,7 @@ oot-clean: OOT_BUILD+=clean
>
>  .PHONY: all test mod clean vhost oot oot-clean oot-build
>  clean:
> -   ${RM} *.o vringh_test virtio_test vhost_test/*.o vhost_test/.*.cmd \
> -  vhost_test/Module.symvers vhost_test/modules.order *.d
> +   ${RM} *.o vringh_test virtio_test vhost_net_test vhost_test/*.o \
> +  vhost_test/.*.cmd vhost_test/Module.symvers \
> +  vhost_test/modules.order *.d
>  -include *.d
> diff --git a/tools/virtio/vhost_net_test.c b/tools/virtio/vhost_net_test.c
> new file mode 100644
> index ..7e7b7aba3668
> --- /dev/null
> +++ b/tools/virtio/vhost_net_test.c
> @@ -0,0 +1,441 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define _GNU_SOURCE
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define RANDOM_BATCH -1
> +
> +static int tun_alloc(void)
> +{
> +   struct ifreq ifr;
> +   int fd, e;
> +
> +   fd = open("/dev/net/tun", O_RDWR);
> +   if (fd < 0) {
> +   perror("Cannot open /dev/net/tun");
> +   return fd;
> +   }
> +
> +   memset(, 0, sizeof(ifr));
> +
> +   ifr.ifr_flags = IFF_TUN | IFF_NO_PI;

Why did you use IFF_TUN but not IFF_TAP here?

> +   strncpy(ifr.ifr_name, "tun0", IFNAMSIZ);

tun0 is pretty common if there's a VPN. Do we need some randomized name here?


> +
> +   e = ioctl(fd, TUNSETIFF, (void *) );
> +   if (e < 0) {
> +   perror("ioctl[TUNSETIFF]");
> +   close(fd);
> +   return e;
> +   }
> +
> +   return fd;
> +}
> +
> +/* Unused */
> +void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end;

Why do we need trick like these here?

> +
> +struct vq_info {
> +   int kick;
> +   int call;
> +   int num;
> +   int idx;
> +   void *ring;
> +   /* copy used for control */
> +   struct vring vring;
> +   struct virtqueue *vq;
> +};
> +
> +struct vdev_info {
> +   struct virtio_device vdev;
> +   int control;
> +   struct pollfd fds[1];
> +   struct vq_info vqs[1];
> +   int nvqs;
> +   void *buf;
> +   size_t buf_size;
> +   struct vhost_memory *mem;
> +};
> +
> +static struct vhost_vring_file no_backend = { .index = 1, .fd = -1 },
> +backend = { .index = 1, .fd = 1 };

A magic number like fd = 1 is pretty confusing.

And I don't see why we need global variables here.

> +static const struct vhost_vring_state null_state = {};
> +
> +bool vq_notify(struct virtqueue *vq)
> +{
> +   struct vq_info *info = vq->priv;
> +   unsigned long long v = 1;
> +   int r;
> +   r = write(info->kick, , sizeof v);
> +   assert(r == sizeof v);
> +   return true;
> +}
> +
> +void vq_callback(struct virtqueue *vq)
> +{
> +}
> +
> +
> +void vhost_vq_setup(struct vdev_info *dev, struct vq_info *info)
> +{
> +   struct vhost_vring_state state = { .index = info->idx };
> +   struct vhost_vring_file file = { .index = info->idx };
> +   unsigned long long features = dev->vdev.features;
> +   struct vhost_vring_addr addr = {
> +   .index = info->idx,
> +   .desc_user_addr = (uint64_t)(unsigned long)info->vring.desc,
> +   .avail_user_addr = (uint64_t)(unsigned long)info->vring.avail,
> +   .used_user_addr = (uint64_t)(unsigned long)info->vring.used,
> +   };
> +   int r;
> +   r = ioctl(dev->control, VHOST_SET_FEATURES, );
> +   assert(r >= 0);
> +   state.num = info->vring.num;
> +   r = ioctl(dev->control, VHOST_SET_VRING_NUM, );
> +   assert(r >= 0);
> +   state.num = 0;
> +   r = ioctl(dev->control, VHOST_SET_VRING_BASE, );
> +   assert(r >= 0);
> +   r = ioctl(dev->control, VHOST_SET_VRING_ADDR, );
> +   assert(r >= 0);
> +   

Re: [PATCH net-next 4/6] vhost/net: remove vhost_net_page_frag_refill()

2023-12-06 Thread Jason Wang
On Tue, Dec 5, 2023 at 7:35 PM Yunsheng Lin  wrote:
>
> The page frag in vhost_net_page_frag_refill() uses the
> 'struct page_frag' from skb_page_frag_refill(), but it's
> implementation is similar to page_frag_alloc_align() now.
>
> This patch removes vhost_net_page_frag_refill() by using
> 'struct page_frag_cache' instead of 'struct page_frag',
> and allocating frag using page_frag_alloc_align().
>
> The added benefit is that not only unifying the page frag
> implementation a little, but also having about 0.5% performance
> boost testing by using the vhost_net_test introduced in the
> last patch.
>
> Signed-off-by: Yunsheng Lin 
> ---

Acked-by: Jason Wang 

Thanks




Re: [PATCH net-next 5/6] net: introduce page_frag_cache_drain()

2023-12-06 Thread Jason Wang
On Tue, Dec 5, 2023 at 7:35 PM Yunsheng Lin  wrote:
>
> When draining a page_frag_cache, most user are doing
> the similar steps, so introduce an API to avoid code
> duplication.
>
> Signed-off-by: Yunsheng Lin 
> ---

For vhost part:

Acked-by: Jason Wang 

Thanks




Re: [PATCH] ipvs: add a stateless type of service and a stateless Maglev hashing scheduler

2023-12-06 Thread Dan Carpenter
Hi Lev,

kernel test robot noticed the following build warnings:

https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Lev-Pantiukhin/ipvs-add-a-stateless-type-of-service-and-a-stateless-Maglev-hashing-scheduler/20231204-232344
base:   https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next.git 
master
patch link:
https://lore.kernel.org/r/20231204152020.472247-1-kndrvt%40yandex-team.ru
patch subject: [PATCH] ipvs: add a stateless type of service and a stateless 
Maglev hashing scheduler
config: i386-randconfig-141-20231207 
(https://download.01.org/0day-ci/archive/20231207/202312070849.i9gwwsh0-...@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
reproduce: 
(https://download.01.org/0day-ci/archive/20231207/202312070849.i9gwwsh0-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Reported-by: Dan Carpenter 
| Closes: https://lore.kernel.org/r/202312070849.i9gwwsh0-...@intel.com/

New smatch warnings:
net/netfilter/ipvs/ip_vs_core.c:545 ip_vs_schedule() error: uninitialized 
symbol 'need_state'.

vim +/need_state +545 net/netfilter/ipvs/ip_vs_core.c

^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c  Linus Torvalds 
2005-04-16  440  struct ip_vs_conn *
190ecd27cd7294 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov   
2010-10-17  441  ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff *skb,
d4383f04d145cc net/netfilter/ipvs/ip_vs_core.c Jesper Dangaard Brouer 
2012-09-26  442  struct ip_vs_proto_data *pd, int *ignored,
d4383f04d145cc net/netfilter/ipvs/ip_vs_core.c Jesper Dangaard Brouer 
2012-09-26  443  struct ip_vs_iphdr *iph)
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c  Linus Torvalds 
2005-04-16  444  {
9330419d9aa4f9 net/netfilter/ipvs/ip_vs_core.c Hans Schillstrom   
2011-01-03  445   struct ip_vs_protocol *pp = pd->pp;
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c  Linus Torvalds 
2005-04-16  446   struct ip_vs_conn *cp = NULL;
ceec4c38168184 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov   
2013-03-22  447   struct ip_vs_scheduler *sched;
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c  Linus Torvalds 
2005-04-16  448   struct ip_vs_dest *dest;
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell  
2015-08-26  449   __be16 _ports[2], *pptr, cport, vport;
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell  
2015-08-26  450   const void *caddr, *vaddr;
3575792e005dc9 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov   
2010-09-17  451   unsigned int flags;
b276d504bee439 net/netfilter/ipvs/ip_vs_core.c Lev Pantiukhin 
2023-12-04  452   bool need_state;
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c  Linus Torvalds 
2005-04-16  453  
190ecd27cd7294 net/netfilter/ipvs/ip_vs_core.c Julian Anastasov   
2010-10-17  454   *ignored = 1;
2f74713d1436b7 net/netfilter/ipvs/ip_vs_core.c Jesper Dangaard Brouer 
2012-09-26  455   /*
2f74713d1436b7 net/netfilter/ipvs/ip_vs_core.c Jesper Dangaard Brouer 
2012-09-26  456* IPv6 frags, only the first hit here.
2f74713d1436b7 net/netfilter/ipvs/ip_vs_core.c Jesper Dangaard Brouer 
2012-09-26  457*/
6b3d933000cbe5 net/netfilter/ipvs/ip_vs_core.c Gao Feng   
2017-11-13  458   pptr = frag_safe_skb_hp(skb, iph->len, sizeof(_ports), 
_ports);
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c  Linus Torvalds 
2005-04-16  459   if (pptr == NULL)
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c  Linus Torvalds 
2005-04-16  460   return NULL;
^1da177e4c3f41 net/ipv4/ipvs/ip_vs_core.c  Linus Torvalds 
2005-04-16  461  
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell  
2015-08-26  462   if (likely(!ip_vs_iph_inverse(iph))) {
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell  
2015-08-26  463   cport = pptr[0];
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell  
2015-08-26  464   caddr = >saddr;
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell  
2015-08-26  465   vport = pptr[1];
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell  
2015-08-26  466   vaddr = >daddr;
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell  
2015-08-26  467   } else {
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell  
2015-08-26  468   cport = pptr[1];
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell  
2015-08-26  469   caddr = >daddr;
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell  
2015-08-26  470   vport = pptr[0];
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex Gartrell  
2015-08-26  471   vaddr = >saddr;
ee78378f976488 net/netfilter/ipvs/ip_vs_core.c Alex 

[PATCH v2 1/2] Documentatiion/ABI: Add ABI documentation for sys-bus-dax

2023-12-06 Thread Vishal Verma
Add the missing sysfs ABI documentation for the device DAX subsystem.
Various ABI attributes under this have been present since v5.1, and more
have been added over time. In preparation for adding a new attribute,
add this file with the historical details.

Cc: Dan Williams 
Signed-off-by: Vishal Verma 
---
 Documentation/ABI/testing/sysfs-bus-dax | 151 
 1 file changed, 151 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-dax 
b/Documentation/ABI/testing/sysfs-bus-dax
new file mode 100644
index ..a61a7b186017
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-dax
@@ -0,0 +1,151 @@
+What:  /sys/bus/dax/devices/daxX.Y/align
+Date:  October, 2020
+KernelVersion: v5.10
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RW) Provides a way to specify an alignment for a dax device.
+   Values allowed are constrained by the physical address ranges
+   that back the dax device, and also by arch requirements.
+
+What:  /sys/bus/dax/devices/daxX.Y/mapping
+Date:  October, 2020
+KernelVersion: v5.10
+Contact:   nvd...@lists.linux.dev
+Description:
+   (WO) Provides a way to allocate a mapping range under a dax
+   device. Specified in the format -.
+
+What:  /sys/bus/dax/devices/daxX.Y/mapping[0..N]/start
+Date:  October, 2020
+KernelVersion: v5.10
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RO) A dax device may have multiple constituent discontiguous
+   address ranges. These are represented by the different
+   'mappingX' subdirectories. The 'start' attribute indicates the
+   start physical address for the given range.
+
+What:  /sys/bus/dax/devices/daxX.Y/mapping[0..N]/end
+Date:  October, 2020
+KernelVersion: v5.10
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RO) A dax device may have multiple constituent discontiguous
+   address ranges. These are represented by the different
+   'mappingX' subdirectories. The 'end' attribute indicates the
+   end physical address for the given range.
+
+What:  /sys/bus/dax/devices/daxX.Y/mapping[0..N]/page_offset
+Date:  October, 2020
+KernelVersion: v5.10
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RO) A dax device may have multiple constituent discontiguous
+   address ranges. These are represented by the different
+   'mappingX' subdirectories. The 'page_offset' attribute 
indicates the
+   offset of the current range in the dax device.
+
+What:  /sys/bus/dax/devices/daxX.Y/resource
+Date:  June, 2019
+KernelVersion: v5.3
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RO) The resource attribute indicates the starting physical
+   address of a dax device. In case of a device with multiple
+   constituent ranges, it indicates the starting address of the
+   first range.
+
+What:  /sys/bus/dax/devices/daxX.Y/size
+Date:  October, 2020
+KernelVersion: v5.10
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RW) The size attribute indicates the total size of a dax
+   device. For creating subdivided dax devices, or for resizing
+   an existing device, the new size can be written to this as
+   part of the reconfiguration process.
+
+What:  /sys/bus/dax/devices/daxX.Y/numa_node
+Date:  November, 2019
+KernelVersion: v5.5
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RO) If NUMA is enabled and the platform has affinitized the
+   backing device for this dax device, emit the CPU node
+   affinity for this device.
+
+What:  /sys/bus/dax/devices/daxX.Y/target_node
+Date:  February, 2019
+KernelVersion: v5.1
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RO) The target-node attribute is the Linux numa-node that a
+   device-dax instance may create when it is online. Prior to
+   being online the device's 'numa_node' property reflects the
+   closest online cpu node which is the typical expectation of a
+   device 'numa_node'. Once it is online it becomes its own
+   distinct numa node.
+
+What:  $(readlink -f 
/sys/bus/dax/devices/daxX.Y)/../dax_region/available_size
+Date:  October, 2020
+KernelVersion: v5.10
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RO) The available_size attribute tracks available dax region
+   capacity. This only applies to volatile hmem devices, not pmem
+   devices, since pmem devices are defined by nvdimm namespace
+   boundaries.
+
+What:  $(readlink -f 

[PATCH v2 2/2] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-06 Thread Vishal Verma
Add a sysfs knob for dax devices to control the memmap_on_memory setting
if the dax device were to be hotplugged as system memory.

The default memmap_on_memory setting for dax devices originating via
pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to
preserve legacy behavior. For dax devices via CXL, the default is on.
The sysfs control allows the administrator to override the above
defaults if needed.

Cc: David Hildenbrand 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Dave Hansen 
Cc: Huang Ying 
Reviewed-by: Jonathan Cameron 
Reviewed-by: David Hildenbrand 
Signed-off-by: Vishal Verma 
---
 drivers/dax/bus.c   | 40 +
 Documentation/ABI/testing/sysfs-bus-dax | 13 +++
 2 files changed, 53 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1ff1ab5fa105..11abb57cc031 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1270,6 +1270,45 @@ static ssize_t numa_node_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(numa_node);
 
+static ssize_t memmap_on_memory_show(struct device *dev,
+struct device_attribute *attr, char *buf)
+{
+   struct dev_dax *dev_dax = to_dev_dax(dev);
+
+   return sprintf(buf, "%d\n", dev_dax->memmap_on_memory);
+}
+
+static ssize_t memmap_on_memory_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+   struct dev_dax *dev_dax = to_dev_dax(dev);
+   struct dax_region *dax_region = dev_dax->region;
+   ssize_t rc;
+   bool val;
+
+   rc = kstrtobool(buf, );
+   if (rc)
+   return rc;
+
+   if (dev_dax->memmap_on_memory == val)
+   return len;
+
+   device_lock(dax_region->dev);
+   if (!dax_region->dev->driver) {
+   device_unlock(dax_region->dev);
+   return -ENXIO;
+   }
+
+   device_lock(dev);
+   dev_dax->memmap_on_memory = val;
+   device_unlock(dev);
+
+   device_unlock(dax_region->dev);
+   return rc == 0 ? len : rc;
+}
+static DEVICE_ATTR_RW(memmap_on_memory);
+
 static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int 
n)
 {
struct device *dev = container_of(kobj, struct device, kobj);
@@ -1296,6 +1335,7 @@ static struct attribute *dev_dax_attributes[] = {
_attr_align.attr,
_attr_resource.attr,
_attr_numa_node.attr,
+   _attr_memmap_on_memory.attr,
NULL,
 };
 
diff --git a/Documentation/ABI/testing/sysfs-bus-dax 
b/Documentation/ABI/testing/sysfs-bus-dax
index a61a7b186017..bb063a004e41 100644
--- a/Documentation/ABI/testing/sysfs-bus-dax
+++ b/Documentation/ABI/testing/sysfs-bus-dax
@@ -149,3 +149,16 @@ KernelVersion: v5.1
 Contact:   nvd...@lists.linux.dev
 Description:
(RO) The id attribute indicates the region id of a dax region.
+
+What:  /sys/bus/dax/devices/daxX.Y/memmap_on_memory
+Date:  October, 2023
+KernelVersion: v6.8
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RW) Control the memmap_on_memory setting if the dax device
+   were to be hotplugged as system memory. This determines whether
+   the 'altmap' for the hotplugged memory will be placed on the
+   device being hotplugged (memmap_on+memory=1) or if it will be
+   placed on regular memory (memmap_on_memory=0). This attribute
+   must be set before the device is handed over to the 'kmem'
+   driver (i.e.  hotplugged into system-ram).

-- 
2.41.0




[PATCH v2 0/2] Add DAX ABI for memmap_on_memory

2023-12-06 Thread Vishal Verma
The DAX drivers were missing sysfs ABI documentation entirely.  Add this
missing documentation for the sysfs ABI for DAX regions and Dax devices
in patch 1. Add a new ABI for toggling memmap_on_memory semantics in
patch 2.

The missing ABI was spotted in [1], this series is a split of the new
ABI additions behind the initial documentation creation.

[1]: 
https://lore.kernel.org/linux-cxl/651f27b728fef_ae7e729...@dwillia2-xfh.jf.intel.com.notmuch/

Cc: Dan Williams 
Cc:  
Cc:  
Cc:  
To: Dave Jiang 
Signed-off-by: Vishal Verma 

Changes in v2:
- Fix CC lists, patch 1/2 didn't get sent correctly in v1
- Link to v1: 
https://lore.kernel.org/r/20231206-vv-dax_abi-v1-0-474eb88e2...@intel.com

---
Vishal Verma (2):
  Documentatiion/ABI: Add ABI documentation for sys-bus-dax
  dax: add a sysfs knob to control memmap_on_memory behavior

 drivers/dax/bus.c   |  40 
 Documentation/ABI/testing/sysfs-bus-dax | 164 
 2 files changed, 204 insertions(+)
---
base-commit: c4e1ccfad42352918810802095a8ace8d1c744c9
change-id: 20231025-vv-dax_abi-17a219c46076

Best regards,
-- 
Vishal Verma 




[PATCH v8] bus: mhi: host: Add tracing support

2023-12-06 Thread Krishna chaitanya chundru
This change adds ftrace support for following functions which
helps in debugging the issues when there is Channel state & MHI
state change and also when we receive data and control events:
1. mhi_intvec_mhi_states
2. mhi_process_data_event_ring
3. mhi_process_ctrl_ev_ring
4. mhi_gen_tre
5. mhi_update_channel_state
6. mhi_tryset_pm_state
7. mhi_pm_st_worker

Where ever the trace events are added, debug messages are removed.

Signed-off-by: Krishna chaitanya chundru 
---
Changes in v8:
- Pass the structure and derefernce the variables in TP_fast_assign as 
suggested by steve
- Link to v7: 
https://lore.kernel.org/r/20231206-ftrace_support-v7-1-aca49a042...@quicinc.com

Changes in v7:
- change log format as pointed by mani.
- Link to v6: 
https://lore.kernel.org/r/20231204-ftrace_support-v6-1-9b206546d...@quicinc.com

Changes in v6:
- use 'rp' directly as suggested by jeffrey.
- Link to v5: 
https://lore.kernel.org/r/20231127-ftrace_support-v5-1-eb67daead...@quicinc.com

Changes in v5:
- Use DECLARE_EVENT_CLASS for multiple events as suggested by steve.
- Instead of converting to u64 to print address, use %px to print the address 
to avoid
- warnings in some platforms.
- Link to v4: 
https://lore.kernel.org/r/2023-ftrace_support-v4-1-c83602399...@quicinc.com

Changes in v4:
- Fix compilation issues in previous patch which happended due to rebasing.
- In the defconfig FTRACE config is not enabled due to that the compilation 
issue is not
- seen in my workspace.
- Link to v3: 
https://lore.kernel.org/r/2023-ftrace_support-v3-1-f358d2911...@quicinc.com

Changes in v3:
- move trace header file from include/trace/events to drivers/bus/mhi/host/ so 
that
- we can include driver header files.
- Use macros directly in the trace events as suggested Jeffrey Hugo.
- Reorder the structure in the events as suggested by steve to avoid holes in 
the buffer.
- removed the mhi_to_physical function as this can give security issues.
- removed macros to define strings as we can get those from driver headers.
- Link to v2: 
https://lore.kernel.org/r/20231013-ftrace_support-v2-1-6e893ce01...@quicinc.com

Changes in v2:
- Passing the raw state into the trace event and using  __print_symbolic() as 
suggested by bjorn.
- Change mhi_pm_st_worker to mhi_pm_st_transition as suggested by bjorn.
- Fixed the kernel test rebot issues.
- Link to v1: 
https://lore.kernel.org/r/20231005-ftrace_support-v1-1-23a2f394f...@quicinc.com
---
 drivers/bus/mhi/host/init.c  |   3 +
 drivers/bus/mhi/host/main.c  |  19 ++--
 drivers/bus/mhi/host/pm.c|   7 +-
 drivers/bus/mhi/host/trace.h | 205 +++
 4 files changed, 221 insertions(+), 13 deletions(-)

diff --git a/drivers/bus/mhi/host/init.c b/drivers/bus/mhi/host/init.c
index f78aefd2d7a3..6acb85f4c5f8 100644
--- a/drivers/bus/mhi/host/init.c
+++ b/drivers/bus/mhi/host/init.c
@@ -20,6 +20,9 @@
 #include 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include "trace.h"
+
 static DEFINE_IDA(mhi_controller_ida);
 
 const char * const mhi_ee_str[MHI_EE_MAX] = {
diff --git a/drivers/bus/mhi/host/main.c b/drivers/bus/mhi/host/main.c
index dcf627b36e82..189f4786403e 100644
--- a/drivers/bus/mhi/host/main.c
+++ b/drivers/bus/mhi/host/main.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include "internal.h"
+#include "trace.h"
 
 int __must_check mhi_read_reg(struct mhi_controller *mhi_cntrl,
  void __iomem *base, u32 offset, u32 *out)
@@ -491,11 +492,8 @@ irqreturn_t mhi_intvec_threaded_handler(int irq_number, 
void *priv)
 
state = mhi_get_mhi_state(mhi_cntrl);
ee = mhi_get_exec_env(mhi_cntrl);
-   dev_dbg(dev, "local ee: %s state: %s device ee: %s state: %s\n",
-   TO_MHI_EXEC_STR(mhi_cntrl->ee),
-   mhi_state_str(mhi_cntrl->dev_state),
-   TO_MHI_EXEC_STR(ee), mhi_state_str(state));
 
+   trace_mhi_intvec_states(mhi_cntrl, ee, state);
if (state == MHI_STATE_SYS_ERR) {
dev_dbg(dev, "System error detected\n");
pm_state = mhi_tryset_pm_state(mhi_cntrl,
@@ -832,6 +830,8 @@ int mhi_process_ctrl_ev_ring(struct mhi_controller 
*mhi_cntrl,
while (dev_rp != local_rp) {
enum mhi_pkt_type type = MHI_TRE_GET_EV_TYPE(local_rp);
 
+   trace_mhi_ctrl_event(mhi_cntrl, local_rp);
+
switch (type) {
case MHI_PKT_TYPE_BW_REQ_EVENT:
{
@@ -997,6 +997,8 @@ int mhi_process_data_event_ring(struct mhi_controller 
*mhi_cntrl,
while (dev_rp != local_rp && event_quota > 0) {
enum mhi_pkt_type type = MHI_TRE_GET_EV_TYPE(local_rp);
 
+   trace_mhi_data_event(mhi_cntrl, local_rp);
+
chan = MHI_TRE_GET_EV_CHID(local_rp);
 
WARN_ON(chan >= mhi_cntrl->max_chan);
@@ -1235,6 +1237,7 @@ int mhi_gen_tre(struct mhi_controller *mhi_

Re: [PATCH v7] bus: mhi: host: Add tracing support

2023-12-06 Thread Krishna Chaitanya Chundru



On 12/6/2023 9:25 PM, Steven Rostedt wrote:

On Wed, 6 Dec 2023 21:12:57 +0530
Krishna chaitanya chundru  wrote:


diff --git a/drivers/bus/mhi/host/init.c b/drivers/bus/mhi/host/init.c
index f78aefd2d7a3..6acb85f4c5f8 100644
--- a/drivers/bus/mhi/host/init.c
+++ b/drivers/bus/mhi/host/init.c
@@ -20,6 +20,9 @@
  #include 
  #include "internal.h"
  
+#define CREATE_TRACE_POINTS

+#include "trace.h"
+
  static DEFINE_IDA(mhi_controller_ida);
  
  const char * const mhi_ee_str[MHI_EE_MAX] = {

diff --git a/drivers/bus/mhi/host/main.c b/drivers/bus/mhi/host/main.c
index dcf627b36e82..507cf3351a97 100644
--- a/drivers/bus/mhi/host/main.c
+++ b/drivers/bus/mhi/host/main.c
@@ -15,6 +15,7 @@
  #include 
  #include 
  #include "internal.h"
+#include "trace.h"
  
  int __must_check mhi_read_reg(struct mhi_controller *mhi_cntrl,

  void __iomem *base, u32 offset, u32 *out)
@@ -491,11 +492,9 @@ irqreturn_t mhi_intvec_threaded_handler(int irq_number, 
void *priv)
  
  	state = mhi_get_mhi_state(mhi_cntrl);

ee = mhi_get_exec_env(mhi_cntrl);
-   dev_dbg(dev, "local ee: %s state: %s device ee: %s state: %s\n",
-   TO_MHI_EXEC_STR(mhi_cntrl->ee),
-   mhi_state_str(mhi_cntrl->dev_state),
-   TO_MHI_EXEC_STR(ee), mhi_state_str(state));
  
+	trace_mhi_intvec_states(mhi_cntrl->mhi_dev->name, mhi_cntrl->ee,

+   mhi_cntrl->dev_state, ee, state);

I'm not sure if this was discussed before, but why not just pass in the
mhi_cntrl pointer and do the assigning in the TRACE_EVENT()
TP_fast_assign() portion?

It will save on setting up the parameters.


if (state == MHI_STATE_SYS_ERR) {
dev_dbg(dev, "System error detected\n");
pm_state = mhi_tryset_pm_state(mhi_cntrl,
@@ -832,6 +831,10 @@ int mhi_process_ctrl_ev_ring(struct mhi_controller 
*mhi_cntrl,
while (dev_rp != local_rp) {
enum mhi_pkt_type type = MHI_TRE_GET_EV_TYPE(local_rp);
  
+		trace_mhi_ctrl_event(mhi_cntrl->mhi_dev->name, local_rp,

+local_rp->ptr, local_rp->dword[0],
+local_rp->dword[1]);

And here..


+
switch (type) {
case MHI_PKT_TYPE_BW_REQ_EVENT:
{
@@ -997,6 +1000,9 @@ int mhi_process_data_event_ring(struct mhi_controller 
*mhi_cntrl,
while (dev_rp != local_rp && event_quota > 0) {
enum mhi_pkt_type type = MHI_TRE_GET_EV_TYPE(local_rp);
  
+		trace_mhi_data_event(mhi_cntrl->mhi_dev->name, local_rp, local_rp->ptr,

+local_rp->dword[0], local_rp->dword[1]);

and here..

for the local_rp. Especially since you are just repeating what you send
into the two events. You can consolidate it down to just adding that in the
TP_fast_assign() of the shared DECLARE_EVENT_CLASS().


+
chan = MHI_TRE_GET_EV_CHID(local_rp);
  
  		WARN_ON(chan >= mhi_cntrl->max_chan);

@@ -1235,6 +1241,8 @@ int mhi_gen_tre(struct mhi_controller *mhi_cntrl, struct 
mhi_chan *mhi_chan,
mhi_tre->dword[0] = MHI_TRE_DATA_DWORD0(info->len);
mhi_tre->dword[1] = MHI_TRE_DATA_DWORD1(bei, eot, eob, chain);
  
+	trace_mhi_gen_tre(mhi_cntrl->mhi_dev->name, mhi_chan->chan, mhi_tre,

+ mhi_tre->ptr, mhi_tre->dword[0], mhi_tre->dword[1]);

Same here. You can pass in the mhi_tre pointer.


/* increment WP */
mhi_add_ring_element(mhi_cntrl, tre_ring);
mhi_add_ring_element(mhi_cntrl, buf_ring);
@@ -1327,9 +1335,7 @@ static int mhi_update_channel_state(struct mhi_controller 
*mhi_cntrl,
enum mhi_cmd_type cmd = MHI_CMD_NOP;
int ret;
  
-	dev_dbg(dev, "%d: Updating channel state to: %s\n", mhi_chan->chan,

-   TO_CH_STATE_TYPE_STR(to_state));
-
+   trace_mhi_channel_command_start(mhi_cntrl->mhi_dev->name, 
mhi_chan->chan, to_state);

And here..


switch (to_state) {
case MHI_CH_STATE_TYPE_RESET:
write_lock_irq(_chan->lock);
@@ -1396,9 +1402,7 @@ static int mhi_update_channel_state(struct mhi_controller 
*mhi_cntrl,
write_unlock_irq(_chan->lock);
}
  
-	dev_dbg(dev, "%d: Channel state change to %s successful\n",

-   mhi_chan->chan, TO_CH_STATE_TYPE_STR(to_state));
-
+   trace_mhi_channel_command_end(mhi_cntrl->mhi_dev->name, mhi_chan->chan, 
to_state);

and here..

Where you can also update the DECLARE_EVENT_CLASS() to directly access the
mhi_cntrl and mhi_chan pointers. Sometimes it's better to do the
dereference from inside the TP_fast_assign. That way, things like eprobes
and bpf tracing can also have access to the full structure if needed.

-- Steve


Hi Steve,

I will make changes as suggested in the next patch series.

- Krishna Chaitanya.




[PATCH] nvdimm-btt: fix a potential memleak in btt_freelist_init

2023-12-06 Thread Dinghao Liu
When an error happens in btt_freelist_init(), its caller
discover_arenas() will directly free arena, which makes
arena->freelist allocated in btt_freelist_init() a leaked
memory. Fix this by freeing arena->freelist in all error
handling paths of btt_freelist_init().

Fixes: 5212e11fde4d ("nd_btt: atomic sector updates")
Signed-off-by: Dinghao Liu 
---
 drivers/nvdimm/btt.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index d5593b0dc700..d8c4ba8bfdda 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -544,8 +544,10 @@ static int btt_freelist_init(struct arena_info *arena)
 
for (i = 0; i < arena->nfree; i++) {
new = btt_log_read(arena, i, _new, LOG_NEW_ENT);
-   if (new < 0)
-   return new;
+   if (new < 0) {
+   ret = new;
+   goto out_free;
+   }
 
/* old and new map entries with any flags stripped out */
log_oldmap = ent_lba(le32_to_cpu(log_new.old_map));
@@ -577,7 +579,7 @@ static int btt_freelist_init(struct arena_info *arena)
ret = btt_map_read(arena, le32_to_cpu(log_new.lba), _entry,
NULL, NULL, 0);
if (ret)
-   return ret;
+   goto out_free;
 
/*
 * The map_entry from btt_read_map is stripped of any flag bits,
@@ -594,11 +596,16 @@ static int btt_freelist_init(struct arena_info *arena)
ret = btt_map_write(arena, le32_to_cpu(log_new.lba),
le32_to_cpu(log_new.new_map), 0, 0, 0);
if (ret)
-   return ret;
+   goto out_free;
}
}
 
return 0;
+
+out_free:
+   kfree(arena->freelist);
+   arena->freelist = NULL;
+   return ret;
 }
 
 static bool ent_is_padding(struct log_entry *ent)
-- 
2.17.1




Re: [PATCH net-next 2/6] page_frag: unify gfp bit for order 3 page allocation

2023-12-06 Thread Jakub Kicinski
On Tue, 5 Dec 2023 19:34:40 +0800 Yunsheng Lin wrote:
> __GFP_DIRECT_RECLAIM is xor'd to avoid
> direct reclaim in skb_page_frag_refill(), but it is not
> xor'd in __page_frag_cache_refill().

xor is not the same thing as masking a bit off.
The patch itself LGTM.



Re: [PATCH net-next] tcp: add tracepoints for data send/recv/acked

2023-12-06 Thread Xuan Zhuo
On Tue, 5 Dec 2023 20:39:28 +0100, Eric Dumazet  wrote:
> On Tue, Dec 5, 2023 at 3:11 AM Xuan Zhuo  wrote:
> >
> > On Mon, 4 Dec 2023 13:28:21 +0100, Eric Dumazet  wrote:
> > > On Mon, Dec 4, 2023 at 12:43 PM Philo Lu  wrote:
> > > >
> > > > Add 3 tracepoints, namely tcp_data_send/tcp_data_recv/tcp_data_acked,
> > > > which will be called every time a tcp data packet is sent, received, and
> > > > acked.
> > > > tcp_data_send: called after a data packet is sent.
> > > > tcp_data_recv: called after a data packet is receviced.
> > > > tcp_data_acked: called after a valid ack packet is processed (some sent
> > > > data are ackknowledged).
> > > >
> > > > We use these callbacks for fine-grained tcp monitoring, which collects
> > > > and analyses every tcp request/response event information. The whole
> > > > system has been described in SIGMOD'18 (see
> > > > https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
> > > > achieve this with bpf, we require hooks for data events that call bpf
> > > > prog (1) when any data packet is sent/received/acked, and (2) after
> > > > critical tcp state variables have been updated (e.g., snd_una, snd_nxt,
> > > > rcv_nxt). However, existing bpf hooks cannot meet our requirements.
> > > > Besides, these tracepoints help to debug tcp when data send/recv/acked.
> > >
> > > This I do not understand.
> > >
> > > >
> > > > Though kretprobe/fexit can also be used to collect these information,
> > > > they will not work if the kernel functions get inlined. Considering the
> > > > stability, we prefer tracepoint as the solution.
> > >
> > > I dunno, this seems quite weak to me. I see many patches coming to add
> > > tracing in the stack, but no patches fixing any issues.
> >
> >
> > We have implemented a mechanism to split the request and response from the 
> > TCP
> > connection using these "hookers", which can handle various protocols such as
> > HTTP, HTTPS, Redis, and MySQL. This mechanism allows us to record important
> > information about each request and response, including the amount of data
> > uploaded, the time taken by the server to handle the request, and the time 
> > taken
> > for the client to receive the response. This mechanism has been running
> > internally for many years and has proven to be very useful.
> >
> > One of the main benefits of this mechanism is that it helps in locating the
> > source of any issues or problems that may arise. For example, if there is a
> > problem with the network, the application, or the machine, we can use this
> > mechanism to identify and isolate the issue.
> >
> > TCP has long been a challenge when it comes to tracking the transmission of 
> > data
> > on the network. The application can only confirm that it has sent a certain
> > amount of data to the kernel, but it has limited visibility into whether the
> > client has actually received this data. Our mechanism addresses this issue 
> > by
> > providing insights into the amount of data received by the client and the 
> > time
> > it was received. Furthermore, we can also detect any packet loss or delays
> > caused by the server.
> >
> > https://help-static-aliyun-doc.aliyuncs.com/assets/img/zh-CN/7912288961/9732df025beny.svg
> >
> > So, we do not want to add some tracepoint to do some unknow debug.
> > We have a clear goal. debugging is just an incidental capability.
> >
>
> We have powerful mechanisms in the stack already that ordinary (no
> privilege requested) applications can readily use.
>
> We have been using them for a while.
>
> If existing mechanisms are missing something you need, please expand them.
>
> For reference, start looking at tcp_get_timestamping_opt_stats() history.
>
> Sender side can for instance get precise timestamps.
>
> Combinations of these timestamps reveal different parts of the overall
> network latency,
>
> T0: sendmsg() enters TCP
> T1: first byte enters qdisc
> T2: first byte sent to the NIC
> T3: first byte ACKed in TCP
> T4: last byte sent to the NIC
> T5: last byte ACKed
> T1 - T0: how long the first byte was blocked in the TCP layer ("Head
> of Line Blocking" latency).
> T2 - T1: how long the first byte was blocked in the Linux traffic
> shaping layer (known as QDisc).
> T3 - T2: the network ‘distance’ (propagation delay + current queuing
> delay along the network path and at the receiver).
> T5 - T2: how fast the sent chunk was delivered.
> Message Size / (T5 - T0): goodput (from application’s perspective)


The key point is that using our mechanism, the application does not need to be
modified.

As long as the app's network protocol is request-response, we can trace tcp
connection at any time to analyze the request and response. And record the start
and end times of request and response. Of course there is some ttl and other
information.

Thanks.



Re: [PATCH net-next v7 3/4] virtio/vsock: fix logic which reduces credit update messages

2023-12-06 Thread Arseniy Krasnov



On 07.12.2023 01:08, Michael S. Tsirkin wrote:
> On Thu, Dec 07, 2023 at 12:52:51AM +0300, Arseniy Krasnov wrote:
>>
>>
>> On 07.12.2023 00:53, Michael S. Tsirkin wrote:
>>> On Thu, Dec 07, 2023 at 12:18:48AM +0300, Arseniy Krasnov wrote:
 Add one more condition for sending credit update during dequeue from
 stream socket: when number of bytes in the rx queue is smaller than
 SO_RCVLOWAT value of the socket. This is actual for non-default value
 of SO_RCVLOWAT (e.g. not 1) - idea is to "kick" peer to continue data
 transmission, because we need at least SO_RCVLOWAT bytes in our rx
 queue to wake up user for reading data (in corner case it is also
 possible to stuck both tx and rx sides, this is why 'Fixes' is used).
 Also handle case when 'fwd_cnt' wraps, while 'last_fwd_cnt' is still
 not.

 Fixes: b89d882dc9fc ("vsock/virtio: reduce credit update messages")
 Signed-off-by: Arseniy Krasnov 
 ---
  Changelog:
  v6 -> v7:
   * Handle wrap of 'fwd_cnt'.
   * Do to send credit update when 'fwd_cnt' == 'last_fwd_cnt'.

  net/vmw_vsock/virtio_transport_common.c | 18 +++---
  1 file changed, 15 insertions(+), 3 deletions(-)

 diff --git a/net/vmw_vsock/virtio_transport_common.c 
 b/net/vmw_vsock/virtio_transport_common.c
 index e137d740804e..39f8660d825d 100644
 --- a/net/vmw_vsock/virtio_transport_common.c
 +++ b/net/vmw_vsock/virtio_transport_common.c
 @@ -558,6 +558,8 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
 *vsk,
struct virtio_vsock_sock *vvs = vsk->trans;
size_t bytes, total = 0;
struct sk_buff *skb;
 +  u32 fwd_cnt_delta;
 +  bool low_rx_bytes;
int err = -EFAULT;
u32 free_space;
  
 @@ -601,7 +603,15 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
 *vsk,
}
}
  
 -  free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
 +  /* Handle wrap of 'fwd_cnt'. */
 +  if (vvs->fwd_cnt < vvs->last_fwd_cnt)
 +  fwd_cnt_delta = vvs->fwd_cnt + (U32_MAX - vvs->last_fwd_cnt);
>>>
>>> Are you sure there's no off by one here? for example if fwd_cnt is 0
>>> and last_fwd_cnt is 0xf then apparently delta is 0.
>>
>> Seems yes, I need +1 here
> 
> And then you will get a nop, because assigning U32_MAX + 1 to u32
> gives you 0. Adding () does nothing to change the result,
> + and - are commutative.

Ahh, unsigned here, yes.

@Stefano, what did You mean about wrapping here?

I think Michael is right, for example

vvs->fwd_cnt wraps and now == 5
vvs->last_fwd_cnt == 0x

now delta before this patch will be 6 - correct value

May be I didn't get your idea, so implement it very naive?

Thanks, Arseniy

> 
> 
>>>
>>>
 +  else
 +  fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt;
>>>
>>> I actually don't see what is wrong with just
>>> fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt
>>> 32 bit unsigned math will I think handle wrap around correctly.
>>>
>>> And given buf_alloc is also u32 - I don't see where the bug is in
>>> the original code.
>>
>> I think problem is when fwd_cnt wraps, while last_fwd_cnt is not. In this
>> case fwd_cnt_delta will be too big, so we won't send credit update which
>> leads to stall for sender
>>
>> Thanks, Arseniy
> 
> Care coming up with an example?
> 
> 
>>>
>>>
 +
 +  free_space = vvs->buf_alloc - fwd_cnt_delta;
 +  low_rx_bytes = (vvs->rx_bytes <
 +  sock_rcvlowat(sk_vsock(vsk), 0, INT_MAX));
  
spin_unlock_bh(>rx_lock);
  
 @@ -611,9 +621,11 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
 *vsk,
 * too high causes extra messages. Too low causes transmitter
 * stalls. As stalls are in theory more expensive than extra
 * messages, we set the limit to a high value. TODO: experiment
 -   * with different values.
 +   * with different values. Also send credit update message when
 +   * number of bytes in rx queue is not enough to wake up reader.
 */
 -  if (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
 +  if (fwd_cnt_delta &&
 +  (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE || low_rx_bytes))
virtio_transport_send_credit_update(vsk);
  
return total;
 -- 
 2.25.1
>>>
> 



Re: [PATCH net-next v7 3/4] virtio/vsock: fix logic which reduces credit update messages

2023-12-06 Thread Michael S. Tsirkin
On Thu, Dec 07, 2023 at 12:52:51AM +0300, Arseniy Krasnov wrote:
> 
> 
> On 07.12.2023 00:53, Michael S. Tsirkin wrote:
> > On Thu, Dec 07, 2023 at 12:18:48AM +0300, Arseniy Krasnov wrote:
> >> Add one more condition for sending credit update during dequeue from
> >> stream socket: when number of bytes in the rx queue is smaller than
> >> SO_RCVLOWAT value of the socket. This is actual for non-default value
> >> of SO_RCVLOWAT (e.g. not 1) - idea is to "kick" peer to continue data
> >> transmission, because we need at least SO_RCVLOWAT bytes in our rx
> >> queue to wake up user for reading data (in corner case it is also
> >> possible to stuck both tx and rx sides, this is why 'Fixes' is used).
> >> Also handle case when 'fwd_cnt' wraps, while 'last_fwd_cnt' is still
> >> not.
> >>
> >> Fixes: b89d882dc9fc ("vsock/virtio: reduce credit update messages")
> >> Signed-off-by: Arseniy Krasnov 
> >> ---
> >>  Changelog:
> >>  v6 -> v7:
> >>   * Handle wrap of 'fwd_cnt'.
> >>   * Do to send credit update when 'fwd_cnt' == 'last_fwd_cnt'.
> >>
> >>  net/vmw_vsock/virtio_transport_common.c | 18 +++---
> >>  1 file changed, 15 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/net/vmw_vsock/virtio_transport_common.c 
> >> b/net/vmw_vsock/virtio_transport_common.c
> >> index e137d740804e..39f8660d825d 100644
> >> --- a/net/vmw_vsock/virtio_transport_common.c
> >> +++ b/net/vmw_vsock/virtio_transport_common.c
> >> @@ -558,6 +558,8 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
> >> *vsk,
> >>struct virtio_vsock_sock *vvs = vsk->trans;
> >>size_t bytes, total = 0;
> >>struct sk_buff *skb;
> >> +  u32 fwd_cnt_delta;
> >> +  bool low_rx_bytes;
> >>int err = -EFAULT;
> >>u32 free_space;
> >>  
> >> @@ -601,7 +603,15 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
> >> *vsk,
> >>}
> >>}
> >>  
> >> -  free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
> >> +  /* Handle wrap of 'fwd_cnt'. */
> >> +  if (vvs->fwd_cnt < vvs->last_fwd_cnt)
> >> +  fwd_cnt_delta = vvs->fwd_cnt + (U32_MAX - vvs->last_fwd_cnt);
> > 
> > Are you sure there's no off by one here? for example if fwd_cnt is 0
> > and last_fwd_cnt is 0xf then apparently delta is 0.
> 
> Seems yes, I need +1 here

And then you will get a nop, because assigning U32_MAX + 1 to u32
gives you 0. Adding () does nothing to change the result,
+ and - are commutative.


> > 
> > 
> >> +  else
> >> +  fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt;
> > 
> > I actually don't see what is wrong with just
> > fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt
> > 32 bit unsigned math will I think handle wrap around correctly.
> > 
> > And given buf_alloc is also u32 - I don't see where the bug is in
> > the original code.
> 
> I think problem is when fwd_cnt wraps, while last_fwd_cnt is not. In this
> case fwd_cnt_delta will be too big, so we won't send credit update which
> leads to stall for sender
> 
> Thanks, Arseniy

Care coming up with an example?


> > 
> > 
> >> +
> >> +  free_space = vvs->buf_alloc - fwd_cnt_delta;
> >> +  low_rx_bytes = (vvs->rx_bytes <
> >> +  sock_rcvlowat(sk_vsock(vsk), 0, INT_MAX));
> >>  
> >>spin_unlock_bh(>rx_lock);
> >>  
> >> @@ -611,9 +621,11 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
> >> *vsk,
> >> * too high causes extra messages. Too low causes transmitter
> >> * stalls. As stalls are in theory more expensive than extra
> >> * messages, we set the limit to a high value. TODO: experiment
> >> -   * with different values.
> >> +   * with different values. Also send credit update message when
> >> +   * number of bytes in rx queue is not enough to wake up reader.
> >> */
> >> -  if (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> >> +  if (fwd_cnt_delta &&
> >> +  (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE || low_rx_bytes))
> >>virtio_transport_send_credit_update(vsk);
> >>  
> >>return total;
> >> -- 
> >> 2.25.1
> > 




Re: [PATCH net-next v7 3/4] virtio/vsock: fix logic which reduces credit update messages

2023-12-06 Thread Arseniy Krasnov



On 07.12.2023 00:53, Michael S. Tsirkin wrote:
> On Thu, Dec 07, 2023 at 12:18:48AM +0300, Arseniy Krasnov wrote:
>> Add one more condition for sending credit update during dequeue from
>> stream socket: when number of bytes in the rx queue is smaller than
>> SO_RCVLOWAT value of the socket. This is actual for non-default value
>> of SO_RCVLOWAT (e.g. not 1) - idea is to "kick" peer to continue data
>> transmission, because we need at least SO_RCVLOWAT bytes in our rx
>> queue to wake up user for reading data (in corner case it is also
>> possible to stuck both tx and rx sides, this is why 'Fixes' is used).
>> Also handle case when 'fwd_cnt' wraps, while 'last_fwd_cnt' is still
>> not.
>>
>> Fixes: b89d882dc9fc ("vsock/virtio: reduce credit update messages")
>> Signed-off-by: Arseniy Krasnov 
>> ---
>>  Changelog:
>>  v6 -> v7:
>>   * Handle wrap of 'fwd_cnt'.
>>   * Do to send credit update when 'fwd_cnt' == 'last_fwd_cnt'.
>>
>>  net/vmw_vsock/virtio_transport_common.c | 18 +++---
>>  1 file changed, 15 insertions(+), 3 deletions(-)
>>
>> diff --git a/net/vmw_vsock/virtio_transport_common.c 
>> b/net/vmw_vsock/virtio_transport_common.c
>> index e137d740804e..39f8660d825d 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -558,6 +558,8 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
>> *vsk,
>>  struct virtio_vsock_sock *vvs = vsk->trans;
>>  size_t bytes, total = 0;
>>  struct sk_buff *skb;
>> +u32 fwd_cnt_delta;
>> +bool low_rx_bytes;
>>  int err = -EFAULT;
>>  u32 free_space;
>>  
>> @@ -601,7 +603,15 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
>> *vsk,
>>  }
>>  }
>>  
>> -free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
>> +/* Handle wrap of 'fwd_cnt'. */
>> +if (vvs->fwd_cnt < vvs->last_fwd_cnt)
>> +fwd_cnt_delta = vvs->fwd_cnt + (U32_MAX - vvs->last_fwd_cnt);
> 
> Are you sure there's no off by one here? for example if fwd_cnt is 0
> and last_fwd_cnt is 0xf then apparently delta is 0.

Seems yes, I need +1 here

> 
> 
>> +else
>> +fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt;
> 
> I actually don't see what is wrong with just
>   fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt
> 32 bit unsigned math will I think handle wrap around correctly.
> 
> And given buf_alloc is also u32 - I don't see where the bug is in
> the original code.

I think problem is when fwd_cnt wraps, while last_fwd_cnt is not. In this
case fwd_cnt_delta will be too big, so we won't send credit update which
leads to stall for sender

Thanks, Arseniy

> 
> 
>> +
>> +free_space = vvs->buf_alloc - fwd_cnt_delta;
>> +low_rx_bytes = (vvs->rx_bytes <
>> +sock_rcvlowat(sk_vsock(vsk), 0, INT_MAX));
>>  
>>  spin_unlock_bh(>rx_lock);
>>  
>> @@ -611,9 +621,11 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
>> *vsk,
>>   * too high causes extra messages. Too low causes transmitter
>>   * stalls. As stalls are in theory more expensive than extra
>>   * messages, we set the limit to a high value. TODO: experiment
>> - * with different values.
>> + * with different values. Also send credit update message when
>> + * number of bytes in rx queue is not enough to wake up reader.
>>   */
>> -if (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
>> +if (fwd_cnt_delta &&
>> +(free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE || low_rx_bytes))
>>  virtio_transport_send_credit_update(vsk);
>>  
>>  return total;
>> -- 
>> 2.25.1
> 



Re: [PATCH net-next v7 3/4] virtio/vsock: fix logic which reduces credit update messages

2023-12-06 Thread Michael S. Tsirkin
On Thu, Dec 07, 2023 at 12:18:48AM +0300, Arseniy Krasnov wrote:
> Add one more condition for sending credit update during dequeue from
> stream socket: when number of bytes in the rx queue is smaller than
> SO_RCVLOWAT value of the socket. This is actual for non-default value
> of SO_RCVLOWAT (e.g. not 1) - idea is to "kick" peer to continue data
> transmission, because we need at least SO_RCVLOWAT bytes in our rx
> queue to wake up user for reading data (in corner case it is also
> possible to stuck both tx and rx sides, this is why 'Fixes' is used).
> Also handle case when 'fwd_cnt' wraps, while 'last_fwd_cnt' is still
> not.
> 
> Fixes: b89d882dc9fc ("vsock/virtio: reduce credit update messages")
> Signed-off-by: Arseniy Krasnov 
> ---
>  Changelog:
>  v6 -> v7:
>   * Handle wrap of 'fwd_cnt'.
>   * Do to send credit update when 'fwd_cnt' == 'last_fwd_cnt'.
> 
>  net/vmw_vsock/virtio_transport_common.c | 18 +++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/net/vmw_vsock/virtio_transport_common.c 
> b/net/vmw_vsock/virtio_transport_common.c
> index e137d740804e..39f8660d825d 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -558,6 +558,8 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
>   struct virtio_vsock_sock *vvs = vsk->trans;
>   size_t bytes, total = 0;
>   struct sk_buff *skb;
> + u32 fwd_cnt_delta;
> + bool low_rx_bytes;
>   int err = -EFAULT;
>   u32 free_space;
>  
> @@ -601,7 +603,15 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
> *vsk,
>   }
>   }
>  
> - free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
> + /* Handle wrap of 'fwd_cnt'. */
> + if (vvs->fwd_cnt < vvs->last_fwd_cnt)
> + fwd_cnt_delta = vvs->fwd_cnt + (U32_MAX - vvs->last_fwd_cnt);

Are you sure there's no off by one here? for example if fwd_cnt is 0
and last_fwd_cnt is 0xf then apparently delta is 0.


> + else
> + fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt;

I actually don't see what is wrong with just
fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt
32 bit unsigned math will I think handle wrap around correctly.

And given buf_alloc is also u32 - I don't see where the bug is in
the original code.


> +
> + free_space = vvs->buf_alloc - fwd_cnt_delta;
> + low_rx_bytes = (vvs->rx_bytes <
> + sock_rcvlowat(sk_vsock(vsk), 0, INT_MAX));
>  
>   spin_unlock_bh(>rx_lock);
>  
> @@ -611,9 +621,11 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
> *vsk,
>* too high causes extra messages. Too low causes transmitter
>* stalls. As stalls are in theory more expensive than extra
>* messages, we set the limit to a high value. TODO: experiment
> -  * with different values.
> +  * with different values. Also send credit update message when
> +  * number of bytes in rx queue is not enough to wake up reader.
>*/
> - if (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> + if (fwd_cnt_delta &&
> + (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE || low_rx_bytes))
>   virtio_transport_send_credit_update(vsk);
>  
>   return total;
> -- 
> 2.25.1




[PATCH net-next v7 4/4] vsock/test: two tests to check credit update logic

2023-12-06 Thread Arseniy Krasnov
Both tests are almost same, only differs in two 'if' conditions, so
implemented in a single function. Tests check, that credit update
message is sent:

1) During setting SO_RCVLOWAT value of the socket.
2) When number of 'rx_bytes' become smaller than SO_RCVLOWAT value.

Signed-off-by: Arseniy Krasnov 
Reviewed-by: Stefano Garzarella 
---
 Changelog:
 v1 -> v2:
  * Update commit message by removing 'This patch adds XXX' manner.
  * Update commit message by adding details about dependency for this
test from kernel internal define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE.
  * Add comment for this dependency in 'vsock_test.c' where this define
is duplicated.
 v2 -> v3:
  * Replace synchronization based on control TCP socket with vsock
data socket - this is needed to allow sender transmit data only
when new buffer size of receiver is visible to sender. Otherwise
there is race and test fails sometimes.
 v3 -> v4:
  * Replace 'recv_buf()' to 'recv(MSG_DONTWAIT)' in last read operation
in server part. This is needed to ensure that 'poll()' wake up us
when number of bytes ready to read is equal to SO_RCVLOWAT value.
 v4 -> v5:
  * Use 'recv_buf(MSG_DONTWAIT)' instead of 'recv(MSG_DONTWAIT)'.
 v5 -> v6:
  * Add second test which checks, that credit update is sent during
reading data from socket.
  * Update commit message.

 tools/testing/vsock/vsock_test.c | 175 +++
 1 file changed, 175 insertions(+)

diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
index 01fa816868bc..66246d81d654 100644
--- a/tools/testing/vsock/vsock_test.c
+++ b/tools/testing/vsock/vsock_test.c
@@ -1232,6 +1232,171 @@ static void test_double_bind_connect_client(const 
struct test_opts *opts)
}
 }
 
+#define RCVLOWAT_CREDIT_UPD_BUF_SIZE   (1024 * 128)
+/* This define is the same as in 'include/linux/virtio_vsock.h':
+ * it is used to decide when to send credit update message during
+ * reading from rx queue of a socket. Value and its usage in
+ * kernel is important for this test.
+ */
+#define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE  (1024 * 64)
+
+static void test_stream_rcvlowat_def_cred_upd_client(const struct test_opts 
*opts)
+{
+   size_t buf_size;
+   void *buf;
+   int fd;
+
+   fd = vsock_stream_connect(opts->peer_cid, 1234);
+   if (fd < 0) {
+   perror("connect");
+   exit(EXIT_FAILURE);
+   }
+
+   /* Send 1 byte more than peer's buffer size. */
+   buf_size = RCVLOWAT_CREDIT_UPD_BUF_SIZE + 1;
+
+   buf = malloc(buf_size);
+   if (!buf) {
+   perror("malloc");
+   exit(EXIT_FAILURE);
+   }
+
+   /* Wait until peer sets needed buffer size. */
+   recv_byte(fd, 1, 0);
+
+   if (send(fd, buf, buf_size, 0) != buf_size) {
+   perror("send failed");
+   exit(EXIT_FAILURE);
+   }
+
+   free(buf);
+   close(fd);
+}
+
+static void test_stream_credit_update_test(const struct test_opts *opts,
+  bool low_rx_bytes_test)
+{
+   size_t recv_buf_size;
+   struct pollfd fds;
+   size_t buf_size;
+   void *buf;
+   int fd;
+
+   fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+   if (fd < 0) {
+   perror("accept");
+   exit(EXIT_FAILURE);
+   }
+
+   buf_size = RCVLOWAT_CREDIT_UPD_BUF_SIZE;
+
+   if (setsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE,
+  _size, sizeof(buf_size))) {
+   perror("setsockopt(SO_VM_SOCKETS_BUFFER_SIZE)");
+   exit(EXIT_FAILURE);
+   }
+
+   if (low_rx_bytes_test) {
+   /* Set new SO_RCVLOWAT here. This enables sending credit
+* update when number of bytes if our rx queue become <
+* SO_RCVLOWAT value.
+*/
+   recv_buf_size = 1 + VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
+
+   if (setsockopt(fd, SOL_SOCKET, SO_RCVLOWAT,
+  _buf_size, sizeof(recv_buf_size))) {
+   perror("setsockopt(SO_RCVLOWAT)");
+   exit(EXIT_FAILURE);
+   }
+   }
+
+   /* Send one dummy byte here, because 'setsockopt()' above also
+* sends special packet which tells sender to update our buffer
+* size. This 'send_byte()' will serialize such packet with data
+* reads in a loop below. Sender starts transmission only when
+* it receives this single byte.
+*/
+   send_byte(fd, 1, 0);
+
+   buf = malloc(buf_size);
+   if (!buf) {
+   perror("malloc");
+   exit(EXIT_FAILURE);
+   }
+
+   /* Wait until there will be 128KB of data in rx queue. */
+   while (1) {
+   ssize_t res;
+
+   res = recv(fd, buf, buf_size, MSG_PEEK);
+   if (res == buf_size)
+   break;
+
+   if (res 

[PATCH net-next v7 2/4] virtio/vsock: send credit update during setting SO_RCVLOWAT

2023-12-06 Thread Arseniy Krasnov
Send credit update message when SO_RCVLOWAT is updated and it is bigger
than number of bytes in rx queue. It is needed, because 'poll()' will
wait until number of bytes in rx queue will be not smaller than
SO_RCVLOWAT, so kick sender to send more data. Otherwise mutual hungup
for tx/rx is possible: sender waits for free space and receiver is
waiting data in 'poll()'.

Signed-off-by: Arseniy Krasnov 
---
 Changelog:
 v1 -> v2:
  * Update commit message by removing 'This patch adds XXX' manner.
  * Do not initialize 'send_update' variable - set it directly during
first usage.
 v3 -> v4:
  * Fit comment in 'virtio_transport_notify_set_rcvlowat()' to 80 chars.
 v4 -> v5:
  * Do not change callbacks order in transport structures.
 v5 -> v6:
  * Reorder callbacks in transport structures.
  * Do to send credit update when 'fwd_cnt' == 'last_fwd_cnt'.

 drivers/vhost/vsock.c   |  1 +
 include/linux/virtio_vsock.h|  1 +
 net/vmw_vsock/virtio_transport.c|  1 +
 net/vmw_vsock/virtio_transport_common.c | 30 +
 net/vmw_vsock/vsock_loopback.c  |  1 +
 5 files changed, 34 insertions(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index f75731396b7e..ec20ecff85c7 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -449,6 +449,7 @@ static struct virtio_transport vhost_transport = {
.notify_send_pre_enqueue  = 
virtio_transport_notify_send_pre_enqueue,
.notify_send_post_enqueue = 
virtio_transport_notify_send_post_enqueue,
.notify_buffer_size   = virtio_transport_notify_buffer_size,
+   .notify_set_rcvlowat  = 
virtio_transport_notify_set_rcvlowat,
 
.read_skb = virtio_transport_read_skb,
},
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index ebb3ce63d64d..c82089dee0c8 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -256,4 +256,5 @@ void virtio_transport_put_credit(struct virtio_vsock_sock 
*vvs, u32 credit);
 void virtio_transport_deliver_tap_pkt(struct sk_buff *skb);
 int virtio_transport_purge_skbs(void *vsk, struct sk_buff_head *list);
 int virtio_transport_read_skb(struct vsock_sock *vsk, skb_read_actor_t 
read_actor);
+int virtio_transport_notify_set_rcvlowat(struct vsock_sock *vsk, int val);
 #endif /* _LINUX_VIRTIO_VSOCK_H */
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index af5bab1acee1..f495b9e5186b 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -537,6 +537,7 @@ static struct virtio_transport virtio_transport = {
.notify_send_pre_enqueue  = 
virtio_transport_notify_send_pre_enqueue,
.notify_send_post_enqueue = 
virtio_transport_notify_send_post_enqueue,
.notify_buffer_size   = virtio_transport_notify_buffer_size,
+   .notify_set_rcvlowat  = 
virtio_transport_notify_set_rcvlowat,
 
.read_skb = virtio_transport_read_skb,
},
diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index f6dc896bf44c..e137d740804e 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1684,6 +1684,36 @@ int virtio_transport_read_skb(struct vsock_sock *vsk, 
skb_read_actor_t recv_acto
 }
 EXPORT_SYMBOL_GPL(virtio_transport_read_skb);
 
+int virtio_transport_notify_set_rcvlowat(struct vsock_sock *vsk, int val)
+{
+   struct virtio_vsock_sock *vvs = vsk->trans;
+   bool send_update;
+
+   spin_lock_bh(>rx_lock);
+
+   /* If number of available bytes is less than new SO_RCVLOWAT value,
+* kick sender to send more data, because sender may sleep in its
+* 'send()' syscall waiting for enough space at our side. Also
+* don't send credit update when peer already knows actual value -
+* such transmission will be useless.
+*/
+   send_update = (vvs->rx_bytes < val) &&
+ (vvs->fwd_cnt != vvs->last_fwd_cnt);
+
+   spin_unlock_bh(>rx_lock);
+
+   if (send_update) {
+   int err;
+
+   err = virtio_transport_send_credit_update(vsk);
+   if (err < 0)
+   return err;
+   }
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_set_rcvlowat);
+
 MODULE_LICENSE("GPL v2");
 MODULE_AUTHOR("Asias He");
 MODULE_DESCRIPTION("common code for virtio vsock");
diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
index 048640167411..6dea6119f5b2 100644
--- a/net/vmw_vsock/vsock_loopback.c
+++ b/net/vmw_vsock/vsock_loopback.c
@@ -96,6 +96,7 @@ static struct virtio_transport loopback_transport = {
.notify_send_pre_enqueue  = 
virtio_transport_notify_send_pre_enqueue,
.notify_send_post_enqueue = 
virtio_transport_notify_send_post_enqueue,

[PATCH net-next v7 0/4] send credit update during setting SO_RCVLOWAT

2023-12-06 Thread Arseniy Krasnov
Hello,

   DESCRIPTION

This patchset fixes old problem with hungup of both rx/tx sides and adds
test for it. This happens due to non-default SO_RCVLOWAT value and
deferred credit update in virtio/vsock. Link to previous old patchset:
https://lore.kernel.org/netdev/39b2e9fd-601b-189d-39a9-914e55745...@sberdevices.ru/

Here is what happens step by step:

  TEST

INITIAL CONDITIONS

1) Vsock buffer size is 128KB.
2) Maximum packet size is also 64KB as defined in header (yes it is
   hardcoded, just to remind about that value).
3) SO_RCVLOWAT is default, e.g. 1 byte.


 STEPS

SENDER  RECEIVER
1) sends 128KB + 1 byte in a
   single buffer. 128KB will
   be sent, but for 1 byte
   sender will wait for free
   space at peer. Sender goes
   to sleep.


2) reads 64KB, credit update not sent
3) sets SO_RCVLOWAT to 64KB + 1
4) poll() -> wait forever, there is
   only 64KB available to read.

So in step 4) receiver also goes to sleep, waiting for enough data or
connection shutdown message from the sender. Idea to fix it is that rx
kicks tx side to continue transmission (and may be close connection)
when rx changes number of bytes to be woken up (e.g. SO_RCVLOWAT) and
this value is bigger than number of available bytes to read.

I've added small test for this, but not sure as it uses hardcoded value
for maximum packet length, this value is defined in kernel header and
used to control deferred credit update. And as this is not available to
userspace, I can't control test parameters correctly (if one day this
define will be changed - test may become useless). 

Head for this patchset is:
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=021b0c952f226236f2edf89c737efb9a28d1422d

Link to v1:
https://lore.kernel.org/netdev/20231108072004.1045669-1-avkras...@salutedevices.com/
Link to v2:
https://lore.kernel.org/netdev/20231119204922.2251912-1-avkras...@salutedevices.com/
Link to v3:
https://lore.kernel.org/netdev/20231122180510.2297075-1-avkras...@salutedevices.com/
Link to v4:
https://lore.kernel.org/netdev/20231129212519.2938875-1-avkras...@salutedevices.com/
Link to v5:
https://lore.kernel.org/netdev/20231130130840.253733-1-avkras...@salutedevices.com/
Link to v6:
https://lore.kernel.org/netdev/20231205064806.2851305-1-avkras...@salutedevices.com/

Changelog:
v1 -> v2:
 * Patchset rebased and tested on new HEAD of net-next (see hash above).
 * New patch is added as 0001 - it removes return from SO_RCVLOWAT set
   callback in 'af_vsock.c' when transport callback is set - with that
   we can set 'sk_rcvlowat' only once in 'af_vsock.c' and in future do
   not copy-paste it to every transport. It was discussed in v1.
 * See per-patch changelog after ---.
v2 -> v3:
 * See changelog after --- in 0003 only (0001 and 0002 still same).
v3 -> v4:
 * Patchset rebased and tested on new HEAD of net-next (see hash above).
 * See per-patch changelog after ---.
v4 -> v5:
 * Change patchset tag 'RFC' -> 'net-next'.
 * See per-patch changelog after ---.
v5 -> v6:
 * New patch 0003 which sends credit update during reading bytes from
   socket.
 * See per-patch changelog after ---.
v6 -> v7:
 * Patchset rebased and tested on new HEAD of net-next (see hash above).
 * See per-patch changelog after ---.

Arseniy Krasnov (4):
  vsock: update SO_RCVLOWAT setting callback
  virtio/vsock: send credit update during setting SO_RCVLOWAT
  virtio/vsock: fix logic which reduces credit update messages
  vsock/test: two tests to check credit update logic

 drivers/vhost/vsock.c   |   1 +
 include/linux/virtio_vsock.h|   1 +
 include/net/af_vsock.h  |   2 +-
 net/vmw_vsock/af_vsock.c|   9 +-
 net/vmw_vsock/hyperv_transport.c|   4 +-
 net/vmw_vsock/virtio_transport.c|   1 +
 net/vmw_vsock/virtio_transport_common.c |  48 ++-
 net/vmw_vsock/vsock_loopback.c  |   1 +
 tools/testing/vsock/vsock_test.c| 175 
 9 files changed, 234 insertions(+), 8 deletions(-)

-- 
2.25.1




[PATCH net-next v7 3/4] virtio/vsock: fix logic which reduces credit update messages

2023-12-06 Thread Arseniy Krasnov
Add one more condition for sending credit update during dequeue from
stream socket: when number of bytes in the rx queue is smaller than
SO_RCVLOWAT value of the socket. This is actual for non-default value
of SO_RCVLOWAT (e.g. not 1) - idea is to "kick" peer to continue data
transmission, because we need at least SO_RCVLOWAT bytes in our rx
queue to wake up user for reading data (in corner case it is also
possible to stuck both tx and rx sides, this is why 'Fixes' is used).
Also handle case when 'fwd_cnt' wraps, while 'last_fwd_cnt' is still
not.

Fixes: b89d882dc9fc ("vsock/virtio: reduce credit update messages")
Signed-off-by: Arseniy Krasnov 
---
 Changelog:
 v6 -> v7:
  * Handle wrap of 'fwd_cnt'.
  * Do to send credit update when 'fwd_cnt' == 'last_fwd_cnt'.

 net/vmw_vsock/virtio_transport_common.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index e137d740804e..39f8660d825d 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -558,6 +558,8 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
struct virtio_vsock_sock *vvs = vsk->trans;
size_t bytes, total = 0;
struct sk_buff *skb;
+   u32 fwd_cnt_delta;
+   bool low_rx_bytes;
int err = -EFAULT;
u32 free_space;
 
@@ -601,7 +603,15 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
}
}
 
-   free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
+   /* Handle wrap of 'fwd_cnt'. */
+   if (vvs->fwd_cnt < vvs->last_fwd_cnt)
+   fwd_cnt_delta = vvs->fwd_cnt + (U32_MAX - vvs->last_fwd_cnt);
+   else
+   fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt;
+
+   free_space = vvs->buf_alloc - fwd_cnt_delta;
+   low_rx_bytes = (vvs->rx_bytes <
+   sock_rcvlowat(sk_vsock(vsk), 0, INT_MAX));
 
spin_unlock_bh(>rx_lock);
 
@@ -611,9 +621,11 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
 * too high causes extra messages. Too low causes transmitter
 * stalls. As stalls are in theory more expensive than extra
 * messages, we set the limit to a high value. TODO: experiment
-* with different values.
+* with different values. Also send credit update message when
+* number of bytes in rx queue is not enough to wake up reader.
 */
-   if (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
+   if (fwd_cnt_delta &&
+   (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE || low_rx_bytes))
virtio_transport_send_credit_update(vsk);
 
return total;
-- 
2.25.1




[PATCH net-next v7 1/4] vsock: update SO_RCVLOWAT setting callback

2023-12-06 Thread Arseniy Krasnov
Do not return if transport callback for SO_RCVLOWAT is set (only in
error case). In this case we don't need to set 'sk_rcvlowat' field in
each transport - only in 'vsock_set_rcvlowat()'. Also, if 'sk_rcvlowat'
is now set only in af_vsock.c, change callback name from 'set_rcvlowat'
to 'notify_set_rcvlowat'.

Signed-off-by: Arseniy Krasnov 
Reviewed-by: Stefano Garzarella 
---
 Changelog:
 v3 -> v4:
  * Rename 'set_rcvlowat' to 'notify_set_rcvlowat'.
  * Commit message updated.

 include/net/af_vsock.h   | 2 +-
 net/vmw_vsock/af_vsock.c | 9 +++--
 net/vmw_vsock/hyperv_transport.c | 4 ++--
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index e302c0e804d0..535701efc1e5 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -137,7 +137,6 @@ struct vsock_transport {
u64 (*stream_rcvhiwat)(struct vsock_sock *);
bool (*stream_is_active)(struct vsock_sock *);
bool (*stream_allow)(u32 cid, u32 port);
-   int (*set_rcvlowat)(struct vsock_sock *vsk, int val);
 
/* SEQ_PACKET. */
ssize_t (*seqpacket_dequeue)(struct vsock_sock *vsk, struct msghdr *msg,
@@ -168,6 +167,7 @@ struct vsock_transport {
struct vsock_transport_send_notify_data *);
/* sk_lock held by the caller */
void (*notify_buffer_size)(struct vsock_sock *, u64 *);
+   int (*notify_set_rcvlowat)(struct vsock_sock *vsk, int val);
 
/* Shutdown. */
int (*shutdown)(struct vsock_sock *, int);
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 816725af281f..54ba7316f808 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -2264,8 +2264,13 @@ static int vsock_set_rcvlowat(struct sock *sk, int val)
 
transport = vsk->transport;
 
-   if (transport && transport->set_rcvlowat)
-   return transport->set_rcvlowat(vsk, val);
+   if (transport && transport->notify_set_rcvlowat) {
+   int err;
+
+   err = transport->notify_set_rcvlowat(vsk, val);
+   if (err)
+   return err;
+   }
 
WRITE_ONCE(sk->sk_rcvlowat, val ? : 1);
return 0;
diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index 7cb1a9d2cdb4..e2157e387217 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -816,7 +816,7 @@ int hvs_notify_send_post_enqueue(struct vsock_sock *vsk, 
ssize_t written,
 }
 
 static
-int hvs_set_rcvlowat(struct vsock_sock *vsk, int val)
+int hvs_notify_set_rcvlowat(struct vsock_sock *vsk, int val)
 {
return -EOPNOTSUPP;
 }
@@ -856,7 +856,7 @@ static struct vsock_transport hvs_transport = {
.notify_send_pre_enqueue  = hvs_notify_send_pre_enqueue,
.notify_send_post_enqueue = hvs_notify_send_post_enqueue,
 
-   .set_rcvlowat = hvs_set_rcvlowat
+   .notify_set_rcvlowat  = hvs_notify_set_rcvlowat
 };
 
 static bool hvs_check_transport(struct vsock_sock *vsk)
-- 
2.25.1




[PATCH RFC 4/4] dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter

2023-12-06 Thread John Groves
From: John Groves 

Add CONFIG_DEV_DAXIOMAP kernel build parameter
---
 drivers/dax/Kconfig | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index a88744244149..b1ebcc77120b 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -78,4 +78,10 @@ config DEV_DAX_KMEM
 
  Say N if unsure.
 
+config DEV_DAX_IOMAP
+   depends on DEV_DAX && DAX
+   def_bool y
+   help
+ Support iomap mapping of devdax devices (for FS-DAX file
+ systems that reside on character /dev/dax devices)
 endif
-- 
2.40.1




[PATCH RFC 3/4] dev_dax_iomap: Add dax_operations to /dev/dax struct dax_device

2023-12-06 Thread John Groves
From: John Groves 

This is the primary content of this rfc. Notes about this commit:

* These methods are based somewhat loosely on pmem_dax_ops from
  drivers/nvdimm/pmem.c

* dev_dax_direct_access() is physaddr based

* dev_dax_direct_access() works for mmap, but fails dax_copy_to_iter()
  on posix read

* dev_dax_recovery_write()  and dev_dax_zero_page_range() have not been
  tested yet. I'm looking for suggestions as to how to test those.

I'm hoping somebody (Dan?) can point the way to getting this working
with posix I/O. Does this need to go the memremap route?

Thanks,
John
---
 drivers/dax/bus.c | 105 ++
 1 file changed, 105 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1b55fd7aabaf..8f8c2991c7c2 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -10,6 +10,12 @@
 #include "dax-private.h"
 #include "bus.h"
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+#include 
+#include 
+#include 
+#endif
+
 static DEFINE_MUTEX(dax_bus_lock);
 
 #define DAX_NAME_LEN 30
@@ -1374,6 +1380,100 @@ phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, 
pgoff_t pgoff,
 }
 
 
+/* the phys address approach */
+long __dev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+long nr_pages, enum dax_access_mode mode, void 
**kaddr,
+pfn_t *pfn)
+{
+   struct dev_dax *dev_dax = dax_get_private(dax_dev);
+   size_t size = nr_pages << PAGE_SHIFT;
+   size_t offset = pgoff << PAGE_SHIFT;
+   long range_remainder = 0;
+   phys_addr_t phys;
+   int i;
+
+   /*
+* pmem hides dax ranges by mapping  to a contiguous
+* pmem->virt_addr = devm_mremap_pages() (in pem_attach_disk()).
+* Is it legal to avoid the vmap overhead (and resource consumption) 
and just return
+* a (potentially partial) phys range? This function does this, 
returning the
+* phys_addr with the length truncated if necessary to the range 
remainder
+*/
+   phys = dax_pgoff_to_phys(dev_dax, pgoff, nr_pages << PAGE_SHIFT);
+
+   if (kaddr)
+   *kaddr = (void *)phys;
+
+   if (pfn)
+   *pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP); /* are flags 
correct? */
+
+   /*
+* If dax_pgoff_to_phys() also returned the range remainder (range_len 
- range_offset)
+* this loop would not be necessary
+*/
+   for (i = 0; i < dev_dax->nr_range; i++) {
+   size_t rlen = range_len(&(dev_dax->ranges[i].range));
+
+   if (offset < rlen) {
+   range_remainder = rlen - offset;
+   break;
+   }
+   offset -= rlen;
+   }
+
+   /*
+* Return length valid at phys. Hoping callers can deal with len < 
entire_dax_device
+* (or < npages). This returns the remaining length in the applicable 
dax region.
+*/
+   return PHYS_PFN(min_t(size_t, range_remainder, size));
+}
+
+static int dev_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
+   size_t nr_pages)
+{
+   long resid = nr_pages << PAGE_SHIFT;
+   long offset = pgoff << PAGE_SHIFT;
+
+   /* Break into one write per dax region */
+   while (resid > 0) {
+   void *kaddr;
+   pgoff_t poff = offset >> PAGE_SHIFT;
+   long len = __dev_dax_direct_access(dax_dev, poff,
+  nr_pages, DAX_ACCESS, 
, NULL);
+   len = min_t(long, len, PAGE_SIZE);
+   write_dax(kaddr, ZERO_PAGE(0), offset, len);
+
+   offset += len;
+   resid  -= len;
+   }
+   return 0;
+}
+
+static long dev_dax_direct_access(struct dax_device *dax_dev,
+   pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
+   void **kaddr, pfn_t *pfn)
+{
+   return __dev_dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, 
pfn);
+}
+
+static size_t dev_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+   void *addr, size_t bytes, struct iov_iter *i)
+{
+   size_t len, off;
+
+   off = offset_in_page(addr);
+   len = PFN_PHYS(PFN_UP(off + bytes));
+
+   return _copy_from_iter_flushcache(addr, bytes, i);
+}
+
+static const struct dax_operations dev_dax_ops = {
+   .direct_access = dev_dax_direct_access,
+   .zero_page_range = dev_dax_zero_page_range,
+   .recovery_write = dev_dax_recovery_write,
+};
+#endif /* IS_ENABLED(CONFIG_DEV_DAX_IOMAP) */
+
 struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 {
struct dax_region *dax_region = data->dax_region;
@@ -1429,11 +1529,16 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data 
*data)
}
}
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+   /* holder_ops currently populated separately in a slightly hacky way */
+   dax_dev = 

[PATCH RFC 2/4] dev_dax_iomap: Temporary hacks due to linkage issues

2023-12-06 Thread John Groves
From: John Groves 

These are functions that should be called from outside, but I had
linkage issues and did this hack instead. Will fix in the "real"
patches...
---
 drivers/dax/bus.c | 50 +++
 1 file changed, 50 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1659b787b65f..1b55fd7aabaf 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1324,6 +1324,56 @@ static const struct device_type dev_dax_type = {
.groups = dax_attribute_groups,
 };
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+
+/*
+ * This is write_pmem() from pmem.c
+ */
+static void write_dax(void *pmem_addr, struct page *page,
+   unsigned int off, unsigned int len)
+{
+   unsigned int chunk;
+   void *mem;
+
+   while (len) {
+   mem = kmap_atomic(page);
+   chunk = min_t(unsigned int, len, PAGE_SIZE - off);
+   memcpy_flushcache(pmem_addr, mem + off, chunk);
+   kunmap_atomic(mem);
+   len -= chunk;
+   off = 0;
+   page++;
+   pmem_addr += chunk;
+   }
+}
+
+/*
+ * This function is from drivers/dax/device.c
+ * For some reason EXPORT_SYMBOL(dax_pgoff_to_phys) didn't result in linkable 
code
+ */
+phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
+ unsigned long size)
+{
+   int i;
+
+   for (i = 0; i < dev_dax->nr_range; i++) {
+   struct dev_dax_range *dax_range = _dax->ranges[i];
+   struct range *range = _range->range;
+   unsigned long long pgoff_end;
+   phys_addr_t phys;
+
+   pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
+   if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
+   continue;
+   phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
+   if (phys + size - 1 <= range->end)
+   return phys;
+   break;
+   }
+   return -1;
+}
+
+
 struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 {
struct dax_region *dax_region = data->dax_region;
-- 
2.40.1





[PATCH RFC 1/4] dev_dax_iomap: Add add_dax_ops() func for fs-dax to provide dax holder_ops

2023-12-06 Thread John Groves
From: John Groves 

This is clearly not the right way to set the holder_ops; where &
how should this be done?
---
 drivers/dax/super.c | 16 
 include/linux/dax.h |  5 +
 2 files changed, 21 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0da9232ea175..3d4e205c1854 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -467,6 +467,22 @@ struct dax_device *alloc_dax(void *private, const struct 
dax_operations *ops)
 }
 EXPORT_SYMBOL_GPL(alloc_dax);
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+/* famfs calls this to add the holder_ops. There is probably a more elegant 
approach */
+int add_dax_ops(
+   struct dax_device   *dax_dev,
+   const struct dax_holder_operations *hops)
+{
+   /* dax_dev->ops should have been populated by devm_create_dev_dax() */
+   WARN_ON(!dax_dev->ops);
+
+   /* Use cmpxchg? */
+   dax_dev->holder_ops = hops;
+   return 0;
+}
+EXPORT_SYMBOL_GPL(add_dax_ops);
+#endif /* DEV_DAX_IOMAP */
+
 void put_dax(struct dax_device *dax_dev)
 {
if (!dax_dev)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..31b68667b3cb 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -57,6 +57,11 @@ struct dax_holder_operations {
 
 #if IS_ENABLED(CONFIG_DAX)
 struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
+
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+int add_dax_ops(struct dax_device *dax_dev,
+   const struct dax_holder_operations *hops);
+#endif
 void *dax_holder(struct dax_device *dax_dev);
 void put_dax(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
-- 
2.40.1





[PATCH RFC 0/4] dev_dax_iomap: introduce iomap for /dev/dax

2023-12-06 Thread John Groves
From: John Groves 

This patch set is not intended to be merged; I'm hoping to get some
clarification as to the correct approach (especialy from Dan). 

This work is related to famfs, which is a dax file system for shared
fabric-attached memory (FAM). Famfs is "coming soon" as an RFC, but
the concept and requirements were presented at LPC 2023. See
https://lpc.events/event/17/contributions/1455/ and
https://www.youtube.com/watch?v=aA_DgO95gLo. My expectation is that
a future (fully working) version of this patch will be folded into the
famfs
patches.

Unlike the current set of fs-dax file systems, famfs does not need a block
(pmem) device, and should really run on a /dev/dax character device since
that's how sharable fabric-attached cxl memory will surface. But
/dev/dax character devices are missing some functionality that is provided
by the block /dev/pmem driver - specifically struct dax_operations pointer
in struct dax_device.

This patch, when CONFIG_DEV_DAX_IOMAP=y, populates dax_dev->ops for
character dax devices. The added operations differ (currently) from
/dev/pmem's dev_dax->ops in that they don't use memremap but instead
provide a physical address in response to the dev_dax->direct_access()
method. 

The dax_operations are direct_access() (which resolves a dax dev offset
to an address), zero_page_range() and recovery_write(). I'm not sure yet
how to test the latter two, but the direct_access() method works in
conjunciton with famfs - but only for mmaped files.

But Posix reads fail. Specifically dax_iomap_iter() calls
dax_copy_to_iter(), which declines to copy the data for some reason in
one of the lower level copy_to_user variants. I've tried to isolate the
reason for the failure with a VM under gdb, but no luck there yet. I'm
wondering if there is some flag or attribute that needs to be applied to
these addresses/pages somehow to allow this to work.

The failing copy_to_user() differs from the same path with pmem fs-dax,
in that pmem does a memremap (which I think generates one contiguous
range, even if the device had more than one range - is this right, and
does this mean it's consuming some of the vmap/vmalloc range?)

I spent some time attempting a memremap, but I haven't figured out the
magic for that. However, I like the simplicity of resolving to phys if
that's not a non-starter for some reason.

I hope this is enough context for a meaningful review and suggestions as to
what a working dev_dax->dax_operations implementation should look like.

Thanks for any tips!


John Groves (4):
  Add add_dax_ops() func for fs-dax to provide dax holder_ops
  Temporary hacks due to linkage issues
  Add dax_operations to /dev/dax struct dax_device
  Add CONFIG_DEV_DAX_IOMAP kernel build parameter

 drivers/dax/Kconfig |   6 ++
 drivers/dax/bus.c   | 155 
 drivers/dax/super.c |  16 +
 include/linux/dax.h |   5 ++
 4 files changed, 182 insertions(+)

-- 
2.40.1




[PATCH v6 1/2] depmod: Handle installing modules under a different directory

2023-12-06 Thread Michal Suchanek
Some distributions aim at shipping all files in /usr.

The path under which kernel modules are installed is hardcoded to /lib
which conflicts with this goal.

When kmod provides kmod.pc, use it to determine the correct module
installation path.

With kmod that does not provide the config /lib/modules is used as
before.

While pkg-config does not return an error when a variable does not exist
the kmod configure script puts some effort into ensuring that
module_directory is non-empty. With that empty module_directory from
pkg-config can be used to detect absence of the variable.

Signed-off-by: Michal Suchanek 
---
v6:
 - use ?= instead of := to make it easier to override the value
 - use shorter expression for determining the module directory assuming
   it's non-empty
---
 Makefile | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index 511b5616aa41..84f32bd563d4 100644
--- a/Makefile
+++ b/Makefile
@@ -1081,7 +1081,9 @@ export INSTALL_DTBS_PATH ?= 
$(INSTALL_PATH)/dtbs/$(KERNELRELEASE)
 # makefile but the argument can be passed to make if needed.
 #
 
-MODLIB = $(INSTALL_MOD_PATH)/lib/modules/$(KERNELRELEASE)
+export KERNEL_MODULE_DIRECTORY ?= $(or $(shell pkg-config 
--variable=module_directory kmod 2>/dev/null),/lib/modules)
+
+MODLIB = $(INSTALL_MOD_PATH)$(KERNEL_MODULE_DIRECTORY)/$(KERNELRELEASE)
 export MODLIB
 
 PHONY += prepare0
-- 
2.42.0




[PATCH v6 2/2] kbuild: rpm-pkg: Fix build with non-default MODLIB

2023-12-06 Thread Michal Suchanek
The default MODLIB value is composed of three variables

MODLIB = $(INSTALL_MOD_PATH)$(KERNEL_MODULE_DIRECTORY)/$(KERNELRELEASE)

However, the kernel.spec hadcodes the default value of
$(KERNEL_MODULE_DIRECTORY), and changed value is not reflected when
building the package.

Pass KERNEL_MODULE_DIRECTORY to kernel.spec to fix this problem.

Signed-off-by: Michal Suchanek 
---
Build on top of the previous patch adding KERNEL_MODULE_DIRECTORY
---
 scripts/package/kernel.spec | 8 
 scripts/package/mkspec  | 1 +
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/scripts/package/kernel.spec b/scripts/package/kernel.spec
index 3eee0143e0c5..12996ed365f8 100644
--- a/scripts/package/kernel.spec
+++ b/scripts/package/kernel.spec
@@ -67,7 +67,7 @@ cp $(%{make} %{makeflags} -s image_name) 
%{buildroot}/boot/vmlinuz-%{KERNELRELEA
 %{make} %{makeflags} INSTALL_HDR_PATH=%{buildroot}/usr headers_install
 cp System.map %{buildroot}/boot/System.map-%{KERNELRELEASE}
 cp .config %{buildroot}/boot/config-%{KERNELRELEASE}
-ln -fns /usr/src/kernels/%{KERNELRELEASE} 
%{buildroot}/lib/modules/%{KERNELRELEASE}/build
+ln -fns /usr/src/kernels/%{KERNELRELEASE} 
%{buildroot}%{KERNEL_MODULE_DIRECTORY}/%{KERNELRELEASE}/build
 %if %{with_devel}
 %{make} %{makeflags} run-command 
KBUILD_RUN_COMMAND='${srctree}/scripts/package/install-extmod-build 
%{buildroot}/usr/src/kernels/%{KERNELRELEASE}'
 %endif
@@ -98,8 +98,8 @@ fi
 
 %files
 %defattr (-, root, root)
-/lib/modules/%{KERNELRELEASE}
-%exclude /lib/modules/%{KERNELRELEASE}/build
+%{KERNEL_MODULE_DIRECTORY}/%{KERNELRELEASE}
+%exclude %{KERNEL_MODULE_DIRECTORY}/%{KERNELRELEASE}/build
 /boot/*
 
 %files headers
@@ -110,5 +110,5 @@ fi
 %files devel
 %defattr (-, root, root)
 /usr/src/kernels/%{KERNELRELEASE}
-/lib/modules/%{KERNELRELEASE}/build
+%{KERNEL_MODULE_DIRECTORY}/%{KERNELRELEASE}/build
 %endif
diff --git a/scripts/package/mkspec b/scripts/package/mkspec
index ce201bfa8377..e952fa4f2937 100755
--- a/scripts/package/mkspec
+++ b/scripts/package/mkspec
@@ -24,6 +24,7 @@ fi
 cat<

Re: [PATCH] init: move THIS_MODULE from to

2023-12-06 Thread Paul Gortmaker
[Re: [PATCH] init: move THIS_MODULE from  to ] On 
03/12/2023 (Sun 19:06) Masahiro Yamada wrote:

> On Sun, Nov 26, 2023 at 4:19???PM Masahiro Yamada  
> wrote:
> >
> > Commit f50169324df4 ("module.h: split out the EXPORT_SYMBOL into
> > export.h") appropriately separated EXPORT_SYMBOL into 
> > because modules and EXPORT_SYMBOL are orthogonal; modules are symbol
> > consumers, while EXPORT_SYMBOL are used by symbol providers, which
> > may not be necessarily a module.
> >
> > However, that commit also relocated THIS_MODULE. As explained in the
> > commit description, the intention was to define THIS_MODULE in a
> > lightweight header, but I do not believe  was the
> > suitable location because EXPORT_SYMBOL and THIS_MODULE are unrelated.
> >
> > Move it to another lightweight header, . The reason for
> > choosing  is to make  self-contained
> > without relying on  incorrectly including
> > .
> >
> > With this adjustment, the role of  becomes clearer as
> > it only defines EXPORT_SYMBOL.
> >
> > Signed-off-by: Masahiro Yamada 
> > ---
> 
> 
> Applied to kbuild.
> 
> I did not get any report from the 0day bot so far,
> but I hope it will get a little more compile tests
> before getting into linux-next.

Haven't touched that kind of header shuffle for over 10 years?

But yeah, it is near impossible to not trip over some implicit header
inclusion somewhere in some driver or a less common arch and hence break
the build at least once when doing this kind of stuff.

Paul.
--

> 
> 
> 
> >
> >  include/linux/export.h | 18 --
> >  include/linux/init.h   |  7 +++
> >  2 files changed, 7 insertions(+), 18 deletions(-)
> >
> > diff --git a/include/linux/export.h b/include/linux/export.h
> > index 9911508a9604..0bbd02fd351d 100644
> > --- a/include/linux/export.h
> > +++ b/include/linux/export.h
> > @@ -6,15 +6,6 @@
> >  #include 
> >  #include 
> >
> > -/*
> > - * Export symbols from the kernel to modules.  Forked from module.h
> > - * to reduce the amount of pointless cruft we feed to gcc when only
> > - * exporting a simple symbol or two.
> > - *
> > - * Try not to add #includes here.  It slows compilation and makes kernel
> > - * hackers place grumpy comments in header files.
> > - */
> > -
> >  /*
> >   * This comment block is used by fixdep. Please do not remove.
> >   *
> > @@ -23,15 +14,6 @@
> >   * side effect of the *.o build rule.
> >   */
> >
> > -#ifndef __ASSEMBLY__
> > -#ifdef MODULE
> > -extern struct module __this_module;
> > -#define THIS_MODULE (&__this_module)
> > -#else
> > -#define THIS_MODULE ((struct module *)0)
> > -#endif
> > -#endif /* __ASSEMBLY__ */
> > -
> >  #ifdef CONFIG_64BIT
> >  #define __EXPORT_SYMBOL_REF(sym)   \
> > .balign 8   ASM_NL  \
> > diff --git a/include/linux/init.h b/include/linux/init.h
> > index 01b52c9c7526..3fa3f6241350 100644
> > --- a/include/linux/init.h
> > +++ b/include/linux/init.h
> > @@ -179,6 +179,13 @@ extern void (*late_time_init)(void);
> >
> >  extern bool initcall_debug;
> >
> > +#ifdef MODULE
> > +extern struct module __this_module;
> > +#define THIS_MODULE (&__this_module)
> > +#else
> > +#define THIS_MODULE ((struct module *)0)
> > +#endif
> > +
> >  #endif
> >
> >  #ifndef MODULE
> > --
> > 2.40.1
> >
> 
> 
> -- 
> Best Regards
> Masahiro Yamada



Re: [PATCH net] vsock/virtio: fix "comparison of distinct pointer types lacks a cast" warning

2023-12-06 Thread Arseniy Krasnov



On 06.12.2023 19:41, Stefano Garzarella wrote:
> After backporting commit 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY
> flag support") in CentOS Stream 9, CI reported the following error:
> 
> In file included from ./include/linux/kernel.h:17,
>  from ./include/linux/list.h:9,
>  from ./include/linux/preempt.h:11,
>  from ./include/linux/spinlock.h:56,
>  from net/vmw_vsock/virtio_transport_common.c:9:
> net/vmw_vsock/virtio_transport_common.c: In function 
> ‘virtio_transport_can_zcopy‘:
> ./include/linux/minmax.h:20:35: error: comparison of distinct pointer 
> types lacks a cast [-Werror]
>20 | (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
>   |   ^~
> ./include/linux/minmax.h:26:18: note: in expansion of macro ‘__typecheck‘
>26 | (__typecheck(x, y) && __no_side_effects(x, y))
>   |  ^~~
> ./include/linux/minmax.h:36:31: note: in expansion of macro ‘__safe_cmp‘
>36 | __builtin_choose_expr(__safe_cmp(x, y), \
>   |   ^~
> ./include/linux/minmax.h:45:25: note: in expansion of macro 
> ‘__careful_cmp‘
>45 | #define min(x, y)   __careful_cmp(x, y, <)
>   | ^
> net/vmw_vsock/virtio_transport_common.c:63:37: note: in expansion of 
> macro ‘min‘
>63 | int pages_to_send = min(pages_in_iov, 
> MAX_SKB_FRAGS);
> 
> We could solve it by using min_t(), but this operation seems entirely
> unnecessary, because we also pass MAX_SKB_FRAGS to iov_iter_npages(),
> which performs almost the same check, returning at most MAX_SKB_FRAGS
> elements. So, let's eliminate this unnecessary comparison.
> 
> Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support")
> Cc: avkras...@salutedevices.com
> Signed-off-by: Stefano Garzarella 
> ---

Reviewed-by: Arseniy Krasnov 

>  net/vmw_vsock/virtio_transport_common.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/net/vmw_vsock/virtio_transport_common.c 
> b/net/vmw_vsock/virtio_transport_common.c
> index f6dc896bf44c..c8e162c9d1df 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -59,8 +59,7 @@ static bool virtio_transport_can_zcopy(const struct 
> virtio_transport *t_ops,
>   t_ops = virtio_transport_get_ops(info->vsk);
>  
>   if (t_ops->can_msgzerocopy) {
> - int pages_in_iov = iov_iter_npages(iov_iter, MAX_SKB_FRAGS);
> - int pages_to_send = min(pages_in_iov, MAX_SKB_FRAGS);
> + int pages_to_send = iov_iter_npages(iov_iter, MAX_SKB_FRAGS);
>  
>   /* +1 is for packet header. */
>   return t_ops->can_msgzerocopy(pages_to_send + 1);



Re: [PATCH 0/2] kmod /usr support

2023-12-06 Thread Lucas De Marchi

On Fri, Nov 10, 2023 at 01:13:53PM +0100, Michal Suchanek wrote:

Hello,

This is resend of the last patch in the series that adds prefix support
to kernel module location together with additional patch for validating
the user supplied input to options that are interpreted as directories.

Thanks


applied, thanks

Lucas De Marchi



Re: [PATCH net] vsock/virtio: fix "comparison of distinct pointer types lacks a cast" warning

2023-12-06 Thread Michael S. Tsirkin
On Wed, Dec 06, 2023 at 05:41:43PM +0100, Stefano Garzarella wrote:
> After backporting commit 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY
> flag support") in CentOS Stream 9, CI reported the following error:
> 
> In file included from ./include/linux/kernel.h:17,
>  from ./include/linux/list.h:9,
>  from ./include/linux/preempt.h:11,
>  from ./include/linux/spinlock.h:56,
>  from net/vmw_vsock/virtio_transport_common.c:9:
> net/vmw_vsock/virtio_transport_common.c: In function 
> ???virtio_transport_can_zcopy???:
> ./include/linux/minmax.h:20:35: error: comparison of distinct pointer 
> types lacks a cast [-Werror]
>20 | (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
>   |   ^~
> ./include/linux/minmax.h:26:18: note: in expansion of macro 
> ???__typecheck???
>26 | (__typecheck(x, y) && __no_side_effects(x, y))
>   |  ^~~
> ./include/linux/minmax.h:36:31: note: in expansion of macro 
> ???__safe_cmp???
>36 | __builtin_choose_expr(__safe_cmp(x, y), \
>   |   ^~
> ./include/linux/minmax.h:45:25: note: in expansion of macro 
> ???__careful_cmp???
>45 | #define min(x, y)   __careful_cmp(x, y, <)
>   | ^
> net/vmw_vsock/virtio_transport_common.c:63:37: note: in expansion of 
> macro ???min???
>63 | int pages_to_send = min(pages_in_iov, 
> MAX_SKB_FRAGS);
> 
> We could solve it by using min_t(), but this operation seems entirely
> unnecessary, because we also pass MAX_SKB_FRAGS to iov_iter_npages(),
> which performs almost the same check, returning at most MAX_SKB_FRAGS
> elements. So, let's eliminate this unnecessary comparison.
> 
> Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support")
> Cc: avkras...@salutedevices.com
> Signed-off-by: Stefano Garzarella 
> ---


Acked-by: Michael S. Tsirkin 

>  net/vmw_vsock/virtio_transport_common.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/net/vmw_vsock/virtio_transport_common.c 
> b/net/vmw_vsock/virtio_transport_common.c
> index f6dc896bf44c..c8e162c9d1df 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -59,8 +59,7 @@ static bool virtio_transport_can_zcopy(const struct 
> virtio_transport *t_ops,
>   t_ops = virtio_transport_get_ops(info->vsk);
>  
>   if (t_ops->can_msgzerocopy) {
> - int pages_in_iov = iov_iter_npages(iov_iter, MAX_SKB_FRAGS);
> - int pages_to_send = min(pages_in_iov, MAX_SKB_FRAGS);
> + int pages_to_send = iov_iter_npages(iov_iter, MAX_SKB_FRAGS);
>  
>   /* +1 is for packet header. */
>   return t_ops->can_msgzerocopy(pages_to_send + 1);
> -- 
> 2.43.0




Re: [PATCH v7 3/4] remoteproc: zynqmp: add pm domains support

2023-12-06 Thread Tanmay Shah


On 12/6/23 9:43 AM, Mathieu Poirier wrote:
> On Fri, 1 Dec 2023 at 11:10, Tanmay Shah  wrote:
> >
> >
> > On 11/29/23 11:10 AM, Mathieu Poirier wrote:
> > > On Mon, Nov 27, 2023 at 10:33:05AM -0600, Tanmay Shah wrote:
> > > >
> > > > On 11/23/23 12:11 PM, Mathieu Poirier wrote:
> > > > > On Wed, Nov 22, 2023 at 03:00:36PM -0600, Tanmay Shah wrote:
> > > > > > Hi Mathieu,
> > > > > >
> > > > > > Please find my comments below.
> > > > > >
> > > > > > On 11/21/23 4:59 PM, Mathieu Poirier wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > On Fri, Nov 17, 2023 at 09:42:37AM -0800, Tanmay Shah wrote:
> > > > > > > > Use TCM pm domains extracted from device-tree
> > > > > > > > to power on/off TCM using general pm domain framework.
> > > > > > > >
> > > > > > > > Signed-off-by: Tanmay Shah 
> > > > > > > > ---
> > > > > > > >
> > > > > > > > Changes in v7:
> > > > > > > >   - %s/pm_dev1/pm_dev_core0/r
> > > > > > > >   - %s/pm_dev_link1/pm_dev_core0_link/r
> > > > > > > >   - %s/pm_dev2/pm_dev_core1/r
> > > > > > > >   - %s/pm_dev_link2/pm_dev_core1_link/r
> > > > > > > >   - remove pm_domain_id check to move next patch
> > > > > > > >   - add comment about how 1st entry in pm domain list is used
> > > > > > > >   - fix loop when jump to fail_add_pm_domains loop
> > > > > > > >
> > > > > > > >  drivers/remoteproc/xlnx_r5_remoteproc.c | 215 
> > > > > > > > +++-
> > > > > > > >  1 file changed, 212 insertions(+), 3 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
> > > > > > > > b/drivers/remoteproc/xlnx_r5_remoteproc.c
> > > > > > > > index 4395edea9a64..22bccc5075a0 100644
> > > > > > > > --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
> > > > > > > > +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
> > > > > > > > @@ -16,6 +16,7 @@
> > > > > > > >  #include 
> > > > > > > >  #include 
> > > > > > > >  #include 
> > > > > > > > +#include 
> > > > > > > >
> > > > > > > >  #include "remoteproc_internal.h"
> > > > > > > >
> > > > > > > > @@ -102,6 +103,12 @@ static const struct mem_bank_data 
> > > > > > > > zynqmp_tcm_banks_lockstep[] = {
> > > > > > > >   * @rproc: rproc handle
> > > > > > > >   * @pm_domain_id: RPU CPU power domain id
> > > > > > > >   * @ipi: pointer to mailbox information
> > > > > > > > + * @num_pm_dev: number of tcm pm domain devices for this core
> > > > > > > > + * @pm_dev_core0: pm domain virtual devices for power domain 
> > > > > > > > framework
> > > > > > > > + * @pm_dev_core0_link: pm domain device links after 
> > > > > > > > registration
> > > > > > > > + * @pm_dev_core1: used only in lockstep mode. second core's pm 
> > > > > > > > domain virtual devices
> > > > > > > > + * @pm_dev_core1_link: used only in lockstep mode. second 
> > > > > > > > core's pm device links after
> > > > > > > > + * registration
> > > > > > > >   */
> > > > > > > >  struct zynqmp_r5_core {
> > > > > > > > struct device *dev;
> > > > > > > > @@ -111,6 +118,11 @@ struct zynqmp_r5_core {
> > > > > > > > struct rproc *rproc;
> > > > > > > > u32 pm_domain_id;
> > > > > > > > struct mbox_info *ipi;
> > > > > > > > +   int num_pm_dev;
> > > > > > > > +   struct device **pm_dev_core0;
> > > > > > > > +   struct device_link **pm_dev_core0_link;
> > > > > > > > +   struct device **pm_dev_core1;
> > > > > > > > +   struct device_link **pm_dev_core1_link;
> > > > > > > >  };
> > > > > > > >
> > > > > > > >  /**
> > > > > > > > @@ -651,7 +663,8 @@ static int 
> > > > > > > > add_tcm_carveout_lockstep_mode(struct rproc *rproc)
> > > > > > > >  
> > > > > > > > ZYNQMP_PM_CAPABILITY_ACCESS, 0,
> > > > > > > >  
> > > > > > > > ZYNQMP_PM_REQUEST_ACK_BLOCKING);
> > > > > > > > if (ret < 0) {
> > > > > > > > -   dev_err(dev, "failed to turn on TCM 0x%x", 
> > > > > > > > pm_domain_id);
> > > > > > > > +   dev_err(dev, "failed to turn on TCM 0x%x",
> > > > > > > > +   pm_domain_id);
> > > > > > >
> > > > > > > Spurious change, you should have caught that.
> > > > > >
> > > > > > Ack, need to observe changes more closely before sending them.
> > > > > >
> > > > > > >
> > > > > > > > goto release_tcm_lockstep;
> > > > > > > > }
> > > > > > > >
> > > > > > > > @@ -758,6 +771,189 @@ static int zynqmp_r5_parse_fw(struct 
> > > > > > > > rproc *rproc, const struct firmware *fw)
> > > > > > > > return ret;
> > > > > > > >  }
> > > > > > > >
> > > > > > > > +static void zynqmp_r5_remove_pm_domains(struct rproc *rproc)
> > > > > > > > +{
> > > > > > > > +   struct zynqmp_r5_core *r5_core = rproc->priv;
> > > > > > > > +   struct device *dev = r5_core->dev;
> > > > > > > > +   struct zynqmp_r5_cluster *cluster;
> > > > > > > > +   int i;
> > > > > > > > +
> > > > > > > > +   cluster = 
> > > > > > > > platform_get_drvdata(to_platform_device(dev->parent));
> > > > > > > 

[PATCH net] vsock/virtio: fix "comparison of distinct pointer types lacks a cast" warning

2023-12-06 Thread Stefano Garzarella
After backporting commit 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY
flag support") in CentOS Stream 9, CI reported the following error:

In file included from ./include/linux/kernel.h:17,
 from ./include/linux/list.h:9,
 from ./include/linux/preempt.h:11,
 from ./include/linux/spinlock.h:56,
 from net/vmw_vsock/virtio_transport_common.c:9:
net/vmw_vsock/virtio_transport_common.c: In function 
‘virtio_transport_can_zcopy‘:
./include/linux/minmax.h:20:35: error: comparison of distinct pointer types 
lacks a cast [-Werror]
   20 | (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
  |   ^~
./include/linux/minmax.h:26:18: note: in expansion of macro 
‘__typecheck‘
   26 | (__typecheck(x, y) && __no_side_effects(x, y))
  |  ^~~
./include/linux/minmax.h:36:31: note: in expansion of macro ‘__safe_cmp‘
   36 | __builtin_choose_expr(__safe_cmp(x, y), \
  |   ^~
./include/linux/minmax.h:45:25: note: in expansion of macro 
‘__careful_cmp‘
   45 | #define min(x, y)   __careful_cmp(x, y, <)
  | ^
net/vmw_vsock/virtio_transport_common.c:63:37: note: in expansion of macro 
‘min‘
   63 | int pages_to_send = min(pages_in_iov, 
MAX_SKB_FRAGS);

We could solve it by using min_t(), but this operation seems entirely
unnecessary, because we also pass MAX_SKB_FRAGS to iov_iter_npages(),
which performs almost the same check, returning at most MAX_SKB_FRAGS
elements. So, let's eliminate this unnecessary comparison.

Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support")
Cc: avkras...@salutedevices.com
Signed-off-by: Stefano Garzarella 
---
 net/vmw_vsock/virtio_transport_common.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index f6dc896bf44c..c8e162c9d1df 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -59,8 +59,7 @@ static bool virtio_transport_can_zcopy(const struct 
virtio_transport *t_ops,
t_ops = virtio_transport_get_ops(info->vsk);
 
if (t_ops->can_msgzerocopy) {
-   int pages_in_iov = iov_iter_npages(iov_iter, MAX_SKB_FRAGS);
-   int pages_to_send = min(pages_in_iov, MAX_SKB_FRAGS);
+   int pages_to_send = iov_iter_npages(iov_iter, MAX_SKB_FRAGS);
 
/* +1 is for packet header. */
return t_ops->can_msgzerocopy(pages_to_send + 1);
-- 
2.43.0




Re: [PATCH v7] bus: mhi: host: Add tracing support

2023-12-06 Thread Steven Rostedt
On Wed, 6 Dec 2023 21:12:57 +0530
Krishna chaitanya chundru  wrote:

> diff --git a/drivers/bus/mhi/host/init.c b/drivers/bus/mhi/host/init.c
> index f78aefd2d7a3..6acb85f4c5f8 100644
> --- a/drivers/bus/mhi/host/init.c
> +++ b/drivers/bus/mhi/host/init.c
> @@ -20,6 +20,9 @@
>  #include 
>  #include "internal.h"
>  
> +#define CREATE_TRACE_POINTS
> +#include "trace.h"
> +
>  static DEFINE_IDA(mhi_controller_ida);
>  
>  const char * const mhi_ee_str[MHI_EE_MAX] = {
> diff --git a/drivers/bus/mhi/host/main.c b/drivers/bus/mhi/host/main.c
> index dcf627b36e82..507cf3351a97 100644
> --- a/drivers/bus/mhi/host/main.c
> +++ b/drivers/bus/mhi/host/main.c
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include "internal.h"
> +#include "trace.h"
>  
>  int __must_check mhi_read_reg(struct mhi_controller *mhi_cntrl,
> void __iomem *base, u32 offset, u32 *out)
> @@ -491,11 +492,9 @@ irqreturn_t mhi_intvec_threaded_handler(int irq_number, 
> void *priv)
>  
>   state = mhi_get_mhi_state(mhi_cntrl);
>   ee = mhi_get_exec_env(mhi_cntrl);
> - dev_dbg(dev, "local ee: %s state: %s device ee: %s state: %s\n",
> - TO_MHI_EXEC_STR(mhi_cntrl->ee),
> - mhi_state_str(mhi_cntrl->dev_state),
> - TO_MHI_EXEC_STR(ee), mhi_state_str(state));
>  
> + trace_mhi_intvec_states(mhi_cntrl->mhi_dev->name, mhi_cntrl->ee,
> + mhi_cntrl->dev_state, ee, state);

I'm not sure if this was discussed before, but why not just pass in the
mhi_cntrl pointer and do the assigning in the TRACE_EVENT()
TP_fast_assign() portion?

It will save on setting up the parameters.

>   if (state == MHI_STATE_SYS_ERR) {
>   dev_dbg(dev, "System error detected\n");
>   pm_state = mhi_tryset_pm_state(mhi_cntrl,
> @@ -832,6 +831,10 @@ int mhi_process_ctrl_ev_ring(struct mhi_controller 
> *mhi_cntrl,
>   while (dev_rp != local_rp) {
>   enum mhi_pkt_type type = MHI_TRE_GET_EV_TYPE(local_rp);
>  
> + trace_mhi_ctrl_event(mhi_cntrl->mhi_dev->name, local_rp,
> +  local_rp->ptr, local_rp->dword[0],
> +  local_rp->dword[1]);

And here..

> +
>   switch (type) {
>   case MHI_PKT_TYPE_BW_REQ_EVENT:
>   {
> @@ -997,6 +1000,9 @@ int mhi_process_data_event_ring(struct mhi_controller 
> *mhi_cntrl,
>   while (dev_rp != local_rp && event_quota > 0) {
>   enum mhi_pkt_type type = MHI_TRE_GET_EV_TYPE(local_rp);
>  
> + trace_mhi_data_event(mhi_cntrl->mhi_dev->name, local_rp, 
> local_rp->ptr,
> +  local_rp->dword[0], local_rp->dword[1]);

and here..

for the local_rp. Especially since you are just repeating what you send
into the two events. You can consolidate it down to just adding that in the
TP_fast_assign() of the shared DECLARE_EVENT_CLASS().

> +
>   chan = MHI_TRE_GET_EV_CHID(local_rp);
>  
>   WARN_ON(chan >= mhi_cntrl->max_chan);
> @@ -1235,6 +1241,8 @@ int mhi_gen_tre(struct mhi_controller *mhi_cntrl, 
> struct mhi_chan *mhi_chan,
>   mhi_tre->dword[0] = MHI_TRE_DATA_DWORD0(info->len);
>   mhi_tre->dword[1] = MHI_TRE_DATA_DWORD1(bei, eot, eob, chain);
>  
> + trace_mhi_gen_tre(mhi_cntrl->mhi_dev->name, mhi_chan->chan, mhi_tre,
> +   mhi_tre->ptr, mhi_tre->dword[0], mhi_tre->dword[1]);

Same here. You can pass in the mhi_tre pointer.

>   /* increment WP */
>   mhi_add_ring_element(mhi_cntrl, tre_ring);
>   mhi_add_ring_element(mhi_cntrl, buf_ring);
> @@ -1327,9 +1335,7 @@ static int mhi_update_channel_state(struct 
> mhi_controller *mhi_cntrl,
>   enum mhi_cmd_type cmd = MHI_CMD_NOP;
>   int ret;
>  
> - dev_dbg(dev, "%d: Updating channel state to: %s\n", mhi_chan->chan,
> - TO_CH_STATE_TYPE_STR(to_state));
> -
> + trace_mhi_channel_command_start(mhi_cntrl->mhi_dev->name, 
> mhi_chan->chan, to_state);

And here..

>   switch (to_state) {
>   case MHI_CH_STATE_TYPE_RESET:
>   write_lock_irq(_chan->lock);
> @@ -1396,9 +1402,7 @@ static int mhi_update_channel_state(struct 
> mhi_controller *mhi_cntrl,
>   write_unlock_irq(_chan->lock);
>   }
>  
> - dev_dbg(dev, "%d: Channel state change to %s successful\n",
> - mhi_chan->chan, TO_CH_STATE_TYPE_STR(to_state));
> -
> + trace_mhi_channel_command_end(mhi_cntrl->mhi_dev->name, mhi_chan->chan, 
> to_state);

and here..

Where you can also update the DECLARE_EVENT_CLASS() to directly access the
mhi_cntrl and mhi_chan pointers. Sometimes it's better to do the
dereference from inside the TP_fast_assign. That way, things like eprobes
and bpf tracing can also have access to the full structure if needed.

-- Steve



Re: [PATCH v7 3/4] remoteproc: zynqmp: add pm domains support

2023-12-06 Thread Mathieu Poirier
On Fri, 1 Dec 2023 at 11:10, Tanmay Shah  wrote:
>
>
> On 11/29/23 11:10 AM, Mathieu Poirier wrote:
> > On Mon, Nov 27, 2023 at 10:33:05AM -0600, Tanmay Shah wrote:
> > >
> > > On 11/23/23 12:11 PM, Mathieu Poirier wrote:
> > > > On Wed, Nov 22, 2023 at 03:00:36PM -0600, Tanmay Shah wrote:
> > > > > Hi Mathieu,
> > > > >
> > > > > Please find my comments below.
> > > > >
> > > > > On 11/21/23 4:59 PM, Mathieu Poirier wrote:
> > > > > > Hi,
> > > > > >
> > > > > > On Fri, Nov 17, 2023 at 09:42:37AM -0800, Tanmay Shah wrote:
> > > > > > > Use TCM pm domains extracted from device-tree
> > > > > > > to power on/off TCM using general pm domain framework.
> > > > > > >
> > > > > > > Signed-off-by: Tanmay Shah 
> > > > > > > ---
> > > > > > >
> > > > > > > Changes in v7:
> > > > > > >   - %s/pm_dev1/pm_dev_core0/r
> > > > > > >   - %s/pm_dev_link1/pm_dev_core0_link/r
> > > > > > >   - %s/pm_dev2/pm_dev_core1/r
> > > > > > >   - %s/pm_dev_link2/pm_dev_core1_link/r
> > > > > > >   - remove pm_domain_id check to move next patch
> > > > > > >   - add comment about how 1st entry in pm domain list is used
> > > > > > >   - fix loop when jump to fail_add_pm_domains loop
> > > > > > >
> > > > > > >  drivers/remoteproc/xlnx_r5_remoteproc.c | 215 
> > > > > > > +++-
> > > > > > >  1 file changed, 212 insertions(+), 3 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
> > > > > > > b/drivers/remoteproc/xlnx_r5_remoteproc.c
> > > > > > > index 4395edea9a64..22bccc5075a0 100644
> > > > > > > --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
> > > > > > > +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
> > > > > > > @@ -16,6 +16,7 @@
> > > > > > >  #include 
> > > > > > >  #include 
> > > > > > >  #include 
> > > > > > > +#include 
> > > > > > >
> > > > > > >  #include "remoteproc_internal.h"
> > > > > > >
> > > > > > > @@ -102,6 +103,12 @@ static const struct mem_bank_data 
> > > > > > > zynqmp_tcm_banks_lockstep[] = {
> > > > > > >   * @rproc: rproc handle
> > > > > > >   * @pm_domain_id: RPU CPU power domain id
> > > > > > >   * @ipi: pointer to mailbox information
> > > > > > > + * @num_pm_dev: number of tcm pm domain devices for this core
> > > > > > > + * @pm_dev_core0: pm domain virtual devices for power domain 
> > > > > > > framework
> > > > > > > + * @pm_dev_core0_link: pm domain device links after registration
> > > > > > > + * @pm_dev_core1: used only in lockstep mode. second core's pm 
> > > > > > > domain virtual devices
> > > > > > > + * @pm_dev_core1_link: used only in lockstep mode. second core's 
> > > > > > > pm device links after
> > > > > > > + * registration
> > > > > > >   */
> > > > > > >  struct zynqmp_r5_core {
> > > > > > > struct device *dev;
> > > > > > > @@ -111,6 +118,11 @@ struct zynqmp_r5_core {
> > > > > > > struct rproc *rproc;
> > > > > > > u32 pm_domain_id;
> > > > > > > struct mbox_info *ipi;
> > > > > > > +   int num_pm_dev;
> > > > > > > +   struct device **pm_dev_core0;
> > > > > > > +   struct device_link **pm_dev_core0_link;
> > > > > > > +   struct device **pm_dev_core1;
> > > > > > > +   struct device_link **pm_dev_core1_link;
> > > > > > >  };
> > > > > > >
> > > > > > >  /**
> > > > > > > @@ -651,7 +663,8 @@ static int 
> > > > > > > add_tcm_carveout_lockstep_mode(struct rproc *rproc)
> > > > > > >  
> > > > > > > ZYNQMP_PM_CAPABILITY_ACCESS, 0,
> > > > > > >  
> > > > > > > ZYNQMP_PM_REQUEST_ACK_BLOCKING);
> > > > > > > if (ret < 0) {
> > > > > > > -   dev_err(dev, "failed to turn on TCM 0x%x", 
> > > > > > > pm_domain_id);
> > > > > > > +   dev_err(dev, "failed to turn on TCM 0x%x",
> > > > > > > +   pm_domain_id);
> > > > > >
> > > > > > Spurious change, you should have caught that.
> > > > >
> > > > > Ack, need to observe changes more closely before sending them.
> > > > >
> > > > > >
> > > > > > > goto release_tcm_lockstep;
> > > > > > > }
> > > > > > >
> > > > > > > @@ -758,6 +771,189 @@ static int zynqmp_r5_parse_fw(struct rproc 
> > > > > > > *rproc, const struct firmware *fw)
> > > > > > > return ret;
> > > > > > >  }
> > > > > > >
> > > > > > > +static void zynqmp_r5_remove_pm_domains(struct rproc *rproc)
> > > > > > > +{
> > > > > > > +   struct zynqmp_r5_core *r5_core = rproc->priv;
> > > > > > > +   struct device *dev = r5_core->dev;
> > > > > > > +   struct zynqmp_r5_cluster *cluster;
> > > > > > > +   int i;
> > > > > > > +
> > > > > > > +   cluster = 
> > > > > > > platform_get_drvdata(to_platform_device(dev->parent));
> > > > > > > +
> > > > > > > +   for (i = 1; i < r5_core->num_pm_dev; i++) {
> > > > > > > +   device_link_del(r5_core->pm_dev_core0_link[i]);
> > > > > > > +   dev_pm_domain_detach(r5_core->pm_dev_core0[i], false);
> > > > > > > +   }
> > > > > > > +
> > > > > > > +   

[PATCH v7] bus: mhi: host: Add tracing support

2023-12-06 Thread Krishna chaitanya chundru
This change adds ftrace support for following functions which
helps in debugging the issues when there is Channel state & MHI
state change and also when we receive data and control events:
1. mhi_intvec_mhi_states
2. mhi_process_data_event_ring
3. mhi_process_ctrl_ev_ring
4. mhi_gen_tre
5. mhi_update_channel_state
6. mhi_tryset_pm_state
7. mhi_pm_st_worker

Where ever the trace events are added, debug messages are removed.

Signed-off-by: Krishna chaitanya chundru 
---
Changes in v7:
- change log format as pointed by mani.
- Link to v6: 
https://lore.kernel.org/r/20231204-ftrace_support-v6-1-9b206546d...@quicinc.com

Changes in v6:
- use 'rp' directly as suggested by jeffrey.
- Link to v5: 
https://lore.kernel.org/r/20231127-ftrace_support-v5-1-eb67daead...@quicinc.com

Changes in v5:
- Use DECLARE_EVENT_CLASS for multiple events as suggested by steve.
- Instead of converting to u64 to print address, use %px to print the address 
to avoid
- warnings in some platforms.
- Link to v4: 
https://lore.kernel.org/r/2023-ftrace_support-v4-1-c83602399...@quicinc.com

Changes in v4:
- Fix compilation issues in previous patch which happended due to rebasing.
- In the defconfig FTRACE config is not enabled due to that the compilation 
issue is not
- seen in my workspace.
- Link to v3: 
https://lore.kernel.org/r/2023-ftrace_support-v3-1-f358d2911...@quicinc.com

Changes in v3:
- move trace header file from include/trace/events to drivers/bus/mhi/host/ so 
that
- we can include driver header files.
- Use macros directly in the trace events as suggested Jeffrey Hugo.
- Reorder the structure in the events as suggested by steve to avoid holes in 
the buffer.
- removed the mhi_to_physical function as this can give security issues.
- removed macros to define strings as we can get those from driver headers.
- Link to v2: 
https://lore.kernel.org/r/20231013-ftrace_support-v2-1-6e893ce01...@quicinc.com

Changes in v2:
- Passing the raw state into the trace event and using  __print_symbolic() as 
suggested by bjorn.
- Change mhi_pm_st_worker to mhi_pm_st_transition as suggested by bjorn.
- Fixed the kernel test rebot issues.
- Link to v1: 
https://lore.kernel.org/r/20231005-ftrace_support-v1-1-23a2f394f...@quicinc.com
---
 drivers/bus/mhi/host/init.c  |   3 +
 drivers/bus/mhi/host/main.c  |  24 ++---
 drivers/bus/mhi/host/pm.c|   7 +-
 drivers/bus/mhi/host/trace.h | 208 +++
 4 files changed, 229 insertions(+), 13 deletions(-)

diff --git a/drivers/bus/mhi/host/init.c b/drivers/bus/mhi/host/init.c
index f78aefd2d7a3..6acb85f4c5f8 100644
--- a/drivers/bus/mhi/host/init.c
+++ b/drivers/bus/mhi/host/init.c
@@ -20,6 +20,9 @@
 #include 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include "trace.h"
+
 static DEFINE_IDA(mhi_controller_ida);
 
 const char * const mhi_ee_str[MHI_EE_MAX] = {
diff --git a/drivers/bus/mhi/host/main.c b/drivers/bus/mhi/host/main.c
index dcf627b36e82..507cf3351a97 100644
--- a/drivers/bus/mhi/host/main.c
+++ b/drivers/bus/mhi/host/main.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include "internal.h"
+#include "trace.h"
 
 int __must_check mhi_read_reg(struct mhi_controller *mhi_cntrl,
  void __iomem *base, u32 offset, u32 *out)
@@ -491,11 +492,9 @@ irqreturn_t mhi_intvec_threaded_handler(int irq_number, 
void *priv)
 
state = mhi_get_mhi_state(mhi_cntrl);
ee = mhi_get_exec_env(mhi_cntrl);
-   dev_dbg(dev, "local ee: %s state: %s device ee: %s state: %s\n",
-   TO_MHI_EXEC_STR(mhi_cntrl->ee),
-   mhi_state_str(mhi_cntrl->dev_state),
-   TO_MHI_EXEC_STR(ee), mhi_state_str(state));
 
+   trace_mhi_intvec_states(mhi_cntrl->mhi_dev->name, mhi_cntrl->ee,
+   mhi_cntrl->dev_state, ee, state);
if (state == MHI_STATE_SYS_ERR) {
dev_dbg(dev, "System error detected\n");
pm_state = mhi_tryset_pm_state(mhi_cntrl,
@@ -832,6 +831,10 @@ int mhi_process_ctrl_ev_ring(struct mhi_controller 
*mhi_cntrl,
while (dev_rp != local_rp) {
enum mhi_pkt_type type = MHI_TRE_GET_EV_TYPE(local_rp);
 
+   trace_mhi_ctrl_event(mhi_cntrl->mhi_dev->name, local_rp,
+local_rp->ptr, local_rp->dword[0],
+local_rp->dword[1]);
+
switch (type) {
case MHI_PKT_TYPE_BW_REQ_EVENT:
{
@@ -997,6 +1000,9 @@ int mhi_process_data_event_ring(struct mhi_controller 
*mhi_cntrl,
while (dev_rp != local_rp && event_quota > 0) {
enum mhi_pkt_type type = MHI_TRE_GET_EV_TYPE(local_rp);
 
+   trace_mhi_data_event(mhi_cntrl->mhi_dev->name, local_rp, 
local_rp->ptr,
+local_rp->dword[0], local_rp->dword[1]);
+
chan = MHI_TRE_GET_EV_CHID(local_rp);
 
WARN_ON(chan >= mhi_cntrl->max_chan);
@@ -1235,6 

Re: [PATCH] kmod: Add FIPS 202 SHA-3 support

2023-12-06 Thread Lucas De Marchi

On Sun, Oct 22, 2023 at 07:09:28PM +0100, Dimitri John Ledkov wrote:

Add support for parsing FIPS 202 SHA-3 signature hashes. Separately,
it is not clear why explicit hashes are re-encoded here, instead of
trying to generically show any digest openssl supports.

Signed-off-by: Dimitri John Ledkov 
---
libkmod/libkmod-signature.c | 12 
1 file changed, 12 insertions(+)

diff --git a/libkmod/libkmod-signature.c b/libkmod/libkmod-signature.c
index b749a818f9..a39059cd7c 100644
--- a/libkmod/libkmod-signature.c
+++ b/libkmod/libkmod-signature.c
@@ -57,6 +57,9 @@ enum pkey_hash_algo {
PKEY_HASH_SHA512,
PKEY_HASH_SHA224,
PKEY_HASH_SM3,
+   PKEY_HASH_SHA3_256,
+   PKEY_HASH_SHA3_384,
+   PKEY_HASH_SHA3_512,
PKEY_HASH__LAST
};

@@ -70,6 +73,9 @@ const char *const pkey_hash_algo[PKEY_HASH__LAST] = {
[PKEY_HASH_SHA512]  = "sha512",
[PKEY_HASH_SHA224]  = "sha224",
[PKEY_HASH_SM3] = "sm3",
+   [PKEY_HASH_SHA3_256]= "sha3-256",
+   [PKEY_HASH_SHA3_384]= "sha3-384",
+   [PKEY_HASH_SHA3_512]= "sha3-512",
};

enum pkey_id_type {
@@ -167,6 +173,12 @@ static int obj_to_hash_algo(const ASN1_OBJECT *o)
case NID_sm3:
return PKEY_HASH_SM3;
# endif
+   case NID_sha3_256:
+   return PKEY_HASH_SHA3_256;
+   case NID_sha3_384:
+   return PKEY_HASH_SHA3_384;
+   case NID_sha3_512:
+   return PKEY_HASH_SHA3_512;



with your other patch, libkmod: remove pkcs7 obj_to_hash_algo(), this
hunk is not needed anymore. Do you want to send a new version of this
patch?

thanks
Lucas De Marchi


default:
return -1;
}
--
2.34.1






[PATCH] ring-buffer: Force absolute timestamp on discard of event

2023-12-06 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

There's a race where if an event is discarded from the ring buffer and an
interrupt were to happen at that time and insert an event, the time stamp
is still used from the discarded event as an offset. This can screw up the
timings.

If the event is going to be discarded, set the "before_stamp" to zero.
When a new event comes in, it compares the "before_stamp" with the
"write_stamp" and if they are not equal, it will insert an absolute
timestamp. This will prevent the timings from getting out of sync due to
the discarded event.

Cc: sta...@vger.kernel.org
Fixes: 6f6be606e763f ("ring-buffer: Force before_stamp and write_stamp to be 
different on discard")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 19 ---
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 43cc47d7faaf..a6da2d765c78 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3030,22 +3030,19 @@ rb_try_to_discard(struct ring_buffer_per_cpu 
*cpu_buffer,
local_read(>write) & ~RB_WRITE_MASK;
unsigned long event_length = rb_event_length(event);
 
+   /*
+* For the before_stamp to be different than the write_stamp
+* to make sure that the next event adds an absolute
+* value and does not rely on the saved write stamp, which
+* is now going to be bogus.
+*/
+   rb_time_set(_buffer->before_stamp, 0);
+
/* Something came in, can't discard */
if (!rb_time_cmpxchg(_buffer->write_stamp,
   write_stamp, write_stamp - delta))
return false;
 
-   /*
-* It's possible that the event time delta is zero
-* (has the same time stamp as the previous event)
-* in which case write_stamp and before_stamp could
-* be the same. In such a case, force before_stamp
-* to be different than write_stamp. It doesn't
-* matter what it is, as long as its different.
-*/
-   if (!delta)
-   rb_time_set(_buffer->before_stamp, 0);
-
/*
 * If an event were to come in now, it would see that the
 * write_stamp and the before_stamp are different, and assume
-- 
2.42.0




[PATCH] ring-buffer: Test last update in 32bit version of __rb_time_read()

2023-12-06 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Since 64 bit cmpxchg() is very expensive on 32bit architectures, the
timestamp used by the ring buffer does some interesting tricks to be able
to still have an atomic 64 bit number. It originally just used 60 bits and
broke it up into two 32 bit words where the extra 2 bits were used for
synchronization. But this was not enough for all use cases, and all 64
bits were required.

The 32bit version of the ring buffer timestamp was then broken up into 3
32bit words using the same counter trick. But one update was not done. The
check to see if the read operation was done without interruption only
checked the first two words and not last one (like it had before this
update). Fix it by making sure all three updates happen without
interruption by comparing the initial counter with the last updated
counter.

Cc: sta...@vger.kernel.org
Fixes: f03f2abce4f39 ("ring-buffer: Have 32 bit time stamps use all 64 bits")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a6da2d765c78..8d2a4f00eca9 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -644,8 +644,8 @@ static inline bool __rb_time_read(rb_time_t *t, u64 *ret, 
unsigned long *cnt)
 
*cnt = rb_time_cnt(top);
 
-   /* If top and bottom counts don't match, this interrupted a write */
-   if (*cnt != rb_time_cnt(bottom))
+   /* If top and msb counts don't match, this interrupted a write */
+   if (*cnt != rb_time_cnt(msb))
return false;
 
/* The shift to msb will lose its cnt bits */
-- 
2.42.0




Re: [PATCH] ipvs: add a stateless type of service and a stateless Maglev hashing scheduler

2023-12-06 Thread Julian Anastasov


Hello,

On Mon, 4 Dec 2023, Lev Pantiukhin wrote:

> +#define IP_VS_SVC_F_STATELESS0x0040  /* stateless scheduling 
> */

I have another idea for the traffic that does not
need per-client state. We need some per-dest cp to forward the packet.
If we replace the cp->caddr usage with iph->saddr/daddr usage we can try 
it. cp->caddr is used at the following places:

- tcp_snat_handler (iph->daddr), tcp_dnat_handler (iph->saddr): iph is 
already provided. tcp_snat_handler requires IP_VS_SVC_F_STATELESS
to be set for serivce with present vaddr, i.e. non-fwmark based.
So, NAT+svc->fwmark is another restriction for IP_VS_SVC_F_STATELESS
because we do not know what VIP to use as saddr for outgoing traffic.

- ip_vs_nfct_expect_related
- we should investigate for any problems when IP_VS_CONN_F_NFCT
is set, probably, we can not work with NFCT?

- ip_vs_conn_drop_conntrack

- FTP:
- sets IP_VS_CONN_F_NFCT, uses cp->app

May be IP_VS_CONN_F_NFCT should be restriction for 
IP_VS_SVC_F_STATELESS mode? cp->app for sure because we keep TCP
seq/ack state for the app in cp->in_seq/out_seq.

We can keep some dest->cp_route or another name that will
hold our cp for such connections. The idea is to not allocate cp for
every packet but to reuse this saved cp. It has all needed info to
forward skb to real server. The first packet will create it, save
it with some locking into dest and next packets will reuse it.

Probably, it should be ONE_PACKET entry (not hashed in table) but 
can be with running timer, if needed. One refcnt for attaching to dest, 
new temp refcnt for every packet. But in this mode __ip_vs_conn_put_timer 
uses 0-second timer, we have to handle it somehow. It should be released
when dest is removed and on edit_dest if needed.

There are other problems to solve, such as set_tcp_state()
changing dest->activeconns and dest->inactconns. They are used also
in ip_vs_bind_dest(), ip_vs_unbind_dest(). As we do not keep previous
connection state and as conn can start in established state, we should
avoid touching these counters. For UDP ONE_PACKET has no such problem
with states but for TCP/SCTP we should take care.

Regards

--
Julian Anastasov 




Re: [PATCH] ipvs: add a stateless type of service and a stateless Maglev hashing scheduler

2023-12-06 Thread kernel test robot
Hi Lev,

kernel test robot noticed the following build warnings:

[auto build test WARNING on horms-ipvs-next/master]
[also build test WARNING on horms-ipvs/master linus/master v6.7-rc4 
next-20231206]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Lev-Pantiukhin/ipvs-add-a-stateless-type-of-service-and-a-stateless-Maglev-hashing-scheduler/20231204-232344
base:   https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next.git 
master
patch link:
https://lore.kernel.org/r/20231204152020.472247-1-kndrvt%40yandex-team.ru
patch subject: [PATCH] ipvs: add a stateless type of service and a stateless 
Maglev hashing scheduler
config: microblaze-randconfig-r123-20231206 
(https://download.01.org/0day-ci/archive/20231206/202312061823.o99vpf15-...@intel.com/config)
compiler: microblaze-linux-gcc (GCC) 13.2.0
reproduce: 
(https://download.01.org/0day-ci/archive/20231206/202312061823.o99vpf15-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202312061823.o99vpf15-...@intel.com/

sparse warnings: (new ones prefixed by >>)
>> net/netfilter/ipvs/ip_vs_ctl.c:972:42: sparse: sparse: restricted __be32 
>> degrades to integer
   net/netfilter/ipvs/ip_vs_ctl.c:972:60: sparse: sparse: restricted __be32 
degrades to integer
   net/netfilter/ipvs/ip_vs_ctl.c:976:48: sparse: sparse: restricted __be32 
degrades to integer
   net/netfilter/ipvs/ip_vs_ctl.c:976:70: sparse: sparse: restricted __be32 
degrades to integer
>> net/netfilter/ipvs/ip_vs_ctl.c:976:30: sparse: sparse: incorrect type in 
>> assignment (different base types) @@ expected restricted __be32 
>> [usertype] diff @@ got unsigned int @@
   net/netfilter/ipvs/ip_vs_ctl.c:976:30: sparse: expected restricted 
__be32 [usertype] diff
   net/netfilter/ipvs/ip_vs_ctl.c:976:30: sparse: got unsigned int
>> net/netfilter/ipvs/ip_vs_ctl.c:978:41: sparse: sparse: cast from restricted 
>> __be32
   net/netfilter/ipvs/ip_vs_ctl.c:1532:27: sparse: sparse: dereference of 
noderef expression

vim +972 net/netfilter/ipvs/ip_vs_ctl.c

   962  
   963  static int __ip_vs_mh_compare_dests(struct list_head *a, struct 
list_head *b)
   964  {
   965  struct ip_vs_dest *dest_a = list_entry(a, struct ip_vs_dest, 
n_list);
   966  struct ip_vs_dest *dest_b = list_entry(b, struct ip_vs_dest, 
n_list);
   967  unsigned int i = 0;
   968  __be32 diff;
   969  
   970  switch (dest_a->af) {
   971  case AF_INET:
 > 972  return (int)(dest_a->addr.ip - dest_b->addr.ip);
   973  
   974  case AF_INET6:
   975  for (; i < ARRAY_SIZE(dest_a->addr.ip6); i++) {
 > 976  diff = dest_a->addr.ip6[i] - 
 > dest_b->addr.ip6[i];
   977  if (diff)
 > 978  return (int)diff;
   979  }
   980  }
   981  
   982  return 0;
   983  }
   984  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



Re: [PATCH 12/27] tty: hvc: convert to u8 and size_t

2023-12-06 Thread Christophe Leroy


Le 06/12/2023 à 08:36, Jiri Slaby (SUSE) a écrit :
> Switch character types to u8 and sizes to size_t. To conform to
> characters/sizes in the rest of the tty layer.
> 
> Signed-off-by: Jiri Slaby (SUSE) 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Christophe Leroy 
> Cc: Amit Shah 
> Cc: Arnd Bergmann 
> Cc: Paul Walmsley 
> Cc: Palmer Dabbelt 
> Cc: Albert Ou 
> Cc: linuxppc-...@lists.ozlabs.org
> Cc: virtualizat...@lists.linux.dev
> Cc: linux-ri...@lists.infradead.org
> ---
>   arch/powerpc/include/asm/hvconsole.h   |  4 ++--
>   arch/powerpc/include/asm/hvsi.h| 18 
>   arch/powerpc/include/asm/opal.h|  8 +---
>   arch/powerpc/platforms/powernv/opal.c  | 14 +++--
>   arch/powerpc/platforms/pseries/hvconsole.c |  4 ++--
>   drivers/char/virtio_console.c  | 10 -
>   drivers/tty/hvc/hvc_console.h  |  4 ++--
>   drivers/tty/hvc/hvc_dcc.c  | 24 +++---
>   drivers/tty/hvc/hvc_iucv.c | 18 
>   drivers/tty/hvc/hvc_opal.c |  5 +++--
>   drivers/tty/hvc/hvc_riscv_sbi.c|  9 
>   drivers/tty/hvc/hvc_rtas.c | 11 +-
>   drivers/tty/hvc/hvc_udbg.c |  9 
>   drivers/tty/hvc/hvc_vio.c  | 18 
>   drivers/tty/hvc/hvc_xen.c  | 23 +++--
>   drivers/tty/hvc/hvsi_lib.c | 20 ++
>   16 files changed, 107 insertions(+), 92 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/hvconsole.h 
> b/arch/powerpc/include/asm/hvconsole.h
> index ccb2034506f0..d841a97010a0 100644
> --- a/arch/powerpc/include/asm/hvconsole.h
> +++ b/arch/powerpc/include/asm/hvconsole.h
> @@ -21,8 +21,8 @@
>* Vio firmware always attempts to fetch MAX_VIO_GET_CHARS chars.  The 
> 'count'
>* parm is included to conform to put_chars() function pointer template
>*/
> -extern int hvc_get_chars(uint32_t vtermno, char *buf, int count);
> -extern int hvc_put_chars(uint32_t vtermno, const char *buf, int count);
> +extern ssize_t hvc_get_chars(uint32_t vtermno, u8 *buf, size_t count);
> +extern ssize_t hvc_put_chars(uint32_t vtermno, const u8 *buf, size_t count);

Would be a good opportunity to drop this pointless deprecated 'extern' 
keyword on all function prototypes you are changing.

Christophe