Re: [PATCH v4] virtio_pmem: support feature SHMEM_REGION
On Thu, Dec 21, 2023 at 4:49 AM Changyuan Lyu wrote: > > Thanks Michael for the feedback! > > On Tue, Dec 19, 2023 at 11:44 PM Michael S. Tsirkin wrote: > > > > > On Tue, Dec 19, 2023 at 11:32:27PM -0800, Changyuan Lyu wrote: > > > > > > + if (!have_shm) { > > > + dev_err(&vdev->dev, "failed to get shared memory > > > region %d\n", > > > + VIRTIO_PMEM_SHMEM_REGION_ID); > > > + err = -ENXIO; > > > + goto out_vq; > > > + } > > > > Maybe additionally, add a validate callback and clear > > VIRTIO_PMEM_F_SHMEM_REGION if VIRTIO_PMEM_SHMEM_REGION_ID is not there. > > Done. > > > > +/* Feature bits */ > > > +#define VIRTIO_PMEM_F_SHMEM_REGION 0 /* guest physical address > > > range will be > > > +* indicated as shared memory region 0 > > > +*/ > > > > Either make this comment shorter to fit in one line, or put the > > multi-line comment before the define. > > Done. > > ---8<--- > > This patch adds the support for feature VIRTIO_PMEM_F_SHMEM_REGION > (virtio spec v1.2 section 5.19.5.2 [1]). > > During feature negotiation, if VIRTIO_PMEM_F_SHMEM_REGION is offered > by the device, the driver looks for a shared memory region of id 0. > If it is found, this feature is understood. Otherwise, this feature > bit is cleared. > > During probe, if VIRTIO_PMEM_F_SHMEM_REGION has been negotiated, > virtio pmem ignores the `start` and `size` fields in device config > and uses the physical address range of shared memory region 0. > > [1] > https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-6480002 > > Signed-off-by: Changyuan Lyu Acked-by: Jason Wang Thanks > --- > v4: > * added virtio_pmem_validate callback. > v3: > * updated the patch description. > V2: > * renamed VIRTIO_PMEM_SHMCAP_ID to VIRTIO_PMEM_SHMEM_REGION_ID > * fixed the error handling when region 0 does not exist > --- > drivers/nvdimm/virtio_pmem.c | 36 > include/uapi/linux/virtio_pmem.h | 7 +++ > 2 files changed, 39 insertions(+), 4 deletions(-) > > diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c > index a92eb172f0e7..4ceced5cefcf 100644 > --- a/drivers/nvdimm/virtio_pmem.c > +++ b/drivers/nvdimm/virtio_pmem.c > @@ -29,12 +29,27 @@ static int init_vq(struct virtio_pmem *vpmem) > return 0; > }; > > +static int virtio_pmem_validate(struct virtio_device *vdev) > +{ > + struct virtio_shm_region shm_reg; > + > + if (virtio_has_feature(vdev, VIRTIO_PMEM_F_SHMEM_REGION) && > + !virtio_get_shm_region(vdev, &shm_reg, > (u8)VIRTIO_PMEM_SHMEM_REGION_ID) > + ) { > + dev_notice(&vdev->dev, "failed to get shared memory region > %d\n", > + VIRTIO_PMEM_SHMEM_REGION_ID); > + __virtio_clear_bit(vdev, VIRTIO_PMEM_F_SHMEM_REGION); > + } > + return 0; > +} > + > static int virtio_pmem_probe(struct virtio_device *vdev) > { > struct nd_region_desc ndr_desc = {}; > struct nd_region *nd_region; > struct virtio_pmem *vpmem; > struct resource res; > + struct virtio_shm_region shm_reg; > int err = 0; > > if (!vdev->config->get) { > @@ -57,10 +72,16 @@ static int virtio_pmem_probe(struct virtio_device *vdev) > goto out_err; > } > > - virtio_cread_le(vpmem->vdev, struct virtio_pmem_config, > - start, &vpmem->start); > - virtio_cread_le(vpmem->vdev, struct virtio_pmem_config, > - size, &vpmem->size); > + if (virtio_has_feature(vdev, VIRTIO_PMEM_F_SHMEM_REGION)) { > + virtio_get_shm_region(vdev, &shm_reg, > (u8)VIRTIO_PMEM_SHMEM_REGION_ID); > + vpmem->start = shm_reg.addr; > + vpmem->size = shm_reg.len; > + } else { > + virtio_cread_le(vpmem->vdev, struct virtio_pmem_config, > + start, &vpmem->start); > + virtio_cread_le(vpmem->vdev, struct virtio_pmem_config, > + size, &vpmem->size); > + } > > res.start = vpmem->start; > res.end = vpmem->start + vpmem->size - 1; > @@ -122,10 +143,17 @@ static void virtio_pmem_remove(struct virtio_device > *vdev) > virtio_reset_device(vdev); > } > > +static unsigned int features[] = { > + VIRTIO_PMEM_F_SHMEM_REGION, > +}; > + > static struct virtio_driver virtio_pmem_driver = { > + .feature_table = features, > + .feature_table_size = ARRAY_SIZE(features), > .driver.name= KBUILD_MODNAME, > .driver.owner = THIS_MODULE, > .id_table = id_table, > + .validate = virtio_pmem_validate, > .probe = virtio_pmem_probe, >
Re: [PATCH V1] net: qrtr: ns: Ignore ENODEV failures in ns
On 12/23/2023 5:56 AM, Simon Horman wrote: [Dropped bjorn.anders...@kernel.org, as the correct address seems to be anders...@kernel.org, which is already in the CC list. kernel.org rejected sending this email without that update.] On Thu, Dec 21, 2023 at 03:36:50PM +0530, Sarannya S wrote: From: Chris Lew Ignore the ENODEV failures returned by kernel_sendmsg(). These errors indicate that either the local port has been closed or the remote has gone down. Neither of these scenarios are fatal and will eventually be handled through packets that are later queued on the control port. Signed-off-by: Chris Lew Signed-off-by: Sarannya Sasikumar --- net/qrtr/ns.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/net/qrtr/ns.c b/net/qrtr/ns.c index abb0c70..8234339 100644 --- a/net/qrtr/ns.c +++ b/net/qrtr/ns.c @@ -157,7 +157,7 @@ static int service_announce_del(struct sockaddr_qrtr *dest, msg.msg_namelen = sizeof(*dest); ret = kernel_sendmsg(qrtr_ns.sock, &msg, &iv, 1, sizeof(pkt)); - if (ret < 0) + if (ret < 0 && ret != -ENODEV) pr_err("failed to announce del service\n"); return ret; Hi, The caller of service_announce_del() ignores it's return value. So the only action on error is the pr_err() call above, and so with this patch -ENODEV is indeed ignored. However, I wonder if it would make things clearer to the reader (me?) if the return type of service_announce_del was updated void. Because as things stand -ENODEV may be returned, which implies something might handle that, even though it doe not. The above notwithstanding, this change looks good to me. Reviewed-by: Simon Horman ... Hi Simon, thanks for the review and suggestion. We weren't sure whether we should change the function prototype on these patches on the chance that there will be something that listens and handles this in the future. I think it's a good idea to change it to void and we can change it back if there is such a usecase in the future.
Re: [PATCH] ring-buffer: Fix wake ups when buffer_percent is set to 100
On Tue, 26 Dec 2023 12:59:02 -0500 Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > The tracefs file "buffer_percent" is to allow user space to set a > water-mark on how much of the tracing ring buffer needs to be filled in > order to wake up a blocked reader. > > 0 - is to wait until any data is in the buffer > 1 - is to wait for 1% of the sub buffers to be filled > 50 - would be half of the sub buffers are filled with data > 100 - is not to wake the waiter until the ring buffer is completely full > > Unfortunately the test for being full was: > > dirty = ring_buffer_nr_dirty_pages(buffer, cpu); > return (dirty * 100) > (full * nr_pages); > > Where "full" is the value for "buffer_percent". > > There is two issues with the above when full == 100. > > 1. dirty * 100 > 100 * nr_pages will never be true >That is, the above is basically saying that if the user sets >buffer_percent to 100, more pages need to be dirty than exist in the >ring buffer! > > 2. The page that the writer is on is never considered dirty, as dirty >pages are only those that are full. When the writer goes to a new >sub-buffer, it clears the contents of that sub-buffer. > > That is, even if the check was ">=" it would still not be equal as the > most pages that can be considered "dirty" is nr_pages - 1. > > To fix this, add one to dirty and use ">=" in the compare. > > Cc: sta...@vger.kernel.org > Fixes: 03329f9939781 ("tracing: Add tracefs file buffer_percentage") > Signed-off-by: Steven Rostedt (Google) > --- > kernel/trace/ring_buffer.c | 9 +++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > index 83eab547f1d1..32c0dd2fd1c3 100644 > --- a/kernel/trace/ring_buffer.c > +++ b/kernel/trace/ring_buffer.c > @@ -881,9 +881,14 @@ static __always_inline bool full_hit(struct trace_buffer > *buffer, int cpu, int f > if (!nr_pages || !full) > return true; > > - dirty = ring_buffer_nr_dirty_pages(buffer, cpu); > + /* > + * Add one as dirty will never equal nr_pages, as the sub-buffer > + * that the writer is on is not counted as dirty. > + * This is needed if "buffer_percent" is set to 100. > + */ > + dirty = ring_buffer_nr_dirty_pages(buffer, cpu) + 1; Is this "+ 1" required? If we have 200 pages and 1 buffer is dirty, it is 0.5% dirty. Consider @full = 1%. @dirty = 1 + 1 = 2 and @dirty * 100 == 200. but @full * @nr_pages = 1 * 200 = 200. Thus it hits (200 >= 200 is true) even if dirty pages are 0.5%. > > - return (dirty * 100) > (full * nr_pages); > + return (dirty * 100) >= (full * nr_pages); Thank you, > } > > /* > -- > 2.42.0 > -- Masami Hiramatsu (Google)
Re: [PATCH v2] vhost-vdpa: account iommu allocations
On Tue, 26 Dec 2023, Pasha Tatashin wrote: > iommu allocations should be accounted in order to allow admins to > monitor and limit the amount of iommu memory. > > Signed-off-by: Pasha Tatashin > Acked-by: Michael S. Tsirkin Acked-by: David Rientjes
[PATCH v2] vhost-vdpa: account iommu allocations
iommu allocations should be accounted in order to allow admins to monitor and limit the amount of iommu memory. Signed-off-by: Pasha Tatashin Acked-by: Michael S. Tsirkin --- drivers/vhost/vdpa.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Changelog: v1: This patch is spinned of from the series: https://lore.kernel.org/all/20231128204938.1453583-1-pasha.tatas...@soleen.com v2: - Synced with v6.7-rc7 - Added Acked-by Michael S. Tsirkin. diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index da7ec77cdaff..a51c69c078d9 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -968,7 +968,8 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb, r = ops->set_map(vdpa, asid, iotlb); } else { r = iommu_map(v->domain, iova, pa, size, - perm_to_iommu_flags(perm), GFP_KERNEL); + perm_to_iommu_flags(perm), + GFP_KERNEL_ACCOUNT); } if (r) { vhost_iotlb_del_range(iotlb, iova, iova + size - 1); -- 2.43.0.472.g3155946c3a-goog
[PATCH] ring-buffer: Fix wake ups when buffer_percent is set to 100
From: "Steven Rostedt (Google)" The tracefs file "buffer_percent" is to allow user space to set a water-mark on how much of the tracing ring buffer needs to be filled in order to wake up a blocked reader. 0 - is to wait until any data is in the buffer 1 - is to wait for 1% of the sub buffers to be filled 50 - would be half of the sub buffers are filled with data 100 - is not to wake the waiter until the ring buffer is completely full Unfortunately the test for being full was: dirty = ring_buffer_nr_dirty_pages(buffer, cpu); return (dirty * 100) > (full * nr_pages); Where "full" is the value for "buffer_percent". There is two issues with the above when full == 100. 1. dirty * 100 > 100 * nr_pages will never be true That is, the above is basically saying that if the user sets buffer_percent to 100, more pages need to be dirty than exist in the ring buffer! 2. The page that the writer is on is never considered dirty, as dirty pages are only those that are full. When the writer goes to a new sub-buffer, it clears the contents of that sub-buffer. That is, even if the check was ">=" it would still not be equal as the most pages that can be considered "dirty" is nr_pages - 1. To fix this, add one to dirty and use ">=" in the compare. Cc: sta...@vger.kernel.org Fixes: 03329f9939781 ("tracing: Add tracefs file buffer_percentage") Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 83eab547f1d1..32c0dd2fd1c3 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -881,9 +881,14 @@ static __always_inline bool full_hit(struct trace_buffer *buffer, int cpu, int f if (!nr_pages || !full) return true; - dirty = ring_buffer_nr_dirty_pages(buffer, cpu); + /* +* Add one as dirty will never equal nr_pages, as the sub-buffer +* that the writer is on is not counted as dirty. +* This is needed if "buffer_percent" is set to 100. +*/ + dirty = ring_buffer_nr_dirty_pages(buffer, cpu) + 1; - return (dirty * 100) > (full * nr_pages); + return (dirty * 100) >= (full * nr_pages); } /* -- 2.42.0
Re: [PATCH v5 06/34] function_graph: Allow multiple users to attach to function graph
Hi, On Wed, 20 Dec 2023 00:45:40 +0900 Masami Hiramatsu (Google) wrote: > OK, I think we need a "rsrv_ret_stack" index. Basically new one will do; > > (1) increment rsrv_ret_stack > (2) write a reserve type entry > (3) set curr_ret_stack = rsrv_ret_stack > > And before those, > > (0) if rsrv_ret_stack != curr_ret_stack, write a reserve type entry at > rsrv_ret_stack for the previous frame (which offset can be read > from curr_ret_stack) > > Than it will never be broken. > (of course when decrement curr_ret_stack, rsrv_ret_stack is also decremented) > So here is an additional patch for this issue. I'll make v6 with this. Thanks, >From 4da1ec7b679052a131ecdeebd2e1a9db767c5c24 Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Wed, 27 Dec 2023 00:09:09 +0900 Subject: [PATCH] function_graph: Improve push operation for several interrupts Improve push and data reserve operation on the shadow stack for several sequencial interrupts. To push a ret_stack or data entry on the shadow stack, we need to prepare an index (offset) entry before updating the stack pointer (curr_ret_stack) so that unwinder from interrupts can find the next return address from the shadow stack. Currently we do write index, update the curr_ret_stack, and rewrite it again. But that is not enough for the case if two interrupts happens and the first one breaks it. For example, 1. write reserved index entry at ret_stack[new_index - 1] and ret addr. 2. interrupt comes. 2.1. push new index and ret addr on ret_stack. 2.2. pop it. (corrupt entries on new_index - 1) 3. return from interrupt. 4. update curr_ret_stack = new_index 5. interrupt comes again. 5.1. unwind <-- may not work. To avoid this issue, this introduces a new rsrv_ret_stack stack reservation pointer and a new push code (slow path) to commit previous reserved code forcibly. 0. update rsrv_ret_stack = new_index. 1. write reserved index entry at ret_stack[new_index - 1] and ret addr. 2. interrupt comes. 2.0. if rsrv_ret_stack != curr_ret_stack, add reserved index entry on ret_stack[rsrv_ret_stack - 1] to point the previous ret_stack pointed by ret_stack[curr_ret_stack - 1]. and update curr_ret_stack = rsrv_ret_stack. 2.1. push new index and ret addr on ret_stack. 2.2. pop it. (corrupt entries on new_index - 1) 3. return from interrupt. 4. update curr_ret_stack = new_index 5. interrupt comes again. 5.1. unwind works, because curr_ret_stack points the previously saved ret_stack. 5.2. this can do push/pop operations too. 6. return from interrupt. 7. rewrite reserved index entry at ret_stack[new_index] again. This maybe a bit heavier but safer. Signed-off-by: Masami Hiramatsu (Google) --- include/linux/sched.h | 1 + kernel/trace/fgraph.c | 133 ++ 2 files changed, 97 insertions(+), 37 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 4dab30f00211..fda551e1aade 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1387,6 +1387,7 @@ struct task_struct { #ifdef CONFIG_FUNCTION_GRAPH_TRACER /* Index of current stored address in ret_stack: */ int curr_ret_stack; + int rsrv_ret_stack; int curr_ret_depth; /* Stack of return addresses for return function tracing: */ diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c index 47f389834b50..bf7a6eebff75 100644 --- a/kernel/trace/fgraph.c +++ b/kernel/trace/fgraph.c @@ -298,31 +298,47 @@ void *fgraph_reserve_data(int idx, int size_bytes) unsigned long val; void *data; int curr_ret_stack = current->curr_ret_stack; + int rsrv_ret_stack = current->rsrv_ret_stack; int data_size; if (size_bytes > FGRAPH_MAX_DATA_SIZE) return NULL; + /* +* Since this API is used after pushing ret_stack, curr_ret_stack +* should be synchronized with rsrv_ret_stack. +*/ + if (WARN_ON_ONCE(curr_ret_stack != rsrv_ret_stack)) + return NULL; + /* Convert to number of longs + data word */ data_size = DIV_ROUND_UP(size_bytes, sizeof(long)); val = get_fgraph_entry(current, curr_ret_stack - 1); data = ¤t->ret_stack[curr_ret_stack]; - curr_ret_stack += data_size + 1; - if (unlikely(curr_ret_stack >= SHADOW_STACK_MAX_INDEX)) + rsrv_ret_stack += data_size + 1; + if (unlikely(rsrv_ret_stack >= SHADOW_STACK_MAX_INDEX)) return NULL; val = make_fgraph_data(idx, data_size, __get_index(val) + data_size + 1); - /* Set the last word to be reserved */ - current->ret_stack[curr_ret_stack - 1] = val; - - /* Make sure interrupts see this */ + /* Extend the reserved-ret_stack at first */ + current->rsrv_ret_stack = rsrv_ret_stack; + /* And sync