date:20231212

Re: [PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for the scheduler

2023-12-12 Thread Viresh Kumar

On 12-12-23, 15:27, Vincent Guittot wrote:
> Provide to the scheduler a feedback about the temporary max available
> capacity. Unlike arch_update_thermal_pressure, this doesn't need to be
> filtered as the pressure will happen for dozens ms or more.
> 
> Signed-off-by: Vincent Guittot 
> ---
>  drivers/cpufreq/cpufreq.c | 48 +++
>  include/linux/cpufreq.h   | 10 
>  2 files changed, 58 insertions(+)
> 
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 44db4f59c4cc..7d5f71be8d29 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -2563,6 +2563,50 @@ int cpufreq_get_policy(struct cpufreq_policy *policy, 
> unsigned int cpu)
>  }
>  EXPORT_SYMBOL(cpufreq_get_policy);
>  
> +DEFINE_PER_CPU(unsigned long, cpufreq_pressure);
> +EXPORT_PER_CPU_SYMBOL_GPL(cpufreq_pressure);
> +
> +/**
> + * cpufreq_update_pressure() - Update cpufreq pressure for CPUs
> + * @cpus: The related CPUs for which max capacity has been reduced
> + * @capped_freq : The maximum allowed frequency that CPUs can run at
> + *
> + * Update the value of cpufreq pressure for all @cpus in the mask. The
> + * cpumask should include all (online+offline) affected CPUs, to avoid
> + * operating on stale data when hot-plug is used for some CPUs. The
> + * @capped_freq reflects the currently allowed max CPUs frequency due to
> + * freq_qos capping. It might be also a boost frequency value, which is 
> bigger
> + * than the internal 'capacity_freq_ref' max frequency. In such case the
> + * pressure value should simply be removed, since this is an indication that
> + * there is no capping. The @capped_freq must be provided in kHz.
> + */
> +static void cpufreq_update_pressure(const struct cpumask *cpus,

Since this is defined as 'static', why not just pass policy here ?

> +   unsigned long capped_freq)
> +{
> + unsigned long max_capacity, capacity, pressure;
> + u32 max_freq;
> + int cpu;
> +
> + cpu = cpumask_first(cpus);
> + max_capacity = arch_scale_cpu_capacity(cpu);

This anyway expects all of them to be from the same policy ..

> + max_freq = arch_scale_freq_ref(cpu);
> +
> + /*
> +  * Handle properly the boost frequencies, which should simply clean
> +  * the thermal pressure value.
> +  */
> + if (max_freq <= capped_freq)
> + capacity = max_capacity;
> + else
> + capacity = mult_frac(max_capacity, capped_freq, max_freq);
> +
> + pressure = max_capacity - capacity;
> +

Extra blank line here.

> +
> + for_each_cpu(cpu, cpus)
> + WRITE_ONCE(per_cpu(cpufreq_pressure, cpu), pressure);
> +}
> +
>  /**
>   * cpufreq_set_policy - Modify cpufreq policy parameters.
>   * @policy: Policy object to modify.
> @@ -2584,6 +2628,7 @@ static int cpufreq_set_policy(struct cpufreq_policy 
> *policy,
>  {
>   struct cpufreq_policy_data new_data;
>   struct cpufreq_governor *old_gov;
> + struct cpumask *cpus;
>   int ret;
>  
>   memcpy(_data.cpuinfo, >cpuinfo, sizeof(policy->cpuinfo));
> @@ -2618,6 +2663,9 @@ static int cpufreq_set_policy(struct cpufreq_policy 
> *policy,
>   policy->max = __resolve_freq(policy, policy->max, CPUFREQ_RELATION_H);
>   trace_cpu_frequency_limits(policy);
>  
> + cpus = policy->related_cpus;

You don't need the extra variable anyway, but lets just pass policy
instead to the routine.

> + cpufreq_update_pressure(cpus, policy->max);
> +
>   policy->cached_target_freq = UINT_MAX;
>  
>   pr_debug("new min and max freqs are %u - %u kHz\n",
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index afda5f24d3dd..b1d97edd3253 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -241,6 +241,12 @@ struct kobject *get_governor_parent_kobj(struct 
> cpufreq_policy *policy);
>  void cpufreq_enable_fast_switch(struct cpufreq_policy *policy);
>  void cpufreq_disable_fast_switch(struct cpufreq_policy *policy);
>  bool has_target_index(void);
> +
> +DECLARE_PER_CPU(unsigned long, cpufreq_pressure);
> +static inline unsigned long cpufreq_get_pressure(int cpu)
> +{
> + return per_cpu(cpufreq_pressure, cpu);
> +}
>  #else
>  static inline unsigned int cpufreq_get(unsigned int cpu)
>  {
> @@ -263,6 +269,10 @@ static inline bool cpufreq_supports_freq_invariance(void)
>   return false;
>  }
>  static inline void disable_cpufreq(void) { }
> +static inline unsigned long cpufreq_get_pressure(int cpu)
> +{
> + return 0;
> +}
>  #endif
>  
>  #ifdef CONFIG_CPU_FREQ_STAT
> -- 
> 2.34.1

-- 
viresh

Re: [PATCH -next 2/2] mm: vmscan: add new event to trace shrink lru

2023-12-12 Thread Bixuan Cui





在 2023/12/13 11:03, Andrew Morton 写道:

-TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
+TRACE_EVENT(mm_vmscan_lru_shrink_inactive_start,

Current kernels have a call to trace_mm_vmscan_lru_shrink_inactive() in
evict_folios(), so this renaming broke the build.

Sorry, I did not enable CONFIG_LRU_GEN when compiling and testing.
I will double check my patches.

Thanks
Bixuan Cui

Re: [PATCH 1/1] hwspinlock: qcom: fix tcsr data for ipq6018

2023-12-12 Thread Vignesh Viswanathan




On 12/12/2023 7:55 PM, Chukun Pan wrote:
> The tcsr_mutex hwlock register of the ipq6018 SoC is 0x2[1], which
> should not use the max_register configuration of older SoCs. This will
> cause smem to be unable to probe, further causing devices that use
> smem-part to parse partitions to fail to boot.
> 
> [2.118227] qcom-smem: probe of 4aa0.memory failed with error -110
> [   22.273120] platform 79b.nand-controller: deferred probe pending
> 
> Remove 'qcom,ipq6018-tcsr-mutex' setting from of_match to fix this.

Hi Chukun,

This patch was already posted [2] and Bjorn applied the same [3].

Hi Bjorn,

This patch is missing in linux-next. Could you please help check ?

Thanks,
Vignesh

[2] 
https://lore.kernel.org/all/20230905095535.1263113-3-quic_viswa...@quicinc.com/
[3] 
https://lore.kernel.org/all/169522934567.2501390.111220106132298.b4...@kernel.org/
> 
> [1] commit 72fc3d58b87b ("arm64: dts: qcom: ipq6018: Fix tcsr_mutex register 
> size")
> Fixes: 5d4753f741d8 ("hwspinlock: qcom: add support for MMIO on older SoCs")
> Signed-off-by: Chukun Pan 
> ---
>  drivers/hwspinlock/qcom_hwspinlock.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/hwspinlock/qcom_hwspinlock.c 
> b/drivers/hwspinlock/qcom_hwspinlock.c
> index a0fd67fd2934..814dfe8697bf 100644
> --- a/drivers/hwspinlock/qcom_hwspinlock.c
> +++ b/drivers/hwspinlock/qcom_hwspinlock.c
> @@ -115,7 +115,6 @@ static const struct of_device_id 
> qcom_hwspinlock_of_match[] = {
>   { .compatible = "qcom,sfpb-mutex", .data = _sfpb_mutex },
>   { .compatible = "qcom,tcsr-mutex", .data = _tcsr_mutex },
>   { .compatible = "qcom,apq8084-tcsr-mutex", .data = 
> _msm8226_tcsr_mutex },
> - { .compatible = "qcom,ipq6018-tcsr-mutex", .data = 
> _msm8226_tcsr_mutex },
>   { .compatible = "qcom,msm8226-tcsr-mutex", .data = 
> _msm8226_tcsr_mutex },
>   { .compatible = "qcom,msm8974-tcsr-mutex", .data = 
> _msm8226_tcsr_mutex },
>   { .compatible = "qcom,msm8994-tcsr-mutex", .data = 
> _msm8226_tcsr_mutex },

Re: [PATCH v5 2/4] vduse: Temporarily disable control queue features

2023-12-12 Thread Jason Wang

On Tue, Dec 12, 2023 at 9:17 PM Maxime Coquelin
 wrote:
>
> Virtio-net driver control queue implementation is not safe
> when used with VDUSE. If the VDUSE application does not
> reply to control queue messages, it currently ends up
> hanging the kernel thread sending this command.
>
> Some work is on-going to make the control queue
> implementation robust with VDUSE. Until it is completed,
> let's disable control virtqueue and features that depend on
> it.
>
> Signed-off-by: Maxime Coquelin 

I wonder if it's better to fail instead of a mask as a start.

Thanks

> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 37 ++
>  1 file changed, 37 insertions(+)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 0486ff672408..fe4b5c8203fd 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>
>  #include "iova_domain.h"
> @@ -46,6 +47,30 @@
>
>  #define IRQ_UNBOUND -1
>
> +#define VDUSE_NET_VALID_FEATURES_MASK   \
> +   (BIT_ULL(VIRTIO_NET_F_CSUM) |   \
> +BIT_ULL(VIRTIO_NET_F_GUEST_CSUM) | \
> +BIT_ULL(VIRTIO_NET_F_MTU) |\
> +BIT_ULL(VIRTIO_NET_F_MAC) |\
> +BIT_ULL(VIRTIO_NET_F_GUEST_TSO4) | \
> +BIT_ULL(VIRTIO_NET_F_GUEST_TSO6) | \
> +BIT_ULL(VIRTIO_NET_F_GUEST_ECN) |  \
> +BIT_ULL(VIRTIO_NET_F_GUEST_UFO) |  \
> +BIT_ULL(VIRTIO_NET_F_HOST_TSO4) |  \
> +BIT_ULL(VIRTIO_NET_F_HOST_TSO6) |  \
> +BIT_ULL(VIRTIO_NET_F_HOST_ECN) |   \
> +BIT_ULL(VIRTIO_NET_F_HOST_UFO) |   \
> +BIT_ULL(VIRTIO_NET_F_MRG_RXBUF) |  \
> +BIT_ULL(VIRTIO_NET_F_STATUS) | \
> +BIT_ULL(VIRTIO_NET_F_HOST_USO) |   \
> +BIT_ULL(VIRTIO_F_ANY_LAYOUT) | \
> +BIT_ULL(VIRTIO_RING_F_INDIRECT_DESC) | \
> +BIT_ULL(VIRTIO_RING_F_EVENT_IDX) |  \
> +BIT_ULL(VIRTIO_F_VERSION_1) |  \
> +BIT_ULL(VIRTIO_F_ACCESS_PLATFORM) | \
> +BIT_ULL(VIRTIO_F_RING_PACKED) |\
> +BIT_ULL(VIRTIO_F_IN_ORDER))
> +
>  struct vduse_virtqueue {
> u16 index;
> u16 num_max;
> @@ -1782,6 +1807,16 @@ static struct attribute *vduse_dev_attrs[] = {
>
>  ATTRIBUTE_GROUPS(vduse_dev);
>
> +static void vduse_dev_features_filter(struct vduse_dev_config *config)
> +{
> +   /*
> +* Temporarily filter out virtio-net's control virtqueue and features
> +* that depend on it while CVQ is being made more robust for VDUSE.
> +*/
> +   if (config->device_id == VIRTIO_ID_NET)
> +   config->features &= VDUSE_NET_VALID_FEATURES_MASK;
> +}
> +
>  static int vduse_create_dev(struct vduse_dev_config *config,
> void *config_buf, u64 api_version)
>  {
> @@ -1797,6 +1832,8 @@ static int vduse_create_dev(struct vduse_dev_config 
> *config,
> if (!dev)
> goto err;
>
> +   vduse_dev_features_filter(config);
> +
> dev->api_version = api_version;
> dev->device_features = config->features;
> dev->device_id = config->device_id;
> --
> 2.43.0
>

Re: [PATCH] nvdimm-btt: simplify code with the scope based resource management

2023-12-12 Thread dinghao . liu

> 
> On 12/10/23 03:27, Dinghao Liu wrote:
> > Use the scope based resource management (defined in
> > linux/cleanup.h) to automate resource lifetime
> > control on struct btt_sb *super in discover_arenas().
> > 
> > Signed-off-by: Dinghao Liu 
> > ---
> >  drivers/nvdimm/btt.c | 12 
> >  1 file changed, 4 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
> > index d5593b0dc700..ff42778b51de 100644
> > --- a/drivers/nvdimm/btt.c
> > +++ b/drivers/nvdimm/btt.c
> > @@ -16,6 +16,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include "btt.h"
> >  #include "nd.h"
> >  
> > @@ -847,7 +848,7 @@ static int discover_arenas(struct btt *btt)
> >  {
> > int ret = 0;
> > struct arena_info *arena;
> > -   struct btt_sb *super;
> > +   struct btt_sb *super __free(kfree) = NULL;
> > size_t remaining = btt->rawsize;
> > u64 cur_nlba = 0;
> > size_t cur_off = 0;
> > @@ -860,10 +861,8 @@ static int discover_arenas(struct btt *btt)
> > while (remaining) {
> > /* Alloc memory for arena */
> > arena = alloc_arena(btt, 0, 0, 0);
> > -   if (!arena) {
> > -   ret = -ENOMEM;
> > -   goto out_super;
> > -   }
> > +   if (!arena)
> > +   return -ENOMEM;
> >  
> > arena->infooff = cur_off;
> > ret = btt_info_read(arena, super);
> > @@ -919,14 +918,11 @@ static int discover_arenas(struct btt *btt)
> > btt->nlba = cur_nlba;
> > btt->init_state = INIT_READY;
> >  
> > -   kfree(super);
> > return ret;
> >  
> >   out:
> > kfree(arena);
> > free_arenas(btt);
> > - out_super:
> > -   kfree(super);
> > return ret;
> >  }
> >  
> 
> I would do the allocation like something below for the first chunk. Otherwise 
> the rest LGTM. 
> 
> diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
> index d5593b0dc700..143921e7f26c 100644
> --- a/drivers/nvdimm/btt.c
> +++ b/drivers/nvdimm/btt.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "btt.h"
>  #include "nd.h"
>  
> @@ -845,25 +846,23 @@ static void parse_arena_meta(struct arena_info *arena, 
> struct btt_sb *super,
>  
>  static int discover_arenas(struct btt *btt)
>  {
> +   struct btt_sb *super __free(kfree) =
> +   kzalloc(sizeof(*super), GFP_KERNEL);
> int ret = 0;
> struct arena_info *arena;
> -   struct btt_sb *super;
> size_t remaining = btt->rawsize;
> u64 cur_nlba = 0;
> size_t cur_off = 0;
> int num_arenas = 0;
>  
> -   super = kzalloc(sizeof(*super), GFP_KERNEL);
> if (!super)
> return -ENOMEM;
>  
> while (remaining) {
> /* Alloc memory for arena */

It's a little strange that we do not check super immediately after allocation.
How about this:

 static int discover_arenas(struct btt *btt)
 {
int ret = 0;
struct arena_info *arena;
-   struct btt_sb *super;
size_t remaining = btt->rawsize;
u64 cur_nlba = 0;
size_t cur_off = 0;
int num_arenas = 0;
 
-   super = kzalloc(sizeof(*super), GFP_KERNEL);
+   struct btt_sb *super __free(kfree) = 
+   kzalloc(sizeof(*super), GFP_KERNEL);
if (!super)
return -ENOMEM;
 
while (remaining) {

[PATCH v2] tracing: Add size check when printing trace_marker output

2023-12-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

If for some reason the trace_marker write does not have a nul byte for the
string, it will overflow the print:

  trace_seq_printf(s, ": %s", field->buf);

The field->buf could be missing the nul byte. To prevent overflow, add the
max size that the buf can be by using the event size and the field
location.

  int max = iter->ent_size - offsetof(struct print_entry, buf);

  trace_seq_printf(s, ": %*.s", max, field->buf);

Reviewed-by: Masami Hiramatsu (Google) 
Signed-off-by: Steven Rostedt (Google) 
---
Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/2023121208.4619b...@gandalf.local.home

- Use "%.*s" and not "%*s" otherwise it right aligns the output.

[ Masami, I kept your review by. ]

 kernel/trace/trace_output.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d8b302d01083..3e7fa44dc2b2 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1587,11 +1587,12 @@ static enum print_line_t trace_print_print(struct 
trace_iterator *iter,
 {
struct print_entry *field;
struct trace_seq *s = >seq;
+   int max = iter->ent_size - offsetof(struct print_entry, buf);
 
trace_assign_type(field, iter->ent);
 
seq_print_ip_sym(s, field->ip, flags);
-   trace_seq_printf(s, ": %s", field->buf);
+   trace_seq_printf(s, ": %.*s", max, field->buf);
 
return trace_handle_return(s);
 }
@@ -1600,10 +1601,11 @@ static enum print_line_t trace_print_raw(struct 
trace_iterator *iter, int flags,
 struct trace_event *event)
 {
struct print_entry *field;
+   int max = iter->ent_size - offsetof(struct print_entry, buf);
 
trace_assign_type(field, iter->ent);
 
-   trace_seq_printf(>seq, "# %lx %s", field->ip, field->buf);
+   trace_seq_printf(>seq, "# %lx %.*s", field->ip, max, field->buf);
 
return trace_handle_return(>seq);
 }
-- 
2.42.0

Re: [PATCH] tracing: Add size check when printing trace_marker output

2023-12-12 Thread Steven Rostedt

On Tue, 12 Dec 2023 08:44:44 -0500
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> If for some reason the trace_marker write does not have a nul byte for the
> string, it will overflow the print:
> 
>   trace_seq_printf(s, ": %s", field->buf);
> 
> The field->buf could be missing the nul byte. To prevent overflow, add the
> max size that the buf can be by using the event size and the field
> location.
> 
>   int max = iter->ent_size - offsetof(struct print_entry, buf);
> 
>   trace_seq_printf(s, ": %*s", max, field->buf);

Bah, this needs to be:

   trace_seq_printf(s, ": %.*s", max, field->buf);

Note the '.' between % and *. Otherwise it right aligns the output.

This did fail the selftest for trace_printk(), but I modified the new one
to add " *" to accommodate it :-p

Sending out v2.

-- Steve

Re: [PATCH -next 2/2] mm: vmscan: add new event to trace shrink lru

2023-12-12 Thread Andrew Morton

On Mon, 11 Dec 2023 19:26:40 -0800 Bixuan Cui  wrote:

> -TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
> +TRACE_EVENT(mm_vmscan_lru_shrink_inactive_start,

Current kernels have a call to trace_mm_vmscan_lru_shrink_inactive() in
evict_folios(), so this renaming broke the build.

[PATCH v2] tracing: Fix uaf issue when open the hist or hist_debug file

2023-12-12 Thread Zheng Yejian

KASAN report following issue. The root cause is when opening 'hist'
file of an instance and accessing 'trace_event_file' in hist_show(),
but 'trace_event_file' has been freed due to the instance being removed.
'hist_debug' file has the same problem. To fix it, call
tracing_{open,release}_file_tr() in file_operations callback to have
the ref count and avoid 'trace_event_file' being freed.

  BUG: KASAN: slab-use-after-free in hist_show+0x11e0/0x1278
  Read of size 8 at addr 242541e336b8 by task head/190

  CPU: 4 PID: 190 Comm: head Not tainted 6.7.0-rc5-g26aff849438c #133
  Hardware name: linux,dummy-virt (DT)
  Call trace:
   dump_backtrace+0x98/0xf8
   show_stack+0x1c/0x30
   dump_stack_lvl+0x44/0x58
   print_report+0xf0/0x5a0
   kasan_report+0x80/0xc0
   __asan_report_load8_noabort+0x1c/0x28
   hist_show+0x11e0/0x1278
   seq_read_iter+0x344/0xd78
   seq_read+0x128/0x1c0
   vfs_read+0x198/0x6c8
   ksys_read+0xf4/0x1e0
   __arm64_sys_read+0x70/0xa8
   invoke_syscall+0x70/0x260
   el0_svc_common.constprop.0+0xb0/0x280
   do_el0_svc+0x44/0x60
   el0_svc+0x34/0x68
   el0t_64_sync_handler+0xb8/0xc0
   el0t_64_sync+0x168/0x170

  Allocated by task 188:
   kasan_save_stack+0x28/0x50
   kasan_set_track+0x28/0x38
   kasan_save_alloc_info+0x20/0x30
   __kasan_slab_alloc+0x6c/0x80
   kmem_cache_alloc+0x15c/0x4a8
   trace_create_new_event+0x84/0x348
   __trace_add_new_event+0x18/0x88
   event_trace_add_tracer+0xc4/0x1a0
   trace_array_create_dir+0x6c/0x100
   trace_array_create+0x2e8/0x568
   instance_mkdir+0x48/0x80
   tracefs_syscall_mkdir+0x90/0xe8
   vfs_mkdir+0x3c4/0x610
   do_mkdirat+0x144/0x200
   __arm64_sys_mkdirat+0x8c/0xc0
   invoke_syscall+0x70/0x260
   el0_svc_common.constprop.0+0xb0/0x280
   do_el0_svc+0x44/0x60
   el0_svc+0x34/0x68
   el0t_64_sync_handler+0xb8/0xc0
   el0t_64_sync+0x168/0x170

  Freed by task 191:
   kasan_save_stack+0x28/0x50
   kasan_set_track+0x28/0x38
   kasan_save_free_info+0x34/0x58
   __kasan_slab_free+0xe4/0x158
   kmem_cache_free+0x19c/0x508
   event_file_put+0xa0/0x120
   remove_event_file_dir+0x180/0x320
   event_trace_del_tracer+0xb0/0x180
   __remove_instance+0x224/0x508
   instance_rmdir+0x44/0x78
   tracefs_syscall_rmdir+0xbc/0x140
   vfs_rmdir+0x1cc/0x4c8
   do_rmdir+0x220/0x2b8
   __arm64_sys_unlinkat+0xc0/0x100
   invoke_syscall+0x70/0x260
   el0_svc_common.constprop.0+0xb0/0x280
   do_el0_svc+0x44/0x60
   el0_svc+0x34/0x68
   el0t_64_sync_handler+0xb8/0xc0
   el0t_64_sync+0x168/0x170

Suggested-by: Steven Rostedt 
Signed-off-by: Zheng Yejian 
---
 kernel/trace/trace_events_hist.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

Steve, thanks for your review!

v2:
  - Introduce tracing_single_release_file_tr() to add the missing call for
single_release() as suggested by Steve;
Link: 
https://lore.kernel.org/all/20231212113546.6a51d...@gandalf.local.home/
  - Slightly modify the commit message and comments.

v1:
  Link: 
https://lore.kernel.org/all/20231212113317.4159890-1-zhengyeji...@huawei.com/

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 1abc07fba1b9..5296a08c0641 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -5619,14 +5619,22 @@ static int hist_show(struct seq_file *m, void *v)
return ret;
 }
 
+static int tracing_single_release_file_tr(struct inode *inode, struct file 
*filp)
+{
+   tracing_release_file_tr(inode, filp);
+   return single_release(inode, filp);
+}
+
 static int event_hist_open(struct inode *inode, struct file *file)
 {
int ret;
 
-   ret = security_locked_down(LOCKDOWN_TRACEFS);
+   ret = tracing_open_file_tr(inode, file);
if (ret)
return ret;
 
+   /* Clear private_data to avoid warning in single_open() */
+   file->private_data = NULL;
return single_open(file, hist_show, file);
 }
 
@@ -5634,7 +5642,7 @@ const struct file_operations event_hist_fops = {
.open = event_hist_open,
.read = seq_read,
.llseek = seq_lseek,
-   .release = single_release,
+   .release = tracing_single_release_file_tr,
 };
 
 #ifdef CONFIG_HIST_TRIGGERS_DEBUG
@@ -5900,10 +5908,12 @@ static int event_hist_debug_open(struct inode *inode, 
struct file *file)
 {
int ret;
 
-   ret = security_locked_down(LOCKDOWN_TRACEFS);
+   ret = tracing_open_file_tr(inode, file);
if (ret)
return ret;
 
+   /* Clear private_data to avoid warning in single_open() */
+   file->private_data = NULL;
return single_open(file, hist_debug_show, file);
 }
 
@@ -5911,7 +5921,7 @@ const struct file_operations event_hist_debug_fops = {
.open = event_hist_debug_open,
.read = seq_read,
.llseek = seq_lseek,
-   .release = single_release,
+   .release = tracing_single_release_file_tr,
 };
 #endif
 
-- 
2.25.1

Re: [PATCH v2 13/33] kmsan: Introduce memset_no_sanitize_memory()

2023-12-12 Thread Ilya Leoshkevich

On Fri, 2023-12-08 at 16:25 +0100, Alexander Potapenko wrote:
> > A problem with __memset() is that, at least for me, it always ends
> > up being a call. There is a use case where we need to write only 1
> > byte, so I thought that introducing a call there (when compiling
> > without KMSAN) would be unacceptable.
> 
> Wonder what happens with that use case if we e.g. build with fortify-
> source.
> Calling memset() for a single byte might be indicating the code is
> not hot.

The original code has a simple assignment. Here is the relevant diff:

if (s->flags & __OBJECT_POISON) {
-   memset(p, POISON_FREE, poison_size - 1);
-   p[poison_size - 1] = POISON_END;
+   memset_no_sanitize_memory(p, POISON_FREE, poison_size -
1);
+   memset_no_sanitize_memory(p + poison_size - 1,
POISON_END, 1);
}

[...]


> As stated above, I don't think this is more or less working as
> intended.
> If we really want the ability to inline __memset(), we could
> transform
> it into memset() in non-sanitizer builds, but perhaps having a call
> is
> also acceptable?

Thanks for the detailed explanation and analysis. I will post
a version with a __memset() and let the slab maintainers decide if
the additional overhead is acceptable.

Re: [PATCH v4 3/3] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-12 Thread Huang, Ying

Vishal Verma  writes:

> Add a sysfs knob for dax devices to control the memmap_on_memory setting
> if the dax device were to be hotplugged as system memory.
>
> The default memmap_on_memory setting for dax devices originating via
> pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to
> preserve legacy behavior. For dax devices via CXL, the default is on.
> The sysfs control allows the administrator to override the above
> defaults if needed.
>
> Cc: David Hildenbrand 
> Cc: Dan Williams 
> Cc: Dave Jiang 
> Cc: Dave Hansen 
> Cc: Huang Ying 
> Tested-by: Li Zhijian 
> Reviewed-by: Jonathan Cameron 
> Reviewed-by: David Hildenbrand 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/dax/bus.c   | 32 
>  Documentation/ABI/testing/sysfs-bus-dax | 17 +
>  2 files changed, 49 insertions(+)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index ce1356ac6dc2..423adee6f802 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -1245,6 +1245,37 @@ static ssize_t numa_node_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(numa_node);
>  
> +static ssize_t memmap_on_memory_show(struct device *dev,
> +  struct device_attribute *attr, char *buf)
> +{
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> +
> + return sprintf(buf, "%d\n", dev_dax->memmap_on_memory);
> +}
> +
> +static ssize_t memmap_on_memory_store(struct device *dev,
> +   struct device_attribute *attr,
> +   const char *buf, size_t len)
> +{
> + struct dax_device_driver *dax_drv = to_dax_drv(dev->driver);
> + struct dev_dax *dev_dax = to_dev_dax(dev);
> + ssize_t rc;
> + bool val;
> +
> + rc = kstrtobool(buf, );
> + if (rc)
> + return rc;
> +
> + guard(device)(dev);
> + if (dev_dax->memmap_on_memory != val &&
> + dax_drv->type == DAXDRV_KMEM_TYPE)

Should we check "dev->driver != NULL" here, and should we move

dax_drv = to_dax_drv(dev->driver);

here with device lock held?

--
Best Regards,
Huang, Ying

> + return -EBUSY;
> + dev_dax->memmap_on_memory = val;
> +
> + return len;
> +}
> +static DEVICE_ATTR_RW(memmap_on_memory);
> +
>  static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, 
> int n)
>  {
>   struct device *dev = container_of(kobj, struct device, kobj);
> @@ -1271,6 +1302,7 @@ static struct attribute *dev_dax_attributes[] = {
>   _attr_align.attr,
>   _attr_resource.attr,
>   _attr_numa_node.attr,
> + _attr_memmap_on_memory.attr,
>   NULL,
>  };
>  
> diff --git a/Documentation/ABI/testing/sysfs-bus-dax 
> b/Documentation/ABI/testing/sysfs-bus-dax
> index a61a7b186017..b1fd8bf8a7de 100644
> --- a/Documentation/ABI/testing/sysfs-bus-dax
> +++ b/Documentation/ABI/testing/sysfs-bus-dax
> @@ -149,3 +149,20 @@ KernelVersion:   v5.1
>  Contact: nvd...@lists.linux.dev
>  Description:
>   (RO) The id attribute indicates the region id of a dax region.
> +
> +What:/sys/bus/dax/devices/daxX.Y/memmap_on_memory
> +Date:October, 2023
> +KernelVersion:   v6.8
> +Contact: nvd...@lists.linux.dev
> +Description:
> + (RW) Control the memmap_on_memory setting if the dax device
> + were to be hotplugged as system memory. This determines whether
> + the 'altmap' for the hotplugged memory will be placed on the
> + device being hotplugged (memmap_on_memory=1) or if it will be
> + placed on regular memory (memmap_on_memory=0). This attribute
> + must be set before the device is handed over to the 'kmem'
> + driver (i.e.  hotplugged into system-ram). Additionally, this
> + depends on CONFIG_MHP_MEMMAP_ON_MEMORY, and a globally enabled
> + memmap_on_memory parameter for memory_hotplug. This is
> + typically set on the kernel command line -
> + memory_hotplug.memmap_on_memory set to 'true' or 'force'."

Re: [PATCH v2 18/33] lib/string: Add KMSAN support to strlcpy() and strlcat()

2023-12-12 Thread Ilya Leoshkevich

On Fri, 2023-12-08 at 17:50 +0100, Alexander Potapenko wrote:
> On Tue, Nov 21, 2023 at 11:02 PM Ilya Leoshkevich 
> wrote:
> > 
> > Currently KMSAN does not fully propagate metadata in strlcpy() and
> > strlcat(), because they are built with -ffreestanding and call
> > memcpy(). In this combination memcpy() calls are not instrumented.
> 
> Is this something specific to s390?

Nice catch - I can't reproduce this behavior anymore. Even if I go
back to the clang version that first introduced KMSAN on s390x, the
memset() instrumentation with -ffreestanding is still there. I should
have written down more detailed notes after investigating this, but
here we are. I will drop this patch as well as 10/33.

[...]

Re: [PATCH] ring-buffer: Do not update before stamp when switching sub-buffers

2023-12-12 Thread Google

On Mon, 11 Dec 2023 11:44:20 -0500
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> The ring buffer timestamps are synchronized by two timestamp placeholders.
> One is the "before_stamp" and the other is the "write_stamp" (sometimes
> referred to as the "after stamp" but only in the comments. These two
> stamps are key to knowing how to handle nested events coming in with a
> lockless system.
> 
> When moving across sub-buffers, the before stamp is updated but the write
> stamp is not. There's an effort to put back the before stamp to something
> that seems logical in case there's nested events. But as the current event
> is about to cross sub-buffers, and so will any new nested event that happens,
> updating the before stamp is useless, and could even introduce new race
> conditions.
> 
> The first event on a sub-buffer simply uses the sub-buffer's timestamp
> and keeps a "delta" of zero. The "before_stamp" and "write_stamp" are not
> used in the algorithm in this case. There's no reason to try to fix the
> before_stamp when this happens.
> 
> As a bonus, it removes a cmpxchg() when crossing sub-buffers!
> 

Looks good to me.

Reviewed-by: Masami Hiramatsu (Google) 

Thank you

> Cc: sta...@vger.kernel.org
> Fixes: a389d86f7fd09 ("ring-buffer: Have nested events still record running 
> time stamp")
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  kernel/trace/ring_buffer.c | 9 +
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 2596fa7b748a..02bc9986fe0d 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -3607,14 +3607,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu 
> *cpu_buffer,
>  
>   /* See if we shot pass the end of this buffer page */
>   if (unlikely(write > BUF_PAGE_SIZE)) {
> - /* before and after may now different, fix it up*/
> - b_ok = rb_time_read(_buffer->before_stamp, >before);
> - a_ok = rb_time_read(_buffer->write_stamp, >after);
> - if (a_ok && b_ok && info->before != info->after)
> - (void)rb_time_cmpxchg(_buffer->before_stamp,
> -   info->before, info->after);
> - if (a_ok && b_ok)
> - check_buffer(cpu_buffer, info, CHECK_FULL_PAGE);
> + check_buffer(cpu_buffer, info, CHECK_FULL_PAGE);
>   return rb_move_tail(cpu_buffer, tail, info);
>   }
>  
> -- 
> 2.42.0
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH] tracing: Have trace_marker break up by lines by size of trace_seq

2023-12-12 Thread Steven Rostedt

On Wed, 13 Dec 2023 09:19:33 +0900
Masami Hiramatsu (Google)  wrote:

> On Tue, 12 Dec 2023 19:04:22 -0500
> Steven Rostedt  wrote:
> 
> > From: "Steven Rostedt (Google)" 
> > 
> > If a trace_marker write is bigger than what trace_seq can hold, then it
> > will print "LINE TOO BIG" message and not what was written.
> > 
> > Instead, if check if the write is bigger than the trace_seq and break it  
> 
> Instead, check if ... ?

Ah yes, thank you.

> 
> > up by that size.
> > 
> > Ideally, we could make the trace_seq dynamic that could hold this. But
> > that's for another time.  
> 
> I think this is OK, but if possible it is better to be merged with the
> "LINE TOO BIG" patch (by updating the version).

What do you mean by "updating the version"?

Note, the LINE TOO BIG doesn't happen today. It only happens when applying
the sub buffer resize change, and then when I run the tests, it breaks when
the subbuffer is bigger than the trace_seq.

> 
> Reviewed-by: Masami Hiramatsu (Google) 

Thanks!

-- Steve

Re: [PATCH] tracing: Have trace_marker break up by lines by size of trace_seq

2023-12-12 Thread Google

On Tue, 12 Dec 2023 19:04:22 -0500
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> If a trace_marker write is bigger than what trace_seq can hold, then it
> will print "LINE TOO BIG" message and not what was written.
> 
> Instead, if check if the write is bigger than the trace_seq and break it

Instead, check if ... ?

> up by that size.
> 
> Ideally, we could make the trace_seq dynamic that could hold this. But
> that's for another time.

I think this is OK, but if possible it is better to be merged with the
"LINE TOO BIG" patch (by updating the version).

Reviewed-by: Masami Hiramatsu (Google) 

Thank you,

> 
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  kernel/trace/trace.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 893e749713d3..2a21bc840fe7 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -7298,6 +7298,11 @@ tracing_mark_write(struct file *filp, const char 
> __user *ubuf,
>   if (cnt < FAULTED_SIZE)
>   size += FAULTED_SIZE - cnt;
>  
> + if (size > TRACE_SEQ_BUFFER_SIZE) {
> + cnt -= size - TRACE_SEQ_BUFFER_SIZE;
> + goto again;
> + }
> +
>   buffer = tr->array_buffer.buffer;
>   event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
>   tracing_gen_ctx());
> -- 
> 2.42.0
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH v4] tracing: Allow for max buffer data size trace_marker writes

2023-12-12 Thread Google

On Tue, 12 Dec 2023 13:19:01 -0500
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> Allow a trace write to be as big as the ring buffer tracing data will
> allow. Currently, it only allows writes of 1KB in size, but there's no
> reason that it cannot allow what the ring buffer can hold.
> 
> Signed-off-by: Steven Rostedt (Google) 

Looks good to me.

Reviewed-by: Masami Hiramatsu (Google) 

Thank you,

> ---
> Changes since v3: 
> https://lore.kernel.org/linux-trace-kernel/20231212110332.6fca5...@gandalf.local.home
> 
> - No greated cheese. (Mathieu Desnoyers)
> 
>  include/linux/ring_buffer.h |  1 +
>  kernel/trace/ring_buffer.c  | 15 +++
>  kernel/trace/trace.c| 31 ---
>  3 files changed, 40 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
> index 782e14f62201..b1b03b2c0f08 100644
> --- a/include/linux/ring_buffer.h
> +++ b/include/linux/ring_buffer.h
> @@ -141,6 +141,7 @@ int ring_buffer_iter_empty(struct ring_buffer_iter *iter);
>  bool ring_buffer_iter_dropped(struct ring_buffer_iter *iter);
>  
>  unsigned long ring_buffer_size(struct trace_buffer *buffer, int cpu);
> +unsigned long ring_buffer_max_event_size(struct trace_buffer *buffer);
>  
>  void ring_buffer_reset_cpu(struct trace_buffer *buffer, int cpu);
>  void ring_buffer_reset_online_cpus(struct trace_buffer *buffer);
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 6b82c3398938..087f0f6b3409 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -5255,6 +5255,21 @@ unsigned long ring_buffer_size(struct trace_buffer 
> *buffer, int cpu)
>  }
>  EXPORT_SYMBOL_GPL(ring_buffer_size);
>  
> +/**
> + * ring_buffer_max_event_size - return the max data size of an event
> + * @buffer: The ring buffer.
> + *
> + * Returns the maximum size an event can be.
> + */
> +unsigned long ring_buffer_max_event_size(struct trace_buffer *buffer)
> +{
> + /* If abs timestamp is requested, events have a timestamp too */
> + if (ring_buffer_time_stamp_abs(buffer))
> + return BUF_MAX_DATA_SIZE - RB_LEN_TIME_EXTEND;
> + return BUF_MAX_DATA_SIZE;
> +}
> +EXPORT_SYMBOL_GPL(ring_buffer_max_event_size);
> +
>  static void rb_clear_buffer_page(struct buffer_page *page)
>  {
>   local_set(>write, 0);
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index ef86379555e4..a359783fede8 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -7272,8 +7272,9 @@ tracing_mark_write(struct file *filp, const char __user 
> *ubuf,
>   enum event_trigger_type tt = ETT_NONE;
>   struct trace_buffer *buffer;
>   struct print_entry *entry;
> + int meta_size;
>   ssize_t written;
> - int size;
> + size_t size;
>   int len;
>  
>  /* Used in tracing_mark_raw_write() as well */
> @@ -7286,12 +7287,12 @@ tracing_mark_write(struct file *filp, const char 
> __user *ubuf,
>   if (!(tr->trace_flags & TRACE_ITER_MARKERS))
>   return -EINVAL;
>  
> - if (cnt > TRACE_BUF_SIZE)
> - cnt = TRACE_BUF_SIZE;
> -
> - BUILD_BUG_ON(TRACE_BUF_SIZE >= PAGE_SIZE);
> + if ((ssize_t)cnt < 0)
> + return -EINVAL;
>  
> - size = sizeof(*entry) + cnt + 2; /* add '\0' and possible '\n' */
> + meta_size = sizeof(*entry) + 2;  /* add '\0' and possible '\n' */
> + again:
> + size = cnt + meta_size;
>  
>   /* If less than "", then make sure we can still add that */
>   if (cnt < FAULTED_SIZE)
> @@ -7300,9 +7301,25 @@ tracing_mark_write(struct file *filp, const char 
> __user *ubuf,
>   buffer = tr->array_buffer.buffer;
>   event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
>   tracing_gen_ctx());
> - if (unlikely(!event))
> + if (unlikely(!event)) {
> + /*
> +  * If the size was greater than what was allowed, then
> +  * make it smaller and try again.
> +  */
> + if (size > ring_buffer_max_event_size(buffer)) {
> + /* cnt < FAULTED size should never be bigger than max */
> + if (WARN_ON_ONCE(cnt < FAULTED_SIZE))
> + return -EBADF;
> + cnt = ring_buffer_max_event_size(buffer) - meta_size;
> + /* The above should only happen once */
> + if (WARN_ON_ONCE(cnt + meta_size == size))
> + return -EBADF;
> + goto again;
> + }
> +
>   /* Ring buffer disabled, return as if not open for write */
>   return -EBADF;
> + }
>  
>   entry = ring_buffer_event_data(event);
>   entry->ip = _THIS_IP_;
> -- 
> 2.42.0
> 


-- 
Masami Hiramatsu (Google)

[PATCH] tracing: Have trace_marker break up by lines by size of trace_seq

2023-12-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

If a trace_marker write is bigger than what trace_seq can hold, then it
will print "LINE TOO BIG" message and not what was written.

Instead, if check if the write is bigger than the trace_seq and break it
up by that size.

Ideally, we could make the trace_seq dynamic that could hold this. But
that's for another time.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 893e749713d3..2a21bc840fe7 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7298,6 +7298,11 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
if (cnt < FAULTED_SIZE)
size += FAULTED_SIZE - cnt;
 
+   if (size > TRACE_SEQ_BUFFER_SIZE) {
+   cnt -= size - TRACE_SEQ_BUFFER_SIZE;
+   goto again;
+   }
+
buffer = tr->array_buffer.buffer;
event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
tracing_gen_ctx());
-- 
2.42.0

Re: [PATCH v3] ring-buffer: Fix writing to the buffer with max_data_size

2023-12-12 Thread Google

On Tue, 12 Dec 2023 11:16:17 -0500
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> The maximum ring buffer data size is the maximum size of data that can be
> recorded on the ring buffer. Events must be smaller than the sub buffer
> data size minus any meta data. This size is checked before trying to
> allocate from the ring buffer because the allocation assumes that the size
> will fit on the sub buffer.
> 
> The maximum size was calculated as the size of a sub buffer page (which is
> currently PAGE_SIZE minus the sub buffer header) minus the size of the
> meta data of an individual event. But it missed the possible adding of a
> time stamp for events that are added long enough apart that the event meta
> data can't hold the time delta.
> 
> When an event is added that is greater than the current BUF_MAX_DATA_SIZE
> minus the size of a time stamp, but still less than or equal to
> BUF_MAX_DATA_SIZE, the ring buffer would go into an infinite loop, looking
> for a page that can hold the event. Luckily, there's a check for this loop
> and after 1000 iterations and a warning is emitted and the ring buffer is
> disabled. But this should never happen.
> 
> This can happen when a large event is added first, or after a long period
> where an absolute timestamp is prefixed to the event, increasing its size
> by 8 bytes. This passes the check and then goes into the algorithm that
> causes the infinite loop.
> 
> For events that are the first event on the sub-buffer, it does not need to
> add a timestamp, because the sub-buffer itself contains an absolute
> timestamp, and adding one is redundant.
> 
> The fix is to check if the event is to be the first event on the
> sub-buffer, and if it is, then do not add a timestamp.
> 
> This also fixes 32 bit adding a timestamp when a read of before_stamp or
> write_stamp is interrupted. There's still no need to add that timestamp if
> the event is going to be the first event on the sub buffer.
> 
> Also, if the buffer has "time_stamp_abs" set, then also check if the
> length plus the timestamp is greater than the BUF_MAX_DATA_SIZE.
> 
> Link: https://lore.kernel.org/all/20231212104549.58863...@gandalf.local.home/
> Link: 
> https://lore.kernel.org/linux-trace-kernel/20231212071837.5fdd6...@gandalf.local.home
> 
> Cc: sta...@vger.kernel.org
> Fixes: a4543a2fa9ef3 ("ring-buffer: Get timestamp after event is allocated")
> Fixes: 58fbc3c63275c ("ring-buffer: Consolidate add_timestamp to remove some 
> branches")
> Reported-by: Kent Overstreet  # (on IRC)
> Signed-off-by: Steven Rostedt (Google) 

This looks good to me :)

Acked-by: Masami Hiramatsu (Google) 

Thank you!

> ---
> Changes since v2: 
> https://lore.kernel.org/linux-trace-kernel/20231212065922.05f28...@gandalf.local.home
> 
> - Just test 'w' first, and then do the rest of the checks.
> 
>  kernel/trace/ring_buffer.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 8d2a4f00eca9..b8986f82eccf 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -3579,7 +3579,10 @@ __rb_reserve_next(struct ring_buffer_per_cpu 
> *cpu_buffer,
>* absolute timestamp.
>* Don't bother if this is the start of a new page (w == 0).
>*/
> - if (unlikely(!a_ok || !b_ok || (info->before != info->after && 
> w))) {
> + if (!w) {
> + /* Use the sub-buffer timestamp */
> + info->delta = 0;
> + } else if (unlikely(!a_ok || !b_ok || info->before != 
> info->after)) {
>   info->add_timestamp |= RB_ADD_STAMP_FORCE | 
> RB_ADD_STAMP_EXTEND;
>   info->length += RB_LEN_TIME_EXTEND;
>   } else {
> @@ -3737,6 +3740,8 @@ rb_reserve_next_event(struct trace_buffer *buffer,
>   if (ring_buffer_time_stamp_abs(cpu_buffer->buffer)) {
>   add_ts_default = RB_ADD_STAMP_ABSOLUTE;
>   info.length += RB_LEN_TIME_EXTEND;
> + if (info.length > BUF_MAX_DATA_SIZE)
> + goto out_fail;
>   } else {
>   add_ts_default = RB_ADD_STAMP_NONE;
>   }
> -- 
> 2.42.0
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH vhost v2 4/8] vdpa/mlx5: Mark vq addrs for modification in hw vq

2023-12-12 Thread Si-Wei Liu





On 12/12/2023 11:21 AM, Eugenio Perez Martin wrote:

On Tue, Dec 5, 2023 at 11:46 AM Dragos Tatulea  wrote:

Addresses get set by .set_vq_address. hw vq addresses will be updated on
next modify_virtqueue.

Signed-off-by: Dragos Tatulea 
Reviewed-by: Gal Pressman 
Acked-by: Eugenio Pérez 

I'm kind of ok with this patch and the next one about state, but I
didn't ack them in the previous series.

My main concern is that it is not valid to change the vq address after
DRIVER_OK in VirtIO, which vDPA follows. Only memory maps are ok to
change at this moment. I'm not sure about vq state in vDPA, but vhost
forbids changing it with an active backend.

Suspend is not defined in VirtIO at this moment though, so maybe it is
ok to decide that all of these parameters may change during suspend.
Maybe the best thing is to protect this with a vDPA feature flag.
I think protect with vDPA feature flag could work, while on the other 
hand vDPA means vendor specific optimization is possible around suspend 
and resume (in case it helps performance), which doesn't have to be 
backed by virtio spec. Same applies to vhost user backend features, 
variations there were not backed by spec either. Of course, we should 
try best to make the default behavior backward compatible with 
virtio-based backend, but that circles back to no suspend definition in 
the current virtio spec, for which I hope we don't cease development on 
vDPA indefinitely. After all, the virtio based vdap backend can well 
define its own feature flag to describe (minor difference in) the 
suspend behavior based on the later spec once it is formed in future.


Regards,
-Siwei





Jason, what do you think?

Thanks!


---
  drivers/vdpa/mlx5/net/mlx5_vnet.c  | 9 +
  include/linux/mlx5/mlx5_ifc_vdpa.h | 1 +
  2 files changed, 10 insertions(+)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index f8f088cced50..80e066de0866 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1209,6 +1209,7 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
 bool state_change = false;
 void *obj_context;
 void *cmd_hdr;
+   void *vq_ctx;
 void *in;
 int err;

@@ -1230,6 +1231,7 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
 MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, uid, ndev->mvdev.res.uid);

 obj_context = MLX5_ADDR_OF(modify_virtio_net_q_in, in, obj_context);
+   vq_ctx = MLX5_ADDR_OF(virtio_net_q_object, obj_context, 
virtio_q_context);

 if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_STATE) {
 if (!is_valid_state_change(mvq->fw_state, state, 
is_resumable(ndev))) {
@@ -1241,6 +1243,12 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
 state_change = true;
 }

+   if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_ADDRS) {
+   MLX5_SET64(virtio_q, vq_ctx, desc_addr, mvq->desc_addr);
+   MLX5_SET64(virtio_q, vq_ctx, used_addr, mvq->device_addr);
+   MLX5_SET64(virtio_q, vq_ctx, available_addr, mvq->driver_addr);
+   }
+
 MLX5_SET64(virtio_net_q_object, obj_context, modify_field_select, 
mvq->modified_fields);
 err = mlx5_cmd_exec(ndev->mvdev.mdev, in, inlen, out, sizeof(out));
 if (err)
@@ -2202,6 +2210,7 @@ static int mlx5_vdpa_set_vq_address(struct vdpa_device 
*vdev, u16 idx, u64 desc_
 mvq->desc_addr = desc_area;
 mvq->device_addr = device_area;
 mvq->driver_addr = driver_area;
+   mvq->modified_fields |= MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_ADDRS;
 return 0;
  }

diff --git a/include/linux/mlx5/mlx5_ifc_vdpa.h 
b/include/linux/mlx5/mlx5_ifc_vdpa.h
index b86d51a855f6..9594ac405740 100644
--- a/include/linux/mlx5/mlx5_ifc_vdpa.h
+++ b/include/linux/mlx5/mlx5_ifc_vdpa.h
@@ -145,6 +145,7 @@ enum {
 MLX5_VIRTQ_MODIFY_MASK_STATE= (u64)1 << 0,
 MLX5_VIRTQ_MODIFY_MASK_DIRTY_BITMAP_PARAMS  = (u64)1 << 3,
 MLX5_VIRTQ_MODIFY_MASK_DIRTY_BITMAP_DUMP_ENABLE = (u64)1 << 4,
+   MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_ADDRS   = (u64)1 << 6,
 MLX5_VIRTQ_MODIFY_MASK_DESC_GROUP_MKEY  = (u64)1 << 14,
  };

--
2.42.0

Re: [PATCH v5 4/4] vduse: Add LSM hook to check Virtio device type

2023-12-12 Thread Casey Schaufler

On 12/12/2023 9:59 AM, Michael S. Tsirkin wrote:
> On Tue, Dec 12, 2023 at 08:33:39AM -0800, Casey Schaufler wrote:
>> On 12/12/2023 5:17 AM, Maxime Coquelin wrote:
>>> This patch introduces a LSM hook for devices creation,
>>> destruction (ioctl()) and opening (open()) operations,
>>> checking the application is allowed to perform these
>>> operations for the Virtio device type.
>> My earlier comments on a vduse specific LSM hook still hold.
>> I would much prefer to see a device permissions hook(s) that
>> are useful for devices in general. Not just vduse devices.
>> I know that there are already some very special purpose LSM
>> hooks, but the experience with maintaining them is why I don't
>> want more of them. 
> What exactly does this mean?

You have proposed an LSM hook that is only useful for vduse.
You want to implement a set of controls that only apply to vduse.
I can't help but think that if someone (i.e. you) wants to control
device creation for vduse that there could well be a use case for
control over device creation for some other set of devices. It is
quite possible that someone out there is desperately trying to
solve the same problem you have, but with a different device.

I have no desire to have to deal with
security_vduse_perm_check()
security_odddev_perm_check()
...
security_evendev_perm_check()

when we should be able to have
security_device_perm_check()

that can service them all.


>  Devices like tap etc? How do we
> find them all though?

I'm not suggesting you find them all. I'm suggesting that you provide
an interface that someone could use if they wanted to. I think you
will be surprised how many will appear (with complaints about the
interface you propose, of course) if you implement a generally useful
LSM hook.

>
>>> Signed-off-by: Maxime Coquelin 
>>> ---
>>>  MAINTAINERS |  1 +
>>>  drivers/vdpa/vdpa_user/vduse_dev.c  | 13 
>>>  include/linux/lsm_hook_defs.h   |  2 ++
>>>  include/linux/security.h|  6 ++
>>>  include/linux/vduse.h   | 14 +
>>>  security/security.c | 15 ++
>>>  security/selinux/hooks.c| 32 +
>>>  security/selinux/include/classmap.h |  2 ++
>>>  8 files changed, 85 insertions(+)
>>>  create mode 100644 include/linux/vduse.h
>>>
>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>> index a0fb0df07b43..4e83b14358d2 100644
>>> --- a/MAINTAINERS
>>> +++ b/MAINTAINERS
>>> @@ -23040,6 +23040,7 @@ F:  drivers/net/virtio_net.c
>>>  F: drivers/vdpa/
>>>  F: drivers/virtio/
>>>  F: include/linux/vdpa.h
>>> +F: include/linux/vduse.h
>>>  F: include/linux/virtio*.h
>>>  F: include/linux/vringh.h
>>>  F: include/uapi/linux/virtio_*.h
>>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
>>> b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> index fa62825be378..59ab7eb62e20 100644
>>> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
>>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> @@ -8,6 +8,7 @@
>>>   *
>>>   */
>>>  
>>> +#include "linux/security.h"
>>>  #include 
>>>  #include 
>>>  #include 
>>> @@ -30,6 +31,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  
>>>  #include "iova_domain.h"
>>>  
>>> @@ -1442,6 +1444,10 @@ static int vduse_dev_open(struct inode *inode, 
>>> struct file *file)
>>> if (dev->connected)
>>> goto unlock;
>>>  
>>> +   ret = -EPERM;
>>> +   if (security_vduse_perm_check(VDUSE_PERM_OPEN, dev->device_id))
>>> +   goto unlock;
>>> +
>>> ret = 0;
>>> dev->connected = true;
>>> file->private_data = dev;
>>> @@ -1664,6 +1670,9 @@ static int vduse_destroy_dev(char *name)
>>> if (!dev)
>>> return -EINVAL;
>>>  
>>> +   if (security_vduse_perm_check(VDUSE_PERM_DESTROY, dev->device_id))
>>> +   return -EPERM;
>>> +
>>> mutex_lock(>lock);
>>> if (dev->vdev || dev->connected) {
>>> mutex_unlock(>lock);
>>> @@ -1828,6 +1837,10 @@ static int vduse_create_dev(struct vduse_dev_config 
>>> *config,
>>> int ret;
>>> struct vduse_dev *dev;
>>>  
>>> +   ret = -EPERM;
>>> +   if (security_vduse_perm_check(VDUSE_PERM_CREATE, config->device_id))
>>> +   goto err;
>>> +
>>> ret = -EEXIST;
>>> if (vduse_find_dev(config->name))
>>> goto err;
>>> diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
>>> index ff217a5ce552..3930ab2ae974 100644
>>> --- a/include/linux/lsm_hook_defs.h
>>> +++ b/include/linux/lsm_hook_defs.h
>>> @@ -419,3 +419,5 @@ LSM_HOOK(int, 0, uring_override_creds, const struct 
>>> cred *new)
>>>  LSM_HOOK(int, 0, uring_sqpoll, void)
>>>  LSM_HOOK(int, 0, uring_cmd, struct io_uring_cmd *ioucmd)
>>>  #endif /* CONFIG_IO_URING */
>>> +
>>> +LSM_HOOK(int, 0, vduse_perm_check, enum vduse_op_perm op_perm, u32 
>>> device_id)
>>> diff --git a/include/linux/security.h b/include/linux/security.h
>>> index 1d1df326c881..2a2054172394 100644
>>>

Re: [PATCH] tracing: Add size check when printing trace_marker output

2023-12-12 Thread Google

On Tue, 12 Dec 2023 08:44:44 -0500
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> If for some reason the trace_marker write does not have a nul byte for the
> string, it will overflow the print:
> 
>   trace_seq_printf(s, ": %s", field->buf);
> 
> The field->buf could be missing the nul byte. To prevent overflow, add the
> max size that the buf can be by using the event size and the field
> location.
> 
>   int max = iter->ent_size - offsetof(struct print_entry, buf);
> 
>   trace_seq_printf(s, ": %*s", max, field->buf);
> 

This looks good to me.

Reviewed-by: Masami Hiramatsu (Google) 

Thanks!

> Signed-off-by: Steven Rostedt (Google) 
> ---
>  kernel/trace/trace_output.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
> index d8b302d01083..e11fb8996286 100644
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -1587,11 +1587,12 @@ static enum print_line_t trace_print_print(struct 
> trace_iterator *iter,
>  {
>   struct print_entry *field;
>   struct trace_seq *s = >seq;
> + int max = iter->ent_size - offsetof(struct print_entry, buf);
>  
>   trace_assign_type(field, iter->ent);
>  
>   seq_print_ip_sym(s, field->ip, flags);
> - trace_seq_printf(s, ": %s", field->buf);
> + trace_seq_printf(s, ": %*s", max, field->buf);
>  
>   return trace_handle_return(s);
>  }
> @@ -1600,10 +1601,11 @@ static enum print_line_t trace_print_raw(struct 
> trace_iterator *iter, int flags,
>struct trace_event *event)
>  {
>   struct print_entry *field;
> + int max = iter->ent_size - offsetof(struct print_entry, buf);
>  
>   trace_assign_type(field, iter->ent);
>  
> - trace_seq_printf(>seq, "# %lx %s", field->ip, field->buf);
> + trace_seq_printf(>seq, "# %lx %*s", field->ip, max, field->buf);
>  
>   return trace_handle_return(>seq);
>  }
> -- 
> 2.42.0
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH] kmod: Add FIPS 202 SHA-3 support

2023-12-12 Thread Dimitri John Ledkov

On Wed, 6 Dec 2023 at 15:26, Lucas De Marchi  wrote:
>
> On Sun, Oct 22, 2023 at 07:09:28PM +0100, Dimitri John Ledkov wrote:
> >Add support for parsing FIPS 202 SHA-3 signature hashes. Separately,
> >it is not clear why explicit hashes are re-encoded here, instead of
> >trying to generically show any digest openssl supports.
> >
> >Signed-off-by: Dimitri John Ledkov 

NACK

> >---
> > libkmod/libkmod-signature.c | 12 
> > 1 file changed, 12 insertions(+)
> >
> >diff --git a/libkmod/libkmod-signature.c b/libkmod/libkmod-signature.c
> >index b749a818f9..a39059cd7c 100644
> >--- a/libkmod/libkmod-signature.c
> >+++ b/libkmod/libkmod-signature.c
> >@@ -57,6 +57,9 @@ enum pkey_hash_algo {
> >   PKEY_HASH_SHA512,
> >   PKEY_HASH_SHA224,
> >   PKEY_HASH_SM3,
> >+  PKEY_HASH_SHA3_256,
> >+  PKEY_HASH_SHA3_384,
> >+  PKEY_HASH_SHA3_512,
> >   PKEY_HASH__LAST
> > };
> >
> >@@ -70,6 +73,9 @@ const char *const pkey_hash_algo[PKEY_HASH__LAST] = {
> >   [PKEY_HASH_SHA512]  = "sha512",
> >   [PKEY_HASH_SHA224]  = "sha224",
> >   [PKEY_HASH_SM3] = "sm3",
> >+  [PKEY_HASH_SHA3_256]= "sha3-256",
> >+  [PKEY_HASH_SHA3_384]= "sha3-384",
> >+  [PKEY_HASH_SHA3_512]= "sha3-512",
> > };
> >
> > enum pkey_id_type {
> >@@ -167,6 +173,12 @@ static int obj_to_hash_algo(const ASN1_OBJECT *o)
> >   case NID_sm3:
> >   return PKEY_HASH_SM3;
> > # endif
> >+  case NID_sha3_256:
> >+  return PKEY_HASH_SHA3_256;
> >+  case NID_sha3_384:
> >+  return PKEY_HASH_SHA3_384;
> >+  case NID_sha3_512:
> >+  return PKEY_HASH_SHA3_512;
>
>
> with your other patch, libkmod: remove pkcs7 obj_to_hash_algo(), this
> hunk is not needed anymore. Do you want to send a new version of this
> patch?

This patch is no longer required, given that
https://lore.kernel.org/all/20231029010319.157390-1-dimitri.led...@canonical.com/
is applied. Upgrade kmod to the one that has at least that patch
applied, and then pkcs7 signatures are parsed correctly with
everything that a runtime OpenSSL supports. Thus if you want to see
SHA3 signatures, ensure your runtime libssl has SHA3 support.

>
> thanks
> Lucas De Marchi
>
> >   default:
> >   return -1;
> >   }
> >--
> >2.34.1
> >
> >

-- 
Dimitri

Sent from Ubuntu Pro
https://ubuntu.com/pro

Re: [PATCH vhost v2 4/8] vdpa/mlx5: Mark vq addrs for modification in hw vq

2023-12-12 Thread Dragos Tatulea

On Tue, 2023-12-12 at 20:21 +0100, Eugenio Perez Martin wrote:
> On Tue, Dec 5, 2023 at 11:46 AM Dragos Tatulea  wrote:
> > 
> > Addresses get set by .set_vq_address. hw vq addresses will be updated on
> > next modify_virtqueue.
> > 
> > Signed-off-by: Dragos Tatulea 
> > Reviewed-by: Gal Pressman 
> > Acked-by: Eugenio Pérez 
> 
> I'm kind of ok with this patch and the next one about state, but I
> didn't ack them in the previous series.
> 
Sorry about the Ack misplacement. I got confused.

> My main concern is that it is not valid to change the vq address after
> DRIVER_OK in VirtIO, which vDPA follows. Only memory maps are ok to
> change at this moment. I'm not sure about vq state in vDPA, but vhost
> forbids changing it with an active backend.
> 
> Suspend is not defined in VirtIO at this moment though, so maybe it is
> ok to decide that all of these parameters may change during suspend.
> Maybe the best thing is to protect this with a vDPA feature flag.
> 
> Jason, what do you think?
> 
> Thanks!
> 
> > ---
> >  drivers/vdpa/mlx5/net/mlx5_vnet.c  | 9 +
> >  include/linux/mlx5/mlx5_ifc_vdpa.h | 1 +
> >  2 files changed, 10 insertions(+)
> > 
> > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> > b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > index f8f088cced50..80e066de0866 100644
> > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > @@ -1209,6 +1209,7 @@ static int modify_virtqueue(struct mlx5_vdpa_net 
> > *ndev,
> > bool state_change = false;
> > void *obj_context;
> > void *cmd_hdr;
> > +   void *vq_ctx;
> > void *in;
> > int err;
> > 
> > @@ -1230,6 +1231,7 @@ static int modify_virtqueue(struct mlx5_vdpa_net 
> > *ndev,
> > MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, uid, ndev->mvdev.res.uid);
> > 
> > obj_context = MLX5_ADDR_OF(modify_virtio_net_q_in, in, obj_context);
> > +   vq_ctx = MLX5_ADDR_OF(virtio_net_q_object, obj_context, 
> > virtio_q_context);
> > 
> > if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_STATE) {
> > if (!is_valid_state_change(mvq->fw_state, state, 
> > is_resumable(ndev))) {
> > @@ -1241,6 +1243,12 @@ static int modify_virtqueue(struct mlx5_vdpa_net 
> > *ndev,
> > state_change = true;
> > }
> > 
> > +   if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_ADDRS) {
> > +   MLX5_SET64(virtio_q, vq_ctx, desc_addr, mvq->desc_addr);
> > +   MLX5_SET64(virtio_q, vq_ctx, used_addr, mvq->device_addr);
> > +   MLX5_SET64(virtio_q, vq_ctx, available_addr, 
> > mvq->driver_addr);
> > +   }
> > +
> > MLX5_SET64(virtio_net_q_object, obj_context, modify_field_select, 
> > mvq->modified_fields);
> > err = mlx5_cmd_exec(ndev->mvdev.mdev, in, inlen, out, sizeof(out));
> > if (err)
> > @@ -2202,6 +2210,7 @@ static int mlx5_vdpa_set_vq_address(struct 
> > vdpa_device *vdev, u16 idx, u64 desc_
> > mvq->desc_addr = desc_area;
> > mvq->device_addr = device_area;
> > mvq->driver_addr = driver_area;
> > +   mvq->modified_fields |= MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_ADDRS;
> > return 0;
> >  }
> > 
> > diff --git a/include/linux/mlx5/mlx5_ifc_vdpa.h 
> > b/include/linux/mlx5/mlx5_ifc_vdpa.h
> > index b86d51a855f6..9594ac405740 100644
> > --- a/include/linux/mlx5/mlx5_ifc_vdpa.h
> > +++ b/include/linux/mlx5/mlx5_ifc_vdpa.h
> > @@ -145,6 +145,7 @@ enum {
> > MLX5_VIRTQ_MODIFY_MASK_STATE= (u64)1 << 0,
> > MLX5_VIRTQ_MODIFY_MASK_DIRTY_BITMAP_PARAMS  = (u64)1 << 3,
> > MLX5_VIRTQ_MODIFY_MASK_DIRTY_BITMAP_DUMP_ENABLE = (u64)1 << 4,
> > +   MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_ADDRS   = (u64)1 << 6,
> > MLX5_VIRTQ_MODIFY_MASK_DESC_GROUP_MKEY  = (u64)1 << 14,
> >  };
> > 
> > --
> > 2.42.0
> > 
>

Re: [PATCH v4 2/3] dax/bus: Introduce guard(device) for device_{lock,unlock} flows

2023-12-12 Thread Ira Weiny

Vishal Verma wrote:
> Introduce a guard(device) macro to lock a 'struct device', and unlock it
> automatically when going out of scope using Scope Based Resource
> Management semantics. A lot of the sysfs attribute writes in
> drivers/dax/bus.c benefit from a cleanup using these, so change these
> where applicable.
> 
> Cc: Joao Martins 
> Suggested-by: Dan Williams 
> Signed-off-by: Vishal Verma 
>

Reviewed-by: Ira Weiny

Re: [PATCH] ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs

2023-12-12 Thread Mathieu Desnoyers


On 2023-12-12 11:53, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Mathieu Desnoyers pointed out an issue in the rb_time_cmpxchg() for 32 bit
architectures. That is:

  static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
  {
unsigned long cnt, top, bottom, msb;
unsigned long cnt2, top2, bottom2, msb2;
u64 val;

/* The cmpxchg always fails if it interrupted an update */
 if (!__rb_time_read(t, , ))
 return false;

 if (val != expect)
 return false;

 interrupted here!

 cnt = local_read(>cnt);

The problem is that the synchronization counter in the rb_time_t is read
*after* the value of the timestamp is read. That means if an interrupt
were to come in between the value being read and the counter being read,
it can change the value and the counter and the interrupted process would
be clueless about it!

The counter needs to be read first and then the value. That way it is easy
to tell if the value is stale or not. If the counter hasn't been updated,
then the value is still good.

Link: 
https://lore.kernel.org/linux-trace-kernel/20231211201324.652870-1-mathieu.desnoy...@efficios.com/

Cc: sta...@vger.kernel.org
Fixes: 10464b4aa605e ("ring-buffer: Add rb_time_t 64 bit operations for speeding up 
32 bit")
Reported-by: Mathieu Desnoyers 
Signed-off-by: Steven Rostedt (Google) 


Reviewed-by: Mathieu Desnoyers 


---
  kernel/trace/ring_buffer.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 1d9caee7f542..e110cde685ea 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -706,6 +706,9 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 
set)
unsigned long cnt2, top2, bottom2, msb2;
u64 val;
  
+	/* Any interruptions in this function should cause a failure */

+   cnt = local_read(>cnt);
+
/* The cmpxchg always fails if it interrupted an update */
 if (!__rb_time_read(t, , ))
 return false;
@@ -713,7 +716,6 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 
set)
 if (val != expect)
 return false;
  
-	 cnt = local_read(>cnt);

 if ((cnt & 3) != cnt2)
 return false;
  


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

[PATCH] ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg()

2023-12-12 Thread Mathieu Desnoyers

The following race can cause rb_time_read() to observe a corrupted time
stamp:

rb_time_cmpxchg()
[...]
if (!rb_time_read_cmpxchg(>msb, msb, msb2))
return false;
if (!rb_time_read_cmpxchg(>top, top, top2))
return false;

__rb_time_read()
[...]
do {
c = local_read(>cnt);
top = local_read(>top);
bottom = local_read(>bottom);
msb = local_read(>msb);
} while (c != local_read(>cnt));

*cnt = rb_time_cnt(top);

/* If top and msb counts don't match, this interrupted a write */
if (*cnt != rb_time_cnt(msb))
return false;
  ^ this check fails to catch that "bottom" is still not updated.

So the old "bottom" value is returned, which is wrong.

Fix this by checking that all three of msb, top, and bottom 2-bit cnt
values match.

The reason to favor checking all three fields over requiring a specific
update order for both rb_time_set() and rb_time_cmpxchg() is because
checking all three fields is more robust to handle partial failures of
rb_time_cmpxchg() when interrupted by nested rb_time_set().

Link: 
https://lore.kernel.org/lkml/20231211201324.652870-1-mathieu.desnoy...@efficios.com/
Fixes: f458a1453424e ("ring-buffer: Test last update in 32bit version of 
__rb_time_read()")
Signed-off-by: Mathieu Desnoyers 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: linux-trace-ker...@vger.kernel.org
---
 kernel/trace/ring_buffer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8d2a4f00eca9..71c225ca2a2b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -644,8 +644,8 @@ static inline bool __rb_time_read(rb_time_t *t, u64 *ret, 
unsigned long *cnt)
 
*cnt = rb_time_cnt(top);
 
-   /* If top and msb counts don't match, this interrupted a write */
-   if (*cnt != rb_time_cnt(msb))
+   /* If top, msb or bottom counts don't match, this interrupted a write */
+   if (*cnt != rb_time_cnt(msb) || *cnt != rb_time_cnt(bottom))
return false;
 
/* The shift to msb will lose its cnt bits */
-- 
2.39.2

Re: [PATCH vhost v2 6/8] vdpa/mlx5: Use vq suspend/resume during .set_map

2023-12-12 Thread Eugenio Perez Martin

On Tue, Dec 5, 2023 at 11:47 AM Dragos Tatulea  wrote:
>
> Instead of tearing down and setting up vq resources, use vq
> suspend/resume during .set_map to speed things up a bit.
>
> The vq mr is updated with the new mapping while the vqs are suspended.
>
> If the device doesn't support resumable vqs, do the old teardown and
> setup dance.
>
> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Gal Pressman 
> Acked-by: Eugenio Pérez 

I didn't ack it, but I'm ok with it, so:

Acked-by: Eugenio Pérez 

Thanks!

> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c  | 46 --
>  include/linux/mlx5/mlx5_ifc_vdpa.h |  1 +
>  2 files changed, 39 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index d6c8506cec8f..6a21223d97a8 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -1206,6 +1206,7 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
>  {
> int inlen = MLX5_ST_SZ_BYTES(modify_virtio_net_q_in);
> u32 out[MLX5_ST_SZ_DW(modify_virtio_net_q_out)] = {};
> +   struct mlx5_vdpa_dev *mvdev = >mvdev;
> bool state_change = false;
> void *obj_context;
> void *cmd_hdr;
> @@ -1255,6 +1256,24 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
> if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_USED_IDX)
> MLX5_SET(virtio_net_q_object, obj_context, hw_used_index, 
> mvq->used_idx);
>
> +   if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_MKEY) {
> +   struct mlx5_vdpa_mr *mr = 
> mvdev->mr[mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP]];
> +
> +   if (mr)
> +   MLX5_SET(virtio_q, vq_ctx, virtio_q_mkey, mr->mkey);
> +   else
> +   mvq->modified_fields &= 
> ~MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_MKEY;
> +   }
> +
> +   if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_DESC_GROUP_MKEY) {
> +   struct mlx5_vdpa_mr *mr = 
> mvdev->mr[mvdev->group2asid[MLX5_VDPA_DATAVQ_DESC_GROUP]];
> +
> +   if (mr && MLX5_CAP_DEV_VDPA_EMULATION(mvdev->mdev, 
> desc_group_mkey_supported))
> +   MLX5_SET(virtio_q, vq_ctx, desc_group_mkey, mr->mkey);
> +   else
> +   mvq->modified_fields &= 
> ~MLX5_VIRTQ_MODIFY_MASK_DESC_GROUP_MKEY;
> +   }
> +
> MLX5_SET64(virtio_net_q_object, obj_context, modify_field_select, 
> mvq->modified_fields);
> err = mlx5_cmd_exec(ndev->mvdev.mdev, in, inlen, out, sizeof(out));
> if (err)
> @@ -2784,24 +2803,35 @@ static int mlx5_vdpa_change_map(struct mlx5_vdpa_dev 
> *mvdev,
> unsigned int asid)
>  {
> struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
> +   bool teardown = !is_resumable(ndev);
> int err;
>
> suspend_vqs(ndev);
> -   err = save_channels_info(ndev);
> -   if (err)
> -   return err;
> +   if (teardown) {
> +   err = save_channels_info(ndev);
> +   if (err)
> +   return err;
>
> -   teardown_driver(ndev);
> +   teardown_driver(ndev);
> +   }
>
> mlx5_vdpa_update_mr(mvdev, new_mr, asid);
>
> +   for (int i = 0; i < ndev->cur_num_vqs; i++)
> +   ndev->vqs[i].modified_fields |= 
> MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_MKEY |
> +   
> MLX5_VIRTQ_MODIFY_MASK_DESC_GROUP_MKEY;
> +
> if (!(mvdev->status & VIRTIO_CONFIG_S_DRIVER_OK) || mvdev->suspended)
> return 0;
>
> -   restore_channels_info(ndev);
> -   err = setup_driver(mvdev);
> -   if (err)
> -   return err;
> +   if (teardown) {
> +   restore_channels_info(ndev);
> +   err = setup_driver(mvdev);
> +   if (err)
> +   return err;
> +   }
> +
> +   resume_vqs(ndev);
>
> return 0;
>  }
> diff --git a/include/linux/mlx5/mlx5_ifc_vdpa.h 
> b/include/linux/mlx5/mlx5_ifc_vdpa.h
> index 32e712106e68..40371c916cf9 100644
> --- a/include/linux/mlx5/mlx5_ifc_vdpa.h
> +++ b/include/linux/mlx5/mlx5_ifc_vdpa.h
> @@ -148,6 +148,7 @@ enum {
> MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_ADDRS   = (u64)1 << 6,
> MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_AVAIL_IDX   = (u64)1 << 7,
> MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_USED_IDX= (u64)1 << 8,
> +   MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_MKEY= (u64)1 << 11,
> MLX5_VIRTQ_MODIFY_MASK_DESC_GROUP_MKEY  = (u64)1 << 14,
>  };
>
> --
> 2.42.0
>

Re: [PATCH vhost v2 4/8] vdpa/mlx5: Mark vq addrs for modification in hw vq

2023-12-12 Thread Eugenio Perez Martin

On Tue, Dec 5, 2023 at 11:46 AM Dragos Tatulea  wrote:
>
> Addresses get set by .set_vq_address. hw vq addresses will be updated on
> next modify_virtqueue.
>
> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Gal Pressman 
> Acked-by: Eugenio Pérez 

I'm kind of ok with this patch and the next one about state, but I
didn't ack them in the previous series.

My main concern is that it is not valid to change the vq address after
DRIVER_OK in VirtIO, which vDPA follows. Only memory maps are ok to
change at this moment. I'm not sure about vq state in vDPA, but vhost
forbids changing it with an active backend.

Suspend is not defined in VirtIO at this moment though, so maybe it is
ok to decide that all of these parameters may change during suspend.
Maybe the best thing is to protect this with a vDPA feature flag.

Jason, what do you think?

Thanks!

> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c  | 9 +
>  include/linux/mlx5/mlx5_ifc_vdpa.h | 1 +
>  2 files changed, 10 insertions(+)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index f8f088cced50..80e066de0866 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -1209,6 +1209,7 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
> bool state_change = false;
> void *obj_context;
> void *cmd_hdr;
> +   void *vq_ctx;
> void *in;
> int err;
>
> @@ -1230,6 +1231,7 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
> MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, uid, ndev->mvdev.res.uid);
>
> obj_context = MLX5_ADDR_OF(modify_virtio_net_q_in, in, obj_context);
> +   vq_ctx = MLX5_ADDR_OF(virtio_net_q_object, obj_context, 
> virtio_q_context);
>
> if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_STATE) {
> if (!is_valid_state_change(mvq->fw_state, state, 
> is_resumable(ndev))) {
> @@ -1241,6 +1243,12 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
> state_change = true;
> }
>
> +   if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_ADDRS) {
> +   MLX5_SET64(virtio_q, vq_ctx, desc_addr, mvq->desc_addr);
> +   MLX5_SET64(virtio_q, vq_ctx, used_addr, mvq->device_addr);
> +   MLX5_SET64(virtio_q, vq_ctx, available_addr, 
> mvq->driver_addr);
> +   }
> +
> MLX5_SET64(virtio_net_q_object, obj_context, modify_field_select, 
> mvq->modified_fields);
> err = mlx5_cmd_exec(ndev->mvdev.mdev, in, inlen, out, sizeof(out));
> if (err)
> @@ -2202,6 +2210,7 @@ static int mlx5_vdpa_set_vq_address(struct vdpa_device 
> *vdev, u16 idx, u64 desc_
> mvq->desc_addr = desc_area;
> mvq->device_addr = device_area;
> mvq->driver_addr = driver_area;
> +   mvq->modified_fields |= MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_ADDRS;
> return 0;
>  }
>
> diff --git a/include/linux/mlx5/mlx5_ifc_vdpa.h 
> b/include/linux/mlx5/mlx5_ifc_vdpa.h
> index b86d51a855f6..9594ac405740 100644
> --- a/include/linux/mlx5/mlx5_ifc_vdpa.h
> +++ b/include/linux/mlx5/mlx5_ifc_vdpa.h
> @@ -145,6 +145,7 @@ enum {
> MLX5_VIRTQ_MODIFY_MASK_STATE= (u64)1 << 0,
> MLX5_VIRTQ_MODIFY_MASK_DIRTY_BITMAP_PARAMS  = (u64)1 << 3,
> MLX5_VIRTQ_MODIFY_MASK_DIRTY_BITMAP_DUMP_ENABLE = (u64)1 << 4,
> +   MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_ADDRS   = (u64)1 << 6,
> MLX5_VIRTQ_MODIFY_MASK_DESC_GROUP_MKEY  = (u64)1 << 14,
>  };
>
> --
> 2.42.0
>

[PATCH v4 3/3] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-12 Thread Vishal Verma

Add a sysfs knob for dax devices to control the memmap_on_memory setting
if the dax device were to be hotplugged as system memory.

The default memmap_on_memory setting for dax devices originating via
pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to
preserve legacy behavior. For dax devices via CXL, the default is on.
The sysfs control allows the administrator to override the above
defaults if needed.

Cc: David Hildenbrand 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Dave Hansen 
Cc: Huang Ying 
Tested-by: Li Zhijian 
Reviewed-by: Jonathan Cameron 
Reviewed-by: David Hildenbrand 
Signed-off-by: Vishal Verma 
---
 drivers/dax/bus.c   | 32 
 Documentation/ABI/testing/sysfs-bus-dax | 17 +
 2 files changed, 49 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index ce1356ac6dc2..423adee6f802 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1245,6 +1245,37 @@ static ssize_t numa_node_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(numa_node);
 
+static ssize_t memmap_on_memory_show(struct device *dev,
+struct device_attribute *attr, char *buf)
+{
+   struct dev_dax *dev_dax = to_dev_dax(dev);
+
+   return sprintf(buf, "%d\n", dev_dax->memmap_on_memory);
+}
+
+static ssize_t memmap_on_memory_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+   struct dax_device_driver *dax_drv = to_dax_drv(dev->driver);
+   struct dev_dax *dev_dax = to_dev_dax(dev);
+   ssize_t rc;
+   bool val;
+
+   rc = kstrtobool(buf, );
+   if (rc)
+   return rc;
+
+   guard(device)(dev);
+   if (dev_dax->memmap_on_memory != val &&
+   dax_drv->type == DAXDRV_KMEM_TYPE)
+   return -EBUSY;
+   dev_dax->memmap_on_memory = val;
+
+   return len;
+}
+static DEVICE_ATTR_RW(memmap_on_memory);
+
 static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int 
n)
 {
struct device *dev = container_of(kobj, struct device, kobj);
@@ -1271,6 +1302,7 @@ static struct attribute *dev_dax_attributes[] = {
_attr_align.attr,
_attr_resource.attr,
_attr_numa_node.attr,
+   _attr_memmap_on_memory.attr,
NULL,
 };
 
diff --git a/Documentation/ABI/testing/sysfs-bus-dax 
b/Documentation/ABI/testing/sysfs-bus-dax
index a61a7b186017..b1fd8bf8a7de 100644
--- a/Documentation/ABI/testing/sysfs-bus-dax
+++ b/Documentation/ABI/testing/sysfs-bus-dax
@@ -149,3 +149,20 @@ KernelVersion: v5.1
 Contact:   nvd...@lists.linux.dev
 Description:
(RO) The id attribute indicates the region id of a dax region.
+
+What:  /sys/bus/dax/devices/daxX.Y/memmap_on_memory
+Date:  October, 2023
+KernelVersion: v6.8
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RW) Control the memmap_on_memory setting if the dax device
+   were to be hotplugged as system memory. This determines whether
+   the 'altmap' for the hotplugged memory will be placed on the
+   device being hotplugged (memmap_on_memory=1) or if it will be
+   placed on regular memory (memmap_on_memory=0). This attribute
+   must be set before the device is handed over to the 'kmem'
+   driver (i.e.  hotplugged into system-ram). Additionally, this
+   depends on CONFIG_MHP_MEMMAP_ON_MEMORY, and a globally enabled
+   memmap_on_memory parameter for memory_hotplug. This is
+   typically set on the kernel command line -
+   memory_hotplug.memmap_on_memory set to 'true' or 'force'."

-- 
2.41.0

[PATCH v4 2/3] dax/bus: Introduce guard(device) for device_{lock,unlock} flows

2023-12-12 Thread Vishal Verma

Introduce a guard(device) macro to lock a 'struct device', and unlock it
automatically when going out of scope using Scope Based Resource
Management semantics. A lot of the sysfs attribute writes in
drivers/dax/bus.c benefit from a cleanup using these, so change these
where applicable.

Cc: Joao Martins 
Suggested-by: Dan Williams 
Signed-off-by: Vishal Verma 
---
 include/linux/device.h |   2 +
 drivers/dax/bus.c  | 109 +++--
 2 files changed, 44 insertions(+), 67 deletions(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index d7a72a8749ea..a83efd9ae949 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1131,6 +1131,8 @@ void set_secondary_fwnode(struct device *dev, struct 
fwnode_handle *fwnode);
 void device_set_of_node_from_dev(struct device *dev, const struct device 
*dev2);
 void device_set_node(struct device *dev, struct fwnode_handle *fwnode);
 
+DEFINE_GUARD(device, struct device *, device_lock(_T), device_unlock(_T))
+
 static inline int dev_num_vf(struct device *dev)
 {
if (dev->bus && dev->bus->num_vf)
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1ff1ab5fa105..ce1356ac6dc2 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -296,9 +296,8 @@ static ssize_t available_size_show(struct device *dev,
struct dax_region *dax_region = dev_get_drvdata(dev);
unsigned long long size;
 
-   device_lock(dev);
+   guard(device)(dev);
size = dax_region_avail_size(dax_region);
-   device_unlock(dev);
 
return sprintf(buf, "%llu\n", size);
 }
@@ -314,10 +313,9 @@ static ssize_t seed_show(struct device *dev,
if (is_static(dax_region))
return -EINVAL;
 
-   device_lock(dev);
+   guard(device)(dev);
seed = dax_region->seed;
rc = sprintf(buf, "%s\n", seed ? dev_name(seed) : "");
-   device_unlock(dev);
 
return rc;
 }
@@ -333,10 +331,9 @@ static ssize_t create_show(struct device *dev,
if (is_static(dax_region))
return -EINVAL;
 
-   device_lock(dev);
+   guard(device)(dev);
youngest = dax_region->youngest;
rc = sprintf(buf, "%s\n", youngest ? dev_name(youngest) : "");
-   device_unlock(dev);
 
return rc;
 }
@@ -345,7 +342,14 @@ static ssize_t create_store(struct device *dev, struct 
device_attribute *attr,
const char *buf, size_t len)
 {
struct dax_region *dax_region = dev_get_drvdata(dev);
+   struct dev_dax_data data = {
+   .dax_region = dax_region,
+   .size = 0,
+   .id = -1,
+   .memmap_on_memory = false,
+   };
unsigned long long avail;
+   struct dev_dax *dev_dax;
ssize_t rc;
int val;
 
@@ -358,38 +362,26 @@ static ssize_t create_store(struct device *dev, struct 
device_attribute *attr,
if (val != 1)
return -EINVAL;
 
-   device_lock(dev);
+   guard(device)(dev);
avail = dax_region_avail_size(dax_region);
if (avail == 0)
-   rc = -ENOSPC;
-   else {
-   struct dev_dax_data data = {
-   .dax_region = dax_region,
-   .size = 0,
-   .id = -1,
-   .memmap_on_memory = false,
-   };
-   struct dev_dax *dev_dax = devm_create_dev_dax();
+   return -ENOSPC;
 
-   if (IS_ERR(dev_dax))
-   rc = PTR_ERR(dev_dax);
-   else {
-   /*
-* In support of crafting multiple new devices
-* simultaneously multiple seeds can be created,
-* but only the first one that has not been
-* successfully bound is tracked as the region
-* seed.
-*/
-   if (!dax_region->seed)
-   dax_region->seed = _dax->dev;
-   dax_region->youngest = _dax->dev;
-   rc = len;
-   }
-   }
-   device_unlock(dev);
+   dev_dax = devm_create_dev_dax();
+   if (IS_ERR(dev_dax))
+   return PTR_ERR(dev_dax);
 
-   return rc;
+   /*
+* In support of crafting multiple new devices
+* simultaneously multiple seeds can be created,
+* but only the first one that has not been
+* successfully bound is tracked as the region
+* seed.
+*/
+   if (!dax_region->seed)
+   dax_region->seed = _dax->dev;
+   dax_region->youngest = _dax->dev;
+   return len;
 }
 static DEVICE_ATTR_RW(create);
 
@@ -481,12 +473,9 @@ static int __free_dev_dax_id(struct dev_dax *dev_dax)
 static int free_dev_dax_id(struct dev_dax *dev_dax)
 {
struct device *dev = _dax->dev;
-   int rc;
 
-   device_lock(dev);
-

[PATCH v4 1/3] Documentatiion/ABI: Add ABI documentation for sys-bus-dax

2023-12-12 Thread Vishal Verma

Add the missing sysfs ABI documentation for the device DAX subsystem.
Various ABI attributes under this have been present since v5.1, and more
have been added over time. In preparation for adding a new attribute,
add this file with the historical details.

Cc: Dan Williams 
Signed-off-by: Vishal Verma 
---
 Documentation/ABI/testing/sysfs-bus-dax | 151 
 1 file changed, 151 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-dax 
b/Documentation/ABI/testing/sysfs-bus-dax
new file mode 100644
index ..a61a7b186017
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-dax
@@ -0,0 +1,151 @@
+What:  /sys/bus/dax/devices/daxX.Y/align
+Date:  October, 2020
+KernelVersion: v5.10
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RW) Provides a way to specify an alignment for a dax device.
+   Values allowed are constrained by the physical address ranges
+   that back the dax device, and also by arch requirements.
+
+What:  /sys/bus/dax/devices/daxX.Y/mapping
+Date:  October, 2020
+KernelVersion: v5.10
+Contact:   nvd...@lists.linux.dev
+Description:
+   (WO) Provides a way to allocate a mapping range under a dax
+   device. Specified in the format -.
+
+What:  /sys/bus/dax/devices/daxX.Y/mapping[0..N]/start
+Date:  October, 2020
+KernelVersion: v5.10
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RO) A dax device may have multiple constituent discontiguous
+   address ranges. These are represented by the different
+   'mappingX' subdirectories. The 'start' attribute indicates the
+   start physical address for the given range.
+
+What:  /sys/bus/dax/devices/daxX.Y/mapping[0..N]/end
+Date:  October, 2020
+KernelVersion: v5.10
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RO) A dax device may have multiple constituent discontiguous
+   address ranges. These are represented by the different
+   'mappingX' subdirectories. The 'end' attribute indicates the
+   end physical address for the given range.
+
+What:  /sys/bus/dax/devices/daxX.Y/mapping[0..N]/page_offset
+Date:  October, 2020
+KernelVersion: v5.10
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RO) A dax device may have multiple constituent discontiguous
+   address ranges. These are represented by the different
+   'mappingX' subdirectories. The 'page_offset' attribute 
indicates the
+   offset of the current range in the dax device.
+
+What:  /sys/bus/dax/devices/daxX.Y/resource
+Date:  June, 2019
+KernelVersion: v5.3
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RO) The resource attribute indicates the starting physical
+   address of a dax device. In case of a device with multiple
+   constituent ranges, it indicates the starting address of the
+   first range.
+
+What:  /sys/bus/dax/devices/daxX.Y/size
+Date:  October, 2020
+KernelVersion: v5.10
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RW) The size attribute indicates the total size of a dax
+   device. For creating subdivided dax devices, or for resizing
+   an existing device, the new size can be written to this as
+   part of the reconfiguration process.
+
+What:  /sys/bus/dax/devices/daxX.Y/numa_node
+Date:  November, 2019
+KernelVersion: v5.5
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RO) If NUMA is enabled and the platform has affinitized the
+   backing device for this dax device, emit the CPU node
+   affinity for this device.
+
+What:  /sys/bus/dax/devices/daxX.Y/target_node
+Date:  February, 2019
+KernelVersion: v5.1
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RO) The target-node attribute is the Linux numa-node that a
+   device-dax instance may create when it is online. Prior to
+   being online the device's 'numa_node' property reflects the
+   closest online cpu node which is the typical expectation of a
+   device 'numa_node'. Once it is online it becomes its own
+   distinct numa node.
+
+What:  $(readlink -f 
/sys/bus/dax/devices/daxX.Y)/../dax_region/available_size
+Date:  October, 2020
+KernelVersion: v5.10
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RO) The available_size attribute tracks available dax region
+   capacity. This only applies to volatile hmem devices, not pmem
+   devices, since pmem devices are defined by nvdimm namespace
+   boundaries.
+
+What:  $(readlink -f

[PATCH v4 0/3] Add DAX ABI for memmap_on_memory

2023-12-12 Thread Vishal Verma

The DAX drivers were missing sysfs ABI documentation entirely.  Add this
missing documentation for the sysfs ABI for DAX regions and Dax devices
in patch 1. Define guard(device) semantics for Scope Based Resource
Management for device_lock, and convert device_{lock,unlock} flows in
drivers/dax/bus.c to use this in patch 2. Add a new ABI for toggling
memmap_on_memory semantics in patch 3.

The missing ABI was spotted in [1], this series is a split of the new
ABI additions behind the initial documentation creation.

[1]: 
https://lore.kernel.org/linux-cxl/651f27b728fef_ae7e729...@dwillia2-xfh.jf.intel.com.notmuch/

Changes in v4:
- Hold the device lock when checking if the dax_dev is bound to kmem
  (Ying, Dan)
- Remove dax region checks (and locks) as they were unnecessary.
- Introduce guard(device) for device lock/unlock (Dan)
- Convert the rest of drivers/dax/bus.c to guard(device)
- Link to v3: 
https://lore.kernel.org/r/20231211-vv-dax_abi-v3-0-acf6cc1bd...@intel.com

Changes in v3:
- Fix typo in ABI docs (Zhijian Li)
- Add kernel config and module parameter dependencies to the ABI docs
  entry (David Hildenbrand)
- Ensure kmem isn't active when setting the sysfs attribute (Ying
  Huang)
- Simplify returning from memmap_on_memory_store()
- Link to v2: 
https://lore.kernel.org/r/20231206-vv-dax_abi-v2-0-f4f4f2336...@intel.com

Changes in v2:
- Fix CC lists, patch 1/2 didn't get sent correctly in v1
- Link to v1: 
https://lore.kernel.org/r/20231206-vv-dax_abi-v1-0-474eb88e2...@intel.com

---
Vishal Verma (3):
  Documentatiion/ABI: Add ABI documentation for sys-bus-dax
  dax/bus: Introduce guard(device) for device_{lock,unlock} flows
  dax: add a sysfs knob to control memmap_on_memory behavior

 include/linux/device.h  |   2 +
 drivers/dax/bus.c   | 141 ++-
 Documentation/ABI/testing/sysfs-bus-dax | 168 
 3 files changed, 244 insertions(+), 67 deletions(-)
---
base-commit: c4e1ccfad42352918810802095a8ace8d1c744c9
change-id: 20231025-vv-dax_abi-17a219c46076

Best regards,
-- 
Vishal Verma

Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory

2023-12-12 Thread Rob Herring

On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
 wrote:
>
> Hi Rob,
>
> Thank you so much for the feedback, I'm not very familiar with device tree,
> and any comments are very useful.
>
> On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> >  wrote:
> > >
> > > Allow the kernel to get the size and location of the MTE tag storage
> > > regions from the DTB. This memory is marked as reserved for now.
> > >
> > > The DTB node for the tag storage region is defined as:
> > >
> > > tags0: tag-storage@8f800 {
> > > compatible = "arm,mte-tag-storage";
> > > reg = <0x08 0xf800 0x00 0x400>;
> > > block-size = <0x1000>;
> > > memory = <>;// Associated tagged memory node
> > > };
> >
> > I skimmed thru the discussion some. If this memory range is within
> > main RAM, then it definitely belongs in /reserved-memory.
>
> Ok, will do that.
>
> If you don't mind, why do you say that it definitely belongs in
> reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> motivation.

Simply so that /memory nodes describe all possible memory and
/reserved-memory is just adding restrictions. It's also because
/reserved-memory is what gets handled early, and we don't need
multiple things to handle early.

> Tag storage is not DMA and can live anywhere in memory.

Then why put it in DT at all? The only reason CMA is there is to set
the size. It's not even clear to me we need CMA in DT either. The
reasoning long ago was the kernel didn't do a good job of moving and
reclaiming contiguous space, but that's supposed to be better now (and
most h/w figured out they need IOMMUs).

But for tag storage you know the size as it is a function of the
memory size, right? After all, you are validating the size is correct.
I guess there is still the aspect of whether you want enable MTE or
not which could be done in a variety of ways.

> In
> arm64_memblock_init(), the kernel first removes the memory that it cannot
> address from memblock. For example, because it has been compiled with
> CONFIG_ARM64_VA_BITS_39=y. And then calls
> early_init_fdt_scan_reserved_mem().
>
> What happens if reserved memory is above what the kernel can address?

I would hope the kernel handles it. That's the kernel's problem unless
there's some h/w limitation to access some region. The DT can't have
things dependent on the kernel config.

> From my testing, when the kernel is compiled with 39 bit VA, if I use
> reserved memory to discover tag storage the lives above the virtua address
> limit and then I try to use CMA to manage the tag storage memory, I get a
> kernel panic:

Looks like we should handle that better...

>> [0.00] Reserved memory: created CMA memory pool at 
>> 0x0100, size 64 MiB
> [0.00] OF: reserved mem: initialized node linux,cma, compatible id 
> shared-dma-pool
> [0.00] OF: reserved mem: 0x0100..0x010003ff 
> (65536 KiB) map reusable linux,cma
> [..]
> [0.806945] Unable to handle kernel paging request at virtual address 
> 0001fe00
> [0.807277] Mem abort info:
> [0.807277]   ESR = 0x9605
> [0.807693]   EC = 0x25: DABT (current EL), IL = 32 bits
> [0.808110]   SET = 0, FnV = 0
> [0.808443]   EA = 0, S1PTW = 0
> [0.808526]   FSC = 0x05: level 1 translation fault
> [0.808943] Data abort info:
> [0.808943]   ISV = 0, ISS = 0x0005, ISS2 = 0x
> [0.809360]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> [0.809776]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [0.810221] [0001fe00] user address but active_mm is swapper
> [..]
> [0.820887] Call trace:
> [0.821027]  cma_init_reserved_areas+0xc4/0x378
>
> >
> > You need a binding for this too.
>
> By binding you mean having an yaml file in dt-schem [1] describing the tag
> storage node, right?

Yes, but in the kernel tree is fine.

[...]

> > > +static int __init tag_storage_of_flat_get_range(unsigned long node, 
> > > const __be32 *reg,
> > > +   int reg_len, struct range 
> > > *range)
> > > +{
> > > +   int addr_cells = dt_root_addr_cells;
> > > +   int size_cells = dt_root_size_cells;
> > > +   u64 size;
> > > +
> > > +   if (reg_len / 4 > addr_cells + size_cells)
> > > +   return -EINVAL;
> > > +
> > > +   range->start = PHYS_PFN(of_read_number(reg, addr_cells));
> > > +   size = PHYS_PFN(of_read_number(reg + addr_cells, size_cells));
> > > +   if (size == 0) {
> > > +   pr_err("Invalid node");
> > > +   return -EINVAL;
> > > +   }
> > > +   range->end = range->start + size - 1;
> >
> > We have a function to read (and translate which you forgot) addresses.
> > Add what's missing rather than open code your own.
>
> I must have missed that there's already a function to

Re: [PATCH vhost v2 8/8] vdpa/mlx5: Add mkey leak detection

2023-12-12 Thread Eugenio Perez Martin

On Tue, Dec 5, 2023 at 11:47 AM Dragos Tatulea  wrote:
>
> Track allocated mrs in a list and show warning when leaks are detected
> on device free or reset.
>
> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Gal Pressman 

Acked-by: Eugenio Pérez 

> ---
>  drivers/vdpa/mlx5/core/mlx5_vdpa.h |  2 ++
>  drivers/vdpa/mlx5/core/mr.c| 23 +++
>  drivers/vdpa/mlx5/net/mlx5_vnet.c  |  2 ++
>  3 files changed, 27 insertions(+)
>
> diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
> b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
> index 1a0d27b6e09a..50aac8fe57ef 100644
> --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
> +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
> @@ -37,6 +37,7 @@ struct mlx5_vdpa_mr {
> bool user_mr;
>
> refcount_t refcount;
> +   struct list_head mr_list;
>  };
>
>  struct mlx5_vdpa_resources {
> @@ -95,6 +96,7 @@ struct mlx5_vdpa_dev {
> u32 generation;
>
> struct mlx5_vdpa_mr *mr[MLX5_VDPA_NUM_AS];
> +   struct list_head mr_list_head;
> /* serialize mr access */
> struct mutex mr_mtx;
> struct mlx5_control_vq cvq;
> diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
> index c7dc8914354a..4758914ccf86 100644
> --- a/drivers/vdpa/mlx5/core/mr.c
> +++ b/drivers/vdpa/mlx5/core/mr.c
> @@ -508,6 +508,8 @@ static void _mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev 
> *mvdev, struct mlx5_vdpa_
>
> vhost_iotlb_free(mr->iotlb);
>
> +   list_del(>mr_list);
> +
> kfree(mr);
>  }
>
> @@ -560,12 +562,31 @@ void mlx5_vdpa_update_mr(struct mlx5_vdpa_dev *mvdev,
> mutex_unlock(>mr_mtx);
>  }
>
> +static void mlx5_vdpa_show_mr_leaks(struct mlx5_vdpa_dev *mvdev)
> +{
> +   struct mlx5_vdpa_mr *mr;
> +
> +   mutex_lock(>mr_mtx);
> +
> +   list_for_each_entry(mr, >mr_list_head, mr_list) {
> +
> +   mlx5_vdpa_warn(mvdev, "mkey still alive after resource 
> delete: "
> + "mr: %p, mkey: 0x%x, refcount: %u\n",
> +  mr, mr->mkey, 
> refcount_read(>refcount));
> +   }
> +
> +   mutex_unlock(>mr_mtx);
> +
> +}
> +
>  void mlx5_vdpa_destroy_mr_resources(struct mlx5_vdpa_dev *mvdev)
>  {
> for (int i = 0; i < MLX5_VDPA_NUM_AS; i++)
> mlx5_vdpa_update_mr(mvdev, NULL, i);
>
> prune_iotlb(mvdev->cvq.iotlb);
> +
> +   mlx5_vdpa_show_mr_leaks(mvdev);
>  }
>
>  static int _mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev,
> @@ -592,6 +613,8 @@ static int _mlx5_vdpa_create_mr(struct mlx5_vdpa_dev 
> *mvdev,
> if (err)
> goto err_iotlb;
>
> +   list_add_tail(>mr_list, >mr_list_head);
> +
> return 0;
>
>  err_iotlb:
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 133cbb66dcfe..778821bab7d9 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -3722,6 +3722,8 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev 
> *v_mdev, const char *name,
> if (err)
> goto err_mpfs;
>
> +   INIT_LIST_HEAD(>mr_list_head);
> +
> if (MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
> err = mlx5_vdpa_create_dma_mr(mvdev);
> if (err)
> --
> 2.42.0
>

Re: [PATCH vhost v2 7/8] vdpa/mlx5: Introduce reference counting to mrs

2023-12-12 Thread Eugenio Perez Martin

On Tue, Dec 5, 2023 at 11:47 AM Dragos Tatulea  wrote:
>
> Deleting the old mr during mr update (.set_map) and then modifying the
> vqs with the new mr is not a good flow for firmware. The firmware
> expects that mkeys are deleted after there are no more vqs referencing
> them.
>
> Introduce reference counting for mrs to fix this. It is the only way to
> make sure that mkeys are not in use by vqs.
>
> An mr reference is taken when the mr is associated to the mr asid table
> and when the mr is linked to the vq on create/modify. The reference is
> released when the mkey is unlinked from the vq (trough modify/destroy)
> and from the mr asid table.
>
> To make things consistent, get rid of mlx5_vdpa_destroy_mr and use
> get/put semantics everywhere.
>
> Signed-off-by: Dragos Tatulea 
> Reviewed-by: Gal Pressman 

Acked-by: Eugenio Pérez 

> ---
>  drivers/vdpa/mlx5/core/mlx5_vdpa.h |  8 +++--
>  drivers/vdpa/mlx5/core/mr.c| 50 --
>  drivers/vdpa/mlx5/net/mlx5_vnet.c  | 45 ++-
>  3 files changed, 78 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
> b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
> index 84547d998bcf..1a0d27b6e09a 100644
> --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
> +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
> @@ -35,6 +35,8 @@ struct mlx5_vdpa_mr {
> struct vhost_iotlb *iotlb;
>
> bool user_mr;
> +
> +   refcount_t refcount;
>  };
>
>  struct mlx5_vdpa_resources {
> @@ -118,8 +120,10 @@ int mlx5_vdpa_destroy_mkey(struct mlx5_vdpa_dev *mvdev, 
> u32 mkey);
>  struct mlx5_vdpa_mr *mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev,
>  struct vhost_iotlb *iotlb);
>  void mlx5_vdpa_destroy_mr_resources(struct mlx5_vdpa_dev *mvdev);
> -void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev,
> - struct mlx5_vdpa_mr *mr);
> +void mlx5_vdpa_get_mr(struct mlx5_vdpa_dev *mvdev,
> + struct mlx5_vdpa_mr *mr);
> +void mlx5_vdpa_put_mr(struct mlx5_vdpa_dev *mvdev,
> + struct mlx5_vdpa_mr *mr);
>  void mlx5_vdpa_update_mr(struct mlx5_vdpa_dev *mvdev,
>  struct mlx5_vdpa_mr *mr,
>  unsigned int asid);
> diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
> index 2197c46e563a..c7dc8914354a 100644
> --- a/drivers/vdpa/mlx5/core/mr.c
> +++ b/drivers/vdpa/mlx5/core/mr.c
> @@ -498,32 +498,52 @@ static void destroy_user_mr(struct mlx5_vdpa_dev 
> *mvdev, struct mlx5_vdpa_mr *mr
>
>  static void _mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev, struct 
> mlx5_vdpa_mr *mr)
>  {
> +   if (WARN_ON(!mr))
> +   return;
> +
> if (mr->user_mr)
> destroy_user_mr(mvdev, mr);
> else
> destroy_dma_mr(mvdev, mr);
>
> vhost_iotlb_free(mr->iotlb);
> +
> +   kfree(mr);
>  }
>
> -void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev,
> - struct mlx5_vdpa_mr *mr)
> +static void _mlx5_vdpa_put_mr(struct mlx5_vdpa_dev *mvdev,
> + struct mlx5_vdpa_mr *mr)
>  {
> if (!mr)
> return;
>
> +   if (refcount_dec_and_test(>refcount))
> +   _mlx5_vdpa_destroy_mr(mvdev, mr);
> +}
> +
> +void mlx5_vdpa_put_mr(struct mlx5_vdpa_dev *mvdev,
> + struct mlx5_vdpa_mr *mr)
> +{
> mutex_lock(>mr_mtx);
> +   _mlx5_vdpa_put_mr(mvdev, mr);
> +   mutex_unlock(>mr_mtx);
> +}
>
> -   _mlx5_vdpa_destroy_mr(mvdev, mr);
> +static void _mlx5_vdpa_get_mr(struct mlx5_vdpa_dev *mvdev,
> + struct mlx5_vdpa_mr *mr)
> +{
> +   if (!mr)
> +   return;
>
> -   for (int i = 0; i < MLX5_VDPA_NUM_AS; i++) {
> -   if (mvdev->mr[i] == mr)
> -   mvdev->mr[i] = NULL;
> -   }
> +   refcount_inc(>refcount);
> +}
>
> +void mlx5_vdpa_get_mr(struct mlx5_vdpa_dev *mvdev,
> + struct mlx5_vdpa_mr *mr)
> +{
> +   mutex_lock(>mr_mtx);
> +   _mlx5_vdpa_get_mr(mvdev, mr);
> mutex_unlock(>mr_mtx);
> -
> -   kfree(mr);
>  }
>
>  void mlx5_vdpa_update_mr(struct mlx5_vdpa_dev *mvdev,
> @@ -534,20 +554,16 @@ void mlx5_vdpa_update_mr(struct mlx5_vdpa_dev *mvdev,
>
> mutex_lock(>mr_mtx);
>
> +   _mlx5_vdpa_put_mr(mvdev, old_mr);
> mvdev->mr[asid] = new_mr;
> -   if (old_mr) {
> -   _mlx5_vdpa_destroy_mr(mvdev, old_mr);
> -   kfree(old_mr);
> -   }
>
> mutex_unlock(>mr_mtx);
> -
>  }
>
>  void mlx5_vdpa_destroy_mr_resources(struct mlx5_vdpa_dev *mvdev)
>  {
> for (int i = 0; i < MLX5_VDPA_NUM_AS; i++)
> -   mlx5_vdpa_destroy_mr(mvdev, mvdev->mr[i]);
> +   mlx5_vdpa_update_mr(mvdev, NULL, i);
>
> prune_iotlb(mvdev->cvq.iotlb);
>  }
> @@ -607,6 +623,8 @@ struct mlx5_vdpa_mr *mlx5_vdpa_create_mr(struct 
>

[PATCH v4] tracing: Allow for max buffer data size trace_marker writes

2023-12-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Allow a trace write to be as big as the ring buffer tracing data will
allow. Currently, it only allows writes of 1KB in size, but there's no
reason that it cannot allow what the ring buffer can hold.

Signed-off-by: Steven Rostedt (Google) 
---
Changes since v3: 
https://lore.kernel.org/linux-trace-kernel/20231212110332.6fca5...@gandalf.local.home

- No greated cheese. (Mathieu Desnoyers)

 include/linux/ring_buffer.h |  1 +
 kernel/trace/ring_buffer.c  | 15 +++
 kernel/trace/trace.c| 31 ---
 3 files changed, 40 insertions(+), 7 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 782e14f62201..b1b03b2c0f08 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -141,6 +141,7 @@ int ring_buffer_iter_empty(struct ring_buffer_iter *iter);
 bool ring_buffer_iter_dropped(struct ring_buffer_iter *iter);
 
 unsigned long ring_buffer_size(struct trace_buffer *buffer, int cpu);
+unsigned long ring_buffer_max_event_size(struct trace_buffer *buffer);
 
 void ring_buffer_reset_cpu(struct trace_buffer *buffer, int cpu);
 void ring_buffer_reset_online_cpus(struct trace_buffer *buffer);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 6b82c3398938..087f0f6b3409 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5255,6 +5255,21 @@ unsigned long ring_buffer_size(struct trace_buffer 
*buffer, int cpu)
 }
 EXPORT_SYMBOL_GPL(ring_buffer_size);
 
+/**
+ * ring_buffer_max_event_size - return the max data size of an event
+ * @buffer: The ring buffer.
+ *
+ * Returns the maximum size an event can be.
+ */
+unsigned long ring_buffer_max_event_size(struct trace_buffer *buffer)
+{
+   /* If abs timestamp is requested, events have a timestamp too */
+   if (ring_buffer_time_stamp_abs(buffer))
+   return BUF_MAX_DATA_SIZE - RB_LEN_TIME_EXTEND;
+   return BUF_MAX_DATA_SIZE;
+}
+EXPORT_SYMBOL_GPL(ring_buffer_max_event_size);
+
 static void rb_clear_buffer_page(struct buffer_page *page)
 {
local_set(>write, 0);
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ef86379555e4..a359783fede8 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7272,8 +7272,9 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
enum event_trigger_type tt = ETT_NONE;
struct trace_buffer *buffer;
struct print_entry *entry;
+   int meta_size;
ssize_t written;
-   int size;
+   size_t size;
int len;
 
 /* Used in tracing_mark_raw_write() as well */
@@ -7286,12 +7287,12 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
if (!(tr->trace_flags & TRACE_ITER_MARKERS))
return -EINVAL;
 
-   if (cnt > TRACE_BUF_SIZE)
-   cnt = TRACE_BUF_SIZE;
-
-   BUILD_BUG_ON(TRACE_BUF_SIZE >= PAGE_SIZE);
+   if ((ssize_t)cnt < 0)
+   return -EINVAL;
 
-   size = sizeof(*entry) + cnt + 2; /* add '\0' and possible '\n' */
+   meta_size = sizeof(*entry) + 2;  /* add '\0' and possible '\n' */
+ again:
+   size = cnt + meta_size;
 
/* If less than "", then make sure we can still add that */
if (cnt < FAULTED_SIZE)
@@ -7300,9 +7301,25 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
buffer = tr->array_buffer.buffer;
event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
tracing_gen_ctx());
-   if (unlikely(!event))
+   if (unlikely(!event)) {
+   /*
+* If the size was greater than what was allowed, then
+* make it smaller and try again.
+*/
+   if (size > ring_buffer_max_event_size(buffer)) {
+   /* cnt < FAULTED size should never be bigger than max */
+   if (WARN_ON_ONCE(cnt < FAULTED_SIZE))
+   return -EBADF;
+   cnt = ring_buffer_max_event_size(buffer) - meta_size;
+   /* The above should only happen once */
+   if (WARN_ON_ONCE(cnt + meta_size == size))
+   return -EBADF;
+   goto again;
+   }
+
/* Ring buffer disabled, return as if not open for write */
return -EBADF;
+   }
 
entry = ring_buffer_event_data(event);
entry->ip = _THIS_IP_;
-- 
2.42.0

Re: [PATCH v5 4/4] vduse: Add LSM hook to check Virtio device type

2023-12-12 Thread Michael S. Tsirkin

On Tue, Dec 12, 2023 at 08:33:39AM -0800, Casey Schaufler wrote:
> On 12/12/2023 5:17 AM, Maxime Coquelin wrote:
> > This patch introduces a LSM hook for devices creation,
> > destruction (ioctl()) and opening (open()) operations,
> > checking the application is allowed to perform these
> > operations for the Virtio device type.
> 
> My earlier comments on a vduse specific LSM hook still hold.
> I would much prefer to see a device permissions hook(s) that
> are useful for devices in general. Not just vduse devices.
> I know that there are already some very special purpose LSM
> hooks, but the experience with maintaining them is why I don't
> want more of them. 

What exactly does this mean? Devices like tap etc? How do we
find them all though?

> >
> > Signed-off-by: Maxime Coquelin 
> > ---
> >  MAINTAINERS |  1 +
> >  drivers/vdpa/vdpa_user/vduse_dev.c  | 13 
> >  include/linux/lsm_hook_defs.h   |  2 ++
> >  include/linux/security.h|  6 ++
> >  include/linux/vduse.h   | 14 +
> >  security/security.c | 15 ++
> >  security/selinux/hooks.c| 32 +
> >  security/selinux/include/classmap.h |  2 ++
> >  8 files changed, 85 insertions(+)
> >  create mode 100644 include/linux/vduse.h
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index a0fb0df07b43..4e83b14358d2 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -23040,6 +23040,7 @@ F:  drivers/net/virtio_net.c
> >  F: drivers/vdpa/
> >  F: drivers/virtio/
> >  F: include/linux/vdpa.h
> > +F: include/linux/vduse.h
> >  F: include/linux/virtio*.h
> >  F: include/linux/vringh.h
> >  F: include/uapi/linux/virtio_*.h
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > index fa62825be378..59ab7eb62e20 100644
> > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -8,6 +8,7 @@
> >   *
> >   */
> >  
> > +#include "linux/security.h"
> >  #include 
> >  #include 
> >  #include 
> > @@ -30,6 +31,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include "iova_domain.h"
> >  
> > @@ -1442,6 +1444,10 @@ static int vduse_dev_open(struct inode *inode, 
> > struct file *file)
> > if (dev->connected)
> > goto unlock;
> >  
> > +   ret = -EPERM;
> > +   if (security_vduse_perm_check(VDUSE_PERM_OPEN, dev->device_id))
> > +   goto unlock;
> > +
> > ret = 0;
> > dev->connected = true;
> > file->private_data = dev;
> > @@ -1664,6 +1670,9 @@ static int vduse_destroy_dev(char *name)
> > if (!dev)
> > return -EINVAL;
> >  
> > +   if (security_vduse_perm_check(VDUSE_PERM_DESTROY, dev->device_id))
> > +   return -EPERM;
> > +
> > mutex_lock(>lock);
> > if (dev->vdev || dev->connected) {
> > mutex_unlock(>lock);
> > @@ -1828,6 +1837,10 @@ static int vduse_create_dev(struct vduse_dev_config 
> > *config,
> > int ret;
> > struct vduse_dev *dev;
> >  
> > +   ret = -EPERM;
> > +   if (security_vduse_perm_check(VDUSE_PERM_CREATE, config->device_id))
> > +   goto err;
> > +
> > ret = -EEXIST;
> > if (vduse_find_dev(config->name))
> > goto err;
> > diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
> > index ff217a5ce552..3930ab2ae974 100644
> > --- a/include/linux/lsm_hook_defs.h
> > +++ b/include/linux/lsm_hook_defs.h
> > @@ -419,3 +419,5 @@ LSM_HOOK(int, 0, uring_override_creds, const struct 
> > cred *new)
> >  LSM_HOOK(int, 0, uring_sqpoll, void)
> >  LSM_HOOK(int, 0, uring_cmd, struct io_uring_cmd *ioucmd)
> >  #endif /* CONFIG_IO_URING */
> > +
> > +LSM_HOOK(int, 0, vduse_perm_check, enum vduse_op_perm op_perm, u32 
> > device_id)
> > diff --git a/include/linux/security.h b/include/linux/security.h
> > index 1d1df326c881..2a2054172394 100644
> > --- a/include/linux/security.h
> > +++ b/include/linux/security.h
> > @@ -32,6 +32,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  struct linux_binprm;
> >  struct cred;
> > @@ -484,6 +485,7 @@ int security_inode_notifysecctx(struct inode *inode, 
> > void *ctx, u32 ctxlen);
> >  int security_inode_setsecctx(struct dentry *dentry, void *ctx, u32 ctxlen);
> >  int security_inode_getsecctx(struct inode *inode, void **ctx, u32 *ctxlen);
> >  int security_locked_down(enum lockdown_reason what);
> > +int security_vduse_perm_check(enum vduse_op_perm op_perm, u32 device_id);
> >  #else /* CONFIG_SECURITY */
> >  
> >  static inline int call_blocking_lsm_notifier(enum lsm_event event, void 
> > *data)
> > @@ -1395,6 +1397,10 @@ static inline int security_locked_down(enum 
> > lockdown_reason what)
> >  {
> > return 0;
> >  }
> > +static inline int security_vduse_perm_check(enum vduse_op_perm op_perm, 
> > u32 device_id)
> > +{
> > +   return 0;
> > +}
> >  #endif /* CONFIG_SECURITY */
> >  
> >  #if

Re: [PATCH net-next v8 0/4] send credit update during setting SO_RCVLOWAT

2023-12-12 Thread Arseniy Krasnov

On 12.12.2023 19:12, Michael S. Tsirkin wrote:
> On Tue, Dec 12, 2023 at 06:59:03PM +0300, Arseniy Krasnov wrote:
>>
>>
>> On 12.12.2023 18:54, Michael S. Tsirkin wrote:
>>> On Tue, Dec 12, 2023 at 12:16:54AM +0300, Arseniy Krasnov wrote:
 Hello,

DESCRIPTION

 This patchset fixes old problem with hungup of both rx/tx sides and adds
 test for it. This happens due to non-default SO_RCVLOWAT value and
 deferred credit update in virtio/vsock. Link to previous old patchset:
 https://lore.kernel.org/netdev/39b2e9fd-601b-189d-39a9-914e55745...@sberdevices.ru/
>>>
>>>
>>> Patchset:
>>>
>>> Acked-by: Michael S. Tsirkin 
>>
>> Thanks!
>>
>>>
>>>
>>> But I worry whether we actually need 3/8 in net not in net-next.
>>
>> Because of "Fixes" tag ? I think this problem is not critical and 
>> reproducible
>> only in special cases, but i'm not familiar with netdev process so good, so 
>> I don't
>> have strong opinion. I guess @Stefano knows better.
>>
>> Thanks, Arseniy
> 
> Fixes means "if you have that other commit then you need this commit
> too". I think as a minimum you need to rearrange patches to make the
> fix go in first. We don't want a regression followed by a fix.

I see, ok, @Stefano WDYT? I think rearrange doesn't break anything, because this
patch fixes problem that is not related with the new patches from this patchset.

Thanks, Arseniy

> 
>>>
>>> Thanks!
>>>
 Here is what happens step by step:

   TEST

 INITIAL CONDITIONS

 1) Vsock buffer size is 128KB.
 2) Maximum packet size is also 64KB as defined in header (yes it is
hardcoded, just to remind about that value).
 3) SO_RCVLOWAT is default, e.g. 1 byte.

  STEPS

 SENDER  RECEIVER
 1) sends 128KB + 1 byte in a
single buffer. 128KB will
be sent, but for 1 byte
sender will wait for free
space at peer. Sender goes
to sleep.

 2) reads 64KB, credit update not sent
 3) sets SO_RCVLOWAT to 64KB + 1
 4) poll() -> wait forever, there is
only 64KB available to read.

 So in step 4) receiver also goes to sleep, waiting for enough data or
 connection shutdown message from the sender. Idea to fix it is that rx
 kicks tx side to continue transmission (and may be close connection)
 when rx changes number of bytes to be woken up (e.g. SO_RCVLOWAT) and
 this value is bigger than number of available bytes to read.

 I've added small test for this, but not sure as it uses hardcoded value
 for maximum packet length, this value is defined in kernel header and
 used to control deferred credit update. And as this is not available to
 userspace, I can't control test parameters correctly (if one day this
 define will be changed - test may become useless). 

 Head for this patchset is:
 https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=021b0c952f226236f2edf89c737efb9a28d1422d

 Link to v1:
 https://lore.kernel.org/netdev/20231108072004.1045669-1-avkras...@salutedevices.com/
 Link to v2:
 https://lore.kernel.org/netdev/20231119204922.2251912-1-avkras...@salutedevices.com/
 Link to v3:
 https://lore.kernel.org/netdev/20231122180510.2297075-1-avkras...@salutedevices.com/
 Link to v4:
 https://lore.kernel.org/netdev/20231129212519.2938875-1-avkras...@salutedevices.com/
 Link to v5:
 https://lore.kernel.org/netdev/20231130130840.253733-1-avkras...@salutedevices.com/
 Link to v6:
 https://lore.kernel.org/netdev/20231205064806.2851305-1-avkras...@salutedevices.com/
 Link to v7:
 https://lore.kernel.org/netdev/20231206211849.2707151-1-avkras...@salutedevices.com/

 Changelog:
 v1 -> v2:
  * Patchset rebased and tested on new HEAD of net-next (see hash above).
  * New patch is added as 0001 - it removes return from SO_RCVLOWAT set
callback in 'af_vsock.c' when transport callback is set - with that
we can set 'sk_rcvlowat' only once in 'af_vsock.c' and in future do
not copy-paste it to every transport. It was discussed in v1.
  * See per-patch changelog after ---.
 v2 -> v3:
  * See changelog after --- in 0003 only (0001 and 0002 still same).
 v3 -> v4:
  * Patchset rebased and tested on new HEAD of net-next (see hash above).
  * See per-patch changelog after ---.
 v4 -> v5:
  * Change patchset tag 'RFC' -> 'net-next'.
  * See per-patch changelog after ---.
 v5 -> v6:
  * New patch 0003 which sends credit update during reading bytes from
socket.
  * See per-patch

Re: [PATCH net-next v8 3/4] virtio/vsock: fix logic which reduces credit update messages

2023-12-12 Thread Arseniy Krasnov




On 12.12.2023 19:11, Michael S. Tsirkin wrote:
> On Tue, Dec 12, 2023 at 06:50:39PM +0300, Arseniy Krasnov wrote:
>>
>>
>> On 12.12.2023 18:54, Michael S. Tsirkin wrote:
>>> On Tue, Dec 12, 2023 at 12:16:57AM +0300, Arseniy Krasnov wrote:
 Add one more condition for sending credit update during dequeue from
 stream socket: when number of bytes in the rx queue is smaller than
 SO_RCVLOWAT value of the socket. This is actual for non-default value
 of SO_RCVLOWAT (e.g. not 1) - idea is to "kick" peer to continue data
 transmission, because we need at least SO_RCVLOWAT bytes in our rx
 queue to wake up user for reading data (in corner case it is also
 possible to stuck both tx and rx sides, this is why 'Fixes' is used).
>>>
>>> I don't get what does "to stuck both tx and rx sides" mean.
>>
>> I meant situation when tx waits for the free space, while rx doesn't send
>> credit update, just waiting for more data. Sorry for my English :)
>>
>>> Besides being agrammatical, is there a way to do this without
>>> playing with SO_RCVLOWAT?
>>
>> No, this may happen only with non-default SO_RCVLOWAT values (e.g. != 1)
>>
>> Thanks, Arseniy 
> 
> I am split on whether we need the Fixes tag. I guess if the other side
> is vhost with SO_RCVLOWAT then it might be stuck and it might apply
> without SO_RCVLOWAT on the local kernel?

IIUC your question, then this problem is actual for any transports: g2h, h2g and
loopback.

Thanks, Arseniy

> 
> 
>>>

 Fixes: b89d882dc9fc ("vsock/virtio: reduce credit update messages")
 Signed-off-by: Arseniy Krasnov 
 ---
  Changelog:
  v6 -> v7:
   * Handle wrap of 'fwd_cnt'.
   * Do to send credit update when 'fwd_cnt' == 'last_fwd_cnt'.
  v7 -> v8:
   * Remove unneeded/wrong handling of wrap for 'fwd_cnt'.

  net/vmw_vsock/virtio_transport_common.c | 13 ++---
  1 file changed, 10 insertions(+), 3 deletions(-)

 diff --git a/net/vmw_vsock/virtio_transport_common.c 
 b/net/vmw_vsock/virtio_transport_common.c
 index e137d740804e..8572f94bba88 100644
 --- a/net/vmw_vsock/virtio_transport_common.c
 +++ b/net/vmw_vsock/virtio_transport_common.c
 @@ -558,6 +558,8 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
 *vsk,
struct virtio_vsock_sock *vvs = vsk->trans;
size_t bytes, total = 0;
struct sk_buff *skb;
 +  u32 fwd_cnt_delta;
 +  bool low_rx_bytes;
int err = -EFAULT;
u32 free_space;
  
 @@ -601,7 +603,10 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
 *vsk,
}
}
  
 -  free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
 +  fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt;
 +  free_space = vvs->buf_alloc - fwd_cnt_delta;
 +  low_rx_bytes = (vvs->rx_bytes <
 +  sock_rcvlowat(sk_vsock(vsk), 0, INT_MAX));
  
spin_unlock_bh(>rx_lock);
  
 @@ -611,9 +616,11 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
 *vsk,
 * too high causes extra messages. Too low causes transmitter
 * stalls. As stalls are in theory more expensive than extra
 * messages, we set the limit to a high value. TODO: experiment
 -   * with different values.
 +   * with different values. Also send credit update message when
 +   * number of bytes in rx queue is not enough to wake up reader.
 */
 -  if (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
 +  if (fwd_cnt_delta &&
 +  (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE || low_rx_bytes))
virtio_transport_send_credit_update(vsk);
  
return total;
 -- 
 2.25.1
>>>
>

Re: [RFC PATCH] ring-buffer: Fix and comment ring buffer rb_time functions on 32-bit

2023-12-12 Thread Steven Rostedt

On Tue, 12 Dec 2023 11:49:20 -0500
Mathieu Desnoyers  wrote:


> >> So the old "bottom" value is returned, which is wrong.  
> > 
> > Ah, OK that makes more sense. Yeah, if I had the three words from the
> > beginning, I would have tested to make sure they all match an not just the
> > two :-p  
> 
> Technically just checking that the very first and last words which are
> updated by set/cmpxchg have the same cnt bits would suffice. Because
> this is just a scenario of __rb_time_read interrupting an update, the
> updates in between are fine if the first/last words to be updated have
> the same cnt.

Correct, but I'm paranoid ;-)


> >> rb_time_cmpxchg()
> >> [...]
> >>   /* The cmpxchg always fails if it interrupted an update */
> >>if (!__rb_time_read(t, , ))
> >>return false;
> >>
> >>if (val != expect)
> >>return false;
> >> 
> >> 
> >>cnt = local_read(>cnt);
> >>if ((cnt & 3) != cnt2)
> >>return false;
> >>^ here (cnt & 3) == cnt2, but @val contains outdated data. This
> >>  means the piecewise rb_time_read_cmpxchg() that follow will
> >>  derive expected values from the outdated @val.  
> > 
> > Ah. Of course this would be fixed if we did the local_read(>cnt)
> > *before* everything else.  
> 
> But then we could be interrupted after that initial read, before reading
> val. I suspect we'd want to propagate the full 32-bit cnt that was
> read by __rb_time_read() to the caller, which is not the case today.
> With that, we would not have to read it again in rb_time_cmpxchg.

Not an issue, because before we do *any* update, we have:

if (!rb_time_read_cmpxchg(>cnt, cnt, cnt2))
return false;

Which does check the full 32 bits of cnt, and would fail if there was an
interruption, and everything before this point would be tossed.

> 
> It does leave the issue of having only 2 bits in the msb, top, bottom
> fields to detect races, which are subject to overflow.

Yeah, but requires all the bits to be exactly the same. Which is extremely
unlikely. But to fix that, I think the context bits and the toggle bit is
sufficient. Something for a KTODO.

> 
> >   
> >>
> >>cnt2 = cnt + 1;
> >>
> >>rb_time_split(val, , , );
> >>top = rb_time_val_cnt(top, cnt);
> >>bottom = rb_time_val_cnt(bottom, cnt);
> >>   ^ top, bottom, and msb contain outdated data, which do not
> >> match cnt due to 2-bit overflow.
> >>
> >>rb_time_split(set, , , );
> >>top2 = rb_time_val_cnt(top2, cnt2);
> >>bottom2 = rb_time_val_cnt(bottom2, cnt2);
> >>
> >>   if (!rb_time_read_cmpxchg(>cnt, cnt, cnt2))
> >>   return false;
> >>   ^ This @cnt cmpxchg succeeds because it uses the re-read cnt
> >> is used as expected value.  

And now it doesn't succeed! And here we exit if there was any interruption
between reading cnt and this cmpxchg.

> > 
> > Sure. And I believe you did find another bug. If we read the cnt first,
> > before reading val, then it would not be outdated.  
> 
> As stated above, I suspect we'd run into other issues if interrupted
> between read of cnt and reading val. Propagating the full 32-bit cnt
> value read from __rb_time_read() to the caller would be better I think.

And as I stated above, we do check the full cnt before we update anything.
So that would not cause any other race issue.

> 
> >   
> >>   if (!rb_time_read_cmpxchg(>msb, msb, msb2))
> >>   return false;
> >>   if (!rb_time_read_cmpxchg(>top, top, top2))
> >>   return false;
> >>   if (!rb_time_read_cmpxchg(>bottom, bottom, bottom2))
> >>   return false;
> >>   ^ these cmpxchg have just used the outdated @val as expected
> >> values, even though the content of the rb_time was modified
> >> by 4 consecutive rb_time_set() or rb_time_cmpxchg(). This
> >> means those cmpxchg can fail not only due to being interrupted
> >> by another write or cmpxchg, but also simply due to expected
> >> value mismatch in any of the fields, which will then cause  
> > 
> > Yes, it is expected that this will fail for being interrupt any time during
> > this operation. So it can only fail for being interrupted. How else would
> > the value be mismatched if this function had not been interrupted?  
> 
> Each of those cmpxchg can fail due to rb_time_cmpxchg() being interrupted
> 4 times by writes/cmpxchg _before the re-load of the cnt field_: this causes
> the 2-bit cnt to match, but each of the sub-fields may not match
> anymore, which can cause a situation where the rb_time_cmpxchg() only
> partially succeeds and leaves the first fields with a different 2-bit
> cnt value than the rest, thus causing following reads to fail.

Note,

Re: [PATCH] iommu/qcom: restore IOMMU state if needed

2023-12-12 Thread Will Deacon

On Wed, 11 Oct 2023 19:57:26 +0200, Luca Weiss wrote:
> From: Vladimir Lypak 
> 
> If the IOMMU has a power domain then some state will be lost in
> qcom_iommu_suspend and TZ will reset device if we don't call
> qcom_scm_restore_sec_cfg before accessing it again.
> 
> 
> [...]

Applied to will (for-joerg/arm-smmu/updates), thanks!

[1/1] iommu/qcom: restore IOMMU state if needed
  https://git.kernel.org/will/c/268dd4edb748

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev

[PATCH] ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs

2023-12-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Mathieu Desnoyers pointed out an issue in the rb_time_cmpxchg() for 32 bit
architectures. That is:

 static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
 {
unsigned long cnt, top, bottom, msb;
unsigned long cnt2, top2, bottom2, msb2;
u64 val;

/* The cmpxchg always fails if it interrupted an update */
 if (!__rb_time_read(t, , ))
 return false;

 if (val != expect)
 return false;

 interrupted here!

 cnt = local_read(>cnt);

The problem is that the synchronization counter in the rb_time_t is read
*after* the value of the timestamp is read. That means if an interrupt
were to come in between the value being read and the counter being read,
it can change the value and the counter and the interrupted process would
be clueless about it!

The counter needs to be read first and then the value. That way it is easy
to tell if the value is stale or not. If the counter hasn't been updated,
then the value is still good.

Link: 
https://lore.kernel.org/linux-trace-kernel/20231211201324.652870-1-mathieu.desnoy...@efficios.com/

Cc: sta...@vger.kernel.org
Fixes: 10464b4aa605e ("ring-buffer: Add rb_time_t 64 bit operations for 
speeding up 32 bit")
Reported-by: Mathieu Desnoyers 
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 1d9caee7f542..e110cde685ea 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -706,6 +706,9 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 
set)
unsigned long cnt2, top2, bottom2, msb2;
u64 val;
 
+   /* Any interruptions in this function should cause a failure */
+   cnt = local_read(>cnt);
+
/* The cmpxchg always fails if it interrupted an update */
 if (!__rb_time_read(t, , ))
 return false;
@@ -713,7 +716,6 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 
set)
 if (val != expect)
 return false;
 
-cnt = local_read(>cnt);
 if ((cnt & 3) != cnt2)
 return false;
 
-- 
2.42.0

Re: [RFC PATCH] ring-buffer: Fix and comment ring buffer rb_time functions on 32-bit

2023-12-12 Thread Mathieu Desnoyers


On 2023-12-11 23:38, Steven Rostedt wrote:

On Mon, 11 Dec 2023 22:51:04 -0500
Mathieu Desnoyers  wrote:

[...]


For this first issue, here is the race:

rb_time_cmpxchg()
[...]
  if (!rb_time_read_cmpxchg(>msb, msb, msb2))
  return false;
  if (!rb_time_read_cmpxchg(>top, top, top2))
  return false;

__rb_time_read()
[...]
  do {
  c = local_read(>cnt);
  top = local_read(>top);
  bottom = local_read(>bottom);
  msb = local_read(>msb);
  } while (c != local_read(>cnt));

  *cnt = rb_time_cnt(top);

  /* If top and msb counts don't match, this interrupted a write */
  if (*cnt != rb_time_cnt(msb))
  return false;
^ this check fails to catch that "bottom" is still not updated.

So the old "bottom" value is returned, which is wrong.


Ah, OK that makes more sense. Yeah, if I had the three words from the
beginning, I would have tested to make sure they all match an not just the
two :-p


Technically just checking that the very first and last words which are
updated by set/cmpxchg have the same cnt bits would suffice. Because
this is just a scenario of __rb_time_read interrupting an update, the
updates in between are fine if the first/last words to be updated have
the same cnt.



As this would fix a commit that tried to fix this before!

   f458a1453424e ("ring-buffer: Test last update in 32bit version of 
__rb_time_read()")

FYI, that would be the "Fixes" for this patch.


OK

[...]



   


- A cmpxchg interrupted by 4 writes or cmpxchg overflows the counter
and produces corrupted time stamps. This is _not_ fixed by this patch.


Except that it's not 4 bits that is compared, but 32 bits.

struct rb_time_struct {
local_t cnt;
local_t top;
local_t bottom;
local_t msb;
};

The full local_t (32 bits) is used for synchronization. But the other
elements do get extra bits and there still might be some issues, but not as
severe as you stated here.


Let's bring up the race scenario I spotted:

rb_time_cmpxchg()
[...]
  /* The cmpxchg always fails if it interrupted an update */
   if (!__rb_time_read(t, , ))
   return false;

   if (val != expect)
   return false;


   cnt = local_read(>cnt);
   if ((cnt & 3) != cnt2)
   return false;
   ^ here (cnt & 3) == cnt2, but @val contains outdated data. This
 means the piecewise rb_time_read_cmpxchg() that follow will
 derive expected values from the outdated @val.


Ah. Of course this would be fixed if we did the local_read(>cnt)
*before* everything else.


But then we could be interrupted after that initial read, before reading
val. I suspect we'd want to propagate the full 32-bit cnt that was
read by __rb_time_read() to the caller, which is not the case today.
With that, we would not have to read it again in rb_time_cmpxchg.

It does leave the issue of having only 2 bits in the msb, top, bottom
fields to detect races, which are subject to overflow.





   cnt2 = cnt + 1;

   rb_time_split(val, , , );
   top = rb_time_val_cnt(top, cnt);
   bottom = rb_time_val_cnt(bottom, cnt);
  ^ top, bottom, and msb contain outdated data, which do not
match cnt due to 2-bit overflow.

   rb_time_split(set, , , );
   top2 = rb_time_val_cnt(top2, cnt2);
   bottom2 = rb_time_val_cnt(bottom2, cnt2);

  if (!rb_time_read_cmpxchg(>cnt, cnt, cnt2))
  return false;
  ^ This @cnt cmpxchg succeeds because it uses the re-read cnt
is used as expected value.


Sure. And I believe you did find another bug. If we read the cnt first,
before reading val, then it would not be outdated.


As stated above, I suspect we'd run into other issues if interrupted
between read of cnt and reading val. Propagating the full 32-bit cnt
value read from __rb_time_read() to the caller would be better I think.




  if (!rb_time_read_cmpxchg(>msb, msb, msb2))
  return false;
  if (!rb_time_read_cmpxchg(>top, top, top2))
  return false;
  if (!rb_time_read_cmpxchg(>bottom, bottom, bottom2))
  return false;
  ^ these cmpxchg have just used the outdated @val as expected
values, even though the content of the rb_time was modified
by 4 consecutive rb_time_set() or rb_time_cmpxchg(). This
means those cmpxchg can fail not only due to being interrupted
by another write or cmpxchg, but also simply due to expected
value mismatch in any of the fields, which will then cause


Yes, it is expected that this will fail for being interrupt any time during
this operation. So it can only fail for being interrupted. How

Re: [PATCH v3] tracing: Allow for max buffer data size trace_marker writes

2023-12-12 Thread Steven Rostedt

On Tue, 12 Dec 2023 11:03:32 -0500
Steven Rostedt  wrote:

> @@ -7300,9 +7301,25 @@ tracing_mark_write(struct file *filp, const char 
> __user *ubuf,
>   buffer = tr->array_buffer.buffer;
>   event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
>   tracing_gen_ctx());
> - if (unlikely(!event))
> + if (unlikely(!event)) {
> + /*
> +  * If the size was greated than what was allowed, then
> +  * make it smaller and try again.
> +  */

And I forgot to fix the comment. :-p

-- Steve

Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory

2023-12-12 Thread Alexandru Elisei

Hi Rob,

Thank you so much for the feedback, I'm not very familiar with device tree,
and any comments are very useful.

On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
>  wrote:
> >
> > Allow the kernel to get the size and location of the MTE tag storage
> > regions from the DTB. This memory is marked as reserved for now.
> >
> > The DTB node for the tag storage region is defined as:
> >
> > tags0: tag-storage@8f800 {
> > compatible = "arm,mte-tag-storage";
> > reg = <0x08 0xf800 0x00 0x400>;
> > block-size = <0x1000>;
> > memory = <>;// Associated tagged memory node
> > };
> 
> I skimmed thru the discussion some. If this memory range is within
> main RAM, then it definitely belongs in /reserved-memory.

Ok, will do that.

If you don't mind, why do you say that it definitely belongs in
reserved-memory? I'm not trying to argue otherwise, I'm curious about the
motivation.

Tag storage is not DMA and can live anywhere in memory. In
arm64_memblock_init(), the kernel first removes the memory that it cannot
address from memblock. For example, because it has been compiled with
CONFIG_ARM64_VA_BITS_39=y. And then calls
early_init_fdt_scan_reserved_mem().

What happens if reserved memory is above what the kernel can address?

>From my testing, when the kernel is compiled with 39 bit VA, if I use
reserved memory to discover tag storage the lives above the virtua address
limit and then I try to use CMA to manage the tag storage memory, I get a
kernel panic:

[0.00] Reserved memory: created CMA memory pool at 0x0100, 
size 64 MiB
[0.00] OF: reserved mem: initialized node linux,cma, compatible id 
shared-dma-pool
[0.00] OF: reserved mem: 0x0100..0x010003ff (65536 
KiB) map reusable linux,cma
[..]
[0.806945] Unable to handle kernel paging request at virtual address 
0001fe00
[0.807277] Mem abort info:
[0.807277]   ESR = 0x9605
[0.807693]   EC = 0x25: DABT (current EL), IL = 32 bits
[0.808110]   SET = 0, FnV = 0
[0.808443]   EA = 0, S1PTW = 0
[0.808526]   FSC = 0x05: level 1 translation fault
[0.808943] Data abort info:
[0.808943]   ISV = 0, ISS = 0x0005, ISS2 = 0x
[0.809360]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[0.809776]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[0.810221] [0001fe00] user address but active_mm is swapper
[..]
[0.820887] Call trace:
[0.821027]  cma_init_reserved_areas+0xc4/0x378

> 
> You need a binding for this too.

By binding you mean having an yaml file in dt-schem [1] describing the tag
storage node, right?

[1] https://github.com/devicetree-org/dt-schema

> 
> > The tag storage region represents the largest contiguous memory region that
> > holds all the tags for the associated contiguous memory region which can be
> > tagged. For example, for a 32GB contiguous tagged memory the corresponding
> > tag storage region is 1GB of contiguous memory, not two adjacent 512M of
> > tag storage memory.
> >
> > "block-size" represents the minimum multiple of 4K of tag storage where all
> > the tags stored in the block correspond to a contiguous memory region. This
> > is needed for platforms where the memory controller interleaves tag writes
> > to memory. For example, if the memory controller interleaves tag writes for
> > 256KB of contiguous memory across 8K of tag storage (2-way interleave),
> > then the correct value for "block-size" is 0x2000. This value is a hardware
> > property, independent of the selected kernel page size.
> >
> > Signed-off-by: Alexandru Elisei 
> > ---
> >  arch/arm64/Kconfig   |  12 ++
> >  arch/arm64/include/asm/mte_tag_storage.h |  15 ++
> >  arch/arm64/kernel/Makefile   |   1 +
> >  arch/arm64/kernel/mte_tag_storage.c  | 256 +++
> >  arch/arm64/kernel/setup.c|   7 +
> >  5 files changed, 291 insertions(+)
> >  create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
> >  create mode 100644 arch/arm64/kernel/mte_tag_storage.c
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 7b071a00425d..fe8276fdc7a8 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -2062,6 +2062,18 @@ config ARM64_MTE
> >
> >   Documentation/arch/arm64/memory-tagging-extension.rst.
> >
> > +if ARM64_MTE
> > +config ARM64_MTE_TAG_STORAGE
> > +   bool "Dynamic MTE tag storage management"
> > +   help
> > + Adds support for dynamic management of the memory used by the 
> > hardware
> > + for storing MTE tags. This memory, unlike normal memory, cannot be
> > + tagged. When it is used to store tags for another memory location 
> > it
> > + cannot be used for any type of allocation.
> > +
> > + If unsure, say N
> > +endif #

Re: [PATCH] tracing: Fix uaf issue when open the hist or hist_debug file

2023-12-12 Thread Steven Rostedt

On Tue, 12 Dec 2023 19:33:17 +0800
Zheng Yejian  wrote:

> diff --git a/kernel/trace/trace_events_hist.c 
> b/kernel/trace/trace_events_hist.c
> index 1abc07fba1b9..00447ea7dabd 100644
> --- a/kernel/trace/trace_events_hist.c
> +++ b/kernel/trace/trace_events_hist.c
> @@ -5623,10 +5623,12 @@ static int event_hist_open(struct inode *inode, 
> struct file *file)
>  {
>   int ret;
>  
> - ret = security_locked_down(LOCKDOWN_TRACEFS);
> + ret = tracing_open_file_tr(inode, file);
>   if (ret)
>   return ret;
>  
> + /* Clear private_data to avoid warning in single_open */
> + file->private_data = NULL;
>   return single_open(file, hist_show, file);
>  }
>  
> @@ -5634,7 +5636,7 @@ const struct file_operations event_hist_fops = {
>   .open = event_hist_open,
>   .read = seq_read,
>   .llseek = seq_lseek,
> - .release = single_release,
> + .release = tracing_release_file_tr,

single_release() still needs to be called. This can't simply be replaced
with tracing_release_file_tr().

>  };
>  
>  #ifdef CONFIG_HIST_TRIGGERS_DEBUG
> @@ -5900,10 +5902,12 @@ static int event_hist_debug_open(struct inode *inode, 
> struct file *file)
>  {
>   int ret;
>  
> - ret = security_locked_down(LOCKDOWN_TRACEFS);
> + ret = tracing_open_file_tr(inode, file);
>   if (ret)
>   return ret;
>  
> + /* Clear private_data to avoid warning in single_open */
> + file->private_data = NULL;
>   return single_open(file, hist_debug_show, file);
>  }
>  
> @@ -5911,7 +5915,7 @@ const struct file_operations event_hist_debug_fops = {
>   .open = event_hist_debug_open,
>   .read = seq_read,
>   .llseek = seq_lseek,
> - .release = single_release,
> + .release = tracing_release_file_tr,

Same here. This just causes a leak of the single resources.

What needs to be done is to add a:

tracing_single_release_file_tr()

That does both:

int tracing_single_release_file_tr(struct inode *inode, struct file *filp)
{
struct trace_event_file *file = inode->i_private;

trace_array_put(file->tr);
event_file_put(file);

return single_release(inode, filp);
}

-- Steve

>  };
>  #endif
>

Re: [PATCH v5 4/4] vduse: Add LSM hook to check Virtio device type

2023-12-12 Thread Casey Schaufler

On 12/12/2023 5:17 AM, Maxime Coquelin wrote:
> This patch introduces a LSM hook for devices creation,
> destruction (ioctl()) and opening (open()) operations,
> checking the application is allowed to perform these
> operations for the Virtio device type.

My earlier comments on a vduse specific LSM hook still hold.
I would much prefer to see a device permissions hook(s) that
are useful for devices in general. Not just vduse devices.
I know that there are already some very special purpose LSM
hooks, but the experience with maintaining them is why I don't
want more of them. 

>
> Signed-off-by: Maxime Coquelin 
> ---
>  MAINTAINERS |  1 +
>  drivers/vdpa/vdpa_user/vduse_dev.c  | 13 
>  include/linux/lsm_hook_defs.h   |  2 ++
>  include/linux/security.h|  6 ++
>  include/linux/vduse.h   | 14 +
>  security/security.c | 15 ++
>  security/selinux/hooks.c| 32 +
>  security/selinux/include/classmap.h |  2 ++
>  8 files changed, 85 insertions(+)
>  create mode 100644 include/linux/vduse.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index a0fb0df07b43..4e83b14358d2 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -23040,6 +23040,7 @@ F:drivers/net/virtio_net.c
>  F:   drivers/vdpa/
>  F:   drivers/virtio/
>  F:   include/linux/vdpa.h
> +F:   include/linux/vduse.h
>  F:   include/linux/virtio*.h
>  F:   include/linux/vringh.h
>  F:   include/uapi/linux/virtio_*.h
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index fa62825be378..59ab7eb62e20 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -8,6 +8,7 @@
>   *
>   */
>  
> +#include "linux/security.h"
>  #include 
>  #include 
>  #include 
> @@ -30,6 +31,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "iova_domain.h"
>  
> @@ -1442,6 +1444,10 @@ static int vduse_dev_open(struct inode *inode, struct 
> file *file)
>   if (dev->connected)
>   goto unlock;
>  
> + ret = -EPERM;
> + if (security_vduse_perm_check(VDUSE_PERM_OPEN, dev->device_id))
> + goto unlock;
> +
>   ret = 0;
>   dev->connected = true;
>   file->private_data = dev;
> @@ -1664,6 +1670,9 @@ static int vduse_destroy_dev(char *name)
>   if (!dev)
>   return -EINVAL;
>  
> + if (security_vduse_perm_check(VDUSE_PERM_DESTROY, dev->device_id))
> + return -EPERM;
> +
>   mutex_lock(>lock);
>   if (dev->vdev || dev->connected) {
>   mutex_unlock(>lock);
> @@ -1828,6 +1837,10 @@ static int vduse_create_dev(struct vduse_dev_config 
> *config,
>   int ret;
>   struct vduse_dev *dev;
>  
> + ret = -EPERM;
> + if (security_vduse_perm_check(VDUSE_PERM_CREATE, config->device_id))
> + goto err;
> +
>   ret = -EEXIST;
>   if (vduse_find_dev(config->name))
>   goto err;
> diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
> index ff217a5ce552..3930ab2ae974 100644
> --- a/include/linux/lsm_hook_defs.h
> +++ b/include/linux/lsm_hook_defs.h
> @@ -419,3 +419,5 @@ LSM_HOOK(int, 0, uring_override_creds, const struct cred 
> *new)
>  LSM_HOOK(int, 0, uring_sqpoll, void)
>  LSM_HOOK(int, 0, uring_cmd, struct io_uring_cmd *ioucmd)
>  #endif /* CONFIG_IO_URING */
> +
> +LSM_HOOK(int, 0, vduse_perm_check, enum vduse_op_perm op_perm, u32 device_id)
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 1d1df326c881..2a2054172394 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -32,6 +32,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  struct linux_binprm;
>  struct cred;
> @@ -484,6 +485,7 @@ int security_inode_notifysecctx(struct inode *inode, void 
> *ctx, u32 ctxlen);
>  int security_inode_setsecctx(struct dentry *dentry, void *ctx, u32 ctxlen);
>  int security_inode_getsecctx(struct inode *inode, void **ctx, u32 *ctxlen);
>  int security_locked_down(enum lockdown_reason what);
> +int security_vduse_perm_check(enum vduse_op_perm op_perm, u32 device_id);
>  #else /* CONFIG_SECURITY */
>  
>  static inline int call_blocking_lsm_notifier(enum lsm_event event, void 
> *data)
> @@ -1395,6 +1397,10 @@ static inline int security_locked_down(enum 
> lockdown_reason what)
>  {
>   return 0;
>  }
> +static inline int security_vduse_perm_check(enum vduse_op_perm op_perm, u32 
> device_id)
> +{
> + return 0;
> +}
>  #endif   /* CONFIG_SECURITY */
>  
>  #if defined(CONFIG_SECURITY) && defined(CONFIG_WATCH_QUEUE)
> diff --git a/include/linux/vduse.h b/include/linux/vduse.h
> new file mode 100644
> index ..7a20dcc43997
> --- /dev/null
> +++ b/include/linux/vduse.h
> @@ -0,0 +1,14 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_VDUSE_H
> +#define _LINUX_VDUSE_H
> +
> +/*
> + * The permission required for

[PATCH v3] ring-buffer: Fix writing to the buffer with max_data_size

2023-12-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The maximum ring buffer data size is the maximum size of data that can be
recorded on the ring buffer. Events must be smaller than the sub buffer
data size minus any meta data. This size is checked before trying to
allocate from the ring buffer because the allocation assumes that the size
will fit on the sub buffer.

The maximum size was calculated as the size of a sub buffer page (which is
currently PAGE_SIZE minus the sub buffer header) minus the size of the
meta data of an individual event. But it missed the possible adding of a
time stamp for events that are added long enough apart that the event meta
data can't hold the time delta.

When an event is added that is greater than the current BUF_MAX_DATA_SIZE
minus the size of a time stamp, but still less than or equal to
BUF_MAX_DATA_SIZE, the ring buffer would go into an infinite loop, looking
for a page that can hold the event. Luckily, there's a check for this loop
and after 1000 iterations and a warning is emitted and the ring buffer is
disabled. But this should never happen.

This can happen when a large event is added first, or after a long period
where an absolute timestamp is prefixed to the event, increasing its size
by 8 bytes. This passes the check and then goes into the algorithm that
causes the infinite loop.

For events that are the first event on the sub-buffer, it does not need to
add a timestamp, because the sub-buffer itself contains an absolute
timestamp, and adding one is redundant.

The fix is to check if the event is to be the first event on the
sub-buffer, and if it is, then do not add a timestamp.

This also fixes 32 bit adding a timestamp when a read of before_stamp or
write_stamp is interrupted. There's still no need to add that timestamp if
the event is going to be the first event on the sub buffer.

Also, if the buffer has "time_stamp_abs" set, then also check if the
length plus the timestamp is greater than the BUF_MAX_DATA_SIZE.

Link: https://lore.kernel.org/all/20231212104549.58863...@gandalf.local.home/
Link: 
https://lore.kernel.org/linux-trace-kernel/20231212071837.5fdd6...@gandalf.local.home

Cc: sta...@vger.kernel.org
Fixes: a4543a2fa9ef3 ("ring-buffer: Get timestamp after event is allocated")
Fixes: 58fbc3c63275c ("ring-buffer: Consolidate add_timestamp to remove some 
branches")
Reported-by: Kent Overstreet  # (on IRC)
Signed-off-by: Steven Rostedt (Google) 
---
Changes since v2: 
https://lore.kernel.org/linux-trace-kernel/20231212065922.05f28...@gandalf.local.home

- Just test 'w' first, and then do the rest of the checks.

 kernel/trace/ring_buffer.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8d2a4f00eca9..b8986f82eccf 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3579,7 +3579,10 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 * absolute timestamp.
 * Don't bother if this is the start of a new page (w == 0).
 */
-   if (unlikely(!a_ok || !b_ok || (info->before != info->after && 
w))) {
+   if (!w) {
+   /* Use the sub-buffer timestamp */
+   info->delta = 0;
+   } else if (unlikely(!a_ok || !b_ok || info->before != 
info->after)) {
info->add_timestamp |= RB_ADD_STAMP_FORCE | 
RB_ADD_STAMP_EXTEND;
info->length += RB_LEN_TIME_EXTEND;
} else {
@@ -3737,6 +3740,8 @@ rb_reserve_next_event(struct trace_buffer *buffer,
if (ring_buffer_time_stamp_abs(cpu_buffer->buffer)) {
add_ts_default = RB_ADD_STAMP_ABSOLUTE;
info.length += RB_LEN_TIME_EXTEND;
+   if (info.length > BUF_MAX_DATA_SIZE)
+   goto out_fail;
} else {
add_ts_default = RB_ADD_STAMP_NONE;
}
-- 
2.42.0

Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-12 Thread Michael S. Tsirkin

On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin  wrote:
> >
> > On Mon, Dec 11, 2023 at 03:26:46PM +0800, Jason Wang wrote:
> > > > Try reducing the VHOST_NET_WEIGHT limit and see if that improves things 
> > > > any?
> > >
> > > Or a dirty hack like cond_resched() in translate_desc().
> >
> > what do you mean, exactly?
> 
> Ideally it should not matter, but Tobias said there's an unexpectedly
> long time spent on translate_desc() which may indicate that the
> might_sleep() or other doesn't work for some reason.
> 
> Thanks

You mean for debugging, add it with a patch to see what this does?

Sure - can you post the debugging patch pls?

> >
> > --
> > MST
> >

Re: [PATCH v2] virtio: Add support for no-reset virtio PCI PM

2023-12-12 Thread Michael S. Tsirkin

On Tue, Dec 12, 2023 at 11:01:53AM +0800, Jason Wang wrote:
> On Tue, Dec 12, 2023 at 12:37 AM Michael S. Tsirkin  wrote:
> >
> > On Fri, Dec 08, 2023 at 04:07:54PM +0900, David Stevens wrote:
> > > If a virtio_pci_device supports native PCI power management and has the
> > > No_Soft_Reset bit set, then skip resetting and reinitializing the device
> > > when suspending and restoring the device. This allows system-wide low
> > > power states like s2idle to be used in systems with stateful virtio
> > > devices that can't simply be re-initialized (e.g. virtio-fs).
> > >
> > > Signed-off-by: David Stevens 
> >
> > tagged, thanks!
> > I'm still debating with myself whether we can classify this
> > as a bugfix ... better not risk it I guess.
> 
> It might be suitable if there's a hypervisor that advertises
> no_soft_reset (but it seems Qemu doesn't).
> 
> Thanks

Yea... so a bugfix but no rush to merge it I think.

> >
> > > ---
> > > v1 -> v2:
> > >  - Check the No_Soft_Reset bit
> > >
> > >  drivers/virtio/virtio_pci_common.c | 34 +-
> > >  1 file changed, 33 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/virtio/virtio_pci_common.c 
> > > b/drivers/virtio/virtio_pci_common.c
> > > index c2524a7207cf..3a95ecaf12dc 100644
> > > --- a/drivers/virtio/virtio_pci_common.c
> > > +++ b/drivers/virtio/virtio_pci_common.c
> > > @@ -492,8 +492,40 @@ static int virtio_pci_restore(struct device *dev)
> > >   return virtio_device_restore(_dev->vdev);
> > >  }
> > >
> > > +static bool vp_supports_pm_no_reset(struct device *dev)
> > > +{
> > > + struct pci_dev *pci_dev = to_pci_dev(dev);
> > > + u16 pmcsr;
> > > +
> > > + if (!pci_dev->pm_cap)
> > > + return false;
> > > +
> > > + pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, 
> > > );
> > > + if (PCI_POSSIBLE_ERROR(pmcsr)) {
> > > + dev_err(dev, "Unable to query pmcsr");
> > > + return false;
> > > + }
> > > +
> > > + return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET;
> > > +}
> > > +
> > > +static int virtio_pci_suspend(struct device *dev)
> > > +{
> > > + return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_freeze(dev);
> > > +}
> > > +
> > > +static int virtio_pci_resume(struct device *dev)
> > > +{
> > > + return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_restore(dev);
> > > +}
> > > +
> > >  static const struct dev_pm_ops virtio_pci_pm_ops = {
> > > - SET_SYSTEM_SLEEP_PM_OPS(virtio_pci_freeze, virtio_pci_restore)
> > > + .suspend = virtio_pci_suspend,
> > > + .resume = virtio_pci_resume,
> > > + .freeze = virtio_pci_freeze,
> > > + .thaw = virtio_pci_restore,
> > > + .poweroff = virtio_pci_freeze,
> > > + .restore = virtio_pci_restore,
> > >  };
> > >  #endif
> > >
> > > --
> > > 2.43.0.472.g3155946c3a-goog
> >

Re: [PATCH net-next v8 0/4] send credit update during setting SO_RCVLOWAT

2023-12-12 Thread Michael S. Tsirkin

On Tue, Dec 12, 2023 at 06:59:03PM +0300, Arseniy Krasnov wrote:
> 
> 
> On 12.12.2023 18:54, Michael S. Tsirkin wrote:
> > On Tue, Dec 12, 2023 at 12:16:54AM +0300, Arseniy Krasnov wrote:
> >> Hello,
> >>
> >>DESCRIPTION
> >>
> >> This patchset fixes old problem with hungup of both rx/tx sides and adds
> >> test for it. This happens due to non-default SO_RCVLOWAT value and
> >> deferred credit update in virtio/vsock. Link to previous old patchset:
> >> https://lore.kernel.org/netdev/39b2e9fd-601b-189d-39a9-914e55745...@sberdevices.ru/
> > 
> > 
> > Patchset:
> > 
> > Acked-by: Michael S. Tsirkin 
> 
> Thanks!
> 
> > 
> > 
> > But I worry whether we actually need 3/8 in net not in net-next.
> 
> Because of "Fixes" tag ? I think this problem is not critical and reproducible
> only in special cases, but i'm not familiar with netdev process so good, so I 
> don't
> have strong opinion. I guess @Stefano knows better.
> 
> Thanks, Arseniy

Fixes means "if you have that other commit then you need this commit
too". I think as a minimum you need to rearrange patches to make the
fix go in first. We don't want a regression followed by a fix.

> > 
> > Thanks!
> > 
> >> Here is what happens step by step:
> >>
> >>   TEST
> >>
> >> INITIAL CONDITIONS
> >>
> >> 1) Vsock buffer size is 128KB.
> >> 2) Maximum packet size is also 64KB as defined in header (yes it is
> >>hardcoded, just to remind about that value).
> >> 3) SO_RCVLOWAT is default, e.g. 1 byte.
> >>
> >>
> >>  STEPS
> >>
> >> SENDER  RECEIVER
> >> 1) sends 128KB + 1 byte in a
> >>single buffer. 128KB will
> >>be sent, but for 1 byte
> >>sender will wait for free
> >>space at peer. Sender goes
> >>to sleep.
> >>
> >>
> >> 2) reads 64KB, credit update not sent
> >> 3) sets SO_RCVLOWAT to 64KB + 1
> >> 4) poll() -> wait forever, there is
> >>only 64KB available to read.
> >>
> >> So in step 4) receiver also goes to sleep, waiting for enough data or
> >> connection shutdown message from the sender. Idea to fix it is that rx
> >> kicks tx side to continue transmission (and may be close connection)
> >> when rx changes number of bytes to be woken up (e.g. SO_RCVLOWAT) and
> >> this value is bigger than number of available bytes to read.
> >>
> >> I've added small test for this, but not sure as it uses hardcoded value
> >> for maximum packet length, this value is defined in kernel header and
> >> used to control deferred credit update. And as this is not available to
> >> userspace, I can't control test parameters correctly (if one day this
> >> define will be changed - test may become useless). 
> >>
> >> Head for this patchset is:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=021b0c952f226236f2edf89c737efb9a28d1422d
> >>
> >> Link to v1:
> >> https://lore.kernel.org/netdev/20231108072004.1045669-1-avkras...@salutedevices.com/
> >> Link to v2:
> >> https://lore.kernel.org/netdev/20231119204922.2251912-1-avkras...@salutedevices.com/
> >> Link to v3:
> >> https://lore.kernel.org/netdev/20231122180510.2297075-1-avkras...@salutedevices.com/
> >> Link to v4:
> >> https://lore.kernel.org/netdev/20231129212519.2938875-1-avkras...@salutedevices.com/
> >> Link to v5:
> >> https://lore.kernel.org/netdev/20231130130840.253733-1-avkras...@salutedevices.com/
> >> Link to v6:
> >> https://lore.kernel.org/netdev/20231205064806.2851305-1-avkras...@salutedevices.com/
> >> Link to v7:
> >> https://lore.kernel.org/netdev/20231206211849.2707151-1-avkras...@salutedevices.com/
> >>
> >> Changelog:
> >> v1 -> v2:
> >>  * Patchset rebased and tested on new HEAD of net-next (see hash above).
> >>  * New patch is added as 0001 - it removes return from SO_RCVLOWAT set
> >>callback in 'af_vsock.c' when transport callback is set - with that
> >>we can set 'sk_rcvlowat' only once in 'af_vsock.c' and in future do
> >>not copy-paste it to every transport. It was discussed in v1.
> >>  * See per-patch changelog after ---.
> >> v2 -> v3:
> >>  * See changelog after --- in 0003 only (0001 and 0002 still same).
> >> v3 -> v4:
> >>  * Patchset rebased and tested on new HEAD of net-next (see hash above).
> >>  * See per-patch changelog after ---.
> >> v4 -> v5:
> >>  * Change patchset tag 'RFC' -> 'net-next'.
> >>  * See per-patch changelog after ---.
> >> v5 -> v6:
> >>  * New patch 0003 which sends credit update during reading bytes from
> >>socket.
> >>  * See per-patch changelog after ---.
> >> v6 -> v7:
> >>  * Patchset rebased and tested on new HEAD of net-next (see hash above).
> >>  * See per-patch changelog after ---.
> >> v7 -> v8:
> >>  * See per-patch changelog after ---.
> >>
> >> Arseniy Krasnov (4):
>

Re: [PATCH net-next v8 3/4] virtio/vsock: fix logic which reduces credit update messages

2023-12-12 Thread Michael S. Tsirkin

On Tue, Dec 12, 2023 at 06:50:39PM +0300, Arseniy Krasnov wrote:
> 
> 
> On 12.12.2023 18:54, Michael S. Tsirkin wrote:
> > On Tue, Dec 12, 2023 at 12:16:57AM +0300, Arseniy Krasnov wrote:
> >> Add one more condition for sending credit update during dequeue from
> >> stream socket: when number of bytes in the rx queue is smaller than
> >> SO_RCVLOWAT value of the socket. This is actual for non-default value
> >> of SO_RCVLOWAT (e.g. not 1) - idea is to "kick" peer to continue data
> >> transmission, because we need at least SO_RCVLOWAT bytes in our rx
> >> queue to wake up user for reading data (in corner case it is also
> >> possible to stuck both tx and rx sides, this is why 'Fixes' is used).
> > 
> > I don't get what does "to stuck both tx and rx sides" mean.
> 
> I meant situation when tx waits for the free space, while rx doesn't send
> credit update, just waiting for more data. Sorry for my English :)
> 
> > Besides being agrammatical, is there a way to do this without
> > playing with SO_RCVLOWAT?
> 
> No, this may happen only with non-default SO_RCVLOWAT values (e.g. != 1)
> 
> Thanks, Arseniy 

I am split on whether we need the Fixes tag. I guess if the other side
is vhost with SO_RCVLOWAT then it might be stuck and it might apply
without SO_RCVLOWAT on the local kernel?


> > 
> >>
> >> Fixes: b89d882dc9fc ("vsock/virtio: reduce credit update messages")
> >> Signed-off-by: Arseniy Krasnov 
> >> ---
> >>  Changelog:
> >>  v6 -> v7:
> >>   * Handle wrap of 'fwd_cnt'.
> >>   * Do to send credit update when 'fwd_cnt' == 'last_fwd_cnt'.
> >>  v7 -> v8:
> >>   * Remove unneeded/wrong handling of wrap for 'fwd_cnt'.
> >>
> >>  net/vmw_vsock/virtio_transport_common.c | 13 ++---
> >>  1 file changed, 10 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/net/vmw_vsock/virtio_transport_common.c 
> >> b/net/vmw_vsock/virtio_transport_common.c
> >> index e137d740804e..8572f94bba88 100644
> >> --- a/net/vmw_vsock/virtio_transport_common.c
> >> +++ b/net/vmw_vsock/virtio_transport_common.c
> >> @@ -558,6 +558,8 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
> >> *vsk,
> >>struct virtio_vsock_sock *vvs = vsk->trans;
> >>size_t bytes, total = 0;
> >>struct sk_buff *skb;
> >> +  u32 fwd_cnt_delta;
> >> +  bool low_rx_bytes;
> >>int err = -EFAULT;
> >>u32 free_space;
> >>  
> >> @@ -601,7 +603,10 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
> >> *vsk,
> >>}
> >>}
> >>  
> >> -  free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
> >> +  fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt;
> >> +  free_space = vvs->buf_alloc - fwd_cnt_delta;
> >> +  low_rx_bytes = (vvs->rx_bytes <
> >> +  sock_rcvlowat(sk_vsock(vsk), 0, INT_MAX));
> >>  
> >>spin_unlock_bh(>rx_lock);
> >>  
> >> @@ -611,9 +616,11 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
> >> *vsk,
> >> * too high causes extra messages. Too low causes transmitter
> >> * stalls. As stalls are in theory more expensive than extra
> >> * messages, we set the limit to a high value. TODO: experiment
> >> -   * with different values.
> >> +   * with different values. Also send credit update message when
> >> +   * number of bytes in rx queue is not enough to wake up reader.
> >> */
> >> -  if (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> >> +  if (fwd_cnt_delta &&
> >> +  (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE || low_rx_bytes))
> >>virtio_transport_send_credit_update(vsk);
> >>  
> >>return total;
> >> -- 
> >> 2.25.1
> >

Re: [PATCH net-next v8 0/4] send credit update during setting SO_RCVLOWAT

2023-12-12 Thread Arseniy Krasnov




On 12.12.2023 18:54, Michael S. Tsirkin wrote:
> On Tue, Dec 12, 2023 at 12:16:54AM +0300, Arseniy Krasnov wrote:
>> Hello,
>>
>>DESCRIPTION
>>
>> This patchset fixes old problem with hungup of both rx/tx sides and adds
>> test for it. This happens due to non-default SO_RCVLOWAT value and
>> deferred credit update in virtio/vsock. Link to previous old patchset:
>> https://lore.kernel.org/netdev/39b2e9fd-601b-189d-39a9-914e55745...@sberdevices.ru/
> 
> 
> Patchset:
> 
> Acked-by: Michael S. Tsirkin 

Thanks!

> 
> 
> But I worry whether we actually need 3/8 in net not in net-next.

Because of "Fixes" tag ? I think this problem is not critical and reproducible
only in special cases, but i'm not familiar with netdev process so good, so I 
don't
have strong opinion. I guess @Stefano knows better.

Thanks, Arseniy

> 
> Thanks!
> 
>> Here is what happens step by step:
>>
>>   TEST
>>
>> INITIAL CONDITIONS
>>
>> 1) Vsock buffer size is 128KB.
>> 2) Maximum packet size is also 64KB as defined in header (yes it is
>>hardcoded, just to remind about that value).
>> 3) SO_RCVLOWAT is default, e.g. 1 byte.
>>
>>
>>  STEPS
>>
>> SENDER  RECEIVER
>> 1) sends 128KB + 1 byte in a
>>single buffer. 128KB will
>>be sent, but for 1 byte
>>sender will wait for free
>>space at peer. Sender goes
>>to sleep.
>>
>>
>> 2) reads 64KB, credit update not sent
>> 3) sets SO_RCVLOWAT to 64KB + 1
>> 4) poll() -> wait forever, there is
>>only 64KB available to read.
>>
>> So in step 4) receiver also goes to sleep, waiting for enough data or
>> connection shutdown message from the sender. Idea to fix it is that rx
>> kicks tx side to continue transmission (and may be close connection)
>> when rx changes number of bytes to be woken up (e.g. SO_RCVLOWAT) and
>> this value is bigger than number of available bytes to read.
>>
>> I've added small test for this, but not sure as it uses hardcoded value
>> for maximum packet length, this value is defined in kernel header and
>> used to control deferred credit update. And as this is not available to
>> userspace, I can't control test parameters correctly (if one day this
>> define will be changed - test may become useless). 
>>
>> Head for this patchset is:
>> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=021b0c952f226236f2edf89c737efb9a28d1422d
>>
>> Link to v1:
>> https://lore.kernel.org/netdev/20231108072004.1045669-1-avkras...@salutedevices.com/
>> Link to v2:
>> https://lore.kernel.org/netdev/20231119204922.2251912-1-avkras...@salutedevices.com/
>> Link to v3:
>> https://lore.kernel.org/netdev/20231122180510.2297075-1-avkras...@salutedevices.com/
>> Link to v4:
>> https://lore.kernel.org/netdev/20231129212519.2938875-1-avkras...@salutedevices.com/
>> Link to v5:
>> https://lore.kernel.org/netdev/20231130130840.253733-1-avkras...@salutedevices.com/
>> Link to v6:
>> https://lore.kernel.org/netdev/20231205064806.2851305-1-avkras...@salutedevices.com/
>> Link to v7:
>> https://lore.kernel.org/netdev/20231206211849.2707151-1-avkras...@salutedevices.com/
>>
>> Changelog:
>> v1 -> v2:
>>  * Patchset rebased and tested on new HEAD of net-next (see hash above).
>>  * New patch is added as 0001 - it removes return from SO_RCVLOWAT set
>>callback in 'af_vsock.c' when transport callback is set - with that
>>we can set 'sk_rcvlowat' only once in 'af_vsock.c' and in future do
>>not copy-paste it to every transport. It was discussed in v1.
>>  * See per-patch changelog after ---.
>> v2 -> v3:
>>  * See changelog after --- in 0003 only (0001 and 0002 still same).
>> v3 -> v4:
>>  * Patchset rebased and tested on new HEAD of net-next (see hash above).
>>  * See per-patch changelog after ---.
>> v4 -> v5:
>>  * Change patchset tag 'RFC' -> 'net-next'.
>>  * See per-patch changelog after ---.
>> v5 -> v6:
>>  * New patch 0003 which sends credit update during reading bytes from
>>socket.
>>  * See per-patch changelog after ---.
>> v6 -> v7:
>>  * Patchset rebased and tested on new HEAD of net-next (see hash above).
>>  * See per-patch changelog after ---.
>> v7 -> v8:
>>  * See per-patch changelog after ---.
>>
>> Arseniy Krasnov (4):
>>   vsock: update SO_RCVLOWAT setting callback
>>   virtio/vsock: send credit update during setting SO_RCVLOWAT
>>   virtio/vsock: fix logic which reduces credit update messages
>>   vsock/test: two tests to check credit update logic
>>
>>  drivers/vhost/vsock.c   |   1 +
>>  include/linux/virtio_vsock.h|   1 +
>>  include/net/af_vsock.h  |   2 +-
>>  net/vmw_vsock/af_vsock.c|   9 +-
>>  net/vmw_vsock/hyperv_transport.c|   4 +-
>>

[PATCH v3] tracing: Allow for max buffer data size trace_marker writes

2023-12-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Allow a trace write to be as big as the ring buffer tracing data will
allow. Currently, it only allows writes of 1KB in size, but there's no
reason that it cannot allow what the ring buffer can hold.

Signed-off-by: Steven Rostedt (Google) 
---
Changes since v2: 
https://lore.kernel.org/linux-trace-kernel/20231212090057.41b28...@gandalf.local.home

- Test (ssize_t)cnt for less than zero (Mathieu Desnoyers)

- Make "size" size_t to not overflow when copying cnt to it (Mathieu Desnoyers)

 include/linux/ring_buffer.h |  1 +
 kernel/trace/ring_buffer.c  | 15 +++
 kernel/trace/trace.c| 31 ---
 3 files changed, 40 insertions(+), 7 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 782e14f62201..b1b03b2c0f08 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -141,6 +141,7 @@ int ring_buffer_iter_empty(struct ring_buffer_iter *iter);
 bool ring_buffer_iter_dropped(struct ring_buffer_iter *iter);
 
 unsigned long ring_buffer_size(struct trace_buffer *buffer, int cpu);
+unsigned long ring_buffer_max_event_size(struct trace_buffer *buffer);
 
 void ring_buffer_reset_cpu(struct trace_buffer *buffer, int cpu);
 void ring_buffer_reset_online_cpus(struct trace_buffer *buffer);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a3eaa052f4de..882aab2bede3 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5250,6 +5250,21 @@ unsigned long ring_buffer_size(struct trace_buffer 
*buffer, int cpu)
 }
 EXPORT_SYMBOL_GPL(ring_buffer_size);
 
+/**
+ * ring_buffer_max_event_size - return the max data size of an event
+ * @buffer: The ring buffer.
+ *
+ * Returns the maximum size an event can be.
+ */
+unsigned long ring_buffer_max_event_size(struct trace_buffer *buffer)
+{
+   /* If abs timestamp is requested, events have a timestamp too */
+   if (ring_buffer_time_stamp_abs(buffer))
+   return BUF_MAX_DATA_SIZE - RB_LEN_TIME_EXTEND;
+   return BUF_MAX_DATA_SIZE;
+}
+EXPORT_SYMBOL_GPL(ring_buffer_max_event_size);
+
 static void rb_clear_buffer_page(struct buffer_page *page)
 {
local_set(>write, 0);
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ef86379555e4..9fec58a2d4be 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7272,8 +7272,9 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
enum event_trigger_type tt = ETT_NONE;
struct trace_buffer *buffer;
struct print_entry *entry;
+   int meta_size;
ssize_t written;
-   int size;
+   size_t size;
int len;
 
 /* Used in tracing_mark_raw_write() as well */
@@ -7286,12 +7287,12 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
if (!(tr->trace_flags & TRACE_ITER_MARKERS))
return -EINVAL;
 
-   if (cnt > TRACE_BUF_SIZE)
-   cnt = TRACE_BUF_SIZE;
-
-   BUILD_BUG_ON(TRACE_BUF_SIZE >= PAGE_SIZE);
+   if ((ssize_t)cnt < 0)
+   return -EINVAL;
 
-   size = sizeof(*entry) + cnt + 2; /* add '\0' and possible '\n' */
+   meta_size = sizeof(*entry) + 2;  /* add '\0' and possible '\n' */
+ again:
+   size = cnt + meta_size;
 
/* If less than "", then make sure we can still add that */
if (cnt < FAULTED_SIZE)
@@ -7300,9 +7301,25 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
buffer = tr->array_buffer.buffer;
event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
tracing_gen_ctx());
-   if (unlikely(!event))
+   if (unlikely(!event)) {
+   /*
+* If the size was greated than what was allowed, then
+* make it smaller and try again.
+*/
+   if (size > ring_buffer_max_event_size(buffer)) {
+   /* cnt < FAULTED size should never be bigger than max */
+   if (WARN_ON_ONCE(cnt < FAULTED_SIZE))
+   return -EBADF;
+   cnt = ring_buffer_max_event_size(buffer) - meta_size;
+   /* The above should only happen once */
+   if (WARN_ON_ONCE(cnt + meta_size == size))
+   return -EBADF;
+   goto again;
+   }
+
/* Ring buffer disabled, return as if not open for write */
return -EBADF;
+   }
 
entry = ring_buffer_event_data(event);
entry->ip = _THIS_IP_;
-- 
2.42.0

Re: [PATCH net-next v8 3/4] virtio/vsock: fix logic which reduces credit update messages

2023-12-12 Thread Arseniy Krasnov




On 12.12.2023 18:54, Michael S. Tsirkin wrote:
> On Tue, Dec 12, 2023 at 12:16:57AM +0300, Arseniy Krasnov wrote:
>> Add one more condition for sending credit update during dequeue from
>> stream socket: when number of bytes in the rx queue is smaller than
>> SO_RCVLOWAT value of the socket. This is actual for non-default value
>> of SO_RCVLOWAT (e.g. not 1) - idea is to "kick" peer to continue data
>> transmission, because we need at least SO_RCVLOWAT bytes in our rx
>> queue to wake up user for reading data (in corner case it is also
>> possible to stuck both tx and rx sides, this is why 'Fixes' is used).
> 
> I don't get what does "to stuck both tx and rx sides" mean.

I meant situation when tx waits for the free space, while rx doesn't send
credit update, just waiting for more data. Sorry for my English :)

> Besides being agrammatical, is there a way to do this without
> playing with SO_RCVLOWAT?

No, this may happen only with non-default SO_RCVLOWAT values (e.g. != 1)

Thanks, Arseniy 

> 
>>
>> Fixes: b89d882dc9fc ("vsock/virtio: reduce credit update messages")
>> Signed-off-by: Arseniy Krasnov 
>> ---
>>  Changelog:
>>  v6 -> v7:
>>   * Handle wrap of 'fwd_cnt'.
>>   * Do to send credit update when 'fwd_cnt' == 'last_fwd_cnt'.
>>  v7 -> v8:
>>   * Remove unneeded/wrong handling of wrap for 'fwd_cnt'.
>>
>>  net/vmw_vsock/virtio_transport_common.c | 13 ++---
>>  1 file changed, 10 insertions(+), 3 deletions(-)
>>
>> diff --git a/net/vmw_vsock/virtio_transport_common.c 
>> b/net/vmw_vsock/virtio_transport_common.c
>> index e137d740804e..8572f94bba88 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -558,6 +558,8 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
>> *vsk,
>>  struct virtio_vsock_sock *vvs = vsk->trans;
>>  size_t bytes, total = 0;
>>  struct sk_buff *skb;
>> +u32 fwd_cnt_delta;
>> +bool low_rx_bytes;
>>  int err = -EFAULT;
>>  u32 free_space;
>>  
>> @@ -601,7 +603,10 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
>> *vsk,
>>  }
>>  }
>>  
>> -free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
>> +fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt;
>> +free_space = vvs->buf_alloc - fwd_cnt_delta;
>> +low_rx_bytes = (vvs->rx_bytes <
>> +sock_rcvlowat(sk_vsock(vsk), 0, INT_MAX));
>>  
>>  spin_unlock_bh(>rx_lock);
>>  
>> @@ -611,9 +616,11 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
>> *vsk,
>>   * too high causes extra messages. Too low causes transmitter
>>   * stalls. As stalls are in theory more expensive than extra
>>   * messages, we set the limit to a high value. TODO: experiment
>> - * with different values.
>> + * with different values. Also send credit update message when
>> + * number of bytes in rx queue is not enough to wake up reader.
>>   */
>> -if (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
>> +if (fwd_cnt_delta &&
>> +(free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE || low_rx_bytes))
>>  virtio_transport_send_credit_update(vsk);
>>  
>>  return total;
>> -- 
>> 2.25.1
>

Re: [PATCH net-next v8 0/4] send credit update during setting SO_RCVLOWAT

2023-12-12 Thread Michael S. Tsirkin

On Tue, Dec 12, 2023 at 12:16:54AM +0300, Arseniy Krasnov wrote:
> Hello,
> 
>DESCRIPTION
> 
> This patchset fixes old problem with hungup of both rx/tx sides and adds
> test for it. This happens due to non-default SO_RCVLOWAT value and
> deferred credit update in virtio/vsock. Link to previous old patchset:
> https://lore.kernel.org/netdev/39b2e9fd-601b-189d-39a9-914e55745...@sberdevices.ru/


Patchset:

Acked-by: Michael S. Tsirkin 


But I worry whether we actually need 3/8 in net not in net-next.

Thanks!

> Here is what happens step by step:
> 
>   TEST
> 
> INITIAL CONDITIONS
> 
> 1) Vsock buffer size is 128KB.
> 2) Maximum packet size is also 64KB as defined in header (yes it is
>hardcoded, just to remind about that value).
> 3) SO_RCVLOWAT is default, e.g. 1 byte.
> 
> 
>  STEPS
> 
> SENDER  RECEIVER
> 1) sends 128KB + 1 byte in a
>single buffer. 128KB will
>be sent, but for 1 byte
>sender will wait for free
>space at peer. Sender goes
>to sleep.
> 
> 
> 2) reads 64KB, credit update not sent
> 3) sets SO_RCVLOWAT to 64KB + 1
> 4) poll() -> wait forever, there is
>only 64KB available to read.
> 
> So in step 4) receiver also goes to sleep, waiting for enough data or
> connection shutdown message from the sender. Idea to fix it is that rx
> kicks tx side to continue transmission (and may be close connection)
> when rx changes number of bytes to be woken up (e.g. SO_RCVLOWAT) and
> this value is bigger than number of available bytes to read.
> 
> I've added small test for this, but not sure as it uses hardcoded value
> for maximum packet length, this value is defined in kernel header and
> used to control deferred credit update. And as this is not available to
> userspace, I can't control test parameters correctly (if one day this
> define will be changed - test may become useless). 
> 
> Head for this patchset is:
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=021b0c952f226236f2edf89c737efb9a28d1422d
> 
> Link to v1:
> https://lore.kernel.org/netdev/20231108072004.1045669-1-avkras...@salutedevices.com/
> Link to v2:
> https://lore.kernel.org/netdev/20231119204922.2251912-1-avkras...@salutedevices.com/
> Link to v3:
> https://lore.kernel.org/netdev/20231122180510.2297075-1-avkras...@salutedevices.com/
> Link to v4:
> https://lore.kernel.org/netdev/20231129212519.2938875-1-avkras...@salutedevices.com/
> Link to v5:
> https://lore.kernel.org/netdev/20231130130840.253733-1-avkras...@salutedevices.com/
> Link to v6:
> https://lore.kernel.org/netdev/20231205064806.2851305-1-avkras...@salutedevices.com/
> Link to v7:
> https://lore.kernel.org/netdev/20231206211849.2707151-1-avkras...@salutedevices.com/
> 
> Changelog:
> v1 -> v2:
>  * Patchset rebased and tested on new HEAD of net-next (see hash above).
>  * New patch is added as 0001 - it removes return from SO_RCVLOWAT set
>callback in 'af_vsock.c' when transport callback is set - with that
>we can set 'sk_rcvlowat' only once in 'af_vsock.c' and in future do
>not copy-paste it to every transport. It was discussed in v1.
>  * See per-patch changelog after ---.
> v2 -> v3:
>  * See changelog after --- in 0003 only (0001 and 0002 still same).
> v3 -> v4:
>  * Patchset rebased and tested on new HEAD of net-next (see hash above).
>  * See per-patch changelog after ---.
> v4 -> v5:
>  * Change patchset tag 'RFC' -> 'net-next'.
>  * See per-patch changelog after ---.
> v5 -> v6:
>  * New patch 0003 which sends credit update during reading bytes from
>socket.
>  * See per-patch changelog after ---.
> v6 -> v7:
>  * Patchset rebased and tested on new HEAD of net-next (see hash above).
>  * See per-patch changelog after ---.
> v7 -> v8:
>  * See per-patch changelog after ---.
> 
> Arseniy Krasnov (4):
>   vsock: update SO_RCVLOWAT setting callback
>   virtio/vsock: send credit update during setting SO_RCVLOWAT
>   virtio/vsock: fix logic which reduces credit update messages
>   vsock/test: two tests to check credit update logic
> 
>  drivers/vhost/vsock.c   |   1 +
>  include/linux/virtio_vsock.h|   1 +
>  include/net/af_vsock.h  |   2 +-
>  net/vmw_vsock/af_vsock.c|   9 +-
>  net/vmw_vsock/hyperv_transport.c|   4 +-
>  net/vmw_vsock/virtio_transport.c|   1 +
>  net/vmw_vsock/virtio_transport_common.c |  43 +-
>  net/vmw_vsock/vsock_loopback.c  |   1 +
>  tools/testing/vsock/vsock_test.c| 175 
>  9 files changed, 229 insertions(+), 8 deletions(-)
> 
> -- 
> 2.25.1

Re: [PATCH net-next v8 3/4] virtio/vsock: fix logic which reduces credit update messages

2023-12-12 Thread Michael S. Tsirkin

On Tue, Dec 12, 2023 at 12:16:57AM +0300, Arseniy Krasnov wrote:
> Add one more condition for sending credit update during dequeue from
> stream socket: when number of bytes in the rx queue is smaller than
> SO_RCVLOWAT value of the socket. This is actual for non-default value
> of SO_RCVLOWAT (e.g. not 1) - idea is to "kick" peer to continue data
> transmission, because we need at least SO_RCVLOWAT bytes in our rx
> queue to wake up user for reading data (in corner case it is also
> possible to stuck both tx and rx sides, this is why 'Fixes' is used).

I don't get what does "to stuck both tx and rx sides" mean.
Besides being agrammatical, is there a way to do this without
playing with SO_RCVLOWAT?

> 
> Fixes: b89d882dc9fc ("vsock/virtio: reduce credit update messages")
> Signed-off-by: Arseniy Krasnov 
> ---
>  Changelog:
>  v6 -> v7:
>   * Handle wrap of 'fwd_cnt'.
>   * Do to send credit update when 'fwd_cnt' == 'last_fwd_cnt'.
>  v7 -> v8:
>   * Remove unneeded/wrong handling of wrap for 'fwd_cnt'.
> 
>  net/vmw_vsock/virtio_transport_common.c | 13 ++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/net/vmw_vsock/virtio_transport_common.c 
> b/net/vmw_vsock/virtio_transport_common.c
> index e137d740804e..8572f94bba88 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -558,6 +558,8 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
>   struct virtio_vsock_sock *vvs = vsk->trans;
>   size_t bytes, total = 0;
>   struct sk_buff *skb;
> + u32 fwd_cnt_delta;
> + bool low_rx_bytes;
>   int err = -EFAULT;
>   u32 free_space;
>  
> @@ -601,7 +603,10 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
> *vsk,
>   }
>   }
>  
> - free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
> + fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt;
> + free_space = vvs->buf_alloc - fwd_cnt_delta;
> + low_rx_bytes = (vvs->rx_bytes <
> + sock_rcvlowat(sk_vsock(vsk), 0, INT_MAX));
>  
>   spin_unlock_bh(>rx_lock);
>  
> @@ -611,9 +616,11 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
> *vsk,
>* too high causes extra messages. Too low causes transmitter
>* stalls. As stalls are in theory more expensive than extra
>* messages, we set the limit to a high value. TODO: experiment
> -  * with different values.
> +  * with different values. Also send credit update message when
> +  * number of bytes in rx queue is not enough to wake up reader.
>*/
> - if (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> + if (fwd_cnt_delta &&
> + (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE || low_rx_bytes))
>   virtio_transport_send_credit_update(vsk);
>  
>   return total;
> -- 
> 2.25.1

Re: [PATCH v2] tracing: Allow for max buffer data size trace_marker writes

2023-12-12 Thread Steven Rostedt

On Tue, 12 Dec 2023 09:33:11 -0500
Mathieu Desnoyers  wrote:

> On 2023-12-12 09:00, Steven Rostedt wrote:
> [...]
> > --- a/kernel/trace/trace.c
> > +++ b/kernel/trace/trace.c
> > @@ -7272,6 +7272,7 @@ tracing_mark_write(struct file *filp, const char 
> > __user *ubuf,
> > enum event_trigger_type tt = ETT_NONE;
> > struct trace_buffer *buffer;
> > struct print_entry *entry;
> > +   int meta_size;
> > ssize_t written;
> > int size;
> > int len;
> > @@ -7286,12 +7287,9 @@ tracing_mark_write(struct file *filp, const char 
> > __user *ubuf,
> > if (!(tr->trace_flags & TRACE_ITER_MARKERS))
> > return -EINVAL;
> >   
> > -   if (cnt > TRACE_BUF_SIZE)
> > -   cnt = TRACE_BUF_SIZE;  
> 
> You're removing an early bound check for a size_t userspace input...
> 
> > -
> > -   BUILD_BUG_ON(TRACE_BUF_SIZE >= PAGE_SIZE);
> > -
> > -   size = sizeof(*entry) + cnt + 2; /* add '\0' and possible '\n' */
> > +   meta_size = sizeof(*entry) + 2;  /* add '\0' and possible '\n' */
> > + again:
> > +   size = cnt + meta_size;  
> 
> ... and then implicitly casting it into a "int" size variable, which
> can therefore become a negative value.
> 
> Just for the sake of not having to rely on ring_buffer_lock_reserve
> catching (length > BUF_MAX_DATA_SIZE), I would recommend to add an
> early check for negative here.

size_t is not signed, so nothing should be negative. But you are right, I
need to have "size" be of size_t type too to prevent the overflow.

And I could make cnt of ssize_t type and check for negative and fail early
in such a case.

Thanks!

> 
> >   
> > /* If less than "", then make sure we can still add
> > that */ if (cnt < FAULTED_SIZE)
> > @@ -7300,9 +7298,25 @@ tracing_mark_write(struct file *filp, const char
> > __user *ubuf, buffer = tr->array_buffer.buffer;
> > event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
> > tracing_gen_ctx());
> > -   if (unlikely(!event))
> > +   if (unlikely(!event)) {
> > +   /*
> > +* If the size was greated than what was allowed, then
> >  
> 
> greater ?

Nah, the size is "greated" like "greated cheese" ;-)

Thanks for the review, I'll send out a v3.

-- Steve

Re: [PATCH v2] ring-buffer: Never use absolute timestamp for first event

2023-12-12 Thread Steven Rostedt

On Tue, 12 Dec 2023 23:20:08 +0900
Masami Hiramatsu (Google)  wrote:

> On Tue, 12 Dec 2023 07:18:37 -0500
> Steven Rostedt  wrote:
> 
> > From: "Steven Rostedt (Google)" 
> > 
> > On 32bit machines, the 64 bit timestamps are broken up into 32 bit words
> > to keep from using local64_cmpxchg(), as that is very expensive on 32 bit
> > architectures.
> > 
> > On 32 bit architectures, reading these timestamps can happen in a middle
> > of an update. In this case, the read returns "false", telling the caller
> > that the timestamp is in the middle of an update, and it needs to assume
> > it is corrupted. The code then accommodates this.
> > 
> > When first reserving space on the ring buffer, a "before_stamp" and
> > "write_stamp" are read. If they do not match, or if either is in the
> > process of being updated (false was returned from the read), an absolute
> > timestamp is added and the delta is not used, as that requires reading
> > theses timestamps without being corrupted.
> > 
> > The one case that this does not matter is if the event is the first event
> > on the sub-buffer, in which case, the event uses the sub-buffer's
> > timestamp and doesn't need the other stamps for calculating them.
> > 
> > After some work to consolidate the code, if the before or write stamps are
> > in the process of updating, an absolute timestamp will be added regardless
> > if the event is the first event on the sub-buffer. This is wrong as it
> > should not care about the success of these reads if it is the first event
> > on the sub-buffer.
> > 
> > Fix up the parenthesis so that even if the timestamps are corrupted, if
> > the event is the first event on the sub-buffer (w == 0) it still does not
> > force an absolute timestamp.
> > 
> > It's actually likely that w is not zero, but move it out of the unlikeyl()
> > and test it first. It should be in hot cache anyway, and there's no reason
> > to do the rest of the test for the first event on the sub-buffer. And this
> > prevents having to test all the 'or' statements in that case.
> > 
> > Cc: sta...@vger.kernel.org
> > Fixes: 58fbc3c63275c ("ring-buffer: Consolidate add_timestamp to remove 
> > some branches")
> > Signed-off-by: Steven Rostedt (Google) 
> > ---
> > Changes since v2: 
> > https://lore.kernel.org/linux-trace-kernel/2023125949.4692e...@gandalf.local.home
> > 
> > - Move the test to 'w' out of the unlikely and do it first.
> >   It's already in hot cache, and the rest of test shouldn't be done
> >   if 'w' is zero.
> > 
> >  kernel/trace/ring_buffer.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > index b416bdf6c44a..095b86081ea8 100644
> > --- a/kernel/trace/ring_buffer.c
> > +++ b/kernel/trace/ring_buffer.c
> > @@ -3581,7 +3581,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu 
> > *cpu_buffer,
> >  * absolute timestamp.
> >  * Don't bother if this is the start of a new page (w == 0).
> >  */
> > -   if (unlikely(!a_ok || !b_ok || (info->before != info->after && 
> > w))) {
> > +   if (w && unlikely(!a_ok || !b_ok || info->before != 
> > info->after)) {
> > info->add_timestamp |= RB_ADD_STAMP_FORCE | 
> > RB_ADD_STAMP_EXTEND;
> > info->length += RB_LEN_TIME_EXTEND;
> > } else {  
> 
> After this else,
> 
> } else {
> info->delta = info->ts - info->after;
> 
> The code is using info_after, but it is not sure 'a_ok'. Does this mean if
> 'w == 0 && !a_ok' this doesn't work correctly?
> What will be the expected behavior when w == 0 here?
> 

Hmm, looking at this and

  
https://lore.kernel.org/linux-trace-kernel/20231212065922.05f28...@gandalf.local.home/

I think the proper solution is simply:

if (!w) {
/* Use the sub-buffer timestamp */
info->delta = 0;
} else if (unlikely(!a_ok || !b_ok || info->before != 
info->after)) {
info->add_timestamp |= RB_ADD_STAMP_FORCE | 
RB_ADD_STAMP_EXTEND;
info->length += RB_LEN_TIME_EXTEND;
} else {
info->delta = info->ts - info->after;
if (unlikely(test_time_stamp(info->delta))) {
info->add_timestamp |= RB_ADD_STAMP_EXTEND;
info->length += RB_LEN_TIME_EXTEND;
}
}

Thanks,

-- Steve

Re: [PATCH] tracing: Add size check when printing trace_marker output

2023-12-12 Thread Steven Rostedt

On Tue, 12 Dec 2023 09:23:54 -0500
Mathieu Desnoyers  wrote:

> On 2023-12-12 08:44, Steven Rostedt wrote:
> > From: "Steven Rostedt (Google)" 
> > 
> > If for some reason the trace_marker write does not have a nul byte for the
> > string, it will overflow the print:  
> 
> Does this result in leaking kernel memory to userspace ? If so, it
> should state "Fixes..." and CC stable.

No, it was triggered because of a bug elsewhere ;-)

https://lore.kernel.org/linux-trace-kernel/20231212072558.61f76...@gandalf.local.home/

Which does have a Cc stable and Fixes tag.

The event truncated the trace_marker output and caused the buffer overflow
here. The trace_marker always adds a '\0', but that got dropped due to the
other bug. This is just hardening the kernel.

Note, this can only happen with the new code that allows trace_marker to
use the max size of the buffer, which is for the next kernel release.

-- Steve

Re: [PATCH v2] tracing: Allow for max buffer data size trace_marker writes

2023-12-12 Thread Mathieu Desnoyers


On 2023-12-12 09:00, Steven Rostedt wrote:
[...]

--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7272,6 +7272,7 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
enum event_trigger_type tt = ETT_NONE;
struct trace_buffer *buffer;
struct print_entry *entry;
+   int meta_size;
ssize_t written;
int size;
int len;
@@ -7286,12 +7287,9 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
if (!(tr->trace_flags & TRACE_ITER_MARKERS))
return -EINVAL;
  
-	if (cnt > TRACE_BUF_SIZE)

-   cnt = TRACE_BUF_SIZE;


You're removing an early bound check for a size_t userspace input...


-
-   BUILD_BUG_ON(TRACE_BUF_SIZE >= PAGE_SIZE);
-
-   size = sizeof(*entry) + cnt + 2; /* add '\0' and possible '\n' */
+   meta_size = sizeof(*entry) + 2;  /* add '\0' and possible '\n' */
+ again:
+   size = cnt + meta_size;


... and then implicitly casting it into a "int" size variable, which
can therefore become a negative value.

Just for the sake of not having to rely on ring_buffer_lock_reserve
catching (length > BUF_MAX_DATA_SIZE), I would recommend to add an
early check for negative here.

  
  	/* If less than "", then make sure we can still add that */

if (cnt < FAULTED_SIZE)
@@ -7300,9 +7298,25 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
buffer = tr->array_buffer.buffer;
event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
tracing_gen_ctx());
-   if (unlikely(!event))
+   if (unlikely(!event)) {
+   /*
+* If the size was greated than what was allowed, then


greater ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

[PATCH 1/1] hwspinlock: qcom: fix tcsr data for ipq6018

2023-12-12 Thread Chukun Pan

The tcsr_mutex hwlock register of the ipq6018 SoC is 0x2[1], which
should not use the max_register configuration of older SoCs. This will
cause smem to be unable to probe, further causing devices that use
smem-part to parse partitions to fail to boot.

[2.118227] qcom-smem: probe of 4aa0.memory failed with error -110
[   22.273120] platform 79b.nand-controller: deferred probe pending

Remove 'qcom,ipq6018-tcsr-mutex' setting from of_match to fix this.

[1] commit 72fc3d58b87b ("arm64: dts: qcom: ipq6018: Fix tcsr_mutex register 
size")
Fixes: 5d4753f741d8 ("hwspinlock: qcom: add support for MMIO on older SoCs")
Signed-off-by: Chukun Pan 
---
 drivers/hwspinlock/qcom_hwspinlock.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/hwspinlock/qcom_hwspinlock.c 
b/drivers/hwspinlock/qcom_hwspinlock.c
index a0fd67fd2934..814dfe8697bf 100644
--- a/drivers/hwspinlock/qcom_hwspinlock.c
+++ b/drivers/hwspinlock/qcom_hwspinlock.c
@@ -115,7 +115,6 @@ static const struct of_device_id qcom_hwspinlock_of_match[] 
= {
{ .compatible = "qcom,sfpb-mutex", .data = _sfpb_mutex },
{ .compatible = "qcom,tcsr-mutex", .data = _tcsr_mutex },
{ .compatible = "qcom,apq8084-tcsr-mutex", .data = 
_msm8226_tcsr_mutex },
-   { .compatible = "qcom,ipq6018-tcsr-mutex", .data = 
_msm8226_tcsr_mutex },
{ .compatible = "qcom,msm8226-tcsr-mutex", .data = 
_msm8226_tcsr_mutex },
{ .compatible = "qcom,msm8974-tcsr-mutex", .data = 
_msm8226_tcsr_mutex },
{ .compatible = "qcom,msm8994-tcsr-mutex", .data = 
_msm8226_tcsr_mutex },
-- 
2.25.1

[PATCH 4/4] sched: Rename arch_update_thermal_pressure into arch_update_hw_pressure

2023-12-12 Thread Vincent Guittot

Now that cpufreq provides a pressure value to the scheduler, rename
arch_update_thermal_pressure into hw pressure to reflect that it returns
a pressure applied by HW with a high frequency and which needs filtering.
This pressure is not always related to thermal mitigation but can also be
generated by max current limitation as an example.

Signed-off-by: Vincent Guittot 
---
 arch/arm/include/asm/topology.h   |  6 ++---
 arch/arm64/include/asm/topology.h |  6 ++---
 drivers/base/arch_topology.c  | 26 +--
 drivers/cpufreq/qcom-cpufreq-hw.c |  4 +--
 include/linux/arch_topology.h |  8 +++---
 include/linux/sched/topology.h|  8 +++---
 .../{thermal_pressure.h => hw_pressure.h} | 14 +-
 include/trace/events/sched.h  |  2 +-
 init/Kconfig  | 12 -
 kernel/sched/core.c   |  8 +++---
 kernel/sched/fair.c   | 12 -
 kernel/sched/pelt.c   | 18 ++---
 kernel/sched/pelt.h   | 16 ++--
 kernel/sched/sched.h  |  4 +--
 14 files changed, 72 insertions(+), 72 deletions(-)
 rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 853c4f81ba4a..e175e8596b5d 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -22,9 +22,9 @@
 /* Enable topology flag updates */
 #define arch_update_cpu_topology topology_update_cpu_topology
 
-/* Replace task scheduler's default thermal pressure API */
-#define arch_scale_thermal_pressure topology_get_thermal_pressure
-#define arch_update_thermal_pressure   topology_update_thermal_pressure
+/* Replace task scheduler's default hw pressure API */
+#define arch_scale_hw_pressure topology_get_hw_pressure
+#define arch_update_hw_pressuretopology_update_hw_pressure
 
 #else
 
diff --git a/arch/arm64/include/asm/topology.h 
b/arch/arm64/include/asm/topology.h
index a323b109b9c4..a427650bdfba 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -35,9 +35,9 @@ void update_freq_counters_refs(void);
 /* Enable topology flag updates */
 #define arch_update_cpu_topology topology_update_cpu_topology
 
-/* Replace task scheduler's default thermal pressure API */
-#define arch_scale_thermal_pressure topology_get_thermal_pressure
-#define arch_update_thermal_pressure   topology_update_thermal_pressure
+/* Replace task scheduler's default hw pressure API */
+#define arch_scale_hw_pressure topology_get_hw_pressure
+#define arch_update_hw_pressuretopology_update_hw_pressure
 
 #include 
 
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 0906114963ff..3d8dc9d5c3ad 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -22,7 +22,7 @@
 #include 
 
 #define CREATE_TRACE_POINTS
-#include 
+#include 
 
 static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
 static struct cpumask scale_freq_counters_mask;
@@ -160,26 +160,26 @@ void topology_set_cpu_scale(unsigned int cpu, unsigned 
long capacity)
per_cpu(cpu_scale, cpu) = capacity;
 }
 
-DEFINE_PER_CPU(unsigned long, thermal_pressure);
+DEFINE_PER_CPU(unsigned long, hw_pressure);
 
 /**
- * topology_update_thermal_pressure() - Update thermal pressure for CPUs
+ * topology_update_hw_pressure() - Update hw pressure for CPUs
  * @cpus: The related CPUs for which capacity has been reduced
  * @capped_freq : The maximum allowed frequency that CPUs can run at
  *
- * Update the value of thermal pressure for all @cpus in the mask. The
+ * Update the value of hw pressure for all @cpus in the mask. The
  * cpumask should include all (online+offline) affected CPUs, to avoid
  * operating on stale data when hot-plug is used for some CPUs. The
  * @capped_freq reflects the currently allowed max CPUs frequency due to
- * thermal capping. It might be also a boost frequency value, which is bigger
+ * hw capping. It might be also a boost frequency value, which is bigger
  * than the internal 'capacity_freq_ref' max frequency. In such case the
  * pressure value should simply be removed, since this is an indication that
- * there is no thermal throttling. The @capped_freq must be provided in kHz.
+ * there is no hw throttling. The @capped_freq must be provided in kHz.
  */
-void topology_update_thermal_pressure(const struct cpumask *cpus,
+void topology_update_hw_pressure(const struct cpumask *cpus,
  unsigned long capped_freq)
 {
-   unsigned long max_capacity, capacity, th_pressure;
+   unsigned long max_capacity, capacity, hw_pressure;
u32 max_freq;
int cpu;
 
@@ -189,21 +189,21 @@ void topology_update_thermal_pressure(const struct 
cpumask *cpus,

[PATCH 1/4] cpufreq: Add a cpufreq pressure feedback for the scheduler

2023-12-12 Thread Vincent Guittot

Provide to the scheduler a feedback about the temporary max available
capacity. Unlike arch_update_thermal_pressure, this doesn't need to be
filtered as the pressure will happen for dozens ms or more.

Signed-off-by: Vincent Guittot 
---
 drivers/cpufreq/cpufreq.c | 48 +++
 include/linux/cpufreq.h   | 10 
 2 files changed, 58 insertions(+)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 44db4f59c4cc..7d5f71be8d29 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -2563,6 +2563,50 @@ int cpufreq_get_policy(struct cpufreq_policy *policy, 
unsigned int cpu)
 }
 EXPORT_SYMBOL(cpufreq_get_policy);
 
+DEFINE_PER_CPU(unsigned long, cpufreq_pressure);
+EXPORT_PER_CPU_SYMBOL_GPL(cpufreq_pressure);
+
+/**
+ * cpufreq_update_pressure() - Update cpufreq pressure for CPUs
+ * @cpus: The related CPUs for which max capacity has been reduced
+ * @capped_freq : The maximum allowed frequency that CPUs can run at
+ *
+ * Update the value of cpufreq pressure for all @cpus in the mask. The
+ * cpumask should include all (online+offline) affected CPUs, to avoid
+ * operating on stale data when hot-plug is used for some CPUs. The
+ * @capped_freq reflects the currently allowed max CPUs frequency due to
+ * freq_qos capping. It might be also a boost frequency value, which is bigger
+ * than the internal 'capacity_freq_ref' max frequency. In such case the
+ * pressure value should simply be removed, since this is an indication that
+ * there is no capping. The @capped_freq must be provided in kHz.
+ */
+static void cpufreq_update_pressure(const struct cpumask *cpus,
+ unsigned long capped_freq)
+{
+   unsigned long max_capacity, capacity, pressure;
+   u32 max_freq;
+   int cpu;
+
+   cpu = cpumask_first(cpus);
+   max_capacity = arch_scale_cpu_capacity(cpu);
+   max_freq = arch_scale_freq_ref(cpu);
+
+   /*
+* Handle properly the boost frequencies, which should simply clean
+* the thermal pressure value.
+*/
+   if (max_freq <= capped_freq)
+   capacity = max_capacity;
+   else
+   capacity = mult_frac(max_capacity, capped_freq, max_freq);
+
+   pressure = max_capacity - capacity;
+
+
+   for_each_cpu(cpu, cpus)
+   WRITE_ONCE(per_cpu(cpufreq_pressure, cpu), pressure);
+}
+
 /**
  * cpufreq_set_policy - Modify cpufreq policy parameters.
  * @policy: Policy object to modify.
@@ -2584,6 +2628,7 @@ static int cpufreq_set_policy(struct cpufreq_policy 
*policy,
 {
struct cpufreq_policy_data new_data;
struct cpufreq_governor *old_gov;
+   struct cpumask *cpus;
int ret;
 
memcpy(_data.cpuinfo, >cpuinfo, sizeof(policy->cpuinfo));
@@ -2618,6 +2663,9 @@ static int cpufreq_set_policy(struct cpufreq_policy 
*policy,
policy->max = __resolve_freq(policy, policy->max, CPUFREQ_RELATION_H);
trace_cpu_frequency_limits(policy);
 
+   cpus = policy->related_cpus;
+   cpufreq_update_pressure(cpus, policy->max);
+
policy->cached_target_freq = UINT_MAX;
 
pr_debug("new min and max freqs are %u - %u kHz\n",
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index afda5f24d3dd..b1d97edd3253 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -241,6 +241,12 @@ struct kobject *get_governor_parent_kobj(struct 
cpufreq_policy *policy);
 void cpufreq_enable_fast_switch(struct cpufreq_policy *policy);
 void cpufreq_disable_fast_switch(struct cpufreq_policy *policy);
 bool has_target_index(void);
+
+DECLARE_PER_CPU(unsigned long, cpufreq_pressure);
+static inline unsigned long cpufreq_get_pressure(int cpu)
+{
+   return per_cpu(cpufreq_pressure, cpu);
+}
 #else
 static inline unsigned int cpufreq_get(unsigned int cpu)
 {
@@ -263,6 +269,10 @@ static inline bool cpufreq_supports_freq_invariance(void)
return false;
 }
 static inline void disable_cpufreq(void) { }
+static inline unsigned long cpufreq_get_pressure(int cpu)
+{
+   return 0;
+}
 #endif
 
 #ifdef CONFIG_CPU_FREQ_STAT
-- 
2.34.1

[PATCH 3/4] thermal/cpufreq: Remove arch_update_thermal_pressure()

2023-12-12 Thread Vincent Guittot

arch_update_thermal_pressure() aims to update fast changing signal which
should be averaged using PELT filtering before being provided to the
scheduler which can't make smart use of fast changing signal.
cpufreq now provides the maximum freq_qos pressure on the capacity to the
scheduler, which includes cpufreq cooling device. Remove the call to
arch_update_thermal_pressure() in cpufreq cooling device as this is
handled by cpufreq_get_pressure().

Signed-off-by: Vincent Guittot 
---
 drivers/thermal/cpufreq_cooling.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/thermal/cpufreq_cooling.c 
b/drivers/thermal/cpufreq_cooling.c
index e2cc7bd30862..e77d3b44903e 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -448,7 +448,6 @@ static int cpufreq_set_cur_state(struct 
thermal_cooling_device *cdev,
 unsigned long state)
 {
struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
-   struct cpumask *cpus;
unsigned int frequency;
int ret;
 
@@ -465,8 +464,6 @@ static int cpufreq_set_cur_state(struct 
thermal_cooling_device *cdev,
ret = freq_qos_update_request(_cdev->qos_req, frequency);
if (ret >= 0) {
cpufreq_cdev->cpufreq_state = state;
-   cpus = cpufreq_cdev->policy->related_cpus;
-   arch_update_thermal_pressure(cpus, frequency);
ret = 0;
}
 
-- 
2.34.1

[PATCH 2/4] sched: Take cpufreq feedback into account

2023-12-12 Thread Vincent Guittot

Aggregate the different pressures applied on the capacity of CPUs and
create a new function that returns the actual capacity of the CPU:
  get_actual_cpu_capacity()

Signed-off-by: Vincent Guittot 
---
 kernel/sched/fair.c | 43 +++
 1 file changed, 23 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bcea3d55d95d..11d3be829302 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4932,12 +4932,20 @@ static inline void util_est_update(struct cfs_rq 
*cfs_rq,
trace_sched_util_est_se_tp(>se);
 }
 
+static inline unsigned long get_actual_cpu_capacity(int cpu)
+{
+   unsigned long capacity = arch_scale_cpu_capacity(cpu);
+
+   capacity -= max(thermal_load_avg(cpu_rq(cpu)), 
cpufreq_get_pressure(cpu));
+
+   return capacity;
+}
 static inline int util_fits_cpu(unsigned long util,
unsigned long uclamp_min,
unsigned long uclamp_max,
int cpu)
 {
-   unsigned long capacity_orig, capacity_orig_thermal;
+   unsigned long capacity_orig;
unsigned long capacity = capacity_of(cpu);
bool fits, uclamp_max_fits;
 
@@ -4970,7 +4978,6 @@ static inline int util_fits_cpu(unsigned long util,
 * goal is to cap the task. So it's okay if it's getting less.
 */
capacity_orig = arch_scale_cpu_capacity(cpu);
-   capacity_orig_thermal = capacity_orig - 
arch_scale_thermal_pressure(cpu);
 
/*
 * We want to force a task to fit a cpu as implied by uclamp_max.
@@ -5045,7 +5052,7 @@ static inline int util_fits_cpu(unsigned long util,
 * handle the case uclamp_min > uclamp_max.
 */
uclamp_min = min(uclamp_min, uclamp_max);
-   if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal))
+   if (fits && (util < uclamp_min) && (uclamp_min > 
get_actual_cpu_capacity(cpu)))
return -1;
 
return fits;
@@ -7426,7 +7433,7 @@ select_idle_capacity(struct task_struct *p, struct 
sched_domain *sd, int target)
 * Look for the CPU with best capacity.
 */
else if (fits < 0)
-   cpu_cap = arch_scale_cpu_capacity(cpu) - 
thermal_load_avg(cpu_rq(cpu));
+   cpu_cap = get_actual_cpu_capacity(cpu);
 
/*
 * First, select CPU which fits better (-1 being better than 0).
@@ -7919,8 +7926,8 @@ static int find_energy_efficient_cpu(struct task_struct 
*p, int prev_cpu)
struct root_domain *rd = this_rq()->rd;
int cpu, best_energy_cpu, target = -1;
int prev_fits = -1, best_fits = -1;
-   unsigned long best_thermal_cap = 0;
-   unsigned long prev_thermal_cap = 0;
+   unsigned long best_actual_cap = 0;
+   unsigned long prev_actual_cap = 0;
struct sched_domain *sd;
struct perf_domain *pd;
struct energy_env eenv;
@@ -7950,7 +7957,7 @@ static int find_energy_efficient_cpu(struct task_struct 
*p, int prev_cpu)
 
for (; pd; pd = pd->next) {
unsigned long util_min = p_util_min, util_max = p_util_max;
-   unsigned long cpu_cap, cpu_thermal_cap, util;
+   unsigned long cpu_cap, cpu_actual_cap, util;
long prev_spare_cap = -1, max_spare_cap = -1;
unsigned long rq_util_min, rq_util_max;
unsigned long cur_delta, base_energy;
@@ -7962,18 +7969,17 @@ static int find_energy_efficient_cpu(struct task_struct 
*p, int prev_cpu)
if (cpumask_empty(cpus))
continue;
 
-   /* Account thermal pressure for the energy estimation */
+   /* Account external pressure for the energy estimation */
cpu = cpumask_first(cpus);
-   cpu_thermal_cap = arch_scale_cpu_capacity(cpu);
-   cpu_thermal_cap -= arch_scale_thermal_pressure(cpu);
+   cpu_actual_cap = get_actual_cpu_capacity(cpu);
 
-   eenv.cpu_cap = cpu_thermal_cap;
+   eenv.cpu_cap = cpu_actual_cap;
eenv.pd_cap = 0;
 
for_each_cpu(cpu, cpus) {
struct rq *rq = cpu_rq(cpu);
 
-   eenv.pd_cap += cpu_thermal_cap;
+   eenv.pd_cap += cpu_actual_cap;
 
if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
continue;
@@ -8044,7 +8050,7 @@ static int find_energy_efficient_cpu(struct task_struct 
*p, int prev_cpu)
if (prev_delta < base_energy)
goto unlock;
prev_delta -= base_energy;
-   prev_thermal_cap = cpu_thermal_cap;
+   prev_actual_cap = cpu_actual_cap;
best_delta = min(best_delta, prev_delta);
}
 
@@

[PATCH 0/5] Rework system pressure interface to the scheduler

2023-12-12 Thread Vincent Guittot

Following the consolidation and cleanup of CPU capacity in [1], this serie
reworks how the scheduler gets the pressures on CPUs. We need to take into
account all pressures applied by cpufreq on the compute capacity of a CPU
for dozens of ms or more and not only cpufreq cooling device or HW
mitigiations. we split the pressure applied on CPU's capacity in 2 parts:
- one from cpufreq and freq_qos
- one from HW high freq mitigiation.

The next step will be to add a dedicated interface for long standing
capping of the CPU capacity (i.e. for seconds or more) like the
scaling_max_freq of cpufreq sysfs. The latter is already taken into
account by this serie but as a temporary pressure which is not always the
best choice when we know that it will happen for seconds or more.

[1] 
https://lore.kernel.org/lkml/20231211104855.558096-1-vincent.guit...@linaro.org/

Vincent Guittot (4):
  cpufreq: Add a cpufreq pressure feedback for the scheduler
  sched: Take cpufreq feedback into account
  thermal/cpufreq: Remove arch_update_thermal_pressure()
  sched: Rename arch_update_thermal_pressure into
arch_update_hw_pressure

 arch/arm/include/asm/topology.h   |  6 +--
 arch/arm64/include/asm/topology.h |  6 +--
 drivers/base/arch_topology.c  | 26 -
 drivers/cpufreq/cpufreq.c | 48 +
 drivers/cpufreq/qcom-cpufreq-hw.c |  4 +-
 drivers/thermal/cpufreq_cooling.c |  3 --
 include/linux/arch_topology.h |  8 +--
 include/linux/cpufreq.h   | 10 
 include/linux/sched/topology.h|  8 +--
 .../{thermal_pressure.h => hw_pressure.h} | 14 ++---
 include/trace/events/sched.h  |  2 +-
 init/Kconfig  | 12 ++---
 kernel/sched/core.c   |  8 +--
 kernel/sched/fair.c   | 53 ++-
 kernel/sched/pelt.c   | 18 +++
 kernel/sched/pelt.h   | 16 +++---
 kernel/sched/sched.h  |  4 +-
 17 files changed, 152 insertions(+), 94 deletions(-)
 rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%)

-- 
2.34.1

Re: [PATCH] tracing: Add size check when printing trace_marker output

2023-12-12 Thread Mathieu Desnoyers


On 2023-12-12 08:44, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

If for some reason the trace_marker write does not have a nul byte for the
string, it will overflow the print:


Does this result in leaking kernel memory to userspace ? If so, it
should state "Fixes..." and CC stable.

Thanks,

Mathieu



   trace_seq_printf(s, ": %s", field->buf);

The field->buf could be missing the nul byte. To prevent overflow, add the
max size that the buf can be by using the event size and the field
location.

   int max = iter->ent_size - offsetof(struct print_entry, buf);

   trace_seq_printf(s, ": %*s", max, field->buf);

Signed-off-by: Steven Rostedt (Google) 
---
  kernel/trace/trace_output.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d8b302d01083..e11fb8996286 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1587,11 +1587,12 @@ static enum print_line_t trace_print_print(struct 
trace_iterator *iter,
  {
struct print_entry *field;
struct trace_seq *s = >seq;
+   int max = iter->ent_size - offsetof(struct print_entry, buf);
  
  	trace_assign_type(field, iter->ent);
  
  	seq_print_ip_sym(s, field->ip, flags);

-   trace_seq_printf(s, ": %s", field->buf);
+   trace_seq_printf(s, ": %*s", max, field->buf);
  
  	return trace_handle_return(s);

  }
@@ -1600,10 +1601,11 @@ static enum print_line_t trace_print_raw(struct 
trace_iterator *iter, int flags,
 struct trace_event *event)
  {
struct print_entry *field;
+   int max = iter->ent_size - offsetof(struct print_entry, buf);
  
  	trace_assign_type(field, iter->ent);
  
-	trace_seq_printf(>seq, "# %lx %s", field->ip, field->buf);

+   trace_seq_printf(>seq, "# %lx %*s", field->ip, max, field->buf);
  
  	return trace_handle_return(>seq);

  }


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v2] ring-buffer: Never use absolute timestamp for first event

2023-12-12 Thread Google

On Tue, 12 Dec 2023 07:18:37 -0500
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> On 32bit machines, the 64 bit timestamps are broken up into 32 bit words
> to keep from using local64_cmpxchg(), as that is very expensive on 32 bit
> architectures.
> 
> On 32 bit architectures, reading these timestamps can happen in a middle
> of an update. In this case, the read returns "false", telling the caller
> that the timestamp is in the middle of an update, and it needs to assume
> it is corrupted. The code then accommodates this.
> 
> When first reserving space on the ring buffer, a "before_stamp" and
> "write_stamp" are read. If they do not match, or if either is in the
> process of being updated (false was returned from the read), an absolute
> timestamp is added and the delta is not used, as that requires reading
> theses timestamps without being corrupted.
> 
> The one case that this does not matter is if the event is the first event
> on the sub-buffer, in which case, the event uses the sub-buffer's
> timestamp and doesn't need the other stamps for calculating them.
> 
> After some work to consolidate the code, if the before or write stamps are
> in the process of updating, an absolute timestamp will be added regardless
> if the event is the first event on the sub-buffer. This is wrong as it
> should not care about the success of these reads if it is the first event
> on the sub-buffer.
> 
> Fix up the parenthesis so that even if the timestamps are corrupted, if
> the event is the first event on the sub-buffer (w == 0) it still does not
> force an absolute timestamp.
> 
> It's actually likely that w is not zero, but move it out of the unlikeyl()
> and test it first. It should be in hot cache anyway, and there's no reason
> to do the rest of the test for the first event on the sub-buffer. And this
> prevents having to test all the 'or' statements in that case.
> 
> Cc: sta...@vger.kernel.org
> Fixes: 58fbc3c63275c ("ring-buffer: Consolidate add_timestamp to remove some 
> branches")
> Signed-off-by: Steven Rostedt (Google) 
> ---
> Changes since v2: 
> https://lore.kernel.org/linux-trace-kernel/2023125949.4692e...@gandalf.local.home
> 
> - Move the test to 'w' out of the unlikely and do it first.
>   It's already in hot cache, and the rest of test shouldn't be done
>   if 'w' is zero.
> 
>  kernel/trace/ring_buffer.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index b416bdf6c44a..095b86081ea8 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -3581,7 +3581,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu 
> *cpu_buffer,
>* absolute timestamp.
>* Don't bother if this is the start of a new page (w == 0).
>*/
> - if (unlikely(!a_ok || !b_ok || (info->before != info->after && 
> w))) {
> + if (w && unlikely(!a_ok || !b_ok || info->before != 
> info->after)) {
>   info->add_timestamp |= RB_ADD_STAMP_FORCE | 
> RB_ADD_STAMP_EXTEND;
>   info->length += RB_LEN_TIME_EXTEND;
>   } else {

After this else,

} else {
info->delta = info->ts - info->after;

The code is using info_after, but it is not sure 'a_ok'. Does this mean if
'w == 0 && !a_ok' this doesn't work correctly?
What will be the expected behavior when w == 0 here?

Thank you,


> -- 
> 2.42.0
> 


-- 
Masami Hiramatsu (Google)

回复: [PATCH V1] remoteproc: virtio: Fix wdg cannot recovery remote processor

2023-12-12 Thread Joakim Zhang


Hello maintainers,

This patch may not fix it in a correct way, after applying this patch, in 
rproc_add_virtio_dev():

1) If the allocate path is dma_declare_coherent_memory(), it will be freed from 
dma_release_coherent_memory(), which is expected

2) If the allocate path is of_reserved_mem_device_init_by_idx(), it will still 
be freed from dma_release_coherent_memory(), which is not expected

Try to fix this issue, I also introduce another patch: 
https://lore.kernel.org/lkml/20231212052130.2051333-1-joakim.zh...@cixtech.com/T/

Are there any suggestions? Thanks.

Joakim

> -邮件原件-
> 发件人: Joakim Zhang 
> 发送时间: 2023年12月12日 13:24
> 收件人: anders...@kernel.org; mathieu.poir...@linaro.org;
> arnaud.pouliq...@foss.st.com
> 抄送: linux-remotep...@vger.kernel.org; linux-kernel@vger.kernel.org;
> cix-kernel-upstream ; Joakim Zhang
> 
> 主题: [PATCH V1] remoteproc: virtio: Fix wdg cannot recovery remote
> processor
> 
> From: Joakim Zhang 
> 
> Recovery remote processor failed when wdg irq received:
> [0.842574] remoteproc remoteproc0: crash detected in cix-dsp-rproc:
> type watchdog
> [0.842750] remoteproc remoteproc0: handling crash #1 in cix-dsp-rproc
> [0.842824] remoteproc remoteproc0: recovering cix-dsp-rproc
> [0.843342] remoteproc remoteproc0: stopped remote processor
> cix-dsp-rproc
> [0.847901] rproc-virtio rproc-virtio.0.auto: Failed to associate buffer
> [0.847979] remoteproc remoteproc0: failed to probe subdevices for
> cix-dsp-rproc: -16
> 
> The reason is that dma coherent mem would not be released when recovering
> the remote processor, due to rproc_virtio_remove() would not be called, where
> the mem released. It will fail when it try to allocate and associate buffer 
> again.
> 
> We can see that dma coherent mem allocated from rproc_add_virtio_dev(), so
> should release it from rproc_remove_virtio_dev(). These functions should
> appear symmetrically:
> -rproc_vdev_do_start()->rproc_add_virtio_dev()->dma_declare_coherent_mem
> ory()
> -rproc_vdev_do_stop()->rproc_remove_virtio_dev()->dma_release_coherent_m
> emory()
> 
> Fixes: 1d7b61c06dc3 ("remoteproc: virtio: Create platform device for the
> remoteproc_virtio")
> Signed-off-by: Joakim Zhang 
> ---
>  drivers/remoteproc/remoteproc_virtio.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/remoteproc/remoteproc_virtio.c
> b/drivers/remoteproc/remoteproc_virtio.c
> index 83d76915a6ad..725b957ee226 100644
> --- a/drivers/remoteproc/remoteproc_virtio.c
> +++ b/drivers/remoteproc/remoteproc_virtio.c
> @@ -465,8 +465,12 @@ static int rproc_add_virtio_dev(struct rproc_vdev
> *rvdev, int id)  static int rproc_remove_virtio_dev(struct device *dev, void
> *data)  {
>   struct virtio_device *vdev = dev_to_virtio(dev);
> + struct rproc_vdev *rvdev = vdev_to_rvdev(vdev);
> 
>   unregister_virtio_device(vdev);
> +
> + dma_release_coherent_memory(>pdev->dev);
> +
>   return 0;
>  }
> 
> @@ -585,7 +589,6 @@ static void rproc_virtio_remove(struct platform_device
> *pdev)
>   rproc_remove_rvdev(rvdev);
> 
>   of_reserved_mem_device_release(>dev);
> - dma_release_coherent_memory(>dev);
> 
>   put_device(>dev);
>  }
> --
> 2.25.1

[PATCH v2] tracing: Allow for max buffer data size trace_marker writes

2023-12-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Allow a trace write to be as big as the ring buffer tracing data will
allow. Currently, it only allows writes of 1KB in size, but there's no
reason that it cannot allow what the ring buffer can hold.

Cc: Masami Hiramatsu 
Cc: Mark Rutland 
Cc: Mathieu Desnoyers 
Signed-off-by: Steven Rostedt (Google) 
---
Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20231209175003.63db4...@gandalf.local.home

- Now that there's a new fix for the max event size, there's no more
  BUF_MAX_EVENT_SIZE macro. Now the BUF_MAX_DATA_SIZE can be used again.

- Check if the buffer itself is requesting forced timestamps, and if so,
  decrement from the max size, the timestamp size.

- This no longer depends on the previous fix change, as it's now using
  existing macros.

 include/linux/ring_buffer.h |  1 +
 kernel/trace/ring_buffer.c  | 15 +++
 kernel/trace/trace.c| 28 +---
 3 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 782e14f62201..b1b03b2c0f08 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -141,6 +141,7 @@ int ring_buffer_iter_empty(struct ring_buffer_iter *iter);
 bool ring_buffer_iter_dropped(struct ring_buffer_iter *iter);
 
 unsigned long ring_buffer_size(struct trace_buffer *buffer, int cpu);
+unsigned long ring_buffer_max_event_size(struct trace_buffer *buffer);
 
 void ring_buffer_reset_cpu(struct trace_buffer *buffer, int cpu);
 void ring_buffer_reset_online_cpus(struct trace_buffer *buffer);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a3eaa052f4de..882aab2bede3 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5250,6 +5250,21 @@ unsigned long ring_buffer_size(struct trace_buffer 
*buffer, int cpu)
 }
 EXPORT_SYMBOL_GPL(ring_buffer_size);
 
+/**
+ * ring_buffer_max_event_size - return the max data size of an event
+ * @buffer: The ring buffer.
+ *
+ * Returns the maximum size an event can be.
+ */
+unsigned long ring_buffer_max_event_size(struct trace_buffer *buffer)
+{
+   /* If abs timestamp is requested, events have a timestamp too */
+   if (ring_buffer_time_stamp_abs(buffer))
+   return BUF_MAX_DATA_SIZE - RB_LEN_TIME_EXTEND;
+   return BUF_MAX_DATA_SIZE;
+}
+EXPORT_SYMBOL_GPL(ring_buffer_max_event_size);
+
 static void rb_clear_buffer_page(struct buffer_page *page)
 {
local_set(>write, 0);
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ef86379555e4..bd6d28dad05d 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7272,6 +7272,7 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
enum event_trigger_type tt = ETT_NONE;
struct trace_buffer *buffer;
struct print_entry *entry;
+   int meta_size;
ssize_t written;
int size;
int len;
@@ -7286,12 +7287,9 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
if (!(tr->trace_flags & TRACE_ITER_MARKERS))
return -EINVAL;
 
-   if (cnt > TRACE_BUF_SIZE)
-   cnt = TRACE_BUF_SIZE;
-
-   BUILD_BUG_ON(TRACE_BUF_SIZE >= PAGE_SIZE);
-
-   size = sizeof(*entry) + cnt + 2; /* add '\0' and possible '\n' */
+   meta_size = sizeof(*entry) + 2;  /* add '\0' and possible '\n' */
+ again:
+   size = cnt + meta_size;
 
/* If less than "", then make sure we can still add that */
if (cnt < FAULTED_SIZE)
@@ -7300,9 +7298,25 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
buffer = tr->array_buffer.buffer;
event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
tracing_gen_ctx());
-   if (unlikely(!event))
+   if (unlikely(!event)) {
+   /*
+* If the size was greated than what was allowed, then
+* make it smaller and try again.
+*/
+   if (size > ring_buffer_max_event_size(buffer)) {
+   /* cnt < FAULTED size should never be bigger than max */
+   if (WARN_ON_ONCE(cnt < FAULTED_SIZE))
+   return -EBADF;
+   cnt = ring_buffer_max_event_size(buffer) - meta_size;
+   /* The above should only happen once */
+   if (WARN_ON_ONCE(cnt + meta_size == size))
+   return -EBADF;
+   goto again;
+   }
+
/* Ring buffer disabled, return as if not open for write */
return -EBADF;
+   }
 
entry = ring_buffer_event_data(event);
entry->ip = _THIS_IP_;
-- 
2.42.0

[linus:master] [ring] f458a14534: WARNING:at_kernel/trace/ring_buffer_benchmark.c:#ring_buffer_consumer

2023-12-12 Thread kernel test robot

benchmark.c:147) 
[ 38.107426][ T45] ? exc_overflow (arch/x86/kernel/traps.c:250) 
[ 38.108429][ T45] ? ring_buffer_consumer 
(kernel/trace/ring_buffer_benchmark.c:147) 
[ 38.109508][ T45] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:63) 
[ 38.110505][ T45] ? complete (kernel/sched/completion.c:48) 
[ 38.111212][ T45] ring_buffer_consumer_thread 
(arch/x86/include/asm/current.h:41 kernel/trace/ring_buffer_benchmark.c:388) 
[ 38.112335][ T45] kthread (kernel/kthread.c:390) 
[ 38.113096][ T45] ? rb_check_timestamp 
(kernel/trace/ring_buffer_benchmark.c:382) 
[ 38.114061][ T45] ? kthread_unuse_mm (kernel/kthread.c:341) 
[ 38.114962][ T45] ? kthread_unuse_mm (kernel/kthread.c:341) 
[ 38.115897][ T45] ret_from_fork (arch/x86/kernel/process.c:153) 
[ 38.116892][ T45] ret_from_fork_asm (arch/x86/entry/entry_32.S:741) 
[ 38.117827][ T45] entry_INT80_32 (arch/x86/entry/entry_32.S:947) 
[   38.118756][   T45] irq event stamp: 45237315
[ 38.119660][ T45] hardirqs last enabled at (45237323): console_unlock 
(arch/x86/include/asm/irqflags.h:26 arch/x86/include/asm/irqflags.h:67 
arch/x86/include/asm/irqflags.h:127 kernel/printk/printk.c:341 
kernel/printk/printk.c:2706 kernel/printk/printk.c:3038) 
[ 38.121456][ T45] hardirqs last disabled at (45237572): console_unlock 
(kernel/printk/printk.c:339) 
[ 38.123207][ T45] softirqs last enabled at (45237570): do_softirq_own_stack 
(arch/x86/kernel/irq_32.c:57 arch/x86/kernel/irq_32.c:147) 
[ 38.125116][ T45] softirqs last disabled at (45237589): do_softirq_own_stack 
(arch/x86/kernel/irq_32.c:57 arch/x86/kernel/irq_32.c:147) 
[   38.126991][   T45] ---[ end trace  ]---



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231212/202312121655.f8f36552-oliver.s...@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

[PATCH] tracing: Add size check when printing trace_marker output

2023-12-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

If for some reason the trace_marker write does not have a nul byte for the
string, it will overflow the print:

  trace_seq_printf(s, ": %s", field->buf);

The field->buf could be missing the nul byte. To prevent overflow, add the
max size that the buf can be by using the event size and the field
location.

  int max = iter->ent_size - offsetof(struct print_entry, buf);

  trace_seq_printf(s, ": %*s", max, field->buf);

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace_output.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d8b302d01083..e11fb8996286 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1587,11 +1587,12 @@ static enum print_line_t trace_print_print(struct 
trace_iterator *iter,
 {
struct print_entry *field;
struct trace_seq *s = >seq;
+   int max = iter->ent_size - offsetof(struct print_entry, buf);
 
trace_assign_type(field, iter->ent);
 
seq_print_ip_sym(s, field->ip, flags);
-   trace_seq_printf(s, ": %s", field->buf);
+   trace_seq_printf(s, ": %*s", max, field->buf);
 
return trace_handle_return(s);
 }
@@ -1600,10 +1601,11 @@ static enum print_line_t trace_print_raw(struct 
trace_iterator *iter, int flags,
 struct trace_event *event)
 {
struct print_entry *field;
+   int max = iter->ent_size - offsetof(struct print_entry, buf);
 
trace_assign_type(field, iter->ent);
 
-   trace_seq_printf(>seq, "# %lx %s", field->ip, field->buf);
+   trace_seq_printf(>seq, "# %lx %*s", field->ip, max, field->buf);
 
return trace_handle_return(>seq);
 }
-- 
2.42.0

[PATCH v5 4/4] vduse: Add LSM hook to check Virtio device type

2023-12-12 Thread Maxime Coquelin

This patch introduces a LSM hook for devices creation,
destruction (ioctl()) and opening (open()) operations,
checking the application is allowed to perform these
operations for the Virtio device type.

Signed-off-by: Maxime Coquelin 
---
 MAINTAINERS |  1 +
 drivers/vdpa/vdpa_user/vduse_dev.c  | 13 
 include/linux/lsm_hook_defs.h   |  2 ++
 include/linux/security.h|  6 ++
 include/linux/vduse.h   | 14 +
 security/security.c | 15 ++
 security/selinux/hooks.c| 32 +
 security/selinux/include/classmap.h |  2 ++
 8 files changed, 85 insertions(+)
 create mode 100644 include/linux/vduse.h

diff --git a/MAINTAINERS b/MAINTAINERS
index a0fb0df07b43..4e83b14358d2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -23040,6 +23040,7 @@ F:  drivers/net/virtio_net.c
 F: drivers/vdpa/
 F: drivers/virtio/
 F: include/linux/vdpa.h
+F: include/linux/vduse.h
 F: include/linux/virtio*.h
 F: include/linux/vringh.h
 F: include/uapi/linux/virtio_*.h
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index fa62825be378..59ab7eb62e20 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -8,6 +8,7 @@
  *
  */
 
+#include "linux/security.h"
 #include 
 #include 
 #include 
@@ -30,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "iova_domain.h"
 
@@ -1442,6 +1444,10 @@ static int vduse_dev_open(struct inode *inode, struct 
file *file)
if (dev->connected)
goto unlock;
 
+   ret = -EPERM;
+   if (security_vduse_perm_check(VDUSE_PERM_OPEN, dev->device_id))
+   goto unlock;
+
ret = 0;
dev->connected = true;
file->private_data = dev;
@@ -1664,6 +1670,9 @@ static int vduse_destroy_dev(char *name)
if (!dev)
return -EINVAL;
 
+   if (security_vduse_perm_check(VDUSE_PERM_DESTROY, dev->device_id))
+   return -EPERM;
+
mutex_lock(>lock);
if (dev->vdev || dev->connected) {
mutex_unlock(>lock);
@@ -1828,6 +1837,10 @@ static int vduse_create_dev(struct vduse_dev_config 
*config,
int ret;
struct vduse_dev *dev;
 
+   ret = -EPERM;
+   if (security_vduse_perm_check(VDUSE_PERM_CREATE, config->device_id))
+   goto err;
+
ret = -EEXIST;
if (vduse_find_dev(config->name))
goto err;
diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index ff217a5ce552..3930ab2ae974 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -419,3 +419,5 @@ LSM_HOOK(int, 0, uring_override_creds, const struct cred 
*new)
 LSM_HOOK(int, 0, uring_sqpoll, void)
 LSM_HOOK(int, 0, uring_cmd, struct io_uring_cmd *ioucmd)
 #endif /* CONFIG_IO_URING */
+
+LSM_HOOK(int, 0, vduse_perm_check, enum vduse_op_perm op_perm, u32 device_id)
diff --git a/include/linux/security.h b/include/linux/security.h
index 1d1df326c881..2a2054172394 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct linux_binprm;
 struct cred;
@@ -484,6 +485,7 @@ int security_inode_notifysecctx(struct inode *inode, void 
*ctx, u32 ctxlen);
 int security_inode_setsecctx(struct dentry *dentry, void *ctx, u32 ctxlen);
 int security_inode_getsecctx(struct inode *inode, void **ctx, u32 *ctxlen);
 int security_locked_down(enum lockdown_reason what);
+int security_vduse_perm_check(enum vduse_op_perm op_perm, u32 device_id);
 #else /* CONFIG_SECURITY */
 
 static inline int call_blocking_lsm_notifier(enum lsm_event event, void *data)
@@ -1395,6 +1397,10 @@ static inline int security_locked_down(enum 
lockdown_reason what)
 {
return 0;
 }
+static inline int security_vduse_perm_check(enum vduse_op_perm op_perm, u32 
device_id)
+{
+   return 0;
+}
 #endif /* CONFIG_SECURITY */
 
 #if defined(CONFIG_SECURITY) && defined(CONFIG_WATCH_QUEUE)
diff --git a/include/linux/vduse.h b/include/linux/vduse.h
new file mode 100644
index ..7a20dcc43997
--- /dev/null
+++ b/include/linux/vduse.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_VDUSE_H
+#define _LINUX_VDUSE_H
+
+/*
+ * The permission required for a VDUSE device operation.
+ */
+enum vduse_op_perm {
+   VDUSE_PERM_CREATE,
+   VDUSE_PERM_DESTROY,
+   VDUSE_PERM_OPEN,
+};
+
+#endif /* _LINUX_VDUSE_H */
diff --git a/security/security.c b/security/security.c
index dcb3e7014f9b..150abf85f97d 100644
--- a/security/security.c
+++ b/security/security.c
@@ -5337,3 +5337,18 @@ int security_uring_cmd(struct io_uring_cmd *ioucmd)
return call_int_hook(uring_cmd, 0, ioucmd);
 }
 #endif /* CONFIG_IO_URING */
+
+/**
+ * security_vduse_perm_check() - Check if a VDUSE device type operation is 
allowed
+ * @op_perm: the operation

[PATCH v5 2/4] vduse: Temporarily disable control queue features

2023-12-12 Thread Maxime Coquelin

Virtio-net driver control queue implementation is not safe
when used with VDUSE. If the VDUSE application does not
reply to control queue messages, it currently ends up
hanging the kernel thread sending this command.

Some work is on-going to make the control queue
implementation robust with VDUSE. Until it is completed,
let's disable control virtqueue and features that depend on
it.

Signed-off-by: Maxime Coquelin 
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 37 ++
 1 file changed, 37 insertions(+)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index 0486ff672408..fe4b5c8203fd 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "iova_domain.h"
@@ -46,6 +47,30 @@
 
 #define IRQ_UNBOUND -1
 
+#define VDUSE_NET_VALID_FEATURES_MASK   \
+   (BIT_ULL(VIRTIO_NET_F_CSUM) |   \
+BIT_ULL(VIRTIO_NET_F_GUEST_CSUM) | \
+BIT_ULL(VIRTIO_NET_F_MTU) |\
+BIT_ULL(VIRTIO_NET_F_MAC) |\
+BIT_ULL(VIRTIO_NET_F_GUEST_TSO4) | \
+BIT_ULL(VIRTIO_NET_F_GUEST_TSO6) | \
+BIT_ULL(VIRTIO_NET_F_GUEST_ECN) |  \
+BIT_ULL(VIRTIO_NET_F_GUEST_UFO) |  \
+BIT_ULL(VIRTIO_NET_F_HOST_TSO4) |  \
+BIT_ULL(VIRTIO_NET_F_HOST_TSO6) |  \
+BIT_ULL(VIRTIO_NET_F_HOST_ECN) |   \
+BIT_ULL(VIRTIO_NET_F_HOST_UFO) |   \
+BIT_ULL(VIRTIO_NET_F_MRG_RXBUF) |  \
+BIT_ULL(VIRTIO_NET_F_STATUS) | \
+BIT_ULL(VIRTIO_NET_F_HOST_USO) |   \
+BIT_ULL(VIRTIO_F_ANY_LAYOUT) | \
+BIT_ULL(VIRTIO_RING_F_INDIRECT_DESC) | \
+BIT_ULL(VIRTIO_RING_F_EVENT_IDX) |  \
+BIT_ULL(VIRTIO_F_VERSION_1) |  \
+BIT_ULL(VIRTIO_F_ACCESS_PLATFORM) | \
+BIT_ULL(VIRTIO_F_RING_PACKED) |\
+BIT_ULL(VIRTIO_F_IN_ORDER))
+
 struct vduse_virtqueue {
u16 index;
u16 num_max;
@@ -1782,6 +1807,16 @@ static struct attribute *vduse_dev_attrs[] = {
 
 ATTRIBUTE_GROUPS(vduse_dev);
 
+static void vduse_dev_features_filter(struct vduse_dev_config *config)
+{
+   /*
+* Temporarily filter out virtio-net's control virtqueue and features
+* that depend on it while CVQ is being made more robust for VDUSE.
+*/
+   if (config->device_id == VIRTIO_ID_NET)
+   config->features &= VDUSE_NET_VALID_FEATURES_MASK;
+}
+
 static int vduse_create_dev(struct vduse_dev_config *config,
void *config_buf, u64 api_version)
 {
@@ -1797,6 +1832,8 @@ static int vduse_create_dev(struct vduse_dev_config 
*config,
if (!dev)
goto err;
 
+   vduse_dev_features_filter(config);
+
dev->api_version = api_version;
dev->device_features = config->features;
dev->device_id = config->device_id;
-- 
2.43.0

[PATCH v5 3/4] vduse: enable Virtio-net device type

2023-12-12 Thread Maxime Coquelin

This patch adds Virtio-net device type to the supported
devices types. Initialization fails if the device does
not support VIRTIO_F_VERSION_1 feature, in order to
guarantee the configuration space is read-only.

Acked-by: Jason Wang 
Reviewed-by: Xie Yongji 
Signed-off-by: Maxime Coquelin 
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index fe4b5c8203fd..fa62825be378 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -166,6 +166,7 @@ static struct workqueue_struct *vduse_irq_bound_wq;
 
 static u32 allowed_device_id[] = {
VIRTIO_ID_BLOCK,
+   VIRTIO_ID_NET,
 };
 
 static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
@@ -1706,6 +1707,10 @@ static bool features_is_valid(struct vduse_dev_config 
*config)
(config->features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE)))
return false;
 
+   if ((config->device_id == VIRTIO_ID_NET) &&
+   !(config->features & (1ULL << VIRTIO_F_VERSION_1)))
+   return false;
+
return true;
 }
 
@@ -2068,6 +2073,7 @@ static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops 
= {
 
 static struct virtio_device_id id_table[] = {
{ VIRTIO_ID_BLOCK, VIRTIO_DEV_ANY_ID },
+   { VIRTIO_ID_NET, VIRTIO_DEV_ANY_ID },
{ 0 },
 };
 
-- 
2.43.0

[PATCH v5 1/4] vduse: validate block features only with block devices

2023-12-12 Thread Maxime Coquelin

This patch is preliminary work to enable network device
type support to VDUSE.

As VIRTIO_BLK_F_CONFIG_WCE shares the same value as
VIRTIO_NET_F_HOST_TSO4, we need to restrict its check
to Virtio-blk device type.

Acked-by: Jason Wang 
Reviewed-by: Xie Yongji 
Signed-off-by: Maxime Coquelin 
---
 drivers/vdpa/vdpa_user/vduse_dev.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index 0ddd4b8abecb..0486ff672408 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -1671,13 +1671,14 @@ static bool device_is_allowed(u32 device_id)
return false;
 }
 
-static bool features_is_valid(u64 features)
+static bool features_is_valid(struct vduse_dev_config *config)
 {
-   if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
+   if (!(config->features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
return false;
 
/* Now we only support read-only configuration space */
-   if (features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))
+   if ((config->device_id == VIRTIO_ID_BLOCK) &&
+   (config->features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE)))
return false;
 
return true;
@@ -1704,7 +1705,7 @@ static bool vduse_validate_config(struct vduse_dev_config 
*config)
if (!device_is_allowed(config->device_id))
return false;
 
-   if (!features_is_valid(config->features))
+   if (!features_is_valid(config))
return false;
 
return true;
-- 
2.43.0

[PATCH v5 0/4] vduse: add support for networking devices

2023-12-12 Thread Maxime Coquelin

This small series enables virtio-net device type in VDUSE.
With it, basic operation have been tested, both with
virtio-vdpa and vhost-vdpa using DPDK Vhost library series
adding VDUSE support using split rings layout (merged in
DPDK v23.07-rc1).

Control queue support (and so multiqueue) has also been
tested, but requires a Kernel series from Jason Wang
relaxing control queue polling [1] to function reliably,
so while Jason rework is done, a patch is added to disable
CVQ and features that depend on it (tested also with DPDK
v23.07-rc1).

In this v5, LSM hooks introduced in previous revision are
unified into a single hook that covers below operations:
- VDUSE_CREATE_DEV ioctl on VDUSE control file,
- VDUSE_DESTROY_DEV ioctl on VDUSE control file,
- open() on VDUSE device file.

In combination with the operations permission, a device type
permission has to be associated:
- block: Virtio block device type,
- net: Virtio networking device type.

Changes in v5:
==
- Move control queue disablement patch before Net
  devices enablement (Jason).
- Unify operations LSM hooks into a single hook.
- Rebase on latest master.

Maxime Coquelin (4):
  vduse: validate block features only with block devices
  vduse: Temporarily disable control queue features
  vduse: enable Virtio-net device type
  vduse: Add LSM hook to check Virtio device type

 MAINTAINERS |  1 +
 drivers/vdpa/vdpa_user/vduse_dev.c  | 65 +++--
 include/linux/lsm_hook_defs.h   |  2 +
 include/linux/security.h|  6 +++
 include/linux/vduse.h   | 14 +++
 security/security.c | 15 +++
 security/selinux/hooks.c| 32 ++
 security/selinux/include/classmap.h |  2 +
 8 files changed, 133 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/vduse.h

-- 
2.43.0

Re: [PATCH v6 2/2] kbuild: rpm-pkg: Fix build with non-default MODLIB

2023-12-12 Thread Michal Suchánek

On Mon, Dec 11, 2023 at 01:33:23PM +0900, Masahiro Yamada wrote:
> On Mon, Dec 11, 2023 at 6:09 AM Michal Suchánek  wrote:
> >
> > On Mon, Dec 11, 2023 at 03:44:35AM +0900, Masahiro Yamada wrote:
> > > On Thu, Dec 7, 2023 at 4:48 AM Michal Suchanek  wrote:
> > > >
> > > > The default MODLIB value is composed of three variables
> > > >
> > > > MODLIB = $(INSTALL_MOD_PATH)$(KERNEL_MODULE_DIRECTORY)/$(KERNELRELEASE)
> > > >
> > > > However, the kernel.spec hadcodes the default value of
> > > > $(KERNEL_MODULE_DIRECTORY), and changed value is not reflected when
> > > > building the package.
> > > >
> > > > Pass KERNEL_MODULE_DIRECTORY to kernel.spec to fix this problem.
> > > >
> > > > Signed-off-by: Michal Suchanek 
> > > > ---
> > > > Build on top of the previous patch adding KERNEL_MODULE_DIRECTORY
> > >
> > >
> > > The SRPM package created by 'make srcrpm-pkg' may not work
> > > if rpmbuild is executed in a different machine.
> >
> > That's why there is an option to override KERNEL_MODULE_DIRECTORY?
> 
> 
> Yes.
> But, as I pointed out in 1/2, depmod must follow the packager's decision.
> 
> 'make srcrpm-pkg' creates a SRPM on machine A.
> 'rpmbuild' builds it into binary RPMs on machine B.
> 
> If A and B disagree about kmod.pc, depmod will fail
> because there is no code to force the decision made
> on machine A.

There is. It's the ?= in the top Makefile.

Currently the test that determines the module directory uses make logic
so it's not possible to pass on the shell magic before executing it so
it could be executed inside the rpm spec file as well.

OUtsourcing it into an external script would mean that the sources need
to be unpacked before the script can be executed. That would require
using dynamically generated file list in the spec file because the
module location would not be known at spec parse time. Possible but
convoluted.

In the end I do not think this is a problem that needs solving. Most
distributions that build kernel packages would use their own packaging
files, not rpm-pkg. That limits rpm-pkg to ad-hoc use when people want
to build one-off test kernel. It's reasonable to do on the same
distribution as the target system. The option to do so on a distribution
with different module directory is available if somebody really needs
that.

Thanks

Michal

Re: [PATCH v6 1/2] depmod: Handle installing modules under a different directory

2023-12-12 Thread Michal Suchánek

On Mon, Dec 11, 2023 at 01:29:15PM +0900, Masahiro Yamada wrote:
> On Mon, Dec 11, 2023 at 6:07 AM Michal Suchánek  wrote:
> >
> > Hello!
> >
> > On Mon, Dec 11, 2023 at 03:43:44AM +0900, Masahiro Yamada wrote:
> > > On Thu, Dec 7, 2023 at 4:48 AM Michal Suchanek  wrote:
> > > >
> > > > Some distributions aim at shipping all files in /usr.
> > > >
> > > > The path under which kernel modules are installed is hardcoded to /lib
> > > > which conflicts with this goal.
> > > >
> > > > When kmod provides kmod.pc, use it to determine the correct module
> > > > installation path.
> > > >
> > > > With kmod that does not provide the config /lib/modules is used as
> > > > before.
> > > >
> > > > While pkg-config does not return an error when a variable does not exist
> > > > the kmod configure script puts some effort into ensuring that
> > > > module_directory is non-empty. With that empty module_directory from
> > > > pkg-config can be used to detect absence of the variable.
> > > >
> > > > Signed-off-by: Michal Suchanek 
> > > > ---
> > > > v6:
> > > >  - use ?= instead of := to make it easier to override the value
> > >
> > >
> > > "KERNEL_MODULE_DIRECTORY=/local/usr/lib/modules make modules_install"
> > > will override the install destination, but
> > > depmod will not be not aware of it.
> >
> > At the same time if you know what you are doing you can build a src rpm
> > for another system that uses a different location.
> >
> > > How to avoid the depmod error?
> >
> > Not override the variable?
> 
> You are not answering my question.
> You intentionally changed := to ?=.
> 
> This implies that KERNEL_MODULE_DIRECTORY is an interface to users,
> and should be documented in Documentation/kbuild/kbuild.rst

That's reasonable

> However, it never works if it is overridden from the env variable
> or make command line because there is no way to let depmod know
> the fact that KERNEL_MODULE_DIRECTORY has been overridden.

And there should not. kmod is not aware, kbuild is. That's the
direction of information flow. kmod defines where it looks for the
modules, and kbuild shoukld install the modules there.

If the user knows better (eg. possibility of building src-rpm for a
different you brought up) they can override the autodetection.

> In my understanding, depmod does not provide an option to
> specify the module directory from a command line option, does it?

No it does not.

> If not, is it reasonable to add a new option to depmod?

I don't think so. The module directory is intentionally in a fixed
location. It can be set at compile time, and that's it.

Then when running depmod on the target distribution either kbuild and
kmod agree on the location or the build fails. That's the intended
outcome.

kmod recently grew the ability to use modules outside of module
directory. For that to work internally the path to these out-of-kernel
modules is stored as absolute path, and the path of modules that are in
the module directory is stored relative to the module directory.

Setting the module directory location dynamically still should not break
this but I am not sure it's a great idea. In the end modprobe needs to
find those modules, and if depmod puts the modules.dep in arbitrary
location it will not.

> depmod provides the "-b basedir" option, but it only allows
> adding a prefix to the default "/lib/modules/".

Yes, that's for installation into a staging directory, and there again
the modules that are inside the module directory are considedred
'in-kernel'. Not sure how well this even works with 'out-of-kernel'
modules.

> (My original idea to provide the prefix_part, it would have worked
> like  -b "${INSTALL_MOD_PATH}${MOD_PREFIX}", which you refused)

It's not clear that adding a prefix covers all use cases. It is an
arbitrary limitation that the module path must end with '/lib/modules'.

It may allow taking some shortcuts in some places but is unnecessarily
limiting.

Thanks

Michal

[PATCH] ring-buffer: Have saved event hold the entire event

2023-12-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

For the ring buffer iterator (non-consuming read), the event needs to be
copied into the iterator buffer to make sure that a writer does not
overwrite it while the user is reading it. If a write happens during the
copy, the buffer is simply discarded.

But the temp buffer itself was not big enough. The allocation of the
buffer was only BUF_MAX_DATA_SIZE, which is the maximum data size that can
be passed into the ring buffer and saved. But the temp buffer needs to
hold the meta data as well. That would be BUF_PAGE_SIZE and not
BUF_MAX_DATA_SIZE.

Cc: sta...@vger.kernel.org
Fixes: 785888c544e04 ("ring-buffer: Have rb_iter_head_event() handle concurrent 
writer")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 095b86081ea8..32e41ca41caf 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2409,7 +2409,7 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
 */
barrier();
 
-   if ((iter->head + length) > commit || length > BUF_MAX_DATA_SIZE)
+   if ((iter->head + length) > commit || length > BUF_PAGE_SIZE)
/* Writer corrupted the read? */
goto reset;
 
@@ -5115,7 +5115,8 @@ ring_buffer_read_prepare(struct trace_buffer *buffer, int 
cpu, gfp_t flags)
if (!iter)
return NULL;
 
-   iter->event = kmalloc(BUF_MAX_DATA_SIZE, flags);
+   /* Holds the entire event: data and meta data */
+   iter->event = kmalloc(BUF_PAGE_SIZE, flags);
if (!iter->event) {
kfree(iter);
return NULL;
-- 
2.42.0

[PATCH v2] ring-buffer: Never use absolute timestamp for first event

2023-12-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

On 32bit machines, the 64 bit timestamps are broken up into 32 bit words
to keep from using local64_cmpxchg(), as that is very expensive on 32 bit
architectures.

On 32 bit architectures, reading these timestamps can happen in a middle
of an update. In this case, the read returns "false", telling the caller
that the timestamp is in the middle of an update, and it needs to assume
it is corrupted. The code then accommodates this.

When first reserving space on the ring buffer, a "before_stamp" and
"write_stamp" are read. If they do not match, or if either is in the
process of being updated (false was returned from the read), an absolute
timestamp is added and the delta is not used, as that requires reading
theses timestamps without being corrupted.

The one case that this does not matter is if the event is the first event
on the sub-buffer, in which case, the event uses the sub-buffer's
timestamp and doesn't need the other stamps for calculating them.

After some work to consolidate the code, if the before or write stamps are
in the process of updating, an absolute timestamp will be added regardless
if the event is the first event on the sub-buffer. This is wrong as it
should not care about the success of these reads if it is the first event
on the sub-buffer.

Fix up the parenthesis so that even if the timestamps are corrupted, if
the event is the first event on the sub-buffer (w == 0) it still does not
force an absolute timestamp.

It's actually likely that w is not zero, but move it out of the unlikeyl()
and test it first. It should be in hot cache anyway, and there's no reason
to do the rest of the test for the first event on the sub-buffer. And this
prevents having to test all the 'or' statements in that case.

Cc: sta...@vger.kernel.org
Fixes: 58fbc3c63275c ("ring-buffer: Consolidate add_timestamp to remove some 
branches")
Signed-off-by: Steven Rostedt (Google) 
---
Changes since v2: 
https://lore.kernel.org/linux-trace-kernel/2023125949.4692e...@gandalf.local.home

- Move the test to 'w' out of the unlikely and do it first.
  It's already in hot cache, and the rest of test shouldn't be done
  if 'w' is zero.

 kernel/trace/ring_buffer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index b416bdf6c44a..095b86081ea8 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3581,7 +3581,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 * absolute timestamp.
 * Don't bother if this is the start of a new page (w == 0).
 */
-   if (unlikely(!a_ok || !b_ok || (info->before != info->after && 
w))) {
+   if (w && unlikely(!a_ok || !b_ok || info->before != 
info->after)) {
info->add_timestamp |= RB_ADD_STAMP_FORCE | 
RB_ADD_STAMP_EXTEND;
info->length += RB_LEN_TIME_EXTEND;
} else {
-- 
2.42.0

[PATCH v2] ring-buffer: Fix writing to the buffer with max_data_size

2023-12-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The maximum ring buffer data size is the maximum size of data that can be
recorded on the ring buffer. Events must be smaller than the sub buffer
data size minus any meta data. This size is checked before trying to
allocate from the ring buffer because the allocation assumes that the size
will fit on the sub buffer.

The maximum size was calculated as the size of a sub buffer page (which is
currently PAGE_SIZE minus the sub buffer header) minus the size of the
meta data of an individual event. But it missed the possible adding of a
time stamp for events that are added long enough apart that the event meta
data can't hold the time delta.

When an event is added that is greater than the current BUF_MAX_DATA_SIZE
minus the size of a time stamp, but still less than or equal to
BUF_MAX_DATA_SIZE, the ring buffer would go into an infinite loop, looking
for a page that can hold the event. Luckily, there's a check for this loop
and after 1000 iterations and a warning is emitted and the ring buffer is
disabled. But this should never happen.

This can happen when a large event is added first, or after a long period
where an absolute timestamp is prefixed to the event, increasing its size
by 8 bytes. This passes the check and then goes into the algorithm that
causes the infinite loop.

For events that are the first event on the sub-buffer, it does not need to
add a timestamp, because the sub-buffer itself contains an absolute
timestamp, and adding one is redundant.

The fix is to check if the event is to be the first event on the
sub-buffer, and if it is, then do not add a timestamp.

Also, if the buffer has "time_stamp_abs" set, then also check if the
length plus the timestamp is greater than the BUF_MAX_DATA_SIZE.

Cc: sta...@vger.kernel.org
Fixes: a4543a2fa9ef3 ("ring-buffer: Get timestamp after event is allocated")
Reported-by: Kent Overstreet  # (on IRC)
Signed-off-by: Steven Rostedt (Google) 
---
Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20231209170139.33c1b...@gandalf.local.home

- Instead of subtracting the timestamp size from the max data, check if the
  event is the fist event on the sub-buffer and if it is do not add a timestamp.

- If the ring buffer enabled adding timestamps for every event, then check
  if the added timestamp puts the length over BUF_MAX_DATA_SIZE.


 kernel/trace/ring_buffer.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8d2a4f00eca9..d8ce1dc5110e 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3584,7 +3584,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
info->length += RB_LEN_TIME_EXTEND;
} else {
info->delta = info->ts - info->after;
-   if (unlikely(test_time_stamp(info->delta))) {
+   if (w && unlikely(test_time_stamp(info->delta))) {
info->add_timestamp |= RB_ADD_STAMP_EXTEND;
info->length += RB_LEN_TIME_EXTEND;
}
@@ -3737,6 +3737,8 @@ rb_reserve_next_event(struct trace_buffer *buffer,
if (ring_buffer_time_stamp_abs(cpu_buffer->buffer)) {
add_ts_default = RB_ADD_STAMP_ABSOLUTE;
info.length += RB_LEN_TIME_EXTEND;
+   if (info.length > BUF_MAX_DATA_SIZE)
+   goto out_fail;
} else {
add_ts_default = RB_ADD_STAMP_NONE;
}
-- 
2.42.0

[PATCH] tracing: Fix uaf issue when open the hist or hist_debug file

2023-12-12 Thread Zheng Yejian

KASAN report following issue. The root cause is when opening 'hist'
file of an instance and accessing 'trace_event_file' in hist_show(),
but 'trace_event_file' has been freed due to the instance being removed.
'hist_debug' file has the same problem. To fix it, use
tracing_{open, release}_file_tr() as instead when open these two files.

  BUG: KASAN: slab-use-after-free in hist_show+0x11e0/0x1278
  Read of size 8 at addr 242541e336b8 by task head/190

  CPU: 4 PID: 190 Comm: head Not tainted 6.7.0-rc5-g26aff849438c #133
  Hardware name: linux,dummy-virt (DT)
  Call trace:
   dump_backtrace+0x98/0xf8
   show_stack+0x1c/0x30
   dump_stack_lvl+0x44/0x58
   print_report+0xf0/0x5a0
   kasan_report+0x80/0xc0
   __asan_report_load8_noabort+0x1c/0x28
   hist_show+0x11e0/0x1278
   seq_read_iter+0x344/0xd78
   seq_read+0x128/0x1c0
   vfs_read+0x198/0x6c8
   ksys_read+0xf4/0x1e0
   __arm64_sys_read+0x70/0xa8
   invoke_syscall+0x70/0x260
   el0_svc_common.constprop.0+0xb0/0x280
   do_el0_svc+0x44/0x60
   el0_svc+0x34/0x68
   el0t_64_sync_handler+0xb8/0xc0
   el0t_64_sync+0x168/0x170

  Allocated by task 188:
   kasan_save_stack+0x28/0x50
   kasan_set_track+0x28/0x38
   kasan_save_alloc_info+0x20/0x30
   __kasan_slab_alloc+0x6c/0x80
   kmem_cache_alloc+0x15c/0x4a8
   trace_create_new_event+0x84/0x348
   __trace_add_new_event+0x18/0x88
   event_trace_add_tracer+0xc4/0x1a0
   trace_array_create_dir+0x6c/0x100
   trace_array_create+0x2e8/0x568
   instance_mkdir+0x48/0x80
   tracefs_syscall_mkdir+0x90/0xe8
   vfs_mkdir+0x3c4/0x610
   do_mkdirat+0x144/0x200
   __arm64_sys_mkdirat+0x8c/0xc0
   invoke_syscall+0x70/0x260
   el0_svc_common.constprop.0+0xb0/0x280
   do_el0_svc+0x44/0x60
   el0_svc+0x34/0x68
   el0t_64_sync_handler+0xb8/0xc0
   el0t_64_sync+0x168/0x170

  Freed by task 191:
   kasan_save_stack+0x28/0x50
   kasan_set_track+0x28/0x38
   kasan_save_free_info+0x34/0x58
   __kasan_slab_free+0xe4/0x158
   kmem_cache_free+0x19c/0x508
   event_file_put+0xa0/0x120
   remove_event_file_dir+0x180/0x320
   event_trace_del_tracer+0xb0/0x180
   __remove_instance+0x224/0x508
   instance_rmdir+0x44/0x78
   tracefs_syscall_rmdir+0xbc/0x140
   vfs_rmdir+0x1cc/0x4c8
   do_rmdir+0x220/0x2b8
   __arm64_sys_unlinkat+0xc0/0x100
   invoke_syscall+0x70/0x260
   el0_svc_common.constprop.0+0xb0/0x280
   do_el0_svc+0x44/0x60
   el0_svc+0x34/0x68
   el0t_64_sync_handler+0xb8/0xc0
   el0t_64_sync+0x168/0x170

Signed-off-by: Zheng Yejian 
---
 kernel/trace/trace_events_hist.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 1abc07fba1b9..00447ea7dabd 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -5623,10 +5623,12 @@ static int event_hist_open(struct inode *inode, struct 
file *file)
 {
int ret;
 
-   ret = security_locked_down(LOCKDOWN_TRACEFS);
+   ret = tracing_open_file_tr(inode, file);
if (ret)
return ret;
 
+   /* Clear private_data to avoid warning in single_open */
+   file->private_data = NULL;
return single_open(file, hist_show, file);
 }
 
@@ -5634,7 +5636,7 @@ const struct file_operations event_hist_fops = {
.open = event_hist_open,
.read = seq_read,
.llseek = seq_lseek,
-   .release = single_release,
+   .release = tracing_release_file_tr,
 };
 
 #ifdef CONFIG_HIST_TRIGGERS_DEBUG
@@ -5900,10 +5902,12 @@ static int event_hist_debug_open(struct inode *inode, 
struct file *file)
 {
int ret;
 
-   ret = security_locked_down(LOCKDOWN_TRACEFS);
+   ret = tracing_open_file_tr(inode, file);
if (ret)
return ret;
 
+   /* Clear private_data to avoid warning in single_open */
+   file->private_data = NULL;
return single_open(file, hist_debug_show, file);
 }
 
@@ -5911,7 +5915,7 @@ const struct file_operations event_hist_debug_fops = {
.open = event_hist_debug_open,
.read = seq_read,
.llseek = seq_lseek,
-   .release = single_release,
+   .release = tracing_release_file_tr,
 };
 #endif
 
-- 
2.25.1

Re: [PATCH] ring-buffer: Fix buffer max_data_size with max_event_size

2023-12-12 Thread Steven Rostedt

On Mon, 11 Dec 2023 20:40:33 +0900
Masami Hiramatsu (Google)  wrote:

> On Sat, 9 Dec 2023 17:09:25 -0500
> Steven Rostedt  wrote:
> 
> > On Sat, 9 Dec 2023 17:01:39 -0500
> > Steven Rostedt  wrote:
> >   
> > > From: "Steven Rostedt (Google)" 
> > > 
> > > The maximum ring buffer data size is the maximum size of data that can be
> > > recorded on the ring buffer. Events must be smaller than the sub buffer
> > > data size minus any meta data. This size is checked before trying to
> > > allocate from the ring buffer because the allocation assumes that the size
> > > will fit on the sub buffer.
> > > 
> > > The maximum size was calculated as the size of a sub buffer page (which is
> > > currently PAGE_SIZE minus the sub buffer header) minus the size of the
> > > meta data of an individual event. But it missed the possible adding of a
> > > time stamp for events that are added long enough apart that the event meta
> > > data can't hold the time delta.
> > > 
> > > When an event is added that is greater than the current BUF_MAX_DATA_SIZE
> > > minus the size of a time stamp, but still less than or equal to
> > > BUF_MAX_DATA_SIZE, the ring buffer would go into an infinite loop, looking
> > > for a page that can hold the event. Luckily, there's a check for this loop
> > > and after 1000 iterations and a warning is emitted and the ring buffer is
> > > disabled. But this should never happen.
> > > 
> > > This can happen when a large event is added first, or after a long period
> > > where an absolute timestamp is prefixed to the event, increasing its size
> > > by 8 bytes. This passes the check and then goes into the algorithm that
> > > causes the infinite loop.
> > > 
> > > Fix this by creating a BUF_MAX_EVENT_SIZE to be used to determine if the
> > > passed in event is too big for the buffer.
> > >   
> > 
> > Forgot to add:
> > 
> > Cc: sta...@vger.kernel.org
> > Fixes: a4543a2fa9ef3 ("ring-buffer: Get timestamp after event is 
> > allocated")  
> 
> Looks good to me.
> 
> Reviewed-by: Masami Hiramatsu (Google) 

Actually, I found out that this can be fixed with:

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 86a60a0eb279..aaad104a1707 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3594,7 +3594,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
info->length += RB_LEN_TIME_EXTEND;
} else {
info->delta = info->ts - info->after;
-   if (unlikely(test_time_stamp(info->delta))) {
+   if (w && unlikely(test_time_stamp(info->delta))) {
info->add_timestamp |= RB_ADD_STAMP_EXTEND;
info->length += RB_LEN_TIME_EXTEND;
}

The bug that this patch fixed was that the size that was acceptable did not take
into account the added time stamp. But the time stamp should *not* be added
if it's the first event on the sub buffer. And once it goes to the next
buffer, it should be the first event.

I ran the same tests with this change, and it works just as well. I believe
this is the proper fix.

I'll send a v2.

Thanks!

-- Steve


> 
> Thanks,
> > 
> > -- Steve
> > 
> >   
> > > Reported-by: Kent Overstreet  # (on IRC)
> > > Signed-off-by: Steven Rostedt (Google) 
> > > ---
> > >  kernel/trace/ring_buffer.c | 7 +--
> > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > > index 8d2a4f00eca9..a38e5a3c6803 100644
> > > --- a/kernel/trace/ring_buffer.c
> > > +++ b/kernel/trace/ring_buffer.c
> > > @@ -378,6 +378,9 @@ static inline bool test_time_stamp(u64 delta)
> > >  /* Max payload is BUF_PAGE_SIZE - header (8bytes) */
> > >  #define BUF_MAX_DATA_SIZE (BUF_PAGE_SIZE - (sizeof(u32) * 2))
> > >  
> > > +/* Events may have a time stamp attached to them */
> > > +#define BUF_MAX_EVENT_SIZE (BUF_MAX_DATA_SIZE - RB_LEN_TIME_EXTEND)
> > > +
> > >  int ring_buffer_print_page_header(struct trace_seq *s)
> > >  {
> > >   struct buffer_data_page field;
> > > @@ -3810,7 +3813,7 @@ ring_buffer_lock_reserve(struct trace_buffer 
> > > *buffer, unsigned long length)
> > >   if (unlikely(atomic_read(_buffer->record_disabled)))
> > >   goto out;
> > >  
> > > - if (unlikely(length > BUF_MAX_DATA_SIZE))
> > > + if (unlikely(length > BUF_MAX_EVENT_SIZE))
> > >   goto out;
> > >  
> > >   if (unlikely(trace_recursive_lock(cpu_buffer)))
> > > @@ -3960,7 +3963,7 @@ int ring_buffer_write(struct trace_buffer *buffer,
> > >   if (atomic_read(_buffer->record_disabled))
> > >   goto out;
> > >  
> > > - if (length > BUF_MAX_DATA_SIZE)
> > > + if (length > BUF_MAX_EVENT_SIZE)
> > >   goto out;
> > >  
> > >   if (unlikely(trace_recursive_lock(cpu_buffer)))  
> >   
> 
>

[PATCH RESEND 04/11] tracing: tracing_buffers_splice_read: behave as-if non-blocking I/O

2023-12-12 Thread Ahelenia Ziemiańska

Otherwise we risk sleeping with the pipe locked for indeterminate
lengths of time.

Link: 
https://lore.kernel.org/linux-fsdevel/qk6hjuam54khlaikf2ssom6custxf5is2ekkaequf4hvode3ls@zgf7j5j4ubvw/t/#u
Signed-off-by: Ahelenia Ziemiańska 
---
 kernel/trace/trace.c | 32 
 1 file changed, 4 insertions(+), 28 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index abaaf516fcae..9be7a4c0a3ca 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8477,7 +8477,6 @@ tracing_buffers_splice_read(struct file *file, loff_t 
*ppos,
if (splice_grow_spd(pipe, ))
return -ENOMEM;
 
- again:
trace_access_lock(iter->cpu_file);
entries = ring_buffer_entries_cpu(iter->array_buffer->buffer, 
iter->cpu_file);
 
@@ -8528,35 +8527,12 @@ tracing_buffers_splice_read(struct file *file, loff_t 
*ppos,
 
/* did we read anything? */
if (!spd.nr_pages) {
-   long wait_index;
-
-   if (ret)
-   goto out;
-
-   ret = -EAGAIN;
-   if ((file->f_flags & O_NONBLOCK) || (flags & SPLICE_F_NONBLOCK))
-   goto out;
-
-   wait_index = READ_ONCE(iter->wait_index);
-
-   ret = wait_on_pipe(iter, iter->tr->buffer_percent);
-   if (ret)
-   goto out;
-
-   /* No need to wait after waking up when tracing is off */
-   if (!tracer_tracing_is_on(iter->tr))
-   goto out;
-
-   /* Make sure we see the new wait_index */
-   smp_rmb();
-   if (wait_index != iter->wait_index)
-   goto out;
-
-   goto again;
+   if (!ret)
+   ret = -EAGAIN;
+   } else {
+   ret = splice_to_pipe(pipe, );
}
 
-   ret = splice_to_pipe(pipe, );
-out:
splice_shrink_spd();
 
return ret;
-- 
2.39.2


signature.asc
Description: PGP signature

[PATCH RESEND 00/11] splice(file<>pipe) I/O on file as-if O_NONBLOCK

2023-12-12 Thread Ahelenia Ziemiańska

(Originally sent on 2023-10-16 as
 ;
 received no replies, resending unchanged per
 Documentation/process/submitting-patches.rst#_resend_reminders).

Hi!

As it stands, splice(file -> pipe):
1. locks the pipe,
2. does a read from the file,
3. unlocks the pipe.

For reading from regular files and blcokdevs this makes no difference.
But if the file is a tty or a socket, for example, this means that until
data appears, which it may never do, every process trying to read from
or open the pipe enters an uninterruptible sleep,
and will only exit it if the splicing process is killed.

This trivially denies service to:
* any hypothetical pipe-based log collexion system
* all nullmailer installations
* me, personally, when I'm pasting stuff into qemu -serial chardev:pipe

This follows:
1. 
https://lore.kernel.org/linux-fsdevel/qk6hjuam54khlaikf2ssom6custxf5is2ekkaequf4hvode3ls@zgf7j5j4ubvw/t/#u
2. a security@ thread rooted in
   
3. https://nabijaczleweli.xyz/content/blogn_t/011-linux-splice-exclusion.html

Patches were posted and then discarded on principle or funxionality,
all in all terminating in Linus posting
> But it is possible that we need to just bite the bullet and say
> "copy_splice_read() needs to use a non-blocking kiocb for the IO".

This does that, effectively making splice(file -> pipe)
request (and require) O_NONBLOCK on reads fron the file:
this doesn't affect splicing from regular files and blockdevs,
since they're always non-blocking
(and requesting the stronger "no kernel sleep" IOCB_NOWAIT is non-sensical),
but always returns -EINVAL for ttys.
Sockets behave as expected from O_NONBLOCK reads:
splice if there's data available else -EAGAIN.

This should all pretty much behave as-expected.

Mostly a re-based version of the summary diff from
.

Bisexion yields commit 8924feff66f35fe22ce77aafe3f21eb8e5cff881
("splice: lift pipe_lock out of splice_to_pipe()") as first bad.


The patchset is made quite wide due to the many implementations
of the splice_read callback, and was based entirely on results from
  $ git grep '\.splice_read.*=' | cut -d= -f2 |
  tr -s ',;[:space:]' '\n' | sort -u

I'm assuming this is exhaustive, but it's 27 distinct implementations.
Of these, I've classified these as trivial delegating wrappers:
  nfs_file_splice_read   filemap_splice_read
  afs_file_splice_read   filemap_splice_read
  ceph_splice_read   filemap_splice_read
  ecryptfs_splice_read_update_atime  filemap_splice_read
  ext4_file_splice_read  filemap_splice_read
  f2fs_file_splice_read  filemap_splice_read
  ntfs_file_splice_read  filemap_splice_read
  ocfs2_file_splice_read filemap_splice_read
  orangefs_file_splice_read  filemap_splice_read
  v9fs_file_splice_read  filemap_splice_read
  xfs_file_splice_read   filemap_splice_read
  zonefs_file_splice_readfilemap_splice_read
  sock_splice_read   copy_splice_read or a socket-specific one
  coda_file_splice_read  vfs_splice_read
  ovl_splice_readvfs_splice_read


filemap_splice_read() is used for regular files and blockdevs,
and thus needs no changes, and is thus unchanged.

vfs_splice_read() delegates to copy_splice_read() or f_op->splice_read().


The rest are fixed, in patch order:
01. copy_splice_read() by simply doing the I/O with IOCB_NOWAIT;
diff from Linus:
  
https://lore.kernel.org/lkml/5osglsw36dla3mubtpsmdwdid4fsdacplyd6acx2igo4atogdg@yur3idyim3cc/t/#ee67de5a9ec18886c434113637d7eff6cd7acac4b
02. unix_stream_splice_read() by unconditionally passing MSG_DONTWAIT
03. fuse_dev_splice_read() by behaving as-if O_NONBLOCK
04. tracing_buffers_splice_read() by behaving as-if O_NONBLOCK
(this also removes the retry loop)
05. relay_file_splice_read() by behaving as-if SPLICE_F_NONBLOCK
(this just means EAGAINing unconditionally for an empty transfer)
06. smc_splice_read() by unconditionally passing MSG_DONTWAIT
07. kcm_splice_read() by unconditionally passing MSG_DONTWAIT
08. tls_sw_splice_read() by behaving as-if SPLICE_F_NONBLOCK
09. tcp_splice_read() by behaving as-if O_NONBLOCK
(this also removes the retry loop)

10. EINVALs on files that neither have FMODE_NOWAIT nor are S_ISREG

We don't want this to be just FMODE_NOWAIT since most regular files
don't have it set and that's not the right semantic anyway,
as noted at the top of this mail,

But this allows blockdevs "by accident", effectively,
since they have FMODE_NOWAIT (at least the ones I tried),
even though they're just like regular files:
handled by filemap_splice_read(),
thus not dispatched with IOCB_NOWAIT. since always non-blocking.

Should this be a check for FMODE_NOWAIT && (S_ISREG || S_ISBLK)?
Should it remain FMODE_NOWAIT && S_ISREG?
Is there an even better way of spelling this?


In net/kcm, this also fixes kcm_splice_read() passing SPLICE_F_*-style

Re: [PATCH v3 2/2] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-12 Thread David Hildenbrand


On 11.12.23 23:52, Vishal Verma wrote:

Add a sysfs knob for dax devices to control the memmap_on_memory setting
if the dax device were to be hotplugged as system memory.

The default memmap_on_memory setting for dax devices originating via
pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to
preserve legacy behavior. For dax devices via CXL, the default is on.
The sysfs control allows the administrator to override the above
defaults if needed.

Cc: David Hildenbrand 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Dave Hansen 
Cc: Huang Ying 
Tested-by: Li Zhijian 
Reviewed-by: Jonathan Cameron 
Reviewed-by: David Hildenbrand 
Signed-off-by: Vishal Verma 
---
  drivers/dax/bus.c   | 47 +
  Documentation/ABI/testing/sysfs-bus-dax | 17 
  2 files changed, 64 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1ff1ab5fa105..2871e5188f0d 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1270,6 +1270,52 @@ static ssize_t numa_node_show(struct device *dev,
  }
  static DEVICE_ATTR_RO(numa_node);
  
+static ssize_t memmap_on_memory_show(struct device *dev,

+struct device_attribute *attr, char *buf)
+{
+   struct dev_dax *dev_dax = to_dev_dax(dev);
+
+   return sprintf(buf, "%d\n", dev_dax->memmap_on_memory);
+}
+
+static ssize_t memmap_on_memory_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+   struct device_driver *drv = dev->driver;
+   struct dev_dax *dev_dax = to_dev_dax(dev);
+   struct dax_region *dax_region = dev_dax->region;
+   struct dax_device_driver *dax_drv = to_dax_drv(drv);
+   ssize_t rc;
+   bool val;
+
+   rc = kstrtobool(buf, );
+   if (rc)
+   return rc;
+
+   if (dev_dax->memmap_on_memory == val)
+   return len;
+
+   device_lock(dax_region->dev);
+   if (!dax_region->dev->driver) {
+   device_unlock(dax_region->dev);
+   return -ENXIO;
+   }
+
+   if (dax_drv->type == DAXDRV_KMEM_TYPE) {
+   device_unlock(dax_region->dev);
+   return -EBUSY;
+   }
+
+   device_lock(dev);
+   dev_dax->memmap_on_memory = val;
+   device_unlock(dev);
+
+   device_unlock(dax_region->dev);
+   return len;
+}
+static DEVICE_ATTR_RW(memmap_on_memory);
+
  static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int 
n)
  {
struct device *dev = container_of(kobj, struct device, kobj);
@@ -1296,6 +1342,7 @@ static struct attribute *dev_dax_attributes[] = {
_attr_align.attr,
_attr_resource.attr,
_attr_numa_node.attr,
+   _attr_memmap_on_memory.attr,
NULL,
  };
  
diff --git a/Documentation/ABI/testing/sysfs-bus-dax b/Documentation/ABI/testing/sysfs-bus-dax

index a61a7b186017..b1fd8bf8a7de 100644
--- a/Documentation/ABI/testing/sysfs-bus-dax
+++ b/Documentation/ABI/testing/sysfs-bus-dax
@@ -149,3 +149,20 @@ KernelVersion: v5.1
  Contact:  nvd...@lists.linux.dev
  Description:
(RO) The id attribute indicates the region id of a dax region.
+
+What:  /sys/bus/dax/devices/daxX.Y/memmap_on_memory
+Date:  October, 2023
+KernelVersion: v6.8
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RW) Control the memmap_on_memory setting if the dax device
+   were to be hotplugged as system memory. This determines whether
+   the 'altmap' for the hotplugged memory will be placed on the
+   device being hotplugged (memmap_on_memory=1) or if it will be
+   placed on regular memory (memmap_on_memory=0). This attribute
+   must be set before the device is handed over to the 'kmem'
+   driver (i.e.  hotplugged into system-ram). Additionally, this
+   depends on CONFIG_MHP_MEMMAP_ON_MEMORY, and a globally enabled
+   memmap_on_memory parameter for memory_hotplug. This is
+   typically set on the kernel command line -
+   memory_hotplug.memmap_on_memory set to 'true' or 'force'."



Thinking about it, I wonder if we could disallow setting that property 
to "true" if the current configuration does not allow it.


That is:

1) Removing the "size" parameter from mhp_supports_memmap_on_memory(), 
it doesn't make any sense anymore.


2) Exporting mhp_supports_memmap_on_memory() to modules.

3) When setting memmap_on_memory, check whether 
mhp_supports_memmap_on_memory() == true.


Then, the user really gets an error when trying to set it to "true".

--
Cheers,

David / dhildenb

Re: [PATCH v3 1/3] kasan: switch kunit tests to console tracepoints

2023-12-12 Thread Marco Elver

On Tue, 12 Dec 2023 at 10:19, Paul Heidekrüger  wrote:
>
> On 12.12.2023 00:37, Andrey Konovalov wrote:
> > On Tue, Dec 12, 2023 at 12:35 AM Paul Heidekrüger
> >  wrote:
> > >
> > > Using CONFIG_FTRACE=y instead of CONFIG_TRACEPOINTS=y produces the same 
> > > error
> > > for me.
> > >
> > > So
> > >
> > > CONFIG_KUNIT=y
> > > CONFIG_KUNIT_ALL_TESTS=n
> > > CONFIG_FTRACE=y
> > > CONFIG_KASAN=y
> > > CONFIG_KASAN_GENERIC=y
> > > CONFIG_KASAN_KUNIT_TEST=y
> > >
> > > produces
> > >
> > > ➜   ./tools/testing/kunit/kunit.py run 
> > > --kunitconfig=mm/kasan/.kunitconfig --arch=arm64
> > > Configuring KUnit Kernel ...
> > > Regenerating .config ...
> > > Populating config with:
> > > $ make ARCH=arm64 O=.kunit olddefconfig CC=clang
> > > ERROR:root:Not all Kconfig options selected in kunitconfig were 
> > > in the generated .config.
> > > This is probably due to unsatisfied dependencies.
> > > Missing: CONFIG_KASAN_KUNIT_TEST=y
> > >
> > > By that error message, CONFIG_FTRACE appears to be present in the 
> > > generated
> > > config, but CONFIG_KASAN_KUNIT_TEST still isn't. Presumably,
> > > CONFIG_KASAN_KUNIT_TEST is missing because of an unsatisfied dependency, 
> > > which
> > > must be CONFIG_TRACEPOINTS, unless I'm missing something ...
> > >
> > > If I just generate an arm64 defconfig and select CONFIG_FTRACE=y,
> > > CONFIG_TRACEPOINTS=y shows up in my .config. So, maybe this is 
> > > kunit.py-related
> > > then?
> > >
> > > Andrey, you said that the tests have been working for you; are you 
> > > running them
> > > with kunit.py?
> >
> > No, I just run the kernel built with a config file that I put together
> > based on defconfig.
>
> Ah. I believe I've figured it out.
>
> When I add CONFIG_STACK_TRACER=y in addition to CONFIG_FTRACE=y, it works.

CONFIG_FTRACE should be enough - maybe also check x86 vs. arm64 to debug more.

> CONFIG_STACK_TRACER selects CONFIG_FUNCTION_TRACER, CONFIG_FUNCTION_TRACER
> selects CONFIG_GENERIC_TRACER, CONFIG_GENERIC_TRACER selects CONFIG_TRACING, 
> and
> CONFIG_TRACING selects CONFIG_TRACEPOINTS.
>
> CONFIG_BLK_DEV_IO_TRACE=y also works instead of CONFIG_STACK_TRACER=y, as it
> directly selects CONFIG_TRACEPOINTS.
>
> CONFIG_FTRACE=y on its own does not appear suffice for kunit.py on arm64.

When you build manually with just CONFIG_FTRACE, is CONFIG_TRACEPOINTS enabled?

> I believe the reason my .kunitconfig as well as the existing
> mm/kfence/.kunitconfig work on X86 is because CONFIG_TRACEPOINTS=y is present 
> in
> an X86 defconfig.
>
> Does this make sense?
>
> Would you welcome a patch addressing this for the existing
> mm/kfence/.kunitconfig?
>
> I would also like to submit a patch for an mm/kasan/.kunitconfig. Do you think
> that would be helpful too?
>
> FWICT, kernel/kcsan/.kunitconfig might also be affected since
> CONFIG_KCSAN_KUNIT_TEST also depends on CONFIG_TRACEPOITNS, but I would have 
> to
> test that. That could be a third patch.

I'd support figuring out the minimal config (CONFIG_FTRACE or
something else?) that satisfies the TRACEPOINTS dependency. I always
thought CONFIG_FTRACE ought to be the one config option, but maybe
something changed.

Also maybe one of the tracing maintainers can help untangle what's
going on here.

Thanks,
-- Marco

Re: [PATCH v3 1/3] kasan: switch kunit tests to console tracepoints

2023-12-12 Thread Paul Heidekrüger

On 12.12.2023 00:37, Andrey Konovalov wrote:
> On Tue, Dec 12, 2023 at 12:35 AM Paul Heidekrüger
>  wrote:
> >
> > Using CONFIG_FTRACE=y instead of CONFIG_TRACEPOINTS=y produces the same 
> > error
> > for me.
> >
> > So
> >
> > CONFIG_KUNIT=y
> > CONFIG_KUNIT_ALL_TESTS=n
> > CONFIG_FTRACE=y
> > CONFIG_KASAN=y
> > CONFIG_KASAN_GENERIC=y
> > CONFIG_KASAN_KUNIT_TEST=y
> >
> > produces
> >
> > ➜   ./tools/testing/kunit/kunit.py run 
> > --kunitconfig=mm/kasan/.kunitconfig --arch=arm64
> > Configuring KUnit Kernel ...
> > Regenerating .config ...
> > Populating config with:
> > $ make ARCH=arm64 O=.kunit olddefconfig CC=clang
> > ERROR:root:Not all Kconfig options selected in kunitconfig were in 
> > the generated .config.
> > This is probably due to unsatisfied dependencies.
> > Missing: CONFIG_KASAN_KUNIT_TEST=y
> >
> > By that error message, CONFIG_FTRACE appears to be present in the generated
> > config, but CONFIG_KASAN_KUNIT_TEST still isn't. Presumably,
> > CONFIG_KASAN_KUNIT_TEST is missing because of an unsatisfied dependency, 
> > which
> > must be CONFIG_TRACEPOINTS, unless I'm missing something ...
> >
> > If I just generate an arm64 defconfig and select CONFIG_FTRACE=y,
> > CONFIG_TRACEPOINTS=y shows up in my .config. So, maybe this is 
> > kunit.py-related
> > then?
> >
> > Andrey, you said that the tests have been working for you; are you running 
> > them
> > with kunit.py?
> 
> No, I just run the kernel built with a config file that I put together
> based on defconfig.

Ah. I believe I've figured it out.

When I add CONFIG_STACK_TRACER=y in addition to CONFIG_FTRACE=y, it works.

CONFIG_STACK_TRACER selects CONFIG_FUNCTION_TRACER, CONFIG_FUNCTION_TRACER 
selects CONFIG_GENERIC_TRACER, CONFIG_GENERIC_TRACER selects CONFIG_TRACING, 
and 
CONFIG_TRACING selects CONFIG_TRACEPOINTS. 

CONFIG_BLK_DEV_IO_TRACE=y also works instead of CONFIG_STACK_TRACER=y, as it 
directly selects CONFIG_TRACEPOINTS. 

CONFIG_FTRACE=y on its own does not appear suffice for kunit.py on arm64.

I believe the reason my .kunitconfig as well as the existing 
mm/kfence/.kunitconfig work on X86 is because CONFIG_TRACEPOINTS=y is present 
in 
an X86 defconfig.

Does this make sense?

Would you welcome a patch addressing this for the existing 
mm/kfence/.kunitconfig?

I would also like to submit a patch for an mm/kasan/.kunitconfig. Do you think 
that would be helpful too?

FWICT, kernel/kcsan/.kunitconfig might also be affected since 
CONFIG_KCSAN_KUNIT_TEST also depends on CONFIG_TRACEPOITNS, but I would have to 
test that. That could be a third patch.

What do you think?

Many thanks,
Paul

Re: [PATCH net-next v8 3/4] virtio/vsock: fix logic which reduces credit update messages

2023-12-12 Thread Stefano Garzarella


On Tue, Dec 12, 2023 at 12:16:57AM +0300, Arseniy Krasnov wrote:

Add one more condition for sending credit update during dequeue from
stream socket: when number of bytes in the rx queue is smaller than
SO_RCVLOWAT value of the socket. This is actual for non-default value
of SO_RCVLOWAT (e.g. not 1) - idea is to "kick" peer to continue data
transmission, because we need at least SO_RCVLOWAT bytes in our rx
queue to wake up user for reading data (in corner case it is also
possible to stuck both tx and rx sides, this is why 'Fixes' is used).

Fixes: b89d882dc9fc ("vsock/virtio: reduce credit update messages")
Signed-off-by: Arseniy Krasnov 
---
Changelog:
v6 -> v7:
 * Handle wrap of 'fwd_cnt'.
 * Do to send credit update when 'fwd_cnt' == 'last_fwd_cnt'.
v7 -> v8:
 * Remove unneeded/wrong handling of wrap for 'fwd_cnt'.

net/vmw_vsock/virtio_transport_common.c | 13 ++---
1 file changed, 10 insertions(+), 3 deletions(-)


Reviewed-by: Stefano Garzarella 

Thanks!
Stefano



diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index e137d740804e..8572f94bba88 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -558,6 +558,8 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
struct virtio_vsock_sock *vvs = vsk->trans;
size_t bytes, total = 0;
struct sk_buff *skb;
+   u32 fwd_cnt_delta;
+   bool low_rx_bytes;
int err = -EFAULT;
u32 free_space;

@@ -601,7 +603,10 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
}
}

-   free_space = vvs->buf_alloc - (vvs->fwd_cnt - vvs->last_fwd_cnt);
+   fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt;
+   free_space = vvs->buf_alloc - fwd_cnt_delta;
+   low_rx_bytes = (vvs->rx_bytes <
+   sock_rcvlowat(sk_vsock(vsk), 0, INT_MAX));

spin_unlock_bh(>rx_lock);

@@ -611,9 +616,11 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
 * too high causes extra messages. Too low causes transmitter
 * stalls. As stalls are in theory more expensive than extra
 * messages, we set the limit to a high value. TODO: experiment
-* with different values.
+* with different values. Also send credit update message when
+* number of bytes in rx queue is not enough to wake up reader.
 */
-   if (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
+   if (fwd_cnt_delta &&
+   (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE || low_rx_bytes))
virtio_transport_send_credit_update(vsk);

return total;
--
2.25.1

[PATCH v5 3/3] remoteproc: qcom: pas: Add SM8650 remoteproc support

2023-12-12 Thread Neil Armstrong

Add DSP Peripheral Authentication Service support for the SM8650 platform.

Reviewed-by: Dmitry Baryshkov 
Signed-off-by: Neil Armstrong 
---
 drivers/remoteproc/qcom_q6v5_pas.c | 50 ++
 1 file changed, 50 insertions(+)

diff --git a/drivers/remoteproc/qcom_q6v5_pas.c 
b/drivers/remoteproc/qcom_q6v5_pas.c
index 46d744fbe8ad..83dcde2dec61 100644
--- a/drivers/remoteproc/qcom_q6v5_pas.c
+++ b/drivers/remoteproc/qcom_q6v5_pas.c
@@ -1197,6 +1197,53 @@ static const struct adsp_data sm8550_mpss_resource = {
.region_assign_vmid = QCOM_SCM_VMID_MSS_MSA,
 };
 
+static const struct adsp_data sm8650_cdsp_resource = {
+   .crash_reason_smem = 601,
+   .firmware_name = "cdsp.mdt",
+   .dtb_firmware_name = "cdsp_dtb.mdt",
+   .pas_id = 18,
+   .dtb_pas_id = 0x25,
+   .minidump_id = 7,
+   .auto_boot = true,
+   .proxy_pd_names = (char*[]){
+   "cx",
+   "mxc",
+   "nsp",
+   NULL
+   },
+   .load_state = "cdsp",
+   .ssr_name = "cdsp",
+   .sysmon_name = "cdsp",
+   .ssctl_id = 0x17,
+   .region_assign_idx = 2,
+   .region_assign_count = 1,
+   .region_assign_shared = true,
+   .region_assign_vmid = QCOM_SCM_VMID_CDSP,
+};
+
+static const struct adsp_data sm8650_mpss_resource = {
+   .crash_reason_smem = 421,
+   .firmware_name = "modem.mdt",
+   .dtb_firmware_name = "modem_dtb.mdt",
+   .pas_id = 4,
+   .dtb_pas_id = 0x26,
+   .minidump_id = 3,
+   .auto_boot = false,
+   .decrypt_shutdown = true,
+   .proxy_pd_names = (char*[]){
+   "cx",
+   "mss",
+   NULL
+   },
+   .load_state = "modem",
+   .ssr_name = "mpss",
+   .sysmon_name = "modem",
+   .ssctl_id = 0x12,
+   .region_assign_idx = 2,
+   .region_assign_count = 2,
+   .region_assign_vmid = QCOM_SCM_VMID_MSS_MSA,
+};
+
 static const struct of_device_id adsp_of_match[] = {
{ .compatible = "qcom,msm8226-adsp-pil", .data = _resource_init},
{ .compatible = "qcom,msm8953-adsp-pil", .data = 
_adsp_resource},
@@ -1249,6 +1296,9 @@ static const struct of_device_id adsp_of_match[] = {
{ .compatible = "qcom,sm8550-adsp-pas", .data = _adsp_resource},
{ .compatible = "qcom,sm8550-cdsp-pas", .data = _cdsp_resource},
{ .compatible = "qcom,sm8550-mpss-pas", .data = _mpss_resource},
+   { .compatible = "qcom,sm8650-adsp-pas", .data = _adsp_resource},
+   { .compatible = "qcom,sm8650-cdsp-pas", .data = _cdsp_resource},
+   { .compatible = "qcom,sm8650-mpss-pas", .data = _mpss_resource},
{ },
 };
 MODULE_DEVICE_TABLE(of, adsp_of_match);

-- 
2.34.1

[PATCH v5 2/3] remoteproc: qcom: pas: make region assign more generic

2023-12-12 Thread Neil Armstrong

The current memory region assign only supports a single
memory region.

But new platforms introduces more regions to make the
memory requirements more flexible for various use cases.
Those new platforms also shares the memory region between the
DSP and HLOS.

To handle this, make the region assign more generic in order
to support more than a single memory region and also permit
setting the regions permissions as shared.

Reviewed-by: Mukesh Ojha 
Signed-off-by: Neil Armstrong 
---
 drivers/remoteproc/qcom_q6v5_pas.c | 100 -
 1 file changed, 66 insertions(+), 34 deletions(-)

diff --git a/drivers/remoteproc/qcom_q6v5_pas.c 
b/drivers/remoteproc/qcom_q6v5_pas.c
index 913a5d2068e8..46d744fbe8ad 100644
--- a/drivers/remoteproc/qcom_q6v5_pas.c
+++ b/drivers/remoteproc/qcom_q6v5_pas.c
@@ -33,6 +33,8 @@
 
 #define ADSP_DECRYPT_SHUTDOWN_DELAY_MS 100
 
+#define MAX_ASSIGN_COUNT 2
+
 struct adsp_data {
int crash_reason_smem;
const char *firmware_name;
@@ -51,6 +53,9 @@ struct adsp_data {
int ssctl_id;
 
int region_assign_idx;
+   int region_assign_count;
+   bool region_assign_shared;
+   int region_assign_vmid;
 };
 
 struct qcom_adsp {
@@ -87,15 +92,18 @@ struct qcom_adsp {
phys_addr_t dtb_mem_phys;
phys_addr_t mem_reloc;
phys_addr_t dtb_mem_reloc;
-   phys_addr_t region_assign_phys;
+   phys_addr_t region_assign_phys[MAX_ASSIGN_COUNT];
void *mem_region;
void *dtb_mem_region;
size_t mem_size;
size_t dtb_mem_size;
-   size_t region_assign_size;
+   size_t region_assign_size[MAX_ASSIGN_COUNT];
 
int region_assign_idx;
-   u64 region_assign_perms;
+   int region_assign_count;
+   bool region_assign_shared;
+   int region_assign_vmid;
+   u64 region_assign_owners[MAX_ASSIGN_COUNT];
 
struct qcom_rproc_glink glink_subdev;
struct qcom_rproc_subdev smd_subdev;
@@ -590,37 +598,53 @@ static int adsp_alloc_memory_region(struct qcom_adsp 
*adsp)
 
 static int adsp_assign_memory_region(struct qcom_adsp *adsp)
 {
-   struct reserved_mem *rmem = NULL;
-   struct qcom_scm_vmperm perm;
+   struct qcom_scm_vmperm perm[MAX_ASSIGN_COUNT];
struct device_node *node;
+   unsigned int perm_size;
+   int offset;
int ret;
 
if (!adsp->region_assign_idx)
return 0;
 
-   node = of_parse_phandle(adsp->dev->of_node, "memory-region", 
adsp->region_assign_idx);
-   if (node)
-   rmem = of_reserved_mem_lookup(node);
-   of_node_put(node);
-   if (!rmem) {
-   dev_err(adsp->dev, "unable to resolve shareable 
memory-region\n");
-   return -EINVAL;
-   }
+   for (offset = 0; offset < adsp->region_assign_count; ++offset) {
+   struct reserved_mem *rmem = NULL;
+
+   node = of_parse_phandle(adsp->dev->of_node, "memory-region",
+   adsp->region_assign_idx + offset);
+   if (node)
+   rmem = of_reserved_mem_lookup(node);
+   of_node_put(node);
+   if (!rmem) {
+   dev_err(adsp->dev, "unable to resolve shareable 
memory-region index %d\n",
+   offset);
+   return -EINVAL;
+   }
 
-   perm.vmid = QCOM_SCM_VMID_MSS_MSA;
-   perm.perm = QCOM_SCM_PERM_RW;
+   if (adsp->region_assign_shared)  {
+   perm[0].vmid = QCOM_SCM_VMID_HLOS;
+   perm[0].perm = QCOM_SCM_PERM_RW;
+   perm[1].vmid = adsp->region_assign_vmid;
+   perm[1].perm = QCOM_SCM_PERM_RW;
+   perm_size = 2;
+   } else {
+   perm[0].vmid = adsp->region_assign_vmid;
+   perm[0].perm = QCOM_SCM_PERM_RW;
+   perm_size = 1;
+   }
 
-   adsp->region_assign_phys = rmem->base;
-   adsp->region_assign_size = rmem->size;
-   adsp->region_assign_perms = BIT(QCOM_SCM_VMID_HLOS);
+   adsp->region_assign_phys[offset] = rmem->base;
+   adsp->region_assign_size[offset] = rmem->size;
+   adsp->region_assign_owners[offset] = BIT(QCOM_SCM_VMID_HLOS);
 
-   ret = qcom_scm_assign_mem(adsp->region_assign_phys,
- adsp->region_assign_size,
- >region_assign_perms,
- , 1);
-   if (ret < 0) {
-   dev_err(adsp->dev, "assign memory failed\n");
-   return ret;
+   ret = qcom_scm_assign_mem(adsp->region_assign_phys[offset],
+ adsp->region_assign_size[offset],
+ >region_assign_owners[offset],
+ perm, perm_size);
+

[PATCH v5 1/3] dt-bindings: remoteproc: qcom,sm8550-pas: document the SM8650 PAS

2023-12-12 Thread Neil Armstrong

Document the DSP Peripheral Authentication Service on the SM8650 Platform.

Reviewed-by: Krzysztof Kozlowski 
Signed-off-by: Neil Armstrong 
---
 .../bindings/remoteproc/qcom,sm8550-pas.yaml   | 44 +-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/remoteproc/qcom,sm8550-pas.yaml 
b/Documentation/devicetree/bindings/remoteproc/qcom,sm8550-pas.yaml
index 58120829fb06..4e8ce9e7e9fa 100644
--- a/Documentation/devicetree/bindings/remoteproc/qcom,sm8550-pas.yaml
+++ b/Documentation/devicetree/bindings/remoteproc/qcom,sm8550-pas.yaml
@@ -19,6 +19,9 @@ properties:
   - qcom,sm8550-adsp-pas
   - qcom,sm8550-cdsp-pas
   - qcom,sm8550-mpss-pas
+  - qcom,sm8650-adsp-pas
+  - qcom,sm8650-cdsp-pas
+  - qcom,sm8650-mpss-pas
 
   reg:
 maxItems: 1
@@ -49,6 +52,7 @@ properties:
   - description: Memory region for main Firmware authentication
   - description: Memory region for Devicetree Firmware authentication
   - description: DSM Memory region
+  - description: DSM Memory region 2
 
 required:
   - compatible
@@ -63,6 +67,7 @@ allOf:
   enum:
 - qcom,sm8550-adsp-pas
 - qcom,sm8550-cdsp-pas
+- qcom,sm8650-adsp-pas
 then:
   properties:
 interrupts:
@@ -71,7 +76,26 @@ allOf:
   maxItems: 5
 memory-region:
   maxItems: 2
-else:
+  - if:
+  properties:
+compatible:
+  enum:
+- qcom,sm8650-cdsp-pas
+then:
+  properties:
+interrupts:
+  maxItems: 5
+interrupt-names:
+  maxItems: 5
+memory-region:
+  minItems: 3
+  maxItems: 3
+  - if:
+  properties:
+compatible:
+  enum:
+- qcom,sm8550-mpss-pas
+then:
   properties:
 interrupts:
   minItems: 6
@@ -79,12 +103,28 @@ allOf:
   minItems: 6
 memory-region:
   minItems: 3
+  maxItems: 3
+  - if:
+  properties:
+compatible:
+  enum:
+- qcom,sm8650-mpss-pas
+then:
+  properties:
+interrupts:
+  minItems: 6
+interrupt-names:
+  minItems: 6
+memory-region:
+  minItems: 4
+  maxItems: 4
 
   - if:
   properties:
 compatible:
   enum:
 - qcom,sm8550-adsp-pas
+- qcom,sm8650-adsp-pas
 then:
   properties:
 power-domains:
@@ -101,6 +141,7 @@ allOf:
 compatible:
   enum:
 - qcom,sm8550-mpss-pas
+- qcom,sm8650-mpss-pas
 then:
   properties:
 power-domains:
@@ -116,6 +157,7 @@ allOf:
 compatible:
   enum:
 - qcom,sm8550-cdsp-pas
+- qcom,sm8650-cdsp-pas
 then:
   properties:
 power-domains:

-- 
2.34.1

[PATCH v5 0/3] remoteproc: qcom: Introduce DSP support for SM8650

2023-12-12 Thread Neil Armstrong

Add the bindings and driver changes for DSP support on the
SM8650 platform in order to enable the aDSP, cDSP and MPSS
subsystems to boot.

Compared to SM8550, where SM8650 uses the same dual firmware files,
(dtb file and main firmware) the memory zones requirement has changed:
- cDSP: now requires 2 memory zones to be configured as shared
  between the cDSP and the HLOS subsystem
- MPSS: In addition to the memory zone required for the SM8550
  MPSS, another one is required to be configured for MPSS
  usage only.

In order to handle this and avoid code duplication, the region_assign_*
code patch has been made more generic and is able handle multiple
DSP-only memory zones (for MPSS) or DSP-HLOS shared memory zones (cDSP)
in the same region_assign functions.

Dependencies: None

For convenience, a regularly refreshed linux-next based git tree containing
all the SM8650 related work is available at:
https://git.codelinaro.org/neil.armstrong/linux/-/tree/topic/sm8650/upstream/integ

Signed-off-by: Neil Armstrong 
---
Changes in v5:
- Rename _perms to _owners per Konrad suggestion
- Link to v4: 
https://lore.kernel.org/r/20231208-topic-sm8650-upstream-remoteproc-v4-0-a96c3e5f0...@linaro.org

Changes in v4:
- Collected review from Mukesh Ojha
- Fixed adsp_unassign_memory_region() as suggested by Mukesh Ojha
- Link to v3: 
https://lore.kernel.org/r/20231106-topic-sm8650-upstream-remoteproc-v3-0-dbd4cabae...@linaro.org

Changes in v3:
- Collected bindings review tags
- Small fixes suggested by Mukesh Ojha
- Link to v2: 
https://lore.kernel.org/r/20231030-topic-sm8650-upstream-remoteproc-v2-0-609ee572e...@linaro.org

Changes in v2:
- Fixed sm8650 entries in allOf:if:then to match Krzysztof's comments
- Collected reviewed-by on patch 3
- Link to v1: 
https://lore.kernel.org/r/20231025-topic-sm8650-upstream-remoteproc-v1-0-a8d20e4ce...@linaro.org

---
Neil Armstrong (3):
  dt-bindings: remoteproc: qcom,sm8550-pas: document the SM8650 PAS
  remoteproc: qcom: pas: make region assign more generic
  remoteproc: qcom: pas: Add SM8650 remoteproc support

 .../bindings/remoteproc/qcom,sm8550-pas.yaml   |  44 +-
 drivers/remoteproc/qcom_q6v5_pas.c | 150 -
 2 files changed, 159 insertions(+), 35 deletions(-)
---
base-commit: bbd220ce4e29ed55ab079007cff0b550895258eb
change-id: 20231016-topic-sm8650-upstream-remoteproc-66a87eeb6fee

Best regards,
-- 
Neil Armstrong

95 matches

Mail list logo