from:"Mathieu Desnoyers"


On 2024-03-04 21:37, Steven Rostedt wrote:

On Mon, 4 Mar 2024 21:35:38 -0500
Steven Rostedt  wrote:


And it's not for debugging, it's for validation of assumptions
made about an upper bound limit defined for a compile-time
check, so as the code evolves issues are caught early.


validating is debugging.


Did Linus put you up to this? To test me to see if I'm learning how to say "No" 
;-)


No, he did not. I genuinely think that validating size limits like
this either at compile time or, when they can vary at runtime like
in this case, with a dynamic check, decreases the cognitive
load on the reviewers. We can then assume that whatever limit
was put in place is actually enforced and not just wishful
thinking.

If the "header" size upper bound is not validated at runtime, there
is not much point in adding the BUILD_BUG_ON() based on that value
in the first place, and you should then just add a runtime check that
you don't overflow the output buffer before writing the output to it.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE


On 2024-03-04 20:59, Steven Rostedt wrote:

On Mon, 4 Mar 2024 20:42:39 -0500
Mathieu Desnoyers  wrote:


#define TRACE_OUTPUT_META_DATA_MAX_LEN  80

and a runtime check in the code generating this header.

This would avoid adding an unchecked upper limit.


That would be a lot of complex code that is for debugging something that
has never happened in the past 16 years and Linus has already reprimanded
me on doing such things.


Is it more complex than remembering the iterator position in
print_trace_fmt() right before:

if (tr->trace_flags & TRACE_ITER_CONTEXT_INFO) {
if (iter->iter_flags & TRACE_FILE_LAT_FMT)
trace_print_lat_context(iter);
else
trace_print_context(iter);
}

and then checking just after that the offset is not beyond 128
bytes ? Perhaps there is even something internal to "iter"
that already knows about this offset (?).

This would catch any context going beyond its planned upper
limit early. Failing early and not just in rare corner cases
is good.

And it's not for debugging, it's for validation of assumptions
made about an upper bound limit defined for a compile-time
check, so as the code evolves issues are caught early.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE


On 2024-03-04 20:41, Steven Rostedt wrote:

On Mon, 4 Mar 2024 20:35:16 -0500
Steven Rostedt  wrote:


BUILD_BUG_ON(TRACING_MARK_MAX_SIZE + sizeof(meta data stuff...) > 
TRACE_SEQ_SIZE);


That's not the meta size I'm worried about. The sizeof(meta data) is the
raw event binary data, which is not related to the size of the event output.

  # cd /sys/kernel/tracing
  # echo hello > trace_marker
  # cat trace
[..]
<...>-999 [001] .  2296.140373: tracing_mark_write: hello
^^^
This is the meta data that is added to trace_seq


That said, the meta data is most likely not going to be more than 128 bytes
(it shouldn't be more than 80).

I could do as you suggest and create a separate TRACE_MARKER_SIZE and just
make sure that it's less than TRACE_SEQ_BUFFER_SIZE (as that's the size of
the content) by 128 bytes.

/* Added meta data should not be more than 128 bytes */
BUILD_BUG_ON((TRACE_MARKER_MAX_SIZE + 128) > TRACE_SEQ_BUFFER_SIZE);


Bonus points if you add

#define TRACE_OUTPUT_META_DATA_MAX_LEN  80

and a runtime check in the code generating this header.

This would avoid adding an unchecked upper limit.

Thanks,

Mathieu



-- Steve


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE


On 2024-03-04 20:35, Steven Rostedt wrote:

On Mon, 4 Mar 2024 20:15:57 -0500
Mathieu Desnoyers  wrote:


On 2024-03-04 19:27, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Since the size of trace_seq's buffer is the max an event can output, have
the trace_marker be half of the entire TRACE_SEQ_SIZE, which is 4K. That
will keep writes that has meta data written from being dropped (but
reported), because the total output of the print event is greater than
what the trace_seq can hold.


Defining the trace_mark limit in terms of "TRACE_SEQ_SIZE / 2"
seems backwards. It's basically using a define of the maximum
buffer size for the pretty-printing output and defining the maximum
input size of a system call to half of that.

I'd rather see, in a header file shared between tracing mark
write implementation and output implementation:

#define TRACING_MARK_MAX_SIZE   4096

and then a static validation that this input fits within your
pretty printing output in the output implementation file:

BUILD_BUG_ON(TRACING_MARK_MAX_SIZE + sizeof(meta data stuff...) > 
TRACE_SEQ_SIZE);


That's not the meta size I'm worried about. The sizeof(meta data) is the
raw event binary data, which is not related to the size of the event output.

  # cd /sys/kernel/tracing
  # echo hello > trace_marker
  # cat trace
[..]
<...>-999 [001] .  2296.140373: tracing_mark_write: hello
^^^
This is the meta data that is added to trace_seq


If this header has a known well-defined upper-limit length, then use
that in the BUILD_BUG_ON().

Thanks,

Mathieu



-- Steve
--

Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE


On 2024-03-04 19:27, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Since the size of trace_seq's buffer is the max an event can output, have
the trace_marker be half of the entire TRACE_SEQ_SIZE, which is 4K. That
will keep writes that has meta data written from being dropped (but
reported), because the total output of the print event is greater than
what the trace_seq can hold.


Defining the trace_mark limit in terms of "TRACE_SEQ_SIZE / 2"
seems backwards. It's basically using a define of the maximum
buffer size for the pretty-printing output and defining the maximum
input size of a system call to half of that.

I'd rather see, in a header file shared between tracing mark
write implementation and output implementation:

#define TRACING_MARK_MAX_SIZE   4096

and then a static validation that this input fits within your
pretty printing output in the output implementation file:

BUILD_BUG_ON(TRACING_MARK_MAX_SIZE + sizeof(meta data stuff...) > 
TRACE_SEQ_SIZE);

This way we clearly document that the tracing mark write
input limit is 4kB, rather than something derived from
the size of an output buffer.

Thanks,

Mathieu



Signed-off-by: Steven Rostedt (Google) 
---
  kernel/trace/trace.c | 16 +++-
  1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8198bfc54b58..d68544aef65f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7320,6 +7320,17 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
if ((ssize_t)cnt < 0)
return -EINVAL;
  
+	/*

+* TRACE_SEQ_SIZE is the total size of trace_seq buffer used
+* for output. As the print event outputs more than just
+* the string written, keep it smaller than the trace_seq
+* as it could drop the event if the extra data makes it bigger
+* than what the trace_seq can hold. Half he TRACE_SEQ_SIZE
+* is more than enough.
+*/
+   if (cnt > TRACE_SEQ_SIZE / 2)
+   cnt = TRACE_SEQ_SIZE / 2;
+
meta_size = sizeof(*entry) + 2;  /* add '\0' and possible '\n' */
   again:
size = cnt + meta_size;
@@ -7328,11 +7339,6 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
if (cnt < FAULTED_SIZE)
size += FAULTED_SIZE - cnt;
  
-	if (size > TRACE_SEQ_BUFFER_SIZE) {

-   cnt -= size - TRACE_SEQ_BUFFER_SIZE;
-   goto again;
-   }
-
buffer = tr->array_buffer.buffer;
event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
        tracing_gen_ctx());


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Remove precision vsnprintf() check from print event


On 2024-03-04 17:43, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

This reverts 60be76eeabb3d ("tracing: Add size check when printing
trace_marker output"). The only reason the precision check was added
was because of a bug that miscalculated the write size of the string into
the ring buffer and it truncated it removing the terminating nul byte. On
reading the trace it crashed the kernel. But this was due to the bug in
the code that happened during development and should never happen in
practice. If anything, the precision can hide bugs where the string in the
ring buffer isn't nul terminated and it will not be checked.

Link: 
https://lore.kernel.org/all/c7e7af1a-d30f-4d18-b8e5-af1ef5800...@linux.ibm.com/
Link: 
https://lore.kernel.org/linux-trace-kernel/20240227125706.04279...@gandalf.local.home
Link: https://lore.kernel.org/all/20240302111244.3a167...@gandalf.local.home/

Reported-by: Sachin Sant 
Fixes: 60be76eeabb3d ("tracing: Add size check when printing trace_marker 
output")
Signed-off-by: Steven Rostedt (Google) 


This is a step in the right direction IMHO.

Reviewed-by: Mathieu Desnoyers 

Just out of curiosity, is there anything to prevent trace_marker from
writing a huge string into the ring buffer in the first place ? Is this
limit implicit and based on the page size of the architecture or is it
a known fixed limit ? (e.g. 4kB strings).

It appears to currently be limited by

#define TRACE_SEQ_BUFFER_SIZE   (PAGE_SIZE * 2 - \
(sizeof(struct seq_buf) + sizeof(size_t) + sizeof(int)))

checked within tracing_mark_write().

I would have hoped for a simpler limit (e.g. 4kB) consistent across
architectures. But that would belong to a separate change.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] ring-buffer: use READ_ONCE() to read cpu_buffer->commit_page in concurrent environment

2024-03-01 Thread Mathieu Desnoyers


On 2024-03-01 10:49, Steven Rostedt wrote:

On Fri, 1 Mar 2024 13:37:18 +0800
linke  wrote:


So basically you are worried about read-tearing?

That wasn't mentioned in the change log.


Yes. Sorry for making this confused, I am not very familiar with this and
still learning.


No problem. We all have to learn this anyway.




Funny part is, if the above timestamp read did a tear, then this would
definitely not match, and would return the correct value. That is, the
buffer is not empty because the only way for this to get corrupted is if
something is in the process of writing to it.


I agree with you here.

commit = rb_page_commit(commit_page);

But if commit_page above is the result of a torn read, the commit field
read by rb_page_commit() may not represent a valid value.


But commit_page is a word length, and I will argue that any compiler that
tears "long" words is broken. ;-)


[ For those tuning in, we are discussing ring_buffer_iter_empty()
  "commit_page = cpu_buffer->commit_page;" racy load. ]

I counter-argue that real-world compilers *are* broken based on your
personal definition, but we have to deal with them, as documented
in Documentation/memory-barriers.txt (see below).

What is the added overhead of using a READ_ONCE() there ? Why are
we wasting effort trying to guess the compiler behavior if the
real-world performance impact is insignificant ?

Quote from memory-barrier.txt explaining the purpose of {READ,WRITE}_ONCE():

"(*) For aligned memory locations whose size allows them to be accessed
 with a single memory-reference instruction, prevents "load tearing"
 and "store tearing," in which a single large access is replaced by
 multiple smaller accesses."

I agree that {READ,WRITE}_ONCE() are really not needed at initialization,
when there are demonstrably no concurrent accesses to the data

But trying to eliminate {READ,WRITE}_ONCE() on concurrently accessed fields
just adds complexity, prevents static analyzers to properly understand the
code and report issues, and just obfuscates the code.

Thanks,

Mathieu





In this case, READ_ONCE() is only needed for the commit_page.


But we can at least keep the READ_ONCE() on the commit_page just because it
is used in the next instruction.

-- Steve


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v3] ring-buffer: Simplify reservation with try_cmpxchg() loop

2024-02-20 Thread Mathieu Desnoyers


On 2024-02-20 09:19, Steven Rostedt wrote:

On Mon, 19 Feb 2024 18:20:32 -0500
Steven Rostedt  wrote:


Instead of using local_add_return() to reserve the ring buffer data,
Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the
reservation with the time keeping code.

Although, it does not get rid of the double time stamps (before_stamp and
write_stamp), using cmpxchg() does get rid of the more complex case when
an interrupting event occurs between getting the timestamps and reserving
the data, as when that happens, it just tries again instead of dealing
with it.

Before we had:

w = local_read(_page->write);
/* get time stamps */
write = local_add_return(length, _page->write);
if (write - length == w) {
/* do simple case */
} else {
/* do complex case */
}

By switching the local_add_return() to a local_try_cmpxchg() it can now be:

 w = local_read(_page->write);
  again:
/* get time stamps */
if (!local_try_cmpxchg(_page->write, , w + length))
goto again;

 /* do simple case */


Something about this logic is causing __rb_next_reserve() to sometimes
always return -EAGAIN and triggering the:

 RB_WARN_ON(cpu_buffer, ++nr_loops > 1000)

Which disables the ring buffer.

I'm not sure what it is, but until I do, I'm removing the patch from my
queue.


Try resetting the info->add_timestamp flags to add_ts_default on goto again
within __rb_reserve_next().

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] dax: add set_dax_nomc() and set_dax_nocache() stub helpers

2024-02-16 Thread Mathieu Desnoyers


On 2024-02-16 15:22, Arnd Bergmann wrote:

From: Arnd Bergmann 

In some randconfig builds, the IS_ERR() check appears to not get completely
eliminated, resulting in the compiler to insert references to these two
functions that cause a link failure:

ERROR: modpost: "set_dax_nocache" [drivers/md/dm-mod.ko] undefined!
ERROR: modpost: "set_dax_nomc" [drivers/md/dm-mod.ko] undefined!

Add more stub functions for the dax-disabled case here to make it build again.


Hi Arnd,

Note that this is a duplicate of:

https://lore.kernel.org/lkml/20240215144633.96437-2-mathieu.desnoy...@efficios.com/

now present in Andrew's tree.

The only differences are the subject, commit message and a newline between 
"set_dax_nomc"
and "set_dax_synchronous" in your change.

Thanks,

Mathieu



Fixes: d888f6b0a766 ("dm: treat alloc_dax() -EOPNOTSUPP failure as non-fatal")
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202402160420.e4qkwogo-...@intel.com/
Signed-off-by: Arnd Bergmann 
---
  include/linux/dax.h | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/linux/dax.h b/include/linux/dax.h
index df2d52b8a245..4527c10016fb 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -64,6 +64,9 @@ void dax_write_cache(struct dax_device *dax_dev, bool wc);
  bool dax_write_cache_enabled(struct dax_device *dax_dev);
  bool dax_synchronous(struct dax_device *dax_dev);
  void set_dax_synchronous(struct dax_device *dax_dev);
+void set_dax_nocache(struct dax_device *dax_dev);
+void set_dax_nomc(struct dax_device *dax_dev);
+
  size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i);
  /*
@@ -108,6 +111,12 @@ static inline bool dax_synchronous(struct dax_device 
*dax_dev)
  static inline void set_dax_synchronous(struct dax_device *dax_dev)
  {
  }
+static inline void set_dax_nocache(struct dax_device *dax_dev)
+{
+}
+static inline void set_dax_nomc(struct dax_device *dax_dev)
+{
+}
  static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
struct dax_device *dax_dev)
  {
@@ -120,9 +129,6 @@ static inline size_t dax_recovery_write(struct dax_device 
*dax_dev,
  }
  #endif
  
-void set_dax_nocache(struct dax_device *dax_dev);

-void set_dax_nomc(struct dax_device *dax_dev);
-
  struct writeback_control;
  #if defined(CONFIG_BLOCK) && defined(CONFIG_FS_DAX)
  int dax_add_host(struct dax_device *dax_dev, struct gendisk *disk);


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: CPU data cache across reboot/kexec for pmem/dax devices

2024-02-13 Thread Mathieu Desnoyers


On 2024-02-09 15:15, Dan Williams wrote:

Mathieu Desnoyers wrote:

Hi Dan,

In the context of extracting user-space trace data when the kernel crashes,
the LTTng user-space tracer recommends using nvdimm/pmem to reserve an area
of physical (volatile) RAM at boot (memmap=nn[KMG]!ss[KMG]), and use the
resulting device to create/mount a dax-enabled fs (e.g. ext4).

We then use this filesystem to mmap() the shared memory files for the tracer.

I want to make sure that the very last events from the userspace tracer written
to the memory mapped buffers (mmap()) by userspace are present after a
warm-reboot (or kexec/kdump).

Note that the LTTng user-space tracer (LTTng-UST) does *not* issue any clflush
(or equivalent pmem_persist() from libpmem) for performance reasons: ring buffer
data is usually overwritten many times before the system actually crashes, and
the only thing we really need to make sure is that the cache lines are not
invalidated without write back.

So I understand that the main use-case for pmem is nvdimm, and that in order to
guarantee persistence of the data on power off an explicit pmem_persist() is
needed after each "transaction", but for the needs of tracing, is there some
kind of architectural guarantee that the data present in the cpu data cache
is not invalidated prior to write back in each of those scenarios ?

- reboot with bios explicitly not clearing memory,


This one gives me pause, because a trip through the BIOS typically means
lots of resets and other low level magic, so this would likely require
pushing dirty data out of CPU caches prior to entering the BIOS code
paths.

So this either needs explicit cache flushing or mapping the memory with
write-through semantics. That latter one is not supported in the stack
today.


I would favor the explicit cache flushing approach over write-through semantics
for performance reasons: typically the ring buffers are overwritten, so always
storing the data to memory would be wasteful.

But I would rather do the cache flushing only on crash (die oops/panic 
notifiers)
rather than require the tracer to flush cache lines immediately after each event
is produced, again for performance reasons.

I have done a crude attempt at registering die oops/panic notifiers from the 
pmem
driver, and flush all pmem areas to memory when die oops/panic notifiers are
called (see the untested patch below). I wonder if this is something that would 
be
generally useful ?




- kexec/kdump.


This should maintain the state of CPU caches. As far as the CPU is
concerned it is just long jumping into a new kernel in memory without
resetting any CPU cache state.


Good to know that this scenario is expected to "Just Work".

Thanks,

Mathieu


diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index e9898457a7bd..b92f18607d39 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -26,12 +26,18 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "pmem.h"
 #include "btt.h"
 #include "pfn.h"
 #include "nd.h"
 
+static int pmem_die_handler(struct notifier_block *self,

+   unsigned long ev, void *unused);
+static int pmem_panic_handler(struct notifier_block *self,
+   unsigned long ev, void *unused);
+
 static struct device *to_dev(struct pmem_device *pmem)
 {
/*
@@ -423,6 +429,9 @@ static void pmem_release_disk(void *__pmem)
 {
struct pmem_device *pmem = __pmem;
 
+	atomic_notifier_chain_unregister(_notifier_list,

+   >panic_notifier);
+   unregister_die_notifier(>die_notifier);
dax_remove_host(pmem->disk);
kill_dax(pmem->dax_dev);
put_dax(pmem->dax_dev);
@@ -573,9 +582,20 @@ static int pmem_attach_disk(struct device *dev,
goto out_cleanup_dax;
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
}
-   rc = device_add_disk(dev, disk, pmem_attribute_groups);
+   pmem->die_notifier.notifier_call = pmem_die_handler;
+   pmem->die_notifier.priority = -INT_MAX;
+   rc = register_die_notifier(>die_notifier);
if (rc)
goto out_remove_host;
+   pmem->panic_notifier.notifier_call = pmem_panic_handler;
+   pmem->panic_notifier.priority = -INT_MAX;
+   rc = atomic_notifier_chain_register(_notifier_list,
+   >panic_notifier);
+   if (rc)
+   goto out_unregister_die;
+   rc = device_add_disk(dev, disk, pmem_attribute_groups);
+   if (rc)
+   goto out_unregister_panic;
if (devm_add_action_or_reset(dev, pmem_release_disk, pmem))
return -ENOMEM;
 
@@ -587,6 +607,11 @@ static int pmem_attach_disk(struct device *dev,

dev_warn(dev, "'badblocks' notification disabled\n");
return 0;
 
+out_unregister_panic:

+   atomic_notifier_chain_unregister(_notifi

CPU data cache across reboot/kexec for pmem/dax devices

2024-02-09 Thread Mathieu Desnoyers


Hi Dan,

In the context of extracting user-space trace data when the kernel crashes,
the LTTng user-space tracer recommends using nvdimm/pmem to reserve an area
of physical (volatile) RAM at boot (memmap=nn[KMG]!ss[KMG]), and use the
resulting device to create/mount a dax-enabled fs (e.g. ext4).

We then use this filesystem to mmap() the shared memory files for the tracer.

I want to make sure that the very last events from the userspace tracer written
to the memory mapped buffers (mmap()) by userspace are present after a
warm-reboot (or kexec/kdump).

Note that the LTTng user-space tracer (LTTng-UST) does *not* issue any clflush
(or equivalent pmem_persist() from libpmem) for performance reasons: ring buffer
data is usually overwritten many times before the system actually crashes, and
the only thing we really need to make sure is that the cache lines are not
invalidated without write back.

So I understand that the main use-case for pmem is nvdimm, and that in order to
guarantee persistence of the data on power off an explicit pmem_persist() is
needed after each "transaction", but for the needs of tracing, is there some
kind of architectural guarantee that the data present in the cpu data cache
is not invalidated prior to write back in each of those scenarios ?

- reboot with bios explicitly not clearing memory,
- kexec/kdump.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer

2024-02-05 Thread Mathieu Desnoyers

On 2024-02-05 13:34, Vincent Donnefort wrote:

On Mon, Feb 05, 2024 at 11:55:08AM -0500, Mathieu Desnoyers wrote:

[...]

How are the kernel linear mapping and the userspace mapping made coherent
on architectures with virtually aliasing data caches ?

Ref.
https://lore.kernel.org/lkml/20240202210019.88022-1-mathieu.desnoy...@efficios.com/T/#t

Hi Mathieu,

Thanks for the pointer.

We are in the exact same problem as DAX. We do modify the data through the
kernel linear mapping while user-space can read it through its own. I should
probably return an error when used with any of the arch ARM || SPARC || MIPS,
until cpu_dcache_is_aliasing() introduces a fine-grain differentiation.

You might want to use LTTng's ring buffer approach instead. See

https://github.com/lttng/lttng-modules/blob/master/src/lib/ringbuffer/ring_buffer_frontend.c#L1202

lib_ring_buffer_flush_read_subbuf_dcache()

Basically, whenever user-space grabs a sub-buffer for reading (through
lttng-modules's LTTNG_KERNEL_ABI_RING_BUFFER_GET_SUBBUF ioctl), lttng
calls flush_dcache_page() on all pages of this subbuffer (I should
really change this for a call to flush_dcache_folio() which would be
more efficient).

Note that doing this is not very efficient on architectures which have
coherent data caches and incoherent dcache vs icache: in that case,
we issue the flush_dcache_page uselessly. I plan on using the new
cpu_dcache_is_aliasing() check once/if it makes it way upstream to
remove those useless flushes on architectures which define
ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE, but do not virtually alias the
data cache.

The equivalent of LTTng's "get subbuf" operation would be
the new TRACE_MMAP_IOCTL_GET_READER ioctl in ftrace AFAIU.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer

2024-02-05 Thread Mathieu Desnoyers

 vmf->page->mapping = vmf->vma->vm_file->f_mapping;
+   vmf->page->index = vmf->pgoff;
+
+   return 0;
+}
+
+static void tracing_buffers_mmap_close(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = >iter;
+   struct trace_array *tr = iter->tr;
+
+   ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file);
+
+#ifdef CONFIG_TRACER_MAX_TRACE
+   spin_lock(>snapshot_trigger_lock);
+   if (!WARN_ON(!tr->mapped))
+   tr->mapped--;
+   spin_unlock(>snapshot_trigger_lock);
+#endif
+}
+
+static void tracing_buffers_mmap_open(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = >iter;
+
+   WARN_ON(ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file));
+}
+
+static const struct vm_operations_struct tracing_buffers_vmops = {
+   .open   = tracing_buffers_mmap_open,
+   .close  = tracing_buffers_mmap_close,
+   .fault  = tracing_buffers_mmap_fault,
+};
+
+static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = filp->private_data;
+   struct trace_iterator *iter = >iter;
+   struct trace_array *tr = iter->tr;
+   int ret = 0;
+
+   if (vma->vm_flags & VM_WRITE)
+   return -EPERM;
+
+   vm_flags_mod(vma, VM_DONTCOPY | VM_DONTDUMP, VM_MAYWRITE);
+   vma->vm_ops = _buffers_vmops;
+
+#ifdef CONFIG_TRACER_MAX_TRACE
+   /*
+* We hold mmap_lock here. lockdep would be unhappy if we would now take
+* trace_types_lock. Instead use the specific snapshot_trigger_lock.
+*/
+   spin_lock(>snapshot_trigger_lock);
+   if (tr->snapshot || tr->mapped == UINT_MAX) {
+   spin_unlock(>snapshot_trigger_lock);
+   return -EBUSY;
+   }
+   tr->mapped++;
+   spin_unlock(>snapshot_trigger_lock);
+
+   /* Wait for update_max_tr() to observe iter->tr->mapped */
+   if (tr->mapped == 1)
+   synchronize_rcu();
+#endif
+   ret = ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file);
+#ifdef CONFIG_TRACER_MAX_TRACE
+   if (ret) {
+   spin_lock(>snapshot_trigger_lock);
+   iter->tr->mapped--;
+   spin_unlock(>snapshot_trigger_lock);
+   }
+#endif
+   return ret;
+}
+
  static const struct file_operations tracing_buffers_fops = {
.open   = tracing_buffers_open,
.read   = tracing_buffers_read,
@@ -8681,6 +8794,7 @@ static const struct file_operations tracing_buffers_fops 
= {
.splice_read= tracing_buffers_splice_read,
.unlocked_ioctl = tracing_buffers_ioctl,
.llseek = no_llseek,
+   .mmap   = tracing_buffers_mmap,
  };
  
  static ssize_t

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index bd312e9afe25..8a96e7a89e6b 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -336,6 +336,7 @@ struct trace_array {
boolallocated_snapshot;
    spinlock_t  snapshot_trigger_lock;
unsigned intsnapshot;
+   unsigned intmapped;
unsigned long   max_latency;
  #ifdef CONFIG_FSNOTIFY
struct dentry   *d_max_latency;


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH v2 8/8] dax: Fix incorrect list of dcache aliasing architectures

2024-01-31 Thread Mathieu Desnoyers


On 2024-01-30 22:13, Dan Williams wrote:

Dave Chinner wrote:

On Tue, Jan 30, 2024 at 11:52:55AM -0500, Mathieu Desnoyers wrote:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
prevents DAX from building on architectures with virtually aliased
dcache with:

   depends on !(ARM || MIPS || SPARC)

This check is too broad (e.g. recent ARMv7 don't have virtually aliased
dcaches), and also misses many other architectures with virtually
aliased dcache.

This is a regression introduced in the v5.13 Linux kernel where the
dax mount option is removed for 32-bit ARMv7 boards which have no dcache
aliasing, and therefore should work fine with FS_DAX.

This was turned into the following implementation of dax_is_supported()
by a preparatory change:

 return !IS_ENABLED(CONFIG_ARM) &&
!IS_ENABLED(CONFIG_MIPS) &&
!IS_ENABLED(CONFIG_SPARC);

Use dcache_is_aliasing() instead to figure out whether the environment
has aliasing dcaches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
---
  include/linux/dax.h | 5 ++---
  1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/linux/dax.h b/include/linux/dax.h
index cfc8cd4a3eae..f59e604662e4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -5,6 +5,7 @@
  #include 
  #include 
  #include 
+#include 
  
  typedef unsigned long dax_entry_t;
  
@@ -80,9 +81,7 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,

  }
  static inline bool dax_is_supported(void)
  {
-   return !IS_ENABLED(CONFIG_ARM) &&
-  !IS_ENABLED(CONFIG_MIPS) &&
-  !IS_ENABLED(CONFIG_SPARC);
+   return !dcache_is_aliasing();


Yeah, if this is just a one liner should go into
fs_dax_get_by_bdev(), similar to the blk_queue_dax() check at the
start of the function.

I also noticed that device mapper uses fs_dax_get_by_bdev() to
determine if it can support DAX, but this patch set does not address
that case. Hence it really seems to me like fs_dax_get_by_bdev() is
the right place to put this check.


Oh, good catch. Yes, I agree this can definitely be pushed down, but
then I think it should be pushed down all the way to make alloc_dax()
fail. That will need some additional fixups like:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8dcabf84d866..a35e60e62440 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2126,12 +2126,12 @@ static struct mapped_device *alloc_dev(int minor)
 md->dax_dev = alloc_dax(md, _dax_ops);
 if (IS_ERR(md->dax_dev)) {
 md->dax_dev = NULL;
-   goto bad;
+   } else {
+   set_dax_nocache(md->dax_dev);
+   set_dax_nomc(md->dax_dev);
+   if (dax_add_host(md->dax_dev, md->disk))
+   goto bad;
 }
-   set_dax_nocache(md->dax_dev);
-   set_dax_nomc(md->dax_dev);
-   if (dax_add_host(md->dax_dev, md->disk))
-   goto bad;
 }
  
 format_dev_t(md->name, MKDEV(_major, minor));


...to make it not fatal to fail to register the dax_dev.


I've had a quick look at other users of alloc_dax() and
alloc_dax_region(), and so far I figure that all of those
really want to bail out on alloc_dax failure. Is dm.c the
only special-case we need to fix to make it non-fatal ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH v2 7/8] Introduce dcache_is_aliasing() across all architectures

2024-01-31 Thread Mathieu Desnoyers


On 2024-01-30 21:48, Dave Chinner wrote:

On Tue, Jan 30, 2024 at 11:52:54AM -0500, Mathieu Desnoyers wrote:

Introduce a generic way to query whether the dcache is virtually aliased
on all architectures. Its purpose is to ensure that subsystems which
are incompatible with virtually aliased data caches (e.g. FS_DAX) can
reliably query this.

For dcache aliasing, there are three scenarios dependending on the
architecture. Here is a breakdown based on my understanding:

A) The dcache is always aliasing:

* arc
* csky
* m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
* sh
* parisc


/me wonders why the dentry cache aliasing has problems on these
systems.

Oh, dcache != fs/dcache.c (the VFS dentry cache).

Can you please rename this function appropriately so us dumb
filesystem people don't confuse cpu data cache configurations with
the VFS dentry cache aliasing when we read this code? Something like
cpu_dcache_is_aliased(), perhaps?


Good point, will do. I'm planning go rename as follows for v3 to
eliminate confusion with dentry cache (and with "page cache" in
general):

ARCH_HAS_CACHE_ALIASING -> ARCH_HAS_CPU_CACHE_ALIASING
dcache_is_aliasing() -> cpu_dcache_is_aliasing()

I noticed that you suggested "aliased" rather than "aliasing",
but I followed what arm64 did for icache_is_aliasing(). Do you
have a strong preference one way or another ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

[RFC PATCH v2 7/8] Introduce dcache_is_aliasing() across all architectures

Introduce a generic way to query whether the dcache is virtually aliased
on all architectures. Its purpose is to ensure that subsystems which
are incompatible with virtually aliased data caches (e.g. FS_DAX) can
reliably query this.

For dcache aliasing, there are three scenarios dependending on the
architecture. Here is a breakdown based on my understanding:

A) The dcache is always aliasing:

* arc
* csky
* m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
* sh
* parisc

B) The dcache aliasing is statically known or depends on querying CPU
   state at runtime:

* arm (cache_is_vivt() || cache_is_vipt_aliasing())
* mips (cpu_has_dc_aliases)
* nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
* sparc32 (vac_cache_size > PAGE_SIZE)
* sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
* xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)

C) The dcache is never aliasing:

* alpha
* arm64 (aarch64)
* hexagon
* loongarch (but with incoherent write buffers, which are disabled since
 commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to 
PAGE_SIZE"))
* microblaze
* openrisc
* powerpc
* riscv
* s390
* um
* x86

Require architectures in A) and B) to select ARCH_HAS_CACHE_ALIASING and
implement "dcache_is_aliasing()".

Architectures in C) don't select ARCH_HAS_CACHE_ALIASING, and thus
dcache_is_aliasing() simply evaluates to "false".

Note that this leaves "icache_is_aliasing()" to be implemented as future
work. This would be useful to gate features like XIP on architectures
which have aliasing dcache-icache but not dcache-dcache.

Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-...@vger.kernel.org
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 arch/arc/Kconfig|  1 +
 arch/arc/include/asm/cachetype.h|  9 +
 arch/arm/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  2 ++
 arch/csky/Kconfig   |  1 +
 arch/csky/include/asm/cachetype.h   |  9 +
 arch/m68k/Kconfig   |  1 +
 arch/m68k/include/asm/cachetype.h   |  9 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/parisc/include/asm/cachetype.h |  9 +
 arch/sh/Kconfig |  1 +
 arch/sh/include/asm/cachetype.h |  9 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 include/linux/cacheinfo.h   |  6 ++
 mm/Kconfig  |  6 ++
 22 files changed, 112 insertions(+)
 create mode 100644 arch/arc/include/asm/cachetype.h
 create mode 100644 arch/csky/include/asm/cachetype.h
 create mode 100644 arch/m68k/include/asm/cachetype.h
 create mode 100644 arch/mips/include/asm/cachetype.h
 create mode 100644 arch/nios2/include/asm/cachetype.h
 create mode 100644 arch/parisc/include/asm/cachetype.h
 create mode 100644 arch/sh/include/asm/cachetype.h
 create mode 100644 arch/sparc/include/asm/cachetype.h
 create mode 100644 arch/xtensa/include/asm/cachetype.h

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 1b0483c51cc1..969e6740bcf7 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -6,6 +6,7 @@
 config ARC
def_bool y
select ARC_TIMERS
+   select ARCH_HAS_CACHE_ALIASING
select ARCH_HAS_CACHE_LINE_SIZE
select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DMA_PREP_COHERENT
diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h
new file mode 100644
index ..290e3cc85845
--- /dev/null
+++ b/arch/arc/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_ARC_CACHETYPE_H
+#define __ASM_ARC_CACHETYPE_H
+
+#include 
+
+#define dcache_is_aliasing()   true
+
+#endif
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index f8567e95f98b..5adeee5e421f 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -5,6 +5,7 @@ config ARM
select ARCH_32BIT_OFF_T
select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE if HAVE_KRETPROBES && 
FRAME_POINTER && !ARM_UNWIND
select ARCH_HAS_BINFMT_FLAT
+   select ARCH_HAS_CACHE_ALIASING
select ARCH_HAS_CPU_FINALIZE_INIT if MMU
select ARCH_HAS_CURRENT_STACK_POINTER
select ARCH_HAS_DEBUG_VIRTUAL if MMU
diff --git a/arch/arm/include/asm/c

[RFC PATCH v2 8/8] dax: Fix incorrect list of dcache aliasing architectures

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
prevents DAX from building on architectures with virtually aliased
dcache with:

  depends on !(ARM || MIPS || SPARC)

This check is too broad (e.g. recent ARMv7 don't have virtually aliased
dcaches), and also misses many other architectures with virtually
aliased dcache.

This is a regression introduced in the v5.13 Linux kernel where the
dax mount option is removed for 32-bit ARMv7 boards which have no dcache
aliasing, and therefore should work fine with FS_DAX.

This was turned into the following implementation of dax_is_supported()
by a preparatory change:

return !IS_ENABLED(CONFIG_ARM) &&
   !IS_ENABLED(CONFIG_MIPS) &&
   !IS_ENABLED(CONFIG_SPARC);

Use dcache_is_aliasing() instead to figure out whether the environment
has aliasing dcaches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
---
 include/linux/dax.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/linux/dax.h b/include/linux/dax.h
index cfc8cd4a3eae..f59e604662e4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 typedef unsigned long dax_entry_t;
 
@@ -80,9 +81,7 @@ static inline bool daxdev_mapping_supported(struct 
vm_area_struct *vma,
 }
 static inline bool dax_is_supported(void)
 {
-   return !IS_ENABLED(CONFIG_ARM) &&
-  !IS_ENABLED(CONFIG_MIPS) &&
-  !IS_ENABLED(CONFIG_SPARC);
+   return !dcache_is_aliasing();
 }
 #else
 static inline void *dax_holder(struct dax_device *dax_dev)
-- 
2.39.2

[RFC PATCH v2 6/8] xfs: Use dax_is_supported()

Use dax_is_supported() to validate whether the architecture has
virtually aliased data caches at mount time. Silently disable
DAX if dax=always is requested as a mount option on an architecture
which does not support DAX.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Chandan Babu R 
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 fs/xfs/xfs_iops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a0d77f5f512e..360f640159b0 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1208,7 +1208,7 @@ static bool
 xfs_inode_should_enable_dax(
struct xfs_inode *ip)
 {
-   if (!IS_ENABLED(CONFIG_FS_DAX))
+   if (!dax_is_supported())
return false;
if (xfs_has_dax_never(ip->i_mount))
return false;
-- 
2.39.2

[RFC PATCH v2 5/8] fuse: Use dax_is_supported()

Use dax_is_supported() to validate whether the architecture has
virtually aliased data caches at mount time. Silently disable
DAX if dax=always is requested as a mount option on an architecture
which does not support DAX.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Miklos Szeredi 
Cc: linux-fsde...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 fs/fuse/dax.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 12ef91d170bb..36e1c1abbf8e 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -1336,6 +1336,13 @@ static bool fuse_should_enable_dax(struct inode *inode, 
unsigned int flags)
if (dax_mode == FUSE_DAX_NEVER)
return false;
 
+   /*
+* Silently fallback to 'never' mode if the architecture does
+* not support DAX.
+*/
+   if (!dax_is_supported())
+   return false;
+
/*
 * fc->dax may be NULL in 'inode' mode when filesystem device doesn't
 * support DAX, in which case it will silently fallback to 'never' mode.
-- 
2.39.2

[RFC PATCH v2 4/8] ext4: Use dax_is_supported()

Use dax_is_supported() to validate whether the architecture has
virtually aliased data caches at mount time. Mount fails if dax=always
is requested as a mount option on an architecture which does not support
DAX.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: "Theodore Ts'o" 
Cc: Andreas Dilger 
Cc: linux-e...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
---
 fs/ext4/super.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c5fcf377ab1f..f2c11ae3ec29 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4743,7 +4743,10 @@ static int ext4_check_feature_compatibility(struct 
super_block *sb,
}
 
if (sbi->s_mount_opt & EXT4_MOUNT_DAX_ALWAYS) {
-   if (ext4_has_feature_inline_data(sb)) {
+   if (!dax_is_supported()) {
+   ext4_msg(sb, KERN_ERR, "DAX unsupported by 
architecture.");
+   return -EINVAL;
+   } else if (ext4_has_feature_inline_data(sb)) {
ext4_msg(sb, KERN_ERR, "Cannot use DAX on a filesystem"
" that may contain inline data");
return -EINVAL;
-- 
2.39.2

[RFC PATCH v2 3/8] ext2: Use dax_is_supported()

Use dax_is_supported() to validate whether the architecture has
virtually aliased data caches at mount time. Print an error and disable
DAX if dax=always is requested as a mount option on an architecture
which does not support DAX.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Jan Kara 
Cc: linux-e...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
---
 fs/ext2/super.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 01f9addc8b1f..30ff57d47ed4 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -955,7 +955,11 @@ static int ext2_fill_super(struct super_block *sb, void 
*data, int silent)
blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
 
if (test_opt(sb, DAX)) {
-   if (!sbi->s_daxdev) {
+   if (!dax_is_supported()) {
+   ext2_msg(sb, KERN_ERR,
+   "DAX unsupported by architecture. Turning off 
DAX.");
+   clear_opt(sbi->s_mount_opt, DAX);
+   } else if (!sbi->s_daxdev) {
ext2_msg(sb, KERN_ERR,
"DAX unsupported by block device. Turning off 
DAX.");
clear_opt(sbi->s_mount_opt, DAX);
-- 
2.39.2

[RFC PATCH v2 0/8] Introduce dcache_is_aliasing() to fix DAX regression

This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM,
even on ARMv7 which does not have virtually aliased dcaches:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")

It used to work fine before: I have customers using DAX over pmem on
ARMv7, but this regression will likely prevent them from upgrading their
kernel.

The root of the issue here is the fact that DAX was never designed to
handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It
touches the pages through their linear mapping, which is not consistent
with the userspace mappings on virtually aliased dcaches.

This patch series introduces dcache_is_aliasing() with the new Kconfig
option ARCH_HAS_CACHE_ALIASING and implements it for all architectures.
The implementation of dcache_is_aliasing() is either evaluated to a
constant at compile-time or a runtime check, which is what is needed on
ARM.

With this we can basically narrow down the list of architectures which
are unsupported by DAX to those which are really affected.

Note that the order of the series was completely changed based on the
feedback received on v1, cache_is_aliasing() is renamed to
dcache_is_aliasing(), ARCH_HAS_CACHE_ALIASING_DYNAMIC is gone,
dcache_is_aliasing() vs ARCH_HAS_CACHE_ALIASING relationship is
simplified, and the dax_is_supported() check was moved to its rightful
place in all filesystems.

Feedback is welcome,

Thanks,

Mathieu

Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-...@vger.kernel.org
Cc: nvd...@lists.linux.dev
Cc: linux-fsde...@vger.kernel.org

Mathieu Desnoyers (8):
  dax: Introduce dax_is_supported()
  erofs: Use dax_is_supported()
  ext2: Use dax_is_supported()
  ext4: Use dax_is_supported()
  fuse: Use dax_is_supported()
  xfs: Use dax_is_supported()
  Introduce dcache_is_aliasing() across all architectures
  dax: Fix incorrect list of dcache aliasing architectures

 arch/arc/Kconfig|  1 +
 arch/arc/include/asm/cachetype.h|  9 +
 arch/arm/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  2 ++
 arch/csky/Kconfig   |  1 +
 arch/csky/include/asm/cachetype.h   |  9 +
 arch/m68k/Kconfig   |  1 +
 arch/m68k/include/asm/cachetype.h   |  9 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/parisc/include/asm/cachetype.h |  9 +
 arch/sh/Kconfig |  1 +
 arch/sh/include/asm/cachetype.h |  9 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 fs/Kconfig  |  1 -
 fs/erofs/super.c|  5 -
 fs/ext2/super.c |  6 +-
 fs/ext4/super.c |  5 -
 fs/fuse/dax.c   |  7 +++
 fs/xfs/xfs_iops.c   |  2 +-
 include/linux/cacheinfo.h   |  6 ++
 include/linux/dax.h |  9 +
 mm/Kconfig  |  6 ++
 29 files changed, 142 insertions(+), 5 deletions(-)
 create mode 100644 arch/arc/include/asm/cachetype.h
 create mode 100644 arch/csky/include/asm/cachetype.h
 create mode 100644 arch/m68k/include/asm/cachetype.h
 create mode 100644 arch/mips/include/asm/cachetype.h
 create mode 100644 arch/nios2/include/asm/cachetype.h
 create mode 100644 arch/parisc/include/asm/cachetype.h
 create mode 100644 arch/sh/include/asm/cachetype.h
 create mode 100644 arch/sparc/include/asm/cachetype.h
 create mode 100644 arch/xtensa/include/asm/cachetype.h

-- 
2.39.2

[RFC PATCH v2 1/8] dax: Introduce dax_is_supported()

Introduce a new dax_is_supported() static inline to check whether the
architecture supports DAX.

This replaces the following fs/Kconfig:FS_DAX dependency:

  depends on !(ARM || MIPS || SPARC)

This is done in preparation for its use by each filesystem supporting
the dax mount option to validate whether dax is indeed supported.

This is done in preparation for using dcache_is_aliasing() in a
following change which will properly support architectures which detect
dcache aliasing at runtime.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
---
 fs/Kconfig  |  1 -
 include/linux/dax.h | 10 ++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 42837617a55b..e5efdb3b276b 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -56,7 +56,6 @@ endif # BLOCK
 config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
-   depends on !(ARM || MIPS || SPARC)
depends on ZONE_DEVICE || FS_DAX_LIMITED
select FS_IOMAP
select DAX
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..cfc8cd4a3eae 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -78,6 +78,12 @@ static inline bool daxdev_mapping_supported(struct 
vm_area_struct *vma,
return false;
return dax_synchronous(dax_dev);
 }
+static inline bool dax_is_supported(void)
+{
+   return !IS_ENABLED(CONFIG_ARM) &&
+  !IS_ENABLED(CONFIG_MIPS) &&
+  !IS_ENABLED(CONFIG_SPARC);
+}
 #else
 static inline void *dax_holder(struct dax_device *dax_dev)
 {
@@ -122,6 +128,10 @@ static inline size_t dax_recovery_write(struct dax_device 
*dax_dev,
 {
return 0;
 }
+static inline bool dax_is_supported(void)
+{
+   return false;
+}
 #endif
 
 void set_dax_nocache(struct dax_device *dax_dev);
-- 
2.39.2

Re: [RFC PATCH 4/7] ext2: Use dax_is_supported()


On 2024-01-30 06:33, Jan Kara wrote:

On Mon 29-01-24 16:06:28, Mathieu Desnoyers wrote:

Use dax_is_supported() to validate whether the architecture has
virtually aliased caches at mount time.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches
(ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Jan Kara 
Cc: linux-e...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org


Looks good to me (although I share Dave's opinion it would be nice to CC
the whole series to fsdevel - luckily we have lore these days so it is not
that tedious to find the whole series :)). Feel free to add:

Acked-by: Jan Kara 


Hi Jan,

Thanks for looking at it! I will do significant changes for v2, so
I will hold on before integrating your acked-by if it's OK with you.

Thanks!

Mathieu



Honza


---
  fs/ext2/super.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 01f9addc8b1f..0398e7a90eb6 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -585,13 +585,13 @@ static int parse_options(char *options, struct 
super_block *sb,
set_opt(opts->s_mount_opt, XIP);
fallthrough;
case Opt_dax:
-#ifdef CONFIG_FS_DAX
-   ext2_msg(sb, KERN_WARNING,
-   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
-   set_opt(opts->s_mount_opt, DAX);
-#else
-   ext2_msg(sb, KERN_INFO, "dax option not supported");
-#endif
+   if (dax_is_supported()) {
+   ext2_msg(sb, KERN_WARNING,
+"DAX enabled. Warning: EXPERIMENTAL, use at 
your own risk");
+   set_opt(opts->s_mount_opt, DAX);
+   } else {
+   ext2_msg(sb, KERN_INFO, "dax option not 
supported");
+   }
break;
  
  #if defined(CONFIG_QUOTA)

--
2.39.2



--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH 7/7] xfs: Use dax_is_supported()


On 2024-01-29 21:38, Dave Chinner wrote:

On Mon, Jan 29, 2024 at 04:06:31PM -0500, Mathieu Desnoyers wrote:

Use dax_is_supported() to validate whether the architecture has
virtually aliased caches at mount time.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches
(ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).


Where's the rest of this patchset? I have no idea what
dax_is_supported() actually does, how it interacts with
CONFIG_FS_DAX, etc.

If you are changing anything to do with FSDAX, the cc-ing the
-entire- patchset to linux-fsdevel is absolutely necessary so the
entire patchset lands in our inboxes and not just a random patch
from the middle of a bigger change.


Sorry, I will Cc linux-fsdevel on all patches for the next round.

Meanwhile you can find the whole series on lore:

https://lore.kernel.org/lkml/20240129210631.193493-1-mathieu.desnoy...@efficios.com/

[...]



Assuming that I understand what dax_is_supported() is doing, this
change isn't right.  We're just setting the DAX configuration flags
from the mount options here, we don't validate them until
we've parsed all options and eliminated conflicts and rejected
conflicting options. We validate whether the options are
appropriate for the underlying hardware configuration later in the
mount process.

dax=always suitability is check in xfs_setup_dax_always() called
later in the mount process when we have enough context and support
to open storage devices and check them for DAX support. If the
hardware does not support DAX then we simply we turn off DAX
support, we do not reject the mount as this change does.

dax=inode and dax=never are valid options on all configurations,
even those with without FSDAX support or have hardware that is not
capable of using DAX. dax=inode only affects how an inode is
instantiated in cache - if the inode has a flag that says "use DAX"
and dax is suppoortable by the hardware, then the turn on DAX for
that inode. Otherwise we just use the normal non-dax IO paths.

Again, we don't error out the filesystem if DAX is not supported,
we just don't turn it on. This check is done in
xfs_inode_should_enable_dax() and I think all you need to do is
replace the IS_ENABLED(CONFIG_FS_DAX) with a dax_is_supported()
call...


Thanks a lot for the detailed explanation. You are right, I will
move the dax_is_supported() check to xfs_inode_should_enable_dax().

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH 0/7] Introduce cache_is_aliasing() to fix DAX regression


On 2024-01-29 16:22, Dan Williams wrote:

Mathieu Desnoyers wrote:

This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM,
even on ARMv7 which does not have virtually aliased dcaches:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")

It used to work fine before: I have customers using dax over pmem on
ARMv7, but this regression will likely prevent them from upgrading their
kernel.

The root of the issue here is the fact that DAX was never designed to
handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It
touches the pages through their linear mapping, which is not consistent
with the userspace mappings on virtually aliased dcaches.

This patch series introduces cache_is_aliasing() with new Kconfig
options:

   * ARCH_HAS_CACHE_ALIASING
   * ARCH_HAS_CACHE_ALIASING_DYNAMIC

and implements it for all architectures. The "DYNAMIC" implementation
implements cache_is_aliasing() as a runtime check, which is what is
needed on architectures like 32-bit ARMV6 and ARMV6K.

With this we can basically narrow down the list of architectures which
are unsupported by DAX to those which are really affected.

Feedback is welcome,


Hi Mathieu, this looks good overall, just some quibbling about the
ordering.


Thanks for having a look !



I would introduce dax_is_supported() with the current overly broad
interpretation of "!(ARM || MIPS || SPARC)" using IS_ENABLED(), then
fixup the filesystems to use the new helper, and finally go back and
convert dax_is_supported() to use cache_is_aliasing() internally.


Will do.



Separately, it is not clear to me why ARCH_HAS_CACHE_ALIASING_DYNAMIC
needs to exist. As long as all paths that care are calling
cache_is_aliasing() then whether it is dynamic or not is something only
the compiler cares about. If those dynamic archs do not want to pay the
.text size increase they can always do CONFIG_FS_DAX=n, right?


Good point. It will help reduce complexity and improve test coverage.

I also intend to rename "cache_is_aliasing()" to "dcache_is_aliasing()",
so if we introduce an "icache_is_aliasing()" in the future, it won't be
confusing. Having aliasing icache-dcache but not dcache-dcache seems to
be fairly common.

So basically:

If an arch selects ARCH_HAS_CACHE_ALIASING, it implements
dcache_is_aliasing() (for now), and eventually we can implement
icache_is_aliasing() as well.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

[RFC PATCH 5/7] ext4: Use dax_is_supported()

Use dax_is_supported() to validate whether the architecture has
virtually aliased caches at mount time.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches
(ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: "Theodore Ts'o" 
Cc: Andreas Dilger 
Cc: linux-e...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 fs/ext4/super.c | 52 -
 1 file changed, 25 insertions(+), 27 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c5fcf377ab1f..9e0606289239 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2359,34 +2359,32 @@ static int ext4_parse_param(struct fs_context *fc, 
struct fs_parameter *param)
return ext4_parse_test_dummy_encryption(param, ctx);
case Opt_dax:
case Opt_dax_type:
-#ifdef CONFIG_FS_DAX
-   {
-   int type = (token == Opt_dax) ?
-  Opt_dax : result.uint_32;
-
-   switch (type) {
-   case Opt_dax:
-   case Opt_dax_always:
-   ctx_set_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS);
-   ctx_clear_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER);
-   break;
-   case Opt_dax_never:
-   ctx_set_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER);
-   ctx_clear_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS);
-   break;
-   case Opt_dax_inode:
-   ctx_clear_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS);
-   ctx_clear_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER);
-   /* Strictly for printing options */
-   ctx_set_mount_opt2(ctx, EXT4_MOUNT2_DAX_INODE);
-   break;
+   if (dax_is_supported()) {
+   int type = (token == Opt_dax) ?
+  Opt_dax : result.uint_32;
+
+   switch (type) {
+   case Opt_dax:
+   case Opt_dax_always:
+   ctx_set_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS);
+   ctx_clear_mount_opt2(ctx, 
EXT4_MOUNT2_DAX_NEVER);
+   break;
+   case Opt_dax_never:
+   ctx_set_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER);
+   ctx_clear_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS);
+   break;
+   case Opt_dax_inode:
+   ctx_clear_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS);
+   ctx_clear_mount_opt2(ctx, 
EXT4_MOUNT2_DAX_NEVER);
+   /* Strictly for printing options */
+   ctx_set_mount_opt2(ctx, EXT4_MOUNT2_DAX_INODE);
+   break;
+   }
+   return 0;
+   } else {
+   ext4_msg(NULL, KERN_INFO, "dax option not supported");
+   return -EINVAL;
}
-   return 0;
-   }
-#else
-   ext4_msg(NULL, KERN_INFO, "dax option not supported");
-   return -EINVAL;
-#endif
case Opt_data_err:
if (result.uint_32 == Opt_data_err_abort)
ctx_set_mount_opt(ctx, m->mount_opt);
-- 
2.39.2

[RFC PATCH 7/7] xfs: Use dax_is_supported()

Use dax_is_supported() to validate whether the architecture has
virtually aliased caches at mount time.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches
(ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Chandan Babu R 
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 fs/xfs/xfs_super.c | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 764304595e8b..b27ecb11db66 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1376,14 +1376,22 @@ xfs_fs_parse_param(
case Opt_nodiscard:
parsing_mp->m_features &= ~XFS_FEAT_DISCARD;
return 0;
-#ifdef CONFIG_FS_DAX
case Opt_dax:
-   xfs_mount_set_dax_mode(parsing_mp, XFS_DAX_ALWAYS);
-   return 0;
+   if (dax_is_supported()) {
+   xfs_mount_set_dax_mode(parsing_mp, XFS_DAX_ALWAYS);
+   return 0;
+   } else {
+   xfs_warn(parsing_mp, "dax option not supported.");
+   return -EINVAL;
+   }
case Opt_dax_enum:
-   xfs_mount_set_dax_mode(parsing_mp, result.uint_32);
-   return 0;
-#endif
+   if (dax_is_supported()) {
+   xfs_mount_set_dax_mode(parsing_mp, result.uint_32);
+   return 0;
+   } else {
+   xfs_warn(parsing_mp, "dax option not supported.");
+   return -EINVAL;
+   }
/* Following mount options will be removed in September 2025 */
case Opt_ikeep:
xfs_fs_warn_deprecated(fc, param, XFS_FEAT_IKEEP, true);
-- 
2.39.2

[RFC PATCH 6/7] fuse: Introduce fuse_dax_is_supported()

Use dax_is_supported() in addition to IS_ENABLED(CONFIG_FUSE_DAX) to
validate whether CONFIG_FUSE_DAX is enabled and the architecture does
not have virtually aliased caches.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches
(ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Miklos Szeredi 
Cc: linux-fsde...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 fs/fuse/file.c  |  2 +-
 fs/fuse/fuse_i.h| 36 +-
 fs/fuse/inode.c | 47 +++--
 fs/fuse/virtio_fs.c |  4 ++--
 4 files changed, 62 insertions(+), 27 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a660f1f21540..133ac8524064 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3247,6 +3247,6 @@ void fuse_init_file_inode(struct inode *inode, unsigned 
int flags)
init_waitqueue_head(>page_waitq);
fi->writepages = RB_ROOT;
 
-   if (IS_ENABLED(CONFIG_FUSE_DAX))
+   if (fuse_dax_is_supported())
fuse_dax_inode_init(inode, flags);
 }
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 1df83eebda92..1cbe37106669 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /** Default max number of pages that can be used in a single read request */
 #define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32
@@ -979,6 +980,38 @@ static inline void fuse_sync_bucket_dec(struct 
fuse_sync_bucket *bucket)
rcu_read_unlock();
 }
 
+#ifdef CONFIG_FUSE_DAX
+static inline struct fuse_inode_dax *fuse_inode_get_dax(struct fuse_inode 
*inode)
+{
+   return inode->dax;
+}
+
+static inline enum fuse_dax_mode fuse_conn_get_dax_mode(struct fuse_conn *fc)
+{
+   return fc->dax_mode;
+}
+
+static inline struct fuse_conn_dax *fuse_conn_get_dax(struct fuse_conn *fc)
+{
+   return fc->dax;
+}
+#else
+static inline struct fuse_inode_dax *fuse_inode_get_dax(struct fuse_inode 
*inode)
+{
+   return NULL;
+}
+
+static inline enum fuse_dax_mode fuse_conn_get_dax_mode(struct fuse_conn *fc)
+{
+   return FUSE_DAX_INODE_DEFAULT;
+}
+
+static inline struct fuse_conn_dax *fuse_conn_get_dax(struct fuse_conn *fc)
+{
+   return NULL;
+}
+#endif
+
 /** Device operations */
 extern const struct file_operations fuse_dev_operations;
 
@@ -1324,7 +1357,8 @@ void fuse_free_conn(struct fuse_conn *fc);
 
 /* dax.c */
 
-#define FUSE_IS_DAX(inode) (IS_ENABLED(CONFIG_FUSE_DAX) && IS_DAX(inode))
+#define fuse_dax_is_supported()(IS_ENABLED(CONFIG_FUSE_DAX) && 
dax_is_supported())
+#define FUSE_IS_DAX(inode) (fuse_dax_is_supported() && IS_DAX(inode))
 
 ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
 ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 2a6d44f91729..030e6ce5486d 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -108,7 +108,7 @@ static struct inode *fuse_alloc_inode(struct super_block 
*sb)
if (!fi->forget)
goto out_free;
 
-   if (IS_ENABLED(CONFIG_FUSE_DAX) && !fuse_dax_inode_alloc(sb, fi))
+   if (fuse_dax_is_supported() && !fuse_dax_inode_alloc(sb, fi))
goto out_free_forget;
 
return >inode;
@@ -126,9 +126,8 @@ static void fuse_free_inode(struct inode *inode)
 
mutex_destroy(>mutex);
kfree(fi->forget);
-#ifdef CONFIG_FUSE_DAX
-   kfree(fi->dax);
-#endif
+   if (fuse_dax_is_supported())
+   kfree(fuse_inode_get_dax(fi));
kmem_cache_free(fuse_inode_cachep, fi);
 }
 
@@ -361,7 +360,7 @@ void fuse_change_attributes(struct inode *inode, struct 
fuse_attr *attr,
invalidate_inode_pages2(inode->i_mapping);
}
 
-   if (IS_ENABLED(CONFIG_FUSE_DAX))
+   if (fuse_dax_is_supported())
fuse_dax_dontcache(inode, attr->flags);
 }
 
@@ -856,14 +855,16 @@ static int fuse_show_options(struct seq_file *m, struct 
dentry *root)
if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
seq_printf(m, ",blksize=%lu", sb->s_blocksize);
}
-#ifdef CONFIG_FUSE_DAX
-   if (fc->dax_mode == FUSE_DAX_ALWAYS)
-   seq_puts(m, ",dax=always");
-   else if (fc->dax_mode == FUSE_DAX_NEVER)
-   seq_puts(m, ",dax=never");
-   else if (fc->dax_mode == FUSE_DAX_INODE_USER)
-   seq_puts(m, ",dax=inode");
-#endif
+   if

[RFC PATCH 2/7] dax: Fix incorrect list of cache aliasing architectures

fs/Kconfig:FS_DAX prevents DAX from building on architectures with
virtually aliased dcache with:

  depends on !(ARM || MIPS || SPARC)

This check is too broad (e.g. recent ARMv7 don't have virtually aliased
dcaches), and also misses many other architectures with virtually
aliased dcache.

This is a regression introduced in the v5.13 Linux kernel where the
dax mount option is removed for 32-bit ARMv7 boards which have no dcache
aliasing, and therefore should work fine with FS_DAX.

Use this instead in Kconfig to prevent FS_DAX from being built on
architectures with virtually aliased dcache:

  depends on !ARCH_HAS_CACHE_ALIASING

For architectures which detect dcache aliasing at runtime, introduce
a new dax_is_supported() static inline which uses "cache_is_aliasing()"
to figure out whether the environment has aliasing dcaches.

This new dax_is_supported() helper will be used in each filesystem
supporting the dax mount option to validate whether dax is indeed
supported.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 fs/Kconfig  | 2 +-
 include/linux/dax.h | 9 +
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 42837617a55b..6746fe403761 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -56,7 +56,7 @@ endif # BLOCK
 config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
-   depends on !(ARM || MIPS || SPARC)
+   depends on !ARCH_HAS_CACHE_ALIASING
depends on ZONE_DEVICE || FS_DAX_LIMITED
select FS_IOMAP
select DAX
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..8c595b04deeb 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 typedef unsigned long dax_entry_t;
 
@@ -78,6 +79,10 @@ static inline bool daxdev_mapping_supported(struct 
vm_area_struct *vma,
return false;
return dax_synchronous(dax_dev);
 }
+static inline bool dax_is_supported(void)
+{
+   return !cache_is_aliasing();
+}
 #else
 static inline void *dax_holder(struct dax_device *dax_dev)
 {
@@ -122,6 +127,10 @@ static inline size_t dax_recovery_write(struct dax_device 
*dax_dev,
 {
return 0;
 }
+static inline bool dax_is_supported(void)
+{
+   return false;
+}
 #endif
 
 void set_dax_nocache(struct dax_device *dax_dev);
-- 
2.39.2

[RFC PATCH 4/7] ext2: Use dax_is_supported()

Use dax_is_supported() to validate whether the architecture has
virtually aliased caches at mount time.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches
(ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Jan Kara 
Cc: linux-e...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 fs/ext2/super.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 01f9addc8b1f..0398e7a90eb6 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -585,13 +585,13 @@ static int parse_options(char *options, struct 
super_block *sb,
set_opt(opts->s_mount_opt, XIP);
fallthrough;
case Opt_dax:
-#ifdef CONFIG_FS_DAX
-   ext2_msg(sb, KERN_WARNING,
-   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
-   set_opt(opts->s_mount_opt, DAX);
-#else
-   ext2_msg(sb, KERN_INFO, "dax option not supported");
-#endif
+   if (dax_is_supported()) {
+   ext2_msg(sb, KERN_WARNING,
+"DAX enabled. Warning: EXPERIMENTAL, 
use at your own risk");
+   set_opt(opts->s_mount_opt, DAX);
+   } else {
+   ext2_msg(sb, KERN_INFO, "dax option not 
supported");
+   }
break;
 
 #if defined(CONFIG_QUOTA)
-- 
2.39.2

[RFC PATCH 1/7] Introduce cache_is_aliasing() across all architectures

Introduce a generic way to query whether the dcache is virtually aliased
on all architectures. Its purpose is to ensure that subsystems which
are incompatible with virtually aliased caches (e.g. FS_DAX) can
reliably query this.

For dcache aliasing, there are three scenarios dependending on the
architecture. Here is a breakdown based on my understanding:

A) The dcache is always aliasing:
   (ARCH_HAS_CACHE_ALIASING=y)

* arm V4, V5 (CPU_CACHE_VIVT)
* arc
* csky
* m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
* sh
* parisc

B) The dcache aliasing depends on querying CPU state at runtime:
   (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y)

* arm V6, V6K (CPU_CACHE_VIPT) (cache_is_vipt_aliasing())
* mips (cpu_has_dc_aliases)
* nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
* sparc32 (vac_cache_size > PAGE_SIZE)
* sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
* xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)

C) The dcache is never aliasing:

* arm V7, V7M (unless ARM V6 or V6K are also present) (CPU_CACHE_VIPT)
* alpha
* arm64 (aarch64)
* hexagon
* loongarch (but with incoherent write buffers, which are disabled since
 commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to 
PAGE_SIZE"))
* microblaze
* openrisc
* powerpc
* riscv
* s390
* um
* x86

Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: linux-...@vger.kernel.org
Cc: nvd...@lists.linux.dev
---
 arch/arc/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  3 +++
 arch/arm/mm/Kconfig |  2 ++
 arch/csky/Kconfig   |  1 +
 arch/m68k/Kconfig   |  1 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/sh/Kconfig |  1 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 include/linux/cacheinfo.h   |  8 
 mm/Kconfig  | 10 ++
 17 files changed, 75 insertions(+)
 create mode 100644 arch/mips/include/asm/cachetype.h
 create mode 100644 arch/nios2/include/asm/cachetype.h
 create mode 100644 arch/sparc/include/asm/cachetype.h
 create mode 100644 arch/xtensa/include/asm/cachetype.h

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 1b0483c51cc1..969e6740bcf7 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -6,6 +6,7 @@
 config ARC
def_bool y
select ARC_TIMERS
+   select ARCH_HAS_CACHE_ALIASING
select ARCH_HAS_CACHE_LINE_SIZE
select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DMA_PREP_COHERENT
diff --git a/arch/arm/include/asm/cachetype.h b/arch/arm/include/asm/cachetype.h
index e8c30430be33..b03054b35c74 100644
--- a/arch/arm/include/asm/cachetype.h
+++ b/arch/arm/include/asm/cachetype.h
@@ -16,6 +16,9 @@ extern unsigned int cacheid;
 #define cache_is_vipt()cacheid_is(CACHEID_VIPT)
 #define cache_is_vipt_nonaliasing()cacheid_is(CACHEID_VIPT_NONALIASING)
 #define cache_is_vipt_aliasing()   cacheid_is(CACHEID_VIPT_ALIASING)
+#ifdef CONFIG_ARCH_HAS_CACHE_ALIASING_DYNAMIC
+#define cache_is_aliasing()cache_is_vipt_aliasing()
+#endif
 #define icache_is_vivt_asid_tagged()   cacheid_is(CACHEID_ASID_TAGGED)
 #define icache_is_vipt_aliasing()  cacheid_is(CACHEID_VIPT_I_ALIASING)
 #define icache_is_pipt()   cacheid_is(CACHEID_PIPT)
diff --git a/arch/arm/mm/Kconfig b/arch/arm/mm/Kconfig
index c164cde50243..23af93cdc03d 100644
--- a/arch/arm/mm/Kconfig
+++ b/arch/arm/mm/Kconfig
@@ -539,9 +539,11 @@ config CPU_CACHE_NOP
bool
 
 config CPU_CACHE_VIVT
+   select ARCH_HAS_CACHE_ALIASING
bool
 
 config CPU_CACHE_VIPT
+   select ARCH_HAS_CACHE_ALIASING_DYNAMIC if CPU_V6 || CPU_V6K
bool
 
 config CPU_CACHE_FA
diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index cf2a6fd7dff8..439d7640deb8 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -2,6 +2,7 @@
 config CSKY
def_bool y
select ARCH_32BIT_OFF_T
+   select ARCH_HAS_CACHE_ALIASING
select ARCH_HAS_DMA_PREP_COHERENT
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_SYNC_DMA_FOR_CPU
diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig
index 4b3e93cac723..216338704f0a 100644
--- a/arch/m68k/Kconfig
+++ b/arch/m68k/Kconfig
@@ -3,6 +3,7 @@ config M68K
bool
default y
select ARCH_32BIT_OFF_T
+   select ARCH_HAS_CACHE_ALIASING
select ARCH_HAS_BINFMT_FLAT
select ARCH_HAS_CP

[RFC PATCH 0/7] Introduce cache_is_aliasing() to fix DAX regression

This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM,
even on ARMv7 which does not have virtually aliased dcaches:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")

It used to work fine before: I have customers using dax over pmem on
ARMv7, but this regression will likely prevent them from upgrading their
kernel.

The root of the issue here is the fact that DAX was never designed to
handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It
touches the pages through their linear mapping, which is not consistent
with the userspace mappings on virtually aliased dcaches. 

This patch series introduces cache_is_aliasing() with new Kconfig
options:

  * ARCH_HAS_CACHE_ALIASING
  * ARCH_HAS_CACHE_ALIASING_DYNAMIC

and implements it for all architectures. The "DYNAMIC" implementation
implements cache_is_aliasing() as a runtime check, which is what is
needed on architectures like 32-bit ARMV6 and ARMV6K.

With this we can basically narrow down the list of architectures which
are unsupported by DAX to those which are really affected.

Feedback is welcome,

Thanks,

Mathieu

Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: linux-...@vger.kernel.org
Cc: nvd...@lists.linux.dev

Mathieu Desnoyers (7):
  Introduce cache_is_aliasing() across all architectures
  dax: Fix incorrect list of cache aliasing architectures
  erofs: Use dax_is_supported()
  ext2: Use dax_is_supported()
  ext4: Use dax_is_supported()
  fuse: Introduce fuse_dax_is_supported()
  xfs: Use dax_is_supported()

 arch/arc/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  3 ++
 arch/arm/mm/Kconfig |  2 ++
 arch/csky/Kconfig   |  1 +
 arch/m68k/Kconfig   |  1 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/sh/Kconfig |  1 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 fs/Kconfig  |  2 +-
 fs/erofs/super.c| 10 +++---
 fs/ext2/super.c | 14 
 fs/ext4/super.c | 52 ++---
 fs/fuse/file.c  |  2 +-
 fs/fuse/fuse_i.h| 36 +++-
 fs/fuse/inode.c | 47 +-
 fs/fuse/virtio_fs.c |  4 +--
 fs/xfs/xfs_super.c  | 20 +++
 include/linux/cacheinfo.h   |  8 +
 include/linux/dax.h |  9 +
 mm/Kconfig  | 10 ++
 27 files changed, 198 insertions(+), 73 deletions(-)
 create mode 100644 arch/mips/include/asm/cachetype.h
 create mode 100644 arch/nios2/include/asm/cachetype.h
 create mode 100644 arch/sparc/include/asm/cachetype.h
 create mode 100644 arch/xtensa/include/asm/cachetype.h

-- 
2.39.2

Re: [RESEND] [PATCH] eventfs: Have inodes have unique inode numbers

2024-01-26 Thread Mathieu Desnoyers


On 2024-01-26 15:12, Steven Rostedt wrote:
[...]

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index e1b172c0e091..2187be6d7b23 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -223,13 +223,41 @@ static const struct inode_operations 
tracefs_file_inode_operations = {
.setattr= tracefs_setattr,
  };
  
+/* Copied from get_next_ino() but adds allocation for multiple inodes */

+#define LAST_INO_BATCH 1024
+#define LAST_INO_MASK (~(LAST_INO_BATCH - 1))
+static DEFINE_PER_CPU(unsigned int, last_ino);
+
+unsigned int tracefs_get_next_ino(int files)
+{
+   unsigned int *p = _cpu_var(last_ino);
+   unsigned int res = *p;
+
+#ifdef CONFIG_SMP
+   /* Check if adding files+1 overflows */


How does it handle a @files input where:

* (files+1 > LAST_INO_BATCH) ?

* (files+1 == LAST_INO_BATCH) ?


+   if (unlikely(!res || (res & LAST_INO_MASK) != ((res + files + 1) & 
LAST_INO_MASK))) {
+   static atomic_t shared_last_ino;
+   int next = atomic_add_return(LAST_INO_BATCH, _last_ino);
+
+   res = next - LAST_INO_BATCH;
+   }
+#endif
+
+   res++;
+   /* get_next_ino should not provide a 0 inode number */
+   if (unlikely(!res))
+   res++;


I suspect that bumping this res++ in the 0 case can cause inode range
reservation issues at (files+1 == LAST_INO_BATCH-1).

Thanks,

Mathieu


+   *p = res + files;
+   put_cpu_var(last_ino);
+   return res;
+}


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

[REGRESSION] v5.13: FS_DAX unavailable on 32-bit ARM

2024-01-26 Thread Mathieu Desnoyers


Hi,

This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM,
even on ARMv7 which does not have virtually aliased dcaches:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")

It used to work fine before: I have customers using dax over pmem on ARMv7, but
this regression will likely prevent them from upgrading their kernel.

The root of the issue here is the fact that DAX was never designed to handle
virtually aliased dcache (VIVT and VIPT with aliased dcache). It touches the
pages through their linear mapping, which is not consistent with the userspace
mappings on virtually aliased dcaches.

I can see a few ways forward to address this:

A) I have prepared a patch series introducing cache_is_aliasing() with new 
Kconfig
   options:

  * ARCH_HAS_CACHE_ALIASING
  * ARCH_HAS_CACHE_ALIASING_DYNAMIC

and implemented it for all architectures. The "DYNAMIC" implementation
implements cache_is_aliasing() as a runtime check, which is what is needed
on architectures like 32-bit ARM.

With this we can basically narrow down the list of architectures which are
unsupported by DAX to those which are really affected, without actually solving
the issue for architectures with virtually aliased dcaches.

B) Another approach would be to dig into what exactly DAX is doing with the 
linear
   mapping, and try to fix this. I see two options there:

B.1) Either we extend vmap to allow vmap'd pages to be aligned on specific 
multiples,
 and use a coloring trick based on SHMLBA like userspace mappings do for 
all DAX
 internal pages accesses, or

B.2) We introduce flush_dcache_folio() at relevant spots (perhaps dax_flush() 
?) to
 synchronize the linear mapping wrt userspace mappings. (would this work ?)

Any thoughts on how to best move forward with this issue are welcome.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] ring-buffer: Simplify reservation with try_cmpxchg() loop

2024-01-25 Thread Mathieu Desnoyers


On 2024-01-20 08:47, Steven Rostedt wrote:

On Fri, 19 Jan 2024 20:49:36 -0500
Mathieu Desnoyers  wrote:


Let's say we have the following ktime_get() values (monotonic timestamp value) 
for
a sequence of events:

 Timestamp (Hex)Encoding in the trace

Packet header timestamp begin 0x12345678 64-bit: 0x12345678

Event 1   0x12345678 16-bit: 0x5678
  (When decoded, same value as previous timestamp, no overflow)
Event 2   0x1234 16-bit: 0x
  (When decoded, going from "0x5678" to "0x" does not overflow 16-bit)
Event 3   0x1235 16-bit: 0x
  (When decoded, going from "0x" to "0x" overflow 16-bit exactly 
once
   which allows the trace reader to reconstruct timestamp 0x1235 from 
the
   previous timestamp and the 16-bit timestamp encoding.)
Event 4   0x1237 64-bit: 0x1237
  (Encoding over 16-bit not possible because going from 0x1235 to
   0x1237 would overflow 16-bit twice, which cannot be detected
   by a trace reader. Therefore use the full 64-bit timestamp in the
   "large" event header representation.)


I think that's basically what I said, but you are just looking at it
differently ;-) Or should I say, you are using bits for optimization.


Based on your explanation below, we are really talking about different things
here. Let me try to reply to your explanation to try to show where what I am
doing completely differs from what you have in mind. This will help explain
how I handle 16-bit overflow as well.


The events are based off of the last injected timestamp.


Incorrect. There is no "injected timestamp". There is only a concept
of the "current timestamp" as we either write to or read from the
event stream. I will take the point of view of the trace reader for
the rest of the discussion.


The above example,
starts with an timestamp injection into the packet header: 0x12345678, with
the lsb 16bits ignore.


Wrong again. The 16 least significant bits are not ignored. The "current 
timestamp"
is really 0x12345678 when the packet header is read.


In the packet header you have 0x12345678 in the first event you have
0x5678 how does that get you the timestamp? If that event had 0x,
when the reader reads this packet, it would take the header 0x12345678
chop off (ignore) the 5678, and add the , right?


We need to consider not only what happens when the 16 low bits increase, but
also what happens when they end up with a value smaller than the previous
16 low bits.

As a summary from our video meeting discussion:

There are 3 cases we care about here:

packet header timestamp: 0x12345678

followed by either:

A) first event delta from packet header timestamp is 0: 16-bit value 0x5678

B) first event delta from packet header timestamp is <= 0x:
   B.1) 16-bit value example 0x5699 (no 16-bit overflow from previous value)
   B.2) 16-bit value example 0x (exactly one 16-bit overflow from previous 
value)

C) first event delta from packet header timestamp is larger than 0x, which 
would
   cause the low-order 16 bits to have more than one 16-bit overflow from the 
previous
   value. The tracer detects this and uses a full 64-bit timestamp instead of 
the compact
   16 bits.

[...]








But how do you detect the overflow? That last timestamps to know if
the tsc overflowed or not needs to be saved and compared. I would
assume you have a similar race that I have.


Yes, I save a "last timestamp" per buffer, but the race does not
matter because of the order in which it is saved wrt local cmpxchg
updating the reserved position.

The algorithm looks like:

do {
- read current reserved position (old pos)
- read time
- compute new reserved position (new pos)
} while (cmpxchg(reserved pos, old pos, new pos) != old pos);

[A]

save_last_tsc()


So the last_tsc that is saved is from the timestamp read before the
cmpxchg.


Yes.





If interrupted at [A] by another trace producer, it will compare with
an older "last tsc" than the tsc of the event physically located just
before the nested event. This stale "last tsc" has a value which is
necessarily lower than the one we would be saving with the
save_last_tsc immediately afterwards, which means in the worse case
we end up using a full 64-bit timestamp when in fact we could use a
more compact representation. But this race is rare and therefore it
does not matter for size.


That's equivalent to me "injecting" an absolute value for the same race.


Yes.





The fact that I only need this last_tsc value for the sake of
optimization, and not for computation of a time delta from a previous
injected timestamp, makes it possible to handle the race gracefully
without

Re: [PATCH] ring-buffer: Simplify reservation with try_cmpxchg() loop

2024-01-19 Thread Mathieu Desnoyers


On 2024-01-19 16:42, Steven Rostedt wrote:

On Fri, 19 Jan 2024 15:56:21 -0500
Mathieu Desnoyers  wrote:


So when an overflow happens, you just insert a timestamp, and then events
after that is based on that?


No. Let's use an example to show how it works.

For reference, LTTng uses 5-bit for event ID and 27-bit for timestamps
in the compact event header representation. But for the sake of making this
discussion easier, let's assume a tracer would use 16-bit for timestamps in the
compact representation.

Let's say we have the following ktime_get() values (monotonic timestamp value) 
for
a sequence of events:

Timestamp (Hex)Encoding in the trace

Packet header timestamp begin 0x12345678 64-bit: 0x12345678

Event 1   0x12345678 16-bit: 0x5678
 (When decoded, same value as previous timestamp, no overflow)
Event 2   0x1234 16-bit: 0x
 (When decoded, going from "0x5678" to "0x" does not overflow 16-bit)
Event 3   0x1235 16-bit: 0x
 (When decoded, going from "0x" to "0x" overflow 16-bit exactly once
  which allows the trace reader to reconstruct timestamp 0x1235 from the
  previous timestamp and the 16-bit timestamp encoding.)
Event 4   0x1237 64-bit: 0x1237
 (Encoding over 16-bit not possible because going from 0x1235 to
  0x1237 would overflow 16-bit twice, which cannot be detected
  by a trace reader. Therefore use the full 64-bit timestamp in the
  "large" event header representation.)


I think that's basically what I said, but you are just looking at it
differently ;-) Or should I say, you are using bits for optimization.


Based on your explanation below, we are really talking about different things
here. Let me try to reply to your explanation to try to show where what I am
doing completely differs from what you have in mind. This will help explain
how I handle 16-bit overflow as well.


The events are based off of the last injected timestamp.


Incorrect. There is no "injected timestamp". There is only a concept
of the "current timestamp" as we either write to or read from the
event stream. I will take the point of view of the trace reader for
the rest of the discussion.


The above example,
starts with an timestamp injection into the packet header: 0x12345678, with
the lsb 16bits ignore.


Wrong again. The 16 least significant bits are not ignored. The "current 
timestamp"
is really 0x12345678 when the packet header is read.


So in actuality, it inserts 0x1234, plus adds a
5678 that represents the creation of the header (like we discussed about in
the last tracing meeting).


There is no "0x1234" reference time. There is only the actual 0x12345678
current time at packet begin, including those 16 low order bits.



The first event has: 0x5678 which is based on the previous injected
timestamp of 0x1234.


It is not "based" on the previous injected timestamp. It just represents
the low-order 16-bits of the timestamp at event 1. The trace readers takes
two informations to compute the complete current timestamp for event 1:

1) The current timestamp just before event 1 is read (0x12345678),
2) The low-order 16 bits of event 1 (0x5678).

Given that there is no 16-bit overflow when comparing:

0x12345678 & 0x and 0x5678

We know that the resulting current timestamp for event 1 is:

(0x12345678 & ~0x) + 0x5678 = 0x12345678

Or basically that time did not move between the packet header and event 1.



the events go on just using a delta from that until you see it overflow (or
too big of a delta to fit in the 16 bits), and you insert a new full
timestamps that everything will be based on:

   0x1237


No. The following events use the same algorithm I just described: They use
this notion of "current timestamp" and the information provided by the new
timestamp field in the event header to figure out the updated current timestamp.

It is _never_ based on some kind of arbitrary reference, it is always absolute,
and is always computed based on the current timestamp and the timestamp field
encountered.



Now events following that are just a delta from that timestamp. But you
calculate the delta simply by masking out the lower bits.


No, again, there is no delta from arbitrary injected time. It's always
computed from this virtual "current time", which applies to an event stream.



But how do you detect the overflow? That last timestamps to know if the tsc
overflowed or not needs to be saved and compared. I would assume you have a
similar race that I have.


Yes, I save a "last timestamp" per buffer, but the race does not matter because
of the order in which it is saved wrt local cmpxchg up

Re: [PATCH] ring-buffer: Simplify reservation with try_cmpxchg() loop

2024-01-19 Thread Mathieu Desnoyers


On 2024-01-19 10:37, Steven Rostedt wrote:

On Fri, 19 Jan 2024 09:40:27 -0500
Mathieu Desnoyers  wrote:


On 2024-01-18 18:12, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 



[...]


Although, it does not get rid of the double time stamps (before_stamp and
write_stamp), using cmpxchg() does get rid of the more complex case when
an interrupting event occurs between getting the timestamps and reserving
the data, as when that happens, it just tries again instead of dealing
with it.


I understand that the reason why you need the before/after stamps and their
associated complexity is because the Ftrace ring buffer ABI encodes event
timestamps as delta from the previous event within the buffer as a mean of
compressing the timestamp fields. If the delta cannot be represented in a
given number of bits, then it inserts a 64-bit timestamp (not sure if that
one is absolute or a delta from previous event).


There's both. An extended timestamp, which is added when the delta is too
big, and that too is just a delta from the previous event. And there is the
absolute timestamp as well. I could always just use the absolute one. That
event came much later.


OK





This timestamp encoding as delta between events introduce a strong
inter-dependency between consecutive (nested) events, and is the reason
why you are stuck with all this timestamp before/after complexity.

The Common Trace Format specifies (and LTTng implements) a different way
to achieve the same ring buffer space-savings achieved with timestamp deltas
while keeping the timestamps semantically absolute from a given reference,
hence without all the before/after timestamp complexity. You can see the
clock value decoding procedure in the CTF2 SPEC RC9 [1] document. The basic


That points to this:

-8<-
6.3. Clock value update procedure
To update DEF_CLK_VAL from an unsigned integer field F having the unsigned 
integer value V and the class C:

Let L be an unsigned integer initialized to, depending on the type property of 
C:

"fixed-length-unsigned-integer"
The value of the length property of C.

"variable-length-unsigned-integer"
S ×7, where S is the number of bytes which F occupies with the data stream.

Let MASK be an unsigned integer initialized to 2L − 1.

Let H be an unsigned integer initialized to DEF_CLK_VAL & ~MASK, where “&” is 
the bitwise AND operator and “~” is the bitwise NOT operator.

Let CUR be an unsigned integer initialized to DEF_CLK_VAL & MASK, where “&” is 
the bitwise AND operator.

Set DEF_CLK_VAL to:

If V ≥ CUR
H + V

Else
H + MASK + 1 + V
->8-

There's a lot of missing context there, so I don't see how it relates.


This explains how the "current time" is reconstructed by a trace reader
when loading an event header timestamp field. But for the sake of this
discussion we can focus on the less formal explanation of how the tracer
generates this timestamp encoding provided below.


idea on the producer side is to record the low-order bits of the current
timestamp in the event header (truncating the higher order bits), and
fall back on a full 64-bit value if the number of low-order bits overflows
from the previous timestamp is more than 1, or if it is impossible to figure
out precisely the timestamp of the previous event due to a race. This achieves
the same space savings as delta timestamp encoding without introducing the
strong event inter-dependency.


So when an overflow happens, you just insert a timestamp, and then events
after that is based on that?


No. Let's use an example to show how it works.

For reference, LTTng uses 5-bit for event ID and 27-bit for timestamps
in the compact event header representation. But for the sake of making this
discussion easier, let's assume a tracer would use 16-bit for timestamps in the
compact representation.

Let's say we have the following ktime_get() values (monotonic timestamp value) 
for
a sequence of events:

  Timestamp (Hex)Encoding in the trace

Packet header timestamp begin 0x12345678 64-bit: 0x12345678

Event 1   0x12345678 16-bit: 0x5678
   (When decoded, same value as previous timestamp, no overflow)
Event 2   0x1234 16-bit: 0x
   (When decoded, going from "0x5678" to "0x" does not overflow 16-bit)
Event 3   0x1235 16-bit: 0x
   (When decoded, going from "0x" to "0x" overflow 16-bit exactly once
which allows the trace reader to reconstruct timestamp 0x1235 from the
previous timestamp and the 16-bit timestamp encoding.)
Event 4   0x1237 64-bit: 0x1237
   (Encoding over 16-bit not possible because going from 0x1235 to
0x1237 would overflow

Re: [PATCH] ring-buffer: Simplify reservation with try_cmpxchg() loop

2024-01-19 Thread Mathieu Desnoyers


On 2024-01-18 18:12, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Instead of using local_add_return() to reserve the ring buffer data,
Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the
reservation with the time keeping code.


I admire the effort of trying to simplify the Ftrace ring buffer by bringing
over ideas that worked well for LTTng. :-) As reviewer of the tracing subsystem,
I certainly welcome the simplifications.


Although, it does not get rid of the double time stamps (before_stamp and
write_stamp), using cmpxchg() does get rid of the more complex case when
an interrupting event occurs between getting the timestamps and reserving
the data, as when that happens, it just tries again instead of dealing
with it.


I understand that the reason why you need the before/after stamps and their
associated complexity is because the Ftrace ring buffer ABI encodes event
timestamps as delta from the previous event within the buffer as a mean of
compressing the timestamp fields. If the delta cannot be represented in a
given number of bits, then it inserts a 64-bit timestamp (not sure if that
one is absolute or a delta from previous event).

This timestamp encoding as delta between events introduce a strong
inter-dependency between consecutive (nested) events, and is the reason
why you are stuck with all this timestamp before/after complexity.

The Common Trace Format specifies (and LTTng implements) a different way
to achieve the same ring buffer space-savings achieved with timestamp deltas
while keeping the timestamps semantically absolute from a given reference,
hence without all the before/after timestamp complexity. You can see the
clock value decoding procedure in the CTF2 SPEC RC9 [1] document. The basic
idea on the producer side is to record the low-order bits of the current
timestamp in the event header (truncating the higher order bits), and
fall back on a full 64-bit value if the number of low-order bits overflows
from the previous timestamp is more than 1, or if it is impossible to figure
out precisely the timestamp of the previous event due to a race. This achieves
the same space savings as delta timestamp encoding without introducing the
strong event inter-dependency.

The fact that Ftrace exposes this ring buffer binary layout as a user-space
ABI makes it tricky to move to the Common Trace Format timestamp encoding.
There are clearly huge simplifications that could be made by moving to this
scheme though. Is there any way to introduce a different timestamp encoding
scheme as an extension to the Ftrace ring buffer ABI ? This would allow us to
introduce this simpler scheme and gradually phase out the more complex delta
encoding when no users are left.

Thoughts ?

Thanks,

Mathieu

[1] https://diamon.org/ctf/files/CTF2-SPECRC-9.0.html#clk-val-update

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v11 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-01-11 Thread Mathieu Desnoyers


On 2024-01-11 11:17, Vincent Donnefort wrote:

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

   ring_buffer_{map,unmap}()
   ring_buffer_map_fault()

And controls on the ring-buffer:

   ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

   A unique ID for each subbuf of the ring-buffer, currently they are
   only identified through their in-kernel VA.

   A meta-page, where are stored ring-buffer statistics and a
   description for the current reader



Hi Vincent,

The LTTng kernel tracer has supported mmap'd buffers for nearly 15 years [1],
and has a lot of similarities with this patch series.

LTTng has the notion of "subbuffer id" to allow atomically exchanging a
"reader" extra subbuffer with the subbuffer to be read. It implements
"get subbuffer" / "put subbuffer" ioctls to allow the consumer (reader)
to move the currently read subbuffer position. [2]

It would not hurt to compare your approach to LTTng and highlight
similarities/differences, and the rationale for the differences.

Especially when it comes to designing kernel ABIs, it's good to make sure
that all bases are covered, because those choices will have lasting impacts.

Thanks,

Mathieu

[1] 
https://git.lttng.org/?p=lttng-modules.git;a=blob;f=src/lib/ringbuffer/ring_buffer_mmap.c
[2] 
https://git.lttng.org/?p=lttng-modules.git;a=blob;f=src/lib/ringbuffer/ring_buffer_vfs.c

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Add disable-filter-buf option


On 2023-12-15 13:43, Steven Rostedt wrote:

On Fri, 15 Dec 2023 13:25:07 -0500
Mathieu Desnoyers  wrote:


I am not against exposing an ABI that allows userspace to alter the
filter behavior. I disagree on the way you plan to expose the ABI.


These are no different than the knobs for sched_debug


These are very much different. The sched features are enabled at
build-time by modifying kernel/sched/features.h. There is no userspace
ABI involved.





Exposing this option as an ABI in this way exposes too much internal
ring buffer implementation details to userspace.


There's already lots of exposure using options. The options are just
knobs, nothing more.



It exposes the following details which IMHO should be hidden or
configurable in a way that allows moving to a whole new mechanism
which will have significantly different characteristics in the
future:

It exposes that:

- filtering uses a copy to a temporary buffer, and
- that this copy is enabled by default.

Once exposed, those design constraints become immutable due to ABI.


No it is not. There is no such thing as "immutable ABI". The rule is
"don't break user space" If this functionality in the kernel goes away,
the knob could become a nop, and I doubt any user space will break
because of it.

That is, the only burden is keeping this option exposed. But it could
be just like that light switch that has nothing connected to it. It's
still there, but does nothing if you switch it. This knob can act the
same way. This does not in anyway prevent future innovation.


I am not comfortable with exposing internal ring buffer implementation
details to userspace which may or may not be deprecated as no-ops
in the future. This will lead to useless clutter.

One approach that might be somewhat better would be to only expose
those files when a new CONFIG_TRACING_DEBUG option is enabled. This
way userspace cannot rely on having those files present as part
of the ABI, but you would still be free to use them for selftests
by skipping the specific selftests if the config option is not
enabled.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Add disable-filter-buf option


On 2023-12-15 12:34, Steven Rostedt wrote:

On Fri, 15 Dec 2023 12:24:14 -0500
Mathieu Desnoyers  wrote:


On 2023-12-15 12:04, Steven Rostedt wrote:

On Fri, 15 Dec 2023 10:53:39 -0500
Mathieu Desnoyers  wrote:

[...]


So rather than stacking tons of "on/off" switches for filter
features, how about you let users express the mechanism they
want to use for filtering with a string instead ? e.g.

filter-method="temp-buffer"
filter-method="ring-buffer"
filter-method="input-arguments"


If I add other ways to filter, it will be a separate file to control
that, but all options are on/off switches. Even if I add other
functionality to the way buffers are created, this will still have the
same functionality to turn the entire thing on or off.


I'll be clearer then: I think this is a bad ABI. In your reply, you justify
this bad ABI by implementation motivations.


What's wrong with a way to stop the copying ?


I am not against exposing an ABI that allows userspace to alter the
filter behavior. I disagree on the way you plan to expose the ABI.

Exposing this option as an ABI in this way exposes too much internal
ring buffer implementation details to userspace.

It exposes the following details which IMHO should be hidden or
configurable in a way that allows moving to a whole new mechanism
which will have significantly different characteristics in the
future:

It exposes that:

- filtering uses a copy to a temporary buffer, and
- that this copy is enabled by default.

Once exposed, those design constraints become immutable due to ABI.



The option is just a way to say "I don't want to do the copy into the
buffer, I want to go directly into it"


My main concern is how this concept, once exposed to userspace,
becomes not only an internal implementation detail, but a fundamental
part of the design which cannot ever go away without breaking the ABI
or making parts of the ABI irrelevant.

I can make a parallel with the scheduler: this is as if the sched
feature knobs (which are there only for development/debugging purposes)
would all be exposed as userspace ABI. This would seriously
limit the evolution of the scheduler design in the future. I see this
as the same problem at the ring buffer level.





I don't care about the implementation, I care about the ABI, and
I feel that your reply is not addressing my concern at all.


Maybe I don't understand your concern.

It's an on/off switch (like all options are). This is just a way to say
"I want to indirect copying of the event before filtering or not".


Not all tracefs options are booleans. The "current_tracer" file ABI
exposes a string input/output parameter. I would recommend the same
for the equivalent of a "current_filter" file.



The "input-argument" part above may never happen. What's different
between tracefs and LTTng, is that all events are created by the
subsystem not by me. You don't use the TRACE_EVENT() macro, but you
need to manually create each event you care about yourself. It's more
of a burden but you also then have the luxury of hooking to the input
portion. That is not exposed, and that is by design. As that could
possibly make all tracepoints an ABI, and you'll have people like
peterz nacking any new tracepoint in the scheduler. He doesn't allow
trace events anymore because of that exposure.


I'm not arguing for moving to the input-argument scheme, I just used
this as an hypothetical example to show why we should not expose
internal implementation details to userspace which will prevent future
evolution only for the sake of having debugging knobs.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Add disable-filter-buf option


On 2023-12-15 12:04, Steven Rostedt wrote:

On Fri, 15 Dec 2023 10:53:39 -0500
Mathieu Desnoyers  wrote:

[...]


So rather than stacking tons of "on/off" switches for filter
features, how about you let users express the mechanism they
want to use for filtering with a string instead ? e.g.

filter-method="temp-buffer"
filter-method="ring-buffer"
filter-method="input-arguments"


If I add other ways to filter, it will be a separate file to control
that, but all options are on/off switches. Even if I add other
functionality to the way buffers are created, this will still have the
same functionality to turn the entire thing on or off.


I'll be clearer then: I think this is a bad ABI. In your reply, you justify
this bad ABI by implementation motivations.

I don't care about the implementation, I care about the ABI, and
I feel that your reply is not addressing my concern at all.

Moreover, double-negative boolean options (disable-X=false) are confusing.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Add disable-filter-buf option


On 2023-12-15 10:26, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Normally, when the filter is enabled, a temporary buffer is created to
copy the event data into it to perform the filtering logic. If the filter
passes and the event should be recorded, then the event is copied from the
temporary buffer into the ring buffer. If the event is to be discarded
then it is simply dropped. If another event comes in via an interrupt, it
will not use the temporary buffer as it is busy and will write directly
into the ring buffer.

The disable-filter-buf option will disable the temporary buffer and always
write into the ring buffer. This will avoid the copy when the event is to
be recorded, but also adds a bit more overhead on the discard, and if
another event were to interrupt the event that is to be discarded, then
the event will not be removed from the ring buffer but instead converted
to padding that will not be read by the reader. Padding will still take up
space on the ring buffer.

This option can be beneficial if most events are recorded and not
discarded, or simply for debugging the discard functionality of the ring
buffer.

Also fix some whitespace (that was fixed by editing this in vscode).



I'm not convinced that a boolean state is what you need here.

Yes, today you are in a position where you have two options:

a) use the filter buffer, which falls back on copy to ring buffer
   if nested,

b) disable the filter buffer, and thus always copy to ring buffer
   before filtering,

But I think it would not be far-fetched to improve the implementation
of the filter-buffer to have one filter-buffer per nesting level
(per execution context), or just implement the filter buffer as a
per-cpu stack, which would remove the need to fall back on copy
to ring buffer when nested. Or you could even do like LTTng and
filter on the input arguments directly.

But each of these changes would add yet another boolean tunable,
which can quickly make the mix hard to understand from a user
perspective.

So rather than stacking tons of "on/off" switches for filter
features, how about you let users express the mechanism they
want to use for filtering with a string instead ? e.g.

filter-method="temp-buffer"
filter-method="ring-buffer"
filter-method="input-arguments"

?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] ring-buffer: Remove 32bit timestamp logic

2023-12-13 Thread Mathieu Desnoyers


On 2023-12-13 21:11, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Each event has a 27 bit timestamp delta that is used to hold the delta
from the last event. If the time between events is greater than 2^27, then
a timestamp is added that holds a 59 bit absolute timestamp.

Until a389d86f7fd09 ("ring-buffer: Have nested events still record running
time stamp"), if an interrupt interrupted an event in progress, all the
events delta would be zero to not deal with the races that need to be
handled. The commit a389d86f7fd09 changed that to handle the races giving
all events, even those that preempt other events, still have an accurate
timestamp.

To handle those races requires performing 64-bit cmpxchg on the
timestamps. But doing 64-bit cmpxchg on 32-bit architectures is considered
very slow. To try to deal with this the timestamp logic was broken into
two and then three 32-bit cmpxchgs, with the thought that two (or three)
32-bit cmpxchgs are still faster than a single 64-bit cmpxchg on 32-bit
architectures.

Part of the problem with this is that I didn't have any 32-bit
architectures to test on. After hitting several subtle bugs in this code,
an effort was made to try and see if three 32-bit cmpxchgs are indeed
faster than a single 64-bit. After a few people brushed off the dust of
their old 32-bit machines, tests were done, and even though 32-bit cmpxchg
was faster than a single 64-bit, it was in the order of 50% at best, not
300%.


I literally had to dust off my old Eee PC for this :)



This means that this complex code is not only complex but also not even
faster than just using 64-bit cmpxchg.

Nuke it!



Acked-by: Mathieu Desnoyers 


@@ -3720,6 +3517,7 @@ rb_reserve_next_event(struct trace_buffer *buffer,
struct rb_event_info info;
int nr_loops = 0;
int add_ts_default;
+   static int once;


(as you spotted, should be removed)

Thanks,

Mathieu

  
  	/* ring buffer does cmpxchg, make sure it is safe in NMI context */

if (!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) &&


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs


On 2023-12-12 11:53, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Mathieu Desnoyers pointed out an issue in the rb_time_cmpxchg() for 32 bit
architectures. That is:

  static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
  {
unsigned long cnt, top, bottom, msb;
unsigned long cnt2, top2, bottom2, msb2;
u64 val;

/* The cmpxchg always fails if it interrupted an update */
 if (!__rb_time_read(t, , ))
 return false;

 if (val != expect)
 return false;

<<<< interrupted here!

 cnt = local_read(>cnt);

The problem is that the synchronization counter in the rb_time_t is read
*after* the value of the timestamp is read. That means if an interrupt
were to come in between the value being read and the counter being read,
it can change the value and the counter and the interrupted process would
be clueless about it!

The counter needs to be read first and then the value. That way it is easy
to tell if the value is stale or not. If the counter hasn't been updated,
then the value is still good.

Link: 
https://lore.kernel.org/linux-trace-kernel/20231211201324.652870-1-mathieu.desnoy...@efficios.com/

Cc: sta...@vger.kernel.org
Fixes: 10464b4aa605e ("ring-buffer: Add rb_time_t 64 bit operations for speeding up 
32 bit")
Reported-by: Mathieu Desnoyers 
Signed-off-by: Steven Rostedt (Google) 


Reviewed-by: Mathieu Desnoyers 


---
  kernel/trace/ring_buffer.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 1d9caee7f542..e110cde685ea 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -706,6 +706,9 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 
set)
unsigned long cnt2, top2, bottom2, msb2;
u64 val;
  
+	/* Any interruptions in this function should cause a failure */

+   cnt = local_read(>cnt);
+
/* The cmpxchg always fails if it interrupted an update */
 if (!__rb_time_read(t, , ))
 return false;
@@ -713,7 +716,6 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 
set)
 if (val != expect)
 return false;
  
-	 cnt = local_read(>cnt);

 if ((cnt & 3) != cnt2)
     return false;
  


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

[PATCH] ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg()

The following race can cause rb_time_read() to observe a corrupted time
stamp:

rb_time_cmpxchg()
[...]
if (!rb_time_read_cmpxchg(>msb, msb, msb2))
return false;
if (!rb_time_read_cmpxchg(>top, top, top2))
return false;

__rb_time_read()
[...]
do {
c = local_read(>cnt);
top = local_read(>top);
bottom = local_read(>bottom);
msb = local_read(>msb);
} while (c != local_read(>cnt));

*cnt = rb_time_cnt(top);

/* If top and msb counts don't match, this interrupted a write */
if (*cnt != rb_time_cnt(msb))
return false;
  ^ this check fails to catch that "bottom" is still not updated.

So the old "bottom" value is returned, which is wrong.

Fix this by checking that all three of msb, top, and bottom 2-bit cnt
values match.

The reason to favor checking all three fields over requiring a specific
update order for both rb_time_set() and rb_time_cmpxchg() is because
checking all three fields is more robust to handle partial failures of
rb_time_cmpxchg() when interrupted by nested rb_time_set().

Link: 
https://lore.kernel.org/lkml/20231211201324.652870-1-mathieu.desnoy...@efficios.com/
Fixes: f458a1453424e ("ring-buffer: Test last update in 32bit version of 
__rb_time_read()")
Signed-off-by: Mathieu Desnoyers 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: linux-trace-ker...@vger.kernel.org
---
 kernel/trace/ring_buffer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8d2a4f00eca9..71c225ca2a2b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -644,8 +644,8 @@ static inline bool __rb_time_read(rb_time_t *t, u64 *ret, 
unsigned long *cnt)
 
*cnt = rb_time_cnt(top);
 
-   /* If top and msb counts don't match, this interrupted a write */
-   if (*cnt != rb_time_cnt(msb))
+   /* If top, msb or bottom counts don't match, this interrupted a write */
+   if (*cnt != rb_time_cnt(msb) || *cnt != rb_time_cnt(bottom))
return false;
 
/* The shift to msb will lose its cnt bits */
-- 
2.39.2

Re: [RFC PATCH] ring-buffer: Fix and comment ring buffer rb_time functions on 32-bit


On 2023-12-11 23:38, Steven Rostedt wrote:

On Mon, 11 Dec 2023 22:51:04 -0500
Mathieu Desnoyers  wrote:

[...]


For this first issue, here is the race:

rb_time_cmpxchg()
[...]
  if (!rb_time_read_cmpxchg(>msb, msb, msb2))
  return false;
  if (!rb_time_read_cmpxchg(>top, top, top2))
  return false;

__rb_time_read()
[...]
  do {
  c = local_read(>cnt);
  top = local_read(>top);
  bottom = local_read(>bottom);
  msb = local_read(>msb);
  } while (c != local_read(>cnt));

  *cnt = rb_time_cnt(top);

  /* If top and msb counts don't match, this interrupted a write */
  if (*cnt != rb_time_cnt(msb))
  return false;
^ this check fails to catch that "bottom" is still not updated.

So the old "bottom" value is returned, which is wrong.


Ah, OK that makes more sense. Yeah, if I had the three words from the
beginning, I would have tested to make sure they all match an not just the
two :-p


Technically just checking that the very first and last words which are
updated by set/cmpxchg have the same cnt bits would suffice. Because
this is just a scenario of __rb_time_read interrupting an update, the
updates in between are fine if the first/last words to be updated have
the same cnt.



As this would fix a commit that tried to fix this before!

   f458a1453424e ("ring-buffer: Test last update in 32bit version of 
__rb_time_read()")

FYI, that would be the "Fixes" for this patch.


OK

[...]



   


- A cmpxchg interrupted by 4 writes or cmpxchg overflows the counter
and produces corrupted time stamps. This is _not_ fixed by this patch.


Except that it's not 4 bits that is compared, but 32 bits.

struct rb_time_struct {
local_t cnt;
local_t top;
local_t bottom;
local_t msb;
};

The full local_t (32 bits) is used for synchronization. But the other
elements do get extra bits and there still might be some issues, but not as
severe as you stated here.


Let's bring up the race scenario I spotted:

rb_time_cmpxchg()
[...]
  /* The cmpxchg always fails if it interrupted an update */
   if (!__rb_time_read(t, , ))
   return false;

   if (val != expect)
   return false;


   cnt = local_read(>cnt);
   if ((cnt & 3) != cnt2)
   return false;
   ^ here (cnt & 3) == cnt2, but @val contains outdated data. This
 means the piecewise rb_time_read_cmpxchg() that follow will
 derive expected values from the outdated @val.


Ah. Of course this would be fixed if we did the local_read(>cnt)
*before* everything else.


But then we could be interrupted after that initial read, before reading
val. I suspect we'd want to propagate the full 32-bit cnt that was
read by __rb_time_read() to the caller, which is not the case today.
With that, we would not have to read it again in rb_time_cmpxchg.

It does leave the issue of having only 2 bits in the msb, top, bottom
fields to detect races, which are subject to overflow.





   cnt2 = cnt + 1;

   rb_time_split(val, , , );
   top = rb_time_val_cnt(top, cnt);
   bottom = rb_time_val_cnt(bottom, cnt);
  ^ top, bottom, and msb contain outdated data, which do not
match cnt due to 2-bit overflow.

   rb_time_split(set, , , );
   top2 = rb_time_val_cnt(top2, cnt2);
   bottom2 = rb_time_val_cnt(bottom2, cnt2);

  if (!rb_time_read_cmpxchg(>cnt, cnt, cnt2))
  return false;
  ^ This @cnt cmpxchg succeeds because it uses the re-read cnt
is used as expected value.


Sure. And I believe you did find another bug. If we read the cnt first,
before reading val, then it would not be outdated.


As stated above, I suspect we'd run into other issues if interrupted
between read of cnt and reading val. Propagating the full 32-bit cnt
value read from __rb_time_read() to the caller would be better I think.




  if (!rb_time_read_cmpxchg(>msb, msb, msb2))
  return false;
  if (!rb_time_read_cmpxchg(>top, top, top2))
  return false;
  if (!rb_time_read_cmpxchg(>bottom, bottom, bottom2))
  return false;
  ^ these cmpxchg have just used the outdated @val as expected
values, even though the content of the rb_time was modified
by 4 consecutive rb_time_set() or rb_time_cmpxchg(). This
means those cmpxchg can fail not only due to being interrupted
by another write or cmpxchg, but also simply due to expected
value mismatch in any of the fields, which will then cause


Yes, it is expected that this will fail for being i

Re: [PATCH v2] tracing: Allow for max buffer data size trace_marker writes


On 2023-12-12 09:00, Steven Rostedt wrote:
[...]

--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7272,6 +7272,7 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
enum event_trigger_type tt = ETT_NONE;
struct trace_buffer *buffer;
struct print_entry *entry;
+   int meta_size;
ssize_t written;
int size;
int len;
@@ -7286,12 +7287,9 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
if (!(tr->trace_flags & TRACE_ITER_MARKERS))
return -EINVAL;
  
-	if (cnt > TRACE_BUF_SIZE)

-   cnt = TRACE_BUF_SIZE;


You're removing an early bound check for a size_t userspace input...


-
-   BUILD_BUG_ON(TRACE_BUF_SIZE >= PAGE_SIZE);
-
-   size = sizeof(*entry) + cnt + 2; /* add '\0' and possible '\n' */
+   meta_size = sizeof(*entry) + 2;  /* add '\0' and possible '\n' */
+ again:
+   size = cnt + meta_size;


... and then implicitly casting it into a "int" size variable, which
can therefore become a negative value.

Just for the sake of not having to rely on ring_buffer_lock_reserve
catching (length > BUF_MAX_DATA_SIZE), I would recommend to add an
early check for negative here.

  
  	/* If less than "", then make sure we can still add that */

if (cnt < FAULTED_SIZE)
@@ -7300,9 +7298,25 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
buffer = tr->array_buffer.buffer;
event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
tracing_gen_ctx());
-   if (unlikely(!event))
+   if (unlikely(!event)) {
+   /*
+* If the size was greated than what was allowed, then


greater ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Add size check when printing trace_marker output


On 2023-12-12 08:44, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

If for some reason the trace_marker write does not have a nul byte for the
string, it will overflow the print:


Does this result in leaking kernel memory to userspace ? If so, it
should state "Fixes..." and CC stable.

Thanks,

Mathieu



   trace_seq_printf(s, ": %s", field->buf);

The field->buf could be missing the nul byte. To prevent overflow, add the
max size that the buf can be by using the event size and the field
location.

   int max = iter->ent_size - offsetof(struct print_entry, buf);

   trace_seq_printf(s, ": %*s", max, field->buf);

Signed-off-by: Steven Rostedt (Google) 
---
  kernel/trace/trace_output.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d8b302d01083..e11fb8996286 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1587,11 +1587,12 @@ static enum print_line_t trace_print_print(struct 
trace_iterator *iter,
  {
struct print_entry *field;
struct trace_seq *s = >seq;
+   int max = iter->ent_size - offsetof(struct print_entry, buf);
  
  	trace_assign_type(field, iter->ent);
  
  	seq_print_ip_sym(s, field->ip, flags);

-   trace_seq_printf(s, ": %s", field->buf);
+   trace_seq_printf(s, ": %*s", max, field->buf);
  
  	return trace_handle_return(s);

  }
@@ -1600,10 +1601,11 @@ static enum print_line_t trace_print_raw(struct 
trace_iterator *iter, int flags,
 struct trace_event *event)
  {
struct print_entry *field;
+   int max = iter->ent_size - offsetof(struct print_entry, buf);
  
  	trace_assign_type(field, iter->ent);
  
-	trace_seq_printf(>seq, "# %lx %s", field->ip, field->buf);

+   trace_seq_printf(>seq, "# %lx %*s", field->ip, max, field->buf);
  
  	return trace_handle_return(>seq);

  }


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH] ring-buffer: Fix and comment ring buffer rb_time functions on 32-bit

2023-12-11 Thread Mathieu Desnoyers


On 2023-12-11 17:59, Steven Rostedt wrote:

On Mon, 11 Dec 2023 15:13:24 -0500
Mathieu Desnoyers  wrote:


Going through a review of the ring buffer rb_time functions for 32-bit
architectures, I updated the comments to match the code, and identified
the following issues:


Thanks Mathieu!



- rb_time_cmpxchg() needs to update the msb last, so it matches
   the validation of top and msb by __rb_time_read(). This is fixed by
   this patch.


Hmm, does it? This is not parallel programming, it's only protecting
against interrupts.


Understood, this is indeed the model I had in mind during my review
(nested interruption only from local cpu) because preemption is disabled
around use of the ring buffer.

For this first issue, here is the race:

rb_time_cmpxchg()
[...]
if (!rb_time_read_cmpxchg(>msb, msb, msb2))
return false;
if (!rb_time_read_cmpxchg(>top, top, top2))
return false;

__rb_time_read()
[...]
do {
c = local_read(>cnt);
top = local_read(>top);
bottom = local_read(>bottom);
msb = local_read(>msb);
} while (c != local_read(>cnt));

*cnt = rb_time_cnt(top);

/* If top and msb counts don't match, this interrupted a write */
if (*cnt != rb_time_cnt(msb))
return false;
  ^ this check fails to catch that "bottom" is still not updated.

So the old "bottom" value is returned, which is wrong.



BTW, it's best not to have a fix like this with a comment change this big,
as the comment change is highly likely to cause conflicts in any backport.


I wanted to start the discussion without having a N-patches series, but I agree
that this first fix should be split into a separate patch.



Although, for consistency, I wonder if everything else should be changed to go:

   bottom, top, msb

as it would match the order of the bits, as msb has the highest order, top
the next, and bottom the least. Doing it as:

   top, bottom, msb

seems out of order. I did that because msb was an after thought :-/


Agreed on the order change, but I suspect this would belong to another
patch (not the fix).





- A cmpxchg interrupted by 4 writes or cmpxchg overflows the counter
   and produces corrupted time stamps. This is _not_ fixed by this patch.


Except that it's not 4 bits that is compared, but 32 bits.

struct rb_time_struct {
local_t cnt;
local_t top;
local_t bottom;
local_t msb;
};

The full local_t (32 bits) is used for synchronization. But the other
elements do get extra bits and there still might be some issues, but not as
severe as you stated here.


Let's bring up the race scenario I spotted:

rb_time_cmpxchg()
[...]
/* The cmpxchg always fails if it interrupted an update */
 if (!__rb_time_read(t, , ))
 return false;

 if (val != expect)
 return false;


 cnt = local_read(>cnt);
 if ((cnt & 3) != cnt2)
 return false;
 ^ here (cnt & 3) == cnt2, but @val contains outdated data. This
   means the piecewise rb_time_read_cmpxchg() that follow will
   derive expected values from the outdated @val.

 cnt2 = cnt + 1;

 rb_time_split(val, , , );
 top = rb_time_val_cnt(top, cnt);
 bottom = rb_time_val_cnt(bottom, cnt);
^ top, bottom, and msb contain outdated data, which do not
  match cnt due to 2-bit overflow.

 rb_time_split(set, , , );
 top2 = rb_time_val_cnt(top2, cnt2);
 bottom2 = rb_time_val_cnt(bottom2, cnt2);

if (!rb_time_read_cmpxchg(>cnt, cnt, cnt2))
return false;
^ This @cnt cmpxchg succeeds because it uses the re-read cnt
  is used as expected value.
if (!rb_time_read_cmpxchg(>msb, msb, msb2))
return false;
if (!rb_time_read_cmpxchg(>top, top, top2))
return false;
if (!rb_time_read_cmpxchg(>bottom, bottom, bottom2))
return false;
^ these cmpxchg have just used the outdated @val as expected
  values, even though the content of the rb_time was modified
  by 4 consecutive rb_time_set() or rb_time_cmpxchg(). This
  means those cmpxchg can fail not only due to being interrupted
  by another write or cmpxchg, but also simply due to expected
  value mismatch in any of the fields, which will then cause
  following __rb_time_read() to fail until a rb_time_set() is done.

return true;

So this overflow scenario on top of cmpxchg does not cause corrupted
time stamps, but does cause subsequent __rb_time_read() and rb_time_cmpxchg()
to fail until an eventual rb_time_set().



Although, I should also change this to be:

struct rb_time_struct {
local_t

[RFC PATCH] ring-buffer: Fix and comment ring buffer rb_time functions on 32-bit

2023-12-11 Thread Mathieu Desnoyers

Going through a review of the ring buffer rb_time functions for 32-bit
architectures, I updated the comments to match the code, and identified
the following issues:

- rb_time_cmpxchg() needs to update the msb last, so it matches
  the validation of top and msb by __rb_time_read(). This is fixed by
  this patch.

- A cmpxchg interrupted by 4 writes or cmpxchg overflows the counter
  and produces corrupted time stamps. This is _not_ fixed by this patch.

- After a cmpxchg fails between updates to top and msb, a write is
  needed before read and cmpxchg can succeed again. I am not entirely
  sure the rest of the ring buffer handles this correctly.

Signed-off-by: Mathieu Desnoyers 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: linux-trace-ker...@vger.kernel.org
---
 kernel/trace/ring_buffer.c | 64 +++---
 1 file changed, 46 insertions(+), 18 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8d2a4f00eca9..f6ed699947cd 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -576,34 +576,50 @@ struct ring_buffer_iter {
 #ifdef RB_TIME_32
 
 /*
- * On 32 bit machines, local64_t is very expensive. As the ring
- * buffer doesn't need all the features of a true 64 bit atomic,
- * on 32 bit, it uses these functions (64 still uses local64_t).
+ * On 32-bit machines, local64_t is very expensive. As the ring
+ * buffer doesn't need all the features of a true 64-bit atomic,
+ * on 32-bit, it uses these functions (64-bit still uses local64_t).
  *
- * For the ring buffer, 64 bit required operations for the time is
- * the following:
+ * For the ring buffer, the operations required to manipulate 64-bit
+ * time stamps are the following:
  *
- *  - Reads may fail if it interrupted a modification of the time stamp.
+ *  - Read may fail if it interrupted a modification of the time stamp.
  *  It will succeed if it did not interrupt another write even if
  *  the read itself is interrupted by a write.
+ *  A read will fail if it follows a cmpxchg which failed between
+ *  updates to its top and msb bits, until a write is performed.
+ *  (note: this limitation may be unexpected in parts of the
+ *  ring buffer algorithm)
  *  It returns whether it was successful or not.
  *
- *  - Writes always succeed and will overwrite other writes and writes
+ *  - Write always succeeds and will overwrite other writes and writes
  *  that were done by events interrupting the current write.
  *
  *  - A write followed by a read of the same time stamp will always succeed,
  *  but may not contain the same value.
  *
  *  - A cmpxchg will fail if it interrupted another write or cmpxchg.
+ *  A cmpxchg will fail if it follows a cmpxchg which failed between
+ *  updates to its top and msb bits, until a write is performed.
+ *  (note: this limitation may be unexpected in parts of the
+ *  ring buffer algorithm)
  *  Other than that, it acts like a normal cmpxchg.
  *
- * The 60 bit time stamp is broken up by 30 bits in a top and bottom half
- *  (bottom being the least significant 30 bits of the 60 bit time stamp).
+ * The 64-bit time stamp is broken up, from most to least significant,
+ * in: msb, top and bottom fields, of respectively 4, 30, and 30 bits.
  *
- * The two most significant bits of each half holds a 2 bit counter (0-3).
+ * The two most significant bits of each field hold a 2-bit counter (0-3).
  * Each update will increment this counter by one.
- * When reading the top and bottom, if the two counter bits match then the
- *  top and bottom together make a valid 60 bit number.
+ * When reading the top, bottom, and msb fields, if the two counter bits
+ *   match, then the combined values make a valid 64-bit number.
+ *
+ * Counter limits. The following situations can generate overflows that
+ * produce corrupted time stamps:
+ *
+ * - A read or a write interrupted by 2^32 writes or cmpxchg.
+ *
+ * - A cmpxchg interrupted by 4 writes or cmpxchg.
+ *   (note: this is not sufficient and should be fixed)
  */
 #define RB_TIME_SHIFT  30
 #define RB_TIME_VAL_MASK ((1 << RB_TIME_SHIFT) - 1)
@@ -632,7 +648,7 @@ static inline bool __rb_time_read(rb_time_t *t, u64 *ret, 
unsigned long *cnt)
 
/*
 * If the read is interrupted by a write, then the cnt will
-* be different. Loop until both top and bottom have been read
+* be different. Loop until top, bottom and msb have been read
 * without interruption.
 */
do {
@@ -644,7 +660,12 @@ static inline bool __rb_time_read(rb_time_t *t, u64 *ret, 
unsigned long *cnt)
 
*cnt = rb_time_cnt(top);
 
-   /* If top and msb counts don't match, this interrupted a write */
+   /*
+* If top and msb counts don't match, this either interrupted a
+* write or follows a failed cmpxchg.
+* This requires the update to bottom to be enclosed between
+* updates

Re: [PATCH] tracing: Allow for max buffer data size trace_marker writes


On 2023-12-10 11:38, Steven Rostedt wrote:

On Sun, 10 Dec 2023 11:07:22 -0500
Mathieu Desnoyers  wrote:


It just allows more to be written in one go.

I don't see why the tests need to cover this or detect this change.


If the purpose of this change is to ensure that the entire
trace marker payload is shown within a single event, then
there should be a test which exercises this, and which
validates that the end result is that the entire payload
is indeed shown within a single event record.


No, the purpose of the change is not to do that, because there can always
be a bigger trace marker write than a single event can hold. This is the
way it has always worked. This is an optimization or "enhancement". The 1KB
restriction was actually because of a previous implementation years ago
(before selftests even existed) that wrote into a temp buffer before
copying into the ring buffer. But since we now can copy directly into the
ring buffer, there's no reason not to use the maximum that the ring buffer
can accept.


My point is that the difference between the new "enhanced" behavior
and the previous behavior is not tested for.





Otherwise there is no permanent validation that this change
indeed does what it is intended to do, so it can regress
at any time without any test noticing it.


What regress? The amount of a trace_marker write that can make it into a
the buffer in one go? Now, I agree that we should have a test to make sure
that all of the trace marker write gets into the buffer.


Yes. This is pretty much my point.



But it's always
been allowed to break up that write however it wanted to.


And the enhanced behavior extends the amount of data that can get
written into a single sub-buffer, and this is not tested.



Note, because different architectures have different page sizes, how much
that can make it in one go is architecture dependent. So you can have a
"regression" by simply running your application on a different architecture.


Which is why in the following patches you have expressing the subbuffer
size as bytes rather than pages is important at the ABI level. It
facilitates portability of tests, and decreases documentation / user
burden.


Again, it's not a requirement, it's just an enhancement.


How does this have anything to do with dispensing from testing the
new behavior ? If the new behavior has a bug that causes it to
silently truncate the trace marker payloads, how do you catch it
with the current tests ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Allow for max buffer data size trace_marker writes


On 2023-12-10 10:30, Steven Rostedt wrote:

On Sun, 10 Dec 2023 09:09:06 -0500
Mathieu Desnoyers  wrote:


On 2023-12-09 17:50, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Allow a trace write to be as big as the ring buffer tracing data will
allow. Currently, it only allows writes of 1KB in size, but there's no
reason that it cannot allow what the ring buffer can hold.


I would expect the tests under tools/testing/selftests/ftrace/* to
be affected by those changes. If they are not, then it's a hole in
the test coverage and should be taken care of as this behavior is
modified.



Before this change we had:

~# echo -n 
'12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456
  
7890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
  
4567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
  
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567
  89012345678901234567890123456789012345678901234567890123456789012345678901234' 
> /sys/kernel/tracing/trace_marker

~# cat /sys/kernel/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 4/4   #P:8
#
#_-=> irqs-off/BH-disabled
#   / _---

Re: [PATCH 13/14] ring-buffer: Add documentation on the buffer_subbuf_order file


On 2023-12-09 22:54, Steven Rostedt wrote:
[...]

+  buffer_subbuf_order:
+
+   This sets or displays the sub buffer page size order. The ring buffer
+   is broken up into several same size "sub buffers". An event can not be
+   bigger than the size of the sub buffer. Normally, the sub buffer is
+   the size of the architecture's page (4K on x86). The sub buffer also
+   contains meta data at the start which also limits the size of an event.
+   That means when the sub buffer is a page size, no event can be larger
+   than the page size minus the sub buffer meta data.


The fact that the user ABI documentation for this tracer parameter needs
to dig into details about architecture page size is a good indicator
that this ABI is not at the abstraction level it should be (pages vs
bytes).

Thanks,

Mathieu


+
+   The buffer_subbuf_order allows the user to change the size of the sub
+   buffer. As the sub buffer is a set of pages by the power of 2, thus
+   the sub buffer total size is defined by the order:
+
+   order   size
+   
+   0   PAGE_SIZE
+   1   PAGE_SIZE * 2
+   2   PAGE_SIZE * 4
+   3   PAGE_SIZE * 8
+
+   Changing the order will change the sub buffer size allowing for events
+   to be larger than the page size.
+
+   Note: When changing the order, tracing is stopped and any data in the
+   ring buffer and the snapshot buffer will be discarded.
+
free_buffer:
  
  	If a process is performing tracing, and the ring buffer	should be


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH 14/14] ringbuffer/selftest: Add basic selftest to test chaning subbuf order


On 2023-12-09 22:54, Steven Rostedt wrote:
[...]

+get_buffer_data_size() {
+   sed -ne 's/^.*data.*size:\([0-9][0-9]*\).*/\1/p' events/header_page
+}
+
+a="1234567890"
+
+make_str() {
+cnt=$1
+s=""
+while [ $cnt -gt 10 ]; do
+s="${s}${a}"
+cnt=$((cnt-10))
+done
+while [ $cnt -gt 0 ]; do
+s="${s}X"
+cnt=$((cnt-1))
+done
+echo -n $s
+}
+
+test_buffer() {
+
+   size=`get_buffer_data_size`
+
+   str=`make_str $size`
+
+   echo $str > trace_marker
+
+   grep -q $a trace


This test has no clue if the record was truncated or not.

It basically repeats the string

"1234567890" until it fills the subbuffer size and pads with
 as needed as trace marker payload, but the grep looks for the
"1234567890" pattern only.

The test should be extended to validate whether the trace marker
payload was truncated or not, otherwise it is of limited value.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH 00/14] ring-buffer/tracing: Allow ring buffer to have bigger sub buffers


On 2023-12-09 22:54, Steven Rostedt wrote:
[...]


Basically, events to the tracing subsystem are limited to just under a
PAGE_SIZE, as the ring buffer is split into "sub buffers" of one page
size, and an event can not be bigger than a sub buffer. This allows users
to change the size of a sub buffer by the order:

   echo 3 > /sys/kernel/tracing/buffer_subbuf_order

Will make each sub buffer a size of 8 pages, allowing events to be almost
as big as 8 pages in size (sub buffers do have meta data on them as
well, keeping an event from reaching the same size as a sub buffer).


Specifying the "order" of subbuffer size as a power of two of
number of pages is a poor UX choice for a user-facing ABI.

I would recommend allowing the user to specify the size in bytes, and
internally bump to size to the next power of 2, with a minimum of
PAGE_SIZE.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Have large events show up as '[LINE TOO BIG]' instead of nothing


On 2023-12-09 17:10, Steven Rostedt wrote:
[...]

 <...>-852 [001] .   121.550551: tracing_mark_write[LINE 
TOO BIG]
 <...>-852 [001] .   121.550581: tracing_mark_write: 
78901234


Failing to print an entire message because it does not fit in the
buffer size is rather inconvenient.

It would be better to print the partial line, and end the line with
a  tag.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] tracing: Allow for max buffer data size trace_marker writes


On 2023-12-09 17:50, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Allow a trace write to be as big as the ring buffer tracing data will
allow. Currently, it only allows writes of 1KB in size, but there's no
reason that it cannot allow what the ring buffer can hold.


I would expect the tests under tools/testing/selftests/ftrace/* to
be affected by those changes. If they are not, then it's a hole in
the test coverage and should be taken care of as this behavior is
modified.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] ring-buffer: Add offset of events in dump on mismatch

2023-12-07 Thread Mathieu Desnoyers


On 2023-12-07 17:16, Steven Rostedt wrote:

[...]


diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8d2a4f00eca9..b10deb8a5647 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3424,23 +3424,27 @@ static void dump_buffer_page(struct buffer_data_page 
*bpage,
case RINGBUF_TYPE_TIME_EXTEND:
delta = rb_event_time_stamp(event);
ts += delta;
-   pr_warn("  [%lld] delta:%lld TIME EXTEND\n", ts, delta);
+   pr_warn(" %x: [%lld] delta:%lld TIME EXTEND\n",


Please prefix hex values with "0x", as in:

pr_warn(" 0x%x: [%lld] delta:%lld TIME EXTEND\n"

Otherwise it can be confusing.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

[PATCH] MAINTAINERS: TRACING: Add Mathieu Desnoyers as Reviewer

2023-11-15 Thread Mathieu Desnoyers

In order to make sure I get CC'd on tracing changes for which my input
would be relevant, add my name as reviewer of the TRACING subsystem.

Signed-off-by: Mathieu Desnoyers 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: linux-trace-ker...@vger.kernel.org
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 4cc6bf79fdd8..a7c2092d0063 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -21622,6 +21622,7 @@ F:  drivers/hwmon/pmbus/tps546d24.c
 TRACING
 M: Steven Rostedt 
 M: Masami Hiramatsu 
+R: Mathieu Desnoyers 
 L: linux-kernel@vger.kernel.org
 L: linux-trace-ker...@vger.kernel.org
 S: Maintained
-- 
2.39.2

Re: [PATCH RFC] selftests/rseq: fix kselftest Clang build warnings

2023-09-09 Thread Mathieu Desnoyers


On 9/8/23 19:03, Justin Stitt wrote:

Hi,

I am experiencing many warnings when trying to build tools/testing/selftests.

Here's one such example from rseq tree:
|  param_test.c:1234:10: error: address argument to atomic operation must be a 
pointer to _Atomic type ('intptr_t *' (aka 'long *') invalid)
|   1234 | while (!atomic_load(>percpu_list_ptr)) {}
|| ^   ~~
|  
/usr/local/google/home/justinstitt/repos/tc-build/build/llvm/final/lib/clang/18/include/stdatomic.h:140:29:
 note: expanded from macro 'atomic_load'
|140 | #define atomic_load(object) __c11_atomic_load(object, 
__ATOMIC_SEQ_CST)
|| ^ ~~

I added the _Atomic type in various locations to silence _all_ (10) of these
warnings. I'm wondering, though, perhaps the absence of these _Atomic
types in the first place is on purpose? Am I on the right track to fix
these warnings without damaging the legitimacy of the tests at hand?

I'd like some feedback about where to go from here and if others are
experiencing the same issues. Thanks!

FWIW here's my specific build incantation on Clang-18 
(49d41de57896e935cd5726719c5208bce22694ae):
$ make LLVM=1 -j128 ARCH=x86_64 mrproper headers defconfig kselftest-merge
$ make LLVM=1 ARCH=x86_64 -C tools/testing/selftests


I should have used the __atomic_load_n() compiler builtin rather than 
atomic_load(), mainly because it does not require the _Atomic annotation 
to each type it touches.


Thanks,

Mathieu




Link: https://github.com/ClangBuiltLinux/linux/issues/1698
Link: https://github.com/ClangBuiltLinux/continuous-integration2/issues/61
Signed-off-by: Justin Stitt 
---
  tools/testing/selftests/rseq/param_test.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/rseq/param_test.c 
b/tools/testing/selftests/rseq/param_test.c
index bf951a490bb4..94802aeed2c6 100644
--- a/tools/testing/selftests/rseq/param_test.c
+++ b/tools/testing/selftests/rseq/param_test.c
@@ -356,7 +356,7 @@ struct inc_thread_test_data {
  };
  
  struct percpu_list_node {

-   intptr_t data;
+   _Atomic intptr_t data;
struct percpu_list_node *next;
  };
  
@@ -1212,8 +1212,8 @@ static int set_signal_handler(void)

  /* Test MEMBARRIER_CMD_PRIVATE_RESTART_RSEQ_ON_CPU membarrier command. */
  #ifdef TEST_MEMBARRIER
  struct test_membarrier_thread_args {
-   int stop;
-   intptr_t percpu_list_ptr;
+   _Atomic int stop;
+   _Atomic intptr_t percpu_list_ptr;
  };
  
  /* Worker threads modify data in their "active" percpu lists. */

@@ -1240,7 +1240,7 @@ void *test_membarrier_worker_thread(void *arg)
int cpu = get_current_cpu_id();
  
  			ret = rseq_offset_deref_addv(RSEQ_MO_RELAXED, RSEQ_PERCPU,

-   >percpu_list_ptr,
+   (intptr_t*)>percpu_list_ptr,
sizeof(struct percpu_list_entry) * cpu, 1, cpu);
} while (rseq_unlikely(ret));
}

---
base-commit: 2dde18cd1d8fac735875f2e4987f11817cc0bc2c
change-id: 20230908-kselftest-param_test-c-1763b62e762f

Best regards,
--
Justin Stitt 



--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH][RFC] tracing: Enable tracepoints via module parameters

2021-04-20 Thread Mathieu Desnoyers

- On Apr 20, 2021, at 10:55 AM, rostedt rost...@goodmis.org wrote:

> On Tue, 20 Apr 2021 09:29:27 -0400 (EDT)
> Mathieu Desnoyers  wrote:
> 
>> - On Apr 20, 2021, at 8:55 AM, rostedt rost...@goodmis.org wrote:
>> [...]
>> > 
>> > Would adding automatic module parameters be an issue? That is, you can add
>> > in the insmod command line a parameter that will enable tracepoints. We
>> > could have a way to even see them from the modinfo. I think I had that
>> > working once, and it wasn't really that hard to do.
>> 
>> There is one thing we should consider here in terms of namespacing: those 
>> module
>> command line parameters should be specific to each tracer (e.g. ftrace, perf,
>> ebpf).
>> 
>> LTTng for instance already tackles early module load tracing in a different
>> way: users can enable instrumentation of yet-to-be loaded kernel modules. So
>> it would not make sense in that scheme to have module load parameters.
>> 
>> It's a different trade-off in terms of error reporting though: for instance,
>> LTTng won't report an error if a user does a typo when entering an event 
>> name.
>> 
>> So I think those command line parameters should be tracer-specific, do you 
>> agree
>> ?
> 
> 
> No, I do not agree. I would like to make it consistent with the kernel
> command line. As you can put in: "trace_event=sched_switch" and the
> sched_switch trace point will be enable (for the tracefs directory) on boot
> up. The same should be for modules as well.
> 
> It shouldn't affect LTTng, as you already have a way to enable them as they
> get loaded.

That sounds fine. So that would be within the "trace_event" (and not tracepoint)
namespace for module load parameters as well ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH][RFC] tracing: Enable tracepoints via module parameters

2021-04-20 Thread Mathieu Desnoyers

- On Apr 20, 2021, at 8:55 AM, rostedt rost...@goodmis.org wrote:
[...]
> 
> Would adding automatic module parameters be an issue? That is, you can add
> in the insmod command line a parameter that will enable tracepoints. We
> could have a way to even see them from the modinfo. I think I had that
> working once, and it wasn't really that hard to do.

There is one thing we should consider here in terms of namespacing: those module
command line parameters should be specific to each tracer (e.g. ftrace, perf, 
ebpf).

LTTng for instance already tackles early module load tracing in a different
way: users can enable instrumentation of yet-to-be loaded kernel modules. So
it would not make sense in that scheme to have module load parameters.

It's a different trade-off in terms of error reporting though: for instance,
LTTng won't report an error if a user does a typo when entering an event name.

So I think those command line parameters should be tracer-specific, do you 
agree ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: liburcu: LTO breaking rcu_dereference on arm64 and possibly other architectures ?

2021-04-16 Thread Mathieu Desnoyers

- On Apr 16, 2021, at 3:02 PM, paulmck paul...@kernel.org wrote:
[...]
> 
> If it can be done reasonably, I suggest also having some way for the
> person building userspace RCU to say "I know what I am doing, so do
> it with volatile rather than memory_order_consume."

Like so ?

#define CMM_ACCESS_ONCE(x) (*(__volatile__  __typeof__(x) *)&(x))
#define CMM_LOAD_SHARED(p) CMM_ACCESS_ONCE(p)

/*
 * By defining URCU_DEREFERENCE_USE_VOLATILE, the user requires use of
 * volatile access to implement rcu_dereference rather than
 * memory_order_consume load from the C11/C++11 standards.
 *
 * This may improve performance on weakly-ordered architectures where
 * the compiler implements memory_order_consume as a
 * memory_order_acquire, which is stricter than required by the
 * standard.
 *
 * Note that using volatile accesses for rcu_dereference may cause
 * LTO to generate incorrectly ordered code starting from C11/C++11.
 */

#ifdef URCU_DEREFERENCE_USE_VOLATILE
# define rcu_dereference(x) CMM_LOAD_SHARED(x)
#else
# if defined (__cplusplus)
#  if __cplusplus >= 201103L
#   include 
#   define rcu_dereference(x)   
((std::atomic<__typeof__(x)>)(x)).load(std::memory_order_consume)
#  else
#   define rcu_dereference(x)   CMM_LOAD_SHARED(x)
#  endif
# else
#  if (defined(__STDC_VERSION__) && __STDC_VERSION__ >= 201112L)
#   include 
#   define rcu_dereference(x)   atomic_load_explicit(&(x), memory_order_consume)
#  else
#   define rcu_dereference(x)   CMM_LOAD_SHARED(x)
#  endif
# endif
#endif

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: liburcu: LTO breaking rcu_dereference on arm64 and possibly other architectures ?

2021-04-16 Thread Mathieu Desnoyers

- On Apr 16, 2021, at 12:01 PM, paulmck paul...@kernel.org wrote:

> On Fri, Apr 16, 2021 at 05:17:11PM +0200, Peter Zijlstra wrote:
>> On Fri, Apr 16, 2021 at 10:52:16AM -0400, Mathieu Desnoyers wrote:
>> > Hi Paul, Will, Peter,
>> > 
>> > I noticed in this discussion https://lkml.org/lkml/2021/4/16/118 that LTO
>> > is able to break rcu_dereference. This seems to be taken care of by
>> > arch/arm64/include/asm/rwonce.h on arm64 in the Linux kernel tree.
>> > 
>> > In the liburcu user-space library, we have this comment near 
>> > rcu_dereference()
>> > in
>> > include/urcu/static/pointer.h:
>> > 
>> >  * The compiler memory barrier in CMM_LOAD_SHARED() ensures that
>> >  value-speculative
>> >  * optimizations (e.g. VSS: Value Speculation Scheduling) does not perform 
>> > the
>> >  * data read before the pointer read by speculating the value of the 
>> > pointer.
>> >  * Correct ordering is ensured because the pointer is read as a volatile 
>> > access.
>> >  * This acts as a global side-effect operation, which forbids reordering of
>> >  * dependent memory operations. Note that such concern about 
>> > dependency-breaking
>> >  * optimizations will eventually be taken care of by the 
>> > "memory_order_consume"
>> >  * addition to forthcoming C++ standard.
>> > 
>> > (note: CMM_LOAD_SHARED() is the equivalent of READ_ONCE(), but was 
>> > introduced in
>> > liburcu as a public API before READ_ONCE() existed in the Linux kernel)
>> > 
>> > Peter tells me the "memory_order_consume" is not something which can be 
>> > used
>> > today.
>> > Any information on its status at C/C++ standard levels and 
>> > implementation-wise ?
> 
> Actually, you really can use memory_order_consume.  All current
> implementations will compile it as if it was memory_order_acquire.
> This will work correctly, but may be slower than you would like on ARM,
> PowerPC, and so on.
> 
> On things like x86, the penalty is forgone optimizations, so less
> of a problem there.

OK

> 
>> > Pragmatically speaking, what should we change in liburcu to ensure we don't
>> > generate
>> > broken code when LTO is enabled ? I suspect there are a few options here:
>> > 
>> > 1) Fail to build if LTO is enabled,
>> > 2) Generate slower code for rcu_dereference, either on all architectures 
>> > or only
>> >on weakly-ordered architectures,
>> > 3) Generate different code depending on whether LTO is enabled or not. 
>> > AFAIU
>> > this would only
>> >work if every compile unit is aware that it will end up being optimized 
>> > with
>> >LTO. Not sure
>> >how this could be done in the context of user-space.
>> > 4) [ Insert better idea here. ]
> 
> Use memory_order_consume if LTO is enabled.  That will work now, and
> might generate good code in some hoped-for future.

In the context of a user-space library, how does one check whether LTO is 
enabled with
preprocessor directives ? A quick test with gcc seems to show that both with 
and without
-flto cannot be distinguished from a preprocessor POV, e.g. the output of both

gcc --std=c11 -O2 -dM -E - < /dev/null
and
gcc --std=c11 -O2 -flto -dM -E - < /dev/null

is exactly the same. Am I missing something here ?

If we accept to use memory_order_consume all the time in both C and C++ code 
starting from
C11 and C++11, the following code snippet could do the trick:

#define CMM_ACCESS_ONCE(x) (*(__volatile__  __typeof__(x) *)&(x))
#define CMM_LOAD_SHARED(p) CMM_ACCESS_ONCE(p)

#if defined (__cplusplus)
# if __cplusplus >= 201103L
#  include 
#  define rcu_dereference(x)
((std::atomic<__typeof__(x)>)(x)).load(std::memory_order_consume)
# else
#  define rcu_dereference(x)CMM_LOAD_SHARED(x)
# endif
#else
# if (defined(__STDC_VERSION__) && __STDC_VERSION__ >= 201112L)
#  include 
#  define rcu_dereference(x)atomic_load_explicit(&(x), memory_order_consume)
# else
#  define rcu_dereference(x)CMM_LOAD_SHARED(x)
# endif
#endif

This uses the volatile approach prior to C11/C++11, and moves to 
memory_order_consume
afterwards. This will bring a performance penalty on weakly-ordered 
architectures even
when -flto is not specified though.

Then the burden is pushed on the compiler people to eventually implement an 
efficient
memory_order_consume.

Is that acceptable ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

liburcu: LTO breaking rcu_dereference on arm64 and possibly other architectures ?

2021-04-16 Thread Mathieu Desnoyers

Hi Paul, Will, Peter,

I noticed in this discussion https://lkml.org/lkml/2021/4/16/118 that LTO
is able to break rcu_dereference. This seems to be taken care of by
arch/arm64/include/asm/rwonce.h on arm64 in the Linux kernel tree.

In the liburcu user-space library, we have this comment near rcu_dereference() 
in
include/urcu/static/pointer.h:

 * The compiler memory barrier in CMM_LOAD_SHARED() ensures that 
value-speculative
 * optimizations (e.g. VSS: Value Speculation Scheduling) does not perform the
 * data read before the pointer read by speculating the value of the pointer.
 * Correct ordering is ensured because the pointer is read as a volatile access.
 * This acts as a global side-effect operation, which forbids reordering of
 * dependent memory operations. Note that such concern about dependency-breaking
 * optimizations will eventually be taken care of by the "memory_order_consume"
 * addition to forthcoming C++ standard.

(note: CMM_LOAD_SHARED() is the equivalent of READ_ONCE(), but was introduced in
liburcu as a public API before READ_ONCE() existed in the Linux kernel)

Peter tells me the "memory_order_consume" is not something which can be used 
today.
Any information on its status at C/C++ standard levels and implementation-wise ?

Pragmatically speaking, what should we change in liburcu to ensure we don't 
generate
broken code when LTO is enabled ? I suspect there are a few options here:

1) Fail to build if LTO is enabled,
2) Generate slower code for rcu_dereference, either on all architectures or only
   on weakly-ordered architectures,
3) Generate different code depending on whether LTO is enabled or not. AFAIU 
this would only
   work if every compile unit is aware that it will end up being optimized with 
LTO. Not sure
   how this could be done in the context of user-space.
4) [ Insert better idea here. ]

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH v3 0/3] rseq: minor optimizations

- On Apr 13, 2021, at 4:33 PM, Eric Dumazet eric.duma...@gmail.com wrote:

> From: Eric Dumazet 
> 
> rseq is a heavy user of copy to/from user data in fast paths.
> This series tries to reduce the cost.
> 
> v3: Third patch going back to v1 (only deal with 64bit arches)
> v2: Addressed Peter and Mathieu feedbacks, thanks !

For the whole series:

Reviewed-by: Mathieu Desnoyers 

Thanks Eric!

Mathieu

> 
> Eric Dumazet (3):
>  rseq: optimize rseq_update_cpu_id()
>  rseq: remove redundant access_ok()
>  rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
> 
> kernel/rseq.c | 29 +
> 1 file changed, 21 insertions(+), 8 deletions(-)
> 
> --
> 2.31.1.295.g9ea45b61b8-goog

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()




- On Apr 13, 2021, at 2:22 PM, Eric Dumazet eduma...@google.com wrote:

> On Tue, Apr 13, 2021 at 8:00 PM Mathieu Desnoyers
>  wrote:
>>
> 
>> As long as the ifdefs are localized within clearly identified wrappers in the
>> rseq code I don't mind doing the special-casing there.
>>
>> The point which remains is that I don't think we want to optimize for speed
>> on 32-bit architectures when it adds special-casing and complexity to the 
>> 32-bit
>> build. I suspect there is less and less testing performed on 32-bit
>> architectures
>> nowadays, and it's good that as much code as possible is shared between 
>> 32-bit
>> and
>> 64-bit builds to share the test coverage.
>>
> 
> Quite frankly V1 was fine, I can't really make it looking better.

Yes, I'm OK with V1 of that patch.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

- On Apr 13, 2021, at 1:33 PM, Eric Dumazet eduma...@google.com wrote:

> On Tue, Apr 13, 2021 at 7:20 PM Mathieu Desnoyers
>  wrote:
>>
>> - On Apr 13, 2021, at 1:07 PM, Eric Dumazet eduma...@google.com wrote:
>>
>> > On Tue, Apr 13, 2021 at 7:01 PM Eric Dumazet  wrote:
>> >>
>> >> On Tue, Apr 13, 2021 at 6:57 PM Eric Dumazet  wrote:
>> >> >
>> >> > On Tue, Apr 13, 2021 at 6:54 PM Mathieu Desnoyers
>> >> >  wrote:
>> >> > >
>> >> > > - On Apr 13, 2021, at 12:22 PM, Eric Dumazet 
>> >> > > eric.duma...@gmail.com wrote:
>> >> > >
>> >> > > > From: Eric Dumazet 
>> >> > > >
>> >> > > > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
>> >> > > > update includes") added regressions for our servers.
>> >> > > >
>> >> > > > Using copy_from_user() and clear_user() for 64bit values
>> >> > > > is suboptimal.
>> >> > > >
>> >> > > > We can use faster put_user() and get_user().
>> >> > > >
>> >> > > > 32bit arches can be changed to use the ptr32 field,
>> >> > > > since the padding field must always be zero.
>> >> > > >
>> >> > > > v2: added ideas from Peter and Mathieu about making this
>> >> > > >generic, since my initial patch was only dealing with
>> >> > > >64bit arches.
>> >> > >
>> >> > > Ah, now I remember the reason why reading and clearing the entire 
>> >> > > 64-bit
>> >> > > is important: it's because we don't want to allow user-space 
>> >> > > processes to
>> >> > > use this change in behavior to figure out whether they are running on 
>> >> > > a
>> >> > > 32-bit or in a 32-bit compat mode on a 64-bit kernel.
>> >> > >
>> >> > > So although I'm fine with making 64-bit kernels faster, we'll want to 
>> >> > > keep
>> >> > > updating the entire 64-bit ptr field on 32-bit kernels as well.
>> >> > >
>> >> > > Thanks,
>> >> > >
>> >> >
>> >> > So... back to V1 then ?
>> >>
>> >> Or add more stuff as in :
>> >
>> > diff against v2, WDYT ?
>>
>> I like this approach slightly better, because it moves the preprocessor 
>> ifdefs
>> into
>> rseq_get_rseq_cs and clear_rseq_cs, while keeping the same behavior for a 
>> 32-bit
>> process running on native 32-bit kernel and as compat task on a 64-bit 
>> kernel.
>>
>> That being said, I don't expect anyone to care much about performance of 
>> 32-bit
>> kernels, so we could use copy_from_user() on 32-bit kernels to remove
>> special-cases
>> in 32-bit specific code. This would eliminate the 32-bit specific "padding"
>> read, and
>> let the TASK_SIZE comparison handle the check for both 32-bit and 64-bit
>> kernels.
>>
>> As for clear_user(), I wonder whether we could simply keep using it, but 
>> change
>> the
>> clear_user() macro to figure out that it can use a faster 8-byte put_user ? I
>> find it
>> odd that performance optimizations which would be relevant elsewhere creep 
>> into
>> the
>> rseq code.
> 
> 
> clear_user() is a maze of arch-dependent macros/functions/assembly
> 
> I guess the same could be said from  copy_in_user(), but apparently we removed
> special-casing, like in commit a41e0d754240fe8ca9c4f2070bf67e3b0228aa22
> 
> Definitely it seems odd having to carefully choose between multiple methods.

As long as the ifdefs are localized within clearly identified wrappers in the
rseq code I don't mind doing the special-casing there.

The point which remains is that I don't think we want to optimize for speed
on 32-bit architectures when it adds special-casing and complexity to the 32-bit
build. I suspect there is less and less testing performed on 32-bit 
architectures
nowadays, and it's good that as much code as possible is shared between 32-bit 
and
64-bit builds to share the test coverage.

Thanks,

Mathieu

> 
> 
>>
>> Thanks,
>>
>> Mathieu
>>
>> >
>> > diff --git a/kernel/rseq.c b/kernel/rseq.c
>> > index
>> > f2eee3f7f5d330688c81cb2e57d47ca6b843873e..537b1f684efa11069990018ffa3642c209993011
>>

Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

- On Apr 13, 2021, at 1:07 PM, Eric Dumazet eduma...@google.com wrote:

> On Tue, Apr 13, 2021 at 7:01 PM Eric Dumazet  wrote:
>>
>> On Tue, Apr 13, 2021 at 6:57 PM Eric Dumazet  wrote:
>> >
>> > On Tue, Apr 13, 2021 at 6:54 PM Mathieu Desnoyers
>> >  wrote:
>> > >
>> > > - On Apr 13, 2021, at 12:22 PM, Eric Dumazet eric.duma...@gmail.com 
>> > > wrote:
>> > >
>> > > > From: Eric Dumazet 
>> > > >
>> > > > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
>> > > > update includes") added regressions for our servers.
>> > > >
>> > > > Using copy_from_user() and clear_user() for 64bit values
>> > > > is suboptimal.
>> > > >
>> > > > We can use faster put_user() and get_user().
>> > > >
>> > > > 32bit arches can be changed to use the ptr32 field,
>> > > > since the padding field must always be zero.
>> > > >
>> > > > v2: added ideas from Peter and Mathieu about making this
>> > > >generic, since my initial patch was only dealing with
>> > > >64bit arches.
>> > >
>> > > Ah, now I remember the reason why reading and clearing the entire 64-bit
>> > > is important: it's because we don't want to allow user-space processes to
>> > > use this change in behavior to figure out whether they are running on a
>> > > 32-bit or in a 32-bit compat mode on a 64-bit kernel.
>> > >
>> > > So although I'm fine with making 64-bit kernels faster, we'll want to 
>> > > keep
>> > > updating the entire 64-bit ptr field on 32-bit kernels as well.
>> > >
>> > > Thanks,
>> > >
>> >
>> > So... back to V1 then ?
>>
>> Or add more stuff as in :
> 
> diff against v2, WDYT ?

I like this approach slightly better, because it moves the preprocessor ifdefs 
into
rseq_get_rseq_cs and clear_rseq_cs, while keeping the same behavior for a 32-bit
process running on native 32-bit kernel and as compat task on a 64-bit kernel.

That being said, I don't expect anyone to care much about performance of 32-bit
kernels, so we could use copy_from_user() on 32-bit kernels to remove 
special-cases
in 32-bit specific code. This would eliminate the 32-bit specific "padding" 
read, and
let the TASK_SIZE comparison handle the check for both 32-bit and 64-bit 
kernels.

As for clear_user(), I wonder whether we could simply keep using it, but change 
the
clear_user() macro to figure out that it can use a faster 8-byte put_user ? I 
find it
odd that performance optimizations which would be relevant elsewhere creep into 
the
rseq code.

Thanks,

Mathieu

> 
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index
> f2eee3f7f5d330688c81cb2e57d47ca6b843873e..537b1f684efa11069990018ffa3642c209993011
> 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -136,6 +136,10 @@ static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp,
> {
>u32 ptr;
> 
> +   if (get_user(ptr, >rseq_cs.ptr.padding))
> +   return -EFAULT;
> +   if (ptr)
> +   return -EINVAL;
>if (get_user(ptr, >rseq_cs.ptr.ptr32))
>return -EFAULT;
>*uptrp = (struct rseq_cs __user *)ptr;
> @@ -150,8 +154,9 @@ static int rseq_get_rseq_cs(struct task_struct *t,
> struct rseq_cs *rseq_cs)
>u32 sig;
>int ret;
> 
> -   if (rseq_get_cs_ptr(_cs, t->rseq))
> -   return -EFAULT;
> +   ret = rseq_get_cs_ptr(_cs, t->rseq);
> +   if (ret)
> +   return ret;
>if (!urseq_cs) {
>memset(rseq_cs, 0, sizeof(*rseq_cs));
>return 0;
> @@ -237,7 +242,8 @@ static int clear_rseq_cs(struct task_struct *t)
> #ifdef CONFIG_64BIT
>return put_user(0UL, >rseq->rseq_cs.ptr64);
> #else
> -   return put_user(0UL, >rseq->rseq_cs.ptr.ptr32);
> +   return put_user(0UL, >rseq->rseq_cs.ptr.ptr32) |
> +  put_user(0UL, >rseq->rseq_cs.ptr.padding);
> #endif
>  }

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

- On Apr 13, 2021, at 12:57 PM, Eric Dumazet eduma...@google.com wrote:

> On Tue, Apr 13, 2021 at 6:54 PM Mathieu Desnoyers
>  wrote:
>>
>> - On Apr 13, 2021, at 12:22 PM, Eric Dumazet eric.duma...@gmail.com 
>> wrote:
>>
>> > From: Eric Dumazet 
>> >
>> > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
>> > update includes") added regressions for our servers.
>> >
>> > Using copy_from_user() and clear_user() for 64bit values
>> > is suboptimal.
>> >
>> > We can use faster put_user() and get_user().
>> >
>> > 32bit arches can be changed to use the ptr32 field,
>> > since the padding field must always be zero.
>> >
>> > v2: added ideas from Peter and Mathieu about making this
>> >generic, since my initial patch was only dealing with
>> >64bit arches.
>>
>> Ah, now I remember the reason why reading and clearing the entire 64-bit
>> is important: it's because we don't want to allow user-space processes to
>> use this change in behavior to figure out whether they are running on a
>> 32-bit or in a 32-bit compat mode on a 64-bit kernel.
>>
>> So although I'm fine with making 64-bit kernels faster, we'll want to keep
>> updating the entire 64-bit ptr field on 32-bit kernels as well.
>>
>> Thanks,
>>
> 
> So... back to V1 then ?

In terms of behavior, yes. And it's probably the "easy" fix, but I hate that
it adds lots of preprocessor ifdefs into the rseq code.

But this would require auditing get_user()/put_user() for each architecture
supported by rseq to ensure they support 8-byte load/store. And it would become
an added burden on architecture maintainers wishing to add rseq support for 
their
architecture.

One alternative would be to implement rseq_get_user_u64 and rseq_put_user_u64
wrappers as static functions within rseq.c to hide the preprocessor ifdeffery
from the higher-level code. I try very hard to avoid mixing preprocessor ifdefs
with C code logic whenever I can.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

- On Apr 13, 2021, at 12:22 PM, Eric Dumazet eric.duma...@gmail.com wrote:

> From: Eric Dumazet 
> 
> Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
> update includes") added regressions for our servers.
> 
> Using copy_from_user() and clear_user() for 64bit values
> is suboptimal.
> 
> We can use faster put_user() and get_user().
> 
> 32bit arches can be changed to use the ptr32 field,
> since the padding field must always be zero.
> 
> v2: added ideas from Peter and Mathieu about making this
>generic, since my initial patch was only dealing with
>64bit arches.

Ah, now I remember the reason why reading and clearing the entire 64-bit
is important: it's because we don't want to allow user-space processes to
use this change in behavior to figure out whether they are running on a
32-bit or in a 32-bit compat mode on a 64-bit kernel.

So although I'm fine with making 64-bit kernels faster, we'll want to keep
updating the entire 64-bit ptr field on 32-bit kernels as well.

Thanks,

Mathieu

> 
> Signed-off-by: Eric Dumazet 
> Cc: Mathieu Desnoyers 
> Cc: Peter Zijlstra 
> Cc: "Paul E. McKenney" 
> Cc: Boqun Feng 
> Cc: Arjun Roy 
> Cc: Ingo Molnar 
> ---
> kernel/rseq.c | 41 +
> 1 file changed, 33 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index
> cfe01ab5253c1c424c0e8b25acbb6a8e1b41a5b6..f2eee3f7f5d330688c81cb2e57d47ca6b843873e
> 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -119,23 +119,46 @@ static int rseq_reset_rseq_cpu_id(struct task_struct *t)
>   return 0;
> }
> 
> +#ifdef CONFIG_64BIT
> +static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp,
> +const struct rseq __user *rseq)
> +{
> + u64 ptr;
> +
> + if (get_user(ptr, >rseq_cs.ptr64))
> + return -EFAULT;
> + *uptrp = (struct rseq_cs __user *)ptr;
> + return 0;
> +}
> +#else
> +static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp,
> +const struct rseq __user *rseq)
> +{
> + u32 ptr;
> +
> + if (get_user(ptr, >rseq_cs.ptr.ptr32))
> + return -EFAULT;
> + *uptrp = (struct rseq_cs __user *)ptr;
> + return 0;
> +}
> +#endif
> +
> static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
> {
>   struct rseq_cs __user *urseq_cs;
> - u64 ptr;
>   u32 __user *usig;
>   u32 sig;
>   int ret;
> 
> - if (copy_from_user(, >rseq->rseq_cs.ptr64, sizeof(ptr)))
> + if (rseq_get_cs_ptr(_cs, t->rseq))
>   return -EFAULT;
> - if (!ptr) {
> + if (!urseq_cs) {
>   memset(rseq_cs, 0, sizeof(*rseq_cs));
>   return 0;
>   }
> - if (ptr >= TASK_SIZE)
> + if ((unsigned long)urseq_cs >= TASK_SIZE)
>   return -EINVAL;
> - urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
> +
>   if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
>   return -EFAULT;
> 
> @@ -211,9 +234,11 @@ static int clear_rseq_cs(struct task_struct *t)
>*
>* Set rseq_cs to NULL.
>*/
> - if (clear_user(>rseq->rseq_cs.ptr64, sizeof(t->rseq->rseq_cs.ptr64)))
> - return -EFAULT;
> - return 0;
> +#ifdef CONFIG_64BIT
> + return put_user(0UL, >rseq->rseq_cs.ptr64);
> +#else
> + return put_user(0UL, >rseq->rseq_cs.ptr.ptr32);
> +#endif
> }
> 
> /*
> --
> 2.31.1.295.g9ea45b61b8-goog

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH 3/3] rseq: optimise for 64bit arches

- On Apr 13, 2021, at 11:06 AM, David Laight david.lai...@aculab.com wrote:
[...]
> 
> Hmmm... too much replication.
> You could do:
> #ifdef CONFIG_64BIT
> #define PTR_TYPE u64
> #define PTR_FLD ptr64
> #else
> #define PTR_TYPE u32
> #define PTR_FLD ptr32
> #endif
> 
> Then have one copy of the code and the #undefs.

Good point, agreed.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH 2/3] rseq: remove redundant access_ok()

- On Apr 13, 2021, at 3:36 AM, Eric Dumazet eric.duma...@gmail.com wrote:

> From: Eric Dumazet 
> 
> After commit 8f2817701492 ("rseq: Use get_user/put_user rather
> than __get_user/__put_user") we no longer need
> an access_ok() call from __rseq_handle_notify_resume()

While we are doing that, should we also remove the access_ok() check in
rseq_syscall() ? It look like it can also be removed for the exact same
reason outlined here.

Thanks,

Mathieu

> 
> Signed-off-by: Eric Dumazet 
> Cc: Mathieu Desnoyers 
> Cc: Peter Zijlstra 
> Cc: "Paul E. McKenney" 
> Cc: Boqun Feng 
> Cc: Arjun Roy 
> Cc: Ingo Molnar 
> ---
> kernel/rseq.c | 2 --
> 1 file changed, 2 deletions(-)
> 
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index
> d2689ccbb132c0fc8ec0924008771e5ee1ca855e..57344f9abb43905c7dd2b6081205ff508d963e1e
> 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -273,8 +273,6 @@ void __rseq_handle_notify_resume(struct ksignal *ksig,
> struct pt_regs *regs)
> 
>   if (unlikely(t->flags & PF_EXITING))
>   return;
> - if (unlikely(!access_ok(t->rseq, sizeof(*t->rseq
> - goto error;
>   ret = rseq_ip_fixup(regs);
>   if (unlikely(ret < 0))
>   goto error;
> --
> 2.31.1.295.g9ea45b61b8-goog

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH 1/3] rseq: optimize rseq_update_cpu_id()

- On Apr 13, 2021, at 3:36 AM, Eric Dumazet eric.duma...@gmail.com wrote:

> From: Eric Dumazet 
> 
> Two put_user() in rseq_update_cpu_id() are replaced
> by a pair of unsafe_put_user() with appropriate surroundings.
> 
> This removes one stac/clac pair on x86 in fast path.
> 
> Signed-off-by: Eric Dumazet 
> Cc: Mathieu Desnoyers 
> Cc: Peter Zijlstra 
> Cc: "Paul E. McKenney" 
> Cc: Boqun Feng 
> Cc: Arjun Roy 
> Cc: Ingo Molnar 
> ---
> kernel/rseq.c | 15 +++
> 1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index
> a4f86a9d6937cdfa2f13d1dcc9be863c1943d06f..d2689ccbb132c0fc8ec0924008771e5ee1ca855e
> 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -84,13 +84,20 @@
> static int rseq_update_cpu_id(struct task_struct *t)
> {
>   u32 cpu_id = raw_smp_processor_id();
> + struct rseq *r = t->rseq;

AFAIU the variable above should be a struct rseq __user *.

Elsewhere in the file we use "rseq" rather than "r" for struct rseq __user *
variable name, it would be better to keep the naming consistent across the file
if possible.

Thanks,

Mathieu

> 
> - if (put_user(cpu_id, >rseq->cpu_id_start))
> - return -EFAULT;
> - if (put_user(cpu_id, >rseq->cpu_id))
> - return -EFAULT;
> + if (!user_write_access_begin(r, sizeof(*r)))
> + goto efault;
> + unsafe_put_user(cpu_id, >cpu_id_start, efault_end);
> + unsafe_put_user(cpu_id, >cpu_id, efault_end);
> + user_write_access_end();
>   trace_rseq_update(t);
>   return 0;
> +
> +efault_end:
> + user_write_access_end();
> +efault:
> + return -EFAULT;
> }
> 
> static int rseq_reset_rseq_cpu_id(struct task_struct *t)
> --
> 2.31.1.295.g9ea45b61b8-goog

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH 3/3] rseq: optimise for 64bit arches

- On Apr 13, 2021, at 6:36 AM, David Laight david.lai...@aculab.com wrote:

> From: Peter Zijlstra
>> Sent: 13 April 2021 10:10
>> 
>> On Tue, Apr 13, 2021 at 12:36:57AM -0700, Eric Dumazet wrote:
>> > From: Eric Dumazet 
>> >
>> > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
>> > update includes") added regressions for our servers.
>> >
>> > Using copy_from_user() and clear_user() for 64bit values
>> > on 64bit arches is suboptimal.
>> >
>> > We might revisit this patch once all 32bit arches support
>> > get_user() and/or put_user() for 8 bytes values.
>> 
>> Argh, what a mess :/ afaict only nios32 lacks put_user_8, but get_user_8
>> is missing in a fair number of archs.
>> 
>> That said; 32bit archs never have to actually set the top bits in that
>> word, so they _could_ only set the low 32 bits. That works provided
>> userspace itself keeps the high bits clear.
> 
> Does that work for 32bit BE ?

Yes, because uapi/linux/rseq.h defines the ptr32 as:

#if (defined(__BYTE_ORDER) && (__BYTE_ORDER == __BIG_ENDIAN)) || 
defined(__BIG_ENDIAN)
__u32 padding;  /* Initialized to zero. */
__u32 ptr32;
#else /* LITTLE */
__u32 ptr32;
__u32 padding;  /* Initialized to zero. */
#endif /* ENDIAN */

which takes care of BE vs LE.

> 
>   David
> 
>> So I suppose that if we're going to #ifdef this, we might as well do the
>> whole thing.
>> 
>> Mathieu; did I forget a reason why this cannot work?

The only difference it brings on 32-bit is that the truncation of high bits
will be done before the following validation:

if (!ptr) {
memset(rseq_cs, 0, sizeof(*rseq_cs));
return 0;
}
if (ptr >= TASK_SIZE)
return -EINVAL;

The question is whether we really want to issue a segmentation fault if 32-bit
user-space has set non-zero high bits, or if silently ignoring those high
bits is acceptable.

Nits below:

>> 
>> diff --git a/kernel/rseq.c b/kernel/rseq.c
>> index a4f86a9d6937..94006190b8eb 100644
>> --- a/kernel/rseq.c
>> +++ b/kernel/rseq.c
>> @@ -115,20 +115,25 @@ static int rseq_reset_rseq_cpu_id(struct task_struct 
>> *t)
>>  static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
>>  {
>>  struct rseq_cs __user *urseq_cs;
>> -u64 ptr;
>> +unsigned long ptr;

I am always reluctant to use long/unsigned long type as type for the 
get/put_user
(x) argument, because it hides the cast deep within architecture-specific 
macros.
I understand that in this specific case it happens that on 64-bit archs we end 
up
casting a u64 to unsigned long (same size), and on 32-bit archs we end up 
casting a
u32 to unsigned long (also same size), so there is no practical concern about 
type
promotion and sign-extension, but I think it would be better to have something
explicit, e.g.:

#ifdef CONFIG_64BIT
static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp, struct rseq_cs 
*rseq_cs)
{
u64 ptr;

if (get_user(ptr, _cs->ptr64))
return -EFAULT;
*ptrp = (struct rseq_cs __user *)ptr;
return 0;
}
#else
static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp, struct rseq_cs 
*rseq_cs)
{
u32 ptr;

if (get_user(ptr, _cs->ptr.ptr32))
return -EFAULT;
*ptrp = (struct rseq_cs __user *)ptr;
return 0;
}
#endif

And use those helpers to get the ptr value.

>>  u32 __user *usig;
>>  u32 sig;
>>  int ret;
>> 
>> -if (copy_from_user(, >rseq->rseq_cs.ptr64, sizeof(ptr)))
>> +#ifdef CONFIG_64BIT
>> +if (get_user(ptr, >rseq->rseq_cs.ptr64))
>>  return -EFAULT;
>> +#else
>> +if (get_user(ptr, >rseq->rseq_cs.ptr32))

Note that this is also not right. It should be >rseq->rseq_cs.ptr.ptr32.

Thanks,

Mathieu

>> +return -EFAULT;
>> +#endif
>>  if (!ptr) {
>>  memset(rseq_cs, 0, sizeof(*rseq_cs));
>>  return 0;
>>  }
>>  if (ptr >= TASK_SIZE)
>>  return -EINVAL;
>> -urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
>> +urseq_cs = (struct rseq_cs __user *)ptr;
>>  if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
>>  return -EFAULT;
>> 
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT,
> UK
> Registration No: 1397386 (Wales)

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: about seg fault inside rseq critical section

2021-04-10 Thread Mathieu Desnoyers

4f \n\t "
>> ".popsection \n\t "
>> "1: \n\t "
>> "leaq cs_obj(%%rip), %%rax \n\t "
>> "movq %%rax, % [rseq_cs] \n\t "
>> "cmpl % [cpu_id], % [current_cpu_id] \n\t "
>> "jnz 4f \n\t "
>> /* enable signal testing */
>> "movq $5, %%rdi \n\t "
>> "call sleep@plt \n\t "
>> /*/

>> "jmp % l[committed] \n\t "

>> "2: \n\t "
>> /* Disassembler-friendly signature: nopl (%rip). */
>> ".byte 0x0f, 0x1f, 0x05 \n\t "
>> ".long 0x53053053 \n\t " /* RSEQ_SIG */
>> "4: \n\t "
>> "jmp % l[aborted] \n\t "
>> : /* no outputs */
>> : [cpu_id] "r" (cpu),
>> [current_cpu_id] "m" ( __rseq_abi . cpu_id ),
>> [rseq_cs] "m" ( __rseq_abi . rseq_cs )
>> : "memory" , "cc" , "rax"
>> : aborted, committed

>> );

>> committed:
>> printf ( "committed \n " );
>> return 0 ;

>> aborted:
>> printf ( "aborted \n " );
>> return - 1 ;

>> }

>> void signal_callback_handler ( int signum ) {
>> printf ( "Caught signal %d \n " , signum);
>> }

>> int main ( int argc , char ** argv ) {
>> signal (SIGINT, signal_callback_handler);
>> int cpu, ret;

>> register_thread ();
>> cpu = RSEQ_ACCESS_ONCE ( __rseq_abi . cpu_id_start );

>> printf ( "ret = %d \n " , do_test (cpu));

>> return 0 ;
>> }

> As the screenshot is shown, the program executed the signal handler instead of
> rseq abort handler after interrupted the program.

> I am confused about how rseq deal with signal delivery as I assumed abort
> handler will be triggered anyway when flags permitted. Could you please 
> explain
> such two cases or could you please share any references (code, article, etc)
> here?

> Thanks in advance!

> My virtual machine environment:

>> $ cat /etc/os-release
>> NAME="Ubuntu"
>> VERSION="20.04.2 LTS (Focal Fossa)"
>> ...

>> $ uname -r
>> 5.4.0-66-generic

> Best regards,
> Mingyi

> From: Mathieu Desnoyers 
> Sent: Wednesday, April 7, 2021 9:25 AM
> To: Liu, Mingyi 
> Cc: linux-kernel ; Peter Zijlstra
> ; paulmck ; Boqun Feng
> 
> Subject: Re: about seg fault inside rseq critical section
> (re-sent in plain-text for lkml)

> - On Apr 6, 2021, at 6:24 PM, Mingyi Liu mingyi...@gatech.edu wrote:

> > Hi Mathieu,

> > I notice that the program will be terminated with segmentation fault when 
> > it is
> > even seg faulted inside the rseq critical section. In Linux kernel rseq.c, I
> > haven't found comment or code regarding this. Could you share any 
> > references on
> > why it doesn't execute user defined abort handler but uses the OS handler
> > instead?

> > Thanks in advance!

> Hi Mingyi,

> Please let me add the other rseq maintainers and LKML in CC. I'm a bit 
> stretched
> on time
> here, so maybe they will have time to answer before I do.

> Meanwhile, if you could provide details about your architecture, kernel 
> .config,
> and a
> small reproducer program, it would help bootstrapping the discussion.

> Thanks,

> Mathieu

> > Best,

> > Mingyi

> --
> Mathieu Desnoyers
> EfficiOS Inc.
> [ http://www.efficios.com/ | http://www.efficios.com ]

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: about seg fault inside rseq critical section

2021-04-07 Thread Mathieu Desnoyers

(re-sent in plain-text for lkml)

- On Apr 6, 2021, at 6:24 PM, Mingyi Liu mingyi...@gatech.edu wrote:

> Hi Mathieu,

> I notice that the program will be terminated with segmentation fault when it 
> is
> even seg faulted inside the rseq critical section. In Linux kernel rseq.c, I
> haven't found comment or code regarding this. Could you share any references 
> on
> why it doesn't execute user defined abort handler but uses the OS handler
> instead?

> Thanks in advance!

Hi Mingyi, 

Please let me add the other rseq maintainers and LKML in CC. I'm a bit 
stretched on time 
here, so maybe they will have time to answer before I do. 

Meanwhile, if you could provide details about your architecture, kernel 
.config, and a 
small reproducer program, it would help bootstrapping the discussion. 

Thanks, 

Mathieu 

> Best,

> Mingyi

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH v2] ptrace: add PTRACE_GET_RSEQ_CONFIGURATION request

2021-03-11 Thread Mathieu Desnoyers

- On Mar 11, 2021, at 11:51 AM, Peter Zijlstra pet...@infradead.org wrote:

> On Thu, Mar 11, 2021 at 09:51:56AM -0500, Mathieu Desnoyers wrote:
>> 
>> 
>> - On Feb 26, 2021, at 8:51 AM, Piotr Figiel fig...@google.com wrote:
>> 
>> > For userspace checkpoint and restore (C/R) a way of getting process state
>> > containing RSEQ configuration is needed.
>> > 
>> > There are two ways this information is going to be used:
>> > - to re-enable RSEQ for threads which had it enabled before C/R
>> > - to detect if a thread was in a critical section during C/R
>> > 
>> > Since C/R preserves TLS memory and addresses RSEQ ABI will be restored
>> > using the address registered before C/R.
>> > 
>> > Detection whether the thread is in a critical section during C/R is needed
>> > to enforce behavior of RSEQ abort during C/R. Attaching with ptrace()
>> > before registers are dumped itself doesn't cause RSEQ abort.
>> > Restoring the instruction pointer within the critical section is
>> > problematic because rseq_cs may get cleared before the control is passed
>> > to the migrated application code leading to RSEQ invariants not being
>> > preserved. C/R code will use RSEQ ABI address to find the abort handler
>> > to which the instruction pointer needs to be set.
>> > 
>> > To achieve above goals expose the RSEQ ABI address and the signature value
>> > with the new ptrace request PTRACE_GET_RSEQ_CONFIGURATION.
>> > 
>> > This new ptrace request can also be used by debuggers so they are aware
>> > of stops within restartable sequences in progress.
>> > 
>> > Signed-off-by: Piotr Figiel 
>> > Reviewed-by: Michal Miroslaw 
>> 
>> Reviewed-by: Mathieu Desnoyers 
> 
> How do we route this? Do I stick this in tip/sched/core as being an rseq
> patch?

Sure, it's fine with me, thanks!

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH v2] ptrace: add PTRACE_GET_RSEQ_CONFIGURATION request

2021-03-11 Thread Mathieu Desnoyers




- On Feb 26, 2021, at 8:51 AM, Piotr Figiel fig...@google.com wrote:

> For userspace checkpoint and restore (C/R) a way of getting process state
> containing RSEQ configuration is needed.
> 
> There are two ways this information is going to be used:
> - to re-enable RSEQ for threads which had it enabled before C/R
> - to detect if a thread was in a critical section during C/R
> 
> Since C/R preserves TLS memory and addresses RSEQ ABI will be restored
> using the address registered before C/R.
> 
> Detection whether the thread is in a critical section during C/R is needed
> to enforce behavior of RSEQ abort during C/R. Attaching with ptrace()
> before registers are dumped itself doesn't cause RSEQ abort.
> Restoring the instruction pointer within the critical section is
> problematic because rseq_cs may get cleared before the control is passed
> to the migrated application code leading to RSEQ invariants not being
> preserved. C/R code will use RSEQ ABI address to find the abort handler
> to which the instruction pointer needs to be set.
> 
> To achieve above goals expose the RSEQ ABI address and the signature value
> with the new ptrace request PTRACE_GET_RSEQ_CONFIGURATION.
> 
> This new ptrace request can also be used by debuggers so they are aware
> of stops within restartable sequences in progress.
> 
> Signed-off-by: Piotr Figiel 
> Reviewed-by: Michal Miroslaw 

Reviewed-by: Mathieu Desnoyers 

Thanks!

Mathieu


> 
> ---
> v2:
> Applied review comments:
> - changed return value from the ptrace request to the size of the
>   configuration structure
> - expanded configuration structure with the flags field and
>   the rseq abi structure size
> 
> v1:
> https://lore.kernel.org/lkml/20210222100443.4155938-1-fig...@google.com/
> 
> ---
> include/uapi/linux/ptrace.h | 10 ++
> kernel/ptrace.c | 25 +
> 2 files changed, 35 insertions(+)
> 
> diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
> index 83ee45fa634b..3747bf816f9a 100644
> --- a/include/uapi/linux/ptrace.h
> +++ b/include/uapi/linux/ptrace.h
> @@ -102,6 +102,16 @@ struct ptrace_syscall_info {
>   };
> };
> 
> +#define PTRACE_GET_RSEQ_CONFIGURATION0x420f
> +
> +struct ptrace_rseq_configuration {
> + __u64 rseq_abi_pointer;
> + __u32 rseq_abi_size;
> + __u32 signature;
> + __u32 flags;
> + __u32 pad;
> +};
> +
> /*
>  * These values are stored in task->ptrace_message
>  * by tracehook_report_syscall_* to describe the current syscall-stop.
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 61db50f7ca86..76f09456ec4b 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -31,6 +31,7 @@
> #include 
> #include 
> #include 
> +#include 
> 
> #include   /* for syscall_get_* */
> 
> @@ -779,6 +780,24 @@ static int ptrace_peek_siginfo(struct task_struct *child,
>   return ret;
> }
> 
> +#ifdef CONFIG_RSEQ
> +static long ptrace_get_rseq_configuration(struct task_struct *task,
> +   unsigned long size, void __user *data)
> +{
> + struct ptrace_rseq_configuration conf = {
> + .rseq_abi_pointer = (u64)(uintptr_t)task->rseq,
> + .rseq_abi_size = sizeof(*task->rseq),
> + .signature = task->rseq_sig,
> + .flags = 0,
> + };
> +
> + size = min_t(unsigned long, size, sizeof(conf));
> + if (copy_to_user(data, , size))
> + return -EFAULT;
> + return sizeof(conf);
> +}
> +#endif
> +
> #ifdef PTRACE_SINGLESTEP
> #define is_singlestep(request)((request) == PTRACE_SINGLESTEP)
> #else
> @@ -1222,6 +1241,12 @@ int ptrace_request(struct task_struct *child, long
> request,
>   ret = seccomp_get_metadata(child, addr, datavp);
>   break;
> 
> +#ifdef CONFIG_RSEQ
> + case PTRACE_GET_RSEQ_CONFIGURATION:
> + ret = ptrace_get_rseq_configuration(child, addr, datavp);
> + break;
> +#endif
> +
>   default:
>   break;
>   }
> --
> 2.30.1.766.gb4fecdf3b7-goog

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

[tip: sched/core] sched/membarrier: fix missing local execution of ipi_sync_rq_state()

2021-03-06 Thread tip-bot2 for Mathieu Desnoyers

The following commit has been merged into the sched/core branch of tip:

Commit-ID: ce29ddc47b91f97e7f69a0fb7cbb5845f52a9825
Gitweb:
https://git.kernel.org/tip/ce29ddc47b91f97e7f69a0fb7cbb5845f52a9825
Author:Mathieu Desnoyers 
AuthorDate:Wed, 17 Feb 2021 11:56:51 -05:00
Committer: Ingo Molnar 
CommitterDate: Sat, 06 Mar 2021 12:40:21 +01:00

sched/membarrier: fix missing local execution of ipi_sync_rq_state()

The function sync_runqueues_membarrier_state() should copy the
membarrier state from the @mm received as parameter to each runqueue
currently running tasks using that mm.

However, the use of smp_call_function_many() skips the current runqueue,
which is unintended. Replace by a call to on_each_cpu_mask().

Fixes: 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy load")
Reported-by: Nadav Amit 
Signed-off-by: Mathieu Desnoyers 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Cc: sta...@vger.kernel.org # 5.4.x+
Link: https://lore.kernel.org/r/74f1e842-4a84-47bf-b6c2-5407dfdd4...@gmail.com
---
 kernel/sched/membarrier.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index acdae62..b5add64 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -471,9 +471,7 @@ static int sync_runqueues_membarrier_state(struct mm_struct 
*mm)
}
rcu_read_unlock();
 
-   preempt_disable();
-   smp_call_function_many(tmpmask, ipi_sync_rq_state, mm, 1);
-   preempt_enable();
+   on_each_cpu_mask(tmpmask, ipi_sync_rq_state, mm, true);
 
free_cpumask_var(tmpmask);
cpus_read_unlock();

Re: [PATCH] ptrace: add PTRACE_GET_RSEQ_CONFIGURATION request

2021-03-03 Thread Mathieu Desnoyers

- On Feb 26, 2021, at 9:11 AM, Piotr Figiel fig...@google.com wrote:

> Hi,
> 
> On Mon, Feb 22, 2021 at 09:53:17AM -0500, Mathieu Desnoyers wrote:
> 
>> I notice that other structures defined in this UAPI header are not
>> packed as well.  Should we add an attribute packed on new structures ?
>> It seems like it is generally a safer course of action, even though
>> each field is naturally aligned here (there is no padding/hole in the
>> structure).
> 
> I considered this for quite a while. There are some gains for this
> approach, i.e. it's safer towards the ISO C, as theoretically compiler
> can generate arbitrary offsets as long as struct elements have correct
> order in memory.
> Also with packed attribute it would be harder to make it incorrect in
> future modifications.
> User code also could theoretically put the structure on any misaligned
> address.
> 
> But the drawback is that all accesses to the structure contents are
> inefficient and some compilers may generate large chunks of code
> whenever the structure elements are accessed (I recall at least one ARM
> compiler which generates series of single-byte accesses for those). For
> kernel it doesn't matter much because the structure type is used in one
> place, but it may be different for the application code.
> 
> The change would be also inconsistent with the rest of the file and IMO
> the gains are only theoretical.
> 
> If there are more opinions on this or you have some argument I'm missing
> please let me know I can send v3 with packed and explicit padding
> removed. I think this is rather borderline trade off.

I personally don't have a strong opinion on this and completely agree with
your analysis. Maybe for pre-existing system calls adding more non-packed
structures might be kind-of OK if some were already exposed, even though
it seems rather fragile wrt ISO C.

Thanks,

Mathieu

> 
> Best regards and thanks for looking at this,
> Piotr.

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH v2] ptrace: add PTRACE_GET_RSEQ_CONFIGURATION request

2021-03-03 Thread Mathieu Desnoyers

- On Feb 26, 2021, at 11:06 AM, Piotr Figiel fig...@google.com wrote:

> Hi,
> 
> On Fri, Feb 26, 2021 at 10:32:35AM -0500, Mathieu Desnoyers wrote:
>> > +static long ptrace_get_rseq_configuration(struct task_struct *task,
>> > +unsigned long size, void __user *data)
>> > +{
>> > +  struct ptrace_rseq_configuration conf = {
>> > +  .rseq_abi_pointer = (u64)(uintptr_t)task->rseq,
>> > +  .rseq_abi_size = sizeof(*task->rseq),
>> > +  .signature = task->rseq_sig,
>> > +  .flags = 0,
>> > +  };
>> > +
>> > +  size = min_t(unsigned long, size, sizeof(conf));
>> > +  if (copy_to_user(data, , size))
>> > +  return -EFAULT;
>> > +  return sizeof(conf);
>> > +}
>> 
>> I think what Florian was after would be:
>> 
>> struct ptrace_rseq_configuration {
>>  __u32 size;  /* size of struct ptrace_rseq_configuration */
>>  __u32 flags;
>>  __u64 rseq_abi_pointer;
>>  __u32 signature;
>>  __u32 pad;
>> };
>> 
>> where:
>> 
>> .size = sizeof(struct ptrace_rseq_configuration),
>> 
>> This way, the configuration structure can be expanded in the future. The
>> rseq ABI structure is by definition fixed-size, so there is no point in
>> having its size here.
> 
> Still rseq syscall accepts the rseq ABI structure size as a paremeter.
> I think this way the information returned from ptrace is consistent with
> the userspace view of the rseq state and allows expansion in case the
> ABI structure would have to be extended (in spite of it's current
> definition).
> 
> The configuration structure still can be expanded as its size is
> reported to userspace as return value from the request (in line with
> Dmitry's comments).

Fair enough. And now with the reply from Florian I see that I misunderstood his
point.

Thanks,

Mathieu

> 
> Best regards, Piotr.

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

[tip: sched/urgent] sched/membarrier: fix missing local execution of ipi_sync_rq_state()

2021-03-01 Thread tip-bot2 for Mathieu Desnoyers

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: fba111913e51a934eaad85734254eab801343836
Gitweb:
https://git.kernel.org/tip/fba111913e51a934eaad85734254eab801343836
Author:Mathieu Desnoyers 
AuthorDate:Wed, 17 Feb 2021 11:56:51 -05:00
Committer: Peter Zijlstra 
CommitterDate: Mon, 01 Mar 2021 11:02:15 +01:00

sched/membarrier: fix missing local execution of ipi_sync_rq_state()

The function sync_runqueues_membarrier_state() should copy the
membarrier state from the @mm received as parameter to each runqueue
currently running tasks using that mm.

However, the use of smp_call_function_many() skips the current runqueue,
which is unintended. Replace by a call to on_each_cpu_mask().

Fixes: 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy load")
Reported-by: Nadav Amit 
Signed-off-by: Mathieu Desnoyers 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: sta...@vger.kernel.org # 5.4.x+
Link: https://lore.kernel.org/r/74f1e842-4a84-47bf-b6c2-5407dfdd4...@gmail.com
---
 kernel/sched/membarrier.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index acdae62..b5add64 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -471,9 +471,7 @@ static int sync_runqueues_membarrier_state(struct mm_struct 
*mm)
}
rcu_read_unlock();
 
-   preempt_disable();
-   smp_call_function_many(tmpmask, ipi_sync_rq_state, mm, 1);
-   preempt_enable();
+   on_each_cpu_mask(tmpmask, ipi_sync_rq_state, mm, true);
 
free_cpumask_var(tmpmask);
cpus_read_unlock();

Re: [PATCH v2] ptrace: add PTRACE_GET_RSEQ_CONFIGURATION request

2021-02-26 Thread Mathieu Desnoyers

- On Feb 26, 2021, at 11:04 AM, emmir em...@google.com wrote:

> On Fri, 26 Feb 2021 at 16:32, Mathieu Desnoyers
>  wrote:
>>
>> - On Feb 26, 2021, at 8:51 AM, Piotr Figiel fig...@google.com wrote:
>> [...]
>> > ---
>> > v2:
>> > Applied review comments:
>> > - changed return value from the ptrace request to the size of the
>> >   configuration structure
>> > - expanded configuration structure with the flags field and
>> >   the rseq abi structure size
>> >
>> [...]
>> > +#define PTRACE_GET_RSEQ_CONFIGURATION0x420f
>> > +
>> > +struct ptrace_rseq_configuration {
>> > + __u64 rseq_abi_pointer;
>> > + __u32 rseq_abi_size;
>> > + __u32 signature;
>> > + __u32 flags;
>> > + __u32 pad;
>> > +};
>> > +
>> [...]
>> > +#ifdef CONFIG_RSEQ
>> > +static long ptrace_get_rseq_configuration(struct task_struct *task,
>> > +   unsigned long size, void __user 
>> > *data)
>> > +{
>> > + struct ptrace_rseq_configuration conf = {
>> > + .rseq_abi_pointer = (u64)(uintptr_t)task->rseq,
>> > + .rseq_abi_size = sizeof(*task->rseq),
>> > + .signature = task->rseq_sig,
>> > + .flags = 0,
>> > + };
>> > +
>> > + size = min_t(unsigned long, size, sizeof(conf));
>> > + if (copy_to_user(data, , size))
>> > + return -EFAULT;
>> > + return sizeof(conf);
>> > +}
>>
>> I think what Florian was after would be:
>>
>> struct ptrace_rseq_configuration {
>> __u32 size;  /* size of struct ptrace_rseq_configuration */
>> __u32 flags;
>> __u64 rseq_abi_pointer;
>> __u32 signature;
>> __u32 pad;
>> };
>>
>> where:
>>
>> .size = sizeof(struct ptrace_rseq_configuration),
>>
>> This way, the configuration structure can be expanded in the future. The
>> rseq ABI structure is by definition fixed-size, so there is no point in
>> having its size here.
>>
>> Florian, did I understand your request correctly, or am I missing your point 
>> ?
> 
> In this case returning sizeof(conf) would serve the same purpose, wouldn't it?

If the size is received as input from user-space as well, this can be used to
make sure the kernel detects what size is expected by user-space and act 
accordingly.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH v2] ptrace: add PTRACE_GET_RSEQ_CONFIGURATION request

2021-02-26 Thread Mathieu Desnoyers

- On Feb 26, 2021, at 8:51 AM, Piotr Figiel fig...@google.com wrote:
[...]
> ---
> v2:
> Applied review comments:
> - changed return value from the ptrace request to the size of the
>   configuration structure
> - expanded configuration structure with the flags field and
>   the rseq abi structure size
> 
[...]
> +#define PTRACE_GET_RSEQ_CONFIGURATION0x420f
> +
> +struct ptrace_rseq_configuration {
> + __u64 rseq_abi_pointer;
> + __u32 rseq_abi_size;
> + __u32 signature;
> + __u32 flags;
> + __u32 pad;
> +};
> +
[...]
> +#ifdef CONFIG_RSEQ
> +static long ptrace_get_rseq_configuration(struct task_struct *task,
> +   unsigned long size, void __user *data)
> +{
> + struct ptrace_rseq_configuration conf = {
> + .rseq_abi_pointer = (u64)(uintptr_t)task->rseq,
> + .rseq_abi_size = sizeof(*task->rseq),
> + .signature = task->rseq_sig,
> + .flags = 0,
> + };
> +
> + size = min_t(unsigned long, size, sizeof(conf));
> + if (copy_to_user(data, , size))
> + return -EFAULT;
> + return sizeof(conf);
> +}

I think what Florian was after would be:

struct ptrace_rseq_configuration {
__u32 size;  /* size of struct ptrace_rseq_configuration */
__u32 flags;
__u64 rseq_abi_pointer;
__u32 signature;
__u32 pad;
};

where:

.size = sizeof(struct ptrace_rseq_configuration),

This way, the configuration structure can be expanded in the future. The
rseq ABI structure is by definition fixed-size, so there is no point in
having its size here.

Florian, did I understand your request correctly, or am I missing your point ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2)

- On Feb 24, 2021, at 1:14 PM, rostedt rost...@goodmis.org wrote:

> On Wed, 24 Feb 2021 11:59:35 -0500 (EST)
> Mathieu Desnoyers  wrote:
>> 
>> As a prototype solution, what I've done currently is to copy the user-space
>> data into a kmalloc'd buffer in a preparation step before disabling 
>> preemption
>> and copying data over into the per-cpu buffers. It works, but I think we 
>> should
>> be able to do it without the needless copy.
>> 
>> What I have in mind as an efficient solution (not implemented yet) for the 
>> LTTng
>> kernel tracer goes as follows:
>> 
>> #define COMMIT_LOCAL 0
>> #define COMMIT_REMOTE 1
>> 
>> - faultable probe is called from system call tracepoint [
>> preemption/blocking/migration is allowed ]
>>   - probe code calculate the length which needs to be reserved to store the 
>> event
>> (e.g. user strlen),
>> 
>>   - preempt disable -> [ preemption/blocking/migration is not allowed from 
>> here ]
>> - reserve_cpu = smp_processor_id()
>> - reserve space in the ring buffer for reserve_cpu
>>   [ from that point on, we have _exclusive_ access to write into the 
>> ring buffer
>>   "slot"
>> from any cpu until we commit. ]
>>   - preempt enable -> [ preemption/blocking/migration is allowed from here ]
>> 
> 
> So basically the commit position here doesn't move until this task is
> scheduled back in and the commit (remote or local) is updated.

Indeed.

> To put it in terms of the ftrace ring buffer, where we have both a commit
> page and a commit index, and it only gets moved by the first one to start a
> commit stack (that is, interrupts that interrupted a write will not
> increment the commit).

The tricky part for ftrace is its reliance on the fact that the concurrent
users of the per-cpu ring buffer are all nested contexts. LTTng does not
assume that and has been designed to be used both in kernel and user-space:
lttng-modules and lttng-ust share a lot of ring buffer code. Therefore,
LTTng's ring buffer supports preemption/migration of concurrent contexts.

The fact that LTTng uses local-atomic-ops on its kernel ring buffers is just
an optimization on an overall ring buffer design meant to allow preemption.

> Now, I'm not sure how LTTng does it, but I could see issues for ftrace to
> try to move the commit pointer (the pointer to the new commit page), as the
> design is currently dependent on the fact that it can't happen while
> commits are taken place.

Indeed, what makes it easy for LTTng is because the ring buffer has been
designed to support preemption/migration from the ground up.

> Are the pages of the LTTng indexed by an array of pages?

Yes, they are. Handling the initial page allocation and then the tracer copy of 
data
to/from the ring buffer pages is the responsibility of the LTTng lib ring 
buffer "backend".
The LTTng lib ring buffer backend is somewhat similar to a page table done in 
software, where
the top level of the page table can be dynamically updated when doing flight 
recorder tracing.

It is however completely separate from the space reservation/commit scheme 
which is handled
by the lib ring buffer "frontend".

The algorithm I described in my prior email is specifically targeted at the 
frontend layer,
leaving the "backend" unchanged.

For some reasons I suspect Ftrace ring buffer combined those two layers into a 
single
algorithm, which may have its advantages, but seems to strengthen its 
dependency on
only having nested contexts sharing a given per-cpu ring buffer.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: tasks-trace RCU: question about grace period forward progress

- On Feb 25, 2021, at 1:33 PM, paulmck paul...@kernel.org wrote:
[...]
> commit 581f79546b6be406a9c7280b2d3511b60821efe0
> Author: Paul E. McKenney 
> Date:   Thu Feb 25 10:26:00 2021 -0800
> 
>rcu-tasks: Add block comment laying out RCU Tasks Trace design
>
>This commit adds a block comment that gives a high-level overview of
>how RCU tasks trace grace periods progress.  It also adds a note about
>how exiting tasks are handles, plus it gives an overview of the memory

handles -> handled

>ordering.
>
>Reported-by: Peter Zijlstra 
>Reported-by: Mathieu Desnoyers 
>Signed-off-by: Paul E. McKenney 
> 
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index 17c8ebe..f818357 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -726,6 +726,42 @@ EXPORT_SYMBOL_GPL(show_rcu_tasks_rude_gp_kthread);
> // flavors, rcu_preempt and rcu_sched.  The fact that RCU Tasks Trace
> // readers can operate from idle, offline, and exception entry/exit in no
> // way allows rcu_preempt and rcu_sched readers to also do so.
> +//
> +// The implementation uses rcu_tasks_wait_gp(), which relies on function
> +// pointers in the rcu_tasks structure.  The rcu_spawn_tasks_trace_kthread()
> +// function sets these function pointers up so that rcu_tasks_wait_gp()
> +// invokes these functions in this order:
> +//
> +// rcu_tasks_trace_pregp_step():
> +//   Initialize the count of readers and block CPU-hotplug operations.
> +// rcu_tasks_trace_pertask(), invoked on every non-idle task:
> +//   Initialize per-task state and attempt to identify an immediate
> +//   quiescent state for that task, or, failing that, attempt to set
> +//   that task's .need_qs flag so that that task's next outermost
> +//   rcu_read_unlock_trace() will report the quiescent state (in which
> +//   case the count of readers is incremented).  If both attempts fail,
> +//   the task is added to a "holdout" list.
> +// rcu_tasks_trace_postscan():
> +//   Initialize state and attempt to identify an immediate quiescent
> +//   state as above (but only for idle tasks), unblock CPU-hotplug
> +//   operations, and wait for an RCU grace period to avoid races with
> +//   tasks that are in the process of exiting.
> +// check_all_holdout_tasks_trace(), repeatedly until holdout list is empty:
> +//   Scans the holdout list, attempting to identify a quiescent state
> +//   for each task on the list.  If there is a quiescent state, the
> +//   corresponding task is removed from the holdout list.
> +// rcu_tasks_trace_postgp():
> +//   Wait for the count of readers do drop to zero, reporting any stalls.
> +//   Also execute full memory barriers to maintain ordering with code
> +//   executing after the grace period.
> +//
> +// The exit_tasks_rcu_finish_trace() synchronizes with exiting tasks.
> +//
> +// Pre-grace-period update-side code is ordered before the grace
> +// period via the ->cbs_lock and barriers in rcu_tasks_kthread().
> +// Pre-grace-period read-side code is ordered before the grace period by
> +// atomic_dec_and_test() of the count of readers (for IPIed readers) and by
> +// scheduler context-switch ordering (for locked-down non-running readers).

The rest looks good, thanks!

Mathieu

> 
> // The lockdep state must be outside of #ifdef to be useful.
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: tasks-trace RCU: question about grace period forward progress

- On Feb 25, 2021, at 10:36 AM, paulmck paul...@kernel.org wrote:

> On Thu, Feb 25, 2021 at 09:22:48AM -0500, Mathieu Desnoyers wrote:
>> Hi Paul,
>> 
>> Answering a question from Peter on IRC got me to look at 
>> rcu_read_lock_trace(),
>> and I see this:
>> 
>> static inline void rcu_read_lock_trace(void)
>> {
>> struct task_struct *t = current;
>> 
>> WRITE_ONCE(t->trc_reader_nesting, READ_ONCE(t->trc_reader_nesting) + 
>> 1);
>> barrier();
>> if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB) &&
>> t->trc_reader_special.b.need_mb)
>> smp_mb(); // Pairs with update-side barriers
>> rcu_lock_acquire(_trace_lock_map);
>> }
>> 
>> static inline void rcu_read_unlock_trace(void)
>> {
>> int nesting;
>> struct task_struct *t = current;
>> 
>> rcu_lock_release(_trace_lock_map);
>> nesting = READ_ONCE(t->trc_reader_nesting) - 1;
>> barrier(); // Critical section before disabling.
>> // Disable IPI-based setting of .need_qs.
>> WRITE_ONCE(t->trc_reader_nesting, INT_MIN);
>> if (likely(!READ_ONCE(t->trc_reader_special.s)) || nesting) {
>> WRITE_ONCE(t->trc_reader_nesting, nesting);
>> return;  // We assume shallow reader nesting.
>> }
>> rcu_read_unlock_trace_special(t, nesting);
>> }
>> 
>> AFAIU, each thread keeps track of whether it is nested within a RCU read-side
>> critical
>> section with a counter, and grace periods iterate over all threads to make 
>> sure
>> they
>> are not within a read-side critical section before they can complete:
>> 
>> # define rcu_tasks_trace_qs(t)  \
>> do {\
>> if (!likely(READ_ONCE((t)->trc_reader_checked)) &&  \
>> !unlikely(READ_ONCE((t)->trc_reader_nesting))) {\
>> smp_store_release(&(t)->trc_reader_checked, true); \
>> smp_mb(); /* Readers partitioned by store. */   \
>> }   \
>> } while (0)
>> 
>> It reminds me of the liburcu urcu-mb flavor which also deals with per-thread
>> state to track whether threads are nested within a critical section:
>> 
>> https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-mb.h#L90
>> https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-mb.h#L125
>> 
>> static inline void _urcu_mb_read_lock_update(unsigned long tmp)
>> {
>>  if (caa_likely(!(tmp & URCU_GP_CTR_NEST_MASK))) {
>>  _CMM_STORE_SHARED(URCU_TLS(urcu_mb_reader).ctr,
>>  _CMM_LOAD_SHARED(urcu_mb_gp.ctr));
>>  cmm_smp_mb();
>>  } else
>>  _CMM_STORE_SHARED(URCU_TLS(urcu_mb_reader).ctr, tmp + 
>> URCU_GP_COUNT);
>> }
>> 
>> static inline void _urcu_mb_read_lock(void)
>> {
>>  unsigned long tmp;
>> 
>>  urcu_assert(URCU_TLS(urcu_mb_reader).registered);
>>  cmm_barrier();
>>  tmp = URCU_TLS(urcu_mb_reader).ctr;
>>  urcu_assert((tmp & URCU_GP_CTR_NEST_MASK) != URCU_GP_CTR_NEST_MASK);
>>  _urcu_mb_read_lock_update(tmp);
>> }
>> 
>> The main difference between the two algorithm is that task-trace within the
>> kernel lacks the global "urcu_mb_gp.ctr" state snapshot, which is either
>> incremented or flipped between 0 and 1 by the grace period. This allow RCU
>> readers
>> outermost nesting starting after the beginning of the grace period not to
>> prevent
>> progress of the grace period.
>> 
>> Without this, a steady flow of incoming tasks-trace-RCU readers can prevent 
>> the
>> grace period from ever completing.
>> 
>> Or is this handled in a clever way that I am missing here ?
> 
> There are several mechanisms designed to handle this.  The following
> paragraphs describe these at a high level.
> 
> The trc_wait_for_one_reader() is invoked on each task.  It uses the
> try_invoke_on_locked_down_task(), which, if the task is currently not
> running, keeps it that way and invokes trc_inspect_reader().  If the
> locked-down task is in a read-side critical section, the need_qs field
> is set, which will cause the task's next rcu_read_lock_trace() to report
> the quiescent st

tasks-trace RCU: question about grace period forward progress