[PATCH v2] x86/paravirt: Disable virt spinlock on bare metal

2024-05-25 Thread Chen Yu
The kernel can change spinlock behavior when running as a guest. But
this guest-friendly behavior causes performance problems on bare metal.
So there's a 'virt_spin_lock_key' static key to switch between the two
modes.

The static key is always enabled by default (run in guest mode) and
should be disabled for bare metal (and in some guests that want native
behavior).

Performance drop is reported when running encode/decode workload and
BenchSEE cache sub-workload.
Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused
native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS
is disabled the virt_spin_lock_key is incorrectly set to true on bare
metal. The qspinlock degenerates to test-and-set spinlock, which
decrease the performance on bare metal.

Fix this by disabling virt_spin_lock_key if it is on bare metal,
regardless of CONFIG_PARAVIRT_SPINLOCKS.

Fixes: ce0a1b608bfc ("x86/paravirt: Silence unused native_pv_lock_init() 
function warning")
Suggested-by: Dave Hansen 
Suggested-by: Qiuxu Zhuo 
Reported-by: Prem Nath Dey 
Reported-by: Xiaoping Zhou 
Reviewed-by: Juergen Gross 
Signed-off-by: Chen Yu 
---
v1->v2:
  Refine the commit log per Dave's suggestion.
  Simplify the fix by directly disabling the virt_spin_lock_key on bare metal.
  Collect Reviewed-by from Juergen.
---
 arch/x86/kernel/paravirt.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 5358d43886ad..c193c9e60a1b 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -55,8 +55,7 @@ DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);
 
 void __init native_pv_lock_init(void)
 {
-   if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) &&
-   !boot_cpu_has(X86_FEATURE_HYPERVISOR))
+   if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
static_branch_disable(_spin_lock_key);
 }
 
-- 
2.25.1




Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

2024-05-25 Thread Dave Chinner
On Fri, May 24, 2024 at 09:55:48AM +0200, Miklos Szeredi wrote:
> On Fri, 24 May 2024 at 02:47, John Groves  wrote:
> 
> > Apologies, but I'm short on time at the moment - going into a long holiday
> > weekend in the US with family plans. I should be focused again by middle of
> > next week.
> 
> NP.
> 
> Obviously I'll need to test it before anything is merged, other than
> that this is not urgent at all...
> 
> > But can you check /proc/cmdline to see of the memmap arg got through without
> > getting mangled? The '$' tends to get fubar'd. You might need \$, or I've 
> > seen
> > the need for \\\$. If it's un-mangled, there should be a dax device.
> 
> /proc/cmdline shows the option correctly:
> 
> root@kvm:~# cat /proc/cmdline
> root=/dev/vda console=hvc0 memmap=4G$4G
> 
> > If that doesn't work, it's worth trying '!' instead, which I think would 
> > give
> > you a pmem device - if the arg gets through (but ! is less likely to get
> > horked). That pmem device can be converted to devdax...
> 
> That doesn't work either.  No device created in /dev  (dax or pmem).

I think you need to do some ndctl magic to get the memory to be
namespaced correctly for the correct devices to appear.

https://docs.pmem.io/ndctl-user-guide/managing-namespaces

IIRC, need to set the type to pmem and the mode to fsdax, devdax or
raw to get the relevant device nodes to be created for the range..

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com



[PATCH] tracing/probes: fix error check in parse_btf_field()

2024-05-25 Thread Carlos López
btf_find_struct_member() might return NULL or an error via the
ERR_PTR() macro. However, its caller in parse_btf_field() only checks
for the NULL condition. Fix this by using IS_ERR() and returning the
error up the stack.

Fixes: c440adfbe3025 ("tracing/probes: Support BTF based data structure field 
access")
Signed-off-by: Carlos López 
---
 kernel/trace/trace_probe.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 5e263c141574..5417e9712157 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -554,6 +554,8 @@ static int parse_btf_field(char *fieldname, const struct 
btf_type *type,
anon_offs = 0;
field = btf_find_struct_member(ctx->btf, type, 
fieldname,
   _offs);
+   if (IS_ERR(field))
+   return PTR_ERR(field);
if (!field) {
trace_probe_log_err(ctx->offset, NO_BTF_FIELD);
return -ENOENT;
-- 
2.35.3




Re: [PATCH RFC 1/2] dt-bindings: soc: qcom,smsm: Allow specifying mboxes instead of qcom,ipc

2024-05-25 Thread Krzysztof Kozlowski
On 24/05/2024 19:55, Luca Weiss wrote:
> On Donnerstag, 23. Mai 2024 08:19:11 MESZ Krzysztof Kozlowski wrote:
>> On 23/05/2024 08:16, Luca Weiss wrote:
>>> On Donnerstag, 23. Mai 2024 08:02:13 MESZ Krzysztof Kozlowski wrote:
 On 22/05/2024 19:34, Luca Weiss wrote:
> On Mittwoch, 22. Mai 2024 08:49:43 MESZ Krzysztof Kozlowski wrote:
>> On 21/05/2024 22:35, Luca Weiss wrote:
>>> On Dienstag, 21. Mai 2024 10:58:07 MESZ Krzysztof Kozlowski wrote:
 On 20/05/2024 17:11, Luca Weiss wrote:
> Hi Krzysztof
>
> Ack, sounds good.
>
> Maybe also from you, any opinion between these two binding styles?
>
> So first using index of mboxes for the numbering, where for the known
> usages the first element (and sometimes the 3rd - ipc-2) are empty <>.
>
> The second variant is using mbox-names to get the correct channel-mbox
> mapping.
>
> -   qcom,ipc-1 = < 8 13>;
> -   qcom,ipc-2 = < 8 9>;
> -   qcom,ipc-3 = < 8 19>;
> +   mboxes = <0>, < 13>, < 9>, < 19>;
>
> vs.
>
> -   qcom,ipc-1 = < 8 13>;
> -   qcom,ipc-2 = < 8 9>;
> -   qcom,ipc-3 = < 8 19>;
> +   mboxes = < 13>, < 9>, < 19>;
> +   mbox-names = "ipc-1", "ipc-2", "ipc-3";

 Sorry, don't get, ipc-1 is the first mailbox, so why would there be <0>
 in first case?
>>>
>>> Actually not, ipc-0 would be permissible by the driver, used for the 
>>> 0th host
>>>
>>> e.g. from:
>>>
>>> /* Iterate over all hosts to check whom wants a kick */
>>> for (host = 0; host < smsm->num_hosts; host++) {
>>> hostp = >hosts[host];
>>>
>>> Even though no mailbox is specified in any upstream dts for this 0th 
>>> host I
>>> didn't want the bindings to restrict that, that's why in the first 
>>> example
>>> there's an empty element (<0>) for the 0th smsm host
>>>
 Anyway, the question is if you need to know that some
 mailbox is missing. But then it is weird to name them "ipc-1" etc.
>>>
>>> In either case we'd just query the mbox (either by name or index) and 
>>> then
>>> see if it's there? Not quite sure I understand the sentence..
>>> Pretty sure either binding would work the same way.
>>
>> The question is: does the driver care only about having some mailboxes
>> or the driver cares about each specific mailbox? IOW, is skipping ipc-0
>> important for the driver?
>
> There's nothing special from driver side about any mailbox. Some SoCs have
> a mailbox for e.g. hosts 1&2&3, some have only 1&3, and apq8064 even has
> 1&2&3&4.
>
> And if the driver doesn't find a mailbox for a host, it just ignores it
> but then of course it can't 'ring' the mailbox for that host when 
> necessary.
>
> Not sure how much more I can add here, to be fair I barely understand what
> this driver is doing myself apart from the obvious.

 From what you said, it looks like it is enough to just list mailboxes,
 e.g. for ipc-1, ipc-2 and ipc-4 (so no ipc-0 and ipc-3):
>>>
>>> No, for sure we need also the possibility to list ipc-3.
>>
>> ? You can list it, what's the problem>
> 
> Maybe we're talking past each other...
> 
> You asked why this wouldn't work:
> 
>   e.g. for ipc-1, ipc-2 and ipc-4 (so no ipc-0 and ipc-3):
>   mboxes = < 13>, < 9>, < 19>;
> 
> How would we know that the 3rd mailbox ( 19) is for the 4th host
> (previous ipc-4)?
> 
> 1. If we use mboxes with indexes we'd need to have <0> values for
> "smsm hosts" where we don't have a mailbox for - this is at least
> for the 2nd smsm host (qcom,ipc-2) for a bunch of SoCs.
> 
> 2. If we use mboxes with mbox-names then we could skip that since we
> can directly specify which "smsm host" a given mailbox is for.
> 
> My only question really is whether 1. or 2. is a better idea.
> 
> Is this clearer now or still not?


So again, does the driver care about missing entry? If so, why?

Best regards,
Krzysztof




Re: [PATCH v10 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-05-25 Thread Google
On Fri, 24 May 2024 18:41:56 -0400
Steven Rostedt  wrote:

> On Tue,  7 May 2024 23:08:00 +0900
> "Masami Hiramatsu (Google)"  wrote:
> 
> > Steven Rostedt (VMware) (15):
> >   function_graph: Convert ret_stack to a series of longs
> >   fgraph: Use BUILD_BUG_ON() to make sure we have structures divisible 
> > by long
> >   function_graph: Add an array structure that will allow multiple 
> > callbacks
> >   function_graph: Allow multiple users to attach to function graph
> >   function_graph: Remove logic around ftrace_graph_entry and return
> >   ftrace/function_graph: Pass fgraph_ops to function graph callbacks
> >   ftrace: Allow function_graph tracer to be enabled in instances
> >   ftrace: Allow ftrace startup flags exist without dynamic ftrace
> >   function_graph: Have the instances use their own ftrace_ops for 
> > filtering
> >   function_graph: Add "task variables" per task for fgraph_ops
> >   function_graph: Move set_graph_function tests to shadow stack global 
> > var
> >   function_graph: Move graph depth stored data to shadow stack global 
> > var
> >   function_graph: Move graph notrace bit to shadow stack global var
> >   function_graph: Implement fgraph_reserve_data() and 
> > fgraph_retrieve_data()
> >   function_graph: Add selftest for passing local variables
> 
> Hi Masami,
> 
> While reviewing these patches, I realized there's several things I dislike
> about the patches I wrote. So I took these patches and started cleaning
> them up a little. Mostly renaming functions and adding comments.

Thanks for cleaning up the patches!!

> 
> As this is a major change to the function graph tracer, and I feel nervous
> about building something on top of this, how about I take over these
> patches and push them out for the next merge window. I'm hoping to get them
> into linux-next by v6.10-rc2 (I spent the day working on them, and it's
> mostly minor tweaks).

OK.

> Then I can push it out to 6.11 and get some good testing against it. Then
> we can add your stuff on top and get that merged in 6.12.

Yeah, it is reasonable plan. I also concerns about the stability. Especially,
this involves fprobe side changes too. If we introduce both at once, it may
mess up many things.

> 
> If all goes well, I'm hoping to get a series on just these patches (and
> your selftest addition) by tonight.
> 
> Thoughts?

I agree with you.

Thank you,

> 
> -- Steve


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v10 07/36] function_graph: Allow multiple users to attach to function graph

2024-05-25 Thread Google
On Fri, 24 May 2024 21:32:08 -0400
Steven Rostedt  wrote:

> On Tue,  7 May 2024 23:09:22 +0900
> "Masami Hiramatsu (Google)"  wrote:
> 
> > @@ -109,6 +244,21 @@ ftrace_push_return_trace(unsigned long ret, unsigned 
> > long func,
> > if (!current->ret_stack)
> > return -EBUSY;
> >  
> > +   /*
> > +* At first, check whether the previous fgraph callback is pushed by
> > +* the fgraph on the same function entry.
> > +* But if @func is the self tail-call function, we also need to ensure
> > +* the ret_stack is not for the previous call by checking whether the
> > +* bit of @fgraph_idx is set or not.
> > +*/
> > +   ret_stack = get_ret_stack(current, current->curr_ret_stack, );
> > +   if (ret_stack && ret_stack->func == func &&
> > +   get_fgraph_type(current, offset + FGRAPH_FRAME_OFFSET) == 
> > FGRAPH_TYPE_BITMAP &&
> > +   !is_fgraph_index_set(current, offset + FGRAPH_FRAME_OFFSET, 
> > fgraph_idx))
> > +   return offset + FGRAPH_FRAME_OFFSET;
> > +
> > +   val = (FGRAPH_TYPE_RESERVED << FGRAPH_TYPE_SHIFT) | FGRAPH_FRAME_OFFSET;
> > +
> > BUILD_BUG_ON(SHADOW_STACK_SIZE % sizeof(long));
> 
> I'm trying to figure out what the above is trying to do. This gets called
> once in function_graph_enter() (or function_graph_enter_ops()). What
> exactly are you trying to catch here?

Aah, good catch! This was originally for catching the self tail-call case with
multiple fgraph callback on the same function, but it was my misread.
In later patch ([12/36]), we introduced function_graph_enter_ops() so that
we can skip checking hash table and directly pass the fgraph_ops to user
callback. I thought this function_graph_enter_ops() is used even if multiple
fgraph is set on the same function. In this case, we always need to check the
stack can be reused(pushed by other fgraph_ops on the same function) or not.
But as we discussed, the function_graph_enter_ops() is used only when only
one fgraph is set on the function (if there are multiple fgraphs are set on
the same function, use function_graph_enter() ), we are sure that 
ftrace_push_return_trace() is called only once on hooking the function entry.
Thus we don't need to reuse it.

> 
> Is it from this email:
> 
>   
> https://lore.kernel.org/all/20231110105154.df937bf9f200a0c16806c...@kernel.org/
> 
> As that's the last version before you added the above code.
> 
> But you also noticed it may not be needed, but triggered a crash without it
> in v3:
> 
>   
> https://lore.kernel.org/all/20231205234511.3839128259dfec153ea7d...@kernel.org/
> 
> I removed this code in my version and it runs just fine. Perhaps there was
> another bug that this was hiding that you fixed in later versions?

No problem. I think we can remove this block safely.

Thank you,

> 
> -- Steve
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] x86/paravirt: Disable virt spinlock when CONFIG_PARAVIRT_SPINLOCKS disabled

2024-05-25 Thread Chen Yu
On 2024-05-23 at 09:30:59 -0700, Dave Hansen wrote:
> On 5/16/24 06:02, Chen Yu wrote:
> > Performance drop is reported when running encode/decode workload and
> > BenchSEE cache sub-workload.
> > Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused
> > native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS
> > is disabled the virt_spin_lock_key is set to true on bare-metal.
> > The qspinlock degenerates to test-and-set spinlock, which decrease the
> > performance on bare-metal.
> > 
> > Fix this by disabling virt_spin_lock_key if CONFIG_PARAVIRT_SPINLOCKS
> > is not set, or it is on bare-metal.
> 
> This is missing some background:
> 
> The kernel can change spinlock behavior when running as a guest.  But
> this guest-friendly behavior causes performance problems on bare metal.
> So there's a 'virt_spin_lock_key' static key to switch between the two
> modes.
> 
> The static key is always enabled by default (run in guest mode) and
> should be disabled for bare metal (and in some guests that want native
> behavior).
> 
> ... then describe the regression and the fix
>
Thanks Juergen for your review.

And thanks Dave for the write up, I'll refine the log according to your 
suggestion. 

> > diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> > index 5358d43886ad..ee51c0949ed8 100644
> > --- a/arch/x86/kernel/paravirt.c
> > +++ b/arch/x86/kernel/paravirt.c
> > @@ -55,7 +55,7 @@ DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);
> >  
> >  void __init native_pv_lock_init(void)
> >  {
> > -   if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) &&
> > +   if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) ||
> > !boot_cpu_has(X86_FEATURE_HYPERVISOR))
> > static_branch_disable(_spin_lock_key);
> >  }
> This gets used at a single site:
> 
> if (pv_enabled())
> goto pv_queue;
> 
> if (virt_spin_lock(lock))
> return;
> 
> which is logically:
> 
>   if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS))
>   goto ...; // don't look at virt_spin_lock_key
> 
>   if (virt_spin_lock_key)
>   return; // On virt, but non-paravirt.  Did Test-and-Set
>   // spinlock.
>

Thanks for the description in detail, my original change might break the
"X86_FEATURE_HYPERVISOR + NO_CONFIG_PARAVIRT_SPINLOCKS " case that, the guest
can not fall into test-and-set.
 
> So I _think_ Arnd was trying to optimize native_pv_lock_init() away when
> it's going to get skipped over anyway by the 'goto'.
> 
> But this took me at least 30 minutes of scratching my head and trying to
> untangle the whole thing.  It's all far too subtle for my taste, and all
> of that to save a few bytes of init text in a configuration that's
> probably not even used very often (PARAVIRT=y, but PARAVIRT_SPINLOCKS=n).
> 
> Let's just keep it simple.  How about the attached patch?

Yes, this one works, I'll refine it.

thanks,
Chenyu