from:"Qais Yousef"

Re: [PATCH v8 0/3] ARM: Implement MODULE_PLT support in FTRACE

2021-04-19 Thread Qais Yousef

On 04/19/21 14:54, Florian Fainelli wrote:
> 
> 
> On 4/12/2021 4:08 AM, Qais Yousef wrote:
> > Hi Alexander
> > 
> > Fixing Ard's email as the Linaro one keeps bouncing back. Please fix that in
> > your future postings.
> > 
> > On 04/12/21 08:28, Alexander Sverdlin wrote:
> >> Hi!
> >>
> >> On 09/04/2021 17:33, Qais Yousef wrote:
> >>> I still think the ifdefery in patch 3 is ugly. Any reason my suggestion 
> >>> didn't
> >>> work out for you? I struggle to see how this is better and why it was 
> >>> hard to
> >>> incorporate my suggestion.
> >>>
> >>> For example
> >>>
> >>>   -   old = ftrace_call_replace(ip, adjust_address(rec, addr));
> >>>   +#ifdef CONFIG_ARM_MODULE_PLTS
> >>>   +   /* mod is only supplied during module loading */
> >>>   +   if (!mod)
> >>>   +   mod = rec->arch.mod;
> >>>   +   else
> >>>   +   rec->arch.mod = mod;
> >>>   +#endif
> >>>   +
> >>>   +   old = ftrace_call_replace(ip, aaddr,
> >>>   + !IS_ENABLED(CONFIG_ARM_MODULE_PLTS) 
> >>> || !mod);
> >>>   +#ifdef CONFIG_ARM_MODULE_PLTS
> >>>   +   if (!old && mod) {
> >>>   +   aaddr = get_module_plt(mod, ip, aaddr);
> >>>   +   old = ftrace_call_replace(ip, aaddr, true);
> >>>   +   }
> >>>   +#endif
> >>>   +
> >>>
> >>> There's an ifdef, followed by a code that embeds
> >>> !IS_ENABLED(CONFIG_ARM_MODULE_PLTS) followed by another ifdef :-/
> >>
> >> No, it's actually two small ifdefed blocks added before and after an 
> >> original call,
> >> which parameters have been modified as well. The issue with arch.mod was 
> >> explained
> >> by Steven Rostedt, maybe you've missed his email.
> > 
> > If you're referring to arch.mod having to be protected by the ifdef I did
> > address that. Please look at my patch.
> > 
> > My comment here refers to the ugliness of this ifdefery. Introducing 2 
> > simple
> > wrapper functions would address that as I've demonstrated in my
> > suggestion/patch.
> 
> What is the plan to move forward with this patch series, should v8 be
> submitted into RMK's patch tracker and improved upon from there, or do
> you feel like your suggestion needs to be addressed right away?

There's no objection from my side to submitting this and improve later.

Thanks

--
Qais Yousef

Re: [PATCH v2 1/1] kernel/cpu: to track which CPUHP callback is failed

2021-04-13 Thread Qais Yousef

On 04/08/21 22:53, Dongli Zhang wrote:
> During bootup or cpu hotplug, the cpuhp_up_callbacks() or
> cpuhp_down_callbacks() call many CPUHP callbacks (e.g., perf, mm,
> workqueue, RCU, kvmclock and more) for each cpu to online/offline. It may
> roll back to its previous state if any of callbacks is failed. As a result,
> the user will not be able to know which callback is failed and usually the
> only symptom is cpu online/offline failure.
> 
> This patch is to print more debug log to help user narrow down where is the
> root cause.
> 
> Below is the example that how the patch helps narrow down the root cause
> for the issue fixed by commit d7eb79c6290c ("KVM: kvmclock: Fix vCPUs > 64
> can't be online/hotpluged").
> 
> We will have below dynamic debug log once we add
> dyndbg="file kernel/cpu.c +p" to kernel command line and when issue is
> reproduced.

You can also enable it at runtime

echo "file kernel/cpu.c +p" > /sys/kernel/debug/dynamic_debug/control

> 
> "CPUHP up callback failure (-12) for cpu 64 at kvmclock:setup_percpu (66)"
> 
> Cc: Joe Jin 
> Signed-off-by: Dongli Zhang 
> ---

I don't see the harm in adding the debug if some find it useful.

FWIW

Reviewed-by: Qais Yousef 

Cheers

--
Qais Yousef

> Changed since v1 RFC:
>   - use pr_debug() but not pr_err_once() (suggested by Qais Yousef)
>   - print log for cpuhp_down_callbacks() as well (suggested by Qais Yousef)
> 
>  kernel/cpu.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 1b6302ecbabe..bcd4dd7de9c3 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -621,6 +621,10 @@ static int cpuhp_up_callbacks(unsigned int cpu, struct 
> cpuhp_cpu_state *st,
>   st->state++;
>   ret = cpuhp_invoke_callback(cpu, st->state, true, NULL, NULL);
>   if (ret) {
> + pr_debug("CPUHP up callback failure (%d) for cpu %u at 
> %s (%d)\n",
> +  ret, cpu, cpuhp_get_step(st->state)->name,
> +  st->state);
> +
>   if (can_rollback_cpu(st)) {
>   st->target = prev_state;
>   undo_cpu_up(cpu, st);
> @@ -990,6 +994,10 @@ static int cpuhp_down_callbacks(unsigned int cpu, struct 
> cpuhp_cpu_state *st,
>   for (; st->state > target; st->state--) {
>   ret = cpuhp_invoke_callback(cpu, st->state, false, NULL, NULL);
>   if (ret) {
> + pr_debug("CPUHP down callback failure (%d) for cpu %u 
> at %s (%d)\n",
> +  ret, cpu, cpuhp_get_step(st->state)->name,
> +  st->state);
> +
>   st->target = prev_state;
>   if (st->state < prev_state)
>   undo_cpu_down(cpu, st);
> -- 
> 2.17.1
>

Re: [PATCH 2/3] cpumask: Introduce DYING mask

2021-04-12 Thread Qais Yousef

On 04/12/21 12:55, Peter Zijlstra wrote:
> On Sun, Mar 21, 2021 at 07:30:37PM +0000, Qais Yousef wrote:
> > On 03/10/21 15:53, Peter Zijlstra wrote:
> > > --- a/kernel/cpu.c
> > > +++ b/kernel/cpu.c
> > > @@ -160,6 +160,9 @@ static int cpuhp_invoke_callback(unsigne
> > >   int (*cb)(unsigned int cpu);
> > >   int ret, cnt;
> > >  
> > > + if (bringup != !cpu_dying(cpu))
> > 
> > nit: this condition is hard to read
> > 
> > > + set_cpu_dying(cpu, !bringup);
> 
> How's:
> 
>   if (cpu_dying(cpu) != !bringup)
>   set_cpu_dying(cpu, !bringup);

Slightly better I suppose :)

> 
> > since cpu_dying() will do cpumask_test_cpu(), are we saving  much if we
> > unconditionally call set_cpu_dying(cpu, !bringup) which performs
> > cpumask_{set, clear}_cpu()?
> 
> This is hotplug, it's all slow, endlessly rewriting that bit shouldn't
> be a problem I suppose.

True. Beside I doubt there's a performance hit really, cpu_dying() will read
the bit in the cpumask anyway, unconditionally writing will be as fast since
both will fetch the cacheline anyway?

Regardless, not really a big deal. It's not really the hardest thing to stare
at here ;-)

Thanks

--
Qais Yousef

Re: [PATCH v8 0/3] ARM: Implement MODULE_PLT support in FTRACE

2021-04-12 Thread Qais Yousef

Hi Alexander

Fixing Ard's email as the Linaro one keeps bouncing back. Please fix that in
your future postings.

On 04/12/21 08:28, Alexander Sverdlin wrote:
> Hi!
> 
> On 09/04/2021 17:33, Qais Yousef wrote:
> > I still think the ifdefery in patch 3 is ugly. Any reason my suggestion 
> > didn't
> > work out for you? I struggle to see how this is better and why it was hard 
> > to
> > incorporate my suggestion.
> > 
> > For example
> > 
> > -   old = ftrace_call_replace(ip, adjust_address(rec, addr));
> > +#ifdef CONFIG_ARM_MODULE_PLTS
> > +   /* mod is only supplied during module loading */
> > +   if (!mod)
> > +   mod = rec->arch.mod;
> > +   else
> > +   rec->arch.mod = mod;
> > +#endif
> > +
> > +   old = ftrace_call_replace(ip, aaddr,
> > + !IS_ENABLED(CONFIG_ARM_MODULE_PLTS) 
> > || !mod);
> > +#ifdef CONFIG_ARM_MODULE_PLTS
> > +   if (!old && mod) {
> > +   aaddr = get_module_plt(mod, ip, aaddr);
> > +   old = ftrace_call_replace(ip, aaddr, true);
> > +   }
> > +#endif
> > +
> > 
> > There's an ifdef, followed by a code that embeds
> > !IS_ENABLED(CONFIG_ARM_MODULE_PLTS) followed by another ifdef :-/
> 
> No, it's actually two small ifdefed blocks added before and after an original 
> call,
> which parameters have been modified as well. The issue with arch.mod was 
> explained
> by Steven Rostedt, maybe you've missed his email.

If you're referring to arch.mod having to be protected by the ifdef I did
address that. Please look at my patch.

My comment here refers to the ugliness of this ifdefery. Introducing 2 simple
wrapper functions would address that as I've demonstrated in my
suggestion/patch.

Thanks

--
Qais Yousef

Re: [PATCH v8 0/3] ARM: Implement MODULE_PLT support in FTRACE

2021-04-09 Thread Qais Yousef

Adding Linus Walleij back to CC

On 03/30/21 13:40, Alexander A Sverdlin wrote:
> From: Alexander Sverdlin 
> 
> FTRACE's function tracer currently doesn't always work on ARM with
> MODULE_PLT option enabled. If the module is loaded too far, FTRACE's
> code modifier cannot cope with introduced veneers and turns the
> function tracer off globally.
> 
> ARM64 already has a solution for the problem, refer to the following
> patches:
> 
> arm64: ftrace: emit ftrace-mod.o contents through code
> arm64: module-plts: factor out PLT generation code for ftrace
> arm64: ftrace: fix !CONFIG_ARM64_MODULE_PLTS kernels
> arm64: ftrace: fix building without CONFIG_MODULES
> arm64: ftrace: add support for far branches to dynamic ftrace
> arm64: ftrace: don't validate branch via PLT in ftrace_make_nop()
> 
> But the presented ARM variant has just a half of the footprint in terms of
> the changed LoCs. It also retains the code validation-before-modification
> instead of switching it off.
> 
> Changelog:
> v8:
> * Add warn suppress parameter to arm_gen_branch_link()

I still think the ifdefery in patch 3 is ugly. Any reason my suggestion didn't
work out for you? I struggle to see how this is better and why it was hard to
incorporate my suggestion.

For example

-   old = ftrace_call_replace(ip, adjust_address(rec, addr));
+#ifdef CONFIG_ARM_MODULE_PLTS
+   /* mod is only supplied during module loading */
+   if (!mod)
+   mod = rec->arch.mod;
+   else
+   rec->arch.mod = mod;
+#endif
+
+   old = ftrace_call_replace(ip, aaddr,
+ !IS_ENABLED(CONFIG_ARM_MODULE_PLTS) 
|| !mod);
+#ifdef CONFIG_ARM_MODULE_PLTS
+   if (!old && mod) {
+   aaddr = get_module_plt(mod, ip, aaddr);
+   old = ftrace_call_replace(ip, aaddr, true);
+   }
+#endif
+

There's an ifdef, followed by a code that embeds
!IS_ENABLED(CONFIG_ARM_MODULE_PLTS) followed by another ifdef :-/

And there was no need to make the new warn arg visible all the way to
ftrace_call_repalce() and all of its users.

FWIW

Tested-by: Qais Yousef 

If this gets accepted as-is, I'll send a patch to improve on this.


Thanks

--
Qais Yousef

Re: [PATCH RFC 1/1] kernel/cpu: to track which CPUHP callback is failed

2021-04-06 Thread Qais Yousef

On 04/05/21 14:22, Dongli Zhang wrote:
> May I have if there is any feedback on this? To pr_err_once() here helps 
> narrow
> down what is the root cause of cpu online failure.
> 
> 
> The issue fixed by d7eb79c6290c ("KVM: kvmclock: Fix vCPUs > 64 can't be
> online/hotpluged") is able to demonstrate how this pr_err_once() helps.
> 
> Due to VM kernel issue, at most 64 VCPUs can online if host clocksource is
> switched to hpet.
> 
> # echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource
> 
> 
> By default, we have no idea why only 64 out of 100 VCPUs are online in VM. We
> need to customize kernel to debug why some CPUs are not able to online.
> 
> 
> We will have below and know the root cause is with kvmclock, if we add the
> pr_err_once().

Isn't pr_debug() more appropriate here? Don't you want one on the
cpuhp_down_callbacks() too?

I *think* it's common now to have CONFIG_DYNAMIC_DEBUG enabled so one can
enable that line when they need to


https://www.kernel.org/doc/html/latest/admin-guide/dynamic-debug-howto.html

I can see the appeal of pr_err_once() but I think this can fail for legitimate
reasons and is not necessarily strictly always an error?

Thanks

--
Qais Yousef

Re: [PATCH v3] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

2021-04-01 Thread Qais Yousef

On 03/27/21 22:01, Thomas Gleixner wrote:
> And while you carefully reworded the comment, did you actually read what
> it said and what is says now?
> 
> > -* cpu_down() which takes cpu maps lock. cpu maps lock
> > -* needs to be held as this might race against in kernel
> > -* abusers of the hotplug machinery (thermal management).
> 
> vs.
> 
> > +* cpu_down() which takes cpu maps lock. cpu maps lock
> > +* needed to be held as this might race against in-kernel
> > +* abusers of the hotplug machinery (thermal management).
> 
> The big fat hint is: "cpu maps lock needs to be held as this " and
> it still needs to be held for the above loop to work correctly at
> all. See also below.
> 
> So just moving comments blindly around and making them past tense is not
> cutting it. Quite the contrary the comments make no sense anymore. They
> became uncomprehensible word salad.
> 
> Now for the second part of that comment:
> 
> > +  *    This is
> > +* called under the sysfs hotplug lock, so it is properly
> > +* serialized against the regular offline usage.
> 
> So there are two layers of protection:
> 
>cpu_maps_lock and sysfs hotplug lock
> 
> One protects surprisingly against concurrent sysfs writes and the other
> is required to serialize in kernel usage.
> 
> Now lets look at the original protection:
> 
>lock(sysfs)
>  lock(cpu_maps)
>hotplug
> dev->offline = new_state
> uevent()
>  unlock(cpu_maps)
>unlock(sysfs)
> 
> and the one you created:
> 
>lock(sysfs)
>  lock(cpu_maps)
>hotplug
>  unlock(cpu_maps)
>  dev->offline = new_state
>  uevent()
>unlock(sysfs)
> 
> Where is that protection scope change mentioned in the change log and
> where is the analysis that it is correct?

The comment do still need updating though. Its reference to thermal management
is cryptic. I couldn't see who performs hotplug operations in thermal code.

I also think generally that comment is no longer valid after the refactor I did
to 'prevent' in-kernel users from calling cpu_up/down() directly and force them
all to go via device_offline/online() which is wrapped via the add/remove_cpu()

33c3736ec888 cpu/hotplug: Hide cpu_up/down()

So I think except for suspend/resume/hibernate/kexec, all in-kernel users
should be serialized by lock(hotplug) now. Since uevents() don't matter for
suspend/resume/hiberante/kexec I think moving it outside of lock(cpu_maps) is
fine.

So IIUC in the past we had the race

userspace   in-kernel users

lock(hotplug)
cpuhp_smt_disable()
   lock(cpu_maps)   cpu_down()
  lock(cpu_maps)

So they were serialized by cpu_maps lock. But that should not happen now since
in-kernel (except for the aforementioned) users should use remove_cpu().

userspace   in-kernel users

lock(hotplug)
cpuhp_smt_disable()
   lock(cpu_maps)   remove_cpu()
  lock(hotplug)
  device_offline()
cpu_down()
  lock(cpu_maps)

Which forces the serialization at lock(hotplug), which is what
lock_device_hotplug_sysfs() is actually tries to hold.

So I think that race condition should not happen now. Or did I miss something?

The only loophole is that cpu_device_up() could be abused if not caught by
reviewers. I didn't find a way to warn/error if someone other than
cpu_subsys_online() uses it. We rely on a comment explaining it..

I think cpuhp_smt_disable/enable can safely call device_offline/online now.
Although it might still be more efficient to hold lock(cpu_maps) once than
repeatedly in a loop.

If we do that, then cpu_up_down_serialize_trainwrech() can be called from
cpu_device_up/down() which implies !task_frozen.

Can't remember now if Alexey moved the uevent() handling out of the loop for
efficiency reasons or was seeing something else. I doubt it was the latter.

Thanks

--
Qais Yousef

Re: [PATCH v7 2/2] ARM: ftrace: Add MODULE_PLTS support

2021-03-24 Thread Qais Yousef

On 03/24/21 17:33, Alexander Sverdlin wrote:
> Hello Qais,
> 
> On 24/03/2021 16:57, Qais Yousef wrote:
> >>> FWIW my main concern is about duplicating the range check in
> >>> ftrace_call_replace() and using magic values that already exist in
> >>> __arm_gen_branch_{arm, thumb2}() and better remain encapsulated there.
> >> could you please check the negative limits? I have an opinion, my limits 
> >> are
> >> correct. I could add extra parameter to arm_gen_branch_link(), but for this
> >> I first need to fix its negative limits, which, I believe, well... 
> >> Approximate :)
> > Can you elaborate please?
> > 
> > If you look at Arm ARM [1] the ranges are defined in page 347
> > 
> > Encoding T1 Even numbers in the range –16777216 to 16777214.
> > Encoding T2 Multiples of 4 in the range –16777216 to 16777212.
> > Encoding A1 Multiples of 4 in the range –33554432 to 33554428.
> > Encoding A2 Even numbers in the range –33554432 to 33554430.
> > 
> > which matches what's in the code (T1 for thumb2 and A1 for arm).
> > 
> > Why do you think it's wrong?
> 
> thanks for checking this! I'll re-send v8 with your proposal.

If you felt some details need tweaking I don't mind, my proposal was an attempt
to help rather than impose.

Thanks!

--
Qais Yousef

Re: [PATCH v7 2/2] ARM: ftrace: Add MODULE_PLTS support

2021-03-24 Thread Qais Yousef

Hi Florian

On 03/23/21 20:37, Florian Fainelli wrote:
> Hi Qais,
> 
> On 3/23/2021 3:22 PM, Qais Yousef wrote:
> > Hi Alexander
> > 
> > On 03/22/21 18:02, Alexander Sverdlin wrote:
> >> Hi Qais,
> >>
> >> On 22/03/2021 17:32, Qais Yousef wrote:
> >>> Yes you're right. I was a bit optimistic on CONFIG_DYNAMIC_FTRACE will 
> >>> imply
> >>> CONFIG_ARM_MODULE_PLTS is enabled too.
> >>>
> >>> It only has an impact on reducing ifdefery when calling
> >>>
> >>>   ftrace_call_replace_mod(rec->arch.mod, ...)
> >>>
> >>> Should be easy to wrap rec->arch.mod with its own accessor that will 
> >>> return
> >>> NULL if !CONFIG_ARM_MODULE_PLTS or just ifdef the functions.
> >>>
> >>> Up to Alexander to pick what he prefers :-)
> >>
> >> well, I of course prefer v7 as-is, because this review is running longer 
> >> than two
> >> years and I actually hope these patches to be finally merged at some point.
> >> But you are welcome to optimize them with follow up patches :)
> > 
> > I appreciate that and thanks a lot for your effort. My attempt to review and
> > test here is to help in getting this merged.
> > 
> > FWIW my main concern is about duplicating the range check in
> > ftrace_call_replace() and using magic values that already exist in
> > __arm_gen_branch_{arm, thumb2}() and better remain encapsulated there.
> 
> Your patch in addition to Alexander's patch work for me as well, so feel
> free to add a:
> 
> Tested-by: Florian Fainelli 
> 
> FWIW, what is nice about Alexander's original patch is that it applies
> relatively cleanly to older kernels as well where this is equally

How old are we talking? Was the conflict that bad for the stable maintainers to
deal with it? ie: would it require sending the backport separately?

> needed. There is not currently any Fixes: tag being provided but maybe
> we should amend the second patch with one?

I'm not sure if this will be considered new feature or a bug fix. FWIW,
tagging it for stable sounds reasonable to me.

Thanks!

--
Qais Yosuef

Re: [PATCH v7 2/2] ARM: ftrace: Add MODULE_PLTS support

2021-03-24 Thread Qais Yousef

Hey Alexander

On 03/24/21 10:04, Alexander Sverdlin wrote:
> Hi Qais,
> 
> On 23/03/2021 23:22, Qais Yousef wrote:
> >>> Yes you're right. I was a bit optimistic on CONFIG_DYNAMIC_FTRACE will 
> >>> imply
> >>> CONFIG_ARM_MODULE_PLTS is enabled too.
> >>>
> >>> It only has an impact on reducing ifdefery when calling
> >>>
> >>>   ftrace_call_replace_mod(rec->arch.mod, ...)
> >>>
> >>> Should be easy to wrap rec->arch.mod with its own accessor that will 
> >>> return
> >>> NULL if !CONFIG_ARM_MODULE_PLTS or just ifdef the functions.
> >>>
> >>> Up to Alexander to pick what he prefers :-)
> >> well, I of course prefer v7 as-is, because this review is running longer 
> >> than two
> >> years and I actually hope these patches to be finally merged at some point.
> >> But you are welcome to optimize them with follow up patches :)
> > I appreciate that and thanks a lot for your effort. My attempt to review and
> > test here is to help in getting this merged.
> > 
> > FWIW my main concern is about duplicating the range check in
> > ftrace_call_replace() and using magic values that already exist in
> > __arm_gen_branch_{arm, thumb2}() and better remain encapsulated there.
> 
> could you please check the negative limits? I have an opinion, my limits are
> correct. I could add extra parameter to arm_gen_branch_link(), but for this
> I first need to fix its negative limits, which, I believe, well... 
> Approximate :)

Can you elaborate please?

If you look at Arm ARM [1] the ranges are defined in page 347

Encoding T1 Even numbers in the range –16777216 to 16777214.
Encoding T2 Multiples of 4 in the range –16777216 to 16777212.
    Encoding A1 Multiples of 4 in the range –33554432 to 33554428.
Encoding A2 Even numbers in the range –33554432 to 33554430.

which matches what's in the code (T1 for thumb2 and A1 for arm).

Why do you think it's wrong?

Thanks

--
Qais Yousef

[1] https://developer.arm.com/documentation/ddi0406/latest/

Re: [PATCH v7 2/2] ARM: ftrace: Add MODULE_PLTS support

2021-03-23 Thread Qais Yousef

Hi Alexander

On 03/22/21 18:02, Alexander Sverdlin wrote:
> Hi Qais,
> 
> On 22/03/2021 17:32, Qais Yousef wrote:
> > Yes you're right. I was a bit optimistic on CONFIG_DYNAMIC_FTRACE will imply
> > CONFIG_ARM_MODULE_PLTS is enabled too.
> > 
> > It only has an impact on reducing ifdefery when calling
> > 
> > ftrace_call_replace_mod(rec->arch.mod, ...)
> > 
> > Should be easy to wrap rec->arch.mod with its own accessor that will return
> > NULL if !CONFIG_ARM_MODULE_PLTS or just ifdef the functions.
> > 
> > Up to Alexander to pick what he prefers :-)
> 
> well, I of course prefer v7 as-is, because this review is running longer than 
> two
> years and I actually hope these patches to be finally merged at some point.
> But you are welcome to optimize them with follow up patches :)

I appreciate that and thanks a lot for your effort. My attempt to review and
test here is to help in getting this merged.

FWIW my main concern is about duplicating the range check in
ftrace_call_replace() and using magic values that already exist in
__arm_gen_branch_{arm, thumb2}() and better remain encapsulated there.

Thanks

--
Qais Yousef

->8--


diff --git a/arch/arm/include/asm/ftrace.h b/arch/arm/include/asm/ftrace.h
index a4dbac07e4ef..8545b3ff8317 100644
--- a/arch/arm/include/asm/ftrace.h
+++ b/arch/arm/include/asm/ftrace.h
@@ -25,6 +25,27 @@ static inline unsigned long ftrace_call_adjust(unsigned long 
addr)
/* With Thumb-2, the recorded addresses have the lsb set */
return addr & ~1;
 }
+
+#ifdef CONFIG_ARM_MODULE_PLTS
+static inline void ftrace_set_mod(struct dyn_arch_ftrace *arch, struct module 
*mod)
+{
+   arch->mod = mod;
+}
+
+static inline struct module *ftrace_get_mod(struct dyn_arch_ftrace *arch)
+{
+   return arch->mod;
+}
+#else
+static inline void ftrace_set_mod(struct dyn_arch_ftrace *arch, struct module 
*mod)
+{
+}
+
+static inline struct module *ftrace_get_mod(struct dyn_arch_ftrace *arch)
+{
+   return NULL;
+}
+#endif
 #endif
 
 #endif
diff --git a/arch/arm/include/asm/insn.h b/arch/arm/include/asm/insn.h
index f20e08ac85ae..71c3edefe629 100644
--- a/arch/arm/include/asm/insn.h
+++ b/arch/arm/include/asm/insn.h
@@ -13,18 +13,24 @@ arm_gen_nop(void)
 }
 
 unsigned long
-__arm_gen_branch(unsigned long pc, unsigned long addr, bool link);
+__arm_gen_branch(unsigned long pc, unsigned long addr, bool link, bool check);
 
 static inline unsigned long
 arm_gen_branch(unsigned long pc, unsigned long addr)
 {
-   return __arm_gen_branch(pc, addr, false);
+   return __arm_gen_branch(pc, addr, false, true);
 }
 
 static inline unsigned long
 arm_gen_branch_link(unsigned long pc, unsigned long addr)
 {
-   return __arm_gen_branch(pc, addr, true);
+   return __arm_gen_branch(pc, addr, true, true);
+}
+
+static inline unsigned long
+arm_gen_branch_link_nocheck(unsigned long pc, unsigned long addr)
+{
+   return __arm_gen_branch(pc, addr, true, false);
 }
 
 #endif
diff --git a/arch/arm/kernel/ftrace.c b/arch/arm/kernel/ftrace.c
index fa867a57100f..63ea34edd222 100644
--- a/arch/arm/kernel/ftrace.c
+++ b/arch/arm/kernel/ftrace.c
@@ -70,20 +70,28 @@ int ftrace_arch_code_modify_post_process(void)
 
 static unsigned long ftrace_call_replace(unsigned long pc, unsigned long addr)
 {
-   s32 offset = addr - pc;
-   s32 blim = 0xfe08;
-   s32 flim = 0x0204;
+   return arm_gen_branch_link(pc, addr);
+}
 
-   if (IS_ENABLED(CONFIG_THUMB2_KERNEL)) {
-   blim = 0xff04;
-   flim = 0x0102;
-   }
+static unsigned long
+ftrace_call_replace_mod(struct module *mod, unsigned long pc, unsigned long 
addr)
+{
+#ifdef CONFIG_ARM_MODULE_PLTS
+   unsigned long new;
 
-   if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS) &&
-   (offset < blim || offset > flim))
-   return 0;
+   if (likely(!mod))
+   return arm_gen_branch_link(pc, addr);
 
+   new = arm_gen_branch_link_nocheck(pc, addr);
+   if (!new) {
+   addr = get_module_plt(mod, pc, addr);
+   new = arm_gen_branch_link(pc, addr);
+   }
+
+   return new;
+#else
return arm_gen_branch_link(pc, addr);
+#endif
 }
 
 static int ftrace_modify_code(unsigned long pc, unsigned long old,
@@ -141,18 +149,7 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long 
addr)
 
old = ftrace_nop_replace(rec);
 
-   new = ftrace_call_replace(ip, aaddr);
-
-#ifdef CONFIG_ARM_MODULE_PLTS
-   if (!new) {
-   struct module *mod = rec->arch.mod;
-
-   if (mod) {
-   aaddr = get_module_plt(mod, ip, aaddr);
-   new = ftrace_call_replace(ip, aaddr);
-   }
-   }
-#endif
+   new = ftrace_call_replace_mod(ftrace_get_mod(>arch), ip, aaddr);
 
return ftrace_modify_code(rec-

Re: [PATCH v7 2/2] ARM: ftrace: Add MODULE_PLTS support

2021-03-22 Thread Qais Yousef

On 03/22/21 11:01, Steven Rostedt wrote:
> On Sun, 21 Mar 2021 19:06:11 +
> Qais Yousef  wrote:
> 
> >  #ifdef CONFIG_DYNAMIC_FTRACE
> >  struct dyn_arch_ftrace {
> > -#ifdef CONFIG_ARM_MODULE_PLTS
> > struct module *mod;
> > -#endif
> >  };
> >  
> 
> I know you want to reduce the "ifdefery", but please note that the
> dyn_arch_ftrace is defined once for every function that can be traced. If
> you have 40,000 functions that can be traced, that pointer is created
> 40,000 times. Thus, you really only want fields in the struct
> dyn_arch_ftrace if you really need them, otherwise, that's a lot of memory
> that is wasted.

Yes you're right. I was a bit optimistic on CONFIG_DYNAMIC_FTRACE will imply
CONFIG_ARM_MODULE_PLTS is enabled too.

It only has an impact on reducing ifdefery when calling

ftrace_call_replace_mod(rec->arch.mod, ...)

Should be easy to wrap rec->arch.mod with its own accessor that will return
NULL if !CONFIG_ARM_MODULE_PLTS or just ifdef the functions.

Up to Alexander to pick what he prefers :-)

Thanks

--
Qais Yousef

Re: [PATCH 2/3] cpumask: Introduce DYING mask

2021-03-21 Thread Qais Yousef

On 03/10/21 15:53, Peter Zijlstra wrote:
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -160,6 +160,9 @@ static int cpuhp_invoke_callback(unsigne
>   int (*cb)(unsigned int cpu);
>   int ret, cnt;
>  
> + if (bringup != !cpu_dying(cpu))

nit: this condition is hard to read

> + set_cpu_dying(cpu, !bringup);

since cpu_dying() will do cpumask_test_cpu(), are we saving  much if we
unconditionally call set_cpu_dying(cpu, !bringup) which performs
cpumask_{set, clear}_cpu()?

> +
>   if (st->fail == state) {
>   st->fail = CPUHP_INVALID;
>       return -EAGAIN;

Thanks

--
Qais yousef

Re: [PATCH v7 2/2] ARM: ftrace: Add MODULE_PLTS support

2021-03-21 Thread Qais Yousef

Hi Alexander

On 03/14/21 22:02, Qais Yousef wrote:
> I fixed Ard's email as it kept bouncing back.
> 
> +CC Linus Walleij
> 
> On 03/12/21 10:35, Florian Fainelli wrote:
> > On 3/12/21 9:24 AM, Qais Yousef wrote:
> > > Hi Alexander
> > > 
> > > On 03/10/21 18:17, Alexander Sverdlin wrote:
> > >> Hi!
> > >>
> > >> On 10/03/2021 17:14, Florian Fainelli wrote:
> > >>>>>>> I tried on 5.12-rc2 and 5.11 but couldn't reproduce the problem 
> > >>>>>>> using your
> > >>>>> I still can't reproduce on 5.12-rc2.
> > >>>>>
> > >>>>> I do have CONFIG_ARM_MODULE_PLTS=y. Do you need to do something else 
> > >>>>> after
> > >>>>> loading the module? I tried starting ftrace, but maybe there's a 
> > >>>>> particular
> > >>>>> combination required?
> > >>>> You need to load a BIG module, so big that it has no place in the 
> > >>>> modules area
> > >>>> any more and goes to vmalloc area.
> > >>> You absolutely need a very big module maybe more than one. When I tested
> > >>> this, I could use the two proprietary modules (*sigh*) that I needed to
> > >>> exercise against and loading one but not the other was not enough to
> > >>> make the second module loading spill into vmalloc space.
> > >>
> > >> Here is what I use instead of these real world "proprietary" modules 
> > >> (which of course
> > >> were the real trigger for the patch):
> > >>
> > >> https://www.spinics.net/lists/arm-kernel/msg878599.html
> > > 
> > > I am testing with your module. I can't reproduce the problem you describe 
> > > with
> > > it as I stated.
> > > 
> > > I will try to spend more time on it on the weekend.
> > 
> > Alexander, do you load one or multiple instances of that fat module?
> > 
> > The test module does a 6 * 1024 * 1024 / 2 = 3 million repetitions of
> > the "nop" instruction which should be 32-bits wide in ARM mode and
> > 16-bits wide in Thumb mode, right?
> > 
> > In ARM mode we have a 14MB module space, so 3 * 1024 * 1024 * 4 = 12MB,
> > which should still fit within if you have no module loaded, however a
> > second instance of the module should make us spill into vmalloc space.
> > 
> > In Thumb mode, we have a 6MB module space, so 3 * 1024 * 1024 * 2 = 6MB
> > so we may spill, but maybe not.
> > 
> > I was not able to reproduce the warning with just one module, but with
> > two (cannot have the same name BTW), it kicked in.
> 
> Thanks Florian. With 2 modules I can indeed reproduce the issue. Though for
> thumb I still hit the other problem so I couldn't properly test it.
> 
> I have come up with an alternative solution (hacky patch at the end) which 
> seem
> to fix both problems for me. That is
> 
>   * sed -i 's/=m/=y/ .config' which causes a lot of veneers to be
> generated is handled correctly now AFAICT
>   * Loading a fat module works the same.
> 
> I have tested it in both arm and thumb modes. Running
> 
>   trace-cmd start -p function
> 
> executes successfully and I can see functions being traced.
> 
> The solution makes a lot of assumptions that I yet to verify though, but I 
> hope
> it could form the basis for a proper one that fixes both issues. It's faster
> too. I'm just not sure if I need to take this handling one layer above or it's
> okay to be done at __arm_gen_branch(). Also I had a version that was more
> strict about verifying the veneer @pc points to @addr before using the veneer
> address but that fails for modules. The logic didn't work to verify for 
> module,
> but we do get the correct veneer function as returned by get_module_plt().
> 
> Since for ftrace we really care about mcount, I considered reading the plt 
> once
> at load time to get the address and save it in rec->arch and use that later
> instead of continuously reading it + having to store the plt. We could revisit
> that if the below has a fundamental flaw.
> 
> Alexander, have you considered these options before? I could have easily 
> missed
> something :-)

So the answer is no, that wouldn't have worked.

After spending more time on this your approach looks goot to me now.  If you
send v8 addressing my comment about removing the range check from
ftrace_call_repalce() as it is rendundant to what arm_gen_branch_link() is
already doing I'd be happy to give this series

Re: [PATCH v7 2/2] ARM: ftrace: Add MODULE_PLTS support

2021-03-14 Thread Qais Yousef

I fixed Ard's email as it kept bouncing back.

+CC Linus Walleij

On 03/12/21 10:35, Florian Fainelli wrote:
> On 3/12/21 9:24 AM, Qais Yousef wrote:
> > Hi Alexander
> > 
> > On 03/10/21 18:17, Alexander Sverdlin wrote:
> >> Hi!
> >>
> >> On 10/03/2021 17:14, Florian Fainelli wrote:
> >>>>>>> I tried on 5.12-rc2 and 5.11 but couldn't reproduce the problem using 
> >>>>>>> your
> >>>>> I still can't reproduce on 5.12-rc2.
> >>>>>
> >>>>> I do have CONFIG_ARM_MODULE_PLTS=y. Do you need to do something else 
> >>>>> after
> >>>>> loading the module? I tried starting ftrace, but maybe there's a 
> >>>>> particular
> >>>>> combination required?
> >>>> You need to load a BIG module, so big that it has no place in the 
> >>>> modules area
> >>>> any more and goes to vmalloc area.
> >>> You absolutely need a very big module maybe more than one. When I tested
> >>> this, I could use the two proprietary modules (*sigh*) that I needed to
> >>> exercise against and loading one but not the other was not enough to
> >>> make the second module loading spill into vmalloc space.
> >>
> >> Here is what I use instead of these real world "proprietary" modules 
> >> (which of course
> >> were the real trigger for the patch):
> >>
> >> https://www.spinics.net/lists/arm-kernel/msg878599.html
> > 
> > I am testing with your module. I can't reproduce the problem you describe 
> > with
> > it as I stated.
> > 
> > I will try to spend more time on it on the weekend.
> 
> Alexander, do you load one or multiple instances of that fat module?
> 
> The test module does a 6 * 1024 * 1024 / 2 = 3 million repetitions of
> the "nop" instruction which should be 32-bits wide in ARM mode and
> 16-bits wide in Thumb mode, right?
> 
> In ARM mode we have a 14MB module space, so 3 * 1024 * 1024 * 4 = 12MB,
> which should still fit within if you have no module loaded, however a
> second instance of the module should make us spill into vmalloc space.
> 
> In Thumb mode, we have a 6MB module space, so 3 * 1024 * 1024 * 2 = 6MB
> so we may spill, but maybe not.
> 
> I was not able to reproduce the warning with just one module, but with
> two (cannot have the same name BTW), it kicked in.

Thanks Florian. With 2 modules I can indeed reproduce the issue. Though for
thumb I still hit the other problem so I couldn't properly test it.

I have come up with an alternative solution (hacky patch at the end) which seem
to fix both problems for me. That is

* sed -i 's/=m/=y/ .config' which causes a lot of veneers to be
  generated is handled correctly now AFAICT
* Loading a fat module works the same.

I have tested it in both arm and thumb modes. Running

trace-cmd start -p function

executes successfully and I can see functions being traced.

The solution makes a lot of assumptions that I yet to verify though, but I hope
it could form the basis for a proper one that fixes both issues. It's faster
too. I'm just not sure if I need to take this handling one layer above or it's
okay to be done at __arm_gen_branch(). Also I had a version that was more
strict about verifying the veneer @pc points to @addr before using the veneer
address but that fails for modules. The logic didn't work to verify for module,
but we do get the correct veneer function as returned by get_module_plt().

Since for ftrace we really care about mcount, I considered reading the plt once
at load time to get the address and save it in rec->arch and use that later
instead of continuously reading it + having to store the plt. We could revisit
that if the below has a fundamental flaw.

Alexander, have you considered these options before? I could have easily missed
something :-)

Thanks

--
Qais Yousef

--->8-

diff --git a/arch/arm/kernel/ftrace.c b/arch/arm/kernel/ftrace.c
index 9a79ef6b1876..70f6e197c306 100644
--- a/arch/arm/kernel/ftrace.c
+++ b/arch/arm/kernel/ftrace.c
@@ -157,6 +157,9 @@ int ftrace_make_nop(struct module *mod,
unsigned long new;
int ret;

+   if (mod)
+   pr_info("VENEER_DBG: mcount_veneedr_addr = 0x%x\n", 
get_module_plt(mod, ip, addr));
+
old = ftrace_call_replace(ip, adjust_address(rec, addr));
new = ftrace_nop_replace(rec);
ret = ftrace_modify_code(ip, old, new, true);
diff --git a/arch/arm/kernel/insn.c b/arch/arm/kernel/insn.c
index 2e844b70386b..0d6abf8d726f 100644
--- a/arch/arm/kernel/insn.c
+++ b/arch/arm/kernel/insn.c
@@ -3,17 +3,53 @@
 #include 
 #include 

+/*
+ *  Checks if

Re: [PATCH v7 2/2] ARM: ftrace: Add MODULE_PLTS support

2021-03-12 Thread Qais Yousef

Hi Alexander

On 03/10/21 18:17, Alexander Sverdlin wrote:
> Hi!
> 
> On 10/03/2021 17:14, Florian Fainelli wrote:
> >>>>> I tried on 5.12-rc2 and 5.11 but couldn't reproduce the problem using 
> >>>>> your
> >>> I still can't reproduce on 5.12-rc2.
> >>>
> >>> I do have CONFIG_ARM_MODULE_PLTS=y. Do you need to do something else after
> >>> loading the module? I tried starting ftrace, but maybe there's a 
> >>> particular
> >>> combination required?
> >> You need to load a BIG module, so big that it has no place in the modules 
> >> area
> >> any more and goes to vmalloc area.
> > You absolutely need a very big module maybe more than one. When I tested
> > this, I could use the two proprietary modules (*sigh*) that I needed to
> > exercise against and loading one but not the other was not enough to
> > make the second module loading spill into vmalloc space.
> 
> Here is what I use instead of these real world "proprietary" modules (which 
> of course
> were the real trigger for the patch):
> 
> https://www.spinics.net/lists/arm-kernel/msg878599.html

I am testing with your module. I can't reproduce the problem you describe with
it as I stated.

I will try to spend more time on it on the weekend.

Thanks

--
Qais Yousef

Re: [PATCH v4 15/19] sched: Fix migrate_disable() vs rt/dl balancing

2021-03-10 Thread Qais Yousef

On 03/05/21 15:41, Valentin Schneider wrote:
> On 05/03/21 15:56, Peter Zijlstra wrote:
> > On Sat, Dec 26, 2020 at 01:54:45PM +0000, Qais Yousef wrote:
> >>
> >> > +static inline struct task_struct *get_push_task(struct rq *rq)
> >> > +{
> >> > +struct task_struct *p = rq->curr;
> >>
> >> Shouldn't we verify the class of the task here? The RT task in migration
> >> disabled could have been preempted by a dl or stopper task. Similarly, the 
> >> dl
> >> task could have been preempted by a stopper task.
> >>
> >> I don't think an RT task should be allowed to push a dl task under any
> >> circumstances?
> >
> > Hmm, quite. Fancy doing a patch?
> 
> Last time we talked about this, I looked into
> 
>   push_rt_task() + find_lowest_rq()
> 
> IIRC, with how
> 
>   find_lowest_rq() + cpupri_find_fitness()
> 
> currently work, find_lowest_rq() should return -1 in push_rt_task() if
> rq->curr is DL (CPUPRI_INVALID). IOW, Migration-Disabled RT tasks shouldn't
> actually interfere with DL tasks (unless a DL task gets scheduled after we
> drop the rq lock and kick the stopper, but we have that problem everywhere
> including CFS active balance).

This makes it less of a problem true, but AFAICT this can still happen in the
pull path.

Anyways, here's the patch as extra bolts and braces to be considered.

Thanks

--
Qais Yousef

--->8

>From 2df733d381f636cc185944c7eda86c824a9a785e Mon Sep 17 00:00:00 2001
From: Qais Yousef 
Date: Tue, 12 Jan 2021 11:54:16 +
Subject: [PATCH] sched: Don't push a higher priority class in get_push_task()

Commit a7c81556ec4d ("sched: Fix migrate_disable() vs rt/dl balancing")
will attempt to push/pull a higher priority task if the candidate task
is in migrate_disable() section. This is an attempt to prevent
starvation of these lower priority task that, in theory at least, could
end up in a situation where they're forever in migrate disable section
with no CPU time to run.

One issue with that is get_push_task() assumes rq->curr is of the same
sched_class, which AFAICT is not guaranteed to be true.

This patch adds extra bolts and braces to ensure that this voluntary
push operation is performed on a task of the same scheduling class only.

Otherwise an RT task could end up causing a DL task to be pushed away.
Which breaks the strict priority between sched classes.

We could also end up trying to push the migration task. Which I think is
harmless and is nothing but a wasted effort.

Fixes: a7c81556ec4d ("sched: Fix migrate_disable() vs rt/dl balancing")
Signed-off-by: Qais Yousef 
---
 kernel/sched/deadline.c |  2 +-
 kernel/sched/rt.c   |  4 ++--
 kernel/sched/sched.h| 17 -
 3 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index aac3539aa0fe..afadc7e1f968 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2276,7 +2276,7 @@ static void pull_dl_task(struct rq *this_rq)
goto skip;
 
if (is_migration_disabled(p)) {
-   push_task = get_push_task(src_rq);
+   push_task = get_push_task(src_rq, 
SCHED_DEADLINE);
} else {
deactivate_task(src_rq, p, 0);
set_task_cpu(p, this_cpu);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 8f720b71d13d..c2c5c08e3030 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1892,7 +1892,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 * to this other CPU, instead attempt to push the current
 * running task on this CPU away.
 */
-   push_task = get_push_task(rq);
+   push_task = get_push_task(rq, SCHED_FIFO);
if (push_task) {
raw_spin_unlock(>lock);
stop_one_cpu_nowait(rq->cpu, push_cpu_stop,
@@ -2225,7 +2225,7 @@ static void pull_rt_task(struct rq *this_rq)
goto skip;
 
if (is_migration_disabled(p)) {
-   push_task = get_push_task(src_rq);
+   push_task = get_push_task(src_rq, SCHED_FIFO);
} else {
deactivate_task(src_rq, p, 0);
set_task_cpu(p, this_cpu);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 10a1522b1e30..4e156f008d22 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1954,12 +1954,27 @@ extern void trigger_load_balance(struct rq *rq);
 
 extern void set_cpus_allowed_common(struct task_struct *p, const struct 
cpumask *new_mask, u32 flags);

Re: [PATCH v7 2/2] ARM: ftrace: Add MODULE_PLTS support

2021-03-09 Thread Qais Yousef

On 03/08/21 08:58, Alexander Sverdlin wrote:
> Hi!
> 
> On 07/03/2021 18:26, Qais Yousef wrote:
> > I tried on 5.12-rc2 and 5.11 but couldn't reproduce the problem using your

I still can't reproduce on 5.12-rc2.

I do have CONFIG_ARM_MODULE_PLTS=y. Do you need to do something else after
loading the module? I tried starting ftrace, but maybe there's a particular
combination required?

> > instructions on the other email. But most likely because I'm hitting another
> > problem that could be masking it. I'm not sure it is related or just 
> > randomly
> > happened to hit it.
> > 
> > Did you see something similar?
> 
> [...]
> 
> > [0.00] [] (ftrace_bug) from [] 
> > (ftrace_process_locs+0x2b0/0x518)
> > [0.00]  r7:c3817ac4 r6:c38040c0 r5:0a3c r4:000134e4
> > [0.00] [] (ftrace_process_locs) from [] 
> > (ftrace_init+0xc8/0x174)
> > [0.00]  r10:c2ffa000 r9:c2be8a78 r8:c2c5d1fc r7:c2c0c208 
> > r6:0001 r5:c2d0908c
> > [0.00]  r4:c362f518
> > [0.00] [] (ftrace_init) from [] 
> > (start_kernel+0x2f4/0x5b8)
> > [0.00]  r9:c2be8a78 r8:dbfffec0 r7: r6:c36385cc 
> > r5:c2d08f00 r4:c2ffa000
> > [0.00] [] (start_kernel) from [<>] (0x0)
> 
> This means, FTRACE has more problems with your kernel/compiler/platform, I've 
> addressed similar issue
> in the past, but my patch should be long merged:
> 
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1817963.html
> 
> Could it be the same problem as here:
> https://www.spinics.net/lists/arm-kernel/msg854022.html
> 
> Seems that the size check deserves something line BUILD_BUG_ON() with 
> FTRACE...

So I only see this when I convert all modules to be built-in

sed -i 's/=m/=y/' .config

FWIW, I see the problem with your patch applied too. Trying to dig more into
it..

> 
> >> diff --git a/arch/arm/kernel/ftrace.c b/arch/arm/kernel/ftrace.c
> >> index 9a79ef6..fa867a5 100644
> >> --- a/arch/arm/kernel/ftrace.c
> >> +++ b/arch/arm/kernel/ftrace.c
> >> @@ -70,6 +70,19 @@ int ftrace_arch_code_modify_post_process(void)
> >>  
> >>  static unsigned long ftrace_call_replace(unsigned long pc, unsigned long 
> >> addr)
> >>  {
> >> +  s32 offset = addr - pc;
> >> +  s32 blim = 0xfe08;
> >> +  s32 flim = 0x0204;
> > 
> > This look like magic numbers to me..
> 
> These magic numbers are most probably the reason for your FTRACE to resign...
> Those are backward- and forward-branch limits. I didn't find the matching 
> DEFINEs
> in the kernel, but I would be happy to learn them. I can also put some 
> comments,
> but I actually thought the purpose would be obvious from the code...

So I did dig more into it. The range is asymmetrical indeed. And the strange
offset is to cater for the pc being incremented by +8 (+4 for thumb2).

You're duplicating the checks in __arm_gen_branch_{thumb2, arm}(). As you noted
__arm_gen_branch() which is called by arm_gen_branch_link() will end up doing
the exact same check and return 0. So why do you need to duplicate the check
here? We can do something about the WARN_ON_ONCE(1).

[...]

> >> +
> >>return arm_gen_branch_link(pc, addr);
> >>  }
> >>  
> >> @@ -124,10 +137,22 @@ int ftrace_make_call(struct dyn_ftrace *rec, 
> >> unsigned long addr)
> >>  {
> >>unsigned long new, old;
> >>unsigned long ip = rec->ip;
> >> +  unsigned long aaddr = adjust_address(rec, addr);
> >>  
> >>old = ftrace_nop_replace(rec);
> >>  
> >> -  new = ftrace_call_replace(ip, adjust_address(rec, addr));
> >> +  new = ftrace_call_replace(ip, aaddr);
> >> +
> >> +#ifdef CONFIG_ARM_MODULE_PLTS
> >> +  if (!new) {
> >> +  struct module *mod = rec->arch.mod;
> >> +
> >> +  if (mod) {
> > 
> > What would happen if !new and !mod?
> 
> I believe, that's exactly what happens in the dump you experience with your 
> kernel.
> This is not covered by this patch, this patch covers the issue with modules 
> in vmalloc area.
> 
> >> +  aaddr = get_module_plt(mod, ip, aaddr);
> >> +  new = ftrace_call_replace(ip, aaddr);
> > 
> > I assume we're guaranteed to have a sensible value returned in 'new' here?
> 
> Otherwise you'd see the dump you see :)
> It relies on the already existing error handling.

I understand from this there are still loose ends to be handled in this area of
the code.

I admit I need to spend more time to understand why I get the failure above and
how this overlaps with your proposal. But as it stands it seems there's more
work to be done here.

Thanks

--
Qais Yousef

Re: [PATCH v7 2/2] ARM: ftrace: Add MODULE_PLTS support

2021-03-07 Thread Qais Yousef

.00] hardirqs last disabled at (0): [<>] 0x0
[0.00] softirqs last  enabled at (0): [<>] 0x0
[0.00] softirqs last disabled at (0): [<>] 0x0
[0.00] random: get_random_bytes called from 
print_oops_end_marker+0x34/0xa0 with crng_init=0
[0.00] ---[ end trace  ]---
[0.00] ftrace: allocated 243 pages with 6 groups


> diff --git a/arch/arm/kernel/ftrace.c b/arch/arm/kernel/ftrace.c
> index 9a79ef6..fa867a5 100644
> --- a/arch/arm/kernel/ftrace.c
> +++ b/arch/arm/kernel/ftrace.c
> @@ -70,6 +70,19 @@ int ftrace_arch_code_modify_post_process(void)
>  
>  static unsigned long ftrace_call_replace(unsigned long pc, unsigned long 
> addr)
>  {
> + s32 offset = addr - pc;
> + s32 blim = 0xfe08;
> + s32 flim = 0x0204;

This look like magic numbers to me..

> +
> + if (IS_ENABLED(CONFIG_THUMB2_KERNEL)) {
> + blim = 0xff04;
> + flim = 0x0102;

.. ditto ..

> + }
> +
> + if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS) &&
> + (offset < blim || offset > flim))
> + return 0;

.. I could have missed something, but wouldn't something like below be clearer?
Only compile tested. I think abs() will do the right thing here given the
passed types. I admit I don't understand why you have the '4' and '8' at the
lowest nibble..

diff --git a/arch/arm/kernel/ftrace.c b/arch/arm/kernel/ftrace.c
index fa867a57100f..b44aee87c53a 100644
--- a/arch/arm/kernel/ftrace.c
+++ b/arch/arm/kernel/ftrace.c
@@ -70,17 +70,13 @@ int ftrace_arch_code_modify_post_process(void)

 static unsigned long ftrace_call_replace(unsigned long pc, unsigned 
long addr)
 {
-   s32 offset = addr - pc;
-   s32 blim = 0xfe08;
-   s32 flim = 0x0204;
+   u32 offset = abs(addr - pc);
+   u32 range = 0x0200; /* +-32MiB */

-   if (IS_ENABLED(CONFIG_THUMB2_KERNEL)) {
-   blim = 0xff04;
-   flim = 0x0102;
-   }
+   if (IS_ENABLED(CONFIG_THUMB2_KERNEL))
+   range = 0x0100; /* +-16MiB */

-   if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS) &&
-   (offset < blim || offset > flim))
+   if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS) && offset > range)
return 0;

return arm_gen_branch_link(pc, addr);

In case CONFIG_ARM_MODULE_PLTS is not enabled what would happen? Is it
impossible to hit this corner case or we could fail one way or another? IOW,
should this check be always compiled in?

> +
>   return arm_gen_branch_link(pc, addr);
>  }
>  
> @@ -124,10 +137,22 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned 
> long addr)
>  {
>   unsigned long new, old;
>   unsigned long ip = rec->ip;
> + unsigned long aaddr = adjust_address(rec, addr);
>  
>   old = ftrace_nop_replace(rec);
>  
> - new = ftrace_call_replace(ip, adjust_address(rec, addr));
> + new = ftrace_call_replace(ip, aaddr);
> +
> +#ifdef CONFIG_ARM_MODULE_PLTS
> + if (!new) {
> + struct module *mod = rec->arch.mod;
> +
> + if (mod) {

What would happen if !new and !mod?

> + aaddr = get_module_plt(mod, ip, aaddr);
> + new = ftrace_call_replace(ip, aaddr);

I assume we're guaranteed to have a sensible value returned in 'new' here?

Thanks

--
Qais Yousef

> + }
> + }
> +#endif

Re: [PATCH v4 15/19] sched: Fix migrate_disable() vs rt/dl balancing

2021-03-05 Thread Qais Yousef

On 03/05/21 15:41, Valentin Schneider wrote:
> On 05/03/21 15:56, Peter Zijlstra wrote:
> > On Sat, Dec 26, 2020 at 01:54:45PM +0000, Qais Yousef wrote:
> >>
> >> > +static inline struct task_struct *get_push_task(struct rq *rq)
> >> > +{
> >> > +struct task_struct *p = rq->curr;
> >>
> >> Shouldn't we verify the class of the task here? The RT task in migration
> >> disabled could have been preempted by a dl or stopper task. Similarly, the 
> >> dl
> >> task could have been preempted by a stopper task.
> >>
> >> I don't think an RT task should be allowed to push a dl task under any
> >> circumstances?
> >
> > Hmm, quite. Fancy doing a patch?
> 
> Last time we talked about this, I looked into
> 
>   push_rt_task() + find_lowest_rq()
> 
> IIRC, with how
> 
>   find_lowest_rq() + cpupri_find_fitness()
> 
> currently work, find_lowest_rq() should return -1 in push_rt_task() if
> rq->curr is DL (CPUPRI_INVALID). IOW, Migration-Disabled RT tasks shouldn't

[...]

> If you look closely, this is exactly the same as the previous spread
> modulo CPU numbers. IOW, this is (again) a CPU renumbering exercise.

I don't see it a re-numbering exercise. The way I understand it a system
designer doesn't expect their DL task to move because of an RT task. I think we
should try to keep it this way, that's why I asked.

To be fair, I need to look at the code again and understand where I missed that
3rd condition Peter mentioned.

Thanks

--
Qais Yousef

Re: [PATCH v4 15/19] sched: Fix migrate_disable() vs rt/dl balancing

2021-03-05 Thread Qais Yousef

On 03/05/21 15:56, Peter Zijlstra wrote:
> On Sat, Dec 26, 2020 at 01:54:45PM +0000, Qais Yousef wrote:
> > Hi Peter
> > 
> > Apologies for the late comments on the patch.
> 
> Ha!, it seems I too need to apologize for never having actually found
> your reply ;-)

No worries, thanks for taking the time to answer! :-)

> 
> > On 10/23/20 12:12, Peter Zijlstra wrote:
> > 
> > [...]
> > 
> > > + * When a preempted task becomes elegible to run under the ideal model 
> > > (IOW it
> > > + * becomes one of the M highest priority tasks), it might still have to 
> > > wait
> > > + * for the preemptee's migrate_disable() section to complete. Thereby 
> > > suffering
> > > + * a reduction in bandwidth in the exact duration of the 
> > > migrate_disable()
> > > + * section.
> > > + *
> > > + * Per this argument, the change from preempt_disable() to 
> > > migrate_disable()
> > > + * gets us:
> > > + *
> > > + * - a higher priority tasks gains reduced wake-up latency; with 
> > > preempt_disable()
> > > + *   it would have had to wait for the lower priority task.
> > > + *
> > > + * - a lower priority tasks; which under preempt_disable() could've 
> > > instantly
> > > + *   migrated away when another CPU becomes available, is now constrained
> > > + *   by the ability to push the higher priority task away, which might 
> > > itself be
> > > + *   in a migrate_disable() section, reducing it's available bandwidth.
> > > + *
> > > + * IOW it trades latency / moves the interference term, but it stays in 
> > > the
> > > + * system, and as long as it remains unbounded, the system is not fully
> > > + * deterministic.
> > 
> > The idea makes sense but I'm worried about some implementation details.
> > 
> > Specifically:
> > 
> > * There's no guarantee the target CPU we're pushing to doesn't have
> >   a lower priority task in migration_disabled too. So we could end up
> >   having to push the task again. 
> 
> I'm not sure I follow, if the CPU we're pushing to has a
> migrate_disable() task of lower priority we'll simply preempt it.
> 
> IIRC there's conditions for this push:
> 
>  1) we just did migrate_enable();
>  2) the task below us also has migrate_disable();
>  3) the task below us is actually higher priority than
> the lowest priority task currently running.
> 
> So at that point we shoot our high prio task away, and we aim it at the
> lowest prio task.
> 
> In order to then shoot it away again, someone else needs to block to
> make lower prio task we just preempted elegible again.

Okay. I missed that 3rd condition. I understood only 1 and 2 are required.
So we have to have 3 tasks of different priorities on the rq, the middle being
in migrate_disabled.

It is less of a problem in that case.

> 
> Still, possible I suppose.
> 
> > Although unlikely in practice, but as
> >   I see it the worst case scenario is unbounded here. The planets could
> >   align perfectly for the higher priority task to spend the majority of
> >   its time migrating between cpus that have low priority RT tasks in
> >   migration_disabled regions.
> 
> I'm thinking it might be limited by the range of priorities. You need to
> drop the prio on every round, and you can't keep on dropping priority
> levels, at some point we've reached bottom.

With that 3rd condition in mind, there has to be an element of bad design to
end up with 3 tasks of different priorities on 1 rq that continuously. The
system has to be in some sort of overloaded state, which is a bigger problem to
address first.

> > > +static inline struct task_struct *get_push_task(struct rq *rq)
> > > +{
> > > + struct task_struct *p = rq->curr;
> > 
> > Shouldn't we verify the class of the task here? The RT task in migration
> > disabled could have been preempted by a dl or stopper task. Similarly, the 
> > dl
> > task could have been preempted by a stopper task.
> > 
> > I don't think an RT task should be allowed to push a dl task under any
> > circumstances?
> 
> Hmm, quite. Fancy doing a patch?

I had one. Let me revive and post it next week.

Thanks

--
Qais Yousef

Re: [PATCH 3/7 v4] sched/fair: remove unused parameter of update_nohz_stats

2021-03-05 Thread Qais Yousef

On 02/24/21 14:30, Vincent Guittot wrote:
> idle load balance is the only user of update_nohz_stats and doesn't use
> force parameter. Remove it
> 
> Signed-off-by: Vincent Guittot 
> ---
>  kernel/sched/fair.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e23709f6854b..f52f4dd3fb9e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8352,7 +8352,7 @@ group_type group_classify(unsigned int imbalance_pct,
>   return group_has_spare;
>  }
>  
> -static bool update_nohz_stats(struct rq *rq, bool force)
> +static bool update_nohz_stats(struct rq *rq)
>  {
>  #ifdef CONFIG_NO_HZ_COMMON
>   unsigned int cpu = rq->cpu;
> @@ -8363,7 +8363,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
>   if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
>   return false;
>  
> - if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> + if (!time_after(jiffies, rq->last_blocked_load_update_tick))
>   return true;
>  
>   update_blocked_averages(cpu);
> @@ -10401,7 +10401,7 @@ static void _nohz_idle_balance(struct rq *this_rq, 
> unsigned int flags,
>  
>   rq = cpu_rq(balance_cpu);
>  
> - has_blocked_load |= update_nohz_stats(rq, true);
> + has_blocked_load |= update_nohz_stats(rq);

I think Dietmar commented on this on v1. There's a change in behavior here
AFAICT. Worth expanding the changelog to explain that this will be rate limited
and why it's okay? It'll help a lost soul like me who doesn't have the ins and
outs of this code carved in their head :-)

Thanks

--
Qais Yousef


>  
>   /*
>* If time for next balance is due,
> -- 
> 2.17.1
>

Re: [PATCH 6/7 v4] sched/fair: trigger the update of blocked load on newly idle cpu

2021-03-05 Thread Qais Yousef

On 02/24/21 14:30, Vincent Guittot wrote:
> +/*
> + * Check if we need to run the ILB for updating blocked load before entering
> + * idle state.
> + */
> +void nohz_run_idle_balance(int cpu)
> +{
> + unsigned int flags;
> +
> + flags = atomic_fetch_andnot(NOHZ_NEWILB_KICK, nohz_flags(cpu));
> +
> + /*
> +  * Update the blocked load only if no SCHED_SOFTIRQ is about to happen
> +  * (ie NOHZ_STATS_KICK set) and will do the same.
> +  */
> + if ((flags == NOHZ_NEWILB_KICK) && !need_resched())
> + _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK, CPU_IDLE);
> +}

nit: need_resched() implies this_cpu, but the function signature implies you
could operate on any CPU. Do need_resched() outside this function or make
the function read smp_processor_id() itself without taking an arg?

Thanks

--
Qais Yousef

Re: [PATCH RESEND v2 2/2] virt: acrn: Make remove_cpu sysfs invisible with !CONFIG_HOTPLUG_CPU

2021-02-23 Thread Qais Yousef

On 02/21/21 21:43, shuo.a@intel.com wrote:
> From: Shuo Liu 
> 
> Without cpu hotplug support, vCPU cannot be removed from a Service VM.
> Don't expose remove_cpu sysfs when CONFIG_HOTPLUG_CPU disabled.
> 
> Signed-off-by: Shuo Liu 
> Acked-by: Randy Dunlap  # build-tested
> Cc: Stephen Rothwell 
> Cc: Thomas Gleixner 
> Cc: Greg Kroah-Hartman 
> Cc: Qais Yousef 
> ---
>  drivers/virt/acrn/hsm.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/virt/acrn/hsm.c b/drivers/virt/acrn/hsm.c
> index 1f6b7c54a1a4..6996ea6219e5 100644
> --- a/drivers/virt/acrn/hsm.c
> +++ b/drivers/virt/acrn/hsm.c
> @@ -404,6 +404,14 @@ static ssize_t remove_cpu_store(struct device *dev,
>  }
>  static DEVICE_ATTR_WO(remove_cpu);
>  
> +static umode_t acrn_attr_visible(struct kobject *kobj, struct attribute *a, 
> int n)
> +{
> +   if (a == _attr_remove_cpu.attr)
> +   return IS_ENABLED(CONFIG_HOTPLUG_CPU) ? a->mode : 0;
> +
> +   return a->mode;
> +}
> +

I can't find this code in Linus' master. But this looks fine from my narrow
PoV. Protecting the attribute with ifdef CONFIG_HOTPLUG_CPU is easier to read
for me, but this doesn't mean this approach is not fine.

Thanks

--
Qais Yousef

>  static struct attribute *acrn_attrs[] = {
>   _attr_remove_cpu.attr,
>   NULL
> @@ -411,6 +419,7 @@ static struct attribute *acrn_attrs[] = {
>  
>  static struct attribute_group acrn_attr_group = {
>   .attrs = acrn_attrs,
> + .is_visible = acrn_attr_visible,
>  };
>  
>  static const struct attribute_group *acrn_attr_groups[] = {
> -- 
> 2.28.0
>

Re: [PATCH RESEND v2 1/2] cpu/hotplug: Fix build error of using {add,remove}_cpu() with !CONFIG_SMP

2021-02-23 Thread Qais Yousef

On 02/21/21 21:43, shuo.a@intel.com wrote:
> From: Shuo Liu 
> 
> 279dcf693ac7 ("virt: acrn: Introduce an interface for Service VM to
> control vCPU") introduced {add,remove}_cpu() usage and it hit below
> error with !CONFIG_SMP:
> 
> ../drivers/virt/acrn/hsm.c: In function ‘remove_cpu_store’:
> ../drivers/virt/acrn/hsm.c:389:3: error: implicit declaration of function 
> ‘remove_cpu’; [-Werror=implicit-function-declaration]
>remove_cpu(cpu);
> 
> ../drivers/virt/acrn/hsm.c:402:2: error: implicit declaration of function 
> ‘add_cpu’; [-Werror=implicit-function-declaration]
>add_cpu(cpu);
> 
> Add add_cpu() function prototypes with !CONFIG_SMP and remove_cpu() with
> !CONFIG_HOTPLUG_CPU for such usage.
> 
> Fixes: 279dcf693ac7 ("virt: acrn: Introduce an interface for Service VM to 
> control vCPU")
> Reported-by: Randy Dunlap 
> Signed-off-by: Shuo Liu 
> Acked-by: Randy Dunlap  # build-tested
> Cc: Stephen Rothwell 
> Cc: Thomas Gleixner 
> Cc: Greg Kroah-Hartman 
> Cc: Qais Yousef 
> ---

Reviewed-by: Qais Yousef 

Thanks!

--
Qais Yousef

>  include/linux/cpu.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/include/linux/cpu.h b/include/linux/cpu.h
> index 3aaa0687e8df..94a578a96202 100644
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -108,6 +108,8 @@ static inline void cpu_maps_update_done(void)
>  {
>  }
>  
> +static inline int add_cpu(unsigned int cpu) { return 0;}
> +
>  #endif /* CONFIG_SMP */
>  extern struct bus_type cpu_subsys;
>  
> @@ -137,6 +139,7 @@ static inline int  cpus_read_trylock(void) { return true; 
> }
>  static inline void lockdep_assert_cpus_held(void) { }
>  static inline void cpu_hotplug_disable(void) { }
>  static inline void cpu_hotplug_enable(void) { }
> +static inline int remove_cpu(unsigned int cpu) { return -EPERM; }
>  static inline void smp_shutdown_nonboot_cpus(unsigned int primary_cpu) { }
>  #endif   /* !CONFIG_HOTPLUG_CPU */
>  
> 
> base-commit: abaf6f60176f1ae9d946d63e4db63164600b7b1a
> -- 
> 2.28.0
>

Re: [PATCH v2] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

2021-02-16 Thread Qais Yousef

On 02/12/21 00:30, Alexey Klimov wrote:
> When a CPU offlined and onlined via device_offline() and device_online()
> the userspace gets uevent notification. If, after receiving "online" uevent,
> userspace executes sched_setaffinity() on some task trying to move it
> to a recently onlined CPU, then it often fails with -EINVAL. Userspace needs
> to wait around 5..30 ms before sched_setaffinity() will succeed for recently
> onlined CPU after receiving uevent.
> 
> If in_mask argument for sched_setaffinity() has only recently onlined CPU,
> it often fails with such flow:
> 
>   sched_setaffinity()
> cpuset_cpus_allowed()
>   guarantee_online_cpus()   <-- cs->effective_cpus mask does not
> contain recently onlined cpu
> cpumask_and()   <-- final new_mask is empty
> __set_cpus_allowed_ptr()
>   cpumask_any_and_distribute() <-- returns dest_cpu equal to nr_cpu_ids
>   returns -EINVAL
> 
> Cpusets used in guarantee_online_cpus() are updated using workqueue from
> cpuset_update_active_cpus() which in its turn is called from cpu hotplug 
> callback
> sched_cpu_activate() hence it may not be observable by sched_setaffinity() if
> it is called immediately after uevent.

nit: newline

> Out of line uevent can be avoided if we will ensure that cpuset_hotplug_work
> has run to completion using cpuset_wait_for_hotplug() after onlining the
> cpu in cpu_device_up() and in cpuhp_smt_enable().
> 
> Co-analyzed-by: Joshua Baker 
> Signed-off-by: Alexey Klimov 
> ---

This looks good to me.

Reviewed-by: Qais Yousef 

Thanks

--
Qais Yousef

Re: [PATCH] docs: reporting-issues.rst: explain how to decode stack traces

2021-02-15 Thread Qais Yousef

Hi Thorsten

On 02/15/21 06:55, Thorsten Leemhuis wrote:
> Hi! Many thx for looking into this, much appreciated!
> 
> Am 14.02.21 um 17:00 schrieb Qais Yousef:
> > On 02/10/21 06:48, Thorsten Leemhuis wrote:
> >
> >> - * If the failure includes a stack dump, like an Oops does, consider 
> >> decoding
> >> -   it to find the offending line of code.
> >> + * If your failure involves a 'panic', 'oops', or 'warning', consider 
> >> decoding
> > or 'BUG'? There are similar other places below that could benefit from this
> > addition too.
> 
> Good point. In fact there are other places in the document where this is
> needed as well. Will address those in another patch.
> 
> >> +   the kernel log to find the line of code that trigger the error.
> >>  
> >>   * If your problem is a regression, try to narrow down when the issue was
> >> introduced as much as possible.
> >> @@ -869,6 +869,15 @@ pick up the configuration of your current kernel and 
> >> then tries to adjust it
> >>  somewhat for your system. That does not make the resulting kernel any 
> >> better,
> >>  but quicker to compile.
> >>  
> >> +Note: If you are dealing with a kernel panic, oops, or warning, please 
> >> make
> >> +sure to enable CONFIG_KALLSYMS when configuring your kernel. Additionally,
> > 
> > s/make sure/try/
> 
> I did that, but ignored...
> 
> > s/kernel./kernel if you can./
> 
> ...this. Yes, you have a point with...
> 
> > Less demanding wording in case the user doesn't have the capability to 
> > rebuild
> > or deploy such a kernel where the problem happens. Maybe you can tweak it 
> > more
> > if you like too :-)
> 
> ...that, but that section in the document is about building your own
> kernel, so I'd say we don't have to be that careful here.

Cool. Works for me.

Thanks

--
Qais Yousef

Re: [PATCH] docs: reporting-issues.rst: explain how to decode stack traces

2021-02-14 Thread Qais Yousef

out
> +the executed code. This makes it possible to pinpoint the exact line in the
> +source code that triggered the issue and shows how it was called. But that 
> only
> +works if you enabled CONFIG_DEBUG_INFO and CONFIG_KALLSYMS when configuring
> +your kernel. If you did so, consider to decode the information from the
> +kernel's log. That will make it a lot easier to understand what lead to the
> +'panic', 'oops', or 'warning', which increases the chances enormously that
> +someone can provide a fix.

I suggest removing the word enormously. It helps, but it all depends on the
particular circumstances. Sometimes it does, others it doesn't.

>  
> -   This section in the end should answer questions like "when is this 
> actually
> -   needed", "what .config options to ideally set earlier to make this step 
> easy
> -   or unnecessary?" (likely CONFIG_UNWINDER_ORC when it's available, 
> otherwise
> -   CONFIG_UNWINDER_FRAME_POINTER; but is there anything else needed?).
> +Decoding can be done with a script you find in the Linux source tree. If you
> +are running a kernel you compiled yourself earlier, call it like this::
>  
> -..
> +   [user@something ~]$ sudo dmesg | 
> ./linux-5.10.5/scripts/decode_stacktrace.sh ./linux-5.10.5/vmlinux
> +
> +If you are running a packaged vanilla kernel, you will likely have to install
> +the corresponding packages with debug symbols. Then call the script (which 
> you
> +might need to get from the Linux sources if your distro does not package it)
> +like this::
> +
> +   [user@something ~]$ sudo dmesg | 
> ./linux-5.10.5/scripts/decode_stacktrace.sh \
> +/usr/lib/debug/lib/modules/5.10.10-4.1.x86_64/vmlinux 
> /usr/src/kernels/5.10.10-4.1.x86_64/
> +
> +The script will work on log lines like the following, which show the address 
> of
> +the code the kernel was executing when the error occurred::
> +
> +   [   68.387301] RIP: 0010:test_module_init+0x5/0xffa [test_module]
> +
> +Once decoded, these lines will look like this::
> +
> +   [   68.387301] RIP: 0010:test_module_init 
> (/home/username/linux-5.10.5/test-module/test-module.c:16) test_module
> +
> +In this case the executed code was built from the file
> +'~/linux-5.10.5/test-module/test-module.c' and the error occurred by the
> +instructions found in line '16'.
>  
> -*If the failure includes a stack dump, like an Oops does, consider 
> decoding
> -it to find the offending line of code.*
> +The script will similarly decode the addresses mentioned in the section
> +starting with 'Call trace', which show the path to the function where the
> +problem occurred. Additionally, the script will show the assembler output for
> +the code section the kernel was executing.
>  
> -When the kernel detects an error, it will print a stack dump that allows to
> -identify the exact line of code where the issue happens. But that information
> -sometimes needs to get decoded to be readable, which is explained in
> -admin-guide/bug-hunting.rst.
> +Note, if you can't get this to work, simply skip this step and mention the
> +reason for it in the report. If you're lucky, it might not be needed. And if 
> it
> +is, someone might help you to get things going. Also be aware this is just 
> one
> +of several ways to decode kernel stack traces. Sometimes different steps will
> +be required to retrieve the relevant details. Don't worry about that, if 
> that's
> +needed in your case, developers will tell you what to do.

Ah you already clarify nicely here this is a good-to-have rather than
a must-have as I was trying to elude to above :-)

This looks good to me in general. With the above minor nits fixed, feel free to
add my

Reviewed-by: Qais Yousef 

Thanks!

--
Qais Yousef

>  
>  
>  Special care for regressions
> -- 
> 2.29.2
>

Re: [PATCH] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

2021-02-05 Thread Qais Yousef

On 02/04/21 10:46, Peter Zijlstra wrote:
> On Thu, Feb 04, 2021 at 01:01:57AM +, Alexey Klimov wrote:
> > @@ -1281,6 +1282,11 @@ static int cpu_up(unsigned int cpu, enum cpuhp_state 
> > target)
> > err = _cpu_up(cpu, 0, target);
> >  out:
> > cpu_maps_update_done();
> > +
> > +   /* To avoid out of line uevent */
> > +   if (!err)
> > +   cpuset_wait_for_hotplug();
> > +
> > return err;
> >  }
> >  
> 
> > @@ -2071,14 +2075,18 @@ static void cpuhp_online_cpu_device(unsigned int 
> > cpu)
> > struct device *dev = get_cpu_device(cpu);
> >  
> > dev->offline = false;
> > -   /* Tell user space about the state change */
> > -   kobject_uevent(>kobj, KOBJ_ONLINE);
> >  }
> >  
> 
> One concequence of this is that you'll now get a bunch of notifications
> across things like suspend/hybernate.

And the resume latency will incur 5-30ms * nr_cpu_ids.

Since you just care about device_online(), isn't cpu_device_up() a better place
for the wait? This function is special helper for device_online(), leaving
suspend/resume and kexec paths free from having to do this unnecessary wait.

Thanks

--
Qais Yousef

Re: [PATCH 1/8] sched/fair: Clean up active balance nr_balance_failed trickery

2021-02-04 Thread Qais Yousef

On 02/03/21 18:42, Valentin Schneider wrote:
> >> @@ -9805,9 +9810,6 @@ static int load_balance(int this_cpu, struct rq 
> >> *this_rq,
> >>active_load_balance_cpu_stop, busiest,
> >>>active_balance_work);
> >>}
> >> -
> >> -  /* We've kicked active balancing, force task migration. 
> >> */
> >> -  sd->nr_balance_failed = sd->cache_nice_tries+1;
> >
> > This has an impact on future calls to need_active_balance() too, no? We 
> > enter
> > this path because need_active_balance() returned true; one of the 
> > conditions it
> > checks for is
> >
> > return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
> >
> > So since we used to reset nr_balanced_failed to cache_nice_tries+1, the 
> > above
> > condition would be false in the next call or two IIUC. But since we remove
> > that, we could end up here again soon.
> >
> > Was this intentional?
> >
> 
> Partially, I'd say :-)
> 
> If you look at active_load_balance_cpu_stop(), it does
> 
>   sd->nr_balance_failed = 0;
> 
> when it successfully pulls a task. So we get a reset of the failed counter
> on pull, which I've preserved. As for interactions with later
> need_active_balance(), the commit that introduced the current counter write
> (which is over 15 years old!):  
> 
>   3950745131e2 ("[PATCH] sched: fix SMT scheduling problems")
> 
> only states the task_hot() issue; thus I'm doubtful whether said
> interaction was intentional.

The '+1' was added in that comment. 'Original' code was just resetting the
nr_balance_failed cache_nice_tries, so that we don't do another one too soon
I think.

With this change, no active balance is allowed until later. Which makes sense.
I can't see why we would have allowed another kick sooner tbh. But as you say,
this is ancient piece of logic.

I agree I can't see a reason to worry about this (potential) change of
behavior.

Thanks

--
Qais Yousef

Re: [RFC PATCH] sched/core: Fix premature p->migration_pending completion

2021-02-04 Thread Qais Yousef

On 02/03/21 18:59, Valentin Schneider wrote:
> On 03/02/21 17:23, Qais Yousef wrote:
> > On 01/27/21 19:30, Valentin Schneider wrote:
> >> Fiddling some more with a TLA+ model of set_cpus_allowed_ptr() & friends
> >> unearthed one more outstanding issue. This doesn't even involve
> >> migrate_disable(), but rather affinity changes and execution of the stopper
> >> racing with each other.
> >> 
> >> My own interpretation of the (lengthy) TLA+ splat (note the potential for
> >> errors at each level) is:
> >> 
> >>   Initial conditions:
> >> victim.cpus_mask = {CPU0, CPU1}
> >> 
> >>   CPU0 CPU1 
> >> CPU
> >> 
> >>   switch_to(victim)
> >>
> >> set_cpus_allowed(victim, {CPU1})
> >>  kick CPU0 
> >> migration_cpu_stop({.dest_cpu = CPU1})
> >>   switch_to(stopper/0)
> >>// e.g. CFS 
> >> load balance
> >>
> >> move_queued_task(CPU0, victim, CPU1);
> >>   switch_to(victim)
> >>
> >> set_cpus_allowed(victim, {CPU0});
> >>  
> >> task_rq_unlock();
> >>   migration_cpu_stop(dest_cpu=CPU1)
> >
> > This migration stop is due to set_cpus_allowed(victim, {CPU1}), right?
> >
> 
> Right
> 
> >> task_rq(p) != rq && pending
> >>   kick CPU1 migration_cpu_stop({.dest_cpu = CPU1})
> >> 
> >>   switch_to(stopper/1)
> >>   migration_cpu_stop(dest_cpu=CPU1)
> >
> > And this migration stop is due to set_cpus_allowed(victim, {CPU0}), right?
> >
> 
> Nein! This is a retriggering of the "current" stopper (triggered by
> set_cpus_allowed(victim, {CPU1})), see the tail of that
> 
>   else if (dest_cpu < 0 || pending)
> 
> branch in migration_cpu_stop(), is what I'm trying to hint at with that 
> 
> task_rq(p) != rq && pending

Okay I see. But AFAIU, the work will be queued in order. So we should first
handle the set_cpus_allowed_ptr(victim, {CPU0}) before the retrigger, no?

So I see migration_cpu_stop() running 3 times

1. because of set_cpus_allowed(victim, {CPU1}) on CPU0
2. because of set_cpus_allowed(victim, {CPU0}) on CPU1
3. because of retrigger of '1' on CPU0

Thanks

--
Qais Yousef

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-02-04 Thread Qais Yousef

On 02/03/21 18:35, Vincent Guittot wrote:
> > >   raw_spin_unlock(_rq->lock);
> > > - /*
> > > -  * This CPU is going to be idle and blocked load of idle CPUs
> > > -  * need to be updated. Run the ilb locally as it is a good
> > > -  * candidate for ilb instead of waking up another idle CPU.
> > > -  * Kick an normal ilb if we failed to do the update.
> > > -  */
> > > - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
> >
> > Since we removed the call to this function (which uses this_rq)
> >
> > > - kick_ilb(NOHZ_STATS_KICK);
> > > + kick_ilb(NOHZ_STATS_KICK);
> >
> > And unconditionally call kick_ilb() which will find a suitable cpu to run 
> > the
> > lb at regardless what this_rq is.
> >
> > Doesn't the below become unnecessary now?
> 
> The end goal is to keep running on this cpu that is about to become idle.
> 
> This patch is mainly  there to check that Joel's problem disappears if
> the update of the blocked load of the cpus is not done in the
> newidle_balance context. I'm preparing few other patches on top to
> clean up the full path

+1

> >
> >   10494 /*
> >   10495  * This CPU doesn't want to be disturbed by 
> > scheduler
> >   10496  * housekeeping
> >   10497  */
> >   10498 if (!housekeeping_cpu(this_cpu, HK_FLAG_SCHED))
> >   10499 return;
> >   10500
> >   10501 /* Will wake up very soon. No time for doing 
> > anything else*/
> >   10502 if (this_rq->avg_idle < sysctl_sched_migration_cost)
> >   10503 return;
> >
> > And we can drop this_rq arg altogether?
> >
> > >   raw_spin_lock(_rq->lock);
> > >  }
> > >
> > > @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, 
> > > struct rq_flags *rf)
> > >   update_next_balance(sd, _balance);
> > >   rcu_read_unlock();
> > >
> > > - nohz_newidle_balance(this_rq);
> > > -
> > >   goto out;
> > >   }
> > >
> > > @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, 
> > > struct rq_flags *rf)
> > >
> > >   if (pulled_task)
> > >   this_rq->idle_stamp = 0;
> > > + else
> > > + nohz_newidle_balance(this_rq);
> >
> > Since nohz_newidle_balance() will not do any real work now, I couldn't 
> > figure
> > out what moving this here achieves. Fault from my end to parse the change 
> > most
> > likely :-)
> 
> The goal is to schedule the update only if we are about to be idle and
> nothing else has been queued in the meantime

I see. This short coming already existed and not *strictly* related to moving
update of blocked load out of newidle balance.

Thanks

--
Qais Yousef

Re: [RFC PATCH] sched/core: Fix premature p->migration_pending completion

2021-02-03 Thread Qais Yousef

On 01/27/21 19:30, Valentin Schneider wrote:
> Fiddling some more with a TLA+ model of set_cpus_allowed_ptr() & friends
> unearthed one more outstanding issue. This doesn't even involve
> migrate_disable(), but rather affinity changes and execution of the stopper
> racing with each other.
> 
> My own interpretation of the (lengthy) TLA+ splat (note the potential for
> errors at each level) is:
> 
>   Initial conditions:
> victim.cpus_mask = {CPU0, CPU1}
> 
>   CPU0 CPU1 CPU care>
> 
>   switch_to(victim)
>   
> set_cpus_allowed(victim, {CPU1})
> kick CPU0 
> migration_cpu_stop({.dest_cpu = CPU1})
>   switch_to(stopper/0)
>   // e.g. CFS 
> load balance
>   
> move_queued_task(CPU0, victim, CPU1);
>  switch_to(victim)
>   
> set_cpus_allowed(victim, {CPU0});
> 
> task_rq_unlock();
>   migration_cpu_stop(dest_cpu=CPU1)

This migration stop is due to set_cpus_allowed(victim, {CPU1}), right?

> task_rq(p) != rq && pending
>   kick CPU1 migration_cpu_stop({.dest_cpu = CPU1})
> 
>  switch_to(stopper/1)
>  migration_cpu_stop(dest_cpu=CPU1)

And this migration stop is due to set_cpus_allowed(victim, {CPU0}), right?

If I didn't miss something, then dest_cpu should be CPU0 too, not CPU1 and the
task should be moved back to CPU0 as expected?

Thanks

--
Qais Yousef

>task_rq(p) == rq && pending
>  __migrate_task(dest_cpu) // no-op
>complete_all() <-- !!! affinity is {CPU0} 
> !!!
> 
> I believe there are two issues there:
> - retriggering of migration_cpu_stop() from within migration_cpu_stop()
>   itself doesn't change arg.dest_cpu
> - we'll issue a complete_all() in the task_rq(p) == rq path of
>   migration_cpu_stop() even if the dest_cpu has been superseded by a
>   further affinity change.
> 
> Something similar could happen with NUMA's migrate_task_to(), and arguably
> any other user of migration_cpu_stop() with a .dest_cpu >= 0.
> Consider:
> 
>   CPU0CPUX
> 
>   switch_to(victim)
>   migrate_task_to(victim, CPU1)
> kick CPU0 
> migration_cpu_stop({.dest_cpu = CPU1})
> 
>   set_cpus_allowed(victim, {CPU42})
> task_rq_unlock();
>   switch_to(stopper/0)
>   migration_cpu_stop(dest_cpu=CPU1)
> task_rq(p) == rq && pending
>   __migrate_task(dest_cpu)
> complete_all() <-- !!! affinity is {CPU42} !!!
> 
> Prevent such premature completions by ensuring the dest_cpu in
> migration_cpu_stop() is in the task's allowed cpumask.
> 
> Signed-off-by: Valentin Schneider 
> ---
>  kernel/sched/core.c | 32 
>  1 file changed, 20 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 06b449942adf..b57326b0a742 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1923,20 +1923,28 @@ static int migration_cpu_stop(void *data)
>   complete = true;
>   }
>  
> - /* migrate_enable() --  we must not race against SCA */
> - if (dest_cpu < 0) {
> - /*
> -  * When this was migrate_enable() but we no longer
> -  * have a @pending, a concurrent SCA 'fixed' things
> -  * and we should be valid again. Nothing to do.
> -  */
> - if (!pending) {
> - WARN_ON_ONCE(!cpumask_test_cpu(task_cpu(p), 
> >cpus_mask));
> - goto out;
> - }
> +/*
> + * When this was migrate_enable() but we no longer
> + * have a @pending, a concurrent SCA 'fixed' things
> + * and we should be valid again.
> + *
> + * This can also be a stopper invocation that was 'fixed' by an
> + * earlier one.
> + *
> +

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-02-03 Thread Qais Yousef

On 01/29/21 18:27, Vincent Guittot wrote:
> The patch below moves the update of the blocked load of CPUs outside 
> newidle_balance().
> 
> Instead, the update is done with the usual idle load balance update. I'm 
> working on an
> additonnal patch that will select this cpu that is about to become idle, 
> instead of a
> random idle cpu but this 1st step fixe the problem of lot of update in newly 
> idle.
> 
> Signed-off-by: Vincent Guittot 
> ---
>  kernel/sched/fair.c | 32 +++-
>  1 file changed, 3 insertions(+), 29 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 197a51473e0c..8200b1d4df3d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7421,8 +7421,6 @@ enum migration_type {
>  #define LBF_NEED_BREAK   0x02
>  #define LBF_DST_PINNED  0x04
>  #define LBF_SOME_PINNED  0x08
> -#define LBF_NOHZ_STATS   0x10
> -#define LBF_NOHZ_AGAIN   0x20
>  
>  struct lb_env {
>   struct sched_domain *sd;
> @@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env 
> *env,
>   for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>   struct rq *rq = cpu_rq(i);
>  
> - if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, 
> false))
> - env->flags |= LBF_NOHZ_AGAIN;
> -
>   sgs->group_load += cpu_load(rq);
>   sgs->group_util += cpu_util(i);
>   sgs->group_runnable += cpu_runnable(rq);
> @@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env 
> *env, struct sd_lb_stats *sd
>   struct sg_lb_stats tmp_sgs;
>   int sg_status = 0;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> - if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
> - env->flags |= LBF_NOHZ_STATS;
> -#endif
> -
>   do {
>   struct sg_lb_stats *sgs = _sgs;
>   int local_group;
> @@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env 
> *env, struct sd_lb_stats *sd
>   /* Tag domain that child domain prefers tasks go to siblings first */
>   sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
>  
> -#ifdef CONFIG_NO_HZ_COMMON
> - if ((env->flags & LBF_NOHZ_AGAIN) &&
> - cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
> -
> - WRITE_ONCE(nohz.next_blocked,
> -jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
> - }
> -#endif
> -
>   if (env->sd->flags & SD_NUMA)
>   env->fbq_type = fbq_classify_group(>busiest_stat);
>  
> @@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
>   return;
>  
>   raw_spin_unlock(_rq->lock);
> - /*
> -  * This CPU is going to be idle and blocked load of idle CPUs
> -  * need to be updated. Run the ilb locally as it is a good
> -  * candidate for ilb instead of waking up another idle CPU.
> -  * Kick an normal ilb if we failed to do the update.
> -  */
> - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))

Since we removed the call to this function (which uses this_rq)

> - kick_ilb(NOHZ_STATS_KICK);
> + kick_ilb(NOHZ_STATS_KICK);

And unconditionally call kick_ilb() which will find a suitable cpu to run the
lb at regardless what this_rq is.

Doesn't the below become unnecessary now?

  10494 /*
  10495  * This CPU doesn't want to be disturbed by scheduler
  10496  * housekeeping
  10497  */
  10498 if (!housekeeping_cpu(this_cpu, HK_FLAG_SCHED))
  10499 return;
  10500
  10501 /* Will wake up very soon. No time for doing anything 
else*/
  10502 if (this_rq->avg_idle < sysctl_sched_migration_cost)
  10503 return;

And we can drop this_rq arg altogether?

>   raw_spin_lock(_rq->lock);
>  }
>  
> @@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>   update_next_balance(sd, _balance);
>   rcu_read_unlock();
>  
> - nohz_newidle_balance(this_rq);
> -
>   goto out;
>   }
>  
> @@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct 
> rq_flags *rf)
>  
>   if (pulled_task)
>   this_rq->idle_stamp = 0;
> + else
> + nohz_newidle_balance(this_rq);

Since nohz_newidle_balance() will not do any real work now, I couldn't figure
out what moving this here achieves. Fault from my end to parse the change most
likely :-)

Joel can still test this patch as is of course. This is just an early review
since I already spent the time trying to understand it.

Thanks

--
Qais Yousef

>  
>   rq_repin_lock(this_rq, rf);
>  
> -- 
> 2.17.1

Re: [PATCH 1/8] sched/fair: Clean up active balance nr_balance_failed trickery

2021-02-03 Thread Qais Yousef

Hi Valentin

On 01/28/21 18:31, Valentin Schneider wrote:
> When triggering an active load balance, sd->nr_balance_failed is set to
> such a value that any further can_migrate_task() using said sd will ignore
> the output of task_hot().
> 
> This behaviour makes sense, as active load balance intentionally preempts a
> rq's running task to migrate it right away, but this asynchronous write is
> a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
> before the sd->nr_balance_failed write either becomes visible to the
> stopper's CPU or even happens on the CPU that appended the stopper work.
> 
> Add a struct lb_env flag to denote active balancing, and use it in
> can_migrate_task(). Remove the sd->nr_balance_failed write that served the
> same purpose.
> 
> Signed-off-by: Valentin Schneider 
> ---
>  kernel/sched/fair.c | 17 ++---
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 197a51473e0c..0f6a4e58ce3c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7423,6 +7423,7 @@ enum migration_type {
>  #define LBF_SOME_PINNED  0x08
>  #define LBF_NOHZ_STATS   0x10
>  #define LBF_NOHZ_AGAIN   0x20
> +#define LBF_ACTIVE_LB0x40
>  
>  struct lb_env {
>   struct sched_domain *sd;
> @@ -7608,10 +7609,14 @@ int can_migrate_task(struct task_struct *p, struct 
> lb_env *env)
>  
>   /*
>* Aggressive migration if:
> -  * 1) destination numa is preferred
> -  * 2) task is cache cold, or
> -  * 3) too many balance attempts have failed.
> +  * 1) active balance
> +  * 2) destination numa is preferred
> +  * 3) task is cache cold, or
> +  * 4) too many balance attempts have failed.
>*/
> + if (env->flags & LBF_ACTIVE_LB)
> + return 1;
> +
>   tsk_cache_hot = migrate_degrades_locality(p, env);
>   if (tsk_cache_hot == -1)
>   tsk_cache_hot = task_hot(p, env);
> @@ -9805,9 +9810,6 @@ static int load_balance(int this_cpu, struct rq 
> *this_rq,
>   active_load_balance_cpu_stop, busiest,
>   >active_balance_work);
>   }
> -
> - /* We've kicked active balancing, force task migration. 
> */
> - sd->nr_balance_failed = sd->cache_nice_tries+1;

This has an impact on future calls to need_active_balance() too, no? We enter
this path because need_active_balance() returned true; one of the conditions it
checks for is

return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);

So since we used to reset nr_balanced_failed to cache_nice_tries+1, the above
condition would be false in the next call or two IIUC. But since we remove
that, we could end up here again soon.

Was this intentional?

Thanks

--
Qais Yousef

>   }
>   } else {
>   sd->nr_balance_failed = 0;
> @@ -9963,7 +9965,8 @@ static int active_load_balance_cpu_stop(void *data)
>* @dst_grpmask we need to make that test go away with 
> lying
>* about DST_PINNED.
>*/
> - .flags  = LBF_DST_PINNED,
> + .flags  = LBF_DST_PINNED |
> +   LBF_ACTIVE_LB,
>   };
>  
>   schedstat_inc(sd->alb_count);
> -- 
> 2.27.0
>

Re: [PATCH 2/8] sched/fair: Add more sched_asym_cpucapacity static branch checks

2021-02-03 Thread Qais Yousef

On 01/28/21 18:31, Valentin Schneider wrote:
> Rik noted a while back that a handful of
> 
>   sd->flags & SD_ASYM_CPUCAPACITY
> 
> & family in the CFS load-balancer code aren't guarded by the
> sched_asym_cpucapacity static branch.
> 
> The load-balancer is already doing a humongous amount of work, but turning
> those checks into NOPs for those who don't need it is fairly
> straightforward, so do that.
> 
> Suggested-by: Rik van Riel 
> Signed-off-by: Valentin Schneider 
> ---

Reviewed-by: Qais Yousef 

Thanks

--
Qais Yousef

>  kernel/sched/fair.c  | 11 ++-
>  kernel/sched/sched.h |  6 ++
>  2 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0f6a4e58ce3c..7d01fa0bfc7e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8465,7 +8465,7 @@ static inline void update_sg_lb_stats(struct lb_env 
> *env,
>   continue;
>  
>   /* Check for a misfit task on the cpu */
> - if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> + if (sd_has_asym_cpucapacity(env->sd) &&
>   sgs->group_misfit_task_load < rq->misfit_task_load) {
>   sgs->group_misfit_task_load = rq->misfit_task_load;
>   *sg_status |= SG_OVERLOAD;
> @@ -8522,7 +8522,8 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>* CPUs in the group should either be possible to resolve
>* internally or be covered by avg_load imbalance (eventually).
>*/
> - if (sgs->group_type == group_misfit_task &&
> + if (static_branch_unlikely(_asym_cpucapacity) &&
> + sgs->group_type == group_misfit_task &&
>   (!group_smaller_max_cpu_capacity(sg, sds->local) ||
>sds->local_stat.group_type != group_has_spare))
>   return false;
> @@ -8605,7 +8606,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>* throughput. Maximize throughput, power/energy consequences are not
>* considered.
>*/
> - if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> + if (sd_has_asym_cpucapacity(env->sd) &&
>   (sgs->group_type <= group_fully_busy) &&
>   (group_smaller_min_cpu_capacity(sds->local, sg)))
>   return false;
> @@ -8728,7 +8729,7 @@ static inline void update_sg_wakeup_stats(struct 
> sched_domain *sd,
>   }
>  
>   /* Check if task fits in the group */
> - if (sd->flags & SD_ASYM_CPUCAPACITY &&
> + if (sd_has_asym_cpucapacity(sd) &&
>   !task_fits_capacity(p, group->sgc->max_capacity)) {
>   sgs->group_misfit_task_load = 1;
>   }
> @@ -9419,7 +9420,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>* Higher per-CPU capacity is considered better than balancing
>* average load.
>*/
> - if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> + if (sd_has_asym_cpucapacity(env->sd) &&
>   capacity_of(env->dst_cpu) < capacity &&
>   nr_running == 1)
>   continue;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 045b01064c1e..21bd71f58c06 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1482,6 +1482,12 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, 
> sd_asym_packing);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>  extern struct static_key_false sched_asym_cpucapacity;
>  
> +static inline bool sd_has_asym_cpucapacity(struct sched_domain *sd)
> +{
> + return static_branch_unlikely(_asym_cpucapacity) &&
> + sd->flags & SD_ASYM_CPUCAPACITY;
> +}
> +
>  struct sched_group_capacity {
>   atomic_tref;
>   /*
> -- 
> 2.27.0
>

Re: [PATCH 7/8] sched/fair: Attempt misfit active balance when migration_type != migrate_misfit

2021-02-03 Thread Qais Yousef

On 01/28/21 18:31, Valentin Schneider wrote:
> Consider a 4-CPU big.LITTLE system with CPUs 0-1 as LITTLEs and CPUs 2-3 as
> bigs. The resulting sched_domain hierarchy is:
> 
>   DIE [  ]
>   MC  [][]
>0  1  2  3
> 
> When running a multithreaded CPU-bound workload (i.e. 1 hog per CPU), the
> expected behaviour is to have the about-to-idle big CPUs pull a hog from
> the LITTLEs, since bigs will complete their work sooner than LITTLEs.
> 
> Further Consider a scenario where:
> - CPU0 is idle (e.g. its hog got migrated to one of the big CPUs)
> - CPU1 is currently executing a per-CPU kworker, preempting the CPU hog
> - CPU2 and CPU3 are executing CPU-hogs
> 
> CPU0 goes through load_balance() at MC level, and tries to pick stuff from
> CPU1, but:
> - the hog can't be pulled, because it's task_hot()
> - the kworker can't be pulled, because it's pinned to CPU1, which sets
>   LBF_SOME_PINNED
> 
> This load balance attempts ends with no load pulled, LBF_SOME_PINNED set,
> and as a consequence we set the imbalance flag of DIE's [0, 1]
> sched_group_capacity.
> 
> Shortly after, CPU2 completes its work and is about to go idle. It goes
> through the newidle_balance(), and we would really like it to active
> balance the hog running on CPU1 (which is a misfit task). However,
> sgc->imbalance is set for the LITTLE group at DIE level, so the group gets
> classified as group_imbalanced rather than group_misfit_task.
> 
> Unlike group_misfit_task (via migrate_misfit), the active balance logic
> doesn't have any specific case for group_imbalanced, so CPU2 ends up going
> idle. We'll have to wait for a load balance on CPU0 or CPU1 to happen and
> clear the imbalance flag, and then for another DIE-level load-balance on
> CPU2 to happen to pull the task off of CPU1. That's several precious
> milliseconds wasted down the drain.

I think that might help with in games where some worker threads could get end
up on little cores and this delay in up migration could cause frames to drop.

> 
> Giving group_misfit_task a higher group_classify() priority than
> group_imbalance doesn't seem like the right thing to do. Instead, make
> need_active_balance() return true for any migration_type when the

s/need_active_balance()/voluntary_active_balance()/?

> destination CPU is idle and the source CPU has a misfit task.
> 
> While at it, add an sd_has_asym_cpucapacity() guard in
> need_active_balance().

ditto.

> 
> Signed-off-by: Valentin Schneider 
> ---
>  kernel/sched/fair.c | 13 +
>  1 file changed, 13 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0ac2f876b86f..cba9f97d9beb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9557,9 +9557,22 @@ static int need_active_balance(struct lb_env *env)
>   return 1;
>   }
>  
> + if (!sd_has_asym_cpucapacity(sd))
> + return 0;
> +
>   if (env->migration_type == migrate_misfit)
>   return 1;
>  
> + /*
> +  * If we failed to pull anything and the src_rq has a misfit task, but
> +  * the busiest group_type was higher than group_misfit_task, try to
> +  * go for a misfit active balance anyway.
> +  */
> + if ((env->idle != CPU_NOT_IDLE) &&
> + env->src_rq->misfit_task_load &&
> + cpu_capacity_greater(env->dst_cpu, env->src_cpu))
> + return 1;
> +

Reviewed-by: Qais Yousef 

Thanks

--
Qais Yousef

>   return 0;
>  }
>  
> -- 
> 2.27.0
>

Re: [PATCH 6/8] sched/fair: Filter out locally-unsolvable misfit imbalances

2021-02-03 Thread Qais Yousef

load,
> + rq->misfit_task_load);

nit: missing curly braces around the if.

>   }
>  
>   /* Check if dst CPU is idle and preferred to this group */
> @@ -8504,7 +8521,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>   /* Don't try to pull misfit tasks we can't help */
>   if (static_branch_unlikely(_asym_cpucapacity) &&
>   sgs->group_type == group_misfit_task &&
> - (!capacity_greater(capacity_of(env->dst_cpu), 
> sg->sgc->max_capacity) ||
> + (!sgs->group_misfit_task_load ||
>sds->local_stat.group_type != group_has_spare))
>   return false;
>  
> @@ -9464,15 +9481,18 @@ static struct rq *find_busiest_queue(struct lb_env 
> *env,
>   case migrate_misfit:
>   /*
>* For ASYM_CPUCAPACITY domains with misfit tasks we
> -      * simply seek the "biggest" misfit task.
> +  * simply seek the "biggest" misfit task we can
> +  * accommodate.
>*/
> + if (!cpu_capacity_greater(env->dst_cpu, i))
> + continue;

Both this hunk and the one above mean we will end up searching harder to pull
the task into the right cpu taking actual capacity into account. Which is
a good improvement.

Reviewed-by: Qais Yousef 

Thanks

--
Qais Yousef

> +
>   if (rq->misfit_task_load > busiest_load) {
>   busiest_load = rq->misfit_task_load;
>   busiest = rq;
>   }
>  
>   break;
> -
>   }
>   }
>  
> -- 
> 2.27.0
>

Re: [PATCH 8/8] sched/fair: Relax task_hot() for misfit tasks

2021-02-03 Thread Qais Yousef

On 01/28/21 18:31, Valentin Schneider wrote:
> Misfit tasks can and will be preempted by the stopper to migrate them over
> to a higher-capacity CPU. However, when runnable but not current misfit
> tasks are scanned by the load balancer (i.e. detach_tasks()), the
> task_hot() ratelimiting logic may prevent us from enqueuing said task onto
> a higher-capacity CPU.
> 
> Align detach_tasks() with the active-balance logic and let it pick a
> cache-hot misfit task when the destination CPU can provide a capacity
> uplift.

Good catch.

> 
> Signed-off-by: Valentin Schneider 
> ---
>  kernel/sched/fair.c | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cba9f97d9beb..c2351b87824f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7484,6 +7484,17 @@ static int task_hot(struct task_struct *p, struct 
> lb_env *env)
>   if (env->sd->flags & SD_SHARE_CPUCAPACITY)
>   return 0;
>  
> + /*
> +  * On a (sane) asymmetric CPU capacity system, the increase in compute
> +  * capacity should offset any potential performance hit caused by a
> +  * migration.
> +  */
> + if (sd_has_asym_cpucapacity(env->sd) &&
> + env->idle != CPU_NOT_IDLE &&
> + !task_fits_capacity(p, capacity_of(env->src_cpu)) &&

Note for a very busy task that is running on the biggest cpu this will always
return true.

> + cpu_capacity_greater(env->dst_cpu, env->src_cpu))

But this will save us from triggering unnecessary migration.

We could swap them and optimize for this particular case, but tbh this is the
type of micro optimization that is hard to know whether it makes a real
difference or not..

Anyways, this looks good to me.

Reviewed-by: Qais Yousef 

Thanks

--
Qais Yousef

> + return 0;
> +
>   /*
>* Buddy candidates are cache hot:
>*/
> -- 
> 2.27.0
>

Re: [PATCH 5/8] sched/fair: Make check_misfit_status() only compare dynamic capacities

2021-02-03 Thread Qais Yousef

On 01/28/21 18:31, Valentin Schneider wrote:
> check_misfit_status() checks for both capacity pressure & available CPUs
> with higher capacity. Now that we have a sane(ish) capacity comparison
> margin which is used throughout load-balance, this can be condensed into a
> single check:
> 
>   capacity_greater(, );
> 
> This has the added benefit of returning false if the misfit task CPU's is
> heavily pressured, but there are no better candidates for migration.
> 
> Signed-off-by: Valentin Schneider 
> ---

check_cpu_capacity() call looks redundant now, yes.

Reviewed-by: Qais Yousef 

Thanks

--
Qais Yousef

>  kernel/sched/fair.c | 10 --
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0959a770ecc0..ef44474b8fbf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8254,14 +8254,12 @@ check_cpu_capacity(struct rq *rq, struct sched_domain 
> *sd)
>  
>  /*
>   * Check whether a rq has a misfit task and if it looks like we can actually
> - * help that task: we can migrate the task to a CPU of higher capacity, or
> - * the task's current CPU is heavily pressured.
> + * help that task: we can migrate the task to a CPU of higher capacity.
>   */
> -static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
> +static inline int check_misfit_status(struct rq *rq)
>  {
>   return rq->misfit_task_load &&
> - (capacity_greater(rq->rd->max_cpu_capacity, 
> rq->cpu_capacity_orig) ||
> -  check_cpu_capacity(rq, sd));
> + capacity_greater(rq->rd->max_cpu_capacity, rq->cpu_capacity);
>  }
>  
>  /*
> @@ -10238,7 +10236,7 @@ static void nohz_balancer_kick(struct rq *rq)
>* When ASYM_CPUCAPACITY; see if there's a higher capacity CPU
>* to run the misfit task on.
>*/
> - if (check_misfit_status(rq, sd)) {
> + if (check_misfit_status(rq)) {
>   flags = NOHZ_KICK_MASK;
>   goto unlock;
>   }
> -- 
> 2.27.0
>

Re: [PATCH 4/8] sched/fair: Use dst_cpu's capacity rather than group {min, max} capacity

2021-02-03 Thread Qais Yousef

On 01/28/21 18:31, Valentin Schneider wrote:
> Comparing capacity extrema of local and source sched_group's doesn't make
> much sense when at the day of the day the imbalance will be pulled by a
> known env->dst_cpu, whose capacity can be anywhere within the local group's
> capacity extrema.
> 
> Replace group_smaller_{min, max}_cpu_capacity() with comparisons of the
> source group's min/max capacity and the destination CPU's capacity.
> 
> Signed-off-by: Valentin Schneider 
> ---

Reviewed-by: Qais Yousef 

Thanks

--
Qais Yousef

>  kernel/sched/fair.c | 31 +++
>  1 file changed, 3 insertions(+), 28 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 58ce0b22fcb0..0959a770ecc0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8352,26 +8352,6 @@ group_is_overloaded(unsigned int imbalance_pct, struct 
> sg_lb_stats *sgs)
>   return false;
>  }
>  
> -/*
> - * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
> - * per-CPU capacity than sched_group ref.
> - */
> -static inline bool
> -group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group 
> *ref)
> -{
> - return capacity_greater(ref->sgc->min_capacity, sg->sgc->min_capacity);
> -}
> -
> -/*
> - * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
> - * per-CPU capacity_orig than sched_group ref.
> - */
> -static inline bool
> -group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group 
> *ref)
> -{
> - return capacity_greater(ref->sgc->max_capacity, sg->sgc->max_capacity);
> -}
> -
>  static inline enum
>  group_type group_classify(unsigned int imbalance_pct,
> struct sched_group *group,
> @@ -8523,15 +8503,10 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>   if (!sgs->sum_h_nr_running)
>   return false;
>  
> - /*
> -  * Don't try to pull misfit tasks we can't help.
> -  * We can use max_capacity here as reduction in capacity on some
> -  * CPUs in the group should either be possible to resolve
> -  * internally or be covered by avg_load imbalance (eventually).
> -  */
> + /* Don't try to pull misfit tasks we can't help */
>   if (static_branch_unlikely(_asym_cpucapacity) &&
>   sgs->group_type == group_misfit_task &&
> - (!group_smaller_max_cpu_capacity(sg, sds->local) ||
> + (!capacity_greater(capacity_of(env->dst_cpu), 
> sg->sgc->max_capacity) ||
>sds->local_stat.group_type != group_has_spare))
>   return false;
>  
> @@ -8615,7 +8590,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>*/
>   if (sd_has_asym_cpucapacity(env->sd) &&
>   (sgs->group_type <= group_fully_busy) &&
> - (group_smaller_min_cpu_capacity(sds->local, sg)))
> + (capacity_greater(sg->sgc->min_capacity, 
> capacity_of(env->dst_cpu
>   return false;
>  
>   return true;
> -- 
> 2.27.0
>

Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks

2021-02-03 Thread Qais Yousef

On 01/28/21 18:31, Valentin Schneider wrote:
> During load-balance, groups classified as group_misfit_task are filtered
> out if they do not pass
> 
>   group_smaller_max_cpu_capacity(, );
> 
> which itself employs fits_capacity() to compare the sgc->max_capacity of
> both groups.
> 
> Due to the underlying margin, fits_capacity(X, 1024) will return false for
> any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
> {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
> CPUs, misfit migration will never intentionally upmigrate it to a CPU of
> higher capacity due to the aforementioned margin.
> 
> One may argue the 20% margin of fits_capacity() is excessive in the advent
> of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
> is that fits_capacity() is meant to compare a utilization value to a
> capacity value, whereas here it is being used to compare two capacity
> values. As CPU capacity and task utilization have different dynamics, a
> sensible approach here would be to add a new helper dedicated to comparing
> CPU capacities.
> 
> Introduce capacity_greater(), which uses a 5% margin. Use it to replace the
> existing capacity checks. Note that check_cpu_capacity() uses yet another
> margin (sd->imbalance_pct), and is left alone for now.
> 
> Signed-off-by: Valentin Schneider 
> ---
>  kernel/sched/fair.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7d01fa0bfc7e..58ce0b22fcb0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
>   */
>  #define fits_capacity(cap, max)  ((cap) * 1280 < (max) * 1024)
>  
> +/*
> + * The margin used when comparing CPU capacities.
> + * is 'cap' noticeably greater than 'ref'
> + *
> + * (default: ~5%)
> + */
> +#define capacity_greater(cap, ref) ((ref) * 1078 < (cap) * 1024)

nit: can we use cap1 and cap2 and make the implementation use '>' instead of
'<'? ie:

#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)

this is more intuitive to read IMHO. Especially few lines below we have

return capacity_greater(ref->sgc->max_capacity, sg->sgc->max_capacity);

which pass 'ref->...' as cap which can be confusing when looking at the
function signature @ref.

Either way, this LGTM

Reviewed-by: Qais Yousef 

Thanks

--
Qais Yousef

>  #endif
>  
>  #ifdef CONFIG_CFS_BANDWIDTH
> @@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain 
> *sd)
>  static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
>  {
>   return rq->misfit_task_load &&
> - (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
> + (capacity_greater(rq->rd->max_cpu_capacity, 
> rq->cpu_capacity_orig) ||
>check_cpu_capacity(rq, sd));
>  }
>  
> @@ -8352,7 +8359,7 @@ group_is_overloaded(unsigned int imbalance_pct, struct 
> sg_lb_stats *sgs)
>  static inline bool
>  group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group 
> *ref)
>  {
> - return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity);
> + return capacity_greater(ref->sgc->min_capacity, sg->sgc->min_capacity);
>  }
>  
>  /*
> @@ -8362,7 +8369,7 @@ group_smaller_min_cpu_capacity(struct sched_group *sg, 
> struct sched_group *ref)
>  static inline bool
>  group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group 
> *ref)
>  {
> - return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity);
> + return capacity_greater(ref->sgc->max_capacity, sg->sgc->max_capacity);
>  }
>  
>  static inline enum
> @@ -9421,7 +9428,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>* average load.
>*/
>   if (sd_has_asym_cpucapacity(env->sd) &&
> - capacity_of(env->dst_cpu) < capacity &&
> + !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
>   nr_running == 1)
>   continue;
>  
> -- 
> 2.27.0
>

Re: [PATCH 0/8] sched/fair: misfit task load-balance tweaks

2021-02-03 Thread Qais Yousef

On 01/28/21 18:31, Valentin Schneider wrote:
> Hi folks,
> 
> Here is this year's series of misfit changes. On the menu:
> 
> o Patch 1 is an independent active balance cleanup
> o Patch 2 adds some more sched_asym_cpucapacity static branches
> o Patch 3 introduces yet another margin for capacity to capacity
>   comparisons
> o Patches 4-6 build on top of patch 3 and change capacity comparisons
>   throughout misfit load balancing  
> o Patches 7-8 fix some extra misfit issues I've been seeing on "real"
>   workloads.
> 
> IMO the somewhat controversial bit is patch 3, because it attempts to solve
> margin issues by... Adding another margin. This does solve issues on
> existing platforms (e.g. Pixel4), but we'll be back to square one the day
> some "clever" folks spin a platform with two different CPU capacities less 
> than
> 5% apart.

One more margin is a cause of apprehension to me. But in this case I think it
is the appropriate thing to do now. I can't think of a scenario where this
could hurt.

Thanks

--
Qais Yousef

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-28 Thread Qais Yousef

On 01/28/21 10:09, Joel Fernandes wrote:
> > > > Qais mentioned half of the time being used by
> > > > sugov_next_freq_shared(). Are there any frequency changes resulting in
> > > > this call ?
> > >
> > > I do not see a frequency update happening at the time of the problem. 
> > > However
> > > note that sugov_iowait_boost() does run even if frequency is not being
> > > updated. IIRC, this function is also not that light weight and I am not 
> > > sure
> > > if it is a good idea to call this that often.
> >
> > Scheduler can't make any assumption about how often schedutil/cpufreq
> > wants to be called. Some are fast and straightforward and can be
> > called very often to adjust frequency; Others can't handle much
> > updates. The rate limit mechanism in schedutil and io-boost should be
> > there for such purpose.
> 
> Sure, I know that's the intention.

How about the below patch to help with reducing the impact of
sugov_update_shared()?

I basically made sure it'll bail out immediately if it gets 2 calls within the
defined rate_limit_us value.

And tweaked the way we call cpufreq_update_util() from
update_blocked_averages() too so that we first update blocked load on all cpus,
then we ask for the frequency update. Combined with above this should result to
a single call to sugov_update_shared() for each policy. Each call should see
the final state of what is the load really is.

Only compile tested!

Thanks

--
Qais Yousef

->8-


diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 6931f0cdeb80..bd83e9343749 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -109,7 +109,6 @@ static bool sugov_update_next_freq(struct sugov_policy 
*sg_policy, u64 time,
return false;
 
sg_policy->next_freq = next_freq;
-   sg_policy->last_freq_update_time = time;
 
return true;
 }
@@ -554,20 +553,23 @@ sugov_update_shared(struct update_util_data *hook, u64 
time, unsigned int flags)
 
raw_spin_lock(_policy->update_lock);
 
+   ignore_dl_rate_limit(sg_cpu, sg_policy);
+
+   if (!sugov_should_update_freq(sg_policy, time))
+   goto out;
+
sugov_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;
 
-   ignore_dl_rate_limit(sg_cpu, sg_policy);
-
-   if (sugov_should_update_freq(sg_policy, time)) {
-   next_f = sugov_next_freq_shared(sg_cpu, time);
+   next_f = sugov_next_freq_shared(sg_cpu, time);
 
-   if (sg_policy->policy->fast_switch_enabled)
-   sugov_fast_switch(sg_policy, time, next_f);
-   else
-   sugov_deferred_update(sg_policy, time, next_f);
-   }
+   if (sg_policy->policy->fast_switch_enabled)
+   sugov_fast_switch(sg_policy, time, next_f);
+   else
+   sugov_deferred_update(sg_policy, time, next_f);
 
+   sg_policy->last_freq_update_time = time;
+out:
raw_spin_unlock(_policy->update_lock);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..47bd8be03a6c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8033,7 +8033,7 @@ static unsigned long task_h_load(struct task_struct *p)
 }
 #endif
 
-static void update_blocked_averages(int cpu)
+static void update_blocked_averages(int cpu, bool update_freq)
 {
bool decayed = false, done = true;
struct rq *rq = cpu_rq(cpu);
@@ -8046,7 +8046,7 @@ static void update_blocked_averages(int cpu)
decayed |= __update_blocked_fair(rq, );
 
update_blocked_load_status(rq, !done);
-   if (decayed)
+   if (decayed && update_freq)
cpufreq_update_util(rq, 0);
rq_unlock_irqrestore(rq, );
 }
@@ -8384,7 +8384,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
return true;
 
-   update_blocked_averages(cpu);
+   update_blocked_averages(cpu, false);
 
return rq->has_blocked_load;
 #else
@@ -8454,6 +8454,15 @@ static inline void update_sg_lb_stats(struct lb_env *env,
}
}
 
+   /*
+* We did a bunch a blocked average updates, let cpufreq know about
+* them in one go. For system with a lot of cpus on a frequency domain
+* this will make sure we take all the changes that affects the policy
+* in one go.
+*/
+   for_each_cpu_and(i, sched_group_span(group), env->cpus)
+   cpufreq_update_util(cpu_rq(i), 0);
+
/* Check if dst CPU is idle and preferred to this group */
if (env->sd->flags & SD_ASYM_PACKING &&
env->idle != CPU_NOT_IDLE &&
@@ -1046

Re: [PATCH 0/2] Fix BUG: Invalid wait context in hrtimer_interrupt()

2021-01-26 Thread Qais Yousef

On 01/26/21 17:23, Peter Zijlstra wrote:
> On Wed, Jan 27, 2021 at 12:58:33AM +0900, Sergey Senozhatsky wrote:
> > On (21/01/26 14:59), Qais Yousef wrote:
> 
> > >   # [67628.388606] hrtimer: interrupt took 304720 ns
> > >   [67628.393546]
> > >   [67628.393550] =
> > >   [67628.393554] [ BUG: Invalid wait context ]
> > >   [67628.393557] 5.11.0-rc3-00019-g86be331946f7 #37 Not tainted
> > >   [67628.393560] -
> > >   [67628.393563] sugov:0/192 is trying to lock:
> > >   [67628.393566] 000800b1d898 (_lock_key){-.-.}-{3:3}, at: 
> > > pl011_console_write+0x138/0x218
> > >   [67628.393581] other info that might help us debug this:
> > >   [67628.393584] context-{2:2}
> > >   [67628.393586] 4 locks held by sugov:0/192:
> > >   [67628.393589]  #0: 0008059cb720 
> > > (_policy->work_lock){+.+.}-{4:4}, at: sugov_work+0x58/0x88
> > >   [67628.393603]  #1: 800015446f20 (prepare_lock){+.+.}-{4:4}, at: 
> > > clk_prepare_lock+0x34/0xb0
> > >   [67628.393618]  #2: 8000152aaa60 (console_lock){+.+.}-{0:0}, at: 
> > > vprintk_emit+0x12c/0x310
> > >   [67628.393632]  #3: 8000152aab88 (console_owner){-.-.}-{0:0}, at: 
> > > console_unlock+0x190/0x6d8
> 
> > > Did I miss something?
> > 
> > printk() is not permitted to sleep/schedule/etc and it never does.
> > Generally it should be OK to call it from IRQ (module recursion paths).
> 
> The report is that it is trying to acquire spin_lock() while holding
> raw_spin_lock(), which is an invalid lock nesting.
> 
> Note that this is CONFIG_PROVE_RAW_LOCK_NESTING=y which specifically
> checks for this.
> 
> On current (mainline) kernel configs this is not yet a problem, but the
> moment we do land PREEMPT_RT this order will be problematic.

I should have dug more into the history of printk() and the meaning of the
splat. Sorry for the noise.

Looking at v5.10.8-rt24 the following fix is applied in RT


https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/commit/drivers/tty/serial/amba-pl011.c?h=linux-5.10.y-rt=008cc77aff249e830e5eb90b7ae3a6784597b8cf

which is what John suggested.

Looking at the locks held

> > >   [67628.393589]  #0: 0008059cb720 
> > > (_policy->work_lock){+.+.}-{4:4}, at: sugov_work+0x58/0x88
> > >   [67628.393603]  #1: 800015446f20 (prepare_lock){+.+.}-{4:4}, at: 
> > > clk_prepare_lock+0x34/0xb0

These two are mutexes.

> > >   [67628.393618]  #2: 8000152aaa60 (console_lock){+.+.}-{0:0}, at: 
> > > vprintk_emit+0x12c/0x310

This is a semaphore.

> > >   [67628.393632]  #3: 8000152aab88 (console_owner){-.-.}-{0:0}, at: 
> > > console_unlock+0x190/0x6d8

I think this is acquired by console_lock_spinning_enable() which has acquiring
syntax I'm not familiar with. console_owner_lock is defined as RAW_SPINLOCK, so
regardless of how it is acquired, it must be the problem.

Looks like John has reworked this code in RT too. So maybe this is just a red
herring after all..


https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/commit/kernel/printk/printk.c?h=linux-5.10.y-rt=0097798fd99948d3ffea535005eee7eb3b14fd06

Thanks

--
Qais Yousef

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-26 Thread Qais Yousef

On 01/25/21 14:23, Vincent Guittot wrote:
> On Fri, 22 Jan 2021 at 19:39, Qais Yousef  wrote:
> >
> > On 01/22/21 17:56, Vincent Guittot wrote:
> > > > ---
> > > >  kernel/sched/fair.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 04a3ce20da67..fe2dc0024db5 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool 
> > > > force)
> > > > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > > > return false;
> > > >
> > > > -   if (!force && !time_after(jiffies, 
> > > > rq->last_blocked_load_update_tick))
> > > > +   if (!force && !time_after(jiffies, 
> > > > rq->last_blocked_load_update_tick + (HZ/20)))
> > >
> > > This condition is there to make sure to update blocked load at most
> > > once a tick in order to filter newly idle case otherwise the rate
> > > limit is already done by load balance interval
> > > This hard coded (HZ/20) looks really like an ugly hack
> >
> > This was meant as an RFC patch to discuss the problem really.
> >
> > Joel is seeing update_blocked_averages() taking ~100us. Half of it seems in
> > processing __update_blocked_fair() and the other half in 
> > sugov_update_shared().
> > So roughly 50us each. Note that each function is calling an iterator in
> 
> Can I assume that a freq change happens if sugov_update_shared() takes 50us ?
> which would mean that the update was useful at the end ?

I couldn't reproduce his problem on Juno. But I think it is not actually doing
any frequency update. sugov_update_shared() is rate limited by
sugov_should_update_freq(). Joel has a 1ms rate limit for schedutil in sysfs.
The function traces showed that it is continuously doing the full scan which
indicates that sugov_update_next_freq() is continuously bailing out at

if else (sg_policy->next_freq == next_freq)
return false;

> 
> > return. Correct me if my numbers are wrong Joel.
> >
> > Running on a little core on low frequency these numbers don't look too odd.
> > So I'm not seeing how we can speed these functions up.
> >
> > But since update_sg_lb_stats() will end up with multiple calls to
> > update_blocked_averages() in one go, this latency adds up quickly.
> >
> > One noticeable factor in Joel's system is the presence of a lot of cgroups.
> > Which is essentially what makes __update_blocked_fair() expensive, and it 
> > seems
> > to always return something has decayed so we end up with a call to
> > sugov_update_shared() in every call.
> >
> > I think we should limit the expensive call to update_blocked_averages() but
> 
> At the opposite, some will complain that block values  stay stall to
> high value and prevent any useful adjustment.
> 
> Also update_blocked average is already rate limited with idle and busy
> load_balance
> 
> Seems like the problem raised by Joel is the number of newly idle load balance

It could be. When Joel comments out the update_blocked_averages() or rate limit
it the big schedule delay disappears. Which is just an indication of where we
could do better. Either by making update_blocked_averages() faster, or control
how often we do it in a row. I thought the latter made more sense - though
implementation wise I'm not sure how we should do that.

We might actually help update_blocked_averages() being a bit faster by not
doing cpufreq_update_util() in every call and do it once in the last call to
update_blocked_averages(). Since it seemed the for_each_leaf_cfs_rq_safe() loop
in __update_blocked_fair() is expensive too, not sure if reducing the calls to
cpufreq_update_util() will be enough.

Thanks

--
Qais Yousef

Re: [PATCH 0/2] Fix BUG: Invalid wait context in hrtimer_interrupt()

2021-01-26 Thread Qais Yousef

On 01/26/21 13:46, Sergey Senozhatsky wrote:
> On (21/01/23 23:37), Qais Yousef wrote:
> > 
> > I hit a pr_warn() inside hrtimer_interrupt() which lead to a BUG: Invalid 
> > wait
> > context splat.
> > 
> > The problem wasn't reproducible but I think the cause is obvious, printk 
> > can't
> > be called from interrupt context.
> > 
> > AFAICU printk_deferred() is safe from NMI, so I assumed it is safe to be 
> > called
> > from hrtimer_interrupt() too. Adding a pr_warn_once() inside
> > hrtimer_interrupt() in a location where it is always hit produces the BUG
> > splat. Replacing it with pr_warn_deferred_once() generates the printk 
> > warning
> > without any splat.
> 
> Could you please post the lockdep splat?

Sorry I should have done that. I thought printk() in interrupt is not allowed
and obvious.

I reported it here: 
https://lore.kernel.org/lkml/20210114124512.mg3vexudfmoadef5@e107158-lin/

# [67628.388606] hrtimer: interrupt took 304720 ns
[67628.393546]
[67628.393550] =
[67628.393554] [ BUG: Invalid wait context ]
[67628.393557] 5.11.0-rc3-00019-g86be331946f7 #37 Not tainted
[67628.393560] -
[67628.393563] sugov:0/192 is trying to lock:
[67628.393566] 000800b1d898 (_lock_key){-.-.}-{3:3}, at: 
pl011_console_write+0x138/0x218
[67628.393581] other info that might help us debug this:
[67628.393584] context-{2:2}
[67628.393586] 4 locks held by sugov:0/192:
[67628.393589]  #0: 0008059cb720 
(_policy->work_lock){+.+.}-{4:4}, at: sugov_work+0x58/0x88
[67628.393603]  #1: 800015446f20 (prepare_lock){+.+.}-{4:4}, at: 
clk_prepare_lock+0x34/0xb0
[67628.393618]  #2: 8000152aaa60 (console_lock){+.+.}-{0:0}, at: 
vprintk_emit+0x12c/0x310
[67628.393632]  #3: 8000152aab88 (console_owner){-.-.}-{0:0}, at: 
console_unlock+0x190/0x6d8
[67628.393646] stack backtrace:
[67628.393649] CPU: 0 PID: 192 Comm: sugov:0 Not tainted 
5.11.0-rc3-00019-g86be331946f7 #37
[67628.393653] Hardware name: ARM Juno development board (r2) (DT)
[67628.393656] Call trace:
[67628.393659]  dump_backtrace+0x0/0x1b0
[67628.393661]  show_stack+0x20/0x70
[67628.393664]  dump_stack+0xf8/0x168
[67628.393666]  __lock_acquire+0x964/0x1c88
[67628.393669]  lock_acquire+0x26c/0x500
[67628.393671]  _raw_spin_lock+0x64/0x88
[67628.393674]  pl011_console_write+0x138/0x218
[67628.393677]  console_unlock+0x2c4/0x6d8
[67628.393680]  vprintk_emit+0x134/0x310
[67628.393682]  vprintk_default+0x40/0x50
[67628.393685]  vprintk_func+0xfc/0x2b0
[67628.393687]  printk+0x68/0x90
[67628.393690]  hrtimer_interrupt+0x24c/0x250
[67628.393693]  arch_timer_handler_phys+0x3c/0x50
[67628.393695]  handle_percpu_devid_irq+0xd8/0x460
[67628.393698]  generic_handle_irq+0x38/0x50
[67628.393701]  __handle_domain_irq+0x6c/0xc8
[67628.393704]  gic_handle_irq+0xb0/0xf0
[67628.393706]  el1_irq+0xb4/0x180
[67628.393709]  _raw_spin_unlock_irqrestore+0x58/0xa8
[67628.393712]  hrtimer_start_range_ns+0x1b0/0x420
[67628.393715]  msg_submit+0x100/0x108
[67628.393717]  mbox_send_message+0x84/0x128
[67628.393720]  scpi_send_message+0x11c/0x2a8
[67628.393723]  scpi_dvfs_set_idx+0x48/0x70
[67628.393726]  scpi_dvfs_set_rate+0x60/0x78
[67628.393728]  clk_change_rate+0x184/0x8a8
[67628.393731]  clk_core_set_rate_nolock+0x1c0/0x280
[67628.393734]  clk_set_rate+0x40/0x98
[67628.393736]  scpi_cpufreq_set_target+0x40/0x68
[67628.393739]  __cpufreq_driver_target+0x1a0/0x5d0

> Why is it invalid? Is this... -rt kernel, perhaps?

No I was running on Linus master at the time.

AFAIU printk will hold the console lock which could sleep in interrupt context.

Did I miss something?

Thanks

--
Qais Yousef

Re: [PATCH 1/2] printk: Add new pr_*_deferred_once() variants

2021-01-25 Thread Qais Yousef

On 01/25/21 12:04, John Ogness wrote:
> On 2021-01-25, Peter Zijlstra  wrote:
> > On Sat, Jan 23, 2021 at 11:37:40PM +0000, Qais Yousef wrote:
> >> To allow users in code where printk is not allowed.
> >> 
> >> Signed-off-by: Qais Yousef 
> >> ---
> >>  include/linux/printk.h | 24 
> >>  1 file changed, 24 insertions(+)
> >> 
> >> diff --git a/include/linux/printk.h b/include/linux/printk.h
> >> index fe7eb2351610..92c0978c7b44 100644
> >> --- a/include/linux/printk.h
> >> +++ b/include/linux/printk.h
> >> @@ -480,21 +480,45 @@ extern int kptr_restrict;
> >>printk_once(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
> >>  /* no pr_cont_once, don't do that... */
> >>  
> >> +#define pr_emerg_deferred_once(fmt, ...)  \
> >> +  printk_deferred_once(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)
> >> +#define pr_alert_deferred_once(fmt, ...)  \
> >> +  printk_deferred_once(KERN_ALERT pr_fmt(fmt), ##__VA_ARGS__)
> >> +#define pr_crit_deferred_once(fmt, ...)   
> >> \
> >> +  printk_deferred_once(KERN_CRIT pr_fmt(fmt), ##__VA_ARGS__)
> >> +#define pr_err_deferred_once(fmt, ...)
> >> \
> >> +  printk_deferred_once(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
> >> +#define pr_warn_deferred_once(fmt, ...)   
> >> \
> >> +  printk_deferred_once(KERN_WARNING pr_fmt(fmt), ##__VA_ARGS__)
> >> +#define pr_notice_deferred_once(fmt, ...) \
> >> +  printk_deferred_once(KERN_NOTICE pr_fmt(fmt), ##__VA_ARGS__)
> >> +#define pr_info_deferred_once(fmt, ...)   
> >> \
> >> +  printk_deferred_once(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
> >> +/* no pr_cont_deferred_once, don't do that... */
> >
> > I absolutely detest this.. the whole _deferred thing is an
> > abomination.
> 
> And it will disappear at some point.
> 
> > We should be very close to printk not needing this anymore, printk
> > people?
> 
> It will disappear once console printing threads are introduced. We
> probably still have a few kernel releases until we see that. First we
> need to finish merging full lockless access, remove the safe buffers,
> and merge the atomic consoles.

Okay. As I said in the cover letter, I didn't think the addition of these new
macros looked like a win overall.

I will drop this patch and just open code the use of printk_deferred_once() in
hrtimer_interrupt(). Which should be easier to fix later when it should
disappear.

Thanks

--
Qais Yousef

[PATCH 2/2] hrtimer: Use pr_warn_deferred_once() in hrtimer_interrupt()

2021-01-23 Thread Qais Yousef

printk is not allowed in this context and causes a BUG: Invalid wait context.

Signed-off-by: Qais Yousef 
---
 kernel/time/hrtimer.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 743c852e10f2..2d9b7cf1d5e2 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1700,7 +1700,8 @@ void hrtimer_interrupt(struct clock_event_device *dev)
else
expires_next = ktime_add(now, delta);
tick_program_event(expires_next, 1);
-   pr_warn_once("hrtimer: interrupt took %llu ns\n", ktime_to_ns(delta));
+   pr_warn_deferred_once("hrtimer: interrupt took %llu ns\n",
+ ktime_to_ns(delta));
 }
 
 /* called with interrupts disabled */
-- 
2.25.1

[PATCH 0/2] Fix BUG: Invalid wait context in hrtimer_interrupt()

2021-01-23 Thread Qais Yousef

I hit a pr_warn() inside hrtimer_interrupt() which lead to a BUG: Invalid wait
context splat.

The problem wasn't reproducible but I think the cause is obvious, printk can't
be called from interrupt context.

AFAICU printk_deferred() is safe from NMI, so I assumed it is safe to be called
from hrtimer_interrupt() too. Adding a pr_warn_once() inside
hrtimer_interrupt() in a location where it is always hit produces the BUG
splat. Replacing it with pr_warn_deferred_once() generates the printk warning
without any splat.

I added a new pr_*_deferred_once() variants to avoid open coding; but the name
ended not much shorter and I'm not sure if the wrappers are a win overall.
Since I've already done it, I'm sticking to it in this post. But will be happy
to drop it and just open code the printk_deferred_once(KERN_WARN, ...) in
hrtimer_interrupt() instead.

Thanks

Qais Yousef (2):
  printk: Add new pr_*_deferred_once() variants
  hrtimer: Use pr_warn_deferred_once() in hrtimer_interrupt()

 include/linux/printk.h | 24 
 kernel/time/hrtimer.c  |  3 ++-
 2 files changed, 26 insertions(+), 1 deletion(-)

-- 
2.25.1

[PATCH 1/2] printk: Add new pr_*_deferred_once() variants

2021-01-23 Thread Qais Yousef

To allow users in code where printk is not allowed.

Signed-off-by: Qais Yousef 
---
 include/linux/printk.h | 24 
 1 file changed, 24 insertions(+)

diff --git a/include/linux/printk.h b/include/linux/printk.h
index fe7eb2351610..92c0978c7b44 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -480,21 +480,45 @@ extern int kptr_restrict;
printk_once(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
 /* no pr_cont_once, don't do that... */
 
+#define pr_emerg_deferred_once(fmt, ...)   \
+   printk_deferred_once(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_alert_deferred_once(fmt, ...)   \
+   printk_deferred_once(KERN_ALERT pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_crit_deferred_once(fmt, ...)
\
+   printk_deferred_once(KERN_CRIT pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_err_deferred_once(fmt, ...) \
+   printk_deferred_once(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_warn_deferred_once(fmt, ...)
\
+   printk_deferred_once(KERN_WARNING pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_notice_deferred_once(fmt, ...)  \
+   printk_deferred_once(KERN_NOTICE pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_info_deferred_once(fmt, ...)
\
+   printk_deferred_once(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
+/* no pr_cont_deferred_once, don't do that... */
+
 #if defined(DEBUG)
 #define pr_devel_once(fmt, ...)\
printk_once(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_devel_deferred_once(fmt, ...)   \
+   printk_deferred_once(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
 #else
 #define pr_devel_once(fmt, ...)\
no_printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_devel_deferred_once(fmt, ...)   \
+   no_printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
 #endif
 
 /* If you are writing a driver, please use dev_dbg instead */
 #if defined(DEBUG)
 #define pr_debug_once(fmt, ...)\
printk_once(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_debug_deferred_once(fmt, ...)   \
+   printk_deferred_once(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
 #else
 #define pr_debug_once(fmt, ...)\
no_printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
+#define pr_debug_deferred_once(fmt, ...)   \
+   no_printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
 #endif
 
 /*
-- 
2.25.1

Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

2021-01-22 Thread Qais Yousef

On 01/22/21 17:56, Vincent Guittot wrote:
> > ---
> >  kernel/sched/fair.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 04a3ce20da67..fe2dc0024db5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool 
> > force)
> > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > return false;
> >
> > -   if (!force && !time_after(jiffies, 
> > rq->last_blocked_load_update_tick))
> > +   if (!force && !time_after(jiffies, 
> > rq->last_blocked_load_update_tick + (HZ/20)))
> 
> This condition is there to make sure to update blocked load at most
> once a tick in order to filter newly idle case otherwise the rate
> limit is already done by load balance interval
> This hard coded (HZ/20) looks really like an ugly hack

This was meant as an RFC patch to discuss the problem really.

Joel is seeing update_blocked_averages() taking ~100us. Half of it seems in
processing __update_blocked_fair() and the other half in sugov_update_shared().
So roughly 50us each. Note that each function is calling an iterator in
return. Correct me if my numbers are wrong Joel.

Running on a little core on low frequency these numbers don't look too odd.
So I'm not seeing how we can speed these functions up.

But since update_sg_lb_stats() will end up with multiple calls to
update_blocked_averages() in one go, this latency adds up quickly.

One noticeable factor in Joel's system is the presence of a lot of cgroups.
Which is essentially what makes __update_blocked_fair() expensive, and it seems
to always return something has decayed so we end up with a call to
sugov_update_shared() in every call.

I think we should limit the expensive call to update_blocked_averages() but
I honestly don't know what would be the right way to do it :-/

Or maybe there's another better alternative too..

Thanks

--
Qais Yousef

> 
> > return true;
> >
> > update_blocked_averages(cpu);
> > --
> > 2.30.0.280.ga3ce27912f-goog
> >

Re: [PATCH] arm64: kprobes: Fix Uexpected kernel BRK exception at EL1

2021-01-22 Thread Qais Yousef

On 01/22/21 22:36, Masami Hiramatsu wrote:
> > Further analysis showed that kcb->kprobe_status is set to
> > KPROBE_REENTER when the error occurs. By teaching
> > kprobe_breakpoint_ss_handler() to handle this status I can no  longer
> > reproduce the problem.
> 
> Very good catch! Yes, this missed the reentered kprobe case.
> 
> Acked-by: Masami Hiramatsu 

Thanks!

> 
> > 
> > Fixes: ba090f9cafd5 ("arm64: kprobes: Remove redundant kprobe_step_ctx")
> > Signed-off-by: Qais Yousef 
> > ---
> > 
> > Another change in behavior I noticed is that before ba090f9cafd5 ("arm64:
> > kprobes: Remove redundant kprobe_step_ctx") if 'cur' was NULL we wouldn't
> > return DBG_HOOK_ERROR, but after the change we do.
> 
> It should not happen, since the KPROBES_BRK_SS_IMM must be used only for
> kprobes's second break which must happen on the trampoline instruction
> buffer, which must set current kprobes before execution.

I see. Thanks for the explanation!

Cheers

--
Qais Yousef

[PATCH] arm64: kprobes: Fix Uexpected kernel BRK exception at EL1

2021-01-22 Thread Qais Yousef

I was hitting the below panic continuously when attaching kprobes to
scheduler functions

[  159.045212] Unexpected kernel BRK exception at EL1
[  159.053753] Internal error: BRK handler: f206 [#1] PREEMPT SMP
[  159.059954] Modules linked in:
[  159.063025] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 
5.11.0-rc4-8-g1e2a199f6ccd #56
[rt-app]  [1] Exiting.[  159.071166] Hardware name: ARM Juno 
development board (r2) (DT)
[  159.079689] pstate: 63c5 (nZCv DAIF -PAN -UAO -TCO BTYPE=--)

[  159.085723] pc : 0x80001624501c
[  159.089377] lr : attach_entity_load_avg+0x2ac/0x350
[  159.094271] sp : 80001622b640
[rt-app]  [0] Exiting.[  159.097591] x29: 80001622b640 x28: 
0001
[  159.105515] x27: 0049 x26: 000800b79980

[  159.110847] x25: 00097ef37840 x24: 
[  159.116331] x23: 0024eacec1ec x22: 00097ef12b90
[  159.121663] x21: 00097ef37700 x20: 800010119170
[rt-app]  [11] Exiting.[  159.126995] x19: 00097ef37840 
x18: 000e
[  159.135003] x17: 0001 x16: 0019
[  159.140335] x15:  x14: 
[  159.145666] x13: 0002 x12: 0002
[  159.150996] x11: 80001592f9f0 x10: 0060
[  159.156327] x9 : 8000100f6f9c x8 : be618290de0999a1
[  159.161659] x7 : 80096a4b1000 x6 : 
[  159.166990] x5 : 00097ef37840 x4 : 
[  159.172321] x3 : 000800328948 x2 : 
[  159.177652] x1 : 002507d52fec x0 : 00097ef12b90
[  159.182983] Call trace:
[  159.185433]  0x80001624501c
[  159.188581]  update_load_avg+0x2d0/0x778
[  159.192516]  enqueue_task_fair+0x134/0xe20
[  159.196625]  enqueue_task+0x4c/0x2c8
[  159.200211]  ttwu_do_activate+0x70/0x138
[  159.204147]  sched_ttwu_pending+0xbc/0x160
[  159.208253]  flush_smp_call_function_queue+0x16c/0x320
[  159.213408]  generic_smp_call_function_single_interrupt+0x1c/0x28
[  159.219521]  ipi_handler+0x1e8/0x3c8
[  159.223106]  handle_percpu_devid_irq+0xd8/0x460
[  159.227650]  generic_handle_irq+0x38/0x50
[  159.231672]  __handle_domain_irq+0x6c/0xc8
[  159.235781]  gic_handle_irq+0xcc/0xf0
[  159.239452]  el1_irq+0xb4/0x180
[  159.242600]  rcu_is_watching+0x28/0x70
[  159.246359]  rcu_read_lock_held_common+0x44/0x88
[  159.250991]  rcu_read_lock_any_held+0x30/0xc0
[  159.255360]  kretprobe_dispatcher+0xc4/0xf0
[  159.259555]  __kretprobe_trampoline_handler+0xc0/0x150
[  159.264710]  trampoline_probe_handler+0x38/0x58
[  159.269255]  kretprobe_trampoline+0x70/0xc4
[  159.273450]  run_rebalance_domains+0x54/0x80
[  159.277734]  __do_softirq+0x164/0x684
[  159.281406]  irq_exit+0x198/0x1b8
[  159.284731]  __handle_domain_irq+0x70/0xc8
[  159.288840]  gic_handle_irq+0xb0/0xf0
[  159.292510]  el1_irq+0xb4/0x180
[  159.295658]  arch_cpu_idle+0x18/0x28
[  159.299245]  default_idle_call+0x9c/0x3e8
[  159.303265]  do_idle+0x25c/0x2a8
[  159.306502]  cpu_startup_entry+0x2c/0x78
[  159.310436]  secondary_start_kernel+0x160/0x198
[  159.314984] Code: d42000c0 aa1e03e9 d42000c0 aa1e03e9 (d42000c0)

After a bit of head scratching and debugging it turned out that it is
due to kprobe handler being interrupted by a tick that causes us to go
into (I think another) kprobe handler.

The culprit was kprobe_breakpoint_ss_handler() returning DBG_HOOK_ERROR
which leads to the Unexpected kernel BRK exception.

Reverting commit ba090f9cafd5 ("arm64: kprobes: Remove redundant
kprobe_step_ctx") seemed to fix the problem for me.

Further analysis showed that kcb->kprobe_status is set to
KPROBE_REENTER when the error occurs. By teaching
kprobe_breakpoint_ss_handler() to handle this status I can no  longer
reproduce the problem.

Fixes: ba090f9cafd5 ("arm64: kprobes: Remove redundant kprobe_step_ctx")
Signed-off-by: Qais Yousef 
---

Another change in behavior I noticed is that before ba090f9cafd5 ("arm64:
kprobes: Remove redundant kprobe_step_ctx") if 'cur' was NULL we wouldn't
return DBG_HOOK_ERROR, but after the change we do.

I didn't hit a problem because of that it's just something I noticed when
I realized that this commit was causing my problem.


 arch/arm64/kernel/probes/kprobes.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/probes/kprobes.c 
b/arch/arm64/kernel/probes/kprobes.c
index 89c64ada8732..66aac2881ba8 100644
--- a/arch/arm64/kernel/probes/kprobes.c
+++ b/arch/arm64/kernel/probes/kprobes.c
@@ -352,8 +352,8 @@ kp

Re: hrtimer_interrupt::pr_warn_once() causes BUG: Invalid wait context

2021-01-20 Thread Qais Yousef

On 01/14/21 12:45, Qais Yousef wrote:
> Hi
> 
> I hit this splat today
> 
>   # [67628.388606] hrtimer: interrupt took 304720 ns
>   [67628.393546]
>   [67628.393550] =
>   [67628.393554] [ BUG: Invalid wait context ]
>   [67628.393557] 5.11.0-rc3-00019-g86be331946f7 #37 Not tainted
>   [67628.393560] -
>   [67628.393563] sugov:0/192 is trying to lock:
>   [67628.393566] 000800b1d898 (_lock_key){-.-.}-{3:3}, at: 
> pl011_console_write+0x138/0x218
>   [67628.393581] other info that might help us debug this:
>   [67628.393584] context-{2:2}
>   [67628.393586] 4 locks held by sugov:0/192:
>   [67628.393589]  #0: 0008059cb720 
> (_policy->work_lock){+.+.}-{4:4}, at: sugov_work+0x58/0x88
>   [67628.393603]  #1: 800015446f20 (prepare_lock){+.+.}-{4:4}, at: 
> clk_prepare_lock+0x34/0xb0
>   [67628.393618]  #2: 8000152aaa60 (console_lock){+.+.}-{0:0}, at: 
> vprintk_emit+0x12c/0x310
>   [67628.393632]  #3: 8000152aab88 (console_owner){-.-.}-{0:0}, at: 
> console_unlock+0x190/0x6d8
>   [67628.393646] stack backtrace:
>   [67628.393649] CPU: 0 PID: 192 Comm: sugov:0 Not tainted 
> 5.11.0-rc3-00019-g86be331946f7 #37
>   [67628.393653] Hardware name: ARM Juno development board (r2) (DT)
>   [67628.393656] Call trace:
>   [67628.393659]  dump_backtrace+0x0/0x1b0
>   [67628.393661]  show_stack+0x20/0x70
>   [67628.393664]  dump_stack+0xf8/0x168
>   [67628.393666]  __lock_acquire+0x964/0x1c88
>   [67628.393669]  lock_acquire+0x26c/0x500
>   [67628.393671]  _raw_spin_lock+0x64/0x88
>   [67628.393674]  pl011_console_write+0x138/0x218
>   [67628.393677]  console_unlock+0x2c4/0x6d8
>   [67628.393680]  vprintk_emit+0x134/0x310
>   [67628.393682]  vprintk_default+0x40/0x50
>   [67628.393685]  vprintk_func+0xfc/0x2b0
>   [67628.393687]  printk+0x68/0x90
>   [67628.393690]  hrtimer_interrupt+0x24c/0x250
>   [67628.393693]  arch_timer_handler_phys+0x3c/0x50
>   [67628.393695]  handle_percpu_devid_irq+0xd8/0x460
>   [67628.393698]  generic_handle_irq+0x38/0x50
>   [67628.393701]  __handle_domain_irq+0x6c/0xc8
>   [67628.393704]  gic_handle_irq+0xb0/0xf0
>   [67628.393706]  el1_irq+0xb4/0x180
>   [67628.393709]  _raw_spin_unlock_irqrestore+0x58/0xa8
>   [67628.393712]  hrtimer_start_range_ns+0x1b0/0x420
>   [67628.393715]  msg_submit+0x100/0x108
>   [67628.393717]  mbox_send_message+0x84/0x128
>   [67628.393720]  scpi_send_message+0x11c/0x2a8
>   [67628.393723]  scpi_dvfs_set_idx+0x48/0x70
>   [67628.393726]  scpi_dvfs_set_rate+0x60/0x78
>   [67628.393728]  clk_change_rate+0x184/0x8a8
>   [67628.393731]  clk_core_set_rate_nolock+0x1c0/0x280
>   [67628.393734]  clk_set_rate+0x40/0x98
>   [67628.393736]  scpi_cpufreq_set_target+0x40/0x68
>   [67628.393739]  __cpufreq_driver_target+0x1a0/0x5d0
> 
> AFAICT the call to pr_warn_once() is how we end up here. Not sure if there's 
> an
> appropriate or easy fix for that.

Oh I just accidentally came across printk_deferred_once() which looks like the
right thing to do. I can't reproduce but I'll send a patch over the weekend to
see if anyone will complain why it shouldn't work.

Thanks

--
Qais Yousef

Re: [PATCH] sched/eas: Don't update misfit status if the task is pinned

2021-01-20 Thread Qais Yousef

On 01/19/21 17:50, Quentin Perret wrote:
> On Tuesday 19 Jan 2021 at 17:42:44 (+), Qais Yousef wrote:
> > Hmm IIUC you want to still tag it as misfit so it'll be balanced within the
> > little cores in case there's another core with more spare capacity, right?
> 
> Well yes but that's just a special case. But even you have big CPUs in
> the affinity mask, you may find that the task fits on none of the CPUs
> because they're currently under pressure. But in this case, you may
> still want to mark the task as misfit because being under pressure may
> be a relatively transient state.

Okay. So your thoughts are that if the utilization is above capacity_orig_of()
then marking it as misfit is meaningless (taking into account the cpus it is
affined to). Which I agree with. But if it is less than capacity_orig_of() but
doesn't fit because of pressure ie:

util <= capacity_orig_of(cpu) && util > capacity_of(cpu)

then we should mark it as misfit as it currently does. I think this makes sense
too. There's the margin to consider in the mix here too. And util clamp
effects. And the fact this gets called from pick_next_task_fair() which is
a hot path :-)

Unless someone else beats me to it, I'll send a patch eventually :-)

Thanks

--
Qais Yousef

Re: [PATCH] sched/eas: Don't update misfit status if the task is pinned

2021-01-19 Thread Qais Yousef

On 01/19/21 16:55, Quentin Perret wrote:
> On Tuesday 19 Jan 2021 at 16:40:27 (+), Qais Yousef wrote:
> > On 01/19/21 15:35, Quentin Perret wrote:
> > > Do you mean failing the sched_setaffinity syscall if e.g. the task
> > > has a min clamp that is higher than the capacity of the CPUs to which it
> > > will be pinned? If so, I'm not sure if we really want that.
> > 
> > No. In Android for instance, I'm worried a background task affined to little
> > cores that has a utilization > capacity_of(little) will trigger the same
> > problem. It'll be affined to more than just 1 cpu, but none of the little 
> > cpus
> > will actually fit.
> > 
> > Makes sense?
> 
> Now yes.
> 
> I agree this may be a real problem, but capacity_of() very much is a
> per-CPU thing, because of RT pressure and such, and that is not a static
> thing by any mean. So, even if the task doesn't fit on any CPU _now_ we
> might still want to mark it misfit, just so it can be picked up by a
> potential idle balance on another CPU later on. Maybe capacity_orig_of
> would be preferable?

Hmm IIUC you want to still tag it as misfit so it'll be balanced within the
little cores in case there's another core with more spare capacity, right?

This needs more thinking. Misfit doesn't seem the right mechanism to handle
this. If there are multiple tasks crammed on the same CPU, then we should try
to distribute yes. If it is the only task I can't see this being useful unless
the pressure is very high. Which could be an indication of another problem in
the system..

Thanks

--
Qais Yousef

Re: [PATCH] sched/eas: Don't update misfit status if the task is pinned

2021-01-19 Thread Qais Yousef

On 01/19/21 15:35, Quentin Perret wrote:
> On Tuesday 19 Jan 2021 at 12:07:55 (+), Qais Yousef wrote:
> > If the task is pinned to a cpu, setting the misfit status means that
> > we'll unnecessarily continuously attempt to migrate the task but fail.
> > 
> > This continuous failure will cause the balance_interval to increase to
> > a high value, and eventually cause unnecessary significant delays in
> > balancing the system when real imbalance happens.
> > 
> > Caught while testing uclamp where rt-app calibration loop was pinned to
> > cpu 0, shortly after which we spawn another task with high util_clamp
> > value. The task was failing to migrate after over 40ms of runtime due to
> > balance_interval unnecessary expanded to a very high value from the
> > calibration loop.
> > 
> > Not done here, but it could be useful to extend the check for pinning to
> > verify that the affinity of the task has a cpu that fits. We could end
> > up in a similar situation otherwise.
> 
> Do you mean failing the sched_setaffinity syscall if e.g. the task
> has a min clamp that is higher than the capacity of the CPUs to which it
> will be pinned? If so, I'm not sure if we really want that.

No. In Android for instance, I'm worried a background task affined to little
cores that has a utilization > capacity_of(little) will trigger the same
problem. It'll be affined to more than just 1 cpu, but none of the little cpus
will actually fit.

Makes sense?

> But this patch makes sense on its own for sure, so:
> 
> Reviewed-by: Quentin Perret 

Thanks

--
Qais Yousef

[PATCH v3 bpf-next 1/2] trace: bpf: Allow bpf to attach to bare tracepoints

2021-01-19 Thread Qais Yousef

Some subsystems only have bare tracepoints (a tracepoint with no
associated trace event) to avoid the problem of trace events being an
ABI that can't be changed.

>From bpf presepective, bare tracepoints are what it calls
RAW_TRACEPOINT().

Since bpf assumed there's 1:1 mapping, it relied on hooking to
DEFINE_EVENT() macro to create bpf mapping of the tracepoints. Since
bare tracepoints use DECLARE_TRACE() to create the tracepoint, bpf had
no knowledge about their existence.

By teaching bpf_probe.h to parse DECLARE_TRACE() in a similar fashion to
DEFINE_EVENT(), bpf can find and attach to the new raw tracepoints.

Enabling that comes with the contract that changes to raw tracepoints
don't constitute a regression if they break existing bpf programs.
We need the ability to continue to morph and modify these raw
tracepoints without worrying about any ABI.

Update Documentation/bpf/bpf_design_QA.rst to document this contract.

Acked-by: Yonghong Song 
Signed-off-by: Qais Yousef 
---
 Documentation/bpf/bpf_design_QA.rst |  6 ++
 include/trace/bpf_probe.h   | 12 ++--
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/Documentation/bpf/bpf_design_QA.rst 
b/Documentation/bpf/bpf_design_QA.rst
index 2df7b067ab93..0e15f9b05c9d 100644
--- a/Documentation/bpf/bpf_design_QA.rst
+++ b/Documentation/bpf/bpf_design_QA.rst
@@ -208,6 +208,12 @@ data structures and compile with kernel internal headers. 
Both of these
 kernel internals are subject to change and can break with newer kernels
 such that the program needs to be adapted accordingly.
 
+Q: Are tracepoints part of the stable ABI?
+--
+A: NO. Tracepoints are tied to internal implementation details hence they are
+subject to change and can break with newer kernels. BPF programs need to change
+accordingly when this happens.
+
 Q: How much stack space a BPF program uses?
 ---
 A: Currently all program types are limited to 512 bytes of stack
diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index cd74bffed5c6..a23be89119aa 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -55,8 +55,7 @@
 /* tracepoints with more than 12 arguments will hit build error */
 #define CAST_TO_U64(...) CONCATENATE(__CAST, 
COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__)
 
-#undef DECLARE_EVENT_CLASS
-#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+#define __BPF_DECLARE_TRACE(call, proto, args) \
 static notrace void\
 __bpf_trace_##call(void *__data, proto)
\
 {  \
@@ -64,6 +63,10 @@ __bpf_trace_##call(void *__data, proto)  
\
CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(prog, CAST_TO_U64(args));  
\
 }
 
+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))
+
 /*
  * This part is compiled out, it is only here as a build time check
  * to make sure that if the tracepoint handling changes, the
@@ -111,6 +114,11 @@ __DEFINE_EVENT(template, call, PARAMS(proto), 
PARAMS(args), size)
 #define DEFINE_EVENT_PRINT(template, name, proto, args, print) \
DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
 
+#undef DECLARE_TRACE
+#define DECLARE_TRACE(call, proto, args)   \
+   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))  \
+   __DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), 0)
+
 #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
 
 #undef DEFINE_EVENT_WRITABLE
-- 
2.25.1

[PATCH v3 bpf-next 2/2] selftests: bpf: Add a new test for bare tracepoints

2021-01-19 Thread Qais Yousef

Reuse module_attach infrastructure to add a new bare tracepoint to check
we can attach to it as a raw tracepoint.

Signed-off-by: Qais Yousef 
---
 .../bpf/bpf_testmod/bpf_testmod-events.h  |  6 +
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 21 ++-
 .../selftests/bpf/bpf_testmod/bpf_testmod.h   |  6 +
 .../selftests/bpf/prog_tests/module_attach.c  | 27 +++
 .../selftests/bpf/progs/test_module_attach.c  | 10 +++
 5 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h 
b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h
index b83ea448bc79..89c6d58e5dd6 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h
@@ -28,6 +28,12 @@ TRACE_EVENT(bpf_testmod_test_read,
  __entry->pid, __entry->comm, __entry->off, __entry->len)
 );
 
+/* A bare tracepoint with no event associated with it */
+DECLARE_TRACE(bpf_testmod_test_write_bare,
+   TP_PROTO(struct task_struct *task, struct bpf_testmod_test_write_ctx 
*ctx),
+   TP_ARGS(task, ctx)
+);
+
 #endif /* _BPF_TESTMOD_EVENTS_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c 
b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 2df19d73ca49..e900adad2276 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -28,9 +28,28 @@ bpf_testmod_test_read(struct file *file, struct kobject 
*kobj,
 EXPORT_SYMBOL(bpf_testmod_test_read);
 ALLOW_ERROR_INJECTION(bpf_testmod_test_read, ERRNO);
 
+noinline ssize_t
+bpf_testmod_test_write(struct file *file, struct kobject *kobj,
+ struct bin_attribute *bin_attr,
+ char *buf, loff_t off, size_t len)
+{
+   struct bpf_testmod_test_write_ctx ctx = {
+   .buf = buf,
+   .off = off,
+   .len = len,
+   };
+
+   trace_bpf_testmod_test_write_bare(current, );
+
+   return -EIO; /* always fail */
+}
+EXPORT_SYMBOL(bpf_testmod_test_write);
+ALLOW_ERROR_INJECTION(bpf_testmod_test_write, ERRNO);
+
 static struct bin_attribute bin_attr_bpf_testmod_file __ro_after_init = {
-   .attr = { .name = "bpf_testmod", .mode = 0444, },
+   .attr = { .name = "bpf_testmod", .mode = 0666, },
.read = bpf_testmod_test_read,
+   .write = bpf_testmod_test_write,
 };
 
 static int bpf_testmod_init(void)
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h 
b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
index b81adfedb4f6..b3892dc40111 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
@@ -11,4 +11,10 @@ struct bpf_testmod_test_read_ctx {
size_t len;
 };
 
+struct bpf_testmod_test_write_ctx {
+   char *buf;
+   loff_t off;
+   size_t len;
+};
+
 #endif /* _BPF_TESTMOD_H */
diff --git a/tools/testing/selftests/bpf/prog_tests/module_attach.c 
b/tools/testing/selftests/bpf/prog_tests/module_attach.c
index 50796b651f72..5bc53d53d86e 100644
--- a/tools/testing/selftests/bpf/prog_tests/module_attach.c
+++ b/tools/testing/selftests/bpf/prog_tests/module_attach.c
@@ -21,9 +21,34 @@ static int trigger_module_test_read(int read_sz)
return 0;
 }
 
+static int trigger_module_test_write(int write_sz)
+{
+   int fd, err;
+   char *buf = malloc(write_sz);
+
+   if (!buf)
+   return -ENOMEM;
+
+   memset(buf, 'a', write_sz);
+   buf[write_sz-1] = '\0';
+
+   fd = open("/sys/kernel/bpf_testmod", O_WRONLY);
+   err = -errno;
+   if (CHECK(fd < 0, "testmod_file_open", "failed: %d\n", err)) {
+   free(buf);
+   return err;
+   }
+
+   write(fd, buf, write_sz);
+   close(fd);
+   free(buf);
+   return 0;
+}
+
 void test_module_attach(void)
 {
const int READ_SZ = 456;
+   const int WRITE_SZ = 457;
struct test_module_attach* skel;
struct test_module_attach__bss *bss;
int err;
@@ -48,8 +73,10 @@ void test_module_attach(void)
 
/* trigger tracepoint */
ASSERT_OK(trigger_module_test_read(READ_SZ), "trigger_read");
+   ASSERT_OK(trigger_module_test_write(WRITE_SZ), "trigger_write");
 
ASSERT_EQ(bss->raw_tp_read_sz, READ_SZ, "raw_tp");
+   ASSERT_EQ(bss->raw_tp_bare_write_sz, WRITE_SZ, "raw_tp_bare");
ASSERT_EQ(bss->tp_btf_read_sz, READ_SZ, "tp_btf");
ASSERT_EQ(bss->fentry_read_sz, READ_SZ, "fentry");
ASSERT_EQ(bss->fentry_manual_read_sz, READ_SZ, "fentry_manual");
diff --git a/tools/testing/selftests/bpf/progs/test_module_attach.c 
b/tools/testing/selftests/bpf/progs/test_

[PATCH v3 bpf-next 0/2] Allow attaching to bare tracepoints

2021-01-19 Thread Qais Yousef

Changes in v3:
* Fix not returning error value correctly in
  trigger_module_test_write() (Yonghong)
* Add Yonghong acked-by to patch 1.

Changes in v2:
* Fix compilation error. (Andrii)
* Make the new test use write() instead of read() (Andrii)

Add some missing glue logic to teach bpf about bare tracepoints - tracepoints
without any trace event associated with them.

Bare tracepoints are declare with DECLARE_TRACE(). Full tracepoints are declare
with TRACE_EVENT().

BPF can attach to these tracepoints as RAW_TRACEPOINT() only as there're no
events in tracefs created with them.

Qais Yousef (2):
  trace: bpf: Allow bpf to attach to bare tracepoints
  selftests: bpf: Add a new test for bare tracepoints

 Documentation/bpf/bpf_design_QA.rst   |  6 +
 include/trace/bpf_probe.h | 12 +++--
 .../bpf/bpf_testmod/bpf_testmod-events.h  |  6 +
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 21 ++-
 .../selftests/bpf/bpf_testmod/bpf_testmod.h   |  6 +
 .../selftests/bpf/prog_tests/module_attach.c  | 27 +++
 .../selftests/bpf/progs/test_module_attach.c  | 10 +++
 7 files changed, 85 insertions(+), 3 deletions(-)

-- 
2.25.1

[PATCH] sched/eas: Don't update misfit status if the task is pinned

2021-01-19 Thread Qais Yousef

If the task is pinned to a cpu, setting the misfit status means that
we'll unnecessarily continuously attempt to migrate the task but fail.

This continuous failure will cause the balance_interval to increase to
a high value, and eventually cause unnecessary significant delays in
balancing the system when real imbalance happens.

Caught while testing uclamp where rt-app calibration loop was pinned to
cpu 0, shortly after which we spawn another task with high util_clamp
value. The task was failing to migrate after over 40ms of runtime due to
balance_interval unnecessary expanded to a very high value from the
calibration loop.

Not done here, but it could be useful to extend the check for pinning to
verify that the affinity of the task has a cpu that fits. We could end
up in a similar situation otherwise.

Fixes: 3b1baa6496e6 ("sched/fair: Add 'group_misfit_task' load-balance type")
Signed-off-by: Qais Yousef 
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 197a51473e0c..9379a481dd8c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4060,7 +4060,7 @@ static inline void update_misfit_status(struct 
task_struct *p, struct rq *rq)
if (!static_branch_unlikely(_asym_cpucapacity))
return;
 
-   if (!p) {
+   if (!p || p->nr_cpus_allowed == 1) {
rq->misfit_task_load = 0;
return;
}
-- 
2.25.1

Re: [PATCH v2 bpf-next 2/2] selftests: bpf: Add a new test for bare tracepoints

2021-01-19 Thread Qais Yousef

Hi Yonghong

On 01/18/21 09:48, Yonghong Song wrote:
> The original patch code:
> 
> +static int trigger_module_test_write(int write_sz)
> +{
> + int fd, err;
> + char *buf = malloc(write_sz);
> +
> + if (!buf)
> + return -ENOMEM;
> +
> + memset(buf, 'a', write_sz);
> + buf[write_sz-1] = '\0';
> +
> + fd = open("/sys/kernel/bpf_testmod", O_WRONLY);
> + err = -errno;
> + if (CHECK(fd < 0, "testmod_file_open", "failed: %d\n", err))
> + goto out;
> +
> + write(fd, buf, write_sz);
> + close(fd);
> +out:
> + free(buf);
> +
> + return 0;
> +}
> 
> Even for "fd < 0" case, it "goto out" and "return 0". We should return
> error code here instead of 0.
> 
> Second, "err = -errno" is set before checking fd < 0. If fd >= 0, err might
> inherit an postive errno from previous failure.
> In trigger_module_test_write(), it is okay since the err is only used
> when fd < 0:
> err = -errno;
> if (CHECK(fd < 0, "testmod_file_open", "failed: %d\n", err))
> return err;
> 
> My above rewrite intends to use "err" during final "return" statement,
> so I put assignment of "err = -errno" inside the CHECK branch.
> But there are different ways to implement this properly.

Okay I see now. Sorry I missed your point initially. I will fix and send v3.

Thanks

--
Qais Yousef

Re: [PATCH v2 bpf-next 2/2] selftests: bpf: Add a new test for bare tracepoints

2021-01-18 Thread Qais Yousef

On 01/16/21 18:11, Yonghong Song wrote:
> 
> 
> On 1/16/21 10:21 AM, Qais Yousef wrote:
> > Reuse module_attach infrastructure to add a new bare tracepoint to check
> > we can attach to it as a raw tracepoint.
> > 
> > Signed-off-by: Qais Yousef 
> > ---
> >   .../bpf/bpf_testmod/bpf_testmod-events.h  |  6 +
> >   .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 21 ++-
> >   .../selftests/bpf/bpf_testmod/bpf_testmod.h   |  6 +
> >   .../selftests/bpf/prog_tests/module_attach.c  | 27 +++
> >   .../selftests/bpf/progs/test_module_attach.c  | 10 +++
> >   5 files changed, 69 insertions(+), 1 deletion(-)
> > 
> > diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h 
> > b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h
> > index b83ea448bc79..89c6d58e5dd6 100644
> > --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h
> > +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h
> > @@ -28,6 +28,12 @@ TRACE_EVENT(bpf_testmod_test_read,
> >   __entry->pid, __entry->comm, __entry->off, __entry->len)
> >   );
> > +/* A bare tracepoint with no event associated with it */
> > +DECLARE_TRACE(bpf_testmod_test_write_bare,
> > +   TP_PROTO(struct task_struct *task, struct bpf_testmod_test_write_ctx 
> > *ctx),
> > +   TP_ARGS(task, ctx)
> > +);
> > +
> >   #endif /* _BPF_TESTMOD_EVENTS_H */
> >   #undef TRACE_INCLUDE_PATH
> > diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c 
> > b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > index 2df19d73ca49..e900adad2276 100644
> > --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > @@ -28,9 +28,28 @@ bpf_testmod_test_read(struct file *file, struct kobject 
> > *kobj,
> >   EXPORT_SYMBOL(bpf_testmod_test_read);
> >   ALLOW_ERROR_INJECTION(bpf_testmod_test_read, ERRNO);
> > +noinline ssize_t
> > +bpf_testmod_test_write(struct file *file, struct kobject *kobj,
> > + struct bin_attribute *bin_attr,
> > + char *buf, loff_t off, size_t len)
> > +{
> > +   struct bpf_testmod_test_write_ctx ctx = {
> > +   .buf = buf,
> > +   .off = off,
> > +   .len = len,
> > +   };
> > +
> > +   trace_bpf_testmod_test_write_bare(current, );
> > +
> > +   return -EIO; /* always fail */
> > +}
> > +EXPORT_SYMBOL(bpf_testmod_test_write);
> > +ALLOW_ERROR_INJECTION(bpf_testmod_test_write, ERRNO);
> > +
> >   static struct bin_attribute bin_attr_bpf_testmod_file __ro_after_init = {
> 
> Do we need to remove __ro_after_init?

I don't think so. The structure should still remain RO AFAIU.

> 
> > -   .attr = { .name = "bpf_testmod", .mode = 0444, },
> > +   .attr = { .name = "bpf_testmod", .mode = 0666, },
> > .read = bpf_testmod_test_read,
> > +   .write = bpf_testmod_test_write,
> >   };
> >   static int bpf_testmod_init(void)
> > diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h 
> > b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
> > index b81adfedb4f6..b3892dc40111 100644
> > --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
> > +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
> > @@ -11,4 +11,10 @@ struct bpf_testmod_test_read_ctx {
> > size_t len;
> >   };
> > +struct bpf_testmod_test_write_ctx {
> > +   char *buf;
> > +   loff_t off;
> > +   size_t len;
> > +};
> > +
> >   #endif /* _BPF_TESTMOD_H */
> > diff --git a/tools/testing/selftests/bpf/prog_tests/module_attach.c 
> > b/tools/testing/selftests/bpf/prog_tests/module_attach.c
> > index 50796b651f72..e4605c0b5af1 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/module_attach.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/module_attach.c
> > @@ -21,9 +21,34 @@ static int trigger_module_test_read(int read_sz)
> > return 0;
> >   }
> > +static int trigger_module_test_write(int write_sz)
> > +{
> > +   int fd, err;
> 
> Init err = 0?

I don't see what difference this makes.

> 
> > +   char *buf = malloc(write_sz);
> > +
> > +   if (!buf)
> > +   return -ENOMEM;
> 
> Looks like we already non-negative value, so return ENOMEM?

We already set err=-errno. So shouldn't we return negative too?

> 
> > +
> > +   memset(buf, 'a', write_sz);
> > +   buf[write_sz-1] = '\0';
> > +
> > +   fd = open("/sys/kernel/bpf_testmod", O_WRONLY);
> > +   err = -errno;
> > +   if (CHECK(fd < 0, "testmod_file_open", "failed: %d\n", err))
> > +   goto out;
> 
> Change the above to
>   fd = open("/sys/kernel/bpf_testmod", O_WRONLY);
>   if (CHECK(fd < 0, "testmod_file_open", "failed: %d\n", errno)) {
>   err = -errno;
>   goto out;
>   }

I kept the code consistent with the definition of trigger_module_test_read().

I'll leave it up to the maintainer to pick up the style changes if they prefer
it this way.

Thanks for the ack and for the review.

Cheers

--
Qais Yousef

[PATCH v2 bpf-next 0/2] Allow attaching to bare tracepoints

2021-01-16 Thread Qais Yousef

Changes in v2:
* Fix compilation error.
* Make the new test use write() instead of read()

Add some missing glue logic to teach bpf about bare tracepoints - tracepoints
without any trace event associated with them.

Bare tracepoints are declare with DECLARE_TRACE(). Full tracepoints are declare
with TRACE_EVENT().

BPF can attach to these tracepoints as RAW_TRACEPOINT() only as there're no
events in tracefs created with them.

Qais Yousef (2):
  trace: bpf: Allow bpf to attach to bare tracepoints
  selftests: bpf: Add a new test for bare tracepoints

 Documentation/bpf/bpf_design_QA.rst   |  6 +
 include/trace/bpf_probe.h | 12 +++--
 .../bpf/bpf_testmod/bpf_testmod-events.h  |  6 +
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 21 ++-
 .../selftests/bpf/bpf_testmod/bpf_testmod.h   |  6 +
 .../selftests/bpf/prog_tests/module_attach.c  | 27 +++
 .../selftests/bpf/progs/test_module_attach.c  | 10 +++
 7 files changed, 85 insertions(+), 3 deletions(-)

-- 
2.25.1

[PATCH v2 bpf-next 2/2] selftests: bpf: Add a new test for bare tracepoints

2021-01-16 Thread Qais Yousef

Reuse module_attach infrastructure to add a new bare tracepoint to check
we can attach to it as a raw tracepoint.

Signed-off-by: Qais Yousef 
---
 .../bpf/bpf_testmod/bpf_testmod-events.h  |  6 +
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 21 ++-
 .../selftests/bpf/bpf_testmod/bpf_testmod.h   |  6 +
 .../selftests/bpf/prog_tests/module_attach.c  | 27 +++
 .../selftests/bpf/progs/test_module_attach.c  | 10 +++
 5 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h 
b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h
index b83ea448bc79..89c6d58e5dd6 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h
@@ -28,6 +28,12 @@ TRACE_EVENT(bpf_testmod_test_read,
  __entry->pid, __entry->comm, __entry->off, __entry->len)
 );
 
+/* A bare tracepoint with no event associated with it */
+DECLARE_TRACE(bpf_testmod_test_write_bare,
+   TP_PROTO(struct task_struct *task, struct bpf_testmod_test_write_ctx 
*ctx),
+   TP_ARGS(task, ctx)
+);
+
 #endif /* _BPF_TESTMOD_EVENTS_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c 
b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 2df19d73ca49..e900adad2276 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -28,9 +28,28 @@ bpf_testmod_test_read(struct file *file, struct kobject 
*kobj,
 EXPORT_SYMBOL(bpf_testmod_test_read);
 ALLOW_ERROR_INJECTION(bpf_testmod_test_read, ERRNO);
 
+noinline ssize_t
+bpf_testmod_test_write(struct file *file, struct kobject *kobj,
+ struct bin_attribute *bin_attr,
+ char *buf, loff_t off, size_t len)
+{
+   struct bpf_testmod_test_write_ctx ctx = {
+   .buf = buf,
+   .off = off,
+   .len = len,
+   };
+
+   trace_bpf_testmod_test_write_bare(current, );
+
+   return -EIO; /* always fail */
+}
+EXPORT_SYMBOL(bpf_testmod_test_write);
+ALLOW_ERROR_INJECTION(bpf_testmod_test_write, ERRNO);
+
 static struct bin_attribute bin_attr_bpf_testmod_file __ro_after_init = {
-   .attr = { .name = "bpf_testmod", .mode = 0444, },
+   .attr = { .name = "bpf_testmod", .mode = 0666, },
.read = bpf_testmod_test_read,
+   .write = bpf_testmod_test_write,
 };
 
 static int bpf_testmod_init(void)
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h 
b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
index b81adfedb4f6..b3892dc40111 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
@@ -11,4 +11,10 @@ struct bpf_testmod_test_read_ctx {
size_t len;
 };
 
+struct bpf_testmod_test_write_ctx {
+   char *buf;
+   loff_t off;
+   size_t len;
+};
+
 #endif /* _BPF_TESTMOD_H */
diff --git a/tools/testing/selftests/bpf/prog_tests/module_attach.c 
b/tools/testing/selftests/bpf/prog_tests/module_attach.c
index 50796b651f72..e4605c0b5af1 100644
--- a/tools/testing/selftests/bpf/prog_tests/module_attach.c
+++ b/tools/testing/selftests/bpf/prog_tests/module_attach.c
@@ -21,9 +21,34 @@ static int trigger_module_test_read(int read_sz)
return 0;
 }
 
+static int trigger_module_test_write(int write_sz)
+{
+   int fd, err;
+   char *buf = malloc(write_sz);
+
+   if (!buf)
+   return -ENOMEM;
+
+   memset(buf, 'a', write_sz);
+   buf[write_sz-1] = '\0';
+
+   fd = open("/sys/kernel/bpf_testmod", O_WRONLY);
+   err = -errno;
+   if (CHECK(fd < 0, "testmod_file_open", "failed: %d\n", err))
+   goto out;
+
+   write(fd, buf, write_sz);
+   close(fd);
+out:
+   free(buf);
+
+   return 0;
+}
+
 void test_module_attach(void)
 {
const int READ_SZ = 456;
+   const int WRITE_SZ = 457;
struct test_module_attach* skel;
struct test_module_attach__bss *bss;
int err;
@@ -48,8 +73,10 @@ void test_module_attach(void)
 
/* trigger tracepoint */
ASSERT_OK(trigger_module_test_read(READ_SZ), "trigger_read");
+   ASSERT_OK(trigger_module_test_write(WRITE_SZ), "trigger_write");
 
ASSERT_EQ(bss->raw_tp_read_sz, READ_SZ, "raw_tp");
+   ASSERT_EQ(bss->raw_tp_bare_write_sz, WRITE_SZ, "raw_tp_bare");
ASSERT_EQ(bss->tp_btf_read_sz, READ_SZ, "tp_btf");
ASSERT_EQ(bss->fentry_read_sz, READ_SZ, "fentry");
ASSERT_EQ(bss->fentry_manual_read_sz, READ_SZ, "fentry_manual");
diff --git a/tools/testing/selftests/bpf/progs/test_module_attach.c 
b/tools/testing/selftests/bpf/progs/test_module_attach.c
index efd1e

[PATCH v2 bpf-next 1/2] trace: bpf: Allow bpf to attach to bare tracepoints

2021-01-16 Thread Qais Yousef

Some subsystems only have bare tracepoints (a tracepoint with no
associated trace event) to avoid the problem of trace events being an
ABI that can't be changed.

>From bpf presepective, bare tracepoints are what it calls
RAW_TRACEPOINT().

Since bpf assumed there's 1:1 mapping, it relied on hooking to
DEFINE_EVENT() macro to create bpf mapping of the tracepoints. Since
bare tracepoints use DECLARE_TRACE() to create the tracepoint, bpf had
no knowledge about their existence.

By teaching bpf_probe.h to parse DECLARE_TRACE() in a similar fashion to
DEFINE_EVENT(), bpf can find and attach to the new raw tracepoints.

Enabling that comes with the contract that changes to raw tracepoints
don't constitute a regression if they break existing bpf programs.
We need the ability to continue to morph and modify these raw
tracepoints without worrying about any ABI.

Update Documentation/bpf/bpf_design_QA.rst to document this contract.

Signed-off-by: Qais Yousef 
---
 Documentation/bpf/bpf_design_QA.rst |  6 ++
 include/trace/bpf_probe.h   | 12 ++--
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/Documentation/bpf/bpf_design_QA.rst 
b/Documentation/bpf/bpf_design_QA.rst
index 2df7b067ab93..0e15f9b05c9d 100644
--- a/Documentation/bpf/bpf_design_QA.rst
+++ b/Documentation/bpf/bpf_design_QA.rst
@@ -208,6 +208,12 @@ data structures and compile with kernel internal headers. 
Both of these
 kernel internals are subject to change and can break with newer kernels
 such that the program needs to be adapted accordingly.
 
+Q: Are tracepoints part of the stable ABI?
+--
+A: NO. Tracepoints are tied to internal implementation details hence they are
+subject to change and can break with newer kernels. BPF programs need to change
+accordingly when this happens.
+
 Q: How much stack space a BPF program uses?
 ---
 A: Currently all program types are limited to 512 bytes of stack
diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index cd74bffed5c6..a23be89119aa 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -55,8 +55,7 @@
 /* tracepoints with more than 12 arguments will hit build error */
 #define CAST_TO_U64(...) CONCATENATE(__CAST, 
COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__)
 
-#undef DECLARE_EVENT_CLASS
-#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+#define __BPF_DECLARE_TRACE(call, proto, args) \
 static notrace void\
 __bpf_trace_##call(void *__data, proto)
\
 {  \
@@ -64,6 +63,10 @@ __bpf_trace_##call(void *__data, proto)  
\
CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(prog, CAST_TO_U64(args));  
\
 }
 
+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))
+
 /*
  * This part is compiled out, it is only here as a build time check
  * to make sure that if the tracepoint handling changes, the
@@ -111,6 +114,11 @@ __DEFINE_EVENT(template, call, PARAMS(proto), 
PARAMS(args), size)
 #define DEFINE_EVENT_PRINT(template, name, proto, args, print) \
DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
 
+#undef DECLARE_TRACE
+#define DECLARE_TRACE(call, proto, args)   \
+   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))  \
+   __DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), 0)
+
 #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
 
 #undef DEFINE_EVENT_WRITABLE
-- 
2.25.1

Re: [PATCH bpf-next 2/2] selftests: bpf: Add a new test for bare tracepoints

2021-01-14 Thread Qais Yousef

On 01/13/21 17:40, Jean-Philippe Brucker wrote:
> On Wed, Jan 13, 2021 at 10:21:31AM +0000, Qais Yousef wrote:
> > On 01/12/21 12:07, Andrii Nakryiko wrote:
> > > > > > $ sudo ./test_progs -v -t module_attach
> > > > >
> > > > > use -vv when debugging stuff like that with test_progs, it will output
> > > > > libbpf detailed logs, that often are very helpful
> > > >
> > > > I tried that but it didn't help me. Full output is here
> > > >
> > > > https://paste.debian.net/1180846
> > > >
> > > 
> > > It did help a bit for me to make sure that you have bpf_testmod
> > > properly loaded and its BTF was found, so the problem is somewhere
> > > else. Also, given load succeeded and attach failed with OPNOTSUPP, I
> > > suspect you are missing some of FTRACE configs, which seems to be
> > > missing from selftests's config as well. Check that you have
> > > CONFIG_FTRACE=y and CONFIG_DYNAMIC_FTRACE=y, and you might need some
> > > more. See [0] for a real config we are using to run all tests in
> > > libbpf CI. If you figure out what you were missing, please also
> > > contribute a patch to selftests' config.
> > > 
> > >   [0] 
> > > https://github.com/libbpf/libbpf/blob/master/travis-ci/vmtest/configs/latest.config
> > 
> > Yeah that occurred to me too. I do have all necessary FTRACE options 
> > enabled,
> > including DYNAMIC_FTRACE. I think I did try enabling fault injection too 
> > just
> > in case. I have CONFIG_FAULT_INJECTION=y and 
> > CONFIG_FUNCTION_ERROR_INJECTION=y.
> 
> Could it come from lack of fentry support on arm64 (or are you testing on
> x86?) Since the arm64 JIT doesn't have trampoline support at the moment, a
> lot of bpf selftests fail with ENOTSUPP.

I am on arm64. I honestly have no clue about this. I'll try to dig out.

Thanks

--
Qais Yousef

hrtimer_interrupt::pr_warn_once() causes BUG: Invalid wait context

2021-01-14 Thread Qais Yousef

Hi

I hit this splat today

# [67628.388606] hrtimer: interrupt took 304720 ns
[67628.393546]
[67628.393550] =
[67628.393554] [ BUG: Invalid wait context ]
[67628.393557] 5.11.0-rc3-00019-g86be331946f7 #37 Not tainted
[67628.393560] -
[67628.393563] sugov:0/192 is trying to lock:
[67628.393566] 000800b1d898 (_lock_key){-.-.}-{3:3}, at: 
pl011_console_write+0x138/0x218
[67628.393581] other info that might help us debug this:
[67628.393584] context-{2:2}
[67628.393586] 4 locks held by sugov:0/192:
[67628.393589]  #0: 0008059cb720 
(_policy->work_lock){+.+.}-{4:4}, at: sugov_work+0x58/0x88
[67628.393603]  #1: 800015446f20 (prepare_lock){+.+.}-{4:4}, at: 
clk_prepare_lock+0x34/0xb0
[67628.393618]  #2: 8000152aaa60 (console_lock){+.+.}-{0:0}, at: 
vprintk_emit+0x12c/0x310
[67628.393632]  #3: 8000152aab88 (console_owner){-.-.}-{0:0}, at: 
console_unlock+0x190/0x6d8
[67628.393646] stack backtrace:
[67628.393649] CPU: 0 PID: 192 Comm: sugov:0 Not tainted 
5.11.0-rc3-00019-g86be331946f7 #37
[67628.393653] Hardware name: ARM Juno development board (r2) (DT)
[67628.393656] Call trace:
[67628.393659]  dump_backtrace+0x0/0x1b0
[67628.393661]  show_stack+0x20/0x70
[67628.393664]  dump_stack+0xf8/0x168
[67628.393666]  __lock_acquire+0x964/0x1c88
[67628.393669]  lock_acquire+0x26c/0x500
[67628.393671]  _raw_spin_lock+0x64/0x88
[67628.393674]  pl011_console_write+0x138/0x218
[67628.393677]  console_unlock+0x2c4/0x6d8
[67628.393680]  vprintk_emit+0x134/0x310
[67628.393682]  vprintk_default+0x40/0x50
[67628.393685]  vprintk_func+0xfc/0x2b0
[67628.393687]  printk+0x68/0x90
[67628.393690]  hrtimer_interrupt+0x24c/0x250
[67628.393693]  arch_timer_handler_phys+0x3c/0x50
[67628.393695]  handle_percpu_devid_irq+0xd8/0x460
[67628.393698]  generic_handle_irq+0x38/0x50
[67628.393701]  __handle_domain_irq+0x6c/0xc8
[67628.393704]  gic_handle_irq+0xb0/0xf0
[67628.393706]  el1_irq+0xb4/0x180
[67628.393709]  _raw_spin_unlock_irqrestore+0x58/0xa8
[67628.393712]  hrtimer_start_range_ns+0x1b0/0x420
[67628.393715]  msg_submit+0x100/0x108
[67628.393717]  mbox_send_message+0x84/0x128
[67628.393720]  scpi_send_message+0x11c/0x2a8
[67628.393723]  scpi_dvfs_set_idx+0x48/0x70
[67628.393726]  scpi_dvfs_set_rate+0x60/0x78
[67628.393728]  clk_change_rate+0x184/0x8a8
[67628.393731]  clk_core_set_rate_nolock+0x1c0/0x280
[67628.393734]  clk_set_rate+0x40/0x98
[67628.393736]  scpi_cpufreq_set_target+0x40/0x68
[67628.393739]  __cpufreq_driver_target+0x1a0/0x5d0

AFAICT the call to pr_warn_once() is how we end up here. Not sure if there's an
appropriate or easy fix for that.

But for the sake of documenting at least, sending this report to LKML.

It was a random occurrence and not something I can reproduce.

Thanks

--
Qais Yousef

Re: [PATCH bpf-next 2/2] selftests: bpf: Add a new test for bare tracepoints

2021-01-13 Thread Qais Yousef

On 01/12/21 12:07, Andrii Nakryiko wrote:
> > > > $ sudo ./test_progs -v -t module_attach
> > >
> > > use -vv when debugging stuff like that with test_progs, it will output
> > > libbpf detailed logs, that often are very helpful
> >
> > I tried that but it didn't help me. Full output is here
> >
> > https://paste.debian.net/1180846
> >
> 
> It did help a bit for me to make sure that you have bpf_testmod
> properly loaded and its BTF was found, so the problem is somewhere
> else. Also, given load succeeded and attach failed with OPNOTSUPP, I
> suspect you are missing some of FTRACE configs, which seems to be
> missing from selftests's config as well. Check that you have
> CONFIG_FTRACE=y and CONFIG_DYNAMIC_FTRACE=y, and you might need some
> more. See [0] for a real config we are using to run all tests in
> libbpf CI. If you figure out what you were missing, please also
> contribute a patch to selftests' config.
> 
>   [0] 
> https://github.com/libbpf/libbpf/blob/master/travis-ci/vmtest/configs/latest.config

Yeah that occurred to me too. I do have all necessary FTRACE options enabled,
including DYNAMIC_FTRACE. I think I did try enabling fault injection too just
in case. I have CONFIG_FAULT_INJECTION=y and CONFIG_FUNCTION_ERROR_INJECTION=y.

I will look at the CI config and see if I can figure it out.

I will likely get a chance to look at all of this and send v2  over the
weekend.

Thanks

--
Qais Yousef

Re: [PATCH bpf-next 1/2] trace: bpf: Allow bpf to attach to bare tracepoints

2021-01-13 Thread Qais Yousef

On 01/12/21 12:19, Yonghong Song wrote:
> I applied the patch to my local bpf-next repo, and got the following
> compilation error:

[...]

> 
> I dumped preprecessor result but after macro expansion, the code
> becomes really complex and I have not figured out why it failed.
> Do you know what is the possible reason?

Yeah I did a last minute fix to address a checkpatch.pl error and my
verification of the change wasn't good enough obviously.

If you're keen to try out I can send you a patch with the fix. I should send v2
by the weekend too.

Thanks for having a look.

Cheers

--
Qais Yousef

Re: [PATCH bpf-next 2/2] selftests: bpf: Add a new test for bare tracepoints

2021-01-12 Thread Qais Yousef

On 01/11/21 23:26, Andrii Nakryiko wrote:
> On Mon, Jan 11, 2021 at 10:20 AM Qais Yousef  wrote:
> >
> > Reuse module_attach infrastructure to add a new bare tracepoint to check
> > we can attach to it as a raw tracepoint.
> >
> > Signed-off-by: Qais Yousef 
> > ---
> >
> > Andrii
> >
> > I was getting the error below when I was trying to run the test.
> > I had to comment out all related fentry* code to be able to test the raw_tp
> > stuff. Not sure something I've done wrong or it's broken for some reason.
> > I was on v5.11-rc2.
> 
> Check that you have all the required Kconfig options from
> tools/testing/selftests/bpf/config. And also you will need to build

Yep I have merged this config snippet using merge_config.sh script.

> pahole from master, 1.19 doesn't have some fixes that add kernel
> module support. I think pahole is the reasons why you have the failure
> below.

I am using pahole 1.19. I have built it from tip of master though.

/trying using v1.19 tag

Still fails the same.

> 
> >
> > $ sudo ./test_progs -v -t module_attach
> 
> use -vv when debugging stuff like that with test_progs, it will output
> libbpf detailed logs, that often are very helpful

I tried that but it didn't help me. Full output is here

https://paste.debian.net/1180846

> 
> > bpf_testmod.ko is already unloaded.
> > Loading bpf_testmod.ko...
> > Successfully loaded bpf_testmod.ko.
> > test_module_attach:PASS:skel_open 0 nsec
> > test_module_attach:PASS:set_attach_target 0 nsec
> > test_module_attach:PASS:skel_load 0 nsec
> > libbpf: prog 'handle_fentry': failed to attach: ERROR: 
> > strerror_r(-524)=22
> > libbpf: failed to auto-attach program 'handle_fentry': -524
> > test_module_attach:FAIL:skel_attach skeleton attach failed: -524
> > #58 module_attach:FAIL
> > Successfully unloaded bpf_testmod.ko.
> > Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED
> >
> 
> But even apart from test failure, there seems to be kernel build
> failure. See [0] for what fails in kernel-patches CI.
> 
>[0] https://travis-ci.com/github/kernel-patches/bpf/builds/212730017

Sorry about that. I did a last minute change because of checkpatch.pl error and
it seems I either forgot to rebuild or missed that the rebuild failed :/

> 
> 
> >
> >  .../selftests/bpf/bpf_testmod/bpf_testmod-events.h |  6 ++
> >  tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c  |  2 ++
> >  tools/testing/selftests/bpf/prog_tests/module_attach.c |  1 +
> >  tools/testing/selftests/bpf/progs/test_module_attach.c | 10 ++
> >  4 files changed, 19 insertions(+)
> >
> > diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h 
> > b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h
> > index b83ea448bc79..e1ada753f10c 100644
> > --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h
> > +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h
> > @@ -28,6 +28,12 @@ TRACE_EVENT(bpf_testmod_test_read,
> >   __entry->pid, __entry->comm, __entry->off, __entry->len)
> >  );
> >
> > +/* A bare tracepoint with no event associated with it */
> > +DECLARE_TRACE(bpf_testmod_test_read_bare,
> > +   TP_PROTO(struct task_struct *task, struct bpf_testmod_test_read_ctx 
> > *ctx),
> > +   TP_ARGS(task, ctx)
> > +);
> > +
> >  #endif /* _BPF_TESTMOD_EVENTS_H */
> >
> >  #undef TRACE_INCLUDE_PATH
> > diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c 
> > b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > index 2df19d73ca49..d63cebdaca44 100644
> > --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > @@ -22,6 +22,8 @@ bpf_testmod_test_read(struct file *file, struct kobject 
> > *kobj,
> > };
> >
> > trace_bpf_testmod_test_read(current, );
> > +   ctx.len++;
> > +   trace_bpf_testmod_test_read_bare(current, );
> 
> It's kind of boring to have two read tracepoints :) Do you mind adding

Hehe boring is good :p

> a write tracepoint and use bare tracepoint there? You won't need this
> ctx.len++ hack as well. Feel free to add identical
> bpf_testmod_test_write_ctx (renaming it is more of a pain).

It was easy to get this done. So I think it should be easy to make it a write
too :)

Thanks

--
Qais Yousef

> 
> >
> > return -EIO; /* always fail */
>

[PATCH bpf-next 1/2] trace: bpf: Allow bpf to attach to bare tracepoints

2021-01-11 Thread Qais Yousef

Some subsystems only have bare tracepoints (a tracepoint with no
associated trace event) to avoid the problem of trace events being an
ABI that can't be changed.

>From bpf presepective, bare tracepoints are what it calls
RAW_TRACEPOINT().

Since bpf assumed there's 1:1 mapping, it relied on hooking to
DEFINE_EVENT() macro to create bpf mapping of the tracepoints. Since
bare tracepoints use DECLARE_TRACE() to create the tracepoint, bpf had
no knowledge about their existence.

By teaching bpf_probe.h to parse DECLARE_TRACE() in a similar fashion to
DEFINE_EVENT(), bpf can find and attach to the new raw tracepoints.

Enabling that comes with the contract that changes to raw tracepoints
don't constitute a regression if they break existing bpf programs.
We need the ability to continue to morph and modify these raw
tracepoints without worrying about any ABI.

Update Documentation/bpf/bpf_design_QA.rst to document this contract.

Signed-off-by: Qais Yousef 
---
 Documentation/bpf/bpf_design_QA.rst |  6 ++
 include/trace/bpf_probe.h   | 12 ++--
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/Documentation/bpf/bpf_design_QA.rst 
b/Documentation/bpf/bpf_design_QA.rst
index 2df7b067ab93..0e15f9b05c9d 100644
--- a/Documentation/bpf/bpf_design_QA.rst
+++ b/Documentation/bpf/bpf_design_QA.rst
@@ -208,6 +208,12 @@ data structures and compile with kernel internal headers. 
Both of these
 kernel internals are subject to change and can break with newer kernels
 such that the program needs to be adapted accordingly.
 
+Q: Are tracepoints part of the stable ABI?
+--
+A: NO. Tracepoints are tied to internal implementation details hence they are
+subject to change and can break with newer kernels. BPF programs need to change
+accordingly when this happens.
+
 Q: How much stack space a BPF program uses?
 ---
 A: Currently all program types are limited to 512 bytes of stack
diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index cd74bffed5c6..cf1496b162b1 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -55,8 +55,7 @@
 /* tracepoints with more than 12 arguments will hit build error */
 #define CAST_TO_U64(...) CONCATENATE(__CAST, 
COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__)
 
-#undef DECLARE_EVENT_CLASS
-#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+#define __BPF_DECLARE_TRACE(call, proto, args) \
 static notrace void\
 __bpf_trace_##call(void *__data, proto)
\
 {  \
@@ -64,6 +63,10 @@ __bpf_trace_##call(void *__data, proto)  
\
CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(prog, CAST_TO_U64(args));  
\
 }
 
+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))
+
 /*
  * This part is compiled out, it is only here as a build time check
  * to make sure that if the tracepoint handling changes, the
@@ -111,6 +114,11 @@ __DEFINE_EVENT(template, call, PARAMS(proto), 
PARAMS(args), size)
 #define DEFINE_EVENT_PRINT(template, name, proto, args, print) \
DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
 
+#undef DECLARE_TRACE
+#define DECLARE_TRACE(call, proto, args)   \
+   (__BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) \
+__DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), 0))
+
 #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
 
 #undef DEFINE_EVENT_WRITABLE
-- 
2.25.1

[PATCH bpf-next 0/2] Allow attaching to bare tracepoints

2021-01-11 Thread Qais Yousef

Add some missing glue logic to teach bpf about bare tracepoints - tracepoints
without any trace event associated with them.

Bare tracepoints are declare with DECLARE_TRACE(). Full tracepoints are declare
with TRACE_EVENT().

BPF can attach to these tracepoints as RAW_TRACEPOINT() only as there's no
events in tracefs created with them.

Qais Yousef (2):
  trace: bpf: Allow bpf to attach to bare tracepoints
  selftests: bpf: Add a new test for bare tracepoints

 Documentation/bpf/bpf_design_QA.rst  |  6 ++
 include/trace/bpf_probe.h| 12 ++--
 .../selftests/bpf/bpf_testmod/bpf_testmod-events.h   |  6 ++
 .../testing/selftests/bpf/bpf_testmod/bpf_testmod.c  |  2 ++
 .../testing/selftests/bpf/prog_tests/module_attach.c |  1 +
 .../testing/selftests/bpf/progs/test_module_attach.c | 10 ++
 6 files changed, 35 insertions(+), 2 deletions(-)

-- 
2.25.1

[PATCH bpf-next 2/2] selftests: bpf: Add a new test for bare tracepoints

2021-01-11 Thread Qais Yousef

Reuse module_attach infrastructure to add a new bare tracepoint to check
we can attach to it as a raw tracepoint.

Signed-off-by: Qais Yousef 
---

Andrii

I was getting the error below when I was trying to run the test.
I had to comment out all related fentry* code to be able to test the raw_tp
stuff. Not sure something I've done wrong or it's broken for some reason.
I was on v5.11-rc2.

$ sudo ./test_progs -v -t module_attach
bpf_testmod.ko is already unloaded.
Loading bpf_testmod.ko...
Successfully loaded bpf_testmod.ko.
test_module_attach:PASS:skel_open 0 nsec
test_module_attach:PASS:set_attach_target 0 nsec
test_module_attach:PASS:skel_load 0 nsec
libbpf: prog 'handle_fentry': failed to attach: ERROR: 
strerror_r(-524)=22
libbpf: failed to auto-attach program 'handle_fentry': -524
test_module_attach:FAIL:skel_attach skeleton attach failed: -524
#58 module_attach:FAIL
Successfully unloaded bpf_testmod.ko.
Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED


 .../selftests/bpf/bpf_testmod/bpf_testmod-events.h |  6 ++
 tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c  |  2 ++
 tools/testing/selftests/bpf/prog_tests/module_attach.c |  1 +
 tools/testing/selftests/bpf/progs/test_module_attach.c | 10 ++
 4 files changed, 19 insertions(+)

diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h 
b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h
index b83ea448bc79..e1ada753f10c 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h
@@ -28,6 +28,12 @@ TRACE_EVENT(bpf_testmod_test_read,
  __entry->pid, __entry->comm, __entry->off, __entry->len)
 );
 
+/* A bare tracepoint with no event associated with it */
+DECLARE_TRACE(bpf_testmod_test_read_bare,
+   TP_PROTO(struct task_struct *task, struct bpf_testmod_test_read_ctx 
*ctx),
+   TP_ARGS(task, ctx)
+);
+
 #endif /* _BPF_TESTMOD_EVENTS_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c 
b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 2df19d73ca49..d63cebdaca44 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -22,6 +22,8 @@ bpf_testmod_test_read(struct file *file, struct kobject *kobj,
};
 
trace_bpf_testmod_test_read(current, );
+   ctx.len++;
+   trace_bpf_testmod_test_read_bare(current, );
 
return -EIO; /* always fail */
 }
diff --git a/tools/testing/selftests/bpf/prog_tests/module_attach.c 
b/tools/testing/selftests/bpf/prog_tests/module_attach.c
index 50796b651f72..7085a118f38c 100644
--- a/tools/testing/selftests/bpf/prog_tests/module_attach.c
+++ b/tools/testing/selftests/bpf/prog_tests/module_attach.c
@@ -50,6 +50,7 @@ void test_module_attach(void)
ASSERT_OK(trigger_module_test_read(READ_SZ), "trigger_read");
 
ASSERT_EQ(bss->raw_tp_read_sz, READ_SZ, "raw_tp");
+   ASSERT_EQ(bss->raw_tp_bare_read_sz, READ_SZ+1, "raw_tp_bare");
ASSERT_EQ(bss->tp_btf_read_sz, READ_SZ, "tp_btf");
ASSERT_EQ(bss->fentry_read_sz, READ_SZ, "fentry");
ASSERT_EQ(bss->fentry_manual_read_sz, READ_SZ, "fentry_manual");
diff --git a/tools/testing/selftests/bpf/progs/test_module_attach.c 
b/tools/testing/selftests/bpf/progs/test_module_attach.c
index efd1e287ac17..08aa157afa1d 100644
--- a/tools/testing/selftests/bpf/progs/test_module_attach.c
+++ b/tools/testing/selftests/bpf/progs/test_module_attach.c
@@ -17,6 +17,16 @@ int BPF_PROG(handle_raw_tp,
return 0;
 }
 
+__u32 raw_tp_bare_read_sz = 0;
+
+SEC("raw_tp/bpf_testmod_test_read_bare")
+int BPF_PROG(handle_raw_tp_bare,
+struct task_struct *task, struct bpf_testmod_test_read_ctx 
*read_ctx)
+{
+   raw_tp_bare_read_sz = BPF_CORE_READ(read_ctx, len);
+   return 0;
+}
+
 __u32 tp_btf_read_sz = 0;
 
 SEC("tp_btf/bpf_testmod_test_read")
-- 
2.25.1

Re: [PATCH v2] sched/debug: Add new tracepoint to track cpu_capacity

2021-01-11 Thread Qais Yousef

On 01/11/21 15:04, Peter Zijlstra wrote:
> On Mon, Jan 04, 2021 at 06:26:42PM +0000, Qais Yousef wrote:
> 
> > So I have a proper patch for that now, that actually turned out to be really
> > tiny once you untangle exactly what is missing.
> > 
> > Peter, bpf programs aren't considered ABIs AFAIK, do you have concerns about
> > that?
> 
> In order to use these you need to rely on BTF to get anything useful
> done right? And anything that relies on BTF cannot be ABI.

Yes. To decode struct rq for instance one has to either hardcode it in their
program or use BTF to get the definition dynamically.

The worry is if we modify the function signature of the tracepoint. Alexei did
confirm this can't be an ABI and I'm adding additional documentation to make
this crystal clear.

Thanks

--
Qais Yousef

Re: [RFC][PATCH 1/5] sched/fair: Fix select_idle_cpu()s cost accounting

2021-01-08 Thread Qais Yousef

On 01/08/21 10:27, Mel Gorman wrote:
>   for_each_cpu_wrap(cpu, cpus, target) {
> - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
> + /* Adjust cost of a successful scan */
> + loops <<= 2;
> +
>   break;
> + }
>  
> - if (loops >= nr) {
> + if (++loops >= nr) {
>   cpu = -1;
>   break;
>   }
> - loops++;

Random (out of the blue) comment.

Now this will increment loops before the comparison/break. ie: we're
effectively doing one iteration less IIRC. Should loops be initialized to
0 instead of 1?

Thanks

--
Qais Yousef

Re: [PATCH v2] sched/debug: Add new tracepoint to track cpu_capacity

2021-01-07 Thread Qais Yousef

On 01/06/21 15:42, Andrii Nakryiko wrote:
> On Wed, Jan 6, 2021 at 3:27 AM Qais Yousef  wrote:
> >
> > On 01/05/21 08:44, Alexei Starovoitov wrote:
> > > > Any pointer to an example test I could base this on?
> > >
> > > selftests/bpf/
> >
> > I was hoping for something more elaborate. I thought there's something 
> > already
> > there that do some verification for raw tracepoint that I could either 
> > extend
> > or replicate. Otherwise this could end up being a time sink for me and I'm 
> > not
> > keen on jumping down this rabbit hole.
> 
> One way would be to add either another custom tracepoint definition to
> a test module or modify the existing one to be a bare tracepoint. See
> links below.
> 
> If it's easy to trigger those tracepoints from user-space on demand,
> writing a similar (to module_attach) selftest for in-kernel tracepoint
> is trivial.
> 
>   [0] 
> https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h
>   [1] 
> https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/test_module_attach.c#L12-L18
>   [2] 
> https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/prog_tests/module_attach.c

Thanks a lot Andrii. That will make it much easier to figure out something.

Cheers

--
Qais Yousef

Re: [PATCH v2] sched/debug: Add new tracepoint to track cpu_capacity

2021-01-06 Thread Qais Yousef

On 01/05/21 08:44, Alexei Starovoitov wrote:
> > Any pointer to an example test I could base this on?
> 
> selftests/bpf/

I was hoping for something more elaborate. I thought there's something already
there that do some verification for raw tracepoint that I could either extend
or replicate. Otherwise this could end up being a time sink for me and I'm not
keen on jumping down this rabbit hole.

> > > - add a doc with contents from commit log.
> >
> > You're referring to the ABI part of the changelog, right?
> >
> > > The "Does bpf make things into an abi ?" question keeps coming back
> > > over and over again.
> > > Everytime we have the same answer that No, bpf cannot bake things into 
> > > abi.
> > > I think once it's spelled out somewhere in Documentation/ it would be 
> > > easier to
> > > repeat this message.
> >
> > How about a new Documentation/bpf/ABI.rst? I can write something up 
> > initially
> > for us to discuss in detail when I post.
> 
> There is Documentation/bpf/bpf_design_QA.rst
> and we already have this text in there that was added back in 2017:
> 
> Q: Does BPF have a stable ABI?
> --
> A: YES. BPF instructions, arguments to BPF programs, set of helper
> functions and their arguments, recognized return codes are all part
> of ABI. However there is one specific exception to tracing programs
> which are using helpers like bpf_probe_read() to walk kernel internal
> data structures and compile with kernel internal headers. Both of these
> kernel internals are subject to change and can break with newer kernels
> such that the program needs to be adapted accordingly.
> 
> I'm suggesting to add an additional section to this Q/A doc to include
> more or less
> the same text you had in the commit log.

Works for me.

Thanks

--
Qais Yousef

Re: [PATCH v2] sched/debug: Add new tracepoint to track cpu_capacity

2021-01-05 Thread Qais Yousef

On 01/04/21 10:59, Alexei Starovoitov wrote:
> > > > I did have a patch that allowed that. It might be worth trying to 
> > > > upstream it.
> > > > It just required a new macro which could be problematic.
> > > >
> > > > https://github.com/qais-yousef/linux/commit/fb9fea29edb8af327e6b2bf3bc41469a8e66df8b
> > > >
> > > > With the above I could attach using bpf::RAW_TRACEPOINT mechanism.
> > > >
> > >
> > > Yeah, that could work. I meant there was no way to do it with what was 
> > > there :)
> > >
> > > In our initial attempts at using BPF to get at nr_running (which I was not
> > > involved in and don't have all the details...) there were issues being 
> > > able to
> > > keep up and losing events.  That may have been an implementation issue, 
> > > but
> > > using the module and trace-cmd doesn't have that problem. Hopefully you 
> > > don't
> > > see that using RAW_TRACEPOINTs.
> >
> > So I have a proper patch for that now, that actually turned out to be really
> > tiny once you untangle exactly what is missing.
> >
> > Peter, bpf programs aren't considered ABIs AFAIK, do you have concerns about
> > that?
> >
> > Thanks
> >
> > --
> > Qais Yousef
> >
> > -->8--
> >
> > From cf81de8c7db03d62730939aa902579339e2fc859 Mon Sep 17 00:00:00 2001
> > From: Qais Yousef 
> > Date: Wed, 30 Dec 2020 22:20:34 +
> > Subject: [PATCH] trace: bpf: Allow bpf to attach to bare tracepoints
> >
> > Some subsystems only have bare tracepoints (a tracepoint with no
> > associated trace event) to avoid the problem of trace events being an
> > ABI that can't be changed.
> >
> > From bpf presepective, bare tracepoints are what it calls
> > RAW_TRACEPOINT().
> >
> > Since bpf assumed there's 1:1 mapping, it relied on hooking to
> > DEFINE_EVENT() macro to create bpf mapping of the tracepoints. Since
> > bare tracepoints use DECLARE_TRACE() to create the tracepoint, bpf had
> > no knowledge about their existence.
> >
> > By teaching bpf_probe.h to parse DECLARE_TRACE() in a similar fashion to
> > DEFINE_EVENT(), bpf can find and attach to the new raw tracepoints.
> >
> > Enabling that comes with the contract that changes to raw tracepoints
> > don't constitute a regression if they break existing bpf programs.
> > We need the ability to continue to morph and modify these raw
> > tracepoints without worrying about any ABI.
> >
> > Signed-off-by: Qais Yousef 
> > ---
> >  include/trace/bpf_probe.h | 12 ++--
> >  1 file changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
> > index cd74bffed5c6..a23be89119aa 100644
> > --- a/include/trace/bpf_probe.h
> > +++ b/include/trace/bpf_probe.h
> > @@ -55,8 +55,7 @@
> >  /* tracepoints with more than 12 arguments will hit build error */
> >  #define CAST_TO_U64(...) CONCATENATE(__CAST, 
> > COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__)
> >
> > -#undef DECLARE_EVENT_CLASS
> > -#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
> > +#define __BPF_DECLARE_TRACE(call, proto, args) \
> >  static notrace void\
> >  __bpf_trace_##call(void *__data, proto)
> > \
> >  {  \
> > @@ -64,6 +63,10 @@ __bpf_trace_##call(void *__data, proto)  
> > \
> > CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(prog, 
> > CAST_TO_U64(args));  \
> >  }
> >
> > +#undef DECLARE_EVENT_CLASS
> > +#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
> > +   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))
> > +
> >  /*
> >   * This part is compiled out, it is only here as a build time check
> >   * to make sure that if the tracepoint handling changes, the
> > @@ -111,6 +114,11 @@ __DEFINE_EVENT(template, call, PARAMS(proto), 
> > PARAMS(args), size)
> >  #define DEFINE_EVENT_PRINT(template, name, proto, args, print) \
> > DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
> >
> > +#undef DECLARE_TRACE
> > +#define DECLARE_TRACE(call, proto, args)   \
> > +   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))  \
> > +   __DEFINE_EVENT(call,

Re: [PATCH v2] sched/debug: Add new tracepoint to track cpu_capacity

2021-01-04 Thread Qais Yousef

On 09/08/20 09:19, Phil Auld wrote:
> Hi Quais,
> 
> On Mon, Sep 07, 2020 at 12:02:24PM +0100 Qais Yousef wrote:
> > On 09/02/20 09:54, Phil Auld wrote:
> > > > 
> > > > I think this decoupling is not necessary. The natural place for those
> > > > scheduler trace_event based on trace_points extension files is
> > > > kernel/sched/ and here the internal sched.h can just be included.
> > > > 
> > > > If someone really wants to build this as an out-of-tree module there is
> > > > an easy way to make kernel/sched/sched.h visible.
> > > >
> > > 
> > > It's not so much that we really _want_ to do this in an external module.
> > > But we aren't adding more trace events and my (limited) knowledge of
> > > BPF let me to the conclusion that its raw tracepoint functionality
> > > requires full events.  I didn't see any other way to do it.
> > 
> > I did have a patch that allowed that. It might be worth trying to upstream 
> > it.
> > It just required a new macro which could be problematic.
> > 
> > https://github.com/qais-yousef/linux/commit/fb9fea29edb8af327e6b2bf3bc41469a8e66df8b
> > 
> > With the above I could attach using bpf::RAW_TRACEPOINT mechanism.
> >
> 
> Yeah, that could work. I meant there was no way to do it with what was there 
> :)
> 
> In our initial attempts at using BPF to get at nr_running (which I was not
> involved in and don't have all the details...) there were issues being able to
> keep up and losing events.  That may have been an implementation issue, but
> using the module and trace-cmd doesn't have that problem. Hopefully you don't
> see that using RAW_TRACEPOINTs.

So I have a proper patch for that now, that actually turned out to be really
tiny once you untangle exactly what is missing.

Peter, bpf programs aren't considered ABIs AFAIK, do you have concerns about
that?

Thanks

--
Qais Yousef

-->8--

>From cf81de8c7db03d62730939aa902579339e2fc859 Mon Sep 17 00:00:00 2001
From: Qais Yousef 
Date: Wed, 30 Dec 2020 22:20:34 +
Subject: [PATCH] trace: bpf: Allow bpf to attach to bare tracepoints

Some subsystems only have bare tracepoints (a tracepoint with no
associated trace event) to avoid the problem of trace events being an
ABI that can't be changed.

>From bpf presepective, bare tracepoints are what it calls
RAW_TRACEPOINT().

Since bpf assumed there's 1:1 mapping, it relied on hooking to
DEFINE_EVENT() macro to create bpf mapping of the tracepoints. Since
bare tracepoints use DECLARE_TRACE() to create the tracepoint, bpf had
no knowledge about their existence.

By teaching bpf_probe.h to parse DECLARE_TRACE() in a similar fashion to
DEFINE_EVENT(), bpf can find and attach to the new raw tracepoints.

Enabling that comes with the contract that changes to raw tracepoints
don't constitute a regression if they break existing bpf programs.
We need the ability to continue to morph and modify these raw
tracepoints without worrying about any ABI.

Signed-off-by: Qais Yousef 
---
 include/trace/bpf_probe.h | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index cd74bffed5c6..a23be89119aa 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -55,8 +55,7 @@
 /* tracepoints with more than 12 arguments will hit build error */
 #define CAST_TO_U64(...) CONCATENATE(__CAST, 
COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__)
 
-#undef DECLARE_EVENT_CLASS
-#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+#define __BPF_DECLARE_TRACE(call, proto, args) \
 static notrace void\
 __bpf_trace_##call(void *__data, proto)
\
 {  \
@@ -64,6 +63,10 @@ __bpf_trace_##call(void *__data, proto)  
\
CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(prog, CAST_TO_U64(args));  
\
 }
 
+#undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))
+
 /*
  * This part is compiled out, it is only here as a build time check
  * to make sure that if the tracepoint handling changes, the
@@ -111,6 +114,11 @@ __DEFINE_EVENT(template, call, PARAMS(proto), 
PARAMS(args), size)
 #define DEFINE_EVENT_PRINT(template, name, proto, args, print) \
DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
 
+#undef DECLARE_TRACE
+#define DECLARE_TRACE(call, proto, args)   \
+   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))  \
+   __DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), 0)
+
 #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
 
 #undef DEFINE_EVENT_WRITABLE
-- 
2.25.1

Re: [PATCH v2] sched/debug: Add new tracepoint to track cpu_capacity

2021-01-04 Thread Qais Yousef

On 09/07/20 13:13, pet...@infradead.org wrote:
> On Mon, Sep 07, 2020 at 11:48:45AM +0100, Qais Yousef wrote:
> > IMHO the above is a hack. Out-of-tree modules should rely on public headers 
> > and
> > exported functions only. What you propose means that people who want to use
> > these tracepoints in meaningful way must have a prebuilt kernel handy. 
> > Which is
> > maybe true for us who work in the embedded world. But users who run normal
> > distro kernels (desktop/servers) will fail to build against
> 
> But this isn't really aimed at regular users. We're aiming this at
> developers (IIUC) so I dont really see this as a problem.
> 
> > FWIW, I did raise this concern with Peter in 2019 OSPM and he was okay with 
> > the
> > exports as it's still not a contract and they can disappear anytime we want.
> > Migrating to using BTF is the right way forward IMO. I don't think what we 
> > have
> > here is out-of-control yet. Though I agree they're annoying.
> 
> Right, we're hiding behind the explicit lack of ABI for modules.
> 
> Anyway, CTF/BTF/random other crap that isn't DWARFs should work fine to
> replace all this muck. Just no idea what the state of any of that is.

So I have updated my reference module to remove the dependency on those
functions:

https://github.com/qais-yousef/tracepoints-helpers/tree/pelt-tps-v3-create-events/sched_tp

When building against a prebuilt kernel tree, I use pahole + vmlinux to generate
vmlinux.h with all the internal struct we depend on. The kernel must be
compiled with CONFIG_DEBUG_INFO for pahole to work.

When building against a running kernel (ie: exported headers only available),
we try to use /sys/kernel/btf/vmlinux to generate vmlinux.h.

One outstanding issue is that pahole + BTF don't generate alignment info so
while the module compiles but I get a crash when I load the module and enable
one of the tracepoints. bpftool generates alignment info with BTF, but it dumps
everything which leads to another set of problems.

Barring fixing the BTF issue which I'm talking with Arnaldo about, I think we
should be in good position to remove these exported functions soon.

In the module we have to maintain helper functions (see sched_tp_helpers.h) but
I think that's much easier than maintaining out of tree copy of the structs.

It also enabled me to add uclamp trace events which had dependency on internal
details that I wasn't keen on exporting in mainline.

Is this a good/better compromise?

Thanks

--
Qais Yousef

Re: BTFIDS: FAILED unresolved symbol udp6_sock

2020-12-30 Thread Qais Yousef

On 12/30/20 14:28, Jiri Olsa wrote:
> On Wed, Dec 30, 2020 at 02:28:02PM +0100, Jiri Olsa wrote:
> > On Wed, Dec 30, 2020 at 10:03:37AM +0100, Jiri Olsa wrote:
> > > On Tue, Dec 29, 2020 at 11:28:35PM +, Qais Yousef wrote:
> > > > Hi Jiri
> > > > 
> > > > On 12/29/20 18:34, Jiri Olsa wrote:
> > > > > On Tue, Dec 29, 2020 at 03:13:52PM +, Qais Yousef wrote:
> > > > > > Hi
> > > > > > 
> > > > > > When I enable CONFIG_DEBUG_INFO_BTF I get the following error in 
> > > > > > the BTFIDS
> > > > > > stage
> > > > > > 
> > > > > > FAILED unresolved symbol udp6_sock
> > > > > > 
> > > > > > I cross compile for arm64. My .config is attached.
> > > > > > 
> > > > > > I managed to reproduce the problem on v5.9 and v5.10. Plus 5.11-rc1.
> > > > > > 
> > > > > > Have you seen this before? I couldn't find a specific report about 
> > > > > > this
> > > > > > problem.
> > > > > > 
> > > > > > Let me know if you need more info.
> > > > > 
> > > > > hi,
> > > > > this looks like symptom of the gcc DWARF bug we were
> > > > > dealing with recently:
> > > > > 
> > > > >   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97060
> > > > >   
> > > > > https://lore.kernel.org/lkml/CAE1WUT75gu9G62Q9uAALGN6vLX=o7vz9uhqtvwnbuv81dgm...@mail.gmail.com/#r
> > > > > 
> > > > > what pahole/gcc version are you using?
> > > > 
> > > > I'm on gcc 9.3.0
> > > > 
> > > > aarch64-linux-gnu-gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
> > > > 
> > > > I was on pahole v1.17. I moved to v1.19 but I still see the same 
> > > > problem.
> > > 
> > > I can reproduce with your .config, but make 'defconfig' works,
> > > so I guess it's some config option issue, I'll check later today

My .config was a defconfig + CONFIG_DEBUG_INFO_BTF + convert all modules to
built-in.

> > 
> > so your .config has
> >   CONFIG_CRYPTO_DEV_BCM_SPU=y

Ah, how random. I removed this config and indeed it does work then, thanks!

> > 
> > and that defines 'struct device_private' which
> > clashes with the same struct defined in drivers/base/base.h
> > 
> > so several networking structs will be doubled, like net_device:
> > 
> > $ bpftool btf dump file ../vmlinux.config | grep net_device\' | grep 
> > STRUCT
> > [2731] STRUCT 'net_device' size=2240 vlen=133
> > [113981] STRUCT 'net_device' size=2240 vlen=133
> > 
> > each is using different 'struct device_private' when it's unwinded
> > 
> > and that will confuse BTFIDS logic, becase we have multiple structs
> > with the same name, and we can't be sure which one to pick

We can't tell which object/subsystem the struct come from?

Or maybe we can introduce some annotation to help BTFIDS to pick the right one?

> > 
> > perhaps we should check on this in pahole and warn earlier with
> > better error message.. I'll check, but I'm not sure if pahole can
> > survive another hastab ;-)
> > 
> > Andrii, any ideas on this? ;-)
> > 
> > easy fix is the patch below that renames the bcm's structs,
> > it makes the kernel to compile.. but of course the new name
> > is probably wrong and we should push this through that code
> > authors
> 
> also another quick fix is to switch it to module

This works too.

Thanks for your speedy response.

Cheers

--
Qais Yousef

Re: BTFIDS: FAILED unresolved symbol udp6_sock

2020-12-29 Thread Qais Yousef

Hi Jiri

On 12/29/20 18:34, Jiri Olsa wrote:
> On Tue, Dec 29, 2020 at 03:13:52PM +0000, Qais Yousef wrote:
> > Hi
> > 
> > When I enable CONFIG_DEBUG_INFO_BTF I get the following error in the BTFIDS
> > stage
> > 
> > FAILED unresolved symbol udp6_sock
> > 
> > I cross compile for arm64. My .config is attached.
> > 
> > I managed to reproduce the problem on v5.9 and v5.10. Plus 5.11-rc1.
> > 
> > Have you seen this before? I couldn't find a specific report about this
> > problem.
> > 
> > Let me know if you need more info.
> 
> hi,
> this looks like symptom of the gcc DWARF bug we were
> dealing with recently:
> 
>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97060
>   
> https://lore.kernel.org/lkml/CAE1WUT75gu9G62Q9uAALGN6vLX=o7vz9uhqtvwnbuv81dgm...@mail.gmail.com/#r
> 
> what pahole/gcc version are you using?

I'm on gcc 9.3.0

aarch64-linux-gnu-gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

I was on pahole v1.17. I moved to v1.19 but I still see the same problem.

Thanks

--
Qais Yousef

> 
> we fixed pahole (v1.19) to workaround the issue and AFAIK
> there's already gcc fix as well

Re: [PATCH 4/5] irqtime: Move irqtime entry accounting after irq offset incrementation

2020-12-29 Thread Qais Yousef

On 12/29/20 15:30, Frederic Weisbecker wrote:
> On Tue, Dec 29, 2020 at 02:12:31PM +0000, Qais Yousef wrote:
> > On 12/29/20 14:41, Frederic Weisbecker wrote:
> > > > > -void vtime_account_irq(struct task_struct *tsk)
> > > > > +void vtime_account_irq(struct task_struct *tsk, unsigned int offset)
> > > > >  {
> > > > > - if (hardirq_count()) {
> > > > > + unsigned int pc = preempt_count() - offset;
> > > > > +
> > > > > + if (pc & HARDIRQ_OFFSET) {
> > > > 
> > > > Shouldn't this be HARDIRQ_MASK like above?
> > > 
> > > In the rare cases of nested hardirqs happening with broken drivers, Only 
> > > the outer hardirq
> > > does matter. All the time spent in the inner hardirqs is included in the 
> > > outer
> > > one.
> > 
> > Ah I see. The original code was doing hardirq_count(), which apparently 
> > wasn't
> > right either.
> > 
> > Shouldn't it be pc == HARDIRQ_OFFSET then? All odd nest counts will trigger
> > this otherwise, and IIUC we want this to trigger once on first entry only.
> 
> Right but we must also handle hardirqs interrupting either preempt disabled 
> sections
> or softirq servicing/disabled section.
> 
> 3 stacking hardirqs should be rare enough that we don't really care. In the
> worst case we are going to account the third IRQ seperately. Not a correctness
> issue, just a rare unoptimized case.

I admit I need to wrap my head around some more details to fully comprehend
that, but that's my own confusion to clear out :-)

Thanks for your answer.

Cheers

--
Qais Yousef

Re: [PATCH 4/5] irqtime: Move irqtime entry accounting after irq offset incrementation

2020-12-29 Thread Qais Yousef

On 12/29/20 14:41, Frederic Weisbecker wrote:
> On Mon, Dec 28, 2020 at 02:15:29AM +0000, Qais Yousef wrote:
> > Hi Frederic
> > 
> > On 12/02/20 12:57, Frederic Weisbecker wrote:
> > > @@ -66,9 +68,9 @@ void irqtime_account_irq(struct task_struct *curr)
> > >* in that case, so as not to confuse scheduler with a special task
> > >* that do not consume any time, but still wants to run.
> > >*/
> > > - if (hardirq_count())
> > > + if (pc & HARDIRQ_MASK)
> > >   irqtime_account_delta(irqtime, delta, CPUTIME_IRQ);
> > > - else if (in_serving_softirq() && curr != this_cpu_ksoftirqd())
> > > + else if ((pc & SOFTIRQ_OFFSET) && curr != this_cpu_ksoftirqd())
> > 
> > Noob question. Why for SOFTIRQs we do sofirq_count() & *SOFTIRQ_OFFSET*? It
> > seems we're in-softirq only if the count is odd numbered.
> > 
> > /me tries to dig more
> > 
> > Hmm could it be because the softirq count is actually 1 bit and the rest is
> > for SOFTIRQ_DISABLE_OFFSET (BH disabled)?
> 
> Exactly!
> 
> > 
> > IOW, 1 bit is for we're in softirq context, and the remaining 7 bits are to
> > count BH disable nesting, right?
> > 
> > I guess this would make sense; we don't nest softirqs processing AFAIK. But
> > I could be misreading the code too :-)
> 
> You got it right!
> 
> This is commented in softirq.c somewhere:
> 
> /*
>  * preempt_count and SOFTIRQ_OFFSET usage:
>  * - preempt_count is changed by SOFTIRQ_OFFSET on entering or leaving
>  *   softirq processing.
>  * - preempt_count is changed by SOFTIRQ_DISABLE_OFFSET (= 2 * SOFTIRQ_OFFSET)
>  *   on local_bh_disable or local_bh_enable.
>  * This lets us distinguish between whether we are currently processing
>  * softirq and whether we just have bh disabled.
>  */
> 
> But we should elaborate on the fact that, indeed, softirq processing can't 
> nest,
> while softirq disablement can. I should try to send a patch and comment more
> thoroughly on the subtleties of preempt mask in preempt.h.

Thanks for the info!

> 
> > 
> > >   irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
> > >  }
> > >  
> > > @@ -417,11 +419,13 @@ void vtime_task_switch(struct task_struct *prev)
> > >  }
> > >  # endif
> > >  
> > > -void vtime_account_irq(struct task_struct *tsk)
> > > +void vtime_account_irq(struct task_struct *tsk, unsigned int offset)
> > >  {
> > > - if (hardirq_count()) {
> > > + unsigned int pc = preempt_count() - offset;
> > > +
> > > + if (pc & HARDIRQ_OFFSET) {
> > 
> > Shouldn't this be HARDIRQ_MASK like above?
> 
> In the rare cases of nested hardirqs happening with broken drivers, Only the 
> outer hardirq
> does matter. All the time spent in the inner hardirqs is included in the outer
> one.

Ah I see. The original code was doing hardirq_count(), which apparently wasn't
right either.

Shouldn't it be pc == HARDIRQ_OFFSET then? All odd nest counts will trigger
this otherwise, and IIUC we want this to trigger once on first entry only.

Thanks

--
Qais Yousef

Re: [PATCH 4/5] irqtime: Move irqtime entry accounting after irq offset incrementation

2020-12-27 Thread Qais Yousef

Hi Frederic

On 12/02/20 12:57, Frederic Weisbecker wrote:
>  #endif /* _LINUX_KERNEL_VTIME_H */
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 02163d4260d7..5f611658eeab 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -44,12 +44,13 @@ static void irqtime_account_delta(struct irqtime 
> *irqtime, u64 delta,
>  }
>  
>  /*
> - * Called before incrementing preempt_count on {soft,}irq_enter
> + * Called after incrementing preempt_count on {soft,}irq_enter
>   * and before decrementing preempt_count on {soft,}irq_exit.
>   */
> -void irqtime_account_irq(struct task_struct *curr)
> +void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
>  {
>   struct irqtime *irqtime = this_cpu_ptr(_irqtime);
> + unsigned int pc;
>   s64 delta;
>   int cpu;
>  
> @@ -59,6 +60,7 @@ void irqtime_account_irq(struct task_struct *curr)
>   cpu = smp_processor_id();
>   delta = sched_clock_cpu(cpu) - irqtime->irq_start_time;
>   irqtime->irq_start_time += delta;
> + pc = preempt_count() - offset;
>  
>   /*
>* We do not account for softirq time from ksoftirqd here.
> @@ -66,9 +68,9 @@ void irqtime_account_irq(struct task_struct *curr)
>* in that case, so as not to confuse scheduler with a special task
>* that do not consume any time, but still wants to run.
>*/
> - if (hardirq_count())
> + if (pc & HARDIRQ_MASK)
>   irqtime_account_delta(irqtime, delta, CPUTIME_IRQ);
> - else if (in_serving_softirq() && curr != this_cpu_ksoftirqd())
> + else if ((pc & SOFTIRQ_OFFSET) && curr != this_cpu_ksoftirqd())

Noob question. Why for SOFTIRQs we do sofirq_count() & *SOFTIRQ_OFFSET*? It
seems we're in-softirq only if the count is odd numbered.

/me tries to dig more

Hmm could it be because the softirq count is actually 1 bit and the rest is
for SOFTIRQ_DISABLE_OFFSET (BH disabled)?

IOW, 1 bit is for we're in softirq context, and the remaining 7 bits are to
count BH disable nesting, right?

I guess this would make sense; we don't nest softirqs processing AFAIK. But
I could be misreading the code too :-)

>   irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
>  }
>  
> @@ -417,11 +419,13 @@ void vtime_task_switch(struct task_struct *prev)
>  }
>  # endif
>  
> -void vtime_account_irq(struct task_struct *tsk)
> +void vtime_account_irq(struct task_struct *tsk, unsigned int offset)
>  {
> - if (hardirq_count()) {
> + unsigned int pc = preempt_count() - offset;
> +
> + if (pc & HARDIRQ_OFFSET) {

Shouldn't this be HARDIRQ_MASK like above?

>   vtime_account_hardirq(tsk);
> - } else if (in_serving_softirq()) {
> + } else if (pc & SOFTIRQ_OFFSET) {
>   vtime_account_softirq(tsk);
>   } else if (!IS_ENABLED(CONFIG_HAVE_VIRT_CPU_ACCOUNTING_IDLE) &&
>  is_idle_task(tsk)) {

Thanks

--
Qais Yousef

Re: [PATCH v4 15/19] sched: Fix migrate_disable() vs rt/dl balancing

2020-12-26 Thread Qais Yousef

..]

> +static inline struct task_struct *get_push_task(struct rq *rq)
> +{
> + struct task_struct *p = rq->curr;

Shouldn't we verify the class of the task here? The RT task in migration
disabled could have been preempted by a dl or stopper task. Similarly, the dl
task could have been preempted by a stopper task.

I don't think an RT task should be allowed to push a dl task under any
circumstances?

Cheers

--
Qais Yousef

> +
> + lockdep_assert_held(>lock);
> +
> + if (rq->push_busy)
> + return NULL;
> +
> + if (p->nr_cpus_allowed == 1)
> + return NULL;
> +
> + rq->push_busy = true;
> + return get_task_struct(p);
> +}
> +
> +extern int push_cpu_stop(void *arg);
> +
>  #endif
>  
>  #ifdef CONFIG_CPU_IDLE
> 
>

Re: [PATCH] kernel/cpu: fix: use scnprintf or sprintf.

2020-12-23 Thread Qais Yousef

Hi Yang

'or sprintf' in the subject line doesn't make much sense for what's done in
this patch. Perhaps you meant "Use scnprintf instead of snprintf"?

On 12/22/20 17:11, YANG LI wrote:
> The snprintf() function returns the number of characters which would
> have been printed if there were enough space, but the scnprintf()
> returns the number of characters which were actually printed. If the
> buffer is not large enough, then using snprintf() would result in a
> read overflow and an information leak.
> 
> Signed-off-by: YANG LI 
> Reported-by: Abaci 

Two different yet very similar email addresses, it seems both are you? The
Reported-by is unnecessary.

> ---
>  kernel/cpu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 4e11e91..c123741 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -2345,7 +2345,7 @@ static ssize_t show_cpuhp_states(struct device *dev,
>  {
>   const char *state = smt_states[cpu_smt_control];
>  
> - return snprintf(buf, PAGE_SIZE - 2, "%s\n", state);
> + return scnprintf(buf, PAGE_SIZE - 2, "%s\n", state);

show_cpuhp_states() doesn't have snprintf() in Linus' master. Which tree is
this based on?

I can see two snprintf() in cpu.c, show_smt_active/control().

Mind resend to fix them both?

Thanks

--
Qais Yousef

Re: sched: Reenable interrupts in do sched_yield()

2020-12-23 Thread Qais Yousef

On 10/21/20 10:07, Steven Rostedt wrote:
> On Wed, 21 Oct 2020 09:27:22 +0200
> Thomas Gleixner  wrote:
> 
> > On Tue, Oct 20 2020 at 16:07, Steven Rostedt wrote:
> > > On Tue, 20 Oct 2020 20:02:55 +0200
> > > Thomas Gleixner  wrote:
> > > What I wrote wasn't exactly what I meant. What I meant to have:
> > >
> > >   /*
> > >* Since we are going to call schedule() anyways, there's
> > >* no need to do the preemption check when the rq_lock is released.
> > >*/
> > >
> > > That is, to document why we have the preempt_disable() before the unlock: 
> > >  
> > 
> > which is pretty obvious, but I let Peter decide on that.
> 
> To us maybe, but I like to have comments that explain why things are done to
> average people. ;-)
> 
> If I go to another kernel developer outside the core kernel, would they know
> why there's a preempt_disable() there?
> 
> 
>   preempt_disable();
>   rq_unlock_irq(rq, );
>   sched_preempt_enable_no_resched();
>  
>   schedule();
> 
> 
> Not everyone knows that the rq_unlock_irq() would trigger a schedule if an
> interrupt happened as soon as irqs were enabled again and need_resched was
> set.

Sorry a bit late to the party.

Personally, what actually is tripping me off is that rq_unlock_irq() will end
up calling preempt_enable(), and then we do sched_preempt_enable_no_resched().
Was there an earlier preempt_disable() called up in the chain that I couldn't
figure out that's why it's okay to do the 2? Otherwise I see we have imbalanced
preempt_disable/enable.

preempt_disable()
rq_unlock_irq()
__raw_spin_unlock_irq()
local_irq_enable()
preempt_enable()// first preempt_count_dec()
sched_preempt_enable_no_resched()   // second preempt_count_dec()

Thanks

--
Qais Yousef

Re: [PATCH v5 07/15] cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1

2020-12-17 Thread Qais Yousef

On 12/17/20 14:44, Peter Zijlstra wrote:
> On Thu, Dec 17, 2020 at 12:15:52PM +0000, Qais Yousef wrote:
> > On 12/08/20 13:28, Will Deacon wrote:
> > > If the scheduler cannot find an allowed CPU for a task,
> > > cpuset_cpus_allowed_fallback() will widen the affinity to 
> > > cpu_possible_mask
> > > if cgroup v1 is in use.
> > > 
> > > In preparation for allowing architectures to provide their own fallback
> > > mask, just return early if we're not using cgroup v2 and allow
> > > select_fallback_rq() to figure out the mask by itself.
> > > 
> > > Cc: Li Zefan 
> > > Cc: Tejun Heo 
> > > Cc: Johannes Weiner 
> > > Reviewed-by: Quentin Perret 
> > > Signed-off-by: Will Deacon 
> > > ---
> > >  kernel/cgroup/cpuset.c | 6 --
> > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > > index 57b5b5d0a5fd..e970737c3ed2 100644
> > > --- a/kernel/cgroup/cpuset.c
> > > +++ b/kernel/cgroup/cpuset.c
> > > @@ -3299,9 +3299,11 @@ void cpuset_cpus_allowed(struct task_struct *tsk, 
> > > struct cpumask *pmask)
> > >  
> > >  void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> > >  {
> > > + if (!is_in_v2_mode())
> > > + return; /* select_fallback_rq will try harder */
> > > +
> > >   rcu_read_lock();
> > > - do_set_cpus_allowed(tsk, is_in_v2_mode() ?
> > > - task_cs(tsk)->cpus_allowed : cpu_possible_mask);
> > > + do_set_cpus_allowed(tsk, task_cs(tsk)->cpus_allowed);
> > 
> > Why is it safe to return that for cpuset v2?
> 
> v1
> 
> Because in that case it does cpu_possible_mask, which, if you look at
> select_fallback_rq(), is exactly what happens when cpuset 'fails' to
> find a candidate.
> 
> Or at least, that's how I read the patch.

Okay I can see that if v2 has effectively empty mask for the 32bit tasks, then
we'll fallback to the 'possible' switch case where we set
task_cpu_possible_mask().

But how about when task_cs(tsk)->cpus_allowed contains partially invalid cpus?

The search for a candidate cpu will return a correct dest_cpu, but the actual
cpu_mask of the task will contain invalid cpus that could be picked up later,
no? Shouldn't we

cpumask_and(mask, task_cs(tsk)->cpus_allowed, task_cpu_possible_mask())

to remove those invalid cpus?

Thanks

--
Qais Yousef

Re: [PATCH v5 07/15] cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1

2020-12-17 Thread Qais Yousef

On 12/08/20 13:28, Will Deacon wrote:
> If the scheduler cannot find an allowed CPU for a task,
> cpuset_cpus_allowed_fallback() will widen the affinity to cpu_possible_mask
> if cgroup v1 is in use.
> 
> In preparation for allowing architectures to provide their own fallback
> mask, just return early if we're not using cgroup v2 and allow
> select_fallback_rq() to figure out the mask by itself.
> 
> Cc: Li Zefan 
> Cc: Tejun Heo 
> Cc: Johannes Weiner 
> Reviewed-by: Quentin Perret 
> Signed-off-by: Will Deacon 
> ---
>  kernel/cgroup/cpuset.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 57b5b5d0a5fd..e970737c3ed2 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3299,9 +3299,11 @@ void cpuset_cpus_allowed(struct task_struct *tsk, 
> struct cpumask *pmask)
>  
>  void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
>  {
> + if (!is_in_v2_mode())
> + return; /* select_fallback_rq will try harder */
> +
>   rcu_read_lock();
> - do_set_cpus_allowed(tsk, is_in_v2_mode() ?
> - task_cs(tsk)->cpus_allowed : cpu_possible_mask);
> + do_set_cpus_allowed(tsk, task_cs(tsk)->cpus_allowed);

Why is it safe to return that for cpuset v2? task_cs(tsk)->cpus_allowed is the
original user configured settings of the cpuset.cpus; which could have empty
intersection with task_cpu_possible_mask(), no?

do_set_cpus_allowed() will call set_cpus_allowed_common() which will end up
copying the mask as-is.

So unless I missed something there's a risk a 32bit task ends up having a 64bit
only cpu_mask when using cpuset v2.

Thanks

--
Qais Yousef

>   rcu_read_unlock();
>  
>   /*
> -- 
> 2.29.2.576.ga3fc446d84-goog
>

Re: RCU stall leading to deadlock warning

2020-12-16 Thread Qais Yousef

On 12/16/20 10:00, Paul E. McKenney wrote:
> On Wed, Dec 16, 2020 at 09:54:42AM -0800, Paul E. McKenney wrote:
> > On Wed, Dec 16, 2020 at 05:29:39PM +0000, Qais Yousef wrote:
> > > Hi Paul
> > > 
> > > We hit the below splat a couple of days ago in our testing. Sadly I can't
> > > reproduce it. And it was on android-mainline branch..
> > > 
> > > It's the deadlock message that bothers me. I can't see how we could have 
> > > ended
> > > there. We detect a stall and when trying to dump the stack LOCKDEP spits 
> > > the
> > > warning.
> > > 
> > > Maybe should take this report with a pinch of salt since it wasn't on 
> > > mainline.
> > > I just thought it might be something worth sharing in case you can 
> > > actually
> > > spot something obvious that I can't see. If I got more info or a 
> > > reproducer
> > > I will share them.
> > > 
> > > The failure was triggered twice on that day running 2 different tests.
> > 
> > This looks like the same problem that Mark Rutland's recent patch series
> > was fixing.  Do you have this series applied?
> > 
> > lore.kernel.org/lkml/20201126123602.23454-1-mark.rutl...@arm.com

Oh yeah I remember this one. Yes it could be relevant. I don't see the series
in the tree. If it wasn't merged (which AFAICS it isn't), it won't appear
there.

> I would not expect the patch below to help given what your RCU CPU stall
> warning looks like, but just in case...

Thanks. If I manage to find a reproducer I will give it a go. Still no luck in
triggering it in my test env :(

Thanks

--
Qais Yousef

> 
> (Full disclosure: Peter fixed a bug of mine, filenames notwithstanding.)
> 
>   Thanx, Paul
> 
> 
> 
> commit f355d19f94bf4361d641fb3dbb9ece0fbac766f8
> Author: Peter Zijlstra 
> Date:   Sat Aug 29 10:22:24 2020 -0700
> 
> sched/core: Allow try_invoke_on_locked_down_task() with irqs disabled
> 
> The try_invoke_on_locked_down_task() function currently requires
> that interrupts be enabled, but it is called with interrupts
> disabled from rcu_print_task_stall(), resulting in an "IRQs not
> enabled as expected" diagnostic.  This commit therefore updates
> try_invoke_on_locked_down_task() to use raw_spin_lock_irqsave() instead
> of raw_spin_lock_irq(), thus allowing use from either context.
> 
> Link: 
> https://lore.kernel.org/lkml/903d5805ab908...@google.com/
> Link: 
> https://lore.kernel.org/lkml/20200928075729.gc2...@hirez.programming.kicks-ass.net/
> Reported-by: syzbot+cb3b69ae80afd6535...@syzkaller.appspotmail.com
> Signed-off-by: Peter Zijlstra 
> Signed-off-by: Paul E. McKenney 
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b2d6898..4abf041 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2989,7 +2989,7 @@ try_to_wake_up(struct task_struct *p, unsigned int 
> state, int wake_flags)
>  
>  /**
>   * try_invoke_on_locked_down_task - Invoke a function on task in fixed state
> - * @p: Process for which the function is to be invoked.
> + * @p: Process for which the function is to be invoked, can be @current.
>   * @func: Function to invoke.
>   * @arg: Argument to function.
>   *
> @@ -3007,12 +3007,11 @@ try_to_wake_up(struct task_struct *p, unsigned int 
> state, int wake_flags)
>   */
>  bool try_invoke_on_locked_down_task(struct task_struct *p, bool 
> (*func)(struct task_struct *t, void *arg), void *arg)
>  {
> - bool ret = false;
>   struct rq_flags rf;
> + bool ret = false;
>   struct rq *rq;
>  
> - lockdep_assert_irqs_enabled();
> - raw_spin_lock_irq(>pi_lock);
> + raw_spin_lock_irqsave(>pi_lock, rf.flags);
>   if (p->on_rq) {
>   rq = __task_rq_lock(p, );
>   if (task_rq(p) == rq)
> @@ -3029,7 +3028,7 @@ bool try_invoke_on_locked_down_task(struct task_struct 
> *p, bool (*func)(struct t
>   ret = func(p, arg);
>   }
>   }
> - raw_spin_unlock_irq(>pi_lock);
> + raw_spin_unlock_irqrestore(>pi_lock, rf.flags);
>   return ret;
>  }
>

RCU stall leading to deadlock warning

2020-12-16 Thread Qais Yousef

Hi Paul

We hit the below splat a couple of days ago in our testing. Sadly I can't
reproduce it. And it was on android-mainline branch..

It's the deadlock message that bothers me. I can't see how we could have ended
there. We detect a stall and when trying to dump the stack LOCKDEP spits the
warning.

Maybe should take this report with a pinch of salt since it wasn't on mainline.
I just thought it might be something worth sharing in case you can actually
spot something obvious that I can't see. If I got more info or a reproducer
I will share them.

The failure was triggered twice on that day running 2 different tests.

[  310.073379] LTP: starting leapsec01
[  345.070123] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  345.076717] rcu: 0-...!: (1 ticks this GP) idle=798/0/0x0 
softirq=19187/19187 fqs=0  (false positive?)
[  345.087533](detected by 2, t=6502 jiffies, g=57249, q=184)
[  345.093284]
[  345.094797] 
[  345.100139] WARNING: possible recursive locking detected
[  345.105485] 5.10.0-rc7-02296-g3f43bd6f2c3b-ab89 #1 Not tainted
[  345.111349] 
[  345.116693] swapper/2/0 is trying to acquire lock:
[  345.121515] a00013b50c58 (rcu_node_0){-.-.}-{2:2}, at: 
rcu_dump_cpu_stacks+0x7c/0x14c
[  345.129813]
[  345.129813] but task is already holding lock:
[  345.135678] a00013b50c58 (rcu_node_0){-.-.}-{2:2}, at: 
rcu_sched_clock_irq+0x68c/0x11c0
[  345.144143]
[  345.144143] other info that might help us debug this:
[  345.150702]  Possible unsafe locking scenario:
[  345.150702]
[  345.156651]CPU0
[  345.159119]
[  345.161585]   lock(rcu_node_0);
[  345.164779]   lock(rcu_node_0);
[  345.167973]
[  345.167973]  *** DEADLOCK ***
[  345.167973]
[  345.173923]  May be due to missing lock nesting notation
[  345.173923]
[  345.180746] 1 lock held by swapper/2/0:
[  345.184607]  #0: a00013b50c58 (rcu_node_0){-.-.}-{2:2}, at: 
rcu_sched_clock_irq+0x68c/0x11c0
[  345.193517]
[  345.193517] stack backtrace:
[  345.197910] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 
5.10.0-rc7-02296-g3f43bd6f2c3b-ab89 #1
[  345.206389] Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno 
Development Platform, BIOS EDK II Oct 19 2019
[  345.217215] Call trace:
[  345.219691]  dump_backtrace+0x0/0x2b8
[  345.223383]  show_stack+0x18/0x68
[  345.226731]  dump_stack+0x110/0x188
[  345.230255]  __lock_acquire+0x23f0/0x2410
[  345.234300]  lock_acquire+0x3b8/0x730
[  345.237997]  _raw_spin_lock_irqsave+0x80/0x168
[  345.242476]  rcu_dump_cpu_stacks+0x7c/0x14c
[  345.246693]  rcu_sched_clock_irq+0xfd4/0x11c0
[  345.251087]  update_process_times+0x84/0xe0
[  345.255306]  tick_sched_handle.isra.0+0x68/0x98
[  345.259871]  tick_sched_timer+0x60/0xd8
[  345.263742]  __hrtimer_run_queues+0x534/0x9e0
[  345.268134]  hrtimer_interrupt+0x1a8/0x398
[  345.272264]  tick_receive_broadcast+0x60/0x88
[  345.276657]  ipi_handler+0x250/0x4b8
[  345.280270]  handle_percpu_devid_fasteoi_ipi+0x138/0x4f0
[  345.285619]  generic_handle_irq+0x4c/0x68
[  345.289661]  __handle_domain_irq+0x9c/0x118
[  345.293880]  gic_handle_irq+0xdc/0x118
[  345.297661]  el1_irq+0xc8/0x180
[  345.300835]  cpuidle_enter_state+0x16c/0x810
[  345.305139]  cpuidle_enter+0x4c/0x78
[  345.308749]  call_cpuidle+0x44/0x88
[  345.312271]  do_idle+0x2d4/0x338
[  345.315532]  cpu_startup_entry+0x24/0x68
[  345.319491]  secondary_start_kernel+0x1d4/0x2d8

Thanks

--
Qais Yousef

Re: [PATCH v5 00/15] An alternative series for asymmetric AArch32 systems

2020-12-16 Thread Qais Yousef

On 12/16/20 14:14, Will Deacon wrote:
> Hi Qais,
> 
> On Wed, Dec 16, 2020 at 11:16:46AM +0000, Qais Yousef wrote:
> > On 12/08/20 13:28, Will Deacon wrote:
> > > Changes in v5 include:
> > > 
> > >   * Teach cpuset_cpus_allowed() about task_cpu_possible_mask() so that
> > > we can avoid returning incompatible CPUs for a given task. This
> > > means that sched_setaffinity() can be used with larger masks (like
> > > the online mask) from userspace and also allows us to take into
> > > account the cpuset hierarchy when forcefully overriding the affinity
> > > for a task on execve().
> > > 
> > >   * Honour task_cpu_possible_mask() when attaching a task to a cpuset,
> > > so that the resulting affinity mask does not contain any incompatible
> > > CPUs (since it would be rejected by set_cpus_allowed_ptr() otherwise).
> > > 
> > >   * Moved overriding of the affinity mask into the scheduler core rather
> > > than munge affinity masks directly in the architecture backend.
> > > 
> > >   * Extended comments and documentation.
> > > 
> > >   * Some renaming and cosmetic changes.
> > > 
> > > I'm pretty happy with this now, although it still needs review and will
> > > require rebasing to play nicely with the SCA changes in -next.
> > 
> > I still have concerns about the cpuset v1 handling. Specifically:
> > 
> > 1. Attaching a 32bit task to 64bit only cpuset is allowed.
> > 
> >I think the right behavior here is to prevent that as the
> >intersection will appear as offline cpus for the 32bit tasks. So it
> >shouldn't be allowed to move there.
> 
> Suren or Quantin can correct me if I'm wrong I'm here, but I think Android
> relies on this working so it's not an option for us to prevent the attach.

I don't think so. It's just a matter who handles the error. ie: kernel fix it
up silently and effectively make the cpuset a NOP since we don't respect the
affinity of the cpuset, or user space pick the next best thing. Since this
could return an error anyway, likely user space already handles this.

> I also don't think it really achieves much, since as you point out, the same
> problem exists in other cases such as execve() of a 32-bit binary, or
> hotplugging off all 32-bit CPUs within a mixed cpuset. Allowing the attach
> and immediately reparenting would probably be better, but see below.

I am just wary that we're introducing a generic asymmetric ISA support, so my
concerns have been related to making sure the behavior is sane generally. When
this gets merged, I can bet more 'fun' hardware will appear all over the place.
We're opening the flood gates I'm afraid :p

> > 2. Modifying cpuset.cpus could result with empty set for 32bit tasks.
> > 
> >It is a variation of the above, it's just the cpuset transforms into
> >64bit only after we attach.
> > 
> >I think the right behavior here is to move the 32bit tasks to the
> >nearest ancestor like we do when all cpuset.cpus are hotplugged out.
> > 
> >We could too return an error if the new set will result an empty set
> >for the 32bit tasks. In a similar manner to how it fails if you
> >write a cpu that is offline.
> > 
> > 3. If a 64bit task belongs to 64bit-only-cpuset execs a 32bit binary,
> >the 32 tasks will inherit the cgroup setting.
> > 
> >Like above, we should move this to the nearest ancestor.
> 
> I considered this when I was writing the patches, but the reality is that
> by allowing 32-bit tasks to attach to a 64-bit only cpuset (which is required
> by Android), we have no choice but to expose a new ABI to userspace. This is
> all gated behind a command-line option, so I think that's fine, but then why
> not just have the same behaviour as cgroup v2? I don't see the point in
> creating two new ABIs (for cgroup v1 and v2 respectively) if we don't need

Ultimately it's up to Tejun and Peter I guess. I thought we need to preserve
the v1 behavior for the new class of tasks. I won't object to the new ABI
myself. Maybe we just need to make the commit messages and cgroup-v1
documentation reflect that explicitly.

> to. If it was _identical_ to the hotplug case, then we would surely just
> follow the existing behaviour, but it's really quite different in this
> situation because the cpuset is not empty.

It is actually effectively empty for those tasks. But I see that one could look
at it from two different angles.

> One thing we should definitely do though is add this to the documentation
> for the command-line option.

+1

By the way, should the command-line option be renamed to something more
generic? This has already grown beyond just enabling the support for one
isolated case. No strong opinion, just a suggestion.

Thanks

--
Qais Yousef

Re: [PATCH v5 00/15] An alternative series for asymmetric AArch32 systems

2020-12-16 Thread Qais Yousef

Hi Will

On 12/08/20 13:28, Will Deacon wrote:
> Hi all,
> 
> Christmas has come early: it's time for version five of these patches
> which have previously appeared here:
> 
>   v1: https://lore.kernel.org/r/20201027215118.27003-1-w...@kernel.org
>   v2: https://lore.kernel.org/r/20201109213023.15092-1-w...@kernel.org
>   v3: https://lore.kernel.org/r/20201113093720.21106-1-w...@kernel.org
>   v4: https://lore.kernel.org/r/20201124155039.13804-1-w...@kernel.org
> 
> and which started life as a reimplementation of some patches from Qais:
> 
>   https://lore.kernel.org/r/20201021104611.2744565-1-qais.you...@arm.com
> 
> There's also now a nice writeup on LWN:
> 
>   https://lwn.net/Articles/838339/
> 
> and rumours of a feature film are doing the rounds.
> 
> [subscriber-only, but if you're reading this then you should really
>  subscribe.]
> 
> The aim of this series is to allow 32-bit ARM applications to run on
> arm64 SoCs where not all of the CPUs support the 32-bit instruction set.
> Unfortunately, such SoCs are real and will continue to be productised
> over the next few years at least. I can assure you that I'm not just
> doing this for fun.
> 
> Changes in v5 include:
> 
>   * Teach cpuset_cpus_allowed() about task_cpu_possible_mask() so that
> we can avoid returning incompatible CPUs for a given task. This
> means that sched_setaffinity() can be used with larger masks (like
> the online mask) from userspace and also allows us to take into
> account the cpuset hierarchy when forcefully overriding the affinity
> for a task on execve().
> 
>   * Honour task_cpu_possible_mask() when attaching a task to a cpuset,
> so that the resulting affinity mask does not contain any incompatible
> CPUs (since it would be rejected by set_cpus_allowed_ptr() otherwise).
> 
>   * Moved overriding of the affinity mask into the scheduler core rather
> than munge affinity masks directly in the architecture backend.
> 
>   * Extended comments and documentation.
> 
>   * Some renaming and cosmetic changes.
> 
> I'm pretty happy with this now, although it still needs review and will
> require rebasing to play nicely with the SCA changes in -next.

I still have concerns about the cpuset v1 handling. Specifically:

1. Attaching a 32bit task to 64bit only cpuset is allowed.

   I think the right behavior here is to prevent that as the
   intersection will appear as offline cpus for the 32bit tasks. So it
   shouldn't be allowed to move there.

2. Modifying cpuset.cpus could result with empty set for 32bit tasks.

   It is a variation of the above, it's just the cpuset transforms into
   64bit only after we attach.

   I think the right behavior here is to move the 32bit tasks to the
   nearest ancestor like we do when all cpuset.cpus are hotplugged out.

   We could too return an error if the new set will result an empty set
   for the 32bit tasks. In a similar manner to how it fails if you
   write a cpu that is offline.

3. If a 64bit task belongs to 64bit-only-cpuset execs a 32bit binary,
   the 32 tasks will inherit the cgroup setting.

   Like above, we should move this to the nearest ancestor.

I was worried if in a hierarchy the parent cpuset.cpus is modified such that
the childs no longer have a valid cpu for 32bit tasks. But I checked for v1 and
this isn't a problem. You'll get an error if you try to change it in a way that
ends up with an empty cpuset.

I played with v2, and the model allows tasks to remain attached even if cpus
are hotplugged, or cpusets.cpus is modified in such a way we end up with an
empty cpuset. So I think breaking the affinity of the cpuset for v2 is okay.

To simplify the problem for v1, we could say that asym ISA tasks can only live
in the root cpuset for v1. This will simplify the solution too since we will
only need to ensure that these tasks are moved to the root group on exec and
block any future move to anything else. Of course this dictates that such
systems must use cpuset v2 if they care. Not a terrible restriction IMO.

I hacked a patch to fix the exec scenario and it was easy to do. I just need to
block clone3 (cgroup_post_fork()) and task_can_attach() from allowing these
tasks from moving anywhere else.

Thanks

--
Qais Yousef

Re: [PATCH v4 12/14] arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched system

2020-12-02 Thread Qais Yousef

On 12/02/20 17:42, Will Deacon wrote:
> On Wed, Dec 02, 2020 at 12:59:52PM +0000, Qais Yousef wrote:
> > On 12/01/20 22:13, Will Deacon wrote:
> > > On Fri, Nov 27, 2020 at 01:41:22PM +, Qais Yousef wrote:
> > > > On 11/24/20 15:50, Will Deacon wrote:
> > > > > diff --git a/arch/arm64/kernel/cpufeature.c 
> > > > > b/arch/arm64/kernel/cpufeature.c
> > > > > index 29017cbb6c8e..fe470683b43e 100644
> > > > > --- a/arch/arm64/kernel/cpufeature.c
> > > > > +++ b/arch/arm64/kernel/cpufeature.c
> > > > > @@ -1237,6 +1237,8 @@ has_cpuid_feature(const struct 
> > > > > arm64_cpu_capabilities *entry, int scope)
> > > > >  
> > > > >  static int enable_mismatched_32bit_el0(unsigned int cpu)
> > > > >  {
> > > > > + static int lucky_winner = -1;
> > > > > +
> > > > >   struct cpuinfo_arm64 *info = _cpu(cpu_data, cpu);
> > > > >   bool cpu_32bit = id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0);
> > > > >  
> > > > > @@ -1245,6 +1247,22 @@ static int 
> > > > > enable_mismatched_32bit_el0(unsigned int cpu)
> > > > >   
> > > > > static_branch_enable_cpuslocked(_mismatched_32bit_el0);
> > > > >   }
> > > > >  
> > > > > + if (cpumask_test_cpu(0, cpu_32bit_el0_mask) == cpu_32bit)
> > > > > + return 0;
> > > > 
> > > > Hmm I'm struggling to get what you're doing here. You're treating CPU0 
> > > > (the
> > > > boot CPU) specially here, but I don't get why?
> > > 
> > > If our ability to execute 32-bit code is the same as the boot CPU then we
> > > don't have to do anything. That way, we can postpone nominating the lucky
> > > winner until we really need to.
> > 
> > Okay I see what you're doing now. The '== cpu_32bit' part of the check gave 
> > me
> > trouble. If the first N cpus are 64bit only, we'll skip them here. Worth
> > a comment?
> > 
> > Wouldn't it be better to replace this with a check if cpu_32bit_el0_mask is
> > empty instead? That would be a lot easier to read.
> 
> Sorry, but I don't follow. What if all the CPUs are 32-bit capable?

You're right I missed this case.

--
Qais Yousef

Re: [PATCH v4 08/14] arm64: exec: Adjust affinity for compat tasks with mismatched 32-bit EL0

2020-12-02 Thread Qais Yousef

On 12/01/20 16:55, Will Deacon wrote:
> > > +static void adjust_compat_task_affinity(struct task_struct *p)
> > > +{
> > > + cpumask_var_t cpuset_mask;
> > > + const struct cpumask *possible_mask = system_32bit_el0_cpumask();
> > > + const struct cpumask *newmask = possible_mask;
> > > +
> > > + /*
> > > +  * Restrict the CPU affinity mask for a 32-bit task so that it contains
> > > +  * only the 32-bit-capable subset of its original CPU mask. If this is
> > > +  * empty, then try again with the cpuset allowed mask. If that fails,
> > > +  * forcefully override it with the set of all 32-bit-capable CPUs that
> > > +  * we know about.
> > > +  *
> > > +  * From the perspective of the task, this looks similar to what would
> > > +  * happen if the 64-bit-only CPUs were hot-unplugged at the point of
> > > +  * execve().
> > > +  */
> > > + if (!restrict_cpus_allowed_ptr(p, possible_mask))
> > > + goto out;
> > > +
> > > + if (alloc_cpumask_var(_mask, GFP_KERNEL)) {
> > > + cpuset_cpus_allowed(p, cpuset_mask);
> > > + if (cpumask_and(cpuset_mask, cpuset_mask, possible_mask)) {
> > > + newmask = cpuset_mask;
> > > + goto out_set_mask;
> > > + }
> > > + }
> > 
> > Wouldn't it be better to move this logic to restrict_cpus_allowed_ptr()?
> > I think it should always take cpusets into account and it's not special to
> > this particular handling here, no?
> 
> I did actually try this but didn't pursue it further because I was worried
> that I was putting too much of the "can't run a 32-bit task on a 64-bit-only
> CPU" logic into what would otherwise be a potentially useful library function
> if/when other architectures want something similar. But I'll have another
> look because there were a couple of ideas I didn't try out.

If we improve the cpuset handling issues to take into account
arch_task_cpu_possible_mask() as discussed in the other thread, I think we can
drop the cpuset handling here.

> 
> > > + if (printk_ratelimit()) {
> > > + printk_deferred("Overriding affinity for 32-bit process %d (%s) 
> > > to CPUs %*pbl\n",
> > > + task_pid_nr(p), p->comm, 
> > > cpumask_pr_args(newmask));
> > > + }
> > 
> > We have 2 cases where the affinity could have been overridden but we won't
> > print anything:
> > 
> > 1. restrict_cpus_allowed_ptr()
> > 2. intersection of cpuset_mask and possible mask drops some cpus.
> > 
> > Shouldn't we print something in these cases too?
> 
> I don't think so: in these cases we've found a subset of CPUs that we can
> run on, and so there's no need to warn. Nothing says we _have_ to use all
> the CPUs available to us. The case where we override the affinity mask
> altogether, however, does warrant a warning. This is very similar to the
> hotplug behaviour in select_fallback_rq().

Okay. It is just to warn when we actually break the affinity because we ended
up with empty mask; not just because we changed the affinity to an intersecting
one.

I think this makes sense, yes. We might be able to drop this too if we improve
cpuset handling. The devil is in the details I guess.

Thanks

--
Qais Yousef

Re: [PATCH v4 04/14] arm64: Kill 32-bit applications scheduled on 64-bit-only CPUs

2020-12-02 Thread Qais Yousef

On 12/01/20 16:56, Will Deacon wrote:
> On Fri, Nov 27, 2020 at 01:12:17PM +0000, Qais Yousef wrote:
> > On 11/24/20 15:50, Will Deacon wrote:
> > > Scheduling a 32-bit application on a 64-bit-only CPU is a bad idea.
> > > 
> > > Ensure that 32-bit applications always take the slow-path when returning
> > > to userspace on a system with mismatched support at EL0, so that we can
> > > avoid trying to run on a 64-bit-only CPU and force a SIGKILL instead.
> > > 
> > > Signed-off-by: Will Deacon 
> > > ---
> > 
> > nit: We drop this patch at the end. Can't we avoid it altogether instead?
> 
> I did it like this so that the last patch can be reverted for
> testing/debugging, but also because I think it helps the structure of the
> series.

Cool. I had a comment about the barrier(), you were worried about
cpu_affinity_invalid() being inlined by the compiler and then things get
mangled such that TIF_NOTIFY_RESUME clearing is moved after the call as you
described? Can the compiler move things if cpu_affinity_invalid() is a proper
function call (not inlined)?

> 
> > > diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
> > > index a8184cad8890..bcb6ca2d9a7c 100644
> > > --- a/arch/arm64/kernel/signal.c
> > > +++ b/arch/arm64/kernel/signal.c
> > > @@ -911,6 +911,19 @@ static void do_signal(struct pt_regs *regs)
> > >   restore_saved_sigmask();
> > >  }
> > >  
> > > +static bool cpu_affinity_invalid(struct pt_regs *regs)
> > > +{
> > > + if (!compat_user_mode(regs))
> > > + return false;
> > 
> > Silly question. Is there an advantage of using compat_user_mode() vs
> > is_compat_task()? I see the latter used in the file although struct pt_regs
> > *regs is passed to the functions calling it.
> > 
> > Nothing's wrong with it, just curious.
> 
> Not sure about advantages, but is_compat_task() is available in core code,
> whereas compat_user_mode() is specific to arm64. The former implicitly
> operates on 'current' and just checks thread flag, whereas the latter
> actually goes and looks at mode field of the spsr to see what we're
> going to be returning into.

Okay, so just 2 different ways to do the same thing and you happened to pick
the one that first came to mind.

Thanks

--
Qais Yousef

Re: [PATCH v4 02/14] arm64: Allow mismatched 32-bit EL0 support

2020-12-02 Thread Qais Yousef

On 12/01/20 16:56, Will Deacon wrote:
> On Fri, Nov 27, 2020 at 01:09:41PM +0000, Qais Yousef wrote:
> > On 11/24/20 15:50, Will Deacon wrote:
> > > When confronted with a mixture of CPUs, some of which support 32-bit
> > 
> > Confronted made me laugh, well chosen word! :D
> > 
> > For some reason made me think of this :p
> > 
> > https://www.youtube.com/watch?v=NJbXPzSPzxc=1m33s
> 
> I think it just about sums it up!
> 
> > > applications and others which don't, we quite sensibly treat the system
> > > as 64-bit only for userspace and prevent execve() of 32-bit binaries.
> > > 
> > > Unfortunately, some crazy folks have decided to build systems like this
> > > with the intention of running 32-bit applications, so relax our
> > > sanitisation logic to continue to advertise 32-bit support to userspace
> > > on these systems and track the real 32-bit capable cores in a cpumask
> > > instead. For now, the default behaviour remains but will be tied to
> > > a command-line option in a later patch.
> > > 
> > > Signed-off-by: Will Deacon 
> > > ---
> > >  arch/arm64/include/asm/cpucaps.h|   2 +-
> > >  arch/arm64/include/asm/cpufeature.h |   8 ++-
> > >  arch/arm64/kernel/cpufeature.c  | 106 ++--
> > >  3 files changed, 107 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/arch/arm64/include/asm/cpucaps.h 
> > > b/arch/arm64/include/asm/cpucaps.h
> > > index e7d98997c09c..e6f0eb4643a0 100644
> > > --- a/arch/arm64/include/asm/cpucaps.h
> > > +++ b/arch/arm64/include/asm/cpucaps.h
> > > @@ -20,7 +20,7 @@
> > >  #define ARM64_ALT_PAN_NOT_UAO10
> > >  #define ARM64_HAS_VIRT_HOST_EXTN 11
> > >  #define ARM64_WORKAROUND_CAVIUM_2745612
> > > -#define ARM64_HAS_32BIT_EL0  13
> > > +#define ARM64_HAS_32BIT_EL0_DO_NOT_USE   13
> > 
> > nit: would UNUSED be better here? Worth adding a comment as to why too?
> 
> UNUSED sounds like you could delete it, but I'll add a comment.

+1, thanks.

> 
> > >  #define ARM64_HARDEN_EL2_VECTORS 14
> > >  #define ARM64_HAS_CNP15
> > >  #define ARM64_HAS_NO_FPSIMD  16
> > 
> > [...]
> > 
> > > +static bool has_32bit_el0(const struct arm64_cpu_capabilities *entry, 
> > > int scope)
> > > +{
> > > + if (!has_cpuid_feature(entry, scope))
> > > + return allow_mismatched_32bit_el0;
> > 
> > If a user passes the command line by mistake on a 64bit only system, this 
> > will
> > return true. I'll be honest, I'm not entirely sure what the impact is. I get
> > lost in the features maze. It is nicely encapsulated, but hard to navigate 
> > for
> > the none initiated :-)
> 
> The thing is, we can't generally detect a 64-bit-only system because a
> 32-bit-capable CPU could be hotplugged on late. So passing this option
> just controls what the behaviour is at the point that the 32-bit-capable
> CPU appears. If one doesn't appear, then there won't be a difference.

Okay, thanks for confirming.

Cheers

--
Qais Yousef

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 912 matches

Mail list logo