from:"Thomas Gleixner"

Re: [PATCH 1/1] genirq/cpuhotplug: retry with online CPUs on irq_do_set_affinity failure

2024-04-22 Thread Thomas Gleixner

On Mon, Apr 22 2024 at 16:09, Dongli Zhang wrote:
> On 4/22/24 13:58, Thomas Gleixner wrote:
>> On Thu, Apr 18 2024 at 18:33, Dongli Zhang wrote:
> Would you mind suggesting if the below commit message is fine to you?
>
>
> genirq/cpuhotplug: retry with cpu_online_mask when irq_do_set_affinity return
> -ENOSPC
>
> When a CPU goes offline, the interrupts pinned to that CPU are
> re-configured.
>
> Its managed interrupts undergo either migration to other CPUs or shutdown
> if all CPUs listed in the affinity are offline. This patch doesn't affect
> managed interrupts.
>
> For regular interrupts, they are migrated to other selected online CPUs.
> The target CPUs are chosen from either desc->pending_mask (suppose
> CONFIG_GENERIC_PENDING_IRQ) or d->common->affinity (suppose CONFIG_SMP).
> The cpu_online_mask is used as target CPUs only when CPUs in both
> desc->pending_mask and d->common->affinity are offline.
>
> However, there is a bad corner case, when desc->pending_mask or
> d->common->affinity is selected as the target cpumask, but none of their
> CPUs has any available vectors.

Up to here it's fine.

> As a result, an -ENOSPC error happens:
>
>   "IRQ151: set affinity failed(-28)."
>
> This is from the debugfs. The allocation fails although other online CPUs
> (except CPU=2) have many free vectors.

The debugfs output is not really providing more information than the
last sentence. It just occupies space :)

> The steps to reproduce the issue are in [1]. The core idea is:
>
> 1. Create a KVM guest with many virtio-net PCI devices, each configured
> with a very large number of queues/vectors.
>
> 2. Set the affinity of all virtio-net interrupts to "2,3".

That makes absolutely no sense at all. :)

But yes, I can see the non-real world problem with that.

> For regular interrupts, if irq_do_set_affinity() returns -ENOSPC, retry it
> with all online CPUs. The issue does not happen for managed interrupts
> because the vectors are always reserved (in cm->managed_map) before the CPU
> offline operation.
>
> [1]
> https://lore.kernel.org/all/20240419013322.58500-1-dongli.zh...@oracle.com/

The reproduction instructions are just additional information and not
necessarily change log material.

So I'd just say after the above:
> However, there is a bad corner case, when desc->pending_mask or
> d->common->affinity is selected as the target cpumask, but none of their
> CPUs has any available vectors.

   In this case the migration fails and the device interrupt becomes
   stale. This is not any different from the case where the affinity
   mask does not contain any online CPU, but there is no fallback
   operation for this.

   Instead of giving up retry the migration attempt with the online CPU
   mask if the interrupt is not managed as managed interrupts cannot be
   affected by this problem.

Hmm?

> I will change it to a single line.
>
> Would you mind suggesting which is preferred? !cpumask_equal(affinity,
> cpu_online_mask) or (affinity != cpu_online_mask)?

If at all you want !cpumask_subset(cpu_online_mask, affinity), but as
this is a corner case 'affinity != cpu_online_mask' should be good
enough.

Thanks,

tglx

Re: [PATCH 1/1] genirq/cpuhotplug: retry with online CPUs on irq_do_set_affinity failure

2024-04-22 Thread Thomas Gleixner

On Thu, Apr 18 2024 at 18:33, Dongli Zhang wrote:

> When a CPU is offline, its IRQs may migrate to other CPUs. For managed
> IRQs, they are migrated, or shutdown (if all CPUs of the managed IRQ
> affinity are offline). For regular IRQs, there will only be a
> migration.

Please write out interrupts. There is enough space for it and IRQ is
just not a regular word.

> The migrate_one_irq() first uses pending_mask or affinity_mask of the IRQ.
>
> 104 if (irq_fixup_move_pending(desc, true))
> 105 affinity = irq_desc_get_pending_mask(desc);
> 106 else
> 107 affinity = irq_data_get_affinity_mask(d);
>
> The migrate_one_irq() may use all online CPUs, if all CPUs in
> pending_mask/affinity_mask are already offline.
>
> 113 if (cpumask_any_and(affinity, cpu_online_mask) >= nr_cpu_ids) {
> 114 /*
> 115  * If the interrupt is managed, then shut it down and 
> leave
> 116  * the affinity untouched.
> 117  */
> 118 if (irqd_affinity_is_managed(d)) {
> 119 irqd_set_managed_shutdown(d);
> 120 irq_shutdown_and_deactivate(desc);
> 121 return false;
> 122 }
> 123 affinity = cpu_online_mask;
> 124 brokeaff = true;
> 125 }

Please don't copy code into the change log. Describe the problem in
text.

> However, there is a corner case. Although some CPUs in
> pending_mask/affinity_mask are still online, they are lack of available
> vectors. If the kernel continues calling irq_do_set_affinity() with those 
> CPUs,
> there will be -ENOSPC error.
>
> This is not reasonable as other online CPUs still have many available
> vectors.

Reasonable is not the question here. It's either correct or not.

> name:   VECTOR
>  size:   0
>  mapped: 529
>  flags:  0x0103
> Online bitmaps:7
> Global available:884
> Global reserved:   6
> Total allocated: 539
> System: 36: 0-19,21,50,128,236,243-244,246-255
>  | CPU | avl | man | mac | act | vectors
>  0   147 0 0   55  32-49,51-87
>  1   147 0 0   55  32-49,51-87
>  2 0 0 0  202  32-49,51-127,129-235

Just ouf of curiousity. How did this end up with CPU2 completely
occupied?

>  4   147 0 0   55  32-49,51-87
>  5   147 0 0   55  32-49,51-87
>  6   148 0 0   54  32-49,51-86
>  7   148 0 0   54  32-49,51-86
>
> This issue should not happen for managed IRQs because the vectors are already
> reserved before CPU hotplug.

Should not? It either does or it does not.

> For regular IRQs, do a re-try with all online
> CPUs if the prior irq_do_set_affinity() is failed with -ENOSPC.
>
> Cc: Joe Jin 
> Signed-off-by: Dongli Zhang 
> ---
>  kernel/irq/cpuhotplug.c | 13 +
>  1 file changed, 13 insertions(+)
>
> diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
> index 1ed2b1739363..d1666a6b73f4 100644
> --- a/kernel/irq/cpuhotplug.c
> +++ b/kernel/irq/cpuhotplug.c
> @@ -130,6 +130,19 @@ static bool migrate_one_irq(struct irq_desc *desc)
>* CPU.
>*/
>   err = irq_do_set_affinity(d, affinity, false);
> +
> + if (err == -ENOSPC &&
> + !irqd_affinity_is_managed(d) &&
> + affinity != cpu_online_mask) {

This really wants to be a single line conditional.

> + affinity = cpu_online_mask;
> + brokeaff = true;
> +
> + pr_debug("IRQ%u: set affinity failed for %*pbl, re-try with all 
> online CPUs\n",
> +  d->irq, cpumask_pr_args(affinity));

How is it useful to print cpu_online_mask here?

Thanks,

tglx

Re: [PATCH V7 03/18] x86/pks: Add additional PKEY helper macros

2021-12-08 Thread Thomas Gleixner

Ira,

On Tue, Dec 07 2021 at 16:51, Ira Weiny wrote:
> On Thu, Nov 25, 2021 at 03:25:09PM +0100, Thomas Gleixner wrote:
>
>  u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits)
>  {
> -  int shift = pkey * PKR_BITS_PER_PKEY;
> +  int shift = PKR_PKEY_SHIFT(pkey);
>
>if (WARN_ON_ONCE(accessbits & ~PKEY_ACCESS_MASK))
>accessbits &= PKEY_ACCESS_MASK;
>
> Better?

Let me postpone this question.

> As to the reason of why to put this patch after the other one.  Why would I
> improve the old pre-refactoring code only to throw it away when moving it to
> pkey_update_pkval()?  This reasoning is even stronger when pkey_update_pkval()
> is implemented.

Which refactoring? We seem to have fundamentally definitions of that
term. Let me illustrate why.

The original version of this was:

  u32 get_new_pkr(u32 old_pkr, int pkey, unsigned long init_val)
  {
int pkey_shift = (pkey * PKR_BITS_PER_PKEY);
u32 new_pkr_bits = 0;

/* Set the bits we need in the register:  */
if (init_val & PKEY_DISABLE_ACCESS)
new_pkr_bits |= PKR_AD_BIT;
if (init_val & PKEY_DISABLE_WRITE)
new_pkr_bits |= PKR_WD_BIT;

/* Shift the bits in to the correct place: */
new_pkr_bits <<= pkey_shift;

/* Mask off any old bits in place: */
old_pkr &= ~((PKR_AD_BIT | PKR_WD_BIT) << pkey_shift);

/* Return the old part along with the new part: */
return old_pkr | new_pkr_bits;
  }

IOW, mechanical Cut & Paste.

Then PeterZ came along and suggested to improve it this way:

  u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
  {
  int pkey_shift = pkey * PKR_BITS_PER_PKEY;

  /*  Mask out old bit values */
  pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);

  /*  Or in new values */
  if (flags & PKEY_DISABLE_ACCESS)
  pk_reg |= PKR_AD_BIT << pkey_shift;
  if (flags & PKEY_DISABLE_WRITE)
  pk_reg |= PKR_WD_BIT << pkey_shift;

  return pk_reg;
  }

which is already better. So you changed your approach from Cut & Paste
to Copy & Paste.

But neither Cut & Paste nor Copy & Paste match what refactoring is
really about. Just throwing the term refactoring at it does not make it
so.

Refactoring is about improving the code in design and implementation.
The keyword is: improving.

There are obviously cases where you can take the code as is and split it
out into a new helper function.

You really have to look at it and answer the question whether it's good
code or not, whether it could be written in better ways and with
improved functionality.

I could have given you this minimalistic one:

  u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits)
  {
  int shift = pkey * PKR_BITS_PER_PKEY;

  pkval &= ~(PKEY_ACCESS_MASK << shift);
  return pkval | (accessbit & PKEY_ACCESS_MASK) << shift;
  }

But I gave you this:

  u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits)
  {
  int shift = pkey * PKR_BITS_PER_PKEY;

  if (WARN_ON_ONCE(accessbits & ~PKEY_ACCESS_MASK))
  accessbits &= PKEY_ACCESS_MASK;

  pkval &= ~(PKEY_ACCESS_MASK << shift);
  return pkval | accessbit << shift;
  }

This is what refactoring is about. See?

> I agree with Dan regarding the macros though.  I think they make it easier to
> see what is going on without dealing with masks and shifts directly.  But I 
> can
> remove this patch if you feel that strongly about it.

I'm not against macros per se, but not everything is automatically
better when it is hidden behind a macro.

What I'm arguing against is the claim that macros are an improvement by
definition. Especially when they are just blindly thrown into code which
should not exist in the first place.

Also versus ordering. What's wrong with doing it this way:

  1) Define the macros first without changing the code

  2) Implement pkey_update_pkval() in a sensible way and use the macros
 where appropriate. Thereby replacing the existing version in the
 other function.

Which would end up in the obviously even simpler code:

  u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits)
  {
  if (WARN_ON_ONCE(accessbits & ~PKEY_ACCESS_MASK))
  accessbits &= PKEY_ACCESS_MASK;

  pkval &= ~PKR_PKEY_VALUE(pkey, PKEY_ACCESS_MASK);
  return pkval | PKR_PKEY_VALUE(pkey, accessbits);
  }

That fits the goal of that macro exercise to make it easy to read and
obvious what's going on, no?

Instead of:

>  u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits)
>  {
> -  int shift = pkey * PKR_BITS_PER_PKEY;
> +  int shift = PKR_PKEY_SHIF

Re: [PATCH V7 08/18] x86/entry: Preserve PKRS MSR across exceptions

2021-12-07 Thread Thomas Gleixner

Ira,

On Mon, Dec 06 2021 at 17:54, Ira Weiny wrote:
> On Thu, Nov 25, 2021 at 03:12:47PM +0100, Thomas Gleixner wrote:
>> > +.macro __call_ext_ptregs cfunc annotate_retpoline_safe:req
>> > +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
>> > +  /* add space for extended_pt_regs */
>> > +  subq$EXTENDED_PT_REGS_SIZE, %rsp
>> > +#endif
>> > +  .if \annotate_retpoline_safe == 1
>> > +  ANNOTATE_RETPOLINE_SAFE
>> > +  .endif
>> 
>> This annotation is new and nowhere mentioned why it is part of this
>> patch.
>
> I don't understand.  ANNOTATE_RETPOLINE_SAFE has been around since:
>
> 9e0e3c5130e9 x86/speculation, objtool: Annotate indirect calls/jumps
> for objtool

Sorry, I misread that macro maze. It's conditional obviously.

> I can split it if you prefer.  How about a patch with just the x86 extended
> pt_regs stuff but that would leave a zero size for the extended stuff?  Then
> followed by the pks bits?

Whatever makes sense and does one thing per patch.

>> I really have to ask the question whether this #ifdeffery has any value
>> at all. 8 bytes extra stack usage is not going to be the end of the
>> world and distro kernels will enable that config anyway.
>
> My goal with this has always been 0 overhead if turned off.  So this seemed
> like a logical addition.  Furthermore, ARCH_ENABLE_SUPERVISOR_PKEYS is
> predicated on ARCH_HAS_SUPERVISOR_PKEYS which is only available with x86_64.
> This removes the space for x86 when not needed.

The question is not about 64 vs. 32bit. The question is whether the
conditional makes sense for 64bit in the first place. Whether this
matters for 32bit has to be determined. It makes some sense, but less
#ifdeffery and less obfuscation makes sense too.

>> If we really want to save the space then certainly not by sprinkling
>> CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS all over the place and hiding the
>> extra sized ptregs in the pkrs header.
>> 
>> You are changing generic architecture code so you better think about
>> making such a change generic and extensible.
>
> I agree.  And I tried to do so.  The generic entry code is modified only by 
> the
> addition of pkrs_[save|restore]_irq().  These are only defined if the arch
> defines ARCH_HAS_SUPERVISOR_PKEYS and furthermore, if something triggers
> enabling ARCH_ENABLE_SUPERVISOR_PKEYS.

I'm talking about generic _architecture_ code, i.e. the code in
arch/x86/ which affects all vendors and systems.

> ARCH_HAS_SUPERVISOR_PKEYS is restricted to x86_64 at the moment.  All other
> arch's including x86 should not see any changes in the generic code.

That was not the question and I'm well aware of that.

>> If the next feature comes around which needs to save something in that
>> extended area then we are going to change the world again, right?
>
> I'm not sure what you mean by 'change the world'.  I would anticipate the 
> entry
> code to be modified with something similar to pks_[save|restore]_irq() and let
> the arch deal with the specifics.

If on X86 the next X86 specific feature comes around which needs extra
reg space then someone has to change world in arch/x86 again, replace
all the ARCH_ENABLE_SUPERVISOR_PKEYS #ifdefs with something else, right?

Instead of adding a new field to pt_regs_aux and be done with it.

> Also in [1] I thought Peter and Andy agreed that placing additional generic
> state in the extended pt_regs was not needed and does not buy us anything.  I
> specifically asked if that was something we wanted to do in.[2]

This was about a generic representation which affects the common entry
code in kernel/entry/... Can you spot the difference?

What I suggested is _solely_ x86 specific and does not trickle into
anything outside of arch/x86.

>> See? No magic hardcoded constant, no build time error checking for that
>> constant. Nothing, it just works.
>
> Yes agreed definitely an improvement.

It's the right thing to do.

>> That's one part, but let me come back to this:
>> 
>> > +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
>> > +  /* add space for extended_pt_regs */
>> > +  subq$EXTENDED_PT_REGS_SIZE, %rsp
>> 
>> What guarantees that RSP points to pt_regs at this point?  Nothing at
>> all. It's just pure luck and a question of time until this explodes in
>> hard to diagnose ways.
>
> It took me a bit to wrap my head around what I think you mean.  My initial
> response was that rsp should be the stack pointer for __call_ext_ptregs() just
> like it was for call.  But I think I see that it is better to open code this
> since others may want to play the same trick without using this code and
> therefore we may not be getting the extended pt_regs structure on t

Re: [PATCH V7 05/18] x86/pks: Add PKS setup code

2021-11-26 Thread Thomas Gleixner

On Thu, Nov 25 2021 at 16:15, Thomas Gleixner wrote:
> On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
> Aside of that, the function which set's up the init value is really
> bogus. As you explained in the cover letter a kernel user has to:
>
>1) Claim an index in the enum
>2) Add a default value to the array in that function
>
> Seriously? How is that any better than doing:
>
> #define PKS_KEY0_DEFAULT  PKR_RW_ENABLE
> #define PKS_KEY1_DEFAULT  PKR_ACCESS_DISABLE
> ...
> #define PKS_KEY15_DEFAULT PKR_ACCESS_DISABLE
>
> and let the compiler construct pkrs_init_value?
>
> TBH, it's not and this function has to be ripped out in case that you
> need a dynamic allocation of keys some day. So what is this buying us
> aside of horrible to read and utterly pointless code?

And as Taoyi confirmed its broken.

It surely takes a reviewer to spot that and an external engineer to run
rdmsr -a to validate that this is not working as expected, right?

Sigh...

Re: [PATCH V7 05/18] x86/pks: Add PKS setup code

2021-11-26 Thread Thomas Gleixner

On Fri, Nov 26 2021 at 11:11, taoyi ty wrote:
> On 11/25/21 11:15 PM, Thomas Gleixner wrote:
>>> +void setup_pks(void)
>>> +{
>>> +   if (!cpu_feature_enabled(X86_FEATURE_PKS))
>>> +   return;
>>> +
>>> +   write_pkrs(pkrs_init_value);
>> 
>> Is the init value set up _before_ this function is invoked for the first
>> time?
>> 
> Setting up for cpu0 is before create_initial_pkrs_value. therefore pkrs 
> value of cpu0 won't be set correctly.
>
> [root@AliYun ~]# rdmsr -a 0x06E1
> 0
> 5554
> 5554
> 5554
> 5554
> 5554
> 5554
> 5554
> 5554
> 5554
>
> Here are my test results after applying the patches

Thanks for confirming what I assumed from looking at the patches!

   tglx

Re: [PATCH V7 03/18] x86/pks: Add additional PKEY helper macros

2021-11-25 Thread Thomas Gleixner

On Thu, Nov 25 2021 at 15:25, Thomas Gleixner wrote:
> On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
>> @@ -200,16 +200,14 @@ __setup("init_pkru=", setup_init_pkru);
>>   */
>>  u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
>>  {
>> -int pkey_shift = pkey * PKR_BITS_PER_PKEY;
>> -
>>  /*  Mask out old bit values */
>> -pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
>> +pk_reg &= ~PKR_PKEY_MASK(pkey);
>>  
>>  /*  Or in new values */
>>  if (flags & PKEY_DISABLE_ACCESS)
>> -pk_reg |= PKR_AD_BIT << pkey_shift;
>> +pk_reg |= PKR_AD_KEY(pkey);
>>  if (flags & PKEY_DISABLE_WRITE)
>> -pk_reg |= PKR_WD_BIT << pkey_shift;
>> +pk_reg |= PKR_WD_KEY(pkey);
>
> I'm not seeing how this is improving that code. Quite the contrary.

Aside of that why are you ordering it the wrong way around, i.e.

   1) implement turd
   2) polish turd

instead of implementing the required helpers first if they are really
providing value.

Thanks,

tglx

Re: [PATCH V7 07/18] x86/pks: Preserve the PKRS MSR on context switch

2021-11-25 Thread Thomas Gleixner

On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
> @@ -658,6 +659,8 @@ __switch_to(struct task_struct *prev_p, struct 
> task_struct *next_p)
>   /* Load the Intel cache allocation PQR MSR. */
>   resctrl_sched_in();
>  
> + pkrs_write_current();

This is invoked from switch_to() and does this extra get/put_cpu_ptr()
dance in the write function for no reason. Can you please stop sticking
overhead into the hotpath just because?

Thanks,

tglx

Re: [PATCH V7 05/18] x86/pks: Add PKS setup code

2021-11-25 Thread Thomas Gleixner

On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +
> +void setup_pks(void);

pks_setup()

> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +
> +static DEFINE_PER_CPU(u32, pkrs_cache);
> +u32 __read_mostly pkrs_init_value;
> +
> +/*
> + * write_pkrs() optimizes MSR writes by maintaining a per cpu cache which can
> + * be checked quickly.
> + *
> + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
> + * serializing but still maintains ordering properties similar to WRPKRU.
> + * The current SDM section on PKRS needs updating but should be the same as
> + * that of WRPKRU.  So to quote from the WRPKRU text:
> + *
> + * WRPKRU will never execute transiently. Memory accesses
> + * affected by PKRU register will not execute (even transiently)
> + * until all prior executions of WRPKRU have completed execution
> + * and updated the PKRU register.
> + */
> +void write_pkrs(u32 new_pkrs)

pkrs_write()

> +{
> + u32 *pkrs;
> +
> + if (!static_cpu_has(X86_FEATURE_PKS))
> + return;

  cpu_feature_enabled() if at all. Why is this function even invoked
  when PKS is off?

> +
> + pkrs = get_cpu_ptr(_cache);

As far as I've seen this is mostly called from non-preemptible
regions. So that get/put pair is pointless. Stick a lockdep assert into
the code and disable preemption at the maybe one callsite which needs
it.

> + if (*pkrs != new_pkrs) {
> + *pkrs = new_pkrs;
> + wrmsrl(MSR_IA32_PKRS, new_pkrs);
> + }
> + put_cpu_ptr(pkrs);
> +}
> +
> +/*
> + * Build a default PKRS value from the array specified by consumers
> + */
> +static int __init create_initial_pkrs_value(void)
> +{
> + /* All users get Access Disabled unless changed below */
> + u8 consumer_defaults[PKS_NUM_PKEYS] = {
> + [0 ... PKS_NUM_PKEYS-1] = PKR_AD_BIT
> + };
> + int i;
> +
> + consumer_defaults[PKS_KEY_DEFAULT] = PKR_RW_BIT;
> +
> + /* Ensure the number of consumers is less than the number of keys */
> + BUILD_BUG_ON(PKS_KEY_NR_CONSUMERS > PKS_NUM_PKEYS);
> +
> + pkrs_init_value = 0;

This needs to be cleared because the BSS might be non zero?

> + /* Fill the defaults for the consumers */
> + for (i = 0; i < PKS_NUM_PKEYS; i++)
> + pkrs_init_value |= PKR_VALUE(i, consumer_defaults[i]);

Also PKR_RW_BIT is a horrible invention:

> +#define PKR_RW_BIT 0x0
>  #define PKR_AD_BIT 0x1
>  #define PKR_WD_BIT 0x2
>  #define PKR_BITS_PER_PKEY 2

This makes my brain spin. How do you fit 3 bits into 2 bits per key?
That's really non-intuitive.

PKR_RW_ENABLE   0x0
PKR_ACCESS_DISABLE  0x1
PKR_WRITE_DISABLE   0x2

makes it obvious what this is about, no?

Aside of that, the function which set's up the init value is really
bogus. As you explained in the cover letter a kernel user has to:

   1) Claim an index in the enum
   2) Add a default value to the array in that function

Seriously? How is that any better than doing:

#define PKS_KEY0_DEFAULTPKR_RW_ENABLE
#define PKS_KEY1_DEFAULTPKR_ACCESS_DISABLE
...
#define PKS_KEY15_DEFAULT   PKR_ACCESS_DISABLE

and let the compiler construct pkrs_init_value?

TBH, it's not and this function has to be ripped out in case that you
need a dynamic allocation of keys some day. So what is this buying us
aside of horrible to read and utterly pointless code?

> + return 0;
> +}
> +early_initcall(create_initial_pkrs_value);
> +
> +/*
> + * PKS is independent of PKU and either or both may be supported on a CPU.
> + * Configure PKS if the CPU supports the feature.
> + */
> +void setup_pks(void)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_PKS))
> + return;
> +
> + write_pkrs(pkrs_init_value);

Is the init value set up _before_ this function is invoked for the first
time?

Thanks,

tglx

Re: [PATCH V7 03/18] x86/pks: Add additional PKEY helper macros

2021-11-25 Thread Thomas Gleixner

On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
> @@ -200,16 +200,14 @@ __setup("init_pkru=", setup_init_pkru);
>   */
>  u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
>  {
> - int pkey_shift = pkey * PKR_BITS_PER_PKEY;
> -
>   /*  Mask out old bit values */
> - pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
> + pk_reg &= ~PKR_PKEY_MASK(pkey);
>  
>   /*  Or in new values */
>   if (flags & PKEY_DISABLE_ACCESS)
> - pk_reg |= PKR_AD_BIT << pkey_shift;
> + pk_reg |= PKR_AD_KEY(pkey);
>   if (flags & PKEY_DISABLE_WRITE)
> - pk_reg |= PKR_WD_BIT << pkey_shift;
> + pk_reg |= PKR_WD_KEY(pkey);

I'm not seeing how this is improving that code. Quite the contrary.

Thanks,

tglx

Re: [PATCH V7 02/18] x86/fpu: Refactor arch_set_user_pkey_access()

2021-11-25 Thread Thomas Gleixner

On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
> +/*
> + * Replace disable bits for @pkey with values from @flags
> + *
> + * Kernel users use the same flags as user space:
> + * PKEY_DISABLE_ACCESS
> + * PKEY_DISABLE_WRITE
> + */
> +u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)

pkey_ please.

> +{
> + int pkey_shift = pkey * PKR_BITS_PER_PKEY;
> +
> + /*  Mask out old bit values */
> + pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
> +
> + /*  Or in new values */
> + if (flags & PKEY_DISABLE_ACCESS)
> + pk_reg |= PKR_AD_BIT << pkey_shift;
> + if (flags & PKEY_DISABLE_WRITE)
> + pk_reg |= PKR_WD_BIT << pkey_shift;
> +
> + return pk_reg;

Also this code is silly.

#define PKEY_ACCESS_MASK(PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE)

u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits)
{
int shift = pkey * PKR_BITS_PER_PKEY;

if (WARN_ON_ONCE(accessbits & ~PKEY_ACCESS_MASK))
accessbits &= PKEY_ACCESS_MASK;

pkval &= ~(PKEY_ACCESS_MASK << shift);
return pkval | accessbit << pkey_shift;
}

See? It does not even need comments because it's self explaining and
uses sensible argument names matching the function name.

Re: [PATCH V7 08/18] x86/entry: Preserve PKRS MSR across exceptions

2021-11-25 Thread Thomas Gleixner

Ira,

On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
> +/*
> + * __call_ext_ptregs - Helper macro to call into C with extended pt_regs
> + * @cfunc:   C function to be called
> + *
> + * This will ensure that extended_ptregs is added and removed as needed 
> during
> + * a call into C code.
> + */
> +.macro __call_ext_ptregs cfunc annotate_retpoline_safe:req
> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> + /* add space for extended_pt_regs */
> + subq$EXTENDED_PT_REGS_SIZE, %rsp
> +#endif
> + .if \annotate_retpoline_safe == 1
> + ANNOTATE_RETPOLINE_SAFE
> + .endif

This annotation is new and nowhere mentioned why it is part of this
patch.

Can you please do _ONE_ functional change per patch and not a
unreviewable pile of changes in one go? Same applies for the ASM and the
C code changes. The ASM change has to go first and then the C code can
build upon it.

> + call\cfunc
> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> + /* remove space for extended_pt_regs */
> + addq$EXTENDED_PT_REGS_SIZE, %rsp
> +#endif

I really have to ask the question whether this #ifdeffery has any value
at all. 8 bytes extra stack usage is not going to be the end of the
world and distro kernels will enable that config anyway.

If we really want to save the space then certainly not by sprinkling
CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS all over the place and hiding the
extra sized ptregs in the pkrs header.

You are changing generic architecture code so you better think about
making such a change generic and extensible. Can folks please start
thinking beyond the brim of their teacup and not pretend that the
feature they are working on is the unicorn which requires unique magic
brandnamed after the unicorn of the day.

If the next feature comes around which needs to save something in that
extended area then we are going to change the world again, right?
Certainly not.

This wants to go into asm/ptrace.h:

struct pt_regs_aux {
u32 pkrs;
};

struct pt_regs_extended {
struct pt_regs_aux  aux;
struct pt_regs  regs __attribute__((aligned(8)));
};

and then have in asm-offset:

   DEFINE(PT_REGS_AUX_SIZE, sizeof(struct pt_regs_extended) - sizeof(struct 
pt_regs));

which does the right thing whatever the size of pt_regs_aux is. So for
the above it will have:

 #define PT_REGS_AUX_SIZE 8 /* sizeof(struct pt_regs_extended) - sizeof(struct 
pt_regs) */

Even, if you do

struct pt_regs_aux {
#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
u32 pkrs;
#endif
};

and the config switch is disabled. It's still correct:

 #define PT_REGS_AUX_SIZE 0 /* sizeof(struct pt_regs_extended) - sizeof(struct 
pt_regs) */

See? No magic hardcoded constant, no build time error checking for that
constant. Nothing, it just works.

That's one part, but let me come back to this:

> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> + /* add space for extended_pt_regs */
> + subq$EXTENDED_PT_REGS_SIZE, %rsp

What guarantees that RSP points to pt_regs at this point?  Nothing at
all. It's just pure luck and a question of time until this explodes in
hard to diagnose ways.

Because between

movq%rsp, %rdi
and
call

can legitimately be other code which causes the stack pointer to
change. It's not the case today, but nothing prevents this in the
future.

The correct thing to do is:

movq%rsp, %rdi
RSP_MAKE_PT_REGS_AUX_SPACE
call...
RSP_REMOVE_PT_REGS_AUX_SPACE

The few extra macro lines in the actual code are way better as they make
it completely obvious what's going on and any misuse can be spotted
easily.

> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +/*
> + * PKRS is a per-logical-processor MSR which overlays additional protection 
> for
> + * pages which have been mapped with a protection key.
> + *
> + * Context switches save the MSR in the task struct thus taking that value to
> + * other processors if necessary.
> + *
> + * To protect against exceptions having access to this memory save the 
> current
> + * thread value and set the PKRS value to be used during the exception.
> + */
> +void pkrs_save_irq(struct pt_regs *regs)

That's a misnomer as this is invoked for _any_ exception not just
interrupts.

>  #ifdef CONFIG_XEN_PV
>  #ifndef CONFIG_PREEMPTION
>  /*
> @@ -309,6 +361,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
> pt_regs *regs)
>  
>   inhcall = get_and_clear_inhcall();
>   if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
> + /* Normally called by irqentry_exit, restore pkrs here */
> + pkrs_restore_irq(regs);
>   irqentry_exit_cond_resched();

Sigh. Consistency is overrated

> +
>  void setup_pks(void);
>  void pkrs_write_current(void);
>  void pks_init_task(struct task_struct *task);
> +void write_pkrs(u32 new_pkrs);

So we have pkrs_write_current() and write_pkrs() now. Can you please
stick to a common

Re: [PATCH V7 08/18] x86/entry: Preserve PKRS MSR across exceptions

2021-11-25 Thread Thomas Gleixner

On Fri, Nov 12 2021 at 16:50, Ira Weiny wrote:
> On Tue, Aug 03, 2021 at 09:32:21PM -0700, 'Ira Weiny' wrote:
>> From: Ira Weiny 
>> 
>> The PKRS MSR is not managed by XSAVE.  It is preserved through a context
>> switch but this support leaves exception handling code open to memory
>> accesses during exceptions.
>> 
>> 2 possible places for preserving this state were considered,
>> irqentry_state_t or pt_regs.[1]  pt_regs was much more complicated and
>> was potentially fraught with unintended consequences.[2]  However, Andy
>> came up with a way to hide additional values on the stack which could be
>> accessed as "extended_pt_regs".[3]
>
> Andy,
>
> I'm preparing to send V8 of this PKS work.  But I have not seen any feed back
> since I originally implemented this in V4[1].
>
> Does this meets your expectations?  Are there any issues you can see with this
> code?
>
> I would appreciate your feedback.

Not Andy here, but I'll respond to the patch in a minute.

Thanks,

tglx

Re: [signal] 4bad58ebc8: will-it-scale.per_thread_ops -3.3% regression

2021-04-20 Thread Thomas Gleixner

On Tue, Apr 20 2021 at 11:08, kernel test robot wrote:
> FYI, we noticed a -3.3% regression of will-it-scale.per_thread_ops due to 
> commit:
>
> commit: 4bad58ebc8bc4f20d89cff95417c9b4674769709 ("signal: Allow tasks to 
> cache one sigqueue struct")
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/core
>
> in testcase: will-it-scale
> on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz 
> with 192G memory
> with following parameters:
>
>   nr_task: 100%
>   mode: thread
>   test: futex3
>   cpufreq_governor: performance
>   ucode: 0x5003006
>
> test-description: Will It Scale takes a testcase and runs it from 1 through 
> to n parallel copies to see if the testcase will scale. It builds both a 
> process and threads based test in order to see any differences between the 
> two.
> test-url: https://github.com/antonblanchard/will-it-scale
> commit: 
>   69995ebbb9 ("signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()")
>   4bad58ebc8 ("signal: Allow tasks to cache one sigqueue struct")
>
> 69995ebbb9d37173 4bad58ebc8bc4f20d89cff95417 
>  --- 
>  %stddev %change %stddev
>  \  |\  
>  1.273e+09-3.3%  1.231e+09will-it-scale.192.threads
>6630224-3.3%6409738will-it-scale.per_thread_ops
>  1.273e+09-3.3%  1.231e+09will-it-scale.workload
>   1638 ±  3%  -7.8%   1510 ±  5%  
> sched_debug.cfs_rq:/.runnable_avg.max
> 297.83 ± 68%   +1747.6%   5502 ±152%  
> interrupts.33:PCI-MSI.524291-edge.eth0-TxRx-2
> 297.83 ± 68%   +1747.6%   5502 ±152%  
> interrupts.CPU12.33:PCI-MSI.524291-edge.eth0-TxRx-2

This change is definitely not causing more network traffic

>   8200   -33.4%   5459 ± 35%  
> interrupts.CPU27.NMI:Non-maskable_interrupts
>   8200   -33.4%   5459 ± 35%  
> interrupts.CPU27.PMI:Performance_monitoring_interrupts
>   8199   -33.4%   5459 ± 35%  
> interrupts.CPU28.NMI:Non-maskable_interrupts
>   8199   -33.4%   5459 ± 35%  
> interrupts.CPU28.PMI:Performance_monitoring_interrupts
>   6148 ± 33% -11.2%   5459 ± 35%  
> interrupts.CPU29.NMI:Non-maskable_interrupts
>   6148 ± 33% -11.2%   5459 ± 35%  
> interrupts.CPU29.PMI:Performance_monitoring_interrupts
>   4287 ±  8% +33.6%   5730 ± 15%  
> interrupts.CPU49.CAL:Function_call_interrupts
>   6356 ± 19% +49.6%   9509 ± 19%  
> interrupts.CPU97.CAL:Function_call_interrupts

Neither does it increase the number of function calls

> 407730 ±  8% +37.5% 560565 ±  7%  perf-stat.i.dTLB-load-misses
> 415959 ±  8% +40.4% 583928 ±  7%  perf-stat.ps.dTLB-load-misses

And this massive increase does not make sense either.

Confused.

Thanks,

tglx

Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()

2021-04-20 Thread Thomas Gleixner

On Tue, Apr 20 2021 at 17:15, Lorenzo Colitti wrote:
> On Fri, Apr 16, 2021 at 1:47 AM Thomas Gleixner  wrote:
>> Enable tracing and enable the following tracepoints:
>> [...]
>
> Sorry for the delay. I had to learn a bit about how to use the tracing
> infrastructure. I don't know if I can post here, but to my untrained
> eye, one big difference between the old (fast) code and the new (slow)
> code is that the new code calls tick_program_event() much more. It
> looks like that makes most of the difference.
>
> With the old code, hrtimer_start_range_ns almost never calls
> tick_program_event at all, but the new code seems to call it twice on
> every timer update.

Yes, I figured out why that happens by now, but the old behaviour was
just incorrect. So now there are clearly two issues:

  1) hrtimer is contrary to timer_list not really suited for high
 frequency start/cancel/start/... cycles of a timer. It's optimized
 for the start and expire precisely case.

  2) The cost for reprogramming depends on the underlying hardware. With
 halfways sane timer hardware it's minimal and as I measured in a
 micro benchmark in the 1% range. Now when your system ends up
 having one of the real timer trainwrecks which can be found in
 drivers/clocksource/ which are even worse than the x86 HPET horror,
 then it's going to be painful and a performance issue.

 I assume that's an ARM64 system. ARM64 CPUs have an architected per
 CPU timer where the reprogramming is pretty fast as it's next to
 the CPU, but who knows what your system is using.

Now in the meantime I looked into __hrtimer_start_range_ns() whether
that double reprogram can be avoided without creating a total trainwreck
and imposing penalty on all sensible use cases. Completely untested
patch below should do the trick and it's not ugly enough that I hate it
with a passion.

Even if that makes your problem go away #1 is still beyond suboptimal
and #2 is something you really want to look into.

Thanks,

tglx
---
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1030,12 +1030,13 @@ static void __remove_hrtimer(struct hrti
  * remove hrtimer, called with base lock held
  */
 static inline int
-remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, bool 
restart)
+remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
+  bool restart, bool keep_local)
 {
u8 state = timer->state;
 
if (state & HRTIMER_STATE_ENQUEUED) {
-   int reprogram;
+   bool reprogram;
 
/*
 * Remove the timer and force reprogramming when high
@@ -1048,8 +1049,16 @@ remove_hrtimer(struct hrtimer *timer, st
debug_deactivate(timer);
reprogram = base->cpu_base == this_cpu_ptr(_bases);
 
+   /*
+* If the timer is not restarted then reprogramming is
+* required if the timer is local. If it is local and about
+* to be restarted, avoid programming it twice (on removal
+* and a moment later when it's requeued).
+*/
if (!restart)
state = HRTIMER_STATE_INACTIVE;
+   else
+   reprogram &= !keep_local;
 
__remove_hrtimer(timer, base, state, reprogram);
return 1;
@@ -1103,9 +1112,31 @@ static int __hrtimer_start_range_ns(stru
struct hrtimer_clock_base *base)
 {
struct hrtimer_clock_base *new_base;
+   bool force_local, first;
 
-   /* Remove an active timer from the queue: */
-   remove_hrtimer(timer, base, true);
+   /*
+* If the timer is on the local cpu base and is the first expiring
+* timer then this might end up reprogramming the hardware twice
+* (on removal and on enqueue). To avoid that by prevent the
+* reprogram on removal, keep the timer local to the current CPU
+* and enforce reprogramming after it is queued no matter whether
+* it is the new first expiring timer again or not.
+*/
+   force_local = base->cpu_base == this_cpu_ptr(_bases);
+   force_local &= base->cpu_base->next_timer == timer;
+
+   /*
+* Remove an active timer from the queue. In case it is not queued
+* on the current CPU, make sure that remove_hrtimer() updates the
+* remote data correctly.
+*
+* If it's on the current CPU and the first expiring timer, then
+* skip reprogramming, keep the timer local and enforce
+* reprogramming later if it was the first expiring timer.  This
+* avoids programming the underlying clock event twice (once at
+* removal and once after enqueue).
+*/
+   remove_hrtimer(timer, base, true, force_local);
 
if (mode &

Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()

2021-04-20 Thread Thomas Gleixner

On Mon, Apr 19 2021 at 20:12, Maciej Żenczykowski wrote:
> On Thu, Apr 15, 2021 at 9:47 AM Thomas Gleixner  wrote:
>> Run the test on a kernels with and without that commit and collect trace
>> data for both.
>>
>> That should give me a pretty clear picture what's going on.
>
> Lorenzo is trying to get the traces you asked for, or rather he’s
> trying to get confirmation he can post them.
>
> Our initial observation of these results seems to suggest that
> updating the timer (hrtimer_start, which seems to also call
> hrtimer_cancel) takes twice as long as it used to.

Which contradicts my measurements. The change in complexity is marginal
and the overhead in cycles/instructions is close to noise. It's
measurable in a microbenchmark, but it's in the < 1% range which is far
away from the 60% you are seing.

> My gut feeling is that softirq_activated is usually false, and the old
> code in such a case calls just __hrtimer_get_next_event(,
> HRTIMER_ACTIVE_ALL).  While the new code will first call
> __hrtimer_get_next_event(, HRTIMER_ACTIVE_SOFT) and then
> __hrtimer_get_next_event(, HRTIMER_ACTIVE_HARD)
>
> Perhaps __hrtimer_get_next_event() should return both soft and hard
> event times in one function call?
> Or perhaps hrtimer_start should not call hrtimer_cancel?

Perhaps we do a proper analysis first :)

Thanks,

tglx

Re: [PATCH v6] hrtimer: avoid retrigger_next_event IPI

2021-04-19 Thread Thomas Gleixner

On Mon, Apr 19 2021 at 16:39, Marcelo Tosatti wrote:
>  
> +static void clock_was_set_force_reprogram_work(struct work_struct *work)
> +{
> + clock_was_set(true);
> +}
> +
> +static DECLARE_WORK(hrtimer_force_reprogram_work, 
> clock_was_set_force_reprogram_work);
> +
> +
>  static void clock_was_set_work(struct work_struct *work)
>  {
> - clock_was_set();
> + clock_was_set(false);
>  }
>  
>  static DECLARE_WORK(hrtimer_work, clock_was_set_work);
> @@ -769,9 +777,12 @@ static DECLARE_WORK(hrtimer_work, clock_was_set_work);
>   * Called from timekeeping and resume code to reprogram the hrtimer
>   * interrupt device on all cpus.
>   */
> -void clock_was_set_delayed(void)
> +void clock_was_set_delayed(bool force_reprogram)
>  {
> - schedule_work(_work);
> + if (force_reprogram)
> + schedule_work(_force_reprogram_work);
> + else
> + schedule_work(_work);
>  }
>  
>  #else
> @@ -871,6 +882,18 @@ static void hrtimer_reprogram(struct hrtimer *timer, 
> bool reprogram)
>   tick_program_event(expires, 1);
>  }
>  
> +#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME) | \
> +  (1U << HRTIMER_BASE_REALTIME_SOFT) |   \
> +  (1U << HRTIMER_BASE_TAI) | \
> +  (1U << HRTIMER_BASE_TAI_SOFT) |\
> +  (1U << HRTIMER_BASE_BOOTTIME) |\
> +  (1U << HRTIMER_BASE_BOOTTIME_SOFT))
> +
> +static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
> +{
> + return (cpu_base->active_bases & CLOCK_SET_BASES) != 0;
> +}
> +
>  /*
>   * Clock realtime was set
>   *
> @@ -882,11 +905,42 @@ static void hrtimer_reprogram(struct hrtimer *timer, 
> bool reprogram)
>   * resolution timer interrupts. On UP we just disable interrupts and
>   * call the high resolution interrupt code.
>   */
> -void clock_was_set(void)
> +void clock_was_set(bool force_reprogram)
>  {
>  #ifdef CONFIG_HIGH_RES_TIMERS
> - /* Retrigger the CPU local events everywhere */
> - on_each_cpu(retrigger_next_event, NULL, 1);
> + cpumask_var_t mask;
> + int cpu;
> +
> + if (force_reprogram == true) {
> + on_each_cpu(retrigger_next_event, NULL, 1);
> + goto set_timerfd;
> + }
> +
> + if (!zalloc_cpumask_var(, GFP_KERNEL)) {
> + on_each_cpu(retrigger_next_event, NULL, 1);
> + goto set_timerfd;
> + }

This is really horrible and the proper thing to do is to seperate the
s2idle/resume related functionality which allows to do some other
cleanups as well.

I already started to look at that over the weekend, but got stuck due to
a couple of correctness issues I found with the whole clock_was_set()
mechanism while looking. That stuff needs to be fixed first.

Back to the above. For the s2idle resume path there is no reason for an
IPI at all. It's just that way because it has been bolted on the
existing machinery.

So today it does:

   tick_unfreeze() {
 if (first_cpu_to_unfreeze()) {
timekeeping_resume();
  tick_resume();
tick_resume_local();
  clock_was_set_delayed();
 } else {
tick_resume_local();
 }
   }

Then after everything is unfrozen including workqueues the delayed
clock_was_set() runs and does the IPI dance. That's just silly.

We can be smarter than that:

   tick_unfreeze() {
 if (first_cpu_to_unfreeze())
timekeeping_resume();
  tick_resume();
tick_resume_local();
  hrtimer_resume_local();
 } else {
tick_resume_local();
  hrtimer_resume_local();
 }
   }

See? No IPI required at all and no magic force argument and as a bonus
we get also understandable code.

Splitting this properly allows to be smarter about the affected clocks
because most of the clock_was_set() events only care about
CLOCK_REALTIME and CLOCK_TAI and just the sleep time injection affects
CLOCK_BOOTTIME as well.

There are a few other things which can be done to avoid even more IPIs,
but let me address the correctness issues of clock_was_set() first.

Thanks,

tglx

Re: [syzbot] WARNING in kthread_is_per_cpu

2021-04-19 Thread Thomas Gleixner

On Mon, Apr 19 2021 at 03:36, syzbot wrote:

> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:1216f02e Add linux-next specific files for 20210415
> git tree:   linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=1032ba29d0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=3491b04113499f81
> dashboard link: https://syzkaller.appspot.com/bug?extid=9362b31a2e0cad8b749d
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+9362b31a2e0cad8b7...@syzkaller.appspotmail.com
>
> [ cut here ]
> WARNING: CPU: 1 PID: 23550 at kernel/kthread.c:83 to_kthread 
> kernel/kthread.c:83 [inline]
> WARNING: CPU: 1 PID: 23550 at kernel/kthread.c:83 
> kthread_is_per_cpu+0xc4/0xf0 kernel/kthread.c:519
> Modules linked in:
> CPU: 1 PID: 23550 Comm: syz-executor.3 Not tainted 
> 5.12.0-rc7-next-20210415-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> RIP: 0010:to_kthread kernel/kthread.c:83 [inline]
> RIP: 0010:kthread_is_per_cpu+0xc4/0xf0 kernel/kthread.c:519
> Code: 00 00 00 00 fc ff df 48 c1 ea 03 80 3c 02 00 75 2e 4c 8b 23 41 83 e4 01 
> e8 89 d3 27 00 44 89 e0 5b 5d 41 5c c3 e8 7c d3 27 00 <0f> 0b eb 88 e8 33 90 
> 6c 00 e9 68 ff ff ff e8 39 90 6c 00 eb 9a 48
> RSP: 0018:c9dc0c08 EFLAGS: 00010046
> RAX:  RBX: 88802533d580 RCX: 0100
> RDX: 8880549bb900 RSI: 814ca4c4 RDI: 0003
> RBP:  R08:  R09: 88802533d580
> R10: 814ca44c R11: 018a3b90 R12: 0001
> R13: c9dc0d90 R14: 0001 R15: 88802533d580
> FS:  7f4be57d3700() GS:8880b9d0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 001b2cd24000 CR3: 24626000 CR4: 001506e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  
>  can_migrate_task+0x124/0x1630 kernel/sched/fair.c:7610

So this is:

if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))

The warning in to_kthread() is:

WARN_ON(!(k->flags & PF_KTHREAD));

IOW, the p>flags lost PF_KTHREAD within at max. 50 instructions.

Magic, cosmic rays or memory corruption / stray pointer in some other
place?

Thanks,

tglx

Re: [PATCH] genirq/irqaction: Move dev_id and percpu_dev_id in a union

2021-04-18 Thread Thomas Gleixner

On Sat, Apr 10 2021 at 13:00, Marc Zyngier wrote:
> dev_id and percpu_dev_id are mutually exclusive in struct irqaction,
> as they conceptually represent the same thing, only in a per-cpu
> fashion.
>
> Move them into an anonymous union, saving a few bytes on the way.

The reason why they are not in an anomymous union is that any misuse of
interfaces will result in an instantaneous explosion while with your
variant it will cause hard to diagnose side effects.

I rather waste the extra 4/8 bytes unless there is a compelling reason
not to do so.

Thanks,

tglx

Re: [PATCH 05/15] x86: Implement function_nocfi

2021-04-18 Thread Thomas Gleixner

On Sat, Apr 17 2021 at 17:11, Andy Lutomirski wrote:
> On Sat, Apr 17, 2021 at 4:53 PM Thomas Gleixner  wrote:
>> which works for
>>
>>   foo = function_nocfi(bar);
>
> I agree in general.  But right now, we have, in asm/proto.h:
>
> void entry_SYSCALL_64(void);
>
> and that's pure nonsense.  Depending on your point of view,
> entry_SYSCALL_64 is a symbol that resolves to an integer or it's an
> array of bytes containing instructions, but it is most definitely not
> a function void (void).  So, regardless of any CFI stuff, I propose
> that we standardize our handling of prototypes of symbols that are
> opaque to the C compiler.  Here are a couple of choices:
>
> Easy one:
>
> extern u8 entry_SYSCALL_64[];
>
> Slightly more complicated:
>
> struct opaque_symbol;
> extern struct opaque_symbol entry_SYSCALL_64;
>
> The opaque_symbol variant avoids any possible confusion over the weird
> status of arrays in C, and it's hard to misuse, since struct
> opaque_symbol is an incomplete type.

Makes sense.

Re: [PATCH 05/15] x86: Implement function_nocfi

2021-04-17 Thread Thomas Gleixner

On Sat, Apr 17 2021 at 16:19, Andy Lutomirski wrote:
> On Fri, Apr 16, 2021 at 4:40 PM Kees Cook  wrote:
>> Okay, you're saying you want __builtin_gimme_body_p() to be a constant
>> expression for the compiler, not inline asm?
>
> Yes.
>
> I admit that, in the trivial case where the asm code is *not* a
> C-ABI-compliant function, giving a type that doesn't fool the compiler
> into thinking that it might be is probably the best fix.  Maybe we
> should standardize something, e.g.:
>
> struct raw_symbol;  /* not defined anywhere */
> #define DECLARE_RAW_SYMBOL(x) struct raw_symbol x[]
>
> and then we write this:
>
> DECLARE_RAW_SYMBOL(entry_SYSCALL_64);
>
> wrmsrl(..., (unsigned long)entry_SYSCALL_64);
>
> It would be a bit nifty if we didn't need a forward declaration, but
> I'm not immediately seeing a way to do this without hacks that we'll
> probably regret;
>
> But this doesn't help the case in which the symbol is an actual
> C-callable function and we want to be able to call it, too.

The right way to solve this is that the compiler provides a builtin

 function_nocfi() +/- the naming bikeshed

which works for

  foo = function_nocfi(bar);

and

static unsigned long foo[] = {
   function_nocfi(bar1),
   function_nocfi(bar2),
};

which are pretty much the only possible 2 usecases. If the compiler
wants to have function_nocfi_in_code() and function_nocfi_const()
because it can't figure it out on it's own, then I can live with that,
but that's still several orders of magnitudes better than having to work
around it by whatever nasty macro/struct magic.

I don't care about the slightly more unreadable code, but if that
builtin has a descriptive name, then it's even useful for documentary
purposes. And it can be easily grepped for which makes it subject to
static code analysers which can help to detect abuse.

Anything which requires to come up with half baken constructs to work
around the shortcomings of the compiler are wrong to begin with.

Thanks,

tglx

Re: [PATCH v8 clocksource 2/5] clocksource: Retry clock read if long delays detected

2021-04-17 Thread Thomas Gleixner

On Sat, Apr 17 2021 at 15:54, Paul E. McKenney wrote:
> On Sat, Apr 17, 2021 at 02:24:23PM +0200, Thomas Gleixner wrote:
>> I so wish we could just delete all of this horror instead of making it
>> more horrible.
>
> Revisit deleting it in five years if there are no issues, whatever
> "issue" might mean in this context?  Five years should give time for
> this to propagate to the enterprise distros.

That does not help with the fact that the broken hardware will still be
around in 5 years from now. Which is what prevents me from deleting the
whole watchdog muck ...

Thanks,

tglx

Re: [PATCH] Fix set apic mode from x2apic enabled bit patch

2021-04-17 Thread Thomas Gleixner

Mike!

On Sun, Apr 18 2021 at 00:39, Thomas Gleixner wrote:
> If you can't come up with something sensible anytime soon before the
> merge window opens then I'm simply going to revert 41e2da9b5e67 and you
> can try again for the next cycle.

so I just figured out that Boris wasted his time once more to fix that
up and redo the commit in question.

Won't happen again.

Thanks,

tglx

Re: [PATCH] Fix set apic mode from x2apic enabled bit patch

2021-04-17 Thread Thomas Gleixner

Mike!

On Thu, Apr 15 2021 at 17:06, Mike Travis wrote:

I'm slowly getting tired of the fact that every patch coming from you
fails to comply with the minimal requirements of the documented
procedures.

$subject: [PATCH] Fix set apic mode from x2apic enabled bit patch

Documentation clearly states, that there has to be a subsystem
prefix. It's not rocket science to figure it out:

# git log arch/x86/kernel/apic/x2apic_uv_x.c

gives you a pretty good hint that the prefix should be:

 x86/platform/uv:

It's not that hard and it's not optional.

Also what the heck means:

 Fix set apic mode from x2apic enabled bit patch

Again documentation is very clear about what the subject line, aka git
shortlog of a patch should look like.

This sentence is just word salad and does not give any clue of what this
is about. Even you wont be able to decode this 3 month from now. Simply
because it's word salad.

I don't care what kind of standards you have @hpe, but I very much care
about the standards of the kernel.

> Do not set uv_system_type for hubless UV systems as it tricks the
> is_uv_system function into thinking it's a UV hubbed system and includes
> a UV HUB RTC.  This causes UV RTC init to panic on UV hubless systems.

And yet more word salad. Can you please describe the precise technical
problem and not use metaphors of functions which are thinking?

Aside of that this does not describe the change at all. The change is:

> - early_set_apic_mode();

but your changelog blurbs about "thinking it's an UV hubbed system".

How the heck can this make sense for someone who is not part of the @hpe
universe mindset?

> Fixes: 41e2da9b5e67 ("x86/platform/uv: Use x2apic enabled bit as set by BIOS 
> to indicate APIC mode")
>
> [41e2da9b5e67 was accepted into tip x86/platform branch but not yet
> pulled into the linus tree.]

Truly useful. The only value of that information is that the maintainer
has to manualy remove it because it's irrelevant for the final commit
message of the change which ends up on top of that commit in that
branch. You can stick this into the section below '---' if you think
it's helpful, but then maintainers and tools can just ignore it.

TBH, it's not helpful at all because tooling already tells us that
41e2da9b5e67 is not in Linus tree and only queued in tip x86/platform.

Commit SHAs are supposed to be unique for a reason...

> Signed-off-by: Mike Travis 
> Reviewed-by: Steve Wahl 
> Reviewed-by: Dimitri Sivanich 

The value of these reviewed-by tags is close to zero because they are
just documenting that the change is ok at the @hpe universe level...

> ---
>  arch/x86/kernel/apic/x2apic_uv_x.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/apic/x2apic_uv_x.c 
> b/arch/x86/kernel/apic/x2apic_uv_x.c
> index 2e99605f9a05..68ef9abc91f7 100644
> --- a/arch/x86/kernel/apic/x2apic_uv_x.c
> +++ b/arch/x86/kernel/apic/x2apic_uv_x.c
> @@ -413,9 +413,8 @@ static int __init uv_set_system_type(char *_oem_id, char 
> *_oem_table_id)
>   else
>   uv_hubless_system |= 0x8;
>  
> - /* Copy OEM Table ID and set APIC Mode */
> + /* Copy OEM Table ID */
>   uv_stringify(sizeof(oem_table_id), oem_table_id, _oem_table_id);
> - early_set_apic_mode();

So the patch itself tells me more about what's going on than the
changelog:

  Setting up the APIC mode at this place is wrong.

But it does not tell me WHY. Neither does the changelog because of
useless word salad...

If you can't come up with something sensible anytime soon before the
merge window opens then I'm simply going to revert 41e2da9b5e67 and you
can try again for the next cycle.

Thanks,

tglx

Re: [PATCH v5] hrtimer: avoid retrigger_next_event IPI

2021-04-17 Thread Thomas Gleixner

On Sat, Apr 17 2021 at 18:24, Thomas Gleixner wrote:
> On Fri, Apr 16 2021 at 13:13, Peter Xu wrote:
>> On Fri, Apr 16, 2021 at 01:00:23PM -0300, Marcelo Tosatti wrote:
>>>  
>>> +#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME) |   \
>>> +(1U << HRTIMER_BASE_REALTIME_SOFT) |   \
>>> +(1U << HRTIMER_BASE_TAI) | \
>>> +(1U << HRTIMER_BASE_TAI_SOFT))
>>> +
>>> +static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
>>> +{
>>> +   if (cpu_base->softirq_activated)
>>> +   return true;
>>
>> A pure question on whether this check is needed...
>>
>> Here even if softirq_activated==1 (as softirq is going to happen), as long as
>> (cpu_base->active_bases & CLOCK_SET_BASES)==0, shouldn't it already mean that
>> "yes indeed clock was set, but no need to kick this cpu as no relevant 
>> timer"?
>> As that question seems to be orthogonal to whether a softirq is going to
>> trigger on that cpu.
>
> That's correct and it's not any different from firing the IPI because in
> both cases the update happens with the base lock of the CPU in question
> held. And if there are no active timers in any of the affected bases,
> then there is no need to reevaluate the next expiry because the offset
> update does not affect any armed timers. It just makes sure that the
> next enqueu of a timer on such a base will see the the correct offset.
>
> I'll just zap it.

But the whole thing is still wrong in two aspects:

1) BOOTTIME can be one of the affected clocks when sleep time
   (suspended time) is injected because that uses the same mechanism.

   Sorry for missing that earlier when I asked to remove it, but
   that's trivial to fix by adding the BOOTTIME base back.

2) What's worse is that on resume this might break because that
   mechanism is also used to enforce the reprogramming of the clock
   event devices and there we cannot be selective on clock bases.

   I need to dig deeper into that because suspend/resume has changed
   a lot over time, so this might be just a historical leftover. But
   without proper analysis we might end up with subtle and hard to
   debug wreckage.

Thanks,

tglx

Re: [PATCH v5] hrtimer: avoid retrigger_next_event IPI

2021-04-17 Thread Thomas Gleixner

On Fri, Apr 16 2021 at 13:13, Peter Xu wrote:
> On Fri, Apr 16, 2021 at 01:00:23PM -0300, Marcelo Tosatti wrote:
>>  
>> +#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME) |\
>> + (1U << HRTIMER_BASE_REALTIME_SOFT) |   \
>> + (1U << HRTIMER_BASE_TAI) | \
>> + (1U << HRTIMER_BASE_TAI_SOFT))
>> +
>> +static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
>> +{
>> +if (cpu_base->softirq_activated)
>> +return true;
>
> A pure question on whether this check is needed...
>
> Here even if softirq_activated==1 (as softirq is going to happen), as long as
> (cpu_base->active_bases & CLOCK_SET_BASES)==0, shouldn't it already mean that
> "yes indeed clock was set, but no need to kick this cpu as no relevant timer"?
> As that question seems to be orthogonal to whether a softirq is going to
> trigger on that cpu.

That's correct and it's not any different from firing the IPI because in
both cases the update happens with the base lock of the CPU in question
held. And if there are no active timers in any of the affected bases,
then there is no need to reevaluate the next expiry because the offset
update does not affect any armed timers. It just makes sure that the
next enqueu of a timer on such a base will see the the correct offset.

I'll just zap it.

Thanks,

tglx

Re: [PATCH v8 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

2021-04-17 Thread Thomas Gleixner

On Tue, Apr 13 2021 at 21:36, Paul E. McKenney wrote:

Bah, hit send too quick.

> + cpumask_clear(_ahead);
> + cpumask_clear(_behind);
> + preempt_disable();

Daft. 

> + testcpu = smp_processor_id();
> + pr_warn("Checking clocksource %s synchronization from CPU %d.\n", 
> cs->name, testcpu);
> + for_each_online_cpu(cpu) {
> + if (cpu == testcpu)
> + continue;
> + csnow_begin = cs->read(cs);
> + smp_call_function_single(cpu, clocksource_verify_one_cpu, cs, 
> 1);
> + csnow_end = cs->read(cs);

As this must run with interrupts enabled, that's a pretty rough
approximation like measuring wind speed with a wet thumb.

Wouldn't it be smarter to let the remote CPU do the watchdog dance and
take that result? i.e. split out more of the watchdog code so that you
can get the nanoseconds delta on that remote CPU to the watchdog.

> + delta = (s64)((csnow_mid - csnow_begin) & cs->mask);
> + if (delta < 0)
> + cpumask_set_cpu(cpu, _behind);
> + delta = (csnow_end - csnow_mid) & cs->mask;
> + if (delta < 0)
> + cpumask_set_cpu(cpu, _ahead);
> + delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
> + cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);

> + if (firsttime || cs_nsec > cs_nsec_max)
> + cs_nsec_max = cs_nsec;
> + if (firsttime || cs_nsec < cs_nsec_min)
> + cs_nsec_min = cs_nsec;
> + firsttime = 0;

  int64_t cs_nsec_max = 0, cs_nsec_min = LLONG_MAX;

and then the firsttime muck is not needed at all.

Thanks,

tglx

Re: [PATCH v8 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

2021-04-17 Thread Thomas Gleixner

On Tue, Apr 13 2021 at 21:36, Paul E. McKenney wrote:
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 1fc0962c89c0..97eeaf164296 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -169,7 +169,7 @@ struct clocksource kvm_clock = {
>   .read   = kvm_clock_get_cycles,
>   .rating = 400,
>   .mask   = CLOCKSOURCE_MASK(64),
> - .flags  = CLOCK_SOURCE_IS_CONTINUOUS,
> + .flags  = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_VERIFY_PERCPU,

kvm_clock is not marked with CLOCK_SOURCE_MUST_VERIFY, so what's the
point of adding this here? It's not subject to be monitored by the
watchdog muck.

> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index f70dffc2771f..56289170753c 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -1151,7 +1151,8 @@ static struct clocksource clocksource_tsc = {
>   .mask   = CLOCKSOURCE_MASK(64),
>   .flags  = CLOCK_SOURCE_IS_CONTINUOUS |
> CLOCK_SOURCE_VALID_FOR_HRES |
> -   CLOCK_SOURCE_MUST_VERIFY,
> +   CLOCK_SOURCE_MUST_VERIFY |
> +   CLOCK_SOURCE_VERIFY_PERCPU,

While this one is part of the horror show.

> +static u64 csnow_mid;
> +static cpumask_t cpus_ahead;
> +static cpumask_t cpus_behind;
> +
> +static void clocksource_verify_one_cpu(void *csin)
> +{
> + struct clocksource *cs = (struct clocksource *)csin;
> +
> + csnow_mid = cs->read(cs);
> +}
> +
> +static void clocksource_verify_percpu(struct clocksource *cs)
> +{
> + int64_t cs_nsec, cs_nsec_max, cs_nsec_min;
> + u64 csnow_begin, csnow_end;
> + bool firsttime = 1;
> + int testcpu;
> + s64 delta;
> + int cpu;

int testcpu, cpu; :)

Re: [PATCH v8 clocksource 2/5] clocksource: Retry clock read if long delays detected

2021-04-17 Thread Thomas Gleixner

On Tue, Apr 13 2021 at 21:35, Paul E. McKenney wrote:
>  #define WATCHDOG_INTERVAL (HZ >> 1)
>  #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
> +#define WATCHDOG_MAX_SKEW (NSEC_PER_SEC >> 6)

That's ~15ms which is a tad large I'd say...
  
>  static void clocksource_watchdog_work(struct work_struct *work)
>  {
> @@ -213,9 +214,10 @@ static void clocksource_watchdog_inject_delay(void)
>  static void clocksource_watchdog(struct timer_list *unused)
>  {
>   struct clocksource *cs;
> - u64 csnow, wdnow, cslast, wdlast, delta;
> - int64_t wd_nsec, cs_nsec;
> + u64 csnow, wdnow, wdagain, cslast, wdlast, delta;
> + int64_t wd_nsec, wdagain_delta, wderr_nsec = 0, cs_nsec;
>   int next_cpu, reset_pending;
> + int nretries;
>  
>   spin_lock(_lock);
>   if (!watchdog_running)
> @@ -224,6 +226,7 @@ static void clocksource_watchdog(struct timer_list 
> *unused)
>   reset_pending = atomic_read(_reset_pending);
>  
>   list_for_each_entry(cs, _list, wd_list) {
> + nretries = 0;
>  
>   /* Clocksource already marked unstable? */
>   if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
> @@ -232,11 +235,24 @@ static void clocksource_watchdog(struct timer_list 
> *unused)
>   continue;
>   }
>  
> +retry:
>   local_irq_disable();
> - csnow = cs->read(cs);
> - clocksource_watchdog_inject_delay();
>   wdnow = watchdog->read(watchdog);
> + clocksource_watchdog_inject_delay();
> + csnow = cs->read(cs);
> + wdagain = watchdog->read(watchdog);
>   local_irq_enable();
> + delta = clocksource_delta(wdagain, wdnow, watchdog->mask);
> + wdagain_delta = clocksource_cyc2ns(delta, watchdog->mult, 
> watchdog->shift);
> + if (wdagain_delta > WATCHDOG_MAX_SKEW) {
> + wderr_nsec = wdagain_delta;
> + if (nretries++ < max_read_retries)
> + goto retry;
> + }
> + if (nretries) {
> + pr_warn("timekeeping watchdog on CPU%d: %s read-back 
> delay of %lldns, attempt %d\n",
> + smp_processor_id(), watchdog->name, wderr_nsec, 
> nretries);
> + }
>  
>   /* Clocksource initialized ? */
>   if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||

This can nicely be split out into a read function which avoids brain
overload when reading. Something like the uncompiled below.

I so wish we could just delete all of this horror instead of making it
more horrible.

Thanks,

tglx
---
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -124,6 +124,12 @@ static void __clocksource_change_rating(
 #define WATCHDOG_INTERVAL (HZ >> 1)
 #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
 
+/*
+ * The maximum delay between two consecutive readouts of the watchdog
+ * clocksource to detect SMI,NMI,vCPU preemption.
+ */
+#define WATCHDOG_MAX_DELAY (100 * NSEC_PER_USEC)
+
 static void clocksource_watchdog_work(struct work_struct *work)
 {
/*
@@ -184,12 +190,37 @@ void clocksource_mark_unstable(struct cl
spin_unlock_irqrestore(_lock, flags);
 }
 
+static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
+{
+   unsigned int nretries;
+   u64 wd_end, wd_delta;
+   int64_t wd_delay;
+
+   for (nretries = 0; nretries < max_read_retries; nretries++) {
+   local_irq_disable();
+   *wdnow = watchdog->read(watchdog);
+   clocksource_watchdog_inject_delay();
+   *csnow = cs->read(cs);
+   wd_end = watchdog->read(watchdog);
+   local_irq_enable();
+
+   wd_delta = clocksource_delta(wd_end, *wdnow, watchdog->mask);
+   wd_delay = clocksource_cyc2ns(wd_delta, watchdog->mult, 
watchdog->shift);
+   if (wd_delay < WATCHDOG_MAX_DELAY)
+   return true;
+   }
+
+   pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, 
%d attempts\n",
+   smp_processor_id(), watchdog->name, wd_delay, nretries);
+   return false;
+}
+
 static void clocksource_watchdog(struct timer_list *unused)
 {
-   struct clocksource *cs;
u64 csnow, wdnow, cslast, wdlast, delta;
-   int64_t wd_nsec, cs_nsec;
int next_cpu, reset_pending;
+   int64_t wd_nsec, cs_nsec;
+   struct clocksource *cs;
 
spin_lock(_lock);
if (!watchdog_running)
@@ -206,10 +237,14 @@ static void clocksource_watchdog(struct
continue;
}
 
-   local_irq_disable();
-   csnow = cs->read(cs);
-   wdnow = watchdog->read(watchdog);
-   local_irq_enable();
+   if (!cs_watchdog_read(cs, , )) {
+   /*
+* No point to continue if the watchdog

Re: [PATCH 05/15] x86: Implement function_nocfi

2021-04-17 Thread Thomas Gleixner

On Sat, Apr 17 2021 at 01:02, Thomas Gleixner wrote:
> On Fri, Apr 16 2021 at 15:37, Kees Cook wrote:
>
>> On Fri, Apr 16, 2021 at 03:20:17PM -0700, Andy Lutomirski wrote:
>>> But obviously there is code that needs real function pointers.  How
>>> about making this a first-class feature, or at least hacking around it
>>> more cleanly.  For example, what does this do:
>>> 
>>> char entry_whatever[];
>>> wrmsrl(..., (unsigned long)entry_whatever);
>>
>> This is just casting. It'll still resolve to the jump table entry.
>>
>>> or, alternatively,
>>> 
>>> extern void func() __attribute__((nocfi));
>>
>> __nocfi says func() should not perform checking of correct jump table
>> membership for indirect calls.
>>
>> But we don't want a global marking for a function to be ignored by CFI;
>> we don't want functions to escape CFI -- we want specific _users_ to
>> either not check CFI for indirect calls (__nocfi) or we want specific
>> passed addresses to avoid going through the jump table
>> (function_nocfi()).
>
> And that's why you mark entire files to be exempt without any rationale
> why it makes sense.

The reason why you have to do that is because function_nocfi() is not
provided by the compiler.

So you need to hack around that with that macro which fails to work
e.g. for the idt data arrays.

Is there any fundamental reason why the compiler does not provide that
in a form which allows to use it everywhere?

It's not too much asked from a tool which provides new functionality to
provide it in a way which is usable.

Thanks,

tglx

Re: [PATCH 04/15] static_call: Use global functions for the self-test

2021-04-16 Thread Thomas Gleixner

On Fri, Apr 16 2021 at 23:37, Thomas Gleixner wrote:
> On Fri, Apr 16 2021 at 13:38, Sami Tolvanen wrote:
>>  #ifdef CONFIG_STATIC_CALL_SELFTEST
>>  
>> -static int func_a(int x)
>> +int func_a(int x)
>>  {
>>  return x+1;
>>  }
>>  
>> -static int func_b(int x)
>> +int func_b(int x)
>>  {
>>  return x+2;
>>  }
>
> Did you even compile that?
>
> Global functions without a prototype are generating warnings, but we can
> ignore them just because of sekurity, right?
>
> Aside of that polluting the global namespace with func_a/b just to work
> around a tool shortcoming is beyond hillarious.
>
> Fix the tool not the perfectly correct code.

That said, I wouldn't mind a  __dont_dare_to_rename annotation to help
the compiler, but anything else is just wrong.

Thanks,

tglx

Re: [PATCH 06/15] x86: Avoid CFI jump tables in IDT and entry points

2021-04-16 Thread Thomas Gleixner

On Fri, Apr 16 2021 at 16:56, Kees Cook wrote:
> On Sat, Apr 17, 2021 at 12:26:56AM +0200, Thomas Gleixner wrote:
>> Where is the analysis why excluding 
>> 
>> > +CFLAGS_REMOVE_idt.o   := $(CC_FLAGS_CFI)
>> > +CFLAGS_REMOVE_paravirt.o  := $(CC_FLAGS_CFI)
>> 
>> all of idt.c and paravirt.c is correct and how that is going to be
>> correct in the future?
>> 
>> These files are excluded from CFI, so I can add whatever I want to them
>> and circumvent the purpose of CFI, right?
>> 
>> Brilliant plan that. But I know, sekurity ...
>
> *sigh* we're on the same side. :P I will choose to understand your
> comments here as:
>
> "How will enforcement of CFI policy be correctly maintained here if
> the justification for disabling it for whole compilation units is not
> clearly understandable by other developers not familiar with the nuances
> of its application?"

Plus, if there is a justification for disabling it for a whole
compilation unit:

 Where is the tooling which makes sure that this compilation unit is not
 later on filled with code which should be subject to CFI?

Thanks,

tglx

Re: [PATCH 05/15] x86: Implement function_nocfi

2021-04-16 Thread Thomas Gleixner

On Fri, Apr 16 2021 at 15:37, Kees Cook wrote:

> On Fri, Apr 16, 2021 at 03:20:17PM -0700, Andy Lutomirski wrote:
>> But obviously there is code that needs real function pointers.  How
>> about making this a first-class feature, or at least hacking around it
>> more cleanly.  For example, what does this do:
>> 
>> char entry_whatever[];
>> wrmsrl(..., (unsigned long)entry_whatever);
>
> This is just casting. It'll still resolve to the jump table entry.
>
>> or, alternatively,
>> 
>> extern void func() __attribute__((nocfi));
>
> __nocfi says func() should not perform checking of correct jump table
> membership for indirect calls.
>
> But we don't want a global marking for a function to be ignored by CFI;
> we don't want functions to escape CFI -- we want specific _users_ to
> either not check CFI for indirect calls (__nocfi) or we want specific
> passed addresses to avoid going through the jump table
> (function_nocfi()).

And that's why you mark entire files to be exempt without any rationale
why it makes sense.

Thanks,

tglx

Re: [PATCH 06/15] x86: Avoid CFI jump tables in IDT and entry points

2021-04-16 Thread Thomas Gleixner

On Fri, Apr 16 2021 at 13:38, Sami Tolvanen wrote:
> With CONFIG_CFI_CLANG, the compiler replaces function addresses in C
> code with jump table addresses.

Fine.

> To avoid referring to jump tables in entry code with PTI,

What has this to do with PTI?

> disable CFI for IDT and paravirt code, and use function_nocfi() to
> prevent jump table addresses from being added to the IDT or system
> call entry points.

How does this changelog make sense for anyone not familiar with the
matter at hand?

Where is the analysis why excluding 

> +CFLAGS_REMOVE_idt.o  := $(CC_FLAGS_CFI)
> +CFLAGS_REMOVE_paravirt.o := $(CC_FLAGS_CFI)

all of idt.c and paravirt.c is correct and how that is going to be
correct in the future?

These files are excluded from CFI, so I can add whatever I want to them
and circumvent the purpose of CFI, right?

Brilliant plan that. But I know, sekurity ...

Thanks,

tglx

Re: [PATCH 05/15] x86: Implement function_nocfi

2021-04-16 Thread Thomas Gleixner

On Fri, Apr 16 2021 at 14:49, Sami Tolvanen wrote:
> On Fri, Apr 16, 2021 at 2:18 PM Borislav Petkov  wrote:
>> In file included from ./include/linux/ftrace.h:22:0,
>>  from ./include/linux/init_task.h:9,
>>  from init/init_task.c:2:
>> ./include/linux/ftrace.h: In function ‘ftrace_init_nop’:
>> ./arch/x86/include/asm/ftrace.h:9:40: error: implicit declaration of 
>> function ‘function_nocfi’ [-Werror=implicit-function-declaration]
>
> This is defined in linux-next, but I do see another issue, which I'll
> fix in v2. Note that CFI_CLANG itself cannot be selected on 32-bit
> x86.

Sure and because of that it's overrated to make sure that it does not
break the build. I know, sekurity ...

But aside of that when looking at the rest of the series, then I really
have to ask whether the only way to address this is to make a large
amount of code unreadable like this:

-   wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+   wrmsrl(MSR_LSTAR, (unsigned long)function_nocfi(entry_SYSCALL_64));

plus a gazillion of similar changes.

This is unreadable garbage.

Thanks,

tglx

Re: [PATCH v2] X86: Makefile: Replace -pg with CC_FLAGS_FTRACE

2021-04-16 Thread Thomas Gleixner

On Fri, Apr 16 2021 at 13:39, zhaoxiao wrote:
> In preparation for x86 supporting ftrace built on other compiler
> options, let's have the x86 Makefiles remove the $(CC_FLAGS_FTRACE)
> flags, whatever these may be, rather than assuming '-pg'.

s/let's have the/make the/

> There should be no functional change as a result of this patch.

Should be? Either you know that there is no functional change or not. If
you're unsure why are you posting it at all?

So this wants to be:

   No functional change.

Other than that and the fact that this is incomplete as it fails to
catch _all_ instances of -pg in arch/x86 and also leaves stale comments
around I'm ok with the intent.

# git grep '\-pg' arch/x86/
arch/x86/entry/vdso/Makefile:# vDSO code runs in userspace and -pg doesn't help 
with profiling anyway.
arch/x86/kernel/ftrace_64.S: * gcc -pg option adds a call to 'mcount' in most 
functions.
arch/x86/purgatory/Makefile:# Default KBUILD_CFLAGS can have -pg option set 
when FTRACE is enabled. That
arch/x86/um/vdso/Makefile:# vDSO code runs in userspace and -pg doesn't help 
with profiling anyway.
arch/x86/um/vdso/Makefile:CFLAGS_REMOVE_vdso-note.o = -pg -fprofile-arcs 
-ftest-coverage
arch/x86/um/vdso/Makefile:CFLAGS_REMOVE_um_vdso.o = -pg -fprofile-arcs 
-ftest-coverage

grep is not rocket science ...

Thanks,

tglx

Re: [patch] x86/crash: fix crash_setup_memmap_entries() out-of-bounds access

2021-04-16 Thread Thomas Gleixner

On Fri, Apr 16 2021 at 17:13, Mike Galbraith wrote:
> On Fri, 2021-04-16 at 16:44 +0200, Borislav Petkov wrote:
>> On Fri, Apr 16, 2021 at 03:16:07PM +0200, Mike Galbraith wrote:
>> > On Fri, 2021-04-16 at 14:16 +0200, Borislav Petkov wrote:
>> > >
>> > > Please be more verbose and structure your commit message like this:
>> >
>> > Hrmph, I thought it was too verbose for dinky one-liner if anything.
>>
>> Please look at how other commit messages in tip have free text - not
>> only tools output.
>>
>> Also, this looks like a fix for some previous commit. Please dig out
>> which commit introduced the issue and put its commit ID in a Fixes: tag
>> above your S-o-B.
>>
>> If you don't have time or desire to do that, you can say so and I'll do
>> it myself when I get a chance.
>
> Ok, bin it for the nonce.

Can all of you involved stop this sandpit fight and do something useful
to fix that obvious bug already?

OMG

Re: [PATCH 04/15] static_call: Use global functions for the self-test

2021-04-16 Thread Thomas Gleixner

On Fri, Apr 16 2021 at 13:38, Sami Tolvanen wrote:

> With CONFIG_CFI_CLANG, the compiler renames static functions. This
> breaks static_call users using static functions, because the current
> implementation assumes functions have stable names by hardcoding them
> in inline assembly. Make the self-test functions global to prevent the
> compiler from renaming them.

Sorry, no.

> Signed-off-by: Sami Tolvanen 
> ---
>  kernel/static_call.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/static_call.c b/kernel/static_call.c
> index 723fcc9d20db..d09f500c2d2a 100644
> --- a/kernel/static_call.c
> +++ b/kernel/static_call.c
> @@ -503,12 +503,12 @@ long __static_call_return0(void)
>  
>  #ifdef CONFIG_STATIC_CALL_SELFTEST
>  
> -static int func_a(int x)
> +int func_a(int x)
>  {
>   return x+1;
>  }
>  
> -static int func_b(int x)
> +int func_b(int x)
>  {
>   return x+2;
>  }

Did you even compile that?

Global functions without a prototype are generating warnings, but we can
ignore them just because of sekurity, right?

Aside of that polluting the global namespace with func_a/b just to work
around a tool shortcoming is beyond hillarious.

Fix the tool not the perfectly correct code.

Thanks,

tglx

Re: [PATCH v8 clocksource 2/5] clocksource: Retry clock read if long delays detected

2021-04-16 Thread Thomas Gleixner

On Tue, Apr 13 2021 at 21:35, Paul E. McKenney wrote:
>  #define WATCHDOG_INTERVAL (HZ >> 1)
>  #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)

Didn't we discuss that the threshold is too big ?

Re: [PATCH v8 clocksource 1/5] clocksource: Provide module parameters to inject delays in watchdog

2021-04-16 Thread Thomas Gleixner

On Tue, Apr 13 2021 at 21:35, Paul E. McKenney wrote:
>  
> +static int inject_delay_freq;
> +module_param(inject_delay_freq, int, 0644);
> +static int inject_delay_run = 1;
> +module_param(inject_delay_run, int, 0644);

int? Can't we just make them 'unsigned int'? Negative values are not
that useful.

> +static int max_read_retries = 3;
> +module_param(max_read_retries, int, 0644);

max_read_retries is unused here. Should be in the patch which actually
uses it.

> +static void clocksource_watchdog_inject_delay(void)
> +{
> + int i;
> + static int injectfail = -1;
> +
> + if (inject_delay_freq <= 0 || inject_delay_run <= 0)
> + return;
> + if (injectfail < 0 || injectfail > INT_MAX / 2)
> + injectfail = inject_delay_run;
> + if (!(++injectfail / inject_delay_run % inject_delay_freq)) {

Operator precedence based cleverness is really easy to parse - NOT!

> + pr_warn("%s(): Injecting delay.\n", __func__);
> + for (i = 0; i < 2 * WATCHDOG_THRESHOLD / NSEC_PER_MSEC; i++)
> + udelay(1000);
> + pr_warn("%s(): Done injecting delay.\n", __func__);
> + }
> +
> + WARN_ON_ONCE(injectfail < 0);
> +}

Brain melt stage reached by now.

static unsigned int invocations, injections;

if (!inject_delay_period || !inject_delay_repeat)
return;

if (!(invocations % inject_delay_period)) {
mdelay(2 * WATCHDOG_THRESHOLD / NSEC_PER_MSEC);
if (++injections < inject_delay_repeat)
return;
injections = 0;
}

invocations++;
}

Hmm?

Thanks,

tglx

Re: [PATCH] kernel:irq:manage: request threaded irq with a specified priority

2021-04-16 Thread Thomas Gleixner

On Fri, Apr 16 2021 at 12:57, chensong wrote:
> On 2021/4/13 下午4:39, Thomas Gleixner wrote:
>> It breaks because the system designer failed to assign proper priorities
>> to the irq threads int_a, int_b and to the user space process task_a.
>
> yes, it's designers' responsibility to assign proper priorities, but 
> kernel is also obliged to provide interfaces for those designers.

The interface exists. sched_setscheduler(2)

> chrt can help designers in this case, however, the truth is lot of 
> customers are not familiar with it.

The truth is that real-time systems need to be carefully designed and
parametrized. And that's only possible when _all_ of the system
components and their constraints are known. Trying to solve it at the
device driver level of a single device is impossible and a guarantee for
fail.

If the customer does not know how to do it, then I really have to ask
why he's trying to use a real-time system at all. There is no real-time
system which magically tunes itself by pulling the overall system
constraints out of thin air.

> what's more, chrt can also apply to userspace rt task, but userspace
> also has sched_setscheduler to assgin proper priority inside code like
> cyclictest, why can't driver writers have another choice?

There is a very simple reason: The driver writer cannot know about the
requirements of the complete system which is composed of kernel, drivers
and user space applications, unless the driver writer is fully aware of
the overall system design and constraints.

How is that supposed to work on a general purpose kernel which is
utilized for a gazillion of different use cases which all have different
expectations?

It simply cannot work because default A will only work for usecase A and
be completely wrong for all others.

> Further, what if irq handlear thread has to run on the expected priority 
> at the very beginning? This patch helps.

There is no such thing as the expected priority of an interrupt thread
which can be applied upfront.

There are ~5400 instances of request*irq() in the kernel source and
there is no way to make priority decisions for them which work for every
RT system out there.

The kernel sets a default and the system designer, admin, user has to
take care of tuning it to match the expectations and constraints of his
particular application scenario.

The kernel provides an userspace interface to do that. That interface
might be a bit awkward to use, but there are tools out there which help
with that, and if at all we can think about providing a better and
easier to use interface for this.

Trying to solve that at the kernel level is putting the cart before the
horse.

Thanks,

tglx

Re: [PATCH v3] hrtimer: avoid retrigger_next_event IPI

2021-04-15 Thread Thomas Gleixner

On Thu, Apr 15 2021 at 12:39, Marcelo Tosatti wrote:
> +static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
> +{
> + unsigned int active = 0;
> +
> + if (cpu_base->softirq_activated)
> + return true;
> +
> + active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
> + active = active | (cpu_base->active_bases & HRTIMER_ACTIVE_HARD);

It's not entirely clear to me what that 'active' variable business above
is doing and why it's needed at all. But I might be missing something.

> + return (cpu_base->active_bases & CLOCK_SET_BASES) != 0;
> +}

Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()

2021-04-15 Thread Thomas Gleixner

On Wed, Apr 14 2021 at 11:49, Lorenzo Colitti wrote:
> On Wed, Apr 14, 2021 at 2:14 AM Greg KH  wrote:
>> To give context, the commit is now 46eb1701c046 ("hrtimer: Update
>> softirq_expires_next correctly after __hrtimer_get_next_event()") and is
>> attached below.
>>
>> The f_ncm.c driver is doing a lot of calls to hrtimer_start() with mode
>> HRTIMER_MODE_REL_SOFT for I think every packet it gets.  If that should
>> not be happening, we can easily fix it but I guess no one has actually
>> had fast USB devices that noticed this until now :)
>
> AIUI the purpose of this timer is to support packet aggregation. USB
> transfers are relatively expensive/high latency. 6 Gbps is 500k
> 1500-byte packets per second, or one every 2us. So f_ncm buffers as
> many packets as will fit into 16k (usually, 10 1500-byte packets), and
> only initiates a USB transfer when those packets have arrived. That
> ends up doing only one transfer every 20us. It sets a 300us timer to
> ensure that if the 10 packets haven't arrived, it still sends out
> whatever it has when the timer fires. The timer is set/updated on
> every packet buffered by ncm.
>
> Is this somehow much more expensive in 5.10.24 than it was before?
> Even if this driver is somehow "holding it wrong", might there not be
> other workloads that have a similar problem? What about regressions on
> those workloads?

Let's put the question of whether this hrtimer usage is sensible or not
aside for now.

I stared at the change for a while and did some experiments to recreate
the problem, but that didn't get me anywhere.

Could you please do the following?

Enable tracing and enable the following tracepoints:

timers/hrtimer_cancel
timers/hrtimer_start
timers/hrtimer_expire_entry
irq/softirq_raise
irq/softirq_enter
irq/softirq_exit

and function tracing filtered on ncm_wrap_ntb() and
package_for_tx() only (to reduce the noise).

Run the test on a kernels with and without that commit and collect trace
data for both.

That should give me a pretty clear picture what's going on.

Thanks,

tglx

[tip: sched/core] signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()

2021-04-15 Thread tip-bot2 for Thomas Gleixner

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 69995ebbb9d3717306a165db88a1292b63f77a37
Gitweb:
https://git.kernel.org/tip/69995ebbb9d3717306a165db88a1292b63f77a37
Author:Thomas Gleixner 
AuthorDate:Mon, 22 Mar 2021 10:19:42 +01:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 14 Apr 2021 18:04:08 +02:00

signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()

There is no point in having the conditional at the callsite.

Just hand in the allocation mode flag to __sigqueue_alloc() and use it to
initialize sigqueue::flags.

No functional change.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20210322092258.898677...@linutronix.de
---
 kernel/signal.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index ba4d1ef..568a2e2 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -410,7 +410,8 @@ void task_join_group_stop(struct task_struct *task)
  *   appropriate lock must be held to stop the target task from exiting
  */
 static struct sigqueue *
-__sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int 
override_rlimit)
+__sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags,
+int override_rlimit, const unsigned int sigqueue_flags)
 {
struct sigqueue *q = NULL;
struct user_struct *user;
@@ -432,7 +433,7 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t 
flags, int override_rlimi
rcu_read_unlock();
 
if (override_rlimit || likely(sigpending <= task_rlimit(t, 
RLIMIT_SIGPENDING))) {
-   q = kmem_cache_alloc(sigqueue_cachep, flags);
+   q = kmem_cache_alloc(sigqueue_cachep, gfp_flags);
} else {
print_dropped_signal(sig);
}
@@ -442,7 +443,7 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t 
flags, int override_rlimi
free_uid(user);
} else {
INIT_LIST_HEAD(>list);
-   q->flags = 0;
+   q->flags = sigqueue_flags;
q->user = user;
}
 
@@ -1113,7 +1114,8 @@ static int __send_signal(int sig, struct kernel_siginfo 
*info, struct task_struc
else
override_rlimit = 0;
 
-   q = __sigqueue_alloc(sig, t, GFP_ATOMIC, override_rlimit);
+   q = __sigqueue_alloc(sig, t, GFP_ATOMIC, override_rlimit, 0);
+
if (q) {
list_add_tail(>list, >list);
switch ((unsigned long) info) {
@@ -1807,12 +1809,7 @@ EXPORT_SYMBOL(kill_pid);
  */
 struct sigqueue *sigqueue_alloc(void)
 {
-   struct sigqueue *q = __sigqueue_alloc(-1, current, GFP_KERNEL, 0);
-
-   if (q)
-   q->flags |= SIGQUEUE_PREALLOC;
-
-   return q;
+   return __sigqueue_alloc(-1, current, GFP_KERNEL, 0, SIGQUEUE_PREALLOC);
 }
 
 void sigqueue_free(struct sigqueue *q)

[tip: sched/core] signal: Allow tasks to cache one sigqueue struct

2021-04-15 Thread tip-bot2 for Thomas Gleixner

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 4bad58ebc8bc4f20d89cff95417c9b4674769709
Gitweb:
https://git.kernel.org/tip/4bad58ebc8bc4f20d89cff95417c9b4674769709
Author:Thomas Gleixner 
AuthorDate:Tue, 23 Mar 2021 22:05:39 +01:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 14 Apr 2021 18:04:08 +02:00

signal: Allow tasks to cache one sigqueue struct

The idea for this originates from the real time tree to make signal
delivery for realtime applications more efficient. In quite some of these
application scenarios a control tasks signals workers to start their
computations. There is usually only one signal per worker on flight.  This
works nicely as long as the kmem cache allocations do not hit the slow path
and cause latencies.

To cure this an optimistic caching was introduced (limited to RT tasks)
which allows a task to cache a single sigqueue in a pointer in task_struct
instead of handing it back to the kmem cache after consuming a signal. When
the next signal is sent to the task then the cached sigqueue is used
instead of allocating a new one. This solved the problem for this set of
application scenarios nicely.

The task cache is not preallocated so the first signal sent to a task goes
always to the cache allocator. The cached sigqueue stays around until the
task exits and is freed when task::sighand is dropped.

After posting this solution for mainline the discussion came up whether
this would be useful in general and should not be limited to realtime
tasks: https://lore.kernel.org/r/m11rcu7nbr@fess.ebiederm.org

One concern leading to the original limitation was to avoid a large amount
of pointlessly cached sigqueues in alive tasks. The other concern was
vs. RLIMIT_SIGPENDING as these cached sigqueues are not accounted for.

The accounting problem is real, but on the other hand slightly academic.
After gathering some statistics it turned out that after boot of a regular
distro install there are less than 10 sigqueues cached in ~1500 tasks.

In case of a 'mass fork and fire signal to child' scenario the extra 80
bytes of memory per task are well in the noise of the overall memory
consumption of the fork bomb.

If this should be limited then this would need an extra counter in struct
user, more atomic instructions and a seperate rlimit. Yet another tunable
which is mostly unused.

The caching is actually used. After boot and a full kernel compile on a
64CPU machine with make -j128 the number of 'allocations' looks like this:

  From slab:   23996
  From task cache: 52223

I.e. it reduces the number of slab cache operations by ~68%.

A typical pattern there is:

<...>-58490 __sigqueue_alloc:  for 58488 from slab 8881132df460
<...>-58488 __sigqueue_free:   cache 8881132df460
<...>-58488 __sigqueue_alloc:  for 1149 from cache 8881103dc550
  bash-1149 exit_task_sighand: free 8881132df460
  bash-1149 __sigqueue_free:   cache 8881103dc550

The interesting sequence is that the exiting task 58488 grabs the sigqueue
from bash's task cache to signal exit and bash sticks it back into it's own
cache. Lather, rinse and repeat.

The caching is probably not noticable for the general use case, but the
benefit for latency sensitive applications is clear. While kmem caches are
usually just serving from the fast path the slab merging (default) can
depending on the usage pattern of the merged slabs cause occasional slow
path allocations.

The time spared per cached entry is a few micro seconds per signal which is
not relevant for e.g. a kernel build, but for signal heavy workloads it's
measurable.

As there is no real downside of this caching mechanism making it
unconditionally available is preferred over more conditional code or new
magic tunables.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Oleg Nesterov 
Link: https://lkml.kernel.org/r/87sg4lbmxo@nanos.tec.linutronix.de
---
 include/linux/sched.h  |  1 +-
 include/linux/signal.h |  1 +-
 kernel/exit.c  |  1 +-
 kernel/fork.c  |  1 +-
 kernel/signal.c| 44 +++--
 5 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 05572e2..f5ca798 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -984,6 +984,7 @@ struct task_struct {
/* Signal handlers: */
struct signal_struct*signal;
struct sighand_struct __rcu *sighand;
+   struct sigqueue *sigqueue_cache;
sigset_tblocked;
sigset_treal_blocked;
/* Restored if set_restore_sigmask() was used: */
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 205526c..c3cbea2 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -265,6 +265,7 @@ static inline void init_sigpending(stru

Re: [tip: core/rcu] softirq: Don't try waking ksoftirqd before it has been spawned

2021-04-14 Thread Thomas Gleixner

Paul,

On Wed, Apr 14 2021 at 11:11, Paul E. McKenney wrote:
> On Wed, Apr 14, 2021 at 10:57:57AM +0200, Uladzislau Rezki wrote:
>> On Wed, Apr 14, 2021 at 09:13:22AM +0200, Sebastian Andrzej Siewior wrote:
>> At the same time Paul made another patch:
>> 
>> softirq: Don't try waking ksoftirqd before it has been spawned
>> 
>> it allows us to keep RCU-tasks initialization before even
>> early_initcall() where it is now and let our rcu-self-test
>> to be completed without any hanging.
>
> In short, this window of time in which it is not possible to reliably
> wait on a softirq handler has caused trouble, just as several other
> similar boot-sequence time windows have caused trouble in the past.
> It therefore makes sense to just eliminate this problem, and prevent
> future developers from facing inexplicable silent boot-time hangs.
>
> We can move the spawning of ksoftirqd kthreads earlier, but that
> simply narrows the window.  It does not eliminate the problem.
>
> I can easily believe that this might have -rt consequences that need
> attention.  For your amusement, I will make a few guesses as to what
> these might be:
>
> o Back-of-interrupt softirq handlers degrade real-time response.
>   This should not be a problem this early in boot, and once the
>   ksoftirqd kthreads are spawned, there will never be another
>   back-of-interrupt softirq handler in kernels that have
>   force_irqthreads set, which includes -rt kernels.

Not a problem obviously.

> o That !__this_cpu_read(ksoftirqd) check remains at runtime, even
>   though it always evaluates to false.  I would be surprised if
>   this overhead is measurable at the system level, but if it is,
>   static branches should take care of this.

Agreed.

> o There might be a -rt lockdep check that isn't happy with
>   back-of-interrupt softirq handlers.  But such a lockdep check
>   could be conditioned on __this_cpu_read(ksoftirqd), thus
>   preventing it from firing during that short window at boot time.

It's not like there are only a handful of lockdep invocations which need
to be taken care of. The lockdep checks are mostly inside of lock
operations and if lockdep has recorded back-of-interrupt context once
during boot it will complain about irqs enabled context usage later on
no matter what.

If you can come up with a reasonable implementation of that without
losing valuable lockdep coverage and without creating a major mess in
the code then I'm all ears.

But lockdep is just one of the problems

> o The -rt kernels might be using locks to implement things like
>   local_bh_disable(), in which case back-of-interrupt softirq
>   handlers could result in self-deadlock.  This could be addressed
>   by disabling bh the old way up to the time that the ksoftirqd
>   kthreads are created.  Because these are created while the system
>   is running on a single CPU (right?), a simple flag (or static
>   branch) could be used to switch this behavior into lock-only
>   mode long before the first real-time application can be spawned.

That has absolutely nothing to do with the first real-time application
at all and just looking at the local_bh_disable() part does not cut it
either.

The point is that the fundamental assumption of RT to break the non-rt
semantics of interrupts and soft interrupts in order to rely on always
thread context based serialization is going down the drain with that.

Surely we can come up with yet another pile of horrible hacks to work
around that, which in turn will cause more problems than they solve.

But what for? Just to make it possible to issue rcu_whatever() during
early boot just because?

Give me _one_ serious technical reason why this is absolutely required
and why the accidental misplacement which might happen due to
unawareness, sloppyness or incompetence is reason enough to create a
horrible clusterfail just for purely academic reasons.

None of the usecases at hand did have any reason to depend on the early
boot behaviour of softirqs and unless you can come up with something
really compelling the proper solution is to revert this commit and add
the following instead:

--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -74,7 +74,10 @@ static void wakeup_softirqd(void)
/* Interrupts are disabled: no need to stop preemption */
struct task_struct *tsk = __this_cpu_read(ksoftirqd);
 
-   if (tsk && tsk->state != TASK_RUNNING)
+   if (WARN_ON_ONCE(!tsk))
+   return;
+
+   if (tsk->state != TASK_RUNNING)
wake_up_process(tsk);
 }
 
which will trigger on any halfways usefull CI infrastructure which cares
about the !default use cases of the kernel.

Putting a restriction like that onto the early boot process is not the
end of the world and might teach people who think that their oh so
important stuff is not that important at all especially not during the
early boot phase which is fragile enough

Re: [RFC PATCH 0/7] KVM: Fix tick-based vtime accounting on x86

2021-04-14 Thread Thomas Gleixner

On Tue, Apr 13 2021 at 11:29, Sean Christopherson wrote:
> This is an alternative to Wanpeng's series[*] to fix tick-based accounting
> on x86.  The approach for fixing the bug is identical: defer accounting
> until after tick IRQs are handled.  The difference is purely in how the
> context tracking and vtime code is refactored in order to give KVM the
> hooks it needs to fix the x86 bug.
>
> x86 compile tested only, hence the RFC.  If folks like the direction and
> there are no unsolvable issues, I'll cross-compile, properly test on x86,
> and post an "official" series.

I like the final outcome of this, but we really want a small set of
patches first which actually fix the bug and is easy to backport and
then the larger consolidation on top.

Can you sort that out with Wanpeng please?

Thanks,

tglx

Re: [PATCH v2] hrtimer: avoid retrigger_next_event IPI

2021-04-14 Thread Thomas Gleixner

Marcelo,

On Tue, Apr 13 2021 at 14:04, Marcelo Tosatti wrote:
> Setting the realtime clock triggers an IPI to all CPUs to reprogram
> hrtimers.

s/hrtimers/clock event device/

> However, only realtime and TAI clocks have their offsets updated
> (and therefore potentially require a reprogram).
>
> Check if it only has monotonic active timers, and in that case

boottime != monotonic 

> update the realtime and TAI base offsets remotely, skipping the IPI.

Something like this perhaps:

Instead of sending an IPI unconditionally, check each per CPU hrtimer base
whether it has active timers in the CLOCK_REALTIME and CLOCK_TAI bases. If
that's not the case, update the realtime and TAI base offsets remotely and
skip the IPI. This ensures that any subsequently armed timers on
CLOCK_REALTIME and CLOCK_TAI are evaluated with the correct offsets.

Hmm?

> +#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME)|  \

Missing space between ) and |

> +  (1U << HRTIMER_BASE_REALTIME_SOFT)|\
> +  (1U << HRTIMER_BASE_TAI)|  \
> +  (1U << HRTIMER_BASE_TAI_SOFT))
> +
> +static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
> +{
> + unsigned int active = 0;
> +
> + if (cpu_base->softirq_activated)
> + return true;
> +
> + active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
> +
> + active = active | (cpu_base->active_bases & HRTIMER_ACTIVE_HARD);
> +
> + if ((active & CLOCK_SET_BASES) == 0)
> + return false;
> +
> + return true;

That's a pretty elaborate way of writing:

   return (cpu_base->active_bases & CLOCK_SET_BASES) != 0;

> +}
> +
>  /*
>   * Clock realtime was set
>   *
> @@ -885,9 +907,31 @@ static void hrtimer_reprogram(struct hrtimer *timer, 
> bool reprogram)
>  void clock_was_set(void)
>  {
>  #ifdef CONFIG_HIGH_RES_TIMERS
> - /* Retrigger the CPU local events everywhere */
> - on_each_cpu(retrigger_next_event, NULL, 1);
> + cpumask_var_t mask;
> + int cpu;
> +
> + if (!zalloc_cpumask_var(, GFP_KERNEL)) {
> + on_each_cpu(retrigger_next_event, NULL, 1);
> + goto set_timerfd;
> + }
> +
> + /* Avoid interrupting CPUs if possible */
> + preempt_disable();

That preempt disable is only required around the function call, for the
loop cpus_read_lock() is sufficient.

> + for_each_online_cpu(cpu) {
> + unsigned long flags;
> + struct hrtimer_cpu_base *cpu_base = _cpu(hrtimer_bases, 
> cpu);
> +
> + raw_spin_lock_irqsave(_base->lock, flags);
> + if (need_reprogram_timer(cpu_base))
> + cpumask_set_cpu(cpu, mask);
> + raw_spin_unlock_irqrestore(_base->lock, flags);
> + }
> +
> + smp_call_function_many(mask, retrigger_next_event, NULL, 1);
> + preempt_enable();
> + free_cpumask_var(mask);
>  #endif
> +set_timerfd:

My brain compiler tells me that with CONFIG_HIGH_RES_TIMERS=n a real
compiler will emit a warning about a defined, but unused label

>   timerfd_clock_was_set();
>  }

Thanks,

tglx

Re: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock

2021-04-14 Thread Thomas Gleixner

Fenghua,

On Tue, Apr 13 2021 at 23:40, Fenghua Yu wrote:
> On Mon, Apr 12, 2021 at 09:15:08AM +0200, Thomas Gleixner wrote:
>> Aside of that why are you trying to make this throttling in any way
>> accurate? It does not matter at all, really. Limit reached, put it to
>> sleep for some time and be done with it. No point in trying to be clever
>> for no value.
>
> Is it OK to set bld_ratelimit between 1 and 1,000 bus locks/sec for
> bld_ratelimit?
>
> Can I do the throttling like this?
>
>/* Enforce no more than bld_ratelimit bus locks/sec. */
>while (!__ratelimit(_bld_ratelimit))
>msleep(10);
>
> On one machine, if bld_ratelimit=1,000, that's about 5msec for a busy
> bus lock loop, i.e. bus is locked for about 5msec and then the process
> sleeps for 10msec and thus won't generate any bus lock.
> "dd" command running on other cores doesn't have noticeable degradation
> with bld_ratelimit=1,000.

Something like this makes sense. Add some rationale for the upper limit
you finally end up with so sysadmins can make informed decisions.

Thanks,

tglx

Re: bl_list and lockdep

2021-04-13 Thread Thomas Gleixner

Dave,

On Tue, Apr 13 2021 at 19:58, Dave Chinner wrote:
> On Tue, Apr 13, 2021 at 01:18:35AM +0200, Thomas Gleixner wrote:
> So for solving the inode cache scalability issue with RT in mind,
> we're left with these choices:
>
> a) increase memory consumption and cacheline misses for everyone by
>adding a spinlock per hash chain so that RT kernels can do their
>substitution magic and make the memory footprint and scalability
>for RT kernels worse
>
> b) convert the inode hash table to something different (rhashtable,
>radix tree, Xarray, etc) that is more scalable and more "RT
>friendly".
>
> c) have RT kernel substitute hlist-bl with hlist_head and a spinlock
>so that it all works correctly on RT kernels and only RT kernels
>take the memory footprint and cacheline miss penalties...
>
> We rejected a) for the dentry hash table, so it is not an appropriate
> soltion for the inode hash table for the same reasons.
>
> There is a lot of downside to b). Firstly there's the time and
> resources needed for experimentation to find an appropriate
> algorithm for both scalability and RT. Then all the insert, removal
> and search facilities will have to be rewritten, along with all the
> subtlies like "fake hashing" to allow fielsysetms to provide their
> own inode caches.  The changes in behaviour and, potentially, API
> semantics will greatly increase the risk of regressions and adverse
> behaviour on both vanilla and RT kernels compared to option a) or
> c).
>
> It is clear that option c) is of minimal risk to vanilla kernels,
> and low risk to RT kernels. It's pretty straight forward to do for
> both configs, and only the RT kernels take the memory footprint
> penalty.
>
> So a technical analysis points to c) being the most reasonable
> resolution of the problem.

I agree with that analysis for technical reasons and I'm not entirely
unfamiliar how to solve hlist_bl conversions on RT either as you might
have guessed.

Having a technical argument to discuss and agree on is far simpler
than going along with "I don't care".

Thanks for taking the time to put a technical rationale on this!

   tglx

Re: [PATCH v7 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

2021-04-13 Thread Thomas Gleixner

Paul,

On Mon, Apr 12 2021 at 16:18, Paul E. McKenney wrote:
> On Mon, Apr 12, 2021 at 10:37:10PM +0200, Thomas Gleixner wrote:
>> On Mon, Apr 12 2021 at 12:57, Paul E. McKenney wrote:
>> > On Mon, Apr 12, 2021 at 08:54:03PM +0200, Thomas Gleixner wrote:
>> >> > I will send a new series out later today, Pacific Time.
>> >> 
>> >> Can you do me a favour and send it standalone and not as yet another
>> >> reply to this existing thread maze. A trivial lore link to the previous
>> >> version gives enough context.
>> >
>> > Will do!
>> >
>> > Of course, it turns out that lockdep also doesn't like waited-on
>> > smp_call_function_single() invocations from timer handlers,
>> > so I am currently looking at other options for dealing with that
>> > potential use-after-free.  I am starting to like the looks of "only set
>> > CLOCK_SOURCE_VERIFY_PERCPU on statically allocated clocksource structures
>> > and let KASAN enforce this restriction", but I have not quite given up
>> > on making it more general.
>> 
>> The simplest point is in the thread under the clocksource_mutex which
>> prevents anything from vanishing under your feet.
>
> And lockdep is -much- happier with the setup shown below, so thank
> you again!

But it is too simple now :) ...

> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index f047c6cb056c..34dc38b6b923 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -519,6 +515,13 @@ static int __clocksource_watchdog_kthread(void)
>   unsigned long flags;
>   int select = 0;
>  
> + /* Do any required per-CPU skew verification. */
> + list_for_each_entry(cs, _list, wd_list) {
> + if ((cs->flags & (CLOCK_SOURCE_UNSTABLE | 
> CLOCK_SOURCE_VERIFY_PERCPU)) ==
> + (CLOCK_SOURCE_UNSTABLE | CLOCK_SOURCE_VERIFY_PERCPU))
> + clocksource_verify_percpu(cs);
> + }

because that list is _NOT_ protected by the clocksource_mutex as you
noticed yourself already.

But you don't have to walk that list at all because the only interesting
thing is the currently active clocksource, which is about to be changed
in case the watchdog marked it unstable and cannot be changed by any
other code concurrently because clocksource_mutex is held.

So all you need is:

if (curr_clocksource &&
curr_clocksource->flags & CLOCK_SOURCE_UNSTABLE &&
curr_clocksource->flags & CLOCK_SOURCE_VERIFY_PERCPU)
clocksource_verify_percpu_wreckage(curr_clocksource);

Hmm?

Thanks,

tglx

Re: [PATCH] genirq: reduce irqdebug bouncing cachelines

2021-04-13 Thread Thomas Gleixner

On Tue, Apr 13 2021 at 14:16, Cédric Le Goater wrote:
>>> We could test irq_settings_no_debug() directly under handle_nested_irq() 
>>> and handle_irq_event_percpu() to avoid calling note_interrupt(), just 
>>> like we do for noirqdebug.
>> 
>> We can do that, but then we should not just make it:
>> 
>>if (!irqnodebug && !irq_settings_no_debug(desc))
>>  note_interrupt(...);
>> 
>> Instead have only one condition:
>> 
>>if (!irq_settings_no_debug(desc))
>>  note_interrupt(...);
>> 
>> See the uncompiled delta patch below.
>
> I merged this second part with the first and gave IRQF_NO_DEBUG a try 
> on P8 and P9 systems and all looked fine. I should send both patches 
> after IRQF_NO_AUTOEN is merged in mainline. 

Does having that NODEBUG flag set on the IPI irqs make a measurable
difference or is it just too small to matter?

Thanks,

tglx

Re: [PATCH] kernel:irq:manage: request threaded irq with a specified priority

2021-04-13 Thread Thomas Gleixner

On Tue, Apr 13 2021 at 14:19, Song Chen wrote:
> In general, irq handler thread will be assigned a default priority which
> is MAX_RT_PRIO/2, as a result, no one can preempt others.
>
> Here is the case I found in a real project, an interrupt int_a is
> coming, wakes up its handler handler_a and handler_a wakes up a
> userspace RT process task_a.
>
> However, if another irq handler handler_b which has nothing to do
> with any RT tasks is running when int_a is coming, handler_a can't
> preempt handler_b, as a result, task_a can't be waken up immediately
> as expected until handler_b gives up cpu voluntarily. In this case,
> determinism breaks.

It breaks because the system designer failed to assign proper priorities
to the irq threads int_a, int_b and to the user space process task_a.

That's not solvable at the kernel level.

Thanks,

tglx

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-12 Thread Thomas Gleixner

On Mon, Apr 12 2021 at 19:46, Len Brown wrote:
> On Mon, Apr 12, 2021 at 11:21 AM Andy Lutomirski  wrote:
>> Even putting aside all kernel and ABI issues, is it actually a good
>> idea for user libraries to transparently use these new features?  I'm
>> not really convinced.  I think that serious discussion among userspace
>> people is needed.
>
> At the risk of stating the obvious...
> Intel's view is that libraries that deliver the most value from the
> hardware are a "good thing",
> and that anything preventing libraries from getting the most value
> from the hardware is a "bad thing":-)

Sure, and as a consequence the kernel is the problem when creative
libraries cause wreckage along the way.

I'm fine with that as long the kernel has a way to detect that and can
kill the offending application/library combo with an excplicit
-SIG_NICE_TRY.

Thanks,

tglx

Re: bl_list and lockdep

2021-04-12 Thread Thomas Gleixner

Dave,

On Tue, Apr 13 2021 at 08:15, Dave Chinner wrote:
> On Mon, Apr 12, 2021 at 05:20:53PM +0200, Thomas Gleixner wrote:
>> On Wed, Apr 07 2021 at 07:22, Dave Chinner wrote:
>> > And, FWIW, I'm also aware of the problems that RT kernels have with
>> > the use of bit spinlocks and being unable to turn them into sleeping
>> > mutexes by preprocessor magic. I don't care about that either,
>> > because dentry cache...
>> 
>> In the dentry cache it's a non-issue.
>
> Incorrect.

I'm impressed about your detailed knowledge of something you do not care
about in the first place.

>> RT does not have a problem with bit spinlocks per se, it depends on how
>> they are used and what nests inside. Most of them are just kept as bit
>> spinlocks because the lock held, and therefore preempt disabled times
>> are small and no other on RT conflicting operations happen inside.
>> 
>> In the case at hand this is going to be a problem because inode->i_lock
>> nests inside the bit spinlock and we can't make inode->i_lock a raw
>> spinlock because it protects way heavier weight code pathes as well.
>
> Yes, that's exactly the "problem" I'm refering to. And I don't care,
> precisely because, well, dentry cache
>
> THat is, the dcache calls wake_up_all() from under the
> hlist_bl_lock() in __d_lookup_done(). That ends up in
> __wake_up_common_lock() which takes a spin lock embedded inside a
> wait_queue_head.  That's not a raw spinlock, either, so we already
> have this "spinlock inside bit lock" situation with the dcache usage
> of hlist_bl.

Sure, but you are missing that RT solves that by substituting the
wait_queue with a swait_queue, which does not suffer from that. But that
can't be done for the inode::i_lock case for various reasons.

> FYI, this dentry cache behaviour was added to the dentry cache in
> 2016 by commit d9171b934526 ("parallel lookups machinery, part 4
> (and last)"), so it's not like it's a new thing, either.

Really? I wasn't aware of that. Thanks for the education.

> If you want to make hlist_bl RT safe, then re-implement it behind
> the scenes for RT enabled kernels. All it takes is more memory
> usage for the hash table + locks, but that's something that non-RT
> people should not be burdened with caring about

I'm well aware that anything outside of @fromorbit universe is not
interesting to you, but I neverless want to take the opportunity to
express my appreciation for your truly caring and collaborative attitude
versus interests of others who unfortunately do no share that universe.

Thanks,

tglx

Re: [PATCH 02/11] mm/page_alloc: Convert per-cpu list protection to local_lock

2021-04-12 Thread Thomas Gleixner

On Mon, Apr 12 2021 at 12:56, Mel Gorman wrote:
> On Fri, Apr 09, 2021 at 08:55:39PM +0200, Peter Zijlstra wrote:
> I'll update the changelog and comment accordingly. I'll decide later
> whether to leave it or move the location of the lock at the end of the
> series. If the patch is added, it'll either incur the double lookup (not
> that expensive, might be optimised by the compiler) or come up with a
> helper that takes the lock and returns the per-cpu structure. The double
> lookup probably makes more sense initially because there are multiple
> potential users of a helper that says "pin to CPU, lookup, lock and return
> a per-cpu structure" for both IRQ-safe and IRQ-unsafe variants with the
> associated expansion of the local_lock API. It might be better to introduce
> such a helper with multiple users converted at the same time and there are
> other local_lock users in preempt-rt that could do with upstreaming first.

We had such helpers in RT a while ago but it turned into an helper
explosion pretty fast. But that was one of the early versions of local
locks which could not be embedded into a per CPU data structure due to
raisins (my stupidity).

But with the more thought out approach of today we can have (+/- the
obligatory naming bikeshedding):

--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -51,4 +51,35 @@
 #define local_unlock_irqrestore(lock, flags)   \
__local_unlock_irqrestore(lock, flags)
 
+/**
+ * local_lock_get_cpu_ptr - Acquire a per CPU local lock and return
+ * a pointer to the per CPU data which
+ * contains the local lock.
+ * @pcp:   Per CPU data structure
+ * @lock:  The local lock member of @pcp
+ */
+#define local_lock_get_cpu_ptr(pcp, lock)  \
+   __local_lock_get_cpu_ptr(pcp, typeof(*(pcp)), lock)
+
+/**
+ * local_lock_irq_get_cpu_ptr - Acquire a per CPU local lock, disable
+ * interrupts and return a pointer to the
+ * per CPU data which contains the local lock.
+ * @pcp:   Per CPU data structure
+ * @lock:  The local lock member of @pcp
+ */
+#define local_lock_irq_get_cpu_ptr(pcp, lock)  \
+   __local_lock_irq_get_cpu_ptr(pcp, typeof(*(pcp)), lock)
+
+/**
+ * local_lock_irqsave_get_cpu_ptr - Acquire a per CPU local lock, save and
+ * disable interrupts and return a pointer to
+ * the CPU data which contains the local lock.
+ * @pcp:   Per CPU data structure
+ * @lock:  The local lock member of @pcp
+ * @flags: Storage for interrupt flags
+ */
+#define local_lock_irqsave_get_cpu_ptr(pcp, lock, flags)   \
+   __local_lock_irqsave_get_cpu_ptr(pcp, typeof(*(pcp)), lock, flags)
+
 #endif
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -91,3 +91,33 @@ static inline void local_lock_release(lo
local_lock_release(this_cpu_ptr(lock)); \
local_irq_restore(flags);   \
} while (0)
+
+#define __local_lock_get_cpu_ptr(pcp, type, lock)  \
+   ({  \
+   type *__pcp;\
+   \
+   preempt_disable();  \
+   __pcp = this_cpu_ptr(pcp);  \
+   local_lock_acquire(&__pcp->lock);   \
+   __pcp;  \
+   })
+
+#define __local_lock_irq_get_cpu_ptr(pcp, type, lock)  \
+   ({  \
+   type *__pcp;\
+   \
+   local_irq_disable();\
+   __pcp = this_cpu_ptr(pcp);  \
+   local_lock_acquire(&__pcp->lock);   \
+   __pcp;  \
+   })
+
+#define __local_lock_irqsave_get_cpu_ptr(pcp, type, lock, flags)\
+   ({  \
+   type *__pcp;\
+   \
+   local_irq_save(flags);  \
+   __pcp = this_cpu_ptr(pcp);  \
+   local_lock_acquire(&__pcp->lock);   \
+   __pcp;  \
+   })


and RT will then change that to:

--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -96,7 +96,7 @@ static inline void local_lock_release(lo
({

Re: [PATCH v7 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

2021-04-12 Thread Thomas Gleixner

On Mon, Apr 12 2021 at 12:57, Paul E. McKenney wrote:
> On Mon, Apr 12, 2021 at 08:54:03PM +0200, Thomas Gleixner wrote:
>> > I will send a new series out later today, Pacific Time.
>> 
>> Can you do me a favour and send it standalone and not as yet another
>> reply to this existing thread maze. A trivial lore link to the previous
>> version gives enough context.
>
> Will do!
>
> Of course, it turns out that lockdep also doesn't like waited-on
> smp_call_function_single() invocations from timer handlers,
> so I am currently looking at other options for dealing with that
> potential use-after-free.  I am starting to like the looks of "only set
> CLOCK_SOURCE_VERIFY_PERCPU on statically allocated clocksource structures
> and let KASAN enforce this restriction", but I have not quite given up
> on making it more general.

The simplest point is in the thread under the clocksource_mutex which
prevents anything from vanishing under your feet.

Thanks,

tglx

Re: [PATCH v7 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

2021-04-12 Thread Thomas Gleixner

Paul,

On Mon, Apr 12 2021 at 11:20, Paul E. McKenney wrote:
> On Mon, Apr 12, 2021 at 03:08:16PM +0200, Thomas Gleixner wrote:
>> The reason for irqsave is again historical AFAICT and nobody bothered to
>> clean it up. spin_lock_bh() should be sufficient to serialize against
>> the watchdog timer, though I haven't looked at all possible scenarios.
>
> Though if BH is disabled, there is not so much advantage to
> invoking it from __clocksource_watchdog_kthread().  Might as
> well just invoke it directly from clocksource_watchdog().
>
>> > 2. Invoke clocksource_verify_percpu() from its original
>> >location in clocksource_watchdog(), just before the call to
>> >__clocksource_unstable().  This relies on the fact that
>> >clocksource_watchdog() acquires watchdog_lock without
>> >disabling interrupts.
>> 
>> That should be fine, but this might cause the softirq to 'run' for a
>> very long time which is not pretty either.
>> 
>> Aside of that, do we really need to check _all_ online CPUs? What you
>> are trying to figure out is whether the wreckage is CPU local or global,
>> right?
>> 
>> Wouldn't a shirt-sleeve approach of just querying _one_ CPU be good
>> enough? Either the other CPU has the same wreckage, then it's global or
>> it hasn't which points to a per CPU local issue.
>> 
>> Sure it does not catch the case where a subset (>1) of all CPUs is
>> affected, but I'm not seing how that really buys us anything.
>
> Good point!  My thought is to randomly pick eight CPUs to keep the
> duration reasonable while having a good chance of hitting "interesting"
> CPU choices in multicore and multisocket systems.
>
> However, if a hard-to-reproduce problem occurred, it would be good to take
> the hit and scan all the CPUs.  Additionally, there are some workloads
> for which the switch from TSC to HPET is fatal anyway due to increased
> overhead.  For these workloads, the full CPU scan is no additional pain.
>
> So I am thinking in terms of a default that probes eight randomly selected
> CPUs without worrying about duplicates (as in there would be some chance
> that fewer CPUs would actually be probed), but with a boot-time flag
> that does all CPUs.  I would add the (default) random selection as a
> separate patch.

You can't do without making it complex, right? Keep it simple is not an
option for a RCU hacker it seems :)

> I will send a new series out later today, Pacific Time.

Can you do me a favour and send it standalone and not as yet another
reply to this existing thread maze. A trivial lore link to the previous
version gives enough context.

Thanks,

tglx

Re: bl_list and lockdep

2021-04-12 Thread Thomas Gleixner

Dave,

On Wed, Apr 07 2021 at 07:22, Dave Chinner wrote:
> On Tue, Apr 06, 2021 at 02:28:34PM +0100, Matthew Wilcox wrote:
>> On Tue, Apr 06, 2021 at 10:33:43PM +1000, Dave Chinner wrote:
>> > +++ b/fs/inode.c
>> > @@ -57,8 +57,7 @@
>> >  
>> >  static unsigned int i_hash_mask __read_mostly;
>> >  static unsigned int i_hash_shift __read_mostly;
>> > -static struct hlist_head *inode_hashtable __read_mostly;
>> > -static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
>> > +static struct hlist_bl_head *inode_hashtable __read_mostly;
>> 
>> I'm a little concerned that we're losing a lockdep map here.  
>> 
>> Nobody seems to have done this for list_bl yet, and I'd be reluctant
>> to gate your patch on "Hey, Dave, solve this problem nobody else has
>> done yet".
>
> I really don't care about lockdep. Adding lockdep support to
> hlist_bl is somebody else's problem - I'm just using infrastructure
> that already exists. Also, the dentry cache usage of hlist_bl is
> vastly more complex and so if lockdep coverage was really necessary,
> it would have already been done
>
> And, FWIW, I'm also aware of the problems that RT kernels have with
> the use of bit spinlocks and being unable to turn them into sleeping
> mutexes by preprocessor magic. I don't care about that either,
> because dentry cache...

In the dentry cache it's a non-issue.

RT does not have a problem with bit spinlocks per se, it depends on how
they are used and what nests inside. Most of them are just kept as bit
spinlocks because the lock held, and therefore preempt disabled times
are small and no other on RT conflicting operations happen inside.

In the case at hand this is going to be a problem because inode->i_lock
nests inside the bit spinlock and we can't make inode->i_lock a raw
spinlock because it protects way heavier weight code pathes as well.

Thanks,

tglx

Re: [PATCH 02/17] locking: Add split_lock

2021-04-12 Thread Thomas Gleixner

On Mon, Apr 12 2021 at 15:45, Matthew Wilcox wrote:
> On Mon, Apr 12, 2021 at 04:29:28PM +0200, Thomas Gleixner wrote:
>> On Fri, Apr 09 2021 at 03:51, Matthew Wilcox wrote:
>> > Bitlocks do not currently participate in lockdep.  Conceptually, a
>> > bit_spinlock is a split lock, eg across each bucket in a hash table.
>> > The struct split_lock gives us somewhere to record the lockdep_map.
>> 
>> I like the concept, but the name is strange. The only purpose of 
>> 
>> > +struct split_lock {
>> > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
>> > +  struct lockdep_map dep_map;
>> > +#endif
>> > +};
>> 
>> is to have a place to stick the lockdep map into. So it's not a lock
>> construct as the name suggests, it's just auxiliary data when lockdep is
>> enabled.
>
> That's the implementation _today_, but conceptually, it's a single lock.
> I was thinking that for non-RT, we could put a qspinlock in there for a
> thread to spin on if the bit is contended.  It'd need a bit of ingenuity
> to make sure that a thread unlocking a bitlock made sure that a thread
> spinning on the qspinlock saw the wakeup, but it should be doable.

Ah, that's what you have in mind.

> Anyway, from the point of view of the user, they should be declaring
> "this is the lock", not "this is the lock tracking structure", right?
>
>> I know you hinted that RT could make use of that data structure and the
>> fact that it's mandatory for the various lock functions, but that's not
>> really feasible if this is related to a hash with a bit spinlock per
>> bucket as the data structure is hash global.
>> 
>> Sure, we can do pointer math to find out the bucket index and do
>> something from there, but I'm not sure whether that really helps. Need
>> to stare at the remaining few places where bit spinlocks are an issue on
>> RT.
>
> I obviously don't understand exactly what the RT patchset does.  My
> thinking was that you could handle the bit locks like rw sems, and
> sacrifice the scalability of per-bucket-lock for the determinism of
> a single PI lock.

That'd suck for most bit spinlocks where the lock is just protecting
minimal hashlist operations and these preeempt disabled protections are
actually shorter than the overhead of a heavier lock.

For situations where the bit spinlock was actually an issue (long
traversals or such) in older kernel versions we just bit the bullet and
bloated the hash data structure with an actual spinlock and had some
wrappers to hide the mess from the actual code. That still preserved the
scalability while making the lock held section preemptible which we
obviously cannot have with real bit spinlocks even on RT.

But your idea of having a qspinlock for the contended case might
actually be something which might be worth to exploit RT wise -
obviously not with a qspinlock :) - but conceptually.

Need to think more about it.

Thanks,

tglx

Re: [PATCH 02/17] locking: Add split_lock

2021-04-12 Thread Thomas Gleixner

On Fri, Apr 09 2021 at 03:51, Matthew Wilcox wrote:
> Bitlocks do not currently participate in lockdep.  Conceptually, a
> bit_spinlock is a split lock, eg across each bucket in a hash table.
> The struct split_lock gives us somewhere to record the lockdep_map.

I like the concept, but the name is strange. The only purpose of 

> +struct split_lock {
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> + struct lockdep_map dep_map;
> +#endif
> +};

is to have a place to stick the lockdep map into. So it's not a lock
construct as the name suggests, it's just auxiliary data when lockdep is
enabled.

I know you hinted that RT could make use of that data structure and the
fact that it's mandatory for the various lock functions, but that's not
really feasible if this is related to a hash with a bit spinlock per
bucket as the data structure is hash global.

Sure, we can do pointer math to find out the bucket index and do
something from there, but I'm not sure whether that really helps. Need
to stare at the remaining few places where bit spinlocks are an issue on
RT.

Thanks,

tglx

Re: [tip: core/rcu] softirq: Don't try waking ksoftirqd before it has been spawned

2021-04-12 Thread Thomas Gleixner

On Sun, Apr 11 2021 at 13:43, tip-bot wrote:
> The following commit has been merged into the core/rcu branch of tip:
>
> Commit-ID: 1c0c4bc1ceb580851b2d76fdef9712b3bdae134b
> Gitweb:
> https://git.kernel.org/tip/1c0c4bc1ceb580851b2d76fdef9712b3bdae134b
> Author:Paul E. McKenney 
> AuthorDate:Fri, 12 Feb 2021 16:20:40 -08:00
> Committer: Paul E. McKenney 
> CommitterDate: Mon, 15 Mar 2021 13:51:48 -07:00
>
> softirq: Don't try waking ksoftirqd before it has been spawned
>
> If there is heavy softirq activity, the softirq system will attempt
> to awaken ksoftirqd and will stop the traditional back-of-interrupt
> softirq processing.  This is all well and good, but only if the
> ksoftirqd kthreads already exist, which is not the case during early
> boot, in which case the system hangs.
>
> One reproducer is as follows:
>
> tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 2 
> --configs "TREE03" --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y 
> CONFIG_PROVE_LOCKING=y CONFIG_NO_HZ_IDLE=y CONFIG_HZ_PERIODIC=n" --bootargs 
> "threadirqs=1" --trust-make
>
> This commit therefore adds a couple of existence checks for ksoftirqd
> and forces back-of-interrupt softirq processing when ksoftirqd does not
> yet exist.  With this change, the above test passes.

Color me confused. I did not follow the discussion around this
completely, but wasn't it agreed on that this rcu torture muck can wait
until the threads are brought up?

> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 9908ec4..bad14ca 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -211,7 +211,7 @@ static inline void invoke_softirq(void)
>   if (ksoftirqd_running(local_softirq_pending()))
>   return;
>  
> - if (!force_irqthreads) {
> + if (!force_irqthreads || !__this_cpu_read(ksoftirqd)) {
>  #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
>   /*
>* We can safely execute softirq on the current stack if

This still breaks RT which forces force_irqthreads to a compile time
const which makes the compiler optimize out the direct invocation.

Surely RT can work around that, but how is that rcu torture muck
supposed to work then? We're back to square one then.

Thanks,

tglx

Re: [PATCH v7 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

2021-04-12 Thread Thomas Gleixner

On Sun, Apr 11 2021 at 21:21, Paul E. McKenney wrote:
> On Sun, Apr 11, 2021 at 09:46:12AM -0700, Paul E. McKenney wrote:
>> So I need to is inline clocksource_verify_percpu_wq()
>> into clocksource_verify_percpu() and then move the call to
>> clocksource_verify_percpu() to __clocksource_watchdog_kthread(), right
>> before the existing call to list_del_init().  Will do!
>
> Except that this triggers the WARN_ON_ONCE() in smp_call_function_single()
> due to interrupts being disabled across that list_del_init().
>
> Possibilities include:
>
> 1.Figure out why interrupts must be disabled only sometimes while
>   holding watchdog_lock, in the hope that they need not be across
>   the entire critical section for __clocksource_watchdog_kthread().
>   As in:
>
>   local_irq_restore(flags);
>   clocksource_verify_percpu(cs);
>   local_irq_save(flags);
>
>   Trying this first with lockdep enabled.  Might be spectacular.

Yes, it's a possible deadlock against the watchdog timer firing ...

The reason for irqsave is again historical AFAICT and nobody bothered to
clean it up. spin_lock_bh() should be sufficient to serialize against
the watchdog timer, though I haven't looked at all possible scenarios.

> 2.Invoke clocksource_verify_percpu() from its original
>   location in clocksource_watchdog(), just before the call to
>   __clocksource_unstable().  This relies on the fact that
>   clocksource_watchdog() acquires watchdog_lock without
>   disabling interrupts.

That should be fine, but this might cause the softirq to 'run' for a
very long time which is not pretty either.

Aside of that, do we really need to check _all_ online CPUs? What you
are trying to figure out is whether the wreckage is CPU local or global,
right?

Wouldn't a shirt-sleeve approach of just querying _one_ CPU be good
enough? Either the other CPU has the same wreckage, then it's global or
it hasn't which points to a per CPU local issue.

Sure it does not catch the case where a subset (>1) of all CPUs is
affected, but I'm not seing how that really buys us anything.

Thanks,

tglx

Re: [PATCH RESEND v1 2/4] lib/vdso: Add vdso_data pointer as input to __arch_get_timens_vdso_data()

2021-04-12 Thread Thomas Gleixner

On Wed, Mar 31 2021 at 16:48, Christophe Leroy wrote:
> For the same reason as commit e876f0b69dc9 ("lib/vdso: Allow
> architectures to provide the vdso data pointer"), powerpc wants to
> avoid calculation of relative position to code.
>
> As the timens_vdso_data is next page to vdso_data, provide
> vdso_data pointer to __arch_get_timens_vdso_data() in order
> to ease the calculation on powerpc in following patches.
>
> Signed-off-by: Christophe Leroy 

Reviewed-by: Thomas Gleixner

Re: [PATCH RESEND v1 0/4] powerpc/vdso: Add support for time namespaces

2021-04-12 Thread Thomas Gleixner

On Wed, Mar 31 2021 at 16:48, Christophe Leroy wrote:
> [Sorry, resending with complete destination list, I used the wrong script on 
> the first delivery]
>
> This series adds support for time namespaces on powerpc.
>
> All timens selftests are successfull.

If PPC people want to pick up the whole lot, no objections from my side.

Thanks,

tglx

Re: [PATCH RESEND v1 1/4] lib/vdso: Mark do_hres_timens() and do_coarse_timens() __always_inline()

2021-04-12 Thread Thomas Gleixner

On Wed, Mar 31 2021 at 16:48, Christophe Leroy wrote:
> In the same spirit as commit c966533f8c6c ("lib/vdso: Mark do_hres()
> and do_coarse() as __always_inline"), mark do_hres_timens() and
> do_coarse_timens() __always_inline.
>
> The measurement below in on a non timens process, ie on the fastest path.
>
> On powerpc32, without the patch:
>
> clock-gettime-monotonic-raw:vdso: 1155 nsec/call
> clock-gettime-monotonic-coarse:vdso: 813 nsec/call
> clock-gettime-monotonic:vdso: 1076 nsec/call
>
> With the patch:
>
> clock-gettime-monotonic-raw:vdso: 1100 nsec/call
> clock-gettime-monotonic-coarse:vdso: 667 nsec/call
> clock-gettime-monotonic:vdso: 1025 nsec/call
>
> Signed-off-by: Christophe Leroy 

Reviewed-by: Thomas Gleixner

Re: [PATCH] genirq: reduce irqdebug bouncing cachelines

2021-04-12 Thread Thomas Gleixner

Cédric,

On Mon, Apr 12 2021 at 11:06, Cédric Le Goater wrote:
> On 4/10/21 1:58 PM, Thomas Gleixner wrote:
>> --- a/kernel/irq/spurious.c
>> +++ b/kernel/irq/spurious.c
>> @@ -274,7 +274,7 @@ void note_interrupt(struct irq_desc *des
>>  unsigned int irq;
>>  
>>  if (desc->istate & IRQS_POLL_INPROGRESS ||
>> -irq_settings_is_polled(desc))
>> +irq_settings_is_polled(desc) | irq_settings_no_debug(desc))
>
> Shouldn't it be '||' instead  ?

It could. But that's intentionally '|'. Why?

Because that lets the compiler merge the bit checks into one and
therefore spares one conditional branch.

>>  return;
>>  
>>  if (bad_action_ret(action_ret)) {
>> 
>
> We could test irq_settings_no_debug() directly under handle_nested_irq() 
> and handle_irq_event_percpu() to avoid calling note_interrupt(), just 
> like we do for noirqdebug.

We can do that, but then we should not just make it:

   if (!irqnodebug && !irq_settings_no_debug(desc))
note_interrupt(...);

Instead have only one condition:

   if (!irq_settings_no_debug(desc))
note_interrupt(...);

See the uncompiled delta patch below.

Thanks,

tglx
---
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1690,6 +1690,9 @@ static int
irq_settings_set_no_debug(desc);
}
 
+   if (noirqdebug)
+   irq_settings_set_no_debug(desc);
+
if (new->flags & IRQF_ONESHOT)
desc->istate |= IRQS_ONESHOT;
 
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -274,7 +274,7 @@ void note_interrupt(struct irq_desc *des
unsigned int irq;
 
if (desc->istate & IRQS_POLL_INPROGRESS ||
-   irq_settings_is_polled(desc) | irq_settings_no_debug(desc))
+   irq_settings_is_polled(desc))
return;
 
if (bad_action_ret(action_ret)) {
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -481,7 +481,7 @@ void handle_nested_irq(unsigned int irq)
for_each_action_of_desc(desc, action)
action_ret |= action->thread_fn(action->irq, action->dev_id);
 
-   if (!noirqdebug)
+   if (!irq_settings_no_debug(desc))
note_interrupt(desc, action_ret);
 
raw_spin_lock_irq(>lock);
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -197,7 +197,7 @@ irqreturn_t handle_irq_event_percpu(stru
 
add_interrupt_randomness(desc->irq_data.irq, flags);
 
-   if (!noirqdebug)
+   if (!irq_settings_no_debug(desc))
note_interrupt(desc, retval);
return retval;
 }

Re: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock

2021-04-12 Thread Thomas Gleixner

On Sat, Apr 03 2021 at 01:04, Fenghua Yu wrote:
> On Sat, Mar 20, 2021 at 01:42:52PM +0100, Thomas Gleixner wrote:
>> On Fri, Mar 19 2021 at 22:19, Fenghua Yu wrote:
>> And even with throttling the injection rate further down to 25k per
>> second the impact on the workload is still significant in the 10% range.
>
> Thank you for your insight!
>
> So I can change the ratelimit to system wide and call usleep_range()
> to sleep: 
>while (!__ratelimit(_bld_ratelimit))
>usleep_range(100 / bld_ratelimit,
> 100 / bld_ratelimit);
>
> The max bld_ratelimit is 1000,000/s because the max sleeping time is 1
> usec.

Maximum sleep time is 1usec?

> The min bld_ratelimit is 1/s.

Again. This does not make sense at all. 1Mio bus lock events per second
are way beyond the point where the machine does anything else than being
stuck in buslocks.

Aside of that why are you trying to make this throttling in any way
accurate? It does not matter at all, really. Limit reached, put it to
sleep for some time and be done with it. No point in trying to be clever
for no value.

Thanks,

tglx

Re: [PATCH v7 clocksource] Do not mark clocks unstable due to delays for v5.13

2021-04-11 Thread Thomas Gleixner

On Sat, Apr 10 2021 at 16:26, Paul E. McKenney wrote:
> On Sat, Apr 10, 2021 at 10:01:58AM +0200, Thomas Gleixner wrote:
>> On Fri, Apr 02 2021 at 15:48, Paul E. McKenney wrote:
>> I buy the vCPU preemption part and TBH guests should not have that
>> watchdog thing active at all for exactly this reason.
>
> Agreed, one approch is to enable the the clocksource watchdog only in
> the hypervisor, and have some action on the guests triggered when the
> host detects clock skew.
>
> This works quite well, at least until something breaks in a way that
> messes up clock reads from the guest but not from the host.  And I
> am sure that any number of hardware guys will tell me that this just
> isn't possible, but if failing hardware operated according to their
> expectations, that hardware wouldn't be considered to be failing.
> Or it wouldn't be hardware, firmware, or clock-driver bringup, as the
> case may be.

Don't tell me. The fact that this code exists at all is a horror on it's
own.

>> SMI, NMI injecting 62.5ms delay? If that happens then the performance of
>> the clocksource is the least of your worries.
>
> I was kind of hoping that you would tell me why the skew must be all the
> way up to 62.5ms before the clock is disabled.  The watchdog currently
> is quite happy with more than 10% skew between clocks.
>
> 100HZ clocks or some such?

Histerical raisins. When the clocksource watchdog was introduced it
replaced a x86 specific validation which was jiffies based. I have faint
memories that we wanted to have at least jiffies based checks preserved
in absence of other hardware, which had other problems and we gave up on
it. But obviously nobody thought about revisiting the threshold.

Yes, it's way too big. The slowest watchdog frequency on x86 is ~3.5 Mhz
(ACPI PMtimer). Don't know about the reference frequency on MIPS which
is the only other user of this.

Thanks,

tglx

Re: [PATCH v7 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

2021-04-11 Thread Thomas Gleixner

On Sat, Apr 10 2021 at 17:20, Paul E. McKenney wrote:
> On Sat, Apr 10, 2021 at 11:00:25AM +0200, Thomas Gleixner wrote:
>> > +  if (WARN_ON_ONCE(!cs))
>> > +  return;
>> > +  pr_warn("Checking clocksource %s synchronization from CPU %d.\n",
>> > +  cs->name, smp_processor_id());
>> > +  cpumask_clear(_ahead);
>> > +  cpumask_clear(_behind);
>> > +  csnow_begin = cs->read(cs);
>> 
>> So this is invoked via work and the actual clocksource change is done
>> via work too. Once the clocksource is not longer actively used for
>> timekeeping it can go away. What's guaranteeing that this runs prior to
>> the clocksource change and 'cs' is valid throughout this function?
>
> From what I can see, cs->read() doesn't care whether or not the
> clocksource has been marked unstable.  So it should be OK to call
> cs->read() before, during, or after the call to __clocksource_unstable().
>
> Also, this is only done on clocksources marked CLOCK_SOURCE_VERIFY_PERCPU,
> so any clocksource that did not like cs->read() being invoked during
> or after the call to __clocksource_unstable() should leave off the
> CLOCK_SOURCE_VERIFY_PERCPU bit.
>
> Or did I take a wrong turn somewhere in the pointers to functions?

Right. cs->read() does not care, but what guarantees that cs is valid
and not freed yet? It's not an issue with TSC and KVMCLOCK, but
conceptually the following is possible:

watchdog()   
  queue_work(synccheck);
  queue_work(clocksource_change);

work:   
  synccheck() clocksource_change()
preemption...
  ...
  some_other_code():
 unregister_clocksource(cs)
 free(cs)
  cs->read()   <- UAF

>> > +  queue_work(system_highpri_wq, _verify_work);
>> 
>> This does not guarantee anything. So why does this need an extra work
>> function which is scheduled seperately?
>
> Because I was concerned about doing smp_call_function() while holding
> watchdog_lock, which is also acquired elsewhere using spin_lock_irqsave().
> And it still looks like on x86 that spin_lock_irqsave() spins with irqs
> disabled, which could result in deadlock.  The smp_call_function_single()
> would wait for the target CPU to enable interrupts, which would not
> happen until after the smp_call_function_single() returned due to its
> caller holding watchdog_lock.
>
> Or is there something that I am missing that prevents this deadlock
> from occurring?

The unstable mechanism is:

watchdog()
  __clocksource_unstable()
schedule_work(_work);

watchdog_work()
  kthread_run(clocksource_watchdog_thread);

cs_watchdog_thread()
  mutex_lock(_mutex);
  if (__clocksource_watchdog_kthread())
clocksource_select();
  mutex_unlock(_mutex);

So what prevents you from doing that right in watchdog_work() or even in
cs_watchdog_thread() properly ordered against the actual clocksource
switch?

Hmm?

Thanks,

tglx

Re: [RFC 1/2] x86/tsc: add a timer to make sure tsc_adjust is always checked

2021-04-10 Thread Thomas Gleixner

Feng,

On Sat, Apr 10 2021 at 22:38, Feng Tang wrote:
> On Sat, Apr 10, 2021 at 11:27:11AM +0200, Thomas Gleixner wrote:
>> > +static int __init start_sync_check_timer(void)
>> > +{
>> > +  if (!boot_cpu_has(X86_FEATURE_TSC_ADJUST))
>> > +  return 0;
>> > +
>> > +  timer_setup(_sync_check_timer, tsc_sync_check_timer_fn, 0);
>> > +  tsc_sync_check_timer.expires = jiffies + SYNC_CHECK_INTERVAL;
>> > +  add_timer(_sync_check_timer);
>> > +
>> > +  return 0;
>> > +}
>> > +late_initcall(start_sync_check_timer);
>> 
>> So right now, if someone adds 'tsc=reliable' on the kernel command line
>> then all of the watchdog checking, except for the idle enter TSC_ADJUST
>> check is disabled. The NOHZ full people are probably going to be pretty
>> unhappy about yet another unconditional timer they have to chase down.
>> 
>> So this needs some more thought.
>
> 'tsc=reliable' in cmdline will set 'tsc_clocksource_reliable' to 1, so
> we can skip starting this timer if 'tsc_clocksource_reliable==1' ?

Then we can just ignore that patch alltogether because of 2/2 doing:

+   if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
+   boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
+   boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
+   nr_online_nodes <= 2)
+   tsc_clocksource_reliable = 1;



I said for a reason:

>> So this needs some more thought.

Thanks,

tglx

Re: [RFC 1/2] x86/tsc: add a timer to make sure tsc_adjust is always checked

2021-04-10 Thread Thomas Gleixner

On Sat, Apr 10 2021 at 11:47, Borislav Petkov wrote:

> On Sat, Apr 10, 2021 at 11:27:11AM +0200, Thomas Gleixner wrote:
>> On Tue, Mar 30 2021 at 16:25, Feng Tang wrote:
>> > Normally the tsc_sync will be checked every time system enters idle state,
>> > but there is still caveat that a system won't enter idle, either because
>> > it's too busy or configured purposely to not enter idle. Setup a periodic
>> > timer to make sure the check is always on.
>> 
>> Bah. I really hate the fact that we don't have a knob to disable writes
>> to the TSC/TSC_ADJUST msrs. That would spare this business alltogether.
>
> We have the MSR filtering and I'd *love* to add those MSRs to a
> permanent ban list of MSRs which will never ever be written to from
> luserspace.

That's good, but what I really want is a knob which prevents BIOS/SMM
from writing to it. The only reason why BIOS ever needs to write is for
physical hotplug and perhaps for 4+ socket machines on boot. After that
every write is a bug.

Thanks,

tglx

Re: [PATCH] genirq: reduce irqdebug bouncing cachelines

2021-04-10 Thread Thomas Gleixner

Nicholas,

On Fri, Apr 02 2021 at 23:20, Nicholas Piggin wrote:
> note_interrupt increments desc->irq_count for each interrupt even for
> percpu interrupt handlers, even when they are handled successfully. This
> causes cacheline bouncing and limits scalability.
>
> Instead of incrementing irq_count every time, only start incrementing it
> after seeing an unhandled irq, which should avoid the cache line
> bouncing in the common path.
>
> This actually should give better consistency in handling misbehaving
> irqs too, because instead of the first unhandled irq arriving at an
> arbitrary point in the irq_count cycle, its arrival will begin the
> irq_count cycle.

I've applied that because it makes sense in general, but I think the whole
call to note_interrupt() can be avoided or return early when interrupts
would be marked accordingly. For IPI handlers which always return
HANDLED the whole procedure is pretty pointless to begin with.

Something like the completely untested below.

Thanks,

tglx
---
 include/linux/interrupt.h |3 +++
 include/linux/irq.h   |2 ++
 kernel/irq/manage.c   |2 ++
 kernel/irq/settings.h |   12 
 kernel/irq/spurious.c |2 +-
 5 files changed, 20 insertions(+), 1 deletion(-)

--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -64,6 +64,8 @@
  * IRQF_NO_AUTOEN - Don't enable IRQ or NMI automatically when users request 
it.
  *Users will enable it explicitly by enable_irq() or 
enable_nmi()
  *later.
+ * IRQF_NO_DEBUG - Exclude from runnaway detection for IPI and similar 
handlers,
+ *depends on IRQF_PERCPU.
  */
 #define IRQF_SHARED0x0080
 #define IRQF_PROBE_SHARED  0x0100
@@ -78,6 +80,7 @@
 #define IRQF_EARLY_RESUME  0x0002
 #define IRQF_COND_SUSPEND  0x0004
 #define IRQF_NO_AUTOEN 0x0008
+#define IRQF_NO_DEBUG  0x0010
 
 #define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND | 
IRQF_NO_THREAD)
 
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -72,6 +72,7 @@ enum irqchip_irq_state;
  *   mechanism and from core side polling.
  * IRQ_DISABLE_UNLAZY  - Disable lazy irq disable
  * IRQ_HIDDEN  - Don't show up in /proc/interrupts
+ * IRQ_NO_DEBUG- Exclude from note_interrupt() 
debugging
  */
 enum {
IRQ_TYPE_NONE   = 0x,
@@ -99,6 +100,7 @@ enum {
IRQ_IS_POLLED   = (1 << 18),
IRQ_DISABLE_UNLAZY  = (1 << 19),
IRQ_HIDDEN  = (1 << 20),
+   IRQ_NO_DEBUG= (1 << 21),
 };
 
 #define IRQF_MODIFY_MASK   \
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1682,6 +1682,8 @@ static int
if (new->flags & IRQF_PERCPU) {
irqd_set(>irq_data, IRQD_PER_CPU);
irq_settings_set_per_cpu(desc);
+   if (new->flags & IRQF_NO_DEBUG)
+   irq_settings_set_no_debug(desc);
}
 
if (new->flags & IRQF_ONESHOT)
--- a/kernel/irq/settings.h
+++ b/kernel/irq/settings.h
@@ -18,6 +18,7 @@ enum {
_IRQ_IS_POLLED  = IRQ_IS_POLLED,
_IRQ_DISABLE_UNLAZY = IRQ_DISABLE_UNLAZY,
_IRQ_HIDDEN = IRQ_HIDDEN,
+   _IRQ_NO_DEBUG   = IRQ_NO_DEBUG,
_IRQF_MODIFY_MASK   = IRQF_MODIFY_MASK,
 };
 
@@ -33,6 +34,7 @@ enum {
 #define IRQ_IS_POLLED  GOT_YOU_MORON
 #define IRQ_DISABLE_UNLAZY GOT_YOU_MORON
 #define IRQ_HIDDEN GOT_YOU_MORON
+#define IRQ_NO_DEBUG   GOT_YOU_MORON
 #undef IRQF_MODIFY_MASK
 #define IRQF_MODIFY_MASK   GOT_YOU_MORON
 
@@ -174,3 +176,13 @@ static inline bool irq_settings_is_hidde
 {
return desc->status_use_accessors & _IRQ_HIDDEN;
 }
+
+static inline void irq_settings_set_no_debug(struct irq_desc *desc)
+{
+   desc->status_use_accessors |= _IRQ_NO_DEBUG;
+}
+
+static inline bool irq_settings_no_debug(struct irq_desc *desc)
+{
+   return desc->status_use_accessors & _IRQ_NO_DEBUG;
+}
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -274,7 +274,7 @@ void note_interrupt(struct irq_desc *des
unsigned int irq;
 
if (desc->istate & IRQS_POLL_INPROGRESS ||
-   irq_settings_is_polled(desc))
+   irq_settings_is_polled(desc) | irq_settings_no_debug(desc))
return;
 
if (bad_action_ret(action_ret)) {

Re: [PATCH] arch/x86: Encourage rather than discourage x2apic support

2021-04-10 Thread Thomas Gleixner

On Wed, Mar 31 2021 at 05:16, Luke Dashjr wrote:

> Multi-core SMP doesn't work on modern Intel CPUs (at least Comet Lake) without
> x2apic. Unsure users should say Y.

I'd rather make it explicit by also adding

default y
   
Thanks,

tglx

Re: [RFC 1/2] x86/tsc: add a timer to make sure tsc_adjust is always checked

2021-04-10 Thread Thomas Gleixner

On Tue, Mar 30 2021 at 16:25, Feng Tang wrote:
> Normally the tsc_sync will be checked every time system enters idle state,
> but there is still caveat that a system won't enter idle, either because
> it's too busy or configured purposely to not enter idle. Setup a periodic
> timer to make sure the check is always on.

Bah. I really hate the fact that we don't have a knob to disable writes
to the TSC/TSC_ADJUST msrs. That would spare this business alltogether.

> +/*
> + * Normally the tsc_sync will be checked every time system enters idle state,
> + * but there is still caveat that a system won't enter idle, either because
> + * it's too busy or configured purposely to not enter idle.
> + *
> + * So setup a periodic timer to make sure the check is always on.
> + */
> +
> +#define SYNC_CHECK_INTERVAL  (HZ * 600)
> +static void tsc_sync_check_timer_fn(struct timer_list *unused)

I've surely mentioned this before that glueing a define without an empty
newline to a function definition is horrible to read.

> +{
> + int next_cpu;
> +
> + tsc_verify_tsc_adjust(false);
> +
> + /* Loop to do the check for all onlined CPUs */

I don't see a loop here.

> + next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);

Why raw_smp_processor_id()? What's wrong with smp_processor_id()?

> + if (next_cpu >= nr_cpu_ids)
> + next_cpu = cpumask_first(cpu_online_mask);
> +
> + tsc_sync_check_timer.expires += SYNC_CHECK_INTERVAL;
> + add_timer_on(_sync_check_timer, next_cpu);
> +}
> +
> +static int __init start_sync_check_timer(void)
> +{
> + if (!boot_cpu_has(X86_FEATURE_TSC_ADJUST))
> + return 0;
> +
> + timer_setup(_sync_check_timer, tsc_sync_check_timer_fn, 0);
> + tsc_sync_check_timer.expires = jiffies + SYNC_CHECK_INTERVAL;
> + add_timer(_sync_check_timer);
> +
> + return 0;
> +}
> +late_initcall(start_sync_check_timer);

So right now, if someone adds 'tsc=reliable' on the kernel command line
then all of the watchdog checking, except for the idle enter TSC_ADJUST
check is disabled. The NOHZ full people are probably going to be pretty
unhappy about yet another unconditional timer they have to chase down.

So this needs some more thought.

Thanks,

tglx

Re: [PATCH v7 clocksource 5/5] clocksource: Do pairwise clock-desynchronization checking

2021-04-10 Thread Thomas Gleixner

On Fri, Apr 02 2021 at 15:49, paulmck wrote:
> From: "Paul E. McKenney" 
>
> Although smp_call_function() has the advantage of simplicity, using
> it to check for cross-CPU clock desynchronization means that any CPU
> being slow reduces the sensitivity of the checking across all CPUs.
> And it is not uncommon for smp_call_function() latencies to be in the
> hundreds of microseconds.
>
> This commit therefore switches to smp_call_function_single(), so that
> delays from a given CPU affect only those measurements involving that
> particular CPU.

Is there any reason I'm missing why this is not done right in patch 3/5
which introduces this synchronization check?

Thanks,

tglx

Re: [PATCH v7 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

2021-04-10 Thread Thomas Gleixner

On Fri, Apr 02 2021 at 15:49, paulmck wrote:
>
> +static void clocksource_verify_percpu_wq(struct work_struct *unused)
> +{
> + int cpu;
> + struct clocksource *cs;
> + int64_t cs_nsec;
> + u64 csnow_begin;
> + u64 csnow_end;
> + u64 delta;

Please use reverse fir tree ordering and stick variables of the same
type together:

u64 csnow_begin, csnow_end, delta;
struct clocksource *cs;
s64 cs_nsec;
int cpu;
> +
> + cs = smp_load_acquire(_verify_work_cs); // pairs with 
> release

Please don't use tail comments. They are a horrible distraction.

> + if (WARN_ON_ONCE(!cs))
> + return;
> + pr_warn("Checking clocksource %s synchronization from CPU %d.\n",
> + cs->name, smp_processor_id());
> + cpumask_clear(_ahead);
> + cpumask_clear(_behind);
> + csnow_begin = cs->read(cs);

So this is invoked via work and the actual clocksource change is done
via work too. Once the clocksource is not longer actively used for
timekeeping it can go away. What's guaranteeing that this runs prior to
the clocksource change and 'cs' is valid throughout this function?

> + queue_work(system_highpri_wq, _verify_work);

This does not guarantee anything. So why does this need an extra work
function which is scheduled seperately?

Thanks,

tglx

Re: [PATCH v7 clocksource 2/5] clocksource: Retry clock read if long delays detected

2021-04-10 Thread Thomas Gleixner

On Fri, Apr 02 2021 at 15:49, paulmck wrote:
> This commit therefore re-reads the watchdog clock on either side of

'This commit' is not any better than 'This patch' and this sentence
makes no sense. I might be missing something, but how exactly does "the
commit" re-read the watchdog clock?

 git grep 'This patch' Documentation/process/

> the read from the clock under test.  If the watchdog clock shows an
> +retry:
>   local_irq_disable();
> - csnow = cs->read(cs);
> - clocksource_watchdog_inject_delay();
>   wdnow = watchdog->read(watchdog);
> + clocksource_watchdog_inject_delay();
> + csnow = cs->read(cs);
> + wdagain = watchdog->read(watchdog);
>   local_irq_enable();
> + delta = clocksource_delta(wdagain, wdnow, watchdog->mask);
> + wdagain_nsec = clocksource_cyc2ns(delta, watchdog->mult, 
> watchdog->shift);

That variable naming is confusing as hell. This is about the delta and
not about the second readout of the watchdog.

> + if (wdagain_nsec < 0 || wdagain_nsec > WATCHDOG_MAX_SKEW) {

How exactly is this going negative especially with clocksources which
have a limited bitwidth? See clocksource_delta().

> + wderr_nsec = wdagain_nsec;
> + if (nretries++ < max_read_retries)
> + goto retry;
> + }
> + if (nretries)
> + pr_warn("timekeeping watchdog on CPU%d: %s read-back 
> delay of %lldns, attempt %d\n",
> + smp_processor_id(), watchdog->name, wderr_nsec, 
> nretries);

Lacks curly braces around the pr_warn() simply because it's not a single
line. Breaks my parser :)

But if this ever happens to exceed max_read_retries, then what's the
point of continuing at all? The data is known to be crap already.

>   /* Clocksource initialized ? */
>   if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||

Thanks,

tglx

Re: [PATCH v7 clocksource] Do not mark clocks unstable due to delays for v5.13

2021-04-10 Thread Thomas Gleixner

On Fri, Apr 02 2021 at 15:48, Paul E. McKenney wrote:
> Hello!
>
> If there is a sufficient delay between reading the watchdog clock and the
> clock under test, the clock under test will be marked unstable through no
> fault of its own.  This series checks for this, doing limited retries
> to get a good set of clock reads.  If the clock is marked unstable
> and is marked as being per-CPU, cross-CPU synchronization is checked.
> This series also provides delay injection, which may be enabled via
> kernel boot parameters to test the checking for delays.
>
> Note that "sufficient delay" can be provided by SMIs, NMIs, and of course
> vCPU preemption.

I buy the vCPU preemption part and TBH guests should not have that
watchdog thing active at all for exactly this reason.

SMI, NMI injecting 62.5ms delay? If that happens then the performance of
the clocksource is the least of your worries.

Thanks,

tglx

Re: [PATCH] hrtimer: avoid retrigger_next_event IPI

2021-04-10 Thread Thomas Gleixner

On Fri, Apr 09 2021 at 13:51, Marcelo Tosatti wrote:
> On Fri, Apr 09, 2021 at 04:15:13PM +0200, Thomas Gleixner wrote:
>> On Wed, Apr 07 2021 at 10:53, Marcelo Tosatti wrote:
>> ---> fail because that newly started timer is on the old offset.
>
> CPU0  CPU1
>
>
> clock_was_set()
>   Case-1: CPU-1 grabs 
> base->lock before CPU-0:
>   CPU-0 sees 
> active_mask[CPU1] and IPIs.
>
>   base = 
> lock_hrtimer_base(timer, );
>   if 
> (__hrtimer_start_range_ns(timer, tim, ...
>   
> hrtimer_reprogram(timer, true);
>
>   
> unlock_hrtimer_base(timer, );
>
>
> raw_spin_lock_irqsave(_base->lock, flags);
> if (need_reprogram_timer(cpu_base))
> cpumask_set_cpu(cpu, mask);
> else
> hrtimer_update_base(cpu_base);
> raw_spin_unlock_irqrestore(_base->lock, flags);
>
>   Case-2: CPU-1 grabs 
> base->lock after CPU-0:
>   CPU-0 will have updated 
> the offsets remotely.
>
>   base = 
> lock_hrtimer_base(timer, );
>   if 
> (__hrtimer_start_range_ns(timer, tim, ...
>   
> hrtimer_reprogram(timer, true);
>
>   
> unlock_hrtimer_base(timer, );
>
>
> No?

Yeah, you're right. I misread the loop logic.

Can we please make that unconditional independent of nohz full. There is
no reason to special case it.

Thanks,

tglx

Re: [PATCH] hrtimer: avoid retrigger_next_event IPI

2021-04-09 Thread Thomas Gleixner

On Wed, Apr 07 2021 at 10:53, Marcelo Tosatti wrote:
> Setting the realtime clock triggers an IPI to all CPUs to reprogram
> hrtimers.
>
> However, only base, boottime and tai clocks have their offsets updated

base clock? And why boottime? Boottime is not affected by a clock
realtime set. It's clock REALTIME and TAI, nothing else.

> +#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME)|  \
> +  (1U << HRTIMER_BASE_REALTIME_SOFT)|\
> +  (1U << HRTIMER_BASE_BOOTTIME)| \
> +  (1U << HRTIMER_BASE_BOOTTIME_SOFT)|\
> +  (1U << HRTIMER_BASE_TAI)|  \
> +  (1U << HRTIMER_BASE_TAI_SOFT))
> +
> +static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
> +{
> + unsigned int active = 0;
> +
> + if (!cpu_base->softirq_activated)
> + active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
> +
> + active = active | (cpu_base->active_bases & HRTIMER_ACTIVE_HARD);
> +
> + if ((active & CLOCK_SET_BASES) == 0)
> + return false;
> +
> + return true;
> +}

Errm. 

> + /* Avoid interrupting nohz_full CPUs if possible */
> + preempt_disable();
> + for_each_online_cpu(cpu) {
> + if (tick_nohz_full_cpu(cpu)) {
> + unsigned long flags;
> + struct hrtimer_cpu_base *cpu_base = 
> _cpu(hrtimer_bases, cpu);
> +
> + raw_spin_lock_irqsave(_base->lock, flags);
> + if (need_reprogram_timer(cpu_base))
> + cpumask_set_cpu(cpu, mask);
> + else
> + hrtimer_update_base(cpu_base);
> + raw_spin_unlock_irqrestore(_base->lock, flags);
> + }
> + }

How is that supposed to be correct?

CPU0CPU1

clock_was_set() hrtimer_start(CLOCK_REALTIME)

  if (!active_mask[CPU1] & XXX)
continue;
active_mask |= REALTIME;

---> fail because that newly started timer is on the old offset.

Thanks,

tglx

Re: [RFC PATCH 06/10] genirq: Don't mask IRQ within flow handler if IRQ is flow-masked

2021-04-09 Thread Thomas Gleixner

On Thu, Apr 08 2021 at 16:43, Valentin Schneider wrote:
> + /*
> +  * Masking is required if IRQ is ONESHOT and we can't rely on the
> +  * flow-masking persisting down to irq_finalize_oneshot()
> +  * (in the IRQ thread).
> +  */
> + if ((desc->istate & IRQS_ONESHOT) &&
> + (!(chip->flags & IRQCHIP_AUTOMASKS_FLOW) ||
> +  !(chip->flags & IRQCHIP_EOI_THREADED)))

#define  (IRQCHIP_AUTOMASKS_FLOW | IRQCHIP_EOI_THREADED)

((chip->flags & ) != )

Hmm?

Thanks,

tglx

Re: [RFC PATCH 05/10] genirq: Let purely flow-masked ONESHOT irqs through unmask_threaded_irq()

2021-04-09 Thread Thomas Gleixner

On Thu, Apr 08 2021 at 16:43, Valentin Schneider wrote:
> A subsequent patch will let IRQs end up in irq_finalize_oneshot() without
> IRQD_IRQ_MASKED, but with IRQD_IRQ_FLOW_MASKED set instead. Let such IRQs
> receive their final ->irq_eoi().
>
> Signed-off-by: Valentin Schneider 
> ---
>  kernel/irq/manage.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index e976c4927b25..59c8056d6714 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -1076,7 +1076,7 @@ static void irq_finalize_oneshot(struct irq_desc *desc,
>   desc->threads_oneshot &= ~action->thread_mask;
>  
>   if (!desc->threads_oneshot && !irqd_irq_disabled(>irq_data) &&
> - irqd_irq_masked(>irq_data))
> + (irqd_irq_masked(>irq_data) || 
> irqd_irq_flow_masked(>irq_data)))
>   unmask_threaded_irq(desc);

This creates better code when you have '|' instead of '||' because the
compiler can spare a conditional and combine the bit check into one.

Thanks,

tglx

Re: [RFC PATCH] Add split_lock

2021-04-09 Thread Thomas Gleixner

On Thu, Apr 08 2021 at 12:18, Peter Zijlstra wrote:
> On Thu, Apr 08, 2021 at 06:23:38AM +0100, Matthew Wilcox (Oracle) wrote:
>> bit_spinlocks are horrible on RT because there's absolutely nowhere
>> to put the mutex to sleep on.  They also do not participate in lockdep
>> because there's nowhere to put the map.
>> 
>> Most (all?) bit spinlocks are actually a split lock; logically they
>> could be treated as a single spinlock, but for performance, we want to
>> split the lock over many objects.  Introduce the split_lock as somewhere
>> to store the lockdep map and as somewhere that the RT kernel can put
>> a mutex.  It may also let us store a ticket lock for better performance
>> on non-RT kernels in the future, but I have left the current cpu_relax()
>> implementation intact for now.
>
> I think I like it, but I'm not sure it'll work for RT as is. It's a bit
> like qrwlock in that it only uses the internal (split) lock for
> contention, but that doesn't work for PI.
>
> I've not recently looked at RT, but I think they simply used to bloat a
> number of the data structures with a real lock. Sebastian and Thomas
> will know better.

Correct, but it depends on the nature of the bit spinlock and the lock
held times. Some of them we just keep as is as it's not worth the
trouble.

Thanks,

tglx

Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate

2021-04-07 Thread Thomas Gleixner

Greg,

On Fri, Apr 02 2021 at 09:54, Greg KH wrote:
> On Thu, Apr 01, 2021 at 11:59:25PM +, Luis Chamberlain wrote:
>> As for the syfs deadlock possible with drivers, this fixes it in a generic 
>> way:
>> 
>> commit fac43d8025727a74f80a183cc5eb74ed902a5d14
>> Author: Luis Chamberlain 
>> Date:   Sat Mar 27 14:58:15 2021 +
>> 
>> sysfs: add optional module_owner to attribute
>> 
>> This is needed as otherwise the owner of the attribute
>> or group read/store might have a shared lock used on driver removal,
>> and deadlock if we race with driver removal.
>> 
>> Signed-off-by: Luis Chamberlain 
>
> No, please no.  Module removal is a "best effort", if the system dies
> when it happens, that's on you.  I am not willing to expend extra energy
> and maintance of core things like sysfs for stuff like this that does
> not matter in any system other than a developer's box.
>
> Lock data, not code please.  Trying to tie data structure's lifespans
> to the lifespan of code is a tangled mess, and one that I do not want to
> add to in any form.
>
> sorry,

Sorry, but you are fundamentaly off track here. This has absolutely
nothing to do with module removal.

The point is that module removal is the reverse operation of module
insertion. So far so good.

But module insertion can fail. So if you have nested functionalities
which hang off or are enabled by moduled insertion then any fail in that
sequence has to be able to roll back and clean up properly no matter
what.

Which it turn makes modules removal a reverse operation of module
insertion.

If you think otherwise, then please provide a proper plan how nested
operations like sysfs - not to talk about more complex things like multi
instance discovery which can happen inside a module insertion sequence
can be properly rolled back.

Just declaring that rmmod is evil does not cut it. rmmod is the least of
the problems. If that fails, then a lot of rollback, failure handling
mechanisms are missing in the setup path already.

Anything which cannot cleanly rollback no matter whether the fail or
rollback request happens at insertion time or later is broken by design.

So either you declare module removal as disfunctional or you stop making
up semantically ill defined and therefore useless claims about it.

Your argument in:

 https://lore.kernel.org/linux-block/ygbnplkxfwpy0...@kroah.com/

 "Lock data, not code please.  Trying to tie data structure's lifespans
  to the lifespan of code is a tangled mess, and one that I do not want to
  add to in any form"

is just useless blurb because the fundamental purpose of discovery code
is to create the data structures which are tied to the code which is
associated to it.

Please stop this 'module removal' is not supported nonsense unless you
can prove a complete indepenence of module init/discovery code to
subsequent discovered entities depending on it.

Thanks,

tglx

Re: [PATCH] kernel/time: Feedback reply for hr_sleep syscall, a fine-grained sleep service

2021-04-07 Thread Thomas Gleixner

Marco!

On Wed, Apr 07 2021 at 11:32, Marco Faltelli wrote:

> Current sleep services (nanosleep) provide sleep periods very far from
> the expectations when scheuling microsecond-scale timers. On our
> testbed, using rdtscp() before and after a nanosleep() syscall to
> measure the effective elapsed time with a 1us timer, we got ~59us.
> Even with larger timeout periods, the difference is still evident
> (e.g., with a 100us timer, we measured ~158us of elapsed time).

So the delta is a constant of ~50us, right?

> We believe that one of the reasons is the use of the timespec
> structure, that needs to be copied for user to kernel and then
> converted into a single-value representation.

Interesting.

> In our work Metronome
> (https://dl.acm.org/doi/pdf/10.1145/3386367.3432730) we had the need
> for a precise microsecond-granularity sleep service, as nanosleep()
> was far from our needs, so we developed hr_sleep(), a new sleep
> service.

The above 'interesting' assumption made me curious about the deeper
insight, so I went to read. Let me give you a few comments on that
paper.

> In current conventional implementations of the Linux kernel, the support
> for (fine-grain) sleep periods of threads is based on the nanosleep()
> system call, which has index 35 in the current specification of the
> x86-64/Linux system call table.

There is also clock_nanosleep(2) for completness sake...

> The actual execution path of this system call, particularly at kernel
> side, is shown in Figure 1a. When entering kernel mode the thread exploits 
> two main kernel
> level subsystems. One is the scheduling subsystem, which allows managing
> the run-queue of threads that can be rescheduled in CPU. The other one
> is the high-resolution timers subsystem, which allows posting
> timer-expiration requests to the Linux kernel timer wheel.

The timer wheel is not involved in this at all. If your timer would end
up on the timer wheel your observed latencies would be in the
milliseconds range not in the 50usec range.

> The latter is a data structure that keeps the ordered set of timer
> expiration requests, so that each time one of these timers expires the
> subsequent timer expiration request is activated.

Not exactly how the timer wheel in the kernel works, but that's
irrelevant because it is not involved in this.

> The expiration of a timer is associated with the interrupt coming from
> the High Precision Event Timer (HPET) on board of x86 processors.

You must have a really old and crappy machine to test on. HPET is
avoided on any CPU less than 10 years old and pretty much irrelevant or
a performace horror on any machine which has more than 4 cores.

> In any case, independently of whether preemption will occur, the CPU
> cycles spent for that preamble lead to a delay for the post of the
> timer- expiration request to the timer wheel, leading the thread to
> actually start its timed-sleep phase with a delay.

I would have expected a proper measurement of the delay which is caused
by that processing in the paper, but in absence of that I instrumented
it for you:

First of all I implemented the thing myself, because the crud you posted
fails to compile (see below) and for other reasons which I spare myself
to explain because of that.

The regular clock_nanosleep() over the hacked up nanosleep_u64(), which
just takes a u64 nanosecond value as argument instead of the timespec
pointer has an overhead of ~64 CPU clock cycles on average according to
'perf stat' which amounts to a whopping 32 nanoseconds per syscall on my
test machine running at 2 GHz.

That's _three_ orders of magnitude off from 50us. There goes the theory.

So now where are these 50 microseconds coming from?

There is no massive software/hardware induced overhead caused by the
timespec pointer handling at all, the 50 microseconds are simply the
default timer slack value which is added to the expiry value to allow
better batching of timer expiries. 

That slack is automatically zero for tasks in real time scheduling
classes and can also be modified by a system wide setting and per
process via prctl(PR_SET_TIMERSLACK, .) except a system policy
prevents that.

That prtcl has unfortunately a severe limitation: it does not allow to
set the slack value to 0, the mininum values is 1 nanosecond, and I'm
happy to discuss that when you come up with a proper scientific proof
that that _one_ nanosecond matters.

As a limited excuse I concede, that the timer slack is barely
documented, but i'm thorougly surprised that this has not been figured
out and instead of that weird theories about the syscall entry code
implications make up several pages of handwaving content of a published
and 'reviewed' academic paper.

So here is the comparison between the regular clock_nanosleep() with the
prctl() used and the u64 based variant which sets the slack to 0 with
1e6 samples each for a delay of 50us.

sys_clock_nanosleep(CLOCK_MONOTONIC, 0, , NULL)
 |
 |

Re: [syzbot] WARNING: suspicious RCU usage in get_timespec64

2021-04-04 Thread Thomas Gleixner

On Sun, Apr 04 2021 at 12:05, syzbot wrote:

Cc + ...

> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:5e46d1b7 reiserfs: update reiserfs_xattrs_initialized() co..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=1125f831d0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=78ef1d159159890
> dashboard link: https://syzkaller.appspot.com/bug?extid=88e4f02896967fe1ab0d
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+88e4f02896967fe1a...@syzkaller.appspotmail.com
>
> =
> WARNING: suspicious RCU usage
> 5.12.0-rc5-syzkaller #0 Not tainted
> -
> kernel/sched/core.c:8294 Illegal context switch in RCU-sched read-side 
> critical section!
>
> other info that might help us debug this:
>
>
> rcu_scheduler_active = 2, debug_locks = 0
> 3 locks held by syz-executor.4/8418:
>  #0: 
> 8880751d2b28
>  (
> >pi_lock
> ){-.-.}-{2:2}
> , at: try_to_wake_up+0x98/0x14a0 kernel/sched/core.c:3345
>  #1: 
> 8880b9d35258
>  (
> >lock
> ){-.-.}-{2:2}
> , at: rq_lock kernel/sched/sched.h:1321 [inline]
> , at: ttwu_queue kernel/sched/core.c:3184 [inline]
> , at: try_to_wake_up+0x5e6/0x14a0 kernel/sched/core.c:3464
>  #2: 8880b9d1f948 (_cpu_ptr(group->pcpu, cpu)->seq){-.-.}-{0:0}, at: 
> psi_task_change+0x142/0x220 kernel/sched/psi.c:807
>
> stack backtrace:
> CPU: 0 PID: 8418 Comm: syz-executor.4 Not tainted 5.12.0-rc5-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x141/0x1d7 lib/dump_stack.c:120
>  ___might_sleep+0x266/0x2c0 kernel/sched/core.c:8294
>  __might_fault+0x6e/0x180 mm/memory.c:5018
>  _copy_from_user+0x27/0x180 lib/usercopy.c:13
>  copy_from_user include/linux/uaccess.h:192 [inline]
>  get_timespec64+0x75/0x220 kernel/time/time.c:787
>  __do_sys_clock_nanosleep kernel/time/posix-timers.c:1257 [inline]
>  __se_sys_clock_nanosleep kernel/time/posix-timers.c:1245 [inline]
>  __x64_sys_clock_nanosleep+0x1bb/0x430 kernel/time/posix-timers.c:1245
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> RIP: 0033:0x48a621
> Code: 24 0c 89 3c 24 48 89 4c 24 18 e8 aa e7 ff ff 4c 8b 54 24 18 48 8b 54 24 
> 10 41 89 c0 8b 74 24 0c 8b 3c 24 b8 e6 00 00 00 0f 05 <44> 89 c7 48 89 04 24 
> e8 e3 e7 ff ff 48 8b 04 24 eb 97 66 2e 0f 1f
> RSP: 002b:7fffe59fbd50 EFLAGS: 0293 ORIG_RAX: 00e6
> RAX: ffda RBX: 0294 RCX: 0048a621
> RDX: 7fffe59fbd90 RSI:  RDI: 
> RBP: 7fffe59fbe2c R08:  R09: 7fffe5b8a090
> R10:  R11: 0293 R12: 0032
> R13: 0005717a R14: 0003 R15: 7fffe59fbe90
>
>
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
>
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

Re: [PATCH v6 clocksource] Do not mark clocks unstable dueclocksource: Provide module parameters to inject delays in watchdog

2021-04-02 Thread Thomas Gleixner

On Fri, Apr 02 2021 at 13:31, paulmck wrote:

The subsystem prefix does not parse:

[PATCH v6 clocksource] Do not mark clocks unstable dueclocksource: Provide 
module parameters to inject delays in watchdog

I look at the actual code changes after the easter break.

Thanks,

tglx

Re: [PATCH v9 0/6] Optionally randomize kernel stack offset each syscall

2021-04-01 Thread Thomas Gleixner

On Thu, Apr 01 2021 at 09:37, Will Deacon wrote:
> On Wed, Mar 31, 2021 at 01:54:52PM -0700, Kees Cook wrote:
>> Hi Will (and Mark and Catalin),
>> 
>> Can you take this via the arm64 tree for v5.13 please? Thomas has added
>> his Reviewed-by, so it only leaves arm64's. :)
>
> Sorry, these got mixed up in my inbox so I just replied to v7 and v8 and
> then noticed v9. Argh!
>
> However, I think the comments still apply: namely that the dummy "=m" looks
> dangerous to me

> https://lore.kernel.org/r/20210401083034.GA8554@willie-the-truck

Hrmpf, didn't think about that.

> and I think you're accessing pcpu variables with preemption enabled
> for the arm64 part:

That's my fault. On x86 this is invoked right after coming up into C
code and _before_ enabling interrupts, which I why I suggested not to
use the wrapped one. raw_cpu_read() should be fine everywhere.

Thanks,

tglx

Re: [PATCH v8 3/6] stack: Optionally randomize kernel stack offset each syscall

2021-03-31 Thread Thomas Gleixner

On Wed, Mar 31 2021 at 14:54, Kees Cook wrote:
> On Wed, Mar 31, 2021 at 09:53:26AM +0200, Thomas Gleixner wrote:
>> On Tue, Mar 30 2021 at 13:57, Kees Cook wrote:
>> > +/*
>> > + * Do not use this anywhere else in the kernel. This is used here because
>> > + * it provides an arch-agnostic way to grow the stack with correct
>> > + * alignment. Also, since this use is being explicitly masked to a max of
>> > + * 10 bits, stack-clash style attacks are unlikely. For more details see
>> > + * "VLAs" in Documentation/process/deprecated.rst
>> > + * The asm statement is designed to convince the compiler to keep the
>> > + * allocation around even after "ptr" goes out of scope.
>> 
>> Nit. That explanation of "ptr" might be better placed right at the
>> add_random...() macro.
>
> Ah, yes! Fixed in v9.

Hmm, looking at V9 the "ptr" thing got lost 

> +/*
> + * Do not use this anywhere else in the kernel. This is used here because
> + * it provides an arch-agnostic way to grow the stack with correct
> + * alignment. Also, since this use is being explicitly masked to a max of
> + * 10 bits, stack-clash style attacks are unlikely. For more details see
> + * "VLAs" in Documentation/process/deprecated.rst
> + */
> +void *__builtin_alloca(size_t size);
> +/*
> + * Use, at most, 10 bits of entropy. We explicitly cap this to keep the
> + * "VLA" from being unbounded (see above). 10 bits leaves enough room for
> + * per-arch offset masks to reduce entropy (by removing higher bits, since
> + * high entropy may overly constrain usable stack space), and for
> + * compiler/arch-specific stack alignment to remove the lower bits.
> + */
> +#define KSTACK_OFFSET_MAX(x) ((x) & 0x3FF)
> +
> +/*
> + * These macros must be used during syscall entry when interrupts and
> + * preempt are disabled, and after user registers have been stored to
> + * the stack.
> + */
> +#define add_random_kstack_offset() do {  
> \

> Do you want to take this via -tip (and leave off the arm64 patch until
> it is acked), or would you rather it go via arm64? (I've sent v9 now...)

Either way is fine.

Thanks,

tglx

Re: [PATCH 2/6] mm/page_alloc: Convert per-cpu list protection to local_lock

2021-03-31 Thread Thomas Gleixner

On Wed, Mar 31 2021 at 19:42, Thomas Gleixner wrote:
> On Wed, Mar 31 2021 at 12:01, Mel Gorman wrote:
>> On Wed, Mar 31, 2021 at 11:55:56AM +0200, Thomas Gleixner wrote:
>> @@ -887,13 +887,11 @@ void cpu_vm_stats_fold(int cpu)
>>  
>>  pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
>>  
>> -preempt_disable();
>>  for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
>>  if (pzstats->vm_stat_diff[i]) {
>>  int v;
>>  
>> -v = pzstats->vm_stat_diff[i];
>> -pzstats->vm_stat_diff[i] = 0;
>> +v = this_cpu_xchg(pzstats->vm_stat_diff[i], 0);
>
> Confused. pzstats is not a percpu pointer. zone->per_cpu_zonestats is.
>
> But @cpu is not necessarily the current CPU.

@cpu _is_ definitely NOT the current CPU as this is invoked from the
hotplug callback _after_ @cpu went dead. @cpu is dead and wont update
these things concurrently. So I'm even more confused :)

Thanks,

tglx

Re: [PATCH 2/6] mm/page_alloc: Convert per-cpu list protection to local_lock

2021-03-31 Thread Thomas Gleixner

On Wed, Mar 31 2021 at 12:01, Mel Gorman wrote:
> On Wed, Mar 31, 2021 at 11:55:56AM +0200, Thomas Gleixner wrote:
> @@ -887,13 +887,11 @@ void cpu_vm_stats_fold(int cpu)
>  
>   pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
>  
> - preempt_disable();
>   for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
>   if (pzstats->vm_stat_diff[i]) {
>   int v;
>  
> - v = pzstats->vm_stat_diff[i];
> - pzstats->vm_stat_diff[i] = 0;
> + v = this_cpu_xchg(pzstats->vm_stat_diff[i], 0);

Confused. pzstats is not a percpu pointer. zone->per_cpu_zonestats is.

But @cpu is not necessarily the current CPU.

Thanks,

tglx

Re: [PATCH 2/6] mm/page_alloc: Convert per-cpu list protection to local_lock

2021-03-31 Thread Thomas Gleixner

On Mon, Mar 29 2021 at 13:06, Mel Gorman wrote:
> There is a lack of clarity of what exactly local_irq_save/local_irq_restore
> protects in page_alloc.c . It conflates the protection of per-cpu page
> allocation structures with per-cpu vmstat deltas.
>
> This patch protects the PCP structure using local_lock which
> for most configurations is identical to IRQ enabling/disabling.
> The scope of the lock is still wider than it should be but this is
> decreased in later patches. The per-cpu vmstat deltas are protected by
> preempt_disable/preempt_enable where necessary instead of relying on
> IRQ disable/enable.

Yes, this goes into the right direction and I really appreciate the
scoped protection for clarity sake.

>  #ifdef CONFIG_MEMORY_HOTREMOVE
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 8a8f1a26b231..01b74ff73549 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -887,6 +887,7 @@ void cpu_vm_stats_fold(int cpu)
>  
>   pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
>  
> + preempt_disable();

What's the reason for the preempt_disable() here? A comment would be
appreciated.

>   for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
>   if (pzstats->vm_stat_diff[i]) {
>   int v;
> @@ -908,6 +909,7 @@ void cpu_vm_stats_fold(int cpu)
>   global_numa_diff[i] += v;
>   }
>  #endif
> + preempt_enable();
>   }
>  
>   for_each_online_pgdat(pgdat) {
> @@ -941,6 +943,7 @@ void drain_zonestat(struct zone *zone, struct 
> per_cpu_zonestat *pzstats)
>  {
>   int i;
>  
> + preempt_disable();

Same here.

>   for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
>   if (pzstats->vm_stat_diff[i]) {
>   int v = pzstats->vm_stat_diff[i];
> @@ -959,6 +962,7 @@ void drain_zonestat(struct zone *zone, struct 
> per_cpu_zonestat *pzstats)
>   atomic_long_add(v, _numa_stat[i]);
>   }
>  #endif
> + preempt_enable();

Thanks,

tglx

Re: [RFC PATCH 0/6] Use local_lock for pcp protection and reduce stat overhead

2021-03-31 Thread Thomas Gleixner

On Wed, Mar 31 2021 at 09:52, Mel Gorman wrote:
> Ingo, Thomas or Peter, is there any chance one of you could take a look
> at patch "[PATCH 2/6] mm/page_alloc: Convert per-cpu list protection to
> local_lock" from this series? It's partially motivated by PREEMPT_RT. More
> details below.

Sure.

Re: [PATCH v2] tick/broadcast: Allow later registered device enter oneshot mode

2021-03-31 Thread Thomas Gleixner

On Wed, Mar 31 2021 at 09:10, Jindong Yue wrote:
> --- a/kernel/time/tick-broadcast.c
> +++ b/kernel/time/tick-broadcast.c
> @@ -47,6 +47,7 @@ static inline void tick_resume_broadcast_oneshot(struct 
> clock_event_device *bc)
>  static inline void tick_broadcast_oneshot_offline(unsigned int cpu) { }
>  # endif
>  #endif
> +static void tick_handle_oneshot_broadcast(struct clock_event_device *dev);

Leftover ...
  
>  /*
>   * Debugging: see timer_list.c
> @@ -107,6 +108,19 @@ void tick_install_broadcast_device(struct 
> clock_event_device *dev)
>   tick_broadcast_device.evtdev = dev;
>   if (!cpumask_empty(tick_broadcast_mask))
>   tick_broadcast_start_periodic(dev);
> +
> + if (!(dev->features & CLOCK_EVT_FEAT_ONESHOT))
> + return;
> +
> + /*
> +  * If system already runs in oneshot mode, switch new registered

the system  the newly registered

> +  * broadcast device to oneshot mode explicitly if possiable.

s/possible/possiable/

But the 'if possible' makes no sense here. The above check for
CLOCK_EVT_FEAT_ONESHOT established that it _is_ possible. So just remove
the 'if ...'.

> +  */
> + if (tick_broadcast_oneshot_active()) {
> + tick_broadcast_switch_to_oneshot();
> + return;
> + }

Thanks,

tglx

Re: [PATCH v8 3/6] stack: Optionally randomize kernel stack offset each syscall

2021-03-31 Thread Thomas Gleixner

On Tue, Mar 30 2021 at 13:57, Kees Cook wrote:
> +/*
> + * Do not use this anywhere else in the kernel. This is used here because
> + * it provides an arch-agnostic way to grow the stack with correct
> + * alignment. Also, since this use is being explicitly masked to a max of
> + * 10 bits, stack-clash style attacks are unlikely. For more details see
> + * "VLAs" in Documentation/process/deprecated.rst
> + * The asm statement is designed to convince the compiler to keep the
> + * allocation around even after "ptr" goes out of scope.

Nit. That explanation of "ptr" might be better placed right at the
add_random...() macro.

> + */
> +void *__builtin_alloca(size_t size);
> +/*
> + * Use, at most, 10 bits of entropy. We explicitly cap this to keep the
> + * "VLA" from being unbounded (see above). 10 bits leaves enough room for
> + * per-arch offset masks to reduce entropy (by removing higher bits, since
> + * high entropy may overly constrain usable stack space), and for
> + * compiler/arch-specific stack alignment to remove the lower bits.
> + */
> +#define KSTACK_OFFSET_MAX(x) ((x) & 0x3FF)
> +
> +/*
> + * These macros must be used during syscall entry when interrupts and
> + * preempt are disabled, and after user registers have been stored to
> + * the stack.
> + */
> +#define add_random_kstack_offset() do {  
> \
> + if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
> + _kstack_offset)) {\
> + u32 offset = __this_cpu_read(kstack_offset);\
> + u8 *ptr = __builtin_alloca(KSTACK_OFFSET_MAX(offset));  \
> + asm volatile("" : "=m"(*ptr) :: "memory");  \
> + }   \
> +} while (0)

Other than that.

Reviewed-by: Thomas Gleixner

Re: [PATCH v8 4/6] x86/entry: Enable random_kstack_offset support

2021-03-31 Thread Thomas Gleixner

On Tue, Mar 30 2021 at 13:57, Kees Cook wrote:

> Allow for a randomized stack offset on a per-syscall basis, with roughly
> 5-6 bits of entropy, depending on compiler and word size. Since the
> method of offsetting uses macros, this cannot live in the common entry
> code (the stack offset needs to be retained for the life of the syscall,
> which means it needs to happen at the actual entry point).
>
> Signed-off-by: Kees Cook 

Reviewed-by: Thomas Gleixner

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-30 Thread Thomas Gleixner

Len,

On Mon, Mar 29 2021 at 18:16, Len Brown wrote:
> On Mon, Mar 29, 2021 at 2:49 PM Thomas Gleixner  wrote:
> Let me know if this problem description is fair:
>
> Many-core Xeon servers will support AMX, and when I run an AMX application
> on one, when I take an interrupt with AMX INIT=0, Linux may go idle on my CPU.
> If Linux cpuidle requests C6, the hardware will demote to C1E.
>
> The concern is that a core in C1E will negatively impact power of
> self, or performance
> of a neighboring core.
>
> This is what we are talking about, right?

Correct.

> I'm delighted that there are Xeon customers, who care about this power 
> savings.
> Unfortunately, they are the exception, not the rule.

That maybe true or not. The point is that there is some side effect and
from a correctness point of view it needs to be addressed.

>>- Use TILERELEASE on context switch after XSAVES?
>
> Yes, that would be perfectly reasonable.
>
>>- Any other mechanism on context switch
>
> XRESTOR of a context with INIT=1 would also do it.
>
>>- Clear XFD[18] when going idle and issue TILERELEASE depending
>>  on the last state
>
> I think you mean to *set* XFD.
> When the task touched AMX, he took a #NM, and we cleared XFD for that task.
> So when we get here, XFD is already clear (unarmed).
> Nevertheless, the setting of XFD is moot here.

No. We set XFD when the task which used AMX schedules out. If the CPU
reaches idle without going back to user space then XFD is still set and
AMX INIT still 0. So my assumption was that in order to issue
TILERELEASE before going idle, you need to clear XFD[18] first, but I
just saw in the spec that it is not necessary.

>>- Use any other means to set the thing back into INIT=1 state when
>>  going idle
>
> TILERELEASE and XRESTOR are the tools in the toolbox, if necessary.
>
>> There is no option 'shrug and ignore' unfortunately.
>
> I'm not going to say it is impossible that this path will matter.
> If some terrible things go wrong with the hardware, and the hardware
> is configured and used in a very specific way, yes, this could matter.

So then let me summarize for the bare metal case:

   1) The paragraph in the specification is unclear and needs to be
  rephrased.

  What I understood from your explanations so far:

  When AMX is disabled by clearing XCR0[18:17], by clearing
  CR4.OSXSAVE, or by setting IA32_XFD[18], then there are no
  negative side effects due to AMX INIT=0 as long as the CPU is
  executing.

  The only possible side effect is when the CPU goes idle because
  AMX INIT=0 limits C states.

   2) As a consequence of #1 there is no further action required on
  context switch when XFD[18] is set.

   3) When the CPU goes idle with AMX INIT=0 then the idle code should
  invoke TILERELEASE. Maybe the condition is not even necessary and
  TILERELEASE can be invoked unconditionally before trying to enter
  idle.

If that's correct, then this should be part of the next series.

> In the grand scheme of things, this is a pretty small issue, say,
> compared to the API discussion.

No argument about that.

Thanks,

tglx

[tip: x86/apic] x86/vector: Add a sanity check to prevent IRQ2 allocations

2021-03-29 Thread tip-bot2 for Thomas Gleixner

The following commit has been merged into the x86/apic branch of tip:

Commit-ID: 9a98bc2cf08a095367449b3548c3d9ad4ad2cd20
Gitweb:
https://git.kernel.org/tip/9a98bc2cf08a095367449b3548c3d9ad4ad2cd20
Author:Thomas Gleixner 
AuthorDate:Thu, 18 Mar 2021 20:26:48 +01:00
Committer: Thomas Gleixner 
CommitterDate: Tue, 30 Mar 2021 00:39:12 +02:00

x86/vector: Add a sanity check to prevent IRQ2 allocations

To prevent another incidental removal of the IRQ2 ignore logic in the
IO/APIC code going unnoticed add a sanity check. Add some commentry at the
other place which ignores IRQ2 while at it.

Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20210318192819.795280...@linutronix.de

---
 arch/x86/kernel/apic/vector.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 3c9c749..9b75a70 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -543,6 +543,14 @@ static int x86_vector_alloc_irqs(struct irq_domain 
*domain, unsigned int virq,
if ((info->flags & X86_IRQ_ALLOC_CONTIGUOUS_VECTORS) && nr_irqs > 1)
return -ENOSYS;
 
+   /*
+* Catch any attempt to touch the cascade interrupt on a PIC
+* equipped system.
+*/
+   if (WARN_ON_ONCE(info->flags & X86_IRQ_ALLOC_LEGACY &&
+virq == PIC_CASCADE_IR))
+   return -EINVAL;
+
for (i = 0; i < nr_irqs; i++) {
irqd = irq_domain_get_irq_data(domain, virq + i);
BUG_ON(!irqd);
@@ -745,6 +753,11 @@ void __init lapic_assign_system_vectors(void)
 
/* Mark the preallocated legacy interrupts */
for (i = 0; i < nr_legacy_irqs(); i++) {
+   /*
+* Don't touch the cascade interrupt. It's unusable
+* on PIC equipped machines. See the large comment
+* in the IO/APIC code.
+*/
if (i != PIC_CASCADE_IR)
irq_matrix_assign(vector_matrix, ISA_IRQ_VECTOR(i));
}

Re: [PATCH] tick/broadcast: Allow later probed device enter oneshot mode

2021-03-29 Thread Thomas Gleixner

On Mon, Mar 29 2021 at 13:25, Jindong Yue wrote:

>  /*
>   * Debugging: see timer_list.c
> @@ -115,8 +116,20 @@ void tick_install_broadcast_device(struct 
> clock_event_device *dev)
>* notification the systems stays stuck in periodic mode
>* forever.
>*/
> - if (dev->features & CLOCK_EVT_FEAT_ONESHOT)
> + if (dev->features & CLOCK_EVT_FEAT_ONESHOT) {
>   tick_clock_notify();

If the kernel runs in oneshot mode already, then calling
tick_clock_notify() does not make sense.

> + /*
> +  * If new broadcast device is installed after high resolution
> +  * timers enabled, it can not switch to oneshot mode anymore.

This is not restricted to high resolution timers. The point is that
the mode is ONESHOT which might be either HIGHRES or NOHZ or both.

> +  */
> + if (tick_broadcast_oneshot_active() &&
> + dev->event_handler != tick_handle_oneshot_broadcast) {

How would that condition ever be false for a newly installed device?

> + tick_broadcast_switch_to_oneshot();
> + }

So what you really want is:

if (!(dev->features & CLOCK_EVT_FEAT_ONESHOT))
return;

if (tick_broadcast_oneshot_active()) {
tick_broadcast_switch_to_oneshot();
return;
}

tick_clock_notify();

Thanks,

tglx

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 29419 matches

Mail list logo