Re: [PATCH] powerpc/powernv/prd: Validate whether address to be mapped is part of system RAM

2019-10-03 Thread Jeremy Kerr

Hi Vaidy,


The current topic is who owns setting up the ATT bits for that piece
of memory.  It is the kernel today.  Kernel decides to set this up as
normal memory or I/O memory and sets the bits in page table entry.


Or, what if there's a range of address-space that isn't backed by system
RAM (say, some MMIO-mapped hardware) that we want to expose to a future
HBRT implementation? This change will block that.

The kernel doesn't know what is and is not valid for a HBRT mapping, so
it has no reason to override what's specified in the device tree. We've
designed this so that the kernel provides the mechanism for mapping
pages, and not the policy of which pages can be mapped.


The features altered are cache inhibit and guarding which affects
ability to fetch instructions.  If we allow HBRT to reside in an I/O
memory, the we need to tell kernel that it is ok to allow caching and
instruction execution in that region and accordingly change the ATT
bits.


But this isn't only about the HBRT range itself (ie., the memory 
containing the HBRT binary). Everything that HBRT needs to map will come 
through this path. We may not need to fetch instructions from those ranges.



This patch does not block a working function, but actually makes
debugging a failed case easier.  The failing scenario without this
check is such that HBRT cannot fetch from the region of memory and
loops in minor page faults doing nothing.


Yep, that's not great, but the alternative means applying this kernel 
policy, which we can't guarantee is correct.


That is, unless the page protection bits mean that this won't work 
anyway, but we can probably fix that without a kernel policy, by 
applying the appropriate pgprot_t, perhaps.



As Vasant mentioned hostboot team will add code to relocate the HBRT
to the right place.  Addressing your concern, if we end up allowing
HBRT in non system-RAM area


Not just HBRT, but anything that HBRT maps too.


we need to add some more flags in device
tree to instruct the driver to force change the page protection bits
as page_prot = pgprot_cached(page_prot);


Doesn't phys_mem_access_prot() handle that for us? Or do I have my 
_noncached/_cached logic inverted?


Cheers,


Jeremy


Re: [PATCH] powerpc/kvm: Fix kvmppc_vcore->in_guest value in kvmhv_switch_to_host

2019-10-03 Thread Alistair Popple
Reviewed-by: Alistair Popple 

On Friday, 4 October 2019 12:53:17 PM AEST Jordan Niethe wrote:
> kvmhv_switch_to_host() in arch/powerpc/kvm/book3s_hv_rmhandlers.S needs
> to set kvmppc_vcore->in_guest to 0 to signal secondary CPUs to continue.
> This happens after resetting the PCR. Before commit 13c7bb3c57dc
> ("powerpc/64s: Set reserved PCR bits"), r0 would always be 0 before it
> was stored to kvmppc_vcore->in_guest. However because of this change in
> the commit:
> 
> /* Reset PCR */
> ld  r0, VCORE_PCR(r5)
> -   cmpdi   r0, 0
> +   LOAD_REG_IMMEDIATE(r6, PCR_MASK)
> +   cmpld   r0, r6
> beq 18f
> -   li  r0, 0
> -   mtspr   SPRN_PCR, r0
> +   mtspr   SPRN_PCR, r6
>  18:
> /* Signal secondary CPUs to continue */
> stb r0,VCORE_IN_GUEST(r5)
> 
> We are no longer comparing r0 against 0 and loading it with 0 if it
> contains something else. Hence when we store r0 to
> kvmppc_vcore->in_guest, it might not be 0.  This means that secondary
> CPUs will not be signalled to continue. Those CPUs get stuck and errors
> like the following are logged:
> 
> KVM: CPU 1 seems to be stuck
> KVM: CPU 2 seems to be stuck
> KVM: CPU 3 seems to be stuck
> KVM: CPU 4 seems to be stuck
> KVM: CPU 5 seems to be stuck
> KVM: CPU 6 seems to be stuck
> KVM: CPU 7 seems to be stuck
> 
> This can be reproduced with:
> $ for i in `seq 1 7` ; do chcpu -d $i ; done ;
> $ taskset -c 0 qemu-system-ppc64 -smp 8,threads=8 \
>-M pseries,accel=kvm,kvm-type=HV -m 1G -nographic -vga none \
>-kernel vmlinux -initrd initrd.cpio.xz
> 
> Fix by making sure r0 is 0 before storing it to kvmppc_vcore->in_guest.
> 
> Fixes: 13c7bb3c57dc ("powerpc/64s: Set reserved PCR bits")
> Reported-by: Alexey Kardashevskiy 
> Signed-off-by: Jordan Niethe 
> ---
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index 74a9cfe84aee..faebcbb8c4db 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1921,6 +1921,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>   mtspr   SPRN_PCR, r6
>  18:
>   /* Signal secondary CPUs to continue */
> + li  r0, 0
>   stb r0,VCORE_IN_GUEST(r5)
>  19:  lis r8,0x7fff   /* MAX_INT@h */
>   mtspr   SPRN_HDEC,r8
> 






[PATCH] powerpc/kvm: Fix kvmppc_vcore->in_guest value in kvmhv_switch_to_host

2019-10-03 Thread Jordan Niethe
kvmhv_switch_to_host() in arch/powerpc/kvm/book3s_hv_rmhandlers.S needs
to set kvmppc_vcore->in_guest to 0 to signal secondary CPUs to continue.
This happens after resetting the PCR. Before commit 13c7bb3c57dc
("powerpc/64s: Set reserved PCR bits"), r0 would always be 0 before it
was stored to kvmppc_vcore->in_guest. However because of this change in
the commit:

/* Reset PCR */
ld  r0, VCORE_PCR(r5)
-   cmpdi   r0, 0
+   LOAD_REG_IMMEDIATE(r6, PCR_MASK)
+   cmpld   r0, r6
beq 18f
-   li  r0, 0
-   mtspr   SPRN_PCR, r0
+   mtspr   SPRN_PCR, r6
 18:
/* Signal secondary CPUs to continue */
stb r0,VCORE_IN_GUEST(r5)

We are no longer comparing r0 against 0 and loading it with 0 if it
contains something else. Hence when we store r0 to
kvmppc_vcore->in_guest, it might not be 0.  This means that secondary
CPUs will not be signalled to continue. Those CPUs get stuck and errors
like the following are logged:

KVM: CPU 1 seems to be stuck
KVM: CPU 2 seems to be stuck
KVM: CPU 3 seems to be stuck
KVM: CPU 4 seems to be stuck
KVM: CPU 5 seems to be stuck
KVM: CPU 6 seems to be stuck
KVM: CPU 7 seems to be stuck

This can be reproduced with:
$ for i in `seq 1 7` ; do chcpu -d $i ; done ;
$ taskset -c 0 qemu-system-ppc64 -smp 8,threads=8 \
   -M pseries,accel=kvm,kvm-type=HV -m 1G -nographic -vga none \
   -kernel vmlinux -initrd initrd.cpio.xz

Fix by making sure r0 is 0 before storing it to kvmppc_vcore->in_guest.

Fixes: 13c7bb3c57dc ("powerpc/64s: Set reserved PCR bits")
Reported-by: Alexey Kardashevskiy 
Signed-off-by: Jordan Niethe 
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 74a9cfe84aee..faebcbb8c4db 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1921,6 +1921,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
mtspr   SPRN_PCR, r6
 18:
/* Signal secondary CPUs to continue */
+   li  r0, 0
stb r0,VCORE_IN_GUEST(r5)
 19:lis r8,0x7fff   /* MAX_INT@h */
mtspr   SPRN_HDEC,r8
-- 
2.20.1



Re: [PATCH] selftests/powerpc: Fix compiling error on tlbie_test due to newer gcc

2019-10-03 Thread Michael Ellerman
On Thu, 2019-10-03 at 21:10:10 UTC, "Desnes A. Nunes do Rosario" wrote:
> Newer versions of GCC demand that the size of the string to be copied must
> be explicitly smaller than the size of the destination. Thus, the NULL
> char has to be taken into account on strncpy.
> 
> This will avoid the following compiling error:
> 
>   tlbie_test.c: In function 'main':
>   tlbie_test.c:639:4: error: 'strncpy' specified bound 100 equals destination 
> size [-Werror=stringop-truncation]
>   strncpy(logdir, optarg, LOGDIR_NAME_SIZE);
>   ^
>   cc1: all warnings being treated as errors
> 
> Signed-off-by: Desnes A. Nunes do Rosario 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/3439595d5b85f0af565f5a58e495d885537fb227

cheers


Re: [PATCH] powerpc/pseries: Remove confusing warning message.

2019-10-03 Thread Michael Ellerman
On Tue, 2019-10-01 at 13:29:28 UTC, Laurent Dufour wrote:
> Since the commit 1211ee61b4a8 ("powerpc/pseries: Read TLB Block Invalidate
> Characteristics"), a warning message is displayed when booting a guest on
> top of KVM:
> 
> lpar: arch/powerpc/platforms/pseries/lpar.c 
> pseries_lpar_read_hblkrm_characteristics Error calling get-system-parameter 
> (0xfffd)
> 
> This message is displayed because this hypervisor is not supporting the
> H_BLOCK_REMOVE hcall and thus is not exposing the corresponding feature.
> 
> Reading the TLB Block Invalidate Characteristics should not be done if the
> feature is not exposed.
> 
> Fixes: 1211ee61b4a8 ("powerpc/pseries: Read TLB Block Invalidate 
> Characteristics")
> Reported-by: Stephen Rothwell 
> Signed-off-by: Laurent Dufour 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/9123b7914823a7d81dd0e5d8ae40bc6f1ee869c8

cheers


Re: linux-next: build failure after merge of the powerpc tree

2019-10-03 Thread Michael Ellerman
On Mon, 2019-09-30 at 00:13:42 UTC, Stephen Rothwell wrote:
> Hi all,
> 
> After merging the powerpc tree, today's linux-next build (powerpc64
> allnoconfig) failed like this:
> 
> arch/powerpc/mm/book3s64/pgtable.c: In function 'flush_partition':
> arch/powerpc/mm/book3s64/pgtable.c:216:3: error: implicit declaration of fu=
> nction 'radix__flush_all_lpid_guest'; did you mean 'radix__flush_all_lpid'?=
>  [-Werror=3Dimplicit-function-declaration]
>   216 |   radix__flush_all_lpid_guest(lpid);
>   |   ^~~
>   |   radix__flush_all_lpid
> 
> Caused by commit
> 
>   99161de3a283 ("powerpc/64s/radix: tidy up TLB flushing code")
> 
> radix__flush_all_lpid_guest() is only declared for CONFIG_PPC_RADIX_MMU
> which is not set for this build.
> 
> I am not sure why this did not show up earlier (maybe a Kconfig
> change?).
> 
> I added the following hack for today.
> 
> From: Stephen Rothwell 
> Date: Mon, 30 Sep 2019 10:09:17 +1000
> Subject: [PATCH] powerpc/64s/radix: fix for "tidy up TLB flushing code" and
>  !CONFIG_PPC_RADIX_MMU
> 
> Signed-off-by: Stephen Rothwell 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/8996ae8f05a1cc5559120aaec36183edb9c68c50

cheers


Re: [PATCH 1/3] powerpc/book3s64/hash/4k: 4k supports only 16TB linear mapping

2019-10-03 Thread Michael Ellerman
On Tue, 2019-09-17 at 14:57:00 UTC, "Aneesh Kumar K.V" wrote:
> With commit: 0034d395f89d ("powerpc/mm/hash64: Map all the kernel regions in 
> the
> same 0xc range"), we now split the 64TB address range into 4 contexts each of
> 16TB. That implies we can do only 16TB linear mapping. Make sure we don't
> add physical memory above 16TB if that is present in the system.
> 
> Fixes: 0034d395f89d ("powerpc/mm/hash64: Map all the kernel regions in 
> thesame 0xc range")
> Reported-by: Cameron Berkenpas 
> Signed-off-by: Aneesh Kumar K.V 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/e422bd3330e5bc5cd76b063d9fc3a21cd06d9957

cheers


Re: [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks

2019-10-03 Thread Leonardo Bras
On Thu, 2019-10-03 at 13:49 -0700, John Hubbard wrote:
> Yes. And to clarify, I was assuming that the changes to mm/gup.c were 
> required in order to accomplish your goals. Given that assumption, I 
> wanted the generic code to be "proper", and that's what that feedback
> is about.

You assumed right. On my counting approach it's necessary count all
'lockless pagetable walks', including the ones in generic code.

And, I think even without the counting approach, it was a good way
focus this 'lockless pagetable walk' routine in a single place.

> 
> Although the other questions about file-backed THP
> make it sound like some rethinking across the board is required now.

Yeap, I need to better understand how the file-backed THP problem
works.

Thanks,

Leonardo Brás


signature.asc
Description: This is a digitally signed message part


Re: [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks

2019-10-03 Thread Leonardo Bras
Hello Peter, thanks for the feedback!

On Thu, 2019-10-03 at 13:51 +0200, Peter Zijlstra wrote:
> On Thu, Oct 03, 2019 at 09:11:45AM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 02, 2019 at 10:33:15PM -0300, Leonardo Bras wrote:
> > > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> > > index 818691846c90..3043ea9812d5 100644
> > > --- a/include/asm-generic/pgtable.h
> > > +++ b/include/asm-generic/pgtable.h
> > > @@ -1171,6 +1171,64 @@ static inline bool arch_has_pfn_modify_check(void)
> > >  #endif
> > >  #endif
> > >  
> > > +#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
> > > +static inline unsigned long begin_lockless_pgtbl_walk(struct mm_struct 
> > > *mm)
> > > +{
> > > + unsigned long irq_mask;
> > > +
> > > + if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
> > > + atomic_inc(>lockless_pgtbl_walkers);
> > 
> > This will not work for file backed THP. Also, this is a fairly serious
> > contention point all on its own.
> 
> Kiryl says we have tmpfs-thp, this would be broken vs that, as would
> your (PowerPC) use of mm_cpumask() for that IPI.

Could you please explain it?
I mean, why this breaks tmpfs-thp?

Also, why mm_cpumask() is also broken?

> 
> > > + /*
> > > +  * Interrupts must be disabled during the lockless page table walk.
> > > +  * That's because the deleting or splitting involves flushing TLBs,
> > > +  * which in turn issues interrupts, that will block when disabled.
> > > +  */
> > > + local_irq_save(irq_mask);
> > > +
> > > + /*
> > > +  * This memory barrier pairs with any code that is either trying to
> > > +  * delete page tables, or split huge pages. Without this barrier,
> > > +  * the page tables could be read speculatively outside of interrupt
> > > +  * disabling.
> > > +  */
> > > + smp_mb();
> > 
> > I don't think this is something smp_mb() can guarantee. smp_mb() is
> > defined to order memory accesses, in this case the store of the old
> > flags vs whatever comes after this.
> > 
> > It cannot (in generic) order against completion of prior instructions,
> > like clearing the interrupt enabled flags.
> > 
> > Possibly you want barrier_nospec().
> 
> I'm still really confused about this barrier. It just doesn't make
> sense.
> 
> If an interrupt happens before the local_irq_disable()/save(), then it
> will discard any and all speculation that would be in progress to handle
> the exception.
> 
> If there isn't an interrupt (or it happens after disable) it is
> irrelevant.
> 
> Specifically, that serialize-IPI thing wants to ensure in-progress
> lookups are complete, and I can't find a scenario where
> local_irq_disable/enable() needs additional help vs IPIs. The moment an
> interrupt lands it kills speculation and forces things into
> program-order.
> 
> Did you perhaps want something like:
> 
>   if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING)) {
>   atomic_inc();
>   smp_mb__after_atomic();
>   }
> 
>   ...
> 
>   if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING)) {
>   smp_mb__before_atomic();
>   atomic_dec();
>   }
> 
> To ensure everything happens inside of the increment?
> 

I need to rethink this barrier, but yes. I think that's it. 
It's how it was on v4.

I have changed it because I thought it would be better this way. Well,
it was probably a mistake of my part.

> And I still think all that wrong, you really shouldn't need to wait on
> munmap().

That is something I need to better understand. I mean, before coming
with this patch, I thought exactly this: not serialize when on munmap. 

But on the way I was convinced it would not work on munmap. I need to
recall why, and if it was false to assume this, re-think the whole
solution.

Best regards,

Leonardo Brás


signature.asc
Description: This is a digitally signed message part


[PATCH] selftests/powerpc: Fix compiling error on tlbie_test due to newer gcc

2019-10-03 Thread Desnes A. Nunes do Rosario
Newer versions of GCC demand that the size of the string to be copied must
be explicitly smaller than the size of the destination. Thus, the NULL
char has to be taken into account on strncpy.

This will avoid the following compiling error:

  tlbie_test.c: In function 'main':
  tlbie_test.c:639:4: error: 'strncpy' specified bound 100 equals destination 
size [-Werror=stringop-truncation]
  strncpy(logdir, optarg, LOGDIR_NAME_SIZE);
  ^
  cc1: all warnings being treated as errors

Signed-off-by: Desnes A. Nunes do Rosario 
---
 tools/testing/selftests/powerpc/mm/tlbie_test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/powerpc/mm/tlbie_test.c 
b/tools/testing/selftests/powerpc/mm/tlbie_test.c
index 9868a5ddd847..0d0aee462f8e 100644
--- a/tools/testing/selftests/powerpc/mm/tlbie_test.c
+++ b/tools/testing/selftests/powerpc/mm/tlbie_test.c
@@ -636,7 +636,7 @@ int main(int argc, char *argv[])
nrthreads = strtoul(optarg, NULL, 10);
break;
case 'l':
-   strncpy(logdir, optarg, LOGDIR_NAME_SIZE);
+   strncpy(logdir, optarg, LOGDIR_NAME_SIZE-1);
break;
case 't':
run_time = strtoul(optarg, NULL, 10);
-- 
2.21.0



Re: [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks

2019-10-03 Thread John Hubbard
On 10/3/19 1:36 PM, Leonardo Bras wrote:
> On Thu, 2019-10-03 at 09:29 +0200, Peter Zijlstra wrote:
>> On Wed, Oct 02, 2019 at 10:33:14PM -0300, Leonardo Bras wrote:
...
>> This is something entirely specific to Power, you shouldn't be touching
>> generic code at all.
> 
> Up to v4, I was declaring dummy functions so it would not mess up with
> other archs: http://patchwork.ozlabs.org/patch/1168779/
> 
> But I was recommended to create a generic function that could guide the
> way to archs: http://patchwork.ozlabs.org/patch/1168775/

Yes. And to clarify, I was assuming that the changes to mm/gup.c were 
required in order to accomplish your goals. Given that assumption, I 
wanted the generic code to be "proper", and that's what that feedback
is about.

If you can somehow do it entirely as an arch-specific thing, then probably
that's even better. Although the other questions about file-backed THP
make it sound like some rethinking across the board is required now.


thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v5 10/11] mm/Kconfig: Adds config option to track lockless pagetable walks

2019-10-03 Thread Leonardo Bras
On Thu, 2019-10-03 at 09:44 +0200, Peter Zijlstra wrote:
> This shouldn't be a user visible option at all. Either the arch needs
> it and selects it or not.

You are right. I will do that on v6.
Thanks for the feedback!


signature.asc
Description: This is a digitally signed message part


Re: [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks

2019-10-03 Thread John Hubbard
On 10/3/19 4:51 AM, Peter Zijlstra wrote:
> On Thu, Oct 03, 2019 at 09:11:45AM +0200, Peter Zijlstra wrote:
>> On Wed, Oct 02, 2019 at 10:33:15PM -0300, Leonardo Bras wrote:
...
> 
> I'm still really confused about this barrier. It just doesn't make
> sense.
> 
> If an interrupt happens before the local_irq_disable()/save(), then it
> will discard any and all speculation that would be in progress to handle
> the exception.
> 

Hi Peter,

So, would that imply that it's correct to apply approximately the following
patch:

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt
index 1adbb8a371c7..cf41eff37e24 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -2099,9 +2099,9 @@ INTERRUPT DISABLING FUNCTIONS
 -
 
 Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts
-(RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
-barriers are required in such a situation, they must be provided from some
-other means.
+(RELEASE equivalent) will act as full memory barriers. This is because, for
+all supported CPU architectures, interrupt arrival causes all speculative
+memory accesses to be discarded.
 
?

We're also discussing this over in [1] ("mm: don't expose non-hugetlb page to
fast gup prematurely"), so I'm adding Paul to this thread here as well.

[1] https://lore.kernel.org/r/20191002092447.gc9...@quack2.suse.cz

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks

2019-10-03 Thread Leonardo Bras
Hello Peter, thanks for the feedback!

On Thu, 2019-10-03 at 09:29 +0200, Peter Zijlstra wrote:
> On Wed, Oct 02, 2019 at 10:33:14PM -0300, Leonardo Bras wrote:
> > If a process (qemu) with a lot of CPUs (128) try to munmap() a large
> > chunk of memory (496GB) mapped with THP, it takes an average of 275
> > seconds, which can cause a lot of problems to the load (in qemu case,
> > the guest will lock for this time).
> > 
> > Trying to find the source of this bug, I found out most of this time is
> > spent on serialize_against_pte_lookup(). This function will take a lot
> > of time in smp_call_function_many() if there is more than a couple CPUs
> > running the user process. Since it has to happen to all THP mapped, it
> > will take a very long time for large amounts of memory.
> > 
> > By the docs, serialize_against_pte_lookup() is needed in order to avoid
> > pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless
> > pagetable walk, to happen concurrently with THP splitting/collapsing.
> > 
> > It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[],
> > after interrupts are re-enabled.
> > Since, interrupts are (usually) disabled during lockless pagetable
> > walk, and serialize_against_pte_lookup will only return after
> > interrupts are enabled, it is protected.
> 
> This is something entirely specific to Power, you shouldn't be touching
> generic code at all.

Up to v4, I was declaring dummy functions so it would not mess up with
other archs: http://patchwork.ozlabs.org/patch/1168779/

But I was recommended to create a generic function that could guide the
way to archs: http://patchwork.ozlabs.org/patch/1168775/

The idea was to concentrate all routines of beginning/ending lockless
pagetable walks on these functions, and call them instead of
irq_disable/irq_enable.
Then it was easy to place the refcount-based tracking in these
functions. It should only be enabled in case the config chooses to do
so. 

> 
> Also, I'm not sure I understand things properly.
> 
> So serialize_against_pte_lookup() wants to wait for all currently
> out-standing __find_linux_pte() instances (which are very similar to
> gup_fast).
> 
> It seems to want to do this before flushing the THP TLB for some reason;
> why? Should not THP observe the normal page table freeing rules which
> includes a RCU-like grace period like this already.
> 
> Why is THP special here? This doesn't seem adequately explained.

"It's necessary to monitor lockless pagetable walks, in order to avoid
doing THP splitting/collapsing during them."

If a there is a THP split/collapse during the lockless pagetable walk,
the returned ptep can be a pointing to an invalid pte. 

To avoid that, the pmd is updated, then serialize_against_pte_lookup is
ran. Serialize runs a do_nothing in all cpu in cpu_mask. 

So, after all cpus finish running do_nothing(), there is a guarantee
that if there is any 'lockless pagetable walk' it is running on top of
a updated version of this pmd, and so, collapsing/splitting THP is
safe.

> 
> Also, specifically to munmap(), this seems entirely superfluous,
> munmap() uses the normal page-table freeing code and should be entirely
> fine without additional waiting.

To be honest, I remember it being needed in munmap case, but I really
don't remember the details. I will take a deeper look and come back
with this answer. 

> Furthermore, Power never accurately tracks mm_cpumask(), so using that
> makes the whole thing more expensive than it needs to be. Also, I
> suppose that is buggered vs file backed THP.

That accuracy of mm_cpumask is above my knowledge right now. =)

I agree that it's to expensive to do that. That's why I suggested this
method, that can check if there is any 'lockless pagetable walk'
running before trying to serialize. It reduced the waiting time a lot
for large amounts of memory. (more details on cover letter)

Best regards,

Leonardo Brás


signature.asc
Description: This is a digitally signed message part


Re: [PATCH] powerpc/vio: use simple dummy struct device as bus parent

2019-10-03 Thread Greg Kroah-Hartman
On Thu, Oct 03, 2019 at 03:10:03PM -0400, Dan Streetman wrote:
> On Fri, Sep 27, 2019 at 2:19 PM Greg Kroah-Hartman
>  wrote:
> >
> > On Fri, Sep 27, 2019 at 09:04:02AM -0400, Dan Streetman wrote:
> > > The dummy vio_bus_device creates the /sys/devices/vio directory, which
> > > contains real vio devices under it; since it represents itself as having
> > > a bus = _bus_type, its /sys/devices/vio/uevent does call the bus's
> > > .uevent function, vio_hotplug(), and as that function won't find a real
> > > device for the dummy vio_dev, it will return -ENODEV.
> > >
> > > One of the main users of the uevent node is udevadm, e.g. when it is 
> > > called
> > > with 'udevadm trigger --devices'.  Up until recently, it would ignore any
> > > errors returned when writing to devices' uevent file, but it was recently
> > > changed to start returning error if it gets an error writing to any uevent
> > > file:
> > > https://github.com/systemd/systemd/commit/97afc0351a96e0daa83964df33937967c75c644f
> > >
> > > since the /sys/devices/vio/uevent file has always returned ENODEV from
> > > any write to it, this now causes the udevadm trigger command to return
> > > an error.  This may be fixed in udevadm to ignore ENODEV errors, but the
> > > vio driver should still be fixed.
> > >
> > > This patch changes the arch/powerpc/platform/pseries/vio.c 'dummy'
> > > parent device into a real dummy device with no .bus, so its uevent
> > > file will stop returning ENODEV and simply do nothing and return 0.
> > >
> > > Signed-off-by: Dan Streetman 
> > > ---
> > >  arch/powerpc/platforms/pseries/vio.c | 11 ---
> > >  1 file changed, 4 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/arch/powerpc/platforms/pseries/vio.c 
> > > b/arch/powerpc/platforms/pseries/vio.c
> > > index 79e2287991db..63bc16631680 100644
> > > --- a/arch/powerpc/platforms/pseries/vio.c
> > > +++ b/arch/powerpc/platforms/pseries/vio.c
> > > @@ -32,11 +32,8 @@
> > >  #include 
> > >  #include 
> > >
> > > -static struct vio_dev vio_bus_device  = { /* fake "parent" device */
> > > - .name = "vio",
> > > - .type = "",
> > > - .dev.init_name = "vio",
> > > - .dev.bus = _bus_type,
> > > +static struct device vio_bus = {
> > > + .init_name  = "vio",
> >
> > Eeek, no!  Why are you creating a static device that will then be
> > reference counted?  Not nice :(
> 
> so, I looked again and it seems quite a few places appear to do
> exactly this, is it something that should be fixed?

Yes.


Re: [PATCH v5 10/11] mm/Kconfig: Adds config option to track lockless pagetable walks

2019-10-03 Thread Leonardo Bras
On Thu, 2019-10-03 at 16:04 -0300, Leonardo Bras wrote:
> On Wed, 2019-10-02 at 22:08 -0400, Qian Cai wrote:
> > Can’t this name and all those new *lockless* function names be shorter? 
> > There are many functions name with *_locked, so how about dropping 
> > lockless at all, i.e., PAGE_TABLE_WALK_TRACKING blah blah?
> 
> Thanks for the feedback!
> 
> Well, in this case it only tracks the 'lockless pagetable walks'. In
> this approach, the 'locked pagetable walks' don't need to be tracked.

So, using PAGE_TABLE_WALK_TRACKING would not be very accurate (as said
before, there are the locked ones).


signature.asc
Description: This is a digitally signed message part


Re: [PATCH] powerpc/vio: use simple dummy struct device as bus parent

2019-10-03 Thread Dan Streetman
On Fri, Sep 27, 2019 at 2:19 PM Greg Kroah-Hartman
 wrote:
>
> On Fri, Sep 27, 2019 at 09:04:02AM -0400, Dan Streetman wrote:
> > The dummy vio_bus_device creates the /sys/devices/vio directory, which
> > contains real vio devices under it; since it represents itself as having
> > a bus = _bus_type, its /sys/devices/vio/uevent does call the bus's
> > .uevent function, vio_hotplug(), and as that function won't find a real
> > device for the dummy vio_dev, it will return -ENODEV.
> >
> > One of the main users of the uevent node is udevadm, e.g. when it is called
> > with 'udevadm trigger --devices'.  Up until recently, it would ignore any
> > errors returned when writing to devices' uevent file, but it was recently
> > changed to start returning error if it gets an error writing to any uevent
> > file:
> > https://github.com/systemd/systemd/commit/97afc0351a96e0daa83964df33937967c75c644f
> >
> > since the /sys/devices/vio/uevent file has always returned ENODEV from
> > any write to it, this now causes the udevadm trigger command to return
> > an error.  This may be fixed in udevadm to ignore ENODEV errors, but the
> > vio driver should still be fixed.
> >
> > This patch changes the arch/powerpc/platform/pseries/vio.c 'dummy'
> > parent device into a real dummy device with no .bus, so its uevent
> > file will stop returning ENODEV and simply do nothing and return 0.
> >
> > Signed-off-by: Dan Streetman 
> > ---
> >  arch/powerpc/platforms/pseries/vio.c | 11 ---
> >  1 file changed, 4 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/pseries/vio.c 
> > b/arch/powerpc/platforms/pseries/vio.c
> > index 79e2287991db..63bc16631680 100644
> > --- a/arch/powerpc/platforms/pseries/vio.c
> > +++ b/arch/powerpc/platforms/pseries/vio.c
> > @@ -32,11 +32,8 @@
> >  #include 
> >  #include 
> >
> > -static struct vio_dev vio_bus_device  = { /* fake "parent" device */
> > - .name = "vio",
> > - .type = "",
> > - .dev.init_name = "vio",
> > - .dev.bus = _bus_type,
> > +static struct device vio_bus = {
> > + .init_name  = "vio",
>
> Eeek, no!  Why are you creating a static device that will then be
> reference counted?  Not nice :(

so, I looked again and it seems quite a few places appear to do
exactly this, is it something that should be fixed?

$ git grep 'static struct device [^*{]*{'
arch/arm/kernel/dma-isa.c:static struct device isa_dma_dev = {
arch/arm/mach-rpc/dma.c:static struct device isa_dma_dev = {
arch/arm/mach-s3c24xx/s3c2410.c:static struct device s3c2410_dev = {
arch/arm/mach-s3c24xx/s3c2412.c:static struct device s3c2412_dev = {
arch/arm/mach-s3c24xx/s3c2416.c:static struct device s3c2416_dev = {
arch/arm/mach-s3c24xx/s3c2440.c:static struct device s3c2440_dev = {
arch/arm/mach-s3c24xx/s3c2442.c:static struct device s3c2442_dev = {
arch/arm/mach-s3c24xx/s3c2443.c:static struct device s3c2443_dev = {
arch/arm/mach-s3c64xx/common.c:static struct device s3c64xx_dev = {
arch/arm/mach-s3c64xx/s3c6400.c:static struct device s3c6400_dev = {
arch/arm/mach-s3c64xx/s3c6410.c:static struct device s3c6410_dev = {
arch/mips/sgi-ip22/ip22-gio.c:static struct device gio_bus = {
arch/parisc/kernel/drivers.c:static struct device root = {
arch/powerpc/platforms/ps3/system-bus.c:static struct device ps3_system_bus = {
arch/powerpc/platforms/pseries/ibmebus.c:static struct device
ibmebus_bus_device = { /* fake "parent" device */
arch/powerpc/platforms/pseries/vio.c:static struct device vio_bus = {
arch/um/drivers/virtio_uml.c:static struct device vu_cmdline_parent = {
drivers/base/isa.c:static struct device isa_bus = {
drivers/block/rbd.c:static struct device rbd_root_dev = {
drivers/gpu/drm/ttm/ttm_module.c:static struct device ttm_drm_class_device = {
drivers/iio/dummy/iio_dummy_evgen.c:static struct device iio_evgen_dev = {
drivers/iio/trigger/iio-trig-sysfs.c:static struct device iio_sysfs_trig_dev = {
drivers/misc/sgi-gru/grumain.c:static struct device gru_device = {
drivers/nubus/bus.c:static struct device nubus_parent = {
drivers/sh/maple/maple.c:static struct device maple_bus = {
drivers/sh/superhyway/superhyway.c:static struct device
superhyway_bus_device = {
drivers/soc/fsl/qe/qe_ic.c:static struct device device_qe_ic = {
drivers/virtio/virtio_mmio.c:static struct device vm_cmdline_parent = {
kernel/time/clockevents.c:static struct device tick_bc_dev = {
kernel/time/clocksource.c:static struct device device_clocksource = {


>
> What's wrong with a simple call to device_create() for your "fake"
> device you want to make here?  That's what it is there for :)
>
> thanks,
>
> greg k-h


Re: [PATCH v5 10/11] mm/Kconfig: Adds config option to track lockless pagetable walks

2019-10-03 Thread Leonardo Bras
On Wed, 2019-10-02 at 22:08 -0400, Qian Cai wrote:
> Can’t this name and all those new *lockless* function names be shorter? 
> There are many functions name with *_locked, so how about dropping 
> lockless at all, i.e., PAGE_TABLE_WALK_TRACKING blah blah?

Thanks for the feedback!

Well, in this case it only tracks the 'lockless pagetable walks'. In
this approach, the 'locked pagetable walks' don't need to be tracked.




signature.asc
Description: This is a digitally signed message part


Re: [PATCH v5 01/10] mm/memunmap: Use the correct start and end pfn when removing pages from zone

2019-10-03 Thread Aneesh Kumar K.V

On 10/1/19 8:33 PM, David Hildenbrand wrote:

On 01.10.19 16:57, David Hildenbrand wrote:

On 01.10.19 16:40, David Hildenbrand wrote:

From: "Aneesh Kumar K.V" 

With altmap, all the resource pfns are not initialized. While initializing
pfn, altmap reserve space is skipped. Hence when removing pfn from zone
skip pfns that were never initialized.

Update memunmap_pages to calculate start and end pfn based on altmap
values. This fixes a kernel crash that is observed when destroying
a namespace.

[   81.356173] kernel BUG at include/linux/mm.h:1107!
cpu 0x1: Vector: 700 (Program Check) at [c00274087890]
 pc: c04b9728: memunmap_pages+0x238/0x340
 lr: c04b9724: memunmap_pages+0x234/0x340
...
 pid   = 3669, comm = ndctl
kernel BUG at include/linux/mm.h:1107!
[c00274087ba0] c09e3500 devm_action_release+0x30/0x50
[c00274087bc0] c09e4758 release_nodes+0x268/0x2d0
[c00274087c30] c09dd144 device_release_driver_internal+0x174/0x240
[c00274087c70] c09d9dfc unbind_store+0x13c/0x190
[c00274087cb0] c09d8a24 drv_attr_store+0x44/0x60
[c00274087cd0] c05a7470 sysfs_kf_write+0x70/0xa0
[c00274087d10] c05a5cac kernfs_fop_write+0x1ac/0x290
[c00274087d60] c04be45c __vfs_write+0x3c/0x70
[c00274087d80] c04c26e4 vfs_write+0xe4/0x200
[c00274087dd0] c04c2a6c ksys_write+0x7c/0x140
[c00274087e20] c000bbd0 system_call+0x5c/0x68

Cc: Dan Williams 
Cc: Andrew Morton 
Cc: Jason Gunthorpe 
Cc: Logan Gunthorpe 
Cc: Ira Weiny 
Reviewed-by: Pankaj Gupta 
Signed-off-by: Aneesh Kumar K.V 
[ move all pfn-realted declarations into a single line ]
Signed-off-by: David Hildenbrand 
---
  mm/memremap.c | 13 -
  1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index 557e53c6fb46..026788b2ac69 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -123,7 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
  void memunmap_pages(struct dev_pagemap *pgmap)
  {
struct resource *res = >res;
-   unsigned long pfn;
+   unsigned long pfn, nr_pages, start_pfn, end_pfn;
int nid;
  
  	dev_pagemap_kill(pgmap);

@@ -131,14 +131,17 @@ void memunmap_pages(struct dev_pagemap *pgmap)
put_page(pfn_to_page(pfn));
dev_pagemap_cleanup(pgmap);
  
+	start_pfn = pfn_first(pgmap);

+   end_pfn = pfn_end(pgmap);
+   nr_pages = end_pfn - start_pfn;
+
/* pages are dead and unused, undo the arch mapping */
-   nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
+   nid = page_to_nid(pfn_to_page(start_pfn));
  
  	mem_hotplug_begin();

if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
-   pfn = PHYS_PFN(res->start);
-   __remove_pages(page_zone(pfn_to_page(pfn)), pfn,
-PHYS_PFN(resource_size(res)), NULL);
+   __remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
+  nr_pages, NULL);
} else {
arch_remove_memory(nid, res->start, resource_size(res),
pgmap_altmap(pgmap));



Aneesh, I was wondering why the use of "res->start" is correct (and we
shouldn't also witch to start_pfn/nr_pages here. It would be good if Dan
could review.



To be more precise, I wonder if it should actually be

__remove_pages(page_zone(pfn_to_page(start_pfn)), res->start,
resource_size(res))



yes, that would be make it much clear.

But for MEMORY_DEVICE_PRIVATE start_pfn and pfn should be same?



IOW, keep calling __remove_pages() with the same parameters but read
nid/zone from the offset one.

Hope some memunmap_pages() expert can clarify.



-aneesh


Re: [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks

2019-10-03 Thread Peter Zijlstra
On Thu, Oct 03, 2019 at 09:11:45AM +0200, Peter Zijlstra wrote:
> On Wed, Oct 02, 2019 at 10:33:15PM -0300, Leonardo Bras wrote:
> > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> > index 818691846c90..3043ea9812d5 100644
> > --- a/include/asm-generic/pgtable.h
> > +++ b/include/asm-generic/pgtable.h
> > @@ -1171,6 +1171,64 @@ static inline bool arch_has_pfn_modify_check(void)
> >  #endif
> >  #endif
> >  
> > +#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
> > +static inline unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm)
> > +{
> > +   unsigned long irq_mask;
> > +
> > +   if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
> > +   atomic_inc(>lockless_pgtbl_walkers);
> 
> This will not work for file backed THP. Also, this is a fairly serious
> contention point all on its own.

Kiryl says we have tmpfs-thp, this would be broken vs that, as would
your (PowerPC) use of mm_cpumask() for that IPI.

> > +   /*
> > +* Interrupts must be disabled during the lockless page table walk.
> > +* That's because the deleting or splitting involves flushing TLBs,
> > +* which in turn issues interrupts, that will block when disabled.
> > +*/
> > +   local_irq_save(irq_mask);
> > +
> > +   /*
> > +* This memory barrier pairs with any code that is either trying to
> > +* delete page tables, or split huge pages. Without this barrier,
> > +* the page tables could be read speculatively outside of interrupt
> > +* disabling.
> > +*/
> > +   smp_mb();
> 
> I don't think this is something smp_mb() can guarantee. smp_mb() is
> defined to order memory accesses, in this case the store of the old
> flags vs whatever comes after this.
> 
> It cannot (in generic) order against completion of prior instructions,
> like clearing the interrupt enabled flags.
> 
> Possibly you want barrier_nospec().

I'm still really confused about this barrier. It just doesn't make
sense.

If an interrupt happens before the local_irq_disable()/save(), then it
will discard any and all speculation that would be in progress to handle
the exception.

If there isn't an interrupt (or it happens after disable) it is
irrelevant.

Specifically, that serialize-IPI thing wants to ensure in-progress
lookups are complete, and I can't find a scenario where
local_irq_disable/enable() needs additional help vs IPIs. The moment an
interrupt lands it kills speculation and forces things into
program-order.

Did you perhaps want something like:

if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING)) {
atomic_inc();
smp_mb__after_atomic();
}

...

if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING)) {
smp_mb__before_atomic();
atomic_dec();
}

To ensure everything happens inside of the increment?

And I still think all that wrong, you really shouldn't need to wait on
munmap().


[PATCH v2] powerpc/kernel/sysfs: Add PERF_EVENT config option check to PMU SPRs

2019-10-03 Thread Kajol Jain
Perf is the primary interface to program performance monitoring
unit (pmu) and collect counter data in system.
But currently pmu register files are created in the
/sys/devices/system/cpu/cpu* without checking CONFIG_PERF_EVENTS
option. These includes PMC* and MMCR* sprs.
Patch ties sysfs pmu spr file creation with CONFIG_PERF_EVENTS options.

Tested this patch with enable/disable CONFIG_PERF_EVENTS option
in powernv and pseries machines.
Also did compilation testing with book3s_32.config.

Reviewed-by: Madhavan Srinivasan 

Signed-off-by: Kajol Jain 
---
 arch/powerpc/kernel/sysfs.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index e2147d7c9e72..263023cc6308 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -456,16 +456,21 @@ static ssize_t __used \
 
 #if defined(CONFIG_PPC64)
 #define HAS_PPC_PMC_CLASSIC1
+#if defined(CONFIG_PERF_EVENTS)
 #define HAS_PPC_PMC_IBM1
+#endif
 #define HAS_PPC_PMC_PA6T   1
 #elif defined(CONFIG_PPC_BOOK3S_32)
 #define HAS_PPC_PMC_CLASSIC1
+#if defined(CONFIG_PERF_EVENTS)
 #define HAS_PPC_PMC_IBM1
 #define HAS_PPC_PMC_G4 1
 #endif
+#endif
 
 
 #ifdef HAS_PPC_PMC_CLASSIC
+#ifdef HAS_PPC_PMC_IBM
 SYSFS_PMCSETUP(mmcr0, SPRN_MMCR0);
 SYSFS_PMCSETUP(mmcr1, SPRN_MMCR1);
 SYSFS_PMCSETUP(pmc1, SPRN_PMC1);
@@ -484,6 +489,10 @@ SYSFS_PMCSETUP(pmc7, SPRN_PMC7);
 SYSFS_PMCSETUP(pmc8, SPRN_PMC8);
 
 SYSFS_PMCSETUP(mmcra, SPRN_MMCRA);
+#endif /* CONFIG_PPC64 */
+#endif /* HAS_PPC_PMC_IBM */
+
+#ifdef CONFIG_PPC64
 SYSFS_SPRSETUP(purr, SPRN_PURR);
 SYSFS_SPRSETUP(spurr, SPRN_SPURR);
 SYSFS_SPRSETUP(pir, SPRN_PIR);
@@ -494,7 +503,9 @@ SYSFS_SPRSETUP(tscr, SPRN_TSCR);
   enable write when needed with a separate function.
   Lets be conservative and default to pseries.
 */
+#ifdef HAS_PPC_PMC_IBM
 static DEVICE_ATTR(mmcra, 0600, show_mmcra, store_mmcra);
+#endif /* HAS_PPC_PMC_IBM */
 static DEVICE_ATTR(spurr, 0400, show_spurr, NULL);
 static DEVICE_ATTR(purr, 0400, show_purr, store_purr);
 static DEVICE_ATTR(pir, 0400, show_pir, NULL);
@@ -605,12 +616,14 @@ static void sysfs_create_dscr_default(void)
 #endif /* CONFIG_PPC64 */
 
 #ifdef HAS_PPC_PMC_PA6T
+#ifdef HAS_PPC_PMC_IBM
 SYSFS_PMCSETUP(pa6t_pmc0, SPRN_PA6T_PMC0);
 SYSFS_PMCSETUP(pa6t_pmc1, SPRN_PA6T_PMC1);
 SYSFS_PMCSETUP(pa6t_pmc2, SPRN_PA6T_PMC2);
 SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3);
 SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4);
 SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5);
+#endif /* HAS_PPC_PMC_IBM */
 #ifdef CONFIG_DEBUG_MISC
 SYSFS_SPRSETUP(hid0, SPRN_HID0);
 SYSFS_SPRSETUP(hid1, SPRN_HID1);
@@ -648,7 +661,6 @@ static struct device_attribute ibm_common_attrs[] = {
__ATTR(mmcr0, 0600, show_mmcr0, store_mmcr0),
__ATTR(mmcr1, 0600, show_mmcr1, store_mmcr1),
 };
-#endif /* HAS_PPC_PMC_G4 */
 
 #ifdef HAS_PPC_PMC_G4
 static struct device_attribute g4_common_attrs[] = {
@@ -670,9 +682,11 @@ static struct device_attribute classic_pmc_attrs[] = {
__ATTR(pmc8, 0600, show_pmc8, store_pmc8),
 #endif
 };
+#endif /* HAS_PPC_PMC_IBM */
 
 #ifdef HAS_PPC_PMC_PA6T
 static struct device_attribute pa6t_attrs[] = {
+#ifdef HAS_PPC_PMC_IBM
__ATTR(mmcr0, 0600, show_mmcr0, store_mmcr0),
__ATTR(mmcr1, 0600, show_mmcr1, store_mmcr1),
__ATTR(pmc0, 0600, show_pa6t_pmc0, store_pa6t_pmc0),
@@ -681,6 +695,7 @@ static struct device_attribute pa6t_attrs[] = {
__ATTR(pmc3, 0600, show_pa6t_pmc3, store_pa6t_pmc3),
__ATTR(pmc4, 0600, show_pa6t_pmc4, store_pa6t_pmc4),
__ATTR(pmc5, 0600, show_pa6t_pmc5, store_pa6t_pmc5),
+#endif /* HAS_PPC_PMC_IBM */
 #ifdef CONFIG_DEBUG_MISC
__ATTR(hid0, 0600, show_hid0, store_hid0),
__ATTR(hid1, 0600, show_hid1, store_hid1),
@@ -769,8 +784,10 @@ static int register_cpu_online(unsigned int cpu)
device_create_file(s, _attrs[i]);
 
 #ifdef CONFIG_PPC64
+#ifdef HAS_PPC_PMC_IBM
if (cpu_has_feature(CPU_FTR_MMCRA))
device_create_file(s, _attr_mmcra);
+#endif
 
if (cpu_has_feature(CPU_FTR_PURR)) {
if (!firmware_has_feature(FW_FEATURE_LPAR))
@@ -858,8 +875,10 @@ static int unregister_cpu_online(unsigned int cpu)
device_remove_file(s, _attrs[i]);
 
 #ifdef CONFIG_PPC64
+#ifdef HAS_PPC_PMC_IBM
if (cpu_has_feature(CPU_FTR_MMCRA))
device_remove_file(s, _attr_mmcra);
+#endif
 
if (cpu_has_feature(CPU_FTR_PURR))
device_remove_file(s, _attr_purr);
-- 
2.21.0



Re: [PATCH] powerpc/powernv/prd: Validate whether address to be mapped is part of system RAM

2019-10-03 Thread Vaidyanathan Srinivasan
* Jeremy Kerr  [2019-10-03 15:07:24]:

> Hi Vasant,
> 
> > > OK. How about we just don't do that?
> > 
> > Yes. Hostboot will fix that. It will make sure that HBRT is loaded
> > into regular memory.
> 
> Super.
> 
> > > It sounds like we're just trying to work around an invalid
> > > representation of the mappings.
> > 
> > Its not workaround. Its additional check.
> 
> The issue is that you've added a check for stuff that the kernel doesn't
> (and shouldn't) know about, and assumed that the kernel knows better
> than the device tree. It may be the correct thing to do in this case,
> but we can't guarantee that it's always correct.

Good point on the policy ownership.

> For example, what if there is a future HBRT range that is fine to go
> into NVRAM? With this change, that's not possible.

The current topic is who owns setting up the ATT bits for that piece
of memory.  It is the kernel today.  Kernel decides to set this up as
normal memory or I/O memory and sets the bits in page table entry.

> Or, what if there's a range of address-space that isn't backed by system
> RAM (say, some MMIO-mapped hardware) that we want to expose to a future
> HBRT implementation? This change will block that.
> 
> The kernel doesn't know what is and is not valid for a HBRT mapping, so
> it has no reason to override what's specified in the device tree. We've
> designed this so that the kernel provides the mechanism for mapping
> pages, and not the policy of which pages can be mapped.

The features altered are cache inhibit and guarding which affects
ability to fetch instructions.  If we allow HBRT to reside in an I/O
memory, the we need to tell kernel that it is ok to allow caching and
instruction execution in that region and accordingly change the ATT
bits.

This patch does not block a working function, but actually makes
debugging a failed case easier.  The failing scenario without this
check is such that HBRT cannot fetch from the region of memory and
loops in minor page faults doing nothing.  

As Vasant mentioned hostboot team will add code to relocate the HBRT
to the right place.  Addressing your concern, if we end up allowing
HBRT in non system-RAM area, we need to add some more flags in device
tree to instruct the driver to force change the page protection bits
as page_prot = pgprot_cached(page_prot); It does not make sense to
keep the region cache inhibited and just clear the Guard bit
(ATT=0b10 - non-idempotent I/O) so we should force to normal memory
ATT=0b00.

In summary, this check does not block any working function today.  We
fail to execute HBRT code after we successfully mmap() the memory
anyway.

When we need to move firmware components to other types of memory, we
should do a future patch to indicate in device tree that this is non
system-RAM and kernel should change PTE permissions and then setup the
mmap(). Or HBRT really has a use for NVRAM in which case we explicitly
need to indicate (via device-tree) the required attribute for this
mapping.

--Vaidy



Re: [PATCH v6 3/3] PCI: layerscape: Add LS1028a support

2019-10-03 Thread Shawn Guo
On Mon, Sep 02, 2019 at 11:43:19AM +0800, Xiaowei Bao wrote:
> Add support for the LS1028a PCIe controller.
> 
> Signed-off-by: Xiaowei Bao 
> Signed-off-by: Hou Zhiqiang 
> ---
> v2:
>  - No change.
> v3:
>  - Reuse the ls2088 driver data structurt.
> v4:
>  - No change.
> v5:
>  - No change.
> v6:
>  - No change.
> 
>  drivers/pci/controller/dwc/pci-layerscape.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/pci/controller/dwc/pci-layerscape.c 
> b/drivers/pci/controller/dwc/pci-layerscape.c
> index 3a5fa26..f24f79a 100644
> --- a/drivers/pci/controller/dwc/pci-layerscape.c
> +++ b/drivers/pci/controller/dwc/pci-layerscape.c
> @@ -263,6 +263,7 @@ static const struct ls_pcie_drvdata ls2088_drvdata = {
>  static const struct of_device_id ls_pcie_of_match[] = {
>   { .compatible = "fsl,ls1012a-pcie", .data = _drvdata },
>   { .compatible = "fsl,ls1021a-pcie", .data = _drvdata },
> + { .compatible = "fsl,ls1028a-pcie", .data = _drvdata },

I think you can save this driver change by using "fsl,ls2088a-pcie" as
compatible fallback like below.

  compatible = "fsl,ls1028a-pcie", "fsl,ls2088a-pcie";

Shawn

>   { .compatible = "fsl,ls1043a-pcie", .data = _drvdata },
>   { .compatible = "fsl,ls1046a-pcie", .data = _drvdata },
>   { .compatible = "fsl,ls2080a-pcie", .data = _drvdata },
> -- 
> 2.9.5
> 


Re: [PATCH] powerpc/perf: Add functionality to check PERF_EVENTS config option

2019-10-03 Thread maddy



On 10/1/19 2:51 PM, Kajol Jain wrote:

Perf is the primary interface to program performance monitoring
unit (pmu) and collect counter data in system.
But currently pmu register files are created in the
/sys/devices/system/cpu/cpu* without checking CONFIG_PERF_EVENTS
option. These includes PMC* and MMCR* sprs.
Patch ties sysfs pmu spr file creation with CONFIG_PERF_EVENTS options.

Tested this patch with enable/disable CONFIG_PERF_EVENTS option
in powernv and pseries machines.
Also did compilation testing with book3s_32.config.



Title of the patch should be

powerpc/kernel/sysfs: Add PERF_EVENT config option check to PMU SPRs

Rest looks fine.

Reviewed-by: Madhavan Srinivasan 



Signed-off-by: Kajol Jain 
---
  arch/powerpc/kernel/sysfs.c | 21 -
  1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index e2147d7c9e72..263023cc6308 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -456,16 +456,21 @@ static ssize_t __used \

  #if defined(CONFIG_PPC64)
  #define HAS_PPC_PMC_CLASSIC   1
+#if defined(CONFIG_PERF_EVENTS)
  #define HAS_PPC_PMC_IBM   1
+#endif
  #define HAS_PPC_PMC_PA6T  1
  #elif defined(CONFIG_PPC_BOOK3S_32)
  #define HAS_PPC_PMC_CLASSIC   1
+#if defined(CONFIG_PERF_EVENTS)
  #define HAS_PPC_PMC_IBM   1
  #define HAS_PPC_PMC_G41
  #endif
+#endif


  #ifdef HAS_PPC_PMC_CLASSIC
+#ifdef HAS_PPC_PMC_IBM
  SYSFS_PMCSETUP(mmcr0, SPRN_MMCR0);
  SYSFS_PMCSETUP(mmcr1, SPRN_MMCR1);
  SYSFS_PMCSETUP(pmc1, SPRN_PMC1);
@@ -484,6 +489,10 @@ SYSFS_PMCSETUP(pmc7, SPRN_PMC7);
  SYSFS_PMCSETUP(pmc8, SPRN_PMC8);

  SYSFS_PMCSETUP(mmcra, SPRN_MMCRA);
+#endif /* CONFIG_PPC64 */
+#endif /* HAS_PPC_PMC_IBM */
+
+#ifdef CONFIG_PPC64
  SYSFS_SPRSETUP(purr, SPRN_PURR);
  SYSFS_SPRSETUP(spurr, SPRN_SPURR);
  SYSFS_SPRSETUP(pir, SPRN_PIR);
@@ -494,7 +503,9 @@ SYSFS_SPRSETUP(tscr, SPRN_TSCR);
enable write when needed with a separate function.
Lets be conservative and default to pseries.
  */
+#ifdef HAS_PPC_PMC_IBM
  static DEVICE_ATTR(mmcra, 0600, show_mmcra, store_mmcra);
+#endif /* HAS_PPC_PMC_IBM */
  static DEVICE_ATTR(spurr, 0400, show_spurr, NULL);
  static DEVICE_ATTR(purr, 0400, show_purr, store_purr);
  static DEVICE_ATTR(pir, 0400, show_pir, NULL);
@@ -605,12 +616,14 @@ static void sysfs_create_dscr_default(void)
  #endif /* CONFIG_PPC64 */

  #ifdef HAS_PPC_PMC_PA6T
+#ifdef HAS_PPC_PMC_IBM
  SYSFS_PMCSETUP(pa6t_pmc0, SPRN_PA6T_PMC0);
  SYSFS_PMCSETUP(pa6t_pmc1, SPRN_PA6T_PMC1);
  SYSFS_PMCSETUP(pa6t_pmc2, SPRN_PA6T_PMC2);
  SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3);
  SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4);
  SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5);
+#endif /* HAS_PPC_PMC_IBM */
  #ifdef CONFIG_DEBUG_MISC
  SYSFS_SPRSETUP(hid0, SPRN_HID0);
  SYSFS_SPRSETUP(hid1, SPRN_HID1);
@@ -648,7 +661,6 @@ static struct device_attribute ibm_common_attrs[] = {
__ATTR(mmcr0, 0600, show_mmcr0, store_mmcr0),
__ATTR(mmcr1, 0600, show_mmcr1, store_mmcr1),
  };
-#endif /* HAS_PPC_PMC_G4 */

  #ifdef HAS_PPC_PMC_G4
  static struct device_attribute g4_common_attrs[] = {
@@ -670,9 +682,11 @@ static struct device_attribute classic_pmc_attrs[] = {
__ATTR(pmc8, 0600, show_pmc8, store_pmc8),
  #endif
  };
+#endif /* HAS_PPC_PMC_IBM */

  #ifdef HAS_PPC_PMC_PA6T
  static struct device_attribute pa6t_attrs[] = {
+#ifdef HAS_PPC_PMC_IBM
__ATTR(mmcr0, 0600, show_mmcr0, store_mmcr0),
__ATTR(mmcr1, 0600, show_mmcr1, store_mmcr1),
__ATTR(pmc0, 0600, show_pa6t_pmc0, store_pa6t_pmc0),
@@ -681,6 +695,7 @@ static struct device_attribute pa6t_attrs[] = {
__ATTR(pmc3, 0600, show_pa6t_pmc3, store_pa6t_pmc3),
__ATTR(pmc4, 0600, show_pa6t_pmc4, store_pa6t_pmc4),
__ATTR(pmc5, 0600, show_pa6t_pmc5, store_pa6t_pmc5),
+#endif /* HAS_PPC_PMC_IBM */
  #ifdef CONFIG_DEBUG_MISC
__ATTR(hid0, 0600, show_hid0, store_hid0),
__ATTR(hid1, 0600, show_hid1, store_hid1),
@@ -769,8 +784,10 @@ static int register_cpu_online(unsigned int cpu)
device_create_file(s, _attrs[i]);

  #ifdef CONFIG_PPC64
+#ifdef HAS_PPC_PMC_IBM
if (cpu_has_feature(CPU_FTR_MMCRA))
device_create_file(s, _attr_mmcra);
+#endif

if (cpu_has_feature(CPU_FTR_PURR)) {
if (!firmware_has_feature(FW_FEATURE_LPAR))
@@ -858,8 +875,10 @@ static int unregister_cpu_online(unsigned int cpu)
device_remove_file(s, _attrs[i]);

  #ifdef CONFIG_PPC64
+#ifdef HAS_PPC_PMC_IBM
if (cpu_has_feature(CPU_FTR_MMCRA))
device_remove_file(s, _attr_mmcra);
+#endif

if (cpu_has_feature(CPU_FTR_PURR))
device_remove_file(s, _attr_purr);




Re: [PATCH v5 10/11] mm/Kconfig: Adds config option to track lockless pagetable walks

2019-10-03 Thread Peter Zijlstra
On Wed, Oct 02, 2019 at 10:33:24PM -0300, Leonardo Bras wrote:
> Adds config option LOCKLESS_PAGE_TABLE_WALK_TRACKING to make possible
> enabling tracking lockless pagetable walks directly from kernel config.
> 
> Signed-off-by: Leonardo Bras 
> ---
>  mm/Kconfig | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index a5dae9a7eb51..00f487a0122f 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -736,4 +736,15 @@ config ARCH_HAS_PTE_SPECIAL
>  config ARCH_HAS_HUGEPD
>   bool
>  
> +config LOCKLESS_PAGE_TABLE_WALK_TRACKING
> + bool "Track (and optimize) lockless page table walks"
> + default n
> +
> + help
> +   Maintain a reference count of active lockless page table
> +   walkers. This adds 4 bytes to struct mm size, and two atomic
> +   operations to calls such as get_user_pages_fast(). Some
> +   architectures can optimize lockless page table operations if
> +   this is enabled.

This shouldn't be a user visible option at all. Either the arch needs
it and selects it or not.


Re: [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks

2019-10-03 Thread Peter Zijlstra
On Wed, Oct 02, 2019 at 10:33:14PM -0300, Leonardo Bras wrote:
> If a process (qemu) with a lot of CPUs (128) try to munmap() a large
> chunk of memory (496GB) mapped with THP, it takes an average of 275
> seconds, which can cause a lot of problems to the load (in qemu case,
> the guest will lock for this time).
> 
> Trying to find the source of this bug, I found out most of this time is
> spent on serialize_against_pte_lookup(). This function will take a lot
> of time in smp_call_function_many() if there is more than a couple CPUs
> running the user process. Since it has to happen to all THP mapped, it
> will take a very long time for large amounts of memory.
> 
> By the docs, serialize_against_pte_lookup() is needed in order to avoid
> pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless
> pagetable walk, to happen concurrently with THP splitting/collapsing.
> 
> It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[],
> after interrupts are re-enabled.
> Since, interrupts are (usually) disabled during lockless pagetable
> walk, and serialize_against_pte_lookup will only return after
> interrupts are enabled, it is protected.

This is something entirely specific to Power, you shouldn't be touching
generic code at all.

Also, I'm not sure I understand things properly.

So serialize_against_pte_lookup() wants to wait for all currently
out-standing __find_linux_pte() instances (which are very similar to
gup_fast).

It seems to want to do this before flushing the THP TLB for some reason;
why? Should not THP observe the normal page table freeing rules which
includes a RCU-like grace period like this already.

Why is THP special here? This doesn't seem adequately explained.

Also, specifically to munmap(), this seems entirely superfluous,
munmap() uses the normal page-table freeing code and should be entirely
fine without additional waiting.

Furthermore, Power never accurately tracks mm_cpumask(), so using that
makes the whole thing more expensive than it needs to be. Also, I
suppose that is buggered vs file backed THP.


Re: [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks

2019-10-03 Thread Peter Zijlstra
On Wed, Oct 02, 2019 at 10:33:15PM -0300, Leonardo Bras wrote:
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 818691846c90..3043ea9812d5 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -1171,6 +1171,64 @@ static inline bool arch_has_pfn_modify_check(void)
>  #endif
>  #endif
>  
> +#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
> +static inline unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm)
> +{
> + unsigned long irq_mask;
> +
> + if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
> + atomic_inc(>lockless_pgtbl_walkers);

This will not work for file backed THP. Also, this is a fairly serious
contention point all on its own.

> + /*
> +  * Interrupts must be disabled during the lockless page table walk.
> +  * That's because the deleting or splitting involves flushing TLBs,
> +  * which in turn issues interrupts, that will block when disabled.
> +  */
> + local_irq_save(irq_mask);
> +
> + /*
> +  * This memory barrier pairs with any code that is either trying to
> +  * delete page tables, or split huge pages. Without this barrier,
> +  * the page tables could be read speculatively outside of interrupt
> +  * disabling.
> +  */
> + smp_mb();

I don't think this is something smp_mb() can guarantee. smp_mb() is
defined to order memory accesses, in this case the store of the old
flags vs whatever comes after this.

It cannot (in generic) order against completion of prior instructions,
like clearing the interrupt enabled flags.

Possibly you want barrier_nospec().

> +
> + return irq_mask;
> +}


Re: [PATCH] powerpc/powernv/prd: Validate whether address to be mapped is part of system RAM

2019-10-03 Thread Jeremy Kerr
Hi Vasant,

> > OK. How about we just don't do that?
> 
> Yes. Hostboot will fix that. It will make sure that HBRT is loaded
> into regular memory.

Super.

> > It sounds like we're just trying to work around an invalid
> > representation of the mappings.
> 
> Its not workaround. Its additional check.

The issue is that you've added a check for stuff that the kernel doesn't
(and shouldn't) know about, and assumed that the kernel knows better
than the device tree. It may be the correct thing to do in this case,
but we can't guarantee that it's always correct.

For example, what if there is a future HBRT range that is fine to go
into NVRAM? With this change, that's not possible.

Or, what if there's a range of address-space that isn't backed by system
RAM (say, some MMIO-mapped hardware) that we want to expose to a future
HBRT implementation? This change will block that.

The kernel doesn't know what is and is not valid for a HBRT mapping, so
it has no reason to override what's specified in the device tree. We've
designed this so that the kernel provides the mechanism for mapping
pages, and not the policy of which pages can be mapped.

Regards,


Jeremy




[RFC v5 3/3] cpuidle-powernv : Recompute the idle-state timeouts when state usage is enabled/disabled

2019-10-03 Thread Abhishek Goel
The disable callback can be used to compute timeout for other states
whenever a state is enabled or disabled. We store the computed timeout
in "timeout" defined in cpuidle state strucure. So, we compute timeout
only when some state is enabled or disabled and not every time in the
fast idle path.
We also use the computed timeout to get timeout for snooze, thus getting
rid of get_snooze_timeout for snooze loop.

Signed-off-by: Abhishek Goel 
---
 drivers/cpuidle/cpuidle-powernv.c | 35 +++
 include/linux/cpuidle.h   |  1 +
 2 files changed, 13 insertions(+), 23 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index d7686ce6e..a75226f52 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -45,7 +45,6 @@ struct stop_psscr_table {
 static struct stop_psscr_table stop_psscr_table[CPUIDLE_STATE_MAX] 
__read_mostly;
 
 static u64 default_snooze_timeout __read_mostly;
-static bool snooze_timeout_en __read_mostly;
 
 static u64 forced_wakeup_timeout(struct cpuidle_device *dev,
 struct cpuidle_driver *drv,
@@ -67,26 +66,13 @@ static u64 forced_wakeup_timeout(struct cpuidle_device *dev,
return 0;
 }
 
-static u64 get_snooze_timeout(struct cpuidle_device *dev,
- struct cpuidle_driver *drv,
- int index)
+static void pnv_disable_callback(struct cpuidle_device *dev,
+struct cpuidle_driver *drv)
 {
int i;
 
-   if (unlikely(!snooze_timeout_en))
-   return default_snooze_timeout;
-
-   for (i = index + 1; i < drv->state_count; i++) {
-   struct cpuidle_state *s = >states[i];
-   struct cpuidle_state_usage *su = >states_usage[i];
-
-   if (s->disabled || su->disable)
-   continue;
-
-   return s->target_residency * tb_ticks_per_usec;
-   }
-
-   return default_snooze_timeout;
+   for (i = 0; i < drv->state_count; i++)
+   drv->states[i].timeout = forced_wakeup_timeout(dev, drv, i);
 }
 
 static int snooze_loop(struct cpuidle_device *dev,
@@ -94,16 +80,20 @@ static int snooze_loop(struct cpuidle_device *dev,
int index)
 {
u64 snooze_exit_time;
+   u64 snooze_timeout = drv->states[index].timeout;
+
+   if (!snooze_timeout)
+   snooze_timeout = default_snooze_timeout;
 
set_thread_flag(TIF_POLLING_NRFLAG);
 
local_irq_enable();
 
-   snooze_exit_time = get_tb() + get_snooze_timeout(dev, drv, index);
+   snooze_exit_time = get_tb() + snooze_timeout;
ppc64_runlatch_off();
HMT_very_low();
while (!need_resched()) {
-   if (likely(snooze_timeout_en) && get_tb() > snooze_exit_time) {
+   if (get_tb() > snooze_exit_time) {
/*
 * Task has not woken up but we are exiting the polling
 * loop anyway. Require a barrier after polling is
@@ -168,7 +158,7 @@ static int stop_loop(struct cpuidle_device *dev,
u64 timeout_tb;
bool forced_wakeup = false;
 
-   timeout_tb = forced_wakeup_timeout(dev, drv, index);
+   timeout_tb = drv->states[index].timeout;
 
if (timeout_tb) {
/* Ensure that the timeout is at least one microsecond
@@ -263,6 +253,7 @@ static int powernv_cpuidle_driver_init(void)
 */
 
drv->cpumask = (struct cpumask *)cpu_present_mask;
+   drv->disable_callback = pnv_disable_callback;
 
return 0;
 }
@@ -422,8 +413,6 @@ static int powernv_idle_probe(void)
/* Device tree can indicate more idle states */
max_idle_state = powernv_add_idle_states();
default_snooze_timeout = TICK_USEC * tb_ticks_per_usec;
-   if (max_idle_state > 1)
-   snooze_timeout_en = true;
} else
return -ENODEV;
 
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 1729a497b..64195861b 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -50,6 +50,7 @@ struct cpuidle_state {
int power_usage; /* in mW */
unsigned inttarget_residency; /* in US */
booldisabled; /* disabled on all CPUs */
+   unsigned long long timeout; /* timeout for exiting out of a state */
 
int (*enter)(struct cpuidle_device *dev,
struct cpuidle_driver *drv,
-- 
2.17.1



[RFC v5 2/3] cpuidle : Add callback whenever a state usage is enabled/disabled

2019-10-03 Thread Abhishek Goel
To force wakeup a cpu, we need to compute the timeout in the fast idle
path as a state may be enabled or disabled but there did not exist a
feedback to driver when a state is enabled or disabled.
This patch adds a callback whenever a state_usage records a store for
disable attribute

Signed-off-by: Abhishek Goel 
---
 drivers/cpuidle/sysfs.c | 15 ++-
 include/linux/cpuidle.h |  3 +++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/cpuidle/sysfs.c b/drivers/cpuidle/sysfs.c
index 2bb2683b4..6c9bf2f7b 100644
--- a/drivers/cpuidle/sysfs.c
+++ b/drivers/cpuidle/sysfs.c
@@ -418,8 +418,21 @@ static ssize_t cpuidle_state_store(struct kobject *kobj, 
struct attribute *attr,
struct cpuidle_state_attr *cattr = attr_to_stateattr(attr);
struct cpuidle_device *dev = kobj_to_device(kobj);
 
-   if (cattr->store)
+   if (cattr->store) {
ret = cattr->store(state, state_usage, buf, size);
+   if (ret == size &&
+   strncmp(cattr->attr.name, "disable",
+   strlen("disable"))) {
+   struct kobject *cpuidle_kobj = kobj->parent;
+   struct cpuidle_device *dev =
+   to_cpuidle_device(cpuidle_kobj);
+   struct cpuidle_driver *drv =
+   cpuidle_get_cpu_driver(dev);
+
+   if (drv->disable_callback)
+   drv->disable_callback(dev, drv);
+   }
+   }
 
/* reset poll time cache */
dev->poll_limit_ns = 0;
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 4b6b5bea8..1729a497b 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -122,6 +122,9 @@ struct cpuidle_driver {
/* the driver handles the cpus in cpumask */
struct cpumask  *cpumask;
 
+   void (*disable_callback)(struct cpuidle_device *dev,
+struct cpuidle_driver *drv);
+
/* preferred governor to switch at register time */
const char  *governor;
 };
-- 
2.17.1



[PATCH v5 1/3] cpuidle-powernv : forced wakeup for stop states

2019-10-03 Thread Abhishek Goel
Currently, the cpuidle governors determine what idle state a idling CPU
should enter into based on heuristics that depend on the idle history on
that CPU. Given that no predictive heuristic is perfect, there are cases
where the governor predicts a shallow idle state, hoping that the CPU will
be busy soon. However, if no new workload is scheduled on that CPU in the
near future, the CPU may end up in the shallow state.

This is problematic, when the predicted state in the aforementioned
scenario is a shallow stop state on a tickless system. As we might get
stuck into shallow states for hours, in absence of ticks or interrupts.

To address this, We forcefully wakeup the cpu by setting the
decrementer. The decrementer is set to a value that corresponds with the
residency of the next available state. Thus firing up a timer that will
forcefully wakeup the cpu. Few such iterations will essentially train the
governor to select a deeper state for that cpu, as the timer here
corresponds to the next available cpuidle state residency. Thus, cpu will
eventually end up in the deepest possible state.

Signed-off-by: Abhishek Goel 
---

Auto-promotion
  v1 : started as auto promotion logic for cpuidle states in generic
driver
  v2 : Removed timeout_needed and rebased the code to upstream kernel
Forced-wakeup
  v1 : New patch with name of forced wakeup started
  v2 : Extending the forced wakeup logic for all states. Setting the
decrementer instead of queuing up a hrtimer to implement the logic.
  v3 : Cleanly handle setting/resetting of decrementer so as to not break
irq work
  v4 : Changed type and name of set/reset decrementer fucntion
   Handled irq_work_pending in try_set_dec_before_idle
  v5 : Removed forced wakeup for last stop state by changing the
   checking conditon of timeout_tb

 arch/powerpc/include/asm/time.h   |  2 ++
 arch/powerpc/kernel/time.c| 43 +++
 drivers/cpuidle/cpuidle-powernv.c | 40 
 3 files changed, 85 insertions(+)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 08dbe3e68..06a6a2314 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -184,6 +184,8 @@ static inline unsigned long tb_ticks_since(unsigned long 
tstamp)
 extern u64 mulhdu(u64, u64);
 #endif
 
+extern bool try_set_dec_before_idle(u64 timeout);
+extern void try_reset_dec_after_idle(void);
 extern void div128_by_32(u64 dividend_high, u64 dividend_low,
 unsigned divisor, struct div_result *dr);
 
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 694522308..d004c0d8e 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -576,6 +576,49 @@ void arch_irq_work_raise(void)
 
 #endif /* CONFIG_IRQ_WORK */
 
+/*
+ * This function tries setting decrementer before entering into idle.
+ * Returns true if we have reprogrammed the decrementer for idle.
+ * Returns false if the decrementer is unchanged.
+ */
+bool try_set_dec_before_idle(u64 timeout)
+{
+   u64 *next_tb = this_cpu_ptr(_next_tb);
+   u64 now = get_tb_or_rtc();
+
+   if (now + timeout > *next_tb)
+   return false;
+
+   set_dec(timeout);
+   if (test_irq_work_pending())
+   set_dec(1);
+
+   return true;
+}
+
+/*
+ * This function gets called if we have set decrementer before
+ * entering into idle. It tries to reset/restore the decrementer
+ * to its original value.
+ */
+void try_reset_dec_after_idle(void)
+{
+   u64 now;
+   u64 *next_tb;
+
+   if (test_irq_work_pending())
+   return;
+
+   now = get_tb_or_rtc();
+   next_tb = this_cpu_ptr(_next_tb);
+   if (now >= *next_tb)
+   return;
+
+   set_dec(*next_tb - now);
+   if (test_irq_work_pending())
+   set_dec(1);
+}
+
 /*
  * timer_interrupt - gets called when the decrementer overflows,
  * with interrupts disabled.
diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index 84b1ebe21..d7686ce6e 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Expose only those Hardware idle states via the cpuidle framework
@@ -46,6 +47,26 @@ static struct stop_psscr_table 
stop_psscr_table[CPUIDLE_STATE_MAX] __read_mostly
 static u64 default_snooze_timeout __read_mostly;
 static bool snooze_timeout_en __read_mostly;
 
+static u64 forced_wakeup_timeout(struct cpuidle_device *dev,
+struct cpuidle_driver *drv,
+int index)
+{
+   int i;
+
+   for (i = index + 1; i < drv->state_count; i++) {
+   struct cpuidle_state *s = >states[i];
+   struct cpuidle_state_usage *su = >states_usage[i];
+
+   if (s->disabled || su->disable)
+   continue;
+
+   return 

[PATCH v5 0/3] Forced-wakeup for stop states on Powernv

2019-10-03 Thread Abhishek Goel
Currently, the cpuidle governors determine what idle state a idling CPU
should enter into based on heuristics that depend on the idle history on
that CPU. Given that no predictive heuristic is perfect, there are cases
where the governor predicts a shallow idle state, hoping that the CPU will
be busy soon. However, if no new workload is scheduled on that CPU in the
near future, the CPU will end up in the shallow state.

Motivation
--
In case of POWER, this is problematic, when the predicted state in the
aforementioned scenario is a shallow stop state on a tickless system. As
we might get stuck into shallow states even for hours, in absence of ticks
or interrupts.

To address this, We forcefully wakeup the cpu by setting the decrementer.
The decrementer is set to a value that corresponds with the residency of
the next available state. Thus firing up a timer that will forcefully
wakeup the cpu. Few such iterations will essentially train the governor to
select a deeper state for that cpu, as the timer here corresponds to the
next available cpuidle state residency. Thus, cpu will eventually end up
in the deepest possible state and we won't get stuck in a shallow state
for long duration.

Experiment
--
For earlier versions when this feature was meat to be only for shallow lite
states, I performed experiments for three scenarios to collect some data.

case 1 :
Without this patch and without tick retained, i.e. in a upstream kernel,
It would spend more than even a second to get out of stop0_lite.

case 2 : With tick retained in a upstream kernel -

Generally, we have a sched tick at 4ms(CONF_HZ = 250). Ideally I expected
it to take 8 sched tick to get out of stop0_lite. Experimentally,
observation was

=
sample  minmax   99percentile
20  4ms12ms  4ms
=

It would take atleast one sched tick to get out of stop0_lite.

case 2 :  With this patch (not stopping tick, but explicitly queuing a
  timer)


sample  min max 99percentile

20  144us   192us   144us



Description of current implementation
-

We calculate timeout for the current idle state as the residency value
of the next available idle state. If the decrementer is set to be
greater than this timeout, we update the decrementer value with the
residency of next available idle state. Thus, essentially training the
governor to select the next available deeper state until we reach the
deepest state. Hence, we won't get stuck unnecessarily in shallow states
for longer duration.


v1 of auto-promotion : https://lkml.org/lkml/2019/3/22/58 This patch was
implemented only for shallow lite state in generic cpuidle driver.

v2 : Removed timeout_needed and rebased to current
upstream kernel

Then,
v1 of forced-wakeup : Moved the code to cpuidle powernv driver and started
as forced wakeup instead of auto-promotion

v2 : Extended the forced wakeup logic for all states.
Setting the decrementer instead of queuing up a hrtimer to implement the
logic.

v3 : 1) Cleanly handle setting the decrementer after exiting out of stop
   states.
 2) Added a disable_callback feature to compute timeout whenever a
state is enbaled or disabled instead of computing everytime in fast
idle path.
 3) Use disable callback to recompute timeout whenever state usage
is changed for a state. Also, cleaned up the get_snooze_timeout
function.

v4 :Changed the type and name of set/reset decrementer function.
Handled irq work pending in try_set_dec_before_idle.
No change in patch 2 and 3.

v5 :Removed forced wakeup for the last state. We dont want to wakeup
unnecessarily when already in deepest state. It was a mistake in
previous patches that was found out in recent experiments.
No change in patch 2 and 3.

Abhishek Goel (3):
  cpuidle-powernv : forced wakeup for stop states
  cpuidle : Add callback whenever a state usage is enabled/disabled
  cpuidle-powernv : Recompute the idle-state timeouts when state usage
is enabled/disabled

 arch/powerpc/include/asm/time.h   |  2 ++
 arch/powerpc/kernel/time.c| 43 
 drivers/cpuidle/cpuidle-powernv.c | 55 +++
 drivers/cpuidle/sysfs.c   | 15 -
 include/linux/cpuidle.h   |  4 +++
 5 files changed, 105 insertions(+), 14 deletions(-)

-- 
2.17.1