Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table

2017-08-14 Thread Benjamin Herrenschmidt
On Mon, 2017-08-14 at 14:12 +0100, Robin Murphy wrote:
> On the other hand, if the check is not so much to mitigate malicious
> guests attacking the system as to prevent dumb guests breaking
> themselves (e.g. if some or all of the MSI-X capability is actually
> emulated), then allowing things to sometimes go wrong on the grounds of
> an irrelevant hardware feature doesn't seem correct :/

There is 0 value in trying to prevent the guest kernel from shooting
itself in the foot. There are so many other ways it can do it that I
fail the point of even attempting it here.

In addition, this actually harms performance on some devices. There
are cases where the MSI-X table shares a page with other registrers
that are used during normal device operation. This is especially
problematic on architectures such as powerpc that use 64K pages.

Those devices thus suffer a massive performance loss, for the sake of
something that never happens in practice (especially on pseries where
the MSI configuration is done by paravirt calls, thus by qemu itself).

Cheers,
Ben.



Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table

2017-08-14 Thread Benjamin Herrenschmidt
On Tue, 2017-08-15 at 09:47 +0800, Jike Song wrote:
> On 08/15/2017 09:33 AM, Benjamin Herrenschmidt wrote:
> > On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
> > > > Taking a step back, though, why does vfio-pci perform this check in the
> > > > first place? If a malicious guest already has control of a device, any
> > > > kind of interrupt spoofing it could do by fiddling with the MSI-X
> > > > message address/data it could simply do with a DMA write anyway, so the
> > > > security argument doesn't stand up in general (sure, not all PCIe
> > > > devices may be capable of arbitrary DMA, but that seems like more of a
> > > > tenuous security-by-obscurity angle to me).
> > 
> > I tried to make that point for years, thanks for re-iterating it :-)
> > 
> > > Hi Robin,
> > > 
> > > DMA writes will be translated (thereby censored) by DMA Remapping 
> > > hardware,
> > > while MSI/MSI-X will not. Is this different for non-x86?
> > 
> > There is no way your DMA remapping HW can differenciate. The only
> > difference between a DMA write and an MSI is ... the address. So if I
> > can make my device DMA to the MSI address range, I've defeated your
> > security.
> 
> I don't think with IRQ remapping enabled, you can make your device DMA to
> MSI address, without being treated as an IRQ and remapped. If so, the IRQ
> remapping hardware is simply broken :)

You are mixing things here.

Robin's point is that there is no security provided by the obfuscating
of the MSI-X table by qemu because whatever qemu does to "filter" the
MSI-X targer addresses can be worked around by making the device DMA
wherever you want.

None of what you say invalidates that basic fact.

Now, as far as your remapping HW goes, either it filters interrupts or
it doesn't. If it does then yes, it can't be spoofed, and thus you
don't need the filtering of the table in qemu.

If it doesn't, then the guest can spoof any interrupt using DMAs and
whatever qemu does to filter the table is not going to fix it.

Thus the point remains that the only value in qemu filtering the table
is to enable already insecure use cases to work, without actually
making them any more secure.

Ben.



Re: [PATCH v7 7/9] mm: Add address parameter to arch_validate_prot()

2017-08-14 Thread Michael Ellerman
Khalid Aziz  writes:

> On 08/10/2017 07:20 AM, Michael Ellerman wrote:
>> Khalid Aziz  writes:
>> 
>>> A protection flag may not be valid across entire address space and
>>> hence arch_validate_prot() might need the address a protection bit is
>>> being set on to ensure it is a valid protection flag. For example, sparc
>>> processors support memory corruption detection (as part of ADI feature)
>>> flag on memory addresses mapped on to physical RAM but not on PFN mapped
>>> pages or addresses mapped on to devices. This patch adds address to the
>>> parameters being passed to arch_validate_prot() so protection bits can
>>> be validated in the relevant context.
>>>
>>> Signed-off-by: Khalid Aziz 
>>> Cc: Khalid Aziz 
>>> ---
>>> v7:
>>> - new patch
>>>
>>>   arch/powerpc/include/asm/mman.h | 2 +-
>>>   arch/powerpc/kernel/syscalls.c  | 2 +-
>>>   include/linux/mman.h| 2 +-
>>>   mm/mprotect.c   | 2 +-
>>>   4 files changed, 4 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/mman.h 
>>> b/arch/powerpc/include/asm/mman.h
>>> index 30922f699341..bc74074304a2 100644
>>> --- a/arch/powerpc/include/asm/mman.h
>>> +++ b/arch/powerpc/include/asm/mman.h
>>> @@ -40,7 +40,7 @@ static inline bool arch_validate_prot(unsigned long prot)
>>> return false;
>>> return true;
>>>   }
>>> -#define arch_validate_prot(prot) arch_validate_prot(prot)
>>> +#define arch_validate_prot(prot, addr) arch_validate_prot(prot)
>> 
>> This can be simpler, as just:
>> 
>> #define arch_validate_prot arch_validate_prot
>> 
>
> Hi Michael,
>
> Thanks for reviewing!
>
> My patch expands parameter list for arch_validate_prot() from one to two 
> parameters. Existing powerpc version of arch_validate_prot() is written 
> with one parameter. If I use the above #define, compilation fails with:
>
> mm/mprotect.c: In function ‘do_mprotect_pkey’:
> mm/mprotect.c:399: error: too many arguments to function 
> ‘arch_validate_prot’
>
> Another way to solve it would be to add the new addr parameter to 
> powerpc version of arch_validate_prot() but I chose the less disruptive 
> solution of tackling it through #define and expanded the existing 
> #define to include the new parameter. Make sense?

Yes, it makes sense. But it's a bit gross.

At first glance it looks like our arch_validate_prot() has an incorrect
signature.

I'd prefer you just updated it to have the correct signature, I think
you'll have to change one more line in do_mmap2(). So it's not very
intrusive.

cheers


[PATCH] KVM: PPC: Book3S HV: Fix invalid use of register expression

2017-08-14 Thread Michael Ellerman
From: Andreas Schwab 

binutils >= 2.26 now warns about misuse of register expressions in
assembler operands that are actually literals. In this instance r0 is
being used where a literal 0 should be used.

Signed-off-by: Andreas Schwab 
[mpe: Split into separate KVM patch, tweak change log]
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index c52184a8efdf..0bc400f882f4 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -976,7 +976,7 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
 #ifdef CONFIG_KVM_XICS
/* We are entering the guest on that thread, push VCPU to XIVE */
ld  r10, HSTATE_XIVE_TIMA_PHYS(r13)
-   cmpldi  cr0, r10, r0
+   cmpldi  cr0, r10, 0
beq no_xive
ld  r11, VCPU_XIVE_SAVED_STATE(r4)
li  r9, TM_QW1_OS
-- 
2.7.4



[PATCH] powerpc/mm/nohash: add definition of PGALLOC_GFP

2017-08-14 Thread Balbir Singh
fixes
(de3b876 powerpc/mm/book(e)(3s)/64: Add page table accounting)

I missed adding PGALLOC_GFP for nohash/64

Reported-by: Michael Ellerman 
Signed-off-by: Balbir Singh 
---
 arch/powerpc/include/asm/nohash/64/pgalloc.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/nohash/64/pgalloc.h 
b/arch/powerpc/include/asm/nohash/64/pgalloc.h
index 9721c78..ce1d34e 100644
--- a/arch/powerpc/include/asm/nohash/64/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/64/pgalloc.h
@@ -41,6 +41,8 @@ extern struct kmem_cache *pgtable_cache[];
pgtable_cache[(shift) - 1]; \
})
 
+#define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
+
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
return kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE),
-- 
2.9.4



Re: [PATCH v3 1/2] powerpc/xmon: Dump ftrace buffers for the current CPU only

2017-08-14 Thread Michael Ellerman
Breno Leitao  writes:

> Hello Michael,
>
> On Mon, Aug 14, 2017 at 11:00:07PM +1000, Michael Ellerman wrote:
>> Breno Leitao  writes:
>> > @@ -2231,6 +2232,19 @@ static void xmon_rawdump (unsigned long adrs, long 
>> > ndump)
>> >printf("\n");
>> >  }
>> >  
>> > +static void dump_tracing(void)
>> > +{
>> > +  int c;
>> > +
>> > +  c = inchar();
>> > +  if (c == 'c')
>> > +  ftrace_dump(DUMP_ORIG);
>> > +  else
>> > +  ftrace_dump(DUMP_ALL);
>> > +
>> > +  tracing_on();
>> > +}
>> 
>> Thinking about this some more, two things that would make this *really*
>> useful.
>> 
>> Firstly, it would be great if we could dump the buffer for *another*
>> CPU. 
>
> Well, you can do it with this new 'dtc' option on xmon. You just need to
> change to that CPU prior to call 'dtc'.

But that's the problem. If the other CPU is stuck then you can't change
to it in xmon.

If it *isn't* stuck, then you can just change to it and backtrace like
normal.

eg, if you use the HARDLOCKUP lkdtm test:

  # cd /sys/kernel/debug/provoke-crash
  # echo HARDLOCKUP > DIRECT
  sysrq: SysRq : Entering xmon
  cpu 0x1: Vector: 501 (Hardware Interrupt) at [c000fe9af9a0]
  pc: c00f15d0: plpar_hcall_norets+0x1c/0x28
  lr: c0daaf04: check_and_cede_processor+0x34/0x50
  sp: c000fe9afc20
 msr: 80009033
current = 0xc000fe95a200
paca= 0xcfd40580 softe: 0irq_happened: 0x09
  pid   = 0, comm = swapper/1
  Linux version 4.13.0-rc2-gcc-6.3.1-00101-ged49f7fd6438 
(mich...@ka3.ozlabs.ibm.com) (gcc version 6.3.1 20170214 (Custom 
e9096cb27f4bd642)) #455 SMP Mon Aug 14 22:19:37 AEST 2017
  enter ? for help
  [link register   ] c0daaf04 check_and_cede_processor+0x34/0x50
  [c000fe9afc20] c0daaef0 check_and_cede_processor+0x20/0x50 
(unreliable)
  [c000fe9afc80] c0daaf94 shared_cede_loop+0x74/0x2b0
  [c000fe9afcd0] c0da6fa0 cpuidle_enter_state+0xe0/0x6b0
  [c000fe9afd50] c01ff17c call_cpuidle+0x7c/0x110
  [c000fe9afd90] c01ff800 do_idle+0x350/0x460
  [c000fe9afe20] c01ffbe8 cpu_startup_entry+0x38/0x40
  [c000fe9afe50] c0060108 start_secondary+0x528/0xb90
  [c000fe9aff90] c000af6c start_secondary_prolog+0x10/0x14
  1:mon> c
  cpus stopped: 0x0-0x5 0x7-0xf
  1:mon> 
  1:mon> c 6
  cpu 0x6 isn't in xmon


Notice that CPU 6 hasn't called in, so we can't switch to it, and we
have no (easy) way of working out where it is.

If we could dump the ftrace buffer for CPU 6 from another CPU, we'd be
able to see eg:

  6)   |  SyS_write() {
  6)   |__fdget_pos() {
  6)   0.088 us|  __fget_light();
  6)   1.168 us|}
  6)   |vfs_write() {
  6)   0.092 us|  rw_verify_area();
  6)   0.062 us|  __sb_start_write();
  6)   |  __vfs_write() {
  ...
  6)   |lkdtm_do_action() {
  6)   |  lkdtm_HARDLOCKUP() {
-


Which would be helpful.

cheers


Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table

2017-08-14 Thread Jike Song
On 08/15/2017 09:33 AM, Benjamin Herrenschmidt wrote:
> On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
>>> Taking a step back, though, why does vfio-pci perform this check in the
>>> first place? If a malicious guest already has control of a device, any
>>> kind of interrupt spoofing it could do by fiddling with the MSI-X
>>> message address/data it could simply do with a DMA write anyway, so the
>>> security argument doesn't stand up in general (sure, not all PCIe
>>> devices may be capable of arbitrary DMA, but that seems like more of a
>>> tenuous security-by-obscurity angle to me).
> 
> I tried to make that point for years, thanks for re-iterating it :-)
> 
>> Hi Robin,
>>
>> DMA writes will be translated (thereby censored) by DMA Remapping hardware,
>> while MSI/MSI-X will not. Is this different for non-x86?
> 
> There is no way your DMA remapping HW can differenciate. The only
> difference between a DMA write and an MSI is ... the address. So if I
> can make my device DMA to the MSI address range, I've defeated your
> security.

I don't think with IRQ remapping enabled, you can make your device DMA to
MSI address, without being treated as an IRQ and remapped. If so, the IRQ
remapping hardware is simply broken :)

--
Thanks,
Jike


Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table

2017-08-14 Thread Benjamin Herrenschmidt
On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
> > Taking a step back, though, why does vfio-pci perform this check in the
> > first place? If a malicious guest already has control of a device, any
> > kind of interrupt spoofing it could do by fiddling with the MSI-X
> > message address/data it could simply do with a DMA write anyway, so the
> > security argument doesn't stand up in general (sure, not all PCIe
> > devices may be capable of arbitrary DMA, but that seems like more of a
> > tenuous security-by-obscurity angle to me).

I tried to make that point for years, thanks for re-iterating it :-)

> Hi Robin,
> 
> DMA writes will be translated (thereby censored) by DMA Remapping hardware,
> while MSI/MSI-X will not. Is this different for non-x86?

There is no way your DMA remapping HW can differenciate. The only
difference between a DMA write and an MSI is ... the address. So if I
can make my device DMA to the MSI address range, I've defeated your
security.

The table obfuscating in qemu is only useful as an insecure way of
"making things sort-of-work" for HW that doesnt have proper remapping
or filtering.

On pseries we don't have that problem because:

 1) Our hypervisor (which is qemu) provide the DMA address for MSIs/X
so there is no need for "magic remapping" to give the guest a value
that works.

 2) Our HW (configured by VFIO/KVM) filters which device can DMA to
what address (including which MSIs/X) thus even if the guest doesn't
use the address passed and messes around, it can only shoot itself in
the foot.

So all we need is a way to tell qemu to stop doing that filtering on
our platform. This is *one bit* of information, it's taken 3 years of
arguments and we still don't have a solution. In the meantime,
workloads *are* being hurt by significant performance degradation due
to the MSI-X table sharing a 64K page (our page size) with other MMIOs.

Yay !

Ben.



Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table

2017-08-14 Thread Jike Song
On 08/14/2017 09:12 PM, Robin Murphy wrote:
> On 14/08/17 10:45, Alexey Kardashevskiy wrote:
>> Folks,
>>
>> Is there anything to change besides those compiler errors and David's
>> comment in 5/5? Or the while patchset is too bad? Thanks.
> 
> While I now understand it's not the low-level thing I first thought it
> was, so my reasoning has changed, personally I don't like this approach
> any more than the previous one - it still smells of abusing external
> APIs to pass information from one part of VFIO to another (and it has
> the same conceptual problem of attributing something to interrupt
> sources that is actually a property of the interrupt target).
> 
> Taking a step back, though, why does vfio-pci perform this check in the
> first place? If a malicious guest already has control of a device, any
> kind of interrupt spoofing it could do by fiddling with the MSI-X
> message address/data it could simply do with a DMA write anyway, so the
> security argument doesn't stand up in general (sure, not all PCIe
> devices may be capable of arbitrary DMA, but that seems like more of a
> tenuous security-by-obscurity angle to me).

Hi Robin,

DMA writes will be translated (thereby censored) by DMA Remapping hardware,
while MSI/MSI-X will not. Is this different for non-x86?

--
Thanks,
Jike

> Besides, with Type1 IOMMU
> the fact that we've let a device be assigned at all means that this is
> already a non-issue (because either the hardware provides isolation or
> the user has explicitly accepted the consequences of an unsafe
> configuration) - from patch #4 that's apparently the same for SPAPR TCE,
> in which case it seems this flag doesn't even need to be propagated and
> could simply be assumed always.
> 
> On the other hand, if the check is not so much to mitigate malicious
> guests attacking the system as to prevent dumb guests breaking
> themselves (e.g. if some or all of the MSI-X capability is actually
> emulated), then allowing things to sometimes go wrong on the grounds of
> an irrelevant hardware feature doesn't seem correct :/
> 
> Robin.
> 
>> On 07/08/17 17:25, Alexey Kardashevskiy wrote:
>>> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for 
>>> mmapping MSI-X table"
>>> http://www.spinics.net/lists/kvm/msg152232.html
>>>
>>> This time it is using "caps" in IOMMU groups. The main question is if PCI
>>> bus flags or IOMMU domains are still better (and which one).
>>
>>>
>>>
>>>
>>> Here is some background:
>>>
>>> Current vfio-pci implementation disallows to mmap the page
>>> containing MSI-X table in case that users can write directly
>>> to MSI-X table and generate an incorrect MSIs.
>>>
>>> However, this will cause some performance issue when there
>>> are some critical device registers in the same page as the
>>> MSI-X table. We have to handle the mmio access to these
>>> registers in QEMU emulation rather than in guest.
>>>
>>> To solve this issue, this series allows to expose MSI-X table
>>> to userspace when hardware enables the capability of interrupt
>>> remapping which can ensure that a given PCI device can only
>>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>>> for different archs.
>>>
>>>
>>> This is based on sha1
>>> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of 
>>> git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>>>
>>> Please comment. Thanks.
>>>
>>> Changelog:
>>>
>>> v5:
>>> * redid the whole thing via so-called IOMMU group capabilities
>>>
>>> v4:
>>> * rebased on recent upstream
>>> * got all 6 patches from v2 (v3 was missing some)
>>>
>>>
>>>
>>>
>>> Alexey Kardashevskiy (5):
>>>   iommu: Add capabilities to a group
>>>   iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>>> remapping
>>>   iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>>> enabled
>>>   powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>>>   vfio-pci: Allow to expose MSI-X table to userspace when safe
>>>
>>>  include/linux/iommu.h| 20 
>>>  include/linux/vfio.h |  1 +
>>>  arch/powerpc/kernel/iommu.c  |  1 +
>>>  drivers/iommu/amd_iommu.c|  3 +++
>>>  drivers/iommu/intel-iommu.c  |  3 +++
>>>  drivers/iommu/iommu.c| 35 +++
>>>  drivers/vfio/pci/vfio_pci.c  | 20 +---
>>>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 -
>>>  drivers/vfio/vfio.c  | 15 +++
>>>  9 files changed, 99 insertions(+), 4 deletions(-)
>>>
>>
>>
> 


Re: [PATCH v6 14/17] powerpc: Add support for setting SPRN_TIDR

2017-08-14 Thread Benjamin Herrenschmidt
On Mon, 2017-08-14 at 13:03 -0700, Sukadev Bhattiprolu wrote:
> As Ben pointed out, we are going to be have limit the number of TIDs (to
> be within the size limits), so we won't be able to use task_pid_nr()? But
> if we assign the TIDs in the RX_WIN_OPEN ioctl, then only the FTW processes
> will need the TIDR value.

But you'll have to assign it for all present and future threads of that
process which is somewhat hard to do without races.

> Can we then assign new, globally-unique TID values for now and have the ioctl
> fail with -EAGAIN if all TIDs are in use? We can extend to per-process TID
> values, later?

Why would you want to do that ?

Ben.



Re: [PATCH v6 14/17] powerpc: Add support for setting SPRN_TIDR

2017-08-14 Thread Benjamin Herrenschmidt
On Mon, 2017-08-14 at 21:16 +1000, Michael Ellerman wrote:
> Sukadev Bhattiprolu  writes:
> 
> > We need the SPRN_TIDR to bet set for use with fast thread-wakeup
> > (core-to-core wakeup).  Each thread in a process needs to have a
> > unique id within the process but as explained below, for now, we
> > assign globally unique thread ids to all threads in the system.
> 
> Each thread in a process already has a unique id, ie. its pid (in the
> init PID namespace), accessible in the kernel as task_pid_nr(task).
> 
> So if that's all we need, we don't need a new allocator, and we don't
> need to store it in the thread_struct.

We need an allocator, I think, due to size restriction on the HW TID.

> Also 99.99% of processes aren't going to care about the TIDR, so we
> should avoid setting it in the common case. ie. it should start out zero
> and only be initialised in the FTW code, or a helper that it calls.


> > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> > index 9f3e2c9..6123859 100644
> > --- a/arch/powerpc/kernel/process.c
> > +++ b/arch/powerpc/kernel/process.c
> > @@ -1213,6 +1213,16 @@ struct task_struct *__switch_to(struct task_struct 
> > *prev,
> > hard_irq_disable();
> > }
> >  
> > +#ifdef CONFIG_PPC_VAS
> > +   mtspr(SPRN_TIDR, new->thread.tidr);
> > +#endif
> 
> That should be in restore_sprs().
> 
> It should also check that the TIDR is initialised, and only switch it
> when necessary.
> 
> > +   /*
> > +* We can't take a PMU exception inside _switch() since there is a
> > +* window where the kernel stack SLB and the kernel stack are out
> > +* of sync. Hard disable here.
> > +*/
> > +   hard_irq_disable();
> 
> We removed that in June in:
> 
>  e4c0fc5f72bc ("powerpc/64s: Leave interrupts hard enabled in context switch 
> for radix")
> 
> You've obviously picked it up somewhere along the line during a rebase,
> please be more careful!
> 
> cheers


Re: [PATCH v3] powerpc/mm: Implemented default_hugepagesz verification for powerpc

2017-08-14 Thread Victor Aoqui

Em 2017-08-04 15:17, Mike Kravetz escreveu:

On 07/24/2017 04:52 PM, Victor Aoqui wrote:

Implemented default hugepage size verification (default_hugepagesz=)
in order to allow allocation of defined number of pages (hugepages=)
only for supported hugepage sizes.

Signed-off-by: Victor Aoqui 
---
v2:

- Renamed default_hugepage_setup_sz function to 
hugetlb_default_size_setup;

- Added powerpc string to error message.

v3:

- Renamed hugetlb_default_size_setup() to hugepage_default_setup_sz();
- Implemented hugetlb_bad_default_size();
- Reimplemented hugepage_setup_sz() to just parse default_hugepagesz= 
and

check if it's a supported size;
- Added verification of default_hugepagesz= value on 
hugetlb_nrpages_setup()

before allocating hugepages.

 arch/powerpc/mm/hugetlbpage.c | 15 +++
 include/linux/hugetlb.h   |  1 +
 mm/hugetlb.c  | 17 +++--
 3 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c 
b/arch/powerpc/mm/hugetlbpage.c

index e1bf5ca..5990381 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -780,6 +780,21 @@ static int __init hugepage_setup_sz(char *str)
 }
 __setup("hugepagesz=", hugepage_setup_sz);

+static int __init hugepage_default_setup_sz(char *str)
+{
+   unsigned long long size;
+
+   size = memparse(str, );
+
+   if (add_huge_page_size(size) != 0) {
+   hugetlb_bad_default_size();
+		pr_err("Invalid ppc default huge page size specified(%llu)\n", 
size);

+   }
+
+   return 1;
+}
+__setup("default_hugepagesz=", hugepage_default_setup_sz);
+
 struct kmem_cache *hugepte_cache;
 static int __init hugetlbpage_init(void)
 {
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 0ed8e41..2927200 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -361,6 +361,7 @@ int huge_add_to_page_cache(struct page *page, 
struct address_space *mapping,

 int __init alloc_bootmem_huge_page(struct hstate *h);

 void __init hugetlb_bad_size(void);
+void __init hugetlb_bad_default_size(void);
 void __init hugetlb_add_hstate(unsigned order);
 struct hstate *size_to_hstate(unsigned long size);

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bc48ee7..3c24266 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -54,6 +54,7 @@
 static unsigned long __initdata default_hstate_max_huge_pages;
 static unsigned long __initdata default_hstate_size;
 static bool __initdata parsed_valid_hugepagesz = true;
+static bool __initdata parsed_valid_default_hugepagesz = true;

 /*
  * Protects updates to hugepage_freelists, hugepage_activelist, 
nr_huge_pages,

@@ -2804,6 +2805,12 @@ void __init hugetlb_bad_size(void)
parsed_valid_hugepagesz = false;
 }

+/* Should be called on processing a default_hugepagesz=... option */
+void __init hugetlb_bad_default_size(void)
+{
+   parsed_valid_default_hugepagesz = false;
+}
+
 void __init hugetlb_add_hstate(unsigned int order)
 {
struct hstate *h;
@@ -2846,8 +2853,14 @@ static int __init hugetlb_nrpages_setup(char 
*s)
 	 * !hugetlb_max_hstate means we haven't parsed a hugepagesz= 
parameter yet,

 * so this hugepages= parameter goes to the "default hstate".
 */
-   else if (!hugetlb_max_hstate)
-   mhp = _hstate_max_huge_pages;
+   else if (!hugetlb_max_hstate) {
+   if (!parsed_valid_default_hugepagesz) {
+   pr_warn("hugepages = %s cannot be allocated for "
+   "unsupported default_hugepagesz, ignoring\n", 
s);
+   parsed_valid_default_hugepagesz = true;
+   } else
+   mhp = _hstate_max_huge_pages;
+   }
else
mhp = _hstate->max_huge_pages;




My compiler tells me,

mm/hugetlb.c: In function ‘hugetlb_nrpages_setup’:
mm/hugetlb.c:2873:8: warning: ‘mhp’ may be used uninitialized in
this function [-Wmaybe-uninitialized]

You have added a way of getting out of that big if/else if statement 
without
setting mhp.  mhp will be examined later in the code, so this is indeed 
a bug.


Like Aneesh, I am not sure if there is great benefit in this patch.

You added this change in functionality only for powerpc.  IMO, it would 
be
best if behavior was consistent in all architectures.  So, if we change 
it

for powerpc we may want to change everywhere.


Hi Mike,

Yes, the patch mentioned by Aneesh solves the issue.

Thanks

--
Victor Aoqui



Re: [PATCH v3] powerpc/mm: Implemented default_hugepagesz verification for powerpc

2017-08-14 Thread Victor Aoqui

Em 2017-08-04 02:57, Aneesh Kumar K.V escreveu:

Victor Aoqui  writes:


Implemented default hugepage size verification (default_hugepagesz=)
in order to allow allocation of defined number of pages (hugepages=)
only for supported hugepage sizes.

Signed-off-by: Victor Aoqui 


I am still not sure about this. With current upstream we get

 PCI: Probing PCI hardware
 PCI: Probing PCI hardware done


 HugeTLB: unsupported default_hugepagesz 2097152. Reverting to
16777216

 HugeTLB registered 16.0 MiB page size, pre-allocated 0 pages


 HugeTLB registered 16.0 GiB page size, pre-allocated 0 pages

That warning is added by

d715cf804a0318e83c75c0a7abd1a4b9ce13e8da

Which should be good enough right ?

-aneesh


Hi Aneesh,

Sorry for the delay. I was on vacation.
Yes, that solves the issue. This patch was accepted when I was fixing 
the last version.

Sorry for the inconvenience.

--
Victor Aoqui



Re: [PATCH v3 1/2] powerpc/xmon: Dump ftrace buffers for the current CPU only

2017-08-14 Thread Breno Leitao
Hello Michael,

On Mon, Aug 14, 2017 at 11:00:07PM +1000, Michael Ellerman wrote:
> Breno Leitao  writes:
> > @@ -2231,6 +2232,19 @@ static void xmon_rawdump (unsigned long adrs, long 
> > ndump)
> > printf("\n");
> >  }
> >  
> > +static void dump_tracing(void)
> > +{
> > +   int c;
> > +
> > +   c = inchar();
> > +   if (c == 'c')
> > +   ftrace_dump(DUMP_ORIG);
> > +   else
> > +   ftrace_dump(DUMP_ALL);
> > +
> > +   tracing_on();
> > +}
> 
> Thinking about this some more, two things that would make this *really*
> useful.
> 
> Firstly, it would be great if we could dump the buffer for *another*
> CPU. 

Well, you can do it with this new 'dtc' option on xmon. You just need to
change to that CPU prior to call 'dtc'.

Here is an example, where the exception is hit on cpu '0xa', but I want to dump
ftrace from CPU '0x1'.

   cpu 0xa: Vector: 700 (Program Check) at [c0003ff47d40]
   pc: c000c318: fast_exception_return+0xac/0x170
   lr: 7fff9c4680dc
   sp: 7fff9c29e710
  msr: 800102a03031
 current = 0xc004216d9f80
 paca= 0xcfe02800softe: 0irq_happened: 0x01
   pid   = 893, comm = bad_kernel_stac
   Linux version 4.12.0+ (root@unstable) (gcc version 6.3.0 20170628 (Debian 
6.3.0-21)) #18 SMP Mon Aug 7 20:18:39 EDT 2017
   WARNING: exception is not recoverable, can't continue
   enter ? for help
   SP (7fff9c29e710) is in userspace
   a:mon> dtc
   [  299.770536] Dumping ftrace buffer:
   [  299.770619] -
   [  299.770747] CPU:10 [LOST 7923885 EVENTS]
   [  299.770747]  10)   0.060 us|} /* msr_check_and_set */

   a:mon> c 1
   [link register   ] c06f40e8 check_and_cede_processor+0x48/0x80
   [c004285dfd60] c0065378 return_to_handler+0x0/0x40 (unreliable)
   [c004285dfdc0] c0065378 return_to_handler+0x0/0x40
   [c004285dfdf0] c0065378 return_to_handler+0x0/0x40
   [c004285dfe50] c0065378 return_to_handler+0x0/0x40
   [c004285dfe90] c0065378 return_to_handler+0x0/0x40
   [c004285dff00] c0065378 return_to_handler+0x0/0x40
   [c004285dff30] c00495a8 start_secondary+0x338/0x380
   [c004285dff90] c000b46c start_secondary_prolog+0x10/0x14
   1:mon> dtc
   [  308.629183] Dumping ftrace buffer:
   [  308.629236] -
   [  308.629302] CPU:1 [LOST 16378 EVENTS]
   [  308.629302]   1)   0.044 us| } /* __accumulate_pelt_segments */
   [  308.629388]   1)   4.326 us| } /* __update_load_avg_cfs_rq.isra.3 */
   [  308.629459]   1)   0.058 us| update_cfs_shares();
   [  308.629522]   1)   0.108 us| account_entity_enqueue();
   [  308.629583]   1)   0.048 us| place_entity();
   [  308.629637]   1)   0.078 us| __enqueue_entity();
   .
   

> Currently ftrace_dump() doesn't support that, and maybe it can't
> because of the ring buffer design (?), but it would be really great if
> you could dump another CPU's buffer.

Well, you can just dump the buffer for all CPU, using ftrace_dump(DUMP_ALL) or
just the current CPU, using ftrace_dump(DUMP_ORIG). Since it can dump a CPU
buffer specifically, I am wondering what would happen if we assign any CPU to
iter.cpu_file.

> That would be great eg. when a CPU is stuck and doesn't come into xmon,
> you could use the trace buffer to work out where it is. You can do it
> now, by dumping the whole trace buffer, but it's quite tricky to spot
> that one CPU amongst all the others.

Well, with this new 'dtc' feature you can do it easily. Just invoke 'xmon',
change to the stuck CPU and print the ftrace for that CPU. Not very straight
forward but doable.

But if we can do it from a ftrace userspace that would be easier, yes! :-)

> The second thing that would be good is if dumping the trace buffer from
> xmon didn't consume the trace. Currently if you do 'dt' to dump the
> trace buffer, and then realise actually you should have just dumped it
> for one CPU then you're out of luck.
> 
> So it'd be nice if we could dump but leave the trace intact. That would
> also be good from an "xmon doesn't perturb the system" (too much) point
> of view, ie. if you drop to xmon and dump the trace then currently the
> trace is no longer available.

I agree with this concern and definitely it concerns me when I try to dump the
buffer again and I get "(ftrace buffer empty)". I will try to address this
also.


Re: [PATCH 2/2] of: Restrict DMA configuration

2017-08-14 Thread Rob Herring
+linuxppc-dev

On Fri, Aug 11, 2017 at 11:29 AM, Robin Murphy  wrote:
> Moving DMA configuration to happen later at driver probe time had the
> unnoticed side-effect that we now perform DMA configuration for *every*
> device represented in DT, rather than only those explicitly created by
> the of_platform and PCI code.
>
> As Christoph points out, this is not really the best thing to do. Whilst
> there may well be other DMA-capable buses that can benefit from having
> their children automatically configured after the bridge has probed,
> there are also plenty of others like USB, MDIO, etc. that definitely do
> not support DMA and should not be indiscriminately processed.
>
> The good news is that DT already gives us the ammunition to do the right
> thing - anything lacking a "dma-ranges" property should be considered
> not to have a mapping of DMA address space from its children to its
> parent, thus anything for which of_dma_get_range() does not succeed does
> not need DMA configuration.
>
> The bad news is that strictly enforcing that would likely break just
> about every FDT platform out there, since most authors have either not
> considered the property at all or have mistakenly assumed that omitting
> "dma-ranges" is equivalent to including the empty property. Thus we have
> little choice but to special-case platform, AMBA and PCI devices so they
> continue to receive configuration unconditionally as before. At least
> anything new will have to get it right in future...

By "anything new", you mean new buses, not new platforms, right?
What's a platform bus device today could be a different kernel bus
type tomorrow with no DT change. So this isn't really enforceable.

I don't completely agree that omitting dma-ranges is wrong and that
new DTs have to have dma-ranges simply because there is much precedent
of DTs with dma-ranges omitted (just go look at PPC). If a bus has no
bus to cpu address translation nor size restrictions, then no
dma-ranges should be allowed. Of course, DT standards can and do
evolve and we could decide to be stricter here, but that hasn't
happened. If it does, then we need to make that clear in the spec and
enforce it.

Rob


>
> Fixes: 09515ef5ddad ("of/acpi: Configure dma operations at probe time for 
> platform/amba/pci bus devices")
> Reported-by: Christoph Hellwig 
> Signed-off-by: Robin Murphy 
> ---
>  drivers/of/device.c | 48 
>  1 file changed, 32 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/of/device.c b/drivers/of/device.c
> index e0a28ea341fe..04c4c952dc57 100644
> --- a/drivers/of/device.c
> +++ b/drivers/of/device.c
> @@ -9,6 +9,9 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
>
>  #include 
>  #include "of_private.h"
> @@ -84,31 +87,28 @@ int of_device_add(struct platform_device *ofdev)
>   */
>  int of_dma_configure(struct device *dev, struct device_node *np)
>  {
> -   u64 dma_addr, paddr, size;
> +   u64 dma_addr, paddr, size = 0;
> int ret;
> bool coherent;
> unsigned long offset;
> const struct iommu_ops *iommu;
> u64 mask;
>
> -   /*
> -* Set default coherent_dma_mask to 32 bit.  Drivers are expected to
> -* setup the correct supported mask.
> -*/
> -   if (!dev->coherent_dma_mask)
> -   dev->coherent_dma_mask = DMA_BIT_MASK(32);
> -
> -   /*
> -* Set it to coherent_dma_mask by default if the architecture
> -* code has not set it.
> -*/
> -   if (!dev->dma_mask)
> -   dev->dma_mask = >coherent_dma_mask;
> -
> ret = of_dma_get_range(np, _addr, , );
> if (ret < 0) {
> +   /*
> +* For legacy reasons, we have to assume some devices need
> +* DMA configuration regardless of whether "dma-ranges" is
> +* correctly specified or not.
> +*/
> +   if (!dev_is_pci(dev) &&
> +#ifdef CONFIG_ARM_AMBA
> +   dev->bus != _bustype &&
> +#endif
> +   dev->bus != _bus_type)
> +   return ret == -ENODEV ? 0 : ret;
> +
> dma_addr = offset = 0;
> -   size = max(dev->coherent_dma_mask, dev->coherent_dma_mask + 
> 1);
> } else {
> offset = PFN_DOWN(paddr - dma_addr);
>
> @@ -129,6 +129,22 @@ int of_dma_configure(struct device *dev, struct 
> device_node *np)
> dev_dbg(dev, "dma_pfn_offset(%#08lx)\n", offset);
> }
>
> +   /*
> +* Set default coherent_dma_mask to 32 bit.  Drivers are expected to
> +* setup the correct supported mask.
> +*/
> +   if (!dev->coherent_dma_mask)
> +   dev->coherent_dma_mask = DMA_BIT_MASK(32);
> +   /*
> +* Set it to coherent_dma_mask by default if the architecture
> +* code has not set 

Re: [PATCH v6 14/17] powerpc: Add support for setting SPRN_TIDR

2017-08-14 Thread Sukadev Bhattiprolu
Michael Ellerman [m...@ellerman.id.au] wrote:
> Sukadev Bhattiprolu  writes:
> 
> > We need the SPRN_TIDR to bet set for use with fast thread-wakeup
> > (core-to-core wakeup).  Each thread in a process needs to have a
> > unique id within the process but as explained below, for now, we
> > assign globally unique thread ids to all threads in the system.
> 
> Each thread in a process already has a unique id, ie. its pid (in the
> init PID namespace), accessible in the kernel as task_pid_nr(task).
> 
> So if that's all we need, we don't need a new allocator, and we don't
> need to store it in the thread_struct.
> 
> Also 99.99% of processes aren't going to care about the TIDR, so we
> should avoid setting it in the common case. ie. it should start out zero
> and only be initialised in the FTW code, or a helper that it calls.

Good point. So, should we just set when the RX_WIN_OPEN ioctl is called
rather than at the time of clone()?

_switch_to() (restore_sprs() could check for non-zero and save/restore
the value.

As Ben pointed out, we are going to be have limit the number of TIDs (to
be within the size limits), so we won't be able to use task_pid_nr()? But
if we assign the TIDs in the RX_WIN_OPEN ioctl, then only the FTW processes
will need the TIDR value.

Can we then assign new, globally-unique TID values for now and have the ioctl
fail with -EAGAIN if all TIDs are in use? We can extend to per-process TID
values, later?

> 
> > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> > index 9f3e2c9..6123859 100644
> > --- a/arch/powerpc/kernel/process.c
> > +++ b/arch/powerpc/kernel/process.c
> > @@ -1213,6 +1213,16 @@ struct task_struct *__switch_to(struct task_struct 
> > *prev,
> > hard_irq_disable();
> > }
> >  
> > +#ifdef CONFIG_PPC_VAS
> > +   mtspr(SPRN_TIDR, new->thread.tidr);
> > +#endif
> 
> That should be in restore_sprs().

ok.
> 
> It should also check that the TIDR is initialised, and only switch it
> when necessary.
> 
> > +   /*
> > +* We can't take a PMU exception inside _switch() since there is a
> > +* window where the kernel stack SLB and the kernel stack are out
> > +* of sync. Hard disable here.
> > +*/
> > +   hard_irq_disable();
> 
> We removed that in June in:
> 
>  e4c0fc5f72bc ("powerpc/64s: Leave interrupts hard enabled in context switch 
> for radix")
> 
> You've obviously picked it up somewhere along the line during a rebase,
> please be more careful!

Yeah, That was stupid. I picked it up on a recent rebase. Will be careful.

> 
> cheers



Re: [PATCH v6 14/17] powerpc: Add support for setting SPRN_TIDR

2017-08-14 Thread Sukadev Bhattiprolu
Benjamin Herrenschmidt [b...@au1.ibm.com] wrote:
> On Mon, 2017-08-14 at 17:02 +1000, Michael Neuling wrote:
> > > +/*
> > > + * We need to assign an unique thread id to each thread in a process. 
> > > This
> > > + * thread id is intended to be used with the Fast Thread-wakeup (aka 
> > > Core-
> > > + * to-core wakeup) mechanism being implemented on top of Virtual 
> > > Accelerator
> > > + * Switchboard (VAS).
> > > + *
> > > + * To get a unique thread-id per process we could simply use 
> > > task_pid_nr()
> > > + * but the problem is that task_pid_nr() is not yet available for the 
> > > thread
> > > + * when copy_thread() is called. Fixing that would require changing more
> > > + * intrusive arch-neutral code in code path in copy_process()?.
> > > + *
> > > + * Further, to assign unique thread ids within each process, we need an
> > > + * atomic field (or an IDR) in task_struct, which again intrudes into the
> > > + * arch-neutral code.
> > 
> > Really?
> > 
> > > + * So try to assign globally unique thraed ids for now.
> > 
> > Yuck!

I know :-) copy_process() has:

retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
if (retval)
goto bad_fork_cleanup_io;

if (pid != _struct_pid) {
pid = alloc_pid(p->nsproxy->pid_ns_for_children);
if (IS_ERR(pid)) {


so copy_thread() is called before a pid_nr is assigned to the task.

But see also response to Michael Ellerman.

> 
> Also CAPI has size limits for the TIDR afaik

Ok.

> 
> Ben.



Re: [PATCH 03/11] ASoC: fsl: make snd_soc_platform_driver const

2017-08-14 Thread Nicolin Chen
On Mon, Aug 14, 2017 at 05:08:42PM +0530, Bhumika Goyal wrote:
> Make these const as they are only passed as the 2nd argument to the function
> snd_soc_register_platform, which is of type const.
> Done using Coccinelle.
> 
> Signed-off-by: Bhumika Goyal 

Acked-by: Nicolin Chen 

Thanks

> ---
>  sound/soc/fsl/imx-pcm-fiq.c | 2 +-
>  sound/soc/fsl/mpc5200_dma.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/sound/soc/fsl/imx-pcm-fiq.c b/sound/soc/fsl/imx-pcm-fiq.c
> index 92410f7..3fcc7b5 100644
> --- a/sound/soc/fsl/imx-pcm-fiq.c
> +++ b/sound/soc/fsl/imx-pcm-fiq.c
> @@ -341,7 +341,7 @@ static void imx_pcm_fiq_free(struct snd_pcm *pcm)
>   imx_pcm_free(pcm);
>  }
>  
> -static struct snd_soc_platform_driver imx_soc_platform_fiq = {
> +static const struct snd_soc_platform_driver imx_soc_platform_fiq = {
>   .ops= _pcm_ops,
>   .pcm_new= imx_pcm_fiq_new,
>   .pcm_free   = imx_pcm_fiq_free,
> diff --git a/sound/soc/fsl/mpc5200_dma.c b/sound/soc/fsl/mpc5200_dma.c
> index 1f7e70b..cdd848c 100644
> --- a/sound/soc/fsl/mpc5200_dma.c
> +++ b/sound/soc/fsl/mpc5200_dma.c
> @@ -356,7 +356,7 @@ static void psc_dma_free(struct snd_pcm *pcm)
>   }
>  }
>  
> -static struct snd_soc_platform_driver mpc5200_audio_dma_platform = {
> +static const struct snd_soc_platform_driver mpc5200_audio_dma_platform = {
>   .ops= _dma_ops,
>   .pcm_new= _dma_new,
>   .pcm_free   = _dma_free,
> -- 
> 1.9.1
> 


Re: [PATCH v6 01/17] powerpc/vas: Define macros, register fields and structures

2017-08-14 Thread Sukadev Bhattiprolu
Nicholas Piggin [npig...@gmail.com] wrote:
> On Mon, 14 Aug 2017 15:21:48 +1000
> Michael Ellerman  wrote:
> 
> > Sukadev Bhattiprolu  writes:
> 
> > >  arch/powerpc/include/asm/vas.h   |  35 
> > >  arch/powerpc/include/uapi/asm/vas.h  |  25 +++  
> > 
> > I thought we weren't exposing VAS to userspace yet?
> > 
> > If we are then we need to get things straight WRT copy/paste abort.
> 

> No we should not be. This might be just a leftover hunk that should
> be moved to a future series.

Yes, I should have posted patches 14..17 separately as an RFC that goes
on top of the VAS kernel patches 1..13.

> 
> At the moment (as far as I understand) it should be limited to
> preempt-disabled, process context, kernel users which avoids any
> concern for switch_to.
> 

In the FTW case, there is no data transfer from user space to the hardware.
i.e the copy/paste submit a NULL CRB and hardware will be configured (see
->fifo_disable setting in winctx) to ignore any data they specify in the CRB.

Would we be able to allow copy/paste from user space in that case?

Sukadev



Re: [PATCH 08/11] ASoC: samsung: make snd_soc_platform_driver const

2017-08-14 Thread Krzysztof Kozlowski
On Mon, Aug 14, 2017 at 05:08:47PM +0530, Bhumika Goyal wrote:
> Make this const as it is only passed as the 2nd argument to the function
> devm_snd_soc_register_platform, which is of type const.
> Done using Coccinelle.
> 
> Signed-off-by: Bhumika Goyal 
> ---
>  sound/soc/samsung/idma.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 

Reviewed-by: Krzysztof Kozlowski 

Best regards,
Krzysztof



Re: [PATCH] powerpc: fix invalid use of register expressions

2017-08-14 Thread Andreas Schwab
This fixes another invalid use of register expressions.

Signed-off-by: Andreas Schwab 
---
 arch/powerpc/kernel/l2cr_6xx.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/l2cr_6xx.S b/arch/powerpc/kernel/l2cr_6xx.S
index 97ec8557f9..6408f09dbb 100644
--- a/arch/powerpc/kernel/l2cr_6xx.S
+++ b/arch/powerpc/kernel/l2cr_6xx.S
@@ -181,7 +181,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_SPEC7450)
mtctr   r4
li  r4,0
 1:
-   lwzxr0,r0,r4
+   lwzxr0,0,r4
addir4,r4,32/* Go to start of next cache line */
bdnz1b
isync
@@ -328,7 +328,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_L3CR)
mtctr   r4
li  r4,0
 1:
-   lwzxr0,r0,r4
+   lwzxr0,0,r4
dcbf0,r4
addir4,r4,32/* Go to start of next cache line */
bdnz1b
-- 
2.14.1

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


Re: [GIT PULL 00/19] perf/core improvements and fixes

2017-08-14 Thread Arnaldo Carvalho de Melo
Em Mon, Aug 14, 2017 at 07:39:48PM +0200, Ingo Molnar escreveu:
> * Arnaldo Carvalho de Melo  wrote:
> > Infrastructure:

> > - Add support for shell based tests in 'perf test', add a few that
> >   run 'perf probe', 'perf trace', using kprobes, uprobes to check
> >   the output of those tools and the effects on the system, checking,
> >   for instance, DWARF backtraces from uprobes (Arnaldo Carvalho de Melo)

> >  create mode 100644 tools/perf/tests/shell/lib/probe_vfs_getname.sh
> >  create mode 100755 tools/perf/tests/shell/probe_vfs_getname.sh
> >  create mode 100755 
> > tools/perf/tests/shell/record+script_probe_vfs_getname.sh
> >  create mode 100755 tools/perf/tests/shell/trace+probe_libc_inet_pton.sh
> >  create mode 100755 tools/perf/tests/shell/trace+probe_vfs_getname.sh
> 
> Pulled, thanks a lot Arnaldo!

Thanks! I'm working with Kim Phillips to fix some issues he noticed
while testing on his ARM systems where 'perf probe' is not available, my
perf/core branch has several fixes to handle this that will be in my
next pull request.

- Arnaldo


Re: [GIT PULL 00/19] perf/core improvements and fixes

2017-08-14 Thread Ingo Molnar

* Arnaldo Carvalho de Melo <a...@kernel.org> wrote:

> Hi Ingo,
> 
>   Please consider pulling,
> 
> - Arnaldo
> 
> Test results at the end of this message, as usual.
> 
> 
> The following changes since commit 82119cbe8e1e32cc2a941393e59816e731681310:
> 
>   Merge tag 'perf-core-for-mingo-4.14-20170801' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core 
> (2017-08-10 17:07:02 +0200)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-4.14-20170814
> 
> for you to fetch changes up to 8fc375d7d36c72b4c2d55f5c24be022a939295d4:
> 
>   perf test shell: Add uprobes + backtrace ping test (2017-08-11 16:18:49 
> -0300)
> 
> 
> perf/core improvements and fixes:
> 
> Infrastructure:
> 
> - Do not consider empty files as valid srclines (Milian Wolff)
> 
> - Fix wrong size in perf_record_mmap for last kernel module,
>   which resulted in erroneous symbol resolution in at least s390x (Thomas 
> Richter)
> 
> - Add missing newline to expr parser error messages (Andi Kleen)
> 
> - Fix saved values rbtree lookup in 'perf stat' (Andi Kleen)
> 
> - Add support for shell based tests in 'perf test', add a few that
>   run 'perf probe', 'perf trace', using kprobes, uprobes to check
>   the output of those tools and the effects on the system, checking,
>   for instance, DWARF backtraces from uprobes (Arnaldo Carvalho de Melo)
> 
> Arch specific:
> 
> - Add ppc64le to audit uname list in the python scripting support (Naveen N. 
> Rao)
> 
> - Update POWER9 vendor events tables (Sukadev Bhattiprolu)
> 
> - Fix module symbol adjustment for s390x (Thomas Richter)
> 
> Signed-off-by: Arnaldo Carvalho de Melo <a...@redhat.com>
> 
> 
> Andi Kleen (2):
>   perf stat: Fix saved values rbtree lookup
>   perf tools: Add missing newline to expr parser error messages
> 
> Arnaldo Carvalho de Melo (10):
>   perf test: Make 'list' subcommand match main 'perf test' 
> numbering/matching
>   perf test: Add 'struct test *' to the test functions
>   perf test: Add infrastructure to run shell based tests
>   perf test: Make 'list' use same filtering code as main 'perf test'
>   perf test shell: Add 'probe_vfs_getname' shell test
>   perf test shell: Install shell tests
>   perf test shell: Move vfs_getname probe function to lib
>   perf test shell: Add test using probe:vfs_getname and verifying results
>   perf test shell: Add test using vfs_getname + 'perf trace'
>   perf test shell: Add uprobes + backtrace ping test
> 
> Milian Wolff (2):
>   perf util: Take elf_name as const string in dso__demangle_sym
>   perf srcline: Do not consider empty files as valid srclines
> 
> Naveen N. Rao (1):
>   perf scripting python: Add ppc64le to audit uname list
> 
> Sukadev Bhattiprolu (2):
>   perf vendor events powerpc: remove suffix in mapfile
>   perf vendor events powerpc: Update POWER9 events
> 
> Thomas Richter (2):
>   perf record: Fix wrong size in perf_record_mmap for last kernel module
>   perf report: Fix module symbol adjustment for s390x
> 
>  tools/perf/Makefile.perf   |6 +-
>  tools/perf/arch/s390/util/sym-handling.c   |7 +
>  tools/perf/arch/x86/include/arch-tests.h   |   11 +-
>  tools/perf/arch/x86/tests/insn-x86.c   |2 +-
>  tools/perf/arch/x86/tests/intel-cqm.c  |2 +-
>  tools/perf/arch/x86/tests/perf-time-to-tsc.c   |2 +-
>  tools/perf/arch/x86/tests/rdpmc.c  |2 +-
>  tools/perf/pmu-events/arch/powerpc/mapfile.csv |   20 +-
>  .../perf/pmu-events/arch/powerpc/power9/cache.json |  191 +-
>  .../arch/powerpc/power9/floating-point.json|   42 +-
>  .../pmu-events/arch/powerpc/power9/frontend.json   |  517 ++--
>  .../pmu-events/arch/powerpc/power9/marked.json |  905 +++
>  .../pmu-events/arch/powerpc/power9/memory.json |  178 +-
>  .../perf/pmu-events/arch/powerpc/power9/other.json | 2768 
> 
>  .../pmu-events/arch/powerpc/power9/pipeline.json   |  779 +++---
>  tools/perf/pmu-events/arch/powerpc/power9/pmc.json |  167 +-
>  .../arch/powerpc/power9/translation.json   |  314 +--
>  .../python/Perf-Trace-Util/lib/Perf/Trace/Util.py  |1 +
>  tools/perf/tests/attr.c|2 +-
>  tools/perf/tests/backward-ring-buffer.c|2 +-
>  tools/perf/tests/bitmap.c  |   

[PATCH 01/19] perf scripting python: Add ppc64le to audit uname list

2017-08-14 Thread Arnaldo Carvalho de Melo
From: "Naveen N. Rao" 

Before patch:

  $ uname -m
  ppc64le
  $ ./perf script -s ./scripts/python/syscall-counts.py
  Install the audit-libs-python package to get syscall names.
  For example:
# apt-get install python-audit (Ubuntu)
# yum install audit-libs-python (Fedora)
etc.

  Press control+C to stop and show the summary
  ^CWarning:
  4 out of order events recorded.

  syscall events:

  event  count
    ---
  4 504638
  54  1206
  221   42
  5521
  3 12
  167   10
  11 8
  6  7
  1256
  5  6
  1085
  1624
  90 4
  45 3
  33 3
  3111
  2461
  2381
  93 1
  91 1

After patch:
  ./perf script -s ./scripts/python/syscall-counts.py
  Press control+C to stop and show the summary
  ^CWarning:
  5 out of order events recorded.

  syscall events:

  event  count
    ---
  write 643411
  ioctl   1206
  futex 54
  fcntl 27
  poll  14
  read  12
  execve 8
  close  7
  mprotect   6
  open   6
  nanosleep  5
  fstat  5
  mmap   4
  inotify_add_watch  3
  brk3
  access 3
  timerfd_settime1
  clock_gettime  1
  epoll_wait 1
  ftruncate  1
  munmap 1

Signed-off-by: Naveen N. Rao 
Acked-by: Paul Clarke 
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/n/tip-bnl67p1alkvx97pn9moxz...@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace/Util.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace/Util.py 
b/tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace/Util.py
index 1d95009592eb..f6c84966e4f8 100644
--- a/tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace/Util.py
+++ b/tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace/Util.py
@@ -57,6 +57,7 @@ try:
'ia64'  : audit.MACH_IA64,
'ppc'   : audit.MACH_PPC,
'ppc64' : audit.MACH_PPC64,
+   'ppc64le' : audit.MACH_PPC64LE,
's390'  : audit.MACH_S390,
's390x' : audit.MACH_S390X,
'i386'  : audit.MACH_X86,
-- 
2.13.4



[GIT PULL 00/19] perf/core improvements and fixes

2017-08-14 Thread Arnaldo Carvalho de Melo
Hi Ingo,

Please consider pulling,

- Arnaldo

Test results at the end of this message, as usual.


The following changes since commit 82119cbe8e1e32cc2a941393e59816e731681310:

  Merge tag 'perf-core-for-mingo-4.14-20170801' of 
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core 
(2017-08-10 17:07:02 +0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
tags/perf-core-for-mingo-4.14-20170814

for you to fetch changes up to 8fc375d7d36c72b4c2d55f5c24be022a939295d4:

  perf test shell: Add uprobes + backtrace ping test (2017-08-11 16:18:49 -0300)


perf/core improvements and fixes:

Infrastructure:

- Do not consider empty files as valid srclines (Milian Wolff)

- Fix wrong size in perf_record_mmap for last kernel module,
  which resulted in erroneous symbol resolution in at least s390x (Thomas 
Richter)

- Add missing newline to expr parser error messages (Andi Kleen)

- Fix saved values rbtree lookup in 'perf stat' (Andi Kleen)

- Add support for shell based tests in 'perf test', add a few that
  run 'perf probe', 'perf trace', using kprobes, uprobes to check
  the output of those tools and the effects on the system, checking,
  for instance, DWARF backtraces from uprobes (Arnaldo Carvalho de Melo)

Arch specific:

- Add ppc64le to audit uname list in the python scripting support (Naveen N. 
Rao)

- Update POWER9 vendor events tables (Sukadev Bhattiprolu)

- Fix module symbol adjustment for s390x (Thomas Richter)

Signed-off-by: Arnaldo Carvalho de Melo <a...@redhat.com>


Andi Kleen (2):
  perf stat: Fix saved values rbtree lookup
  perf tools: Add missing newline to expr parser error messages

Arnaldo Carvalho de Melo (10):
  perf test: Make 'list' subcommand match main 'perf test' 
numbering/matching
  perf test: Add 'struct test *' to the test functions
  perf test: Add infrastructure to run shell based tests
  perf test: Make 'list' use same filtering code as main 'perf test'
  perf test shell: Add 'probe_vfs_getname' shell test
  perf test shell: Install shell tests
  perf test shell: Move vfs_getname probe function to lib
  perf test shell: Add test using probe:vfs_getname and verifying results
  perf test shell: Add test using vfs_getname + 'perf trace'
  perf test shell: Add uprobes + backtrace ping test

Milian Wolff (2):
  perf util: Take elf_name as const string in dso__demangle_sym
  perf srcline: Do not consider empty files as valid srclines

Naveen N. Rao (1):
  perf scripting python: Add ppc64le to audit uname list

Sukadev Bhattiprolu (2):
  perf vendor events powerpc: remove suffix in mapfile
  perf vendor events powerpc: Update POWER9 events

Thomas Richter (2):
  perf record: Fix wrong size in perf_record_mmap for last kernel module
  perf report: Fix module symbol adjustment for s390x

 tools/perf/Makefile.perf   |6 +-
 tools/perf/arch/s390/util/sym-handling.c   |7 +
 tools/perf/arch/x86/include/arch-tests.h   |   11 +-
 tools/perf/arch/x86/tests/insn-x86.c   |2 +-
 tools/perf/arch/x86/tests/intel-cqm.c  |2 +-
 tools/perf/arch/x86/tests/perf-time-to-tsc.c   |2 +-
 tools/perf/arch/x86/tests/rdpmc.c  |2 +-
 tools/perf/pmu-events/arch/powerpc/mapfile.csv |   20 +-
 .../perf/pmu-events/arch/powerpc/power9/cache.json |  191 +-
 .../arch/powerpc/power9/floating-point.json|   42 +-
 .../pmu-events/arch/powerpc/power9/frontend.json   |  517 ++--
 .../pmu-events/arch/powerpc/power9/marked.json |  905 +++
 .../pmu-events/arch/powerpc/power9/memory.json |  178 +-
 .../perf/pmu-events/arch/powerpc/power9/other.json | 2768 
 .../pmu-events/arch/powerpc/power9/pipeline.json   |  779 +++---
 tools/perf/pmu-events/arch/powerpc/power9/pmc.json |  167 +-
 .../arch/powerpc/power9/translation.json   |  314 +--
 .../python/Perf-Trace-Util/lib/Perf/Trace/Util.py  |1 +
 tools/perf/tests/attr.c|2 +-
 tools/perf/tests/backward-ring-buffer.c|2 +-
 tools/perf/tests/bitmap.c  |2 +-
 tools/perf/tests/bp_signal.c   |2 +-
 tools/perf/tests/bp_signal_overflow.c  |2 +-
 tools/perf/tests/bpf.c |4 +-
 tools/perf/tests/builtin-test.c|  184 +-
 tools/perf/tests/clang.c   |4 +-
 tools/perf/tests/code-reading.c|2 +-
 tools/perf/tests/cpumap.c  |4 +-
 tools/perf/tests/dso-data.c|6 +-
 tools/perf/tests/dwarf-unwind.c|2 +-
 tools/perf/tests/event-times.c 

Re: [PATCH v2 3/9] powerpc/powernv: Remove real mode access limit for early allocations

2017-08-14 Thread Nicholas Piggin
On Mon, 14 Aug 2017 23:10:50 +1000
Benjamin Herrenschmidt  wrote:

> On Mon, 2017-08-14 at 22:49 +1000, Michael Ellerman wrote:
> > Nicholas Piggin  writes:
> >   
> > > This removes the RMA limit on powernv platform, which constrains
> > > early allocations such as PACAs and stacks. There are still other
> > > restrictions that must be followed, such as bolted SLB limits, but
> > > real mode addressing has no constraints.  
> 
> For radix, should we consider making the PACAs chip/node local ?

Yes that's the main goal of the series. I had NUMAization patches
at the end. Just dropped them for now because some of them need
toplogy information that's not available (that's why I was asking
about moving unflattening earlier in boot, but we may be able to
move allocations later too).

Thanks,
Nick


Re: [PATCH v2 2/3] livepatch: send a fake signal to all blocking tasks

2017-08-14 Thread Miroslav Benes
On Fri, 11 Aug 2017, Josh Poimboeuf wrote:

> On Thu, Aug 10, 2017 at 12:48:14PM +0200, Miroslav Benes wrote:
> > Last, sending the fake signal is not automatic. It is done only when
> > admin requests it by writing 1 to force sysfs attribute in livepatch
> > sysfs directory.
> 
> 'writing 1' -> 'writing "signal"'
> 
> (unless you take my suggestion to change to two separate sysfs files)

I'll take two separate sysfs files instead.
 
> > @@ -468,7 +468,12 @@ static ssize_t force_store(struct kobject *kobj, 
> > struct kobj_attribute *attr,
> > return -EINVAL;
> > }
> >  
> > -   return -EINVAL;
> > +   if (!memcmp("signal", buf, min(sizeof("signal")-1, count)))
> > +   klp_force_signals();
> 
> Any reason why you can't just do a strcmp()?

Not really IIRC. I just borrowed the code from 
mm/huge_memory.c:enabled_store().

> > +++ b/kernel/livepatch/transition.c
> > @@ -577,3 +577,43 @@ void klp_copy_process(struct task_struct *child)
> >  
> > /* TIF_PATCH_PENDING gets copied in setup_thread_stack() */
> >  }
> > +
> > +/*
> > + * Sends a fake signal to all non-kthread tasks with TIF_PATCH_PENDING set.
> > + * Kthreads with TIF_PATCH_PENDING set are woken up. Only admin can 
> > request this
> > + * action currently.
> > + */
> > +void klp_force_signals(void)
> > +{
> > +   struct task_struct *g, *task;
> > +
> > +   pr_notice("signalling remaining tasks\n");
> 
> As a native US speaker with possible OCD spelling tendencies, it bothers
> me to see "signalling" with two l's instead of one.  According to
> Google, the UK spelling of the word has two l's, so maybe it's not a
> typo.  I'll forgive you if you don't fix it :-)

If it bothers you, I'll fix it. As a non-native speaker, I can live with 
both.

> > +
> > +   read_lock(_lock);
> > +   for_each_process_thread(g, task) {
> > +   if (!klp_patch_pending(task))
> > +   continue;
> > +
> > +   /*
> > +* There is a small race here. We could see TIF_PATCH_PENDING
> > +* set and decide to wake up a kthread or send a fake signal.
> > +* Meanwhile the task could migrate itself and the action
> > +* would be meaningless. It is not serious though.
> > +*/
> > +   if (task->flags & PF_KTHREAD) {
> > +   /*
> > +* Wake up a kthread which still has not been migrated.
> > +*/
> > +   wake_up_process(task);
> > +   } else {
> > +   /*
> > +* Send fake signal to all non-kthread tasks which are
> > +* still not migrated.
> > +*/
> > +   spin_lock_irq(>sighand->siglock);
> > +   signal_wake_up(task, 0);
> > +   spin_unlock_irq(>sighand->siglock);
> > +   }
> > +   }
> > +   read_unlock(_lock);
> 
> I can't remember if we talked about this before, is it possible to also
> signal/wake the idle tasks?

Jiri mentioned that in his email. It is not that easy. Take a look at 
pick_next_task() in kernel/sched/core.c. idle_sched_class is always the 
last one to be checked. Of course we could do something like this 
there...

if (klp_patch_pending(rq->idle)) {
p = idle_sched_class.pick_next_task(rq, prev);

return p;
}

... but people may be watching, so I didn't say anything.

Thanks,
Miroslav


Re: [v6 15/15] mm: debug for raw alloctor

2017-08-14 Thread Pasha Tatashin

However, now thinking about it, I will change it to CONFIG_MEMBLOCK_DEBUG,
and let users decide what other debugging configs need to be enabled, as
this is also OK.


Actually the more I think about it the more I am convinced that a kernel
boot parameter would be better because it doesn't need the kernel to be
recompiled and it is a single branch in not so hot path.


The main reason I do not like kernel parameter is that automated test 
suits for every platform would need to be updated to include this new 
parameter in order to test it.


Yet, I think it is important at least initially to test it on every 
platform unconditionally when certain debug configs are enabled.


This patch series allows boot allocator to return uninitialized memory, 
this behavior Linux never had before, but way too often firmware 
explicitly zero all the memory before starting OS. Therefore, it would 
be hard to debug issues that might be only seen during kinit type of 
reboots.


In the future, when memory sizes will increase so that this memset will 
become unacceptable even on debug kernels, it can always be removed, but 
at least at that time we will know that the code has been tested for 
many years.


Linux 4.13: Reported regressions as of Monday, 2017-08-14

2017-08-14 Thread Thorsten Leemhuis
Hi! Find below my third regression report for Linux 4.13. It lists 11
regressions I'm currently aware of (or 10 if you count the two scsi-mq
regressions discussions as one). 4 regressions are new; 3 got fixed
since last weeks report (two others didn't even make it to the report,
as they were quickly fixed); 1 gets removed. You can also find the
report at http://bit.ly/lnxregrep413 where I try to update it every now
and then.

As always: Are you aware of any other regressions? Then please let me
know. For details see http://bit.ly/lnxregtrackid And please tell me if
there is anything in the report that shouldn't be there.

Ciao, Thorsten

P.S.: Thx to all those that CCed me on regression reports or provided
other input, it makes compiling these reports a whole lot easier!

P.P.S.: Sorry, I adjusted the report structure again because I added a
new field that shows the date when a proper kernel developer (normally:
one that is working in the affected subsystem) looked into issue. That
should hopefully make it easier to spot regressions that are getting
ignored or got stuck somehow.


== Current regressions ==

[x86/mm/gup] e585513b76: will-it-scale.per_thread_ops -6.9% regression
Status: Asked on the list, but looks like issue gets ignored by everyone
Note: I'm a bit unsure if adding this issue to this list was a good
idea. Side note: Was report against linux-next in May already
Reported: 2017-07-10 Developer activity: none
http://lkml.kernel.org/r/20170710024020.GA26389@yexl-desktop
Cause: https://git.kernel.org/torvalds/c/e585513b76

Null dereference in rt5677_i2c_probe()
Status: Patch is available in in asoc-next as commit ddc9e69b9dc2
Reported: 2017-07-17 Developer activity: 2017-07-27
https://bugzilla.kernel.org/show_bug.cgi?id=196397
https://bugzilla.kernel.org/show_bug.cgi?id=196397#c6
Cause: https://git.kernel.org/torvalds/c/a36afb0ab6
Linux-Regression-ID: lr#96bd63

[Dell xps13 9630] Could not be woken up from suspend-to-idle via usb
keyboard
Status: it's a tracking bug for an issue that seems to get handled by
Intel devs already
Note: suspend-to-idle is rare
Reported: 2017-07-24 Developer activity: 2017-07-24
https://bugzilla.kernel.org/show_bug.cgi?id=196459
Cause: https://git.kernel.org/torvalds/c/33e4f80ee6
Linux-Regression-ID: lr#bd29ab

[lkp-robot] [Btrfs] 28785f70ef: xfstests.generic.273.fail
Status: Jeff: "We're not ignoring it. […] collection of bugs that
approximate a correct result, and we're addressing them individually.[…]"
Reported: 2017-07-26 Developer activity: 2017-08-10
https://lkml.kernel.org/r/20170726062352.GC4877@yexl-desktop
https://lkml.kernel.org/r/bcd49705-e63a-4439-1620-57cd16f5b...@suse.com
Cause: https://git.kernel.org/torvalds/c/28785f70ef
Linux-Regression-ID: lr#a7d273

SCSI-MQ performance regression due to blk-mq scheduler
Status: Revert planned
https://lkml.kernel.org/r/20170813174422.16197-1-...@lst.de
Note: see also "Switching to MQ by default may generate some bug reports"
Reported: 2017-07-31 Developer activity: 2017-08-13
https://lkml.kernel.org/r/20170731165111.11536-2-ming@redhat.com
https://lkml.kernel.org/r/20170813174422.16197-1-...@lst.de
Cause: https://git.kernel.org/torvalds/c/5c279bd9e4

Switching to MQ by default may generate some bug reports
Status: Revert planned
https://lkml.kernel.org/r/20170813174422.16197-1-...@lst.de
Note: see also "SCSI-MQ performance regression due to blk-mq scheduler"
Reported: 2017-08-03 Developer activity: 2017-08-13
https://lkml.kernel.org/r/20170803085115.r2jfz2lofy5sp...@techsingularity.net
https://lkml.kernel.org/r/20170813174422.16197-1-...@lst.de
Cause: https://git.kernel.org/torvalds/c/5c279bd9e4

CIFS mount error -112 due to "SMB3 by default for security reasons"
Status: Reminded people they need to get the issue to the mailing list
Note: Due to the changes in  908b852df1d5d27d289e915fea7bfc16d38b8a76
That's a security change, but one that IMHO at least could have been
handled a lot better by giving users a hint what's wrong
Reported: 2017-08-06 Developer activity: none
https://bugzilla.kernel.org/show_bug.cgi?id=196599
https://bugzilla.kernel.org/show_bug.cgi?id=196599#c6
Cause: https://git.kernel.org/torvalds/c/eef914a9eb
Linux-Regression-ID: lr#60efe5

clang build regression in ext4
Status: report contains patch to fix issue
Reported: 2017-08-07 Developer activity: 2017-08-12
https://lkml.kernel.org/r/20170807105701.3835991-1-a...@arndb.de
Cause: https://git.kernel.org/torvalds/c/2df2c3402f

ACPI/IORT: fix build regression without IOMMU
Status: report contains patch to fix issue
Reported: 2017-08-10 Developer activity: 2017-08-10
https://lkml.kernel.org/r/20170810121114.2509560-1-a...@arndb.de
Cause: https://git.kernel.org/torvalds/c/bc8648d49a

Build regression: cc1: error: '-march=r3000' requires '-mfp32'
Status: brand new
Reported: 2017-08-13 Developer activity: none
https://lkml.kernel.org/r/59901cdb.b0ndvwhnqacjcnum%fengguang...@intel.com
Cause: https://git.kernel.org/torvalds/c/89a55278de

Lockdep: 

Re: [v6 01/15] x86/mm: reserve only exiting low pages

2017-08-14 Thread Michal Hocko
Let's CC Hpa on this one. I am still not sure it is correct. The full
series is here
http://lkml.kernel.org/r/1502138329-123460-1-git-send-email-pasha.tatas...@oracle.com

On Mon 07-08-17 16:38:35, Pavel Tatashin wrote:
> Struct pages are initialized by going through __init_single_page(). Since
> the existing physical memory in memblock is represented in memblock.memory
> list, struct page for every page from this list goes through
> __init_single_page().
> 
> The second memblock list: memblock.reserved, manages the allocated memory.
> The memory that won't be available to kernel allocator. So, every page from
> this list goes through reserve_bootmem_region(), where certain struct page
> fields are set, the assumption being that the struct pages have been
> initialized beforehand.
> 
> In trim_low_memory_range() we unconditionally reserve memoryfrom PFN 0, but
> memblock.memory might start at a later PFN. For example, in QEMU,
> e820__memblock_setup() can use PFN 1 as the first PFN in memblock.memory,
> so PFN 0 is not on memblock.memory (and hence isn't initialized via
> __init_single_page) but is on memblock.reserved (and hence we set fields in
> the uninitialized struct page).
> 
> Currently, the struct page memory is always zeroed during allocation,
> which prevents this problem from being detected. But, if some asserts
> provided by CONFIG_DEBUG_VM_PGFLAGS are tighten, this problem may become
> visible in existing kernels.
> 
> In this patchset we will stop zeroing struct page memory during allocation.
> Therefore, this bug must be fixed in order to avoid random assert failures
> caused by CONFIG_DEBUG_VM_PGFLAGS triggers.
> 
> The fix is to reserve memory from the first existing PFN.
> 
> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Steven Sistare 
> Reviewed-by: Daniel Jordan 
> Reviewed-by: Bob Picco 
> ---
>  arch/x86/kernel/setup.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 3486d0498800..489cdc141bcb 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -790,7 +790,10 @@ early_param("reservelow", parse_reservelow);
>  
>  static void __init trim_low_memory_range(void)
>  {
> - memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
> + unsigned long min_pfn = find_min_pfn_with_active_regions();
> + phys_addr_t base = min_pfn << PAGE_SHIFT;
> +
> + memblock_reserve(base, ALIGN(reserve_low, PAGE_SIZE));
>  }
>   
>  /*
> -- 
> 2.14.0

-- 
Michal Hocko
SUSE Labs


Re: [v6 05/15] mm: don't accessed uninitialized struct pages

2017-08-14 Thread Pasha Tatashin

mem_init()
  free_all_bootmem()
   free_low_memory_core_early()
for_each_reserved_mem_region()
 reserve_bootmem_region()
  init_reserved_page() <- if this is deferred reserved page
   __init_single_pfn()
__init_single_page()

So, currently, we are using the value of page->flags to figure out if this
page has been initialized while being part of deferred page, but this is not
going to work for this project, as we do not zero the memory that is backing
the struct pages, and size the value of page->flags can be anything.


True, this is the initialization part I've missed in one of the previous
patches already. Would it be possible to only iterate over !reserved
memory blocks instead? Now that we discard all the metadata later it
should be quite easy to do for_each_memblock_type, no?


Hi Michal,

Clever suggestion to add a new iterator to go through unreserved 
existing memory, I do not think there is this iterator available, so it 
would need to be implemented, using similar approach to what I have done 
with a call back.


However, there is a different reason, why I took this current approach.

Daniel Jordan is working on a ktask support:
https://lkml.org/lkml/2017/7/14/666

He and I discussed on how to multi-thread struct pages initialization 
within memory nodes using ktasks. Having this callback interface makes 
that multi-threading quiet easy, improving the boot performance further, 
with his prototype we saw x4-6 improvements (using 4-8 threads per 
node). Reducing the total time it takes to initialize all struct pages 
on machines with terabytes of memory to less than one second.


Pasha


Re: [v6 04/15] mm: discard memblock data later

2017-08-14 Thread Michal Hocko
On Mon 14-08-17 09:39:17, Pasha Tatashin wrote:
> >>#ifdef CONFIG_MEMBLOCK in page_alloc, or define memblock_discard() stubs in
> >>nobootmem headfile.
> >
> >This is the standard way to do this. And it is usually preferred to
> >proliferate ifdefs in the code.
> 
> Hi Michal,
> 
> As you suggested, I sent-out this patch separately. If you feel strongly,
> that this should be updated to have stubs for platforms that do not
> implement memblock, please send a reply to that e-mail, so those who do not
> follow this tread will see it. Otherwise, I can leave it as is, page_alloc
> file already has a number memblock related ifdefs all of which can be
> cleaned out once every platform implements it (is it even achievable?)

I do not think this needs a repost just for this. This was merely a
JFYI, in case you would need to repost for other reasons then just
update that as well. But nothing really earth shattering.
-- 
Michal Hocko
SUSE Labs


Re: [v6 04/15] mm: discard memblock data later

2017-08-14 Thread Pasha Tatashin

#ifdef CONFIG_MEMBLOCK in page_alloc, or define memblock_discard() stubs in
nobootmem headfile.


This is the standard way to do this. And it is usually preferred to
proliferate ifdefs in the code.


Hi Michal,

As you suggested, I sent-out this patch separately. If you feel 
strongly, that this should be updated to have stubs for platforms that 
do not implement memblock, please send a reply to that e-mail, so those 
who do not follow this tread will see it. Otherwise, I can leave it as 
is, page_alloc file already has a number memblock related ifdefs all of 
which can be cleaned out once every platform implements it (is it even 
achievable?)


Thank you,
Pasha


Re: [v6 04/15] mm: discard memblock data later

2017-08-14 Thread Pasha Tatashin

OK, I will post it separately. No it does not depend on the rest, but the
reset depends on this. So, I am not sure how to enforce that this comes
before the rest.


Andrew will take care of that. Just make it explicit that some of the
patch depends on an earlier work when reposting.


Ok.


Yes, they said that the problem was bisected down to this patch. Do you know
if there is a way to submit a patch to this test robot?


You can ask them for re testing with an updated patch by replying to
their report. ANyway I fail to see how the change could lead to this
patch.


I have already done that. Anyway, I think it is unrelated. I have used 
their scripts to test the patch alone, with number of elements in 
memblock array reduced down to 4. Verified that my freeing code is 
called, and never hit the problem that they reported.


Re: [v6 02/15] x86/mm: setting fields in deferred pages

2017-08-14 Thread Pasha Tatashin



On 08/14/2017 07:43 AM, Michal Hocko wrote:

register_page_bootmem_info
  register_page_bootmem_info_node
   get_page_bootmem
.. setting fields here ..
such as: page->freelist = (void *)type;

free_all_bootmem()
  free_low_memory_core_early()
   for_each_reserved_mem_region()
reserve_bootmem_region()
 init_reserved_page() <- Only if this is deferred reserved page
  __init_single_pfn()
   __init_single_page()
   memset(0) <-- Loose the set fields here!

OK, I have missed that part. Please make it explicit in the changelog.
It is quite easy to get lost in the deep call chains.


Ok, will update comment.


Re: [v6 01/15] x86/mm: reserve only exiting low pages

2017-08-14 Thread Pasha Tatashin

Correct, the pgflags asserts were triggered when we were setting reserved
flags to struct page for PFN 0 in which was never initialized through
__init_single_page(). The reason they were triggered is because we set all
uninitialized memory to ones in one of the debug patches.


And why don't we need the same treatment for other architectures?



I have not seen similar issues on other architectures. At least this low 
memory reserve is x86 specific for BIOS purposes:


Documentation/admin-guide/kernel-parameters.txt
3624reservelow= [X86]
3625Format: nn[K]
3626Set the amount of memory to reserve for BIOS at
3627the bottom of the address space.

If there are similar cases with other architectures, they will be caught 
by the last patch in this series, where all allocated memory is set to 
ones, and page flags asserts will be triggered. I have boot-tested on 
SPARC, ARM, and x86.


Pasha


Re: [PATCH v6 14/17] powerpc: Add support for setting SPRN_TIDR

2017-08-14 Thread Benjamin Herrenschmidt
On Mon, 2017-08-14 at 17:02 +1000, Michael Neuling wrote:
> > +/*
> > + * We need to assign an unique thread id to each thread in a process. This
> > + * thread id is intended to be used with the Fast Thread-wakeup (aka Core-
> > + * to-core wakeup) mechanism being implemented on top of Virtual 
> > Accelerator
> > + * Switchboard (VAS).
> > + *
> > + * To get a unique thread-id per process we could simply use task_pid_nr()
> > + * but the problem is that task_pid_nr() is not yet available for the 
> > thread
> > + * when copy_thread() is called. Fixing that would require changing more
> > + * intrusive arch-neutral code in code path in copy_process()?.
> > + *
> > + * Further, to assign unique thread ids within each process, we need an
> > + * atomic field (or an IDR) in task_struct, which again intrudes into the
> > + * arch-neutral code.
> 
> Really?
> 
> > + * So try to assign globally unique thraed ids for now.
> 
> Yuck!

Also CAPI has size limits for the TIDR afaik

Ben.



Re: [PATCH v2 3/9] powerpc/powernv: Remove real mode access limit for early allocations

2017-08-14 Thread Benjamin Herrenschmidt
On Mon, 2017-08-14 at 22:49 +1000, Michael Ellerman wrote:
> > - /*
> > -  * We limit the allocation that depend on ppc64_rma_size
> > -  * to first_memblock_size. We also clamp it to 1GB to
> > -  * avoid some funky things such as RTAS bugs.
> 
> That comment about RTAS is 7 years old, and I'm pretty sure it was a
> historical note when it was written.
> 
> I'm inclined to drop it and if we discover new bugs with RTAS on Power9
> then we can always put it back.

Arent' we using a 32-bit RTAS ? (Afaik there's a 64-bit one, we just
never used it ..). In this case we need to at least clamp to 2G (no
trust RTAS doing unsigned properly).

> > -  *
> > -  * On radix config we really don't have a limitation
> > -  * on real mode access. But keeping it as above works
> > -  * well enough.
> 
> Ergh.
> 
> > -  */
> > - ppc64_rma_size = min_t(u64, first_memblock_size, 0x4000);
> > - /*
> > -  * Finally limit subsequent allocations. We really don't want
> > -  * to limit the memblock allocations to rma_size. FIXME!! should
> > -  * we even limit at all ?
> > -  */
> 
> So I think we should just delete this function entirely.
> 
> Any objections?

Well.. RTAS is quite sucky ... 

Ben.

> cheers



Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table

2017-08-14 Thread Robin Murphy
On 14/08/17 10:45, Alexey Kardashevskiy wrote:
> Folks,
> 
> Is there anything to change besides those compiler errors and David's
> comment in 5/5? Or the while patchset is too bad? Thanks.

While I now understand it's not the low-level thing I first thought it
was, so my reasoning has changed, personally I don't like this approach
any more than the previous one - it still smells of abusing external
APIs to pass information from one part of VFIO to another (and it has
the same conceptual problem of attributing something to interrupt
sources that is actually a property of the interrupt target).

Taking a step back, though, why does vfio-pci perform this check in the
first place? If a malicious guest already has control of a device, any
kind of interrupt spoofing it could do by fiddling with the MSI-X
message address/data it could simply do with a DMA write anyway, so the
security argument doesn't stand up in general (sure, not all PCIe
devices may be capable of arbitrary DMA, but that seems like more of a
tenuous security-by-obscurity angle to me). Besides, with Type1 IOMMU
the fact that we've let a device be assigned at all means that this is
already a non-issue (because either the hardware provides isolation or
the user has explicitly accepted the consequences of an unsafe
configuration) - from patch #4 that's apparently the same for SPAPR TCE,
in which case it seems this flag doesn't even need to be propagated and
could simply be assumed always.

On the other hand, if the check is not so much to mitigate malicious
guests attacking the system as to prevent dumb guests breaking
themselves (e.g. if some or all of the MSI-X capability is actually
emulated), then allowing things to sometimes go wrong on the grounds of
an irrelevant hardware feature doesn't seem correct :/

Robin.

> On 07/08/17 17:25, Alexey Kardashevskiy wrote:
>> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for 
>> mmapping MSI-X table"
>> http://www.spinics.net/lists/kvm/msg152232.html
>>
>> This time it is using "caps" in IOMMU groups. The main question is if PCI
>> bus flags or IOMMU domains are still better (and which one).
> 
>>
>>
>>
>> Here is some background:
>>
>> Current vfio-pci implementation disallows to mmap the page
>> containing MSI-X table in case that users can write directly
>> to MSI-X table and generate an incorrect MSIs.
>>
>> However, this will cause some performance issue when there
>> are some critical device registers in the same page as the
>> MSI-X table. We have to handle the mmio access to these
>> registers in QEMU emulation rather than in guest.
>>
>> To solve this issue, this series allows to expose MSI-X table
>> to userspace when hardware enables the capability of interrupt
>> remapping which can ensure that a given PCI device can only
>> shoot the MSIs assigned for it. And we introduce a new bus_flags
>> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
>> for different archs.
>>
>>
>> This is based on sha1
>> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of 
>> git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
>>
>> Please comment. Thanks.
>>
>> Changelog:
>>
>> v5:
>> * redid the whole thing via so-called IOMMU group capabilities
>>
>> v4:
>> * rebased on recent upstream
>> * got all 6 patches from v2 (v3 was missing some)
>>
>>
>>
>>
>> Alexey Kardashevskiy (5):
>>   iommu: Add capabilities to a group
>>   iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
>> remapping
>>   iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
>> enabled
>>   powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>>   vfio-pci: Allow to expose MSI-X table to userspace when safe
>>
>>  include/linux/iommu.h| 20 
>>  include/linux/vfio.h |  1 +
>>  arch/powerpc/kernel/iommu.c  |  1 +
>>  drivers/iommu/amd_iommu.c|  3 +++
>>  drivers/iommu/intel-iommu.c  |  3 +++
>>  drivers/iommu/iommu.c| 35 +++
>>  drivers/vfio/pci/vfio_pci.c  | 20 +---
>>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 -
>>  drivers/vfio/vfio.c  | 15 +++
>>  9 files changed, 99 insertions(+), 4 deletions(-)
>>
> 
> 



Re: [PATCH v2 3/9] powerpc/powernv: Remove real mode access limit for early allocations

2017-08-14 Thread Benjamin Herrenschmidt
On Mon, 2017-08-14 at 22:49 +1000, Michael Ellerman wrote:
> Nicholas Piggin  writes:
> 
> > This removes the RMA limit on powernv platform, which constrains
> > early allocations such as PACAs and stacks. There are still other
> > restrictions that must be followed, such as bolted SLB limits, but
> > real mode addressing has no constraints.

For radix, should we consider making the PACAs chip/node local ?

> > Signed-off-by: Nicholas Piggin 
> > ---
> >  arch/powerpc/mm/hash_utils_64.c | 24 +++-
> >  arch/powerpc/mm/pgtable-radix.c | 33 +
> 
> I missed that we'd duplicated this logic for radix vs hash [yes I know I
> merged the commit that did it :)]
> 
> > diff --git a/arch/powerpc/mm/pgtable-radix.c 
> > b/arch/powerpc/mm/pgtable-radix.c
> > index 671a45d86c18..61ca17d81737 100644
> > --- a/arch/powerpc/mm/pgtable-radix.c
> > +++ b/arch/powerpc/mm/pgtable-radix.c
> > @@ -598,22 +598,23 @@ void radix__setup_initial_memory_limit(phys_addr_t 
> > first_memblock_base,
> >  * physical on those processors
> >  */
> > BUG_ON(first_memblock_base != 0);
> > -   /*
> > -* We limit the allocation that depend on ppc64_rma_size
> > -* to first_memblock_size. We also clamp it to 1GB to
> > -* avoid some funky things such as RTAS bugs.
> 
> That comment about RTAS is 7 years old, and I'm pretty sure it was a
> historical note when it was written.
> 
> I'm inclined to drop it and if we discover new bugs with RTAS on Power9
> then we can always put it back.
> 
> > -*
> > -* On radix config we really don't have a limitation
> > -* on real mode access. But keeping it as above works
> > -* well enough.
> 
> Ergh.
> 
> > -*/
> > -   ppc64_rma_size = min_t(u64, first_memblock_size, 0x4000);
> > -   /*
> > -* Finally limit subsequent allocations. We really don't want
> > -* to limit the memblock allocations to rma_size. FIXME!! should
> > -* we even limit at all ?
> > -*/
> 
> So I think we should just delete this function entirely.
> 
> Any objections?
> 
> cheers



Re: [PATCH v3 1/2] powerpc/xmon: Dump ftrace buffers for the current CPU only

2017-08-14 Thread Michael Ellerman
Breno Leitao  writes:
> @@ -2231,6 +2232,19 @@ static void xmon_rawdump (unsigned long adrs, long 
> ndump)
>   printf("\n");
>  }
>  
> +static void dump_tracing(void)
> +{
> + int c;
> +
> + c = inchar();
> + if (c == 'c')
> + ftrace_dump(DUMP_ORIG);
> + else
> + ftrace_dump(DUMP_ALL);
> +
> + tracing_on();
> +}

Thinking about this some more, two things that would make this *really*
useful.

Firstly, it would be great if we could dump the buffer for *another*
CPU. Currently ftrace_dump() doesn't support that, and maybe it can't
because of the ring buffer design (?), but it would be really great if
you could dump another CPU's buffer.

That would be great eg. when a CPU is stuck and doesn't come into xmon,
you could use the trace buffer to work out where it is. You can do it
now, by dumping the whole trace buffer, but it's quite tricky to spot
that one CPU amongst all the others.


The second thing that would be good is if dumping the trace buffer from
xmon didn't consume the trace. Currently if you do 'dt' to dump the
trace buffer, and then realise actually you should have just dumped it
for one CPU then you're out of luck.

So it'd be nice if we could dump but leave the trace intact. That would
also be good from an "xmon doesn't perturb the system" (too much) point
of view, ie. if you drop to xmon and dump the trace then currently the
trace is no longer available.

cheers


Re: [PATCH v2 3/9] powerpc/powernv: Remove real mode access limit for early allocations

2017-08-14 Thread Michael Ellerman
Nicholas Piggin  writes:

> This removes the RMA limit on powernv platform, which constrains
> early allocations such as PACAs and stacks. There are still other
> restrictions that must be followed, such as bolted SLB limits, but
> real mode addressing has no constraints.
>
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/mm/hash_utils_64.c | 24 +++-
>  arch/powerpc/mm/pgtable-radix.c | 33 +

I missed that we'd duplicated this logic for radix vs hash [yes I know I
merged the commit that did it :)]

> diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
> index 671a45d86c18..61ca17d81737 100644
> --- a/arch/powerpc/mm/pgtable-radix.c
> +++ b/arch/powerpc/mm/pgtable-radix.c
> @@ -598,22 +598,23 @@ void radix__setup_initial_memory_limit(phys_addr_t 
> first_memblock_base,
>* physical on those processors
>*/
>   BUG_ON(first_memblock_base != 0);
> - /*
> -  * We limit the allocation that depend on ppc64_rma_size
> -  * to first_memblock_size. We also clamp it to 1GB to
> -  * avoid some funky things such as RTAS bugs.

That comment about RTAS is 7 years old, and I'm pretty sure it was a
historical note when it was written.

I'm inclined to drop it and if we discover new bugs with RTAS on Power9
then we can always put it back.

> -  *
> -  * On radix config we really don't have a limitation
> -  * on real mode access. But keeping it as above works
> -  * well enough.

Ergh.

> -  */
> - ppc64_rma_size = min_t(u64, first_memblock_size, 0x4000);
> - /*
> -  * Finally limit subsequent allocations. We really don't want
> -  * to limit the memblock allocations to rma_size. FIXME!! should
> -  * we even limit at all ?
> -  */

So I think we should just delete this function entirely.

Any objections?

cheers


Re: [v6 15/15] mm: debug for raw alloctor

2017-08-14 Thread Michal Hocko
On Fri 11-08-17 12:18:24, Pasha Tatashin wrote:
> >>When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
> >>returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
> >>places excpect zeroed memory.
> >
> >Please fold this into the patch which introduces
> >memblock_virt_alloc_try_nid_raw.
> 
> OK
> 
>  I am not sure CONFIG_DEBUG_VM is the
> >best config because that tends to be enabled quite often. Maybe
> >CONFIG_MEMBLOCK_DEBUG? Or even make it kernel command line parameter?
> >
> 
> Initially, I did not want to make it CONFIG_MEMBLOCK_DEBUG because we really
> benefit from this debugging code when VM debug is enabled, and especially
> struct page debugging asserts which also depend on CONFIG_DEBUG_VM.
> 
> However, now thinking about it, I will change it to CONFIG_MEMBLOCK_DEBUG,
> and let users decide what other debugging configs need to be enabled, as
> this is also OK.

Actually the more I think about it the more I am convinced that a kernel
boot parameter would be better because it doesn't need the kernel to be
recompiled and it is a single branch in not so hot path.
-- 
Michal Hocko
SUSE Labs


Re: [v6 05/15] mm: don't accessed uninitialized struct pages

2017-08-14 Thread Michal Hocko
On Fri 11-08-17 11:55:39, Pasha Tatashin wrote:
> On 08/11/2017 05:37 AM, Michal Hocko wrote:
> >On Mon 07-08-17 16:38:39, Pavel Tatashin wrote:
> >>In deferred_init_memmap() where all deferred struct pages are initialized
> >>we have a check like this:
> >>
> >> if (page->flags) {
> >> VM_BUG_ON(page_zone(page) != zone);
> >> goto free_range;
> >> }
> >>
> >>This way we are checking if the current deferred page has already been
> >>initialized. It works, because memory for struct pages has been zeroed, and
> >>the only way flags are not zero if it went through __init_single_page()
> >>before.  But, once we change the current behavior and won't zero the memory
> >>in memblock allocator, we cannot trust anything inside "struct page"es
> >>until they are initialized. This patch fixes this.
> >>
> >>This patch defines a new accessor memblock_get_reserved_pfn_range()
> >>which returns successive ranges of reserved PFNs.  deferred_init_memmap()
> >>calls it to determine if a PFN and its struct page has already been
> >>initialized.
> >
> >Why don't we simply check the pfn against pgdat->first_deferred_pfn?
> 
> Because we are initializing deferred pages, and all of them have pfn greater
> than pgdat->first_deferred_pfn. However, some of deferred pages were already
> initialized if they were reserved, in this path:
> 
> mem_init()
>  free_all_bootmem()
>   free_low_memory_core_early()
>for_each_reserved_mem_region()
> reserve_bootmem_region()
>  init_reserved_page() <- if this is deferred reserved page
>   __init_single_pfn()
>__init_single_page()
> 
> So, currently, we are using the value of page->flags to figure out if this
> page has been initialized while being part of deferred page, but this is not
> going to work for this project, as we do not zero the memory that is backing
> the struct pages, and size the value of page->flags can be anything.

True, this is the initialization part I've missed in one of the previous
patches already. Would it be possible to only iterate over !reserved
memory blocks instead? Now that we discard all the metadata later it
should be quite easy to do for_each_memblock_type, no?

-- 
Michal Hocko
SUSE Labs


Re: [v6 02/15] x86/mm: setting fields in deferred pages

2017-08-14 Thread Michal Hocko
On Fri 11-08-17 11:39:41, Pasha Tatashin wrote:
> >AFAIU register_page_bootmem_info_node is only about struct pages backing
> >pgdat, usemap and memmap. Those should be in reserved memblocks and we
> >do not initialize those at later times, they are not relevant to the
> >deferred initialization as your changelog suggests so the ordering with
> >get_page_bootmem shouldn't matter. Or am I missing something here?
> 
> The pages for pgdata, usemap, and memmap are part of reserved, and thus
> getting initialized when free_all_bootmem() is called.
> 
> So, we have something like this in mem_init()
> 
> register_page_bootmem_info
>  register_page_bootmem_info_node
>   get_page_bootmem
>.. setting fields here ..
>such as: page->freelist = (void *)type;
> 
> free_all_bootmem()
>  free_low_memory_core_early()
>   for_each_reserved_mem_region()
>reserve_bootmem_region()
> init_reserved_page() <- Only if this is deferred reserved page
>  __init_single_pfn()
>   __init_single_page()
>   memset(0) <-- Loose the set fields here!

OK, I have missed that part. Please make it explicit in the changelog.
It is quite easy to get lost in the deep call chains.
-- 
Michal Hocko
SUSE Labs


[PATCH] Fix for supporting nest events on muti socket system

2017-08-14 Thread Anju T Sudhakar
In a multi node system with discontinuous node id, nest event values
are not showing up properly.That is,

snip from lscpu output:

..
NUMA node0 CPU(s): 0-15
NUMA node8 CPU(s): 16-31
..

Nest event values on such systems are broken:

$./perf stat -e 'nest_powerbus0_imc/PM_PB_CYC/' -C 0-14 -I 1000 sleep 1000
#   time counts unit events
 1.00029457730,17,24,42,880 nest_powerbus0_imc/PM_PB_CYC/
 2.00052893829,92,08,53,760 nest_powerbus0_imc/PM_PB_CYC/
 3.00071392529,92,08,00,000 nest_powerbus0_imc/PM_PB_CYC/
 4.00090194429,95,08,63,360 nest_powerbus0_imc/PM_PB_CYC/
 5.00108911929,92,07,92,320 nest_powerbus0_imc/PM_PB_CYC/
 6.00127610629,92,08,11,520 nest_powerbus0_imc/PM_PB_CYC/

$./perf stat -e 'nest_powerbus0_imc/PM_PB_CYC/' -C 16-28 -I 1000 sleep 1000
#   time counts unit events
 1.49902 nest_powerbus0_imc/PM_PB_CYC/
 2.000147269 nest_powerbus0_imc/PM_PB_CYC/
 3.000219730 nest_powerbus0_imc/PM_PB_CYC/
 4.000288098 nest_powerbus0_imc/PM_PB_CYC/
 5.000358716 nest_powerbus0_imc/PM_PB_CYC/
 6.000435615 nest_powerbus0_imc/PM_PB_CYC/
 7.000508481 nest_powerbus0_imc/PM_PB_CYC/

This is because, when fetching for the reference count, node id is used
as the array index which is not how this is done when initializing the
structure. Patch to fix the same by using the right index to get the
nest_imc_refc.

$./perf stat -e 'nest_powerbus0_imc/PM_PB_CYC/' -C 16-28 -I 1000 sleep 1000
#   time counts unit events
 1.00024196126,12,35,28,704 nest_powerbus0_imc/PM_PB_CYC/
 2.00045167825,95,72,48,512 nest_powerbus0_imc/PM_PB_CYC/
 3.00063496325,93,13,96,608 nest_powerbus0_imc/PM_PB_CYC/
 4.00082118625,95,74,38,208 nest_powerbus0_imc/PM_PB_CYC/
 5.00100522125,93,13,30,048 nest_powerbus0_imc/PM_PB_CYC/ 

Signed-off-by: Anju T Sudhakar 
---
 arch/powerpc/perf/imc-pmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 46cd912..bbcce29 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -1064,7 +1064,7 @@ static int init_nest_pmu_ref(void)
 */
for_each_possible_cpu(cpu) {
nid = cpu_to_node(cpu);
-   for_each_online_node(i) {
+   for (i = 0; i < num_possible_nodes(); i++) {
if (nest_imc_refc[i].id == nid) {
per_cpu(local_nest_imc_refc, cpu) = 
_imc_refc[i];
break;
-- 
2.7.4



[PATCH 11/11] ASoC: soc-utils: make snd_soc_platform_driver const

2017-08-14 Thread Bhumika Goyal
Make this const as it is only passed as the 2nd argument to the function
snd_soc_register_platform, which is of type const.
Done using Coccinelle

Signed-off-by: Bhumika Goyal 
---
 sound/soc/soc-utils.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sound/soc/soc-utils.c b/sound/soc/soc-utils.c
index 644d9a9..e30aacb 100644
--- a/sound/soc/soc-utils.c
+++ b/sound/soc/soc-utils.c
@@ -284,7 +284,7 @@ static int dummy_dma_open(struct snd_pcm_substream 
*substream)
.ioctl  = snd_pcm_lib_ioctl,
 };
 
-static struct snd_soc_platform_driver dummy_platform = {
+static const struct snd_soc_platform_driver dummy_platform = {
.ops = _dma_ops,
 };
 
-- 
1.9.1



[PATCH 10/11] ASoC: txx9: make snd_soc_platform_driver const

2017-08-14 Thread Bhumika Goyal
Make this const as it is only passed as the 2nd argument to the function
devm_snd_soc_register_platform, which is of type const.
Done using Coccinelle.

Signed-off-by: Bhumika Goyal 
---
 sound/soc/txx9/txx9aclc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sound/soc/txx9/txx9aclc.c b/sound/soc/txx9/txx9aclc.c
index 7912bf0..7df95df 100644
--- a/sound/soc/txx9/txx9aclc.c
+++ b/sound/soc/txx9/txx9aclc.c
@@ -403,7 +403,7 @@ static int txx9aclc_pcm_remove(struct snd_soc_platform 
*platform)
return 0;
 }
 
-static struct snd_soc_platform_driver txx9aclc_soc_platform = {
+static const struct snd_soc_platform_driver txx9aclc_soc_platform = {
.probe  = txx9aclc_pcm_probe,
.remove = txx9aclc_pcm_remove,
.ops= _pcm_ops,
-- 
1.9.1



[PATCH 09/11] ASoC: sh: make snd_soc_platform_driver const

2017-08-14 Thread Bhumika Goyal
Make these const as they are either passed as the 2nd argument to
the function devm_snd_soc_register_platform or snd_soc_register_platform,
and the arguments are of type const.
Done using Coccinelle.

Signed-off-by: Bhumika Goyal 
---
 sound/soc/sh/dma-sh7760.c | 2 +-
 sound/soc/sh/fsi.c| 2 +-
 sound/soc/sh/rcar/core.c  | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/sound/soc/sh/dma-sh7760.c b/sound/soc/sh/dma-sh7760.c
index 8fad444..81c7433 100644
--- a/sound/soc/sh/dma-sh7760.c
+++ b/sound/soc/sh/dma-sh7760.c
@@ -320,7 +320,7 @@ static int camelot_pcm_new(struct snd_soc_pcm_runtime *rtd)
return 0;
 }
 
-static struct snd_soc_platform_driver sh7760_soc_platform = {
+static const struct snd_soc_platform_driver sh7760_soc_platform = {
.ops= _pcm_ops,
.pcm_new= camelot_pcm_new,
 };
diff --git a/sound/soc/sh/fsi.c b/sound/soc/sh/fsi.c
index 005b215..60bb23f 100644
--- a/sound/soc/sh/fsi.c
+++ b/sound/soc/sh/fsi.c
@@ -1818,7 +1818,7 @@ static int fsi_pcm_new(struct snd_soc_pcm_runtime *rtd)
},
 };
 
-static struct snd_soc_platform_driver fsi_soc_platform = {
+static const struct snd_soc_platform_driver fsi_soc_platform = {
.ops= _pcm_ops,
.pcm_new= fsi_pcm_new,
 };
diff --git a/sound/soc/sh/rcar/core.c b/sound/soc/sh/rcar/core.c
index 650cc28..361afc0 100644
--- a/sound/soc/sh/rcar/core.c
+++ b/sound/soc/sh/rcar/core.c
@@ -1318,7 +1318,7 @@ static int rsnd_pcm_new(struct snd_soc_pcm_runtime *rtd)
PREALLOC_BUFFER, PREALLOC_BUFFER_MAX);
 }
 
-static struct snd_soc_platform_driver rsnd_soc_platform = {
+static const struct snd_soc_platform_driver rsnd_soc_platform = {
.ops= _pcm_ops,
.pcm_new= rsnd_pcm_new,
 };
-- 
1.9.1



[PATCH 08/11] ASoC: samsung: make snd_soc_platform_driver const

2017-08-14 Thread Bhumika Goyal
Make this const as it is only passed as the 2nd argument to the function
devm_snd_soc_register_platform, which is of type const.
Done using Coccinelle.

Signed-off-by: Bhumika Goyal 
---
 sound/soc/samsung/idma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sound/soc/samsung/idma.c b/sound/soc/samsung/idma.c
index 3e40815..ee16e4c 100644
--- a/sound/soc/samsung/idma.c
+++ b/sound/soc/samsung/idma.c
@@ -399,7 +399,7 @@ void idma_reg_addr_init(void __iomem *regs, dma_addr_t addr)
 }
 EXPORT_SYMBOL_GPL(idma_reg_addr_init);
 
-static struct snd_soc_platform_driver asoc_idma_platform = {
+static const struct snd_soc_platform_driver asoc_idma_platform = {
.ops = _ops,
.pcm_new = idma_new,
.pcm_free = idma_free,
-- 
1.9.1



[PATCH 07/11] ASoC: qcom: make snd_soc_platform_driver const

2017-08-14 Thread Bhumika Goyal
Make this const as it is only passed as the 2nd argument to the
function devm_snd_soc_register_platform, which is of type const.
Done using Coccinelle.

Signed-off-by: Bhumika Goyal 
---
 sound/soc/qcom/lpass-platform.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sound/soc/qcom/lpass-platform.c b/sound/soc/qcom/lpass-platform.c
index 7aabf08..fb3576a 100644
--- a/sound/soc/qcom/lpass-platform.c
+++ b/sound/soc/qcom/lpass-platform.c
@@ -557,7 +557,7 @@ static void lpass_platform_pcm_free(struct snd_pcm *pcm)
}
 }
 
-static struct snd_soc_platform_driver lpass_platform_driver = {
+static const struct snd_soc_platform_driver lpass_platform_driver = {
.pcm_new= lpass_platform_pcm_new,
.pcm_free   = lpass_platform_pcm_free,
.ops= _platform_pcm_ops,
-- 
1.9.1



[PATCH 06/11] ASoC: pxa: make snd_soc_platform_driver const

2017-08-14 Thread Bhumika Goyal
Make these const as they are only passed as the 2nd argument to the
function devm_snd_soc_register_platform, which is of type const.
Done using Coccinelle.

Signed-off-by: Bhumika Goyal 
---
 sound/soc/pxa/mmp-pcm.c| 2 +-
 sound/soc/pxa/pxa2xx-pcm.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/sound/soc/pxa/mmp-pcm.c b/sound/soc/pxa/mmp-pcm.c
index 5b5f1a4..38d5999 100644
--- a/sound/soc/pxa/mmp-pcm.c
+++ b/sound/soc/pxa/mmp-pcm.c
@@ -211,7 +211,7 @@ static int mmp_pcm_new(struct snd_soc_pcm_runtime *rtd)
return ret;
 }
 
-static struct snd_soc_platform_driver mmp_soc_platform = {
+static const struct snd_soc_platform_driver mmp_soc_platform = {
.ops= _pcm_ops,
.pcm_new= mmp_pcm_new,
.pcm_free   = mmp_pcm_free_dma_buffers,
diff --git a/sound/soc/pxa/pxa2xx-pcm.c b/sound/soc/pxa/pxa2xx-pcm.c
index b51d7a0..636895a 100644
--- a/sound/soc/pxa/pxa2xx-pcm.c
+++ b/sound/soc/pxa/pxa2xx-pcm.c
@@ -84,7 +84,7 @@ static int pxa2xx_soc_pcm_new(struct snd_soc_pcm_runtime *rtd)
return ret;
 }
 
-static struct snd_soc_platform_driver pxa2xx_soc_platform = {
+static const struct snd_soc_platform_driver pxa2xx_soc_platform = {
.ops= _pcm_ops,
.pcm_new= pxa2xx_soc_pcm_new,
.pcm_free   = pxa2xx_pcm_free_dma_buffers,
-- 
1.9.1



[PATCH 05/11] ASoC: omap: make snd_soc_platform_driver const

2017-08-14 Thread Bhumika Goyal
Make this const as it is only passed as the 2nd argument to the
function devm_snd_soc_register_platform, which is of type const.
Done using Coccinelle.

Signed-off-by: Bhumika Goyal 
---
 sound/soc/omap/omap-pcm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sound/soc/omap/omap-pcm.c b/sound/soc/omap/omap-pcm.c
index 94e9ff7..1283473 100644
--- a/sound/soc/omap/omap-pcm.c
+++ b/sound/soc/omap/omap-pcm.c
@@ -243,7 +243,7 @@ static int omap_pcm_new(struct snd_soc_pcm_runtime *rtd)
return ret;
 }
 
-static struct snd_soc_platform_driver omap_soc_platform = {
+static const struct snd_soc_platform_driver omap_soc_platform = {
.ops= _pcm_ops,
.pcm_new= omap_pcm_new,
.pcm_free   = omap_pcm_free_dma_buffers,
-- 
1.9.1



Re: [v6 01/15] x86/mm: reserve only exiting low pages

2017-08-14 Thread Michal Hocko
On Fri 11-08-17 11:24:55, Pasha Tatashin wrote:
[...]
> >>In this patchset we will stop zeroing struct page memory during allocation.
> >>Therefore, this bug must be fixed in order to avoid random assert failures
> >>caused by CONFIG_DEBUG_VM_PGFLAGS triggers.
> >>
> >>The fix is to reserve memory from the first existing PFN.
> >
> >Hmm, I assume this is a result of some assert triggering, right? Which
> >one? Why don't we need the same treatment for other than x86 arch?
> 
> Correct, the pgflags asserts were triggered when we were setting reserved
> flags to struct page for PFN 0 in which was never initialized through
> __init_single_page(). The reason they were triggered is because we set all
> uninitialized memory to ones in one of the debug patches.

And why don't we need the same treatment for other architectures?
-- 
Michal Hocko
SUSE Labs


[PATCH 04/11] ASoC: nuc900: make snd_soc_platform_driver const

2017-08-14 Thread Bhumika Goyal
Make this const as it is only passed as the 2nd argument to the
function devm_snd_soc_register_platform, which is of type const.
Done using Coccinelle

Signed-off-by: Bhumika Goyal 
---
 sound/soc/nuc900/nuc900-pcm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sound/soc/nuc900/nuc900-pcm.c b/sound/soc/nuc900/nuc900-pcm.c
index 2cca055..cd0486c 100644
--- a/sound/soc/nuc900/nuc900-pcm.c
+++ b/sound/soc/nuc900/nuc900-pcm.c
@@ -299,7 +299,7 @@ static int nuc900_dma_new(struct snd_soc_pcm_runtime *rtd)
return 0;
 }
 
-static struct snd_soc_platform_driver nuc900_soc_platform = {
+static const struct snd_soc_platform_driver nuc900_soc_platform = {
.ops= _dma_ops,
.pcm_new= nuc900_dma_new,
 };
-- 
1.9.1



[PATCH 03/11] ASoC: fsl: make snd_soc_platform_driver const

2017-08-14 Thread Bhumika Goyal
Make these const as they are only passed as the 2nd argument to the function
snd_soc_register_platform, which is of type const.
Done using Coccinelle.

Signed-off-by: Bhumika Goyal 
---
 sound/soc/fsl/imx-pcm-fiq.c | 2 +-
 sound/soc/fsl/mpc5200_dma.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/sound/soc/fsl/imx-pcm-fiq.c b/sound/soc/fsl/imx-pcm-fiq.c
index 92410f7..3fcc7b5 100644
--- a/sound/soc/fsl/imx-pcm-fiq.c
+++ b/sound/soc/fsl/imx-pcm-fiq.c
@@ -341,7 +341,7 @@ static void imx_pcm_fiq_free(struct snd_pcm *pcm)
imx_pcm_free(pcm);
 }
 
-static struct snd_soc_platform_driver imx_soc_platform_fiq = {
+static const struct snd_soc_platform_driver imx_soc_platform_fiq = {
.ops= _pcm_ops,
.pcm_new= imx_pcm_fiq_new,
.pcm_free   = imx_pcm_fiq_free,
diff --git a/sound/soc/fsl/mpc5200_dma.c b/sound/soc/fsl/mpc5200_dma.c
index 1f7e70b..cdd848c 100644
--- a/sound/soc/fsl/mpc5200_dma.c
+++ b/sound/soc/fsl/mpc5200_dma.c
@@ -356,7 +356,7 @@ static void psc_dma_free(struct snd_pcm *pcm)
}
 }
 
-static struct snd_soc_platform_driver mpc5200_audio_dma_platform = {
+static const struct snd_soc_platform_driver mpc5200_audio_dma_platform = {
.ops= _dma_ops,
.pcm_new= _dma_new,
.pcm_free   = _dma_free,
-- 
1.9.1



[PATCH 02/11] ASoC: Intel: make snd_soc_platform_driver const

2017-08-14 Thread Bhumika Goyal
Make these const as they are only passed as the 2nd argument to the
function snd_soc_register_platform, which is of type const.
Done using Coccinelle.

Signed-off-by: Bhumika Goyal 
---
 sound/soc/intel/atom/sst-mfld-platform-pcm.c | 2 +-
 sound/soc/intel/baytrail/sst-baytrail-pcm.c  | 2 +-
 sound/soc/intel/haswell/sst-haswell-pcm.c| 2 +-
 sound/soc/intel/skylake/skl-pcm.c| 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/sound/soc/intel/atom/sst-mfld-platform-pcm.c 
b/sound/soc/intel/atom/sst-mfld-platform-pcm.c
index 49c7b88..b272df5 100644
--- a/sound/soc/intel/atom/sst-mfld-platform-pcm.c
+++ b/sound/soc/intel/atom/sst-mfld-platform-pcm.c
@@ -705,7 +705,7 @@ static int sst_soc_probe(struct snd_soc_platform *platform)
return sst_dsp_init_v2_dpcm(platform);
 }
 
-static struct snd_soc_platform_driver sst_soc_platform_drv  = {
+static const struct snd_soc_platform_driver sst_soc_platform_drv  = {
.probe  = sst_soc_probe,
.ops= _platform_ops,
.compr_ops  = _platform_compr_ops,
diff --git a/sound/soc/intel/baytrail/sst-baytrail-pcm.c 
b/sound/soc/intel/baytrail/sst-baytrail-pcm.c
index 4765ad4..84cb568 100644
--- a/sound/soc/intel/baytrail/sst-baytrail-pcm.c
+++ b/sound/soc/intel/baytrail/sst-baytrail-pcm.c
@@ -395,7 +395,7 @@ static int sst_byt_pcm_remove(struct snd_soc_platform 
*platform)
return 0;
 }
 
-static struct snd_soc_platform_driver byt_soc_platform = {
+static const struct snd_soc_platform_driver byt_soc_platform = {
.probe  = sst_byt_pcm_probe,
.remove = sst_byt_pcm_remove,
.ops= _byt_pcm_ops,
diff --git a/sound/soc/intel/haswell/sst-haswell-pcm.c 
b/sound/soc/intel/haswell/sst-haswell-pcm.c
index 9e4094e..c044400 100644
--- a/sound/soc/intel/haswell/sst-haswell-pcm.c
+++ b/sound/soc/intel/haswell/sst-haswell-pcm.c
@@ -1135,7 +1135,7 @@ static int hsw_pcm_remove(struct snd_soc_platform 
*platform)
return 0;
 }
 
-static struct snd_soc_platform_driver hsw_soc_platform = {
+static const struct snd_soc_platform_driver hsw_soc_platform = {
.probe  = hsw_pcm_probe,
.remove = hsw_pcm_remove,
.ops= _pcm_ops,
diff --git a/sound/soc/intel/skylake/skl-pcm.c 
b/sound/soc/intel/skylake/skl-pcm.c
index debdaac..e98d825 100644
--- a/sound/soc/intel/skylake/skl-pcm.c
+++ b/sound/soc/intel/skylake/skl-pcm.c
@@ -1310,7 +1310,7 @@ static int skl_platform_soc_probe(struct snd_soc_platform 
*platform)
 
return 0;
 }
-static struct snd_soc_platform_driver skl_platform_drv  = {
+static const struct snd_soc_platform_driver skl_platform_drv  = {
.probe  = skl_platform_soc_probe,
.ops= _platform_ops,
.pcm_new= skl_pcm_new,
-- 
1.9.1



[PATCH 01/11] ASoC: codecs: make snd_soc_platform_driver const

2017-08-14 Thread Bhumika Goyal
Make these const as they are either passed as the 2nd argument to the
function devm_snd_soc_register_platform or snd_soc_register_platform,
and the arguments are of type const.
Done using Coccinelle.

Signed-off-by: Bhumika Goyal 
---
 sound/soc/codecs/cs47l24.c| 2 +-
 sound/soc/codecs/rt5514-spi.c | 2 +-
 sound/soc/codecs/wm5102.c | 2 +-
 sound/soc/codecs/wm5110.c | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/sound/soc/codecs/cs47l24.c b/sound/soc/codecs/cs47l24.c
index d323caa..505dbc9 100644
--- a/sound/soc/codecs/cs47l24.c
+++ b/sound/soc/codecs/cs47l24.c
@@ -1213,7 +1213,7 @@ static struct regmap *cs47l24_get_regmap(struct device 
*dev)
.copy = wm_adsp_compr_copy,
 };
 
-static struct snd_soc_platform_driver cs47l24_compr_platform = {
+static const struct snd_soc_platform_driver cs47l24_compr_platform = {
.compr_ops = _compr_ops,
 };
 
diff --git a/sound/soc/codecs/rt5514-spi.c b/sound/soc/codecs/rt5514-spi.c
index 640193d..ed6e537 100644
--- a/sound/soc/codecs/rt5514-spi.c
+++ b/sound/soc/codecs/rt5514-spi.c
@@ -272,7 +272,7 @@ static int rt5514_spi_pcm_probe(struct snd_soc_platform 
*platform)
return 0;
 }
 
-static struct snd_soc_platform_driver rt5514_spi_platform = {
+static const struct snd_soc_platform_driver rt5514_spi_platform = {
.probe = rt5514_spi_pcm_probe,
.ops = _spi_pcm_ops,
 };
diff --git a/sound/soc/codecs/wm5102.c b/sound/soc/codecs/wm5102.c
index 1fe358e..f500692 100644
--- a/sound/soc/codecs/wm5102.c
+++ b/sound/soc/codecs/wm5102.c
@@ -2027,7 +2027,7 @@ static struct regmap *wm5102_get_regmap(struct device 
*dev)
.copy = wm_adsp_compr_copy,
 };
 
-static struct snd_soc_platform_driver wm5102_compr_platform = {
+static const struct snd_soc_platform_driver wm5102_compr_platform = {
.compr_ops = _compr_ops,
 };
 
diff --git a/sound/soc/codecs/wm5110.c b/sound/soc/codecs/wm5110.c
index 1bc9421..d6fae139 100644
--- a/sound/soc/codecs/wm5110.c
+++ b/sound/soc/codecs/wm5110.c
@@ -2382,7 +2382,7 @@ static struct regmap *wm5110_get_regmap(struct device 
*dev)
.copy = wm_adsp_compr_copy,
 };
 
-static struct snd_soc_platform_driver wm5110_compr_platform = {
+static const struct snd_soc_platform_driver wm5110_compr_platform = {
.compr_ops = _compr_ops,
 };
 
-- 
1.9.1



[PATCH 00/11] ASoC: make snd_soc_platform_driver const

2017-08-14 Thread Bhumika Goyal
Make snd_soc_platform_driver const.

Bhumika Goyal (11):
  ASoC: codecs: make snd_soc_platform_driver const
  ASoC: Intel: make snd_soc_platform_driver const
  ASoC: fsl: make snd_soc_platform_driver const
  ASoC: nuc900: make snd_soc_platform_driver const
  ASoC: omap: make snd_soc_platform_driver const
  ASoC: pxa: make snd_soc_platform_driver const
  ASoC: qcom: make snd_soc_platform_driver const
  ASoC: samsung: make snd_soc_platform_driver const
  ASoC: sh: make snd_soc_platform_driver const
  ASoC: txx9: make snd_soc_platform_driver const
  ASoC: soc-utils: make snd_soc_platform_driver const

 sound/soc/codecs/cs47l24.c   | 2 +-
 sound/soc/codecs/rt5514-spi.c| 2 +-
 sound/soc/codecs/wm5102.c| 2 +-
 sound/soc/codecs/wm5110.c| 2 +-
 sound/soc/fsl/imx-pcm-fiq.c  | 2 +-
 sound/soc/fsl/mpc5200_dma.c  | 2 +-
 sound/soc/intel/atom/sst-mfld-platform-pcm.c | 2 +-
 sound/soc/intel/baytrail/sst-baytrail-pcm.c  | 2 +-
 sound/soc/intel/haswell/sst-haswell-pcm.c| 2 +-
 sound/soc/intel/skylake/skl-pcm.c| 2 +-
 sound/soc/nuc900/nuc900-pcm.c| 2 +-
 sound/soc/omap/omap-pcm.c| 2 +-
 sound/soc/pxa/mmp-pcm.c  | 2 +-
 sound/soc/pxa/pxa2xx-pcm.c   | 2 +-
 sound/soc/qcom/lpass-platform.c  | 2 +-
 sound/soc/samsung/idma.c | 2 +-
 sound/soc/sh/dma-sh7760.c| 2 +-
 sound/soc/sh/fsi.c   | 2 +-
 sound/soc/sh/rcar/core.c | 2 +-
 sound/soc/soc-utils.c| 2 +-
 sound/soc/txx9/txx9aclc.c| 2 +-
 21 files changed, 21 insertions(+), 21 deletions(-)

-- 
1.9.1



Re: [v6 04/15] mm: discard memblock data later

2017-08-14 Thread Michal Hocko
On Fri 11-08-17 12:22:52, Pasha Tatashin wrote:
> >>I will address your comment, and send out a new patch. Should I send it out
> >>separately from the series or should I keep it inside?
> >
> >I would post it separatelly. It doesn't depend on the rest.
> 
> OK, I will post it separately. No it does not depend on the rest, but the
> reset depends on this. So, I am not sure how to enforce that this comes
> before the rest.

Andrew will take care of that. Just make it explicit that some of the
patch depends on an earlier work when reposting.
 
> >>Also, before I send out a new patch, I will need to root cause and resolve
> >>problem found by kernel test robot , and bisected
> >>down to this patch.
> >>
> >>[  156.659400] BUG: Bad page state in process swapper  pfn:03147
> >>[  156.660051] page:88001ed8a1c0 count:0 mapcount:-127 mapping:
> >>(null) index:0x1
> >>[  156.660917] flags: 0x0()
> >>[  156.661198] raw:   0001
> >>ff80
> >>[  156.662006] raw: 88001f4a8120 88001ed85ce0 
> >>
> >>[  156.662811] page dumped because: nonzero mapcount
> >>[  156.663307] CPU: 0 PID: 1 Comm: swapper Not tainted
> >>4.13.0-rc3-00220-g1aad694 #1
> >>[  156.664077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> >>1.9.3-20161025_171302-gandalf 04/01/2014
> >>[  156.665129] Call Trace:
> >>[  156.665422]  dump_stack+0x1e/0x20
> >>[  156.665802]  bad_page+0x122/0x148
> >
> >Was the report related with this patch?
> 
> Yes, they said that the problem was bisected down to this patch. Do you know
> if there is a way to submit a patch to this test robot?

You can ask them for re testing with an updated patch by replying to
their report. ANyway I fail to see how the change could lead to this
patch.
-- 
Michal Hocko
SUSE Labs


Re: [v6 04/15] mm: discard memblock data later

2017-08-14 Thread Michal Hocko
On Fri 11-08-17 15:00:47, Pasha Tatashin wrote:
> Hi Michal,
> 
> This suggestion won't work, because there are arches without memblock
> support: tile, sh...
> 
> So, I would still need to have:
> 
> #ifdef CONFIG_MEMBLOCK in page_alloc, or define memblock_discard() stubs in
> nobootmem headfile. 

This is the standard way to do this. And it is usually preferred to
proliferate ifdefs in the code.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3 1/2] powerpc/xmon: Dump ftrace buffers for the current CPU only

2017-08-14 Thread Michael Ellerman
Breno Leitao  writes:

> Current xmon 'dt' command dumps the tracing buffer for all the CPUs,
> which makes it very hard to read due to the fact that most of
> powerpc machines currently have many CPUs.

> Other than that, the CPU
> lines are interleaved in the ftrace log.

That's because they're ordered by time. And the CPU number is included
in each line.

> This new option just dumps the ftrace buffer for the current CPU.

This is still a good idea though :)

> diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
> index 08e367e3e8c3..e0522f60f0ee 100644
> --- a/arch/powerpc/xmon/xmon.c
> +++ b/arch/powerpc/xmon/xmon.c
> @@ -234,6 +234,7 @@ Commands:\n\
>"\
>dr dump stream of raw bytes\n\
>dt dump the tracing buffers (uses printk)\n\
> +  dtcdump the tracing buffers for current CPU (uses printk)\n\

Hopefully people won't be disappointed that this doesn't invoke dtc
(Device Tree Compiler) in xmon :P

cheers


Re: [PATCH v6 14/17] powerpc: Add support for setting SPRN_TIDR

2017-08-14 Thread Michael Ellerman
Sukadev Bhattiprolu  writes:

> We need the SPRN_TIDR to bet set for use with fast thread-wakeup
> (core-to-core wakeup).  Each thread in a process needs to have a
> unique id within the process but as explained below, for now, we
> assign globally unique thread ids to all threads in the system.

Each thread in a process already has a unique id, ie. its pid (in the
init PID namespace), accessible in the kernel as task_pid_nr(task).

So if that's all we need, we don't need a new allocator, and we don't
need to store it in the thread_struct.

Also 99.99% of processes aren't going to care about the TIDR, so we
should avoid setting it in the common case. ie. it should start out zero
and only be initialised in the FTW code, or a helper that it calls.

> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index 9f3e2c9..6123859 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -1213,6 +1213,16 @@ struct task_struct *__switch_to(struct task_struct 
> *prev,
>   hard_irq_disable();
>   }
>  
> +#ifdef CONFIG_PPC_VAS
> + mtspr(SPRN_TIDR, new->thread.tidr);
> +#endif

That should be in restore_sprs().

It should also check that the TIDR is initialised, and only switch it
when necessary.

> + /*
> +  * We can't take a PMU exception inside _switch() since there is a
> +  * window where the kernel stack SLB and the kernel stack are out
> +  * of sync. Hard disable here.
> +  */
> + hard_irq_disable();

We removed that in June in:

 e4c0fc5f72bc ("powerpc/64s: Leave interrupts hard enabled in context switch 
for radix")

You've obviously picked it up somewhere along the line during a rebase,
please be more careful!

cheers


Re: [PATCH v6 01/17] powerpc/vas: Define macros, register fields and structures

2017-08-14 Thread Michael Ellerman
Nicholas Piggin  writes:

> On Mon, 14 Aug 2017 15:21:48 +1000
> Michael Ellerman  wrote:
>
>> Sukadev Bhattiprolu  writes:
>
>> >  arch/powerpc/include/asm/vas.h   |  35 
>> >  arch/powerpc/include/uapi/asm/vas.h  |  25 +++  
>> 
>> I thought we weren't exposing VAS to userspace yet?
>> 
>> If we are then we need to get things straight WRT copy/paste abort.
>
> No we should not be. This might be just a leftover hunk that should
> be moved to a future series.
>
> At the moment (as far as I understand) it should be limited to
> preempt-disabled, process context, kernel users which avoids any
> concern for switch_to.

I think that comment applied to a previous version, see patch 16.

cheers


Re: [PATCH v6 17/17] powerpc/vas: Document FTW API/usage

2017-08-14 Thread Michael Neuling
On Tue, 2017-08-08 at 16:07 -0700, Sukadev Bhattiprolu wrote:
> Document the usage of the VAS Fast thread-wakeup API.
> 
> Thanks for input/comments from Benjamin Herrenschmidt, Michael Neuling,
> Michael Ellerman, Robert Blackmore, Ian Munsie, Haren Myneni, Paul Mackerras.
> 
> Cc:Ian Munsie 
> Cc:Paul Mackerras 
> Signed-off-by: Sukadev Bhattiprolu 
> ---
>  Documentation/powerpc/ftw-api.txt | 373
> ++
>  1 file changed, 373 insertions(+)
>  create mode 100644 Documentation/powerpc/ftw-api.txt
> 
> diff --git a/Documentation/powerpc/ftw-api.txt b/Documentation/powerpc/ftw-
> api.txt
> new file mode 100644
> index 000..0b3f16f
> --- /dev/null
> +++ b/Documentation/powerpc/ftw-api.txt
> @@ -0,0 +1,373 @@
> +Virtual Accelerator Switchboard and Fast Thread-Wakeup API
> +
> +Power9 processor supports a hardware subystem known as the Virtual
> +Accelerator Switchboard (VAS) which allows two entities in the Power9
> +system to efficiently exchange messages. Messages must be formatted as
> +Coprocessor Reqeust Blocks (CRB) and be submitted using the COPY/PASTE
> +instructions (new in Power9).
> +
> +Usage of VAS depends on the entities exchanging the messages and
> +currently two usages have been identified.
> +
> +First usage of VAS, referred to as VAS/NX involves a software thread
> +submitting data compression requests to a co-processor (hardware/nest
> +accelerator) aka NX engine. The API for this usage is described in the
> +VAS/NX API document.
> +
> +Alternatively, VAS can be used by two software threads to efficiently
> +exchange messages. Initially, this mechanism is intended to wake up a
> +waiting thread quickly - i.e "fast thread wake-up (FTW)". This document
> +describes the user API for this VAS/FTW mechanism.
> +
> +Application access to the FTW mechanism is provided through the NX-FTW
> +device node (/dev/crypto/nx-ftw) implemented by the VAS/FTW device
> +driver.

crypto?

> +
> +A software thread T1 that intends to wait for an event must first setup
> +a receive window, by opening the NX-FTW device and using the
> +VAS_RX_WIN_OPEN ioctl. Upon successful return from the VAS_RX_WIN_OPEN
> +ioctl, an rx_win_handle is returned.

I realise there is a window here as part of the hardware implementation, but the
users don't care about the window on the receive side. It's hidden from them. 
It's just an rx handle IMHO.

The sender certainly has a window that users care about since they have to mmap
it.

> +
> +A software thread T2 that intends to wake up T1 at some point, must first
> +set up a "send window" using the VAS_TX_WIN_OPEN ioctl and specify the
> +rx_win_handle obtained by T1. After a successful VAS_TX_WIN_OPEN ioctl
> the
> +send window of T2 is considered paired with the receive window of T1. The
> +thread T2 must then use mmap() to obtain a "paste address" for the send
> +window.


> +With this set up, thread T1 can wait for an event using the WAIT
> +instruction.
> +
> +Thread T2 can wake up T1 by using the "COPY/PASTE" instructions and
> +submitting an empty/NULL CRB to the send window's paste address. The
> +wait/wake up process can be repeated as long as the threads have the
> +send/receive windows open.



> +1. NX-FTW Device Node
> +
> +There is one /dev/crypto/nx-ftw node in the system and it provides
> +access to the VAS/FTW functionality.


> +The only valid operations on the NX-FTW node are:
> +
> +- open() the device for read and write.
> +
> +- issue either VAS_RX_WIN_OPEN or VAS_TX_WIN_OPEN ioctls to set up
> +  receive or send (only one of them per open).
> +
> +- if the open is associated with send window (i.e VAS_TX_WIN_OPEN
> +  ioctl was issued) mmap() the send window into the application's
> +  virtual address space. (i.e get a 'paste_address' for the send
> +  window).
> +
> +- close the device node.
> +
> +Other file operations on the NX-FTW node are undefined.
> +
> +Note tHAT the COPY and PASTE operations go directly to the hardware
> +and not go through the NX-FTW device.

I don't understand this statement

> +
> +Although a system may have several instances of the VAS in the system
> +(typically, one per P9 chip) there is just one NX-FTW device node in
> +the system.

> + When the NX-FTW device node is opened, the kernel assigns a suitable
> + instance of VAS to the process. Kernel will make a best-effort
> attempt
> + to assign an optimal instance of VAS for the process. In the initial
> +release, the kernel does not support migrating the VAS instance if the
> +process migrates from a processor on one chip to a processor on another
> +chip.

How is it "optimal"?

> +Applications may chose a 

Re: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table

2017-08-14 Thread Alexey Kardashevskiy
Folks,

Is there anything to change besides those compiler errors and David's
comment in 5/5? Or the while patchset is too bad? Thanks.



On 07/08/17 17:25, Alexey Kardashevskiy wrote:
> This is a followup for "[PATCH kernel v4 0/6] vfio-pci: Add support for 
> mmapping MSI-X table"
> http://www.spinics.net/lists/kvm/msg152232.html
> 
> This time it is using "caps" in IOMMU groups. The main question is if PCI
> bus flags or IOMMU domains are still better (and which one).

> 
> 
> 
> Here is some background:
> 
> Current vfio-pci implementation disallows to mmap the page
> containing MSI-X table in case that users can write directly
> to MSI-X table and generate an incorrect MSIs.
> 
> However, this will cause some performance issue when there
> are some critical device registers in the same page as the
> MSI-X table. We have to handle the mmio access to these
> registers in QEMU emulation rather than in guest.
> 
> To solve this issue, this series allows to expose MSI-X table
> to userspace when hardware enables the capability of interrupt
> remapping which can ensure that a given PCI device can only
> shoot the MSIs assigned for it. And we introduce a new bus_flags
> PCI_BUS_FLAGS_MSI_REMAP to test this capability on PCI side
> for different archs.
> 
> 
> This is based on sha1
> 26c5cebfdb6c "Merge branch 'parisc-4.13-4' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux"
> 
> Please comment. Thanks.
> 
> Changelog:
> 
> v5:
> * redid the whole thing via so-called IOMMU group capabilities
> 
> v4:
> * rebased on recent upstream
> * got all 6 patches from v2 (v3 was missing some)
> 
> 
> 
> 
> Alexey Kardashevskiy (5):
>   iommu: Add capabilities to a group
>   iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if MSI controller enables IRQ
> remapping
>   iommu/intel/amd: Set IOMMU_GROUP_CAP_ISOLATE_MSIX if IRQ remapping is
> enabled
>   powerpc/iommu: Set IOMMU_GROUP_CAP_ISOLATE_MSIX
>   vfio-pci: Allow to expose MSI-X table to userspace when safe
> 
>  include/linux/iommu.h| 20 
>  include/linux/vfio.h |  1 +
>  arch/powerpc/kernel/iommu.c  |  1 +
>  drivers/iommu/amd_iommu.c|  3 +++
>  drivers/iommu/intel-iommu.c  |  3 +++
>  drivers/iommu/iommu.c| 35 +++
>  drivers/vfio/pci/vfio_pci.c  | 20 +---
>  drivers/vfio/pci/vfio_pci_rdwr.c |  5 -
>  drivers/vfio/vfio.c  | 15 +++
>  9 files changed, 99 insertions(+), 4 deletions(-)
> 


-- 
Alexey


Re: [PATCH v6 02/17] powerpc/vas: Move GET_FIELD/SET_FIELD to vas.h

2017-08-14 Thread Michael Ellerman
Sukadev Bhattiprolu  writes:

> Move the GET_FIELD and SET_FIELD macros to vas.h as VAS and other
> users of VAS, including NX-842 can use those macros.
>
> There is a lot of related code between the VAS/NX kernel drivers
> and skiboot. For consistency switch the order of parameters in
> SET_FIELD to match the order in skiboot.
>
> Signed-off-by: Sukadev Bhattiprolu 
> Reviewed-by: Dan Streetman 

> diff --git a/arch/powerpc/include/uapi/asm/vas.h 
> b/arch/powerpc/include/uapi/asm/vas.h
> index ddfe046..21249f5 100644
> --- a/arch/powerpc/include/uapi/asm/vas.h
> +++ b/arch/powerpc/include/uapi/asm/vas.h
> @@ -22,4 +22,12 @@
>  #define VAS_THRESH_FIFO_GT_QTR_FULL  2
>  #define VAS_THRESH_FIFO_GT_EIGHTH_FULL   3
>  
> +/*
> + * Get/Set bit fields
> + */
> +#define GET_FIELD(m, v)  (((v) & (m)) >> MASK_LSH(m))
> +#define MASK_LSH(m)  (__builtin_ffsl(m) - 1)
> +#define SET_FIELD(m, v, val) \
> + (((v) & ~(m)) | typeof(v))(val)) << MASK_LSH(m)) & (m)))

This has no business being in a uapi header for VAS.

Put it in asm/vas.h if you must.

Personally I really dislike these sort of macros because they completely
obscure what the final value should end up being, and it's the final
value you'll see when you're debugging it.

> + ccw = SET_FIELD(CCW_CT, ccw, nx842_ct);
> + ccw = SET_FIELD(CCW_CI_842, ccw, 0); /* use 0 for hw auto-selection */
> + ccw = SET_FIELD(CCW_FC_842, ccw, fc);

eg. that could also be written:

   ccw = (nx842_ct << 16) | (fc & 7);

cheers


Re: [PATCH] powerpc/mm: Invalidate partition table cache on host proc tbl base update

2017-08-14 Thread Suraj Jitindar Singh
On Wed, 2017-08-09 at 20:30 +1000, Michael Ellerman wrote:
> Suraj Jitindar Singh  writes:
> 
> > The host process table base is stored in the partition table by
> > calling
> > the function native_register_process_table(). Currently this just
> > sets
> > the entry in memory and is missing a proceeding cache invalidation
> > instruction. Any update to the partition table should be followed
> > by a
> > cache invalidation instruction specifying invalidation of the
> > caching of
> > any partition table entries (RIC = 2, PRS = 0).
> > 
> > We already have a function to update the partition table with the
> > required cache invalidation instructions -
> > mmu_partition_table_set_entry().
> > Update the native_register_process_table() function to call
> > mmu_partition_table_set_entry(), this ensures all appropriate
> > invalidation will be performed.
> > 
> > Signed-off-by: Suraj Jitindar Singh 
> > ---
> >  arch/powerpc/mm/pgtable-radix.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/mm/pgtable-radix.c
> > b/arch/powerpc/mm/pgtable-radix.c
> > index 671a45d..1d5178f 100644
> > --- a/arch/powerpc/mm/pgtable-radix.c
> > +++ b/arch/powerpc/mm/pgtable-radix.c
> > @@ -33,7 +33,8 @@ static int native_register_process_table(unsigned
> > long base, unsigned long pg_sz
> >  {
> >     unsigned long patb1 = base | table_size | PATB_GR;
> >  
> > -   partition_tb->patb1 = cpu_to_be64(patb1);
> > +   mmu_partition_table_set_entry(0, be64_to_cpu(partition_tb-
> > >patb0),
> > +     patb1);
> 
> This is really a bit gross.
> 
> Can we agree on whether partition_tb is an array or not?

Well it is an array, it's just we only ever want the first element in
this function. That being said we might as well access it as an array
to make that clear.

> 
> How about ...
> 
> cheers
> 
> diff --git a/arch/powerpc/mm/pgtable-radix.c
> b/arch/powerpc/mm/pgtable-radix.c
> index c1185c8ecb22..5d8be076f8e5 100644
> --- a/arch/powerpc/mm/pgtable-radix.c
> +++ b/arch/powerpc/mm/pgtable-radix.c
> @@ -28,9 +28,13 @@
>  static int native_register_process_table(unsigned long base,
> unsigned long pg_sz,
>  unsigned long table_size)
>  {
> -   unsigned long patb1 = base | table_size | PATB_GR;
> +   unsigned long patb0, patb1;
> +
> +   patb0 = be64_to_cpu(partition_tb[0].patb0);
> +   patb1 = base | table_size | PATB_GR;
> +
> +   mmu_partition_table_set_entry(0, patb0, patb1);
>  
> -   partition_tb->patb1 = cpu_to_be64(patb1);
> return 0;
>  }

Looks good :)


[PATCH] powerpc/8xx: fix two CONFIG_8xx left behind

2017-08-14 Thread Christophe Leroy
Commit 968159c0031ac ("powerpc/8xx: Getting rid of remaining
use of CONFIG_8xx") removed all but 2 references to 8xx in
Kconfigs.

This patch removes the two remaining ones.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Kconfig   | 2 +-
 arch/powerpc/platforms/8xx/Kconfig | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c8334a5b0eb7..2a290561e851 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -1176,7 +1176,7 @@ config CONSISTENT_SIZE
 
 config PIN_TLB
bool "Pinned Kernel TLBs (860 ONLY)"
-   depends on ADVANCED_OPTIONS && 8xx
+   depends on ADVANCED_OPTIONS && PPC_8xx
 
 config PIN_TLB_IMMR
bool "Pinned TLB for IMMR"
diff --git a/arch/powerpc/platforms/8xx/Kconfig 
b/arch/powerpc/platforms/8xx/Kconfig
index d3f664f5166b..536b0c5d5ce3 100644
--- a/arch/powerpc/platforms/8xx/Kconfig
+++ b/arch/powerpc/platforms/8xx/Kconfig
@@ -91,7 +91,7 @@ endmenu
 #
 
 menu "MPC8xx CPM Options"
-   depends on 8xx
+   depends on PPC_8xx
 
 # This doesn't really belong here, but it is convenient to ask
 # 8xx specific questions.
-- 
2.13.3



Re: [PATCH v6 14/17] powerpc: Add support for setting SPRN_TIDR

2017-08-14 Thread Michael Neuling
On Tue, 2017-08-08 at 16:06 -0700, Sukadev Bhattiprolu wrote:
> We need the SPRN_TIDR to bet set for use with fast thread-wakeup
> (core-to-core wakeup).  Each thread in a process needs to have a
> unique id within the process but as explained below, for now, we
> assign globally unique thread ids to all threads in the system.
> 
> Signed-off-by: Sukadev Bhattiprolu 
> ---
>  arch/powerpc/include/asm/processor.h |  4 ++
>  arch/powerpc/kernel/process.c| 74
> 
>  2 files changed, 78 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/processor.h
> b/arch/powerpc/include/asm/processor.h
> index fab7ff8..bf6ba63 100644
> --- a/arch/powerpc/include/asm/processor.h
> +++ b/arch/powerpc/include/asm/processor.h
> @@ -232,6 +232,10 @@ struct debug_reg {
>  struct thread_struct {
>   unsigned long   ksp;/* Kernel stack pointer */
>  
> +#ifdef CONFIG_PPC_VAS

I'm tempted to have this always, or a new feature CONFIG_PPC_TID that's PPC_VAS
depends on.

> + unsigned long   tidr;

> +#endif
> +
>  #ifdef CONFIG_PPC64
>   unsigned long   ksp_vsid;
>  #endif
> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index 9f3e2c9..6123859 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -1213,6 +1213,16 @@ struct task_struct *__switch_to(struct task_struct
> *prev,
>   hard_irq_disable();
>   }
>  
> +#ifdef CONFIG_PPC_VAS
> + mtspr(SPRN_TIDR, new->thread.tidr);

how much does this hurt our context_switch benchmark in
tools/testing/selftests/powerpc/benchmarks/context_switch.c ?

Also you need an CPU_FTR_ARCH_300 test here (and elsewhere)

> +#endif
> + /*
> +  * We can't take a PMU exception inside _switch() since there is a
> +  * window where the kernel stack SLB and the kernel stack are out
> +  * of sync. Hard disable here.
> +  */
> + hard_irq_disable();
> +

What is this?

>   /*
>    * Call restore_sprs() before calling _switch(). If we move it after
>    * _switch() then we miss out on calling it for new tasks. The reason
> @@ -1449,9 +1459,70 @@ void flush_thread(void)
>  #endif /* CONFIG_HAVE_HW_BREAKPOINT */
>  }
>  
> +#ifdef CONFIG_PPC_VAS
> +static DEFINE_SPINLOCK(vas_thread_id_lock);
> +static DEFINE_IDA(vas_thread_ida);

This IDA be per process, not global.

> +
> +/*
> + * We need to assign an unique thread id to each thread in a process. This
> + * thread id is intended to be used with the Fast Thread-wakeup (aka Core-
> + * to-core wakeup) mechanism being implemented on top of Virtual Accelerator
> + * Switchboard (VAS).
> + *
> + * To get a unique thread-id per process we could simply use task_pid_nr()
> + * but the problem is that task_pid_nr() is not yet available for the thread
> + * when copy_thread() is called. Fixing that would require changing more
> + * intrusive arch-neutral code in code path in copy_process()?.
> + *
> + * Further, to assign unique thread ids within each process, we need an
> + * atomic field (or an IDR) in task_struct, which again intrudes into the
> + * arch-neutral code.

Really?

> + * So try to assign globally unique thraed ids for now.

Yuck!

> + */
> +static int assign_thread_id(void)
> +{
> + int index;
> + int err;
> +
> +again:
> + if (!ida_pre_get(_thread_ida, GFP_KERNEL))
> + return -ENOMEM;
> +
> + spin_lock(_thread_id_lock);
> + err = ida_get_new_above(_thread_ida, 1, );

We can't use 0 or 1?

> + spin_unlock(_thread_id_lock);
> +
> + if (err == -EAGAIN)
> + goto again;
> + else if (err)
> + return err;
> +
> + if (index > MAX_USER_CONTEXT) {
> + spin_lock(_thread_id_lock);
> + ida_remove(_thread_ida, index);
> + spin_unlock(_thread_id_lock);
> + return -ENOMEM;
> + }
> +
> + return index;
> +}
> +
> +static void free_thread_id(int id)
> +{
> + spin_lock(_thread_id_lock);
> + ida_remove(_thread_ida, id);
> + spin_unlock(_thread_id_lock);
> +}
> +#endif /* CONFIG_PPC_VAS */
> +
> +
>  void
>  release_thread(struct task_struct *t)
>  {
> +#ifdef CONFIG_PPC_VAS
> + free_thread_id(t->thread.tidr);
> +#endif

Can you restructure this to avoid the #ifdef ugliness

>  }
>  
>  /*
> @@ -1587,6 +1658,9 @@ int copy_thread(unsigned long clone_flags, unsigned long
> usp,
>  #endif
>  
>   setup_ksp_vsid(p, sp);
> +#ifdef CONFIG_PPC_VAS
> + p->thread.tidr = assign_thread_id();
> +#endif

Same here... 

>  
>  #ifdef CONFIG_PPC64 
>   if (cpu_has_feature(CPU_FTR_DSCR)) {


Re: [PATCH v6 16/17] powerpc/vas: Implement a simple FTW driver

2017-08-14 Thread Michael Ellerman
Hi Suka,

Some comments inline ...


Sukadev Bhattiprolu  writes:

> The Fast Thread Wake-up (FTW) driver provides user space applications an
> interface to the Core-to-Core functionality in POWER9. The driver provides
> the device node/ioctl API to applications and uses the external interfaces
> to the VAS driver to interact with the VAS hardware.
>
> A follow-on patch provides detailed description of the API for the driver.
>
> Signed-off-by: Sukadev Bhattiprolu 
> ---
>  MAINTAINERS |   1 +
>  arch/powerpc/platforms/powernv/Kconfig  |  16 ++
>  arch/powerpc/platforms/powernv/Makefile |   1 +
>  arch/powerpc/platforms/powernv/nx-ftw.c | 486 
> 

AFAICS this has nothing to do with NX, so why is it called nx-ftw ?

Also aren't we going to want to use this on pseries eventually? If so
should it go in arch/powerpc/sysdev ?

> diff --git a/arch/powerpc/platforms/powernv/Makefile 
> b/arch/powerpc/platforms/powernv/Makefile
> index e4db292..dc60046 100644
> --- a/arch/powerpc/platforms/powernv/Makefile
> +++ b/arch/powerpc/platforms/powernv/Makefile
> @@ -13,3 +13,4 @@ obj-$(CONFIG_MEMORY_FAILURE)+= opal-memory-errors.o
>  obj-$(CONFIG_TRACEPOINTS)+= opal-tracepoints.o
>  obj-$(CONFIG_OPAL_PRD)   += opal-prd.o
>  obj-$(CONFIG_PPC_VAS)+= vas.o vas-window.o
> +obj-$(CONFIG_PPC_FTW)+= nx-ftw.o
> diff --git a/arch/powerpc/platforms/powernv/nx-ftw.c 
> b/arch/powerpc/platforms/powernv/nx-ftw.c
> new file mode 100644
> index 000..a0b6388
> --- /dev/null
> +++ b/arch/powerpc/platforms/powernv/nx-ftw.c
> @@ -0,0 +1,486 @@

Missing license header.

> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 

Please try and trim the list to what you need.

> +
> +/*
> + * NX-FTW is a device driver used to provide user space access to the
> + * Core-to-Core aka Fast Thread Wakeup (FTW) functionality provided by
> + * the Virtual Accelerator Subsystem (VAS) in POWER9 systems. See also
> + * arch/powerpc/platforms/powernv/vas*.
> + *
> + * The driver creates the device node /dev/crypto/nx-ftw that can be
> + * used as follows:
> + *
> + *   fd = open("/dev/crypto/nx-ftw", O_RDWR);
> + *   rc = ioctl(fd, VAS_RX_WIN_OPEN, );
> + *   rc = ioctl(fd, VAS_TX_WIN_OPEN, );
> + *   paste_addr = mmap(NULL, PAGE_SIZE, prot, MAP_SHARED, fd, 0ULL).
> + *   vas_copy(, 0, 1);
> + *   vas_paste(paste_addr, 0, 1);
> + *
> + * where "vas_copy" and "vas_paste" are defined in copy-paste.h.
> + */
> +
> +static char  *nxftw_dev_name = "nx-ftw";
> +static atomic_t  nxftw_instid = ATOMIC_INIT(0);
> +static dev_t nxftw_devt;
> +static struct dentry *nxftw_debugfs;
> +static struct class  *nxftw_dbgfs_class;

The class doesn't go in debugfs, which is what "dbgfs" says to me.

> +/*
> + * Wrapper object for the nx-ftw device node - there is just one

Just "device".

"device node" is ambiguous vs device tree.

> + * instance of this node for the whole system.

So why not put the globals above in here also?

> + */
> +struct nxftw_dev {
> + struct cdev cdev;
> + struct device *device;
> + char *name;
> + atomic_t refcount;
> +} nxftw_device;
> +
> +/*
> + * One instance per open of a nx-ftw device. Each nxftw_instance is
> + * associated with a VAS window, after the caller issues VAS_RX_WIN_OPEN
> + * or VAS_TX_WIN_OPEN ioctl.
> + */
> +struct nxftw_instance {
> + int instance;
> + bool tx_win;
> + struct vas_window *window;
> +};
> +
> +#define VAS_DEFAULT_VAS_ID   0
> +#define POWERNV_LPID 0   /* TODO: For VM/KVM guests? */

mfspr(SPRN_LPID)

would seem to do the trick?

> +static char *nxftw_devnode(struct device *dev, umode_t *mode)
> +{
> + return kasprintf(GFP_KERNEL, "crypto/%s", dev_name(dev));

This isn't a crypto device?

> +}
> +
> +static int nxftw_open(struct inode *inode, struct file *fp)
> +{
> + int minor;
> + struct nxftw_instance *nxti;

instance would be a better name.

> + minor = MINOR(inode->i_rdev);

Not used?

> + nxti = kzalloc(sizeof(*nxti), GFP_KERNEL);
> + if (!nxti)
> + return -ENOMEM;
> +
> + nxti->instance = atomic_inc_return(_instid);

And this would read better if the variable was "id". eg.

instance->id = atomic_inc_return(_instance_id);

> + nxti->window = NULL;
> +
> + fp->private_data = nxti;
> + return 0;
> +}
> +
> +static int validate_txwin_user_attr(struct vas_tx_win_open_attr *uattr)
> +{
> + int i;
> +
> + if (uattr->version != 1)
> + return -EINVAL;
> +
> + if (uattr->flags & ~VAS_FLAGS_HIGH_PRI)
> + return -EINVAL;
> +
> + if (uattr->reserved1 || uattr->reserved2)
> + return -EINVAL;
> +
> + 

Re: [PATCH] powerpc/64s: Add support for ASB_Notify on POWER9

2017-08-14 Thread Michael Neuling
On Sat, 2017-08-05 at 14:28 +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2017-08-04 at 16:56 +0200, Christophe Lombard wrote:
> > The POWER9 core supports a new feature: ASB_Notify which requires the
> > support of the Special Purpose Register: TIDR.
> > 
> > The ASB_Notify command, generated by the AFU, will attempt to
> > wake-up the host thread identified by the particular LPID:PID:TID.
> > 
> > The special register TIDR has to be updated to with the same value
> > present in the process element.
> > 
> > If the length of the register TIDR is 64bits, the CAPI Translation
> > Service Layer core (XSL9) for Power9 systems limits the size (16bits) of
> > the Thread ID when it generates the ASB_Notify message adding
> > PID:LPID:TID information from the context.
> > 
> > The content of the internal kernel Thread ID (32bits) can not therefore
> > be used to fulfill the register TIDR.
> > 
> > This patch allows to avoid this limitation by adding a new interface
> > for the user. The instructions mfspr/mtspr SPRN_TIDR are emulated,
> > save/restore SPRs (context switch) are updated and a new feature
> > (CPU_FTR_TIDR) is added to POWER9 system.
> 
> Those CPU_FTR_* are internal to the kernel. You probably also need a
> feature in AT_HWCAP2 to indicate to userspace that this is supported.
> 
> Also you put the onus of allocating the TIDs onto userspace which is a
> bit tricky. What happens if there are duplicate TIDs for example ? (ie,
> userspace doesn't allocate it or uses a library that spawns a thread)

I tend to agree.  I don't want userspace knowing anything about TIDR
allocations.  If we want userspace to receive one of these as_notifys, there
should be some abstract handle (like a file descriptor) that the kernel gives
out. That handle should be associated with an LPID/PID/TID tuple by the kernel.

This is similar to the PE number in CAPI. Most of the time userspace doesn't
need to know its PE number (there is some cases were a master needs to know a
slave PE, but that's the exception, not the rule).  Similarly, guests don't know
their LPID.  Processes don't know their PID. Threads shouldn't know the TID.

Also, Suka has posted a patch that does TID allocation in the kernel... 
http://patchwork.ozlabs.org/patch/799494/

Regards,
Mikey


Re: [PATCH v2 2/8] powerpc/xive: guest exploitation of the XIVE interrupt controller

2017-08-14 Thread David Gibson
On Fri, Aug 11, 2017 at 04:23:35PM +0200, Cédric Le Goater wrote:
> This is the framework for using XIVE in a PowerVM guest. The support
> is very similar to the native one in a much simpler form.
> 
> Instead of OPAL calls, a set of Hypervisors call are used to configure
> the interrupt sources and the event/notification queues of the guest:
> 
>  - H_INT_GET_SOURCE_INFO
> 
>used to obtain the address of the MMIO page of the Event State
>Buffer (PQ bits) entry associated with the source.
> 
>  - H_INT_SET_SOURCE_CONFIG
> 
>assigns a source to a "target".
> 
>  - H_INT_GET_SOURCE_CONFIG
> 
>determines to which "target" and "priority" is assigned to a source
> 
>  - H_INT_GET_QUEUE_INFO
> 
>returns the address of the notification management page associated
>with the specified "target" and "priority".
> 
>  - H_INT_SET_QUEUE_CONFIG
> 
>sets or resets the event queue for a given "target" and "priority".
>It is also used to set the notification config associated with the
>queue, only unconditional notification for the moment.  Reset is
>performed with a queue size of 0 and queueing is disabled in that
>case.
> 
>  - H_INT_GET_QUEUE_CONFIG
> 
>returns the queue settings for a given "target" and "priority".
> 
>  - H_INT_RESET
> 
>resets all of the partition's interrupt exploitation structures to
>their initial state, losing all configuration set via the hcalls
>H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.
> 
>  - H_INT_SYNC
> 
>issue a synchronisation on a source to make sure sure all
>notifications have reached their queue.
> 
> As for XICS, the XIVE interface for the guest is described in the
> device tree under the "interrupt-controller" node. A couple of new
> properties are specific to XIVE :
> 
>  - "reg"
> 
>contains the base address and size of the thread interrupt
>managnement areas (TIMA) for the user level for the OS level. Only
>the OS level is taken into account.
> 
>  - "ibm,xive-eq-sizes"
> 
>the size of the event queues.
> 
>  - "ibm,xive-lisn-ranges"
> 
>the interrupt numbers ranges assigned to the guest. These are
>allocated using a simple bitmap.
> 
> and also :
> 
>  - "/ibm,plat-res-int-priorities"
> 
>contains a list of priorities that the hypervisor has reserved for
>its own use.
> 
> Tested with a QEMU XIVE model for pseries and with the Power
> hypervisor
> 
> Signed-off-by: Cédric Le Goater 
> ---
> 
>  Changes since v1 :
> 
>  - added a xive_teardown_cpu() routine
>  - removed P9 doorbell support when xive is enabled.
>  - merged in patch for "ibm,plat-res-int-priorities" support
>  - added some comments on the usage of raw I/O accessors.
>  
>  Changes since RFC :
> 
>  - renamed backend to spapr
>  - fixed hotplug support
>  - fixed kexec support
>  - fixed src_chip value (XIVE_INVALID_CHIP_ID)
>  - added doorbell support 
>  - added some hcall debug logs
> 
>  arch/powerpc/include/asm/hvcall.h|  13 +-
>  arch/powerpc/include/asm/xive.h  |   3 +
>  arch/powerpc/platforms/pseries/Kconfig   |   1 +
>  arch/powerpc/platforms/pseries/hotplug-cpu.c |  11 +-
>  arch/powerpc/platforms/pseries/kexec.c   |   6 +-
>  arch/powerpc/platforms/pseries/setup.c   |   8 +-
>  arch/powerpc/platforms/pseries/smp.c |  27 +-
>  arch/powerpc/sysdev/xive/Kconfig |   5 +
>  arch/powerpc/sysdev/xive/Makefile|   1 +
>  arch/powerpc/sysdev/xive/common.c|  13 +
>  arch/powerpc/sysdev/xive/spapr.c | 617 
> +++
>  11 files changed, 697 insertions(+), 8 deletions(-)
>  create mode 100644 arch/powerpc/sysdev/xive/spapr.c
> 
> diff --git a/arch/powerpc/include/asm/hvcall.h 
> b/arch/powerpc/include/asm/hvcall.h
> index 57d38b504ff7..3d34dc0869f6 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -280,7 +280,18 @@
>  #define H_RESIZE_HPT_COMMIT  0x370
>  #define H_REGISTER_PROC_TBL  0x37C
>  #define H_SIGNAL_SYS_RESET   0x380
> -#define MAX_HCALL_OPCODE H_SIGNAL_SYS_RESET
> +#define H_INT_GET_SOURCE_INFO   0x3A8
> +#define H_INT_SET_SOURCE_CONFIG 0x3AC
> +#define H_INT_GET_SOURCE_CONFIG 0x3B0
> +#define H_INT_GET_QUEUE_INFO0x3B4
> +#define H_INT_SET_QUEUE_CONFIG  0x3B8
> +#define H_INT_GET_QUEUE_CONFIG  0x3BC
> +#define H_INT_SET_OS_REPORTING_LINE 0x3C0
> +#define H_INT_GET_OS_REPORTING_LINE 0x3C4
> +#define H_INT_ESB   0x3C8
> +#define H_INT_SYNC  0x3CC
> +#define H_INT_RESET 0x3D0
> +#define MAX_HCALL_OPCODE H_INT_RESET
>  
>  /* H_VIOCTL functions */
>  #define H_GET_VIOA_DUMP_SIZE 0x01
> diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
> index c23ff4389ca2..473f133a8555 100644
> --- a/arch/powerpc/include/asm/xive.h
> +++ b/arch/powerpc/include/asm/xive.h
> @@ -110,11 +110,13 @@ extern bool __xive_enabled;
>  
>  static inline bool xive_enabled(void) { 

Re: [PATCH v2 1/8] powerpc/xive: introduce a common routine xive_queue_page_alloc()

2017-08-14 Thread David Gibson
On Fri, Aug 11, 2017 at 04:23:34PM +0200, Cédric Le Goater wrote:
> This routine will be used in the spapr backend. Also introduce a short
> xive_alloc_order() helper.
> 
> Signed-off-by: Cédric Le Goater 

Reviewed-by: David Gibson 

> ---
>  arch/powerpc/sysdev/xive/common.c| 16 
>  arch/powerpc/sysdev/xive/native.c| 16 +---
>  arch/powerpc/sysdev/xive/xive-internal.h |  6 ++
>  3 files changed, 27 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index 6e0c9dee724f..26999ceae20e 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -1424,6 +1424,22 @@ bool xive_core_init(const struct xive_ops *ops, void 
> __iomem *area, u32 offset,
>   return true;
>  }
>  
> +__be32 *xive_queue_page_alloc(unsigned int cpu, u32 queue_shift)
> +{
> + unsigned int alloc_order;
> + struct page *pages;
> + __be32 *qpage;
> +
> + alloc_order = xive_alloc_order(queue_shift);
> + pages = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, alloc_order);
> + if (!pages)
> + return ERR_PTR(-ENOMEM);
> + qpage = (__be32 *)page_address(pages);
> + memset(qpage, 0, 1 << queue_shift);
> +
> + return qpage;
> +}
> +
>  static int __init xive_off(char *arg)
>  {
>   xive_cmdline_disabled = true;
> diff --git a/arch/powerpc/sysdev/xive/native.c 
> b/arch/powerpc/sysdev/xive/native.c
> index 0f95476b01f6..ef92a83090e1 100644
> --- a/arch/powerpc/sysdev/xive/native.c
> +++ b/arch/powerpc/sysdev/xive/native.c
> @@ -202,17 +202,12 @@ EXPORT_SYMBOL_GPL(xive_native_disable_queue);
>  static int xive_native_setup_queue(unsigned int cpu, struct xive_cpu *xc, u8 
> prio)
>  {
>   struct xive_q *q = >queue[prio];
> - unsigned int alloc_order;
> - struct page *pages;
>   __be32 *qpage;
>  
> - alloc_order = (xive_queue_shift > PAGE_SHIFT) ?
> - (xive_queue_shift - PAGE_SHIFT) : 0;
> - pages = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, alloc_order);
> - if (!pages)
> - return -ENOMEM;
> - qpage = (__be32 *)page_address(pages);
> - memset(qpage, 0, 1 << xive_queue_shift);
> + qpage = xive_queue_page_alloc(cpu, xive_queue_shift);
> + if (IS_ERR(qpage))
> + return PTR_ERR(qpage);
> +
>   return xive_native_configure_queue(get_hard_smp_processor_id(cpu),
>  q, prio, qpage, xive_queue_shift, 
> false);
>  }
> @@ -227,8 +222,7 @@ static void xive_native_cleanup_queue(unsigned int cpu, 
> struct xive_cpu *xc, u8
>* from an IPI and iounmap isn't safe
>*/
>   __xive_native_disable_queue(get_hard_smp_processor_id(cpu), q, prio);
> - alloc_order = (xive_queue_shift > PAGE_SHIFT) ?
> - (xive_queue_shift - PAGE_SHIFT) : 0;
> + alloc_order = xive_alloc_order(xive_queue_shift);
>   free_pages((unsigned long)q->qpage, alloc_order);
>   q->qpage = NULL;
>  }
> diff --git a/arch/powerpc/sysdev/xive/xive-internal.h 
> b/arch/powerpc/sysdev/xive/xive-internal.h
> index d07ef2d29caf..dd1e2022cce4 100644
> --- a/arch/powerpc/sysdev/xive/xive-internal.h
> +++ b/arch/powerpc/sysdev/xive/xive-internal.h
> @@ -56,6 +56,12 @@ struct xive_ops {
>  
>  bool xive_core_init(const struct xive_ops *ops, void __iomem *area, u32 
> offset,
>   u8 max_prio);
> +__be32 *xive_queue_page_alloc(unsigned int cpu, u32 queue_shift);
> +
> +static inline u32 xive_alloc_order(u32 queue_shift)
> +{
> + return (queue_shift > PAGE_SHIFT) ? (queue_shift - PAGE_SHIFT) : 0;
> +}
>  
>  extern bool xive_cmdline_disabled;
>  

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v6 01/17] powerpc/vas: Define macros, register fields and structures

2017-08-14 Thread Nicholas Piggin
On Mon, 14 Aug 2017 15:21:48 +1000
Michael Ellerman  wrote:

> Sukadev Bhattiprolu  writes:

> >  arch/powerpc/include/asm/vas.h   |  35 
> >  arch/powerpc/include/uapi/asm/vas.h  |  25 +++  
> 
> I thought we weren't exposing VAS to userspace yet?
> 
> If we are then we need to get things straight WRT copy/paste abort.

No we should not be. This might be just a leftover hunk that should
be moved to a future series.

At the moment (as far as I understand) it should be limited to
preempt-disabled, process context, kernel users which avoids any
concern for switch_to.

Thanks,
Nick