Re: [PATCH v2 4/4] powerpc: Book3S 64-bit "heavyweight" KASAN support

2019-12-11 Thread Balbir Singh



On 12/12/19 1:24 am, Daniel Axtens wrote:
> Hi Balbir,
> 
> +Discontiguous memory can occur when you have a machine with memory spread
> +across multiple nodes. For example, on a Talos II with 64GB of RAM:
> +
> + - 32GB runs from 0x0 to 0x_0008__,
> + - then there's a gap,
> + - then the final 32GB runs from 0x_2000__ to 
> 0x_2008__
> +
> +This can create _significant_ issues:
> +
> + - If we try to treat the machine as having 64GB of _contiguous_ RAM, we 
> would
> +   assume that ran from 0x0 to 0x_0010__. We'd then reserve 
> the
> +   last 1/8th - 0x_000e__ to 0x_0010__ as the 
> shadow
> +   region. But when we try to access any of that, we'll try to access 
> pages
> +   that are not physically present.
> +

 If we reserved memory for KASAN from each node (discontig region), we 
 might survive
 this no? May be we need NUMA aware KASAN? That might be a generic change, 
 just thinking
 out loud.
>>>
>>> The challenge is that - AIUI - in inline instrumentation, the compiler
>>> doesn't generate calls to things like __asan_loadN and
>>> __asan_storeN. Instead it uses -fasan-shadow-offset to compute the
>>> checks, and only calls the __asan_report* family of functions if it
>>> detects an issue. This also matches what I can observe with objdump
>>> across outline and inline instrumentation settings.
>>>
>>> This means that for this sort of thing to work we would need to either
>>> drop back to out-of-line calls, or teach the compiler how to use a
>>> nonlinear, NUMA aware mem-to-shadow mapping.
>>
>> Yes, out of line is expensive, but seems to work well for all use cases.
> 
> I'm not sure this is true. Looking at scripts/Makefile.kasan, allocas,
> stacks and globals will only be instrumented if you can provide
> KASAN_SHADOW_OFFSET. In the case you're proposing, we can't provide a
> static offset. I _think_ this is a compiler limitation, where some of
> those instrumentations only work/make sense with a static offset, but
> perhaps that's not right? Dmitry and Andrey, can you shed some light on
> this?
> 

>From what I can read, everything should still be supported, the info page
for gcc states that globals, stack asan should be enabled by default.
allocas may have limited meaning if stack-protector is turned on (no?)

> Also, as it currently stands, the speed difference between inline and
> outline is approximately 2x, and given that we'd like to run this
> full-time in syzkaller I think there is value in trading off speed for
> some limitations.
> 

Full speed vs actually working across different configurations?

>> BTW, the current set of patches just hang if I try to make the default
>> mode as out of line
> 
> Do you have CONFIG_RELOCATABLE?
> 
> I've tested the following process:
> 
> # 1) apply patches on a fresh linux-next
> # 2) output dir
> mkdir ../out-3s-kasan
> 
> # 3) merge in the relevant config snippets
> cat > kasan.config << EOF
> CONFIG_EXPERT=y
> CONFIG_LD_HEAD_STUB_CATCH=y
> 
> CONFIG_RELOCATABLE=y
> 
> CONFIG_KASAN=y
> CONFIG_KASAN_GENERIC=y
> CONFIG_KASAN_OUTLINE=y
> 
> CONFIG_PHYS_MEM_SIZE_FOR_KASAN=2048
> EOF
> 

I think I got CONFIG_PHYS_MEM_SIZE_FOR_KASN wrong, honestly I don't get why
we need this size? The size is in MB and the default is 0. 

Why does the powerpc port of KASAN need the SIZE to be explicitly specified?

Balbir Singh.


Re: Call for report - G5/PPC970 status

2019-12-11 Thread jjhdiederen
PowerMac 7,3 G5 2.5 DP PCI-X Mid-2004 is affected with this bug. The 
machine freezes at boot due to the new ppc64 kernel.



Regards,
Jeroen Diederen

Romain Dolbeau schreef op 2019-12-11 08:19:

Le mer. 11 déc. 2019 à 03:20, Aneesh Kumar K.V
 a écrit :

The PowerMac system we have internally was not able to recreate this.


To narrow down the issue - is that a PCI/PCI-X (7,3 [1]) or PCIe G5 
(11,2 [1]) ?

Single, dual or quad ?

Same question to anyone else with a G5 / PPC970 - what is it and does
it boot recent PPC64 Linux kernel ?

Christian from the original report has a quad, like me (so 
powermac11,2).


There was also a report of a powermac7.3 working in the original 
discussion,

single or dual unspecified.

So this might be a Quad thing, or a more general 11,2 thing...


At this point, I am not sure what would cause the Machine check with
that patch series because we have not changed the VA bits in that 
patch.


Any test I could run that would help you tracking the bug ?

Cordially,

Romain

[1] 




--
Romain Dolbeau


Re: [PATCH v5 2/2] powerpc/pseries/iommu: Use dma_iommu_ops for Secure VM.

2019-12-11 Thread Ram Pai
On Tue, Dec 10, 2019 at 07:43:24PM -0600, Michael Roth wrote:
> Quoting Ram Pai (2019-12-06 19:12:39)
> > Commit edea902c1c1e ("powerpc/pseries/iommu: Don't use dma_iommu_ops on
> > secure guests")
> > disabled dma_iommu_ops path, for secure VMs. Disabling dma_iommu_ops
> > path for secure VMs, helped enable dma_direct path.  This enabled
> > support for bounce-buffering through SWIOTLB.  However it fails to
> > operate when IOMMU is enabled, since I/O pages are not TCE mapped.
> > 
> > Renable dma_iommu_ops path for pseries Secure VMs.  It handles all
> > cases including, TCE mapping I/O pages, in the presence of a
> > IOMMU.
> 
> Wasn't clear to me at first, but I guess the main gist of this series is
> that we want to continue to use SWIOTLB, but also need to create mappings
> of it's bounce buffers in the IOMMU, so we revert to using dma_iommu_ops
> and rely on the various dma_iommu_{map,alloc}_bypass() hooks throughout
> to call into dma_direct_* ops rather than relying on the dma_is_direct(ops)
> checks in DMA API functions to do the same.
> 
> That makes sense, but one issue I see with that is that
> dma_iommu_map_bypass() only tests true if all the following are true:
> 
> 1) the device requests a 64-bit DMA mask via
>dma_set_mask/dma_set_coherent_mask
> 2) DDW is enabled (i.e. we don't pass disable_ddw on command-line)
> 
> dma_is_direct() checks don't have this limitation, so I think for
> anything cases, such as devices that use a smaller DMA mask, we'll
> end up falling back to the non-bypass functions in dma_iommu_ops, which
> will likely break for things like dma_alloc_coherent/dma_map_single
> since they won't use SWIOTLB pages and won't do the necessary calls to
> set_memory_unencrypted() to share those non-SWIOTLB buffers with
> hypervisor.
> 
> Maybe that's ok, but I think we should be clearer about how to
> fail/handle these cases.

Yes. makes sense. Device that cannot handle 64bit dma mask will not work.

> 
> Though I also agree with some concerns Alexey stated earlier: it seems
> wasteful to map the entire DDW window just so these bounce buffers can be
> mapped.  Especially if you consider the lack of a mapping to be an additional
> safe-guard against things like buggy device implementations on the QEMU
> side. E.g. if we leaked pages to the hypervisor on accident, those pages
> wouldn't be immediately accessible to a device, and would still require
> additional work get past the IOMMU.

Well, an accidental unintented page leak to the hypervisor, is a very
bad thing, regardless of any DMA mapping. The device may not be able to
access it, but the hypervisor still can access it.

> 
> What would it look like if we try to make all this work with disable_ddw 
> passed
> to kernel command-line (or forced for is_secure_guest())?
> 
>   1) dma_iommu_{alloc,map}_bypass() would no longer get us to dma_direct_* 
> ops,
>  but an additional case or hook that considers is_secure_guest() might do
>  it.
>  
>   2) We'd also need to set up an IOMMU mapping for the bounce buffers via
>  io_tlb_start/io_tlb_end. We could do it once, on-demand via
>  dma_iommu_bypass_supported() like we do for the 64-bit DDW window, or
>  maybe in some init function.

Hmm... i not sure how to accomplish (2).   we need use some DDW window
to setup the mappings. right?  If disable_ddw is set, there wont be any
ddw.  What am i missing?

> 
> That also has the benefit of not requiring devices to support 64-bit DMA.
> 
> Alternatively, we could continue to rely on the 64-bit DDW window, but
> modify call to enable_ddw() to only map the io_tlb_start/end range in
> the case of is_secure_guest(). This is a little cleaner implementation-wise
> since we can rely on the existing dma_iommu_{alloc,map}_bypass() hooks.

I have been experimenting with this.  Trying to map only the memory
range from io_tlb_start/io_tlb_end though the 64-bit ddw window.  But
due to some reason, it wants the io_tlb_start to be aligned to some
boundary. It looks like a 2^28 boundary. Not sure what dictates that
boundary.
   

> , but
> devices that don't support 64-bit will fail back to not using dma_direct_* ops
> and fail miserably. We'd probably want to handle that more gracefully.

Yes i will put a warning message to indicate the failure.

> 
> Or we handle both cases gracefully. To me it makes more sense to enable
> non-DDW case, then consider adding DDW case later if there's some reason
> why 64-bit DMA is needed. But would be good to hear if there are other
> opinions.

educate me a bit here. What is a non-DDW case?  is it possible for a
device to acccess memory, in the presence of a IOMMU, without a window-mapping?

> 
> > 
> > Signed-off-by: Ram Pai 
> > ---
> >  arch/powerpc/platforms/pseries/iommu.c | 11 +--
> >  1 file changed, 1 insertion(+), 10 deletions(-)
> > 
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> > b/arch/powerpc/platforms/pseries/iommu.c
> > index 

Re: [PATCH v5 2/5] powerpc/kprobes: Mark newly allocated probes as RO

2019-12-11 Thread Russell Currey
On Fri, 2019-12-06 at 10:47 +1100, Michael Ellerman wrote:
> Michael Ellerman  writes:
> > Russell Currey  writes:
> > > With CONFIG_STRICT_KERNEL_RWX=y and CONFIG_KPROBES=y, there will
> > > be one
> > > W+X page at boot by default.  This can be tested with
> > > CONFIG_PPC_PTDUMP=y and CONFIG_PPC_DEBUG_WX=y set, and checking
> > > the
> > > kernel log during boot.
> > > 
> > > powerpc doesn't implement its own alloc() for kprobes like other
> > > architectures do, but we couldn't immediately mark RO anyway
> > > since we do
> > > a memcpy to the page we allocate later.  After that, nothing
> > > should be
> > > allowed to modify the page, and write permissions are removed
> > > well
> > > before the kprobe is armed.
> > > 
> > > Thus mark newly allocated probes as read-only once it's safe to
> > > do so.
> > > 
> > > Signed-off-by: Russell Currey 
> > > ---
> > >  arch/powerpc/kernel/kprobes.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/arch/powerpc/kernel/kprobes.c
> > > b/arch/powerpc/kernel/kprobes.c
> > > index 2d27ec4feee4..2610496de7c7 100644
> > > --- a/arch/powerpc/kernel/kprobes.c
> > > +++ b/arch/powerpc/kernel/kprobes.c
> > > @@ -24,6 +24,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  
> > >  DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL;
> > >  DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
> > > @@ -131,6 +132,8 @@ int arch_prepare_kprobe(struct kprobe *p)
> > >   (unsigned long)p->ainsn.insn +
> > > sizeof(kprobe_opcode_t));
> > >   }
> > >  
> > > + set_memory_ro((unsigned long)p->ainsn.insn, 1);
> > > +
> > 
> > That comes from:
> > p->ainsn.insn = get_insn_slot();
> > 
> > 
> > Which ends up in __get_insn_slot() I think. And that looks very
> > much
> > like it's going to hand out multiple slots per page, which isn't
> > going
> > to work because you've just marked the whole page RO.
> > 
> > So I would expect this to crash on the 2nd kprobe that's installed.
> > Have
> > you tested it somehow?
> 
> I'm not sure if this is the issue I was talking about, but it doesn't
> survive ftracetest:
> 
>   [ 1139.576047] [ cut here ]
>   [ 1139.576322] kernel BUG at mm/memory.c:2036!
>   cpu 0x1f: Vector: 700 (Program Check) at [c01fd6c675d0]
>   pc: c035d018: apply_to_page_range+0x318/0x610
>   lr: c00900bc: change_memory_attr+0x4c/0x70
>   sp: c01fd6c67860
>  msr: 90029033
> current = 0xc01fa4a47880
> paca= 0xc01e5c80   irqmask: 0x03   irq_happened: 0x01
>   pid   = 7168, comm = ftracetest
>   kernel BUG at mm/memory.c:2036!
>   Linux version 5.4.0-gcc-8.2.0-11694-gf1f9aa266811 (
> mich...@raptor-2.ozlabs.ibm.com) (gcc version 8.2.0 (crosstool-NG
> 1.24.0-rc1.16-9627a04)) #1384 SMP Thu Dec 5 22:11:09 AEDT 2019
>   enter ? for help
>   [c01fd6c67940] c00900bc change_memory_attr+0x4c/0x70
>   [c01fd6c67970] c0053c48 arch_prepare_kprobe+0xb8/0x120
>   [c01fd6c679e0] c022f718 register_kprobe+0x608/0x790
>   [c01fd6c67a40] c022fc50 register_kretprobe+0x230/0x350
>   [c01fd6c67a80] c02849b4
> __register_trace_kprobe+0xf4/0x1a0
>   [c01fd6c67af0] c0285b18 trace_kprobe_create+0x738/0xf70
>   [c01fd6c67c30] c0286378
> create_or_delete_trace_kprobe+0x28/0x70
>   [c01fd6c67c50] c025f024 trace_run_command+0xc4/0xe0
>   [c01fd6c67ca0] c025f128
> trace_parse_run_command+0xe8/0x230
>   [c01fd6c67d40] c02845d0 probes_write+0x20/0x40
>   [c01fd6c67d60] c03eef4c __vfs_write+0x3c/0x70
>   [c01fd6c67d80] c03f26a0 vfs_write+0xd0/0x200
>   [c01fd6c67dd0] c03f2a3c ksys_write+0x7c/0x140
>   [c01fd6c67e20] c000b9e0 system_call+0x5c/0x68
>   --- Exception: c01 (System Call) at 7fff8f06e420
>   SP (793d6830) is in userspace
>   1f:mon> client_loop: send disconnect: Broken pipe
> 
> 
> Sorry I didn't get any more info on the crash, I lost the console and
> then some CI bot stole the machine 8)
> 
> You should be able to reproduce just by running ftracetest.

The test that blew it up was test.d/kprobe/probepoint.tc for the
record.  It goes away when replacing the memcpy with a
patch_instruction().

> 
> cheers



Re: [PATCH v9 23/25] mm/gup: track FOLL_PIN pages

2019-12-11 Thread John Hubbard

On 12/11/19 3:28 AM, Jan Kara wrote:
...


The patch looks mostly good to me now. Just a few smaller comments below.


Suggested-by: Jan Kara 
Suggested-by: Jérôme Glisse 
Reviewed-by: Jan Kara 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Ira Weiny 


I think you inherited here the Reviewed-by tags from the "add flags" patch
you've merged into this one but that's not really fair since this patch
does much more... In particular I didn't give my Reviewed-by tag for this
patch yet.


OK, I've removed those reviewed-by's. (I felt bad about dropping them, after
people had devoted time to reviewing, but I do see that it's wrong to imply
that they've reviewed this much much larger thing.)

...


I somewhat wonder about the asymmetry of try_grab_compound_head() vs
try_grab_page() in the treatment of 'flags'. How costly would it be to make
them symmetric (i.e., either set FOLL_GET for try_grab_compound_head()
callers or make sure one of FOLL_GET, FOLL_PIN is set for try_grab_page())?

Because this difference looks like a subtle catch in the long run...


Done. It is only a modest code-level change, at least the way I've done it, 
which is
setting FOLL_GET for try_grab_compound_head(). In order to do that, I set
it at the top of the internal gup fast calling stacks, which is actually a good
design anyway: gup fast is logically doing FOLL_GET in all cases. So setting
the flag internally is accurate and consistent with the overall design.



...


@@ -1522,8 +1536,8 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct 
*vma,
  skip_mlock:
page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
-   if (flags & FOLL_GET)
-   get_page(page);
+   if (!try_grab_page(page, flags))
+   page = ERR_PTR(-EFAULT);


I think you need to also move the try_grab_page() earlier in the function.
At this point the page may be marked as mlocked and you'd need to undo that
in case try_grab_page() fails.



OK, I've moved it up, adding a "subpage" variable in order to make that work.




diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ac65bb5e38ac..0aab6fe0072f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4356,7 +4356,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
  same_page:
if (pages) {
pages[i] = mem_map_offset(page, pfn_offset);
-   get_page(pages[i]);
+   if (!try_grab_page(pages[i], flags)) {
+   spin_unlock(ptl);
+   remainder = 0;
+   err = -ENOMEM;
+   WARN_ON_ONCE(1);
+   break;
+   }
}


This function does a refcount overflow check early so that it doesn't have
to do try_get_page() here. So that check can be now removed when you do
try_grab_page() here anyway since that early check seems to be just a tiny
optimization AFAICT.

Honza



Yes. I've removed it, good spot.


thanks,
--
John Hubbard
NVIDIA


READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-11 Thread Michael Ellerman
[ trimmed CC a bit ]

Peter Zijlstra  writes:
> On Fri, Dec 06, 2019 at 11:46:11PM +1100, Michael Ellerman wrote:
...
> you write:
>
>   "Currently bitops-instrumented.h assumes that the architecture provides
> atomic, non-atomic and locking bitops (e.g. both set_bit and __set_bit).
> This is true on x86 and s390, but is not always true: there is a
> generic bitops/non-atomic.h header that provides generic non-atomic
> operations, and also a generic bitops/lock.h for locking operations."
>
> Is there any actual benefit for PPC to using their own atomic bitops
> over bitops/lock.h ? I'm thinking that the generic code is fairly
> optimal for most LL/SC architectures.

Yes and no :)

Some of the generic versions don't generate good code compared to our
versions, but that's because READ_ONCE() is triggering stack protector
to be enabled.

For example, comparing an out-of-line copy of the generic and ppc
versions of test_and_set_bit_lock():

   1 :   1 :
   2 addis   r2,r12,361
   3 addir2,r2,-4240
   4 stdur1,-48(r1)
   5 rlwinm  r8,r3,29,3,28
   6 clrlwi  r10,r3,26   2 rldicl  r10,r3,58,6
   7 ld  r9,3320(r13)
   8 std r9,40(r1)
   9 li  r9,0
  10 li  r9,13 li  r9,1
 4 clrlwi  r3,r3,26
 5 rldicr  r10,r10,3,60
  11 sld r9,r9,r10   6 sld r3,r9,r3
  12 add r10,r4,r8   7 add r4,r4,r10
  13 ldx r8,r4,r8
  14 and.r8,r9,r8
  15 bne 34f
  16 ldarx   r7,0,r108 ldarx   r9,0,r4,1
  17 or  r8,r9,r79 or  r10,r9,r3
  18 stdcx.  r8,0,r10   10 stdcx.  r10,0,r4
  19 bne-16b11 bne-8b
  20 isync  12 isync
  21 and r9,r7,r9   13 and r3,r3,r9
  22 addic   r7,r9,-1   14 addic   r9,r3,-1
  23 subfe   r7,r7,r9   15 subfe   r3,r9,r3
  24 ld  r9,40(r1)
  25 ld  r10,3320(r13)
  26 xor.r9,r9,r10
  27 li  r10,0
  28 mr  r3,r7
  29 bne 36f
  30 addir1,r1,48
  31 blr16 blr
  32 nop
  33 nop
  34 li  r7,1
  35 b   24b
  36 mflrr0
  37 std r0,64(r1)
  38 bl  <__stack_chk_fail+0x8>


If you squint, the generated code for the actual logic is pretty similar, but
the stack protector gunk makes a big mess. It's particularly bad here
because the ppc version doesn't even need a stack frame.

I've also confirmed that even when test_and_set_bit_lock() is inlined
into an actual call site the stack protector logic still triggers.

eg, if I make two versions of ext4_resize_begin() which call the generic or ppc
version of test_and_set_bit_lock(), the generic version gets a bunch of extra
stack protector code.

   1 c05336e0 :   1 c05335b0 
:
   2 addis   r2,r12,281  2 addis   
r2,r12,281
   3 addir2,r2,-122563 addi
r2,r2,-11952
   4 mflrr0  4 mflr
r0
   5 bl  <_mcount>   5 bl  
<_mcount>
   6 mflrr0  6 mflr
r0
   7 std r31,-8(r1)  7 std 
r31,-8(r1)
   8 std r30,-16(r1) 8 std 
r30,-16(r1)
   9 mr  r31,r3  9 mr  
r31,r3
  10 li  r3,24  10 li  
r3,24
  11 std r0,16(r1)  11 std 
r0,16(r1)
  12 stdur1,-128(r1)12 stdu
r1,-112(r1)
  13 ld  r9,3320(r13)
  14 std r9,104(r1)
  15 li  r9,0
  16 ld  r30,920(r31)   13 ld  
r30,920(r31)
  17 bl14 bl  

  18 nop15 nop
  19 cmpdi   cr7,r3,0   16 cmpdi   
cr7,r3,0
  20 beq cr7,   17 beq 
cr7,
  21 ld  r9,920(r31)18 ld  
r9,920(r31)
  22 ld  r10,96(r30)19 ld  

[PATCH V2 00/13] powerpc/vas: Page fault handling for user space NX requests

2019-12-11 Thread Haren Myneni
[PATCH V2 00/13] powerpc/vas: Page fault handling for user space NX requests

On power9, Virtual Accelerator Switchboard (VAS) allows user space or
kernel to communicate with Nest Accelerator (NX) directly using COPY/PASTE
instructions. NX provides verious functionalities such as compression,
encryption and etc. But only compression (842 and GZIP formats) is
supported in Linux kernel on power9.

842 compression driver (drivers/crypto/nx/nx-842-powernv.c)
is already included in Linux. Only GZIP support will be available from
user space.

Applications can issue GZIP compression / decompression requests to NX with
COPY/PASTE instructions. When NX is processing these requests, can hit
fault on the request buffer (not in memory). It issues an interrupt and
pastes fault CRB in fault FIFO. Expects kernel to handle this fault and
return credits for both send and fault windows after processing.

This patch series adds IRQ and fault window setup, and NX fault handling:
- Read IRQ# from "interrupts" property and configure IRQ per VAS instance.
- Set port# for each window to generate an interrupt when noticed fault.
- Set fault window and FIFO on which NX paste fault CRB.
- Setup IRQ thread fault handler per VAS instance.
- When receiving an interrupt, Read CRBs from fault FIFO and update
  coprocessor_status_block (CSB) in the corresponding CRB with translation
  failure (CSB_CC_TRANSLATION). After issuing NX requests, process polls
  on CSB address. When it sees translation error, can touch the request
  buffer to bring the page in to memory and reissue NX request.
- If copy_to_user fails on user space CSB address, OS sends SEGV signal.

Tested these patches with NX-GZIP support and will be posting this series
soon.

Patch 2: Define nx_fault_stamp on which NX writes fault status for the fault
 CRB
Patch 3: Read interrupts and port properties per VAS instance
Patch 4: Setup fault window per each VAS instance. This window is used for
 NX to paste fault CRB in FIFO.
Patches 5 & 6: Setup threaded IRQ per VAS and register NX with fault window
 ID and port number for each send window so that NX paste fault CRB
 in this window.
Patch 7: Reference to pid and mm so that pid is not used until window closed.
 Needed for multi thread application where child can open a window
 and can be used by parent later.
Patches 8 and 9: Process CRBs from fault FIFO and notify tasks by
 updating CSB or through signals.
Patches 10 and 11: Return credits for send and fault windows after handling
faults.
Patch 13:Fix closing send window after all credits are returned. This issue
 happens only for user space requests. No page faults on kernel
 request buffer.

Changelog:
V2:
  - Use threaded IRQ instead of own kernel thread handler
  - Use pswid insted of user space CSB address to find valid CRB
  - Removed unused macros and other changes as suggested by Christoph Hellwig

Haren Myneni (13):
  powerpc/vas: Describe vas-port and interrupts properties
  powerpc/vas: Define nx_fault_stamp in coprocessor_request_block
  powerpc/vas: Read interrupts and vas-port device tree properties
  powerpc/vas: Setup fault window per VAS instance
  powerpc/vas: Setup thread IRQ handler per VAS instance
  powerpc/vas: Register NX with fault window ID and IRQ port value
  powerpc/vas: Take reference to PID and mm for user space windows
  powerpc/vas: Update CSB and notify process for fault CRBs
  powerpc/vas: Print CRB and FIFO values
  powerpc/vas: Do not use default credits for receive window
  powerpc/VAS: Return credits after handling fault
  powerpc/vas: Display process stuck message
  powerpc/vas: Free send window in VAS instance after credits returned

 .../devicetree/bindings/powerpc/ibm,vas.txt|   5 +
 arch/powerpc/include/asm/icswx.h   |  18 +-
 arch/powerpc/platforms/powernv/Makefile|   2 +-
 arch/powerpc/platforms/powernv/vas-debug.c |   2 +-
 arch/powerpc/platforms/powernv/vas-fault.c | 337 +
 arch/powerpc/platforms/powernv/vas-window.c| 173 ++-
 arch/powerpc/platforms/powernv/vas.c   |  77 -
 arch/powerpc/platforms/powernv/vas.h   |  38 ++-
 8 files changed, 627 insertions(+), 25 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/vas-fault.c

-- 
1.8.3.1





Re: [PATCH V2 00/13] powerpc/vas: Page fault handling for user space NX requests

2019-12-11 Thread Haren Myneni
On Mon, 2019-12-09 at 02:38 -0600, Segher Boessenkool wrote:
> Hi!
> 
> On Mon, Dec 09, 2019 at 06:37:09AM +0100, Christophe Leroy wrote:
> > What do you mean by NX ?
> 
> It is the Power9 "Nest Accelerator".  The patch series should ideally
> mention that right at the start, yeah.

Thanks, NX (Nest Accelerator) is introduced since power7+
(drivers/crypto/nx/)

Whereas on power9, VAS (Virtual Accelerator Switchboard) is introduced
which allows to open multiple channels to communicate with NX. kernel or
user space can interact with NX directly using copy/paste instructions.
Kernel support with NX-842 compression is already included in kernel. 

In the case of user space, NX can see page fault on the request buffer
and interrupts OS to handle this fault. This patch series adds page
fault handling in VAS for user space requests. 

I will repost this patch with more explanation on NX. 

> 
> > Up to now, NX has been standing for No-eXecute. That's a bit in segment 
> > registers on book3s/32 to forbid executing code.
> 
> That bit is called just N fwiw (and not really specific to 32-bit -- on
> 64-bit implementations it was part of segment table entries, and of SLB
> entries on newer machines).
> 
> 
> Segher




[RFC PATCH] OPAL v4 cpu idle driver skeleton

2019-12-11 Thread Nicholas Piggin
With OPAL using the same endianness, same stack, and with OS
callbacks, it looks relatively easy to provide a CPU idle driver.

The Linux sreset interrupt won't have to change, if it registers
almost like isa300_idle_stop_mayloss as the os_stop function,
then skiboot will call that to stop, and it will return like a
normal function call returning the srr1 wakeup value.

This allows the firmware to deal with supported stop states and
psscr and consequences for saving and restoring various resources,
and the kernel can implement a simple OPAL idle driver which has
some interface like wakeup latency requested or something.

Calls in and out of OPAL (once it's running with MMU=on) are not
much more expensive calling a function in a kernel module, so
performance should be okay. Kernel can still choose to implement
an optimised CPU specific driver as it does today.

The patch is just a hack with no actual policy or SPR saving in
it at the moment and only does stop0, but illustrates the mechanism.

Thanks,
Nick
---
 core/Makefile.inc   |  2 +-
 core/opal.c |  3 +++
 core/stop.c | 35 +++
 include/opal-api.h  | 10 ++
 include/opal-internal.h |  1 +
 5 files changed, 46 insertions(+), 5 deletions(-)
 create mode 100644 core/stop.c

diff --git a/core/Makefile.inc b/core/Makefile.inc
index cc90fb958..653ca544e 100644
--- a/core/Makefile.inc
+++ b/core/Makefile.inc
@@ -7,7 +7,7 @@ CORE_OBJS = relocate.o console.o stack.o init.o chip.o 
mem_region.o vm.o
 CORE_OBJS += malloc.o lock.o cpu.o utils.o fdt.o opal.o interrupts.o timebase.o
 CORE_OBJS += opal-msg.o pci.o pci-virt.o pci-slot.o pcie-slot.o
 CORE_OBJS += pci-opal.o fast-reboot.o device.o exceptions.o trace.o affinity.o
-CORE_OBJS += vpd.o platform.o nvram.o nvram-format.o hmi.o mce.o
+CORE_OBJS += vpd.o platform.o nvram.o nvram-format.o hmi.o mce.o stop.o
 CORE_OBJS += console-log.o ipmi.o time-utils.o pel.o pool.o errorlog.o
 CORE_OBJS += timer.o i2c.o rtc.o flash.o sensor.o ipmi-opal.o
 CORE_OBJS += flash-subpartition.o bitmap.o buddy.o pci-quirk.o powercap.o psr.o
diff --git a/core/opal.c b/core/opal.c
index bb88d7710..d5c1d057b 100644
--- a/core/opal.c
+++ b/core/opal.c
@@ -444,6 +444,9 @@ static int64_t opal_register_opal_ops(struct opal_os_ops 
*__os_ops)
/* v4 must provide printf */
os_ops.os_printf = (void *)be64_to_cpu(__os_ops->os_printf);
 
+   /* v4 may provide stop (or NULL) */
+   os_ops.os_stop = (void *)be64_to_cpu(__os_ops->os_stop);
+
set_opal_console_to_raw();
 
checksum_romem();
diff --git a/core/stop.c b/core/stop.c
new file mode 100644
index 0..6d98d68e6
--- /dev/null
+++ b/core/stop.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: Apache-2.0
+/*
+ * Stop idle driver
+ *
+ * Copyright 2019 IBM Corp.
+ */
+
+#define pr_fmt(fmt)"IDLE: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static int64_t opal_cpu_idle(uint64_t latency, bool radix, __be64 *srr1)
+{
+   uint64_t psscr;
+
+   if (!os_ops.os_stop)
+   return OPAL_UNSUPPORTED;
+
+   if (proc_gen != proc_gen_p9)
+   return OPAL_UNSUPPORTED;
+
+   (void)latency;
+   (void)radix;
+   psscr = OPAL_PM_PSSCR_RL(0) \
+| OPAL_PM_PSSCR_MTL(3) \
+| OPAL_PM_PSSCR_TR(3);
+   *srr1 = os_ops.os_stop(psscr, true);
+
+   return OPAL_SUCCESS;
+}
+opal_call(OPAL_CPU_IDLE, opal_cpu_idle, 3);
diff --git a/include/opal-api.h b/include/opal-api.h
index 169061a26..03f323628 100644
--- a/include/opal-api.h
+++ b/include/opal-api.h
@@ -231,6 +231,7 @@
 #define OPAL_LOOKUP_SYMBOL 182
 #define OPAL_REGISTER_OS_OPS   183
 #define OPAL_HANDLE_MCE184
+#define OPAL_CPU_IDLE  185
 #define OPAL_LAST  184
 
 #define QUIESCE_HOLD   1 /* Spin all calls at entry */
@@ -1259,10 +1260,11 @@ struct opal_mpipl_fadump {
 };
 
 struct opal_os_ops {
-__be16  version;
-__be16  reserved0;
-__be32  reserved1;
-__be64  os_printf;  /* void printf(int32_t level, const char *str) 
*/
+   __be16  version;
+   __be16  reserved0;
+   __be32  reserved1;
+   __be64  os_printf;  /* void printf(int32_t level, const char *str) 
*/
+   __be64  os_stop;/* uint64_t stop(uint64_t psscr, bool 
save_gprs) */
 };
 
 #define MCE_HANDLE_CORRECT 0x0001  /* Attempt to correct */
diff --git a/include/opal-internal.h b/include/opal-internal.h
index cd968a0fe..2baf79a53 100644
--- a/include/opal-internal.h
+++ b/include/opal-internal.h
@@ -20,6 +20,7 @@ struct opal_table_entry {
 
 struct os_ops {
 void (*os_printf)(uint32_t log_level, const char *str);
+   uint64_t (*os_stop)(uint64_t psscr, bool save_gprs);
 };
 
 extern bool opal_v4_os;
-- 
2.23.0



RE: [PATCH v5 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.

2019-12-11 Thread Ram Pai
On Wed, Dec 11, 2019 at 07:15:44PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 11/12/2019 02:35, Ram Pai wrote:
> > On Tue, Dec 10, 2019 at 04:32:10PM +1100, Alexey Kardashevskiy wrote:
> >>
..snip..
> >> As discussed in slack, by default we do not need to clear the entire TCE
> >> table and we only have to map swiotlb buffer using the small window. It
> >> is a guest kernel change only. Thanks,
> > 
> > Can you tell me what code you are talking about here.  Where is the TCE
> > table getting cleared? What code needs to be changed to not clear it?
> 
> 
> pci_dma_bus_setup_pSeriesLP()
>   iommu_init_table()
>   iommu_table_clear()
>   for () tbl->it_ops->get()
> 
> We do not really need to clear it there, we only need it for VFIO with
> IOMMU SPAPR TCE v1 which reuses these tables but there are
> iommu_take_ownership/iommu_release_ownership to clear these tables. I'll
> send a patch for this.

Did some experiments. It spent the first 9s in tce_free_pSeriesLP()
clearing the tce entries.  And the second 13s in 
tce_setrange_multi_pSeriesLP_walk().  BTW: the code in
tce_setrange_multi_pSeriesLP_walk() is modified to use DIRECT_TCE.

So it looks like the amount of time spent in
tce_setrange_multi_pSeriesLP_walk() is a function of the size of the
memory that is mapped in the ddw.


> 
..snip..
> 
> > But before I close, you have not told me clearly, what is the problem
> > with;  'share the page, make the H_PUT_INDIRECT_TCE hcall, unshare the 
> > page'.
> 
> Between share and unshare you have a (tiny) window of opportunity to
> attack the guest. No, I do not know how exactly.
> 
> For example, the hypervisor does a lot of PHB+PCI hotplug-unplug with
> 64bit devices - each time this will create a huge window which will
> share/unshare the same page.  No, I do not know how exactly how this can
> be exploited either, we cannot rely of what you or myself know today. My
> point is that we should not be sharing pages at all unless we really
> really have to, and this does not seem to be the case.
> 
> But since this seems to an acceptable compromise anyway,
> 
> Reviewed-by: Alexey Kardashevskiy 
> 

Thanks!
RP



Re: [PATCH v5 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.

2019-12-11 Thread Alexey Kardashevskiy



On 12/12/2019 09:47, Alexey Kardashevskiy wrote:
> 
> 
> On 12/12/2019 07:31, Michael Roth wrote:
>> Quoting Alexey Kardashevskiy (2019-12-11 02:15:44)
>>>
>>>
>>> On 11/12/2019 02:35, Ram Pai wrote:
 On Tue, Dec 10, 2019 at 04:32:10PM +1100, Alexey Kardashevskiy wrote:
>
>
> On 10/12/2019 16:12, Ram Pai wrote:
>> On Tue, Dec 10, 2019 at 02:07:36PM +1100, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 07/12/2019 12:12, Ram Pai wrote:
 H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
 its parameters.  On secure VMs, hypervisor cannot access the contents 
 of
 this page since it gets encrypted.  Hence share the page with the
 hypervisor, and unshare when done.
>>>
>>>
>>> I thought the idea was to use H_PUT_TCE and avoid sharing any extra
>>> pages. There is small problem that when DDW is enabled,
>>> FW_FEATURE_MULTITCE is ignored (easy to fix); I also noticed complains
>>> about the performance on slack but this is caused by initial cleanup of
>>> the default TCE window (which we do not use anyway) and to battle this
>>> we can simply reduce its size by adding
>>
>> something that takes hardly any time with H_PUT_TCE_INDIRECT,  takes
>> 13secs per device for H_PUT_TCE approach, during boot. This is with a
>> 30GB guest. With larger guest, the time will further detoriate.
>
>
> No it will not, I checked. The time is the same for 2GB and 32GB guests-
> the delay is caused by clearing the small DMA window which is small by
> the space mapped (1GB) but quite huge in TCEs as it uses 4K pages; and
> for DDW window + emulated devices the IOMMU page size will be 2M/16M/1G
> (depends on the system) so the number of TCEs is much smaller.

 I cant get your results.  What changes did you make to get it?
>>>
>>>
>>> Get what? I passed "-m 2G" and "-m 32G", got the same time - 13s spent
>>> in clearing the default window and the huge window took a fraction of a
>>> second to create and map.
>>
>> Is this if we disable FW_FEATURE_MULTITCE in the guest and force the use
>> of H_PUT_TCE everywhere?
> 
> 
> Yes. Well, for the DDW case FW_FEATURE_MULTITCE is ignored but even when
> fixed (I have it in my local branch), this does not make a difference.
> 
> 
>>
>> In theory couldn't we leave FW_FEATURE_MULTITCE in place so that
>> iommu_table_clear() can still use H_STUFF_TCE (which I guess is basically
>> instant),
> 
> PAPR/LoPAPR "conveniently" do not describe what hcall-multi-tce does
> exactly. But I am pretty sure the idea is that either both H_STUFF_TCE
> and H_PUT_TCE_INDIRECT are present or neither.
> 
> 
>> and then force H_PUT_TCE for new mappings via something like:
>>
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
>> b/arch/powerpc/platforms/pseries/iommu.c
>> index 6ba081dd61c9..85d092baf17d 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -194,6 +194,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
>> *tbl, long tcenum,
>> unsigned long flags;
>>  
>> if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
>> +   if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE) || 
>> is_secure_guest()) {
> 
> 
> Nobody (including myself) seems to like the idea of having
> is_secure_guest() all over the place.
> 
> And with KVM acceleration enabled, it is pretty fast anyway. Just now we
> do not have H_PUT_TCE in KVM/UV for secure guests but we will have to
> fix this for secure PCI passhtrough anyway.
> 
> 
>> return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
>>direction, attrs);
>> }
>>
>> That seems like it would avoid the extra 13s.
> 
> Or move around iommu_table_clear() which imho is just the right thing to do.


Huh. It is not the right thing as the firmware could have left mappings
there so we need cleanup. Even if I fixed SLOF, there is POWERVM which I
do not know what it does about TCEs. Thanks,



-- 
Alexey


Re: [PATCH] libbpf: fix readelf output parsing on powerpc with recent binutils

2019-12-11 Thread Michael Ellerman
Thadeu Lima de Souza Cascardo  writes:
> On Wed, Dec 11, 2019 at 09:33:53AM -0600, Justin Forbes wrote:
>> On Tue, Dec 10, 2019 at 4:26 PM Thadeu Lima de Souza Cascardo
>>  wrote:
>> >
>> > On Tue, Dec 10, 2019 at 12:58:33PM -0600, Justin Forbes wrote:
>> > > On Mon, Dec 2, 2019 at 3:37 AM Daniel Borkmann  
>> > > wrote:
>> > > >
>> > > > On Mon, Dec 02, 2019 at 04:53:26PM +1100, Michael Ellerman wrote:
>> > > > > Aurelien Jarno  writes:
>> > > > > > On powerpc with recent versions of binutils, readelf outputs an 
>> > > > > > extra
>> > > > > > field when dumping the symbols of an object file. For example:
>> > > > > >
>> > > > > > 35: 083896 FUNCLOCAL  DEFAULT 
>> > > > > > [: 8] 1 btf_is_struct
>> > > > > >
>> > > > > > The extra "[: 8]" prevents the GLOBAL_SYM_COUNT 
>> > > > > > variable to
>> > > > > > be computed correctly and causes the checkabi target to fail.
>> > > > > >
>> > > > > > Fix that by looking for the symbol name in the last field instead 
>> > > > > > of the
>> > > > > > 8th one. This way it should also cope with future extra fields.
>> > > > > >
>> > > > > > Signed-off-by: Aurelien Jarno 
>> > > > > > ---
>> > > > > >  tools/lib/bpf/Makefile | 4 ++--
>> > > > > >  1 file changed, 2 insertions(+), 2 deletions(-)
>> > > > >
>> > > > > Thanks for fixing that, it's been on my very long list of test 
>> > > > > failures
>> > > > > for a while.
>> > > > >
>> > > > > Tested-by: Michael Ellerman 
>> > > >
>> > > > Looks good & also continues to work on x86. Applied, thanks!
>> > >
>> > > This actually seems to break horribly on PPC64le with binutils 2.33.1
>> > > resulting in:
>> > > Warning: Num of global symbols in sharedobjs/libbpf-in.o (32) does NOT
>> > > match with num of versioned symbols in libbpf.so (184). Please make
>> > > sure all LIBBPF_API symbols are versioned in libbpf.map.
>> > >
>> > > This is the only arch that fails, with x86/arm/aarch64/s390 all
>> > > building fine.  Reverting this patch allows successful build across
>> > > all arches.
>> > >
>> > > Justin
>> >
>> > Well, I ended up debugging this same issue and had the same fix as Jarno's 
>> > when
>> > I noticed his fix was already applied.
>> >
>> > I just installed a system with the latest binutils, 2.33.1, and it still 
>> > breaks
>> > without such fix. Can you tell what is the output of the following command 
>> > on
>> > your system?
>> >
>> > readelf -s --wide tools/lib/bpf/sharedobjs/libbpf-in.o | cut -d "@" -f1 | 
>> > sed 's/_v[0-9]_[0-9]_[0-9].*//' | awk '/GLOBAL/ && /DEFAULT/ && !/UND/ 
>> > {print $0}'
>> >
>> 
>> readelf -s --wide tools/lib/bpf/sharedobjs/libbpf-in.o | cut -d "@"
>> -f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' | awk '/GLOBAL/ && /DEFAULT/ &&
>> !/UND/ {print $0}'
>>373: 000141bc  1376 FUNCGLOBAL DEFAULT1
>> libbpf_num_possible_cpus [: 8]
>>375: 0001869c   176 FUNCGLOBAL DEFAULT1 btf__free
>> [: 8]
> [...]
>
> This is a patch on binutils carried by Fedora:
>
> https://src.fedoraproject.org/rpms/binutils/c/b8265c46f7ddae23a792ee8306fbaaeacba83bf8
>
> " b8265c Have readelf display extra symbol information at the end of the 
> line. "
>
> It has the following comment:
>
> # FIXME:The proper fix would be to update the scripts that are expecting
> #   a fixed output from readelf.  But it seems that some of them are
> #   no longer being maintained.
>
> This commit is from 2017, had it been on binutils upstream, maybe the 
> situation
> right now would be different.

Bleeping bleep.

Looks like it was actually ruby that was the original problem:

  https://bugzilla.redhat.com/show_bug.cgi?id=1479302


Why it wasn't hacked around in the ruby package I don't know, doing it in
the distro binutils package is not ideal.

cheers


Re: [PATCH v5 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.

2019-12-11 Thread Alexey Kardashevskiy



On 12/12/2019 07:31, Michael Roth wrote:
> Quoting Alexey Kardashevskiy (2019-12-11 02:15:44)
>>
>>
>> On 11/12/2019 02:35, Ram Pai wrote:
>>> On Tue, Dec 10, 2019 at 04:32:10PM +1100, Alexey Kardashevskiy wrote:


 On 10/12/2019 16:12, Ram Pai wrote:
> On Tue, Dec 10, 2019 at 02:07:36PM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 07/12/2019 12:12, Ram Pai wrote:
>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
>>> its parameters.  On secure VMs, hypervisor cannot access the contents of
>>> this page since it gets encrypted.  Hence share the page with the
>>> hypervisor, and unshare when done.
>>
>>
>> I thought the idea was to use H_PUT_TCE and avoid sharing any extra
>> pages. There is small problem that when DDW is enabled,
>> FW_FEATURE_MULTITCE is ignored (easy to fix); I also noticed complains
>> about the performance on slack but this is caused by initial cleanup of
>> the default TCE window (which we do not use anyway) and to battle this
>> we can simply reduce its size by adding
>
> something that takes hardly any time with H_PUT_TCE_INDIRECT,  takes
> 13secs per device for H_PUT_TCE approach, during boot. This is with a
> 30GB guest. With larger guest, the time will further detoriate.


 No it will not, I checked. The time is the same for 2GB and 32GB guests-
 the delay is caused by clearing the small DMA window which is small by
 the space mapped (1GB) but quite huge in TCEs as it uses 4K pages; and
 for DDW window + emulated devices the IOMMU page size will be 2M/16M/1G
 (depends on the system) so the number of TCEs is much smaller.
>>>
>>> I cant get your results.  What changes did you make to get it?
>>
>>
>> Get what? I passed "-m 2G" and "-m 32G", got the same time - 13s spent
>> in clearing the default window and the huge window took a fraction of a
>> second to create and map.
> 
> Is this if we disable FW_FEATURE_MULTITCE in the guest and force the use
> of H_PUT_TCE everywhere?


Yes. Well, for the DDW case FW_FEATURE_MULTITCE is ignored but even when
fixed (I have it in my local branch), this does not make a difference.


> 
> In theory couldn't we leave FW_FEATURE_MULTITCE in place so that
> iommu_table_clear() can still use H_STUFF_TCE (which I guess is basically
> instant),

PAPR/LoPAPR "conveniently" do not describe what hcall-multi-tce does
exactly. But I am pretty sure the idea is that either both H_STUFF_TCE
and H_PUT_TCE_INDIRECT are present or neither.


> and then force H_PUT_TCE for new mappings via something like:
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> b/arch/powerpc/platforms/pseries/iommu.c
> index 6ba081dd61c9..85d092baf17d 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -194,6 +194,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
> *tbl, long tcenum,
> unsigned long flags;
>  
> if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
> +   if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE) || 
> is_secure_guest()) {


Nobody (including myself) seems to like the idea of having
is_secure_guest() all over the place.

And with KVM acceleration enabled, it is pretty fast anyway. Just now we
do not have H_PUT_TCE in KVM/UV for secure guests but we will have to
fix this for secure PCI passhtrough anyway.


> return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
>direction, attrs);
> }
> 
> That seems like it would avoid the extra 13s.

Or move around iommu_table_clear() which imho is just the right thing to do.


> If we take the additional step of only mapping SWIOTLB range in
> enable_ddw() for is_secure_guest() that might further improve things
> (though the bigger motivation with that is the extra isolation it would
> grant us for stuff behind the IOMMU, since it apparently doesn't affect
> boot-time all that much)


Sure, we just need to confirm how many of these swiotlb banks we are
going to have (just one or many and at what location). Thanks,



> 
>>
>>
>>
>> -global
>> spapr-pci-host-bridge.dma_win_size=0x400
>
> This option, speeds it up tremendously.  But than should this option be
> enabled in qemu by default?  only for secure VMs? for both VMs?


 As discussed in slack, by default we do not need to clear the entire TCE
 table and we only have to map swiotlb buffer using the small window. It
 is a guest kernel change only. Thanks,
>>>
>>> Can you tell me what code you are talking about here.  Where is the TCE
>>> table getting cleared? What code needs to be changed to not clear it?
>>
>>
>> pci_dma_bus_setup_pSeriesLP()
>> iommu_init_table()
>> iommu_table_clear()
>> for () tbl->it_ops->get()
>>
>> We do not really 

Re: [PATCH v9 10/25] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-12-11 Thread John Hubbard
On 12/11/19 12:57 PM, Jonathan Corbet wrote:
> On Tue, 10 Dec 2019 18:53:03 -0800
> John Hubbard  wrote:
> 
>> Introduce pin_user_pages*() variations of get_user_pages*() calls,
>> and also pin_longterm_pages*() variations.
> 
> Just a couple of nits on the documentation patch
> 
>> +++ b/Documentation/core-api/pin_user_pages.rst
>> @@ -0,0 +1,232 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +
>> +pin_user_pages() and related calls
>> +
>> +
>> +.. contents:: :local:
>> +
>> +Overview
>> +
>> +
>> +This document describes the following functions: ::
>> +
>> + pin_user_pages
>> + pin_user_pages_fast
>> + pin_user_pages_remote
> 
> You could just say "the following functions::" and get the result you're
> after with a slightly less alien plain-text reading experience.

I see. That works nicely: same result with fewer :'s. 

> 
> Of course, you could also just say "This document describes
> pin_user_pages(), pin_user_pages_fast(), and pin_user_pages_remote()." But
> that's a matter of personal taste, I guess.  Using the function() notation
> will cause the docs system to automatically link to the kerneldoc info,
> though.  

OK. I did try the single-sentence approach just now, but to me the one-per-line
seems to make both the text and the generated HTML slightly easier to look at. 
Of course, like you say, different people will have different preferences. So 
in the end I've combined the tips, like this:

+Overview
+
+
+This document describes the following functions::
+
+ pin_user_pages()
+ pin_user_pages_fast()
+ pin_user_pages_remote()


> 
>> +Basic description of FOLL_PIN
>> +=
>> +
>> +FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the 
>> get_user_pages*()
>> +("gup") family of functions. FOLL_PIN has significant interactions and
>> +interdependencies with FOLL_LONGTERM, so both are covered here.
>> +
>> +FOLL_PIN is internal to gup, meaning that it should not appear at the gup 
>> call
>> +sites. This allows the associated wrapper functions  (pin_user_pages*() and
>> +others) to set the correct combination of these flags, and to check for 
>> problems
>> +as well.
>> +
>> +FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call 
>> sites.
>> +This is in order to avoid creating a large number of wrapper functions to 
>> cover
>> +all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the
>> +pin_user_pages*() APIs are clearly distinct from the get_user_pages*() 
>> APIs, so
>> +that's a natural dividing line, and a good point to make separate wrapper 
>> calls.
>> +In other words, use pin_user_pages*() for DMA-pinned pages, and
>> +get_user_pages*() for other cases. There are four cases described later on 
>> in
>> +this document, to further clarify that concept.
>> +
>> +FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
>> +multiple threads and call sites are free to pin the same struct pages, via 
>> both
>> +FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or 
>> the
>> +other, not the struct page(s).
>> +
>> +The FOLL_PIN implementation is nearly the same as FOLL_GET, except that 
>> FOLL_PIN
>> +uses a different reference counting technique.
>> +
>> +FOLL_PIN is a prerequisite to FOLL_LONGTGERM. Another way of saying that is,
> 
> FOLL_LONGTERM typoed there.
> 

Good catch. Fixed.

thanks,
-- 
John Hubbard
NVIDIA




Re: [PATCH v9 10/25] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-12-11 Thread Jonathan Corbet
On Tue, 10 Dec 2019 18:53:03 -0800
John Hubbard  wrote:

> Introduce pin_user_pages*() variations of get_user_pages*() calls,
> and also pin_longterm_pages*() variations.

Just a couple of nits on the documentation patch

> +++ b/Documentation/core-api/pin_user_pages.rst
> @@ -0,0 +1,232 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +
> +pin_user_pages() and related calls
> +
> +
> +.. contents:: :local:
> +
> +Overview
> +
> +
> +This document describes the following functions: ::
> +
> + pin_user_pages
> + pin_user_pages_fast
> + pin_user_pages_remote

You could just say "the following functions::" and get the result you're
after with a slightly less alien plain-text reading experience.

Of course, you could also just say "This document describes
pin_user_pages(), pin_user_pages_fast(), and pin_user_pages_remote()." But
that's a matter of personal taste, I guess.  Using the function() notation
will cause the docs system to automatically link to the kerneldoc info,
though.  

> +Basic description of FOLL_PIN
> +=
> +
> +FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the 
> get_user_pages*()
> +("gup") family of functions. FOLL_PIN has significant interactions and
> +interdependencies with FOLL_LONGTERM, so both are covered here.
> +
> +FOLL_PIN is internal to gup, meaning that it should not appear at the gup 
> call
> +sites. This allows the associated wrapper functions  (pin_user_pages*() and
> +others) to set the correct combination of these flags, and to check for 
> problems
> +as well.
> +
> +FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call 
> sites.
> +This is in order to avoid creating a large number of wrapper functions to 
> cover
> +all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the
> +pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, 
> so
> +that's a natural dividing line, and a good point to make separate wrapper 
> calls.
> +In other words, use pin_user_pages*() for DMA-pinned pages, and
> +get_user_pages*() for other cases. There are four cases described later on in
> +this document, to further clarify that concept.
> +
> +FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
> +multiple threads and call sites are free to pin the same struct pages, via 
> both
> +FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or 
> the
> +other, not the struct page(s).
> +
> +The FOLL_PIN implementation is nearly the same as FOLL_GET, except that 
> FOLL_PIN
> +uses a different reference counting technique.
> +
> +FOLL_PIN is a prerequisite to FOLL_LONGTGERM. Another way of saying that is,

FOLL_LONGTERM typoed there.

Thanks,

jon


Re: [PATCH v5 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.

2019-12-11 Thread Michael Roth
Quoting Alexey Kardashevskiy (2019-12-11 02:15:44)
> 
> 
> On 11/12/2019 02:35, Ram Pai wrote:
> > On Tue, Dec 10, 2019 at 04:32:10PM +1100, Alexey Kardashevskiy wrote:
> >>
> >>
> >> On 10/12/2019 16:12, Ram Pai wrote:
> >>> On Tue, Dec 10, 2019 at 02:07:36PM +1100, Alexey Kardashevskiy wrote:
> 
> 
>  On 07/12/2019 12:12, Ram Pai wrote:
> > H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
> > its parameters.  On secure VMs, hypervisor cannot access the contents of
> > this page since it gets encrypted.  Hence share the page with the
> > hypervisor, and unshare when done.
> 
> 
>  I thought the idea was to use H_PUT_TCE and avoid sharing any extra
>  pages. There is small problem that when DDW is enabled,
>  FW_FEATURE_MULTITCE is ignored (easy to fix); I also noticed complains
>  about the performance on slack but this is caused by initial cleanup of
>  the default TCE window (which we do not use anyway) and to battle this
>  we can simply reduce its size by adding
> >>>
> >>> something that takes hardly any time with H_PUT_TCE_INDIRECT,  takes
> >>> 13secs per device for H_PUT_TCE approach, during boot. This is with a
> >>> 30GB guest. With larger guest, the time will further detoriate.
> >>
> >>
> >> No it will not, I checked. The time is the same for 2GB and 32GB guests-
> >> the delay is caused by clearing the small DMA window which is small by
> >> the space mapped (1GB) but quite huge in TCEs as it uses 4K pages; and
> >> for DDW window + emulated devices the IOMMU page size will be 2M/16M/1G
> >> (depends on the system) so the number of TCEs is much smaller.
> > 
> > I cant get your results.  What changes did you make to get it?
> 
> 
> Get what? I passed "-m 2G" and "-m 32G", got the same time - 13s spent
> in clearing the default window and the huge window took a fraction of a
> second to create and map.

Is this if we disable FW_FEATURE_MULTITCE in the guest and force the use
of H_PUT_TCE everywhere?

In theory couldn't we leave FW_FEATURE_MULTITCE in place so that
iommu_table_clear() can still use H_STUFF_TCE (which I guess is basically
instant), and then force H_PUT_TCE for new mappings via something like:

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 6ba081dd61c9..85d092baf17d 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -194,6 +194,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
unsigned long flags;
 
if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
+   if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE) || 
is_secure_guest()) {
return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
   direction, attrs);
}

That seems like it would avoid the extra 13s.

If we take the additional step of only mapping SWIOTLB range in
enable_ddw() for is_secure_guest() that might further improve things
(though the bigger motivation with that is the extra isolation it would
grant us for stuff behind the IOMMU, since it apparently doesn't affect
boot-time all that much)

> 
> 
> 
>  -global
>  spapr-pci-host-bridge.dma_win_size=0x400
> >>>
> >>> This option, speeds it up tremendously.  But than should this option be
> >>> enabled in qemu by default?  only for secure VMs? for both VMs?
> >>
> >>
> >> As discussed in slack, by default we do not need to clear the entire TCE
> >> table and we only have to map swiotlb buffer using the small window. It
> >> is a guest kernel change only. Thanks,
> > 
> > Can you tell me what code you are talking about here.  Where is the TCE
> > table getting cleared? What code needs to be changed to not clear it?
> 
> 
> pci_dma_bus_setup_pSeriesLP()
> iommu_init_table()
> iommu_table_clear()
> for () tbl->it_ops->get()
> 
> We do not really need to clear it there, we only need it for VFIO with
> IOMMU SPAPR TCE v1 which reuses these tables but there are
> iommu_take_ownership/iommu_release_ownership to clear these tables. I'll
> send a patch for this.


> 
> 
> > Is the code in tce_buildmulti_pSeriesLP(), the one that does the clear
> > aswell?
> 
> 
> This one does not need to clear TCEs as this creates a window of known
> size and maps it all.
> 
> Well, actually, it only maps actual guest RAM, if there are gaps in RAM,
> then TCEs for the gaps will have what hypervisor had there (which is
> zeroes, qemu/kvm clears it anyway).
> 
> 
> > But before I close, you have not told me clearly, what is the problem
> > with;  'share the page, make the H_PUT_INDIRECT_TCE hcall, unshare the 
> > page'.
> 
> Between share and unshare you have a (tiny) window of opportunity to
> attack the guest. No, I do not know how exactly.
> 
> For example, the hypervisor does a lot of PHB+PCI hotplug-unplug 

[Bug 205099] KASAN hit at raid6_pq: BUG: Unable to handle kernel data access at 0x00f0fd0d

2019-12-11 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=205099

Christophe Leroy (christophe.le...@c-s.fr) changed:

   What|Removed |Added

 CC||christophe.le...@c-s.fr

--- Comment #6 from Christophe Leroy (christophe.le...@c-s.fr) ---
Obviously, r9 is wrong

2538:   13 04 c4 c4 vxorv24,v4,v24
253c:   7d 20 48 ce lvx v9,0,r9
2540:   39 21 00 90 addir9,r1,144
2544:   13 25 cc c4 vxorv25,v5,v25
2548:   7d 60 48 ce lvx v11,0,r9
254c:   13 46 d4 c4 vxorv26,v6,v26
2550:   81 21 00 88 lwz r9,136(r1)  <== r9 is loaded here
2554:   13 67 dc c4 vxorv27,v7,v27
2558:   7d 11 a8 ce lvx v8,r17,r21
255c:   11 5f 5b 06 vcmpgtsb v10,v31,v11
2560:   11 6b 58 00 vaddubm v11,v11,v11
2564:   81 41 00 8c lwz r10,140(r1)
==> 2568:   7c 00 48 ce lvx v0,0,r9
256c:   39 21 00 a0 addir9,r1,160
2570:   7d 80 48 ce lvx v12,0,r9
2574:   39 21 00 b0 addir9,r1,176

So the stack must be clobbered somewhere

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCH v5 2/2] powerpc/pseries/iommu: Use dma_iommu_ops for Secure VM.

2019-12-11 Thread Christoph Hellwig
On Wed, Dec 11, 2019 at 12:07:17PM -0600, Michael Roth wrote:
> > io_tlb_start/io_tlb_end are only guaranteed to stay within 4GB and our
> > default DMA window is 1GB (KVM) or 2GB (PowerVM), ok, we can define
> > ARCH_LOW_ADDRESS_LIMIT as 1GB.
> 
> True, and limiting allocations to under 1GB might be brittle (also saw a
> patching floating around that increased IO_TLB_DEFAULT_SIZE size to 1GB,
> which obviously wouldn't work out with this approach, but not sure if
> that's still needed or not: "powerpc/svm: Increase SWIOTLB buffer size")

FYI, there is a patch out there that allocates the powerpc swiotlb
from the boottom of the memblock area instead of the top to fix a 85xx
regression.

Also the AMD folks have been asking about non-GFP_DMA32 swiotlb pools
as they have the same bounce buffer issue with SEV.  I think it is
entirely doable to have multiple swiotlb pool, I just need a volunteer
to implement that.

> 
> However that's only an issue if we insist on using an identity mapping
> in the IOMMU, which would be nice because non-IOMMU virtio would
> magically work, but since that's not a goal of this series I think we do
> have the option of mapping io_tlb_start at DMA address 0 (or
> thereabouts).
> 
> We'd probably need to modify __phys_to_dma to treat archdata.dma_offset
> as a negative offset in this case, but it seems like it would work about
> the same as with DDW offset.

Or switch to the generic version of __phys_to_dma that has a negative
offset.  We'd just need to look into a signed value for dma_pfn_offset
to allow for the existing platforms that need the current positive
offset.


Re: [PATCH v5 2/2] powerpc/pseries/iommu: Use dma_iommu_ops for Secure VM.

2019-12-11 Thread Michael Roth
Quoting Alexey Kardashevskiy (2019-12-11 02:36:29)
> 
> 
> On 11/12/2019 12:43, Michael Roth wrote:
> > Quoting Ram Pai (2019-12-06 19:12:39)
> >> Commit edea902c1c1e ("powerpc/pseries/iommu: Don't use dma_iommu_ops on
> >> secure guests")
> >> disabled dma_iommu_ops path, for secure VMs. Disabling dma_iommu_ops
> >> path for secure VMs, helped enable dma_direct path.  This enabled
> >> support for bounce-buffering through SWIOTLB.  However it fails to
> >> operate when IOMMU is enabled, since I/O pages are not TCE mapped.
> >>
> >> Renable dma_iommu_ops path for pseries Secure VMs.  It handles all
> >> cases including, TCE mapping I/O pages, in the presence of a
> >> IOMMU.
> > 
> > Wasn't clear to me at first, but I guess the main gist of this series is
> > that we want to continue to use SWIOTLB, but also need to create mappings
> > of it's bounce buffers in the IOMMU, so we revert to using dma_iommu_ops
> > and rely on the various dma_iommu_{map,alloc}_bypass() hooks throughout
> > to call into dma_direct_* ops rather than relying on the dma_is_direct(ops)
> > checks in DMA API functions to do the same.
> 
> 
> Correct. Took me a bit of time to realize what we got here :) We only
> rely on  dma_iommu_ops::.dma_supported to write the DMA offset to a
> device (when creating a huge window), and after that we know it is
> mapped directly and swiotlb gets this 1<<59 offset via __phys_to_dma().
> 
> 
> > That makes sense, but one issue I see with that is that
> > dma_iommu_map_bypass() only tests true if all the following are true:
> > 
> > 1) the device requests a 64-bit DMA mask via
> >dma_set_mask/dma_set_coherent_mask
> > 2) DDW is enabled (i.e. we don't pass disable_ddw on command-line)
> > 
> > dma_is_direct() checks don't have this limitation, so I think for
> > anything cases, such as devices that use a smaller DMA mask, we'll
> > end up falling back to the non-bypass functions in dma_iommu_ops, which
> > will likely break for things like dma_alloc_coherent/dma_map_single
> > since they won't use SWIOTLB pages and won't do the necessary calls to
> > set_memory_unencrypted() to share those non-SWIOTLB buffers with
> > hypervisor.
> > 
> > Maybe that's ok, but I think we should be clearer about how to
> > fail/handle these cases.
> > 
> > Though I also agree with some concerns Alexey stated earlier: it seems
> > wasteful to map the entire DDW window just so these bounce buffers can be
> > mapped.  Especially if you consider the lack of a mapping to be an 
> > additional
> > safe-guard against things like buggy device implementations on the QEMU
> > side. E.g. if we leaked pages to the hypervisor on accident, those pages
> > wouldn't be immediately accessible to a device, and would still require
> > additional work get past the IOMMU.
> > 
> > What would it look like if we try to make all this work with disable_ddw 
> > passed
> > to kernel command-line (or forced for is_secure_guest())?
> > 
> >   1) dma_iommu_{alloc,map}_bypass() would no longer get us to dma_direct_* 
> > ops,
> >  but an additional case or hook that considers is_secure_guest() might 
> > do
> >  it.
> >  
> >   2) We'd also need to set up an IOMMU mapping for the bounce buffers via
> >  io_tlb_start/io_tlb_end. We could do it once, on-demand via
> >  dma_iommu_bypass_supported() like we do for the 64-bit DDW window, or
> >  maybe in some init function.
> 
> 
> io_tlb_start/io_tlb_end are only guaranteed to stay within 4GB and our
> default DMA window is 1GB (KVM) or 2GB (PowerVM), ok, we can define
> ARCH_LOW_ADDRESS_LIMIT as 1GB.

True, and limiting allocations to under 1GB might be brittle (also saw a
patching floating around that increased IO_TLB_DEFAULT_SIZE size to 1GB,
which obviously wouldn't work out with this approach, but not sure if
that's still needed or not: "powerpc/svm: Increase SWIOTLB buffer size")

However that's only an issue if we insist on using an identity mapping
in the IOMMU, which would be nice because non-IOMMU virtio would
magically work, but since that's not a goal of this series I think we do
have the option of mapping io_tlb_start at DMA address 0 (or
thereabouts).

We'd probably need to modify __phys_to_dma to treat archdata.dma_offset
as a negative offset in this case, but it seems like it would work about
the same as with DDW offset.

But yah, it does make things a bit less appealing than what I was initially
thinking with that approach...

> 
> But it has also been mentioned that we are likely to be having swiotlb
> buffers outside of the first 4GB as they are not just for crippled
> devices any more. So we are likely to have 64bit window, I'd just ditch
> the default window then, I have patches for this but every time I
> thought I have a use case, turned out that I did not.

Not sure I've seen this discussion, maybe it was on slack? By crippled
devices do you mean virtio with IOMMU off? Isn't swiotlb buffer limited
to under ARCH_LOW_ADDRESS_LIMIT in any 

Re: [PATCH] libbpf: fix readelf output parsing on powerpc with recent binutils

2019-12-11 Thread Justin Forbes
On Wed, Dec 11, 2019 at 10:01 AM Thadeu Lima de Souza Cascardo
 wrote:
>
> On Wed, Dec 11, 2019 at 09:33:53AM -0600, Justin Forbes wrote:
> > On Tue, Dec 10, 2019 at 4:26 PM Thadeu Lima de Souza Cascardo
> >  wrote:
> > >
> > > On Tue, Dec 10, 2019 at 12:58:33PM -0600, Justin Forbes wrote:
> > > > On Mon, Dec 2, 2019 at 3:37 AM Daniel Borkmann  
> > > > wrote:
> > > > >
> > > > > On Mon, Dec 02, 2019 at 04:53:26PM +1100, Michael Ellerman wrote:
> > > > > > Aurelien Jarno  writes:
> > > > > > > On powerpc with recent versions of binutils, readelf outputs an 
> > > > > > > extra
> > > > > > > field when dumping the symbols of an object file. For example:
> > > > > > >
> > > > > > > 35: 083896 FUNCLOCAL  DEFAULT 
> > > > > > > [: 8] 1 btf_is_struct
> > > > > > >
> > > > > > > The extra "[: 8]" prevents the GLOBAL_SYM_COUNT 
> > > > > > > variable to
> > > > > > > be computed correctly and causes the checkabi target to fail.
> > > > > > >
> > > > > > > Fix that by looking for the symbol name in the last field instead 
> > > > > > > of the
> > > > > > > 8th one. This way it should also cope with future extra fields.
> > > > > > >
> > > > > > > Signed-off-by: Aurelien Jarno 
> > > > > > > ---
> > > > > > >  tools/lib/bpf/Makefile | 4 ++--
> > > > > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > Thanks for fixing that, it's been on my very long list of test 
> > > > > > failures
> > > > > > for a while.
> > > > > >
> > > > > > Tested-by: Michael Ellerman 
> > > > >
> > > > > Looks good & also continues to work on x86. Applied, thanks!
> > > >
> > > > This actually seems to break horribly on PPC64le with binutils 2.33.1
> > > > resulting in:
> > > > Warning: Num of global symbols in sharedobjs/libbpf-in.o (32) does NOT
> > > > match with num of versioned symbols in libbpf.so (184). Please make
> > > > sure all LIBBPF_API symbols are versioned in libbpf.map.
> > > >
> > > > This is the only arch that fails, with x86/arm/aarch64/s390 all
> > > > building fine.  Reverting this patch allows successful build across
> > > > all arches.
> > > >
> > > > Justin
> > >
> > > Well, I ended up debugging this same issue and had the same fix as 
> > > Jarno's when
> > > I noticed his fix was already applied.
> > >
> > > I just installed a system with the latest binutils, 2.33.1, and it still 
> > > breaks
> > > without such fix. Can you tell what is the output of the following 
> > > command on
> > > your system?
> > >
> > > readelf -s --wide tools/lib/bpf/sharedobjs/libbpf-in.o | cut -d "@" -f1 | 
> > > sed 's/_v[0-9]_[0-9]_[0-9].*//' | awk '/GLOBAL/ && /DEFAULT/ && !/UND/ 
> > > {print $0}'
> > >
> >
> > readelf -s --wide tools/lib/bpf/sharedobjs/libbpf-in.o | cut -d "@"
> > -f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' | awk '/GLOBAL/ && /DEFAULT/ &&
> > !/UND/ {print $0}'
> >373: 000141bc  1376 FUNCGLOBAL DEFAULT1
> > libbpf_num_possible_cpus [: 8]
> >375: 0001869c   176 FUNCGLOBAL DEFAULT1 btf__free
> > [: 8]
> [...]
>
> This is a patch on binutils carried by Fedora:
>
> https://src.fedoraproject.org/rpms/binutils/c/b8265c46f7ddae23a792ee8306fbaaeacba83bf8
>
> " b8265c Have readelf display extra symbol information at the end of the 
> line. "
>
> It has the following comment:
>
> # FIXME:The proper fix would be to update the scripts that are expecting
> #   a fixed output from readelf.  But it seems that some of them are
> #   no longer being maintained.
>
> This commit is from 2017, had it been on binutils upstream, maybe the 
> situation
> right now would be different.
>
> Honestly, it seems the best way out is to filter the other information in the
> libbpf Makefile.
>
> Does the following patch work for you?
>
>
> diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
> index 56ce6292071b..e6f99484d7d5 100644
> --- a/tools/lib/bpf/Makefile
> +++ b/tools/lib/bpf/Makefile
> @@ -145,6 +145,7 @@ PC_FILE := $(addprefix $(OUTPUT),$(PC_FILE))
>
>  GLOBAL_SYM_COUNT = $(shell readelf -s --wide $(BPF_IN_SHARED) | \
>cut -d "@" -f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' | 
> \
> +  sed 's/\[.*\]//' | \
>awk '/GLOBAL/ && /DEFAULT/ && !/UND/ {print $$8}' 
> | \
>sort -u | wc -l)
>  VERSIONED_SYM_COUNT = $(shell readelf -s --wide $(OUTPUT)libbpf.so | \
> @@ -217,6 +218,7 @@ check_abi: $(OUTPUT)libbpf.so
>  "versioned in $(VERSION_SCRIPT)." >&2;  \
> readelf -s --wide $(OUTPUT)libbpf-in.o | \
> cut -d "@" -f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' |   \
> +   sed 's/\[.*\]//' |   \
> awk '/GLOBAL/ && /DEFAULT/ && !/UND/ {print $$8}'|   \
> sort -u > $(OUTPUT)libbpf_global_syms.tmp;   \
> readelf -s --wide 

[PATCH v6 4/6] powerpc/powernv: move core and fadump_release_opalcore under new kobject

2019-12-11 Thread Sourabh Jain
The /sys/firmware/opal/core and /sys/kernel/fadump_release_opalcore sysfs
files are used to export and release the OPAL memory on PowerNV platform.
let's organize them into a new kobject under /sys/firmware/opal/mpipl/
directory.

A symlink is added to maintain the backward compatibility for
/sys/firmware/opal/core sysfs file.

Signed-off-by: Sourabh Jain 
---
 .../sysfs-kernel-fadump_release_opalcore  |  2 +
 .../powerpc/firmware-assisted-dump.rst| 15 +++--
 arch/powerpc/platforms/powernv/opal-core.c| 55 ++-
 3 files changed, 51 insertions(+), 21 deletions(-)
 rename Documentation/ABI/{testing => 
removed}/sysfs-kernel-fadump_release_opalcore (82%)

diff --git a/Documentation/ABI/testing/sysfs-kernel-fadump_release_opalcore 
b/Documentation/ABI/removed/sysfs-kernel-fadump_release_opalcore
similarity index 82%
rename from Documentation/ABI/testing/sysfs-kernel-fadump_release_opalcore
rename to Documentation/ABI/removed/sysfs-kernel-fadump_release_opalcore
index 53313c1d4e7a..a8d46cd0f4e6 100644
--- a/Documentation/ABI/testing/sysfs-kernel-fadump_release_opalcore
+++ b/Documentation/ABI/removed/sysfs-kernel-fadump_release_opalcore
@@ -1,3 +1,5 @@
+This ABI is moved to /sys/firmware/opal/mpipl/release_core.
+
 What:  /sys/kernel/fadump_release_opalcore
 Date:  Sep 2019
 Contact:   linuxppc-dev@lists.ozlabs.org
diff --git a/Documentation/powerpc/firmware-assisted-dump.rst 
b/Documentation/powerpc/firmware-assisted-dump.rst
index 0455a78486d5..345a3405206e 100644
--- a/Documentation/powerpc/firmware-assisted-dump.rst
+++ b/Documentation/powerpc/firmware-assisted-dump.rst
@@ -112,13 +112,13 @@ to ensure that crash data is preserved to process later.
 
 -- On OPAL based machines (PowerNV), if the kernel is build with
CONFIG_OPAL_CORE=y, OPAL memory at the time of crash is also
-   exported as /sys/firmware/opal/core file. This procfs file is
+   exported as /sys/firmware/opal/mpipl/core file. This procfs file is
helpful in debugging OPAL crashes with GDB. The kernel memory
used for exporting this procfs file can be released by echo'ing
-   '1' to /sys/kernel/fadump_release_opalcore node.
+   '1' to /sys/firmware/opal/mpipl/release_core node.
 
e.g.
- # echo 1 > /sys/kernel/fadump_release_opalcore
+ # echo 1 > /sys/firmware/opal/mpipl/release_core
 
 Implementation details:
 ---
@@ -283,14 +283,17 @@ Here is the list of files under kernel sysfs:
 enhanced to use this interface to release the memory reserved for
 dump and continue without 2nd reboot.
 
- /sys/kernel/fadump_release_opalcore
+Note: /sys/kernel/fadump_release_opalcore sysfs has moved to
+  /sys/firmware/opal/mpipl/release_core
+
+ /sys/firmware/opal/mpipl/release_core
 
 This file is available only on OPAL based machines when FADump is
 active during capture kernel. This is used to release the memory
-used by the kernel to export /sys/firmware/opal/core file. To
+used by the kernel to export /sys/firmware/opal/mpipl/core file. To
 release this memory, echo '1' to it:
 
-echo 1  > /sys/kernel/fadump_release_opalcore
+echo 1  > /sys/firmware/opal/mpipl/release_core
 
 Here is the list of files under powerpc debugfs:
 (Assuming debugfs is mounted on /sys/kernel/debug directory.)
diff --git a/arch/powerpc/platforms/powernv/opal-core.c 
b/arch/powerpc/platforms/powernv/opal-core.c
index ed895d82c048..6dba3b62269f 100644
--- a/arch/powerpc/platforms/powernv/opal-core.c
+++ b/arch/powerpc/platforms/powernv/opal-core.c
@@ -71,6 +71,7 @@ static LIST_HEAD(opalcore_list);
 static struct opalcore_config *oc_conf;
 static const struct opal_mpipl_fadump *opalc_metadata;
 static const struct opal_mpipl_fadump *opalc_cpu_metadata;
+struct kobject *mpipl_kobj;
 
 /*
  * Set crashing CPU's signal to SIGUSR1. if the kernel is triggered
@@ -428,7 +429,7 @@ static void opalcore_cleanup(void)
return;
 
/* Remove OPAL core sysfs file */
-   sysfs_remove_bin_file(opal_kobj, _core_attr);
+   sysfs_remove_bin_file(mpipl_kobj, _core_attr);
oc_conf->ptload_phdr = NULL;
oc_conf->ptload_cnt = 0;
 
@@ -563,9 +564,9 @@ static void __init opalcore_config_init(void)
of_node_put(np);
 }
 
-static ssize_t fadump_release_opalcore_store(struct kobject *kobj,
-struct kobj_attribute *attr,
-const char *buf, size_t count)
+static ssize_t release_core_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
 {
int input = -1;
 
@@ -589,9 +590,23 @@ static ssize_t fadump_release_opalcore_store(struct 
kobject *kobj,
return count;
 }
 
-static struct kobj_attribute opalcore_rel_attr = 
__ATTR(fadump_release_opalcore,
-   0200, NULL,
- 

[PATCH v6 6/6] powerpc/fadump: sysfs for fadump memory reservation

2019-12-11 Thread Sourabh Jain
Add a sys interface to allow querying the memory reserved by FADump for
saving the crash dump.

Also added Documentation/ABI for the new sysfs file.

Signed-off-by: Sourabh Jain 
---
 Documentation/ABI/testing/sysfs-kernel-fadump| 7 +++
 Documentation/powerpc/firmware-assisted-dump.rst | 5 +
 arch/powerpc/kernel/fadump.c | 9 +
 3 files changed, 21 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-kernel-fadump 
b/Documentation/ABI/testing/sysfs-kernel-fadump
index 5d988b919e81..8f7a64a81783 100644
--- a/Documentation/ABI/testing/sysfs-kernel-fadump
+++ b/Documentation/ABI/testing/sysfs-kernel-fadump
@@ -31,3 +31,10 @@ Description: write only
the system is booted to capture the vmcore using FADump.
It is used to release the memory reserved by FADump to
save the crash dump.
+
+What:  /sys/kernel/fadump/mem_reserved
+Date:  Dec 2019
+Contact:   linuxppc-dev@lists.ozlabs.org
+Description:   read only
+   Provide information about the amount of memory reserved by
+   FADump to save the crash dump in bytes.
diff --git a/Documentation/powerpc/firmware-assisted-dump.rst 
b/Documentation/powerpc/firmware-assisted-dump.rst
index 365c10209ef3..04993eaf3113 100644
--- a/Documentation/powerpc/firmware-assisted-dump.rst
+++ b/Documentation/powerpc/firmware-assisted-dump.rst
@@ -268,6 +268,11 @@ Here is the list of files under kernel sysfs:
 be handled and vmcore will not be captured. This interface can be
 easily integrated with kdump service start/stop.
 
+ /sys/kernel/fadump/mem_reserved
+
+   This is used to display the memory reserved by FADump for saving the
+   crash dump.
+
  /sys/kernel/fadump_release_mem
 This file is available only when FADump is active during
 second kernel. This is used to release the reserved memory
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 23f15bfea512..1c9b23c5b296 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -1369,6 +1369,13 @@ static ssize_t enabled_show(struct kobject *kobj,
return sprintf(buf, "%d\n", fw_dump.fadump_enabled);
 }
 
+static ssize_t mem_reserved_show(struct kobject *kobj,
+struct kobj_attribute *attr,
+char *buf)
+{
+   return sprintf(buf, "%ld\n", fw_dump.reserve_dump_area_size);
+}
+
 static ssize_t registered_show(struct kobject *kobj,
   struct kobj_attribute *attr,
   char *buf)
@@ -1433,10 +1440,12 @@ static int fadump_region_show(struct seq_file *m, void 
*private)
 static struct kobj_attribute release_attr = __ATTR_WO(release_mem);
 static struct kobj_attribute enable_attr = __ATTR_RO(enabled);
 static struct kobj_attribute register_attr = __ATTR_RW(registered);
+static struct kobj_attribute mem_reserved_attr = __ATTR_RO(mem_reserved);
 
 static struct attribute *fadump_attrs[] = {
_attr.attr,
_attr.attr,
+   _reserved_attr.attr,
NULL,
 };
 
-- 
2.17.2



[PATCH v6 5/6] Documentation/ABI: mark /sys/kernel/fadump_* sysfs files deprecated

2019-12-11 Thread Sourabh Jain
Add a deprecation note in FADump sysfs ABI documentation files and move
them from ABI/testing to ABI/obsolete directory.

Signed-off-by: Sourabh Jain 
---
 .../ABI/{testing => obsolete}/sysfs-kernel-fadump_enabled | 2 ++
 .../{testing => obsolete}/sysfs-kernel-fadump_registered  | 2 ++
 .../{testing => obsolete}/sysfs-kernel-fadump_release_mem | 2 ++
 Documentation/powerpc/firmware-assisted-dump.rst  | 8 
 4 files changed, 14 insertions(+)
 rename Documentation/ABI/{testing => obsolete}/sysfs-kernel-fadump_enabled 
(73%)
 rename Documentation/ABI/{testing => obsolete}/sysfs-kernel-fadump_registered 
(77%)
 rename Documentation/ABI/{testing => obsolete}/sysfs-kernel-fadump_release_mem 
(78%)

diff --git a/Documentation/ABI/testing/sysfs-kernel-fadump_enabled 
b/Documentation/ABI/obsolete/sysfs-kernel-fadump_enabled
similarity index 73%
rename from Documentation/ABI/testing/sysfs-kernel-fadump_enabled
rename to Documentation/ABI/obsolete/sysfs-kernel-fadump_enabled
index f73632b1c006..e9c2de8b3688 100644
--- a/Documentation/ABI/testing/sysfs-kernel-fadump_enabled
+++ b/Documentation/ABI/obsolete/sysfs-kernel-fadump_enabled
@@ -1,3 +1,5 @@
+This ABI is renamed and moved to a new location /sys/kernel/fadump/enabled.
+
 What:  /sys/kernel/fadump_enabled
 Date:  Feb 2012
 Contact:   linuxppc-dev@lists.ozlabs.org
diff --git a/Documentation/ABI/testing/sysfs-kernel-fadump_registered 
b/Documentation/ABI/obsolete/sysfs-kernel-fadump_registered
similarity index 77%
rename from Documentation/ABI/testing/sysfs-kernel-fadump_registered
rename to Documentation/ABI/obsolete/sysfs-kernel-fadump_registered
index dcf925e53f0f..0360be39c98e 100644
--- a/Documentation/ABI/testing/sysfs-kernel-fadump_registered
+++ b/Documentation/ABI/obsolete/sysfs-kernel-fadump_registered
@@ -1,3 +1,5 @@
+This ABI is renamed and moved to a new location 
/sys/kernel/fadump/registered.??
+
 What:  /sys/kernel/fadump_registered
 Date:  Feb 2012
 Contact:   linuxppc-dev@lists.ozlabs.org
diff --git a/Documentation/ABI/testing/sysfs-kernel-fadump_release_mem 
b/Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem
similarity index 78%
rename from Documentation/ABI/testing/sysfs-kernel-fadump_release_mem
rename to Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem
index 9c20d64ab48d..6ce0b129ab12 100644
--- a/Documentation/ABI/testing/sysfs-kernel-fadump_release_mem
+++ b/Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem
@@ -1,3 +1,5 @@
+This ABI is renamed and moved to a new location 
/sys/kernel/fadump/release_mem.??
+
 What:  /sys/kernel/fadump_release_mem
 Date:  Feb 2012
 Contact:   linuxppc-dev@lists.ozlabs.org
diff --git a/Documentation/powerpc/firmware-assisted-dump.rst 
b/Documentation/powerpc/firmware-assisted-dump.rst
index 345a3405206e..365c10209ef3 100644
--- a/Documentation/powerpc/firmware-assisted-dump.rst
+++ b/Documentation/powerpc/firmware-assisted-dump.rst
@@ -295,6 +295,14 @@ Note: /sys/kernel/fadump_release_opalcore sysfs has moved 
to
 
 echo 1  > /sys/firmware/opal/mpipl/release_core
 
+Note: The following FADump sysfs files are deprecated.
+
+Deprecated   Alternative
+
+/sys/kernel/fadump_enabled   /sys/kernel/fadump/enabled
+/sys/kernel/fadump_registered/sys/kernel/fadump/registered
+/sys/kernel/fadump_release_mem   /sys/kernel/fadump/release_mem
+
 Here is the list of files under powerpc debugfs:
 (Assuming debugfs is mounted on /sys/kernel/debug directory.)
 
-- 
2.17.2



[PATCH v6 3/6] powerpc/fadump: reorganize /sys/kernel/fadump_* sysfs files

2019-12-11 Thread Sourabh Jain
As the number of FADump sysfs files increases it is hard to manage all of
them inside /sys/kernel directory. It's better to have all the FADump
related sysfs files in a dedicated directory /sys/kernel/fadump. But in
order to maintain backward compatibility a symlink has been added for every
sysfs that has moved to new location.

As the FADump sysfs files are now part of a dedicated directory there is no
need to prefix their name with fadump_, hence sysfs file names are also
updated. For example fadump_enabled sysfs file is now referred as enabled.

Also consolidate ABI documentation for all the FADump sysfs files in a
single file Documentation/ABI/testing/sysfs-kernel-fadump.

Signed-off-by: Sourabh Jain 
---
 Documentation/ABI/testing/sysfs-kernel-fadump |  33 +
 arch/powerpc/kernel/fadump.c  | 118 +-
 2 files changed, 117 insertions(+), 34 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-fadump

diff --git a/Documentation/ABI/testing/sysfs-kernel-fadump 
b/Documentation/ABI/testing/sysfs-kernel-fadump
new file mode 100644
index ..5d988b919e81
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-fadump
@@ -0,0 +1,33 @@
+What:  /sys/kernel/fadump/*
+Date:  Dec 2019
+Contact:   linuxppc-dev@lists.ozlabs.org
+Description:
+   The /sys/kernel/fadump/* is a collection of FADump sysfs
+   file provide information about the configuration status
+   of Firmware Assisted Dump (FADump).
+
+What:  /sys/kernel/fadump/enabled
+Date:  Dec 2019
+Contact:   linuxppc-dev@lists.ozlabs.org
+Description:   read only
+   Primarily used to identify whether the FADump is enabled in
+   the kernel or not.
+User:  Kdump service
+
+What:  /sys/kernel/fadump/registered
+Date:  Dec 2019
+Contact:   linuxppc-dev@lists.ozlabs.org
+Description:   read/write
+   Helps to control the dump collect feature from userspace.
+   Setting 1 to this file enables the system to collect the
+   dump and 0 to disable it.
+User:  Kdump service
+
+What:  /sys/kernel/fadump/release_mem
+Date:  Dec 2019
+Contact:   linuxppc-dev@lists.ozlabs.org
+Description:   write only
+   This is a special sysfs file and only available when
+   the system is booted to capture the vmcore using FADump.
+   It is used to release the memory reserved by FADump to
+   save the crash dump.
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index ed59855430b9..23f15bfea512 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -36,6 +36,8 @@ static struct fw_dump fw_dump;
 
 static void __init fadump_reserve_crash_area(u64 base);
 
+struct kobject *fadump_kobj;
+
 #ifndef CONFIG_PRESERVE_FA_DUMP
 static DEFINE_MUTEX(fadump_mutex);
 struct fadump_mrange_info crash_mrange_info = { "crash", NULL, 0, 0, 0 };
@@ -1323,9 +1325,9 @@ static void fadump_invalidate_release_mem(void)
fw_dump.ops->fadump_init_mem_struct(_dump);
 }
 
-static ssize_t fadump_release_memory_store(struct kobject *kobj,
-   struct kobj_attribute *attr,
-   const char *buf, size_t count)
+static ssize_t release_mem_store(struct kobject *kobj,
+struct kobj_attribute *attr,
+const char *buf, size_t count)
 {
int input = -1;
 
@@ -1350,23 +1352,33 @@ static ssize_t fadump_release_memory_store(struct 
kobject *kobj,
return count;
 }
 
-static ssize_t fadump_enabled_show(struct kobject *kobj,
-   struct kobj_attribute *attr,
-   char *buf)
+/* Release the reserved memory and disable the FADump */
+static void unregister_fadump(void)
+{
+   fadump_cleanup();
+   fadump_release_memory(fw_dump.reserve_dump_area_start,
+ fw_dump.reserve_dump_area_size);
+   fw_dump.fadump_enabled = 0;
+   kobject_put(fadump_kobj);
+}
+
+static ssize_t enabled_show(struct kobject *kobj,
+   struct kobj_attribute *attr,
+   char *buf)
 {
return sprintf(buf, "%d\n", fw_dump.fadump_enabled);
 }
 
-static ssize_t fadump_register_show(struct kobject *kobj,
-   struct kobj_attribute *attr,
-   char *buf)
+static ssize_t registered_show(struct kobject *kobj,
+  struct kobj_attribute *attr,
+  char *buf)
 {
return sprintf(buf, "%d\n", fw_dump.dump_registered);
 }
 
-static ssize_t fadump_register_store(struct kobject *kobj,
-   struct kobj_attribute *attr,
-

[PATCH v6 2/6] sysfs: wrap __compat_only_sysfs_link_entry_to_kobj function to change the symlink name

2019-12-11 Thread Sourabh Jain
The __compat_only_sysfs_link_entry_to_kobj function creates a symlink to a
kobject but doesn't provide an option to change the symlink file name.

This patch adds a wrapper function compat_only_sysfs_link_entry_to_kobj
that extends the __compat_only_sysfs_link_entry_to_kobj functionality
which allows function caller to customize the symlink name.

Signed-off-by: Sourabh Jain 
---
 fs/sysfs/group.c  | 28 +---
 include/linux/sysfs.h | 12 
 2 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
index d41c21fef138..0993645f0b59 100644
--- a/fs/sysfs/group.c
+++ b/fs/sysfs/group.c
@@ -424,6 +424,25 @@ EXPORT_SYMBOL_GPL(sysfs_remove_link_from_group);
 int __compat_only_sysfs_link_entry_to_kobj(struct kobject *kobj,
  struct kobject *target_kobj,
  const char *target_name)
+{
+   return compat_only_sysfs_link_entry_to_kobj(kobj, target_kobj,
+   target_name, NULL);
+}
+EXPORT_SYMBOL_GPL(__compat_only_sysfs_link_entry_to_kobj);
+
+/**
+ * compat_only_sysfs_link_entry_to_kobj - add a symlink to a kobject pointing
+ * to a group or an attribute
+ * @kobj:  The kobject containing the group.
+ * @target_kobj:   The target kobject.
+ * @target_name:   The name of the target group or attribute.
+ * @symlink_name:  The name of the symlink file (target_name will be
+ * considered if symlink_name is NULL).
+ */
+int compat_only_sysfs_link_entry_to_kobj(struct kobject *kobj,
+struct kobject *target_kobj,
+const char *target_name,
+const char *symlink_name)
 {
struct kernfs_node *target;
struct kernfs_node *entry;
@@ -448,12 +467,15 @@ int __compat_only_sysfs_link_entry_to_kobj(struct kobject 
*kobj,
return -ENOENT;
}
 
-   link = kernfs_create_link(kobj->sd, target_name, entry);
+   if (!symlink_name)
+   symlink_name = target_name;
+
+   link = kernfs_create_link(kobj->sd, symlink_name, entry);
if (IS_ERR(link) && PTR_ERR(link) == -EEXIST)
-   sysfs_warn_dup(kobj->sd, target_name);
+   sysfs_warn_dup(kobj->sd, symlink_name);
 
kernfs_put(entry);
kernfs_put(target);
return PTR_ERR_OR_ZERO(link);
 }
-EXPORT_SYMBOL_GPL(__compat_only_sysfs_link_entry_to_kobj);
+EXPORT_SYMBOL_GPL(compat_only_sysfs_link_entry_to_kobj);
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index 5420817ed317..15b195a4529d 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -300,6 +300,10 @@ void sysfs_remove_link_from_group(struct kobject *kobj, 
const char *group_name,
 int __compat_only_sysfs_link_entry_to_kobj(struct kobject *kobj,
  struct kobject *target_kobj,
  const char *target_name);
+int compat_only_sysfs_link_entry_to_kobj(struct kobject *kobj,
+struct kobject *target_kobj,
+const char *target_name,
+const char *symlink_name);
 
 void sysfs_notify(struct kobject *kobj, const char *dir, const char *attr);
 
@@ -508,6 +512,14 @@ static inline int __compat_only_sysfs_link_entry_to_kobj(
return 0;
 }
 
+static int compat_only_sysfs_link_entry_to_kobj(struct kobject *kobj,
+   struct kobject *target_kobj,
+   const char *target_name,
+   const char *symlink_name)
+{
+   return 0;
+}
+
 static inline void sysfs_notify(struct kobject *kobj, const char *dir,
const char *attr)
 {
-- 
2.17.2



[PATCH v6 0/6] reorganize and add FADump sysfs files

2019-12-11 Thread Sourabh Jain
Currently, FADump sysfs files are present inside /sys/kernel directory.
But as the number of FADump sysfs file increases it is not a good idea to
push all of them in /sys/kernel directory. It is better to have separate
directory to keep all the FADump sysfs files.

Patch series reorganizes the FADump sysfs files and avail all the existing
FADump sysfs files present inside /sys/kernel into a new directory
/sys/kernel/fadump. The backward compatibility is maintained by adding a
symlink for every sysfs file that has moved to new location. Also a new
FADump sys interface is added to get the amount of memory reserved by FADump
for saving the crash dump.

Changelog:
v1 -> v5:
  - https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-December/201642.html

v5 -> v6
  - Unregister FADump if fadump group creation fails.
  - Remove fadump_enabled symlink if fadump_registered symlink
creation fails.
  - Removed CREATE_SYMLINK macro.

Sourabh Jain (6):
  Documentation/ABI: add ABI documentation for /sys/kernel/fadump_*
  sysfs: wrap __compat_only_sysfs_link_entry_to_kobj function to change
the symlink name
  powerpc/fadump: reorganize /sys/kernel/fadump_* sysfs files
  powerpc/powernv: move core and fadump_release_opalcore under new
kobject
  Documentation/ABI: mark /sys/kernel/fadump_* sysfs files deprecated
  powerpc/fadump: sysfs for fadump memory reservation

 .../ABI/obsolete/sysfs-kernel-fadump_enabled  |   9 ++
 .../obsolete/sysfs-kernel-fadump_registered   |  10 ++
 .../obsolete/sysfs-kernel-fadump_release_mem  |  10 ++
 .../sysfs-kernel-fadump_release_opalcore  |   9 ++
 Documentation/ABI/testing/sysfs-kernel-fadump |  40 ++
 .../powerpc/firmware-assisted-dump.rst|  28 +++-
 arch/powerpc/kernel/fadump.c  | 127 +-
 arch/powerpc/platforms/powernv/opal-core.c|  55 +---
 fs/sysfs/group.c  |  28 +++-
 include/linux/sysfs.h |  12 ++
 10 files changed, 270 insertions(+), 58 deletions(-)
 create mode 100644 Documentation/ABI/obsolete/sysfs-kernel-fadump_enabled
 create mode 100644 Documentation/ABI/obsolete/sysfs-kernel-fadump_registered
 create mode 100644 Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem
 create mode 100644 
Documentation/ABI/removed/sysfs-kernel-fadump_release_opalcore
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-fadump

-- 
2.17.2



[PATCH v6 1/6] Documentation/ABI: add ABI documentation for /sys/kernel/fadump_*

2019-12-11 Thread Sourabh Jain
Add missing ABI documentation for existing FADump sysfs files.

Signed-off-by: Sourabh Jain 
---
 Documentation/ABI/testing/sysfs-kernel-fadump_enabled | 7 +++
 Documentation/ABI/testing/sysfs-kernel-fadump_registered  | 8 
 Documentation/ABI/testing/sysfs-kernel-fadump_release_mem | 8 
 .../ABI/testing/sysfs-kernel-fadump_release_opalcore  | 7 +++
 4 files changed, 30 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-fadump_enabled
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-fadump_registered
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-fadump_release_mem
 create mode 100644 
Documentation/ABI/testing/sysfs-kernel-fadump_release_opalcore

diff --git a/Documentation/ABI/testing/sysfs-kernel-fadump_enabled 
b/Documentation/ABI/testing/sysfs-kernel-fadump_enabled
new file mode 100644
index ..f73632b1c006
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-fadump_enabled
@@ -0,0 +1,7 @@
+What:  /sys/kernel/fadump_enabled
+Date:  Feb 2012
+Contact:   linuxppc-dev@lists.ozlabs.org
+Description:   read only
+   Primarily used to identify whether the FADump is enabled in
+   the kernel or not.
+User:  Kdump service
diff --git a/Documentation/ABI/testing/sysfs-kernel-fadump_registered 
b/Documentation/ABI/testing/sysfs-kernel-fadump_registered
new file mode 100644
index ..dcf925e53f0f
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-fadump_registered
@@ -0,0 +1,8 @@
+What:  /sys/kernel/fadump_registered
+Date:  Feb 2012
+Contact:   linuxppc-dev@lists.ozlabs.org
+Description:   read/write
+   Helps to control the dump collect feature from userspace.
+   Setting 1 to this file enables the system to collect the
+   dump and 0 to disable it.
+User:  Kdump service
diff --git a/Documentation/ABI/testing/sysfs-kernel-fadump_release_mem 
b/Documentation/ABI/testing/sysfs-kernel-fadump_release_mem
new file mode 100644
index ..9c20d64ab48d
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-fadump_release_mem
@@ -0,0 +1,8 @@
+What:  /sys/kernel/fadump_release_mem
+Date:  Feb 2012
+Contact:   linuxppc-dev@lists.ozlabs.org
+Description:   write only
+   This is a special sysfs file and only available when
+   the system is booted to capture the vmcore using FADump.
+   It is used to release the memory reserved by FADump to
+   save the crash dump.
diff --git a/Documentation/ABI/testing/sysfs-kernel-fadump_release_opalcore 
b/Documentation/ABI/testing/sysfs-kernel-fadump_release_opalcore
new file mode 100644
index ..53313c1d4e7a
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-fadump_release_opalcore
@@ -0,0 +1,7 @@
+What:  /sys/kernel/fadump_release_opalcore
+Date:  Sep 2019
+Contact:   linuxppc-dev@lists.ozlabs.org
+Description:   write only
+   The sysfs file is available when the system is booted to
+   collect the dump on OPAL based machine. It used to release
+   the memory used to collect the opalcore.
-- 
2.17.2



Re: [PATCH] libbpf: fix readelf output parsing on powerpc with recent binutils

2019-12-11 Thread Thadeu Lima de Souza Cascardo
On Wed, Dec 11, 2019 at 09:33:53AM -0600, Justin Forbes wrote:
> On Tue, Dec 10, 2019 at 4:26 PM Thadeu Lima de Souza Cascardo
>  wrote:
> >
> > On Tue, Dec 10, 2019 at 12:58:33PM -0600, Justin Forbes wrote:
> > > On Mon, Dec 2, 2019 at 3:37 AM Daniel Borkmann  
> > > wrote:
> > > >
> > > > On Mon, Dec 02, 2019 at 04:53:26PM +1100, Michael Ellerman wrote:
> > > > > Aurelien Jarno  writes:
> > > > > > On powerpc with recent versions of binutils, readelf outputs an 
> > > > > > extra
> > > > > > field when dumping the symbols of an object file. For example:
> > > > > >
> > > > > > 35: 083896 FUNCLOCAL  DEFAULT 
> > > > > > [: 8] 1 btf_is_struct
> > > > > >
> > > > > > The extra "[: 8]" prevents the GLOBAL_SYM_COUNT 
> > > > > > variable to
> > > > > > be computed correctly and causes the checkabi target to fail.
> > > > > >
> > > > > > Fix that by looking for the symbol name in the last field instead 
> > > > > > of the
> > > > > > 8th one. This way it should also cope with future extra fields.
> > > > > >
> > > > > > Signed-off-by: Aurelien Jarno 
> > > > > > ---
> > > > > >  tools/lib/bpf/Makefile | 4 ++--
> > > > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > >
> > > > > Thanks for fixing that, it's been on my very long list of test 
> > > > > failures
> > > > > for a while.
> > > > >
> > > > > Tested-by: Michael Ellerman 
> > > >
> > > > Looks good & also continues to work on x86. Applied, thanks!
> > >
> > > This actually seems to break horribly on PPC64le with binutils 2.33.1
> > > resulting in:
> > > Warning: Num of global symbols in sharedobjs/libbpf-in.o (32) does NOT
> > > match with num of versioned symbols in libbpf.so (184). Please make
> > > sure all LIBBPF_API symbols are versioned in libbpf.map.
> > >
> > > This is the only arch that fails, with x86/arm/aarch64/s390 all
> > > building fine.  Reverting this patch allows successful build across
> > > all arches.
> > >
> > > Justin
> >
> > Well, I ended up debugging this same issue and had the same fix as Jarno's 
> > when
> > I noticed his fix was already applied.
> >
> > I just installed a system with the latest binutils, 2.33.1, and it still 
> > breaks
> > without such fix. Can you tell what is the output of the following command 
> > on
> > your system?
> >
> > readelf -s --wide tools/lib/bpf/sharedobjs/libbpf-in.o | cut -d "@" -f1 | 
> > sed 's/_v[0-9]_[0-9]_[0-9].*//' | awk '/GLOBAL/ && /DEFAULT/ && !/UND/ 
> > {print $0}'
> >
> 
> readelf -s --wide tools/lib/bpf/sharedobjs/libbpf-in.o | cut -d "@"
> -f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' | awk '/GLOBAL/ && /DEFAULT/ &&
> !/UND/ {print $0}'
>373: 000141bc  1376 FUNCGLOBAL DEFAULT1
> libbpf_num_possible_cpus [: 8]
>375: 0001869c   176 FUNCGLOBAL DEFAULT1 btf__free
> [: 8]
[...]

This is a patch on binutils carried by Fedora:

https://src.fedoraproject.org/rpms/binutils/c/b8265c46f7ddae23a792ee8306fbaaeacba83bf8

" b8265c Have readelf display extra symbol information at the end of the line. "

It has the following comment:

# FIXME:The proper fix would be to update the scripts that are expecting
#   a fixed output from readelf.  But it seems that some of them are
#   no longer being maintained.

This commit is from 2017, had it been on binutils upstream, maybe the situation
right now would be different.

Honestly, it seems the best way out is to filter the other information in the
libbpf Makefile.

Does the following patch work for you?


diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
index 56ce6292071b..e6f99484d7d5 100644
--- a/tools/lib/bpf/Makefile
+++ b/tools/lib/bpf/Makefile
@@ -145,6 +145,7 @@ PC_FILE := $(addprefix $(OUTPUT),$(PC_FILE))
 
 GLOBAL_SYM_COUNT = $(shell readelf -s --wide $(BPF_IN_SHARED) | \
   cut -d "@" -f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' | \
+  sed 's/\[.*\]//' | \
   awk '/GLOBAL/ && /DEFAULT/ && !/UND/ {print $$8}' | \
   sort -u | wc -l)
 VERSIONED_SYM_COUNT = $(shell readelf -s --wide $(OUTPUT)libbpf.so | \
@@ -217,6 +218,7 @@ check_abi: $(OUTPUT)libbpf.so
 "versioned in $(VERSION_SCRIPT)." >&2;  \
readelf -s --wide $(OUTPUT)libbpf-in.o | \
cut -d "@" -f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' |   \
+   sed 's/\[.*\]//' |   \
awk '/GLOBAL/ && /DEFAULT/ && !/UND/ {print $$8}'|   \
sort -u > $(OUTPUT)libbpf_global_syms.tmp;   \
readelf -s --wide $(OUTPUT)libbpf.so |   \


Re: [PATCH] libbpf: fix readelf output parsing on powerpc with recent binutils

2019-12-11 Thread Aurelien Jarno
On 2019-12-11 09:33, Justin Forbes wrote:
> On Tue, Dec 10, 2019 at 4:26 PM Thadeu Lima de Souza Cascardo
>  wrote:
> >
> > On Tue, Dec 10, 2019 at 12:58:33PM -0600, Justin Forbes wrote:
> > > On Mon, Dec 2, 2019 at 3:37 AM Daniel Borkmann  
> > > wrote:
> > > >
> > > > On Mon, Dec 02, 2019 at 04:53:26PM +1100, Michael Ellerman wrote:
> > > > > Aurelien Jarno  writes:
> > > > > > On powerpc with recent versions of binutils, readelf outputs an 
> > > > > > extra
> > > > > > field when dumping the symbols of an object file. For example:
> > > > > >
> > > > > > 35: 083896 FUNCLOCAL  DEFAULT 
> > > > > > [: 8] 1 btf_is_struct
> > > > > >
> > > > > > The extra "[: 8]" prevents the GLOBAL_SYM_COUNT 
> > > > > > variable to
> > > > > > be computed correctly and causes the checkabi target to fail.
> > > > > >
> > > > > > Fix that by looking for the symbol name in the last field instead 
> > > > > > of the
> > > > > > 8th one. This way it should also cope with future extra fields.
> > > > > >
> > > > > > Signed-off-by: Aurelien Jarno 
> > > > > > ---
> > > > > >  tools/lib/bpf/Makefile | 4 ++--
> > > > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > >
> > > > > Thanks for fixing that, it's been on my very long list of test 
> > > > > failures
> > > > > for a while.
> > > > >
> > > > > Tested-by: Michael Ellerman 
> > > >
> > > > Looks good & also continues to work on x86. Applied, thanks!
> > >
> > > This actually seems to break horribly on PPC64le with binutils 2.33.1
> > > resulting in:
> > > Warning: Num of global symbols in sharedobjs/libbpf-in.o (32) does NOT
> > > match with num of versioned symbols in libbpf.so (184). Please make
> > > sure all LIBBPF_API symbols are versioned in libbpf.map.
> > >
> > > This is the only arch that fails, with x86/arm/aarch64/s390 all
> > > building fine.  Reverting this patch allows successful build across
> > > all arches.
> > >
> > > Justin
> >
> > Well, I ended up debugging this same issue and had the same fix as Jarno's 
> > when
> > I noticed his fix was already applied.
> >
> > I just installed a system with the latest binutils, 2.33.1, and it still 
> > breaks
> > without such fix. Can you tell what is the output of the following command 
> > on
> > your system?
> >
> > readelf -s --wide tools/lib/bpf/sharedobjs/libbpf-in.o | cut -d "@" -f1 | 
> > sed 's/_v[0-9]_[0-9]_[0-9].*//' | awk '/GLOBAL/ && /DEFAULT/ && !/UND/ 
> > {print $0}'
> >
> 
> readelf -s --wide tools/lib/bpf/sharedobjs/libbpf-in.o | cut -d "@"
> -f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' | awk '/GLOBAL/ && /DEFAULT/ &&
> !/UND/ {print $0}'
>373: 000141bc  1376 FUNCGLOBAL DEFAULT1 
> libbpf_num_possible_cpus [: 8]
>375: 0001869c   176 FUNCGLOBAL DEFAULT1 btf__free 
> [: 8]

It seems that in your case the localentry part is added after the symbol
name. That doesn't match what is done in upstream binutils:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=binutils/readelf.c;hb=refs/heads/master#l11485

Which version of binutils are you using? It looks like your version has
been modified to workaround this exact issue.

-- 
Aurelien Jarno  GPG: 4096R/1DDD8C9B
aurel...@aurel32.net http://www.aurel32.net


Re: [PATCH] libbpf: fix readelf output parsing on powerpc with recent binutils

2019-12-11 Thread Justin Forbes
On Tue, Dec 10, 2019 at 4:26 PM Thadeu Lima de Souza Cascardo
 wrote:
>
> On Tue, Dec 10, 2019 at 12:58:33PM -0600, Justin Forbes wrote:
> > On Mon, Dec 2, 2019 at 3:37 AM Daniel Borkmann  wrote:
> > >
> > > On Mon, Dec 02, 2019 at 04:53:26PM +1100, Michael Ellerman wrote:
> > > > Aurelien Jarno  writes:
> > > > > On powerpc with recent versions of binutils, readelf outputs an extra
> > > > > field when dumping the symbols of an object file. For example:
> > > > >
> > > > > 35: 083896 FUNCLOCAL  DEFAULT [: 
> > > > > 8] 1 btf_is_struct
> > > > >
> > > > > The extra "[: 8]" prevents the GLOBAL_SYM_COUNT variable 
> > > > > to
> > > > > be computed correctly and causes the checkabi target to fail.
> > > > >
> > > > > Fix that by looking for the symbol name in the last field instead of 
> > > > > the
> > > > > 8th one. This way it should also cope with future extra fields.
> > > > >
> > > > > Signed-off-by: Aurelien Jarno 
> > > > > ---
> > > > >  tools/lib/bpf/Makefile | 4 ++--
> > > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > >
> > > > Thanks for fixing that, it's been on my very long list of test failures
> > > > for a while.
> > > >
> > > > Tested-by: Michael Ellerman 
> > >
> > > Looks good & also continues to work on x86. Applied, thanks!
> >
> > This actually seems to break horribly on PPC64le with binutils 2.33.1
> > resulting in:
> > Warning: Num of global symbols in sharedobjs/libbpf-in.o (32) does NOT
> > match with num of versioned symbols in libbpf.so (184). Please make
> > sure all LIBBPF_API symbols are versioned in libbpf.map.
> >
> > This is the only arch that fails, with x86/arm/aarch64/s390 all
> > building fine.  Reverting this patch allows successful build across
> > all arches.
> >
> > Justin
>
> Well, I ended up debugging this same issue and had the same fix as Jarno's 
> when
> I noticed his fix was already applied.
>
> I just installed a system with the latest binutils, 2.33.1, and it still 
> breaks
> without such fix. Can you tell what is the output of the following command on
> your system?
>
> readelf -s --wide tools/lib/bpf/sharedobjs/libbpf-in.o | cut -d "@" -f1 | sed 
> 's/_v[0-9]_[0-9]_[0-9].*//' | awk '/GLOBAL/ && /DEFAULT/ && !/UND/ {print $0}'
>

readelf -s --wide tools/lib/bpf/sharedobjs/libbpf-in.o | cut -d "@"
-f1 | sed 's/_v[0-9]_[0-9]_[0-9].*//' | awk '/GLOBAL/ && /DEFAULT/ &&
!/UND/ {print $0}'
   373: 000141bc  1376 FUNCGLOBAL DEFAULT1
libbpf_num_possible_cpus [: 8]
   375: 0001869c   176 FUNCGLOBAL DEFAULT1 btf__free
[: 8]
   377: 0001093c84 FUNCGLOBAL DEFAULT1
bpf_object__find_map_by_offset [: 8]
   378: 00016288   100 FUNCGLOBAL DEFAULT1
bpf_prog_get_next_id [: 8]
   379: 000103c0   104 FUNCGLOBAL DEFAULT1
bpf_map__priv [: 8]
   380: e158   180 FUNCGLOBAL DEFAULT1
bpf_object__pin [: 8]
   381: 000102f8   200 FUNCGLOBAL DEFAULT1
bpf_map__set_priv [: 8]
   382: 0001874c   380 FUNCGLOBAL DEFAULT1 btf__new
[: 8]
   384: 0002238c  1372 FUNCGLOBAL DEFAULT1 xsk_umem__create
   385: 000106fc   116 FUNCGLOBAL DEFAULT1
bpf_map__next [: 8]
   387: 000162ec   100 FUNCGLOBAL DEFAULT1
bpf_map_get_next_id [: 8]
   389: f57484 FUNCGLOBAL DEFAULT1
bpf_program__is_xdp [: 8]
   390: 00011e14   392 FUNCGLOBAL DEFAULT1
bpf_program__attach_tracepoint [: 8]
   391: 00016534   196 FUNCGLOBAL DEFAULT1
bpf_obj_get_info_by_fd [: 8]
   392: cf64   324 FUNCGLOBAL DEFAULT1
bpf_program__unpin_instance [: 8]
   393: d818   456 FUNCGLOBAL DEFAULT1
bpf_map__unpin [: 8]
   395: efe064 FUNCGLOBAL DEFAULT1 bpf_program__set_type
   396: 00010e94   696 FUNCGLOBAL DEFAULT1
bpf_program__attach_perf_event [: 8]
   397: 0001a774   136 FUNCGLOBAL DEFAULT1
btf_ext__reloc_func_info [: 8]
   398: 00014bc8   236 FUNCGLOBAL DEFAULT1
bpf_create_map_name [: 8]
   402: 000228e8   160 FUNCGLOBAL DEFAULT1 xsk_umem__create
   403: 00021f1c72 FUNCGLOBAL DEFAULT1 xsk_socket__fd
   404: 0001a8ec   536 FUNCGLOBAL DEFAULT1 btf__dedup
[: 8]
   405: eadc   180 FUNCGLOBAL DEFAULT1
bpf_program__set_priv [: 8]
   409: c540   144 FUNCGLOBAL DEFAULT1
bpf_object__open_file [: 8]
   410: 000121a8   416 FUNCGLOBAL DEFAULT1
bpf_program__attach_trace [: 8]
   415: d51c   764 FUNCGLOBAL DEFAULT1
bpf_map__pin [: 8]
   416: 000154d0   212 FUNCGLOBAL DEFAULT1
bpf_load_program [: 8]
   418: 00010810   192 FUNCGLOBAL DEFAULT1
bpf_object__find_map_by_name [: 8]
   420: 00012348   580 FUNCGLOBAL DEFAULT1
bpf_perf_event_read_simple [: 8]
   421: 000191e8   220 FUNCGLOBAL 

[PATCH AUTOSEL 4.4 34/37] libfdt: define INT32_MAX and UINT32_MAX in libfdt_env.h

2019-12-11 Thread Sasha Levin
From: Masahiro Yamada 

[ Upstream commit a8de1304b7df30e3a14f2a8b9709bb4ff31a0385 ]

The DTC v1.5.1 added references to (U)INT32_MAX.

This is no problem for user-space programs since  defines
(U)INT32_MAX along with (u)int32_t.

For the kernel space, libfdt_env.h needs to be adjusted before we
pull in the changes.

In the kernel, we usually use s/u32 instead of (u)int32_t for the
fixed-width types.

Accordingly, we already have S/U32_MAX for their max values.
So, we should not add (U)INT32_MAX to  any more.

Instead, add them to the in-kernel libfdt_env.h to compile the
latest libfdt.

Signed-off-by: Masahiro Yamada 
Signed-off-by: Rob Herring 
Signed-off-by: Sasha Levin 
---
 arch/arm/boot/compressed/libfdt_env.h | 4 +++-
 arch/powerpc/boot/libfdt_env.h| 2 ++
 include/linux/libfdt_env.h| 3 +++
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/arm/boot/compressed/libfdt_env.h 
b/arch/arm/boot/compressed/libfdt_env.h
index 005bf4ff1b4cb..f3ddd4f599e3e 100644
--- a/arch/arm/boot/compressed/libfdt_env.h
+++ b/arch/arm/boot/compressed/libfdt_env.h
@@ -1,11 +1,13 @@
 #ifndef _ARM_LIBFDT_ENV_H
 #define _ARM_LIBFDT_ENV_H
 
+#include 
 #include 
 #include 
 #include 
 
-#define INT_MAX((int)(~0U>>1))
+#define INT32_MAX  S32_MAX
+#define UINT32_MAX U32_MAX
 
 typedef __be16 fdt16_t;
 typedef __be32 fdt32_t;
diff --git a/arch/powerpc/boot/libfdt_env.h b/arch/powerpc/boot/libfdt_env.h
index 0b3db6322c793..5f2cb1c53e151 100644
--- a/arch/powerpc/boot/libfdt_env.h
+++ b/arch/powerpc/boot/libfdt_env.h
@@ -5,6 +5,8 @@
 #include 
 
 #define INT_MAX((int)(~0U>>1))
+#define UINT32_MAX ((u32)~0U)
+#define INT32_MAX  ((s32)(UINT32_MAX >> 1))
 
 #include "of.h"
 
diff --git a/include/linux/libfdt_env.h b/include/linux/libfdt_env.h
index 8850e243c9406..bd0a55821177a 100644
--- a/include/linux/libfdt_env.h
+++ b/include/linux/libfdt_env.h
@@ -6,6 +6,9 @@
 
 #include 
 
+#define INT32_MAX  S32_MAX
+#define UINT32_MAX U32_MAX
+
 typedef __be16 fdt16_t;
 typedef __be32 fdt32_t;
 typedef __be64 fdt64_t;
-- 
2.20.1



[PATCH net v2] net/ibmvnic: Fix typo in retry check

2019-12-11 Thread Thomas Falcon
This conditional is missing a bang, with the intent
being to break when the retry count reaches zero.

Fixes: 476d96ca9c ("ibmvnic: Bound waits for device queries")
Suggested-by: Juliet Kim 
Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index efb0f10..2d84523 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -184,7 +184,7 @@ static int ibmvnic_wait_for_completion(struct 
ibmvnic_adapter *adapter,
netdev_err(netdev, "Device down!\n");
return -ENODEV;
}
-   if (retry--)
+   if (!retry--)
break;
if (wait_for_completion_timeout(comp_done, div_timeout))
return 0;
-- 
1.8.3.1



[PATCH AUTOSEL 4.4 23/37] powerpc/security: Fix wrong message when RFI Flush is disable

2019-12-11 Thread Sasha Levin
From: "Gustavo L. F. Walbon" 

[ Upstream commit 4e706af3cd8e1d0503c25332b30cad33c97ed442 ]

The issue was showing "Mitigation" message via sysfs whatever the
state of "RFI Flush", but it should show "Vulnerable" when it is
disabled.

If you have "L1D private" feature enabled and not "RFI Flush" you are
vulnerable to meltdown attacks.

"RFI Flush" is the key feature to mitigate the meltdown whatever the
"L1D private" state.

SEC_FTR_L1D_THREAD_PRIV is a feature for Power9 only.

So the message should be as the truth table shows:

  CPU | L1D private | RFI Flush |sysfs
  |-|---|-
   P9 |False|   False   | Vulnerable
   P9 |False|   True| Mitigation: RFI Flush
   P9 |True |   False   | Vulnerable: L1D private per thread
   P9 |True |   True| Mitigation: RFI Flush, L1D private per thread
   P8 |False|   False   | Vulnerable
   P8 |False|   True| Mitigation: RFI Flush

Output before this fix:
  # cat /sys/devices/system/cpu/vulnerabilities/meltdown
  Mitigation: RFI Flush, L1D private per thread
  # echo 0 > /sys/kernel/debug/powerpc/rfi_flush
  # cat /sys/devices/system/cpu/vulnerabilities/meltdown
  Mitigation: L1D private per thread

Output after fix:
  # cat /sys/devices/system/cpu/vulnerabilities/meltdown
  Mitigation: RFI Flush, L1D private per thread
  # echo 0 > /sys/kernel/debug/powerpc/rfi_flush
  # cat /sys/devices/system/cpu/vulnerabilities/meltdown
  Vulnerable: L1D private per thread

Signed-off-by: Gustavo L. F. Walbon 
Signed-off-by: Mauro S. M. Rodrigues 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190502210907.42375-1-gwal...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/security.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index fc5c49046aa7d..45778c83038f8 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -135,26 +135,22 @@ ssize_t cpu_show_meltdown(struct device *dev, struct 
device_attribute *attr, cha
 
thread_priv = security_ftr_enabled(SEC_FTR_L1D_THREAD_PRIV);
 
-   if (rfi_flush || thread_priv) {
+   if (rfi_flush) {
struct seq_buf s;
seq_buf_init(, buf, PAGE_SIZE - 1);
 
-   seq_buf_printf(, "Mitigation: ");
-
-   if (rfi_flush)
-   seq_buf_printf(, "RFI Flush");
-
-   if (rfi_flush && thread_priv)
-   seq_buf_printf(, ", ");
-
+   seq_buf_printf(, "Mitigation: RFI Flush");
if (thread_priv)
-   seq_buf_printf(, "L1D private per thread");
+   seq_buf_printf(, ", L1D private per thread");
 
seq_buf_printf(, "\n");
 
return s.len;
}
 
+   if (thread_priv)
+   return sprintf(buf, "Vulnerable: L1D private per thread\n");
+
if (!security_ftr_enabled(SEC_FTR_L1D_FLUSH_HV) &&
!security_ftr_enabled(SEC_FTR_L1D_FLUSH_PR))
return sprintf(buf, "Not affected\n");
-- 
2.20.1



[PATCH AUTOSEL 4.4 22/37] powerpc/pseries/cmm: Implement release() function for sysfs device

2019-12-11 Thread Sasha Levin
From: David Hildenbrand 

[ Upstream commit 7d8212747435c534c8d564fbef4541a463c976ff ]

When unloading the module, one gets
  [ cut here ]
  Device 'cmm0' does not have a release() function, it is broken and must be 
fixed. See Documentation/kobject.txt.
  WARNING: CPU: 0 PID: 19308 at drivers/base/core.c:1244 
.device_release+0xcc/0xf0
  ...

We only have one static fake device. There is nothing to do when
releasing the device (via cmm_exit()).

Signed-off-by: David Hildenbrand 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20191031142933.10779-2-da...@redhat.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/cmm.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/cmm.c 
b/arch/powerpc/platforms/pseries/cmm.c
index fc44ad0475f84..b126ce49ae7bb 100644
--- a/arch/powerpc/platforms/pseries/cmm.c
+++ b/arch/powerpc/platforms/pseries/cmm.c
@@ -391,6 +391,10 @@ static struct bus_type cmm_subsys = {
.dev_name = "cmm",
 };
 
+static void cmm_release_device(struct device *dev)
+{
+}
+
 /**
  * cmm_sysfs_register - Register with sysfs
  *
@@ -406,6 +410,7 @@ static int cmm_sysfs_register(struct device *dev)
 
dev->id = 0;
dev->bus = _subsys;
+   dev->release = cmm_release_device;
 
if ((rc = device_register(dev)))
goto subsys_unregister;
-- 
2.20.1



[PATCH AUTOSEL 4.4 11/37] powerpc/security/book3s64: Report L1TF status in sysfs

2019-12-11 Thread Sasha Levin
From: Anthony Steinhauser 

[ Upstream commit 8e6b6da91ac9b9ec5a925b6cb13f287a54bd547d ]

Some PowerPC CPUs are vulnerable to L1TF to the same extent as to
Meltdown. It is also mitigated by flushing the L1D on privilege
transition.

Currently the sysfs gives a false negative on L1TF on CPUs that I
verified to be vulnerable, a Power9 Talos II Boston 004e 1202, PowerNV
T2P9D01.

Signed-off-by: Anthony Steinhauser 
Signed-off-by: Michael Ellerman 
[mpe: Just have cpu_show_l1tf() call cpu_show_meltdown() directly]
Link: https://lore.kernel.org/r/20191029190759.84821-1-asteinhau...@google.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/security.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index 156cfe6d23b09..fc5c49046aa7d 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -161,6 +161,11 @@ ssize_t cpu_show_meltdown(struct device *dev, struct 
device_attribute *attr, cha
 
return sprintf(buf, "Vulnerable\n");
 }
+
+ssize_t cpu_show_l1tf(struct device *dev, struct device_attribute *attr, char 
*buf)
+{
+   return cpu_show_meltdown(dev, attr, buf);
+}
 #endif
 
 ssize_t cpu_show_spectre_v1(struct device *dev, struct device_attribute *attr, 
char *buf)
-- 
2.20.1



[PATCH AUTOSEL 4.4 08/37] powerpc/pseries: Mark accumulate_stolen_time() as notrace

2019-12-11 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit eb8e20f89093b64f48975c74ccb114e6775cee22 ]

accumulate_stolen_time() is called prior to interrupt state being
reconciled, which can trip the warning in arch_local_irq_restore():

  WARNING: CPU: 5 PID: 1017 at arch/powerpc/kernel/irq.c:258 
.arch_local_irq_restore+0x9c/0x130
  ...
  NIP .arch_local_irq_restore+0x9c/0x130
  LR  .rb_start_commit+0x38/0x80
  Call Trace:
.ring_buffer_lock_reserve+0xe4/0x620
.trace_function+0x44/0x210
.function_trace_call+0x148/0x170
.ftrace_ops_no_ops+0x180/0x1d0
ftrace_call+0x4/0x8
.accumulate_stolen_time+0x1c/0xb0
decrementer_common+0x124/0x160

For now just mark it as notrace. We may change the ordering to call it
after interrupt state has been reconciled, but that is a larger
change.

Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20191024055932.27940-1-...@ellerman.id.au
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/time.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 9baba9576e998..2bcd0cfb82e0b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -245,7 +245,7 @@ static u64 scan_dispatch_log(u64 stop_tb)
  * Accumulate stolen time by scanning the dispatch trace log.
  * Called on entry from user mode.
  */
-void accumulate_stolen_time(void)
+void notrace accumulate_stolen_time(void)
 {
u64 sst, ust;
 
-- 
2.20.1



[PATCH AUTOSEL 4.9 39/42] libfdt: define INT32_MAX and UINT32_MAX in libfdt_env.h

2019-12-11 Thread Sasha Levin
From: Masahiro Yamada 

[ Upstream commit a8de1304b7df30e3a14f2a8b9709bb4ff31a0385 ]

The DTC v1.5.1 added references to (U)INT32_MAX.

This is no problem for user-space programs since  defines
(U)INT32_MAX along with (u)int32_t.

For the kernel space, libfdt_env.h needs to be adjusted before we
pull in the changes.

In the kernel, we usually use s/u32 instead of (u)int32_t for the
fixed-width types.

Accordingly, we already have S/U32_MAX for their max values.
So, we should not add (U)INT32_MAX to  any more.

Instead, add them to the in-kernel libfdt_env.h to compile the
latest libfdt.

Signed-off-by: Masahiro Yamada 
Signed-off-by: Rob Herring 
Signed-off-by: Sasha Levin 
---
 arch/arm/boot/compressed/libfdt_env.h | 4 +++-
 arch/powerpc/boot/libfdt_env.h| 2 ++
 include/linux/libfdt_env.h| 3 +++
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/arm/boot/compressed/libfdt_env.h 
b/arch/arm/boot/compressed/libfdt_env.h
index 005bf4ff1b4cb..f3ddd4f599e3e 100644
--- a/arch/arm/boot/compressed/libfdt_env.h
+++ b/arch/arm/boot/compressed/libfdt_env.h
@@ -1,11 +1,13 @@
 #ifndef _ARM_LIBFDT_ENV_H
 #define _ARM_LIBFDT_ENV_H
 
+#include 
 #include 
 #include 
 #include 
 
-#define INT_MAX((int)(~0U>>1))
+#define INT32_MAX  S32_MAX
+#define UINT32_MAX U32_MAX
 
 typedef __be16 fdt16_t;
 typedef __be32 fdt32_t;
diff --git a/arch/powerpc/boot/libfdt_env.h b/arch/powerpc/boot/libfdt_env.h
index 0b3db6322c793..5f2cb1c53e151 100644
--- a/arch/powerpc/boot/libfdt_env.h
+++ b/arch/powerpc/boot/libfdt_env.h
@@ -5,6 +5,8 @@
 #include 
 
 #define INT_MAX((int)(~0U>>1))
+#define UINT32_MAX ((u32)~0U)
+#define INT32_MAX  ((s32)(UINT32_MAX >> 1))
 
 #include "of.h"
 
diff --git a/include/linux/libfdt_env.h b/include/linux/libfdt_env.h
index 8850e243c9406..bd0a55821177a 100644
--- a/include/linux/libfdt_env.h
+++ b/include/linux/libfdt_env.h
@@ -6,6 +6,9 @@
 
 #include 
 
+#define INT32_MAX  S32_MAX
+#define UINT32_MAX U32_MAX
+
 typedef __be16 fdt16_t;
 typedef __be32 fdt32_t;
 typedef __be64 fdt64_t;
-- 
2.20.1



[PATCH AUTOSEL 4.9 24/42] powerpc/pseries/cmm: Implement release() function for sysfs device

2019-12-11 Thread Sasha Levin
From: David Hildenbrand 

[ Upstream commit 7d8212747435c534c8d564fbef4541a463c976ff ]

When unloading the module, one gets
  [ cut here ]
  Device 'cmm0' does not have a release() function, it is broken and must be 
fixed. See Documentation/kobject.txt.
  WARNING: CPU: 0 PID: 19308 at drivers/base/core.c:1244 
.device_release+0xcc/0xf0
  ...

We only have one static fake device. There is nothing to do when
releasing the device (via cmm_exit()).

Signed-off-by: David Hildenbrand 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20191031142933.10779-2-da...@redhat.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/cmm.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/cmm.c 
b/arch/powerpc/platforms/pseries/cmm.c
index 66e7227469b8c..b5ff5ee3e39cb 100644
--- a/arch/powerpc/platforms/pseries/cmm.c
+++ b/arch/powerpc/platforms/pseries/cmm.c
@@ -391,6 +391,10 @@ static struct bus_type cmm_subsys = {
.dev_name = "cmm",
 };
 
+static void cmm_release_device(struct device *dev)
+{
+}
+
 /**
  * cmm_sysfs_register - Register with sysfs
  *
@@ -406,6 +410,7 @@ static int cmm_sysfs_register(struct device *dev)
 
dev->id = 0;
dev->bus = _subsys;
+   dev->release = cmm_release_device;
 
if ((rc = device_register(dev)))
goto subsys_unregister;
-- 
2.20.1



[PATCH AUTOSEL 4.9 25/42] powerpc/security: Fix wrong message when RFI Flush is disable

2019-12-11 Thread Sasha Levin
From: "Gustavo L. F. Walbon" 

[ Upstream commit 4e706af3cd8e1d0503c25332b30cad33c97ed442 ]

The issue was showing "Mitigation" message via sysfs whatever the
state of "RFI Flush", but it should show "Vulnerable" when it is
disabled.

If you have "L1D private" feature enabled and not "RFI Flush" you are
vulnerable to meltdown attacks.

"RFI Flush" is the key feature to mitigate the meltdown whatever the
"L1D private" state.

SEC_FTR_L1D_THREAD_PRIV is a feature for Power9 only.

So the message should be as the truth table shows:

  CPU | L1D private | RFI Flush |sysfs
  |-|---|-
   P9 |False|   False   | Vulnerable
   P9 |False|   True| Mitigation: RFI Flush
   P9 |True |   False   | Vulnerable: L1D private per thread
   P9 |True |   True| Mitigation: RFI Flush, L1D private per thread
   P8 |False|   False   | Vulnerable
   P8 |False|   True| Mitigation: RFI Flush

Output before this fix:
  # cat /sys/devices/system/cpu/vulnerabilities/meltdown
  Mitigation: RFI Flush, L1D private per thread
  # echo 0 > /sys/kernel/debug/powerpc/rfi_flush
  # cat /sys/devices/system/cpu/vulnerabilities/meltdown
  Mitigation: L1D private per thread

Output after fix:
  # cat /sys/devices/system/cpu/vulnerabilities/meltdown
  Mitigation: RFI Flush, L1D private per thread
  # echo 0 > /sys/kernel/debug/powerpc/rfi_flush
  # cat /sys/devices/system/cpu/vulnerabilities/meltdown
  Vulnerable: L1D private per thread

Signed-off-by: Gustavo L. F. Walbon 
Signed-off-by: Mauro S. M. Rodrigues 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190502210907.42375-1-gwal...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/security.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index db66f25c190c9..ff85fc8001836 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -135,26 +135,22 @@ ssize_t cpu_show_meltdown(struct device *dev, struct 
device_attribute *attr, cha
 
thread_priv = security_ftr_enabled(SEC_FTR_L1D_THREAD_PRIV);
 
-   if (rfi_flush || thread_priv) {
+   if (rfi_flush) {
struct seq_buf s;
seq_buf_init(, buf, PAGE_SIZE - 1);
 
-   seq_buf_printf(, "Mitigation: ");
-
-   if (rfi_flush)
-   seq_buf_printf(, "RFI Flush");
-
-   if (rfi_flush && thread_priv)
-   seq_buf_printf(, ", ");
-
+   seq_buf_printf(, "Mitigation: RFI Flush");
if (thread_priv)
-   seq_buf_printf(, "L1D private per thread");
+   seq_buf_printf(, ", L1D private per thread");
 
seq_buf_printf(, "\n");
 
return s.len;
}
 
+   if (thread_priv)
+   return sprintf(buf, "Vulnerable: L1D private per thread\n");
+
if (!security_ftr_enabled(SEC_FTR_L1D_FLUSH_HV) &&
!security_ftr_enabled(SEC_FTR_L1D_FLUSH_PR))
return sprintf(buf, "Not affected\n");
-- 
2.20.1



[PATCH AUTOSEL 4.9 13/42] powerpc/book3s64/hash: Add cond_resched to avoid soft lockup warning

2019-12-11 Thread Sasha Levin
From: "Aneesh Kumar K.V" 

[ Upstream commit 16f6b67cf03cb43db7104acb2ca877bdc2606c92 ]

With large memory (8TB and more) hotplug, we can get soft lockup
warnings as below. These were caused by a long loop without any
explicit cond_resched which is a problem for !PREEMPT kernels.

Avoid this using cond_resched() while inserting hash page table
entries. We already do similar cond_resched() in __add_pages(), see
commit f64ac5e6e306 ("mm, memory_hotplug: add scheduling point to
__add_pages").

  rcu: 3-: (24002 ticks this GP) idle=13e/1/0x4002 
softirq=722/722 fqs=12001
   (t=24003 jiffies g=4285 q=2002)
  NMI backtrace for cpu 3
  CPU: 3 PID: 3870 Comm: ndctl Not tainted 5.3.0-197.18-default+ #2
  Call Trace:
dump_stack+0xb0/0xf4 (unreliable)
nmi_cpu_backtrace+0x124/0x130
nmi_trigger_cpumask_backtrace+0x1ac/0x1f0
arch_trigger_cpumask_backtrace+0x28/0x3c
rcu_dump_cpu_stacks+0xf8/0x154
rcu_sched_clock_irq+0x878/0xb40
update_process_times+0x48/0x90
tick_sched_handle.isra.16+0x4c/0x80
tick_sched_timer+0x68/0xe0
__hrtimer_run_queues+0x180/0x430
hrtimer_interrupt+0x110/0x300
timer_interrupt+0x108/0x2f0
decrementer_common+0x114/0x120
  --- interrupt: 901 at arch_add_memory+0xc0/0x130
  LR = arch_add_memory+0x74/0x130
memremap_pages+0x494/0x650
devm_memremap_pages+0x3c/0xa0
pmem_attach_disk+0x188/0x750
nvdimm_bus_probe+0xac/0x2c0
really_probe+0x148/0x570
driver_probe_device+0x19c/0x1d0
device_driver_attach+0xcc/0x100
bind_store+0x134/0x1c0
drv_attr_store+0x44/0x60
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1a0/0x270
__vfs_write+0x3c/0x70
vfs_write+0xd0/0x260
ksys_write+0xdc/0x130
system_call+0x5c/0x68

Signed-off-by: Aneesh Kumar K.V 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/20191001084656.31277-1-aneesh.ku...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/mm/hash_utils_64.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index de1d8cdd29915..2dc1fc445f357 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -300,6 +300,7 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long 
vend,
if (ret < 0)
break;
 
+   cond_resched();
 #ifdef CONFIG_DEBUG_PAGEALLOC
if (debug_pagealloc_enabled() &&
(paddr >> PAGE_SHIFT) < linear_map_hash_count)
-- 
2.20.1



[PATCH AUTOSEL 4.9 12/42] powerpc/security/book3s64: Report L1TF status in sysfs

2019-12-11 Thread Sasha Levin
From: Anthony Steinhauser 

[ Upstream commit 8e6b6da91ac9b9ec5a925b6cb13f287a54bd547d ]

Some PowerPC CPUs are vulnerable to L1TF to the same extent as to
Meltdown. It is also mitigated by flushing the L1D on privilege
transition.

Currently the sysfs gives a false negative on L1TF on CPUs that I
verified to be vulnerable, a Power9 Talos II Boston 004e 1202, PowerNV
T2P9D01.

Signed-off-by: Anthony Steinhauser 
Signed-off-by: Michael Ellerman 
[mpe: Just have cpu_show_l1tf() call cpu_show_meltdown() directly]
Link: https://lore.kernel.org/r/20191029190759.84821-1-asteinhau...@google.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/security.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index 11fff9669cfdf..db66f25c190c9 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -161,6 +161,11 @@ ssize_t cpu_show_meltdown(struct device *dev, struct 
device_attribute *attr, cha
 
return sprintf(buf, "Vulnerable\n");
 }
+
+ssize_t cpu_show_l1tf(struct device *dev, struct device_attribute *attr, char 
*buf)
+{
+   return cpu_show_meltdown(dev, attr, buf);
+}
 #endif
 
 ssize_t cpu_show_spectre_v1(struct device *dev, struct device_attribute *attr, 
char *buf)
-- 
2.20.1



[PATCH AUTOSEL 4.9 09/42] powerpc/pseries: Don't fail hash page table insert for bolted mapping

2019-12-11 Thread Sasha Levin
From: "Aneesh Kumar K.V" 

[ Upstream commit 75838a3290cd4ebbd1f567f310ba04b6ef017ce4 ]

If the hypervisor returned H_PTEG_FULL for H_ENTER hcall, retry a hash page 
table
insert by removing a random entry from the group.

After some runtime, it is very well possible to find all the 8 hash page table
entry slot in the hpte group used for mapping. Don't fail a bolted entry insert
in that case. With Storage class memory a user can find this error easily since
a namespace enable/disable is equivalent to memory add/remove.

This results in failures as reported below:

$ ndctl create-namespace -r region1 -t pmem -m devdax -a 65536 -s 100M
libndctl: ndctl_dax_enable: dax1.3: failed to enable
  Error: namespace1.2: failed to enable

failed to create namespace: No such device or address

In kernel log we find the details as below:

Unable to create mapping for hot added memory 
0xc4200600..0xc4200d00: -1
dax_pmem: probe of dax1.3 failed with error -14

This indicates that we failed to create a bolted hash table entry for direct-map
address backing the namespace.

We also observe failures such that not all namespaces will be enabled with
ndctl enable-namespace all command.

Signed-off-by: Aneesh Kumar K.V 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/20191024093542.29777-2-aneesh.ku...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/mm/hash_utils_64.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index bd666287c5eda..de1d8cdd29915 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -289,7 +289,14 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long 
vend,
ret = mmu_hash_ops.hpte_insert(hpteg, vpn, paddr, tprot,
   HPTE_V_BOLTED, psize, psize,
   ssize);
-
+   if (ret == -1) {
+   /* Try to remove a non bolted entry */
+   ret = mmu_hash_ops.hpte_remove(hpteg);
+   if (ret != -1)
+   ret = mmu_hash_ops.hpte_insert(hpteg, vpn, 
paddr, tprot,
+  HPTE_V_BOLTED, 
psize, psize,
+  ssize);
+   }
if (ret < 0)
break;
 
-- 
2.20.1



[PATCH AUTOSEL 4.9 08/42] powerpc/pseries: Mark accumulate_stolen_time() as notrace

2019-12-11 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit eb8e20f89093b64f48975c74ccb114e6775cee22 ]

accumulate_stolen_time() is called prior to interrupt state being
reconciled, which can trip the warning in arch_local_irq_restore():

  WARNING: CPU: 5 PID: 1017 at arch/powerpc/kernel/irq.c:258 
.arch_local_irq_restore+0x9c/0x130
  ...
  NIP .arch_local_irq_restore+0x9c/0x130
  LR  .rb_start_commit+0x38/0x80
  Call Trace:
.ring_buffer_lock_reserve+0xe4/0x620
.trace_function+0x44/0x210
.function_trace_call+0x148/0x170
.ftrace_ops_no_ops+0x180/0x1d0
ftrace_call+0x4/0x8
.accumulate_stolen_time+0x1c/0xb0
decrementer_common+0x124/0x160

For now just mark it as notrace. We may change the ordering to call it
after interrupt state has been reconciled, but that is a larger
change.

Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20191024055932.27940-1-...@ellerman.id.au
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/time.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index ab7b661b6da3a..412ac5d45160b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -257,7 +257,7 @@ static u64 scan_dispatch_log(u64 stop_tb)
  * Accumulate stolen time by scanning the dispatch trace log.
  * Called on entry from user mode.
  */
-void accumulate_stolen_time(void)
+void notrace accumulate_stolen_time(void)
 {
u64 sst, ust;
u8 save_soft_enabled = local_paca->soft_enabled;
-- 
2.20.1



Re: [PATCH] net/ibmvnic: Fix typo in retry check

2019-12-11 Thread Thomas Falcon



On 12/11/19 9:32 AM, Thomas Falcon wrote:

This conditional is missing a bang, with the intent
being to break when the retry count reaches zero.

Fixes: 476d96ca9c ("ibmvnic: Bound waits for device queries")
Suggested-by: Juliet Kim 
Signed-off-by: Thomas Falcon 
---


Excuse me, disregard this patch. I used the wrong email address for 
Juliet. And forgot the intended branch.  I will resend a v2 soon.


Tom



  drivers/net/ethernet/ibm/ibmvnic.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index efb0f10..2d84523 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -184,7 +184,7 @@ static int ibmvnic_wait_for_completion(struct 
ibmvnic_adapter *adapter,
netdev_err(netdev, "Device down!\n");
return -ENODEV;
}
-   if (retry--)
+   if (!retry--)
break;
if (wait_for_completion_timeout(comp_done, div_timeout))
return 0;


[PATCH] net/ibmvnic: Fix typo in retry check

2019-12-11 Thread Thomas Falcon
This conditional is missing a bang, with the intent
being to break when the retry count reaches zero.

Fixes: 476d96ca9c ("ibmvnic: Bound waits for device queries")
Suggested-by: Juliet Kim 
Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index efb0f10..2d84523 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -184,7 +184,7 @@ static int ibmvnic_wait_for_completion(struct 
ibmvnic_adapter *adapter,
netdev_err(netdev, "Device down!\n");
return -ENODEV;
}
-   if (retry--)
+   if (!retry--)
break;
if (wait_for_completion_timeout(comp_done, div_timeout))
return 0;
-- 
1.8.3.1



[PATCH AUTOSEL 4.14 54/58] libfdt: define INT32_MAX and UINT32_MAX in libfdt_env.h

2019-12-11 Thread Sasha Levin
From: Masahiro Yamada 

[ Upstream commit a8de1304b7df30e3a14f2a8b9709bb4ff31a0385 ]

The DTC v1.5.1 added references to (U)INT32_MAX.

This is no problem for user-space programs since  defines
(U)INT32_MAX along with (u)int32_t.

For the kernel space, libfdt_env.h needs to be adjusted before we
pull in the changes.

In the kernel, we usually use s/u32 instead of (u)int32_t for the
fixed-width types.

Accordingly, we already have S/U32_MAX for their max values.
So, we should not add (U)INT32_MAX to  any more.

Instead, add them to the in-kernel libfdt_env.h to compile the
latest libfdt.

Signed-off-by: Masahiro Yamada 
Signed-off-by: Rob Herring 
Signed-off-by: Sasha Levin 
---
 arch/arm/boot/compressed/libfdt_env.h | 4 +++-
 arch/powerpc/boot/libfdt_env.h| 2 ++
 include/linux/libfdt_env.h| 3 +++
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/arm/boot/compressed/libfdt_env.h 
b/arch/arm/boot/compressed/libfdt_env.h
index b36c0289a308e..6a0f1f524466e 100644
--- a/arch/arm/boot/compressed/libfdt_env.h
+++ b/arch/arm/boot/compressed/libfdt_env.h
@@ -2,11 +2,13 @@
 #ifndef _ARM_LIBFDT_ENV_H
 #define _ARM_LIBFDT_ENV_H
 
+#include 
 #include 
 #include 
 #include 
 
-#define INT_MAX((int)(~0U>>1))
+#define INT32_MAX  S32_MAX
+#define UINT32_MAX U32_MAX
 
 typedef __be16 fdt16_t;
 typedef __be32 fdt32_t;
diff --git a/arch/powerpc/boot/libfdt_env.h b/arch/powerpc/boot/libfdt_env.h
index 39155d3b2cefa..ac5d3c947e04e 100644
--- a/arch/powerpc/boot/libfdt_env.h
+++ b/arch/powerpc/boot/libfdt_env.h
@@ -6,6 +6,8 @@
 #include 
 
 #define INT_MAX((int)(~0U>>1))
+#define UINT32_MAX ((u32)~0U)
+#define INT32_MAX  ((s32)(UINT32_MAX >> 1))
 
 #include "of.h"
 
diff --git a/include/linux/libfdt_env.h b/include/linux/libfdt_env.h
index 1aa707ab19bbf..8b54c591678e1 100644
--- a/include/linux/libfdt_env.h
+++ b/include/linux/libfdt_env.h
@@ -7,6 +7,9 @@
 
 #include 
 
+#define INT32_MAX  S32_MAX
+#define UINT32_MAX U32_MAX
+
 typedef __be16 fdt16_t;
 typedef __be32 fdt32_t;
 typedef __be64 fdt64_t;
-- 
2.20.1



[PATCH AUTOSEL 4.14 28/58] powerpc/pseries/cmm: Implement release() function for sysfs device

2019-12-11 Thread Sasha Levin
From: David Hildenbrand 

[ Upstream commit 7d8212747435c534c8d564fbef4541a463c976ff ]

When unloading the module, one gets
  [ cut here ]
  Device 'cmm0' does not have a release() function, it is broken and must be 
fixed. See Documentation/kobject.txt.
  WARNING: CPU: 0 PID: 19308 at drivers/base/core.c:1244 
.device_release+0xcc/0xf0
  ...

We only have one static fake device. There is nothing to do when
releasing the device (via cmm_exit()).

Signed-off-by: David Hildenbrand 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20191031142933.10779-2-da...@redhat.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/cmm.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/cmm.c 
b/arch/powerpc/platforms/pseries/cmm.c
index 4ac419c7eb4c9..25224c9e1dc0b 100644
--- a/arch/powerpc/platforms/pseries/cmm.c
+++ b/arch/powerpc/platforms/pseries/cmm.c
@@ -425,6 +425,10 @@ static struct bus_type cmm_subsys = {
.dev_name = "cmm",
 };
 
+static void cmm_release_device(struct device *dev)
+{
+}
+
 /**
  * cmm_sysfs_register - Register with sysfs
  *
@@ -440,6 +444,7 @@ static int cmm_sysfs_register(struct device *dev)
 
dev->id = 0;
dev->bus = _subsys;
+   dev->release = cmm_release_device;
 
if ((rc = device_register(dev)))
goto subsys_unregister;
-- 
2.20.1



[PATCH AUTOSEL 4.14 29/58] powerpc/security: Fix wrong message when RFI Flush is disable

2019-12-11 Thread Sasha Levin
From: "Gustavo L. F. Walbon" 

[ Upstream commit 4e706af3cd8e1d0503c25332b30cad33c97ed442 ]

The issue was showing "Mitigation" message via sysfs whatever the
state of "RFI Flush", but it should show "Vulnerable" when it is
disabled.

If you have "L1D private" feature enabled and not "RFI Flush" you are
vulnerable to meltdown attacks.

"RFI Flush" is the key feature to mitigate the meltdown whatever the
"L1D private" state.

SEC_FTR_L1D_THREAD_PRIV is a feature for Power9 only.

So the message should be as the truth table shows:

  CPU | L1D private | RFI Flush |sysfs
  |-|---|-
   P9 |False|   False   | Vulnerable
   P9 |False|   True| Mitigation: RFI Flush
   P9 |True |   False   | Vulnerable: L1D private per thread
   P9 |True |   True| Mitigation: RFI Flush, L1D private per thread
   P8 |False|   False   | Vulnerable
   P8 |False|   True| Mitigation: RFI Flush

Output before this fix:
  # cat /sys/devices/system/cpu/vulnerabilities/meltdown
  Mitigation: RFI Flush, L1D private per thread
  # echo 0 > /sys/kernel/debug/powerpc/rfi_flush
  # cat /sys/devices/system/cpu/vulnerabilities/meltdown
  Mitigation: L1D private per thread

Output after fix:
  # cat /sys/devices/system/cpu/vulnerabilities/meltdown
  Mitigation: RFI Flush, L1D private per thread
  # echo 0 > /sys/kernel/debug/powerpc/rfi_flush
  # cat /sys/devices/system/cpu/vulnerabilities/meltdown
  Vulnerable: L1D private per thread

Signed-off-by: Gustavo L. F. Walbon 
Signed-off-by: Mauro S. M. Rodrigues 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190502210907.42375-1-gwal...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/security.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index fef3f09fc238b..b3f540c9f4109 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -134,26 +134,22 @@ ssize_t cpu_show_meltdown(struct device *dev, struct 
device_attribute *attr, cha
 
thread_priv = security_ftr_enabled(SEC_FTR_L1D_THREAD_PRIV);
 
-   if (rfi_flush || thread_priv) {
+   if (rfi_flush) {
struct seq_buf s;
seq_buf_init(, buf, PAGE_SIZE - 1);
 
-   seq_buf_printf(, "Mitigation: ");
-
-   if (rfi_flush)
-   seq_buf_printf(, "RFI Flush");
-
-   if (rfi_flush && thread_priv)
-   seq_buf_printf(, ", ");
-
+   seq_buf_printf(, "Mitigation: RFI Flush");
if (thread_priv)
-   seq_buf_printf(, "L1D private per thread");
+   seq_buf_printf(, ", L1D private per thread");
 
seq_buf_printf(, "\n");
 
return s.len;
}
 
+   if (thread_priv)
+   return sprintf(buf, "Vulnerable: L1D private per thread\n");
+
if (!security_ftr_enabled(SEC_FTR_L1D_FLUSH_HV) &&
!security_ftr_enabled(SEC_FTR_L1D_FLUSH_PR))
return sprintf(buf, "Not affected\n");
-- 
2.20.1



[PATCH AUTOSEL 4.14 15/58] powerpc/book3s64/hash: Add cond_resched to avoid soft lockup warning

2019-12-11 Thread Sasha Levin
From: "Aneesh Kumar K.V" 

[ Upstream commit 16f6b67cf03cb43db7104acb2ca877bdc2606c92 ]

With large memory (8TB and more) hotplug, we can get soft lockup
warnings as below. These were caused by a long loop without any
explicit cond_resched which is a problem for !PREEMPT kernels.

Avoid this using cond_resched() while inserting hash page table
entries. We already do similar cond_resched() in __add_pages(), see
commit f64ac5e6e306 ("mm, memory_hotplug: add scheduling point to
__add_pages").

  rcu: 3-: (24002 ticks this GP) idle=13e/1/0x4002 
softirq=722/722 fqs=12001
   (t=24003 jiffies g=4285 q=2002)
  NMI backtrace for cpu 3
  CPU: 3 PID: 3870 Comm: ndctl Not tainted 5.3.0-197.18-default+ #2
  Call Trace:
dump_stack+0xb0/0xf4 (unreliable)
nmi_cpu_backtrace+0x124/0x130
nmi_trigger_cpumask_backtrace+0x1ac/0x1f0
arch_trigger_cpumask_backtrace+0x28/0x3c
rcu_dump_cpu_stacks+0xf8/0x154
rcu_sched_clock_irq+0x878/0xb40
update_process_times+0x48/0x90
tick_sched_handle.isra.16+0x4c/0x80
tick_sched_timer+0x68/0xe0
__hrtimer_run_queues+0x180/0x430
hrtimer_interrupt+0x110/0x300
timer_interrupt+0x108/0x2f0
decrementer_common+0x114/0x120
  --- interrupt: 901 at arch_add_memory+0xc0/0x130
  LR = arch_add_memory+0x74/0x130
memremap_pages+0x494/0x650
devm_memremap_pages+0x3c/0xa0
pmem_attach_disk+0x188/0x750
nvdimm_bus_probe+0xac/0x2c0
really_probe+0x148/0x570
driver_probe_device+0x19c/0x1d0
device_driver_attach+0xcc/0x100
bind_store+0x134/0x1c0
drv_attr_store+0x44/0x60
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1a0/0x270
__vfs_write+0x3c/0x70
vfs_write+0xd0/0x260
ksys_write+0xdc/0x130
system_call+0x5c/0x68

Signed-off-by: Aneesh Kumar K.V 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/20191001084656.31277-1-aneesh.ku...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/mm/hash_utils_64.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index cf1d76e036359..387600ecea60a 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -303,6 +303,7 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long 
vend,
if (ret < 0)
break;
 
+   cond_resched();
 #ifdef CONFIG_DEBUG_PAGEALLOC
if (debug_pagealloc_enabled() &&
(paddr >> PAGE_SHIFT) < linear_map_hash_count)
-- 
2.20.1



[PATCH AUTOSEL 4.14 14/58] powerpc/security/book3s64: Report L1TF status in sysfs

2019-12-11 Thread Sasha Levin
From: Anthony Steinhauser 

[ Upstream commit 8e6b6da91ac9b9ec5a925b6cb13f287a54bd547d ]

Some PowerPC CPUs are vulnerable to L1TF to the same extent as to
Meltdown. It is also mitigated by flushing the L1D on privilege
transition.

Currently the sysfs gives a false negative on L1TF on CPUs that I
verified to be vulnerable, a Power9 Talos II Boston 004e 1202, PowerNV
T2P9D01.

Signed-off-by: Anthony Steinhauser 
Signed-off-by: Michael Ellerman 
[mpe: Just have cpu_show_l1tf() call cpu_show_meltdown() directly]
Link: https://lore.kernel.org/r/20191029190759.84821-1-asteinhau...@google.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/security.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index f5d6541bf8c27..fef3f09fc238b 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -160,6 +160,11 @@ ssize_t cpu_show_meltdown(struct device *dev, struct 
device_attribute *attr, cha
 
return sprintf(buf, "Vulnerable\n");
 }
+
+ssize_t cpu_show_l1tf(struct device *dev, struct device_attribute *attr, char 
*buf)
+{
+   return cpu_show_meltdown(dev, attr, buf);
+}
 #endif
 
 ssize_t cpu_show_spectre_v1(struct device *dev, struct device_attribute *attr, 
char *buf)
-- 
2.20.1



[PATCH AUTOSEL 4.14 11/58] powerpc/tools: Don't quote $objdump in scripts

2019-12-11 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit e44ff9ea8f4c8a90c82f7b85bd4f5e497c841960 ]

Some of our scripts are passed $objdump and then call it as
"$objdump". This doesn't work if it contains spaces because we're
using ccache, for example you get errors such as:

  ./arch/powerpc/tools/relocs_check.sh: line 48: ccache ppc64le-objdump: No 
such file or directory
  ./arch/powerpc/tools/unrel_branch_check.sh: line 26: ccache ppc64le-objdump: 
No such file or directory

Fix it by not quoting the string when we expand it, allowing the shell
to do the right thing for us.

Fixes: a71aa05e1416 ("powerpc: Convert relocs_check to a shell script using 
grep")
Fixes: 4ea80652dc75 ("powerpc/64s: Tool to flag direct branches from 
unrelocated interrupt vectors")
Signed-off-by: Michael Ellerman 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20191024004730.32135-1-...@ellerman.id.au
Signed-off-by: Sasha Levin 
---
 arch/powerpc/tools/relocs_check.sh   | 2 +-
 arch/powerpc/tools/unrel_branch_check.sh | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/tools/relocs_check.sh 
b/arch/powerpc/tools/relocs_check.sh
index ec2d5c835170a..d6c16e7faa387 100755
--- a/arch/powerpc/tools/relocs_check.sh
+++ b/arch/powerpc/tools/relocs_check.sh
@@ -23,7 +23,7 @@ objdump="$1"
 vmlinux="$2"
 
 bad_relocs=$(
-"$objdump" -R "$vmlinux" |
+$objdump -R "$vmlinux" |
# Only look at relocation lines.
grep -E '\:' |
 awk '{print $1}'
 )
 
 BRANCHES=$(
-"$objdump" -R "$vmlinux" -D --start-address=0xc000 \
+$objdump -R "$vmlinux" -D --start-address=0xc000   \
--stop-address=${end_intr} |
 grep -e 
"^c[0-9a-f]*:[[:space:]]*\([0-9a-f][0-9a-f][[:space:]]\)\{4\}[[:space:]]*b" |
 grep -v '\<__start_initialization_multiplatform>' |
-- 
2.20.1



[PATCH AUTOSEL 4.14 10/58] powerpc/pseries: Don't fail hash page table insert for bolted mapping

2019-12-11 Thread Sasha Levin
From: "Aneesh Kumar K.V" 

[ Upstream commit 75838a3290cd4ebbd1f567f310ba04b6ef017ce4 ]

If the hypervisor returned H_PTEG_FULL for H_ENTER hcall, retry a hash page 
table
insert by removing a random entry from the group.

After some runtime, it is very well possible to find all the 8 hash page table
entry slot in the hpte group used for mapping. Don't fail a bolted entry insert
in that case. With Storage class memory a user can find this error easily since
a namespace enable/disable is equivalent to memory add/remove.

This results in failures as reported below:

$ ndctl create-namespace -r region1 -t pmem -m devdax -a 65536 -s 100M
libndctl: ndctl_dax_enable: dax1.3: failed to enable
  Error: namespace1.2: failed to enable

failed to create namespace: No such device or address

In kernel log we find the details as below:

Unable to create mapping for hot added memory 
0xc4200600..0xc4200d00: -1
dax_pmem: probe of dax1.3 failed with error -14

This indicates that we failed to create a bolted hash table entry for direct-map
address backing the namespace.

We also observe failures such that not all namespaces will be enabled with
ndctl enable-namespace all command.

Signed-off-by: Aneesh Kumar K.V 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/20191024093542.29777-2-aneesh.ku...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/mm/hash_utils_64.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 58c14749bb0c1..cf1d76e036359 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -292,7 +292,14 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long 
vend,
ret = mmu_hash_ops.hpte_insert(hpteg, vpn, paddr, tprot,
   HPTE_V_BOLTED, psize, psize,
   ssize);
-
+   if (ret == -1) {
+   /* Try to remove a non bolted entry */
+   ret = mmu_hash_ops.hpte_remove(hpteg);
+   if (ret != -1)
+   ret = mmu_hash_ops.hpte_insert(hpteg, vpn, 
paddr, tprot,
+  HPTE_V_BOLTED, 
psize, psize,
+  ssize);
+   }
if (ret < 0)
break;
 
-- 
2.20.1



[PATCH AUTOSEL 4.14 09/58] powerpc/pseries: Mark accumulate_stolen_time() as notrace

2019-12-11 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit eb8e20f89093b64f48975c74ccb114e6775cee22 ]

accumulate_stolen_time() is called prior to interrupt state being
reconciled, which can trip the warning in arch_local_irq_restore():

  WARNING: CPU: 5 PID: 1017 at arch/powerpc/kernel/irq.c:258 
.arch_local_irq_restore+0x9c/0x130
  ...
  NIP .arch_local_irq_restore+0x9c/0x130
  LR  .rb_start_commit+0x38/0x80
  Call Trace:
.ring_buffer_lock_reserve+0xe4/0x620
.trace_function+0x44/0x210
.function_trace_call+0x148/0x170
.ftrace_ops_no_ops+0x180/0x1d0
ftrace_call+0x4/0x8
.accumulate_stolen_time+0x1c/0xb0
decrementer_common+0x124/0x160

For now just mark it as notrace. We may change the ordering to call it
after interrupt state has been reconciled, but that is a larger
change.

Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20191024055932.27940-1-...@ellerman.id.au
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/time.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 7c7c5a16284d2..808f71f9fe615 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -241,7 +241,7 @@ static u64 scan_dispatch_log(u64 stop_tb)
  * Accumulate stolen time by scanning the dispatch trace log.
  * Called on entry from user mode.
  */
-void accumulate_stolen_time(void)
+void notrace accumulate_stolen_time(void)
 {
u64 sst, ust;
u8 save_soft_enabled = local_paca->soft_enabled;
-- 
2.20.1



[PATCH AUTOSEL 4.19 74/79] libfdt: define INT32_MAX and UINT32_MAX in libfdt_env.h

2019-12-11 Thread Sasha Levin
From: Masahiro Yamada 

[ Upstream commit a8de1304b7df30e3a14f2a8b9709bb4ff31a0385 ]

The DTC v1.5.1 added references to (U)INT32_MAX.

This is no problem for user-space programs since  defines
(U)INT32_MAX along with (u)int32_t.

For the kernel space, libfdt_env.h needs to be adjusted before we
pull in the changes.

In the kernel, we usually use s/u32 instead of (u)int32_t for the
fixed-width types.

Accordingly, we already have S/U32_MAX for their max values.
So, we should not add (U)INT32_MAX to  any more.

Instead, add them to the in-kernel libfdt_env.h to compile the
latest libfdt.

Signed-off-by: Masahiro Yamada 
Signed-off-by: Rob Herring 
Signed-off-by: Sasha Levin 
---
 arch/arm/boot/compressed/libfdt_env.h | 4 +++-
 arch/powerpc/boot/libfdt_env.h| 2 ++
 include/linux/libfdt_env.h| 3 +++
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/arm/boot/compressed/libfdt_env.h 
b/arch/arm/boot/compressed/libfdt_env.h
index b36c0289a308e..6a0f1f524466e 100644
--- a/arch/arm/boot/compressed/libfdt_env.h
+++ b/arch/arm/boot/compressed/libfdt_env.h
@@ -2,11 +2,13 @@
 #ifndef _ARM_LIBFDT_ENV_H
 #define _ARM_LIBFDT_ENV_H
 
+#include 
 #include 
 #include 
 #include 
 
-#define INT_MAX((int)(~0U>>1))
+#define INT32_MAX  S32_MAX
+#define UINT32_MAX U32_MAX
 
 typedef __be16 fdt16_t;
 typedef __be32 fdt32_t;
diff --git a/arch/powerpc/boot/libfdt_env.h b/arch/powerpc/boot/libfdt_env.h
index 2abc8e83b95e9..9757d4f6331e7 100644
--- a/arch/powerpc/boot/libfdt_env.h
+++ b/arch/powerpc/boot/libfdt_env.h
@@ -6,6 +6,8 @@
 #include 
 
 #define INT_MAX((int)(~0U>>1))
+#define UINT32_MAX ((u32)~0U)
+#define INT32_MAX  ((s32)(UINT32_MAX >> 1))
 
 #include "of.h"
 
diff --git a/include/linux/libfdt_env.h b/include/linux/libfdt_env.h
index edb0f0c309044..1adf54aad2df1 100644
--- a/include/linux/libfdt_env.h
+++ b/include/linux/libfdt_env.h
@@ -7,6 +7,9 @@
 
 #include 
 
+#define INT32_MAX  S32_MAX
+#define UINT32_MAX U32_MAX
+
 typedef __be16 fdt16_t;
 typedef __be32 fdt32_t;
 typedef __be64 fdt64_t;
-- 
2.20.1



[PATCH AUTOSEL 4.19 66/79] powerpc: Don't add -mabi= flags when building with Clang

2019-12-11 Thread Sasha Levin
From: Nathan Chancellor 

[ Upstream commit 465bfd9c44dea6b55962b5788a23ac87a467c923 ]

When building pseries_defconfig, building vdso32 errors out:

  error: unknown target ABI 'elfv1'

This happens because -m32 in clang changes the target to 32-bit,
which does not allow the ABI to be changed.

Commit 4dc831aa8813 ("powerpc: Fix compiling a BE kernel with a
powerpc64le toolchain") added these flags to fix building big endian
kernels with a little endian GCC.

Clang doesn't need -mabi because the target triple controls the
default value. -mlittle-endian and -mbig-endian manipulate the triple
into either powerpc64-* or powerpc64le-*, which properly sets the
default ABI.

Adding a debug print out in the PPC64TargetInfo constructor after line
383 above shows this:

  $ echo | ./clang -E --target=powerpc64-linux -mbig-endian -o /dev/null -
  Default ABI: elfv1

  $ echo | ./clang -E --target=powerpc64-linux -mlittle-endian -o /dev/null -
  Default ABI: elfv2

  $ echo | ./clang -E --target=powerpc64le-linux -mbig-endian -o /dev/null -
  Default ABI: elfv1

  $ echo | ./clang -E --target=powerpc64le-linux -mlittle-endian -o /dev/null -
  Default ABI: elfv2

Don't specify -mabi when building with clang to avoid the build error
with -m32 and not change any code generation.

-mcall-aixdesc is not an implemented flag in clang so it can be safely
excluded as well, see commit 238abecde8ad ("powerpc: Don't use gcc
specific options on clang").

pseries_defconfig successfully builds after this patch and
powernv_defconfig and ppc44x_defconfig don't regress.

Reviewed-by: Daniel Axtens 
Signed-off-by: Nathan Chancellor 
[mpe: Trim clang links in change log]
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20191119045712.39633-2-natechancel...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/Makefile | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index dfcb698ec8f3b..e43321f46a3be 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -90,11 +90,13 @@ MULTIPLEWORD:= -mmultiple
 endif
 
 ifdef CONFIG_PPC64
+ifndef CONFIG_CC_IS_CLANG
 cflags-$(CONFIG_CPU_BIG_ENDIAN)+= $(call cc-option,-mabi=elfv1)
 cflags-$(CONFIG_CPU_BIG_ENDIAN)+= $(call 
cc-option,-mcall-aixdesc)
 aflags-$(CONFIG_CPU_BIG_ENDIAN)+= $(call cc-option,-mabi=elfv1)
 aflags-$(CONFIG_CPU_LITTLE_ENDIAN) += -mabi=elfv2
 endif
+endif
 
 ifneq ($(cc-name),clang)
   cflags-$(CONFIG_CPU_LITTLE_ENDIAN)   += -mno-strict-align
@@ -134,6 +136,7 @@ endif
 endif
 
 CFLAGS-$(CONFIG_PPC64) := $(call cc-option,-mtraceback=no)
+ifndef CONFIG_CC_IS_CLANG
 ifdef CONFIG_CPU_LITTLE_ENDIAN
 CFLAGS-$(CONFIG_PPC64) += $(call cc-option,-mabi=elfv2,$(call 
cc-option,-mcall-aixdesc))
 AFLAGS-$(CONFIG_PPC64) += $(call cc-option,-mabi=elfv2)
@@ -142,6 +145,7 @@ CFLAGS-$(CONFIG_PPC64)  += $(call cc-option,-mabi=elfv1)
 CFLAGS-$(CONFIG_PPC64) += $(call cc-option,-mcall-aixdesc)
 AFLAGS-$(CONFIG_PPC64) += $(call cc-option,-mabi=elfv1)
 endif
+endif
 CFLAGS-$(CONFIG_PPC64) += $(call cc-option,-mcmodel=medium,$(call 
cc-option,-mminimal-toc))
 CFLAGS-$(CONFIG_PPC64) += $(call cc-option,-mno-pointers-to-nested-functions)
 
-- 
2.20.1



[PATCH AUTOSEL 4.19 43/79] powerpc/security: Fix wrong message when RFI Flush is disable

2019-12-11 Thread Sasha Levin
From: "Gustavo L. F. Walbon" 

[ Upstream commit 4e706af3cd8e1d0503c25332b30cad33c97ed442 ]

The issue was showing "Mitigation" message via sysfs whatever the
state of "RFI Flush", but it should show "Vulnerable" when it is
disabled.

If you have "L1D private" feature enabled and not "RFI Flush" you are
vulnerable to meltdown attacks.

"RFI Flush" is the key feature to mitigate the meltdown whatever the
"L1D private" state.

SEC_FTR_L1D_THREAD_PRIV is a feature for Power9 only.

So the message should be as the truth table shows:

  CPU | L1D private | RFI Flush |sysfs
  |-|---|-
   P9 |False|   False   | Vulnerable
   P9 |False|   True| Mitigation: RFI Flush
   P9 |True |   False   | Vulnerable: L1D private per thread
   P9 |True |   True| Mitigation: RFI Flush, L1D private per thread
   P8 |False|   False   | Vulnerable
   P8 |False|   True| Mitigation: RFI Flush

Output before this fix:
  # cat /sys/devices/system/cpu/vulnerabilities/meltdown
  Mitigation: RFI Flush, L1D private per thread
  # echo 0 > /sys/kernel/debug/powerpc/rfi_flush
  # cat /sys/devices/system/cpu/vulnerabilities/meltdown
  Mitigation: L1D private per thread

Output after fix:
  # cat /sys/devices/system/cpu/vulnerabilities/meltdown
  Mitigation: RFI Flush, L1D private per thread
  # echo 0 > /sys/kernel/debug/powerpc/rfi_flush
  # cat /sys/devices/system/cpu/vulnerabilities/meltdown
  Vulnerable: L1D private per thread

Signed-off-by: Gustavo L. F. Walbon 
Signed-off-by: Mauro S. M. Rodrigues 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190502210907.42375-1-gwal...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/security.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index a4354c4f6bc50..6a3dde9587ccb 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -134,26 +134,22 @@ ssize_t cpu_show_meltdown(struct device *dev, struct 
device_attribute *attr, cha
 
thread_priv = security_ftr_enabled(SEC_FTR_L1D_THREAD_PRIV);
 
-   if (rfi_flush || thread_priv) {
+   if (rfi_flush) {
struct seq_buf s;
seq_buf_init(, buf, PAGE_SIZE - 1);
 
-   seq_buf_printf(, "Mitigation: ");
-
-   if (rfi_flush)
-   seq_buf_printf(, "RFI Flush");
-
-   if (rfi_flush && thread_priv)
-   seq_buf_printf(, ", ");
-
+   seq_buf_printf(, "Mitigation: RFI Flush");
if (thread_priv)
-   seq_buf_printf(, "L1D private per thread");
+   seq_buf_printf(, ", L1D private per thread");
 
seq_buf_printf(, "\n");
 
return s.len;
}
 
+   if (thread_priv)
+   return sprintf(buf, "Vulnerable: L1D private per thread\n");
+
if (!security_ftr_enabled(SEC_FTR_L1D_FLUSH_HV) &&
!security_ftr_enabled(SEC_FTR_L1D_FLUSH_PR))
return sprintf(buf, "Not affected\n");
-- 
2.20.1



[PATCH AUTOSEL 4.19 39/79] powerpc/pseries/cmm: Implement release() function for sysfs device

2019-12-11 Thread Sasha Levin
From: David Hildenbrand 

[ Upstream commit 7d8212747435c534c8d564fbef4541a463c976ff ]

When unloading the module, one gets
  [ cut here ]
  Device 'cmm0' does not have a release() function, it is broken and must be 
fixed. See Documentation/kobject.txt.
  WARNING: CPU: 0 PID: 19308 at drivers/base/core.c:1244 
.device_release+0xcc/0xf0
  ...

We only have one static fake device. There is nothing to do when
releasing the device (via cmm_exit()).

Signed-off-by: David Hildenbrand 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20191031142933.10779-2-da...@redhat.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/cmm.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/cmm.c 
b/arch/powerpc/platforms/pseries/cmm.c
index 25427a48feae3..502ebcc6c3cbe 100644
--- a/arch/powerpc/platforms/pseries/cmm.c
+++ b/arch/powerpc/platforms/pseries/cmm.c
@@ -425,6 +425,10 @@ static struct bus_type cmm_subsys = {
.dev_name = "cmm",
 };
 
+static void cmm_release_device(struct device *dev)
+{
+}
+
 /**
  * cmm_sysfs_register - Register with sysfs
  *
@@ -440,6 +444,7 @@ static int cmm_sysfs_register(struct device *dev)
 
dev->id = 0;
dev->bus = _subsys;
+   dev->release = cmm_release_device;
 
if ((rc = device_register(dev)))
goto subsys_unregister;
-- 
2.20.1



[PATCH AUTOSEL 4.19 22/79] powerpc/book3s64/hash: Add cond_resched to avoid soft lockup warning

2019-12-11 Thread Sasha Levin
From: "Aneesh Kumar K.V" 

[ Upstream commit 16f6b67cf03cb43db7104acb2ca877bdc2606c92 ]

With large memory (8TB and more) hotplug, we can get soft lockup
warnings as below. These were caused by a long loop without any
explicit cond_resched which is a problem for !PREEMPT kernels.

Avoid this using cond_resched() while inserting hash page table
entries. We already do similar cond_resched() in __add_pages(), see
commit f64ac5e6e306 ("mm, memory_hotplug: add scheduling point to
__add_pages").

  rcu: 3-: (24002 ticks this GP) idle=13e/1/0x4002 
softirq=722/722 fqs=12001
   (t=24003 jiffies g=4285 q=2002)
  NMI backtrace for cpu 3
  CPU: 3 PID: 3870 Comm: ndctl Not tainted 5.3.0-197.18-default+ #2
  Call Trace:
dump_stack+0xb0/0xf4 (unreliable)
nmi_cpu_backtrace+0x124/0x130
nmi_trigger_cpumask_backtrace+0x1ac/0x1f0
arch_trigger_cpumask_backtrace+0x28/0x3c
rcu_dump_cpu_stacks+0xf8/0x154
rcu_sched_clock_irq+0x878/0xb40
update_process_times+0x48/0x90
tick_sched_handle.isra.16+0x4c/0x80
tick_sched_timer+0x68/0xe0
__hrtimer_run_queues+0x180/0x430
hrtimer_interrupt+0x110/0x300
timer_interrupt+0x108/0x2f0
decrementer_common+0x114/0x120
  --- interrupt: 901 at arch_add_memory+0xc0/0x130
  LR = arch_add_memory+0x74/0x130
memremap_pages+0x494/0x650
devm_memremap_pages+0x3c/0xa0
pmem_attach_disk+0x188/0x750
nvdimm_bus_probe+0xac/0x2c0
really_probe+0x148/0x570
driver_probe_device+0x19c/0x1d0
device_driver_attach+0xcc/0x100
bind_store+0x134/0x1c0
drv_attr_store+0x44/0x60
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1a0/0x270
__vfs_write+0x3c/0x70
vfs_write+0xd0/0x260
ksys_write+0xdc/0x130
system_call+0x5c/0x68

Signed-off-by: Aneesh Kumar K.V 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/20191001084656.31277-1-aneesh.ku...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/mm/hash_utils_64.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 11b41383e1672..8894c8f300eac 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -307,6 +307,7 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long 
vend,
if (ret < 0)
break;
 
+   cond_resched();
 #ifdef CONFIG_DEBUG_PAGEALLOC
if (debug_pagealloc_enabled() &&
(paddr >> PAGE_SHIFT) < linear_map_hash_count)
-- 
2.20.1



[PATCH AUTOSEL 4.19 21/79] powerpc/security/book3s64: Report L1TF status in sysfs

2019-12-11 Thread Sasha Levin
From: Anthony Steinhauser 

[ Upstream commit 8e6b6da91ac9b9ec5a925b6cb13f287a54bd547d ]

Some PowerPC CPUs are vulnerable to L1TF to the same extent as to
Meltdown. It is also mitigated by flushing the L1D on privilege
transition.

Currently the sysfs gives a false negative on L1TF on CPUs that I
verified to be vulnerable, a Power9 Talos II Boston 004e 1202, PowerNV
T2P9D01.

Signed-off-by: Anthony Steinhauser 
Signed-off-by: Michael Ellerman 
[mpe: Just have cpu_show_l1tf() call cpu_show_meltdown() directly]
Link: https://lore.kernel.org/r/20191029190759.84821-1-asteinhau...@google.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/security.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index a5c5940d970ab..a4354c4f6bc50 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -160,6 +160,11 @@ ssize_t cpu_show_meltdown(struct device *dev, struct 
device_attribute *attr, cha
 
return sprintf(buf, "Vulnerable\n");
 }
+
+ssize_t cpu_show_l1tf(struct device *dev, struct device_attribute *attr, char 
*buf)
+{
+   return cpu_show_meltdown(dev, attr, buf);
+}
 #endif
 
 ssize_t cpu_show_spectre_v1(struct device *dev, struct device_attribute *attr, 
char *buf)
-- 
2.20.1



[PATCH AUTOSEL 4.19 15/79] powerpc/tools: Don't quote $objdump in scripts

2019-12-11 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit e44ff9ea8f4c8a90c82f7b85bd4f5e497c841960 ]

Some of our scripts are passed $objdump and then call it as
"$objdump". This doesn't work if it contains spaces because we're
using ccache, for example you get errors such as:

  ./arch/powerpc/tools/relocs_check.sh: line 48: ccache ppc64le-objdump: No 
such file or directory
  ./arch/powerpc/tools/unrel_branch_check.sh: line 26: ccache ppc64le-objdump: 
No such file or directory

Fix it by not quoting the string when we expand it, allowing the shell
to do the right thing for us.

Fixes: a71aa05e1416 ("powerpc: Convert relocs_check to a shell script using 
grep")
Fixes: 4ea80652dc75 ("powerpc/64s: Tool to flag direct branches from 
unrelocated interrupt vectors")
Signed-off-by: Michael Ellerman 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20191024004730.32135-1-...@ellerman.id.au
Signed-off-by: Sasha Levin 
---
 arch/powerpc/tools/relocs_check.sh   | 2 +-
 arch/powerpc/tools/unrel_branch_check.sh | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/tools/relocs_check.sh 
b/arch/powerpc/tools/relocs_check.sh
index ec2d5c835170a..d6c16e7faa387 100755
--- a/arch/powerpc/tools/relocs_check.sh
+++ b/arch/powerpc/tools/relocs_check.sh
@@ -23,7 +23,7 @@ objdump="$1"
 vmlinux="$2"
 
 bad_relocs=$(
-"$objdump" -R "$vmlinux" |
+$objdump -R "$vmlinux" |
# Only look at relocation lines.
grep -E '\:' |
 awk '{print $1}'
 )
 
 BRANCHES=$(
-"$objdump" -R "$vmlinux" -D --start-address=0xc000 \
+$objdump -R "$vmlinux" -D --start-address=0xc000   \
--stop-address=${end_intr} |
 grep -e 
"^c[0-9a-f]*:[[:space:]]*\([0-9a-f][0-9a-f][[:space:]]\)\{4\}[[:space:]]*b" |
 grep -v '\<__start_initialization_multiplatform>' |
-- 
2.20.1



[PATCH AUTOSEL 4.19 14/79] powerpc/pseries: Don't fail hash page table insert for bolted mapping

2019-12-11 Thread Sasha Levin
From: "Aneesh Kumar K.V" 

[ Upstream commit 75838a3290cd4ebbd1f567f310ba04b6ef017ce4 ]

If the hypervisor returned H_PTEG_FULL for H_ENTER hcall, retry a hash page 
table
insert by removing a random entry from the group.

After some runtime, it is very well possible to find all the 8 hash page table
entry slot in the hpte group used for mapping. Don't fail a bolted entry insert
in that case. With Storage class memory a user can find this error easily since
a namespace enable/disable is equivalent to memory add/remove.

This results in failures as reported below:

$ ndctl create-namespace -r region1 -t pmem -m devdax -a 65536 -s 100M
libndctl: ndctl_dax_enable: dax1.3: failed to enable
  Error: namespace1.2: failed to enable

failed to create namespace: No such device or address

In kernel log we find the details as below:

Unable to create mapping for hot added memory 
0xc4200600..0xc4200d00: -1
dax_pmem: probe of dax1.3 failed with error -14

This indicates that we failed to create a bolted hash table entry for direct-map
address backing the namespace.

We also observe failures such that not all namespaces will be enabled with
ndctl enable-namespace all command.

Signed-off-by: Aneesh Kumar K.V 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/20191024093542.29777-2-aneesh.ku...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/mm/hash_utils_64.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index b1007e9a31ba7..11b41383e1672 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -296,7 +296,14 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long 
vend,
ret = mmu_hash_ops.hpte_insert(hpteg, vpn, paddr, tprot,
   HPTE_V_BOLTED, psize, psize,
   ssize);
-
+   if (ret == -1) {
+   /* Try to remove a non bolted entry */
+   ret = mmu_hash_ops.hpte_remove(hpteg);
+   if (ret != -1)
+   ret = mmu_hash_ops.hpte_insert(hpteg, vpn, 
paddr, tprot,
+  HPTE_V_BOLTED, 
psize, psize,
+  ssize);
+   }
if (ret < 0)
break;
 
-- 
2.20.1



[PATCH AUTOSEL 4.19 13/79] powerpc/pseries: Mark accumulate_stolen_time() as notrace

2019-12-11 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit eb8e20f89093b64f48975c74ccb114e6775cee22 ]

accumulate_stolen_time() is called prior to interrupt state being
reconciled, which can trip the warning in arch_local_irq_restore():

  WARNING: CPU: 5 PID: 1017 at arch/powerpc/kernel/irq.c:258 
.arch_local_irq_restore+0x9c/0x130
  ...
  NIP .arch_local_irq_restore+0x9c/0x130
  LR  .rb_start_commit+0x38/0x80
  Call Trace:
.ring_buffer_lock_reserve+0xe4/0x620
.trace_function+0x44/0x210
.function_trace_call+0x148/0x170
.ftrace_ops_no_ops+0x180/0x1d0
ftrace_call+0x4/0x8
.accumulate_stolen_time+0x1c/0xb0
decrementer_common+0x124/0x160

For now just mark it as notrace. We may change the ordering to call it
after interrupt state has been reconciled, but that is a larger
change.

Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20191024055932.27940-1-...@ellerman.id.au
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/time.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 7707990c4c169..02ae92c224800 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -235,7 +235,7 @@ static u64 scan_dispatch_log(u64 stop_tb)
  * Accumulate stolen time by scanning the dispatch trace log.
  * Called on entry from user mode.
  */
-void accumulate_stolen_time(void)
+void notrace accumulate_stolen_time(void)
 {
u64 sst, ust;
unsigned long save_irq_soft_mask = irq_soft_mask_return();
-- 
2.20.1



[PATCH AUTOSEL 5.4 125/134] libfdt: define INT32_MAX and UINT32_MAX in libfdt_env.h

2019-12-11 Thread Sasha Levin
From: Masahiro Yamada 

[ Upstream commit a8de1304b7df30e3a14f2a8b9709bb4ff31a0385 ]

The DTC v1.5.1 added references to (U)INT32_MAX.

This is no problem for user-space programs since  defines
(U)INT32_MAX along with (u)int32_t.

For the kernel space, libfdt_env.h needs to be adjusted before we
pull in the changes.

In the kernel, we usually use s/u32 instead of (u)int32_t for the
fixed-width types.

Accordingly, we already have S/U32_MAX for their max values.
So, we should not add (U)INT32_MAX to  any more.

Instead, add them to the in-kernel libfdt_env.h to compile the
latest libfdt.

Signed-off-by: Masahiro Yamada 
Signed-off-by: Rob Herring 
Signed-off-by: Sasha Levin 
---
 arch/arm/boot/compressed/libfdt_env.h | 4 +++-
 arch/powerpc/boot/libfdt_env.h| 2 ++
 include/linux/libfdt_env.h| 3 +++
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/arm/boot/compressed/libfdt_env.h 
b/arch/arm/boot/compressed/libfdt_env.h
index b36c0289a308e..6a0f1f524466e 100644
--- a/arch/arm/boot/compressed/libfdt_env.h
+++ b/arch/arm/boot/compressed/libfdt_env.h
@@ -2,11 +2,13 @@
 #ifndef _ARM_LIBFDT_ENV_H
 #define _ARM_LIBFDT_ENV_H
 
+#include 
 #include 
 #include 
 #include 
 
-#define INT_MAX((int)(~0U>>1))
+#define INT32_MAX  S32_MAX
+#define UINT32_MAX U32_MAX
 
 typedef __be16 fdt16_t;
 typedef __be32 fdt32_t;
diff --git a/arch/powerpc/boot/libfdt_env.h b/arch/powerpc/boot/libfdt_env.h
index 2abc8e83b95e9..9757d4f6331e7 100644
--- a/arch/powerpc/boot/libfdt_env.h
+++ b/arch/powerpc/boot/libfdt_env.h
@@ -6,6 +6,8 @@
 #include 
 
 #define INT_MAX((int)(~0U>>1))
+#define UINT32_MAX ((u32)~0U)
+#define INT32_MAX  ((s32)(UINT32_MAX >> 1))
 
 #include "of.h"
 
diff --git a/include/linux/libfdt_env.h b/include/linux/libfdt_env.h
index edb0f0c309044..1adf54aad2df1 100644
--- a/include/linux/libfdt_env.h
+++ b/include/linux/libfdt_env.h
@@ -7,6 +7,9 @@
 
 #include 
 
+#define INT32_MAX  S32_MAX
+#define UINT32_MAX U32_MAX
+
 typedef __be16 fdt16_t;
 typedef __be32 fdt32_t;
 typedef __be64 fdt64_t;
-- 
2.20.1



[PATCH AUTOSEL 5.4 111/134] powerpc: Don't add -mabi= flags when building with Clang

2019-12-11 Thread Sasha Levin
From: Nathan Chancellor 

[ Upstream commit 465bfd9c44dea6b55962b5788a23ac87a467c923 ]

When building pseries_defconfig, building vdso32 errors out:

  error: unknown target ABI 'elfv1'

This happens because -m32 in clang changes the target to 32-bit,
which does not allow the ABI to be changed.

Commit 4dc831aa8813 ("powerpc: Fix compiling a BE kernel with a
powerpc64le toolchain") added these flags to fix building big endian
kernels with a little endian GCC.

Clang doesn't need -mabi because the target triple controls the
default value. -mlittle-endian and -mbig-endian manipulate the triple
into either powerpc64-* or powerpc64le-*, which properly sets the
default ABI.

Adding a debug print out in the PPC64TargetInfo constructor after line
383 above shows this:

  $ echo | ./clang -E --target=powerpc64-linux -mbig-endian -o /dev/null -
  Default ABI: elfv1

  $ echo | ./clang -E --target=powerpc64-linux -mlittle-endian -o /dev/null -
  Default ABI: elfv2

  $ echo | ./clang -E --target=powerpc64le-linux -mbig-endian -o /dev/null -
  Default ABI: elfv1

  $ echo | ./clang -E --target=powerpc64le-linux -mlittle-endian -o /dev/null -
  Default ABI: elfv2

Don't specify -mabi when building with clang to avoid the build error
with -m32 and not change any code generation.

-mcall-aixdesc is not an implemented flag in clang so it can be safely
excluded as well, see commit 238abecde8ad ("powerpc: Don't use gcc
specific options on clang").

pseries_defconfig successfully builds after this patch and
powernv_defconfig and ppc44x_defconfig don't regress.

Reviewed-by: Daniel Axtens 
Signed-off-by: Nathan Chancellor 
[mpe: Trim clang links in change log]
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20191119045712.39633-2-natechancel...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/Makefile | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 83522c9fc7b66..37ac731a556b8 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -91,11 +91,13 @@ MULTIPLEWORD:= -mmultiple
 endif
 
 ifdef CONFIG_PPC64
+ifndef CONFIG_CC_IS_CLANG
 cflags-$(CONFIG_CPU_BIG_ENDIAN)+= $(call cc-option,-mabi=elfv1)
 cflags-$(CONFIG_CPU_BIG_ENDIAN)+= $(call 
cc-option,-mcall-aixdesc)
 aflags-$(CONFIG_CPU_BIG_ENDIAN)+= $(call cc-option,-mabi=elfv1)
 aflags-$(CONFIG_CPU_LITTLE_ENDIAN) += -mabi=elfv2
 endif
+endif
 
 ifndef CONFIG_CC_IS_CLANG
   cflags-$(CONFIG_CPU_LITTLE_ENDIAN)   += -mno-strict-align
@@ -141,6 +143,7 @@ endif
 endif
 
 CFLAGS-$(CONFIG_PPC64) := $(call cc-option,-mtraceback=no)
+ifndef CONFIG_CC_IS_CLANG
 ifdef CONFIG_CPU_LITTLE_ENDIAN
 CFLAGS-$(CONFIG_PPC64) += $(call cc-option,-mabi=elfv2,$(call 
cc-option,-mcall-aixdesc))
 AFLAGS-$(CONFIG_PPC64) += $(call cc-option,-mabi=elfv2)
@@ -149,6 +152,7 @@ CFLAGS-$(CONFIG_PPC64)  += $(call cc-option,-mabi=elfv1)
 CFLAGS-$(CONFIG_PPC64) += $(call cc-option,-mcall-aixdesc)
 AFLAGS-$(CONFIG_PPC64) += $(call cc-option,-mabi=elfv1)
 endif
+endif
 CFLAGS-$(CONFIG_PPC64) += $(call cc-option,-mcmodel=medium,$(call 
cc-option,-mminimal-toc))
 CFLAGS-$(CONFIG_PPC64) += $(call cc-option,-mno-pointers-to-nested-functions)
 
-- 
2.20.1



[PATCH AUTOSEL 5.4 088/134] powerpc/fixmap: Use __fix_to_virt() instead of fix_to_virt()

2019-12-11 Thread Sasha Levin
From: Christophe Leroy 

[ Upstream commit 77693a5fb57be4606a6024ec8e3076f9499b906b ]

Modify back __set_fixmap() to using __fix_to_virt() instead
of fix_to_virt() otherwise the following happens because it
seems GCC doesn't see idx as a builtin const.

  CC  mm/early_ioremap.o
In file included from ./include/linux/kernel.h:11:0,
 from mm/early_ioremap.c:11:
In function ‘fix_to_virt’,
inlined from ‘__set_fixmap’ at ./arch/powerpc/include/asm/fixmap.h:87:2,
inlined from ‘__early_ioremap’ at mm/early_ioremap.c:156:4:
./include/linux/compiler.h:350:38: error: call to ‘__compiletime_assert_32’ 
declared with attribute error: BUILD_BUG_ON failed: idx >= 
__end_of_fixed_addresses
  _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
  ^
./include/linux/compiler.h:331:4: note: in definition of macro 
‘__compiletime_assert’
prefix ## suffix();\
^
./include/linux/compiler.h:350:2: note: in expansion of macro 
‘_compiletime_assert’
  _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
  ^
./include/linux/build_bug.h:39:37: note: in expansion of macro 
‘compiletime_assert’
 #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
 ^
./include/linux/build_bug.h:50:2: note: in expansion of macro ‘BUILD_BUG_ON_MSG’
  BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
  ^
./include/asm-generic/fixmap.h:32:2: note: in expansion of macro ‘BUILD_BUG_ON’
  BUILD_BUG_ON(idx >= __end_of_fixed_addresses);
  ^

Signed-off-by: Christophe Leroy 
Fixes: 4cfac2f9c7f1 ("powerpc/mm: Simplify __set_fixmap()")
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/f4984c615f90caa325a68849afeea846850d.1568295907.git.christophe.le...@c-s.fr
Signed-off-by: Sasha Levin 
---
 arch/powerpc/include/asm/fixmap.h | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/fixmap.h 
b/arch/powerpc/include/asm/fixmap.h
index 0cfc365d814ba..722289a1d000e 100644
--- a/arch/powerpc/include/asm/fixmap.h
+++ b/arch/powerpc/include/asm/fixmap.h
@@ -77,7 +77,12 @@ enum fixed_addresses {
 static inline void __set_fixmap(enum fixed_addresses idx,
phys_addr_t phys, pgprot_t flags)
 {
-   map_kernel_page(fix_to_virt(idx), phys, flags);
+   if (__builtin_constant_p(idx))
+   BUILD_BUG_ON(idx >= __end_of_fixed_addresses);
+   else if (WARN_ON(idx >= __end_of_fixed_addresses))
+   return;
+
+   map_kernel_page(__fix_to_virt(idx), phys, flags);
 }
 
 #endif /* !__ASSEMBLY__ */
-- 
2.20.1



[PATCH AUTOSEL 5.4 071/134] powerpc/book3s/mm: Update Oops message to print the correct translation in use

2019-12-11 Thread Sasha Levin
From: "Aneesh Kumar K.V" 

[ Upstream commit d7e02f7b7991dbe14a2acfb0e53d675cd149001c ]

Avoids confusion when printing Oops message like below

 Faulting instruction address: 0xc008bdb4
 Oops: Kernel access of bad area, sig: 11 [#1]
 LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV

This was because we never clear the MMU_FTR_HPTE_TABLE feature flag
even if we run with radix translation. It was discussed that we should
look at this feature flag as an indication of the capability to run
hash translation and we should not clear the flag even if we run in
radix translation. All the code paths check for radix_enabled() check and
if found true consider we are running with radix translation. Follow the
same sequence for finding the MMU translation string to be used in Oops
message.

Signed-off-by: Aneesh Kumar K.V 
Acked-by: Nicholas Piggin 
Reviewed-by: Christophe Leroy 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/20190711145814.17970-1-aneesh.ku...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/traps.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 82f43535e6867..014ff0701f245 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -250,15 +250,22 @@ static void oops_end(unsigned long flags, struct pt_regs 
*regs,
 }
 NOKPROBE_SYMBOL(oops_end);
 
+static char *get_mmu_str(void)
+{
+   if (early_radix_enabled())
+   return " MMU=Radix";
+   if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE))
+   return " MMU=Hash";
+   return "";
+}
+
 static int __die(const char *str, struct pt_regs *regs, long err)
 {
printk("Oops: %s, sig: %ld [#%d]\n", str, err, ++die_counter);
 
-   printk("%s PAGE_SIZE=%luK%s%s%s%s%s%s%s %s\n",
+   printk("%s PAGE_SIZE=%luK%s%s%s%s%s%s %s\n",
   IS_ENABLED(CONFIG_CPU_LITTLE_ENDIAN) ? "LE" : "BE",
-  PAGE_SIZE / 1024,
-  early_radix_enabled() ? " MMU=Radix" : "",
-  early_mmu_has_feature(MMU_FTR_HPTE_TABLE) ? " MMU=Hash" : "",
+  PAGE_SIZE / 1024, get_mmu_str(),
   IS_ENABLED(CONFIG_PREEMPT) ? " PREEMPT" : "",
   IS_ENABLED(CONFIG_SMP) ? " SMP" : "",
   IS_ENABLED(CONFIG_SMP) ? (" NR_CPUS=" __stringify(NR_CPUS)) : "",
-- 
2.20.1



[PATCH AUTOSEL 5.4 070/134] powerpc/eeh: differentiate duplicate detection message

2019-12-11 Thread Sasha Levin
From: Sam Bobroff 

[ Upstream commit de84ffc3ccbeec3678f95a3d898fc188efa0d9c5 ]

Currently when an EEH error is detected, the system log receives the
same (or almost the same) message twice:

  EEH: PHB#0 failure detected, location: N/A
  EEH: PHB#0 failure detected, location: N/A
or
  EEH: eeh_dev_check_failure: Frozen PHB#0-PE#0 detected
  EEH: Frozen PHB#0-PE#0 detected

This looks like a bug, but in fact the messages are from different
functions and mean slightly different things.  So keep both but change
one of the messages slightly, so that it's clear they are different:

  EEH: PHB#0 failure detected, location: N/A
  EEH: Recovering PHB#0, location: N/A
or
  EEH: eeh_dev_check_failure: Frozen PHB#0-PE#0 detected
  EEH: Recovering PHB#0-PE#0

Signed-off-by: Sam Bobroff 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/43817cb6e6631b0828b9a6e266f60d1f8ca8eb22.1571288375.git.sbobr...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/eeh_driver.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index d9279d0ee9f54..c031be8d41ffd 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -897,12 +897,12 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 
/* Log the event */
if (pe->type & EEH_PE_PHB) {
-   pr_err("EEH: PHB#%x failure detected, location: %s\n",
+   pr_err("EEH: Recovering PHB#%x, location: %s\n",
pe->phb->global_number, eeh_pe_loc_get(pe));
} else {
struct eeh_pe *phb_pe = eeh_phb_pe_get(pe->phb);
 
-   pr_err("EEH: Frozen PHB#%x-PE#%x detected\n",
+   pr_err("EEH: Recovering PHB#%x-PE#%x\n",
   pe->phb->global_number, pe->addr);
pr_err("EEH: PE location: %s, PHB location: %s\n",
   eeh_pe_loc_get(pe), eeh_pe_loc_get(phb_pe));
-- 
2.20.1



[PATCH AUTOSEL 5.4 069/134] powerpc/security: Fix wrong message when RFI Flush is disable

2019-12-11 Thread Sasha Levin
From: "Gustavo L. F. Walbon" 

[ Upstream commit 4e706af3cd8e1d0503c25332b30cad33c97ed442 ]

The issue was showing "Mitigation" message via sysfs whatever the
state of "RFI Flush", but it should show "Vulnerable" when it is
disabled.

If you have "L1D private" feature enabled and not "RFI Flush" you are
vulnerable to meltdown attacks.

"RFI Flush" is the key feature to mitigate the meltdown whatever the
"L1D private" state.

SEC_FTR_L1D_THREAD_PRIV is a feature for Power9 only.

So the message should be as the truth table shows:

  CPU | L1D private | RFI Flush |sysfs
  |-|---|-
   P9 |False|   False   | Vulnerable
   P9 |False|   True| Mitigation: RFI Flush
   P9 |True |   False   | Vulnerable: L1D private per thread
   P9 |True |   True| Mitigation: RFI Flush, L1D private per thread
   P8 |False|   False   | Vulnerable
   P8 |False|   True| Mitigation: RFI Flush

Output before this fix:
  # cat /sys/devices/system/cpu/vulnerabilities/meltdown
  Mitigation: RFI Flush, L1D private per thread
  # echo 0 > /sys/kernel/debug/powerpc/rfi_flush
  # cat /sys/devices/system/cpu/vulnerabilities/meltdown
  Mitigation: L1D private per thread

Output after fix:
  # cat /sys/devices/system/cpu/vulnerabilities/meltdown
  Mitigation: RFI Flush, L1D private per thread
  # echo 0 > /sys/kernel/debug/powerpc/rfi_flush
  # cat /sys/devices/system/cpu/vulnerabilities/meltdown
  Vulnerable: L1D private per thread

Signed-off-by: Gustavo L. F. Walbon 
Signed-off-by: Mauro S. M. Rodrigues 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190502210907.42375-1-gwal...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/security.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index 298a2e3ad6f4c..d341b464f23c6 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -142,26 +142,22 @@ ssize_t cpu_show_meltdown(struct device *dev, struct 
device_attribute *attr, cha
 
thread_priv = security_ftr_enabled(SEC_FTR_L1D_THREAD_PRIV);
 
-   if (rfi_flush || thread_priv) {
+   if (rfi_flush) {
struct seq_buf s;
seq_buf_init(, buf, PAGE_SIZE - 1);
 
-   seq_buf_printf(, "Mitigation: ");
-
-   if (rfi_flush)
-   seq_buf_printf(, "RFI Flush");
-
-   if (rfi_flush && thread_priv)
-   seq_buf_printf(, ", ");
-
+   seq_buf_printf(, "Mitigation: RFI Flush");
if (thread_priv)
-   seq_buf_printf(, "L1D private per thread");
+   seq_buf_printf(, ", L1D private per thread");
 
seq_buf_printf(, "\n");
 
return s.len;
}
 
+   if (thread_priv)
+   return sprintf(buf, "Vulnerable: L1D private per thread\n");
+
if (!security_ftr_enabled(SEC_FTR_L1D_FLUSH_HV) &&
!security_ftr_enabled(SEC_FTR_L1D_FLUSH_PR))
return sprintf(buf, "Not affected\n");
-- 
2.20.1



[PATCH AUTOSEL 5.4 065/134] powerpc/pseries/cmm: Implement release() function for sysfs device

2019-12-11 Thread Sasha Levin
From: David Hildenbrand 

[ Upstream commit 7d8212747435c534c8d564fbef4541a463c976ff ]

When unloading the module, one gets
  [ cut here ]
  Device 'cmm0' does not have a release() function, it is broken and must be 
fixed. See Documentation/kobject.txt.
  WARNING: CPU: 0 PID: 19308 at drivers/base/core.c:1244 
.device_release+0xcc/0xf0
  ...

We only have one static fake device. There is nothing to do when
releasing the device (via cmm_exit()).

Signed-off-by: David Hildenbrand 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20191031142933.10779-2-da...@redhat.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/cmm.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/cmm.c 
b/arch/powerpc/platforms/pseries/cmm.c
index b33251d75927b..572651a5c87bb 100644
--- a/arch/powerpc/platforms/pseries/cmm.c
+++ b/arch/powerpc/platforms/pseries/cmm.c
@@ -411,6 +411,10 @@ static struct bus_type cmm_subsys = {
.dev_name = "cmm",
 };
 
+static void cmm_release_device(struct device *dev)
+{
+}
+
 /**
  * cmm_sysfs_register - Register with sysfs
  *
@@ -426,6 +430,7 @@ static int cmm_sysfs_register(struct device *dev)
 
dev->id = 0;
dev->bus = _subsys;
+   dev->release = cmm_release_device;
 
if ((rc = device_register(dev)))
goto subsys_unregister;
-- 
2.20.1



[PATCH AUTOSEL 5.4 040/134] powerpc/book3s64/hash: Add cond_resched to avoid soft lockup warning

2019-12-11 Thread Sasha Levin
From: "Aneesh Kumar K.V" 

[ Upstream commit 16f6b67cf03cb43db7104acb2ca877bdc2606c92 ]

With large memory (8TB and more) hotplug, we can get soft lockup
warnings as below. These were caused by a long loop without any
explicit cond_resched which is a problem for !PREEMPT kernels.

Avoid this using cond_resched() while inserting hash page table
entries. We already do similar cond_resched() in __add_pages(), see
commit f64ac5e6e306 ("mm, memory_hotplug: add scheduling point to
__add_pages").

  rcu: 3-: (24002 ticks this GP) idle=13e/1/0x4002 
softirq=722/722 fqs=12001
   (t=24003 jiffies g=4285 q=2002)
  NMI backtrace for cpu 3
  CPU: 3 PID: 3870 Comm: ndctl Not tainted 5.3.0-197.18-default+ #2
  Call Trace:
dump_stack+0xb0/0xf4 (unreliable)
nmi_cpu_backtrace+0x124/0x130
nmi_trigger_cpumask_backtrace+0x1ac/0x1f0
arch_trigger_cpumask_backtrace+0x28/0x3c
rcu_dump_cpu_stacks+0xf8/0x154
rcu_sched_clock_irq+0x878/0xb40
update_process_times+0x48/0x90
tick_sched_handle.isra.16+0x4c/0x80
tick_sched_timer+0x68/0xe0
__hrtimer_run_queues+0x180/0x430
hrtimer_interrupt+0x110/0x300
timer_interrupt+0x108/0x2f0
decrementer_common+0x114/0x120
  --- interrupt: 901 at arch_add_memory+0xc0/0x130
  LR = arch_add_memory+0x74/0x130
memremap_pages+0x494/0x650
devm_memremap_pages+0x3c/0xa0
pmem_attach_disk+0x188/0x750
nvdimm_bus_probe+0xac/0x2c0
really_probe+0x148/0x570
driver_probe_device+0x19c/0x1d0
device_driver_attach+0xcc/0x100
bind_store+0x134/0x1c0
drv_attr_store+0x44/0x60
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1a0/0x270
__vfs_write+0x3c/0x70
vfs_write+0xd0/0x260
ksys_write+0xdc/0x130
system_call+0x5c/0x68

Signed-off-by: Aneesh Kumar K.V 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/20191001084656.31277-1-aneesh.ku...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 6e5a769ebcb80..83c51a7d7eee6 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -305,6 +305,7 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long 
vend,
if (ret < 0)
break;
 
+   cond_resched();
 #ifdef CONFIG_DEBUG_PAGEALLOC
if (debug_pagealloc_enabled() &&
(paddr >> PAGE_SHIFT) < linear_map_hash_count)
-- 
2.20.1



[PATCH AUTOSEL 5.4 039/134] powerpc/security/book3s64: Report L1TF status in sysfs

2019-12-11 Thread Sasha Levin
From: Anthony Steinhauser 

[ Upstream commit 8e6b6da91ac9b9ec5a925b6cb13f287a54bd547d ]

Some PowerPC CPUs are vulnerable to L1TF to the same extent as to
Meltdown. It is also mitigated by flushing the L1D on privilege
transition.

Currently the sysfs gives a false negative on L1TF on CPUs that I
verified to be vulnerable, a Power9 Talos II Boston 004e 1202, PowerNV
T2P9D01.

Signed-off-by: Anthony Steinhauser 
Signed-off-by: Michael Ellerman 
[mpe: Just have cpu_show_l1tf() call cpu_show_meltdown() directly]
Link: https://lore.kernel.org/r/20191029190759.84821-1-asteinhau...@google.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/security.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index bd91dceb70105..298a2e3ad6f4c 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -168,6 +168,11 @@ ssize_t cpu_show_meltdown(struct device *dev, struct 
device_attribute *attr, cha
 
return sprintf(buf, "Vulnerable\n");
 }
+
+ssize_t cpu_show_l1tf(struct device *dev, struct device_attribute *attr, char 
*buf)
+{
+   return cpu_show_meltdown(dev, attr, buf);
+}
 #endif
 
 ssize_t cpu_show_spectre_v1(struct device *dev, struct device_attribute *attr, 
char *buf)
-- 
2.20.1



[PATCH AUTOSEL 5.4 027/134] powerpc/tools: Don't quote $objdump in scripts

2019-12-11 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit e44ff9ea8f4c8a90c82f7b85bd4f5e497c841960 ]

Some of our scripts are passed $objdump and then call it as
"$objdump". This doesn't work if it contains spaces because we're
using ccache, for example you get errors such as:

  ./arch/powerpc/tools/relocs_check.sh: line 48: ccache ppc64le-objdump: No 
such file or directory
  ./arch/powerpc/tools/unrel_branch_check.sh: line 26: ccache ppc64le-objdump: 
No such file or directory

Fix it by not quoting the string when we expand it, allowing the shell
to do the right thing for us.

Fixes: a71aa05e1416 ("powerpc: Convert relocs_check to a shell script using 
grep")
Fixes: 4ea80652dc75 ("powerpc/64s: Tool to flag direct branches from 
unrelocated interrupt vectors")
Signed-off-by: Michael Ellerman 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20191024004730.32135-1-...@ellerman.id.au
Signed-off-by: Sasha Levin 
---
 arch/powerpc/tools/relocs_check.sh   | 2 +-
 arch/powerpc/tools/unrel_branch_check.sh | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/tools/relocs_check.sh 
b/arch/powerpc/tools/relocs_check.sh
index 2b4e959caa365..7b9fe0a567cf3 100755
--- a/arch/powerpc/tools/relocs_check.sh
+++ b/arch/powerpc/tools/relocs_check.sh
@@ -20,7 +20,7 @@ objdump="$1"
 vmlinux="$2"
 
 bad_relocs=$(
-"$objdump" -R "$vmlinux" |
+$objdump -R "$vmlinux" |
# Only look at relocation lines.
grep -E '\:' |
 awk '{print $1}'
 )
 
 BRANCHES=$(
-"$objdump" -R "$vmlinux" -D --start-address=0xc000 \
+$objdump -R "$vmlinux" -D --start-address=0xc000   \
--stop-address=${end_intr} |
 grep -e 
"^c[0-9a-f]*:[[:space:]]*\([0-9a-f][0-9a-f][[:space:]]\)\{4\}[[:space:]]*b" |
 grep -v '\<__start_initialization_multiplatform>' |
-- 
2.20.1



[PATCH AUTOSEL 5.4 024/134] powerpc/pseries: Don't fail hash page table insert for bolted mapping

2019-12-11 Thread Sasha Levin
From: "Aneesh Kumar K.V" 

[ Upstream commit 75838a3290cd4ebbd1f567f310ba04b6ef017ce4 ]

If the hypervisor returned H_PTEG_FULL for H_ENTER hcall, retry a hash page 
table
insert by removing a random entry from the group.

After some runtime, it is very well possible to find all the 8 hash page table
entry slot in the hpte group used for mapping. Don't fail a bolted entry insert
in that case. With Storage class memory a user can find this error easily since
a namespace enable/disable is equivalent to memory add/remove.

This results in failures as reported below:

$ ndctl create-namespace -r region1 -t pmem -m devdax -a 65536 -s 100M
libndctl: ndctl_dax_enable: dax1.3: failed to enable
  Error: namespace1.2: failed to enable

failed to create namespace: No such device or address

In kernel log we find the details as below:

Unable to create mapping for hot added memory 
0xc4200600..0xc4200d00: -1
dax_pmem: probe of dax1.3 failed with error -14

This indicates that we failed to create a bolted hash table entry for direct-map
address backing the namespace.

We also observe failures such that not all namespaces will be enabled with
ndctl enable-namespace all command.

Signed-off-by: Aneesh Kumar K.V 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/20191024093542.29777-2-aneesh.ku...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 6c123760164e8..6e5a769ebcb80 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -294,7 +294,14 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long 
vend,
ret = mmu_hash_ops.hpte_insert(hpteg, vpn, paddr, tprot,
   HPTE_V_BOLTED, psize, psize,
   ssize);
-
+   if (ret == -1) {
+   /* Try to remove a non bolted entry */
+   ret = mmu_hash_ops.hpte_remove(hpteg);
+   if (ret != -1)
+   ret = mmu_hash_ops.hpte_insert(hpteg, vpn, 
paddr, tprot,
+  HPTE_V_BOLTED, 
psize, psize,
+  ssize);
+   }
if (ret < 0)
break;
 
-- 
2.20.1



[PATCH AUTOSEL 5.4 023/134] powerpc/pseries: Mark accumulate_stolen_time() as notrace

2019-12-11 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit eb8e20f89093b64f48975c74ccb114e6775cee22 ]

accumulate_stolen_time() is called prior to interrupt state being
reconciled, which can trip the warning in arch_local_irq_restore():

  WARNING: CPU: 5 PID: 1017 at arch/powerpc/kernel/irq.c:258 
.arch_local_irq_restore+0x9c/0x130
  ...
  NIP .arch_local_irq_restore+0x9c/0x130
  LR  .rb_start_commit+0x38/0x80
  Call Trace:
.ring_buffer_lock_reserve+0xe4/0x620
.trace_function+0x44/0x210
.function_trace_call+0x148/0x170
.ftrace_ops_no_ops+0x180/0x1d0
ftrace_call+0x4/0x8
.accumulate_stolen_time+0x1c/0xb0
decrementer_common+0x124/0x160

For now just mark it as notrace. We may change the ordering to call it
after interrupt state has been reconciled, but that is a larger
change.

Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20191024055932.27940-1-...@ellerman.id.au
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/time.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 694522308cd51..968ae97382b4e 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -232,7 +232,7 @@ static u64 scan_dispatch_log(u64 stop_tb)
  * Accumulate stolen time by scanning the dispatch trace log.
  * Called on entry from user mode.
  */
-void accumulate_stolen_time(void)
+void notrace accumulate_stolen_time(void)
 {
u64 sst, ust;
unsigned long save_irq_soft_mask = irq_soft_mask_return();
-- 
2.20.1



[PATCH AUTOSEL 5.4 010/134] powerpc/papr_scm: Fix an off-by-one check in papr_scm_meta_{get, set}

2019-12-11 Thread Sasha Levin
From: Vaibhav Jain 

[ Upstream commit 612ee81b9461475b5a5612c2e8d71559dd3c7920 ]

A validation check to prevent out of bounds read/write inside
functions papr_scm_meta_{get,set}() is off-by-one that prevent reads
and writes to the last byte of the label area.

This bug manifests as a failure to probe a dimm when libnvdimm is
unable to read the entire config-area as advertised by
ND_CMD_GET_CONFIG_SIZE. This usually happens when there are large
number of namespaces created in the region backed by the dimm and the
label-index spans max possible config-area. An error of the form below
usually reported in the kernel logs:

[  255.293912] nvdimm: probe of nmem0 failed with error -22

The patch fixes these validation checks there by letting libnvdimm
access the entire config-area.

Fixes: 53e80bd042773('powerpc/nvdimm: Add support for multibyte read/write for 
metadata')
Signed-off-by: Vaibhav Jain 
Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20190927062002.3169-1-vaib...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 61883291defc3..ee07d0718bf1a 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -152,7 +152,7 @@ static int papr_scm_meta_get(struct papr_scm_priv *p,
int len, read;
int64_t ret;
 
-   if ((hdr->in_offset + hdr->in_length) >= p->metadata_size)
+   if ((hdr->in_offset + hdr->in_length) > p->metadata_size)
return -EINVAL;
 
for (len = hdr->in_length; len; len -= read) {
@@ -206,7 +206,7 @@ static int papr_scm_meta_set(struct papr_scm_priv *p,
__be64 data_be;
int64_t ret;
 
-   if ((hdr->in_offset + hdr->in_length) >= p->metadata_size)
+   if ((hdr->in_offset + hdr->in_length) > p->metadata_size)
return -EINVAL;
 
for (len = hdr->in_length; len; len -= wrote) {
-- 
2.20.1



Re: [PATCH v3 1/2] powerpc/vcpu: Assume dedicated processors as non-preempt

2019-12-11 Thread Waiman Long
On 12/5/19 3:32 AM, Srikar Dronamraju wrote:
> With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted
> vCPUs"), scheduler avoids preempted vCPUs to schedule tasks on wakeup.
> This leads to wrong choice of CPU, which in-turn leads to larger wakeup
> latencies. Eventually, it leads to performance regression in latency
> sensitive benchmarks like soltp, schbench etc.
>
> On Powerpc, vcpu_is_preempted only looks at yield_count. If the
> yield_count is odd, the vCPU is assumed to be preempted. However
> yield_count is increased whenever LPAR enters CEDE state. So any CPU
> that has entered CEDE state is assumed to be preempted.
>
> Even if vCPU of dedicated LPAR is preempted/donated, it should have
> right of first-use since they are suppose to own the vCPU.
>
> On a Power9 System with 32 cores
>  # lscpu
> Architecture:ppc64le
> Byte Order:  Little Endian
> CPU(s):  128
> On-line CPU(s) list: 0-127
> Thread(s) per core:  8
> Core(s) per socket:  1
> Socket(s):   16
> NUMA node(s):2
> Model:   2.2 (pvr 004e 0202)
> Model name:  POWER9 (architected), altivec supported
> Hypervisor vendor:   pHyp
> Virtualization type: para
> L1d cache:   32K
> L1i cache:   32K
> L2 cache:512K
> L3 cache:10240K
> NUMA node0 CPU(s):   0-63
> NUMA node1 CPU(s):   64-127
>
>   # perf stat -a -r 5 ./schbench
> v5.4 v5.4 + patch
> Latency percentiles (usec)   Latency percentiles (usec)
>   50.th: 45   50.th: 39
>   75.th: 62   75.th: 53
>   90.th: 71   90.th: 67
>   95.th: 77   95.th: 76
>   *99.th: 91  *99.th: 89
>   99.5000th: 707  99.5000th: 93
>   99.9000th: 6920 99.9000th: 118
>   min=0, max=10048min=0, max=211
> Latency percentiles (usec)   Latency percentiles (usec)
>   50.th: 45   50.th: 34
>   75.th: 61   75.th: 45
>   90.th: 72   90.th: 53
>   95.th: 79   95.th: 56
>   *99.th: 691 *99.th: 61
>   99.5000th: 3972 99.5000th: 63
>   99.9000th: 8368 99.9000th: 78
>   min=0, max=16606min=0, max=228
> Latency percentiles (usec)   Latency percentiles (usec)
>   50.th: 45   50.th: 34
>   75.th: 61   75.th: 45
>   90.th: 71   90.th: 53
>   95.th: 77   95.th: 57
>   *99.th: 106 *99.th: 63
>   99.5000th: 2364 99.5000th: 68
>   99.9000th: 7480 99.9000th: 100
>   min=0, max=10001min=0, max=134
> Latency percentiles (usec)   Latency percentiles (usec)
>   50.th: 45   50.th: 34
>   75.th: 62   75.th: 46
>   90.th: 72   90.th: 53
>   95.th: 78   95.th: 56
>   *99.th: 93  *99.th: 61
>   99.5000th: 108  99.5000th: 64
>   99.9000th: 6792 99.9000th: 85
>   min=0, max=17681min=0, max=121
> Latency percentiles (usec)   Latency percentiles (usec)
>   50.th: 46   50.th: 33
>   75.th: 62   75.th: 44
>   90.th: 73   90.th: 51
>   95.th: 79   95.th: 54
>   *99.th: 113 *99.th: 61
>   99.5000th: 2724 99.5000th: 64
>   99.9000th: 6184 99.9000th: 82
>   min=0, max=9887 min=0, max=121
>
>  Performance counter stats for 'system wide' (5 runs):
>
> context-switches43,373  ( +-  0.40% )   44,597 ( +-  0.55% )
> cpu-migrations   1,211  ( +-  5.04% )  220 ( +-  6.23% )
> page-faults 15,983  ( +-  5.21% )   15,360 ( +-  3.38% )
>
> Waiman Long suggested using static_keys.

Since this patch is fixing a performance regression. Maybe we should add

Fixes: 41946c86876e ("locking/core, powerpc: Implement
vcpu_is_preempted(cpu)")

Cheers,
Longman




Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-11 Thread Rob Herring
On Tue, Dec 10, 2019 at 2:17 AM Frank Rowand  wrote:
>
> On 12/9/19 7:51 PM, Rob Herring wrote:
> > On Mon, Dec 9, 2019 at 7:35 AM Sebastian Andrzej Siewior
> >  wrote:
> >>
> >> On 2019-12-05 20:01:41 [-0600], Frank Rowand wrote:
> >>> Is there a memory usage issue for the systems that led to this thread?
> >>
> >> No, no memory issue led to this thread. I was just testing my patch and
> >> I assumed that I did something wrong in the counting/lock drop/lock
> >> acquire/allocate path because the array was hardly used. So I started to
> >> look deeper…
> >> Once I figured out everything was fine, I was curious if everyone is
> >> aware of the different phandle creation by dtc vs POWER. And I posted
> >> the mail in the thread.
> >> Once you confirmed that everything is "known / not an issue" I was ready
> >> to take off [0].
> >>
> >> Later more replies came in such as one mail [1] from Rob describing the
> >> original reason with 814 phandles. _Here_ I was just surprised that 1024
> >> were used over 64 entries for a benefit of 60ms. I understand that this
> >> is low concern for you because that memory is released if modules are
> >> not enabled. I usually see that module support is left enabled.
> >>
> >> However, Rob suggested / asked about the fixed size array (this is how I
> >> understood it):
> >> |And yes, as mentioned earlier I don't like the complexity. I didn't
> >> |from the start and I'm  I'm still of the opinion we should have a
> >> |fixed or 1 time sized true cache (i.e. smaller than total # of
> >> |phandles). That would solve the RT memory allocation and locking issue
> >> |too.
> >>
> >> so I attempted to ask if we should have the fixed size array maybe
> >> with the hash_32() instead the mask. This would make my other patch
> >> obsolete because the fixed size array should not have a RT issue. The
> >> hash_32() part here would address the POWER issue where the cache is
> >> currently not used efficiently.
> >>
> >> If you want instead to keep things as-is then this is okay from my side.
> >> If you want to keep this cache off on POWER then I could contribute a
> >> patch doing so.
> >
> > It turns out there's actually a bug in the current implementation. If
> > we have multiple phandles with the same mask, then we leak node
> > references if we miss in the cache and re-assign the cache entry.
>
> Aaargh.  Patch sent.
>
> > Easily fixed I suppose, but holding a ref count for a cached entry
> > seems wrong. That means we never have a ref count of 0 on every node
> > with a phandle.
>
> It will go to zero when the cache is freed.
>
> My memory is that we free the cache as part of removing an overlay.  I'll
> verify whether my memory is correct.

Yes, as part of having entries for every phandle we release and
realloc when number of phandles changes. If the size is fixed, then we
can stop doing that. We only need to remove entries in
of_detach_node() as that should always happen before nodes are
removed, right?

Rob


Re: [PATCH v2 4/4] powerpc: Book3S 64-bit "heavyweight" KASAN support

2019-12-11 Thread Daniel Axtens
Hi Balbir,

 +Discontiguous memory can occur when you have a machine with memory spread
 +across multiple nodes. For example, on a Talos II with 64GB of RAM:
 +
 + - 32GB runs from 0x0 to 0x_0008__,
 + - then there's a gap,
 + - then the final 32GB runs from 0x_2000__ to 
 0x_2008__
 +
 +This can create _significant_ issues:
 +
 + - If we try to treat the machine as having 64GB of _contiguous_ RAM, we 
 would
 +   assume that ran from 0x0 to 0x_0010__. We'd then reserve 
 the
 +   last 1/8th - 0x_000e__ to 0x_0010__ as the 
 shadow
 +   region. But when we try to access any of that, we'll try to access 
 pages
 +   that are not physically present.
 +
>>>
>>> If we reserved memory for KASAN from each node (discontig region), we might 
>>> survive
>>> this no? May be we need NUMA aware KASAN? That might be a generic change, 
>>> just thinking
>>> out loud.
>> 
>> The challenge is that - AIUI - in inline instrumentation, the compiler
>> doesn't generate calls to things like __asan_loadN and
>> __asan_storeN. Instead it uses -fasan-shadow-offset to compute the
>> checks, and only calls the __asan_report* family of functions if it
>> detects an issue. This also matches what I can observe with objdump
>> across outline and inline instrumentation settings.
>> 
>> This means that for this sort of thing to work we would need to either
>> drop back to out-of-line calls, or teach the compiler how to use a
>> nonlinear, NUMA aware mem-to-shadow mapping.
>
> Yes, out of line is expensive, but seems to work well for all use cases.

I'm not sure this is true. Looking at scripts/Makefile.kasan, allocas,
stacks and globals will only be instrumented if you can provide
KASAN_SHADOW_OFFSET. In the case you're proposing, we can't provide a
static offset. I _think_ this is a compiler limitation, where some of
those instrumentations only work/make sense with a static offset, but
perhaps that's not right? Dmitry and Andrey, can you shed some light on
this?

Also, as it currently stands, the speed difference between inline and
outline is approximately 2x, and given that we'd like to run this
full-time in syzkaller I think there is value in trading off speed for
some limitations.

> BTW, the current set of patches just hang if I try to make the default
> mode as out of line

Do you have CONFIG_RELOCATABLE?

I've tested the following process:

# 1) apply patches on a fresh linux-next
# 2) output dir
mkdir ../out-3s-kasan

# 3) merge in the relevant config snippets
cat > kasan.config << EOF
CONFIG_EXPERT=y
CONFIG_LD_HEAD_STUB_CATCH=y

CONFIG_RELOCATABLE=y

CONFIG_KASAN=y
CONFIG_KASAN_GENERIC=y
CONFIG_KASAN_OUTLINE=y

CONFIG_PHYS_MEM_SIZE_FOR_KASAN=2048
EOF

ARCH=powerpc CROSS_COMPILE=powerpc64-linux-gnu- 
./scripts/kconfig/merge_config.sh -O ../out-3s-kasan/ 
arch/powerpc/configs/pseries_defconfig arch/powerpc/configs/le.config 
kasan.config

# 4) make
make O=../out-3s-kasan/ ARCH=powerpc CROSS_COMPILE=powerpc64-linux-gnu- -j8 
vmlinux

# 5) test
qemu-system-ppc64  -m 2G -M pseries -cpu power9  -kernel 
../out-3s-kasan/vmlinux  -nographic -chardev stdio,id=charserial0,mux=on 
-device spapr-vty,chardev=charserial0,reg=0x3000 -initrd 
./rootfs-le.cpio.xz -mon chardev=charserial0,mode=readline -nodefaults -smp 4 

This boots fine for me under TCG and KVM, with both CONFIG_KASAN_OUTLINE
and CONFIG_KASAN_INLINE. You do still need to supply the size even in
outline mode - I don't have code that switches over to vmalloced space
when in outline mode. I will clarify the docs on that.


 +  if (IS_ENABLED(CONFIG_KASAN) && IS_ENABLED(CONFIG_PPC_BOOK3S_64)) {
 +  kasan_memory_size =
 +  ((phys_addr_t)CONFIG_PHYS_MEM_SIZE_FOR_KASAN << 20);
 +
 +  if (top_phys_addr < kasan_memory_size) {
 +  /*
 +   * We are doomed. Attempts to call e.g. panic() are
 +   * likely to fail because they call out into
 +   * instrumented code, which will almost certainly
 +   * access memory beyond the end of physical
 +   * memory. Hang here so that at least the NIP points
 +   * somewhere that will help you debug it if you look at
 +   * it in qemu.
 +   */
 +  while (true)
 +  ;
>>>
>>> Again with the right hooks in check_memory_region_inline() these are 
>>> recoverable,
>>> or so I think
>> 
>> So unless I misunderstand the circumstances in which
>> check_memory_region_inline is used, this isn't going to help with inline
>> instrumentation.
>> 
>
> Yes, I understand. Same as above?

Yes.

>>> NOTE: I can't test any of these, well may be with qemu, let me see if I can 
>>> spin
>>> the series and provide more feedback
>> 
>> It's 

MODPOST warnings on ppc64le

2019-12-11 Thread Roman Bolshakov
Hello,

I'm seeing a set of build warnings on the 5.5-rc1 kernel:

WARNING: vmlinux.o(.text+0x31e4): Section mismatch in reference from the 
variable __boot_from_prom to the function .init.text:prom_init()
The function __boot_from_prom() references
the function __init prom_init().
This is often because __boot_from_prom lacks a __init
annotation or the annotation of prom_init is wrong.

WARNING: vmlinux.o(.text+0x33c8): Section mismatch in reference from the 
variable start_here_common to the function .init.text:start_kernel()
The function start_here_common() references
the function __init start_kernel().
This is often because start_here_common lacks a __init
annotation or the annotation of start_kernel is wrong.

There was a patch sent a year ago that addresses the issue:
https://patchwork.ozlabs.org/patch/895442/

What's the fate of it? Could it be merged?

Thank you,
Roman


Re: [PATCH v9 23/25] mm/gup: track FOLL_PIN pages

2019-12-11 Thread Jan Kara
On Tue 10-12-19 18:53:16, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via unpin_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1], [2], and [3].
> 
> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> [1] Some slow progress on get_user_pages() (Apr 2, 2019):
> https://lwn.net/Articles/784574/
> [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
> https://lwn.net/Articles/774411/
> [3] The trouble with get_user_pages() (Apr 30, 2018):
> https://lwn.net/Articles/753027/

The patch looks mostly good to me now. Just a few smaller comments below.

> Suggested-by: Jan Kara 
> Suggested-by: Jérôme Glisse 
> Reviewed-by: Jan Kara 
> Reviewed-by: Jérôme Glisse 
> Reviewed-by: Ira Weiny 

I think you inherited here the Reviewed-by tags from the "add flags" patch
you've merged into this one but that's not really fair since this patch
does much more... In particular I didn't give my Reviewed-by tag for this
patch yet.

> +/*
> + * try_grab_compound_head() - attempt to elevate a page's refcount, by a
> + * flags-dependent amount.
> + *
> + * This has a default assumption of "use FOLL_GET behavior, if FOLL_PIN is 
> not
> + * set".
> + *
> + * "grab" names in this file mean, "look at flags to decide whether to use
> + * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
> + */
> +static __maybe_unused struct page *try_grab_compound_head(struct page *page,
> +   int refs,
> +   unsigned int flags)
> +{
> + if (flags & FOLL_PIN)
> + return try_pin_compound_head(page, refs);
> +
> + return try_get_compound_head(page, refs);
> +}

I somewhat wonder about the asymmetry of try_grab_compound_head() vs
try_grab_page() in the treatment of 'flags'. How costly would it be to make
them symmetric (i.e., either set FOLL_GET for try_grab_compound_head()
callers or make sure one of FOLL_GET, FOLL_PIN is set for try_grab_page())?

Because this difference looks like a subtle catch in the long run...

> +
> +/**
> + * try_grab_page() - elevate a page's refcount by a flag-dependent amount
> + *
> + * This might not do anything at all, depending on the flags argument.
> + *
> + * "grab" names in this file mean, "look at flags to decide whether to use
> + * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
> + *
> + * @page:pointer to page to be grabbed
> + * @flags:   gup flags: these are the FOLL_* flag values.
> + *
> + * Either FOLL_PIN or FOLL_GET (or neither) may be set, but not both at the 
> same
> + * time. (That's true throughout the get_user_pages*() and pin_user_pages*()
> + * APIs.) Cases:
> + *
> + *   FOLL_GET: page's refcount will be incremented by 1.
> + *  FOLL_PIN: page's refcount will be incremented by 
> GUP_PIN_COUNTING_BIAS.
> + *
> + * Return: true for success, or if no action was required (if neither 
> FOLL_PIN
> + * nor FOLL_GET was set, nothing is done). False for failure: FOLL_GET or
> + * FOLL_PIN was set, but the page could not be grabbed.
> + */
> +bool __must_check try_grab_page(struct page *page, unsigned int flags)
> +{
> + if (flags & FOLL_GET)
> + return try_get_page(page);
> + else if (flags & FOLL_PIN) {
> + page = compound_head(page);
> + WARN_ON_ONCE(flags & FOLL_GET);
> +
> + if (WARN_ON_ONCE(page_ref_zero_or_close_to_bias_overflow(page)))
> + return false;
> +
> + page_ref_add(page, GUP_PIN_COUNTING_BIAS);
> + __update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, 1);
> + }
> +
> + return true;
> +}

...

> @@ -1522,8 +1536,8 @@ struct page *follow_trans_huge_pmd(struct 
> vm_area_struct *vma,
>  skip_mlock:
>   page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
>   VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
> - if (flags & FOLL_GET)
> - get_page(page);
> + if (!try_grab_page(page, flags))
> + page = ERR_PTR(-EFAULT);

I think you need to also move the try_grab_page() earlier in the function.
At this point the page may be marked as mlocked and you'd need to undo that
in case try_grab_page() fails.

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ac65bb5e38ac..0aab6fe0072f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4356,7 +4356,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>  same_page:
> 

Re: [PATCH v9 20/25] powerpc: book3s64: convert to pin_user_pages() and put_user_page()

2019-12-11 Thread Jan Kara
On Tue 10-12-19 18:53:13, John Hubbard wrote:
> 1. Convert from get_user_pages() to pin_user_pages().
> 
> 2. As required by pin_user_pages(), release these pages via
> put_user_page().
> 
> Cc: Jan Kara 
> Signed-off-by: John Hubbard 

The patch looks good to me. You can add:

Reviewed-by: Jan Kara 

I'd just note that mm_iommu_do_alloc() has a pre-existing bug that the last
jump to 'free_exit' (at line 157) happens already after converting page
pointers to physical addresses so put_page() calls there will just crash.
But that's completely unrelated to your change. I'll send a fix separately.

Honza

> ---
>  arch/powerpc/mm/book3s64/iommu_api.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
> b/arch/powerpc/mm/book3s64/iommu_api.c
> index 56cc84520577..a86547822034 100644
> --- a/arch/powerpc/mm/book3s64/iommu_api.c
> +++ b/arch/powerpc/mm/book3s64/iommu_api.c
> @@ -103,7 +103,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
> unsigned long ua,
>   for (entry = 0; entry < entries; entry += chunk) {
>   unsigned long n = min(entries - entry, chunk);
>  
> - ret = get_user_pages(ua + (entry << PAGE_SHIFT), n,
> + ret = pin_user_pages(ua + (entry << PAGE_SHIFT), n,
>   FOLL_WRITE | FOLL_LONGTERM,
>   mem->hpages + entry, NULL);
>   if (ret == n) {
> @@ -167,9 +167,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
> unsigned long ua,
>   return 0;
>  
>  free_exit:
> - /* free the reference taken */
> - for (i = 0; i < pinned; i++)
> - put_page(mem->hpages[i]);
> + /* free the references taken */
> + put_user_pages(mem->hpages, pinned);
>  
>   vfree(mem->hpas);
>   kfree(mem);
> @@ -215,7 +214,8 @@ static void mm_iommu_unpin(struct 
> mm_iommu_table_group_mem_t *mem)
>   if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
>   SetPageDirty(page);
>  
> - put_page(page);
> + put_user_page(page);
> +
>   mem->hpas[i] = 0;
>   }
>  }
> -- 
> 2.24.0
> 
-- 
Jan Kara 
SUSE Labs, CR


[PATCH] powerpc/book3s64: Fix error handling in mm_iommu_do_alloc()

2019-12-11 Thread Jan Kara
The last jump to free_exit in mm_iommu_do_alloc() happens after page
pointers in struct mm_iommu_table_group_mem_t were already converted to
physical addresses. Thus calling put_page() on these physical addresses
will likely crash. Convert physical addresses back to page pointers
during the error cleanup.

Signed-off-by: Jan Kara 
---
 arch/powerpc/mm/book3s64/iommu_api.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

 Beware, this is completely untested, spotted just by code audit.

diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
b/arch/powerpc/mm/book3s64/iommu_api.c
index 56cc84520577..06c403381c9c 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -154,7 +154,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
   (mem2->entries << PAGE_SHIFT {
ret = -EINVAL;
mutex_unlock(_list_mutex);
-   goto free_exit;
+   goto convert_exit;
}
}
 
@@ -166,6 +166,9 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
 
return 0;
 
+convert_exit:
+   for (i = 0; i < pinned; i++)
+   mem->hpages[i] = pfn_to_page(mem->hpas[i] >> PAGE_SHIFT);
 free_exit:
/* free the reference taken */
for (i = 0; i < pinned; i++)
-- 
2.16.4



[RFC PATCH] skiboot machine check handler

2019-12-11 Thread Nicholas Piggin
Provide facilities to decode machine checks into human readable
strings, with only sufficient information required to deal with
them sanely.

The old machine check stuff was over engineered. The philosophy
here is that OPAL should correct anything it possibly can, what
it can't handle but the OS might be able to do something with
(e.g., uncorrected memory error or SLB multi-hit), it passes back
to Linux. Anything else, the OS doesn't care. It doesn't want a
huge struct of severities and levels and originators etc that it
can't do anything with -- just provide human readable strings
for what happened and what was done with it.

A Linux driver for this will be able to cope with new processors.

This also uses the same facility to decode machine checks in OPAL
boot.

The code is a bit in flux because it's sitting on top of a few
other RFC patches and not quite complete, just wanted opinions
about it.
---
 core/Makefile.inc  |   2 +-
 core/exceptions.c  |  28 -
 core/mce.c | 306 +
 include/opal-api.h |  41 +-
 include/skiboot.h  |   6 +
 5 files changed, 379 insertions(+), 4 deletions(-)
 create mode 100644 core/mce.c

diff --git a/core/Makefile.inc b/core/Makefile.inc
index c2b5251d7..cc90fb958 100644
--- a/core/Makefile.inc
+++ b/core/Makefile.inc
@@ -7,7 +7,7 @@ CORE_OBJS = relocate.o console.o stack.o init.o chip.o 
mem_region.o vm.o
 CORE_OBJS += malloc.o lock.o cpu.o utils.o fdt.o opal.o interrupts.o timebase.o
 CORE_OBJS += opal-msg.o pci.o pci-virt.o pci-slot.o pcie-slot.o
 CORE_OBJS += pci-opal.o fast-reboot.o device.o exceptions.o trace.o affinity.o
-CORE_OBJS += vpd.o platform.o nvram.o nvram-format.o hmi.o
+CORE_OBJS += vpd.o platform.o nvram.o nvram-format.o hmi.o mce.o
 CORE_OBJS += console-log.o ipmi.o time-utils.o pel.o pool.o errorlog.o
 CORE_OBJS += timer.o i2c.o rtc.o flash.o sensor.o ipmi-opal.o
 CORE_OBJS += flash-subpartition.o bitmap.o buddy.o pci-quirk.o powercap.o psr.o
diff --git a/core/exceptions.c b/core/exceptions.c
index 66e8953ce..b04d15125 100644
--- a/core/exceptions.c
+++ b/core/exceptions.c
@@ -32,6 +32,7 @@ static void dump_regs(struct stack_frame *stack)
 
 #define EXCEPTION_MAX_STR 320
 
+#if 0
 static void print_recoverable_mce_vm(struct stack_frame *stack, uint64_t nip, 
uint64_t msr)
 {
char buf[EXCEPTION_MAX_STR];
@@ -46,6 +47,7 @@ static void print_recoverable_mce_vm(struct stack_frame 
*stack, uint64_t nip, ui
dump_regs(stack);
prerror("Continuing with VM off\n");
 }
+#endif
 
 void exception_entry(struct stack_frame *stack)
 {
@@ -103,7 +105,11 @@ void exception_entry(struct stack_frame *stack)
}
break;
 
-   case 0x200:
+   case 0x200: {
+   uint64_t mce_flags, mce_addr;
+   const char *mce_err;
+
+#if 0
if (this_cpu()->vm_local_map_inuse)
fatal = true; /* local map is non-linear */
 
@@ -114,12 +120,29 @@ void exception_entry(struct stack_frame *stack)
stack->srr1 &= ~(MSR_IR|MSR_DR);
goto out;
}
+#endif
 
fatal = true;
prerror("***\n");
l += snprintf(buf + l, EXCEPTION_MAX_STR - l,
"Fatal MCE at "REG"   ", nip);
-   break;
+   l += snprintf_symbol(buf + l, EXCEPTION_MAX_STR - l, nip);
+   l += snprintf(buf + l, EXCEPTION_MAX_STR - l, "  MSR "REG, msr);
+   prerror("%s\n", buf);
+
+   decode_mce(stack->srr0, stack->srr1, stack->dsisr, stack->dar,
+   _flags, _err, _addr);
+   l = 0;
+   l += snprintf(buf + l, EXCEPTION_MAX_STR - l,
+   "Cause: %s", mce_err);
+   prerror("%s\n", buf);
+   if (mce_flags & MCE_INVOLVED_EA) {
+   l += snprintf(buf + l, EXCEPTION_MAX_STR - l,
+   "Effective address: 0x%016llx", mce_addr);
+   prerror("%s\n", buf);
+   }
+   goto no_symbol;
+   }
 
case 0x300:
if (vm_dsi(nip, stack->dar, !!(stack->dsisr & DSISR_ISSTORE)))
@@ -195,6 +218,7 @@ void exception_entry(struct stack_frame *stack)
l += snprintf_symbol(buf + l, EXCEPTION_MAX_STR - l, nip);
l += snprintf(buf + l, EXCEPTION_MAX_STR - l, "  MSR "REG, msr);
prerror("%s\n", buf);
+no_symbol:
dump_regs(stack);
backtrace_r1((uint64_t)stack);
if (fatal) {
diff --git a/core/mce.c b/core/mce.c
new file mode 100644
index 0..0ebf98380
--- /dev/null
+++ b/core/mce.c
@@ -0,0 +1,306 @@
+// SPDX-License-Identifier: Apache-2.0
+/*
+ * Deal with Machine Check Exceptions
+ *
+ * Copyright 2019 IBM Corp.
+ */
+
+#define pr_fmt(fmt)"MCE: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static 

Re: Call for report - G5/PPC970 status

2019-12-11 Thread jjhdiederen
I have an iMac iSight with a 2.1 GHz PowerPC 970fx (G5) processor, that 
boots fine with the latest ppc64 kernel.


Romain Dolbeau schreef op 2019-12-11 08:19:

Le mer. 11 déc. 2019 à 03:20, Aneesh Kumar K.V
 a écrit :

The PowerMac system we have internally was not able to recreate this.


To narrow down the issue - is that a PCI/PCI-X (7,3 [1]) or PCIe G5 
(11,2 [1]) ?

Single, dual or quad ?

Same question to anyone else with a G5 / PPC970 - what is it and does
it boot recent PPC64 Linux kernel ?

Christian from the original report has a quad, like me (so 
powermac11,2).


There was also a report of a powermac7.3 working in the original 
discussion,

single or dual unspecified.

So this might be a Quad thing, or a more general 11,2 thing...


At this point, I am not sure what would cause the Machine check with
that patch series because we have not changed the VA bits in that 
patch.


Any test I could run that would help you tracking the bug ?

Cordially,

Romain

[1] 




--
Romain Dolbeau


Re: [PATCH] powerpc/fault: kernel can extend a user process's stack

2019-12-11 Thread Daniel Black
On Wed, 11 Dec 2019 12:43:37 +1100
Daniel Axtens  wrote:

> If a process page-faults trying to write beyond the end of its
> stack, we attempt to grow the stack.
> 
> However, if the kernel attempts to write beyond the end of a
> process's stack, it takes a bad fault. This can occur when the
> kernel is trying to set up a signal frame.
> 
> Permit the kernel to grow a process's stack. The same general
> limits as to how and when the stack can be grown apply: the kernel
> code still passes through expand_stack(), so anything that goes
> beyond e.g. the rlimit should still be blocked.


Thanks Daniel.

Looks good from a function perspective.

danielgb@ozrom2:~$ gcc -g -Wall -O stactest.c 
danielgb@ozrom2:~$ ./a.out 124 &
[1] 4223
danielgb@ozrom2:~$ cat /proc/$(pidof a.out)/maps | grep stack
^[[3~714d-7160 rw-p  00:00 0  
[stack]
danielgb@ozrom2:~$  kill -USR1 %1
danielgb@ozrom2:~$ signal delivered, stack base 0x7160 top 
0x714d1427 (1240025 used)

[1]+  Done./a.out 124
danielgb@ozrom2:~$  ./a.out 1241000 &
[1] 4227
danielgb@ozrom2:~$ kill -USR1 %1
danielgb@ozrom2:~$ signal delivered, stack base 0x7f63 top 
0x7f501057 (1241001 used)

[1]+  Done./a.out 1241000
danielgb@ozrom2:~$ uname -a
Linux ozrom2 5.5.0-rc1-1-g83ab444248c1 #1 SMP Wed Dec 11 17:01:50 AEDT 2019 
ppc64le ppc64le ppc64le GNU/Linux

Tested-by: Daniel Black 

> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=205183
> Reported-by: Tom Lane 
> Cc: Daniel Black 
> Signed-off-by: Daniel Axtens 
> ---
>  arch/powerpc/mm/fault.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index b5047f9b5dec..00183731ea22 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -287,7 +287,17 @@ static bool bad_stack_expansion(struct pt_regs *regs, 
> unsigned long address,
>   if (!res)
>   return !store_updates_sp(inst);
>   *must_retry = true;
> + } else if ((flags & FAULT_FLAG_WRITE) &&
> +!(flags & FAULT_FLAG_USER)) {
> + /*
> +  * the kernel can also attempt to write beyond the end
> +  * of a process's stack - for example setting up a
> +  * signal frame. We assume this is valid, subject to
> +  * the checks in expand_stack() later.
> +  */
> + return false;
>   }
> +
>   return true;
>   }
>   return false;



Re: [PATCH] powerpc/fault: kernel can extend a user process's stack

2019-12-11 Thread Daniel Axtens
> Fixes: 14cf11af6cf6 ("powerpc: Merge enough to start building in
> arch/powerpc.")

Wow, that's pretty ancient! I'm also not sure it's right - in that same
patch, arch/ppc64/mm/fault.c contains:

^1da177e4c3f4 (Linus Torvalds 2005-04-16 15:20:36 -0700 213)
if (address + 2048 < uregs->gpr[1]
^1da177e4c3f4 (Linus Torvalds 2005-04-16 15:20:36 -0700 214)
&& (!user_mode(regs) || !store_updates_sp(regs)))
^1da177e4c3f4 (Linus Torvalds 2005-04-16 15:20:36 -0700 215)
goto bad_area;

Which is the same as the new arch/powerpc/mm/fault.c code:

14cf11af6cf60 (Paul Mackerras 2005-09-26 16:04:21 +1000 234)if 
(address + 2048 < uregs->gpr[1]
14cf11af6cf60 (Paul Mackerras 2005-09-26 16:04:21 +1000 235)&& 
(!user_mode(regs) || !store_updates_sp(regs)))
14cf11af6cf60 (Paul Mackerras 2005-09-26 16:04:21 +1000 236)
goto bad_area;

So either they're both right or they're both wrong, either way I'm not
sure how this patch is to blame.

I guess we should also cc stable@...

Regards,
Daniel

>> Reported-by: Tom Lane 
>> Cc: Daniel Black 
>> Signed-off-by: Daniel Axtens 
>> ---
>>  arch/powerpc/mm/fault.c | 10 ++
>>  1 file changed, 10 insertions(+)
>> 
>> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
>> index b5047f9b5dec..00183731ea22 100644
>> --- a/arch/powerpc/mm/fault.c
>> +++ b/arch/powerpc/mm/fault.c
>> @@ -287,7 +287,17 @@ static bool bad_stack_expansion(struct pt_regs *regs, 
>> unsigned long address,
>>  if (!res)
>>  return !store_updates_sp(inst);
>>  *must_retry = true;
>> +} else if ((flags & FAULT_FLAG_WRITE) &&
>> +   !(flags & FAULT_FLAG_USER)) {
>> +/*
>> + * the kernel can also attempt to write beyond the end
>> + * of a process's stack - for example setting up a
>> + * signal frame. We assume this is valid, subject to
>> + * the checks in expand_stack() later.
>> + */
>> +return false;
>>  }
>> +
>>  return true;
>>  }
>>  return false;
>> -- 
>> 2.20.1
>> 


Re: [PATCH v1 3/4] arm64: dts: ls1028a: fix little-big endian issue for dcfg

2019-12-11 Thread Shawn Guo
On Tue, Dec 10, 2019 at 02:34:30AM +, Y.b. Lu wrote:
> + Shawn,
> 
> > -Original Message-
> > From: Michael Walle 
> > Sent: Tuesday, December 10, 2019 8:06 AM
> > To: Yinbo Zhu 
> > Cc: Ashish Kumar ; Alexandru Marginean
> > ; Alison Wang ;
> > Amit Jain (aj) ; catalin.horghi...@nxp.com; Claudiu
> > Manoil ; devicet...@vger.kernel.org; Jiafei Pan
> > ; Leo Li ;
> > linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org;
> > linuxppc-dev@lists.ozlabs.org; mark.rutl...@arm.com;
> > rajat.srivast...@nxp.com; Rajesh Bhagat ;
> > robh...@kernel.org; Vabhav Sharma ; Xiaobo Xie
> > ; Y.b. Lu ; Michael Walle
> > 
> > Subject: Re: [PATCH v1 3/4] arm64: dts: ls1028a: fix little-big endian 
> > issue for
> > dcfg
> > 
> 
> [Y.b. Lu] Acked-by: Yangbo Lu 
> 
> Hi Shawn, could you help to review and merge the two dts patches of this 
> patch-set?
> Thanks.

Please resend them with me on recipients.

Shawn


Re: [PATCH v2 4/4] powerpc: Book3S 64-bit "heavyweight" KASAN support

2019-12-11 Thread Balbir Singh



On 11/12/19 4:21 pm, Daniel Axtens wrote:
> Hi Balbir,
> 
>>> +Discontiguous memory can occur when you have a machine with memory spread
>>> +across multiple nodes. For example, on a Talos II with 64GB of RAM:
>>> +
>>> + - 32GB runs from 0x0 to 0x_0008__,
>>> + - then there's a gap,
>>> + - then the final 32GB runs from 0x_2000__ to 
>>> 0x_2008__
>>> +
>>> +This can create _significant_ issues:
>>> +
>>> + - If we try to treat the machine as having 64GB of _contiguous_ RAM, we 
>>> would
>>> +   assume that ran from 0x0 to 0x_0010__. We'd then reserve the
>>> +   last 1/8th - 0x_000e__ to 0x_0010__ as the 
>>> shadow
>>> +   region. But when we try to access any of that, we'll try to access pages
>>> +   that are not physically present.
>>> +
>>
>> If we reserved memory for KASAN from each node (discontig region), we might 
>> survive
>> this no? May be we need NUMA aware KASAN? That might be a generic change, 
>> just thinking
>> out loud.
> 
> The challenge is that - AIUI - in inline instrumentation, the compiler
> doesn't generate calls to things like __asan_loadN and
> __asan_storeN. Instead it uses -fasan-shadow-offset to compute the
> checks, and only calls the __asan_report* family of functions if it
> detects an issue. This also matches what I can observe with objdump
> across outline and inline instrumentation settings.
> 
> This means that for this sort of thing to work we would need to either
> drop back to out-of-line calls, or teach the compiler how to use a
> nonlinear, NUMA aware mem-to-shadow mapping.

Yes, out of line is expensive, but seems to work well for all use cases.
BTW, the current set of patches just hang if I try to make the default
mode as out of line


> 
> I'll document this a bit better in the next spin.
> 
>>> +   if (IS_ENABLED(CONFIG_KASAN) && IS_ENABLED(CONFIG_PPC_BOOK3S_64)) {
>>> +   kasan_memory_size =
>>> +   ((phys_addr_t)CONFIG_PHYS_MEM_SIZE_FOR_KASAN << 20);
>>> +
>>> +   if (top_phys_addr < kasan_memory_size) {
>>> +   /*
>>> +* We are doomed. Attempts to call e.g. panic() are
>>> +* likely to fail because they call out into
>>> +* instrumented code, which will almost certainly
>>> +* access memory beyond the end of physical
>>> +* memory. Hang here so that at least the NIP points
>>> +* somewhere that will help you debug it if you look at
>>> +* it in qemu.
>>> +*/
>>> +   while (true)
>>> +   ;
>>
>> Again with the right hooks in check_memory_region_inline() these are 
>> recoverable,
>> or so I think
> 
> So unless I misunderstand the circumstances in which
> check_memory_region_inline is used, this isn't going to help with inline
> instrumentation.
> 

Yes, I understand. Same as above?


>>> +void __init kasan_init(void)
>>> +{
>>> +   int i;
>>> +   void *k_start = kasan_mem_to_shadow((void *)RADIX_KERN_VIRT_START);
>>> +   void *k_end = kasan_mem_to_shadow((void *)RADIX_VMEMMAP_END);
>>> +
>>> +   pte_t pte = __pte(__pa(kasan_early_shadow_page) |
>>> + pgprot_val(PAGE_KERNEL) | _PAGE_PTE);
>>> +
>>> +   if (!early_radix_enabled())
>>> +   panic("KASAN requires radix!");
>>> +
>>
>> I think this is avoidable, we could use a static key for disabling kasan in
>> the generic code. I wonder what happens if someone tries to boot this
>> image on a Power8 box and keeps panic'ing with no easy way of recovering.
> 
> Again, assuming I understand correctly that the compiler generates raw
> IR->asm for these checks rather than calling out to a function, then I
> don't think we get a way to intercept those checks. It's too late to do
> anything at the __asan report stage because that will already have
> accessed memory that's not set up properly.
> 
> If you try to boot this on a Power8 box it will panic and you'll have to
> boot into another kernel from the bootloader. I don't think it's
> avoidable without disabling inline instrumentation, but I'd love to be
> proven wrong.
> 
>>
>> NOTE: I can't test any of these, well may be with qemu, let me see if I can 
>> spin
>> the series and provide more feedback
> 
> It's actually super easy to do simple boot tests with qemu, it works fine in 
> TCG,
> Michael's wiki page at
> https://github.com/linuxppc/wiki/wiki/Booting-with-Qemu is very helpful.
> 
> I did this a lot in development.
> 
> My full commandline, fwiw, is:
> 
> qemu-system-ppc64  -m 8G -M pseries -cpu power9  -kernel 
> ../out-3s-radix/vmlinux  -nographic -chardev stdio,id=charserial0,mux=on 
> -device spapr-vty,chardev=charserial0,reg=0x3000 -initrd 
> ./rootfs-le.cpio.xz -mon chardev=charserial0,mode=readline -nodefaults -smp 4

qemu has been crashing with KASAN enabled/ both inline/out-of-line options. I 

Re: [PATCH 5/6] mm, memory_hotplug: Provide argument for the pgprot_t in arch_add_memory()

2019-12-11 Thread Michal Hocko
On Tue 10-12-19 16:52:31, Logan Gunthorpe wrote:
[...]
> In my opinion, having a coder and reviewer see PAGE_KERNEL and ask if
> that makes sense is a benefit. Having it hidden because we don't want
> people to think about it is worse, harder to understand and results in
> bugs that are more difficult to spot.

My experience would disagree here. We have several examples in the MM
where an overly complex and versatile APIs led to suble bugs, a lot of
copy and cargo cult programing (just look at the page allocator
as a shiny example - e.g. gfp_flags). So I am always trying to be
carefull here.

> Though, we may be overthinking this: arch_add_memory() is a low level
> non-exported API that's currently used in exactly two places.

This is a fair argument. Most users are and should be using
add_memory().

> I don't
> think there's going to be many, if any, valid new use cases coming up
> for it in the future. That's more what memremap_pages() is for.

OK, fair enough. If this is indeed the simplest way forward then I will
not stand in the way.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5 2/2] powerpc/pseries/iommu: Use dma_iommu_ops for Secure VM.

2019-12-11 Thread Alexey Kardashevskiy



On 11/12/2019 12:43, Michael Roth wrote:
> Quoting Ram Pai (2019-12-06 19:12:39)
>> Commit edea902c1c1e ("powerpc/pseries/iommu: Don't use dma_iommu_ops on
>> secure guests")
>> disabled dma_iommu_ops path, for secure VMs. Disabling dma_iommu_ops
>> path for secure VMs, helped enable dma_direct path.  This enabled
>> support for bounce-buffering through SWIOTLB.  However it fails to
>> operate when IOMMU is enabled, since I/O pages are not TCE mapped.
>>
>> Renable dma_iommu_ops path for pseries Secure VMs.  It handles all
>> cases including, TCE mapping I/O pages, in the presence of a
>> IOMMU.
> 
> Wasn't clear to me at first, but I guess the main gist of this series is
> that we want to continue to use SWIOTLB, but also need to create mappings
> of it's bounce buffers in the IOMMU, so we revert to using dma_iommu_ops
> and rely on the various dma_iommu_{map,alloc}_bypass() hooks throughout
> to call into dma_direct_* ops rather than relying on the dma_is_direct(ops)
> checks in DMA API functions to do the same.


Correct. Took me a bit of time to realize what we got here :) We only
rely on  dma_iommu_ops::.dma_supported to write the DMA offset to a
device (when creating a huge window), and after that we know it is
mapped directly and swiotlb gets this 1<<59 offset via __phys_to_dma().


> That makes sense, but one issue I see with that is that
> dma_iommu_map_bypass() only tests true if all the following are true:
> 
> 1) the device requests a 64-bit DMA mask via
>dma_set_mask/dma_set_coherent_mask
> 2) DDW is enabled (i.e. we don't pass disable_ddw on command-line)
> 
> dma_is_direct() checks don't have this limitation, so I think for
> anything cases, such as devices that use a smaller DMA mask, we'll
> end up falling back to the non-bypass functions in dma_iommu_ops, which
> will likely break for things like dma_alloc_coherent/dma_map_single
> since they won't use SWIOTLB pages and won't do the necessary calls to
> set_memory_unencrypted() to share those non-SWIOTLB buffers with
> hypervisor.
> 
> Maybe that's ok, but I think we should be clearer about how to
> fail/handle these cases.
> 
> Though I also agree with some concerns Alexey stated earlier: it seems
> wasteful to map the entire DDW window just so these bounce buffers can be
> mapped.  Especially if you consider the lack of a mapping to be an additional
> safe-guard against things like buggy device implementations on the QEMU
> side. E.g. if we leaked pages to the hypervisor on accident, those pages
> wouldn't be immediately accessible to a device, and would still require
> additional work get past the IOMMU.
> 
> What would it look like if we try to make all this work with disable_ddw 
> passed
> to kernel command-line (or forced for is_secure_guest())?
> 
>   1) dma_iommu_{alloc,map}_bypass() would no longer get us to dma_direct_* 
> ops,
>  but an additional case or hook that considers is_secure_guest() might do
>  it.
>  
>   2) We'd also need to set up an IOMMU mapping for the bounce buffers via
>  io_tlb_start/io_tlb_end. We could do it once, on-demand via
>  dma_iommu_bypass_supported() like we do for the 64-bit DDW window, or
>  maybe in some init function.


io_tlb_start/io_tlb_end are only guaranteed to stay within 4GB and our
default DMA window is 1GB (KVM) or 2GB (PowerVM), ok, we can define
ARCH_LOW_ADDRESS_LIMIT as 1GB.

But it has also been mentioned that we are likely to be having swiotlb
buffers outside of the first 4GB as they are not just for crippled
devices any more. So we are likely to have 64bit window, I'd just ditch
the default window then, I have patches for this but every time I
thought I have a use case, turned out that I did not.


> That also has the benefit of not requiring devices to support 64-bit DMA.
> 
> Alternatively, we could continue to rely on the 64-bit DDW window, but
> modify call to enable_ddw() to only map the io_tlb_start/end range in
> the case of is_secure_guest(). This is a little cleaner implementation-wise
> since we can rely on the existing dma_iommu_{alloc,map}_bypass() hooks, but
> devices that don't support 64-bit will fail back to not using dma_direct_* ops
> and fail miserably. We'd probably want to handle that more gracefully.
> 
> Or we handle both cases gracefully. To me it makes more sense to enable
> non-DDW case, then consider adding DDW case later if there's some reason
> why 64-bit DMA is needed. But would be good to hear if there are other
> opinions.


For now we need to do something with the H_PUT_TCE_INDIRECT's page -
either disable multitce (but boot time increases) or share the page. The
patch does the latter. Thanks,


> 
>>
>> Signed-off-by: Ram Pai 
>> ---
>>  arch/powerpc/platforms/pseries/iommu.c | 11 +--
>>  1 file changed, 1 insertion(+), 10 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
>> b/arch/powerpc/platforms/pseries/iommu.c
>> index 67b5009..4e27d66 100644
>> --- 

Re: [PATCH v5 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.

2019-12-11 Thread Alexey Kardashevskiy



On 11/12/2019 02:35, Ram Pai wrote:
> On Tue, Dec 10, 2019 at 04:32:10PM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 10/12/2019 16:12, Ram Pai wrote:
>>> On Tue, Dec 10, 2019 at 02:07:36PM +1100, Alexey Kardashevskiy wrote:


 On 07/12/2019 12:12, Ram Pai wrote:
> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
> its parameters.  On secure VMs, hypervisor cannot access the contents of
> this page since it gets encrypted.  Hence share the page with the
> hypervisor, and unshare when done.


 I thought the idea was to use H_PUT_TCE and avoid sharing any extra
 pages. There is small problem that when DDW is enabled,
 FW_FEATURE_MULTITCE is ignored (easy to fix); I also noticed complains
 about the performance on slack but this is caused by initial cleanup of
 the default TCE window (which we do not use anyway) and to battle this
 we can simply reduce its size by adding
>>>
>>> something that takes hardly any time with H_PUT_TCE_INDIRECT,  takes
>>> 13secs per device for H_PUT_TCE approach, during boot. This is with a
>>> 30GB guest. With larger guest, the time will further detoriate.
>>
>>
>> No it will not, I checked. The time is the same for 2GB and 32GB guests-
>> the delay is caused by clearing the small DMA window which is small by
>> the space mapped (1GB) but quite huge in TCEs as it uses 4K pages; and
>> for DDW window + emulated devices the IOMMU page size will be 2M/16M/1G
>> (depends on the system) so the number of TCEs is much smaller.
> 
> I cant get your results.  What changes did you make to get it?


Get what? I passed "-m 2G" and "-m 32G", got the same time - 13s spent
in clearing the default window and the huge window took a fraction of a
second to create and map.



 -global
 spapr-pci-host-bridge.dma_win_size=0x400
>>>
>>> This option, speeds it up tremendously.  But than should this option be
>>> enabled in qemu by default?  only for secure VMs? for both VMs?
>>
>>
>> As discussed in slack, by default we do not need to clear the entire TCE
>> table and we only have to map swiotlb buffer using the small window. It
>> is a guest kernel change only. Thanks,
> 
> Can you tell me what code you are talking about here.  Where is the TCE
> table getting cleared? What code needs to be changed to not clear it?


pci_dma_bus_setup_pSeriesLP()
iommu_init_table()
iommu_table_clear()
for () tbl->it_ops->get()

We do not really need to clear it there, we only need it for VFIO with
IOMMU SPAPR TCE v1 which reuses these tables but there are
iommu_take_ownership/iommu_release_ownership to clear these tables. I'll
send a patch for this.


> Is the code in tce_buildmulti_pSeriesLP(), the one that does the clear
> aswell?


This one does not need to clear TCEs as this creates a window of known
size and maps it all.

Well, actually, it only maps actual guest RAM, if there are gaps in RAM,
then TCEs for the gaps will have what hypervisor had there (which is
zeroes, qemu/kvm clears it anyway).


> But before I close, you have not told me clearly, what is the problem
> with;  'share the page, make the H_PUT_INDIRECT_TCE hcall, unshare the page'.

Between share and unshare you have a (tiny) window of opportunity to
attack the guest. No, I do not know how exactly.

For example, the hypervisor does a lot of PHB+PCI hotplug-unplug with
64bit devices - each time this will create a huge window which will
share/unshare the same page.  No, I do not know how exactly how this can
be exploited either, we cannot rely of what you or myself know today. My
point is that we should not be sharing pages at all unless we really
really have to, and this does not seem to be the case.

But since this seems to an acceptable compromise anyway,

Reviewed-by: Alexey Kardashevskiy 





> Remember this is the same page that is earmarked for doing
> H_PUT_INDIRECT_TCE, not by my patch, but its already earmarked by the
> existing code. So it not some random buffer that is picked. Second 
> this page is temporarily shared and unshared, it does not stay shared
> for life.  It does not slow the boot. it does not need any
> special command line options on the qemu.
>> Shared pages technology was put in place, exactly for the purpose of
> sharing data with the hypervisor.  We are using this technology exactly
> for that purpose.  And finally I agreed with your concern of having
> shared pages staying around.  Hence i addressed that concern, by
> unsharing the page.  At this point, I fail to understand your concern.




-- 
Alexey