date:20180628

Re: [PATCH 1/2] powerpc/pkeys: preallocate execute_only key only if the key is available.

2018-06-28 Thread Gabriel Paubert

On Thu, Jun 28, 2018 at 11:56:34PM -0300, Thiago Jung Bauermann wrote:
> 
> Hello,
> 
> Ram Pai  writes:
> 
> > Key 2 is preallocated and reserved for execute-only key. In rare
> > cases if key-2 is unavailable, mprotect(PROT_EXEC) will behave
> > incorrectly. NOTE: mprotect(PROT_EXEC) uses execute-only key.
> >
> > Ensure key 2 is available for preallocation before reserving it for
> > execute_only purpose.  Problem noticed by Michael Ellermen.
> 
> Since "powerpc/pkeys: Preallocate execute-only key" isn't upstream yet,
> this patch could be squashed into it.
> 
> > Signed-off-by: Ram Pai 
> > ---
> >  arch/powerpc/mm/pkeys.c |   14 +-
> >  1 files changed, 9 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
> > index cec990c..0b03914 100644
> > --- a/arch/powerpc/mm/pkeys.c
> > +++ b/arch/powerpc/mm/pkeys.c
> > @@ -19,6 +19,7 @@
> >  u64  pkey_amr_mask;/* Bits in AMR not to be touched */
> >  u64  pkey_iamr_mask;   /* Bits in AMR not to be touched */
> >  u64  pkey_uamor_mask;  /* Bits in UMOR not to be touched */
> > +int  execute_only_key = 2;
> >
> >  #define AMR_BITS_PER_PKEY 2
> >  #define AMR_RD_BIT 0x1UL
> > @@ -26,7 +27,6 @@
> >  #define IAMR_EX_BIT 0x1UL
> >  #define PKEY_REG_BITS (sizeof(u64)*8)
> >  #define pkeyshift(pkey) (PKEY_REG_BITS - ((pkey+1) * AMR_BITS_PER_PKEY))
> > -#define EXECUTE_ONLY_KEY 2
> >
> >  static void scan_pkey_feature(void)
> >  {
> > @@ -122,8 +122,12 @@ int pkey_initialize(void)
> >  #else
> > os_reserved = 0;
> >  #endif
> > +
> > +   if ((pkeys_total - os_reserved) <= execute_only_key)
> > +   execute_only_key = -1;
> > +
> > /* Bits are in LE format. */
> > -   reserved_allocation_mask = (0x1 << 1) | (0x1 << EXECUTE_ONLY_KEY);
> > +   reserved_allocation_mask = (0x1 << 1) | (0x1 << execute_only_key);
> 
> My understanding is that left-shifting by a negative amount is undefined
> behavior in C. A quick test tells me that at least on the couple of
> machines I tested, 1 < -1 = 0. Does GCC guarantee that behavior? 

Not in general. It probably always works on Power because of the definition 
of the machine instruction for shifts with variable amount (consider the 
shift amount unsigned and take it modulo twice the width of the operand), 
but for example it fails on x86 (1<<-1 gives 0x8000).

> If so, a comment pointing this out would make this less confusing.

Unless I miss something, this code is run once at boot, so its
performance is irrelevant.

In this case simply rewrite it as:

reserved_allocation_mask = 0x1 << 1;
if ( (pkeys_total - os_reserved) <= execute_only_key) {
execute_only_key = -1;
} else {
reserved_allocation_mask = (0x1 << 1) | (0x1 << 
execute_only_key);
}

Caveat, I have assumed that the code will either:
- only run once,
- pkeys_total and os_reserved are int, not unsigned

> 
> > initial_allocation_mask  = reserved_allocation_mask | (0x1 << PKEY_0);
> >
> > /* register mask is in BE format */
> > @@ -132,11 +136,11 @@ int pkey_initialize(void)
> >
> > pkey_iamr_mask = ~0x0ul;
> > pkey_iamr_mask &= ~(0x3ul << pkeyshift(PKEY_0));
> > -   pkey_iamr_mask &= ~(0x3ul << pkeyshift(EXECUTE_ONLY_KEY));
> > +   pkey_iamr_mask &= ~(0x3ul << pkeyshift(execute_only_key));
> >
> > pkey_uamor_mask = ~0x0ul;
> > pkey_uamor_mask &= ~(0x3ul << pkeyshift(PKEY_0));
> > -   pkey_uamor_mask &= ~(0x3ul << pkeyshift(EXECUTE_ONLY_KEY));
> > +   pkey_uamor_mask &= ~(0x3ul << pkeyshift(execute_only_key));
> 
> Here the behaviour is undefined in C as well, given that pkeyshift(-1) =
> 64, which is the total number of bits in the left operand. Does GCC
> guarantee that the result will be 0 here as well?

Same answer: very likely on Power, not portable.

Gabriel

Re: [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page

2018-06-28 Thread Alexey Kardashevskiy

On Fri, 29 Jun 2018 14:57:02 +1000
David Gibson  wrote:

> On Fri, Jun 29, 2018 at 02:51:21PM +1000, Alexey Kardashevskiy wrote:
> > On Fri, 29 Jun 2018 14:12:41 +1000
> > David Gibson  wrote:
> >   
> > > On Tue, Jun 26, 2018 at 03:59:26PM +1000, Alexey Kardashevskiy wrote:  
> > > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
> > > > an IOMMU page is contained in the physical page so the PCI hardware 
> > > > won't
> > > > get access to unassigned host memory.
> > > > 
> > > > However we do not have this check in KVM fastpath (H_PUT_TCE accelerated
> > > > code) so the user space can pin memory backed with 64k pages and create
> > > > a hardware TCE table with a bigger page size. We were lucky so far and
> > > > did not hit this yet as the very first time the mapping happens
> > > > we do not have tbl::it_userspace allocated yet and fall back to
> > > > the userspace which in turn calls VFIO IOMMU driver and that fails
> > > > because of the check in vfio_iommu_spapr_tce.c which is really
> > > > sustainable solution.
> > > > 
> > > > This stores the smallest preregistered page size in the preregistered
> > > > region descriptor and changes the mm_iommu_xxx API to check this against
> > > > the IOMMU page size.
> > > > 
> > > > Signed-off-by: Alexey Kardashevskiy 
> > > > ---
> > > > Changes:
> > > > v2:
> > > > * explicitly check for compound pages before calling compound_order()
> > > > 
> > > > ---
> > > > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to
> > > > advertise 16MB pages to the guest; a typical pseries guest will use 16MB
> > > > for IOMMU pages without checking the mmu pagesize and this will fail
> > > > at 
> > > > https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256
> > > > 
> > > > With the change, mapping will fail in KVM and the guest will print:
> > > > 
> > > > mlx5_core :00:00.0: ibm,create-pe-dma-window(2027) 0 800 
> > > > 2000 18 1f returned 0 (liobn = 0x8001 starting addr = 800 0)
> > > > mlx5_core :00:00.0: created tce table LIOBN 0x8001 for 
> > > > /pci@8002000/ethernet@0
> > > > mlx5_core :00:00.0: failed to map direct window for
> > > > /pci@8002000/ethernet@0: -1
> > > 
> > > [snip]  
> > > > @@ -124,7 +125,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned 
> > > > long ua, unsigned long entries,
> > > > struct mm_iommu_table_group_mem_t **pmem)
> > > >  {
> > > > struct mm_iommu_table_group_mem_t *mem;
> > > > -   long i, j, ret = 0, locked_entries = 0;
> > > > +   long i, j, ret = 0, locked_entries = 0, pageshift;
> > > > struct page *page = NULL;
> > > >  
> > > > mutex_lock(&mem_list_mutex);
> > > > @@ -166,6 +167,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned 
> > > > long ua, unsigned long entries,
> > > > goto unlock_exit;
> > > > }
> > > >  
> >  > > +  mem->pageshift = 30; /* start from 1G pages - the biggest we 
> > have */
> > > 
> > > What about 16G pages on an HPT system?  
> > 
> > 
> > Below in the loop mem->pageshift will reduce to the biggest actual size
> > which will be 16mb/64k/4k. Or remain 1GB if no memory is actually
> > pinned, no loss there.  
> 
> Are you saying that 16G IOMMU pages aren't supported?  Or that there's
> some reason a guest can never use them?


ah, 16_G_, not _M_. My bad. I just never tried such huge pages, I will
lift the limit up to 64 then, easier this way.

> 
> > > > for (i = 0; i < entries; ++i) {
> > > > if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
> > > > 1/* pages */, 1/* iswrite */, 
> > > > &page)) {
> > > > @@ -199,6 +202,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned 
> > > > long ua, unsigned long entries,
> > > > }
> > > > }
> > > >  populate:
> > > > +   pageshift = PAGE_SHIFT;
> > > > +   if (PageCompound(page))
> > > > +   pageshift += 
> > > > compound_order(compound_head(page));
> > > > +   mem->pageshift = min_t(unsigned int, mem->pageshift, 
> > > > pageshift);
> > > 
> > > Why not make mem->pageshift and pageshift local the same type to avoid
> > > the min_t() ?  
> > 
> > I was under impression min() is deprecated (misinterpret checkpatch.pl
> > may be) and therefore did not pay attention to it. I can fix this and
> > repost if there is no other question.  
> 
> Hm, it's possible.


Nah, tried min(), compiles fine.



--
Alexey


pgpKknRvvcuha.pgp
Description: OpenPGP digital signature

Re: [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page

2018-06-28 Thread David Gibson

On Fri, Jun 29, 2018 at 02:51:21PM +1000, Alexey Kardashevskiy wrote:
> On Fri, 29 Jun 2018 14:12:41 +1000
> David Gibson  wrote:
> 
> > On Tue, Jun 26, 2018 at 03:59:26PM +1000, Alexey Kardashevskiy wrote:
> > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
> > > an IOMMU page is contained in the physical page so the PCI hardware won't
> > > get access to unassigned host memory.
> > > 
> > > However we do not have this check in KVM fastpath (H_PUT_TCE accelerated
> > > code) so the user space can pin memory backed with 64k pages and create
> > > a hardware TCE table with a bigger page size. We were lucky so far and
> > > did not hit this yet as the very first time the mapping happens
> > > we do not have tbl::it_userspace allocated yet and fall back to
> > > the userspace which in turn calls VFIO IOMMU driver and that fails
> > > because of the check in vfio_iommu_spapr_tce.c which is really
> > > sustainable solution.
> > > 
> > > This stores the smallest preregistered page size in the preregistered
> > > region descriptor and changes the mm_iommu_xxx API to check this against
> > > the IOMMU page size.
> > > 
> > > Signed-off-by: Alexey Kardashevskiy 
> > > ---
> > > Changes:
> > > v2:
> > > * explicitly check for compound pages before calling compound_order()
> > > 
> > > ---
> > > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to
> > > advertise 16MB pages to the guest; a typical pseries guest will use 16MB
> > > for IOMMU pages without checking the mmu pagesize and this will fail
> > > at 
> > > https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256
> > > 
> > > With the change, mapping will fail in KVM and the guest will print:
> > > 
> > > mlx5_core :00:00.0: ibm,create-pe-dma-window(2027) 0 800 2000 
> > > 18 1f returned 0 (liobn = 0x8001 starting addr = 800 0)
> > > mlx5_core :00:00.0: created tce table LIOBN 0x8001 for 
> > > /pci@8002000/ethernet@0
> > > mlx5_core :00:00.0: failed to map direct window for
> > > /pci@8002000/ethernet@0: -1  
> > 
> > [snip]
> > > @@ -124,7 +125,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long 
> > > ua, unsigned long entries,
> > >   struct mm_iommu_table_group_mem_t **pmem)
> > >  {
> > >   struct mm_iommu_table_group_mem_t *mem;
> > > - long i, j, ret = 0, locked_entries = 0;
> > > + long i, j, ret = 0, locked_entries = 0, pageshift;
> > >   struct page *page = NULL;
> > >  
> > >   mutex_lock(&mem_list_mutex);
> > > @@ -166,6 +167,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long 
> > > ua, unsigned long entries,
> > >   goto unlock_exit;
> > >   }
> > >  
>  > > +mem->pageshift = 30; /* start from 1G pages - the biggest we 
> have */  
> > 
> > What about 16G pages on an HPT system?
> 
> 
> Below in the loop mem->pageshift will reduce to the biggest actual size
> which will be 16mb/64k/4k. Or remain 1GB if no memory is actually
> pinned, no loss there.

Are you saying that 16G IOMMU pages aren't supported?  Or that there's
some reason a guest can never use them?

> > >   for (i = 0; i < entries; ++i) {
> > >   if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
> > >   1/* pages */, 1/* iswrite */, &page)) {
> > > @@ -199,6 +202,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned 
> > > long ua, unsigned long entries,
> > >   }
> > >   }
> > >  populate:
> > > + pageshift = PAGE_SHIFT;
> > > + if (PageCompound(page))
> > > + pageshift += compound_order(compound_head(page));
> > > + mem->pageshift = min_t(unsigned int, mem->pageshift, 
> > > pageshift);  
> > 
> > Why not make mem->pageshift and pageshift local the same type to avoid
> > the min_t() ?
> 
> I was under impression min() is deprecated (misinterpret checkpatch.pl
> may be) and therefore did not pay attention to it. I can fix this and
> repost if there is no other question.

Hm, it's possible.


-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page

2018-06-28 Thread Alexey Kardashevskiy

On Fri, 29 Jun 2018 14:12:41 +1000
David Gibson  wrote:

> On Tue, Jun 26, 2018 at 03:59:26PM +1000, Alexey Kardashevskiy wrote:
> > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
> > an IOMMU page is contained in the physical page so the PCI hardware won't
> > get access to unassigned host memory.
> > 
> > However we do not have this check in KVM fastpath (H_PUT_TCE accelerated
> > code) so the user space can pin memory backed with 64k pages and create
> > a hardware TCE table with a bigger page size. We were lucky so far and
> > did not hit this yet as the very first time the mapping happens
> > we do not have tbl::it_userspace allocated yet and fall back to
> > the userspace which in turn calls VFIO IOMMU driver and that fails
> > because of the check in vfio_iommu_spapr_tce.c which is really
> > sustainable solution.
> > 
> > This stores the smallest preregistered page size in the preregistered
> > region descriptor and changes the mm_iommu_xxx API to check this against
> > the IOMMU page size.
> > 
> > Signed-off-by: Alexey Kardashevskiy 
> > ---
> > Changes:
> > v2:
> > * explicitly check for compound pages before calling compound_order()
> > 
> > ---
> > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to
> > advertise 16MB pages to the guest; a typical pseries guest will use 16MB
> > for IOMMU pages without checking the mmu pagesize and this will fail
> > at 
> > https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256
> > 
> > With the change, mapping will fail in KVM and the guest will print:
> > 
> > mlx5_core :00:00.0: ibm,create-pe-dma-window(2027) 0 800 2000 
> > 18 1f returned 0 (liobn = 0x8001 starting addr = 800 0)
> > mlx5_core :00:00.0: created tce table LIOBN 0x8001 for 
> > /pci@8002000/ethernet@0
> > mlx5_core :00:00.0: failed to map direct window for
> > /pci@8002000/ethernet@0: -1  
> 
> [snip]
> > @@ -124,7 +125,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long 
> > ua, unsigned long entries,
> > struct mm_iommu_table_group_mem_t **pmem)
> >  {
> > struct mm_iommu_table_group_mem_t *mem;
> > -   long i, j, ret = 0, locked_entries = 0;
> > +   long i, j, ret = 0, locked_entries = 0, pageshift;
> > struct page *page = NULL;
> >  
> > mutex_lock(&mem_list_mutex);
> > @@ -166,6 +167,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long 
> > ua, unsigned long entries,
> > goto unlock_exit;
> > }
> >  
 > > +  mem->pageshift = 30; /* start from 1G pages - the biggest we have */  
> 
> What about 16G pages on an HPT system?


Below in the loop mem->pageshift will reduce to the biggest actual size
which will be 16mb/64k/4k. Or remain 1GB if no memory is actually
pinned, no loss there.


> 
> > for (i = 0; i < entries; ++i) {
> > if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
> > 1/* pages */, 1/* iswrite */, &page)) {
> > @@ -199,6 +202,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long 
> > ua, unsigned long entries,
> > }
> > }
> >  populate:
> > +   pageshift = PAGE_SHIFT;
> > +   if (PageCompound(page))
> > +   pageshift += compound_order(compound_head(page));
> > +   mem->pageshift = min_t(unsigned int, mem->pageshift, 
> > pageshift);  
> 
> Why not make mem->pageshift and pageshift local the same type to avoid
> the min_t() ?

I was under impression min() is deprecated (misinterpret checkpatch.pl
may be) and therefore did not pay attention to it. I can fix this and
repost if there is no other question.


> 
> > +
> > mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
> > }
> >  
> > @@ -349,7 +357,7 @@ struct mm_iommu_table_group_mem_t *mm_iommu_find(struct 
> > mm_struct *mm,
> >  EXPORT_SYMBOL_GPL(mm_iommu_find);
> >  
> >  long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
> > -   unsigned long ua, unsigned long *hpa)
> > +   unsigned long ua, unsigned int pageshift, unsigned long *hpa)
> >  {
> > const long entry = (ua - mem->ua) >> PAGE_SHIFT;
> > u64 *va = &mem->hpas[entry];
> > @@ -357,6 +365,9 @@ long mm_iommu_ua_to_hpa(struct 
> > mm_iommu_table_group_mem_t *mem,
> > if (entry >= mem->entries)
> > return -EFAULT;
> >  
> > +   if (pageshift > mem->pageshift)
> > +   return -EFAULT;
> > +
> > *hpa = *va | (ua & ~PAGE_MASK);
> >  
> > return 0;
> > @@ -364,7 +375,7 @@ long mm_iommu_ua_to_hpa(struct 
> > mm_iommu_table_group_mem_t *mem,
> >  EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
> >  
> >  long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
> > -   unsigned long ua, unsigned long *hpa)
> > +   unsigned long ua, unsigned int pageshift, unsigned long *hpa)
> >  {
> > const long entry = (ua - mem->ua) >> PAGE_SHIFT;
> >

Re: [PATCH v4 1/6] powerpc/pseries: Defer the logging of rtas error to irq work queue.

2018-06-28 Thread Mahesh Jagannath Salgaonkar

On 06/29/2018 02:35 AM, kbuild test robot wrote:
> Hi Mahesh,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on powerpc/next]
> [also build test ERROR on v4.18-rc2 next-20180628]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Mahesh-J-Salgaonkar/powerpc-pseries-Defer-the-logging-of-rtas-error-to-irq-work-queue/20180628-224101
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
> config: powerpc-defconfig (attached as .config)
> compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> GCC_VERSION=7.2.0 make.cross ARCH=powerpc 
> 
> Note: the 
> linux-review/Mahesh-J-Salgaonkar/powerpc-pseries-Defer-the-logging-of-rtas-error-to-irq-work-queue/20180628-224101
>  HEAD 3496ae1afd6528103d508528e25bfca82c60f4ee builds fine.
>   It only hurts bisectibility.
> 
> All errors (new ones prefixed by >>):
> 
>arch/powerpc/platforms/pseries/ras.c: In function 
> 'mce_process_errlog_event':
>>> arch/powerpc/platforms/pseries/ras.c:433:8: error: implicit declaration of 
>>> function 'fwnmi_get_errlog'; did you mean 'fwnmi_get_errinfo'? 
>>> [-Werror=implicit-function-declaration]
>  err = fwnmi_get_errlog();
>^~~~
>fwnmi_get_errinfo
>>> arch/powerpc/platforms/pseries/ras.c:433:6: error: assignment makes pointer 
>>> from integer without a cast [-Werror=int-conversion]
>  err = fwnmi_get_errlog();
>  ^
>cc1: all warnings being treated as errors

Ouch... Looks like I pushed down the function definition while
rearranging the hunks. Will fix it in next revision. Thanks for catching
this.

Thanks,
-Mahesh.

Re: [PATCH v4 1/6] powerpc/pseries: Defer the logging of rtas error to irq work queue.

2018-06-28 Thread Mahesh Jagannath Salgaonkar

On 06/28/2018 06:49 PM, Laurent Dufour wrote:
> On 28/06/2018 13:10, Mahesh J Salgaonkar wrote:
>> From: Mahesh Salgaonkar 
>>
>> rtas_log_buf is a buffer to hold RTAS event data that are communicated
>> to kernel by hypervisor. This buffer is then used to pass RTAS event
>> data to user through proc fs. This buffer is allocated from vmalloc
>> (non-linear mapping) area.
>>
>> On Machine check interrupt, register r3 points to RTAS extended event
>> log passed by hypervisor that contains the MCE event. The pseries
>> machine check handler then logs this error into rtas_log_buf. The
>> rtas_log_buf is a vmalloc-ed (non-linear) buffer we end up taking up a
>> page fault (vector 0x300) while accessing it. Since machine check
>> interrupt handler runs in NMI context we can not afford to take any
>> page fault. Page faults are not honored in NMI context and causes
>> kernel panic. Apart from that, as Nick pointed out, pSeries_log_error()
>> also takes a spin_lock while logging error which is not safe in NMI
>> context. It may endup in deadlock if we get another MCE before releasing
>> the lock. Fix this by deferring the logging of rtas error to irq work queue.
>>
>> Current implementation uses two different buffers to hold rtas error log
>> depending on whether extended log is provided or not. This makes bit
>> difficult to identify which buffer has valid data that needs to logged
>> later in irq work. Simplify this using single buffer, one per paca, and
>> copy rtas log to it irrespective of whether extended log is provided or
>> not. Allocate this buffer below RMA region so that it can be accessed
>> in real mode mce handler.
>>
>> Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable 
>> interrupt")
>> Cc: sta...@vger.kernel.org
>> Signed-off-by: Mahesh Salgaonkar 
>> ---
>>  arch/powerpc/include/asm/paca.h|3 ++
>>  arch/powerpc/platforms/pseries/ras.c   |   39 
>> +---
>>  arch/powerpc/platforms/pseries/setup.c |   16 +
>>  3 files changed, 45 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/paca.h 
>> b/arch/powerpc/include/asm/paca.h
>> index 3f109a3e3edb..b441fef53077 100644
>> --- a/arch/powerpc/include/asm/paca.h
>> +++ b/arch/powerpc/include/asm/paca.h
>> @@ -251,6 +251,9 @@ struct paca_struct {
>>  void *rfi_flush_fallback_area;
>>  u64 l1d_flush_size;
>>  #endif
>> +#ifdef CONFIG_PPC_PSERIES
>> +u8 *mce_data_buf;   /* buffer to hold per cpu rtas errlog */
>> +#endif /* CONFIG_PPC_PSERIES */
>>  } cacheline_aligned;
>>
>>  extern void copy_mm_to_paca(struct mm_struct *mm);
>> diff --git a/arch/powerpc/platforms/pseries/ras.c 
>> b/arch/powerpc/platforms/pseries/ras.c
>> index 5e1ef9150182..f6ba9a2a4f84 100644
>> --- a/arch/powerpc/platforms/pseries/ras.c
>> +++ b/arch/powerpc/platforms/pseries/ras.c
>> @@ -22,6 +22,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  #include 
>>  #include 
>> @@ -32,11 +33,13 @@
>>  static unsigned char ras_log_buf[RTAS_ERROR_LOG_MAX];
>>  static DEFINE_SPINLOCK(ras_log_buf_lock);
>>
>> -static char global_mce_data_buf[RTAS_ERROR_LOG_MAX];
>> -static DEFINE_PER_CPU(__u64, mce_data_buf);
>> -
>>  static int ras_check_exception_token;
>>
>> +static void mce_process_errlog_event(struct irq_work *work);
>> +static struct irq_work mce_errlog_process_work = {
>> +.func = mce_process_errlog_event,
>> +};
>> +
>>  #define EPOW_SENSOR_TOKEN   9
>>  #define EPOW_SENSOR_INDEX   0
>>
>> @@ -336,10 +339,9 @@ static irqreturn_t ras_error_interrupt(int irq, void 
>> *dev_id)
>>   * the actual r3 if possible, and a ptr to the error log entry
>>   * will be returned if found.
>>   *
>> - * If the RTAS error is not of the extended type, then we put it in a per
>> - * cpu 64bit buffer. If it is the extended type we use global_mce_data_buf.
>> + * Use one buffer mce_data_buf per cpu to store RTAS error.
>>   *
>> - * The global_mce_data_buf does not have any locks or protection around it,
>> + * The mce_data_buf does not have any locks or protection around it,
>>   * if a second machine check comes in, or a system reset is done
>>   * before we have logged the error, then we will get corruption in the
>>   * error log.  This is preferable over holding off on calling
>> @@ -362,20 +364,19 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct 
>> pt_regs *regs)
>>  savep = __va(regs->gpr[3]);
>>  regs->gpr[3] = savep[0];/* restore original r3 */
>>
>> -/* If it isn't an extended log we can use the per cpu 64bit buffer */
>>  h = (struct rtas_error_log *)&savep[1];
>> +/* Use the per cpu buffer from paca to store rtas error log */
>> +memset(local_paca->mce_data_buf, 0, RTAS_ERROR_LOG_MAX);
>>  if (!rtas_error_extended(h)) {
>> -memcpy(this_cpu_ptr(&mce_data_buf), h, sizeof(__u64));
>> -errhdr = (struct rtas_error_log *)this_cpu_ptr(&mce_data_buf);
>> +memcpy(local_paca->mc

Re: [PATCH kernel v2 0/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page

2018-06-28 Thread David Gibson

On Fri, Jun 29, 2018 at 01:00:07PM +1000, Alexey Kardashevskiy wrote:
> On Fri, 29 Jun 2018 11:55:40 +1000
> Michael Ellerman  wrote:
> 
> > Alexey Kardashevskiy  writes:
> > 
> > > This is to improve page boundaries checking and should probably
> > > be cc:stable. I came accross this while debugging nvlink2 passthrough
> > > but the lack of checking might be exploited by the existing userspace.  
> > 
> > Do you really mean "exploited" ? As in there's a security issue?
> > 
> > Your change log for patch 2 sort of suggests that but then says that
> > without the fix you just hit an error in vfio code.
> 
> 
> The bug is that I can easily make unmodified guest use 16MB IOMMU pages
> while guest RAM is backed with system 64K pages so unless the guest RAM
> is allocated contigously (which is unlikely), a 16MB TCE will provide
> the hardware access to the host physical memory it is not supposed to
> have access to, which is 16MB minus first 64K.
> 
> There is a fast path for H_PUT_TCE - via KVM - there is no contained
> test.
> 
> There is a slow path for H_PUT_TCE - via VFIO ioctl() - there is a
> contained test.
> 
> Because of a different feature of VFIO on sPAPR (it stores an array of
> userspace addresses which we received from QEMU and translated to host
> physical addresses and programmed those to the TCE table) we do not take
> the fast path on the very first H_PUT_TCE (because I allocate the
> array when the slow path is taken the very first time), fail there,
> pass the failure to the guest the guest decides that is over.
> 
> But a modified guest could ignore that initial H_PUT_TCE failure and
> simply repeat H_PUT_TCE again - this time it will take the fast path
> and allow the bad mapping.

In short, yes, it's an exploitable security hole in the host.  An
unmodified Linux guest kernel just doesn't happen to exploit it, even
if the guest userspace tries to get it to.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [PATCH kernel v2 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page

2018-06-28 Thread David Gibson

On Tue, Jun 26, 2018 at 03:59:26PM +1000, Alexey Kardashevskiy wrote:
> We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
> an IOMMU page is contained in the physical page so the PCI hardware won't
> get access to unassigned host memory.
> 
> However we do not have this check in KVM fastpath (H_PUT_TCE accelerated
> code) so the user space can pin memory backed with 64k pages and create
> a hardware TCE table with a bigger page size. We were lucky so far and
> did not hit this yet as the very first time the mapping happens
> we do not have tbl::it_userspace allocated yet and fall back to
> the userspace which in turn calls VFIO IOMMU driver and that fails
> because of the check in vfio_iommu_spapr_tce.c which is really
> sustainable solution.
> 
> This stores the smallest preregistered page size in the preregistered
> region descriptor and changes the mm_iommu_xxx API to check this against
> the IOMMU page size.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v2:
> * explicitly check for compound pages before calling compound_order()
> 
> ---
> The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to
> advertise 16MB pages to the guest; a typical pseries guest will use 16MB
> for IOMMU pages without checking the mmu pagesize and this will fail
> at 
> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256
> 
> With the change, mapping will fail in KVM and the guest will print:
> 
> mlx5_core :00:00.0: ibm,create-pe-dma-window(2027) 0 800 2000 18 
> 1f returned 0 (liobn = 0x8001 starting addr = 800 0)
> mlx5_core :00:00.0: created tce table LIOBN 0x8001 for 
> /pci@8002000/ethernet@0
> mlx5_core :00:00.0: failed to map direct window for
> /pci@8002000/ethernet@0: -1

[snip]
> @@ -124,7 +125,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, 
> unsigned long entries,
>   struct mm_iommu_table_group_mem_t **pmem)
>  {
>   struct mm_iommu_table_group_mem_t *mem;
> - long i, j, ret = 0, locked_entries = 0;
> + long i, j, ret = 0, locked_entries = 0, pageshift;
>   struct page *page = NULL;
>  
>   mutex_lock(&mem_list_mutex);
> @@ -166,6 +167,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, 
> unsigned long entries,
>   goto unlock_exit;
>   }
>  
> + mem->pageshift = 30; /* start from 1G pages - the biggest we have */

What about 16G pages on an HPT system?

>   for (i = 0; i < entries; ++i) {
>   if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
>   1/* pages */, 1/* iswrite */, &page)) {
> @@ -199,6 +202,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long 
> ua, unsigned long entries,
>   }
>   }
>  populate:
> + pageshift = PAGE_SHIFT;
> + if (PageCompound(page))
> + pageshift += compound_order(compound_head(page));
> + mem->pageshift = min_t(unsigned int, mem->pageshift, pageshift);

Why not make mem->pageshift and pageshift local the same type to avoid
the min_t() ?

> +
>   mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
>   }
>  
> @@ -349,7 +357,7 @@ struct mm_iommu_table_group_mem_t *mm_iommu_find(struct 
> mm_struct *mm,
>  EXPORT_SYMBOL_GPL(mm_iommu_find);
>  
>  long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
> - unsigned long ua, unsigned long *hpa)
> + unsigned long ua, unsigned int pageshift, unsigned long *hpa)
>  {
>   const long entry = (ua - mem->ua) >> PAGE_SHIFT;
>   u64 *va = &mem->hpas[entry];
> @@ -357,6 +365,9 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t 
> *mem,
>   if (entry >= mem->entries)
>   return -EFAULT;
>  
> + if (pageshift > mem->pageshift)
> + return -EFAULT;
> +
>   *hpa = *va | (ua & ~PAGE_MASK);
>  
>   return 0;
> @@ -364,7 +375,7 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t 
> *mem,
>  EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
>  
>  long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
> - unsigned long ua, unsigned long *hpa)
> + unsigned long ua, unsigned int pageshift, unsigned long *hpa)
>  {
>   const long entry = (ua - mem->ua) >> PAGE_SHIFT;
>   void *va = &mem->hpas[entry];
> @@ -373,6 +384,9 @@ long mm_iommu_ua_to_hpa_rm(struct 
> mm_iommu_table_group_mem_t *mem,
>   if (entry >= mem->entries)
>   return -EFAULT;
>  
> + if (pageshift > mem->pageshift)
> + return -EFAULT;
> +
>   pa = (void *) vmalloc_to_phys(va);
>   if (!pa)
>   return -EFAULT;
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 2da5f05..7cd63b0 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spa

Re: [PATCH v2 4/6] powerpc/pkeys: Preallocate execute-only key

2018-06-28 Thread Thiago Jung Bauermann



Hello,

My understanding is that this patch isn't upstream yet and it's not too
late for bikeshedding. Please ignore if this is not the case.

Ram Pai  writes:

> @@ -326,48 +330,7 @@ static inline bool pkey_allows_readwrite(int pkey)
>
>  int __execute_only_pkey(struct mm_struct *mm)
>  {
> - bool need_to_set_mm_pkey = false;
> - int execute_only_pkey = mm->context.execute_only_pkey;
> - int ret;
> -
> - /* Do we need to assign a pkey for mm's execute-only maps? */
> - if (execute_only_pkey == -1) {
> - /* Go allocate one to use, which might fail */
> - execute_only_pkey = mm_pkey_alloc(mm);
> - if (execute_only_pkey < 0)
> - return -1;
> - need_to_set_mm_pkey = true;
> - }
> -
> - /*
> -  * We do not want to go through the relatively costly dance to set AMR
> -  * if we do not need to. Check it first and assume that if the
> -  * execute-only pkey is readwrite-disabled than we do not have to set it
> -  * ourselves.
> -  */
> - if (!need_to_set_mm_pkey && !pkey_allows_readwrite(execute_only_pkey))
> - return execute_only_pkey;
> -
> - /*
> -  * Set up AMR so that it denies access for everything other than
> -  * execution.
> -  */
> - ret = __arch_set_user_pkey_access(current, execute_only_pkey,
> -   PKEY_DISABLE_ACCESS |
> -   PKEY_DISABLE_WRITE);
> - /*
> -  * If the AMR-set operation failed somehow, just return 0 and
> -  * effectively disable execute-only support.
> -  */
> - if (ret) {
> - mm_pkey_free(mm, execute_only_pkey);
> - return -1;
> - }
> -
> - /* We got one, store it and use it from here on out */
> - if (need_to_set_mm_pkey)
> - mm->context.execute_only_pkey = execute_only_pkey;
> - return execute_only_pkey;
> + return mm->context.execute_only_pkey;
>  }

There's no reason to have a separate  __execute_only_pkey() function
anymore. Its single line can go directly in execute_only_pkey(), defined
in .
-- 
Thiago Jung Bauermann
IBM Linux Technology Center

Re: [PATCH kernel v2 0/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page

2018-06-28 Thread Alexey Kardashevskiy

On Fri, 29 Jun 2018 11:55:40 +1000
Michael Ellerman  wrote:

> Alexey Kardashevskiy  writes:
> 
> > This is to improve page boundaries checking and should probably
> > be cc:stable. I came accross this while debugging nvlink2 passthrough
> > but the lack of checking might be exploited by the existing userspace.  
> 
> Do you really mean "exploited" ? As in there's a security issue?
> 
> Your change log for patch 2 sort of suggests that but then says that
> without the fix you just hit an error in vfio code.

The bug is that I can easily make unmodified guest use 16MB IOMMU pages
while guest RAM is backed with system 64K pages so unless the guest RAM
is allocated contigously (which is unlikely), a 16MB TCE will provide
the hardware access to the host physical memory it is not supposed to
have access to, which is 16MB minus first 64K.

There is a fast path for H_PUT_TCE - via KVM - there is no contained
test.

There is a slow path for H_PUT_TCE - via VFIO ioctl() - there is a
contained test.

Because of a different feature of VFIO on sPAPR (it stores an array of
userspace addresses which we received from QEMU and translated to host
physical addresses and programmed those to the TCE table) we do not take
the fast path on the very first H_PUT_TCE (because I allocate the
array when the slow path is taken the very first time), fail there,
pass the failure to the guest the guest decides that is over.

But a modified guest could ignore that initial H_PUT_TCE failure and
simply repeat H_PUT_TCE again - this time it will take the fast path
and allow the bad mapping.

> So I'm not clear on what the exposure is.
> 
> cheers

--
Alexey

Re: [PATCH 1/2] powerpc/pkeys: preallocate execute_only key only if the key is available.

2018-06-28 Thread Thiago Jung Bauermann



Hello,

Ram Pai  writes:

> Key 2 is preallocated and reserved for execute-only key. In rare
> cases if key-2 is unavailable, mprotect(PROT_EXEC) will behave
> incorrectly. NOTE: mprotect(PROT_EXEC) uses execute-only key.
>
> Ensure key 2 is available for preallocation before reserving it for
> execute_only purpose.  Problem noticed by Michael Ellermen.

Since "powerpc/pkeys: Preallocate execute-only key" isn't upstream yet,
this patch could be squashed into it.

> Signed-off-by: Ram Pai 
> ---
>  arch/powerpc/mm/pkeys.c |   14 +-
>  1 files changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
> index cec990c..0b03914 100644
> --- a/arch/powerpc/mm/pkeys.c
> +++ b/arch/powerpc/mm/pkeys.c
> @@ -19,6 +19,7 @@
>  u64  pkey_amr_mask;  /* Bits in AMR not to be touched */
>  u64  pkey_iamr_mask; /* Bits in AMR not to be touched */
>  u64  pkey_uamor_mask;/* Bits in UMOR not to be touched */
> +int  execute_only_key = 2;
>
>  #define AMR_BITS_PER_PKEY 2
>  #define AMR_RD_BIT 0x1UL
> @@ -26,7 +27,6 @@
>  #define IAMR_EX_BIT 0x1UL
>  #define PKEY_REG_BITS (sizeof(u64)*8)
>  #define pkeyshift(pkey) (PKEY_REG_BITS - ((pkey+1) * AMR_BITS_PER_PKEY))
> -#define EXECUTE_ONLY_KEY 2
>
>  static void scan_pkey_feature(void)
>  {
> @@ -122,8 +122,12 @@ int pkey_initialize(void)
>  #else
>   os_reserved = 0;
>  #endif
> +
> + if ((pkeys_total - os_reserved) <= execute_only_key)
> + execute_only_key = -1;
> +
>   /* Bits are in LE format. */
> - reserved_allocation_mask = (0x1 << 1) | (0x1 << EXECUTE_ONLY_KEY);
> + reserved_allocation_mask = (0x1 << 1) | (0x1 << execute_only_key);

My understanding is that left-shifting by a negative amount is undefined
behavior in C. A quick test tells me that at least on the couple of
machines I tested, 1 < -1 = 0. Does GCC guarantee that behavior? If so,
a comment pointing this out would make this less confusing.

>   initial_allocation_mask  = reserved_allocation_mask | (0x1 << PKEY_0);
>
>   /* register mask is in BE format */
> @@ -132,11 +136,11 @@ int pkey_initialize(void)
>
>   pkey_iamr_mask = ~0x0ul;
>   pkey_iamr_mask &= ~(0x3ul << pkeyshift(PKEY_0));
> - pkey_iamr_mask &= ~(0x3ul << pkeyshift(EXECUTE_ONLY_KEY));
> + pkey_iamr_mask &= ~(0x3ul << pkeyshift(execute_only_key));
>
>   pkey_uamor_mask = ~0x0ul;
>   pkey_uamor_mask &= ~(0x3ul << pkeyshift(PKEY_0));
> - pkey_uamor_mask &= ~(0x3ul << pkeyshift(EXECUTE_ONLY_KEY));
> + pkey_uamor_mask &= ~(0x3ul << pkeyshift(execute_only_key));

Here the behaviour is undefined in C as well, given that pkeyshift(-1) =
64, which is the total number of bits in the left operand. Does GCC
guarantee that the result will be 0 here as well?

--
Thiago Jung Bauermann
IBM Linux Technology Center

Re: [PATCH] powerpc/mm: fix always true/false warning in slice.c

2018-06-28 Thread Michael Ellerman

Christophe Leroy  writes:

> This patch fixes the following warnings (obtained with make W=1).
>
> arch/powerpc/mm/slice.c: In function 'slice_range_to_mask':
> arch/powerpc/mm/slice.c:73:12: error: comparison is always true due to 
> limited range of data type [-Werror=type-limits]
>   if (start < SLICE_LOW_TOP) {

Presumably only on 32-bit ?

> diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
> index 9530c6db406a..17c57760e06c 100644
> --- a/arch/powerpc/mm/slice.c
> +++ b/arch/powerpc/mm/slice.c
> @@ -79,7 +86,7 @@ static void slice_range_to_mask(unsigned long start, 
> unsigned long len,
>   - (1u << GET_LOW_SLICE_INDEX(start));
>   }
>  
> - if ((start + len) > SLICE_LOW_TOP) {
> + if (!slice_addr_is_low(end)) {
>   unsigned long start_index = GET_HIGH_SLICE_INDEX(start);
>   unsigned long align_end = ALIGN(end, (1UL << SLICE_HIGH_SHIFT));
>   unsigned long count = GET_HIGH_SLICE_INDEX(align_end) - 
> start_index;

This worries me.

By casting before the comparison in the helper you squash the compiler
warning, but the code is still broken if (start + len) overflows.

Presumably that "never happens", but it just seems fishy.

The other similar check in that file does:

  if (SLICE_NUM_HIGH && ((start + len) > SLICE_LOW_TOP)) {

Where SLICE_NUM_HIGH == 0 on 32-bit.


Could we fix the less than comparisons with SLICE_LOW_TOP with something
similar, eg:

if (!SLICE_NUM_HIGH || start < SLICE_LOW_TOP) {

ie. limit them to the 64-bit code?

cheers

Re: [PATCH kernel] powerpc/powernv/ioda2: Reduce upper limit for DMA window size

2018-06-28 Thread Michael Ellerman

Alexey Kardashevskiy  writes:

> We use PHB in mode1 which uses bit 59 to select a correct DMA window.
> However there is mode2 which uses bits 59:55 and allows up to 32 DMA
> windows per a PE.

Do we ever use mode2?

> Even though documentation does not clearly specify that, it seems that
> the actual hardware does not support bits 59:55 even in mode1, in other
> words we can create a window as big as 1<<58 but DMA simply won't work.

Can we get anything more solid than "seems that" ?

Is this documented somewhere to not work or you just found this by
testing?

> This reduces the upper limit from 59 to 55 bits to let the userspace know
> about the hardware limits.
>
> Fixes: 7aafac11e3 "powerpc/powernv/ioda2: Gracefully fail if too many TCE 
> levels requested"

Stable?

cheers

> Signed-off-by: Alexey Kardashevskiy 
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 92ca662..50e21d7 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2839,7 +2839,7 @@ static long pnv_pci_ioda2_table_alloc_pages(int nid, 
> __u64 bus_offset,
>   level_shift = entries_shift + 3;
>   level_shift = max_t(unsigned, level_shift, PAGE_SHIFT);
>  
> - if ((level_shift - 3) * levels + page_shift >= 60)
> + if ((level_shift - 3) * levels + page_shift >= 55)
>   return -EINVAL;
>  
>   /* Allocate TCE table */
> -- 
> 2.11.0

Re: [PATCH kernel v2 0/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page

2018-06-28 Thread Michael Ellerman

Alexey Kardashevskiy  writes:

> This is to improve page boundaries checking and should probably
> be cc:stable. I came accross this while debugging nvlink2 passthrough
> but the lack of checking might be exploited by the existing userspace.

Do you really mean "exploited" ? As in there's a security issue?

Your change log for patch 2 sort of suggests that but then says that
without the fix you just hit an error in vfio code.

So I'm not clear on what the exposure is.

cheers

[PATCH] powerpc: Remove memtrace mmap

2018-06-28 Thread Michael Neuling

debugfs doesn't support mmap, so this code is never used.

Signed-off-by: Michael Neuling 
---
 arch/powerpc/platforms/powernv/memtrace.c | 29 ---
 1 file changed, 29 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index b99283df85..f73101119e 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -47,38 +47,9 @@ static ssize_t memtrace_read(struct file *filp, char __user 
*ubuf,
return simple_read_from_buffer(ubuf, count, ppos, ent->mem, ent->size);
 }
 
-static bool valid_memtrace_range(struct memtrace_entry *dev,
-unsigned long start, unsigned long size)
-{
-   if ((start >= dev->start) &&
-   ((start + size) <= (dev->start + dev->size)))
-   return true;
-
-   return false;
-}
-
-static int memtrace_mmap(struct file *filp, struct vm_area_struct *vma)
-{
-   unsigned long size = vma->vm_end - vma->vm_start;
-   struct memtrace_entry *dev = filp->private_data;
-
-   if (!valid_memtrace_range(dev, vma->vm_pgoff << PAGE_SHIFT, size))
-   return -EINVAL;
-
-   vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
-
-   if (remap_pfn_range(vma, vma->vm_start,
-   vma->vm_pgoff + (dev->start >> PAGE_SHIFT),
-   size, vma->vm_page_prot))
-   return -EAGAIN;
-
-   return 0;
-}
-
 static const struct file_operations memtrace_fops = {
.llseek = default_llseek,
.read   = memtrace_read,
-   .mmap   = memtrace_mmap,
.open   = simple_open,
 };
 
-- 
2.17.1

Re: [PATCH 65/65] powerpc/mm/radix: Cputable update for radix

2018-06-28 Thread Benjamin Herrenschmidt

On Fri, 2016-04-01 at 15:04 +0530, Aneesh Kumar K.V wrote:
> 
> commit 9c9d8b4f6a2c2210c90cbb3f5c6d33b2a642e8d2
> Author: Aneesh Kumar K.V 
> Date:   Mon Feb 15 13:44:01 2016 +0530
> 
> powerpc/mm/radix: Cputable update for radix
> 
> With P9 Radix we need to do
> 
> * set UPRT = 1
> * set different TLB set count
> 
> In this patch we delay the UPRT=1 to early mmu init. We also update
> other cpu_spec callback there. The restore cpu callback is used to
> init secondary cpus and also during opal init. So we do a full
> radix variant for that, even though the only difference is UPRT=1
> 
> Signed-off-by: Aneesh Kumar K.V 

How are things working in absence of cputable/PVR match ?

Remember we have a requirement to be able to boot existing OSes on
future chips, so Nick's new cpu-features node needs to be what we
test against.

> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index b546e6f28d44..3400ed884f10 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -347,6 +347,10 @@
>  #define   LPCR_LPES_SH   2
>  #define   LPCR_RMI 0x0002  /* real mode is cache inhibit */
>  #define   LPCR_HDICE   0x0001  /* Hyp Decr enable (HV,PR,EE) */
> +/*
> + * Used in asm code, hence we don't want to use PPC_BITCOUNT
> + */
> +#defineLPCR_UPRT (ASM_CONST(0x1) << 22)
>  #ifndef SPRN_LPID
>  #define SPRN_LPID0x13F   /* Logical Partition Identifier */
>  #endif
> diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
> b/arch/powerpc/kernel/cpu_setup_power.S
> index 584e119fa8b0..8d717954d0ca 100644
> --- a/arch/powerpc/kernel/cpu_setup_power.S
> +++ b/arch/powerpc/kernel/cpu_setup_power.S
> @@ -117,6 +117,24 @@ _GLOBAL(__restore_cpu_power9)
>   mtlrr11
>   blr
>  
> +_GLOBAL(__restore_cpu_power9_uprt)
> + mflrr11
> + bl  __init_FSCR
> + mfmsr   r3
> + rldicl. r0,r3,4,63
> + mtlrr11
> + beqlr
> + li  r0,0
> + mtspr   SPRN_LPID,r0
> + mfspr   r3,SPRN_LPCR
> + ori r3, r3, LPCR_PECEDH
> + orisr3,r3,LPCR_UPRT@h
> + bl  __init_LPCR
> + bl  __init_HFSCR
> + bl  __init_tlb_power7
> + mtlrr11
> + blr
> +
>  __init_hvmode_206:
>   /* Disable CPU_FTR_HVMODE and exit if MSR:HV is not set */
>   mfmsr   r3
> diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c
> index 6c662b8de90d..e009722d5914 100644
> --- a/arch/powerpc/kernel/cputable.c
> +++ b/arch/powerpc/kernel/cputable.c
> @@ -514,7 +514,7 @@ static struct cpu_spec __initdata cpu_specs[] = {
>   .cpu_features   = CPU_FTRS_POWER9,
>   .cpu_user_features  = COMMON_USER_POWER9,
>   .cpu_user_features2 = COMMON_USER2_POWER9,
> - .mmu_features   = MMU_FTRS_POWER9,
> + .mmu_features   = MMU_FTRS_POWER9 | MMU_FTR_RADIX,
>   .icache_bsize   = 128,
>   .dcache_bsize   = 128,
>   .num_pmcs   = 6,
> diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
> index 92a66a2a9b85..f902ede263ab 100644
> --- a/arch/powerpc/kernel/mce_power.c
> +++ b/arch/powerpc/kernel/mce_power.c
> @@ -75,6 +75,10 @@ void __flush_tlb_power9(unsigned int action)
>   flush_tlb_206(POWER9_TLB_SETS_HASH, action);
>  }
>  
> +void __flush_tlb_power9_radix(unsigned int action)
> +{
> + flush_tlb_206(POWER9_TLB_SETS_RADIX, action);
> +}
>  
>  /* flush SLBs and reload */
>  #ifdef CONFIG_PPC_MMU_STD_64
> diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
> index bb1eb7d0911c..6e56051bf825 100644
> --- a/arch/powerpc/mm/pgtable-radix.c
> +++ b/arch/powerpc/mm/pgtable-radix.c
> @@ -294,8 +294,20 @@ found:
>   return;
>  }
>  
> +extern void __restore_cpu_power9_uprt(void);
> +extern void __flush_tlb_power9_radix(unsigned int action);
>  void __init rearly_init_mmu(void)
>  {
> + unsigned long lpcr;
> + /*
> +  * setup LPCR UPRT based on mmu_features
> +  */
> + lpcr = mfspr(SPRN_LPCR);
> + mtspr(SPRN_LPCR, lpcr | LPCR_UPRT);
> + /* update cpu_spec to point to radix enabled callbacks */
> + cur_cpu_spec->cpu_restore = __restore_cpu_power9_uprt;
> + cur_cpu_spec->flush_tlb   = __flush_tlb_power9_radix;
> +
>  #ifdef CONFIG_PPC_64K_PAGES
>   /* PAGE_SIZE mappings */
>   mmu_virtual_psize = MMU_PAGE_64K;

Re: [PATCH v2 10/10] cxl: Remove abandonned capi support for the Mellanox CX4, final cleanup

2018-06-28 Thread Andrew Donnellan


On 28/06/18 20:05, Frederic Barrat wrote:

Remove a few XSL/CX4 oddities which are no longer needed. A simple
revert of the initial commits was not possible (or not worth it) due
to the history of the code.

Signed-off-by: Frederic Barrat 


Acked-by: Andrew Donnellan 


---
  drivers/misc/cxl/context.c |  2 +-
  drivers/misc/cxl/cxl.h | 12 --
  drivers/misc/cxl/debugfs.c |  5 ---
  drivers/misc/cxl/pci.c | 75 +++---
  4 files changed, 7 insertions(+), 87 deletions(-)

diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index 0355d42d367f..5fe529b43ebe 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -95,7 +95,7 @@ int cxl_context_init(struct cxl_context *ctx, struct cxl_afu 
*afu, bool master)
 */
mutex_lock(&afu->contexts_lock);
idr_preload(GFP_KERNEL);
-   i = idr_alloc(&ctx->afu->contexts_idr, ctx, ctx->afu->adapter->min_pe,
+   i = idr_alloc(&ctx->afu->contexts_idr, ctx, 0,
  ctx->afu->num_procs, GFP_NOWAIT);
idr_preload_end();
mutex_unlock(&afu->contexts_lock);
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index aa453448201d..44bcfafbb579 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -93,11 +93,6 @@ static const cxl_p1_reg_t CXL_PSL_FIR_CNTL  = {0x0148};
  static const cxl_p1_reg_t CXL_PSL_DSNDCTL   = {0x0150};
  static const cxl_p1_reg_t CXL_PSL_SNWRALLOC = {0x0158};
  static const cxl_p1_reg_t CXL_PSL_TRACE = {0x0170};
-/* XSL registers (Mellanox CX4) */
-static const cxl_p1_reg_t CXL_XSL_Timebase  = {0x0100};
-static const cxl_p1_reg_t CXL_XSL_TB_CTLSTAT = {0x0108};
-static const cxl_p1_reg_t CXL_XSL_FEC   = {0x0158};
-static const cxl_p1_reg_t CXL_XSL_DSNCTL= {0x0168};
  /* PSL registers - CAIA 2 */
  static const cxl_p1_reg_t CXL_PSL9_CONTROL  = {0x0020};
  static const cxl_p1_reg_t CXL_XSL9_INV  = {0x0110};
@@ -695,7 +690,6 @@ struct cxl {
struct bin_attribute cxl_attr;
int adapter_num;
int user_irqs;
-   int min_pe;
u64 ps_size;
u16 psl_rev;
u16 base_image;
@@ -934,7 +928,6 @@ int cxl_debugfs_afu_add(struct cxl_afu *afu);
  void cxl_debugfs_afu_remove(struct cxl_afu *afu);
  void cxl_debugfs_add_adapter_regs_psl9(struct cxl *adapter, struct dentry 
*dir);
  void cxl_debugfs_add_adapter_regs_psl8(struct cxl *adapter, struct dentry 
*dir);
-void cxl_debugfs_add_adapter_regs_xsl(struct cxl *adapter, struct dentry *dir);
  void cxl_debugfs_add_afu_regs_psl9(struct cxl_afu *afu, struct dentry *dir);
  void cxl_debugfs_add_afu_regs_psl8(struct cxl_afu *afu, struct dentry *dir);
  
@@ -977,11 +970,6 @@ static inline void cxl_debugfs_add_adapter_regs_psl8(struct cxl *adapter,

  {
  }
  
-static inline void cxl_debugfs_add_adapter_regs_xsl(struct cxl *adapter,

-   struct dentry *dir)
-{
-}
-
  static inline void cxl_debugfs_add_afu_regs_psl9(struct cxl_afu *afu, struct 
dentry *dir)
  {
  }
diff --git a/drivers/misc/cxl/debugfs.c b/drivers/misc/cxl/debugfs.c
index 1643850d2302..a1921d81593a 100644
--- a/drivers/misc/cxl/debugfs.c
+++ b/drivers/misc/cxl/debugfs.c
@@ -58,11 +58,6 @@ void cxl_debugfs_add_adapter_regs_psl8(struct cxl *adapter, 
struct dentry *dir)
debugfs_create_io_x64("trace", S_IRUSR | S_IWUSR, dir, 
_cxl_p1_addr(adapter, CXL_PSL_TRACE));
  }
  
-void cxl_debugfs_add_adapter_regs_xsl(struct cxl *adapter, struct dentry *dir)

-{
-   debugfs_create_io_x64("fec", S_IRUSR, dir, _cxl_p1_addr(adapter, 
CXL_XSL_FEC));
-}
-
  int cxl_debugfs_adapter_add(struct cxl *adapter)
  {
struct dentry *dir;
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 0ca818396524..6dfb4ed345d3 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -593,27 +593,7 @@ static int init_implementation_adapter_regs_psl8(struct 
cxl *adapter, struct pci
return 0;
  }
  
-static int init_implementation_adapter_regs_xsl(struct cxl *adapter, struct pci_dev *dev)

-{
-   u64 xsl_dsnctl;
-   u64 chipid;
-   u32 phb_index;
-   u64 capp_unit_id;
-   int rc;
-
-   rc = cxl_calc_capp_routing(dev, &chipid, &phb_index, &capp_unit_id);
-   if (rc)
-   return rc;
-
-   /* Tell XSL where to route data to */
-   xsl_dsnctl = 0x6000ULL | (chipid << (63-5));
-   xsl_dsnctl |= (capp_unit_id << (63-13));
-   cxl_p1_write(adapter, CXL_XSL_DSNCTL, xsl_dsnctl);
-
-   return 0;
-}
-
-/* PSL & XSL */
+/* PSL */
  #define TBSYNC_CAL(n) (((u64)n & 0x7) << (63-3))
  #define TBSYNC_CNT(n) (((u64)n & 0x7) << (63-6))
  /* For the PSL this is a multiple for 0 < n <= 7: */
@@ -625,21 +605,6 @@ static void write_timebase_ctrl_psl8(struct cxl *adapter)
 TBSYNC_CNT(2 * PSL_2048_250MHZ_CYCLES));
  }
  
-/* XSL */

-#define TBSYNC_ENA (1ULL << 63)
-/* For the XSL this is 2**n * 2000 clocks for 0 < n <= 6: *

Re: [PATCH v2 09/10] Revert "cxl: Allow a default context to be associated with an external pci_dev"

2018-06-28 Thread Andrew Donnellan


On 28/06/18 20:05, Frederic Barrat wrote:

Remove abandonned capi support for the Mellanox CX4.

This reverts commit a19bd79e31769626d288cc016e21a31b6f47bf6f.

Signed-off-by: Frederic Barrat 


Acked-by: Andrew Donnellan 


---
  drivers/misc/cxl/Makefile |  2 +-
  drivers/misc/cxl/base.c   | 35 ---
  drivers/misc/cxl/cxl.h|  6 --
  drivers/misc/cxl/main.c   |  2 --
  drivers/misc/cxl/phb.c| 44 ---
  drivers/misc/cxl/vphb.c   | 30 +++---
  include/misc/cxl-base.h   |  6 --
  7 files changed, 28 insertions(+), 97 deletions(-)
  delete mode 100644 drivers/misc/cxl/phb.c

diff --git a/drivers/misc/cxl/Makefile b/drivers/misc/cxl/Makefile
index 502d41fc9ea5..5eea61b9584f 100644
--- a/drivers/misc/cxl/Makefile
+++ b/drivers/misc/cxl/Makefile
@@ -4,7 +4,7 @@ ccflags-$(CONFIG_PPC_WERROR)+= -Werror
  
  cxl-y+= main.o file.o irq.o fault.o native.o

  cxl-y += context.o sysfs.o pci.o trace.o
-cxl-y  += vphb.o phb.o api.o cxllib.o
+cxl-y  += vphb.o api.o cxllib.o
  cxl-$(CONFIG_PPC_PSERIES) += flash.o guest.o of.o hcalls.o
  cxl-$(CONFIG_DEBUG_FS)+= debugfs.o
  obj-$(CONFIG_CXL) += cxl.o
diff --git a/drivers/misc/cxl/base.c b/drivers/misc/cxl/base.c
index e1e80cb99ad9..7557835cdfcd 100644
--- a/drivers/misc/cxl/base.c
+++ b/drivers/misc/cxl/base.c
@@ -106,41 +106,6 @@ int cxl_update_properties(struct device_node *dn,
  }
  EXPORT_SYMBOL_GPL(cxl_update_properties);
  
-/*

- * API calls into the driver that may be called from the PHB code and must be
- * built in.
- */
-bool cxl_pci_associate_default_context(struct pci_dev *dev, struct cxl_afu 
*afu)
-{
-   bool ret;
-   struct cxl_calls *calls;
-
-   calls = cxl_calls_get();
-   if (!calls)
-   return false;
-
-   ret = calls->cxl_pci_associate_default_context(dev, afu);
-
-   cxl_calls_put(calls);
-
-   return ret;
-}
-EXPORT_SYMBOL_GPL(cxl_pci_associate_default_context);
-
-void cxl_pci_disable_device(struct pci_dev *dev)
-{
-   struct cxl_calls *calls;
-
-   calls = cxl_calls_get();
-   if (!calls)
-   return;
-
-   calls->cxl_pci_disable_device(dev);
-
-   cxl_calls_put(calls);
-}
-EXPORT_SYMBOL_GPL(cxl_pci_disable_device);
-
  static int __init cxl_base_init(void)
  {
struct device_node *np;
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index d95c2c98f2ab..aa453448201d 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -867,15 +867,9 @@ static inline bool cxl_is_power9_dd1(void)
  ssize_t cxl_pci_afu_read_err_buffer(struct cxl_afu *afu, char *buf,
loff_t off, size_t count);
  
-/* Internal functions wrapped in cxl_base to allow PHB to call them */

-bool _cxl_pci_associate_default_context(struct pci_dev *dev, struct cxl_afu 
*afu);
-void _cxl_pci_disable_device(struct pci_dev *dev);
  
  struct cxl_calls {

void (*cxl_slbia)(struct mm_struct *mm);
-   bool (*cxl_pci_associate_default_context)(struct pci_dev *dev, struct 
cxl_afu *afu);
-   void (*cxl_pci_disable_device)(struct pci_dev *dev);
-
struct module *owner;
  };
  int register_cxl_calls(struct cxl_calls *calls);
diff --git a/drivers/misc/cxl/main.c b/drivers/misc/cxl/main.c
index a7e83624034b..334223b802ee 100644
--- a/drivers/misc/cxl/main.c
+++ b/drivers/misc/cxl/main.c
@@ -104,8 +104,6 @@ static inline void cxl_slbia_core(struct mm_struct *mm)
  
  static struct cxl_calls cxl_calls = {

.cxl_slbia = cxl_slbia_core,
-   .cxl_pci_associate_default_context = _cxl_pci_associate_default_context,
-   .cxl_pci_disable_device = _cxl_pci_disable_device,
.owner = THIS_MODULE,
  };
  
diff --git a/drivers/misc/cxl/phb.c b/drivers/misc/cxl/phb.c

deleted file mode 100644
index 6ec69ada19f4..
--- a/drivers/misc/cxl/phb.c
+++ /dev/null
@@ -1,44 +0,0 @@
-/*
- * Copyright 2014-2016 IBM Corp.
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License
- * as published by the Free Software Foundation; either version
- * 2 of the License, or (at your option) any later version.
- */
-
-#include 
-#include "cxl.h"
-
-bool _cxl_pci_associate_default_context(struct pci_dev *dev, struct cxl_afu 
*afu)
-{
-   struct cxl_context *ctx;
-
-   /*
-* Allocate a context to do cxl things to. This is used for interrupts
-* in the peer model using a real phb, and if we eventually do DMA ops
-* in the virtual phb, we'll need a default context to attach them to.
-*/
-   ctx = cxl_dev_context_init(dev);
-   if (IS_ERR(ctx))
-   return false;
-   dev->dev.archdata.cxl_ctx = ctx;
-
-   return (cxl_ops->afu_check_and_enable(afu) == 0);
-}
-/* exported via cxl_base */
-
-void _cxl_pci_disable

Re: [PATCH v2 08/10] Revert "cxl: Add cxl_slot_is_supported API"

2018-06-28 Thread Andrew Donnellan


On 28/06/18 20:05, Frederic Barrat wrote:

Remove abandonned capi support for the Mellanox CX4.

This reverts commit 4e56f858bdde5cbfb70f61baddfaa56a8ed851bf.

Signed-off-by: Frederic Barrat 


Acked-by: Andrew Donnellan 


---
  drivers/misc/cxl/pci.c | 37 -
  include/misc/cxl.h | 15 ---
  2 files changed, 52 deletions(-)

diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 193ff22f610b..0ca818396524 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -1808,43 +1808,6 @@ int cxl_slot_is_switched(struct pci_dev *dev)
return (depth > CXL_MAX_PCIEX_PARENT);
  }
  
-bool cxl_slot_is_supported(struct pci_dev *dev, int flags)

-{
-   if (!cpu_has_feature(CPU_FTR_HVMODE))
-   return false;
-
-   if ((flags & CXL_SLOT_FLAG_DMA) && (!pvr_version_is(PVR_POWER8NVL))) {
-   /*
-* CAPP DMA mode is technically supported on regular P8, but
-* will EEH if the card attempts to access memory < 4GB, which
-* we cannot realistically avoid. We might be able to work
-* around the issue, but until then return unsupported:
-*/
-   return false;
-   }
-
-   if (cxl_slot_is_switched(dev))
-   return false;
-
-   /*
-* XXX: This gets a little tricky on regular P8 (not POWER8NVL) since
-* the CAPP can be connected to PHB 0, 1 or 2 on a first come first
-* served basis, which is racy to check from here. If we need to
-* support this in future we might need to consider having this
-* function effectively reserve it ahead of time.
-*
-* Currently, the only user of this API is the Mellanox CX4, which is
-* only supported on P8NVL due to the above mentioned limitation of
-* CAPP DMA mode and therefore does not need to worry about this. If the
-* issue with CAPP DMA mode is later worked around on P8 we might need
-* to revisit this.
-*/
-
-   return true;
-}
-EXPORT_SYMBOL_GPL(cxl_slot_is_supported);
-
-
  static int cxl_probe(struct pci_dev *dev, const struct pci_device_id *id)
  {
struct cxl *adapter;
diff --git a/include/misc/cxl.h b/include/misc/cxl.h
index 74da2e440763..ea9ff4a1a9ca 100644
--- a/include/misc/cxl.h
+++ b/include/misc/cxl.h
@@ -24,21 +24,6 @@
   * generic PCI API. This API is agnostic to the actual AFU.
   */
  
-#define CXL_SLOT_FLAG_DMA 0x1

-
-/*
- * Checks if the given card is in a cxl capable slot. Pass CXL_SLOT_FLAG_DMA if
- * the card requires CAPP DMA mode to also check if the system supports it.
- * This is intended to be used by bi-modal devices to determine if they can use
- * cxl mode or if they should continue running in PCI mode.
- *
- * Note that this only checks if the slot is cxl capable - it does not
- * currently check if the CAPP is currently available for chips where it can be
- * assigned to different PHBs on a first come first serve basis (i.e. P8)
- */
-bool cxl_slot_is_supported(struct pci_dev *dev, int flags);
-
-
  /* Get the AFU associated with a pci_dev */
  struct cxl_afu *cxl_pci_to_afu(struct pci_dev *dev);
  



--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited

Re: [PATCH v2 07/10] Revert "powerpc/powernv: Add support for the cxl kernel api on the real phb"

2018-06-28 Thread Andrew Donnellan


On 28/06/18 20:05, Frederic Barrat wrote:

From: Alastair D'Silva 

Remove abandonned capi support for the Mellanox CX4.

This reverts commit 4361b03430d685610e5feea3ec7846e8b9ae795f.

Signed-off-by: Alastair D'Silva 


Acked-by: Andrew Donnellan 


---
  arch/powerpc/include/asm/pnv-pci.h|   7 --
  arch/powerpc/platforms/powernv/pci-cxl.c  | 115 --
  arch/powerpc/platforms/powernv/pci-ioda.c |  18 +---
  arch/powerpc/platforms/powernv/pci.h  |  13 ---
  4 files changed, 1 insertion(+), 152 deletions(-)

diff --git a/arch/powerpc/include/asm/pnv-pci.h 
b/arch/powerpc/include/asm/pnv-pci.h
index d2d8c28db336..7f627e3f4da4 100644
--- a/arch/powerpc/include/asm/pnv-pci.h
+++ b/arch/powerpc/include/asm/pnv-pci.h
@@ -50,13 +50,6 @@ int pnv_cxl_alloc_hwirq_ranges(struct cxl_irq_ranges *irqs,
   struct pci_dev *dev, int num);
  void pnv_cxl_release_hwirq_ranges(struct cxl_irq_ranges *irqs,
  struct pci_dev *dev);
-
-/* Support for the cxl kernel api on the real PHB (instead of vPHB) */
-int pnv_cxl_enable_phb_kernel_api(struct pci_controller *hose, bool enable);
-bool pnv_pci_on_cxl_phb(struct pci_dev *dev);
-struct cxl_afu *pnv_cxl_phb_to_afu(struct pci_controller *hose);
-void pnv_cxl_phb_set_peer_afu(struct pci_dev *dev, struct cxl_afu *afu);
-
  #endif
  
  struct pnv_php_slot {

diff --git a/arch/powerpc/platforms/powernv/pci-cxl.c 
b/arch/powerpc/platforms/powernv/pci-cxl.c
index c447b7f03c09..1b18111453d7 100644
--- a/arch/powerpc/platforms/powernv/pci-cxl.c
+++ b/arch/powerpc/platforms/powernv/pci-cxl.c
@@ -8,10 +8,8 @@
   */
  
  #include 

-#include 
  #include 
  #include 
-#include 
  
  #include "pci.h"
  
@@ -178,116 +176,3 @@ static inline int get_cxl_module(void)

  #else
  static inline int get_cxl_module(void) { return 0; }
  #endif
-
-/*
- * Sets flags and switches the controller ops to enable the cxl kernel api.
- * Originally the cxl kernel API operated on a virtual PHB, but certain cards
- * such as the Mellanox CX4 use a peer model instead and for these cards the
- * cxl kernel api will operate on the real PHB.
- */
-int pnv_cxl_enable_phb_kernel_api(struct pci_controller *hose, bool enable)
-{
-   struct pnv_phb *phb = hose->private_data;
-   int rc;
-
-   if (!enable) {
-   /*
-* Once cxl mode is enabled on the PHB, there is currently no
-* known safe method to disable it again, and trying risks a
-* checkstop. If we can find a way to safely disable cxl mode
-* in the future we can revisit this, but for now the only sane
-* thing to do is to refuse to disable cxl mode:
-*/
-   return -EPERM;
-   }
-
-   /*
-* Hold a reference to the cxl module since several PHB operations now
-* depend on it, and it would be insane to allow it to be removed so
-* long as we are in this mode (and since we can't safely disable this
-* mode once enabled...).
-*/
-   rc = get_cxl_module();
-   if (rc)
-   return rc;
-
-   phb->flags |= PNV_PHB_FLAG_CXL;
-   hose->controller_ops = pnv_cxl_cx4_ioda_controller_ops;
-
-   return 0;
-}
-EXPORT_SYMBOL_GPL(pnv_cxl_enable_phb_kernel_api);
-
-bool pnv_pci_on_cxl_phb(struct pci_dev *dev)
-{
-   struct pci_controller *hose = pci_bus_to_host(dev->bus);
-   struct pnv_phb *phb = hose->private_data;
-
-   return !!(phb->flags & PNV_PHB_FLAG_CXL);
-}
-EXPORT_SYMBOL_GPL(pnv_pci_on_cxl_phb);
-
-struct cxl_afu *pnv_cxl_phb_to_afu(struct pci_controller *hose)
-{
-   struct pnv_phb *phb = hose->private_data;
-
-   return (struct cxl_afu *)phb->cxl_afu;
-}
-EXPORT_SYMBOL_GPL(pnv_cxl_phb_to_afu);
-
-void pnv_cxl_phb_set_peer_afu(struct pci_dev *dev, struct cxl_afu *afu)
-{
-   struct pci_controller *hose = pci_bus_to_host(dev->bus);
-   struct pnv_phb *phb = hose->private_data;
-
-   phb->cxl_afu = afu;
-}
-EXPORT_SYMBOL_GPL(pnv_cxl_phb_set_peer_afu);
-
-/*
- * In the peer cxl model, the XSL/PSL is physical function 0, and will be used
- * by other functions on the device for memory access and interrupts. When the
- * other functions are enabled we explicitly take a reference on the cxl
- * function since they will use it, and allocate a default context associated
- * with that function just like the vPHB model of the cxl kernel API.
- */
-bool pnv_cxl_enable_device_hook(struct pci_dev *dev)
-{
-   struct pci_controller *hose = pci_bus_to_host(dev->bus);
-   struct pnv_phb *phb = hose->private_data;
-   struct cxl_afu *afu = phb->cxl_afu;
-
-   if (!pnv_pci_enable_device_hook(dev))
-   return false;
-
-
-   /* No special handling for the cxl function, which is always PF 0 */
-   if (PCI_FUNC(dev->devfn) == 0)
-   return true;
-
-   if (!afu) {
-   dev_WARN(&dev->dev, "Attempted to enable

Re: [PATCH v2 06/10] Revert "cxl: Add support for using the kernel API with a real PHB"

2018-06-28 Thread Andrew Donnellan


On 28/06/18 20:05, Frederic Barrat wrote:

From: Alastair D'Silva 

Remove abandonned capi support for the Mellanox CX4.

This reverts commit 317f5ef1b363417b6f1e93b90dfd2ffd6be6e867.

Signed-off-by: Alastair D'Silva 


Acked-by: Andrew Donnellan 


---
  drivers/misc/cxl/pci.c  |  3 ---
  drivers/misc/cxl/vphb.c | 16 ++--
  2 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 9c5a21fee835..193ff22f610b 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -1886,9 +1886,6 @@ static int cxl_probe(struct pci_dev *dev, const struct 
pci_device_id *id)
dev_err(&dev->dev, "AFU %i failed to start: %i\n", 
slice, rc);
}
  
-	if (pnv_pci_on_cxl_phb(dev) && adapter->slices >= 1)

-   pnv_cxl_phb_set_peer_afu(dev, adapter->afu[0]);
-
return 0;
  }
  
diff --git a/drivers/misc/cxl/vphb.c b/drivers/misc/cxl/vphb.c

index 7fd0bdc1436a..1a99c9c7a6fb 100644
--- a/drivers/misc/cxl/vphb.c
+++ b/drivers/misc/cxl/vphb.c
@@ -9,7 +9,6 @@
  
  #include 

  #include 
-#include 
  #include "cxl.h"
  
  static int cxl_dma_set_mask(struct pci_dev *pdev, u64 dma_mask)

@@ -284,18 +283,13 @@ void cxl_pci_vphb_remove(struct cxl_afu *afu)
 */
  }
  
-static bool _cxl_pci_is_vphb_device(struct pci_controller *phb)

-{
-   return (phb->ops == &cxl_pcie_pci_ops);
-}
-
  bool cxl_pci_is_vphb_device(struct pci_dev *dev)
  {
struct pci_controller *phb;
  
  	phb = pci_bus_to_host(dev->bus);
  
-	return _cxl_pci_is_vphb_device(phb);

+   return (phb->ops == &cxl_pcie_pci_ops);
  }
  
  struct cxl_afu *cxl_pci_to_afu(struct pci_dev *dev)

@@ -304,13 +298,7 @@ struct cxl_afu *cxl_pci_to_afu(struct pci_dev *dev)
  
  	phb = pci_bus_to_host(dev->bus);
  
-	if (_cxl_pci_is_vphb_device(phb))

-   return (struct cxl_afu *)phb->private_data;
-
-   if (pnv_pci_on_cxl_phb(dev))
-   return pnv_cxl_phb_to_afu(phb);
-
-   return ERR_PTR(-ENODEV);
+   return (struct cxl_afu *)phb->private_data;
  }
  EXPORT_SYMBOL_GPL(cxl_pci_to_afu);
  



--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited

Re: [PATCH v2 05/10] Revert "cxl: Add cxl_check_and_switch_mode() API to switch bi-modal cards"

2018-06-28 Thread Andrew Donnellan


On 28/06/18 20:05, Frederic Barrat wrote:

From: Alastair D'Silva 

Remove abandonned capi support for the Mellanox CX4.

This reverts commit b0b5e5918ad1babfd1d43d98c7281926a7b57b9f.

Signed-off-by: Alastair D'Silva 


I was kinda proud at how dodgy this was and yet how it actually worked...

(Hmm, I should go back and see if there's anything we can rip out of 
pnv_php now...)


Acked-by: Andrew Donnellan 


---
  drivers/misc/cxl/Kconfig |   8 --
  drivers/misc/cxl/pci.c   | 236 +++
  include/misc/cxl.h   |  25 -
  3 files changed, 18 insertions(+), 251 deletions(-)

diff --git a/drivers/misc/cxl/Kconfig b/drivers/misc/cxl/Kconfig
index 93397cb05b15..3ce933707828 100644
--- a/drivers/misc/cxl/Kconfig
+++ b/drivers/misc/cxl/Kconfig
@@ -33,11 +33,3 @@ config CXL
  CAPI adapters are found in POWER8 based systems.
  
  	  If unsure, say N.

-
-config CXL_BIMODAL
-   bool "Support for bi-modal CAPI cards"
-   depends on HOTPLUG_PCI_POWERNV = y && CXL || HOTPLUG_PCI_POWERNV = m && 
CXL = m
-   default y
-   help
- Select this option to enable support for bi-modal CAPI cards, such as
- the Mellanox CX-4.
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 429d6de1dde7..9c5a21fee835 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -55,8 +55,6 @@
pci_read_config_byte(dev, vsec + 0xa, dest)
  #define CXL_WRITE_VSEC_MODE_CONTROL(dev, vsec, val) \
pci_write_config_byte(dev, vsec + 0xa, val)
-#define CXL_WRITE_VSEC_MODE_CONTROL_BUS(bus, devfn, vsec, val) \
-   pci_bus_write_config_byte(bus, devfn, vsec + 0xa, val)
  #define CXL_VSEC_PROTOCOL_MASK   0xe0
  #define CXL_VSEC_PROTOCOL_1024TB 0x80
  #define CXL_VSEC_PROTOCOL_512TB  0x40
@@ -800,234 +798,36 @@ static int setup_cxl_bars(struct pci_dev *dev)
return 0;
  }
  
-#ifdef CONFIG_CXL_BIMODAL

-
-struct cxl_switch_work {
-   struct pci_dev *dev;
-   struct work_struct work;
-   int vsec;
-   int mode;
-};
-
-static void switch_card_to_cxl(struct work_struct *work)
+/* pciex node: ibm,opal-m64-window = <0x3d058 0x0 0x3d058 0x0 0x8 0x0>; */
+static int switch_card_to_cxl(struct pci_dev *dev)
  {
-   struct cxl_switch_work *switch_work =
-   container_of(work, struct cxl_switch_work, work);
-   struct pci_dev *dev = switch_work->dev;
-   struct pci_bus *bus = dev->bus;
-   struct pci_controller *hose = pci_bus_to_host(bus);
-   struct pci_dev *bridge;
-   struct pnv_php_slot *php_slot;
-   unsigned int devfn;
+   int vsec;
u8 val;
int rc;
  
-	dev_info(&bus->dev, "cxl: Preparing for mode switch...\n");

-   bridge = list_first_entry_or_null(&hose->bus->devices, struct pci_dev,
- bus_list);
-   if (!bridge) {
-   dev_WARN(&bus->dev, "cxl: Couldn't find root port!\n");
-   goto err_dev_put;
-   }
+   dev_info(&dev->dev, "switch card to CXL\n");
  
-	php_slot = pnv_php_find_slot(pci_device_to_OF_node(bridge));

-   if (!php_slot) {
-   dev_err(&bus->dev, "cxl: Failed to find slot hotplug "
-  "information. You may need to upgrade "
-  "skiboot. Aborting.\n");
-   goto err_dev_put;
-   }
-
-   rc = CXL_READ_VSEC_MODE_CONTROL(dev, switch_work->vsec, &val);
-   if (rc) {
-   dev_err(&bus->dev, "cxl: Failed to read CAPI mode control: 
%i\n", rc);
-   goto err_dev_put;
-   }
-   devfn = dev->devfn;
-
-   /* Release the reference obtained in cxl_check_and_switch_mode() */
-   pci_dev_put(dev);
-
-   dev_dbg(&bus->dev, "cxl: Removing PCI devices from kernel\n");
-   pci_lock_rescan_remove();
-   pci_hp_remove_devices(bridge->subordinate);
-   pci_unlock_rescan_remove();
-
-   /* Switch the CXL protocol on the card */
-   if (switch_work->mode == CXL_BIMODE_CXL) {
-   dev_info(&bus->dev, "cxl: Switching card to CXL mode\n");
-   val &= ~CXL_VSEC_PROTOCOL_MASK;
-   val |= CXL_VSEC_PROTOCOL_256TB | CXL_VSEC_PROTOCOL_ENABLE;
-   rc = pnv_cxl_enable_phb_kernel_api(hose, true);
-   if (rc) {
-   dev_err(&bus->dev, "cxl: Failed to enable kernel API"
-  " on real PHB, aborting\n");
-   goto err_free_work;
-   }
-   } else {
-   dev_WARN(&bus->dev, "cxl: Switching card to PCI mode not 
supported!\n");
-   goto err_free_work;
-   }
-
-   rc = CXL_WRITE_VSEC_MODE_CONTROL_BUS(bus, devfn, switch_work->vsec, 
val);
-   if (rc) {
-   dev_err(&bus->dev, "cxl: Failed to configure CXL protocol: 
%i\n", rc);
-   goto err_free_work;
-   }
-
-   /*
-* The CAIA spec (v1.1, Section 10.6 Bi-modal Device Support)

Re: [PATCH v2 04/10] Revert "cxl: Add kernel APIs to get & set the max irqs per context"

2018-06-28 Thread Andrew Donnellan


On 28/06/18 20:05, Frederic Barrat wrote:

From: Alastair D'Silva 

Remove abandonned capi support for the Mellanox CX4.

This reverts commit 79384e4b71240abf50c375eea56060b0d79c242a.

Signed-off-by: Alastair D'Silva 


Acked-by: Andrew Donnellan 


---
  drivers/misc/cxl/api.c | 27 ---
  1 file changed, 27 deletions(-)

diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index 34ba67bc41bd..a535c1e6aa92 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -552,30 +552,3 @@ ssize_t cxl_read_adapter_vpd(struct pci_dev *dev, void 
*buf, size_t count)
return cxl_ops->read_adapter_vpd(afu->adapter, buf, count);
  }
  EXPORT_SYMBOL_GPL(cxl_read_adapter_vpd);
-
-int cxl_set_max_irqs_per_process(struct pci_dev *dev, int irqs)
-{
-   struct cxl_afu *afu = cxl_pci_to_afu(dev);
-   if (IS_ERR(afu))
-   return -ENODEV;
-
-   if (irqs > afu->adapter->user_irqs)
-   return -EINVAL;
-
-   /* Limit user_irqs to prevent the user increasing this via sysfs */
-   afu->adapter->user_irqs = irqs;
-   afu->irqs_max = irqs;
-
-   return 0;
-}
-EXPORT_SYMBOL_GPL(cxl_set_max_irqs_per_process);
-
-int cxl_get_max_irqs_per_process(struct pci_dev *dev)
-{
-   struct cxl_afu *afu = cxl_pci_to_afu(dev);
-   if (IS_ERR(afu))
-   return -ENODEV;
-
-   return afu->irqs_max;
-}
-EXPORT_SYMBOL_GPL(cxl_get_max_irqs_per_process);



--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited

Re: [PATCH v2 03/10] Revert "cxl: Add preliminary workaround for CX4 interrupt limitation"

2018-06-28 Thread Andrew Donnellan


On 28/06/18 20:05, Frederic Barrat wrote:

From: Alastair D'Silva 

Remove abandonned capi support for the Mellanox CX4.

This reverts commit cbce0917e2e47d4bf5aa3b5fd6b1247f33e1a126.

Signed-off-by: Alastair D'Silva 


Acked-by: Andrew Donnellan 


---
  drivers/misc/cxl/api.c | 15 ---
  drivers/misc/cxl/base.c| 17 -
  drivers/misc/cxl/context.c |  1 -
  drivers/misc/cxl/cxl.h | 10 --
  drivers/misc/cxl/main.c|  1 -
  include/misc/cxl.h | 20 
  6 files changed, 64 deletions(-)

diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index 2e5862b7a074..34ba67bc41bd 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -181,21 +181,6 @@ static irq_hw_number_t cxl_find_afu_irq(struct cxl_context 
*ctx, int num)
return 0;
  }
  
-int _cxl_next_msi_hwirq(struct pci_dev *pdev, struct cxl_context **ctx, int *afu_irq)

-{
-   if (*ctx == NULL || *afu_irq == 0) {
-   *afu_irq = 1;
-   *ctx = cxl_get_context(pdev);
-   } else {
-   (*afu_irq)++;
-   if (*afu_irq > cxl_get_max_irqs_per_process(pdev)) {
-   *ctx = list_next_entry(*ctx, extra_irq_contexts);
-   *afu_irq = 1;
-   }
-   }
-   return cxl_find_afu_irq(*ctx, *afu_irq);
-}
-/* Exported via cxl_base */
  
  int cxl_set_priv(struct cxl_context *ctx, void *priv)

  {
diff --git a/drivers/misc/cxl/base.c b/drivers/misc/cxl/base.c
index fe90f895bb10..e1e80cb99ad9 100644
--- a/drivers/misc/cxl/base.c
+++ b/drivers/misc/cxl/base.c
@@ -141,23 +141,6 @@ void cxl_pci_disable_device(struct pci_dev *dev)
  }
  EXPORT_SYMBOL_GPL(cxl_pci_disable_device);
  
-int cxl_next_msi_hwirq(struct pci_dev *pdev, struct cxl_context **ctx, int *afu_irq)

-{
-   int ret;
-   struct cxl_calls *calls;
-
-   calls = cxl_calls_get();
-   if (!calls)
-   return -EBUSY;
-
-   ret = calls->cxl_next_msi_hwirq(pdev, ctx, afu_irq);
-
-   cxl_calls_put(calls);
-
-   return ret;
-}
-EXPORT_SYMBOL_GPL(cxl_next_msi_hwirq);
-
  static int __init cxl_base_init(void)
  {
struct device_node *np;
diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index c6ec872800a2..0355d42d367f 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -74,7 +74,6 @@ int cxl_context_init(struct cxl_context *ctx, struct cxl_afu 
*afu, bool master)
ctx->pending_afu_err = false;
  
  	INIT_LIST_HEAD(&ctx->irq_names);

-   INIT_LIST_HEAD(&ctx->extra_irq_contexts);
  
  	/*

 * When we have to destroy all contexts in cxl_context_detach_all() we
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 9688fe8b4d80..d95c2c98f2ab 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -623,14 +623,6 @@ struct cxl_context {
  
  	struct rcu_head rcu;
  
-	/*

-* Only used when more interrupts are allocated via
-* pci_enable_msix_range than are supported in the default context, to
-* use additional contexts to overcome the limitation. i.e. Mellanox
-* CX4 only:
-*/
-   struct list_head extra_irq_contexts;
-
struct mm_struct *mm;
  
  	u16 tidr;

@@ -878,13 +870,11 @@ ssize_t cxl_pci_afu_read_err_buffer(struct cxl_afu *afu, 
char *buf,
  /* Internal functions wrapped in cxl_base to allow PHB to call them */
  bool _cxl_pci_associate_default_context(struct pci_dev *dev, struct cxl_afu 
*afu);
  void _cxl_pci_disable_device(struct pci_dev *dev);
-int _cxl_next_msi_hwirq(struct pci_dev *pdev, struct cxl_context **ctx, int 
*afu_irq);
  
  struct cxl_calls {

void (*cxl_slbia)(struct mm_struct *mm);
bool (*cxl_pci_associate_default_context)(struct pci_dev *dev, struct 
cxl_afu *afu);
void (*cxl_pci_disable_device)(struct pci_dev *dev);
-   int (*cxl_next_msi_hwirq)(struct pci_dev *pdev, struct cxl_context 
**ctx, int *afu_irq);
  
  	struct module *owner;

  };
diff --git a/drivers/misc/cxl/main.c b/drivers/misc/cxl/main.c
index 59a904efd104..a7e83624034b 100644
--- a/drivers/misc/cxl/main.c
+++ b/drivers/misc/cxl/main.c
@@ -106,7 +106,6 @@ static struct cxl_calls cxl_calls = {
.cxl_slbia = cxl_slbia_core,
.cxl_pci_associate_default_context = _cxl_pci_associate_default_context,
.cxl_pci_disable_device = _cxl_pci_disable_device,
-   .cxl_next_msi_hwirq = _cxl_next_msi_hwirq,
.owner = THIS_MODULE,
  };
  
diff --git a/include/misc/cxl.h b/include/misc/cxl.h

index 82cc6ffafe2d..6a3711a2e217 100644
--- a/include/misc/cxl.h
+++ b/include/misc/cxl.h
@@ -183,26 +183,6 @@ void cxl_psa_unmap(void __iomem *addr);
  /*  Get the process element for this context */
  int cxl_process_element(struct cxl_context *ctx);
  
-/*

- * Limit the number of interrupts that a single context can allocate via
- * cxl_start_work. If using the api with a real phb, this may be used to
- * request that a

Re: [PATCH v2 02/10] Revert "cxl: Add support for interrupts on the Mellanox CX4"

2018-06-28 Thread Andrew Donnellan


On 28/06/18 20:05, Frederic Barrat wrote:

From: Alastair D'Silva 

Remove abandonned capi support for the Mellanox CX4.

This reverts commit a2f67d5ee8d950caaa7a6144cf0bfb256500b73e.

Signed-off-by: Alastair D'Silva 


Acked-by: Andrew Donnellan 


---
  arch/powerpc/platforms/powernv/pci-cxl.c  | 84 ---
  arch/powerpc/platforms/powernv/pci-ioda.c |  4 --
  arch/powerpc/platforms/powernv/pci.h  |  2 -
  drivers/misc/cxl/api.c| 71 ---
  drivers/misc/cxl/base.c   | 31 -
  drivers/misc/cxl/cxl.h|  4 --
  drivers/misc/cxl/main.c   |  2 -
  include/misc/cxl-base.h   |  4 --
  8 files changed, 202 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-cxl.c 
b/arch/powerpc/platforms/powernv/pci-cxl.c
index cee003de63af..c447b7f03c09 100644
--- a/arch/powerpc/platforms/powernv/pci-cxl.c
+++ b/arch/powerpc/platforms/powernv/pci-cxl.c
@@ -8,7 +8,6 @@
   */
  
  #include 

-#include 
  #include 
  #include 
  #include 
@@ -292,86 +291,3 @@ void pnv_cxl_disable_device(struct pci_dev *dev)
cxl_pci_disable_device(dev);
cxl_afu_put(afu);
  }
-
-/*
- * This is a special version of pnv_setup_msi_irqs for cards in cxl mode. This
- * function handles setting up the IVTE entries for the XSL to use.
- *
- * We are currently not filling out the MSIX table, since the only currently
- * supported adapter (CX4) uses a custom MSIX table format in cxl mode and it
- * is up to their driver to fill that out. In the future we may fill out the
- * MSIX table (and change the IVTE entries to be an index to the MSIX table)
- * for adapters implementing the Full MSI-X mode described in the CAIA.
- */
-int pnv_cxl_cx4_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
-{
-   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
-   struct pnv_phb *phb = hose->private_data;
-   struct msi_desc *entry;
-   struct cxl_context *ctx = NULL;
-   unsigned int virq;
-   int hwirq;
-   int afu_irq = 0;
-   int rc;
-
-   if (WARN_ON(!phb) || !phb->msi_bmp.bitmap)
-   return -ENODEV;
-
-   if (pdev->no_64bit_msi && !phb->msi32_support)
-   return -ENODEV;
-
-   rc = cxl_cx4_setup_msi_irqs(pdev, nvec, type);
-   if (rc)
-   return rc;
-
-   for_each_pci_msi_entry(entry, pdev) {
-   if (!entry->msi_attrib.is_64 && !phb->msi32_support) {
-   pr_warn("%s: Supports only 64-bit MSIs\n",
-   pci_name(pdev));
-   return -ENXIO;
-   }
-
-   hwirq = cxl_next_msi_hwirq(pdev, &ctx, &afu_irq);
-   if (WARN_ON(hwirq <= 0))
-   return (hwirq ? hwirq : -ENOMEM);
-
-   virq = irq_create_mapping(NULL, hwirq);
-   if (!virq) {
-   pr_warn("%s: Failed to map cxl mode MSI to linux irq\n",
-   pci_name(pdev));
-   return -ENOMEM;
-   }
-
-   rc = pnv_cxl_ioda_msi_setup(pdev, hwirq, virq);
-   if (rc) {
-   pr_warn("%s: Failed to setup cxl mode MSI\n", 
pci_name(pdev));
-   irq_dispose_mapping(virq);
-   return rc;
-   }
-
-   irq_set_msi_desc(virq, entry);
-   }
-
-   return 0;
-}
-
-void pnv_cxl_cx4_teardown_msi_irqs(struct pci_dev *pdev)
-{
-   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
-   struct pnv_phb *phb = hose->private_data;
-   struct msi_desc *entry;
-   irq_hw_number_t hwirq;
-
-   if (WARN_ON(!phb))
-   return;
-
-   for_each_pci_msi_entry(entry, pdev) {
-   if (!entry->irq)
-   continue;
-   hwirq = virq_to_hw(entry->irq);
-   irq_set_msi_desc(entry->irq, NULL);
-   irq_dispose_mapping(entry->irq);
-   }
-
-   cxl_cx4_teardown_msi_irqs(pdev);
-}
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5bd0eb6681bc..41f8f0ff4a55 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -3847,10 +3847,6 @@ static const struct pci_controller_ops 
pnv_npu_ocapi_ioda_controller_ops = {
  const struct pci_controller_ops pnv_cxl_cx4_ioda_controller_ops = {
.dma_dev_setup  = pnv_pci_dma_dev_setup,
.dma_bus_setup  = pnv_pci_dma_bus_setup,
-#ifdef CONFIG_PCI_MSI
-   .setup_msi_irqs = pnv_cxl_cx4_setup_msi_irqs,
-   .teardown_msi_irqs  = pnv_cxl_cx4_teardown_msi_irqs,
-#endif
.enable_device_hook = pnv_cxl_enable_device_hook,
.disable_device = pnv_cxl_disable_device,
.release_device = pnv_pci_release_device,
diff --git a/arch/powerpc/platforms/powernv/pc

Re: [PATCH v2 01/10] Revert "cxl: Add kernel API to allow a context to operate with relocate disabled"

2018-06-28 Thread Andrew Donnellan


On 28/06/18 20:05, Frederic Barrat wrote:

From: Alastair D'Silva 

Remove abandonned capi support for the Mellanox CX4.
The symbol 'cxl_set_translation_mode' is never called, so
ctx->real_mode is always false.

This reverts commit 7a0d85d313c2066712e530e668bc02bb741a685c.

Signed-off-by: Alastair D'Silva 


Acked-by: Andrew Donnellan 


---
  drivers/misc/cxl/api.c| 19 ---
  drivers/misc/cxl/cxl.h|  1 -
  drivers/misc/cxl/guest.c  |  3 ---
  drivers/misc/cxl/native.c |  3 ++-
  include/misc/cxl.h|  8 
  5 files changed, 2 insertions(+), 32 deletions(-)

diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index 753b1a698fc4..21d620e29fea 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -324,7 +324,6 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed,
if (task) {
ctx->pid = get_task_pid(task, PIDTYPE_PID);
kernel = false;
-   ctx->real_mode = false;
  
  		/* acquire a reference to the task's mm */

ctx->mm = get_task_mm(current);
@@ -388,24 +387,6 @@ void cxl_set_master(struct cxl_context *ctx)
  }
  EXPORT_SYMBOL_GPL(cxl_set_master);
  
-int cxl_set_translation_mode(struct cxl_context *ctx, bool real_mode)

-{
-   if (ctx->status == STARTED) {
-   /*
-* We could potentially update the PE and issue an update LLCMD
-* to support this, but it doesn't seem to have a good use case
-* since it's trivial to just create a second kernel context
-* with different translation modes, so until someone convinces
-* me otherwise:
-*/
-   return -EBUSY;
-   }
-
-   ctx->real_mode = real_mode;
-   return 0;
-}
-EXPORT_SYMBOL_GPL(cxl_set_translation_mode);
-
  /* wrappers around afu_* file ops which are EXPORTED */
  int cxl_fd_open(struct inode *inode, struct file *file)
  {
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 918d4fb742d1..af8794719956 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -613,7 +613,6 @@ struct cxl_context {
bool pe_inserted;
bool master;
bool kernel;
-   bool real_mode;
bool pending_irq;
bool pending_fault;
bool pending_afu_err;
diff --git a/drivers/misc/cxl/guest.c b/drivers/misc/cxl/guest.c
index 4644f16606a3..f5dc740fcd13 100644
--- a/drivers/misc/cxl/guest.c
+++ b/drivers/misc/cxl/guest.c
@@ -623,9 +623,6 @@ static int guest_attach_process(struct cxl_context *ctx, 
bool kernel, u64 wed, u
  {
pr_devel("in %s\n", __func__);
  
-	if (ctx->real_mode)

-   return -EPERM;
-
ctx->kernel = kernel;
if (ctx->afu->current_mode == CXL_MODE_DIRECTED)
return attach_afu_directed(ctx, wed, amr);
diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c
index 98f867fcef24..c9d5d82dce8e 100644
--- a/drivers/misc/cxl/native.c
+++ b/drivers/misc/cxl/native.c
@@ -605,6 +605,7 @@ u64 cxl_calculate_sr(bool master, bool kernel, bool 
real_mode, bool p9)
sr |= CXL_PSL_SR_An_MP;
if (mfspr(SPRN_LPCR) & LPCR_TC)
sr |= CXL_PSL_SR_An_TC;
+
if (kernel) {
if (!real_mode)
sr |= CXL_PSL_SR_An_R;
@@ -629,7 +630,7 @@ u64 cxl_calculate_sr(bool master, bool kernel, bool 
real_mode, bool p9)
  
  static u64 calculate_sr(struct cxl_context *ctx)

  {
-   return cxl_calculate_sr(ctx->master, ctx->kernel, ctx->real_mode,
+   return cxl_calculate_sr(ctx->master, ctx->kernel, false,
cxl_is_power9());
  }
  
diff --git a/include/misc/cxl.h b/include/misc/cxl.h

index b712be544f8c..82cc6ffafe2d 100644
--- a/include/misc/cxl.h
+++ b/include/misc/cxl.h
@@ -173,14 +173,6 @@ int cxl_afu_reset(struct cxl_context *ctx);
   */
  void cxl_set_master(struct cxl_context *ctx);
  
-/*

- * Sets the context to use real mode memory accesses to operate with
- * translation disabled. Note that this only makes sense for kernel contexts
- * under bare metal, and will not work with virtualisation. May only be
- * performed on stopped contexts.
- */
-int cxl_set_translation_mode(struct cxl_context *ctx, bool real_mode);
-
  /*
   * Map and unmap the AFU Problem Space area. The amount and location mapped
   * depends on if this context is a master or slave.



--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited

Re: [PATCH v4 1/6] powerpc/pseries: Defer the logging of rtas error to irq work queue.

2018-06-28 Thread kbuild test robot

Hi Mahesh,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on v4.18-rc2 next-20180628]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Mahesh-J-Salgaonkar/powerpc-pseries-Defer-the-logging-of-rtas-error-to-irq-work-queue/20180628-224101
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-defconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=powerpc 

Note: the 
linux-review/Mahesh-J-Salgaonkar/powerpc-pseries-Defer-the-logging-of-rtas-error-to-irq-work-queue/20180628-224101
 HEAD 3496ae1afd6528103d508528e25bfca82c60f4ee builds fine.
  It only hurts bisectibility.

All errors (new ones prefixed by >>):

   arch/powerpc/platforms/pseries/ras.c: In function 'mce_process_errlog_event':
>> arch/powerpc/platforms/pseries/ras.c:433:8: error: implicit declaration of 
>> function 'fwnmi_get_errlog'; did you mean 'fwnmi_get_errinfo'? 
>> [-Werror=implicit-function-declaration]
 err = fwnmi_get_errlog();
   ^~~~
   fwnmi_get_errinfo
>> arch/powerpc/platforms/pseries/ras.c:433:6: error: assignment makes pointer 
>> from integer without a cast [-Werror=int-conversion]
 err = fwnmi_get_errlog();
 ^
   cc1: all warnings being treated as errors

vim +433 arch/powerpc/platforms/pseries/ras.c

   425  
   426  /*
   427   * Process MCE rtas errlog event.
   428   */
   429  static void mce_process_errlog_event(struct irq_work *work)
   430  {
   431  struct rtas_error_log *err;
   432  
 > 433  err = fwnmi_get_errlog();
   434  log_error((char *)err, ERR_TYPE_RTAS_LOG, 0);
   435  }
   436  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: Oops in kmem_cache_free() via bioset_exit() (was Re: [next-20180601][nvme][ppc] Kernel Oops is triggered when creating lvm snapshots on nvme disks)

2018-06-28 Thread Jens Axboe

On 6/28/18 8:42 AM, Michael Ellerman wrote:
> Kent, Jens,
> 
> This looks like it might be related to the recent bioset changes?
> 
> cheers
> 
> Abdul Haleem  writes:
>> On Tue, 2018-06-26 at 23:36 +1000, Michael Ellerman wrote:
>>> Abdul Haleem  writes:
> ...
>> I was able to reproduce again with slub_debug=FZP and DEBUG_INFO enabled
>> on 4.17.0-rc7-next-20180601, but not much traces other than the Oops stack 
>> trace
> 
> Are you still testing on that revision? It's nearly a month old.
> 
> Please try to reproduce on mainline or today's linux-next.
> 
> 
>> the faulty instruction points to below code path :
>>
>> gdb -batch vmlinux -ex 'list *(0xc0304fe0)'
>> 0xc0304fe0 is in kmem_cache_free (mm/slab.h:231).
>> 226  }
>> 227  
>> 228  static inline bool slab_equal_or_root(struct kmem_cache *s,
>> 229struct kmem_cache *p)
>> 230  {
>> 231  return p == s || p == s->memcg_params.root_cache;
>> 232  }
> 
> And s is NULL.
> 
> Called via:
>   kmem_cache_free+0x210/0x2a0
>   mempool_free_slab+0x24/0x40
>   mempool_exit+0x50/0x90
>   bioset_exit+0x40/0x1d0
>   dm_io_client_destroy+0x2c/0x50
>   dm_bufio_client_destroy+0x1fc/0x2d0 [dm_bufio]
>   persistent_read_metadata+0x430/0x660 [dm_snapshot]
>   snapshot_ctr+0x5c8/0x7a0 [dm_snapshot]
>   dm_table_add_target+0x19c/0x3c0
>   table_load+0x104/0x450
>   ctl_ioctl+0x1f8/0x570
>   dm_ctl_ioctl+0x18/0x30
>   do_vfs_ioctl+0xcc/0x9e0
>   ksys_ioctl+0x5c/0xe0
>   sys_ioctl+0x20/0x80
>   system_call+0x58/0x6c
> 
> So looks like we did:
> 
>   kmem_cache_free(NULL
> 
> Probably a bad error path that frees before the cache has been allocated.
> 
> mempool_init_node() calls mempool_exit() on a partially initialised
> mempool, which looks fishy, though you're not hitting that patch AFAICS.

The slab cache is setup elsewhere, it's pending_cache. So if pending_cache
is NULL, then yeah and exit there will barf. I'd try something like the
below, but from the trace, we already basically see the path.


diff --git a/include/linux/mempool.h b/include/linux/mempool.h
index 0c964ac107c2..ebfa2f89ffdd 100644
--- a/include/linux/mempool.h
+++ b/include/linux/mempool.h
@@ -59,6 +59,7 @@ void mempool_free_slab(void *element, void *pool_data);
 static inline int
 mempool_init_slab_pool(mempool_t *pool, int min_nr, struct kmem_cache *kc)
 {
+   BUG_ON(!kc);
return mempool_init(pool, min_nr, mempool_alloc_slab,
mempool_free_slab, (void *) kc);
 }
diff --git a/mm/mempool.c b/mm/mempool.c
index b54f2c20e5e0..060f44acd0df 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -508,7 +508,9 @@ EXPORT_SYMBOL(mempool_alloc_slab);
 void mempool_free_slab(void *element, void *pool_data)
 {
struct kmem_cache *mem = pool_data;
-   kmem_cache_free(mem, element);
+
+   if (!WARN_ON(!mem))
+   kmem_cache_free(mem, element);
 }
 EXPORT_SYMBOL(mempool_free_slab);
 

-- 
Jens Axboe

Oops in kmem_cache_free() via bioset_exit() (was Re: [next-20180601][nvme][ppc] Kernel Oops is triggered when creating lvm snapshots on nvme disks)

2018-06-28 Thread Michael Ellerman

Kent, Jens,

This looks like it might be related to the recent bioset changes?

cheers

Abdul Haleem  writes:
> On Tue, 2018-06-26 at 23:36 +1000, Michael Ellerman wrote:
>> Abdul Haleem  writes:
...
> I was able to reproduce again with slub_debug=FZP and DEBUG_INFO enabled
> on 4.17.0-rc7-next-20180601, but not much traces other than the Oops stack 
> trace

Are you still testing on that revision? It's nearly a month old.

Please try to reproduce on mainline or today's linux-next.


> the faulty instruction points to below code path :
>
> gdb -batch vmlinux -ex 'list *(0xc0304fe0)'
> 0xc0304fe0 is in kmem_cache_free (mm/slab.h:231).
> 226   }
> 227   
> 228   static inline bool slab_equal_or_root(struct kmem_cache *s,
> 229 struct kmem_cache *p)
> 230   {
> 231   return p == s || p == s->memcg_params.root_cache;
> 232   }

And s is NULL.

Called via:
  kmem_cache_free+0x210/0x2a0
  mempool_free_slab+0x24/0x40
  mempool_exit+0x50/0x90
  bioset_exit+0x40/0x1d0
  dm_io_client_destroy+0x2c/0x50
  dm_bufio_client_destroy+0x1fc/0x2d0 [dm_bufio]
  persistent_read_metadata+0x430/0x660 [dm_snapshot]
  snapshot_ctr+0x5c8/0x7a0 [dm_snapshot]
  dm_table_add_target+0x19c/0x3c0
  table_load+0x104/0x450
  ctl_ioctl+0x1f8/0x570
  dm_ctl_ioctl+0x18/0x30
  do_vfs_ioctl+0xcc/0x9e0
  ksys_ioctl+0x5c/0xe0
  sys_ioctl+0x20/0x80
  system_call+0x58/0x6c

So looks like we did:

  kmem_cache_free(NULL, x)


Probably a bad error path that frees before the cache has been allocated.

mempool_init_node() calls mempool_exit() on a partially initialised
mempool, which looks fishy, though you're not hitting that patch AFAICS.


This patch should hopefully catch it earlier:

diff --git a/mm/mempool.c b/mm/mempool.c
index b54f2c20e5e0..6e23d7a119d4 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -508,6 +508,10 @@ EXPORT_SYMBOL(mempool_alloc_slab);
 void mempool_free_slab(void *element, void *pool_data)
 {
struct kmem_cache *mem = pool_data;
+
+   if (WARN_ON_ONCE(!mem))
+   return;
+
kmem_cache_free(mem, element);
 }
 EXPORT_SYMBOL(mempool_free_slab);


cheers



> [0.00] dt-cpu-ftrs: setup for ISA 3000
> [0.00] dt-cpu-ftrs: not enabling: system-call-vectored (disabled or 
> unsupported by kernel)
> [0.00] dt-cpu-ftrs: final cpu/mmu features = 0x786f8f5fb1a7 
> 0x3c006041
> [0.00] radix-mmu: Page sizes from device-tree:
> [0.00] radix-mmu: Page size shift = 12 AP=0x0
> [0.00] radix-mmu: Page size shift = 16 AP=0x5
> [0.00] radix-mmu: Page size shift = 21 AP=0x1
> [0.00] radix-mmu: Page size shift = 30 AP=0x2
> [0.00] radix-mmu: Initializing Radix MMU
> [0.00] radix-mmu: Partition table (ptrval)
> [0.00] radix-mmu: Mapped 0x-0x0010 with 
> 1.00 GiB pages
> [0.00] radix-mmu: Mapped 0x2000-0x2010 with 
> 1.00 GiB pages
> [0.00] radix-mmu: Process table (ptrval) and radix root for 
> kernel: (ptrval)
> [0.00] Linux version 4.17.0-rc7-next-20180601-autotest 
> (root@ltc-boston21) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #3 SMP Thu 
> Jun 28 03:01:06 CDT 2018
> [0.00] Found initrd at 0xc2d5:0xc9265921
> [0.00] OPAL: Found memory mapped LPC bus on chip 0
> [0.00] ISA: Non-PCI bridge is /lpcm-opb@60300/lpc@0
> [0.00] Using PowerNV machine description
> [0.00] bootconsole [udbg0] enabled
> [0.00] CPU maps initialized for 4 threads per core
> [0.00]  (thread shift is 2)
> [0.00] Allocated 4352 bytes for 128 pacas
> [0.00] -
> [0.00] ppc64_pft_size= 0x0
> [0.00] phys_mem_size = 0x20
> [0.00] dcache_bsize  = 0x80
> [0.00] icache_bsize  = 0x80
> [0.00] cpu_features  = 0x786f8f5fb1a7
> [0.00]   possible= 0x7fffcf5fb1a7
> [0.00]   always  = 0x0003800081a1
> [0.00] cpu_user_features = 0xdc0065c2 0xaee0
> [0.00] mmu_features  = 0x3c006041
> [0.00] firmware_features = 0x00011000
> [0.00] -
> [0.00] cma: Reserved 6560 MiB at 0x200e6200
> [0.00] numa:   NODE_DATA [mem 0xfffabe300-0xfffac7fff]
> [0.00] numa:   NODE_DATA [mem 0x200fff1a0300-0x200fff1a9fff]
> [0.00] rfi-flush: mttrig type flush available
> [0.00] rfi-flush: patched 9 locations (mttrig type flush)
> [0.00] stf-barrier: eieio barrier available
> [0.00] stf-barrier: patched 61 entry locations (eieio barrier)
> [0.00] stf-barrier: patched 9 exit locations (eieio barrier)
> [0.00] Initializing IODA2 PHB (/pciex@600c3c000)
> [0.00] PCI host bridge /pciex@600c3c000

Re: [PATCH 2/3] drivers/base: reorder consumer and its children behind suppliers

2018-06-28 Thread Pingfan Liu

On Wed, Jun 27, 2018 at 4:35 PM Dan Carpenter  wrote:
>
> On Wed, Jun 27, 2018 at 10:34:54AM +0800, Pingfan Liu wrote:
> > > 1b2a1e63 Pingfan Liu 2018-06-25  243}
> > > 1b2a1e63 Pingfan Liu 2018-06-25  244}
> > > 1b2a1e63 Pingfan Liu 2018-06-25 @245BUG_ON(!ret);
> > >
> > > If the list is empty then "ret" can be unitialized.  We test a different
> > > list "dev->links.suppliers" to see if that's empty.  I wrote a bunch of
> > > code to make Smatch try to understand about empty lists, but I don't
> > > think it works...
> > >
> > Yes, if list_empty, then the code can not touch ret. But ret is
> > useless in this scene. Does it matter?
> >
>
> I'm not sure I understand what you're asking?  Of course, it matters?
>
Oh, I misunderstood your original comment. Yes, you are right. I will
fix it in next version, if this code section is still used.

Thanks and regards,
Pingfan

Re: [PATCH v4 1/6] powerpc/pseries: Defer the logging of rtas error to irq work queue.

2018-06-28 Thread Laurent Dufour

On 28/06/2018 13:10, Mahesh J Salgaonkar wrote:
> From: Mahesh Salgaonkar 
> 
> rtas_log_buf is a buffer to hold RTAS event data that are communicated
> to kernel by hypervisor. This buffer is then used to pass RTAS event
> data to user through proc fs. This buffer is allocated from vmalloc
> (non-linear mapping) area.
> 
> On Machine check interrupt, register r3 points to RTAS extended event
> log passed by hypervisor that contains the MCE event. The pseries
> machine check handler then logs this error into rtas_log_buf. The
> rtas_log_buf is a vmalloc-ed (non-linear) buffer we end up taking up a
> page fault (vector 0x300) while accessing it. Since machine check
> interrupt handler runs in NMI context we can not afford to take any
> page fault. Page faults are not honored in NMI context and causes
> kernel panic. Apart from that, as Nick pointed out, pSeries_log_error()
> also takes a spin_lock while logging error which is not safe in NMI
> context. It may endup in deadlock if we get another MCE before releasing
> the lock. Fix this by deferring the logging of rtas error to irq work queue.
> 
> Current implementation uses two different buffers to hold rtas error log
> depending on whether extended log is provided or not. This makes bit
> difficult to identify which buffer has valid data that needs to logged
> later in irq work. Simplify this using single buffer, one per paca, and
> copy rtas log to it irrespective of whether extended log is provided or
> not. Allocate this buffer below RMA region so that it can be accessed
> in real mode mce handler.
> 
> Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable 
> interrupt")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Mahesh Salgaonkar 
> ---
>  arch/powerpc/include/asm/paca.h|3 ++
>  arch/powerpc/platforms/pseries/ras.c   |   39 
> +---
>  arch/powerpc/platforms/pseries/setup.c |   16 +
>  3 files changed, 45 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
> index 3f109a3e3edb..b441fef53077 100644
> --- a/arch/powerpc/include/asm/paca.h
> +++ b/arch/powerpc/include/asm/paca.h
> @@ -251,6 +251,9 @@ struct paca_struct {
>   void *rfi_flush_fallback_area;
>   u64 l1d_flush_size;
>  #endif
> +#ifdef CONFIG_PPC_PSERIES
> + u8 *mce_data_buf;   /* buffer to hold per cpu rtas errlog */
> +#endif /* CONFIG_PPC_PSERIES */
>  } cacheline_aligned;
> 
>  extern void copy_mm_to_paca(struct mm_struct *mm);
> diff --git a/arch/powerpc/platforms/pseries/ras.c 
> b/arch/powerpc/platforms/pseries/ras.c
> index 5e1ef9150182..f6ba9a2a4f84 100644
> --- a/arch/powerpc/platforms/pseries/ras.c
> +++ b/arch/powerpc/platforms/pseries/ras.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include 
>  #include 
> @@ -32,11 +33,13 @@
>  static unsigned char ras_log_buf[RTAS_ERROR_LOG_MAX];
>  static DEFINE_SPINLOCK(ras_log_buf_lock);
> 
> -static char global_mce_data_buf[RTAS_ERROR_LOG_MAX];
> -static DEFINE_PER_CPU(__u64, mce_data_buf);
> -
>  static int ras_check_exception_token;
> 
> +static void mce_process_errlog_event(struct irq_work *work);
> +static struct irq_work mce_errlog_process_work = {
> + .func = mce_process_errlog_event,
> +};
> +
>  #define EPOW_SENSOR_TOKEN9
>  #define EPOW_SENSOR_INDEX0
> 
> @@ -336,10 +339,9 @@ static irqreturn_t ras_error_interrupt(int irq, void 
> *dev_id)
>   * the actual r3 if possible, and a ptr to the error log entry
>   * will be returned if found.
>   *
> - * If the RTAS error is not of the extended type, then we put it in a per
> - * cpu 64bit buffer. If it is the extended type we use global_mce_data_buf.
> + * Use one buffer mce_data_buf per cpu to store RTAS error.
>   *
> - * The global_mce_data_buf does not have any locks or protection around it,
> + * The mce_data_buf does not have any locks or protection around it,
>   * if a second machine check comes in, or a system reset is done
>   * before we have logged the error, then we will get corruption in the
>   * error log.  This is preferable over holding off on calling
> @@ -362,20 +364,19 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct 
> pt_regs *regs)
>   savep = __va(regs->gpr[3]);
>   regs->gpr[3] = savep[0];/* restore original r3 */
> 
> - /* If it isn't an extended log we can use the per cpu 64bit buffer */
>   h = (struct rtas_error_log *)&savep[1];
> + /* Use the per cpu buffer from paca to store rtas error log */
> + memset(local_paca->mce_data_buf, 0, RTAS_ERROR_LOG_MAX);
>   if (!rtas_error_extended(h)) {
> - memcpy(this_cpu_ptr(&mce_data_buf), h, sizeof(__u64));
> - errhdr = (struct rtas_error_log *)this_cpu_ptr(&mce_data_buf);
> + memcpy(local_paca->mce_data_buf, h, sizeof(__u64));
>   } else {
>   int len, error_log_length;
> 
>   error_log_length = 8 + r

[PATCH v4 6/6] powerpc/pseries: Dump the SLB contents on SLB MCE errors.

2018-06-28 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

If we get a machine check exceptions due to SLB errors then dump the
current SLB contents which will be very much helpful in debugging the
root cause of SLB errors. Introduce an exclusive buffer per cpu to hold
faulty SLB entries. In real mode mce handler saves the old SLB contents
into this buffer accessible through paca and print it out later in virtual
mode.

With this patch the console will log SLB contents like below on SLB MCE
errors:

[ 3022.938065] SLB contents of cpu 0x3
[ 3022.938066] 00 c800 400ea1b217000500
[ 3022.938067]   1T  ESID=   c0  VSID=  ea1b217 LLP:100
[ 3022.938068] 01 d800 400d43642f000510
[ 3022.938069]   1T  ESID=   d0  VSID=  d43642f LLP:110
[ 3022.938070] 05 f800 400a86c85f000500
[ 3022.938071]   1T  ESID=   f0  VSID=  a86c85f LLP:100
[ 3022.938072] 06 7f000800 400a628b13000d90
[ 3022.938073]   1T  ESID=   7f  VSID=  a628b13 LLP:110
[ 3022.938074] 07 1800 000b7979f523fd90
[ 3022.938075]  256M ESID=1  VSID=   b7979f523f LLP:110
[ 3022.938076] 08 c800 400ea1b217000510
[ 3022.938076]   1T  ESID=   c0  VSID=  ea1b217 LLP:110
[ 3022.938077] 09 c800 400ea1b217000510
[ 3022.938078]   1T  ESID=   c0  VSID=  ea1b217 LLP:110

Suggested-by: Aneesh Kumar K.V 
Suggested-by: Michael Ellerman 
Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |7 +++
 arch/powerpc/include/asm/paca.h   |1 
 arch/powerpc/mm/slb.c |   57 +
 arch/powerpc/platforms/pseries/ras.c  |   10 
 arch/powerpc/platforms/pseries/setup.c|   10 
 5 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index cc00a7088cf3..5a3fe282076d 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -485,9 +485,16 @@ static inline void hpte_init_pseries(void) { }
 
 extern void hpte_init_native(void);
 
+struct slb_entry {
+   u64 esid;
+   u64 vsid;
+};
+
 extern void slb_initialize(void);
 extern void slb_flush_and_rebolt(void);
 extern void slb_flush_and_rebolt_realmode(void);
+extern void slb_save_contents(struct slb_entry *slb_ptr);
+extern void slb_dump_contents(struct slb_entry *slb_ptr);
 
 extern void slb_vmalloc_update(void);
 extern void slb_set_size(u16 size);
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index b441fef53077..653f87c69423 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -253,6 +253,7 @@ struct paca_struct {
 #endif
 #ifdef CONFIG_PPC_PSERIES
u8 *mce_data_buf;   /* buffer to hold per cpu rtas errlog */
+   struct slb_entry *mce_faulty_slbs;
 #endif /* CONFIG_PPC_PSERIES */
 } cacheline_aligned;
 
diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index 5b1813b98358..476ab0b1d4e8 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -151,6 +151,63 @@ void slb_flush_and_rebolt_realmode(void)
get_paca()->slb_cache_ptr = 0;
 }
 
+void slb_save_contents(struct slb_entry *slb_ptr)
+{
+   int i;
+   unsigned long e, v;
+
+   if (!slb_ptr)
+   return;
+
+   for (i = 0; i < mmu_slb_size; i++) {
+   asm volatile("slbmfee  %0,%1" : "=r" (e) : "r" (i));
+   asm volatile("slbmfev  %0,%1" : "=r" (v) : "r" (i));
+   slb_ptr->esid = e;
+   slb_ptr->vsid = v;
+   slb_ptr++;
+   }
+}
+
+void slb_dump_contents(struct slb_entry *slb_ptr)
+{
+   int i;
+   unsigned long e, v;
+   unsigned long llp;
+
+   if (!slb_ptr)
+   return;
+
+   pr_err("SLB contents of cpu 0x%x\n", smp_processor_id());
+
+   for (i = 0; i < mmu_slb_size; i++) {
+   e = slb_ptr->esid;
+   v = slb_ptr->vsid;
+   slb_ptr++;
+
+   if (!e && !v)
+   continue;
+
+   pr_err("%02d %016lx %016lx\n", i, e, v);
+
+   if (!(e & SLB_ESID_V)) {
+   pr_err("\n");
+   continue;
+   }
+   llp = v & SLB_VSID_LLP;
+   if (v & SLB_VSID_B_1T) {
+   pr_err("  1T  ESID=%9lx  VSID=%13lx LLP:%3lx\n",
+   GET_ESID_1T(e),
+   (v & ~SLB_VSID_B) >> SLB_VSID_SHIFT_1T,
+   llp);
+   } else {
+   pr_err(" 256M ESID=%9lx  VSID=%13lx LLP:%3lx\n",
+   GET_ESID(e),
+   (v & ~SLB_VSID_B) >> SLB_VSID_SHIFT,
+   llp);
+   }
+   }
+}
+
 void slb_vmalloc_update(void)
 {
unsigned long vflags

[PATCH v4 5/6] powerpc/pseries: Display machine check error details.

2018-06-28 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

Extract the MCE error details from RTAS extended log and display it to
console.

With this patch you should now see mce logs like below:

[  142.371818] Severe Machine check interrupt [Recovered]
[  142.371822]   NIP [dca301b8]: init_module+0x1b8/0x338 [bork_kernel]
[  142.371822]   Initiator: CPU
[  142.371823]   Error type: SLB [Multihit]
[  142.371824] Effective address: dca7

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/rtas.h  |5 +
 arch/powerpc/platforms/pseries/ras.c |  131 ++
 2 files changed, 136 insertions(+)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index ceeed2dd489b..26bc3d5c4992 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -197,6 +197,11 @@ static inline uint8_t rtas_error_extended(const struct 
rtas_error_log *elog)
return (elog->byte1 & 0x04) >> 2;
 }
 
+static inline uint8_t rtas_error_initiator(const struct rtas_error_log *elog)
+{
+   return (elog->byte2 & 0xf0) >> 4;
+}
+
 #define rtas_error_type(x) ((x)->byte3)
 
 static inline
diff --git a/arch/powerpc/platforms/pseries/ras.c 
b/arch/powerpc/platforms/pseries/ras.c
index ae08263daa24..be665eeb97df 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -428,6 +428,135 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
return 0; /* need to perform reset */
 }
 
+#define VAL_TO_STRING(ar, val) ((val < ARRAY_SIZE(ar)) ? ar[val] : "Unknown")
+
+static void pseries_print_mce_info(struct pt_regs *regs,
+   struct rtas_error_log *errp)
+{
+   const char *level, *sevstr;
+   struct pseries_errorlog *pseries_log;
+   struct pseries_mc_errorlog *mce_log;
+   uint8_t error_type, err_sub_type;
+   uint64_t addr;
+   uint8_t initiator = rtas_error_initiator(errp);
+   int disposition = rtas_error_disposition(errp);
+
+   static const char * const initiators[] = {
+   "Unknown",
+   "CPU",
+   "PCI",
+   "ISA",
+   "Memory",
+   "Power Mgmt",
+   };
+   static const char * const mc_err_types[] = {
+   "UE",
+   "SLB",
+   "ERAT",
+   "TLB",
+   "D-Cache",
+   "Unknown",
+   "I-Cache",
+   };
+   static const char * const mc_ue_types[] = {
+   "Indeterminate",
+   "Instruction fetch",
+   "Page table walk ifetch",
+   "Load/Store",
+   "Page table walk Load/Store",
+   };
+
+   /* SLB sub errors valid values are 0x0, 0x1, 0x2 */
+   static const char * const mc_slb_types[] = {
+   "Parity",
+   "Multihit",
+   "Indeterminate",
+   };
+
+   /* TLB and ERAT sub errors valid values are 0x1, 0x2, 0x3 */
+   static const char * const mc_soft_types[] = {
+   "Unknown",
+   "Parity",
+   "Multihit",
+   "Indeterminate",
+   };
+
+   if (!rtas_error_extended(errp)) {
+   pr_err("Machine check interrupt: Missing extended error log\n");
+   return;
+   }
+
+   pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
+   if (pseries_log == NULL)
+   return;
+
+   mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
+
+   error_type = rtas_mc_error_type(mce_log);
+   err_sub_type = rtas_mc_error_sub_type(mce_log);
+
+   switch (rtas_error_severity(errp)) {
+   case RTAS_SEVERITY_NO_ERROR:
+   level = KERN_INFO;
+   sevstr = "Harmless";
+   break;
+   case RTAS_SEVERITY_WARNING:
+   level = KERN_WARNING;
+   sevstr = "";
+   break;
+   case RTAS_SEVERITY_ERROR:
+   case RTAS_SEVERITY_ERROR_SYNC:
+   level = KERN_ERR;
+   sevstr = "Severe";
+   break;
+   case RTAS_SEVERITY_FATAL:
+   default:
+   level = KERN_ERR;
+   sevstr = "Fatal";
+   break;
+   }
+
+   printk("%s%s Machine check interrupt [%s]\n", level, sevstr,
+   disposition == RTAS_DISP_FULLY_RECOVERED ?
+   "Recovered" : "Not recovered");
+   if (user_mode(regs)) {
+   printk("%s  NIP: [%016lx] PID: %d Comm: %s\n", level,
+   regs->nip, current->pid, current->comm);
+   } else {
+   printk("%s  NIP [%016lx]: %pS\n", level, regs->nip,
+   (void *)regs->nip);
+   }
+   printk("%s  Initiator: %s\n", level,
+   VAL_TO_STRING(initiators, initiator));
+
+   switch (error_type) {
+   case PSERIES_MC_ERROR_TYPE_UE:
+   printk("%s  Error

[PATCH v4 4/6] powerpc/pseries: flush SLB contents on SLB MCE errors.

2018-06-28 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

On pseries, as of today system crashes if we get a machine check
exceptions due to SLB errors. These are soft errors and can be fixed by
flushing the SLBs so the kernel can continue to function instead of
system crash. We do this in real mode before turning on MMU. Otherwise
we would run into nested machine checks. This patch now fetches the
rtas error log in real mode and flushes the SLBs on SLB errors.

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |1 
 arch/powerpc/include/asm/machdep.h|1 
 arch/powerpc/kernel/exceptions-64s.S  |   42 +++
 arch/powerpc/kernel/mce.c |   16 ++-
 arch/powerpc/mm/slb.c |6 +++
 arch/powerpc/platforms/powernv/opal.c |1 
 arch/powerpc/platforms/pseries/pseries.h  |1 
 arch/powerpc/platforms/pseries/ras.c  |   56 +
 arch/powerpc/platforms/pseries/setup.c|1 
 9 files changed, 121 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 50ed64fba4ae..cc00a7088cf3 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -487,6 +487,7 @@ extern void hpte_init_native(void);
 
 extern void slb_initialize(void);
 extern void slb_flush_and_rebolt(void);
+extern void slb_flush_and_rebolt_realmode(void);
 
 extern void slb_vmalloc_update(void);
 extern void slb_set_size(u16 size);
diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index ffe7c71e1132..fe447e0d4140 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -108,6 +108,7 @@ struct machdep_calls {
 
/* Early exception handlers called in realmode */
int (*hmi_exception_early)(struct pt_regs *regs);
+   int (*machine_check_early)(struct pt_regs *regs);
 
/* Called during machine check exception to retrive fixup address. */
bool(*mce_check_early_recovery)(struct pt_regs *regs);
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index f283958129f2..0038596b7906 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -332,6 +332,9 @@ TRAMP_REAL_BEGIN(machine_check_pSeries)
 machine_check_fwnmi:
SET_SCRATCH0(r13)   /* save r13 */
EXCEPTION_PROLOG_0(PACA_EXMC)
+BEGIN_FTR_SECTION
+   b   machine_check_pSeries_early
+END_FTR_SECTION_IFCLR(CPU_FTR_HVMODE)
 machine_check_pSeries_0:
EXCEPTION_PROLOG_1(PACA_EXMC, KVMTEST_PR, 0x200)
/*
@@ -343,6 +346,45 @@ machine_check_pSeries_0:
 
 TRAMP_KVM_SKIP(PACA_EXMC, 0x200)
 
+TRAMP_REAL_BEGIN(machine_check_pSeries_early)
+BEGIN_FTR_SECTION
+   EXCEPTION_PROLOG_1(PACA_EXMC, NOTEST, 0x200)
+   mr  r10,r1  /* Save r1 */
+   ld  r1,PACAMCEMERGSP(r13)   /* Use MC emergency stack */
+   subir1,r1,INT_FRAME_SIZE/* alloc stack frame*/
+   mfspr   r11,SPRN_SRR0   /* Save SRR0 */
+   mfspr   r12,SPRN_SRR1   /* Save SRR1 */
+   EXCEPTION_PROLOG_COMMON_1()
+   EXCEPTION_PROLOG_COMMON_2(PACA_EXMC)
+   EXCEPTION_PROLOG_COMMON_3(0x200)
+   addir3,r1,STACK_FRAME_OVERHEAD
+   BRANCH_LINK_TO_FAR(machine_check_early) /* Function call ABI */
+
+   /* Move original SRR0 and SRR1 into the respective regs */
+   ld  r9,_MSR(r1)
+   mtspr   SPRN_SRR1,r9
+   ld  r3,_NIP(r1)
+   mtspr   SPRN_SRR0,r3
+   ld  r9,_CTR(r1)
+   mtctr   r9
+   ld  r9,_XER(r1)
+   mtxer   r9
+   ld  r9,_LINK(r1)
+   mtlrr9
+   REST_GPR(0, r1)
+   REST_8GPRS(2, r1)
+   REST_GPR(10, r1)
+   ld  r11,_CCR(r1)
+   mtcrr11
+   REST_GPR(11, r1)
+   REST_2GPRS(12, r1)
+   /* restore original r1. */
+   ld  r1,GPR1(r1)
+   SET_SCRATCH0(r13)   /* save r13 */
+   EXCEPTION_PROLOG_0(PACA_EXMC)
+   b   machine_check_pSeries_0
+END_FTR_SECTION_IFCLR(CPU_FTR_HVMODE)
+
 EXC_COMMON_BEGIN(machine_check_common)
/*
 * Machine check is different because we use a different
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index efdd16a79075..221271c96a57 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -488,9 +488,21 @@ long machine_check_early(struct pt_regs *regs)
 {
long handled = 0;
 
-   __this_cpu_inc(irq_stat.mce_exceptions);
+   /*
+* For pSeries we count mce when we go into virtual mode machine
+* check handler. Hence skip it. Also, We can't access per cpu
+* variables in real mode for LPAR.
+*/
+   if (early_cpu_has_feature(CPU_FTR_HVMODE))
+   __this_cpu_inc(i

[PATCH v4 3/6] powerpc/pseries: Define MCE error event section.

2018-06-28 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

On pseries, the machine check error details are part of RTAS extended
event log passed under Machine check exception section. This patch adds
the definition of rtas MCE event section and related helper
functions.

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/rtas.h |  111 +++
 1 file changed, 111 insertions(+)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index ec9dd79398ee..ceeed2dd489b 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -185,6 +185,13 @@ static inline uint8_t rtas_error_disposition(const struct 
rtas_error_log *elog)
return (elog->byte1 & 0x18) >> 3;
 }
 
+static inline
+void rtas_set_disposition_recovered(struct rtas_error_log *elog)
+{
+   elog->byte1 &= ~0x18;
+   elog->byte1 |= (RTAS_DISP_FULLY_RECOVERED << 3);
+}
+
 static inline uint8_t rtas_error_extended(const struct rtas_error_log *elog)
 {
return (elog->byte1 & 0x04) >> 2;
@@ -275,6 +282,7 @@ inline uint32_t rtas_ext_event_company_id(struct 
rtas_ext_event_log_v6 *ext_log)
 #define PSERIES_ELOG_SECT_ID_CALL_HOME (('C' << 8) | 'H')
 #define PSERIES_ELOG_SECT_ID_USER_DEF  (('U' << 8) | 'D')
 #define PSERIES_ELOG_SECT_ID_HOTPLUG   (('H' << 8) | 'P')
+#define PSERIES_ELOG_SECT_ID_MCE   (('M' << 8) | 'C')
 
 /* Vendor specific Platform Event Log Format, Version 6, section header */
 struct pseries_errorlog {
@@ -326,6 +334,109 @@ struct pseries_hp_errorlog {
 #define PSERIES_HP_ELOG_ID_DRC_COUNT   3
 #define PSERIES_HP_ELOG_ID_DRC_IC  4
 
+/* RTAS pseries MCE errorlog section */
+#pragma pack(push, 1)
+struct pseries_mc_errorlog {
+   __be32  fru_id;
+   __be32  proc_id;
+   uint8_t error_type;
+   union {
+   struct {
+   uint8_t ue_err_type;
+   /* 
+* X1: Permanent or Transient UE.
+*  X   1: Effective address provided.
+*   X  1: Logical address provided.
+*XX2: Reserved.
+*  XXX 3: Type of UE error.
+*/
+   uint8_t reserved_1[6];
+   __be64  effective_address;
+   __be64  logical_address;
+   } ue_error;
+   struct {
+   uint8_t soft_err_type;
+   /* 
+* X1: Effective address provided.
+*  X   5: Reserved.
+*   XX 2: Type of SLB/ERAT/TLB error.
+*/
+   uint8_t reserved_1[6];
+   __be64  effective_address;
+   uint8_t reserved_2[8];
+   } soft_error;
+   } u;
+};
+#pragma pack(pop)
+
+/* RTAS pseries MCE error types */
+#define PSERIES_MC_ERROR_TYPE_UE   0x00
+#define PSERIES_MC_ERROR_TYPE_SLB  0x01
+#define PSERIES_MC_ERROR_TYPE_ERAT 0x02
+#define PSERIES_MC_ERROR_TYPE_TLB  0x04
+#define PSERIES_MC_ERROR_TYPE_D_CACHE  0x05
+#define PSERIES_MC_ERROR_TYPE_I_CACHE  0x07
+
+/* RTAS pseries MCE error sub types */
+#define PSERIES_MC_ERROR_UE_INDETERMINATE  0
+#define PSERIES_MC_ERROR_UE_IFETCH 1
+#define PSERIES_MC_ERROR_UE_PAGE_TABLE_WALK_IFETCH 2
+#define PSERIES_MC_ERROR_UE_LOAD_STORE 3
+#define PSERIES_MC_ERROR_UE_PAGE_TABLE_WALK_LOAD_STORE 4
+
+#define PSERIES_MC_ERROR_SLB_PARITY0
+#define PSERIES_MC_ERROR_SLB_MULTIHIT  1
+#define PSERIES_MC_ERROR_SLB_INDETERMINATE 2
+
+#define PSERIES_MC_ERROR_ERAT_PARITY   1
+#define PSERIES_MC_ERROR_ERAT_MULTIHIT 2
+#define PSERIES_MC_ERROR_ERAT_INDETERMINATE3
+
+#define PSERIES_MC_ERROR_TLB_PARITY1
+#define PSERIES_MC_ERROR_TLB_MULTIHIT  2
+#define PSERIES_MC_ERROR_TLB_INDETERMINATE 3
+
+static inline uint8_t rtas_mc_error_type(const struct pseries_mc_errorlog 
*mlog)
+{
+   return mlog->error_type;
+}
+
+static inline uint8_t rtas_mc_error_sub_type(
+   const struct pseries_mc_errorlog *mlog)
+{
+   switch (mlog->error_type) {
+   casePSERIES_MC_ERROR_TYPE_UE:
+   return (mlog->u.ue_error.ue_err_type & 0x07);
+   casePSERIES_MC_ERROR_TYPE_SLB:
+   casePSERIES_MC_ERROR_TYPE_ERAT:
+   casePSERIES_MC_ERROR_TYPE_TLB:
+   return (mlog->u.soft_error.soft_err_type & 0x03);
+   default:
+   return 0;
+   }
+}
+
+static inline uint64_t rtas_mc_get_effective_addr(
+   const struct pseries_mc_errorlog *mlog)
+{
+   uint64_t addr = 0;
+
+   switch (mlog->error_ty

[PATCH v4 2/6] powerpc/pseries: Fix endainness while restoring of r3 in MCE handler.

2018-06-28 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

During Machine Check interrupt on pseries platform, register r3 points
RTAS extended event log passed by hypervisor. Since hypervisor uses r3
to pass pointer to rtas log, it stores the original r3 value at the
start of the memory (first 8 bytes) pointed by r3. Since hypervisor
stores this info and rtas log is in BE format, linux should make
sure to restore r3 value in correct endian format.

Without this patch when MCE handler, after recovery, returns to code that
that caused the MCE may end up with Data SLB access interrupt for invalid
address followed by kernel panic or hang.

[   62.878965] Severe Machine check interrupt [Recovered]
[   62.878968]   NIP [dca301b8]: init_module+0x1b8/0x338 [bork_kernel]
[   62.878969]   Initiator: CPU
[   62.878970]   Error type: SLB [Multihit]
[   62.878971] Effective address: dca7
cpu 0xa: Vector: 380 (Data SLB Access) at [c000fc7775b0]
pc: c09694c0: vsnprintf+0x80/0x480
lr: c09698e0: vscnprintf+0x20/0x60
sp: c000fc777830
   msr: 82009033
   dar: a803a30c00d0
  current = 0xcbc9ef00
  paca= 0xc0001eca5c00   softe: 3irq_happened: 0x01
pid   = 8860, comm = insmod
[c000fc7778b0] c09698e0 vscnprintf+0x20/0x60
[c000fc7778e0] c016b6c4 vprintk_emit+0xb4/0x4b0
[c000fc777960] c016d40c vprintk_func+0x5c/0xd0
[c000fc777980] c016cbb4 printk+0x38/0x4c
[c000fc7779a0] dca301c0 init_module+0x1c0/0x338 [bork_kernel]
[c000fc777a40] c000d9c4 do_one_initcall+0x54/0x230
[c000fc777b00] c01b3b74 do_init_module+0x8c/0x248
[c000fc777b90] c01b2478 load_module+0x12b8/0x15b0
[c000fc777d30] c01b29e8 sys_finit_module+0xa8/0x110
[c000fc777e30] c000b204 system_call+0x58/0x6c
--- Exception: c00 (System Call) at 7fff8bda0644
SP (7fffdfbfe980) is in userspace

This patch fixes this issue.

Fixes: a08a53ea4c97 ("powerpc/le: Enable RTAS events support")
Cc: sta...@vger.kernel.org
Reviewed-by: Nicholas Piggin 
Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/platforms/pseries/ras.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/ras.c 
b/arch/powerpc/platforms/pseries/ras.c
index f6ba9a2a4f84..e3bd849141de 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -362,7 +362,7 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct 
pt_regs *regs)
}
 
savep = __va(regs->gpr[3]);
-   regs->gpr[3] = savep[0];/* restore original r3 */
+   regs->gpr[3] = be64_to_cpu(savep[0]);   /* restore original r3 */
 
h = (struct rtas_error_log *)&savep[1];
/* Use the per cpu buffer from paca to store rtas error log */

[PATCH v4 1/6] powerpc/pseries: Defer the logging of rtas error to irq work queue.

2018-06-28 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

rtas_log_buf is a buffer to hold RTAS event data that are communicated
to kernel by hypervisor. This buffer is then used to pass RTAS event
data to user through proc fs. This buffer is allocated from vmalloc
(non-linear mapping) area.

On Machine check interrupt, register r3 points to RTAS extended event
log passed by hypervisor that contains the MCE event. The pseries
machine check handler then logs this error into rtas_log_buf. The
rtas_log_buf is a vmalloc-ed (non-linear) buffer we end up taking up a
page fault (vector 0x300) while accessing it. Since machine check
interrupt handler runs in NMI context we can not afford to take any
page fault. Page faults are not honored in NMI context and causes
kernel panic. Apart from that, as Nick pointed out, pSeries_log_error()
also takes a spin_lock while logging error which is not safe in NMI
context. It may endup in deadlock if we get another MCE before releasing
the lock. Fix this by deferring the logging of rtas error to irq work queue.

Current implementation uses two different buffers to hold rtas error log
depending on whether extended log is provided or not. This makes bit
difficult to identify which buffer has valid data that needs to logged
later in irq work. Simplify this using single buffer, one per paca, and
copy rtas log to it irrespective of whether extended log is provided or
not. Allocate this buffer below RMA region so that it can be accessed
in real mode mce handler.

Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable 
interrupt")
Cc: sta...@vger.kernel.org
Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/paca.h|3 ++
 arch/powerpc/platforms/pseries/ras.c   |   39 +---
 arch/powerpc/platforms/pseries/setup.c |   16 +
 3 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 3f109a3e3edb..b441fef53077 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -251,6 +251,9 @@ struct paca_struct {
void *rfi_flush_fallback_area;
u64 l1d_flush_size;
 #endif
+#ifdef CONFIG_PPC_PSERIES
+   u8 *mce_data_buf;   /* buffer to hold per cpu rtas errlog */
+#endif /* CONFIG_PPC_PSERIES */
 } cacheline_aligned;
 
 extern void copy_mm_to_paca(struct mm_struct *mm);
diff --git a/arch/powerpc/platforms/pseries/ras.c 
b/arch/powerpc/platforms/pseries/ras.c
index 5e1ef9150182..f6ba9a2a4f84 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -32,11 +33,13 @@
 static unsigned char ras_log_buf[RTAS_ERROR_LOG_MAX];
 static DEFINE_SPINLOCK(ras_log_buf_lock);
 
-static char global_mce_data_buf[RTAS_ERROR_LOG_MAX];
-static DEFINE_PER_CPU(__u64, mce_data_buf);
-
 static int ras_check_exception_token;
 
+static void mce_process_errlog_event(struct irq_work *work);
+static struct irq_work mce_errlog_process_work = {
+   .func = mce_process_errlog_event,
+};
+
 #define EPOW_SENSOR_TOKEN  9
 #define EPOW_SENSOR_INDEX  0
 
@@ -336,10 +339,9 @@ static irqreturn_t ras_error_interrupt(int irq, void 
*dev_id)
  * the actual r3 if possible, and a ptr to the error log entry
  * will be returned if found.
  *
- * If the RTAS error is not of the extended type, then we put it in a per
- * cpu 64bit buffer. If it is the extended type we use global_mce_data_buf.
+ * Use one buffer mce_data_buf per cpu to store RTAS error.
  *
- * The global_mce_data_buf does not have any locks or protection around it,
+ * The mce_data_buf does not have any locks or protection around it,
  * if a second machine check comes in, or a system reset is done
  * before we have logged the error, then we will get corruption in the
  * error log.  This is preferable over holding off on calling
@@ -362,20 +364,19 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct 
pt_regs *regs)
savep = __va(regs->gpr[3]);
regs->gpr[3] = savep[0];/* restore original r3 */
 
-   /* If it isn't an extended log we can use the per cpu 64bit buffer */
h = (struct rtas_error_log *)&savep[1];
+   /* Use the per cpu buffer from paca to store rtas error log */
+   memset(local_paca->mce_data_buf, 0, RTAS_ERROR_LOG_MAX);
if (!rtas_error_extended(h)) {
-   memcpy(this_cpu_ptr(&mce_data_buf), h, sizeof(__u64));
-   errhdr = (struct rtas_error_log *)this_cpu_ptr(&mce_data_buf);
+   memcpy(local_paca->mce_data_buf, h, sizeof(__u64));
} else {
int len, error_log_length;
 
error_log_length = 8 + rtas_error_extended_log_length(h);
len = max_t(int, error_log_length, RTAS_ERROR_LOG_MAX);
-   memset(global_mce_data_buf, 0, RTAS_ERROR_LOG_MAX);
-   memcpy(global_mce_data_buf, h, len);
-

[PATCH v4 0/6] powerpc/pseries: Machien check handler improvements.

2018-06-28 Thread Mahesh J Salgaonkar

This patch series includes some improvement to Machine check handler
for pseries. Patch 1 fixes an issue where machine check handler crashes
kernel while accessing vmalloc-ed buffer while in nmi context.
Patch 2 fixes endain bug while restoring of r3 in MCE handler.
Patch 4 implements a real mode mce handler and flushes the SLBs on SLB error.
Patch 5 display's the MCE error details on console.
Patch 6 saves and dumps the SLB contents on SLB MCE errors to improve the
debugability.

Change in V4:
- Flush the SLBs in real mode mce handler to handle SLB errors for entry 0.
- Allocate buffers per cpu to hold rtas error log and old slb contents.
- Defer the logging of rtas error log to irq work queue.

Change in V3:
- Moved patch 5 to patch 2

Change in V2:
- patch 3: Display additional info (NIP and task info) in MCE error details.
- patch 5: Fix endain bug while restoring of r3 in MCE handler.


---

Mahesh Salgaonkar (6):
  powerpc/pseries: Defer the logging of rtas error to irq work queue.
  powerpc/pseries: Fix endainness while restoring of r3 in MCE handler.
  powerpc/pseries: Define MCE error event section.
  powerpc/pseries: flush SLB contents on SLB MCE errors.
  powerpc/pseries: Display machine check error details.
  powerpc/pseries: Dump the SLB contents on SLB MCE errors.


 arch/powerpc/include/asm/book3s/64/mmu-hash.h |8 +
 arch/powerpc/include/asm/machdep.h|1 
 arch/powerpc/include/asm/paca.h   |4 
 arch/powerpc/include/asm/rtas.h   |  116 
 arch/powerpc/kernel/exceptions-64s.S  |   42 
 arch/powerpc/kernel/mce.c |   16 +-
 arch/powerpc/mm/slb.c |   63 +++
 arch/powerpc/platforms/powernv/opal.c |1 
 arch/powerpc/platforms/pseries/pseries.h  |1 
 arch/powerpc/platforms/pseries/ras.c  |  236 +++--
 arch/powerpc/platforms/pseries/setup.c|   27 +++
 11 files changed, 497 insertions(+), 18 deletions(-)

--
Signature

[PATCH v2 01/10] Revert "cxl: Add kernel API to allow a context to operate with relocate disabled"

2018-06-28 Thread Frederic Barrat

From: Alastair D'Silva 

Remove abandonned capi support for the Mellanox CX4.
The symbol 'cxl_set_translation_mode' is never called, so
ctx->real_mode is always false.

This reverts commit 7a0d85d313c2066712e530e668bc02bb741a685c.

Signed-off-by: Alastair D'Silva 
---
 drivers/misc/cxl/api.c| 19 ---
 drivers/misc/cxl/cxl.h|  1 -
 drivers/misc/cxl/guest.c  |  3 ---
 drivers/misc/cxl/native.c |  3 ++-
 include/misc/cxl.h|  8 
 5 files changed, 2 insertions(+), 32 deletions(-)

diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index 753b1a698fc4..21d620e29fea 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -324,7 +324,6 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed,
if (task) {
ctx->pid = get_task_pid(task, PIDTYPE_PID);
kernel = false;
-   ctx->real_mode = false;
 
/* acquire a reference to the task's mm */
ctx->mm = get_task_mm(current);
@@ -388,24 +387,6 @@ void cxl_set_master(struct cxl_context *ctx)
 }
 EXPORT_SYMBOL_GPL(cxl_set_master);
 
-int cxl_set_translation_mode(struct cxl_context *ctx, bool real_mode)
-{
-   if (ctx->status == STARTED) {
-   /*
-* We could potentially update the PE and issue an update LLCMD
-* to support this, but it doesn't seem to have a good use case
-* since it's trivial to just create a second kernel context
-* with different translation modes, so until someone convinces
-* me otherwise:
-*/
-   return -EBUSY;
-   }
-
-   ctx->real_mode = real_mode;
-   return 0;
-}
-EXPORT_SYMBOL_GPL(cxl_set_translation_mode);
-
 /* wrappers around afu_* file ops which are EXPORTED */
 int cxl_fd_open(struct inode *inode, struct file *file)
 {
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 918d4fb742d1..af8794719956 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -613,7 +613,6 @@ struct cxl_context {
bool pe_inserted;
bool master;
bool kernel;
-   bool real_mode;
bool pending_irq;
bool pending_fault;
bool pending_afu_err;
diff --git a/drivers/misc/cxl/guest.c b/drivers/misc/cxl/guest.c
index 4644f16606a3..f5dc740fcd13 100644
--- a/drivers/misc/cxl/guest.c
+++ b/drivers/misc/cxl/guest.c
@@ -623,9 +623,6 @@ static int guest_attach_process(struct cxl_context *ctx, 
bool kernel, u64 wed, u
 {
pr_devel("in %s\n", __func__);
 
-   if (ctx->real_mode)
-   return -EPERM;
-
ctx->kernel = kernel;
if (ctx->afu->current_mode == CXL_MODE_DIRECTED)
return attach_afu_directed(ctx, wed, amr);
diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c
index 98f867fcef24..c9d5d82dce8e 100644
--- a/drivers/misc/cxl/native.c
+++ b/drivers/misc/cxl/native.c
@@ -605,6 +605,7 @@ u64 cxl_calculate_sr(bool master, bool kernel, bool 
real_mode, bool p9)
sr |= CXL_PSL_SR_An_MP;
if (mfspr(SPRN_LPCR) & LPCR_TC)
sr |= CXL_PSL_SR_An_TC;
+
if (kernel) {
if (!real_mode)
sr |= CXL_PSL_SR_An_R;
@@ -629,7 +630,7 @@ u64 cxl_calculate_sr(bool master, bool kernel, bool 
real_mode, bool p9)
 
 static u64 calculate_sr(struct cxl_context *ctx)
 {
-   return cxl_calculate_sr(ctx->master, ctx->kernel, ctx->real_mode,
+   return cxl_calculate_sr(ctx->master, ctx->kernel, false,
cxl_is_power9());
 }
 
diff --git a/include/misc/cxl.h b/include/misc/cxl.h
index b712be544f8c..82cc6ffafe2d 100644
--- a/include/misc/cxl.h
+++ b/include/misc/cxl.h
@@ -173,14 +173,6 @@ int cxl_afu_reset(struct cxl_context *ctx);
  */
 void cxl_set_master(struct cxl_context *ctx);
 
-/*
- * Sets the context to use real mode memory accesses to operate with
- * translation disabled. Note that this only makes sense for kernel contexts
- * under bare metal, and will not work with virtualisation. May only be
- * performed on stopped contexts.
- */
-int cxl_set_translation_mode(struct cxl_context *ctx, bool real_mode);
-
 /*
  * Map and unmap the AFU Problem Space area. The amount and location mapped
  * depends on if this context is a master or slave.
-- 
2.17.1

[PATCH v2 10/10] cxl: Remove abandonned capi support for the Mellanox CX4, final cleanup

2018-06-28 Thread Frederic Barrat

Remove a few XSL/CX4 oddities which are no longer needed. A simple
revert of the initial commits was not possible (or not worth it) due
to the history of the code.

Signed-off-by: Frederic Barrat 
---
 drivers/misc/cxl/context.c |  2 +-
 drivers/misc/cxl/cxl.h | 12 --
 drivers/misc/cxl/debugfs.c |  5 ---
 drivers/misc/cxl/pci.c | 75 +++---
 4 files changed, 7 insertions(+), 87 deletions(-)

diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index 0355d42d367f..5fe529b43ebe 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -95,7 +95,7 @@ int cxl_context_init(struct cxl_context *ctx, struct cxl_afu 
*afu, bool master)
 */
mutex_lock(&afu->contexts_lock);
idr_preload(GFP_KERNEL);
-   i = idr_alloc(&ctx->afu->contexts_idr, ctx, ctx->afu->adapter->min_pe,
+   i = idr_alloc(&ctx->afu->contexts_idr, ctx, 0,
  ctx->afu->num_procs, GFP_NOWAIT);
idr_preload_end();
mutex_unlock(&afu->contexts_lock);
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index aa453448201d..44bcfafbb579 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -93,11 +93,6 @@ static const cxl_p1_reg_t CXL_PSL_FIR_CNTL  = {0x0148};
 static const cxl_p1_reg_t CXL_PSL_DSNDCTL   = {0x0150};
 static const cxl_p1_reg_t CXL_PSL_SNWRALLOC = {0x0158};
 static const cxl_p1_reg_t CXL_PSL_TRACE = {0x0170};
-/* XSL registers (Mellanox CX4) */
-static const cxl_p1_reg_t CXL_XSL_Timebase  = {0x0100};
-static const cxl_p1_reg_t CXL_XSL_TB_CTLSTAT = {0x0108};
-static const cxl_p1_reg_t CXL_XSL_FEC   = {0x0158};
-static const cxl_p1_reg_t CXL_XSL_DSNCTL= {0x0168};
 /* PSL registers - CAIA 2 */
 static const cxl_p1_reg_t CXL_PSL9_CONTROL  = {0x0020};
 static const cxl_p1_reg_t CXL_XSL9_INV  = {0x0110};
@@ -695,7 +690,6 @@ struct cxl {
struct bin_attribute cxl_attr;
int adapter_num;
int user_irqs;
-   int min_pe;
u64 ps_size;
u16 psl_rev;
u16 base_image;
@@ -934,7 +928,6 @@ int cxl_debugfs_afu_add(struct cxl_afu *afu);
 void cxl_debugfs_afu_remove(struct cxl_afu *afu);
 void cxl_debugfs_add_adapter_regs_psl9(struct cxl *adapter, struct dentry 
*dir);
 void cxl_debugfs_add_adapter_regs_psl8(struct cxl *adapter, struct dentry 
*dir);
-void cxl_debugfs_add_adapter_regs_xsl(struct cxl *adapter, struct dentry *dir);
 void cxl_debugfs_add_afu_regs_psl9(struct cxl_afu *afu, struct dentry *dir);
 void cxl_debugfs_add_afu_regs_psl8(struct cxl_afu *afu, struct dentry *dir);
 
@@ -977,11 +970,6 @@ static inline void 
cxl_debugfs_add_adapter_regs_psl8(struct cxl *adapter,
 {
 }
 
-static inline void cxl_debugfs_add_adapter_regs_xsl(struct cxl *adapter,
-   struct dentry *dir)
-{
-}
-
 static inline void cxl_debugfs_add_afu_regs_psl9(struct cxl_afu *afu, struct 
dentry *dir)
 {
 }
diff --git a/drivers/misc/cxl/debugfs.c b/drivers/misc/cxl/debugfs.c
index 1643850d2302..a1921d81593a 100644
--- a/drivers/misc/cxl/debugfs.c
+++ b/drivers/misc/cxl/debugfs.c
@@ -58,11 +58,6 @@ void cxl_debugfs_add_adapter_regs_psl8(struct cxl *adapter, 
struct dentry *dir)
debugfs_create_io_x64("trace", S_IRUSR | S_IWUSR, dir, 
_cxl_p1_addr(adapter, CXL_PSL_TRACE));
 }
 
-void cxl_debugfs_add_adapter_regs_xsl(struct cxl *adapter, struct dentry *dir)
-{
-   debugfs_create_io_x64("fec", S_IRUSR, dir, _cxl_p1_addr(adapter, 
CXL_XSL_FEC));
-}
-
 int cxl_debugfs_adapter_add(struct cxl *adapter)
 {
struct dentry *dir;
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 0ca818396524..6dfb4ed345d3 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -593,27 +593,7 @@ static int init_implementation_adapter_regs_psl8(struct 
cxl *adapter, struct pci
return 0;
 }
 
-static int init_implementation_adapter_regs_xsl(struct cxl *adapter, struct 
pci_dev *dev)
-{
-   u64 xsl_dsnctl;
-   u64 chipid;
-   u32 phb_index;
-   u64 capp_unit_id;
-   int rc;
-
-   rc = cxl_calc_capp_routing(dev, &chipid, &phb_index, &capp_unit_id);
-   if (rc)
-   return rc;
-
-   /* Tell XSL where to route data to */
-   xsl_dsnctl = 0x6000ULL | (chipid << (63-5));
-   xsl_dsnctl |= (capp_unit_id << (63-13));
-   cxl_p1_write(adapter, CXL_XSL_DSNCTL, xsl_dsnctl);
-
-   return 0;
-}
-
-/* PSL & XSL */
+/* PSL */
 #define TBSYNC_CAL(n) (((u64)n & 0x7) << (63-3))
 #define TBSYNC_CNT(n) (((u64)n & 0x7) << (63-6))
 /* For the PSL this is a multiple for 0 < n <= 7: */
@@ -625,21 +605,6 @@ static void write_timebase_ctrl_psl8(struct cxl *adapter)
 TBSYNC_CNT(2 * PSL_2048_250MHZ_CYCLES));
 }
 
-/* XSL */
-#define TBSYNC_ENA (1ULL << 63)
-/* For the XSL this is 2**n * 2000 clocks for 0 < n <= 6: */
-#define XSL_2000_CLOCKS 1
-#define XSL_4000_CLOCKS 2
-#define XSL_8000_CLOCKS 3
-
-static void write_timebase_

[PATCH v2 09/10] Revert "cxl: Allow a default context to be associated with an external pci_dev"

2018-06-28 Thread Frederic Barrat

Remove abandonned capi support for the Mellanox CX4.

This reverts commit a19bd79e31769626d288cc016e21a31b6f47bf6f.

Signed-off-by: Frederic Barrat 
---
 drivers/misc/cxl/Makefile |  2 +-
 drivers/misc/cxl/base.c   | 35 ---
 drivers/misc/cxl/cxl.h|  6 --
 drivers/misc/cxl/main.c   |  2 --
 drivers/misc/cxl/phb.c| 44 ---
 drivers/misc/cxl/vphb.c   | 30 +++---
 include/misc/cxl-base.h   |  6 --
 7 files changed, 28 insertions(+), 97 deletions(-)
 delete mode 100644 drivers/misc/cxl/phb.c

diff --git a/drivers/misc/cxl/Makefile b/drivers/misc/cxl/Makefile
index 502d41fc9ea5..5eea61b9584f 100644
--- a/drivers/misc/cxl/Makefile
+++ b/drivers/misc/cxl/Makefile
@@ -4,7 +4,7 @@ ccflags-$(CONFIG_PPC_WERROR)+= -Werror
 
 cxl-y  += main.o file.o irq.o fault.o native.o
 cxl-y  += context.o sysfs.o pci.o trace.o
-cxl-y  += vphb.o phb.o api.o cxllib.o
+cxl-y  += vphb.o api.o cxllib.o
 cxl-$(CONFIG_PPC_PSERIES)  += flash.o guest.o of.o hcalls.o
 cxl-$(CONFIG_DEBUG_FS) += debugfs.o
 obj-$(CONFIG_CXL)  += cxl.o
diff --git a/drivers/misc/cxl/base.c b/drivers/misc/cxl/base.c
index e1e80cb99ad9..7557835cdfcd 100644
--- a/drivers/misc/cxl/base.c
+++ b/drivers/misc/cxl/base.c
@@ -106,41 +106,6 @@ int cxl_update_properties(struct device_node *dn,
 }
 EXPORT_SYMBOL_GPL(cxl_update_properties);
 
-/*
- * API calls into the driver that may be called from the PHB code and must be
- * built in.
- */
-bool cxl_pci_associate_default_context(struct pci_dev *dev, struct cxl_afu 
*afu)
-{
-   bool ret;
-   struct cxl_calls *calls;
-
-   calls = cxl_calls_get();
-   if (!calls)
-   return false;
-
-   ret = calls->cxl_pci_associate_default_context(dev, afu);
-
-   cxl_calls_put(calls);
-
-   return ret;
-}
-EXPORT_SYMBOL_GPL(cxl_pci_associate_default_context);
-
-void cxl_pci_disable_device(struct pci_dev *dev)
-{
-   struct cxl_calls *calls;
-
-   calls = cxl_calls_get();
-   if (!calls)
-   return;
-
-   calls->cxl_pci_disable_device(dev);
-
-   cxl_calls_put(calls);
-}
-EXPORT_SYMBOL_GPL(cxl_pci_disable_device);
-
 static int __init cxl_base_init(void)
 {
struct device_node *np;
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index d95c2c98f2ab..aa453448201d 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -867,15 +867,9 @@ static inline bool cxl_is_power9_dd1(void)
 ssize_t cxl_pci_afu_read_err_buffer(struct cxl_afu *afu, char *buf,
loff_t off, size_t count);
 
-/* Internal functions wrapped in cxl_base to allow PHB to call them */
-bool _cxl_pci_associate_default_context(struct pci_dev *dev, struct cxl_afu 
*afu);
-void _cxl_pci_disable_device(struct pci_dev *dev);
 
 struct cxl_calls {
void (*cxl_slbia)(struct mm_struct *mm);
-   bool (*cxl_pci_associate_default_context)(struct pci_dev *dev, struct 
cxl_afu *afu);
-   void (*cxl_pci_disable_device)(struct pci_dev *dev);
-
struct module *owner;
 };
 int register_cxl_calls(struct cxl_calls *calls);
diff --git a/drivers/misc/cxl/main.c b/drivers/misc/cxl/main.c
index a7e83624034b..334223b802ee 100644
--- a/drivers/misc/cxl/main.c
+++ b/drivers/misc/cxl/main.c
@@ -104,8 +104,6 @@ static inline void cxl_slbia_core(struct mm_struct *mm)
 
 static struct cxl_calls cxl_calls = {
.cxl_slbia = cxl_slbia_core,
-   .cxl_pci_associate_default_context = _cxl_pci_associate_default_context,
-   .cxl_pci_disable_device = _cxl_pci_disable_device,
.owner = THIS_MODULE,
 };
 
diff --git a/drivers/misc/cxl/phb.c b/drivers/misc/cxl/phb.c
deleted file mode 100644
index 6ec69ada19f4..
--- a/drivers/misc/cxl/phb.c
+++ /dev/null
@@ -1,44 +0,0 @@
-/*
- * Copyright 2014-2016 IBM Corp.
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License
- * as published by the Free Software Foundation; either version
- * 2 of the License, or (at your option) any later version.
- */
-
-#include 
-#include "cxl.h"
-
-bool _cxl_pci_associate_default_context(struct pci_dev *dev, struct cxl_afu 
*afu)
-{
-   struct cxl_context *ctx;
-
-   /*
-* Allocate a context to do cxl things to. This is used for interrupts
-* in the peer model using a real phb, and if we eventually do DMA ops
-* in the virtual phb, we'll need a default context to attach them to.
-*/
-   ctx = cxl_dev_context_init(dev);
-   if (IS_ERR(ctx))
-   return false;
-   dev->dev.archdata.cxl_ctx = ctx;
-
-   return (cxl_ops->afu_check_and_enable(afu) == 0);
-}
-/* exported via cxl_base */
-
-void _cxl_pci_disable_device(struct pci_dev *dev)
-{
-   struct cxl_context *ctx = cxl_get_context(dev);
-
-

[PATCH v2 08/10] Revert "cxl: Add cxl_slot_is_supported API"

2018-06-28 Thread Frederic Barrat

Remove abandonned capi support for the Mellanox CX4.

This reverts commit 4e56f858bdde5cbfb70f61baddfaa56a8ed851bf.

Signed-off-by: Frederic Barrat 
---
 drivers/misc/cxl/pci.c | 37 -
 include/misc/cxl.h | 15 ---
 2 files changed, 52 deletions(-)

diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 193ff22f610b..0ca818396524 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -1808,43 +1808,6 @@ int cxl_slot_is_switched(struct pci_dev *dev)
return (depth > CXL_MAX_PCIEX_PARENT);
 }
 
-bool cxl_slot_is_supported(struct pci_dev *dev, int flags)
-{
-   if (!cpu_has_feature(CPU_FTR_HVMODE))
-   return false;
-
-   if ((flags & CXL_SLOT_FLAG_DMA) && (!pvr_version_is(PVR_POWER8NVL))) {
-   /*
-* CAPP DMA mode is technically supported on regular P8, but
-* will EEH if the card attempts to access memory < 4GB, which
-* we cannot realistically avoid. We might be able to work
-* around the issue, but until then return unsupported:
-*/
-   return false;
-   }
-
-   if (cxl_slot_is_switched(dev))
-   return false;
-
-   /*
-* XXX: This gets a little tricky on regular P8 (not POWER8NVL) since
-* the CAPP can be connected to PHB 0, 1 or 2 on a first come first
-* served basis, which is racy to check from here. If we need to
-* support this in future we might need to consider having this
-* function effectively reserve it ahead of time.
-*
-* Currently, the only user of this API is the Mellanox CX4, which is
-* only supported on P8NVL due to the above mentioned limitation of
-* CAPP DMA mode and therefore does not need to worry about this. If the
-* issue with CAPP DMA mode is later worked around on P8 we might need
-* to revisit this.
-*/
-
-   return true;
-}
-EXPORT_SYMBOL_GPL(cxl_slot_is_supported);
-
-
 static int cxl_probe(struct pci_dev *dev, const struct pci_device_id *id)
 {
struct cxl *adapter;
diff --git a/include/misc/cxl.h b/include/misc/cxl.h
index 74da2e440763..ea9ff4a1a9ca 100644
--- a/include/misc/cxl.h
+++ b/include/misc/cxl.h
@@ -24,21 +24,6 @@
  * generic PCI API. This API is agnostic to the actual AFU.
  */
 
-#define CXL_SLOT_FLAG_DMA 0x1
-
-/*
- * Checks if the given card is in a cxl capable slot. Pass CXL_SLOT_FLAG_DMA if
- * the card requires CAPP DMA mode to also check if the system supports it.
- * This is intended to be used by bi-modal devices to determine if they can use
- * cxl mode or if they should continue running in PCI mode.
- *
- * Note that this only checks if the slot is cxl capable - it does not
- * currently check if the CAPP is currently available for chips where it can be
- * assigned to different PHBs on a first come first serve basis (i.e. P8)
- */
-bool cxl_slot_is_supported(struct pci_dev *dev, int flags);
-
-
 /* Get the AFU associated with a pci_dev */
 struct cxl_afu *cxl_pci_to_afu(struct pci_dev *dev);
 
-- 
2.17.1

[PATCH v2 07/10] Revert "powerpc/powernv: Add support for the cxl kernel api on the real phb"

2018-06-28 Thread Frederic Barrat

From: Alastair D'Silva 

Remove abandonned capi support for the Mellanox CX4.

This reverts commit 4361b03430d685610e5feea3ec7846e8b9ae795f.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/include/asm/pnv-pci.h|   7 --
 arch/powerpc/platforms/powernv/pci-cxl.c  | 115 --
 arch/powerpc/platforms/powernv/pci-ioda.c |  18 +---
 arch/powerpc/platforms/powernv/pci.h  |  13 ---
 4 files changed, 1 insertion(+), 152 deletions(-)

diff --git a/arch/powerpc/include/asm/pnv-pci.h 
b/arch/powerpc/include/asm/pnv-pci.h
index d2d8c28db336..7f627e3f4da4 100644
--- a/arch/powerpc/include/asm/pnv-pci.h
+++ b/arch/powerpc/include/asm/pnv-pci.h
@@ -50,13 +50,6 @@ int pnv_cxl_alloc_hwirq_ranges(struct cxl_irq_ranges *irqs,
   struct pci_dev *dev, int num);
 void pnv_cxl_release_hwirq_ranges(struct cxl_irq_ranges *irqs,
  struct pci_dev *dev);
-
-/* Support for the cxl kernel api on the real PHB (instead of vPHB) */
-int pnv_cxl_enable_phb_kernel_api(struct pci_controller *hose, bool enable);
-bool pnv_pci_on_cxl_phb(struct pci_dev *dev);
-struct cxl_afu *pnv_cxl_phb_to_afu(struct pci_controller *hose);
-void pnv_cxl_phb_set_peer_afu(struct pci_dev *dev, struct cxl_afu *afu);
-
 #endif
 
 struct pnv_php_slot {
diff --git a/arch/powerpc/platforms/powernv/pci-cxl.c 
b/arch/powerpc/platforms/powernv/pci-cxl.c
index c447b7f03c09..1b18111453d7 100644
--- a/arch/powerpc/platforms/powernv/pci-cxl.c
+++ b/arch/powerpc/platforms/powernv/pci-cxl.c
@@ -8,10 +8,8 @@
  */
 
 #include 
-#include 
 #include 
 #include 
-#include 
 
 #include "pci.h"
 
@@ -178,116 +176,3 @@ static inline int get_cxl_module(void)
 #else
 static inline int get_cxl_module(void) { return 0; }
 #endif
-
-/*
- * Sets flags and switches the controller ops to enable the cxl kernel api.
- * Originally the cxl kernel API operated on a virtual PHB, but certain cards
- * such as the Mellanox CX4 use a peer model instead and for these cards the
- * cxl kernel api will operate on the real PHB.
- */
-int pnv_cxl_enable_phb_kernel_api(struct pci_controller *hose, bool enable)
-{
-   struct pnv_phb *phb = hose->private_data;
-   int rc;
-
-   if (!enable) {
-   /*
-* Once cxl mode is enabled on the PHB, there is currently no
-* known safe method to disable it again, and trying risks a
-* checkstop. If we can find a way to safely disable cxl mode
-* in the future we can revisit this, but for now the only sane
-* thing to do is to refuse to disable cxl mode:
-*/
-   return -EPERM;
-   }
-
-   /*
-* Hold a reference to the cxl module since several PHB operations now
-* depend on it, and it would be insane to allow it to be removed so
-* long as we are in this mode (and since we can't safely disable this
-* mode once enabled...).
-*/
-   rc = get_cxl_module();
-   if (rc)
-   return rc;
-
-   phb->flags |= PNV_PHB_FLAG_CXL;
-   hose->controller_ops = pnv_cxl_cx4_ioda_controller_ops;
-
-   return 0;
-}
-EXPORT_SYMBOL_GPL(pnv_cxl_enable_phb_kernel_api);
-
-bool pnv_pci_on_cxl_phb(struct pci_dev *dev)
-{
-   struct pci_controller *hose = pci_bus_to_host(dev->bus);
-   struct pnv_phb *phb = hose->private_data;
-
-   return !!(phb->flags & PNV_PHB_FLAG_CXL);
-}
-EXPORT_SYMBOL_GPL(pnv_pci_on_cxl_phb);
-
-struct cxl_afu *pnv_cxl_phb_to_afu(struct pci_controller *hose)
-{
-   struct pnv_phb *phb = hose->private_data;
-
-   return (struct cxl_afu *)phb->cxl_afu;
-}
-EXPORT_SYMBOL_GPL(pnv_cxl_phb_to_afu);
-
-void pnv_cxl_phb_set_peer_afu(struct pci_dev *dev, struct cxl_afu *afu)
-{
-   struct pci_controller *hose = pci_bus_to_host(dev->bus);
-   struct pnv_phb *phb = hose->private_data;
-
-   phb->cxl_afu = afu;
-}
-EXPORT_SYMBOL_GPL(pnv_cxl_phb_set_peer_afu);
-
-/*
- * In the peer cxl model, the XSL/PSL is physical function 0, and will be used
- * by other functions on the device for memory access and interrupts. When the
- * other functions are enabled we explicitly take a reference on the cxl
- * function since they will use it, and allocate a default context associated
- * with that function just like the vPHB model of the cxl kernel API.
- */
-bool pnv_cxl_enable_device_hook(struct pci_dev *dev)
-{
-   struct pci_controller *hose = pci_bus_to_host(dev->bus);
-   struct pnv_phb *phb = hose->private_data;
-   struct cxl_afu *afu = phb->cxl_afu;
-
-   if (!pnv_pci_enable_device_hook(dev))
-   return false;
-
-
-   /* No special handling for the cxl function, which is always PF 0 */
-   if (PCI_FUNC(dev->devfn) == 0)
-   return true;
-
-   if (!afu) {
-   dev_WARN(&dev->dev, "Attempted to enable function > 0 on CXL 
PHB without a peer AFU\n");
-   return false;
-   }
-
-

[PATCH v2 06/10] Revert "cxl: Add support for using the kernel API with a real PHB"

2018-06-28 Thread Frederic Barrat

From: Alastair D'Silva 

Remove abandonned capi support for the Mellanox CX4.

This reverts commit 317f5ef1b363417b6f1e93b90dfd2ffd6be6e867.

Signed-off-by: Alastair D'Silva 
---
 drivers/misc/cxl/pci.c  |  3 ---
 drivers/misc/cxl/vphb.c | 16 ++--
 2 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 9c5a21fee835..193ff22f610b 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -1886,9 +1886,6 @@ static int cxl_probe(struct pci_dev *dev, const struct 
pci_device_id *id)
dev_err(&dev->dev, "AFU %i failed to start: %i\n", 
slice, rc);
}
 
-   if (pnv_pci_on_cxl_phb(dev) && adapter->slices >= 1)
-   pnv_cxl_phb_set_peer_afu(dev, adapter->afu[0]);
-
return 0;
 }
 
diff --git a/drivers/misc/cxl/vphb.c b/drivers/misc/cxl/vphb.c
index 7fd0bdc1436a..1a99c9c7a6fb 100644
--- a/drivers/misc/cxl/vphb.c
+++ b/drivers/misc/cxl/vphb.c
@@ -9,7 +9,6 @@
 
 #include 
 #include 
-#include 
 #include "cxl.h"
 
 static int cxl_dma_set_mask(struct pci_dev *pdev, u64 dma_mask)
@@ -284,18 +283,13 @@ void cxl_pci_vphb_remove(struct cxl_afu *afu)
 */
 }
 
-static bool _cxl_pci_is_vphb_device(struct pci_controller *phb)
-{
-   return (phb->ops == &cxl_pcie_pci_ops);
-}
-
 bool cxl_pci_is_vphb_device(struct pci_dev *dev)
 {
struct pci_controller *phb;
 
phb = pci_bus_to_host(dev->bus);
 
-   return _cxl_pci_is_vphb_device(phb);
+   return (phb->ops == &cxl_pcie_pci_ops);
 }
 
 struct cxl_afu *cxl_pci_to_afu(struct pci_dev *dev)
@@ -304,13 +298,7 @@ struct cxl_afu *cxl_pci_to_afu(struct pci_dev *dev)
 
phb = pci_bus_to_host(dev->bus);
 
-   if (_cxl_pci_is_vphb_device(phb))
-   return (struct cxl_afu *)phb->private_data;
-
-   if (pnv_pci_on_cxl_phb(dev))
-   return pnv_cxl_phb_to_afu(phb);
-
-   return ERR_PTR(-ENODEV);
+   return (struct cxl_afu *)phb->private_data;
 }
 EXPORT_SYMBOL_GPL(cxl_pci_to_afu);
 
-- 
2.17.1

[PATCH v2 05/10] Revert "cxl: Add cxl_check_and_switch_mode() API to switch bi-modal cards"

2018-06-28 Thread Frederic Barrat

From: Alastair D'Silva 

Remove abandonned capi support for the Mellanox CX4.

This reverts commit b0b5e5918ad1babfd1d43d98c7281926a7b57b9f.

Signed-off-by: Alastair D'Silva 
---
 drivers/misc/cxl/Kconfig |   8 --
 drivers/misc/cxl/pci.c   | 236 +++
 include/misc/cxl.h   |  25 -
 3 files changed, 18 insertions(+), 251 deletions(-)

diff --git a/drivers/misc/cxl/Kconfig b/drivers/misc/cxl/Kconfig
index 93397cb05b15..3ce933707828 100644
--- a/drivers/misc/cxl/Kconfig
+++ b/drivers/misc/cxl/Kconfig
@@ -33,11 +33,3 @@ config CXL
  CAPI adapters are found in POWER8 based systems.
 
  If unsure, say N.
-
-config CXL_BIMODAL
-   bool "Support for bi-modal CAPI cards"
-   depends on HOTPLUG_PCI_POWERNV = y && CXL || HOTPLUG_PCI_POWERNV = m && 
CXL = m
-   default y
-   help
- Select this option to enable support for bi-modal CAPI cards, such as
- the Mellanox CX-4.
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 429d6de1dde7..9c5a21fee835 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -55,8 +55,6 @@
pci_read_config_byte(dev, vsec + 0xa, dest)
 #define CXL_WRITE_VSEC_MODE_CONTROL(dev, vsec, val) \
pci_write_config_byte(dev, vsec + 0xa, val)
-#define CXL_WRITE_VSEC_MODE_CONTROL_BUS(bus, devfn, vsec, val) \
-   pci_bus_write_config_byte(bus, devfn, vsec + 0xa, val)
 #define CXL_VSEC_PROTOCOL_MASK   0xe0
 #define CXL_VSEC_PROTOCOL_1024TB 0x80
 #define CXL_VSEC_PROTOCOL_512TB  0x40
@@ -800,234 +798,36 @@ static int setup_cxl_bars(struct pci_dev *dev)
return 0;
 }
 
-#ifdef CONFIG_CXL_BIMODAL
-
-struct cxl_switch_work {
-   struct pci_dev *dev;
-   struct work_struct work;
-   int vsec;
-   int mode;
-};
-
-static void switch_card_to_cxl(struct work_struct *work)
+/* pciex node: ibm,opal-m64-window = <0x3d058 0x0 0x3d058 0x0 0x8 0x0>; */
+static int switch_card_to_cxl(struct pci_dev *dev)
 {
-   struct cxl_switch_work *switch_work =
-   container_of(work, struct cxl_switch_work, work);
-   struct pci_dev *dev = switch_work->dev;
-   struct pci_bus *bus = dev->bus;
-   struct pci_controller *hose = pci_bus_to_host(bus);
-   struct pci_dev *bridge;
-   struct pnv_php_slot *php_slot;
-   unsigned int devfn;
+   int vsec;
u8 val;
int rc;
 
-   dev_info(&bus->dev, "cxl: Preparing for mode switch...\n");
-   bridge = list_first_entry_or_null(&hose->bus->devices, struct pci_dev,
- bus_list);
-   if (!bridge) {
-   dev_WARN(&bus->dev, "cxl: Couldn't find root port!\n");
-   goto err_dev_put;
-   }
+   dev_info(&dev->dev, "switch card to CXL\n");
 
-   php_slot = pnv_php_find_slot(pci_device_to_OF_node(bridge));
-   if (!php_slot) {
-   dev_err(&bus->dev, "cxl: Failed to find slot hotplug "
-  "information. You may need to upgrade "
-  "skiboot. Aborting.\n");
-   goto err_dev_put;
-   }
-
-   rc = CXL_READ_VSEC_MODE_CONTROL(dev, switch_work->vsec, &val);
-   if (rc) {
-   dev_err(&bus->dev, "cxl: Failed to read CAPI mode control: 
%i\n", rc);
-   goto err_dev_put;
-   }
-   devfn = dev->devfn;
-
-   /* Release the reference obtained in cxl_check_and_switch_mode() */
-   pci_dev_put(dev);
-
-   dev_dbg(&bus->dev, "cxl: Removing PCI devices from kernel\n");
-   pci_lock_rescan_remove();
-   pci_hp_remove_devices(bridge->subordinate);
-   pci_unlock_rescan_remove();
-
-   /* Switch the CXL protocol on the card */
-   if (switch_work->mode == CXL_BIMODE_CXL) {
-   dev_info(&bus->dev, "cxl: Switching card to CXL mode\n");
-   val &= ~CXL_VSEC_PROTOCOL_MASK;
-   val |= CXL_VSEC_PROTOCOL_256TB | CXL_VSEC_PROTOCOL_ENABLE;
-   rc = pnv_cxl_enable_phb_kernel_api(hose, true);
-   if (rc) {
-   dev_err(&bus->dev, "cxl: Failed to enable kernel API"
-  " on real PHB, aborting\n");
-   goto err_free_work;
-   }
-   } else {
-   dev_WARN(&bus->dev, "cxl: Switching card to PCI mode not 
supported!\n");
-   goto err_free_work;
-   }
-
-   rc = CXL_WRITE_VSEC_MODE_CONTROL_BUS(bus, devfn, switch_work->vsec, 
val);
-   if (rc) {
-   dev_err(&bus->dev, "cxl: Failed to configure CXL protocol: 
%i\n", rc);
-   goto err_free_work;
-   }
-
-   /*
-* The CAIA spec (v1.1, Section 10.6 Bi-modal Device Support) states
-* we must wait 100ms after this mode switch before touching PCIe config
-* space.
-*/
-   msleep(100);
-
-   /*
-* Hot reset to cause the card to come back in cxl mode. A
-* OPAL_RESET

[PATCH v2 04/10] Revert "cxl: Add kernel APIs to get & set the max irqs per context"

2018-06-28 Thread Frederic Barrat

From: Alastair D'Silva 

Remove abandonned capi support for the Mellanox CX4.

This reverts commit 79384e4b71240abf50c375eea56060b0d79c242a.

Signed-off-by: Alastair D'Silva 
---
 drivers/misc/cxl/api.c | 27 ---
 1 file changed, 27 deletions(-)

diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index 34ba67bc41bd..a535c1e6aa92 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -552,30 +552,3 @@ ssize_t cxl_read_adapter_vpd(struct pci_dev *dev, void 
*buf, size_t count)
return cxl_ops->read_adapter_vpd(afu->adapter, buf, count);
 }
 EXPORT_SYMBOL_GPL(cxl_read_adapter_vpd);
-
-int cxl_set_max_irqs_per_process(struct pci_dev *dev, int irqs)
-{
-   struct cxl_afu *afu = cxl_pci_to_afu(dev);
-   if (IS_ERR(afu))
-   return -ENODEV;
-
-   if (irqs > afu->adapter->user_irqs)
-   return -EINVAL;
-
-   /* Limit user_irqs to prevent the user increasing this via sysfs */
-   afu->adapter->user_irqs = irqs;
-   afu->irqs_max = irqs;
-
-   return 0;
-}
-EXPORT_SYMBOL_GPL(cxl_set_max_irqs_per_process);
-
-int cxl_get_max_irqs_per_process(struct pci_dev *dev)
-{
-   struct cxl_afu *afu = cxl_pci_to_afu(dev);
-   if (IS_ERR(afu))
-   return -ENODEV;
-
-   return afu->irqs_max;
-}
-EXPORT_SYMBOL_GPL(cxl_get_max_irqs_per_process);
-- 
2.17.1

[PATCH v2 03/10] Revert "cxl: Add preliminary workaround for CX4 interrupt limitation"

2018-06-28 Thread Frederic Barrat

From: Alastair D'Silva 

Remove abandonned capi support for the Mellanox CX4.

This reverts commit cbce0917e2e47d4bf5aa3b5fd6b1247f33e1a126.

Signed-off-by: Alastair D'Silva 
---
 drivers/misc/cxl/api.c | 15 ---
 drivers/misc/cxl/base.c| 17 -
 drivers/misc/cxl/context.c |  1 -
 drivers/misc/cxl/cxl.h | 10 --
 drivers/misc/cxl/main.c|  1 -
 include/misc/cxl.h | 20 
 6 files changed, 64 deletions(-)

diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index 2e5862b7a074..34ba67bc41bd 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -181,21 +181,6 @@ static irq_hw_number_t cxl_find_afu_irq(struct cxl_context 
*ctx, int num)
return 0;
 }
 
-int _cxl_next_msi_hwirq(struct pci_dev *pdev, struct cxl_context **ctx, int 
*afu_irq)
-{
-   if (*ctx == NULL || *afu_irq == 0) {
-   *afu_irq = 1;
-   *ctx = cxl_get_context(pdev);
-   } else {
-   (*afu_irq)++;
-   if (*afu_irq > cxl_get_max_irqs_per_process(pdev)) {
-   *ctx = list_next_entry(*ctx, extra_irq_contexts);
-   *afu_irq = 1;
-   }
-   }
-   return cxl_find_afu_irq(*ctx, *afu_irq);
-}
-/* Exported via cxl_base */
 
 int cxl_set_priv(struct cxl_context *ctx, void *priv)
 {
diff --git a/drivers/misc/cxl/base.c b/drivers/misc/cxl/base.c
index fe90f895bb10..e1e80cb99ad9 100644
--- a/drivers/misc/cxl/base.c
+++ b/drivers/misc/cxl/base.c
@@ -141,23 +141,6 @@ void cxl_pci_disable_device(struct pci_dev *dev)
 }
 EXPORT_SYMBOL_GPL(cxl_pci_disable_device);
 
-int cxl_next_msi_hwirq(struct pci_dev *pdev, struct cxl_context **ctx, int 
*afu_irq)
-{
-   int ret;
-   struct cxl_calls *calls;
-
-   calls = cxl_calls_get();
-   if (!calls)
-   return -EBUSY;
-
-   ret = calls->cxl_next_msi_hwirq(pdev, ctx, afu_irq);
-
-   cxl_calls_put(calls);
-
-   return ret;
-}
-EXPORT_SYMBOL_GPL(cxl_next_msi_hwirq);
-
 static int __init cxl_base_init(void)
 {
struct device_node *np;
diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index c6ec872800a2..0355d42d367f 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -74,7 +74,6 @@ int cxl_context_init(struct cxl_context *ctx, struct cxl_afu 
*afu, bool master)
ctx->pending_afu_err = false;
 
INIT_LIST_HEAD(&ctx->irq_names);
-   INIT_LIST_HEAD(&ctx->extra_irq_contexts);
 
/*
 * When we have to destroy all contexts in cxl_context_detach_all() we
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 9688fe8b4d80..d95c2c98f2ab 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -623,14 +623,6 @@ struct cxl_context {
 
struct rcu_head rcu;
 
-   /*
-* Only used when more interrupts are allocated via
-* pci_enable_msix_range than are supported in the default context, to
-* use additional contexts to overcome the limitation. i.e. Mellanox
-* CX4 only:
-*/
-   struct list_head extra_irq_contexts;
-
struct mm_struct *mm;
 
u16 tidr;
@@ -878,13 +870,11 @@ ssize_t cxl_pci_afu_read_err_buffer(struct cxl_afu *afu, 
char *buf,
 /* Internal functions wrapped in cxl_base to allow PHB to call them */
 bool _cxl_pci_associate_default_context(struct pci_dev *dev, struct cxl_afu 
*afu);
 void _cxl_pci_disable_device(struct pci_dev *dev);
-int _cxl_next_msi_hwirq(struct pci_dev *pdev, struct cxl_context **ctx, int 
*afu_irq);
 
 struct cxl_calls {
void (*cxl_slbia)(struct mm_struct *mm);
bool (*cxl_pci_associate_default_context)(struct pci_dev *dev, struct 
cxl_afu *afu);
void (*cxl_pci_disable_device)(struct pci_dev *dev);
-   int (*cxl_next_msi_hwirq)(struct pci_dev *pdev, struct cxl_context 
**ctx, int *afu_irq);
 
struct module *owner;
 };
diff --git a/drivers/misc/cxl/main.c b/drivers/misc/cxl/main.c
index 59a904efd104..a7e83624034b 100644
--- a/drivers/misc/cxl/main.c
+++ b/drivers/misc/cxl/main.c
@@ -106,7 +106,6 @@ static struct cxl_calls cxl_calls = {
.cxl_slbia = cxl_slbia_core,
.cxl_pci_associate_default_context = _cxl_pci_associate_default_context,
.cxl_pci_disable_device = _cxl_pci_disable_device,
-   .cxl_next_msi_hwirq = _cxl_next_msi_hwirq,
.owner = THIS_MODULE,
 };
 
diff --git a/include/misc/cxl.h b/include/misc/cxl.h
index 82cc6ffafe2d..6a3711a2e217 100644
--- a/include/misc/cxl.h
+++ b/include/misc/cxl.h
@@ -183,26 +183,6 @@ void cxl_psa_unmap(void __iomem *addr);
 /*  Get the process element for this context */
 int cxl_process_element(struct cxl_context *ctx);
 
-/*
- * Limit the number of interrupts that a single context can allocate via
- * cxl_start_work. If using the api with a real phb, this may be used to
- * request that additional default contexts be created when allocating
- * interrupts via pci_enable_msix

[PATCH v2 02/10] Revert "cxl: Add support for interrupts on the Mellanox CX4"

2018-06-28 Thread Frederic Barrat

From: Alastair D'Silva 

Remove abandonned capi support for the Mellanox CX4.

This reverts commit a2f67d5ee8d950caaa7a6144cf0bfb256500b73e.

Signed-off-by: Alastair D'Silva 
---
 arch/powerpc/platforms/powernv/pci-cxl.c  | 84 ---
 arch/powerpc/platforms/powernv/pci-ioda.c |  4 --
 arch/powerpc/platforms/powernv/pci.h  |  2 -
 drivers/misc/cxl/api.c| 71 ---
 drivers/misc/cxl/base.c   | 31 -
 drivers/misc/cxl/cxl.h|  4 --
 drivers/misc/cxl/main.c   |  2 -
 include/misc/cxl-base.h   |  4 --
 8 files changed, 202 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-cxl.c 
b/arch/powerpc/platforms/powernv/pci-cxl.c
index cee003de63af..c447b7f03c09 100644
--- a/arch/powerpc/platforms/powernv/pci-cxl.c
+++ b/arch/powerpc/platforms/powernv/pci-cxl.c
@@ -8,7 +8,6 @@
  */
 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -292,86 +291,3 @@ void pnv_cxl_disable_device(struct pci_dev *dev)
cxl_pci_disable_device(dev);
cxl_afu_put(afu);
 }
-
-/*
- * This is a special version of pnv_setup_msi_irqs for cards in cxl mode. This
- * function handles setting up the IVTE entries for the XSL to use.
- *
- * We are currently not filling out the MSIX table, since the only currently
- * supported adapter (CX4) uses a custom MSIX table format in cxl mode and it
- * is up to their driver to fill that out. In the future we may fill out the
- * MSIX table (and change the IVTE entries to be an index to the MSIX table)
- * for adapters implementing the Full MSI-X mode described in the CAIA.
- */
-int pnv_cxl_cx4_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
-{
-   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
-   struct pnv_phb *phb = hose->private_data;
-   struct msi_desc *entry;
-   struct cxl_context *ctx = NULL;
-   unsigned int virq;
-   int hwirq;
-   int afu_irq = 0;
-   int rc;
-
-   if (WARN_ON(!phb) || !phb->msi_bmp.bitmap)
-   return -ENODEV;
-
-   if (pdev->no_64bit_msi && !phb->msi32_support)
-   return -ENODEV;
-
-   rc = cxl_cx4_setup_msi_irqs(pdev, nvec, type);
-   if (rc)
-   return rc;
-
-   for_each_pci_msi_entry(entry, pdev) {
-   if (!entry->msi_attrib.is_64 && !phb->msi32_support) {
-   pr_warn("%s: Supports only 64-bit MSIs\n",
-   pci_name(pdev));
-   return -ENXIO;
-   }
-
-   hwirq = cxl_next_msi_hwirq(pdev, &ctx, &afu_irq);
-   if (WARN_ON(hwirq <= 0))
-   return (hwirq ? hwirq : -ENOMEM);
-
-   virq = irq_create_mapping(NULL, hwirq);
-   if (!virq) {
-   pr_warn("%s: Failed to map cxl mode MSI to linux irq\n",
-   pci_name(pdev));
-   return -ENOMEM;
-   }
-
-   rc = pnv_cxl_ioda_msi_setup(pdev, hwirq, virq);
-   if (rc) {
-   pr_warn("%s: Failed to setup cxl mode MSI\n", 
pci_name(pdev));
-   irq_dispose_mapping(virq);
-   return rc;
-   }
-
-   irq_set_msi_desc(virq, entry);
-   }
-
-   return 0;
-}
-
-void pnv_cxl_cx4_teardown_msi_irqs(struct pci_dev *pdev)
-{
-   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
-   struct pnv_phb *phb = hose->private_data;
-   struct msi_desc *entry;
-   irq_hw_number_t hwirq;
-
-   if (WARN_ON(!phb))
-   return;
-
-   for_each_pci_msi_entry(entry, pdev) {
-   if (!entry->irq)
-   continue;
-   hwirq = virq_to_hw(entry->irq);
-   irq_set_msi_desc(entry->irq, NULL);
-   irq_dispose_mapping(entry->irq);
-   }
-
-   cxl_cx4_teardown_msi_irqs(pdev);
-}
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5bd0eb6681bc..41f8f0ff4a55 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -3847,10 +3847,6 @@ static const struct pci_controller_ops 
pnv_npu_ocapi_ioda_controller_ops = {
 const struct pci_controller_ops pnv_cxl_cx4_ioda_controller_ops = {
.dma_dev_setup  = pnv_pci_dma_dev_setup,
.dma_bus_setup  = pnv_pci_dma_bus_setup,
-#ifdef CONFIG_PCI_MSI
-   .setup_msi_irqs = pnv_cxl_cx4_setup_msi_irqs,
-   .teardown_msi_irqs  = pnv_cxl_cx4_teardown_msi_irqs,
-#endif
.enable_device_hook = pnv_cxl_enable_device_hook,
.disable_device = pnv_cxl_disable_device,
.release_device = pnv_pci_release_device,
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index eada4b6068cb..ba41913c7e21 100644
--- a/arch

[PATCH v2 00/10] cxl: Remove abandonned capi support for the Mellanox CX4

2018-06-28 Thread Frederic Barrat

An attempt was made to add capi support for the Mellanox CX4 card, so
that it could operate in "traditional" PCI mode or capi mode. The
project ended up being canceled and a different approach was taken for
CX5. Mellanox never upstreamed any capi support in their CX4 driver.

CX4 was not following exactly the CAIA 1.0 model, so some CX4-specific
code was added. That code is now dead and hasn't been tested for a
while, so it's probably broken anyway. Let's remove it. It has been
agreed with engineers at Mellanox, of course.

No user API is affected. Some (unused) symbols exported by cxl are removed.

Most of the patch set consists of reverts of older patches, with
sometimes very minor tweaking. The last patch aggregates a few changes
where reverting was not possible easily.

Tests run to prevent regressions:
-  memcpy tests on POWER8 and POWER9
-  Mellanox CX5, which uses a different code path, based on cxllib
-  cxlflash tests on POWER8


Changelog:
v2: add missing signed-offs for the revert patches


Alastair D'Silva (7):
  Revert "cxl: Add kernel API to allow a context to operate with
relocate disabled"
  Revert "cxl: Add support for interrupts on the Mellanox CX4"
  Revert "cxl: Add preliminary workaround for CX4 interrupt limitation"
  Revert "cxl: Add kernel APIs to get & set the max irqs per context"
  Revert "cxl: Add cxl_check_and_switch_mode() API to switch bi-modal
cards"
  Revert "cxl: Add support for using the kernel API with a real PHB"
  Revert "powerpc/powernv: Add support for the cxl kernel api on the
real phb"

Frederic Barrat (3):
  Revert "cxl: Add cxl_slot_is_supported API"
  Revert "cxl: Allow a default context to be associated with an external
pci_dev"
  cxl: Remove abandonned capi support for the Mellanox CX4, final
cleanup

 arch/powerpc/include/asm/pnv-pci.h|   7 -
 arch/powerpc/platforms/powernv/pci-cxl.c  | 199 
 arch/powerpc/platforms/powernv/pci-ioda.c |  22 +-
 arch/powerpc/platforms/powernv/pci.h  |  15 -
 drivers/misc/cxl/Kconfig  |   8 -
 drivers/misc/cxl/Makefile |   2 +-
 drivers/misc/cxl/api.c| 132 
 drivers/misc/cxl/base.c   |  83 -
 drivers/misc/cxl/context.c|   3 +-
 drivers/misc/cxl/cxl.h|  33 --
 drivers/misc/cxl/debugfs.c|   5 -
 drivers/misc/cxl/guest.c  |   3 -
 drivers/misc/cxl/main.c   |   5 -
 drivers/misc/cxl/native.c |   3 +-
 drivers/misc/cxl/pci.c| 351 ++
 drivers/misc/cxl/phb.c|  44 ---
 drivers/misc/cxl/vphb.c   |  46 +--
 include/misc/cxl-base.h   |  10 -
 include/misc/cxl.h|  68 -
 19 files changed, 58 insertions(+), 981 deletions(-)
 delete mode 100644 drivers/misc/cxl/phb.c

-- 
2.17.1

Re: [next-20180601][nvme][ppc] Kernel Oops is triggered when creating lvm snapshots on nvme disks

2018-06-28 Thread Abdul Haleem

On Tue, 2018-06-26 at 23:36 +1000, Michael Ellerman wrote:
> Abdul Haleem  writes:
> 
> > Greeting's
> >
> > Kernel Oops is seen on 4.17.0-rc7-next-20180601 kernel on a bare-metal
> > machine when running lvm snapshot tests on nvme disks.
> >
> > Machine Type: Power 8 bare-metal
> > kernel : 4.17.0-rc7-next-20180601
> > test:  
> > $ pvcreate -y /dev/nvme0n1
> > $ vgcreate avocado_vg /dev/nvme0n1
> > $ lvcreate --size 1.4T --name avocado_lv avocado_vg -y
> > $ mkfs.ext2 /dev/avocado_vg/avocado_lv
> > $ lvcreate --size 1G --snapshot --name avocado_sn 
> > /dev/avocado_vg/avocado_lv -y
> > $ lvconvert --merge /dev/avocado_vg/avocado_sn
> 
> > the last command results in Oops:
> >
> > Unable to handle kernel paging request for data at address 0x00d0
> > Faulting instruction address: 0xc02dced4
> > Oops: Kernel access of bad area, sig: 11 [#1]
> > LE SMP NR_CPUS=2048 NUMA PowerNV
> > Dumping ftrace buffer:
> >(ftrace buffer empty)
> > Modules linked in: dm_snapshot dm_bufio nvme bnx2x iptable_mangle
> > ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4
> > nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4
> > xt_tcpudp tun bridge stp llc iptable_filter dm_mirror dm_region_hash
> > dm_log dm_service_time vmx_crypto powernv_rng rng_core dm_multipath
> > kvm_hv binfmt_misc kvm nfsd ip_tables x_tables autofs4 xfs lpfc
> > crc_t10dif crct10dif_generic mdio nvme_fc libcrc32c nvme_fabrics
> > nvme_core crct10dif_common [last unloaded: nvme]
> > CPU: 70 PID: 157763 Comm: lvconvert Not tainted 
> > 4.17.0-rc7-next-20180601-autotest-autotest #1
> > NIP:  c02dced4 LR: c0244d14 CTR: c0244cf0
> > REGS: c01f81d6b5a0 TRAP: 0300   Not tainted  
> > (4.17.0-rc7-next-20180601-autotest-autotest)
> > MSR:  90010280b033   CR: 
> > 22442444  XER: 2000
> > CFAR: c0008934 DAR: 00d0 DSISR: 4000 SOFTE: 0
> > GPR00: c0244d14 c01f81d6b820 c109c400 c03c9d080180
> > GPR04: 0001 c01fad51 c01fad51 0001
> > GPR08:  f000 f008 
> > GPR12: c0244cf0 c01c4f80 7fffa0e31090 7fffd9d9b470
> > GPR16:  005c 7fffa0e3a5b0 7fffa0e62040
> > GPR20: 010014ad7d50 010014ad7d20 7fffa0e64210 0001
> > GPR24:  c081bae0 c01ed2461b00 df859d08
> > GPR28: c03c9d080180 c0244d14 0001 
> > NIP [c02dced4] kmem_cache_free+0x1a4/0x2b0
> > LR [c0244d14] mempool_free_slab+0x24/0x40
> 
> Are you running with slub debugging enabled?
> Try booting with slub_debug=FZP

I was able to reproduce again with slub_debug=FZP and DEBUG_INFO enabled
on 4.17.0-rc7-next-20180601, but not much traces other than the Oops
stack trace

cat /proc/cmdline
rw,slub_debug=FZP root=UUID=e62c58bb-2824-4075-a31d-455f1bb62504 

.config
CONFIG_SLUB_DEBUG=y
CONFIG_SLUB=y
CONFIG_SLUB_CPU_PARTIAL=y
CONFIG_SLUB_DEBUG_ON=y
CONFIG_SLUB_STATS=y


the faulty instruction points to below code path :

gdb -batch vmlinux -ex 'list *(0xc0304fe0)'
0xc0304fe0 is in kmem_cache_free (mm/slab.h:231).
226 }
227 
228 static inline bool slab_equal_or_root(struct kmem_cache *s,
229   struct kmem_cache *p)
230 {
231 return p == s || p == s->memcg_params.root_cache;
232 }
233 
234 /*
235  * We use suffixes to the name in memcg because we can't have caches

detailed dmesg logs attached.

-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre


[0.00] dt-cpu-ftrs: setup for ISA 3000
[0.00] dt-cpu-ftrs: not enabling: system-call-vectored (disabled or 
unsupported by kernel)
[0.00] dt-cpu-ftrs: final cpu/mmu features = 0x786f8f5fb1a7 
0x3c006041
[0.00] radix-mmu: Page sizes from device-tree:
[0.00] radix-mmu: Page size shift = 12 AP=0x0
[0.00] radix-mmu: Page size shift = 16 AP=0x5
[0.00] radix-mmu: Page size shift = 21 AP=0x1
[0.00] radix-mmu: Page size shift = 30 AP=0x2
[0.00] radix-mmu: Initializing Radix MMU
[0.00] radix-mmu: Partition table (ptrval)
[0.00] radix-mmu: Mapped 0x-0x0010 with 
1.00 GiB pages
[0.00] radix-mmu: Mapped 0x2000-0x2010 with 
1.00 GiB pages
[0.00] radix-mmu: Process table (ptrval) and radix root for 
kernel: (ptrval)
[0.00] Linux version 4.17.0-rc7-next-20180601-autotest 
(root@ltc-boston21) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #3 SMP Thu Jun 
28 03:01:06 CDT 2018
[0.00] Found initrd at 0xc2d5:0xc9265921
[0.00] OPAL: Found memory mapped LPC bus on chip 0
[0.00] ISA: Non-PCI bridge is /lpcm-opb@60300/lpc@0
[0.00] Using PowerNV machine description
[0.

50 matches

Mail list logo