Re: [PATCH 1/2] powerpc/tracing: Trace TLBIE(L)

2017-04-11 Thread Balbir Singh
On Tue, 2017-04-11 at 15:23 +1000, Balbir Singh wrote:
> Just a quick patch to trace tlbie(l)'s. The idea being that it can be
> enabled when we suspect corruption or when we need to see if we are doing
> the right thing during flush. I think the format can be enhanced to
> make it nicer (expand the RB/RS/IS/L cases in more detail if we ever
> need that level of details).

The subject is misleading this is not PATCH 1/2, but should read
[PATCH v2]. Sorry! I can resend this if required

Balbir


Re: [PATCH v3 2/5] perf/x86/intel: Record branch type

2017-04-11 Thread Peter Zijlstra
On Tue, Apr 11, 2017 at 06:56:30PM +0800, Jin Yao wrote:
> Perf already has support for disassembling the branch instruction
> and using the branch type for filtering. The patch just records
> the branch type in perf_branch_entry.
> 
> Before recording, the patch converts the x86 branch classification
> to common branch classification.

This is still a completely inadequate changelog. I really will not
accept patches like this.

> 
> Signed-off-by: Jin Yao 
> ---
>  arch/x86/events/intel/lbr.c | 53 
> -
>  1 file changed, 52 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
> index 81b321a..6968c63 100644
> --- a/arch/x86/events/intel/lbr.c
> +++ b/arch/x86/events/intel/lbr.c
> @@ -109,6 +109,9 @@ enum {
>   X86_BR_ZERO_CALL= 1 << 15,/* zero length call */
>   X86_BR_CALL_STACK   = 1 << 16,/* call stack */
>   X86_BR_IND_JMP  = 1 << 17,/* indirect jump */
> +
> + X86_BR_TYPE_SAVE= 1 << 18,/* indicate to save branch type */
> +
>  };
>  
>  #define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
> @@ -670,6 +673,10 @@ static int intel_pmu_setup_sw_lbr_filter(struct 
> perf_event *event)
>  
>   if (br_type & PERF_SAMPLE_BRANCH_CALL)
>   mask |= X86_BR_CALL | X86_BR_ZERO_CALL;
> +
> + if (br_type & PERF_SAMPLE_BRANCH_TYPE_SAVE)
> + mask |= X86_BR_TYPE_SAVE;
> +
>   /*
>* stash actual user request into reg, it may
>* be used by fixup code for some CPU
> @@ -923,6 +930,44 @@ static int branch_type(unsigned long from, unsigned long 
> to, int abort)
>   return ret;
>  }
>  
> +#define X86_BR_TYPE_MAP_MAX  16
> +
> +static int
> +common_branch_type(int type)
> +{
> + int i, mask;
> + const int branch_map[X86_BR_TYPE_MAP_MAX] = {
> + PERF_BR_CALL,   /* X86_BR_CALL */
> + PERF_BR_RET,/* X86_BR_RET */
> + PERF_BR_SYSCALL,/* X86_BR_SYSCALL */
> + PERF_BR_SYSRET, /* X86_BR_SYSRET */
> + PERF_BR_INT,/* X86_BR_INT */
> + PERF_BR_IRET,   /* X86_BR_IRET */
> + PERF_BR_JCC,/* X86_BR_JCC */
> + PERF_BR_JMP,/* X86_BR_JMP */
> + PERF_BR_IRQ,/* X86_BR_IRQ */
> + PERF_BR_IND_CALL,   /* X86_BR_IND_CALL */
> + PERF_BR_NONE,   /* X86_BR_ABORT */
> + PERF_BR_NONE,   /* X86_BR_IN_TX */
> + PERF_BR_NONE,   /* X86_BR_NO_TX */
> + PERF_BR_CALL,   /* X86_BR_ZERO_CALL */
> + PERF_BR_NONE,   /* X86_BR_CALL_STACK */
> + PERF_BR_IND_JMP,/* X86_BR_IND_JMP */
> + };
> +
> + type >>= 2; /* skip X86_BR_USER and X86_BR_KERNEL */
> + mask = ~(~0 << 1);
> +
> + for (i = 0; i < X86_BR_TYPE_MAP_MAX; i++) {
> + if (type & mask)
> + return branch_map[i];
> +
> + type >>= 1;
> + }
> +
> + return PERF_BR_NONE;
> +}
> +
>  /*
>   * implement actual branch filter based on user demand.
>   * Hardware may not exactly satisfy that request, thus
> @@ -939,7 +984,8 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
>   bool compress = false;
>  
>   /* if sampling all branches, then nothing to filter */
> - if ((br_sel & X86_BR_ALL) == X86_BR_ALL)
> + if (((br_sel & X86_BR_ALL) == X86_BR_ALL) &&
> + ((br_sel & X86_BR_TYPE_SAVE) != X86_BR_TYPE_SAVE))
>   return;
>  
>   for (i = 0; i < cpuc->lbr_stack.nr; i++) {
> @@ -960,6 +1006,11 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
>   cpuc->lbr_entries[i].from = 0;
>   compress = true;
>   }
> +
> + if ((br_sel & X86_BR_TYPE_SAVE) == X86_BR_TYPE_SAVE)
> + cpuc->lbr_entries[i].type = common_branch_type(type);
> + else
> + cpuc->lbr_entries[i].type = PERF_BR_NONE;
>   }
>  
>   if (!compress)
> -- 
> 2.7.4
> 


[PATCH kernel v2] powerpc/iommu: Do not call PageTransHuge() on tail pages

2017-04-11 Thread Alexey Kardashevskiy
The CMA pages migration code does not support compound pages at
the moment so it performs few tests before proceeding to actual page
migration.

One of the tests - PageTransHuge() - has VM_BUG_ON_PAGE(PageTail()) as
it is designed to be called on head pages only. Since we also test for
PageCompound(), and it contains PageTail() and PageHead(), we can
simplify the check by leaving just PageCompound() and therefore avoid
possible VM_BUG_ON_PAGE.

Fixes: 2e5bbb5461f1 ("KVM: PPC: Book3S HV: Migrate pinned pages out of CMA")
Cc: sta...@vger.kernel.org # v4.9+
Signed-off-by: Alexey Kardashevskiy 
Acked-by: Balbir Singh 
---

Changes:
v2:
* instead of moving PageCompound() to the beginning, this just drops
PageHuge() and PageTransHuge()

---
 arch/powerpc/mm/mmu_context_iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index 497130c5c742..96f835cbf212 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -81,7 +81,7 @@ struct page *new_iommu_non_cma_page(struct page *page, 
unsigned long private,
gfp_t gfp_mask = GFP_USER;
struct page *new_page;
 
-   if (PageHuge(page) || PageTransHuge(page) || PageCompound(page))
+   if (PageCompound(page))
return NULL;
 
if (PageHighMem(page))
@@ -100,7 +100,7 @@ static int mm_iommu_move_page_from_cma(struct page *page)
LIST_HEAD(cma_migrate_pages);
 
/* Ignore huge pages for now */
-   if (PageHuge(page) || PageTransHuge(page) || PageCompound(page))
+   if (PageCompound(page))
return -EBUSY;
 
lru_add_drain();
-- 
2.11.0



Re: [PATCH v3 2/5] perf/x86/intel: Record branch type

2017-04-11 Thread Jin, Yao



On 4/11/2017 3:52 PM, Peter Zijlstra wrote:

This is still a completely inadequate changelog. I really will not
accept patches like this.


Hi,

The changelog is added in the cover-letter ("[PATCH v3 0/5] perf report: Show branch 
type").

Does the changelog need to be added in each patch's description?

That's fine, I can add and resend this patch.

Thanks
Jin Yao



Re: [PATCH v3 2/5] perf/x86/intel: Record branch type

2017-04-11 Thread Peter Zijlstra
On Tue, Apr 11, 2017 at 09:52:19AM +0200, Peter Zijlstra wrote:
> On Tue, Apr 11, 2017 at 06:56:30PM +0800, Jin Yao wrote:

> > @@ -960,6 +1006,11 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
> > cpuc->lbr_entries[i].from = 0;
> > compress = true;
> > }
> > +
> > +   if ((br_sel & X86_BR_TYPE_SAVE) == X86_BR_TYPE_SAVE)
> > +   cpuc->lbr_entries[i].type = common_branch_type(type);
> > +   else
> > +   cpuc->lbr_entries[i].type = PERF_BR_NONE;
> > }

I was wondering WTH you did that else; because it should already be 0
(aka, BR_NONE). Then I found intel_pmu_lbr_read_32() is already broken,
and you just broke intel_pmu_lbr_read_64().

Arguably we should add a union on the last __u64 with a name for the
entire thing, but the below is the minimal fix.

---
Subject: perf,x86: Avoid exposing wrong/stale data in intel_pmu_lbr_read_32()
From: Peter Zijlstra 
Date: Tue Apr 11 10:10:28 CEST 2017

When the perf_branch_entry::{in_tx,abort,cycles} fields were added,
intel_pmu_lbr_read_32() wasn't updated to initialize them.

Fixes: 135c5612c460 ("perf/x86/intel: Support Haswell/v4 LBR format")
Signed-off-by: Peter Zijlstra (Intel) 
---
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -507,6 +507,9 @@ static void intel_pmu_lbr_read_32(struct
cpuc->lbr_entries[i].to = msr_lastbranch.to;
cpuc->lbr_entries[i].mispred= 0;
cpuc->lbr_entries[i].predicted  = 0;
+   cpuc->lbr_entries[i].in_tx  = 0;
+   cpuc->lbr_entries[i].abort  = 0;
+   cpuc->lbr_entries[i].cycles = 0;
cpuc->lbr_entries[i].reserved   = 0;
}
cpuc->lbr_stack.nr = i;


Re: [PATCH kernel v2] powerpc/iommu: Do not call PageTransHuge() on tail pages

2017-04-11 Thread Balbir Singh
On Tue, 2017-04-11 at 17:54 +1000, Alexey Kardashevskiy wrote:
> The CMA pages migration code does not support compound pages at
> the moment so it performs few tests before proceeding to actual page
> migration.
> 
> One of the tests - PageTransHuge() - has VM_BUG_ON_PAGE(PageTail()) as
> it is designed to be called on head pages only. Since we also test for
> PageCompound(), and it contains PageTail() and PageHead(), we can
> simplify the check by leaving just PageCompound() and therefore avoid
> possible VM_BUG_ON_PAGE.
> 
> Fixes: 2e5bbb5461f1 ("KVM: PPC: Book3S HV: Migrate pinned pages out of CMA")
> Cc: sta...@vger.kernel.org # v4.9+
> Signed-off-by: Alexey Kardashevskiy 
> Acked-by: Balbir Singh 
> ---
> 
> Changes:
> v2:
> * instead of moving PageCompound() to the beginning, this just drops
> PageHuge() and PageTransHuge()
>

Looks good! My Acked-by is already present

Balbir Singh. 


Re: [PATCH kernel v2] powerpc/powernv: Check kzalloc() return value in pnv_pci_table_alloc

2017-04-11 Thread Alexey Kardashevskiy
On 27/03/17 19:27, Alexey Kardashevskiy wrote:
> pnv_pci_table_alloc() ignores possible failure from kzalloc_node(),
> this adds a check. There are 2 callers of pnv_pci_table_alloc(),
> one already checks for tbl!=NULL, this adds WARN_ON() to the other path
> which only happens during boot time in IODA1 and not expected to fail.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v2:
> * s/BUG_ON/WARN_ON/

Bad/good?


> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 3 +++
>  arch/powerpc/platforms/powernv/pci.c  | 3 +++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index e36738291c32..04ef03a5201b 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2128,6 +2128,9 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb 
> *phb,
>  
>  found:
>   tbl = pnv_pci_table_alloc(phb->hose->node);
> + if (WARN_ON(!tbl))
> + return;
> +
>   iommu_register_group(&pe->table_group, phb->hose->global_number,
>   pe->pe_number);
>   pnv_pci_link_table_and_group(phb->hose->node, 0, tbl, &pe->table_group);
> diff --git a/arch/powerpc/platforms/powernv/pci.c 
> b/arch/powerpc/platforms/powernv/pci.c
> index eb835e977e33..9acdf6889c0d 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -766,6 +766,9 @@ struct iommu_table *pnv_pci_table_alloc(int nid)
>   struct iommu_table *tbl;
>  
>   tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, nid);
> + if (!tbl)
> + return NULL;
> +
>   INIT_LIST_HEAD_RCU(&tbl->it_group_list);
>  
>   return tbl;
> 


-- 
Alexey


Re: [PATCH v3 2/5] perf/x86/intel: Record branch type

2017-04-11 Thread Peter Zijlstra
On Tue, Apr 11, 2017 at 04:11:21PM +0800, Jin, Yao wrote:
> 
> 
> On 4/11/2017 3:52 PM, Peter Zijlstra wrote:
> > This is still a completely inadequate changelog. I really will not
> > accept patches like this.
> > 
> Hi,
> 
> The changelog is added in the cover-letter ("[PATCH v3 0/5] perf report: Show 
> branch type").
> 
> Does the changelog need to be added in each patch's description?
> 
> That's fine, I can add and resend this patch.

The cover letter is not retained; it is throw away information.

Each patch should have a coherent changelog that explain why the patch
was done and explain non trivial things in the implementation.

Simply copy/pasting the same story in multiple patches is not right
either, for the simple fact that the patches were not the same. You did
a different thing, so you need a different story.





Re: [PATCH v2] ppc64/kprobe: Fix oops when kprobed on 'stdu' instruction

2017-04-11 Thread Balbir Singh
On Tue, 2017-04-11 at 10:38 +0530, Ravi Bangoria wrote:
> If we set a kprobe on a 'stdu' instruction on powerpc64, we see a kernel 
> OOPS:
> 
>   [ 1275.165932] Bad kernel stack pointer cd93c840 at c0009868
>   [ 1275.166378] Oops: Bad kernel stack pointer, sig: 6 [#1]
>   ...
>   GPR00: c01fcd93cb30 cd93c840 c15c5e00 cd93c840
>   ...
>   [ 1275.178305] NIP [c0009868] resume_kernel+0x2c/0x58
>   [ 1275.178594] LR [c0006208] program_check_common+0x108/0x180
> 
> Basically, on 64 bit system, when user probes on 'stdu' instruction,
> kernel does not emulate actual store in emulate_step itself because it
> may corrupt exception frame. So kernel does actual store operation in
> exception return code i.e. resume_kernel().
> 
> resume_kernel() loads the saved stack pointer from memory using lwz,
> effectively loading a corrupt (32bit) address, causing the kernel crash.
> 
> Fix this by loading the 64bit value instead.
> 
> Fixes: be96f63375a1 ("powerpc: Split out instruction analysis part of 
> emulate_step()") 
> Signed-off-by: Ravi Bangoria 
> Reviewed-by: Naveen N. Rao  
> ---

The patch looks correct to me from the description and code. I have not
validated that the write to GPR1(r1) via store of r8 to 0(r5) is indeed correct.
I would assume r8 should contain regs->gpr[r1] with the updated ea that
is written down to the GPR1(r1) which will be what we restore when we return
from the exception.

The conversion of lwz to ld indeed looks correct

Balbir Singh.



Re: WARN @lib/refcount.c:128 during hot unplug of I/O adapter.

2017-04-11 Thread Michael Ellerman
Tyrel Datwyler  writes:

> On 04/06/2017 09:04 PM, Michael Ellerman wrote:
>> Tyrel Datwyler  writes:
>> 
>>> On 04/06/2017 03:27 AM, Sachin Sant wrote:
 On a POWER8 LPAR running 4.11.0-rc5, a hot unplug operation on
 any I/O adapter results in the following warning

 This problem has been in the code for some time now. I had first seen this 
 in
 -next tree.

>
> 
>
 Have attached the dmesg log from the system. Let me know if any additional
 information is required to help debug this problem.
>>>
>>> I remember you mentioning this when the issue was brought up for CPUs. I
>>> assume the case is the same here where the issue is only seen with
>>> adapters that were hot-added after boot (ie. hot-remove of adapter
>>> present at boot doesn't trip the warning)?
>> 
>> So who's fixing this?
>
> I started looking at it when Bharata submitted a patch trying to fix the
> issue for CPUs, but got side tracked by other things. I suspect that
> this underflow has actually been an issue for quite some time, and we
> are just now becoming aware of it thanks to the recount_t patchset being
> merged.

Yes I agree. Which means it might be broken in existing distros.

> I'll look into it again this week.

Thanks.

cheers


Re: EEH error in doing DMA with PEX 8619

2017-04-11 Thread IanJiang
I did another test:
- Call dma_set_mask_and_coherent(&pPciDev->dev, DMA_BIT_MASK(32)) in probe;
- Use DMA address or BUS address in DMA
But EHH error remains.

All sources are based on PLX SDK 7.25.
Note: Sample test is in user space. It allocates memory and starts DMA
through PLX API.
The original sample NT_DmaTest does DMA between BARx and Host memory.
I change this for simple: Allocate two host memory buffers and try to do DMA
between them.

Device probe
===
(Driver/Source.Plx8000_DMA/Driver.c)

int
AddDevice(
DRIVER_OBJECT  *pDriverObject,
struct pci_dev *pPciDev
)
{
U8channel;
int   status;
U32   RegValue;
DEVICE_OBJECT*fdo;
DEVICE_OBJECT*pDevice;
DEVICE_EXTENSION *pdx;


// Allocate memory for the device object
fdo =
kmalloc(
sizeof(DEVICE_OBJECT),
GFP_KERNEL
);

if (fdo == NULL)
{
ErrorPrintf(("ERROR - memory allocation for device object
failed\n"));
return (-ENOMEM);
}

// Initialize device object
RtlZeroMemory( fdo, sizeof(DEVICE_OBJECT) );

fdo->DriverObject= pDriverObject; // Save parent driver
object
fdo->DeviceExtension = &(fdo->DeviceInfo);

// Enable the device
if (pci_enable_device( pPciDev ) == 0)
{
DebugPrintf(("Enabled PCI device\n"));
}
else
{
ErrorPrintf(("WARNING - PCI device enable failed\n"));
}

#if 1
/* New added: Set DMA mask as suggestied on linuxppc */
{
int err;
printk("Debug %s: dma_set_mask_and_coherent()...\n", __func__);
err = dma_set_mask_and_coherent(&pPciDev->dev, DMA_BIT_MASK(32));
if (err != 0) {
printk("Error %s: Failed dma_set_mask_and_coherent(). ret = %d\n",
__func__, err);
return err;
}

}
#endif
// Enable bus mastering
pci_set_master( pPciDev );

//
// Initialize the device extension
//

pdx = fdo->DeviceExtension;

// Clear device extension
RtlZeroMemory( pdx, sizeof(DEVICE_EXTENSION) );

// Store parent device object
pdx->pDeviceObject = fdo;

// Save the OS-supplied PCI object
pdx->pPciDevice = pPciDev;

// Set initial device device state
pdx->State = PLX_STATE_STOPPED;

// Set initial power state
pdx->PowerState = PowerDeviceD0;

// Store device location information
pdx->Key.domain   = pci_domain_nr(pPciDev->bus);
pdx->Key.bus  = pPciDev->bus->number;
pdx->Key.slot = PCI_SLOT(pPciDev->devfn);
pdx->Key.function = PCI_FUNC(pPciDev->devfn);
pdx->Key.DeviceId = pPciDev->device;
pdx->Key.VendorId = pPciDev->vendor;
pdx->Key.SubVendorId  = pPciDev->subsystem_vendor;
pdx->Key.SubDeviceId  = pPciDev->subsystem_device;
pdx->Key.DeviceNumber = pDriverObject->DeviceCount;

// Set API access mode
pdx->Key.ApiMode = PLX_API_MODE_PCI;

// Update Revision ID
PLX_PCI_REG_READ( pdx, PCI_REG_CLASS_REV, &RegValue );
pdx->Key.Revision = (U8)(RegValue & 0xFF);

// Set device mode
pdx->Key.DeviceMode = PLX_CHIP_MODE_STANDARD;

// Set PLX-specific port type
pdx->Key.PlxPortType = PLX_SPEC_PORT_DMA;

// Build device name
sprintf(
pdx->LinkName,
PLX_DRIVER_NAME "-%d",
pDriverObject->DeviceCount
);

// Initialize work queue for ISR DPC queueing
PLX_INIT_WORK(
&(pdx->Task_DpcForIsr),
DpcForIsr,// DPC routine
&(pdx->Task_DpcForIsr)// DPC parameter (pre-2.6.20 only)
);

// Initialize ISR spinlock
spin_lock_init( &(pdx->Lock_Isr) );

// Initialize interrupt wait list
INIT_LIST_HEAD( &(pdx->List_WaitObjects) );
spin_lock_init( &(pdx->Lock_WaitObjectsList) );

// Initialize physical memories list
INIT_LIST_HEAD( &(pdx->List_PhysicalMem) );
spin_lock_init( &(pdx->Lock_PhysicalMemList) );

// Set the DMA mask
if (Plx_dma_set_mask( pdx, PLX_DMA_BIT_MASK(48) ) == 0)
{
DebugPrintf(("Set DMA bit mask to 48-bits\n"));
}
else
{
DebugPrintf(("ERROR - Unable to set DMA mask to 48-bits, revert to
32-bit\n"));
Plx_dma_set_mask( pdx, PLX_DMA_BIT_MASK(32) );
}

// Set buffer allocation mask
if (Plx_dma_set_coherent_mask( pdx, PLX_DMA_BIT_MASK(32) ) != 0)
{
ErrorPrintf(("WARNING - Set DMA coherent mask failed\n"));
}

// Initialize DMA spinlocks
for (channel = 0; channel < MAX_DMA_CHANNELS; channel++)
{
spin_lock_init( &(pdx->Lock_Dma[channel]) );
}

//
// Add to driver device list
//

// Acquire Device List lock
spin_lock( &(pDriverObject->Lock_DeviceList) );

// Get device list head
pDevice = pDriverObject->DeviceObject;

if (pDevice == NULL)
{
// Add device as first in list
pDriverObject->DeviceObject = fdo;
}
else
{
// Go to end of list
while (pDe

Re: EEH error in doing DMA with PEX 8619

2017-04-11 Thread Benjamin Herrenschmidt
On Tue, 2017-04-11 at 02:26 -0700, IanJiang wrote:
> I did another test:
> - Call dma_set_mask_and_coherent(&pPciDev->dev, DMA_BIT_MASK(32)) in
> probe;
> - Use DMA address or BUS address in DMA
> But EHH error remains.

We need to dig out the details of the EEH error. It will tell us
more precisely what is happening.

Note also that if your device can do 64-bit addresses, you should use
a 64-bit mask, it will result in more efficient transfers. However, we
should first investigate the problem with 32-bit because it seems to
indicate that you might be DMA'ing beyond your buffer.

Another possibility would be if the requests from the PLX have a
different initiator ID on the bus than the device you are setting up
the DMA for.

> All sources are based on PLX SDK 7.25.
> Note: Sample test is in user space. It allocates memory and starts
> DMA
> through PLX API.
> The original sample NT_DmaTest does DMA between BARx and Host memory.
> I change this for simple: Allocate two host memory buffers and try to
> do DMA
> between them.
> 
> Device probe
> ===
> (Driver/Source.Plx8000_DMA/Driver.c)
> 
> int
> AddDevice(
> DRIVER_OBJECT  *pDriverObject,
> struct pci_dev *pPciDev
> )
> {
> U8channel;
> int   status;
> U32   RegValue;
> DEVICE_OBJECT*fdo;
> DEVICE_OBJECT*pDevice;
> DEVICE_EXTENSION *pdx;
> 
> 
> // Allocate memory for the device object
> fdo =
> kmalloc(
> sizeof(DEVICE_OBJECT),
> GFP_KERNEL
> );
> 
> if (fdo == NULL)
> {
> ErrorPrintf(("ERROR - memory allocation for device object
> failed\n"));
> return (-ENOMEM);
> }
> 
> // Initialize device object
> RtlZeroMemory( fdo, sizeof(DEVICE_OBJECT) );
> 
> fdo->DriverObject= pDriverObject; // Save parent
> driver
> object
> fdo->DeviceExtension = &(fdo->DeviceInfo);
> 
> // Enable the device
> if (pci_enable_device( pPciDev ) == 0)
> {
> DebugPrintf(("Enabled PCI device\n"));
> }
> else
> {
> ErrorPrintf(("WARNING - PCI device enable failed\n"));
> }
> 
> #if 1
> /* New added: Set DMA mask as suggestied on linuxppc */
> {
> int err;
> printk("Debug %s: dma_set_mask_and_coherent()...\n", __func__);
> err = dma_set_mask_and_coherent(&pPciDev->dev, DMA_BIT_MASK(32));
> if (err != 0) {
> printk("Error %s: Failed dma_set_mask_and_coherent(). ret =
> %d\n",
> __func__, err);
> return err;
> }
> 
> }
> #endif
> // Enable bus mastering
> pci_set_master( pPciDev );
> 
> //
> // Initialize the device extension
> //
> 
> pdx = fdo->DeviceExtension;
> 
> // Clear device extension
> RtlZeroMemory( pdx, sizeof(DEVICE_EXTENSION) );
> 
> // Store parent device object
> pdx->pDeviceObject = fdo;
> 
> // Save the OS-supplied PCI object
> pdx->pPciDevice = pPciDev;
> 
> // Set initial device device state
> pdx->State = PLX_STATE_STOPPED;
> 
> // Set initial power state
> pdx->PowerState = PowerDeviceD0;
> 
> // Store device location information
> pdx->Key.domain   = pci_domain_nr(pPciDev->bus);
> pdx->Key.bus  = pPciDev->bus->number;
> pdx->Key.slot = PCI_SLOT(pPciDev->devfn);
> pdx->Key.function = PCI_FUNC(pPciDev->devfn);
> pdx->Key.DeviceId = pPciDev->device;
> pdx->Key.VendorId = pPciDev->vendor;
> pdx->Key.SubVendorId  = pPciDev->subsystem_vendor;
> pdx->Key.SubDeviceId  = pPciDev->subsystem_device;
> pdx->Key.DeviceNumber = pDriverObject->DeviceCount;
> 
> // Set API access mode
> pdx->Key.ApiMode = PLX_API_MODE_PCI;
> 
> // Update Revision ID
> PLX_PCI_REG_READ( pdx, PCI_REG_CLASS_REV, &RegValue );
> pdx->Key.Revision = (U8)(RegValue & 0xFF);
> 
> // Set device mode
> pdx->Key.DeviceMode = PLX_CHIP_MODE_STANDARD;
> 
> // Set PLX-specific port type
> pdx->Key.PlxPortType = PLX_SPEC_PORT_DMA;
> 
> // Build device name
> sprintf(
> pdx->LinkName,
> PLX_DRIVER_NAME "-%d",
> pDriverObject->DeviceCount
> );
> 
> // Initialize work queue for ISR DPC queueing
> PLX_INIT_WORK(
> &(pdx->Task_DpcForIsr),
> DpcForIsr,// DPC routine
> &(pdx->Task_DpcForIsr)// DPC parameter (pre-2.6.20 only)
> );
> 
> // Initialize ISR spinlock
> spin_lock_init( &(pdx->Lock_Isr) );
> 
> // Initialize interrupt wait list
> INIT_LIST_HEAD( &(pdx->List_WaitObjects) );
> spin_lock_init( &(pdx->Lock_WaitObjectsList) );
> 
> // Initialize physical memories list
> INIT_LIST_HEAD( &(pdx->List_PhysicalMem) );
> spin_lock_init( &(pdx->Lock_PhysicalMemList) );
> 
> // Set the DMA mask
> if (Plx_dma_set_mask( pdx, PLX_DMA_BIT_MASK(48) ) == 0)
> {
> DebugPrintf(("Set DMA bit mask to 48-bits\n"));
> }
> else
>  

Re: kselftest:lost_exception_test failure with 4.11.0-rc5

2017-04-11 Thread Michael Ellerman
Madhavan Srinivasan  writes:

> On Friday 07 April 2017 06:06 PM, Michael Ellerman wrote:
>> Sachin Sant  writes:
>>
>>> I have run into few instances where the lost_exception_test from
>>> powerpc kselftest fails with SIGABRT. Following o/p is against
>>> 4.11.0-rc5. The failure is intermittent.
>> What hardware are you on?
>>
>> How long does it take to run when it fails? I assume ~2 minutes?
>
> Started a run in power8 host (habanero) and it is more than 24hrs and
> havent failed yet. So this should be guest/VM scenario then?

Aha good point. I never tested this much (at all?) on VMs because it was
about verifying a workaround for a hardware bug.

So does it happen on both KVM and PowerVM or just one or the other?

cheers


Re: [PATCH 1/5] powerpc/pseries: do not use msgsndp doorbells on POWER9 guests

2017-04-11 Thread Michael Ellerman
Nicholas Piggin  writes:

> POWER9 hypervisors will not necessarily run guest threads together on
> the same core at the same time, so msgsndp should not be used.

I'm worried this is encoding the behaviour of a particular hypervisor in
the guest kernel.

If we *can't* use msgsndp then the hypervisor better do something to
stop us from using it.

If it would be preferable for us not to use msgsndp, then the hypervisor
can tell us that somehow, eg. in the device tree.

?

cheers


Re: [RFC PATCH 6/7] powerpc/hugetlb: Add code to support to follow huge page directory entries

2017-04-11 Thread Michael Ellerman
"Aneesh Kumar K.V"  writes:

> Add follow_huge_pd implementation for ppc64.
>
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/mm/hugetlbpage.c | 42 ++
>  1 file changed, 42 insertions(+)
>
> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> index 80f6d2ed551a..9d66d4f810aa 100644
> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -17,6 +17,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -618,6 +620,10 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
>  }
>  
>  /*
> + * 64 bit book3s use generic follow_page_mask
> + */
> +#ifndef CONFIG_PPC_BOOK3S_64

I think it's always easier to follow if you use:

  #ifdef x
  ...
  #else /* !x */
  ...
  #endif

ie. in this case put the Book3S 64 case first and the existing code in the
#else.

cheers


Re: [PATCH 2/2] powerpc/book3s: mce: Use add_taint_no_warn() in machine_check_early().

2017-04-11 Thread Michael Ellerman
Mahesh J Salgaonkar  writes:

> From: Mahesh Salgaonkar 
>
> machine_check_early() gets called in real mode. The very first time when
> add_taint() is called, it prints a warning which ends up calling opal
> call (that uses OPAL_CALL wrapper) for writing it to console. If we get a
> very first machine check while we are in opal we are doomed. OPAL_CALL
> overwrites the PACASAVEDMSR in r13 and in this case when we are done with
> MCE handling the original opal call will use this new MSR on it's way
> back to opal_return. This usually leads unexpected behaviour or kernel
> to panic. Instead use the add_taint_no_warn() that does not call printk.
>
> This is broken with current FW level. We got lucky so far for not getting
> very first MCE hit while in OPAL. But easily reproducible on Mambo.
> This should go to stable as well alongwith patch 1/2.

This is not a good way to fix a bug that needs to go back to stable.
Changing generic code means I need to sync up with the right maintainer,
get acks, etc. And then convince people that it should go to stable also.

So you can please fix this a different way for stable?

Can we just do the tainting later, once we're in virtual mode?

cheers

> Fixes: 27ea2c420cad powerpc: Set the correct kernel taint on machine check 
> errors.
> Signed-off-by: Mahesh Salgaonkar 
> ---
>  arch/powerpc/kernel/traps.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> index 62b587f..4a048dc 100644
> --- a/arch/powerpc/kernel/traps.c
> +++ b/arch/powerpc/kernel/traps.c
> @@ -306,7 +306,7 @@ long machine_check_early(struct pt_regs *regs)
>  
>   __this_cpu_inc(irq_stat.mce_exceptions);
>  
> - add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
> + add_taint_no_warn(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
>  
>   /*
>* See if platform is capable of handling machine check. (e.g. PowerNV


Re: [PATCH v4] cxl: Force context lock during EEH flow

2017-04-11 Thread Michael Ellerman
Frederic Barrat  writes:

> Le 05/04/2017 à 13:35, Vaibhav Jain a écrit :
>> During an eeh event when the cxl card is fenced and card sysfs attr
>> perst_reloads_same_image is set following warning message is seen in the
>> kernel logs:
>>
>>  [   60.622727] Adapter context unlocked with 0 active contexts
>>  [   60.622762] [ cut here ]
>>  [   60.622771] WARNING: CPU: 12 PID: 627 at
>>  ../drivers/misc/cxl/main.c:325 cxl_adapter_context_unlock+0x60/0x80 [cxl]
>>
>> Even though this warning is harmless, it clutters the kernel log
>> during an eeh event. This warning is triggered as the EEH callback
>> cxl_pci_error_detected doesn't obtain a context-lock before forcibly
>> detaching all active context and when context-lock is released during
>> call to cxl_configure_adapter from cxl_pci_slot_reset, a warning in
>> cxl_adapter_context_unlock is triggered.
>>
>> To fix this warning, we acquire the adapter context-lock via
>> cxl_adapter_context_lock() in the eeh callback
>> cxl_pci_error_detected() once all the virtual AFU PHBs are notified
>> and their contexts detached. The context-lock is released in
>> cxl_pci_slot_reset() after the adapter is successfully reconfigured
>> and before we call slot_reset callback on slice attached device-drivers.
>>
>> Cc: sta...@vger.kernel.org
>> Fixes: 70b565bbdb91("cxl: Prevent adapter reset if an active context exists")
>> Reported-by: Andrew Donnellan 
>> Signed-off-by: Vaibhav Jain 
>> ---
>
> Pending test result from cxl-flash:
> Acked-by: Frederic Barrat 

Still pending ... ?

cheers


[BUG][next-20170410][PPC] WARNING: CPU: 22 PID: 0 at block/blk-core.c:2655 .blk_update_request+0x4f8/0x500

2017-04-11 Thread Abdul Haleem
Hi,

Warning while booting next-20170410 on PowerPC.

We did not see warnings with next-20170407.

In mean time I will update with the badcommit once my automated bisect
run finishes.

Machine type: Power7 LPAR
Kernel : 4.11.0-rc6-next-20170410
Config : file attched.


IPv6: ADDRCONF(NETDEV_UP): net0: link is not ready
Starting Authorization Manager...

Starting WPA Supplicant daemon...

[ cut here ]
WARNING: CPU: 22 PID: 0 at block/blk-core.c:2655 .blk_update_request
+0x4f8/0x500
Modules linked in: sg(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E)
sunrpc(E) grace(E) binfmt_misc(E) ip_tables(E) ext4(E) mbcache(E)
jbd2(E) sd_mod(E) ibmvscsi(E) scsi_transport_srp(E) ibmveth(E)
CPU: 22 PID: 0 Comm: swapper/22 Tainted: GE
4.11.0-rc6-next-20170410-autotest #1
task: c009f82a0400 task.stack: c009f8324000
NIP: c0512a08 LR: c05125ec CTR: c0518270
REGS: c013fff23740 TRAP: 0700   Tainted: GE
(4.11.0-rc6-next-20170410-autotest)
MSR: 8282b032 
  CR: 48042048  XER: 0001  
CFAR: c0512784 SOFTE: 1 
GPR00: c05125ec c013fff239c0 c1396300 c000fd55a000 
GPR04:   0001 00b0 
GPR08: 00067887  c000fd55a000 d85f7dc0 
GPR12: 88044044 ce97dc00 c009f8327f90 00200042 
GPR16: 9239 c013fff2  c0de4100 
GPR20: c13d3b00 c0de4100  0005 
GPR24: 2ee0 c1788018   
GPR28:   c009f212e400 c000fd55a000 
NIP [c0512a08] .blk_update_request+0x4f8/0x500
LR [c05125ec] .blk_update_request+0xdc/0x500
Call Trace:
[c013fff239c0] [c05125ec] .blk_update_request+0xdc/0x500 
(unreliable)
[c013fff23a60] [c06b462c] .scsi_end_request+0x4c/0x240
[c013fff23b10] [c06b84a4] .scsi_io_completion+0x1d4/0x6c0
[c013fff23be0] [c06acbd0] .scsi_finish_command+0x100/0x1b0
[c013fff23c70] [c06b78b8] .scsi_softirq_done+0x188/0x1e0
[c013fff23d00] [c051d8b4] .blk_done_softirq+0xc4/0xf0
[c013fff23d90] [c00dd758] .__do_softirq+0x158/0x3a0
[c013fff23e90] [c00dde08] .irq_exit+0x1a8/0x1c0
[c013fff23f10] [c0014f84] .__do_irq+0x94/0x1f0
[c013fff23f90] [c0026d1c] .call_do_irq+0x14/0x24
[c009f83277f0] [c001516c] .do_IRQ+0x8c/0x100
[c009f8327890] [c0008bf4] hardware_interrupt_common+0x114/0x120
--- interrupt: 501 at .arch_local_irq_restore+0x74/0x90
LR = .arch_local_irq_restore+0x74/0x90
[c009f8327b80] [0002] 0x2 (unreliable)
[c009f8327bf0] [c07c08a8] .dedicated_cede_loop+0xc8/0x150
[c009f8327c70] [c07be280] .cpuidle_enter_state+0xb0/0x380
[c009f8327d20] [c012fd5c] .call_cpuidle+0x3c/0x70
[c009f8327d90] [c01300f0] .do_idle+0x280/0x2e0
[c009f8327e50] [c0130308] .cpu_startup_entry+0x28/0x40
[c009f8327ed0] [c0042364] .start_secondary+0x304/0x350
[c009f8327f90] [c000aa6c] start_secondary_prolog+0x10/0x14
Instruction dump:
3f82ff8e 3b9cd308 4b50 3f82ff8e 3b9cd320 4b44 61290040 b13f0018 
4bfffbe8 3cc2ff89 38c64258 4b60 <0fe0> 4bfffd7c 7c0802a6 fba1ffe8 
---[ end trace 1fdfef416a071a8e ]---
EXT4-fs (sda3): Delayed block allocation failed for inode 11011452 at
logical offset 0 with max blocks 5 with error 121 
EXT4-fs (sda3): This should not happen!! Data will be lost

 Starting Network Manager Script Dispatcher Service...


-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre


#
# Automatically generated file; DO NOT EDIT.
# Linux/powerpc 4.10.0-rc5 Kernel Configuration
#
CONFIG_PPC64=y

#
# Processor support
#
CONFIG_PPC_BOOK3S_64=y
# CONFIG_PPC_BOOK3E_64 is not set
CONFIG_GENERIC_CPU=y
# CONFIG_CELL_CPU is not set
# CONFIG_POWER4_CPU is not set
# CONFIG_POWER5_CPU is not set
# CONFIG_POWER6_CPU is not set
# CONFIG_POWER7_CPU is not set
# CONFIG_POWER8_CPU is not set
CONFIG_PPC_BOOK3S=y
CONFIG_PPC_FPU=y
CONFIG_ALTIVEC=y
CONFIG_VSX=y
CONFIG_PPC_ICSWX=y
# CONFIG_PPC_ICSWX_PID is not set
# CONFIG_PPC_ICSWX_USE_SIGILL is not set
CONFIG_PPC_STD_MMU=y
CONFIG_PPC_STD_MMU_64=y
CONFIG_PPC_RADIX_MMU=y
CONFIG_PPC_MM_SLICES=y
CONFIG_PPC_HAVE_PMU_SUPPORT=y
CONFIG_PPC_PERF_CTRS=y
CONFIG_SMP=y
CONFIG_NR_CPUS=2048
CONFIG_PPC_DOORBELL=y
CONFIG_VDSO32=y
CONFIG_CPU_BIG_ENDIAN=y
# CONFIG_CPU_LITTLE_ENDIAN is not set
CONFIG_64BIT=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_MMU=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NR_IRQS=512
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_ARCH_HAS_ILOG2_U32=y
CONFIG_ARCH_HAS_ILOG2_U64=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_HAS_DMA_SE

Re: [PATCH v2] ppc64/kprobe: Fix oops when kprobed on 'stdu' instruction

2017-04-11 Thread Ravi Bangoria
Thanks Balbir for the review,

On Tuesday 11 April 2017 02:25 PM, Balbir Singh wrote:
> On Tue, 2017-04-11 at 10:38 +0530, Ravi Bangoria wrote:
>> If we set a kprobe on a 'stdu' instruction on powerpc64, we see a kernel 
>> OOPS:
>>
>>   [ 1275.165932] Bad kernel stack pointer cd93c840 at c0009868
>>   [ 1275.166378] Oops: Bad kernel stack pointer, sig: 6 [#1]
>>   ...
>>   GPR00: c01fcd93cb30 cd93c840 c15c5e00 cd93c840
>>   ...
>>   [ 1275.178305] NIP [c0009868] resume_kernel+0x2c/0x58
>>   [ 1275.178594] LR [c0006208] program_check_common+0x108/0x180
>>
>> Basically, on 64 bit system, when user probes on 'stdu' instruction,
>> kernel does not emulate actual store in emulate_step itself because it
>> may corrupt exception frame. So kernel does actual store operation in
>> exception return code i.e. resume_kernel().
>>
>> resume_kernel() loads the saved stack pointer from memory using lwz,
>> effectively loading a corrupt (32bit) address, causing the kernel crash.
>>
>> Fix this by loading the 64bit value instead.
>>
>> Fixes: be96f63375a1 ("powerpc: Split out instruction analysis part of 
>> emulate_step()") 
>> Signed-off-by: Ravi Bangoria 
>> Reviewed-by: Naveen N. Rao  
>> ---
> The patch looks correct to me from the description and code. I have not
> validated that the write to GPR1(r1) via store of r8 to 0(r5) is indeed 
> correct.
> I would assume r8 should contain regs->gpr[r1] with the updated ea that
> is written down to the GPR1(r1) which will be what we restore when we return
> from the exception.

emulate_step() updates regs->gpr[r1] with the new value. So,
regs->gpr[r1] and GPR(r1) both are same at resume_kernel.

At resume_kernel, r1 points to the exception frame. Address
of frame preceding exception frame gets loaded in r8 with:

addir8,r1,INT_FRAME_SIZE

Let me know if you need more details.

Ravi



Re: [PATCH v4] cxl: Force context lock during EEH flow

2017-04-11 Thread Frederic Barrat



Le 11/04/2017 à 12:40, Michael Ellerman a écrit :

Frederic Barrat  writes:


Le 05/04/2017 à 13:35, Vaibhav Jain a écrit :

During an eeh event when the cxl card is fenced and card sysfs attr
perst_reloads_same_image is set following warning message is seen in the
kernel logs:

 [   60.622727] Adapter context unlocked with 0 active contexts
 [   60.622762] [ cut here ]
 [   60.622771] WARNING: CPU: 12 PID: 627 at
 ../drivers/misc/cxl/main.c:325 cxl_adapter_context_unlock+0x60/0x80 [cxl]

Even though this warning is harmless, it clutters the kernel log
during an eeh event. This warning is triggered as the EEH callback
cxl_pci_error_detected doesn't obtain a context-lock before forcibly
detaching all active context and when context-lock is released during
call to cxl_configure_adapter from cxl_pci_slot_reset, a warning in
cxl_adapter_context_unlock is triggered.

To fix this warning, we acquire the adapter context-lock via
cxl_adapter_context_lock() in the eeh callback
cxl_pci_error_detected() once all the virtual AFU PHBs are notified
and their contexts detached. The context-lock is released in
cxl_pci_slot_reset() after the adapter is successfully reconfigured
and before we call slot_reset callback on slice attached device-drivers.

Cc: sta...@vger.kernel.org
Fixes: 70b565bbdb91("cxl: Prevent adapter reset if an active context exists")
Reported-by: Andrew Donnellan 
Signed-off-by: Vaibhav Jain 
---


Pending test result from cxl-flash:
Acked-by: Frederic Barrat 


Still pending ... ?


Yes, still waiting. It was mentioned in a call with the cxlflash team 
yesterday.


  Fred





cheers





Re: [PATCH v3 2/5] perf/x86/intel: Record branch type

2017-04-11 Thread Jin, Yao



On 4/11/2017 4:35 PM, Peter Zijlstra wrote:

On Tue, Apr 11, 2017 at 04:11:21PM +0800, Jin, Yao wrote:


On 4/11/2017 3:52 PM, Peter Zijlstra wrote:

This is still a completely inadequate changelog. I really will not
accept patches like this.


Hi,

The changelog is added in the cover-letter ("[PATCH v3 0/5] perf report: Show branch 
type").

Does the changelog need to be added in each patch's description?

That's fine, I can add and resend this patch.

The cover letter is not retained; it is throw away information.

Each patch should have a coherent changelog that explain why the patch
was done and explain non trivial things in the implementation.

Simply copy/pasting the same story in multiple patches is not right
either, for the simple fact that the patches were not the same. You did
a different thing, so you need a different story.




Thanks so much for the suggestion!

I accept this and decide to make changes on my patch description. Maybe 
not adding a full change-log, I will add a section in patch description 
to describe the major changes from previous version.


Thanks
Jin Yao



Re: [PATCH v3 2/5] perf/x86/intel: Record branch type

2017-04-11 Thread Jin, Yao



On 4/11/2017 4:18 PM, Peter Zijlstra wrote:

On Tue, Apr 11, 2017 at 09:52:19AM +0200, Peter Zijlstra wrote:

On Tue, Apr 11, 2017 at 06:56:30PM +0800, Jin Yao wrote:

@@ -960,6 +1006,11 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
cpuc->lbr_entries[i].from = 0;
compress = true;
}
+
+   if ((br_sel & X86_BR_TYPE_SAVE) == X86_BR_TYPE_SAVE)
+   cpuc->lbr_entries[i].type = common_branch_type(type);
+   else
+   cpuc->lbr_entries[i].type = PERF_BR_NONE;
}

I was wondering WTH you did that else; because it should already be 0
(aka, BR_NONE).


Yes. I will remove the else code. Thanks!


Then I found intel_pmu_lbr_read_32() is already broken,
and you just broke intel_pmu_lbr_read_64().

Arguably we should add a union on the last __u64 with a name for the
entire thing, but the below is the minimal fix.

---
Subject: perf,x86: Avoid exposing wrong/stale data in intel_pmu_lbr_read_32()
From: Peter Zijlstra 
Date: Tue Apr 11 10:10:28 CEST 2017

When the perf_branch_entry::{in_tx,abort,cycles} fields were added,
intel_pmu_lbr_read_32() wasn't updated to initialize them.

Fixes: 135c5612c460 ("perf/x86/intel: Support Haswell/v4 LBR format")
Signed-off-by: Peter Zijlstra (Intel) 
---
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -507,6 +507,9 @@ static void intel_pmu_lbr_read_32(struct
cpuc->lbr_entries[i].to  = msr_lastbranch.to;
cpuc->lbr_entries[i].mispred = 0;
cpuc->lbr_entries[i].predicted   = 0;
+   cpuc->lbr_entries[i].in_tx   = 0;
+   cpuc->lbr_entries[i].abort   = 0;
+   cpuc->lbr_entries[i].cycles  = 0;
cpuc->lbr_entries[i].reserved= 0;
}
cpuc->lbr_stack.nr = i;


I will add cpuc->lbr_entries[i].type = 0 in my patch.



Re: [PATCH 1/5] powerpc/pseries: do not use msgsndp doorbells on POWER9 guests

2017-04-11 Thread Nicholas Piggin
cc'ing Paul

On Tue, 11 Apr 2017 20:10:17 +1000
Michael Ellerman  wrote:

> Nicholas Piggin  writes:
> 
> > POWER9 hypervisors will not necessarily run guest threads together on
> > the same core at the same time, so msgsndp should not be used.  
> 
> I'm worried this is encoding the behaviour of a particular hypervisor in
> the guest kernel.

Yeah, it's not ideal.

> If we *can't* use msgsndp then the hypervisor better do something to
> stop us from using it.

POWER9 hypervisor has an hfscr and should clear that if it does not gang
threads like POWER8. The guest still needs to know not to use it though...

> If it would be preferable for us not to use msgsndp, then the hypervisor
> can tell us that somehow, eg. in the device tree.

I don't know that we have a really good way to do that other than guests
to clear the doorbell feature for POWER9.

Does the hypervisor set any relevant DT we can use today that says virtual
sibling != physical sibling? If not, then we'll just have to clear it from
all POWER9 guests until we get a DT proprety from phyp.

Thanks,
Nick


Re: [PATCH v5 13/15] livepatch: change to a per-task consistency model

2017-04-11 Thread Petr Mladek
On Mon 2017-02-13 19:42:40, Josh Poimboeuf wrote:
> Change livepatch to use a basic per-task consistency model.  This is the
> foundation which will eventually enable us to patch those ~10% of
> security patches which change function or data semantics.  This is the
> biggest remaining piece needed to make livepatch more generally useful.
> 
> 
> Signed-off-by: Josh Poimboeuf 

Just for record, this last version looks fine to me. I do not see
problems any longer. Everything looks consistent now ;-)
It is a great work. Feel free to use:

Reviewed-by: Petr Mladek 

Thanks a lot for patience.

Best Regards,
Petr


[PATCH] powerpc/mm: Update mm context addr limit correctly.

2017-04-11 Thread Aneesh Kumar K.V
We added the addr < TASK_SIZE check to avoid updating addr_limit unnecessarily 
and
also to avoid calling slice_flush_segments on all the cpus. This had the side
effect of having different behaviour when using an addr value above TASK_SIZE
before updating addr_limit and after updating addr_limit as show by below
output:

requesting with hint 0x0
Addr returned 0x7fff893a
requesting with hint 0x
Addr returned 0x7fff891b  <= 1st return
requesting with hint 0x1
Addr returned 0x1
requesting with hint 0x
Addr returned 0x18941< second return

After fix:
requesting with hint 0x0
Addr returned 0x7fff8bc0
requesting with hint 0x
Addr returned 0x18bc8< 1st return
requesting with hint 0x1
Addr returned 0x1
requesting with hint 0x
Addr returned 0x18bc6< second return

Fixes: 1b49451ebd3e9 (powerpc/mm: Enable mappings above 128TB)
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/mmap.c  | 6 --
 arch/powerpc/mm/slice.c | 3 ++-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/mmap.c b/arch/powerpc/mm/mmap.c
index b2111baa0da6..355b6fe8a1e6 100644
--- a/arch/powerpc/mm/mmap.c
+++ b/arch/powerpc/mm/mmap.c
@@ -97,7 +97,8 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned 
long addr,
struct vm_area_struct *vma;
struct vm_unmapped_area_info info;
 
-   if (unlikely(addr > mm->context.addr_limit && addr < TASK_SIZE))
+   if (unlikely(addr > mm->context.addr_limit &&
+mm->context.addr_limit != TASK_SIZE))
mm->context.addr_limit = TASK_SIZE;
 
if (len > mm->context.addr_limit - mmap_min_addr)
@@ -139,7 +140,8 @@ radix__arch_get_unmapped_area_topdown(struct file *filp,
unsigned long addr = addr0;
struct vm_unmapped_area_info info;
 
-   if (unlikely(addr > mm->context.addr_limit && addr < TASK_SIZE))
+   if (unlikely(addr > mm->context.addr_limit &&
+mm->context.addr_limit != TASK_SIZE))
mm->context.addr_limit = TASK_SIZE;
 
/* requested length too big for entire address space */
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 251b6bae7023..2d2d9760d057 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -419,7 +419,8 @@ unsigned long slice_get_unmapped_area(unsigned long addr, 
unsigned long len,
/*
 * Check if we need to expland slice area.
 */
-   if (unlikely(addr > mm->context.addr_limit && addr < TASK_SIZE)) {
+   if (unlikely(addr > mm->context.addr_limit &&
+mm->context.addr_limit != TASK_SIZE)) {
mm->context.addr_limit = TASK_SIZE;
on_each_cpu(slice_flush_segments, mm, 1);
}
-- 
2.7.4



[PATCH v4 0/5] perf report: Show branch type

2017-04-11 Thread Jin Yao
v4:
---
1. Describe the major changes in patch description.
   Thanks for Peter Zijlstra's reminding. 

2. Initialize branch type to 0 in intel_pmu_lbr_read_32 and
   intel_pmu_lbr_read_64. Remove the invalid else code in
   intel_pmu_lbr_filter. 

v3:
---
1. Move the JCC forward/backward and cross page computing from
   kernel to userspace.

2. Use lookup table to replace original switch/case processing.

Changed:
  perf/core: Define the common branch type classification
  perf/x86/intel: Record branch type
  perf report: Show branch type statistics for stdio mode
  perf report: Show branch type in callchain entry

Not changed:
  perf record: Create a new option save_type in --branch-filter

v2:
---
1. Use 4 bits in perf_branch_entry to record branch type.

2. Pull out some common branch types from FAR_BRANCH. Now the branch
   types defined in perf_event.h:

PERF_BR_NONE  : unknown
PERF_BR_JCC_FWD   : conditional forward jump
PERF_BR_JCC_BWD   : conditional backward jump
PERF_BR_JMP   : jump
PERF_BR_IND_JMP   : indirect jump
PERF_BR_CALL  : call
PERF_BR_IND_CALL  : indirect call
PERF_BR_RET   : return
PERF_BR_SYSCALL   : syscall
PERF_BR_SYSRET: syscall return
PERF_BR_IRQ   : hw interrupt/trap/fault
PERF_BR_INT   : sw interrupt
PERF_BR_IRET  : return from interrupt
PERF_BR_FAR_BRANCH: others not generic far branch type

3. Use 2 bits in perf_branch_entry for a "cross" metrics checking
   for branch cross 4K or 2M area. It's an approximate computing
   for checking if the branch cross 4K page or 2MB page.

For example:

perf record -g --branch-filter any,save_type 

perf report --stdio

 JCC forward:  27.7%
JCC backward:   9.8%
 JMP:   0.0%
 IND_JMP:   6.5%
CALL:  26.6%
IND_CALL:   0.0%
 RET:  29.3%
IRET:   0.0%
CROSS_4K:   0.0%
CROSS_2M:  14.3%

perf report --branch-history --stdio --no-children

-23.60%--main div.c:42 (RET cycles:2)
 compute_flag div.c:28 (RET cycles:2)
 compute_flag div.c:27 (RET CROSS_2M cycles:1)
 rand rand.c:28 (RET CROSS_2M cycles:1)
 rand rand.c:28 (RET cycles:1)
 __random random.c:298 (RET cycles:1)
 __random random.c:297 (JCC forward cycles:1)
 __random random.c:295 (JCC forward cycles:1)
 __random random.c:295 (JCC forward cycles:1)
 __random random.c:295 (JCC forward cycles:1)
 __random random.c:295 (RET cycles:9)

Changed:
  perf/core: Define the common branch type classification
  perf/x86/intel: Record branch type
  perf report: Show branch type statistics for stdio mode
  perf report: Show branch type in callchain entry

Not changed:
  perf record: Create a new option save_type in --branch-filter

v1:
---
It is often useful to know the branch types while analyzing branch
data. For example, a call is very different from a conditional branch.

Currently we have to look it up in binary while the binary may later
not be available and even the binary is available but user has to take
some time. It is very useful for user to check it directly in perf
report.

Perf already has support for disassembling the branch instruction
to get the branch type.

The patch series records the branch type and show the branch type with
other LBR information in callchain entry via perf report. The patch
series also adds the branch type summary at the end of
perf report --stdio.

To keep consistent on kernel and userspace and make the classification
more common, the patch adds the common branch type classification
in perf_event.h.

The common branch types are:

 JCC forward: Conditional forward jump
JCC backward: Conditional backward jump
 JMP: Jump imm
 IND_JMP: Jump reg/mem
CALL: Call imm
IND_CALL: Call reg/mem
 RET: Ret
  FAR_BRANCH: SYSCALL/SYSRET, IRQ, IRET, TSX Abort

An example:

1. Record branch type (new option "save_type")

perf record -g --branch-filter any,save_type 

2. Show the branch type statistics at the end of perf report --stdio

perf report --stdio

 JCC forward:  34.0%
JCC backward:   3.6%
 JMP:   0.0%
 IND_JMP:   6.5%
CALL:  26.6%
IND_CALL:   0.0%
 RET:  29.3%
  FAR_BRANCH:   0.0%

3. Show branch type in callchain entry

perf report --branch-history --stdio --no-children

--23.91%--main div.c:42 (RET cycles:2)
  compute_flag div.c:28 (RET cycles:2)
  compute_flag div.c:27 (RET cycles:1)
  rand rand.c:28 (RET cycles:1)
  rand rand.c:28 (RET cycles:1)
  __random random.c:298 (RET cycles:1)
  __random random.c:297 (JCC forward cycles:1)
  __random random.c:295 (JCC forward cycles:1)
  __random random.c:295 (JCC forward cycles:1)
  __random random.c:295 (JCC forward cycles:1)
  __random random.c:295 (RET cycles:9)


Jin Yao (5

[PATCH v4 1/5] perf/core: Define the common branch type classification

2017-04-11 Thread Jin Yao
It is often useful to know the branch types while analyzing branch
data. For example, a call is very different from a conditional branch.

Currently we have to look it up in binary while the binary may later
not be available and even the binary is available but user has to take
some time. It is very useful for user to check it directly in perf
report.

Perf already has support for disassembling the branch instruction
to get the x86 branch type.

To keep consistent on kernel and userspace and make the classification
more common, the patch adds the common branch type classification
in perf_event.h.

PERF_BR_NONE  : unknown
PERF_BR_JCC   : conditional jump
PERF_BR_JMP   : jump
PERF_BR_IND_JMP   : indirect jump
PERF_BR_CALL  : call
PERF_BR_IND_CALL  : indirect call
PERF_BR_RET   : return
PERF_BR_SYSCALL   : syscall
PERF_BR_SYSRET: syscall return
PERF_BR_IRQ   : hw interrupt/trap/fault
PERF_BR_INT   : sw interrupt
PERF_BR_IRET  : return from interrupt
PERF_BR_FAR_BRANCH: not generic far branch type

The patch also adds a new field type (4 bits) in perf_branch_entry
to record the branch type.

Since the disassembling of branch instruction needs some overhead,
a new PERF_SAMPLE_BRANCH_TYPE_SAVE is introduced to indicate if it
needs to disassemble the branch instruction and record the branch
type.

Comparing to previous version, the major changes are:

1. Remove the PERF_BR_JCC_FWD/PERF_BR_JCC_BWD, they will be
   computed later in userspace.

2. Remove the "cross" field in perf_branch_entry. The cross page
   computing will be done later in userspace.

Signed-off-by: Jin Yao 
---
 include/uapi/linux/perf_event.h   | 29 -
 tools/include/uapi/linux/perf_event.h | 29 -
 2 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index d09a9cd..69af012 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -174,6 +174,8 @@ enum perf_branch_sample_type_shift {
PERF_SAMPLE_BRANCH_NO_FLAGS_SHIFT   = 14, /* no flags */
PERF_SAMPLE_BRANCH_NO_CYCLES_SHIFT  = 15, /* no cycles */
 
+   PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT  = 16, /* save branch type */
+
PERF_SAMPLE_BRANCH_MAX_SHIFT/* non-ABI */
 };
 
@@ -198,9 +200,32 @@ enum perf_branch_sample_type {
PERF_SAMPLE_BRANCH_NO_FLAGS = 1U << 
PERF_SAMPLE_BRANCH_NO_FLAGS_SHIFT,
PERF_SAMPLE_BRANCH_NO_CYCLES= 1U << 
PERF_SAMPLE_BRANCH_NO_CYCLES_SHIFT,
 
+   PERF_SAMPLE_BRANCH_TYPE_SAVE=
+   1U << PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT,
+
PERF_SAMPLE_BRANCH_MAX  = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
 };
 
+/*
+ * Common flow change classification
+ */
+enum {
+   PERF_BR_NONE= 0,/* unknown */
+   PERF_BR_JCC = 1,/* conditional jump */
+   PERF_BR_JMP = 2,/* jump */
+   PERF_BR_IND_JMP = 3,/* indirect jump */
+   PERF_BR_CALL= 4,/* call */
+   PERF_BR_IND_CALL= 5,/* indirect call */
+   PERF_BR_RET = 6,/* return */
+   PERF_BR_SYSCALL = 7,/* syscall */
+   PERF_BR_SYSRET  = 8,/* syscall return */
+   PERF_BR_IRQ = 9,/* hw interrupt/trap/fault */
+   PERF_BR_INT = 10,   /* sw interrupt */
+   PERF_BR_IRET= 11,   /* return from interrupt */
+   PERF_BR_FAR_BRANCH  = 12,   /* not generic far branch type */
+   PERF_BR_MAX,
+};
+
 #define PERF_SAMPLE_BRANCH_PLM_ALL \
(PERF_SAMPLE_BRANCH_USER|\
 PERF_SAMPLE_BRANCH_KERNEL|\
@@ -999,6 +1024,7 @@ union perf_mem_data_src {
  * in_tx: running in a hardware transaction
  * abort: aborting a hardware transaction
  *cycles: cycles from last branch (or 0 if not supported)
+ *  type: branch type
  */
 struct perf_branch_entry {
__u64   from;
@@ -1008,7 +1034,8 @@ struct perf_branch_entry {
in_tx:1,/* in transaction */
abort:1,/* transaction abort */
cycles:16,  /* cycle count to last branch */
-   reserved:44;
+   type:4, /* branch type */
+   reserved:40;
 };
 
 #endif /* _UAPI_LINUX_PERF_EVENT_H */
diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index d09a9cd..69af012 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -174,6 +174,8 @@ enum perf_branch_sample_type_shift {
PERF_SAMPLE_BRANCH_NO_FLAGS_SHIFT   = 14, /* no flags */
PERF_SAMPLE_BRANCH_NO_CYCLES_SHIFT  = 15, /* no cycles */
 
+   PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT  = 16, /* save branch type */
+
PERF_SAMPLE_BRANCH_MAX_SHIFT/* non-ABI */
 };
 
@@ -198,9 +200,32 @@ enum perf_branch_sample_type {

[PATCH v4 2/5] perf/x86/intel: Record branch type

2017-04-11 Thread Jin Yao
Perf already has support for disassembling the branch instruction
and using the branch type for filtering. The patch just records
the branch type in perf_branch_entry.

Before recording, the patch converts the x86 branch type to
common branch type.

Comparing to previous version, the major changes are:

1. Uses a lookup table to convert x86 branch type to common branch
   type.

2. Move the JCC forward/JCC backward and cross page computing to
   user space.

3. Initialize branch type to 0 in intel_pmu_lbr_read_32 and
   intel_pmu_lbr_read_64

Signed-off-by: Jin Yao 
---
 arch/x86/events/intel/lbr.c | 53 -
 1 file changed, 52 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 81b321a..d3b1dd6 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -109,6 +109,9 @@ enum {
X86_BR_ZERO_CALL= 1 << 15,/* zero length call */
X86_BR_CALL_STACK   = 1 << 16,/* call stack */
X86_BR_IND_JMP  = 1 << 17,/* indirect jump */
+
+   X86_BR_TYPE_SAVE= 1 << 18,/* indicate to save branch type */
+
 };
 
 #define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
@@ -507,6 +510,7 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events 
*cpuc)
cpuc->lbr_entries[i].to = msr_lastbranch.to;
cpuc->lbr_entries[i].mispred= 0;
cpuc->lbr_entries[i].predicted  = 0;
+   cpuc->lbr_entries[i].type   = 0;
cpuc->lbr_entries[i].reserved   = 0;
}
cpuc->lbr_stack.nr = i;
@@ -593,6 +597,7 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events 
*cpuc)
cpuc->lbr_entries[out].in_tx = in_tx;
cpuc->lbr_entries[out].abort = abort;
cpuc->lbr_entries[out].cycles= cycles;
+   cpuc->lbr_entries[out].type  = 0;
cpuc->lbr_entries[out].reserved  = 0;
out++;
}
@@ -670,6 +675,10 @@ static int intel_pmu_setup_sw_lbr_filter(struct perf_event 
*event)
 
if (br_type & PERF_SAMPLE_BRANCH_CALL)
mask |= X86_BR_CALL | X86_BR_ZERO_CALL;
+
+   if (br_type & PERF_SAMPLE_BRANCH_TYPE_SAVE)
+   mask |= X86_BR_TYPE_SAVE;
+
/*
 * stash actual user request into reg, it may
 * be used by fixup code for some CPU
@@ -923,6 +932,44 @@ static int branch_type(unsigned long from, unsigned long 
to, int abort)
return ret;
 }
 
+#define X86_BR_TYPE_MAP_MAX16
+
+static int
+common_branch_type(int type)
+{
+   int i, mask;
+   const int branch_map[X86_BR_TYPE_MAP_MAX] = {
+   PERF_BR_CALL,   /* X86_BR_CALL */
+   PERF_BR_RET,/* X86_BR_RET */
+   PERF_BR_SYSCALL,/* X86_BR_SYSCALL */
+   PERF_BR_SYSRET, /* X86_BR_SYSRET */
+   PERF_BR_INT,/* X86_BR_INT */
+   PERF_BR_IRET,   /* X86_BR_IRET */
+   PERF_BR_JCC,/* X86_BR_JCC */
+   PERF_BR_JMP,/* X86_BR_JMP */
+   PERF_BR_IRQ,/* X86_BR_IRQ */
+   PERF_BR_IND_CALL,   /* X86_BR_IND_CALL */
+   PERF_BR_NONE,   /* X86_BR_ABORT */
+   PERF_BR_NONE,   /* X86_BR_IN_TX */
+   PERF_BR_NONE,   /* X86_BR_NO_TX */
+   PERF_BR_CALL,   /* X86_BR_ZERO_CALL */
+   PERF_BR_NONE,   /* X86_BR_CALL_STACK */
+   PERF_BR_IND_JMP,/* X86_BR_IND_JMP */
+   };
+
+   type >>= 2; /* skip X86_BR_USER and X86_BR_KERNEL */
+   mask = ~(~0 << 1);
+
+   for (i = 0; i < X86_BR_TYPE_MAP_MAX; i++) {
+   if (type & mask)
+   return branch_map[i];
+
+   type >>= 1;
+   }
+
+   return PERF_BR_NONE;
+}
+
 /*
  * implement actual branch filter based on user demand.
  * Hardware may not exactly satisfy that request, thus
@@ -939,7 +986,8 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
bool compress = false;
 
/* if sampling all branches, then nothing to filter */
-   if ((br_sel & X86_BR_ALL) == X86_BR_ALL)
+   if (((br_sel & X86_BR_ALL) == X86_BR_ALL) &&
+   ((br_sel & X86_BR_TYPE_SAVE) != X86_BR_TYPE_SAVE))
return;
 
for (i = 0; i < cpuc->lbr_stack.nr; i++) {
@@ -960,6 +1008,9 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
cpuc->lbr_entries[i].from = 0;
compress = true;
}
+
+   if ((br_sel & X86_BR_TYPE_SAVE) == X86_BR_TYPE_SAVE)
+   cpuc->lbr_entries[i].type = common_branch_type(type);
}
 
if (!compress)
-- 
2.7.4



[PATCH v4 3/5] perf record: Create a new option save_type in --branch-filter

2017-04-11 Thread Jin Yao
The option indicates the kernel to save branch type during sampling.

One example:
perf record -g --branch-filter any,save_type 

Signed-off-by: Jin Yao 
---
 tools/perf/Documentation/perf-record.txt | 1 +
 tools/perf/util/parse-branch-options.c   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/tools/perf/Documentation/perf-record.txt 
b/tools/perf/Documentation/perf-record.txt
index ea3789d..e2f5a4f 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -332,6 +332,7 @@ following filters are defined:
- no_tx: only when the target is not in a hardware transaction
- abort_tx: only when the target is a hardware transaction abort
- cond: conditional branches
+   - save_type: save branch type during sampling in case binary is not 
available later
 
 +
 The option requires at least one branch type among any, any_call, any_ret, 
ind_call, cond.
diff --git a/tools/perf/util/parse-branch-options.c 
b/tools/perf/util/parse-branch-options.c
index 38fd115..e71fb5f 100644
--- a/tools/perf/util/parse-branch-options.c
+++ b/tools/perf/util/parse-branch-options.c
@@ -28,6 +28,7 @@ static const struct branch_mode branch_modes[] = {
BRANCH_OPT("cond", PERF_SAMPLE_BRANCH_COND),
BRANCH_OPT("ind_jmp", PERF_SAMPLE_BRANCH_IND_JUMP),
BRANCH_OPT("call", PERF_SAMPLE_BRANCH_CALL),
+   BRANCH_OPT("save_type", PERF_SAMPLE_BRANCH_TYPE_SAVE),
BRANCH_END
 };
 
-- 
2.7.4



[PATCH v4 4/5] perf report: Show branch type statistics for stdio mode

2017-04-11 Thread Jin Yao
Show the branch type statistics at the end of perf report --stdio.

For example:
perf report --stdio

 JCC forward:  27.8%
JCC backward:   9.7%
CROSS_4K:   0.0%
CROSS_2M:  14.3%
 JCC:  37.6%
 JMP:   0.0%
 IND_JMP:   6.5%
CALL:  26.6%
 RET:  29.3%
IRET:   0.0%

The branch types are:
-
 JCC forward: Conditional forward jump
JCC backward: Conditional backward jump
 JMP: Jump imm
 IND_JMP: Jump reg/mem
CALL: Call imm
IND_CALL: Call reg/mem
 RET: Ret
 SYSCALL: Syscall
  SYSRET: Syscall return
 IRQ: HW interrupt/trap/fault
 INT: SW interrupt
IRET: Return from interrupt
  FAR_BRANCH: Others not generic branch type

CROSS_4K and CROSS_2M:
--
They are the metrics checking for branches cross 4K or 2MB pages.
It's an approximate computing. We don't know if the area is 4K or
2MB, so always compute both.

To make the output simple, if a branch crosses 2M area, CROSS_4K
will not be incremented.

Comparing to previous version, the major changes are:

Add the computing of JCC forward/JCC backward and cross page checking
by using the from and to addresses.

Signed-off-by: Jin Yao 
---
 tools/perf/builtin-report.c | 70 +
 tools/perf/util/event.h |  3 +-
 tools/perf/util/hist.c  |  5 +---
 tools/perf/util/util.c  | 59 ++
 tools/perf/util/util.h  | 17 +++
 5 files changed, 149 insertions(+), 5 deletions(-)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index c18158b..c2889eb 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -66,6 +66,7 @@ struct report {
u64 queue_size;
int socket_filter;
DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
+   struct branch_type_stat brtype_stat;
 };
 
 static int report__config(const char *var, const char *value, void *cb)
@@ -144,6 +145,24 @@ static int hist_iter__report_callback(struct 
hist_entry_iter *iter,
return err;
 }
 
+static int hist_iter__branch_callback(struct hist_entry_iter *iter,
+ struct addr_location *al __maybe_unused,
+ bool single __maybe_unused,
+ void *arg)
+{
+   struct hist_entry *he = iter->he;
+   struct report *rep = arg;
+   struct branch_info *bi;
+
+   if (sort__mode == SORT_MODE__BRANCH) {
+   bi = he->branch_info;
+   branch_type_count(&rep->brtype_stat, &bi->flags,
+ bi->from.addr, bi->to.addr);
+   }
+
+   return 0;
+}
+
 static int process_sample_event(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
@@ -182,6 +201,8 @@ static int process_sample_event(struct perf_tool *tool,
 */
if (!sample->branch_stack)
goto out_put;
+
+   iter.add_entry_cb = hist_iter__branch_callback;
iter.ops = &hist_iter_branch;
} else if (rep->mem_mode) {
iter.ops = &hist_iter_mem;
@@ -369,6 +390,50 @@ static size_t hists__fprintf_nr_sample_events(struct hists 
*hists, struct report
return ret + fprintf(fp, "\n#\n");
 }
 
+static void branch_type_stat_display(FILE *fp, struct branch_type_stat *stat)
+{
+   u64 total = 0;
+   int i;
+
+   for (i = 0; i < PERF_BR_MAX; i++)
+   total += stat->counts[i];
+
+   if (total == 0)
+   return;
+
+   fprintf(fp, "\n#");
+   fprintf(fp, "\n# Branch Statistics:");
+   fprintf(fp, "\n#");
+
+   if (stat->jcc_fwd > 0)
+   fprintf(fp, "\n%12s: %5.1f%%",
+   "JCC forward",
+   100.0 * (double)stat->jcc_fwd / (double)total);
+
+   if (stat->jcc_bwd > 0)
+   fprintf(fp, "\n%12s: %5.1f%%",
+   "JCC backward",
+   100.0 * (double)stat->jcc_bwd / (double)total);
+
+   if (stat->cross_4k > 0)
+   fprintf(fp, "\n%12s: %5.1f%%",
+   "CROSS_4K",
+   100.0 * (double)stat->cross_4k / (double)total);
+
+   if (stat->cross_2m > 0)
+   fprintf(fp, "\n%12s: %5.1f%%",
+   "CROSS_2M",
+   100.0 * (double)stat->cross_2m / (double)total);
+
+   for (i = 0; i < PERF_BR_MAX; i++) {
+   if (stat->counts[i] > 0)
+   fprintf(fp, "\n%12s: %5.1f%%",
+   branch_type_name(i),
+   100.0 *
+   (double)stat->counts[i] / (double)total);
+   }
+}
+
 static int perf_evlist__tty_browse_hists(struct perf_evlist *evlist,
 

[PATCH v4 5/5] perf report: Show branch type in callchain entry

2017-04-11 Thread Jin Yao
Show branch type in callchain entry. The branch type is printed
with other LBR information (such as cycles/abort/...).

One example:
perf report --branch-history --stdio --no-children

--23.54%--main div.c:42 (CROSS_2M RET cycles:2)
  compute_flag div.c:28 (RET cycles:2)
  compute_flag div.c:27 (CROSS_2M RET cycles:1)
  rand rand.c:28 (CROSS_4K RET cycles:1)
  rand rand.c:28 (CROSS_2M RET cycles:1)
  __random random.c:298 (CROSS_4K RET cycles:1)
  __random random.c:297 (JCC backward CROSS_2M cycles:1)
  __random random.c:295 (JCC forward CROSS_4K cycles:1)
  __random random.c:295 (JCC backward CROSS_2M cycles:1)
  __random random.c:295 (JCC forward CROSS_4K cycles:1)
  __random random.c:295 (CROSS_2M RET cycles:9)

Comparing to previous version, the major changes are:

Since we have to compute the JCC forward/JCC backward and cross
page checking in user space by from and to addresses, while each
callchain entry only contains one ip (either from or to), so
this patch will append a branch from address to the callchain
entry which just contains the to ip.

Signed-off-by: Jin Yao 
---
 tools/perf/util/callchain.c | 195 ++--
 tools/perf/util/callchain.h |   4 +-
 tools/perf/util/machine.c   |  26 --
 3 files changed, 152 insertions(+), 73 deletions(-)

diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index 2e5eff5..3c875b1 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -467,6 +467,11 @@ fill_node(struct callchain_node *node, struct 
callchain_cursor *cursor)
call->cycles_count = cursor_node->branch_flags.cycles;
call->iter_count = cursor_node->nr_loop_iter;
call->samples_count = cursor_node->samples;
+
+   branch_type_count(&call->brtype_stat,
+ &cursor_node->branch_flags,
+ cursor_node->branch_from,
+ cursor_node->ip);
}
 
list_add_tail(&call->list, &node->val);
@@ -579,6 +584,11 @@ static enum match_result match_chain(struct 
callchain_cursor_node *node,
cnode->cycles_count += node->branch_flags.cycles;
cnode->iter_count += node->nr_loop_iter;
cnode->samples_count += node->samples;
+
+   branch_type_count(&cnode->brtype_stat,
+ &node->branch_flags,
+ node->branch_from,
+ node->ip);
}
 
return MATCH_EQ;
@@ -813,7 +823,7 @@ merge_chain_branch(struct callchain_cursor *cursor,
list_for_each_entry_safe(list, next_list, &src->val, list) {
callchain_cursor_append(cursor, list->ip,
list->ms.map, list->ms.sym,
-   false, NULL, 0, 0);
+   false, NULL, 0, 0, 0);
list_del(&list->list);
map__zput(list->ms.map);
free(list);
@@ -853,7 +863,7 @@ int callchain_merge(struct callchain_cursor *cursor,
 int callchain_cursor_append(struct callchain_cursor *cursor,
u64 ip, struct map *map, struct symbol *sym,
bool branch, struct branch_flags *flags,
-   int nr_loop_iter, int samples)
+   int nr_loop_iter, int samples, u64 branch_from)
 {
struct callchain_cursor_node *node = *cursor->last;
 
@@ -877,6 +887,7 @@ int callchain_cursor_append(struct callchain_cursor *cursor,
memcpy(&node->branch_flags, flags,
sizeof(struct branch_flags));
 
+   node->branch_from = branch_from;
cursor->nr++;
 
cursor->last = &node->next;
@@ -1105,95 +1116,151 @@ int callchain_branch_counts(struct callchain_root 
*root,
  cycles_count);
 }
 
+static int branch_type_str(struct branch_type_stat *stat,
+  char *bf, int bfsize)
+{
+   int i, j = 0, printed = 0;
+   u64 total = 0;
+
+   for (i = 0; i < PERF_BR_MAX; i++)
+   total += stat->counts[i];
+
+   if (total == 0)
+   return 0;
+
+   printed += scnprintf(bf + printed, bfsize - printed, " (");
+
+   if (stat->jcc_fwd > 0) {
+   j++;
+   printed += scnprintf(bf + printed, bfsize - printed,
+"JCC forward");
+   }
+
+   if (stat->jcc_bwd > 0) {
+   if (j++)
+   printed += scnprintf(bf + printed, bfsize - printed,
+" JCC backward");
+  

Re: [PATCH V4 7/7] cxl: Add psl9 specific code

2017-04-11 Thread Frederic Barrat



Le 07/04/2017 à 16:11, Christophe Lombard a écrit :

The new Coherent Accelerator Interface Architecture, level 2, for the
IBM POWER9 brings new content and features:
- POWER9 Service Layer
- Registers
- Radix mode
- Process element entry
- Dedicated-Shared Process Programming Model
- Translation Fault Handling
- CAPP
- Memory Context ID
If a valid mm_struct is found the memory context id is used for each
transaction associated with the process handle. The PSL uses the
context ID to find the corresponding process element.

Signed-off-by: Christophe Lombard 
---



I'm ok with the code. However checkpatch is complaining about a 
tab/space error in native.c


If you have a quick respin, I also have a comment below about the 
documentation.




 Documentation/powerpc/cxl.txt |  11 +-
 drivers/misc/cxl/context.c|  16 ++-
 drivers/misc/cxl/cxl.h| 137 +++
 drivers/misc/cxl/debugfs.c|  19 
 drivers/misc/cxl/fault.c  |  64 +++
 drivers/misc/cxl/guest.c  |   8 +-
 drivers/misc/cxl/irq.c|  53 +
 drivers/misc/cxl/native.c | 225 +++---
 drivers/misc/cxl/pci.c| 246 +++---
 drivers/misc/cxl/trace.h  |  43 
 10 files changed, 748 insertions(+), 74 deletions(-)

diff --git a/Documentation/powerpc/cxl.txt b/Documentation/powerpc/cxl.txt
index d5506ba0..4a77462 100644
--- a/Documentation/powerpc/cxl.txt
+++ b/Documentation/powerpc/cxl.txt
@@ -21,7 +21,7 @@ Introduction
 Hardware overview
 =

-  POWER8   FPGA
+ POWER8/9 FPGA
+--++-+
|  || |
|   CPU||   AFU   |
@@ -34,7 +34,7 @@ Hardware overview
|   | CAPP |<-->| |
+---+--+  PCIE  +-+

-The POWER8 chip has a Coherently Attached Processor Proxy (CAPP)
+The POWER8/9 chip has a Coherently Attached Processor Proxy (CAPP)
 unit which is part of the PCIe Host Bridge (PHB). This is managed
 by Linux by calls into OPAL. Linux doesn't directly program the
 CAPP.
@@ -59,6 +59,13 @@ Hardware overview
 the fault. The context to which this fault is serviced is based on
 who owns that acceleration function.




+POWER8 <-> PSL Version 8 is compliant to the CAIA Version 1.0.
+POWER9 <-> PSL Version 9 is compliant to the CAIA Version 2.0.
+This PSL Version 9 provides new features as:
+* Native DMA support.
+* Supports sending ASB_Notify messages for host thread wakeup.
+* Supports Atomic operations.
+* 



I think one of the most important difference is missing: the PSL on 
power9 uses the new nest MMU on the power9 chip and no longer has its 
own MMU.


  Fred



 AFU Modes
 =
diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index ac2531e..45363be 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -188,12 +188,24 @@ int cxl_context_iomap(struct cxl_context *ctx, struct 
vm_area_struct *vma)
if (ctx->afu->current_mode == CXL_MODE_DEDICATED) {
if (start + len > ctx->afu->adapter->ps_size)
return -EINVAL;
+
+   if (cxl_is_psl9(ctx->afu)) {
+   /* make sure there is a valid problem state
+* area space for this AFU
+*/
+   if (ctx->master && !ctx->afu->psa) {
+   pr_devel("AFU doesn't support mmio space\n");
+   return -EINVAL;
+   }
+
+   /* Can't mmap until the AFU is enabled */
+   if (!ctx->afu->enabled)
+   return -EBUSY;
+   }
} else {
if (start + len > ctx->psn_size)
return -EINVAL;
-   }

-   if (ctx->afu->current_mode != CXL_MODE_DEDICATED) {
/* make sure there is a valid per process space for this AFU */
if ((ctx->master && !ctx->afu->psa) || (!ctx->afu->pp_psa)) {
pr_devel("AFU doesn't support mmio space\n");
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 82335c0..df40e6e 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -63,7 +63,7 @@ typedef struct {
 /* Memory maps. Ref CXL Appendix A */

 /* PSL Privilege 1 Memory Map */
-/* Configuration and Control area */
+/* Configuration and Control area - CAIA 1&2 */
 static const cxl_p1_reg_t CXL_PSL_CtxTime = {0x};
 static const cxl_p1_reg_t CXL_PSL_ErrIVTE = {0x0008};
 static const cxl_p1_reg_t CXL_PSL_KEY1= {0x0010};
@@ -98,11 +98,29 @@ static const cxl_p1_reg_t CXL_XSL_Timebase  = {0x0100};
 static const cxl_p1_reg_t CXL_XSL_TB_CTLSTAT = {0x0108};
 static const cxl_p1_reg_t CXL_XSL_FEC   = {0x0158};
 static 

Re: WARN @lib/refcount.c:128 during hot unplug of I/O adapter.

2017-04-11 Thread Tyrel Datwyler
On 04/11/2017 02:00 AM, Michael Ellerman wrote:
> Tyrel Datwyler  writes:
> 
>> On 04/06/2017 09:04 PM, Michael Ellerman wrote:
>>> Tyrel Datwyler  writes:
>>>
 On 04/06/2017 03:27 AM, Sachin Sant wrote:
> On a POWER8 LPAR running 4.11.0-rc5, a hot unplug operation on
> any I/O adapter results in the following warning
>
> This problem has been in the code for some time now. I had first seen 
> this in
> -next tree.
>
>>
>> 
>>
> Have attached the dmesg log from the system. Let me know if any additional
> information is required to help debug this problem.

 I remember you mentioning this when the issue was brought up for CPUs. I
 assume the case is the same here where the issue is only seen with
 adapters that were hot-added after boot (ie. hot-remove of adapter
 present at boot doesn't trip the warning)?
>>>
>>> So who's fixing this?
>>
>> I started looking at it when Bharata submitted a patch trying to fix the
>> issue for CPUs, but got side tracked by other things. I suspect that
>> this underflow has actually been an issue for quite some time, and we
>> are just now becoming aware of it thanks to the recount_t patchset being
>> merged.
> 
> Yes I agree. Which means it might be broken in existing distros.

Definitely. I did some profiling last night, and I understand the
hotplug case. It turns out to be as I suggested in the original thread
about CPUs. When the devicetree code was worked to move the tree out of
proc and into sysfs the sysfs detach code added a of_node_put to remove
the original of_init reference. pSeries Being the sole original
*dynamic* device tree user we had always issued a of_node_put in our
dlpar specific detach function to achieve that end. So, this should be a
pretty straight forward trivial fix.

However, for the case where devices are present at boot it appears we a
leaking a lot of references resulting in the device nodes never actually
being released/freed after a dlpar remove. In the CPU case after boot I
count 8 more references taken than the hotplug case, and corresponding
of_node_put's are not called at dlpar remove time either. That will take
some time to track them down, review and clean up.

-Tyrel

> 
>> I'll look into it again this week.
> 
> Thanks.
> 
> cheers
> 



ZONE_DEVICE and pmem API support for powerpc

2017-04-11 Thread Oliver O'Halloran
Hi all,

This series adds support for ZONE_DEVICE and the pmem api on powerpc. Namely,
support for altmaps and the various bits and pieces required for DAX PMD faults.
The first two patches touch generic mm/ code, but otherwise this is fairly well
contained in arch/powerpc.

If the nvdimm folks could sanity check this series I'd appreciate it.

Series is based on next-20170411, but it should apply elsewhere with minor
fixups to arch_{add|remove}_memory due to conflicts with HMM.  For those
interested in testing this, there is a driver and matching firmware that carves
out some system memory for use as an emulated Con Tutto memory card.

Driver: https://github.com/oohal/linux/tree/contutto-next
Firmware: https://github.com/oohal/skiboot/tree/fake-contutto

Edit core/init.c:686 to control the amount of memory borrowed for the emulated
device.  I'm keeping the driver out of tree for a until 4.13 since I plan on
reworking the firmware interface anyway and There's at least one showstopper
bug.


Thanks,
Oliver



[PATCH 1/9] mm/huge_memory: Use zap_deposited_table() more

2017-04-11 Thread Oliver O'Halloran
Depending flags of the PMD being zapped there may or may not be a
deposited pgtable to be freed. In two of the three cases this is open
coded while the third uses the zap_deposited_table() helper. This patch
converts the others to use the helper to clean things up a bit.

Cc: "Aneesh Kumar K.V" 
Cc: "Kirill A. Shutemov" 
Cc: linux...@kvack.org
Signed-off-by: Oliver O'Halloran 
---
For reference:

void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
{
pgtable_t pgtable;

pgtable = pgtable_trans_huge_withdraw(mm, pmd);
pte_free(mm, pgtable);
atomic_long_dec(&mm->nr_ptes);
}
---
 mm/huge_memory.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b787c4cfda0e..aa01dd47cc65 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1615,8 +1615,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
if (is_huge_zero_pmd(orig_pmd))
tlb_remove_page_size(tlb, pmd_page(orig_pmd), 
HPAGE_PMD_SIZE);
} else if (is_huge_zero_pmd(orig_pmd)) {
-   pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd));
-   atomic_long_dec(&tlb->mm->nr_ptes);
+   zap_deposited_table(tlb->mm, pmd);
spin_unlock(ptl);
tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
} else {
@@ -1625,10 +1624,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
VM_BUG_ON_PAGE(!PageHead(page), page);
if (PageAnon(page)) {
-   pgtable_t pgtable;
-   pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
-   pte_free(tlb->mm, pgtable);
-   atomic_long_dec(&tlb->mm->nr_ptes);
+   zap_deposited_table(tlb->mm, pmd);
add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
} else {
if (arch_needs_pgtable_deposit())
-- 
2.9.3



[PATCH 2/9] mm/huge_memory: Deposit a pgtable for DAX PMD faults when required

2017-04-11 Thread Oliver O'Halloran
Although all architectures use a deposited page table for THP on anonymous VMAs
some architectures (s390 and powerpc) require the deposited storage even for
file backed VMAs due to quirks of their MMUs. This patch adds support for
depositing a table in DAX PMD fault handling path for archs that require it.
Other architectures should see no functional changes.

Cc: "Aneesh Kumar K.V" 
Cc: linux...@kvack.org
Signed-off-by: Oliver O'Halloran 
---
 mm/huge_memory.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index aa01dd47cc65..a84909cf20d3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -715,7 +715,8 @@ int do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 }
 
 static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
-   pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write)
+   pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write,
+   pgtable_t pgtable)
 {
struct mm_struct *mm = vma->vm_mm;
pmd_t entry;
@@ -729,6 +730,12 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, 
unsigned long addr,
entry = pmd_mkyoung(pmd_mkdirty(entry));
entry = maybe_pmd_mkwrite(entry, vma);
}
+
+   if (pgtable) {
+   pgtable_trans_huge_deposit(mm, pmd, pgtable);
+   atomic_long_inc(&mm->nr_ptes);
+   }
+
set_pmd_at(mm, addr, pmd, entry);
update_mmu_cache_pmd(vma, addr, pmd);
spin_unlock(ptl);
@@ -738,6 +745,7 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned 
long addr,
pmd_t *pmd, pfn_t pfn, bool write)
 {
pgprot_t pgprot = vma->vm_page_prot;
+   pgtable_t pgtable = NULL;
/*
 * If we had pmd_special, we could avoid all these restrictions,
 * but we need to be consistent with PTEs and architectures that
@@ -752,9 +760,15 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, 
unsigned long addr,
if (addr < vma->vm_start || addr >= vma->vm_end)
return VM_FAULT_SIGBUS;
 
+   if (arch_needs_pgtable_deposit()) {
+   pgtable = pte_alloc_one(vma->vm_mm, addr);
+   if (!pgtable)
+   return VM_FAULT_OOM;
+   }
+
track_pfn_insert(vma, &pgprot, pfn);
 
-   insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write);
+   insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write, pgtable);
return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd);
@@ -1611,6 +1625,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
tlb->fullmm);
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
if (vma_is_dax(vma)) {
+   if (arch_needs_pgtable_deposit())
+   zap_deposited_table(tlb->mm, pmd);
spin_unlock(ptl);
if (is_huge_zero_pmd(orig_pmd))
tlb_remove_page_size(tlb, pmd_page(orig_pmd), 
HPAGE_PMD_SIZE);
-- 
2.9.3



[PATCH 3/9] powerpc/mm: Add _PAGE_DEVMAP for ppc64.

2017-04-11 Thread Oliver O'Halloran
From: "Aneesh Kumar K.V" 

Add a _PAGE_DEVMAP bit for PTE and DAX PMD entires. PowerPC doesn't
currently support PUD faults so we haven't extended it to the PUD
level.

Cc: Aneesh Kumar K.V 
Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 37 +++-
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index fb72ff6b98e6..b5fc6337649e 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -78,6 +78,9 @@
 
 #define _PAGE_SOFT_DIRTY   _RPAGE_SW3 /* software: software dirty tracking 
*/
 #define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
+#define _PAGE_DEVMAP   _RPAGE_SW1
+#define __HAVE_ARCH_PTE_DEVMAP
+
 /*
  * Drivers request for cache inhibited pte mapping using _PAGE_NO_CACHE
  * Instead of fixing all of them, add an alternate define which
@@ -602,6 +605,16 @@ static inline pte_t pte_mkhuge(pte_t pte)
return pte;
 }
 
+static inline pte_t pte_mkdevmap(pte_t pte)
+{
+   return __pte(pte_val(pte) | _PAGE_SPECIAL|_PAGE_DEVMAP);
+}
+
+static inline int pte_devmap(pte_t pte)
+{
+   return !!(pte_raw(pte) & cpu_to_be64(_PAGE_DEVMAP));
+}
+
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
/* FIXME!! check whether this need to be a conditional */
@@ -966,6 +979,9 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 #define pmd_mk_savedwrite(pmd) pte_pmd(pte_mk_savedwrite(pmd_pte(pmd)))
 #define pmd_clear_savedwrite(pmd)  
pte_pmd(pte_clear_savedwrite(pmd_pte(pmd)))
 
+#define pud_pfn(...) (0)
+#define pgd_pfn(...) (0)
+
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 #define pmd_soft_dirty(pmd)pte_soft_dirty(pmd_pte(pmd))
 #define pmd_mksoft_dirty(pmd)  pte_pmd(pte_mksoft_dirty(pmd_pte(pmd)))
@@ -1140,7 +1156,6 @@ static inline int pmd_move_must_withdraw(struct spinlock 
*new_pmd_ptl,
return true;
 }
 
-
 #define arch_needs_pgtable_deposit arch_needs_pgtable_deposit
 static inline bool arch_needs_pgtable_deposit(void)
 {
@@ -1149,6 +1164,26 @@ static inline bool arch_needs_pgtable_deposit(void)
return true;
 }
 
+static inline pmd_t pmd_mkdevmap(pmd_t pmd)
+{
+   return pte_pmd(pte_mkdevmap(pmd_pte(pmd)));
+}
+
+static inline int pmd_devmap(pmd_t pmd)
+{
+   return pte_devmap(pmd_pte(pmd));
+}
+
+static inline int pud_devmap(pud_t pud)
+{
+   return 0;
+}
+
+static inline int pgd_devmap(pgd_t pgd)
+{
+   return 0;
+}
+
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
-- 
2.9.3



[PATCH 4/9] powerpc/mm: Reshuffle vmemmap_free()

2017-04-11 Thread Oliver O'Halloran
Removes an indentation level and shuffles some code around to make the
following patch cleaner. No functional changes.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/mm/init_64.c | 47 +--
 1 file changed, 25 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index ec84b31c6c86..f8124edb6ffa 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -234,12 +234,15 @@ static unsigned long vmemmap_list_free(unsigned long 
start)
 void __ref vmemmap_free(unsigned long start, unsigned long end)
 {
unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+   unsigned long page_order = get_order(page_size);
 
start = _ALIGN_DOWN(start, page_size);
 
pr_debug("vmemmap_free %lx...%lx\n", start, end);
 
for (; start < end; start += page_size) {
+   struct page *page = pfn_to_page(addr >> PAGE_SHIFT);
+   unsigned int nr_pages;
unsigned long addr;
 
/*
@@ -251,29 +254,29 @@ void __ref vmemmap_free(unsigned long start, unsigned 
long end)
continue;
 
addr = vmemmap_list_free(start);
-   if (addr) {
-   struct page *page = pfn_to_page(addr >> PAGE_SHIFT);
-
-   if (PageReserved(page)) {
-   /* allocated from bootmem */
-   if (page_size < PAGE_SIZE) {
-   /*
-* this shouldn't happen, but if it is
-* the case, leave the memory there
-*/
-   WARN_ON_ONCE(1);
-   } else {
-   unsigned int nr_pages =
-   1 << get_order(page_size);
-   while (nr_pages--)
-   free_reserved_page(page++);
-   }
-   } else
-   free_pages((unsigned long)(__va(addr)),
-   get_order(page_size));
-
-   vmemmap_remove_mapping(start, page_size);
+   if (!addr)
+   continue;
+
+   page = pfn_to_page(addr >> PAGE_SHIFT);
+   nr_pages = 1 << page_order;
+
+   if (PageReserved(page)) {
+   /* allocated from bootmem */
+   if (page_size < PAGE_SIZE) {
+   /*
+* this shouldn't happen, but if it is
+* the case, leave the memory there
+*/
+   WARN_ON_ONCE(1);
+   } else {
+   while (nr_pages--)
+   free_reserved_page(page++);
+   }
+   } else {
+   free_pages((unsigned long)(__va(addr)), page_order);
}
+
+   vmemmap_remove_mapping(start, page_size);
}
 }
 #endif
-- 
2.9.3



[PATCH 5/9] powerpc/vmemmap: Add altmap support

2017-04-11 Thread Oliver O'Halloran
Adds support to powerpc for the altmap feature of ZONE_DEVICE memory. An
altmap is a driver provided region that is used to provide the backing
storage for the struct pages of ZONE_DEVICE memory. In situations where
large amount of ZONE_DEVICE memory is being added to the system the
altmap reduces pressure on main system memory by allowing the mm/
metadata to be stored on the device itself rather in main memory.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/mm/init_64.c | 20 +++-
 arch/powerpc/mm/mem.c | 16 +---
 2 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index f8124edb6ffa..225fbb8034e6 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -171,13 +172,17 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node)
pr_debug("vmemmap_populate %lx..%lx, node %d\n", start, end, node);
 
for (; start < end; start += page_size) {
+   struct vmem_altmap *altmap;
void *p;
int rc;
 
if (vmemmap_populated(start, page_size))
continue;
 
-   p = vmemmap_alloc_block(page_size, node);
+   /* altmap lookups only work at section boundaries */
+   altmap = to_vmem_altmap(SECTION_ALIGN_DOWN(start));
+
+   p =  __vmemmap_alloc_block_buf(page_size, node, altmap);
if (!p)
return -ENOMEM;
 
@@ -241,9 +246,10 @@ void __ref vmemmap_free(unsigned long start, unsigned long 
end)
pr_debug("vmemmap_free %lx...%lx\n", start, end);
 
for (; start < end; start += page_size) {
-   struct page *page = pfn_to_page(addr >> PAGE_SHIFT);
-   unsigned int nr_pages;
-   unsigned long addr;
+   unsigned long nr_pages, addr;
+   struct vmem_altmap *altmap;
+   struct page *section_base;
+   struct page *page;
 
/*
 * the section has already be marked as invalid, so
@@ -258,9 +264,13 @@ void __ref vmemmap_free(unsigned long start, unsigned long 
end)
continue;
 
page = pfn_to_page(addr >> PAGE_SHIFT);
+   section_base = pfn_to_page(vmemmap_section_start(start));
nr_pages = 1 << page_order;
 
-   if (PageReserved(page)) {
+   altmap = to_vmem_altmap((unsigned long) section_base);
+   if (altmap) {
+   vmem_altmap_free(altmap, nr_pages);
+   } else if (PageReserved(page)) {
/* allocated from bootmem */
if (page_size < PAGE_SIZE) {
/*
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 3bbba178b464..6f7b64eaa9d8 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -176,7 +177,8 @@ int arch_remove_memory(u64 start, u64 size, enum 
memory_type type)
 {
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
-   struct zone *zone;
+   struct vmem_altmap *altmap;
+   struct page *page;
int ret;
 
/*
@@ -193,8 +195,16 @@ int arch_remove_memory(u64 start, u64 size, enum 
memory_type type)
return -EINVAL;
}
 
-   zone = page_zone(pfn_to_page(start_pfn));
-   ret = __remove_pages(zone, start_pfn, nr_pages);
+   /*
+* If we have an altmap then we need to skip over any reserved PFNs
+* when querying the zone.
+*/
+   page = pfn_to_page(start_pfn);
+   altmap = to_vmem_altmap((unsigned long) page);
+   if (altmap)
+   page += vmem_altmap_offset(altmap);
+
+   ret = __remove_pages(page_zone(page), start_pfn, nr_pages);
if (ret)
return ret;
 
-- 
2.9.3



[PATCH 6/9] powerpc, mm: Enable ZONE_DEVICE on powerpc

2017-04-11 Thread Oliver O'Halloran
Flip the switch. Running around and screaming "IT'S ALIVE" is optional,
but recommended.

Signed-off-by: Oliver O'Halloran 
---
 mm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 43d000e44424..d696af58f97f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -724,7 +724,7 @@ config ZONE_DEVICE
depends on MEMORY_HOTPLUG
depends on MEMORY_HOTREMOVE
depends on SPARSEMEM_VMEMMAP
-   depends on X86_64 #arch_add_memory() comprehends device memory
+   depends on (X86_64 || PPC_BOOK3S_64)  #arch_add_memory() comprehends 
device memory
 
help
  Device memory hotplug support allows for establishing pmem,
-- 
2.9.3



[PATCH 7/9] powerpc/mm: Wire up ioremap_cache

2017-04-11 Thread Oliver O'Halloran
The default implementation of ioremap_cache() is aliased to ioremap().
On powerpc ioremap() creates cache-inhibited mappings by default which
is almost certainly not what you wanted.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/include/asm/io.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
index 5ed292431b5b..839eb031857f 100644
--- a/arch/powerpc/include/asm/io.h
+++ b/arch/powerpc/include/asm/io.h
@@ -757,6 +757,8 @@ extern void __iomem *ioremap_prot(phys_addr_t address, 
unsigned long size,
 extern void __iomem *ioremap_wc(phys_addr_t address, unsigned long size);
 #define ioremap_nocache(addr, size)ioremap((addr), (size))
 #define ioremap_uc(addr, size) ioremap((addr), (size))
+#define ioremap_cache(addr, size) \
+   ioremap_prot((addr), (size), pgprot_val(PAGE_KERNEL))
 
 extern void iounmap(volatile void __iomem *addr);
 
-- 
2.9.3



[PATCH 8/9] powerpc/mm: Wire up hpte_removebolted for powernv

2017-04-11 Thread Oliver O'Halloran
From: Rashmica Gupta 

Adds support for removing bolted (i.e kernel linear mapping) mappings on
powernv. This is needed to support memory hot unplug operations which
are required for the teardown of DAX/PMEM devices.

Cc: Rashmica Gupta 
Cc: Anton Blanchard 
Signed-off-by: Oliver O'Halloran 
---
Could the original author of this add their S-o-b? I pulled it out of
Rashmica's memtrace patch, but I remember someone saying Anton wrote
it originally.
---
 arch/powerpc/mm/hash_native_64.c | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index 65bb8f33b399..9ba91d4905a4 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -407,6 +407,36 @@ static void native_hpte_updateboltedpp(unsigned long 
newpp, unsigned long ea,
tlbie(vpn, psize, psize, ssize, 0);
 }
 
+/*
+ * Remove a bolted kernel entry. Memory hotplug uses this.
+ *
+ * No need to lock here because we should be the only user.
+ */
+static int native_hpte_removebolted(unsigned long ea, int psize, int ssize)
+{
+   unsigned long vpn;
+   unsigned long vsid;
+   long slot;
+   struct hash_pte *hptep;
+
+   vsid = get_kernel_vsid(ea, ssize);
+   vpn = hpt_vpn(ea, vsid, ssize);
+
+   slot = native_hpte_find(vpn, psize, ssize);
+   if (slot == -1)
+   return -ENOENT;
+
+   hptep = htab_address + slot;
+
+   /* Invalidate the hpte */
+   hptep->v = 0;
+
+   /* Invalidate the TLB */
+   tlbie(vpn, psize, psize, ssize, 0);
+   return 0;
+}
+
+
 static void native_hpte_invalidate(unsigned long slot, unsigned long vpn,
   int bpsize, int apsize, int ssize, int local)
 {
@@ -725,6 +755,7 @@ void __init hpte_init_native(void)
mmu_hash_ops.hpte_invalidate= native_hpte_invalidate;
mmu_hash_ops.hpte_updatepp  = native_hpte_updatepp;
mmu_hash_ops.hpte_updateboltedpp = native_hpte_updateboltedpp;
+   mmu_hash_ops.hpte_removebolted = native_hpte_removebolted;
mmu_hash_ops.hpte_insert= native_hpte_insert;
mmu_hash_ops.hpte_remove= native_hpte_remove;
mmu_hash_ops.hpte_clear_all = native_hpte_clear;
-- 
2.9.3



[PATCH 9/9] powerpc: Add pmem API support

2017-04-11 Thread Oliver O'Halloran
Initial powerpc support for the arch-specific bit of the persistent
memory API. Nothing fancy here.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/Kconfig|   1 +
 arch/powerpc/include/asm/pmem.h | 109 
 arch/powerpc/kernel/misc_64.S   |   2 +-
 3 files changed, 111 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/include/asm/pmem.h

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d7413ed700b8..cf84d0db49ab 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -87,6 +87,7 @@ config PPC
select ARCH_HAS_DMA_SET_COHERENT_MASK
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_GCOV_PROFILE_ALL
+   select ARCH_HAS_PMEM_API
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
diff --git a/arch/powerpc/include/asm/pmem.h b/arch/powerpc/include/asm/pmem.h
new file mode 100644
index ..27da9594040f
--- /dev/null
+++ b/arch/powerpc/include/asm/pmem.h
@@ -0,0 +1,109 @@
+/*
+ * Copyright(c) 2017 IBM Corporation. All rights reserved.
+ *
+ * Based on the x86 version.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __ASM_POWERPC_PMEM_H__
+#define __ASM_POWERPC_PMEM_H__
+
+#include 
+#include 
+#include 
+
+/*
+ * See include/linux/pmem.h for API documentation
+ *
+ * PPC specific notes:
+ *
+ * 1. PPC has no non-temporal (cache bypassing) stores so we're stuck with
+ *doing cache writebacks.
+ *
+ * 2. DCBST is a suggestion. DCBF *will* force a writeback.
+ *
+ */
+
+static inline void arch_wb_cache_pmem(void *addr, size_t size)
+{
+   unsigned long iaddr = (unsigned long) addr;
+
+   /* NB: contains a barrier */
+   flush_inval_dcache_range(iaddr, iaddr + size);
+}
+
+/* invalidate and writeback are functionally identical */
+#define arch_invalidate_pmem arch_wb_cache_pmem
+
+static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
+{
+   int unwritten;
+
+   /*
+* We are copying between two kernel buffers, if
+* __copy_from_user_inatomic_nocache() returns an error (page
+* fault) we would have already reported a general protection fault
+* before the WARN+BUG.
+*
+* XXX: replace this with a hand-rolled memcpy+dcbf
+*/
+   unwritten = __copy_from_user_inatomic(dst, (void __user *) src, n);
+   if (WARN(unwritten, "%s: fault copying %p <- %p unwritten: %d\n",
+   __func__, dst, src, unwritten))
+   BUG();
+
+   arch_wb_cache_pmem(dst, n);
+}
+
+static inline int arch_memcpy_from_pmem(void *dst, const void *src, size_t n)
+{
+   /*
+* TODO: We should have most of the infrastructure for MCE handling
+*   but it needs to be made slightly smarter.
+*/
+   memcpy(dst, src, n);
+   return 0;
+}
+
+static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
+   struct iov_iter *i)
+{
+   size_t len;
+
+   /* XXX: under what conditions would this return len < size? */
+   len = copy_from_iter(addr, bytes, i);
+   arch_wb_cache_pmem(addr, bytes - len);
+
+   return len;
+}
+
+static inline void arch_clear_pmem(void *addr, size_t size)
+{
+   void *start = addr;
+
+   /*
+* XXX: A hand rolled dcbz+dcbf loop would probably be better.
+*/
+
+   if (((uintptr_t) addr & ~PAGE_MASK) == 0) {
+   while (size >= PAGE_SIZE) {
+   clear_page(addr);
+   addr += PAGE_SIZE;
+   size -= PAGE_SIZE;
+   }
+   }
+
+   if (size)
+   memset(addr, 0, size);
+
+   arch_wb_cache_pmem(start, size);
+}
+
+#endif /* __ASM_POWERPC_PMEM_H__ */
diff --git a/arch/powerpc/kernel/misc_64.S b/arch/powerpc/kernel/misc_64.S
index c119044cad0d..1378a8d61faf 100644
--- a/arch/powerpc/kernel/misc_64.S
+++ b/arch/powerpc/kernel/misc_64.S
@@ -182,7 +182,7 @@ _GLOBAL(flush_dcache_phys_range)
isync
blr
 
-_GLOBAL(flush_inval_dcache_range)
+_GLOBAL_TOC(flush_inval_dcache_range)
ld  r10,PPC64_CACHES@toc(r2)
lwz r7,DCACHEL1BLOCKSIZE(r10)   /* Get dcache block size */
addir5,r7,-1
-- 
2.9.3



Re: ZONE_DEVICE and pmem API support for powerpc

2017-04-11 Thread Dan Williams
On Tue, Apr 11, 2017 at 10:42 AM, Oliver O'Halloran  wrote:
> Hi all,
>
> This series adds support for ZONE_DEVICE and the pmem api on powerpc. Namely,
> support for altmaps and the various bits and pieces required for DAX PMD 
> faults.
> The first two patches touch generic mm/ code, but otherwise this is fairly 
> well
> contained in arch/powerpc.
>
> If the nvdimm folks could sanity check this series I'd appreciate it.

Quick feedback: I'm in the process of cleaning up and resubmitting my
patch set to push the pmem api down into the driver directly.

https://lwn.net/Articles/713064/

I'm also reworking memory hotplug to allow sub-section allocations
which has collided with Michal Hocko's hotplug reworks. It will be
good to have some more eyes on that work to understand the cross-arch
implications.

https://lkml.org/lkml/2017/3/19/146

> Series is based on next-20170411, but it should apply elsewhere with minor
> fixups to arch_{add|remove}_memory due to conflicts with HMM.  For those
> interested in testing this, there is a driver and matching firmware that 
> carves
> out some system memory for use as an emulated Con Tutto memory card.
>
> Driver: https://github.com/oohal/linux/tree/contutto-next
> Firmware: https://github.com/oohal/skiboot/tree/fake-contutto
>
> Edit core/init.c:686 to control the amount of memory borrowed for the emulated
> device.  I'm keeping the driver out of tree for a until 4.13 since I plan on
> reworking the firmware interface anyway and There's at least one showstopper
> bug.

Is this memory card I/O-cache coherent? I.e. existing dma mapping api
can hand out mappings to it? Just trying to figure out if this the
existing pmem-definition of ZONE_DEVICE or a new one.


Re: [PATCH 8/9] powerpc/mm: Wire up hpte_removebolted for powernv

2017-04-11 Thread Anton Blanchard
Hi Oliver,

> From: Rashmica Gupta 
> 
> Adds support for removing bolted (i.e kernel linear mapping) mappings
> on powernv. This is needed to support memory hot unplug operations
> which are required for the teardown of DAX/PMEM devices.
> 
> Cc: Rashmica Gupta 
> Cc: Anton Blanchard 
> Signed-off-by: Oliver O'Halloran 
> ---
> Could the original author of this add their S-o-b? I pulled it out of
> Rashmica's memtrace patch, but I remember someone saying Anton wrote
> it originally.

I did.

Signed-off-by: Anton Blanchard 

Anton

> ---
>  arch/powerpc/mm/hash_native_64.c | 31 +++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/arch/powerpc/mm/hash_native_64.c
> b/arch/powerpc/mm/hash_native_64.c index 65bb8f33b399..9ba91d4905a4
> 100644 --- a/arch/powerpc/mm/hash_native_64.c
> +++ b/arch/powerpc/mm/hash_native_64.c
> @@ -407,6 +407,36 @@ static void native_hpte_updateboltedpp(unsigned
> long newpp, unsigned long ea, tlbie(vpn, psize, psize, ssize, 0);
>  }
>  
> +/*
> + * Remove a bolted kernel entry. Memory hotplug uses this.
> + *
> + * No need to lock here because we should be the only user.
> + */
> +static int native_hpte_removebolted(unsigned long ea, int psize, int
> ssize) +{
> + unsigned long vpn;
> + unsigned long vsid;
> + long slot;
> + struct hash_pte *hptep;
> +
> + vsid = get_kernel_vsid(ea, ssize);
> + vpn = hpt_vpn(ea, vsid, ssize);
> +
> + slot = native_hpte_find(vpn, psize, ssize);
> + if (slot == -1)
> + return -ENOENT;
> +
> + hptep = htab_address + slot;
> +
> + /* Invalidate the hpte */
> + hptep->v = 0;
> +
> + /* Invalidate the TLB */
> + tlbie(vpn, psize, psize, ssize, 0);
> + return 0;
> +}
> +
> +
>  static void native_hpte_invalidate(unsigned long slot, unsigned long
> vpn, int bpsize, int apsize, int ssize, int local)
>  {
> @@ -725,6 +755,7 @@ void __init hpte_init_native(void)
>   mmu_hash_ops.hpte_invalidate= native_hpte_invalidate;
>   mmu_hash_ops.hpte_updatepp  = native_hpte_updatepp;
>   mmu_hash_ops.hpte_updateboltedpp =
> native_hpte_updateboltedpp;
> + mmu_hash_ops.hpte_removebolted = native_hpte_removebolted;
>   mmu_hash_ops.hpte_insert= native_hpte_insert;
>   mmu_hash_ops.hpte_remove= native_hpte_remove;
>   mmu_hash_ops.hpte_clear_all = native_hpte_clear;



Re: [PATCH 8/9] powerpc/mm: Wire up hpte_removebolted for powernv

2017-04-11 Thread Stephen Rothwell
Hi Oliver,

On Wed, 12 Apr 2017 08:50:56 +1000 Anton Blanchard  wrote:
>
> > From: Rashmica Gupta 
> > 
> > Adds support for removing bolted (i.e kernel linear mapping) mappings
> > on powernv. This is needed to support memory hot unplug operations
> > which are required for the teardown of DAX/PMEM devices.
> > 
> > Cc: Rashmica Gupta 
> > Cc: Anton Blanchard 
> > Signed-off-by: Oliver O'Halloran 
> > ---
> > Could the original author of this add their S-o-b? I pulled it out of
> > Rashmica's memtrace patch, but I remember someone saying Anton wrote
> > it originally.  
> 
> I did.
> 
> Signed-off-by: Anton Blanchard 

If you are going to claim that Rashmica authored this patch (and you do
with the From: line above), then you need her Signed-off-by as well.

-- 
Cheers,
Stephen Rothwell


Re: [PATCH 3/9] powerpc/mm: Add _PAGE_DEVMAP for ppc64.

2017-04-11 Thread Stephen Rothwell
Hi Oliver,

On Wed, 12 Apr 2017 03:42:27 +1000 Oliver O'Halloran  wrote:
>
> From: "Aneesh Kumar K.V" 
> 
> Add a _PAGE_DEVMAP bit for PTE and DAX PMD entires. PowerPC doesn't
> currently support PUD faults so we haven't extended it to the PUD
> level.
> 
> Cc: Aneesh Kumar K.V 
> Signed-off-by: Oliver O'Halloran 

This needs Aneesh's Signed-off-by.

-- 
Cheers,
Stephen Rothwell


Re: [PATCH 5/9] powerpc/vmemmap: Add altmap support

2017-04-11 Thread Balbir Singh
On Wed, 2017-04-12 at 03:42 +1000, Oliver O'Halloran wrote:
> Adds support to powerpc for the altmap feature of ZONE_DEVICE memory. An
> altmap is a driver provided region that is used to provide the backing
> storage for the struct pages of ZONE_DEVICE memory. In situations where
> large amount of ZONE_DEVICE memory is being added to the system the
> altmap reduces pressure on main system memory by allowing the mm/
> metadata to be stored on the device itself rather in main memory.
> 
> Signed-off-by: Oliver O'Halloran 
> ---

Reviewed-by: Balbir Singh 


Re: [PATCH 6/9] powerpc, mm: Enable ZONE_DEVICE on powerpc

2017-04-11 Thread Balbir Singh
On Wed, 2017-04-12 at 03:42 +1000, Oliver O'Halloran wrote:
> Flip the switch. Running around and screaming "IT'S ALIVE" is optional,
> but recommended.
> 
> Signed-off-by: Oliver O'Halloran 
> ---
>  mm/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 43d000e44424..d696af58f97f 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -724,7 +724,7 @@ config ZONE_DEVICE
>   depends on MEMORY_HOTPLUG
>   depends on MEMORY_HOTREMOVE
>   depends on SPARSEMEM_VMEMMAP
> - depends on X86_64 #arch_add_memory() comprehends device memory
> + depends on (X86_64 || PPC_BOOK3S_64)  #arch_add_memory() comprehends 
> device memory

Reviewed-by: Balbir Singh 


Re: [PATCH 4/9] powerpc/mm: Reshuffle vmemmap_free()

2017-04-11 Thread Stephen Rothwell
Hi Oliver,

On Wed, 12 Apr 2017 03:42:28 +1000 Oliver O'Halloran  wrote:
>
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index ec84b31c6c86..f8124edb6ffa 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -234,12 +234,15 @@ static unsigned long vmemmap_list_free(unsigned long 
> start)
>  void __ref vmemmap_free(unsigned long start, unsigned long end)
>  {
>   unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
> + unsigned long page_order = get_order(page_size);
>  
>   start = _ALIGN_DOWN(start, page_size);
>  
>   pr_debug("vmemmap_free %lx...%lx\n", start, end);
>  
>   for (; start < end; start += page_size) {
> + struct page *page = pfn_to_page(addr >> PAGE_SHIFT);

The declaration of addr is below here and, even so, it would be
uninitialised ...

> + unsigned int nr_pages;
>   unsigned long addr;

-- 
Cheers,
Stephen Rothwell


Re: [PATCH 6/9] powerpc, mm: Enable ZONE_DEVICE on powerpc

2017-04-11 Thread Stephen Rothwell
Hi Oliver,

On Wed, 12 Apr 2017 03:42:30 +1000 Oliver O'Halloran  wrote:
>
> Flip the switch. Running around and screaming "IT'S ALIVE" is optional,
> but recommended.
> 
> Signed-off-by: Oliver O'Halloran 
> ---
>  mm/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 43d000e44424..d696af58f97f 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -724,7 +724,7 @@ config ZONE_DEVICE
>   depends on MEMORY_HOTPLUG
>   depends on MEMORY_HOTREMOVE
>   depends on SPARSEMEM_VMEMMAP
> - depends on X86_64 #arch_add_memory() comprehends device memory
> + depends on (X86_64 || PPC_BOOK3S_64)  #arch_add_memory() comprehends 
> device memory
>  

That's fine, but at what point do we create
CONFIG_ARCH_HAVE_ZONE_DEVICE, replace the "depends on
" above with "depends on ARCH_HAVE_ZONE_DEVICE" and
select that from the appropriate places?

-- 
Cheers,
Stephen Rothwell


Re: ZONE_DEVICE and pmem API support for powerpc

2017-04-11 Thread Stephen Rothwell
Hi Oliver,

On Wed, 12 Apr 2017 03:42:24 +1000 Oliver O'Halloran  wrote:
>
> Series is based on next-20170411, but it should apply elsewhere with minor
> fixups to arch_{add|remove}_memory due to conflicts with HMM.  For those

Just to make life fun for you, Andrew has dropped the HMM patches from
his quilt series today (and so they will not be in next-20170412).

-- 
Cheers,
Stephen Rothwell


[PATCH 1/2] lib/raid6: Build proper files on corresponding arch

2017-04-11 Thread Matt Brown
Previously the raid6 test Makefile did not correctly build the files for
testing on PowerPC. This patch fixes the bug, so that all appropriate files
for PowerPC are built.

Signed-off-by: Matt Brown 
---
 lib/raid6/test/Makefile | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile
index 9c333e9..62b26d1 100644
--- a/lib/raid6/test/Makefile
+++ b/lib/raid6/test/Makefile
@@ -44,10 +44,12 @@ else ifeq ($(HAS_NEON),yes)
 CFLAGS += -DCONFIG_KERNEL_MODE_NEON=1
 else
 HAS_ALTIVEC := $(shell printf '\#include \nvector int a;\n' 
|\
- gcc -c -x c - >&/dev/null && \
- rm ./-.o && echo yes)
+gcc -c -x c - >/dev/null && rm ./-.o && echo yes)
 ifeq ($(HAS_ALTIVEC),yes)
-OBJS += altivec1.o altivec2.o altivec4.o altivec8.o
+   CFLAGS += -I../../../arch/powerpc/include
+   CFLAGS += -DCONFIG_ALTIVEC
+   OBJS += altivec1.o altivec2.o altivec4.o altivec8.o \
+   vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
 endif
 endif
 ifeq ($(ARCH),tilegx)
-- 
2.9.3



[PATCH v3 2/2] raid6/altivec: Add vpermxor implementation for raid6 Q syndrome

2017-04-11 Thread Matt Brown
The raid6 Q syndrome check has been optimised using the vpermxor
instruction. This instruction was made available with POWER8, ISA version
2.07. It allows for both vperm and vxor instructions to be done in a single
instruction. This has been tested for correctness on a ppc64le vm with a
basic RAID6 setup containing 5 drives.

The performance benchmarks are from the raid6test in the /lib/raid6/test
directory. These results are from an IBM Firestone machine with ppc64le
architecture. The benchmark results show a 35% speed increase over the best
existing algorithm for powerpc (altivec). The raid6test has also been run
on a big-endian ppc64 vm to ensure it also works for big-endian
architectures.

Performance benchmarks:
raid6: altivecx4 gen() 18773 MB/s
raid6: altivecx8 gen() 19438 MB/s

raid6: vpermxor4 gen() 25112 MB/s
raid6: vpermxor8 gen() 26279 MB/s

Note: Fixed minor bug in altivec.uc regarding missing and mismatched ifdef
statements.

Signed-off-by: Matt Brown 
---
Changelog
v2
- Change CONFIG_ALTIVEC to CPU_FTR_ALTIVEC_COMP
- Seperate bug fix into different patch
---
 include/linux/raid/pq.h |   4 ++
 lib/raid6/Makefile  |  27 -
 lib/raid6/algos.c   |   4 ++
 lib/raid6/altivec.uc|   3 ++
 lib/raid6/test/Makefile |  14 ++-
 lib/raid6/vpermxor.uc   | 104 
 6 files changed, 154 insertions(+), 2 deletions(-)
 create mode 100644 lib/raid6/vpermxor.uc

diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
index 4d57bba..3df9aa6 100644
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -107,6 +107,10 @@ extern const struct raid6_calls raid6_avx512x2;
 extern const struct raid6_calls raid6_avx512x4;
 extern const struct raid6_calls raid6_tilegx8;
 extern const struct raid6_calls raid6_s390vx8;
+extern const struct raid6_calls raid6_vpermxor1;
+extern const struct raid6_calls raid6_vpermxor2;
+extern const struct raid6_calls raid6_vpermxor4;
+extern const struct raid6_calls raid6_vpermxor8;
 
 struct raid6_recov_calls {
void (*data2)(int, size_t, int, int, void **);
diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
index 3057011..7775aad 100644
--- a/lib/raid6/Makefile
+++ b/lib/raid6/Makefile
@@ -4,7 +4,8 @@ raid6_pq-y  += algos.o recov.o tables.o int1.o int2.o 
int4.o \
   int8.o int16.o int32.o
 
 raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o mmx.o sse1.o sse2.o 
avx2.o avx512.o recov_avx512.o
-raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o
+raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o \
+   vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
 raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o
 raid6_pq-$(CONFIG_TILEGX) += tilegx8.o
 raid6_pq-$(CONFIG_S390) += s390vx8.o recov_s390xc.o
@@ -88,6 +89,30 @@ $(obj)/altivec8.c:   UNROLL := 8
 $(obj)/altivec8.c:   $(src)/altivec.uc $(src)/unroll.awk FORCE
$(call if_changed,unroll)
 
+CFLAGS_vpermxor1.o += $(altivec_flags)
+targets += vpermxor1.c
+$(obj)/vpermxor1.c: UNROLL := 1
+$(obj)/vpermxor1.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
+   $(call if_changed,unroll)
+
+CFLAGS_vpermxor2.o += $(altivec_flags)
+targets += vpermxor2.c
+$(obj)/vpermxor2.c: UNROLL := 2
+$(obj)/vpermxor2.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
+   $(call if_changed,unroll)
+
+CFLAGS_vpermxor4.o += $(altivec_flags)
+targets += vpermxor4.c
+$(obj)/vpermxor4.c: UNROLL := 4
+$(obj)/vpermxor4.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
+   $(call if_changed,unroll)
+
+CFLAGS_vpermxor8.o += $(altivec_flags)
+targets += vpermxor8.c
+$(obj)/vpermxor8.c: UNROLL := 8
+$(obj)/vpermxor8.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
+   $(call if_changed,unroll)
+
 CFLAGS_neon1.o += $(NEON_FLAGS)
 targets += neon1.c
 $(obj)/neon1.c:   UNROLL := 1
diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
index 7857049..edd4f69 100644
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -74,6 +74,10 @@ const struct raid6_calls * const raid6_algos[] = {
&raid6_altivec2,
&raid6_altivec4,
&raid6_altivec8,
+   &raid6_vpermxor1,
+   &raid6_vpermxor2,
+   &raid6_vpermxor4,
+   &raid6_vpermxor8,
 #endif
 #if defined(CONFIG_TILEGX)
&raid6_tilegx8,
diff --git a/lib/raid6/altivec.uc b/lib/raid6/altivec.uc
index 682aae8..d20ed0d 100644
--- a/lib/raid6/altivec.uc
+++ b/lib/raid6/altivec.uc
@@ -24,10 +24,13 @@
 
 #include 
 
+#ifdef CONFIG_ALTIVEC
+
 #include 
 #ifdef __KERNEL__
 # include 
 # include 
+#endif /* __KERNEL__ */
 
 /*
  * This is the C data type to use.  We use a vector of
diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile
index 2c7b60e..9c333e9 100644
--- a/lib/raid6/test/Makefile
+++ b/lib/raid6/test/Makefile
@@ -97,6 +97,18 @@ altivec4.c: altivec.uc ../unroll.awk
 altivec8.c: altivec.uc .

Re: EEH error in doing DMA with PEX 8619

2017-04-11 Thread IanJiang
On Tue, Apr 11, 2017 at 5:37 PM, Benjamin Herrenschmidt [via linuxppc]
 wrote:

> Another possibility would be if the requests from the PLX have a 
> different initiator ID on the bus than the device you are setting up 
> the DMA for. 

Is there a way to check out the initiator ID in the driver? I'd like to make
sure of this.



--
View this message in context: 
http://linuxppc.10917.n7.nabble.com/EEH-error-in-doing-DMA-with-PEX-8619-tp121121p121224.html
Sent from the linuxppc-dev mailing list archive at Nabble.com.


Re: EEH error in doing DMA with PEX 8619

2017-04-11 Thread Benjamin Herrenschmidt
On Tue, 2017-04-11 at 18:39 -0700, IanJiang wrote:
> On Tue, Apr 11, 2017 at 5:37 PM, Benjamin Herrenschmidt [via
> linuxppc]
>  wrote:
> 
> > Another possibility would be if the requests from the PLX have a 
> > different initiator ID on the bus than the device you are setting
> > up 
> > the DMA for. 
> 
> Is there a way to check out the initiator ID in the driver? I'd like
> to make sure of this.

If you are running bare metal (ie, not under any hypervisor, aka
"powernv" platform), the EEH error log will contain a register dump. If
you paste that to us, we might be able to decode it, it will tell us
more data about the cause of the failure, including possibly the
initiator of the failing transaction.

The initiator ID (aka RID, aka bus/device/fn) of the DMA packets must
match the ones of the struct pci_dev you are using to establish the
mapping.

Cheers,
Ben.

> 
> --
> View this message in context: http://linuxppc.10917.n7.nabble.com/EEH
> -error-in-doing-DMA-with-PEX-8619-tp121121p121224.html
> Sent from the linuxppc-dev mailing list archive at Nabble.com.



Re: [PATCH 8/9] powerpc/mm: Wire up hpte_removebolted for powernv

2017-04-11 Thread Balbir Singh
On Wed, 2017-04-12 at 03:42 +1000, Oliver O'Halloran wrote:
> From: Rashmica Gupta 
> 
> Adds support for removing bolted (i.e kernel linear mapping) mappings on
> powernv. This is needed to support memory hot unplug operations which
> are required for the teardown of DAX/PMEM devices.
> 
> Cc: Rashmica Gupta 
> Cc: Anton Blanchard 
> Signed-off-by: Oliver O'Halloran 
> ---
> Could the original author of this add their S-o-b? I pulled it out of
> Rashmica's memtrace patch, but I remember someone saying Anton wrote
> it originally.
> ---
>  arch/powerpc/mm/hash_native_64.c | 31 +++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/arch/powerpc/mm/hash_native_64.c 
> b/arch/powerpc/mm/hash_native_64.c
> index 65bb8f33b399..9ba91d4905a4 100644
> --- a/arch/powerpc/mm/hash_native_64.c
> +++ b/arch/powerpc/mm/hash_native_64.c
> @@ -407,6 +407,36 @@ static void native_hpte_updateboltedpp(unsigned long 
> newpp, unsigned long ea,
>   tlbie(vpn, psize, psize, ssize, 0);
>  }
>  
> +/*
> + * Remove a bolted kernel entry. Memory hotplug uses this.
> + *
> + * No need to lock here because we should be the only user.

As long as this is after the necessary isolation and is called from
arch_remove_memory(), I think we should be fine

> + */
> +static int native_hpte_removebolted(unsigned long ea, int psize, int ssize)
> +{
> + unsigned long vpn;
> + unsigned long vsid;
> + long slot;
> + struct hash_pte *hptep;
> +
> + vsid = get_kernel_vsid(ea, ssize);
> + vpn = hpt_vpn(ea, vsid, ssize);
> +
> + slot = native_hpte_find(vpn, psize, ssize);
> + if (slot == -1)
> + return -ENOENT;

If slot == -1, it means someone else removed the HPTE entry? Are we racing?
I suspect we should never hit this situation during hotunplug, specifically
since this is bolted.

> +
> + hptep = htab_address + slot;
> +
> + /* Invalidate the hpte */
> + hptep->v = 0;

Under DEBUG or otherwise, I would add more checks like

1. was hpte_v & HPTE_V_VALID and BOLTED set? If not, we've already invalidated
that hpte and we can skip the tlbie. Since this was bolted you might be right
that it is always valid and bolted



> +
> + /* Invalidate the TLB */
> + tlbie(vpn, psize, psize, ssize, 0);

The API also does not clear linear_map_hash_slots[] under DEBUG_PAGEALLOC

> + return 0;
> +}
> +
> +

Balbir Singh.


Re: [PATCH 6/9] powerpc, mm: Enable ZONE_DEVICE on powerpc

2017-04-11 Thread Michael Ellerman
Stephen Rothwell  writes:

> Hi Oliver,
>
> On Wed, 12 Apr 2017 03:42:30 +1000 Oliver O'Halloran  wrote:
>>
>> Flip the switch. Running around and screaming "IT'S ALIVE" is optional,
>> but recommended.
>> 
>> Signed-off-by: Oliver O'Halloran 
>> ---
>>  mm/Kconfig | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 43d000e44424..d696af58f97f 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -724,7 +724,7 @@ config ZONE_DEVICE
>>  depends on MEMORY_HOTPLUG
>>  depends on MEMORY_HOTREMOVE
>>  depends on SPARSEMEM_VMEMMAP
>> -depends on X86_64 #arch_add_memory() comprehends device memory
>> +depends on (X86_64 || PPC_BOOK3S_64)  #arch_add_memory() comprehends 
>> device memory
>>  
>
> That's fine, but at what point do we create
> CONFIG_ARCH_HAVE_ZONE_DEVICE, replace the "depends on
> " above with "depends on ARCH_HAVE_ZONE_DEVICE" and
> select that from the appropriate places?

You mean CONFIG_HAVE_ZONE_DEVICE :)

A patch to do that, and update x86, would be a good precursor to this
series. It could probably go in right now, and be in place for when this
series lands.

cheers


Re: WARN @lib/refcount.c:128 during hot unplug of I/O adapter.

2017-04-11 Thread Michael Ellerman
Tyrel Datwyler  writes:
> On 04/11/2017 02:00 AM, Michael Ellerman wrote:
>> Tyrel Datwyler  writes:
>>> I started looking at it when Bharata submitted a patch trying to fix the
>>> issue for CPUs, but got side tracked by other things. I suspect that
>>> this underflow has actually been an issue for quite some time, and we
>>> are just now becoming aware of it thanks to the recount_t patchset being
>>> merged.
>> 
>> Yes I agree. Which means it might be broken in existing distros.
>
> Definitely. I did some profiling last night, and I understand the
> hotplug case. It turns out to be as I suggested in the original thread
> about CPUs. When the devicetree code was worked to move the tree out of
> proc and into sysfs the sysfs detach code added a of_node_put to remove
> the original of_init reference. pSeries Being the sole original
> *dynamic* device tree user we had always issued a of_node_put in our
> dlpar specific detach function to achieve that end. So, this should be a
> pretty straight forward trivial fix.

Excellent, thanks.

> However, for the case where devices are present at boot it appears we a
> leaking a lot of references resulting in the device nodes never actually
> being released/freed after a dlpar remove. In the CPU case after boot I
> count 8 more references taken than the hotplug case, and corresponding
> of_node_put's are not called at dlpar remove time either. That will take
> some time to track them down, review and clean up.

Yes that is a perennial problem unfortunately which we've never come up
with a good solution for.

The (old) patch below might help track some of them down. I remember
having a script to process the output of the trace and find mismatches,
but I can't find it right now - but I'm sure you can hack up something
:)

cheers


diff --git a/arch/powerpc/include/asm/trace.h b/arch/powerpc/include/asm/trace.h
index 32e36b16773f..ad32365082a0 100644
--- a/arch/powerpc/include/asm/trace.h
+++ b/arch/powerpc/include/asm/trace.h
@@ -168,6 +168,44 @@ TRACE_EVENT(hash_fault,
  __entry->addr, __entry->access, __entry->trap)
 );
 
+TRACE_EVENT(of_node_get,
+
+   TP_PROTO(struct device_node *dn, int val),
+
+   TP_ARGS(dn, val),
+
+   TP_STRUCT__entry(
+   __field(struct device_node *, dn)
+   __field(int, val)
+   ),
+
+   TP_fast_assign(
+   __entry->dn = dn;
+   __entry->val = val;
+   ),
+
+   TP_printk("get %d -> %d %s", __entry->val - 1, __entry->val, 
__entry->dn->full_name)
+);
+
+TRACE_EVENT(of_node_put,
+
+   TP_PROTO(struct device_node *dn, int val),
+
+   TP_ARGS(dn, val),
+
+   TP_STRUCT__entry(
+   __field(struct device_node *, dn)
+   __field(int, val)
+   ),
+
+   TP_fast_assign(
+   __entry->dn = dn;
+   __entry->val = val;
+   ),
+
+   TP_printk("put %d -> %d %s", __entry->val + 1, __entry->val, 
__entry->dn->full_name)
+);
+
 #endif /* _TRACE_POWERPC_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
index c647bd1b6903..f5c3d761f3cd 100644
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -14,6 +14,8 @@
 
 #include "of_private.h"
 
+#include 
+
 /**
  * of_node_get() - Increment refcount of a node
  * @node:  Node to inc refcount, NULL is supported to simplify writing of
@@ -23,8 +25,12 @@
  */
 struct device_node *of_node_get(struct device_node *node)
 {
-   if (node)
+   if (node) {
kobject_get(&node->kobj);
+
+   trace_of_node_get(node, atomic_read(&node->kobj.kref.refcount));
+   }
+
return node;
 }
 EXPORT_SYMBOL(of_node_get);
@@ -36,8 +42,10 @@ EXPORT_SYMBOL(of_node_get);
  */
 void of_node_put(struct device_node *node)
 {
-   if (node)
+   if (node) {
kobject_put(&node->kobj);
+   trace_of_node_put(node, atomic_read(&node->kobj.kref.refcount));
+   }
 }
 EXPORT_SYMBOL(of_node_put);
 


Re: [PATCH V4 7/7] cxl: Add psl9 specific code

2017-04-11 Thread Michael Ellerman
Frederic Barrat  writes:

> Le 07/04/2017 à 16:11, Christophe Lombard a écrit :
>> The new Coherent Accelerator Interface Architecture, level 2, for the
>> IBM POWER9 brings new content and features:
>> - POWER9 Service Layer
>> - Registers
>> - Radix mode
>> - Process element entry
>> - Dedicated-Shared Process Programming Model
>> - Translation Fault Handling
>> - CAPP
>> - Memory Context ID
>> If a valid mm_struct is found the memory context id is used for each
>> transaction associated with the process handle. The PSL uses the
>> context ID to find the corresponding process element.
>>
>> Signed-off-by: Christophe Lombard 
>> ---
>
>
> I'm ok with the code. However checkpatch is complaining about a 
> tab/space error in native.c

I already fixed it up when I applied them (and a bunch of other things).

> If you have a quick respin, I also have a comment below about the 
> documentation.

So please send me an incremental patch to update the doco and I'll
squash it before merging the series.

cheers


Re: [PATCH V4 6/7] cxl: Isolate few psl8 specific calls

2017-04-11 Thread Michael Ellerman
Frederic Barrat  writes:

> Le 07/04/2017 à 16:11, Christophe Lombard a écrit :
>> Point out the specific Coherent Accelerator Interface Architecture,
>> level 1, registers.
>> Code and functions specific to PSL8 (CAIA1) must be framed.
>>
>> Signed-off-by: Christophe Lombard 
>> ---
>
> There are a few changes in native.c which are about splitting long 
> strings, but that's minor. And the rest looks ok.

It is minor, so I fixed it up when applying. But in future please don't
split long strings, it makes them harder to grep for.

cheers


Re: [PATCH 3/9] powerpc/mm: Add _PAGE_DEVMAP for ppc64.

2017-04-11 Thread Aneesh Kumar K.V



On Wednesday 12 April 2017 05:49 AM, Stephen Rothwell wrote:

Hi Oliver,

On Wed, 12 Apr 2017 03:42:27 +1000 Oliver O'Halloran  wrote:


From: "Aneesh Kumar K.V" 

Add a _PAGE_DEVMAP bit for PTE and DAX PMD entires. PowerPC doesn't
currently support PUD faults so we haven't extended it to the PUD
level.

Cc: Aneesh Kumar K.V 
Signed-off-by: Oliver O'Halloran 


This needs Aneesh's Signed-off-by.



Signed-off-by: Aneesh Kumar K.V 

-aneesh



Re: [PATCH 8/9] powerpc/mm: Wire up hpte_removebolted for powernv

2017-04-11 Thread Rashmica Gupta



On 12/04/17 10:18, Stephen Rothwell wrote:

Hi Oliver,

On Wed, 12 Apr 2017 08:50:56 +1000 Anton Blanchard  wrote:

From: Rashmica Gupta 

Adds support for removing bolted (i.e kernel linear mapping) mappings
on powernv. This is needed to support memory hot unplug operations
which are required for the teardown of DAX/PMEM devices.

Cc: Rashmica Gupta 
Cc: Anton Blanchard 
Signed-off-by: Oliver O'Halloran 
---
Could the original author of this add their S-o-b? I pulled it out of
Rashmica's memtrace patch, but I remember someone saying Anton wrote
it originally.

I did.

Signed-off-by: Anton Blanchard 

If you are going to claim that Rashmica authored this patch (and you do
with the From: line above), then you need her Signed-off-by as well.


Oliver, can you change the 'From' to a 'Reviewed By'?


Re: powerpc/crypto/crc32c-vpmsum: Fix missing preempt_disable()

2017-04-11 Thread Michael Ellerman
On Thu, 2017-04-06 at 13:34:38 UTC, Michael Ellerman wrote:
> In crc32c_vpmsum() we call enable_kernel_altivec() without first
> disabling preemption, which is not allowed:
> 
>   WARNING: CPU: 9 PID: 2949 at ../arch/powerpc/kernel/process.c:277 
> enable_kernel_altivec+0x100/0x120
>   Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio 
> libcrc32c vmx_crypto ...
>   CPU: 9 PID: 2949 Comm: docker Not tainted 
> 4.11.0-rc5-compiler_gcc-6.3.1-00033-g308ac7563944 #381
>   ...
>   NIP [c001e320] enable_kernel_altivec+0x100/0x120
>   LR [d3df0910] crc32c_vpmsum+0x108/0x150 [crc32c_vpmsum]
>   Call Trace:
> 0xc138fd09 (unreliable)
> crc32c_vpmsum+0x108/0x150 [crc32c_vpmsum]
> crc32c_vpmsum_update+0x3c/0x60 [crc32c_vpmsum]
> crypto_shash_update+0x88/0x1c0
> crc32c+0x64/0x90 [libcrc32c]
> dm_bm_checksum+0x48/0x80 [dm_persistent_data]
> sb_check+0x84/0x120 [dm_thin_pool]
> dm_bm_validate_buffer.isra.0+0xc0/0x1b0 [dm_persistent_data]
> dm_bm_read_lock+0x80/0xf0 [dm_persistent_data]
> __create_persistent_data_objects+0x16c/0x810 [dm_thin_pool]
> dm_pool_metadata_open+0xb0/0x1a0 [dm_thin_pool]
> pool_ctr+0x4cc/0xb60 [dm_thin_pool]
> dm_table_add_target+0x16c/0x3c0
> table_load+0x184/0x400
> ctl_ioctl+0x2f0/0x560
> dm_ctl_ioctl+0x38/0x50
> do_vfs_ioctl+0xd8/0x920
> SyS_ioctl+0x68/0xc0
> system_call+0x38/0xfc
> 
> It used to be sufficient just to call pagefault_disable(), because that
> also disabled preemption. But the two were decoupled in commit 8222dbe21e79
> ("sched/preempt, mm/fault: Decouple preemption from the page fault
> logic") in mid 2015.
> 
> So add the missing preempt_disable/enable(). We should also call
> disable_kernel_fp(), although it does nothing by default, there is a
> debug switch to make it active and all enables should be paired with
> disables.
> 
> Fixes: 6dd7a82cc54e ("crypto: powerpc - Add POWER8 optimised crc32c")
> Cc: sta...@vger.kernel.org # v4.8+
> Signed-off-by: Michael Ellerman 

Applied to powerpc fixes.

https://git.kernel.org/powerpc/c/4749228f022893faf54a3dbc70796f

cheers


Re: [PATCH 1/2] powerpc/mm: fix up pgtable dump flags

2017-04-11 Thread Rashmica Gupta


On 31/03/17 12:37, Oliver O'Halloran wrote:

On Book3s we have two PTE flags used to mark cache-inhibited mappings:
_PAGE_TOLERANT and _PAGE_NON_IDEMPOTENT. Currently the kernel page
table dumper only looks at the generic _PAGE_NO_CACHE which is
defined to be _PAGE_TOLERANT. This patch modifies the dumper so
both flags are shown in the dump.

Cc: Rashmica Gupta 
Signed-off-by: Oliver O'Halloran 

Should we also add in _PAGE_SAO  that is in Book3s?



---
  arch/powerpc/mm/dump_linuxpagetables.c | 13 +
  1 file changed, 13 insertions(+)

diff --git a/arch/powerpc/mm/dump_linuxpagetables.c 
b/arch/powerpc/mm/dump_linuxpagetables.c
index 49abaf4dc8e3..e7cbfd5a0940 100644
--- a/arch/powerpc/mm/dump_linuxpagetables.c
+++ b/arch/powerpc/mm/dump_linuxpagetables.c
@@ -154,11 +154,24 @@ static const struct flag_info flag_array[] = {
.clear  = " ",
}, {
  #endif
+#ifndef CONFIG_PPC_BOOK3S_64
.mask   = _PAGE_NO_CACHE,
.val= _PAGE_NO_CACHE,
.set= "no cache",
.clear  = "",
}, {
+#else
+   .mask   = _PAGE_NON_IDEMPOTENT,
+   .val= _PAGE_NON_IDEMPOTENT,
+   .set= "non-idempotent",
+   .clear  = "  ",
+   }, {
+   .mask   = _PAGE_TOLERANT,
+   .val= _PAGE_TOLERANT,
+   .set= "tolerant",
+   .clear  = "",
+   }, {
+#endif
  #ifdef CONFIG_PPC_BOOK3S_64
.mask   = H_PAGE_BUSY,
.val= H_PAGE_BUSY,




Re: [PATCH 2/2] powerpc/mm: add phys addr to linux page table dump

2017-04-11 Thread Rashmica Gupta



On 31/03/17 12:37, Oliver O'Halloran wrote:

The current page table dumper scans the linux page tables and coalesces
mappings with adjacent virtual addresses and similar PTE flags. This
behaviour is somewhat broken when you consider the IOREMAP space where
entirely unrelated mappings will appear to be contiguous.  This patch
modifies the range coalescing so that only ranges that are both physically
and virtually contiguous are combined. This patch also adds to the dump
output the physical address at the start of each range.

Cc: Rashmica Gupta 
Signed-off-by: Oliver O'Halloran 
---
  arch/powerpc/mm/dump_linuxpagetables.c | 18 --
  1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/dump_linuxpagetables.c 
b/arch/powerpc/mm/dump_linuxpagetables.c
index e7cbfd5a0940..85e6a45bd7ee 100644
--- a/arch/powerpc/mm/dump_linuxpagetables.c
+++ b/arch/powerpc/mm/dump_linuxpagetables.c
@@ -56,6 +56,8 @@ struct pg_state {
struct seq_file *seq;
const struct addr_marker *marker;
unsigned long start_address;
+   unsigned long start_pa;
+   unsigned long last_pa;
unsigned int level;
u64 current_flags;
  };
@@ -265,7 +267,9 @@ static void dump_addr(struct pg_state *st, unsigned long 
addr)
const char *unit = units;
unsigned long delta;
  
-	seq_printf(st->seq, "0x%016lx-0x%016lx   ", st->start_address, addr-1);

+   seq_printf(st->seq, "0x%016lx-0x%016lx ", st->start_address, addr-1);
+   seq_printf(st->seq, "%016lx ", st->start_pa);
+
delta = (addr - st->start_address) >> 10;
/* Work out what appropriate unit to use */
while (!(delta & 1023) && unit[1]) {
@@ -280,11 +284,15 @@ static void note_page(struct pg_state *st, unsigned long 
addr,
   unsigned int level, u64 val)
  {
u64 flag = val & pg_level[level].mask;
+   u64 pa = val & PTE_RPN_MASK;
+
/* At first no level is set */
if (!st->level) {
st->level = level;
st->current_flags = flag;
st->start_address = addr;
+   st->start_pa = pa;
+   st->last_pa = pa;
seq_printf(st->seq, "---[ %s ]---\n", st->marker->name);
/*
 * Dump the section of virtual memory when:
@@ -292,9 +300,11 @@ static void note_page(struct pg_state *st, unsigned long 
addr,
 *   - we change levels in the tree.
 *   - the address is in a different section of memory and is thus
 *   used for a different purpose, regardless of the flags.
+*   - the pa of this page is not adjacent to the last inspected page
 */
} else if (flag != st->current_flags || level != st->level ||
-  addr >= st->marker[1].start_address) {
+  addr >= st->marker[1].start_address ||
+  pa != st->last_pa + PAGE_SIZE) {
  
  		/* Check the PTE flags */

if (st->current_flags) {
@@ -318,8 +328,12 @@ static void note_page(struct pg_state *st, unsigned long 
addr,
seq_printf(st->seq, "---[ %s ]---\n", st->marker->name);
}
st->start_address = addr;
+   st->start_pa = pa;
+   st->last_pa = pa;
st->current_flags = flag;
st->level = level;
+   } else {
+   st->last_pa = pa;
}
  }
  

Makes sense to me!
Reviewed-by: Rashmica Gupta 


[PATCH] powerpc/mm/hash: don't opencode VMALLOC_INDEX

2017-04-11 Thread Aneesh Kumar K.V
Also remove wrong indentation to fix checkpatch.pl warning.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/slb.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index 98ae810b8c21..3d580ccf4b71 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -131,9 +131,9 @@ static void __slb_flush_and_rebolt(void)
 "slbmte%2,%3\n"
 "isync"
 :: "r"(mk_vsid_data(VMALLOC_START, mmu_kernel_ssize, 
vflags)),
-   "r"(mk_esid_data(VMALLOC_START, mmu_kernel_ssize, 1)),
-   "r"(ksp_vsid_data),
-   "r"(ksp_esid_data)
+   "r"(mk_esid_data(VMALLOC_START, mmu_kernel_ssize, 
VMALLOC_INDEX)),
+   "r"(ksp_vsid_data),
+   "r"(ksp_esid_data)
 : "memory");
 }
 
-- 
2.7.4



Re: [RFC PATCH 6/7] powerpc/hugetlb: Add code to support to follow huge page directory entries

2017-04-11 Thread Anshuman Khandual
On 04/11/2017 03:55 PM, Michael Ellerman wrote:
> "Aneesh Kumar K.V"  writes:
> 
>> Add follow_huge_pd implementation for ppc64.
>>
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  arch/powerpc/mm/hugetlbpage.c | 42 
>> ++
>>  1 file changed, 42 insertions(+)
>>
>> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
>> index 80f6d2ed551a..9d66d4f810aa 100644
>> --- a/arch/powerpc/mm/hugetlbpage.c
>> +++ b/arch/powerpc/mm/hugetlbpage.c
>> @@ -17,6 +17,8 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -618,6 +620,10 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
>>  }
>>  
>>  /*
>> + * 64 bit book3s use generic follow_page_mask
>> + */
>> +#ifndef CONFIG_PPC_BOOK3S_64
> I think it's always easier to follow if you use:
> 
>   #ifdef x
>   ...
>   #else /* !x */
>   ...
>   #endif
> 
> ie. in this case put the Book3S 64 case first and the existing code in the
> #else.

Yeah, it was difficult to read in the first glance.



Re: [PATCH 1/9] mm/huge_memory: Use zap_deposited_table() more

2017-04-11 Thread Aneesh Kumar K.V
Oliver O'Halloran  writes:

> Depending flags of the PMD being zapped there may or may not be a
> deposited pgtable to be freed. In two of the three cases this is open
> coded while the third uses the zap_deposited_table() helper. This patch
> converts the others to use the helper to clean things up a bit.
>
> Cc: "Aneesh Kumar K.V" 
> Cc: "Kirill A. Shutemov" 
> Cc: linux...@kvack.org
> Signed-off-by: Oliver O'Halloran 

Reviewed-by: Aneesh Kumar K.V 

> ---
> For reference:
>
> void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
> {
> pgtable_t pgtable;
>
> pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> pte_free(mm, pgtable);
> atomic_long_dec(&mm->nr_ptes);
> }
> ---
>  mm/huge_memory.c | 8 ++--
>  1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index b787c4cfda0e..aa01dd47cc65 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1615,8 +1615,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
> vm_area_struct *vma,
>   if (is_huge_zero_pmd(orig_pmd))
>   tlb_remove_page_size(tlb, pmd_page(orig_pmd), 
> HPAGE_PMD_SIZE);
>   } else if (is_huge_zero_pmd(orig_pmd)) {
> - pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd));
> - atomic_long_dec(&tlb->mm->nr_ptes);
> + zap_deposited_table(tlb->mm, pmd);
>   spin_unlock(ptl);
>   tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
>   } else {
> @@ -1625,10 +1624,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
> vm_area_struct *vma,
>   VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
>   VM_BUG_ON_PAGE(!PageHead(page), page);
>   if (PageAnon(page)) {
> - pgtable_t pgtable;
> - pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
> - pte_free(tlb->mm, pgtable);
> - atomic_long_dec(&tlb->mm->nr_ptes);
> + zap_deposited_table(tlb->mm, pmd);
>   add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
>   } else {
>   if (arch_needs_pgtable_deposit())
> -- 
> 2.9.3



Re: [PATCH 2/9] mm/huge_memory: Deposit a pgtable for DAX PMD faults when required

2017-04-11 Thread Aneesh Kumar K.V
Oliver O'Halloran  writes:

> Although all architectures use a deposited page table for THP on anonymous 
> VMAs
> some architectures (s390 and powerpc) require the deposited storage even for
> file backed VMAs due to quirks of their MMUs. This patch adds support for
> depositing a table in DAX PMD fault handling path for archs that require it.
> Other architectures should see no functional changes.
>
> Cc: "Aneesh Kumar K.V" 
> Cc: linux...@kvack.org
> Signed-off-by: Oliver O'Halloran 

Reviewed-by: Aneesh Kumar K.V 


> ---
>  mm/huge_memory.c | 20 ++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index aa01dd47cc65..a84909cf20d3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -715,7 +715,8 @@ int do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  }
>
>  static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
> - pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write)
> + pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write,
> + pgtable_t pgtable)
>  {
>   struct mm_struct *mm = vma->vm_mm;
>   pmd_t entry;
> @@ -729,6 +730,12 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, 
> unsigned long addr,
>   entry = pmd_mkyoung(pmd_mkdirty(entry));
>   entry = maybe_pmd_mkwrite(entry, vma);
>   }
> +
> + if (pgtable) {
> + pgtable_trans_huge_deposit(mm, pmd, pgtable);
> + atomic_long_inc(&mm->nr_ptes);
> + }
> +
>   set_pmd_at(mm, addr, pmd, entry);
>   update_mmu_cache_pmd(vma, addr, pmd);
>   spin_unlock(ptl);
> @@ -738,6 +745,7 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, 
> unsigned long addr,
>   pmd_t *pmd, pfn_t pfn, bool write)
>  {
>   pgprot_t pgprot = vma->vm_page_prot;
> + pgtable_t pgtable = NULL;
>   /*
>* If we had pmd_special, we could avoid all these restrictions,
>* but we need to be consistent with PTEs and architectures that
> @@ -752,9 +760,15 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, 
> unsigned long addr,
>   if (addr < vma->vm_start || addr >= vma->vm_end)
>   return VM_FAULT_SIGBUS;
>
> + if (arch_needs_pgtable_deposit()) {
> + pgtable = pte_alloc_one(vma->vm_mm, addr);
> + if (!pgtable)
> + return VM_FAULT_OOM;
> + }
> +
>   track_pfn_insert(vma, &pgprot, pfn);
>
> - insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write);
> + insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write, pgtable);
>   return VM_FAULT_NOPAGE;
>  }
>  EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd);
> @@ -1611,6 +1625,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
> vm_area_struct *vma,
>   tlb->fullmm);
>   tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
>   if (vma_is_dax(vma)) {
> + if (arch_needs_pgtable_deposit())
> + zap_deposited_table(tlb->mm, pmd);
>   spin_unlock(ptl);
>   if (is_huge_zero_pmd(orig_pmd))
>   tlb_remove_page_size(tlb, pmd_page(orig_pmd), 
> HPAGE_PMD_SIZE);
> -- 
> 2.9.3



[PATCH] powerpc/syscalls/trace: Fix mmap in syscalls_trace

2017-04-11 Thread Balbir Singh
This patch uses SYSCALL_DEFINE6 for sys_mmap and sys_mmap2
so that the meta-data associated with these syscalls is
visible to the syscall tracer. In the absence of this
generic syscalls (defined outside arch) like munmap,etc.
are visible in available_events, syscall_enter_mmap and
syscall_exit_mmap is not.

A side-effect of this change is that the return type has
changed from unsigned long to long.

Prior to these changes, we had, under /sys/kernel/tracing

cat available_events  | grep syscalls | grep map
syscalls:sys_exit_remap_file_pages
syscalls:sys_enter_remap_file_pages
syscalls:sys_exit_munmap
syscalls:sys_enter_munmap
syscalls:sys_exit_mremap
syscalls:sys_enter_mremap

After these changes we have mmap in available_events.

cat available_events  | grep syscalls | grep map
syscalls:sys_exit_mmap
syscalls:sys_enter_mmap
syscalls:sys_exit_remap_file_pages
syscalls:sys_enter_remap_file_pages
syscalls:sys_exit_munmap
syscalls:sys_enter_munmap
syscalls:sys_exit_mremap
syscalls:sys_enter_mremap

Sample trace:
 cat-3399  [001]    196.542410: sys_mmap(addr: 7fff922a, len: 
2, prot: 3, flags: 812, fd: 3, offset: 1b)
 cat-3399  [001]    196.542443: sys_mmap -> 0x7fff922a
 cat-3399  [001]    196.542668: sys_munmap(addr: 7fff922c, len: 
6d2c)
 cat-3399  [001]    196.542677: sys_munmap -> 0x0

Signed-off-by: Balbir Singh 
---

 Changelog:
   Removed RFC
   Fixed len from unsigned long to size_t
   Added some examples of use of mmap trace

 arch/powerpc/include/asm/syscalls.h |  4 ++--
 arch/powerpc/kernel/syscalls.c  | 16 
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/syscalls.h 
b/arch/powerpc/include/asm/syscalls.h
index 23be8f1..16fab68 100644
--- a/arch/powerpc/include/asm/syscalls.h
+++ b/arch/powerpc/include/asm/syscalls.h
@@ -8,10 +8,10 @@
 
 struct rtas_args;
 
-asmlinkage unsigned long sys_mmap(unsigned long addr, size_t len,
+asmlinkage long sys_mmap(unsigned long addr, size_t len,
unsigned long prot, unsigned long flags,
unsigned long fd, off_t offset);
-asmlinkage unsigned long sys_mmap2(unsigned long addr, size_t len,
+asmlinkage long sys_mmap2(unsigned long addr, size_t len,
unsigned long prot, unsigned long flags,
unsigned long fd, unsigned long pgoff);
 asmlinkage long ppc64_personality(unsigned long personality);
diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c
index de04c9f..a877bf8 100644
--- a/arch/powerpc/kernel/syscalls.c
+++ b/arch/powerpc/kernel/syscalls.c
@@ -42,11 +42,11 @@
 #include 
 #include 
 
-static inline unsigned long do_mmap2(unsigned long addr, size_t len,
+static inline long do_mmap2(unsigned long addr, size_t len,
unsigned long prot, unsigned long flags,
unsigned long fd, unsigned long off, int shift)
 {
-   unsigned long ret = -EINVAL;
+   long ret = -EINVAL;
 
if (!arch_validate_prot(prot))
goto out;
@@ -62,16 +62,16 @@ static inline unsigned long do_mmap2(unsigned long addr, 
size_t len,
return ret;
 }
 
-unsigned long sys_mmap2(unsigned long addr, size_t len,
-   unsigned long prot, unsigned long flags,
-   unsigned long fd, unsigned long pgoff)
+SYSCALL_DEFINE6(mmap2, unsigned long, addr, size_t, len,
+   unsigned long, prot, unsigned long, flags,
+   unsigned long, fd, unsigned long, pgoff)
 {
return do_mmap2(addr, len, prot, flags, fd, pgoff, PAGE_SHIFT-12);
 }
 
-unsigned long sys_mmap(unsigned long addr, size_t len,
-  unsigned long prot, unsigned long flags,
-  unsigned long fd, off_t offset)
+SYSCALL_DEFINE6(mmap, unsigned long, addr, size_t, len,
+   unsigned long, prot, unsigned long, flags,
+   unsigned long, fd, off_t, offset)
 {
return do_mmap2(addr, len, prot, flags, fd, offset, PAGE_SHIFT);
 }
-- 
2.9.3



Re: [PATCH 1/2] powerpc/mm: fix up pgtable dump flags

2017-04-11 Thread Michael Ellerman
Rashmica Gupta  writes:

> On 31/03/17 12:37, Oliver O'Halloran wrote:
>> On Book3s we have two PTE flags used to mark cache-inhibited mappings:
>> _PAGE_TOLERANT and _PAGE_NON_IDEMPOTENT. Currently the kernel page
>> table dumper only looks at the generic _PAGE_NO_CACHE which is
>> defined to be _PAGE_TOLERANT. This patch modifies the dumper so
>> both flags are shown in the dump.
>>
>> Cc: Rashmica Gupta 
>> Signed-off-by: Oliver O'Halloran 

> Should we also add in _PAGE_SAO  that is in Book3s?

I don't think we ever expect to see it in the kernel page tables. But if
we did that would be "interesting".

I've forgotten what the code does with unknown bits, does it already
print them in some way?

If not we should either add that or add _PAGE_SAO and everything else
that could possibly ever be there.

cheers


Re: [PATCH] powerpc/mm/hash: don't opencode VMALLOC_INDEX

2017-04-11 Thread Michael Ellerman
"Aneesh Kumar K.V"  writes:

> powerpc/mm/hash: don't opencode VMALLOC_INDEX

OK.

> Also remove wrong indentation to fix checkpatch.pl warning.

No thanks :)

Or at least do it as a separate patch.

I'll fix it up this time.

cheers

> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
> index 98ae810b8c21..3d580ccf4b71 100644
> --- a/arch/powerpc/mm/slb.c
> +++ b/arch/powerpc/mm/slb.c
> @@ -131,9 +131,9 @@ static void __slb_flush_and_rebolt(void)
>"slbmte%2,%3\n"
>"isync"
>:: "r"(mk_vsid_data(VMALLOC_START, mmu_kernel_ssize, 
> vflags)),
> - "r"(mk_esid_data(VMALLOC_START, mmu_kernel_ssize, 1)),
> - "r"(ksp_vsid_data),
> - "r"(ksp_esid_data)
> + "r"(mk_esid_data(VMALLOC_START, mmu_kernel_ssize, 
> VMALLOC_INDEX)),
> + "r"(ksp_vsid_data),
> + "r"(ksp_esid_data)
>: "memory");
>  }
>  
> -- 
> 2.7.4