Re: [PATCH 1/2] powerpc/tracing: Trace TLBIE(L)
On Tue, 2017-04-11 at 15:23 +1000, Balbir Singh wrote: > Just a quick patch to trace tlbie(l)'s. The idea being that it can be > enabled when we suspect corruption or when we need to see if we are doing > the right thing during flush. I think the format can be enhanced to > make it nicer (expand the RB/RS/IS/L cases in more detail if we ever > need that level of details). The subject is misleading this is not PATCH 1/2, but should read [PATCH v2]. Sorry! I can resend this if required Balbir
Re: [PATCH v3 2/5] perf/x86/intel: Record branch type
On Tue, Apr 11, 2017 at 06:56:30PM +0800, Jin Yao wrote: > Perf already has support for disassembling the branch instruction > and using the branch type for filtering. The patch just records > the branch type in perf_branch_entry. > > Before recording, the patch converts the x86 branch classification > to common branch classification. This is still a completely inadequate changelog. I really will not accept patches like this. > > Signed-off-by: Jin Yao > --- > arch/x86/events/intel/lbr.c | 53 > - > 1 file changed, 52 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c > index 81b321a..6968c63 100644 > --- a/arch/x86/events/intel/lbr.c > +++ b/arch/x86/events/intel/lbr.c > @@ -109,6 +109,9 @@ enum { > X86_BR_ZERO_CALL= 1 << 15,/* zero length call */ > X86_BR_CALL_STACK = 1 << 16,/* call stack */ > X86_BR_IND_JMP = 1 << 17,/* indirect jump */ > + > + X86_BR_TYPE_SAVE= 1 << 18,/* indicate to save branch type */ > + > }; > > #define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL) > @@ -670,6 +673,10 @@ static int intel_pmu_setup_sw_lbr_filter(struct > perf_event *event) > > if (br_type & PERF_SAMPLE_BRANCH_CALL) > mask |= X86_BR_CALL | X86_BR_ZERO_CALL; > + > + if (br_type & PERF_SAMPLE_BRANCH_TYPE_SAVE) > + mask |= X86_BR_TYPE_SAVE; > + > /* >* stash actual user request into reg, it may >* be used by fixup code for some CPU > @@ -923,6 +930,44 @@ static int branch_type(unsigned long from, unsigned long > to, int abort) > return ret; > } > > +#define X86_BR_TYPE_MAP_MAX 16 > + > +static int > +common_branch_type(int type) > +{ > + int i, mask; > + const int branch_map[X86_BR_TYPE_MAP_MAX] = { > + PERF_BR_CALL, /* X86_BR_CALL */ > + PERF_BR_RET,/* X86_BR_RET */ > + PERF_BR_SYSCALL,/* X86_BR_SYSCALL */ > + PERF_BR_SYSRET, /* X86_BR_SYSRET */ > + PERF_BR_INT,/* X86_BR_INT */ > + PERF_BR_IRET, /* X86_BR_IRET */ > + PERF_BR_JCC,/* X86_BR_JCC */ > + PERF_BR_JMP,/* X86_BR_JMP */ > + PERF_BR_IRQ,/* X86_BR_IRQ */ > + PERF_BR_IND_CALL, /* X86_BR_IND_CALL */ > + PERF_BR_NONE, /* X86_BR_ABORT */ > + PERF_BR_NONE, /* X86_BR_IN_TX */ > + PERF_BR_NONE, /* X86_BR_NO_TX */ > + PERF_BR_CALL, /* X86_BR_ZERO_CALL */ > + PERF_BR_NONE, /* X86_BR_CALL_STACK */ > + PERF_BR_IND_JMP,/* X86_BR_IND_JMP */ > + }; > + > + type >>= 2; /* skip X86_BR_USER and X86_BR_KERNEL */ > + mask = ~(~0 << 1); > + > + for (i = 0; i < X86_BR_TYPE_MAP_MAX; i++) { > + if (type & mask) > + return branch_map[i]; > + > + type >>= 1; > + } > + > + return PERF_BR_NONE; > +} > + > /* > * implement actual branch filter based on user demand. > * Hardware may not exactly satisfy that request, thus > @@ -939,7 +984,8 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc) > bool compress = false; > > /* if sampling all branches, then nothing to filter */ > - if ((br_sel & X86_BR_ALL) == X86_BR_ALL) > + if (((br_sel & X86_BR_ALL) == X86_BR_ALL) && > + ((br_sel & X86_BR_TYPE_SAVE) != X86_BR_TYPE_SAVE)) > return; > > for (i = 0; i < cpuc->lbr_stack.nr; i++) { > @@ -960,6 +1006,11 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc) > cpuc->lbr_entries[i].from = 0; > compress = true; > } > + > + if ((br_sel & X86_BR_TYPE_SAVE) == X86_BR_TYPE_SAVE) > + cpuc->lbr_entries[i].type = common_branch_type(type); > + else > + cpuc->lbr_entries[i].type = PERF_BR_NONE; > } > > if (!compress) > -- > 2.7.4 >
[PATCH kernel v2] powerpc/iommu: Do not call PageTransHuge() on tail pages
The CMA pages migration code does not support compound pages at the moment so it performs few tests before proceeding to actual page migration. One of the tests - PageTransHuge() - has VM_BUG_ON_PAGE(PageTail()) as it is designed to be called on head pages only. Since we also test for PageCompound(), and it contains PageTail() and PageHead(), we can simplify the check by leaving just PageCompound() and therefore avoid possible VM_BUG_ON_PAGE. Fixes: 2e5bbb5461f1 ("KVM: PPC: Book3S HV: Migrate pinned pages out of CMA") Cc: sta...@vger.kernel.org # v4.9+ Signed-off-by: Alexey Kardashevskiy Acked-by: Balbir Singh --- Changes: v2: * instead of moving PageCompound() to the beginning, this just drops PageHuge() and PageTransHuge() --- arch/powerpc/mm/mmu_context_iommu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index 497130c5c742..96f835cbf212 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -81,7 +81,7 @@ struct page *new_iommu_non_cma_page(struct page *page, unsigned long private, gfp_t gfp_mask = GFP_USER; struct page *new_page; - if (PageHuge(page) || PageTransHuge(page) || PageCompound(page)) + if (PageCompound(page)) return NULL; if (PageHighMem(page)) @@ -100,7 +100,7 @@ static int mm_iommu_move_page_from_cma(struct page *page) LIST_HEAD(cma_migrate_pages); /* Ignore huge pages for now */ - if (PageHuge(page) || PageTransHuge(page) || PageCompound(page)) + if (PageCompound(page)) return -EBUSY; lru_add_drain(); -- 2.11.0
Re: [PATCH v3 2/5] perf/x86/intel: Record branch type
On 4/11/2017 3:52 PM, Peter Zijlstra wrote: This is still a completely inadequate changelog. I really will not accept patches like this. Hi, The changelog is added in the cover-letter ("[PATCH v3 0/5] perf report: Show branch type"). Does the changelog need to be added in each patch's description? That's fine, I can add and resend this patch. Thanks Jin Yao
Re: [PATCH v3 2/5] perf/x86/intel: Record branch type
On Tue, Apr 11, 2017 at 09:52:19AM +0200, Peter Zijlstra wrote: > On Tue, Apr 11, 2017 at 06:56:30PM +0800, Jin Yao wrote: > > @@ -960,6 +1006,11 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc) > > cpuc->lbr_entries[i].from = 0; > > compress = true; > > } > > + > > + if ((br_sel & X86_BR_TYPE_SAVE) == X86_BR_TYPE_SAVE) > > + cpuc->lbr_entries[i].type = common_branch_type(type); > > + else > > + cpuc->lbr_entries[i].type = PERF_BR_NONE; > > } I was wondering WTH you did that else; because it should already be 0 (aka, BR_NONE). Then I found intel_pmu_lbr_read_32() is already broken, and you just broke intel_pmu_lbr_read_64(). Arguably we should add a union on the last __u64 with a name for the entire thing, but the below is the minimal fix. --- Subject: perf,x86: Avoid exposing wrong/stale data in intel_pmu_lbr_read_32() From: Peter Zijlstra Date: Tue Apr 11 10:10:28 CEST 2017 When the perf_branch_entry::{in_tx,abort,cycles} fields were added, intel_pmu_lbr_read_32() wasn't updated to initialize them. Fixes: 135c5612c460 ("perf/x86/intel: Support Haswell/v4 LBR format") Signed-off-by: Peter Zijlstra (Intel) --- --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -507,6 +507,9 @@ static void intel_pmu_lbr_read_32(struct cpuc->lbr_entries[i].to = msr_lastbranch.to; cpuc->lbr_entries[i].mispred= 0; cpuc->lbr_entries[i].predicted = 0; + cpuc->lbr_entries[i].in_tx = 0; + cpuc->lbr_entries[i].abort = 0; + cpuc->lbr_entries[i].cycles = 0; cpuc->lbr_entries[i].reserved = 0; } cpuc->lbr_stack.nr = i;
Re: [PATCH kernel v2] powerpc/iommu: Do not call PageTransHuge() on tail pages
On Tue, 2017-04-11 at 17:54 +1000, Alexey Kardashevskiy wrote: > The CMA pages migration code does not support compound pages at > the moment so it performs few tests before proceeding to actual page > migration. > > One of the tests - PageTransHuge() - has VM_BUG_ON_PAGE(PageTail()) as > it is designed to be called on head pages only. Since we also test for > PageCompound(), and it contains PageTail() and PageHead(), we can > simplify the check by leaving just PageCompound() and therefore avoid > possible VM_BUG_ON_PAGE. > > Fixes: 2e5bbb5461f1 ("KVM: PPC: Book3S HV: Migrate pinned pages out of CMA") > Cc: sta...@vger.kernel.org # v4.9+ > Signed-off-by: Alexey Kardashevskiy > Acked-by: Balbir Singh > --- > > Changes: > v2: > * instead of moving PageCompound() to the beginning, this just drops > PageHuge() and PageTransHuge() > Looks good! My Acked-by is already present Balbir Singh.
Re: [PATCH kernel v2] powerpc/powernv: Check kzalloc() return value in pnv_pci_table_alloc
On 27/03/17 19:27, Alexey Kardashevskiy wrote: > pnv_pci_table_alloc() ignores possible failure from kzalloc_node(), > this adds a check. There are 2 callers of pnv_pci_table_alloc(), > one already checks for tbl!=NULL, this adds WARN_ON() to the other path > which only happens during boot time in IODA1 and not expected to fail. > > Signed-off-by: Alexey Kardashevskiy > --- > Changes: > v2: > * s/BUG_ON/WARN_ON/ Bad/good? > --- > arch/powerpc/platforms/powernv/pci-ioda.c | 3 +++ > arch/powerpc/platforms/powernv/pci.c | 3 +++ > 2 files changed, 6 insertions(+) > > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c > b/arch/powerpc/platforms/powernv/pci-ioda.c > index e36738291c32..04ef03a5201b 100644 > --- a/arch/powerpc/platforms/powernv/pci-ioda.c > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c > @@ -2128,6 +2128,9 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb > *phb, > > found: > tbl = pnv_pci_table_alloc(phb->hose->node); > + if (WARN_ON(!tbl)) > + return; > + > iommu_register_group(&pe->table_group, phb->hose->global_number, > pe->pe_number); > pnv_pci_link_table_and_group(phb->hose->node, 0, tbl, &pe->table_group); > diff --git a/arch/powerpc/platforms/powernv/pci.c > b/arch/powerpc/platforms/powernv/pci.c > index eb835e977e33..9acdf6889c0d 100644 > --- a/arch/powerpc/platforms/powernv/pci.c > +++ b/arch/powerpc/platforms/powernv/pci.c > @@ -766,6 +766,9 @@ struct iommu_table *pnv_pci_table_alloc(int nid) > struct iommu_table *tbl; > > tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, nid); > + if (!tbl) > + return NULL; > + > INIT_LIST_HEAD_RCU(&tbl->it_group_list); > > return tbl; > -- Alexey
Re: [PATCH v3 2/5] perf/x86/intel: Record branch type
On Tue, Apr 11, 2017 at 04:11:21PM +0800, Jin, Yao wrote: > > > On 4/11/2017 3:52 PM, Peter Zijlstra wrote: > > This is still a completely inadequate changelog. I really will not > > accept patches like this. > > > Hi, > > The changelog is added in the cover-letter ("[PATCH v3 0/5] perf report: Show > branch type"). > > Does the changelog need to be added in each patch's description? > > That's fine, I can add and resend this patch. The cover letter is not retained; it is throw away information. Each patch should have a coherent changelog that explain why the patch was done and explain non trivial things in the implementation. Simply copy/pasting the same story in multiple patches is not right either, for the simple fact that the patches were not the same. You did a different thing, so you need a different story.
Re: [PATCH v2] ppc64/kprobe: Fix oops when kprobed on 'stdu' instruction
On Tue, 2017-04-11 at 10:38 +0530, Ravi Bangoria wrote: > If we set a kprobe on a 'stdu' instruction on powerpc64, we see a kernel > OOPS: > > [ 1275.165932] Bad kernel stack pointer cd93c840 at c0009868 > [ 1275.166378] Oops: Bad kernel stack pointer, sig: 6 [#1] > ... > GPR00: c01fcd93cb30 cd93c840 c15c5e00 cd93c840 > ... > [ 1275.178305] NIP [c0009868] resume_kernel+0x2c/0x58 > [ 1275.178594] LR [c0006208] program_check_common+0x108/0x180 > > Basically, on 64 bit system, when user probes on 'stdu' instruction, > kernel does not emulate actual store in emulate_step itself because it > may corrupt exception frame. So kernel does actual store operation in > exception return code i.e. resume_kernel(). > > resume_kernel() loads the saved stack pointer from memory using lwz, > effectively loading a corrupt (32bit) address, causing the kernel crash. > > Fix this by loading the 64bit value instead. > > Fixes: be96f63375a1 ("powerpc: Split out instruction analysis part of > emulate_step()") > Signed-off-by: Ravi Bangoria > Reviewed-by: Naveen N. Rao > --- The patch looks correct to me from the description and code. I have not validated that the write to GPR1(r1) via store of r8 to 0(r5) is indeed correct. I would assume r8 should contain regs->gpr[r1] with the updated ea that is written down to the GPR1(r1) which will be what we restore when we return from the exception. The conversion of lwz to ld indeed looks correct Balbir Singh.
Re: WARN @lib/refcount.c:128 during hot unplug of I/O adapter.
Tyrel Datwyler writes: > On 04/06/2017 09:04 PM, Michael Ellerman wrote: >> Tyrel Datwyler writes: >> >>> On 04/06/2017 03:27 AM, Sachin Sant wrote: On a POWER8 LPAR running 4.11.0-rc5, a hot unplug operation on any I/O adapter results in the following warning This problem has been in the code for some time now. I had first seen this in -next tree. > > > Have attached the dmesg log from the system. Let me know if any additional information is required to help debug this problem. >>> >>> I remember you mentioning this when the issue was brought up for CPUs. I >>> assume the case is the same here where the issue is only seen with >>> adapters that were hot-added after boot (ie. hot-remove of adapter >>> present at boot doesn't trip the warning)? >> >> So who's fixing this? > > I started looking at it when Bharata submitted a patch trying to fix the > issue for CPUs, but got side tracked by other things. I suspect that > this underflow has actually been an issue for quite some time, and we > are just now becoming aware of it thanks to the recount_t patchset being > merged. Yes I agree. Which means it might be broken in existing distros. > I'll look into it again this week. Thanks. cheers
Re: EEH error in doing DMA with PEX 8619
I did another test: - Call dma_set_mask_and_coherent(&pPciDev->dev, DMA_BIT_MASK(32)) in probe; - Use DMA address or BUS address in DMA But EHH error remains. All sources are based on PLX SDK 7.25. Note: Sample test is in user space. It allocates memory and starts DMA through PLX API. The original sample NT_DmaTest does DMA between BARx and Host memory. I change this for simple: Allocate two host memory buffers and try to do DMA between them. Device probe === (Driver/Source.Plx8000_DMA/Driver.c) int AddDevice( DRIVER_OBJECT *pDriverObject, struct pci_dev *pPciDev ) { U8channel; int status; U32 RegValue; DEVICE_OBJECT*fdo; DEVICE_OBJECT*pDevice; DEVICE_EXTENSION *pdx; // Allocate memory for the device object fdo = kmalloc( sizeof(DEVICE_OBJECT), GFP_KERNEL ); if (fdo == NULL) { ErrorPrintf(("ERROR - memory allocation for device object failed\n")); return (-ENOMEM); } // Initialize device object RtlZeroMemory( fdo, sizeof(DEVICE_OBJECT) ); fdo->DriverObject= pDriverObject; // Save parent driver object fdo->DeviceExtension = &(fdo->DeviceInfo); // Enable the device if (pci_enable_device( pPciDev ) == 0) { DebugPrintf(("Enabled PCI device\n")); } else { ErrorPrintf(("WARNING - PCI device enable failed\n")); } #if 1 /* New added: Set DMA mask as suggestied on linuxppc */ { int err; printk("Debug %s: dma_set_mask_and_coherent()...\n", __func__); err = dma_set_mask_and_coherent(&pPciDev->dev, DMA_BIT_MASK(32)); if (err != 0) { printk("Error %s: Failed dma_set_mask_and_coherent(). ret = %d\n", __func__, err); return err; } } #endif // Enable bus mastering pci_set_master( pPciDev ); // // Initialize the device extension // pdx = fdo->DeviceExtension; // Clear device extension RtlZeroMemory( pdx, sizeof(DEVICE_EXTENSION) ); // Store parent device object pdx->pDeviceObject = fdo; // Save the OS-supplied PCI object pdx->pPciDevice = pPciDev; // Set initial device device state pdx->State = PLX_STATE_STOPPED; // Set initial power state pdx->PowerState = PowerDeviceD0; // Store device location information pdx->Key.domain = pci_domain_nr(pPciDev->bus); pdx->Key.bus = pPciDev->bus->number; pdx->Key.slot = PCI_SLOT(pPciDev->devfn); pdx->Key.function = PCI_FUNC(pPciDev->devfn); pdx->Key.DeviceId = pPciDev->device; pdx->Key.VendorId = pPciDev->vendor; pdx->Key.SubVendorId = pPciDev->subsystem_vendor; pdx->Key.SubDeviceId = pPciDev->subsystem_device; pdx->Key.DeviceNumber = pDriverObject->DeviceCount; // Set API access mode pdx->Key.ApiMode = PLX_API_MODE_PCI; // Update Revision ID PLX_PCI_REG_READ( pdx, PCI_REG_CLASS_REV, &RegValue ); pdx->Key.Revision = (U8)(RegValue & 0xFF); // Set device mode pdx->Key.DeviceMode = PLX_CHIP_MODE_STANDARD; // Set PLX-specific port type pdx->Key.PlxPortType = PLX_SPEC_PORT_DMA; // Build device name sprintf( pdx->LinkName, PLX_DRIVER_NAME "-%d", pDriverObject->DeviceCount ); // Initialize work queue for ISR DPC queueing PLX_INIT_WORK( &(pdx->Task_DpcForIsr), DpcForIsr,// DPC routine &(pdx->Task_DpcForIsr)// DPC parameter (pre-2.6.20 only) ); // Initialize ISR spinlock spin_lock_init( &(pdx->Lock_Isr) ); // Initialize interrupt wait list INIT_LIST_HEAD( &(pdx->List_WaitObjects) ); spin_lock_init( &(pdx->Lock_WaitObjectsList) ); // Initialize physical memories list INIT_LIST_HEAD( &(pdx->List_PhysicalMem) ); spin_lock_init( &(pdx->Lock_PhysicalMemList) ); // Set the DMA mask if (Plx_dma_set_mask( pdx, PLX_DMA_BIT_MASK(48) ) == 0) { DebugPrintf(("Set DMA bit mask to 48-bits\n")); } else { DebugPrintf(("ERROR - Unable to set DMA mask to 48-bits, revert to 32-bit\n")); Plx_dma_set_mask( pdx, PLX_DMA_BIT_MASK(32) ); } // Set buffer allocation mask if (Plx_dma_set_coherent_mask( pdx, PLX_DMA_BIT_MASK(32) ) != 0) { ErrorPrintf(("WARNING - Set DMA coherent mask failed\n")); } // Initialize DMA spinlocks for (channel = 0; channel < MAX_DMA_CHANNELS; channel++) { spin_lock_init( &(pdx->Lock_Dma[channel]) ); } // // Add to driver device list // // Acquire Device List lock spin_lock( &(pDriverObject->Lock_DeviceList) ); // Get device list head pDevice = pDriverObject->DeviceObject; if (pDevice == NULL) { // Add device as first in list pDriverObject->DeviceObject = fdo; } else { // Go to end of list while (pDe
Re: EEH error in doing DMA with PEX 8619
On Tue, 2017-04-11 at 02:26 -0700, IanJiang wrote: > I did another test: > - Call dma_set_mask_and_coherent(&pPciDev->dev, DMA_BIT_MASK(32)) in > probe; > - Use DMA address or BUS address in DMA > But EHH error remains. We need to dig out the details of the EEH error. It will tell us more precisely what is happening. Note also that if your device can do 64-bit addresses, you should use a 64-bit mask, it will result in more efficient transfers. However, we should first investigate the problem with 32-bit because it seems to indicate that you might be DMA'ing beyond your buffer. Another possibility would be if the requests from the PLX have a different initiator ID on the bus than the device you are setting up the DMA for. > All sources are based on PLX SDK 7.25. > Note: Sample test is in user space. It allocates memory and starts > DMA > through PLX API. > The original sample NT_DmaTest does DMA between BARx and Host memory. > I change this for simple: Allocate two host memory buffers and try to > do DMA > between them. > > Device probe > === > (Driver/Source.Plx8000_DMA/Driver.c) > > int > AddDevice( > DRIVER_OBJECT *pDriverObject, > struct pci_dev *pPciDev > ) > { > U8channel; > int status; > U32 RegValue; > DEVICE_OBJECT*fdo; > DEVICE_OBJECT*pDevice; > DEVICE_EXTENSION *pdx; > > > // Allocate memory for the device object > fdo = > kmalloc( > sizeof(DEVICE_OBJECT), > GFP_KERNEL > ); > > if (fdo == NULL) > { > ErrorPrintf(("ERROR - memory allocation for device object > failed\n")); > return (-ENOMEM); > } > > // Initialize device object > RtlZeroMemory( fdo, sizeof(DEVICE_OBJECT) ); > > fdo->DriverObject= pDriverObject; // Save parent > driver > object > fdo->DeviceExtension = &(fdo->DeviceInfo); > > // Enable the device > if (pci_enable_device( pPciDev ) == 0) > { > DebugPrintf(("Enabled PCI device\n")); > } > else > { > ErrorPrintf(("WARNING - PCI device enable failed\n")); > } > > #if 1 > /* New added: Set DMA mask as suggestied on linuxppc */ > { > int err; > printk("Debug %s: dma_set_mask_and_coherent()...\n", __func__); > err = dma_set_mask_and_coherent(&pPciDev->dev, DMA_BIT_MASK(32)); > if (err != 0) { > printk("Error %s: Failed dma_set_mask_and_coherent(). ret = > %d\n", > __func__, err); > return err; > } > > } > #endif > // Enable bus mastering > pci_set_master( pPciDev ); > > // > // Initialize the device extension > // > > pdx = fdo->DeviceExtension; > > // Clear device extension > RtlZeroMemory( pdx, sizeof(DEVICE_EXTENSION) ); > > // Store parent device object > pdx->pDeviceObject = fdo; > > // Save the OS-supplied PCI object > pdx->pPciDevice = pPciDev; > > // Set initial device device state > pdx->State = PLX_STATE_STOPPED; > > // Set initial power state > pdx->PowerState = PowerDeviceD0; > > // Store device location information > pdx->Key.domain = pci_domain_nr(pPciDev->bus); > pdx->Key.bus = pPciDev->bus->number; > pdx->Key.slot = PCI_SLOT(pPciDev->devfn); > pdx->Key.function = PCI_FUNC(pPciDev->devfn); > pdx->Key.DeviceId = pPciDev->device; > pdx->Key.VendorId = pPciDev->vendor; > pdx->Key.SubVendorId = pPciDev->subsystem_vendor; > pdx->Key.SubDeviceId = pPciDev->subsystem_device; > pdx->Key.DeviceNumber = pDriverObject->DeviceCount; > > // Set API access mode > pdx->Key.ApiMode = PLX_API_MODE_PCI; > > // Update Revision ID > PLX_PCI_REG_READ( pdx, PCI_REG_CLASS_REV, &RegValue ); > pdx->Key.Revision = (U8)(RegValue & 0xFF); > > // Set device mode > pdx->Key.DeviceMode = PLX_CHIP_MODE_STANDARD; > > // Set PLX-specific port type > pdx->Key.PlxPortType = PLX_SPEC_PORT_DMA; > > // Build device name > sprintf( > pdx->LinkName, > PLX_DRIVER_NAME "-%d", > pDriverObject->DeviceCount > ); > > // Initialize work queue for ISR DPC queueing > PLX_INIT_WORK( > &(pdx->Task_DpcForIsr), > DpcForIsr,// DPC routine > &(pdx->Task_DpcForIsr)// DPC parameter (pre-2.6.20 only) > ); > > // Initialize ISR spinlock > spin_lock_init( &(pdx->Lock_Isr) ); > > // Initialize interrupt wait list > INIT_LIST_HEAD( &(pdx->List_WaitObjects) ); > spin_lock_init( &(pdx->Lock_WaitObjectsList) ); > > // Initialize physical memories list > INIT_LIST_HEAD( &(pdx->List_PhysicalMem) ); > spin_lock_init( &(pdx->Lock_PhysicalMemList) ); > > // Set the DMA mask > if (Plx_dma_set_mask( pdx, PLX_DMA_BIT_MASK(48) ) == 0) > { > DebugPrintf(("Set DMA bit mask to 48-bits\n")); > } > else >
Re: kselftest:lost_exception_test failure with 4.11.0-rc5
Madhavan Srinivasan writes: > On Friday 07 April 2017 06:06 PM, Michael Ellerman wrote: >> Sachin Sant writes: >> >>> I have run into few instances where the lost_exception_test from >>> powerpc kselftest fails with SIGABRT. Following o/p is against >>> 4.11.0-rc5. The failure is intermittent. >> What hardware are you on? >> >> How long does it take to run when it fails? I assume ~2 minutes? > > Started a run in power8 host (habanero) and it is more than 24hrs and > havent failed yet. So this should be guest/VM scenario then? Aha good point. I never tested this much (at all?) on VMs because it was about verifying a workaround for a hardware bug. So does it happen on both KVM and PowerVM or just one or the other? cheers
Re: [PATCH 1/5] powerpc/pseries: do not use msgsndp doorbells on POWER9 guests
Nicholas Piggin writes: > POWER9 hypervisors will not necessarily run guest threads together on > the same core at the same time, so msgsndp should not be used. I'm worried this is encoding the behaviour of a particular hypervisor in the guest kernel. If we *can't* use msgsndp then the hypervisor better do something to stop us from using it. If it would be preferable for us not to use msgsndp, then the hypervisor can tell us that somehow, eg. in the device tree. ? cheers
Re: [RFC PATCH 6/7] powerpc/hugetlb: Add code to support to follow huge page directory entries
"Aneesh Kumar K.V" writes: > Add follow_huge_pd implementation for ppc64. > > Signed-off-by: Aneesh Kumar K.V > --- > arch/powerpc/mm/hugetlbpage.c | 42 ++ > 1 file changed, 42 insertions(+) > > diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c > index 80f6d2ed551a..9d66d4f810aa 100644 > --- a/arch/powerpc/mm/hugetlbpage.c > +++ b/arch/powerpc/mm/hugetlbpage.c > @@ -17,6 +17,8 @@ > #include > #include > #include > +#include > +#include > #include > #include > #include > @@ -618,6 +620,10 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb, > } > > /* > + * 64 bit book3s use generic follow_page_mask > + */ > +#ifndef CONFIG_PPC_BOOK3S_64 I think it's always easier to follow if you use: #ifdef x ... #else /* !x */ ... #endif ie. in this case put the Book3S 64 case first and the existing code in the #else. cheers
Re: [PATCH 2/2] powerpc/book3s: mce: Use add_taint_no_warn() in machine_check_early().
Mahesh J Salgaonkar writes: > From: Mahesh Salgaonkar > > machine_check_early() gets called in real mode. The very first time when > add_taint() is called, it prints a warning which ends up calling opal > call (that uses OPAL_CALL wrapper) for writing it to console. If we get a > very first machine check while we are in opal we are doomed. OPAL_CALL > overwrites the PACASAVEDMSR in r13 and in this case when we are done with > MCE handling the original opal call will use this new MSR on it's way > back to opal_return. This usually leads unexpected behaviour or kernel > to panic. Instead use the add_taint_no_warn() that does not call printk. > > This is broken with current FW level. We got lucky so far for not getting > very first MCE hit while in OPAL. But easily reproducible on Mambo. > This should go to stable as well alongwith patch 1/2. This is not a good way to fix a bug that needs to go back to stable. Changing generic code means I need to sync up with the right maintainer, get acks, etc. And then convince people that it should go to stable also. So you can please fix this a different way for stable? Can we just do the tainting later, once we're in virtual mode? cheers > Fixes: 27ea2c420cad powerpc: Set the correct kernel taint on machine check > errors. > Signed-off-by: Mahesh Salgaonkar > --- > arch/powerpc/kernel/traps.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c > index 62b587f..4a048dc 100644 > --- a/arch/powerpc/kernel/traps.c > +++ b/arch/powerpc/kernel/traps.c > @@ -306,7 +306,7 @@ long machine_check_early(struct pt_regs *regs) > > __this_cpu_inc(irq_stat.mce_exceptions); > > - add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE); > + add_taint_no_warn(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE); > > /* >* See if platform is capable of handling machine check. (e.g. PowerNV
Re: [PATCH v4] cxl: Force context lock during EEH flow
Frederic Barrat writes: > Le 05/04/2017 à 13:35, Vaibhav Jain a écrit : >> During an eeh event when the cxl card is fenced and card sysfs attr >> perst_reloads_same_image is set following warning message is seen in the >> kernel logs: >> >> [ 60.622727] Adapter context unlocked with 0 active contexts >> [ 60.622762] [ cut here ] >> [ 60.622771] WARNING: CPU: 12 PID: 627 at >> ../drivers/misc/cxl/main.c:325 cxl_adapter_context_unlock+0x60/0x80 [cxl] >> >> Even though this warning is harmless, it clutters the kernel log >> during an eeh event. This warning is triggered as the EEH callback >> cxl_pci_error_detected doesn't obtain a context-lock before forcibly >> detaching all active context and when context-lock is released during >> call to cxl_configure_adapter from cxl_pci_slot_reset, a warning in >> cxl_adapter_context_unlock is triggered. >> >> To fix this warning, we acquire the adapter context-lock via >> cxl_adapter_context_lock() in the eeh callback >> cxl_pci_error_detected() once all the virtual AFU PHBs are notified >> and their contexts detached. The context-lock is released in >> cxl_pci_slot_reset() after the adapter is successfully reconfigured >> and before we call slot_reset callback on slice attached device-drivers. >> >> Cc: sta...@vger.kernel.org >> Fixes: 70b565bbdb91("cxl: Prevent adapter reset if an active context exists") >> Reported-by: Andrew Donnellan >> Signed-off-by: Vaibhav Jain >> --- > > Pending test result from cxl-flash: > Acked-by: Frederic Barrat Still pending ... ? cheers
[BUG][next-20170410][PPC] WARNING: CPU: 22 PID: 0 at block/blk-core.c:2655 .blk_update_request+0x4f8/0x500
Hi, Warning while booting next-20170410 on PowerPC. We did not see warnings with next-20170407. In mean time I will update with the badcommit once my automated bisect run finishes. Machine type: Power7 LPAR Kernel : 4.11.0-rc6-next-20170410 Config : file attched. IPv6: ADDRCONF(NETDEV_UP): net0: link is not ready Starting Authorization Manager... Starting WPA Supplicant daemon... [ cut here ] WARNING: CPU: 22 PID: 0 at block/blk-core.c:2655 .blk_update_request +0x4f8/0x500 Modules linked in: sg(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) sunrpc(E) grace(E) binfmt_misc(E) ip_tables(E) ext4(E) mbcache(E) jbd2(E) sd_mod(E) ibmvscsi(E) scsi_transport_srp(E) ibmveth(E) CPU: 22 PID: 0 Comm: swapper/22 Tainted: GE 4.11.0-rc6-next-20170410-autotest #1 task: c009f82a0400 task.stack: c009f8324000 NIP: c0512a08 LR: c05125ec CTR: c0518270 REGS: c013fff23740 TRAP: 0700 Tainted: GE (4.11.0-rc6-next-20170410-autotest) MSR: 8282b032 CR: 48042048 XER: 0001 CFAR: c0512784 SOFTE: 1 GPR00: c05125ec c013fff239c0 c1396300 c000fd55a000 GPR04: 0001 00b0 GPR08: 00067887 c000fd55a000 d85f7dc0 GPR12: 88044044 ce97dc00 c009f8327f90 00200042 GPR16: 9239 c013fff2 c0de4100 GPR20: c13d3b00 c0de4100 0005 GPR24: 2ee0 c1788018 GPR28: c009f212e400 c000fd55a000 NIP [c0512a08] .blk_update_request+0x4f8/0x500 LR [c05125ec] .blk_update_request+0xdc/0x500 Call Trace: [c013fff239c0] [c05125ec] .blk_update_request+0xdc/0x500 (unreliable) [c013fff23a60] [c06b462c] .scsi_end_request+0x4c/0x240 [c013fff23b10] [c06b84a4] .scsi_io_completion+0x1d4/0x6c0 [c013fff23be0] [c06acbd0] .scsi_finish_command+0x100/0x1b0 [c013fff23c70] [c06b78b8] .scsi_softirq_done+0x188/0x1e0 [c013fff23d00] [c051d8b4] .blk_done_softirq+0xc4/0xf0 [c013fff23d90] [c00dd758] .__do_softirq+0x158/0x3a0 [c013fff23e90] [c00dde08] .irq_exit+0x1a8/0x1c0 [c013fff23f10] [c0014f84] .__do_irq+0x94/0x1f0 [c013fff23f90] [c0026d1c] .call_do_irq+0x14/0x24 [c009f83277f0] [c001516c] .do_IRQ+0x8c/0x100 [c009f8327890] [c0008bf4] hardware_interrupt_common+0x114/0x120 --- interrupt: 501 at .arch_local_irq_restore+0x74/0x90 LR = .arch_local_irq_restore+0x74/0x90 [c009f8327b80] [0002] 0x2 (unreliable) [c009f8327bf0] [c07c08a8] .dedicated_cede_loop+0xc8/0x150 [c009f8327c70] [c07be280] .cpuidle_enter_state+0xb0/0x380 [c009f8327d20] [c012fd5c] .call_cpuidle+0x3c/0x70 [c009f8327d90] [c01300f0] .do_idle+0x280/0x2e0 [c009f8327e50] [c0130308] .cpu_startup_entry+0x28/0x40 [c009f8327ed0] [c0042364] .start_secondary+0x304/0x350 [c009f8327f90] [c000aa6c] start_secondary_prolog+0x10/0x14 Instruction dump: 3f82ff8e 3b9cd308 4b50 3f82ff8e 3b9cd320 4b44 61290040 b13f0018 4bfffbe8 3cc2ff89 38c64258 4b60 <0fe0> 4bfffd7c 7c0802a6 fba1ffe8 ---[ end trace 1fdfef416a071a8e ]--- EXT4-fs (sda3): Delayed block allocation failed for inode 11011452 at logical offset 0 with max blocks 5 with error 121 EXT4-fs (sda3): This should not happen!! Data will be lost Starting Network Manager Script Dispatcher Service... -- Regard's Abdul Haleem IBM Linux Technology Centre # # Automatically generated file; DO NOT EDIT. # Linux/powerpc 4.10.0-rc5 Kernel Configuration # CONFIG_PPC64=y # # Processor support # CONFIG_PPC_BOOK3S_64=y # CONFIG_PPC_BOOK3E_64 is not set CONFIG_GENERIC_CPU=y # CONFIG_CELL_CPU is not set # CONFIG_POWER4_CPU is not set # CONFIG_POWER5_CPU is not set # CONFIG_POWER6_CPU is not set # CONFIG_POWER7_CPU is not set # CONFIG_POWER8_CPU is not set CONFIG_PPC_BOOK3S=y CONFIG_PPC_FPU=y CONFIG_ALTIVEC=y CONFIG_VSX=y CONFIG_PPC_ICSWX=y # CONFIG_PPC_ICSWX_PID is not set # CONFIG_PPC_ICSWX_USE_SIGILL is not set CONFIG_PPC_STD_MMU=y CONFIG_PPC_STD_MMU_64=y CONFIG_PPC_RADIX_MMU=y CONFIG_PPC_MM_SLICES=y CONFIG_PPC_HAVE_PMU_SUPPORT=y CONFIG_PPC_PERF_CTRS=y CONFIG_SMP=y CONFIG_NR_CPUS=2048 CONFIG_PPC_DOORBELL=y CONFIG_VDSO32=y CONFIG_CPU_BIG_ENDIAN=y # CONFIG_CPU_LITTLE_ENDIAN is not set CONFIG_64BIT=y CONFIG_ARCH_PHYS_ADDR_T_64BIT=y CONFIG_ARCH_DMA_ADDR_T_64BIT=y CONFIG_MMU=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NR_IRQS=512 CONFIG_STACKTRACE_SUPPORT=y CONFIG_TRACE_IRQFLAGS_SUPPORT=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_ARCH_HAS_ILOG2_U32=y CONFIG_ARCH_HAS_ILOG2_U64=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_HAS_DMA_SE
Re: [PATCH v2] ppc64/kprobe: Fix oops when kprobed on 'stdu' instruction
Thanks Balbir for the review, On Tuesday 11 April 2017 02:25 PM, Balbir Singh wrote: > On Tue, 2017-04-11 at 10:38 +0530, Ravi Bangoria wrote: >> If we set a kprobe on a 'stdu' instruction on powerpc64, we see a kernel >> OOPS: >> >> [ 1275.165932] Bad kernel stack pointer cd93c840 at c0009868 >> [ 1275.166378] Oops: Bad kernel stack pointer, sig: 6 [#1] >> ... >> GPR00: c01fcd93cb30 cd93c840 c15c5e00 cd93c840 >> ... >> [ 1275.178305] NIP [c0009868] resume_kernel+0x2c/0x58 >> [ 1275.178594] LR [c0006208] program_check_common+0x108/0x180 >> >> Basically, on 64 bit system, when user probes on 'stdu' instruction, >> kernel does not emulate actual store in emulate_step itself because it >> may corrupt exception frame. So kernel does actual store operation in >> exception return code i.e. resume_kernel(). >> >> resume_kernel() loads the saved stack pointer from memory using lwz, >> effectively loading a corrupt (32bit) address, causing the kernel crash. >> >> Fix this by loading the 64bit value instead. >> >> Fixes: be96f63375a1 ("powerpc: Split out instruction analysis part of >> emulate_step()") >> Signed-off-by: Ravi Bangoria >> Reviewed-by: Naveen N. Rao >> --- > The patch looks correct to me from the description and code. I have not > validated that the write to GPR1(r1) via store of r8 to 0(r5) is indeed > correct. > I would assume r8 should contain regs->gpr[r1] with the updated ea that > is written down to the GPR1(r1) which will be what we restore when we return > from the exception. emulate_step() updates regs->gpr[r1] with the new value. So, regs->gpr[r1] and GPR(r1) both are same at resume_kernel. At resume_kernel, r1 points to the exception frame. Address of frame preceding exception frame gets loaded in r8 with: addir8,r1,INT_FRAME_SIZE Let me know if you need more details. Ravi
Re: [PATCH v4] cxl: Force context lock during EEH flow
Le 11/04/2017 à 12:40, Michael Ellerman a écrit : Frederic Barrat writes: Le 05/04/2017 à 13:35, Vaibhav Jain a écrit : During an eeh event when the cxl card is fenced and card sysfs attr perst_reloads_same_image is set following warning message is seen in the kernel logs: [ 60.622727] Adapter context unlocked with 0 active contexts [ 60.622762] [ cut here ] [ 60.622771] WARNING: CPU: 12 PID: 627 at ../drivers/misc/cxl/main.c:325 cxl_adapter_context_unlock+0x60/0x80 [cxl] Even though this warning is harmless, it clutters the kernel log during an eeh event. This warning is triggered as the EEH callback cxl_pci_error_detected doesn't obtain a context-lock before forcibly detaching all active context and when context-lock is released during call to cxl_configure_adapter from cxl_pci_slot_reset, a warning in cxl_adapter_context_unlock is triggered. To fix this warning, we acquire the adapter context-lock via cxl_adapter_context_lock() in the eeh callback cxl_pci_error_detected() once all the virtual AFU PHBs are notified and their contexts detached. The context-lock is released in cxl_pci_slot_reset() after the adapter is successfully reconfigured and before we call slot_reset callback on slice attached device-drivers. Cc: sta...@vger.kernel.org Fixes: 70b565bbdb91("cxl: Prevent adapter reset if an active context exists") Reported-by: Andrew Donnellan Signed-off-by: Vaibhav Jain --- Pending test result from cxl-flash: Acked-by: Frederic Barrat Still pending ... ? Yes, still waiting. It was mentioned in a call with the cxlflash team yesterday. Fred cheers
Re: [PATCH v3 2/5] perf/x86/intel: Record branch type
On 4/11/2017 4:35 PM, Peter Zijlstra wrote: On Tue, Apr 11, 2017 at 04:11:21PM +0800, Jin, Yao wrote: On 4/11/2017 3:52 PM, Peter Zijlstra wrote: This is still a completely inadequate changelog. I really will not accept patches like this. Hi, The changelog is added in the cover-letter ("[PATCH v3 0/5] perf report: Show branch type"). Does the changelog need to be added in each patch's description? That's fine, I can add and resend this patch. The cover letter is not retained; it is throw away information. Each patch should have a coherent changelog that explain why the patch was done and explain non trivial things in the implementation. Simply copy/pasting the same story in multiple patches is not right either, for the simple fact that the patches were not the same. You did a different thing, so you need a different story. Thanks so much for the suggestion! I accept this and decide to make changes on my patch description. Maybe not adding a full change-log, I will add a section in patch description to describe the major changes from previous version. Thanks Jin Yao
Re: [PATCH v3 2/5] perf/x86/intel: Record branch type
On 4/11/2017 4:18 PM, Peter Zijlstra wrote: On Tue, Apr 11, 2017 at 09:52:19AM +0200, Peter Zijlstra wrote: On Tue, Apr 11, 2017 at 06:56:30PM +0800, Jin Yao wrote: @@ -960,6 +1006,11 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc) cpuc->lbr_entries[i].from = 0; compress = true; } + + if ((br_sel & X86_BR_TYPE_SAVE) == X86_BR_TYPE_SAVE) + cpuc->lbr_entries[i].type = common_branch_type(type); + else + cpuc->lbr_entries[i].type = PERF_BR_NONE; } I was wondering WTH you did that else; because it should already be 0 (aka, BR_NONE). Yes. I will remove the else code. Thanks! Then I found intel_pmu_lbr_read_32() is already broken, and you just broke intel_pmu_lbr_read_64(). Arguably we should add a union on the last __u64 with a name for the entire thing, but the below is the minimal fix. --- Subject: perf,x86: Avoid exposing wrong/stale data in intel_pmu_lbr_read_32() From: Peter Zijlstra Date: Tue Apr 11 10:10:28 CEST 2017 When the perf_branch_entry::{in_tx,abort,cycles} fields were added, intel_pmu_lbr_read_32() wasn't updated to initialize them. Fixes: 135c5612c460 ("perf/x86/intel: Support Haswell/v4 LBR format") Signed-off-by: Peter Zijlstra (Intel) --- --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -507,6 +507,9 @@ static void intel_pmu_lbr_read_32(struct cpuc->lbr_entries[i].to = msr_lastbranch.to; cpuc->lbr_entries[i].mispred = 0; cpuc->lbr_entries[i].predicted = 0; + cpuc->lbr_entries[i].in_tx = 0; + cpuc->lbr_entries[i].abort = 0; + cpuc->lbr_entries[i].cycles = 0; cpuc->lbr_entries[i].reserved= 0; } cpuc->lbr_stack.nr = i; I will add cpuc->lbr_entries[i].type = 0 in my patch.
Re: [PATCH 1/5] powerpc/pseries: do not use msgsndp doorbells on POWER9 guests
cc'ing Paul On Tue, 11 Apr 2017 20:10:17 +1000 Michael Ellerman wrote: > Nicholas Piggin writes: > > > POWER9 hypervisors will not necessarily run guest threads together on > > the same core at the same time, so msgsndp should not be used. > > I'm worried this is encoding the behaviour of a particular hypervisor in > the guest kernel. Yeah, it's not ideal. > If we *can't* use msgsndp then the hypervisor better do something to > stop us from using it. POWER9 hypervisor has an hfscr and should clear that if it does not gang threads like POWER8. The guest still needs to know not to use it though... > If it would be preferable for us not to use msgsndp, then the hypervisor > can tell us that somehow, eg. in the device tree. I don't know that we have a really good way to do that other than guests to clear the doorbell feature for POWER9. Does the hypervisor set any relevant DT we can use today that says virtual sibling != physical sibling? If not, then we'll just have to clear it from all POWER9 guests until we get a DT proprety from phyp. Thanks, Nick
Re: [PATCH v5 13/15] livepatch: change to a per-task consistency model
On Mon 2017-02-13 19:42:40, Josh Poimboeuf wrote: > Change livepatch to use a basic per-task consistency model. This is the > foundation which will eventually enable us to patch those ~10% of > security patches which change function or data semantics. This is the > biggest remaining piece needed to make livepatch more generally useful. > > > Signed-off-by: Josh Poimboeuf Just for record, this last version looks fine to me. I do not see problems any longer. Everything looks consistent now ;-) It is a great work. Feel free to use: Reviewed-by: Petr Mladek Thanks a lot for patience. Best Regards, Petr
[PATCH] powerpc/mm: Update mm context addr limit correctly.
We added the addr < TASK_SIZE check to avoid updating addr_limit unnecessarily and also to avoid calling slice_flush_segments on all the cpus. This had the side effect of having different behaviour when using an addr value above TASK_SIZE before updating addr_limit and after updating addr_limit as show by below output: requesting with hint 0x0 Addr returned 0x7fff893a requesting with hint 0x Addr returned 0x7fff891b <= 1st return requesting with hint 0x1 Addr returned 0x1 requesting with hint 0x Addr returned 0x18941< second return After fix: requesting with hint 0x0 Addr returned 0x7fff8bc0 requesting with hint 0x Addr returned 0x18bc8< 1st return requesting with hint 0x1 Addr returned 0x1 requesting with hint 0x Addr returned 0x18bc6< second return Fixes: 1b49451ebd3e9 (powerpc/mm: Enable mappings above 128TB) Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/mmap.c | 6 -- arch/powerpc/mm/slice.c | 3 ++- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/mm/mmap.c b/arch/powerpc/mm/mmap.c index b2111baa0da6..355b6fe8a1e6 100644 --- a/arch/powerpc/mm/mmap.c +++ b/arch/powerpc/mm/mmap.c @@ -97,7 +97,8 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned long addr, struct vm_area_struct *vma; struct vm_unmapped_area_info info; - if (unlikely(addr > mm->context.addr_limit && addr < TASK_SIZE)) + if (unlikely(addr > mm->context.addr_limit && +mm->context.addr_limit != TASK_SIZE)) mm->context.addr_limit = TASK_SIZE; if (len > mm->context.addr_limit - mmap_min_addr) @@ -139,7 +140,8 @@ radix__arch_get_unmapped_area_topdown(struct file *filp, unsigned long addr = addr0; struct vm_unmapped_area_info info; - if (unlikely(addr > mm->context.addr_limit && addr < TASK_SIZE)) + if (unlikely(addr > mm->context.addr_limit && +mm->context.addr_limit != TASK_SIZE)) mm->context.addr_limit = TASK_SIZE; /* requested length too big for entire address space */ diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c index 251b6bae7023..2d2d9760d057 100644 --- a/arch/powerpc/mm/slice.c +++ b/arch/powerpc/mm/slice.c @@ -419,7 +419,8 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len, /* * Check if we need to expland slice area. */ - if (unlikely(addr > mm->context.addr_limit && addr < TASK_SIZE)) { + if (unlikely(addr > mm->context.addr_limit && +mm->context.addr_limit != TASK_SIZE)) { mm->context.addr_limit = TASK_SIZE; on_each_cpu(slice_flush_segments, mm, 1); } -- 2.7.4
[PATCH v4 0/5] perf report: Show branch type
v4: --- 1. Describe the major changes in patch description. Thanks for Peter Zijlstra's reminding. 2. Initialize branch type to 0 in intel_pmu_lbr_read_32 and intel_pmu_lbr_read_64. Remove the invalid else code in intel_pmu_lbr_filter. v3: --- 1. Move the JCC forward/backward and cross page computing from kernel to userspace. 2. Use lookup table to replace original switch/case processing. Changed: perf/core: Define the common branch type classification perf/x86/intel: Record branch type perf report: Show branch type statistics for stdio mode perf report: Show branch type in callchain entry Not changed: perf record: Create a new option save_type in --branch-filter v2: --- 1. Use 4 bits in perf_branch_entry to record branch type. 2. Pull out some common branch types from FAR_BRANCH. Now the branch types defined in perf_event.h: PERF_BR_NONE : unknown PERF_BR_JCC_FWD : conditional forward jump PERF_BR_JCC_BWD : conditional backward jump PERF_BR_JMP : jump PERF_BR_IND_JMP : indirect jump PERF_BR_CALL : call PERF_BR_IND_CALL : indirect call PERF_BR_RET : return PERF_BR_SYSCALL : syscall PERF_BR_SYSRET: syscall return PERF_BR_IRQ : hw interrupt/trap/fault PERF_BR_INT : sw interrupt PERF_BR_IRET : return from interrupt PERF_BR_FAR_BRANCH: others not generic far branch type 3. Use 2 bits in perf_branch_entry for a "cross" metrics checking for branch cross 4K or 2M area. It's an approximate computing for checking if the branch cross 4K page or 2MB page. For example: perf record -g --branch-filter any,save_type perf report --stdio JCC forward: 27.7% JCC backward: 9.8% JMP: 0.0% IND_JMP: 6.5% CALL: 26.6% IND_CALL: 0.0% RET: 29.3% IRET: 0.0% CROSS_4K: 0.0% CROSS_2M: 14.3% perf report --branch-history --stdio --no-children -23.60%--main div.c:42 (RET cycles:2) compute_flag div.c:28 (RET cycles:2) compute_flag div.c:27 (RET CROSS_2M cycles:1) rand rand.c:28 (RET CROSS_2M cycles:1) rand rand.c:28 (RET cycles:1) __random random.c:298 (RET cycles:1) __random random.c:297 (JCC forward cycles:1) __random random.c:295 (JCC forward cycles:1) __random random.c:295 (JCC forward cycles:1) __random random.c:295 (JCC forward cycles:1) __random random.c:295 (RET cycles:9) Changed: perf/core: Define the common branch type classification perf/x86/intel: Record branch type perf report: Show branch type statistics for stdio mode perf report: Show branch type in callchain entry Not changed: perf record: Create a new option save_type in --branch-filter v1: --- It is often useful to know the branch types while analyzing branch data. For example, a call is very different from a conditional branch. Currently we have to look it up in binary while the binary may later not be available and even the binary is available but user has to take some time. It is very useful for user to check it directly in perf report. Perf already has support for disassembling the branch instruction to get the branch type. The patch series records the branch type and show the branch type with other LBR information in callchain entry via perf report. The patch series also adds the branch type summary at the end of perf report --stdio. To keep consistent on kernel and userspace and make the classification more common, the patch adds the common branch type classification in perf_event.h. The common branch types are: JCC forward: Conditional forward jump JCC backward: Conditional backward jump JMP: Jump imm IND_JMP: Jump reg/mem CALL: Call imm IND_CALL: Call reg/mem RET: Ret FAR_BRANCH: SYSCALL/SYSRET, IRQ, IRET, TSX Abort An example: 1. Record branch type (new option "save_type") perf record -g --branch-filter any,save_type 2. Show the branch type statistics at the end of perf report --stdio perf report --stdio JCC forward: 34.0% JCC backward: 3.6% JMP: 0.0% IND_JMP: 6.5% CALL: 26.6% IND_CALL: 0.0% RET: 29.3% FAR_BRANCH: 0.0% 3. Show branch type in callchain entry perf report --branch-history --stdio --no-children --23.91%--main div.c:42 (RET cycles:2) compute_flag div.c:28 (RET cycles:2) compute_flag div.c:27 (RET cycles:1) rand rand.c:28 (RET cycles:1) rand rand.c:28 (RET cycles:1) __random random.c:298 (RET cycles:1) __random random.c:297 (JCC forward cycles:1) __random random.c:295 (JCC forward cycles:1) __random random.c:295 (JCC forward cycles:1) __random random.c:295 (JCC forward cycles:1) __random random.c:295 (RET cycles:9) Jin Yao (5
[PATCH v4 1/5] perf/core: Define the common branch type classification
It is often useful to know the branch types while analyzing branch data. For example, a call is very different from a conditional branch. Currently we have to look it up in binary while the binary may later not be available and even the binary is available but user has to take some time. It is very useful for user to check it directly in perf report. Perf already has support for disassembling the branch instruction to get the x86 branch type. To keep consistent on kernel and userspace and make the classification more common, the patch adds the common branch type classification in perf_event.h. PERF_BR_NONE : unknown PERF_BR_JCC : conditional jump PERF_BR_JMP : jump PERF_BR_IND_JMP : indirect jump PERF_BR_CALL : call PERF_BR_IND_CALL : indirect call PERF_BR_RET : return PERF_BR_SYSCALL : syscall PERF_BR_SYSRET: syscall return PERF_BR_IRQ : hw interrupt/trap/fault PERF_BR_INT : sw interrupt PERF_BR_IRET : return from interrupt PERF_BR_FAR_BRANCH: not generic far branch type The patch also adds a new field type (4 bits) in perf_branch_entry to record the branch type. Since the disassembling of branch instruction needs some overhead, a new PERF_SAMPLE_BRANCH_TYPE_SAVE is introduced to indicate if it needs to disassemble the branch instruction and record the branch type. Comparing to previous version, the major changes are: 1. Remove the PERF_BR_JCC_FWD/PERF_BR_JCC_BWD, they will be computed later in userspace. 2. Remove the "cross" field in perf_branch_entry. The cross page computing will be done later in userspace. Signed-off-by: Jin Yao --- include/uapi/linux/perf_event.h | 29 - tools/include/uapi/linux/perf_event.h | 29 - 2 files changed, 56 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index d09a9cd..69af012 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -174,6 +174,8 @@ enum perf_branch_sample_type_shift { PERF_SAMPLE_BRANCH_NO_FLAGS_SHIFT = 14, /* no flags */ PERF_SAMPLE_BRANCH_NO_CYCLES_SHIFT = 15, /* no cycles */ + PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT = 16, /* save branch type */ + PERF_SAMPLE_BRANCH_MAX_SHIFT/* non-ABI */ }; @@ -198,9 +200,32 @@ enum perf_branch_sample_type { PERF_SAMPLE_BRANCH_NO_FLAGS = 1U << PERF_SAMPLE_BRANCH_NO_FLAGS_SHIFT, PERF_SAMPLE_BRANCH_NO_CYCLES= 1U << PERF_SAMPLE_BRANCH_NO_CYCLES_SHIFT, + PERF_SAMPLE_BRANCH_TYPE_SAVE= + 1U << PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT, + PERF_SAMPLE_BRANCH_MAX = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT, }; +/* + * Common flow change classification + */ +enum { + PERF_BR_NONE= 0,/* unknown */ + PERF_BR_JCC = 1,/* conditional jump */ + PERF_BR_JMP = 2,/* jump */ + PERF_BR_IND_JMP = 3,/* indirect jump */ + PERF_BR_CALL= 4,/* call */ + PERF_BR_IND_CALL= 5,/* indirect call */ + PERF_BR_RET = 6,/* return */ + PERF_BR_SYSCALL = 7,/* syscall */ + PERF_BR_SYSRET = 8,/* syscall return */ + PERF_BR_IRQ = 9,/* hw interrupt/trap/fault */ + PERF_BR_INT = 10, /* sw interrupt */ + PERF_BR_IRET= 11, /* return from interrupt */ + PERF_BR_FAR_BRANCH = 12, /* not generic far branch type */ + PERF_BR_MAX, +}; + #define PERF_SAMPLE_BRANCH_PLM_ALL \ (PERF_SAMPLE_BRANCH_USER|\ PERF_SAMPLE_BRANCH_KERNEL|\ @@ -999,6 +1024,7 @@ union perf_mem_data_src { * in_tx: running in a hardware transaction * abort: aborting a hardware transaction *cycles: cycles from last branch (or 0 if not supported) + * type: branch type */ struct perf_branch_entry { __u64 from; @@ -1008,7 +1034,8 @@ struct perf_branch_entry { in_tx:1,/* in transaction */ abort:1,/* transaction abort */ cycles:16, /* cycle count to last branch */ - reserved:44; + type:4, /* branch type */ + reserved:40; }; #endif /* _UAPI_LINUX_PERF_EVENT_H */ diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h index d09a9cd..69af012 100644 --- a/tools/include/uapi/linux/perf_event.h +++ b/tools/include/uapi/linux/perf_event.h @@ -174,6 +174,8 @@ enum perf_branch_sample_type_shift { PERF_SAMPLE_BRANCH_NO_FLAGS_SHIFT = 14, /* no flags */ PERF_SAMPLE_BRANCH_NO_CYCLES_SHIFT = 15, /* no cycles */ + PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT = 16, /* save branch type */ + PERF_SAMPLE_BRANCH_MAX_SHIFT/* non-ABI */ }; @@ -198,9 +200,32 @@ enum perf_branch_sample_type {
[PATCH v4 2/5] perf/x86/intel: Record branch type
Perf already has support for disassembling the branch instruction and using the branch type for filtering. The patch just records the branch type in perf_branch_entry. Before recording, the patch converts the x86 branch type to common branch type. Comparing to previous version, the major changes are: 1. Uses a lookup table to convert x86 branch type to common branch type. 2. Move the JCC forward/JCC backward and cross page computing to user space. 3. Initialize branch type to 0 in intel_pmu_lbr_read_32 and intel_pmu_lbr_read_64 Signed-off-by: Jin Yao --- arch/x86/events/intel/lbr.c | 53 - 1 file changed, 52 insertions(+), 1 deletion(-) diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 81b321a..d3b1dd6 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -109,6 +109,9 @@ enum { X86_BR_ZERO_CALL= 1 << 15,/* zero length call */ X86_BR_CALL_STACK = 1 << 16,/* call stack */ X86_BR_IND_JMP = 1 << 17,/* indirect jump */ + + X86_BR_TYPE_SAVE= 1 << 18,/* indicate to save branch type */ + }; #define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL) @@ -507,6 +510,7 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc) cpuc->lbr_entries[i].to = msr_lastbranch.to; cpuc->lbr_entries[i].mispred= 0; cpuc->lbr_entries[i].predicted = 0; + cpuc->lbr_entries[i].type = 0; cpuc->lbr_entries[i].reserved = 0; } cpuc->lbr_stack.nr = i; @@ -593,6 +597,7 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc) cpuc->lbr_entries[out].in_tx = in_tx; cpuc->lbr_entries[out].abort = abort; cpuc->lbr_entries[out].cycles= cycles; + cpuc->lbr_entries[out].type = 0; cpuc->lbr_entries[out].reserved = 0; out++; } @@ -670,6 +675,10 @@ static int intel_pmu_setup_sw_lbr_filter(struct perf_event *event) if (br_type & PERF_SAMPLE_BRANCH_CALL) mask |= X86_BR_CALL | X86_BR_ZERO_CALL; + + if (br_type & PERF_SAMPLE_BRANCH_TYPE_SAVE) + mask |= X86_BR_TYPE_SAVE; + /* * stash actual user request into reg, it may * be used by fixup code for some CPU @@ -923,6 +932,44 @@ static int branch_type(unsigned long from, unsigned long to, int abort) return ret; } +#define X86_BR_TYPE_MAP_MAX16 + +static int +common_branch_type(int type) +{ + int i, mask; + const int branch_map[X86_BR_TYPE_MAP_MAX] = { + PERF_BR_CALL, /* X86_BR_CALL */ + PERF_BR_RET,/* X86_BR_RET */ + PERF_BR_SYSCALL,/* X86_BR_SYSCALL */ + PERF_BR_SYSRET, /* X86_BR_SYSRET */ + PERF_BR_INT,/* X86_BR_INT */ + PERF_BR_IRET, /* X86_BR_IRET */ + PERF_BR_JCC,/* X86_BR_JCC */ + PERF_BR_JMP,/* X86_BR_JMP */ + PERF_BR_IRQ,/* X86_BR_IRQ */ + PERF_BR_IND_CALL, /* X86_BR_IND_CALL */ + PERF_BR_NONE, /* X86_BR_ABORT */ + PERF_BR_NONE, /* X86_BR_IN_TX */ + PERF_BR_NONE, /* X86_BR_NO_TX */ + PERF_BR_CALL, /* X86_BR_ZERO_CALL */ + PERF_BR_NONE, /* X86_BR_CALL_STACK */ + PERF_BR_IND_JMP,/* X86_BR_IND_JMP */ + }; + + type >>= 2; /* skip X86_BR_USER and X86_BR_KERNEL */ + mask = ~(~0 << 1); + + for (i = 0; i < X86_BR_TYPE_MAP_MAX; i++) { + if (type & mask) + return branch_map[i]; + + type >>= 1; + } + + return PERF_BR_NONE; +} + /* * implement actual branch filter based on user demand. * Hardware may not exactly satisfy that request, thus @@ -939,7 +986,8 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc) bool compress = false; /* if sampling all branches, then nothing to filter */ - if ((br_sel & X86_BR_ALL) == X86_BR_ALL) + if (((br_sel & X86_BR_ALL) == X86_BR_ALL) && + ((br_sel & X86_BR_TYPE_SAVE) != X86_BR_TYPE_SAVE)) return; for (i = 0; i < cpuc->lbr_stack.nr; i++) { @@ -960,6 +1008,9 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc) cpuc->lbr_entries[i].from = 0; compress = true; } + + if ((br_sel & X86_BR_TYPE_SAVE) == X86_BR_TYPE_SAVE) + cpuc->lbr_entries[i].type = common_branch_type(type); } if (!compress) -- 2.7.4
[PATCH v4 3/5] perf record: Create a new option save_type in --branch-filter
The option indicates the kernel to save branch type during sampling. One example: perf record -g --branch-filter any,save_type Signed-off-by: Jin Yao --- tools/perf/Documentation/perf-record.txt | 1 + tools/perf/util/parse-branch-options.c | 1 + 2 files changed, 2 insertions(+) diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt index ea3789d..e2f5a4f 100644 --- a/tools/perf/Documentation/perf-record.txt +++ b/tools/perf/Documentation/perf-record.txt @@ -332,6 +332,7 @@ following filters are defined: - no_tx: only when the target is not in a hardware transaction - abort_tx: only when the target is a hardware transaction abort - cond: conditional branches + - save_type: save branch type during sampling in case binary is not available later + The option requires at least one branch type among any, any_call, any_ret, ind_call, cond. diff --git a/tools/perf/util/parse-branch-options.c b/tools/perf/util/parse-branch-options.c index 38fd115..e71fb5f 100644 --- a/tools/perf/util/parse-branch-options.c +++ b/tools/perf/util/parse-branch-options.c @@ -28,6 +28,7 @@ static const struct branch_mode branch_modes[] = { BRANCH_OPT("cond", PERF_SAMPLE_BRANCH_COND), BRANCH_OPT("ind_jmp", PERF_SAMPLE_BRANCH_IND_JUMP), BRANCH_OPT("call", PERF_SAMPLE_BRANCH_CALL), + BRANCH_OPT("save_type", PERF_SAMPLE_BRANCH_TYPE_SAVE), BRANCH_END }; -- 2.7.4
[PATCH v4 4/5] perf report: Show branch type statistics for stdio mode
Show the branch type statistics at the end of perf report --stdio. For example: perf report --stdio JCC forward: 27.8% JCC backward: 9.7% CROSS_4K: 0.0% CROSS_2M: 14.3% JCC: 37.6% JMP: 0.0% IND_JMP: 6.5% CALL: 26.6% RET: 29.3% IRET: 0.0% The branch types are: - JCC forward: Conditional forward jump JCC backward: Conditional backward jump JMP: Jump imm IND_JMP: Jump reg/mem CALL: Call imm IND_CALL: Call reg/mem RET: Ret SYSCALL: Syscall SYSRET: Syscall return IRQ: HW interrupt/trap/fault INT: SW interrupt IRET: Return from interrupt FAR_BRANCH: Others not generic branch type CROSS_4K and CROSS_2M: -- They are the metrics checking for branches cross 4K or 2MB pages. It's an approximate computing. We don't know if the area is 4K or 2MB, so always compute both. To make the output simple, if a branch crosses 2M area, CROSS_4K will not be incremented. Comparing to previous version, the major changes are: Add the computing of JCC forward/JCC backward and cross page checking by using the from and to addresses. Signed-off-by: Jin Yao --- tools/perf/builtin-report.c | 70 + tools/perf/util/event.h | 3 +- tools/perf/util/hist.c | 5 +--- tools/perf/util/util.c | 59 ++ tools/perf/util/util.h | 17 +++ 5 files changed, 149 insertions(+), 5 deletions(-) diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c index c18158b..c2889eb 100644 --- a/tools/perf/builtin-report.c +++ b/tools/perf/builtin-report.c @@ -66,6 +66,7 @@ struct report { u64 queue_size; int socket_filter; DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS); + struct branch_type_stat brtype_stat; }; static int report__config(const char *var, const char *value, void *cb) @@ -144,6 +145,24 @@ static int hist_iter__report_callback(struct hist_entry_iter *iter, return err; } +static int hist_iter__branch_callback(struct hist_entry_iter *iter, + struct addr_location *al __maybe_unused, + bool single __maybe_unused, + void *arg) +{ + struct hist_entry *he = iter->he; + struct report *rep = arg; + struct branch_info *bi; + + if (sort__mode == SORT_MODE__BRANCH) { + bi = he->branch_info; + branch_type_count(&rep->brtype_stat, &bi->flags, + bi->from.addr, bi->to.addr); + } + + return 0; +} + static int process_sample_event(struct perf_tool *tool, union perf_event *event, struct perf_sample *sample, @@ -182,6 +201,8 @@ static int process_sample_event(struct perf_tool *tool, */ if (!sample->branch_stack) goto out_put; + + iter.add_entry_cb = hist_iter__branch_callback; iter.ops = &hist_iter_branch; } else if (rep->mem_mode) { iter.ops = &hist_iter_mem; @@ -369,6 +390,50 @@ static size_t hists__fprintf_nr_sample_events(struct hists *hists, struct report return ret + fprintf(fp, "\n#\n"); } +static void branch_type_stat_display(FILE *fp, struct branch_type_stat *stat) +{ + u64 total = 0; + int i; + + for (i = 0; i < PERF_BR_MAX; i++) + total += stat->counts[i]; + + if (total == 0) + return; + + fprintf(fp, "\n#"); + fprintf(fp, "\n# Branch Statistics:"); + fprintf(fp, "\n#"); + + if (stat->jcc_fwd > 0) + fprintf(fp, "\n%12s: %5.1f%%", + "JCC forward", + 100.0 * (double)stat->jcc_fwd / (double)total); + + if (stat->jcc_bwd > 0) + fprintf(fp, "\n%12s: %5.1f%%", + "JCC backward", + 100.0 * (double)stat->jcc_bwd / (double)total); + + if (stat->cross_4k > 0) + fprintf(fp, "\n%12s: %5.1f%%", + "CROSS_4K", + 100.0 * (double)stat->cross_4k / (double)total); + + if (stat->cross_2m > 0) + fprintf(fp, "\n%12s: %5.1f%%", + "CROSS_2M", + 100.0 * (double)stat->cross_2m / (double)total); + + for (i = 0; i < PERF_BR_MAX; i++) { + if (stat->counts[i] > 0) + fprintf(fp, "\n%12s: %5.1f%%", + branch_type_name(i), + 100.0 * + (double)stat->counts[i] / (double)total); + } +} + static int perf_evlist__tty_browse_hists(struct perf_evlist *evlist,
[PATCH v4 5/5] perf report: Show branch type in callchain entry
Show branch type in callchain entry. The branch type is printed with other LBR information (such as cycles/abort/...). One example: perf report --branch-history --stdio --no-children --23.54%--main div.c:42 (CROSS_2M RET cycles:2) compute_flag div.c:28 (RET cycles:2) compute_flag div.c:27 (CROSS_2M RET cycles:1) rand rand.c:28 (CROSS_4K RET cycles:1) rand rand.c:28 (CROSS_2M RET cycles:1) __random random.c:298 (CROSS_4K RET cycles:1) __random random.c:297 (JCC backward CROSS_2M cycles:1) __random random.c:295 (JCC forward CROSS_4K cycles:1) __random random.c:295 (JCC backward CROSS_2M cycles:1) __random random.c:295 (JCC forward CROSS_4K cycles:1) __random random.c:295 (CROSS_2M RET cycles:9) Comparing to previous version, the major changes are: Since we have to compute the JCC forward/JCC backward and cross page checking in user space by from and to addresses, while each callchain entry only contains one ip (either from or to), so this patch will append a branch from address to the callchain entry which just contains the to ip. Signed-off-by: Jin Yao --- tools/perf/util/callchain.c | 195 ++-- tools/perf/util/callchain.h | 4 +- tools/perf/util/machine.c | 26 -- 3 files changed, 152 insertions(+), 73 deletions(-) diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c index 2e5eff5..3c875b1 100644 --- a/tools/perf/util/callchain.c +++ b/tools/perf/util/callchain.c @@ -467,6 +467,11 @@ fill_node(struct callchain_node *node, struct callchain_cursor *cursor) call->cycles_count = cursor_node->branch_flags.cycles; call->iter_count = cursor_node->nr_loop_iter; call->samples_count = cursor_node->samples; + + branch_type_count(&call->brtype_stat, + &cursor_node->branch_flags, + cursor_node->branch_from, + cursor_node->ip); } list_add_tail(&call->list, &node->val); @@ -579,6 +584,11 @@ static enum match_result match_chain(struct callchain_cursor_node *node, cnode->cycles_count += node->branch_flags.cycles; cnode->iter_count += node->nr_loop_iter; cnode->samples_count += node->samples; + + branch_type_count(&cnode->brtype_stat, + &node->branch_flags, + node->branch_from, + node->ip); } return MATCH_EQ; @@ -813,7 +823,7 @@ merge_chain_branch(struct callchain_cursor *cursor, list_for_each_entry_safe(list, next_list, &src->val, list) { callchain_cursor_append(cursor, list->ip, list->ms.map, list->ms.sym, - false, NULL, 0, 0); + false, NULL, 0, 0, 0); list_del(&list->list); map__zput(list->ms.map); free(list); @@ -853,7 +863,7 @@ int callchain_merge(struct callchain_cursor *cursor, int callchain_cursor_append(struct callchain_cursor *cursor, u64 ip, struct map *map, struct symbol *sym, bool branch, struct branch_flags *flags, - int nr_loop_iter, int samples) + int nr_loop_iter, int samples, u64 branch_from) { struct callchain_cursor_node *node = *cursor->last; @@ -877,6 +887,7 @@ int callchain_cursor_append(struct callchain_cursor *cursor, memcpy(&node->branch_flags, flags, sizeof(struct branch_flags)); + node->branch_from = branch_from; cursor->nr++; cursor->last = &node->next; @@ -1105,95 +1116,151 @@ int callchain_branch_counts(struct callchain_root *root, cycles_count); } +static int branch_type_str(struct branch_type_stat *stat, + char *bf, int bfsize) +{ + int i, j = 0, printed = 0; + u64 total = 0; + + for (i = 0; i < PERF_BR_MAX; i++) + total += stat->counts[i]; + + if (total == 0) + return 0; + + printed += scnprintf(bf + printed, bfsize - printed, " ("); + + if (stat->jcc_fwd > 0) { + j++; + printed += scnprintf(bf + printed, bfsize - printed, +"JCC forward"); + } + + if (stat->jcc_bwd > 0) { + if (j++) + printed += scnprintf(bf + printed, bfsize - printed, +" JCC backward"); +
Re: [PATCH V4 7/7] cxl: Add psl9 specific code
Le 07/04/2017 à 16:11, Christophe Lombard a écrit : The new Coherent Accelerator Interface Architecture, level 2, for the IBM POWER9 brings new content and features: - POWER9 Service Layer - Registers - Radix mode - Process element entry - Dedicated-Shared Process Programming Model - Translation Fault Handling - CAPP - Memory Context ID If a valid mm_struct is found the memory context id is used for each transaction associated with the process handle. The PSL uses the context ID to find the corresponding process element. Signed-off-by: Christophe Lombard --- I'm ok with the code. However checkpatch is complaining about a tab/space error in native.c If you have a quick respin, I also have a comment below about the documentation. Documentation/powerpc/cxl.txt | 11 +- drivers/misc/cxl/context.c| 16 ++- drivers/misc/cxl/cxl.h| 137 +++ drivers/misc/cxl/debugfs.c| 19 drivers/misc/cxl/fault.c | 64 +++ drivers/misc/cxl/guest.c | 8 +- drivers/misc/cxl/irq.c| 53 + drivers/misc/cxl/native.c | 225 +++--- drivers/misc/cxl/pci.c| 246 +++--- drivers/misc/cxl/trace.h | 43 10 files changed, 748 insertions(+), 74 deletions(-) diff --git a/Documentation/powerpc/cxl.txt b/Documentation/powerpc/cxl.txt index d5506ba0..4a77462 100644 --- a/Documentation/powerpc/cxl.txt +++ b/Documentation/powerpc/cxl.txt @@ -21,7 +21,7 @@ Introduction Hardware overview = - POWER8 FPGA + POWER8/9 FPGA +--++-+ | || | | CPU|| AFU | @@ -34,7 +34,7 @@ Hardware overview | | CAPP |<-->| | +---+--+ PCIE +-+ -The POWER8 chip has a Coherently Attached Processor Proxy (CAPP) +The POWER8/9 chip has a Coherently Attached Processor Proxy (CAPP) unit which is part of the PCIe Host Bridge (PHB). This is managed by Linux by calls into OPAL. Linux doesn't directly program the CAPP. @@ -59,6 +59,13 @@ Hardware overview the fault. The context to which this fault is serviced is based on who owns that acceleration function. +POWER8 <-> PSL Version 8 is compliant to the CAIA Version 1.0. +POWER9 <-> PSL Version 9 is compliant to the CAIA Version 2.0. +This PSL Version 9 provides new features as: +* Native DMA support. +* Supports sending ASB_Notify messages for host thread wakeup. +* Supports Atomic operations. +* I think one of the most important difference is missing: the PSL on power9 uses the new nest MMU on the power9 chip and no longer has its own MMU. Fred AFU Modes = diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c index ac2531e..45363be 100644 --- a/drivers/misc/cxl/context.c +++ b/drivers/misc/cxl/context.c @@ -188,12 +188,24 @@ int cxl_context_iomap(struct cxl_context *ctx, struct vm_area_struct *vma) if (ctx->afu->current_mode == CXL_MODE_DEDICATED) { if (start + len > ctx->afu->adapter->ps_size) return -EINVAL; + + if (cxl_is_psl9(ctx->afu)) { + /* make sure there is a valid problem state +* area space for this AFU +*/ + if (ctx->master && !ctx->afu->psa) { + pr_devel("AFU doesn't support mmio space\n"); + return -EINVAL; + } + + /* Can't mmap until the AFU is enabled */ + if (!ctx->afu->enabled) + return -EBUSY; + } } else { if (start + len > ctx->psn_size) return -EINVAL; - } - if (ctx->afu->current_mode != CXL_MODE_DEDICATED) { /* make sure there is a valid per process space for this AFU */ if ((ctx->master && !ctx->afu->psa) || (!ctx->afu->pp_psa)) { pr_devel("AFU doesn't support mmio space\n"); diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h index 82335c0..df40e6e 100644 --- a/drivers/misc/cxl/cxl.h +++ b/drivers/misc/cxl/cxl.h @@ -63,7 +63,7 @@ typedef struct { /* Memory maps. Ref CXL Appendix A */ /* PSL Privilege 1 Memory Map */ -/* Configuration and Control area */ +/* Configuration and Control area - CAIA 1&2 */ static const cxl_p1_reg_t CXL_PSL_CtxTime = {0x}; static const cxl_p1_reg_t CXL_PSL_ErrIVTE = {0x0008}; static const cxl_p1_reg_t CXL_PSL_KEY1= {0x0010}; @@ -98,11 +98,29 @@ static const cxl_p1_reg_t CXL_XSL_Timebase = {0x0100}; static const cxl_p1_reg_t CXL_XSL_TB_CTLSTAT = {0x0108}; static const cxl_p1_reg_t CXL_XSL_FEC = {0x0158}; static
Re: WARN @lib/refcount.c:128 during hot unplug of I/O adapter.
On 04/11/2017 02:00 AM, Michael Ellerman wrote: > Tyrel Datwyler writes: > >> On 04/06/2017 09:04 PM, Michael Ellerman wrote: >>> Tyrel Datwyler writes: >>> On 04/06/2017 03:27 AM, Sachin Sant wrote: > On a POWER8 LPAR running 4.11.0-rc5, a hot unplug operation on > any I/O adapter results in the following warning > > This problem has been in the code for some time now. I had first seen > this in > -next tree. > >> >> >> > Have attached the dmesg log from the system. Let me know if any additional > information is required to help debug this problem. I remember you mentioning this when the issue was brought up for CPUs. I assume the case is the same here where the issue is only seen with adapters that were hot-added after boot (ie. hot-remove of adapter present at boot doesn't trip the warning)? >>> >>> So who's fixing this? >> >> I started looking at it when Bharata submitted a patch trying to fix the >> issue for CPUs, but got side tracked by other things. I suspect that >> this underflow has actually been an issue for quite some time, and we >> are just now becoming aware of it thanks to the recount_t patchset being >> merged. > > Yes I agree. Which means it might be broken in existing distros. Definitely. I did some profiling last night, and I understand the hotplug case. It turns out to be as I suggested in the original thread about CPUs. When the devicetree code was worked to move the tree out of proc and into sysfs the sysfs detach code added a of_node_put to remove the original of_init reference. pSeries Being the sole original *dynamic* device tree user we had always issued a of_node_put in our dlpar specific detach function to achieve that end. So, this should be a pretty straight forward trivial fix. However, for the case where devices are present at boot it appears we a leaking a lot of references resulting in the device nodes never actually being released/freed after a dlpar remove. In the CPU case after boot I count 8 more references taken than the hotplug case, and corresponding of_node_put's are not called at dlpar remove time either. That will take some time to track them down, review and clean up. -Tyrel > >> I'll look into it again this week. > > Thanks. > > cheers >
ZONE_DEVICE and pmem API support for powerpc
Hi all, This series adds support for ZONE_DEVICE and the pmem api on powerpc. Namely, support for altmaps and the various bits and pieces required for DAX PMD faults. The first two patches touch generic mm/ code, but otherwise this is fairly well contained in arch/powerpc. If the nvdimm folks could sanity check this series I'd appreciate it. Series is based on next-20170411, but it should apply elsewhere with minor fixups to arch_{add|remove}_memory due to conflicts with HMM. For those interested in testing this, there is a driver and matching firmware that carves out some system memory for use as an emulated Con Tutto memory card. Driver: https://github.com/oohal/linux/tree/contutto-next Firmware: https://github.com/oohal/skiboot/tree/fake-contutto Edit core/init.c:686 to control the amount of memory borrowed for the emulated device. I'm keeping the driver out of tree for a until 4.13 since I plan on reworking the firmware interface anyway and There's at least one showstopper bug. Thanks, Oliver
[PATCH 1/9] mm/huge_memory: Use zap_deposited_table() more
Depending flags of the PMD being zapped there may or may not be a deposited pgtable to be freed. In two of the three cases this is open coded while the third uses the zap_deposited_table() helper. This patch converts the others to use the helper to clean things up a bit. Cc: "Aneesh Kumar K.V" Cc: "Kirill A. Shutemov" Cc: linux...@kvack.org Signed-off-by: Oliver O'Halloran --- For reference: void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd) { pgtable_t pgtable; pgtable = pgtable_trans_huge_withdraw(mm, pmd); pte_free(mm, pgtable); atomic_long_dec(&mm->nr_ptes); } --- mm/huge_memory.c | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b787c4cfda0e..aa01dd47cc65 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1615,8 +1615,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, if (is_huge_zero_pmd(orig_pmd)) tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE); } else if (is_huge_zero_pmd(orig_pmd)) { - pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd)); - atomic_long_dec(&tlb->mm->nr_ptes); + zap_deposited_table(tlb->mm, pmd); spin_unlock(ptl); tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE); } else { @@ -1625,10 +1624,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, VM_BUG_ON_PAGE(page_mapcount(page) < 0, page); VM_BUG_ON_PAGE(!PageHead(page), page); if (PageAnon(page)) { - pgtable_t pgtable; - pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd); - pte_free(tlb->mm, pgtable); - atomic_long_dec(&tlb->mm->nr_ptes); + zap_deposited_table(tlb->mm, pmd); add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); } else { if (arch_needs_pgtable_deposit()) -- 2.9.3
[PATCH 2/9] mm/huge_memory: Deposit a pgtable for DAX PMD faults when required
Although all architectures use a deposited page table for THP on anonymous VMAs some architectures (s390 and powerpc) require the deposited storage even for file backed VMAs due to quirks of their MMUs. This patch adds support for depositing a table in DAX PMD fault handling path for archs that require it. Other architectures should see no functional changes. Cc: "Aneesh Kumar K.V" Cc: linux...@kvack.org Signed-off-by: Oliver O'Halloran --- mm/huge_memory.c | 20 ++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index aa01dd47cc65..a84909cf20d3 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -715,7 +715,8 @@ int do_huge_pmd_anonymous_page(struct vm_fault *vmf) } static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, - pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write) + pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write, + pgtable_t pgtable) { struct mm_struct *mm = vma->vm_mm; pmd_t entry; @@ -729,6 +730,12 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, entry = pmd_mkyoung(pmd_mkdirty(entry)); entry = maybe_pmd_mkwrite(entry, vma); } + + if (pgtable) { + pgtable_trans_huge_deposit(mm, pmd, pgtable); + atomic_long_inc(&mm->nr_ptes); + } + set_pmd_at(mm, addr, pmd, entry); update_mmu_cache_pmd(vma, addr, pmd); spin_unlock(ptl); @@ -738,6 +745,7 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, pfn_t pfn, bool write) { pgprot_t pgprot = vma->vm_page_prot; + pgtable_t pgtable = NULL; /* * If we had pmd_special, we could avoid all these restrictions, * but we need to be consistent with PTEs and architectures that @@ -752,9 +760,15 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, if (addr < vma->vm_start || addr >= vma->vm_end) return VM_FAULT_SIGBUS; + if (arch_needs_pgtable_deposit()) { + pgtable = pte_alloc_one(vma->vm_mm, addr); + if (!pgtable) + return VM_FAULT_OOM; + } + track_pfn_insert(vma, &pgprot, pfn); - insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write); + insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write, pgtable); return VM_FAULT_NOPAGE; } EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd); @@ -1611,6 +1625,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, tlb->fullmm); tlb_remove_pmd_tlb_entry(tlb, pmd, addr); if (vma_is_dax(vma)) { + if (arch_needs_pgtable_deposit()) + zap_deposited_table(tlb->mm, pmd); spin_unlock(ptl); if (is_huge_zero_pmd(orig_pmd)) tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE); -- 2.9.3
[PATCH 3/9] powerpc/mm: Add _PAGE_DEVMAP for ppc64.
From: "Aneesh Kumar K.V" Add a _PAGE_DEVMAP bit for PTE and DAX PMD entires. PowerPC doesn't currently support PUD faults so we haven't extended it to the PUD level. Cc: Aneesh Kumar K.V Signed-off-by: Oliver O'Halloran --- arch/powerpc/include/asm/book3s/64/pgtable.h | 37 +++- 1 file changed, 36 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index fb72ff6b98e6..b5fc6337649e 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -78,6 +78,9 @@ #define _PAGE_SOFT_DIRTY _RPAGE_SW3 /* software: software dirty tracking */ #define _PAGE_SPECIAL _RPAGE_SW2 /* software: special page */ +#define _PAGE_DEVMAP _RPAGE_SW1 +#define __HAVE_ARCH_PTE_DEVMAP + /* * Drivers request for cache inhibited pte mapping using _PAGE_NO_CACHE * Instead of fixing all of them, add an alternate define which @@ -602,6 +605,16 @@ static inline pte_t pte_mkhuge(pte_t pte) return pte; } +static inline pte_t pte_mkdevmap(pte_t pte) +{ + return __pte(pte_val(pte) | _PAGE_SPECIAL|_PAGE_DEVMAP); +} + +static inline int pte_devmap(pte_t pte) +{ + return !!(pte_raw(pte) & cpu_to_be64(_PAGE_DEVMAP)); +} + static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { /* FIXME!! check whether this need to be a conditional */ @@ -966,6 +979,9 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd) #define pmd_mk_savedwrite(pmd) pte_pmd(pte_mk_savedwrite(pmd_pte(pmd))) #define pmd_clear_savedwrite(pmd) pte_pmd(pte_clear_savedwrite(pmd_pte(pmd))) +#define pud_pfn(...) (0) +#define pgd_pfn(...) (0) + #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY #define pmd_soft_dirty(pmd)pte_soft_dirty(pmd_pte(pmd)) #define pmd_mksoft_dirty(pmd) pte_pmd(pte_mksoft_dirty(pmd_pte(pmd))) @@ -1140,7 +1156,6 @@ static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl, return true; } - #define arch_needs_pgtable_deposit arch_needs_pgtable_deposit static inline bool arch_needs_pgtable_deposit(void) { @@ -1149,6 +1164,26 @@ static inline bool arch_needs_pgtable_deposit(void) return true; } +static inline pmd_t pmd_mkdevmap(pmd_t pmd) +{ + return pte_pmd(pte_mkdevmap(pmd_pte(pmd))); +} + +static inline int pmd_devmap(pmd_t pmd) +{ + return pte_devmap(pmd_pte(pmd)); +} + +static inline int pud_devmap(pud_t pud) +{ + return 0; +} + +static inline int pgd_devmap(pgd_t pgd) +{ + return 0; +} + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif /* __ASSEMBLY__ */ #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */ -- 2.9.3
[PATCH 4/9] powerpc/mm: Reshuffle vmemmap_free()
Removes an indentation level and shuffles some code around to make the following patch cleaner. No functional changes. Signed-off-by: Oliver O'Halloran --- arch/powerpc/mm/init_64.c | 47 +-- 1 file changed, 25 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index ec84b31c6c86..f8124edb6ffa 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -234,12 +234,15 @@ static unsigned long vmemmap_list_free(unsigned long start) void __ref vmemmap_free(unsigned long start, unsigned long end) { unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift; + unsigned long page_order = get_order(page_size); start = _ALIGN_DOWN(start, page_size); pr_debug("vmemmap_free %lx...%lx\n", start, end); for (; start < end; start += page_size) { + struct page *page = pfn_to_page(addr >> PAGE_SHIFT); + unsigned int nr_pages; unsigned long addr; /* @@ -251,29 +254,29 @@ void __ref vmemmap_free(unsigned long start, unsigned long end) continue; addr = vmemmap_list_free(start); - if (addr) { - struct page *page = pfn_to_page(addr >> PAGE_SHIFT); - - if (PageReserved(page)) { - /* allocated from bootmem */ - if (page_size < PAGE_SIZE) { - /* -* this shouldn't happen, but if it is -* the case, leave the memory there -*/ - WARN_ON_ONCE(1); - } else { - unsigned int nr_pages = - 1 << get_order(page_size); - while (nr_pages--) - free_reserved_page(page++); - } - } else - free_pages((unsigned long)(__va(addr)), - get_order(page_size)); - - vmemmap_remove_mapping(start, page_size); + if (!addr) + continue; + + page = pfn_to_page(addr >> PAGE_SHIFT); + nr_pages = 1 << page_order; + + if (PageReserved(page)) { + /* allocated from bootmem */ + if (page_size < PAGE_SIZE) { + /* +* this shouldn't happen, but if it is +* the case, leave the memory there +*/ + WARN_ON_ONCE(1); + } else { + while (nr_pages--) + free_reserved_page(page++); + } + } else { + free_pages((unsigned long)(__va(addr)), page_order); } + + vmemmap_remove_mapping(start, page_size); } } #endif -- 2.9.3
[PATCH 5/9] powerpc/vmemmap: Add altmap support
Adds support to powerpc for the altmap feature of ZONE_DEVICE memory. An altmap is a driver provided region that is used to provide the backing storage for the struct pages of ZONE_DEVICE memory. In situations where large amount of ZONE_DEVICE memory is being added to the system the altmap reduces pressure on main system memory by allowing the mm/ metadata to be stored on the device itself rather in main memory. Signed-off-by: Oliver O'Halloran --- arch/powerpc/mm/init_64.c | 20 +++- arch/powerpc/mm/mem.c | 16 +--- 2 files changed, 28 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index f8124edb6ffa..225fbb8034e6 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -44,6 +44,7 @@ #include #include #include +#include #include #include @@ -171,13 +172,17 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node) pr_debug("vmemmap_populate %lx..%lx, node %d\n", start, end, node); for (; start < end; start += page_size) { + struct vmem_altmap *altmap; void *p; int rc; if (vmemmap_populated(start, page_size)) continue; - p = vmemmap_alloc_block(page_size, node); + /* altmap lookups only work at section boundaries */ + altmap = to_vmem_altmap(SECTION_ALIGN_DOWN(start)); + + p = __vmemmap_alloc_block_buf(page_size, node, altmap); if (!p) return -ENOMEM; @@ -241,9 +246,10 @@ void __ref vmemmap_free(unsigned long start, unsigned long end) pr_debug("vmemmap_free %lx...%lx\n", start, end); for (; start < end; start += page_size) { - struct page *page = pfn_to_page(addr >> PAGE_SHIFT); - unsigned int nr_pages; - unsigned long addr; + unsigned long nr_pages, addr; + struct vmem_altmap *altmap; + struct page *section_base; + struct page *page; /* * the section has already be marked as invalid, so @@ -258,9 +264,13 @@ void __ref vmemmap_free(unsigned long start, unsigned long end) continue; page = pfn_to_page(addr >> PAGE_SHIFT); + section_base = pfn_to_page(vmemmap_section_start(start)); nr_pages = 1 << page_order; - if (PageReserved(page)) { + altmap = to_vmem_altmap((unsigned long) section_base); + if (altmap) { + vmem_altmap_free(altmap, nr_pages); + } else if (PageReserved(page)) { /* allocated from bootmem */ if (page_size < PAGE_SIZE) { /* diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index 3bbba178b464..6f7b64eaa9d8 100644 --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -36,6 +36,7 @@ #include #include #include +#include #include #include @@ -176,7 +177,8 @@ int arch_remove_memory(u64 start, u64 size, enum memory_type type) { unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; - struct zone *zone; + struct vmem_altmap *altmap; + struct page *page; int ret; /* @@ -193,8 +195,16 @@ int arch_remove_memory(u64 start, u64 size, enum memory_type type) return -EINVAL; } - zone = page_zone(pfn_to_page(start_pfn)); - ret = __remove_pages(zone, start_pfn, nr_pages); + /* +* If we have an altmap then we need to skip over any reserved PFNs +* when querying the zone. +*/ + page = pfn_to_page(start_pfn); + altmap = to_vmem_altmap((unsigned long) page); + if (altmap) + page += vmem_altmap_offset(altmap); + + ret = __remove_pages(page_zone(page), start_pfn, nr_pages); if (ret) return ret; -- 2.9.3
[PATCH 6/9] powerpc, mm: Enable ZONE_DEVICE on powerpc
Flip the switch. Running around and screaming "IT'S ALIVE" is optional, but recommended. Signed-off-by: Oliver O'Halloran --- mm/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/Kconfig b/mm/Kconfig index 43d000e44424..d696af58f97f 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -724,7 +724,7 @@ config ZONE_DEVICE depends on MEMORY_HOTPLUG depends on MEMORY_HOTREMOVE depends on SPARSEMEM_VMEMMAP - depends on X86_64 #arch_add_memory() comprehends device memory + depends on (X86_64 || PPC_BOOK3S_64) #arch_add_memory() comprehends device memory help Device memory hotplug support allows for establishing pmem, -- 2.9.3
[PATCH 7/9] powerpc/mm: Wire up ioremap_cache
The default implementation of ioremap_cache() is aliased to ioremap(). On powerpc ioremap() creates cache-inhibited mappings by default which is almost certainly not what you wanted. Signed-off-by: Oliver O'Halloran --- arch/powerpc/include/asm/io.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h index 5ed292431b5b..839eb031857f 100644 --- a/arch/powerpc/include/asm/io.h +++ b/arch/powerpc/include/asm/io.h @@ -757,6 +757,8 @@ extern void __iomem *ioremap_prot(phys_addr_t address, unsigned long size, extern void __iomem *ioremap_wc(phys_addr_t address, unsigned long size); #define ioremap_nocache(addr, size)ioremap((addr), (size)) #define ioremap_uc(addr, size) ioremap((addr), (size)) +#define ioremap_cache(addr, size) \ + ioremap_prot((addr), (size), pgprot_val(PAGE_KERNEL)) extern void iounmap(volatile void __iomem *addr); -- 2.9.3
[PATCH 8/9] powerpc/mm: Wire up hpte_removebolted for powernv
From: Rashmica Gupta Adds support for removing bolted (i.e kernel linear mapping) mappings on powernv. This is needed to support memory hot unplug operations which are required for the teardown of DAX/PMEM devices. Cc: Rashmica Gupta Cc: Anton Blanchard Signed-off-by: Oliver O'Halloran --- Could the original author of this add their S-o-b? I pulled it out of Rashmica's memtrace patch, but I remember someone saying Anton wrote it originally. --- arch/powerpc/mm/hash_native_64.c | 31 +++ 1 file changed, 31 insertions(+) diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c index 65bb8f33b399..9ba91d4905a4 100644 --- a/arch/powerpc/mm/hash_native_64.c +++ b/arch/powerpc/mm/hash_native_64.c @@ -407,6 +407,36 @@ static void native_hpte_updateboltedpp(unsigned long newpp, unsigned long ea, tlbie(vpn, psize, psize, ssize, 0); } +/* + * Remove a bolted kernel entry. Memory hotplug uses this. + * + * No need to lock here because we should be the only user. + */ +static int native_hpte_removebolted(unsigned long ea, int psize, int ssize) +{ + unsigned long vpn; + unsigned long vsid; + long slot; + struct hash_pte *hptep; + + vsid = get_kernel_vsid(ea, ssize); + vpn = hpt_vpn(ea, vsid, ssize); + + slot = native_hpte_find(vpn, psize, ssize); + if (slot == -1) + return -ENOENT; + + hptep = htab_address + slot; + + /* Invalidate the hpte */ + hptep->v = 0; + + /* Invalidate the TLB */ + tlbie(vpn, psize, psize, ssize, 0); + return 0; +} + + static void native_hpte_invalidate(unsigned long slot, unsigned long vpn, int bpsize, int apsize, int ssize, int local) { @@ -725,6 +755,7 @@ void __init hpte_init_native(void) mmu_hash_ops.hpte_invalidate= native_hpte_invalidate; mmu_hash_ops.hpte_updatepp = native_hpte_updatepp; mmu_hash_ops.hpte_updateboltedpp = native_hpte_updateboltedpp; + mmu_hash_ops.hpte_removebolted = native_hpte_removebolted; mmu_hash_ops.hpte_insert= native_hpte_insert; mmu_hash_ops.hpte_remove= native_hpte_remove; mmu_hash_ops.hpte_clear_all = native_hpte_clear; -- 2.9.3
[PATCH 9/9] powerpc: Add pmem API support
Initial powerpc support for the arch-specific bit of the persistent memory API. Nothing fancy here. Signed-off-by: Oliver O'Halloran --- arch/powerpc/Kconfig| 1 + arch/powerpc/include/asm/pmem.h | 109 arch/powerpc/kernel/misc_64.S | 2 +- 3 files changed, 111 insertions(+), 1 deletion(-) create mode 100644 arch/powerpc/include/asm/pmem.h diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index d7413ed700b8..cf84d0db49ab 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -87,6 +87,7 @@ config PPC select ARCH_HAS_DMA_SET_COHERENT_MASK select ARCH_HAS_ELF_RANDOMIZE select ARCH_HAS_GCOV_PROFILE_ALL + select ARCH_HAS_PMEM_API select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE select ARCH_HAS_SG_CHAIN select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST diff --git a/arch/powerpc/include/asm/pmem.h b/arch/powerpc/include/asm/pmem.h new file mode 100644 index ..27da9594040f --- /dev/null +++ b/arch/powerpc/include/asm/pmem.h @@ -0,0 +1,109 @@ +/* + * Copyright(c) 2017 IBM Corporation. All rights reserved. + * + * Based on the x86 version. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#ifndef __ASM_POWERPC_PMEM_H__ +#define __ASM_POWERPC_PMEM_H__ + +#include +#include +#include + +/* + * See include/linux/pmem.h for API documentation + * + * PPC specific notes: + * + * 1. PPC has no non-temporal (cache bypassing) stores so we're stuck with + *doing cache writebacks. + * + * 2. DCBST is a suggestion. DCBF *will* force a writeback. + * + */ + +static inline void arch_wb_cache_pmem(void *addr, size_t size) +{ + unsigned long iaddr = (unsigned long) addr; + + /* NB: contains a barrier */ + flush_inval_dcache_range(iaddr, iaddr + size); +} + +/* invalidate and writeback are functionally identical */ +#define arch_invalidate_pmem arch_wb_cache_pmem + +static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n) +{ + int unwritten; + + /* +* We are copying between two kernel buffers, if +* __copy_from_user_inatomic_nocache() returns an error (page +* fault) we would have already reported a general protection fault +* before the WARN+BUG. +* +* XXX: replace this with a hand-rolled memcpy+dcbf +*/ + unwritten = __copy_from_user_inatomic(dst, (void __user *) src, n); + if (WARN(unwritten, "%s: fault copying %p <- %p unwritten: %d\n", + __func__, dst, src, unwritten)) + BUG(); + + arch_wb_cache_pmem(dst, n); +} + +static inline int arch_memcpy_from_pmem(void *dst, const void *src, size_t n) +{ + /* +* TODO: We should have most of the infrastructure for MCE handling +* but it needs to be made slightly smarter. +*/ + memcpy(dst, src, n); + return 0; +} + +static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes, + struct iov_iter *i) +{ + size_t len; + + /* XXX: under what conditions would this return len < size? */ + len = copy_from_iter(addr, bytes, i); + arch_wb_cache_pmem(addr, bytes - len); + + return len; +} + +static inline void arch_clear_pmem(void *addr, size_t size) +{ + void *start = addr; + + /* +* XXX: A hand rolled dcbz+dcbf loop would probably be better. +*/ + + if (((uintptr_t) addr & ~PAGE_MASK) == 0) { + while (size >= PAGE_SIZE) { + clear_page(addr); + addr += PAGE_SIZE; + size -= PAGE_SIZE; + } + } + + if (size) + memset(addr, 0, size); + + arch_wb_cache_pmem(start, size); +} + +#endif /* __ASM_POWERPC_PMEM_H__ */ diff --git a/arch/powerpc/kernel/misc_64.S b/arch/powerpc/kernel/misc_64.S index c119044cad0d..1378a8d61faf 100644 --- a/arch/powerpc/kernel/misc_64.S +++ b/arch/powerpc/kernel/misc_64.S @@ -182,7 +182,7 @@ _GLOBAL(flush_dcache_phys_range) isync blr -_GLOBAL(flush_inval_dcache_range) +_GLOBAL_TOC(flush_inval_dcache_range) ld r10,PPC64_CACHES@toc(r2) lwz r7,DCACHEL1BLOCKSIZE(r10) /* Get dcache block size */ addir5,r7,-1 -- 2.9.3
Re: ZONE_DEVICE and pmem API support for powerpc
On Tue, Apr 11, 2017 at 10:42 AM, Oliver O'Halloran wrote: > Hi all, > > This series adds support for ZONE_DEVICE and the pmem api on powerpc. Namely, > support for altmaps and the various bits and pieces required for DAX PMD > faults. > The first two patches touch generic mm/ code, but otherwise this is fairly > well > contained in arch/powerpc. > > If the nvdimm folks could sanity check this series I'd appreciate it. Quick feedback: I'm in the process of cleaning up and resubmitting my patch set to push the pmem api down into the driver directly. https://lwn.net/Articles/713064/ I'm also reworking memory hotplug to allow sub-section allocations which has collided with Michal Hocko's hotplug reworks. It will be good to have some more eyes on that work to understand the cross-arch implications. https://lkml.org/lkml/2017/3/19/146 > Series is based on next-20170411, but it should apply elsewhere with minor > fixups to arch_{add|remove}_memory due to conflicts with HMM. For those > interested in testing this, there is a driver and matching firmware that > carves > out some system memory for use as an emulated Con Tutto memory card. > > Driver: https://github.com/oohal/linux/tree/contutto-next > Firmware: https://github.com/oohal/skiboot/tree/fake-contutto > > Edit core/init.c:686 to control the amount of memory borrowed for the emulated > device. I'm keeping the driver out of tree for a until 4.13 since I plan on > reworking the firmware interface anyway and There's at least one showstopper > bug. Is this memory card I/O-cache coherent? I.e. existing dma mapping api can hand out mappings to it? Just trying to figure out if this the existing pmem-definition of ZONE_DEVICE or a new one.
Re: [PATCH 8/9] powerpc/mm: Wire up hpte_removebolted for powernv
Hi Oliver, > From: Rashmica Gupta > > Adds support for removing bolted (i.e kernel linear mapping) mappings > on powernv. This is needed to support memory hot unplug operations > which are required for the teardown of DAX/PMEM devices. > > Cc: Rashmica Gupta > Cc: Anton Blanchard > Signed-off-by: Oliver O'Halloran > --- > Could the original author of this add their S-o-b? I pulled it out of > Rashmica's memtrace patch, but I remember someone saying Anton wrote > it originally. I did. Signed-off-by: Anton Blanchard Anton > --- > arch/powerpc/mm/hash_native_64.c | 31 +++ > 1 file changed, 31 insertions(+) > > diff --git a/arch/powerpc/mm/hash_native_64.c > b/arch/powerpc/mm/hash_native_64.c index 65bb8f33b399..9ba91d4905a4 > 100644 --- a/arch/powerpc/mm/hash_native_64.c > +++ b/arch/powerpc/mm/hash_native_64.c > @@ -407,6 +407,36 @@ static void native_hpte_updateboltedpp(unsigned > long newpp, unsigned long ea, tlbie(vpn, psize, psize, ssize, 0); > } > > +/* > + * Remove a bolted kernel entry. Memory hotplug uses this. > + * > + * No need to lock here because we should be the only user. > + */ > +static int native_hpte_removebolted(unsigned long ea, int psize, int > ssize) +{ > + unsigned long vpn; > + unsigned long vsid; > + long slot; > + struct hash_pte *hptep; > + > + vsid = get_kernel_vsid(ea, ssize); > + vpn = hpt_vpn(ea, vsid, ssize); > + > + slot = native_hpte_find(vpn, psize, ssize); > + if (slot == -1) > + return -ENOENT; > + > + hptep = htab_address + slot; > + > + /* Invalidate the hpte */ > + hptep->v = 0; > + > + /* Invalidate the TLB */ > + tlbie(vpn, psize, psize, ssize, 0); > + return 0; > +} > + > + > static void native_hpte_invalidate(unsigned long slot, unsigned long > vpn, int bpsize, int apsize, int ssize, int local) > { > @@ -725,6 +755,7 @@ void __init hpte_init_native(void) > mmu_hash_ops.hpte_invalidate= native_hpte_invalidate; > mmu_hash_ops.hpte_updatepp = native_hpte_updatepp; > mmu_hash_ops.hpte_updateboltedpp = > native_hpte_updateboltedpp; > + mmu_hash_ops.hpte_removebolted = native_hpte_removebolted; > mmu_hash_ops.hpte_insert= native_hpte_insert; > mmu_hash_ops.hpte_remove= native_hpte_remove; > mmu_hash_ops.hpte_clear_all = native_hpte_clear;
Re: [PATCH 8/9] powerpc/mm: Wire up hpte_removebolted for powernv
Hi Oliver, On Wed, 12 Apr 2017 08:50:56 +1000 Anton Blanchard wrote: > > > From: Rashmica Gupta > > > > Adds support for removing bolted (i.e kernel linear mapping) mappings > > on powernv. This is needed to support memory hot unplug operations > > which are required for the teardown of DAX/PMEM devices. > > > > Cc: Rashmica Gupta > > Cc: Anton Blanchard > > Signed-off-by: Oliver O'Halloran > > --- > > Could the original author of this add their S-o-b? I pulled it out of > > Rashmica's memtrace patch, but I remember someone saying Anton wrote > > it originally. > > I did. > > Signed-off-by: Anton Blanchard If you are going to claim that Rashmica authored this patch (and you do with the From: line above), then you need her Signed-off-by as well. -- Cheers, Stephen Rothwell
Re: [PATCH 3/9] powerpc/mm: Add _PAGE_DEVMAP for ppc64.
Hi Oliver, On Wed, 12 Apr 2017 03:42:27 +1000 Oliver O'Halloran wrote: > > From: "Aneesh Kumar K.V" > > Add a _PAGE_DEVMAP bit for PTE and DAX PMD entires. PowerPC doesn't > currently support PUD faults so we haven't extended it to the PUD > level. > > Cc: Aneesh Kumar K.V > Signed-off-by: Oliver O'Halloran This needs Aneesh's Signed-off-by. -- Cheers, Stephen Rothwell
Re: [PATCH 5/9] powerpc/vmemmap: Add altmap support
On Wed, 2017-04-12 at 03:42 +1000, Oliver O'Halloran wrote: > Adds support to powerpc for the altmap feature of ZONE_DEVICE memory. An > altmap is a driver provided region that is used to provide the backing > storage for the struct pages of ZONE_DEVICE memory. In situations where > large amount of ZONE_DEVICE memory is being added to the system the > altmap reduces pressure on main system memory by allowing the mm/ > metadata to be stored on the device itself rather in main memory. > > Signed-off-by: Oliver O'Halloran > --- Reviewed-by: Balbir Singh
Re: [PATCH 6/9] powerpc, mm: Enable ZONE_DEVICE on powerpc
On Wed, 2017-04-12 at 03:42 +1000, Oliver O'Halloran wrote: > Flip the switch. Running around and screaming "IT'S ALIVE" is optional, > but recommended. > > Signed-off-by: Oliver O'Halloran > --- > mm/Kconfig | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/Kconfig b/mm/Kconfig > index 43d000e44424..d696af58f97f 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -724,7 +724,7 @@ config ZONE_DEVICE > depends on MEMORY_HOTPLUG > depends on MEMORY_HOTREMOVE > depends on SPARSEMEM_VMEMMAP > - depends on X86_64 #arch_add_memory() comprehends device memory > + depends on (X86_64 || PPC_BOOK3S_64) #arch_add_memory() comprehends > device memory Reviewed-by: Balbir Singh
Re: [PATCH 4/9] powerpc/mm: Reshuffle vmemmap_free()
Hi Oliver, On Wed, 12 Apr 2017 03:42:28 +1000 Oliver O'Halloran wrote: > > diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c > index ec84b31c6c86..f8124edb6ffa 100644 > --- a/arch/powerpc/mm/init_64.c > +++ b/arch/powerpc/mm/init_64.c > @@ -234,12 +234,15 @@ static unsigned long vmemmap_list_free(unsigned long > start) > void __ref vmemmap_free(unsigned long start, unsigned long end) > { > unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift; > + unsigned long page_order = get_order(page_size); > > start = _ALIGN_DOWN(start, page_size); > > pr_debug("vmemmap_free %lx...%lx\n", start, end); > > for (; start < end; start += page_size) { > + struct page *page = pfn_to_page(addr >> PAGE_SHIFT); The declaration of addr is below here and, even so, it would be uninitialised ... > + unsigned int nr_pages; > unsigned long addr; -- Cheers, Stephen Rothwell
Re: [PATCH 6/9] powerpc, mm: Enable ZONE_DEVICE on powerpc
Hi Oliver, On Wed, 12 Apr 2017 03:42:30 +1000 Oliver O'Halloran wrote: > > Flip the switch. Running around and screaming "IT'S ALIVE" is optional, > but recommended. > > Signed-off-by: Oliver O'Halloran > --- > mm/Kconfig | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/Kconfig b/mm/Kconfig > index 43d000e44424..d696af58f97f 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -724,7 +724,7 @@ config ZONE_DEVICE > depends on MEMORY_HOTPLUG > depends on MEMORY_HOTREMOVE > depends on SPARSEMEM_VMEMMAP > - depends on X86_64 #arch_add_memory() comprehends device memory > + depends on (X86_64 || PPC_BOOK3S_64) #arch_add_memory() comprehends > device memory > That's fine, but at what point do we create CONFIG_ARCH_HAVE_ZONE_DEVICE, replace the "depends on " above with "depends on ARCH_HAVE_ZONE_DEVICE" and select that from the appropriate places? -- Cheers, Stephen Rothwell
Re: ZONE_DEVICE and pmem API support for powerpc
Hi Oliver, On Wed, 12 Apr 2017 03:42:24 +1000 Oliver O'Halloran wrote: > > Series is based on next-20170411, but it should apply elsewhere with minor > fixups to arch_{add|remove}_memory due to conflicts with HMM. For those Just to make life fun for you, Andrew has dropped the HMM patches from his quilt series today (and so they will not be in next-20170412). -- Cheers, Stephen Rothwell
[PATCH 1/2] lib/raid6: Build proper files on corresponding arch
Previously the raid6 test Makefile did not correctly build the files for testing on PowerPC. This patch fixes the bug, so that all appropriate files for PowerPC are built. Signed-off-by: Matt Brown --- lib/raid6/test/Makefile | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile index 9c333e9..62b26d1 100644 --- a/lib/raid6/test/Makefile +++ b/lib/raid6/test/Makefile @@ -44,10 +44,12 @@ else ifeq ($(HAS_NEON),yes) CFLAGS += -DCONFIG_KERNEL_MODE_NEON=1 else HAS_ALTIVEC := $(shell printf '\#include \nvector int a;\n' |\ - gcc -c -x c - >&/dev/null && \ - rm ./-.o && echo yes) +gcc -c -x c - >/dev/null && rm ./-.o && echo yes) ifeq ($(HAS_ALTIVEC),yes) -OBJS += altivec1.o altivec2.o altivec4.o altivec8.o + CFLAGS += -I../../../arch/powerpc/include + CFLAGS += -DCONFIG_ALTIVEC + OBJS += altivec1.o altivec2.o altivec4.o altivec8.o \ + vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o endif endif ifeq ($(ARCH),tilegx) -- 2.9.3
[PATCH v3 2/2] raid6/altivec: Add vpermxor implementation for raid6 Q syndrome
The raid6 Q syndrome check has been optimised using the vpermxor instruction. This instruction was made available with POWER8, ISA version 2.07. It allows for both vperm and vxor instructions to be done in a single instruction. This has been tested for correctness on a ppc64le vm with a basic RAID6 setup containing 5 drives. The performance benchmarks are from the raid6test in the /lib/raid6/test directory. These results are from an IBM Firestone machine with ppc64le architecture. The benchmark results show a 35% speed increase over the best existing algorithm for powerpc (altivec). The raid6test has also been run on a big-endian ppc64 vm to ensure it also works for big-endian architectures. Performance benchmarks: raid6: altivecx4 gen() 18773 MB/s raid6: altivecx8 gen() 19438 MB/s raid6: vpermxor4 gen() 25112 MB/s raid6: vpermxor8 gen() 26279 MB/s Note: Fixed minor bug in altivec.uc regarding missing and mismatched ifdef statements. Signed-off-by: Matt Brown --- Changelog v2 - Change CONFIG_ALTIVEC to CPU_FTR_ALTIVEC_COMP - Seperate bug fix into different patch --- include/linux/raid/pq.h | 4 ++ lib/raid6/Makefile | 27 - lib/raid6/algos.c | 4 ++ lib/raid6/altivec.uc| 3 ++ lib/raid6/test/Makefile | 14 ++- lib/raid6/vpermxor.uc | 104 6 files changed, 154 insertions(+), 2 deletions(-) create mode 100644 lib/raid6/vpermxor.uc diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h index 4d57bba..3df9aa6 100644 --- a/include/linux/raid/pq.h +++ b/include/linux/raid/pq.h @@ -107,6 +107,10 @@ extern const struct raid6_calls raid6_avx512x2; extern const struct raid6_calls raid6_avx512x4; extern const struct raid6_calls raid6_tilegx8; extern const struct raid6_calls raid6_s390vx8; +extern const struct raid6_calls raid6_vpermxor1; +extern const struct raid6_calls raid6_vpermxor2; +extern const struct raid6_calls raid6_vpermxor4; +extern const struct raid6_calls raid6_vpermxor8; struct raid6_recov_calls { void (*data2)(int, size_t, int, int, void **); diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile index 3057011..7775aad 100644 --- a/lib/raid6/Makefile +++ b/lib/raid6/Makefile @@ -4,7 +4,8 @@ raid6_pq-y += algos.o recov.o tables.o int1.o int2.o int4.o \ int8.o int16.o int32.o raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o mmx.o sse1.o sse2.o avx2.o avx512.o recov_avx512.o -raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o +raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o \ + vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o raid6_pq-$(CONFIG_TILEGX) += tilegx8.o raid6_pq-$(CONFIG_S390) += s390vx8.o recov_s390xc.o @@ -88,6 +89,30 @@ $(obj)/altivec8.c: UNROLL := 8 $(obj)/altivec8.c: $(src)/altivec.uc $(src)/unroll.awk FORCE $(call if_changed,unroll) +CFLAGS_vpermxor1.o += $(altivec_flags) +targets += vpermxor1.c +$(obj)/vpermxor1.c: UNROLL := 1 +$(obj)/vpermxor1.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE + $(call if_changed,unroll) + +CFLAGS_vpermxor2.o += $(altivec_flags) +targets += vpermxor2.c +$(obj)/vpermxor2.c: UNROLL := 2 +$(obj)/vpermxor2.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE + $(call if_changed,unroll) + +CFLAGS_vpermxor4.o += $(altivec_flags) +targets += vpermxor4.c +$(obj)/vpermxor4.c: UNROLL := 4 +$(obj)/vpermxor4.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE + $(call if_changed,unroll) + +CFLAGS_vpermxor8.o += $(altivec_flags) +targets += vpermxor8.c +$(obj)/vpermxor8.c: UNROLL := 8 +$(obj)/vpermxor8.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE + $(call if_changed,unroll) + CFLAGS_neon1.o += $(NEON_FLAGS) targets += neon1.c $(obj)/neon1.c: UNROLL := 1 diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c index 7857049..edd4f69 100644 --- a/lib/raid6/algos.c +++ b/lib/raid6/algos.c @@ -74,6 +74,10 @@ const struct raid6_calls * const raid6_algos[] = { &raid6_altivec2, &raid6_altivec4, &raid6_altivec8, + &raid6_vpermxor1, + &raid6_vpermxor2, + &raid6_vpermxor4, + &raid6_vpermxor8, #endif #if defined(CONFIG_TILEGX) &raid6_tilegx8, diff --git a/lib/raid6/altivec.uc b/lib/raid6/altivec.uc index 682aae8..d20ed0d 100644 --- a/lib/raid6/altivec.uc +++ b/lib/raid6/altivec.uc @@ -24,10 +24,13 @@ #include +#ifdef CONFIG_ALTIVEC + #include #ifdef __KERNEL__ # include # include +#endif /* __KERNEL__ */ /* * This is the C data type to use. We use a vector of diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile index 2c7b60e..9c333e9 100644 --- a/lib/raid6/test/Makefile +++ b/lib/raid6/test/Makefile @@ -97,6 +97,18 @@ altivec4.c: altivec.uc ../unroll.awk altivec8.c: altivec.uc .
Re: EEH error in doing DMA with PEX 8619
On Tue, Apr 11, 2017 at 5:37 PM, Benjamin Herrenschmidt [via linuxppc] wrote: > Another possibility would be if the requests from the PLX have a > different initiator ID on the bus than the device you are setting up > the DMA for. Is there a way to check out the initiator ID in the driver? I'd like to make sure of this. -- View this message in context: http://linuxppc.10917.n7.nabble.com/EEH-error-in-doing-DMA-with-PEX-8619-tp121121p121224.html Sent from the linuxppc-dev mailing list archive at Nabble.com.
Re: EEH error in doing DMA with PEX 8619
On Tue, 2017-04-11 at 18:39 -0700, IanJiang wrote: > On Tue, Apr 11, 2017 at 5:37 PM, Benjamin Herrenschmidt [via > linuxppc] > wrote: > > > Another possibility would be if the requests from the PLX have a > > different initiator ID on the bus than the device you are setting > > up > > the DMA for. > > Is there a way to check out the initiator ID in the driver? I'd like > to make sure of this. If you are running bare metal (ie, not under any hypervisor, aka "powernv" platform), the EEH error log will contain a register dump. If you paste that to us, we might be able to decode it, it will tell us more data about the cause of the failure, including possibly the initiator of the failing transaction. The initiator ID (aka RID, aka bus/device/fn) of the DMA packets must match the ones of the struct pci_dev you are using to establish the mapping. Cheers, Ben. > > -- > View this message in context: http://linuxppc.10917.n7.nabble.com/EEH > -error-in-doing-DMA-with-PEX-8619-tp121121p121224.html > Sent from the linuxppc-dev mailing list archive at Nabble.com.
Re: [PATCH 8/9] powerpc/mm: Wire up hpte_removebolted for powernv
On Wed, 2017-04-12 at 03:42 +1000, Oliver O'Halloran wrote: > From: Rashmica Gupta > > Adds support for removing bolted (i.e kernel linear mapping) mappings on > powernv. This is needed to support memory hot unplug operations which > are required for the teardown of DAX/PMEM devices. > > Cc: Rashmica Gupta > Cc: Anton Blanchard > Signed-off-by: Oliver O'Halloran > --- > Could the original author of this add their S-o-b? I pulled it out of > Rashmica's memtrace patch, but I remember someone saying Anton wrote > it originally. > --- > arch/powerpc/mm/hash_native_64.c | 31 +++ > 1 file changed, 31 insertions(+) > > diff --git a/arch/powerpc/mm/hash_native_64.c > b/arch/powerpc/mm/hash_native_64.c > index 65bb8f33b399..9ba91d4905a4 100644 > --- a/arch/powerpc/mm/hash_native_64.c > +++ b/arch/powerpc/mm/hash_native_64.c > @@ -407,6 +407,36 @@ static void native_hpte_updateboltedpp(unsigned long > newpp, unsigned long ea, > tlbie(vpn, psize, psize, ssize, 0); > } > > +/* > + * Remove a bolted kernel entry. Memory hotplug uses this. > + * > + * No need to lock here because we should be the only user. As long as this is after the necessary isolation and is called from arch_remove_memory(), I think we should be fine > + */ > +static int native_hpte_removebolted(unsigned long ea, int psize, int ssize) > +{ > + unsigned long vpn; > + unsigned long vsid; > + long slot; > + struct hash_pte *hptep; > + > + vsid = get_kernel_vsid(ea, ssize); > + vpn = hpt_vpn(ea, vsid, ssize); > + > + slot = native_hpte_find(vpn, psize, ssize); > + if (slot == -1) > + return -ENOENT; If slot == -1, it means someone else removed the HPTE entry? Are we racing? I suspect we should never hit this situation during hotunplug, specifically since this is bolted. > + > + hptep = htab_address + slot; > + > + /* Invalidate the hpte */ > + hptep->v = 0; Under DEBUG or otherwise, I would add more checks like 1. was hpte_v & HPTE_V_VALID and BOLTED set? If not, we've already invalidated that hpte and we can skip the tlbie. Since this was bolted you might be right that it is always valid and bolted > + > + /* Invalidate the TLB */ > + tlbie(vpn, psize, psize, ssize, 0); The API also does not clear linear_map_hash_slots[] under DEBUG_PAGEALLOC > + return 0; > +} > + > + Balbir Singh.
Re: [PATCH 6/9] powerpc, mm: Enable ZONE_DEVICE on powerpc
Stephen Rothwell writes: > Hi Oliver, > > On Wed, 12 Apr 2017 03:42:30 +1000 Oliver O'Halloran wrote: >> >> Flip the switch. Running around and screaming "IT'S ALIVE" is optional, >> but recommended. >> >> Signed-off-by: Oliver O'Halloran >> --- >> mm/Kconfig | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/mm/Kconfig b/mm/Kconfig >> index 43d000e44424..d696af58f97f 100644 >> --- a/mm/Kconfig >> +++ b/mm/Kconfig >> @@ -724,7 +724,7 @@ config ZONE_DEVICE >> depends on MEMORY_HOTPLUG >> depends on MEMORY_HOTREMOVE >> depends on SPARSEMEM_VMEMMAP >> -depends on X86_64 #arch_add_memory() comprehends device memory >> +depends on (X86_64 || PPC_BOOK3S_64) #arch_add_memory() comprehends >> device memory >> > > That's fine, but at what point do we create > CONFIG_ARCH_HAVE_ZONE_DEVICE, replace the "depends on > " above with "depends on ARCH_HAVE_ZONE_DEVICE" and > select that from the appropriate places? You mean CONFIG_HAVE_ZONE_DEVICE :) A patch to do that, and update x86, would be a good precursor to this series. It could probably go in right now, and be in place for when this series lands. cheers
Re: WARN @lib/refcount.c:128 during hot unplug of I/O adapter.
Tyrel Datwyler writes: > On 04/11/2017 02:00 AM, Michael Ellerman wrote: >> Tyrel Datwyler writes: >>> I started looking at it when Bharata submitted a patch trying to fix the >>> issue for CPUs, but got side tracked by other things. I suspect that >>> this underflow has actually been an issue for quite some time, and we >>> are just now becoming aware of it thanks to the recount_t patchset being >>> merged. >> >> Yes I agree. Which means it might be broken in existing distros. > > Definitely. I did some profiling last night, and I understand the > hotplug case. It turns out to be as I suggested in the original thread > about CPUs. When the devicetree code was worked to move the tree out of > proc and into sysfs the sysfs detach code added a of_node_put to remove > the original of_init reference. pSeries Being the sole original > *dynamic* device tree user we had always issued a of_node_put in our > dlpar specific detach function to achieve that end. So, this should be a > pretty straight forward trivial fix. Excellent, thanks. > However, for the case where devices are present at boot it appears we a > leaking a lot of references resulting in the device nodes never actually > being released/freed after a dlpar remove. In the CPU case after boot I > count 8 more references taken than the hotplug case, and corresponding > of_node_put's are not called at dlpar remove time either. That will take > some time to track them down, review and clean up. Yes that is a perennial problem unfortunately which we've never come up with a good solution for. The (old) patch below might help track some of them down. I remember having a script to process the output of the trace and find mismatches, but I can't find it right now - but I'm sure you can hack up something :) cheers diff --git a/arch/powerpc/include/asm/trace.h b/arch/powerpc/include/asm/trace.h index 32e36b16773f..ad32365082a0 100644 --- a/arch/powerpc/include/asm/trace.h +++ b/arch/powerpc/include/asm/trace.h @@ -168,6 +168,44 @@ TRACE_EVENT(hash_fault, __entry->addr, __entry->access, __entry->trap) ); +TRACE_EVENT(of_node_get, + + TP_PROTO(struct device_node *dn, int val), + + TP_ARGS(dn, val), + + TP_STRUCT__entry( + __field(struct device_node *, dn) + __field(int, val) + ), + + TP_fast_assign( + __entry->dn = dn; + __entry->val = val; + ), + + TP_printk("get %d -> %d %s", __entry->val - 1, __entry->val, __entry->dn->full_name) +); + +TRACE_EVENT(of_node_put, + + TP_PROTO(struct device_node *dn, int val), + + TP_ARGS(dn, val), + + TP_STRUCT__entry( + __field(struct device_node *, dn) + __field(int, val) + ), + + TP_fast_assign( + __entry->dn = dn; + __entry->val = val; + ), + + TP_printk("put %d -> %d %s", __entry->val + 1, __entry->val, __entry->dn->full_name) +); + #endif /* _TRACE_POWERPC_H */ #undef TRACE_INCLUDE_PATH diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c index c647bd1b6903..f5c3d761f3cd 100644 --- a/drivers/of/dynamic.c +++ b/drivers/of/dynamic.c @@ -14,6 +14,8 @@ #include "of_private.h" +#include + /** * of_node_get() - Increment refcount of a node * @node: Node to inc refcount, NULL is supported to simplify writing of @@ -23,8 +25,12 @@ */ struct device_node *of_node_get(struct device_node *node) { - if (node) + if (node) { kobject_get(&node->kobj); + + trace_of_node_get(node, atomic_read(&node->kobj.kref.refcount)); + } + return node; } EXPORT_SYMBOL(of_node_get); @@ -36,8 +42,10 @@ EXPORT_SYMBOL(of_node_get); */ void of_node_put(struct device_node *node) { - if (node) + if (node) { kobject_put(&node->kobj); + trace_of_node_put(node, atomic_read(&node->kobj.kref.refcount)); + } } EXPORT_SYMBOL(of_node_put);
Re: [PATCH V4 7/7] cxl: Add psl9 specific code
Frederic Barrat writes: > Le 07/04/2017 à 16:11, Christophe Lombard a écrit : >> The new Coherent Accelerator Interface Architecture, level 2, for the >> IBM POWER9 brings new content and features: >> - POWER9 Service Layer >> - Registers >> - Radix mode >> - Process element entry >> - Dedicated-Shared Process Programming Model >> - Translation Fault Handling >> - CAPP >> - Memory Context ID >> If a valid mm_struct is found the memory context id is used for each >> transaction associated with the process handle. The PSL uses the >> context ID to find the corresponding process element. >> >> Signed-off-by: Christophe Lombard >> --- > > > I'm ok with the code. However checkpatch is complaining about a > tab/space error in native.c I already fixed it up when I applied them (and a bunch of other things). > If you have a quick respin, I also have a comment below about the > documentation. So please send me an incremental patch to update the doco and I'll squash it before merging the series. cheers
Re: [PATCH V4 6/7] cxl: Isolate few psl8 specific calls
Frederic Barrat writes: > Le 07/04/2017 à 16:11, Christophe Lombard a écrit : >> Point out the specific Coherent Accelerator Interface Architecture, >> level 1, registers. >> Code and functions specific to PSL8 (CAIA1) must be framed. >> >> Signed-off-by: Christophe Lombard >> --- > > There are a few changes in native.c which are about splitting long > strings, but that's minor. And the rest looks ok. It is minor, so I fixed it up when applying. But in future please don't split long strings, it makes them harder to grep for. cheers
Re: [PATCH 3/9] powerpc/mm: Add _PAGE_DEVMAP for ppc64.
On Wednesday 12 April 2017 05:49 AM, Stephen Rothwell wrote: Hi Oliver, On Wed, 12 Apr 2017 03:42:27 +1000 Oliver O'Halloran wrote: From: "Aneesh Kumar K.V" Add a _PAGE_DEVMAP bit for PTE and DAX PMD entires. PowerPC doesn't currently support PUD faults so we haven't extended it to the PUD level. Cc: Aneesh Kumar K.V Signed-off-by: Oliver O'Halloran This needs Aneesh's Signed-off-by. Signed-off-by: Aneesh Kumar K.V -aneesh
Re: [PATCH 8/9] powerpc/mm: Wire up hpte_removebolted for powernv
On 12/04/17 10:18, Stephen Rothwell wrote: Hi Oliver, On Wed, 12 Apr 2017 08:50:56 +1000 Anton Blanchard wrote: From: Rashmica Gupta Adds support for removing bolted (i.e kernel linear mapping) mappings on powernv. This is needed to support memory hot unplug operations which are required for the teardown of DAX/PMEM devices. Cc: Rashmica Gupta Cc: Anton Blanchard Signed-off-by: Oliver O'Halloran --- Could the original author of this add their S-o-b? I pulled it out of Rashmica's memtrace patch, but I remember someone saying Anton wrote it originally. I did. Signed-off-by: Anton Blanchard If you are going to claim that Rashmica authored this patch (and you do with the From: line above), then you need her Signed-off-by as well. Oliver, can you change the 'From' to a 'Reviewed By'?
Re: powerpc/crypto/crc32c-vpmsum: Fix missing preempt_disable()
On Thu, 2017-04-06 at 13:34:38 UTC, Michael Ellerman wrote: > In crc32c_vpmsum() we call enable_kernel_altivec() without first > disabling preemption, which is not allowed: > > WARNING: CPU: 9 PID: 2949 at ../arch/powerpc/kernel/process.c:277 > enable_kernel_altivec+0x100/0x120 > Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio > libcrc32c vmx_crypto ... > CPU: 9 PID: 2949 Comm: docker Not tainted > 4.11.0-rc5-compiler_gcc-6.3.1-00033-g308ac7563944 #381 > ... > NIP [c001e320] enable_kernel_altivec+0x100/0x120 > LR [d3df0910] crc32c_vpmsum+0x108/0x150 [crc32c_vpmsum] > Call Trace: > 0xc138fd09 (unreliable) > crc32c_vpmsum+0x108/0x150 [crc32c_vpmsum] > crc32c_vpmsum_update+0x3c/0x60 [crc32c_vpmsum] > crypto_shash_update+0x88/0x1c0 > crc32c+0x64/0x90 [libcrc32c] > dm_bm_checksum+0x48/0x80 [dm_persistent_data] > sb_check+0x84/0x120 [dm_thin_pool] > dm_bm_validate_buffer.isra.0+0xc0/0x1b0 [dm_persistent_data] > dm_bm_read_lock+0x80/0xf0 [dm_persistent_data] > __create_persistent_data_objects+0x16c/0x810 [dm_thin_pool] > dm_pool_metadata_open+0xb0/0x1a0 [dm_thin_pool] > pool_ctr+0x4cc/0xb60 [dm_thin_pool] > dm_table_add_target+0x16c/0x3c0 > table_load+0x184/0x400 > ctl_ioctl+0x2f0/0x560 > dm_ctl_ioctl+0x38/0x50 > do_vfs_ioctl+0xd8/0x920 > SyS_ioctl+0x68/0xc0 > system_call+0x38/0xfc > > It used to be sufficient just to call pagefault_disable(), because that > also disabled preemption. But the two were decoupled in commit 8222dbe21e79 > ("sched/preempt, mm/fault: Decouple preemption from the page fault > logic") in mid 2015. > > So add the missing preempt_disable/enable(). We should also call > disable_kernel_fp(), although it does nothing by default, there is a > debug switch to make it active and all enables should be paired with > disables. > > Fixes: 6dd7a82cc54e ("crypto: powerpc - Add POWER8 optimised crc32c") > Cc: sta...@vger.kernel.org # v4.8+ > Signed-off-by: Michael Ellerman Applied to powerpc fixes. https://git.kernel.org/powerpc/c/4749228f022893faf54a3dbc70796f cheers
Re: [PATCH 1/2] powerpc/mm: fix up pgtable dump flags
On 31/03/17 12:37, Oliver O'Halloran wrote: On Book3s we have two PTE flags used to mark cache-inhibited mappings: _PAGE_TOLERANT and _PAGE_NON_IDEMPOTENT. Currently the kernel page table dumper only looks at the generic _PAGE_NO_CACHE which is defined to be _PAGE_TOLERANT. This patch modifies the dumper so both flags are shown in the dump. Cc: Rashmica Gupta Signed-off-by: Oliver O'Halloran Should we also add in _PAGE_SAO that is in Book3s? --- arch/powerpc/mm/dump_linuxpagetables.c | 13 + 1 file changed, 13 insertions(+) diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c index 49abaf4dc8e3..e7cbfd5a0940 100644 --- a/arch/powerpc/mm/dump_linuxpagetables.c +++ b/arch/powerpc/mm/dump_linuxpagetables.c @@ -154,11 +154,24 @@ static const struct flag_info flag_array[] = { .clear = " ", }, { #endif +#ifndef CONFIG_PPC_BOOK3S_64 .mask = _PAGE_NO_CACHE, .val= _PAGE_NO_CACHE, .set= "no cache", .clear = "", }, { +#else + .mask = _PAGE_NON_IDEMPOTENT, + .val= _PAGE_NON_IDEMPOTENT, + .set= "non-idempotent", + .clear = " ", + }, { + .mask = _PAGE_TOLERANT, + .val= _PAGE_TOLERANT, + .set= "tolerant", + .clear = "", + }, { +#endif #ifdef CONFIG_PPC_BOOK3S_64 .mask = H_PAGE_BUSY, .val= H_PAGE_BUSY,
Re: [PATCH 2/2] powerpc/mm: add phys addr to linux page table dump
On 31/03/17 12:37, Oliver O'Halloran wrote: The current page table dumper scans the linux page tables and coalesces mappings with adjacent virtual addresses and similar PTE flags. This behaviour is somewhat broken when you consider the IOREMAP space where entirely unrelated mappings will appear to be contiguous. This patch modifies the range coalescing so that only ranges that are both physically and virtually contiguous are combined. This patch also adds to the dump output the physical address at the start of each range. Cc: Rashmica Gupta Signed-off-by: Oliver O'Halloran --- arch/powerpc/mm/dump_linuxpagetables.c | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c index e7cbfd5a0940..85e6a45bd7ee 100644 --- a/arch/powerpc/mm/dump_linuxpagetables.c +++ b/arch/powerpc/mm/dump_linuxpagetables.c @@ -56,6 +56,8 @@ struct pg_state { struct seq_file *seq; const struct addr_marker *marker; unsigned long start_address; + unsigned long start_pa; + unsigned long last_pa; unsigned int level; u64 current_flags; }; @@ -265,7 +267,9 @@ static void dump_addr(struct pg_state *st, unsigned long addr) const char *unit = units; unsigned long delta; - seq_printf(st->seq, "0x%016lx-0x%016lx ", st->start_address, addr-1); + seq_printf(st->seq, "0x%016lx-0x%016lx ", st->start_address, addr-1); + seq_printf(st->seq, "%016lx ", st->start_pa); + delta = (addr - st->start_address) >> 10; /* Work out what appropriate unit to use */ while (!(delta & 1023) && unit[1]) { @@ -280,11 +284,15 @@ static void note_page(struct pg_state *st, unsigned long addr, unsigned int level, u64 val) { u64 flag = val & pg_level[level].mask; + u64 pa = val & PTE_RPN_MASK; + /* At first no level is set */ if (!st->level) { st->level = level; st->current_flags = flag; st->start_address = addr; + st->start_pa = pa; + st->last_pa = pa; seq_printf(st->seq, "---[ %s ]---\n", st->marker->name); /* * Dump the section of virtual memory when: @@ -292,9 +300,11 @@ static void note_page(struct pg_state *st, unsigned long addr, * - we change levels in the tree. * - the address is in a different section of memory and is thus * used for a different purpose, regardless of the flags. +* - the pa of this page is not adjacent to the last inspected page */ } else if (flag != st->current_flags || level != st->level || - addr >= st->marker[1].start_address) { + addr >= st->marker[1].start_address || + pa != st->last_pa + PAGE_SIZE) { /* Check the PTE flags */ if (st->current_flags) { @@ -318,8 +328,12 @@ static void note_page(struct pg_state *st, unsigned long addr, seq_printf(st->seq, "---[ %s ]---\n", st->marker->name); } st->start_address = addr; + st->start_pa = pa; + st->last_pa = pa; st->current_flags = flag; st->level = level; + } else { + st->last_pa = pa; } } Makes sense to me! Reviewed-by: Rashmica Gupta
[PATCH] powerpc/mm/hash: don't opencode VMALLOC_INDEX
Also remove wrong indentation to fix checkpatch.pl warning. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/slb.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c index 98ae810b8c21..3d580ccf4b71 100644 --- a/arch/powerpc/mm/slb.c +++ b/arch/powerpc/mm/slb.c @@ -131,9 +131,9 @@ static void __slb_flush_and_rebolt(void) "slbmte%2,%3\n" "isync" :: "r"(mk_vsid_data(VMALLOC_START, mmu_kernel_ssize, vflags)), - "r"(mk_esid_data(VMALLOC_START, mmu_kernel_ssize, 1)), - "r"(ksp_vsid_data), - "r"(ksp_esid_data) + "r"(mk_esid_data(VMALLOC_START, mmu_kernel_ssize, VMALLOC_INDEX)), + "r"(ksp_vsid_data), + "r"(ksp_esid_data) : "memory"); } -- 2.7.4
Re: [RFC PATCH 6/7] powerpc/hugetlb: Add code to support to follow huge page directory entries
On 04/11/2017 03:55 PM, Michael Ellerman wrote: > "Aneesh Kumar K.V" writes: > >> Add follow_huge_pd implementation for ppc64. >> >> Signed-off-by: Aneesh Kumar K.V >> --- >> arch/powerpc/mm/hugetlbpage.c | 42 >> ++ >> 1 file changed, 42 insertions(+) >> >> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c >> index 80f6d2ed551a..9d66d4f810aa 100644 >> --- a/arch/powerpc/mm/hugetlbpage.c >> +++ b/arch/powerpc/mm/hugetlbpage.c >> @@ -17,6 +17,8 @@ >> #include >> #include >> #include >> +#include >> +#include >> #include >> #include >> #include >> @@ -618,6 +620,10 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb, >> } >> >> /* >> + * 64 bit book3s use generic follow_page_mask >> + */ >> +#ifndef CONFIG_PPC_BOOK3S_64 > I think it's always easier to follow if you use: > > #ifdef x > ... > #else /* !x */ > ... > #endif > > ie. in this case put the Book3S 64 case first and the existing code in the > #else. Yeah, it was difficult to read in the first glance.
Re: [PATCH 1/9] mm/huge_memory: Use zap_deposited_table() more
Oliver O'Halloran writes: > Depending flags of the PMD being zapped there may or may not be a > deposited pgtable to be freed. In two of the three cases this is open > coded while the third uses the zap_deposited_table() helper. This patch > converts the others to use the helper to clean things up a bit. > > Cc: "Aneesh Kumar K.V" > Cc: "Kirill A. Shutemov" > Cc: linux...@kvack.org > Signed-off-by: Oliver O'Halloran Reviewed-by: Aneesh Kumar K.V > --- > For reference: > > void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd) > { > pgtable_t pgtable; > > pgtable = pgtable_trans_huge_withdraw(mm, pmd); > pte_free(mm, pgtable); > atomic_long_dec(&mm->nr_ptes); > } > --- > mm/huge_memory.c | 8 ++-- > 1 file changed, 2 insertions(+), 6 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index b787c4cfda0e..aa01dd47cc65 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1615,8 +1615,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct > vm_area_struct *vma, > if (is_huge_zero_pmd(orig_pmd)) > tlb_remove_page_size(tlb, pmd_page(orig_pmd), > HPAGE_PMD_SIZE); > } else if (is_huge_zero_pmd(orig_pmd)) { > - pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd)); > - atomic_long_dec(&tlb->mm->nr_ptes); > + zap_deposited_table(tlb->mm, pmd); > spin_unlock(ptl); > tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE); > } else { > @@ -1625,10 +1624,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct > vm_area_struct *vma, > VM_BUG_ON_PAGE(page_mapcount(page) < 0, page); > VM_BUG_ON_PAGE(!PageHead(page), page); > if (PageAnon(page)) { > - pgtable_t pgtable; > - pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd); > - pte_free(tlb->mm, pgtable); > - atomic_long_dec(&tlb->mm->nr_ptes); > + zap_deposited_table(tlb->mm, pmd); > add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); > } else { > if (arch_needs_pgtable_deposit()) > -- > 2.9.3
Re: [PATCH 2/9] mm/huge_memory: Deposit a pgtable for DAX PMD faults when required
Oliver O'Halloran writes: > Although all architectures use a deposited page table for THP on anonymous > VMAs > some architectures (s390 and powerpc) require the deposited storage even for > file backed VMAs due to quirks of their MMUs. This patch adds support for > depositing a table in DAX PMD fault handling path for archs that require it. > Other architectures should see no functional changes. > > Cc: "Aneesh Kumar K.V" > Cc: linux...@kvack.org > Signed-off-by: Oliver O'Halloran Reviewed-by: Aneesh Kumar K.V > --- > mm/huge_memory.c | 20 ++-- > 1 file changed, 18 insertions(+), 2 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index aa01dd47cc65..a84909cf20d3 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -715,7 +715,8 @@ int do_huge_pmd_anonymous_page(struct vm_fault *vmf) > } > > static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, > - pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write) > + pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write, > + pgtable_t pgtable) > { > struct mm_struct *mm = vma->vm_mm; > pmd_t entry; > @@ -729,6 +730,12 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, > unsigned long addr, > entry = pmd_mkyoung(pmd_mkdirty(entry)); > entry = maybe_pmd_mkwrite(entry, vma); > } > + > + if (pgtable) { > + pgtable_trans_huge_deposit(mm, pmd, pgtable); > + atomic_long_inc(&mm->nr_ptes); > + } > + > set_pmd_at(mm, addr, pmd, entry); > update_mmu_cache_pmd(vma, addr, pmd); > spin_unlock(ptl); > @@ -738,6 +745,7 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, > unsigned long addr, > pmd_t *pmd, pfn_t pfn, bool write) > { > pgprot_t pgprot = vma->vm_page_prot; > + pgtable_t pgtable = NULL; > /* >* If we had pmd_special, we could avoid all these restrictions, >* but we need to be consistent with PTEs and architectures that > @@ -752,9 +760,15 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, > unsigned long addr, > if (addr < vma->vm_start || addr >= vma->vm_end) > return VM_FAULT_SIGBUS; > > + if (arch_needs_pgtable_deposit()) { > + pgtable = pte_alloc_one(vma->vm_mm, addr); > + if (!pgtable) > + return VM_FAULT_OOM; > + } > + > track_pfn_insert(vma, &pgprot, pfn); > > - insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write); > + insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write, pgtable); > return VM_FAULT_NOPAGE; > } > EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd); > @@ -1611,6 +1625,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct > vm_area_struct *vma, > tlb->fullmm); > tlb_remove_pmd_tlb_entry(tlb, pmd, addr); > if (vma_is_dax(vma)) { > + if (arch_needs_pgtable_deposit()) > + zap_deposited_table(tlb->mm, pmd); > spin_unlock(ptl); > if (is_huge_zero_pmd(orig_pmd)) > tlb_remove_page_size(tlb, pmd_page(orig_pmd), > HPAGE_PMD_SIZE); > -- > 2.9.3
[PATCH] powerpc/syscalls/trace: Fix mmap in syscalls_trace
This patch uses SYSCALL_DEFINE6 for sys_mmap and sys_mmap2 so that the meta-data associated with these syscalls is visible to the syscall tracer. In the absence of this generic syscalls (defined outside arch) like munmap,etc. are visible in available_events, syscall_enter_mmap and syscall_exit_mmap is not. A side-effect of this change is that the return type has changed from unsigned long to long. Prior to these changes, we had, under /sys/kernel/tracing cat available_events | grep syscalls | grep map syscalls:sys_exit_remap_file_pages syscalls:sys_enter_remap_file_pages syscalls:sys_exit_munmap syscalls:sys_enter_munmap syscalls:sys_exit_mremap syscalls:sys_enter_mremap After these changes we have mmap in available_events. cat available_events | grep syscalls | grep map syscalls:sys_exit_mmap syscalls:sys_enter_mmap syscalls:sys_exit_remap_file_pages syscalls:sys_enter_remap_file_pages syscalls:sys_exit_munmap syscalls:sys_enter_munmap syscalls:sys_exit_mremap syscalls:sys_enter_mremap Sample trace: cat-3399 [001] 196.542410: sys_mmap(addr: 7fff922a, len: 2, prot: 3, flags: 812, fd: 3, offset: 1b) cat-3399 [001] 196.542443: sys_mmap -> 0x7fff922a cat-3399 [001] 196.542668: sys_munmap(addr: 7fff922c, len: 6d2c) cat-3399 [001] 196.542677: sys_munmap -> 0x0 Signed-off-by: Balbir Singh --- Changelog: Removed RFC Fixed len from unsigned long to size_t Added some examples of use of mmap trace arch/powerpc/include/asm/syscalls.h | 4 ++-- arch/powerpc/kernel/syscalls.c | 16 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/syscalls.h b/arch/powerpc/include/asm/syscalls.h index 23be8f1..16fab68 100644 --- a/arch/powerpc/include/asm/syscalls.h +++ b/arch/powerpc/include/asm/syscalls.h @@ -8,10 +8,10 @@ struct rtas_args; -asmlinkage unsigned long sys_mmap(unsigned long addr, size_t len, +asmlinkage long sys_mmap(unsigned long addr, size_t len, unsigned long prot, unsigned long flags, unsigned long fd, off_t offset); -asmlinkage unsigned long sys_mmap2(unsigned long addr, size_t len, +asmlinkage long sys_mmap2(unsigned long addr, size_t len, unsigned long prot, unsigned long flags, unsigned long fd, unsigned long pgoff); asmlinkage long ppc64_personality(unsigned long personality); diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c index de04c9f..a877bf8 100644 --- a/arch/powerpc/kernel/syscalls.c +++ b/arch/powerpc/kernel/syscalls.c @@ -42,11 +42,11 @@ #include #include -static inline unsigned long do_mmap2(unsigned long addr, size_t len, +static inline long do_mmap2(unsigned long addr, size_t len, unsigned long prot, unsigned long flags, unsigned long fd, unsigned long off, int shift) { - unsigned long ret = -EINVAL; + long ret = -EINVAL; if (!arch_validate_prot(prot)) goto out; @@ -62,16 +62,16 @@ static inline unsigned long do_mmap2(unsigned long addr, size_t len, return ret; } -unsigned long sys_mmap2(unsigned long addr, size_t len, - unsigned long prot, unsigned long flags, - unsigned long fd, unsigned long pgoff) +SYSCALL_DEFINE6(mmap2, unsigned long, addr, size_t, len, + unsigned long, prot, unsigned long, flags, + unsigned long, fd, unsigned long, pgoff) { return do_mmap2(addr, len, prot, flags, fd, pgoff, PAGE_SHIFT-12); } -unsigned long sys_mmap(unsigned long addr, size_t len, - unsigned long prot, unsigned long flags, - unsigned long fd, off_t offset) +SYSCALL_DEFINE6(mmap, unsigned long, addr, size_t, len, + unsigned long, prot, unsigned long, flags, + unsigned long, fd, off_t, offset) { return do_mmap2(addr, len, prot, flags, fd, offset, PAGE_SHIFT); } -- 2.9.3
Re: [PATCH 1/2] powerpc/mm: fix up pgtable dump flags
Rashmica Gupta writes: > On 31/03/17 12:37, Oliver O'Halloran wrote: >> On Book3s we have two PTE flags used to mark cache-inhibited mappings: >> _PAGE_TOLERANT and _PAGE_NON_IDEMPOTENT. Currently the kernel page >> table dumper only looks at the generic _PAGE_NO_CACHE which is >> defined to be _PAGE_TOLERANT. This patch modifies the dumper so >> both flags are shown in the dump. >> >> Cc: Rashmica Gupta >> Signed-off-by: Oliver O'Halloran > Should we also add in _PAGE_SAO that is in Book3s? I don't think we ever expect to see it in the kernel page tables. But if we did that would be "interesting". I've forgotten what the code does with unknown bits, does it already print them in some way? If not we should either add that or add _PAGE_SAO and everything else that could possibly ever be there. cheers
Re: [PATCH] powerpc/mm/hash: don't opencode VMALLOC_INDEX
"Aneesh Kumar K.V" writes: > powerpc/mm/hash: don't opencode VMALLOC_INDEX OK. > Also remove wrong indentation to fix checkpatch.pl warning. No thanks :) Or at least do it as a separate patch. I'll fix it up this time. cheers > diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c > index 98ae810b8c21..3d580ccf4b71 100644 > --- a/arch/powerpc/mm/slb.c > +++ b/arch/powerpc/mm/slb.c > @@ -131,9 +131,9 @@ static void __slb_flush_and_rebolt(void) >"slbmte%2,%3\n" >"isync" >:: "r"(mk_vsid_data(VMALLOC_START, mmu_kernel_ssize, > vflags)), > - "r"(mk_esid_data(VMALLOC_START, mmu_kernel_ssize, 1)), > - "r"(ksp_vsid_data), > - "r"(ksp_esid_data) > + "r"(mk_esid_data(VMALLOC_START, mmu_kernel_ssize, > VMALLOC_INDEX)), > + "r"(ksp_vsid_data), > + "r"(ksp_esid_data) >: "memory"); > } > > -- > 2.7.4