Re: [PATCH kernel v3 11/22] powerpc/pseries/npu: Enable platform support

2018-11-18 Thread Alexey Kardashevskiy



On 16/11/2018 16:25, David Gibson wrote:
> On Tue, Nov 13, 2018 at 07:28:12PM +1100, Alexey Kardashevskiy wrote:
>> We already changed NPU API for GPUs to not to call OPAL and the remaining
>> bit is initializing NPU structures.
>>
>> This uses a new QEMU capability which marks NPU-enabled vPHBs as
>> "IBM,npu-vphb" and initializes an NPU structure per vPHB.
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>>  arch/powerpc/platforms/pseries/pci.c | 6 ++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/arch/powerpc/platforms/pseries/pci.c 
>> b/arch/powerpc/platforms/pseries/pci.c
>> index 41d8a4d..a50d5e4 100644
>> --- a/arch/powerpc/platforms/pseries/pci.c
>> +++ b/arch/powerpc/platforms/pseries/pci.c
>> @@ -29,6 +29,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include "pseries.h"
>>  
>>  #if 0
>> @@ -237,6 +238,8 @@ static void __init pSeries_request_regions(void)
>>  
>>  void __init pSeries_final_fixup(void)
>>  {
>> +struct pci_controller *hose;
>> +
>>  pSeries_request_regions();
>>  
>>  eeh_probe_devices();
>> @@ -246,6 +249,9 @@ void __init pSeries_final_fixup(void)
>>  ppc_md.pcibios_sriov_enable = pseries_pcibios_sriov_enable;
>>  ppc_md.pcibios_sriov_disable = pseries_pcibios_sriov_disable;
>>  #endif
>> +list_for_each_entry(hose, _list, list_node)
>> +if (of_device_is_compatible(hose->dn, "IBM,npu-vphb"))
>> +pnv_npu2_init(hose);
> 
> I take it from this the NPUs are showing up with a compatible property
> that lists the normal PHB value as well as IBM,npu-vphb.  Since AIUI
> the NPUs act quite differently from other (real) PHBs this seems
> bogus.  Shouldn't they be probed separately?

First, bad naming, will think of better one.

"IBM,npu-vphb" is an extra compatible type for otherwise usual pseries
PHB. The  differences are:

1. Initialize an "NPU" (not a NVLink2 bridge but a proper NPU) per a PHB
for context manipulation for a GPU;

2. Kill the default DMA window.

When a GPU is passed to a guest, it looks like:

aik@u1804kvm:~$ lspci
00:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host
Controller (rev 03)
00:01.0 Ethernet controller: Red Hat, Inc Virtio network device
00:02.0 3D controller: NVIDIA Corporation GV100 [Tesla V100 SXM2] (rev a1)
00:03.0 Bridge: IBM Device 04ea (rev 01)
00:04.0 Bridge: IBM Device 04ea (rev 01)

So there are:
- one "struct npu" associated with the pseries PHB (not presented in
lspci but there /proc/device-tree/npuphb0/ with link@0 and link@1)

- 2 NVLink2 bridges, presented in lspci.


-- 
Alexey


Re: [PATCH kernel v3 10/22] powerpc/pseries/iommu: Use memory@ nodes in max RAM address calculation

2018-11-18 Thread Alexey Kardashevskiy



On 16/11/2018 16:23, David Gibson wrote:
> On Tue, Nov 13, 2018 at 07:28:11PM +1100, Alexey Kardashevskiy wrote:
>> We might have memory@ nodes with "linux,usable-memory" set to zero
>> (for example, to replicate powernv's behaviour for GPU coherent memory)
>> which means that the memory needs an extra initialization but since
>> it can be used afterwards, the pseries platform will try mapping it
>> for DMA so the DMA window needs to cover those memory regions too.
>>
>> This walks through the memory nodes to find the highest RAM address to
>> let a huge DMA window cover that too in case this memory gets onlined
>> later.
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>>  arch/powerpc/platforms/pseries/iommu.c | 43 +-
>>  1 file changed, 42 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
>> b/arch/powerpc/platforms/pseries/iommu.c
>> index 78473ac..f818737 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -967,6 +967,47 @@ struct failed_ddw_pdn {
>>  
>>  static LIST_HEAD(failed_ddw_pdn_list);
>>  
>> +static unsigned long read_n_cells(int n, const __be32 **buf)
>> +{
>> +unsigned long result = 0;
>> +
>> +while (n--) {
>> +result = (result << 32) | of_read_number(*buf, 1);
>> +(*buf)++;
>> +}
>> +return result;
>> +}
> 
> Um.. this appears to be re-implementing of_read_number() in terms of
> of_read_number().   Wat!?


This is a cut-n-paste from arch/powerpc/mm/numa.c :) My bad, I did not
think much when I did this.


> 
>> +static phys_addr_t ddw_memory_hotplug_max(void)
>> +{
>> +phys_addr_t max_addr = memory_hotplug_max();
>> +struct device_node *memory;
>> +
>> +for_each_node_by_type(memory, "memory") {
>> +unsigned long start, size;
>> +int ranges, n_mem_addr_cells, n_mem_size_cells, len;
>> +const __be32 *memcell_buf;
>> +
>> +memcell_buf = of_get_property(memory, "reg", );
>> +if (!memcell_buf || len <= 0)
>> +continue;
>> +
>> +n_mem_addr_cells = of_n_addr_cells(memory);
>> +n_mem_size_cells = of_n_size_cells(memory);
>> +
>> +/* ranges in cell */
>> +ranges = (len >> 2) / (n_mem_addr_cells + n_mem_size_cells);
>> +
>> +/* these are order-sensitive, and modify the buffer pointer */
>> +start = read_n_cells(n_mem_addr_cells, _buf);
>> +size = read_n_cells(n_mem_size_cells, _buf);
>> +
>> +max_addr = max_t(phys_addr_t, max_addr, start + size);
>> +}
>> +
>> +return max_addr;
>> +}
> 
> Is there really no existing place we keep track of maxmimum possible
> memory address?

There are:

1. memblocks from mm/memblock.c - populated at the boot time from
"usable" memory@ nodes and mine are not "usable";

2. drmem from mm/drmem.c - populated from ibm,dynamic-memory-v2 - these
things do not support sparse regions so when I tried these with a GPU
RAM region mapped at 0x2440 - the device tree became quickly
over 1 MB and then qemu crashed, I did not debug any further as this
memory is not hotpluggable anyway from the rtas/qemu prospective, in
other words it is not something the user can hotplug or unplug.

And that is it afaict.


> 
>>  /*
>>   * If the PE supports dynamic dma windows, and there is space for a table
>>   * that can map all pages in a linear offset, then setup such a table,
>> @@ -1067,7 +1108,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
>> device_node *pdn,
>>  }
>>  /* verify the window * number of ptes will map the partition */
>>  /* check largest block * page size > max memory hotplug addr */
>> -max_addr = memory_hotplug_max();
>> +max_addr = ddw_memory_hotplug_max();
>>  if (query.largest_available_block < (max_addr >> page_shift)) {
>>  dev_dbg(>dev, "can't map partition max 0x%llx with %u "
>>"%llu-sized pages\n", max_addr,  
>> query.largest_available_block,
> 

-- 
Alexey


Re: [PATCH kernel v3 09/22] powerpc/pseries/iommu: Force default DMA window removal

2018-11-18 Thread Alexey Kardashevskiy



On 16/11/2018 15:54, David Gibson wrote:
> On Tue, Nov 13, 2018 at 07:28:10PM +1100, Alexey Kardashevskiy wrote:
>> It is quite common for a device to support more than 32bit but less than
>> 64bit for DMA, for example, GPUs often support 42..50bits. However
>> the pseries platform only allows huge DMA window (the one which allows
>> the use of more than 2GB of DMA space) for 64bit-capable devices mostly
>> because:
>>
>> 1. we may have 32bit and >32bit devices on the same IOMMU domain and
>> we cannot place the new big window where the 32bit one is located;
>>
>> 2. the existing hardware only supports the second window at very high
>> offset of 1<<59 == 0x0800....
>>
>> So in order to allow 33..59bit DMA, we have to remove the default DMA
>> window and place a huge one there instead.
>>
>> The PAPR spec says that the platform may decide not to use the default
>> window and remove it using DDW RTAS calls. There are few possible ways
>> for the platform to decide:
>>
>> 1. look at the device IDs and decide in advance that such and such
>> devices are capable of more than 32bit DMA (powernv's sketchy bypass
>> does something like this - it drops the default window if all devices
>> on the PE are from the same vendor) - this is not great as involves
>> guessing because, unlike sketchy bypass, the GPU case involves 2 vendor
>> ids and does not scale;
>>
>> 2. advertise 1 available DMA window in the hypervisor via
>> ibm,query-pe-dma-window so the pseries platform could take it as a clue
>> that if more bits for DMA are needed, it has to remove the default
>> window - this is not great as it is implicit clue rather than direct
>> instruction;
>>
>> 3. removing the default DMA window at all it not really an option as
>> PAPR mandates its presense at the guest boot time;
>>
>> 4. make the hypervisor explicitly tell the guest that the default window
>> is better be removed so the guest does not have to think hard and can
>> simply do what requested and this is what this patch does.
> 
> This approach only makes sense if the hypervisor has better
> information as to what to do that the guest does.  It's not clear to
> me why that would be the case.  Isn't the DMA capabilities of the
> device something the driver should know, in which case it can decide
> based on that?

The device knows it can do 42bits so it will request DMA mask for 42bits
and then the platform has to deal with it, the device has no control
over DMA windows.

Then the platform tries to make everything work, which sadly includes
32bit-DMA devices so the default DMA window stays there and for 42bit
devices there is no other way than to go via the smaller window as the
only other window we can create is beyond the reach of the GPU.

We have so called "sketchy bypass" hack for some other GPUs (which
Christoph is trying to get rid of) at
https://github.com/aik/linux/blob/nv2/arch/powerpc/platforms/powernv/pci-ioda.c#L1885

which is powernv and which seemed a solution there and which I am trying
to reimplement here.


> 
>>
>> This makes use of the latter approach and exploits a new
>> "qemu,dma-force-remove-default" flag in a vPHB.
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>>  arch/powerpc/platforms/pseries/iommu.c | 28 +++---
>>  1 file changed, 25 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
>> b/arch/powerpc/platforms/pseries/iommu.c
>> index 9ece42f..78473ac 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -54,6 +54,7 @@
>>  #include "pseries.h"
>>  
>>  #define DDW_INVALID_OFFSET  ((uint64_t)-1)
>> +#define DDW_INVALID_LIOBN   ((uint32_t)-1)
>>  
>>  static struct iommu_table_group *iommu_pseries_alloc_group(int node)
>>  {
>> @@ -977,7 +978,8 @@ static LIST_HEAD(failed_ddw_pdn_list);
>>   *
>>   * returns the dma offset for use by dma_set_mask
>>   */
>> -static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>> +static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn,
>> +u32 default_liobn)
>>  {
>>  int len, ret;
>>  struct ddw_query_response query;
>> @@ -1022,6 +1024,16 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
>> device_node *pdn)
>>  if (ret)
>>  goto out_failed;
>>  
>> +/*
>> + * The device tree has a request to force remove the default window,
>> + * do this.
>> + */
>> +if (default_liobn != DDW_INVALID_LIOBN && (!ddw_avail[2] ||
>> +rtas_call(ddw_avail[2], 1, 1, NULL, default_liobn))) {
>> +dev_dbg(>dev, "Could not remove window");
>> +goto out_failed;
>> +}
>> +
>> /*
>>   * Query if there is a second window of size to map the
>>   * whole partition.  Query returns number of windows, largest
>> @@ -1212,7 +1224,7 @@ static int dma_set_mask_pSeriesLP(struct device *dev, 
>> u64 dma_mask)
>>  pdev = to_pci_dev(dev);
>>  
>>  /* 

Re: [PATCH kernel v3 06/22] powerpc/powernv: Detach npu struct from pnv_phb

2018-11-18 Thread Alexey Kardashevskiy



On 14/11/2018 15:28, Alistair Popple wrote:
> Hi Alexey,
> 
> On Tuesday, 13 November 2018 7:28:07 PM AEDT Alexey Kardashevskiy wrote:
>>  static struct npu *npdev_to_npu(struct pci_dev *npdev)
>>  {
>> -struct pnv_phb *nphb;
>> +struct pci_controller *hose = pci_bus_to_host(npdev->bus);
>> +struct npu *npu;
>>
>> -nphb = pci_bus_to_host(npdev->bus)->private_data;
>> +list_for_each_entry(npu, _devices, next)
> 
> This is called from the ATSD path which is (or at least has been) quite a 
> performance critical path so searching through all the NPUs in a list may be 
> problematic.
> 
> I guess currently it wont make any practical difference as we only ever have 
> 2 
> NPUs, but in future they may get divided into more logical NPUs. Would it be 
> possible to store a back-pointer somewhere so we can avoid the lookup?


It is quite possible even now with iommu_group_get() + container_of() +
iommu_group_put(), I'll try that in respin.


> 
>> +if (hose == npu->hose)
>> +return npu;
>>
>> -return >npu;
>> +WARN_ON_ONCE(1);
>> +return NULL;
>>  }
>>
>>  /* Maximum number of nvlinks per npu */
>> @@ -505,6 +531,9 @@ static void acquire_atsd_reg(struct npu_context
>> *npu_context, continue;
>>
>>  npu = npdev_to_npu(npdev);
>> +if (!npu)
>> +continue;
>> +
>>  mmio_atsd_reg[i].npu = npu;
>>  mmio_atsd_reg[i].reg = get_mmio_atsd_reg(npu);
>>  while (mmio_atsd_reg[i].reg < 0) {
>> @@ -701,6 +730,8 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev
>> *gpdev,
>>
>>  nphb = pci_bus_to_host(npdev->bus)->private_data;
>>  npu = npdev_to_npu(npdev);
>> +if (!npu)
>> +return ERR_PTR(-ENODEV);
>>
>>  /*
>>   * Setup the NPU context table for a particular GPU. These need to be
>> @@ -821,6 +852,8 @@ void pnv_npu2_destroy_context(struct npu_context
>> *npu_context,
>>
>>  nphb = pci_bus_to_host(npdev->bus)->private_data;
>>  npu = npdev_to_npu(npdev);
>> +if (!npu)
>> +return;
>>  nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0);
>>  if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index",
>>  _index)))
>> @@ -898,9 +931,15 @@ int pnv_npu2_init(struct pnv_phb *phb)
>>  struct pci_dev *gpdev;
>>  static int npu_index;
>>  uint64_t rc = 0;
>> +struct pci_controller *hose = phb->hose;
>> +struct npu *npu;
>> +int ret;
>>
>> -phb->npu.nmmu_flush =
>> -of_property_read_bool(phb->hose->dn, "ibm,nmmu-flush");
>> +npu = kzalloc(sizeof(*npu), GFP_KERNEL);
>> +if (!npu)
>> +return -ENOMEM;
>> +
>> +npu->nmmu_flush = of_property_read_bool(hose->dn, "ibm,nmmu-flush");
>>  for_each_child_of_node(phb->hose->dn, dn) {
>>  gpdev = pnv_pci_get_gpu_dev(get_pci_dev(dn));
>>  if (gpdev) {
>> @@ -914,18 +953,31 @@ int pnv_npu2_init(struct pnv_phb *phb)
>>  }
>>  }
>>
>> -for (i = 0; !of_property_read_u64_index(phb->hose->dn, "ibm,mmio-atsd",
>> +for (i = 0; !of_property_read_u64_index(hose->dn, "ibm,mmio-atsd",
>>  i, _atsd); i++)
>> -phb->npu.mmio_atsd_regs[i] = ioremap(mmio_atsd, 32);
>> +npu->mmio_atsd_regs[i] = ioremap(mmio_atsd, 32);
>>
>> -pr_info("NPU%lld: Found %d MMIO ATSD registers", phb->opal_id, i);
>> -phb->npu.mmio_atsd_count = i;
>> -phb->npu.mmio_atsd_usage = 0;
>> +pr_info("NPU%d: Found %d MMIO ATSD registers", hose->global_number, i);
>> +npu->mmio_atsd_count = i;
>> +npu->mmio_atsd_usage = 0;
>>  npu_index++;
>> -if (WARN_ON(npu_index >= NV_MAX_NPUS))
>> -return -ENOSPC;
>> +if (WARN_ON(npu_index >= NV_MAX_NPUS)) {
>> +ret = -ENOSPC;
>> +goto fail_exit;
>> +}
>>  max_npu2_index = npu_index;
>> -phb->npu.index = npu_index;
>> +npu->index = npu_index;
>> +npu->hose = hose;
>> +
>> +list_add(>next, _devices);
> 
> Guess we don't need any locking here as the list gets setup once during boot 
> long before loading the driver and is never modified right?


Correct.


> 
> - Alistair
> 
>>  return 0;
>> +
>> +fail_exit:
>> +for (i = 0; i < npu->mmio_atsd_count; ++i)
>> +iounmap(npu->mmio_atsd_regs[i]);
>> +
>> +kfree(npu);
>> +
>> +return ret;
>>  }
> 
> 

-- 
Alexey


RE: [PATCH v2 2/2] dpaa_eth: add ethtool coalesce control

2018-11-18 Thread Madalin-cristian Bucur
> -Original Message-
> From: David Miller 
> Sent: Saturday, November 17, 2018 5:42 AM
> To: Madalin-cristian Bucur 
> Subject: Re: [PATCH v2 2/2] dpaa_eth: add ethtool coalesce control
> 
> From: Madalin Bucur 
> Date: Tue, 13 Nov 2018 18:29:51 +0200
> 
> > +   for_each_cpu(cpu, cpus) {
> > +   portal = qman_get_affine_portal(cpu);
> > +   res = qman_portal_set_iperiod(portal, period);
> > +   if (res)
> > +   return res;
> > +   res = qman_dqrr_set_ithresh(portal, thresh);
> > +   if (res)
> > +   return res;
> 
> Nope, you can't do it like this.
> 
> If any intermediate change fails, you have to unwind all of the
> changes made up until that point.
> 
> Which means you'll have to store the previous setting somewhere
> and reinstall those saved values in the error path.

Thank you, I'll come back with a v3.

Madalin


[PATCH kernel] powerpc/powernv/eeh/npu: Fix uninitialized variables in opal_pci_eeh_freeze_status

2018-11-18 Thread Alexey Kardashevskiy
The current implementation of the OPAL_PCI_EEH_FREEZE_STATUS call in
skiboot's NPU driver does not touch the pci_error_type parameter so
it might have garbage but the powernv code analyzes it nevertheless.

This initializes pcierr and fstate to zero in all call sites.

Signed-off-by: Alexey Kardashevskiy 
---

Without this, this happens:

pnv_eeh_get_phb_diag: Failure -7 getting PHB#6 diag-data
EEH: PHB#6 failure detected, location: N/A
CPU: 23 PID: 5939 Comm: qemu-system-ppc Not tainted 
4.19.0-le_f5a7bb7_aikATfstn1-p1 torvalds#106
Call Trace:
[c03fea9df9c0] [c0a990ec] dump_stack+0xb0/0xf4 (unreliable)
[c03fea9dfa00] [c0038480] eeh_dev_check_failure+0x1f0/0x5f0
[c03fea9dfaa0] [c00a2768] pnv_pci_read_config+0x128/0x160
[c03fea9dfae0] [c05d2b0c] pci_bus_read_config_dword+0x9c/0xf0
[c03fea9dfb40] [c05df3d4] pci_save_state+0x64/0x250
[c03fea9dfbc0] [c05e0730] pci_dev_save_and_disable+0x70/0xa0
[c03fea9dfbf0] [c05e4078] pci_try_reset_function+0x48/0xc0
[c03fea9dfc20] [c0081cbc1b1c] vfio_pci_ioctl+0x334/0xea0 [vfio_pci]
[c03fea9dfcf0] [c0081ca9046c] vfio_device_fops_unl_ioctl+0x44/0x70 
[vfio]
[c03fea9dfd10] [c039fd84] do_vfs_ioctl+0xd4/0xa00
[c03fea9dfdb0] [c03a07b4] ksys_ioctl+0x104/0x120
[c03fea9dfe00] [c03a07f8] sys_ioctl+0x28/0x80
[c03fea9dfe20] [c000b3a4] system_call+0x5c/0x70
EEH: Detected error on PHB#6
EEH: This PCI device has failed 1 times in the last hour and will be 
permanently disabled after 5 fail
ures.
EEH: Notify device drivers to shutdown
EEH: Beginning: 'error_detected(IO frozen)'
EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0)
EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0)
EEH: PE#c (PCI 0006:00:01.0): Invoking vfio-pci->error_detected(IO frozen)
EEH: PE#c (PCI 0006:00:01.0): vfio-pci driver reports: 'can recover'
EEH: PE#c (PCI 0006:00:01.1): Invoking vfio-pci->error_detected(IO frozen)
EEH: PE#c (PCI 0006:00:01.1): vfio-pci driver reports: 'can recover'
EEH: PE#b (PCI 0006:00:02.0): Invoking vfio-pci->error_detected(IO frozen)
EEH: PE#b (PCI 0006:00:02.0): vfio-pci driver reports: 'can recover'
EEH: PE#b (PCI 0006:00:02.1): Invoking vfio-pci->error_detected(IO frozen)
EEH: PE#b (PCI 0006:00:02.1): vfio-pci driver reports: 'can recover'
EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'can 
recover'
EEH: Collect temporary log
pnv_pci_dump_phb_diag_data: Unrecognized ioType 0
EEH: Reset without hotplug activity
iommu: Removing device 0006:00:01.0 from group 4
iommu: Removing device 0006:00:01.1 from group 4
iommu: Removing device 0006:00:02.0 from group 4
iommu: Removing device 0006:00:02.1 from group 4
pnv_ioda_freeze_pe: Failure -7 freezing PHB#6-PE#0
pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7)
EEH: Sleep 5s ahead of partial hotplug
pci 0004:04 : [PE# 00] Setting up window#0 0..3fff pg=1000
pci 0004:05 : [PE# 18] Setting up window#0 0..3fff pg=1000
pci 0004:06 : [PE# 30] Setting up window#0 0..3fff pg=1000
pci 0006:00:00.0: [PE# 0d] Setting up window 0..3fff pg=1000
pci 0006:00:01.0: [PE# 0c] Setting up window 0..3fff pg=1000
pci 0006:00:02.0: [PE# 0b] Setting up window 0..3fff pg=1000
EEH: Beginning: 'slot_reset'
EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0)
EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0)
EEH: Finished:'slot_reset' with aggregate recovery state:'none'
EEH: Notify device driver to resume
EEH: Beginning: 'resume'
EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0)
EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0)
EEH: Finished:'resume'
EEH: Recovery successful.
---
 arch/powerpc/platforms/powernv/eeh-powernv.c | 8 
 arch/powerpc/platforms/powernv/pci-ioda.c| 4 ++--
 arch/powerpc/platforms/powernv/pci.c | 4 ++--
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index abc0be7..f380789 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -564,8 +564,8 @@ static void pnv_eeh_get_phb_diag(struct eeh_pe *pe)
 static int pnv_eeh_get_phb_state(struct eeh_pe *pe)
 {
struct pnv_phb *phb = pe->phb->private_data;
-   u8 fstate;
-   __be16 pcierr;
+   u8 

Re: [PATCH 1/7] drivers/cpufreq: change CONFIG_6xx to CONFIG_PPC_BOOK3S_32

2018-11-18 Thread Viresh Kumar
On 17-11-18, 10:24, Christophe Leroy wrote:
> Today, powerpc has three CONFIG labels which means exactly the same:
> - CONFIG_6xx
> - CONFIG_PPC_BOOK3S_32
> - CONFIG_PPC_STD_MMU_32
> 
> By consistency with PPC64, CONFIG_PPC_BOOK3S_32 is the preferred one.
> Using a label with includes _PPC_ also makes it clearer that it is
> linked to powerpc.
> 
> In preparation of the removal of CONFIG_6xx, this patch replaces it
> by CONFIG_PPC_BOOK3S_32
> 
> Signed-off-by: Christophe Leroy 
> ---
>  drivers/cpufreq/pmac32-cpufreq.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cpufreq/pmac32-cpufreq.c 
> b/drivers/cpufreq/pmac32-cpufreq.c
> index 61ae06ca008e..52f0d91d30c1 100644
> --- a/drivers/cpufreq/pmac32-cpufreq.c
> +++ b/drivers/cpufreq/pmac32-cpufreq.c
> @@ -128,7 +128,7 @@ static int cpu_750fx_cpu_speed(int low_speed)
>   mtspr(SPRN_HID2, hid2);
>   }
>   }
> -#ifdef CONFIG_6xx
> +#ifdef CONFIG_PPC_BOOK3S_32
>   low_choose_750fx_pll(low_speed);
>  #endif
>   if (low_speed == 1) {
> @@ -166,7 +166,7 @@ static int dfs_set_cpu_speed(int low_speed)
>   }
>  
>   /* set frequency */
> -#ifdef CONFIG_6xx
> +#ifdef CONFIG_PPC_BOOK3S_32
>   low_choose_7447a_dfs(low_speed);
>  #endif
>   udelay(100);

Acked-by: Viresh Kumar 

-- 
viresh


Re: [RFC PATCH v1 3/6] powerpc: Add skeleton for Kernel Userspace Execution Prevention

2018-11-18 Thread Russell Currey
On Wed, 2018-11-07 at 16:56 +, Christophe Leroy wrote:
> This patch adds a skeleton for Kernel Userspace Execution Prevention.
> 
> Then subarches implementing it have to define CONFIG_PPC_HAVE_KUEP
> and provide setup_kuep() function.
> 
> Signed-off-by: Christophe Leroy 

An open question (with nothing to do specifically with this patch):

For what reason would you ever disable execution prevention?  Clearly
there must be something since "nosmep" is a thing, but I don't know why
we'd ever do it.

- Russell



Re: [PATCH kernel v3 18/22] powerpc/powernv/npu: Add compound IOMMU groups

2018-11-18 Thread Alexey Kardashevskiy



On 19/11/2018 12:12, David Gibson wrote:
> On Tue, Nov 13, 2018 at 07:28:19PM +1100, Alexey Kardashevskiy wrote:
>> At the moment powernv registers an IOMMU group for each PE. There is
>> an exception though - NPU (an emulated PCI bridge representing an NVLink);
>> powernv attaches these bridges to the GPU IOMMU group which becomes
>> a master.
>>
>> Now we have POWER9 systems with GPUs connected to each other directly,
>> bypassing PCI. At the moment powernv does not control these links so
>> it has to put such interconnected GPUs to the same IOMMU group which
>> means that the old scheme with a GPU as a master won't work - there will
>> be up to 3 GPUs in such group.
>>
>> This introduces a npu_comp struct which represents a compound IOMMU
>> group made of multiple PEs. This converts the existing NVLink1 code to
>> use the new scheme. From now on, each PE must have a valid
>> iommu_table_group_ops which will either be called directly (a single PE
>> group) or indirectly from a compound group.
>>
>> This moves IOMMU group registration for NPU-connected GPUs to npu-dma.c.
>> For POWER8, this stores a new compound group pointer in a PE (so a GPU
>> is still a master); for POWER9 the new group pointer is stored in an NPU.
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>>  arch/powerpc/include/asm/pci.h|   1 +
>>  arch/powerpc/platforms/powernv/pci.h  |   7 +
>>  arch/powerpc/platforms/powernv/npu-dma.c  | 286 --
>>  arch/powerpc/platforms/powernv/pci-ioda.c | 173 +++--
>>  4 files changed, 308 insertions(+), 159 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
>> index baf2886..0c72f18 100644
>> --- a/arch/powerpc/include/asm/pci.h
>> +++ b/arch/powerpc/include/asm/pci.h
>> @@ -132,5 +132,6 @@ extern struct pci_dev *pnv_pci_get_npu_dev(struct 
>> pci_dev *gpdev, int index);
>>  extern int pnv_npu2_init(struct pci_controller *hose);
>>  extern int pnv_npu2_map_lpar_dev(struct pci_dev *gpdev, unsigned int lparid,
>>  unsigned long msr);
>> +extern int pnv_npu2_unmap_lpar_dev(struct pci_dev *gpdev);
>>  
>>  #endif /* __ASM_POWERPC_PCI_H */
>> diff --git a/arch/powerpc/platforms/powernv/pci.h 
>> b/arch/powerpc/platforms/powernv/pci.h
>> index cf9f748..aef4bb5 100644
>> --- a/arch/powerpc/platforms/powernv/pci.h
>> +++ b/arch/powerpc/platforms/powernv/pci.h
>> @@ -62,6 +62,7 @@ struct pnv_ioda_pe {
>>  
>>  /* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
>>  struct iommu_table_group table_group;
>> +struct npu_comp *npucomp;
>>  
>>  /* 64-bit TCE bypass region */
>>  booltce_bypass_enabled;
>> @@ -201,6 +202,8 @@ extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
>>  extern struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev);
>>  extern void pnv_set_msi_irq_chip(struct pnv_phb *phb, unsigned int virq);
>>  extern void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable);
>> +extern unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
>> +__u64 window_size, __u32 levels);
>>  extern int pnv_eeh_post_init(void);
>>  
>>  extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>> @@ -216,6 +219,10 @@ extern void pe_level_printk(const struct pnv_ioda_pe 
>> *pe, const char *level,
>>  extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
>>  extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool 
>> rm);
>>  extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
>> +extern struct iommu_table_group *pnv_try_setup_npu_table_group(
>> +struct pnv_ioda_pe *pe);
>> +extern struct iommu_table_group *pnv_npu_compound_attach(
>> +struct pnv_ioda_pe *pe);
>>  
>>  /* pci-ioda-tce.c */
>>  #define POWERNV_IOMMU_DEFAULT_LEVELS1
>> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
>> b/arch/powerpc/platforms/powernv/npu-dma.c
>> index 1792c7e..2231f4c 100644
>> --- a/arch/powerpc/platforms/powernv/npu-dma.c
>> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
>> @@ -317,31 +317,6 @@ static struct iommu_table_group_ops pnv_pci_npu_ops = {
>>  .unset_window = pnv_npu_unset_window,
>>  .take_ownership = pnv_npu_take_ownership,
>>  };
>> -
>> -struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
>> -{
>> -struct pnv_phb *phb = npe->phb;
>> -struct pci_bus *pbus = phb->hose->bus;
>> -struct pci_dev *npdev, *gpdev = NULL, *gptmp;
>> -struct pnv_ioda_pe *gpe = get_gpu_pci_dev_and_pe(npe, );
>> -
>> -if (!gpe || !gpdev)
>> -return NULL;
>> -
>> -npe->table_group.ops = _pci_npu_ops;
>> -
>> -list_for_each_entry(npdev, >devices, bus_list) {
>> -gptmp = pnv_pci_get_gpu_dev(npdev);
>> -
>> -if (gptmp != gpdev)
>> -continue;
>> -
>> -pe_info(gpe, "Attached NPU %s\n", dev_name(>dev));
>> -

Re: [LKP] dad4f140ed [ 7.709376] WARNING:suspicious_RCU_usage

2018-11-18 Thread Matthew Wilcox
On Mon, Nov 19, 2018 at 09:08:20AM +0800, kernel test robot wrote:
> Greetings,
> 
> 0day kernel testing robot got the below dmesg and the first bad commit is

Umm.  I don't see a 'suspicious RCU usage' message in here.  I see a
lot of vmalloc warnings ... ?

> [7.699777] swapper: vmalloc: allocation failure: 18446744073709551615 
> bytes, mode:0x6000c0(GFP_KERNEL), nodemask=(null)
> [7.724561] swapper: vmalloc: allocation failure: 18446744073709551615 
> bytes, mode:0x6000c0(GFP_KERNEL), nodemask=(null)
> [7.737925] swapper: vmalloc: allocation failure: 18446744073709551615 
> bytes, mode:0x6080c0(GFP_KERNEL|__GFP_ZERO), nodemask=(null)
> [7.750283] swapper: vmalloc: allocation failure: 18446744073709551615 
> bytes, mode:0x6080c0(GFP_KERNEL|__GFP_ZERO), nodemask=(null)
> [7.777903] swapper: vmalloc: allocation failure: 18446744073709551615 
> bytes, mode:0x6000c0(GFP_KERNEL), nodemask=(null)
> [7.787175] swapper: vmalloc: allocation failure: 18446744073709551615 
> bytes, mode:0x6000c0(GFP_KERNEL), nodemask=(null)
> [7.796181] swapper: vmalloc: allocation failure: 18446744073709551615 
> bytes, mode:0x6080c0(GFP_KERNEL|__GFP_ZERO), nodemask=(null)
> [7.804841] swapper: vmalloc: allocation failure: 18446744073709551615 
> bytes, mode:0x6080c0(GFP_KERNEL|__GFP_ZERO), nodemask=(null)


Re: [PATCH kernel v3 18/22] powerpc/powernv/npu: Add compound IOMMU groups

2018-11-18 Thread David Gibson
On Tue, Nov 13, 2018 at 07:28:19PM +1100, Alexey Kardashevskiy wrote:
> At the moment powernv registers an IOMMU group for each PE. There is
> an exception though - NPU (an emulated PCI bridge representing an NVLink);
> powernv attaches these bridges to the GPU IOMMU group which becomes
> a master.
> 
> Now we have POWER9 systems with GPUs connected to each other directly,
> bypassing PCI. At the moment powernv does not control these links so
> it has to put such interconnected GPUs to the same IOMMU group which
> means that the old scheme with a GPU as a master won't work - there will
> be up to 3 GPUs in such group.
> 
> This introduces a npu_comp struct which represents a compound IOMMU
> group made of multiple PEs. This converts the existing NVLink1 code to
> use the new scheme. From now on, each PE must have a valid
> iommu_table_group_ops which will either be called directly (a single PE
> group) or indirectly from a compound group.
> 
> This moves IOMMU group registration for NPU-connected GPUs to npu-dma.c.
> For POWER8, this stores a new compound group pointer in a PE (so a GPU
> is still a master); for POWER9 the new group pointer is stored in an NPU.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  arch/powerpc/include/asm/pci.h|   1 +
>  arch/powerpc/platforms/powernv/pci.h  |   7 +
>  arch/powerpc/platforms/powernv/npu-dma.c  | 286 --
>  arch/powerpc/platforms/powernv/pci-ioda.c | 173 +++--
>  4 files changed, 308 insertions(+), 159 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
> index baf2886..0c72f18 100644
> --- a/arch/powerpc/include/asm/pci.h
> +++ b/arch/powerpc/include/asm/pci.h
> @@ -132,5 +132,6 @@ extern struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev 
> *gpdev, int index);
>  extern int pnv_npu2_init(struct pci_controller *hose);
>  extern int pnv_npu2_map_lpar_dev(struct pci_dev *gpdev, unsigned int lparid,
>   unsigned long msr);
> +extern int pnv_npu2_unmap_lpar_dev(struct pci_dev *gpdev);
>  
>  #endif /* __ASM_POWERPC_PCI_H */
> diff --git a/arch/powerpc/platforms/powernv/pci.h 
> b/arch/powerpc/platforms/powernv/pci.h
> index cf9f748..aef4bb5 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -62,6 +62,7 @@ struct pnv_ioda_pe {
>  
>   /* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
>   struct iommu_table_group table_group;
> + struct npu_comp *npucomp;
>  
>   /* 64-bit TCE bypass region */
>   booltce_bypass_enabled;
> @@ -201,6 +202,8 @@ extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
>  extern struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev);
>  extern void pnv_set_msi_irq_chip(struct pnv_phb *phb, unsigned int virq);
>  extern void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable);
> +extern unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
> + __u64 window_size, __u32 levels);
>  extern int pnv_eeh_post_init(void);
>  
>  extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
> @@ -216,6 +219,10 @@ extern void pe_level_printk(const struct pnv_ioda_pe 
> *pe, const char *level,
>  extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
>  extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool 
> rm);
>  extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
> +extern struct iommu_table_group *pnv_try_setup_npu_table_group(
> + struct pnv_ioda_pe *pe);
> +extern struct iommu_table_group *pnv_npu_compound_attach(
> + struct pnv_ioda_pe *pe);
>  
>  /* pci-ioda-tce.c */
>  #define POWERNV_IOMMU_DEFAULT_LEVELS 1
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
> b/arch/powerpc/platforms/powernv/npu-dma.c
> index 1792c7e..2231f4c 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -317,31 +317,6 @@ static struct iommu_table_group_ops pnv_pci_npu_ops = {
>   .unset_window = pnv_npu_unset_window,
>   .take_ownership = pnv_npu_take_ownership,
>  };
> -
> -struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
> -{
> - struct pnv_phb *phb = npe->phb;
> - struct pci_bus *pbus = phb->hose->bus;
> - struct pci_dev *npdev, *gpdev = NULL, *gptmp;
> - struct pnv_ioda_pe *gpe = get_gpu_pci_dev_and_pe(npe, );
> -
> - if (!gpe || !gpdev)
> - return NULL;
> -
> - npe->table_group.ops = _pci_npu_ops;
> -
> - list_for_each_entry(npdev, >devices, bus_list) {
> - gptmp = pnv_pci_get_gpu_dev(npdev);
> -
> - if (gptmp != gpdev)
> - continue;
> -
> - pe_info(gpe, "Attached NPU %s\n", dev_name(>dev));
> - iommu_group_add_device(gpe->table_group.group, >dev);
> - }
> -
> - return gpe;
> -}
>  #endif /* !CONFIG_IOMMU_API */
>  
>  /*
> 

Re: [PATCH] misc: cxl: Use device_type helpers to access the node type

2018-11-18 Thread Andrew Donnellan

On 17/11/18 9:11 am, Rob Herring wrote:

Remove directly accessing device_node.type pointer and use the accessors
instead. This will eventually allow removing the type pointer.



This description doesn't quite match what's being fixed - we're not 
accessing the .type pointer, we're getting the device_type property 
directly.



Cc: Frederic Barrat 
Cc: Andrew Donnellan 
Cc: Arnd Bergmann 
Cc: Greg Kroah-Hartman 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Rob Herring 


Apart from that:

Acked-by: Andrew Donnellan 


---
  drivers/misc/cxl/pci.c | 4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index b66d832d3233..c79ba1c699ad 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -1718,7 +1718,6 @@ int cxl_slot_is_switched(struct pci_dev *dev)
  {
struct device_node *np;
int depth = 0;
-   const __be32 *prop;
  
  	if (!(np = pci_device_to_OF_node(dev))) {

pr_err("cxl: np = NULL\n");
@@ -1727,8 +1726,7 @@ int cxl_slot_is_switched(struct pci_dev *dev)
of_node_get(np);
while (np) {
np = of_get_next_parent(np);
-   prop = of_get_property(np, "device_type", NULL);
-   if (!prop || strcmp((char *)prop, "pciex"))
+   if (!of_node_is_type(np, "pciex"))
break;
depth++;
}



--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited



Re: [PATCH kernel v3 16/22] powerpc/powernv: Add purge cache OPAL call

2018-11-18 Thread David Gibson
On Tue, Nov 13, 2018 at 07:28:17PM +1100, Alexey Kardashevskiy wrote:
> Flushing caches using the dcbf instruction takes quite some time if we
> need to flush gigabytes (16GB takes more than 15s); OPAL just added
> a big hammer to flush all caches.
> 
> This adds opal_purge_cache() which will be used later to flush caches
> for coherent GPU memory which might suddenly become unavailable if a GPU
> is reset and NVLink is not (re)trained.
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: David Gibson 

> ---
>  arch/powerpc/include/asm/opal-api.h| 3 ++-
>  arch/powerpc/include/asm/opal.h| 1 +
>  arch/powerpc/platforms/powernv/opal.c  | 1 +
>  arch/powerpc/platforms/powernv/opal-wrappers.S | 1 +
>  4 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/opal-api.h 
> b/arch/powerpc/include/asm/opal-api.h
> index 870fb7b..55bc640 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -210,7 +210,8 @@
>  #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR 164
>  #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR 165
>  #define  OPAL_NX_COPROC_INIT 167
> -#define OPAL_LAST167
> +#define OPAL_CLEAR_CACHE 170
> +#define OPAL_LAST170
>  
>  #define QUIESCE_HOLD 1 /* Spin all calls at entry */
>  #define QUIESCE_REJECT   2 /* Fail all calls with 
> OPAL_BUSY */
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index ff38664..7db576e 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -294,6 +294,7 @@ int opal_set_power_shift_ratio(u32 handle, int token, u32 
> psr);
>  int opal_sensor_group_clear(u32 group_hndl, int token);
>  int opal_sensor_group_enable(u32 group_hndl, int token, bool enable);
>  int opal_nx_coproc_init(uint32_t chip_id, uint32_t ct);
> +int opal_purge_cache(void);
>  
>  s64 opal_signal_system_reset(s32 cpu);
>  s64 opal_quiesce(u64 shutdown_type, s32 cpu);
> diff --git a/arch/powerpc/platforms/powernv/opal.c 
> b/arch/powerpc/platforms/powernv/opal.c
> index beed86f..44ce824 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -1113,3 +1113,4 @@ EXPORT_SYMBOL_GPL(opal_int_eoi);
>  EXPORT_SYMBOL_GPL(opal_error_code);
>  /* Export the below symbol for NX compression */
>  EXPORT_SYMBOL(opal_nx_coproc_init);
> +EXPORT_SYMBOL(opal_purge_cache);
> diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S 
> b/arch/powerpc/platforms/powernv/opal-wrappers.S
> index 2515282..5b886a6 100644
> --- a/arch/powerpc/platforms/powernv/opal-wrappers.S
> +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
> @@ -331,3 +331,4 @@ OPAL_CALL(opal_pci_set_pbcq_tunnel_bar,   
> OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
>  OPAL_CALL(opal_sensor_read_u64,  OPAL_SENSOR_READ_U64);
>  OPAL_CALL(opal_sensor_group_enable,  OPAL_SENSOR_GROUP_ENABLE);
>  OPAL_CALL(opal_nx_coproc_init,   OPAL_NX_COPROC_INIT);
> +OPAL_CALL(opal_purge_cache,  OPAL_CLEAR_CACHE);

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel v3 17/22] powerpc/powernv/npu: Convert NPU IOMMU helpers to iommu_table_group_ops

2018-11-18 Thread David Gibson
On Tue, Nov 13, 2018 at 07:28:18PM +1100, Alexey Kardashevskiy wrote:
> At the moment NPU IOMMU is manipulated directly from the IODA2 PCI
> PE code; PCI PE acts as a master to NPU PE. Soon we will have compound
> IOMMU groups with several PEs from several different PHB (such as
> interconnected GPUs and NPUs) so there will be no single master but
> a one big IOMMU group.
> 
> This makes a first step and converts an NPU PE to a table group.
> 
> This should cause no behavioral change. Note that
> pnv_npu_release_ownership() has never been implemented.
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: David Gibson 

> ---
>  arch/powerpc/platforms/powernv/pci.h  |  5 
>  arch/powerpc/platforms/powernv/npu-dma.c  | 29 ++-
>  arch/powerpc/platforms/powernv/pci-ioda.c | 17 +++--
>  3 files changed, 33 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci.h 
> b/arch/powerpc/platforms/powernv/pci.h
> index ddb4f02..cf9f748 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -216,11 +216,6 @@ extern void pe_level_printk(const struct pnv_ioda_pe 
> *pe, const char *level,
>  extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
>  extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool 
> rm);
>  extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
> -extern long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
> - struct iommu_table *tbl);
> -extern long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num);
> -extern void pnv_npu_take_ownership(struct pnv_ioda_pe *npe);
> -extern void pnv_npu_release_ownership(struct pnv_ioda_pe *npe);
>  
>  /* pci-ioda-tce.c */
>  #define POWERNV_IOMMU_DEFAULT_LEVELS 1
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
> b/arch/powerpc/platforms/powernv/npu-dma.c
> index 4b60f43..1792c7e 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -121,9 +121,11 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct 
> pnv_ioda_pe *npe,
>   return pe;
>  }
>  
> -long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
> +static long pnv_npu_set_window(struct iommu_table_group *table_group, int 
> num,
>   struct iommu_table *tbl)
>  {
> + struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
> + table_group);
>   struct pnv_phb *phb = npe->phb;
>   int64_t rc;
>   const unsigned long size = tbl->it_indirect_levels ?
> @@ -155,8 +157,10 @@ long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
>   return 0;
>  }
>  
> -long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num)
> +static long pnv_npu_unset_window(struct iommu_table_group *table_group, int 
> num)
>  {
> + struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
> + table_group);
>   struct pnv_phb *phb = npe->phb;
>   int64_t rc;
>  
> @@ -198,7 +202,8 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
>   if (!gpe)
>   return;
>  
> - rc = pnv_npu_set_window(npe, 0, gpe->table_group.tables[0]);
> + rc = pnv_npu_set_window(>table_group, 0,
> + gpe->table_group.tables[0]);
>  
>   /*
>* NVLink devices use the same TCE table configuration as
> @@ -223,7 +228,7 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
>   if (phb->type != PNV_PHB_NPU_NVLINK || !npe->pdev)
>   return -EINVAL;
>  
> - rc = pnv_npu_unset_window(npe, 0);
> + rc = pnv_npu_unset_window(>table_group, 0);
>   if (rc != OPAL_SUCCESS)
>   return rc;
>  
> @@ -276,9 +281,12 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, 
> bool bypass)
>   }
>  }
>  
> +#ifdef CONFIG_IOMMU_API
>  /* Switch ownership from platform code to external user (e.g. VFIO) */
> -void pnv_npu_take_ownership(struct pnv_ioda_pe *npe)
> +static void pnv_npu_take_ownership(struct iommu_table_group *table_group)
>  {
> + struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
> + table_group);
>   struct pnv_phb *phb = npe->phb;
>   int64_t rc;
>  
> @@ -289,7 +297,7 @@ void pnv_npu_take_ownership(struct pnv_ioda_pe *npe)
>* if it was enabled at the moment of ownership change.
>*/
>   if (npe->table_group.tables[0]) {
> - pnv_npu_unset_window(npe, 0);
> + pnv_npu_unset_window(>table_group, 0);
>   return;
>   }
>  
> @@ -304,6 +312,12 @@ void pnv_npu_take_ownership(struct pnv_ioda_pe *npe)
>   pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
>  }
>  
> +static struct iommu_table_group_ops pnv_pci_npu_ops = {
> + .set_window = pnv_npu_set_window,
> + .unset_window = pnv_npu_unset_window,
> + .take_ownership = pnv_npu_take_ownership,
> +};
> +
> 

Re: [PATCH kernel v3 14/22] powerpc/iommu_api: Move IOMMU groups setup to a single place

2018-11-18 Thread David Gibson
On Tue, Nov 13, 2018 at 07:28:15PM +1100, Alexey Kardashevskiy wrote:
> Registering new IOMMU groups and adding devices to them are separated in
> code and the latter is dug in the DMA setup code which it does not
> really belong to.
> 
> This moved IOMMU groups setup to a separate helper which registers a group
> and adds devices as before. This does not make a difference as IOMMU
> groups are not used anyway; the only dependency here is that
> iommu_add_device() requires a valid pointer to an iommu_table
> (set by set_iommu_table_base()).
> 
> To keep the old behaviour, this does not add new IOMMU groups for PEs
> with no DMA weigth and also skips NVLINK bridges which do not have
> pci_controller_ops::setup_bridge (the normal way of adding PEs).
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: David Gibson 

> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 80 +++
>  1 file changed, 66 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index f36a802..7f4904a 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1269,6 +1269,8 @@ static void pnv_ioda_setup_npu_PEs(struct pci_bus *bus)
>   pnv_ioda_setup_npu_PE(pdev);
>  }
>  
> +static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe);
> +
>  static void pnv_pci_ioda_setup_PEs(void)
>  {
>   struct pci_controller *hose;
> @@ -1591,6 +1593,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, 
> u16 num_vfs)
>   mutex_unlock(>ioda.pe_list_mutex);
>  
>   pnv_pci_ioda2_setup_dma_pe(phb, pe);
> + pnv_ioda_setup_bus_iommu_group(pe);
>   }
>  }
>  
> @@ -1930,21 +1933,16 @@ static u64 pnv_pci_ioda_dma_get_required_mask(struct 
> pci_dev *pdev)
>   return mask;
>  }
>  
> -static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
> -struct pci_bus *bus,
> -bool add_to_group)
> +static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus 
> *bus)
>  {
>   struct pci_dev *dev;
>  
>   list_for_each_entry(dev, >devices, bus_list) {
>   set_iommu_table_base(>dev, pe->table_group.tables[0]);
>   set_dma_offset(>dev, pe->tce_bypass_base);
> - if (add_to_group)
> - iommu_add_device(>table_group, >dev);
>  
>   if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
> - pnv_ioda_setup_bus_dma(pe, dev->subordinate,
> - add_to_group);
> + pnv_ioda_setup_bus_dma(pe, dev->subordinate);
>   }
>  }
>  
> @@ -2374,7 +2372,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb 
> *phb,
>   iommu_init_table(tbl, phb->hose->node);
>  
>   if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
> - pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
> + pnv_ioda_setup_bus_dma(pe, pe->pbus);
>  
>   return;
>   fail:
> @@ -2607,7 +2605,7 @@ static void pnv_ioda2_take_ownership(struct 
> iommu_table_group *table_group)
>   pnv_pci_ioda2_set_bypass(pe, false);
>   pnv_pci_ioda2_unset_window(>table_group, 0);
>   if (pe->pbus)
> - pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
> + pnv_ioda_setup_bus_dma(pe, pe->pbus);
>   iommu_tce_table_put(tbl);
>  }
>  
> @@ -2618,7 +2616,7 @@ static void pnv_ioda2_release_ownership(struct 
> iommu_table_group *table_group)
>  
>   pnv_pci_ioda2_setup_default_config(pe);
>   if (pe->pbus)
> - pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
> + pnv_ioda_setup_bus_dma(pe, pe->pbus);
>  }
>  
>  static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
> @@ -2735,12 +2733,68 @@ static struct iommu_table_group_ops 
> pnv_pci_ioda2_npu_ops = {
>   .release_ownership = pnv_ioda2_release_ownership,
>  };
>  
> +static void pnv_ioda_setup_bus_iommu_group_add_devices(struct pnv_ioda_pe 
> *pe,
> + struct pci_bus *bus)
> +{
> + struct pci_dev *dev;
> +
> + list_for_each_entry(dev, >devices, bus_list) {
> + iommu_add_device(>table_group, >dev);
> +
> + if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
> + pnv_ioda_setup_bus_iommu_group_add_devices(pe,
> + dev->subordinate);
> + }
> +}
> +
> +static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe)
> +{
> + if (!pnv_pci_ioda_pe_dma_weight(pe))
> + return;
> +
> + iommu_register_group(>table_group, pe->phb->hose->global_number,
> + pe->pe_number);
> +
> + /*
> +  * set_iommu_table_base(>pdev->dev, tbl) should have been called
> +  * by now
> +  */
> + if (pe->flags & PNV_IODA_PE_DEV)
> + iommu_add_device(>table_group, >pdev->dev);
> 

Re: [PATCH kernel v3 15/22] powerpc/powernv: Reference iommu_table while it is linked to a group

2018-11-18 Thread David Gibson
On Tue, Nov 13, 2018 at 07:28:16PM +1100, Alexey Kardashevskiy wrote:
> The iommu_table pointer stored in iommu_table_group may get stale
> by accident, this adds referencing and removes a redundant comment
> about this.
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: David Gibson 

> ---
>  arch/powerpc/platforms/powernv/pci-ioda-tce.c | 3 ++-
>  arch/powerpc/platforms/powernv/pci-ioda.c | 4 
>  2 files changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c 
> b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> index 7639b21..697449a 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> @@ -368,6 +368,7 @@ void pnv_pci_unlink_table_and_group(struct iommu_table 
> *tbl,
>   found = false;
>   for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
>   if (table_group->tables[i] == tbl) {
> + iommu_tce_table_put(tbl);
>   table_group->tables[i] = NULL;
>   found = true;
>   break;
> @@ -393,7 +394,7 @@ long pnv_pci_link_table_and_group(int node, int num,
>   tgl->table_group = table_group;
>   list_add_rcu(>next, >it_group_list);
>  
> - table_group->tables[num] = tbl;
> + table_group->tables[num] = iommu_tce_table_get(tbl);
>  
>   return 0;
>  }
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 7f4904a..7caf373 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2716,10 +2716,6 @@ static long pnv_pci_ioda2_npu_unset_window(
>  
>  static void pnv_ioda2_npu_take_ownership(struct iommu_table_group 
> *table_group)
>  {
> - /*
> -  * Detach NPU first as pnv_ioda2_take_ownership() will destroy
> -  * the iommu_table if 32bit DMA is enabled.
> -  */
>   pnv_npu_take_ownership(gpe_table_group_to_npe(table_group));
>   pnv_ioda2_take_ownership(table_group);
>  }

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH 2/4] net/bpf: refactor freeing of executable allocations

2018-11-18 Thread Y Song
On Sun, Nov 18, 2018 at 3:55 PM Ard Biesheuvel
 wrote:
>
> On Sat, 17 Nov 2018 at 23:47, Y Song  wrote:
> >
> > On Sat, Nov 17, 2018 at 6:58 PM Ard Biesheuvel
> >  wrote:
> > >
> > > All arch overrides of the __weak bpf_jit_free() amount to the same
> > > thing: the allocated memory was never mapped read-only, and so
> > > it does not have to be remapped to read-write before being freed.
> > >
> > > So in preparation of permitting arches to serve allocations for BPF
> > > JIT programs from other regions than the module region, refactor
> > > the existing bpf_jit_free() implementations to use the shared code
> > > where possible, and only specialize the remap and free operations.
> > >
> > > Signed-off-by: Ard Biesheuvel 
> > > ---
> > >  arch/mips/net/bpf_jit.c   |  7 ++-
> > >  arch/powerpc/net/bpf_jit_comp.c   |  7 ++-
> > >  arch/powerpc/net/bpf_jit_comp64.c |  9 +++--
> > >  arch/sparc/net/bpf_jit_comp_32.c  |  7 ++-
> > >  kernel/bpf/core.c | 15 +--
> > >  5 files changed, 14 insertions(+), 31 deletions(-)
> > >
> > > diff --git a/arch/mips/net/bpf_jit.c b/arch/mips/net/bpf_jit.c
> > > index 1b69897274a1..5696bd7dccc7 100644
> > > --- a/arch/mips/net/bpf_jit.c
> > > +++ b/arch/mips/net/bpf_jit.c
> > > @@ -1261,10 +1261,7 @@ void bpf_jit_compile(struct bpf_prog *fp)
> > > kfree(ctx.offsets);
> > >  }
> > >
> > > -void bpf_jit_free(struct bpf_prog *fp)
> > > +void bpf_jit_binary_free(struct bpf_binary_header *hdr)
> > >  {
> > > -   if (fp->jited)
> > > -   bpf_jit_binary_free(bpf_jit_binary_hdr(fp));
> > > -
> > > -   bpf_prog_unlock_free(fp);
> > > +   module_memfree(hdr);
> > >  }
> > > diff --git a/arch/powerpc/net/bpf_jit_comp.c 
> > > b/arch/powerpc/net/bpf_jit_comp.c
> > > index a1ea1ea6b40d..5b5ce4a1b44b 100644
> > > --- a/arch/powerpc/net/bpf_jit_comp.c
> > > +++ b/arch/powerpc/net/bpf_jit_comp.c
> > > @@ -680,10 +680,7 @@ void bpf_jit_compile(struct bpf_prog *fp)
> > > return;
> > >  }
> > >
> > > -void bpf_jit_free(struct bpf_prog *fp)
> > > +void bpf_jit_binary_free(struct bpf_binary_header *hdr)
> > >  {
> > > -   if (fp->jited)
> > > -   bpf_jit_binary_free(bpf_jit_binary_hdr(fp));
> > > -
> > > -   bpf_prog_unlock_free(fp);
> > > +   module_memfree(hdr);
> > >  }
> > > diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
> > > b/arch/powerpc/net/bpf_jit_comp64.c
> > > index 84c8f013a6c6..f64f1294bd62 100644
> > > --- a/arch/powerpc/net/bpf_jit_comp64.c
> > > +++ b/arch/powerpc/net/bpf_jit_comp64.c
> > > @@ -1021,11 +1021,8 @@ struct bpf_prog *bpf_int_jit_compile(struct 
> > > bpf_prog *fp)
> > > return fp;
> > >  }
> > >
> > > -/* Overriding bpf_jit_free() as we don't set images read-only. */
> > > -void bpf_jit_free(struct bpf_prog *fp)
> > > +/* Overriding bpf_jit_binary_free() as we don't set images read-only. */
> > > +void bpf_jit_binary_free(struct bpf_binary_header *hdr)
> > >  {
> > > -   if (fp->jited)
> > > -   bpf_jit_binary_free(bpf_jit_binary_hdr(fp));
> > > -
> > > -   bpf_prog_unlock_free(fp);
> > > +   module_memfree(hdr);
> > >  }
> > > diff --git a/arch/sparc/net/bpf_jit_comp_32.c 
> > > b/arch/sparc/net/bpf_jit_comp_32.c
> > > index 01bda6bc9e7f..589950d152cc 100644
> > > --- a/arch/sparc/net/bpf_jit_comp_32.c
> > > +++ b/arch/sparc/net/bpf_jit_comp_32.c
> > > @@ -756,10 +756,7 @@ cond_branch:   f_offset = 
> > > addrs[i + filter[i].jf];
> > > return;
> > >  }
> > >
> > > -void bpf_jit_free(struct bpf_prog *fp)
> > > +void bpf_jit_binary_free(struct bpf_binary_header *hdr)
> > >  {
> > > -   if (fp->jited)
> > > -   bpf_jit_binary_free(bpf_jit_binary_hdr(fp));
> > > -
> > > -   bpf_prog_unlock_free(fp);
> > > +   module_memfree(hdr);
> > >  }
> > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > > index 1a796e0799ec..29f766dac203 100644
> > > --- a/kernel/bpf/core.c
> > > +++ b/kernel/bpf/core.c
> > > @@ -646,25 +646,20 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 
> > > **image_ptr,
> > > return hdr;
> > >  }
> > >
> > > -void bpf_jit_binary_free(struct bpf_binary_header *hdr)
> > > +void __weak bpf_jit_binary_free(struct bpf_binary_header *hdr)
> > >  {
> > > -   u32 pages = hdr->pages;
> > > -
> > > +   bpf_jit_binary_unlock_ro(hdr);
> > > module_memfree(hdr);
> > > -   bpf_jit_uncharge_modmem(pages);
> > >  }
> > >
> > > -/* This symbol is only overridden by archs that have different
> > > - * requirements than the usual eBPF JITs, f.e. when they only
> > > - * implement cBPF JIT, do not set images read-only, etc.
> > > - */
> >
> > Do you want to move the above comments to
> > new weak function bpf_jit_binary_free?
> >
>
> Perhaps. But one thing I don't understand, looking at this again, is
> why we have these overrides in the first place. module_memfree() just
> calls vfree(), which takes down the mapping entirely (along with any
> 

Re: [PATCH 2/4] net/bpf: refactor freeing of executable allocations

2018-11-18 Thread Ard Biesheuvel
On Sat, 17 Nov 2018 at 23:47, Y Song  wrote:
>
> On Sat, Nov 17, 2018 at 6:58 PM Ard Biesheuvel
>  wrote:
> >
> > All arch overrides of the __weak bpf_jit_free() amount to the same
> > thing: the allocated memory was never mapped read-only, and so
> > it does not have to be remapped to read-write before being freed.
> >
> > So in preparation of permitting arches to serve allocations for BPF
> > JIT programs from other regions than the module region, refactor
> > the existing bpf_jit_free() implementations to use the shared code
> > where possible, and only specialize the remap and free operations.
> >
> > Signed-off-by: Ard Biesheuvel 
> > ---
> >  arch/mips/net/bpf_jit.c   |  7 ++-
> >  arch/powerpc/net/bpf_jit_comp.c   |  7 ++-
> >  arch/powerpc/net/bpf_jit_comp64.c |  9 +++--
> >  arch/sparc/net/bpf_jit_comp_32.c  |  7 ++-
> >  kernel/bpf/core.c | 15 +--
> >  5 files changed, 14 insertions(+), 31 deletions(-)
> >
> > diff --git a/arch/mips/net/bpf_jit.c b/arch/mips/net/bpf_jit.c
> > index 1b69897274a1..5696bd7dccc7 100644
> > --- a/arch/mips/net/bpf_jit.c
> > +++ b/arch/mips/net/bpf_jit.c
> > @@ -1261,10 +1261,7 @@ void bpf_jit_compile(struct bpf_prog *fp)
> > kfree(ctx.offsets);
> >  }
> >
> > -void bpf_jit_free(struct bpf_prog *fp)
> > +void bpf_jit_binary_free(struct bpf_binary_header *hdr)
> >  {
> > -   if (fp->jited)
> > -   bpf_jit_binary_free(bpf_jit_binary_hdr(fp));
> > -
> > -   bpf_prog_unlock_free(fp);
> > +   module_memfree(hdr);
> >  }
> > diff --git a/arch/powerpc/net/bpf_jit_comp.c 
> > b/arch/powerpc/net/bpf_jit_comp.c
> > index a1ea1ea6b40d..5b5ce4a1b44b 100644
> > --- a/arch/powerpc/net/bpf_jit_comp.c
> > +++ b/arch/powerpc/net/bpf_jit_comp.c
> > @@ -680,10 +680,7 @@ void bpf_jit_compile(struct bpf_prog *fp)
> > return;
> >  }
> >
> > -void bpf_jit_free(struct bpf_prog *fp)
> > +void bpf_jit_binary_free(struct bpf_binary_header *hdr)
> >  {
> > -   if (fp->jited)
> > -   bpf_jit_binary_free(bpf_jit_binary_hdr(fp));
> > -
> > -   bpf_prog_unlock_free(fp);
> > +   module_memfree(hdr);
> >  }
> > diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
> > b/arch/powerpc/net/bpf_jit_comp64.c
> > index 84c8f013a6c6..f64f1294bd62 100644
> > --- a/arch/powerpc/net/bpf_jit_comp64.c
> > +++ b/arch/powerpc/net/bpf_jit_comp64.c
> > @@ -1021,11 +1021,8 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog 
> > *fp)
> > return fp;
> >  }
> >
> > -/* Overriding bpf_jit_free() as we don't set images read-only. */
> > -void bpf_jit_free(struct bpf_prog *fp)
> > +/* Overriding bpf_jit_binary_free() as we don't set images read-only. */
> > +void bpf_jit_binary_free(struct bpf_binary_header *hdr)
> >  {
> > -   if (fp->jited)
> > -   bpf_jit_binary_free(bpf_jit_binary_hdr(fp));
> > -
> > -   bpf_prog_unlock_free(fp);
> > +   module_memfree(hdr);
> >  }
> > diff --git a/arch/sparc/net/bpf_jit_comp_32.c 
> > b/arch/sparc/net/bpf_jit_comp_32.c
> > index 01bda6bc9e7f..589950d152cc 100644
> > --- a/arch/sparc/net/bpf_jit_comp_32.c
> > +++ b/arch/sparc/net/bpf_jit_comp_32.c
> > @@ -756,10 +756,7 @@ cond_branch:   f_offset = addrs[i 
> > + filter[i].jf];
> > return;
> >  }
> >
> > -void bpf_jit_free(struct bpf_prog *fp)
> > +void bpf_jit_binary_free(struct bpf_binary_header *hdr)
> >  {
> > -   if (fp->jited)
> > -   bpf_jit_binary_free(bpf_jit_binary_hdr(fp));
> > -
> > -   bpf_prog_unlock_free(fp);
> > +   module_memfree(hdr);
> >  }
> > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > index 1a796e0799ec..29f766dac203 100644
> > --- a/kernel/bpf/core.c
> > +++ b/kernel/bpf/core.c
> > @@ -646,25 +646,20 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 
> > **image_ptr,
> > return hdr;
> >  }
> >
> > -void bpf_jit_binary_free(struct bpf_binary_header *hdr)
> > +void __weak bpf_jit_binary_free(struct bpf_binary_header *hdr)
> >  {
> > -   u32 pages = hdr->pages;
> > -
> > +   bpf_jit_binary_unlock_ro(hdr);
> > module_memfree(hdr);
> > -   bpf_jit_uncharge_modmem(pages);
> >  }
> >
> > -/* This symbol is only overridden by archs that have different
> > - * requirements than the usual eBPF JITs, f.e. when they only
> > - * implement cBPF JIT, do not set images read-only, etc.
> > - */
>
> Do you want to move the above comments to
> new weak function bpf_jit_binary_free?
>

Perhaps. But one thing I don't understand, looking at this again, is
why we have these overrides in the first place. module_memfree() just
calls vfree(), which takes down the mapping entirely (along with any
updated permissions), and so remapping it back to r/w right before
that seems rather pointless imo.

Can we get rid of bpf_jit_binary_unlock_ro() entirely, and along with
it, all these overrides for the free() path?


Re: linux-next: build warnings from Linus' tree

2018-11-18 Thread Alan Modra
On Wed, Nov 14, 2018 at 09:20:23PM +1100, Michael Ellerman wrote:
> Joel Stanley  writes:
> > Hello Alan,
> >
> > On Tue, 12 Jun 2018 at 07:44, Stephen Rothwell  
> > wrote:
> >
> >> Building Linus' tree, today's linux-next build (powerpc ppc64_defconfig)
> >> produced these warning:
> >>
> >> ld: warning: orphan section `.gnu.hash' from `linker stubs' being placed 
> >> in section `.gnu.hash'.
> >> ld: warning: orphan section `.gnu.hash' from `linker stubs' being placed 
> >> in section `.gnu.hash'.
> >> ld: warning: orphan section `.gnu.hash' from `linker stubs' being placed 
> >> in section `.gnu.hash'.
> >>
> >> This may just be because I have started building using the native Debian
> >> gcc for the powerpc builds ...
> >
> > Do you know why we started creating these?
> 
> It's controlled by the ld option --hash-style, which AFAICS still
> defaults to sysv (generating .hash).
> 
> But it seems gcc can be configured to have a different default, and at
> least my native ppc64le toolchains are passing gnu, eg:
> 
>  /usr/lib/gcc/powerpc64le-linux-gnu/6/collect2 -plugin
>  /usr/lib/gcc/powerpc64le-linux-gnu/6/liblto_plugin.so
>  -plugin-opt=/usr/lib/gcc/powerpc64le-linux-gnu/6/lto-wrapper
>  -plugin-opt=-fresolution=/tmp/ccw1U2fF.res
>  -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s
>  -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc
>  -plugin-opt=-pass-through=-lgcc_s --sysroot=/ --build-id --eh-frame-hdr
>  -V -shared -m elf64lppc
>  --hash-style=gnu
>  
> 
> So that's presumably why we're seeing it, some GCCs are configured to
> use it.
> 
> > If it's intentional, should we be putting including them in the same
> > way as .hash sections?
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/kernel/vmlinux.lds.S#n282
> >
> >   .hash : AT(ADDR(.hash) - LOAD_OFFSET) { *(.hash) }
> 
> That would presumably work.
> 
> My question though is do we even need it?
> 
> >From what I can see for it to be useful you need the section as well as
> an entry in the dynamic section pointing at it, and we don't have a
> dynamic section at all:
> 
>   $ readelf -S vmlinux | grep gnu.hash
> [ 4] .gnu.hash GNU_HASH c0dbbdb0  00dcbdb0
>   $ readelf -d vmlinux
>   
>   There is no dynamic section in this file.
> 
> Compare to the vdso:
> 
> $ readelf -d arch/powerpc/kernel/vdso64/vdso64.so
> 
> Dynamic section at offset 0x868 contains 12 entries:
>   TagType Name/Value
>  0x000e (SONAME) Library soname: [linux-vdso64.so.1]
>  0x0004 (HASH)   0x120
>  0x6ef5 (GNU_HASH)   0x170
>  0x0005 (STRTAB) 0x320
>  0x0006 (SYMTAB) 0x1d0
>  0x000a (STRSZ)  269 (bytes)
>  0x000b (SYMENT) 24 (bytes)
>  0x7003 (PPC64_OPT)  0x0
>  0x6ffc (VERDEF) 0x450
>  0x6ffd (VERDEFNUM)  2
>  0x6ff0 (VERSYM) 0x42e
>  0x (NULL)   0x0
> 
> 
> So can't we just discard .gnu.hash? And in fact do we need .hash either?
> 
> Actually arm64 discards the latter, and parisc discards both.
> 
> Would still be good to hear from Alan or someone else who knows anything
> about toolchain stuff, ie. not me :)

.gnu.hash, like .hash, is used by glibc ld.so for dynamic symbol
lookup.  I imagine you don't need either section in a kernel, so
discarding both sounds reasonable.  Likely you could discard .interp
and .dynstr too, and .dynsym when !CONFIG_PPC32.

-- 
Alan Modra
Australia Development Lab, IBM


Re: [PATCH 0/4] bpf: permit JIT allocations to be served outside the module region

2018-11-18 Thread Y Song
On Sat, Nov 17, 2018 at 6:58 PM Ard Biesheuvel
 wrote:
>
> On arm64, modules are allocated from a 128 MB window which is close to
> the core kernel, so that relative direct branches are guaranteed to be
> in range (except in some KASLR configurations). Also, module_alloc()
> is in charge of allocating KASAN shadow memory when running with KASAN
> enabled.
>
> This means that the way BPF reuses module_alloc()/module_memfree() is
> undesirable on arm64 (and potentially other architectures as well),
> and so this series refactors BPF's use of those functions to permit
> architectures to change this behavior.
>
> Patch #1 fixes a bug introduced during the merge window, where the new
> alloc/free tracking does not account for memory that is freed by some
> arch code.
>
> Patch #2 refactors the freeing path so that architectures can switch to
> something other than module_memfree().
>
> Patch #3 does the same for module_alloc().
>
> Patch #4 implements the new alloc/free overrides for arm64

Except a minor comment, the whole patch set looks good to me.
Acked-by: Yonghong Song 

>
> Cc: Daniel Borkmann 
> Cc: Alexei Starovoitov 
> Cc: Rick Edgecombe 
> Cc: Eric Dumazet 
> Cc: Jann Horn 
> Cc: Kees Cook 
>
> Cc: Jessica Yu 
> Cc: Arnd Bergmann 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Mark Rutland 
> Cc: Ralf Baechle 
> Cc: Paul Burton 
> Cc: James Hogan 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: "David S. Miller" 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-m...@linux-mips.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: sparcli...@vger.kernel.org
> Cc: net...@vger.kernel.org
>
> Ard Biesheuvel (4):
>   bpf: account for freed JIT allocations in arch code
>   net/bpf: refactor freeing of executable allocations
>   bpf: add __weak hook for allocating executable memory
>   arm64/bpf: don't allocate BPF JIT programs in module memory
>
>  arch/arm64/net/bpf_jit_comp.c | 11 ++
>  arch/mips/net/bpf_jit.c   |  7 ++-
>  arch/powerpc/net/bpf_jit_comp.c   |  7 ++-
>  arch/powerpc/net/bpf_jit_comp64.c | 12 +++
>  arch/sparc/net/bpf_jit_comp_32.c  |  7 ++-
>  kernel/bpf/core.c | 22 ++--
>  6 files changed, 31 insertions(+), 35 deletions(-)
>
> --
> 2.17.1
>


Re: [PATCH 2/4] net/bpf: refactor freeing of executable allocations

2018-11-18 Thread Y Song
On Sat, Nov 17, 2018 at 6:58 PM Ard Biesheuvel
 wrote:
>
> All arch overrides of the __weak bpf_jit_free() amount to the same
> thing: the allocated memory was never mapped read-only, and so
> it does not have to be remapped to read-write before being freed.
>
> So in preparation of permitting arches to serve allocations for BPF
> JIT programs from other regions than the module region, refactor
> the existing bpf_jit_free() implementations to use the shared code
> where possible, and only specialize the remap and free operations.
>
> Signed-off-by: Ard Biesheuvel 
> ---
>  arch/mips/net/bpf_jit.c   |  7 ++-
>  arch/powerpc/net/bpf_jit_comp.c   |  7 ++-
>  arch/powerpc/net/bpf_jit_comp64.c |  9 +++--
>  arch/sparc/net/bpf_jit_comp_32.c  |  7 ++-
>  kernel/bpf/core.c | 15 +--
>  5 files changed, 14 insertions(+), 31 deletions(-)
>
> diff --git a/arch/mips/net/bpf_jit.c b/arch/mips/net/bpf_jit.c
> index 1b69897274a1..5696bd7dccc7 100644
> --- a/arch/mips/net/bpf_jit.c
> +++ b/arch/mips/net/bpf_jit.c
> @@ -1261,10 +1261,7 @@ void bpf_jit_compile(struct bpf_prog *fp)
> kfree(ctx.offsets);
>  }
>
> -void bpf_jit_free(struct bpf_prog *fp)
> +void bpf_jit_binary_free(struct bpf_binary_header *hdr)
>  {
> -   if (fp->jited)
> -   bpf_jit_binary_free(bpf_jit_binary_hdr(fp));
> -
> -   bpf_prog_unlock_free(fp);
> +   module_memfree(hdr);
>  }
> diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
> index a1ea1ea6b40d..5b5ce4a1b44b 100644
> --- a/arch/powerpc/net/bpf_jit_comp.c
> +++ b/arch/powerpc/net/bpf_jit_comp.c
> @@ -680,10 +680,7 @@ void bpf_jit_compile(struct bpf_prog *fp)
> return;
>  }
>
> -void bpf_jit_free(struct bpf_prog *fp)
> +void bpf_jit_binary_free(struct bpf_binary_header *hdr)
>  {
> -   if (fp->jited)
> -   bpf_jit_binary_free(bpf_jit_binary_hdr(fp));
> -
> -   bpf_prog_unlock_free(fp);
> +   module_memfree(hdr);
>  }
> diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
> b/arch/powerpc/net/bpf_jit_comp64.c
> index 84c8f013a6c6..f64f1294bd62 100644
> --- a/arch/powerpc/net/bpf_jit_comp64.c
> +++ b/arch/powerpc/net/bpf_jit_comp64.c
> @@ -1021,11 +1021,8 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog 
> *fp)
> return fp;
>  }
>
> -/* Overriding bpf_jit_free() as we don't set images read-only. */
> -void bpf_jit_free(struct bpf_prog *fp)
> +/* Overriding bpf_jit_binary_free() as we don't set images read-only. */
> +void bpf_jit_binary_free(struct bpf_binary_header *hdr)
>  {
> -   if (fp->jited)
> -   bpf_jit_binary_free(bpf_jit_binary_hdr(fp));
> -
> -   bpf_prog_unlock_free(fp);
> +   module_memfree(hdr);
>  }
> diff --git a/arch/sparc/net/bpf_jit_comp_32.c 
> b/arch/sparc/net/bpf_jit_comp_32.c
> index 01bda6bc9e7f..589950d152cc 100644
> --- a/arch/sparc/net/bpf_jit_comp_32.c
> +++ b/arch/sparc/net/bpf_jit_comp_32.c
> @@ -756,10 +756,7 @@ cond_branch:   f_offset = addrs[i + 
> filter[i].jf];
> return;
>  }
>
> -void bpf_jit_free(struct bpf_prog *fp)
> +void bpf_jit_binary_free(struct bpf_binary_header *hdr)
>  {
> -   if (fp->jited)
> -   bpf_jit_binary_free(bpf_jit_binary_hdr(fp));
> -
> -   bpf_prog_unlock_free(fp);
> +   module_memfree(hdr);
>  }
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 1a796e0799ec..29f766dac203 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -646,25 +646,20 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 
> **image_ptr,
> return hdr;
>  }
>
> -void bpf_jit_binary_free(struct bpf_binary_header *hdr)
> +void __weak bpf_jit_binary_free(struct bpf_binary_header *hdr)
>  {
> -   u32 pages = hdr->pages;
> -
> +   bpf_jit_binary_unlock_ro(hdr);
> module_memfree(hdr);
> -   bpf_jit_uncharge_modmem(pages);
>  }
>
> -/* This symbol is only overridden by archs that have different
> - * requirements than the usual eBPF JITs, f.e. when they only
> - * implement cBPF JIT, do not set images read-only, etc.
> - */

Do you want to move the above comments to
new weak function bpf_jit_binary_free?

> -void __weak bpf_jit_free(struct bpf_prog *fp)
> +void bpf_jit_free(struct bpf_prog *fp)
>  {
> if (fp->jited) {
> struct bpf_binary_header *hdr = bpf_jit_binary_hdr(fp);
> +   u32 pages = hdr->pages;
>
> -   bpf_jit_binary_unlock_ro(hdr);
> bpf_jit_binary_free(hdr);
> +   bpf_jit_uncharge_modmem(pages);
>
> WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(fp));
> }
> --
> 2.17.1
>


Re: [PATCH] powerpc/32: Include .branch_lt in data section

2018-11-18 Thread Alan Modra
On Thu, Nov 15, 2018 at 11:47:52PM +1100, Michael Ellerman wrote:
> Alan Modra  writes:
> 
> > On Wed, Nov 14, 2018 at 01:32:18PM +1030, Joel Stanley wrote:
> >> I wasn't sure where this should go or if the ordering matters.
> >
> > The usual answer is: "Look at where the section goes in the standard
> > linker scripts."   But that doesn't apply here.  The section will be
> > empty for a kernel build so it doesn't matter where it goes.
> 
> If it's empty why don't we just discard it?

That can be a recipe for finding linker bugs.  Not that I'm against
you finding linker bugs.  ;-)

-- 
Alan Modra
Australia Development Lab, IBM