Re: [PATCH] powerpc/powernv : Add support to enable sensor groups

2017-12-20 Thread Shilpasri G Bhat
Hi,

On 12/04/2017 10:11 AM, Stewart Smith wrote:
> Shilpasri G Bhat  writes:
>> On 11/28/2017 05:07 PM, Michael Ellerman wrote:
>>> Shilpasri G Bhat  writes:
>>>
 Adds support to enable/disable a sensor group. This can be used to
 select the sensor groups that needs to be copied to main memory by
 OCC. Sensor groups like power, temperature, current, voltage,
 frequency, utilization can be enabled/disabled at runtime.

 Signed-off-by: Shilpasri G Bhat 
 ---
 The skiboot patch for the opal call is posted below:
 https://lists.ozlabs.org/pipermail/skiboot/2017-November/009713.html
>>>
>>> Can you remind me why we're doing this with a completely bespoke sysfs
>>> API, rather than using some generic sensors API?
>>>
>>
>> Disabling/Enabling sensor groups is not supported in the current generic 
>> sensors
>> API. And also we dont export all type of sensors in HWMON as not all of them 
>> are
>> environment sensors (like performance).
> 
> Are there barriers to adding such concepts to the generic sensors API?
> 

Yes.

HWMON does not support attributes for a sensor-group. If we are to extend HWMON
to add new per-sensor attributes to disable/enable, then we need to do either of
the below:

1) If any one of the sensor is disabled then all the sensors belonging to that
group will be disabled. OR

2) To disable a sensor group we need to disable all the sensors belonging to
that group.

Another problem is hwmon categorizes the sensor-groups based on the type of
sensors like power, temp. If OCC allows multiple groups of the same type then
this approach adds some more complexity to the user to identify the sensors
belonging to correct group.

And lastly HWMON does not allow platform specific non-standard sensor groups
like CSM, job-scheduler, profiler.

Thanks and Regards,
Shilpa



[PATCH] cxl: Check if vphb exists before iterating over AFU devices

2017-12-20 Thread Vaibhav Jain
commit 12841f87b7a8ceb3d54f171660f72a86941bfcb3 upstream, for 4.3.

During an eeh a kernel-oops is reported if no vPHB is allocated to the
AFU. This happens as during AFU init, an error in creation of vPHB is
a non-fatal error. Hence afu->phb should always be checked for NULL
before iterating over it for the virtual AFU pci devices.

This patch fixes the kenel-oops by adding a NULL pointer check for
afu->phb before it is dereferenced.

Fixes: 9e8df8a21963 ("cxl: EEH support")
Cc: sta...@vger.kernel.org # v4.3+
Signed-off-by: Vaibhav Jain 
Acked-by: Andrew Donnellan 
Acked-by: Frederic Barrat 
Signed-off-by: Michael Ellerman 
---
Changelog:
- Rebased the patch on 4.3 stable tree
---
 drivers/misc/cxl/pci.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 85761d7eb333..b982329f3837 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -1328,6 +1328,9 @@ static pci_ers_result_t cxl_vphb_error_detected(struct 
cxl_afu *afu,
/* There should only be one entry, but go through the list
 * anyway
 */
+   if (afu->phb == NULL)
+   return result;
+
list_for_each_entry(afu_dev, >phb->bus->devices, bus_list) {
if (!afu_dev->driver)
continue;
@@ -1368,6 +1371,10 @@ static pci_ers_result_t cxl_pci_error_detected(struct 
pci_dev *pdev,
 */
for (i = 0; i < adapter->slices; i++) {
afu = adapter->afu[i];
+   /*
+* Tell the AFU drivers; but we don't care what they
+* say, we're going away.
+*/
cxl_vphb_error_detected(afu, state);
}
return PCI_ERS_RESULT_DISCONNECT;
@@ -1491,6 +1498,9 @@ static pci_ers_result_t cxl_pci_slot_reset(struct pci_dev 
*pdev)
if (cxl_afu_select_best_mode(afu))
goto err;
 
+   if (afu->phb == NULL)
+   continue;
+
cxl_pci_vphb_reconfigure(afu);
 
list_for_each_entry(afu_dev, >phb->bus->devices, bus_list) 
{
@@ -1555,6 +1565,9 @@ static void cxl_pci_resume(struct pci_dev *pdev)
for (i = 0; i < adapter->slices; i++) {
afu = adapter->afu[i];
 
+   if (afu->phb == NULL)
+   continue;
+
list_for_each_entry(afu_dev, >phb->bus->devices, bus_list) 
{
if (afu_dev->driver && afu_dev->driver->err_handler &&
afu_dev->driver->err_handler->resume)
-- 
2.14.3



Re: [PATCH] cxl: Check if vphb exists before iterating over AFU devices

2017-12-20 Thread Vaibhav Jain
Greg KH  writes:

> On Wed, Dec 20, 2017 at 03:07:06PM +0530, Vaibhav Jain wrote:
>> commit 12841f87b7a8ceb3d54f171660f72a86941bfcb3 upstream, for 4.9.
>
> Thanks, do we also need this for 4.4?  If so, can you provide a
> backport?
>
Thanks Greg for applying this patch on 4.9 stable tree. I have done a
back-port for 4.3+ and will send the back-ported-patch across.

Cheers,
-- 
Vaibhav Jain 
Linux Technology Center, IBM India Pvt. Ltd.



Re: [PATCH v1 7/7] pseries/setup: Add Initialization of VF Bars

2017-12-20 Thread Juan Alvarez
On 12/19/17 12:38 AM, Alexey Kardashevskiy wrote:

> On 19/12/17 06:29, Juan Alvarez wrote:
>> This is PF only path. Yes either we have a root returned otherwise
>> will fall back to iomem_resource.
> You have removed context from my response, do not do that please.

My apologies. I will not do that.

>
> When will you have root and when you won't? imho it should always be either
> one or another.
>
Yes you are correct. The resource is carved out of a different mmio 
space and will never be passed in the assigned-addresses property in 
the device node of PF.

We will remove that function call, conditional check and set root accordingly.

>> On 12/18/17 1:21 AM, Alexey Kardashevskiy wrote:
>>> @dev here is a VF, right? I am not familiar with powervn much but from what
>>> I see - the devices are sitting on a root bus of their own PHB and they all
>>> either have a root returned from pci_find_parent_resource() or none of them
>>> has a root and will fall back to _resource, or both cases are 
>>> possible?
>
- Juan 



Re: [PATCH v2 2/7] powerpc/kernel: Add uevents in EEH error/resume

2017-12-20 Thread Juan Alvarez
On 12/19/17 12:27 AM, Benjamin Herrenschmidt wrote:

> On Mon, 2017-12-18 at 22:50 -0600, Bjorn Helgaas wrote:
>> [+cc Keith, Gabriele, Dongdong]
>>
>> On Mon, Dec 18, 2017 at 04:38:03PM -0600, Bryant G. Ly wrote:
>>> Devices can go offline when EEH is reported. This patch adds
>>> a change to the kernel object and lets udev know of error.
>>> When device resumes a change is also set reporting device as
>>> online. Therefore, EEH events are better propagated to user
>>> space for devices in powerpc arch.
>> I'm on vacation and can't review this in detail, but I wonder if you
>> can compare this with the uevents we emit for DPC, AER, and hotplug
>> events (if any).  I hope we don't end up with userspace having to be
>> aware of the differences between EEH, DPC, AER, etc.
>>
>>> From a very quick look, I only see a few uevents even mentioned in
>> drivers/pci: KOBJ_ADD in __pci_hp_register() and KOBJ_CHANGE in the
>> SR-IOV code.  I'm worried that we're missing some important uevents in
>> the PCI core.  That's not an argument against what you're doing here;
>> it just would be nice to fill in any missing pieces in the core also,
>> and hopefully make them consistent with these EEH events.
> We also need to be careful about what specific EEH activity we are
> talking about, and if we bring into the picture things like DPDK, it
> gets even more murky...
>
> The basic way EEH is supposed to work for recovery (minus all sort of
> implementation nasties which hopefully Russell and Sam are trying to
> cleanup and fix) is that either:
>
>   - The driver of the device has recovery callbacks, in which
> case the driver participates in the recovery process, the device
> doesn't "go away" (though it shouldn't be accessed during that process
> by other entities, userspace originated config space could be a problem
> and needs to be blocked...). The recovery typically involves a reset of
> the device but in sync with the driver.
>
>   - The driver doesn't have the callbacks. In this case, we
> simulate an unplug, reset the device, and replug.
>
> So it makes sense for the second case to emit the same uevents as a
> normal PCI(e) hotplug.
>
> For the former case I'm less sure Do we really need userspace to be
> notified ? If yes, what for precisely ?

In pSeries SR-IOV environment the management console might need to apply
certain configuration changes to the PF driver after it has been recovered
and before the VF drivers are allowed to resume their recovery path.
I could not think of another way to notify user space of these events.
I made this assumption because I saw there were no uevents added when 
the device goes offline and come back online in EEH code. It was my 
intention to make the event as generic as possible in EEH component,
therefore, making this change independent of pSeries SR-IOV.

- Juan



Re: [PATCH v2 2/7] powerpc/kernel: Add uevents in EEH error/resume

2017-12-20 Thread Juan Alvarez
On 12/18/17 10:59 PM, Russell Currey wrote:

> On Mon, 2017-12-18 at 22:50 -0600, Bjorn Helgaas wrote:
>> [+cc Keith, Gabriele, Dongdong]
>>
>> On Mon, Dec 18, 2017 at 04:38:03PM -0600, Bryant G. Ly wrote:
>>> Devices can go offline when EEH is reported. This patch adds
>>> a change to the kernel object and lets udev know of error.
>>> When device resumes a change is also set reporting device as
>>> online. Therefore, EEH events are better propagated to user
>>> space for devices in powerpc arch.
>> I'm on vacation and can't review this in detail, but I wonder if you
>> can compare this with the uevents we emit for DPC, AER, and hotplug
>> events (if any).  I hope we don't end up with userspace having to be
>> aware of the differences between EEH, DPC, AER, etc.
>>
>> From a very quick look, I only see a few uevents even mentioned in
>> drivers/pci: KOBJ_ADD in __pci_hp_register() and KOBJ_CHANGE in the
>> SR-IOV code.  I'm worried that we're missing some important uevents
>> in
>> the PCI core.  

The only place where I see the KOBJ_REMOVE being used is when the device is 
removed in pci_destroy_dev -> device_del whic will be called implicitly
in permanent failure path of EEH code

>> That's not an argument against what you're doing here;
>> it just would be nice to fill in any missing pieces in the core also,
>> and hopefully make them consistent with these EEH events.
> I don't think this needs to be particularly complex, could we get away
> with events for when devices do the following?
>
> - begin recovery
> - successfully recover
> - fail recovery

If there are no objections in the on going review of this patch 
I can change them to these names:

  - BEGIN_RECOVERY
  - SUCCESSFUL_RECOVERY
  - FAILED_RECOVERY

>
> It might be worthwhile sorting out some consistent, non-EEH-specific
> naming, and then other device error recovery systems can do the same
> later.
>
Do you have a more consistent naming in mind for these events? 

- Juan



RE: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-20 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
> Ross Zwisler
...
> 
> On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote:
...
> > initiator is a CPU?  I'd have expected you to expose a memory controller
> > abstraction rather than re-use storage terminology.
> 
> Yea, I agree that at first blush it seems weird.  It turns out that
> looking at it in sort of a storage initiator/target way is beneficial,
> though, because it allows us to cut down on the number of data values
> we need to represent.
> 
> For example the SLIT, which doesn't differentiate between initiator and
> target proximity domains (and thus nodes) always represents a system
> with N proximity domains using a NxN distance table.  This makes sense
> if every node contains both CPUs and memory.
> 
> With the introduction of the HMAT, though, we can have memory-only
> initiator nodes and we can explicitly associate them with their local 
> CPU.  This is necessary so that we can separate memory with different
> performance characteristics (HBM vs normal memory vs persistent memory,
> for example) that are all attached to the same CPU.
> 
> So, say we now have a system with 4 CPUs, and each of those CPUs has 3
> different types of memory attached to it.  We now have 16 total proximity
> domains, 4 CPU and 12 memory.

The CPU cores that make up a node can have performance restrictions of
their own; for example, they might max out at 10 GB/s even though the
memory controller supports 120 GB/s (meaning you need to use 12 cores
on the node to fully exercise memory).  It'd be helpful to report this,
so software can decide how many cores to use for bandwidth-intensive work.

> If we represent this with the SLIT we end up with a 16 X 16 distance table
> (256 entries), most of which don't matter because they are memory-to-
> memory distances which don't make sense.
> 
> In the HMAT, though, we separate out the initiators and the targets and
> put them into separate lists.  (See 5.2.27.4 System Locality Latency and
> Bandwidth Information Structure in ACPI 6.2 for details.)  So, this same
> config in the HMAT only has 4*12=48 performance values of each type, all
> of which convey meaningful information.
> 
> The HMAT indeed even uses the storage "initiator" and "target"
> terminology. :)

Centralized DMA engines (e.g., as used by the "DMA based blk-mq pmem
driver") have performance differences too.  A CPU might include
CPU cores that reach 10 GB/s, DMA engines that reach 60 GB/s, and
memory controllers that reach 120 GB/s.  I guess these would be
represented as extra initiators on the node?


---
Robert Elliott, HPE Persistent Memory





Re: [-next PATCH 4/4] treewide: Use DEVICE_ATTR_WO

2017-12-20 Thread Zhang Rui
On Tue, 2017-12-19 at 10:15 -0800, Joe Perches wrote:
> Convert DEVICE_ATTR uses to DEVICE_ATTR_WO where possible.
> 
> Done with perl script:
> 
> $ git grep -w --name-only DEVICE_ATTR | \
>   xargs perl -i -e 'local $/; while (<>) {
> s/\bDEVICE_ATTR\s*\(\s*(\w+)\s*,\s*\(?(?:\s*S_IWUSR\s*|\s*0200\s*)\)?
> \s*,\s*NULL\s*,\s*\s_store\s*\)/DEVICE_ATTR_WO(\1)/g; print;}'
> 
> Signed-off-by: Joe Perches 
> ---
>  arch/s390/kernel/smp.c | 2 +-
>  arch/x86/kernel/cpu/microcode/core.c   | 2 +-
>  drivers/input/touchscreen/elants_i2c.c | 2 +-
>  drivers/net/ethernet/ibm/ibmvnic.c | 2 +-
>  drivers/net/wimax/i2400m/sysfs.c   | 3 +--
>  drivers/scsi/lpfc/lpfc_attr.c  | 3 +--
>  drivers/thermal/thermal_sysfs.c| 2 +-

For the thermal part,
Acked-by: Zhang Rui 

thanks,
rui

>  7 files changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
> index b8c1a85bcf2d..a919b2f0141d 100644
> --- a/arch/s390/kernel/smp.c
> +++ b/arch/s390/kernel/smp.c
> @@ -1151,7 +1151,7 @@ static ssize_t __ref rescan_store(struct device
> *dev,
>   rc = smp_rescan_cpus();
>   return rc ? rc : count;
>  }
> -static DEVICE_ATTR(rescan, 0200, NULL, rescan_store);
> +static DEVICE_ATTR_WO(rescan);
>  #endif /* CONFIG_HOTPLUG_CPU */
>  
>  static int __init s390_smp_init(void)
> diff --git a/arch/x86/kernel/cpu/microcode/core.c
> b/arch/x86/kernel/cpu/microcode/core.c
> index c4fa4a85d4cb..09c74b0560dd 100644
> --- a/arch/x86/kernel/cpu/microcode/core.c
> +++ b/arch/x86/kernel/cpu/microcode/core.c
> @@ -560,7 +560,7 @@ static ssize_t pf_show(struct device *dev,
>   return sprintf(buf, "0x%x\n", uci->cpu_sig.pf);
>  }
>  
> -static DEVICE_ATTR(reload, 0200, NULL, reload_store);
> +static DEVICE_ATTR_WO(reload);
>  static DEVICE_ATTR(version, 0400, version_show, NULL);
>  static DEVICE_ATTR(processor_flags, 0400, pf_show, NULL);
>  
> diff --git a/drivers/input/touchscreen/elants_i2c.c
> b/drivers/input/touchscreen/elants_i2c.c
> index a458e5ec9e41..819213e88f32 100644
> --- a/drivers/input/touchscreen/elants_i2c.c
> +++ b/drivers/input/touchscreen/elants_i2c.c
> @@ -1000,7 +1000,7 @@ static ssize_t show_iap_mode(struct device
> *dev,
>   "Normal" : "Recovery");
>  }
>  
> -static DEVICE_ATTR(calibrate, S_IWUSR, NULL, calibrate_store);
> +static DEVICE_ATTR_WO(calibrate);
>  static DEVICE_ATTR(iap_mode, S_IRUGO, show_iap_mode, NULL);
>  static DEVICE_ATTR(update_fw, S_IWUSR, NULL, write_update_fw);
>  
> diff --git a/drivers/net/ethernet/ibm/ibmvnic.c
> b/drivers/net/ethernet/ibm/ibmvnic.c
> index 1dc4aef37d3a..42b96e1a1b13 100644
> --- a/drivers/net/ethernet/ibm/ibmvnic.c
> +++ b/drivers/net/ethernet/ibm/ibmvnic.c
> @@ -4411,7 +4411,7 @@ static ssize_t failover_store(struct device
> *dev, struct device_attribute *attr,
>   return count;
>  }
>  
> -static DEVICE_ATTR(failover, 0200, NULL, failover_store);
> +static DEVICE_ATTR_WO(failover);
>  
>  static unsigned long ibmvnic_get_desired_dma(struct vio_dev *vdev)
>  {
> diff --git a/drivers/net/wimax/i2400m/sysfs.c
> b/drivers/net/wimax/i2400m/sysfs.c
> index 1237109f251a..8c67df11105c 100644
> --- a/drivers/net/wimax/i2400m/sysfs.c
> +++ b/drivers/net/wimax/i2400m/sysfs.c
> @@ -65,8 +65,7 @@ ssize_t i2400m_idle_timeout_store(struct device
> *dev,
>  }
>  
>  static
> -DEVICE_ATTR(i2400m_idle_timeout, S_IWUSR,
> - NULL, i2400m_idle_timeout_store);
> +DEVICE_ATTR_WO(i2400m_idle_timeout);
>  
>  static
>  struct attribute *i2400m_dev_attrs[] = {
> diff --git a/drivers/scsi/lpfc/lpfc_attr.c
> b/drivers/scsi/lpfc/lpfc_attr.c
> index 517ff203cfde..6ddaf51a23f6 100644
> --- a/drivers/scsi/lpfc/lpfc_attr.c
> +++ b/drivers/scsi/lpfc/lpfc_attr.c
> @@ -2418,8 +2418,7 @@ lpfc_soft_wwn_enable_store(struct device *dev,
> struct device_attribute *attr,
>  
>   return count;
>  }
> -static DEVICE_ATTR(lpfc_soft_wwn_enable, S_IWUSR, NULL,
> -    lpfc_soft_wwn_enable_store);
> +static DEVICE_ATTR_WO(lpfc_soft_wwn_enable);
>  
>  /**
>   * lpfc_soft_wwpn_show - Return the cfg soft ww port name of the
> adapter
> diff --git a/drivers/thermal/thermal_sysfs.c
> b/drivers/thermal/thermal_sysfs.c
> index 2bc964392924..ba81c9080f6e 100644
> --- a/drivers/thermal/thermal_sysfs.c
> +++ b/drivers/thermal/thermal_sysfs.c
> @@ -317,7 +317,7 @@ emul_temp_store(struct device *dev, struct
> device_attribute *attr,
>  
>   return ret ? ret : count;
>  }
> -static DEVICE_ATTR(emul_temp, S_IWUSR, NULL, emul_temp_store);
> +static DEVICE_ATTR_WO(emul_temp);
>  #endif
>  
>  static ssize_t


Re: [PATCH v9 29/51] mm/mprotect, powerpc/mm/pkeys, x86/mm/pkeys: Add sysfs interface

2017-12-20 Thread Benjamin Herrenschmidt
On Wed, 2017-12-20 at 09:50 -0800, Ram Pai wrote:
> The argument against this patch is --  it should not be baked into
> the ABI as yet, since we do not have clarity on what applications need.
> 
> As it stands today the only way to figure out the information from
> userspace is by probing the kernel through calls to sys_pkey_alloc().
> 
> AT_HWCAP can be used, but that will certainly not be capable of
> providing all the information that userspace might expect.
> 
> Your thoughts?

Well, there's one well known application wanting that whole keys
business, so why not ask them what works for them ?

In the meantime, that shouldn't block the rest of the patches.

Cheers,
Ben.



Re: ps3: Improve a size determination in five functions

2017-12-20 Thread Geoff Levand
On 12/20/2017 01:20 PM, SF Markus Elfring wrote:
>>  o Your patch fixes no bug nor replaces any depreciated feature.
> 
> How do you think about information from the section “14) Allocating memory”
> in the document “coding-style.rst” for the shown source code transformation?

In terms of importance, I would put maintenance and user support as more
important than coding style.

Regarding Section 14 of coding-style.rst specifically, as I mentioned the
PS3 support is over 10 years old.  I don't expect a change to the type of
any structures.  If there are type changes, then we can update the
allocation size parameters at that time.

-Geoff


Re: [net] Revert "net: core: maybe return -EEXIST in __dev_alloc_name"

2017-12-20 Thread Rasmus Villemoes
On Tue, Dec 19 2017, Michael Ellerman  wrote:

> Hi Johannes,
>
>> From: Johannes Berg 
>> 
>> This reverts commit d6f295e9def0; some userspace (in the case
>
> This revert seems to have broken networking on one of my powerpc
> machines, according to git bisect.
>
> The symptom is DHCP fails and I don't get a link, I didn't dig any
> further than that. I can if it's helpful.
>
> I think the problem is that 87c320e51519 ("net: core: dev_get_valid_name
> is now the same as dev_alloc_name_ns") only makes sense while
> d6f295e9def0 remains in the tree.

I'm sorry about all of this, I really didn't think there would be such
consequences of changing an errno return. Indeed, d6f29 was preparation
for unifying the two functions that do the exact same thing (and how we
ever got into that situation is somewhat unclear), except for
their behaviour in the case the requested name already exists. So one of
the two interfaces had to change its return value, and as I wrote, I
thought EEXIST was the saner choice when an explicit name (no %d) had
been requested.

> ie. before the entire series, dev_get_valid_name() would return EEXIST,
> and that was retained when 87c320e51519 was merged, but now that
> d6f295e9def0 has been reverted dev_get_valid_name() is returning ENFILE.
>
> I can get the network up again if I also revert 87c320e51519 ("net:
> core: dev_get_valid_name is now the same as dev_alloc_name_ns"), or with
> the gross patch below.

I don't think changing -ENFILE to -EEXIST would be right either, since
dev_get_valid_name() used to be able to return both (-EEXIST in the case
where there's no %d, -ENFILE in the case where we end up calling
dev_alloc_name_ns()). If anything, we could do the check for the old
-EEXIST condition first, and then call dev_alloc_name_ns(). But I'm also
fine with reverting.

Again, sorry :(

Rasmus


Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-20 Thread Ross Zwisler
On Wed, Dec 20, 2017 at 02:29:56PM -0800, Dan Williams wrote:
> On Wed, Dec 20, 2017 at 1:24 PM, Ross Zwisler
>  wrote:
> > On Wed, Dec 20, 2017 at 01:16:49PM -0800, Matthew Wilcox wrote:
> >> On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote:
> >> > On 12/20/2017 10:19 AM, Matthew Wilcox wrote:
> >> > > I don't know what the right interface is, but my laptop has a set of
> >> > > /sys/devices/system/memory/memoryN/ directories.  Perhaps this is the
> >> > > right place to expose write_bw (etc).
> >> >
> >> > Those directories are already too redundant and wasteful.  I think we'd
> >> > really rather not add to them.  In addition, it's technically possible
> >> > to have a memory section span NUMA nodes and have different performance
> >> > properties, which make it impossible to represent there.
> >> >
> >> > In any case, ACPI PXM's (Proximity Domains) are guaranteed to have
> >> > uniform performance properties in the HMAT, and we just so happen to
> >> > always create one NUMA node per PXM.  So, NUMA nodes really are a good 
> >> > fit.
> >>
> >> I think you're missing my larger point which is that I don't think this
> >> should be exposed to userspace as an ACPI feature.  Because if you do,
> >> then it'll also be exposed to userspace as an openfirmware feature.
> >> And sooner or later a devicetree feature.  And then writing a portable
> >> program becomes an exercise in suffering.
> >>
> >> So, what's the right place in sysfs that isn't tied to ACPI?  A new
> >> directory or set of directories under /sys/devices/system/memory/ ?
> >
> > Oh, the current location isn't at all tied to acpi except that it happens to
> > be named 'hmat'.  When it was all named 'hmem' it was just:
> >
> > /sys/devices/system/hmem
> >
> > Which has no ACPI-isms at all.  I'm happy to move it under
> > /sys/devices/system/memory/hmat if that's helpful, but I think we still have
> > the issue that the data represented therein is still pulled right from the
> > HMAT, and I don't know how to abstract it into something more platform
> > agnostic until I know what data is provided by those other platforms.
> >
> > For example, the HMAT provides latency information and bandwidth information
> > for both reads and writes.  Will the devicetree/openfirmware/etc version 
> > have
> > this same info, or will it be just different enough that it won't translate
> > into whatever I choose to stick in sysfs?
> 
> For the initial implementation do we need to have a representation of
> all the performance data? Given that
> /sys/devices/system/node/nodeX/distance is the only generic
> performance attribute published by the kernel today it is already the
> case that applications that need to target specific memories need to
> go parse information that is not provided by the kernel by default.
> The question is can those specialized applications stay special and go
> parse the platform specific data sources, like raw HMAT, directly, or
> do we expect general purpose applications to make use of this data? I
> think a firmware-id to numa-node translation facility
> (/sys/devices/system/node/nodeX/fwid) is a simple start that we can
> build on with more information as specific use cases arise.

We don't represent all the performance data, we only represent the data for
local initiator/target pairs.  I do think that this is useful to have in sysfs
because it provides a way to easily answer the most commonly asked questions
(or at least what I'm guessing will be the most commmonly asked queststions),
i.e. "given a CPU, what are the speeds of the various types of memory attached
to it", and "given a chunk of memory, how fast is it and to which CPU is it
local"?  By providing this base level of information I'm hoping to prevent
most applications from having to parse the HMAT directly.

The question of whether or not to include this local performance information
was one of the main questions of the initial RFC patch series, and I did get
feedback (albiet off-list) that the local performance information was
valuable to at least some users.  I did intentionally structure my (now very
short) set so that the performance information was added as a separate patch,
so we can get to the place you're talking about where we only provide firmware
id <=> proximity domain mappings by just leaving off the last patch in the
series.

I'm personally still of the opinion though that this last patch does add
value.


Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-20 Thread Dan Williams
On Wed, Dec 20, 2017 at 1:24 PM, Ross Zwisler
 wrote:
> On Wed, Dec 20, 2017 at 01:16:49PM -0800, Matthew Wilcox wrote:
>> On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote:
>> > On 12/20/2017 10:19 AM, Matthew Wilcox wrote:
>> > > I don't know what the right interface is, but my laptop has a set of
>> > > /sys/devices/system/memory/memoryN/ directories.  Perhaps this is the
>> > > right place to expose write_bw (etc).
>> >
>> > Those directories are already too redundant and wasteful.  I think we'd
>> > really rather not add to them.  In addition, it's technically possible
>> > to have a memory section span NUMA nodes and have different performance
>> > properties, which make it impossible to represent there.
>> >
>> > In any case, ACPI PXM's (Proximity Domains) are guaranteed to have
>> > uniform performance properties in the HMAT, and we just so happen to
>> > always create one NUMA node per PXM.  So, NUMA nodes really are a good fit.
>>
>> I think you're missing my larger point which is that I don't think this
>> should be exposed to userspace as an ACPI feature.  Because if you do,
>> then it'll also be exposed to userspace as an openfirmware feature.
>> And sooner or later a devicetree feature.  And then writing a portable
>> program becomes an exercise in suffering.
>>
>> So, what's the right place in sysfs that isn't tied to ACPI?  A new
>> directory or set of directories under /sys/devices/system/memory/ ?
>
> Oh, the current location isn't at all tied to acpi except that it happens to
> be named 'hmat'.  When it was all named 'hmem' it was just:
>
> /sys/devices/system/hmem
>
> Which has no ACPI-isms at all.  I'm happy to move it under
> /sys/devices/system/memory/hmat if that's helpful, but I think we still have
> the issue that the data represented therein is still pulled right from the
> HMAT, and I don't know how to abstract it into something more platform
> agnostic until I know what data is provided by those other platforms.
>
> For example, the HMAT provides latency information and bandwidth information
> for both reads and writes.  Will the devicetree/openfirmware/etc version have
> this same info, or will it be just different enough that it won't translate
> into whatever I choose to stick in sysfs?

For the initial implementation do we need to have a representation of
all the performance data? Given that
/sys/devices/system/node/nodeX/distance is the only generic
performance attribute published by the kernel today it is already the
case that applications that need to target specific memories need to
go parse information that is not provided by the kernel by default.
The question is can those specialized applications stay special and go
parse the platform specific data sources, like raw HMAT, directly, or
do we expect general purpose applications to make use of this data? I
think a firmware-id to numa-node translation facility
(/sys/devices/system/node/nodeX/fwid) is a simple start that we can
build on with more information as specific use cases arise.


Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-20 Thread Ross Zwisler
On Wed, Dec 20, 2017 at 01:16:49PM -0800, Matthew Wilcox wrote:
> On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote:
> > On 12/20/2017 10:19 AM, Matthew Wilcox wrote:
> > > I don't know what the right interface is, but my laptop has a set of
> > > /sys/devices/system/memory/memoryN/ directories.  Perhaps this is the
> > > right place to expose write_bw (etc).
> > 
> > Those directories are already too redundant and wasteful.  I think we'd
> > really rather not add to them.  In addition, it's technically possible
> > to have a memory section span NUMA nodes and have different performance
> > properties, which make it impossible to represent there.
> > 
> > In any case, ACPI PXM's (Proximity Domains) are guaranteed to have
> > uniform performance properties in the HMAT, and we just so happen to
> > always create one NUMA node per PXM.  So, NUMA nodes really are a good fit.
> 
> I think you're missing my larger point which is that I don't think this
> should be exposed to userspace as an ACPI feature.  Because if you do,
> then it'll also be exposed to userspace as an openfirmware feature.
> And sooner or later a devicetree feature.  And then writing a portable
> program becomes an exercise in suffering.
> 
> So, what's the right place in sysfs that isn't tied to ACPI?  A new
> directory or set of directories under /sys/devices/system/memory/ ?

Oh, the current location isn't at all tied to acpi except that it happens to
be named 'hmat'.  When it was all named 'hmem' it was just:

/sys/devices/system/hmem

Which has no ACPI-isms at all.  I'm happy to move it under
/sys/devices/system/memory/hmat if that's helpful, but I think we still have
the issue that the data represented therein is still pulled right from the
HMAT, and I don't know how to abstract it into something more platform
agnostic until I know what data is provided by those other platforms.

For example, the HMAT provides latency information and bandwidth information
for both reads and writes.  Will the devicetree/openfirmware/etc version have
this same info, or will it be just different enough that it won't translate
into whatever I choose to stick in sysfs?


Re: ps3: Improve a size determination in five functions

2017-12-20 Thread SF Markus Elfring
> Some observations:
> 
>  o Your patch fixes no bug nor replaces any depreciated feature.

How do you think about information from the section “14) Allocating memory”
in the document “coding-style.rst” for the shown source code transformation?


>  o There will be no functional change; …

Yes. - The suggested adjustment should work in this way generally.

Regards,
Markus


Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-20 Thread Matthew Wilcox
On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote:
> On 12/20/2017 10:19 AM, Matthew Wilcox wrote:
> > I don't know what the right interface is, but my laptop has a set of
> > /sys/devices/system/memory/memoryN/ directories.  Perhaps this is the
> > right place to expose write_bw (etc).
> 
> Those directories are already too redundant and wasteful.  I think we'd
> really rather not add to them.  In addition, it's technically possible
> to have a memory section span NUMA nodes and have different performance
> properties, which make it impossible to represent there.
> 
> In any case, ACPI PXM's (Proximity Domains) are guaranteed to have
> uniform performance properties in the HMAT, and we just so happen to
> always create one NUMA node per PXM.  So, NUMA nodes really are a good fit.

I think you're missing my larger point which is that I don't think this
should be exposed to userspace as an ACPI feature.  Because if you do,
then it'll also be exposed to userspace as an openfirmware feature.
And sooner or later a devicetree feature.  And then writing a portable
program becomes an exercise in suffering.

So, what's the right place in sysfs that isn't tied to ACPI?  A new
directory or set of directories under /sys/devices/system/memory/ ?


Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-20 Thread Ross Zwisler
On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote:
> On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote:
> > What I'm hoping to do with this series is to just provide a sysfs
> > representation of the HMAT so that applications can know which NUMA nodes to
> > select with existing utilities like numactl.  This series does not currently
> > alter any kernel behavior, it only provides a sysfs interface.
> > 
> > Say for example you had a system with some high bandwidth memory (HBM), and
> > you wanted to use it for a specific application.  You could use the sysfs
> > representation of the HMAT to figure out which memory target held your HBM.
> > You could do this by looking at the local bandwidth values for the various
> > memory targets, so:
> > 
> > # grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps
> > /sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920
> > /sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960
> > /sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960
> > /sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960
> > 
> > and look for the one that corresponds to your HBM speed. (These numbers are
> > made up, but you get the idea.)
> 
> Presumably ACPI-based platforms will not be the only ones who have the
> ability to expose different bandwidth memories in the future.  I think
> we need a platform-agnostic way ... right, PowerPC people?

Hey Matthew,

Yep, this is where I started as well.  My plan with my initial implementation
was to try and make the sysfs representation as platform agnostic as possible,
and just have the ACPI HMAT as one of the many places to gather the data
needed to populate sysfs.

However, as I began coding the implementation became very specific to the
HMAT, probably because I don't know of way that this type of info is
represented on another platform.  John Hubbard noticed the same thing and
asked me to s/HMEM/HMAT/ everywhere and just make it HMAT specific, and to
prevent it from being confused with the HMM work:

https://lkml.org/lkml/2017/7/7/33
https://lkml.org/lkml/2017/7/7/442

I'm open to making it more platform agnostic if I can get my hands on a
parallel effort in another platform and tease out the commonality, but trying
to do that without a second example hasn't worked out.  If we don't have a
good second example right now I think maybe we should put this in and then
merge it with the second example when it comes along.

> I don't know what the right interface is, but my laptop has a set of
> /sys/devices/system/memory/memoryN/ directories.  Perhaps this is the
> right place to expose write_bw (etc).
> 
> > Once you know the NUMA node of your HBM, you can figure out the NUMA node of
> > it's local initiator:
> > 
> > # ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init*
> > /sys/devices/system/hmat/mem_tgt2/local_init/mem_init0
> > 
> > So, in our made-up example our HBM is located in numa node 2, and the local
> > CPU for that HBM is at numa node 0.
> 
> initiator is a CPU?  I'd have expected you to expose a memory controller
> abstraction rather than re-use storage terminology.

Yea, I agree that at first blush it seems weird.  It turns out that looking at
it in sort of a storage initiator/target way is beneficial, though, because it
allows us to cut down on the number of data values we need to represent.

For example the SLIT, which doesn't differentiate between initiator and target
proximity domains (and thus nodes) always represents a system with N proximity
domains using a NxN distance table.  This makes sense if every node contains
both CPUs and memory.

With the introduction of the HMAT, though, we can have memory-only initiator
nodes and we can explicitly associate them with their local CPU.  This is
necessary so that we can separate memory with different performance
characteristics (HBM vs normal memory vs persistent memory, for example) that
are all attached to the same CPU.

So, say we now have a system with 4 CPUs, and each of those CPUs has 3
different types of memory attached to it.  We now have 16 total proximity
domains, 4 CPU and 12 memory.

If we represent this with the SLIT we end up with a 16 X 16 distance table
(256 entries), most of which don't matter because they are memory-to-memory
distances which don't make sense.

In the HMAT, though, we separate out the initiators and the targets and put
them into separate lists.  (See 5.2.27.4 System Locality Latency and Bandwidth
Information Structure in ACPI 6.2 for details.)  So, this same config in the
HMAT only has 4*12=48 performance values of each type, all of which convey
meaningful information.

The HMAT indeed even uses the storage "initiator" and "target" terminology. :)


Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-20 Thread Dave Hansen
On 12/20/2017 10:19 AM, Matthew Wilcox wrote:
> I don't know what the right interface is, but my laptop has a set of
> /sys/devices/system/memory/memoryN/ directories.  Perhaps this is the
> right place to expose write_bw (etc).

Those directories are already too redundant and wasteful.  I think we'd
really rather not add to them.  In addition, it's technically possible
to have a memory section span NUMA nodes and have different performance
properties, which make it impossible to represent there.

In any case, ACPI PXM's (Proximity Domains) are guaranteed to have
uniform performance properties in the HMAT, and we just so happen to
always create one NUMA node per PXM.  So, NUMA nodes really are a good fit.


Re: [PATCH 2/2] ps3: Improve a size determination in five functions

2017-12-20 Thread Geoff Levand
Hi,

On 12/16/2017 05:54 AM, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Sat, 16 Dec 2017 14:21:04 +0100
> 
> Replace the specification of data structures by variable references
> as the parameter for the operator "sizeof" to make the corresponding size
> determination a bit safer according to the Linux coding style convention.

After some thought, I've decided to reject this patch and others like
it because I feel it will make long term maintenance of the PS3 code
more difficult.

Some observations:

 o Your patch fixes no bug nor replaces any depreciated feature.
 o There will be no functional change; the generated binary
   will be nearly identical.
 o The PS3 kernel support is now over 10 years old.
 o I need to continue support for a few old kernel versions,
   specifically linux-3.15 and linux-2.6.30.  That includes
   keeping them working with new toolchain versions.  I need
   to back port fixes to these old kernels.
 o When problems arise I sometimes need to use git bisect
   back to old kernel versions.  When I do the bisect I often
   have fixes and local debug patches that I apply to the
   bisected tree before building.
 o Source code changes between versions causes patch conflicts
   that need to be manually resolved.  This can be error prone
   and very time consuming on a long bisect session.

My decision to reject this patch and others like it is in
attempt to minimize the code maintenance effort.  If you have
patches that fix bugs, upgrade depreciated features, or
generally improve functionality please submit them for
review.

-Geoff


Re: [PATCH V5] cxl: Add support for ASB_Notify on POWER9

2017-12-20 Thread Frederic Barrat



--- a/drivers/misc/cxl/file.c
+++ b/drivers/misc/cxl/file.c
@@ -173,7 +173,7 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
 * flags are set it's invalid
 */
if (work.reserved1 || work.reserved2 || work.reserved3 ||
-   work.reserved4 || work.reserved5 || work.reserved6 ||
+   work.reserved4 || work.reserved5 ||
(work.flags & ~CXL_START_WORK_ALL)) {
rc = -EINVAL;
goto out;
@@ -248,7 +248,19 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
 */
smp_mb();

-   trace_cxl_attach(ctx, work.work_element_descriptor, 
work.num_interrupts, amr);
+   /* Assign a unique TIDR (thread id) for the current thread */
+   if (work.flags & CXL_START_WORK_TID) {
+   rc = cxl_context_thread_tidr(ctx);
+   if (rc)


We're already pretty deep and have allocated quite a few resources, we 
we'd need to unwind (see error path below when the attach fails).


However, we cannot clear the thread TIDR register, so we need to be 
careful that a user process cannot exhaust our limited pool of TIDs by 
calling the attach ioctl with bogus arguments. Which should be easy to 
do: attach the max number of contexts, and keep calling attach!
So we're going to need to figure out something to prevent that (define a 
max allocation per context? with a value of 1 for now?)




diff --git a/include/uapi/misc/cxl.h b/include/uapi/misc/cxl.h
index 49e8fd0..3ea2d4b4 100644
--- a/include/uapi/misc/cxl.h
+++ b/include/uapi/misc/cxl.h
@@ -20,20 +20,22 @@ struct cxl_ioctl_start_work {
__u64 work_element_descriptor;
__u64 amr;
__s16 num_interrupts;
-   __s16 reserved1;
-   __s32 reserved2;
+   __s16 tid;


Should probably be unsigned.

  Fred



+   __s32 reserved1;
+   __u64 reserved2;
__u64 reserved3;
__u64 reserved4;
__u64 reserved5;
-   __u64 reserved6;
  };

  #define CXL_START_WORK_AMR0x0001ULL
  #define CXL_START_WORK_NUM_IRQS   0x0002ULL
  #define CXL_START_WORK_ERR_FF 0x0004ULL
+#define CXL_START_WORK_TID 0x0008ULL
  #define CXL_START_WORK_ALL(CXL_START_WORK_AMR |\
 CXL_START_WORK_NUM_IRQS |\
-CXL_START_WORK_ERR_FF)
+CXL_START_WORK_ERR_FF |\
+CXL_START_WORK_TID)


  /* Possible modes that an afu can be in */





Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-20 Thread Matthew Wilcox
On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote:
> What I'm hoping to do with this series is to just provide a sysfs
> representation of the HMAT so that applications can know which NUMA nodes to
> select with existing utilities like numactl.  This series does not currently
> alter any kernel behavior, it only provides a sysfs interface.
> 
> Say for example you had a system with some high bandwidth memory (HBM), and
> you wanted to use it for a specific application.  You could use the sysfs
> representation of the HMAT to figure out which memory target held your HBM.
> You could do this by looking at the local bandwidth values for the various
> memory targets, so:
> 
>   # grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps
>   /sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920
>   /sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960
>   /sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960
>   /sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960
> 
> and look for the one that corresponds to your HBM speed. (These numbers are
> made up, but you get the idea.)

Presumably ACPI-based platforms will not be the only ones who have the
ability to expose different bandwidth memories in the future.  I think
we need a platform-agnostic way ... right, PowerPC people?

I don't know what the right interface is, but my laptop has a set of
/sys/devices/system/memory/memoryN/ directories.  Perhaps this is the
right place to expose write_bw (etc).

> Once you know the NUMA node of your HBM, you can figure out the NUMA node of
> it's local initiator:
> 
>   # ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init*
>   /sys/devices/system/hmat/mem_tgt2/local_init/mem_init0
> 
> So, in our made-up example our HBM is located in numa node 2, and the local
> CPU for that HBM is at numa node 0.

initiator is a CPU?  I'd have expected you to expose a memory controller
abstraction rather than re-use storage terminology.



Re: [RFC PATCH 3/8] powerpc/64s: put the per-cpu data_offset in r14

2017-12-20 Thread Gabriel Paubert
On Thu, Dec 21, 2017 at 12:52:01AM +1000, Nicholas Piggin wrote:
> Shifted left by 16 bits, so the low 16 bits of r14 remain available.
> This allows per-cpu pointers to be dereferenced with a single extra
> shift whereas previously it was a load and add.
> ---
>  arch/powerpc/include/asm/paca.h   |  5 +
>  arch/powerpc/include/asm/percpu.h |  2 +-
>  arch/powerpc/kernel/entry_64.S|  5 -
>  arch/powerpc/kernel/head_64.S |  5 +
>  arch/powerpc/kernel/setup_64.c| 11 +--
>  5 files changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
> index cd6a9a010895..4dd4ac69e84f 100644
> --- a/arch/powerpc/include/asm/paca.h
> +++ b/arch/powerpc/include/asm/paca.h
> @@ -35,6 +35,11 @@
>  
>  register struct paca_struct *local_paca asm("r13");
>  #ifdef CONFIG_PPC_BOOK3S
> +/*
> + * The top 32-bits of r14 is used as the per-cpu offset, shifted by 
> PAGE_SHIFT.

Top 32, really? It's 48 in later comments.

Gabriel

> + * The per-cpu could be moved completely to vmalloc space if we had large
> + * vmalloc page mapping? (no, must access it in real mode).
> + */
>  register u64 local_r14 asm("r14");
>  #endif
>  
> diff --git a/arch/powerpc/include/asm/percpu.h 
> b/arch/powerpc/include/asm/percpu.h
> index dce863a7635c..1e0d79d30eac 100644
> --- a/arch/powerpc/include/asm/percpu.h
> +++ b/arch/powerpc/include/asm/percpu.h
> @@ -12,7 +12,7 @@
>  
>  #include 
>  
> -#define __my_cpu_offset local_paca->data_offset
> +#define __my_cpu_offset (local_r14 >> 16)
>  
>  #endif /* CONFIG_SMP */
>  #endif /* __powerpc64__ */
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index 592e4b36065f..6b0e3ac311e8 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -262,11 +262,6 @@ system_call_exit:
>  BEGIN_FTR_SECTION
>   stdcx.  r0,0,r1 /* to clear the reservation */
>  END_FTR_SECTION_IFCLR(CPU_FTR_STCX_CHECKS_ADDRESS)
> - LOAD_REG_IMMEDIATE(r10, 0xdeadbeefULL << 32)
> - mfspr   r11,SPRN_PIR
> - or  r10,r10,r11
> - tdner10,r14
> -
>   andi.   r6,r8,MSR_PR
>   ld  r4,_LINK(r1)
>  
> diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
> index 5a9ec06eab14..cdb710f43681 100644
> --- a/arch/powerpc/kernel/head_64.S
> +++ b/arch/powerpc/kernel/head_64.S
> @@ -413,10 +413,7 @@ generic_secondary_common_init:
>   b   kexec_wait  /* next kernel might do better   */
>  
>  2:   SET_PACA(r13)
> - LOAD_REG_IMMEDIATE(r14, 0xdeadbeef << 32)
> - mfspr   r3,SPRN_PIR
> - or  r14,r14,r3
> - std r14,PACA_R14(r13)
> + ld  r14,PACA_R14(r13)
>  
>  #ifdef CONFIG_PPC_BOOK3E
>   addir12,r13,PACA_EXTLB  /* and TLB exc frame in another  */
> diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> index 9a4c5bf35d92..f4a96ebb523a 100644
> --- a/arch/powerpc/kernel/setup_64.c
> +++ b/arch/powerpc/kernel/setup_64.c
> @@ -192,8 +192,8 @@ static void __init fixup_boot_paca(void)
>   get_paca()->data_offset = 0;
>   /* Mark interrupts disabled in PACA */
>   irq_soft_mask_set(IRQ_SOFT_MASK_STD);
> - /* Set r14 and paca_r14 to debug value */
> - get_paca()->r14 = (0xdeadbeefULL << 32) | mfspr(SPRN_PIR);
> + /* Set r14 and paca_r14 to zero */
> + get_paca()->r14 = 0;
>   local_r14 = get_paca()->r14;
>  }
>  
> @@ -761,7 +761,14 @@ void __init setup_per_cpu_areas(void)
>   for_each_possible_cpu(cpu) {
>  __per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
>   paca[cpu].data_offset = __per_cpu_offset[cpu];
> +
> + BUG_ON(paca[cpu].data_offset & (PAGE_SIZE-1));
> + BUG_ON(paca[cpu].data_offset >= (1UL << (64 - 16)));
> +
> + /* The top 48 bits are used for per-cpu data */
> + paca[cpu].r14 |= paca[cpu].data_offset << 16;
>   }
> + local_r14 = paca[smp_processor_id()].r14;
>  }
>  #endif
>  
> -- 
> 2.15.0


Re: [PATCH v4 2/2] cxl: read PHB indications from the device tree

2017-12-20 Thread Frederic Barrat



Le 15/12/2017 à 14:48, Philippe Bergheaud a écrit :

Configure the P9 XSL_DSNCTL register with PHB indications found
in the device tree, or else use legacy hard-coded values.

Signed-off-by: Philippe Bergheaud 
---
Changelog:

v2: New patch. Use the new device tree property "ibm,phb-indications".

v3: No change.

v4: No functional change.
 Drop cosmetic fix in comment.

This patch depends on the following skiboot prerequisite:

https://patchwork.ozlabs.org/patch/849162/
---
  drivers/misc/cxl/cxl.h|  2 +-
  drivers/misc/cxl/cxllib.c |  2 +-
  drivers/misc/cxl/pci.c| 40 +++-
  3 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index e46a4062904a..5a6e9a921c2b 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -1062,7 +1062,7 @@ int cxl_psl_purge(struct cxl_afu *afu);
  int cxl_calc_capp_routing(struct pci_dev *dev, u64 *chipid,
  u32 *phb_index, u64 *capp_unit_id);
  int cxl_slot_is_switched(struct pci_dev *dev);
-int cxl_get_xsl9_dsnctl(u64 capp_unit_id, u64 *reg);
+int cxl_get_xsl9_dsnctl(struct pci_dev *dev, u64 capp_unit_id, u64 *reg);
  u64 cxl_calculate_sr(bool master, bool kernel, bool real_mode, bool p9);

  void cxl_native_irq_dump_regs_psl9(struct cxl_context *ctx);
diff --git a/drivers/misc/cxl/cxllib.c b/drivers/misc/cxl/cxllib.c
index dc9bc1807fdf..61f80d586279 100644
--- a/drivers/misc/cxl/cxllib.c
+++ b/drivers/misc/cxl/cxllib.c
@@ -99,7 +99,7 @@ int cxllib_get_xsl_config(struct pci_dev *dev, struct 
cxllib_xsl_config *cfg)
if (rc)
return rc;

-   rc = cxl_get_xsl9_dsnctl(capp_unit_id, >dsnctl);
+   rc = cxl_get_xsl9_dsnctl(dev, capp_unit_id, >dsnctl);
if (rc)
return rc;
if (cpu_has_feature(CPU_FTR_POWER9_DD1)) {
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 19969ee86d6f..c58fb28685af 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -409,7 +409,36 @@ int cxl_calc_capp_routing(struct pci_dev *dev, u64 *chipid,
return 0;
  }

-int cxl_get_xsl9_dsnctl(u64 capp_unit_id, u64 *reg)
+static u64 nbwind = 0;
+static u64 asnind = 0;
+static u64 capiind = 0;


Could we avoid the globals and keep the static within 
get_phb_indications() and have the function return them as out 
parameters? It would seem cleaner to me.


  Fred


+static int get_phb_indications(struct pci_dev *dev)
+{
+   struct device_node *np;
+   const __be32 *prop;
+
+   if (capiind)
+   return 0;
+
+   if (!(np = pnv_pci_get_phb_node(dev)))
+   return -1;
+
+   prop = of_get_property(np, "ibm,phb-indications", NULL);
+   if (!prop) {
+   nbwind = 0x0300UL; /* legacy values */
+   asnind = 0x0400UL;
+   capiind = 0x0200UL;
+   } else {
+   nbwind = (u64)be32_to_cpu(prop[2]);
+   asnind = (u64)be32_to_cpu(prop[1]);
+   capiind = (u64)be32_to_cpu(prop[0]);
+   }
+   of_node_put(np);
+   return 0;
+}
+
+int cxl_get_xsl9_dsnctl(struct pci_dev *dev, u64 capp_unit_id, u64 *reg)
  {
u64 xsl_dsnctl;

@@ -423,7 +452,8 @@ int cxl_get_xsl9_dsnctl(u64 capp_unit_id, u64 *reg)
 * Tell XSL where to route data to.
 * The field chipid should match the PHB CAPI_CMPM register
 */
-   xsl_dsnctl = ((u64)0x2 << (63-7)); /* Bit 57 */
+   get_phb_indications(dev);




+   xsl_dsnctl = (capiind << (63-15)); /* Bit 57 */
xsl_dsnctl |= (capp_unit_id << (63-15));

/* nMMU_ID Defaults to: b’01001’*/
@@ -437,14 +467,14 @@ int cxl_get_xsl9_dsnctl(u64 capp_unit_id, u64 *reg)
 * nbwind=0x03, bits [57:58], must include capi indicator.
 * Not supported on P9 DD1.
 */
-   xsl_dsnctl |= ((u64)0x03 << (63-47));
+   xsl_dsnctl |= (nbwind << (63-55));

/*
 * Upper 16b address bits of ASB_Notify messages sent to the
 * system. Need to match the PHB’s ASN Compare/Mask Register.
 * Not supported on P9 DD1.
 */
-   xsl_dsnctl |= ((u64)0x04 << (63-55));
+   xsl_dsnctl |= asnind;
}

*reg = xsl_dsnctl;
@@ -464,7 +494,7 @@ static int init_implementation_adapter_regs_psl9(struct cxl 
*adapter,
if (rc)
return rc;

-   rc = cxl_get_xsl9_dsnctl(capp_unit_id, _dsnctl);
+   rc = cxl_get_xsl9_dsnctl(dev, capp_unit_id, _dsnctl);
if (rc)
return rc;





Re: [PATCH v9 29/51] mm/mprotect, powerpc/mm/pkeys, x86/mm/pkeys: Add sysfs interface

2017-12-20 Thread Ram Pai
On Wed, Dec 20, 2017 at 08:34:56AM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2017-12-18 at 14:28 -0800, Dave Hansen wrote:
> > > We do not have generic support for something like that on ppc.
> > > The kernel looks at the device tree to determine what hardware features
> > > are available. But does not have mechanism to tell the hardware to track
> > > which of its features are currently enabled/used by the kernel; atleast
> > > not for the memory-key feature.
> > 
> > Bummer.  You're missing out.
> > 
> > But, you could still do this with a syscall.  "Hey, kernel, do you
> > support this feature?"
> 
> I'm not sure I understand Ram's original (quoted) point, but informing
> userspace of CPU features is what AT_HWCAP's are about.

Ben, my original point was -- we developed this patch to satisfy a concern
you raised back on July 11th;  cut-n-pasted below.

---
That leads to the question... How do you tell userspace.

(apologies if I missed that in an existing patch in the series)

How do we inform userspace of the key capabilities ? There are
at least two things userspace may want to know already:

 - What protection bits are supported for a key

 - How many keys exist

 - Which keys are available for use by userspace. On PowerPC,
 the kernel can reserve some keys for itself, so can the
 hypervisor. In fact, they do.



The argument against this patch is --  it should not be baked into
the ABI as yet, since we do not have clarity on what applications need.

As it stands today the only way to figure out the information from
userspace is by probing the kernel through calls to sys_pkey_alloc().

AT_HWCAP can be used, but that will certainly not be capable of
providing all the information that userspace might expect.

Your thoughts?
RP



[PATCH V5] cxl: Add support for ASB_Notify on POWER9

2017-12-20 Thread Christophe Lombard
The POWER9 core supports a new feature: ASB_Notify which requires the
support of the Special Purpose Register: TIDR.

The ASB_Notify command, generated by the AFU, will attempt to
wake-up the host thread identified by the particular LPID:PID:TID.

This patch assign a unique TIDR (thread id) for the current thread which
will be used in the process element entry.

A next patch will handle a new kind of "compatible" property in the
device-tree (PHB DT node) indicating which version of CAPI and which
features are supported, instead of handling PVR values.

Signed-off-by: Christophe Lombard 
Reviewed-by: Philippe Bergheaud 

---
Changelog[v5]
 - Rebased to latest upstream.
 - Updated the ioctl interface.
 - Returned the tid in the ioctl structure.

Changelog[v4]
 - Rebased to latest upstream.
 - Updated the ioctl interface.
 - Removed the field tid in the context structure.

Changelog[v3]
 - Rebased to latest upstream.
 - Updated attr->tid field in cxllib_get_PE_attributes().

Changelog[v2]
 - Rebased to latest upstream.
 - Updated the ioctl interface.
 - Added a checking to allow updating the TIDR if a P9 chip is present.
---
 arch/powerpc/kernel/process.c |  1 +
 drivers/misc/cxl/context.c| 15 +++
 drivers/misc/cxl/cxl.h|  3 +++
 drivers/misc/cxl/cxllib.c |  3 ++-
 drivers/misc/cxl/file.c   | 19 +--
 drivers/misc/cxl/native.c |  2 +-
 drivers/misc/cxl/trace.h  | 12 
 include/uapi/misc/cxl.h   | 10 ++
 8 files changed, 53 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 5acb5a1..a6a70e2 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1589,6 +1589,7 @@ int set_thread_tidr(struct task_struct *t)
 
return 0;
 }
+EXPORT_SYMBOL_GPL(set_thread_tidr);
 
 #endif /* CONFIG_PPC64 */
 
diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index 12a41b2..e309d35 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "cxl.h"
 
@@ -362,3 +363,17 @@ void cxl_context_mm_count_put(struct cxl_context *ctx)
if (ctx->mm)
mmdrop(ctx->mm);
 }
+
+int cxl_context_thread_tidr(struct cxl_context *ctx)
+{
+   int rc = 0;
+
+   if (!cxl_is_power9())
+   return -ENODEV;
+
+   rc = set_thread_tidr(current);
+   pr_devel("%s: current tidr: %ld\n", __func__,
+current->thread.tidr);
+
+   return rc;
+}
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index e46a406..1a5db0b 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -1169,4 +1169,7 @@ void cxl_context_mm_count_get(struct cxl_context *ctx);
 /* Decrements the reference count to "struct mm_struct" */
 void cxl_context_mm_count_put(struct cxl_context *ctx);
 
+/* Handles an unique TIDR (thread id) for the current thread */
+int cxl_context_thread_tidr(struct cxl_context *ctx);
+
 #endif
diff --git a/drivers/misc/cxl/cxllib.c b/drivers/misc/cxl/cxllib.c
index dc9bc18..30ccba4 100644
--- a/drivers/misc/cxl/cxllib.c
+++ b/drivers/misc/cxl/cxllib.c
@@ -199,10 +199,11 @@ int cxllib_get_PE_attributes(struct task_struct *task,
 */
attr->pid = mm->context.id;
mmput(mm);
+   attr->tid = task->thread.tidr;
} else {
attr->pid = 0;
+   attr->tid = 0;
}
-   attr->tid = 0;
return 0;
 }
 EXPORT_SYMBOL_GPL(cxllib_get_PE_attributes);
diff --git a/drivers/misc/cxl/file.c b/drivers/misc/cxl/file.c
index 76c0b0c..788b3af 100644
--- a/drivers/misc/cxl/file.c
+++ b/drivers/misc/cxl/file.c
@@ -173,7 +173,7 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
 * flags are set it's invalid
 */
if (work.reserved1 || work.reserved2 || work.reserved3 ||
-   work.reserved4 || work.reserved5 || work.reserved6 ||
+   work.reserved4 || work.reserved5 ||
(work.flags & ~CXL_START_WORK_ALL)) {
rc = -EINVAL;
goto out;
@@ -248,7 +248,19 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
 */
smp_mb();
 
-   trace_cxl_attach(ctx, work.work_element_descriptor, 
work.num_interrupts, amr);
+   /* Assign a unique TIDR (thread id) for the current thread */
+   if (work.flags & CXL_START_WORK_TID) {
+   rc = cxl_context_thread_tidr(ctx);
+   if (rc)
+   goto out;
+   }
+   work.tid = current->thread.tidr;
+
+   trace_cxl_attach(ctx,
+work.work_element_descriptor,
+work.num_interrupts,
+amr,
+work.tid);
 
if ((rc = cxl_ops->attach_process(ctx, false, 
work.work_element_descriptor,
 

Re: [PATCH v4 00/11] ASoC: fsl_ssi: Clean up - coding style level

2017-12-20 Thread Nicolin Chen
On Wed, Dec 20, 2017 at 12:40:37PM +0100, Arnaud Mouiche wrote:
> >>>Ugh, so it's basically quite broken again -- before these patches.
> >>I remember Arnaud reviewed one of my changes back to September.
> >>So I suppose the test should be fine at that time -- so a change
> >>being merged recently might have impacted the test result.

> Sorry but I will be busy until mid January, I could help testing and
> fixing broken multi channel after.
> Anyway, I don't see specific issues with Nicolin patches.
> We can take time to fix what was broken before this patch set... after.

I won't be able to fix and re-submit patches either during the
holidays. So let's discuss this in earlier January. Thanks for
joining.

Nicolin


Re: [PATCH v4 1/2] powerpc/powernv: Enable tunneled operations

2017-12-20 Thread Frederic Barrat



Le 15/12/2017 à 14:48, Philippe Bergheaud a écrit :

P9 supports PCI tunneled operations (atomics and as_notify). This
patch adds support for tunneled operations on powernv, with a new
API, to be called by device drivers:

pnv_pci_get_tunnel_ind()
Tell driver the 16-bit ASN indication used by kernel.

pnv_pci_set_tunnel_bar()
Tell kernel the Tunnel BAR Response address used by driver.
This function uses two new OPAL calls, as the PBCQ Tunnel BAR
register is configured by skiboot.

void pnv_pci_get_as_notify_info()
Return the ASN info of the thread to be woken up.

Signed-off-by: Philippe Bergheaud 
---
Changelog:

v2: Do not set the ASN indication. Get it from the device tree.

v3: Make pnv_pci_get_phb_node() available when compiling without cxl.

v4: Add pnv_pci_get_as_notify_info().
 Rebase opal call numbers on skiboot 5.9.6.

This patch depends on the following skiboot prerequisites:

https://patchwork.ozlabs.org/patch/849162/
https://patchwork.ozlabs.org/patch/849163/
---
  arch/powerpc/include/asm/opal-api.h|  4 +-
  arch/powerpc/include/asm/opal.h|  2 +
  arch/powerpc/include/asm/pnv-pci.h |  5 ++
  arch/powerpc/platforms/powernv/opal-wrappers.S |  2 +
  arch/powerpc/platforms/powernv/pci-cxl.c   |  8 ---
  arch/powerpc/platforms/powernv/pci.c   | 93 ++
  6 files changed, 105 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 233c7504b1f2..b901f4d9f009 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -201,7 +201,9 @@
  #define OPAL_SET_POWER_SHIFT_RATIO155
  #define OPAL_SENSOR_GROUP_CLEAR   156
  #define OPAL_PCI_SET_P2P  157
-#define OPAL_LAST  157
+#define OPAL_PCI_GET_PBCQ_TUNNEL_BAR   159
+#define OPAL_PCI_SET_PBCQ_TUNNEL_BAR   160
+#define OPAL_LAST  160

  /* Device tree flags */

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 0c545f7fc77b..8705e422b893 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -198,6 +198,8 @@ int64_t opal_unregister_dump_region(uint32_t id);
  int64_t opal_slw_set_reg(uint64_t cpu_pir, uint64_t sprn, uint64_t val);
  int64_t opal_config_cpu_idle_state(uint64_t state, uint64_t flag);
  int64_t opal_pci_set_phb_cxl_mode(uint64_t phb_id, uint64_t mode, uint64_t 
pe_number);
+int64_t opal_pci_get_pbcq_tunnel_bar(uint64_t phb_id, uint64_t *addr);
+int64_t opal_pci_set_pbcq_tunnel_bar(uint64_t phb_id, uint64_t addr);
  int64_t opal_ipmi_send(uint64_t interface, struct opal_ipmi_msg *msg,
uint64_t msg_len);
  int64_t opal_ipmi_recv(uint64_t interface, struct opal_ipmi_msg *msg,
diff --git a/arch/powerpc/include/asm/pnv-pci.h 
b/arch/powerpc/include/asm/pnv-pci.h
index 3e5cf251ad9a..4839e09663f2 100644
--- a/arch/powerpc/include/asm/pnv-pci.h
+++ b/arch/powerpc/include/asm/pnv-pci.h
@@ -29,6 +29,11 @@ extern int pnv_pci_set_power_state(uint64_t id, uint8_t 
state,
  extern int pnv_pci_set_p2p(struct pci_dev *initiator, struct pci_dev *target,
   u64 desc);

+extern int pnv_pci_get_tunnel_ind(struct pci_dev *dev, uint64_t *ind);
+extern int pnv_pci_set_tunnel_bar(struct pci_dev *dev, uint64_t addr,
+ int enable);
+extern void pnv_pci_get_as_notify_info(struct task_struct *task, u32 *lpid,
+  u32 *pid, u32 *tid);
  int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t mode);
  int pnv_cxl_ioda_msi_setup(struct pci_dev *dev, unsigned int hwirq,
   unsigned int virq);
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S 
b/arch/powerpc/platforms/powernv/opal-wrappers.S
index 6f4b00a2ac46..5da790fb7fef 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -320,3 +320,5 @@ OPAL_CALL(opal_set_powercap,
OPAL_SET_POWERCAP);
  OPAL_CALL(opal_get_power_shift_ratio, OPAL_GET_POWER_SHIFT_RATIO);
  OPAL_CALL(opal_set_power_shift_ratio, OPAL_SET_POWER_SHIFT_RATIO);
  OPAL_CALL(opal_sensor_group_clear,OPAL_SENSOR_GROUP_CLEAR);
+OPAL_CALL(opal_pci_get_pbcq_tunnel_bar,
OPAL_PCI_GET_PBCQ_TUNNEL_BAR);
+OPAL_CALL(opal_pci_set_pbcq_tunnel_bar,
OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
diff --git a/arch/powerpc/platforms/powernv/pci-cxl.c 
b/arch/powerpc/platforms/powernv/pci-cxl.c
index 94498a04558b..cee003de63af 100644
--- a/arch/powerpc/platforms/powernv/pci-cxl.c
+++ b/arch/powerpc/platforms/powernv/pci-cxl.c
@@ -16,14 +16,6 @@

  #include "pci.h"

-struct device_node *pnv_pci_get_phb_node(struct pci_dev *dev)
-{
-   struct pci_controller *hose = pci_bus_to_host(dev->bus);
-
-   

Re: [PATCH] cxl: Check if vphb exists before iterating over AFU devices

2017-12-20 Thread Greg KH
On Wed, Dec 20, 2017 at 03:07:06PM +0530, Vaibhav Jain wrote:
> commit 12841f87b7a8ceb3d54f171660f72a86941bfcb3 upstream, for 4.9.

Thanks, do we also need this for 4.4?  If so, can you provide a
backport?

thanks,

greg k-h


[RFC PATCH 8/8] powerpc/64s: inline local_irq_enable/restore

2017-12-20 Thread Nicholas Piggin
This does increase kernel text size by about 0.4%, but code is often
improved by putting the interrupt-replay call out of line, and gcc
function "shrink wrapping" can more often avoid setting up a stack
frame, e.g., _raw_spin_unlock_irqrestore fastpath before:

<_raw_spin_unlock_irqrestore>:
addis   r2,r12,63
addir2,r2,24688
mflrr0
andi.   r9,r14,256
mr  r9,r3
std r0,16(r1)
stdur1,-32(r1)
bne c09fd1e0 <_raw_spin_unlock_irqrestore+0x50>
lwsync
li  r10,0
mr  r3,r4
stw r10,0(r9)
bl  c0013f98 

:
addis   r2,r12,222
addir2,r2,-3472
rldimi  r14,r3,0,62
cmpdi   cr7,r3,0
bnelr   cr7
andi.   r9,r14,252
beqlr

nop
addir1,r1,32
ld  r0,16(r1)
mtlrr0
blr

And after:

<_raw_spin_unlock_irqrestore>:
addis   r2,r12,64
addir2,r2,-15200
andi.   r9,r14,256
bne c0a06dd0 <_raw_spin_unlock_irqrestore+0x70>
lwsync
li  r9,0
stw r9,0(r3)
rldimi  r14,r4,0,62
cmpdi   cr7,r4,0
bne cr7,c0a06d90 <_raw_spin_unlock_irqrestore+0x30>
andi.   r9,r14,252
bne c0a06da0 <_raw_spin_unlock_irqrestore+0x40>
blr

GCC can still improve code size for the slow paths by avoiding aligning
branch targets too, so there is room to reduce the text size cost there.
---
 arch/powerpc/include/asm/hw_irq.h | 15 +--
 arch/powerpc/kernel/irq.c | 28 ++--
 2 files changed, 19 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/hw_irq.h 
b/arch/powerpc/include/asm/hw_irq.h
index f492a7779ea3..8690e0d5605d 100644
--- a/arch/powerpc/include/asm/hw_irq.h
+++ b/arch/powerpc/include/asm/hw_irq.h
@@ -132,11 +132,22 @@ static inline void arch_local_irq_disable(void)
irq_soft_mask_set(IRQ_SOFT_MASK_STD);
 }
 
-extern void arch_local_irq_restore(unsigned long);
+extern void __arch_local_irq_enable(void);
 
 static inline void arch_local_irq_enable(void)
 {
-   arch_local_irq_restore(0);
+   __irq_soft_mask_clear(IRQ_SOFT_MASK_ALL);
+   if (unlikely(local_r14 & R14_BIT_IRQ_HAPPENED_MASK))
+   __arch_local_irq_enable();
+}
+
+static inline void arch_local_irq_restore(unsigned long flags)
+{
+   __irq_soft_mask_insert(flags);
+   if (!flags) {
+   if (unlikely(local_r14 & R14_BIT_IRQ_HAPPENED_MASK))
+   __arch_local_irq_enable();
+   }
 }
 
 static inline unsigned long arch_local_irq_save(void)
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index ebaf210a7406..e2ff0210477e 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -97,11 +97,6 @@ extern int tau_interrupts(int);
 
 int distribute_irqs = 1;
 
-static inline notrace unsigned long get_irq_happened(void)
-{
-   return local_r14 & R14_BIT_IRQ_HAPPENED_MASK;
-}
-
 static inline notrace int decrementer_check_overflow(void)
 {
u64 now = get_tb_or_rtc();
@@ -210,19 +205,10 @@ notrace unsigned int __check_irq_replay(void)
return 0;
 }
 
-notrace void arch_local_irq_restore(unsigned long mask)
+notrace void __arch_local_irq_enable(void)
 {
-   unsigned char irq_happened;
unsigned int replay;
 
-   /* Write the new soft-enabled value */
-   __irq_soft_mask_insert(mask);
-   /* any bits still disabled */
-   if (mask)
-   return;
-
-   barrier();
-
/*
 * From this point onward, we can take interrupts, preempt,
 * etc... unless we got hard-disabled. We check if an event
@@ -236,9 +222,6 @@ notrace void arch_local_irq_restore(unsigned long mask)
 * be hard-disabled, so there is no problem, we
 * cannot have preempted.
 */
-   irq_happened = get_irq_happened();
-   if (!irq_happened)
-   return;
 
/*
 * We need to hard disable to get a trusted value from
@@ -252,10 +235,11 @@ notrace void arch_local_irq_restore(unsigned long mask)
 * (expensive) mtmsrd.
 * XXX: why not test & IRQ_HARD_DIS?
 */
-   if (unlikely(irq_happened != PACA_IRQ_HARD_DIS))
+   if (unlikely((local_r14 & R14_BIT_IRQ_HAPPENED_MASK) !=
+   PACA_IRQ_HARD_DIS)) {
__hard_irq_disable();
 #ifdef CONFIG_PPC_IRQ_SOFT_MASK_DEBUG
-   else {
+   } else {
/*
 * We should already be hard disabled here. We had bugs
 * where that wasn't the case so let's dbl check it and
@@ -264,8 +248,8 @@ notrace void arch_local_irq_restore(unsigned long mask)
 */

[RFC PATCH 7/8] powerpc/64s: put irq_soft_mask and irq_happened bits into r14

2017-12-20 Thread Nicholas Piggin
This should be split into two patches. irq_happened and soft_mask.
It may not be worth putting all irq_happened bits into r14, just a
single "an irq did happen" bit may be good enough to then load a
paca variable.
---
 arch/powerpc/include/asm/hw_irq.h| 23 +
 arch/powerpc/include/asm/irqflags.h  | 21 +---
 arch/powerpc/include/asm/kvm_ppc.h   |  4 +--
 arch/powerpc/include/asm/paca.h  |  7 ++--
 arch/powerpc/kernel/asm-offsets.c|  9 +-
 arch/powerpc/kernel/entry_64.S   | 13 
 arch/powerpc/kernel/exceptions-64s.S |  4 +--
 arch/powerpc/kernel/head_64.S| 15 ++---
 arch/powerpc/kernel/irq.c| 62 ++--
 arch/powerpc/kvm/book3s_hv.c |  6 ++--
 arch/powerpc/xmon/xmon.c |  4 +--
 11 files changed, 76 insertions(+), 92 deletions(-)

diff --git a/arch/powerpc/include/asm/hw_irq.h 
b/arch/powerpc/include/asm/hw_irq.h
index 9ba445de989d..f492a7779ea3 100644
--- a/arch/powerpc/include/asm/hw_irq.h
+++ b/arch/powerpc/include/asm/hw_irq.h
@@ -22,6 +22,7 @@
  * and allow a proper replay. Additionally, PACA_IRQ_HARD_DIS
  * is set whenever we manually hard disable.
  */
+#ifdef CONFIG_PPC_BOOK3E
 #define PACA_IRQ_HARD_DIS  0x01
 #define PACA_IRQ_DBELL 0x02
 #define PACA_IRQ_EE0x04
@@ -30,14 +31,22 @@
 #define PACA_IRQ_HMI   0x20
 #define PACA_IRQ_PMI   0x40
 
+#else /* CONFIG_PPC_BOOK3E */
 /*
- * 64s uses r14 rather than paca for irq_soft_mask
+ * 64s uses r14 rather than paca for irq_soft_mask and irq_happened
  */
-#ifdef CONFIG_PPC_BOOK3S
+
+#define PACA_IRQ_HARD_DIS  (0x01 << R14_BIT_IRQ_HAPPENED_SHIFT)
+#define PACA_IRQ_DBELL (0x02 << R14_BIT_IRQ_HAPPENED_SHIFT)
+#define PACA_IRQ_EE(0x04 << R14_BIT_IRQ_HAPPENED_SHIFT)
+#define PACA_IRQ_DEC   (0x08 << R14_BIT_IRQ_HAPPENED_SHIFT)
+#define PACA_IRQ_HMI   (0x10 << R14_BIT_IRQ_HAPPENED_SHIFT)
+#define PACA_IRQ_PMI   (0x20 << R14_BIT_IRQ_HAPPENED_SHIFT)
+
 #define IRQ_SOFT_MASK_STD  (0x01 << R14_BIT_IRQ_SOFT_MASK_SHIFT)
 #define IRQ_SOFT_MASK_PMU  (0x02 << R14_BIT_IRQ_SOFT_MASK_SHIFT)
 #define IRQ_SOFT_MASK_ALL  (0x03 << R14_BIT_IRQ_SOFT_MASK_SHIFT)
-#endif /* CONFIG_PPC_BOOK3S */
+#endif /* CONFIG_PPC_BOOK3E */
 
 #endif /* CONFIG_PPC64 */
 
@@ -206,14 +215,14 @@ static inline bool arch_irqs_disabled(void)
unsigned long flags;\
__hard_irq_disable();   \
flags = irq_soft_mask_set_return(IRQ_SOFT_MASK_ALL);\
-   local_paca->irq_happened |= PACA_IRQ_HARD_DIS;  \
+   r14_set_bits(PACA_IRQ_HARD_DIS);\
if (!arch_irqs_disabled_flags(flags))   \
trace_hardirqs_off();   \
 } while(0)
 
 static inline bool lazy_irq_pending(void)
 {
-   return !!(get_paca()->irq_happened & ~PACA_IRQ_HARD_DIS);
+   return !!(local_r14 & R14_BIT_IRQ_HAPPENED_MASK & ~PACA_IRQ_HARD_DIS);
 }
 
 /*
@@ -223,8 +232,8 @@ static inline bool lazy_irq_pending(void)
  */
 static inline void may_hard_irq_enable(void)
 {
-   get_paca()->irq_happened &= ~PACA_IRQ_HARD_DIS;
-   if (!(get_paca()->irq_happened & PACA_IRQ_EE))
+   r14_clear_bits(PACA_IRQ_HARD_DIS);
+   if (!(local_r14 & PACA_IRQ_EE))
__hard_irq_enable();
 }
 
diff --git a/arch/powerpc/include/asm/irqflags.h 
b/arch/powerpc/include/asm/irqflags.h
index 19a2752868f8..140e51b9f436 100644
--- a/arch/powerpc/include/asm/irqflags.h
+++ b/arch/powerpc/include/asm/irqflags.h
@@ -45,26 +45,21 @@
  *
  * NB: This may call C code, so the caller must be prepared for volatiles to
  * be clobbered.
+ * XXX: could make this single-register now
  */
-#define RECONCILE_IRQ_STATE(__rA, __rB)\
-   lbz __rB,PACAIRQHAPPENED(r13);  \
-   andi.   __rA,r14,IRQ_SOFT_MASK_STD; \
-   ori r14,r14,IRQ_SOFT_MASK_STD;  \
-   ori __rB,__rB,PACA_IRQ_HARD_DIS;\
-   stb __rB,PACAIRQHAPPENED(r13);  \
-   bne 44f;\
-   TRACE_DISABLE_INTS; \
+#define RECONCILE_IRQ_STATE(__rA, __rB)
\
+   andi.   __rA,r14,IRQ_SOFT_MASK_STD; \
+   ori r14,r14,(PACA_IRQ_HARD_DIS | IRQ_SOFT_MASK_STD);\
+   bne 44f;\
+   TRACE_DISABLE_INTS; \
 44:
 
 #else
 #define TRACE_ENABLE_INTS
 #define TRACE_DISABLE_INTS
 
-#define RECONCILE_IRQ_STATE(__rA, __rB)\
-   lbz __rA,PACAIRQHAPPENED(r13);  \
-   ori r14,r14,IRQ_SOFT_MASK_STD;  \
-   ori __rA,__rA,PACA_IRQ_HARD_DIS;\
-   stb __rA,PACAIRQHAPPENED(r13)
+#define 

[RFC PATCH 6/8] powerpc/64s: put irq_soft_mask bits into r14

2017-12-20 Thread Nicholas Piggin
Put the STD and PMI interrupt mask bits into r14. This benefits
IRQ disabling (enabling to a lesser extent), and soft mask check
in the interrupt entry handler.
---
 arch/powerpc/include/asm/exception-64s.h |  6 +-
 arch/powerpc/include/asm/hw_irq.h| 98 
 arch/powerpc/include/asm/irqflags.h  |  9 +--
 arch/powerpc/include/asm/kvm_ppc.h   |  2 +-
 arch/powerpc/include/asm/paca.h  | 18 +-
 arch/powerpc/kernel/asm-offsets.c|  7 ++-
 arch/powerpc/kernel/entry_64.S   | 19 +++
 arch/powerpc/kernel/idle_book3s.S|  3 +
 arch/powerpc/kernel/irq.c| 12 ++--
 arch/powerpc/kernel/optprobes_head.S |  3 +-
 arch/powerpc/kernel/process.c|  2 +-
 arch/powerpc/kernel/setup_64.c   | 11 ++--
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |  3 +-
 arch/powerpc/xmon/xmon.c |  5 +-
 14 files changed, 95 insertions(+), 103 deletions(-)

diff --git a/arch/powerpc/include/asm/exception-64s.h 
b/arch/powerpc/include/asm/exception-64s.h
index dadaa7471755..5602454ae56f 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -459,9 +459,8 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
mflrr9; /* Get LR, later save to stack  */ \
ld  r2,PACATOC(r13);/* get kernel TOC into r2   */ \
std r9,_LINK(r1);  \
-   lbz r10,PACAIRQSOFTMASK(r13);  \
mfspr   r11,SPRN_XER;   /* save XER in stackframe   */ \
-   std r10,SOFTE(r1); \
+   std r14,SOFTE(r1);  /* full r14 not just softe XXX  */ \
std r11,_XER(r1);  \
li  r9,(n)+1;  \
std r9,_TRAP(r1);   /* set trap number  */ \
@@ -526,8 +525,7 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
 #define SOFTEN_VALUE_0xf00 PACA_IRQ_PMI
 
 #define __SOFTEN_TEST(h, vec, bitmask) \
-   lbz r10,PACAIRQSOFTMASK(r13);   \
-   andi.   r10,r10,bitmask;\
+   andi.   r10,r14,bitmask;\
li  r10,SOFTEN_VALUE_##vec; \
bne masked_##h##interrupt
 
diff --git a/arch/powerpc/include/asm/hw_irq.h 
b/arch/powerpc/include/asm/hw_irq.h
index eea02cbf5699..9ba445de989d 100644
--- a/arch/powerpc/include/asm/hw_irq.h
+++ b/arch/powerpc/include/asm/hw_irq.h
@@ -12,8 +12,9 @@
 #include 
 #include 
 
-#ifdef CONFIG_PPC64
+#ifndef __ASSEMBLY__
 
+#ifdef CONFIG_PPC64
 /*
  * PACA flags in paca->irq_happened.
  *
@@ -30,21 +31,16 @@
 #define PACA_IRQ_PMI   0x40
 
 /*
- * flags for paca->irq_soft_mask
+ * 64s uses r14 rather than paca for irq_soft_mask
  */
-#define IRQ_SOFT_MASK_NONE 0x00
-#define IRQ_SOFT_MASK_STD  0x01 /* local_irq_disable() interrupts */
 #ifdef CONFIG_PPC_BOOK3S
-#define IRQ_SOFT_MASK_PMU  0x02
-#define IRQ_SOFT_MASK_ALL  0x03
-#else
-#define IRQ_SOFT_MASK_ALL  0x01
-#endif
+#define IRQ_SOFT_MASK_STD  (0x01 << R14_BIT_IRQ_SOFT_MASK_SHIFT)
+#define IRQ_SOFT_MASK_PMU  (0x02 << R14_BIT_IRQ_SOFT_MASK_SHIFT)
+#define IRQ_SOFT_MASK_ALL  (0x03 << R14_BIT_IRQ_SOFT_MASK_SHIFT)
+#endif /* CONFIG_PPC_BOOK3S */
 
 #endif /* CONFIG_PPC64 */
 
-#ifndef __ASSEMBLY__
-
 extern void replay_system_reset(void);
 extern void __replay_interrupt(unsigned int vector);
 
@@ -56,24 +52,16 @@ extern void unknown_exception(struct pt_regs *regs);
 #ifdef CONFIG_PPC64
 #include 
 
-static inline notrace unsigned long irq_soft_mask_return(void)
+/*
+ * __irq_soft_mask_set/clear do not have memory clobbers so they
+ * should not be used by themselves to disable/enable irqs.
+ */
+static inline notrace void __irq_soft_mask_set(unsigned long disable_mask)
 {
-   unsigned long flags;
-
-   asm volatile(
-   "lbz %0,%1(13)"
-   : "=r" (flags)
-   : "i" (offsetof(struct paca_struct, irq_soft_mask)));
-
-   return flags;
+   r14_set_bits(disable_mask);
 }
 
-/*
- * The "memory" clobber acts as both a compiler barrier
- * for the critical section and as a clobber because
- * we changed paca->irq_soft_mask
- */
-static inline notrace void irq_soft_mask_set(unsigned long mask)
+static inline notrace void __irq_soft_mask_insert(unsigned long new_mask)
 {
 #ifdef CONFIG_PPC_IRQ_SOFT_MASK_DEBUG
/*
@@ -90,49 +78,37 @@ static inline notrace void irq_soft_mask_set(unsigned long 
mask)
 * unmasks to be replayed, among other things. For now, take
 * the simple approach.
 */
-   WARN_ON(mask && !(mask & IRQ_SOFT_MASK_STD));
+   WARN_ON(new_mask && !(new_mask & 

[RFC PATCH 5/8] powerpc/64s: put work_pending bit into r14

2017-12-20 Thread Nicholas Piggin
Similarly, may not be worth an r14 bit, but...
---
 arch/powerpc/include/asm/paca.h |  4 ++--
 arch/powerpc/kernel/time.c  | 15 +++
 arch/powerpc/xmon/xmon.c|  1 -
 3 files changed, 5 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 408fa079e00d..cd3637f4ee4e 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -36,7 +36,8 @@
 register struct paca_struct *local_paca asm("r13");
 #ifdef CONFIG_PPC_BOOK3S
 
-#define R14_BIT_IO_SYNC0x0001
+#define R14_BIT_IO_SYNC0x0001
+#define R14_BIT_IRQ_WORK_PENDING   0x0002 /* IRQ_WORK interrupt while 
soft-disable */
 
 /*
  * The top 32-bits of r14 is used as the per-cpu offset, shifted by PAGE_SHIFT.
@@ -212,7 +213,6 @@ struct paca_struct {
u16 trap_save;  /* Used when bad stack is encountered */
u8 irq_soft_mask;   /* mask for irq soft masking */
u8 irq_happened;/* irq happened while soft-disabled */
-   u8 irq_work_pending;/* IRQ_WORK interrupt while 
soft-disable */
u8 nap_state_lost;  /* NV GPR values lost in power7_idle */
u64 sprg_vdso;  /* Saved user-visible sprg */
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 8d32ce95ec88..fac30152723f 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -488,26 +488,17 @@ EXPORT_SYMBOL(profile_pc);
 #ifdef CONFIG_PPC64
 static inline unsigned long test_irq_work_pending(void)
 {
-   unsigned long x;
-
-   asm volatile("lbz %0,%1(13)"
-   : "=r" (x)
-   : "i" (offsetof(struct paca_struct, irq_work_pending)));
-   return x;
+   return local_r14 & R14_BIT_IRQ_WORK_PENDING;
 }
 
 static inline void set_irq_work_pending_flag(void)
 {
-   asm volatile("stb %0,%1(13)" : :
-   "r" (1),
-   "i" (offsetof(struct paca_struct, irq_work_pending)));
+   r14_set_bits(R14_BIT_IRQ_WORK_PENDING);
 }
 
 static inline void clear_irq_work_pending(void)
 {
-   asm volatile("stb %0,%1(13)" : :
-   "r" (0),
-   "i" (offsetof(struct paca_struct, irq_work_pending)));
+   r14_clear_bits(R14_BIT_IRQ_WORK_PENDING);
 }
 
 #else /* 32-bit */
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 40f0d02ae92d..7d2bb26ff333 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2393,7 +2393,6 @@ static void dump_one_paca(int cpu)
DUMP(p, trap_save, "x");
DUMP(p, irq_soft_mask, "x");
DUMP(p, irq_happened, "x");
-   DUMP(p, irq_work_pending, "x");
DUMP(p, nap_state_lost, "x");
DUMP(p, sprg_vdso, "llx");
 
-- 
2.15.0



[RFC PATCH 4/8] powerpc/64s: put io_sync bit into r14

2017-12-20 Thread Nicholas Piggin
This simplifies spin unlock code and mmio primitives. This
may not be the best use of an r14 bit, but it was a simple
first proof of concept after the per-cpu data_offset, and
so it can stay until we get low on bits.
---
 arch/powerpc/include/asm/io.h   | 11 --
 arch/powerpc/include/asm/paca.h | 44 -
 arch/powerpc/include/asm/spinlock.h | 21 ++
 arch/powerpc/xmon/xmon.c|  1 -
 4 files changed, 59 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
index 422f99cf9924..c817f3a83fcc 100644
--- a/arch/powerpc/include/asm/io.h
+++ b/arch/powerpc/include/asm/io.h
@@ -104,8 +104,8 @@ extern bool isa_io_special;
  *
  */
 
-#ifdef CONFIG_PPC64
-#define IO_SET_SYNC_FLAG() do { local_paca->io_sync = 1; } while(0)
+#if defined(CONFIG_PPC64) && defined(CONFIG_SMP)
+#define IO_SET_SYNC_FLAG() do { r14_set_bits(R14_BIT_IO_SYNC); } while(0)
 #else
 #define IO_SET_SYNC_FLAG()
 #endif
@@ -673,11 +673,8 @@ static inline void name at 
\
  */
 static inline void mmiowb(void)
 {
-   unsigned long tmp;
-
-   __asm__ __volatile__("sync; li %0,0; stb %0,%1(13)"
-   : "=" (tmp) : "i" (offsetof(struct paca_struct, io_sync))
-   : "memory");
+   __asm__ __volatile__("sync" : : : "memory");
+   r14_clear_bits(R14_BIT_IO_SYNC);
 }
 #endif /* !CONFIG_PPC32 */
 
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 4dd4ac69e84f..408fa079e00d 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -35,12 +35,55 @@
 
 register struct paca_struct *local_paca asm("r13");
 #ifdef CONFIG_PPC_BOOK3S
+
+#define R14_BIT_IO_SYNC0x0001
+
 /*
  * The top 32-bits of r14 is used as the per-cpu offset, shifted by PAGE_SHIFT.
  * The per-cpu could be moved completely to vmalloc space if we had large
  * vmalloc page mapping? (no, must access it in real mode).
  */
 register u64 local_r14 asm("r14");
+
+/*
+ * r14 should not be modified by C code, because we can not guarantee it
+ * will be done with non-atomic (vs interrupts) read-modify-write sequences.
+ * All updates must be of the form `op r14,r14,xxx` or similar (i.e., atomic
+ * updates).
+ *
+ * Make asm statements have r14 for input and output so that the compiler
+ * does not re-order it with respect to other r14 manipulations.
+ */
+static inline void r14_set_bits(unsigned long mask)
+{
+   if (__builtin_constant_p(mask))
+   asm volatile("ori   %0,%0,%2\n"
+   : "=r" (local_r14)
+   : "0" (local_r14), "i" (mask));
+   else
+   asm volatile("or%0,%0,%2\n"
+   : "=r" (local_r14)
+   : "0" (local_r14), "r" (mask));
+}
+
+static inline void r14_flip_bits(unsigned long mask)
+{
+   if (__builtin_constant_p(mask))
+   asm volatile("xori  %0,%0,%2\n"
+   : "=r" (local_r14)
+   : "0" (local_r14), "i" (mask));
+   else
+   asm volatile("xor   %0,%0,%2\n"
+   : "=r" (local_r14)
+   : "0" (local_r14), "r" (mask));
+}
+
+static inline void r14_clear_bits(unsigned long mask)
+{
+   asm volatile("andc  %0,%0,%2\n"
+   : "=r" (local_r14)
+   : "0" (local_r14), "r" (mask));
+}
 #endif
 
 #if defined(CONFIG_DEBUG_PREEMPT) && defined(CONFIG_SMP)
@@ -169,7 +212,6 @@ struct paca_struct {
u16 trap_save;  /* Used when bad stack is encountered */
u8 irq_soft_mask;   /* mask for irq soft masking */
u8 irq_happened;/* irq happened while soft-disabled */
-   u8 io_sync; /* writel() needs spin_unlock sync */
u8 irq_work_pending;/* IRQ_WORK interrupt while 
soft-disable */
u8 nap_state_lost;  /* NV GPR values lost in power7_idle */
u64 sprg_vdso;  /* Saved user-visible sprg */
diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index b9ebc3085fb7..182bb9304c79 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -40,16 +40,9 @@
 #endif
 
 #if defined(CONFIG_PPC64) && defined(CONFIG_SMP)
-#define CLEAR_IO_SYNC  (get_paca()->io_sync = 0)
-#define SYNC_IOdo {
\
-   if (unlikely(get_paca()->io_sync)) {\
-   mb();   \
-   get_paca()->io_sync = 0;\
-   }   \
-   } while (0)
+#define CLEAR_IO_SYNC  do { 

[RFC PATCH 3/8] powerpc/64s: put the per-cpu data_offset in r14

2017-12-20 Thread Nicholas Piggin
Shifted left by 16 bits, so the low 16 bits of r14 remain available.
This allows per-cpu pointers to be dereferenced with a single extra
shift whereas previously it was a load and add.
---
 arch/powerpc/include/asm/paca.h   |  5 +
 arch/powerpc/include/asm/percpu.h |  2 +-
 arch/powerpc/kernel/entry_64.S|  5 -
 arch/powerpc/kernel/head_64.S |  5 +
 arch/powerpc/kernel/setup_64.c| 11 +--
 5 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index cd6a9a010895..4dd4ac69e84f 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -35,6 +35,11 @@
 
 register struct paca_struct *local_paca asm("r13");
 #ifdef CONFIG_PPC_BOOK3S
+/*
+ * The top 32-bits of r14 is used as the per-cpu offset, shifted by PAGE_SHIFT.
+ * The per-cpu could be moved completely to vmalloc space if we had large
+ * vmalloc page mapping? (no, must access it in real mode).
+ */
 register u64 local_r14 asm("r14");
 #endif
 
diff --git a/arch/powerpc/include/asm/percpu.h 
b/arch/powerpc/include/asm/percpu.h
index dce863a7635c..1e0d79d30eac 100644
--- a/arch/powerpc/include/asm/percpu.h
+++ b/arch/powerpc/include/asm/percpu.h
@@ -12,7 +12,7 @@
 
 #include 
 
-#define __my_cpu_offset local_paca->data_offset
+#define __my_cpu_offset (local_r14 >> 16)
 
 #endif /* CONFIG_SMP */
 #endif /* __powerpc64__ */
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 592e4b36065f..6b0e3ac311e8 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -262,11 +262,6 @@ system_call_exit:
 BEGIN_FTR_SECTION
stdcx.  r0,0,r1 /* to clear the reservation */
 END_FTR_SECTION_IFCLR(CPU_FTR_STCX_CHECKS_ADDRESS)
-   LOAD_REG_IMMEDIATE(r10, 0xdeadbeefULL << 32)
-   mfspr   r11,SPRN_PIR
-   or  r10,r10,r11
-   tdner10,r14
-
andi.   r6,r8,MSR_PR
ld  r4,_LINK(r1)
 
diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
index 5a9ec06eab14..cdb710f43681 100644
--- a/arch/powerpc/kernel/head_64.S
+++ b/arch/powerpc/kernel/head_64.S
@@ -413,10 +413,7 @@ generic_secondary_common_init:
b   kexec_wait  /* next kernel might do better   */
 
 2: SET_PACA(r13)
-   LOAD_REG_IMMEDIATE(r14, 0xdeadbeef << 32)
-   mfspr   r3,SPRN_PIR
-   or  r14,r14,r3
-   std r14,PACA_R14(r13)
+   ld  r14,PACA_R14(r13)
 
 #ifdef CONFIG_PPC_BOOK3E
addir12,r13,PACA_EXTLB  /* and TLB exc frame in another  */
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 9a4c5bf35d92..f4a96ebb523a 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -192,8 +192,8 @@ static void __init fixup_boot_paca(void)
get_paca()->data_offset = 0;
/* Mark interrupts disabled in PACA */
irq_soft_mask_set(IRQ_SOFT_MASK_STD);
-   /* Set r14 and paca_r14 to debug value */
-   get_paca()->r14 = (0xdeadbeefULL << 32) | mfspr(SPRN_PIR);
+   /* Set r14 and paca_r14 to zero */
+   get_paca()->r14 = 0;
local_r14 = get_paca()->r14;
 }
 
@@ -761,7 +761,14 @@ void __init setup_per_cpu_areas(void)
for_each_possible_cpu(cpu) {
 __per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
paca[cpu].data_offset = __per_cpu_offset[cpu];
+
+   BUG_ON(paca[cpu].data_offset & (PAGE_SIZE-1));
+   BUG_ON(paca[cpu].data_offset >= (1UL << (64 - 16)));
+
+   /* The top 48 bits are used for per-cpu data */
+   paca[cpu].r14 |= paca[cpu].data_offset << 16;
}
+   local_r14 = paca[smp_processor_id()].r14;
 }
 #endif
 
-- 
2.15.0



[RFC PATCH 2/8] powerpc/64s: poison r14 register while in kernel

2017-12-20 Thread Nicholas Piggin
Poison r14 register with the PIR SPR, an a magic number.
This means it must be treated like r13, saving and restoring the
register on kernel entry/exit, but not restoring it when returning
back to kernel

However r14 will not be a constant like r13, but may be modified by
the kernel, which means it must not be loaded on exception entry if
the exception is coming from the kernel.

This requires loading SRR1 earlier, before the exception mask/kvm
test. That's okay because SRR1 almost always gets loaded anyway.
---
 arch/powerpc/include/asm/exception-64s.h | 121 ---
 arch/powerpc/include/asm/paca.h  |   5 +-
 arch/powerpc/kernel/asm-offsets.c|   3 +-
 arch/powerpc/kernel/entry_64.S   |  23 ++
 arch/powerpc/kernel/exceptions-64s.S |  45 ++--
 arch/powerpc/kernel/head_64.S|   5 ++
 arch/powerpc/kernel/idle_book3s.S|   4 +
 arch/powerpc/kernel/paca.c   |   1 -
 arch/powerpc/kernel/setup_64.c   |   3 +
 arch/powerpc/lib/sstep.c |   1 +
 10 files changed, 128 insertions(+), 83 deletions(-)

diff --git a/arch/powerpc/include/asm/exception-64s.h 
b/arch/powerpc/include/asm/exception-64s.h
index 54afd1f140a4..dadaa7471755 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -42,16 +42,18 @@
 #define EX_R11 16
 #define EX_R12 24
 #define EX_R13 32
-#define EX_DAR 40
-#define EX_DSISR   48
-#define EX_CCR 52
-#define EX_CFAR56
-#define EX_PPR 64
+#define EX_R14 40
+#define EX_DAR 48
+#define EX_DSISR   56
+#define EX_CCR 60
+#define EX_CFAR64
+#define EX_PPR 72
+
 #if defined(CONFIG_RELOCATABLE)
-#define EX_CTR 72
-#define EX_SIZE10  /* size in u64 units */
+#define EX_CTR 80
+#define EX_SIZE11  /* size in u64 units */
 #else
-#define EX_SIZE9   /* size in u64 units */
+#define EX_SIZE10  /* size in u64 units */
 #endif
 
 /*
@@ -77,9 +79,8 @@
 #ifdef CONFIG_RELOCATABLE
 #define __EXCEPTION_RELON_PROLOG_PSERIES_1(label, h)   \
mfspr   r11,SPRN_##h##SRR0; /* save SRR0 */ \
-   LOAD_HANDLER(r12,label);\
-   mtctr   r12;\
-   mfspr   r12,SPRN_##h##SRR1; /* and SRR1 */  \
+   LOAD_HANDLER(r10,label);\
+   mtctr   r10;\
li  r10,MSR_RI; \
mtmsrd  r10,1;  /* Set RI (EE=0) */ \
bctr;
@@ -87,7 +88,6 @@
 /* If not relocatable, we can jump directly -- and save messing with LR */
 #define __EXCEPTION_RELON_PROLOG_PSERIES_1(label, h)   \
mfspr   r11,SPRN_##h##SRR0; /* save SRR0 */ \
-   mfspr   r12,SPRN_##h##SRR1; /* and SRR1 */  \
li  r10,MSR_RI; \
mtmsrd  r10,1;  /* Set RI (EE=0) */ \
b   label;
@@ -102,7 +102,7 @@
  */
 #define EXCEPTION_RELON_PROLOG_PSERIES(area, label, h, extra, vec) \
EXCEPTION_PROLOG_0(area);   \
-   EXCEPTION_PROLOG_1(area, extra, vec);   \
+   EXCEPTION_PROLOG_1(area, h, extra, vec);
\
EXCEPTION_RELON_PROLOG_PSERIES_1(label, h)
 
 /*
@@ -198,17 +198,21 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
std r10,area+EX_R10(r13);   /* save r10 - r12 */\
OPT_GET_SPR(r10, SPRN_CFAR, CPU_FTR_CFAR)
 
-#define __EXCEPTION_PROLOG_1_PRE(area) \
+#define __EXCEPTION_PROLOG_1(area, h)  \
+   std r11,area+EX_R11(r13);   \
+   std r12,area+EX_R12(r13);   \
+   mfspr   r12,SPRN_##h##SRR1; /* and SRR1 */  \
+   GET_SCRATCH0(r11);  \
OPT_SAVE_REG_TO_PACA(area+EX_PPR, r9, CPU_FTR_HAS_PPR); \
OPT_SAVE_REG_TO_PACA(area+EX_CFAR, r10, CPU_FTR_CFAR);  \
SAVE_CTR(r10, area);\
-   mfcrr9;
-
-#define __EXCEPTION_PROLOG_1_POST(area)
\
-   std r11,area+EX_R11(r13);   \
-   std r12,area+EX_R12(r13);   \
-   GET_SCRATCH0(r10);  \
-   std r10,area+EX_R13(r13)
+   mfcrr9; 

[RFC PATCH 1/8] powerpc/64s: stop using r14 register

2017-12-20 Thread Nicholas Piggin
---
 arch/powerpc/Makefile  |   1 +
 arch/powerpc/crypto/md5-asm.S  |  40 +++
 arch/powerpc/crypto/sha1-powerpc-asm.S |  10 +-
 arch/powerpc/include/asm/kvm_book3s_asm.h  |   2 +-
 arch/powerpc/include/asm/ppc_asm.h |  21 +++-
 arch/powerpc/kernel/asm-offsets.c  |   4 +-
 arch/powerpc/kernel/entry_32.S |   4 +-
 arch/powerpc/kernel/entry_64.S |  45 
 arch/powerpc/kernel/exceptions-64s.S   |   3 +-
 arch/powerpc/kernel/head_64.S  |   8 +-
 arch/powerpc/kernel/idle_book3s.S  |  79 +++--
 arch/powerpc/kernel/kgdb.c |   8 +-
 arch/powerpc/kernel/process.c  |   4 +-
 arch/powerpc/kernel/tm.S   |  40 ---
 arch/powerpc/kernel/trace/ftrace_64_mprofile.S |  10 +-
 arch/powerpc/kvm/book3s_hv_interrupts.S|   5 +-
 arch/powerpc/kvm/book3s_interrupts.S   |  93 +++
 arch/powerpc/kvm/book3s_pr.c   |   6 +
 arch/powerpc/lib/checksum_64.S |  66 +--
 arch/powerpc/lib/copypage_power7.S |  32 +++---
 arch/powerpc/lib/copyuser_power7.S | 152 -
 arch/powerpc/lib/crtsavres.S   |   3 +
 arch/powerpc/lib/memcpy_power7.S   |  80 ++---
 23 files changed, 396 insertions(+), 320 deletions(-)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 1381693a4a51..8dd38facc5f2 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -140,6 +140,7 @@ AFLAGS-$(CONFIG_PPC64)  += $(call cc-option,-mabi=elfv1)
 endif
 CFLAGS-$(CONFIG_PPC64) += $(call cc-option,-mcmodel=medium,$(call 
cc-option,-mminimal-toc))
 CFLAGS-$(CONFIG_PPC64) += $(call cc-option,-mno-pointers-to-nested-functions)
+CFLAGS-$(CONFIG_PPC64) += -ffixed-r13 -ffixed-r14
 CFLAGS-$(CONFIG_PPC32) := -ffixed-r2 $(MULTIPLEWORD)
 
 ifeq ($(CONFIG_PPC_BOOK3S_64),y)
diff --git a/arch/powerpc/crypto/md5-asm.S b/arch/powerpc/crypto/md5-asm.S
index 10cdf5bceebb..99e41af88e19 100644
--- a/arch/powerpc/crypto/md5-asm.S
+++ b/arch/powerpc/crypto/md5-asm.S
@@ -25,31 +25,31 @@
 #define rW02   r10
 #define rW03   r11
 #define rW04   r12
-#define rW05   r14
-#define rW06   r15
-#define rW07   r16
-#define rW08   r17
-#define rW09   r18
-#define rW10   r19
-#define rW11   r20
-#define rW12   r21
-#define rW13   r22
-#define rW14   r23
-#define rW15   r24
-
-#define rT0r25
-#define rT1r26
+#define rW05   r15
+#define rW06   r16
+#define rW07   r17
+#define rW08   r18
+#define rW09   r19
+#define rW10   r20
+#define rW11   r21
+#define rW12   r22
+#define rW13   r23
+#define rW14   r24
+#define rW15   r25
+
+#define rT0r26
+#define rT1r27
 
 #define INITIALIZE \
PPC_STLU r1,-INT_FRAME_SIZE(r1); \
-   SAVE_8GPRS(14, r1); /* push registers onto stack*/ \
-   SAVE_4GPRS(22, r1);\
-   SAVE_GPR(26, r1)
+   SAVE_8GPRS(15, r1); /* push registers onto stack*/ \
+   SAVE_4GPRS(23, r1);\
+   SAVE_GPR(27, r1)
 
 #define FINALIZE \
-   REST_8GPRS(14, r1); /* pop registers from stack */ \
-   REST_4GPRS(22, r1);\
-   REST_GPR(26, r1);  \
+   REST_8GPRS(15, r1); /* pop registers from stack */ \
+   REST_4GPRS(23, r1);\
+   REST_GPR(27, r1);  \
addir1,r1,INT_FRAME_SIZE;
 
 #ifdef __BIG_ENDIAN__
diff --git a/arch/powerpc/crypto/sha1-powerpc-asm.S 
b/arch/powerpc/crypto/sha1-powerpc-asm.S
index c8951ce0dcc4..6c38de214c11 100644
--- a/arch/powerpc/crypto/sha1-powerpc-asm.S
+++ b/arch/powerpc/crypto/sha1-powerpc-asm.S
@@ -42,10 +42,10 @@
or  r6,r6,r0;   \
add r0,RE(t),r15;   \
add RT(t),RT(t),r6; \
-   add r14,r0,W(t);\
+   add r6,r0,W(t); \
LWZ(W((t)+4),((t)+4)*4,r4); \
rotlwi  RB(t),RB(t),30; \
-   add RT(t),RT(t),r14
+   add RT(t),RT(t),r6
 
 #define STEPD0_UPDATE(t)   \
and r6,RB(t),RC(t); \
@@ -124,8 +124,7 @@
 
 _GLOBAL(powerpc_sha_transform)
PPC_STLU r1,-INT_FRAME_SIZE(r1)
-   SAVE_8GPRS(14, r1)
-   SAVE_10GPRS(22, r1)
+   SAVE_NVGPRS(r1)
 
/* Load up A - E */
lwz RA(0),0(r3) /* A */
@@ -183,7 +182,6 @@ _GLOBAL(powerpc_sha_transform)
stw RD(0),12(r3)
stw RE(0),16(r3)
 
-   REST_8GPRS(14, r1)
-   REST_10GPRS(22, r1)
+   REST_NVGPRS(r1)
addir1,r1,INT_FRAME_SIZE
blr
diff 

[RFC PATCH 0/8] use r14 for a per-cpu kernel register

2017-12-20 Thread Nicholas Piggin
This makes r14 a fixed register and used to store per-cpu stuff in
the kernel, including read-write fields that are retained over
interrupts. It ends up being most useful for speeding up per-cpu
pointer dereferencing and soft-irq masking and testing. But it can
also reduce the number of loads and stores in the interrupt entry
paths by moving several bits of interest into r14 (another bit I'm
looking at is adding a bit for HSTATE_IN_GUEST to speed up kvmtest).

The series goes on top of Maddy's softi-irq patches, it works on 64s,
but KVM and 64e are probably broken at the moment. So it's not
intended to merge yet, but if people like the result then maybe the
first patch can be merged to stop using r14 in preparation.

Nicholas Piggin (8):
  powerpc/64s: stop using r14 register
  powerpc/64s: poison r14 register while in kernel
  powerpc/64s: put the per-cpu data_offset in r14
  powerpc/64s: put io_sync bit into r14
  powerpc/64s: put work_pending bit into r14
  powerpc/64s: put irq_soft_mask bits into r14
  powerpc/64s: put irq_soft_mask and irq_happened bits into r14
  powerpc/64s: inline local_irq_enable/restore

 arch/powerpc/Makefile  |   1 +
 arch/powerpc/crypto/md5-asm.S  |  40 +++
 arch/powerpc/crypto/sha1-powerpc-asm.S |  10 +-
 arch/powerpc/include/asm/exception-64s.h   | 127 +++--
 arch/powerpc/include/asm/hw_irq.h  | 130 ++---
 arch/powerpc/include/asm/io.h  |  11 +-
 arch/powerpc/include/asm/irqflags.h|  24 ++--
 arch/powerpc/include/asm/kvm_book3s_asm.h  |   2 +-
 arch/powerpc/include/asm/kvm_ppc.h |   6 +-
 arch/powerpc/include/asm/paca.h|  73 +++-
 arch/powerpc/include/asm/percpu.h  |   2 +-
 arch/powerpc/include/asm/ppc_asm.h |  21 +++-
 arch/powerpc/include/asm/spinlock.h|  21 ++--
 arch/powerpc/kernel/asm-offsets.c  |  23 +++-
 arch/powerpc/kernel/entry_32.S |   4 +-
 arch/powerpc/kernel/entry_64.S |  95 ++--
 arch/powerpc/kernel/exceptions-64s.S   |  52 -
 arch/powerpc/kernel/head_64.S  |  25 ++--
 arch/powerpc/kernel/idle_book3s.S  |  86 +++---
 arch/powerpc/kernel/irq.c  |  92 ++-
 arch/powerpc/kernel/kgdb.c |   8 +-
 arch/powerpc/kernel/optprobes_head.S   |   3 +-
 arch/powerpc/kernel/paca.c |   1 -
 arch/powerpc/kernel/process.c  |   6 +-
 arch/powerpc/kernel/setup_64.c |  19 +++-
 arch/powerpc/kernel/time.c |  15 +--
 arch/powerpc/kernel/tm.S   |  40 ---
 arch/powerpc/kernel/trace/ftrace_64_mprofile.S |  10 +-
 arch/powerpc/kvm/book3s_hv.c   |   6 +-
 arch/powerpc/kvm/book3s_hv_interrupts.S|   5 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S|   3 +-
 arch/powerpc/kvm/book3s_interrupts.S   |  93 +++
 arch/powerpc/kvm/book3s_pr.c   |   6 +
 arch/powerpc/lib/checksum_64.S |  66 +--
 arch/powerpc/lib/copypage_power7.S |  32 +++---
 arch/powerpc/lib/copyuser_power7.S | 152 -
 arch/powerpc/lib/crtsavres.S   |   3 +
 arch/powerpc/lib/memcpy_power7.S   |  80 ++---
 arch/powerpc/lib/sstep.c   |   1 +
 arch/powerpc/xmon/xmon.c   |   7 +-
 40 files changed, 764 insertions(+), 637 deletions(-)

-- 
2.15.0



Re: [PATCH V4] cxl: Add support for ASB_Notify on POWER9

2017-12-20 Thread Vaibhav Jain
christophe lombard  writes:

> Le 20/12/2017 à 09:46, Vaibhav Jain a écrit :
>>> In fact, it does not matter. I don't know what the userspace could do
>>> with this value.
>> Without libcxl knowing the tidr value, it cannot enforce the condition
>> that only threads that have called attach can issue 'wait' on the right
>> context.
>> 
>> Also AFU can selectively ask PSL to issue asb_notify to a specific
>> thread via the PSL interface. Without userspace knowing the tidr value
>> it might not be easy for it to give this value to AFU through a Problem
>> State Area register.
>> 
>
> Don't forget that The ASB_Notify will use LPID:PID:TID tuple found
> in the Process Element Entry.
> The AFU may optionally provide a TID on AxH_CEA[40:55] (AxH_CEA[39]
> must be set to indicate an AFU provided TID)
> If AxH_CEA[39] == 1’b0 then Process Element information
> (LPID:PID:TID) is used to generate the PCIe address.
> If AxH_CEA[39] == 1’b1then the LPID:PID are taken from the PEE
> while the TID is taken from AxH_-CEA[40:55]

Agree and that was the point I was trying to make when I said that AFU
can selectivly issue asb_notify to a thread. Without userspace threads
knowing their tidr libcxl would let any thread issue the 'wait'
instruction that may cause unpredictable results.

-- 
Vaibhav Jain 
Linux Technology Center, IBM India Pvt. Ltd.



Re: [PATCH V4] cxl: Add support for ASB_Notify on POWER9

2017-12-20 Thread christophe lombard

Le 20/12/2017 à 09:46, Vaibhav Jain a écrit :

Hi Chritophe,

christophe lombard  writes:


Le 20/12/2017 à 07:31, Vaibhav Jain a écrit :

EINVAL might be a better return value instead of ENODEV in this case.


This return code has been already discussed (with mpe) on the first
version of the patch. "Either ENODEV or ENXIO would be best that
can be distinguished and interpreted correctly by userspace"

Agreed. Please ignore the review comment.


+   /* Assign a unique TIDR (thread id) for the current thread */
+   if (work.flags & CXL_START_WORK_TID) {
+   rc = cxl_context_thread_tidr(ctx);
+   if (rc)
+   goto out;

May need to copy the cxl_ioctl_start_work struct back to userspace with
the value of tidr allocated.



In fact, it does not matter. I don't know what the userspace could do
with this value.

Without libcxl knowing the tidr value, it cannot enforce the condition
that only threads that have called attach can issue 'wait' on the right
context.

Also AFU can selectively ask PSL to issue asb_notify to a specific
thread via the PSL interface. Without userspace knowing the tidr value
it might not be easy for it to give this value to AFU through a Problem
State Area register.



Don't forget that The ASB_Notify will use LPID:PID:TID tuple found
in the Process Element Entry.
The AFU may optionally provide a TID on AxH_CEA[40:55] (AxH_CEA[39]
must be set to indicate an AFU provided TID)
If AxH_CEA[39] == 1’b0 then Process Element information
(LPID:PID:TID) is used to generate the PCIe address.
If AxH_CEA[39] == 1’b1then the LPID:PID are taken from the PEE
while the TID is taken from AxH_-CEA[40:55]



+   }
+
trace_cxl_attach(ctx, work.work_element_descriptor,
work.num_interrupts, amr);

should update the tracing here to also report the tidr



yep. I will provide a new patch to include this update.

Thanks





WARNING: CPU: 0 PID: 2777 at arch/powerpc/mm/hugetlbpage.c:354 h,ugetlb_free_pgd_range+0xc8/0x1e4

2017-12-20 Thread Christophe LEROY
Trying to malloc() with libhugetlbfs, it runs indefinitly doing page 
faults in do_page_fault()/hugetlb_fault().

When interrupting the blocked app with CTRL+C, I get the following WARNING:

Any idea of what can be wrong ? I'm on a 8xx with 512k huge pages.

[162980.035629] WARNING: CPU: 0 PID: 2777 at 
arch/powerpc/mm/hugetlbpage.c:354 h

ugetlb_free_pgd_range+0xc8/0x1e4
[162980.035699] CPU: 0 PID: 2777 Comm: malloc Tainted: G W   4.14.6-s
3k-dev-ga8e8e8b176-svn9134 #85
[162980.035744] task: c67e2c00 task.stack: c668e000
[162980.035783] NIP:  c000fe18 LR: c00e1eec CTR: c00f90c0
[162980.035830] REGS: c668fc20 TRAP: 0700   Tainted: G W    (4.14.6-s
3k-dev-ga8e8e8b176-svn9134)
[162980.035854] MSR:  00029032   CR: 24044224 XER: 2000
[162980.036003]
[162980.036003] GPR00: c00e1eec c668fcd0 c67e2c00 0010 c6869410 
1008 000

0 77fb4000
[162980.036003] GPR08: 0001 0683c001  ff80 44028228 
10018a34 000

04008 418004fc
[162980.036003] GPR16: c668e000 00040100 c668e000 c06c c668fe78 
c668e000 c68

35ba0 c668fd48
[162980.036003] GPR24:  73ff 7400 0001 77fb4000 
100f 101

0 1010
[162980.036743] NIP [c000fe18] hugetlb_free_pgd_range+0xc8/0x1e4
[162980.036839] LR [c00e1eec] free_pgtables+0x12c/0x150
[162980.036861] Call Trace:
[162980.036939] [c668fcd0] [c00f0774] unlink_anon_vmas+0x1c4/0x214 
(unreliable)

[162980.037040] [c668fd10] [c00e1eec] free_pgtables+0x12c/0x150
[162980.037118] [c668fd40] [c00eabac] exit_mmap+0xe8/0x1b4
[162980.037210] [c668fda0] [c0019710] mmput.part.9+0x20/0xd8
[162980.037301] [c668fdb0] [c001ecb0] do_exit+0x1f0/0x93c
[162980.037386] [c668fe00] [c001f478] do_group_exit+0x40/0xcc
[162980.037479] [c668fe10] [c002a76c] get_signal+0x47c/0x614
[162980.037570] [c668fe70] [c0007840] do_signal+0x54/0x244
[162980.037654] [c668ff30] [c0007ae8] do_notify_resume+0x34/0x88
[162980.037744] [c668ff40] [c000dae8] do_user_signal+0x74/0xc4
[162980.037781] Instruction dump:
[162980.037821] 7fdff378 8137 54a3463a 80890020 7d24182e 7c841a14 
712a0004 4

082ff94
[162980.038014] 2f89 419e0010 712a0ff0 408200e0 <0fe0> 54a9000a 
7f984840

 419d0094
[162980.038216] ---[ end trace c0ceeca8e7a5800a ]---
[162980.038754] BUG: non-zero nr_ptes on freeing mm: 1
[162985.363322] BUG: non-zero nr_ptes on freeing mm: -1

Christophe


Re: [PATCH 09/13] ocxl: Add trace points

2017-12-20 Thread Frederic Barrat



Le 18/12/2017 à 17:48, Philippe Ombredanne a écrit :

--- /dev/null
+++ b/drivers/misc/ocxl/trace.h
@@ -0,0 +1,189 @@
+/*
+ * Copyright 2017 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */

Would you mind using the new SPDX tags documented in Thomas patch set
[1] rather than this legalese?


ok, it will be in the next revision. Thanks!

  Fred



Re: [PATCH v4 00/11] ASoC: fsl_ssi: Clean up - coding style level

2017-12-20 Thread Arnaud Mouiche



On 19/12/2017 01:25, Caleb Crome wrote:

On Mon, Dec 18, 2017 at 3:02 PM, Nicolin Chen  wrote:

On Mon, Dec 18, 2017 at 02:19:08PM -0800, Caleb Crome wrote:


Acked-by: Timur Tabi 

--- To Mark ---

Mark, can you still take these changes first? Since this failed
test that Caleb reported here is already existing on the top of
the mainline tree, I would like to treat this mail as a separate
bug report and fix it with a separate patch.

Besides, this series of changes don't change any function flow.

Thank you


Sorry!  I should have created a separate thread for this subject.  My
comments have *nothing* to do with this patch set, except they are
about the same source files.


--- To Caleb ---


I'm re-setting up my loopback test to try to verify these most recent changes.

I really appreciate your verification and help.

Of course!  I have this wandboard permanently set up for this
verification test, so that I can easily repeat whenever I touch our
kernel.

It's a dead-simple hardware mod just to connect TX to RX.


warn:   11a0 11a1 1160 11a3 11a4 11a5 11a6 11a7
warn: Valid frame after 1 invalid frames
warn:   11c0 11c1 11c2 11c3 11c4 11c5 11c6 11c7
warn: first invalid frame while expecting frame 0x00a0
warn:   13e7 1400 1401 1402 1403 1404 1405 1404
warn:   1407 1420 1421 1422 1423 1424 1425 1426
warn:   1427 1440 1441 1442 1443 1444 1445 1484
warn:   1447 1460 1461 1462 1463 1464 1465 1466

Those last 4 lines are the channel slips -- the least significant
nibble should be the channel number:  i.e. should go 0, 1, 2, 3, 4, 5,
6, 7.

Ugh, so it's basically quite broken again -- before these patches.

I remember Arnaud reviewed one of my changes back to September.
So I suppose the test should be fine at that time -- so a change
being merged recently might have impacted the test result.


It's certainly possible that I'm doing something wrong again -- it
wouldn't be the first time :-)


Hi All,

Sorry but I will be busy until mid January, I could help testing and 
fixing broken multi channel after.

Anyway, I don't see specific issues with Nicolin patches.
We can take time to fix what was broken before this patch set... after.

Arnaud




I guess I need to go backwards in time and see what rev re-broke it.
I don't really have time to dig too deep on this again.

I'd be happy to provide the hardware to anybody that can diagnose and
debug this more quickly than I can.  I'm very inefficient at kernel
drivers I think.   My day job is acoustical and electrical
engineering.

Here's what the hardware looks like for anybody that's interested.
Just a single wire loopback on the wandboard header.

I would definitely like to take the hardware to debug it as long
as you are willing to provide me. Can you send me a private mail
to discuss about it?

Absolutely.
-Caleb



Thanks
Nicolin




Re: [-next PATCH 0/4] sysfs and DEVICE_ATTR_

2017-12-20 Thread Felipe Balbi

Hi,

Joe Perches  writes:
>  drivers/usb/phy/phy-tahvo.c|  2 +-

Acked-by: Felipe Balbi 

-- 
balbi


[PATCH] cxl: Check if vphb exists before iterating over AFU devices

2017-12-20 Thread Vaibhav Jain
commit 12841f87b7a8ceb3d54f171660f72a86941bfcb3 upstream, for 4.9.

During an eeh a kernel-oops is reported if no vPHB is allocated to the
AFU. This happens as during AFU init, an error in creation of vPHB is
a non-fatal error. Hence afu->phb should always be checked for NULL
before iterating over it for the virtual AFU pci devices.

This patch fixes the kenel-oops by adding a NULL pointer check for
afu->phb before it is dereferenced.

Fixes: 9e8df8a21963 ("cxl: EEH support")
Cc: sta...@vger.kernel.org
Signed-off-by: Vaibhav Jain 
Acked-by: Andrew Donnellan 
Acked-by: Frederic Barrat 
Signed-off-by: Michael Ellerman 
---
Changelog:
Rebased the upstream patch over stable 4.9 tree
---
 drivers/misc/cxl/pci.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index eef202d4399b..a5422f483ad5 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -1758,6 +1758,9 @@ static pci_ers_result_t cxl_vphb_error_detected(struct 
cxl_afu *afu,
/* There should only be one entry, but go through the list
 * anyway
 */
+   if (afu->phb == NULL)
+   return result;
+
list_for_each_entry(afu_dev, >phb->bus->devices, bus_list) {
if (!afu_dev->driver)
continue;
@@ -1801,6 +1804,11 @@ static pci_ers_result_t cxl_pci_error_detected(struct 
pci_dev *pdev,
/* Only participate in EEH if we are on a virtual PHB */
if (afu->phb == NULL)
return PCI_ERS_RESULT_NONE;
+
+   /*
+* Tell the AFU drivers; but we don't care what they
+* say, we're going away.
+*/
cxl_vphb_error_detected(afu, state);
}
return PCI_ERS_RESULT_DISCONNECT;
@@ -1941,6 +1949,9 @@ static pci_ers_result_t cxl_pci_slot_reset(struct pci_dev 
*pdev)
if (cxl_afu_select_best_mode(afu))
goto err;
 
+   if (afu->phb == NULL)
+   continue;
+
list_for_each_entry(afu_dev, >phb->bus->devices, bus_list) 
{
/* Reset the device context.
 * TODO: make this less disruptive
@@ -2003,6 +2014,9 @@ static void cxl_pci_resume(struct pci_dev *pdev)
for (i = 0; i < adapter->slices; i++) {
afu = adapter->afu[i];
 
+   if (afu->phb == NULL)
+   continue;
+
list_for_each_entry(afu_dev, >phb->bus->devices, bus_list) 
{
if (afu_dev->driver && afu_dev->driver->err_handler &&
afu_dev->driver->err_handler->resume)
-- 
2.14.3



Re: [PATCH V4] cxl: Add support for ASB_Notify on POWER9

2017-12-20 Thread Vaibhav Jain
Hi Chritophe,

christophe lombard  writes:

> Le 20/12/2017 à 07:31, Vaibhav Jain a écrit :
>> EINVAL might be a better return value instead of ENODEV in this case.
>
> This return code has been already discussed (with mpe) on the first
> version of the patch. "Either ENODEV or ENXIO would be best that
> can be distinguished and interpreted correctly by userspace"
Agreed. Please ignore the review comment.

>>> +   /* Assign a unique TIDR (thread id) for the current thread */
>>> +   if (work.flags & CXL_START_WORK_TID) {
>>> +   rc = cxl_context_thread_tidr(ctx);
>>> +   if (rc)
>>> +   goto out;
>> May need to copy the cxl_ioctl_start_work struct back to userspace with
>> the value of tidr allocated.
>>
>
> In fact, it does not matter. I don't know what the userspace could do
> with this value.
Without libcxl knowing the tidr value, it cannot enforce the condition
that only threads that have called attach can issue 'wait' on the right
context.

Also AFU can selectively ask PSL to issue asb_notify to a specific
thread via the PSL interface. Without userspace knowing the tidr value
it might not be easy for it to give this value to AFU through a Problem
State Area register.

>>> +   }
>>> +
>>> trace_cxl_attach(ctx, work.work_element_descriptor,
>>> work.num_interrupts, amr);
>> should update the tracing here to also report the tidr
>> 
>
> yep. I will provide a new patch to include this update.
Thanks

-- 
Vaibhav Jain 
Linux Technology Center, IBM India Pvt. Ltd.



Re: [PATCH V4] cxl: Add support for ASB_Notify on POWER9

2017-12-20 Thread christophe lombard

Le 20/12/2017 à 07:31, Vaibhav Jain a écrit :

Hi Christophe,

Thanks for the changes to the patch. Few minor review comments:



Thanks for the review.


Christophe Lombard  writes:


@@ -362,3 +363,17 @@ void cxl_context_mm_count_put(struct cxl_context *ctx)
if (ctx->mm)
mmdrop(ctx->mm);
  }
+
+int cxl_context_thread_tidr(struct cxl_context *ctx)
+{
+   int rc = 0;
+
+   if (!cxl_is_power9())
+   return -ENODEV;

EINVAL might be a better return value instead of ENODEV in this case.


This return code has been already discussed (with mpe) on the first
version of the patch. "Either ENODEV or ENXIO would be best that
can be distinguished and interpreted correctly by userspace"




--- a/drivers/misc/cxl/file.c
+++ b/drivers/misc/cxl/file.c
@@ -248,6 +248,13 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
 */
smp_mb();

+   /* Assign a unique TIDR (thread id) for the current thread */
+   if (work.flags & CXL_START_WORK_TID) {
+   rc = cxl_context_thread_tidr(ctx);
+   if (rc)
+   goto out;

May need to copy the cxl_ioctl_start_work struct back to userspace with
the value of tidr allocated.



In fact, it does not matter. I don't know what the userspace could do
with this value.


+   }
+
trace_cxl_attach(ctx, work.work_element_descriptor,
work.num_interrupts, amr);

should update the tracing here to also report the tidr



yep. I will provide a new patch to include this update.


diff --git a/include/uapi/misc/cxl.h b/include/uapi/misc/cxl.h
index 49e8fd0..980ee8f 100644
--- a/include/uapi/misc/cxl.h
+++ b/include/uapi/misc/cxl.h
@@ -31,9 +31,11 @@ struct cxl_ioctl_start_work {

Should reserve a field in the cxl_ioctl_start_work struct to report the
tidr back to userspace.




We could do that, but as we discussed previously. We want to minimize
the impact on libcxl.



Re: [PATCH] KVM: PPC: Book3S: fix XIVE migration of pending interrupts

2017-12-20 Thread Laurent Vivier
On 12/12/2017 13:02, Cédric Le Goater wrote:
> When restoring a pending interrupt, we are setting the Q bit to force
> a retrigger in xive_finish_unmask(). But we also need to force an EOI
> in this case to reach the same initial state : P=1, Q=0.
> 
> This can be done by not setting 'old_p' for pending interrupts which
> will inform xive_finish_unmask() that an EOI needs to be sent.
> 
> Suggested-by: Benjamin Herrenschmidt 
> Signed-off-by: Cédric Le Goater 
> ---
> 
>  Tested with a guest running iozone.
> 
>  arch/powerpc/kvm/book3s_xive.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

We really need this patch to fix VM migration on POWER9.
When will it be merged?

Thanks,
Laurent


Re: [PATCH] cpufreq: powernv: Add support of frequency domain

2017-12-20 Thread Gautham R Shenoy
On Tue, Dec 19, 2017 at 09:21:52PM +1100, Balbir Singh wrote:
> On Tue, Dec 19, 2017 at 8:20 PM, Gautham R Shenoy
>  wrote:
> > Hi Viresh,
> > On Mon, Dec 18, 2017 at 01:59:35PM +0530, Viresh Kumar wrote:
> >> On 18-12-17, 10:41, Abhishek wrote:
> >> > We need to do it in this way as the current implementation takes the max 
> >> > of
> >> > the PMSR of the cores. Thus, when the frequency is required to be ramped 
> >> > up,
> >> > it suffices to write to just the local PMSR, but when the frequency is 
> >> > to be
> >> > ramped down, if we don't send the IPI it breaks the compatibility with 
> >> > P8.
> >>
> >> Looks strange really that you have to program this differently for 
> >> speeding up
> >> or down. These CPUs are part of one cpufreq policy and so I would normally
> >> expect changes to any CPU should reflect for other CPUs as well.
> >>
> >> @Goutham: Do you know why it is so ?
> >>
> >
> > These are due to some implementation quirks where the platform has
> > provided a PMCR per-core to be backward compatible with POWER8, but
> > controls the frequency at a quad-level, by taking the maximum of the
> > four PMCR values instead of the latest one. So, changes to any CPU in
> > the core will reflect on all the cores if the frequency is higher than
> > the current frequency, but not necessarily if the requested frequency
> > is lower than the current frequency.
> >
> > Without sending the extra IPIs, we will be breaking the ABI since if
> > we set userspace governor, and change the frequency of a core by
> > lowering it, then it will not reflect on the CPUs of the cores in the
> > quad.
> 
> 
> What about cpufreq_policy->cpus/related_cpus? Am I missing something?

The frequency indicator passed via the device tree is used to derive
the mask corresponding to the set of CPUs that share the same
frequency. It is this mask that is set to
cpufreq_policy->cpus/related_cpus.


> 
> >
> > Abhishek,
> > I think we can rework this by sending the extra IPIs only in the
> > presence of the quirk which can be indicated through a device-tree
> > parameter. If the future implementation fix this, then we won't need
> > the extra IPIs.
> 
> Balbir Singh.
>