from:"Punit Agrawal"

Re: [RFC PATCH 2/4] cpufreq: acpi-cpufreq: Add processor to the ignore PSD override list

2020-12-17 Thread Punit Agrawal

Borislav Petkov  writes:

> On Mon, Dec 14, 2020 at 10:27:09PM +0900, Punit Agrawal wrote:
>> IIUC, this suggests that Linux booting on anything prior to Zen3 is down
>> to pure luck - I hope that wasn't the case.
>
> WTF does this have to do with linux booting?!

I guess I misunderstood the comment from your previous mail. Pasting
back for context (emphasis mine) -

VS the clear statement from AMD that from zen3 onwards, all BIOS
will be tested. *I hope they boot Linux at least before they ship.*

>> At the moment acpi thermals is bust on this and other affected AMD
>> system I have access to. That'll need fixing before any sensible
>> measurements can be run.
>
> Nope, still not answering my questions.
>
>> Tbh, I didn't quite expect the patch to the PSD exclusion list to be
>> so controversial
>
> It won't be if you explain properly what your patch is fixing. That is,
> if it fixes anything.

I stared at the driver code (and the ACPI tables for the platform) to
see if I could provide a better explanation.

That's when I realised that as the platform advertises hardware
frequency co-ordination, even without the override it still skips
setting the policy cpus -

/*
 * Will let policy->cpus know about dependency only when software
 * coordination is required.
 */
if (policy->shared_type == CPUFREQ_SHARED_TYPE_ALL ||
policy->shared_type == CPUFREQ_SHARED_TYPE_ANY) {
cpumask_copy(policy->cpus, perf->shared_cpu_map);
}

This ends up treating each CPU as an independent frequency domain
anyways. So even ignoring the override for the CPU, doesn't change
anything other than dropping the message from boot log -

overriding BIOS provided _PSD data

As such, there's no point in merging the patch as it is.

Apologies for the noise. I should've checked more thoroughly before
sending the patches. 

[ Aside: If Zen3 is using hardware co-ordination it'll probably face the
issue described above as well. ]

Re: [RFC PATCH 2/4] cpufreq: acpi-cpufreq: Add processor to the ignore PSD override list

2020-12-14 Thread Punit Agrawal

Borislav Petkov  writes:

> On Sat, Dec 12, 2020 at 08:36:48AM +0900, Punit Agrawal wrote:
>> To me it suggests, that there are likely more systems from the family
>> that show the characteristic described below.
>
> Until we find a *single* system with a broken BIOS which has those
> objects kaputt and then this heuristic would need an exception.
>
> VS the clear statement from AMD that from zen3 onwards, all BIOS will be
> tested. I hope they boot Linux at least before they ship.

IIUC, this suggests that Linux booting on anything prior to Zen3 is down
to pure luck - I hope that wasn't the case.

>> In all these systems, the override causes this topology information to
>> be ignored - treating each core to be a separate domain. The proposed
>> patch removes the override so that _PSD is taken into account.
>
> You're still not answering my question: what does the coupling of the
> SMT threads bring on those systems? Power savings? Perf improvement?
> Anything palpable or measurable?

At the moment acpi thermals is bust on this and other affected AMD
system I have access to. That'll need fixing before any sensible
measurements can be run.

Tbh, I didn't quite expect the patch to the PSD exclusion list to be so
controversial - especially when a similar change for zen3 had recently
been merged. If you're really not keen on the change, I will carry it
locally for the time being and revisit once the other issues have been
resolved.

Re: [RFC PATCH 2/4] cpufreq: acpi-cpufreq: Add processor to the ignore PSD override list

2020-12-11 Thread Punit Agrawal

Borislav Petkov  writes:

> On Wed, Dec 09, 2020 at 08:21:48AM +0900, Punit Agrawal wrote:
>> According to the commit log, acd316248205 seems to be only targeted at
>> powernow-K8 -
>
> No, it is not targeted at powernow-k8 - acpi-cpufreq.c is what is used
> on AMD hw. He means to make acpi-cpufreq's behavior consistent with
> powernow-k8.

So "powernow-k8" is not a cpu but a cpufreq driver. That doesn't change
the fact that the patch causes all AMD systems using acpi-cpufreq to
ignore processor frequency groupings and treat each processor to be an
in independent frequency domain from cpufreq's point of view.

>> But if that is not available, the only way we have is to include
>> systems that have been verified to not need the override
>
> You have verified exactly *one* system - yours. Or are you sure that
> *all* family 0x17, model 0x60, stepping 0x1 machines don't need the
> override?

Unfortunately, I only have access to one system with that F/M/S.

Since posting the non-RFC patches, I was able to inspect the ACPI tables
for more CPUs -

Family: 0x17h, Model: 0x71h (Ryzen 3950X)
Family: 0x17h, Model: 0x18h (Ryzen 3500u)

To me it suggests, that there are likely more systems from the family
that show the characteristic described below.

> Also, you still haven't explained what you're trying to fix here.

All the CPUs here are multi-threaded with 2 threads per core. The _PSD
for the system describes the cores as having a coupling that consist of
a frequency domain per core that contains both the threads. The firmware
description makes sense and seems to accurately describe the hardware
topology.

In all these systems, the override causes this topology information to
be ignored - treating each core to be a separate domain. The proposed
patch removes the override so that _PSD is taken into account.

Re: [PATCH 2/2] cpufreq: acpi-cpufreq: Add processor to the ignore PSD override list

2020-12-11 Thread Punit Agrawal

Borislav Petkov  writes:

> On Fri, Dec 11, 2020 at 07:56:40AM +0900, Punit Agrawal wrote:
>> Booting Linux on a Zen2 based processor (family: 0x17, model: 0x60,
>> stepping: 0x1) shows the following message in the logs -
>> 
>> acpi_cpufreq: overriding BIOS provided _PSD data
>> 
>> Although commit 5368512abe08 ("acpi-cpufreq: Honor _PSD table setting
>> on new AMD CPUs") indicates that the override is not required for Zen3
>> onwards, it seems that domain information can be trusted even on
>> certain earlier systems. Update the check, to skip the override for
>> Zen2 processors known to work without the override.
>> 
>> Signed-off-by: Punit Agrawal 
>> Cc: Wei Huang 
>> ---
>>  drivers/cpufreq/acpi-cpufreq.c | 9 +++--
>>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> What about answers to those questions first?
>
> https://lkml.kernel.org/r/20201208233216.gh27...@zn.tnic

Oh.. sorry I missed that mail for some reason. Let's continue
there.

[PATCH 2/2] cpufreq: acpi-cpufreq: Add processor to the ignore PSD override list

2020-12-10 Thread Punit Agrawal

Booting Linux on a Zen2 based processor (family: 0x17, model: 0x60,
stepping: 0x1) shows the following message in the logs -

acpi_cpufreq: overriding BIOS provided _PSD data

Although commit 5368512abe08 ("acpi-cpufreq: Honor _PSD table setting
on new AMD CPUs") indicates that the override is not required for Zen3
onwards, it seems that domain information can be trusted even on
certain earlier systems. Update the check, to skip the override for
Zen2 processors known to work without the override.

Signed-off-by: Punit Agrawal 
Cc: Wei Huang 
---
 drivers/cpufreq/acpi-cpufreq.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
index 126315ad225f..32b6449a438b 100644
--- a/drivers/cpufreq/acpi-cpufreq.c
+++ b/drivers/cpufreq/acpi-cpufreq.c
@@ -198,8 +198,13 @@ static bool amd_override_acpi_psd(unsigned int cpu_id)
if (c->x86_vendor == X86_VENDOR_AMD) {
if (!check_amd_hwpstate_cpu(cpu_id))
return false;
-
-   return c->x86 < 0x19;
+   /*
+* CPU's before Zen3 (except some Zen2) need the
+* override.
+*/
+   return (c->x86 < 0x19) &&
+   !(c->x86 == 0x17 && c->x86_model == 0x60 &&
+ c->x86_stepping == 0x1);
}
 
return false;
-- 
2.29.2

[PATCH 0/2] Add processor to the ignore PSD override list

2020-12-10 Thread Punit Agrawal

Hi,

Here's an update of the previously posted patches at [0].

Patch 1 refactors the test to ignore firmware provided frequency
domain into a separate function.

Patch 2 adds the APU (Family: 0x17, Model: 0x60, Stepping: 0x01) to
the list of CPUs for which the PSD override is ignored. In the absence
of clarity regarding the affected CPUs, the patch adds the CPU by
model numbers.

Thanks,
Punit

Changes since RFC:
* Dropped patches 3 and 4 to add macros for AMD processor families due
  to lack of interest
* Patch 1
  - renamed override_acpi_psd() to amd_override_acpi_psd()
  - Changed return value of the function to bool

Punit Agrawal (2):
  cpufreq: acpi-cpufreq: Re-factor overriding ACPI PSD
  cpufreq: acpi-cpufreq: Add processor to the ignore PSD override list

 drivers/cpufreq/acpi-cpufreq.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

-- 
2.29.2

[PATCH 1/2] cpufreq: acpi-cpufreq: Re-factor overriding ACPI PSD

2020-12-10 Thread Punit Agrawal

Re-factor the code to override the firmware provided frequency domain
information (via PSD) to localise the checks in one function.

No functional change intended.

Signed-off-by: Punit Agrawal 
Cc: Wei Huang 
---
 drivers/cpufreq/acpi-cpufreq.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
index 1e4fbb002a31..126315ad225f 100644
--- a/drivers/cpufreq/acpi-cpufreq.c
+++ b/drivers/cpufreq/acpi-cpufreq.c
@@ -191,6 +191,20 @@ static int check_amd_hwpstate_cpu(unsigned int cpuid)
return cpu_has(cpu, X86_FEATURE_HW_PSTATE);
 }
 
+static bool amd_override_acpi_psd(unsigned int cpu_id)
+{
+   struct cpuinfo_x86 *c = _cpu_data;
+
+   if (c->x86_vendor == X86_VENDOR_AMD) {
+   if (!check_amd_hwpstate_cpu(cpu_id))
+   return false;
+
+   return c->x86 < 0x19;
+   }
+
+   return false;
+}
+
 static unsigned extract_io(struct cpufreq_policy *policy, u32 value)
 {
struct acpi_cpufreq_data *data = policy->driver_data;
@@ -691,8 +705,7 @@ static int acpi_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
cpumask_copy(policy->cpus, topology_core_cpumask(cpu));
}
 
-   if (check_amd_hwpstate_cpu(cpu) && boot_cpu_data.x86 < 0x19 &&
-   !acpi_pstate_strict) {
+   if (amd_override_acpi_psd(cpu) && !acpi_pstate_strict) {
cpumask_clear(policy->cpus);
cpumask_set_cpu(cpu, policy->cpus);
cpumask_copy(data->freqdomain_cpus,
-- 
2.29.2

Re: [PATCH v3 1/4] dt-bindings: gpio: Add bindings for Toshiba Visconti GPIO Controller

2020-12-09 Thread Punit Agrawal

Nobuhiro Iwamatsu  writes:

> Add bindings for the Toshiba Visconti GPIO Controller.
>
> Signed-off-by: Nobuhiro Iwamatsu 
> ---
>  .../bindings/gpio/toshiba,gpio-visconti.yaml  | 85 +++
>  1 file changed, 85 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/gpio/toshiba,gpio-visconti.yaml
>
> diff --git 
> a/Documentation/devicetree/bindings/gpio/toshiba,gpio-visconti.yaml 
> b/Documentation/devicetree/bindings/gpio/toshiba,gpio-visconti.yaml
> new file mode 100644
> index ..5168a15b90e1
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/gpio/toshiba,gpio-visconti.yaml
> @@ -0,0 +1,85 @@
> +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/gpio/toshiba,gpio-visconti.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Toshiba Visconti ARM SoCs GPIO controller
> +
> +maintainers:
> +  - Nobuhiro Iwamatsu 
> +
> +properties:
> +  compatible:
> +items:
> +  - const: toshiba,gpio-tmpv7708
> +
> +  reg:
> +maxItems: 1
> +
> +  "#gpio-cells":
> +const: 2
> +
> +  gpio-ranges: true
> +
> +  gpio-controller: true
> +
> +  interrupt-controller: true
> +
> +  "#interrupt-cells":
> +const: 2
> +
> +  interrupts:
> +description:
> +  interrupt mapping one per GPIO.
> +minItems: 16
> +maxItems: 16
> +
> +required:
> +  - compatible
> +  - reg
> +  - "#gpio-cells"
> +  - gpio-ranges
> +  - gpio-controller
> +  - interrupt-controller
> +  - "#interrupt-cells"
> +  - interrupts
> +
> +additionalProperties: false
> +
> +examples:
> +  - |
> +  #include 
> +  #include 
> +
> +  soc {
> +#address-cells = <2>;
> +#size-cells = <2>;
> +
> +gpio: gpio@2802 {
> +  compatible = "toshiba,gpio-tmpv7708";
> +  reg = <0 0x2802 0 0x1000>;
> +  #gpio-cells = <0x2>;
> +  gpio-ranges = < 0 0 32>;
> +  gpio-controller;
> +  interrupt-controller;
> +  #interrupt-cells = <2>;
> +  interrupts = ,
> +      ,
> +  ,
> +  ,
> +  ,
> +  ,
> +  ,
> +  ,
> +  ,
> +  ,
> +  ,
> +  ,
> +  ,
> +  ,
> +  ,
> +  ;
> +};
> +  };
> +...

FWIW,

Reviewed-by: Punit Agrawal 

Thanks,
Punit

Re: [PATCH v3 1/4] dt-bindings: gpio: Add bindings for Toshiba Visconti GPIO Controller

2020-12-09 Thread Punit Agrawal

Rob Herring  writes:

[...]

>> > +  gpio-ranges: true
>> 
>> I am not sure I have a good handle on the yaml schema definitions but
>> "gpio-ranges" feels like it should be a list of ranges not a boolean.
>> 
>> Something like -
>> 
>> gpio-ranges:
>>   maxItems: 1
>> 
>> feels more appropriate.
>> 
>> I see both the usages in gpio bindings and for other range properties so
>> maybe it's OK. I hope Rob or somebody more knowledgeable on this can
>> clarify the usage.
>
> If you know how many (or a range) entries there are for gpio-ranges, 
> then maxItems is good. If you don't, then 'gpio-ranges: true' is fine. 
> That doesn't make the property a boolean, but just says the property can 
> be present.

Makes sense. Thanks for the explanation.

[...]

Re: [RFC PATCH 1/4] cpufreq: acpi-cpufreq: Re-factor overriding ACPI PSD

2020-12-08 Thread Punit Agrawal

Hi Wei,

Wei Huang  writes:

> On 11/25/20 8:48 AM, Punit Agrawal wrote:
>> Re-factor the code to override the firmware provided frequency domain
>> information (via PSD) to localise the checks in one function.
>> 
>> No functional change intended.
>> 
>> Signed-off-by: Punit Agrawal 
>> Cc: Wei Huang 
>> ---
>>  drivers/cpufreq/acpi-cpufreq.c | 17 +++--
>>  1 file changed, 15 insertions(+), 2 deletions(-)
>> 
>> diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
>> index 1e4fbb002a31..b1e7df96d428 100644
>> --- a/drivers/cpufreq/acpi-cpufreq.c
>> +++ b/drivers/cpufreq/acpi-cpufreq.c
>> @@ -191,6 +191,20 @@ static int check_amd_hwpstate_cpu(unsigned int cpuid)
>>  return cpu_has(cpu, X86_FEATURE_HW_PSTATE);
>>  }
>>  
>> +static int override_acpi_psd(unsigned int cpu_id)
>  ^
> int is fine, but it might be better to use bool. Otherwise I don't see
> any issues with this patch.

Makes sense - I will switch to a boolean in the next update.

Thanks for taking a look.

Punit

>
>> +{
>> +struct cpuinfo_x86 *c = _cpu_data;
>> +
>> +if (c->x86_vendor == X86_VENDOR_AMD) {
>> +if (!check_amd_hwpstate_cpu(cpu_id))
>> +return false;
>> +
>> +return c->x86 < 0x19;
>> +}
>> +
>> +return false;
>> +}
>> +
>>  static unsigned extract_io(struct cpufreq_policy *policy, u32 value)
>>  {
>>  struct acpi_cpufreq_data *data = policy->driver_data;
>> @@ -691,8 +705,7 @@ static int acpi_cpufreq_cpu_init(struct cpufreq_policy 
>> *policy)
>>  cpumask_copy(policy->cpus, topology_core_cpumask(cpu));
>>  }
>>  
>> -if (check_amd_hwpstate_cpu(cpu) && boot_cpu_data.x86 < 0x19 &&
>> -!acpi_pstate_strict) {
>> +if (override_acpi_psd(cpu) && !acpi_pstate_strict) {
>>  cpumask_clear(policy->cpus);
>>  cpumask_set_cpu(cpu, policy->cpus);
>>  cpumask_copy(data->freqdomain_cpus,
>>

Re: [RFC PATCH 0/4] Add processor to the ignore PSD override list

2020-12-08 Thread Punit Agrawal

Hi Rafael,

"Rafael J. Wysocki"  writes:

> On Fri, Dec 4, 2020 at 11:45 PM Punit Agrawal  wrote:
>>
>> Hi Rafael,
>>
>> Punit Agrawal  writes:
>>
>> > Hi,
>> >
>> > While looking into Giovanni's patches to enable frequency invariance
>> > on AMD systems[0], I noticed an issue with initialising frequency
>> > domain information on a recent AMD APU.
>> >
>> > Patch 1 refactors the test to ignore firmware provided frequency
>> > domain into a separate function.
>> >
>> > Patch 2 adds said APU (Family: 0x17, Model: 0x60, Stepping: 0x01) to
>> > the list of CPUs for which the PSD override is ignored. I am not quite
>> > happy with having to special case a particular CPU but also couldn't
>> > find any documentation to help identify the CPUs that don't need the
>> > override.
>>
>> Are you be OK to pick the first two patches if there are no issues?
>
> Please send them as non-RFC and change the name of override_acpi_psd()
> to indicate that it is AMD-specific.

Ack - I will incorporate your comments in the next version (once the
ongoing discussion finishes).

Thanks.

Re: [RFC PATCH 2/4] cpufreq: acpi-cpufreq: Add processor to the ignore PSD override list

2020-12-08 Thread Punit Agrawal

Borislav Petkov  writes:

> On Mon, Dec 07, 2020 at 04:07:52PM -0600, Wei Huang wrote:
>> I think we shouldn't override zen2 if _PSD is correct. In my opinion,
>> there are two approaches:
>> 
>> * Keep override_acpi_psd()
>> Let us keep the original quirk and override_acpi_psd() function. Over
>> the time, people may want to add new CPUs to override_acpi_psd(). The
>> maintainer may declare that only CPUs >= family 17h will be fixed, to
>> avoid exploding the check-list.
>> 
>> * Remove the quirk completely
>> We can completely remove commit acd316248205 ("acpi-cpufreq: Add quirk
>> to disable _PSD usage on all AMD CPUs")? I am not sure what "AMD desktop
>> boards" was referring to in the original commit message of acd316248205.
>> Maybe such machines aren't in use anymore.
>
> * Third option: do not do anything. Why?
>
> - Let sleeping dogs lie and leave the workaround acd316248205 for old
> machines.

According to the commit log, acd316248205 seems to be only targeted at
powernow-K8 -

"in order to use the hardware to its full potential and keep the
original powernow-k8 behavior, lets override the _PSD table setting on
AMD hardware."

It's unfortunate that it has continued to apply to all systems since. An
alternate to this and 5368512abe08 would be to restrict the original
workaround to the system mentioned in the commit log.

> - Make a clear cut that the override is not needed from Zen3 on, i.e.,
> your patch
>
>5368512abe08 ("acpi-cpufreq: Honor _PSD table setting on new AMD CPUs")
>
>
> Punit's commit message reads "...indicates that the override is not
> required for Zen3 onwards, it seems that domain information can be
> trusted even on certain earlier systems."
>
> That's not nearly a justification in my book to do this on anything <
> Zen3.
>
> This way you have a clear cut, you don't need to deal with adding any
> more models to override_acpi_psd() and all is good.

PSD (P-State Dependency) object is used to express the coupling of
processors that share the same frequency domain. Ignoring the hardware
topology information will cause both power management and scheduling to
make sub-optimal choices.

Wasn't this the motivation for taking it into account for Zen3 based
systems in 5368512abe08? Or the powernow-k8 in the original workaround
(though there it required ignoring firmware)? The exact same argument
applies here to the Zen2 system I have running in a thermally
constrained environment.

In an ideal case, I was hoping AMD folks would come back with
confirmation of how far back the override can be dropped and the
workaround condition could be relaxed further. But if that is not
available, the only way we have is to include systems that have been
verified to not need the override and somebody that cares to send the
patch.

Re: [RFC PATCH 0/4] Add processor to the ignore PSD override list

2020-12-04 Thread Punit Agrawal

Hi Rafael,

Punit Agrawal  writes:

> Hi,
>
> While looking into Giovanni's patches to enable frequency invariance
> on AMD systems[0], I noticed an issue with initialising frequency
> domain information on a recent AMD APU.
>
> Patch 1 refactors the test to ignore firmware provided frequency
> domain into a separate function.
>
> Patch 2 adds said APU (Family: 0x17, Model: 0x60, Stepping: 0x01) to
> the list of CPUs for which the PSD override is ignored. I am not quite
> happy with having to special case a particular CPU but also couldn't
> find any documentation to help identify the CPUs that don't need the
> override.

Are you be OK to pick the first two patches if there are no issues?

Thanks,
Punit


> Patch 3 and 4 are somewhat independent and a first step towards
> improving the situation with regards to the use of raw identifiers for
> AMD processors throughout the kernel.
>
> All feedback welcome.
>
> Thanks,
> Punit
>
> [0] 
> https://lore.kernel.org/linux-acpi/20201112182614.10700-1-ggherdov...@suse.cz/
>
> Punit Agrawal (4):
>   cpufreq: acpi-cpufreq: Re-factor overriding ACPI PSD
>   cpufreq: acpi-cpufreq: Add processor to the ignore PSD override list
>   x86/cpu: amd: Define processor families
>   cpufreq: acpi-cpufreq: Use identifiers for AMD processor family
>
>  arch/x86/include/asm/amd-family.h| 18 ++
>  arch/x86/include/asm/cpu_device_id.h |  2 ++
>  drivers/cpufreq/acpi-cpufreq.c   | 24 +---
>  3 files changed, 41 insertions(+), 3 deletions(-)
>  create mode 100644 arch/x86/include/asm/amd-family.h

Re: [PATCH] arm64: dts: visconti: Add watchdog support for TMPV7708 SoC

2020-12-03 Thread Punit Agrawal

Nobuhiro Iwamatsu  writes:

> Add watchdog node in TMPV7708's dtsi, and tmpv7708-rm-mbrc boards's
> dts.
>
> Signed-off-by: Nobuhiro Iwamatsu 
> ---
>  arch/arm64/boot/dts/toshiba/tmpv7708-rm-mbrc.dts |  5 +
>  arch/arm64/boot/dts/toshiba/tmpv7708.dtsi| 12 
>  2 files changed, 17 insertions(+)
>
> diff --git a/arch/arm64/boot/dts/toshiba/tmpv7708-rm-mbrc.dts 
> b/arch/arm64/boot/dts/toshiba/tmpv7708-rm-mbrc.dts
> index ed0bf7f13f54..37da418393e0 100644
> --- a/arch/arm64/boot/dts/toshiba/tmpv7708-rm-mbrc.dts
> +++ b/arch/arm64/boot/dts/toshiba/tmpv7708-rm-mbrc.dts
> @@ -41,3 +41,8 @@  {
>   clocks = <_clk>;
>   clock-names = "apb_pclk";
>  };
> +
> + {
> + status = "okay";
> + clocks = <_clk>;
> +};
> diff --git a/arch/arm64/boot/dts/toshiba/tmpv7708.dtsi 
> b/arch/arm64/boot/dts/toshiba/tmpv7708.dtsi
> index 242f25f4e12a..c360e68bef1d 100644
> --- a/arch/arm64/boot/dts/toshiba/tmpv7708.dtsi
> +++ b/arch/arm64/boot/dts/toshiba/tmpv7708.dtsi
> @@ -134,6 +134,12 @@ uart_clk: uart-clk {
>   #clock-cells = <0>;
>   };
>  
> + wdt_clk: wdt-clk {
> + compatible = "fixed-clock";
> + clock-frequency = <15000>;
> + #clock-cells = <0>;
> + };
> +
>   soc {
>   #address-cells = <2>;
>   #size-cells = <2>;
> @@ -384,6 +390,12 @@ spi6: spi@28146000 {
>   #size-cells = <0>;
>   status = "disabled";
>   };
> +
> + wdt: wdt@28330000 {
> + compatible = "toshiba,visconti-wdt";
> + reg = <0 0x2833 0 0x1000>;
> + status = "disabled";
> + };
>   };
>  };

FWIW,

Reviewed-by: Punit Agrawal 

Thanks,
Punit

Re: [PATCH v3 4/4] arm: dts: visconti: Add DT support for Toshiba Visconti5 GPIO driver

2020-12-03 Thread Punit Agrawal

Nobuhiro Iwamatsu  writes:

> Add the GPIO node in Toshiba Visconti5 SoC-specific DT file.
> And enable the GPIO node in TMPV7708 RM main board's board-specific DT file.
>
> Signed-off-by: Nobuhiro Iwamatsu 
> ---
>  .../boot/dts/toshiba/tmpv7708-rm-mbrc.dts |  4 +++
>  arch/arm64/boot/dts/toshiba/tmpv7708.dtsi | 27 +++
>  2 files changed, 31 insertions(+)
>
> diff --git a/arch/arm64/boot/dts/toshiba/tmpv7708-rm-mbrc.dts 
> b/arch/arm64/boot/dts/toshiba/tmpv7708-rm-mbrc.dts
> index ed0bf7f13f54..950010a290f0 100644
> --- a/arch/arm64/boot/dts/toshiba/tmpv7708-rm-mbrc.dts
> +++ b/arch/arm64/boot/dts/toshiba/tmpv7708-rm-mbrc.dts
> @@ -41,3 +41,7 @@  {
>   clocks = <_clk>;
>   clock-names = "apb_pclk";
>  };
> +
> + {
> + status = "okay";
> +};
> diff --git a/arch/arm64/boot/dts/toshiba/tmpv7708.dtsi 
> b/arch/arm64/boot/dts/toshiba/tmpv7708.dtsi
> index 242f25f4e12a..ac9bddb35b0a 100644
> --- a/arch/arm64/boot/dts/toshiba/tmpv7708.dtsi
> +++ b/arch/arm64/boot/dts/toshiba/tmpv7708.dtsi
> @@ -157,6 +157,33 @@ pmux: pmux@2419 {
>   reg = <0 0x2419 0 0x1>;
>   };
>  
> + gpio: gpio@2802 {
> + compatible = "toshiba,gpio-tmpv7708";
> + reg = <0 0x2802 0 0x1000>;
> + #gpio-cells = <0x2>;
> + gpio-ranges = < 0 0 32>;
> + gpio-controller;
> + interrupt-controller;
> + #interrupt-cells = <2>;
> + interrupts =
> + ,
> + ,
> + ,
> + ,
> + ,
> + ,
> + ,
> + ,
> + ,
> + ,
> + ,
> + ,
> + ,
> + ,
> + ,
> +     ;
> + };
> +
>   uart0: serial@2820 {
>   compatible = "arm,pl011", "arm,primecell";
>   reg = <0 0x2820 0 0x1000>;

FWIW,

Reviewed-by: Punit Agrawal 

Thanks,
Punit

Re: [PATCH v3 2/4] gpio: visconti: Add Toshiba Visconti GPIO support

2020-12-03 Thread Punit Agrawal

> + struct visconti_gpio *priv;
> + struct irq_chip *irq_chip;
> + struct irq_desc *desc;
> + struct gpio_irq_chip *girq;
> + const char *name = dev_name(dev);
> + int i, ret, num_irq;
> +
> + priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
> + if (!priv)
> + return -ENOMEM;
> +
> +     priv->dev = dev;
> + spin_lock_init(>lock);
> +
> + priv->base = devm_platform_ioremap_resource(pdev, 0);
> + if (IS_ERR(priv->base))
> + return PTR_ERR(priv->base);
> +
> + ret = platform_irq_count(pdev);
> + if (!ret) {
> + dev_err(dev, "Couldn't determine # GPIO banks\n");
> + return -ENOENT;
> + }

platform_irq_count() can return -EPROBE_DEFER. Is that something that
should be handled?

With the above two comments addressed, feel free to add 

Reviewed-by: Punit Agrawal 

Thanks,
Punit


> + num_irq = ret;
> +
> + priv->irq = devm_kcalloc(dev, num_irq, sizeof(priv->irq), GFP_KERNEL);
> + if (!priv->irq)
> + return -ENOMEM;
> +
> + for (i = 0; i < num_irq; i++) {
> + priv->irq[i] = platform_get_irq(pdev, i);
> + if (priv->irq[i] < 0) {
> + dev_err(dev, "invalid IRQ[%d]\n", i);
> + return priv->irq[i];
> + }
> + }
> +
> + ret = bgpio_init(>gpio_chip, dev, 4,
> +  priv->base + GPIO_IDATA,
> +  priv->base + GPIO_OSET,
> +  priv->base + GPIO_OCLR,
> +  priv->base + GPIO_DIR,
> +  NULL,
> +  0);
> + if (ret) {
> + dev_err(dev, "unable to init generic GPIO\n");
> + return ret;
> + }
> +
> + priv->gpio_chip.irq.init_valid_mask = visconti_init_irq_valid_mask;
> +
> + irq_chip = >irq_chip;
> + irq_chip->name = "gpio-visconti";
> + irq_chip->irq_mask = visconti_gpio_irq_mask;
> + irq_chip->irq_unmask = visconti_gpio_irq_unmask;
> + irq_chip->irq_set_type = visconti_gpio_irq_set_type;
> + irq_chip->flags = IRQCHIP_SET_TYPE_MASKED | IRQCHIP_MASK_ON_SUSPEND;
> +
> + girq = >gpio_chip.irq;
> + girq->chip = irq_chip;
> + /* This will let us handle the parent IRQ in the driver */
> + girq->parent_handler = NULL;
> + girq->num_parents = 0;
> + girq->parents = NULL;
> + girq->default_type = IRQ_TYPE_NONE;
> + girq->handler = handle_level_irq;
> +
> + ret = devm_gpiochip_add_data(dev, >gpio_chip, priv);
> + if (ret) {
> + dev_err(dev, "failed to add GPIO chip\n");
> + return ret;
> + }
> +
> + for (i = 0; i < num_irq; i++) {
> + desc = irq_to_desc(priv->irq[i]);
> + desc->status_use_accessors |= IRQ_NOAUTOEN;
> + if (devm_request_irq(dev, priv->irq[i],
> +  visconti_gpio_irq_handler, 0, name, priv)) 
> {
> + dev_err(dev, "failed to request IRQ[%d]\n", i);
> + return -ENOENT;
> + }
> + }
> +
> + return ret;
> +}
> +
> +static const struct of_device_id visconti_gpio_of_match[] = {
> + { .compatible = "toshiba,gpio-tmpv7708", },
> + { /* end of table */ }
> +};
> +MODULE_DEVICE_TABLE(of, visconti_gpio_of_match);
> +
> +static struct platform_driver visconti_gpio_driver = {
> + .probe  = visconti_gpio_probe,
> + .driver = {
> + .name   = "visconti_gpio",
> + .of_match_table = of_match_ptr(visconti_gpio_of_match),
> + }
> +};
> +module_platform_driver(visconti_gpio_driver);
> +
> +MODULE_AUTHOR("Toshiba Electronic Devices & Storage Corporation");
> +MODULE_AUTHOR("Nobuhiro Iwamatsu ");
> +MODULE_DESCRIPTION("Toshiba Visconti GPIO Driver");
> +MODULE_LICENSE("GPL v2");

[...]

Re: [PATCH v3 1/4] dt-bindings: gpio: Add bindings for Toshiba Visconti GPIO Controller

2020-12-03 Thread Punit Agrawal

Iwamatsu-san,

Nobuhiro Iwamatsu  writes:

> Add bindings for the Toshiba Visconti GPIO Controller.
>
> Signed-off-by: Nobuhiro Iwamatsu 
> ---
>  .../bindings/gpio/toshiba,gpio-visconti.yaml  | 85 +++
>  1 file changed, 85 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/gpio/toshiba,gpio-visconti.yaml
>
> diff --git 
> a/Documentation/devicetree/bindings/gpio/toshiba,gpio-visconti.yaml 
> b/Documentation/devicetree/bindings/gpio/toshiba,gpio-visconti.yaml
> new file mode 100644
> index ..5168a15b90e1
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/gpio/toshiba,gpio-visconti.yaml
> @@ -0,0 +1,85 @@
> +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/gpio/toshiba,gpio-visconti.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Toshiba Visconti ARM SoCs GPIO controller
> +
> +maintainers:
> +  - Nobuhiro Iwamatsu 
> +
> +properties:
> +  compatible:
> +items:
> +  - const: toshiba,gpio-tmpv7708
> +
> +  reg:
> +maxItems: 1
> +
> +  "#gpio-cells":
> +const: 2
> +
> +  gpio-ranges: true

I am not sure I have a good handle on the yaml schema definitions but
"gpio-ranges" feels like it should be a list of ranges not a boolean.

Something like -

gpio-ranges:
  maxItems: 1

feels more appropriate.

I see both the usages in gpio bindings and for other range properties so
maybe it's OK. I hope Rob or somebody more knowledgeable on this can
clarify the usage.

Otherwise, the patch looks good.

Thanks,
Punit

> +
> +  gpio-controller: true
> +
> +  interrupt-controller: true
> +
> +  "#interrupt-cells":
> +const: 2
> +
> +  interrupts:
> +description:
> +  interrupt mapping one per GPIO.
> +minItems: 16
> +maxItems: 16
> +
> +required:
> +  - compatible
> +  - reg
> +  - "#gpio-cells"
> +  - gpio-ranges
> +  - gpio-controller
> +  - interrupt-controller
> +  - "#interrupt-cells"
> +  - interrupts
> +
> +additionalProperties: false
> +
> +examples:
> +  - |
> +  #include 
> +  #include 
> +
> +  soc {
> +#address-cells = <2>;
> +#size-cells = <2>;
> +
> +gpio: gpio@2802 {
> +  compatible = "toshiba,gpio-tmpv7708";
> +  reg = <0 0x2802 0 0x1000>;
> +  #gpio-cells = <0x2>;
> +  gpio-ranges = < 0 0 32>;
> +  gpio-controller;
> +  interrupt-controller;
> +  #interrupt-cells = <2>;
> +  interrupts = ,
> +  ,
> +  ,
> +  ,
> +  ,
> +  ,
> +  ,
> +  ,
> +  ,
> +  ,
> +  ,
> +  ,
> +  ,
> +  ,
> +  ,
> +  ;
> +};
> +  };
> +...

Re: [RFC PATCH 4/4] cpufreq: acpi-cpufreq: Use identifiers for AMD processor family

2020-12-02 Thread Punit Agrawal

Borislav Petkov  writes:

> On Wed, Nov 25, 2020 at 11:48:47PM +0900, Punit Agrawal wrote:
>> Replace the raw values for AMD processor family with recently
>> introduced identifier macros to improve code readability and
>> maintainability.
>> 
>> Signed-off-by: Punit Agrawal 
>> ---
>>  drivers/cpufreq/acpi-cpufreq.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>> 
>> diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
>> index 29f1cd93541e..d8b8300ae9e0 100644
>> --- a/drivers/cpufreq/acpi-cpufreq.c
>> +++ b/drivers/cpufreq/acpi-cpufreq.c
>> @@ -202,8 +202,8 @@ static int override_acpi_psd(unsigned int cpu_id)
>>   * CPU's before Zen3 (except some Zen2) need the
>>   * override.
>>   */
>> -return (c->x86 < 0x19) &&
>> -!(c->x86 == 0x17 && c->x86_model == 0x60 &&
>> +return (c->x86 < AMD_FAM_ZEN3) &&
>> +!(c->x86 == AMD_FAM_ZEN && c->x86_model == 0x60 &&
>
> This is what I mean - that's Zen2 as the comment above says so having
>
>   c->x86 == AMD_FAM_ZEN
>
> is not enough. And you have a comment above it stating which CPUs are
> matched here so I'm not sure those family defines make it any better...

Hmm.. for this series, it probably doesn't add much value - especially
with the comment and macro mismatch.

The last two patches were posted to check if there's wider interest in
the changes. If the macro conversion is useful, I can split the patches
from this series into a separate set with more sites being updated. I'll
wait to see if there's any further feedback.

Thanks,
Punit

Re: [RFC PATCH 3/4] x86/cpu: amd: Define processor families

2020-12-02 Thread Punit Agrawal

Hi Boris,

Borislav Petkov  writes:

> On Wed, Nov 25, 2020 at 11:48:46PM +0900, Punit Agrawal wrote:
>> So far, the AMD processor identifier (family, models, stepping) are
>> referred to by raw values making it easy to make mistakes. It is also
>> harder to read and maintain. Additionally, these values are also being
>> used in subsystems outside the arch code where not everybody maybe be
>> as familiar with the processor identifiers.
>> 
>> As a first step towards improving the status quo, add macros for the
>> AMD processor families and propagate them through the existing
>> cpu_device_id.h header used for this purpose.
>> 
>> Signed-off-by: Punit Agrawal 
>> Cc: Thomas Gleixner 
>> Cc: Ingo Molnar 
>> Cc: Borislav Petkov 
>> Cc: x...@kernel.org
>> ---
>>  arch/x86/include/asm/amd-family.h| 18 ++
>>  arch/x86/include/asm/cpu_device_id.h |  2 ++
>>  2 files changed, 20 insertions(+)
>>  create mode 100644 arch/x86/include/asm/amd-family.h
>> 
>> diff --git a/arch/x86/include/asm/amd-family.h 
>> b/arch/x86/include/asm/amd-family.h
>> new file mode 100644
>> index ..dff4d13b8e74
>> --- /dev/null
>> +++ b/arch/x86/include/asm/amd-family.h
>> @@ -0,0 +1,18 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _ASM_X86_AMD_FAMILY_H
>> +#define _ASM_X86_AMD_FAMILY_H
>> +
>> +#define AMD_FAM_K5  0x04
>> +#define AMD_FAM_K6  0x05
>> +#define AMD_FAM_K7  0x06
>> +#define AMD_FAM_K8  0x0F
>> +#define AMD_FAM_K10 0x10
>
> Fam 0x10 is Greyhound and a lot more core names. I'd let the AMD folks
> on Cc say what they wanna call it but I don't think K10 was used
> anywhere except outside of AMD. :)

Didn't realize the core was internal only. There are a couple of
instances where the family does shows up arch/x86/kernel/cpu/amd.c so
atleast some of the patches did make it upstream.

>> +#define AMD_FAM_K8_K10_HYBRID   0x11
>
> That was called Griffin so AMD_FAM_GRIFFIN I guess. If not used anywhere
> in the tree, no need to add the define.

I haven't looked to deeply but there's one instance in
arch/x86/kernel/cpu/amd.c - though I wonder if that could be re-written
to rely on a hardware / firmware interface instead.

> Same holds true for the rest of the defines - add them only when
> they're used.

Makes sense - I will follow your suggestion in the next version.

>> +#define AMD_FAM_LLANO   0x12
>> +#define AMD_FAM_BOBCAT  0x14
>> +#define AMD_FAM_BULLDOZER   0x15
>> +#define AMD_FAM_JAGUAR  0x16
>> +#define AMD_FAM_ZEN 0x17
>
> ZEN2 is also 0x17 but different models so this is where the family
> matching scheme doesn't work - you'd need the models too.

Yes, I wasn't sure the best way to handle this so went with the earlier
generation name. I think for such cases, something that looks at the
cpuinfo_x86 structure and decides the family / generation is probably
the way to go.

> 0x18 is HYGON

I missed this one. I'll add it to the list.

Thanks for the review and your comments. I'll wait a bit to see if
there's any further feedback.

Cheers,
Punit

[...]

[RFC PATCH 3/4] x86/cpu: amd: Define processor families

2020-11-25 Thread Punit Agrawal

So far, the AMD processor identifier (family, models, stepping) are
referred to by raw values making it easy to make mistakes. It is also
harder to read and maintain. Additionally, these values are also being
used in subsystems outside the arch code where not everybody maybe be
as familiar with the processor identifiers.

As a first step towards improving the status quo, add macros for the
AMD processor families and propagate them through the existing
cpu_device_id.h header used for this purpose.

Signed-off-by: Punit Agrawal 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
---
 arch/x86/include/asm/amd-family.h| 18 ++
 arch/x86/include/asm/cpu_device_id.h |  2 ++
 2 files changed, 20 insertions(+)
 create mode 100644 arch/x86/include/asm/amd-family.h

diff --git a/arch/x86/include/asm/amd-family.h 
b/arch/x86/include/asm/amd-family.h
new file mode 100644
index ..dff4d13b8e74
--- /dev/null
+++ b/arch/x86/include/asm/amd-family.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_AMD_FAMILY_H
+#define _ASM_X86_AMD_FAMILY_H
+
+#define AMD_FAM_K5 0x04
+#define AMD_FAM_K6 0x05
+#define AMD_FAM_K7 0x06
+#define AMD_FAM_K8 0x0F
+#define AMD_FAM_K100x10
+#define AMD_FAM_K8_K10_HYBRID  0x11
+#define AMD_FAM_LLANO  0x12
+#define AMD_FAM_BOBCAT 0x14
+#define AMD_FAM_BULLDOZER  0x15
+#define AMD_FAM_JAGUAR 0x16
+#define AMD_FAM_ZEN0x17
+#define AMD_FAM_ZEN3   0x19
+
+#endif /* _ASM_X86_AMD_FAMILY_H */
diff --git a/arch/x86/include/asm/cpu_device_id.h 
b/arch/x86/include/asm/cpu_device_id.h
index eb8fcede9e3b..bbb48ba4c7ff 100644
--- a/arch/x86/include/asm/cpu_device_id.h
+++ b/arch/x86/include/asm/cpu_device_id.h
@@ -12,6 +12,8 @@
 #include 
 /* Get the INTEL_FAM* model defines */
 #include 
+/* Get the AMD model defines */
+#include 
 /* And the X86_VENDOR_* ones */
 #include 
 
-- 
2.29.2

[RFC PATCH 2/4] cpufreq: acpi-cpufreq: Add processor to the ignore PSD override list

2020-11-25 Thread Punit Agrawal

Booting Linux on a Zen2 based processor (family: 0x17, model: 0x60,
stepping: 0x1) shows the following message in the logs -

acpi_cpufreq: overriding BIOS provided _PSD data

Although commit 5368512abe08 ("acpi-cpufreq: Honor _PSD table setting
on new AMD CPUs") indicates that the override is not required for Zen3
onwards, it seems that domain information can be trusted even on
certain earlier systems. Update the check, to skip the override for
Zen2 processors known to work without the override.

Signed-off-by: Punit Agrawal 
Cc: Wei Huang 
---
 drivers/cpufreq/acpi-cpufreq.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
index b1e7df96d428..29f1cd93541e 100644
--- a/drivers/cpufreq/acpi-cpufreq.c
+++ b/drivers/cpufreq/acpi-cpufreq.c
@@ -198,8 +198,13 @@ static int override_acpi_psd(unsigned int cpu_id)
if (c->x86_vendor == X86_VENDOR_AMD) {
if (!check_amd_hwpstate_cpu(cpu_id))
return false;
-
-   return c->x86 < 0x19;
+   /*
+* CPU's before Zen3 (except some Zen2) need the
+* override.
+*/
+   return (c->x86 < 0x19) &&
+   !(c->x86 == 0x17 && c->x86_model == 0x60 &&
+ c->x86_stepping == 0x1);
}
 
return false;
-- 
2.29.2

[RFC PATCH 4/4] cpufreq: acpi-cpufreq: Use identifiers for AMD processor family

2020-11-25 Thread Punit Agrawal

Replace the raw values for AMD processor family with recently
introduced identifier macros to improve code readability and
maintainability.

Signed-off-by: Punit Agrawal 
---
 drivers/cpufreq/acpi-cpufreq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
index 29f1cd93541e..d8b8300ae9e0 100644
--- a/drivers/cpufreq/acpi-cpufreq.c
+++ b/drivers/cpufreq/acpi-cpufreq.c
@@ -202,8 +202,8 @@ static int override_acpi_psd(unsigned int cpu_id)
 * CPU's before Zen3 (except some Zen2) need the
 * override.
 */
-   return (c->x86 < 0x19) &&
-   !(c->x86 == 0x17 && c->x86_model == 0x60 &&
+   return (c->x86 < AMD_FAM_ZEN3) &&
+   !(c->x86 == AMD_FAM_ZEN && c->x86_model == 0x60 &&
  c->x86_stepping == 0x1);
}
 
@@ -735,7 +735,7 @@ static int acpi_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
switch (perf->control_register.space_id) {
case ACPI_ADR_SPACE_SYSTEM_IO:
if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
-   boot_cpu_data.x86 == 0xf) {
+   boot_cpu_data.x86 == AMD_FAM_K8) {
pr_debug("AMD K8 systems must use native drivers.\n");
result = -ENODEV;
goto err_unreg;
-- 
2.29.2

[RFC PATCH 0/4] Add processor to the ignore PSD override list

2020-11-25 Thread Punit Agrawal

Hi,

While looking into Giovanni's patches to enable frequency invariance
on AMD systems[0], I noticed an issue with initialising frequency
domain information on a recent AMD APU.

Patch 1 refactors the test to ignore firmware provided frequency
domain into a separate function.

Patch 2 adds said APU (Family: 0x17, Model: 0x60, Stepping: 0x01) to
the list of CPUs for which the PSD override is ignored. I am not quite
happy with having to special case a particular CPU but also couldn't
find any documentation to help identify the CPUs that don't need the
override.

Patch 3 and 4 are somewhat independent and a first step towards
improving the situation with regards to the use of raw identifiers for
AMD processors throughout the kernel.

All feedback welcome.

Thanks,
Punit

[0] 
https://lore.kernel.org/linux-acpi/20201112182614.10700-1-ggherdov...@suse.cz/

Punit Agrawal (4):
  cpufreq: acpi-cpufreq: Re-factor overriding ACPI PSD
  cpufreq: acpi-cpufreq: Add processor to the ignore PSD override list
  x86/cpu: amd: Define processor families
  cpufreq: acpi-cpufreq: Use identifiers for AMD processor family

 arch/x86/include/asm/amd-family.h| 18 ++
 arch/x86/include/asm/cpu_device_id.h |  2 ++
 drivers/cpufreq/acpi-cpufreq.c   | 24 +---
 3 files changed, 41 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/include/asm/amd-family.h

-- 
2.29.2

[RFC PATCH 1/4] cpufreq: acpi-cpufreq: Re-factor overriding ACPI PSD

2020-11-25 Thread Punit Agrawal

Re-factor the code to override the firmware provided frequency domain
information (via PSD) to localise the checks in one function.

No functional change intended.

Signed-off-by: Punit Agrawal 
Cc: Wei Huang 
---
 drivers/cpufreq/acpi-cpufreq.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
index 1e4fbb002a31..b1e7df96d428 100644
--- a/drivers/cpufreq/acpi-cpufreq.c
+++ b/drivers/cpufreq/acpi-cpufreq.c
@@ -191,6 +191,20 @@ static int check_amd_hwpstate_cpu(unsigned int cpuid)
return cpu_has(cpu, X86_FEATURE_HW_PSTATE);
 }
 
+static int override_acpi_psd(unsigned int cpu_id)
+{
+   struct cpuinfo_x86 *c = _cpu_data;
+
+   if (c->x86_vendor == X86_VENDOR_AMD) {
+   if (!check_amd_hwpstate_cpu(cpu_id))
+   return false;
+
+   return c->x86 < 0x19;
+   }
+
+   return false;
+}
+
 static unsigned extract_io(struct cpufreq_policy *policy, u32 value)
 {
struct acpi_cpufreq_data *data = policy->driver_data;
@@ -691,8 +705,7 @@ static int acpi_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
cpumask_copy(policy->cpus, topology_core_cpumask(cpu));
}
 
-   if (check_amd_hwpstate_cpu(cpu) && boot_cpu_data.x86 < 0x19 &&
-   !acpi_pstate_strict) {
+   if (override_acpi_psd(cpu) && !acpi_pstate_strict) {
cpumask_clear(policy->cpus);
cpumask_set_cpu(cpu, policy->cpus);
cpumask_copy(data->freqdomain_cpus,
-- 
2.29.2

[PATCH] ACPI: processor: Drop duplicate setting of shared_cpu_map

2020-11-23 Thread Punit Agrawal

'shared_cpu_map', stored as part of the per-processor
acpi_processor_performance structre, is used to store cpus that share
a performance domain. By definition it contains the owning cpu.

While building the 'shared_cpu_map' it is being set twice - once while
initialising the performance domains and again when matching cpus
belonging to the same domain.

Drop the unnecessary initialisation.

Signed-off-by: Punit Agrawal 
---
 drivers/acpi/processor_perflib.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/acpi/processor_perflib.c b/drivers/acpi/processor_perflib.c
index b04a68950ff1..b0d320f18163 100644
--- a/drivers/acpi/processor_perflib.c
+++ b/drivers/acpi/processor_perflib.c
@@ -616,7 +616,6 @@ int acpi_processor_preregister_performance(
continue;
 
pr->performance = per_cpu_ptr(performance, i);
-   cpumask_set_cpu(i, pr->performance->shared_cpu_map);
pdomain = &(pr->performance->domain_info);
if (acpi_processor_get_psd(pr->handle, pdomain)) {
retval = -EINVAL;
-- 
2.29.2

Re: [PATCH v5] cper, apei, mce: Pass x86 CPER through the MCA handling chain

2020-11-12 Thread Punit Agrawal

Smita Koralahalli Channabasappa  writes:

> Punit,
>
> On 11/9/20 1:05 PM, Smita Koralahalli Channabasappa wrote:
>
>> On 11/8/20 7:18 PM, Punit Agrawal wrote:
>>> Borislav Petkov  writes:
>>>> On Fri, Nov 06, 2020 at 02:36:46PM +0900, Punit Agrawal wrote:
>>>>>> diff --git a/drivers/firmware/efi/cper-x86.c 
>>>>>> b/drivers/firmware/efi/cper-x86.c
>>>>>> index 2531de49f56c..438ed9eff6d0 100644
>>>>>> --- a/drivers/firmware/efi/cper-x86.c
>>>>>> +++ b/drivers/firmware/efi/cper-x86.c
>>>>>> @@ -2,6 +2,7 @@
>>>>>>    // Copyright (C) 2018, Advanced Micro Devices, Inc.
>>>>>>      #include 
>>>>>> +#include 
>>>>> Did you mean to include ?
>>>> Why?
>>> Because arch_apei_report_x86_error() used in the patch is defined
>>> there. The indirect include works but pulls in additional definitions
>>> not needed by the patch.
>>>
>>> Do you prefer the more generic include?
>> I agree, it's generally a good practice to avoid pulling up additional
>> definitions. I had this when I made the declaration in generic header
>> file and may be I did not consider it changing initially as my build
>> didn't break after moving the declaration from generic header to arch
>> specific header file.
>> I will take care henceforth and make the changes as required.
>
> The asm specific include throws out a warning when I run checkpatch.pl
>
> WARNING: Use #include  instead of 
> #215: FILE: drivers/firmware/efi/cper-x86.c:5:
> +#include 
>
> Should I just keep the generic include?

Thanks for checking.

I had a quick look at checkpatch to understand the reason for the
warning. It seems to warn when "asm" includes are used when a suitable
"linux" include exists[0].

I am not convinced that the rationale for that check applies in this
case as the function being used is indeed an architecture specific one
but also don't feel strongly enough to object.

Feel free to pick up the "Reviewed-by" tag in either case.

Thanks,
Punit

[0] https://github.com/torvalds/linux/blob/master/scripts/checkpatch.pl#L5333

Re: [PATCH v5] cper, apei, mce: Pass x86 CPER through the MCA handling chain

2020-11-08 Thread Punit Agrawal

Borislav Petkov  writes:

> On Fri, Nov 06, 2020 at 02:36:46PM +0900, Punit Agrawal wrote:
>> > diff --git a/drivers/firmware/efi/cper-x86.c 
>> > b/drivers/firmware/efi/cper-x86.c
>> > index 2531de49f56c..438ed9eff6d0 100644
>> > --- a/drivers/firmware/efi/cper-x86.c
>> > +++ b/drivers/firmware/efi/cper-x86.c
>> > @@ -2,6 +2,7 @@
>> >  // Copyright (C) 2018, Advanced Micro Devices, Inc.
>> >  
>> >  #include 
>> > +#include 
>> 
>> Did you mean to include ?
>
> Why?

Because arch_apei_report_x86_error() used in the patch is defined
there. The indirect include works but pulls in additional definitions
not needed by the patch.

Do you prefer the more generic include?

Thanks,
Punit

Re: [PATCH v5] cper, apei, mce: Pass x86 CPER through the MCA handling chain

2020-11-05 Thread Punit Agrawal

Hi Smita,

Smita Koralahalli  writes:

> Linux Kernel uses ACPI Boot Error Record Table (BERT) to report fatal
> errors that occurred in a previous boot. The MCA errors in the BERT are
> reported using the x86 Processor Error Common Platform Error Record (CPER)
> format. Currently, the record prints out the raw MSR values and AMD relies
> on the raw record to provide MCA information.
>
> Extract the raw MSR values of MCA registers from the BERT and feed it into
> the standard mce_log() function through the existing x86/MCA RAS
> infrastructure. This will result in better decoding from the EDAC MCE
> decoder or the default notifier.
>
> The implementation is SMCA specific as the raw MCA register values are
> given in the register offset order of the MCAX address space.
>
> [ Fix a build breakage in patch v1. ]
> Reported-by: kernel test robot 
> Signed-off-by: Smita Koralahalli 

[...]

> diff --git a/drivers/firmware/efi/cper-x86.c b/drivers/firmware/efi/cper-x86.c
> index 2531de49f56c..438ed9eff6d0 100644
> --- a/drivers/firmware/efi/cper-x86.c
> +++ b/drivers/firmware/efi/cper-x86.c
> @@ -2,6 +2,7 @@
>  // Copyright (C) 2018, Advanced Micro Devices, Inc.
>  
>  #include 
> +#include 

Did you mean to include ?

>  
>  /*
>   * We don't need a "CPER_IA" prefix since these are all locally defined.
> @@ -347,9 +348,13 @@ void cper_print_proc_ia(const char *pfx, const struct 
> cper_sec_proc_ia *proc)
>  ctx_info->mm_reg_addr);
>   }
>  
> - printk("%sRegister Array:\n", newpfx);
> - print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, groupsize,
> -(ctx_info + 1), ctx_info->reg_arr_size, 0);
> + if (ctx_info->reg_ctx_type != CTX_TYPE_MSR ||
> + arch_apei_report_x86_error(ctx_info, proc->lapic_id)) {
> + printk("%sRegister Array:\n", newpfx);
> + print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16,
> +groupsize, (ctx_info + 1),
> +        ctx_info->reg_arr_size, 0);
> + }
>  
>   ctx_info = (struct cper_ia_proc_ctx *)((long)ctx_info + size);
>   }

With that addressed,

FWIW,

Reviewed-by: Punit Agrawal 

Thanks,
Punit

Re: [PATCH v4] cper, apei, mce: Pass x86 CPER through the MCA handling chain

2020-09-28 Thread Punit Agrawal

Yazen Ghannam  writes:

> On Fri, Sep 25, 2020 at 09:54:06AM +0900, Punit Agrawal wrote:
>> Borislav Petkov  writes:
>> 
>> > On Thu, Sep 24, 2020 at 12:23:27PM -0500, Smita Koralahalli Channabasappa 
>> > wrote:
>> >> > Even though it's not defined in the UEFI spec, it doesn't mean a
>> >> > structure definition cannot be created.
>> >
>> > Created for what? That structure better have a big fat comment above it, 
>> > what
>> > firmware generates its layout.
>> 
>> Maybe I could've used a better choice of words - I meant to define a
>> structure with meaningful member names to replace the *(ptr + i)
>> accesses in the patch.
>> 
>> The requirement for documenting the record layout doesn't change -
>> whether using raw pointer arithmetic vs a structure definition.
>> 
>> >> > After all, the patch is relying on some guarantee of the meaning of
>> >> > the values and their ordering.
>> >
>> > AFAICT, this looks like an ad-hoc definition and the moment they change
>> > it in some future revision, that struct of yours becomes invalid so we'd
>> > need to add another one.
>> 
>> If there's no spec backing the current layout, then it'll indeed be an
>> ad-hoc definition of a structure in the kernel. But considering that
>> it's part of firmware / OS interface for an important part of the RAS
>> story I would hope that the code is based on a spec - having that
>> reference included would help maintainability.
>> 
>> Incompatible changes will indeed break the assumptions in the kernel and
>> code will need to be updated - regardless of the choice of kernel
>> implementation; pointer arithmetic, structure definition - ad-hoc or
>> spec provided.
>> 
>> Having versioning will allow running older kernels on newer hardware and
>> vice versa - but I don't see why that is important only when using a
>> structure based access.
>>
>
> There is no versioning option for the x86 context info structure in the
> UEFI spec, so I don't think there'd be a clean way to include version
> information.
>
> The format of the data in the context info is not totally ad-hoc, and it
> does follow the UEFI spec. The "Register Array" field is raw data. This
> may follow one of the predefined formats in the UEFI spec like the "X64
> Register State", etc. Or, in the case of MSR and Memory Mapped
> Registers, this is a raw dump of the registers starting from the address
> shown in the structure. The two values that can be changed are the
> starting address and the array size. These two together provide a window
> to the registers. The registers are fixed, so a single context info
> struture should include a single contiguous range of registers. Multiple
> context info structures can be provided to include registers from
> different, non-contiguous ranges.
>
> This patch is checking if an MSR context info structure lines up with
> the MCAX register space used on Scalable MCA systems. This register
> space is defined in the AMD Processor Programming Reference for various
> products. This is considered a hardware feature extension, so the
> existing register layout won't change though new registers may be added.
> A layout change would require moving to another register space which is
> what happened going from legacy MCA (starting at address 0x400) to MCAX
> (starting at address 0xC0002000) registers.

Thanks for the SMCA related background.
>
> The only two things firmware can change are from what address does the
> info start and where does the info end. So the implementation-specific
> details here are that currently the starting address is MCA_STATUS (in
> MCAX space) for a bank and the remaining info includes the other MCA
> registers for this bank.
>
> So I think the kernel can be strict with this format, i.e. the two
> variables match what we're looking for. This patch already has a check
> on the starting address. It should also include a check that "Register
> Array Size" is large enough to include all the registers we want to
> extract. If the format doesn't match, then we fall back to a raw dump
> of the data like we have today.
>
> Or the kernel can be more flexible and try to find the window of
> registers based on the starting address. I think this is really
> open-ended though.

I think I understand the hesitancy here if the firmware can arbitrarily
move the starting address. Though I hope that doesn't happen as it would
break the feature introduced in $SUBJECT.

The way I read the code / spec led me to believe that the MSR context
info records in the SMCA space are just encoding the layout of MC Bank
registers[0] and making it explicit can only help.

But Boris seems to think the current approach is good enough. So no
objections from me.

Thanks,
Punit

[0] AMD Processor Programming Reference for Family 17H, Sec 3.1.5

Re: [PATCH v4] cper, apei, mce: Pass x86 CPER through the MCA handling chain

2020-09-24 Thread Punit Agrawal

Borislav Petkov  writes:

> On Thu, Sep 24, 2020 at 12:23:27PM -0500, Smita Koralahalli Channabasappa 
> wrote:
>> > Even though it's not defined in the UEFI spec, it doesn't mean a
>> > structure definition cannot be created.
>
> Created for what? That structure better have a big fat comment above it, what
> firmware generates its layout.

Maybe I could've used a better choice of words - I meant to define a
structure with meaningful member names to replace the *(ptr + i)
accesses in the patch.

The requirement for documenting the record layout doesn't change -
whether using raw pointer arithmetic vs a structure definition.

>> > After all, the patch is relying on some guarantee of the meaning of
>> > the values and their ordering.
>
> AFAICT, this looks like an ad-hoc definition and the moment they change
> it in some future revision, that struct of yours becomes invalid so we'd
> need to add another one.

If there's no spec backing the current layout, then it'll indeed be an
ad-hoc definition of a structure in the kernel. But considering that
it's part of firmware / OS interface for an important part of the RAS
story I would hope that the code is based on a spec - having that
reference included would help maintainability.

Incompatible changes will indeed break the assumptions in the kernel and
code will need to be updated - regardless of the choice of kernel
implementation; pointer arithmetic, structure definition - ad-hoc or
spec provided.

Having versioning will allow running older kernels on newer hardware and
vice versa - but I don't see why that is important only when using a
structure based access.

>
>> > If the patch is relying on the definitions in the SMCA spec it is a good
>
> Yes, what SMCA spec is that?
>
>> > idea to reference it here - both for review and providing relevant
>> > context for future developers.
>> 
>> Okay, I agree the structure definition will make the code less arbitrary
>> and provides relevant context compared to pointer arithmetic. I did not
>> think this way. I can try this out if no objections.
>
> Again, this struct better have "versioning" info because the moment your
> fw people change it in some future platform, this code needs touching
> again.
>
> It probably would need touching even with the offsets if those offsets
> change but at least not having it adhere to some slow-moving spec is
> probably easier in case they wanna add/change fields.
>
> So Smita, you probably should talk to fw people about how stable that
> layout at ctx_info + 1 is going to be wrt future platforms so that
> we make sure we only access the correct offsets, now and on future
> platforms.
>
> Thx.

Re: [PATCH v4] cper, apei, mce: Pass x86 CPER through the MCA handling chain

2020-09-23 Thread Punit Agrawal

Borislav Petkov  writes:

> Smita,
>
> pls sync the time of the box where you create the patch:
>
>  Date: Fri,  4 Sep 2020 09:04:44 -0500
>
> but your mail headers have:
>
>  Received: from ... with mapi id 15.20.3370.019; Fri, 18 Sep 2020 14:49:12 
> +
>   ^^
>
> On Wed, Sep 23, 2020 at 07:07:17PM +0900, Punit Agrawal wrote:
>> I know Boris asked you to add the reason for the Reported-by, but
>> usually we don't track version differences in the committed patch.
>> 
>> Boris, can you confirm if you want the Reported-by to be retained?
>
> How else would you explain what the Reported-by: tag is for on a patch
> which adds a feature?

As Ard clarified, I was questioning the inclusion of the Reported-by:
tag in the patch itself. But I also don't have enough of a strong
opinion to obsess about it.

[ Aside: One interesting consequence of this though is that by the same
argument, changes resulting from comments on earlier versions are also
legitimate content for the final patch. Not saying I agree. ]

>
>> > + * The first expected register in the register layout of MCAX address 
>> > space.
>> > + * The address defined must match with the first MSR address extracted 
>> > from
>> > + * BERT which in SMCA systems is the bank's MCA_STATUS register.
>> > + *
>> > + * Note that the decoding of the raw MSR values in BERT is implementation
>> > + * specific and follows register offset order of MCAX address space.
>> > + */
>> > +#define MASK_MCA_STATUS 0xC0002001
>> 
>> The macro value is already defined in mce.h as
>> MSR_AMD64_SMCA_MC0_STATUS.  Is there any reason to not use it?
>
> Good point.
>
>> You can move the comment to where you check the status register.
>
> No need if he really wants to use the first MCi_STATUS address.
>
>> > +  m.apicid = lapic_id;
>> > +  m.bank = (ctx_info->msr_addr >> 4) & 0xFF;
>> > +  m.status = *i_mce;
>> > +  m.addr = *(i_mce + 1);
>> > +  m.misc = *(i_mce + 2);
>> > +  /* Skipping MCA_CONFIG */
>> > +  m.ipid = *(i_mce + 4);
>> > +  m.synd = *(i_mce + 5);
>> 
>> Instead of using the raw pointer arithmetic, it is better to define a
>> structure for the MCA registers? Something like -
>> 
>> struct {
>> u64 addr;
>> u64 misc;
>> u64 config;
>> u64 ipid;
>> ...
>> }
>> 
>> Checking back, this was mentioned in the previous review comments as
>> well. Please address all comments before posting a new version - either
>> by following the suggestion or explaining why it is not a good idea.
>
> Well, that was addressed in his reply last time:
>
> https://lkml.kernel.org/r/a28aa613-8353-0052-31f6-34bc733ab...@amd.com

Oops. My bad - sorry I missed the response.

Copying the relevant comment here for discussion -

>>> The registers here are implementation specific and applies only for
>>> SMCA systems. So I have used pointer arithmetic as it is not defined
>>> in the spec.

Even though it's not defined in the UEFI spec, it doesn't mean a
structure definition cannot be created. After all, the patch is relying
on some guarantee of the meaning of the values and their ordering.

If the patch is relying on the definitions in the SMCA spec it is a good
idea to reference it here - both for review and providing relevant
context for future developers.

> You might've missed it because you weren't CCed directly.

Indeed, I missed it. Thanks for the pointer.

Cheers,
Punit

Re: [PATCH v4] cper, apei, mce: Pass x86 CPER through the MCA handling chain

2020-09-23 Thread Punit Agrawal

Hi Smita,

A few comments below.

Smita Koralahalli  writes:

> Linux Kernel uses ACPI Boot Error Record Table (BERT) to report fatal
> errors that occurred in a previous boot. The MCA errors in the BERT are
> reported using the x86 Processor Error Common Platform Error Record (CPER)
> format. Currently, the record prints out the raw MSR values and AMD relies
> on the raw record to provide MCA information.
>
> Extract the raw MSR values of MCA registers from the BERT and feed it into
> the standard mce_log() function through the existing x86/MCA RAS
> infrastructure. This will result in better decoding from the EDAC MCE
> decoder or the default notifier.
>
> The implementation is SMCA specific as the raw MCA register values are
> given in the register offset order of the MCAX address space.
>
> [ Build error in patch v1. ]
>
> Reported-by: kernel test robot 

I know Boris asked you to add the reason for the Reported-by, but
usually we don't track version differences in the committed patch.

Boris, can you confirm if you want the Reported-by to be retained?

> Signed-off-by: Smita Koralahalli 
> ---
> Link:
> https://lkml.kernel.org/r/20200903234531.162484-2-smita.koralahallichannabasa...@amd.com
>
> v4:
>   Included what kernel test robot reported.
>   Changed function name from apei_mce_report_x86_error ->
>   apei_smca_report_x86_error.
>   Added comment for MASK_MCA_STATUS definition.
>   Wrapped apei_smca_report_x86_error() with CONFIG_X86_MCE in
>   arch/x86/include/asm/mce.h
> v3:
>   Moved arch specific declarations from generic headers to arch
>   specific headers.
>   Cleaned additional declarations which are unnecessary.
>   Included the check for context type.
>   Added additional check to verify for appropriate MSR address in
>   the register layout.
> v2:
>   Fixed build error reported by kernel test robot.
>   Passed struct variable as function argument instead of entire struct.
> ---
>  arch/x86/include/asm/acpi.h | 11 
>  arch/x86/include/asm/mce.h  |  5 
>  arch/x86/kernel/acpi/apei.c |  5 
>  arch/x86/kernel/cpu/mce/apei.c  | 49 +
>  drivers/firmware/efi/cper-x86.c | 10 +--
>  5 files changed, 77 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
> index 6d2df1ee427b..65064d9f7fa6 100644
> --- a/arch/x86/include/asm/acpi.h
> +++ b/arch/x86/include/asm/acpi.h
> @@ -159,6 +159,8 @@ static inline u64 x86_default_get_root_pointer(void)
>  extern int x86_acpi_numa_init(void);
>  #endif /* CONFIG_ACPI_NUMA */
>  
> +struct cper_ia_proc_ctx;
> +
>  #ifdef CONFIG_ACPI_APEI
>  static inline pgprot_t arch_apei_get_mem_attribute(phys_addr_t addr)
>  {
> @@ -177,6 +179,15 @@ static inline pgprot_t 
> arch_apei_get_mem_attribute(phys_addr_t addr)
>*/
>   return PAGE_KERNEL_NOENC;
>  }
> +
> +int arch_apei_report_x86_error(struct cper_ia_proc_ctx *ctx_info,
> +u64 lapic_id);
> +#else
> +static inline int arch_apei_report_x86_error(struct cper_ia_proc_ctx 
> *ctx_info,
> +  u64 lapic_id)
> +{
> + return -EINVAL;
> +}
>  #endif
>  
>  #define ACPI_TABLE_UPGRADE_MAX_PHYS (max_low_pfn_mapped << PAGE_SHIFT)
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index 109af5c7f515..d07bd635acfd 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -173,17 +173,22 @@ extern void mce_unregister_decode_chain(struct 
> notifier_block *nb);
>  #include 
>  
>  extern int mce_p5_enabled;
> +struct cper_ia_proc_ctx;
>  
>  #ifdef CONFIG_X86_MCE
>  int mcheck_init(void);
>  void mcheck_cpu_init(struct cpuinfo_x86 *c);
>  void mcheck_cpu_clear(struct cpuinfo_x86 *c);
>  void mcheck_vendor_init_severity(void);
> +int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info,
> +u64 lapic_id);
>  #else
>  static inline int mcheck_init(void) { return 0; }
>  static inline void mcheck_cpu_init(struct cpuinfo_x86 *c) {}
>  static inline void mcheck_cpu_clear(struct cpuinfo_x86 *c) {}
>  static inline void mcheck_vendor_init_severity(void) {}
> +static inline int apei_smca_report_x86_error(struct cper_ia_proc_ctx 
> *ctx_info,
> +  u64 lapic_id) { return -EINVAL; }
>  #endif
>  
>  #ifdef CONFIG_X86_ANCIENT_MCE
> diff --git a/arch/x86/kernel/acpi/apei.c b/arch/x86/kernel/acpi/apei.c
> index c22fb55abcfd..0916f00a992e 100644
> --- a/arch/x86/kernel/acpi/apei.c
> +++ b/arch/x86/kernel/acpi/apei.c
> @@ -43,3 +43,8 @@ void arch_apei_report_mem_error(int sev, struct 
> cper_sec_mem_err *mem_err)
>   apei_mce_report_mem_error(sev, mem_err);
>  #endif
>  }
> +
> +int arch_apei_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 
> lapic_id)
> +{
> + return apei_smca_report_x86_error(ctx_info, lapic_id);
> +}
> diff --git

Re: [PATCH v14 5/5] remoteproc: Add initial zynqmp R5 remoteproc driver

2020-09-21 Thread Punit Agrawal

Ben Levinsky  writes:

> Hi All,
>
>> -Original Message-
>> From: Wendy Liang 
>> Sent: Friday, September 18, 2020 6:53 PM
>> To: Michael Auchter 
>> Cc: Ben Levinsky ; punit1.agra...@toshiba.co.jp;
>> devicet...@vger.kernel.org; linux-remotep...@vger.kernel.org; linux-
>> ker...@vger.kernel.org; linux-arm-ker...@lists.infradead.org
>> Subject: Re: RE: RE: [PATCH v14 5/5] remoteproc: Add initial zynqmp R5
>> remoteproc driver
>> 
>> HI Michael, Ben, Punit,
>> 
>> On Fri, Sep 18, 2020 at 12:08 PM Michael Auchter 
>> wrote:
>> >
>> > Hey Ben,
>> >
>> > On Fri, Sep 18, 2020 at 06:01:19PM +, Ben Levinsky wrote:
>> > > Hi Michael, Punit,
>> > >
>> > > > -Original Message-
>> > > > From: Michael Auchter 
>> > > > Sent: Friday, September 18, 2020 9:07 AM
>> > > > To: Ben Levinsky 
>> > > > Cc: devicet...@vger.kernel.org; linux-remotep...@vger.kernel.org;
>> linux-
>> > > > ker...@vger.kernel.org; linux-arm-ker...@lists.infradead.org
>> > > > Subject: Re: RE: [PATCH v14 5/5] remoteproc: Add initial zynqmp R5
>> > > > remoteproc driver
>> > > >
>> > > > On Thu, Sep 17, 2020 at 10:50:42PM +, Ben Levinsky wrote:

[...]

>> > > A suggestion that might clean up the driver so that the whole
>> > > rpu_mode, tcm_mode configuration can be simplified and pulled out of
>> > > the driver:
>> > > - as Punit suggested, remove the lockstep-mode property
>> > > - the zynqmp_remoteproc_r5 driver ONLY loads firmware and does
>> start/stop.
>> > > - the zynqmp_remoteproc_r5 driver does not configure and memory
>> regions or the RPU. Let the Xilinx firmware sysfs interface handle this.
>> >
>> > I don't think this is a good approach.

> [Ben Levinsky] ok, noted. Can keep the configuration but still as
> wendy said just have lockstep property to denote lockstep mode in RPU
> and otherwise be split, for simplicity?

That would be a better approach than the current proposal.

>> [Wendy] The TCMs are presented differently in the system depending on
>> if RPU is in
>> lockstep or split mode.
>> 
>> Not sure if it is allowed to list TCMs registers properties for both
>> split mode and lockstep
>> mode in the same device node.
>> 
>> Even though, driver can have this information in the code, but I feel
>> the device tree is a
>> better place for this information.
>> And also for predefined shared memories, you will need to know the RPU
>> op mode ahead,
>> so that you can specify which shared memories belong to which RPU.
>> 
>> To dynamic setup the RPU mode, besides sysfs, setup, if remoteproc can
>> support
>> device tree overlay, the RPUs can be described with dtbo and loaded at
>> runtime.
>> 
>> Just want to understand the case which needs to set  RPU mode at runtime?
>> I think testing can be one case.
> [Ben Levinsky] for testing, so far it has been r50/1 split and r5
> lockstep

>> Best Regards,
>> Wendy
>> 
>> > - How will someone know to configure the RPU mode and TCM mode via
>> sysfs?
>> > - What happens when someone changes the RPU mode after remoteproc
>> has
>> >   already booted some firmware on it?
>> > - What if the kernel is the one booting the R5, not the user?
>> >
>> > Split vs. lockstep, IMO, needs to be specified as part of the device
>> > tree, and this driver needs to handle configuring the RPU mode and TCM
>> > modes appropriately.
>> >

Typically, the device tree is expected to describe the hardware to the
kernel rather than telling it what the hardware should look like. More
below.

> [Ben Levinsky] Ok, as Wendy suggested would instead the presence of a
> "lockstep=mode" property indicate lockstep mode and otherwise imply
> split mode?

>> > Split vs. lockstep already necessitates different entries in the device
>> > tree:
>> > - In the binding, each core references its TCMs via the
>> >   meta-memory-regions phandles, and the referenced nodes necessarily
>> >   encode this size. In split mode, each core has access to 64K of
>> >   TCMA/TCMB, while in lockstep R5 0 has access to 128K of TCMA/TCMB. So,
>> >   the "xlnx,tcm" nodes' reg entries need to differ between lockstep and
>> >   split.

But considering the dependency between split/lockstep mode and available
memory sizes as described here it maybe OK to have the firmware (via DT)
specify the configured mode. Though IMO it is overloading the device
tree functionality (not the first time) because that's the hammer we've
got.

Even in this scenario, ideally it would be the boot firmware's
responsibility to configure the RPU in the desired mode and communicate
the mode to the kernel. The driver can use the mode information to
verify that the system is in the expected state during probe and error
out if not.

Though taking this approach will not help the "testing" usecase
mentioned by Wendy.

>> > - In lockstep mode, it does not make sense to have both r5@0 and r5@1
>> >   child nodes: only r5@0 makes sense. Though, I just realized that I
>> >   think this driver will currently permit that, and register two
>> >   remoteprocs even in

Re: [PATCH v14 4/5] dt-bindings: remoteproc: Add documentation for ZynqMP R5 rproc bindings

2020-09-18 Thread Punit Agrawal

Hi Ben,

One query below -

Ben Levinsky  writes:

> Add binding for ZynqMP R5 OpenAMP.
>
> Represent the RPU domain resources in one device node. Each RPU
> processor is a subnode of the top RPU domain node.
>
> Signed-off-by: Jason Wu 
> Signed-off-by: Wendy Liang 
> Signed-off-by: Michal Simek 
>
> Signed-off-by: Ben Levinsky 
> ---

[...]

>  .../xilinx,zynqmp-r5-remoteproc.yaml  | 119 ++
>  1 file changed, 119 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/remoteproc/xilinx,zynqmp-r5-remoteproc.yaml
>
> diff --git 
> a/Documentation/devicetree/bindings/remoteproc/xilinx,zynqmp-r5-remoteproc.yaml
>  
> b/Documentation/devicetree/bindings/remoteproc/xilinx,zynqmp-r5-remoteproc.yaml
> new file mode 100644
> index ..cd2406b4dc24
> --- /dev/null
> +++ 
> b/Documentation/devicetree/bindings/remoteproc/xilinx,zynqmp-r5-remoteproc.yaml
> @@ -0,0 +1,119 @@
> +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: 
> "http://devicetree.org/schemas/remoteproc/xilinx,zynqmp-r5-remoteproc.yaml#;
> +$schema: "http://devicetree.org/meta-schemas/core.yaml#;
> +
> +title: Xilinx R5 remote processor controller bindings
> +
> +description:
> +  This document defines the binding for the remoteproc component that loads 
> and
> +  boots firmwares on the Xilinx Zynqmp and Versal family chipset.
> +
> +  Note that the Linux has global addressing view of the R5-related memory 
> (TCM)
> +  so the absolute address ranges are provided in TCM reg's.
> +maintainers:
> +  - Ed Mooring 
> +  - Ben Levinsky 
> +
> +properties:
> +  compatible:
> +const: "xlnx,zynqmp-r5-remoteproc-1.0"
> +
> +  lockstep-mode:
> +description:
> +  R5 core configuration (split is 0 or lock-step and 1)
> +maxItems: 1

Looking at the driver, it seems that it is possible to set the R5s to
operate in split or lock-step mode dynamically.

If so, the device tree shouldn't really contain this property. Wouldn't
it be better to give the users flexibility to choose the mode at run
time?

Thanks,
Punit

> +
> +  interrupts:
> +description:
> +  Interrupt mapping for remoteproc IPI. It is required if the
> +  user uses the remoteproc driver with the RPMsg kernel driver.
> +maxItems: 6
> +
> +  memory-region:
> +description:
> +  collection of memory carveouts used for elf-loading and inter-processor
> +  communication.
> +maxItems: 4
> +minItems: 4
> +  meta-memory-regions:
> +description:
> +  collection of memories that are not present in the top level memory
> +  nodes' mapping. For example, R5s' TCM banks. These banks are needed
> +  for R5 firmware meta data such as the R5 firmware's heap and stack
> +  pnode-id:
> +maxItems: 1
> +  mboxes:
> +maxItems: 2
> +  mbox-names:
> +maxItems: 2
> +
> +examples:
> +  - |
> + reserved-memory {
> +  #address-cells = <1>;
> +  #size-cells = <1>;
> +  ranges;
> +  elf_load: rproc@3ed00 {
> +   no-map;
> +   reg = <0x3ed0 0x4>;
> +  };
> +
> +  rpu0vdev0vring0: rpu0vdev0vring0@3ed4 {
> +   no-map;
> +   reg = <0x3ed4 0x4000>;
> +  };
> +  rpu0vdev0vring1: rpu0vdev0vring1@3ed44000 {
> +   no-map;
> +   reg = <0x3ed44000 0x4000>;
> +  };
> +  rpu0vdev0buffer: rpu0vdev0buffer@3ed48000 {
> +   no-map;
> +   reg = <0x3ed48000 0x10>;
> +  };
> +
> + };
> +
> + /*
> +  * Below nodes are required if using TCM to load R5 firmware
> +  * if not, then either do not provide nodes are label as disabled in
> +  * status property
> +  */
> + tcm0a: tcm_0a@ffe0 {
> + reg = <0xffe0 0x1>;
> + pnode-id = <0xf>;
> + no-map;
> + status = "okay";
> + phandle = <0x40>;
> + compatible = "xlnx,tcm";
> + };
> + tcm0b: tcm_1a@ffe2 {
> + reg = <0xffe2 0x1>;
> + pnode-id = <0x10>;
> + no-map;
> + status = "okay";
> + compatible = "xlnx,tcm";
> + phandle = <0x41>;
> + };
> +
> + rpu {
> +  compatible = "xlnx,zynqmp-r5-remoteproc-1.0";
> +  #address-cells = <1>;
> +  #size-cells = <1>;
> +  ranges;
> +  lockstep-mode = <1>;
> +  r5_0 {
> +   ranges;
> +   #address-cells = <1>;
> +   #size-cells = <1>;
> +   memory-region = <_load>,
> +   <>,
> +   <>,
> +   <>;
> +   meta-memory-regions = <_0a>, <_0b>;
> +   pnode-id = <0x7>;
> +  };
> + };
> +
> +...

Re: [PATCH 1/3] firmware: xilinx: Add validation check for IOCTL

2020-09-17 Thread Punit Agrawal

Hi Amit,

Amit Sunil Dhamne  writes:

> From: Tejas Patel 
>
> Validate IOCTL ID for ZynqMP and Versal before calling
> zynqmp_pm_invoke_fn().
>
> Signed-off-by: Tejas Patel 
> Signed-off-by: Amit Sunil Dhamne 
> ---
>  drivers/firmware/xilinx/zynqmp.c | 117 
> +++
>  1 file changed, 95 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/firmware/xilinx/zynqmp.c 
> b/drivers/firmware/xilinx/zynqmp.c
> index 8d1ff24..8fe0912 100644
> --- a/drivers/firmware/xilinx/zynqmp.c
> +++ b/drivers/firmware/xilinx/zynqmp.c
> @@ -514,6 +514,89 @@ int zynqmp_pm_clock_getparent(u32 clock_id, u32 
> *parent_id)
>  EXPORT_SYMBOL_GPL(zynqmp_pm_clock_getparent);
>
>  /**
> + * versal_is_valid_ioctl() - Check whether IOCTL ID is valid or not for 
> versal
> + * @ioctl_id:  IOCTL ID
> + *
> + * Return: 1 if IOCTL is valid else 0
> + */
> +static inline int versal_is_valid_ioctl(u32 ioctl_id)
> +{
> +   switch (ioctl_id) {
> +   case IOCTL_SD_DLL_RESET:
> +   case IOCTL_SET_SD_TAPDELAY:
> +   case IOCTL_SET_PLL_FRAC_MODE:
> +   case IOCTL_GET_PLL_FRAC_MODE:
> +   case IOCTL_SET_PLL_FRAC_DATA:
> +   case IOCTL_GET_PLL_FRAC_DATA:
> +   case IOCTL_WRITE_GGS:
> +   case IOCTL_READ_GGS:
> +   case IOCTL_WRITE_PGGS:
> +   case IOCTL_READ_PGGS:
> +   case IOCTL_SET_BOOT_HEALTH_STATUS:
> +   return 1;
> +   default:
> +   return 0;
> +   }
> +}
> +
> +/**
> + * zynqmp_is_valid_ioctl() - Check whether IOCTL ID is valid or not
> + * @ioctl_id:  IOCTL ID
> + *
> + * Return: 1 if IOCTL is valid else 0
> + */
> +static inline int zynqmp_is_valid_ioctl(u32 ioctl_id)
> +{
> +   switch (ioctl_id) {
> +   case IOCTL_SD_DLL_RESET:
> +   case IOCTL_SET_SD_TAPDELAY:
> +   case IOCTL_SET_PLL_FRAC_MODE:
> +   case IOCTL_GET_PLL_FRAC_MODE:
> +   case IOCTL_SET_PLL_FRAC_DATA:
> +   case IOCTL_GET_PLL_FRAC_DATA:
> +   case IOCTL_WRITE_GGS:
> +   case IOCTL_READ_GGS:
> +   case IOCTL_WRITE_PGGS:
> +   case IOCTL_READ_PGGS:
> +   case IOCTL_SET_BOOT_HEALTH_STATUS:

Other than the additional IOCTL being added in the next patch, this list
is common between the two SoCs this driver runs on.

To avoid duplication , it would be better to refactor into common and
versal specific checks with the latter using the common function for
shared IOCTLs. e.g., 

common_is_valid_ioctl() containing the common parts

versal_is_valid_ioctl() using the above function

One more comment below.

> +   return 1;
> +   default:
> +   return 0;
> +   }
> +}
> +
> +/**
> + * zynqmp_pm_ioctl() - PM IOCTL API for device control and configs
> + * @node_id:   Node ID of the device
> + * @ioctl_id:  ID of the requested IOCTL
> + * @arg1:  Argument 1 to requested IOCTL call
> + * @arg2:  Argument 2 to requested IOCTL call
> + * @out:   Returned output value
> + *
> + * This function calls IOCTL to firmware for device control and 
> configuration.
> + *
> + * Return: Returns status, either success or error+reason
> + */
> +static int zynqmp_pm_ioctl(u32 node_id, u32 ioctl_id, u32 arg1, u32 arg2,
> +  u32 *out)
> +{
> +   struct device_node *np;
> +
> +   np = of_find_compatible_node(NULL, NULL, "xlnx,versal");

The driver is going to be probed either on a zyncmp or a versal based
platform.

You may want to consider stashing the result of the current platform at
driver probe time and avoid searching through the device tree for every
call.

Thanks,
Punit

> +   if (np) {
> +   if (!versal_is_valid_ioctl(ioctl_id))
> +   return -EINVAL;
> +   } else {
> +   if (!zynqmp_is_valid_ioctl(ioctl_id))
> +   return -EINVAL;
> +   }
> +   of_node_put(np);
> +
> +   return zynqmp_pm_invoke_fn(PM_IOCTL, node_id, ioctl_id, arg1, arg2,
> +  out);
> +}
> +
> +/**
>   * zynqmp_pm_set_pll_frac_mode() - PM API for set PLL mode
>   *
>   * @clk_id:PLL clock ID
> @@ -525,8 +608,7 @@ EXPORT_SYMBOL_GPL(zynqmp_pm_clock_getparent);
>   */
>  int zynqmp_pm_set_pll_frac_mode(u32 clk_id, u32 mode)
>  {
> -   return zynqmp_pm_invoke_fn(PM_IOCTL, 0, IOCTL_SET_PLL_FRAC_MODE,
> -  clk_id, mode, NULL);
> +   return zynqmp_pm_ioctl(0, IOCTL_SET_PLL_FRAC_MODE, clk_id, mode, 
> NULL);
>  }
>  EXPORT_SYMBOL_GPL(zynqmp_pm_set_pll_frac_mode);
>
> @@ -542,8 +624,7 @@ EXPORT_SYMBOL_GPL(zynqmp_pm_set_pll_frac_mode);
>   */
>  int zynqmp_pm_get_pll_frac_mode(u32 clk_id, u32 *mode)
>  {
> -   return zynqmp_pm_invoke_fn(PM_IOCTL, 0, IOCTL_GET_PLL_FRAC_MODE,
> -  clk_id, 0, mode);
> +   return zynqmp_pm_ioctl(0, IOCTL_GET_PLL_FRAC_MODE, clk_id, 0, mode);
>  }
>  EXPORT_SYMBOL_GPL(zynqmp_pm_get_pll_frac_mode);
>
> @@ -560,8 +641,7 @@ EXPORT_SYMBOL_GPL(zynqmp_pm_get_pll_frac_mode);
>   */
>  int

Re: [PATCH v13 5/5] remoteproc: Add initial zynqmp R5 remoteproc driver

2020-09-10 Thread Punit Agrawal

Hi Ben,

Thanks for fixing the commit ordering issue. Some additional comments
below.

Ben Levinsky  writes:

> R5 is included in Xilinx Zynq UltraScale MPSoC so by adding this
> remotproc driver, we can boot the R5 sub-system in different 2
> configurations: split or lock-step.

Suggest following edit for clarity -

"A Cortex R5 is included in the Xilinx Zync UltraScale MPSoC. Adding a
remoteproc driver, it's possible to boot the R5 sub-system in two
different configurations -

* split
* lock-step
"
>
> The Xilinx R5 Remoteproc Driver boots the R5's via calls to the Xilinx
> Platform Management Unit that handles the R5 configuration, memory access
> and R5 lifecycle management. The interface to this manager is done in this
> driver via zynqmp_pm_* function calls.
>
> Signed-off-by: Wendy Liang 
> Signed-off-by: Michal Simek 
> Signed-off-by: Ed Mooring 
> Signed-off-by: Jason Wu 
> Signed-off-by: Ben Levinsky 
> ---

The patch fails to apply on v5.9-rc3. What kernel are the patches based
on? Please rebase and test before sending another update.

[...]

>  drivers/remoteproc/Kconfig|  10 +
>  drivers/remoteproc/Makefile   |   1 +
>  drivers/remoteproc/zynqmp_r5_remoteproc.c | 898 ++
>  3 files changed, 909 insertions(+)
>  create mode 100644 drivers/remoteproc/zynqmp_r5_remoteproc.c
>
> diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig
> index c4d1731295eb..dd9ed45654e0 100644
> --- a/drivers/remoteproc/Kconfig
> +++ b/drivers/remoteproc/Kconfig
> @@ -249,6 +249,16 @@ config STM32_RPROC
>  
> This can be either built-in or a loadable module.
>  
> +config ZYNQMP_R5_REMOTEPROC
> + tristate "ZynqMP_R5 remoteproc support"
> + depends on ARM64 && PM && ARCH_ZYNQMP

The ARM64 dependency seems superfluous. Please drop it.

> + select RPMSG_VIRTIO
> + select MAILBOX

ZYNQMP_IPI_MBOX below implies MAILBOX. I think it is safe to drop
MAILBOX as well.

> + select ZYNQMP_IPI_MBOX
> + help
> +   Say y or m here to support ZynqMP R5 remote processors via the remote
> +   processor framework.
> +
>  endif # REMOTEPROC
>  
>  endmenu
> diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile
> index e8b886e511f0..04d1c95d06d7 100644
> --- a/drivers/remoteproc/Makefile
> +++ b/drivers/remoteproc/Makefile
> @@ -28,5 +28,6 @@ obj-$(CONFIG_QCOM_WCNSS_PIL)+= 
> qcom_wcnss_pil.o
>  qcom_wcnss_pil-y += qcom_wcnss.o
>  qcom_wcnss_pil-y += qcom_wcnss_iris.o
>  obj-$(CONFIG_ST_REMOTEPROC)  += st_remoteproc.o
> +obj-$(CONFIG_ZYNQMP_R5_REMOTEPROC)   += zynqmp_r5_remoteproc.o
>  obj-$(CONFIG_ST_SLIM_REMOTEPROC) += st_slim_rproc.o
>  obj-$(CONFIG_STM32_RPROC)+= stm32_rproc.o
> diff --git a/drivers/remoteproc/zynqmp_r5_remoteproc.c 
> b/drivers/remoteproc/zynqmp_r5_remoteproc.c
> new file mode 100644
> index ..4fc4098ae1ea
> --- /dev/null
> +++ b/drivers/remoteproc/zynqmp_r5_remoteproc.c
> @@ -0,0 +1,898 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Zynq R5 Remote Processor driver
> + *
> + * Based on origin OMAP and Zynq Remote Processor driver
> + *
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "remoteproc_internal.h"
> +
> +#define MAX_RPROCS   2 /* Support up to 2 RPU */
> +#define MAX_MEM_PNODES   4 /* Max power nodes for one RPU memory 
> instance */
> +
> +#define DEFAULT_FIRMWARE_NAME"rproc-rpu-fw"

This define doesn't seem to be used in the driver. Please drop it.

> +#define BANK_LIST_PROP "meta-memory-regions"
> +
> +/* PM proc states */
> +#define PM_PROC_STATE_ACTIVE 1U

Same for this macro as well.

> +
> +/* IPI buffer MAX length */
> +#define IPI_BUF_LEN_MAX  32U
> +/* RX mailbox client buffer max length */
> +#define RX_MBOX_CLIENT_BUF_MAX   (IPI_BUF_LEN_MAX + \
> +  sizeof(struct zynqmp_ipi_message))
> +
> +/**
> + * struct zynqmp_r5_mem - zynqmp rpu memory data
> + * @pnode_id: TCM power domain ids
> + * @res: memory resource
> + * @node: list node
> + */
> +struct zynqmp_r5_mem {
> + u32 pnode_id[MAX_MEM_PNODES];
> + struct resource res;
> + struct list_head node;
> +};
> +
> +/**
> + * struct zynqmp_r5_pdata - zynqmp rpu remote processor private data
> + * @dev: device of RPU instance
> + * @rproc: rproc handle
> + * @pnode_id: RPU CPU power domain id
> + * @is_r5_mode_set: indicate if r5 operation mode is set
> + * @tx_mc: tx mailbox client
> + * @rx_mc: rx mailbox client
> + * @tx_chan: tx mailbox channel
> + * @rx_chan: rx mailbox channel
> + * @mbox_work: mbox_work for the RPU remoteproc
> + * @tx_mc_skbs: socket buffers for tx mailbox client
> + * @rx_mc_buf: rx mailbox client buffer to save the rx message
> + */
> +struct zynqmp_r5_pdata {
> + unsigned char

Re: [PATCH v12 3/5] firmware: xilinx: Add RPU configuration APIs

2020-09-03 Thread Punit Agrawal

Hi Ben,

Thanks for addressing the comments on the previous version. One comment
below.

Ben Levinsky  writes:

> This patch adds APIs to access to configure RPU and its
> processor-specific memory.
>
> That is query the run-time mode of RPU as either split or lockstep as well
> as API to set this mode. In addition add APIs to access configuration of
> the RPUs' tightly coupled memory (TCM).
>
> Signed-off-by: Ben Levinsky 
> ---
> v3:
> - add xilinx-related platform mgmt fn's instead of wrapping around
>   function pointer in xilinx eemi ops struct
> v4:
> - add default values for enums
> v9:
> - update commit message
> - for zynqmp_pm_set_tcm_config and zynqmp_pm_get_rpu_mode update docs for
>   expected output, arguments as well removing unused args
> - remove unused fn zynqmp_pm_get_node_status
> v11:
> - update usage of zynqmp_pm_get_rpu_mode to return rpu mode in enum
> - update zynqmp_pm_set_tcm_config and zynqmp_pm_set_rpu_mode arguments to 
> remove unused args
> v12:
> - in drivers/firmware/zynqmp.c, update zynqmp_pm_set_rpu_mode so rpu_mode
>   is only set if no error
> - update args for zynqmp_pm_set_rpu_mode, zynqmp_pm_set_tcm_config fn arg's to
>   reflect what is expected in the function and the usage in
>   zynqmp_r5_remoteproc accordingly
> ---
>  drivers/firmware/xilinx/zynqmp.c | 60 
>  include/linux/firmware/xlnx-zynqmp.h | 18 +
>  2 files changed, 78 insertions(+)
>
> diff --git a/drivers/firmware/xilinx/zynqmp.c 
> b/drivers/firmware/xilinx/zynqmp.c
> index a966ee956573..916a0b15ab33 100644
> --- a/drivers/firmware/xilinx/zynqmp.c
> +++ b/drivers/firmware/xilinx/zynqmp.c
> @@ -846,6 +846,66 @@ int zynqmp_pm_release_node(const u32 node)
>  }
>  EXPORT_SYMBOL_GPL(zynqmp_pm_release_node);
>  
> +/**
> + * zynqmp_pm_get_rpu_mode() - Get RPU mode
> + * @node_id: Node ID of the device
> + * @rpu_mode:return by reference value
> + *   either split or lockstep
> + *
> + * Return:   return 0 on success or error+reason.
> + *   if success, then  rpu_mode will be set
> + *   to current rpu mode.
> + */
> +int zynqmp_pm_get_rpu_mode(u32 node_id, enum rpu_oper_mode *rpu_mode)
> +{
> + u32 ret_payload[PAYLOAD_ARG_CNT];
> + int ret;
> +
> + ret = zynqmp_pm_invoke_fn(PM_IOCTL, node_id,
> +   IOCTL_GET_RPU_OPER_MODE, 0, 0, ret_payload);
> +
> + /* only set rpu_mode if no error */
> + *rpu_mode = ret_payload[0];

The comment and the statement do not match. rpu_mode is being
un-conditionally set even if there is an error.

It's not clear which is correct - the code or the comment?

Other than that, the rest of the patch looks fine.

Thanks,
Punit

> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(zynqmp_pm_get_rpu_mode);
> +
> +/**
> + * zynqmp_pm_set_rpu_mode() - Set RPU mode
> + * @node_id: Node ID of the device
> + * @rpu_mode:Argument 1 to requested IOCTL call. either split or 
> lockstep
> + *
> + *   This function is used to set RPU mode to split or
> + *   lockstep
> + *
> + * Return:   Returns status, either success or error+reason
> + */
> +int zynqmp_pm_set_rpu_mode(u32 node_id, enum rpu_oper_mode rpu_mode)
> +{
> + return zynqmp_pm_invoke_fn(PM_IOCTL, node_id,
> +IOCTL_SET_RPU_OPER_MODE, (u32)rpu_mode,
> +0, NULL);
> +}
> +EXPORT_SYMBOL_GPL(zynqmp_pm_set_rpu_mode);
> +
> +/**
> + * zynqmp_pm_set_tcm_config - configure TCM
> + * @tcm_mode:Argument 1 to requested IOCTL call
> + *  either PM_RPU_TCM_COMB or PM_RPU_TCM_SPLIT
> + *
> + * This function is used to set RPU mode to split or combined
> + *
> + * Return: status: 0 for success, else failure
> + */
> +int zynqmp_pm_set_tcm_config(u32 node_id, enum rpu_tcm_comb tcm_mode)
> +{
> + return zynqmp_pm_invoke_fn(PM_IOCTL, node_id,
> +IOCTL_TCM_COMB_CONFIG, (u32)tcm_mode, 0,
> +NULL);
> +}
> +EXPORT_SYMBOL_GPL(zynqmp_pm_set_tcm_config);
> +
>  /**
>   * zynqmp_pm_force_pwrdwn - PM call to request for another PU or subsystem to
>   * be powered down forcefully
> diff --git a/include/linux/firmware/xlnx-zynqmp.h 
> b/include/linux/firmware/xlnx-zynqmp.h
> index 6241c5ac51b3..79aa2fcbcd54 100644
> --- a/include/linux/firmware/xlnx-zynqmp.h
> +++ b/include/linux/firmware/xlnx-zynqmp.h
> @@ -385,6 +385,9 @@ int zynqmp_pm_request_wake(const u32 node,
>  const bool set_addr,
>  const u64 address,
>  const enum zynqmp_pm_request_ack ack);
> +int zynqmp_pm_get_rpu_mode(u32 node_id, enum rpu_oper_mode *rpu_mode);
> +int zynqmp_pm_set_rpu_mode(u32 node_id, u32 arg1);
> +int zynqmp_pm_set_tcm_config(u32 node_id, u32 arg1);
>  #else
>  static inline struct zynqmp_eemi_ops *zynqmp_pm_get_eemi_ops(void)
>  {
> @@ -549,6 +552,21 @@ static inline int zynqmp_pm_request_wake(const u32

Re: [PATCH v11 5/5] remoteproc: Add initial zynqmp R5 remoteproc driver

2020-09-03 Thread Punit Agrawal

Ben Levinsky  writes:

> R5 is included in Xilinx Zynq UltraScale MPSoC so by adding this
> remotproc driver, we can boot the R5 sub-system in different 2
> configurations: split or lock-step.
>
> The Xilinx R5 Remoteproc Driver boots the R5's via calls to the Xilinx
> Platform Management Unit that handles the R5 configuration, memory access
> and R5 lifecycle management. The interface to this manager is done in this
> driver via zynqmp_pm_* function calls.
>
> Signed-off-by: Ben Levinsky 
> Signed-off-by: Wendy Liang 
> Signed-off-by: Michal Simek 
> Signed-off-by: Ed Mooring 
> Signed-off-by: Jason Wu 

The Signed-off-by chain looks wrong here. As the submitter, your tag
should be last in the chain.

See Documentation/process/submitting-patches.rst for more details.

Thanks,
Punit


[...]

Re: [PATCH v11 3/5] firmware: xilinx: Add RPU configuration APIs

2020-09-03 Thread Punit Agrawal

Hi Ben,

Noticed some issues while going through the code. A couple of queries
below.

Ben Levinsky  writes:

> This patch adds APIs to access to configure RPU and its
> processor-specific memory.
>
> That is query the run-time mode of RPU as either split or lockstep as well
> as API to set this mode. In addition add APIs to access configuration of
> the RPUs' tightly coupled memory (TCM).
>
> Signed-off-by: Ben Levinsky 
> ---
> v3:
> - add xilinx-related platform mgmt fn's instead of wrapping around
>   function pointer in xilinx eemi ops struct
> v4:
> - add default values for enums
> v9:
> - update commit message
> - for zynqmp_pm_set_tcm_config and zynqmp_pm_get_rpu_mode update docs for
>   expected output, arguments as well removing unused args
> - remove unused fn zynqmp_pm_get_node_status
> v11:
> - update usage of zynqmp_pm_get_rpu_mode to return rpu mode in enum
> - update zynqmp_pm_set_tcm_config and zynqmp_pm_set_rpu_mode arguments to 
> remove unused args
> ---
>  drivers/firmware/xilinx/zynqmp.c | 59 
>  include/linux/firmware/xlnx-zynqmp.h | 18 +
>  2 files changed, 77 insertions(+)
>
> diff --git a/drivers/firmware/xilinx/zynqmp.c 
> b/drivers/firmware/xilinx/zynqmp.c
> index a966ee956573..807e404589f8 100644
> --- a/drivers/firmware/xilinx/zynqmp.c
> +++ b/drivers/firmware/xilinx/zynqmp.c
> @@ -846,6 +846,65 @@ int zynqmp_pm_release_node(const u32 node)
>  }
>  EXPORT_SYMBOL_GPL(zynqmp_pm_release_node);
>  
> +/**
> + * zynqmp_pm_get_rpu_mode() - Get RPU mode
> + * @node_id: Node ID of the device
> + * @rpu_mode:return by reference value
> + *   either split or lockstep
> + *
> + * Return:   return 0 on success or error+reason.
> + *   if success, then  rpu_mode will be set
> + *   to current rpu mode.
> + */
> +int zynqmp_pm_get_rpu_mode(u32 node_id, enum rpu_oper_mode *rpu_mode)
> +{
> + u32 ret_payload[PAYLOAD_ARG_CNT];
> + int ret;
> +
> + ret = zynqmp_pm_invoke_fn(PM_IOCTL, node_id,
> +   IOCTL_GET_RPU_OPER_MODE, 0, 0, ret_payload);
> + if (ret < 0)
> + (void)rpu_mode;

There seems to be something missing from this statement. What is
expected from "(void)rpu_mode" here.

> + else
> + *rpu_mode = ret_payload[0];
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(zynqmp_pm_get_rpu_mode);
> +
> +/**
> + * zynqmp_pm_set_rpu_mode() - Set RPU mode
> + * @node_id: Node ID of the device
> + * @arg1:Argument 1 to requested IOCTL call. This is expeted to
> + *  to be value from enum rpu_oper_mode
> + *
> + * This function is used to set RPU mode to split or lockstep
> + *
> + * Return: Returns status, either success or error+reason
> + */
> +int zynqmp_pm_set_rpu_mode(u32 node_id, u32 arg1)

If arg1 is expected to be "enum rpu_oper_mode" please have the function
argument reflect that and do any conversion needed inside the function.

> +{
> + return zynqmp_pm_invoke_fn(PM_IOCTL, node_id,
> +IOCTL_SET_RPU_OPER_MODE, arg1, 0, NULL);
> +}
> +EXPORT_SYMBOL_GPL(zynqmp_pm_set_rpu_mode);
> +
> +/**
> + * zynqmp_pm_set_tcm_config - configure TCM
> + * @arg1:Argument 1 to requested IOCTL call
> + *  either PM_RPU_TCM_COMB or PM_RPU_TCM_SPLIT
> + *
> + * This function is used to set RPU mode to split or combined
> + *
> + * Return: status: 0 for success, else failure
> + */
> +int zynqmp_pm_set_tcm_config(u32 node_id, u32 arg1)

Same comment as above - with the appropriate enum ofcourse.

Thanks,
Punit

> +{
> + return zynqmp_pm_invoke_fn(PM_IOCTL, node_id,
> +IOCTL_TCM_COMB_CONFIG, arg1, 0, NULL);
> +}
> +EXPORT_SYMBOL_GPL(zynqmp_pm_set_tcm_config);
> +
>  /**
>   * zynqmp_pm_force_pwrdwn - PM call to request for another PU or subsystem to
>   * be powered down forcefully

[...]

Re: [PATCH v2 1/2] cper, apei, mce: Pass x86 CPER through the MCA handling chain

2020-09-03 Thread Punit Agrawal

Hi Smita,

Smita Koralahalli Channabasappa  writes:

> On 8/31/20 12:05 AM, Punit Agrawal wrote:
>
>> Hi Smita,
>>
>> A couple of comments below -
>>
>> Smita Koralahalli  writes:
>>
>> [...]
>>
>>
>>> diff --git a/drivers/firmware/efi/cper-x86.c 
>>> b/drivers/firmware/efi/cper-x86.c
>>> index 2531de49f56c..374b8e18552a 100644
>>> --- a/drivers/firmware/efi/cper-x86.c
>>> +++ b/drivers/firmware/efi/cper-x86.c
>>> @@ -1,7 +1,7 @@
>>>   // SPDX-License-Identifier: GPL-2.0
>>>   // Copyright (C) 2018, Advanced Micro Devices, Inc.
>>>   -#include 
>> Why is the include dropped? AFAICT, the definitions from there are still
>> being used after this patch.
>
> Dropped because  already includes 

Generally, you want to follow the rule that if a declaration from a
header file is being used, it should show up in the includes. The same
applies to both source as well as header files.

It doesn't matter if another include in the source file in turn ends up
including the same header again; the #ifdef guards are there to prevent
duplicate declarations.

The rationale is that if future changes remove the usage of
, the C file can still be compiled after dropping the
include; there should be no need to then re-introduce  at
that point.

Hope that makes sense.

Thanks,
Punit

>>> +#include 
>
> [...]
>
>>> diff --git a/include/acpi/apei.h b/include/acpi/apei.h
>>> index 680f80960c3d..44d4d08acce0 100644
>>> --- a/include/acpi/apei.h
>>> +++ b/include/acpi/apei.h
>>> @@ -33,8 +33,15 @@ extern bool ghes_disable;
>>> #ifdef CONFIG_ACPI_APEI
>>>   void __init acpi_hest_init(void);
>>> +int arch_apei_report_x86_error(struct cper_ia_proc_ctx *ctx_info,
>>> +  u64 lapic_id);
>>>   #else
>>>   static inline void acpi_hest_init(void) { return; }
>>> +static inline int arch_apei_report_x86_error(struct cper_ia_proc_ctx 
>>> *ctx_info,
>>> +u64 lapic_id)
>>> +{
>>> +   return -EINVAL;
>>> +}
>>>   #endif
>> Adding the declaration to this include violates the separation of
>> generic and architecture specific code.
>>
>> Can this be moved to the appropriate architecture specific header?
>> Perhaps arch/x86/include/asm/apei.h.
>
> Yes, I have fixed this and moved into arch/x86/include/asm/acpi.h.
>
>>>   typedef int (*apei_hest_func_t)(struct acpi_hest_header *hest_hdr, void 
>>> *data);
>>> @@ -51,6 +58,8 @@ int erst_clear(u64 record_id);
>>> int arch_apei_enable_cmcff(struct acpi_hest_header *hest_hdr,
>>> void *data);
>>>   void arch_apei_report_mem_error(int sev, struct cper_sec_mem_err 
>>> *mem_err);
>>> +int arch_apei_report_x86_error(struct cper_ia_proc_ctx *ctx_info,
>>> +  u64 lapic_id);
>>
>> Why is the additional declaration needed?
>
> Will fix in the next revision.
>
> Thanks,
> Smita
>
> [...]

Re: [PATCH v2 1/2] cper, apei, mce: Pass x86 CPER through the MCA handling chain

2020-08-30 Thread Punit Agrawal

Hi Smita,

A couple of comments below -

Smita Koralahalli  writes:

> Linux Kernel uses ACPI Boot Error Record Table (BERT) to report fatal
> errors that occurred in a previous boot. The MCA errors in the BERT are
> reported using the x86 Processor Error Common Platform Error Record (CPER)
> format. Currently, the record prints out the raw MSR values and AMD relies
> on the raw record to provide MCA information.
>
> Extract the raw MSR values of MCA registers from the BERT and feed it into
> the standard mce_log() function through the existing x86/MCA RAS
> infrastructure. This will result in better decoding from the EDAC MCE
> decoder or the default notifier.
>
> The implementation is SMCA specific as the raw MCA register values are
> given in the register offset order of the MCAX address space.
>
> Reported-by: kernel test robot 
> Signed-off-by: Smita Koralahalli 
> ---

[...]


> diff --git a/drivers/firmware/efi/cper-x86.c b/drivers/firmware/efi/cper-x86.c
> index 2531de49f56c..374b8e18552a 100644
> --- a/drivers/firmware/efi/cper-x86.c
> +++ b/drivers/firmware/efi/cper-x86.c
> @@ -1,7 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0
>  // Copyright (C) 2018, Advanced Micro Devices, Inc.
>  
> -#include 

Why is the include dropped? AFAICT, the definitions from there are still
being used after this patch.

> +#include 
>  
>  /*
>   * We don't need a "CPER_IA" prefix since these are all locally defined.
> @@ -347,9 +347,11 @@ void cper_print_proc_ia(const char *pfx, const struct 
> cper_sec_proc_ia *proc)
>  ctx_info->mm_reg_addr);
>   }
>  
> - printk("%sRegister Array:\n", newpfx);
> - print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, groupsize,
> -(ctx_info + 1), ctx_info->reg_arr_size, 0);
> + if (arch_apei_report_x86_error(ctx_info, proc->lapic_id)) {
> + printk("%sRegister Array:\n", newpfx);
> + print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 
> groupsize,
> +(ctx_info + 1), ctx_info->reg_arr_size, 
> 0);
> + }
>  
>   ctx_info = (struct cper_ia_proc_ctx *)((long)ctx_info + size);
>   }
> diff --git a/include/acpi/apei.h b/include/acpi/apei.h
> index 680f80960c3d..44d4d08acce0 100644
> --- a/include/acpi/apei.h
> +++ b/include/acpi/apei.h
> @@ -33,8 +33,15 @@ extern bool ghes_disable;
>  
>  #ifdef CONFIG_ACPI_APEI
>  void __init acpi_hest_init(void);
> +int arch_apei_report_x86_error(struct cper_ia_proc_ctx *ctx_info,
> +u64 lapic_id);
>  #else
>  static inline void acpi_hest_init(void) { return; }
> +static inline int arch_apei_report_x86_error(struct cper_ia_proc_ctx 
> *ctx_info,
> +  u64 lapic_id)
> +{
> + return -EINVAL;
> +}
>  #endif

Adding the declaration to this include violates the separation of
generic and architecture specific code.

Can this be moved to the appropriate architecture specific header?
Perhaps arch/x86/include/asm/apei.h.

>  typedef int (*apei_hest_func_t)(struct acpi_hest_header *hest_hdr, void 
> *data);
> @@ -51,6 +58,8 @@ int erst_clear(u64 record_id);
>  
>  int arch_apei_enable_cmcff(struct acpi_hest_header *hest_hdr, void *data);
>  void arch_apei_report_mem_error(int sev, struct cper_sec_mem_err *mem_err);
> +int arch_apei_report_x86_error(struct cper_ia_proc_ctx *ctx_info,
> +u64 lapic_id);


Why is the additional declaration needed?

Thanks,
Punit

>  
>  #endif
>  #endif

[tip: efi/urgent] efi: cper: Add support for printing Firmware Error Record Reference

2020-05-22 Thread tip-bot2 for Punit Agrawal

The following commit has been merged into the efi/urgent branch of tip:

Commit-ID: 3d8c11efd528d56972d44ed0de51c4e11a9a4fa9
Gitweb:
https://git.kernel.org/tip/3d8c11efd528d56972d44ed0de51c4e11a9a4fa9
Author:Punit Agrawal 
AuthorDate:Tue, 12 May 2020 13:55:02 +09:00
Committer: Ard Biesheuvel 
CommitterDate: Thu, 14 May 2020 11:11:20 +02:00

efi: cper: Add support for printing Firmware Error Record Reference

While debugging a boot failure, the following unknown error record was
seen in the boot logs.

<...>
BERT: Error records from previous boot:
[Hardware Error]: event severity: fatal
[Hardware Error]:  Error 0, type: fatal
[Hardware Error]:   section type: unknown, 
81212a96-09ed-4996-9471-8d729c8e69ed
[Hardware Error]:   section length: 0x290
[Hardware Error]:   : 0001   00020002  

[Hardware Error]:   0010: 00020002 001f 0320    
...
[Hardware Error]:   0020:      

[Hardware Error]:   0030:      

<...>

On further investigation, it was found that the error record with
UUID (81212a96-09ed-4996-9471-8d729c8e69ed) has been defined in the
UEFI Specification at least since v2.4 and has recently had additional
fields defined in v2.7 Section N.2.10 Firmware Error Record Reference.

Add support for parsing and printing the defined fields to give users
a chance to figure out what went wrong.

Signed-off-by: Punit Agrawal 
Cc: Ard Biesheuvel 
Cc: "Rafael J. Wysocki" 
Cc: Borislav Petkov 
Cc: James Morse 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Link: 
https://lore.kernel.org/r/20200512045502.3810339-1-punit1.agra...@toshiba.co.jp
Signed-off-by: Ard Biesheuvel 
---
 drivers/firmware/efi/cper.c | 62 -
 include/linux/cper.h|  9 +-
 2 files changed, 71 insertions(+)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 9d25129..f564e15 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -407,6 +407,58 @@ static void cper_print_pcie(const char *pfx, const struct 
cper_sec_pcie *pcie,
}
 }
 
+static const char * const fw_err_rec_type_strs[] = {
+   "IPF SAL Error Record",
+   "SOC Firmware Error Record Type1 (Legacy CrashLog Support)",
+   "SOC Firmware Error Record Type2",
+};
+
+static void cper_print_fw_err(const char *pfx,
+ struct acpi_hest_generic_data *gdata,
+ const struct cper_sec_fw_err_rec_ref *fw_err)
+{
+   void *buf = acpi_hest_get_payload(gdata);
+   u32 offset, length = gdata->error_data_length;
+
+   printk("%s""Firmware Error Record Type: %s\n", pfx,
+  fw_err->record_type < ARRAY_SIZE(fw_err_rec_type_strs) ?
+  fw_err_rec_type_strs[fw_err->record_type] : "unknown");
+   printk("%s""Revision: %d\n", pfx, fw_err->revision);
+
+   /* Record Type based on UEFI 2.7 */
+   if (fw_err->revision == 0) {
+   printk("%s""Record Identifier: %08llx\n", pfx,
+  fw_err->record_identifier);
+   } else if (fw_err->revision == 2) {
+   printk("%s""Record Identifier: %pUl\n", pfx,
+  _err->record_identifier_guid);
+   }
+
+   /*
+* The FW error record may contain trailing data beyond the
+* structure defined by the specification. As the fields
+* defined (and hence the offset of any trailing data) vary
+* with the revision, set the offset to account for this
+* variation.
+*/
+   if (fw_err->revision == 0) {
+   /* record_identifier_guid not defined */
+   offset = offsetof(struct cper_sec_fw_err_rec_ref,
+ record_identifier_guid);
+   } else if (fw_err->revision == 1) {
+   /* record_identifier not defined */
+   offset = offsetof(struct cper_sec_fw_err_rec_ref,
+ record_identifier);
+   } else {
+   offset = sizeof(*fw_err);
+   }
+
+   buf += offset;
+   length -= offset;
+
+   print_hex_dump(pfx, "", DUMP_PREFIX_OFFSET, 16, 4, buf, length, true);
+}
+
 static void cper_print_tstamp(const char *pfx,
   struct acpi_hest_generic_data_v300 *gdata)
 {
@@ -494,6 +546,16 @@ cper_estatus_print_section(const char *pfx, struct 
acpi_hest_generic_data *gdata
else
goto err_section_too_small;
 #endif
+   } else if (guid_equal(sec_type, _SEC_FW_ERR_REC_REF)) {
+   struct cper_sec_fw_err_rec_ref *fw_e

Re: [PATCH] e1000e: Relax condition to trigger reset for ME workaround

2020-05-21 Thread Punit Agrawal

Hi Aaron,

"Brown, Aaron F"  writes:

>> From: netdev-ow...@vger.kernel.org  On
>> Behalf Of Punit Agrawal
>> Sent: Thursday, May 14, 2020 9:31 PM
>> To: Kirsher, Jeffrey T 
>> Cc: daniel.sangor...@toshiba.co.jp; Punit Agrawal
>> ; Alexander Duyck
>> ; David S. Miller ;
>> intel-wired-...@lists.osuosl.org; net...@vger.kernel.org; linux-
>> ker...@vger.kernel.org
>> Subject: [PATCH] e1000e: Relax condition to trigger reset for ME workaround
>> 
>> It's an error if the value of the RX/TX tail descriptor does not match
>> what was written. The error condition is true regardless the duration
>> of the interference from ME. But the driver only performs the reset if
>> E1000_ICH_FWSM_PCIM2PCI_COUNT (2000) iterations of 50us delay have
>> transpired. The extra condition can lead to inconsistency between the
>> state of hardware as expected by the driver.
>> 
>> Fix this by dropping the check for number of delay iterations.
>> 
>> While at it, also make __ew32_prepare() static as it's not used
>> anywhere else.
>> 
>> Signed-off-by: Punit Agrawal 
>> Reviewed-by: Alexander Duyck 
>> Cc: Jeff Kirsher 
>> Cc: "David S. Miller" 
>> Cc: intel-wired-...@lists.osuosl.org
>> Cc: net...@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> ---
>> Hi Jeff,
>> 
>> If there are no further comments please consider merging the patch.
>> 
>> Also, should it be marked for backport to stable?
>> 
>> Thanks,
>> Punit
>> 
>> RFC[0] -> v1:
>> * Dropped return value for __ew32_prepare() as it's not used
>> * Made __ew32_prepare() static
>> * Added tags
>> 
>> [0] https://lkml.org/lkml/2020/5/12/20
>> 
>>  drivers/net/ethernet/intel/e1000e/e1000.h  |  1 -
>>  drivers/net/ethernet/intel/e1000e/netdev.c | 12 +---
>>  2 files changed, 5 insertions(+), 8 deletions(-)
>> 
> Tested-by: Aaron Brown 

Thanks for taking the patch for a spin.

Jeff, let me know if you're okay to apply the tag or want me to send a
new version.

Thanks,
Punit

[PATCH] e1000e: Relax condition to trigger reset for ME workaround

2020-05-14 Thread Punit Agrawal

It's an error if the value of the RX/TX tail descriptor does not match
what was written. The error condition is true regardless the duration
of the interference from ME. But the driver only performs the reset if
E1000_ICH_FWSM_PCIM2PCI_COUNT (2000) iterations of 50us delay have
transpired. The extra condition can lead to inconsistency between the
state of hardware as expected by the driver.

Fix this by dropping the check for number of delay iterations.

While at it, also make __ew32_prepare() static as it's not used
anywhere else.

Signed-off-by: Punit Agrawal 
Reviewed-by: Alexander Duyck 
Cc: Jeff Kirsher 
Cc: "David S. Miller" 
Cc: intel-wired-...@lists.osuosl.org
Cc: net...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
Hi Jeff,

If there are no further comments please consider merging the patch.

Also, should it be marked for backport to stable?

Thanks,
Punit

RFC[0] -> v1:
* Dropped return value for __ew32_prepare() as it's not used
* Made __ew32_prepare() static
* Added tags

[0] https://lkml.org/lkml/2020/5/12/20

 drivers/net/ethernet/intel/e1000e/e1000.h  |  1 -
 drivers/net/ethernet/intel/e1000e/netdev.c | 12 +---
 2 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/e1000.h 
b/drivers/net/ethernet/intel/e1000e/e1000.h
index 37a2314d3e6b..944abd5eae11 100644
--- a/drivers/net/ethernet/intel/e1000e/e1000.h
+++ b/drivers/net/ethernet/intel/e1000e/e1000.h
@@ -576,7 +576,6 @@ static inline u32 __er32(struct e1000_hw *hw, unsigned long 
reg)
 
 #define er32(reg)  __er32(hw, E1000_##reg)
 
-s32 __ew32_prepare(struct e1000_hw *hw);
 void __ew32(struct e1000_hw *hw, unsigned long reg, u32 val);
 
 #define ew32(reg, val) __ew32(hw, E1000_##reg, (val))
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 177c6da80c57..e9aa47aba7eb 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -119,14 +119,12 @@ static const struct e1000_reg_info e1000_reg_info_tbl[] = 
{
  * has bit 24 set while ME is accessing MAC CSR registers, wait if it is set
  * and try again a number of times.
  **/
-s32 __ew32_prepare(struct e1000_hw *hw)
+static void __ew32_prepare(struct e1000_hw *hw)
 {
s32 i = E1000_ICH_FWSM_PCIM2PCI_COUNT;
 
while ((er32(FWSM) & E1000_ICH_FWSM_PCIM2PCI) && --i)
udelay(50);
-
-   return i;
 }
 
 void __ew32(struct e1000_hw *hw, unsigned long reg, u32 val)
@@ -607,11 +605,11 @@ static void e1000e_update_rdt_wa(struct e1000_ring 
*rx_ring, unsigned int i)
 {
struct e1000_adapter *adapter = rx_ring->adapter;
struct e1000_hw *hw = >hw;
-   s32 ret_val = __ew32_prepare(hw);
 
+   __ew32_prepare(hw);
writel(i, rx_ring->tail);
 
-   if (unlikely(!ret_val && (i != readl(rx_ring->tail {
+   if (unlikely(i != readl(rx_ring->tail))) {
u32 rctl = er32(RCTL);
 
ew32(RCTL, rctl & ~E1000_RCTL_EN);
@@ -624,11 +622,11 @@ static void e1000e_update_tdt_wa(struct e1000_ring 
*tx_ring, unsigned int i)
 {
struct e1000_adapter *adapter = tx_ring->adapter;
struct e1000_hw *hw = >hw;
-   s32 ret_val = __ew32_prepare(hw);
 
+   __ew32_prepare(hw);
writel(i, tx_ring->tail);
 
-   if (unlikely(!ret_val && (i != readl(tx_ring->tail {
+   if (unlikely(i != readl(tx_ring->tail))) {
u32 tctl = er32(TCTL);
 
ew32(TCTL, tctl & ~E1000_TCTL_EN);
-- 
2.26.2

Re: [RFC] e1000e: Relax condition to trigger reset for ME workaround

2020-05-14 Thread Punit Agrawal

Alexander Duyck  writes:

> On Mon, May 11, 2020 at 9:45 PM Punit Agrawal
>  wrote:
>>
>> It's an error if the value of the RX/TX tail descriptor does not match
>> what was written. The error condition is true regardless the duration
>> of the interference from ME. But the code only performs the reset if
>> E1000_ICH_FWSM_PCIM2PCI_COUNT (2000) iterations of 50us delay have
>> transpired. The extra condition can lead to inconsistency between the
>> state of hardware as expected by the driver.
>>
>> Fix this by dropping the check for number of delay iterations.
>>
>> Signed-off-by: Punit Agrawal 
>> Cc: Jeff Kirsher 
>> Cc: "David S. Miller" 
>> Cc: intel-wired-...@lists.osuosl.org
>> Cc: net...@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> ---
>> Hi,
>>
>> The issue was noticed through code inspection while backporting the
>> workaround for TDT corruption. Sending it out as an RFC as I am not
>> familiar with the hardware internals of the e1000e.
>>
>> Another unresolved question is the inherent racy nature of the
>> workaround - the ME could block access again after releasing the
>> device (i.e., BIT(E1000_ICH_FWSM_PCIM2PCI) clear) but before the
>> driver performs the write. Has this not been a problem?
>>
>> Any feedback on the patch or the more information on the issues
>> appreciated.
>>
>> Thanks,
>> Punit
>>
>>  drivers/net/ethernet/intel/e1000e/netdev.c | 8 
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
>> b/drivers/net/ethernet/intel/e1000e/netdev.c
>> index 177c6da80c57..5ed4d7ed35b3 100644
>> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
>> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
>> @@ -607,11 +607,11 @@ static void e1000e_update_rdt_wa(struct e1000_ring 
>> *rx_ring, unsigned int i)
>>  {
>> struct e1000_adapter *adapter = rx_ring->adapter;
>> struct e1000_hw *hw = >hw;
>> -   s32 ret_val = __ew32_prepare(hw);
>>
>> +   __ew32_prepare(hw);
>> writel(i, rx_ring->tail);
>>
>> -   if (unlikely(!ret_val && (i != readl(rx_ring->tail {
>> +   if (unlikely(i != readl(rx_ring->tail))) {
>> u32 rctl = er32(RCTL);
>>
>> ew32(RCTL, rctl & ~E1000_RCTL_EN);
>> @@ -624,11 +624,11 @@ static void e1000e_update_tdt_wa(struct e1000_ring 
>> *tx_ring, unsigned int i)
>>  {
>> struct e1000_adapter *adapter = tx_ring->adapter;
>> struct e1000_hw *hw = >hw;
>> -   s32 ret_val = __ew32_prepare(hw);
>>
>> +   __ew32_prepare(hw);
>> writel(i, tx_ring->tail);
>>
>> -   if (unlikely(!ret_val && (i != readl(tx_ring->tail {
>> +   if (unlikely(i != readl(tx_ring->tail))) {
>> u32 tctl = er32(TCTL);
>>
>> ew32(TCTL, tctl & ~E1000_TCTL_EN);
>
> You are eliminating the timeout check in favor of just verifying if
> the write succeeded or not. Seems pretty straight forward to me.
>
> One other change you may consider making would be to drop the return
> value from __ew32_prepare since it doesn't appear to be used anywhere,
> make the function static, and maybe get rid of the prototype in
> e1000.h.
>
> Reviewed-by: Alexander Duyck 

Thanks! I will send out an update dropping the return and the prototype.

Re: [Patch v2] efi: cper: Add support for printing Firmware Error Record Reference

2020-05-12 Thread Punit Agrawal

Ard Biesheuvel  writes:

> On Tue, 12 May 2020 at 06:55, Punit Agrawal
>  wrote:
>>
>> While debugging a boot failure, the following unknown error record was
>> seen in the boot logs.
>>
>> <...>
>> BERT: Error records from previous boot:
>> [Hardware Error]: event severity: fatal
>> [Hardware Error]:  Error 0, type: fatal
>> [Hardware Error]:   section type: unknown, 
>> 81212a96-09ed-4996-9471-8d729c8e69ed
>> [Hardware Error]:   section length: 0x290
>> [Hardware Error]:   : 0001   00020002  
>> 
>> [Hardware Error]:   0010: 00020002 001f 0320   
>>  ...
>> [Hardware Error]:   0020:      
>> 
>> [Hardware Error]:   0030:      
>> 
>> <...>
>>
>> On further investigation, it was found that the error record with
>> UUID (81212a96-09ed-4996-9471-8d729c8e69ed) has been defined in the
>> UEFI Specification at least since v2.4 and has recently had additional
>> fields defined in v2.7 Section N.2.10 Firmware Error Record Reference.
>>
>> Add support for parsing and printing the defined fields to give users
>> a chance to figure out what went wrong.
>>
>> Signed-off-by: Punit Agrawal 
>> Cc: Ard Biesheuvel 
>> Cc: "Rafael J. Wysocki" 
>> Cc: Borislav Petkov 
>> Cc: James Morse 
>> Cc: linux-a...@vger.kernel.org
>> Cc: linux-...@vger.kernel.org
>> ---
>> Hi Ard,
>>
>> I've updated the patch based on your feedback.
>>
>> As you noted, some aspects of the spec make it a bit tricky to support
>> all revisions in a nice way (e.g., size check) but this version should
>> fix existing issues.
>>
>> Thanks,
>> Punit
>>
>> v1[0] -> v2:
>> * Simplified error record structure definition
>> * Fixed size check
>> * Added comment to clarify offset calculation for dumped data
>> * Style fixes for multiline if blocks
>
> Thanks. I will queue this as a fix.

Thanks!

Just for my understanding - are you planning to send this for v5.7 or
v5.8? There's no rush, so I am fine either ways.

[...]

[Patch v2] efi: cper: Add support for printing Firmware Error Record Reference

2020-05-11 Thread Punit Agrawal

While debugging a boot failure, the following unknown error record was
seen in the boot logs.

<...>
BERT: Error records from previous boot:
[Hardware Error]: event severity: fatal
[Hardware Error]:  Error 0, type: fatal
[Hardware Error]:   section type: unknown, 
81212a96-09ed-4996-9471-8d729c8e69ed
[Hardware Error]:   section length: 0x290
[Hardware Error]:   : 0001   00020002  

[Hardware Error]:   0010: 00020002 001f 0320    
...
[Hardware Error]:   0020:      

[Hardware Error]:   0030:      

<...>

On further investigation, it was found that the error record with
UUID (81212a96-09ed-4996-9471-8d729c8e69ed) has been defined in the
UEFI Specification at least since v2.4 and has recently had additional
fields defined in v2.7 Section N.2.10 Firmware Error Record Reference.

Add support for parsing and printing the defined fields to give users
a chance to figure out what went wrong.

Signed-off-by: Punit Agrawal 
Cc: Ard Biesheuvel 
Cc: "Rafael J. Wysocki" 
Cc: Borislav Petkov 
Cc: James Morse 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
---
Hi Ard,

I've updated the patch based on your feedback.

As you noted, some aspects of the spec make it a bit tricky to support
all revisions in a nice way (e.g., size check) but this version should
fix existing issues.

Thanks,
Punit

v1[0] -> v2:
* Simplified error record structure definition
* Fixed size check
* Added comment to clarify offset calculation for dumped data
* Style fixes for multiline if blocks

[0] 
https://lkml.kernel.org/lkml/20200427085242.2380614-1-punit1.agra...@toshiba.co.jp/
---
 drivers/firmware/efi/cper.c | 62 +
 include/linux/cper.h|  9 ++
 2 files changed, 71 insertions(+)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 9d2512913d25..f564e15fbc7e 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -407,6 +407,58 @@ static void cper_print_pcie(const char *pfx, const struct 
cper_sec_pcie *pcie,
}
 }
 
+static const char * const fw_err_rec_type_strs[] = {
+   "IPF SAL Error Record",
+   "SOC Firmware Error Record Type1 (Legacy CrashLog Support)",
+   "SOC Firmware Error Record Type2",
+};
+
+static void cper_print_fw_err(const char *pfx,
+ struct acpi_hest_generic_data *gdata,
+ const struct cper_sec_fw_err_rec_ref *fw_err)
+{
+   void *buf = acpi_hest_get_payload(gdata);
+   u32 offset, length = gdata->error_data_length;
+
+   printk("%s""Firmware Error Record Type: %s\n", pfx,
+  fw_err->record_type < ARRAY_SIZE(fw_err_rec_type_strs) ?
+  fw_err_rec_type_strs[fw_err->record_type] : "unknown");
+   printk("%s""Revision: %d\n", pfx, fw_err->revision);
+
+   /* Record Type based on UEFI 2.7 */
+   if (fw_err->revision == 0) {
+   printk("%s""Record Identifier: %08llx\n", pfx,
+  fw_err->record_identifier);
+   } else if (fw_err->revision == 2) {
+   printk("%s""Record Identifier: %pUl\n", pfx,
+  _err->record_identifier_guid);
+   }
+
+   /*
+* The FW error record may contain trailing data beyond the
+* structure defined by the specification. As the fields
+* defined (and hence the offset of any trailing data) vary
+* with the revision, set the offset to account for this
+* variation.
+*/
+   if (fw_err->revision == 0) {
+   /* record_identifier_guid not defined */
+   offset = offsetof(struct cper_sec_fw_err_rec_ref,
+ record_identifier_guid);
+   } else if (fw_err->revision == 1) {
+   /* record_identifier not defined */
+   offset = offsetof(struct cper_sec_fw_err_rec_ref,
+ record_identifier);
+   } else {
+   offset = sizeof(*fw_err);
+   }
+
+   buf += offset;
+   length -= offset;
+
+   print_hex_dump(pfx, "", DUMP_PREFIX_OFFSET, 16, 4, buf, length, true);
+}
+
 static void cper_print_tstamp(const char *pfx,
   struct acpi_hest_generic_data_v300 *gdata)
 {
@@ -494,6 +546,16 @@ cper_estatus_print_section(const char *pfx, struct 
acpi_hest_generic_data *gdata
else
goto err_section_too_small;
 #endif
+   } else if (guid_equal(sec_type, _SEC_FW_ERR_REC_REF)) {
+   struct cper_sec_fw_err_rec_ref *fw_err = 
acpi_hest_get_pa

[RFC] e1000e: Relax condition to trigger reset for ME workaround

2020-05-11 Thread Punit Agrawal

It's an error if the value of the RX/TX tail descriptor does not match
what was written. The error condition is true regardless the duration
of the interference from ME. But the code only performs the reset if
E1000_ICH_FWSM_PCIM2PCI_COUNT (2000) iterations of 50us delay have
transpired. The extra condition can lead to inconsistency between the
state of hardware as expected by the driver.

Fix this by dropping the check for number of delay iterations.

Signed-off-by: Punit Agrawal 
Cc: Jeff Kirsher 
Cc: "David S. Miller" 
Cc: intel-wired-...@lists.osuosl.org
Cc: net...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
Hi,

The issue was noticed through code inspection while backporting the
workaround for TDT corruption. Sending it out as an RFC as I am not
familiar with the hardware internals of the e1000e.

Another unresolved question is the inherent racy nature of the
workaround - the ME could block access again after releasing the
device (i.e., BIT(E1000_ICH_FWSM_PCIM2PCI) clear) but before the
driver performs the write. Has this not been a problem?

Any feedback on the patch or the more information on the issues
appreciated.

Thanks,
Punit

 drivers/net/ethernet/intel/e1000e/netdev.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 177c6da80c57..5ed4d7ed35b3 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -607,11 +607,11 @@ static void e1000e_update_rdt_wa(struct e1000_ring 
*rx_ring, unsigned int i)
 {
struct e1000_adapter *adapter = rx_ring->adapter;
struct e1000_hw *hw = >hw;
-   s32 ret_val = __ew32_prepare(hw);
 
+   __ew32_prepare(hw);
writel(i, rx_ring->tail);
 
-   if (unlikely(!ret_val && (i != readl(rx_ring->tail {
+   if (unlikely(i != readl(rx_ring->tail))) {
u32 rctl = er32(RCTL);
 
ew32(RCTL, rctl & ~E1000_RCTL_EN);
@@ -624,11 +624,11 @@ static void e1000e_update_tdt_wa(struct e1000_ring 
*tx_ring, unsigned int i)
 {
struct e1000_adapter *adapter = tx_ring->adapter;
struct e1000_hw *hw = >hw;
-   s32 ret_val = __ew32_prepare(hw);
 
+   __ew32_prepare(hw);
writel(i, tx_ring->tail);
 
-   if (unlikely(!ret_val && (i != readl(tx_ring->tail {
+   if (unlikely(i != readl(tx_ring->tail))) {
u32 tctl = er32(TCTL);
 
ew32(TCTL, tctl & ~E1000_TCTL_EN);
-- 
2.26.2

Re: [PATCH] efi: cper: Add support for printing Firmware Error Record Reference

2020-05-06 Thread Punit Agrawal

Hi Ard,

Ard Biesheuvel  writes:

> Hello Punit,
>
> On Mon, 27 Apr 2020 at 11:03, Punit Agrawal
>  wrote:
>>
>> While debugging a boot failure, the following unknown error record was
>> seen in the boot logs.
>>
>> <...>
>> BERT: Error records from previous boot:
>> [Hardware Error]: event severity: fatal
>> [Hardware Error]:  Error 0, type: fatal
>> [Hardware Error]:   section type: unknown, 
>> 81212a96-09ed-4996-9471-8d729c8e69ed
>> [Hardware Error]:   section length: 0x290
>> [Hardware Error]:   : 0001   00020002  
>> 
>> [Hardware Error]:   0010: 00020002 001f 0320   
>>  ...
>> [Hardware Error]:   0020:      
>> 
>> [Hardware Error]:   0030:      
>> 
>> <...>
>>
>> On further investigation, it was found that the error record with
>> UUID (81212a96-09ed-4996-9471-8d729c8e69ed) has been defined in the
>> UEFI Specification at least since v2.4 and has recently had additional
>> fields defined in v2.7 Section N.2.10 Firmware Error Record Reference.
>>
>> Add support for parsing and printing the defined fields to give users
>> a chance to figure out what's went wrong.
>>
>> Signed-off-by: Punit Agrawal 
>> Cc: Ard Biesheuvel 
>> Cc: "Rafael J. Wysocki" 
>> Cc: Borislav Petkov 
>> Cc: James Morse 
>> Cc: linux-a...@vger.kernel.org
>> Cc: linux-...@vger.kernel.org

[...]

>>  drivers/firmware/efi/cper.c | 49 +
>>  include/linux/cper.h| 11 +
>>  2 files changed, 60 insertions(+)
>>
>> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
>> index 9d2512913d25..153b95257e23 100644
>> --- a/drivers/firmware/efi/cper.c
>> +++ b/drivers/firmware/efi/cper.c
>> @@ -407,6 +407,46 @@ static void cper_print_pcie(const char *pfx, const 
>> struct cper_sec_pcie *pcie,
>> }
>>  }
>>
>> +static const char * const fw_err_rec_type_strs[] = {
>> +   "IPF SAL Error Record",
>> +   "SOC Firmware Error Record Type1 (Legacy CrashLog Support)",
>> +   "SOC Firmware Error Record Type2",
>> +};
>> +
>> +static void cper_print_fw_err(const char *pfx,
>> + struct acpi_hest_generic_data *gdata,
>> + const struct cper_sec_fw_err_rec_ref *fw_err)
>> +{
>> +   void *buf = acpi_hest_get_payload(gdata);
>> +   u32 offset, length = gdata->error_data_length;
>> +
>> +   printk("%s""Firmware Error Record Type: %s\n", pfx,
>> +  fw_err->record_type < ARRAY_SIZE(fw_err_rec_type_strs) ?
>> +  fw_err_rec_type_strs[fw_err->record_type] : "unknown");
>> +
>> +   /* Record Type based on UEFI 2.7 */
>> +   if (fw_err->revision == 0)
>> +   printk("%s""Record Identifier: %08llx\n", pfx,
>> +  fw_err->record_identifier);
>> +   else if (fw_err->revision == 2)
>> +   printk("%s""Record Identifier: %pUl\n", pfx,
>> +  _err->record_identifier_guid);
>> +
>
> Please use {} for multi-line statements between the ifs
>
>> +   if (fw_err->revision == 0)
>> +   offset = offsetof(struct cper_sec_fw_err_rec_ref,
>> + record_identifier_guid);
>> +   else if (fw_err->revision == 1)
>> +   offset = offsetof(struct cper_sec_fw_err_rec_ref,
>> + record_identifier);
>> +   else
>> +   offset = sizeof(*fw_err);
>> +
>
> This logic is slightly confusing, so it could do with a comment
> regarding which part of the structure is being dumped and why.
>
>
>> +   buf += offset;
>> +   length -= offset;
>> +
>> +   print_hex_dump(pfx, "", DUMP_PREFIX_OFFSET, 16, 4, buf, length, 
>> true);
>> +}
>> +
>>  static void cper_print_tstamp(const char *pfx,
>>struct acpi_hest_generic_data_v300 *gdata)
>>  {
>> @@ -494,6 +534,15 @@ cper_estatus_print_section(const char *pfx, struct 
>> acpi_hest_generic_data *gdata
>> else
>> goto err_section_to

Re: [PATCH v10 0/8] kvm: arm64: Support PUD hugepage at stage 2

2019-01-03 Thread Punit Agrawal

Christoffer Dall  writes:

> On Tue, Dec 11, 2018 at 05:10:33PM +, Suzuki K Poulose wrote:
>> This series is an update to the PUD hugepage support previously posted
>> at [0]. This patchset adds support for PUD hugepages at stage 2 a
>> feature that is useful on cores that have support for large sized TLB
>> mappings (e.g., 1GB for 4K granule).
>> 
>> The patches are based on v4.20-rc4
>> 
>> The patches have been tested on AMD Seattle system with the following
>> hugepage sizes - 2M and 1G.
>> 
>> Right now the PUD hugepage for stage2 is only supported if the stage2
>> has 4 levels. i.e, with an IPA size of minimum 44bits with 4K pages.
>> This could be relaxed to stage2 with 3 levels, with the stage1 PUD huge
>> page mapped in the entry level of the stage2 (i.e, pgd). I have not
>> added the change here to keep this version stable w.r.t the previous
>> version. I could post a patch later after further discussions in the
>> list.
>> 
>
> For the series:
>
> Reviewed-by: Christoffer Dall 

Thanks a lot for reviewing the patches and the tag. And to Suzuki for
picking up the patchset.

(I was happy to see this while catching up with the lists after an
extended break!)

[PATCH] mailmap: Update email for Punit Agrawal

2018-11-02 Thread Punit Agrawal

As I'll no longer be working with Arm, add a mailmap entry so any mail
directed towards me reaches the appropriate mailbox.

Signed-off-by: Punit Agrawal 
---
 .mailmap | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.mailmap b/.mailmap
index a76be45fef6c..28fecafa6506 100644
--- a/.mailmap
+++ b/.mailmap
@@ -159,6 +159,7 @@ Peter Oruba 
 Peter Oruba 
 Pratyush Anand  
 Praveen BP 
+Punit Agrawal  
 Qais Yousef  
 Oleksij Rempel  
 Oleksij Rempel  
-- 
2.19.1

[PATCH] mailmap: Update email for Punit Agrawal

2018-11-02 Thread Punit Agrawal

As I'll no longer be working with Arm, add a mailmap entry so any mail
directed towards me reaches the appropriate mailbox.

Signed-off-by: Punit Agrawal 
---
 .mailmap | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.mailmap b/.mailmap
index a76be45fef6c..28fecafa6506 100644
--- a/.mailmap
+++ b/.mailmap
@@ -159,6 +159,7 @@ Peter Oruba 
 Peter Oruba 
 Pratyush Anand  
 Praveen BP 
+Punit Agrawal  
 Qais Yousef  
 Oleksij Rempel  
 Oleksij Rempel  
-- 
2.19.1

Re: [PATCH v2] Documentation/arm64: HugeTLB page implementation

2018-10-09 Thread Punit Agrawal

Randy Dunlap  writes:

> On 10/8/18 3:03 AM, Punit Agrawal wrote:
>> Arm v8 architecture supports multiple page sizes - 4k, 16k and
>> 64k. Based on the active page size, the Linux port supports
>> corresponding hugepage sizes at PMD and PUD(4k only) levels.
>> 
>> In addition, the architecture also supports caching larger sized
>> ranges (composed of multiple entries) at the PTE and PMD level in the
>> TLBs using the contiguous bit. The Linux port makes use of this
>> architectural support to enable additional hugepage sizes.
>> 
>> Describe the two different types of hugepages supported by the arm64
>> kernel and the hugepage sizes enabled by each.
>> 
>> Signed-off-by: Punit Agrawal 
>> Cc: Catalin Marinas 
>> Cc: Will Deacon 
>> Cc: Jonathan Corbet 
>
> Acked-by: Randy Dunlap 

Thanks!

Catalin, Will - I assume you'll pick this up at some point? Or do arm64
documentation patches get routed by another tree?

>
> Thanks.
>
>> ---
>> Hi,
>> 
>> This version incorporates the feedback on v1.
>> 
>> Thanks,
>> Punit
>> 
>>  Documentation/arm64/hugetlbpage.txt | 38 +
>>  1 file changed, 38 insertions(+)
>>  create mode 100644 Documentation/arm64/hugetlbpage.txt
>> 
>> diff --git a/Documentation/arm64/hugetlbpage.txt 
>> b/Documentation/arm64/hugetlbpage.txt
>> new file mode 100644
>> index ..cfae87dc653b
>> --- /dev/null
>> +++ b/Documentation/arm64/hugetlbpage.txt
>> @@ -0,0 +1,38 @@
>> +HugeTLBpage on ARM64
>> +
>> +
>> +Hugepage relies on making efficient use of TLBs to improve performance of
>> +address translations. The benefit depends on both -
>> +
>> +  - the size of hugepages
>> +  - size of entries supported by the TLBs
>> +
>> +The ARM64 port supports two flavours of hugepages.
>> +
>> +1) Block mappings at the pud/pmd level
>> +--
>> +
>> +These are regular hugepages where a pmd or a pud page table entry points to 
>> a
>> +block of memory. Regardless of the supported size of entries in TLB, block
>> +mappings reduce the depth of page table walk needed to translate hugepage
>> +addresses.
>> +
>> +2) Using the Contiguous bit
>> +---
>> +
>> +The architecture provides a contiguous bit in the translation table entries
>> +(D4.5.3, ARM DDI 0487C.a) that hints to the MMU to indicate that it is one 
>> of a
>> +contiguous set of entries that can be cached in a single TLB entry.
>> +
>> +The contiguous bit is used in Linux to increase the mapping size at the pmd 
>> and
>> +pte (last) level. The number of supported contiguous entries varies by page 
>> size
>> +and level of the page table.
>> +
>> +
>> +The following hugepage sizes are supported -
>> +
>> + CONT PTEPMDCONT PMDPUD
>> + ------
>> +  4K: 64K 2M 32M 1G
>> +  16K: 2M32M  1G
>> +  64K: 2M   512M 16G
>>

Re: [PATCH v2] Documentation/arm64: HugeTLB page implementation

2018-10-09 Thread Punit Agrawal

Randy Dunlap  writes:

> On 10/8/18 3:03 AM, Punit Agrawal wrote:
>> Arm v8 architecture supports multiple page sizes - 4k, 16k and
>> 64k. Based on the active page size, the Linux port supports
>> corresponding hugepage sizes at PMD and PUD(4k only) levels.
>> 
>> In addition, the architecture also supports caching larger sized
>> ranges (composed of multiple entries) at the PTE and PMD level in the
>> TLBs using the contiguous bit. The Linux port makes use of this
>> architectural support to enable additional hugepage sizes.
>> 
>> Describe the two different types of hugepages supported by the arm64
>> kernel and the hugepage sizes enabled by each.
>> 
>> Signed-off-by: Punit Agrawal 
>> Cc: Catalin Marinas 
>> Cc: Will Deacon 
>> Cc: Jonathan Corbet 
>
> Acked-by: Randy Dunlap 

Thanks!

Catalin, Will - I assume you'll pick this up at some point? Or do arm64
documentation patches get routed by another tree?

>
> Thanks.
>
>> ---
>> Hi,
>> 
>> This version incorporates the feedback on v1.
>> 
>> Thanks,
>> Punit
>> 
>>  Documentation/arm64/hugetlbpage.txt | 38 +
>>  1 file changed, 38 insertions(+)
>>  create mode 100644 Documentation/arm64/hugetlbpage.txt
>> 
>> diff --git a/Documentation/arm64/hugetlbpage.txt 
>> b/Documentation/arm64/hugetlbpage.txt
>> new file mode 100644
>> index ..cfae87dc653b
>> --- /dev/null
>> +++ b/Documentation/arm64/hugetlbpage.txt
>> @@ -0,0 +1,38 @@
>> +HugeTLBpage on ARM64
>> +
>> +
>> +Hugepage relies on making efficient use of TLBs to improve performance of
>> +address translations. The benefit depends on both -
>> +
>> +  - the size of hugepages
>> +  - size of entries supported by the TLBs
>> +
>> +The ARM64 port supports two flavours of hugepages.
>> +
>> +1) Block mappings at the pud/pmd level
>> +--
>> +
>> +These are regular hugepages where a pmd or a pud page table entry points to 
>> a
>> +block of memory. Regardless of the supported size of entries in TLB, block
>> +mappings reduce the depth of page table walk needed to translate hugepage
>> +addresses.
>> +
>> +2) Using the Contiguous bit
>> +---
>> +
>> +The architecture provides a contiguous bit in the translation table entries
>> +(D4.5.3, ARM DDI 0487C.a) that hints to the MMU to indicate that it is one 
>> of a
>> +contiguous set of entries that can be cached in a single TLB entry.
>> +
>> +The contiguous bit is used in Linux to increase the mapping size at the pmd 
>> and
>> +pte (last) level. The number of supported contiguous entries varies by page 
>> size
>> +and level of the page table.
>> +
>> +
>> +The following hugepage sizes are supported -
>> +
>> + CONT PTEPMDCONT PMDPUD
>> + ------
>> +  4K: 64K 2M 32M 1G
>> +  16K: 2M32M  1G
>> +  64K: 2M   512M 16G
>>

Re: [PATCH v2] KVM: arm/arm64: Check memslot bounds before mapping hugepages

2018-10-04 Thread Punit Agrawal

Hi Lukas,

Lukas Braun  writes:

> Userspace can create a memslot with memory backed by (transparent)
> hugepages, but with bounds that do not align with hugepages.
> In that case, we cannot map the entire region in the guest as hugepages
> without exposing additional host memory to the guest and potentially
> interfering with other memslots.
> Consequently, this patch adds a bounds check when populating guest page
> tables and forces the creation of regular PTEs if mapping an entire
> hugepage would violate the memslots bounds.
>
> Signed-off-by: Lukas Braun 
> ---
>
> Hi everyone,
>
> for v2, in addition to writing the condition the way Marc suggested, I
> moved the whole check so it also catches the problem when the hugepage
> was allocated explicitly, not only for THPs.

Ok, that makes sense. Memslot bounds could exceed for hugetlbfs pages as
well.

> The second line is quite long, but splitting it up would make things
> rather ugly IMO, so I left it as it is.

Let's try to do better - user_mem_abort() is quite hard to follow as it
is.

>
>
> Regards,
> Lukas
>
>
>  virt/kvm/arm/mmu.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index ed162a6c57c5..ba77339e23ec 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1500,7 +1500,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   return -EFAULT;
>   }
>  
> - if (vma_kernel_pagesize(vma) == PMD_SIZE && !logging_active) {
> + if ((fault_ipa & S2_PMD_MASK) < (memslot->base_gfn << PAGE_SHIFT) ||
> + ALIGN(fault_ipa, S2_PMD_SIZE) >= ((memslot->base_gfn + 
> memslot->npages) << PAGE_SHIFT)) {
> + /* PMD entry would map something outside of the memslot */
> + force_pte = true;
> + } else if (vma_kernel_pagesize(vma) == PMD_SIZE && !logging_active) {
>   hugetlb = true;
>   gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
>   } else {

For the purpose of this fix, using a helper to check whether the mapping
fits in the memslot makes things clearer (imo) (untested patch below) -

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index ed162a6c57c5..8bca141eb45e 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1466,6 +1466,18 @@ static void kvm_send_hwpoison_signal(unsigned long 
address,
send_sig_info(SIGBUS, , current);
 }
 
+static bool mapping_in_memslot(struct kvm_memory_slot *memslot,
+phys_addr_t fault_ipa, unsigned long mapping_size)
+{
+ gfn_t start_gfn = (fault_ipa & ~(mapping_size - 1)) >> PAGE_SHIFT;
+ gfn_t end_gfn = ALIGN(fault_ipa, mapping_size) >> PAGE_SHIFT;
+
+ WARN_ON(!is_power_of_2(mapping_size));
+
+ return memslot->base_gfn <= start_gfn &&
+ end_gfn < memslot->base_gfn + memslot->npages;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
  struct kvm_memory_slot *memslot, unsigned long hva,
  unsigned long fault_status)
@@ -1480,7 +1492,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
kvm_pfn_t pfn;
pgprot_t mem_type = PAGE_S2;
bool logging_active = memslot_is_logging(memslot);
-   unsigned long flags = 0;
+ unsigned long vma_pagesize, flags = 0;
 
write_fault = kvm_is_write_fault(vcpu);
exec_fault = kvm_vcpu_trap_is_iabt(vcpu);
@@ -1500,7 +1512,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
}
 
-   if (vma_kernel_pagesize(vma) == PMD_SIZE && !logging_active) {
+ vma_pagesize = vma_kernel_pagesize(vma);
+ /* Is the mapping contained in the memslot? */
+ if (!mapping_in_memslot(memslot, fault_ipa, vma_pagesize)) {
+ /* memslot should be aligned to page size */
+ vma_pagesize = PAGE_SIZE;
+ force_pte = true;
+ }
+
+ if (vma_pagesize == PMD_SIZE && !logging_active) {
hugetlb = true;
gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
} else {

Thoughts?

Re: [PATCH v2] KVM: arm/arm64: Check memslot bounds before mapping hugepages

2018-10-04 Thread Punit Agrawal

Hi Lukas,

Lukas Braun  writes:

> Userspace can create a memslot with memory backed by (transparent)
> hugepages, but with bounds that do not align with hugepages.
> In that case, we cannot map the entire region in the guest as hugepages
> without exposing additional host memory to the guest and potentially
> interfering with other memslots.
> Consequently, this patch adds a bounds check when populating guest page
> tables and forces the creation of regular PTEs if mapping an entire
> hugepage would violate the memslots bounds.
>
> Signed-off-by: Lukas Braun 
> ---
>
> Hi everyone,
>
> for v2, in addition to writing the condition the way Marc suggested, I
> moved the whole check so it also catches the problem when the hugepage
> was allocated explicitly, not only for THPs.

Ok, that makes sense. Memslot bounds could exceed for hugetlbfs pages as
well.

> The second line is quite long, but splitting it up would make things
> rather ugly IMO, so I left it as it is.

Let's try to do better - user_mem_abort() is quite hard to follow as it
is.

>
>
> Regards,
> Lukas
>
>
>  virt/kvm/arm/mmu.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index ed162a6c57c5..ba77339e23ec 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1500,7 +1500,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   return -EFAULT;
>   }
>  
> - if (vma_kernel_pagesize(vma) == PMD_SIZE && !logging_active) {
> + if ((fault_ipa & S2_PMD_MASK) < (memslot->base_gfn << PAGE_SHIFT) ||
> + ALIGN(fault_ipa, S2_PMD_SIZE) >= ((memslot->base_gfn + 
> memslot->npages) << PAGE_SHIFT)) {
> + /* PMD entry would map something outside of the memslot */
> + force_pte = true;
> + } else if (vma_kernel_pagesize(vma) == PMD_SIZE && !logging_active) {
>   hugetlb = true;
>   gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
>   } else {

For the purpose of this fix, using a helper to check whether the mapping
fits in the memslot makes things clearer (imo) (untested patch below) -

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index ed162a6c57c5..8bca141eb45e 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1466,6 +1466,18 @@ static void kvm_send_hwpoison_signal(unsigned long 
address,
send_sig_info(SIGBUS, , current);
 }
 
+static bool mapping_in_memslot(struct kvm_memory_slot *memslot,
+phys_addr_t fault_ipa, unsigned long mapping_size)
+{
+ gfn_t start_gfn = (fault_ipa & ~(mapping_size - 1)) >> PAGE_SHIFT;
+ gfn_t end_gfn = ALIGN(fault_ipa, mapping_size) >> PAGE_SHIFT;
+
+ WARN_ON(!is_power_of_2(mapping_size));
+
+ return memslot->base_gfn <= start_gfn &&
+ end_gfn < memslot->base_gfn + memslot->npages;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
  struct kvm_memory_slot *memslot, unsigned long hva,
  unsigned long fault_status)
@@ -1480,7 +1492,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
kvm_pfn_t pfn;
pgprot_t mem_type = PAGE_S2;
bool logging_active = memslot_is_logging(memslot);
-   unsigned long flags = 0;
+ unsigned long vma_pagesize, flags = 0;
 
write_fault = kvm_is_write_fault(vcpu);
exec_fault = kvm_vcpu_trap_is_iabt(vcpu);
@@ -1500,7 +1512,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
return -EFAULT;
}
 
-   if (vma_kernel_pagesize(vma) == PMD_SIZE && !logging_active) {
+ vma_pagesize = vma_kernel_pagesize(vma);
+ /* Is the mapping contained in the memslot? */
+ if (!mapping_in_memslot(memslot, fault_ipa, vma_pagesize)) {
+ /* memslot should be aligned to page size */
+ vma_pagesize = PAGE_SIZE;
+ force_pte = true;
+ }
+
+ if (vma_pagesize == PMD_SIZE && !logging_active) {
hugetlb = true;
gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
} else {

Thoughts?

Re: [PATCH v8 2/9] KVM: arm/arm64: Share common code in user_mem_abort()

2018-10-03 Thread Punit Agrawal

Marc Zyngier  writes:

> Hi Punit,
>
> On 01/10/18 16:54, Punit Agrawal wrote:
>> The code for operations such as marking the pfn as dirty, and
>> dcache/icache maintenance during stage 2 fault handling is duplicated
>> between normal pages and PMD hugepages.
>>
>> Instead of creating another copy of the operations when we introduce
>> PUD hugepages, let's share them across the different pagesizes.
>>
>> Signed-off-by: Punit Agrawal 
>> Cc: Suzuki K Poulose 
>> Cc: Christoffer Dall 
>> Cc: Marc Zyngier 
>> ---
>>   virt/kvm/arm/mmu.c | 45 +
>>   1 file changed, 29 insertions(+), 16 deletions(-)
>>
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index c23a1b323aad..5b76ee204000 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -1490,7 +1490,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  kvm_pfn_t pfn;
>>  pgprot_t mem_type = PAGE_S2;
>>  bool logging_active = memslot_is_logging(memslot);
>> -unsigned long flags = 0;
>> +unsigned long vma_pagesize, flags = 0;
>>  write_fault = kvm_is_write_fault(vcpu);
>>  exec_fault = kvm_vcpu_trap_is_iabt(vcpu);
>> @@ -1510,10 +1510,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  return -EFAULT;
>>  }
>>   -  if (vma_kernel_pagesize(vma) == PMD_SIZE && !logging_active) {
>> +vma_pagesize = vma_kernel_pagesize(vma);
>> +if (vma_pagesize == PMD_SIZE && !logging_active) {
>>  hugetlb = true;
>>  gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
>>  } else {
>> +/*
>> + * Fallback to PTE if it's not one of the Stage 2
>> + * supported hugepage sizes
>> + */
>> +vma_pagesize = PAGE_SIZE;
>> +
>>  /*
>>   * Pages belonging to memslots that don't have the same
>>   * alignment for userspace and IPA cannot be mapped using
>> @@ -1579,23 +1586,34 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  if (mmu_notifier_retry(kvm, mmu_seq))
>>  goto out_unlock;
>>   -  if (!hugetlb && !force_pte)
>> +if (!hugetlb && !force_pte) {
>> +/*
>> + * Only PMD_SIZE transparent hugepages(THP) are
>> + * currently supported. This code will need to be
>> + * updated to support other THP sizes.
>> + */
>>  hugetlb = transparent_hugepage_adjust(, _ipa);
>> +if (hugetlb)
>> +vma_pagesize = PMD_SIZE;
>> +}
>> +
>> +if (writable)
>> +kvm_set_pfn_dirty(pfn);
>>   -  if (hugetlb) {
>> +if (fault_status != FSC_PERM)
>> +clean_dcache_guest_page(pfn, vma_pagesize);
>> +
>> +if (exec_fault)
>> +invalidate_icache_guest_page(pfn, vma_pagesize);
>> +
>> +if (hugetlb && vma_pagesize == PMD_SIZE) {
>
> Can you end-up in a situation where hugetlb==false and vma_pagesize ==
> PMD_SIZE? If that's the case, then the above CMOs are not done on the
> same granularity as they were done before this patch. If that cannot
> happen, then the above condition can be simplified.
>
> Which one is it?

hugetlb is a hangover from when we didn't have vma_pagesize. I think we
can drop it and rely on the pagesize to control the size of mapping we
put down.

Let me give that a try.

Thanks for taking a look.

>
>
> Thanks,
>
>   M.

Re: [PATCH v8 2/9] KVM: arm/arm64: Share common code in user_mem_abort()

2018-10-03 Thread Punit Agrawal

Marc Zyngier  writes:

> Hi Punit,
>
> On 01/10/18 16:54, Punit Agrawal wrote:
>> The code for operations such as marking the pfn as dirty, and
>> dcache/icache maintenance during stage 2 fault handling is duplicated
>> between normal pages and PMD hugepages.
>>
>> Instead of creating another copy of the operations when we introduce
>> PUD hugepages, let's share them across the different pagesizes.
>>
>> Signed-off-by: Punit Agrawal 
>> Cc: Suzuki K Poulose 
>> Cc: Christoffer Dall 
>> Cc: Marc Zyngier 
>> ---
>>   virt/kvm/arm/mmu.c | 45 +
>>   1 file changed, 29 insertions(+), 16 deletions(-)
>>
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index c23a1b323aad..5b76ee204000 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -1490,7 +1490,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  kvm_pfn_t pfn;
>>  pgprot_t mem_type = PAGE_S2;
>>  bool logging_active = memslot_is_logging(memslot);
>> -unsigned long flags = 0;
>> +unsigned long vma_pagesize, flags = 0;
>>  write_fault = kvm_is_write_fault(vcpu);
>>  exec_fault = kvm_vcpu_trap_is_iabt(vcpu);
>> @@ -1510,10 +1510,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  return -EFAULT;
>>  }
>>   -  if (vma_kernel_pagesize(vma) == PMD_SIZE && !logging_active) {
>> +vma_pagesize = vma_kernel_pagesize(vma);
>> +if (vma_pagesize == PMD_SIZE && !logging_active) {
>>  hugetlb = true;
>>  gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
>>  } else {
>> +/*
>> + * Fallback to PTE if it's not one of the Stage 2
>> + * supported hugepage sizes
>> + */
>> +vma_pagesize = PAGE_SIZE;
>> +
>>  /*
>>   * Pages belonging to memslots that don't have the same
>>   * alignment for userspace and IPA cannot be mapped using
>> @@ -1579,23 +1586,34 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  if (mmu_notifier_retry(kvm, mmu_seq))
>>  goto out_unlock;
>>   -  if (!hugetlb && !force_pte)
>> +if (!hugetlb && !force_pte) {
>> +/*
>> + * Only PMD_SIZE transparent hugepages(THP) are
>> + * currently supported. This code will need to be
>> + * updated to support other THP sizes.
>> + */
>>  hugetlb = transparent_hugepage_adjust(, _ipa);
>> +if (hugetlb)
>> +vma_pagesize = PMD_SIZE;
>> +}
>> +
>> +if (writable)
>> +kvm_set_pfn_dirty(pfn);
>>   -  if (hugetlb) {
>> +if (fault_status != FSC_PERM)
>> +clean_dcache_guest_page(pfn, vma_pagesize);
>> +
>> +if (exec_fault)
>> +invalidate_icache_guest_page(pfn, vma_pagesize);
>> +
>> +if (hugetlb && vma_pagesize == PMD_SIZE) {
>
> Can you end-up in a situation where hugetlb==false and vma_pagesize ==
> PMD_SIZE? If that's the case, then the above CMOs are not done on the
> same granularity as they were done before this patch. If that cannot
> happen, then the above condition can be simplified.
>
> Which one is it?

hugetlb is a hangover from when we didn't have vma_pagesize. I think we
can drop it and rely on the pagesize to control the size of mapping we
put down.

Let me give that a try.

Thanks for taking a look.

>
>
> Thanks,
>
>   M.

Re: [PATCH V2] mm/hugetlb: Add mmap() encodings for 32MB and 512MB page sizes

2018-09-25 Thread Punit Agrawal

Anshuman Khandual  writes:

> ARM64 architecture also supports 32MB and 512MB HugeTLB page sizes.
> This just adds mmap() system call argument encoding for them.
>
> Signed-off-by: Anshuman Khandual 

Thanks for adding the encodings.

Acked-by: Punit Agrawal 

> ---
>
> Changes in V2:
> - Updated SHM and MFD definitions per Mike
>
>  include/uapi/asm-generic/hugetlb_encode.h | 2 ++
>  include/uapi/linux/memfd.h| 2 ++
>  include/uapi/linux/mman.h | 2 ++
>  include/uapi/linux/shm.h  | 2 ++
>  4 files changed, 8 insertions(+)
>
> diff --git a/include/uapi/asm-generic/hugetlb_encode.h 
> b/include/uapi/asm-generic/hugetlb_encode.h
> index e4732d3..b0f8e87 100644
> --- a/include/uapi/asm-generic/hugetlb_encode.h
> +++ b/include/uapi/asm-generic/hugetlb_encode.h
> @@ -26,7 +26,9 @@
>  #define HUGETLB_FLAG_ENCODE_2MB  (21 << 
> HUGETLB_FLAG_ENCODE_SHIFT)
>  #define HUGETLB_FLAG_ENCODE_8MB  (23 << 
> HUGETLB_FLAG_ENCODE_SHIFT)
>  #define HUGETLB_FLAG_ENCODE_16MB (24 << HUGETLB_FLAG_ENCODE_SHIFT)
> +#define HUGETLB_FLAG_ENCODE_32MB (25 << HUGETLB_FLAG_ENCODE_SHIFT)
>  #define HUGETLB_FLAG_ENCODE_256MB(28 << HUGETLB_FLAG_ENCODE_SHIFT)
> +#define HUGETLB_FLAG_ENCODE_512MB(29 << HUGETLB_FLAG_ENCODE_SHIFT)
>  #define HUGETLB_FLAG_ENCODE_1GB  (30 << 
> HUGETLB_FLAG_ENCODE_SHIFT)
>  #define HUGETLB_FLAG_ENCODE_2GB  (31 << 
> HUGETLB_FLAG_ENCODE_SHIFT)
>  #define HUGETLB_FLAG_ENCODE_16GB (34 << HUGETLB_FLAG_ENCODE_SHIFT)
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> index 015a4c0..7a8a267 100644
> --- a/include/uapi/linux/memfd.h
> +++ b/include/uapi/linux/memfd.h
> @@ -25,7 +25,9 @@
>  #define MFD_HUGE_2MB HUGETLB_FLAG_ENCODE_2MB
>  #define MFD_HUGE_8MB HUGETLB_FLAG_ENCODE_8MB
>  #define MFD_HUGE_16MBHUGETLB_FLAG_ENCODE_16MB
> +#define MFD_HUGE_32MBHUGETLB_FLAG_ENCODE_32MB
>  #define MFD_HUGE_256MB   HUGETLB_FLAG_ENCODE_256MB
> +#define MFD_HUGE_512MB   HUGETLB_FLAG_ENCODE_512MB
>  #define MFD_HUGE_1GB HUGETLB_FLAG_ENCODE_1GB
>  #define MFD_HUGE_2GB HUGETLB_FLAG_ENCODE_2GB
>  #define MFD_HUGE_16GBHUGETLB_FLAG_ENCODE_16GB
> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> index bfd5938..d0f515d 100644
> --- a/include/uapi/linux/mman.h
> +++ b/include/uapi/linux/mman.h
> @@ -28,7 +28,9 @@
>  #define MAP_HUGE_2MB HUGETLB_FLAG_ENCODE_2MB
>  #define MAP_HUGE_8MB HUGETLB_FLAG_ENCODE_8MB
>  #define MAP_HUGE_16MBHUGETLB_FLAG_ENCODE_16MB
> +#define MAP_HUGE_32MBHUGETLB_FLAG_ENCODE_32MB
>  #define MAP_HUGE_256MB   HUGETLB_FLAG_ENCODE_256MB
> +#define MAP_HUGE_512MB   HUGETLB_FLAG_ENCODE_512MB
>  #define MAP_HUGE_1GB HUGETLB_FLAG_ENCODE_1GB
>  #define MAP_HUGE_2GB HUGETLB_FLAG_ENCODE_2GB
>  #define MAP_HUGE_16GBHUGETLB_FLAG_ENCODE_16GB
> diff --git a/include/uapi/linux/shm.h b/include/uapi/linux/shm.h
> index dde1344..6507ad0 100644
> --- a/include/uapi/linux/shm.h
> +++ b/include/uapi/linux/shm.h
> @@ -65,7 +65,9 @@ struct shmid_ds {
>  #define SHM_HUGE_2MB HUGETLB_FLAG_ENCODE_2MB
>  #define SHM_HUGE_8MB HUGETLB_FLAG_ENCODE_8MB
>  #define SHM_HUGE_16MBHUGETLB_FLAG_ENCODE_16MB
> +#define SHM_HUGE_32MBHUGETLB_FLAG_ENCODE_32MB
>  #define SHM_HUGE_256MB   HUGETLB_FLAG_ENCODE_256MB
> +#define SHM_HUGE_512MB   HUGETLB_FLAG_ENCODE_512MB
>  #define SHM_HUGE_1GB HUGETLB_FLAG_ENCODE_1GB
>  #define SHM_HUGE_2GB HUGETLB_FLAG_ENCODE_2GB
>  #define SHM_HUGE_16GBHUGETLB_FLAG_ENCODE_16GB

Re: [PATCH V2] mm/hugetlb: Add mmap() encodings for 32MB and 512MB page sizes

2018-09-25 Thread Punit Agrawal

Anshuman Khandual  writes:

> ARM64 architecture also supports 32MB and 512MB HugeTLB page sizes.
> This just adds mmap() system call argument encoding for them.
>
> Signed-off-by: Anshuman Khandual 

Thanks for adding the encodings.

Acked-by: Punit Agrawal 

> ---
>
> Changes in V2:
> - Updated SHM and MFD definitions per Mike
>
>  include/uapi/asm-generic/hugetlb_encode.h | 2 ++
>  include/uapi/linux/memfd.h| 2 ++
>  include/uapi/linux/mman.h | 2 ++
>  include/uapi/linux/shm.h  | 2 ++
>  4 files changed, 8 insertions(+)
>
> diff --git a/include/uapi/asm-generic/hugetlb_encode.h 
> b/include/uapi/asm-generic/hugetlb_encode.h
> index e4732d3..b0f8e87 100644
> --- a/include/uapi/asm-generic/hugetlb_encode.h
> +++ b/include/uapi/asm-generic/hugetlb_encode.h
> @@ -26,7 +26,9 @@
>  #define HUGETLB_FLAG_ENCODE_2MB  (21 << 
> HUGETLB_FLAG_ENCODE_SHIFT)
>  #define HUGETLB_FLAG_ENCODE_8MB  (23 << 
> HUGETLB_FLAG_ENCODE_SHIFT)
>  #define HUGETLB_FLAG_ENCODE_16MB (24 << HUGETLB_FLAG_ENCODE_SHIFT)
> +#define HUGETLB_FLAG_ENCODE_32MB (25 << HUGETLB_FLAG_ENCODE_SHIFT)
>  #define HUGETLB_FLAG_ENCODE_256MB(28 << HUGETLB_FLAG_ENCODE_SHIFT)
> +#define HUGETLB_FLAG_ENCODE_512MB(29 << HUGETLB_FLAG_ENCODE_SHIFT)
>  #define HUGETLB_FLAG_ENCODE_1GB  (30 << 
> HUGETLB_FLAG_ENCODE_SHIFT)
>  #define HUGETLB_FLAG_ENCODE_2GB  (31 << 
> HUGETLB_FLAG_ENCODE_SHIFT)
>  #define HUGETLB_FLAG_ENCODE_16GB (34 << HUGETLB_FLAG_ENCODE_SHIFT)
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> index 015a4c0..7a8a267 100644
> --- a/include/uapi/linux/memfd.h
> +++ b/include/uapi/linux/memfd.h
> @@ -25,7 +25,9 @@
>  #define MFD_HUGE_2MB HUGETLB_FLAG_ENCODE_2MB
>  #define MFD_HUGE_8MB HUGETLB_FLAG_ENCODE_8MB
>  #define MFD_HUGE_16MBHUGETLB_FLAG_ENCODE_16MB
> +#define MFD_HUGE_32MBHUGETLB_FLAG_ENCODE_32MB
>  #define MFD_HUGE_256MB   HUGETLB_FLAG_ENCODE_256MB
> +#define MFD_HUGE_512MB   HUGETLB_FLAG_ENCODE_512MB
>  #define MFD_HUGE_1GB HUGETLB_FLAG_ENCODE_1GB
>  #define MFD_HUGE_2GB HUGETLB_FLAG_ENCODE_2GB
>  #define MFD_HUGE_16GBHUGETLB_FLAG_ENCODE_16GB
> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> index bfd5938..d0f515d 100644
> --- a/include/uapi/linux/mman.h
> +++ b/include/uapi/linux/mman.h
> @@ -28,7 +28,9 @@
>  #define MAP_HUGE_2MB HUGETLB_FLAG_ENCODE_2MB
>  #define MAP_HUGE_8MB HUGETLB_FLAG_ENCODE_8MB
>  #define MAP_HUGE_16MBHUGETLB_FLAG_ENCODE_16MB
> +#define MAP_HUGE_32MBHUGETLB_FLAG_ENCODE_32MB
>  #define MAP_HUGE_256MB   HUGETLB_FLAG_ENCODE_256MB
> +#define MAP_HUGE_512MB   HUGETLB_FLAG_ENCODE_512MB
>  #define MAP_HUGE_1GB HUGETLB_FLAG_ENCODE_1GB
>  #define MAP_HUGE_2GB HUGETLB_FLAG_ENCODE_2GB
>  #define MAP_HUGE_16GBHUGETLB_FLAG_ENCODE_16GB
> diff --git a/include/uapi/linux/shm.h b/include/uapi/linux/shm.h
> index dde1344..6507ad0 100644
> --- a/include/uapi/linux/shm.h
> +++ b/include/uapi/linux/shm.h
> @@ -65,7 +65,9 @@ struct shmid_ds {
>  #define SHM_HUGE_2MB HUGETLB_FLAG_ENCODE_2MB
>  #define SHM_HUGE_8MB HUGETLB_FLAG_ENCODE_8MB
>  #define SHM_HUGE_16MBHUGETLB_FLAG_ENCODE_16MB
> +#define SHM_HUGE_32MBHUGETLB_FLAG_ENCODE_32MB
>  #define SHM_HUGE_256MB   HUGETLB_FLAG_ENCODE_256MB
> +#define SHM_HUGE_512MB   HUGETLB_FLAG_ENCODE_512MB
>  #define SHM_HUGE_1GB HUGETLB_FLAG_ENCODE_1GB
>  #define SHM_HUGE_2GB HUGETLB_FLAG_ENCODE_2GB
>  #define SHM_HUGE_16GBHUGETLB_FLAG_ENCODE_16GB

Re: [PATCH v2] dt-binding: arm/cpus.txt: fix dynamic-power-coefficient unit

2018-08-30 Thread Punit Agrawal

Hi Vincent,

Vincent Guittot  writes:

> The unit of dynamic-power-coefficient is described as mW/MHz/uV^2 whereas
> its usage in the code assumes that unit is uW/MHz/V^2
>
> In drivers/thermal/cpu_cooling.c, the code is :
>
> power = (u64)capacitance * freq_mhz * voltage_mv * voltage_mv;
> do_div(power, 10);
>
> which can be summarized as :
> power (mW) = capacitance * freq_mhz/1000 * (voltage_mv/1000)^2
> or
> power (mW) = (capacitance * freq_mhz * (voltage_mv/1000)^2) / 1000
> then
> power (mW) = power (uW) / 1000
> so
> power (uW) = capacitance * freq_mhz * (voltage_mv/1000)^2
>
> Furthermore, if we test basic values like :
> voltage_mv = 1000mV = 1V
> freq_mhz = 1000Mhz
>
> The minimum possible power, when dynamic-power-coefficient equals 1, will
> be with current unit:
> min power = 1 * 1000  * (100)^2 = 10^15 mW
> which is not realistic
>
> With the unit used by the code, the min power is
> min power =  1 * 1000 * 1^2 = 1000uW = 1mW which is far more realistic
>
> Signed-off-by: Vincent Guittot 

Acked-by: Punit Agrawal 

Thanks for fixing the mismatch.

> ---
>  Documentation/devicetree/bindings/arm/cpus.txt | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/devicetree/bindings/arm/cpus.txt 
> b/Documentation/devicetree/bindings/arm/cpus.txt
> index 29e1dc5..71d8cd0 100644
> --- a/Documentation/devicetree/bindings/arm/cpus.txt
> +++ b/Documentation/devicetree/bindings/arm/cpus.txt
> @@ -274,7 +274,7 @@ described below.
>   Usage: optional
>   Value type: 
>   Definition: A u32 value that represents the running time dynamic
> - power coefficient in units of mW/MHz/uV^2. The
> + power coefficient in units of uW/MHz/V^2. The
>   coefficient can either be calculated from power
>   measurements or derived by analysis.
>  
> @@ -285,7 +285,7 @@ described below.
>  
>   Pdyn = dynamic-power-coefficient * V^2 * f
>  
> - where voltage is in uV, frequency is in MHz.
> + where voltage is in V, frequency is in MHz.
>  
>  Example 1 (dual-cluster big.LITTLE system 32-bit):

Re: [PATCH v2] dt-binding: arm/cpus.txt: fix dynamic-power-coefficient unit

2018-08-30 Thread Punit Agrawal

Hi Vincent,

Vincent Guittot  writes:

> The unit of dynamic-power-coefficient is described as mW/MHz/uV^2 whereas
> its usage in the code assumes that unit is uW/MHz/V^2
>
> In drivers/thermal/cpu_cooling.c, the code is :
>
> power = (u64)capacitance * freq_mhz * voltage_mv * voltage_mv;
> do_div(power, 10);
>
> which can be summarized as :
> power (mW) = capacitance * freq_mhz/1000 * (voltage_mv/1000)^2
> or
> power (mW) = (capacitance * freq_mhz * (voltage_mv/1000)^2) / 1000
> then
> power (mW) = power (uW) / 1000
> so
> power (uW) = capacitance * freq_mhz * (voltage_mv/1000)^2
>
> Furthermore, if we test basic values like :
> voltage_mv = 1000mV = 1V
> freq_mhz = 1000Mhz
>
> The minimum possible power, when dynamic-power-coefficient equals 1, will
> be with current unit:
> min power = 1 * 1000  * (100)^2 = 10^15 mW
> which is not realistic
>
> With the unit used by the code, the min power is
> min power =  1 * 1000 * 1^2 = 1000uW = 1mW which is far more realistic
>
> Signed-off-by: Vincent Guittot 

Acked-by: Punit Agrawal 

Thanks for fixing the mismatch.

> ---
>  Documentation/devicetree/bindings/arm/cpus.txt | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/devicetree/bindings/arm/cpus.txt 
> b/Documentation/devicetree/bindings/arm/cpus.txt
> index 29e1dc5..71d8cd0 100644
> --- a/Documentation/devicetree/bindings/arm/cpus.txt
> +++ b/Documentation/devicetree/bindings/arm/cpus.txt
> @@ -274,7 +274,7 @@ described below.
>   Usage: optional
>   Value type: 
>   Definition: A u32 value that represents the running time dynamic
> - power coefficient in units of mW/MHz/uV^2. The
> + power coefficient in units of uW/MHz/V^2. The
>   coefficient can either be calculated from power
>   measurements or derived by analysis.
>  
> @@ -285,7 +285,7 @@ described below.
>  
>   Pdyn = dynamic-power-coefficient * V^2 * f
>  
> - where voltage is in uV, frequency is in MHz.
> + where voltage is in V, frequency is in MHz.
>  
>  Example 1 (dual-cluster big.LITTLE system 32-bit):

[PATCH 2/2] x86/PCI: Remove node-local allocation when initialising host controller

2018-08-28 Thread Punit Agrawal

Memory for host controller data structures is allocated local to the
node to which the controller is associated with. This has been the
behaviour since 965cd0e4a5e5 ("x86, PCI, ACPI: Use kmalloc_node() to
optimize for performance") where the node local allocation was added
without additional context.

Drop the node local allocation as there is no benefit from doing so -
the usage of these structures is independent from where the controller
is located.

Signed-off-by: Punit Agrawal 
Cc: Bjorn Helgaas 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
---
 arch/x86/pci/acpi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
index 5559dcaddd5e..948656069cdd 100644
--- a/arch/x86/pci/acpi.c
+++ b/arch/x86/pci/acpi.c
@@ -356,7 +356,7 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root 
*root)
} else {
struct pci_root_info *info;
 
-   info = kzalloc_node(sizeof(*info), GFP_KERNEL, node);
+   info = kzalloc(sizeof(*info), GFP_KERNEL);
if (!info)
dev_err(>device->dev,
"pci_bus %04x:%02x: ignored (out of memory)\n",
-- 
2.18.0

[PATCH 0/2] Drop node-local allocation during host controller initialisation

2018-08-28 Thread Punit Agrawal

Hi Bjorn,

As discussed before[0], here are a couple of patches to drop
node-local allocations during host contoller initialisation. This set
covers both arm64 and x86.

I'm posting early to give the patches time on the list as well in next
in case there are issues we've missed.

The patches are based on v4.19-rc1 and has been boot tested on arm64
and compile tested on x86.

Thanks,
Punit

[0] https://www.spinics.net/lists/arm-kernel/msg669746.html

Punit Agrawal (2):
  arm64: PCI: Remove node-local allocations when initialising host
controller
  x86/PCI: Remove node-local allocation when initialising host
controller

 arch/arm64/kernel/pci.c | 5 ++---
 arch/x86/pci/acpi.c | 2 +-
 2 files changed, 3 insertions(+), 4 deletions(-)

-- 
2.18.0

[PATCH 2/2] x86/PCI: Remove node-local allocation when initialising host controller

2018-08-28 Thread Punit Agrawal

Memory for host controller data structures is allocated local to the
node to which the controller is associated with. This has been the
behaviour since 965cd0e4a5e5 ("x86, PCI, ACPI: Use kmalloc_node() to
optimize for performance") where the node local allocation was added
without additional context.

Drop the node local allocation as there is no benefit from doing so -
the usage of these structures is independent from where the controller
is located.

Signed-off-by: Punit Agrawal 
Cc: Bjorn Helgaas 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
---
 arch/x86/pci/acpi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
index 5559dcaddd5e..948656069cdd 100644
--- a/arch/x86/pci/acpi.c
+++ b/arch/x86/pci/acpi.c
@@ -356,7 +356,7 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root 
*root)
} else {
struct pci_root_info *info;
 
-   info = kzalloc_node(sizeof(*info), GFP_KERNEL, node);
+   info = kzalloc(sizeof(*info), GFP_KERNEL);
if (!info)
dev_err(>device->dev,
"pci_bus %04x:%02x: ignored (out of memory)\n",
-- 
2.18.0

[PATCH 0/2] Drop node-local allocation during host controller initialisation

2018-08-28 Thread Punit Agrawal

Hi Bjorn,

As discussed before[0], here are a couple of patches to drop
node-local allocations during host contoller initialisation. This set
covers both arm64 and x86.

I'm posting early to give the patches time on the list as well in next
in case there are issues we've missed.

The patches are based on v4.19-rc1 and has been boot tested on arm64
and compile tested on x86.

Thanks,
Punit

[0] https://www.spinics.net/lists/arm-kernel/msg669746.html

Punit Agrawal (2):
  arm64: PCI: Remove node-local allocations when initialising host
controller
  x86/PCI: Remove node-local allocation when initialising host
controller

 arch/arm64/kernel/pci.c | 5 ++---
 arch/x86/pci/acpi.c | 2 +-
 2 files changed, 3 insertions(+), 4 deletions(-)

-- 
2.18.0

[PATCH 1/2] arm64: PCI: Remove node-local allocations when initialising host controller

2018-08-28 Thread Punit Agrawal

Memory for host controller data structures is allocated local to the
node to which the controller is associated with. This has been the
behaviour since support for ACPI was added in
commit 0cb0786bac15 ("ARM64: PCI: Support ACPI-based PCI host controller").

Drop the node local allocation as there is no benefit from doing so -
the usage of these structures is independent from where the controller
is located.

Signed-off-by: Punit Agrawal 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Lorenzo Pieralisi 
Cc: linux-arm-ker...@lists.infradead.org
---
 arch/arm64/kernel/pci.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
index 0e2ea1c78542..bb85e2f4603f 100644
--- a/arch/arm64/kernel/pci.c
+++ b/arch/arm64/kernel/pci.c
@@ -165,16 +165,15 @@ static void pci_acpi_generic_release_info(struct 
acpi_pci_root_info *ci)
 /* Interface called from ACPI code to setup PCI host controller */
 struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
 {
-   int node = acpi_get_node(root->device->handle);
struct acpi_pci_generic_root_info *ri;
struct pci_bus *bus, *child;
struct acpi_pci_root_ops *root_ops;
 
-   ri = kzalloc_node(sizeof(*ri), GFP_KERNEL, node);
+   ri = kzalloc(sizeof(*ri), GFP_KERNEL);
if (!ri)
return NULL;
 
-   root_ops = kzalloc_node(sizeof(*root_ops), GFP_KERNEL, node);
+   root_ops = kzalloc(sizeof(*root_ops), GFP_KERNEL);
if (!root_ops) {
kfree(ri);
return NULL;
-- 
2.18.0

[PATCH 1/2] arm64: PCI: Remove node-local allocations when initialising host controller

2018-08-28 Thread Punit Agrawal

Memory for host controller data structures is allocated local to the
node to which the controller is associated with. This has been the
behaviour since support for ACPI was added in
commit 0cb0786bac15 ("ARM64: PCI: Support ACPI-based PCI host controller").

Drop the node local allocation as there is no benefit from doing so -
the usage of these structures is independent from where the controller
is located.

Signed-off-by: Punit Agrawal 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Lorenzo Pieralisi 
Cc: linux-arm-ker...@lists.infradead.org
---
 arch/arm64/kernel/pci.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
index 0e2ea1c78542..bb85e2f4603f 100644
--- a/arch/arm64/kernel/pci.c
+++ b/arch/arm64/kernel/pci.c
@@ -165,16 +165,15 @@ static void pci_acpi_generic_release_info(struct 
acpi_pci_root_info *ci)
 /* Interface called from ACPI code to setup PCI host controller */
 struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
 {
-   int node = acpi_get_node(root->device->handle);
struct acpi_pci_generic_root_info *ri;
struct pci_bus *bus, *child;
struct acpi_pci_root_ops *root_ops;
 
-   ri = kzalloc_node(sizeof(*ri), GFP_KERNEL, node);
+   ri = kzalloc(sizeof(*ri), GFP_KERNEL);
if (!ri)
return NULL;
 
-   root_ops = kzalloc_node(sizeof(*root_ops), GFP_KERNEL, node);
+   root_ops = kzalloc(sizeof(*root_ops), GFP_KERNEL);
if (!root_ops) {
kfree(ri);
return NULL;
-- 
2.18.0

Re: [PATCH] dt-binding: arm/cpus.txt: fix dynamic-power-coefficient unit

2018-08-28 Thread Punit Agrawal

Vincent Guittot  writes:

> Hi Amit,
>
> On Wed, 22 Aug 2018 at 12:11, Punit Agrawal  wrote:
>>
>> Hi Vincent,
>>
>> Thanks for the patch. One comment about the choice of units below.
>>
>> Vincent Guittot  writes:
>>
>> > The unit of dynamic-power-coefficient is described as mW/MHz/uV^2 whereas
>> > its usage in the code assumes that unit is mW/GHz/V^2
>>
>> Instead of choosing GHz as the base, I'd prefer to use uW/MHz/V^2. It'll
>> avoid introducing fractional GHz value for frequency calculations.
>
> I don't understand your concern about fractional Ghz value for
> frequency calculation ?
> I mean, why it's a problem for frequency with Ghz vs Mhz but not a
> problem for voltage with V vs mV ?
> Don't we have the same "problem" in both case ?

You're right. It's the same problem in both cases.

>>
>> > In drivers/thermal/cpu_cooling.c, the code is :
>> >
>> > power = (u64)capacitance * freq_mhz * voltage_mv * voltage_mv;
>> > do_div(power, 10);
>> >
>> > which can be summarized as :
>> > power (mW) = capacitance * freq_mhz/1000 * (voltage_mv/1000)^2
>>
>> Which would then translate to -
>>
>> power (mW) = power (uW) / 1000 = capacitance * freq_mhz * (voltage_mv/1000)^2
>
> Not sure that the equation above is correct. If we consider uW/MHz/V^2
> for the unit, the equation becomes :
> power (mW) = power (uW) / 1000 = capacitance * freq_mhz *
> (voltage_mv/1000)^2 / 1000

Yes, I missed the "/ 1000" at the end.

> which can be rearranged as
> power (mW) = power (uW) / 1000 = capacitance * freq_mhz/ 1000 *
> (voltage_mv/1000)^2
>
> TBH, I don't really mind between  mW/GHz/V^2 or uW/MHz/V^2 as they are
> the same at the end
> but I don't catch your reasoning

The problem I was thinking of doesn't hold as it's the same issue with
voltage.

One benefit to go with uW/MHz/V^2 might be the extra resolution that it
provides. I'd prefer to go with uW/MHz/V^2 if there's no compelling
reason to go with anything else.


[...]

>> >
>> > Furthermore, if we test basic values like :
>> > voltage_mv = 1000mV = 1V
>> > freq_mhz = 1000Mhz = 1Ghz
>> >
>> > The minimum possible power, when dynamic-power-coefficient equals 1, will
>> > be :
>> > min power = 1 * 1000  * (100)^2 = 10^15 mW
>> > which is not realistic
>> >
>> > With the unit used by the code, the min power is
>> > min power =  1 * 1 * 1^2 = 1mW which is far more realistic
>> >
>> > Signed-off-by: Vincent Guittot 
>> > ---
>> >  Documentation/devicetree/bindings/arm/cpus.txt | 4 ++--
>> >  1 file changed, 2 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/Documentation/devicetree/bindings/arm/cpus.txt 
>> > b/Documentation/devicetree/bindings/arm/cpus.txt
>> > index 29e1dc5..0148d7d 100644
>> > --- a/Documentation/devicetree/bindings/arm/cpus.txt
>> > +++ b/Documentation/devicetree/bindings/arm/cpus.txt
>> > @@ -274,7 +274,7 @@ described below.
>> >   Usage: optional
>> >   Value type: 
>> >   Definition: A u32 value that represents the running time 
>> > dynamic
>> > - power coefficient in units of mW/MHz/uV^2. The
>> > + power coefficient in units of mW/GHz/V^2. The
>> >   coefficient can either be calculated from power
>> >   measurements or derived by analysis.
>> >
>> > @@ -285,7 +285,7 @@ described below.
>> >
>> >   Pdyn = dynamic-power-coefficient * V^2 * f
>> >
>> > - where voltage is in uV, frequency is in MHz.
>> > + where voltage is in V, frequency is in GHz.
>> >
>> >  Example 1 (dual-cluster big.LITTLE system 32-bit):

Re: [PATCH] dt-binding: arm/cpus.txt: fix dynamic-power-coefficient unit

2018-08-28 Thread Punit Agrawal

Vincent Guittot  writes:

> Hi Amit,
>
> On Wed, 22 Aug 2018 at 12:11, Punit Agrawal  wrote:
>>
>> Hi Vincent,
>>
>> Thanks for the patch. One comment about the choice of units below.
>>
>> Vincent Guittot  writes:
>>
>> > The unit of dynamic-power-coefficient is described as mW/MHz/uV^2 whereas
>> > its usage in the code assumes that unit is mW/GHz/V^2
>>
>> Instead of choosing GHz as the base, I'd prefer to use uW/MHz/V^2. It'll
>> avoid introducing fractional GHz value for frequency calculations.
>
> I don't understand your concern about fractional Ghz value for
> frequency calculation ?
> I mean, why it's a problem for frequency with Ghz vs Mhz but not a
> problem for voltage with V vs mV ?
> Don't we have the same "problem" in both case ?

You're right. It's the same problem in both cases.

>>
>> > In drivers/thermal/cpu_cooling.c, the code is :
>> >
>> > power = (u64)capacitance * freq_mhz * voltage_mv * voltage_mv;
>> > do_div(power, 10);
>> >
>> > which can be summarized as :
>> > power (mW) = capacitance * freq_mhz/1000 * (voltage_mv/1000)^2
>>
>> Which would then translate to -
>>
>> power (mW) = power (uW) / 1000 = capacitance * freq_mhz * (voltage_mv/1000)^2
>
> Not sure that the equation above is correct. If we consider uW/MHz/V^2
> for the unit, the equation becomes :
> power (mW) = power (uW) / 1000 = capacitance * freq_mhz *
> (voltage_mv/1000)^2 / 1000

Yes, I missed the "/ 1000" at the end.

> which can be rearranged as
> power (mW) = power (uW) / 1000 = capacitance * freq_mhz/ 1000 *
> (voltage_mv/1000)^2
>
> TBH, I don't really mind between  mW/GHz/V^2 or uW/MHz/V^2 as they are
> the same at the end
> but I don't catch your reasoning

The problem I was thinking of doesn't hold as it's the same issue with
voltage.

One benefit to go with uW/MHz/V^2 might be the extra resolution that it
provides. I'd prefer to go with uW/MHz/V^2 if there's no compelling
reason to go with anything else.


[...]

>> >
>> > Furthermore, if we test basic values like :
>> > voltage_mv = 1000mV = 1V
>> > freq_mhz = 1000Mhz = 1Ghz
>> >
>> > The minimum possible power, when dynamic-power-coefficient equals 1, will
>> > be :
>> > min power = 1 * 1000  * (100)^2 = 10^15 mW
>> > which is not realistic
>> >
>> > With the unit used by the code, the min power is
>> > min power =  1 * 1 * 1^2 = 1mW which is far more realistic
>> >
>> > Signed-off-by: Vincent Guittot 
>> > ---
>> >  Documentation/devicetree/bindings/arm/cpus.txt | 4 ++--
>> >  1 file changed, 2 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/Documentation/devicetree/bindings/arm/cpus.txt 
>> > b/Documentation/devicetree/bindings/arm/cpus.txt
>> > index 29e1dc5..0148d7d 100644
>> > --- a/Documentation/devicetree/bindings/arm/cpus.txt
>> > +++ b/Documentation/devicetree/bindings/arm/cpus.txt
>> > @@ -274,7 +274,7 @@ described below.
>> >   Usage: optional
>> >   Value type: 
>> >   Definition: A u32 value that represents the running time 
>> > dynamic
>> > - power coefficient in units of mW/MHz/uV^2. The
>> > + power coefficient in units of mW/GHz/V^2. The
>> >   coefficient can either be calculated from power
>> >   measurements or derived by analysis.
>> >
>> > @@ -285,7 +285,7 @@ described below.
>> >
>> >   Pdyn = dynamic-power-coefficient * V^2 * f
>> >
>> > - where voltage is in uV, frequency is in MHz.
>> > + where voltage is in V, frequency is in GHz.
>> >
>> >  Example 1 (dual-cluster big.LITTLE system 32-bit):

Re: [PATCH] dt-binding: arm/cpus.txt: fix dynamic-power-coefficient unit

2018-08-22 Thread Punit Agrawal

Hi Vincent,

Thanks for the patch. One comment about the choice of units below.

Vincent Guittot  writes:

> The unit of dynamic-power-coefficient is described as mW/MHz/uV^2 whereas
> its usage in the code assumes that unit is mW/GHz/V^2

Instead of choosing GHz as the base, I'd prefer to use uW/MHz/V^2. It'll
avoid introducing fractional GHz value for frequency calculations.

> In drivers/thermal/cpu_cooling.c, the code is :
>
> power = (u64)capacitance * freq_mhz * voltage_mv * voltage_mv;
> do_div(power, 10);
>
> which can be summarized as :
> power (mW) = capacitance * freq_mhz/1000 * (voltage_mv/1000)^2

Which would then translate to -

power (mW) = power (uW) / 1000 = capacitance * freq_mhz * (voltage_mv/1000)^2

Thanks,
Punit

>
> Furthermore, if we test basic values like :
> voltage_mv = 1000mV = 1V
> freq_mhz = 1000Mhz = 1Ghz
>
> The minimum possible power, when dynamic-power-coefficient equals 1, will
> be :
> min power = 1 * 1000  * (100)^2 = 10^15 mW
> which is not realistic
>
> With the unit used by the code, the min power is
> min power =  1 * 1 * 1^2 = 1mW which is far more realistic
>
> Signed-off-by: Vincent Guittot 
> ---
>  Documentation/devicetree/bindings/arm/cpus.txt | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/devicetree/bindings/arm/cpus.txt 
> b/Documentation/devicetree/bindings/arm/cpus.txt
> index 29e1dc5..0148d7d 100644
> --- a/Documentation/devicetree/bindings/arm/cpus.txt
> +++ b/Documentation/devicetree/bindings/arm/cpus.txt
> @@ -274,7 +274,7 @@ described below.
>   Usage: optional
>   Value type: 
>   Definition: A u32 value that represents the running time dynamic
> - power coefficient in units of mW/MHz/uV^2. The
> + power coefficient in units of mW/GHz/V^2. The
>   coefficient can either be calculated from power
>   measurements or derived by analysis.
>  
> @@ -285,7 +285,7 @@ described below.
>  
>   Pdyn = dynamic-power-coefficient * V^2 * f
>  
> - where voltage is in uV, frequency is in MHz.
> + where voltage is in V, frequency is in GHz.
>  
>  Example 1 (dual-cluster big.LITTLE system 32-bit):

Re: [PATCH] dt-binding: arm/cpus.txt: fix dynamic-power-coefficient unit

2018-08-22 Thread Punit Agrawal

Hi Vincent,

Thanks for the patch. One comment about the choice of units below.

Vincent Guittot  writes:

> The unit of dynamic-power-coefficient is described as mW/MHz/uV^2 whereas
> its usage in the code assumes that unit is mW/GHz/V^2

Instead of choosing GHz as the base, I'd prefer to use uW/MHz/V^2. It'll
avoid introducing fractional GHz value for frequency calculations.

> In drivers/thermal/cpu_cooling.c, the code is :
>
> power = (u64)capacitance * freq_mhz * voltage_mv * voltage_mv;
> do_div(power, 10);
>
> which can be summarized as :
> power (mW) = capacitance * freq_mhz/1000 * (voltage_mv/1000)^2

Which would then translate to -

power (mW) = power (uW) / 1000 = capacitance * freq_mhz * (voltage_mv/1000)^2

Thanks,
Punit

>
> Furthermore, if we test basic values like :
> voltage_mv = 1000mV = 1V
> freq_mhz = 1000Mhz = 1Ghz
>
> The minimum possible power, when dynamic-power-coefficient equals 1, will
> be :
> min power = 1 * 1000  * (100)^2 = 10^15 mW
> which is not realistic
>
> With the unit used by the code, the min power is
> min power =  1 * 1 * 1^2 = 1mW which is far more realistic
>
> Signed-off-by: Vincent Guittot 
> ---
>  Documentation/devicetree/bindings/arm/cpus.txt | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/devicetree/bindings/arm/cpus.txt 
> b/Documentation/devicetree/bindings/arm/cpus.txt
> index 29e1dc5..0148d7d 100644
> --- a/Documentation/devicetree/bindings/arm/cpus.txt
> +++ b/Documentation/devicetree/bindings/arm/cpus.txt
> @@ -274,7 +274,7 @@ described below.
>   Usage: optional
>   Value type: 
>   Definition: A u32 value that represents the running time dynamic
> - power coefficient in units of mW/MHz/uV^2. The
> + power coefficient in units of mW/GHz/V^2. The
>   coefficient can either be calculated from power
>   measurements or derived by analysis.
>  
> @@ -285,7 +285,7 @@ described below.
>  
>   Pdyn = dynamic-power-coefficient * V^2 * f
>  
> - where voltage is in uV, frequency is in MHz.
> + where voltage is in V, frequency is in GHz.
>  
>  Example 1 (dual-cluster big.LITTLE system 32-bit):

Re: [PATCH] arm64: PCI: Remove node-local allocations when initialising host controller

2018-08-09 Thread Punit Agrawal

Bjorn Helgaas  writes:

> On Wed, Aug 08, 2018 at 03:44:03PM +0100, Punit Agrawal wrote:
>> Bjorn Helgaas  writes:
>> > On Thu, Aug 2, 2018 at 9:33 AM Lorenzo Pieralisi
>> >  wrote:
>> >> On Wed, Aug 01, 2018 at 02:38:51PM -0500, Jeremy Linton wrote:
>> >>
>> >> Jiang Liu does not work on the kernel anymore so we won't know
>> >> anytime soon the reasoning behind commit 965cd0e4a5e5
>> >>
>> >> > On 08/01/2018 12:31 PM, Punit Agrawal wrote:
>> >> > >Memory for host controller data structures is allocated local to the
>> >> > >node to which the controller is associated with. This has been the
>> >> > >behaviour since support for ACPI was added in
>> >> > >commit 0cb0786bac15 ("ARM64: PCI: Support ACPI-based PCI host 
>> >> > >controller").
>> >> >
>> >> > Which was apparently influenced by:
>> >> >
>> >> > 965cd0e4a5e5 x86, PCI, ACPI: Use kmalloc_node() to optimize for 
>> >> > performance
>> >> >
>> >> > Was there an actual use-case behind that change?
>> >> >
>> >> > I think this fixes the immediate boot problem, but if there is any
>> >> > perf advantage it seems wise to keep it... Particularly since x86
>> >> > seems to be doing the node sanitation in pci_acpi_root_get_node().
>> >>
>> >> I am struggling to see the perf advantage of allocating a struct
>> >> that the PCI controller will never read/write from a NUMA node that
>> >> is local to the PCI controller, happy to be corrected if there is
>> >> a sound rationale behind that.
>> >
>> > If there is no reason to use kzalloc_node() here, we shouldn't use it.
>> >
>> > But we should use it (or not use it) consistently across arches.  I do
>> > not believe there is an arch-specific reason to be different.
>> > Currently, pci_acpi_scan_root() uses kzalloc_node() on x86 and arm64,
>> > but kzalloc() on ia64.  They all ought to be the same.
>> 
>> From my understanding, arm64 use of kzalloc_node() was derived from the
>> x86 version. Maybe somebody familiar with behaviour on x86 can provide
>> input here.
>
> If you want to remove use of kzalloc_node(), I'm fine with that as
> long as you do it for x86 at the same time (maybe separate patches,
> but at least in the same series).
>
> I don't see any evidence in 965cd0e4a5e5 ("x86, PCI, ACPI: Use
> kmalloc_node() to optimize for performance") that it actually improves
> performance, so I'd be inclined to just use kzalloc().

Thanks for confirming.

I'm happy to add a patch updating x86 use of kzalloc_node() as
well. I'll post something once the merge window closes.

>
> Bjorn

Re: [PATCH] arm64: PCI: Remove node-local allocations when initialising host controller

2018-08-09 Thread Punit Agrawal

Bjorn Helgaas  writes:

> On Wed, Aug 08, 2018 at 03:44:03PM +0100, Punit Agrawal wrote:
>> Bjorn Helgaas  writes:
>> > On Thu, Aug 2, 2018 at 9:33 AM Lorenzo Pieralisi
>> >  wrote:
>> >> On Wed, Aug 01, 2018 at 02:38:51PM -0500, Jeremy Linton wrote:
>> >>
>> >> Jiang Liu does not work on the kernel anymore so we won't know
>> >> anytime soon the reasoning behind commit 965cd0e4a5e5
>> >>
>> >> > On 08/01/2018 12:31 PM, Punit Agrawal wrote:
>> >> > >Memory for host controller data structures is allocated local to the
>> >> > >node to which the controller is associated with. This has been the
>> >> > >behaviour since support for ACPI was added in
>> >> > >commit 0cb0786bac15 ("ARM64: PCI: Support ACPI-based PCI host 
>> >> > >controller").
>> >> >
>> >> > Which was apparently influenced by:
>> >> >
>> >> > 965cd0e4a5e5 x86, PCI, ACPI: Use kmalloc_node() to optimize for 
>> >> > performance
>> >> >
>> >> > Was there an actual use-case behind that change?
>> >> >
>> >> > I think this fixes the immediate boot problem, but if there is any
>> >> > perf advantage it seems wise to keep it... Particularly since x86
>> >> > seems to be doing the node sanitation in pci_acpi_root_get_node().
>> >>
>> >> I am struggling to see the perf advantage of allocating a struct
>> >> that the PCI controller will never read/write from a NUMA node that
>> >> is local to the PCI controller, happy to be corrected if there is
>> >> a sound rationale behind that.
>> >
>> > If there is no reason to use kzalloc_node() here, we shouldn't use it.
>> >
>> > But we should use it (or not use it) consistently across arches.  I do
>> > not believe there is an arch-specific reason to be different.
>> > Currently, pci_acpi_scan_root() uses kzalloc_node() on x86 and arm64,
>> > but kzalloc() on ia64.  They all ought to be the same.
>> 
>> From my understanding, arm64 use of kzalloc_node() was derived from the
>> x86 version. Maybe somebody familiar with behaviour on x86 can provide
>> input here.
>
> If you want to remove use of kzalloc_node(), I'm fine with that as
> long as you do it for x86 at the same time (maybe separate patches,
> but at least in the same series).
>
> I don't see any evidence in 965cd0e4a5e5 ("x86, PCI, ACPI: Use
> kmalloc_node() to optimize for performance") that it actually improves
> performance, so I'd be inclined to just use kzalloc().

Thanks for confirming.

I'm happy to add a patch updating x86 use of kzalloc_node() as
well. I'll post something once the merge window closes.

>
> Bjorn

Re: [PATCH] arm64: PCI: Remove node-local allocations when initialising host controller

2018-08-08 Thread Punit Agrawal

[+cc linux-acpi, linux-mm]

Bjorn Helgaas  writes:

> [+cc linux-pci, linux-kernel]
>
> On Thu, Aug 2, 2018 at 9:33 AM Lorenzo Pieralisi
>  wrote:
>>
>> On Wed, Aug 01, 2018 at 02:38:51PM -0500, Jeremy Linton wrote:
>> > Hi,
>> >
>> > +CC Jiang Lui
>>
>> Jiang Liu does not work on the kernel anymore so we won't know
>> anytime soon the reasoning behind commit 965cd0e4a5e5
>>
>> > On 08/01/2018 12:31 PM, Punit Agrawal wrote:
>> > >Memory for host controller data structures is allocated local to the
>> > >node to which the controller is associated with. This has been the
>> > >behaviour since support for ACPI was added in
>> > >commit 0cb0786bac15 ("ARM64: PCI: Support ACPI-based PCI host 
>> > >controller").
>> >
>> > Which was apparently influenced by:
>> >
>> > 965cd0e4a5e5 x86, PCI, ACPI: Use kmalloc_node() to optimize for performance
>> >
>> > Was there an actual use-case behind that change?
>> >
>> > I think this fixes the immediate boot problem, but if there is any
>> > perf advantage it seems wise to keep it... Particularly since x86
>> > seems to be doing the node sanitation in pci_acpi_root_get_node().
>>
>> I am struggling to see the perf advantage of allocating a struct
>> that the PCI controller will never read/write from a NUMA node that
>> is local to the PCI controller, happy to be corrected if there is
>> a sound rationale behind that.
>
> If there is no reason to use kzalloc_node() here, we shouldn't use it.
>
> But we should use it (or not use it) consistently across arches.  I do
> not believe there is an arch-specific reason to be different.
> Currently, pci_acpi_scan_root() uses kzalloc_node() on x86 and arm64,
> but kzalloc() on ia64.  They all ought to be the same.

>From my understanding, arm64 use of kzalloc_node() was derived from the
x86 version. Maybe somebody familiar with behaviour on x86 can provide
input here.

>
>> > >Drop the node local allocation as there is no benefit from doing so -
>> > >the usage of these structures is independent from where the controller
>> > >is located. It also causes problem during probe if the associated numa
>> > >node hasn't been initialised due to booting with restricted cpus via
>> > >kernel command line or where the node doesn't have cpus or memory
>> > >associated with it.
>
> I do not support the avoidance of kzalloc_node() as a means of working
> around the problem of a NUMA node not being initialized correctly.

Agreed.

The patch is not meant as a work around for uninitialised NUMA nodes. I
mention it in the commit log as the context where this was
discovered. It seems to cause conflation of the issue addressed by the
patch - I'll drop it in future postings.

> We got that node number from acpi_get_node().  I think we should be
> able to pass it to kzalloc_node() and expect something reasonable,
> i.e., either a successful allocation from the desired node (or from a
> node that is present) or an error return.  I don't think the caller is
> in a position to figure out whether it's safe to use kzalloc_node() or
> not.

The problem is that acpi_get_node() just maps the proximity domain to a
node id, without initialising the node pglist_data or setting up the
zonelists. This happens in the case when the proximity domain is not
present in the SRAT (proximity domain has no attached cpus or memory)
but first encountered as part of _PXM() method for a device.

One approach to solve this problem would be to create
acpi_get_online_node() which attempts to online a NUMA node if it isn't
already or at the least return NUMA_NO_NODE.

Another would be to have kzalloc_node() perform the checks for online
node and return error if not.

It's not clear what the preferred approach is.

>
>> > >Signed-off-by: Punit Agrawal 
>> > >Cc: Catalin Marinas 
>> > >Cc: Will Deacon 
>> > >Cc: Lorenzo Pieralisi 
>> > >---
>> > >Hi,
>> > >
>> > >This came up in the context of investigating the boot issues reported
>> > >due to restricted cpus or buggy firmware. Part of the problem is fixed
>> > >by Lorenzo's rework of NUMA initialisation[0].
>> > >
>> > >But there also doesn't seem to be any justification for using
>> > >node-local allocation to begin with.
>> > >
>> > >Thanks,
>> > >Punit
>> > >
>> > >[0] https://patchwork.kernel.org/patch/10486001/
>> > >
>> > >  arch/arm64/kernel/pci.c | 5 ++---
>> > >  1 file changed, 2

Re: [PATCH] arm64: PCI: Remove node-local allocations when initialising host controller

2018-08-08 Thread Punit Agrawal

[+cc linux-acpi, linux-mm]

Bjorn Helgaas  writes:

> [+cc linux-pci, linux-kernel]
>
> On Thu, Aug 2, 2018 at 9:33 AM Lorenzo Pieralisi
>  wrote:
>>
>> On Wed, Aug 01, 2018 at 02:38:51PM -0500, Jeremy Linton wrote:
>> > Hi,
>> >
>> > +CC Jiang Lui
>>
>> Jiang Liu does not work on the kernel anymore so we won't know
>> anytime soon the reasoning behind commit 965cd0e4a5e5
>>
>> > On 08/01/2018 12:31 PM, Punit Agrawal wrote:
>> > >Memory for host controller data structures is allocated local to the
>> > >node to which the controller is associated with. This has been the
>> > >behaviour since support for ACPI was added in
>> > >commit 0cb0786bac15 ("ARM64: PCI: Support ACPI-based PCI host 
>> > >controller").
>> >
>> > Which was apparently influenced by:
>> >
>> > 965cd0e4a5e5 x86, PCI, ACPI: Use kmalloc_node() to optimize for performance
>> >
>> > Was there an actual use-case behind that change?
>> >
>> > I think this fixes the immediate boot problem, but if there is any
>> > perf advantage it seems wise to keep it... Particularly since x86
>> > seems to be doing the node sanitation in pci_acpi_root_get_node().
>>
>> I am struggling to see the perf advantage of allocating a struct
>> that the PCI controller will never read/write from a NUMA node that
>> is local to the PCI controller, happy to be corrected if there is
>> a sound rationale behind that.
>
> If there is no reason to use kzalloc_node() here, we shouldn't use it.
>
> But we should use it (or not use it) consistently across arches.  I do
> not believe there is an arch-specific reason to be different.
> Currently, pci_acpi_scan_root() uses kzalloc_node() on x86 and arm64,
> but kzalloc() on ia64.  They all ought to be the same.

>From my understanding, arm64 use of kzalloc_node() was derived from the
x86 version. Maybe somebody familiar with behaviour on x86 can provide
input here.

>
>> > >Drop the node local allocation as there is no benefit from doing so -
>> > >the usage of these structures is independent from where the controller
>> > >is located. It also causes problem during probe if the associated numa
>> > >node hasn't been initialised due to booting with restricted cpus via
>> > >kernel command line or where the node doesn't have cpus or memory
>> > >associated with it.
>
> I do not support the avoidance of kzalloc_node() as a means of working
> around the problem of a NUMA node not being initialized correctly.

Agreed.

The patch is not meant as a work around for uninitialised NUMA nodes. I
mention it in the commit log as the context where this was
discovered. It seems to cause conflation of the issue addressed by the
patch - I'll drop it in future postings.

> We got that node number from acpi_get_node().  I think we should be
> able to pass it to kzalloc_node() and expect something reasonable,
> i.e., either a successful allocation from the desired node (or from a
> node that is present) or an error return.  I don't think the caller is
> in a position to figure out whether it's safe to use kzalloc_node() or
> not.

The problem is that acpi_get_node() just maps the proximity domain to a
node id, without initialising the node pglist_data or setting up the
zonelists. This happens in the case when the proximity domain is not
present in the SRAT (proximity domain has no attached cpus or memory)
but first encountered as part of _PXM() method for a device.

One approach to solve this problem would be to create
acpi_get_online_node() which attempts to online a NUMA node if it isn't
already or at the least return NUMA_NO_NODE.

Another would be to have kzalloc_node() perform the checks for online
node and return error if not.

It's not clear what the preferred approach is.

>
>> > >Signed-off-by: Punit Agrawal 
>> > >Cc: Catalin Marinas 
>> > >Cc: Will Deacon 
>> > >Cc: Lorenzo Pieralisi 
>> > >---
>> > >Hi,
>> > >
>> > >This came up in the context of investigating the boot issues reported
>> > >due to restricted cpus or buggy firmware. Part of the problem is fixed
>> > >by Lorenzo's rework of NUMA initialisation[0].
>> > >
>> > >But there also doesn't seem to be any justification for using
>> > >node-local allocation to begin with.
>> > >
>> > >Thanks,
>> > >Punit
>> > >
>> > >[0] https://patchwork.kernel.org/patch/10486001/
>> > >
>> > >  arch/arm64/kernel/pci.c | 5 ++---
>> > >  1 file changed, 2

[PATCH v6 0/8] KVM: Support PUD hugepages at stage 2

2018-07-16 Thread Punit Agrawal

This series is an update to the PUD hugepage support previously posted
at [0]. This patchset adds support for PUD hugepages at stage 2 a
feature that is useful on cores that have support for large sized TLB
mappings (e.g., 1GB for 4K granule).

This version adds tags and addresses feedback received on
v5. Additionally, Patch 1 (1 & 2 in this version) has been split to
help make it easy to review.

Support is added to code that is shared between arm and arm64. Dummy
helpers for arm are provided as the port does not support PUD hugepage
sizes.

The patches have been tested on an A57 based system. The patchset is
based on v4.18-rc5. The are a few conflicts with the support for 52
bit IPA[1] - I'll work with Suzuki to resolve this once the code is in
the right shape.

Thanks,
Punit

v5 -> v6

* Split Patch 1 to move out the refactoring of exec permissions on
  page table entries.
* Patch 4 - Initialise p*dpp pointers in stage2_get_leaf_entry()
* Patch 5 - Trigger a BUG() in kvm_pud_pfn() on arm

v4 -> v5:
* Patch 1 - Drop helper stage2_should_exec() and refactor the
  condition to decide if a page table entry should be marked
  executable
* Patch 4-6 - Introduce stage2_get_leaf_entry() and use it in this and
  latter patches
* Patch 7 - Use stage 2 accessors instead of using the page table
  helpers directly
* Patch 7 - Add a note to update the PUD hugepage support when number
  of levels of stage 2 tables differs from stage 1

v3 -> v4:
* Patch 1 and 7 - Don't put down hugepages pte if logging is enabled
* Patch 4-5 - Add PUD hugepage support for exec and access faults
* Patch 6 - PUD hugepage support for aging page table entries

v2 -> v3:
* Update vma_pagesize directly if THP [1/4]. Previsouly this was done
  indirectly via hugetlb
* Added review tag [4/4]

v1 -> v2:
* Create helper to check if the page should have exec permission [1/4]
* Fix broken condition to detect THP hugepage [1/4]
* Fix in-correct hunk resulting from a rebase [4/4]

[0] https://www.spinics.net/lists/arm-kernel/msg664276.html
[1] https://www.spinics.net/lists/kvm/msg171065.html

Punit Agrawal (8):
  KVM: arm/arm64: Share common code in user_mem_abort()
  KVM: arm/arm64: Re-factor setting the Stage 2 entry to exec on fault
  KVM: arm/arm64: Introduce helpers to manipulate page table entries
  KVM: arm64: Support dirty page tracking for PUD hugepages
  KVM: arm64: Support PUD hugepage in stage2_is_exec()
  KVM: arm64: Support handling access faults for PUD hugepages
  KVM: arm64: Update age handlers to support PUD hugepages
  KVM: arm64: Add support for creating PUD hugepages at stage 2

 arch/arm/include/asm/kvm_mmu.h |  61 +
 arch/arm64/include/asm/kvm_mmu.h   |  47 
 arch/arm64/include/asm/pgtable-hwdef.h |   4 +
 arch/arm64/include/asm/pgtable.h   |   9 +
 virt/kvm/arm/mmu.c | 294 ++---
 5 files changed, 341 insertions(+), 74 deletions(-)

-- 
2.17.1

[PATCH v6 0/8] KVM: Support PUD hugepages at stage 2

2018-07-16 Thread Punit Agrawal

This series is an update to the PUD hugepage support previously posted
at [0]. This patchset adds support for PUD hugepages at stage 2 a
feature that is useful on cores that have support for large sized TLB
mappings (e.g., 1GB for 4K granule).

This version adds tags and addresses feedback received on
v5. Additionally, Patch 1 (1 & 2 in this version) has been split to
help make it easy to review.

Support is added to code that is shared between arm and arm64. Dummy
helpers for arm are provided as the port does not support PUD hugepage
sizes.

The patches have been tested on an A57 based system. The patchset is
based on v4.18-rc5. The are a few conflicts with the support for 52
bit IPA[1] - I'll work with Suzuki to resolve this once the code is in
the right shape.

Thanks,
Punit

v5 -> v6

* Split Patch 1 to move out the refactoring of exec permissions on
  page table entries.
* Patch 4 - Initialise p*dpp pointers in stage2_get_leaf_entry()
* Patch 5 - Trigger a BUG() in kvm_pud_pfn() on arm

v4 -> v5:
* Patch 1 - Drop helper stage2_should_exec() and refactor the
  condition to decide if a page table entry should be marked
  executable
* Patch 4-6 - Introduce stage2_get_leaf_entry() and use it in this and
  latter patches
* Patch 7 - Use stage 2 accessors instead of using the page table
  helpers directly
* Patch 7 - Add a note to update the PUD hugepage support when number
  of levels of stage 2 tables differs from stage 1

v3 -> v4:
* Patch 1 and 7 - Don't put down hugepages pte if logging is enabled
* Patch 4-5 - Add PUD hugepage support for exec and access faults
* Patch 6 - PUD hugepage support for aging page table entries

v2 -> v3:
* Update vma_pagesize directly if THP [1/4]. Previsouly this was done
  indirectly via hugetlb
* Added review tag [4/4]

v1 -> v2:
* Create helper to check if the page should have exec permission [1/4]
* Fix broken condition to detect THP hugepage [1/4]
* Fix in-correct hunk resulting from a rebase [4/4]

[0] https://www.spinics.net/lists/arm-kernel/msg664276.html
[1] https://www.spinics.net/lists/kvm/msg171065.html

Punit Agrawal (8):
  KVM: arm/arm64: Share common code in user_mem_abort()
  KVM: arm/arm64: Re-factor setting the Stage 2 entry to exec on fault
  KVM: arm/arm64: Introduce helpers to manipulate page table entries
  KVM: arm64: Support dirty page tracking for PUD hugepages
  KVM: arm64: Support PUD hugepage in stage2_is_exec()
  KVM: arm64: Support handling access faults for PUD hugepages
  KVM: arm64: Update age handlers to support PUD hugepages
  KVM: arm64: Add support for creating PUD hugepages at stage 2

 arch/arm/include/asm/kvm_mmu.h |  61 +
 arch/arm64/include/asm/kvm_mmu.h   |  47 
 arch/arm64/include/asm/pgtable-hwdef.h |   4 +
 arch/arm64/include/asm/pgtable.h   |   9 +
 virt/kvm/arm/mmu.c | 294 ++---
 5 files changed, 341 insertions(+), 74 deletions(-)

-- 
2.17.1

[PATCH v5 0/7] KVM: Support PUD hugepages at stage 2

2018-07-09 Thread Punit Agrawal

This series is an update to the PUD hugepage support previously posted
at [0]. This patchset adds support for PUD hugepages at stage
2. This feature is useful on cores that have support for large sized
TLB mappings (e.g., 1GB for 4K granule).

The biggest change in this version is to replace repeated instances of
walking the page tables to get to a leaf-entry with a function to
return the appropriate entry. This was suggested by Suzuki and should
help reduce the amount of churn resulting from future changes. It also
address other feedback on the previous version.

Support is added to code that is shared between arm and arm64. Dummy
helpers for arm are provided as the port does not support PUD hugepage
sizes.

The patches have been tested on an A57 based system. The patchset is
based on v4.18-rc4. The are a few conflicts with the support for 52
bit IPA[1] due to change in number of parameters for
stage2_pmd_offset().

Thanks,
Punit

v4 -> v5:

* Patch 1 - Drop helper stage2_should_exec() and refactor the
  condition to decide if a page table entry should be marked
  executable
* Patch 4-6 - Introduce stage2_get_leaf_entry() and use it in this and
  latter patches
* Patch 7 - Use stage 2 accessors instead of using the page table
  helpers directly
* Patch 7 - Add a note to update the PUD hugepage support when number
  of levels of stage 2 tables differs from stage 1


v3 -> v4:
* Patch 1 and 7 - Don't put down hugepages pte if logging is enabled
* Patch 4-5 - Add PUD hugepage support for exec and access faults
* Patch 6 - PUD hugepage support for aging page table entries

v2 -> v3:
* Update vma_pagesize directly if THP [1/4]. Previsouly this was done
  indirectly via hugetlb
* Added review tag [4/4]

v1 -> v2:
* Create helper to check if the page should have exec permission [1/4]
* Fix broken condition to detect THP hugepage [1/4]
* Fix in-correct hunk resulting from a rebase [4/4]

[0] https://www.spinics.net/lists/arm-kernel/msg663562.html
[1] https://www.spinics.net/lists/kvm/msg171065.html


Punit Agrawal (7):
  KVM: arm/arm64: Share common code in user_mem_abort()
  KVM: arm/arm64: Introduce helpers to manupulate page table entries
  KVM: arm64: Support dirty page tracking for PUD hugepages
  KVM: arm64: Support PUD hugepage in stage2_is_exec()
  KVM: arm64: Support handling access faults for PUD hugepages
  KVM: arm64: Update age handlers to support PUD hugepages
  KVM: arm64: Add support for creating PUD hugepages at stage 2

 arch/arm/include/asm/kvm_mmu.h |  60 +
 arch/arm64/include/asm/kvm_mmu.h   |  47 
 arch/arm64/include/asm/pgtable-hwdef.h |   4 +
 arch/arm64/include/asm/pgtable.h   |   9 +
 virt/kvm/arm/mmu.c | 289 ++---
 5 files changed, 330 insertions(+), 79 deletions(-)

-- 
2.17.1

[PATCH v5 0/7] KVM: Support PUD hugepages at stage 2

2018-07-09 Thread Punit Agrawal

This series is an update to the PUD hugepage support previously posted
at [0]. This patchset adds support for PUD hugepages at stage
2. This feature is useful on cores that have support for large sized
TLB mappings (e.g., 1GB for 4K granule).

The biggest change in this version is to replace repeated instances of
walking the page tables to get to a leaf-entry with a function to
return the appropriate entry. This was suggested by Suzuki and should
help reduce the amount of churn resulting from future changes. It also
address other feedback on the previous version.

Support is added to code that is shared between arm and arm64. Dummy
helpers for arm are provided as the port does not support PUD hugepage
sizes.

The patches have been tested on an A57 based system. The patchset is
based on v4.18-rc4. The are a few conflicts with the support for 52
bit IPA[1] due to change in number of parameters for
stage2_pmd_offset().

Thanks,
Punit

v4 -> v5:

* Patch 1 - Drop helper stage2_should_exec() and refactor the
  condition to decide if a page table entry should be marked
  executable
* Patch 4-6 - Introduce stage2_get_leaf_entry() and use it in this and
  latter patches
* Patch 7 - Use stage 2 accessors instead of using the page table
  helpers directly
* Patch 7 - Add a note to update the PUD hugepage support when number
  of levels of stage 2 tables differs from stage 1


v3 -> v4:
* Patch 1 and 7 - Don't put down hugepages pte if logging is enabled
* Patch 4-5 - Add PUD hugepage support for exec and access faults
* Patch 6 - PUD hugepage support for aging page table entries

v2 -> v3:
* Update vma_pagesize directly if THP [1/4]. Previsouly this was done
  indirectly via hugetlb
* Added review tag [4/4]

v1 -> v2:
* Create helper to check if the page should have exec permission [1/4]
* Fix broken condition to detect THP hugepage [1/4]
* Fix in-correct hunk resulting from a rebase [4/4]

[0] https://www.spinics.net/lists/arm-kernel/msg663562.html
[1] https://www.spinics.net/lists/kvm/msg171065.html


Punit Agrawal (7):
  KVM: arm/arm64: Share common code in user_mem_abort()
  KVM: arm/arm64: Introduce helpers to manupulate page table entries
  KVM: arm64: Support dirty page tracking for PUD hugepages
  KVM: arm64: Support PUD hugepage in stage2_is_exec()
  KVM: arm64: Support handling access faults for PUD hugepages
  KVM: arm64: Update age handlers to support PUD hugepages
  KVM: arm64: Add support for creating PUD hugepages at stage 2

 arch/arm/include/asm/kvm_mmu.h |  60 +
 arch/arm64/include/asm/kvm_mmu.h   |  47 
 arch/arm64/include/asm/pgtable-hwdef.h |   4 +
 arch/arm64/include/asm/pgtable.h   |   9 +
 virt/kvm/arm/mmu.c | 289 ++---
 5 files changed, 330 insertions(+), 79 deletions(-)

-- 
2.17.1

[PATCH v5 0/7] KVM: Support PUD hugepages at stage 2

2018-07-09 Thread Punit Agrawal

This series is an update to the PUD hugepage support previously posted
at [0]. This patchset adds support for PUD hugepages at stage
2. This feature is useful on cores that have support for large sized
TLB mappings (e.g., 1GB for 4K granule).

The biggest change in this version is to replace repeated instances of
walking the page tables to get to a leaf-entry with a function to
return the appropriate entry. This was suggested by Suzuki and should
help reduce the amount of churn resulting from future changes. It also
address other feedback on the previous version.

Support is added to code that is shared between arm and arm64. Dummy
helpers for arm are provided as the port does not support PUD hugepage
sizes.

The patches have been tested on an A57 based system. The patchset is
based on v4.18-rc4. The are a few conflicts with the support for 52
bit IPA[1] due to change in number of parameters for
stage2_pmd_offset().

Thanks,
Punit

v4 -> v5:
* Patch 1 - Drop helper stage2_should_exec() and refactor the
  condition to decide if a page table entry should be marked
  executable
* Patch 4-6 - Introduce stage2_get_leaf_entry() and use it in this and
  latter patches
* Patch 7 - Use stage 2 accessors instead of using the page table
  helpers directly
* Patch 7 - Add a note to update the PUD hugepage support when number
  of levels of stage 2 tables differs from stage 1

v3 -> v4:
* Patch 1 and 7 - Don't put down hugepages pte if logging is enabled
* Patch 4-5 - Add PUD hugepage support for exec and access faults
* Patch 6 - PUD hugepage support for aging page table entries

v2 -> v3:
* Update vma_pagesize directly if THP [1/4]. Previsouly this was done
  indirectly via hugetlb
* Added review tag [4/4]

v1 -> v2:
* Create helper to check if the page should have exec permission [1/4]
* Fix broken condition to detect THP hugepage [1/4]
* Fix in-correct hunk resulting from a rebase [4/4]

[0] https://www.spinics.net/lists/arm-kernel/msg663562.html
[1] https://www.spinics.net/lists/kvm/msg171065.html

Punit Agrawal (7):
  KVM: arm/arm64: Share common code in user_mem_abort()
  KVM: arm/arm64: Introduce helpers to manupulate page table entries
  KVM: arm64: Support dirty page tracking for PUD hugepages
  KVM: arm64: Support PUD hugepage in stage2_is_exec()
  KVM: arm64: Support handling access faults for PUD hugepages
  KVM: arm64: Update age handlers to support PUD hugepages
  KVM: arm64: Add support for creating PUD hugepages at stage 2

 arch/arm/include/asm/kvm_mmu.h |  60 +
 arch/arm64/include/asm/kvm_mmu.h   |  47 
 arch/arm64/include/asm/pgtable-hwdef.h |   4 +
 arch/arm64/include/asm/pgtable.h   |   9 +
 virt/kvm/arm/mmu.c | 289 ++---
 5 files changed, 330 insertions(+), 79 deletions(-)

-- 
2.17.1

[PATCH v5 0/7] KVM: Support PUD hugepages at stage 2

2018-07-09 Thread Punit Agrawal

This series is an update to the PUD hugepage support previously posted
at [0]. This patchset adds support for PUD hugepages at stage
2. This feature is useful on cores that have support for large sized
TLB mappings (e.g., 1GB for 4K granule).

The biggest change in this version is to replace repeated instances of
walking the page tables to get to a leaf-entry with a function to
return the appropriate entry. This was suggested by Suzuki and should
help reduce the amount of churn resulting from future changes. It also
address other feedback on the previous version.

Support is added to code that is shared between arm and arm64. Dummy
helpers for arm are provided as the port does not support PUD hugepage
sizes.

The patches have been tested on an A57 based system. The patchset is
based on v4.18-rc4. The are a few conflicts with the support for 52
bit IPA[1] due to change in number of parameters for
stage2_pmd_offset().

Thanks,
Punit

v4 -> v5:
* Patch 1 - Drop helper stage2_should_exec() and refactor the
  condition to decide if a page table entry should be marked
  executable
* Patch 4-6 - Introduce stage2_get_leaf_entry() and use it in this and
  latter patches
* Patch 7 - Use stage 2 accessors instead of using the page table
  helpers directly
* Patch 7 - Add a note to update the PUD hugepage support when number
  of levels of stage 2 tables differs from stage 1

v3 -> v4:
* Patch 1 and 7 - Don't put down hugepages pte if logging is enabled
* Patch 4-5 - Add PUD hugepage support for exec and access faults
* Patch 6 - PUD hugepage support for aging page table entries

v2 -> v3:
* Update vma_pagesize directly if THP [1/4]. Previsouly this was done
  indirectly via hugetlb
* Added review tag [4/4]

v1 -> v2:
* Create helper to check if the page should have exec permission [1/4]
* Fix broken condition to detect THP hugepage [1/4]
* Fix in-correct hunk resulting from a rebase [4/4]

[0] https://www.spinics.net/lists/arm-kernel/msg663562.html
[1] https://www.spinics.net/lists/kvm/msg171065.html

Punit Agrawal (7):
  KVM: arm/arm64: Share common code in user_mem_abort()
  KVM: arm/arm64: Introduce helpers to manupulate page table entries
  KVM: arm64: Support dirty page tracking for PUD hugepages
  KVM: arm64: Support PUD hugepage in stage2_is_exec()
  KVM: arm64: Support handling access faults for PUD hugepages
  KVM: arm64: Update age handlers to support PUD hugepages
  KVM: arm64: Add support for creating PUD hugepages at stage 2

 arch/arm/include/asm/kvm_mmu.h |  60 +
 arch/arm64/include/asm/kvm_mmu.h   |  47 
 arch/arm64/include/asm/pgtable-hwdef.h |   4 +
 arch/arm64/include/asm/pgtable.h   |   9 +
 virt/kvm/arm/mmu.c | 289 ++---
 5 files changed, 330 insertions(+), 79 deletions(-)

-- 
2.17.1

Re: [PATCH v4 7/7] KVM: arm64: Add support for creating PUD hugepages at stage 2

2018-07-06 Thread Punit Agrawal

Suzuki K Poulose  writes:

> Hi Punit,
>
> On 05/07/18 15:08, Punit Agrawal wrote:
>> KVM only supports PMD hugepages at stage 2. Now that the various page
>> handling routines are updated, extend the stage 2 fault handling to
>> map in PUD hugepages.
>>
>> Addition of PUD hugepage support enables additional page sizes (e.g.,
>> 1G with 4K granule) which can be useful on cores that support mapping
>> larger block sizes in the TLB entries.
>>
>> Signed-off-by: Punit Agrawal 
>> Cc: Christoffer Dall 
>> Cc: Marc Zyngier 
>> Cc: Russell King 
>> Cc: Catalin Marinas 
>> Cc: Will Deacon 
>
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index 0c04c64e858c..5912210e94d9 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -116,6 +116,25 @@ static void stage2_dissolve_pmd(struct kvm *kvm, 
>> phys_addr_t addr, pmd_t *pmd)
>>  put_page(virt_to_page(pmd));
>>   }
>>   +/**
>> + * stage2_dissolve_pud() - clear and flush huge PUD entry
>> + * @kvm:pointer to kvm structure.
>> + * @addr:   IPA
>> + * @pud:pud pointer for IPA
>> + *
>> + * Function clears a PUD entry, flushes addr 1st and 2nd stage TLBs. Marks 
>> all
>> + * pages in the range dirty.
>> + */
>> +static void stage2_dissolve_pud(struct kvm *kvm, phys_addr_t addr, pud_t 
>> *pud)
>> +{
>> +if (!pud_huge(*pud))
>> +return;
>> +
>> +pud_clear(pud);
>
> You need to use the stage2_ accessors here. The stage2_dissolve_pmd() uses
> "pmd_" helpers as the PTE entries (level 3) are always guaranteed to exist.

I've fixed this and other uses of the PUD helpers to go via the stage2_
accessors.

I've still not quite come to terms with the lack of certain levels at
stage 2 vis-a-vis stage 1. I'll be more careful about this going
forward.

>
>> +kvm_tlb_flush_vmid_ipa(kvm, addr);
>> +put_page(virt_to_page(pud));
>> +}
>> +
>>   static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
>>int min, int max)
>>   {
>> @@ -993,7 +1012,7 @@ static pmd_t *stage2_get_pmd(struct kvm *kvm, struct 
>> kvm_mmu_memory_cache *cache
>>  pmd_t *pmd;
>>  pud = stage2_get_pud(kvm, cache, addr);
>> -if (!pud)
>> +if (!pud || pud_huge(*pud))
>>  return NULL;
>
> Same here.
>
>>  if (stage2_pud_none(*pud)) {
>
> Like this ^
>
>> @@ -1038,6 +1057,26 @@ static int stage2_set_pmd_huge(struct kvm *kvm, 
>> struct kvm_mmu_memory_cache
>>  return 0;
>>   }
>>   +static int stage2_set_pud_huge(struct kvm *kvm, struct
>> kvm_mmu_memory_cache *cache,
>> +   phys_addr_t addr, const pud_t *new_pud)
>> +{
>> +pud_t *pud, old_pud;
>> +
>> +pud = stage2_get_pud(kvm, cache, addr);
>> +VM_BUG_ON(!pud);
>> +
>> +old_pud = *pud;
>> +if (pud_present(old_pud)) {
>> +pud_clear(pud);
>> +kvm_tlb_flush_vmid_ipa(kvm, addr);
>
> Same here.
>
>> +} else {
>> +get_page(virt_to_page(pud));
>> +}
>> +
>> +kvm_set_pud(pud, *new_pud);
>> +return 0;
>> +}
>> +
>>   static bool stage2_is_exec(struct kvm *kvm, phys_addr_t addr)
>>   {
>>  pud_t *pudp;

[...]

>> @@ -1572,7 +1631,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  if (exec_fault)
>>  invalidate_icache_guest_page(pfn, vma_pagesize);
>>   -  if (hugetlb && vma_pagesize == PMD_SIZE) {
>> +if (hugetlb && vma_pagesize == PUD_SIZE) {
>
> I think we may need to check if the stage2 indeed has 3 levels of
> tables to use stage2 PUD. Otherwise, fall back to PTE level mapping
> or even PMD huge pages. Also, this cannot be triggered right now,
> as we only get PUD hugepages with 4K and we are guaranteed to have
> at least 3 levels with 40bit IPA. May be I can take care of it in
> the Dynamic IPA series, when we run a guest with say 32bit IPA.
> So for now, it is worth adding a comment here.

Good point. I've added the following comment.

/*
 * PUD level may not exist if the guest boots with two
 * levels at Stage 2. This configuration is currently
 * not supported due to IPA size supported by KVM.
 *
 * Revisit the assumptions about PUD levels when
 * additional IPA sizes are supported by KVM.
 */

Let me know if looks OK to you.

Thanks a lot for reviewing the patches.

Punit

>
>> +pud_t new_pud = kvm_pfn_pud(pfn, mem_type);
>> +
>> +new_pud = kvm_pud_mkhuge(new_pud);
>> +if (writable)
>> +new_pud = kvm_s2pud_mkwrite(new_pud);
>> +
>> +if (stage2_should_exec(kvm, fault_ipa, exec_fault, 
>> fault_status))
>> +new_pud = kvm_s2pud_mkexec(new_pud);
>> +
>> +ret = stage2_set_pud_huge(kvm, memcache, fault_ipa, _pud);
>> +} else if (hugetlb && vma_pagesize == PMD_SIZE) {
>
> Suzuki
> ___
> kvmarm mailing list
> kvm...@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v4 7/7] KVM: arm64: Add support for creating PUD hugepages at stage 2

2018-07-06 Thread Punit Agrawal

Suzuki K Poulose  writes:

> Hi Punit,
>
> On 05/07/18 15:08, Punit Agrawal wrote:
>> KVM only supports PMD hugepages at stage 2. Now that the various page
>> handling routines are updated, extend the stage 2 fault handling to
>> map in PUD hugepages.
>>
>> Addition of PUD hugepage support enables additional page sizes (e.g.,
>> 1G with 4K granule) which can be useful on cores that support mapping
>> larger block sizes in the TLB entries.
>>
>> Signed-off-by: Punit Agrawal 
>> Cc: Christoffer Dall 
>> Cc: Marc Zyngier 
>> Cc: Russell King 
>> Cc: Catalin Marinas 
>> Cc: Will Deacon 
>
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index 0c04c64e858c..5912210e94d9 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -116,6 +116,25 @@ static void stage2_dissolve_pmd(struct kvm *kvm, 
>> phys_addr_t addr, pmd_t *pmd)
>>  put_page(virt_to_page(pmd));
>>   }
>>   +/**
>> + * stage2_dissolve_pud() - clear and flush huge PUD entry
>> + * @kvm:pointer to kvm structure.
>> + * @addr:   IPA
>> + * @pud:pud pointer for IPA
>> + *
>> + * Function clears a PUD entry, flushes addr 1st and 2nd stage TLBs. Marks 
>> all
>> + * pages in the range dirty.
>> + */
>> +static void stage2_dissolve_pud(struct kvm *kvm, phys_addr_t addr, pud_t 
>> *pud)
>> +{
>> +if (!pud_huge(*pud))
>> +return;
>> +
>> +pud_clear(pud);
>
> You need to use the stage2_ accessors here. The stage2_dissolve_pmd() uses
> "pmd_" helpers as the PTE entries (level 3) are always guaranteed to exist.

I've fixed this and other uses of the PUD helpers to go via the stage2_
accessors.

I've still not quite come to terms with the lack of certain levels at
stage 2 vis-a-vis stage 1. I'll be more careful about this going
forward.

>
>> +kvm_tlb_flush_vmid_ipa(kvm, addr);
>> +put_page(virt_to_page(pud));
>> +}
>> +
>>   static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
>>int min, int max)
>>   {
>> @@ -993,7 +1012,7 @@ static pmd_t *stage2_get_pmd(struct kvm *kvm, struct 
>> kvm_mmu_memory_cache *cache
>>  pmd_t *pmd;
>>  pud = stage2_get_pud(kvm, cache, addr);
>> -if (!pud)
>> +if (!pud || pud_huge(*pud))
>>  return NULL;
>
> Same here.
>
>>  if (stage2_pud_none(*pud)) {
>
> Like this ^
>
>> @@ -1038,6 +1057,26 @@ static int stage2_set_pmd_huge(struct kvm *kvm, 
>> struct kvm_mmu_memory_cache
>>  return 0;
>>   }
>>   +static int stage2_set_pud_huge(struct kvm *kvm, struct
>> kvm_mmu_memory_cache *cache,
>> +   phys_addr_t addr, const pud_t *new_pud)
>> +{
>> +pud_t *pud, old_pud;
>> +
>> +pud = stage2_get_pud(kvm, cache, addr);
>> +VM_BUG_ON(!pud);
>> +
>> +old_pud = *pud;
>> +if (pud_present(old_pud)) {
>> +pud_clear(pud);
>> +kvm_tlb_flush_vmid_ipa(kvm, addr);
>
> Same here.
>
>> +} else {
>> +get_page(virt_to_page(pud));
>> +}
>> +
>> +kvm_set_pud(pud, *new_pud);
>> +return 0;
>> +}
>> +
>>   static bool stage2_is_exec(struct kvm *kvm, phys_addr_t addr)
>>   {
>>  pud_t *pudp;

[...]

>> @@ -1572,7 +1631,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  if (exec_fault)
>>  invalidate_icache_guest_page(pfn, vma_pagesize);
>>   -  if (hugetlb && vma_pagesize == PMD_SIZE) {
>> +if (hugetlb && vma_pagesize == PUD_SIZE) {
>
> I think we may need to check if the stage2 indeed has 3 levels of
> tables to use stage2 PUD. Otherwise, fall back to PTE level mapping
> or even PMD huge pages. Also, this cannot be triggered right now,
> as we only get PUD hugepages with 4K and we are guaranteed to have
> at least 3 levels with 40bit IPA. May be I can take care of it in
> the Dynamic IPA series, when we run a guest with say 32bit IPA.
> So for now, it is worth adding a comment here.

Good point. I've added the following comment.

/*
 * PUD level may not exist if the guest boots with two
 * levels at Stage 2. This configuration is currently
 * not supported due to IPA size supported by KVM.
 *
 * Revisit the assumptions about PUD levels when
 * additional IPA sizes are supported by KVM.
 */

Let me know if looks OK to you.

Thanks a lot for reviewing the patches.

Punit

>
>> +pud_t new_pud = kvm_pfn_pud(pfn, mem_type);
>> +
>> +new_pud = kvm_pud_mkhuge(new_pud);
>> +if (writable)
>> +new_pud = kvm_s2pud_mkwrite(new_pud);
>> +
>> +if (stage2_should_exec(kvm, fault_ipa, exec_fault, 
>> fault_status))
>> +new_pud = kvm_s2pud_mkexec(new_pud);
>> +
>> +ret = stage2_set_pud_huge(kvm, memcache, fault_ipa, _pud);
>> +} else if (hugetlb && vma_pagesize == PMD_SIZE) {
>
> Suzuki
> ___
> kvmarm mailing list
> kvm...@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v4 1/7] KVM: arm/arm64: Share common code in user_mem_abort()

2018-07-05 Thread Punit Agrawal

Marc Zyngier  writes:

> Hi Punit,
>
> On 05/07/18 15:08, Punit Agrawal wrote:
>> The code for operations such as marking the pfn as dirty, and
>> dcache/icache maintenance during stage 2 fault handling is duplicated
>> between normal pages and PMD hugepages.
>> 
>> Instead of creating another copy of the operations when we introduce
>> PUD hugepages, let's share them across the different pagesizes.
>> 
>> Signed-off-by: Punit Agrawal 
>> Cc: Christoffer Dall 
>> Cc: Marc Zyngier 
>> ---
>>  virt/kvm/arm/mmu.c | 68 +++---
>>  1 file changed, 40 insertions(+), 28 deletions(-)
>> 
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index 1d90d79706bd..dd14cc36c51c 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -1398,6 +1398,21 @@ static void invalidate_icache_guest_page(kvm_pfn_t 
>> pfn, unsigned long size)
>>  __invalidate_icache_guest_page(pfn, size);
>>  }
>>  
>> +static bool stage2_should_exec(struct kvm *kvm, phys_addr_t addr,
>> +   bool exec_fault, unsigned long fault_status)
>
> I find this "should exec" very confusing.
>
>> +{
>> +/*
>> + * If we took an execution fault we will have made the
>> + * icache/dcache coherent and should now let the s2 mapping be
>> + * executable.
>> + *
>> + * Write faults (!exec_fault && FSC_PERM) are orthogonal to
>> + * execute permissions, and we preserve whatever we have.
>> + */
>> +return exec_fault ||
>> +(fault_status == FSC_PERM && stage2_is_exec(kvm, addr));
>> +}
>> +
>>  static void kvm_send_hwpoison_signal(unsigned long address,
>>   struct vm_area_struct *vma)
>>  {
>> @@ -1431,7 +1446,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  kvm_pfn_t pfn;
>>  pgprot_t mem_type = PAGE_S2;
>>  bool logging_active = memslot_is_logging(memslot);
>> -unsigned long flags = 0;
>> +unsigned long vma_pagesize, flags = 0;
>>  
>>  write_fault = kvm_is_write_fault(vcpu);
>>  exec_fault = kvm_vcpu_trap_is_iabt(vcpu);
>> @@ -1451,7 +1466,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  return -EFAULT;
>>  }
>>  
>> -if (vma_kernel_pagesize(vma) == PMD_SIZE && !logging_active) {
>> +vma_pagesize = vma_kernel_pagesize(vma);
>> +if (vma_pagesize == PMD_SIZE && !logging_active) {
>>  hugetlb = true;
>>  gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
>>  } else {
>> @@ -1520,28 +1536,34 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  if (mmu_notifier_retry(kvm, mmu_seq))
>>  goto out_unlock;
>>  
>> -if (!hugetlb && !force_pte)
>> +if (!hugetlb && !force_pte) {
>> +/*
>> + * Only PMD_SIZE transparent hugepages(THP) are
>> + * currently supported. This code will need to be
>> + * updated to support other THP sizes.
>> + */
>>  hugetlb = transparent_hugepage_adjust(, _ipa);
>> +if (hugetlb)
>> +vma_pagesize = PMD_SIZE;
>> +}
>> +
>> +if (writable)
>> +kvm_set_pfn_dirty(pfn);
>>  
>> -if (hugetlb) {
>> +if (fault_status != FSC_PERM)
>> +clean_dcache_guest_page(pfn, vma_pagesize);
>> +
>> +if (exec_fault)
>> +invalidate_icache_guest_page(pfn, vma_pagesize);
>> +
>> +if (hugetlb && vma_pagesize == PMD_SIZE) {
>>  pmd_t new_pmd = pfn_pmd(pfn, mem_type);
>>  new_pmd = pmd_mkhuge(new_pmd);
>> -if (writable) {
>> +if (writable)
>>  new_pmd = kvm_s2pmd_mkwrite(new_pmd);
>> -kvm_set_pfn_dirty(pfn);
>> -}
>>  
>> -if (fault_status != FSC_PERM)
>> -clean_dcache_guest_page(pfn, PMD_SIZE);
>> -
>> -if (exec_fault) {
>> +if (stage2_should_exec(kvm, fault_ipa, exec_fault, 
>> fault_status))
>>  new_pmd = kvm_s2pmd_mkexec(new_pmd);
>
> OK, I find this absolutely horrid... ;-)
>
> The rest of the function deals with discrete flags, and all of a sudden
> we have a function call with a bunch of seemingly unrelated parameters.
> And you are repeating it for each vma_pagesize...
>
> How about something like:
>
>   bool needs_exec;
>
>   [...]
>
>   needs_exec = exec_fault || (fault_status == FSC_PERM &&
>   stage2_is_exec(kvm, fault_ipa);
>
> And then you just check needs_exec to update the pte/pmd. And you drop
> this helper.

That does look a lot better. I'll roll the change into the next version.

Thanks,
Punit

[...]

Re: [PATCH v4 1/7] KVM: arm/arm64: Share common code in user_mem_abort()

2018-07-05 Thread Punit Agrawal

Marc Zyngier  writes:

> Hi Punit,
>
> On 05/07/18 15:08, Punit Agrawal wrote:
>> The code for operations such as marking the pfn as dirty, and
>> dcache/icache maintenance during stage 2 fault handling is duplicated
>> between normal pages and PMD hugepages.
>> 
>> Instead of creating another copy of the operations when we introduce
>> PUD hugepages, let's share them across the different pagesizes.
>> 
>> Signed-off-by: Punit Agrawal 
>> Cc: Christoffer Dall 
>> Cc: Marc Zyngier 
>> ---
>>  virt/kvm/arm/mmu.c | 68 +++---
>>  1 file changed, 40 insertions(+), 28 deletions(-)
>> 
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index 1d90d79706bd..dd14cc36c51c 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -1398,6 +1398,21 @@ static void invalidate_icache_guest_page(kvm_pfn_t 
>> pfn, unsigned long size)
>>  __invalidate_icache_guest_page(pfn, size);
>>  }
>>  
>> +static bool stage2_should_exec(struct kvm *kvm, phys_addr_t addr,
>> +   bool exec_fault, unsigned long fault_status)
>
> I find this "should exec" very confusing.
>
>> +{
>> +/*
>> + * If we took an execution fault we will have made the
>> + * icache/dcache coherent and should now let the s2 mapping be
>> + * executable.
>> + *
>> + * Write faults (!exec_fault && FSC_PERM) are orthogonal to
>> + * execute permissions, and we preserve whatever we have.
>> + */
>> +return exec_fault ||
>> +(fault_status == FSC_PERM && stage2_is_exec(kvm, addr));
>> +}
>> +
>>  static void kvm_send_hwpoison_signal(unsigned long address,
>>   struct vm_area_struct *vma)
>>  {
>> @@ -1431,7 +1446,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  kvm_pfn_t pfn;
>>  pgprot_t mem_type = PAGE_S2;
>>  bool logging_active = memslot_is_logging(memslot);
>> -unsigned long flags = 0;
>> +unsigned long vma_pagesize, flags = 0;
>>  
>>  write_fault = kvm_is_write_fault(vcpu);
>>  exec_fault = kvm_vcpu_trap_is_iabt(vcpu);
>> @@ -1451,7 +1466,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  return -EFAULT;
>>  }
>>  
>> -if (vma_kernel_pagesize(vma) == PMD_SIZE && !logging_active) {
>> +vma_pagesize = vma_kernel_pagesize(vma);
>> +if (vma_pagesize == PMD_SIZE && !logging_active) {
>>  hugetlb = true;
>>  gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
>>  } else {
>> @@ -1520,28 +1536,34 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  if (mmu_notifier_retry(kvm, mmu_seq))
>>  goto out_unlock;
>>  
>> -if (!hugetlb && !force_pte)
>> +if (!hugetlb && !force_pte) {
>> +/*
>> + * Only PMD_SIZE transparent hugepages(THP) are
>> + * currently supported. This code will need to be
>> + * updated to support other THP sizes.
>> + */
>>  hugetlb = transparent_hugepage_adjust(, _ipa);
>> +if (hugetlb)
>> +vma_pagesize = PMD_SIZE;
>> +}
>> +
>> +if (writable)
>> +kvm_set_pfn_dirty(pfn);
>>  
>> -if (hugetlb) {
>> +if (fault_status != FSC_PERM)
>> +clean_dcache_guest_page(pfn, vma_pagesize);
>> +
>> +if (exec_fault)
>> +invalidate_icache_guest_page(pfn, vma_pagesize);
>> +
>> +if (hugetlb && vma_pagesize == PMD_SIZE) {
>>  pmd_t new_pmd = pfn_pmd(pfn, mem_type);
>>  new_pmd = pmd_mkhuge(new_pmd);
>> -if (writable) {
>> +if (writable)
>>  new_pmd = kvm_s2pmd_mkwrite(new_pmd);
>> -kvm_set_pfn_dirty(pfn);
>> -}
>>  
>> -if (fault_status != FSC_PERM)
>> -clean_dcache_guest_page(pfn, PMD_SIZE);
>> -
>> -if (exec_fault) {
>> +if (stage2_should_exec(kvm, fault_ipa, exec_fault, 
>> fault_status))
>>  new_pmd = kvm_s2pmd_mkexec(new_pmd);
>
> OK, I find this absolutely horrid... ;-)
>
> The rest of the function deals with discrete flags, and all of a sudden
> we have a function call with a bunch of seemingly unrelated parameters.
> And you are repeating it for each vma_pagesize...
>
> How about something like:
>
>   bool needs_exec;
>
>   [...]
>
>   needs_exec = exec_fault || (fault_status == FSC_PERM &&
>   stage2_is_exec(kvm, fault_ipa);
>
> And then you just check needs_exec to update the pte/pmd. And you drop
> this helper.

That does look a lot better. I'll roll the change into the next version.

Thanks,
Punit

[...]

use-after-free in NFS

2018-06-29 Thread Punit Agrawal

Hi,

Nightly LTP runs are hitting a use-after-free on upstream kernels when
running with 64k pages. There isn't a specific test triggering the
issue. Also, the problem is not encountered with 4k pages.

The boards used for the nightly runs mount their filesystem (Debian
Jessie) via NFS with the following filesystem relevant command line
options -

"root=/dev/nfs rw nfsroot=:,tcp,hard,intr,vers=3 
rootwait"

I suspect that the larger page size is uncovering a latent race but I'm
not familiar with NFS client implementation and so could be barking up
the wrong tree. Frustratingly I've not been able to reproduce it locally
to create a reproducer.

Thanks,
Punit

 >8 

[ 4116.235125] LTP: starting data_space
[ 4116.497897] [ cut here ]
[ 4116.511558] refcount_t: increment on 0; use-after-free.
[ 4116.522421] WARNING: CPU: 4 PID: 6 at lib/refcount.c:153 
refcount_inc+0x44/0x50
[ 4116.537845] Modules linked in: overlay btrfs libcrc32c xor zstd_decompress 
zstd_compress xxhash raid6_pq fuse crc32_ce crct10dif_ce ipv6
[ 4116.562596] CPU: 4 PID: 6 Comm: kworker/4:0 Not tainted 
4.18.0-rc2-00119-g59ec39fe3826 #1
[ 4116.579770] Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno 
Development Platform, BIOS EDK II Feb 13 2018
[ 4116.601505] Workqueue: nfsiod rpc_async_release
[ 4116.610627] pstate: 4005 (nZcv daif -PAN -UAO)
[ 4116.620273] pc : refcount_inc+0x44/0x50
[ 4116.627989] lr : refcount_inc+0x44/0x50
[ 4116.635705] sp : 0a7efba0
[ 4116.642369] x29: 0a7efba0 x28:  
[ 4116.653068] x27: 7fff x26: 0a7efca0 
[ 4116.663767] x25: 0008 x24: 80092054d6a0 
[ 4116.674465] x23: 800942ba7140 x22: 0a7efc90 
[ 4116.685162] x21: 800942ba7100 x20: 800942ba7b00 
[ 4116.695861] x19: 800942ba7100 x18: 0011 
[ 4116.706557] x17:  x16:  
[ 4116.717256] x15:  x14: 0400 
[ 4116.727953] x13: 0001 x12:  
[ 4116.738650] x11: 48ca x10: 0930 
[ 4116.749347] x9 : 0a7ef8a0 x8 : 8009416f0990 
[ 4116.760046] x7 : 8009416f00c0 x6 : 80097fba20c0 
[ 4116.770742] x5 : 80097fba20c0 x4 :  
[ 4116.781439] x3 : 80097fba8ea8 x2 : 8009416f 
[ 4116.792137] x1 : 2e35a06ccfe9d400 x0 :  
[ 4116.802834] Call trace:
[ 4116.807750]  refcount_inc+0x44/0x50
[ 4116.814770]  nfs_scan_commit_list+0x84/0x128
[ 4116.823362]  nfs_scan_commit.part.12+0x4c/0xe0
[ 4116.832305]  __nfs_commit_inode+0x100/0x190
[ 4116.840722]  nfs_io_completion_commit+0x14/0x20
[ 4116.849838]  nfs_io_completion_put+0x38/0x58
[ 4116.858430]  nfs_write_completion+0x108/0x1a0
[ 4116.867199]  nfs_pgio_release+0x14/0x20
[ 4116.874916]  rpc_free_task+0x28/0x48
[ 4116.882108]  rpc_async_release+0x10/0x18
[ 4116.890002]  process_one_work+0x1dc/0x330
[ 4116.898068]  worker_thread+0x48/0x430
[ 4116.905433]  kthread+0xf8/0x128
[ 4116.911750]  ret_from_fork+0x10/0x18
[ 4116.918939] ---[ end trace 9745b48d60ab22f6 ]---
[ 4116.931729] WARNING: CPU: 4 PID: 6 at fs/nfs/pagelist.c:426 
nfs_free_request+0x150/0x160
[ 4116.948730] Modules linked in: overlay btrfs libcrc32c xor zstd_decompress 
zstd_compress xxhash raid6_pq fuse crc32_ce crct10dif_ce ipv6
[ 4116.973478] CPU: 4 PID: 6 Comm: kworker/4:0 Tainted: GW 
4.18.0-rc2-00119-g59ec39fe3826 #1
[ 4116.993452] Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno 
Development Platform, BIOS EDK II Feb 13 2018
[ 4117.015185] Workqueue: nfsiod rpc_async_release
[ 4117.024306] pstate: 6005 (nZCv daif -PAN -UAO)
[ 4117.033951] pc : nfs_free_request+0x150/0x160
[ 4117.042718] lr : nfs_release_request+0x84/0x98
[ 4117.051657] sp : 0a7efc40
[ 4117.058321] x29: 0a7efc40 x28:  
[ 4117.069021] x27: 80097fbac330 x26: 091280f8 
[ 4117.079720] x25:  x24: 09113000 
[ 4117.090419] x23: 8009432338c8 x22: 8009432338d8 
[ 4117.101119] x21: 800942bce480 x20: 800942ba4b80 
[ 4117.111817] x19: 800942bce480 x18:  
[ 4117.122514] x17:  x16: 0001 
[ 4117.133211] x15:  x14: 0400 
[ 4117.143908] x13: 0001 x12:  
[ 4117.154604] x11: 0040 x10: 0930 
[ 4117.165302] x9 : 0a7efd50 x8 :  
[ 4117.175999] x7 :  x6 : 0001 
[ 4117.186695] x5 :  x4 : 091136c8 
[ 4117.197392] x3 : 09111388 x2 : 0005 
[ 4117.208088] x1 : 800942bce480 x0 : 0c02 
[ 4117.218784] Call trace:
[ 4117.223700]  nfs_free_request+0x150/0x160
[ 4117.231766]  nfs_release_request+0x84/0x98
[ 4117.240009]  nfs_release_request+0x60/0x98
[ 4117.248252]  nfs_unlock_and_release_request+0x1c/0x28
[ 4117.258420]

use-after-free in NFS

2018-06-29 Thread Punit Agrawal

Hi,

Nightly LTP runs are hitting a use-after-free on upstream kernels when
running with 64k pages. There isn't a specific test triggering the
issue. Also, the problem is not encountered with 4k pages.

The boards used for the nightly runs mount their filesystem (Debian
Jessie) via NFS with the following filesystem relevant command line
options -

"root=/dev/nfs rw nfsroot=:,tcp,hard,intr,vers=3 
rootwait"

I suspect that the larger page size is uncovering a latent race but I'm
not familiar with NFS client implementation and so could be barking up
the wrong tree. Frustratingly I've not been able to reproduce it locally
to create a reproducer.

Thanks,
Punit

 >8 

[ 4116.235125] LTP: starting data_space
[ 4116.497897] [ cut here ]
[ 4116.511558] refcount_t: increment on 0; use-after-free.
[ 4116.522421] WARNING: CPU: 4 PID: 6 at lib/refcount.c:153 
refcount_inc+0x44/0x50
[ 4116.537845] Modules linked in: overlay btrfs libcrc32c xor zstd_decompress 
zstd_compress xxhash raid6_pq fuse crc32_ce crct10dif_ce ipv6
[ 4116.562596] CPU: 4 PID: 6 Comm: kworker/4:0 Not tainted 
4.18.0-rc2-00119-g59ec39fe3826 #1
[ 4116.579770] Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno 
Development Platform, BIOS EDK II Feb 13 2018
[ 4116.601505] Workqueue: nfsiod rpc_async_release
[ 4116.610627] pstate: 4005 (nZcv daif -PAN -UAO)
[ 4116.620273] pc : refcount_inc+0x44/0x50
[ 4116.627989] lr : refcount_inc+0x44/0x50
[ 4116.635705] sp : 0a7efba0
[ 4116.642369] x29: 0a7efba0 x28:  
[ 4116.653068] x27: 7fff x26: 0a7efca0 
[ 4116.663767] x25: 0008 x24: 80092054d6a0 
[ 4116.674465] x23: 800942ba7140 x22: 0a7efc90 
[ 4116.685162] x21: 800942ba7100 x20: 800942ba7b00 
[ 4116.695861] x19: 800942ba7100 x18: 0011 
[ 4116.706557] x17:  x16:  
[ 4116.717256] x15:  x14: 0400 
[ 4116.727953] x13: 0001 x12:  
[ 4116.738650] x11: 48ca x10: 0930 
[ 4116.749347] x9 : 0a7ef8a0 x8 : 8009416f0990 
[ 4116.760046] x7 : 8009416f00c0 x6 : 80097fba20c0 
[ 4116.770742] x5 : 80097fba20c0 x4 :  
[ 4116.781439] x3 : 80097fba8ea8 x2 : 8009416f 
[ 4116.792137] x1 : 2e35a06ccfe9d400 x0 :  
[ 4116.802834] Call trace:
[ 4116.807750]  refcount_inc+0x44/0x50
[ 4116.814770]  nfs_scan_commit_list+0x84/0x128
[ 4116.823362]  nfs_scan_commit.part.12+0x4c/0xe0
[ 4116.832305]  __nfs_commit_inode+0x100/0x190
[ 4116.840722]  nfs_io_completion_commit+0x14/0x20
[ 4116.849838]  nfs_io_completion_put+0x38/0x58
[ 4116.858430]  nfs_write_completion+0x108/0x1a0
[ 4116.867199]  nfs_pgio_release+0x14/0x20
[ 4116.874916]  rpc_free_task+0x28/0x48
[ 4116.882108]  rpc_async_release+0x10/0x18
[ 4116.890002]  process_one_work+0x1dc/0x330
[ 4116.898068]  worker_thread+0x48/0x430
[ 4116.905433]  kthread+0xf8/0x128
[ 4116.911750]  ret_from_fork+0x10/0x18
[ 4116.918939] ---[ end trace 9745b48d60ab22f6 ]---
[ 4116.931729] WARNING: CPU: 4 PID: 6 at fs/nfs/pagelist.c:426 
nfs_free_request+0x150/0x160
[ 4116.948730] Modules linked in: overlay btrfs libcrc32c xor zstd_decompress 
zstd_compress xxhash raid6_pq fuse crc32_ce crct10dif_ce ipv6
[ 4116.973478] CPU: 4 PID: 6 Comm: kworker/4:0 Tainted: GW 
4.18.0-rc2-00119-g59ec39fe3826 #1
[ 4116.993452] Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno 
Development Platform, BIOS EDK II Feb 13 2018
[ 4117.015185] Workqueue: nfsiod rpc_async_release
[ 4117.024306] pstate: 6005 (nZCv daif -PAN -UAO)
[ 4117.033951] pc : nfs_free_request+0x150/0x160
[ 4117.042718] lr : nfs_release_request+0x84/0x98
[ 4117.051657] sp : 0a7efc40
[ 4117.058321] x29: 0a7efc40 x28:  
[ 4117.069021] x27: 80097fbac330 x26: 091280f8 
[ 4117.079720] x25:  x24: 09113000 
[ 4117.090419] x23: 8009432338c8 x22: 8009432338d8 
[ 4117.101119] x21: 800942bce480 x20: 800942ba4b80 
[ 4117.111817] x19: 800942bce480 x18:  
[ 4117.122514] x17:  x16: 0001 
[ 4117.133211] x15:  x14: 0400 
[ 4117.143908] x13: 0001 x12:  
[ 4117.154604] x11: 0040 x10: 0930 
[ 4117.165302] x9 : 0a7efd50 x8 :  
[ 4117.175999] x7 :  x6 : 0001 
[ 4117.186695] x5 :  x4 : 091136c8 
[ 4117.197392] x3 : 09111388 x2 : 0005 
[ 4117.208088] x1 : 800942bce480 x0 : 0c02 
[ 4117.218784] Call trace:
[ 4117.223700]  nfs_free_request+0x150/0x160
[ 4117.231766]  nfs_release_request+0x84/0x98
[ 4117.240009]  nfs_release_request+0x60/0x98
[ 4117.248252]  nfs_unlock_and_release_request+0x1c/0x28
[ 4117.258420]

Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-26 Thread Punit Agrawal

Jonathan Cameron  writes:

[...]

>
> I'll test it when back in the office, but I had a similar issue with
> memory only nodes when I moved the SRAT listing for cpus from the 4
> 4th mode to the 3rd node to fake some memory I could hot unplug.
> This gave a memory only node for the last node on the system.
>
> When I instead moved cpus from the 3rd node to the 4th (so the node
> with only memory was now in the middle, everything worked).
>
> Was odd, and I'd been meaning to chase it down but hadn't gotten to it
> yet.  If I get time I'll put together some test firmwares as see if there
> are any other nasty corner cases we aren't handling.

If you get a chance, it'd be really helpful to test reversing the
ordering of entries in the SRAT and booting with a restricted
NR_CPUS.

This issue was found through code inspection.

Please make sure to use the updated patch from Lorenzo for your
tests[0].

[0] https://marc.info/?l=linux-acpi=152998665713983=2

>
> Jonathan
>
>> 
>> ___
>> linux-arm-kernel mailing list
>> linux-arm-ker...@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-26 Thread Punit Agrawal

Jonathan Cameron  writes:

[...]

>
> I'll test it when back in the office, but I had a similar issue with
> memory only nodes when I moved the SRAT listing for cpus from the 4
> 4th mode to the 3rd node to fake some memory I could hot unplug.
> This gave a memory only node for the last node on the system.
>
> When I instead moved cpus from the 3rd node to the 4th (so the node
> with only memory was now in the middle, everything worked).
>
> Was odd, and I'd been meaning to chase it down but hadn't gotten to it
> yet.  If I get time I'll put together some test firmwares as see if there
> are any other nasty corner cases we aren't handling.

If you get a chance, it'd be really helpful to test reversing the
ordering of entries in the SRAT and booting with a restricted
NR_CPUS.

This issue was found through code inspection.

Please make sure to use the updated patch from Lorenzo for your
tests[0].

[0] https://marc.info/?l=linux-acpi=152998665713983=2

>
> Jonathan
>
>> 
>> ___
>> linux-arm-kernel mailing list
>> linux-arm-ker...@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-26 Thread Punit Agrawal

Jonathan Cameron  writes:

[...]

>
> I'll test it when back in the office, but I had a similar issue with
> memory only nodes when I moved the SRAT listing for cpus from the 4
> 4th mode to the 3rd node to fake some memory I could hot unplug.
> This gave a memory only node for the last node on the system.
>
> When I instead moved cpus from the 3rd node to the 4th (so the node
> with only memory was now in the middle, everything worked).
>
> Was odd, and I'd been meaning to chase it down but hadn't gotten to it
> yet.  If I get time I'll put together some test firmwares as see if there
> are any other nasty corner cases we aren't handling.

If you get a chance, it'd be really helpful to test reversing the
ordering of entries in the SRAT and booting with a restricted
NR_CPUS.

This issue was found through code inspection.

Please make sure to use the updated patch from Lorenzo for your
tests[0].

[0] https://marc.info/?l=linux-acpi=152998665713983=2

>
> Jonathan
>
>> 
>> ___
>> linux-arm-kernel mailing list
>> linux-arm-ker...@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-26 Thread Punit Agrawal

Jonathan Cameron  writes:

[...]

>
> I'll test it when back in the office, but I had a similar issue with
> memory only nodes when I moved the SRAT listing for cpus from the 4
> 4th mode to the 3rd node to fake some memory I could hot unplug.
> This gave a memory only node for the last node on the system.
>
> When I instead moved cpus from the 3rd node to the 4th (so the node
> with only memory was now in the middle, everything worked).
>
> Was odd, and I'd been meaning to chase it down but hadn't gotten to it
> yet.  If I get time I'll put together some test firmwares as see if there
> are any other nasty corner cases we aren't handling.

If you get a chance, it'd be really helpful to test reversing the
ordering of entries in the SRAT and booting with a restricted
NR_CPUS.

This issue was found through code inspection.

Please make sure to use the updated patch from Lorenzo for your
tests[0].

[0] https://marc.info/?l=linux-acpi=152998665713983=2

>
> Jonathan
>
>> 
>> ___
>> linux-arm-kernel mailing list
>> linux-arm-ker...@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-22 Thread Punit Agrawal

Michal Hocko  writes:

> On Fri 22-06-18 16:58:05, Hanjun Guo wrote:
>> On 2018/6/20 19:51, Punit Agrawal wrote:
>> > Xie XiuQi  writes:
>> > 
>> >> Hi Lorenzo, Punit,
>> >>
>> >>
>> >> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
>> >>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>> >>>> Michal Hocko  writes:
>> >>>>
>> >>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>> >>>>> [...]
>> >>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>> >>>>>> as a temporary fix (it'll also be easier to backport) while we work on
>> >>>>>> fixing these other issues and enabling memoryless nodes.
>> >>>>>
>> >>>>> Well, x86 already does that but copying this antipatern is not really
>> >>>>> nice. So it is good as a quick fix but it would be definitely much
>> >>>>> better to have a robust fix. Who knows how many other places might hit
>> >>>>> this. You certainly do not want to add a hack like this all over...
>> >>>>
>> >>>> Completely agree! I was only suggesting it as a temporary measure,
>> >>>> especially as it looked like a proper fix might be invasive.
>> >>>>
>> >>>> Another fix might be to change the node specific allocation to node
>> >>>> agnostic allocations. It isn't clear why the allocation is being
>> >>>> requested from a specific node. I think Lorenzo suggested this in one of
>> >>>> the threads.
>> >>>
>> >>> I think that code was just copypasted but it is better to fix the
>> >>> underlying issue.
>> >>>
>> >>>> I've started putting together a set fixing the issues identified in this
>> >>>> thread. It should give a better idea on the best course of action.
>> >>>
>> >>> On ACPI ARM64, this diff should do if I read the code correctly, it
>> >>> should be (famous last words) just a matter of mapping PXMs to nodes for
>> >>> every SRAT GICC entry, feel free to pick it up if it works.
>> >>>
>> >>> Yes, we can take the original patch just because it is safer for an -rc
>> >>> cycle even though if the patch below would do delaying the fix for a
>> >>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
>> >>> not a disaster.
>> >>
>> >> I tested this patch on my arm board, it works.
>> > 
>> > I am assuming you tried the patch without enabling support for
>> > memory-less nodes.
>> > 
>> > The patch de-couples the onlining of numa nodes (as parsed from SRAT)
>> > from NR_CPUS restriction. When it comes to building zonelists, the node
>> > referenced by the PCI controller also has zonelists initialised.
>> > 
>> > So it looks like a fallback node is setup even if we don't have
>> > memory-less nodes enabled. I need to stare some more at the code to see
>> > why we need memory-less nodes at all then ...
>> 
>> Yes, please. From my limited MM knowledge, zonelists should not be
>> initialised if no CPU and no memory on this node, correct me if I'm
>> wrong.
>
> Well, as long as there is a code which can explicitly ask for a specific
> node than it is safer to have zonelists configured. Otherwise you just
> force callers to add hacks and figure out the proper placement there.
> Zonelists should be cheep to configure for all possible nodes. It's not
> like we are talking about huge amount of resources.

I agree. The current problem stems from not configuring the zonelists
for nodes that don't have onlined cpu and memory. Lorenzo's patch fixes
the configuration of such nodes.

For allocation requests targeting memory-less nodes, the allocator will
take the slow path and fall back to one of the other nodes based on the
zonelists.

I'm not sure how common such allocations are but I'll work on enabling
CONFIG_HAVE_MEMORYLESS_NODES on top of Lorenzo's patch. AIUI, this
config improves the fallback mechanism by starting the search from a
near-by node with memory.

Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-22 Thread Punit Agrawal

Michal Hocko  writes:

> On Fri 22-06-18 16:58:05, Hanjun Guo wrote:
>> On 2018/6/20 19:51, Punit Agrawal wrote:
>> > Xie XiuQi  writes:
>> > 
>> >> Hi Lorenzo, Punit,
>> >>
>> >>
>> >> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
>> >>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>> >>>> Michal Hocko  writes:
>> >>>>
>> >>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>> >>>>> [...]
>> >>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>> >>>>>> as a temporary fix (it'll also be easier to backport) while we work on
>> >>>>>> fixing these other issues and enabling memoryless nodes.
>> >>>>>
>> >>>>> Well, x86 already does that but copying this antipatern is not really
>> >>>>> nice. So it is good as a quick fix but it would be definitely much
>> >>>>> better to have a robust fix. Who knows how many other places might hit
>> >>>>> this. You certainly do not want to add a hack like this all over...
>> >>>>
>> >>>> Completely agree! I was only suggesting it as a temporary measure,
>> >>>> especially as it looked like a proper fix might be invasive.
>> >>>>
>> >>>> Another fix might be to change the node specific allocation to node
>> >>>> agnostic allocations. It isn't clear why the allocation is being
>> >>>> requested from a specific node. I think Lorenzo suggested this in one of
>> >>>> the threads.
>> >>>
>> >>> I think that code was just copypasted but it is better to fix the
>> >>> underlying issue.
>> >>>
>> >>>> I've started putting together a set fixing the issues identified in this
>> >>>> thread. It should give a better idea on the best course of action.
>> >>>
>> >>> On ACPI ARM64, this diff should do if I read the code correctly, it
>> >>> should be (famous last words) just a matter of mapping PXMs to nodes for
>> >>> every SRAT GICC entry, feel free to pick it up if it works.
>> >>>
>> >>> Yes, we can take the original patch just because it is safer for an -rc
>> >>> cycle even though if the patch below would do delaying the fix for a
>> >>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
>> >>> not a disaster.
>> >>
>> >> I tested this patch on my arm board, it works.
>> > 
>> > I am assuming you tried the patch without enabling support for
>> > memory-less nodes.
>> > 
>> > The patch de-couples the onlining of numa nodes (as parsed from SRAT)
>> > from NR_CPUS restriction. When it comes to building zonelists, the node
>> > referenced by the PCI controller also has zonelists initialised.
>> > 
>> > So it looks like a fallback node is setup even if we don't have
>> > memory-less nodes enabled. I need to stare some more at the code to see
>> > why we need memory-less nodes at all then ...
>> 
>> Yes, please. From my limited MM knowledge, zonelists should not be
>> initialised if no CPU and no memory on this node, correct me if I'm
>> wrong.
>
> Well, as long as there is a code which can explicitly ask for a specific
> node than it is safer to have zonelists configured. Otherwise you just
> force callers to add hacks and figure out the proper placement there.
> Zonelists should be cheep to configure for all possible nodes. It's not
> like we are talking about huge amount of resources.

I agree. The current problem stems from not configuring the zonelists
for nodes that don't have onlined cpu and memory. Lorenzo's patch fixes
the configuration of such nodes.

For allocation requests targeting memory-less nodes, the allocator will
take the slow path and fall back to one of the other nodes based on the
zonelists.

I'm not sure how common such allocations are but I'll work on enabling
CONFIG_HAVE_MEMORYLESS_NODES on top of Lorenzo's patch. AIUI, this
config improves the fallback mechanism by starting the search from a
near-by node with memory.

Re: [PATCH 1/4] drivers/bus: arm-cci: use SPDX license information

2018-06-22 Thread Punit Agrawal

Hi,

Mawanda Henry  writes:

> SPDX license helps developers and machines to know the license governing
> a particular file hence easing work
>
> Signed-off-by: Mawanda Henry 
> ---
>  drivers/bus/arm-cci.c | 10 +-
>  1 file changed, 1 insertion(+), 9 deletions(-)
>
> diff --git a/drivers/bus/arm-cci.c b/drivers/bus/arm-cci.c
> index b8184a9..74a2e0a 100644
> --- a/drivers/bus/arm-cci.c
> +++ b/drivers/bus/arm-cci.c
> @@ -1,17 +1,9 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later

Please use GPL-2.0 as per the below text.

While you're updating this can you also update the MODULE_LICENSE() at
the end of the file to "GPL v2".

>  /*
>   * CCI cache coherent interconnect driver
>   *
>   * Copyright (C) 2013 ARM Ltd.
>   * Author: Lorenzo Pieralisi 
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 as
> - * published by the Free Software Foundation.
> - *
> - * This program is distributed "as is" WITHOUT ANY WARRANTY of any
> - * kind, whether express or implied; without even the implied warranty
> - * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> - * GNU General Public License for more details.
>   */
>  
>  #include

Re: [PATCH 1/4] drivers/bus: arm-cci: use SPDX license information

2018-06-22 Thread Punit Agrawal

Hi,

Mawanda Henry  writes:

> SPDX license helps developers and machines to know the license governing
> a particular file hence easing work
>
> Signed-off-by: Mawanda Henry 
> ---
>  drivers/bus/arm-cci.c | 10 +-
>  1 file changed, 1 insertion(+), 9 deletions(-)
>
> diff --git a/drivers/bus/arm-cci.c b/drivers/bus/arm-cci.c
> index b8184a9..74a2e0a 100644
> --- a/drivers/bus/arm-cci.c
> +++ b/drivers/bus/arm-cci.c
> @@ -1,17 +1,9 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later

Please use GPL-2.0 as per the below text.

While you're updating this can you also update the MODULE_LICENSE() at
the end of the file to "GPL v2".

>  /*
>   * CCI cache coherent interconnect driver
>   *
>   * Copyright (C) 2013 ARM Ltd.
>   * Author: Lorenzo Pieralisi 
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 as
> - * published by the Free Software Foundation.
> - *
> - * This program is distributed "as is" WITHOUT ANY WARRANTY of any
> - * kind, whether express or implied; without even the implied warranty
> - * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> - * GNU General Public License for more details.
>   */
>  
>  #include

Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-20 Thread Punit Agrawal

Xie XiuQi  writes:

> Hi Lorenzo, Punit,
>
>
> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>>> Michal Hocko  writes:
>>>
>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>>>> [...]
>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>>>>> as a temporary fix (it'll also be easier to backport) while we work on
>>>>> fixing these other issues and enabling memoryless nodes.
>>>>
>>>> Well, x86 already does that but copying this antipatern is not really
>>>> nice. So it is good as a quick fix but it would be definitely much
>>>> better to have a robust fix. Who knows how many other places might hit
>>>> this. You certainly do not want to add a hack like this all over...
>>>
>>> Completely agree! I was only suggesting it as a temporary measure,
>>> especially as it looked like a proper fix might be invasive.
>>>
>>> Another fix might be to change the node specific allocation to node
>>> agnostic allocations. It isn't clear why the allocation is being
>>> requested from a specific node. I think Lorenzo suggested this in one of
>>> the threads.
>> 
>> I think that code was just copypasted but it is better to fix the
>> underlying issue.
>> 
>>> I've started putting together a set fixing the issues identified in this
>>> thread. It should give a better idea on the best course of action.
>> 
>> On ACPI ARM64, this diff should do if I read the code correctly, it
>> should be (famous last words) just a matter of mapping PXMs to nodes for
>> every SRAT GICC entry, feel free to pick it up if it works.
>> 
>> Yes, we can take the original patch just because it is safer for an -rc
>> cycle even though if the patch below would do delaying the fix for a
>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
>> not a disaster.
>
> I tested this patch on my arm board, it works.

I am assuming you tried the patch without enabling support for
memory-less nodes.

The patch de-couples the onlining of numa nodes (as parsed from SRAT)
from NR_CPUS restriction. When it comes to building zonelists, the node
referenced by the PCI controller also has zonelists initialised.

So it looks like a fallback node is setup even if we don't have
memory-less nodes enabled. I need to stare some more at the code to see
why we need memory-less nodes at all then ...

Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-20 Thread Punit Agrawal

Xie XiuQi  writes:

> Hi Lorenzo, Punit,
>
>
> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>>> Michal Hocko  writes:
>>>
>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>>>> [...]
>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>>>>> as a temporary fix (it'll also be easier to backport) while we work on
>>>>> fixing these other issues and enabling memoryless nodes.
>>>>
>>>> Well, x86 already does that but copying this antipatern is not really
>>>> nice. So it is good as a quick fix but it would be definitely much
>>>> better to have a robust fix. Who knows how many other places might hit
>>>> this. You certainly do not want to add a hack like this all over...
>>>
>>> Completely agree! I was only suggesting it as a temporary measure,
>>> especially as it looked like a proper fix might be invasive.
>>>
>>> Another fix might be to change the node specific allocation to node
>>> agnostic allocations. It isn't clear why the allocation is being
>>> requested from a specific node. I think Lorenzo suggested this in one of
>>> the threads.
>> 
>> I think that code was just copypasted but it is better to fix the
>> underlying issue.
>> 
>>> I've started putting together a set fixing the issues identified in this
>>> thread. It should give a better idea on the best course of action.
>> 
>> On ACPI ARM64, this diff should do if I read the code correctly, it
>> should be (famous last words) just a matter of mapping PXMs to nodes for
>> every SRAT GICC entry, feel free to pick it up if it works.
>> 
>> Yes, we can take the original patch just because it is safer for an -rc
>> cycle even though if the patch below would do delaying the fix for a
>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
>> not a disaster.
>
> I tested this patch on my arm board, it works.

I am assuming you tried the patch without enabling support for
memory-less nodes.

The patch de-couples the onlining of numa nodes (as parsed from SRAT)
from NR_CPUS restriction. When it comes to building zonelists, the node
referenced by the PCI controller also has zonelists initialised.

So it looks like a fallback node is setup even if we don't have
memory-less nodes enabled. I need to stare some more at the code to see
why we need memory-less nodes at all then ...

Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-19 Thread Punit Agrawal

Michal Hocko  writes:

> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> [...]
>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>> as a temporary fix (it'll also be easier to backport) while we work on
>> fixing these other issues and enabling memoryless nodes.
>
> Well, x86 already does that but copying this antipatern is not really
> nice. So it is good as a quick fix but it would be definitely much
> better to have a robust fix. Who knows how many other places might hit
> this. You certainly do not want to add a hack like this all over...

Completely agree! I was only suggesting it as a temporary measure,
especially as it looked like a proper fix might be invasive.

Another fix might be to change the node specific allocation to node
agnostic allocations. It isn't clear why the allocation is being
requested from a specific node. I think Lorenzo suggested this in one of
the threads.

I've started putting together a set fixing the issues identified in this
thread. It should give a better idea on the best course of action.

Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-19 Thread Punit Agrawal

Michal Hocko  writes:

> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> [...]
>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>> as a temporary fix (it'll also be easier to backport) while we work on
>> fixing these other issues and enabling memoryless nodes.
>
> Well, x86 already does that but copying this antipatern is not really
> nice. So it is good as a quick fix but it would be definitely much
> better to have a robust fix. Who knows how many other places might hit
> this. You certainly do not want to add a hack like this all over...

Completely agree! I was only suggesting it as a temporary measure,
especially as it looked like a proper fix might be invasive.

Another fix might be to change the node specific allocation to node
agnostic allocations. It isn't clear why the allocation is being
requested from a specific node. I think Lorenzo suggested this in one of
the threads.

I've started putting together a set fixing the issues identified in this
thread. It should give a better idea on the best course of action.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 992 matches

Mail list logo