Re: [PATCH -next] cpufreq: powernv: Make some symbols static

2020-07-14 Thread Lee Jones
On Wed, 15 Jul 2020, Viresh Kumar wrote:

> On 14-07-20, 22:23, Wei Yongjun wrote:
> > The sparse tool complains as follows:
> > 
> > drivers/cpufreq/powernv-cpufreq.c:88:1: warning:
> >  symbol 'pstate_revmap' was not declared. Should it be static?
> > drivers/cpufreq/powernv-cpufreq.c:383:18: warning:
> >  symbol 'cpufreq_freq_attr_cpuinfo_nominal_freq' was not declared. Should 
> > it be static?
> > drivers/cpufreq/powernv-cpufreq.c:669:6: warning:
> >  symbol 'gpstate_timer_handler' was not declared. Should it be static?
> > drivers/cpufreq/powernv-cpufreq.c:902:6: warning:
> >  symbol 'powernv_cpufreq_work_fn' was not declared. Should it be static?
> > 
> > Those symbols are not used outside of this file, so mark
> > them static.
> > 
> > Reported-by: Hulk Robot 
> > Signed-off-by: Wei Yongjun 
> > ---
> >  drivers/cpufreq/powernv-cpufreq.c | 8 
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> Lee also sent a fix for this, but yours look complete :)
> 
> https://lore.kernel.org/lkml/20200714145049.2496163-7-lee.jo...@linaro.org/
> 
> Acked-by: Viresh Kumar 

Yes, looks good to me.

Reviewed-by: Lee Jones 

-- 
Lee Jones [李琼斯]
Senior Technical Lead - Developer Services
Linaro.org │ Open source software for Arm SoCs
Follow Linaro: Facebook | Twitter | Blog


Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-14 Thread Arnd Bergmann
On Wed, Jul 15, 2020 at 1:46 AM Bjorn Helgaas  wrote:

 So something like:
>
>   void pci_read_config_word(struct pci_dev *dev, int where, u16 *val)
>
> and where we used to return anything non-zero, we just set *val = ~0
> instead?  I think we do that already in most, maybe all, cases.

Right, this is what I had in mind. If we start by removing the handling
of the return code in all files that clearly don't need it, looking at
whatever remains will give a much better idea of what a good interface
should be.

>  git grep -E "(if|return|=).*\ finds about 400 matches.

Right, and this is some 112 files to look at.

I had a slightly different regex, which found more false-positives, but
also these:

arch/x86/kernel/amd_nb.c:  : pci_read_config_dword(root, 0x64, value));
drivers/i2c/busses/i2c-sis630.c: pci_write_config_byte(sis630_dev,
SIS630_BIOS_CTL_REG, b | 0x80)) {
drivers/i2c/busses/i2c-viapro.c: !pci_read_config_word(pdev,
SMBBA2, &vt596_smba) &&
drivers/ide/rz1000.c: !pci_write_config_word(dev, 0x40, reg & 0xdfff)) {
drivers/net/ethernet/realtek/r8169_main.c:
pci_write_config_byte(pdev, 0x070f, val) == PCIBIOS_SUCCESSFUL)
include/linux/rtsx_pci.h:#define rtsx_pci_read_config_dword(pcr,
where, val) pci_read_config_dword((pcr)->pci, where, val)
include/linux/rtsx_pci.h:#define rtsx_pci_write_config_dword(pcr,
where, val) pci_write_config_dword((pcr)->pci, where, val)
drivers/misc/cardreader/rts5261.c:  retval =
rtsx_pci_read_config_dword(pcr,
drivers/misc/cardreader/rts5261.c:  retval =
rtsx_pci_write_config_dword(pcr, PCR_SETTING_REG2, lval);

That last one is interesting because I think this is a case in which we
actually want to check for errors, as the driver seems to use it
to ensure that accessing extended config space at offset 0x814
works before relying on the value. Unfortunately the implementation
seems buggy as it a) keeps using the possibly uninitialized value after
printing a warning and b) returns the PCIBIOS_* value in place of a
negative errno and then ignores it in the caller.

  Arnd


Re: [PATCH 05/13] cpufreq/arch: powerpc: pasemi: Move prototypes to shared header

2020-07-14 Thread Olof Johansson
On Tue, Jul 14, 2020 at 11:33 PM Lee Jones  wrote:
>
> On Tue, 14 Jul 2020, Olof Johansson wrote:
>
> > On Tue, Jul 14, 2020 at 7:50 AM Lee Jones  wrote:
> > >
> > > If function callers and providers do not share the same prototypes the
> > > compiler complains of missing prototypes.  Fix this by moving the
> > > already existing prototypes out to a mutually convenient location.
> > >
> > > Fixes the following W=1 kernel build warning(s):
> > >
> > >  drivers/cpufreq/pasemi-cpufreq.c:109:5: warning: no previous prototype 
> > > for ‘check_astate’ [-Wmissing-prototypes]
> > >  109 | int check_astate(void)
> > >  | ^~~~
> > >  drivers/cpufreq/pasemi-cpufreq.c:114:6: warning: no previous prototype 
> > > for ‘restore_astate’ [-Wmissing-prototypes]
> > >  114 | void restore_astate(int cpu)
> > >  | ^~
> > >
> > > Cc: Olof Johansson 
> > > Cc: Michael Ellerman 
> > > Cc: Benjamin Herrenschmidt 
> > > Cc: Paul Mackerras 
> > > Cc: linuxppc-dev@lists.ozlabs.org
> > > Signed-off-by: Lee Jones 
> > > ---
> > >  arch/powerpc/platforms/pasemi/pasemi.h| 15 
> > >  arch/powerpc/platforms/pasemi/powersave.S |  2 ++
> > >  drivers/cpufreq/pasemi-cpufreq.c  |  1 +
> > >  include/linux/platform_data/pasemi.h  | 28 +++
> > >  4 files changed, 31 insertions(+), 15 deletions(-)
> > >  create mode 100644 include/linux/platform_data/pasemi.h
> > >
> > > diff --git a/arch/powerpc/platforms/pasemi/pasemi.h 
> > > b/arch/powerpc/platforms/pasemi/pasemi.h
> > > index 70b56048ed1be..528d81ef748ad 100644
> > > --- a/arch/powerpc/platforms/pasemi/pasemi.h
> > > +++ b/arch/powerpc/platforms/pasemi/pasemi.h
> > > @@ -15,21 +15,6 @@ extern void __init pasemi_map_registers(void);
> > >  extern void idle_spin(void);
> > >  extern void idle_doze(void);
> > >
> > > -/* Restore astate to last set */
> > > -#ifdef CONFIG_PPC_PASEMI_CPUFREQ
> > > -extern int check_astate(void);
> > > -extern void restore_astate(int cpu);
> > > -#else
> > > -static inline int check_astate(void)
> > > -{
> > > -   /* Always return >0 so we never power save */
> > > -   return 1;
> > > -}
> > > -static inline void restore_astate(int cpu)
> > > -{
> > > -}
> > > -#endif
> > > -
> > >  extern struct pci_controller_ops pasemi_pci_controller_ops;
> > >
> > >  #endif /* _PASEMI_PASEMI_H */
> > > diff --git a/arch/powerpc/platforms/pasemi/powersave.S 
> > > b/arch/powerpc/platforms/pasemi/powersave.S
> > > index d0215d5329ca7..7747b48963286 100644
> > > --- a/arch/powerpc/platforms/pasemi/powersave.S
> > > +++ b/arch/powerpc/platforms/pasemi/powersave.S
> > > @@ -5,6 +5,8 @@
> > >   * Maintained by: Olof Johansson 
> > >   */
> > >
> > > +#include 
> > > +
> > >  #include 
> > >  #include 
> > >  #include 
> > > diff --git a/drivers/cpufreq/pasemi-cpufreq.c 
> > > b/drivers/cpufreq/pasemi-cpufreq.c
> > > index c66f566a854cb..c6bb3ecc90ef3 100644
> > > --- a/drivers/cpufreq/pasemi-cpufreq.c
> > > +++ b/drivers/cpufreq/pasemi-cpufreq.c
> > > @@ -15,6 +15,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >
> > >  #include 
> > >  #include 
> > > diff --git a/include/linux/platform_data/pasemi.h 
> > > b/include/linux/platform_data/pasemi.h
> > > new file mode 100644
> > > index 0..3fed0687fcc9a
> > > --- /dev/null
> > > +++ b/include/linux/platform_data/pasemi.h
> > > @@ -0,0 +1,28 @@
> > > +/* SPDX-License-Identifier: GPL-2.0-only */
> > > +/*
> > > + * Copyright (C) 2020 Linaro Ltd.
> > > + *
> > > + * Author: Lee Jones 
> > > + */
> >
> > Absolutely not. It's neither your copyright, nor your authorship.
>
> The file was new.  Anyway, the point is now moot.

The contents was copied and pasted from other material, not originally
produced by you.

I suggest you consult with Linaro lawyers on how to handle this if you
have to do something like it in the future.


-Olof


Re: [PATCH -next] powerpc/xive: Remove unused inline function xive_kexec_teardown_cpu()

2020-07-14 Thread Greg Kurz
On Wed, 15 Jul 2020 10:50:40 +0800
YueHaibing  wrote:

> commit e27e0a94651e ("powerpc/xive: Remove xive_kexec_teardown_cpu()")
> left behind this, remove it.
> 
> Signed-off-by: YueHaibing 
> ---

Reviewed-by: Greg Kurz 

>  arch/powerpc/include/asm/xive.h | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
> index d08ea11b271c..309b4d65b74f 100644
> --- a/arch/powerpc/include/asm/xive.h
> +++ b/arch/powerpc/include/asm/xive.h
> @@ -155,7 +155,6 @@ static inline void xive_smp_probe(void) { }
>  static inline int  xive_smp_prepare_cpu(unsigned int cpu) { return -EINVAL; }
>  static inline void xive_smp_setup_cpu(void) { }
>  static inline void xive_smp_disable_cpu(void) { }
> -static inline void xive_kexec_teardown_cpu(int secondary) { }
>  static inline void xive_shutdown(void) { }
>  static inline void xive_flush_interrupt(void) { }
>  



Re: [PATCH 3/3] ASoC: fsl-asoc-card: Support Headphone and Microphone Jack detection

2020-07-14 Thread Nicolin Chen
On Wed, Jul 15, 2020 at 12:14:01PM +0800, Shengjiu Wang wrote:
> On Wed, Jul 15, 2020 at 5:16 AM Nicolin Chen  wrote:
> >
> > Hi Shengjiu,
> >
> > The whole series looks good to me. Just a couple of small
> > questions inline:
> >
> > On Tue, Jul 14, 2020 at 05:05:36PM +0800, Shengjiu Wang wrote:
> > > Use asoc_simple_init_jack function from simple card to implement
> > > the Headphone and Microphone detection.
> > > Register notifier to disable Speaker when Headphone is plugged in
> > > and enable Speaker when Headphone is unplugged.
> > > Register notifier to disable Digital Microphone when Analog Microphone
> > > is plugged in and enable DMIC when Analog Microphone is unplugged.
> > >
> > > Signed-off-by: Shengjiu Wang 
> > > ---
> > >  sound/soc/fsl/Kconfig |  1 +
> > >  sound/soc/fsl/fsl-asoc-card.c | 69 ++-
> > >  2 files changed, 68 insertions(+), 2 deletions(-)
> >
> > >  static int fsl_asoc_card_late_probe(struct snd_soc_card *card)
> > >  {
> > >   struct fsl_asoc_card_priv *priv = snd_soc_card_get_drvdata(card);
> > > @@ -745,8 +789,29 @@ static int fsl_asoc_card_probe(struct 
> > > platform_device *pdev)
> > >   snd_soc_card_set_drvdata(&priv->card, priv);
> > >
> > >   ret = devm_snd_soc_register_card(&pdev->dev, &priv->card);
> > > - if (ret && ret != -EPROBE_DEFER)
> > > - dev_err(&pdev->dev, "snd_soc_register_card failed (%d)\n", 
> > > ret);
> > > + if (ret) {
> > > + if (ret != -EPROBE_DEFER)
> > > + dev_err(&pdev->dev, "snd_soc_register_card failed 
> > > (%d)\n", ret);
> >
> > I think we may move this EPROBE_DEFER to the asrc_fail label.
> 
> If we move this to asrc_fail label, then it will be hard to define the
> error message.
> There are many places that goto asrc_fail.

Oh...good point...

> > > + goto asrc_fail;
> > > + }
> > > +
> > > + if (of_property_read_bool(np, "hp-det-gpio")) {
> >
> > Could we move this check inside asoc_simple_init_jack? There's no
> > problem with doing it here though, yet I got a bit confused by it
> > as I thought it's a boolean type property, which would be against
> > the DT bindings until I saw asoc_simple_init_jack() uses the same
> > string to get the GPIO. Just it probably would be a bit tricky as
> > we need it to be optional here.
> >
> > Otherwise, I think we may add a line of comments to indicate that
> > the API would use the same string to get the GPIO.
> 
> In asoc_simple_init_jack, gpio_is_valid() will be invalid when there is
> no "hp-det-gpio" property, and asoc_simple_init_jack will return 0.
> 
> The reason why I add a check here is mostly for
> snd_soc_jack_notifier_register().
> when there is no jack created, there will be a kernel dump.
> 
> or I can use this code:
> 
> -   if (of_property_read_bool(np, "hp-det-gpio")) {
> -   ret = asoc_simple_init_jack(&priv->card, &priv->hp_jack,
> -   1, NULL, "Headphone Jack");
> -   if (ret)
> -   goto asrc_fail;
> +   ret = asoc_simple_init_jack(&priv->card, &priv->hp_jack,
> +   1, NULL, "Headphone Jack");
> +   if (ret)
> +   goto asrc_fail;
> 
> +   if (priv->hp_jack.jack.jack)
> snd_soc_jack_notifier_register(&priv->hp_jack.jack,

It's pretty clean but not very obvious for the "optional" part.
So I think that it'd be slightly better to go for your previous
solution, but with a line of comments to show: these properties
are optional and asoc_simple_init_jack() uses the same strings.

Please add to all three changes once the comments being added:

Acked-by: Nicolin Chen 

Thanks


Re: [PATCH 05/13] cpufreq/arch: powerpc: pasemi: Move prototypes to shared header

2020-07-14 Thread Viresh Kumar
On 15-07-20, 07:36, Lee Jones wrote:
> I searched for "include.*platforms/" in drivers/, and was scared off
> this method since no one else does this.

Yeah its not right for generic drivers, but this is very much platform
specific so it is fine here.

-- 
viresh


Re: [PATCH 05/13] cpufreq/arch: powerpc: pasemi: Move prototypes to shared header

2020-07-14 Thread Lee Jones
On Tue, 14 Jul 2020, Olof Johansson wrote:

> On Tue, Jul 14, 2020 at 8:07 PM Viresh Kumar  wrote:
> >
> > On 14-07-20, 15:50, Lee Jones wrote:
> > > If function callers and providers do not share the same prototypes the
> > > compiler complains of missing prototypes.  Fix this by moving the
> > > already existing prototypes out to a mutually convenient location.
> > >
> > > Fixes the following W=1 kernel build warning(s):
> > >
> > >  drivers/cpufreq/pasemi-cpufreq.c:109:5: warning: no previous prototype 
> > > for ‘check_astate’ [-Wmissing-prototypes]
> > >  109 | int check_astate(void)
> > >  | ^~~~
> > >  drivers/cpufreq/pasemi-cpufreq.c:114:6: warning: no previous prototype 
> > > for ‘restore_astate’ [-Wmissing-prototypes]
> > >  114 | void restore_astate(int cpu)
> > >  | ^~
> > >
> > > Cc: Olof Johansson 
> > > Cc: Michael Ellerman 
> > > Cc: Benjamin Herrenschmidt 
> > > Cc: Paul Mackerras 
> > > Cc: linuxppc-dev@lists.ozlabs.org
> > > Signed-off-by: Lee Jones 
> > > ---
> > >  arch/powerpc/platforms/pasemi/pasemi.h| 15 
> >
> > Is there no sane way we can include this file directly to the cpufreq
> > file ?
> 
> Yep. arch/powerpc seems to be in the search path for modules on powerpc, so:
> 
> diff --git a/drivers/cpufreq/pasemi-cpufreq.c 
> b/drivers/cpufreq/pasemi-cpufreq.c
> index c66f566a854cb..815645170c4de 100644
> --- a/drivers/cpufreq/pasemi-cpufreq.c
> +++ b/drivers/cpufreq/pasemi-cpufreq.c
> @@ -22,6 +22,8 @@
>  #include 
>  #include 
> 
> +#include 
> +
>  #define SDCASR_REG 0x0100
>  #define SDCASR_REG_STRIDE  0x1000
>  #define SDCPWR_CFGA0_REG   0x0100

I searched for "include.*platforms/" in drivers/, and was scared off
this method since no one else does this.

But if it's a reasonable solution, great.  Will fix.

Thanks Olof.

-- 
Lee Jones [李琼斯]
Senior Technical Lead - Developer Services
Linaro.org │ Open source software for Arm SoCs
Follow Linaro: Facebook | Twitter | Blog


Re: [PATCH 05/13] cpufreq/arch: powerpc: pasemi: Move prototypes to shared header

2020-07-14 Thread Lee Jones
On Tue, 14 Jul 2020, Olof Johansson wrote:

> On Tue, Jul 14, 2020 at 7:50 AM Lee Jones  wrote:
> >
> > If function callers and providers do not share the same prototypes the
> > compiler complains of missing prototypes.  Fix this by moving the
> > already existing prototypes out to a mutually convenient location.
> >
> > Fixes the following W=1 kernel build warning(s):
> >
> >  drivers/cpufreq/pasemi-cpufreq.c:109:5: warning: no previous prototype for 
> > ‘check_astate’ [-Wmissing-prototypes]
> >  109 | int check_astate(void)
> >  | ^~~~
> >  drivers/cpufreq/pasemi-cpufreq.c:114:6: warning: no previous prototype for 
> > ‘restore_astate’ [-Wmissing-prototypes]
> >  114 | void restore_astate(int cpu)
> >  | ^~
> >
> > Cc: Olof Johansson 
> > Cc: Michael Ellerman 
> > Cc: Benjamin Herrenschmidt 
> > Cc: Paul Mackerras 
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Signed-off-by: Lee Jones 
> > ---
> >  arch/powerpc/platforms/pasemi/pasemi.h| 15 
> >  arch/powerpc/platforms/pasemi/powersave.S |  2 ++
> >  drivers/cpufreq/pasemi-cpufreq.c  |  1 +
> >  include/linux/platform_data/pasemi.h  | 28 +++
> >  4 files changed, 31 insertions(+), 15 deletions(-)
> >  create mode 100644 include/linux/platform_data/pasemi.h
> >
> > diff --git a/arch/powerpc/platforms/pasemi/pasemi.h 
> > b/arch/powerpc/platforms/pasemi/pasemi.h
> > index 70b56048ed1be..528d81ef748ad 100644
> > --- a/arch/powerpc/platforms/pasemi/pasemi.h
> > +++ b/arch/powerpc/platforms/pasemi/pasemi.h
> > @@ -15,21 +15,6 @@ extern void __init pasemi_map_registers(void);
> >  extern void idle_spin(void);
> >  extern void idle_doze(void);
> >
> > -/* Restore astate to last set */
> > -#ifdef CONFIG_PPC_PASEMI_CPUFREQ
> > -extern int check_astate(void);
> > -extern void restore_astate(int cpu);
> > -#else
> > -static inline int check_astate(void)
> > -{
> > -   /* Always return >0 so we never power save */
> > -   return 1;
> > -}
> > -static inline void restore_astate(int cpu)
> > -{
> > -}
> > -#endif
> > -
> >  extern struct pci_controller_ops pasemi_pci_controller_ops;
> >
> >  #endif /* _PASEMI_PASEMI_H */
> > diff --git a/arch/powerpc/platforms/pasemi/powersave.S 
> > b/arch/powerpc/platforms/pasemi/powersave.S
> > index d0215d5329ca7..7747b48963286 100644
> > --- a/arch/powerpc/platforms/pasemi/powersave.S
> > +++ b/arch/powerpc/platforms/pasemi/powersave.S
> > @@ -5,6 +5,8 @@
> >   * Maintained by: Olof Johansson 
> >   */
> >
> > +#include 
> > +
> >  #include 
> >  #include 
> >  #include 
> > diff --git a/drivers/cpufreq/pasemi-cpufreq.c 
> > b/drivers/cpufreq/pasemi-cpufreq.c
> > index c66f566a854cb..c6bb3ecc90ef3 100644
> > --- a/drivers/cpufreq/pasemi-cpufreq.c
> > +++ b/drivers/cpufreq/pasemi-cpufreq.c
> > @@ -15,6 +15,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #include 
> >  #include 
> > diff --git a/include/linux/platform_data/pasemi.h 
> > b/include/linux/platform_data/pasemi.h
> > new file mode 100644
> > index 0..3fed0687fcc9a
> > --- /dev/null
> > +++ b/include/linux/platform_data/pasemi.h
> > @@ -0,0 +1,28 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Copyright (C) 2020 Linaro Ltd.
> > + *
> > + * Author: Lee Jones 
> > + */
> 
> Absolutely not. It's neither your copyright, nor your authorship.

The file was new.  Anyway, the point is now moot.

-- 
Lee Jones [李琼斯]
Senior Technical Lead - Developer Services
Linaro.org │ Open source software for Arm SoCs
Follow Linaro: Facebook | Twitter | Blog


[powerpc:merge] BUILD SUCCESS 93c756af92e11df5c1a8be7a8360f28a97f7ad3f

2020-07-14 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
merge
branch HEAD: 93c756af92e11df5c1a8be7a8360f28a97f7ad3f  Automatic merge of 
'master', 'next' and 'fixes' (2020-07-15 00:25)

elapsed time: 939m

configs tested: 104
configs skipped: 6

The following configs have been built successfully.
More configs may be tested in the coming days.

arm defconfig
arm  allyesconfig
arm  allmodconfig
arm   allnoconfig
arm64allyesconfig
arm64   defconfig
arm64allmodconfig
arm64 allnoconfig
sh   se7724_defconfig
sh  sdk7786_defconfig
arm  tango4_defconfig
shmigor_defconfig
arm s3c6400_defconfig
m68k amcore_defconfig
i386 allyesconfig
c6x   allnoconfig
xtensageneric_kc705_defconfig
mips   ip27_defconfig
mips  mips_paravirt_defconfig
c6xevmc6474_defconfig
powerpc  pasemi_defconfig
shdreamcast_defconfig
openrisc alldefconfig
m68km5407c3_defconfig
arm  ep93xx_defconfig
arm  moxart_defconfig
sparc   defconfig
mipsmaltaup_defconfig
mips  rb532_defconfig
powerpcmpc7448_hpc2_defconfig
mipsnlm_xlp_defconfig
sh kfr2r09-romimage_defconfig
openriscdefconfig
xtensa  iss_defconfig
arm  colibri_pxa270_defconfig
ia64  gensparse_defconfig
i386  allnoconfig
i386defconfig
i386  debian-10.3
ia64 allmodconfig
ia64defconfig
ia64  allnoconfig
ia64 allyesconfig
m68k allmodconfig
m68k  allnoconfig
m68k   sun3_defconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
nios2allyesconfig
c6x  allyesconfig
openrisc allyesconfig
nds32   defconfig
nds32 allnoconfig
csky allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
h8300allmodconfig
xtensa  defconfig
arc defconfig
arc  allyesconfig
sh   allmodconfig
shallnoconfig
microblazeallnoconfig
mips allyesconfig
mips  allnoconfig
mips allmodconfig
pariscallnoconfig
parisc  defconfig
parisc   allyesconfig
parisc   allmodconfig
powerpc defconfig
powerpc  allyesconfig
powerpc  rhel-kconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386     randconfig-a016-20200714
i386     randconfig-a011-20200714
i386     randconfig-a015-20200714
i386     randconfig-a012-20200714
i386     randconfig-a013-20200714
i386     randconfig-a014-20200714
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscvallmodconfig
s390 allyesconfig
s390  allnoconfig
s390 allmodconfig
s390defconfig
sparcallyesconfig
sparc64 defconfig
sparc64   allnoconfig
sparc64  allyesconfig
sparc64  allmodconfig
x86_64rhel-7.6-kselftests
x86_64

Re: [PATCH v3] powerpc/perf: Use SIER_USER_MASK while updating SPRN_SIER for EBB events

2020-07-14 Thread Athira Rajeev


> On 14-Jul-2020, at 11:38 AM, Michael Ellerman  wrote:
> 
> Athira Rajeev  > writes:
>>> On 19-Mar-2020, at 4:22 PM, Michael Ellerman  wrote:
>>> 
>>> Hi Athira,
>>> 
>>> Athira Rajeev  writes:
 Sampled Instruction Event Register (SIER), is a PMU register,
>>>  ^
>>>  that
 captures architecture state for a given sample. And sier_user_mask
>>>  ^  ^
>>>  don't think we need "architecture" SIER_USER_MASK
>>> 
 defined in commit 330a1eb7775b ("powerpc/perf: Core EBB support for 64-bit
 book3s") defines the architected bits that needs to be saved from the SPR.
>>> 
>>> Not quite, it defines the bits that are visible to userspace.
>>> 
>>> And I think it's true that for EBB events the bits we need/want to save
>>> are only the user visible bits.
>>> 
 Currently all of the bits from SIER are saved for EBB events. Patch fixes
 this by ANDing the "sier_user_mask" to data from SIER in ebb_switch_out().
 This will force save only architected bits from the SIER.
>>> 
>>> s/architected/user visible/
>>> 
>>> 
>>> But, why does it matter? The kernel saves the user visible bits, as well
>>> as the kernel-only bits into the thread struct. And then later the
>>> kernel restores that value into the hardware before returning to
>>> userspace.
>>> 
>>> But the hardware enforces the visibility of the bits, so userspace can't
>>> observe any bits that it shouldn't.
>>> 
>>> Or is there some other mechanism whereby userspace can see those bits? ;)
>>> 
>>> If there was, what would the security implications of that be?
>> 
>> Hi Michael,
>> 
>> Thanks for your comments. 
>> 
>> In ebb_switch_in, we set PMCC bit [MMCR0 44:45 ] to 10 which means
>> SIER ( Group B ) register is readable in problem state. Hence the
>> intention of the patch was to make sure we are not exposing the bits
>> which the userspace shouldn't be reading.
>> 
>> But following your comment about "hardware enforcing the visibility of
>> bits", I did try an "ebb" experiment which showed that reading
>> SPRN_SIER didn't expose any bits other than the user visible bits.
>> Sorry for the confusion here.
> 
> That's OK. Thanks for following my trail of clues :)
> 

Sure, Thanks for your inputs. I will come back on this after checking on 
information exposed via Ptrace

>> In that case, Can we drop the existing definition of SIER_USER_MASK if
>> it is no more needed ?
> 
> I think it is still needed, and I think this change to use it is good, because
> SIER is visible via ptrace.
> 
> What we need to do, is look at what information in SIER we are currently
> exposing to userspace via ptrace, and what the security implications (if
> any) of that are.
> 
> cheers



Re: [PATCH 15/15] powerpc/powernv/sriov: Make single PE mode a per-BAR setting

2020-07-14 Thread Oliver O'Halloran
On Wed, Jul 15, 2020 at 3:24 PM Alexey Kardashevskiy  wrote:
>
>
> > @@ -158,9 +157,9 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
> > pci_dev *pdev)
> >   goto disable_iov;
> >   pdev->dev.archdata.iov_data = iov;
> >
> > + /* FIXME: totalvfs > phb->ioda.total_pe_num is going to be a problem 
> > */
>
>
> WARN_ON_ONCE() then?

can't hurt

> > @@ -173,50 +172,51 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
> > pci_dev *pdev)
> >   goto disable_iov;
> >   }
> >
> > - total_vf_bar_sz += pci_iov_resource_size(pdev,
> > - i + PCI_IOV_RESOURCES);
> > + vf_bar_sz = pci_iov_resource_size(pdev, i + 
> > PCI_IOV_RESOURCES);
> >
> >   /*
> > -  * If bigger than quarter of M64 segment size, just round up
> > -  * power of two.
> > +  * Generally, one segmented M64 BAR maps one IOV BAR. However,
> > +  * if a VF BAR is too large we end up wasting a lot of space.
> > +  * If we've got a BAR that's bigger than greater than 1/4 of 
> > the
>
>
> bigger, greater, huger? :)
>
> Also, a nit: s/got a BAR/got a VF BAR/

whatever, it's just words

> > +  * default window's segment size then switch to using single 
> > PE
> > +  * windows. This limits the total number of VFs we can 
> > support.
>
> Just to get idea about absolute numbers here.
>
> On my P9:
>
> ./pciex@600c3c030/ibm,opal-m64-window
>  00060200  00060200  0040 
>
> so that default window's segment size is 0x40../512 = 512MB?

Yeah. It'll vary a bit since PHB3 and some PHB4s have 256.

> >*
> > -  * Generally, one M64 BAR maps one IOV BAR. To avoid conflict
> > -  * with other devices, IOV BAR size is expanded to be
> > -  * (total_pe * VF_BAR_size).  When VF_BAR_size is half of M64
> > -  * segment size , the expanded size would equal to half of the
> > -  * whole M64 space size, which will exhaust the M64 Space and
> > -  * limit the system flexibility.  This is a design decision to
> > -  * set the boundary to quarter of the M64 segment size.
> > +  * The 1/4 limit is arbitrary and can be tweaked.
> >*/
> > - if (total_vf_bar_sz > gate) {
> > - mul = roundup_pow_of_two(total_vfs);
> > - dev_info(&pdev->dev,
> > - "VF BAR Total IOV size %llx > %llx, roundup 
> > to %d VFs\n",
> > - total_vf_bar_sz, gate, mul);
> > - iov->m64_single_mode = true;
> > - break;
> > - }
> > - }
> > + if (vf_bar_sz > (phb->ioda.m64_segsize >> 2)) {
> > + /*
> > +  * On PHB3, the minimum size alignment of M64 BAR in
> > +  * single mode is 32MB. If this VF BAR is smaller than
> > +  * 32MB, but still too large for a segmented window
> > +  * then we can't map it and need to disable SR-IOV for
> > +  * this device.
>
>
> Why not use single PE mode for such BAR? Better than nothing.

Suppose you could, but I figured VFs were mainly interesting since you
could give each VF to a separate guest. If there's multiple VFs under
the same single PE BAR then they'd have to be assigned to the same
guest in order to retain the freeze/unfreeze behaviour that PAPR
requires. I guess that's how it used to work, but it seems better just
to disable them rather than having VFs which sort of work.


Re: [PATCH v2 01/10] powerpc/perf: Add support for ISA3.1 PMU SPRs

2020-07-14 Thread Athira Rajeev


> On 13-Jul-2020, at 6:20 PM, Michael Ellerman  wrote:
> 
> Athira Rajeev  > writes:
>>> On 08-Jul-2020, at 4:32 PM, Michael Ellerman  wrote:
>>> 
>>> Athira Rajeev  writes:
>>> ...
 diff --git a/arch/powerpc/perf/core-book3s.c 
 b/arch/powerpc/perf/core-book3s.c
 index cd6a742..5c64bd3 100644
 --- a/arch/powerpc/perf/core-book3s.c
 +++ b/arch/powerpc/perf/core-book3s.c
 @@ -39,10 +39,10 @@ struct cpu_hw_events {
unsigned int flags[MAX_HWEVENTS];
/*
 * The order of the MMCR array is:
 -   *  - 64-bit, MMCR0, MMCR1, MMCRA, MMCR2
 +   *  - 64-bit, MMCR0, MMCR1, MMCRA, MMCR2, MMCR3
 *  - 32-bit, MMCR0, MMCR1, MMCR2
 */
 -  unsigned long mmcr[4];
 +  unsigned long mmcr[5];
struct perf_event *limited_counter[MAX_LIMITED_HWCOUNTERS];
u8  limited_hwidx[MAX_LIMITED_HWCOUNTERS];
u64 alternatives[MAX_HWEVENTS][MAX_EVENT_ALTERNATIVES];
>>> ...
 @@ -1310,6 +1326,10 @@ static void power_pmu_enable(struct pmu *pmu)
if (!cpuhw->n_added) {
mtspr(SPRN_MMCRA, cpuhw->mmcr[2] & ~MMCRA_SAMPLE_ENABLE);
mtspr(SPRN_MMCR1, cpuhw->mmcr[1]);
 +#ifdef CONFIG_PPC64
 +  if (ppmu->flags & PPMU_ARCH_310S)
 +  mtspr(SPRN_MMCR3, cpuhw->mmcr[4]);
 +#endif /* CONFIG_PPC64 */
goto out_enable;
}
 
 @@ -1353,6 +1373,11 @@ static void power_pmu_enable(struct pmu *pmu)
if (ppmu->flags & PPMU_ARCH_207S)
mtspr(SPRN_MMCR2, cpuhw->mmcr[3]);
 
 +#ifdef CONFIG_PPC64
 +  if (ppmu->flags & PPMU_ARCH_310S)
 +  mtspr(SPRN_MMCR3, cpuhw->mmcr[4]);
 +#endif /* CONFIG_PPC64 */
>>> 
>>> I don't think you need the #ifdef CONFIG_PPC64?
>> 
>> Hi Michael
>> 
>> Thanks for reviewing this series.
>> 
>> SPRN_MMCR3 is not defined for PPC32 and we hit build failure for 
>> pmac32_defconfig.
>> The #ifdef CONFIG_PPC64 is to address this.
> 
> We like to avoid #ifdefs in the body of the code like that.
> 
> There's a bunch of existing #defines near the top of the file to make
> 32-bit work, I think you should just add another for this, so eg:
> 
> #ifdef CONFIG_PPC32
> ...
> #define SPRN_MMCR30

Ok Ok. Found that currently we do same way as you mentioned for MMCRA which is 
not supported for 32-bit
I will work on this change

Thanks
Athira 
> 
> cheers



Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Masami Hiramatsu
On Tue, 14 Jul 2020 10:03:45 -0400
Steven Rostedt  wrote:

> On Tue, 14 Jul 2020 14:33:14 +0100
> Mark Rutland  wrote:
> 
> > > Should work for all cases. Yes, we might then want something like a per
> > > arch:
> > > 
> > >   {BPF,FTRACE,KPROBE}_TEXT_TYPE  
> > 
> > ... at that point why not:
> > 
> >   text_alloc_ftrace();
> >   text_alloc_module();
> >   text_alloc_bpf();
> >   text_alloc_kprobe();
> 
> I don't know about bpf and kprobes, but for ftrace, the only place that
> it allocates text happens to be in arch specific code.
> 
> If you want something special for ftrace, you could just add your own
> function. But for x86, a text_alloc_immediate() would work.
> (BTW, I like the function names over the enums)

kprobes will need the text_alloc_immediate() too, since it will use
the trampoline buffer where jumps to/from kernel code/modules.

Thanks,

-- 
Masami Hiramatsu 


Re: [PATCH 15/15] powerpc/powernv/sriov: Make single PE mode a per-BAR setting

2020-07-14 Thread Alexey Kardashevskiy



On 10/07/2020 15:23, Oliver O'Halloran wrote:
> Using single PE BARs to map an SR-IOV BAR is really a choice about what
> strategy to use when mapping a BAR. It doesn't make much sense for this to
> be a global setting since a device might have one large BAR which needs to
> be mapped with single PE windows and another smaller BAR that can be mapped
> with a regular segmented window. Make the segmented vs single decision a
> per-BAR setting and clean up the logic that decides which mode to use.
> 
> Signed-off-by: Oliver O'Halloran 
> ---
>  arch/powerpc/platforms/powernv/pci-sriov.c | 131 +++--
>  arch/powerpc/platforms/powernv/pci.h   |  10 +-
>  2 files changed, 75 insertions(+), 66 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
> b/arch/powerpc/platforms/powernv/pci-sriov.c
> index 8de03636888a..87377d95d648 100644
> --- a/arch/powerpc/platforms/powernv/pci-sriov.c
> +++ b/arch/powerpc/platforms/powernv/pci-sriov.c
> @@ -146,10 +146,9 @@
>  static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>  {
>   struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
> - const resource_size_t gate = phb->ioda.m64_segsize >> 2;
>   struct resource *res;
>   int i;
> - resource_size_t size, total_vf_bar_sz;
> + resource_size_t vf_bar_sz;
>   struct pnv_iov_data *iov;
>   int mul, total_vfs;
>  
> @@ -158,9 +157,9 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
> pci_dev *pdev)
>   goto disable_iov;
>   pdev->dev.archdata.iov_data = iov;
>  
> + /* FIXME: totalvfs > phb->ioda.total_pe_num is going to be a problem */


WARN_ON_ONCE() then?


>   total_vfs = pci_sriov_get_totalvfs(pdev);
>   mul = phb->ioda.total_pe_num;
> - total_vf_bar_sz = 0;
>  
>   for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>   res = &pdev->resource[i + PCI_IOV_RESOURCES];
> @@ -173,50 +172,51 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
> pci_dev *pdev)
>   goto disable_iov;
>   }
>  
> - total_vf_bar_sz += pci_iov_resource_size(pdev,
> - i + PCI_IOV_RESOURCES);
> + vf_bar_sz = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
>  
>   /*
> -  * If bigger than quarter of M64 segment size, just round up
> -  * power of two.
> +  * Generally, one segmented M64 BAR maps one IOV BAR. However,
> +  * if a VF BAR is too large we end up wasting a lot of space.
> +  * If we've got a BAR that's bigger than greater than 1/4 of the


bigger, greater, huger? :)

Also, a nit: s/got a BAR/got a VF BAR/


> +  * default window's segment size then switch to using single PE
> +  * windows. This limits the total number of VFs we can support.

Just to get idea about absolute numbers here.

On my P9:

./pciex@600c3c030/ibm,opal-m64-window
 00060200  00060200  0040 

so that default window's segment size is 0x40../512 = 512MB?


>*
> -  * Generally, one M64 BAR maps one IOV BAR. To avoid conflict
> -  * with other devices, IOV BAR size is expanded to be
> -  * (total_pe * VF_BAR_size).  When VF_BAR_size is half of M64
> -  * segment size , the expanded size would equal to half of the
> -  * whole M64 space size, which will exhaust the M64 Space and
> -  * limit the system flexibility.  This is a design decision to
> -  * set the boundary to quarter of the M64 segment size.
> +  * The 1/4 limit is arbitrary and can be tweaked.
>*/
> - if (total_vf_bar_sz > gate) {
> - mul = roundup_pow_of_two(total_vfs);
> - dev_info(&pdev->dev,
> - "VF BAR Total IOV size %llx > %llx, roundup to 
> %d VFs\n",
> - total_vf_bar_sz, gate, mul);
> - iov->m64_single_mode = true;
> - break;
> - }
> - }
> + if (vf_bar_sz > (phb->ioda.m64_segsize >> 2)) {
> + /*
> +  * On PHB3, the minimum size alignment of M64 BAR in
> +  * single mode is 32MB. If this VF BAR is smaller than
> +  * 32MB, but still too large for a segmented window
> +  * then we can't map it and need to disable SR-IOV for
> +  * this device.


Why not use single PE mode for such BAR? Better than nothing.


> +  */
> + if (vf_bar_sz < SZ_32M) {
> + pci_err(pdev, "VF BAR%d: %pR can't be mapped in 
> single PE mode\n",
> + i, res);
> + goto disable_iov;
> + }
>  
> - for

Re: [v3 1/5] KVM: PPC: Book3S HV: Disable page merging in H_SVM_INIT_START

2020-07-14 Thread Ram Pai
On Mon, Jul 13, 2020 at 10:59:41AM +0530, Bharata B Rao wrote:
> On Sat, Jul 11, 2020 at 02:13:43AM -0700, Ram Pai wrote:
> > Merging of pages associated with each memslot of a SVM is
> > disabled the page is migrated in H_SVM_PAGE_IN handler.
> > 
> > This operation should have been done much earlier; the moment the VM
> > is initiated for secure-transition. Delaying this operation, increases
> > the probability for those pages to acquire new references , making it
> > impossible to migrate those pages in H_SVM_PAGE_IN handler.
> > 
> > Disable page-migration in H_SVM_INIT_START handling.
> 
> While it is a good idea to disable KSM merging for all VMAs during
> H_SVM_INIT_START, I am curious if you did observe an actual case of
> ksm_madvise() failing which resulted in subsequent H_SVM_PAGE_IN
> failing to migrate?

No. I did not find any ksm_madvise() failing.  But it did not make sense
to ksm_madvise() everytime a page_in was requested. Hence i proposed
this patch. H_SVM_INIT_START is the right place for ksm_advise().

> 
> > 
> > Signed-off-by: Ram Pai 
> > ---
> >  arch/powerpc/kvm/book3s_hv_uvmem.c | 96 
> > +-
> >  1 file changed, 74 insertions(+), 22 deletions(-)
> > 
> > diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
> > b/arch/powerpc/kvm/book3s_hv_uvmem.c
> > index 3d987b1..bfc3841 100644
> > --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> > +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> > @@ -211,6 +211,65 @@ static bool kvmppc_gfn_is_uvmem_pfn(unsigned long gfn, 
> > struct kvm *kvm,
> > return false;
> >  }
> >  
> > +static int kvmppc_memslot_page_merge(struct kvm *kvm,
> > +   struct kvm_memory_slot *memslot, bool merge)
> > +{
> > +   unsigned long gfn = memslot->base_gfn;
> > +   unsigned long end, start = gfn_to_hva(kvm, gfn);
> > +   int ret = 0;
> > +   struct vm_area_struct *vma;
> > +   int merge_flag = (merge) ? MADV_MERGEABLE : MADV_UNMERGEABLE;
> > +
> > +   if (kvm_is_error_hva(start))
> > +   return H_STATE;
> 
> This and other cases below seem to be a new return value from
> H_SVM_INIT_START. May be update the documentation too along with
> this patch?

ok.

> 
> > +
> > +   end = start + (memslot->npages << PAGE_SHIFT);
> > +
> > +   down_write(&kvm->mm->mmap_sem);
> 
> When you rebase the patches against latest upstream you may want to
> replace the above and other instances by mmap_write/read_lock().

ok.

> 
> > +   do {
> > +   vma = find_vma_intersection(kvm->mm, start, end);
> > +   if (!vma) {
> > +   ret = H_STATE;
> > +   break;
> > +   }
> > +   ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
> > + merge_flag, &vma->vm_flags);
> > +   if (ret) {
> > +   ret = H_STATE;
> > +   break;
> > +   }
> > +   start = vma->vm_end + 1;
> > +   } while (end > vma->vm_end);
> > +
> > +   up_write(&kvm->mm->mmap_sem);
> > +   return ret;
> > +}
> > +
> > +static int __kvmppc_page_merge(struct kvm *kvm, bool merge)
> > +{
> > +   struct kvm_memslots *slots;
> > +   struct kvm_memory_slot *memslot;
> > +   int ret = 0;
> > +
> > +   slots = kvm_memslots(kvm);
> > +   kvm_for_each_memslot(memslot, slots) {
> > +   ret = kvmppc_memslot_page_merge(kvm, memslot, merge);
> > +   if (ret)
> > +   break;
> > +   }
> > +   return ret;
> > +}
> > +
> > +static inline int kvmppc_disable_page_merge(struct kvm *kvm)
> > +{
> > +   return __kvmppc_page_merge(kvm, false);
> > +}
> > +
> > +static inline int kvmppc_enable_page_merge(struct kvm *kvm)
> > +{
> > +   return __kvmppc_page_merge(kvm, true);
> > +}
> > +
> >  unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
> >  {
> > struct kvm_memslots *slots;
> > @@ -232,11 +291,18 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
> > return H_AUTHORITY;
> >  
> > srcu_idx = srcu_read_lock(&kvm->srcu);
> > +
> > +   /* disable page-merging for all memslot */
> > +   ret = kvmppc_disable_page_merge(kvm);
> > +   if (ret)
> > +   goto out;
> > +
> > +   /* register the memslot */
> > slots = kvm_memslots(kvm);
> > kvm_for_each_memslot(memslot, slots) {
> > if (kvmppc_uvmem_slot_init(kvm, memslot)) {
> > ret = H_PARAMETER;
> > -   goto out;
> > +   break;
> > }
> > ret = uv_register_mem_slot(kvm->arch.lpid,
> >memslot->base_gfn << PAGE_SHIFT,
> > @@ -245,9 +311,12 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
> > if (ret < 0) {
> > kvmppc_uvmem_slot_free(kvm, memslot);
> > ret = H_PARAMETER;
> > -   goto out;
> > +   break;
> > }
> > }
> > +
> > +   if (ret)
> > +   kvmppc_enable_page_merge(kvm);
> 
> Is there any use of enabling KSM merging

Re: [v3 4/5] KVM: PPC: Book3S HV: retry page migration before erroring-out H_SVM_PAGE_IN

2020-07-14 Thread Ram Pai
On Mon, Jul 13, 2020 at 03:20:43PM +0530, Bharata B Rao wrote:
> On Sat, Jul 11, 2020 at 02:13:46AM -0700, Ram Pai wrote:
> > The page requested for page-in; sometimes, can have transient
> > references, and hence cannot migrate immediately. Retry a few times
> > before returning error.
> 
> As I noted in the previous patch, we need to understand what are these
> transient errors and they occur on what type of pages?

Its not clear when they occur. But they occur quite irregularly and they 
do occur on pages that were shared in the previous boot.

> 
> The previous patch also introduced a bit of retry logic in the
> page-in path. Can you consolidate the retry logic into a separate
> patch?

yes. having the retry logic in a seperate patch gives us the flexibility
to drop it, if needed.  The retry patch is not the core element of this
patch series.

RP


Re: [v3 3/5] KVM: PPC: Book3S HV: migrate remaining normal-GFNs to secure-GFNs in H_SVM_INIT_DONE

2020-07-14 Thread Ram Pai
On Mon, Jul 13, 2020 at 03:15:06PM +0530, Bharata B Rao wrote:
> On Sat, Jul 11, 2020 at 02:13:45AM -0700, Ram Pai wrote:
> > The Ultravisor is expected to explicitly call H_SVM_PAGE_IN for all the 
> > pages
> >  
> > if (!(*mig.src & MIGRATE_PFN_MIGRATE)) {
> > -   ret = -1;
> > +   ret = -2;
> 
> migrate_vma_setup() has marked that this pfn can't be migrated. What
> transient errors are you observing which will disappear within 10
> retries?
> 
> Also till now when UV used to pull in all the pages, we never seemed to
> have hit these transient errors. But now when HV is pushing the same
> pages, we see these errors which are disappearing after 10 retries.
> Can you explain this more please? What sort of pages are these?

We did see them even before this patch. The retry alleviates the
problem, but does not entirely eliminate it. If the chance of seeing
the issue without the patch is 1%,  the chance of seeing this issue
with this patch becomes 0.25%.

> 
> > goto out_finalize;
> > }
> > +   bool retry = 0;
...snip...
> > +
> > +   *ret = 0;
> > +   while (kvmppc_next_nontransitioned_gfn(memslot, kvm, &gfn)) {
> > +
> > +   down_write(&kvm->mm->mmap_sem);
> 
> Acquiring and releasing mmap_sem in a loop? Any reason?
> 
> Now that you have moved ksm_madvise() calls to init time, any specific
> reason to take write mmap_sem here?

The semaphore protects the vma. right?

And its acquired/released in the loop, to provide the ability to relinquish
the cpu without getting into a tight loop and cause softlockups.

> 
> > +   start = gfn_to_hva(kvm, gfn);
> > +   if (kvm_is_error_hva(start)) {
> > +   *ret = H_STATE;
> > +   goto next;
> > +   }
> > +
> > +   end = start + (1UL << PAGE_SHIFT);
> > +   vma = find_vma_intersection(kvm->mm, start, end);
> > +   if (!vma || vma->vm_start > start || vma->vm_end < end) {
> > +   *ret = H_STATE;
> > +   goto next;
> > +   }
> > +
> > +   mutex_lock(&kvm->arch.uvmem_lock);
> > +   *ret = kvmppc_svm_migrate_page(vma, start, end,
> > +   (gfn << PAGE_SHIFT), kvm, PAGE_SHIFT, false);
> > +   mutex_unlock(&kvm->arch.uvmem_lock);
> > +
> > +next:
> > +   up_write(&kvm->mm->mmap_sem);
> > +
> > +   if (*ret == -2) {
> > +   retry = 1;
> 
> Using true or false assignment makes it easy to understand the intention
> of this 'retry' variable.

ok. next version will have that change.

Thanks for the comments,
RP


Re: [PATCH 12/15] powerpc/powernv/sriov: De-indent setup and teardown

2020-07-14 Thread Alexey Kardashevskiy



On 15/07/2020 14:46, Oliver O'Halloran wrote:
> On Wed, Jul 15, 2020 at 2:41 PM Alexey Kardashevskiy  wrote:
>>
>>
>>
>> On 15/07/2020 14:21, Oliver O'Halloran wrote:
>>> On Wed, Jul 15, 2020 at 2:00 PM Alexey Kardashevskiy  wrote:


 or we could just skip setting

 ppc_md.pcibios_sriov_enable = pnv_pcibios_sriov_enable;

 for uninteresting platforms in pnv_pci_init_ioda_phb().
>>>
>>> I don't think so. ppc_md is per-platform, not per-PHB andw e still
>>> have to deal with a mixture of IODA/NVLink/OpenCAPI PHBs on a single
>>> system.
>>
>> NVLink/OpenCAPI won't have SRIOV devices.
> 
> ...OR WILL THEY?

NO!


>> Other types won't appear on
>> the same platform simultaneously. It is not too clean, yes.
> 
> Sure, my point is that's a per-PHB setting rather than a per-platform
> one so we should set it up like that.

and my point is that you did too good job getting rid of IODA1 vs IODA2
checks to keep this check. But ok.

> 
>>> We could make it a callback in pnv_phb, but  it seemed like
>>> more indirection than it's worth.
>>
>> I genuinely dislike how we use ppc_md so removing things from it is
>> definitely a good thing.
> 
> you wouldn't be able to get rid of it. We'd have something like what
> we have for the existing pcibios calls where there's a "generic" one
> that bounces it to a member of pci_controller_ops, which then bounces
> it to the pnv_phb method. It's bad and I hate it.

No argument here...


-- 
Alexey


Re: [PATCH 14/15] powerpc/powernv/sriov: Refactor M64 BAR setup

2020-07-14 Thread Alexey Kardashevskiy



On 10/07/2020 15:23, Oliver O'Halloran wrote:
> Split up the logic so that we have one branch that handles setting up a
> segmented window and another that handles setting up single PE windows for
> each VF.
> 
> Signed-off-by: Oliver O'Halloran 


Reviewed-by: Alexey Kardashevskiy 



> ---
> This patch could be folded into the previous one. I've kept it
> seperate mainly because the diff is *horrific* when they're merged.
> ---
>  arch/powerpc/platforms/powernv/pci-sriov.c | 57 ++
>  1 file changed, 27 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
> b/arch/powerpc/platforms/powernv/pci-sriov.c
> index 2f967aa4fbf5..8de03636888a 100644
> --- a/arch/powerpc/platforms/powernv/pci-sriov.c
> +++ b/arch/powerpc/platforms/powernv/pci-sriov.c
> @@ -441,52 +441,49 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
> u16 num_vfs)
>   struct resource   *res;
>   inti, j;
>   int64_trc;
> - inttotal_vfs;
>   resource_size_tsize, start;
> - intm64_bars;
> + intbase_pe_num;
>  
>   phb = pci_bus_to_pnvhb(pdev->bus);
>   iov = pnv_iov_get(pdev);
> - total_vfs = pci_sriov_get_totalvfs(pdev);
> -
> - if (iov->m64_single_mode)
> - m64_bars = num_vfs;
> - else
> - m64_bars = 1;
>  
>   for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>   res = &pdev->resource[i + PCI_IOV_RESOURCES];
>   if (!res->flags || !res->parent)
>   continue;
>  
> - for (j = 0; j < m64_bars; j++) {
> + /* don't need single mode? map everything in one go! */
> + if (!iov->m64_single_mode) {
>   win = pnv_pci_alloc_m64_bar(phb, iov);
>   if (win < 0)
>   goto m64_failed;
>  
> - if (iov->m64_single_mode) {
> - int pe_num = iov->vf_pe_arr[j].pe_number;
> -
> - size = pci_iov_resource_size(pdev,
> - PCI_IOV_RESOURCES + i);
> - start = res->start + size * j;
> - rc = pnv_ioda_map_m64_single(phb, win,
> -  pe_num,
> -  start,
> -  size);
> - } else {
> - size = resource_size(res);
> - start = res->start;
> -
> - rc = pnv_ioda_map_m64_accordion(phb, win, start,
> - size);
> - }
> + size = resource_size(res);
> + start = res->start;
>  
> - if (rc != OPAL_SUCCESS) {
> - dev_err(&pdev->dev, "Failed to map M64 window 
> #%d: %lld\n",
> - win, rc);
> + rc = pnv_ioda_map_m64_accordion(phb, win, start, size);
> + if (rc)
> + goto m64_failed;
> +
> + continue;
> + }
> +
> + /* otherwise map each VF with single PE BARs */
> + size = pci_iov_resource_size(pdev, PCI_IOV_RESOURCES + i);
> + base_pe_num = iov->vf_pe_arr[0].pe_number;
> +
> + for (j = 0; j < num_vfs; j++) {
> + win = pnv_pci_alloc_m64_bar(phb, iov);
> + if (win < 0)
> + goto m64_failed;
> +
> + start = res->start + size * j;
> + rc = pnv_ioda_map_m64_single(phb, win,
> +  base_pe_num + j,
> +  start,
> +  size);
> + if (rc)
>   goto m64_failed;
> - }
>   }
>   }
>   return 0;
> 

-- 
Alexey


Re: [PATCH 12/15] powerpc/powernv/sriov: De-indent setup and teardown

2020-07-14 Thread Oliver O'Halloran
On Wed, Jul 15, 2020 at 2:41 PM Alexey Kardashevskiy  wrote:
>
>
>
> On 15/07/2020 14:21, Oliver O'Halloran wrote:
> > On Wed, Jul 15, 2020 at 2:00 PM Alexey Kardashevskiy  wrote:
> >>
> >>
> >> or we could just skip setting
> >>
> >> ppc_md.pcibios_sriov_enable = pnv_pcibios_sriov_enable;
> >>
> >> for uninteresting platforms in pnv_pci_init_ioda_phb().
> >
> > I don't think so. ppc_md is per-platform, not per-PHB andw e still
> > have to deal with a mixture of IODA/NVLink/OpenCAPI PHBs on a single
> > system.
>
> NVLink/OpenCAPI won't have SRIOV devices.

...OR WILL THEY?

> Other types won't appear on
> the same platform simultaneously. It is not too clean, yes.

Sure, my point is that's a per-PHB setting rather than a per-platform
one so we should set it up like that.

> > We could make it a callback in pnv_phb, but  it seemed like
> > more indirection than it's worth.
>
> I genuinely dislike how we use ppc_md so removing things from it is
> definitely a good thing.

you wouldn't be able to get rid of it. We'd have something like what
we have for the existing pcibios calls where there's a "generic" one
that bounces it to a member of pci_controller_ops, which then bounces
it to the pnv_phb method. It's bad and I hate it.

>
>
> --
> Alexey


Re: [PATCH 12/15] powerpc/powernv/sriov: De-indent setup and teardown

2020-07-14 Thread Alexey Kardashevskiy



On 15/07/2020 14:21, Oliver O'Halloran wrote:
> On Wed, Jul 15, 2020 at 2:00 PM Alexey Kardashevskiy  wrote:
>>
>>
>> or we could just skip setting
>>
>> ppc_md.pcibios_sriov_enable = pnv_pcibios_sriov_enable;
>>
>> for uninteresting platforms in pnv_pci_init_ioda_phb().
> 
> I don't think so. ppc_md is per-platform, not per-PHB andw e still
> have to deal with a mixture of IODA/NVLink/OpenCAPI PHBs on a single
> system.

NVLink/OpenCAPI won't have SRIOV devices. Other types won't appear on
the same platform simultaneously. It is not too clean, yes.


> We could make it a callback in pnv_phb, but  it seemed like
> more indirection than it's worth.

I genuinely dislike how we use ppc_md so removing things from it is
definitely a good thing.


-- 
Alexey


Re: [PATCH 12/15] powerpc/powernv/sriov: De-indent setup and teardown

2020-07-14 Thread Oliver O'Halloran
On Wed, Jul 15, 2020 at 2:00 PM Alexey Kardashevskiy  wrote:
>
>
> or we could just skip setting
>
> ppc_md.pcibios_sriov_enable = pnv_pcibios_sriov_enable;
>
> for uninteresting platforms in pnv_pci_init_ioda_phb().

I don't think so. ppc_md is per-platform, not per-PHB andw e still
have to deal with a mixture of IODA/NVLink/OpenCAPI PHBs on a single
system. We could make it a callback in pnv_phb, but  it seemed like
more indirection than it's worth.


Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-14 Thread Oliver O'Halloran
On Wed, Jul 15, 2020 at 8:03 AM Arnd Bergmann  wrote:
>
> - Most error checking is static: PCIBIOS_BAD_REGISTER_NUMBER
>   only happens if you pass an invalid register number, but most
>   callers pass a compile-time constant register number that is
>   known to be correct, or the driver would never work. Similarly,
>   PCIBIOS_DEVICE_NOT_FOUND wouldn't normally happen
>   since you pass a valid pci_device pointer that was already
>   probed.

Having some feedback about obvious programming errors is still useful
when doing driver development. Reporting those via printk() would
probably be more useful to those who care though.

> - config space accesses are very rare compared to memory
>   space access and on the hardware side the error handling
>   would be similar, but readl/writel don't return errors, they just
>   access wrong registers or return 0x.
>   arch/powerpc/kernel/eeh.c has a ton extra code written to
>   deal with it, but no other architectures do.

TBH the EEH MMIO hooks were probably a mistake to begin with. Errors
detected via MMIO are almost always asynchronous to the error itself
so you usually just wind up with a misleading stack trace rather than
any kind of useful synchronous error reporting. It seems like most
drivers don't bother checking for 0xFFs either and rely on the
asynchronous reporting via .error_detected() instead, so I have to
wonder what the point is. I've been thinking of removing the MMIO
hooks and using a background poller to check for errors on each PHB
periodically (assuming we don't have an EEH interrupt) instead. That
would remove the requirement for eeh_dev_check_failure() to be
interrupt safe too, so it might even let us fix all the godawful races
in EEH.

> - If we add code to detect errors in pci_read_config_*
>   and do some of the stuff from powerpc's
>   eeh_dev_check_failure(), we are more likely to catch
>   intermittent failures when drivers don't check, or bugs
>   with invalid arguments in device drivers than relying on
>   drivers to get their error handling right when those code
>   paths don't ever get covered in normal testing.

Adding some kind of error detection to the generic config accessors
wouldn't hurt, but detection is only half the problem. The main job of
eeh_dev_check_failure() is waking up the EEH recovery thread which
actually handles notifying drivers, device resets, etc and you'd want
something in the PCI core. Right now there's two implementations of
that reporting logic: one for EEH in arch/powerpc/eeh_driver.c and one
for AER/DPC in drivers/pci/pcie/err.c. I think the latter could be
moved into the PCI core easily enough since there's not much about it
that's really specific to PCIe. Ideally we could drop the EEH specific
one too, but I'm not sure how to implement that without it devolving
into callback spaghetti.

Oliver


Re: [PATCH 3/3] ASoC: fsl-asoc-card: Support Headphone and Microphone Jack detection

2020-07-14 Thread Shengjiu Wang
On Wed, Jul 15, 2020 at 5:16 AM Nicolin Chen  wrote:
>
> Hi Shengjiu,
>
> The whole series looks good to me. Just a couple of small
> questions inline:
>
> On Tue, Jul 14, 2020 at 05:05:36PM +0800, Shengjiu Wang wrote:
> > Use asoc_simple_init_jack function from simple card to implement
> > the Headphone and Microphone detection.
> > Register notifier to disable Speaker when Headphone is plugged in
> > and enable Speaker when Headphone is unplugged.
> > Register notifier to disable Digital Microphone when Analog Microphone
> > is plugged in and enable DMIC when Analog Microphone is unplugged.
> >
> > Signed-off-by: Shengjiu Wang 
> > ---
> >  sound/soc/fsl/Kconfig |  1 +
> >  sound/soc/fsl/fsl-asoc-card.c | 69 ++-
> >  2 files changed, 68 insertions(+), 2 deletions(-)
>
> >  static int fsl_asoc_card_late_probe(struct snd_soc_card *card)
> >  {
> >   struct fsl_asoc_card_priv *priv = snd_soc_card_get_drvdata(card);
> > @@ -745,8 +789,29 @@ static int fsl_asoc_card_probe(struct platform_device 
> > *pdev)
> >   snd_soc_card_set_drvdata(&priv->card, priv);
> >
> >   ret = devm_snd_soc_register_card(&pdev->dev, &priv->card);
> > - if (ret && ret != -EPROBE_DEFER)
> > - dev_err(&pdev->dev, "snd_soc_register_card failed (%d)\n", 
> > ret);
> > + if (ret) {
> > + if (ret != -EPROBE_DEFER)
> > + dev_err(&pdev->dev, "snd_soc_register_card failed 
> > (%d)\n", ret);
>
> I think we may move this EPROBE_DEFER to the asrc_fail label.

If we move this to asrc_fail label, then it will be hard to define the
error message.
There are many places that goto asrc_fail.

>
> > + goto asrc_fail;
> > + }
> > +
> > + if (of_property_read_bool(np, "hp-det-gpio")) {
>
> Could we move this check inside asoc_simple_init_jack? There's no
> problem with doing it here though, yet I got a bit confused by it
> as I thought it's a boolean type property, which would be against
> the DT bindings until I saw asoc_simple_init_jack() uses the same
> string to get the GPIO. Just it probably would be a bit tricky as
> we need it to be optional here.
>
> Otherwise, I think we may add a line of comments to indicate that
> the API would use the same string to get the GPIO.

In asoc_simple_init_jack, gpio_is_valid() will be invalid when there is
no "hp-det-gpio" property, and asoc_simple_init_jack will return 0.

The reason why I add a check here is mostly for
snd_soc_jack_notifier_register().
when there is no jack created, there will be a kernel dump.

or I can use this code:

-   if (of_property_read_bool(np, "hp-det-gpio")) {
-   ret = asoc_simple_init_jack(&priv->card, &priv->hp_jack,
-   1, NULL, "Headphone Jack");
-   if (ret)
-   goto asrc_fail;
+   ret = asoc_simple_init_jack(&priv->card, &priv->hp_jack,
+   1, NULL, "Headphone Jack");
+   if (ret)
+   goto asrc_fail;

+   if (priv->hp_jack.jack.jack)
snd_soc_jack_notifier_register(&priv->hp_jack.jack,
&hp_jack_nb);
-   }

what do you think?

best regards
wang shengjiu


Re: [PATCH v3 2/9] powerpc/watchpoint: Fix DAWR exception constraint

2020-07-14 Thread Ravi Bangoria

Hi Jordan,


@@ -536,7 +538,12 @@ static bool check_dawrx_constraints(struct pt_regs *regs, 
int type,
 if (OP_IS_LOAD(type) && !(info->type & HW_BRK_TYPE_READ))
 return false;

-   if (OP_IS_STORE(type) && !(info->type & HW_BRK_TYPE_WRITE))
+   /*
+* The Cache Management instructions other than dcbz never
+* cause a match. i.e. if type is CACHEOP, the instruction
+* is dcbz, and dcbz is treated as Store.
+*/
+   if ((OP_IS_STORE(type) || type == CACHEOP) && !(info->type & 
HW_BRK_TYPE_WRITE))
 return false;

This change seems seperate to this commit?


I also thought about it but was not sure. See below ...



 if (is_kernel_addr(regs->nip) && !(info->type & HW_BRK_TYPE_KERNEL))
@@ -553,7 +560,8 @@ static bool check_dawrx_constraints(struct pt_regs *regs, 
int type,
   * including extraneous exception. Otherwise return false.
   */
  static bool check_constraints(struct pt_regs *regs, struct ppc_inst instr,
- int type, int size, struct arch_hw_breakpoint 
*info)
+ unsigned long ea, int type, int size,
+ struct arch_hw_breakpoint *info)
  {
 bool in_user_range = dar_in_user_range(regs->dar, info);
 bool dawrx_constraints;
@@ -569,11 +577,10 @@ static bool check_constraints(struct pt_regs *regs, 
struct ppc_inst instr,
 }

 if (unlikely(ppc_inst_equal(instr, ppc_inst(0 {
-   if (in_user_range)
-   return true;
-
-   if (dar_in_hw_range(regs->dar, info)) {
-   info->type |= HW_BRK_TYPE_EXTRANEOUS_IRQ;
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   if (dar_in_hw_range(regs->dar, info))
+   return true;
+   } else {
 return true;

I think this would be clearer as:
 if (cpu_has_feature(CPU_FTR_ARCH_31) &&
!(dar_in_hw_range(regs->dar, info)))
 return false;
 else
 return true;


ok




 }
 return false;
@@ -581,10 +588,20 @@ static bool check_constraints(struct pt_regs *regs, 
struct ppc_inst instr,

 dawrx_constraints = check_dawrx_constraints(regs, type, info);

-   if (dar_user_range_overlaps(regs->dar, size, info))
+   if (type == UNKNOWN) {
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   if (dar_in_hw_range(regs->dar, info))
+   return dawrx_constraints;
+   } else {
+   return dawrx_constraints;
+   }
+   return false;
+   }

Similar thing here, it could be:
 if ((cpu_has_feature(CPU_FTR_ARCH_31)) &&
!(dar_in_hw_range(regs->dar, info)))
 return false;
 else
 return dawrx_constraints;


ok


+
+   if (ea_user_range_overlaps(ea, size, info))
 return dawrx_constraints;

-   if (dar_hw_range_overlaps(regs->dar, size, info)) {
+   if (ea_hw_range_overlaps(ea, size, info)) {
 if (dawrx_constraints) {
 info->type |= HW_BRK_TYPE_EXTRANEOUS_IRQ;
 return true;
@@ -593,8 +610,17 @@ static bool check_constraints(struct pt_regs *regs, struct 
ppc_inst instr,
 return false;
  }

+static int cache_op_size(void)
+{
+#ifdef __powerpc64__
+   return ppc64_caches.l1d.block_size;
+#else
+   return L1_CACHE_BYTES;
+#endif
+}
+
  static void get_instr_detail(struct pt_regs *regs, struct ppc_inst *instr,
-int *type, int *size, bool *larx_stcx)
+int *type, int *size, unsigned long *ea)
  {
 struct instruction_op op;

@@ -602,16 +628,23 @@ static void get_instr_detail(struct pt_regs *regs, struct 
ppc_inst *instr,
 return;

 analyse_instr(&op, regs, *instr);
-
-   /*
-* Set size = 8 if analyse_instr() fails. If it's a userspace
-* watchpoint(valid or extraneous), we can notify user about it.
-* If it's a kernel watchpoint, instruction  emulation will fail
-* in stepping_handler() and watchpoint will be disabled.
-*/
 *type = GETTYPE(op.type);
-   *size = !(*type == UNKNOWN) ? GETSIZE(op.type) : 8;
-   *larx_stcx = (*type == LARX || *type == STCX);
+   *ea = op.ea;
+#ifdef __powerpc64__
+   if (!(regs->msr & MSR_64BIT))
+   *ea &= 0xUL;
+#endif
+
+   *size = GETSIZE(op.type);
+   if (*type == CACHEOP) {
+   *size = cache_op_size();
+   *ea &= ~(*size - 1);
+   }

Again related to CACHEOP, should these changes be mentioned in the
commit message?


For CACHEOP, ea returned by analyse_instr() needs to be aligned down to cache
block size manually. Also, for CACHEOP, size returned by analyse_instr() is 0
and thus

Re: [PATCH 13/15] powerpc/powernv/sriov: Move M64 BAR allocation into a helper

2020-07-14 Thread Alexey Kardashevskiy



On 10/07/2020 15:23, Oliver O'Halloran wrote:
> I want to refactor the loop this code is currently inside of. Hoist it on
> out.
> 
> Signed-off-by: Oliver O'Halloran 



Reviewed-by: Alexey Kardashevskiy 


> ---
>  arch/powerpc/platforms/powernv/pci-sriov.c | 31 ++
>  1 file changed, 20 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
> b/arch/powerpc/platforms/powernv/pci-sriov.c
> index d5699cd2ab7a..2f967aa4fbf5 100644
> --- a/arch/powerpc/platforms/powernv/pci-sriov.c
> +++ b/arch/powerpc/platforms/powernv/pci-sriov.c
> @@ -416,6 +416,23 @@ static int64_t pnv_ioda_map_m64_single(struct pnv_phb 
> *phb,
>   return rc;
>  }
>  
> +static int pnv_pci_alloc_m64_bar(struct pnv_phb *phb, struct pnv_iov_data 
> *iov)
> +{
> + int win;
> +
> + do {
> + win = find_next_zero_bit(&phb->ioda.m64_bar_alloc,
> + phb->ioda.m64_bar_idx + 1, 0);
> +
> + if (win >= phb->ioda.m64_bar_idx + 1)
> + return -1;
> + } while (test_and_set_bit(win, &phb->ioda.m64_bar_alloc));
> +
> + set_bit(win, iov->used_m64_bar_mask);
> +
> + return win;
> +}
> +
>  static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 num_vfs)
>  {
>   struct pnv_iov_data   *iov;
> @@ -443,17 +460,9 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
> u16 num_vfs)
>   continue;
>  
>   for (j = 0; j < m64_bars; j++) {
> -
> - /* allocate a window ID for this BAR */
> - do {
> - win = 
> find_next_zero_bit(&phb->ioda.m64_bar_alloc,
> - phb->ioda.m64_bar_idx + 1, 0);
> -
> - if (win >= phb->ioda.m64_bar_idx + 1)
> - goto m64_failed;
> - } while (test_and_set_bit(win, 
> &phb->ioda.m64_bar_alloc));
> - set_bit(win, iov->used_m64_bar_mask);
> -
> + win = pnv_pci_alloc_m64_bar(phb, iov);
> + if (win < 0)
> + goto m64_failed;
>  
>   if (iov->m64_single_mode) {
>   int pe_num = iov->vf_pe_arr[j].pe_number;
> 

-- 
Alexey


Re: [PATCH 12/15] powerpc/powernv/sriov: De-indent setup and teardown

2020-07-14 Thread Alexey Kardashevskiy



On 10/07/2020 15:23, Oliver O'Halloran wrote:
> Remove the IODA2 PHB checks. We already assume IODA2 in several places so
> there's not much point in wrapping most of the setup and teardown process
> in an if block.
> 
> Signed-off-by: Oliver O'Halloran 
> ---
>  arch/powerpc/platforms/powernv/pci-sriov.c | 86 --
>  1 file changed, 49 insertions(+), 37 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
> b/arch/powerpc/platforms/powernv/pci-sriov.c
> index 08f88187d65a..d5699cd2ab7a 100644
> --- a/arch/powerpc/platforms/powernv/pci-sriov.c
> +++ b/arch/powerpc/platforms/powernv/pci-sriov.c
> @@ -610,16 +610,18 @@ static void pnv_pci_sriov_disable(struct pci_dev *pdev)
>   num_vfs = iov->num_vfs;
>   base_pe = iov->vf_pe_arr[0].pe_number;
>  
> + if (WARN_ON(!iov))
> + return;
> +
>   /* Release VF PEs */
>   pnv_ioda_release_vf_PE(pdev);
>  
> - if (phb->type == PNV_PHB_IODA2) {
> - if (!iov->m64_single_mode)
> - pnv_pci_vf_resource_shift(pdev, -base_pe);
> + /* Un-shift the IOV BAR resources */
> + if (!iov->m64_single_mode)
> + pnv_pci_vf_resource_shift(pdev, -base_pe);
>  
> - /* Release M64 windows */
> - pnv_pci_vf_release_m64(pdev, num_vfs);
> - }
> + /* Release M64 windows */
> + pnv_pci_vf_release_m64(pdev, num_vfs);
>  }
>  
>  static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
> @@ -693,41 +695,51 @@ static int pnv_pci_sriov_enable(struct pci_dev *pdev, 
> u16 num_vfs)
>   phb = pci_bus_to_pnvhb(pdev->bus);
>   iov = pnv_iov_get(pdev);
>  
> - if (phb->type == PNV_PHB_IODA2) {
> - if (!iov->vfs_expanded) {
> - dev_info(&pdev->dev, "don't support this SRIOV device"
> - " with non 64bit-prefetchable IOV BAR\n");
> - return -ENOSPC;
> - }
> + /*
> +  * There's a calls to IODA2 PE setup code littered throughout. We could
> +  * probably fix that, but we'd still have problems due to the
> +  * restriction inherent on IODA1 PHBs.
> +  *
> +  * NB: We class IODA3 as IODA2 since they're very similar.
> +  */
> + if (phb->type != PNV_PHB_IODA2) {
> + pci_err(pdev, "SR-IOV is not supported on this PHB\n");
> + return -ENXIO;
> + }

or we could just skip setting

ppc_md.pcibios_sriov_enable = pnv_pcibios_sriov_enable;

for uninteresting platforms in pnv_pci_init_ioda_phb().


>  
> - /* allocate a contigious block of PEs for our VFs */
> - base_pe = pnv_ioda_alloc_pe(phb, num_vfs);
> - if (!base_pe) {
> - pci_err(pdev, "Unable to allocate PEs for %d VFs\n", 
> num_vfs);
> - return -EBUSY;
> - }
> + if (!iov->vfs_expanded) {
> + dev_info(&pdev->dev, "don't support this SRIOV device"
> + " with non 64bit-prefetchable IOV BAR\n");
> + return -ENOSPC;
> + }
>  
> - iov->vf_pe_arr = base_pe;
> - iov->num_vfs = num_vfs;
> + /* allocate a contigious block of PEs for our VFs */
> + base_pe = pnv_ioda_alloc_pe(phb, num_vfs);
> + if (!base_pe) {
> + pci_err(pdev, "Unable to allocate PEs for %d VFs\n", num_vfs);
> + return -EBUSY;
> + }
>  
> - /* Assign M64 window accordingly */
> - ret = pnv_pci_vf_assign_m64(pdev, num_vfs);
> - if (ret) {
> - dev_info(&pdev->dev, "Not enough M64 window 
> resources\n");
> - goto m64_failed;
> - }
> + iov->vf_pe_arr = base_pe;
> + iov->num_vfs = num_vfs;
>  
> - /*
> -  * When using one M64 BAR to map one IOV BAR, we need to shift
> -  * the IOV BAR according to the PE# allocated to the VFs.
> -  * Otherwise, the PE# for the VF will conflict with others.
> -  */
> - if (!iov->m64_single_mode) {
> - ret = pnv_pci_vf_resource_shift(pdev,
> - base_pe->pe_number);
> - if (ret)
> - goto shift_failed;
> - }
> + /* Assign M64 window accordingly */
> + ret = pnv_pci_vf_assign_m64(pdev, num_vfs);
> + if (ret) {
> + dev_info(&pdev->dev, "Not enough M64 window resources\n");
> + goto m64_failed;
> + }
> +
> + /*
> +  * When using one M64 BAR to map one IOV BAR, we need to shift
> +  * the IOV BAR according to the PE# allocated to the VFs.
> +  * Otherwise, the PE# for the VF will conflict with others.
> +  */
> + if (!iov->m64_single_mode) {
> + ret = pnv_pci_vf_resource_shift(pdev,
> + base_pe->pe_number);

This can be a single line no

Re: [PATCH] powerpc: Fix inconsistent function names

2020-07-14 Thread Michael Ellerman
YueHaibing  writes:

> The stub helpers name should be consistent with prototypes.
>
> mm_context_add_vas_windows() --> mm_context_add_vas_window()
> mm_context_remove_vas_windows() --> mm_context_remove_vas_window()
>
> Fixes: c420644c0a8f ("powerpc: Use mm_context vas_windows counter to issue 
> CP_ABORT")
> Signed-off-by: YueHaibing 
> ---
>  arch/powerpc/include/asm/mmu_context.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/mmu_context.h 
> b/arch/powerpc/include/asm/mmu_context.h
> index 1a474f6b1992..00fd1d44731a 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -218,8 +218,8 @@ static inline void inc_mm_active_cpus(struct mm_struct 
> *mm) { }
>  static inline void dec_mm_active_cpus(struct mm_struct *mm) { }
>  static inline void mm_context_add_copro(struct mm_struct *mm) { }
>  static inline void mm_context_remove_copro(struct mm_struct *mm) { }
> -static inline void mm_context_add_vas_windows(struct mm_struct *mm) { }
> -static inline void mm_context_remove_vas_windows(struct mm_struct *mm) { }
> +static inline void mm_context_add_vas_window(struct mm_struct *mm) { }
> +static inline void mm_context_remove_vas_window(struct mm_struct *mm) { }
>  #endif

Both of those functions are only called from 64-bit only code, so the
stubs should not be needed at all. Which explains why we haven't seen a
build break.

So just dropping them would be better IMO.

cheers


Re: [PATCH 05/13] cpufreq/arch: powerpc: pasemi: Move prototypes to shared header

2020-07-14 Thread Viresh Kumar
On 14-07-20, 20:49, Olof Johansson wrote:
> On Tue, Jul 14, 2020 at 8:07 PM Viresh Kumar  wrote:
> >
> > On 14-07-20, 15:50, Lee Jones wrote:
> > > If function callers and providers do not share the same prototypes the
> > > compiler complains of missing prototypes.  Fix this by moving the
> > > already existing prototypes out to a mutually convenient location.
> > >
> > > Fixes the following W=1 kernel build warning(s):
> > >
> > >  drivers/cpufreq/pasemi-cpufreq.c:109:5: warning: no previous prototype 
> > > for ‘check_astate’ [-Wmissing-prototypes]
> > >  109 | int check_astate(void)
> > >  | ^~~~
> > >  drivers/cpufreq/pasemi-cpufreq.c:114:6: warning: no previous prototype 
> > > for ‘restore_astate’ [-Wmissing-prototypes]
> > >  114 | void restore_astate(int cpu)
> > >  | ^~
> > >
> > > Cc: Olof Johansson 
> > > Cc: Michael Ellerman 
> > > Cc: Benjamin Herrenschmidt 
> > > Cc: Paul Mackerras 
> > > Cc: linuxppc-dev@lists.ozlabs.org
> > > Signed-off-by: Lee Jones 
> > > ---
> > >  arch/powerpc/platforms/pasemi/pasemi.h| 15 
> >
> > Is there no sane way we can include this file directly to the cpufreq
> > file ?
> 
> Yep. arch/powerpc seems to be in the search path for modules on powerpc, so:
> 
> diff --git a/drivers/cpufreq/pasemi-cpufreq.c 
> b/drivers/cpufreq/pasemi-cpufreq.c
> index c66f566a854cb..815645170c4de 100644
> --- a/drivers/cpufreq/pasemi-cpufreq.c
> +++ b/drivers/cpufreq/pasemi-cpufreq.c
> @@ -22,6 +22,8 @@
>  #include 
>  #include 
> 
> +#include 
> +
>  #define SDCASR_REG 0x0100
>  #define SDCASR_REG_STRIDE  0x1000
>  #define SDCPWR_CFGA0_REG   0x0100

Fantastic. Thanks.

-- 
viresh


Re: [PATCH v3 05/12] powerpc/drmem: make lmb walk a bit more flexible

2020-07-14 Thread Thiago Jung Bauermann


Hari Bathini  writes:

> @@ -534,7 +537,7 @@ static int __init early_init_dt_scan_memory_ppc(unsigned 
> long node,
>  #ifdef CONFIG_PPC_PSERIES
>   if (depth == 1 &&
>   strcmp(uname, "ibm,dynamic-reconfiguration-memory") == 0) {
> - walk_drmem_lmbs_early(node, early_init_drmem_lmb);
> + walk_drmem_lmbs_early(node, NULL, early_init_drmem_lmb);

walk_drmem_lmbs_early() can now fail. Should this failure be propagated
as a return value of early_init_dt_scan_memory_ppc()?

>   return 0;
>   }
>  #endif


> @@ -787,7 +790,7 @@ static int __init parse_numa_properties(void)
>*/
>   memory = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
>   if (memory) {
> - walk_drmem_lmbs(memory, numa_setup_drmem_lmb);
> + walk_drmem_lmbs(memory, NULL, numa_setup_drmem_lmb);

Similarly here. Now that this call can fail, should
parse_numa_properties() handle or propagate the failure?

>   of_node_put(memory);
>   }
>


--
Thiago Jung Bauermann
IBM Linux Technology Center


Re: [PATCH 05/13] cpufreq/arch: powerpc: pasemi: Move prototypes to shared header

2020-07-14 Thread Olof Johansson
On Tue, Jul 14, 2020 at 8:07 PM Viresh Kumar  wrote:
>
> On 14-07-20, 15:50, Lee Jones wrote:
> > If function callers and providers do not share the same prototypes the
> > compiler complains of missing prototypes.  Fix this by moving the
> > already existing prototypes out to a mutually convenient location.
> >
> > Fixes the following W=1 kernel build warning(s):
> >
> >  drivers/cpufreq/pasemi-cpufreq.c:109:5: warning: no previous prototype for 
> > ‘check_astate’ [-Wmissing-prototypes]
> >  109 | int check_astate(void)
> >  | ^~~~
> >  drivers/cpufreq/pasemi-cpufreq.c:114:6: warning: no previous prototype for 
> > ‘restore_astate’ [-Wmissing-prototypes]
> >  114 | void restore_astate(int cpu)
> >  | ^~
> >
> > Cc: Olof Johansson 
> > Cc: Michael Ellerman 
> > Cc: Benjamin Herrenschmidt 
> > Cc: Paul Mackerras 
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Signed-off-by: Lee Jones 
> > ---
> >  arch/powerpc/platforms/pasemi/pasemi.h| 15 
>
> Is there no sane way we can include this file directly to the cpufreq
> file ?

Yep. arch/powerpc seems to be in the search path for modules on powerpc, so:

diff --git a/drivers/cpufreq/pasemi-cpufreq.c b/drivers/cpufreq/pasemi-cpufreq.c
index c66f566a854cb..815645170c4de 100644
--- a/drivers/cpufreq/pasemi-cpufreq.c
+++ b/drivers/cpufreq/pasemi-cpufreq.c
@@ -22,6 +22,8 @@
 #include 
 #include 

+#include 
+
 #define SDCASR_REG 0x0100
 #define SDCASR_REG_STRIDE  0x1000
 #define SDCPWR_CFGA0_REG   0x0100


-Olof


Re: [PATCH -next] cpufreq: powernv: Make some symbols static

2020-07-14 Thread Viresh Kumar
On 14-07-20, 22:23, Wei Yongjun wrote:
> The sparse tool complains as follows:
> 
> drivers/cpufreq/powernv-cpufreq.c:88:1: warning:
>  symbol 'pstate_revmap' was not declared. Should it be static?
> drivers/cpufreq/powernv-cpufreq.c:383:18: warning:
>  symbol 'cpufreq_freq_attr_cpuinfo_nominal_freq' was not declared. Should it 
> be static?
> drivers/cpufreq/powernv-cpufreq.c:669:6: warning:
>  symbol 'gpstate_timer_handler' was not declared. Should it be static?
> drivers/cpufreq/powernv-cpufreq.c:902:6: warning:
>  symbol 'powernv_cpufreq_work_fn' was not declared. Should it be static?
> 
> Those symbols are not used outside of this file, so mark
> them static.
> 
> Reported-by: Hulk Robot 
> Signed-off-by: Wei Yongjun 
> ---
>  drivers/cpufreq/powernv-cpufreq.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)

Lee also sent a fix for this, but yours look complete :)

https://lore.kernel.org/lkml/20200714145049.2496163-7-lee.jo...@linaro.org/

Acked-by: Viresh Kumar 

-- 
viresh


Re: [PATCH 03/15] powerpc/powernv/pci: Add explicit tracking of the DMA setup state

2020-07-14 Thread Alexey Kardashevskiy



On 15/07/2020 11:38, Oliver O'Halloran wrote:
> On Tue, Jul 14, 2020 at 5:21 PM Alexey Kardashevskiy  wrote:
>>
>> On 14/07/2020 15:58, Oliver O'Halloran wrote:
>>> On Tue, Jul 14, 2020 at 3:37 PM Alexey Kardashevskiy  wrote:

 On 10/07/2020 15:23, Oliver O'Halloran wrote:
> This also means the only remaining user of the old "DMA Weight" code is
> the IODA1 DMA setup code that it was originally added for, which is good.


 Is ditching IODA1 in the plan? :)
>>>
>>> That or separating out the pci_controller_ops for IODA1 and IODA2 so
>>> we can stop any IODA2 specific changes from breaking it.
>>
>> Is IODA1 tested at all these days? Or, is anyone running upstream
>> kernels anywhere and keeps shouting when it does not work on IODA1? Thanks,
> 
> Cedric has a P7 with OPAL. That's probably the one left though.

Has he tried these patches on that box? Or we hope for the best here? :)



-- 
Alexey


Re: [PATCH 11/15] powerpc/powernv/sriov: Drop iov->pe_num_map[]

2020-07-14 Thread Alexey Kardashevskiy



On 10/07/2020 15:23, Oliver O'Halloran wrote:
> Currently the iov->pe_num_map[] does one of two things depending on
> whether single PE mode is being used or not. When it is, this contains an
> array which maps a vf_index to the corresponding PE number. When single PE
> mode is not being used this contains a scalar which is the base PE for the
> set of enabled VFs (for for VFn is base + n).
> 
> The array was necessary because when calling pnv_ioda_alloc_pe() there is
> no guarantee that the allocated PEs would be contigious. We can now


s/contigious/contiguous/ here and below.


> allocate contigious blocks of PEs so this is no longer an issue. This
> allows us to drop the if (single_mode) {} .. else {} block scattered
> through the SR-IOV code which is a nice clean up.
> 
> Signed-off-by: Oliver O'Halloran 
> ---
>  arch/powerpc/platforms/powernv/pci-sriov.c | 109 +
>  arch/powerpc/platforms/powernv/pci.h   |   4 +-
>  2 files changed, 25 insertions(+), 88 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
> b/arch/powerpc/platforms/powernv/pci-sriov.c
> index d53a85ccb538..08f88187d65a 100644
> --- a/arch/powerpc/platforms/powernv/pci-sriov.c
> +++ b/arch/powerpc/platforms/powernv/pci-sriov.c
> @@ -456,11 +456,13 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
> u16 num_vfs)
>  
>  
>   if (iov->m64_single_mode) {
> + int pe_num = iov->vf_pe_arr[j].pe_number;
> +
>   size = pci_iov_resource_size(pdev,
>   PCI_IOV_RESOURCES + i);
>   start = res->start + size * j;
>   rc = pnv_ioda_map_m64_single(phb, win,
> -  iov->pe_num_map[j],
> +  pe_num,
>start,
>size);
>   } else {
> @@ -599,38 +601,24 @@ static int pnv_pci_vf_resource_shift(struct pci_dev 
> *dev, int offset)
>  
>  static void pnv_pci_sriov_disable(struct pci_dev *pdev)
>  {
> + u16num_vfs, base_pe;
>   struct pnv_phb*phb;
> - struct pnv_ioda_pe*pe;
>   struct pnv_iov_data   *iov;
> - u16num_vfs, i;
>  
>   phb = pci_bus_to_pnvhb(pdev->bus);
>   iov = pnv_iov_get(pdev);
>   num_vfs = iov->num_vfs;
> + base_pe = iov->vf_pe_arr[0].pe_number;
>  
>   /* Release VF PEs */
>   pnv_ioda_release_vf_PE(pdev);
>  
>   if (phb->type == PNV_PHB_IODA2) {
>   if (!iov->m64_single_mode)
> - pnv_pci_vf_resource_shift(pdev, -*iov->pe_num_map);
> + pnv_pci_vf_resource_shift(pdev, -base_pe);
>  
>   /* Release M64 windows */
>   pnv_pci_vf_release_m64(pdev, num_vfs);
> -
> - /* Release PE numbers */
> - if (iov->m64_single_mode) {
> - for (i = 0; i < num_vfs; i++) {
> - if (iov->pe_num_map[i] == IODA_INVALID_PE)
> - continue;
> -
> - pe = &phb->ioda.pe_array[iov->pe_num_map[i]];
> - pnv_ioda_free_pe(pe);
> - }
> - } else
> - bitmap_clear(phb->ioda.pe_alloc, *iov->pe_num_map, 
> num_vfs);
> - /* Releasing pe_num_map */
> - kfree(iov->pe_num_map);
>   }
>  }
>  
> @@ -656,13 +644,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, 
> u16 num_vfs)
>   int vf_bus = pci_iov_virtfn_bus(pdev, vf_index);
>   struct pci_dn *vf_pdn;
>  
> - if (iov->m64_single_mode)
> - pe_num = iov->pe_num_map[vf_index];
> - else
> - pe_num = *iov->pe_num_map + vf_index;
> -
> - pe = &phb->ioda.pe_array[pe_num];
> - pe->pe_number = pe_num;
> + pe = &iov->vf_pe_arr[vf_index];
>   pe->phb = phb;
>   pe->flags = PNV_IODA_PE_VF;
>   pe->pbus = NULL;
> @@ -670,6 +652,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, 
> u16 num_vfs)
>   pe->mve_number = -1;
>   pe->rid = (vf_bus << 8) | vf_devfn;
>  
> + pe_num = pe->pe_number;
>   pe_info(pe, "VF %04d:%02d:%02d.%d associated with PE#%x\n",
>   pci_domain_nr(pdev->bus), pdev->bus->number,
>   PCI_SLOT(vf_devfn), PCI_FUNC(vf_devfn), pe_num);
> @@ -701,9 +684,9 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, 
> u16 num_vfs)
>  
>  static int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
>  {
> + struct pnv_ioda_pe*base_pe;
>   struct pnv_iov_data   *iov

Re: [PATCH 05/13] cpufreq/arch: powerpc: pasemi: Move prototypes to shared header

2020-07-14 Thread Olof Johansson
On Tue, Jul 14, 2020 at 7:50 AM Lee Jones  wrote:
>
> If function callers and providers do not share the same prototypes the
> compiler complains of missing prototypes.  Fix this by moving the
> already existing prototypes out to a mutually convenient location.
>
> Fixes the following W=1 kernel build warning(s):
>
>  drivers/cpufreq/pasemi-cpufreq.c:109:5: warning: no previous prototype for 
> ‘check_astate’ [-Wmissing-prototypes]
>  109 | int check_astate(void)
>  | ^~~~
>  drivers/cpufreq/pasemi-cpufreq.c:114:6: warning: no previous prototype for 
> ‘restore_astate’ [-Wmissing-prototypes]
>  114 | void restore_astate(int cpu)
>  | ^~
>
> Cc: Olof Johansson 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Lee Jones 
> ---
>  arch/powerpc/platforms/pasemi/pasemi.h| 15 
>  arch/powerpc/platforms/pasemi/powersave.S |  2 ++
>  drivers/cpufreq/pasemi-cpufreq.c  |  1 +
>  include/linux/platform_data/pasemi.h  | 28 +++
>  4 files changed, 31 insertions(+), 15 deletions(-)
>  create mode 100644 include/linux/platform_data/pasemi.h
>
> diff --git a/arch/powerpc/platforms/pasemi/pasemi.h 
> b/arch/powerpc/platforms/pasemi/pasemi.h
> index 70b56048ed1be..528d81ef748ad 100644
> --- a/arch/powerpc/platforms/pasemi/pasemi.h
> +++ b/arch/powerpc/platforms/pasemi/pasemi.h
> @@ -15,21 +15,6 @@ extern void __init pasemi_map_registers(void);
>  extern void idle_spin(void);
>  extern void idle_doze(void);
>
> -/* Restore astate to last set */
> -#ifdef CONFIG_PPC_PASEMI_CPUFREQ
> -extern int check_astate(void);
> -extern void restore_astate(int cpu);
> -#else
> -static inline int check_astate(void)
> -{
> -   /* Always return >0 so we never power save */
> -   return 1;
> -}
> -static inline void restore_astate(int cpu)
> -{
> -}
> -#endif
> -
>  extern struct pci_controller_ops pasemi_pci_controller_ops;
>
>  #endif /* _PASEMI_PASEMI_H */
> diff --git a/arch/powerpc/platforms/pasemi/powersave.S 
> b/arch/powerpc/platforms/pasemi/powersave.S
> index d0215d5329ca7..7747b48963286 100644
> --- a/arch/powerpc/platforms/pasemi/powersave.S
> +++ b/arch/powerpc/platforms/pasemi/powersave.S
> @@ -5,6 +5,8 @@
>   * Maintained by: Olof Johansson 
>   */
>
> +#include 
> +
>  #include 
>  #include 
>  #include 
> diff --git a/drivers/cpufreq/pasemi-cpufreq.c 
> b/drivers/cpufreq/pasemi-cpufreq.c
> index c66f566a854cb..c6bb3ecc90ef3 100644
> --- a/drivers/cpufreq/pasemi-cpufreq.c
> +++ b/drivers/cpufreq/pasemi-cpufreq.c
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> diff --git a/include/linux/platform_data/pasemi.h 
> b/include/linux/platform_data/pasemi.h
> new file mode 100644
> index 0..3fed0687fcc9a
> --- /dev/null
> +++ b/include/linux/platform_data/pasemi.h
> @@ -0,0 +1,28 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (C) 2020 Linaro Ltd.
> + *
> + * Author: Lee Jones 
> + */

Absolutely not. It's neither your copyright, nor your authorship.


-Olof


[PATCH] powerpc: Fix inconsistent function names

2020-07-14 Thread YueHaibing
The stub helpers name should be consistent with prototypes.

mm_context_add_vas_windows() --> mm_context_add_vas_window()
mm_context_remove_vas_windows() --> mm_context_remove_vas_window()

Fixes: c420644c0a8f ("powerpc: Use mm_context vas_windows counter to issue 
CP_ABORT")
Signed-off-by: YueHaibing 
---
 arch/powerpc/include/asm/mmu_context.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 1a474f6b1992..00fd1d44731a 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -218,8 +218,8 @@ static inline void inc_mm_active_cpus(struct mm_struct *mm) 
{ }
 static inline void dec_mm_active_cpus(struct mm_struct *mm) { }
 static inline void mm_context_add_copro(struct mm_struct *mm) { }
 static inline void mm_context_remove_copro(struct mm_struct *mm) { }
-static inline void mm_context_add_vas_windows(struct mm_struct *mm) { }
-static inline void mm_context_remove_vas_windows(struct mm_struct *mm) { }
+static inline void mm_context_add_vas_window(struct mm_struct *mm) { }
+static inline void mm_context_remove_vas_window(struct mm_struct *mm) { }
 #endif
 
 
-- 
2.17.1




Re: [PATCH 10/15] powerpc/powernv/pci: Refactor pnv_ioda_alloc_pe()

2020-07-14 Thread Alexey Kardashevskiy



On 15/07/2020 12:53, Oliver O'Halloran wrote:
> On Wed, Jul 15, 2020 at 12:29 PM Alexey Kardashevskiy  wrote:
>>
>>
>>
>> On 10/07/2020 15:23, Oliver O'Halloran wrote:
>>> Rework the PE allocation logic to allow allocating blocks of PEs rather
>>> than individually. We'll use this to allocate contigious blocks of PEs for
>>> the SR-IOVs.
>>
>> The patch does not do just this, it also adds missing mutexes (which is
>> good) but still misses them in pnv_pci_sriov_disable() and
>> pnv_pci_ioda_pe_dump().
> 
> The current implementation doesn't need the mutex because alloc,
> reserve and free all use atomic bit ops.

Ah, ok.

> The mutex has been there
> forever with nothing actually using it, but with the change we need to
> prevent modifications to the bitmap while alloc() is scanning it. I
> probably should have mentioned that in the commit message.

but bitmap_clear() (from pnv_pci_sriov_disable()) is not atomic. It
probably does not matter as the next patch gets rid of it anyway.


-- 
Alexey


[PATCH -next] powerpc/smp: Remove unused inline functions

2020-07-14 Thread YueHaibing
commit 441c19c8a290 ("powerpc/kvm/book3s_hv: Rework the secondary inhibit code")
left behind this, remove it.

Signed-off-by: YueHaibing 
---
 arch/powerpc/include/asm/smp.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 49a25e2400f2..5d5a7596d144 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -176,8 +176,6 @@ extern void __cpu_die(unsigned int cpu);
 /* for UP */
 #define hard_smp_processor_id()get_hard_smp_processor_id(0)
 #define smp_setup_cpu_maps()
-static inline void inhibit_secondary_onlining(void) {}
-static inline void uninhibit_secondary_onlining(void) {}
 static inline const struct cpumask *cpu_sibling_mask(int cpu)
 {
return cpumask_of(cpu);
-- 
2.17.1




Re: [PATCH 07/13] cpufreq: powernv-cpufreq: Fix a bunch of kerneldoc related issues

2020-07-14 Thread Viresh Kumar
On 14-07-20, 15:50, Lee Jones wrote:
> Repair problems with formatting and missing attributes/parameters, and
> demote header comments which do not meet the required standards
> applicable to kerneldoc.
> 
> Fixes the following W=1 kernel build warning(s):
> 
>  drivers/cpufreq/powernv-cpufreq.c:84: warning: Function parameter or member 
> 'last_lpstate_idx' not described in 'global_pstate_info'
>  drivers/cpufreq/powernv-cpufreq.c:84: warning: Function parameter or member 
> 'last_gpstate_idx' not described in 'global_pstate_info'
>  drivers/cpufreq/powernv-cpufreq.c:84: warning: Function parameter or member 
> 'policy' not described in 'global_pstate_info'
>  drivers/cpufreq/powernv-cpufreq.c:182: warning: Function parameter or member 
> 'i' not described in 'idx_to_pstate'
>  drivers/cpufreq/powernv-cpufreq.c:201: warning: Function parameter or member 
> 'pstate' not described in 'pstate_to_idx'
>  drivers/cpufreq/powernv-cpufreq.c:670: warning: Function parameter or member 
> 't' not described in 'gpstate_timer_handler'
>  drivers/cpufreq/powernv-cpufreq.c:670: warning: Excess function parameter 
> 'data' description in 'gpstate_timer_handler'
> 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Lee Jones 
> ---
>  drivers/cpufreq/powernv-cpufreq.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index 068cc53abe320..2e5a8b8a4abaa 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -64,13 +64,14 @@
>   *   highest_lpstate_idx
>   * @last_sampled_time:   Time from boot in ms when global 
> pstates were
>   *   last set
> - * @last_lpstate_idx,Last set value of local pstate and 
> global
> - * last_gpstate_idx  pstate in terms of cpufreq table index
> + * @last_lpstate_idx:Last set value of local pstate and 
> global
> + * @last_gpstate_idx:pstate in terms of cpufreq table index
>   * @timer:   Is used for ramping down if cpu goes idle for
>   *   a long time with global pstate held high
>   * @gpstate_lock:A spinlock to maintain synchronization between
>   *   routines called by the timer handler and
>   *   governer's target_index calls
> + * @policy:  Associated CPUFreq policy
>   */
>  struct global_pstate_info {
>   int highest_lpstate_idx;
> @@ -170,7 +171,7 @@ static inline u8 extract_pstate(u64 pmsr_val, unsigned 
> int shift)
>  
>  /* Use following functions for conversions between pstate_id and index */
>  
> -/**
> +/*
>   * idx_to_pstate : Returns the pstate id corresponding to the
>   *  frequency in the cpufreq frequency table
>   *  powernv_freqs indexed by @i.
> @@ -188,7 +189,7 @@ static inline u8 idx_to_pstate(unsigned int i)
>   return powernv_freqs[i].driver_data;
>  }
>  
> -/**
> +/*
>   * pstate_to_idx : Returns the index in the cpufreq frequencytable
>   *  powernv_freqs for the frequency whose corresponding
>   *  pstate id is @pstate.
> @@ -660,7 +661,7 @@ static inline void  queue_gpstate_timer(struct 
> global_pstate_info *gpstates)
>  /**
>   * gpstate_timer_handler
>   *
> - * @data: pointer to cpufreq_policy on which timer was queued
> + * @t: Timer context used to fetch global pstate info struct
>   *
>   * This handler brings down the global pstate closer to the local pstate
>   * according quadratic equation. Queues a new timer if it is still not equal

Acked-by: Viresh Kumar 

-- 
viresh


Re: [PATCH 06/13] cpufreq: powernv-cpufreq: Functions only used in call-backs should be static

2020-07-14 Thread Viresh Kumar
On 14-07-20, 15:50, Lee Jones wrote:
> Fixes the following W=1 kernel build warning(s):
> 
>  drivers/cpufreq/powernv-cpufreq.c:669:6: warning: no previous prototype for 
> ‘gpstate_timer_handler’ [-Wmissing-prototypes]
>  drivers/cpufreq/powernv-cpufreq.c:902:6: warning: no previous prototype for 
> ‘powernv_cpufreq_work_fn’ [-Wmissing-prototypes]
> 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Lee Jones 
> ---
>  drivers/cpufreq/powernv-cpufreq.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index 8646eb197cd96..068cc53abe320 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -666,7 +666,7 @@ static inline void  queue_gpstate_timer(struct 
> global_pstate_info *gpstates)
>   * according quadratic equation. Queues a new timer if it is still not equal
>   * to local pstate
>   */
> -void gpstate_timer_handler(struct timer_list *t)
> +static void gpstate_timer_handler(struct timer_list *t)
>  {
>   struct global_pstate_info *gpstates = from_timer(gpstates, t, timer);
>   struct cpufreq_policy *policy = gpstates->policy;
> @@ -899,7 +899,7 @@ static struct notifier_block powernv_cpufreq_reboot_nb = {
>   .notifier_call = powernv_cpufreq_reboot_notifier,
>  };
>  
> -void powernv_cpufreq_work_fn(struct work_struct *work)
> +static void powernv_cpufreq_work_fn(struct work_struct *work)
>  {
>   struct chip *chip = container_of(work, struct chip, throttle);
>   struct cpufreq_policy *policy;

Acked-by: Viresh Kumar 

-- 
viresh


Re: [PATCH 05/13] cpufreq/arch: powerpc: pasemi: Move prototypes to shared header

2020-07-14 Thread Viresh Kumar
On 14-07-20, 15:50, Lee Jones wrote:
> If function callers and providers do not share the same prototypes the
> compiler complains of missing prototypes.  Fix this by moving the
> already existing prototypes out to a mutually convenient location.
> 
> Fixes the following W=1 kernel build warning(s):
> 
>  drivers/cpufreq/pasemi-cpufreq.c:109:5: warning: no previous prototype for 
> ‘check_astate’ [-Wmissing-prototypes]
>  109 | int check_astate(void)
>  | ^~~~
>  drivers/cpufreq/pasemi-cpufreq.c:114:6: warning: no previous prototype for 
> ‘restore_astate’ [-Wmissing-prototypes]
>  114 | void restore_astate(int cpu)
>  | ^~
> 
> Cc: Olof Johansson 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Lee Jones 
> ---
>  arch/powerpc/platforms/pasemi/pasemi.h| 15 

Is there no sane way we can include this file directly to the cpufreq
file ?

>  arch/powerpc/platforms/pasemi/powersave.S |  2 ++
>  drivers/cpufreq/pasemi-cpufreq.c  |  1 +
>  include/linux/platform_data/pasemi.h  | 28 +++
>  4 files changed, 31 insertions(+), 15 deletions(-)
>  create mode 100644 include/linux/platform_data/pasemi.h

-- 
viresh


[PATCH -next] powerpc/xive: Remove unused inline function xive_kexec_teardown_cpu()

2020-07-14 Thread YueHaibing
commit e27e0a94651e ("powerpc/xive: Remove xive_kexec_teardown_cpu()")
left behind this, remove it.

Signed-off-by: YueHaibing 
---
 arch/powerpc/include/asm/xive.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index d08ea11b271c..309b4d65b74f 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -155,7 +155,6 @@ static inline void xive_smp_probe(void) { }
 static inline int  xive_smp_prepare_cpu(unsigned int cpu) { return -EINVAL; }
 static inline void xive_smp_setup_cpu(void) { }
 static inline void xive_smp_disable_cpu(void) { }
-static inline void xive_kexec_teardown_cpu(int secondary) { }
 static inline void xive_shutdown(void) { }
 static inline void xive_flush_interrupt(void) { }
 
-- 
2.17.1




Re: [PATCH 10/15] powerpc/powernv/pci: Refactor pnv_ioda_alloc_pe()

2020-07-14 Thread Oliver O'Halloran
On Wed, Jul 15, 2020 at 12:29 PM Alexey Kardashevskiy  wrote:
>
>
>
> On 10/07/2020 15:23, Oliver O'Halloran wrote:
> > Rework the PE allocation logic to allow allocating blocks of PEs rather
> > than individually. We'll use this to allocate contigious blocks of PEs for
> > the SR-IOVs.
>
> The patch does not do just this, it also adds missing mutexes (which is
> good) but still misses them in pnv_pci_sriov_disable() and
> pnv_pci_ioda_pe_dump().

The current implementation doesn't need the mutex because alloc,
reserve and free all use atomic bit ops. The mutex has been there
forever with nothing actually using it, but with the change we need to
prevent modifications to the bitmap while alloc() is scanning it. I
probably should have mentioned that in the commit message.


[powerpc:fixes-test] BUILD SUCCESS b710d27bf72068b15b2f0305d825988183e2ff28

2020-07-14 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
fixes-test
branch HEAD: b710d27bf72068b15b2f0305d825988183e2ff28  powerpc/pseries/svm: Fix 
incorrect check for shared_lppaca_size

elapsed time: 721m

configs tested: 110
configs skipped: 6

The following configs have been built successfully.
More configs may be tested in the coming days.

arm defconfig
arm  allyesconfig
arm  allmodconfig
arm   allnoconfig
arm64allyesconfig
arm64   defconfig
arm64allmodconfig
arm64 allnoconfig
sh   se7724_defconfig
sh  sdk7786_defconfig
arm  tango4_defconfig
shmigor_defconfig
i386 allyesconfig
arm s3c6400_defconfig
m68k amcore_defconfig
c6x   allnoconfig
xtensageneric_kc705_defconfig
mips   ip27_defconfig
mips  mips_paravirt_defconfig
c6xevmc6474_defconfig
powerpc  pasemi_defconfig
shdreamcast_defconfig
openrisc alldefconfig
m68km5407c3_defconfig
arm  ep93xx_defconfig
arm  moxart_defconfig
sparc   defconfig
mipsmaltaup_defconfig
mips  rb532_defconfig
powerpcmpc7448_hpc2_defconfig
mipsnlm_xlp_defconfig
sh kfr2r09-romimage_defconfig
openriscdefconfig
xtensa  iss_defconfig
arm  colibri_pxa270_defconfig
ia64  gensparse_defconfig
i386  allnoconfig
i386defconfig
i386  debian-10.3
ia64 allmodconfig
ia64defconfig
ia64  allnoconfig
ia64 allyesconfig
m68k allmodconfig
m68k  allnoconfig
m68k   sun3_defconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
nios2allyesconfig
c6x  allyesconfig
openrisc allyesconfig
nds32   defconfig
nds32 allnoconfig
csky allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
h8300allmodconfig
xtensa  defconfig
arc defconfig
arc  allyesconfig
sh   allmodconfig
shallnoconfig
microblazeallnoconfig
mips allyesconfig
mips  allnoconfig
mips allmodconfig
pariscallnoconfig
parisc  defconfig
parisc   allyesconfig
parisc   allmodconfig
powerpc  allyesconfig
powerpc  rhel-kconfig
powerpc  allmodconfig
powerpc   allnoconfig
powerpc defconfig
i386 randconfig-a001-20200714
i386 randconfig-a005-20200714
i386 randconfig-a002-20200714
i386 randconfig-a006-20200714
i386 randconfig-a003-20200714
i386 randconfig-a004-20200714
i386 randconfig-a016-20200714
i386 randconfig-a011-20200714
i386 randconfig-a015-20200714
i386 randconfig-a012-20200714
i386 randconfig-a013-20200714
i386 randconfig-a014-20200714
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscvallmodconfig
s390 allyesconfig
s390  allnoconfig
s390 allmodconfig
s390defconfig
sparcallyesconfig
sparc64

[powerpc:next-test] BUILD SUCCESS 17babc2734a55cdf4a678d572f316280b820202b

2020-07-14 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
next-test
branch HEAD: 17babc2734a55cdf4a678d572f316280b820202b  powerpc/fadump: fix race 
between pstore write and fadump crash trigger

elapsed time: 720m

configs tested: 104
configs skipped: 8

The following configs have been built successfully.
More configs may be tested in the coming days.

arm defconfig
arm  allyesconfig
arm  allmodconfig
arm   allnoconfig
arm64allyesconfig
arm64   defconfig
arm64allmodconfig
arm64 allnoconfig
sh   se7724_defconfig
sh  sdk7786_defconfig
arm  tango4_defconfig
shmigor_defconfig
sh   se7206_defconfig
mipsmalta_qemu_32r6_defconfig
sparc64 defconfig
mips  pistachio_defconfig
sh   alldefconfig
powerpc64   defconfig
armu300_defconfig
i386 allyesconfig
arm s3c6400_defconfig
m68k amcore_defconfig
c6x   allnoconfig
xtensageneric_kc705_defconfig
mips   ip27_defconfig
mips  mips_paravirt_defconfig
c6xevmc6474_defconfig
powerpc  pasemi_defconfig
shdreamcast_defconfig
openrisc alldefconfig
m68km5407c3_defconfig
arm  ep93xx_defconfig
arm  moxart_defconfig
sparc   defconfig
mipsmaltaup_defconfig
mips  rb532_defconfig
powerpcmpc7448_hpc2_defconfig
mipsnlm_xlp_defconfig
sh kfr2r09-romimage_defconfig
openriscdefconfig
xtensa  iss_defconfig
arm  colibri_pxa270_defconfig
ia64  gensparse_defconfig
i386  allnoconfig
i386defconfig
i386  debian-10.3
ia64 allmodconfig
ia64defconfig
ia64  allnoconfig
ia64 allyesconfig
m68k allmodconfig
m68k  allnoconfig
m68k   sun3_defconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
nios2allyesconfig
c6x  allyesconfig
openrisc allyesconfig
nds32   defconfig
nds32 allnoconfig
csky allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
h8300allmodconfig
xtensa  defconfig
arc defconfig
arc  allyesconfig
sh   allmodconfig
shallnoconfig
microblazeallnoconfig
mips allyesconfig
mips  allnoconfig
mips allmodconfig
pariscallnoconfig
parisc  defconfig
parisc   allyesconfig
parisc   allmodconfig
powerpc  allyesconfig
powerpc  rhel-kconfig
powerpc  allmodconfig
powerpc   allnoconfig
powerpc defconfig
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscvallmodconfig
s390 allyesconfig
s390  allnoconfig
s390 allmodconfig
s390defconfig
sparcallyesconfig
sparc64   allnoconfig
sparc64  allyesconfig
sparc64  allmodconfig
x86_64rhel-7.6-kselftests
x86_64   rhel-8.3
x86_64  

Re: [PATCH v3 04/12] ppc64/kexec_file: avoid stomping memory used by special regions

2020-07-14 Thread Thiago Jung Bauermann


Hari Bathini  writes:

> diff --git a/arch/powerpc/include/asm/crashdump-ppc64.h 
> b/arch/powerpc/include/asm/crashdump-ppc64.h
> new file mode 100644
> index 000..90deb46
> --- /dev/null
> +++ b/arch/powerpc/include/asm/crashdump-ppc64.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef _ASM_POWERPC_CRASHDUMP_PPC64_H
> +#define _ASM_POWERPC_CRASHDUMP_PPC64_H
> +
> +/* min & max addresses for kdump load segments */
> +#define KDUMP_BUF_MIN(crashk_res.start)
> +#define KDUMP_BUF_MAX((crashk_res.end < ppc64_rma_size) ? \
> +  crashk_res.end : (ppc64_rma_size - 1))
> +
> +#endif /* __ASM_POWERPC_CRASHDUMP_PPC64_H */
> diff --git a/arch/powerpc/include/asm/kexec.h 
> b/arch/powerpc/include/asm/kexec.h
> index 7008ea1..bf47a01 100644
> --- a/arch/powerpc/include/asm/kexec.h
> +++ b/arch/powerpc/include/asm/kexec.h
> @@ -100,14 +100,16 @@ void relocate_new_kernel(unsigned long 
> indirection_page, unsigned long reboot_co
>  #ifdef CONFIG_KEXEC_FILE
>  extern const struct kexec_file_ops kexec_elf64_ops;
>
> -#ifdef CONFIG_IMA_KEXEC
>  #define ARCH_HAS_KIMAGE_ARCH
>
>  struct kimage_arch {
> + struct crash_mem *exclude_ranges;
> +
> +#ifdef CONFIG_IMA_KEXEC
>   phys_addr_t ima_buffer_addr;
>   size_t ima_buffer_size;
> -};
>  #endif
> +};
>
>  int setup_purgatory(struct kimage *image, const void *slave_code,
>   const void *fdt, unsigned long kernel_load_addr,
> @@ -125,6 +127,7 @@ int setup_new_fdt_ppc64(const struct kimage *image, void 
> *fdt,
>   unsigned long initrd_load_addr,
>   unsigned long initrd_len, const char *cmdline);
>  #endif /* CONFIG_PPC64 */
> +
>  #endif /* CONFIG_KEXEC_FILE */
>
>  #else /* !CONFIG_KEXEC_CORE */
> diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
> index 23ad04c..c695f94 100644
> --- a/arch/powerpc/kexec/elf_64.c
> +++ b/arch/powerpc/kexec/elf_64.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  static void *elf64_load(struct kimage *image, char *kernel_buf,
>   unsigned long kernel_len, char *initrd,
> @@ -46,6 +47,12 @@ static void *elf64_load(struct kimage *image, char 
> *kernel_buf,
>   if (ret)
>   goto out;
>
> + if (image->type == KEXEC_TYPE_CRASH) {
> + /* min & max buffer values for kdump case */
> + kbuf.buf_min = pbuf.buf_min = KDUMP_BUF_MIN;
> + kbuf.buf_max = pbuf.buf_max = KDUMP_BUF_MAX;

This is only my personal opinion and an actual maintainer may disagree,
but just looking at the lines above, I would assume that KDUMP_BUF_MIN
and KDUMP_BUF_MAX were constants, when in fact they aren't.

I suggest using static inline macros in , for
example:

static inline resource_size_t get_kdump_buf_min(void)
{
return crashk_res.start;
}

static inline resource_size_t get_kdump_buf_max(void)
{
return (crashk_res.end < ppc64_rma_size) ? \
 crashk_res.end : (ppc64_rma_size - 1)
}


> + }
> +
>   ret = kexec_elf_load(image, &ehdr, &elf_info, &kbuf, &kernel_load_addr);
>   if (ret)
>   goto out;



> +/**
> + * __locate_mem_hole_top_down - Looks top down for a large enough memory hole
> + *  in the memory regions between buf_min & 
> buf_max
> + *  for the buffer. If found, sets kbuf->mem.
> + * @kbuf:   Buffer contents and memory parameters.
> + * @buf_min:Minimum address for the buffer.
> + * @buf_max:Maximum address for the buffer.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +static int __locate_mem_hole_top_down(struct kexec_buf *kbuf,
> +   u64 buf_min, u64 buf_max)
> +{
> + int ret = -EADDRNOTAVAIL;
> + phys_addr_t start, end;
> + u64 i;
> +
> + for_each_mem_range_rev(i, &memblock.memory, NULL, NUMA_NO_NODE,
> +MEMBLOCK_NONE, &start, &end, NULL) {
> + if (start > buf_max)
> + continue;
> +
> + /* Memory hole not found */
> + if (end < buf_min)
> + break;
> +
> + /* Adjust memory region based on the given range */
> + if (start < buf_min)
> + start = buf_min;
> + if (end > buf_max)
> + end = buf_max;
> +
> + start = ALIGN(start, kbuf->buf_align);
> + if (start < end && (end - start + 1) >= kbuf->memsz) {

This is why I dislike using start and end to express address ranges:

While struct resource seems to use the [address, end] convention, my
reading of memblock code is that it uses [addres, end). This is
guaranteed to lead to bugs. So the above has an off-by-one error. To
calculate the size of the current range, you need to use `end - start`.

> +

Re: [PATCH 10/15] powerpc/powernv/pci: Refactor pnv_ioda_alloc_pe()

2020-07-14 Thread Alexey Kardashevskiy



On 10/07/2020 15:23, Oliver O'Halloran wrote:
> Rework the PE allocation logic to allow allocating blocks of PEs rather
> than individually. We'll use this to allocate contigious blocks of PEs for
> the SR-IOVs.

The patch does not do just this, it also adds missing mutexes (which is
good) but still misses them in pnv_pci_sriov_disable() and
pnv_pci_ioda_pe_dump().




> 
> Signed-off-by: Oliver O'Halloran 
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 41 ++-
>  arch/powerpc/platforms/powernv/pci.h  |  2 +-
>  2 files changed, 34 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 2d36a9ebf0e9..c9c25fb0783c 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -145,23 +145,45 @@ static void pnv_ioda_reserve_pe(struct pnv_phb *phb, 
> int pe_no)
>   return;
>   }
>  
> + mutex_lock(&phb->ioda.pe_alloc_mutex);
>   if (test_and_set_bit(pe_no, phb->ioda.pe_alloc))
>   pr_debug("%s: PE %x was reserved on PHB#%x\n",
>__func__, pe_no, phb->hose->global_number);
> + mutex_unlock(&phb->ioda.pe_alloc_mutex);
>  
>   pnv_ioda_init_pe(phb, pe_no);
>  }
>  
> -struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb)
> +struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb, int count)
>  {
> - long pe;
> + struct pnv_ioda_pe *ret = NULL;
> + int run = 0, pe, i;
>  
> + mutex_lock(&phb->ioda.pe_alloc_mutex);
> +
> + /* scan backwards for a run of @count cleared bits */
>   for (pe = phb->ioda.total_pe_num - 1; pe >= 0; pe--) {
> - if (!test_and_set_bit(pe, phb->ioda.pe_alloc))
> - return pnv_ioda_init_pe(phb, pe);
> + if (test_bit(pe, phb->ioda.pe_alloc)) {
> + run = 0;
> + continue;
> + }
> +
> + run++;
> + if (run == count)
> + break;
>   }
> + if (run != count)
> + goto out;
>  
> - return NULL;
> + for (i = pe; i < pe + count; i++) {
> + set_bit(i, phb->ioda.pe_alloc);
> + pnv_ioda_init_pe(phb, i);
> + }
> + ret = &phb->ioda.pe_array[pe];
> +
> +out:
> + mutex_unlock(&phb->ioda.pe_alloc_mutex);
> + return ret;
>  }
>  
>  void pnv_ioda_free_pe(struct pnv_ioda_pe *pe)
> @@ -173,7 +195,10 @@ void pnv_ioda_free_pe(struct pnv_ioda_pe *pe)
>   WARN_ON(pe->npucomp); /* NPUs for nvlink are not supposed to be freed */
>   kfree(pe->npucomp);
>   memset(pe, 0, sizeof(struct pnv_ioda_pe));
> +
> + mutex_lock(&phb->ioda.pe_alloc_mutex);
>   clear_bit(pe_num, phb->ioda.pe_alloc);
> + mutex_unlock(&phb->ioda.pe_alloc_mutex);
>  }
>  
>  /* The default M64 BAR is shared by all PEs */
> @@ -976,7 +1001,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct 
> pci_dev *dev)
>   if (pdn->pe_number != IODA_INVALID_PE)
>   return NULL;
>  
> - pe = pnv_ioda_alloc_pe(phb);
> + pe = pnv_ioda_alloc_pe(phb, 1);
>   if (!pe) {
>   pr_warn("%s: Not enough PE# available, disabling device\n",
>   pci_name(dev));
> @@ -1047,7 +1072,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct 
> pci_bus *bus, bool all)
>  
>   /* The PE number isn't pinned by M64 */
>   if (!pe)
> - pe = pnv_ioda_alloc_pe(phb);
> + pe = pnv_ioda_alloc_pe(phb, 1);
>  
>   if (!pe) {
>   pr_warn("%s: Not enough PE# available for PCI bus %04x:%02x\n",
> @@ -3065,7 +3090,7 @@ static void __init pnv_pci_init_ioda_phb(struct 
> device_node *np,
>   pnv_ioda_reserve_pe(phb, phb->ioda.root_pe_idx);
>   } else {
>   /* otherwise just allocate one */
> - root_pe = pnv_ioda_alloc_pe(phb);
> + root_pe = pnv_ioda_alloc_pe(phb, 1);
>   phb->ioda.root_pe_idx = root_pe->pe_number;
>   }
>  
> diff --git a/arch/powerpc/platforms/powernv/pci.h 
> b/arch/powerpc/platforms/powernv/pci.h
> index 58c97e60c3db..b4c9bdba7217 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -223,7 +223,7 @@ int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct 
> pnv_ioda_pe *pe);
>  void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe);
>  void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe);
>  
> -struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb);
> +struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb, int count);
>  void pnv_ioda_free_pe(struct pnv_ioda_pe *pe);
>  
>  #ifdef CONFIG_PCI_IOV
> 

-- 
Alexey


Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-14 Thread Benjamin Herrenschmidt
On Tue, 2020-07-14 at 18:46 -0500, Bjorn Helgaas wrote:
> Yes.  I have no problem with that.  There are a few cases where it's
> important to check for errors, e.g., we read a status register and do
> something based on a bit being set.  A failure will return all bits
> set, and we may do the wrong thing.  But most of the errors we care
> about will be on MMIO reads, not config reads, so we can probably
> ignore most config read errors.

And in both cases, we don't have the plumbing to provide accurate
and reliable error returns for all platforms anyways (esp. not for
MMIO).

I think it makes sense to stick to the good old "if all 1's, then go
out of line" including for config space.

 ../..

> Yep, except for things like device removal or other PCI errors.

A whole bunch of these are reported asynchronously, esp for writes (and
yes, including config writes, they are supposed to be non-posted but
more often than not, the path  from the CPU to the PCI bridge remains
posted for writes including config ones).

> So maybe a good place to start is by removing some of the useless
> error checking for pci_read_config_*() and pci_write_config_*().
> That's a decent-sized but not impractical project that could be done
> per subsystem or something:
> 
>   git grep -E "(if|return|=).*\ 
> finds about 400 matches.
> 
> Some of those callers probably really *do* want to check for errors,
> and I guess we'd have to identify them and do them separately as you
> mentioned.

I'd be curious about these considering how unreliable our error return
is accross the board.

Cheers,
Ben.




Re: [PATCH v3 2/9] powerpc/watchpoint: Fix DAWR exception constraint

2020-07-14 Thread Jordan Niethe
On Wed, Jul 8, 2020 at 2:52 PM Ravi Bangoria
 wrote:
>
> Pedro Miraglia Franco de Carvalho noticed that on p8, DAR value is
> inconsistent with different type of load/store. Like for byte,word
> etc. load/stores, DAR is set to the address of the first byte of
> overlap between watch range and real access. But for quadword load/
> store it's set to the address of the first byte of real access. This
> issue has been fixed in p10. In p10(ISA 3.1), DAR is always set to
> the address of the first byte of overlap. Commit 27985b2a640e
> ("powerpc/watchpoint: Don't ignore extraneous exceptions blindly")
> wrongly assumes that DAR is set to the address of the first byte of
> overlap for all load/stores on p8 as well. Fix that. With the fix,
> we now rely on 'ea' provided by analyse_instr(). If analyse_instr()
> fails, generate event unconditionally on p8, and on p10 generate
> event only if DAR is within a DAWR range.
>
> Note: 8xx is not affected.
>
> Fixes: 27985b2a640e ("powerpc/watchpoint: Don't ignore extraneous exceptions 
> blindly")
> Fixes: 74c6881019b7 ("powerpc/watchpoint: Prepare handler to handle more than 
> one watchpoint")
> Reported-by: Pedro Miraglia Franco de Carvalho 
> Signed-off-by: Ravi Bangoria 
> ---
>  arch/powerpc/kernel/hw_breakpoint.c | 93 +++--
>  1 file changed, 63 insertions(+), 30 deletions(-)
>
> diff --git a/arch/powerpc/kernel/hw_breakpoint.c 
> b/arch/powerpc/kernel/hw_breakpoint.c
> index 031e6defc08e..7a66c370a105 100644
> --- a/arch/powerpc/kernel/hw_breakpoint.c
> +++ b/arch/powerpc/kernel/hw_breakpoint.c
> @@ -498,11 +498,11 @@ static bool dar_in_user_range(unsigned long dar, struct 
> arch_hw_breakpoint *info
> return ((info->address <= dar) && (dar - info->address < info->len));
>  }
>
> -static bool dar_user_range_overlaps(unsigned long dar, int size,
> -   struct arch_hw_breakpoint *info)
> +static bool ea_user_range_overlaps(unsigned long ea, int size,
> +  struct arch_hw_breakpoint *info)
>  {
> -   return ((dar < info->address + info->len) &&
> -   (dar + size > info->address));
> +   return ((ea < info->address + info->len) &&
> +   (ea + size > info->address));
>  }
>
>  static bool dar_in_hw_range(unsigned long dar, struct arch_hw_breakpoint 
> *info)
> @@ -515,20 +515,22 @@ static bool dar_in_hw_range(unsigned long dar, struct 
> arch_hw_breakpoint *info)
> return ((hw_start_addr <= dar) && (hw_end_addr > dar));
>  }
>
> -static bool dar_hw_range_overlaps(unsigned long dar, int size,
> - struct arch_hw_breakpoint *info)
> +static bool ea_hw_range_overlaps(unsigned long ea, int size,
> +struct arch_hw_breakpoint *info)
>  {
> unsigned long hw_start_addr, hw_end_addr;
>
> hw_start_addr = ALIGN_DOWN(info->address, HW_BREAKPOINT_SIZE);
> hw_end_addr = ALIGN(info->address + info->len, HW_BREAKPOINT_SIZE);
>
> -   return ((dar < hw_end_addr) && (dar + size > hw_start_addr));
> +   return ((ea < hw_end_addr) && (ea + size > hw_start_addr));
>  }
>
>  /*
>   * If hw has multiple DAWR registers, we also need to check all
>   * dawrx constraint bits to confirm this is _really_ a valid event.
> + * If type is UNKNOWN, but privilege level matches, consider it as
> + * a positive match.
>   */
>  static bool check_dawrx_constraints(struct pt_regs *regs, int type,
> struct arch_hw_breakpoint *info)
> @@ -536,7 +538,12 @@ static bool check_dawrx_constraints(struct pt_regs 
> *regs, int type,
> if (OP_IS_LOAD(type) && !(info->type & HW_BRK_TYPE_READ))
> return false;
>
> -   if (OP_IS_STORE(type) && !(info->type & HW_BRK_TYPE_WRITE))
> +   /*
> +* The Cache Management instructions other than dcbz never
> +* cause a match. i.e. if type is CACHEOP, the instruction
> +* is dcbz, and dcbz is treated as Store.
> +*/
> +   if ((OP_IS_STORE(type) || type == CACHEOP) && !(info->type & 
> HW_BRK_TYPE_WRITE))
> return false;
This change seems seperate to this commit?
>
> if (is_kernel_addr(regs->nip) && !(info->type & HW_BRK_TYPE_KERNEL))
> @@ -553,7 +560,8 @@ static bool check_dawrx_constraints(struct pt_regs *regs, 
> int type,
>   * including extraneous exception. Otherwise return false.
>   */
>  static bool check_constraints(struct pt_regs *regs, struct ppc_inst instr,
> - int type, int size, struct arch_hw_breakpoint 
> *info)
> + unsigned long ea, int type, int size,
> + struct arch_hw_breakpoint *info)
>  {
> bool in_user_range = dar_in_user_range(regs->dar, info);
> bool dawrx_constraints;
> @@ -569,11 +577,10 @@ static bool check_constraints(struct pt_regs *regs, 
> struct ppc_inst instr,
> }
>
> if (unlikely(pp

Re: [PATCH 09/15] powerpc/powernv/sriov: Factor out M64 BAR setup

2020-07-14 Thread Alexey Kardashevskiy



On 10/07/2020 15:23, Oliver O'Halloran wrote:
> The sequence required to use the single PE BAR mode is kinda janky and
> requires a little explanation. The API was designed with P7-IOC style
> windows where the setup process is something like:
> 
> 1. Configure the window start / end address
> 2. Enable the window
> 3. Map the segments of each window to the PE
> 
> For Single PE BARs the process is:
> 
> 1. Set the PE for segment zero on a disabled window
> 2. Set the range
> 3. Enable the window
> 
> Move the OPAL calls into their own helper functions where the quirks can be
> contained.
> 
> Signed-off-by: Oliver O'Halloran 


I'd use "segmented" instead of "accordion". Otherwise,

Reviewed-by: Alexey Kardashevskiy 




> ---
>  arch/powerpc/platforms/powernv/pci-sriov.c | 132 -
>  1 file changed, 103 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
> b/arch/powerpc/platforms/powernv/pci-sriov.c
> index e4c65cb49757..d53a85ccb538 100644
> --- a/arch/powerpc/platforms/powernv/pci-sriov.c
> +++ b/arch/powerpc/platforms/powernv/pci-sriov.c
> @@ -320,6 +320,102 @@ static int pnv_pci_vf_release_m64(struct pci_dev *pdev, 
> u16 num_vfs)
>   return 0;
>  }
>  
> +
> +/*
> + * PHB3 and beyond support "accordion" windows. The window's address range
> + * is subdivided into phb->ioda.total_pe_num segments and there's a 1-1
> + * mapping between PEs and segments.
> + *
> + * They're called that because as the window size changes the segment sizes
> + * change with it. Sort of like an accordion, sort of.
> + */
> +static int64_t pnv_ioda_map_m64_accordion(struct pnv_phb *phb,
> +   int window_id,
> +   resource_size_t start,
> +   resource_size_t size)
> +{
> + int64_t rc;
> +
> + rc = opal_pci_set_phb_mem_window(phb->opal_id,
> +  OPAL_M64_WINDOW_TYPE,
> +  window_id,
> +  start,
> +  0, /* unused */
> +  size);
> + if (rc)
> + goto out;
> +
> + rc = opal_pci_phb_mmio_enable(phb->opal_id,
> +   OPAL_M64_WINDOW_TYPE,
> +   window_id,
> +   OPAL_ENABLE_M64_SPLIT);
> +out:
> + if (rc)
> + pr_err("Failed to map M64 window #%d: %lld\n", window_id, rc);
> +
> + return rc;
> +}
> +
> +static int64_t pnv_ioda_map_m64_single(struct pnv_phb *phb,
> +int pe_num,
> +int window_id,
> +resource_size_t start,
> +resource_size_t size)
> +{
> + int64_t rc;
> +
> + /*
> +  * The API for setting up m64 mmio windows seems to have been designed
> +  * with P7-IOC in mind. For that chip each M64 BAR (window) had a fixed
> +  * split of 8 equally sized segments each of which could individually
> +  * assigned to a PE.
> +  *
> +  * The problem with this is that the API doesn't have any way to
> +  * communicate the number of segments we want on a BAR. This wasn't
> +  * a problem for p7-ioc since you didn't have a choice, but the
> +  * single PE windows added in PHB3 don't map cleanly to this API.
> +  *
> +  * As a result we've got this slightly awkward process where we
> +  * call opal_pci_map_pe_mmio_window() to put the single in single
> +  * PE mode, and set the PE for the window before setting the address
> +  * bounds. We need to do it this way because the single PE windows
> +  * for PHB3 have different alignment requirements on PHB3.
> +  */
> + rc = opal_pci_map_pe_mmio_window(phb->opal_id,
> +  pe_num,
> +  OPAL_M64_WINDOW_TYPE,
> +  window_id,
> +  0);
> + if (rc)
> + goto out;
> +
> + /*
> +  * NB: In single PE mode the window needs to be aligned to 32MB
> +  */
> + rc = opal_pci_set_phb_mem_window(phb->opal_id,
> +  OPAL_M64_WINDOW_TYPE,
> +  window_id,
> +  start,
> +  0, /* ignored by FW, m64 is 1-1 */
> +  size);
> + if (rc)
> + goto out;
> +
> + /*
> +  * Now actually enable it. We specified the BAR should be in "non-split"
> +  * mode so FW will validate that the BAR is in single PE mode.
> +  */
> + rc = opal_pci_phb_mmio_enable(phb->opal_id,
> +   OPAL_M64_WINDOW_TYPE,
> +   

Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-14 Thread Benjamin Herrenschmidt
On Tue, 2020-07-14 at 23:02 +0200, Kjetil Oftedal wrote:
> > 
> > > For b), it might be nice to also change other aspects of the
> > > interface, e.g. passing a pci_host_bridge pointer plus bus number
> > > instead of a pci_bus pointer, or having the callback in the
> > > pci_host_bridge structure.
> > 
> > I like this idea a lot, too.  I think the fact that
> > pci_bus_read_config_word() requires a pci_bus * complicates things in
> > a few places.
> > 
> > I think it's completely separate, as you say, and we should defer it
> > for now because even part a) is a lot of work.  I added it to my list
> > of possible future projects.
> > 
> 
> What about strange PCI devices such as Non-Transparent bridges?
> They will require their own PCI Config space accessors that is not
> connected to a host bridge if one wants to do some sort of
> punch-through enumeration.
> I guess the kernel doesn't care much about them?

Well, today they would require a pci_bus anyway.. . so if you want to do
that sort of funny trick you may as well create a "virtual" host bridge.

Cheers,
Ben.




Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-14 Thread Benjamin Herrenschmidt
On Tue, 2020-07-14 at 13:45 -0500, Bjorn Helgaas wrote:
> 
> > fail for valid arguments on a valid pci_device* ?
> 
> I really like this idea.
> 
> pci_write_config_*() has one return value, and only 100ish of 2500
> callers check for errors.  It's sometimes possible for config
> accessors to detect PCI errors and return failure, e.g., device was
> removed or didn't respond, but most of them don't, and detecting
> these
> errors is not really that valuable.
> 
> pci_read_config_*() is much more interesting because it returns two
> things, the function return value and the value read from the PCI
> device, and it's complicated to check both. 

  .../...

I agree. It's a mess at the moment.

We have separate mechanism to convey PCI errors (among other things the
channel state) which should apply to config space when detection is
possible.

I think returning all 1's is the right thing to do here and avoids odd
duplicate error detection logic which I bet you is never properly
tested.

> > For b), it might be nice to also change other aspects of the
> > interface, e.g. passing a pci_host_bridge pointer plus bus number
> > instead of a pci_bus pointer, or having the callback in the
> > pci_host_bridge structure.
> 
> I like this idea a lot, too.  I think the fact that
> pci_bus_read_config_word() requires a pci_bus * complicates things in
> a few places.
> 
> I think it's completely separate, as you say, and we should defer it
> for now because even part a) is a lot of work.  I added it to my list
> of possible future projects.

Agreed on both points.

Cheers,
Ben.




Re: [PATCH 08/15] powerpc/powernv/sriov: Simplify used window tracking

2020-07-14 Thread Oliver O'Halloran
On Wed, Jul 15, 2020 at 11:34 AM Alexey Kardashevskiy  wrote:
>
> On 10/07/2020 15:23, Oliver O'Halloran wrote:
> > diff --git a/arch/powerpc/platforms/powernv/pci.h 
> > b/arch/powerpc/platforms/powernv/pci.h
> > index 0156d7d17f7d..58c97e60c3db 100644
> > --- a/arch/powerpc/platforms/powernv/pci.h
> > +++ b/arch/powerpc/platforms/powernv/pci.h
> > @@ -243,8 +243,11 @@ struct pnv_iov_data {
> >   /* Did we map the VF BARs with single-PE IODA BARs? */
> >   boolm64_single_mode;
> >
> > - int (*m64_map)[PCI_SRIOV_NUM_BARS];
> > -#define IODA_INVALID_M64(-1)
> > + /*
> > +  * Bit mask used to track which m64 windows that we used to map the
>
>
> Language question: either "which" or "that" but both?

U... I don't speak english


Re: [PATCH 03/15] powerpc/powernv/pci: Add explicit tracking of the DMA setup state

2020-07-14 Thread Oliver O'Halloran
On Tue, Jul 14, 2020 at 5:21 PM Alexey Kardashevskiy  wrote:
>
> On 14/07/2020 15:58, Oliver O'Halloran wrote:
> > On Tue, Jul 14, 2020 at 3:37 PM Alexey Kardashevskiy  wrote:
> >>
> >> On 10/07/2020 15:23, Oliver O'Halloran wrote:
> >>> This also means the only remaining user of the old "DMA Weight" code is
> >>> the IODA1 DMA setup code that it was originally added for, which is good.
> >>
> >>
> >> Is ditching IODA1 in the plan? :)
> >
> > That or separating out the pci_controller_ops for IODA1 and IODA2 so
> > we can stop any IODA2 specific changes from breaking it.
>
> Is IODA1 tested at all these days? Or, is anyone running upstream
> kernels anywhere and keeps shouting when it does not work on IODA1? Thanks,

Cedric has a P7 with OPAL. That's probably the one left though.


Re: [PATCH 08/15] powerpc/powernv/sriov: Simplify used window tracking

2020-07-14 Thread Alexey Kardashevskiy



On 10/07/2020 15:23, Oliver O'Halloran wrote:
> No need for the multi-dimensional arrays, just use a bitmap.
> 
> Signed-off-by: Oliver O'Halloran 
> ---
>  arch/powerpc/platforms/powernv/pci-sriov.c | 48 +++---
>  arch/powerpc/platforms/powernv/pci.h   |  7 +++-
>  2 files changed, 20 insertions(+), 35 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
> b/arch/powerpc/platforms/powernv/pci-sriov.c
> index 216ceeff69b0..e4c65cb49757 100644
> --- a/arch/powerpc/platforms/powernv/pci-sriov.c
> +++ b/arch/powerpc/platforms/powernv/pci-sriov.c
> @@ -303,28 +303,20 @@ static int pnv_pci_vf_release_m64(struct pci_dev *pdev, 
> u16 num_vfs)
>  {
>   struct pnv_iov_data   *iov;
>   struct pnv_phb*phb;
> - inti, j;
> - intm64_bars;
> + int window_id;
>  
>   phb = pci_bus_to_pnvhb(pdev->bus);
>   iov = pnv_iov_get(pdev);
>  
> - if (iov->m64_single_mode)
> - m64_bars = num_vfs;
> - else
> - m64_bars = 1;
> + for_each_set_bit(window_id, iov->used_m64_bar_mask, 64) {
> + opal_pci_phb_mmio_enable(phb->opal_id,
> +  OPAL_M64_WINDOW_TYPE,
> +  window_id,
> +  0);
>  
> - for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
> - for (j = 0; j < m64_bars; j++) {
> - if (iov->m64_map[j][i] == IODA_INVALID_M64)
> - continue;
> - opal_pci_phb_mmio_enable(phb->opal_id,
> - OPAL_M64_WINDOW_TYPE, iov->m64_map[j][i], 0);
> - clear_bit(iov->m64_map[j][i], &phb->ioda.m64_bar_alloc);
> - iov->m64_map[j][i] = IODA_INVALID_M64;
> - }
> + clear_bit(window_id, &phb->ioda.m64_bar_alloc);
> + }
>  
> - kfree(iov->m64_map);
>   return 0;
>  }
>  
> @@ -350,23 +342,14 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
> u16 num_vfs)
>   else
>   m64_bars = 1;
>  
> - iov->m64_map = kmalloc_array(m64_bars,
> -  sizeof(*iov->m64_map),
> -  GFP_KERNEL);
> - if (!iov->m64_map)
> - return -ENOMEM;
> - /* Initialize the m64_map to IODA_INVALID_M64 */
> - for (i = 0; i < m64_bars ; i++)
> - for (j = 0; j < PCI_SRIOV_NUM_BARS; j++)
> - iov->m64_map[i][j] = IODA_INVALID_M64;
> -
> -
>   for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>   res = &pdev->resource[i + PCI_IOV_RESOURCES];
>   if (!res->flags || !res->parent)
>   continue;
>  
>   for (j = 0; j < m64_bars; j++) {
> +
> + /* allocate a window ID for this BAR */
>   do {
>   win = 
> find_next_zero_bit(&phb->ioda.m64_bar_alloc,
>   phb->ioda.m64_bar_idx + 1, 0);
> @@ -374,8 +357,7 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
> u16 num_vfs)
>   if (win >= phb->ioda.m64_bar_idx + 1)
>   goto m64_failed;
>   } while (test_and_set_bit(win, 
> &phb->ioda.m64_bar_alloc));
> -
> - iov->m64_map[j][i] = win;
> + set_bit(win, iov->used_m64_bar_mask);
>  
>   if (iov->m64_single_mode) {
>   size = pci_iov_resource_size(pdev,
> @@ -391,12 +373,12 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
> u16 num_vfs)
>   pe_num = iov->pe_num_map[j];
>   rc = opal_pci_map_pe_mmio_window(phb->opal_id,
>   pe_num, OPAL_M64_WINDOW_TYPE,
> - iov->m64_map[j][i], 0);
> + win, 0);
>   }
>  
>   rc = opal_pci_set_phb_mem_window(phb->opal_id,
>OPAL_M64_WINDOW_TYPE,
> -  iov->m64_map[j][i],
> +  win,
>start,
>0, /* unused */
>size);
> @@ -410,10 +392,10 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
> u16 num_vfs)
>  
>   if (iov->m64_single_mode)
>   rc = opal_pci_phb_mmio_enable(phb->opal_id,
> -  OPAL_M64_WINDOW_TYPE, iov->m64_map[j][i], 
> 2);
> +  OPAL_M64_WINDOW_TYPE, win, 2);
>   else
>   rc = opal_p

Re: [PATCH v8 5/8] powerpc/vdso: Prepare for switching VDSO to generic C implementation.

2020-07-14 Thread Michael Ellerman
Christophe Leroy  writes:
> Prepare for switching VDSO to generic C implementation in following
> patch. Here, we:
> - Modify __get_datapage() to take an offset
> - Prepare the helpers to call the C VDSO functions
> - Prepare the required callbacks for the C VDSO functions
> - Prepare the clocksource.h files to define VDSO_ARCH_CLOCKMODES
> - Add the C trampolines to the generic C VDSO functions
>
> powerpc is a bit special for VDSO as well as system calls in the
> way that it requires setting CR SO bit which cannot be done in C.
> Therefore, entry/exit needs to be performed in ASM.
>
> Implementing __arch_get_vdso_data() would clobber the link register,
> requiring the caller to save it. As the ASM calling function already
> has to set a stack frame and saves the link register before calling
> the C vdso function, retriving the vdso data pointer there is lighter.
...

> diff --git a/arch/powerpc/include/asm/vdso/gettimeofday.h 
> b/arch/powerpc/include/asm/vdso/gettimeofday.h
> new file mode 100644
> index ..4452897f9bd8
> --- /dev/null
> +++ b/arch/powerpc/include/asm/vdso/gettimeofday.h
> @@ -0,0 +1,175 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __ASM_VDSO_GETTIMEOFDAY_H
> +#define __ASM_VDSO_GETTIMEOFDAY_H
> +
> +#include 
> +
> +#ifdef __ASSEMBLY__
> +
> +.macro cvdso_call funct
> +  .cfi_startproc
> + PPC_STLUr1, -STACK_FRAME_OVERHEAD(r1)
> + mflrr0
> +  .cfi_register lr, r0
> + PPC_STL r0, STACK_FRAME_OVERHEAD + PPC_LR_STKOFF(r1)

This doesn't work for me on ppc64(le) with glibc.

glibc doesn't create a stack frame before making the VDSO call, so the
store of r0 (LR) goes into the caller's frame, corrupting the saved LR,
leading to an infinite loop.

This is an example from a statically built program that calls
clock_gettime():

10030cb0 <__clock_gettime>:
10030cb0:   0e 10 40 3c lis r2,4110
10030cb4:   00 7a 42 38 addir2,r2,31232
10030cb8:   a6 02 08 7c mflrr0
10030cbc:   ff ff 22 3d addis   r9,r2,-1
10030cc0:   58 6d 29 39 addir9,r9,27992
10030cc4:   f0 ff c1 fb std r30,-16(r1) <-- 
redzone store
10030cc8:   78 23 9e 7c mr  r30,r4
10030ccc:   f8 ff e1 fb std r31,-8(r1)  <-- 
redzone store
10030cd0:   78 1b 7f 7c mr  r31,r3
10030cd4:   10 00 01 f8 std r0,16(r1)   <-- 
save LR to caller's frame
10030cd8:   00 00 09 e8 ld  r0,0(r9)
10030cdc:   00 00 20 2c cmpdi   r0,0
10030ce0:   50 00 82 41 beq 10030d30 <__clock_gettime+0x80>
10030ce4:   a6 03 09 7c mtctr   r0
10030ce8:   21 04 80 4e bctrl   <-- 
vdso call
10030cec:   26 00 00 7c mfcrr0
10030cf0:   00 10 09 74 andis.  r9,r0,4096
10030cf4:   78 1b 69 7c mr  r9,r3
10030cf8:   28 00 82 40 bne 10030d20 <__clock_gettime+0x70>
10030cfc:   b4 07 23 7d extsw   r3,r9
10030d00:   10 00 01 e8 ld  r0,16(r1)   <-- 
load saved LR, since clobbered by the VDSO
10030d04:   f0 ff c1 eb ld  r30,-16(r1)
10030d08:   f8 ff e1 eb ld  r31,-8(r1)
10030d0c:   a6 03 08 7c mtlrr0  <-- 
restore LR
10030d10:   20 00 80 4e blr <-- 
jumps to 10030cec


I'm kind of confused how it worked for you on 32-bit.

There's also no code to load/restore the TOC pointer on BE, which I
think we'll need to handle.

cheers


Re: [PATCH v3 1/9] powerpc/watchpoint: Fix 512 byte boundary limit

2020-07-14 Thread Jordan Niethe
On Wed, Jul 8, 2020 at 2:53 PM Ravi Bangoria
 wrote:
>
> Milton Miller reported that we are aligning start and end address to
> wrong size SZ_512M. It should be SZ_512. Fix that.
>
> While doing this change I also found a case where ALIGN() comparison
> fails. Within a given aligned range, ALIGN() of two addresses does not
> match when start address is pointing to the first byte and end address
> is pointing to any other byte except the first one. But that's not true
> for ALIGN_DOWN(). ALIGN_DOWN() of any two addresses within that range
> will always point to the first byte. So use ALIGN_DOWN() instead of
> ALIGN().
>
> Fixes: e68ef121c1f4 ("powerpc/watchpoint: Use builtin ALIGN*() macros")
> Reported-by: Milton Miller 
> Signed-off-by: Ravi Bangoria 
I tested this with the ptrace-hwbreak selftest. Can confirm without
also changing to ALIGN_DOWN() then these tests will fail.
Tested-by: Jordan Niethe 
> ---
>  arch/powerpc/kernel/hw_breakpoint.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/hw_breakpoint.c 
> b/arch/powerpc/kernel/hw_breakpoint.c
> index daf0e1da..031e6defc08e 100644
> --- a/arch/powerpc/kernel/hw_breakpoint.c
> +++ b/arch/powerpc/kernel/hw_breakpoint.c
> @@ -419,7 +419,7 @@ static int hw_breakpoint_validate_len(struct 
> arch_hw_breakpoint *hw)
> if (dawr_enabled()) {
> max_len = DAWR_MAX_LEN;
> /* DAWR region can't cross 512 bytes boundary */
> -   if (ALIGN(start_addr, SZ_512M) != ALIGN(end_addr - 1, 
> SZ_512M))
> +   if (ALIGN_DOWN(start_addr, SZ_512) != ALIGN_DOWN(end_addr - 
> 1, SZ_512))
> return -EINVAL;
> } else if (IS_ENABLED(CONFIG_PPC_8xx)) {
> /* 8xx can setup a range without limitation */
> --
> 2.26.2
>


Re: [PATCH 07/15] powerpc/powernv/sriov: Rename truncate_iov

2020-07-14 Thread Alexey Kardashevskiy



On 10/07/2020 15:23, Oliver O'Halloran wrote:
> This prevents SR-IOV being used by making the SR-IOV BAR resources
> unallocatable. Rename it to reflect what it actually does.
> 
> Signed-off-by: Oliver O'Halloran 


Reviewed-by: Alexey Kardashevskiy 


> ---
>  arch/powerpc/platforms/powernv/pci-sriov.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
> b/arch/powerpc/platforms/powernv/pci-sriov.c
> index f4c74ab1284d..216ceeff69b0 100644
> --- a/arch/powerpc/platforms/powernv/pci-sriov.c
> +++ b/arch/powerpc/platforms/powernv/pci-sriov.c
> @@ -155,7 +155,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
> pci_dev *pdev)
>  
>   iov = kzalloc(sizeof(*iov), GFP_KERNEL);
>   if (!iov)
> - goto truncate_iov;
> + goto disable_iov;
>   pdev->dev.archdata.iov_data = iov;
>  
>   total_vfs = pci_sriov_get_totalvfs(pdev);
> @@ -170,7 +170,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
> pci_dev *pdev)
>   dev_warn(&pdev->dev, "Don't support SR-IOV with"
>   " non M64 VF BAR%d: %pR. \n",
>i, res);
> - goto truncate_iov;
> + goto disable_iov;
>   }
>  
>   total_vf_bar_sz += pci_iov_resource_size(pdev,
> @@ -209,7 +209,8 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
> pci_dev *pdev)
>* mode is 32MB.
>*/
>   if (iov->m64_single_mode && (size < SZ_32M))
> - goto truncate_iov;
> + goto disable_iov;
> +
>   dev_dbg(&pdev->dev, " Fixing VF BAR%d: %pR to\n", i, res);
>   res->end = res->start + size * mul - 1;
>   dev_dbg(&pdev->dev, "   %pR\n", res);
> @@ -220,8 +221,8 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
> pci_dev *pdev)
>  
>   return;
>  
> -truncate_iov:
> - /* To save MMIO space, IOV BAR is truncated. */
> +disable_iov:
> + /* Save ourselves some MMIO space by disabling the unusable BARs */
>   for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>   res = &pdev->resource[i + PCI_IOV_RESOURCES];
>   res->flags = 0;
> 

-- 
Alexey


Re: [PATCH 06/15] powerpc/powernv/sriov: Explain how SR-IOV works on PowerNV

2020-07-14 Thread Alexey Kardashevskiy



On 10/07/2020 15:23, Oliver O'Halloran wrote:
> SR-IOV support on PowerNV is a byzantine maze of hooks. I have no idea
> how anyone is supposed to know how it works except through a lot of
> stuffering. Write up some docs about the overall story to help out
> the next sucker^Wperson who needs to tinker with it.


Sounds about right :)

Reviewed-by: Alexey Kardashevskiy 



> 
> Signed-off-by: Oliver O'Halloran 
> ---
>  arch/powerpc/platforms/powernv/pci-sriov.c | 130 +
>  1 file changed, 130 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
> b/arch/powerpc/platforms/powernv/pci-sriov.c
> index 080ea39f5a83..f4c74ab1284d 100644
> --- a/arch/powerpc/platforms/powernv/pci-sriov.c
> +++ b/arch/powerpc/platforms/powernv/pci-sriov.c
> @@ -12,6 +12,136 @@
>  /* for pci_dev_is_added() */
>  #include "../../../../drivers/pci/pci.h"
>  
> +/*
> + * The majority of the complexity in supporting SR-IOV on PowerNV comes from
> + * the need to put the MMIO space for each VF into a separate PE. Internally
> + * the PHB maps MMIO addresses to a specific PE using the "Memory BAR Table".
> + * The MBT historically only applied to the 64bit MMIO window of the PHB
> + * so it's common to see it referred to as the "M64BT".
> + *
> + * An MBT entry stores the mapped range as an , pair. This forces
> + * the address range that we want to map to be power-of-two sized and 
> aligned.
> + * For conventional PCI devices this isn't really an issue since PCI device 
> BARs
> + * have the same requirement.
> + *
> + * For a SR-IOV BAR things are a little more awkward since size and alignment
> + * are not coupled. The alignment is set based on the the per-VF BAR size, 
> but
> + * the total BAR area is: number-of-vfs * per-vf-size. The number of VFs
> + * isn't necessarily a power of two, so neither is the total size. To fix 
> that
> + * we need to finesse (read: hack) the Linux BAR allocator so that it will
> + * allocate the SR-IOV BARs in a way that lets us map them using the MBT.
> + *
> + * The changes to size and alignment that we need to do depend on the "mode"
> + * of MBT entry that we use. We only support SR-IOV on PHB3 (IODA2) and 
> above,
> + * so as a baseline we can assume that we have the following BAR modes
> + * available:
> + *
> + *   NB: $PE_COUNT is the number of PEs that the PHB supports.
> + *
> + * a) A segmented BAR that splits the mapped range into $PE_COUNT equally 
> sized
> + *segments. The n'th segment is mapped to the n'th PE.
> + * b) An un-segmented BAR that maps the whole address range to a specific PE.
> + *
> + *
> + * We prefer to use mode a) since it only requires one MBT entry per SR-IOV 
> BAR
> + * For comparison b) requires one entry per-VF per-BAR, or:
> + * (num-vfs * num-sriov-bars) in total. To use a) we need the size of each 
> segment
> + * to equal the size of the per-VF BAR area. So:
> + *
> + *   new_size = per-vf-size * number-of-PEs
> + *
> + * The alignment for the SR-IOV BAR also needs to be changed from per-vf-size
> + * to "new_size", calculated above. Implementing this is a convoluted process
> + * which requires several hooks in the PCI core:
> + *
> + * 1. In pcibios_add_device() we call pnv_pci_ioda_fixup_iov().
> + *
> + *At this point the device has been probed and the device's BARs are 
> sized,
> + *but no resource allocations have been done. The SR-IOV BARs are sized
> + *based on the maximum number of VFs supported by the device and we need
> + *to increase that to new_size.
> + *
> + * 2. Later, when Linux actually assigns resources it tries to make the 
> resource
> + *allocations for each PCI bus as compact as possible. As a part of that 
> it
> + *sorts the BARs on a bus by their required alignment, which is 
> calculated
> + *using pci_resource_alignment().
> + *
> + *For IOV resources this goes:
> + *pci_resource_alignment()
> + *pci_sriov_resource_alignment()
> + *pcibios_sriov_resource_alignment()
> + *pnv_pci_iov_resource_alignment()
> + *
> + *Our hook overrides the default alignment, equal to the per-vf-size, 
> with
> + *new_size computed above.
> + *
> + * 3. When userspace enables VFs for a device:
> + *
> + *sriov_enable()
> + *   pcibios_sriov_enable()
> + *   pnv_pcibios_sriov_enable()
> + *
> + *This is where we actually allocate PE numbers for each VF and setup the
> + *MBT mapping for each SR-IOV BAR. In steps 1) and 2) we setup an "arena"
> + *where each MBT segment is equal in size to the VF BAR so we can shift
> + *around the actual SR-IOV BAR location within this arena. We need this
> + *ability because the PE space is shared by all devices on the same PHB.
> + *When using mode a) described above segment 0 in maps to PE#0 which 
> might
> + *be already being used by another device on the PHB.
> + *
> + *As a result we need allocate a contigious range of PE numbers, 

Re: [PATCH 03/15] powerpc/powernv/pci: Add explicit tracking of the DMA setup state

2020-07-14 Thread Alexey Kardashevskiy



On 14/07/2020 17:21, Alexey Kardashevskiy wrote:
> 
> 
> On 14/07/2020 15:58, Oliver O'Halloran wrote:
>> On Tue, Jul 14, 2020 at 3:37 PM Alexey Kardashevskiy  wrote:
>>>
>>> On 10/07/2020 15:23, Oliver O'Halloran wrote:
 There's an optimisation in the PE setup which skips performing DMA
 setup for a PE if we only have bridges in a PE. The assumption being
 that only "real" devices will DMA to system memory, which is probably
 fair. However, if we start off with only bridge devices in a PE then
 add a non-bridge device the new device won't be able to use DMA  because
 we never configured it.

 Fix this (admittedly pretty weird) edge case by tracking whether we've done
 the DMA setup for the PE or not. If a non-bridge device is added to the PE
 (via rescan or hotplug, or whatever) we can set up DMA on demand.
>>>
>>> So hotplug does not work on powernv then, right? I thought you tested it
>>> a while ago, or this patch is the result of that attempt? If it is, then
>>
>> It mostly works. Just the really niche case of hot plugging a bridge,
>> then later on hot plugging a device into the same bus which wouldn't
>> work.
> 
> Do not you have to have a slot (which is a bridge) for hotplug in the
> first place, to hotplug the bridge?


As discussed elsewhere, I missed that it is a non bridge device on the
same bus with previously plugged bridge. Now it all makes sense and


Reviewed-by: Alexey Kardashevskiy 


-- 
Alexey


[PATCH] pseries: Fix 64 bit logical memory block panic

2020-07-14 Thread Anton Blanchard
Booting with a 4GB LMB size causes us to panic:

  qemu-system-ppc64: OS terminated: OS panic:
  Memory block size not suitable: 0x0

Fix pseries_memory_block_size() to handle 64 bit LMBs.

Cc: sta...@vger.kernel.org
Signed-off-by: Anton Blanchard 
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 5ace2f9a277e..6574ac33e887 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -27,7 +27,7 @@ static bool rtas_hp_event;
 unsigned long pseries_memory_block_size(void)
 {
struct device_node *np;
-   unsigned int memblock_size = MIN_MEMORY_BLOCK_SIZE;
+   uint64_t memblock_size = MIN_MEMORY_BLOCK_SIZE;
struct resource r;
 
np = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
-- 
2.26.2



Re: [PATCH v3 03/12] powerpc/kexec_file: add helper functions for getting memory ranges

2020-07-14 Thread Thiago Jung Bauermann


Hello Hari,

Hari Bathini  writes:

> In kexec case, the kernel to be loaded uses the same memory layout as
> the running kernel. So, passing on the DT of the running kernel would
> be good enough.
>
> But in case of kdump, different memory ranges are needed to manage
> loading the kdump kernel, booting into it and exporting the elfcore
> of the crashing kernel. The ranges are exlude memory ranges, usable

s/exlude/exclude/

> memory ranges, reserved memory ranges and crash memory ranges.
>
> Exclude memory ranges specify the list of memory ranges to avoid while
> loading kdump segments. Usable memory ranges list the memory ranges
> that could be used for booting kdump kernel. Reserved memory ranges
> list the memory regions for the loading kernel's reserve map. Crash
> memory ranges list the memory ranges to be exported as the crashing
> kernel's elfcore.
>
> Add helper functions for setting up the above mentioned memory ranges.
> This helpers facilitate in understanding the subsequent changes better
> and make it easy to setup the different memory ranges listed above, as
> and when appropriate.
>
> Signed-off-by: Hari Bathini 
> Tested-by: Pingfan Liu 



> +/**
> + * get_mem_rngs_size - Get the allocated size of mrngs based on
> + * max_nr_ranges and chunk size.
> + * @mrngs: Memory ranges.
> + *
> + * Returns the maximum no. of ranges.

This isn't correct. It returns the maximum size of @mrngs.

> + */
> +static inline size_t get_mem_rngs_size(struct crash_mem *mrngs)
> +{
> + size_t size;
> +
> + if (!mrngs)
> + return 0;
> +
> + size = (sizeof(struct crash_mem) +
> + (mrngs->max_nr_ranges * sizeof(struct crash_mem_range)));
> +
> + /*
> +  * Memory is allocated in size multiple of MEM_RANGE_CHUNK_SZ.
> +  * So, align to get the actual length.
> +  */
> + return ALIGN(size, MEM_RANGE_CHUNK_SZ);
> +}



> +/**
> + * add_tce_mem_ranges - Adds tce-table range to the given memory ranges list.
> + * @mem_ranges: Range list to add the memory range(s) to.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +int add_tce_mem_ranges(struct crash_mem **mem_ranges)
> +{
> + struct device_node *dn;
> + int ret;
> +
> + for_each_node_by_type(dn, "pci") {
> + u64 base;
> + u32 size;
> +
> + ret = of_property_read_u64(dn, "linux,tce-base", &base);
> + ret |= of_property_read_u32(dn, "linux,tce-size", &size);
> + if (!ret)

Shouldn't the condition be `ret` instead of `!ret`?

> + continue;
> +
> + ret = add_mem_range(mem_ranges, base, size);
> + if (ret)
> + break;
> + }
> +
> + return ret;
> +}
> +
> +/**
> + * add_initrd_mem_range - Adds initrd range to the given memory ranges list,
> + *if the initrd was retained.
> + * @mem_ranges:   Range list to add the memory range to.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +int add_initrd_mem_range(struct crash_mem **mem_ranges)
> +{
> + u64 base, end;
> + int ret = 0;
> + char *str;
> +
> + /* This range means something only if initrd was retained */
> + str = strstr(saved_command_line, "retain_initrd");
> + if (!str)
> + return 0;
> +
> + ret = of_property_read_u64(of_chosen, "linux,initrd-start", &base);
> + ret |= of_property_read_u64(of_chosen, "linux,initrd-end", &end);
> + if (!ret)
> + ret = add_mem_range(mem_ranges, base, end - base + 1);
> + return ret;
> +}
> +
> +/**
> + * add_htab_mem_range - Adds htab range to the given memory ranges list,
> + *  if it exists
> + * @mem_ranges: Range list to add the memory range to.
> + *
> + * Returns 0 on success, negative errno on error.
> + */
> +int add_htab_mem_range(struct crash_mem **mem_ranges)
> +{
> +#ifdef CONFIG_PPC_BOOK3S_64
> + int ret;
> +
> + if (!htab_address)
> + return 0;
> +
> + ret = add_mem_range(mem_ranges, __pa(htab_address), htab_size_bytes);
> + return ret;
> +#else
> + return 0;
> +#endif
> +}

If I'm not mistaken, this is not the preferred way of having alternative
implementations of a function. The "Conditional Compilation" section of
the coding style document doesn't mention this directly, but does say
that it's better to put the conditionals in a header file.

In this case, I would do this in 

#ifdef CONFIG_PPC_BOOK3S_64
int add_htab_mem_range(struct crash_mem **mem_ranges);
#else
static inline int add_htab_mem_range(struct crash_mem **mem_ranges)
{
return 0;
}
#endif

And in ranges.c just surround the add_htab_mem_range() definition with
#ifdef CONFIG_PPC_BOOK3S_64 and #endif

Also, there's no need for the ret variable. You can just
`return add_mem_range(...)` directly.

> +
> +/**
> + * add_kernel_mem_range - Adds kernel text region to the given
> + *  

Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-14 Thread Bjorn Helgaas
[+cc Kjetil]

On Wed, Jul 15, 2020 at 12:01:56AM +0200, Arnd Bergmann wrote:
> On Tue, Jul 14, 2020 at 8:45 PM Bjorn Helgaas  wrote:
> > On Mon, Jul 13, 2020 at 05:08:10PM +0200, Arnd Bergmann wrote:
> > > On Mon, Jul 13, 2020 at 3:22 PM Saheed O. Bolarinwa
> > > Starting with a), my first question is whether any high-level
> > > drivers even need to care about errors from these functions. I see
> > > 4913 callers that ignore the return code, and 576 that actually
> > > check it, and almost none care about the specific error (as you
> > > found as well). Unless we conclude that most PCI drivers are wrong,
> > > could we just change the return type to 'void' and assume they never
> > > fail for valid arguments on a valid pci_device* ?
> >
> > I really like this idea.
> >
> > pci_write_config_*() has one return value, and only 100ish of 2500
> > callers check for errors.  It's sometimes possible for config
> > accessors to detect PCI errors and return failure, e.g., device was
> > removed or didn't respond, but most of them don't, and detecting these
> > errors is not really that valuable.
> >
> > pci_read_config_*() is much more interesting because it returns two
> > things, the function return value and the value read from the PCI
> > device, and it's complicated to check both.
> >
> > Again it's sometimes possible for config read accessors to detect PCI
> > errors, but in most cases a PCI error means the accessor returns
> > success and the value from PCI is ~0.
> >
> > Checking the function return value catches programming errors (bad
> > alignment, etc) but misses most of the interesting errors (device was
> > unplugged or reported a PCI error).
> 
> My thinking was more that most of the time the error checking may
> be completely bogus to start with, and I would just not check for
> errors at all.

Yes.  I have no problem with that.  There are a few cases where it's
important to check for errors, e.g., we read a status register and do
something based on a bit being set.  A failure will return all bits
set, and we may do the wrong thing.  But most of the errors we care
about will be on MMIO reads, not config reads, so we can probably
ignore most config read errors.

> > Checking the value returned from PCI is tricky because ~0 is a valid
> > value for some config registers, and only the driver knows for sure.
> > If the driver knows that ~0 is a possible value, it would have to do
> > something else, e.g., another config read of a register that *cannot*
> > be ~0, to see whether it's really an error.
> >
> > I suspect that if we had a single value to look at it would be easier
> > to get right.  Error checking with current interface would look like
> > this:
> >
> >   err = pci_read_config_word(dev, addr, &val);
> >   if (err)
> > return -EINVAL;
> >
> >   if (PCI_POSSIBLE_ERROR(val)) {
> > /* if driver knows ~0 is invalid */
> > return -EINVAL;
> >
> > /* if ~0 is potentially a valid value */
> > err = pci_read_config_word(dev, PCI_VENDOR_ID, &val2);
> > if (err)
> >   return -EINVAL;
> >
> > if (PCI_POSSIBLE_ERROR(val2))
> >   return -EINVAL;
> >   }
> >
> > Error checking with a possible interface that returned only a single
> > value could look like this:
> >
> >   val = pci_config_read_word(dev, addr);
> >   if (PCI_POSSIBLE_ERROR(val)) {
> > /* if driver knows ~0 is invalid */
> > return -EINVAL;
> >
> > /* if ~0 is potentially a valid value */
> > val2 = pci_config_read_word(dev, PCI_VENDOR_ID);
> > if (PCI_POSSIBLE_ERROR(val2))
> >   return -EINVAL;
> >   }
> >
> > Am I understanding you correctly?
> 
> That would require changing all callers of the function, which
> I think would involve changing some 700 files. 

Yeah, that would be a disaster.  So something like:

  void pci_read_config_word(struct pci_dev *dev, int where, u16 *val)

and where we used to return anything non-zero, we just set *val = ~0
instead?  I think we do that already in most, maybe all, cases.

> What I was suggesting was to only change the return type to void and
> categorize all drivers that today check it as either
> 
> a) checking the return code is not helpful, or possibly even
> wrong, so we just stop doing it. I expect those to be the
> vast majority of callers, but that could be wrong.
> 
> b) Code that legitimately check the error code and need to
>take an appropriate action. These could be changed to
>calling a different interface such as 'pci_bus_read_config_word'
>or a new 'pci_device_last_error()' function.

Yep, makes sense.

> The reasons I suspect that most callers don't actually need
> to check for errors are:
> 
> - Most error checking is static: PCIBIOS_BAD_REGISTER_NUMBER
>   only happens if you pass an invalid register number, but most
>   callers pass a compile-time constant register number that is
>   known to be correct, or the driver would never work. Similarly,
>   PCIBIOS_DEVICE_NOT_FOUND wouldn't normally happen
>  

Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-14 Thread Rob Herring
On Tue, Jul 14, 2020 at 12:45 PM Bjorn Helgaas  wrote:
>
> [trimmed the cc list; it's still too large but maybe arch folks care]
>
> On Mon, Jul 13, 2020 at 05:08:10PM +0200, Arnd Bergmann wrote:
> > On Mon, Jul 13, 2020 at 3:22 PM Saheed O. Bolarinwa
> >  wrote:
> > > This goal of these series is to move the definition of *all*
> > > PCIBIOS* from include/linux/pci.h to arch/x86 and limit their use
> > > within there.  All other tree specific definition will be left for
> > > intact. Maybe they can be renamed.
> > >
> > > PCIBIOS* is an x86 concept as defined by the PCI spec. The
> > > returned error codes of PCIBIOS* are positive values and this
> > > introduces some complexities which other archs need not incur.
> >
> > I think the intention is good, but I find the series in its current
> > form very hard to review, in particular the way you touch some
> > functions three times with trivial changes. Instead of
> >
> > 1) replace PCIBIOS_SUCCESSFUL with 0
> > 2) drop pointless 0-comparison
> > 3) reformat whitespace
> >
> > I would suggest to combine the first two steps into one patch per
> > subsystem and drop the third step.
>
> I agree.  BUT please don't just run out and post new patches to do
> this.  Let's talk about Arnd's further ideas below first.
>
> > ...
> > Maybe the work can be split up differently, with a similar end
> > result but fewer and easier reviewed patches. The way I'd look at
> > the problem, there are three main areas that can be dealt with one
> > at a time:
> >
> > a) callers of the high-level config space accessors
> >pci_{write,read}_config_{byte,word,dword}, mostly in device
> >drivers.
> > b) low-level implementation of the config space accessors
> > through struct pci_ops
> > c) all other occurrences of these constants
> >
> > Starting with a), my first question is whether any high-level
> > drivers even need to care about errors from these functions. I see
> > 4913 callers that ignore the return code, and 576 that actually
> > check it, and almost none care about the specific error (as you
> > found as well). Unless we conclude that most PCI drivers are wrong,
> > could we just change the return type to 'void' and assume they never
> > fail for valid arguments on a valid pci_device* ?
>
> I really like this idea.
>
> pci_write_config_*() has one return value, and only 100ish of 2500
> callers check for errors.  It's sometimes possible for config
> accessors to detect PCI errors and return failure, e.g., device was
> removed or didn't respond, but most of them don't, and detecting these
> errors is not really that valuable.
>
> pci_read_config_*() is much more interesting because it returns two
> things, the function return value and the value read from the PCI
> device, and it's complicated to check both.
>
> Again it's sometimes possible for config read accessors to detect PCI
> errors, but in most cases a PCI error means the accessor returns
> success and the value from PCI is ~0.
>
> Checking the function return value catches programming errors (bad
> alignment, etc) but misses most of the interesting errors (device was
> unplugged or reported a PCI error).
>
> Checking the value returned from PCI is tricky because ~0 is a valid
> value for some config registers, and only the driver knows for sure.
> If the driver knows that ~0 is a possible value, it would have to do
> something else, e.g., another config read of a register that *cannot*
> be ~0, to see whether it's really an error.
>
> I suspect that if we had a single value to look at it would be easier
> to get right.  Error checking with current interface would look like
> this:
>
>   err = pci_read_config_word(dev, addr, &val);
>   if (err)
> return -EINVAL;
>
>   if (PCI_POSSIBLE_ERROR(val)) {
> /* if driver knows ~0 is invalid */
> return -EINVAL;
>
> /* if ~0 is potentially a valid value */
> err = pci_read_config_word(dev, PCI_VENDOR_ID, &val2);
> if (err)
>   return -EINVAL;
>
> if (PCI_POSSIBLE_ERROR(val2))
>   return -EINVAL;
>   }
>
> Error checking with a possible interface that returned only a single
> value could look like this:
>
>   val = pci_config_read_word(dev, addr);
>   if (PCI_POSSIBLE_ERROR(val)) {
> /* if driver knows ~0 is invalid */
> return -EINVAL;
>
> /* if ~0 is potentially a valid value */
> val2 = pci_config_read_word(dev, PCI_VENDOR_ID);
> if (PCI_POSSIBLE_ERROR(val2))
>   return -EINVAL;
>   }
>
> Am I understanding you correctly?
>
> > For b), it might be nice to also change other aspects of the
> > interface, e.g. passing a pci_host_bridge pointer plus bus number
> > instead of a pci_bus pointer, or having the callback in the
> > pci_host_bridge structure.
>
> I like this idea a lot, too.  I think the fact that
> pci_bus_read_config_word() requires a pci_bus * complicates things in
> a few places.

I've been looking at the various host implementations of config
accessors as well as probe functions. Needing the

[Bug 208181] BUG: KASAN: stack-out-of-bounds in strcmp+0x58/0xd8

2020-07-14 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=208181

Erhard F. (erhar...@mailbox.org) changed:

   What|Removed |Added

 Attachment #289937|0   |1
is obsolete||

--- Comment #16 from Erhard F. (erhar...@mailbox.org) ---
Created attachment 290285
  --> https://bugzilla.kernel.org/attachment.cgi?id=290285&action=edit
kernel .config (5.8-rc5, PowerMac G4 DP)

Did some additional test-runs, seems there are still problems with stack usage
when running (inline) KASAN:

5.8-rc3 + the 2 patches applied:
Instruction dump:
usercopy: Kernel memory overwrite attemp detected to kernel text (offset 5432,
size 4)!
[ cut here ]
kernel BUG at mm/usercopy.c:99!
Oops: Exeption in kernel mode, sig:5 [#6]
BE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2 PowerMac
Modules linked in: auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc
b43legacy input_leds joydev mac80211 hid_generic g4_windtunnel sungem
sungem_phy btrfs ohci_pci xor lzo_compress lzo_decompress zlib_deflate raid6_pq
zlib_inflate firewire_ohci hcd soundcore ssb pcmcia usbcore uninorth_agp
pcmcia_core agpart usb_common
CPU: 1 PID: 5250 Comm: mount.nfs Tainted: G   W   
5.8.0-rc3-PowerMacG4+ #8
NIP: c04d654c LR: c04d654c CTR: 
REGS: c0001198 TRAP: 0700  Tainted: G   W(5.8.0-rc3-PowerMacG4+)
MSR:  00021031  CR: 24028822 XER: 

GPR00: c04d654c c0001498 e719b980 0058 c01899f4  0027 e8dc4e0c
GPR08: 0023   c0001498 44028822 0061bff4 f80002s9 0003
GPR16: c115a340 f80002d7 c00016b8 c00016c8 c04d654c c115a260 c04d651c f80002d5
GPR24: c00016ac 180002d5 e8dda024 c000 c000153c  0004 c0001538
NIP [c04d654c] usercopy_abort+0x68/0x78
LR [c04d654c] usercopy_abort+0x68/0x78
Call Trace:
Instruction dump:
usercopy: Kernel memory overwrite attemp detected to kernel text (offset 4848,
size 4)!
[ cut here ]
kernel BUG at mm/usercopy.c:99!
Oops: Exeption in kernel mode, sig:5 [#7]
BE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2 PowerMac
Modules linked in: auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc
b43legacy input_leds joydev mac80211 hid_generic g4_windtunnel sungem
sungem_phy btrfs ohci_pci xor lzo_compress lzo_decompress zlib_deflate raid6_pq
zlib_inflate firewire_ohci hcd soundcore ssb pcmcia usbcore uninorth_agp
pcmcia_core agpart usb_common
CPU: 1 PID: 5250 Comm: mount.nfs Tainted: G   W   
5.8.0-rc3-PowerMacG4+ #8
NIP: c04d654c LR: c04d654c CTR: 
REGS: c0001198 TRAP: 0700  Tainted: G   W(5.8.0-rc3-PowerMacG4+)
MSR:  00021031  CR: 24028822 XER: 

GPR00: c04d654c c0001250 e719b980 0058 c01899f4  0027 e8dc4e0c
GPR08: 0023   c0001250 44028822 0061bff4 f8000290 0003
GPR16: c115a340 f800028e c0001470 c0001480 c04d654c c115a260 c04d651c f800028c
GPR24: c0001464 1800028c e8dda024 c000 c00012f4  0004 c00012f0
NIP [c04d654c] usercopy_abort+0x68/0x78
Unrecoverable FP Unavailable Exception 801 at 908
LR [c04d654c] usercopy_abort+0x68/0x78
Call Trace:


5.8-rc5 + the 2 patches applied:
do_IRQ: stack overflow: 1984
CPU: 1 PID: 347 Comm: gzip Tainted: G   W5.8.0-rc5-PowerMacG4+ #1
Call Trace:

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-14 Thread Arnd Bergmann
On Tue, Jul 14, 2020 at 8:45 PM Bjorn Helgaas  wrote:
> On Mon, Jul 13, 2020 at 05:08:10PM +0200, Arnd Bergmann wrote:
> > On Mon, Jul 13, 2020 at 3:22 PM Saheed O. Bolarinwa
> > Starting with a), my first question is whether any high-level
> > drivers even need to care about errors from these functions. I see
> > 4913 callers that ignore the return code, and 576 that actually
> > check it, and almost none care about the specific error (as you
> > found as well). Unless we conclude that most PCI drivers are wrong,
> > could we just change the return type to 'void' and assume they never
> > fail for valid arguments on a valid pci_device* ?
>
> I really like this idea.
>
> pci_write_config_*() has one return value, and only 100ish of 2500
> callers check for errors.  It's sometimes possible for config
> accessors to detect PCI errors and return failure, e.g., device was
> removed or didn't respond, but most of them don't, and detecting these
> errors is not really that valuable.
>
> pci_read_config_*() is much more interesting because it returns two
> things, the function return value and the value read from the PCI
> device, and it's complicated to check both.
>
> Again it's sometimes possible for config read accessors to detect PCI
> errors, but in most cases a PCI error means the accessor returns
> success and the value from PCI is ~0.
>
> Checking the function return value catches programming errors (bad
> alignment, etc) but misses most of the interesting errors (device was
> unplugged or reported a PCI error).

My thinking was more that most of the time the error checking may
be completely bogus to start with, and I would just not check for
errors at all.

> Checking the value returned from PCI is tricky because ~0 is a valid
> value for some config registers, and only the driver knows for sure.
> If the driver knows that ~0 is a possible value, it would have to do
> something else, e.g., another config read of a register that *cannot*
> be ~0, to see whether it's really an error.
>
> I suspect that if we had a single value to look at it would be easier
> to get right.  Error checking with current interface would look like
> this:
>
>   err = pci_read_config_word(dev, addr, &val);
>   if (err)
> return -EINVAL;
>
>   if (PCI_POSSIBLE_ERROR(val)) {
> /* if driver knows ~0 is invalid */
> return -EINVAL;
>
> /* if ~0 is potentially a valid value */
> err = pci_read_config_word(dev, PCI_VENDOR_ID, &val2);
> if (err)
>   return -EINVAL;
>
> if (PCI_POSSIBLE_ERROR(val2))
>   return -EINVAL;
>   }
>
> Error checking with a possible interface that returned only a single
> value could look like this:
>
>   val = pci_config_read_word(dev, addr);
>   if (PCI_POSSIBLE_ERROR(val)) {
> /* if driver knows ~0 is invalid */
> return -EINVAL;
>
> /* if ~0 is potentially a valid value */
> val2 = pci_config_read_word(dev, PCI_VENDOR_ID);
> if (PCI_POSSIBLE_ERROR(val2))
>   return -EINVAL;
>   }
>
> Am I understanding you correctly?

That would require changing all callers of the function, which
I think would involve changing some 700 files. What I was
suggesting was to only change the return type to void and
categorize all drivers that today check it as either

a) checking the return code is not helpful, or possibly even
wrong, so we just stop doing it. I expect those to be the
vast majority of callers, but that could be wrong.

b) Code that legitimately check the error code and need to
   take an appropriate action. These could be changed to
   calling a different interface such as 'pci_bus_read_config_word'
   or a new 'pci_device_last_error()' function.

The reasons I suspect that most callers don't actually need
to check for errors are:

- Most error checking is static: PCIBIOS_BAD_REGISTER_NUMBER
  only happens if you pass an invalid register number, but most
  callers pass a compile-time constant register number that is
  known to be correct, or the driver would never work. Similarly,
  PCIBIOS_DEVICE_NOT_FOUND wouldn't normally happen
  since you pass a valid pci_device pointer that was already
  probed.

- config space accesses are very rare compared to memory
  space access and on the hardware side the error handling
  would be similar, but readl/writel don't return errors, they just
  access wrong registers or return 0x.
  arch/powerpc/kernel/eeh.c has a ton extra code written to
  deal with it, but no other architectures do.

- If we add code to detect errors in pci_read_config_*
  and do some of the stuff from powerpc's
  eeh_dev_check_failure(), we are more likely to catch
  intermittent failures when drivers don't check, or bugs
  with invalid arguments in device drivers than relying on
  drivers to get their error handling right when those code
  paths don't ever get covered in normal testing.

Looking at a couple of random drivers that do check the
return codes, I find:

drivers/edac/amd8131_edac.c: prints the register number,
th

Re: [PATCH v2 5/5] powerpc: Add LKDTM test to hijack a patch mapping

2020-07-14 Thread Kees Cook
On Wed, Jul 08, 2020 at 11:03:16PM -0500, Christopher M. Riedl wrote:
> When live patching with STRICT_KERNEL_RWX, the CPU doing the patching
> must use a temporary mapping which allows for writing to kernel text.
> During the entire window of time when this temporary mapping is in use,
> another CPU could write to the same mapping and maliciously alter kernel
> text. Implement a LKDTM test to attempt to exploit such a openings when
> a CPU is patching under STRICT_KERNEL_RWX. The test is only implemented
> on powerpc for now.
> 
> The LKDTM "hijack" test works as follows:
> 
>   1. A CPU executes an infinite loop to patch an instruction.
>  This is the "patching" CPU.
>   2. Another CPU attempts to write to the address of the temporary
>  mapping used by the "patching" CPU. This other CPU is the
>  "hijacker" CPU. The hijack either fails with a segfault or
>  succeeds, in which case some kernel text is now overwritten.
> 
> How to run the test:
> 
>   mount -t debugfs none /sys/kernel/debug
>   (echo HIJACK_PATCH > /sys/kernel/debug/provoke-crash/DIRECT)
> 
> Signed-off-by: Christopher M. Riedl 
> ---
>  drivers/misc/lkdtm/core.c  |  1 +
>  drivers/misc/lkdtm/lkdtm.h |  1 +
>  drivers/misc/lkdtm/perms.c | 99 ++
>  3 files changed, 101 insertions(+)
> 
> diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
> index a5e344df9166..482e72f6a1e1 100644
> --- a/drivers/misc/lkdtm/core.c
> +++ b/drivers/misc/lkdtm/core.c
> @@ -145,6 +145,7 @@ static const struct crashtype crashtypes[] = {
>   CRASHTYPE(WRITE_RO),
>   CRASHTYPE(WRITE_RO_AFTER_INIT),
>   CRASHTYPE(WRITE_KERN),
> + CRASHTYPE(HIJACK_PATCH),
>   CRASHTYPE(REFCOUNT_INC_OVERFLOW),
>   CRASHTYPE(REFCOUNT_ADD_OVERFLOW),
>   CRASHTYPE(REFCOUNT_INC_NOT_ZERO_OVERFLOW),
> diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
> index 601a2156a0d4..bfcf3542370d 100644
> --- a/drivers/misc/lkdtm/lkdtm.h
> +++ b/drivers/misc/lkdtm/lkdtm.h
> @@ -62,6 +62,7 @@ void lkdtm_EXEC_USERSPACE(void);
>  void lkdtm_EXEC_NULL(void);
>  void lkdtm_ACCESS_USERSPACE(void);
>  void lkdtm_ACCESS_NULL(void);
> +void lkdtm_HIJACK_PATCH(void);
>  
>  /* lkdtm_refcount.c */
>  void lkdtm_REFCOUNT_INC_OVERFLOW(void);
> diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
> index 62f76d506f04..b7149daaeb6f 100644
> --- a/drivers/misc/lkdtm/perms.c
> +++ b/drivers/misc/lkdtm/perms.c
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  /* Whether or not to fill the target memory area with do_nothing(). */
> @@ -213,6 +214,104 @@ void lkdtm_ACCESS_NULL(void)
>   *ptr = tmp;
>  }
>  
> +#if defined(CONFIG_PPC) && defined(CONFIG_STRICT_KERNEL_RWX)
> +#include 
> +
> +static struct ppc_inst * const patch_site = (struct ppc_inst *)&do_nothing;

While this is probably fine, I'd prefer a new target instead of re-using
do_nothing.

> +
> +static int lkdtm_patching_cpu(void *data)
> +{
> + int err = 0;
> + struct ppc_inst insn = ppc_inst(0xdeadbeef);
> +
> + pr_info("starting patching_cpu=%d\n", smp_processor_id());
> + do {
> + err = patch_instruction(patch_site, insn);
> + } while (ppc_inst_equal(ppc_inst_read(READ_ONCE(patch_site)), insn) &&
> + !err && !kthread_should_stop());
> +
> + if (err)
> + pr_warn("patch_instruction returned error: %d\n", err);
> +
> + set_current_state(TASK_INTERRUPTIBLE);
> + while (!kthread_should_stop()) {
> + schedule();
> + set_current_state(TASK_INTERRUPTIBLE);
> + }
> +
> + return err;
> +}
> +
> +void lkdtm_HIJACK_PATCH(void)
> +{
> + struct task_struct *patching_kthrd;
> + struct ppc_inst original_insn;
> + int patching_cpu, hijacker_cpu, attempts;
> + unsigned long addr;
> + bool hijacked;
> +
> + if (num_online_cpus() < 2) {
> + pr_warn("need at least two cpus\n");
> + return;
> + }
> +
> + original_insn = ppc_inst_read(READ_ONCE(patch_site));
> +
> + hijacker_cpu = smp_processor_id();
> + patching_cpu = cpumask_any_but(cpu_online_mask, hijacker_cpu);
> +
> + patching_kthrd = kthread_create_on_node(&lkdtm_patching_cpu, NULL,
> + cpu_to_node(patching_cpu),
> + "lkdtm_patching_cpu");
> + kthread_bind(patching_kthrd, patching_cpu);
> + wake_up_process(patching_kthrd);
> +
> + addr = offset_in_page(patch_site) | 
> read_cpu_patching_addr(patching_cpu);
> +
> + pr_info("starting hijacker_cpu=%d\n", hijacker_cpu);
> + for (attempts = 0; attempts < 10; ++attempts) {
> + /* Use __put_user to catch faults without an Oops */
> + hijacked = !__put_user(0xbad00bad, (unsigned int *)addr);
> +
> + if (hijacked) {
> + if (kthread_stop(patc

Re: [PATCH 3/3] ASoC: fsl-asoc-card: Support Headphone and Microphone Jack detection

2020-07-14 Thread Nicolin Chen
Hi Shengjiu,

The whole series looks good to me. Just a couple of small
questions inline:

On Tue, Jul 14, 2020 at 05:05:36PM +0800, Shengjiu Wang wrote:
> Use asoc_simple_init_jack function from simple card to implement
> the Headphone and Microphone detection.
> Register notifier to disable Speaker when Headphone is plugged in
> and enable Speaker when Headphone is unplugged.
> Register notifier to disable Digital Microphone when Analog Microphone
> is plugged in and enable DMIC when Analog Microphone is unplugged.
> 
> Signed-off-by: Shengjiu Wang 
> ---
>  sound/soc/fsl/Kconfig |  1 +
>  sound/soc/fsl/fsl-asoc-card.c | 69 ++-
>  2 files changed, 68 insertions(+), 2 deletions(-)

>  static int fsl_asoc_card_late_probe(struct snd_soc_card *card)
>  {
>   struct fsl_asoc_card_priv *priv = snd_soc_card_get_drvdata(card);
> @@ -745,8 +789,29 @@ static int fsl_asoc_card_probe(struct platform_device 
> *pdev)
>   snd_soc_card_set_drvdata(&priv->card, priv);
>  
>   ret = devm_snd_soc_register_card(&pdev->dev, &priv->card);
> - if (ret && ret != -EPROBE_DEFER)
> - dev_err(&pdev->dev, "snd_soc_register_card failed (%d)\n", ret);
> + if (ret) {
> + if (ret != -EPROBE_DEFER)
> + dev_err(&pdev->dev, "snd_soc_register_card failed 
> (%d)\n", ret);

I think we may move this EPROBE_DEFER to the asrc_fail label.

> + goto asrc_fail;
> + }
> +
> + if (of_property_read_bool(np, "hp-det-gpio")) {

Could we move this check inside asoc_simple_init_jack? There's no
problem with doing it here though, yet I got a bit confused by it
as I thought it's a boolean type property, which would be against
the DT bindings until I saw asoc_simple_init_jack() uses the same
string to get the GPIO. Just it probably would be a bit tricky as
we need it to be optional here.

Otherwise, I think we may add a line of comments to indicate that
the API would use the same string to get the GPIO.

> + ret = asoc_simple_init_jack(&priv->card, &priv->hp_jack,
> + 1, NULL, "Headphone Jack");
> + if (ret)
> + goto asrc_fail;
> +
> + snd_soc_jack_notifier_register(&priv->hp_jack.jack, 
> &hp_jack_nb);
> + }


Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-14 Thread Kjetil Oftedal
On 14/07/2020, Bjorn Helgaas  wrote:

>>
>> a) callers of the high-level config space accessors
>>pci_{write,read}_config_{byte,word,dword}, mostly in device
>>drivers.
>> b) low-level implementation of the config space accessors
>> through struct pci_ops
>> c) all other occurrences of these constants
>>
>> Starting with a), my first question is whether any high-level
>> drivers even need to care about errors from these functions. I see
>> 4913 callers that ignore the return code, and 576 that actually
>> check it, and almost none care about the specific error (as you
>> found as well). Unless we conclude that most PCI drivers are wrong,
>> could we just change the return type to 'void' and assume they never
>> fail for valid arguments on a valid pci_device* ?
>
> I really like this idea.
>
> pci_write_config_*() has one return value, and only 100ish of 2500
> callers check for errors.  It's sometimes possible for config
> accessors to detect PCI errors and return failure, e.g., device was
> removed or didn't respond, but most of them don't, and detecting these
> errors is not really that valuable.
>
> pci_read_config_*() is much more interesting because it returns two
> things, the function return value and the value read from the PCI
> device, and it's complicated to check both.
>
> Again it's sometimes possible for config read accessors to detect PCI
> errors, but in most cases a PCI error means the accessor returns
> success and the value from PCI is ~0.
>
> Checking the function return value catches programming errors (bad
> alignment, etc) but misses most of the interesting errors (device was
> unplugged or reported a PCI error).
>
> Checking the value returned from PCI is tricky because ~0 is a valid
> value for some config registers, and only the driver knows for sure.
> If the driver knows that ~0 is a possible value, it would have to do
> something else, e.g., another config read of a register that *cannot*
> be ~0, to see whether it's really an error.
>
> I suspect that if we had a single value to look at it would be easier
> to get right.  Error checking with current interface would look like
> this:
>
>   err = pci_read_config_word(dev, addr, &val);
>   if (err)
> return -EINVAL;
>
>   if (PCI_POSSIBLE_ERROR(val)) {
> /* if driver knows ~0 is invalid */
> return -EINVAL;
>
> /* if ~0 is potentially a valid value */
> err = pci_read_config_word(dev, PCI_VENDOR_ID, &val2);
> if (err)
>   return -EINVAL;
>
> if (PCI_POSSIBLE_ERROR(val2))
>   return -EINVAL;
>   }
>
> Error checking with a possible interface that returned only a single
> value could look like this:
>
>   val = pci_config_read_word(dev, addr);
>   if (PCI_POSSIBLE_ERROR(val)) {
> /* if driver knows ~0 is invalid */
> return -EINVAL;
>
> /* if ~0 is potentially a valid value */
> val2 = pci_config_read_word(dev, PCI_VENDOR_ID);
> if (PCI_POSSIBLE_ERROR(val2))
>   return -EINVAL;
>   }
>
> Am I understanding you correctly?

Let us not do this. Reading config space is really expensive on some
architectures. Requiring a driver to do it twice on some values does not
improve upon that situation. And is quite redundant if the Root Complex
driver already knows that the first access has failed.

Additionally since multiple config accesses to the same devices is not
allowed in the spec, the hardware must block and wait for a timeout if
a config access does not get a response.
(Can happen if a intermediate link between the RC and endpoint has to retrain)
Having to block twice is very much not ideal. And in the case with
retraining the secondary access might even succeed. As the link might
recover between reading the first config word and reading PCI_VENDOR_ID.
Thus allowing the driver to accept invalid data from the device.

>
>> For b), it might be nice to also change other aspects of the
>> interface, e.g. passing a pci_host_bridge pointer plus bus number
>> instead of a pci_bus pointer, or having the callback in the
>> pci_host_bridge structure.
>
> I like this idea a lot, too.  I think the fact that
> pci_bus_read_config_word() requires a pci_bus * complicates things in
> a few places.
>
> I think it's completely separate, as you say, and we should defer it
> for now because even part a) is a lot of work.  I added it to my list
> of possible future projects.
>

What about strange PCI devices such as Non-Transparent bridges?
They will require their own PCI Config space accessors that is not
connected to a host bridge if one wants to do some sort of
punch-through enumeration.
I guess the kernel doesn't care much about them?

Best regards,
Kjetil Oftedal


Re: [PATCH v3 01/12] kexec_file: allow archs to handle special regions while locating memory hole

2020-07-14 Thread Thiago Jung Bauermann


Hari Bathini  writes:

> Some architectures may have special memory regions, within the given
> memory range, which can't be used for the buffer in a kexec segment.
> Implement weak arch_kexec_locate_mem_hole() definition which arch code
> may override, to take care of special regions, while trying to locate
> a memory hole.
>
> Also, add the missing declarations for arch overridable functions and
> and drop the __weak descriptors in the declarations to avoid non-weak
> definitions from becoming weak.
>
> Reported-by: kernel test robot 
> [lkp: In v1, arch_kimage_file_post_load_cleanup() declaration was missing]
> Signed-off-by: Hari Bathini 
> Acked-by: Dave Young 
> Tested-by: Pingfan Liu 

Reviewed-by: Thiago Jung Bauermann 

-- 
Thiago Jung Bauermann
IBM Linux Technology Center


Re: [PATCH v2 3/5] powerpc/lib: Use a temporary mm for code patching

2020-07-14 Thread Christopher M. Riedl
On Thu Jul 9, 2020 at 4:02 AM CDT, Christophe Leroy wrote:
>
>
> Le 09/07/2020 à 06:03, Christopher M. Riedl a écrit :
> > Currently, code patching a STRICT_KERNEL_RWX exposes the temporary
> > mappings to other CPUs. These mappings should be kept local to the CPU
> > doing the patching. Use the pre-initialized temporary mm and patching
> > address for this purpose. Also add a check after patching to ensure the
> > patch succeeded.
>
> While trying the LKDTM test, I realised that this is useless for non
> SMP.
> Is it worth applying that change to non SMP systems ?
>
> Christophe
>

Hmm, I would say it's probably preferable to maintain a single
implementation of code-patching under STRICT_KERNEL_RWX instead of
two versions for SMP and non-SMP.

> > 
> > Use the KUAP functions on non-BOOKS3_64 platforms since the temporary
> > mapping for patching uses a userspace address (to keep the mapping
> > local). On BOOKS3_64 platforms hash does not implement KUAP and on radix
> > the use of PAGE_KERNEL sets EAA[0] for the PTE which means the AMR
> > (KUAP) protection is ignored (see PowerISA v3.0b, Fig, 35).
> > 
> > Based on x86 implementation:
> > 
> > commit b3fd8e83ada0
> > ("x86/alternatives: Use temporary mm for text poking")
> > 
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >   arch/powerpc/lib/code-patching.c | 152 +++
> >   1 file changed, 54 insertions(+), 98 deletions(-)
> > 
> > diff --git a/arch/powerpc/lib/code-patching.c 
> > b/arch/powerpc/lib/code-patching.c
> > index 8ae1a9e5fe6e..80fe3864f377 100644
> > --- a/arch/powerpc/lib/code-patching.c
> > +++ b/arch/powerpc/lib/code-patching.c
> > @@ -19,6 +19,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   
> >   static int __patch_instruction(struct ppc_inst *exec_addr, struct 
> > ppc_inst instr,
> >struct ppc_inst *patch_addr)
> > @@ -77,106 +78,57 @@ void __init poking_init(void)
> > pte_unmap_unlock(ptep, ptl);
> >   }
> >   
> > -static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> > -
> > -static int text_area_cpu_up(unsigned int cpu)
> > -{
> > -   struct vm_struct *area;
> > -
> > -   area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> > -   if (!area) {
> > -   WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> > -   cpu);
> > -   return -1;
> > -   }
> > -   this_cpu_write(text_poke_area, area);
> > -
> > -   return 0;
> > -}
> > -
> > -static int text_area_cpu_down(unsigned int cpu)
> > -{
> > -   free_vm_area(this_cpu_read(text_poke_area));
> > -   return 0;
> > -}
> > -
> > -/*
> > - * Run as a late init call. This allows all the boot time patching to be 
> > done
> > - * simply by patching the code, and then we're called here prior to
> > - * mark_rodata_ro(), which happens after all init calls are run. Although
> > - * BUG_ON() is rude, in this case it should only happen if ENOMEM, and we 
> > judge
> > - * it as being preferable to a kernel that will crash later when someone 
> > tries
> > - * to use patch_instruction().
> > - */
> > -static int __init setup_text_poke_area(void)
> > -{
> > -   BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
> > -   "powerpc/text_poke:online", text_area_cpu_up,
> > -   text_area_cpu_down));
> > -
> > -   return 0;
> > -}
> > -late_initcall(setup_text_poke_area);
> > +struct patch_mapping {
> > +   spinlock_t *ptl; /* for protecting pte table */
> > +   pte_t *ptep;
> > +   struct temp_mm temp_mm;
> > +};
> >   
> >   /*
> >* This can be called for kernel text or a module.
> >*/
> > -static int map_patch_area(void *addr, unsigned long text_poke_addr)
> > +static int map_patch(const void *addr, struct patch_mapping *patch_mapping)
> >   {
> > -   unsigned long pfn;
> > -   int err;
> > +   struct page *page;
> > +   pte_t pte;
> > +   pgprot_t pgprot;
> >   
> > if (is_vmalloc_addr(addr))
> > -   pfn = vmalloc_to_pfn(addr);
> > +   page = vmalloc_to_page(addr);
> > else
> > -   pfn = __pa_symbol(addr) >> PAGE_SHIFT;
> > +   page = virt_to_page(addr);
> >   
> > -   err = map_kernel_page(text_poke_addr, (pfn << PAGE_SHIFT), PAGE_KERNEL);
> > +   if (radix_enabled())
> > +   pgprot = PAGE_KERNEL;
> > +   else
> > +   pgprot = PAGE_SHARED;
> >   
> > -   pr_devel("Mapped addr %lx with pfn %lx:%d\n", text_poke_addr, pfn, err);
> > -   if (err)
> > +   patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr,
> > +&patch_mapping->ptl);
> > +   if (unlikely(!patch_mapping->ptep)) {
> > +   pr_warn("map patch: failed to allocate pte for patching\n");
> > return -1;
> > +   }
> > +
> > +   pte = mk_pte(page, pgprot);
> > +   pte = pte_mkdirty(pte);
> > +   set_pte_at(patching_mm, patching_addr, patch_mapping->ptep, pte);
> > +
> > +   init_temp_mm(&patch_mapping->temp_mm, patching_mm);
> > +   use_temporary_mm(&patch_mapping->temp_mm);
> >   
> 

Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Peter Zijlstra
On Tue, Jul 14, 2020 at 07:31:03PM +0300, Jarkko Sakkinen wrote:
> On Tue, Jul 14, 2020 at 03:01:09PM +0200, Peter Zijlstra wrote:
> > to help with text_alloc() usage in generic code, but I think
> > fundamentally, there's only these two options.
> 
> There is one arch (nios2), which uses a regular kzalloc(). This means
> that both text_alloc() and text_memfree() need to be weaks symbols and
> nios2 needs to have overriding text.c to do its thing.

IMO nios2 is just broken.


Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Russell King - ARM Linux admin
On Tue, Jul 14, 2020 at 03:01:09PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 14, 2020 at 03:19:24PM +0300, Ard Biesheuvel wrote:
> > So perhaps the answer is to have text_alloc() not with a 'where'
> > argument but with a 'why' argument. Or more simply, just have separate
> > alloc/free APIs for each case, with generic versions that can be
> > overridden by the architecture.
> 
> Well, there only seem to be 2 cases here, either the pointer needs to
> fit in some immediate displacement, or not.
> 
> On x86 we seem have the advantage of a fairly large immediate
> displacement as compared to many other architectures (due to our
> variable sized instructions). And thus have been fairly liberal with our
> usage of it (also our indirect jmps/calls suck, double so with
> RETCH-POLINE).
> 
> Still, the indirect jump, as mentioned by Russel should work for
> arbitrarily placed code for us too.
> 
> 
> So I'm thinking that something like:
> 
> enum ptr_type {
>   immediate_displacement,
>   absolute,
> };
> 
> void *text_alloc(unsigned long size, enum ptr_type type)
> {
>   unsigned long vstart = VMALLOC_START;
>   unsigned long vend   = VMALLOC_END;
> 
>   if (type == immediate_displacement) {
>   vstart = MODULES_VADDR;
>   vend   = MODULES_END;
>   }
> 
>   return __vmalloc_node_range(size, TEXT_ALIGN, vstart, vend,
>   GFP_KERNEL, PAGE_KERNEL_EXEC, 0,
>   NUMA_NO_NODE, _RET_IP_);
> }
> 
> void text_free(void *ptr)
> {
>   vfree(ptr);
> }

Beware, however, that on 32-bit ARM, if module PLTs are enabled,
we will try to place the module in the module region (which gives
best performance) but if that allocation fails, we will fall back
to placing it in the vmalloc region and using PLTs.

So, for a module allocation, we would need to make up to two calls
to text_alloc(), once with "immediate_displacement", and if that
fails and we have PLT support enabled, again with "absolute".

Hence, as other people have said, why module_alloc() would need to
stay separate.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!


Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Jarkko Sakkinen
On Tue, Jul 14, 2020 at 03:56:52PM +0200, Jessica Yu wrote:
> +++ Jarkko Sakkinen [14/07/20 12:45 +0300]:
> > Rename module_alloc() to text_alloc() and module_memfree() to
> > text_memfree(), and move them to kernel/text.c, which is unconditionally
> > compiled to the kernel proper. This allows kprobes, ftrace and bpf to
> > allocate space for executable code without requiring to compile the modules
> > support (CONFIG_MODULES=y) in.
> > 
> > Cc: Andi Kleen 
> > Suggested-by: Peter Zijlstra 
> > Signed-off-by: Jarkko Sakkinen 
> 
> As Ard and Will have already explained, the major issue I'm having
> with this is that we're taking module_alloc(), an allocator that was
> originally specific to module loading, and turning it into a generic
> interface to be used by other subsystems. You're pulling in all the
> module loading semantics that vary by architecture and expecting it to
> work as a generic text allocator. I'm not against the existence of a
> generic text_alloc() but I would very much rather that module_alloc()
> be left alone to the module loader and instead work on introducing a
> *separate* generic text_alloc() interface that would work for its
> intended users (kprobes, bpf, etc) and have existing users of
> module_alloc() switch to that instead.
> 
> Jessica

This is kind of patch set where you do not have any other chances than
to get it wrong in the first time, so I just did something that flys
in my environment. At the same time I believe that taking the bound
out of tracing and module loading is a generally accepted idea.

I'm refining the next version with CONFIG_HAS_TEXT_ALLOC, which I
explained in more details in my earlier response to this thread.

/Jarkko


Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Jarkko Sakkinen
On Tue, Jul 14, 2020 at 03:01:09PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 14, 2020 at 03:19:24PM +0300, Ard Biesheuvel wrote:
> > So perhaps the answer is to have text_alloc() not with a 'where'
> > argument but with a 'why' argument. Or more simply, just have separate
> > alloc/free APIs for each case, with generic versions that can be
> > overridden by the architecture.
> 
> Well, there only seem to be 2 cases here, either the pointer needs to
> fit in some immediate displacement, or not.
> 
> On x86 we seem have the advantage of a fairly large immediate
> displacement as compared to many other architectures (due to our
> variable sized instructions). And thus have been fairly liberal with our
> usage of it (also our indirect jmps/calls suck, double so with
> RETCH-POLINE).
> 
> Still, the indirect jump, as mentioned by Russel should work for
> arbitrarily placed code for us too.
> 
> 
> So I'm thinking that something like:
> 
> enum ptr_type {
>   immediate_displacement,
>   absolute,
> };
> 
> void *text_alloc(unsigned long size, enum ptr_type type)
> {
>   unsigned long vstart = VMALLOC_START;
>   unsigned long vend   = VMALLOC_END;
> 
>   if (type == immediate_displacement) {
>   vstart = MODULES_VADDR;
>   vend   = MODULES_END;
>   }
> 
>   return __vmalloc_node_range(size, TEXT_ALIGN, vstart, vend,
>   GFP_KERNEL, PAGE_KERNEL_EXEC, 0,
>   NUMA_NO_NODE, _RET_IP_);
> }
> 
> void text_free(void *ptr)
> {
>   vfree(ptr);
> }
> 
> Should work for all cases. Yes, we might then want something like a per
> arch:
> 
>   {BPF,FTRACE,KPROBE}_TEXT_TYPE
> 
> to help with text_alloc() usage in generic code, but I think
> fundamentally, there's only these two options.

There is one arch (nios2), which uses a regular kzalloc(). This means
that both text_alloc() and text_memfree() need to be weaks symbols and
nios2 needs to have overriding text.c to do its thing.

/Jarkko


Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Steven Rostedt
On Tue, 14 Jul 2020 15:56:52 +0200
Jessica Yu  wrote:

> As Ard and Will have already explained, the major issue I'm having
> with this is that we're taking module_alloc(), an allocator that was
> originally specific to module loading, and turning it into a generic
> interface to be used by other subsystems. You're pulling in all the
> module loading semantics that vary by architecture and expecting it to
> work as a generic text allocator. I'm not against the existence of a
> generic text_alloc() but I would very much rather that module_alloc()
> be left alone to the module loader and instead work on introducing a
> *separate* generic text_alloc() interface that would work for its
> intended users (kprobes, bpf, etc) and have existing users of
> module_alloc() switch to that instead.

Looks like the consensus is to create a separate text_alloc() that can
be used when modules are not configured in for things like ftrace
dynamic trampolines, and keep module_alloc() untouched.

For those concerned about added unused code for architectures that
don't need text_alloc(), we can always create a config called:
CONFIG_ARCH_NEED_TEXT_ALLOC, and the arch can add that to its list of
kconfigs if it intends to use text_alloc().

-- Steve


Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Steven Rostedt
On Tue, 14 Jul 2020 14:33:14 +0100
Mark Rutland  wrote:

> > Should work for all cases. Yes, we might then want something like a per
> > arch:
> > 
> > {BPF,FTRACE,KPROBE}_TEXT_TYPE  
> 
> ... at that point why not:
> 
>   text_alloc_ftrace();
>   text_alloc_module();
>   text_alloc_bpf();
>   text_alloc_kprobe();

I don't know about bpf and kprobes, but for ftrace, the only place that
it allocates text happens to be in arch specific code.

If you want something special for ftrace, you could just add your own
function. But for x86, a text_alloc_immediate() would work.
(BTW, I like the function names over the enums)

-- Steve


Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Jessica Yu

+++ Jarkko Sakkinen [14/07/20 12:45 +0300]:

Rename module_alloc() to text_alloc() and module_memfree() to
text_memfree(), and move them to kernel/text.c, which is unconditionally
compiled to the kernel proper. This allows kprobes, ftrace and bpf to
allocate space for executable code without requiring to compile the modules
support (CONFIG_MODULES=y) in.

Cc: Andi Kleen 
Suggested-by: Peter Zijlstra 
Signed-off-by: Jarkko Sakkinen 


As Ard and Will have already explained, the major issue I'm having
with this is that we're taking module_alloc(), an allocator that was
originally specific to module loading, and turning it into a generic
interface to be used by other subsystems. You're pulling in all the
module loading semantics that vary by architecture and expecting it to
work as a generic text allocator. I'm not against the existence of a
generic text_alloc() but I would very much rather that module_alloc()
be left alone to the module loader and instead work on introducing a
*separate* generic text_alloc() interface that would work for its
intended users (kprobes, bpf, etc) and have existing users of
module_alloc() switch to that instead.

Jessica


Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Ard Biesheuvel
On Tue, 14 Jul 2020 at 16:33, Mark Rutland  wrote:
>
> On Tue, Jul 14, 2020 at 03:01:09PM +0200, Peter Zijlstra wrote:
> > On Tue, Jul 14, 2020 at 03:19:24PM +0300, Ard Biesheuvel wrote:
> > > So perhaps the answer is to have text_alloc() not with a 'where'
> > > argument but with a 'why' argument. Or more simply, just have separate
> > > alloc/free APIs for each case, with generic versions that can be
> > > overridden by the architecture.
> >
> > Well, there only seem to be 2 cases here, either the pointer needs to
> > fit in some immediate displacement, or not.
>
> On some arches you have a few choices for immediates depending on
> compiler options, e.g. on arm64:
>
> * +/- 128M with B
> * +/-4G with ADRP+ADD+BR
> * +/- 48/64 bits with a series of MOVK* + BR
>
> ... and you might build core kernel one way and modules another, and
> either could depend on configuration.
>
> > On x86 we seem have the advantage of a fairly large immediate
> > displacement as compared to many other architectures (due to our
> > variable sized instructions). And thus have been fairly liberal with our
> > usage of it (also our indirect jmps/calls suck, double so with
> > RETCH-POLINE).
> >
> > Still, the indirect jump, as mentioned by Russel should work for
> > arbitrarily placed code for us too.
> >
> >
> > So I'm thinking that something like:
> >
> > enum ptr_type {
> >   immediate_displacement,
> >   absolute,
> > };
> >
> > void *text_alloc(unsigned long size, enum ptr_type type)
> > {
> >   unsigned long vstart = VMALLOC_START;
> >   unsigned long vend   = VMALLOC_END;
> >
> >   if (type == immediate_displacement) {
> >   vstart = MODULES_VADDR;
> >   vend   = MODULES_END;
> >   }
> >
> >   return __vmalloc_node_range(size, TEXT_ALIGN, vstart, vend,
> >   GFP_KERNEL, PAGE_KERNEL_EXEC, 0,
> >   NUMA_NO_NODE, _RET_IP_);
> > }
> >
> > void text_free(void *ptr)
> > {
> >   vfree(ptr);
> > }
>
> I think it'd be easier to read with separate functions, e.g.
>
>   text_alloc_imm_offset(unsigned long size);
>   text_alloc_absolute(unsigned long size);
>

On arm64, we have a 128M window close to the core kernel for modules,
and a separate 128m window for bpf  programs, which are kept in
relative branching range of each other, but could be far away from
kernel+modules, and so having 'close' and 'far' as the only
distinction is insufficient.

> > Should work for all cases. Yes, we might then want something like a per
> > arch:
> >
> >   {BPF,FTRACE,KPROBE}_TEXT_TYPE
>
> ... at that point why not:
>
>   text_alloc_ftrace();
>   text_alloc_module();
>   text_alloc_bpf();
>   text_alloc_kprobe();
>
> ... etc which an arch can alias however it wants? e.g. x86 can have
> those all go to a common text_alloc_generic(), and that could even be a
> generic option for arches that don't care to distinguish these cases.
>

That is basically what i meant with separate alloc/free APIs, which i
think is the sanest approach here.

> Then if there are new places that want to allocate text we have to
> consider their requirements when adding them, too.
>
> Thanks,
> Mark.


Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Mark Rutland
On Tue, Jul 14, 2020 at 03:01:09PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 14, 2020 at 03:19:24PM +0300, Ard Biesheuvel wrote:
> > So perhaps the answer is to have text_alloc() not with a 'where'
> > argument but with a 'why' argument. Or more simply, just have separate
> > alloc/free APIs for each case, with generic versions that can be
> > overridden by the architecture.
> 
> Well, there only seem to be 2 cases here, either the pointer needs to
> fit in some immediate displacement, or not.

On some arches you have a few choices for immediates depending on
compiler options, e.g. on arm64:

* +/- 128M with B
* +/-4G with ADRP+ADD+BR
* +/- 48/64 bits with a series of MOVK* + BR

... and you might build core kernel one way and modules another, and
either could depend on configuration.

> On x86 we seem have the advantage of a fairly large immediate
> displacement as compared to many other architectures (due to our
> variable sized instructions). And thus have been fairly liberal with our
> usage of it (also our indirect jmps/calls suck, double so with
> RETCH-POLINE).
> 
> Still, the indirect jump, as mentioned by Russel should work for
> arbitrarily placed code for us too.
> 
> 
> So I'm thinking that something like:
> 
> enum ptr_type {
>   immediate_displacement,
>   absolute,
> };
> 
> void *text_alloc(unsigned long size, enum ptr_type type)
> {
>   unsigned long vstart = VMALLOC_START;
>   unsigned long vend   = VMALLOC_END;
> 
>   if (type == immediate_displacement) {
>   vstart = MODULES_VADDR;
>   vend   = MODULES_END;
>   }
> 
>   return __vmalloc_node_range(size, TEXT_ALIGN, vstart, vend,
>   GFP_KERNEL, PAGE_KERNEL_EXEC, 0,
>   NUMA_NO_NODE, _RET_IP_);
> }
> 
> void text_free(void *ptr)
> {
>   vfree(ptr);
> }

I think it'd be easier to read with separate functions, e.g.

  text_alloc_imm_offset(unsigned long size);
  text_alloc_absolute(unsigned long size);

> Should work for all cases. Yes, we might then want something like a per
> arch:
> 
>   {BPF,FTRACE,KPROBE}_TEXT_TYPE

... at that point why not:

  text_alloc_ftrace();
  text_alloc_module();
  text_alloc_bpf();
  text_alloc_kprobe();

... etc which an arch can alias however it wants? e.g. x86 can have
those all go to a common text_alloc_generic(), and that could even be a
generic option for arches that don't care to distinguish these cases.

Then if there are new places that want to allocate text we have to
consider their requirements when adding them, too.

Thanks,
Mark.


Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Peter Zijlstra
On Tue, Jul 14, 2020 at 03:19:24PM +0300, Ard Biesheuvel wrote:
> So perhaps the answer is to have text_alloc() not with a 'where'
> argument but with a 'why' argument. Or more simply, just have separate
> alloc/free APIs for each case, with generic versions that can be
> overridden by the architecture.

Well, there only seem to be 2 cases here, either the pointer needs to
fit in some immediate displacement, or not.

On x86 we seem have the advantage of a fairly large immediate
displacement as compared to many other architectures (due to our
variable sized instructions). And thus have been fairly liberal with our
usage of it (also our indirect jmps/calls suck, double so with
RETCH-POLINE).

Still, the indirect jump, as mentioned by Russel should work for
arbitrarily placed code for us too.


So I'm thinking that something like:

enum ptr_type {
immediate_displacement,
absolute,
};

void *text_alloc(unsigned long size, enum ptr_type type)
{
unsigned long vstart = VMALLOC_START;
unsigned long vend   = VMALLOC_END;

if (type == immediate_displacement) {
vstart = MODULES_VADDR;
vend   = MODULES_END;
}

return __vmalloc_node_range(size, TEXT_ALIGN, vstart, vend,
GFP_KERNEL, PAGE_KERNEL_EXEC, 0,
NUMA_NO_NODE, _RET_IP_);
}

void text_free(void *ptr)
{
vfree(ptr);
}

Should work for all cases. Yes, we might then want something like a per
arch:

{BPF,FTRACE,KPROBE}_TEXT_TYPE

to help with text_alloc() usage in generic code, but I think
fundamentally, there's only these two options.


Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-14 Thread Bjorn Helgaas
[trimmed the cc list; it's still too large but maybe arch folks care]

On Mon, Jul 13, 2020 at 05:08:10PM +0200, Arnd Bergmann wrote:
> On Mon, Jul 13, 2020 at 3:22 PM Saheed O. Bolarinwa
>  wrote:
> > This goal of these series is to move the definition of *all*
> > PCIBIOS* from include/linux/pci.h to arch/x86 and limit their use
> > within there.  All other tree specific definition will be left for
> > intact. Maybe they can be renamed.
> >
> > PCIBIOS* is an x86 concept as defined by the PCI spec. The
> > returned error codes of PCIBIOS* are positive values and this
> > introduces some complexities which other archs need not incur.
> 
> I think the intention is good, but I find the series in its current
> form very hard to review, in particular the way you touch some
> functions three times with trivial changes. Instead of
> 
> 1) replace PCIBIOS_SUCCESSFUL with 0
> 2) drop pointless 0-comparison
> 3) reformat whitespace
> 
> I would suggest to combine the first two steps into one patch per
> subsystem and drop the third step.

I agree.  BUT please don't just run out and post new patches to do
this.  Let's talk about Arnd's further ideas below first.

> ...
> Maybe the work can be split up differently, with a similar end
> result but fewer and easier reviewed patches. The way I'd look at
> the problem, there are three main areas that can be dealt with one
> at a time:
> 
> a) callers of the high-level config space accessors
>pci_{write,read}_config_{byte,word,dword}, mostly in device
>drivers.
> b) low-level implementation of the config space accessors
> through struct pci_ops
> c) all other occurrences of these constants
> 
> Starting with a), my first question is whether any high-level
> drivers even need to care about errors from these functions. I see
> 4913 callers that ignore the return code, and 576 that actually
> check it, and almost none care about the specific error (as you
> found as well). Unless we conclude that most PCI drivers are wrong,
> could we just change the return type to 'void' and assume they never
> fail for valid arguments on a valid pci_device* ?

I really like this idea.

pci_write_config_*() has one return value, and only 100ish of 2500
callers check for errors.  It's sometimes possible for config
accessors to detect PCI errors and return failure, e.g., device was
removed or didn't respond, but most of them don't, and detecting these
errors is not really that valuable.

pci_read_config_*() is much more interesting because it returns two
things, the function return value and the value read from the PCI
device, and it's complicated to check both. 

Again it's sometimes possible for config read accessors to detect PCI
errors, but in most cases a PCI error means the accessor returns
success and the value from PCI is ~0.

Checking the function return value catches programming errors (bad
alignment, etc) but misses most of the interesting errors (device was
unplugged or reported a PCI error).

Checking the value returned from PCI is tricky because ~0 is a valid
value for some config registers, and only the driver knows for sure.
If the driver knows that ~0 is a possible value, it would have to do
something else, e.g., another config read of a register that *cannot*
be ~0, to see whether it's really an error.

I suspect that if we had a single value to look at it would be easier
to get right.  Error checking with current interface would look like
this:

  err = pci_read_config_word(dev, addr, &val);
  if (err)
return -EINVAL;

  if (PCI_POSSIBLE_ERROR(val)) {
/* if driver knows ~0 is invalid */
return -EINVAL;

/* if ~0 is potentially a valid value */
err = pci_read_config_word(dev, PCI_VENDOR_ID, &val2);
if (err)
  return -EINVAL;

if (PCI_POSSIBLE_ERROR(val2))
  return -EINVAL;
  }

Error checking with a possible interface that returned only a single
value could look like this:

  val = pci_config_read_word(dev, addr);
  if (PCI_POSSIBLE_ERROR(val)) {
/* if driver knows ~0 is invalid */
return -EINVAL;

/* if ~0 is potentially a valid value */
val2 = pci_config_read_word(dev, PCI_VENDOR_ID);
if (PCI_POSSIBLE_ERROR(val2))
  return -EINVAL;
  }

Am I understanding you correctly?

> For b), it might be nice to also change other aspects of the
> interface, e.g. passing a pci_host_bridge pointer plus bus number
> instead of a pci_bus pointer, or having the callback in the
> pci_host_bridge structure.

I like this idea a lot, too.  I think the fact that
pci_bus_read_config_word() requires a pci_bus * complicates things in
a few places.

I think it's completely separate, as you say, and we should defer it
for now because even part a) is a lot of work.  I added it to my list
of possible future projects.

Bjorn


Re: [PATCH v2] powerpc/pseries: detect secure and trusted boot state of the system.

2020-07-14 Thread Mimi Zohar
On Tue, 2020-07-14 at 16:38 +1000, Daniel Axtens wrote:
> Hi Nayna,
> 
> Thanks! Would you be able to fold in some of the information from my
> reply to v1 into the changelog? Until we have public PAPR release with
> it, that information is the extent of the public documentation. It would
> be good to get it into the git log rather than just floating around in
> the mail archives!
> 
> A couple of small nits:
> 
> > +   if (enabled)
> > +   goto out;
> > +
> > +   if (!of_property_read_u32(of_root, "ibm,secure-boot", &secureboot)) {
> > +   if (secureboot)
> > +   enabled = (secureboot > 1) ? true : false;
> 
> Your tests double up here - you don't need both the 'if' statement and
> the 'secureboot > 1' ternary operator.
> 
> Just
> 
> + if (!of_property_read_u32(of_root, "ibm,secure-boot", &secureboot)) {
> + enabled = (secureboot > 1) ? true : false;
> 
> or even
> 
> + if (!of_property_read_u32(of_root, "ibm,secure-boot", &secureboot)) {
> + enabled = (secureboot > 1);
> 
> would work.

I haven't been following this thread, which might be the reason I'm
missing something here.  The patch description should explain why the
test is for "(secureboot > 1)", rather than a fixed number.

thanks,

Mimi


[PATCH 07/13] cpufreq: powernv-cpufreq: Fix a bunch of kerneldoc related issues

2020-07-14 Thread Lee Jones
Repair problems with formatting and missing attributes/parameters, and
demote header comments which do not meet the required standards
applicable to kerneldoc.

Fixes the following W=1 kernel build warning(s):

 drivers/cpufreq/powernv-cpufreq.c:84: warning: Function parameter or member 
'last_lpstate_idx' not described in 'global_pstate_info'
 drivers/cpufreq/powernv-cpufreq.c:84: warning: Function parameter or member 
'last_gpstate_idx' not described in 'global_pstate_info'
 drivers/cpufreq/powernv-cpufreq.c:84: warning: Function parameter or member 
'policy' not described in 'global_pstate_info'
 drivers/cpufreq/powernv-cpufreq.c:182: warning: Function parameter or member 
'i' not described in 'idx_to_pstate'
 drivers/cpufreq/powernv-cpufreq.c:201: warning: Function parameter or member 
'pstate' not described in 'pstate_to_idx'
 drivers/cpufreq/powernv-cpufreq.c:670: warning: Function parameter or member 
't' not described in 'gpstate_timer_handler'
 drivers/cpufreq/powernv-cpufreq.c:670: warning: Excess function parameter 
'data' description in 'gpstate_timer_handler'

Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Lee Jones 
---
 drivers/cpufreq/powernv-cpufreq.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index 068cc53abe320..2e5a8b8a4abaa 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -64,13 +64,14 @@
  * highest_lpstate_idx
  * @last_sampled_time: Time from boot in ms when global pstates were
  * last set
- * @last_lpstate_idx,  Last set value of local pstate and global
- * last_gpstate_idxpstate in terms of cpufreq table index
+ * @last_lpstate_idx:  Last set value of local pstate and global
+ * @last_gpstate_idx:  pstate in terms of cpufreq table index
  * @timer: Is used for ramping down if cpu goes idle for
  * a long time with global pstate held high
  * @gpstate_lock:  A spinlock to maintain synchronization between
  * routines called by the timer handler and
  * governer's target_index calls
+ * @policy:Associated CPUFreq policy
  */
 struct global_pstate_info {
int highest_lpstate_idx;
@@ -170,7 +171,7 @@ static inline u8 extract_pstate(u64 pmsr_val, unsigned int 
shift)
 
 /* Use following functions for conversions between pstate_id and index */
 
-/**
+/*
  * idx_to_pstate : Returns the pstate id corresponding to the
  *frequency in the cpufreq frequency table
  *powernv_freqs indexed by @i.
@@ -188,7 +189,7 @@ static inline u8 idx_to_pstate(unsigned int i)
return powernv_freqs[i].driver_data;
 }
 
-/**
+/*
  * pstate_to_idx : Returns the index in the cpufreq frequencytable
  *powernv_freqs for the frequency whose corresponding
  *pstate id is @pstate.
@@ -660,7 +661,7 @@ static inline void  queue_gpstate_timer(struct 
global_pstate_info *gpstates)
 /**
  * gpstate_timer_handler
  *
- * @data: pointer to cpufreq_policy on which timer was queued
+ * @t: Timer context used to fetch global pstate info struct
  *
  * This handler brings down the global pstate closer to the local pstate
  * according quadratic equation. Queues a new timer if it is still not equal
-- 
2.25.1



[PATCH 05/13] cpufreq/arch: powerpc: pasemi: Move prototypes to shared header

2020-07-14 Thread Lee Jones
If function callers and providers do not share the same prototypes the
compiler complains of missing prototypes.  Fix this by moving the
already existing prototypes out to a mutually convenient location.

Fixes the following W=1 kernel build warning(s):

 drivers/cpufreq/pasemi-cpufreq.c:109:5: warning: no previous prototype for 
‘check_astate’ [-Wmissing-prototypes]
 109 | int check_astate(void)
 | ^~~~
 drivers/cpufreq/pasemi-cpufreq.c:114:6: warning: no previous prototype for 
‘restore_astate’ [-Wmissing-prototypes]
 114 | void restore_astate(int cpu)
 | ^~

Cc: Olof Johansson 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Lee Jones 
---
 arch/powerpc/platforms/pasemi/pasemi.h| 15 
 arch/powerpc/platforms/pasemi/powersave.S |  2 ++
 drivers/cpufreq/pasemi-cpufreq.c  |  1 +
 include/linux/platform_data/pasemi.h  | 28 +++
 4 files changed, 31 insertions(+), 15 deletions(-)
 create mode 100644 include/linux/platform_data/pasemi.h

diff --git a/arch/powerpc/platforms/pasemi/pasemi.h 
b/arch/powerpc/platforms/pasemi/pasemi.h
index 70b56048ed1be..528d81ef748ad 100644
--- a/arch/powerpc/platforms/pasemi/pasemi.h
+++ b/arch/powerpc/platforms/pasemi/pasemi.h
@@ -15,21 +15,6 @@ extern void __init pasemi_map_registers(void);
 extern void idle_spin(void);
 extern void idle_doze(void);
 
-/* Restore astate to last set */
-#ifdef CONFIG_PPC_PASEMI_CPUFREQ
-extern int check_astate(void);
-extern void restore_astate(int cpu);
-#else
-static inline int check_astate(void)
-{
-   /* Always return >0 so we never power save */
-   return 1;
-}
-static inline void restore_astate(int cpu)
-{
-}
-#endif
-
 extern struct pci_controller_ops pasemi_pci_controller_ops;
 
 #endif /* _PASEMI_PASEMI_H */
diff --git a/arch/powerpc/platforms/pasemi/powersave.S 
b/arch/powerpc/platforms/pasemi/powersave.S
index d0215d5329ca7..7747b48963286 100644
--- a/arch/powerpc/platforms/pasemi/powersave.S
+++ b/arch/powerpc/platforms/pasemi/powersave.S
@@ -5,6 +5,8 @@
  * Maintained by: Olof Johansson 
  */
 
+#include 
+
 #include 
 #include 
 #include 
diff --git a/drivers/cpufreq/pasemi-cpufreq.c b/drivers/cpufreq/pasemi-cpufreq.c
index c66f566a854cb..c6bb3ecc90ef3 100644
--- a/drivers/cpufreq/pasemi-cpufreq.c
+++ b/drivers/cpufreq/pasemi-cpufreq.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
diff --git a/include/linux/platform_data/pasemi.h 
b/include/linux/platform_data/pasemi.h
new file mode 100644
index 0..3fed0687fcc9a
--- /dev/null
+++ b/include/linux/platform_data/pasemi.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2020 Linaro Ltd.
+ *
+ * Author: Lee Jones 
+ */
+
+#ifndef _LINUX_PLATFORM_DATA_PASEMI_H
+#define _LINUX_PLATFORM_DATA_PASEMI_H
+
+/* Restore astate to last set */
+#ifdef CONFIG_PPC_PASEMI_CPUFREQ
+int check_astate(void);
+void restore_astate(int cpu);
+#else
+static inline int check_astate(void)
+{
+   /* Always return >0 so we never power save */
+   return 1;
+}
+static inline void restore_astate(int cpu)
+{
+}
+#endif
+
+#endif /* _LINUX_PLATFORM_DATA_PASEMI_H */
+
+
-- 
2.25.1



[PATCH 06/13] cpufreq: powernv-cpufreq: Functions only used in call-backs should be static

2020-07-14 Thread Lee Jones
Fixes the following W=1 kernel build warning(s):

 drivers/cpufreq/powernv-cpufreq.c:669:6: warning: no previous prototype for 
‘gpstate_timer_handler’ [-Wmissing-prototypes]
 drivers/cpufreq/powernv-cpufreq.c:902:6: warning: no previous prototype for 
‘powernv_cpufreq_work_fn’ [-Wmissing-prototypes]

Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Lee Jones 
---
 drivers/cpufreq/powernv-cpufreq.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index 8646eb197cd96..068cc53abe320 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -666,7 +666,7 @@ static inline void  queue_gpstate_timer(struct 
global_pstate_info *gpstates)
  * according quadratic equation. Queues a new timer if it is still not equal
  * to local pstate
  */
-void gpstate_timer_handler(struct timer_list *t)
+static void gpstate_timer_handler(struct timer_list *t)
 {
struct global_pstate_info *gpstates = from_timer(gpstates, t, timer);
struct cpufreq_policy *policy = gpstates->policy;
@@ -899,7 +899,7 @@ static struct notifier_block powernv_cpufreq_reboot_nb = {
.notifier_call = powernv_cpufreq_reboot_notifier,
 };
 
-void powernv_cpufreq_work_fn(struct work_struct *work)
+static void powernv_cpufreq_work_fn(struct work_struct *work)
 {
struct chip *chip = container_of(work, struct chip, throttle);
struct cpufreq_policy *policy;
-- 
2.25.1



[PATCH -next] cpuidle/pseries: Make symbol 'pseries_idle_driver' static

2020-07-14 Thread Wei Yongjun
The sparse tool complains as follows:

drivers/cpuidle/cpuidle-pseries.c:25:23: warning:
 symbol 'pseries_idle_driver' was not declared. Should it be static?
 
'pseries_idle_driver' is not used outside of this file, so marks
it static.

Reported-by: Hulk Robot 
Signed-off-by: Wei Yongjun 
---
 drivers/cpuidle/cpuidle-pseries.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index 6513ef2af66a..3e058ad2bb51 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -22,7 +22,7 @@
 #include 
 #include 
 
-struct cpuidle_driver pseries_idle_driver = {
+static struct cpuidle_driver pseries_idle_driver = {
.name = "pseries_idle",
.owner= THIS_MODULE,
 };



[PATCH -next] cpufreq: powernv: Make some symbols static

2020-07-14 Thread Wei Yongjun
The sparse tool complains as follows:

drivers/cpufreq/powernv-cpufreq.c:88:1: warning:
 symbol 'pstate_revmap' was not declared. Should it be static?
drivers/cpufreq/powernv-cpufreq.c:383:18: warning:
 symbol 'cpufreq_freq_attr_cpuinfo_nominal_freq' was not declared. Should it be 
static?
drivers/cpufreq/powernv-cpufreq.c:669:6: warning:
 symbol 'gpstate_timer_handler' was not declared. Should it be static?
drivers/cpufreq/powernv-cpufreq.c:902:6: warning:
 symbol 'powernv_cpufreq_work_fn' was not declared. Should it be static?

Those symbols are not used outside of this file, so mark
them static.

Reported-by: Hulk Robot 
Signed-off-by: Wei Yongjun 
---
 drivers/cpufreq/powernv-cpufreq.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index 8646eb197cd9..cf118263ec65 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -85,7 +85,7 @@ struct global_pstate_info {
 
 static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
 
-DEFINE_HASHTABLE(pstate_revmap, POWERNV_MAX_PSTATES_ORDER);
+static DEFINE_HASHTABLE(pstate_revmap, POWERNV_MAX_PSTATES_ORDER);
 /**
  * struct pstate_idx_revmap_data: Entry in the hashmap pstate_revmap
  *   indexed by a function of pstate id.
@@ -380,7 +380,7 @@ static ssize_t cpuinfo_nominal_freq_show(struct 
cpufreq_policy *policy,
powernv_freqs[powernv_pstate_info.nominal].frequency);
 }
 
-struct freq_attr cpufreq_freq_attr_cpuinfo_nominal_freq =
+static struct freq_attr cpufreq_freq_attr_cpuinfo_nominal_freq =
__ATTR_RO(cpuinfo_nominal_freq);
 
 #define SCALING_BOOST_FREQS_ATTR_INDEX 2
@@ -666,7 +666,7 @@ static inline void  queue_gpstate_timer(struct 
global_pstate_info *gpstates)
  * according quadratic equation. Queues a new timer if it is still not equal
  * to local pstate
  */
-void gpstate_timer_handler(struct timer_list *t)
+static void gpstate_timer_handler(struct timer_list *t)
 {
struct global_pstate_info *gpstates = from_timer(gpstates, t, timer);
struct cpufreq_policy *policy = gpstates->policy;
@@ -899,7 +899,7 @@ static struct notifier_block powernv_cpufreq_reboot_nb = {
.notifier_call = powernv_cpufreq_reboot_notifier,
 };
 
-void powernv_cpufreq_work_fn(struct work_struct *work)
+static void powernv_cpufreq_work_fn(struct work_struct *work)
 {
struct chip *chip = container_of(work, struct chip, throttle);
struct cpufreq_policy *policy;



Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-07-14 Thread Peter Zijlstra
On Tue, Jul 14, 2020 at 05:46:05AM -0700, Andy Lutomirski wrote:
> x86 has this exact problem. At least no more than 64*8 CPUs share the cache 
> line :)

I've seen patches for a 'sparse' bitmap to solve related problems.

It's basically the same code, except it multiplies everything (size,
bit-nr) by a constant to reduce the number of active bits per line.

This sadly doesn't take topology into account, but reducing contention
is still good ofcourse.


Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-07-14 Thread Andy Lutomirski



> On Jul 13, 2020, at 11:31 PM, Nicholas Piggin  wrote:
> 
> Excerpts from Nicholas Piggin's message of July 14, 2020 3:04 pm:
>> Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am:
>>> 
 On Jul 13, 2020, at 9:48 AM, Nicholas Piggin  wrote:
 
 Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am:
>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin  wrote:
>> 
>> On big systems, the mm refcount can become highly contented when doing
>> a lot of context switching with threaded applications (particularly
>> switching between the idle thread and an application thread).
>> 
>> Abandoning lazy tlb slows switching down quite a bit in the important
>> user->idle->user cases, so so instead implement a non-refcounted scheme
>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
>> any remaining lazy ones.
>> 
>> On a 16-socket 192-core POWER8 system, a context switching benchmark
>> with as many software threads as CPUs (so each switch will go in and
>> out of idle), upstream can achieve a rate of about 1 million context
>> switches per second. After this patch it goes up to 118 million.
>> 
> 
> I read the patch a couple of times, and I have a suggestion that could
> be nonsense.  You are, effectively, using mm_cpumask() as a sort of
> refcount.  You're saying "hey, this mm has no more references, but it
> still has nonempty mm_cpumask(), so let's send an IPI and shoot down
> those references too."  I'm wondering whether you actually need the
> IPI.  What if, instead, you actually treated mm_cpumask as a refcount
> for real?  Roughly, in __mmdrop(), you would only free the page tables
> if mm_cpumask() is empty.  And, in the code that removes a CPU from
> mm_cpumask(), you would check if mm_users == 0 and, if so, check if
> you just removed the last bit from mm_cpumask and potentially free the
> mm.
> 
> Getting the locking right here could be a bit tricky -- you need to
> avoid two CPUs simultaneously exiting lazy TLB and thinking they
> should free the mm, and you also need to avoid an mm with mm_users
> hitting zero concurrently with the last remote CPU using it lazily
> exiting lazy TLB.  Perhaps this could be resolved by having mm_count
> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the
> mm" and mm_count == 0 meaning "now it's dead" and using some careful
> cmpxchg or dec_return to make sure that only one CPU frees it.
> 
> Or maybe you'd need a lock or RCU for this, but the idea would be to
> only ever take the lock after mm_users goes to zero.
 
 I don't think it's nonsense, it could be a good way to avoid IPIs.
 
 I haven't seen much problem here that made me too concerned about IPIs 
 yet, so I think the simple patch may be good enough to start with
 for powerpc. I'm looking at avoiding/reducing the IPIs by combining the
 unlazying with the exit TLB flush without doing anything fancy with
 ref counting, but we'll see.
>>> 
>>> I would be cautious with benchmarking here. I would expect that the
>>> nasty cases may affect power consumption more than performance — the 
>>> specific issue is IPIs hitting idle cores, and the main effects are to 
>>> slow down exit() a bit but also to kick the idle core out of idle. 
>>> Although, if the idle core is in a deep sleep, that IPI could be 
>>> *very* slow.
>> 
>> It will tend to be self-limiting to some degree (deeper idle cores
>> would tend to have less chance of IPI) but we have bigger issues on
>> powerpc with that, like broadcast IPIs to the mm cpumask for THP
>> management. Power hasn't really shown up as an issue but powerpc
>> CPUs may have their own requirements and issues there, shall we say.
>> 
>>> So I think it’s worth at least giving this a try.
>> 
>> To be clear it's not a complete solution itself. The problem is of 
>> course that mm cpumask gives you false negatives, so the bits
>> won't always clean up after themselves as CPUs switch away from their
>> lazy tlb mms.
> 
> ^^
> 
> False positives: CPU is in the mm_cpumask, but is not using the mm
> as a lazy tlb. So there can be bits left and never freed.
> 
> If you closed the false positives, you're back to a shared mm cache
> line on lazy mm context switches.

x86 has this exact problem. At least no more than 64*8 CPUs share the cache 
line :)

Can your share your benchmark?

> 
> Thanks,
> Nick


Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Jarkko Sakkinen
On Tue, Jul 14, 2020 at 01:29:27PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 14, 2020 at 11:28:27AM +0100, Will Deacon wrote:
> 
> > As Ard says, module_alloc() _is_ special, in the sense that the virtual
> > memory it allocates wants to be close to the kernel text, whereas the
> > concept of allocating executable memory is broader and doesn't have these
> > restrictions. So, while I'm in favour of having a text_alloc() interface
> > that can be used by callers which only require an executable mapping, I'd
> > much prefer for the module_alloc() code to remain for, err, modules.
> 
> So on x86 all those things (kprobes, bpf, ftrace) require that same
> closeness.
> 
> An interface like the late vmalloc_exec() will simply not work for us.
> 
> We recently talked about arm64-kprobes and how you're not doing any of
> the optimizations and fully rely on the exception return. And I see
> you're one of the few archs that has bpf_jit_alloc_exec() (also,
> shouldn't you be using PAGE_KERNEL_EXEC there?). But the BPF core seems
> to use module_alloc() as a default means of allocating text.
> 
> 
> So what should this look like? Have a text_alloc() with an argument that
> indicates where? But then I suppose we also need a means to manage PLT
> entries. Otherwise I don't exactly see how you're going to call BPF
> code, or how that BPF stuff is going to call back into its helpers.

To make this less of a havoc to arch maintainers what if:

void * __weak module_alloc(unsigned long size)
{
if (IS_ENABLED(HAS_TEXT_ALLOC))
return text_alloc(size);

return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
GFP_KERNEL, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
NUMA_NO_NODE, __builtin_return_address(0));
}

Then in arch/x86/Kconfig I could put:

config HAS_TEXT_ALLOC
def_bool y

This would scale down the patch set just to add kernel/text.c and
arch/x86/kernel/text.c, and allows gradual migration to other arch's.

/Jarkko


Re: [PATCH 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Jarkko Sakkinen
On Tue, Jul 14, 2020 at 11:33:33AM +0100, Russell King - ARM Linux admin wrote:
> On Tue, Jul 14, 2020 at 01:17:22PM +0300, Ard Biesheuvel wrote:
> > On Tue, 14 Jul 2020 at 12:53, Jarkko Sakkinen
> >  wrote:
> > >
> > > On Mon, Jul 13, 2020 at 10:49:48PM +0300, Ard Biesheuvel wrote:
> > > > This patch suggests that there are other reasons why conflating
> > > > allocation of module space and allocating  text pages for other uses
> > > > is a bad idea, but switching all users to text_alloc() is a step in
> > > > the wrong direction. It would be better to stop using module_alloc()
> > > > in core code except in the module loader, and have a generic
> > > > text_alloc() that can be overridden by the arch if necessary. Note
> > > > that x86  and s390 are the only architectures that use module_alloc()
> > > > in ftrace code.
> > >
> > > This series essentially does this: introduces text_alloc() and
> > > text_memfree(), which have generic implementations in kernel/text.c.
> > > Those can be overriddent by arch specific implementations.
> > >
> > > What you think should be done differently than in my patch set?
> > >
> > 
> > On arm64, module_alloc is only used by the module loader, and so
> > pulling it out and renaming it will cause unused code to be
> > incorporated into the kernel when building without module support,
> > which is the use case you claim to be addressing.
> > 
> > Module_alloc has semantics that are intimately tied to the module
> > loader, but over the years, it ended up being (ab)used by other
> > subsystems, which don't require those semantics but just need n pages
> > of vmalloc space with executable permissions.
> > 
> > So the correct approach is to make text_alloc() implement just that,
> > generically, and switch bpf etc to use it. Then, only on architectures
> > that need it, override it with an implementation that has the required
> > additional semantics.
> > 
> > Refactoring 10+ architectures like this without any regard for how
> > text_alloc() deviates from module_alloc() just creates a lot of churn
> > that others will have to clean up after you.
> 
> For 32-bit ARM, our bpf code uses "blx/bx" (or equivalent code
> sequences) rather than encoding a "bl" or "b", so BPF there doesn't
> care where the executable memory is mapped, and doesn't need any
> PLTs.  Given that, should bpf always allocate from the vmalloc()
> region to preserve the module space for modules?

Most of the allocators use __vmalloc_node_range() but arch/nios2
uses just plain kmalloc():

/*
 * Modules should NOT be allocated with kmalloc for (obvious) reasons.
 * But we do it for now to avoid relocation issues. CALL26/PCREL26 cannot reach
 * from 0x8000 (vmalloc area) to 0xc (kernel) (kmalloc returns
 * addresses in 0xc000)
 */
void *module_alloc(unsigned long size)
{
if (size == 0)
return NULL;
return kmalloc(size, GFP_KERNEL);
}

Also consider arch/x86 module_alloc():

void *module_alloc(unsigned long size)
{
void *p;

if (PAGE_ALIGN(size) > MODULES_LEN)
return NULL;

p = __vmalloc_node_range(size, MODULE_ALIGN,
MODULES_VADDR + get_module_load_offset(),
MODULES_END, GFP_KERNEL,
PAGE_KERNEL, 0, NUMA_NO_NODE,
__builtin_return_address(0));
if (p && (kasan_module_alloc(p, size) < 0)) {
vfree(p);
return NULL;
}

return p;
}

The generic version is

void * __weak module_alloc(unsigned long size)
{
return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
GFP_KERNEL, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
NUMA_NO_NODE, __builtin_return_address(0));
}

There is quite a lot of divergence from the generic version.

However, in other arch's it's mostly just divergence in vmalloc()
parameters and not as radical as in x86.

I could probably limit the total havoc to just nios2 and x86 if there
is a set of vmalloc parameters that work for all arch's. Then there
could be kernel/text.c and re-implementations for x86 and nios2.

I'm all for having separate text_alloc() and text_memfree() if these
issues can be somehow sorted out.

/Jarkko


Re: [PATCH 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Jarkko Sakkinen
On Tue, Jul 14, 2020 at 01:17:22PM +0300, Ard Biesheuvel wrote:
> > This series essentially does this: introduces text_alloc() and
> > text_memfree(), which have generic implementations in kernel/text.c.
> > Those can be overriddent by arch specific implementations.
> >
> > What you think should be done differently than in my patch set?
> >
> 
> On arm64, module_alloc is only used by the module loader, and so
> pulling it out and renaming it will cause unused code to be
> incorporated into the kernel when building without module support,
> which is the use case you claim to be addressing.

It certainly does not cause the full module loader to be bundle, only
the allocator.

> Module_alloc has semantics that are intimately tied to the module
> loader, but over the years, it ended up being (ab)used by other
> subsystems, which don't require those semantics but just need n pages
> of vmalloc space with executable permissions.
> 
> So the correct approach is to make text_alloc() implement just that,
> generically, and switch bpf etc to use it. Then, only on architectures
> that need it, override it with an implementation that has the required
> additional semantics.
> 
> Refactoring 10+ architectures like this without any regard for how
> text_alloc() deviates from module_alloc() just creates a lot of churn
> that others will have to clean up after you.

Using generic text_alloc() in kernel/kprobes.c would make it behave
differently in arch's that reimplement module_alloc(). That's the main
driver for my approach.

/Jarkko


Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Ard Biesheuvel
On Tue, 14 Jul 2020 at 14:31, Peter Zijlstra  wrote:
>
> On Tue, Jul 14, 2020 at 11:28:27AM +0100, Will Deacon wrote:
>
> > As Ard says, module_alloc() _is_ special, in the sense that the virtual
> > memory it allocates wants to be close to the kernel text, whereas the
> > concept of allocating executable memory is broader and doesn't have these
> > restrictions. So, while I'm in favour of having a text_alloc() interface
> > that can be used by callers which only require an executable mapping, I'd
> > much prefer for the module_alloc() code to remain for, err, modules.
>
> So on x86 all those things (kprobes, bpf, ftrace) require that same
> closeness.
>
> An interface like the late vmalloc_exec() will simply not work for us.
>

Fair enough. So for x86, implementing text_alloc() as an alias of
module_alloc() makes sense. But that is not the case in general.

> We recently talked about arm64-kprobes and how you're not doing any of
> the optimizations and fully rely on the exception return. And I see
> you're one of the few archs that has bpf_jit_alloc_exec() (also,
> shouldn't you be using PAGE_KERNEL_EXEC there?). But the BPF core seems
> to use module_alloc() as a default means of allocating text.
>

Indeed. Which means it uses up module space which may be scarce,
especially on 32-bit ARM, and gets backed by kasan shadow pages, which
only makes sense for modules (if CONFIG_KASAN=y)

>
> So what should this look like? Have a text_alloc() with an argument that
> indicates where? But then I suppose we also need a means to manage PLT
> entries. Otherwise I don't exactly see how you're going to call BPF
> code, or how that BPF stuff is going to call back into its helpers.
>

If x86 chooses to back its implementation of text_alloc() by
module_alloc(), that is absolutely fine. But arm64 has no use for
text_alloc() at all today (bpf and kprobes don't use module_alloc(),
and ftrace does not implement dynamic trampoline allocation), and in
the general case, bpf, kprobes, ftrace and the module loader all have
different requirements that deviate subtly between architectures.

So perhaps the answer is to have text_alloc() not with a 'where'
argument but with a 'why' argument. Or more simply, just have separate
alloc/free APIs for each case, with generic versions that can be
overridden by the architecture.


Re: [PATCH 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Peter Zijlstra
On Tue, Jul 14, 2020 at 11:33:33AM +0100, Russell King - ARM Linux admin wrote:
> For 32-bit ARM, our bpf code uses "blx/bx" (or equivalent code
> sequences) rather than encoding a "bl" or "b", so BPF there doesn't
> care where the executable memory is mapped, and doesn't need any
> PLTs.  Given that, should bpf always allocate from the vmalloc()
> region to preserve the module space for modules?

Ah, okay, then I suspect arm64 does something similar there. Thanks!

> I'm more concerned about ftrace though, but only because I don't
> have the understanding of that subsystem to really say whether there
> are any side effects from having the allocations potentially be out
> of range of a "bl" or "b" instruction.
> 
> If ftrace jumps both to and from the allocated page using a "load
> address to register, branch to register" approach like BPF does, then
> ftrace should be safe - and again, raises the issue that maybe it
> should always come from vmalloc() space.

I think the problem with ftrace is patching multiple instruction;
because it sounds like you'd need something to load the absolute address
in a register and then jump to that. And where it's relatively easy to
replace a single instruction, replace multiple instructions gets real
tricky real quick.

Which then leads to you being stuck with that 26bit displacement, IIRC.



Re: [PATCH 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Jarkko Sakkinen
On Tue, Jul 14, 2020 at 10:53:46AM +0100, Russell King - ARM Linux admin wrote:
> On Tue, Jul 14, 2020 at 12:49:28PM +0300, Jarkko Sakkinen wrote:
> > On Mon, Jul 13, 2020 at 07:34:10PM +0100, Russell King - ARM Linux admin 
> > wrote:
> > > On Mon, Jul 13, 2020 at 09:19:37PM +0300, Jarkko Sakkinen wrote:
> > > > Rename module_alloc() to text_alloc() and module_memfree() to
> > > > text_memfree(), and move them to kernel/text.c, which is unconditionally
> > > > compiled to the kernel proper. This allows kprobes, ftrace and bpf to
> > > > allocate space for executable code without requiring to compile the 
> > > > modules
> > > > support (CONFIG_MODULES=y) in.
> > > 
> > > I'm not sure this is a good idea for 32-bit ARM.  The code you are
> > > moving for 32-bit ARM is quite specific to module use in that it also
> > > supports allocating from the vmalloc area, where the module code
> > > knows to add PLT entries.
> > > 
> > > If the other proposed users of this text_alloc() do not have the logic
> > > to add PLT entries when branches between kernel code and this
> > > allocation are not reachable by, e.g. a 26-bit signed offset for 32-bit
> > > ARM code, then this code is not suitable for that use.
> > 
> > My intention is to use this in kprobes code in the place of
> > module_alloc().  I'm not sure why moving this code out of the module
> > subsystem could possibly break anything.  Unfortunately I forgot to add
> > covere letter to my series. Sending v2 with that to explain my use case
> > for this.
> 
> Ah, so you're merely renaming module_alloc() to text_alloc() everywhere?
> It sounded from the initial patch like you were also converting other
> users to use module_alloc().

Yes, exactly. I'm making the allocators unconditionally part of the
kernel proper.

My application for this is test kernels. I never compile anything as
modules when I test a release.

Also, I could imagine that especially in small scale embedded devices,
it could be sometimes useful to be able to have tracing support w/o
module support.

/Jarkko


Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Peter Zijlstra
On Tue, Jul 14, 2020 at 11:28:27AM +0100, Will Deacon wrote:

> As Ard says, module_alloc() _is_ special, in the sense that the virtual
> memory it allocates wants to be close to the kernel text, whereas the
> concept of allocating executable memory is broader and doesn't have these
> restrictions. So, while I'm in favour of having a text_alloc() interface
> that can be used by callers which only require an executable mapping, I'd
> much prefer for the module_alloc() code to remain for, err, modules.

So on x86 all those things (kprobes, bpf, ftrace) require that same
closeness.

An interface like the late vmalloc_exec() will simply not work for us.

We recently talked about arm64-kprobes and how you're not doing any of
the optimizations and fully rely on the exception return. And I see
you're one of the few archs that has bpf_jit_alloc_exec() (also,
shouldn't you be using PAGE_KERNEL_EXEC there?). But the BPF core seems
to use module_alloc() as a default means of allocating text.


So what should this look like? Have a text_alloc() with an argument that
indicates where? But then I suppose we also need a means to manage PLT
entries. Otherwise I don't exactly see how you're going to call BPF
code, or how that BPF stuff is going to call back into its helpers.




Re: [PATCH 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Russell King - ARM Linux admin
On Tue, Jul 14, 2020 at 01:17:22PM +0300, Ard Biesheuvel wrote:
> On Tue, 14 Jul 2020 at 12:53, Jarkko Sakkinen
>  wrote:
> >
> > On Mon, Jul 13, 2020 at 10:49:48PM +0300, Ard Biesheuvel wrote:
> > > This patch suggests that there are other reasons why conflating
> > > allocation of module space and allocating  text pages for other uses
> > > is a bad idea, but switching all users to text_alloc() is a step in
> > > the wrong direction. It would be better to stop using module_alloc()
> > > in core code except in the module loader, and have a generic
> > > text_alloc() that can be overridden by the arch if necessary. Note
> > > that x86  and s390 are the only architectures that use module_alloc()
> > > in ftrace code.
> >
> > This series essentially does this: introduces text_alloc() and
> > text_memfree(), which have generic implementations in kernel/text.c.
> > Those can be overriddent by arch specific implementations.
> >
> > What you think should be done differently than in my patch set?
> >
> 
> On arm64, module_alloc is only used by the module loader, and so
> pulling it out and renaming it will cause unused code to be
> incorporated into the kernel when building without module support,
> which is the use case you claim to be addressing.
> 
> Module_alloc has semantics that are intimately tied to the module
> loader, but over the years, it ended up being (ab)used by other
> subsystems, which don't require those semantics but just need n pages
> of vmalloc space with executable permissions.
> 
> So the correct approach is to make text_alloc() implement just that,
> generically, and switch bpf etc to use it. Then, only on architectures
> that need it, override it with an implementation that has the required
> additional semantics.
> 
> Refactoring 10+ architectures like this without any regard for how
> text_alloc() deviates from module_alloc() just creates a lot of churn
> that others will have to clean up after you.

For 32-bit ARM, our bpf code uses "blx/bx" (or equivalent code
sequences) rather than encoding a "bl" or "b", so BPF there doesn't
care where the executable memory is mapped, and doesn't need any
PLTs.  Given that, should bpf always allocate from the vmalloc()
region to preserve the module space for modules?

I'm more concerned about ftrace though, but only because I don't
have the understanding of that subsystem to really say whether there
are any side effects from having the allocations potentially be out
of range of a "bl" or "b" instruction.

If ftrace jumps both to and from the allocated page using a "load
address to register, branch to register" approach like BPF does, then
ftrace should be safe - and again, raises the issue that maybe it
should always come from vmalloc() space.

So, I think we need to keep module_alloc() for allocating memory for
modules, and provide a new text_alloc() for these other cases.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!


Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Will Deacon
On Tue, Jul 14, 2020 at 12:45:36PM +0300, Jarkko Sakkinen wrote:
> Rename module_alloc() to text_alloc() and module_memfree() to
> text_memfree(), and move them to kernel/text.c, which is unconditionally
> compiled to the kernel proper. This allows kprobes, ftrace and bpf to
> allocate space for executable code without requiring to compile the modules
> support (CONFIG_MODULES=y) in.
> 
> Cc: Andi Kleen 
> Suggested-by: Peter Zijlstra 
> Signed-off-by: Jarkko Sakkinen 

[...]

> diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
> index 1cd1a4d0ed30..adde022f703c 100644
> --- a/arch/arm64/kernel/module.c
> +++ b/arch/arm64/kernel/module.c
> @@ -20,48 +20,6 @@
>  #include 
>  #include 
>  
> -void *module_alloc(unsigned long size)
> -{
> - u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;
> - gfp_t gfp_mask = GFP_KERNEL;
> - void *p;
> -
> - /* Silence the initial allocation */
> - if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
> - gfp_mask |= __GFP_NOWARN;
> -
> - if (IS_ENABLED(CONFIG_KASAN))
> - /* don't exceed the static module region - see below */
> - module_alloc_end = MODULES_END;
> -
> - p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
> - module_alloc_end, gfp_mask, PAGE_KERNEL, 0,
> - NUMA_NO_NODE, __builtin_return_address(0));
> -
> - if (!p && IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
> - !IS_ENABLED(CONFIG_KASAN))
> - /*
> -  * KASAN can only deal with module allocations being served
> -  * from the reserved module region, since the remainder of
> -  * the vmalloc region is already backed by zero shadow pages,
> -  * and punching holes into it is non-trivial. Since the module
> -  * region is not randomized when KASAN is enabled, it is even
> -  * less likely that the module region gets exhausted, so we
> -  * can simply omit this fallback in that case.
> -  */
> - p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
> - module_alloc_base + SZ_2G, GFP_KERNEL,
> - PAGE_KERNEL, 0, NUMA_NO_NODE,
> - __builtin_return_address(0));
> -
> - if (p && (kasan_module_alloc(p, size) < 0)) {
> - vfree(p);
> - return NULL;
> - }
> -
> - return p;
> -}
> -
>  enum aarch64_reloc_op {
>   RELOC_OP_NONE,
>   RELOC_OP_ABS,
> diff --git a/arch/arm64/kernel/text.c b/arch/arm64/kernel/text.c
> new file mode 100644
> index ..64fc7e2d85df
> --- /dev/null
> +++ b/arch/arm64/kernel/text.c
> @@ -0,0 +1,54 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * AArch64 loadable module support.
> + *
> + * Copyright (C) 2012 ARM Limited
> + *
> + * Author: Will Deacon 
> + */
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +void *text_alloc(unsigned long size)
> +{
> + u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;
> + gfp_t gfp_mask = GFP_KERNEL;
> + void *p;
> +
> + /* Silence the initial allocation */
> + if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
> + gfp_mask |= __GFP_NOWARN;
> +
> + if (IS_ENABLED(CONFIG_KASAN))
> + /* don't exceed the static module region - see below */
> + module_alloc_end = MODULES_END;
> +
> + p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
> + module_alloc_end, gfp_mask, PAGE_KERNEL, 0,
> + NUMA_NO_NODE, __builtin_return_address(0));
> +
> + if (!p && IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
> + !IS_ENABLED(CONFIG_KASAN))
> + /*
> +  * KASAN can only deal with module allocations being served
> +  * from the reserved module region, since the remainder of
> +  * the vmalloc region is already backed by zero shadow pages,
> +  * and punching holes into it is non-trivial. Since the module
> +  * region is not randomized when KASAN is enabled, it is even
> +  * less likely that the module region gets exhausted, so we
> +  * can simply omit this fallback in that case.
> +  */
> + p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
> + module_alloc_base + SZ_2G, GFP_KERNEL,
> + PAGE_KERNEL, 0, NUMA_NO_NODE,
> + __builtin_return_address(0));
> +
> + if (p && (kasan_module_alloc(p, size) < 0)) {
> + vfree(p);
> + return NULL;
> + }
> +
> + return p;
> +}

I'm not especially keen on this approach.

As Ard says, module_alloc() _is_ special, in the sense that the virtual
memory it allocates wants to be close to the kernel text, whereas the
concept of allocating exec

Re: [PATCH 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Ard Biesheuvel
On Tue, 14 Jul 2020 at 12:53, Jarkko Sakkinen
 wrote:
>
> On Mon, Jul 13, 2020 at 10:49:48PM +0300, Ard Biesheuvel wrote:
> > This patch suggests that there are other reasons why conflating
> > allocation of module space and allocating  text pages for other uses
> > is a bad idea, but switching all users to text_alloc() is a step in
> > the wrong direction. It would be better to stop using module_alloc()
> > in core code except in the module loader, and have a generic
> > text_alloc() that can be overridden by the arch if necessary. Note
> > that x86  and s390 are the only architectures that use module_alloc()
> > in ftrace code.
>
> This series essentially does this: introduces text_alloc() and
> text_memfree(), which have generic implementations in kernel/text.c.
> Those can be overriddent by arch specific implementations.
>
> What you think should be done differently than in my patch set?
>

On arm64, module_alloc is only used by the module loader, and so
pulling it out and renaming it will cause unused code to be
incorporated into the kernel when building without module support,
which is the use case you claim to be addressing.

Module_alloc has semantics that are intimately tied to the module
loader, but over the years, it ended up being (ab)used by other
subsystems, which don't require those semantics but just need n pages
of vmalloc space with executable permissions.

So the correct approach is to make text_alloc() implement just that,
generically, and switch bpf etc to use it. Then, only on architectures
that need it, override it with an implementation that has the required
additional semantics.

Refactoring 10+ architectures like this without any regard for how
text_alloc() deviates from module_alloc() just creates a lot of churn
that others will have to clean up after you.


  1   2   >