Re: [PATCH 06/12] powerpc/xive: Native exploitation of the XIVE interrupt controller

2017-03-23 Thread Paul Mackerras
On Mon, Mar 20, 2017 at 05:49:08PM +1100, Benjamin Herrenschmidt wrote:
> The XIVE interrupt controller is the new interrupt controller
> found in POWER9. It supports advanced virtualization capabilities
> among other things.
> 
> Currently we use a set of firmware calls that simulate the old
> "XICS" interrupt controller but this is fairly inefficient.
> 
> This adds the framework for using XIVE along with a native
> backend which OPAL for configuration. Later, a backend allowing
> the use in a KVM or PowerVM guest will also be provided.
> 
> This disables some fast path for interrupts in KVM when XIVE is
> enabled as these rely on the firmware emulation code which is no
> longer available when the XIVE is used natively by Linux.
> 
> A latter patch will make KVM also directly exploit the XIVE, thus
> recovering the lost performance (and more).
> 
> Signed-off-by: Benjamin Herrenschmidt 

Mostly looks fine, a few minor comments below...

> ---
>  arch/powerpc/include/asm/xive.h  |  116 +++
>  arch/powerpc/include/asm/xmon.h  |2 +
>  arch/powerpc/platforms/powernv/Kconfig   |2 +
>  arch/powerpc/platforms/powernv/setup.c   |   15 +-
>  arch/powerpc/platforms/powernv/smp.c |   39 +-
>  arch/powerpc/sysdev/Kconfig  |1 +
>  arch/powerpc/sysdev/Makefile |1 +
>  arch/powerpc/sysdev/xive/Kconfig |7 +
>  arch/powerpc/sysdev/xive/Makefile|4 +
>  arch/powerpc/sysdev/xive/common.c| 1175 
> ++
>  arch/powerpc/sysdev/xive/native.c|  604 +++
>  arch/powerpc/sysdev/xive/xive-internal.h |   51 ++
>  arch/powerpc/sysdev/xive/xive-regs.h |   88 +++
>  arch/powerpc/xmon/xmon.c |   93 ++-
>  14 files changed, 2186 insertions(+), 12 deletions(-)
>  create mode 100644 arch/powerpc/include/asm/xive.h
>  create mode 100644 arch/powerpc/sysdev/xive/Kconfig
>  create mode 100644 arch/powerpc/sysdev/xive/Makefile
>  create mode 100644 arch/powerpc/sysdev/xive/common.c
>  create mode 100644 arch/powerpc/sysdev/xive/native.c
>  create mode 100644 arch/powerpc/sysdev/xive/xive-internal.h
>  create mode 100644 arch/powerpc/sysdev/xive/xive-regs.h
> 
> diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
> new file mode 100644
> index 000..b1604b73
> --- /dev/null
> +++ b/arch/powerpc/include/asm/xive.h
> @@ -0,0 +1,116 @@
> +#ifndef _ASM_POWERPC_XIVE_H
> +#define _ASM_POWERPC_XIVE_H
> +
> +#define XIVE_INVALID_VP  0x
> +
> +#ifdef CONFIG_PPC_XIVE
> +
> +extern void __iomem *xive_tm_area;
> +extern u32 xive_tm_offset;
> +
> +/*
> + * Per-irq data (irq_get_handler_data for normal IRQs), IPIs
> + * have it stored in the xive_cpu structure. We also cache
> + * for normal interrupts the current target CPU.
> + */
> +struct xive_irq_data {
> + /* Setup by backend */
> + u64 flags;
> +#define XIVE_IRQ_FLAG_STORE_EOI  0x01
> +#define XIVE_IRQ_FLAG_LSI0x02
> +#define XIVE_IRQ_FLAG_SHIFT_BUG  0x04
> +#define XIVE_IRQ_FLAG_MASK_FW0x08
> +#define XIVE_IRQ_FLAG_EOI_FW 0x10
> + u64 eoi_page;
> + void __iomem *eoi_mmio;
> + u64 trig_page;
> + void __iomem *trig_mmio;
> + u32 esb_shift;
> + int src_chip;
> +
> + /* Setup/used by frontend */
> + int target;
> + bool saved_p;
> +};
> +#define XIVE_INVALID_CHIP_ID -1
> +
> +/* A queue tracking structure in a CPU */
> +struct xive_q {
> + __be32  *qpage;
> + u32 msk;
> + u32 idx;
> + u32 toggle;
> + u64 eoi_phys;
> + void __iomem*eoi_mmio;
> + u32 esc_irq;
> + atomic_tcount;
> + atomic_tpending_count;
> +};
> +
> +/*
> + * "magic" ESB MMIO offsets
> + */
> +#define XIVE_ESB_GET 0x800
> +#define XIVE_ESB_SET_PQ_00   0xc00
> +#define XIVE_ESB_SET_PQ_01   0xd00
> +#define XIVE_ESB_SET_PQ_10   0xe00
> +#define XIVE_ESB_SET_PQ_11   0xf00
> +#define XIVE_ESB_MASKXIVE_ESB_SET_PQ_01
> +
> +extern bool __xive_enabled;
> +
> +static inline bool xive_enabled(void) { return __xive_enabled; }
> +
> +extern bool xive_native_init(void);
> +extern void xive_smp_probe(void);
> +extern int  xive_smp_prepare_cpu(unsigned int cpu);
> +extern void xive_smp_setup_cpu(void);
> +extern void xive_smp_disable_cpu(void);
> +extern void xive_kexec_teardown_cpu(int secondary);
> +extern void xive_shutdown(void);
> +extern void xive_flush_interrupt(void);
> +
> +/* xmon hook */
> +extern void xmon_xive_do_dump(int cpu);
> +
> +/* APIs used by KVM */
> +extern u32 xive_native_default_eq_shift(void);
> +extern u32 xive_native_alloc_vp_block(u32 max_vcpus);
> +extern void xive_native_free_vp_block(u32 vp_base);
> +extern int xive_native_populate_irq_data(u32 hw_irq,
> +  struct xive_irq_data *data);

Re: [PATCH v3 05/10] VAS: Define helpers to init window context

2017-03-23 Thread Michael Neuling
On Thu, 2017-03-16 at 20:33 -0700, Sukadev Bhattiprolu wrote:
> Define helpers to initialize window context registers of the VAS
> hardware. These will be used in follow-on patches when opening/closing
> VAS windows.
> 
> Signed-off-by: Sukadev Bhattiprolu 
> ---
> Changelog[v3]
>   - Have caller, rather than init_xlate_regs() reset window regs
>     so we don't reset any settings caller may already have set.
>   - Translation mode should be 0x3 (0b11) not 0x11.
>   - Skip initilaizing read-only registers NX_UTIL and NX_UTIL_SE
>   - Skip initializing adder registers from UWC - they are already
>     initialized from the HVWC.
>   - Check winctx->user_win when setting translation registers
> ---
>  drivers/misc/vas/vas-internal.h |  59 ++-
>  drivers/misc/vas/vas-window.c   | 334
> 
>  2 files changed, 390 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/misc/vas/vas-internal.h b/drivers/misc/vas/vas-internal.h
> index 15b62e0..8e721df 100644
> --- a/drivers/misc/vas/vas-internal.h
> +++ b/drivers/misc/vas/vas-internal.h
> @@ -11,6 +11,7 @@
>  #define VAS_INTERNAL_H
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #ifdef CONFIG_PPC_4K_PAGES
> @@ -336,9 +337,6 @@ struct vas_window {
>   /* Feilds applicable only to receive windows */
>   enum vas_cop_type cop;
>   atomic_t num_txwins;
> -
> - int32_t hwirq;
> - uint64_t irq_port;

We are removing things already? :-)

>  };
>  
>  /*
> @@ -392,4 +390,59 @@ struct vas_winctx {
>  extern int vas_initialized;
>  extern int vas_window_reset(struct vas_instance *vinst, int winid);
>  extern struct vas_instance *find_vas_instance(int vasid);
> +
> +/*
> + * VREG(x):
> + * Expand a register's short name (eg: LPID) into two parameters:
> + *   - the register's short name in string form ("LPID"), and
> + *   - the name of the macro (eg: VAS_LPID_OFFSET), defining the
> + *     register's offset in the window context
> + */
> +#define VREG_SFX(n, s)   __stringify(n), VAS_##n##s
> +#define VREG(r)  VREG_SFX(r, _OFFSET)
> +
> +#ifndef vas_debug
> +static inline void vas_log_write(struct vas_window *win, char *name,
> + void *regptr, uint64_t val)
> +{
> + if (val)
> + pr_err("%swin #%d: %s reg %p, val 0x%llx\n",
> + win->tx_win ? "Tx" : "Rx", win->winid, name,
> + regptr, val);
> +}
> +
> +#else/* vas_debug */
> +
> +#define vas_log_write(win, name, reg, val)
> +
> +#endif   /* vas_debug */
> +
> +static inline void write_uwc_reg(struct vas_window *win, char *name,
> + int32_t reg, uint64_t val)
> +{
> + void *regptr;
> +
> + regptr = win->uwc_map + reg;
> + vas_log_write(win, name, regptr, val);
> +
> + out_be64(regptr, val);
> +}
> +
> +static inline void write_hvwc_reg(struct vas_window *win, char *name,
> + int32_t reg, uint64_t val)
> +{
> + void *regptr;
> +
> + regptr = win->hvwc_map + reg;
> + vas_log_write(win, name, regptr, val);
> +
> + out_be64(regptr, val);
> +}
> +
> +static inline uint64_t read_hvwc_reg(struct vas_window *win,
> + char *name __maybe_unused, int32_t reg)
> +{
> + return in_be64(win->hvwc_map+reg);
> +}
> +
>  #endif
> diff --git a/drivers/misc/vas/vas-window.c b/drivers/misc/vas/vas-window.c
> index 32dd1d0..edf5c9f 100644
> --- a/drivers/misc/vas/vas-window.c
> +++ b/drivers/misc/vas/vas-window.c
> @@ -14,6 +14,8 @@
>  #include 
>  #include "vas-internal.h"
>  
> +static int fault_winid;
> +
>  /*
>   * Compute the paste address region for the window @window using the
>   * ->win_base_addr and ->win_id_shift we got from device tree.
> @@ -138,6 +140,338 @@ int map_wc_mmio_bars(struct vas_window *window)
>   return 0;
>  }
>  
> +/*
> + * Reset all valid registers in the HV and OS/User Window Contexts for
> + * the window identified by @window.
> + *
> + * NOTE: We cannot really use a for loop to reset window context. Not all
> + *    offsets in a window context are valid registers and the valid
> + *    registers are not sequential. And, we can only write to offsets
> + *    with valid registers (or is that only in Simics?).
> + */
> +void reset_window_regs(struct vas_window *window)
> +{
> + write_hvwc_reg(window, VREG(LPID), 0ULL);
> + write_hvwc_reg(window, VREG(PID), 0ULL);
> + write_hvwc_reg(window, VREG(XLATE_MSR), 0ULL);
> + write_hvwc_reg(window, VREG(XLATE_LPCR), 0ULL);
> + write_hvwc_reg(window, VREG(XLATE_CTL), 0ULL);
> + write_hvwc_reg(window, VREG(AMR), 0ULL);
> + write_hvwc_reg(window, VREG(SEIDR), 0ULL);
> + write_hvwc_reg(window, VREG(FAULT_TX_WIN), 0ULL);
> + write_hvwc_reg(window, VREG(OSU_INTR_SRC_RA), 0ULL);
> + write_hvwc_reg(window, VREG(HV_INTR_SRC_RA), 0ULL);
> + write_hvwc_reg(window, VREG(PSWID), 0ULL);
> + 

Re: [PATCH] ASoC: WM8962: Let codec driver enable/disable its MCLK

2017-03-23 Thread Nicolin Chen
On Thu, Mar 23, 2017 at 02:01:50PM +0200, Daniel Baluta wrote:
> From: Nicolin Chen 
> 
> WM8962 needs its MCLK when powerup in wm8962_resume(). Thus it's better
> to control the MCLK in codec driver. Thus remove the clock enable in
> machine dirver accordingly.
> 
> While at it, get rid of imx_wm8962_remove function since it is now
> empty.
> 
> Signed-off-by: Nicolin Chen 

Hmm...it'd probably be better to let yourself be the author and remove
my signed-off here. Just got an email deliver failure since that email
address is apparently not available any more.

Thanks
Nicolin

> Signed-off-by: Daniel Baluta 
> ---
>  sound/soc/fsl/imx-wm8962.c | 40 
>  1 file changed, 8 insertions(+), 32 deletions(-)
> 
> diff --git a/sound/soc/fsl/imx-wm8962.c b/sound/soc/fsl/imx-wm8962.c
> index 1b60958..3d894d9 100644
> --- a/sound/soc/fsl/imx-wm8962.c
> +++ b/sound/soc/fsl/imx-wm8962.c
> @@ -33,7 +33,6 @@ struct imx_wm8962_data {
>   struct snd_soc_card card;
>   char codec_dai_name[DAI_NAME_SIZE];
>   char platform_name[DAI_NAME_SIZE];
> - struct clk *codec_clk;
>   unsigned int clk_frequency;
>  };
>  
> @@ -163,6 +162,7 @@ static int imx_wm8962_probe(struct platform_device *pdev)
>   struct imx_priv *priv = _priv;
>   struct i2c_client *codec_dev;
>   struct imx_wm8962_data *data;
> + struct clk *codec_clk;
>   int int_port, ext_port;
>   int ret;
>  
> @@ -231,19 +231,14 @@ static int imx_wm8962_probe(struct platform_device 
> *pdev)
>   goto fail;
>   }
>  
> - data->codec_clk = devm_clk_get(_dev->dev, NULL);
> - if (IS_ERR(data->codec_clk)) {
> - ret = PTR_ERR(data->codec_clk);
> + codec_clk = devm_clk_get(_dev->dev, NULL);
> + if (IS_ERR(codec_clk)) {
> + ret = PTR_ERR(codec_clk);
>   dev_err(_dev->dev, "failed to get codec clk: %d\n", ret);
>   goto fail;
>   }
>  
> - data->clk_frequency = clk_get_rate(data->codec_clk);
> - ret = clk_prepare_enable(data->codec_clk);
> - if (ret) {
> - dev_err(_dev->dev, "failed to enable codec clk: %d\n", 
> ret);
> - goto fail;
> - }
> + data->clk_frequency = clk_get_rate(codec_clk);
>  
>   data->dai.name = "HiFi";
>   data->dai.stream_name = "HiFi";
> @@ -258,10 +253,10 @@ static int imx_wm8962_probe(struct platform_device 
> *pdev)
>   data->card.dev = >dev;
>   ret = snd_soc_of_parse_card_name(>card, "model");
>   if (ret)
> - goto clk_fail;
> + goto fail;
>   ret = snd_soc_of_parse_audio_routing(>card, "audio-routing");
>   if (ret)
> - goto clk_fail;
> + goto fail;
>   data->card.num_links = 1;
>   data->card.owner = THIS_MODULE;
>   data->card.dai_link = >dai;
> @@ -277,16 +272,9 @@ static int imx_wm8962_probe(struct platform_device *pdev)
>   ret = devm_snd_soc_register_card(>dev, >card);
>   if (ret) {
>   dev_err(>dev, "snd_soc_register_card failed (%d)\n", ret);
> - goto clk_fail;
> + goto fail;
>   }
>  
> - of_node_put(ssi_np);
> - of_node_put(codec_np);
> -
> - return 0;
> -
> -clk_fail:
> - clk_disable_unprepare(data->codec_clk);
>  fail:
>   of_node_put(ssi_np);
>   of_node_put(codec_np);
> @@ -294,17 +282,6 @@ static int imx_wm8962_probe(struct platform_device *pdev)
>   return ret;
>  }
>  
> -static int imx_wm8962_remove(struct platform_device *pdev)
> -{
> - struct snd_soc_card *card = platform_get_drvdata(pdev);
> - struct imx_wm8962_data *data = snd_soc_card_get_drvdata(card);
> -
> - if (!IS_ERR(data->codec_clk))
> - clk_disable_unprepare(data->codec_clk);
> -
> - return 0;
> -}
> -
>  static const struct of_device_id imx_wm8962_dt_ids[] = {
>   { .compatible = "fsl,imx-audio-wm8962", },
>   { /* sentinel */ }
> @@ -318,7 +295,6 @@ static struct platform_driver imx_wm8962_driver = {
>   .of_match_table = imx_wm8962_dt_ids,
>   },
>   .probe = imx_wm8962_probe,
> - .remove = imx_wm8962_remove,
>  };
>  module_platform_driver(imx_wm8962_driver);
>  
> -- 
> 2.7.4
> 


Re: [PATCH v3 04/10] VAS: Define helpers for access MMIO regions

2017-03-23 Thread Michael Neuling
On Thu, 2017-03-16 at 20:33 -0700, Sukadev Bhattiprolu wrote:
> Define some helper functions to access the MMIO regions. We use these
> in a follow-on patches to read/write VAS hardware registers. These
> helpers are also used to later issue 'paste' instructions to submit
> requests to the NX hardware engines.
> 
> Signed-off-by: Sukadev Bhattiprolu 
> ---
> Changelog [v3]:
>   - Minor reorg/cleanup of map/unmap functions
> 
> Changelog [v2]:
>   - Get HVWC, UWC and paste addresses from window->vinst (i.e DT)
>     rather than kernel macros.
> ---
>  drivers/misc/vas/vas-window.c | 126
> ++
>  1 file changed, 126 insertions(+)
> 
> diff --git a/drivers/misc/vas/vas-window.c b/drivers/misc/vas/vas-window.c
> index 468f3bf..32dd1d0 100644
> --- a/drivers/misc/vas/vas-window.c
> +++ b/drivers/misc/vas/vas-window.c
> @@ -9,9 +9,135 @@
>  
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include "vas-internal.h"
>  
> +/*
> + * Compute the paste address region for the window @window using the
> + * ->win_base_addr and ->win_id_shift we got from device tree.
> + */
> +void compute_paste_address(struct vas_window *window, uint64_t *addr, int
> *len)
> +{
> + uint64_t base, shift;
> + int winid;
> +
> + base = window->vinst->win_base_addr;
> + shift = window->vinst->win_id_shift;
> + winid = window->winid;
> +
> + *addr  = base + (winid << shift);
> + *len = PAGE_SIZE;
> +
> + pr_debug("Txwin #%d: Paste addr 0x%llx\n", winid, *addr);
> +}
> +
> +static inline void get_hvwc_mmio_bar(struct vas_window *window,
> + uint64_t *start, int *len)
> +{
> + uint64_t pbaddr;
> +
> + pbaddr = window->vinst->hvwc_bar_start;
> + *start = pbaddr + window->winid * VAS_HVWC_SIZE;
> + *len = VAS_HVWC_SIZE;
> +}
> +
> +static inline void get_uwc_mmio_bar(struct vas_window *window,
> + uint64_t *start, int *len)
> +{
> + uint64_t pbaddr;
> +
> + pbaddr = window->vinst->uwc_bar_start;
> + *start = pbaddr + window->winid * VAS_UWC_SIZE;
> + *len = VAS_UWC_SIZE;

I'm not sure this works for 4K pages since VAS_UWC_SIZE = PAGE_SIZE but in
reality I think it's always 64K.  Right?

Seem like we are mixing pages sizes and hardware sizes here.

(I realise 4K isn't supported yet, but)

> +}
> +
> +static void *map_mmio_region(char *name, uint64_t start, int len)
> +{
> + void *map;
> +
> + if (!request_mem_region(start, len, name)) {
> + pr_devel("%s(): request_mem_region(0x%llx, %d) failed\n",
> + __func__, start, len);
> + return NULL;
> + }
> +
> + map = __ioremap(start, len, pgprot_val(pgprot_cached(__pgprot(0;
> + if (!map) {
> + pr_devel("%s(): ioremap(0x%llx, %d) failed\n", __func__,
> start,
> + len);
> + return NULL;
> + }
> +
> + return map;
> +}
> +
> +/*
> + * Unmap the MMIO regions for a window.
> + */
> +static void unmap_wc_paste_kaddr(struct vas_window *window)
> +{
> + int len;
> + uint64_t busaddr_start;
> +
> + if (window->paste_kaddr) {
> + iounmap(window->paste_kaddr);
> + compute_paste_address(window, _start, );
> + release_mem_region((phys_addr_t)busaddr_start, len);
> + window->paste_kaddr = NULL;
> + }
> +
> +}
> +
> +static void unmap_wc_mmio_bars(struct vas_window *window)
> +{
> + int len;
> + uint64_t busaddr_start;
> +
> + unmap_wc_paste_kaddr(window);
> +
> + if (window->hvwc_map) {
> + iounmap(window->hvwc_map);
> + get_hvwc_mmio_bar(window, _start, );
> + release_mem_region((phys_addr_t)busaddr_start, len);
> + window->hvwc_map = NULL;
> + }
> +
> + if (window->uwc_map) {
> + iounmap(window->uwc_map);
> + get_uwc_mmio_bar(window, _start, );
> + release_mem_region((phys_addr_t)busaddr_start, len);
> + window->uwc_map = NULL;
> + }
> +}
> +
> +/*
> + * Find the Hypervisor Window Context (HVWC) MMIO Base Address Region and the
> + * OS/User Window Context (UWC) MMIO Base Address Region for the given
> window.
> + * Map these bus addresses and save the mapped kernel addresses in @window.
> + */
> +int map_wc_mmio_bars(struct vas_window *window)
> +{
> + int len;
> + uint64_t start;
> +
> + window->paste_kaddr = window->hvwc_map = window->uwc_map = NULL;
> +
> + get_hvwc_mmio_bar(window, , );
> + window->hvwc_map = map_mmio_region("HVWCM_Window", start, len);
> +
> + get_uwc_mmio_bar(window, , );
> + window->uwc_map = map_mmio_region("UWCM_Window", start, len);
> +
> + if (!window->hvwc_map || !window->uwc_map) {
> + unmap_wc_mmio_bars(window);
> + return -1;
> + }
> +
> + return 0;
> +}
> +
>  /* stub for now */
>  int 

Re: [PATCH] ASoC: WM8962: Let codec driver enable/disable its MCLK

2017-03-23 Thread Nicolin Chen
Hi,

Could you please revise the subject a bit? I forgot what's the
original one I submitted (internally) years ago. But the patch
is for imx-wm8962 instead of WM8962.

So please use
  ASoC: fsl: imx-wm8962: 
or at least
  ASoC: imx-wm8962: 

The change itself looks fine though.

Thanks
Nic


On Thu, Mar 23, 2017 at 02:01:50PM +0200, Daniel Baluta wrote:
> From: Nicolin Chen 
> 
> WM8962 needs its MCLK when powerup in wm8962_resume(). Thus it's better
> to control the MCLK in codec driver. Thus remove the clock enable in
> machine dirver accordingly.
> 
> While at it, get rid of imx_wm8962_remove function since it is now
> empty.
> 
> Signed-off-by: Nicolin Chen 
> Signed-off-by: Daniel Baluta 
> ---
>  sound/soc/fsl/imx-wm8962.c | 40 
>  1 file changed, 8 insertions(+), 32 deletions(-)
> 
> diff --git a/sound/soc/fsl/imx-wm8962.c b/sound/soc/fsl/imx-wm8962.c
> index 1b60958..3d894d9 100644
> --- a/sound/soc/fsl/imx-wm8962.c
> +++ b/sound/soc/fsl/imx-wm8962.c
> @@ -33,7 +33,6 @@ struct imx_wm8962_data {
>   struct snd_soc_card card;
>   char codec_dai_name[DAI_NAME_SIZE];
>   char platform_name[DAI_NAME_SIZE];
> - struct clk *codec_clk;
>   unsigned int clk_frequency;
>  };
>  
> @@ -163,6 +162,7 @@ static int imx_wm8962_probe(struct platform_device *pdev)
>   struct imx_priv *priv = _priv;
>   struct i2c_client *codec_dev;
>   struct imx_wm8962_data *data;
> + struct clk *codec_clk;
>   int int_port, ext_port;
>   int ret;
>  
> @@ -231,19 +231,14 @@ static int imx_wm8962_probe(struct platform_device 
> *pdev)
>   goto fail;
>   }
>  
> - data->codec_clk = devm_clk_get(_dev->dev, NULL);
> - if (IS_ERR(data->codec_clk)) {
> - ret = PTR_ERR(data->codec_clk);
> + codec_clk = devm_clk_get(_dev->dev, NULL);
> + if (IS_ERR(codec_clk)) {
> + ret = PTR_ERR(codec_clk);
>   dev_err(_dev->dev, "failed to get codec clk: %d\n", ret);
>   goto fail;
>   }
>  
> - data->clk_frequency = clk_get_rate(data->codec_clk);
> - ret = clk_prepare_enable(data->codec_clk);
> - if (ret) {
> - dev_err(_dev->dev, "failed to enable codec clk: %d\n", 
> ret);
> - goto fail;
> - }
> + data->clk_frequency = clk_get_rate(codec_clk);
>  
>   data->dai.name = "HiFi";
>   data->dai.stream_name = "HiFi";
> @@ -258,10 +253,10 @@ static int imx_wm8962_probe(struct platform_device 
> *pdev)
>   data->card.dev = >dev;
>   ret = snd_soc_of_parse_card_name(>card, "model");
>   if (ret)
> - goto clk_fail;
> + goto fail;
>   ret = snd_soc_of_parse_audio_routing(>card, "audio-routing");
>   if (ret)
> - goto clk_fail;
> + goto fail;
>   data->card.num_links = 1;
>   data->card.owner = THIS_MODULE;
>   data->card.dai_link = >dai;
> @@ -277,16 +272,9 @@ static int imx_wm8962_probe(struct platform_device *pdev)
>   ret = devm_snd_soc_register_card(>dev, >card);
>   if (ret) {
>   dev_err(>dev, "snd_soc_register_card failed (%d)\n", ret);
> - goto clk_fail;
> + goto fail;
>   }
>  
> - of_node_put(ssi_np);
> - of_node_put(codec_np);
> -
> - return 0;
> -
> -clk_fail:
> - clk_disable_unprepare(data->codec_clk);
>  fail:
>   of_node_put(ssi_np);
>   of_node_put(codec_np);
> @@ -294,17 +282,6 @@ static int imx_wm8962_probe(struct platform_device *pdev)
>   return ret;
>  }
>  
> -static int imx_wm8962_remove(struct platform_device *pdev)
> -{
> - struct snd_soc_card *card = platform_get_drvdata(pdev);
> - struct imx_wm8962_data *data = snd_soc_card_get_drvdata(card);
> -
> - if (!IS_ERR(data->codec_clk))
> - clk_disable_unprepare(data->codec_clk);
> -
> - return 0;
> -}
> -
>  static const struct of_device_id imx_wm8962_dt_ids[] = {
>   { .compatible = "fsl,imx-audio-wm8962", },
>   { /* sentinel */ }
> @@ -318,7 +295,6 @@ static struct platform_driver imx_wm8962_driver = {
>   .of_match_table = imx_wm8962_dt_ids,
>   },
>   .probe = imx_wm8962_probe,
> - .remove = imx_wm8962_remove,
>  };
>  module_platform_driver(imx_wm8962_driver);
>  
> -- 
> 2.7.4
> 


Re: [PATCH v3 03/10] VAS: Define vas_init() and vas_exit()

2017-03-23 Thread Michael Neuling
On Thu, 2017-03-16 at 20:33 -0700, Sukadev Bhattiprolu wrote:
> Implement vas_init() and vas_exit() functions for a new VAS module.
> This VAS module is essentially a library for other device drivers
> and kernel users of the NX coprocessors like NX-842 and NX-GZIP.
> 
> Signed-off-by: Sukadev Bhattiprolu 
> ---
> Changelog[v3]:
>   - Zero vas_instances memory on allocation
>   - [Haren Myneni] Fix description in Kconfig
> Changelog[v2]:
>   - Get HVWC, UWC and window address parameters from device tree.
> ---
>  MAINTAINERS |   8 ++-
>  arch/powerpc/include/asm/reg.h  |   1 +
>  drivers/misc/Kconfig|   1 +
>  drivers/misc/Makefile   |   1 +
>  drivers/misc/vas/Kconfig|  21 ++
>  drivers/misc/vas/Makefile   |   3 +
>  drivers/misc/vas/vas-internal.h |   3 +
>  drivers/misc/vas/vas-window.c   |  19 +
>  drivers/misc/vas/vas.c  | 155
> 
>  9 files changed, 210 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/misc/vas/Kconfig
>  create mode 100644 drivers/misc/vas/Makefile
>  create mode 100644 drivers/misc/vas/vas-window.c
>  create mode 100644 drivers/misc/vas/vas.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 2a910c9..4037252 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3673,8 +3673,6 @@ F:  arch/powerpc/platforms/powernv/pci-cxl.c
>  F:   drivers/misc/cxl/
>  F:   include/misc/cxl*
>  F:   include/uapi/misc/cxl.h
> -F:   Documentation/powerpc/cxl.txt
> -F:   Documentation/ABI/testing/sysfs-class-cxl

err?

>  CXLFLASH (IBM Coherent Accelerator Processor Interface CAPI Flash) SCSI
> DRIVER
>  M:   Manoj N. Kumar 
> @@ -3686,6 +3684,12 @@ F: drivers/scsi/cxlflash/
>  F:   include/uapi/scsi/cxlflash_ioctls.h
>  F:   Documentation/powerpc/cxlflash.txt
>  
> +VAS (IBM Virtual Accelerator Switch) DRIVER
> +M:   Sukadev Bhattiprolu 
> +L:   linuxppc-dev@lists.ozlabs.org
> +S:   Supported
> +F:   drivers/misc/vas/
> +

This was already added in patch 1.

>  STMMAC ETHERNET DRIVER
>  M:   Giuseppe Cavallaro 
>  M:   Alexandre Torgue 
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index fc879fd..7a45ff7 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -1225,6 +1225,7 @@
>  #define PVR_POWER8E  0x004B
>  #define PVR_POWER8NVL0x004C
>  #define PVR_POWER8   0x004D
> +#define PVR_POWER9   0x004E

Can you send this up separately?  

>  #define PVR_BE   0x0070
>  #define PVR_PA6T 0x0090
>  
> diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
> index c290990..97d652e 100644
> --- a/drivers/misc/Kconfig
> +++ b/drivers/misc/Kconfig
> @@ -783,4 +783,5 @@ source "drivers/misc/mic/Kconfig"
>  source "drivers/misc/genwqe/Kconfig"
>  source "drivers/misc/echo/Kconfig"
>  source "drivers/misc/cxl/Kconfig"
> +source "drivers/misc/vas/Kconfig"
>  endmenu
> diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
> index 7a3ea89..5201ffd 100644
> --- a/drivers/misc/Makefile
> +++ b/drivers/misc/Makefile
> @@ -53,6 +53,7 @@ obj-$(CONFIG_GENWQE)+= genwqe/
>  obj-$(CONFIG_ECHO)   += echo/
>  obj-$(CONFIG_VEXPRESS_SYSCFG)+= vexpress-syscfg.o
>  obj-$(CONFIG_CXL_BASE)   += cxl/
> +obj-$(CONFIG_VAS)+= vas/
>  obj-$(CONFIG_PANEL) += panel.o
>  
>  lkdtm-$(CONFIG_LKDTM)+= lkdtm_core.o
> diff --git a/drivers/misc/vas/Kconfig b/drivers/misc/vas/Kconfig
> new file mode 100644
> index 000..43cedda
> --- /dev/null
> +++ b/drivers/misc/vas/Kconfig
> @@ -0,0 +1,21 @@
> +#
> +# IBM Virtual Accelarator Switchboard (VAS) compatible devices
> +#depends on PPC_POWERNV && PCI_MSI && EEH
> +#
> +
> +config VAS
> + tristate "Support for IBM Virtual Accelerator Switchboard (VAS)"
> + depends on PPC_POWERNV
> + default n
> + help
> +   Select this option to enable driver support for IBM Virtual
> +   Accelerator Switchboard (VAS).
> +
> +   VAS allows accelerators in co processors like NX-842 to be
> +   directly available to a user process. This driver enables
> +   userspace programs to access these accelerators via device
> +   nodes like /dev/crypto/nx-gzip.

I though this was kernel only users for now?

> +
> +   VAS adapters are found in POWER9 based systems.
> +
> +   If unsure, say N.
> diff --git a/drivers/misc/vas/Makefile b/drivers/misc/vas/Makefile
> new file mode 100644
> index 000..7dd7139
> --- /dev/null
> +++ b/drivers/misc/vas/Makefile
> @@ -0,0 +1,3 @@
> +ccflags-y:= $(call cc-disable-warning, unused-const-
> variable)
> +ccflags-$(CONFIG_PPC_WERROR) += -Werror
> +obj-$(CONFIG_VAS)+= vas.o vas-window.o
> diff --git a/drivers/misc/vas/vas-internal.h b/drivers/misc/vas/vas-internal.h
> 

Re: [PATCH v3 03/10] VAS: Define vas_init() and vas_exit()

2017-03-23 Thread Sukadev Bhattiprolu
Michael Neuling [michael.neul...@au1.ibm.com] wrote:
> On Thu, 2017-03-16 at 20:33 -0700, Sukadev Bhattiprolu wrote:
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -3673,8 +3673,6 @@ F:arch/powerpc/platforms/powernv/pci-cxl.c
> >  F: drivers/misc/cxl/
> >  F: include/misc/cxl*
> >  F: include/uapi/misc/cxl.h
> > -F: Documentation/powerpc/cxl.txt
> > -F: Documentation/ABI/testing/sysfs-class-cxl
> > 
> 
> This shouldn't be deleted.

Yes, I will fix it.

Thanks

Sukadev
> 
> Mikey



Re: [PATCH v3 01/10] VAS: Define macros, register fields and structures

2017-03-23 Thread Michael Neuling
On Thu, 2017-03-16 at 20:33 -0700, Sukadev Bhattiprolu wrote:
> Define macros for the VAS hardware registers and bit-fields as well
> as couple of data structures needed by the VAS driver.
> 
> > Signed-off-by: Sukadev Bhattiprolu 
> ---
> Changelog[v3]
>   - Rename winctx->pid to winctx->pidr to reflect that its a value
>     from the PID register (SPRN_PID), not the linux process id.
>   - Make it easier to split header into kernel/user parts
>   - To keep user interface simple, use macros rather than enum for
>     the threshold-control modes.
>   - Add a pid field to struct vas_window - needed for user space
>     send windows.
> 
> Changelog[v2]
>   - Add an overview of VAS in vas-internal.h
>   - Get window context parameters from device tree and drop
>     unnecessary macros.
> ---
>  MAINTAINERS |   6 +
>  arch/powerpc/include/asm/vas.h  |  43 +
>  drivers/misc/vas/vas-internal.h | 392 
> 

This is going to have to go through gregkh/lkml if it's drivers/misc.  you'll at
least need gregkh's ack/ok before mpe will take them (which is what we did for
CAPI).

We might want to keep this in arch/powerpc but I'm not sure.

>  3 files changed, 441 insertions(+)
>  create mode 100644 arch/powerpc/include/asm/vas.h
>  create mode 100644 drivers/misc/vas/vas-internal.h
> 

> +
> +/*
> + * Overview of Virtual Accelerator Switchboard (VAS).
> + *
> + * VAS is a hardware "switchboard" that allows senders and receivers to
> + * exchange messages with _minimal_ kernel involvment. The receivers are
> + * typically NX coprocessor engines that perform compression or encryption
> + * in hardware, but receivers can also be other software threads.
> + *
> + * Senders are user/kernel threads that submit compression/encryption or
> + * other requests to the receivers. Senders must format their messages as
> + * Coprocessor Request Blocks (CRB)s and submit them using the instructions
> + * "copy" and "paste" which were introduced in Power9.
> + *
> + * A Power node can have (upto?) 8 Power chips. There is one instance of
> + * VAS in each Power9 chip. Each instance of VAS has 64K windows or ports,
> + * Senders and receivers must each connect to a separate window before they
> + * can exchange messages through the switchboard.
> + *
> + * Each window is described by two types of window contexts:
> + *
> > + * Hypervisor Window Context (HVWC) of size VAS_HVWC_SIZE bytes
> + *
> > + * OS/User Window Context (UWC) of size VAS_UWC_SIZE bytes.
> + *
> + * A window context can be viewed as a set of 64-bit registers. The settings
> + * in these registers configure/control/determine the behavior of the VAS
> + * hardware when messages are sent/received through the window. The registers
> + * in the HVWC are configured by the kernel while the registers in the UWC 
> can
> + * be configured by the kernel or by the user space application that is using
> + * the window.
> + *
> + * The HVWCs for all windows on a specific instance of VAS are in a 
> contiguous
> + * range of hardware addresses or Base address region (BAR) referred to as 
> the
> + * HVWC BAR for the instance. Similarly the UWCs for all windows on an 
> instance
> + * are referred to as the UWC BAR for the instance. The two BARs for each
> + * instance are defined Power9 MMIO Ranges spreadsheet and available to the
> + * kernel the device tree as follows:
> + *
> > + * /proc/device-tree/xscom@.../vas@.../hvwc-bar-start
> > + * /proc/device-tree/xscom@.../vas@.../hvwc-bar-size
> > + * /proc/device-tree/xscom@.../vas@.../uwc-bar-start
> + *   /proc/device-tree/xscom@.../vas@.../uwc-bar-size

should these just be reg properties?

> + *
> + * The kernel maps these two hardware address regions into the kernel address
> + * space (hvwc_map and uwc_map) and accesses the window contexts of a 
> specific
> + * window using:
> + *
> > + *  hvwc = hvwc_map + winid * VAS_HVWC_SIZE.
> > + *  uwc = uwc_map + winid * VAS_UWC_SIZE.
> + *
> + * where winid is the window index (0..64K).
> + *
> + * Note that the window contexts are used to "configure" the windows. In
> + * addition to this configuration address, each _send_ window also has a
> + * unique hardware address, referred to as the "paste-address" to which the
> + * sender must "paste" the message (CRB) they wish to submit. This hardware
> + * paste address for window can be computed from the following nodes in the
> + * device tree:
> + *
> > + * /proc/device-tree/xscom@.../vas@.../window-base
> + *   /proc/device-tree/xscom@.../vas@.../window-shift

Same here with reg properties.

 
> +struct vas_winctx {

> > +   int lpid;
> + int pidr;   /* value from SPRN_PID, not linux pid */

I'm surprised we have a copy of these here.  They should be accessed from the
context we are attaching to, rather than copied here... but I've not looked at
the rest of the code yet...



Re: [PATCH v3 03/10] VAS: Define vas_init() and vas_exit()

2017-03-23 Thread Michael Neuling
On Thu, 2017-03-16 at 20:33 -0700, Sukadev Bhattiprolu wrote:
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3673,8 +3673,6 @@ F:arch/powerpc/platforms/powernv/pci-cxl.c
>  F: drivers/misc/cxl/
>  F: include/misc/cxl*
>  F: include/uapi/misc/cxl.h
> -F: Documentation/powerpc/cxl.txt
> -F: Documentation/ABI/testing/sysfs-class-cxl
> 

This shouldn't be deleted.

Mikey


[PATCH] powerpc/powernv: Fix XSCOM address mangling for form 1 indirect

2017-03-23 Thread Michael Neuling
POWER9 adds form 1 scoms. The form of the indirection is specified in
the top nibble of the scom address.

Currently we do some (ugly) bit mangling so that we can fit a 64 bit
scom address into the debugfs interface. The current code only shifts
the top bit (indirect bit).

This patch changes it to shift the whole top nibble so that the form
of the indirection is also shifted.

This patch is backwards compatible with older scoms.

(This change isn't required in the
arch/powerpc/platformspowernv/opal-prd.c scom interface as it passes
the whole 64bit scom address without any bit mangling)

Signed-off-by: Michael Neuling 
---
(resending to correct linuxppc address)

 arch/powerpc/platforms/powernv/opal-xscom.c | 27 +--
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-xscom.c
b/arch/powerpc/platforms/powernv/opal-xscom.c
index d0ac535cf5..28651fb254 100644
--- a/arch/powerpc/platforms/powernv/opal-xscom.c
+++ b/arch/powerpc/platforms/powernv/opal-xscom.c
@@ -73,25 +73,32 @@ static int opal_xscom_err_xlate(int64_t rc)
 
 static u64 opal_scom_unmangle(u64 addr)
 {
+   u64 tmp;
+
    /*
-    * XSCOM indirect addresses have the top bit set. Additionally
-    * the rest of the top 3 nibbles is always 0.
+    * XSCOM addresses use the top nibble to set indirect mode and
+    * its form.  Bits 4-11 are always 0.
     *
     * Because the debugfs interface uses signed offsets and shifts
     * the address left by 3, we basically cannot use the top 4 bits
     * of the 64-bit address, and thus cannot use the indirect bit.
     *
-    * To deal with that, we support the indirect bit being in bit
-    * 4 (IBM notation) instead of bit 0 in this API, we do the
-    * conversion here. To leave room for further xscom address
-    * expansion, we only clear out the top byte
+    * To deal with that, we support the indirect bits being in
+    * bits 4-7 (IBM notation) instead of bit 0-3 in this API, we
+    * do the conversion here.
     *
-    * For in-kernel use, we also support the real indirect bit, so
-    * we test for any of the top 5 bits
+    * For in-kernel use, we don't need to do this mangling.  In
+    * kernel won't have bits 4-7 set.
     *
+    * So:
+    *   debugfs will always   set 0-3 = 0 and clear 4-7
+    *kernel will always clear 0-3 = 0 and   set 4-7
     */
-   if (addr & (0x1full << 59))
-   addr = (addr & ~(0xffull << 56)) | (1ull << 63);
+   tmp = addr;
+   tmp  &= 0x0f00;
+   addr &= 0xf0ff;
+   addr |= tmp << 4;
+
    return addr;
 }
 


Re: [v1 0/5] parallelized "struct page" zeroing

2017-03-23 Thread Pasha Tatashin


On 03/23/2017 07:47 PM, Pasha Tatashin wrote:


How long does it take if we just don't zero this memory at all?  We seem
to be initialising most of struct page in __init_single_page(), so it
seems like a lot of additional complexity to conditionally zero the rest
of struct page.


Alternatively, just zero out the entire vmemmap area when it is setup
in the kernel page tables.


Hi Dave,

I can do this, either way is fine with me. It would be a little slower
compared to the current approach where we benefit from having memset()
to work as prefetch. But that would become negligible, once in the
future we will increase the granularity of multi-threading, currently it
is only one thread per-mnode to multithread vmemamp. Your call.

Thank  you,
Pasha


Hi Dave and Matthew,

I've been thinking about it some more, and figured that the current 
approach is better:


1. Most importantly: Part of the vmemmap is initialized early during 
boot to support Linux to get to the multi-CPU environment. This means 
that we would need to figure out what part of vmemmap will need to be 
zeroed before hand in single thread, than zero the rest in multi-thread. 
This will be very awkward architecturally and error prone.


2. As I already showed, the current approach is significantly faster. 
So, perhaps it should be the default behavior even for non-deferred 
"struct page" initialization: unconditionally do not zero vmemmap in 
memblock allocator, and always zero in __init_single_page(). But, I am 
afraid it could cause boot time regressions on some platforms where 
memset() is not optimized, so I would not do it in this patchset. But, 
hopefully, gradually more platforms will support deferred struct page 
initialization, and this would become the default behavior.


3. By zeroing "struct page" in  __init_single_page(), we set every byte 
of "struct page" in one place instead of scattering it across different 
places. So, it could help in the future when we will multi-thread 
addition of hotplugged memory.


Thank you,
Pasha


Re: [v1 0/5] parallelized "struct page" zeroing

2017-03-23 Thread Pasha Tatashin



On 03/23/2017 07:35 PM, David Miller wrote:

From: Matthew Wilcox 
Date: Thu, 23 Mar 2017 16:26:38 -0700


On Thu, Mar 23, 2017 at 07:01:48PM -0400, Pavel Tatashin wrote:

When deferred struct page initialization feature is enabled, we get a
performance gain of initializing vmemmap in parallel after other CPUs are
started. However, we still zero the memory for vmemmap using one boot CPU.
This patch-set fixes the memset-zeroing limitation by deferring it as well.

Here is example performance gain on SPARC with 32T:
base
https://hastebin.com/ozanelatat.go

fix
https://hastebin.com/utonawukof.go

As you can see without the fix it takes: 97.89s to boot
With the fix it takes: 46.91 to boot.


How long does it take if we just don't zero this memory at all?  We seem
to be initialising most of struct page in __init_single_page(), so it
seems like a lot of additional complexity to conditionally zero the rest
of struct page.


Alternatively, just zero out the entire vmemmap area when it is setup
in the kernel page tables.


Hi Dave,

I can do this, either way is fine with me. It would be a little slower 
compared to the current approach where we benefit from having memset() 
to work as prefetch. But that would become negligible, once in the 
future we will increase the granularity of multi-threading, currently it 
is only one thread per-mnode to multithread vmemamp. Your call.


Thank  you,
Pasha


Re: [v1 0/5] parallelized "struct page" zeroing

2017-03-23 Thread Pasha Tatashin

Hi Matthew,

Thank you for your comment. If you look at the data, having memset() 
actually benefits initializing data.


With base it takes:
[   66.148867] node 0 initialised, 128312523 pages in 7200ms

With fix:
[   15.260634] node 0 initialised, 128312523 pages in 4190ms

So 4.19s vs 7.2s for the same number of "struct page". This is because 
memset() actually brings "struct page" into cache with efficient  block 
initializing store instruction. I have not tested if there is the same 
effect on Intel.


Pasha

On 03/23/2017 07:26 PM, Matthew Wilcox wrote:

On Thu, Mar 23, 2017 at 07:01:48PM -0400, Pavel Tatashin wrote:

When deferred struct page initialization feature is enabled, we get a
performance gain of initializing vmemmap in parallel after other CPUs are
started. However, we still zero the memory for vmemmap using one boot CPU.
This patch-set fixes the memset-zeroing limitation by deferring it as well.

Here is example performance gain on SPARC with 32T:
base
https://hastebin.com/ozanelatat.go

fix
https://hastebin.com/utonawukof.go

As you can see without the fix it takes: 97.89s to boot
With the fix it takes: 46.91 to boot.


How long does it take if we just don't zero this memory at all?  We seem
to be initialising most of struct page in __init_single_page(), so it
seems like a lot of additional complexity to conditionally zero the rest
of struct page.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



[v1 3/5] mm: add "zero" argument to vmemmap allocators

2017-03-23 Thread Pavel Tatashin
Allow clients to request non-zeroed memory from vmemmap allocator.
The following two public function have a new boolean argument called zero:

__vmemmap_alloc_block_buf()
vmemmap_alloc_block()

When zero is true, memory that is allocated by memblock allocator is zeroed
(the current behavior), when argument is false, the memory is not zeroed.

This change allows for optimizations where client knows when it is better
to zero memory: may be later when other CPUs are started, or may be client
is going to set every byte in the allocated memory, so no need to zero
memory beforehand.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Shannon Nelson 
---
 arch/powerpc/mm/init_64.c |4 +-
 arch/s390/mm/vmem.c   |5 ++-
 arch/sparc/mm/init_64.c   |3 +-
 arch/x86/mm/init_64.c |3 +-
 include/linux/mm.h|6 ++--
 mm/sparse-vmemmap.c   |   48 +---
 6 files changed, 43 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 9be9920..eb4c270 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -133,7 +133,7 @@ static int __meminit vmemmap_populated(unsigned long start, 
int page_size)
 
/* allocate a page when required and hand out chunks */
if (!num_left) {
-   next = vmemmap_alloc_block(PAGE_SIZE, node);
+   next = vmemmap_alloc_block(PAGE_SIZE, node, true);
if (unlikely(!next)) {
WARN_ON(1);
return NULL;
@@ -181,7 +181,7 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node)
if (vmemmap_populated(start, page_size))
continue;
 
-   p = vmemmap_alloc_block(page_size, node);
+   p = vmemmap_alloc_block(page_size, node, true);
if (!p)
return -ENOMEM;
 
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 60d3899..9c75214 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -251,7 +251,8 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node)
if (MACHINE_HAS_EDAT1) {
void *new_page;
 
-   new_page = vmemmap_alloc_block(PMD_SIZE, node);
+   new_page = vmemmap_alloc_block(PMD_SIZE, node,
+  true);
if (!new_page)
goto out;
pmd_val(*pm_dir) = __pa(new_page) | sgt_prot;
@@ -271,7 +272,7 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node)
if (pte_none(*pt_dir)) {
void *new_page;
 
-   new_page = vmemmap_alloc_block(PAGE_SIZE, node);
+   new_page = vmemmap_alloc_block(PAGE_SIZE, node, true);
if (!new_page)
goto out;
pte_val(*pt_dir) = __pa(new_page) | pgt_prot;
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 01eccab..d91e462 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2541,7 +2541,8 @@ int __meminit vmemmap_populate(unsigned long vstart, 
unsigned long vend,
pmd = pmd_offset(pud, vstart);
pte = pmd_val(*pmd);
if (!(pte & _PAGE_VALID)) {
-   void *block = vmemmap_alloc_block(PMD_SIZE, node);
+   void *block = vmemmap_alloc_block(PMD_SIZE, node,
+ true);
 
if (!block)
return -ENOMEM;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 15173d3..46101b6 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1176,7 +1176,8 @@ static int __meminit vmemmap_populate_hugepages(unsigned 
long start,
if (pmd_none(*pmd)) {
void *p;
 
-   p = __vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
+   p = __vmemmap_alloc_block_buf(PMD_SIZE, node, altmap,
+ true);
if (p) {
pte_t entry;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5f01c88..54df194 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2410,13 +2410,13 @@ void sparse_mem_maps_populate_node(struct page 
**map_map,
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
 pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node);
-void *vmemmap_alloc_block(unsigned long size, int node);

[v1 2/5] mm: defining memblock_virt_alloc_try_nid_raw

2017-03-23 Thread Pavel Tatashin
A new version of memblock_virt_alloc_* allocations:
- Does not zero the allocated memory
- Does not panic if request cannot be satisfied

Signed-off-by: Pavel Tatashin 
Reviewed-by: Shannon Nelson 
---
 include/linux/bootmem.h |3 +++
 mm/memblock.c   |   46 +++---
 2 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index dbaf312..b61ea10 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -160,6 +160,9 @@ extern int reserve_bootmem_node(pg_data_t *pgdat,
 #define BOOTMEM_ALLOC_ANYWHERE (~(phys_addr_t)0)
 
 /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */
+void *memblock_virt_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align,
+ phys_addr_t min_addr,
+ phys_addr_t max_addr, int nid);
 void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
phys_addr_t align, phys_addr_t min_addr,
phys_addr_t max_addr, int nid);
diff --git a/mm/memblock.c b/mm/memblock.c
index 696f06d..7fdc555 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1271,7 +1271,7 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t 
size, phys_addr_t align, i
 static void * __init memblock_virt_alloc_internal(
phys_addr_t size, phys_addr_t align,
phys_addr_t min_addr, phys_addr_t max_addr,
-   int nid)
+   int nid, bool zero)
 {
phys_addr_t alloc;
void *ptr;
@@ -1322,7 +1322,8 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t 
size, phys_addr_t align, i
return NULL;
 done:
ptr = phys_to_virt(alloc);
-   memset(ptr, 0, size);
+   if (zero)
+   memset(ptr, 0, size);
 
/*
 * The min_count is set to 0 so that bootmem allocated blocks
@@ -1336,6 +1337,37 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t 
size, phys_addr_t align, i
 }
 
 /**
+ * memblock_virt_alloc_try_nid_raw - allocate boot memory block without zeroing
+ * memory and without panicking
+ * @size: size of memory block to be allocated in bytes
+ * @align: alignment of the region and block's size
+ * @min_addr: the lower bound of the memory region from where the allocation
+ *   is preferred (phys address)
+ * @max_addr: the upper bound of the memory region from where the allocation
+ *   is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to
+ *   allocate only from memory limited by memblock.current_limit value
+ * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
+ *
+ * Public function, provides additional debug information (including caller
+ * info), if enabled. Does not zero allocated memory, does not panic if request
+ * cannot be satisfied.
+ *
+ * RETURNS:
+ * Virtual address of allocated memory block on success, NULL on failure.
+ */
+void * __init memblock_virt_alloc_try_nid_raw(
+   phys_addr_t size, phys_addr_t align,
+   phys_addr_t min_addr, phys_addr_t max_addr,
+   int nid)
+{
+   memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx 
max_addr=0x%llx %pF\n",
+__func__, (u64)size, (u64)align, nid, (u64)min_addr,
+(u64)max_addr, (void *)_RET_IP_);
+   return memblock_virt_alloc_internal(size, align,
+  min_addr, max_addr, nid, false);
+}
+
+/**
  * memblock_virt_alloc_try_nid_nopanic - allocate boot memory block
  * @size: size of memory block to be allocated in bytes
  * @align: alignment of the region and block's size
@@ -1346,8 +1378,8 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t 
size, phys_addr_t align, i
  *   allocate only from memory limited by memblock.current_limit value
  * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
  *
- * Public version of _memblock_virt_alloc_try_nid_nopanic() which provides
- * additional debug information (including caller info), if enabled.
+ * Public function, provides additional debug information (including caller
+ * info), if enabled. This function zeroes the allocated memory.
  *
  * RETURNS:
  * Virtual address of allocated memory block on success, NULL on failure.
@@ -1361,7 +1393,7 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t 
size, phys_addr_t align, i
 __func__, (u64)size, (u64)align, nid, (u64)min_addr,
 (u64)max_addr, (void *)_RET_IP_);
return memblock_virt_alloc_internal(size, align, min_addr,
-max_addr, nid);
+max_addr, nid, true);
 }
 
 /**
@@ -1375,7 +1407,7 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t 
size, 

[v1 4/5] mm: zero struct pages during initialization

2017-03-23 Thread Pavel Tatashin
When deferred struct page initialization is enabled, do not expect that
the memory that was allocated for struct pages was zeroed by the
allocator. Zero it when "struct pages" are initialized.

Also, a defined boolean VMEMMAP_ZERO is provided to tell platforms whether
they should zero memory or can deffer it.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Shannon Nelson 
---
 include/linux/mm.h |9 +
 mm/page_alloc.c|3 +++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 54df194..eb052f6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2427,6 +2427,15 @@ int vmemmap_populate_basepages(unsigned long start, 
unsigned long end,
 #ifdef CONFIG_MEMORY_HOTPLUG
 void vmemmap_free(unsigned long start, unsigned long end);
 #endif
+/*
+ * Don't zero "struct page"es during early boot, and zero only when they are
+ * initialized in parallel.
+ */
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+#define VMEMMAP_ZERO   false
+#else
+#define VMEMMAP_ZERO   true
+#endif
 void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
  unsigned long size);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f202f8b..02945e4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1168,6 +1168,9 @@ static void free_one_page(struct zone *zone,
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
 {
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+   memset(page, 0, sizeof(struct page));
+#endif
set_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
-- 
1.7.1



[v1 1/5] sparc64: simplify vmemmap_populate

2017-03-23 Thread Pavel Tatashin
Remove duplicating code, by using common functions
vmemmap_pud_populate and vmemmap_pgd_populate functions.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Shannon Nelson 
---
 arch/sparc/mm/init_64.c |   23 ++-
 1 files changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 2c0cb2a..01eccab 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2526,30 +2526,19 @@ int __meminit vmemmap_populate(unsigned long vstart, 
unsigned long vend,
vstart = vstart & PMD_MASK;
vend = ALIGN(vend, PMD_SIZE);
for (; vstart < vend; vstart += PMD_SIZE) {
-   pgd_t *pgd = pgd_offset_k(vstart);
+   pgd_t *pgd = vmemmap_pgd_populate(vstart, node);
unsigned long pte;
pud_t *pud;
pmd_t *pmd;
 
-   if (pgd_none(*pgd)) {
-   pud_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
+   if (!pgd)
+   return -ENOMEM;
 
-   if (!new)
-   return -ENOMEM;
-   pgd_populate(_mm, pgd, new);
-   }
-
-   pud = pud_offset(pgd, vstart);
-   if (pud_none(*pud)) {
-   pmd_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
-
-   if (!new)
-   return -ENOMEM;
-   pud_populate(_mm, pud, new);
-   }
+   pud = vmemmap_pud_populate(pgd, vstart, node);
+   if (!pud)
+   return -ENOMEM;
 
pmd = pmd_offset(pud, vstart);
-
pte = pmd_val(*pmd);
if (!(pte & _PAGE_VALID)) {
void *block = vmemmap_alloc_block(PMD_SIZE, node);
-- 
1.7.1



[v1 0/5] parallelized "struct page" zeroing

2017-03-23 Thread Pavel Tatashin
When deferred struct page initialization feature is enabled, we get a
performance gain of initializing vmemmap in parallel after other CPUs are
started. However, we still zero the memory for vmemmap using one boot CPU.
This patch-set fixes the memset-zeroing limitation by deferring it as well.

Here is example performance gain on SPARC with 32T:
base
https://hastebin.com/ozanelatat.go

fix
https://hastebin.com/utonawukof.go

As you can see without the fix it takes: 97.89s to boot
With the fix it takes: 46.91 to boot.

On x86 time saving is going to be even greater (proportionally to memory size)
because there are twice as many "struct page"es for the same amount of memory,
as base pages are twice smaller.


Pavel Tatashin (5):
  sparc64: simplify vmemmap_populate
  mm: defining memblock_virt_alloc_try_nid_raw
  mm: add "zero" argument to vmemmap allocators
  mm: zero struct pages during initialization
  mm: teach platforms not to zero struct pages memory

 arch/powerpc/mm/init_64.c |4 +-
 arch/s390/mm/vmem.c   |5 ++-
 arch/sparc/mm/init_64.c   |   26 +++
 arch/x86/mm/init_64.c |3 +-
 include/linux/bootmem.h   |3 ++
 include/linux/mm.h|   15 +++--
 mm/memblock.c |   46 --
 mm/page_alloc.c   |3 ++
 mm/sparse-vmemmap.c   |   48 +---
 9 files changed, 103 insertions(+), 50 deletions(-)


[v1 5/5] mm: teach platforms not to zero struct pages memory

2017-03-23 Thread Pavel Tatashin
If we are using deferred struct page initialization feature, most of
"struct page"es are getting initialized after other CPUs are started, and
hence we are benefiting from doing this job in parallel. However, we are
still zeroing all the memory that is allocated for "struct pages" using the
boot CPU.  This patch solves this problem, by deferring zeroing "struct
pages" to only when they are initialized.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Shannon Nelson 
---
 arch/powerpc/mm/init_64.c |2 +-
 arch/sparc/mm/init_64.c   |2 +-
 arch/x86/mm/init_64.c |2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index eb4c270..24faf2d 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -181,7 +181,7 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node)
if (vmemmap_populated(start, page_size))
continue;
 
-   p = vmemmap_alloc_block(page_size, node, true);
+   p = vmemmap_alloc_block(page_size, node, VMEMMAP_ZERO);
if (!p)
return -ENOMEM;
 
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index d91e462..280834e 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2542,7 +2542,7 @@ int __meminit vmemmap_populate(unsigned long vstart, 
unsigned long vend,
pte = pmd_val(*pmd);
if (!(pte & _PAGE_VALID)) {
void *block = vmemmap_alloc_block(PMD_SIZE, node,
- true);
+ VMEMMAP_ZERO);
 
if (!block)
return -ENOMEM;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 46101b6..9d8c72c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1177,7 +1177,7 @@ static int __meminit vmemmap_populate_hugepages(unsigned 
long start,
void *p;
 
p = __vmemmap_alloc_block_buf(PMD_SIZE, node, altmap,
- true);
+ VMEMMAP_ZERO);
if (p) {
pte_t entry;
 
-- 
1.7.1



Re: [v1 0/5] parallelized "struct page" zeroing

2017-03-23 Thread David Miller
From: Matthew Wilcox 
Date: Thu, 23 Mar 2017 16:26:38 -0700

> On Thu, Mar 23, 2017 at 07:01:48PM -0400, Pavel Tatashin wrote:
>> When deferred struct page initialization feature is enabled, we get a
>> performance gain of initializing vmemmap in parallel after other CPUs are
>> started. However, we still zero the memory for vmemmap using one boot CPU.
>> This patch-set fixes the memset-zeroing limitation by deferring it as well.
>> 
>> Here is example performance gain on SPARC with 32T:
>> base
>> https://hastebin.com/ozanelatat.go
>> 
>> fix
>> https://hastebin.com/utonawukof.go
>> 
>> As you can see without the fix it takes: 97.89s to boot
>> With the fix it takes: 46.91 to boot.
> 
> How long does it take if we just don't zero this memory at all?  We seem
> to be initialising most of struct page in __init_single_page(), so it
> seems like a lot of additional complexity to conditionally zero the rest
> of struct page.

Alternatively, just zero out the entire vmemmap area when it is setup
in the kernel page tables.


Re: [v1 0/5] parallelized "struct page" zeroing

2017-03-23 Thread Matthew Wilcox
On Thu, Mar 23, 2017 at 07:01:48PM -0400, Pavel Tatashin wrote:
> When deferred struct page initialization feature is enabled, we get a
> performance gain of initializing vmemmap in parallel after other CPUs are
> started. However, we still zero the memory for vmemmap using one boot CPU.
> This patch-set fixes the memset-zeroing limitation by deferring it as well.
> 
> Here is example performance gain on SPARC with 32T:
> base
> https://hastebin.com/ozanelatat.go
> 
> fix
> https://hastebin.com/utonawukof.go
> 
> As you can see without the fix it takes: 97.89s to boot
> With the fix it takes: 46.91 to boot.

How long does it take if we just don't zero this memory at all?  We seem
to be initialising most of struct page in __init_single_page(), so it
seems like a lot of additional complexity to conditionally zero the rest
of struct page.


Re: [PATCH v9 2/3] PCI: Add a macro to set default alignment for all PCI devices

2017-03-23 Thread Bjorn Helgaas
On Thu, Mar 23, 2017 at 03:53:42PM -0500, Bjorn Helgaas wrote:
> Hi Yongji,
> 
> On Wed, Feb 15, 2017 at 02:45:05PM +0800, Yongji Xie wrote:
> > When vfio passthroughs a PCI device of which MMIO BARs are
> > smaller than PAGE_SIZE, guest will not handle the mmio
> > accesses to the BARs which leads to mmio emulations in host.
> > 
> > This is because vfio will not allow to passthrough one BAR's
> > mmio page which may be shared with other BARs. Otherwise,
> > there will be a backdoor that guest can use to access BARs
> > of other guest.
> 
> Please include a pointer to the VFIO code that enforces this.  It's
> not obvious to me how it would do this.  This doesn't change the
> *size* of the resource, only the alignment.  So if VFIO sees a BAR
> like [mem 0x8000-0x80ff], it knows the BAR is aligned enough
> that it *could* be the only thing on a page, but I don't know how it
> could know that there will never be another BAR at 0x8100.  Even
> if there's no other BAR on that page *now*, it would have to know that
> no hot-added device will ever be placed on that page.

Never mind, I found it.  I updated the changelog like this; please
correct anything I got wrong:

When VFIO passes through a PCI device to a guest, it does not
allow the guest to mmap BARs that are smaller than PAGE_SIZE
unless it can reserve the rest of the page (see
vfio_pci_probe_mmaps()).  This is because a page might contain
several small BARs for unrelated devices and a guest should not be
able to access all of them.

VFIO emulates guest accesses to non-mappable BARs, which is
functional but slow.  On systems with large page sizes, e.g.,
PowerNV with 64K pages, BARs are more likely to share a page and
performance is more likely to be a problem.

Add a macro to set default alignment for all PCI devices.  An arch
can set this to PAGE_SIZE to force the PCI core to place memory
BARs on their own pages.


Re: [PATCH v9 3/3] PCI: Don't extend device's size when using default alignment for all devices

2017-03-23 Thread Bjorn Helgaas
On Wed, Feb 15, 2017 at 02:45:06PM +0800, Yongji Xie wrote:
> Currently we reassign the alignment by extending resources' size in
> pci_reassigndev_resource_alignment(). This could potentially break
> some drivers when the driver uses the size to locate register
> whose length is related to the size. Some examples as below:

I understand the motivation to stop changing the resource size.  That
makes good sense.

> - misc\Hpilo.c:

This isn't Windows; please use regular Linux forward slashes here.

> off = pci_resource_len(pdev, bar) - 0x2000;
> 
> - net\ethernet\chelsio\cxgb4\cxgb4_uld.h:
> (pci_resource_len((pdev), 2) - roundup_pow_of_two((vres)->ocq.size))
> 
> - infiniband\hw\nes\Nes_hw.c:
> num_pds = pci_resource_len(nesdev->pcidev, BAR_1) >> PAGE_SHIFT;
> 
> This risk could be easily prevented before because we only had one way
> (kernel parameter resource_alignment) to touch those codes. And even
> some users may be happy to see the extended size.
> 
> But now we define a default alignment for all devices which would also
> touch those codes. It would be hard to prevent the risk in this case. So
> this patch tries to use START_ALIGNMENT to identify the resource's
> alignment without extending the size when the alignment reassigning is
> caused by the default alignment.

But I don't understand the new START_ALIGNMENT case.

> Signed-off-by: Yongji Xie 
> ---
>  drivers/pci/pci.c |   34 --
>  1 files changed, 24 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 2622e9b..0017e5e 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4958,7 +4958,8 @@ void pci_ignore_hotplug(struct pci_dev *dev)
>   * RETURNS: Resource alignment if it is specified.
>   *  Zero if it is not specified.

Since there's function doc, please update it to reflect the new
"resize" return parameter.

What does "resize" mean?  I can't figure it out from your code.

>   */
> -static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev)
> +static resource_size_t pci_specified_resource_alignment(struct pci_dev *dev,
> + bool *resize)
>  {
>   int seg, bus, slot, func, align_order, count;
>   unsigned short vendor, device, subsystem_vendor, subsystem_device;
> @@ -5002,6 +5003,7 @@ static resource_size_t 
> pci_specified_resource_alignment(struct pci_dev *dev)
>   (!device || (device == dev->device)) &&
>   (!subsystem_vendor || (subsystem_vendor == 
> dev->subsystem_vendor)) &&
>   (!subsystem_device || (subsystem_device == 
> dev->subsystem_device))) {
> + *resize = true;
>   if (align_order == -1)
>   align = PAGE_SIZE;
>   else
> @@ -5027,6 +5029,7 @@ static resource_size_t 
> pci_specified_resource_alignment(struct pci_dev *dev)
>   bus == dev->bus->number &&
>   slot == PCI_SLOT(dev->devfn) &&
>   func == PCI_FUNC(dev->devfn)) {
> + *resize = true;
>   if (align_order == -1)
>   align = PAGE_SIZE;
>   else
> @@ -5059,6 +5062,7 @@ void pci_reassigndev_resource_alignment(struct pci_dev 
> *dev)
>   struct resource *r;
>   resource_size_t align, size;
>   u16 command;
> + bool resize = false;
>  
>   /*
>* VF BARs are read-only zero according to SR-IOV spec r1.1, sec
> @@ -5070,7 +5074,7 @@ void pci_reassigndev_resource_alignment(struct pci_dev 
> *dev)
>   return;
>  
>   /* check if specified PCI is target device to reassign */
> - align = pci_specified_resource_alignment(dev);
> + align = pci_specified_resource_alignment(dev, );
>   if (!align)
>   return;
>  
> @@ -5098,15 +5102,25 @@ void pci_reassigndev_resource_alignment(struct 
> pci_dev *dev)
>   }
>  
>   size = resource_size(r);
> - if (size < align) {
> - size = align;
> - dev_info(>dev,
> - "Rounding up size of resource #%d to %#llx.\n",
> - i, (unsigned long long)size);
> + if (resize) {
> + if (size < align) {
> + size = align;
> + dev_info(>dev,
> + "Rounding up size of resource #%d to 
> %#llx.\n",
> + i, (unsigned long long)size);
> + }
> + r->flags |= IORESOURCE_UNSET;
> + r->end = size - 1;
> + r->start = 0;
> + } else {
> 

Re: [PATCH v9 2/3] PCI: Add a macro to set default alignment for all PCI devices

2017-03-23 Thread Bjorn Helgaas
Hi Yongji,

On Wed, Feb 15, 2017 at 02:45:05PM +0800, Yongji Xie wrote:
> When vfio passthroughs a PCI device of which MMIO BARs are
> smaller than PAGE_SIZE, guest will not handle the mmio
> accesses to the BARs which leads to mmio emulations in host.
> 
> This is because vfio will not allow to passthrough one BAR's
> mmio page which may be shared with other BARs. Otherwise,
> there will be a backdoor that guest can use to access BARs
> of other guest.

Please include a pointer to the VFIO code that enforces this.  It's
not obvious to me how it would do this.  This doesn't change the
*size* of the resource, only the alignment.  So if VFIO sees a BAR
like [mem 0x8000-0x80ff], it knows the BAR is aligned enough
that it *could* be the only thing on a page, but I don't know how it
could know that there will never be another BAR at 0x8100.  Even
if there's no other BAR on that page *now*, it would have to know that
no hot-added device will ever be placed on that page.

> This patch adds a macro to set default alignment for all
> PCI devices. Then we could solve this issue on some platforms
> which would easily hit this issue because of their 64K page
> such as PowerNV platform by defining this macro as PAGE_SIZE.
> 
> Signed-off-by: Yongji Xie 
> ---
>  arch/powerpc/include/asm/pci.h |4 
>  drivers/pci/pci.c  |3 +++
>  2 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
> index e9bd6cf..5e31bc2 100644
> --- a/arch/powerpc/include/asm/pci.h
> +++ b/arch/powerpc/include/asm/pci.h
> @@ -28,6 +28,10 @@
>  #define PCIBIOS_MIN_IO   0x1000
>  #define PCIBIOS_MIN_MEM  0x1000
>  
> +#ifdef CONFIG_PPC_POWERNV
> +#define PCIBIOS_DEFAULT_ALIGNMENTPAGE_SIZE
> +#endif
> +
>  struct pci_dev;
>  
>  /* Values for the `which' argument to sys_pciconfig_iobase syscall.  */
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index a881c0d..2622e9b 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4965,6 +4965,9 @@ static resource_size_t 
> pci_specified_resource_alignment(struct pci_dev *dev)
>   resource_size_t align = 0;
>   char *p;
>  
> +#ifdef PCIBIOS_DEFAULT_ALIGNMENT
> + align = PCIBIOS_DEFAULT_ALIGNMENT;
> +#endif

I'd prefer to move this #ifdef out of the code path, as in the patch
below.

I don't know how powerpc kernels are built: can you build a kernel
that will run on PowerNV as well as on different powerpc systems?  If
so, is it acceptable to force that kernel to user 64K alignment even
when it's running on non-PowerNV systems?  Or do you need a function
call here instead of a #define?

>   spin_lock(_alignment_lock);
>   p = resource_alignment_param;
>   if (!*p)


commit 60106ae302a264f9be15fa736dae2ea51140b988
Author: Yongji Xie 
Date:   Wed Feb 15 14:45:05 2017 +0800

PCI: Add PCIBIOS_DEFAULT_ALIGNMENT for arch-specific alignment control

When VFIO passes through a PCI device to a guest, it does not allow the
guest direct access to BARs that are smaller than PAGE_SIZE.  This is
because a page might contain several small BARs for unrelated devices and
a guest should not be able to access all of them.

VFIO emulates guest accesses to these small BARs, which is functional but
slow.  On systems with large page sizes, e.g., PowerNV with 64K pages, BARs
are more likely to share a page and performance is more of a problem.

Add a macro to set default alignment for all PCI devices.  An arch can set
this to PAGE_SIZE to prevent memory BARs from sharing a page.

[bhelgaas: add default PCIBIOS_DEFAULT_ALIGNMENT definition, changelog]
Signed-off-by: Yongji Xie 
Signed-off-by: Bjorn Helgaas 

diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
index 93eded8d3843..dc3c81ab4cc4 100644
--- a/arch/powerpc/include/asm/pci.h
+++ b/arch/powerpc/include/asm/pci.h
@@ -28,6 +28,10 @@
 #define PCIBIOS_MIN_IO 0x1000
 #define PCIBIOS_MIN_MEM0x1000
 
+#ifdef CONFIG_PPC_POWERNV
+#define PCIBIOS_DEFAULT_ALIGNMENT  PAGE_SIZE
+#endif
+
 struct pci_dev;
 
 /* Values for the `which' argument to sys_pciconfig_iobase syscall.  */
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 7904d02ffdb9..e9a079063fd0 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4962,7 +4962,7 @@ static resource_size_t 
pci_specified_resource_alignment(struct pci_dev *dev)
 {
int seg, bus, slot, func, align_order, count;
unsigned short vendor, device, subsystem_vendor, subsystem_device;
-   resource_size_t align = 0;
+   resource_size_t align = PCIBIOS_DEFAULT_ALIGNMENT;
char *p;
 
spin_lock(_alignment_lock);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index eb3da1a04e6c..618beb5809d1 100644
--- 

Re: [PATCH 12/13] powerpc/perf: Thread imc cpuhotplug support

2017-03-23 Thread Gautham R Shenoy
Hi Maddy, Anju,

On Thu, Mar 16, 2017 at 01:05:06PM +0530, Madhavan Srinivasan wrote:
> From: Anju T Sudhakar 
> 
> This patch adds support for thread IMC on cpuhotplug.
> 
> When a cpu goes offline, the LDBAR for that cpu is disabled, and when it comes
> back online the previous ldbar value is written back to the LDBAR for that 
> cpu.
> 
> To register the hotplug functions for thread_imc, a new state
> CPUHP_AP_PERF_POWERPC_THREADIMC_ONLINE is added to the list of existing
> states.
> 
> Cc: Gautham R. Shenoy 
> Cc: Balbir Singh 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Anton Blanchard 
> Cc: Sukadev Bhattiprolu 
> Cc: Michael Neuling 
> Cc: Stewart Smith 
> Cc: Daniel Axtens 
> Cc: Stephane Eranian 
> Signed-off-by: Anju T Sudhakar 
> Signed-off-by: Madhavan Srinivasan 
> ---
>  arch/powerpc/perf/imc-pmu.c | 33 -
>  include/linux/cpuhotplug.h  |  1 +
>  2 files changed, 29 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
> index 6802960db51c..2ff39fe2a5ce 100644
> --- a/arch/powerpc/perf/imc-pmu.c
> +++ b/arch/powerpc/perf/imc-pmu.c
> @@ -687,6 +687,16 @@ static void cleanup_all_thread_imc_memory(void)
>   on_each_cpu(cleanup_thread_imc_memory, NULL, 1);
>  }
> 
> +static void thread_imc_update_ldbar(unsigned int cpu_id)
> +{
> + u64 ldbar_addr, ldbar_value;
> +
> + ldbar_addr = (u64)virt_to_phys((void *)per_cpu_add[cpu_id]);
> + ldbar_value = (ldbar_addr & (u64)THREAD_IMC_LDBAR_MASK) |
> + (u64)THREAD_IMC_ENABLE;
> + mtspr(SPRN_LDBAR, ldbar_value);
> +}
> +
>  /*
>   * Allocates a page of memory for each of the online cpus, and, writes the
>   * physical base address of that page to the LDBAR for that cpu. This starts
> @@ -694,20 +704,33 @@ static void cleanup_all_thread_imc_memory(void)
>   */
>  static void thread_imc_mem_alloc(void *dummy)
>  {
> - u64 ldbar_addr, ldbar_value;
>   int cpu_id = smp_processor_id();
> 
>   per_cpu_add[cpu_id] = (u64)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
>   0);
> - ldbar_addr = (u64)virt_to_phys((void *)per_cpu_add[cpu_id]);
> - ldbar_value = (ldbar_addr & (u64)THREAD_IMC_LDBAR_MASK) |
> - (u64)THREAD_IMC_ENABLE;
> - mtspr(SPRN_LDBAR, ldbar_value);
> + thread_imc_update_ldbar(cpu_id);
> +}
> +
> +static int ppc_thread_imc_cpu_online(unsigned int cpu)
> +{
> + thread_imc_update_ldbar(cpu);
> + return 0;
> +
> +}
> +
> +static int ppc_thread_imc_cpu_offline(unsigned int cpu)
> +{
> + mtspr(SPRN_LDBAR, 0);
> + return 0;
>  }

This patch looks ok to me.

So it appears that in case of a full-core deep stop entry/exit you
will need to save/restore LDBAR as well. But I will take it up for the
next set of stop cleanups.

For this patch,
Reviewed-by: Gautham R. Shenoy 

> 
>  void thread_imc_cpu_init(void)
>  {
>   on_each_cpu(thread_imc_mem_alloc, NULL, 1);
> + cpuhp_setup_state(CPUHP_AP_PERF_POWERPC_THREADIMC_ONLINE,
> +   "POWER_THREAD_IMC_ONLINE",
> +ppc_thread_imc_cpu_online,
> +ppc_thread_imc_cpu_offline);
>  }
> 
>  static void thread_imc_ldbar_disable(void *dummy)
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index abde85d9511a..724df46b2c3c 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -138,6 +138,7 @@ enum cpuhp_state {
>   CPUHP_AP_PERF_ARM_L2X0_ONLINE,
>   CPUHP_AP_PERF_POWERPC_NEST_ONLINE,
>   CPUHP_AP_PERF_POWERPC_COREIMC_ONLINE,
> + CPUHP_AP_PERF_POWERPC_THREADIMC_ONLINE,
>   CPUHP_AP_PERF_ARM_QCOM_L2_ONLINE,
>   CPUHP_AP_WORKQUEUE_ONLINE,
>   CPUHP_AP_RCUTREE_ONLINE,
> -- 
> 2.7.4
> 
--
Thanks and Regards
gautham.



Re: [PATCH v4 1/5] crash: move crashkernel parsing and vmcore related code under CONFIG_CRASH_CORE

2017-03-23 Thread Hari Bathini

Hi Michael,

It's been a while since this patchset is Ack'ed.
Should this go through powerpc-tree or some other?

Thanks
Hari

On Thursday 05 January 2017 10:59 PM, Hari Bathini wrote:

Traditionally, kdump is used to save vmcore in case of a crash. Some
architectures like powerpc can save vmcore using architecture specific
support instead of kexec/kdump mechanism. Such architecture specific
support also needs to reserve memory, to be used by dump capture kernel.
crashkernel parameter can be a reused, for memory reservation, by such
architecture specific infrastructure.

But currently, code related to vmcoreinfo and parsing of crashkernel
parameter is built under CONFIG_KEXEC_CORE. This patch introduces
CONFIG_CRASH_CORE and moves the above mentioned code under this config,
allowing code reuse without dependency on CONFIG_KEXEC. There is no
functional change with this patch.

Signed-off-by: Hari Bathini 
---

Changes from v3:
* Renamed log_buf_kexec_setup()to log_buf_vmcoreinfo_setup() instead of
   log_buf_crash_setup().

Changes from v2:
* Used CONFIG_CRASH_CORE instead of CONFIG_KEXEC_CORE at
   appropriate places in printk and ksysfs.


  arch/Kconfig   |4
  include/linux/crash_core.h |   65 ++
  include/linux/kexec.h  |   57 --
  include/linux/printk.h |4
  kernel/Makefile|1
  kernel/crash_core.c|  445 
  kernel/kexec_core.c|  404 
  kernel/ksysfs.c|8 +
  kernel/printk/printk.c |6 -
  9 files changed, 531 insertions(+), 463 deletions(-)
  create mode 100644 include/linux/crash_core.h
  create mode 100644 kernel/crash_core.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 99839c2..82e6f99 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -2,7 +2,11 @@
  # General architecture dependent options
  #

+config CRASH_CORE
+   bool
+
  config KEXEC_CORE
+   select CRASH_CORE
bool

  config HAVE_IMA_KEXEC
diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
new file mode 100644
index 000..18d0f94
--- /dev/null
+++ b/include/linux/crash_core.h
@@ -0,0 +1,65 @@
+#ifndef LINUX_CRASH_CORE_H
+#define LINUX_CRASH_CORE_H
+
+#include 
+#include 
+#include 
+
+#define CRASH_CORE_NOTE_NAME  "CORE"
+#define CRASH_CORE_NOTE_HEAD_BYTES ALIGN(sizeof(struct elf_note), 4)
+#define CRASH_CORE_NOTE_NAME_BYTES ALIGN(sizeof(CRASH_CORE_NOTE_NAME), 4)
+#define CRASH_CORE_NOTE_DESC_BYTES ALIGN(sizeof(struct elf_prstatus), 4)
+
+#define CRASH_CORE_NOTE_BYTES ((CRASH_CORE_NOTE_HEAD_BYTES * 2) +  \
+CRASH_CORE_NOTE_NAME_BYTES +   \
+CRASH_CORE_NOTE_DESC_BYTES)
+
+#define VMCOREINFO_BYTES  (4096)
+#define VMCOREINFO_NOTE_NAME  "VMCOREINFO"
+#define VMCOREINFO_NOTE_NAME_BYTES ALIGN(sizeof(VMCOREINFO_NOTE_NAME), 4)
+#define VMCOREINFO_NOTE_SIZE  ((CRASH_CORE_NOTE_HEAD_BYTES * 2) +  \
+VMCOREINFO_NOTE_NAME_BYTES +   \
+VMCOREINFO_BYTES)
+
+typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4];
+
+void crash_save_vmcoreinfo(void);
+void arch_crash_save_vmcoreinfo(void);
+__printf(1, 2)
+void vmcoreinfo_append_str(const char *fmt, ...);
+phys_addr_t paddr_vmcoreinfo_note(void);
+
+#define VMCOREINFO_OSRELEASE(value) \
+   vmcoreinfo_append_str("OSRELEASE=%s\n", value)
+#define VMCOREINFO_PAGESIZE(value) \
+   vmcoreinfo_append_str("PAGESIZE=%ld\n", value)
+#define VMCOREINFO_SYMBOL(name) \
+   vmcoreinfo_append_str("SYMBOL(%s)=%lx\n", #name, (unsigned long))
+#define VMCOREINFO_SIZE(name) \
+   vmcoreinfo_append_str("SIZE(%s)=%lu\n", #name, \
+ (unsigned long)sizeof(name))
+#define VMCOREINFO_STRUCT_SIZE(name) \
+   vmcoreinfo_append_str("SIZE(%s)=%lu\n", #name, \
+ (unsigned long)sizeof(struct name))
+#define VMCOREINFO_OFFSET(name, field) \
+   vmcoreinfo_append_str("OFFSET(%s.%s)=%lu\n", #name, #field, \
+ (unsigned long)offsetof(struct name, field))
+#define VMCOREINFO_LENGTH(name, value) \
+   vmcoreinfo_append_str("LENGTH(%s)=%lu\n", #name, (unsigned long)value)
+#define VMCOREINFO_NUMBER(name) \
+   vmcoreinfo_append_str("NUMBER(%s)=%ld\n", #name, (long)name)
+#define VMCOREINFO_CONFIG(name) \
+   vmcoreinfo_append_str("CONFIG_%s=y\n", #name)
+
+extern u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
+extern size_t vmcoreinfo_size;
+extern size_t vmcoreinfo_max_size;
+
+int __init parse_crashkernel(char *cmdline, unsigned long long system_ram,
+   unsigned long long *crash_size, unsigned long long *crash_base);
+int parse_crashkernel_high(char *cmdline, unsigned long long system_ram,
+   unsigned long long *crash_size, unsigned long long *crash_base);
+int parse_crashkernel_low(char *cmdline, 

Re: [PATCH v2 1/2] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration

2017-03-23 Thread Vaidyanathan Srinivasan
* Rafael J. Wysocki  [2017-03-23 16:28:31]:

> On Thu, Mar 23, 2017 at 4:22 PM, Vaidyanathan Srinivasan
>  wrote:
> > drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init().
> > On PowerNV platform cpu_present could be less than cpu_possible in cases
> > where firmware detects the cpu, but it is not available to the OS.  When
> > CONFIG_HOTPLUG_CPU=n, such cpus are not hotplugable at runtime and hence
> > we skip creating cpu_device.
> >
> > This breaks cpuidle on powernv where register_cpu() is not called for
> > cpus in cpu_possible_mask that cannot be hot-added at runtime.
> >
> > Trying cpuidle_register_device() on cpu without cpu_device will cause
> > crash like this:
> >
> > cpu 0xf: Vector: 380 (Data SLB Access) at [c00ff1503490]
> > pc: c022c8bc: string+0x34/0x60
> > lr: c022ed78: vsnprintf+0x284/0x42c
> > sp: c00ff1503710
> >msr: 90009033
> >dar: 60006000
> >   current = 0xc00ff148
> >   paca= 0xcfe82d00   softe: 0irq_happened: 0x01
> > pid   = 1, comm = swapper/8
> > Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4
> > (Buildroot 2017.02-4-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017
> > enter ? for help
> > [link register   ] c022ed78 vsnprintf+0x284/0x42c
> > [c00ff1503710] c022ebb8 vsnprintf+0xc4/0x42c (unreliable)
> > [c00ff1503800] c022ef40 vscnprintf+0x20/0x44
> > [c00ff1503830] c00ab61c vprintk_emit+0x94/0x2cc
> > [c00ff15038a0] c00acc9c vprintk_func+0x60/0x74
> > [c00ff15038c0] c0619694 printk+0x38/0x4c
> > [c00ff15038e0] c0224950 kobject_get+0x40/0x60
> > [c00ff1503950] c022507c kobject_add_internal+0x60/0x2c4
> > [c00ff15039e0] c0225350 kobject_init_and_add+0x70/0x78
> > [c00ff1503a60] c053c288 cpuidle_add_sysfs+0x9c/0xe0
> > [c00ff1503ae0] c053aeac cpuidle_register_device+0xd4/0x12c
> > [c00ff1503b30] c053b108 cpuidle_register+0x98/0xcc
> > [c00ff1503bc0] c085eaf0 powernv_processor_idle_init+0x140/0x1e0
> > [c00ff1503c60] c000cd60 do_one_initcall+0xc0/0x15c
> > [c00ff1503d20] c0833e84 kernel_init_freeable+0x1a0/0x25c
> > [c00ff1503dc0] c000d478 kernel_init+0x24/0x12c
> > [c00ff1503e30] c000b564 ret_from_kernel_thread+0x5c/0x78
> >
> > This patch fixes the bug by passing correct cpumask from
> > powernv-cpuidle driver.
> >
> > Signed-off-by: Vaidyanathan Srinivasan 
> 
> That needs to be ACKed by someone familiar with powernv.

Previous version at
https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-March/155587.html

I had not CCed linux-pm in the first post.

Michael and Mikey have reviewed the previous version.  Let me get an
ack for you to proceed with the merge.

Thanks,
Vaidy



Re: [PATCH v2 2/2] cpuidle: Validate cpu_dev in cpuidle_add_sysfs

2017-03-23 Thread Vaidyanathan Srinivasan
* Rafael J. Wysocki  [2017-03-23 16:27:31]:

> On Thu, Mar 23, 2017 at 4:22 PM, Vaidyanathan Srinivasan
>  wrote:
> > If a given cpu is not in cpu_present and cpu hotplug
> > is disabled, arch can skip setting up the cpu_dev.
> >
> > Arch cpuidle driver should pass correct cpu mask
> > for registration, but failing to do so by the driver
> > causes error to propagate and crash like this:
> >
> > [   30.076045] Unable to handle kernel paging request for
> > data at address 0x0048
> > [   30.076100] Faulting instruction address: 0xc07b2f30
> > cpu 0x4d: Vector: 300 (Data Access) at [c03feb18b670]
> > pc: c07b2f30: kobject_get+0x20/0x70
> > lr: c07b3c94: kobject_add_internal+0x54/0x3f0
> > sp: c03feb18b8f0
> >msr: 90009033
> >dar: 48
> >  dsisr: 4000
> >   current = 0xc03fd2ed8300
> >   paca= 0xcfbab500   softe: 0irq_happened: 0x01
> > pid   = 1, comm = swapper/0
> > Linux version 4.11.0-rc2-svaidy+ (sv@sagarika) (gcc version 6.2.0
> > 20161005 (Ubuntu 6.2.0-5ubuntu12) ) #10 SMP Sun Mar 19 00:08:09 IST 2017
> > enter ? for help
> > [c03feb18b960] c07b3c94 kobject_add_internal+0x54/0x3f0
> > [c03feb18b9f0] c07b43a4 kobject_init_and_add+0x64/0xa0
> > [c03feb18ba70] c0e284f4 cpuidle_add_sysfs+0xb4/0x130
> > [c03feb18baf0] c0e26038 cpuidle_register_device+0x118/0x1c0
> > [c03feb18bb30] c0e26c48 cpuidle_register+0x78/0x120
> > [c03feb18bbc0] c168fd9c powernv_processor_idle_init+0x110/0x1c4
> > [c03feb18bc40] c000cff8 do_one_initcall+0x68/0x1d0
> > [c03feb18bd00] c16242f4 kernel_init_freeable+0x280/0x360
> > [c03feb18bdc0] c000d864 kernel_init+0x24/0x160
> > [c03feb18be30] c000b4e8 ret_from_kernel_thread+0x5c/0x74
> >
> > Validating cpu_dev fixes the crash and reports correct error message like:
> >
> > [   30.163506] Failed to register cpuidle device for cpu136
> > [   30.173329] Registration of powernv driver failed.
> >
> > Signed-off-by: Vaidyanathan Srinivasan 
> 
> The previous version is in linux-next already and I'm going to push it
> for merging shortly.

Thanks Rafael.  The previous version is good for merge.

--Vaidy



Re: [PATCH v2 1/2] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration

2017-03-23 Thread Rafael J. Wysocki
On Thu, Mar 23, 2017 at 4:22 PM, Vaidyanathan Srinivasan
 wrote:
> drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init().
> On PowerNV platform cpu_present could be less than cpu_possible in cases
> where firmware detects the cpu, but it is not available to the OS.  When
> CONFIG_HOTPLUG_CPU=n, such cpus are not hotplugable at runtime and hence
> we skip creating cpu_device.
>
> This breaks cpuidle on powernv where register_cpu() is not called for
> cpus in cpu_possible_mask that cannot be hot-added at runtime.
>
> Trying cpuidle_register_device() on cpu without cpu_device will cause
> crash like this:
>
> cpu 0xf: Vector: 380 (Data SLB Access) at [c00ff1503490]
> pc: c022c8bc: string+0x34/0x60
> lr: c022ed78: vsnprintf+0x284/0x42c
> sp: c00ff1503710
>msr: 90009033
>dar: 60006000
>   current = 0xc00ff148
>   paca= 0xcfe82d00   softe: 0irq_happened: 0x01
> pid   = 1, comm = swapper/8
> Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4
> (Buildroot 2017.02-4-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017
> enter ? for help
> [link register   ] c022ed78 vsnprintf+0x284/0x42c
> [c00ff1503710] c022ebb8 vsnprintf+0xc4/0x42c (unreliable)
> [c00ff1503800] c022ef40 vscnprintf+0x20/0x44
> [c00ff1503830] c00ab61c vprintk_emit+0x94/0x2cc
> [c00ff15038a0] c00acc9c vprintk_func+0x60/0x74
> [c00ff15038c0] c0619694 printk+0x38/0x4c
> [c00ff15038e0] c0224950 kobject_get+0x40/0x60
> [c00ff1503950] c022507c kobject_add_internal+0x60/0x2c4
> [c00ff15039e0] c0225350 kobject_init_and_add+0x70/0x78
> [c00ff1503a60] c053c288 cpuidle_add_sysfs+0x9c/0xe0
> [c00ff1503ae0] c053aeac cpuidle_register_device+0xd4/0x12c
> [c00ff1503b30] c053b108 cpuidle_register+0x98/0xcc
> [c00ff1503bc0] c085eaf0 powernv_processor_idle_init+0x140/0x1e0
> [c00ff1503c60] c000cd60 do_one_initcall+0xc0/0x15c
> [c00ff1503d20] c0833e84 kernel_init_freeable+0x1a0/0x25c
> [c00ff1503dc0] c000d478 kernel_init+0x24/0x12c
> [c00ff1503e30] c000b564 ret_from_kernel_thread+0x5c/0x78
>
> This patch fixes the bug by passing correct cpumask from
> powernv-cpuidle driver.
>
> Signed-off-by: Vaidyanathan Srinivasan 

That needs to be ACKed by someone familiar with powernv.

> ---
>  drivers/cpuidle/cpuidle-powernv.c | 18 ++
>  1 file changed, 18 insertions(+)
>
> diff --git a/drivers/cpuidle/cpuidle-powernv.c 
> b/drivers/cpuidle/cpuidle-powernv.c
> index a06df51..82f7b33 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -175,6 +175,24 @@ static int powernv_cpuidle_driver_init(void)
> drv->state_count += 1;
> }
>
> +   /*
> +* On PowerNV platform cpu_present may be less that cpu_possible in
> +* cases where firmware detects the cpu, but it is not available to 
> the
> +* OS.  If CONFIG_HOTPLUG_CPU=n then such CPUs are not hotplugable at
> +* runtime and hence cpu_devices are not created for those cpus by
> +* generic topology_init().
> +*
> +* drv->cpumask defaults to cpu_possible_mask in
> +* __cpuidle_driver_init().  This breaks cpuidle on powernv where
> +* cpu_devices are not created for cpus in cpu_possible_mask that
> +* cannot be hot-added later at runtime.
> +*
> +* Trying cpuidle_register_device() on a cpu without cpu_devices is
> +* incorrect. Hence pass correct cpu mask to generic cpuidle driver.
> +*/
> +
> +   drv->cpumask = (struct cpumask *)cpu_present_mask;
> +
> return 0;
>  }
>
> --
> 2.9.3
>


Re: [PATCH v2 2/2] cpuidle: Validate cpu_dev in cpuidle_add_sysfs

2017-03-23 Thread Rafael J. Wysocki
On Thu, Mar 23, 2017 at 4:22 PM, Vaidyanathan Srinivasan
 wrote:
> If a given cpu is not in cpu_present and cpu hotplug
> is disabled, arch can skip setting up the cpu_dev.
>
> Arch cpuidle driver should pass correct cpu mask
> for registration, but failing to do so by the driver
> causes error to propagate and crash like this:
>
> [   30.076045] Unable to handle kernel paging request for
> data at address 0x0048
> [   30.076100] Faulting instruction address: 0xc07b2f30
> cpu 0x4d: Vector: 300 (Data Access) at [c03feb18b670]
> pc: c07b2f30: kobject_get+0x20/0x70
> lr: c07b3c94: kobject_add_internal+0x54/0x3f0
> sp: c03feb18b8f0
>msr: 90009033
>dar: 48
>  dsisr: 4000
>   current = 0xc03fd2ed8300
>   paca= 0xcfbab500   softe: 0irq_happened: 0x01
> pid   = 1, comm = swapper/0
> Linux version 4.11.0-rc2-svaidy+ (sv@sagarika) (gcc version 6.2.0
> 20161005 (Ubuntu 6.2.0-5ubuntu12) ) #10 SMP Sun Mar 19 00:08:09 IST 2017
> enter ? for help
> [c03feb18b960] c07b3c94 kobject_add_internal+0x54/0x3f0
> [c03feb18b9f0] c07b43a4 kobject_init_and_add+0x64/0xa0
> [c03feb18ba70] c0e284f4 cpuidle_add_sysfs+0xb4/0x130
> [c03feb18baf0] c0e26038 cpuidle_register_device+0x118/0x1c0
> [c03feb18bb30] c0e26c48 cpuidle_register+0x78/0x120
> [c03feb18bbc0] c168fd9c powernv_processor_idle_init+0x110/0x1c4
> [c03feb18bc40] c000cff8 do_one_initcall+0x68/0x1d0
> [c03feb18bd00] c16242f4 kernel_init_freeable+0x280/0x360
> [c03feb18bdc0] c000d864 kernel_init+0x24/0x160
> [c03feb18be30] c000b4e8 ret_from_kernel_thread+0x5c/0x74
>
> Validating cpu_dev fixes the crash and reports correct error message like:
>
> [   30.163506] Failed to register cpuidle device for cpu136
> [   30.173329] Registration of powernv driver failed.
>
> Signed-off-by: Vaidyanathan Srinivasan 

The previous version is in linux-next already and I'm going to push it
for merging shortly.

> ---
>  drivers/cpuidle/sysfs.c | 12 
>  1 file changed, 12 insertions(+)
>
> diff --git a/drivers/cpuidle/sysfs.c b/drivers/cpuidle/sysfs.c
> index c5adc8c..f2c3bce 100644
> --- a/drivers/cpuidle/sysfs.c
> +++ b/drivers/cpuidle/sysfs.c
> @@ -615,6 +615,18 @@ int cpuidle_add_sysfs(struct cpuidle_device *dev)
> struct device *cpu_dev = get_cpu_device((unsigned long)dev->cpu);
> int error;
>
> +   /*
> +* Return error if cpu_device is not setup for this cpu.  This
> +* could happen if arch did not setup cpu_device since this
> +* cpu is not in cpu_present mask and the driver did not send
> +* correct cpu mask at registration.  Without this check we
> +* would end up passing bogus value for _dev->kobj in
> +* kobject_init_and_add().
> +*/
> +
> +   if (!cpu_dev)
> +   return -ENODEV;
> +
> kdev = kzalloc(sizeof(*kdev), GFP_KERNEL);
> if (!kdev)
> return -ENOMEM;
> --
> 2.9.3
>


[PATCH v2 2/2] cpuidle: Validate cpu_dev in cpuidle_add_sysfs

2017-03-23 Thread Vaidyanathan Srinivasan
If a given cpu is not in cpu_present and cpu hotplug
is disabled, arch can skip setting up the cpu_dev.

Arch cpuidle driver should pass correct cpu mask
for registration, but failing to do so by the driver
causes error to propagate and crash like this:

[   30.076045] Unable to handle kernel paging request for
data at address 0x0048
[   30.076100] Faulting instruction address: 0xc07b2f30
cpu 0x4d: Vector: 300 (Data Access) at [c03feb18b670]
pc: c07b2f30: kobject_get+0x20/0x70
lr: c07b3c94: kobject_add_internal+0x54/0x3f0
sp: c03feb18b8f0
   msr: 90009033
   dar: 48
 dsisr: 4000
  current = 0xc03fd2ed8300
  paca= 0xcfbab500   softe: 0irq_happened: 0x01
pid   = 1, comm = swapper/0
Linux version 4.11.0-rc2-svaidy+ (sv@sagarika) (gcc version 6.2.0
20161005 (Ubuntu 6.2.0-5ubuntu12) ) #10 SMP Sun Mar 19 00:08:09 IST 2017
enter ? for help
[c03feb18b960] c07b3c94 kobject_add_internal+0x54/0x3f0
[c03feb18b9f0] c07b43a4 kobject_init_and_add+0x64/0xa0
[c03feb18ba70] c0e284f4 cpuidle_add_sysfs+0xb4/0x130
[c03feb18baf0] c0e26038 cpuidle_register_device+0x118/0x1c0
[c03feb18bb30] c0e26c48 cpuidle_register+0x78/0x120
[c03feb18bbc0] c168fd9c powernv_processor_idle_init+0x110/0x1c4
[c03feb18bc40] c000cff8 do_one_initcall+0x68/0x1d0
[c03feb18bd00] c16242f4 kernel_init_freeable+0x280/0x360
[c03feb18bdc0] c000d864 kernel_init+0x24/0x160
[c03feb18be30] c000b4e8 ret_from_kernel_thread+0x5c/0x74

Validating cpu_dev fixes the crash and reports correct error message like:

[   30.163506] Failed to register cpuidle device for cpu136
[   30.173329] Registration of powernv driver failed.

Signed-off-by: Vaidyanathan Srinivasan 
---
 drivers/cpuidle/sysfs.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/cpuidle/sysfs.c b/drivers/cpuidle/sysfs.c
index c5adc8c..f2c3bce 100644
--- a/drivers/cpuidle/sysfs.c
+++ b/drivers/cpuidle/sysfs.c
@@ -615,6 +615,18 @@ int cpuidle_add_sysfs(struct cpuidle_device *dev)
struct device *cpu_dev = get_cpu_device((unsigned long)dev->cpu);
int error;
 
+   /*
+* Return error if cpu_device is not setup for this cpu.  This
+* could happen if arch did not setup cpu_device since this
+* cpu is not in cpu_present mask and the driver did not send
+* correct cpu mask at registration.  Without this check we
+* would end up passing bogus value for _dev->kobj in
+* kobject_init_and_add().
+*/
+
+   if (!cpu_dev)
+   return -ENODEV;
+
kdev = kzalloc(sizeof(*kdev), GFP_KERNEL);
if (!kdev)
return -ENOMEM;
-- 
2.9.3



[PATCH v2 1/2] powerpc/powernv/cpuidle: Pass correct drv->cpumask for registration

2017-03-23 Thread Vaidyanathan Srinivasan
drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init().
On PowerNV platform cpu_present could be less than cpu_possible in cases
where firmware detects the cpu, but it is not available to the OS.  When
CONFIG_HOTPLUG_CPU=n, such cpus are not hotplugable at runtime and hence
we skip creating cpu_device.

This breaks cpuidle on powernv where register_cpu() is not called for
cpus in cpu_possible_mask that cannot be hot-added at runtime.

Trying cpuidle_register_device() on cpu without cpu_device will cause
crash like this:

cpu 0xf: Vector: 380 (Data SLB Access) at [c00ff1503490]
pc: c022c8bc: string+0x34/0x60
lr: c022ed78: vsnprintf+0x284/0x42c
sp: c00ff1503710
   msr: 90009033
   dar: 60006000
  current = 0xc00ff148
  paca= 0xcfe82d00   softe: 0irq_happened: 0x01
pid   = 1, comm = swapper/8
Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4
(Buildroot 2017.02-4-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017
enter ? for help
[link register   ] c022ed78 vsnprintf+0x284/0x42c
[c00ff1503710] c022ebb8 vsnprintf+0xc4/0x42c (unreliable)
[c00ff1503800] c022ef40 vscnprintf+0x20/0x44
[c00ff1503830] c00ab61c vprintk_emit+0x94/0x2cc
[c00ff15038a0] c00acc9c vprintk_func+0x60/0x74
[c00ff15038c0] c0619694 printk+0x38/0x4c
[c00ff15038e0] c0224950 kobject_get+0x40/0x60
[c00ff1503950] c022507c kobject_add_internal+0x60/0x2c4
[c00ff15039e0] c0225350 kobject_init_and_add+0x70/0x78
[c00ff1503a60] c053c288 cpuidle_add_sysfs+0x9c/0xe0
[c00ff1503ae0] c053aeac cpuidle_register_device+0xd4/0x12c
[c00ff1503b30] c053b108 cpuidle_register+0x98/0xcc
[c00ff1503bc0] c085eaf0 powernv_processor_idle_init+0x140/0x1e0
[c00ff1503c60] c000cd60 do_one_initcall+0xc0/0x15c
[c00ff1503d20] c0833e84 kernel_init_freeable+0x1a0/0x25c
[c00ff1503dc0] c000d478 kernel_init+0x24/0x12c
[c00ff1503e30] c000b564 ret_from_kernel_thread+0x5c/0x78

This patch fixes the bug by passing correct cpumask from
powernv-cpuidle driver.

Signed-off-by: Vaidyanathan Srinivasan 
---
 drivers/cpuidle/cpuidle-powernv.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index a06df51..82f7b33 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -175,6 +175,24 @@ static int powernv_cpuidle_driver_init(void)
drv->state_count += 1;
}
 
+   /*
+* On PowerNV platform cpu_present may be less that cpu_possible in
+* cases where firmware detects the cpu, but it is not available to the
+* OS.  If CONFIG_HOTPLUG_CPU=n then such CPUs are not hotplugable at
+* runtime and hence cpu_devices are not created for those cpus by
+* generic topology_init().
+*
+* drv->cpumask defaults to cpu_possible_mask in
+* __cpuidle_driver_init().  This breaks cpuidle on powernv where
+* cpu_devices are not created for cpus in cpu_possible_mask that
+* cannot be hot-added later at runtime.
+*
+* Trying cpuidle_register_device() on a cpu without cpu_devices is
+* incorrect. Hence pass correct cpu mask to generic cpuidle driver.
+*/
+
+   drv->cpumask = (struct cpumask *)cpu_present_mask;
+
return 0;
 }
 
-- 
2.9.3



[PATCH v1 0/2] cpuidle: Fixes in cpuidle driver

2017-03-23 Thread Vaidyanathan Srinivasan
When CONFIG_HOTPLUG_CPU=n and cpu_present is less than cpu_possible,
then cpuidle-powernv not passing an explicit drv->cpu_mask allows
generic cpuidle driver to try create sysfs objects for cpus that does
not have cpu_devices created by calling register_cpu().

This caused kernel to access incorrect address and crash. The
following patch series fixes the cpuidle-powernv driver and also adds
additional checks in cpuidle_add_sysfs()

This patch set is against v4.11-rc3.

Changed from v1: Updated commit message and comments.

Signed-off-by: Vaidyanathan Srinivasan 



Re: [PATCH v5 08/13] powerpc/perf: PMU functions for Core IMC and hotplugging

2017-03-23 Thread Gautham R Shenoy
Hi Maddy, Hemant, Anju,

On Thu, Mar 16, 2017 at 01:05:02PM +0530, Madhavan Srinivasan wrote:

[..snip..]

> +
> +static void core_imc_change_cpu_context(int old_cpu, int new_cpu)
> +{
> + if (!core_imc_pmu)
> + return;
> + perf_pmu_migrate_context(_imc_pmu->pmu, old_cpu, new_cpu);
> +}
> +
> +
> +static int ppc_core_imc_cpu_online(unsigned int cpu)
> +{
> + int ret;
> +
> + /* If a cpu for this core is already set, then, don't do anything */
> + ret = cpumask_any_and(_imc_cpumask,
> +  cpu_sibling_mask(cpu));
> + if (ret < nr_cpu_ids)
> + return 0;
> +
> + /* Else, set the cpu in the mask, and change the context */
> + cpumask_set_cpu(cpu, _imc_cpumask);
> + core_imc_change_cpu_context(-1, cpu);

So, in the core case, we are ok as long as any cpu in the core is
present in the imc_cpumask. It need not have to be the smallest online
cpu in the core.

Can the same logic be applied to the earlier nest case ?

We can have a single function for cpu_offline and cpu_online which
implements these checks and sets the cpu bit if required.

ppc_entity_imc_cpu_offline(unsigned int cpu, cpumask_t
   entity_imc_mask,
   entity_imc_change_cpu_context_fn)
{
.
.
.

}


static ppc_nest_imc_cpu_offline(unsigned int cpu)
{
return ppc_entity_imc_cpu_offline(cpu, nest_imc_mask,
  nest_imc_change_cpu_context);
}

And similar ones for core imc and thread imc.

Does this sound reasonable ?

> + return 0;
> +}
> +
> +static int ppc_core_imc_cpu_offline(unsigned int cpu)
> +{
> + int target;
> + unsigned int ncpu;
> +
> + /*
> +  * clear this cpu out of the mask, if not present in the mask,
> +  * don't bother doing anything.
> +  */
> + if (!cpumask_test_and_clear_cpu(cpu, _imc_cpumask))
> + return 0;
> +
> + /* Find any online cpu in that core except the current "cpu" */
> + ncpu = cpumask_any_but(cpu_sibling_mask(cpu), cpu);
> +
> + if (ncpu < nr_cpu_ids) {
> + target = ncpu;
> + cpumask_set_cpu(target, _imc_cpumask);
> + } else
> + target = -1;
> +
> + /* migrate the context */
> + core_imc_change_cpu_context(cpu, target);
> +
> + return 0;
> +}
> +

--
Thanks and Regards
gautham.



[PATCH] ASoC: WM8962: Let codec driver enable/disable its MCLK

2017-03-23 Thread Daniel Baluta
From: Nicolin Chen 

WM8962 needs its MCLK when powerup in wm8962_resume(). Thus it's better
to control the MCLK in codec driver. Thus remove the clock enable in
machine dirver accordingly.

While at it, get rid of imx_wm8962_remove function since it is now
empty.

Signed-off-by: Nicolin Chen 
Signed-off-by: Daniel Baluta 
---
 sound/soc/fsl/imx-wm8962.c | 40 
 1 file changed, 8 insertions(+), 32 deletions(-)

diff --git a/sound/soc/fsl/imx-wm8962.c b/sound/soc/fsl/imx-wm8962.c
index 1b60958..3d894d9 100644
--- a/sound/soc/fsl/imx-wm8962.c
+++ b/sound/soc/fsl/imx-wm8962.c
@@ -33,7 +33,6 @@ struct imx_wm8962_data {
struct snd_soc_card card;
char codec_dai_name[DAI_NAME_SIZE];
char platform_name[DAI_NAME_SIZE];
-   struct clk *codec_clk;
unsigned int clk_frequency;
 };
 
@@ -163,6 +162,7 @@ static int imx_wm8962_probe(struct platform_device *pdev)
struct imx_priv *priv = _priv;
struct i2c_client *codec_dev;
struct imx_wm8962_data *data;
+   struct clk *codec_clk;
int int_port, ext_port;
int ret;
 
@@ -231,19 +231,14 @@ static int imx_wm8962_probe(struct platform_device *pdev)
goto fail;
}
 
-   data->codec_clk = devm_clk_get(_dev->dev, NULL);
-   if (IS_ERR(data->codec_clk)) {
-   ret = PTR_ERR(data->codec_clk);
+   codec_clk = devm_clk_get(_dev->dev, NULL);
+   if (IS_ERR(codec_clk)) {
+   ret = PTR_ERR(codec_clk);
dev_err(_dev->dev, "failed to get codec clk: %d\n", ret);
goto fail;
}
 
-   data->clk_frequency = clk_get_rate(data->codec_clk);
-   ret = clk_prepare_enable(data->codec_clk);
-   if (ret) {
-   dev_err(_dev->dev, "failed to enable codec clk: %d\n", 
ret);
-   goto fail;
-   }
+   data->clk_frequency = clk_get_rate(codec_clk);
 
data->dai.name = "HiFi";
data->dai.stream_name = "HiFi";
@@ -258,10 +253,10 @@ static int imx_wm8962_probe(struct platform_device *pdev)
data->card.dev = >dev;
ret = snd_soc_of_parse_card_name(>card, "model");
if (ret)
-   goto clk_fail;
+   goto fail;
ret = snd_soc_of_parse_audio_routing(>card, "audio-routing");
if (ret)
-   goto clk_fail;
+   goto fail;
data->card.num_links = 1;
data->card.owner = THIS_MODULE;
data->card.dai_link = >dai;
@@ -277,16 +272,9 @@ static int imx_wm8962_probe(struct platform_device *pdev)
ret = devm_snd_soc_register_card(>dev, >card);
if (ret) {
dev_err(>dev, "snd_soc_register_card failed (%d)\n", ret);
-   goto clk_fail;
+   goto fail;
}
 
-   of_node_put(ssi_np);
-   of_node_put(codec_np);
-
-   return 0;
-
-clk_fail:
-   clk_disable_unprepare(data->codec_clk);
 fail:
of_node_put(ssi_np);
of_node_put(codec_np);
@@ -294,17 +282,6 @@ static int imx_wm8962_probe(struct platform_device *pdev)
return ret;
 }
 
-static int imx_wm8962_remove(struct platform_device *pdev)
-{
-   struct snd_soc_card *card = platform_get_drvdata(pdev);
-   struct imx_wm8962_data *data = snd_soc_card_get_drvdata(card);
-
-   if (!IS_ERR(data->codec_clk))
-   clk_disable_unprepare(data->codec_clk);
-
-   return 0;
-}
-
 static const struct of_device_id imx_wm8962_dt_ids[] = {
{ .compatible = "fsl,imx-audio-wm8962", },
{ /* sentinel */ }
@@ -318,7 +295,6 @@ static struct platform_driver imx_wm8962_driver = {
.of_match_table = imx_wm8962_dt_ids,
},
.probe = imx_wm8962_probe,
-   .remove = imx_wm8962_remove,
 };
 module_platform_driver(imx_wm8962_driver);
 
-- 
2.7.4



Re: [PATCH v5 06/13] powerpc/perf: IMC pmu cpumask and cpu hotplug support

2017-03-23 Thread Gautham R Shenoy
Hi Hemant, Maddy, 

On Thu, Mar 16, 2017 at 01:05:00PM +0530, Madhavan Srinivasan wrote:
> From: Hemant Kumar 
> 
> Adds cpumask attribute to be used by each IMC pmu. Only one cpu (any
> online CPU) from each chip for nest PMUs is designated to read counters.
> 
> On CPU hotplug, dying CPU is checked to see whether it is one of the
> designated cpus, if yes, next online cpu from the same chip (for nest
> units) is designated as new cpu to read counters. For this purpose, we
> introduce a new state : CPUHP_AP_PERF_POWERPC_NEST_ONLINE.
> 
> Cc: Gautham R. Shenoy 
> Cc: Balbir Singh 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Anton Blanchard 
> Cc: Sukadev Bhattiprolu 
> Cc: Michael Neuling 
> Cc: Stewart Smith 
> Cc: Daniel Axtens 
> Cc: Stephane Eranian 
> Cc: Anju T Sudhakar 
> Signed-off-by: Hemant Kumar 
> Signed-off-by: Madhavan Srinivasan 
> ---
>  arch/powerpc/include/asm/opal-api.h|   3 +-
>  arch/powerpc/include/asm/opal.h|   3 +
>  arch/powerpc/perf/imc-pmu.c| 163 
> -
>  arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
>  include/linux/cpuhotplug.h |   1 +
>  5 files changed, 169 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/opal-api.h 
> b/arch/powerpc/include/asm/opal-api.h
> index a0aa285869b5..e1c3d4837857 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -168,7 +168,8 @@
>  #define OPAL_INT_SET_MFRR125
>  #define OPAL_PCI_TCE_KILL126
>  #define OPAL_NMMU_SET_PTCR   127
> -#define OPAL_LAST127
> +#define OPAL_NEST_IMC_COUNTERS_CONTROL   145
> +#define OPAL_LAST145
> 
>  /* Device tree flags */
> 
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index 1ff03a6da76e..d93d08204243 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -227,6 +227,9 @@ int64_t opal_pci_tce_kill(uint64_t phb_id, uint32_t 
> kill_type,
> uint64_t dma_addr, uint32_t npages);
>  int64_t opal_nmmu_set_ptcr(uint64_t chip_id, uint64_t ptcr);
> 
> +int64_t opal_nest_imc_counters_control(uint64_t mode, uint64_t value1,
> + uint64_t value2, uint64_t value3);
> +
>  /* Internal functions */
>  extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
>  int depth, void *data);
> diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
> index f6f1ef9f56af..e46ff6d2a584 100644
> --- a/arch/powerpc/perf/imc-pmu.c
> +++ b/arch/powerpc/perf/imc-pmu.c
> @@ -16,6 +16,7 @@
> +static int ppc_nest_imc_cpu_online(unsigned int cpu)
> +{

I take it that 'cpu' is coming online.

> + int nid, fcpu, ncpu;
> + struct cpumask *l_cpumask, tmp_mask;
> +
> + /* Fint the cpumask of this node */
> + nid = cpu_to_node(cpu);
> + l_cpumask = cpumask_of_node(nid);
> +
> + /*
> +  * If any of the cpu from this node is already present in the mask,
> +  * just return, if not, then set this cpu in the mask.
> +  */
> + if (!cpumask_and(_mask, l_cpumask, _imc_cpumask)) {

In this case, none of the cpus in the node are in the mask. So we set
and this cpu in the imc cpumask and return.

> + cpumask_set_cpu(cpu, _imc_cpumask);
> + return 0;
> + }

But this case implies that there is already a CPU from the node which
is in the imc_cpumask. As per the comment above, we are supposed to
just return. So why are we doing the following ?

Either the comment above is incorrect or I am missing something here.

> +
> + fcpu = cpumask_first(l_cpumask);
> + ncpu = cpumask_next(cpu, l_cpumask);
> + if (cpu == fcpu) {
> + if (cpumask_test_and_clear_cpu(ncpu, _imc_cpumask)) {
> + cpumask_set_cpu(cpu, _imc_cpumask);
> + nest_change_cpu_context(ncpu, cpu);
> + }
> + }

It seems that we want to set only the smallest online cpu in the node
in the nest_imc_cpumask. So, if the newly onlined cpu is the smallest,
we replace the previous representative with cpu.

So, the comment above needs to be fixed.

> +
> + return 0;
> +}
> +
> +static int ppc_nest_imc_cpu_offline(unsigned int cpu)
> +{
> + int nid, target = -1;
> + struct cpumask *l_cpumask;
> +
> + /*
> +  * Check in the designated list for this cpu. Dont bother
> +  * if not one of them.
> +  */
> + if (!cpumask_test_and_clear_cpu(cpu, 

[PATCH v2] powerpc/powernv: de-deuplicate OPAL call wrappers

2017-03-23 Thread Oliver O'Halloran
Currently the code to perform an OPAL call is duplicated between the
normal path and path taken when tracepoints are enabled. There's no
real need for this and combining them makes opal_tracepoint_entry
considerably easier to understand.

Signed-off-by: Oliver O'Halloran 
---
v1 -> v2:
slight rework due to the real mode opal call changes
---
 arch/powerpc/platforms/powernv/opal-wrappers.S | 53 +++---
 1 file changed, 22 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S 
b/arch/powerpc/platforms/powernv/opal-wrappers.S
index da8a0f7a035c..ebf6719d241a 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -50,21 +50,13 @@ END_FTR_SECTION(0, 1);  
\
 #define OPAL_BRANCH(LABEL)
 #endif
 
-/* TODO:
- *
- * - Trace irqs in/off (needs saving/restoring all args, argh...)
- * - Get r11 feed up by Dave so I can have better register usage
+/*
+ * DO_OPAL_CALL assumes:
+ * r0  = opal call token
+ * r12 = msr
+ * LR has been saved
  */
-
-#define OPAL_CALL(name, token) \
- _GLOBAL_TOC(name);\
-   mfmsr   r12;\
-   mflrr0; \
-   andi.   r11,r12,MSR_IR|MSR_DR;  \
-   std r0,PPC_LR_STKOFF(r1);   \
-   li  r0,token;   \
-   beq opal_real_call; \
-   OPAL_BRANCH(opal_tracepoint_entry) \
+#define DO_OPAL_CALL() \
mfcrr11;\
stw r11,8(r1);  \
li  r11,0;  \
@@ -83,6 +75,18 @@ END_FTR_SECTION(0, 1);   
\
mtspr   SPRN_HSRR0,r12; \
hrfid
 
+#define OPAL_CALL(name, token) \
+ _GLOBAL_TOC(name);\
+   mfmsr   r12;\
+   mflrr0; \
+   andi.   r11,r12,MSR_IR|MSR_DR;  \
+   std r0,PPC_LR_STKOFF(r1);   \
+   li  r0,token;   \
+   beq opal_real_call; \
+   OPAL_BRANCH(opal_tracepoint_entry) \
+   DO_OPAL_CALL()
+
+
 opal_return:
/*
 * Fixup endian on OPAL return... we should be able to simplify
@@ -148,26 +152,13 @@ opal_tracepoint_entry:
ld  r8,STK_REG(R29)(r1)
ld  r9,STK_REG(R30)(r1)
ld  r10,STK_REG(R31)(r1)
+
+   /* setup LR so we return via tracepoint_return */
LOAD_REG_ADDR(r11,opal_tracepoint_return)
-   mfcrr12
std r11,16(r1)
-   stw r12,8(r1)
-   li  r11,0
+
mfmsr   r12
-   ori r11,r11,MSR_EE
-   std r12,PACASAVEDMSR(r13)
-   andcr12,r12,r11
-   mtmsrd  r12,1
-   LOAD_REG_ADDR(r11,opal_return)
-   mtlrr11
-   li  r11,MSR_DR|MSR_IR|MSR_LE
-   andcr12,r12,r11
-   mtspr   SPRN_HSRR1,r12
-   LOAD_REG_ADDR(r11,opal)
-   ld  r12,8(r11)
-   ld  r2,0(r11)
-   mtspr   SPRN_HSRR0,r12
-   hrfid
+   DO_OPAL_CALL()
 
 opal_tracepoint_return:
std r3,STK_REG(R31)(r1)
-- 
2.9.3



Re: [Skiboot] [PATCH][OPAL] cpufeatures: add base and POWER8, POWER9 /cpus/features dt

2017-03-23 Thread Stewart Smith
Nicholas Piggin  writes:
> On Tue, 21 Mar 2017 15:42:22 +1100
> Stewart Smith  wrote:
>
>> Nicholas Piggin  writes:
>> > With this patch and the Linux one, I can boot (in mambo) a POWER8 or
>> > POWER9 without looking up any cpu tables, and mainly looking at PVR
>> > for MCE and PMU.  
>> 
>> That's pretty neat. I now await the day where somebody produces a chip
>> with uneven feature sets between cores.
>> 
>> > Machine and ISA speicfic features that are not abstracted by firmware
>> > and not captured here will have to be handled on a case by case basis,
>> > using PVR if necessary. Major ones that remain are PMU and machine
>> > check.  
>> 
>> At least for machine check we could probably get something "good enough"
>> without too much effort.
>
> Machine check I'd like to put it in opal with a special case entry from
> the hypervisor. At the moment it just has a lot of implementation specific
> decoding of bits, and we can't really call opal to do anything useful
> with normal call because it re-enters the stack.

Ahh yep. Something along the lines of a machine check specific stack in
OPAL could do it, and we queue up calls etc etc.

It turns out my "not too much effort" metric is possibly different from
other people's :)

>> For PMU, I wonder if we could get something that's suitably common with
>> the IMC work being done so that we could "just work" on new PMUs (albeit
>> calling into OPAL for enable/disable)
>
> I don't know the PMU code, but if we could have some kind of firmware
> fallback like that it would be perfect.
>
> For machine specific implementations that are faster or more capable,
> I guess looking at PVR is not taboo as such. Just that it shouldn't
> be needed to boot and get some reasonable baseline.

Yeah, I'm not that familiar with it either. I need to go look over it
all in my copious amount of free time.

>> > Open question is where and how to develop and document these features?
>> > Not the dt representation created by skiboot, but the exact nature of
>> > each feature. What exact behaviour does a particular feature imply,
>> > etc.  
>> 
>> Part of me seems to think this could be something for the Architecture,
>> and then we just have a 1-to-1 mapping of the arch bits and we're all
>> one big happy family
>> 
>> Benh/Paulus can probably laugh at me suitably hard for suggesting such a
>> thing being possible though :)
>
> I think to a degree we might be moving in that direction. Probably
> can't say much more in public at the moment, but at least this dt
> implementation must be flexible enough to cope with a range of
> approaches we might decide to take (fine/coarse grained, fscr bits
> or not, etc).

Yeah, I think so too.

>> > --- a/hdata/cpu-common.c
>> > +++ b/hdata/cpu-common.c  
>> 
>> > +  { "vector-crypto",
>> > +  CPU_ALL,
>> > +  ISA_BASE, USABLE_HV|USABLE_OS|USABLE_PR,
>> > +  HV_SUPPORT_NONE, OS_SUPPORT_NONE,
>> > +  -1, -1, 38,
>> > +  "vector", },  
>> 
>> Did we want to break this down at all? Specifically maybe just the AES
>> instructions?
>> 
>> AFAIK it's been the only set where there was some amount of discussion
>> about somebody possibly not wanting to include.
>
> I don't have much opinion on it. Userspace has only the one feature bit
> there, so missing part of the instructions would have to disable the
> entire thing.
>
> That's not to say we couldn't add another bit and start getting userspace
> to use it. So if there is some need or strong potential future need,
> we should consider it.

Hopefully we're right on ISA3.00 for new instructions that random folk
may say could be disabled if they fab'd their own chip :)

-- 
Stewart Smith
OPAL Architect, IBM.