Re: [PATCH 0/3] move __HAVE_ARCH_PTE_SPECIAL in Kconfig

2018-04-09 Thread Jerome Glisse
On Mon, Apr 09, 2018 at 04:07:21PM +0200, Michal Hocko wrote:
> On Mon 09-04-18 15:57:06, Laurent Dufour wrote:
> > The per architecture __HAVE_ARCH_PTE_SPECIAL is defined statically in the
> > per architecture header files. This doesn't allow to make other
> > configuration dependent on it.
> > 
> > This series is moving the __HAVE_ARCH_PTE_SPECIAL into the Kconfig files,
> > setting it automatically when architectures was already setting it in
> > header file.
> > 
> > There is no functional change introduced by this series.
> 
> I would just fold all three patches into a single one. It is much easier
> to review that those selects are done properly when you can see that the
> define is set for the same architecture.
> 
> In general, I like the patch. It is always quite painful to track per
> arch defines.

You can also add Reviewed-by: Jérôme Glisse  my grep fu
showed no place that was forgotten.

Cheers,
Jérôme


Re: [PATCH 0/3] move __HAVE_ARCH_PTE_SPECIAL in Kconfig

2018-04-09 Thread Vineet Gupta

On 04/09/2018 06:57 AM, Laurent Dufour wrote:

The per architecture __HAVE_ARCH_PTE_SPECIAL is defined statically in the
per architecture header files. This doesn't allow to make other
configuration dependent on it.


So I understand this series has more "readability" value and I'm fine with this 
change but I wonder if you really would want to make something depend on it or 
make this de-configurable. PTE special is really a fundamental construct - e.g. it 
is used for anon mapped pages where zero page has been wired up etc...


-Vineet


This series is moving the __HAVE_ARCH_PTE_SPECIAL into the Kconfig files,
setting it automatically when architectures was already setting it in
header file.

There is no functional change introduced by this series.




ppc compat v4.16 regression: sending SIGTRAP or SIGFPE via kill() returns wrong values in si_pid and si_uid

2018-04-09 Thread Dmitry V. Levin
Hi,

There seems to be a regression in v4.16 on ppc compat very similar
to sparc compat regression reported earlier at
https://marc.info/?l=linux-sparc=151501500704383 .

The symptoms are exactly the same: the same signal_receive test from
the strace test suite fails with the same diagnostics:
https://build.opensuse.org/public/build/home:ldv_alt/openSUSE_Factory_PowerPC/ppc/strace/_log

Unfortunately, I do not have any means to investigate further,
so just passing this information on to those who care.


-- 
ldv


signature.asc
Description: PGP signature


Re: [PATCH 0/3] move __HAVE_ARCH_PTE_SPECIAL in Kconfig

2018-04-09 Thread Laurent Dufour
On 09/04/2018 18:03, Vineet Gupta wrote:
> On 04/09/2018 06:57 AM, Laurent Dufour wrote:
>> The per architecture __HAVE_ARCH_PTE_SPECIAL is defined statically in the
>> per architecture header files. This doesn't allow to make other
>> configuration dependent on it.
> 
> So I understand this series has more "readability" value and I'm fine with 
> this
> change but I wonder if you really would want to make something depend on it or
> make this de-configurable. PTE special is really a fundamental construct - 
> e.g.
> it is used for anon mapped pages where zero page has been wired up etc...

I don't want it to be de-configurable. This is almost like
ARCH_SUPPORTS_MEMORY_FAILURE, ARCH_USES_HIGH_VMA_FLAGS, ARCH_HAS_HMM...

These values are selected by per architecture Kconfig files and are not exposed
through the configuration menu.

Concerning making something depend on it, I will probably make
CONFIG_SPECULATIVE_PAGE_FAULT introduced by the SPF series dependent on it.
For details, please see https://lkml.org/lkml/2018/3/13/1143

Thanks,
Laurent.



[PATCH 4/6] powerpc/powernv: move opal console flushing to udbg

2018-04-09 Thread Nicholas Piggin
OPAL console writes do not have to synchronously flush firmware /
hardware buffers unless they are going through the udbg path.

Remove the unconditional flushing from opal_put_chars. Flush if
there was no space in the buffer as an optimisation (callers loop
waiting for success in that case). udbg flushing is moved to
udbg_opal_putc.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/platforms/powernv/opal.c | 12 +++-
 drivers/tty/hvc/hvc_opal.c|  5 +
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index a045c446a910..b05500a70f58 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -400,12 +400,14 @@ int opal_put_chars(uint32_t vtermno, const char *data, 
int total_len)
 out:
spin_unlock_irqrestore(_write_lock, flags);
 
-   /* This is a bit nasty but we need that for the console to
-* flush when there aren't any interrupts. We will clean
-* things a bit later to limit that to synchronous path
-* such as the kernel console and xmon/udbg
+   /* In the -EAGAIN case, callers loop, so we have to flush the console
+* here in case they have interrupts off (and we don't want to wait
+* for async flushing if we can make immediate progress here). If
+* necessary the API could be made entirely non-flushing if the
+* callers had a ->flush API to use.
 */
-   opal_flush_console(vtermno);
+   if (written == -EAGAIN)
+   opal_flush_console(vtermno);
 
return written;
 }
diff --git a/drivers/tty/hvc/hvc_opal.c b/drivers/tty/hvc/hvc_opal.c
index 2ed07ca6389e..af122ad7f06d 100644
--- a/drivers/tty/hvc/hvc_opal.c
+++ b/drivers/tty/hvc/hvc_opal.c
@@ -275,6 +275,11 @@ static void udbg_opal_putc(char c)
count = hvc_opal_hvsi_put_chars(termno, , 1);
break;
}
+
+   /* This is needed for the cosole to flush
+* when there aren't any interrupts.
+*/
+   opal_flush_console(termno);
} while(count == 0 || count == -EAGAIN);
 }
 
-- 
2.17.0



Re: [PATCH 5/6] powerpc/powernv: implement opal_put_chars_nonatomic

2018-04-09 Thread Benjamin Herrenschmidt
On Mon, 2018-04-09 at 15:40 +1000, Nicholas Piggin wrote:
> The RAW console does not need writes to be atomic, so implement a
> _nonatomic variant which does not take a spinlock. This API is used
> in xmon, so the less locking thta's used, the better chance there is
> that a crash can be debugged.

I find the term "nonatomic" confusing... don't we have a problem if we
start hitting OPAL without a lock where we can't trust
opal_console_write_buffer_space anymore ? I think we need to handle
partial writes in that case. Maybe we should return how much was
written and leave the caller to deal with it.

I was hoping (but that isn't the case) that by nonatomic you actually
meant calls that could be done in a non-atomic context, where we can do
msleep instead of mdelay. That would be handy for the console coming
from the hvc thread (the tty one).

Cheers,
Ben.

> 
> Cc: Benjamin Herrenschmidt 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/include/asm/opal.h   |  1 +
>  arch/powerpc/platforms/powernv/opal.c | 35 +++
>  drivers/tty/hvc/hvc_opal.c|  4 +--
>  3 files changed, 28 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index bbff49fab0e5..66954d671831 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -303,6 +303,7 @@ extern void opal_configure_cores(void);
>  
>  extern int opal_get_chars(uint32_t vtermno, char *buf, int count);
>  extern int opal_put_chars(uint32_t vtermno, const char *buf, int total_len);
> +extern int opal_put_chars_nonatomic(uint32_t vtermno, const char *buf, int 
> total_len);
>  extern int opal_flush_console(uint32_t vtermno);
>  
>  extern void hvc_opal_init_early(void);
> diff --git a/arch/powerpc/platforms/powernv/opal.c 
> b/arch/powerpc/platforms/powernv/opal.c
> index b05500a70f58..dc77fc57d1e9 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -344,9 +344,9 @@ int opal_get_chars(uint32_t vtermno, char *buf, int count)
>   return 0;
>  }
>  
> -int opal_put_chars(uint32_t vtermno, const char *data, int total_len)
> +static int __opal_put_chars(uint32_t vtermno, const char *data, int 
> total_len, bool atomic)
>  {
> - unsigned long flags;
> + unsigned long flags = 0 /* shut up gcc */;
>   int written;
>   __be64 olen;
>   s64 rc;
> @@ -354,11 +354,8 @@ int opal_put_chars(uint32_t vtermno, const char *data, 
> int total_len)
>   if (!opal.entry)
>   return -ENODEV;
>  
> - /* We want put_chars to be atomic to avoid mangling of hvsi
> -  * packets. To do that, we first test for room and return
> -  * -EAGAIN if there isn't enough.
> -  */
> - spin_lock_irqsave(_write_lock, flags);
> + if (atomic)
> + spin_lock_irqsave(_write_lock, flags);
>   rc = opal_console_write_buffer_space(vtermno, );
>   if (rc || be64_to_cpu(olen) < total_len) {
>   /* Closed -> drop characters */
> @@ -391,14 +388,18 @@ int opal_put_chars(uint32_t vtermno, const char *data, 
> int total_len)
>  
>   written = be64_to_cpu(olen);
>   if (written < total_len) {
> - /* Should not happen */
> - pr_warn("atomic console write returned partial len=%d 
> written=%d\n", total_len, written);
> + if (atomic) {
> + /* Should not happen */
> + pr_warn("atomic console write returned partial "
> + "len=%d written=%d\n", total_len, written);
> + }
>   if (!written)
>   written = -EAGAIN;
>   }
>  
>  out:
> - spin_unlock_irqrestore(_write_lock, flags);
> + if (atomic)
> + spin_unlock_irqrestore(_write_lock, flags);
>  
>   /* In the -EAGAIN case, callers loop, so we have to flush the console
>* here in case they have interrupts off (and we don't want to wait
> @@ -412,6 +413,20 @@ int opal_put_chars(uint32_t vtermno, const char *data, 
> int total_len)
>   return written;
>  }
>  
> +int opal_put_chars(uint32_t vtermno, const char *data, int total_len)
> +{
> + /* We want put_chars to be atomic to avoid mangling of hvsi
> +  * packets. To do that, we first test for room and return
> +  * -EAGAIN if there isn't enough.
> +  */
> + return __opal_put_chars(vtermno, data, total_len, true);
> +}
> +
> +int opal_put_chars_nonatomic(uint32_t vtermno, const char *data, int 
> total_len)
> +{
> + return __opal_put_chars(vtermno, data, total_len, false);
> +}
> +
>  int opal_flush_console(uint32_t vtermno)
>  {
>   s64 rc;
> diff --git a/drivers/tty/hvc/hvc_opal.c b/drivers/tty/hvc/hvc_opal.c
> index af122ad7f06d..e151cfacf2a7 100644
> --- a/drivers/tty/hvc/hvc_opal.c
> +++ b/drivers/tty/hvc/hvc_opal.c
> @@ -51,7 +51,7 @@ static u32 hvc_opal_boot_termno;
>  
>  

Re: [PATCH 6/6] drivers/tty/hvc: remove unexplained "just in case" spin delay

2018-04-09 Thread Benjamin Herrenschmidt
On Mon, 2018-04-09 at 15:40 +1000, Nicholas Piggin wrote:
> This delay was in the very first OPAL console commit 6.5 years ago.
> The firmware console has hardened sufficiently to remove it.
> 

Reviewed-by: Benjamin Herrenschmidt 
> Signed-off-by: Nicholas Piggin 
> ---
>  drivers/tty/hvc/hvc_opal.c | 8 +---
>  1 file changed, 1 insertion(+), 7 deletions(-)
> 
> diff --git a/drivers/tty/hvc/hvc_opal.c b/drivers/tty/hvc/hvc_opal.c
> index e151cfacf2a7..436b98258e60 100644
> --- a/drivers/tty/hvc/hvc_opal.c
> +++ b/drivers/tty/hvc/hvc_opal.c
> @@ -307,14 +307,8 @@ static int udbg_opal_getc(void)
>   int ch;
>   for (;;) {
>   ch = udbg_opal_getc_poll();
> - if (ch == -1) {
> - /* This shouldn't be needed...but... */
> - volatile unsigned long delay;
> - for (delay=0; delay < 200; delay++)
> - ;
> - } else {
> + if (ch != -1)
>   return ch;
> - }
>   }
>  }
>  


[PATCH 6/6] drivers/tty/hvc: remove unexplained "just in case" spin delay

2018-04-09 Thread Nicholas Piggin
This delay was in the very first OPAL console commit 6.5 years ago.
The firmware console has hardened sufficiently to remove it.

Cc: Benjamin Herrenschmidt 
Signed-off-by: Nicholas Piggin 
---
 drivers/tty/hvc/hvc_opal.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/tty/hvc/hvc_opal.c b/drivers/tty/hvc/hvc_opal.c
index e151cfacf2a7..436b98258e60 100644
--- a/drivers/tty/hvc/hvc_opal.c
+++ b/drivers/tty/hvc/hvc_opal.c
@@ -307,14 +307,8 @@ static int udbg_opal_getc(void)
int ch;
for (;;) {
ch = udbg_opal_getc_poll();
-   if (ch == -1) {
-   /* This shouldn't be needed...but... */
-   volatile unsigned long delay;
-   for (delay=0; delay < 200; delay++)
-   ;
-   } else {
+   if (ch != -1)
return ch;
-   }
}
 }
 
-- 
2.17.0



Re: [PATCH v2 4/9] powerpc/powernv: OPAL console standardise OPAL_BUSY loops

2018-04-09 Thread Nicholas Piggin
On Mon, 09 Apr 2018 15:53:33 +1000
Benjamin Herrenschmidt  wrote:

> On Mon, 2018-04-09 at 15:24 +1000, Nicholas Piggin wrote:
> > Convert to using the standard delay poll/delay form.
> > 
> > The console code:
> > 
> > - Did not previously delay or sleep in its busy loop.
> > 
> > Cc: Benjamin Herrenschmidt 
> > Signed-off-by: Nicholas Piggin   
> 
> Does it help with anything ? We don't technically *have* to delay or
> wait, I thought it would be good to try to hit the console as fast as
> possible in that case...

We can always make exceptions to the standard form, but in those
cases I would like to document it in the OPAL API and comment for
the Linux side.

My thinking in this case is that it reduces time in firmware and
in particular holding console locks. Is it likely / possible that
we don't have enough buffering or some other issue makes it worth
retrying so quickly?

Thanks,
Nick


Re: [PATCH 5/6] powerpc/powernv: implement opal_put_chars_nonatomic

2018-04-09 Thread Nicholas Piggin
On Mon, 09 Apr 2018 15:57:55 +1000
Benjamin Herrenschmidt  wrote:

> On Mon, 2018-04-09 at 15:40 +1000, Nicholas Piggin wrote:
> > The RAW console does not need writes to be atomic, so implement a
> > _nonatomic variant which does not take a spinlock. This API is used
> > in xmon, so the less locking thta's used, the better chance there is
> > that a crash can be debugged.  
> 
> I find the term "nonatomic" confusing...

I guess it is to go with the "atomic" comment for the hvsi console
case -- all characters must get to the console together or not at
all.

> don't we have a problem if we
> start hitting OPAL without a lock where we can't trust
> opal_console_write_buffer_space anymore ? I think we need to handle
> partial writes in that case. Maybe we should return how much was
> written and leave the caller to deal with it.

Yes, the _nonatomic variant doesn't use opal_console_write_buffer_space
and it does handle partial writes by returning written bytes (although
callers generally tend to loop at the moment, we might do something
smarter with them later).

> I was hoping (but that isn't the case) that by nonatomic you actually
> meant calls that could be done in a non-atomic context, where we can do
> msleep instead of mdelay. That would be handy for the console coming
> from the hvc thread (the tty one).

Ah right, no. However we no longer loop until everything is written, so
the hvc console driver (or the console layer) should be able to deal with
that with sleeping. I don't think we need to put it at this level of the
driver, but I don't know much about the console code.

Thanks,
Nick


Re: [PATCH v2 3/9] powerpc/powernv: opal_put_chars partial write fix

2018-04-09 Thread Benjamin Herrenschmidt
On Mon, 2018-04-09 at 15:24 +1000, Nicholas Piggin wrote:
> The intention here is to consume and discard the remaining buffer
> upon error. This works if there has not been a previous partial write.
> If there has been, then total_len is no longer total number of bytes
> to copy. total_len is always "bytes left to copy", so it should be
> added to written bytes.
> 
> This code may not be exercised any more if partial writes will not be
> hit, but this is a small bugfix before a larger change.
> 

Reviewed-by: Benjamin Herrenschmidt 

> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/platforms/powernv/opal.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/opal.c 
> b/arch/powerpc/platforms/powernv/opal.c
> index 516e23de5a3d..87d4c0aa7f64 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -388,7 +388,7 @@ int opal_put_chars(uint32_t vtermno, const char *data, 
> int total_len)
>   /* Closed or other error drop */
>   if (rc != OPAL_SUCCESS && rc != OPAL_BUSY &&
>   rc != OPAL_BUSY_EVENT) {
> - written = total_len;
> + written += total_len;
>   break;
>   }
>   if (rc == OPAL_SUCCESS) {


Re: [PATCH v2 4/9] powerpc/powernv: OPAL console standardise OPAL_BUSY loops

2018-04-09 Thread Benjamin Herrenschmidt
On Mon, 2018-04-09 at 15:24 +1000, Nicholas Piggin wrote:
> Convert to using the standard delay poll/delay form.
> 
> The console code:
> 
> - Did not previously delay or sleep in its busy loop.
> 
> Cc: Benjamin Herrenschmidt 
> Signed-off-by: Nicholas Piggin 

Does it help with anything ? We don't technically *have* to delay or
wait, I thought it would be good to try to hit the console as fast as
possible in that case...

Ben.

> ---
>  arch/powerpc/platforms/powernv/opal.c | 38 ---
>  1 file changed, 23 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/opal.c 
> b/arch/powerpc/platforms/powernv/opal.c
> index 87d4c0aa7f64..473c8ce14a34 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -378,33 +378,41 @@ int opal_put_chars(uint32_t vtermno, const char *data, 
> int total_len)
>   /* We still try to handle partial completions, though they
>* should no longer happen.
>*/
> - rc = OPAL_BUSY;
> - while(total_len > 0 && (rc == OPAL_BUSY ||
> - rc == OPAL_BUSY_EVENT || rc == OPAL_SUCCESS)) {
> +
> + while (total_len > 0) {
>   olen = cpu_to_be64(total_len);
> - rc = opal_console_write(vtermno, , data);
> +
> + rc = OPAL_BUSY;
> + while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
> + rc = opal_console_write(vtermno, , data);
> + if (rc == OPAL_BUSY_EVENT) {
> + mdelay(OPAL_BUSY_DELAY_MS);
> + opal_poll_events(NULL);
> + } else if (rc == OPAL_BUSY) {
> + mdelay(OPAL_BUSY_DELAY_MS);
> + }
> + }
> +
>   len = be64_to_cpu(olen);
>  
>   /* Closed or other error drop */
> - if (rc != OPAL_SUCCESS && rc != OPAL_BUSY &&
> - rc != OPAL_BUSY_EVENT) {
> - written += total_len;
> + if (rc != OPAL_SUCCESS) {
> + written += total_len; /* drop remaining chars */
>   break;
>   }
> - if (rc == OPAL_SUCCESS) {
> - total_len -= len;
> - data += len;
> - written += len;
> - }
> +
> + total_len -= len;
> + data += len;
> + written += len;
> +
>   /* This is a bit nasty but we need that for the console to
>* flush when there aren't any interrupts. We will clean
>* things a bit later to limit that to synchronous path
>* such as the kernel console and xmon/udbg
>*/
> - do
> + do {
>   opal_poll_events();
> - while(rc == OPAL_SUCCESS &&
> - (be64_to_cpu(evt) & OPAL_EVENT_CONSOLE_OUTPUT));
> + } while (be64_to_cpu(evt) & OPAL_EVENT_CONSOLE_OUTPUT);
>   }
>   spin_unlock_irqrestore(_write_lock, flags);
>   return written;


Re: [PATCH 2/3] mm: replace __HAVE_ARCH_PTE_SPECIAL

2018-04-09 Thread Christoph Hellwig
> -#ifdef __HAVE_ARCH_PTE_SPECIAL
> +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>  # define HAVE_PTE_SPECIAL 1
>  #else
>  # define HAVE_PTE_SPECIAL 0

I'd say kill this odd indirection and just use the
CONFIG_ARCH_HAS_PTE_SPECIAL symbol directly.



Re: [PATCH] drivers/of: Introduce ARCH_HAS_OWN_OF_NUMA

2018-04-09 Thread Christoph Hellwig
On Mon, Apr 09, 2018 at 05:46:04PM +1000, Oliver O'Halloran wrote:
> Some OF platforms (pseries and some SPARC systems) has their own
> implementations of NUMA affinity detection rather than using the generic
> OF_NUMA driver, which mainly exists for arm64. For other platforms one
> of two fallbacks provided by the base OF driver are used depending on
> CONFIG_NUMA.
> 
> In the CONFIG_NUMA=n case the fallback is an inline function in of.h.
> In the =y case the fallback is a real function which is defined as a
> weak symbol so that it may be overwritten by the architecture if desired.
> 
> The problem with this arrangement is that the real implementations all
> export of_node_to_nid(). Unfortunately it's not possible to export the
> fallback since it would clash with the non-weak version. As a result
> we get build failures when:
> 
> a) CONFIG_NUMA=y && CONFIG_OF=y, and
> b) The platform doesn't implement of_node_to_nid(), and
> c) A module uses of_node_to_nid()
> 
> Given b) will be true for most platforms this is fairly easy to hit
> and has been observed on ia64 and x86.
> 
> This patch remedies the problem by introducing the ARCH_HAS_OWN_OF_NUMA
> Kconfig option which is selected if an architecture provides an
> implementation of of_node_to_nid(). If a platform does not use it's own,
> or the generic OF_NUMA, then always use the inline fallback in of.h so
> we don't need to futz around with exports.

I'd rather have a specific kconfig symbol for the 'generic'
implementation, especially given that it doesn't appear to be all
that generic.


Re: [PATCH 2/3] mm: replace __HAVE_ARCH_PTE_SPECIAL

2018-04-09 Thread David Rientjes
On Mon, 9 Apr 2018, Christoph Hellwig wrote:

> > -#ifdef __HAVE_ARCH_PTE_SPECIAL
> > +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> >  # define HAVE_PTE_SPECIAL 1
> >  #else
> >  # define HAVE_PTE_SPECIAL 0
> 
> I'd say kill this odd indirection and just use the
> CONFIG_ARCH_HAS_PTE_SPECIAL symbol directly.
> 
> 

Agree, and I think it would be easier to audit/review if patches 1 and 3 
were folded together to see the relationship between the newly added 
selects and what #define's it is replacing.  Otherwise, looks good!


Re: [PATCH] drivers/of: Introduce ARCH_HAS_OWN_OF_NUMA

2018-04-09 Thread Rob Herring
On Mon, Apr 9, 2018 at 2:46 AM, Oliver O'Halloran  wrote:
> Some OF platforms (pseries and some SPARC systems) has their own
> implementations of NUMA affinity detection rather than using the generic
> OF_NUMA driver, which mainly exists for arm64. For other platforms one
> of two fallbacks provided by the base OF driver are used depending on
> CONFIG_NUMA.
>
> In the CONFIG_NUMA=n case the fallback is an inline function in of.h.
> In the =y case the fallback is a real function which is defined as a
> weak symbol so that it may be overwritten by the architecture if desired.
>
> The problem with this arrangement is that the real implementations all
> export of_node_to_nid(). Unfortunately it's not possible to export the
> fallback since it would clash with the non-weak version. As a result
> we get build failures when:
>
> a) CONFIG_NUMA=y && CONFIG_OF=y, and
> b) The platform doesn't implement of_node_to_nid(), and
> c) A module uses of_node_to_nid()
>
> Given b) will be true for most platforms this is fairly easy to hit
> and has been observed on ia64 and x86.

How specifically do we hit this? The only module I see using
of_node_to_nid in mainline is Cell EDAC driver.

> This patch remedies the problem by introducing the ARCH_HAS_OWN_OF_NUMA
> Kconfig option which is selected if an architecture provides an
> implementation of of_node_to_nid(). If a platform does not use it's own,
> or the generic OF_NUMA, then always use the inline fallback in of.h so
> we don't need to futz around with exports.

I'm more inclined to figure out how to remove the export and provide a
non DT specific function if drivers need to know this.

Rob


Re: [PATCH] drivers/of: Introduce ARCH_HAS_OWN_OF_NUMA

2018-04-09 Thread Dan Williams
On Mon, Apr 9, 2018 at 1:52 PM, Rob Herring  wrote:
> On Mon, Apr 9, 2018 at 2:46 AM, Oliver O'Halloran  wrote:
>> Some OF platforms (pseries and some SPARC systems) has their own
>> implementations of NUMA affinity detection rather than using the generic
>> OF_NUMA driver, which mainly exists for arm64. For other platforms one
>> of two fallbacks provided by the base OF driver are used depending on
>> CONFIG_NUMA.
>>
>> In the CONFIG_NUMA=n case the fallback is an inline function in of.h.
>> In the =y case the fallback is a real function which is defined as a
>> weak symbol so that it may be overwritten by the architecture if desired.
>>
>> The problem with this arrangement is that the real implementations all
>> export of_node_to_nid(). Unfortunately it's not possible to export the
>> fallback since it would clash with the non-weak version. As a result
>> we get build failures when:
>>
>> a) CONFIG_NUMA=y && CONFIG_OF=y, and
>> b) The platform doesn't implement of_node_to_nid(), and
>> c) A module uses of_node_to_nid()
>>
>> Given b) will be true for most platforms this is fairly easy to hit
>> and has been observed on ia64 and x86.
>
> How specifically do we hit this? The only module I see using
> of_node_to_nid in mainline is Cell EDAC driver.

The of_pmem driver is using it currently pending for a 4.17 pull
request. Stephen hit the compile failure in -next.


Re: [PATCH 1/2] powerpc/mm: Flush cache on memory hot(un)plug

2018-04-09 Thread Reza Arbab

On Fri, Apr 06, 2018 at 03:24:23PM +1000, Balbir Singh wrote:

This patch adds support for flushing potentially dirty
cache lines when memory is hot-plugged/hot-un-plugged.


Acked-by: Reza Arbab 

--
Reza Arbab



Re: [PATCH] ASoC: fsl_esai: Fix divisor calculation failure at lower ratio

2018-04-09 Thread Marek Vasut
On 04/09/2018 01:57 AM, Nicolin Chen wrote:
> When the desired ratio is less than 256, the savesub (tolerance)
> in the calculation would become 0. This will then fail the loop-
> search immediately without reporting any errors.
> 
> But if the ratio is smaller enough, there is no need to calculate
> the tolerance because PM divisor alone is enough to get the ratio.
> 
> So a simple fix could be just to set PM directly instead of going
> into the loop-search.
> 
> Reported-by: Marek Vasut 
> Signed-off-by: Nicolin Chen 
> Cc: Marek Vasut 

On i.MX6Q with TI PCM1808
Tested-by: Marek Vasut 

> ---
>  sound/soc/fsl/fsl_esai.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/sound/soc/fsl/fsl_esai.c b/sound/soc/fsl/fsl_esai.c
> index 40a7004..da8fd98 100644
> --- a/sound/soc/fsl/fsl_esai.c
> +++ b/sound/soc/fsl/fsl_esai.c
> @@ -144,6 +144,13 @@ static int fsl_esai_divisor_cal(struct snd_soc_dai *dai, 
> bool tx, u32 ratio,
>  
>   psr = ratio <= 256 * maxfp ? ESAI_xCCR_xPSR_BYPASS : 
> ESAI_xCCR_xPSR_DIV8;
>  
> + /* Do not loop-search if PM (1 ~ 256) alone can serve the ratio */
> + if (ratio <= 256) {
> + pm = ratio;
> + fp = 1;
> + goto out;
> + }
> +
>   /* Set the max fluctuation -- 0.1% of the max devisor */
>   savesub = (psr ? 1 : 8)  * 256 * maxfp / 1000;
>  
> 


-- 
Best regards,
Marek Vasut


Re: [PATCH] drivers/of: Introduce ARCH_HAS_OWN_OF_NUMA

2018-04-09 Thread Rob Herring
On Mon, Apr 9, 2018 at 4:05 PM, Dan Williams  wrote:
> On Mon, Apr 9, 2018 at 1:52 PM, Rob Herring  wrote:
>> On Mon, Apr 9, 2018 at 2:46 AM, Oliver O'Halloran  wrote:
>>> Some OF platforms (pseries and some SPARC systems) has their own
>>> implementations of NUMA affinity detection rather than using the generic
>>> OF_NUMA driver, which mainly exists for arm64. For other platforms one
>>> of two fallbacks provided by the base OF driver are used depending on
>>> CONFIG_NUMA.
>>>
>>> In the CONFIG_NUMA=n case the fallback is an inline function in of.h.
>>> In the =y case the fallback is a real function which is defined as a
>>> weak symbol so that it may be overwritten by the architecture if desired.
>>>
>>> The problem with this arrangement is that the real implementations all
>>> export of_node_to_nid(). Unfortunately it's not possible to export the
>>> fallback since it would clash with the non-weak version. As a result
>>> we get build failures when:
>>>
>>> a) CONFIG_NUMA=y && CONFIG_OF=y, and
>>> b) The platform doesn't implement of_node_to_nid(), and
>>> c) A module uses of_node_to_nid()
>>>
>>> Given b) will be true for most platforms this is fairly easy to hit
>>> and has been observed on ia64 and x86.
>>
>> How specifically do we hit this? The only module I see using
>> of_node_to_nid in mainline is Cell EDAC driver.
>
> The of_pmem driver is using it currently pending for a 4.17 pull
> request. Stephen hit the compile failure in -next.

You mean the stuff reviewed last week in the middle of the merge
window? Sounds like 4.18 material to me.

Rob


[PATCH] powerpc/modules: Fix crashes by adding CONFIG_RELOCATABLE to vermagic

2018-04-09 Thread Michael Ellerman
If you build the kernel with CONFIG_RELOCATABLE=n, then install the
modules, rebuild the kernel with CONFIG_RELOCATABLE=y and leave the
old modules installed, we crash something like:

  Unable to handle kernel paging request for data at address 0xd00018d66cef
  Faulting instruction address: 0xc21ddd08
  Oops: Kernel access of bad area, sig: 11 [#1]
  Modules linked in: x_tables autofs4
  CPU: 2 PID: 1 Comm: systemd Not tainted 4.16.0-rc6-gcc_ubuntu_le-g99fec39 #1
  ...
  NIP check_version.isra.22+0x118/0x170
  Call Trace:
__ksymtab_xt_unregister_table+0x58/0xfcb8 [x_tables] 
(unreliable)
resolve_symbol+0xb4/0x150
load_module+0x10e8/0x29a0
SyS_finit_module+0x110/0x140
system_call+0x58/0x6c

This happens because since commit 71810db27c1c ("modversions: treat
symbol CRCs as 32 bit quantities"), a relocatable kernel encodes and
handles symbol CRCs differently from a non-relocatable kernel.

Although it's possible we could try and detect this situation and
handle it, it's much more robust to simply make the state of
CONFIG_RELOCATABLE part of the module vermagic.

Fixes: 71810db27c1c ("modversions: treat symbol CRCs as 32 bit quantities")
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/module.h | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/module.h 
b/arch/powerpc/include/asm/module.h
index 7e28442827f1..4f6573934792 100644
--- a/arch/powerpc/include/asm/module.h
+++ b/arch/powerpc/include/asm/module.h
@@ -15,9 +15,19 @@
 
 
 #ifdef CC_USING_MPROFILE_KERNEL
-#define MODULE_ARCH_VERMAGIC   "mprofile-kernel"
+#define MODULE_ARCH_VERMAGIC_FTRACE"mprofile-kernel "
+#else
+#define MODULE_ARCH_VERMAGIC_FTRACE""
 #endif
 
+#ifdef CONFIG_RELOCATABLE
+#define MODULE_ARCH_VERMAGIC_RELOCATABLE   "relocatable "
+#else
+#define MODULE_ARCH_VERMAGIC_RELOCATABLE   ""
+#endif
+
+#define MODULE_ARCH_VERMAGIC MODULE_ARCH_VERMAGIC_FTRACE 
MODULE_ARCH_VERMAGIC_RELOCATABLE
+
 #ifndef __powerpc64__
 /*
  * Thanks to Paul M for explaining this.
-- 
2.14.1



Occasionally losing the tick_sched_timer

2018-04-09 Thread Nicholas Piggin
We are seeing rare hard lockup watchdog timeouts, a CPU seems to have no
more timers scheduled, despite hard and soft lockup watchdogs should have
their heart beat timers and probably many others.

The reproducer we have is running a KVM workload. The lockup is in the
host kernel, quite rare but we may be able to slowly test things.

I have a sysrq+q snippet. CPU3 is the stuck one, you can see its tick has
stopped for a long time and no hrtimer active. Included CPU4 for what the
other CPUs look like.

Thomas do you have any ideas on what we might look for, or if we can add
some BUG_ON()s to catch this at its source?

- CPU3 is sitting in its cpuidle loop (polling idle with all other idle
  states disabled).

- `taskset -c 3 ls` basically revived the CPU and got timers running again.

- May not be a new bug because we have just in the past few releases
  enabled the hard lockup detector by default

- KVM is being used. This switches various registers like timebase and
  decrementer. Possibly that's involved, but we can't say the bug does
  not happen without KVM.

cpu: 3
 clock 0:
  .base:   df30f5ab
  .index:  0
  .resolution: 1 nsecs
  .get_time:   
ktime_get

  .offset: 0 nsecs
active timers:
 clock 1:
  .base:   520cc304
  .index:  1
  .resolution: 1 nsecs
  .get_time:   
ktime_get_real

  .offset: 1523263049759155857 nsecs
active timers:
 clock 2:
  .base:   706e6277
  .index:  2
  .resolution: 1 nsecs
  .get_time:   
ktime_get_boottime

  .offset: 0 nsecs
active timers:
 clock 3:
  .base:   e2ae2811
  .index:  3
  .resolution: 1 nsecs
  .get_time:   
ktime_get_clocktai

  .offset: 1523263049759155857 nsecs
active timers:
 clock 4:
  .base:   c93e2f8e
  .index:  4
  .resolution: 1 nsecs
  .get_time:   
ktime_get

  .offset: 0 nsecs
active timers:
 clock 5:
  .base:   7b726c6a
  .index:  5
  .resolution: 1 nsecs
  .get_time:   
ktime_get_real

  .offset: 1523263049759155857 nsecs
active timers:
 clock 6:
  .base:   f17c2d4f
  .index:  6
  .resolution: 1 nsecs
  .get_time:   
ktime_get_boottime

  .offset: 0 nsecs
active timers:
 clock 7:
  .base:   6c57ef89
  .index:  7
  .resolution: 1 nsecs
  .get_time:   
ktime_get_clocktai

  .offset: 1523263049759155857 nsecs
active timers:
  .expires_next   : 9223372036854775807 nsecs
  .hres_active: 1
  .nr_events  : 1446533
  .nr_retries : 1434
  .nr_hangs   : 0
  .max_hang_time  : 0
  .nohz_mode  : 2
  .last_tick  : 1776312000 nsecs
  .tick_stopped   : 1
  .idle_jiffies   : 4296713609
  .idle_calls : 2573133
  .idle_sleeps: 1957794
  .idle_entrytime : 1776312625 nsecs
  .idle_waketime  : 59550238131639 nsecs
  .idle_exittime  : 17763110009176 nsecs
  .idle_sleeptime : 17504617295679 nsecs
  .iowait_sleeptime: 719978688 nsecs
  .last_jiffies   : 4296713608
  .next_timer : 1776313000
  .idle_expires   : 1776313000 nsecs
jiffies: 4300892324

cpu: 4
 clock 0:
  .base:   07d8226b
  .index:  0
  .resolution: 1 nsecs
  .get_time:   
ktime_get

  .offset: 0 nsecs
active timers:
 #0: 

, 
tick_sched_timer
, S:01

 # expires at 5955295000-5955295000 nsecs [in 2685654802 to 2685654802 
nsecs]
 #1: 
<9b4a3b88>
, 
hrtimer_wakeup
, S:01

 # expires at 59602585423025-59602642458243 nsecs [in 52321077827 to 
52378113045 nsecs]
 clock 1:
  .base:   d2ae50c4
  .index:  1
  .resolution: 1 nsecs
  .get_time:   
ktime_get_real

  .offset: 1523263049759155857 nsecs
active timers:
 clock 2:
  .base:   1a80e123
  .index:  2
  .resolution: 1 nsecs
  .get_time:   
ktime_get_boottime

  .offset: 0 nsecs
active timers:
 clock 3:
  .base:   5c97ab69
  .index:  3
  .resolution: 1 nsecs
  .get_time:   
ktime_get_clocktai

  .offset: 1523263049759155857 nsecs
active timers:
 clock 4:
  .base:   75ac8f03
  .index:  4
  .resolution: 1 nsecs
  .get_time:   
ktime_get

  .offset: 0 nsecs
active timers:
 clock 5:
  .base:   db06f6ce
  .index:  5
  .resolution: 1 nsecs
  .get_time:   
ktime_get_real

  .offset: 1523263049759155857 nsecs
active timers:
 clock 6:
  .base:   fa63fbce
  .index:  6
  .resolution: 1 nsecs
  .get_time:   
ktime_get_boottime

  .offset: 0 nsecs
active timers:
 clock 7:
  .base:   41de439c
  .index:  7
  .resolution: 1 nsecs
  .get_time:   
ktime_get_clocktai

  .offset: 1523263049759155857 nsecs
active timers:
  .expires_next   : 5955295000 nsecs
  .hres_active: 1
  .nr_events  : 294282
  .nr_retries : 16138
  .nr_hangs   : 0
  .max_hang_time  : 0
  .nohz_mode  : 2
  .last_tick  : 5954562000 nsecs
  .tick_stopped   : 1
  .idle_jiffies   : 4300891859
  .idle_calls : 553259
  .idle_sleeps: 536396
  .idle_entrytime : 59547990019145 nsecs

Re: [PATCH] drivers/of: Introduce ARCH_HAS_OWN_OF_NUMA

2018-04-09 Thread Dan Williams
On Mon, Apr 9, 2018 at 6:02 PM, Rob Herring  wrote:
> On Mon, Apr 9, 2018 at 4:05 PM, Dan Williams  wrote:
>> On Mon, Apr 9, 2018 at 1:52 PM, Rob Herring  wrote:
>>> On Mon, Apr 9, 2018 at 2:46 AM, Oliver O'Halloran  wrote:
 Some OF platforms (pseries and some SPARC systems) has their own
 implementations of NUMA affinity detection rather than using the generic
 OF_NUMA driver, which mainly exists for arm64. For other platforms one
 of two fallbacks provided by the base OF driver are used depending on
 CONFIG_NUMA.

 In the CONFIG_NUMA=n case the fallback is an inline function in of.h.
 In the =y case the fallback is a real function which is defined as a
 weak symbol so that it may be overwritten by the architecture if desired.

 The problem with this arrangement is that the real implementations all
 export of_node_to_nid(). Unfortunately it's not possible to export the
 fallback since it would clash with the non-weak version. As a result
 we get build failures when:

 a) CONFIG_NUMA=y && CONFIG_OF=y, and
 b) The platform doesn't implement of_node_to_nid(), and
 c) A module uses of_node_to_nid()

 Given b) will be true for most platforms this is fairly easy to hit
 and has been observed on ia64 and x86.
>>>
>>> How specifically do we hit this? The only module I see using
>>> of_node_to_nid in mainline is Cell EDAC driver.
>>
>> The of_pmem driver is using it currently pending for a 4.17 pull
>> request. Stephen hit the compile failure in -next.
>
> You mean the stuff reviewed last week in the middle of the merge
> window? Sounds like 4.18 material to me.

It was originally posted for 4.16. The reposting and review came in
late this cycle, but outside of a critical issue I'd rather not delay
it again. The build error issue is resolved by not allowing modular
builds of this driver for now.

>
> Rob


Re: [PATCH v3] powerpc/64: Fix section mismatch warnings for early boot symbols

2018-04-09 Thread Michael Ellerman
Mauricio Faria de Oliveira  writes:
> Some of the boot code located at the start of kernel text is "init"
> class, in that it only runs at boot time, however marking it as normal
> init code is problematic because that puts it into a different section
> located at the very end of kernel text.
>
> e.g., in case the TOC is not set up, we may not be able to tolerate a
> branch trampoline to reach the init function.
>
> Credits: code and message are based on 2016 patch by Nicholas Piggin,
> and slightly modified so not to rename the powerpc code/symbol names.
>
> Subject: [PATCH] powerpc/64: quieten section mismatch warnings
> From: Nicholas Piggin 
> Date: Fri Dec 23 00:14:19 AEDT 2016
>
> This resolves the following section mismatch warnings:
>
> WARNING: vmlinux.o(.text+0x2fa8): Section mismatch in reference from the 
> variable __boot_from_prom to the function .init.text:prom_init()
> The function __boot_from_prom() references
> the function __init prom_init().
> This is often because __boot_from_prom lacks a __init
> annotation or the annotation of prom_init is wrong.
>
> WARNING: vmlinux.o(.text+0x3238): Section mismatch in reference from the 
> variable start_here_multiplatform to the function .init.text:early_setup()
> The function start_here_multiplatform() references
> the function __init early_setup().
> This is often because start_here_multiplatform lacks a __init
> annotation or the annotation of early_setup is wrong.
>
> WARNING: vmlinux.o(.text+0x326c): Section mismatch in reference from the 
> variable start_here_common to the function .init.text:start_kernel()
> The function start_here_common() references
> the function __init start_kernel().
> This is often because start_here_common lacks a __init
> annotation or the annotation of start_kernel is wrong.

Thanks for picking this one up.

I hate to be a pain ... but before we merge this and proliferate these
names, I'd like to change the names of some of these early asm
functions. They're terribly named due to historical reasons.

I haven't actually thought of good names yet though :)

I'll try and come up with some and post a patch doing the renames.

cheers


Re: [PATCH 1/2] KVM: PPC: Book3S HV: trace_tlbie must not be called in realmode

2018-04-09 Thread Michael Ellerman
Nicholas Piggin  writes:

> On Sun, 8 Apr 2018 20:17:47 +1000
> Balbir Singh  wrote:
>
>> On Fri, Apr 6, 2018 at 3:56 AM, Nicholas Piggin  wrote:
>> > This crashes with a "Bad real address for load" attempting to load
>> > from the vmalloc region in realmode (faulting address is in DAR).
>> >
>> >   Oops: Bad interrupt in KVM entry/exit code, sig: 6 [#1]
>> >   LE SMP NR_CPUS=2048 NUMA PowerNV
>> >   CPU: 53 PID: 6582 Comm: qemu-system-ppc Not tainted 
>> > 4.16.0-01530-g43d1859f0994
>> >   NIP:  c00155ac LR: c00c2430 CTR: c0015580
>> >   REGS: c00fff76dd80 TRAP: 0200   Not tainted  
>> > (4.16.0-01530-g43d1859f0994)
>> >   MSR:  90201003   CR: 4808  XER: 
>> >   CFAR: 000102900ef0 DAR: d00017fffd941a28 DSISR: 0040 SOFTE: 3
>> >   NIP [c00155ac] perf_trace_tlbie+0x2c/0x1a0
>> >   LR [c00c2430] do_tlbies+0x230/0x2f0
>> >
>> > I suspect the reason is the per-cpu data is not in the linear chunk.
>> > This could be restored if that was able to be fixed, but for now,
>> > just remove the tracepoints.  
>> 
>> Could you share the stack trace as well? I've not observed this in my 
>> testing.
>
> I can't seem to find it, I can try reproduce tomorrow. It was coming
> from h_remove hcall from the guest. It's 176 logical CPUs.
>
>> May be I don't have as many cpus. I presume your talking about the per cpu
>> data offsets for per cpu trace data?
>
> It looked like it was dereferencing virtually mapped per-cpu data, yes.
> Probably the perf_events deref.

Naveen has posted a series to (hopefully) fix this, which just missed
the merge window:

  https://patchwork.ozlabs.org/patch/894757/


cheers


Re: [PATCH v2 9/9] powerpc/powernv: opal-kmsg standardise OPAL_BUSY handling

2018-04-09 Thread Russell Currey
On Mon, 2018-04-09 at 15:24 +1000, Nicholas Piggin wrote:
> OPAL_CONSOLE_FLUSH is documented as being able to return OPAL_BUSY,
> so implement the standard OPAL_BUSY handling for it.
> 
> Cc: Russell Currey 
> Signed-off-by: Nicholas Piggin 

Reviewed-by: Russell Currey 


Re: [PATCH 1/6] powerpc/powernv: opal-kmsg use flush fallback from console code

2018-04-09 Thread Russell Currey
On Mon, 2018-04-09 at 15:40 +1000, Nicholas Piggin wrote:
> Use the more refined and tested event polling loop from
> opal_put_chars
> as the fallback console flush in the opal-kmsg path. This loop is
> used
> by the console driver today, whereas the opal-kmsg fallback is not
> likely to have been used for years.
> 
> Use WARN_ONCE rather than a printk when the fallback is invoked to
> prepare for moving the console flush into a common function.
> 
> Cc: Russell Currey 
> Signed-off-by: Nicholas Piggin 

Reviewed-by: Russell Currey 


Re: [PATCH 2/6] powerpc/powernv: Implement and use opal_flush_console

2018-04-09 Thread Russell Currey
On Mon, 2018-04-09 at 15:40 +1000, Nicholas Piggin wrote:
> A new console flushing firmware API was introduced to replace event
> polling loops, and implemented in opal-kmsg with affddff69c55e
> ("powerpc/powernv: Add a kmsg_dumper that flushes console output on
> panic"), to flush the console in the panic path.
> 
> The OPAL console driver has other situations where interrupts are off
> and it needs to flush the console synchronously. These still use a
> polling loop.
> 
> So move the opal-kmsg flush code to opal_flush_console, and use the
> new function in opal-kmsg and opal_put_chars.
> 
> Cc: Benjamin Herrenschmidt 
> Cc: Russell Currey 
> Signed-off-by: Nicholas Piggin 

Reviewed-by: Russell Currey 


Re: [PATCH 1/2] KVM: PPC: Book3S HV: trace_tlbie must not be called in realmode

2018-04-09 Thread Naveen N. Rao

Michael Ellerman wrote:

Nicholas Piggin  writes:


On Sun, 8 Apr 2018 20:17:47 +1000
Balbir Singh  wrote:


On Fri, Apr 6, 2018 at 3:56 AM, Nicholas Piggin  wrote:
> This crashes with a "Bad real address for load" attempting to load
> from the vmalloc region in realmode (faulting address is in DAR).
>
>   Oops: Bad interrupt in KVM entry/exit code, sig: 6 [#1]
>   LE SMP NR_CPUS=2048 NUMA PowerNV
>   CPU: 53 PID: 6582 Comm: qemu-system-ppc Not tainted 
4.16.0-01530-g43d1859f0994
>   NIP:  c00155ac LR: c00c2430 CTR: c0015580
>   REGS: c00fff76dd80 TRAP: 0200   Not tainted  
(4.16.0-01530-g43d1859f0994)
>   MSR:  90201003   CR: 4808  XER: 
>   CFAR: 000102900ef0 DAR: d00017fffd941a28 DSISR: 0040 SOFTE: 3
>   NIP [c00155ac] perf_trace_tlbie+0x2c/0x1a0
>   LR [c00c2430] do_tlbies+0x230/0x2f0
>
> I suspect the reason is the per-cpu data is not in the linear chunk.
> This could be restored if that was able to be fixed, but for now,
> just remove the tracepoints.  


Could you share the stack trace as well? I've not observed this in my testing.


I can't seem to find it, I can try reproduce tomorrow. It was coming
from h_remove hcall from the guest. It's 176 logical CPUs.


May be I don't have as many cpus. I presume your talking about the per cpu
data offsets for per cpu trace data?


It looked like it was dereferencing virtually mapped per-cpu data, yes.
Probably the perf_events deref.


Naveen has posted a series to (hopefully) fix this, which just missed
the merge window:

  https://patchwork.ozlabs.org/patch/894757/


I'm afraid that won't actually help here :(
That series is specific to the function tracer, while this is using 
static tracepoints.


We could convert trace_tlbie() to a TRACE_EVENT_CONDITION() and guard it 
within a check for paca->ftrace_enabled, but that would only be useful 
if the below callsites can ever be hit outside of KVM guest mode.


- Naveen




[PATCH 1/4] powerpc/perf: Rearrange memory freeing in imc init

2018-04-09 Thread Anju T Sudhakar
When any of the IMC (In-Memory Collection counter) devices fail
to initialize, imc_common_mem_free() frees set of memory. In doing so,
pmu_ptr pointer is also freed. But pmu_ptr pointer is used in subsequent
function (imc_common_cpuhp_mem_free()) which is wrong. Patch here reorders
the code to avoid such access.

Also free the memory which is dynamically allocated during imc initialization,
wherever required.

Signed-off-by: Anju T Sudhakar 
---
test matrix and static checker run details are updated in the cover letter
patch is based on 
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git (branch: 
merge)

 arch/powerpc/perf/imc-pmu.c   | 32 ---
 arch/powerpc/platforms/powernv/opal-imc.c | 13 ++---
 2 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index d7532e7..258b0f4 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -1153,7 +1153,7 @@ static void cleanup_all_core_imc_memory(void)
/* mem_info will never be NULL */
for (i = 0; i < nr_cores; i++) {
if (ptr[i].vbase)
-   free_pages((u64)ptr->vbase, get_order(size));
+   free_pages((u64)ptr[i].vbase, get_order(size));
}
 
kfree(ptr);
@@ -1191,7 +1191,6 @@ static void imc_common_mem_free(struct imc_pmu *pmu_ptr)
if (pmu_ptr->attr_groups[IMC_EVENT_ATTR])
kfree(pmu_ptr->attr_groups[IMC_EVENT_ATTR]->attrs);
kfree(pmu_ptr->attr_groups[IMC_EVENT_ATTR]);
-   kfree(pmu_ptr);
 }
 
 /*
@@ -1208,6 +1207,7 @@ static void imc_common_cpuhp_mem_free(struct imc_pmu 
*pmu_ptr)

cpuhp_remove_state(CPUHP_AP_PERF_POWERPC_NEST_IMC_ONLINE);
kfree(nest_imc_refc);
kfree(per_nest_pmu_arr);
+   per_nest_pmu_arr = NULL;
}
 
if (nest_pmus > 0)
@@ -1319,10 +1319,8 @@ int init_imc_pmu(struct device_node *parent, struct 
imc_pmu *pmu_ptr, int pmu_id
int ret;
 
ret = imc_mem_init(pmu_ptr, parent, pmu_idx);
-   if (ret) {
-   imc_common_mem_free(pmu_ptr);
-   return ret;
-   }
+   if (ret)
+   goto err_free_mem;
 
switch (pmu_ptr->domain) {
case IMC_DOMAIN_NEST:
@@ -1337,7 +1335,9 @@ int init_imc_pmu(struct device_node *parent, struct 
imc_pmu *pmu_ptr, int pmu_id
ret = init_nest_pmu_ref();
if (ret) {
mutex_unlock(_init_lock);
-   goto err_free;
+   kfree(per_nest_pmu_arr);
+   per_nest_pmu_arr = NULL;
+   goto err_free_mem;
}
/* Register for cpu hotplug notification. */
ret = nest_pmu_cpumask_init();
@@ -1345,7 +1345,8 @@ int init_imc_pmu(struct device_node *parent, struct 
imc_pmu *pmu_ptr, int pmu_id
mutex_unlock(_init_lock);
kfree(nest_imc_refc);
kfree(per_nest_pmu_arr);
-   goto err_free;
+   per_nest_pmu_arr = NULL;
+   goto err_free_mem;
}
}
nest_pmus++;
@@ -1355,7 +1356,7 @@ int init_imc_pmu(struct device_node *parent, struct 
imc_pmu *pmu_ptr, int pmu_id
ret = core_imc_pmu_cpumask_init();
if (ret) {
cleanup_all_core_imc_memory();
-   return ret;
+   goto err_free_mem;
}
 
break;
@@ -1363,7 +1364,7 @@ int init_imc_pmu(struct device_node *parent, struct 
imc_pmu *pmu_ptr, int pmu_id
ret = thread_imc_cpu_init();
if (ret) {
cleanup_all_thread_imc_memory();
-   return ret;
+   goto err_free_mem;
}
 
break;
@@ -1373,23 +1374,24 @@ int init_imc_pmu(struct device_node *parent, struct 
imc_pmu *pmu_ptr, int pmu_id
 
ret = update_events_in_group(parent, pmu_ptr);
if (ret)
-   goto err_free;
+   goto err_free_cpuhp_mem;
 
ret = update_pmu_ops(pmu_ptr);
if (ret)
-   goto err_free;
+   goto err_free_cpuhp_mem;
 
ret = perf_pmu_register(_ptr->pmu, pmu_ptr->pmu.name, -1);
if (ret)
-   goto err_free;
+   goto err_free_cpuhp_mem;
 
pr_info("%s performance monitor hardware support registered\n",
pmu_ptr->pmu.name);
 
return 0;
 
-err_free:
-   

[PATCH 4/4] powerpc/perf: Unregister thread-imc if core-imc not supported

2018-04-09 Thread Anju T Sudhakar
Enable thread-imc in the kernel, only if core-imc is registered.

Signed-off-by: Anju T Sudhakar 
---
 arch/powerpc/include/asm/imc-pmu.h|  1 +
 arch/powerpc/perf/imc-pmu.c   | 12 
 arch/powerpc/platforms/powernv/opal-imc.c |  9 +
 3 files changed, 22 insertions(+)

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
index d76cb11..69f516e 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -128,4 +128,5 @@ extern int init_imc_pmu(struct device_node *parent,
struct imc_pmu *pmu_ptr, int pmu_id);
 extern void thread_imc_disable(void);
 extern int get_max_nest_dev(void);
+extern void unregister_thread_imc(void);
 #endif /* __ASM_POWERPC_IMC_PMU_H */
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 4b4ca83..fa88785 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -40,6 +40,7 @@ static struct imc_pmu *core_imc_pmu;
 /* Thread IMC data structures and variables */
 
 static DEFINE_PER_CPU(u64 *, thread_imc_mem);
+static struct imc_pmu *thread_imc_pmu;
 static int thread_imc_mem_size;
 
 struct imc_pmu *imc_event_to_pmu(struct perf_event *event)
@@ -1228,6 +1229,16 @@ static void imc_common_cpuhp_mem_free(struct imc_pmu 
*pmu_ptr)
}
 }
 
+/*
+ * Function to unregister thread-imc if core-imc
+ * is not registered.
+ */
+void unregister_thread_imc(void)
+{
+   imc_common_cpuhp_mem_free(thread_imc_pmu);
+   imc_common_mem_free(thread_imc_pmu);
+   perf_pmu_unregister(_imc_pmu->pmu);
+}
 
 /*
  * imc_mem_init : Function to support memory allocation for core imc.
@@ -1296,6 +1307,7 @@ static int imc_mem_init(struct imc_pmu *pmu_ptr, struct 
device_node *parent,
}
}
 
+   thread_imc_pmu = pmu_ptr;
break;
default:
return -EINVAL;
diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
b/arch/powerpc/platforms/powernv/opal-imc.c
index 490bb72..58a0794 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -255,6 +255,7 @@ static int opal_imc_counters_probe(struct platform_device 
*pdev)
 {
struct device_node *imc_dev = pdev->dev.of_node;
int pmu_count = 0, domain;
+   bool core_imc_reg = false, thread_imc_reg = false;
u32 type;
 
/*
@@ -292,6 +293,10 @@ static int opal_imc_counters_probe(struct platform_device 
*pdev)
if (!imc_pmu_create(imc_dev, pmu_count, domain)) {
if (domain == IMC_DOMAIN_NEST)
pmu_count++;
+   if (domain == IMC_DOMAIN_CORE)
+   core_imc_reg = true;
+   if (domain == IMC_DOMAIN_THREAD)
+   thread_imc_reg = true;
}
}
 
@@ -299,6 +304,10 @@ static int opal_imc_counters_probe(struct platform_device 
*pdev)
if (pmu_count == 0)
debugfs_remove_recursive(imc_debugfs_parent);
 
+   /* If core imc is not registered, unregister thread-imc */
+   if (!core_imc_reg && thread_imc_reg)
+   unregister_thread_imc();
+
return 0;
 }
 
-- 
2.7.4



[PATCH 3/4] powerpc/perf: Return appropriate value for unknown domain

2018-04-09 Thread Anju T Sudhakar
Return proper error code for unknown domain during IMC initialization.

Signed-off-by: Anju T Sudhakar 
---
 arch/powerpc/perf/imc-pmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 1b285cd..4b4ca83 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -1371,7 +1371,7 @@ int init_imc_pmu(struct device_node *parent, struct 
imc_pmu *pmu_ptr, int pmu_id
 
break;
default:
-   return  -1; /* Unknown domain */
+   return  -EINVAL;/* Unknown domain */
}
 
ret = update_events_in_group(parent, pmu_ptr);
-- 
2.7.4



Re: [PATCH] ASoC: fsl_ssi: Fix mode setting when changing channel number

2018-04-09 Thread Mika Penttilä
On 04/08/2018 07:40 AM, Nicolin Chen wrote:
> This is a partial revert (in a cleaner way) of commit ebf08ae3bc90
> ("ASoC: fsl_ssi: Keep ssi->i2s_net updated") to fix a regression
> at test cases when switching between mono and stereo audio.
> 
> The problem is that ssi->i2s_net is initialized in set_dai_fmt()
> only, while this set_dai_fmt() is only called during the dai-link
> probe(). The original patch assumed set_dai_fmt() would be called
> during every playback instance, so it failed at the overriding use
> cases.
> 
> This patch adds the local variable i2s_net back to let regular use
> cases still follow the mode settings from the set_dai_fmt().
> 
> Meanwhile, the original commit of keeping ssi->i2s_net updated was
> to make set_tdm_slot() clean by checking the ssi->i2s_net directly
> instead of reading SCR register. However, the change itself is not
> necessary (or even harmful) because the set_tdm_slot() might fail
> to check the slot number for Normal-Mode-None-Net settings while
> mono audio cases still need 2 slots. So this patch can also fix it.
> And it adds an extra line of comments to declare ssi->i2s_net does
> not reflect the register value but merely the initial setting from
> the set_dai_fmt().
> 
> Reported-by: Mika Penttilä 
> Signed-off-by: Nicolin Chen 
> Cc: Mika Penttilä 
> ---
>  sound/soc/fsl/fsl_ssi.c | 14 +++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/sound/soc/fsl/fsl_ssi.c b/sound/soc/fsl/fsl_ssi.c
> index 0823b08..89df2d9 100644
> --- a/sound/soc/fsl/fsl_ssi.c
> +++ b/sound/soc/fsl/fsl_ssi.c
> @@ -217,6 +217,7 @@ struct fsl_ssi_soc_data {
>   * @dai_fmt: DAI configuration this device is currently used with
>   * @streams: Mask of current active streams: BIT(TX) and BIT(RX)
>   * @i2s_net: I2S and Network mode configurations of SCR register
> + *   (this is the initial settings based on the DAI format)
>   * @synchronous: Use synchronous mode - both of TX and RX use STCK and SFCK
>   * @use_dma: DMA is used or FIQ with stream filter
>   * @use_dual_fifo: DMA with support for dual FIFO mode
> @@ -829,16 +830,23 @@ static int fsl_ssi_hw_params(struct snd_pcm_substream 
> *substream,
>   }
>  
>   if (!fsl_ssi_is_ac97(ssi)) {
> + /*
> +  * Keep the ssi->i2s_net intact while having a local variable
> +  * to override settings for special use cases. Otherwise, the
> +  * ssi->i2s_net will lose the settings for regular use cases.
> +  */
> + u8 i2s_net = ssi->i2s_net;
> +
>   /* Normal + Network mode to send 16-bit data in 32-bit frames */
>   if (fsl_ssi_is_i2s_cbm_cfs(ssi) && sample_size == 16)
> - ssi->i2s_net = SSI_SCR_I2S_MODE_NORMAL | SSI_SCR_NET;
> + i2s_net = SSI_SCR_I2S_MODE_NORMAL | SSI_SCR_NET;
>  
>   /* Use Normal mode to send mono data at 1st slot of 2 slots */
>   if (channels == 1)
> - ssi->i2s_net = SSI_SCR_I2S_MODE_NORMAL;
> + i2s_net = SSI_SCR_I2S_MODE_NORMAL;
>  
>   regmap_update_bits(regs, REG_SSI_SCR,
> -SSI_SCR_I2S_NET_MASK, ssi->i2s_net);
> +SSI_SCR_I2S_NET_MASK, i2s_net);
>   }
>  
>   /* In synchronous mode, the SSI uses STCCR for capture */
> 

This patch fixes my problems, so: 

Tested-by: Mika Penttilä 


--Mika


[PATCH 0/4] powerpc/perf: IMC Cleanups

2018-04-09 Thread Anju T Sudhakar
This patch series includes some cleanups and Unregistration of  
thread-imc pmu, if the kernel does not have core-imc registered.

The entire patch set has been verified using the static checker smatch. 
Command used:   
$ make ARCH=powerpc CHECK="/smatch -p=kernel"  C=1 vmlinux | tee warns.txt

Tests Done: 

* Fail core-imc at init:
nest-imc - working  
cpuhotplug - works as expected  
thread-imc - not registered 

* Fail thread-imc at init:  
nest-imc - works
core-imc - works
cpuhotplug - works  

* Fail nest-imc at init 

core-imc - works
thread-imc -works   
cpuhotplug - works  

* Fail only one nest unit (say for mcs23)   

Other nest-units - works
core-imc - works
thread-imc - works  
cpuhotplug - works. 


* Kexec works   

The first three patches in this series addresses the comments by Dan Carpenter. 


Anju T Sudhakar (4):
  powerpc/perf: Rearrange memory freeing in imc init
  powerpc/perf: Replace the direct return with goto statement
  powerpc/perf: Return appropriate value for unknown domain
  powerpc/perf: Unregister thread-imc if core-imc not supported

 arch/powerpc/include/asm/imc-pmu.h|  1 +
 arch/powerpc/perf/imc-pmu.c   | 64 +++
 arch/powerpc/platforms/powernv/opal-imc.c | 22 +--
 3 files changed, 60 insertions(+), 27 deletions(-)

-- 
2.7.4



[PATCH 2/4] powerpc/perf: Replace the direct return with goto statement

2018-04-09 Thread Anju T Sudhakar
Replace the direct return statement in imc_mem_init() with goto,
to adhere to the kernel coding style.

Signed-off-by: Anju T Sudhakar 
---
 arch/powerpc/perf/imc-pmu.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 258b0f4..1b285cd 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -1236,7 +1236,7 @@ static int imc_mem_init(struct imc_pmu *pmu_ptr, struct 
device_node *parent,
int pmu_index)
 {
const char *s;
-   int nr_cores, cpu, res;
+   int nr_cores, cpu, res = -ENOMEM;
 
if (of_property_read_string(parent, "name", ))
return -ENODEV;
@@ -1246,7 +1246,7 @@ static int imc_mem_init(struct imc_pmu *pmu_ptr, struct 
device_node *parent,
/* Update the pmu name */
pmu_ptr->pmu.name = kasprintf(GFP_KERNEL, "%s%s_imc", "nest_", 
s);
if (!pmu_ptr->pmu.name)
-   return -ENOMEM;
+   goto err;
 
/* Needed for hotplug/migration */
if (!per_nest_pmu_arr) {
@@ -1254,7 +1254,7 @@ static int imc_mem_init(struct imc_pmu *pmu_ptr, struct 
device_node *parent,
sizeof(struct imc_pmu *),
GFP_KERNEL);
if (!per_nest_pmu_arr)
-   return -ENOMEM;
+   goto err;
}
per_nest_pmu_arr[pmu_index] = pmu_ptr;
break;
@@ -1262,21 +1262,21 @@ static int imc_mem_init(struct imc_pmu *pmu_ptr, struct 
device_node *parent,
/* Update the pmu name */
pmu_ptr->pmu.name = kasprintf(GFP_KERNEL, "%s%s", s, "_imc");
if (!pmu_ptr->pmu.name)
-   return -ENOMEM;
+   goto err;
 
nr_cores = DIV_ROUND_UP(num_present_cpus(), threads_per_core);
pmu_ptr->mem_info = kcalloc(nr_cores, sizeof(struct 
imc_mem_info),
GFP_KERNEL);
 
if (!pmu_ptr->mem_info)
-   return -ENOMEM;
+   goto err;
 
core_imc_refc = kcalloc(nr_cores, sizeof(struct imc_pmu_ref),
GFP_KERNEL);
 
if (!core_imc_refc) {
kfree(pmu_ptr->mem_info);
-   return -ENOMEM;
+   goto err;
}
 
core_imc_pmu = pmu_ptr;
@@ -1285,14 +1285,14 @@ static int imc_mem_init(struct imc_pmu *pmu_ptr, struct 
device_node *parent,
/* Update the pmu name */
pmu_ptr->pmu.name = kasprintf(GFP_KERNEL, "%s%s", s, "_imc");
if (!pmu_ptr->pmu.name)
-   return -ENOMEM;
+   goto err;
 
thread_imc_mem_size = pmu_ptr->counter_mem_size;
for_each_online_cpu(cpu) {
res = thread_imc_mem_alloc(cpu, 
pmu_ptr->counter_mem_size);
if (res) {
cleanup_all_thread_imc_memory();
-   return res;
+   goto err;
}
}
 
@@ -1302,6 +1302,8 @@ static int imc_mem_init(struct imc_pmu *pmu_ptr, struct 
device_node *parent,
}
 
return 0;
+err:
+   return res;
 }
 
 /*
-- 
2.7.4



Re: [PATCH 5/6] powerpc/powernv: implement opal_put_chars_nonatomic

2018-04-09 Thread Nicholas Piggin
On Mon, 09 Apr 2018 18:24:46 +1000
Benjamin Herrenschmidt  wrote:

> On Mon, 2018-04-09 at 16:23 +1000, Nicholas Piggin wrote:
> > On Mon, 09 Apr 2018 15:57:55 +1000
> > Benjamin Herrenschmidt  wrote:
> >   
> > > On Mon, 2018-04-09 at 15:40 +1000, Nicholas Piggin wrote:  
> > > > The RAW console does not need writes to be atomic, so implement a
> > > > _nonatomic variant which does not take a spinlock. This API is used
> > > > in xmon, so the less locking thta's used, the better chance there is
> > > > that a crash can be debugged.
> > > 
> > > I find the term "nonatomic" confusing...  
> > 
> > I guess it is to go with the "atomic" comment for the hvsi console
> > case -- all characters must get to the console together or not at
> > all.  
> 
> Yeah ok, it's just that in Linux "atomic" usually means something else
> :-) Why not just call it "unlocked" which is what it's about and
> matches existing practices thorough the kernel ?

Sure, I'll change it.
 
> > > don't we have a problem if we
> > > start hitting OPAL without a lock where we can't trust
> > > opal_console_write_buffer_space anymore ? I think we need to handle
> > > partial writes in that case. Maybe we should return how much was
> > > written and leave the caller to deal with it.  
> > 
> > Yes, the _nonatomic variant doesn't use opal_console_write_buffer_space
> > and it does handle partial writes by returning written bytes (although
> > callers generally tend to loop at the moment, we might do something
> > smarter with them later).
> >   
> > > I was hoping (but that isn't the case) that by nonatomic you actually
> > > meant calls that could be done in a non-atomic context, where we can do
> > > msleep instead of mdelay. That would be handy for the console coming
> > > from the hvc thread (the tty one).  
> > 
> > Ah right, no. However we no longer loop until everything is written, so
> > the hvc console driver (or the console layer) should be able to deal with
> > that with sleeping. I don't think we need to put it at this level of the
> > driver, but I don't know much about the console code.  
> 
> Ok, so hopefully we shouldn't be hitting the delay..

I *think* so. It may actually hit it once on the way out, but I don't
know if it's worth adding a new API to avoid it. Probably warrants
somebody to take a look and measure things though.

Actually I did just take a look. The bigger problem actually is because
we do the console flush here and even the "good" hvc path holds an irq
lock in this case:

   dmesg-8334   12d...0us : _raw_spin_lock_irqsave
   dmesg-8334   12d...0us : hvc_push <-hvc_write
   dmesg-8334   12d...1us : opal_put_chars_nonatomic <-hvc_push
   dmesg-8334   12d...1us : __opal_put_chars <-hvc_push
   dmesg-8334   12d...2us : opal_flush_console <-__opal_put_chars
   dmesg-8334   12d...4us!: udelay <-opal_flush_console
   dmesg-8334   12d...  787us : soft_nmi_interrupt <-soft_nmi_common
   dmesg-8334   12d...  787us : printk_nmi_enter <-soft_nmi_interrupt
   dmesg-8334   12d.Z.  788us : rcu_nmi_enter <-soft_nmi_interrupt
   dmesg-8334   12d.Z.  788us : rcu_nmi_exit <-soft_nmi_interrupt
   dmesg-8334   12d...  788us#: printk_nmi_exit <-soft_nmi_interrupt
   dmesg-8334   12d... 10005us*: udelay <-opal_flush_console
   dmesg-8334   12d... 20007us*: udelay <-opal_flush_console
   dmesg-8334   12d... 30020us*: udelay <-opal_flush_console
   dmesg-8334   12d... 40022us*: udelay <-opal_flush_console
   dmesg-8334   12d... 50023us*: udelay <-opal_flush_console
   dmesg-8334   12d... 60024us : opal_error_code <-opal_flush_console
   dmesg-8334   12d... 60025us : _raw_spin_unlock_irqrestore <-hvc_write
   dmesg-8334   12d... 60025us : _raw_spin_unlock_irqrestore
   dmesg-8334   12d... 60025us : trace_hardirqs_on <-_raw_spin_unlock_irqrestore
   dmesg-8334   12d... 60027us : 

60ms interrupt off latency waiting for the console to flush (just
running `dmesg` from OpenBMC console).

That requires some reworking of the hvc code, we can't fix it in
the OPAL driver alone.

Thanks,
Nick


Re: [PATCH 5/6] powerpc/powernv: implement opal_put_chars_nonatomic

2018-04-09 Thread Benjamin Herrenschmidt
On Mon, 2018-04-09 at 16:23 +1000, Nicholas Piggin wrote:
> On Mon, 09 Apr 2018 15:57:55 +1000
> Benjamin Herrenschmidt  wrote:
> 
> > On Mon, 2018-04-09 at 15:40 +1000, Nicholas Piggin wrote:
> > > The RAW console does not need writes to be atomic, so implement a
> > > _nonatomic variant which does not take a spinlock. This API is used
> > > in xmon, so the less locking thta's used, the better chance there is
> > > that a crash can be debugged.  
> > 
> > I find the term "nonatomic" confusing...
> 
> I guess it is to go with the "atomic" comment for the hvsi console
> case -- all characters must get to the console together or not at
> all.

Yeah ok, it's just that in Linux "atomic" usually means something else
:-) Why not just call it "unlocked" which is what it's about and
matches existing practices thorough the kernel ?

> > don't we have a problem if we
> > start hitting OPAL without a lock where we can't trust
> > opal_console_write_buffer_space anymore ? I think we need to handle
> > partial writes in that case. Maybe we should return how much was
> > written and leave the caller to deal with it.
> 
> Yes, the _nonatomic variant doesn't use opal_console_write_buffer_space
> and it does handle partial writes by returning written bytes (although
> callers generally tend to loop at the moment, we might do something
> smarter with them later).
> 
> > I was hoping (but that isn't the case) that by nonatomic you actually
> > meant calls that could be done in a non-atomic context, where we can do
> > msleep instead of mdelay. That would be handy for the console coming
> > from the hvc thread (the tty one).
> 
> Ah right, no. However we no longer loop until everything is written, so
> the hvc console driver (or the console layer) should be able to deal with
> that with sleeping. I don't think we need to put it at this level of the
> driver, but I don't know much about the console code.

Ok, so hopefully we shouldn't be hitting the delay..

Cheers,
Ben.



Re: [PATCH V6 2/4] powerpc/mm: Add support for handling > 512TB address in SLB miss

2018-04-09 Thread Christophe LEROY



Le 09/04/2018 à 10:33, Aneesh Kumar K.V a écrit :

On 04/09/2018 12:49 PM, Christophe LEROY wrote:



Le 26/03/2018 à 12:04, Aneesh Kumar K.V a écrit :

For addresses above 512TB we allocate additional mmu contexts. To make
it all easy, addresses above 512TB are handled with IR/DR=1 and with
stack frame setup.

The mmu_context_t is also updated to track the new extended_ids. To
support upto 4PB we need a total 8 contexts.

Signed-off-by: Aneesh Kumar K.V 
[mpe: Minor formatting tweaks and comment wording, switch BUG to WARN
   in get_ea_context().]
Signed-off-by: Michael Ellerman 


Compilation fails on mpc885_ads_defconfig + CONFIG_HUGETLBFS :

   CC  arch/powerpc/mm/slice.o
arch/powerpc/mm/slice.c: In function 'slice_get_unmapped_area':
arch/powerpc/mm/slice.c:655:2: error: implicit declaration of function 
'need_extra_context' [-Werror=implicit-function-declaration]
arch/powerpc/mm/slice.c:656:3: error: implicit declaration of function 
'alloc_extended_context' [-Werror=implicit-function-declaration]

cc1: all warnings being treated as errors
make[1]: *** [arch/powerpc/mm/slice.o] Error 1
make: *** [arch/powerpc/mm] Error 2



something like below?

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 9cd87d1..205fe55 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -35,6 +35,7 @@
  #include 
  #include 
  #include 
+#include 

  static DEFINE_SPINLOCK(slice_convert_lock);


PPC64 was including that header via include/linux/pkeys.h


Yes compilation OK now.

Christophe



-aneesh


Re: [PATCH V6 2/4] powerpc/mm: Add support for handling > 512TB address in SLB miss

2018-04-09 Thread Christophe LEROY



Le 26/03/2018 à 12:04, Aneesh Kumar K.V a écrit :

For addresses above 512TB we allocate additional mmu contexts. To make
it all easy, addresses above 512TB are handled with IR/DR=1 and with
stack frame setup.

The mmu_context_t is also updated to track the new extended_ids. To
support upto 4PB we need a total 8 contexts.

Signed-off-by: Aneesh Kumar K.V 
[mpe: Minor formatting tweaks and comment wording, switch BUG to WARN
   in get_ea_context().]
Signed-off-by: Michael Ellerman 


Compilation fails on mpc885_ads_defconfig + CONFIG_HUGETLBFS :

  CC  arch/powerpc/mm/slice.o
arch/powerpc/mm/slice.c: In function 'slice_get_unmapped_area':
arch/powerpc/mm/slice.c:655:2: error: implicit declaration of function 
'need_extra_context' [-Werror=implicit-function-declaration]
arch/powerpc/mm/slice.c:656:3: error: implicit declaration of function 
'alloc_extended_context' [-Werror=implicit-function-declaration]

cc1: all warnings being treated as errors
make[1]: *** [arch/powerpc/mm/slice.o] Error 1
make: *** [arch/powerpc/mm] Error 2

Christophe


---
  arch/powerpc/include/asm/book3s/64/hash-4k.h  |   6 ++
  arch/powerpc/include/asm/book3s/64/hash-64k.h |   6 ++
  arch/powerpc/include/asm/book3s/64/mmu.h  |  33 +++-
  arch/powerpc/include/asm/mmu_context.h|  39 ++
  arch/powerpc/include/asm/processor.h  |   6 ++
  arch/powerpc/kernel/exceptions-64s.S  |  11 ++-
  arch/powerpc/kernel/traps.c   |  12 ---
  arch/powerpc/mm/copro_fault.c |   2 +-
  arch/powerpc/mm/hash_utils_64.c   |   4 +-
  arch/powerpc/mm/mmu_context_book3s64.c|  15 +++-
  arch/powerpc/mm/pgtable-hash64.c  |   2 +-
  arch/powerpc/mm/slb.c | 108 ++
  arch/powerpc/mm/slb_low.S |  11 ++-
  arch/powerpc/mm/slice.c   |  15 +++-
  arch/powerpc/mm/tlb_hash64.c  |   2 +-
  15 files changed, 245 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index 67c5475311ee..1a35eb944481 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -11,6 +11,12 @@
  #define H_PUD_INDEX_SIZE  9
  #define H_PGD_INDEX_SIZE  9
  
+/*

+ * Each context is 512TB. But on 4k we restrict our max TASK size to 64TB
+ * Hence also limit max EA bits to 64TB.
+ */
+#define MAX_EA_BITS_PER_CONTEXT46
+
  #ifndef __ASSEMBLY__
  #define H_PTE_TABLE_SIZE  (sizeof(pte_t) << H_PTE_INDEX_SIZE)
  #define H_PMD_TABLE_SIZE  (sizeof(pmd_t) << H_PMD_INDEX_SIZE)
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 3bcf269f8f55..8d0cbbb31023 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -7,6 +7,12 @@
  #define H_PUD_INDEX_SIZE  7
  #define H_PGD_INDEX_SIZE  8
  
+/*

+ * Each context is 512TB size. SLB miss for first context/default context
+ * is handled in the hotpath.
+ */
+#define MAX_EA_BITS_PER_CONTEXT49
+
  /*
   * 64k aligned address free up few of the lower bits of RPN for us
   * We steal that here. For more deatils look at pte_pfn/pfn_pte()
diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index c8c836e8ad1b..5094696eecd6 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -91,7 +91,18 @@ struct slice_mask {
  };
  
  typedef struct {

-   mm_context_id_t id;
+   union {
+   /*
+* We use id as the PIDR content for radix. On hash we can use
+* more than one id. The extended ids are used when we start
+* having address above 512TB. We allocate one extended id
+* for each 512TB. The new id is then used with the 49 bit
+* EA to build a new VA. We always use ESID_BITS_1T_MASK bits
+* from EA and new context ids to build the new VAs.
+*/
+   mm_context_id_t id;
+   mm_context_id_t extended_id[TASK_SIZE_USER64/TASK_CONTEXT_SIZE];
+   };
u16 user_psize; /* page size index */
  
  	/* Number of bits in the mm_cpumask */

@@ -196,5 +207,25 @@ extern void radix_init_pseries(void);
  static inline void radix_init_pseries(void) { };
  #endif
  
+static inline int get_ea_context(mm_context_t *ctx, unsigned long ea)

+{
+   int index = ea >> MAX_EA_BITS_PER_CONTEXT;
+
+   if (likely(index < ARRAY_SIZE(ctx->extended_id)))
+   return ctx->extended_id[index];
+
+   /* should never happen */
+   WARN_ON(1);
+   return 0;
+}
+
+static inline unsigned long get_user_vsid(mm_context_t *ctx,
+ 

Re: [PATCH v2 4/9] powerpc/powernv: OPAL console standardise OPAL_BUSY loops

2018-04-09 Thread Benjamin Herrenschmidt
On Mon, 2018-04-09 at 16:13 +1000, Nicholas Piggin wrote:
> We can always make exceptions to the standard form, but in those
> cases I would like to document it in the OPAL API and comment for
> the Linux side.
> 
> My thinking in this case is that it reduces time in firmware and
> in particular holding console locks. Is it likely / possible that
> we don't have enough buffering or some other issue makes it worth
> retrying so quickly?

Not sure to be honest, but yeah limiting the lock contention inside
OPAL is probably not a bad idea.

Cheers,
Ben.



Re: [PATCH V6 2/4] powerpc/mm: Add support for handling > 512TB address in SLB miss

2018-04-09 Thread Aneesh Kumar K.V

On 04/09/2018 12:49 PM, Christophe LEROY wrote:



Le 26/03/2018 à 12:04, Aneesh Kumar K.V a écrit :

For addresses above 512TB we allocate additional mmu contexts. To make
it all easy, addresses above 512TB are handled with IR/DR=1 and with
stack frame setup.

The mmu_context_t is also updated to track the new extended_ids. To
support upto 4PB we need a total 8 contexts.

Signed-off-by: Aneesh Kumar K.V 
[mpe: Minor formatting tweaks and comment wording, switch BUG to WARN
   in get_ea_context().]
Signed-off-by: Michael Ellerman 


Compilation fails on mpc885_ads_defconfig + CONFIG_HUGETLBFS :

   CC  arch/powerpc/mm/slice.o
arch/powerpc/mm/slice.c: In function 'slice_get_unmapped_area':
arch/powerpc/mm/slice.c:655:2: error: implicit declaration of function 
'need_extra_context' [-Werror=implicit-function-declaration]
arch/powerpc/mm/slice.c:656:3: error: implicit declaration of function 
'alloc_extended_context' [-Werror=implicit-function-declaration]

cc1: all warnings being treated as errors
make[1]: *** [arch/powerpc/mm/slice.o] Error 1
make: *** [arch/powerpc/mm] Error 2



something like below?

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 9cd87d1..205fe55 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 

 static DEFINE_SPINLOCK(slice_convert_lock);


PPC64 was including that header via include/linux/pkeys.h

-aneesh



[PATCH] drivers/of: Introduce ARCH_HAS_OWN_OF_NUMA

2018-04-09 Thread Oliver O'Halloran
Some OF platforms (pseries and some SPARC systems) has their own
implementations of NUMA affinity detection rather than using the generic
OF_NUMA driver, which mainly exists for arm64. For other platforms one
of two fallbacks provided by the base OF driver are used depending on
CONFIG_NUMA.

In the CONFIG_NUMA=n case the fallback is an inline function in of.h.
In the =y case the fallback is a real function which is defined as a
weak symbol so that it may be overwritten by the architecture if desired.

The problem with this arrangement is that the real implementations all
export of_node_to_nid(). Unfortunately it's not possible to export the
fallback since it would clash with the non-weak version. As a result
we get build failures when:

a) CONFIG_NUMA=y && CONFIG_OF=y, and
b) The platform doesn't implement of_node_to_nid(), and
c) A module uses of_node_to_nid()

Given b) will be true for most platforms this is fairly easy to hit
and has been observed on ia64 and x86.

This patch remedies the problem by introducing the ARCH_HAS_OWN_OF_NUMA
Kconfig option which is selected if an architecture provides an
implementation of of_node_to_nid(). If a platform does not use it's own,
or the generic OF_NUMA, then always use the inline fallback in of.h so
we don't need to futz around with exports.

Cc: devicet...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Fixes: 298535c00a2c ("of, numa: Add NUMA of binding implementation.")
Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/Kconfig | 1 +
 arch/sparc/Kconfig   | 1 +
 drivers/of/Kconfig   | 3 +++
 drivers/of/base.c| 7 ---
 include/linux/of.h   | 2 +-
 5 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c32a181a7cbb..74ce5f3564ae 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -625,6 +625,7 @@ config NUMA
bool "NUMA support"
depends on PPC64
default y if SMP && PPC_PSERIES
+   select ARCH_HAS_OWN_OF_NUMA
 
 config NODES_SHIFT
int
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 8767e45f1b2b..f8071f1c3edb 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -299,6 +299,7 @@ config GENERIC_LOCKBREAK
 config NUMA
bool "NUMA support"
depends on SPARC64 && SMP
+   select ARCH_HAS_OWN_OF_NUMA
 
 config NODES_SHIFT
int "Maximum NUMA Nodes (as a power of 2)"
diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig
index ad3fcad4d75b..01c62b747b25 100644
--- a/drivers/of/Kconfig
+++ b/drivers/of/Kconfig
@@ -103,4 +103,7 @@ config OF_OVERLAY
 config OF_NUMA
bool
 
+config ARCH_HAS_OWN_OF_NUMA
+   bool
+
 endif # OF
diff --git a/drivers/of/base.c b/drivers/of/base.c
index 848f549164cd..82a9584bb0e2 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -84,13 +84,6 @@ int of_n_size_cells(struct device_node *np)
 }
 EXPORT_SYMBOL(of_n_size_cells);
 
-#ifdef CONFIG_NUMA
-int __weak of_node_to_nid(struct device_node *np)
-{
-   return NUMA_NO_NODE;
-}
-#endif
-
 static struct device_node **phandle_cache;
 static u32 phandle_cache_mask;
 
diff --git a/include/linux/of.h b/include/linux/of.h
index 4d25e4f952d9..9bb42dac5e65 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -942,7 +942,7 @@ static inline int of_cpu_node_to_id(struct device_node *np)
 #define of_node_cmp(s1, s2)strcasecmp((s1), (s2))
 #endif
 
-#if defined(CONFIG_OF) && defined(CONFIG_NUMA)
+#if defined(CONFIG_OF_NUMA) || defined(CONFIG_ARCH_HAS_OWN_OF_NUMA)
 extern int of_node_to_nid(struct device_node *np);
 #else
 static inline int of_node_to_nid(struct device_node *device)
-- 
2.9.5



Re: [PATCH] powerpc/64: irq_work avoid immediate interrupt when raised with hard irqs enabled

2018-04-09 Thread Benjamin Herrenschmidt
On Fri, 2018-04-06 at 00:31 +1000, Nicholas Piggin wrote:
> irq_work_raise should not schedule the hardware decrementer interrupt
> unless it is called from NMI context. Doing so often just results in an
> immediate masked decrementer interrupt:
> 
><...>-55090d...4us : update_curr_rt <-dequeue_task_rt
><...>-55090d...5us : dbs_update_util_handler <-update_curr_rt
><...>-55090d...6us : arch_irq_work_raise <-irq_work_queue
><...>-55090d...7us : soft_nmi_interrupt <-soft_nmi_common
><...>-55090d...7us : printk_nmi_enter <-soft_nmi_interrupt
><...>-55090d.Z.8us : rcu_nmi_enter <-soft_nmi_interrupt
><...>-55090d.Z.9us : rcu_nmi_exit <-soft_nmi_interrupt
><...>-55090d...9us : printk_nmi_exit <-soft_nmi_interrupt
><...>-55090d...   10us : cpuacct_charge <-update_curr_rt
> 
> Set the decrementer pending in the irq_happened mask directly, rather
> than having the masked decrementer handler do it.

Setting the paca field needs hard irqs off... also preempt_disable
doesn't look necessary if IRQs are off.

> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/kernel/time.c | 35 +--
>  1 file changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index a32823dcd9a4..9d1cc183c974 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -510,6 +510,35 @@ static inline void clear_irq_work_pending(void)
>   "i" (offsetof(struct paca_struct, irq_work_pending)));
>  }
>  
> +void arch_irq_work_raise(void)
> +{
> + WARN_ON(!irqs_disabled());
> +
> + preempt_disable();
> + set_irq_work_pending_flag();
> + /*
> +  * Regular iterrupts will check pending irq_happened as they return,
> +  * or process context when it next enables interrupts, so the
> +  * decrementer can be scheduled there.
> +  *
> +  * NMI interrupts do not, so setting the decrementer hardware
> +  * interrupt to fire ensures the work runs upon RI (if it's to a
> +  * MSR[EE]=1 context). We do not want to do this in other contexts
> +  * because if interrupts are hard enabled, the decrementer will
> +  * fire immediately here and just go to the masked handler to be
> +  * recorded in irq_happened.
> +  *
> +  * BookE does not support this yet, it must audit all NMI
> +  * interrupt handlers call nmi_enter().
> +  */
> + if (IS_ENABLED(CONFIG_BOOKE) || in_nmi()) {
> + set_dec(1);
> + } else {
> + local_paca->irq_happened |= PACA_IRQ_DEC;
> + }
> + preempt_enable();
> +}
> +
>  #else /* 32-bit */
>  
>  DEFINE_PER_CPU(u8, irq_work_pending);
> @@ -518,16 +547,18 @@ DEFINE_PER_CPU(u8, irq_work_pending);
>  #define test_irq_work_pending()  
> __this_cpu_read(irq_work_pending)
>  #define clear_irq_work_pending() __this_cpu_write(irq_work_pending, 0)
>  
> -#endif /* 32 vs 64 bit */
> -
>  void arch_irq_work_raise(void)
>  {
> + WARN_ON(!irqs_disabled());
> +
>   preempt_disable();
>   set_irq_work_pending_flag();
>   set_dec(1);
>   preempt_enable();
>  }
>  
> +#endif /* 32 vs 64 bit */
> +
>  #else  /* CONFIG_IRQ_WORK */
>  
>  #define test_irq_work_pending()  0


Re: [PATCH] powerpc/64: irq_work avoid immediate interrupt when raised with hard irqs enabled

2018-04-09 Thread Nicholas Piggin
On Mon, 09 Apr 2018 18:46:29 +1000
Benjamin Herrenschmidt  wrote:

> On Fri, 2018-04-06 at 00:31 +1000, Nicholas Piggin wrote:
> > irq_work_raise should not schedule the hardware decrementer interrupt
> > unless it is called from NMI context. Doing so often just results in an
> > immediate masked decrementer interrupt:
> > 
> ><...>-55090d...4us : update_curr_rt <-dequeue_task_rt
> ><...>-55090d...5us : dbs_update_util_handler <-update_curr_rt
> ><...>-55090d...6us : arch_irq_work_raise <-irq_work_queue
> ><...>-55090d...7us : soft_nmi_interrupt <-soft_nmi_common
> ><...>-55090d...7us : printk_nmi_enter <-soft_nmi_interrupt
> ><...>-55090d.Z.8us : rcu_nmi_enter <-soft_nmi_interrupt
> ><...>-55090d.Z.9us : rcu_nmi_exit <-soft_nmi_interrupt
> ><...>-55090d...9us : printk_nmi_exit <-soft_nmi_interrupt
> ><...>-55090d...   10us : cpuacct_charge <-update_curr_rt
> > 
> > Set the decrementer pending in the irq_happened mask directly, rather
> > than having the masked decrementer handler do it.  
> 
> Setting the paca field needs hard irqs off...

Doh! Good catch, I should have noticed that :)

> also preempt_disable
> doesn't look necessary if IRQs are off.

True, just copied from existing code.

Thanks,
Nick


[PATCH RFC] hvc: provide a flush operation, implement for opal console, and use in hvc console

2018-04-09 Thread Nicholas Piggin
This patch is not quite polished and needs to be split out, but the
idea is to move busy waits in hvc console drivers out from under locks
and into the main hvc driver where it can sleep.

The flush op allows for a 0 timeout which reverts to spin wait
behaviour to cater for polling cases. A default operation is
used for drivers that don't supply one, and they should get
some benefit too just from sleeping outside locks.

This applies on top of the recent series of opal console patches.

Before this patch it is possible to see large interrupts off delays
caused by hvc_push->opal_put_chars_nonatomic->opal_flush_console
despite the previous series moved the flush from under the opal
driver's output lock, because hvc_push is called with the hvc lock.

This is an irqsoff trace collected after running `dmesg` from the
OpenBMC console:

   dmesg-8334   12d...0us : _raw_spin_lock_irqsave
   dmesg-8334   12d...0us : hvc_push <-hvc_write
   dmesg-8334   12d...1us : opal_put_chars_nonatomic <-hvc_push
   dmesg-8334   12d...1us : __opal_put_chars <-hvc_push
   dmesg-8334   12d...2us : opal_flush_console <-__opal_put_chars
   dmesg-8334   12d...4us!: udelay <-opal_flush_console
   dmesg-8334   12d...  787us : soft_nmi_interrupt <-soft_nmi_common
   dmesg-8334   12d...  787us : printk_nmi_enter <-soft_nmi_interrupt
   dmesg-8334   12d.Z.  788us : rcu_nmi_enter <-soft_nmi_interrupt
   dmesg-8334   12d.Z.  788us : rcu_nmi_exit <-soft_nmi_interrupt
   dmesg-8334   12d...  788us#: printk_nmi_exit <-soft_nmi_interrupt
   dmesg-8334   12d... 10005us*: udelay <-opal_flush_console
   dmesg-8334   12d... 20007us*: udelay <-opal_flush_console
   dmesg-8334   12d... 30020us*: udelay <-opal_flush_console
   dmesg-8334   12d... 40022us*: udelay <-opal_flush_console
   dmesg-8334   12d... 50023us*: udelay <-opal_flush_console
   dmesg-8334   12d... 60024us : opal_error_code <-opal_flush_console
   dmesg-8334   12d... 60025us : _raw_spin_unlock_irqrestore <-hvc_write
   dmesg-8334   12d... 60025us : _raw_spin_unlock_irqrestore
   dmesg-8334   12d... 60025us : trace_hardirqs_on <-_raw_spin_unlock_irqrestore
   dmesg-8334   12d... 60027us : 

After this patch, the same operation does not show latency above
the noise of an idle system (~400us).

Impossible to tell the responsiveness difference with my ping 200-400ms
to the BMC. It's not noticably worse.

---
 arch/powerpc/include/asm/opal.h   |   1 +
 arch/powerpc/platforms/powernv/opal.c |  74 ++
 drivers/tty/hvc/hvc_console.c | 106 --
 drivers/tty/hvc/hvc_console.h |   1 +
 drivers/tty/hvc/hvc_opal.c|   2 +
 5 files changed, 129 insertions(+), 55 deletions(-)

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 66954d671831..115c8a5a0bfd 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -304,6 +304,7 @@ extern void opal_configure_cores(void);
 extern int opal_get_chars(uint32_t vtermno, char *buf, int count);
 extern int opal_put_chars(uint32_t vtermno, const char *buf, int total_len);
 extern int opal_put_chars_nonatomic(uint32_t vtermno, const char *buf, int 
total_len);
+extern int opal_flush_chars(uint32_t vtermno, long timeout);
 extern int opal_flush_console(uint32_t vtermno);
 
 extern void hvc_opal_init_early(void);
diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index 5e0f6b1bb4ba..fe71dee729ea 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -370,12 +370,8 @@ static int __opal_put_chars(uint32_t vtermno, const char 
*data, int total_len, b
olen = cpu_to_be64(total_len);
rc = opal_console_write(vtermno, , data);
if (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
-   if (rc == OPAL_BUSY_EVENT) {
-   mdelay(OPAL_BUSY_DELAY_MS);
+   if (rc == OPAL_BUSY_EVENT)
opal_poll_events(NULL);
-   } else if (rc == OPAL_BUSY_EVENT) {
-   mdelay(OPAL_BUSY_DELAY_MS);
-   }
written = -EAGAIN;
goto out;
}
@@ -401,15 +397,6 @@ static int __opal_put_chars(uint32_t vtermno, const char 
*data, int total_len, b
if (atomic)
spin_unlock_irqrestore(_write_lock, flags);
 
-   /* In the -EAGAIN case, callers loop, so we have to flush the console
-* here in case they have interrupts off (and we don't want to wait
-* for async flushing if we can make immediate progress here). If
-* necessary the API could be made entirely non-flushing if the
-* callers had a ->flush API to use.
-*/
-   if (written == -EAGAIN)
-   opal_flush_console(vtermno);
-
return written;
 }
 
@@ -427,40 +414,63 @@ int opal_put_chars_nonatomic(uint32_t vtermno, const char 
*data, int total_len)
return 

[PATCH 2/3] mm: replace __HAVE_ARCH_PTE_SPECIAL

2018-04-09 Thread Laurent Dufour
Replace __HAVE_ARCH_PTE_SPECIAL by the new configuration variable
CONFIG_ARCH_HAS_PTE_SPECIAL.

Signed-off-by: Laurent Dufour 
---
 Documentation/features/vm/pte_special/arch-support.txt | 2 +-
 include/linux/pfn_t.h  | 4 ++--
 mm/gup.c   | 4 ++--
 mm/memory.c| 2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/Documentation/features/vm/pte_special/arch-support.txt 
b/Documentation/features/vm/pte_special/arch-support.txt
index 055004f467d2..cd05924ea875 100644
--- a/Documentation/features/vm/pte_special/arch-support.txt
+++ b/Documentation/features/vm/pte_special/arch-support.txt
@@ -1,6 +1,6 @@
 #
 # Feature name:  pte_special
-# Kconfig:   __HAVE_ARCH_PTE_SPECIAL
+# Kconfig:   ARCH_HAS_PTE_SPECIAL
 # description:   arch supports the pte_special()/pte_mkspecial() VM 
APIs
 #
 ---
diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h
index a03c2642a87c..21713dc14ce2 100644
--- a/include/linux/pfn_t.h
+++ b/include/linux/pfn_t.h
@@ -122,7 +122,7 @@ pud_t pud_mkdevmap(pud_t pud);
 #endif
 #endif /* __HAVE_ARCH_PTE_DEVMAP */
 
-#ifdef __HAVE_ARCH_PTE_SPECIAL
+#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
 static inline bool pfn_t_special(pfn_t pfn)
 {
return (pfn.val & PFN_SPECIAL) == PFN_SPECIAL;
@@ -132,5 +132,5 @@ static inline bool pfn_t_special(pfn_t pfn)
 {
return false;
 }
-#endif /* __HAVE_ARCH_PTE_SPECIAL */
+#endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
 #endif /* _LINUX_PFN_T_H_ */
diff --git a/mm/gup.c b/mm/gup.c
index 2e2df7f3e92d..9e6a4f70deab 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1354,7 +1354,7 @@ static void undo_dev_pagemap(int *nr, int nr_start, 
struct page **pages)
}
 }
 
-#ifdef __HAVE_ARCH_PTE_SPECIAL
+#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
 static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 int write, struct page **pages, int *nr)
 {
@@ -1430,7 +1430,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
unsigned long end,
 {
return 0;
 }
-#endif /* __HAVE_ARCH_PTE_SPECIAL */
+#endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
 
 #if defined(__HAVE_ARCH_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index 1bb725631ded..6fc7b9edc18f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -817,7 +817,7 @@ static void print_bad_pte(struct vm_area_struct *vma, 
unsigned long addr,
  * PFNMAP mappings in order to support COWable mappings.
  *
  */
-#ifdef __HAVE_ARCH_PTE_SPECIAL
+#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
 # define HAVE_PTE_SPECIAL 1
 #else
 # define HAVE_PTE_SPECIAL 0
-- 
2.7.4



[PATCH 1/3] mm: introduce ARCH_HAS_PTE_SPECIAL

2018-04-09 Thread Laurent Dufour
Currently the PTE special supports is turned on in per architecture header
files. Most of the time, it is defined in arch/*/include/asm/pgtable.h
depending or not on some other per architecture static definition.

This patch introduce a new configuration variable to manage this directly
in the Kconfig files. It would later replace __HAVE_ARCH_PTE_SPECIAL.

Here notes for some architecture where the definition of
__HAVE_ARCH_PTE_SPECIAL is not obvious:

arm
 __HAVE_ARCH_PTE_SPECIAL which is currently defined in
arch/arm/include/asm/pgtable-3level.h which is included by
arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set.
So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE.

powerpc
__HAVE_ARCH_PTE_SPECIAL is defined in 2 files:
 - arch/powerpc/include/asm/book3s/64/pgtable.h
 - arch/powerpc/include/asm/pte-common.h
The first one is included if (PPC_BOOK3S & PPC64) while the second is
included in all the other cases.
So select ARCH_HAS_PTE_SPECIAL all the time.

sparc:
__HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) &&
defined(__arch64__) which are defined through the compiler in
sparc/Makefile if !SPARC32 which I assume to be if SPARC64.
So select ARCH_HAS_PTE_SPECIAL if SPARC64

Suggested-by: Jerome Glisse 
Signed-off-by: Laurent Dufour 
---
 arch/arc/Kconfig | 1 +
 arch/arm/Kconfig | 1 +
 arch/arm64/Kconfig   | 1 +
 arch/powerpc/Kconfig | 1 +
 arch/riscv/Kconfig   | 1 +
 arch/s390/Kconfig| 1 +
 arch/sh/Kconfig  | 1 +
 arch/sparc/Kconfig   | 1 +
 arch/x86/Kconfig | 1 +
 mm/Kconfig   | 3 +++
 10 files changed, 12 insertions(+)

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index d76bf4a83740..8516e2b0239a 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -44,6 +44,7 @@ config ARC
select HAVE_GENERIC_DMA_COHERENT
select HAVE_KERNEL_GZIP
select HAVE_KERNEL_LZMA
+   select ARCH_HAS_PTE_SPECIAL
 
 config MIGHT_HAVE_PCI
bool
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 1878083771af..a67973cb041c 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -7,6 +7,7 @@ config ARM
select ARCH_HAS_DEBUG_VIRTUAL if MMU
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_ELF_RANDOMIZE
+   select ARCH_HAS_PTE_SPECIAL if ARM_LPAE
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_PHYS_TO_DMA
select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 276e96ceaf27..7ae3c09921fb 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -17,6 +17,7 @@ config ARM64
select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA
select ARCH_HAS_KCOV
select ARCH_HAS_MEMBARRIER_SYNC_CORE
+   select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c32a181a7cbb..f7415fe25c07 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -141,6 +141,7 @@ config PPC
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_PHYS_TO_DMA
select ARCH_HAS_PMEM_APIif PPC64
+   select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_MEMBARRIER_CALLBACKS
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE
select ARCH_HAS_SG_CHAIN
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 148865de1692..b0a8404bf684 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -34,6 +34,7 @@ config RISCV
select THREAD_INFO_IN_TASK
select RISCV_TIMER
select GENERIC_IRQ_MULTI_HANDLER
+   select ARCH_HAS_PTE_SPECIAL
 
 config MMU
def_bool y
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 32a0d5b958bf..5f1f4997e7e9 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -72,6 +72,7 @@ config S390
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA
select ARCH_HAS_KCOV
+   select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 97fe29316476..a6c75b6806d2 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -50,6 +50,7 @@ config SUPERH
select HAVE_ARCH_AUDITSYSCALL
select HAVE_FUTEX_CMPXCHG if FUTEX
select HAVE_NMI
+   select ARCH_HAS_PTE_SPECIAL
help
  The SuperH is a RISC processor targeted for use in embedded systems
  and consumer electronics; it was also used in the Sega Dreamcast
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 8767e45f1b2b..6b5a4f05dcb2 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -86,6 +86,7 @@ config SPARC64
select ARCH_USE_QUEUED_SPINLOCKS
select GENERIC_TIME_VSYSCALL
select 

[PATCH 3/3] mm: remove __HAVE_ARCH_PTE_SPECIAL

2018-04-09 Thread Laurent Dufour
It is now replaced by Kconfig variable CONFIG_ARCH_HAS_PTE_SPECIAL.

Signed-off-by: Laurent Dufour 
---
 arch/arc/include/asm/pgtable.h   | 2 --
 arch/arm/include/asm/pgtable-3level.h| 1 -
 arch/arm64/include/asm/pgtable.h | 2 --
 arch/powerpc/include/asm/book3s/64/pgtable.h | 3 ---
 arch/powerpc/include/asm/pte-common.h| 3 ---
 arch/s390/include/asm/pgtable.h  | 1 -
 arch/sh/include/asm/pgtable.h| 2 --
 arch/sparc/include/asm/pgtable_64.h  | 3 ---
 arch/x86/include/asm/pgtable_types.h | 1 -
 9 files changed, 18 deletions(-)

diff --git a/arch/arc/include/asm/pgtable.h b/arch/arc/include/asm/pgtable.h
index 08fe33830d4b..8ec5599a0957 100644
--- a/arch/arc/include/asm/pgtable.h
+++ b/arch/arc/include/asm/pgtable.h
@@ -320,8 +320,6 @@ PTE_BIT_FUNC(mkexec,|= (_PAGE_EXECUTE));
 PTE_BIT_FUNC(mkspecial,|= (_PAGE_SPECIAL));
 PTE_BIT_FUNC(mkhuge,   |= (_PAGE_HW_SZ));
 
-#define __HAVE_ARCH_PTE_SPECIAL
-
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
return __pte((pte_val(pte) & _PAGE_CHG_MASK) | pgprot_val(newprot));
diff --git a/arch/arm/include/asm/pgtable-3level.h 
b/arch/arm/include/asm/pgtable-3level.h
index 2a4836087358..6d50a11d7793 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -219,7 +219,6 @@ static inline pte_t pte_mkspecial(pte_t pte)
pte_val(pte) |= L_PTE_SPECIAL;
return pte;
 }
-#define__HAVE_ARCH_PTE_SPECIAL
 
 #define pmd_write(pmd) (pmd_isclear((pmd), L_PMD_SECT_RDONLY))
 #define pmd_dirty(pmd) (pmd_isset((pmd), L_PMD_SECT_DIRTY))
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7e2c27e63cd8..b96c8a186908 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -306,8 +306,6 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b)
 #define HPAGE_MASK (~(HPAGE_SIZE - 1))
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 
-#define __HAVE_ARCH_PTE_SPECIAL
-
 static inline pte_t pgd_pte(pgd_t pgd)
 {
return __pte(pgd_val(pgd));
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index a6b9f1d74600..f12d148eccbe 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -338,9 +338,6 @@ extern unsigned long pci_io_base;
 /* Advertise special mapping type for AGP */
 #define HAVE_PAGE_AGP
 
-/* Advertise support for _PAGE_SPECIAL */
-#define __HAVE_ARCH_PTE_SPECIAL
-
 #ifndef __ASSEMBLY__
 
 /*
diff --git a/arch/powerpc/include/asm/pte-common.h 
b/arch/powerpc/include/asm/pte-common.h
index c4a72c7a8c83..03dfddb1f49a 100644
--- a/arch/powerpc/include/asm/pte-common.h
+++ b/arch/powerpc/include/asm/pte-common.h
@@ -216,9 +216,6 @@ static inline bool pte_user(pte_t pte)
 #define PAGE_AGP   (PAGE_KERNEL_NC)
 #define HAVE_PAGE_AGP
 
-/* Advertise support for _PAGE_SPECIAL */
-#define __HAVE_ARCH_PTE_SPECIAL
-
 #ifndef _PAGE_READ
 /* if not defined, we should not find _PAGE_WRITE too */
 #define _PAGE_READ 0
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 2d24d33bf188..9809694e1389 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -171,7 +171,6 @@ static inline int is_module_addr(void *addr)
 #define _PAGE_WRITE0x020   /* SW pte write bit */
 #define _PAGE_SPECIAL  0x040   /* SW associated with special page */
 #define _PAGE_UNUSED   0x080   /* SW bit for pgste usage state */
-#define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_MEM_SOFT_DIRTY
 #define _PAGE_SOFT_DIRTY 0x002 /* SW pte soft dirty bit */
diff --git a/arch/sh/include/asm/pgtable.h b/arch/sh/include/asm/pgtable.h
index 89c513a982fc..f6abfe2bca93 100644
--- a/arch/sh/include/asm/pgtable.h
+++ b/arch/sh/include/asm/pgtable.h
@@ -156,8 +156,6 @@ extern void page_table_range_init(unsigned long start, 
unsigned long end,
 #define HAVE_ARCH_UNMAPPED_AREA
 #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
 
-#define __HAVE_ARCH_PTE_SPECIAL
-
 #include 
 
 #endif /* __ASM_SH_PGTABLE_H */
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 44d6ac47e035..1393a8ac596b 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -117,9 +117,6 @@ bool kern_addr_valid(unsigned long addr);
 #define _PAGE_PMD_HUGE_AC(0x0100,UL) /* Huge page*/
 #define _PAGE_PUD_HUGE_PAGE_PMD_HUGE
 
-/* Advertise support for _PAGE_SPECIAL */
-#define __HAVE_ARCH_PTE_SPECIAL
-
 /* SUN4U pte bits... */
 #define _PAGE_SZ4MB_4U   _AC(0x6000,UL) /* 4MB Page */
 #define _PAGE_SZ512K_4U  _AC(0x4000,UL) /* 512K Page   
 */
diff --git a/arch/x86/include/asm/pgtable_types.h 

Re: [PATCH 0/3] move __HAVE_ARCH_PTE_SPECIAL in Kconfig

2018-04-09 Thread Michal Hocko
On Mon 09-04-18 15:57:06, Laurent Dufour wrote:
> The per architecture __HAVE_ARCH_PTE_SPECIAL is defined statically in the
> per architecture header files. This doesn't allow to make other
> configuration dependent on it.
> 
> This series is moving the __HAVE_ARCH_PTE_SPECIAL into the Kconfig files,
> setting it automatically when architectures was already setting it in
> header file.
> 
> There is no functional change introduced by this series.

I would just fold all three patches into a single one. It is much easier
to review that those selects are done properly when you can see that the
define is set for the same architecture.

In general, I like the patch. It is always quite painful to track per
arch defines.

> Laurent Dufour (3):
>   mm: introduce ARCH_HAS_PTE_SPECIAL
>   mm: replace __HAVE_ARCH_PTE_SPECIAL
>   mm: remove __HAVE_ARCH_PTE_SPECIAL
> 
>  Documentation/features/vm/pte_special/arch-support.txt | 2 +-
>  arch/arc/Kconfig   | 1 +
>  arch/arc/include/asm/pgtable.h | 2 --
>  arch/arm/Kconfig   | 1 +
>  arch/arm/include/asm/pgtable-3level.h  | 1 -
>  arch/arm64/Kconfig | 1 +
>  arch/arm64/include/asm/pgtable.h   | 2 --
>  arch/powerpc/Kconfig   | 1 +
>  arch/powerpc/include/asm/book3s/64/pgtable.h   | 3 ---
>  arch/powerpc/include/asm/pte-common.h  | 3 ---
>  arch/riscv/Kconfig | 1 +
>  arch/s390/Kconfig  | 1 +
>  arch/s390/include/asm/pgtable.h| 1 -
>  arch/sh/Kconfig| 1 +
>  arch/sh/include/asm/pgtable.h  | 2 --
>  arch/sparc/Kconfig | 1 +
>  arch/sparc/include/asm/pgtable_64.h| 3 ---
>  arch/x86/Kconfig   | 1 +
>  arch/x86/include/asm/pgtable_types.h   | 1 -
>  include/linux/pfn_t.h  | 4 ++--
>  mm/Kconfig | 3 +++
>  mm/gup.c   | 4 ++--
>  mm/memory.c| 2 +-
>  23 files changed, 18 insertions(+), 24 deletions(-)
> 
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs


[PATCH v2 2/2] powerpc/time: Only set ARCH_HAS_SCALED_CPUTIME on PPC64

2018-04-09 Thread Christophe Leroy
scaled cputime is only meaningfull when the processor has
SPURR and/or PURR, which means only on PPC64.

Removing it on PPC32 significantly reduces the size of
vtime_account_system() and vtime_account_idle() on an 8xx:

Before:
0114 l F .text  00a8 vtime_delta
04c0 g F .text  0100 vtime_account_system
05c0 g F .text  0048 vtime_account_idle

After:
(vtime_delta gets inlined in the two functions)
0418 g F .text  00a0 vtime_account_system
04b8 g F .text  0054 vtime_account_idle

In terms of performance, we also get approximatly 5% improvement on task switch:
The following small benchmark app is run with perf stat:

#include 

void *thread(void *arg)
{
int i;

for (i = 0; i < atoi((char*)arg); i++)
pthread_yield();
}

int main(int argc, char **argv)
{
pthread_t th1, th2;

pthread_create(, NULL, thread, argv[1]);
pthread_create(, NULL, thread, argv[1]);
pthread_join(th1, NULL);
pthread_join(th2, NULL);

return 0;
}

Before the patch:

~# perf stat chrt -f 98 ./sched 10

 Performance counter stats for 'chrt -f 98 ./sched 10':

   8622.166272  task-clock (msec) #0.955 CPUs utilized  
200027  context-switches  #0.023 M/sec  

After the patch:

~# perf stat chrt -f 98 ./sched 10

 Performance counter stats for 'chrt -f 98 ./sched 10':

   8207.090048  task-clock (msec) #0.958 CPUs utilized  
200025  context-switches  #0.024 M/sec  

Signed-off-by: Christophe Leroy 
---
 v2: added ifdefs in xmon to fix compilation error

 arch/powerpc/Kconfig  |  2 +-
 arch/powerpc/include/asm/accounting.h |  4 
 arch/powerpc/include/asm/cputime.h|  2 ++
 arch/powerpc/kernel/time.c| 29 +++--
 arch/powerpc/xmon/xmon.c  |  4 
 5 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 0c76d93d5da5..8c9f54779ff1 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -142,7 +142,7 @@ config PPC
select ARCH_HAS_PHYS_TO_DMA
select ARCH_HAS_PMEM_APIif PPC64
select ARCH_HAS_MEMBARRIER_CALLBACKS
-   select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE
+   select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
&& PPC64
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) && 
!RELOCATABLE && !HIBERNATION)
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
diff --git a/arch/powerpc/include/asm/accounting.h 
b/arch/powerpc/include/asm/accounting.h
index 3abcf98ed2e0..f1096d4cc658 100644
--- a/arch/powerpc/include/asm/accounting.h
+++ b/arch/powerpc/include/asm/accounting.h
@@ -15,8 +15,10 @@ struct cpu_accounting_data {
/* Accumulated cputime values to flush on ticks*/
unsigned long utime;
unsigned long stime;
+#ifdef ARCH_HAS_SCALED_CPUTIME
unsigned long utime_scaled;
unsigned long stime_scaled;
+#endif
unsigned long gtime;
unsigned long hardirq_time;
unsigned long softirq_time;
@@ -25,8 +27,10 @@ struct cpu_accounting_data {
/* Internal counters */
unsigned long starttime;/* TB value snapshot */
unsigned long starttime_user;   /* TB value on exit to usermode */
+#ifdef ARCH_HAS_SCALED_CPUTIME
unsigned long startspurr;   /* SPURR value snapshot */
unsigned long utime_sspurr; /* ->user_time when ->startspurr set */
+#endif
 };
 
 #endif
diff --git a/arch/powerpc/include/asm/cputime.h 
b/arch/powerpc/include/asm/cputime.h
index bc4903badb3f..8fd3c1338822 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -62,7 +62,9 @@ static inline void arch_vtime_task_switch(struct task_struct 
*prev)
struct cpu_accounting_data *acct0 = get_accounting(prev);
 
acct->starttime = acct0->starttime;
+#ifdef ARCH_HAS_SCALED_CPUTIME
acct->startspurr = acct0->startspurr;
+#endif
 }
 #endif
 
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index a3ed2eb99d88..7d6040233003 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -175,6 +175,7 @@ static void calc_cputime_factors(void)
  * Read the SPURR on systems that have it, otherwise the PURR,
  * or if that doesn't exist return the timebase value passed in.
  */
+#ifdef ARCH_HAS_SCALED_CPUTIME
 static unsigned long read_spurr(unsigned long tb)
 {
if (cpu_has_feature(CPU_FTR_SPURR))
@@ -183,6 +184,7 @@ static unsigned long read_spurr(unsigned long tb)
return mfspr(SPRN_PURR);
return tb;
 }
+#endif
 
 #ifdef CONFIG_PPC_SPLPAR
 
@@ -285,22 +287,28 @@ static 

[PATCH v2 1/2] powerpc/time: inline arch_vtime_task_switch()

2018-04-09 Thread Christophe Leroy
arch_vtime_task_switch() is a small function which is called
only from vtime_common_task_switch(), so it is worth inlining

Signed-off-by: Christophe Leroy 
---
 v2: added a local pointer for get_accounting(prev) to avoid GCC to read it 
twice

 arch/powerpc/include/asm/cputime.h | 16 +++-
 arch/powerpc/kernel/time.c | 21 -
 2 files changed, 15 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/cputime.h 
b/arch/powerpc/include/asm/cputime.h
index 99b541865d8d..bc4903badb3f 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -47,9 +47,23 @@ static inline unsigned long cputime_to_usecs(const cputime_t 
ct)
  * has to be populated in the new task
  */
 #ifdef CONFIG_PPC64
+#define get_accounting(tsk)(_paca()->accounting)
 static inline void arch_vtime_task_switch(struct task_struct *tsk) { }
 #else
-void arch_vtime_task_switch(struct task_struct *tsk);
+#define get_accounting(tsk)(_thread_info(tsk)->accounting)
+/*
+ * Called from the context switch with interrupts disabled, to charge all
+ * accumulated times to the current process, and to prepare accounting on
+ * the next process.
+ */
+static inline void arch_vtime_task_switch(struct task_struct *prev)
+{
+   struct cpu_accounting_data *acct = get_accounting(current);
+   struct cpu_accounting_data *acct0 = get_accounting(prev);
+
+   acct->starttime = acct0->starttime;
+   acct->startspurr = acct0->startspurr;
+}
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 360e71d455cc..a3ed2eb99d88 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -163,12 +163,6 @@ EXPORT_SYMBOL(__cputime_usec_factor);
 void (*dtl_consumer)(struct dtl_entry *, u64);
 #endif
 
-#ifdef CONFIG_PPC64
-#define get_accounting(tsk)(_paca()->accounting)
-#else
-#define get_accounting(tsk)(_thread_info(tsk)->accounting)
-#endif
-
 static void calc_cputime_factors(void)
 {
struct div_result res;
@@ -421,21 +415,6 @@ void vtime_flush(struct task_struct *tsk)
acct->softirq_time = 0;
 }
 
-#ifdef CONFIG_PPC32
-/*
- * Called from the context switch with interrupts disabled, to charge all
- * accumulated times to the current process, and to prepare accounting on
- * the next process.
- */
-void arch_vtime_task_switch(struct task_struct *prev)
-{
-   struct cpu_accounting_data *acct = get_accounting(current);
-
-   acct->starttime = get_accounting(prev)->starttime;
-   acct->startspurr = get_accounting(prev)->startspurr;
-}
-#endif /* CONFIG_PPC32 */
-
 #else /* ! CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 #define calc_cputime_factors()
 #endif
-- 
2.13.3



[PATCH 5/6] powerpc/powernv: implement opal_put_chars_nonatomic

2018-04-09 Thread Nicholas Piggin
The RAW console does not need writes to be atomic, so implement a
_nonatomic variant which does not take a spinlock. This API is used
in xmon, so the less locking thta's used, the better chance there is
that a crash can be debugged.

Cc: Benjamin Herrenschmidt 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/opal.h   |  1 +
 arch/powerpc/platforms/powernv/opal.c | 35 +++
 drivers/tty/hvc/hvc_opal.c|  4 +--
 3 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index bbff49fab0e5..66954d671831 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -303,6 +303,7 @@ extern void opal_configure_cores(void);
 
 extern int opal_get_chars(uint32_t vtermno, char *buf, int count);
 extern int opal_put_chars(uint32_t vtermno, const char *buf, int total_len);
+extern int opal_put_chars_nonatomic(uint32_t vtermno, const char *buf, int 
total_len);
 extern int opal_flush_console(uint32_t vtermno);
 
 extern void hvc_opal_init_early(void);
diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index b05500a70f58..dc77fc57d1e9 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -344,9 +344,9 @@ int opal_get_chars(uint32_t vtermno, char *buf, int count)
return 0;
 }
 
-int opal_put_chars(uint32_t vtermno, const char *data, int total_len)
+static int __opal_put_chars(uint32_t vtermno, const char *data, int total_len, 
bool atomic)
 {
-   unsigned long flags;
+   unsigned long flags = 0 /* shut up gcc */;
int written;
__be64 olen;
s64 rc;
@@ -354,11 +354,8 @@ int opal_put_chars(uint32_t vtermno, const char *data, int 
total_len)
if (!opal.entry)
return -ENODEV;
 
-   /* We want put_chars to be atomic to avoid mangling of hvsi
-* packets. To do that, we first test for room and return
-* -EAGAIN if there isn't enough.
-*/
-   spin_lock_irqsave(_write_lock, flags);
+   if (atomic)
+   spin_lock_irqsave(_write_lock, flags);
rc = opal_console_write_buffer_space(vtermno, );
if (rc || be64_to_cpu(olen) < total_len) {
/* Closed -> drop characters */
@@ -391,14 +388,18 @@ int opal_put_chars(uint32_t vtermno, const char *data, 
int total_len)
 
written = be64_to_cpu(olen);
if (written < total_len) {
-   /* Should not happen */
-   pr_warn("atomic console write returned partial len=%d 
written=%d\n", total_len, written);
+   if (atomic) {
+   /* Should not happen */
+   pr_warn("atomic console write returned partial "
+   "len=%d written=%d\n", total_len, written);
+   }
if (!written)
written = -EAGAIN;
}
 
 out:
-   spin_unlock_irqrestore(_write_lock, flags);
+   if (atomic)
+   spin_unlock_irqrestore(_write_lock, flags);
 
/* In the -EAGAIN case, callers loop, so we have to flush the console
 * here in case they have interrupts off (and we don't want to wait
@@ -412,6 +413,20 @@ int opal_put_chars(uint32_t vtermno, const char *data, int 
total_len)
return written;
 }
 
+int opal_put_chars(uint32_t vtermno, const char *data, int total_len)
+{
+   /* We want put_chars to be atomic to avoid mangling of hvsi
+* packets. To do that, we first test for room and return
+* -EAGAIN if there isn't enough.
+*/
+   return __opal_put_chars(vtermno, data, total_len, true);
+}
+
+int opal_put_chars_nonatomic(uint32_t vtermno, const char *data, int total_len)
+{
+   return __opal_put_chars(vtermno, data, total_len, false);
+}
+
 int opal_flush_console(uint32_t vtermno)
 {
s64 rc;
diff --git a/drivers/tty/hvc/hvc_opal.c b/drivers/tty/hvc/hvc_opal.c
index af122ad7f06d..e151cfacf2a7 100644
--- a/drivers/tty/hvc/hvc_opal.c
+++ b/drivers/tty/hvc/hvc_opal.c
@@ -51,7 +51,7 @@ static u32 hvc_opal_boot_termno;
 
 static const struct hv_ops hvc_opal_raw_ops = {
.get_chars = opal_get_chars,
-   .put_chars = opal_put_chars,
+   .put_chars = opal_put_chars_nonatomic,
.notifier_add = notifier_add_irq,
.notifier_del = notifier_del_irq,
.notifier_hangup = notifier_hangup_irq,
@@ -269,7 +269,7 @@ static void udbg_opal_putc(char c)
do {
switch(hvc_opal_boot_priv.proto) {
case HV_PROTOCOL_RAW:
-   count = opal_put_chars(termno, , 1);
+   count = opal_put_chars_nonatomic(termno, , 1);
break;
case HV_PROTOCOL_HVSI:
count = hvc_opal_hvsi_put_chars(termno, , 1);
-- 
2.17.0



[PATCH 0/3] move __HAVE_ARCH_PTE_SPECIAL in Kconfig

2018-04-09 Thread Laurent Dufour
The per architecture __HAVE_ARCH_PTE_SPECIAL is defined statically in the
per architecture header files. This doesn't allow to make other
configuration dependent on it.

This series is moving the __HAVE_ARCH_PTE_SPECIAL into the Kconfig files,
setting it automatically when architectures was already setting it in
header file.

There is no functional change introduced by this series.

Laurent Dufour (3):
  mm: introduce ARCH_HAS_PTE_SPECIAL
  mm: replace __HAVE_ARCH_PTE_SPECIAL
  mm: remove __HAVE_ARCH_PTE_SPECIAL

 Documentation/features/vm/pte_special/arch-support.txt | 2 +-
 arch/arc/Kconfig   | 1 +
 arch/arc/include/asm/pgtable.h | 2 --
 arch/arm/Kconfig   | 1 +
 arch/arm/include/asm/pgtable-3level.h  | 1 -
 arch/arm64/Kconfig | 1 +
 arch/arm64/include/asm/pgtable.h   | 2 --
 arch/powerpc/Kconfig   | 1 +
 arch/powerpc/include/asm/book3s/64/pgtable.h   | 3 ---
 arch/powerpc/include/asm/pte-common.h  | 3 ---
 arch/riscv/Kconfig | 1 +
 arch/s390/Kconfig  | 1 +
 arch/s390/include/asm/pgtable.h| 1 -
 arch/sh/Kconfig| 1 +
 arch/sh/include/asm/pgtable.h  | 2 --
 arch/sparc/Kconfig | 1 +
 arch/sparc/include/asm/pgtable_64.h| 3 ---
 arch/x86/Kconfig   | 1 +
 arch/x86/include/asm/pgtable_types.h   | 1 -
 include/linux/pfn_t.h  | 4 ++--
 mm/Kconfig | 3 +++
 mm/gup.c   | 4 ++--
 mm/memory.c| 2 +-
 23 files changed, 18 insertions(+), 24 deletions(-)

-- 
2.7.4