date:20170519

[PATCH] powerpc: add PPC_FEATURE userspace bits for SCV and DARN instructions

2017-05-19 Thread Nicholas Piggin

Providing "scv" support to userspace requires kernel support, so it
must be advertised as independently to the base ISA 3 instruction set.

The darn instruction relies on firmware enablement, so it has been
decided to split this out from the core ISA 3 feature as well.

Signed-off-by: Nicholas Piggin 
---
These uapi changes have been agreed by powerpc toolchain and firmware
teams. I believe this completes our anticipated requirements for user
feature advertisement for ISA v3.0B.

 arch/powerpc/include/uapi/asm/cputable.h | 2 ++
 arch/powerpc/kernel/cputable.c   | 3 ++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/uapi/asm/cputable.h 
b/arch/powerpc/include/uapi/asm/cputable.h
index 3e7ce86d5c13..4d877144f377 100644
--- a/arch/powerpc/include/uapi/asm/cputable.h
+++ b/arch/powerpc/include/uapi/asm/cputable.h
@@ -46,6 +46,8 @@
 #define PPC_FEATURE2_HTM_NOSC  0x0100
 #define PPC_FEATURE2_ARCH_3_00 0x0080 /* ISA 3.00 */
 #define PPC_FEATURE2_HAS_IEEE128   0x0040 /* VSX IEEE Binary Float 
128-bit */
+#define PPC_FEATURE2_DARN  0x0020 /* darn random number insn */
+#define PPC_FEATURE2_SCV   0x0010 /* scv syscall */
 
 /*
  * IMPORTANT!
diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c
index 9b3e88b1a9c8..6f849832a669 100644
--- a/arch/powerpc/kernel/cputable.c
+++ b/arch/powerpc/kernel/cputable.c
@@ -124,7 +124,8 @@ extern void __restore_cpu_e6500(void);
 #define COMMON_USER_POWER9 COMMON_USER_POWER8
 #define COMMON_USER2_POWER9(COMMON_USER2_POWER8 | \
 PPC_FEATURE2_ARCH_3_00 | \
-PPC_FEATURE2_HAS_IEEE128)
+PPC_FEATURE2_HAS_IEEE128 | \
+PPC_FEATURE2_DARN )
 
 #ifdef CONFIG_PPC_BOOK3E_64
 #define COMMON_USER_BOOKE  (COMMON_USER_PPC64 | PPC_FEATURE_BOOKE)
-- 
2.11.0

Re: [RFC] arch hardlockup detector interfaces improvement

2017-05-19 Thread Nicholas Piggin

On Fri, 19 May 2017 15:43:45 -0400
Don Zickus  wrote:

> On Sat, May 20, 2017 at 12:53:06AM +1000, Nicholas Piggin wrote:
> > > I am curious to know what IBM thinks there.  Currently the HARDLOCKUP
> > > detector sits on top of perf.  I get the impression, you are removing that
> > > dependency.  Is that a permanent thing or are you thinking of switching 
> > > back
> > > and forth depending on if SOFTLOCKUP is enabled or not?  
> > 
> > We want to get away from perf permanently.
> > 
> > The PMU interrupts are not specially non-maskable from a hardware
> > POV, everything gets masked when you turn off interrupts in hardware.
> > powerpc arch code implements a software disable layer, and PMU
> > interrupts are differentiated there by being allowed to run even
> > under local_irq_disable();
> > 
> > We have a few issues with using perf for it. We disable it by
> > default because using it for that breaks another PMU feature.
> > 
> > But PMU interupts are not special, so it would be possible to e.g.,
> > take the timer interrupt before soft disable and have it touch the
> > watchdog if it fires while under local_irq_disable(). That give
> > exact same kind of pseudo-NMI as perf interrupts, without using PMU.
> > 
> > Further, we now want to introduce a local_pmu_disable() type of
> > interface that extends this soft disable layer to perf interrupts
> > as well for some cases. Once we start doing that, more code will
> > be exempt from the hardlockup watchdog, whereas a watchdog specific
> > hook from the timer interrupt would still cover it.  
> 
> Interesting.  Thanks for that info.
> 
> Do you think you can split the patch in half for me?  You are shuffling code
> around and making subtle changes in the shuffled code.  It is hard to double
> check everything.  Namely watchdog_nmi_reconfigure and watchdog_update_cpus
> suddenly appear.
> 
> The first patch would be straight code movement, the second the real
> changes.  I almost don't care if you throw in the LOCKUP->SOFTLOCKUP changes
> in the first patch either.
> 
> I am really interested in the subtle changes to make sure you covered the
> various race conditions that we have uncovered over the years.

Yes that makes sense, I'll do that.

> I also would like to see a sample of the watchdog_nmi_reconfigure() you had
> in mind.  Usually we hide all the nmi stuff underneath the watchdog
> functions.  But those are threaded, which is what you are trying to avoid,
> so the placement of the function makes sense but looks odd.

I'm just calling it after any sysctl changes or suspend/resume etc, and
the lockup detector I have basically just shuts down everything and
then re-starts with the new parameters (could be made much fancier but
I didn't see a need.

I will split out this patch and re-post to you with the powerpc patch in
the series that you can take a look at.

Thanks,
Nick

[PATCH] powernv/npu-dma.c: Fix opal_npu_destroy_context call

2017-05-19 Thread Alistair Popple

opal_npu_destroy_context() should be called with the NPU PHB, not the
PCIe PHB.

Signed-off-by: Alistair Popple 
---
 arch/powerpc/platforms/powernv/npu-dma.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 067defe..78fa939 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -714,7 +714,7 @@ static void pnv_npu2_release_context(struct kref *kref)
 void pnv_npu2_destroy_context(struct npu_context *npu_context,
struct pci_dev *gpdev)
 {
-   struct pnv_phb *nphb, *phb;
+   struct pnv_phb *nphb;
struct npu *npu;
struct pci_dev *npdev = pnv_pci_get_npu_dev(gpdev, 0);
struct device_node *nvlink_dn;
@@ -728,13 +728,12 @@ void pnv_npu2_destroy_context(struct npu_context 
*npu_context,
 
nphb = pci_bus_to_host(npdev->bus)->private_data;
npu = >npu;
-   phb = pci_bus_to_host(gpdev->bus)->private_data;
nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0);
if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index",
_index)))
return;
npu_context->npdev[npu->index][nvlink_index] = NULL;
-   opal_npu_destroy_context(phb->opal_id, npu_context->mm->context.id,
+   opal_npu_destroy_context(nphb->opal_id, npu_context->mm->context.id,
PCI_DEVID(gpdev->bus->number, gpdev->devfn));
kref_put(_context->kref, pnv_npu2_release_context);
 }
-- 
2.1.4

Re: [RFC] arch hardlockup detector interfaces improvement

2017-05-19 Thread Don Zickus

On Sat, May 20, 2017 at 12:53:06AM +1000, Nicholas Piggin wrote:
> > I am curious to know what IBM thinks there.  Currently the HARDLOCKUP
> > detector sits on top of perf.  I get the impression, you are removing that
> > dependency.  Is that a permanent thing or are you thinking of switching back
> > and forth depending on if SOFTLOCKUP is enabled or not?
> 
> We want to get away from perf permanently.
> 
> The PMU interrupts are not specially non-maskable from a hardware
> POV, everything gets masked when you turn off interrupts in hardware.
> powerpc arch code implements a software disable layer, and PMU
> interrupts are differentiated there by being allowed to run even
> under local_irq_disable();
> 
> We have a few issues with using perf for it. We disable it by
> default because using it for that breaks another PMU feature.
> 
> But PMU interupts are not special, so it would be possible to e.g.,
> take the timer interrupt before soft disable and have it touch the
> watchdog if it fires while under local_irq_disable(). That give
> exact same kind of pseudo-NMI as perf interrupts, without using PMU.
> 
> Further, we now want to introduce a local_pmu_disable() type of
> interface that extends this soft disable layer to perf interrupts
> as well for some cases. Once we start doing that, more code will
> be exempt from the hardlockup watchdog, whereas a watchdog specific
> hook from the timer interrupt would still cover it.

Interesting.  Thanks for that info.

Do you think you can split the patch in half for me?  You are shuffling code
around and making subtle changes in the shuffled code.  It is hard to double
check everything.  Namely watchdog_nmi_reconfigure and watchdog_update_cpus
suddenly appear.

The first patch would be straight code movement, the second the real
changes.  I almost don't care if you throw in the LOCKUP->SOFTLOCKUP changes
in the first patch either.

I am really interested in the subtle changes to make sure you covered the
various race conditions that we have uncovered over the years.

I also would like to see a sample of the watchdog_nmi_reconfigure() you had
in mind.  Usually we hide all the nmi stuff underneath the watchdog
functions.  But those are threaded, which is what you are trying to avoid,
so the placement of the function makes sense but looks odd.

Thanks!

Cheers,
Don

[PATCH] drivers/rtc/interface.c: Validate alarm-time before handling rollover

2017-05-19 Thread Vaibhav Jain

In function __rtc_read_alarm() its possible for an alarm time-stamp to
be invalid even after replacing missing components with current
time-stamp. The condition 'alarm->time.tm_year < 70' will trigger this
case and will cause the call to 'rtc_tm_to_time64(>time)'
return a negative value for variable t_alm.

While handling alarm rollover this negative t_alm (assumed to seconds
offset from '1970-01-01 00:00:00') is converted back to rtc_time via
rtc_time64_to_tm() which results in this error log with seemingly
garbage values:

"rtc rtc0: invalid alarm value: -2-1--1041528741
200557:71582844:32"

This error was generated when the rtc driver (rtc-opal in this case)
returned an alarm time-stamp of '00-00-00 00:00:00' to indicate that
the alarm is disabled. Though I have submitted a separate fix for the
rtc-opal driver, this issue may potentially impact other
existing/future rtc drivers.

To fix this issue the patch validates the alarm time-stamp just after
filling up the missing datetime components and if rtc_valid_tm() still
reports it to be invalid then bails out of the function without
handling the rollover.

Reported-by: Steve Best 
Signed-off-by: Vaibhav Jain 
---
 drivers/rtc/interface.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c
index fc0fa75..8cec9a0 100644
--- a/drivers/rtc/interface.c
+++ b/drivers/rtc/interface.c
@@ -227,6 +227,13 @@ int __rtc_read_alarm(struct rtc_device *rtc, struct 
rtc_wkalrm *alarm)
missing = year;
}
 
+   /* Can't proceed if alarm is still invalid after replacing
+* missing fields.
+*/
+   err = rtc_valid_tm(>time);
+   if (err)
+   goto done;
+
/* with luck, no rollover is needed */
t_now = rtc_tm_to_time64();
t_alm = rtc_tm_to_time64(>time);
@@ -278,9 +285,9 @@ int __rtc_read_alarm(struct rtc_device *rtc, struct 
rtc_wkalrm *alarm)
dev_warn(>dev, "alarm rollover not handled\n");
}
 
-done:
err = rtc_valid_tm(>time);
 
+done:
if (err) {
dev_warn(>dev, "invalid alarm value: %d-%d-%d %d:%d:%d\n",
alarm->time.tm_year + 1900, alarm->time.tm_mon + 1,
-- 
2.9.3

Re: [PATCH 4/4] arch/powerpc/44x/fsp2: wdt tcr update instead of whole rewrite

2017-05-19 Thread Ivan Mikhaylov

Hi Michael,

>> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
>> index 2b33cfa..f75e512 100644
>> --- a/arch/powerpc/kernel/time.c
>> +++ b/arch/powerpc/kernel/time.c
>> @@ -738,12 +738,28 @@ static int __init get_freq(char *name, int cells, 
>> unsigned long *val)
>>  
>>  static void start_cpu_decrementer(void)
>>  {
>> +unsigned int tcr;
>>  #if defined(CONFIG_BOOKE) || defined(CONFIG_40x)
>>  /* Clear any pending timer interrupts */
>>  mtspr(SPRN_TSR, TSR_ENW | TSR_WIS | TSR_DIS | TSR_FIS);
>>  
>> +#ifdef CONFIG_FSP2
>> +/*
>> + * Prevent a kernel panic caused by unintentionally clearing TCR
>> + * watchdog bits.  At this point in the kernel boot, the watchdog has
>> + * already been enabled by u-boot.  The original code's attempt to
>
> Don't refer to "the original code", as it doesn't exist anymore now that
> we've patched it. Just say something like ".. so we must not clear the
> watchdog configuration bits".
Ok, got it.

>That breaks the build for other platforms:
>
> arch/powerpc/kernel/time.c: In function ‘start_cpu_decrementer’:
> arch/powerpc/kernel/time.c:741:15: error: unused variable ‘tcr’ 
> [-Werror=unused-variable]
>
Oops, didn't notice, my fault. 

>Or you could possibly just always leave TCR[WP], is there any case where
>it would be correct to clear that?
>
>cheers

>From my point of view it's possible. I've checked docu and on idea
it should be possible cause WP is only affecting watchdog ping time.
Which in case of '00' is very small, around ~5 ms. 
Ben also in next message said about get rid of ifdef for FSP2.

And now patch looks like this, What do you think Michael, Ben?

 arch/powerpc/kernel/time.c |   15 +--
 1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index bc2e08d..2411c49 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -718,11 +718,22 @@ static int __init get_freq(char *name, int cells, 
unsigned long *val)
 static void start_cpu_decrementer(void)
 {
 #if defined(CONFIG_BOOKE) || defined(CONFIG_40x)
+   unsigned int tcr;
/* Clear any pending timer interrupts */
mtspr(SPRN_TSR, TSR_ENW | TSR_WIS | TSR_DIS | TSR_FIS);
 
-   /* Enable decrementer interrupt */
-   mtspr(SPRN_TCR, TCR_DIE);
+   tcr = mfspr(SPRN_TCR);
+   /*
+* At this point in the kernel boot, the watchdog has already
+* been enabled by u-boot. If we set it this to '00' it may
+* trigger watchdog earlier than needed which will cause
+* inattentional kernel panic. In this case we leaving TCR[WP]
+* bit setting from uboot/bootstrap.
+*/
+   tcr &= TCR_WP_MASK; /* clear all bits except for TCR[WP] */
+   tcr |= TCR_DIE; /* enable decrementer */
+   mtspr(SPRN_TCR, tcr);
+
 #endif /* defined(CONFIG_BOOKE) || defined(CONFIG_40x) */
 }
 
-- 
1.7.1

Re: [PATCH 3/3] EDAC: mv64x60: replace in_le32/out_le32 with ioread32/iowrite32

2017-05-19 Thread Arnd Bergmann

On Fri, May 19, 2017 at 4:01 PM, David Laight  wrote:
> From: Arnd Bergmann
>> Sent: 17 May 2017 22:40
>>
>> On Wed, May 17, 2017 at 11:16 PM, Chris Packham
>>  wrote:
>> > On 18/05/17 06:18, Borislav Petkov wrote:
>> > One thing I would like confirmation on is is in_le32 -> ioread32 the
>> > correct change? I tossed up between ioread32 and readl. Looking at
>> > mv643xx_eth.c which supports both the MV643XX and Orion it's using readl
>> > so perhaps I should be using that.
>>
>> There is no easy answer: on powerpc, readl is used for PCI,
>> while in_le32 is used for on-chip devices, so in_le32 is the
>> right one in principle. The main difference is that readl can
>> work with CONFIG_EEH on pseries, but in_le32 is cheaper.
>>
>> On ARM and most other architectures, readl is used for both
>> PCI and on-chip devices, so that's what portable code tends
>> to use.
>>
>> ioread32 is required to behave the same way as readl
>> on all __iomem pointers returned from ioremap(), but
>> is an extern function on powerpc and can be more
>> expensive when CONFIG_GENERIC_IOMAP is set.
>
> What about x86?
> Isn't ioread32() an extern function that checks for 'io' addresses
> than need 'inb' (etc) instructions rather than memory ones.

Right, x86 uses CONFIG_GENERIC_IOMAP and has that
extra check.

> If we know a PCI slave isn't 'io' should be be using ioread32() or readl()?

Generally speaking yes, though on most architectures that don't
use CONFIG_GENERIC_IOMAP there is no practical difference.

> Don't some architectures have different enforced barriers in both these?

I'm not aware of any architecture that always adds barriers to ioread32
compared to readl, though I guess that may be a reasonable
implementation depending on the architecture. PowerPC doesn't do
it, though one could argue that it would be required to guarantee
non-posted I/O accesses on PCI I/O space that gets mapped with
pci_iomap(). On ARM, no extra barrier is needed because the difference
is in the page table flags.

   Arnd

Re: [RFC] arch hardlockup detector interfaces improvement

2017-05-19 Thread Nicholas Piggin

On Fri, 19 May 2017 09:17:53 -0400
Don Zickus  wrote:

> On Fri, May 19, 2017 at 09:07:31AM +1000, Nicholas Piggin wrote:
> > On Thu, 18 May 2017 12:30:28 -0400
> > Don Zickus  wrote:
> >   
> > > (adding Uli)
> > > 
> > > On Fri, May 19, 2017 at 01:50:26AM +1000, Nicholas Piggin wrote:  
> > > > I'd like to make it easier for architectures that have their own NMI /
> > > > hard lockup detector to reuse various configuration interfaces that are
> > > > provided by generic detectors (cmdline, sysctl, suspend/resume calls).
> > > > 
> > > > I'd also like to remove the dependency of arch hard lockup detectors
> > > > on the softlockup detector. The reason being these watchdogs can be
> > > > very small (sparc's is like a page of core code that does not use any
> > > > big subsystem like kthreads or timers).
> > > > 
> > > > So I do this by adding a separate CONFIG_SOFTLOCKUP_DETECTOR, and
> > > > juggling around what goes under config options. HAVE_NMI_WATCHDOG
> > > > continues to be the config for arch to override the hard lockup
> > > > detector, which is expanded to cover a few more cases.
> > > 
> > > Basically you are trying to remove the heavy HARDLOCKUP pieces to minimize
> > > the SOFTLOCKUP piece and use your own NMI detector, right?
> > > 
> > > I am guessing you would then disable SOFTLOCKUP to remove all the kthread
> > > and timer stuff but continue to use the generic infrastructure to help
> > > manager your own NMI detector?  
> > 
> > Yes that's right.
> >   
> > > A lot of the code is just re-organizing things and adding an explicit
> > > ifdef on SOFTLOCKUP, which seems fine to me.
> > > 
> > > I just need to spend some time on some of your #else clauses to see what
> > > functionality is dropped when you use your approach.  
> > 
> > Okay, appreciated. I can trim down cc lists and send you my powerpc
> > WIP if you'd like to have a look.  
> 
> I am curious to know what IBM thinks there.  Currently the HARDLOCKUP
> detector sits on top of perf.  I get the impression, you are removing that
> dependency.  Is that a permanent thing or are you thinking of switching back
> and forth depending on if SOFTLOCKUP is enabled or not?

We want to get away from perf permanently.

The PMU interrupts are not specially non-maskable from a hardware
POV, everything gets masked when you turn off interrupts in hardware.
powerpc arch code implements a software disable layer, and PMU
interrupts are differentiated there by being allowed to run even
under local_irq_disable();

We have a few issues with using perf for it. We disable it by
default because using it for that breaks another PMU feature.

But PMU interupts are not special, so it would be possible to e.g.,
take the timer interrupt before soft disable and have it touch the
watchdog if it fires while under local_irq_disable(). That give
exact same kind of pseudo-NMI as perf interrupts, without using PMU.

Further, we now want to introduce a local_pmu_disable() type of
interface that extends this soft disable layer to perf interrupts
as well for some cases. Once we start doing that, more code will
be exempt from the hardlockup watchdog, whereas a watchdog specific
hook from the timer interrupt would still cover it.

Thanks,
Nick

[PATCH] powerpc/64: Use tick accounting by default

2017-05-19 Thread Anton Blanchard

From: Anton Blanchard 

ppc64 is the only architecture that turns on VIRT_CPU_ACCOUNTING_NATIVE
by default. The overhead of this option is extremely high - a context
switch microbenchmark using sched_yield() is almost 20% slower.

To get finer grained user/hardirq/softirq statitics, the
IRQ_TIME_ACCOUNTING option can be used instead, which has much lower
overhead.

As such, disable this option by default. If a user really wants it,
they can still enable it manually.

Signed-off-by: Anton Blanchard 
---
 init/Kconfig | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 1d3475fc9496..a5c30acc1ede 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -342,8 +342,7 @@ config VIRT_CPU_ACCOUNTING
 
 choice
prompt "Cputime accounting"
-   default TICK_CPU_ACCOUNTING if !PPC64
-   default VIRT_CPU_ACCOUNTING_NATIVE if PPC64
+   default TICK_CPU_ACCOUNTING
 
 # Kind of a stub config for the pure tick based cputime accounting
 config TICK_CPU_ACCOUNTING
-- 
2.11.0

RE: [PATCH 3/3] EDAC: mv64x60: replace in_le32/out_le32 with ioread32/iowrite32

2017-05-19 Thread David Laight

From: Arnd Bergmann
> Sent: 17 May 2017 22:40
> 
> On Wed, May 17, 2017 at 11:16 PM, Chris Packham
>  wrote:
> > On 18/05/17 06:18, Borislav Petkov wrote:
> > One thing I would like confirmation on is is in_le32 -> ioread32 the
> > correct change? I tossed up between ioread32 and readl. Looking at
> > mv643xx_eth.c which supports both the MV643XX and Orion it's using readl
> > so perhaps I should be using that.
> 
> There is no easy answer: on powerpc, readl is used for PCI,
> while in_le32 is used for on-chip devices, so in_le32 is the
> right one in principle. The main difference is that readl can
> work with CONFIG_EEH on pseries, but in_le32 is cheaper.
> 
> On ARM and most other architectures, readl is used for both
> PCI and on-chip devices, so that's what portable code tends
> to use.
> 
> ioread32 is required to behave the same way as readl
> on all __iomem pointers returned from ioremap(), but
> is an extern function on powerpc and can be more
> expensive when CONFIG_GENERIC_IOMAP is set.

What about x86?
Isn't ioread32() an extern function that checks for 'io' addresses
than need 'inb' (etc) instructions rather than memory ones.
If we know a PCI slave isn't 'io' should be be using ioread32() or readl()?
Don't some architectures have different enforced barriers in both these?

David

Re: [RFC] arch hardlockup detector interfaces improvement

2017-05-19 Thread Don Zickus

On Fri, May 19, 2017 at 09:07:31AM +1000, Nicholas Piggin wrote:
> On Thu, 18 May 2017 12:30:28 -0400
> Don Zickus  wrote:
> 
> > (adding Uli)
> > 
> > On Fri, May 19, 2017 at 01:50:26AM +1000, Nicholas Piggin wrote:
> > > I'd like to make it easier for architectures that have their own NMI /
> > > hard lockup detector to reuse various configuration interfaces that are
> > > provided by generic detectors (cmdline, sysctl, suspend/resume calls).
> > > 
> > > I'd also like to remove the dependency of arch hard lockup detectors
> > > on the softlockup detector. The reason being these watchdogs can be
> > > very small (sparc's is like a page of core code that does not use any
> > > big subsystem like kthreads or timers).
> > > 
> > > So I do this by adding a separate CONFIG_SOFTLOCKUP_DETECTOR, and
> > > juggling around what goes under config options. HAVE_NMI_WATCHDOG
> > > continues to be the config for arch to override the hard lockup
> > > detector, which is expanded to cover a few more cases.  
> > 
> > Basically you are trying to remove the heavy HARDLOCKUP pieces to minimize
> > the SOFTLOCKUP piece and use your own NMI detector, right?
> > 
> > I am guessing you would then disable SOFTLOCKUP to remove all the kthread
> > and timer stuff but continue to use the generic infrastructure to help
> > manager your own NMI detector?
> 
> Yes that's right.
> 
> > A lot of the code is just re-organizing things and adding an explicit
> > ifdef on SOFTLOCKUP, which seems fine to me.
> > 
> > I just need to spend some time on some of your #else clauses to see what
> > functionality is dropped when you use your approach.
> 
> Okay, appreciated. I can trim down cc lists and send you my powerpc
> WIP if you'd like to have a look.

I am curious to know what IBM thinks there.  Currently the HARDLOCKUP
detector sits on top of perf.  I get the impression, you are removing that
dependency.  Is that a permanent thing or are you thinking of switching back
and forth depending on if SOFTLOCKUP is enabled or not?

Cheers,
Don

Re: [PATCH] powerpc/sequoia: fix NAND partitions not to overlap

2017-05-19 Thread Pavel Machek

Hi!

> >>> > > Signed-off-by: Pavel Machek 
> >>> > 
> >>> > Ping? Two partitions at same place are bad news...
> >>> 
> >>> Please expand on "bad news"?  What are the runtime effects of this
> >>> change?  Decisions about which kernel(s) to patch depend on this info.
> >>
> >> Well... two partitions at same place. If you use one, you will corrupt
> >> information on the other one.
> >>
> >> OTOH this moves partition around (so that they don't overlap) so it is
> >> probably not stable candidate.
> >>
> >> I guess this is not huge issue; people using these boards probably
> >> have custom dts changes, anyway...
> >
> > Or no one's even using it anymore.
> >
> > I can take this via powerpc, I won't mark it for stable etc.
> 
> It became:
> 
> Author: Pavel Machek 
> 
> powerpc/sequoia: Fix NAND partitions not to overlap
> 
> Currently the DTS defines two partitions at the same addresses, if you
> use one, you will corrupt information on the other one. Fix it by
> shifting the second partition.
> 
> Signed-off-by: Pavel Machek 
> [mpe: Reconstruct change log from email thread]
> Signed-off-by: Michael Ellerman 
> 
> 
> Minor nit, your From: address doesn't match your Signed-off-by: address,
> which trips my "check patch is signed-off-by author" script. I can
> just ignore it, but I think it's preferable if they match.

Sorry about that (I have automatic scripts changing From:). It is
still me, but I guess denx should get the credit here, so you can use.

Signed-off-by: Pavel Machek 

Thanks,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature

RE: [PATCH 2/9] timers: provide a "modern" variant of timers

2017-05-19 Thread David Laight

From: Christoph Hellwig
> Sent: 16 May 2017 12:48
>
> The new callback gets a pointer to the timer_list itself, which can
> then be used to get the containing structure using container_of
> instead of casting from and to unsigned long all the time.

What about sensible drivers that put some other value in the 'data'
field?

Perhaps it ought to have been 'void *data'.

Seems retrograde to be passing the address of the timer structure
(which, in principle, the callers no nothing about).

So I wouldn't call it 'modern', just different.

David

Memory hotplug broken on powerkvm/pseries with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y

2017-05-19 Thread Bharata B Rao

Hi,

Memory hotplug for a pseries guest on PowerKVM fails with the following
error when CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y in the guest kernel. (I
am on commit b23afd384801)

[root@localhost ~]#  pseries-hotplug-mem: Attempting to hot-add 4 LMB(s) at
index 8020
removing memory fails, because memory
[0x0002-0x00020fff] is onlined
[ cut here ]
kernel BUG at mm/memory_hotplug.c:2189!
Oops: Exception in kernel mode, sig: 5 [#1]
Call Trace:
remove_memory+0x70/0x100 (unreliable)
dlpar_add_lmb+0x370/0x3f0
dlpar_memory+0x7bc/0xcd0
handle_dlpar_errorlog+0xf8/0x160
pseries_hp_work_fn+0x94/0xa0
process_one_work+0x240/0x510
worker_thread+0x9c/0x550
kthread+0x164/0x1b0
ret_from_kernel_thread+0x5c/0x7c

This failure occurs because the current code results in double onlining as
memhp_auto_online gets set with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y.

It appears that by reverting commit 943db62c316c5 (powerpc/pseries: Revert
'Auto-online hotplugged memory'), hotplug is working again.

Apparently that commit itself is a revert of other two commits which caused
"the pseries memory DLPAR code to fail when trying to remove an LMB that
was previously removed and added back". However I am able to do hotplug,
unplug, plug and unplug of the same DIMM device (set of LMBs) most of the
times with the above commit (943db62c316c5) reverted. The only occasional
failure to unplug occurs when guest refuses to offline an LMB (with -EBUSY)
possibly because it is in use, which I think is a valid failure.

Nathan - Could you please let me know the exact way to reproduce the
problem that led to this commit (943db62c316c5) ?

Regards,
Bharata.
-- 
http://raobharata.wordpress.com/

Re: [PATCH 4/4] arch/powerpc/44x/fsp2: wdt tcr update instead of whole rewrite

2017-05-19 Thread Benjamin Herrenschmidt

On Mon, 2017-05-15 at 16:07 +0300, Ivan Mikhaylov wrote:
> +#ifdef CONFIG_FSP2
> +   /*
> +    * Prevent a kernel panic caused by unintentionally clearing TCR
> +    * watchdog bits.  At this point in the kernel boot, the watchdog has
> +    * already been enabled by u-boot.  The original code's attempt to
> +    * write to the TCR register results in an inadvertent clearing of the
> +    * watchdog configuration bits, causing the 440 to reset.
> +    */
> +   tcr = mfspr(SPRN_TCR);
> +   tcr &= TCR_WP_MASK; /* clear all bits except for TCR[WP] */
> +   tcr |= TCR_DIE; /* enable decrementer */
> +   mtspr(SPRN_TCR, tcr);
> +#else

This should be a runtime test, not a compile time option.

Cheers,
Ben.

Re: [PATCH 4/4] arch/powerpc/44x/fsp2: wdt tcr update instead of whole rewrite

2017-05-19 Thread Michael Ellerman

Hi Ivan,

Ivan Mikhaylov  writes:
> Prevent a kernel panic caused by unintentionally clearing TCR
> watchdog bits.  At this point in the kernel boot, the watchdog has
> already been enabled by u-boot.  The original code's attempt to
> write to the TCR register results in an inadvertent clearing of the
> watchdog configuration bits, causing the 476 to reset.
> Panic happens in case of error as silently reboot without any outputs
> on serial.

That sounds reasonable.

> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index 2b33cfa..f75e512 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -738,12 +738,28 @@ static int __init get_freq(char *name, int cells, 
> unsigned long *val)
>  
>  static void start_cpu_decrementer(void)
>  {
> + unsigned int tcr;
>  #if defined(CONFIG_BOOKE) || defined(CONFIG_40x)
>   /* Clear any pending timer interrupts */
>   mtspr(SPRN_TSR, TSR_ENW | TSR_WIS | TSR_DIS | TSR_FIS);
>  
> +#ifdef CONFIG_FSP2
> + /*
> +  * Prevent a kernel panic caused by unintentionally clearing TCR
> +  * watchdog bits.  At this point in the kernel boot, the watchdog has
> +  * already been enabled by u-boot.  The original code's attempt to

Don't refer to "the original code", as it doesn't exist anymore now that
we've patched it. Just say something like ".. so we must not clear the
watchdog configuration bits".

> +  * write to the TCR register results in an inadvertent clearing of the
> +  * watchdog configuration bits, causing the 440 to reset.
> +  */
> + tcr = mfspr(SPRN_TCR);
> + tcr &= TCR_WP_MASK; /* clear all bits except for TCR[WP] */
> + tcr |= TCR_DIE; /* enable decrementer */
> + mtspr(SPRN_TCR, tcr);
> +#else
>   /* Enable decrementer interrupt */
>   mtspr(SPRN_TCR, TCR_DIE);
> +#endif
> +
>  #endif /* defined(CONFIG_BOOKE) || defined(CONFIG_40x) */

That breaks the build for other platforms:

  arch/powerpc/kernel/time.c: In function ‘start_cpu_decrementer’:
  arch/powerpc/kernel/time.c:741:15: error: unused variable ‘tcr’ 
[-Werror=unused-variable]

You can just do something like:

  #if defined(CONFIG_BOOKE) || defined(CONFIG_40x)
unsigned int tcr;

  #ifdef CONFIG_FSP2
tcr = mfspr(SPRN_TCR);
tcr &= TCR_WP_MASK; /* clear all bits except for TCR[WP] */
  #else
tcr = 0;
  #endif
tcr |= TCR_DIE; /* enable decrementer */
mtspr(SPRN_TCR, TCR_DIE);
  #endif /* defined(CONFIG_BOOKE) || defined(CONFIG_40x) */


Or you could possibly just always leave TCR[WP], is there any case where
it would be correct to clear that?

cheers

[PATCH] rtc/tpo: Implement rtc_class_ops.alarm_irq_enable callback

2017-05-19 Thread Vaibhav Jain

Provide an implementation of the callback
rtc_class_ops.alarm_irq_enable callback for rtc-opal driver. This
callback is called when the wake alarm is disabled via the command:

'echo 0 > /sys/class/rtc/rtc0/wakealarm'

Without this the Timed-Power-On(TPO) config remains set even when its
disabled by the above command and the FSP will force machine boot at
previously configured time.

The callback is implemented as function opal_tpo_alarm_irq_enable()
which calls opal_set_tpo_time() with alarm.enabled == 0. A branch has
been added to opal_set_tpo_time() to handle this case by passing y_m_d
== h_m_s_ms == 0 to opal as arguments for opal_tpo_write() call.

Patch-based-on: https://patchwork.ozlabs.org/patch/764559/
Signed-off-by: Vaibhav Jain 
---
Corresponding opal side changes posted at
https://patchwork.ozlabs.org/patch/764555/ "fsp/tpo: Provide support
for disabling TPO alarm"
---
 drivers/rtc/rtc-opal.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/rtc/rtc-opal.c b/drivers/rtc/rtc-opal.c
index 2d84e2a..e2a946c 100644
--- a/drivers/rtc/rtc-opal.c
+++ b/drivers/rtc/rtc-opal.c
@@ -167,7 +167,14 @@ static int opal_set_tpo_time(struct device *dev, struct 
rtc_wkalrm *alarm)
u32 y_m_d = 0;
int token, rc;
 
-   tm_to_opal(>time, _m_d, _m_s_ms);
+   /* if alarm is enabled */
+   if (alarm->enabled) {
+   tm_to_opal(>time, _m_d, _m_s_ms);
+   pr_debug("Alarm set to %x %llx\n", y_m_d, h_m_s_ms);
+
+   } else {
+   pr_debug("Alarm getting disabled\n");
+   }
 
token = opal_async_get_token_interruptible();
if (token < 0) {
@@ -200,6 +207,18 @@ static int opal_set_tpo_time(struct device *dev, struct 
rtc_wkalrm *alarm)
return rc;
 }
 
+int opal_tpo_alarm_irq_enable(struct device *dev, unsigned int enabled)
+{
+   struct rtc_wkalrm alarm = { .enabled = 0 };
+
+   /*
+* TPO is automatically enabled when opal_set_tpo_time() is called with
+* non-zero rtc-time. We only handle disable case which needs to be
+* explicitly told to opal.
+*/
+   return enabled ? 0 : opal_set_tpo_time(dev, );
+}
+
 static struct rtc_class_ops opal_rtc_ops = {
.read_time  = opal_get_rtc_time,
.set_time   = opal_set_rtc_time,
@@ -215,6 +234,7 @@ static int opal_rtc_probe(struct platform_device *pdev)
device_set_wakeup_capable(>dev, true);
opal_rtc_ops.read_alarm = opal_get_tpo_time;
opal_rtc_ops.set_alarm = opal_set_tpo_time;
+   opal_rtc_ops.alarm_irq_enable = opal_tpo_alarm_irq_enable;
}
 
rtc = devm_rtc_device_register(>dev, DRVNAME, _rtc_ops,
-- 
2.9.3

[PATCH] rtc/tpo: Handle disabled TPO in opal_get_tpo_time()

2017-05-19 Thread Vaibhav Jain

On PowerNV platform when Timed-Power-On(TPO) is disabled, read of
stored TPO yields value with all date components set to '0' inside
opal_get_tpo_time(). The function opal_to_tm() then converts it to an
offset from year 1900 yielding alarm-time == "1900-00-01
00:00:00". This causes problems with __rtc_read_alarm() that
expecting an offset from "1970-00-01 00:00:00" and returned alarm-time
results in a -ve value for time64_t. Which ultimately results in this
error reported in kernel logs with a seemingly garbage value:

"rtc rtc0: invalid alarm value: -2-1--1041528741
200557:71582844:32"

We fix this by explicitly handling the case of all alarm date-time
components being '0' inside opal_get_tpo_time() and returning -ENOENT
in such a case. This signals generic rtc that no alarm is set and it
bails out from the alarm initialization flow without reporting the
above error.

Signed-off-by: Vaibhav Jain 
Reported-by: Steve Best 
---
 drivers/rtc/rtc-opal.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/rtc/rtc-opal.c b/drivers/rtc/rtc-opal.c
index ea20f62..2d84e2a 100644
--- a/drivers/rtc/rtc-opal.c
+++ b/drivers/rtc/rtc-opal.c
@@ -142,6 +142,16 @@ static int opal_get_tpo_time(struct device *dev, struct 
rtc_wkalrm *alarm)
 
y_m_d = be32_to_cpu(__y_m_d);
h_m_s_ms = ((u64)be32_to_cpu(__h_m) << 32);
+
+   /* check if no alarm is set */
+   if (y_m_d == 0 && h_m_s_ms == 0) {
+   pr_debug("No alarm is set\n");
+   rc = -ENOENT;
+   goto exit;
+   } else {
+   pr_debug("Alarm set to %x %llx\n", y_m_d, h_m_s_ms);
+   }
+
opal_to_tm(y_m_d, h_m_s_ms, >time);
 
 exit:
-- 
2.9.3

Re: [PATCH] powerpc/sequoia: fix NAND partitions not to overlap

2017-05-19 Thread Michael Ellerman

Michael Ellerman  writes:

> Pavel Machek  writes:
>
>> On Wed 2017-05-17 14:37:17, Andrew Morton wrote:
>>> On Wed, 17 May 2017 14:06:13 +0200 Pavel Machek  wrote:
>>> 
>>> > On Sun 2017-04-02 12:05:36, Pavel Machek wrote:
>>> > > Fix overlapping NAND partitions.
>>> > > 
>>> > > Signed-off-by: Pavel Machek 
>>> > 
>>> > Ping? Two partitions at same place are bad news...
>>> 
>>> Please expand on "bad news"?  What are the runtime effects of this
>>> change?  Decisions about which kernel(s) to patch depend on this info.
>>
>> Well... two partitions at same place. If you use one, you will corrupt
>> information on the other one.
>>
>> OTOH this moves partition around (so that they don't overlap) so it is
>> probably not stable candidate.
>>
>> I guess this is not huge issue; people using these boards probably
>> have custom dts changes, anyway...
>
> Or no one's even using it anymore.
>
> I can take this via powerpc, I won't mark it for stable etc.

It became:

Author: Pavel Machek 

powerpc/sequoia: Fix NAND partitions not to overlap

Currently the DTS defines two partitions at the same addresses, if you
use one, you will corrupt information on the other one. Fix it by
shifting the second partition.

Signed-off-by: Pavel Machek 
[mpe: Reconstruct change log from email thread]
Signed-off-by: Michael Ellerman 


Minor nit, your From: address doesn't match your Signed-off-by: address,
which trips my "check patch is signed-off-by author" script. I can
just ignore it, but I think it's preferable if they match.

cheers

Re: [PATCH 3/3] EDAC: mv64x60: replace in_le32/out_le32 with ioread32/iowrite32

2017-05-19 Thread Michael Ellerman

Arnd Bergmann  writes:

> On Thu, May 18, 2017 at 7:36 AM, Michael Ellerman  wrote:
>> Borislav Petkov  writes:
>>
>>> Top-posting so that the PPC list can see the whole patch below.
>>>
>>> Since I don't know PPC, let me add PPC ML to CC for a confirmation this
>>> change is correct.
>>>
>>> Which brings me to the tangential: this driver is from 2006-ish and
>>> is for some "Marvell MV64x60 Memory Controller kernel module for PPC
>>> platforms". If you're going to touch it, then you should test on the PPC
>>> hardware too, so that you don't break the driver there.
>>>
>>> Unless that hardware is obsolete now and we don't care and, and ..?
>>>
>>> Maybe someone has some insights...
>>
>> Not really sorry.
>>
>> I don't have one of those boards, so I can't test. Maybe someone else on
>> the list does?
>>
>> I'd err on the side of the PPC hardware being obsolete and no one really
>> caring. If the driver is helpful on ARM then we may as well use it
>> there, if we can also avoid breaking it on PPC then great.
>
> I never had one myself, but tried to figure out what is still there to be
> supported. In 2014, we removed one platform (PrPMC2800) that was
> obsolete. There were eight boards that didn't make the cut from
> arch/ppc32 to arch/powerpc. The C2K is the last one and it was
> added in 2008 with this comment:
>
> Support for the C2K cPCI Single Board Computer from GEFanuc
> (PowerPC MPC7448 with a Marvell MV64460 chipset).
> All features of the board are not supported yet, but the board
> boots, flash works, all Ethernet ports are working and PCI
> devices are all found (USB and SATA on PCI1 do not work yet).
>
> Part 3 of 5: driver for the board.  At this time it is very generic
> and similar to its original, the driver for the prpmc2800.
>
> The original submitter never followed up on it and neither the
> board code not the DTS was ever updated to include additional
> features, so I assume it only got worse from there.
>
> According to https://www.abaco.com/products/c2ka-compactpci-sbc
> the end-of-life date for the product was in 2015, presumably 10 years
> after it got introduced.
>
> This is definitely obsolete by now, and given the missing features
> I would assume that nobody is running mainline kernels on it and
> it can be removed.

Great, thanks for doing all that digging.

I'll cook up a patch to remove c2k next week.

Then we can drop CONFIG_MV64X60 from PPC and see where that leaves
things, depending on what's useful on ARM.

cheers

Re: powerpc/mm: Fix virt_addr_valid() etc. on 64-bit hash

2017-05-19 Thread Michael Ellerman

On Thu, 2017-05-18 at 10:37:31 UTC, Michael Ellerman wrote:
> virt_addr_valid() is supposed to tell you if it's OK to call virt_to_page() on
> an address. What this means in practice is that it should only return true for
> addresses in the linear mapping which are backed by a valid PFN.
> 
> We are failing to properly check that the address is in the linear mapping,
> because virt_to_pfn() will return a valid looking PFN for more or less any
> address. That bug is actually caused by __pa(), used in virt_to_pfn().
> 
> eg: __pa(0xc001) = 0x1  # Good
> __pa(0xd001) = 0x1  # Bad!
> __pa(0x0001) = 0x1  # Bad!
> 
> This started happening after commit bdbc29c19b26 ("powerpc: Work around gcc
> miscompilation of __pa() on 64-bit") (Aug 2013), where we changed the 
> definition
> of __pa() to work around a GCC bug. Prior to that we subtracted PAGE_OFFSET 
> from
> the value passed to __pa(), meaning __pa() of a 0xd or 0x0 address would give
> you something bogus back.
> 
> Until we can verify if that GCC bug is no longer an issue, or come up with
> another solution, this commit does the minimal fix to make virt_addr_valid()
> work, by explicitly checking that the address is in the linear mapping region.
> 
> Fixes: bdbc29c19b26 ("powerpc: Work around gcc miscompilation of __pa() on 
> 64-bit")
> Signed-off-by: Michael Ellerman 
> Reviewed-by: Paul Mackerras 
> Reviewed-by: Balbir Singh 
> Tested-by: Breno Leitao 

Applied to powerpc fixes.

https://git.kernel.org/powerpc/c/e41e53cd4fe331d0d1f06f8e4ed7e2

cheers

[GIT PULL] Please pull powerpc/linux.git powerpc-4.12-3 tag

2017-05-19 Thread Michael Ellerman

Hi Linus,

Please pull some powerpc fixes for 4.12:

The following changes since commit 2ea659a9ef488125eb46da6eb571de5eae5c43f6:

  Linux 4.12-rc1 (2017-05-13 13:19:49 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
tags/powerpc-4.12-3

for you to fetch changes up to e41e53cd4fe331d0d1f06f8e4ed7e2cc63ee2c34:

  powerpc/mm: Fix virt_addr_valid() etc. on 64-bit hash (2017-05-19 13:04:35 
+1000)


powerpc fixes for 4.12 #3

The headline is a fix for FP/VMX register corruption when using transactional
memory, and a new selftest to go with it.

Then there's the virt_addr_valid() fix, currently HARDENDED_USERCOPY is tripping
on that causing some machines to crash.

A few other fairly minor fixes for long tail things, and a couple of fixes for
code we just merged.

Thanks to:
  Breno Leitao, Gautham R. Shenoy, Michael Neuling, Naveen N. Rao. Nicholas
  Piggin, Paul Mackerras.


Gautham R. Shenoy (1):
  powerpc/powernv: Set NAPSTATELOST after recovering paca on P9 DD1

Michael Ellerman (3):
  powerpc/modules: If mprofile-kernel is enabled add it to vermagic
  powerpc/mm: Fix crash in page table dump with huge pages
  powerpc/mm: Fix virt_addr_valid() etc. on 64-bit hash

Michael Neuling (2):
  powerpc/tm: Fix FP and VMX register corruption
  selftests/powerpc: Test TM and VMX register state

Naveen N. Rao (1):
  powerpc/kprobes: Fix handling of instruction emulation on probe re-entry

 arch/powerpc/include/asm/module.h  |   4 +
 arch/powerpc/include/asm/page.h|  12 +++
 arch/powerpc/kernel/idle_book3s.S  |   2 +-
 arch/powerpc/kernel/kprobes.c  |   3 +-
 arch/powerpc/kernel/process.c  |  19 
 arch/powerpc/mm/dump_linuxpagetables.c |   7 +-
 tools/testing/selftests/powerpc/tm/.gitignore  |   1 +
 tools/testing/selftests/powerpc/tm/Makefile|   4 +-
 .../testing/selftests/powerpc/tm/tm-vmx-unavail.c  | 118 +
 9 files changed, 164 insertions(+), 6 deletions(-)
 create mode 100644 tools/testing/selftests/powerpc/tm/tm-vmx-unavail.c


signature.asc
Description: PGP signature

[PATCH net-next v2] ibmveth: Support to enable LSO/CSO for Trunk VEA.

2017-05-19 Thread Sivakumar Krishnasamy

Current largesend and checksum offload feature in ibmveth driver,
 - Source VM sends the TCP packets with ip_summed field set as
   CHECKSUM_PARTIAL and TCP pseudo header checksum is placed in
   checksum field
 - CHECKSUM_PARTIAL flag in SKB will enable ibmveth driver to mark
   "no checksum" and "checksum good" bits in transmit buffer descriptor
   before the packet is delivered to pseries PowerVM Hypervisor
 - If ibmveth has largesend capability enabled, transmit buffer descriptors
   are market accordingly before packet is delivered to Hypervisor
   (along with mss value for packets with length > MSS)
 - Destination VM's ibmveth driver receives the packet with "checksum good"
   bit set and so, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY
 - If "largesend" bit was on, mss value is copied from receive descriptor
   into SKB's gso_size and other flags are appropriately set for
   packets > MSS size
 - The packet is now successfully delivered up the stack in destination VM

The offloads described above works fine for TCP communication among VMs in
the same pseries server ( VM A <=> PowerVM Hypervisor <=> VM B )

We are now enabling support for OVS in pseries PowerVM environment. One of
our requirements is to have ibmveth driver configured in "Trunk" mode, when
they are used with OVS. This is because, PowerVM Hypervisor will no more
bridge the packets between VMs, instead the packets are delivered to
IO Server which hosts OVS to bridge them between VMs or to external
networks (flow shown below),
  VM A <=> PowerVM Hypervisor <=> IO Server(OVS) <=> PowerVM Hypervisor
   <=> VM B
In "IO server" the packet is received by inbound Trunk ibmveth and then
delivered to OVS, which is then bridged to outbound Trunk ibmveth (shown
below),
Inbound Trunk ibmveth <=> OVS <=> Outbound Trunk ibmveth

In this model, we hit the following issues which impacted the VM
communication performance,

 - Issue 1: ibmveth doesn't support largesend and checksum offload features
   when configured as "Trunk". Driver has explicit checks to prevent
   enabling these offloads.

 - Issue 2: SYN packet drops seen at destination VM. When the packet
   originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to
   IO server's inbound Trunk ibmveth, on validating "checksum good" bits
   in ibmveth receive routine, SKB's ip_summed field is set with
   CHECKSUM_UNNECESSARY flag. This packet is then bridged by OVS (or Linux
   Bridge) and delivered to outbound Trunk ibmveth. At this point the
   outbound ibmveth transmit routine will not set "no checksum" and
   "checksum good" bits in transmit buffer descriptor, as it does so only
   when the ip_summed field is CHECKSUM_PARTIAL. When this packet gets
   delivered to destination VM, TCP layer receives the packet with checksum
   value of 0 and with no checksum related flags in ip_summed field. This
   leads to packet drops. So, TCP connections never goes through fine.

 - Issue 3: First packet of a TCP connection will be dropped, if there is
   no OVS flow cached in datapath. OVS while trying to identify the flow,
   computes the checksum. The computed checksum will be invalid at the
   receiving end, as ibmveth transmit routine zeroes out the pseudo
   checksum value in the packet. This leads to packet drop.

 - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list.
   When Physical NIC has GRO enabled and when OVS bridges these packets,
   OVS vport send code will end up calling dev_queue_xmit, which in turn
   calls validate_xmit_skb.
   In validate_xmit_skb routine, the larger packets will get segmented into
   MSS sized segments, if SKB has a frag_list and if the driver to which
   they are delivered to doesn't support NETIF_F_FRAGLIST feature.

This patch addresses the above four issues, thereby enabling end to end
largesend and checksum offload support for better performance.

 - Fix for Issue 1 : Remove checks which prevent enabling TCP largesend and
   checksum offloads.
 - Fix for Issue 2 : When ibmveth receives a packet with "checksum good"
   bit set and if its configured in Trunk mode, set appropriate SKB fields
   using skb_partial_csum_set (ip_summed field is set with
   CHECKSUM_PARTIAL)
 - Fix for Issue 3: Recompute the pseudo header checksum before sending the
   SKB up the stack.
 - Fix for Issue 4: Linearize the SKBs with frag_list. Though we end up
   allocating buffers and copying data, this fix gives
   upto 4X throughput increase.

Note: All these fixes need to be dropped together as fixing just one of
them will lead to other issues immediately (especially for Issues 1,2 & 3).

Signed-off-by: Sivakumar Krishnasamy 
---
 drivers/net/ethernet/ibm/ibmveth.c | 107 ++---
 drivers/net/ethernet/ibm/ibmveth.h |   1 +
 2 files changed, 90 insertions(+), 18 deletions(-)

diff --git

Re: [PATCH] powerpc/sequoia: fix NAND partitions not to overlap

2017-05-19 Thread Pavel Machek

Hi!

> > Well... two partitions at same place. If you use one, you will corrupt
> > information on the other one.
> >
> > OTOH this moves partition around (so that they don't overlap) so it is
> > probably not stable candidate.
> >
> > I guess this is not huge issue; people using these boards probably
> > have custom dts changes, anyway...
> 
> Or no one's even using it anymore.
> 
> I can take this via powerpc, I won't mark it for stable etc.

That sounds good.

Thanks,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature

Re: [PATCH] perf: libdw support for powerpc

2017-05-19 Thread Paolo Bonzini



On 18/05/2017 20:48, Paolo Bonzini wrote:
> 
> 
> On 18/05/2017 20:19, Naveen N. Rao wrote:
>> Paolo Bonzini wrote:
>>> The ARM and x86 architectures already use libdw, and it is useful to
>>> have as much common code for the unwinder as possible.  Porting PPC
>>> to libdw only needs an architecture-specific hook to move the register
>>> state from perf to libdw.
>>
>> Thanks. Ravi has had a similar patch locally, but from what I
>> understand, there are issues with libdw based unwinding on powerpc64. I
>> gave this a quick test and I don't see the user-space call trace being
>> unwinded properly with libdw.
> 
> I don't see that problem:
> 
> -   99,98% 0,00%  dd   libc-2.17.so[.] 
> generic_start_main.isra.0
>  generic_start_main.isra.0
>- main
>   - 99,97% iread
>  - 97,82% sys_read
> - 96,97% extract_entropy_user
>  89,44% powernv_get_random_long
>  4,63% sha_transform
>  2,07% extract_buf
>1,15% _raw_spin_lock_irqsave
>0,51% extract_buf
> 
> (This is "perf record dd if=/dev/urandom of=/dev/null bs=512").
> 
> Can you copy the contents of tools/perf/.config-detected here?

FWIW it works fine for me here with latest Linus tree and your suggested
change to fix compilation:

--- a/tools/perf/arch/powerpc/util/unwind-libdw.c
+++ b/tools/perf/arch/powerpc/util/unwind-libdw.c
@@ -1,7 +1,7 @@
 #include 
-#include "../../util/util.h"
 #include "../../util/unwind-libdw.h"
 #include "../../util/perf_regs.h"
+#include "../../util/event.h"

 /* See backends/ppc_initreg.c and backends/ppc_regs.c in elfutils.  */
 static const int special_regs[3][2] = {

Thanks,

Paolo

[PATCH] powerpc: add PPC_FEATURE userspace bits for SCV and DARN instructions

Re: [RFC] arch hardlockup detector interfaces improvement

[PATCH] powernv/npu-dma.c: Fix opal_npu_destroy_context call

Re: [RFC] arch hardlockup detector interfaces improvement

[PATCH] drivers/rtc/interface.c: Validate alarm-time before handling rollover

Re: [PATCH 4/4] arch/powerpc/44x/fsp2: wdt tcr update instead of whole rewrite

Re: [PATCH 3/3] EDAC: mv64x60: replace in_le32/out_le32 with ioread32/iowrite32

Re: [RFC] arch hardlockup detector interfaces improvement

[PATCH] powerpc/64: Use tick accounting by default

RE: [PATCH 3/3] EDAC: mv64x60: replace in_le32/out_le32 with ioread32/iowrite32

Re: [RFC] arch hardlockup detector interfaces improvement

Re: [PATCH] powerpc/sequoia: fix NAND partitions not to overlap

RE: [PATCH 2/9] timers: provide a "modern" variant of timers

Memory hotplug broken on powerkvm/pseries with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y

Re: [PATCH 4/4] arch/powerpc/44x/fsp2: wdt tcr update instead of whole rewrite

Re: [PATCH 4/4] arch/powerpc/44x/fsp2: wdt tcr update instead of whole rewrite

[PATCH] rtc/tpo: Implement rtc_class_ops.alarm_irq_enable callback

[PATCH] rtc/tpo: Handle disabled TPO in opal_get_tpo_time()

Re: [PATCH] powerpc/sequoia: fix NAND partitions not to overlap

Re: [PATCH 3/3] EDAC: mv64x60: replace in_le32/out_le32 with ioread32/iowrite32

Re: powerpc/mm: Fix virt_addr_valid() etc. on 64-bit hash

[GIT PULL] Please pull powerpc/linux.git powerpc-4.12-3 tag

[PATCH net-next v2] ibmveth: Support to enable LSO/CSO for Trunk VEA.

Re: [PATCH] powerpc/sequoia: fix NAND partitions not to overlap

Re: [PATCH] perf: libdw support for powerpc

25 matches

Site Navigation

Mail list logo

Footer information