Re: [PATCH 10/13] timer: Remove expires and data arguments from DEFINE_TIMER

2017-10-04 Thread Greg Kroah-Hartman
On Wed, Oct 04, 2017 at 04:27:04PM -0700, Kees Cook wrote:
> Drop the arguments from the macro and adjust all callers with the
> following script:
> 
>   perl -pi -e 's/DEFINE_TIMER\((.*), 0, 0\);/DEFINE_TIMER($1);/g;' \
> $(git grep DEFINE_TIMER | cut -d: -f1 | sort -u | grep -v timer.h)
> 
> Signed-off-by: Kees Cook 
> Acked-by: Geert Uytterhoeven  # for m68k parts
> ---
>  arch/arm/mach-ixp4xx/dsmg600-setup.c  | 2 +-
>  arch/arm/mach-ixp4xx/nas100d-setup.c  | 2 +-
>  arch/m68k/amiga/amisound.c| 2 +-
>  arch/m68k/mac/macboing.c  | 2 +-
>  arch/mips/mti-malta/malta-display.c   | 2 +-
>  arch/parisc/kernel/pdc_cons.c | 2 +-
>  arch/s390/mm/cmm.c| 2 +-
>  drivers/atm/idt77105.c| 4 ++--
>  drivers/atm/iphase.c  | 2 +-
>  drivers/block/ataflop.c   | 8 
>  drivers/char/dtlk.c   | 2 +-
>  drivers/char/hangcheck-timer.c| 2 +-
>  drivers/char/nwbutton.c   | 2 +-
>  drivers/char/rtc.c| 2 +-
>  drivers/input/touchscreen/s3c2410_ts.c| 2 +-
>  drivers/net/cris/eth_v10.c| 6 +++---
>  drivers/net/hamradio/yam.c| 2 +-
>  drivers/net/wireless/atmel/at76c50x-usb.c | 2 +-
>  drivers/staging/speakup/main.c| 2 +-
>  drivers/staging/speakup/synth.c   | 2 +-
>  drivers/tty/cyclades.c| 2 +-
>  drivers/tty/isicom.c  | 2 +-
>  drivers/tty/moxa.c| 2 +-
>  drivers/tty/rocket.c  | 2 +-
>  drivers/tty/vt/keyboard.c | 2 +-
>  drivers/tty/vt/vt.c   | 2 +-
>  drivers/watchdog/alim7101_wdt.c   | 2 +-
>  drivers/watchdog/machzwd.c| 2 +-
>  drivers/watchdog/mixcomwd.c   | 2 +-
>  drivers/watchdog/sbc60xxwdt.c | 2 +-
>  drivers/watchdog/sc520_wdt.c  | 2 +-
>  drivers/watchdog/via_wdt.c| 2 +-
>  drivers/watchdog/w83877f_wdt.c| 2 +-
>  drivers/xen/grant-table.c | 2 +-
>  fs/pstore/platform.c  | 2 +-
>  include/linux/timer.h | 4 ++--
>  kernel/irq/spurious.c | 2 +-
>  lib/random32.c| 2 +-
>  net/atm/mpc.c | 2 +-
>  net/decnet/dn_route.c | 2 +-
>  net/ipv6/ip6_flowlabel.c  | 2 +-
>  net/netrom/nr_loopback.c  | 2 +-
>  security/keys/gc.c| 2 +-
>  sound/oss/midibuf.c   | 2 +-
>  sound/oss/soundcard.c | 2 +-
>  sound/oss/sys_timer.c | 2 +-
>  sound/oss/uart6850.c  | 2 +-
>  47 files changed, 54 insertions(+), 54 deletions(-)

Acked-by: Greg Kroah-Hartman 


[PATCH] powerpc/tm: P9 disabled suspend mode workaround

2017-10-04 Thread Michael Neuling
Each POWER9 core is made of two super slices. Each super slice can
only have one thread at a time in TM suspend mode. The super slice
restricts ever entering a state where both threads are in suspend by
aborting transactions on tsuspend or exceptions into the kernel.

Unfortunately for context switch we need trechkpt which forces suspend
mode. If a thread is already in suspend and a second thread needs to
be restored that was suspended, the trechkpt must be executed.
Currently the trechkpt will hang in this case until the other thread
exits suspend. This causes problems for Linux resulting in hang and
RCU stall detectors going off.

To workaround this, we disable suspend in the core. This is done via a
firmware change which stops the hardware ever getting into suspend.
The hardware will always rollback a transaction on any tsuspend or
entry into the kernel.

Unfortunately userspace can construct a sigcontext which enables
suspend. Thus userspace can force Linux into a path where trechkpt is
executed.

This patch blocks this from happening on POWER9 but sanity checking
sigcontexts passed in.

ptrace doesn't have this problem as only MSR SE and BE can be changed
via ptrace.

This patch also adds a number of WARN_ON() in case we every enter
suspend when we shouldn't. This should catch systems that don't have
the firmware change and are running TM.

A future firmware change will allow suspend mode on POWER9 but that is
going to require additional Linux changes to support. In the interim,
this allows TM to continue to (partially) work while stopping
userspace from crashing Linux.

Signed-off-by: Michael Neuling 
---
 arch/powerpc/include/asm/cputable.h | 1 +
 arch/powerpc/kernel/cputable.c  | 7 +++
 arch/powerpc/kernel/process.c   | 2 ++
 arch/powerpc/kernel/signal_32.c | 4 
 arch/powerpc/kernel/signal_64.c | 5 +
 5 files changed, 19 insertions(+)

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index a9bf921f4e..477e00b6a7 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -123,6 +123,7 @@ extern struct cpu_spec *identify_cpu(unsigned long offset, 
unsigned int pvr);
 extern void identify_cpu_name(unsigned int pvr);
 extern void do_feature_fixups(unsigned long value, void *fixup_start,
  void *fixup_end);
+extern bool tm_suspend_supported(void);
 
 extern const char *powerpc_base_platform;
 
diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c
index 7608729160..70252843ff 100644
--- a/arch/powerpc/kernel/cputable.c
+++ b/arch/powerpc/kernel/cputable.c
@@ -2301,6 +2301,13 @@ void __init identify_cpu_name(unsigned int pvr)
}
 }
 
+bool tm_suspend_supported(void)
+{
+   if (pvr_version_is(PVR_POWER9))
+   return false;
+   return true;
+}
+
 
 #ifdef CONFIG_JUMP_LABEL_FEATURE_CHECKS
 struct static_key_true cpu_feature_keys[NUM_CPU_FTR_KEYS] = {
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index a0c74bbf34..5b81673c50 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -903,6 +903,8 @@ static inline void tm_reclaim_task(struct task_struct *tsk)
if (!MSR_TM_ACTIVE(thr->regs->msr))
goto out_and_saveregs;
 
+   WARN_ON(!tm_suspend_supported());
+
TM_DEBUG("--- tm_reclaim on pid %d (NIP=%lx, "
 "ccr=%lx, msr=%lx, trap=%lx)\n",
 tsk->pid, thr->regs->nip,
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 92fb1c8dbb..9eac0131c0 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -519,6 +519,8 @@ static int save_tm_user_regs(struct pt_regs *regs,
 {
unsigned long msr = regs->msr;
 
+   WARN_ON(!tm_suspend_supported());
+
/* Remove TM bits from thread's MSR.  The MSR in the sigcontext
 * just indicates to userland that we were doing a transaction, but we
 * don't want to return in transactional state.  This also ensures
@@ -769,6 +771,8 @@ static long restore_tm_user_regs(struct pt_regs *regs,
int i;
 #endif
 
+   if (!tm_suspend_supported())
+   return 1;
/*
 * restore general registers but not including MSR or SOFTE. Also
 * take care of keeping r2 (TLS) intact if not a signal.
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index c83c115858..6d28caf849 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -214,6 +214,8 @@ static long setup_tm_sigcontexts(struct sigcontext __user 
*sc,
 
BUG_ON(!MSR_TM_ACTIVE(regs->msr));
 
+   WARN_ON(!tm_suspend_supported());
+
/* Remove TM bits from thread's MSR.  The MSR in the sigcontext
 * just indicates to userland that we were doing a transaction, but we
 * don't want to return in transactional 

Re: [PATCH tip/core/rcu 1/3] membarrier: Provide register expedited private command

2017-10-04 Thread Nicholas Piggin
On Wed,  4 Oct 2017 14:37:53 -0700
"Paul E. McKenney"  wrote:

> From: Mathieu Desnoyers 
> 
> Provide a new command allowing processes to register their intent to use
> the private expedited command.
> 
> This allows PowerPC to skip the full memory barrier in switch_mm(), and
> only issue the barrier when scheduling into a task belonging to a
> process that has registered to use expedited private.
> 
> Processes are now required to register before using
> MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.
> 
> Changes since v1:
> - Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in
>   powerpc membarrier_arch_sched_in(), given that we want to specifically
>   check the next thread state.
> - Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig.
> - Use task_thread_info() to pass thread_info from task to
>   *_ti_thread_flag().
> 
> Changes since v2:
> - Move membarrier_arch_sched_in() call to finish_task_switch().
> - Check for NULL t->mm in membarrier_arch_fork().
> - Use membarrier_sched_in() in generic code, which invokes the
>   arch-specific membarrier_arch_sched_in(). This fixes allnoconfig
>   build on PowerPC.
> - Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing
>   allnoconfig build on PowerPC.
> - Build and runtime tested on PowerPC.
> 
> Changes since v3:
> - Simply rely on copy_mm() to copy the membarrier_private_expedited mm
>   field on fork.
> - powerpc: test thread flag instead of reading
>   membarrier_private_expedited in membarrier_arch_fork().
> - powerpc: skip memory barrier in membarrier_arch_sched_in() if coming
>   from kernel thread, since mmdrop() implies a full barrier.
> - Set membarrier_private_expedited to 1 only after arch registration
>   code, thus eliminating a race where concurrent commands could succeed
>   when they should fail if issued concurrently with process
>   registration.
> - Use READ_ONCE() for membarrier_private_expedited field access in
>   membarrier_private_expedited. Matches WRITE_ONCE() performed in
>   process registration.
> 
> Changes since v4:
> - Move powerpc hook from sched_in() to switch_mm(), based on feedback
>   from Nicholas Piggin.

For now, the powerpc approach is okay by me. I plan to test
others (e.g., taking runqueue locks) on larger systems, but that can
be sent as an incremental patch at a later time.

The main thing I would like is for people to review the userspace API.


> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 3a19c253bdb1..4af1b719c65f 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -205,4 +205,54 @@ static inline void memalloc_noreclaim_restore(unsigned 
> int flags)
>   current->flags = (current->flags & ~PF_MEMALLOC) | flags;
>  }
>  
> +#ifdef CONFIG_MEMBARRIER
> +
> +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS
> +#include 
> +#else
> +static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
> + struct mm_struct *next, struct task_struct *tsk)
> +{
> +}

This is no longer required in architecture independent code, is it?


Re: [v2,1/2] powerpc/xive: fix IPI reset

2017-10-04 Thread Michael Ellerman
On Wed, 2017-10-04 at 09:15:04 UTC, =?utf-8?q?C=C3=A9dric_Le_Goater?= wrote:
> When resetting an IPI, hw_ipi should also be set to zero.
> 
> Signed-off-by: Cédric Le Goater 

Series applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/74f1282114acc7d67e25745efe200f

cheers


Re: [v3] powerpc: fix compile error on 64K pages on 40x, 44x

2017-10-04 Thread Michael Ellerman
On Sun, 2017-10-01 at 14:33:03 UTC, Christian Lamparter wrote:
> The mmu context on the 40x, 44x does not define pte_frag
> entry. This causes gcc abort the compilation due to:
> 
> setup-common.c: In function ‘setup_arch’:
> setup-common.c:908: error: ‘mm_context_t’ has no ‘pte_frag’
> 
> This patch fixes the issue by removing the pte_frag
> initialization in setup-common.c.
> 
> This is possible, because the compiler will do the
> initialization, since the mm_context is a sub struct of
> init_mm. init_mm is declared in mm_types.h as external linkage.
> according to C99 6.2.4.3:
> "An object whose identifier is declared with external linkage
> [...] has static storage duration."
> 
> C99 defines in 6.7.8.10 that: "
> If an object that has static storage duration is not
> initialized explicitly, then:
> - if it has pointer type, it is initialized to a null pointer
> [...]
> "
> 
> Signed-off-by: Christian Lamparter 
> Reviewed-by: Christophe Leroy 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/070e004912fed099263408bf2ff1bb

cheers


Re: powerpc: Fix action argument for cpufeatures-based TLB flush

2017-10-04 Thread Michael Ellerman
On Wed, 2017-09-27 at 04:55:51 UTC, Jeremy Kerr wrote:
> Commit 41d0c2ecde introduced calls to __flush_tlb_power[89] from the
> cpufeatures code, specifying the number of sets to flush.
> 
> However, these functions take an action argument, not a number of sets.
> This means we hit the BUG() in __flush_tlb_{206,300} when using
> cpufeatures-style configuration.
> 
> This change passes TLB_INVAL_SCOPE_GLOBAL instead.
> 
> Signed-off-by: Jeremy Kerr 
> CC: Nicholas Piggin 
> Reviewed-by: Nicholas Piggin 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/3b7af5c0fd9631762d1c4d7b4cee76

cheers


Re: powerpc/mm: Call flush_tlb_kernel_range with interrupts enabled

2017-10-04 Thread Michael Ellerman
On Sun, 2017-09-24 at 17:30:43 UTC, Guenter Roeck wrote:
> flush_tlb_kernel_range() may call smp_call_function_many() which expects
> interrupts to be enabled. This results in a traceback.
> 
> WARNING: CPU: 0 PID: 1 at kernel/smp.c:416 smp_call_function_many+0xcc/0x2fc
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.0-rc1-9-g0666f56 #1
> task: cf83 task.stack: cf82e000
> NIP:  c00a93c8 LR: c00a9634 CTR: 0001
> REGS: cf82fde0 TRAP: 0700   Not tainted  (4.14.0-rc1-9-g0666f56)
> MSR:  00021000   CR: 2482  XER: 
> 
> GPR00: c00a9634 cf82fe90 cf83 c050ad3c c0015a54  0001 0001
> GPR08: 0001   cf82e000 2484  c0003150 
> GPR16:      0001  c051
> GPR24:  c0015a54  c050ad3c c051823c c050ad3c 0025 
> NIP [c00a93c8] smp_call_function_many+0xcc/0x2fc
> LR [c00a9634] smp_call_function+0x3c/0x50
> Call Trace:
> [cf82fe90] [0010] 0x10 (unreliable)
> [cf82fed0] [c00a9634] smp_call_function+0x3c/0x50
> [cf82fee0] [c0015d2c] flush_tlb_kernel_range+0x20/0x38
> [cf82fef0] [c001524c] mark_initmem_nx+0x154/0x16c
> [cf82ff20] [c001484c] free_initmem+0x20/0x4c
> [cf82ff30] [c000316c] kernel_init+0x1c/0x108
> [cf82ff40] [c000f3a8] ret_from_kernel_thread+0x5c/0x64
> Instruction dump:
> 7c0803a6 7d808120 38210040 4e800020 3d20c052 812981a0 2f89 40beffac
> 3d20c051 8929ac64 2f89 40beff9c <0fe0> 4b94 7fc3f378 7f64db78
> 
> Fixes: 3184cc4b6f6a ("powerpc/mm: Fix kernel RAM protection after freeing 
> ...")
> Fixes: e611939fc8ec ("powerpc/mm: Ensure change_page_attr() doesn't ...")
> Cc: Christophe Leroy 
> Signed-off-by: Guenter Roeck 
> Reviewed-by: Christophe Leroy 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/7c6a4f3b1641195119ddbb531200f4

cheers


Re: [v2] powerpc: configs: Add Skiroot defconfig

2017-10-04 Thread Michael Ellerman
On Wed, 2017-10-04 at 04:23:24 UTC, Joel Stanley wrote:
> This configuration is used by the OpenPower firmware for it's
> Linux-as-bootloader implementation. Also known as the Petitboot kernel,
> this configuration broke in 4.12 (CPU_HOTPLUG=n), so add it to the
> upstream tree in order to get better coverage.
> 
> Signed-off-by: Joel Stanley 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/c3dda4b0db9cf5044010e4de88e5e5

cheers


Re: [1/3] powerpc/lib/sstep: Add XER bits introduced in POWER ISA v3.0

2017-10-04 Thread Michael Ellerman
On Fri, 2017-09-29 at 05:44:08 UTC, Sandipan Das wrote:
> This adds definitions for the OV32 and CA32 bits of XER that
> were introduced in POWER ISA v3.0. There are some existing
> instructions that currently set the OV and CA bits based on
> certain conditions.
> 
> The emulation behaviour of all these instructions needs to
> be updated to set these new bits accordingly.
> 
> Signed-off-by: Sandipan Das 
> Acked-by: Naveen N. Rao 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/924c8feb041c3ef22d46ac2e746816

cheers


Re: [v2, 1/6] powerpc/watchdog: do not panic from locked CPU's IPI handler

2017-10-04 Thread Michael Ellerman
On Fri, 2017-09-29 at 03:29:37 UTC, Nicholas Piggin wrote:
> The SMP watchdog will detect locked CPUs and IPI them to print a
> backtrace and registers. If panic on hard lockup is enabled, do
> not panic from this handler, because that can cause recursion into
> the IPI layer during the panic.
> 
> The caller already panics in this case.
> 
> Signed-off-by: Nicholas Piggin 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/842dc1dbabb5e874550b52d896851e

cheers


Re: powerpc/powernv: Use early_radix_enabled in POWER9 tlb flush

2017-10-04 Thread Michael Ellerman
On Wed, 2017-09-27 at 05:45:58 UTC, Nicholas Piggin wrote:
> This code is used at boot and machine checks, so it should be using
> early_radix_enabled() (which is usable any time).
> 
> Signed-off-by: Nicholas Piggin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/969a86a2855d484a00205a424df1c6

cheers


Re: [1/3] powerpc: oprofile: use setup_timer() helper.

2017-10-04 Thread Michael Ellerman
On Fri, 2017-09-22 at 11:34:58 UTC, Allen Pais wrote:
> Use setup_timer function instead of initializing timer with the
>function and data fields.
> 
> Signed-off-by: Allen Pais 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/83ad1e6a1dc049dd04be4564125a7f

cheers


Re: [v2] cxl: Set the valid bit in PE for dedicated mode

2017-10-04 Thread Michael Ellerman
On Mon, 2017-09-04 at 08:48:25 UTC, Vaibhav Jain wrote:
> Make sure to set the valid-bit in software-state field of the
> populated PE. This was earlier missing for dedicated mode AFUs, hence
> was causing a PSL freeze when the AFU was activated.
> 
> Signed-off-by: Vaibhav Jain 
> Acked-by: Frederic Barrat 
> Acked-by: Andrew Donnellan 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/8512bffd6226fea259f94fd23fc3b6

cheers


Re: [v3,1/2] powerpc/mm: Export flush_all_mm()

2017-10-04 Thread Michael Ellerman
On Sun, 2017-09-03 at 18:15:12 UTC, Frederic Barrat wrote:
> With the optimizations introduced by commit a46cc7a90fd8
> ("powerpc/mm/radix: Improve TLB/PWC flushes"), flush_tlb_mm() no
> longer flushes the page walk cache with radix. This patch introduces
> flush_all_mm(), which flushes everything, tlb and pwc, for a given mm.
> 
> Signed-off-by: Frederic Barrat 
> Reviewed-By: Alistair Popple 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/6110236b9bbd177debc045c5fc2922

cheers


[PATCH] powerpc: Default to enabling STRICT_KERNEL_RWX

2017-10-04 Thread Kees Cook
When available, CONFIG_KERNEL_RWX should be default-enabled.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Kees Cook 
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 809c468edab1..9a549bbfc278 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -178,6 +178,7 @@ config PPC
select HAVE_ARCH_TRACEHOOK
select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) && 
!RELOCATABLE && !HIBERNATION)
select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
+   select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
select HAVE_CBPF_JITif !PPC64
select HAVE_CONTEXT_TRACKINGif PPC64
select HAVE_DEBUG_KMEMLEAK
-- 
2.7.4


-- 
Kees Cook
Pixel Security


Re: [PATCH 0/2] powerpc/xive: fix CPU hot unplug

2017-10-04 Thread Nikunj A Dadhania
David Gibson  writes:

> On Tue, Oct 03, 2017 at 08:24:07AM +0200, Cédric Le Goater wrote:
>> On 10/03/2017 05:36 AM, David Gibson wrote:
>> > On Mon, Oct 02, 2017 at 06:27:20PM +0200, Cédric Le Goater wrote:
>> >> On 09/23/2017 10:26 AM, Cédric Le Goater wrote:
>> >>> Hi,
>> >>>
>> >>> Here are a couple of small fixes to support CPU hot unplug. There are
>> >>> still some issues to be investigated as, in some occasions, after a
>> >>> couple of plug and unplug, the cpu which was removed receives a 'lost'
>> >>> interrupt. This showed to be the decrementer under QEMU.
>> >>
>> >> So this seems to be a QEMU issue only which can be solved by 
>> >> removing the DEE bit from the LPCR on P9 processor when the CPU 
>> >> is stopped in rtas. PECE3 bit on P8 processors. 
>> >>
>> >> I think these patches are valuable fixes for 4.14. The first 
>> >> is trivial and the second touches the common xive part but it
>> >> is only called on the pseries platform.  
>> >>
>> >> Could you please take a look ?
>> > 
>> > Sorry, I think I've missed something here.
>> > 
>> > Is there a qemu bug involved in this?  Has there been a patch sent
>> > that I didn't spot?
>> 
>> 
>> No, not yet, but I will today probably. something like below to stop
>> the decrementer when a CPU is stopped:
>> 
>>  --- qemu.git.orig/hw/ppc/spapr_rtas.c
>>  +++ qemu.git/hw/ppc/spapr_rtas.c
>>  @@ -174,6 +174,15 @@ static void rtas_start_cpu(PowerPCCPU *c
>>   kvm_cpu_synchronize_state(cs);
>>   
>>   env->msr = (1ULL << MSR_SF) | (1ULL << MSR_ME);
>>  +
>>  +/* Enable DECR interrupt */
>>  +if (env->mmu_model == POWERPC_MMU_3_00) {
>>  +env->spr[SPR_LPCR] |= LPCR_DEE;
>>  +} else {
>>  +/* P7 and P8 both have same bit for DECR */
>>  +env->spr[SPR_LPCR] |= LPCR_P8_PECE3;
>>  +}
>>  +
>>   env->nip = start;
>>   env->gpr[3] = r3;
>>   cs->halted = 0;
>>  @@ -210,6 +219,13 @@ static void rtas_stop_self(PowerPCCPU *c
>>* no need to bother with specific bits, we just clear it.
>>*/
>>   env->msr = 0;
>>  +
>>  +if (env->mmu_model == POWERPC_MMU_3_00) {
>>  +env->spr[SPR_LPCR] &= ~LPCR_DEE;
>>  +} else {
>>  +/* P7 and P8 both have same bit for DECR */
>>  +env->spr[SPR_LPCR] &= ~LPCR_P8_PECE3;
>>  +}
>>   }
>>   
>>   static inline int sysparm_st(target_ulong addr, target_ulong len,
>>  
>> I haven't yet because I fail to understand why the decrementer is not 
>> interrupting the dying CPU under xics as it is the case under XIVE.
>
> Oh.. ok.  This sounds very similar to the problem Nikunj hit under TCG
> with decrementer interrupts waking up a supposedly dead CPU.  He had a
> couple of proposed fixes, but we got bogged down trying to work out
> why  (with TCG at least).

Yeah, I wasnt able to get to the exact reason for that.

Regards
Nikunj



Re: [PATCH 05/13] timer: Remove init_timer_deferrable() in favor of timer_setup()

2017-10-04 Thread Sebastian Reichel
Hi,

On Wed, Oct 04, 2017 at 04:26:59PM -0700, Kees Cook wrote:
> This refactors the only users of init_timer_deferrable() to use
> the new timer_setup() and from_timer(). Removes definition of
> init_timer_deferrable().

[...]

>  drivers/hsi/clients/ssi_protocol.c   | 32 
> 

Acked-by: Sebastian Reichel 

-- Sebastian


signature.asc
Description: PGP signature


Re: [PATCH 04/13] timer: Remove init_timer_pinned() in favor of timer_setup()

2017-10-04 Thread David Miller
From: Kees Cook 
Date: Wed,  4 Oct 2017 16:26:58 -0700

> This refactors the only users of init_timer_pinned() to use
> the new timer_setup() and from_timer(). Drops the definition of
> init_timer_pinned().
> 
> Cc: Chris Metcalf 
> Cc: Thomas Gleixner 
> Cc: net...@vger.kernel.org
> Signed-off-by: Kees Cook 

For networking:

Acked-by: David S. Miller 


Re: [PATCH 05/13] timer: Remove init_timer_deferrable() in favor of timer_setup()

2017-10-04 Thread David Miller
From: Kees Cook 
Date: Wed,  4 Oct 2017 16:26:59 -0700

> This refactors the only users of init_timer_deferrable() to use
> the new timer_setup() and from_timer(). Removes definition of
> init_timer_deferrable().
> 
> Cc: Benjamin Herrenschmidt 
> Cc: Michael Ellerman 
> Cc: Sebastian Reichel 
> Cc: Harish Patil 
> Cc: Manish Chopra 
> Cc: Kalle Valo 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: net...@vger.kernel.org
> Cc: linux-wirel...@vger.kernel.org
> Signed-off-by: Kees Cook 

For networking:

Acked-by: David S. Miller 


Re: [PATCH 10/13] timer: Remove expires and data arguments from DEFINE_TIMER

2017-10-04 Thread David Miller
From: Kees Cook 
Date: Wed,  4 Oct 2017 16:27:04 -0700

> Drop the arguments from the macro and adjust all callers with the
> following script:
> 
>   perl -pi -e 's/DEFINE_TIMER\((.*), 0, 0\);/DEFINE_TIMER($1);/g;' \
> $(git grep DEFINE_TIMER | cut -d: -f1 | sort -u | grep -v timer.h)
> 
> Signed-off-by: Kees Cook 
> Acked-by: Geert Uytterhoeven  # for m68k parts

For networking:

Acked-by: David S. Miller 


Re: [PATCH 10/13] timer: Remove expires and data arguments from DEFINE_TIMER

2017-10-04 Thread Guenter Roeck

On 10/04/2017 04:27 PM, Kees Cook wrote:

Drop the arguments from the macro and adjust all callers with the
following script:

   perl -pi -e 's/DEFINE_TIMER\((.*), 0, 0\);/DEFINE_TIMER($1);/g;' \
 $(git grep DEFINE_TIMER | cut -d: -f1 | sort -u | grep -v timer.h)

Signed-off-by: Kees Cook 
Acked-by: Geert Uytterhoeven  # for m68k parts


For watchdog:

Acked-by: Guenter Roeck 


---
  arch/arm/mach-ixp4xx/dsmg600-setup.c  | 2 +-
  arch/arm/mach-ixp4xx/nas100d-setup.c  | 2 +-
  arch/m68k/amiga/amisound.c| 2 +-
  arch/m68k/mac/macboing.c  | 2 +-
  arch/mips/mti-malta/malta-display.c   | 2 +-
  arch/parisc/kernel/pdc_cons.c | 2 +-
  arch/s390/mm/cmm.c| 2 +-
  drivers/atm/idt77105.c| 4 ++--
  drivers/atm/iphase.c  | 2 +-
  drivers/block/ataflop.c   | 8 
  drivers/char/dtlk.c   | 2 +-
  drivers/char/hangcheck-timer.c| 2 +-
  drivers/char/nwbutton.c   | 2 +-
  drivers/char/rtc.c| 2 +-
  drivers/input/touchscreen/s3c2410_ts.c| 2 +-
  drivers/net/cris/eth_v10.c| 6 +++---
  drivers/net/hamradio/yam.c| 2 +-
  drivers/net/wireless/atmel/at76c50x-usb.c | 2 +-
  drivers/staging/speakup/main.c| 2 +-
  drivers/staging/speakup/synth.c   | 2 +-
  drivers/tty/cyclades.c| 2 +-
  drivers/tty/isicom.c  | 2 +-
  drivers/tty/moxa.c| 2 +-
  drivers/tty/rocket.c  | 2 +-
  drivers/tty/vt/keyboard.c | 2 +-
  drivers/tty/vt/vt.c   | 2 +-
  drivers/watchdog/alim7101_wdt.c   | 2 +-
  drivers/watchdog/machzwd.c| 2 +-
  drivers/watchdog/mixcomwd.c   | 2 +-
  drivers/watchdog/sbc60xxwdt.c | 2 +-
  drivers/watchdog/sc520_wdt.c  | 2 +-
  drivers/watchdog/via_wdt.c| 2 +-
  drivers/watchdog/w83877f_wdt.c| 2 +-
  drivers/xen/grant-table.c | 2 +-
  fs/pstore/platform.c  | 2 +-
  include/linux/timer.h | 4 ++--
  kernel/irq/spurious.c | 2 +-
  lib/random32.c| 2 +-
  net/atm/mpc.c | 2 +-
  net/decnet/dn_route.c | 2 +-
  net/ipv6/ip6_flowlabel.c  | 2 +-
  net/netrom/nr_loopback.c  | 2 +-
  security/keys/gc.c| 2 +-
  sound/oss/midibuf.c   | 2 +-
  sound/oss/soundcard.c | 2 +-
  sound/oss/sys_timer.c | 2 +-
  sound/oss/uart6850.c  | 2 +-
  47 files changed, 54 insertions(+), 54 deletions(-)

diff --git a/arch/arm/mach-ixp4xx/dsmg600-setup.c 
b/arch/arm/mach-ixp4xx/dsmg600-setup.c
index b3bd0e137f6d..b3689a141ec6 100644
--- a/arch/arm/mach-ixp4xx/dsmg600-setup.c
+++ b/arch/arm/mach-ixp4xx/dsmg600-setup.c
@@ -174,7 +174,7 @@ static int power_button_countdown;
  #define PBUTTON_HOLDDOWN_COUNT 4 /* 2 secs */
  
  static void dsmg600_power_handler(unsigned long data);

-static DEFINE_TIMER(dsmg600_power_timer, dsmg600_power_handler, 0, 0);
+static DEFINE_TIMER(dsmg600_power_timer, dsmg600_power_handler);
  
  static void dsmg600_power_handler(unsigned long data)

  {
diff --git a/arch/arm/mach-ixp4xx/nas100d-setup.c 
b/arch/arm/mach-ixp4xx/nas100d-setup.c
index 4e0f762bc651..562d05f9888e 100644
--- a/arch/arm/mach-ixp4xx/nas100d-setup.c
+++ b/arch/arm/mach-ixp4xx/nas100d-setup.c
@@ -197,7 +197,7 @@ static int power_button_countdown;
  #define PBUTTON_HOLDDOWN_COUNT 4 /* 2 secs */
  
  static void nas100d_power_handler(unsigned long data);

-static DEFINE_TIMER(nas100d_power_timer, nas100d_power_handler, 0, 0);
+static DEFINE_TIMER(nas100d_power_timer, nas100d_power_handler);
  
  static void nas100d_power_handler(unsigned long data)

  {
diff --git a/arch/m68k/amiga/amisound.c b/arch/m68k/amiga/amisound.c
index 90a60d758f8b..a23f48181fd6 100644
--- a/arch/m68k/amiga/amisound.c
+++ b/arch/m68k/amiga/amisound.c
@@ -66,7 +66,7 @@ void __init amiga_init_sound(void)
  }
  
  static void nosound( unsigned long ignored );

-static DEFINE_TIMER(sound_timer, nosound, 0, 0);
+static DEFINE_TIMER(sound_timer, nosound);
  
  void amiga_mksound( unsigned int hz, unsigned int ticks )

  {
diff --git a/arch/m68k/mac/macboing.c b/arch/m68k/mac/macboing.c
index ffaa1f6439ae..9a52aff183d0 100644
--- a/arch/m68k/mac/macboing.c
+++ b/arch/m68k/mac/macboing.c
@@ -56,7 +56,7 @@ static void ( *mac_special_bell )( unsigned int, unsigned 
int, unsigned int );
  /*
   * our timer to start/continue/stop the bell
   */
-static DEFINE_TIMER(mac_sound_timer, mac_nosound, 0, 0);
+static DEFINE_TIMER(mac_sound_timer, mac_nosound);
  
  /*

   * Sort of initialize 

Re: [PATCH 09/13] timer: Remove users of expire and data arguments to DEFINE_TIMER

2017-10-04 Thread Guenter Roeck

On 10/04/2017 04:27 PM, Kees Cook wrote:

The expire and data arguments of DEFINE_TIMER are only used in two places
and are ignored by the code (malta-display.c only uses mod_timer(),
never add_timer(), so the preset expires value is ignored). Set both
sets of arguments to zero.

Cc: Ralf Baechle 
Cc: Wim Van Sebroeck 
Cc: Guenter Roeck 
Cc: Geert Uytterhoeven 
Cc: linux-m...@linux-mips.org
Cc: linux-watch...@vger.kernel.org
Signed-off-by: Kees Cook 


For watchdog:

Acked-by: Guenter Roeck 


---
  arch/mips/mti-malta/malta-display.c | 6 +++---
  drivers/watchdog/alim7101_wdt.c | 4 ++--
  2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/mips/mti-malta/malta-display.c 
b/arch/mips/mti-malta/malta-display.c
index d4f807191ecd..ac813158b9b8 100644
--- a/arch/mips/mti-malta/malta-display.c
+++ b/arch/mips/mti-malta/malta-display.c
@@ -36,10 +36,10 @@ void mips_display_message(const char *str)
}
  }
  
-static void scroll_display_message(unsigned long data);

-static DEFINE_TIMER(mips_scroll_timer, scroll_display_message, HZ, 0);
+static void scroll_display_message(unsigned long unused);
+static DEFINE_TIMER(mips_scroll_timer, scroll_display_message, 0, 0);
  
-static void scroll_display_message(unsigned long data)

+static void scroll_display_message(unsigned long unused)
  {
mips_display_message(_string[display_count++]);
if (display_count == max_display_count)
diff --git a/drivers/watchdog/alim7101_wdt.c b/drivers/watchdog/alim7101_wdt.c
index 665e0e7dfe1e..3c1f6ac68ea9 100644
--- a/drivers/watchdog/alim7101_wdt.c
+++ b/drivers/watchdog/alim7101_wdt.c
@@ -71,7 +71,7 @@ MODULE_PARM_DESC(use_gpio,
"Use the gpio watchdog (required by old cobalt boards).");
  
  static void wdt_timer_ping(unsigned long);

-static DEFINE_TIMER(timer, wdt_timer_ping, 0, 1);
+static DEFINE_TIMER(timer, wdt_timer_ping, 0, 0);
  static unsigned long next_heartbeat;
  static unsigned long wdt_is_open;
  static char wdt_expect_close;
@@ -87,7 +87,7 @@ MODULE_PARM_DESC(nowayout,
   *Whack the dog
   */
  
-static void wdt_timer_ping(unsigned long data)

+static void wdt_timer_ping(unsigned long unused)
  {
/* If we got a heartbeat pulse within the WDT_US_INTERVAL
 * we agree to ping the WDT





[PATCH] powerpc: Drop lockdep_assert_cpus_held call from arch_update_cpu_topology

2017-10-04 Thread Thiago Jung Bauermann
It turns out that not all paths calling arch_update_cpu_topology hold
cpu_hotplug_lock, but that's ok because those paths aren't supposed to race
with any concurrent hotplug events.

Callers of arch_update_cpu_topology are expected to know what they are
doing when they call the function without holding the lock, so remove the
lockdep warning.

Here are two reported splats that turned out to be harmless (as far as
I know, at least):

[  255.654622] [ cut here ]
[  255.654658] WARNING: CPU: 1 PID: 14 at kernel/cpu.c:240 
lockdep_assert_cpus_held+0x54/0x80
[  255.654661] Modules linked in: bridge stp llc binfmt_misc kvm_hv kvm 
leds_powernv vmx_crypto powernv_rng rng_core led_class powernv_op_panel autofs4 
xfs btrfs lzo_compress raid456 async_raid6_recov async_memcpy async_pq 
async_xor async_tx xor raid6_pq libcrc32c multipath mlx4_en raid10 
crc32c_vpmsum lpfc be2net crc_t10dif crct10dif_generic crct10dif_common 
mlx4_core
[  255.654734] CPU: 1 PID: 14 Comm: cpuhp/1 Tainted: GW   4.13.0 #1
[  255.654737] task: c01ff25fa100 task.stack: c01ff2578000
[  255.654740] NIP: c00f8624 LR: c00f8618 CTR: c01763e0
[  255.654742] REGS: c01ff257b780 TRAP: 0700   Tainted: GW
(4.13.0)
[  255.654744] MSR: 90029033 
[  255.654764]   CR: 24200422  XER: 
[  255.654766] CFAR: c01783d0 SOFTE: 1
   GPR00: c00f8618 c01ff257ba00 c1042100 

   GPR04:  0001  

   GPR08: 001ffe49   
c07fee782f50
   GPR12:  cfd80500 c012c878 
c01ff6209180
   GPR16:   c0ef2728 

   GPR20:   0001 

   GPR24:  c1087d50 c0fe0e7d 
c01448c0
   GPR28: 001ffe49  0002 
c1088050
[  255.654842] NIP [c00f8624] lockdep_assert_cpus_held+0x54/0x80
[  255.654845] LR [c00f8618] lockdep_assert_cpus_held+0x48/0x80
[  255.654847] Call Trace:
[  255.654851] [c01ff257ba00] [c00f8618] 
lockdep_assert_cpus_held+0x48/0x80 (unreliable)
[  255.654858] [c01ff257ba20] [c007a848] 
arch_update_cpu_topology+0x18/0x30
[  255.654864] [c01ff257ba40] [c016bd0c] 
partition_sched_domains+0x8c/0x4e0
[  255.654870] [c01ff257baf0] [c020a914] 
cpuset_update_active_cpus+0x24/0x60
[  255.654876] [c01ff257bb10] [c01449bc] 
sched_cpu_deactivate+0xfc/0x1a0
[  255.654882] [c01ff257bc20] [c00f5a8c] 
cpuhp_invoke_callback+0x19c/0xe00
[  255.654888] [c01ff257bcb0] [c00f6868] 
cpuhp_down_callbacks+0x78/0xf0
[  255.654893] [c01ff257bd00] [c00f6c1c] 
cpuhp_thread_fun+0x1fc/0x210
[  255.654899] [c01ff257bd50] [c0132f4c] 
smpboot_thread_fn+0x2fc/0x370
[  255.654905] [c01ff257bdc0] [c012ca24] kthread+0x1b4/0x1c0
[  255.654911] [c01ff257be30] [c000bcec] 
ret_from_kernel_thread+0x5c/0x70
[  255.654914] Instruction dump:
[  255.654919] 4e800020 6000 6042 7c0802a6 3c62ffeb 3880 38639280 
f8010030
[  255.654934] 4807fd45 6000 2fa3 409e0020 <0fe0> e8010030 38210020 
7c0803a6
[  255.654950] ---[ end trace 06efa323f571f14b ]---
[  255.894356] [ cut here ]

and:

[2.862745] [ cut here ]
[2.862772] WARNING: CPU: 0 PID: 1 at kernel/cpu.c:240 
lockdep_assert_cpus_held+0x5c/0x70
[2.862784] Modules linked in:
[2.862819] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.13.0-00083-gb38923a 
#1
[2.862835] task: c03ff328 task.stack: c03ff6104000
[2.862847] NIP:  c010ac7c LR: c010ac70 CTR: 
[2.862862] REGS: c03ff61078d0 TRAP: 0700   Not tainted  
(4.13.0-00083-gb38923a)
[2.862874] MSR:  92029033   CR: 
2422  XER: 
[2.863046] CFAR: c0190bbc SOFTE: 1
   GPR00: c010ac70 c03ff6107b50 c1150500 

   GPR04: c0fdbe60  c123c5f8 

   GPR08:    

   GPR12: 2482 cfd4 c000dc08 

   GPR16:    

   GPR20:    

   GPR24:  c0eb392c  
c0f2bdb0
   GPR28:   c1199050 
c123c4f8
[2.863597] 

[PATCH 12/13] kthread: Convert callback to use from_timer()

2017-10-04 Thread Kees Cook
In preparation for unconditionally passing the struct timer_list pointer
to all timer callbacks, switch kthread to use from_timer() and pass the
timer pointer explicitly.

Cc: Andrew Morton 
Cc: Petr Mladek 
Cc: Tejun Heo 
Cc: Thomas Gleixner 
Cc: Oleg Nesterov 
Signed-off-by: Kees Cook 
---
 include/linux/kthread.h | 10 +-
 kernel/kthread.c| 10 --
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 0d622b350d3f..35cbe3b0ce5b 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -75,7 +75,7 @@ extern int tsk_fork_get_node(struct task_struct *tsk);
  */
 struct kthread_work;
 typedef void (*kthread_work_func_t)(struct kthread_work *work);
-void kthread_delayed_work_timer_fn(unsigned long __data);
+void kthread_delayed_work_timer_fn(struct timer_list *t);
 
 enum {
KTW_FREEZABLE   = 1 << 0,   /* freeze during suspend */
@@ -116,8 +116,8 @@ struct kthread_delayed_work {
 
 #define KTHREAD_DELAYED_WORK_INIT(dwork, fn) { \
.work = KTHREAD_WORK_INIT((dwork).work, (fn)),  \
-   .timer = __TIMER_INITIALIZER(kthread_delayed_work_timer_fn, \
-(unsigned long)&(dwork),   \
+   .timer = 
__TIMER_INITIALIZER((TIMER_FUNC_TYPE)kthread_delayed_work_timer_fn,\
+(TIMER_DATA_TYPE)&(dwork.timer),   \
 TIMER_IRQSAFE),\
}
 
@@ -164,8 +164,8 @@ extern void __kthread_init_worker(struct kthread_worker 
*worker,
do {\
kthread_init_work(&(dwork)->work, (fn));\
__setup_timer(&(dwork)->timer,  \
- kthread_delayed_work_timer_fn,\
- (unsigned long)(dwork),   \
+ (TIMER_FUNC_TYPE)kthread_delayed_work_timer_fn,\
+ (TIMER_DATA_TYPE)&(dwork)->timer, \
  TIMER_IRQSAFE);   \
} while (0)
 
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 1c19edf82427..ba3992c8c375 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -798,15 +798,14 @@ EXPORT_SYMBOL_GPL(kthread_queue_work);
 /**
  * kthread_delayed_work_timer_fn - callback that queues the associated kthread
  * delayed work when the timer expires.
- * @__data: pointer to the data associated with the timer
+ * @t: pointer to the expired timer
  *
  * The format of the function is defined by struct timer_list.
  * It should have been called from irqsafe timer with irq already off.
  */
-void kthread_delayed_work_timer_fn(unsigned long __data)
+void kthread_delayed_work_timer_fn(struct timer_list *t)
 {
-   struct kthread_delayed_work *dwork =
-   (struct kthread_delayed_work *)__data;
+   struct kthread_delayed_work *dwork = from_timer(dwork, t, timer);
struct kthread_work *work = >work;
struct kthread_worker *worker = work->worker;
 
@@ -837,8 +836,7 @@ void __kthread_queue_delayed_work(struct kthread_worker 
*worker,
struct timer_list *timer = >timer;
struct kthread_work *work = >work;
 
-   WARN_ON_ONCE(timer->function != kthread_delayed_work_timer_fn ||
-timer->data != (unsigned long)dwork);
+   WARN_ON_ONCE(timer->function != 
(TIMER_FUNC_TYPE)kthread_delayed_work_timer_fn);
 
/*
 * If @delay is 0, queue @dwork->work immediately.  This is for
-- 
2.7.4



[PATCH 13/13] workqueue: Convert callback to use from_timer()

2017-10-04 Thread Kees Cook
In preparation for unconditionally passing the struct timer_list pointer
to all timer callbacks, switch workqueue to use from_timer() and pass the
timer pointer explicitly.

Cc: Tejun Heo 
Cc: Lai Jiangshan 
Signed-off-by: Kees Cook 
---
 include/linux/workqueue.h | 15 ---
 kernel/workqueue.c|  7 +++
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index f4960260feaf..f3c47a05fd06 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -17,7 +17,7 @@ struct workqueue_struct;
 
 struct work_struct;
 typedef void (*work_func_t)(struct work_struct *work);
-void delayed_work_timer_fn(unsigned long __data);
+void delayed_work_timer_fn(struct timer_list *t);
 
 /*
  * The first word is the work queue pointer and the flags rolled into
@@ -175,8 +175,8 @@ struct execute_work {
 
 #define __DELAYED_WORK_INITIALIZER(n, f, tflags) { \
.work = __WORK_INITIALIZER((n).work, (f)),  \
-   .timer = __TIMER_INITIALIZER(delayed_work_timer_fn, \
-(unsigned long)&(n),   \
+   .timer = __TIMER_INITIALIZER((TIMER_FUNC_TYPE)delayed_work_timer_fn,\
+(TIMER_DATA_TYPE)&(n.timer),   \
 (tflags) | TIMER_IRQSAFE), \
}
 
@@ -241,8 +241,9 @@ static inline unsigned int work_static(struct work_struct 
*work) { return 0; }
 #define __INIT_DELAYED_WORK(_work, _func, _tflags) \
do {\
INIT_WORK(&(_work)->work, (_func)); \
-   __setup_timer(&(_work)->timer, delayed_work_timer_fn,   \
- (unsigned long)(_work),   \
+   __setup_timer(&(_work)->timer,  \
+ (TIMER_FUNC_TYPE)delayed_work_timer_fn,   \
+ (TIMER_DATA_TYPE)&(_work)->timer, \
  (_tflags) | TIMER_IRQSAFE);   \
} while (0)
 
@@ -250,8 +251,8 @@ static inline unsigned int work_static(struct work_struct 
*work) { return 0; }
do {\
INIT_WORK_ONSTACK(&(_work)->work, (_func)); \
__setup_timer_on_stack(&(_work)->timer, \
-  delayed_work_timer_fn,   \
-  (unsigned long)(_work),  \
+  (TIMER_FUNC_TYPE)delayed_work_timer_fn,\
+  (TIMER_DATA_TYPE)&(_work)->timer,\
   (_tflags) | TIMER_IRQSAFE);  \
} while (0)
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index a5361fc6215d..c77fdf6bf24f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1492,9 +1492,9 @@ bool queue_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL(queue_work_on);
 
-void delayed_work_timer_fn(unsigned long __data)
+void delayed_work_timer_fn(struct timer_list *t)
 {
-   struct delayed_work *dwork = (struct delayed_work *)__data;
+   struct delayed_work *dwork = from_timer(dwork, t, timer);
 
/* should have been called from irqsafe timer with irq already off */
__queue_work(dwork->cpu, dwork->wq, >work);
@@ -1508,8 +1508,7 @@ static void __queue_delayed_work(int cpu, struct 
workqueue_struct *wq,
struct work_struct *work = >work;
 
WARN_ON_ONCE(!wq);
-   WARN_ON_ONCE(timer->function != delayed_work_timer_fn ||
-timer->data != (unsigned long)dwork);
+   WARN_ON_ONCE(timer->function != (TIMER_FUNC_TYPE)delayed_work_timer_fn);
WARN_ON_ONCE(timer_pending(timer));
WARN_ON_ONCE(!list_empty(>entry));
 
-- 
2.7.4



[PATCH 11/13] timer: Remove expires argument from __TIMER_INITIALIZER()

2017-10-04 Thread Kees Cook
The expires field is normally initialized during the first mod_timer()
call. It was unused by all callers, so remove it from the macro.

Signed-off-by: Kees Cook 
---
 include/linux/kthread.h   | 2 +-
 include/linux/timer.h | 5 ++---
 include/linux/workqueue.h | 2 +-
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 82e197eeac91..0d622b350d3f 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -117,7 +117,7 @@ struct kthread_delayed_work {
 #define KTHREAD_DELAYED_WORK_INIT(dwork, fn) { \
.work = KTHREAD_WORK_INIT((dwork).work, (fn)),  \
.timer = __TIMER_INITIALIZER(kthread_delayed_work_timer_fn, \
-0, (unsigned long)&(dwork),\
+(unsigned long)&(dwork),   \
 TIMER_IRQSAFE),\
}
 
diff --git a/include/linux/timer.h b/include/linux/timer.h
index 91e5a2cc81b5..10685c33e679 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -63,10 +63,9 @@ struct timer_list {
 
 #define TIMER_TRACE_FLAGMASK   (TIMER_MIGRATING | TIMER_DEFERRABLE | 
TIMER_PINNED | TIMER_IRQSAFE)
 
-#define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \
+#define __TIMER_INITIALIZER(_function, _data, _flags) {\
.entry = { .next = TIMER_ENTRY_STATIC },\
.function = (_function),\
-   .expires = (_expires),  \
.data = (_data),\
.flags = (_flags),  \
__TIMER_LOCKDEP_MAP_INITIALIZER(\
@@ -75,7 +74,7 @@ struct timer_list {
 
 #define DEFINE_TIMER(_name, _function) \
struct timer_list _name =   \
-   __TIMER_INITIALIZER(_function, 0, 0, 0)
+   __TIMER_INITIALIZER(_function, 0, 0)
 
 void init_timer_key(struct timer_list *timer, unsigned int flags,
const char *name, struct lock_class_key *key);
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 1c49431f3121..f4960260feaf 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -176,7 +176,7 @@ struct execute_work {
 #define __DELAYED_WORK_INITIALIZER(n, f, tflags) { \
.work = __WORK_INITIALIZER((n).work, (f)),  \
.timer = __TIMER_INITIALIZER(delayed_work_timer_fn, \
-0, (unsigned long)&(n),\
+(unsigned long)&(n),   \
 (tflags) | TIMER_IRQSAFE), \
}
 
-- 
2.7.4



[PATCH 09/13] timer: Remove users of expire and data arguments to DEFINE_TIMER

2017-10-04 Thread Kees Cook
The expire and data arguments of DEFINE_TIMER are only used in two places
and are ignored by the code (malta-display.c only uses mod_timer(),
never add_timer(), so the preset expires value is ignored). Set both
sets of arguments to zero.

Cc: Ralf Baechle 
Cc: Wim Van Sebroeck 
Cc: Guenter Roeck 
Cc: Geert Uytterhoeven 
Cc: linux-m...@linux-mips.org
Cc: linux-watch...@vger.kernel.org
Signed-off-by: Kees Cook 
---
 arch/mips/mti-malta/malta-display.c | 6 +++---
 drivers/watchdog/alim7101_wdt.c | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/mips/mti-malta/malta-display.c 
b/arch/mips/mti-malta/malta-display.c
index d4f807191ecd..ac813158b9b8 100644
--- a/arch/mips/mti-malta/malta-display.c
+++ b/arch/mips/mti-malta/malta-display.c
@@ -36,10 +36,10 @@ void mips_display_message(const char *str)
}
 }
 
-static void scroll_display_message(unsigned long data);
-static DEFINE_TIMER(mips_scroll_timer, scroll_display_message, HZ, 0);
+static void scroll_display_message(unsigned long unused);
+static DEFINE_TIMER(mips_scroll_timer, scroll_display_message, 0, 0);
 
-static void scroll_display_message(unsigned long data)
+static void scroll_display_message(unsigned long unused)
 {
mips_display_message(_string[display_count++]);
if (display_count == max_display_count)
diff --git a/drivers/watchdog/alim7101_wdt.c b/drivers/watchdog/alim7101_wdt.c
index 665e0e7dfe1e..3c1f6ac68ea9 100644
--- a/drivers/watchdog/alim7101_wdt.c
+++ b/drivers/watchdog/alim7101_wdt.c
@@ -71,7 +71,7 @@ MODULE_PARM_DESC(use_gpio,
"Use the gpio watchdog (required by old cobalt boards).");
 
 static void wdt_timer_ping(unsigned long);
-static DEFINE_TIMER(timer, wdt_timer_ping, 0, 1);
+static DEFINE_TIMER(timer, wdt_timer_ping, 0, 0);
 static unsigned long next_heartbeat;
 static unsigned long wdt_is_open;
 static char wdt_expect_close;
@@ -87,7 +87,7 @@ MODULE_PARM_DESC(nowayout,
  * Whack the dog
  */
 
-static void wdt_timer_ping(unsigned long data)
+static void wdt_timer_ping(unsigned long unused)
 {
/* If we got a heartbeat pulse within the WDT_US_INTERVAL
 * we agree to ping the WDT
-- 
2.7.4



[PATCH 10/13] timer: Remove expires and data arguments from DEFINE_TIMER

2017-10-04 Thread Kees Cook
Drop the arguments from the macro and adjust all callers with the
following script:

  perl -pi -e 's/DEFINE_TIMER\((.*), 0, 0\);/DEFINE_TIMER($1);/g;' \
$(git grep DEFINE_TIMER | cut -d: -f1 | sort -u | grep -v timer.h)

Signed-off-by: Kees Cook 
Acked-by: Geert Uytterhoeven  # for m68k parts
---
 arch/arm/mach-ixp4xx/dsmg600-setup.c  | 2 +-
 arch/arm/mach-ixp4xx/nas100d-setup.c  | 2 +-
 arch/m68k/amiga/amisound.c| 2 +-
 arch/m68k/mac/macboing.c  | 2 +-
 arch/mips/mti-malta/malta-display.c   | 2 +-
 arch/parisc/kernel/pdc_cons.c | 2 +-
 arch/s390/mm/cmm.c| 2 +-
 drivers/atm/idt77105.c| 4 ++--
 drivers/atm/iphase.c  | 2 +-
 drivers/block/ataflop.c   | 8 
 drivers/char/dtlk.c   | 2 +-
 drivers/char/hangcheck-timer.c| 2 +-
 drivers/char/nwbutton.c   | 2 +-
 drivers/char/rtc.c| 2 +-
 drivers/input/touchscreen/s3c2410_ts.c| 2 +-
 drivers/net/cris/eth_v10.c| 6 +++---
 drivers/net/hamradio/yam.c| 2 +-
 drivers/net/wireless/atmel/at76c50x-usb.c | 2 +-
 drivers/staging/speakup/main.c| 2 +-
 drivers/staging/speakup/synth.c   | 2 +-
 drivers/tty/cyclades.c| 2 +-
 drivers/tty/isicom.c  | 2 +-
 drivers/tty/moxa.c| 2 +-
 drivers/tty/rocket.c  | 2 +-
 drivers/tty/vt/keyboard.c | 2 +-
 drivers/tty/vt/vt.c   | 2 +-
 drivers/watchdog/alim7101_wdt.c   | 2 +-
 drivers/watchdog/machzwd.c| 2 +-
 drivers/watchdog/mixcomwd.c   | 2 +-
 drivers/watchdog/sbc60xxwdt.c | 2 +-
 drivers/watchdog/sc520_wdt.c  | 2 +-
 drivers/watchdog/via_wdt.c| 2 +-
 drivers/watchdog/w83877f_wdt.c| 2 +-
 drivers/xen/grant-table.c | 2 +-
 fs/pstore/platform.c  | 2 +-
 include/linux/timer.h | 4 ++--
 kernel/irq/spurious.c | 2 +-
 lib/random32.c| 2 +-
 net/atm/mpc.c | 2 +-
 net/decnet/dn_route.c | 2 +-
 net/ipv6/ip6_flowlabel.c  | 2 +-
 net/netrom/nr_loopback.c  | 2 +-
 security/keys/gc.c| 2 +-
 sound/oss/midibuf.c   | 2 +-
 sound/oss/soundcard.c | 2 +-
 sound/oss/sys_timer.c | 2 +-
 sound/oss/uart6850.c  | 2 +-
 47 files changed, 54 insertions(+), 54 deletions(-)

diff --git a/arch/arm/mach-ixp4xx/dsmg600-setup.c 
b/arch/arm/mach-ixp4xx/dsmg600-setup.c
index b3bd0e137f6d..b3689a141ec6 100644
--- a/arch/arm/mach-ixp4xx/dsmg600-setup.c
+++ b/arch/arm/mach-ixp4xx/dsmg600-setup.c
@@ -174,7 +174,7 @@ static int power_button_countdown;
 #define PBUTTON_HOLDDOWN_COUNT 4 /* 2 secs */
 
 static void dsmg600_power_handler(unsigned long data);
-static DEFINE_TIMER(dsmg600_power_timer, dsmg600_power_handler, 0, 0);
+static DEFINE_TIMER(dsmg600_power_timer, dsmg600_power_handler);
 
 static void dsmg600_power_handler(unsigned long data)
 {
diff --git a/arch/arm/mach-ixp4xx/nas100d-setup.c 
b/arch/arm/mach-ixp4xx/nas100d-setup.c
index 4e0f762bc651..562d05f9888e 100644
--- a/arch/arm/mach-ixp4xx/nas100d-setup.c
+++ b/arch/arm/mach-ixp4xx/nas100d-setup.c
@@ -197,7 +197,7 @@ static int power_button_countdown;
 #define PBUTTON_HOLDDOWN_COUNT 4 /* 2 secs */
 
 static void nas100d_power_handler(unsigned long data);
-static DEFINE_TIMER(nas100d_power_timer, nas100d_power_handler, 0, 0);
+static DEFINE_TIMER(nas100d_power_timer, nas100d_power_handler);
 
 static void nas100d_power_handler(unsigned long data)
 {
diff --git a/arch/m68k/amiga/amisound.c b/arch/m68k/amiga/amisound.c
index 90a60d758f8b..a23f48181fd6 100644
--- a/arch/m68k/amiga/amisound.c
+++ b/arch/m68k/amiga/amisound.c
@@ -66,7 +66,7 @@ void __init amiga_init_sound(void)
 }
 
 static void nosound( unsigned long ignored );
-static DEFINE_TIMER(sound_timer, nosound, 0, 0);
+static DEFINE_TIMER(sound_timer, nosound);
 
 void amiga_mksound( unsigned int hz, unsigned int ticks )
 {
diff --git a/arch/m68k/mac/macboing.c b/arch/m68k/mac/macboing.c
index ffaa1f6439ae..9a52aff183d0 100644
--- a/arch/m68k/mac/macboing.c
+++ b/arch/m68k/mac/macboing.c
@@ -56,7 +56,7 @@ static void ( *mac_special_bell )( unsigned int, unsigned 
int, unsigned int );
 /*
  * our timer to start/continue/stop the bell
  */
-static DEFINE_TIMER(mac_sound_timer, mac_nosound, 0, 0);
+static DEFINE_TIMER(mac_sound_timer, mac_nosound);
 
 /*
  * Sort of initialize the sound chip (called from mac_mksound on the first
diff --git a/arch/mips/mti-malta/malta-display.c 
b/arch/mips/mti-malta/malta-display.c
index ac813158b9b8..063de44675ce 100644
--- 

[PATCH 07/13] timer: Remove last user of TIMER_INITIALIZER

2017-10-04 Thread Kees Cook
Drops the last user of TIMER_INITIALIZER and adapts timer.h to use the
internal version.

Cc: Arnd Bergmann 
Cc: Greg Kroah-Hartman 
Cc: Mark Gross 
Cc: Thomas Gleixner 
Signed-off-by: Kees Cook 
---
 drivers/char/tlclk.c  | 12 +---
 include/linux/timer.h |  2 +-
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/char/tlclk.c b/drivers/char/tlclk.c
index 6210bff46341..8eeb4190207d 100644
--- a/drivers/char/tlclk.c
+++ b/drivers/char/tlclk.c
@@ -184,9 +184,8 @@ static unsigned int telclk_interrupt;
 static int int_events; /* Event that generate a interrupt */
 static int got_event;  /* if events processing have been done */
 
-static void switchover_timeout(unsigned long data);
-static struct timer_list switchover_timer =
-   TIMER_INITIALIZER(switchover_timeout , 0, 0);
+static void switchover_timeout(struct timer_list *t);
+static struct timer_list switchover_timer;
 static unsigned long tlclk_timer_data;
 
 static struct tlclk_alarms *alarm_events;
@@ -805,7 +804,7 @@ static int __init tlclk_init(void)
goto out3;
}
 
-   init_timer(_timer);
+   timer_setup(_timer, switchover_timeout, 0);
 
ret = misc_register(_miscdev);
if (ret < 0) {
@@ -855,9 +854,9 @@ static void __exit tlclk_cleanup(void)
 
 }
 
-static void switchover_timeout(unsigned long data)
+static void switchover_timeout(struct timer_list *unused)
 {
-   unsigned long flags = *(unsigned long *) data;
+   unsigned long flags = tlclk_timer_data;
 
if ((flags & 1)) {
if ((inb(TLCLK_REG1) & 0x08) != (flags & 0x08))
@@ -922,7 +921,6 @@ static irqreturn_t tlclk_interrupt(int irq, void *dev_id)
/* TIMEOUT in ~10ms */
switchover_timer.expires = jiffies + msecs_to_jiffies(10);
tlclk_timer_data = inb(TLCLK_REG1);
-   switchover_timer.data = (unsigned long) _timer_data;
mod_timer(_timer, switchover_timer.expires);
} else {
got_event = 1;
diff --git a/include/linux/timer.h b/include/linux/timer.h
index 10cc45ca5803..4f7476e4a727 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -87,7 +87,7 @@ struct timer_list {
 
 #define DEFINE_TIMER(_name, _function, _expires, _data)\
struct timer_list _name =   \
-   TIMER_INITIALIZER(_function, _expires, _data)
+   __TIMER_INITIALIZER(_function, _expires, _data, 0)
 
 void init_timer_key(struct timer_list *timer, unsigned int flags,
const char *name, struct lock_class_key *key);
-- 
2.7.4



[PATCH 06/13] timer: Remove users of TIMER_DEFERRED_INITIALIZER

2017-10-04 Thread Kees Cook
This removes uses of TIMER_DEFERRED_INITIALIZER and chooses a location
to call timer_setup() from before add_timer() or mod_timer() is called.
Adjusts callbacks to use from_timer() as needed.

Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Tejun Heo 
Cc: Lai Jiangshan 
Cc: linux-s...@vger.kernel.org
Signed-off-by: Kees Cook 
---
 arch/s390/kernel/lgr.c  | 6 +++---
 arch/s390/kernel/topology.c | 6 +++---
 kernel/workqueue.c  | 8 +++-
 3 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/arch/s390/kernel/lgr.c b/arch/s390/kernel/lgr.c
index ae7dff110054..bf9622f0e6b1 100644
--- a/arch/s390/kernel/lgr.c
+++ b/arch/s390/kernel/lgr.c
@@ -153,14 +153,13 @@ static void lgr_timer_set(void);
 /*
  * LGR timer callback
  */
-static void lgr_timer_fn(unsigned long ignored)
+static void lgr_timer_fn(struct timer_list *unused)
 {
lgr_info_log();
lgr_timer_set();
 }
 
-static struct timer_list lgr_timer =
-   TIMER_DEFERRED_INITIALIZER(lgr_timer_fn, 0, 0);
+static struct timer_list lgr_timer;
 
 /*
  * Setup next LGR timer
@@ -181,6 +180,7 @@ static int __init lgr_init(void)
debug_register_view(lgr_dbf, _hex_ascii_view);
lgr_info_get(_info_last);
debug_event(lgr_dbf, 1, _info_last, sizeof(lgr_info_last));
+   timer_setup(_timer, lgr_timer_fn, TIMER_DEFERRABLE);
lgr_timer_set();
return 0;
 }
diff --git a/arch/s390/kernel/topology.c b/arch/s390/kernel/topology.c
index ed0bdd220e1a..d7ece9888c29 100644
--- a/arch/s390/kernel/topology.c
+++ b/arch/s390/kernel/topology.c
@@ -320,15 +320,14 @@ static void topology_flush_work(void)
flush_work(_work);
 }
 
-static void topology_timer_fn(unsigned long ignored)
+static void topology_timer_fn(struct timer_list *unused)
 {
if (ptf(PTF_CHECK))
topology_schedule_update();
set_topology_timer();
 }
 
-static struct timer_list topology_timer =
-   TIMER_DEFERRED_INITIALIZER(topology_timer_fn, 0, 0);
+static struct timer_list topology_timer;
 
 static atomic_t topology_poll = ATOMIC_INIT(0);
 
@@ -597,6 +596,7 @@ static struct ctl_table topology_dir_table[] = {
 
 static int __init topology_init(void)
 {
+   timer_setup(_timer, topology_timer_fn, TIMER_DEFERRABLE);
if (MACHINE_HAS_TOPOLOGY)
set_topology_timer();
else
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 64d0edf428f8..a5361fc6215d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5390,11 +5390,8 @@ static void workqueue_sysfs_unregister(struct 
workqueue_struct *wq)  { }
  */
 #ifdef CONFIG_WQ_WATCHDOG
 
-static void wq_watchdog_timer_fn(unsigned long data);
-
 static unsigned long wq_watchdog_thresh = 30;
-static struct timer_list wq_watchdog_timer =
-   TIMER_DEFERRED_INITIALIZER(wq_watchdog_timer_fn, 0, 0);
+static struct timer_list wq_watchdog_timer;
 
 static unsigned long wq_watchdog_touched = INITIAL_JIFFIES;
 static DEFINE_PER_CPU(unsigned long, wq_watchdog_touched_cpu) = 
INITIAL_JIFFIES;
@@ -5408,7 +5405,7 @@ static void wq_watchdog_reset_touched(void)
per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
 }
 
-static void wq_watchdog_timer_fn(unsigned long data)
+static void wq_watchdog_timer_fn(struct timer_list *unused)
 {
unsigned long thresh = READ_ONCE(wq_watchdog_thresh) * HZ;
bool lockup_detected = false;
@@ -5510,6 +5507,7 @@ module_param_cb(watchdog_thresh, _watchdog_thresh_ops, 
_watchdog_thresh,
 
 static void wq_watchdog_init(void)
 {
+   timer_setup(_watchdog_timer, wq_watchdog_timer_fn, TIMER_DEFERRABLE);
wq_watchdog_set_thresh(wq_watchdog_thresh);
 }
 
-- 
2.7.4



[PATCH 08/13] timer: Remove unused static initializer macros

2017-10-04 Thread Kees Cook
This removes the now unused TIMER_*INITIALIZER macros:

TIMER_INITIALIZER
TIMER_PINNED_INITIALIZER
TIMER_DEFERRED_INITIALIZER
TIMER_PINNED_DEFERRED_INITIALIZER

Signed-off-by: Kees Cook 
---
 include/linux/timer.h | 12 
 1 file changed, 12 deletions(-)

diff --git a/include/linux/timer.h b/include/linux/timer.h
index 4f7476e4a727..a33220311361 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -73,18 +73,6 @@ struct timer_list {
__FILE__ ":" __stringify(__LINE__)) \
}
 
-#define TIMER_INITIALIZER(_function, _expires, _data)  \
-   __TIMER_INITIALIZER((_function), (_expires), (_data), 0)
-
-#define TIMER_PINNED_INITIALIZER(_function, _expires, _data)   \
-   __TIMER_INITIALIZER((_function), (_expires), (_data), TIMER_PINNED)
-
-#define TIMER_DEFERRED_INITIALIZER(_function, _expires, _data) \
-   __TIMER_INITIALIZER((_function), (_expires), (_data), TIMER_DEFERRABLE)
-
-#define TIMER_PINNED_DEFERRED_INITIALIZER(_function, _expires, _data)  \
-   __TIMER_INITIALIZER((_function), (_expires), (_data), TIMER_DEFERRABLE 
| TIMER_PINNED)
-
 #define DEFINE_TIMER(_name, _function, _expires, _data)\
struct timer_list _name =   \
__TIMER_INITIALIZER(_function, _expires, _data, 0)
-- 
2.7.4



[PATCH 05/13] timer: Remove init_timer_deferrable() in favor of timer_setup()

2017-10-04 Thread Kees Cook
This refactors the only users of init_timer_deferrable() to use
the new timer_setup() and from_timer(). Removes definition of
init_timer_deferrable().

Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Cc: Sebastian Reichel 
Cc: Harish Patil 
Cc: Manish Chopra 
Cc: Kalle Valo 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: net...@vger.kernel.org
Cc: linux-wirel...@vger.kernel.org
Signed-off-by: Kees Cook 
---
 arch/powerpc/mm/numa.c   | 12 +--
 drivers/hsi/clients/ssi_protocol.c   | 32 
 drivers/net/ethernet/qlogic/qlge/qlge_main.c | 11 --
 drivers/net/vxlan.c  |  8 +++
 drivers/net/wireless/ath/ath6kl/recovery.c   |  9 
 include/linux/timer.h|  2 --
 6 files changed, 34 insertions(+), 40 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index b95c584ce19d..f9b6107d6854 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1453,7 +1453,7 @@ static void topology_schedule_update(void)
schedule_work(_work);
 }
 
-static void topology_timer_fn(unsigned long ignored)
+static void topology_timer_fn(struct timer_list *unused)
 {
if (prrn_enabled && cpumask_weight(_associativity_changes_mask))
topology_schedule_update();
@@ -1463,14 +1463,11 @@ static void topology_timer_fn(unsigned long ignored)
reset_topology_timer();
}
 }
-static struct timer_list topology_timer =
-   TIMER_INITIALIZER(topology_timer_fn, 0, 0);
+static struct timer_list topology_timer;
 
 static void reset_topology_timer(void)
 {
-   topology_timer.data = 0;
-   topology_timer.expires = jiffies + 60 * HZ;
-   mod_timer(_timer, topology_timer.expires);
+   mod_timer(_timer, jiffies + 60 * HZ);
 }
 
 #ifdef CONFIG_SMP
@@ -1530,7 +1527,8 @@ int start_topology_update(void)
prrn_enabled = 0;
vphn_enabled = 1;
setup_cpu_associativity_change_counters();
-   init_timer_deferrable(_timer);
+   timer_setup(_timer, topology_timer_fn,
+   TIMER_DEFERRABLE);
reset_topology_timer();
}
}
diff --git a/drivers/hsi/clients/ssi_protocol.c 
b/drivers/hsi/clients/ssi_protocol.c
index 93d28c0ec8bf..67af03d3aeb3 100644
--- a/drivers/hsi/clients/ssi_protocol.c
+++ b/drivers/hsi/clients/ssi_protocol.c
@@ -464,10 +464,10 @@ static void ssip_error(struct hsi_client *cl)
hsi_async_read(cl, msg);
 }
 
-static void ssip_keep_alive(unsigned long data)
+static void ssip_keep_alive(struct timer_list *t)
 {
-   struct hsi_client *cl = (struct hsi_client *)data;
-   struct ssi_protocol *ssi = hsi_client_drvdata(cl);
+   struct ssi_protocol *ssi = from_timer(ssi, t, keep_alive);
+   struct hsi_client *cl = ssi->cl;
 
dev_dbg(>device, "Keep alive kick in: m(%d) r(%d) s(%d)\n",
ssi->main_state, ssi->recv_state, ssi->send_state);
@@ -490,9 +490,19 @@ static void ssip_keep_alive(unsigned long data)
spin_unlock(>lock);
 }
 
-static void ssip_wd(unsigned long data)
+static void ssip_rx_wd(struct timer_list *t)
+{
+   struct ssi_protocol *ssi = from_timer(ssi, t, rx_wd);
+   struct hsi_client *cl = ssi->cl;
+
+   dev_err(>device, "Watchdog trigerred\n");
+   ssip_error(cl);
+}
+
+static void ssip_tx_wd(unsigned long data)
 {
-   struct hsi_client *cl = (struct hsi_client *)data;
+   struct ssi_protocol *ssi = from_timer(ssi, t, tx_wd);
+   struct hsi_client *cl = ssi->cl;
 
dev_err(>device, "Watchdog trigerred\n");
ssip_error(cl);
@@ -1084,15 +1094,9 @@ static int ssi_protocol_probe(struct device *dev)
}
 
spin_lock_init(>lock);
-   init_timer_deferrable(>rx_wd);
-   init_timer_deferrable(>tx_wd);
-   init_timer(>keep_alive);
-   ssi->rx_wd.data = (unsigned long)cl;
-   ssi->rx_wd.function = ssip_wd;
-   ssi->tx_wd.data = (unsigned long)cl;
-   ssi->tx_wd.function = ssip_wd;
-   ssi->keep_alive.data = (unsigned long)cl;
-   ssi->keep_alive.function = ssip_keep_alive;
+   timer_setup(>rx_wd, ssip_rx_wd, TIMER_DEFERRABLE);
+   timer_setup(>tx_wd, ssip_tx_wd, TIMER_DEFERRABLE);
+   timer_setup(>keep_alive, ssip_keep_alive, 0);
INIT_LIST_HEAD(>txqueue);
INIT_LIST_HEAD(>cmdqueue);
atomic_set(>tx_usecnt, 0);
diff --git a/drivers/net/ethernet/qlogic/qlge/qlge_main.c 
b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
index 9feec7009443..29fea74bff2e 100644
--- a/drivers/net/ethernet/qlogic/qlge/qlge_main.c
+++ b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
@@ -4725,9 +4725,9 @@ static const struct net_device_ops qlge_netdev_ops = 

[PATCH 00/13] timer: Start conversion to timer_setup()

2017-10-04 Thread Kees Cook
Hi,

This is the first of many timer infrastructure cleanups to simplify the
timer API[1]. All of these patches are expected to land via the timer
tree, so Acks (or corrections) appreciated.

These patches refactor various users of timer API that are NOT just using
init_timer() or setup_timer() (which is the vast majority of users,
and are being converted separately). These changes are focused on the
lesser-used init_timer_*(), TIMER_*INITIALIZER(), and DEFINE_TIMER()
methods of preparing a timer.

Thanks!

-Kees

[1] https://git.kernel.org/linus/686fef928bba6be13cabe639f154af7d72b63120



[PATCH 03/13] timer: Remove init_timer_on_stack() in favor of timer_setup_on_stack()

2017-10-04 Thread Kees Cook
Remove uses of init_timer_on_stack() with open-coded function and data
assignments that could be expressed using timer_setup_on_stack(). Several
were removed from the stack entirely since there was a one-to-one mapping
of parent structure to timer, those are switched to using timer_setup()
instead. All related callbacks were adjusted to use from_timer().

Cc: "Rafael J. Wysocki" 
Cc: Pavel Machek 
Cc: Len Brown 
Cc: Greg Kroah-Hartman 
Cc: Stefan Richter 
Cc: Sudip Mukherjee 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Julian Wiedmann 
Cc: Ursula Braun 
Cc: Michael Reed 
Cc: "James E.J. Bottomley" 
Cc: "Martin K. Petersen" 
Cc: Thomas Gleixner 
Cc: linux...@vger.kernel.org
Cc: linux1394-de...@lists.sourceforge.net
Cc: linux-s...@vger.kernel.org
Cc: linux-s...@vger.kernel.org
Signed-off-by: Kees Cook 
---
 drivers/base/power/main.c   |  8 +++-
 drivers/firewire/core-transaction.c | 10 +-
 drivers/parport/ieee1284.c  | 21 +++--
 drivers/s390/char/tape.h|  1 +
 drivers/s390/char/tape_std.c| 18 ++
 drivers/s390/net/lcs.c  | 16 ++--
 drivers/s390/net/lcs.h  |  1 +
 drivers/scsi/qla1280.c  | 14 +-
 drivers/scsi/qla1280.h  |  1 +
 include/linux/parport.h |  1 +
 include/linux/timer.h   |  2 --
 11 files changed, 36 insertions(+), 57 deletions(-)

diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index 770b1539a083..ae47b2ec84b4 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -478,9 +478,9 @@ struct dpm_watchdog {
  * There's not much we can do here to recover so panic() to
  * capture a crash-dump in pstore.
  */
-static void dpm_watchdog_handler(unsigned long data)
+static void dpm_watchdog_handler(struct timer_list *t)
 {
-   struct dpm_watchdog *wd = (void *)data;
+   struct dpm_watchdog *wd = from_timer(wd, t, timer);
 
dev_emerg(wd->dev, " DPM device timeout \n");
show_stack(wd->tsk, NULL);
@@ -500,11 +500,9 @@ static void dpm_watchdog_set(struct dpm_watchdog *wd, 
struct device *dev)
wd->dev = dev;
wd->tsk = current;
 
-   init_timer_on_stack(timer);
+   timer_setup_on_stack(timer, dpm_watchdog_handler, 0);
/* use same timeout value for both suspend and resume */
timer->expires = jiffies + HZ * CONFIG_DPM_WATCHDOG_TIMEOUT;
-   timer->function = dpm_watchdog_handler;
-   timer->data = (unsigned long)wd;
add_timer(timer);
 }
 
diff --git a/drivers/firewire/core-transaction.c 
b/drivers/firewire/core-transaction.c
index d6a09b9cd8cc..4372f9e4b0da 100644
--- a/drivers/firewire/core-transaction.c
+++ b/drivers/firewire/core-transaction.c
@@ -137,9 +137,9 @@ int fw_cancel_transaction(struct fw_card *card,
 }
 EXPORT_SYMBOL(fw_cancel_transaction);
 
-static void split_transaction_timeout_callback(unsigned long data)
+static void split_transaction_timeout_callback(struct timer_list *timer)
 {
-   struct fw_transaction *t = (struct fw_transaction *)data;
+   struct fw_transaction *t = from_timer(t, timer, split_timeout_timer);
struct fw_card *card = t->card;
unsigned long flags;
 
@@ -373,8 +373,8 @@ void fw_send_request(struct fw_card *card, struct 
fw_transaction *t, int tcode,
t->tlabel = tlabel;
t->card = card;
t->is_split_transaction = false;
-   setup_timer(>split_timeout_timer,
-   split_transaction_timeout_callback, (unsigned long)t);
+   timer_setup(>split_timeout_timer,
+   split_transaction_timeout_callback, 0);
t->callback = callback;
t->callback_data = callback_data;
 
@@ -423,7 +423,7 @@ int fw_run_transaction(struct fw_card *card, int tcode, int 
destination_id,
struct transaction_callback_data d;
struct fw_transaction t;
 
-   init_timer_on_stack(_timeout_timer);
+   timer_setup_on_stack(_timeout_timer, NULL, 0);
init_completion();
d.payload = payload;
fw_send_request(card, , tcode, destination_id, generation, speed,
diff --git a/drivers/parport/ieee1284.c b/drivers/parport/ieee1284.c
index 74cc6dd982d2..2d1a5c737c6e 100644
--- a/drivers/parport/ieee1284.c
+++ b/drivers/parport/ieee1284.c
@@ -44,10 +44,11 @@ static void parport_ieee1284_wakeup (struct parport *port)
up (>physport->ieee1284.irq);
 }
 
-static struct parport *port_from_cookie[PARPORT_MAX];
-static void timeout_waiting_on_port (unsigned long cookie)
+static void timeout_waiting_on_port (struct timer_list *t)
 {
-   

[PATCH 04/13] timer: Remove init_timer_pinned() in favor of timer_setup()

2017-10-04 Thread Kees Cook
This refactors the only users of init_timer_pinned() to use
the new timer_setup() and from_timer(). Drops the definition of
init_timer_pinned().

Cc: Chris Metcalf 
Cc: Thomas Gleixner 
Cc: net...@vger.kernel.org
Signed-off-by: Kees Cook 
---
 drivers/net/ethernet/tile/tilepro.c | 9 -
 include/linux/timer.h   | 2 --
 2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/tile/tilepro.c 
b/drivers/net/ethernet/tile/tilepro.c
index 49ccee4b9aec..56d06282fbde 100644
--- a/drivers/net/ethernet/tile/tilepro.c
+++ b/drivers/net/ethernet/tile/tilepro.c
@@ -608,9 +608,9 @@ static void tile_net_schedule_egress_timer(struct 
tile_net_cpu *info)
  * ISSUE: Maybe instead track number of expected completions, and free
  * only that many, resetting to zero if "pending" is ever false.
  */
-static void tile_net_handle_egress_timer(unsigned long arg)
+static void tile_net_handle_egress_timer(struct timer_list *t)
 {
-   struct tile_net_cpu *info = (struct tile_net_cpu *)arg;
+   struct tile_net_cpu *info = from_timer(info, t, egress_timer);
struct net_device *dev = info->napi.dev;
 
/* The timer is no longer scheduled. */
@@ -1004,9 +1004,8 @@ static void tile_net_register(void *dev_ptr)
BUG();
 
/* Initialize the egress timer. */
-   init_timer_pinned(>egress_timer);
-   info->egress_timer.data = (long)info;
-   info->egress_timer.function = tile_net_handle_egress_timer;
+   timer_setup(>egress_timer, tile_net_handle_egress_timer,
+   TIMER_PINNED);
 
u64_stats_init(>stats.syncp);
 
diff --git a/include/linux/timer.h b/include/linux/timer.h
index b10c4bdc6fbd..9da903562ed4 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -128,8 +128,6 @@ static inline void init_timer_on_stack_key(struct 
timer_list *timer,
 
 #define init_timer(timer)  \
__init_timer((timer), 0)
-#define init_timer_pinned(timer)   \
-   __init_timer((timer), TIMER_PINNED)
 #define init_timer_deferrable(timer)   \
__init_timer((timer), TIMER_DEFERRABLE)
 
-- 
2.7.4



[PATCH 01/13] timer: Convert schedule_timeout() to use from_timer()

2017-10-04 Thread Kees Cook
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new from_timer() helper and passing
the timer pointer explicitly. Since this special timer is on the stack, it
needs to have a wrapper structure to carry state once .data is eliminated.

Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: Stephen Boyd 
Signed-off-by: Kees Cook 
---
 include/linux/timer.h |  8 
 kernel/time/timer.c   | 26 +++---
 2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/include/linux/timer.h b/include/linux/timer.h
index 6383c528b148..5ef5c9e41a09 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -179,6 +179,14 @@ static inline void timer_setup(struct timer_list *timer,
  (TIMER_DATA_TYPE)timer, flags);
 }
 
+static inline void timer_setup_on_stack(struct timer_list *timer,
+  void (*callback)(struct timer_list *),
+  unsigned int flags)
+{
+   __setup_timer_on_stack(timer, (TIMER_FUNC_TYPE)callback,
+  (TIMER_DATA_TYPE)timer, flags);
+}
+
 #define from_timer(var, callback_timer, timer_fieldname) \
container_of(callback_timer, typeof(*var), timer_fieldname)
 
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index f2674a056c26..38613ced2324 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1668,9 +1668,20 @@ void run_local_timers(void)
raise_softirq(TIMER_SOFTIRQ);
 }
 
-static void process_timeout(unsigned long __data)
+/*
+ * Since schedule_timeout()'s timer is defined on the stack, it must store
+ * the target task on the stack as well.
+ */
+struct process_timer {
+   struct timer_list timer;
+   struct task_struct *task;
+};
+
+static void process_timeout(struct timer_list *t)
 {
-   wake_up_process((struct task_struct *)__data);
+   struct process_timer *timeout = from_timer(timeout, t, timer);
+
+   wake_up_process(timeout->task);
 }
 
 /**
@@ -1704,7 +1715,7 @@ static void process_timeout(unsigned long __data)
  */
 signed long __sched schedule_timeout(signed long timeout)
 {
-   struct timer_list timer;
+   struct process_timer timer;
unsigned long expire;
 
switch (timeout)
@@ -1738,13 +1749,14 @@ signed long __sched schedule_timeout(signed long 
timeout)
 
expire = timeout + jiffies;
 
-   setup_timer_on_stack(, process_timeout, (unsigned long)current);
-   __mod_timer(, expire, false);
+   timer.task = current;
+   timer_setup_on_stack(, process_timeout, 0);
+   __mod_timer(, expire, false);
schedule();
-   del_singleshot_timer_sync();
+   del_singleshot_timer_sync();
 
/* Remove the timer from the object tracker */
-   destroy_timer_on_stack();
+   destroy_timer_on_stack();
 
timeout = expire - jiffies;
 
-- 
2.7.4



[PATCH 02/13] timer: Remove init_timer_pinned_deferrable() in favor of timer_setup()

2017-10-04 Thread Kees Cook
This refactors the only user of init_timer_pinned_deferrable() to use the
new timer_setup() and from_timer(). Adds a pointer back to the policy,
and drops the definition of init_timer_pinned_deferrable().

Cc: "Rafael J. Wysocki" 
Cc: Viresh Kumar 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Thomas Gleixner 
Cc: linux...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Kees Cook 
---
 drivers/cpufreq/powernv-cpufreq.c | 13 +++--
 include/linux/timer.h |  2 --
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index 3ff5160451b4..b6d7c4c98d0a 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -90,6 +90,7 @@ struct global_pstate_info {
int last_gpstate_idx;
spinlock_t gpstate_lock;
struct timer_list timer;
+   struct cpufreq_policy *policy;
 };
 
 static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
@@ -625,10 +626,10 @@ static inline void  queue_gpstate_timer(struct 
global_pstate_info *gpstates)
  * according quadratic equation. Queues a new timer if it is still not equal
  * to local pstate
  */
-void gpstate_timer_handler(unsigned long data)
+void gpstate_timer_handler(struct timer_list *t)
 {
-   struct cpufreq_policy *policy = (struct cpufreq_policy *)data;
-   struct global_pstate_info *gpstates = policy->driver_data;
+   struct global_pstate_info *gpstates = from_timer(gpstates, t, timer);
+   struct cpufreq_policy *policy = gpstates->policy;
int gpstate_idx, lpstate_idx;
unsigned long val;
unsigned int time_diff = jiffies_to_msecs(jiffies)
@@ -800,9 +801,9 @@ static int powernv_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
policy->driver_data = gpstates;
 
/* initialize timer */
-   init_timer_pinned_deferrable(>timer);
-   gpstates->timer.data = (unsigned long)policy;
-   gpstates->timer.function = gpstate_timer_handler;
+   gpstates->policy = policy;
+   timer_setup(>timer, gpstate_timer_handler,
+   TIMER_PINNED | TIMER_DEFERRABLE);
gpstates->timer.expires = jiffies +
msecs_to_jiffies(GPSTATE_TIMER_INTERVAL);
spin_lock_init(>gpstate_lock);
diff --git a/include/linux/timer.h b/include/linux/timer.h
index 5ef5c9e41a09..d11e819a86e2 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -132,8 +132,6 @@ static inline void init_timer_on_stack_key(struct 
timer_list *timer,
__init_timer((timer), TIMER_PINNED)
 #define init_timer_deferrable(timer)   \
__init_timer((timer), TIMER_DEFERRABLE)
-#define init_timer_pinned_deferrable(timer)\
-   __init_timer((timer), TIMER_DEFERRABLE | TIMER_PINNED)
 #define init_timer_on_stack(timer) \
__init_timer_on_stack((timer), 0)
 
-- 
2.7.4



[PATCH tip/core/rcu 1/3] membarrier: Provide register expedited private command

2017-10-04 Thread Paul E. McKenney
From: Mathieu Desnoyers 

Provide a new command allowing processes to register their intent to use
the private expedited command.

This allows PowerPC to skip the full memory barrier in switch_mm(), and
only issue the barrier when scheduling into a task belonging to a
process that has registered to use expedited private.

Processes are now required to register before using
MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.

Changes since v1:
- Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in
  powerpc membarrier_arch_sched_in(), given that we want to specifically
  check the next thread state.
- Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig.
- Use task_thread_info() to pass thread_info from task to
  *_ti_thread_flag().

Changes since v2:
- Move membarrier_arch_sched_in() call to finish_task_switch().
- Check for NULL t->mm in membarrier_arch_fork().
- Use membarrier_sched_in() in generic code, which invokes the
  arch-specific membarrier_arch_sched_in(). This fixes allnoconfig
  build on PowerPC.
- Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing
  allnoconfig build on PowerPC.
- Build and runtime tested on PowerPC.

Changes since v3:
- Simply rely on copy_mm() to copy the membarrier_private_expedited mm
  field on fork.
- powerpc: test thread flag instead of reading
  membarrier_private_expedited in membarrier_arch_fork().
- powerpc: skip memory barrier in membarrier_arch_sched_in() if coming
  from kernel thread, since mmdrop() implies a full barrier.
- Set membarrier_private_expedited to 1 only after arch registration
  code, thus eliminating a race where concurrent commands could succeed
  when they should fail if issued concurrently with process
  registration.
- Use READ_ONCE() for membarrier_private_expedited field access in
  membarrier_private_expedited. Matches WRITE_ONCE() performed in
  process registration.

Changes since v4:
- Move powerpc hook from sched_in() to switch_mm(), based on feedback
  from Nicholas Piggin.

Signed-off-by: Mathieu Desnoyers 
CC: Peter Zijlstra 
CC: Paul E. McKenney 
CC: Boqun Feng 
CC: Andrew Hunter 
CC: Maged Michael 
CC: gro...@google.com
CC: Avi Kivity 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Dave Watson 
CC: Alan Stern 
CC: Will Deacon 
CC: Andy Lutomirski 
CC: Ingo Molnar 
CC: Alexander Viro 
CC: Nicholas Piggin 
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-a...@vger.kernel.org
Signed-off-by: Paul E. McKenney 
---
 MAINTAINERS|  2 ++
 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/include/asm/membarrier.h  | 43 +
 arch/powerpc/include/asm/thread_info.h |  3 ++
 arch/powerpc/kernel/Makefile   |  2 ++
 arch/powerpc/kernel/membarrier.c   | 45 ++
 arch/powerpc/mm/mmu_context.c  |  7 +
 fs/exec.c  |  1 +
 include/linux/mm_types.h   |  3 ++
 include/linux/sched/mm.h   | 50 ++
 include/uapi/linux/membarrier.h| 23 +++-
 init/Kconfig   |  3 ++
 kernel/fork.c  |  2 ++
 kernel/sched/core.c| 10 ---
 kernel/sched/membarrier.c  | 25 ++---
 15 files changed, 199 insertions(+), 21 deletions(-)
 create mode 100644 arch/powerpc/include/asm/membarrier.h
 create mode 100644 arch/powerpc/kernel/membarrier.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 65b0c88d5ee0..c5296d7f447b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8822,6 +8822,8 @@ L:linux-ker...@vger.kernel.org
 S: Supported
 F: kernel/sched/membarrier.c
 F: include/uapi/linux/membarrier.h
+F: arch/powerpc/kernel/membarrier.c
+F: arch/powerpc/include/asm/membarrier.h
 
 MEMORY MANAGEMENT
 L: linux...@kvack.org
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 809c468edab1..6f44c5f74f71 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -138,6 +138,7 @@ config PPC
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
+   select ARCH_HAS_MEMBARRIER_HOOKS
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
diff --git a/arch/powerpc/include/asm/membarrier.h 
b/arch/powerpc/include/asm/membarrier.h
new file mode 100644
index 

Possible LMB hot unplug bug in 4.13+ kernels

2017-10-04 Thread Daniel Henrique Barboza

Hi,

I've stumbled in a LMB hot unplug problem when running a guest with 
4.13+ kernel using QEMU 2.10. When trying to hot unplug a recently 
hotplugged LMB this is what I got, using an upstream kernel:


---
QEMU cmd line: sudo ./qemu-system-ppc64 -name migrate_qemu -boot 
strict=on -device nec-usb-xhci,id=usb,bus=pci.0,addr=0xf -device 
spapr-vscsi,id=scsi0,reg=0x2000 -smp 
32,maxcpus=32,sockets=32,cores=1,threads=1 --machine 
pseries,accel=kvm,kvm-type=HV,usb=off,dump-guest-core=off -m 
4G,slots=32,maxmem=32G -drive 
file=/home/danielhb/vm_imgs/f26.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none 
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x2,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 
-nographic



Last login: Wed Oct  4 12:28:25 on hvc0
[danielhb@localhost ~]$ grep Mem /proc/meminfo
MemTotal:    4161728 kB
MemFree: 3204352 kB
MemAvailable:    3558336 kB
[danielhb@localhost ~]$ QEMU 2.10.50 monitor - type 'help' for more 
information

(qemu)
(qemu) object_add memory-backend-ram,id=ram0,size=1G
(qemu) device_add pc-dimm,id=dimm0,memdev=ram0
(qemu)

[danielhb@localhost ~]$ grep Mem /proc/meminfo
MemTotal:    5210304 kB
MemFree: 4254656 kB
MemAvailable:    4598144 kB
[danielhb@localhost ~]$ (qemu)
(qemu) device_del dimm0
(qemu) [  136.058727] pseries-hotplug-mem: Memory indexed-count-remove 
failed, adding any removed LMBs


(qemu)
[danielhb@localhost ~]$ grep Mem /proc/meminfo
MemTotal:    5210304 kB
MemFree: 4253696 kB
MemAvailable:    4597184 kB
[danielhb@localhost ~]$

-

The LMBs are about to be unplugged, them something happens and they 
ended up being hotplugged back.


This isn't reproducible with 4.11 guests. I can reliably reproduce it in 
4.13+. Haven't tried 4.12.


Changing QEMU snapshots or even the hypervisor kernel/OS didn't affect 
the result.



In an attempt to better understand the issue I did the following changes 
in upstream kernel:


diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c

index 1d48ab424bd9..37550833cdb0 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -433,8 +433,10 @@ static bool lmb_is_removable(struct of_drconf_cell 
*lmb)

    unsigned long pfn, block_sz;
    u64 phys_addr;

-   if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
+   if (!(lmb->flags & DRCONF_MEM_ASSIGNED)) {
+   pr_err("lmb is not assigned \n");
    return false;
+   }

    block_sz = memory_block_size_bytes();
    scns_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
@@ -442,8 +444,10 @@ static bool lmb_is_removable(struct of_drconf_cell 
*lmb)


 #ifdef CONFIG_FA_DUMP
    /* Don't hot-remove memory that falls in fadump boot memory area */
-   if (is_fadump_boot_memory_area(phys_addr, block_sz))
+   if (is_fadump_boot_memory_area(phys_addr, block_sz)) {
+   pr_err("lmb belongs to fadump boot memory area\n");
    return false;
+   }
 #endif

    for (i = 0; i < scns_per_block; i++) {
@@ -454,7 +458,9 @@ static bool lmb_is_removable(struct of_drconf_cell *lmb)
    rc &= is_mem_section_removable(pfn, PAGES_PER_SECTION);
    phys_addr += MIN_MEMORY_BLOCK_SIZE;
    }
-
+   if (!rc) {
+   pr_err("is_mem_section_removable returned false \n");
+   }
    return rc ? true : false;
 }

@@ -465,12 +471,16 @@ static int dlpar_remove_lmb(struct of_drconf_cell 
*lmb)

    unsigned long block_sz;
    int nid, rc;

-   if (!lmb_is_removable(lmb))
+   if (!lmb_is_removable(lmb)) {
+   pr_err("dlpar_remove_lmb: lmb is not removable! \n");
    return -EINVAL;
+   }

    rc = dlpar_offline_lmb(lmb);
-   if (rc)
+   if (rc) {
+   pr_err("dlpar_remove_lmb: offline_lmb returned not zero 
\n");

    return rc;
+   }

    block_sz = pseries_memory_block_size();
    nid = memory_add_physaddr_to_nid(lmb->base_addr);

And this is the output:

-

[danielhb@localhost ~]$ QEMU 2.10.50 monitor - type 'help' for more 
information

(qemu)
(qemu) object_add memory-backend-ram,id=ram0,size=1G
(qemu) device_add pc-dimm,id=dimm0,memdev=ram0
(qemu)

[danielhb@localhost ~]$ grep Mem /proc/meminfo
MemTotal:    5210304 kB
MemFree: 4254656 kB
MemAvailable:    4598144 kB
[danielhb@localhost ~]$ (qemu)
(qemu) device_del dimm0
(qemu) [  136.058473] pseries-hotplug-mem: is_mem_section_removable 
returned false

[  136.058607] pseries-hotplug-mem: dlpar_remove_lmb: lmb is not removable!
[  136.058727] pseries-hotplug-mem: Memory indexed-count-remove failed, 
adding any removed LMBs


(qemu)

[danielhb@localhost ~]$ grep Mem /proc/meminfo
MemTotal:    5210304 kB
MemFree: 4253696 kB
MemAvailable:    4597184 kB
[danielhb@localhost ~]$

---

It appears that the hot 

Re: [bug report] out of bounds read parsing vmode commandline option

2017-10-04 Thread Geert Uytterhoeven
Hi Dan,

On Wed, Oct 4, 2017 at 2:50 PM, Dan Carpenter  wrote:
> This bug predates git but it looks like it might be simple to fix if the
> right person looked at the code.
>
> drivers/video/fbdev/controlfb.c:560 control_setup()
> error: buffer overflow 'control_mac_modes' 20 <= 21
>
> drivers/video/fbdev/controlfb.c
>549  static void __init control_setup(char *options)
>550  {
>551  char *this_opt;
>552
>553  if (!options || !*options)
>554  return;
>555
>556  while ((this_opt = strsep(, ",")) != NULL) {
>557  if (!strncmp(this_opt, "vmode:", 6)) {
>558  int vmode = simple_strtoul(this_opt+6, NULL, 
> 0);
> ^
> We get vmode from the command line.
>
>559  if (vmode > 0 && vmode <= VMODE_MAX &&
>   ^
> We check that it's <= 22.
>
>560  control_mac_modes[vmode - 1].m[1] >= 0)
> ^
> But the problem is that control_mac_modes[] only has 20 elements so the
> highest valid index is 19.  vmode - 1 can be up to 21.

Nice catch!

The bug was introduced in v2.4.5.6, when 2 new modes were added to
macmodes.h, but control_mac_modes[] wasn't updated:

https://kernel.opensuse.org/cgit/kernel/diff/include/video/macmodes.h?h=v2.5.2=29f279c764808560eaceb88fef36cbc35c529aad

A simple fix is to check against ARRAY_SIZE(control_mac_modes) instead.

A better fix is to add the missing entries to control_mac_modes[], cfr. the
(gmail-whitespace-damaged) patch below:

--- a/drivers/video/fbdev/controlfb.h
+++ b/drivers/video/fbdev/controlfb.h
@@ -141,5 +141,7 @@ static struct max_cmodes control_mac_modes[] = {
{{ 1, 2}},  /* 1152x870, 75Hz */
{{ 0, 1}},  /* 1280x960, 75Hz */
{{ 0, 1}},  /* 1280x1024, 75Hz */
+   {{ 1, 2}},  /* 1152x768, 60Hz */
+   {{ 0, 1}},  /* 1600x1024, 60Hz */
 };

(this array lists the maximum color mode (8, 16, or 32 bpp) for each
 video mode given RAM restrictions (2 or 4 MiB)).

The 1152x768 mode is probably OK. Given the 1600x1024 mode has a lower
dotclock (112 MHz) than the supported 1280x960 mode, it's probably OK, too.

platinum_reg_init[] and valkyrie_reg_init[] seem to be handled fine.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


[PATCH] mm: deferred_init_memmap improvements

2017-10-04 Thread Pavel Tatashin
deferred_init_memmap() is called when struct pages are initialized later
in boot by slave CPUs. This patch simplifies and optimizes this function,
and also fixes a couple issues (described below).

The main change is that now we are iterating through free memblock areas
instead of all configured memory. Thus, we do not have to check if the
struct page has already been initialized.

=
In deferred_init_memmap() where all deferred struct pages are initialized
we have a check like this:

if (page->flags) {
VM_BUG_ON(page_zone(page) != zone);
goto free_range;
}

This way we are checking if the current deferred page has already been
initialized. It works, because memory for struct pages has been zeroed, and
the only way flags are not zero if it went through __init_single_page()
before.  But, once we change the current behavior and won't zero the memory
in memblock allocator, we cannot trust anything inside "struct page"es
until they are initialized. This patch fixes this.

The deferred_init_memmap() is re-written to loop through only free memory
ranges provided by memblock.

Note, this first issue is relevant only when the following change is merged:

mm: stop zeroing memory during allocation in vmemmap

=

This patch fixes another existing issue on systems that have holes in
zones i.e CONFIG_HOLES_IN_ZONE is defined.

In for_each_mem_pfn_range() we have code like this:

if (!pfn_valid_within(pfn)
goto free_range;

Note: 'page' is not set to NULL and is not incremented but 'pfn' advances.
Thus means if deferred struct pages are enabled on systems with these kind
of holes, linux would get memory corruptions. I have fixed this issue by
defining a new macro that performs all the necessary operations when we
free the current set of pages.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 
---
 mm/page_alloc.c | 168 
 1 file changed, 85 insertions(+), 83 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c841af88836a..dcfd657cfd4e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1410,14 +1410,17 @@ void clear_zone_contiguous(struct zone *zone)
 }
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-static void __init deferred_free_range(struct page *page,
-   unsigned long pfn, int nr_pages)
+static void __init deferred_free_range(unsigned long pfn,
+  unsigned long nr_pages)
 {
-   int i;
+   struct page *page;
+   unsigned long i;
 
-   if (!page)
+   if (!nr_pages)
return;
 
+   page = pfn_to_page(pfn);
+
/* Free a large naturally-aligned chunk if possible */
if (nr_pages == pageblock_nr_pages &&
(pfn & (pageblock_nr_pages - 1)) == 0) {
@@ -1443,19 +1446,89 @@ static inline void __init 
pgdat_init_report_one_done(void)
complete(_init_all_done_comp);
 }
 
+/*
+ * Helper for deferred_init_range, free the given range, reset the counters, 
and
+ * return number of pages freed.
+ */
+static inline unsigned long __def_free(unsigned long *nr_free,
+  unsigned long *free_base_pfn,
+  struct page **page)
+{
+   unsigned long nr = *nr_free;
+
+   deferred_free_range(*free_base_pfn, nr);
+   *free_base_pfn = 0;
+   *nr_free = 0;
+   *page = NULL;
+
+   return nr;
+}
+
+static unsigned long deferred_init_range(int nid, int zid, unsigned long pfn,
+unsigned long end_pfn)
+{
+   struct mminit_pfnnid_cache nid_init_state = { };
+   unsigned long nr_pgmask = pageblock_nr_pages - 1;
+   unsigned long free_base_pfn = 0;
+   unsigned long nr_pages = 0;
+   unsigned long nr_free = 0;
+   struct page *page = NULL;
+
+   for (; pfn < end_pfn; pfn++) {
+   /*
+* First we check if pfn is valid on architectures where it is
+* possible to have holes within pageblock_nr_pages. On systems
+* where it is not possible, this function is optimized out.
+*
+* Then, we check if a current large page is valid by only
+* checking the validity of the head pfn.
+*
+* meminit_pfn_in_nid is checked on systems where pfns can
+* interleave within a node: a pfn is between start and end
+* of a node, but does not belong to this memory node.
+*
+* Finally, we minimize pfn page lookups and scheduler checks by
+* performing it only once every pageblock_nr_pages.
+*/
+   if (!pfn_valid_within(pfn)) {
+   nr_pages 

Re: [PATCH 0/2] powerpc/xive: fix CPU hot unplug

2017-10-04 Thread Cédric Le Goater
On 10/03/2017 08:58 AM, David Gibson wrote:
> On Tue, Oct 03, 2017 at 08:24:07AM +0200, Cédric Le Goater wrote:
>> On 10/03/2017 05:36 AM, David Gibson wrote:
>>> On Mon, Oct 02, 2017 at 06:27:20PM +0200, Cédric Le Goater wrote:
 On 09/23/2017 10:26 AM, Cédric Le Goater wrote:
> Hi,
>
> Here are a couple of small fixes to support CPU hot unplug. There are
> still some issues to be investigated as, in some occasions, after a
> couple of plug and unplug, the cpu which was removed receives a 'lost'
> interrupt. This showed to be the decrementer under QEMU.

 So this seems to be a QEMU issue only which can be solved by 
 removing the DEE bit from the LPCR on P9 processor when the CPU 
 is stopped in rtas. PECE3 bit on P8 processors. 

 I think these patches are valuable fixes for 4.14. The first 
 is trivial and the second touches the common xive part but it
 is only called on the pseries platform.  

 Could you please take a look ?
>>>
>>> Sorry, I think I've missed something here.
>>>
>>> Is there a qemu bug involved in this?  Has there been a patch sent
>>> that I didn't spot?
>>
>>
>> No, not yet, but I will today probably. something like below to stop
>> the decrementer when a CPU is stopped:
>>
>>  --- qemu.git.orig/hw/ppc/spapr_rtas.c
>>  +++ qemu.git/hw/ppc/spapr_rtas.c
>>  @@ -174,6 +174,15 @@ static void rtas_start_cpu(PowerPCCPU *c
>>   kvm_cpu_synchronize_state(cs);
>>   
>>   env->msr = (1ULL << MSR_SF) | (1ULL << MSR_ME);
>>  +
>>  +/* Enable DECR interrupt */
>>  +if (env->mmu_model == POWERPC_MMU_3_00) {
>>  +env->spr[SPR_LPCR] |= LPCR_DEE;
>>  +} else {
>>  +/* P7 and P8 both have same bit for DECR */
>>  +env->spr[SPR_LPCR] |= LPCR_P8_PECE3;
>>  +}
>>  +
>>   env->nip = start;
>>   env->gpr[3] = r3;
>>   cs->halted = 0;
>>  @@ -210,6 +219,13 @@ static void rtas_stop_self(PowerPCCPU *c
>>* no need to bother with specific bits, we just clear it.
>>*/
>>   env->msr = 0;
>>  +
>>  +if (env->mmu_model == POWERPC_MMU_3_00) {
>>  +env->spr[SPR_LPCR] &= ~LPCR_DEE;
>>  +} else {
>>  +/* P7 and P8 both have same bit for DECR */
>>  +env->spr[SPR_LPCR] &= ~LPCR_P8_PECE3;
>>  +}
>>   }
>>   
>>   static inline int sysparm_st(target_ulong addr, target_ulong len,
>>  
>> I haven't yet because I fail to understand why the decrementer is not 
>> interrupting the dying CPU under xics as it is the case under XIVE.
> 
> Oh.. ok.  This sounds very similar to the problem Nikunj hit under TCG
> with decrementer interrupts waking up a supposedly dead CPU.  He had a
> couple of proposed fixes, but we got bogged down trying to work out
> why  (with TCG at least) it only seemed to bite after a system_reset,
> and not on initial boot up.

yes. It would be nice to fix the reset under TCG though. May be this is
related. 

>> Also I am not sure this hack is of any use :
>>
>> /*
>>  * While stopping a CPU, the guest calls H_CPPR which
>>  * effectively disables interrupts on XICS level.
>>  * However decrementer interrupts in TCG can still
>>  * wake the CPU up so here we disable interrupts in MSR
>>  * as well.
>>  * As rtas_start_cpu() resets the whole MSR anyway, there is
>>  * no need to bother with specific bits, we just clear it.
>>  */
>> env->msr = 0;
> 
> Ok.. why do you think this isn't of use?  I'm pretty sure this is
> necessary for the TCG case, since MSR is checked in cpu_has_work(),
> which could otherwise wake up the "dead" cpu.

well, no, when the CPU is stopped with the 'stop-self' RTAS call, one of 
the CPU states is switched to 1 (cs->halted=1). In cpu_has_work(), this 
is a branch in which we don't check the MSR, only pending hardware 
interrupts are checked with their LPCR enablement bit.

So if the DECR timer fires after 'stop-self' is called (cs->halted=1) and 
before it is really stopped (cs->stop=1), the nearly-dead CPU will have 
some work to do and the guest will crash. This case happens very frequently 
when the P9 XIVE exploitation mode is activated but it does not without, 
when using the XICS mode. In XICS mode, the DECR is occasionally fired but 
after cs->stop=1, so no work is to be done.
 
The patch above fixes the problem but I don't understand why this works 
with XICS. My feeling is that there is a race somewhere and 

env->msr = 0;

is just a useless workaround, in this case at least. 

C.


> 
>> and the different CPU states are confusing. Nikunj already to a look
>> at this when trying to fix the TCG reboot. Anyway, the QEMU patch 
>> should (re)start a thread. This is not the place to discuss.
>>
>> Thanks,
>>
>> C.  
>>
>>
> 



[PATCH v2] kernel/module_64.c: Add REL24 relocation support of livepatch symbols

2017-10-04 Thread Kamalesh Babulal
With commit 425595a7fc20 ("livepatch: reuse module loader code to
write relocations") livepatch uses module loader to write relocations
of livepatch symbols, instead of managing them by arch-dependent
klp_write_module_reloc() function.

livepatch module managed relocation entries are written to sections
marked with SHF_RELA_LIVEPATCH flag and livepatch symbols within the
section are marked with SHN_LIVEPATCH symbol section index. When the
livepatching module is loaded, the livepatch symbols are resolved
before calling apply_relocate_add() to apply the relocations.

R_PPC64_REL24 relocation type resolves to a function address, those may
be local to the livepatch module or available in kernel/other modules.
For every such non-local function, apply_relocate_add() constructs a
stub (a.k.a trampoline) to branch to a function. Stub code is
responsible to save toc onto the stack, before calling the function via
the global entry point. A NOP instruction is expected after every non
local function branch, i.e. after the REL24 relocation. Which in-turn
is replaced by toc restore instruction by apply_relocate_add().

Functions those were local to livepatched function previously, may have
become global now or they might be out of range with current TOC base.
During module load, apply_relocate_add() fails for such global
functions, as it expect's a nop after a branch. Which does't exist for a
non-local function accessed via local entry point.

For example, consider the following livepatch relocations (the example
is from livepatch module generated by kpatch tool):

Relocation section '.klp.rela.vmlinux..text.meminfo_proc_show' at offset
0x84530 contains 44 entries:

Offset Info Type  Symbol's Value   Symbol's Name + Addend
......  R_PPC64_REL24 0x0 .klp.sym.vmlinux.si_swapinfo,0 + 0
......  R_PPC64_REL24 0x0 .klp.sym.vmlinux.total_swapcache_pages,0 + 0
......  R_PPC64_REL24 0x0 .klp.sym.vmlinux.show_val_kb,1 + 0
[...]

1. .klp.sym.vmlinux.si_swapinfo and .klp.sym.vmlinux.total_swapcache_pages
   are not available within the livepatch module TOC range.

2. .klp.sym.vmlinux.show_val_kb livepatch symbol was previously local
   but now global w.r.t module call fs/proc/meminfo.c::meminfo_proc_show()

While the livepatch module is loaded the livepatch symbols mentioned in
case 1 will fail with an error:
module_64: kpatch_meminfo: REL24 -1152921504751525976 out of range!

and livepatch symbols mentioned in case 2 with fail with an error:
module_64: kpatch_meminfo: Expect noop after relocate, got 3d22

Both the failures with REL24 livepatch symbols relocation, can be
resolved by constructing a new livepatch stub. The newly setup klp_stub
mimics the functionality of entry_64.S::livepatch_handler introduced by
commit 85baa095497f ("powerpc/livepatch: Add live patching support on
ppc64le").

Which introduces a "livepatch stack" growing upwards from the base of
the regular stack and is used to store/restore TOC/LR values, other than
the stub setup and branch. The additional instructions sequences to
handle klp_stub increases the stub size and current ppc64_stub_insn[]
is not sufficient to hold them. This patch also introduces new
ppc64le_klp_stub_entry[], along with the helpers to find/allocate
livepatch stub.

Signed-off-by: Kamalesh Babulal 
Cc: Balbir Singh 
Cc: Naveen N. Rao 
Cc: Josh Poimboeuf 
Cc: Jessica Yu 
Cc: Ananth N Mavinakayanahalli 
Cc: Aravinda Prasad 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: live-patch...@vger.kernel.org
---
This patch applies on top of livepatch_handler fix posted at
https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-September/163824.html

v2:
 - Changed klp_stub construction to re-use livepatch_handler and
   additional patch code required for klp_stub, instead of duplicating it.
 - Minor comments and commit body edits.

 arch/powerpc/include/asm/module.h  |   4 +
 arch/powerpc/kernel/module_64.c| 135 -
 arch/powerpc/kernel/trace/ftrace_64_mprofile.S |  31 ++
 3 files changed, 167 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/module.h 
b/arch/powerpc/include/asm/module.h
index 6c0132c7212f..de22c4c7aebc 100644
--- a/arch/powerpc/include/asm/module.h
+++ b/arch/powerpc/include/asm/module.h
@@ -44,6 +44,10 @@ struct mod_arch_specific {
unsigned long toc;
unsigned long tramp;
 #endif
+#ifdef CONFIG_LIVEPATCH
+   /* Count of kernel livepatch relocations */
+   unsigned long klp_relocs;
+#endif
 
 #else /* powerpc64 */
/* Indices of PLT sections within module. */
diff --git a/arch/powerpc/kernel/module_64.c b/arch/powerpc/kernel/module_64.c
index 0b0f89685b67..5be955e59162 100644
--- a/arch/powerpc/kernel/module_64.c
+++ b/arch/powerpc/kernel/module_64.c
@@ -140,6 +140,24 @@ static u32 

Re: [PATCH v9 08/12] mm: zero reserved and unavailable struct pages

2017-10-04 Thread Pasha Tatashin



On 10/04/2017 10:04 AM, Michal Hocko wrote:

On Wed 04-10-17 09:28:55, Pasha Tatashin wrote:



I am not really familiar with the trim_low_memory_range code path. I am
not even sure we have to care about it because nobody should be walking
pfns outside of any zone.


According to commit comments first 4K belongs to BIOS, so I think the memory
exists but BIOS may or may not report it to Linux. So, reserve it to make
sure we never touch it.


Yes and that memory should be outside of any zones, no?


I am not totally sure, I think some x86 expert could help us here. But, 
in either case this issue can be fixed separately from the rest of the 
series.





I am worried that this patch adds a code which
is not really used and it will just stay that way for ever because
nobody will dare to change it as it is too obscure and not explained
very well.


I could explain mine code better. Perhaps add more comments, and explain
when it can be removed?


More explanation would be definitely helpful


trim_low_memory_range is a good example of this. Why do we
even reserve this range from the memory block allocator? The memory
shouldn't be backed by any real memory and thus not in the allocator in
the first place, no?



Since it is not enforced in memblock that everything in reserved list must
be part of memory list, we can have it, and we need to make sure kernel does
not panic. Otherwise, it is very hard to detect such bugs.


So, should we report such a memblock reservation API (ab)use to the log?
Are you actually sure that trim_low_memory_range is doing a sane and
really needed thing? In other words do we have a zone which contains
this no-memory backed pfns?



And, this patch reports it already:

+   pr_info("Reserved but unavailable: %lld pages", pgcnt);

I could add a comment above this print call, explain that such memory is 
probably bogus and must be studied/fixed. Also, add that this code can 
be removed once memblock is changed to allow reserve only memory that is 
backed by physical memory i.e. in "memory" list.


Pasha


Re: [PATCH v9 08/12] mm: zero reserved and unavailable struct pages

2017-10-04 Thread Michal Hocko
On Wed 04-10-17 09:28:55, Pasha Tatashin wrote:
> 
> > I am not really familiar with the trim_low_memory_range code path. I am
> > not even sure we have to care about it because nobody should be walking
> > pfns outside of any zone.
> 
> According to commit comments first 4K belongs to BIOS, so I think the memory
> exists but BIOS may or may not report it to Linux. So, reserve it to make
> sure we never touch it.

Yes and that memory should be outside of any zones, no?

> > I am worried that this patch adds a code which
> > is not really used and it will just stay that way for ever because
> > nobody will dare to change it as it is too obscure and not explained
> > very well.
> 
> I could explain mine code better. Perhaps add more comments, and explain
> when it can be removed?

More explanation would be definitely helpful

> > trim_low_memory_range is a good example of this. Why do we
> > even reserve this range from the memory block allocator? The memory
> > shouldn't be backed by any real memory and thus not in the allocator in
> > the first place, no?
> > 
> 
> Since it is not enforced in memblock that everything in reserved list must
> be part of memory list, we can have it, and we need to make sure kernel does
> not panic. Otherwise, it is very hard to detect such bugs.

So, should we report such a memblock reservation API (ab)use to the log?
Are you actually sure that trim_low_memory_range is doing a sane and
really needed thing? In other words do we have a zone which contains
this no-memory backed pfns?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v9 08/12] mm: zero reserved and unavailable struct pages

2017-10-04 Thread Pasha Tatashin



I am not really familiar with the trim_low_memory_range code path. I am
not even sure we have to care about it because nobody should be walking
pfns outside of any zone.


According to commit comments first 4K belongs to BIOS, so I think the 
memory exists but BIOS may or may not report it to Linux. So, reserve it 
to make sure we never touch it.


 I am worried that this patch adds a code which

is not really used and it will just stay that way for ever because
nobody will dare to change it as it is too obscure and not explained
very well.


I could explain mine code better. Perhaps add more comments, and explain 
when it can be removed?


 trim_low_memory_range is a good example of this. Why do we

even reserve this range from the memory block allocator? The memory
shouldn't be backed by any real memory and thus not in the allocator in
the first place, no?



Since it is not enforced in memblock that everything in reserved list 
must be part of memory list, we can have it, and we need to make sure 
kernel does not panic. Otherwise, it is very hard to detect such bugs.


Re: [PATCH v9 08/12] mm: zero reserved and unavailable struct pages

2017-10-04 Thread Michal Hocko
On Wed 04-10-17 08:40:11, Pasha Tatashin wrote:
> > > > Could you be more specific where is such a memory reserved?
> > > > 
> > > 
> > > I know of one example: trim_low_memory_range() unconditionally reserves 
> > > from
> > > pfn 0, but e820__memblock_setup() might provide the exiting memory from 
> > > pfn
> > > 1 (i.e. KVM).
> > 
> > Then just initialize struct pages for that mapping rigth there where a
> > special API is used.
> > 
> > > But, there could be more based on this comment from linux/page-flags.h:
> > > 
> > >   19  * PG_reserved is set for special pages, which can never be swapped 
> > > out.
> > > Some
> > >   20  * of them might not even exist (eg empty_bad_page)...
> > 
> > I have no idea wht empty_bad_page is but a quick grep shows that this is
> > never used. I might be wrong here but if somebody is reserving a memory
> > in a special way then we should handle the initialization right there.
> > E.g. create an API for special memblock reservations.
> > 
> 
> Hi Michal,
> 
> The reservations happen before struct pages are allocated and mapped. So, it
> is not always possible to do it at call sites.

OK, I didn't realize that.
 
> Previously, I have solved this problem like this:
> 
> https://patchwork.kernel.org/patch/9886163
> 
> But, I was not too happy with that approach, so I replaced it with the
> current approach as it is more generic, and solves similar issues if they
> happen in other places. Also, the comment in page-flags got me scared that
> there are probably other places perhaps on other architectures that can have
> the similar issue.

I believe the comment is just stale. I have looked into empty_bad_page
and it is just a relict. I plan to post a patch soon.

> In addition, I did not like my solution, I was simply shrinking the low
> reservation from:
> [0 - reserve_low) to [min_pfn - reserve_low), but if min_pfn > reserve_low
> can we skip low reservation entirely? I was not sure.
> 
> The current approach notifies us if there are such pages, and we can
> fix/remove them in the future without crashing kernel in the meantime.

I am not really familiar with the trim_low_memory_range code path. I am
not even sure we have to care about it because nobody should be walking
pfns outside of any zone. I am worried that this patch adds a code which
is not really used and it will just stay that way for ever because
nobody will dare to change it as it is too obscure and not explained
very well. trim_low_memory_range is a good example of this. Why do we
even reserve this range from the memory block allocator? The memory
shouldn't be backed by any real memory and thus not in the allocator in
the first place, no?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v9 08/12] mm: zero reserved and unavailable struct pages

2017-10-04 Thread Pasha Tatashin

Could you be more specific where is such a memory reserved?



I know of one example: trim_low_memory_range() unconditionally reserves from
pfn 0, but e820__memblock_setup() might provide the exiting memory from pfn
1 (i.e. KVM).


Then just initialize struct pages for that mapping rigth there where a
special API is used.


But, there could be more based on this comment from linux/page-flags.h:

  19  * PG_reserved is set for special pages, which can never be swapped out.
Some
  20  * of them might not even exist (eg empty_bad_page)...


I have no idea wht empty_bad_page is but a quick grep shows that this is
never used. I might be wrong here but if somebody is reserving a memory
in a special way then we should handle the initialization right there.
E.g. create an API for special memblock reservations.



Hi Michal,

The reservations happen before struct pages are allocated and mapped. 
So, it is not always possible to do it at call sites.


Previously, I have solved this problem like this:

https://patchwork.kernel.org/patch/9886163

But, I was not too happy with that approach, so I replaced it with the 
current approach as it is more generic, and solves similar issues if 
they happen in other places. Also, the comment in page-flags got me 
scared that there are probably other places perhaps on other 
architectures that can have the similar issue.


In addition, I did not like my solution, I was simply shrinking the low 
reservation from:
[0 - reserve_low) to [min_pfn - reserve_low), but if min_pfn > 
reserve_low can we skip low reservation entirely? I was not sure.


The current approach notifies us if there are such pages, and we can 
fix/remove them in the future without crashing kernel in the meantime.


Pasha


Re: [PATCH v2 4/5] powerpc: pseries: only store the device node basename in full_name

2017-10-04 Thread Michael Ellerman
Rob Herring  writes:
> On Tue, Oct 3, 2017 at 4:26 AM, Michael Ellerman  wrote:
...
>>
>> The obvious fix is just to allocate it separately as before, eg ~=:
>
> Yes, I'll go back to doing 2 allocs like v1, but using kstrdup as was
> also pointed out.

Yeah that would be better still I guess.

cheers


Re: [PATCH v9 06/12] mm: zero struct pages during initialization

2017-10-04 Thread Pasha Tatashin

On 10/04/2017 04:45 AM, Michal Hocko wrote:

On Tue 03-10-17 11:22:35, Pasha Tatashin wrote:



On 10/03/2017 09:08 AM, Michal Hocko wrote:

On Wed 20-09-17 16:17:08, Pavel Tatashin wrote:

Add struct page zeroing as a part of initialization of other fields in
__init_single_page().

This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):

  BASEFIX
sparse_init 11.244671836s   0.007199623s
zone_sizes_init  4.879775891s   8.355182299s
--
Total   16.124447727s   8.362381922s


Hmm, this is confusing. This assumes that sparse_init doesn't zero pages
anymore, right? So these number depend on the last patch in the series?


Correct, without the last patch sparse_init time won't change.


THen this is just misleading.



OK, I will re-arrange patches the way you suggested earlier.

Pasha


Re: [PATCHv3 7/7] checkpatch: add pF/pf deprecation warning

2017-10-04 Thread Petr Mladek
On Sat 2017-09-30 11:53:19, Sergey Senozhatsky wrote:
> We deprecated '%pF/%pf' printk specifiers, since '%pS/%ps' is now smart
> enough to handle function pointer dereference on platforms where such
> dereference is required.
> 
> checkpatch warning example:
> 
> WARNING: Deprecated vsprintf pointer extension '%pF' - use %pS instead
> 
> Signed-off-by: Sergey Senozhatsky 
> Signed-off-by: Joe Perches 
> Cc: Andy Whitcroft 
> Tested-by: Helge Deller  # parisc64
> Tested-by: Santosh Sivaraj  # powerpc64
> Acked-by: Michael Ellerman  # powerpc64
> Tested-by: Tony Luck  # ia64

Reviewed-by: Petr Mladek 

Best Regards,
Petr


Re: [PATCHv3 6/7] symbol lookup: use new kernel and module dereference functions

2017-10-04 Thread Petr Mladek
On Sat 2017-09-30 11:53:18, Sergey Senozhatsky wrote:
> Call appropriate function descriptor dereference ARCH callbacks:
> - dereference_kernel_function_descriptor() if the pointer is a
>   kernel symbol;
> 
> - dereference_module_function_descriptor() if the pointer is a
>   module symbol.
> 
> This patch also removes dereference_function_descriptor() from
> '%pF/%pf' vsprintf handler, because it has the same behavior with
> '%pS/%ps' now.

The description is pretty criptic. It should explain why
the dereference was moved from vsprintf to the symbol lookup
and if it is safe.

Note that kallsyms_lookup() and module_address_lookup() is used
in many other situations.

Also I would not be afraid to repeat description of the big picture
from the 2nd patch.

> Signed-off-by: Sergey Senozhatsky 
> Tested-by: Helge Deller  # parisc64
> Tested-by: Santosh Sivaraj  # powerpc64
> Acked-by: Michael Ellerman  # powerpc64
> Tested-by: Tony Luck  # ia64
> ---
>  Documentation/printk-formats.txt | 20 ++--
>  kernel/kallsyms.c|  1 +
>  kernel/module.c  |  1 +
>  lib/vsprintf.c   |  5 +
>  4 files changed, 13 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/printk-formats.txt 
> b/Documentation/printk-formats.txt
> index 361789df51ec..3adbc4fdd482 100644
> --- a/Documentation/printk-formats.txt
> +++ b/Documentation/printk-formats.txt
> @@ -50,26 +50,28 @@ Symbols/Function Pointers
>  
>  ::
>  
> + %pS versatile_init+0x0/0x110
> + %ps versatile_init
>   %pF versatile_init+0x0/0x110
>   %pf versatile_init
> - %pS versatile_init+0x0/0x110
>   %pSRversatile_init+0x9/0x110
>   (with __builtin_extract_return_addr() translation)
> - %ps versatile_init
>   %pB prev_fn_of_versatile_init+0x88/0x88
>  
> -The ``F`` and ``f`` specifiers are for printing function pointers,
> -for example, f->func,  They have the same result as
> -``S`` and ``s`` specifiers. But they do an extra conversion on
> -ia64, ppc64 and parisc64 architectures where the function pointers
> -are actually function descriptors.
> -
>  The ``S`` and ``s`` specifiers can be used for printing symbols
>  from direct addresses, for example, __builtin_return_address(0),
>  (void *)regs->ip. They result in the symbol name with (``S``) or
>  without (``s``) offsets. If KALLSYMS are disabled then the symbol
>  address is printed instead.

This paragraph makes the feeling that ``S`` is still only for direct
adresses. We should update it as well.


> +Note, that the ``F`` and ``f`` specifiers are identical to ``S`` (``s``)
> +and thus deprecated. We have ``F`` and ``f`` because on ia64, ppc64 and
> +parisc64 function pointers are indirect and, in fact, are function
> +descriptors, which require additional dereferencing before we can lookup
> +the symbol. As of now, ``S`` and ``s`` perform dereferencing on those
> +platforms (when needed), so ``F`` and ``f`` exist for compatibility
> +reasons only.
> +
>  The ``B`` specifier results in the symbol name with offsets and should be
>  used when printing stack backtraces. The specifier takes into
>  consideration the effect of compiler optimisations which may occur
> @@ -77,8 +79,6 @@ when tail-call``s are used and marked with the noreturn GCC 
> attribute.
>  
>  Examples::
>  
> - printk("Going to call: %pF\n", gettimeofday);
> - printk("Going to call: %pF\n", p->func);
>   printk("%s: called from %pS\n", __func__, (void *)_RET_IP_);
>   printk("%s: called from %pS\n", __func__,
>   (void *)__builtin_return_address(0));

We should either replace %pF with %pS or remove all examples.
It is strange to keep only half of them.


> diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
> index 127e7cfafa55..e2fc09ea9509 100644
> --- a/kernel/kallsyms.c
> +++ b/kernel/kallsyms.c
> @@ -322,6 +322,7 @@ const char *kallsyms_lookup(unsigned long addr,
>   if (is_ksym_addr(addr)) {

is_ksym_addr() ignores the special .opd elf sections if
CONFIG_KALLSYMS_ALL is disabled. We should dereference before
this call.

>   unsigned long pos;
>  
> + addr = dereference_kernel_function_descriptor(addr);
>   pos = get_symbol_pos(addr, symbolsize, offset);

I still wonder if doing the dereference in the widely used kallsyms
might cause any regression.

One possible problem is that this function returns "offset".
One might expect that it is offset against "addr" but
it is not if the dereference happens here.

Also get_symbol_pos() is called in several other helpers
but the dereference is done only here. It would be
confusing if for example kallsyms_lookup_size_offset()
and kallsyms_lookup() give different result.

I would feel much more comfortable if we keep the derefenrece
only in vsprintf.


In each case, we 

Re: [PATCH v2] powerpc: Blacklist GCC 5.4 6.1 and 6.2

2017-10-04 Thread Michael Ellerman
Michal Suchánek  writes:

> On Mon, 13 Feb 2017 19:49:16 -0600
> Segher Boessenkool  wrote:
>
>> On Tue, Feb 14, 2017 at 11:25:43AM +1100, Cyril Bur wrote:
>> > > > At the time of writing GCC 5.4 is the most recent and is
>> > > > affected. GCC 6.3 contains the backported fix, has been tested
>> > > > and appears safe to use.  
>> > > 
>> > > 6.3 is (of course) the newer release; 5.4 is a maintenance
>> > > release of a compiler that is a year older.  
>> > 
>> > Yes. I think the point I was trying to make is that since they
>> > backported the fix to 5.x and 6.x then I expect that 5.5 will have
>> > the fix but since it doesn't exist yet, I can't be sure. I'll add
>> > something to that effect.  
>> 
>> The patch has been backported to the GCC 5 branch; it will be part of
>> any future GCC 5 release (5.5 and later, if any later will happen; 5.5
>> will).
>> 
>> Don't be so unsure about these things, we aren't *that*
>> incompetent ;-)
>
> Nonetheless the message is factually correct.
>
>> 
>> > > Please mention the GCC PR # somewhere in the code, too?  
>> > 
>> > Sure.  
>
> It seems this has never happened?

>From memory we discovered that there were patched versions of the
compilers in distros, which still reported those version numbers. ie.
the patch would break working compilers.

So Cyril was going to write a test that actually built some code and
checked the generated asm. But I suspect he never got around to it.

cheers


Re: [PATCH 2/2] powerpc/strict_rwx: fixup region splitting

2017-10-04 Thread Michael Ellerman
Balbir Singh  writes:

> We were aggressive with splitting regions and missed the case
> when _stext and __init_begin completely overlap addr and addr+mapping.
>
> This patch fixes that case and allows us to keep the largest possible
> mapping through the overlap. The patch also simplies the check
> into a function

Please do the fix in one patch, which we can backport if necessary, and
then move it into a function in a separate patch.

cheers


Re: [PATCHv3 4/7] powerpc64: Add .opd based function descriptor dereference

2017-10-04 Thread Michael Ellerman
Petr Mladek  writes:
> On Sat 2017-09-30 11:53:16, Sergey Senozhatsky wrote:
>> diff --git a/arch/powerpc/kernel/module_64.c 
>> b/arch/powerpc/kernel/module_64.c
>> index 0b0f89685b67..94caec045a90 100644
>> --- a/arch/powerpc/kernel/module_64.c
>> +++ b/arch/powerpc/kernel/module_64.c
>> @@ -712,6 +717,17 @@ int apply_relocate_add(Elf64_Shdr *sechdrs,
>>  return 0;
>>  }
>>  
>> +#ifdef PPC64_ELF_ABI_v1
>> +unsigned long dereference_module_function_descriptor(struct module *mod,
>> + unsigned long addr)
>> +{
>> +if (addr < mod->arch.start_opd || addr >= mod->arch.end_opd)
>> +return addr;
>> +
>> +return dereference_function_descriptor(addr);
>> +}
>> +#endif /* PPC64_ELF_ABI_v1 */
>
> I would personally move this up in the source file. It is related to
> the definition of func_desc() and other functions that are
> also PPC_ELF_ABI-specific.

Yeah that would be neater. There's already a PPC64_ELF_ABI_v2 block, you
could put this in the else case of that.

But we can do that later if you're not respinning otherwise.

cheers


Re: [PATCHv3 5/7] parisc64: Add .opd based function descriptor dereference

2017-10-04 Thread Petr Mladek
On Sat 2017-09-30 11:53:17, Sergey Senozhatsky wrote:
> We are moving towards separate kernel and module function descriptor
> dereference callbacks. This patch enables it for parisc64.
> 
> For pointers that belong to the kernel
> -  Added __start_opd and __end_opd pointers, to track the kernel
>.opd section address range;
> 
> -  Added dereference_kernel_function_descriptor(). Now we
>will dereference only function pointers that are within
>[__start_opd, __end_opd];
> 
> For pointers that belong to a module
> -  Added dereference_module_function_descriptor() to handle module
>function descriptor dereference. Now we will dereference only
>pointers that are within [module->opd.start, module->opd.end].

> diff --git a/arch/parisc/kernel/module.c b/arch/parisc/kernel/module.c
> index f1a76935a314..28f89b3dcc11 100644
> --- a/arch/parisc/kernel/module.c
> +++ b/arch/parisc/kernel/module.c
> @@ -954,3 +955,19 @@ void module_arch_cleanup(struct module *mod)
>  {
>   deregister_unwind_table(mod);
>  }
> +
> +#ifdef CONFIG_64BIT
> +unsigned long dereference_module_function_descriptor(struct module *mod,
> +  unsigned long addr)
> +{
> + unsigned long start_opd = (Elf64_Addr)mod->core_layout.base +
> +mod->arch.fdesc_offset;
> + unsigned long end_opd = start_opd +
> + mod->arch.fdesc_count * sizeof(Elf64_Fdesc);

I know that this is used in rather slow paths. But it still might
make sense to have these section borders pre-computed and
stored in struct mod_arch_specific. I mean to do similar
thing that we do on powerpc.

Well, we could do this in a followup patch if parisc people
wanted it.


> + if (addr < start_opd || addr >= end_opd)
> + return addr;
> +
> + return dereference_function_descriptor(addr);
> +}
> +#endif

Otherwise the patch looks fine to me.

Reviewed-by: Petr Mladek 

Best Regards,
Petr


[PATCH v2 2/2] powerpc/xive: clear XIVE internal structures when a CPU is removed

2017-10-04 Thread Cédric Le Goater
Commit eac1e731b59e ("powerpc/xive: guest exploitation of the XIVE
interrupt controller") introduced support for the XIVE exploitation
mode of the P9 interrupt controller on the pseries platform. 

At that time, support for CPU removal was not complete on PowerVM and
CPU hot unplug remained untested. It appears that some cleanups of the
XIVE internal structures are required before releasing the CPU,
without which the kernel crashes in a rtas call doing the CPU
isolation.

These changes fix the crash by deconfiguring the IPI interrupt source
and clearing the event queues of the CPU when it is removed.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 1c087ed7427f..b9440ac7a3c1 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1398,6 +1398,14 @@ void xive_teardown_cpu(void)
 
if (xive_ops->teardown_cpu)
xive_ops->teardown_cpu(cpu, xc);
+
+#ifdef CONFIG_SMP
+   /* Get rid of IPI */
+   xive_cleanup_cpu_ipi(cpu, xc);
+#endif
+
+   /* Disable and free the queues */
+   xive_cleanup_cpu_queues(cpu, xc);
 }
 
 void xive_kexec_teardown_cpu(int secondary)
-- 
2.13.5



[PATCH v2 1/2] powerpc/xive: fix IPI reset

2017-10-04 Thread Cédric Le Goater
When resetting an IPI, hw_ipi should also be set to zero.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/spapr.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/sysdev/xive/spapr.c b/arch/powerpc/sysdev/xive/spapr.c
index f24a70bc6855..d9c4c9366049 100644
--- a/arch/powerpc/sysdev/xive/spapr.c
+++ b/arch/powerpc/sysdev/xive/spapr.c
@@ -431,7 +431,11 @@ static int xive_spapr_get_ipi(unsigned int cpu, struct 
xive_cpu *xc)
 
 static void xive_spapr_put_ipi(unsigned int cpu, struct xive_cpu *xc)
 {
+   if (!xc->hw_ipi)
+   return;
+
xive_irq_bitmap_free(xc->hw_ipi);
+   xc->hw_ipi = 0;
 }
 #endif /* CONFIG_SMP */
 
-- 
2.13.5



[PATCH v2 0/2] powerpc/xive: fix CPU hot unplug

2017-10-04 Thread Cédric Le Goater
Hi,

Recent commit eac1e731b59e ("powerpc/xive: guest exploitation of the
XIVE interrupt controller") introduced support for the XIVE
exploitation mode of the P9 interrupt controller on the pseries
platform.

At the time the patchset was sent, support for CPU removal was not
complete on PowerVM and only CPU hotplug was tested. To also support
CPU hot unplug, some cleanups of the XIVE internal structures are
needed. These two patches include the missing bits without which the
kernel crashes in a rtas call when a CPU is released.

There are still some corner cases to address when stressing the lpar
with plug/unplug loops. These are not necessarily Linux issues. Under
QEMU, it showed to be the decrementer waking the dying CPU. Under
phyp, it is much more rare and investigation is under progress.

Tested under a phyp and a XIVE QEMU model for pseries.

These changes should only impact the pseries platform. 

Thanks,

C.

Changes since v2 :

 - clarified commit log. 

Cédric Le Goater (2):
  powerpc/xive: fix IPI reset
  powerpc/xive: fix cpu removal

 arch/powerpc/sysdev/xive/common.c | 8 
 arch/powerpc/sysdev/xive/spapr.c  | 4 
 2 files changed, 12 insertions(+)

-- 
2.13.5



Re: [PATCHv3 4/7] powerpc64: Add .opd based function descriptor dereference

2017-10-04 Thread Petr Mladek
On Sat 2017-09-30 11:53:16, Sergey Senozhatsky wrote:
> We are moving towards separate kernel and module function descriptor
> dereference callbacks. This patch enables it for powerpc64.
> 
> For pointers that belong to the kernel
> -  Added __start_opd and __end_opd pointers, to track the kernel
>.opd section address range;
> 
> -  Added dereference_kernel_function_descriptor(). Now we
>will dereference only function pointers that are within
>[__start_opd, __end_opd];
> 
> For pointers that belong to a module
> -  Added dereference_module_function_descriptor() to handle module
>function descriptor dereference. Now we will dereference only
>pointers that are within [module->opd.start, module->opd.end].
> 
> Signed-off-by: Sergey Senozhatsky 
> Tested-by: Helge Deller  # parisc64
> Tested-by: Santosh Sivaraj  # powerpc64
> Acked-by: Michael Ellerman  # powerpc64
> Tested-by: Tony Luck  # ia64
> ---
>  arch/powerpc/include/asm/module.h   |  3 +++
>  arch/powerpc/include/asm/sections.h | 11 +++
>  arch/powerpc/kernel/module_64.c | 16 
>  arch/powerpc/kernel/vmlinux.lds.S   |  2 ++
>  4 files changed, 32 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/module.h 
> b/arch/powerpc/include/asm/module.h
> index 6c0132c7212f..7e28442827f1 100644
> --- a/arch/powerpc/include/asm/module.h
> +++ b/arch/powerpc/include/asm/module.h
> @@ -45,6 +45,9 @@ struct mod_arch_specific {
>   unsigned long tramp;
>  #endif
>  
> + /* For module function descriptor dereference */
> + unsigned long start_opd;
> + unsigned long end_opd;
>  #else /* powerpc64 */
>   /* Indices of PLT sections within module. */
>   unsigned int core_plt_section;
> diff --git a/arch/powerpc/include/asm/sections.h 
> b/arch/powerpc/include/asm/sections.h
> index 67379b8945e8..6b4ee0d1645f 100644
> --- a/arch/powerpc/include/asm/sections.h
> +++ b/arch/powerpc/include/asm/sections.h
> @@ -75,6 +75,17 @@ static inline unsigned long 
> dereference_function_descriptor(unsigned long ptr)
>   ptr = (unsigned long)p;
>   return ptr;
>  }
> +
> +#undef dereference_kernel_function_descriptor
> +static inline unsigned long
> +dereference_kernel_function_descriptor(unsigned long addr)
> +{
> + if (addr < (unsigned long)__start_opd ||
> + addr >= (unsigned long)__end_opd)
> + return addr;
> +
> + return dereference_function_descriptor(addr);
> +}
>  #endif /* PPC64_ELF_ABI_v1 */
>  
>  #endif
> diff --git a/arch/powerpc/kernel/module_64.c b/arch/powerpc/kernel/module_64.c
> index 0b0f89685b67..94caec045a90 100644
> --- a/arch/powerpc/kernel/module_64.c
> +++ b/arch/powerpc/kernel/module_64.c
> @@ -344,6 +344,11 @@ int module_frob_arch_sections(Elf64_Ehdr *hdr,
>   else if (strcmp(secstrings+sechdrs[i].sh_name,"__versions")==0)
>   dedotify_versions((void *)hdr + sechdrs[i].sh_offset,
> sechdrs[i].sh_size);
> + else if (!strcmp(secstrings + sechdrs[i].sh_name, ".opd")) {
> + me->arch.start_opd = sechdrs[i].sh_addr;
> + me->arch.end_opd = sechdrs[i].sh_addr +
> +sechdrs[i].sh_size;
> + }
>  
>   /* We don't handle .init for the moment: rename to _init */
>   while ((p = strstr(secstrings + sechdrs[i].sh_name, ".init")))
> @@ -712,6 +717,17 @@ int apply_relocate_add(Elf64_Shdr *sechdrs,
>   return 0;
>  }
>  
> +#ifdef PPC64_ELF_ABI_v1
> +unsigned long dereference_module_function_descriptor(struct module *mod,
> +  unsigned long addr)
> +{
> + if (addr < mod->arch.start_opd || addr >= mod->arch.end_opd)
> + return addr;
> +
> + return dereference_function_descriptor(addr);
> +}
> +#endif /* PPC64_ELF_ABI_v1 */

I would personally move this up in the source file. It is related to
the definition of func_desc() and other functions that are
also PPC_ELF_ABI-specific.

Otherwise, it looks good to me.

Reviewed-by: Petr Mladek 

Best Regards,
Petr


Re: [PATCH 04/11] ia64: make dma_cache_sync a no-op

2017-10-04 Thread 'Christoph Hellwig'
On Wed, Oct 04, 2017 at 08:59:24AM +, David Laight wrote:
> Are you sure about this one?

Yes.  And if you are not you haven't read either of the cover letter,
the DMA-API documentation, or the reply from Thomas to your mail
yesterday.


Re: [PATCHv3 3/7] ia64: Add .opd based function descriptor dereference

2017-10-04 Thread Petr Mladek
On Sat 2017-09-30 11:53:15, Sergey Senozhatsky wrote:
> We are moving towards separate kernel and module function descriptor
> dereference callbacks. This patch enables it for IA64.
> 
> For pointers that belong to the kernel
> -  Added __start_opd and __end_opd pointers, to track the kernel
>.opd section address range;
> 
> -  Added dereference_kernel_function_descriptor(). Now we
>will dereference only function pointers that are within
>[__start_opd, __end_opd];
> 
> For pointers that belong to a module
> -  Added dereference_module_function_descriptor() to handle module
>function descriptor dereference. Now we will dereference only
>pointers that are within [module->opd.start, module->opd.end].
> 
> Signed-off-by: Sergey Senozhatsky 
> Tested-by: Helge Deller  # parisc64
> Tested-by: Santosh Sivaraj  # powerpc64
> Acked-by: Michael Ellerman  # powerpc64
> Tested-by: Tony Luck  # ia64

Looks good to me.

Reviewed-by: Petr Mladek 

Best Regards,
Petr


Re: [PATCH 1/2] powerpc/strict_rwx: quirks for STRICT_RWX patches

2017-10-04 Thread Michael Ellerman
Hi Balbir,

Mainly I think we need to add a check in mark_initmem_nx() too don't we?
Or should we put it somewhere common to both?

But seeing as I'm replying here are some more comments.

> Subject: [PATCH 1/2] powerpc/strict_rwx: quirks for STRICT_RWX patches

This misses three key things, 1) it's a bug fix 2) it's for Power9 DD1,
and 3) it only applies to radix.

So how about:

powerpc/64s: Fix crashes on Power9 DD1 with radix MMU and STRICT_RWX


Balbir Singh  writes:
> This patch disables STRICT_RWX for power9 DD1 machines
> where due to some limitations with the way we do tlb
> updates, we clear the TLB entry of the text that's doing
> the update to kernel text and that does lead to a
> crash.

This mostly describes the patch, but I would prefer more detail, eg:

  When using the radix MMU on Power9 DD1, to work around a hardware
  problem, radix__pte_update() is required to do a two stage update of
  the PTE. First we write a zero value into the PTE, then we flush the
  TLB, and then we write the new PTE value.
  
  In the normal case that works OK, but it does not work if we're
  updating the PTE that maps the code we're executing, because the
  mapping is removed by the TLB flush and we can no longer execute from
  it. Unfortunately the STRICT_RWX code needs to do exactly that.
  
  The exact symptoms when we hit this case vary, sometimes we print an
  oops and then get stuck after that, but I've also seen a machine just
  get stuck continually page faulting with no oops printed. The variance
  is presumably due to the exact layout of the text and the page size
  used for the mappings. In all cases we are unable to boot to a shell.
  
  There are possible solutions such as creating a second mapping of the
  TLB flush code, executing from that, and then jumping back to the
  original. However we don't want to add that level of complexity for a
  DD1 work around.
  
  So just detect that we're running on Power9 DD1 and refrain from
  changing the permissions, effectively disabling STRICT_RWX on Power9
  DD1.


> diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
> index 39c252b..c2a2b46 100644
> --- a/arch/powerpc/mm/pgtable-radix.c
> +++ b/arch/powerpc/mm/pgtable-radix.c
> @@ -169,6 +169,18 @@ void radix__mark_rodata_ro(void)
>  {
>   unsigned long start, end;
>  
> + /*
> +  * mark_rodata_ro() will mark itself as !writable at some point
   ^
   .
> +  * due to workaround in radix__pte_update(), we'll end up with
   ^ ^
   D the DD1
> +  * an invalid pte and the system will crash quite severly.

> +  * The alternatives are quite cumbersome and leaving out
> +  * the page containing the flush code is not very nice.

Just drop the last sentence, it doesn't have enough detail to be useful
and we chose not to implement those alternatives, so better not to
confuse the reader by talking about them.

> +  */
> + if (cpu_has_feature(CPU_FTR_POWER9_DD1)) {
> + pr_warn("Warning: Unable to mark rodata read only on P9 DD1\n");
> + return;
> + }

Don't we also need to do the check in radix__mark_initmem_nx() ?

cheers


Re: [PATCHv3 2/7] sections: split dereference_function_descriptor()

2017-10-04 Thread Petr Mladek
On Sat 2017-09-30 11:53:14, Sergey Senozhatsky wrote:
> There are two format specifiers to print out a pointer in symbolic
> format: '%pS/%ps' and '%pF/%pf'. On most architectures, the two
> mean exactly the same thing, but some architectures (ia64, ppc64,
> parisc64) use an indirect pointer for C function pointers, where
> the function pointer points to a function descriptor (which in
> turn contains the actual pointer to the code). The '%pF/%pf, when
> used appropriately, automatically does the appropriate function
> descriptor dereference on such architectures.
> 
> The "when used appropriately" part is tricky. Basically this is
> a subtle ABI detail, specific to some platforms, that made it to
> the API level and people can be unaware of it and miss the whole
> "we need to dereference the function" business out. [1] proves
> that point (note that it fixes only '%pF' and '%pS', there might
> be '%pf' and '%ps' cases as well).
> 
> It appears that we can handle everything within the affected
> arches and make '%pS/%ps' smart enough to retire '%pF/%pf'.
> Function descriptors live in .opd elf section and all affected
> arches (ia64, ppc64, parisc64) handle it properly for kernel
> and modules. So we, technically, can decide if the dereference
> is needed by simply looking at the pointer: if it belongs to
> .opd section then we need to dereference it.

Great catch. I really like this approach!

> diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h
> index e5da44eddd2f..387f22c41e0d 100644
> --- a/include/asm-generic/sections.h
> +++ b/include/asm-generic/sections.h
> @@ -29,6 +29,7 @@
>   *   __ctors_start, __ctors_end
>   *   __irqentry_text_start, __irqentry_text_end
>   *   __softirqentry_text_start, __softirqentry_text_end
> + *   __start_opd, __end_opd
>   */
>  extern char _text[], _stext[], _etext[];
>  extern char _data[], _sdata[], _edata[];
> @@ -47,12 +48,15 @@ extern char __softirqentry_text_start[], 
> __softirqentry_text_end[];
>  /* Start and end of .ctors section - used for constructor calls. */
>  extern char __ctors_start[], __ctors_end[];
>  
> +/* Start and end of .opd section - used for function descriptors. */
> +extern char __start_opd[], __end_opd[];
> +
>  extern __visible const void __nosave_begin, __nosave_end;
>  
> -/* function descriptor handling (if any).  Override
> - * in asm/sections.h */
> +/* Function descriptor handling (if any).  Override in asm/sections.h */
>  #ifndef dereference_function_descriptor
>  #define dereference_function_descriptor(p) (p)
> +#define dereference_kernel_function_descriptor(p) (p)
>  #endif
>  
>  /* random extra sections (if any).  Override
> diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
> index 4d0cb9bba93e..172904e9cded 100644
> --- a/include/linux/moduleloader.h
> +++ b/include/linux/moduleloader.h
> @@ -85,6 +85,10 @@ void module_arch_cleanup(struct module *mod);
>  /* Any cleanup before freeing mod->module_init */
>  void module_arch_freeing_init(struct module *mod);
>  
> +/* Dereference module function descriptor */
> +unsigned long dereference_module_function_descriptor(struct module *mod,
> +  unsigned long addr);
> +

The function is used when the module is already loaded. IMHO,
include/linux/module.h would be a better place.

One advantage would be that we could use the same trick
as in include/asm-generic/sections.h. I mean:

#define dereference_module_function_descriptor(mod, addr) (addr)

and redefine it in the three affected
arch//include/asm/module.h headers. Then it might be completely
optimized out on all architectures.

Anyway, we need to get ack from Jessica for this change.

Best Regards,
Petr


RE: [PATCH 04/11] ia64: make dma_cache_sync a no-op

2017-10-04 Thread David Laight
From: Christoph Hellwig
> Sent: 03 October 2017 11:43
>
> ia64 does not implement DMA_ATTR_NON_CONSISTENT allocations, so it doesn't
> make any sense to do any work in dma_cache_sync given that it must be a
> no-op when dma_alloc_attrs returns coherent memory.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/ia64/include/asm/dma-mapping.h | 5 -
>  1 file changed, 5 deletions(-)
> 
> diff --git a/arch/ia64/include/asm/dma-mapping.h 
> b/arch/ia64/include/asm/dma-mapping.h
> index 3ce5ab4339f3..99dfc1aa9d3c 100644
> --- a/arch/ia64/include/asm/dma-mapping.h
> +++ b/arch/ia64/include/asm/dma-mapping.h
> @@ -48,11 +48,6 @@ static inline void
>  dma_cache_sync (struct device *dev, void *vaddr, size_t size,
>   enum dma_data_direction dir)
>  {
> - /*
> -  * IA-64 is cache-coherent, so this is mostly a no-op.  However, we do 
> need to
> -  * ensure that dma_cache_sync() enforces order, hence the mb().
> -  */
> - mb();
>  }

Are you sure about this one?
It looks as though you are doing a mechanical change for all architectures.
Some of them are probably stranger than you realise.

Even with cache coherent memory any cpu 'store/write buffer' may not
be snooped by dma reads.

Something needs to flush the store buffer between the last cpu write
to the dma buffer and the write (probably a device register) that
tells the device it can read the memory.

My guess from the comment is that dma_cache_synch() is expected to
include that barrier - and it might not be anywhere else.

David



Re: [PATCH v9 08/12] mm: zero reserved and unavailable struct pages

2017-10-04 Thread Michal Hocko
On Tue 03-10-17 11:29:16, Pasha Tatashin wrote:
> On 10/03/2017 09:18 AM, Michal Hocko wrote:
> > On Wed 20-09-17 16:17:10, Pavel Tatashin wrote:
> > > Some memory is reserved but unavailable: not present in memblock.memory
> > > (because not backed by physical pages), but present in memblock.reserved.
> > > Such memory has backing struct pages, but they are not initialized by 
> > > going
> > > through __init_single_page().
> > 
> > Could you be more specific where is such a memory reserved?
> > 
> 
> I know of one example: trim_low_memory_range() unconditionally reserves from
> pfn 0, but e820__memblock_setup() might provide the exiting memory from pfn
> 1 (i.e. KVM).

Then just initialize struct pages for that mapping rigth there where a
special API is used.

> But, there could be more based on this comment from linux/page-flags.h:
> 
>  19  * PG_reserved is set for special pages, which can never be swapped out.
> Some
>  20  * of them might not even exist (eg empty_bad_page)...

I have no idea wht empty_bad_page is but a quick grep shows that this is
never used. I might be wrong here but if somebody is reserving a memory
in a special way then we should handle the initialization right there.
E.g. create an API for special memblock reservations.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v9 03/12] mm: deferred_init_memmap improvements

2017-10-04 Thread Michal Hocko
On Tue 03-10-17 12:01:08, Pasha Tatashin wrote:
> Hi Michal,
> 
> Are you OK, if I replace DEFERRED_FREE() macro with a function like this:
> 
> /*
>  * Helper for deferred_init_range, free the given range, and reset the
>  * counters
>  */
> static inline unsigned long __def_free(unsigned long *nr_free,
>unsigned long *free_base_pfn,
>struct page **page)
> {
> unsigned long nr = *nr_free;
> 
> deferred_free_range(*free_base_pfn, nr);
> *free_base_pfn = 0;
> *nr_free = 0;
> *page = NULL;
> 
> return nr;
> }
> 
> Since it is inline, and we operate with non-volatile counters, compiler will
> be smart enough to remove all the unnecessary de-references. As a plus, we
> won't be adding any new branches, and the code is still going to stay
> compact.

OK. It is a bit clunky but we are holding too much state there. I
haven't checked whether that can be simplified but this can be always
done later.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v9 06/12] mm: zero struct pages during initialization

2017-10-04 Thread Michal Hocko
On Tue 03-10-17 11:22:35, Pasha Tatashin wrote:
> 
> 
> On 10/03/2017 09:08 AM, Michal Hocko wrote:
> > On Wed 20-09-17 16:17:08, Pavel Tatashin wrote:
> > > Add struct page zeroing as a part of initialization of other fields in
> > > __init_single_page().
> > > 
> > > This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
> > > v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):
> > > 
> > >  BASEFIX
> > > sparse_init 11.244671836s   0.007199623s
> > > zone_sizes_init  4.879775891s   8.355182299s
> > >--
> > > Total   16.124447727s   8.362381922s
> > 
> > Hmm, this is confusing. This assumes that sparse_init doesn't zero pages
> > anymore, right? So these number depend on the last patch in the series?
> 
> Correct, without the last patch sparse_init time won't change.

THen this is just misleading.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v9 12/12] mm: stop zeroing memory during allocation in vmemmap

2017-10-04 Thread Michal Hocko
On Tue 03-10-17 16:26:51, Pasha Tatashin wrote:
> Hi Michal,
> 
> I decided not to merge these two patches, because in addition to sparc
> optimization move, we have this dependancies:

optimizations can and should go on top of the core patch.

> mm: zero reserved and unavailable struct pages
> 
> must be before
> 
> mm: stop zeroing memory during allocation in vmemmap.
> 
> Otherwise, we can end-up with struct pages that are not zeroed properly.

Right and you can deal with it easily. Just introduce the
mm_zero_struct_page earlier along with its user in "stop zeroing ..."

I think that moving the zeroying in one go is more reasonable than
adding it to __init_single_page with misleading numbers and later
dropping the zeroying from the memmap path.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 07/11] powerpc: make dma_cache_sync a no-op

2017-10-04 Thread Christophe LEROY



Le 03/10/2017 à 13:43, Christoph Hellwig a écrit :

On Tue, Oct 03, 2017 at 01:24:57PM +0200, Christophe LEROY wrote:

powerpc does not implement DMA_ATTR_NON_CONSISTENT allocations, so it
doesn't make any sense to do any work in dma_cache_sync given that it
must be a no-op when dma_alloc_attrs returns coherent memory.

What about arch/powerpc/mm/dma-noncoherent.c ?

Powerpc 8xx doesn't have coherent memory.


It doesn't implement the DMA_ATTR_NON_CONSISTENT interface either,
so if it really doesn't have a way to provide dma coherent allocation
(although the code in __dma_alloc_coherent suggests it does provide
dma coherent allocations) I have no idea how it could ever have
worked.



Yes indeed it provides coherent memory by allocation non cached memory.

And drivers aiming at using non coherent memory do it by using kmalloc() 
with GFP_DMA then dma_map_single().
Then they use dma_sync_single_for_xxx(), which calls __dma_sync() on the 
8xx and is a nop on other powerpcs.


Christophe


Re: [PATCHv3 1/7] switch dereference_function_descriptor() to `unsigned long'

2017-10-04 Thread Petr Mladek
On Sat 2017-09-30 11:53:13, Sergey Senozhatsky wrote:
> Convert dereference_function_descriptor() to accept and return
> `unsigned long'. There will be two new ARCH function for kernel
> and module function pointer dereference, which will work with
> `unsigned long', so the patch unifies interfaces.
> 
> Besides, dereference_function_descriptor() mostly work with
> `unsigned long':
> 
> Convert dereference_function_descriptor() users tree-wide.

I am not sure if this is a real win. If I count correctly,
the net result is 6 additional casts in this patch. Many
casts are still needed also in the other patches.


> diff --git a/arch/ia64/include/asm/sections.h 
> b/arch/ia64/include/asm/sections.h
> index 2ab2003698ef..de6bfa1ef8fb 100644
> --- a/arch/ia64/include/asm/sections.h
> +++ b/arch/ia64/include/asm/sections.h
> @@ -27,13 +27,13 @@ extern char __start_unwind[], __end_unwind[];
>  extern char __start_ivt_text[], __end_ivt_text[];
>  
>  #undef dereference_function_descriptor
> -static inline void *dereference_function_descriptor(void *ptr)
> +static inline unsigned long dereference_function_descriptor(unsigned long 
> ptr)

I would also expect that a function called "dereference*" would work with
a pointer. The parameter is called ptr but it is unsigned long.

>  {
> - struct fdesc *desc = ptr;
> + struct fdesc *desc = (struct fdesc *)ptr;
>   void *p;
>  
>   if (!probe_kernel_address(>ip, p))
> - ptr = p;
> + ptr = (unsigned long)p;
>   return ptr;
>  }
>  
> diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
> index 1ca9a2b4239f..06e1b79e2946 100644
> --- a/arch/parisc/mm/init.c
> +++ b/arch/parisc/mm/init.c
> @@ -389,10 +389,10 @@ static void __init setup_bootmem(void)
>  static int __init parisc_text_address(unsigned long vaddr)
>  {
>   static unsigned long head_ptr __initdata;
> + unsigned long addr = (unsigned long)_kernel_start;
>  
>   if (!head_ptr)
> - head_ptr = PAGE_MASK & (unsigned long)
> - dereference_function_descriptor(_kernel_start);
> + head_ptr = PAGE_MASK & dereference_function_descriptor(addr);

IMHO, this is harder to read than the original. You need to
search the definition of "addr" and check its manipulation
to understand the meaning.

>  
>   return core_kernel_text(vaddr) || vaddr == head_ptr;
>  }


To make it clear. All these comments are not a big deal and I do
not want to invalidate all the acked-by and tested-by just because
of them.

But please, consider removing this change if we need to do
another iteration of this patchset. IMHO, there is no real win
and it would just pollute the git history.

Best Regards,
Petr


Re: [PATCH] kernel/module_64.c: Add REL24 relocation support of livepatch symbols

2017-10-04 Thread Kamalesh Babulal
On Tuesday 03 October 2017 03:00 PM, Naveen N . Rao wrote:

Hi Naveen,

[snip]
> 
> A few small nits focusing on just the trampoline...
> 
>> diff --git a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S 
>> b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
>> index c98e90b..708a96d 100644
>> --- a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
>> +++ b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
>> @@ -249,6 +249,83 @@ livepatch_handler:
>>
>>  /* Return to original caller of live patched function */
>>  blr
>> +
>> +/*
>> + * This livepatch stub code, called from livepatch module to jump into
>> + * kernel or other modules. It replicates the livepatch_handler code,
>> + * with an expection of jumping to the trampoline instead of patched
>> + * function.
>> + */
>> +.global klp_stub_insn
>> +klp_stub_insn:
>> +CURRENT_THREAD_INFO(r12, r1)
>> +
>> +/* Allocate 3 x 8 bytes */
>> +ld  r11, TI_livepatch_sp(r12)
>> +addir11, r11, 24
>> +std r11, TI_livepatch_sp(r12)
>> +
>> +/* Save toc & real LR on livepatch stack */
>> +std r2,  -24(r11)
>> +mflrr12
>> +std r12, -16(r11)
>> +
>> +/* Store stack end marker */
>> +lis r12, STACK_END_MAGIC@h
>> +ori r12, r12, STACK_END_MAGIC@l
>> +std r12, -8(r11)
> 
> Seeing as this is the same as livepatch_handler() except for this part 
> in the middle, does it make sense to reuse livepatch_handler() with 
> appropriate labels added for your use?  You could patch in the below 5 
> instructions using the macros from ppc-opcode.h...

Thanks for the review. The current upstream livepatch_handler code
is a bit different. I have posted a bug fix to 
https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-September/163824.html
which alters the livepatch_handler to have similar code. It's
a good idea to re-use the livepatch_handler code. I will re-spin
v2 with based on top of bug fix posted earlier.

> 
>> +
>> +/*
>> + * Stub memory is allocated dynamically, during the module load.
>> + * Load TOC relative address into r11. module_64.c::klp_stub_for_addr()
>> + * identifies the available free stub slot and loads the address into
>> + * r11 with two instructions.
>> + *
>> + * addis r11, r2, stub_address@ha
>> + * addi  r11, r11, stub_address@l
>> + */
>> +.global klp_stub_entry
>> +klp_stub_entry:
>> +addis   r11, r2, 0
>> +addir11, r11, 0
>> +
>> +/* Load the r12 with called function address from entry->funcdata */
>> +ld  r12, 128(r11)
>> +
>> +/* Move r12 into ctr for global entry and branch there */
>> +mtctr   r12
>> +bctrl
>> +
>> +/*
>> + * Now we are returning to the patched function. We are free to
>> + * use r11, r12 and we can use r2 until we restore it.
>> + */
>> +CURRENT_THREAD_INFO(r12, r1)
>> +
>> +ld  r11, TI_livepatch_sp(r12)
>> +
>> +/* Check stack marker hasn't been trashed */
>> +lis r2,  STACK_END_MAGIC@h
>> +ori r2,  r2, STACK_END_MAGIC@l
>> +ld  r12, -8(r11)
>> +2:  tdner12, r2
>> +EMIT_BUG_ENTRY 2b, __FILE__, __LINE__ - 1, 0
> 
> If you plan to keep this trampoline separate from livepatch_handler(), 
> note that the above bug entry is not required since you copy only the 
> text of this trampoline elsewhere and you won't have an associated bug 
> entry for that new stub address.
> 

Agree, klp_stub_entry trampoline will go away in v2. 

-- 
cheers,
Kamalesh.



Re: [PATCH v3 00/20] Speculative page faults

2017-10-04 Thread Laurent Dufour
On 25/09/2017 18:27, Alexei Starovoitov wrote:
> On Mon, Sep 18, 2017 at 12:15 AM, Laurent Dufour
>  wrote:
>> Despite the unprovable lockdep warning raised by Sergey, I didn't get any
>> feedback on this series.
>>
>> Is there a chance to get it moved upstream ?
> 
> what is the status ?
> We're eagerly looking forward for this set to land,
> since we have several use cases for tracing that
> will build on top of this set as discussed at Plumbers.

Hi Alexei,

Based on Plumber's note [1], it sounds that the use case is tied to the BPF
tracing where a call tp find_vma() call will be made on a process's context
to fetch user space's symbols.

Am I right ?
Is the find_vma() call made in the context of the process owning the mm
struct ?

Thanks,
Laurent.

[1] https://etherpad.openstack.org/p/LPC2017_Tracing)



[PATCH] powerpc/perf: Fix for core/nest imc call trace on cpuhotplug

2017-10-04 Thread Anju T Sudhakar
Nest/core pmu units are enabled only when it is used. A reference count is  
maintained for the events which uses the nest/core pmu units. Currently in  
*_imc_counters_release function a WARN() is used for notification of any
underflow of ref count. 

The case where event ref count hit a negative value is, when perf session is
started, followed by offlining of all cpus in a given core. 
i.e. in cpuhotplug offline path ppc_core_imc_cpu_offline() function set the 
ref->count to zero, if the current cpu which is about to offline is the last
cpu in a given core and make an OPAL call to disable the engine in that core.   
And on perf session termination, perf->destroy (core_imc_counters_release) will 
first decrement the ref->count for this core and based on the ref->count value  
an opal call is made to disable the core-imc engine.
Now, since cpuhotplug path already clears the ref->count for core and disabled  
the engine, perf->destroy() decrementing again at event termination make it 
negative which in turn fires the WARN_ON. The same happens for nest units.  

Add a check to see if the reference count is alreday zero, before decrementing  
the count, so that the ref count will not hit a negative value. 

Signed-off-by: Anju T Sudhakar 
---
 arch/powerpc/perf/imc-pmu.c | 28 
 1 file changed, 28 insertions(+)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 9ccac86f3463..e3a1f65933b5 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -399,6 +399,20 @@ static void nest_imc_counters_release(struct perf_event 
*event)
 
/* Take the mutex lock for this node and then decrement the reference 
count */
mutex_lock(>lock);
+   if (ref->refc == 0) {
+   /*
+* The scenario where this is true is, when perf session is
+* started, followed by offlining of all cpus in a given node.
+*
+* In the cpuhotplug offline path, ppc_nest_imc_cpu_offline()
+* function set the ref->count to zero, if the cpu which is
+* about to offline is the last cpu in a given node and make
+* an OPAL call to disable the engine in that node.
+*
+*/
+   mutex_unlock(>lock);
+   return;
+   }
ref->refc--;
if (ref->refc == 0) {
rc = opal_imc_counters_stop(OPAL_IMC_COUNTERS_NEST,
@@ -646,6 +660,20 @@ static void core_imc_counters_release(struct perf_event 
*event)
return;
 
mutex_lock(>lock);
+   if (ref->refc == 0) {
+   /*
+* The scenario where this is true is, when perf session is
+* started, followed by offlining of all cpus in a given core.
+*
+* In the cpuhotplug offline path, ppc_core_imc_cpu_offline()
+* function set the ref->count to zero, if the cpu which is
+* about to offline is the last cpu in a given core and make
+* an OPAL call to disable the engine in that core.
+*
+*/
+   mutex_unlock(>lock);
+   return;
+   }
ref->refc--;
if (ref->refc == 0) {
rc = opal_imc_counters_stop(OPAL_IMC_COUNTERS_CORE,
-- 
2.14.1