[RFC PATCH 0/2] KVM: PPC: Book3S HV: Transactional memory bug workarounds for POWER9

2017-12-07 Thread Paul Mackerras
This series adds emulation for some transactional memory instructions
that will be used on POWER9 DD2.2 processors as part of the
workarounds for hardware bugs.  The basic idea is that because the CPU
hardware cannot maintain a full checkpoint of the architected state
for all four threads, it needs to be able to escape from suspended
state at times.  (The CPU can escape from transactional state simply
by aborting the transaction and rolling back, but in suspended state
it can't roll back until the transaction is resumed.)  This is
implemented by having a "fake suspend" state where the MSR indicates
suspended state but there is no checkpoint.  This is differentiated
from real suspended state by a new bit in the PSSCR register.

In addition to these two patches, there will need to be another patch
which handles HMIs (hypervisor maintenance interrupts) which the
hardware will generate on threads that are in real suspend state at
the point where another thread that was in a stop instruction wants to
wake up.  The process of taking the HMI and exiting the guest to host
context will do a treclaim which will eliminate the checkpointed state
from the CPU hardware, so the only thing remaining will be to clear
the HMER bit that indicates the cause of the HMI.

The series is against v4.14.

Paul.


[RFC PATCH 2/2] KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9

2017-12-07 Thread Paul Mackerras
POWER9 has hardware bugs relating to transactional memory and thread
reconfiguration (changes to hardware SMT mode).  Specifically, the core
does not have enough storage to store a complete checkpoint of all the
architected state for all four threads.  The DD2.2 version of POWER9
includes hardware modifications designed to allow hypervisor software
to implement workarounds for these problems.  This patch implements
those workarounds in KVM code so that KVM guests see a full, working
transactional memory implementation.

The problems center around the use of TM suspended state, where the
CPU has a checkpointed state but execution is not transactional.  The
workaround is to implement a "fake suspend" state, which looks to the
guest like suspended state but the CPU does not store a checkpoint.
In this state, any instruction that would cause a transition to
transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
checkpointed state (treclaim) causes a "soft patch" interrupt (vector
0x1500) to the hypervisor so that it can be emulated.  The trechkpt
instruction also causes a soft patch interrupt.

On POWER9 DD2.2, we avoid returning to the guest in any state which
would require a checkpoint to be present.  The trechkpt in the guest
entry path which would normally create that checkpoint is replaced by
either a transition to fake suspend state, if the guest is in suspend
state, or a rollback to the pre-transactional state if the guest is in
transactional state.  Fake suspend state is indicated by a flag in the
PACA plus a new bit in the PSSCR.  The new PSSCR bit is write-only and
reads back as 0.

On exit from the guest, if the guest is in fake suspend state, we still
do the treclaim instruction as we would in real suspend state, in order
to get into non-transactional state, but we do not save the resulting
register state since there was no checkpoint.

Emulation of the instructions that cause a softpath interrupt is handled
in two paths.  If the guest is in real suspend mode, we call
kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
transitioning to transactional state.  This is called before we do
the treclaim in the guest exit path; because we haven't done treclaim,
we can get back to the guest with the transaction still active.
If the instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
handle, or if the guest is in fake suspend state, then we proceed to
do the complete guest exit path and subsequently call
kvmhv_p9_tm_emulation() in host context with the MMU on.  This
handles all the cases including the cases that generate program
interrupts (illegal instruction or TM Bad Thing) and facility
unavailable interrupts.

The emulation is reasonably straightforward and is mostly concerned
with checking for exception conditions and updating the state of
registers such as MSR and CR0.  The treclaim emulation takes care to
ensure that the TEXASR register gets updated as if it were the guest
treclaim instruction that had done failure recording, not the treclaim
done in hypervisor state in the guest exit path.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_asm.h|   2 +
 arch/powerpc/include/asm/kvm_book3s.h |   4 +
 arch/powerpc/include/asm/kvm_book3s_64.h  |  41 ++
 arch/powerpc/include/asm/kvm_book3s_asm.h |   1 +
 arch/powerpc/include/asm/kvm_host.h   |   1 +
 arch/powerpc/include/asm/ppc-opcode.h |   4 +
 arch/powerpc/include/asm/reg.h|   6 +
 arch/powerpc/kernel/asm-offsets.c |   2 +
 arch/powerpc/kernel/exceptions-64s.S  |   4 +-
 arch/powerpc/kvm/Makefile |   2 +
 arch/powerpc/kvm/book3s_hv.c  |  12 ++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  93 -
 arch/powerpc/kvm/book3s_hv_tm.c   | 217 ++
 arch/powerpc/kvm/book3s_hv_tm_builtin.c   | 109 +++
 14 files changed, 495 insertions(+), 3 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_hv_tm.c
 create mode 100644 arch/powerpc/kvm/book3s_hv_tm_builtin.c

diff --git a/arch/powerpc/include/asm/kvm_asm.h 
b/arch/powerpc/include/asm/kvm_asm.h
index 09a802bb702f..a790d5cf6ea3 100644
--- a/arch/powerpc/include/asm/kvm_asm.h
+++ b/arch/powerpc/include/asm/kvm_asm.h
@@ -108,6 +108,8 @@
 
 /* book3s_hv */
 
+#define BOOK3S_INTERRUPT_HV_SOFTPATCH  0x1500
+
 /*
  * Special trap used to indicate to host that this is a
  * passthrough interrupt that could not be handled
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index b8d5b8e35244..d302f4ed8385 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -240,6 +240,10 @@ extern void kvmppc_update_lpcr(struct kvm *kvm, unsigned 
long lpcr,
unsigned long mask);
 extern void kvmppc_set_fscr(struct kvm_vcpu *vcpu, u64 fscr);
 
+extern int kvmhv_p9_tm_emulation_early(struct kvm_vcpu *vcpu);
+extern int kvmhv_p9_tm_emulati

[RFC PATCH 1/2] powerpc: Add a CPU feature bit for TM bug workarounds on POWER9 DD2.2

2017-12-07 Thread Paul Mackerras
This adds a CPU feature bit which is set for POWER9 DD2.2 processors
which will be used to enable software emulation for some transactional
memory instructions, in order to work around hardware bugs.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/cputable.h |  5 -
 arch/powerpc/kernel/cputable.c  | 20 
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index 53b31c2bcdf4..70cee46c046c 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -215,6 +215,7 @@ enum {
 #define CPU_FTR_DAWR   LONG_ASM_CONST(0x0400)
 #define CPU_FTR_DABRX  LONG_ASM_CONST(0x0800)
 #define CPU_FTR_PMAO_BUG   LONG_ASM_CONST(0x1000)
+#define CPU_FTR_P9_TM_EMUL LONG_ASM_CONST(0x2000)
 #define CPU_FTR_POWER9_DD1 LONG_ASM_CONST(0x4000)
 
 #ifndef __ASSEMBLY__
@@ -478,6 +479,7 @@ enum {
CPU_FTR_ARCH_207S | CPU_FTR_TM_COMP | CPU_FTR_ARCH_300)
 #define CPU_FTRS_POWER9_DD1 ((CPU_FTRS_POWER9 | CPU_FTR_POWER9_DD1) & \
 (~CPU_FTR_SAO))
+#define CPU_FTRS_POWER9_DD2_2 (CPU_FTRS_POWER9 | CPU_FTR_P9_TM_EMUL)
 #define CPU_FTRS_CELL  (CPU_FTR_USE_TB | CPU_FTR_LWSYNC | \
CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \
@@ -496,7 +498,8 @@ enum {
(CPU_FTRS_POWER4 | CPU_FTRS_PPC970 | CPU_FTRS_POWER5 | \
 CPU_FTRS_POWER6 | CPU_FTRS_POWER7 | CPU_FTRS_POWER8E | \
 CPU_FTRS_POWER8 | CPU_FTRS_POWER8_DD1 | CPU_FTRS_CELL | \
-CPU_FTRS_PA6T | CPU_FTR_VSX | CPU_FTRS_POWER9 | 
CPU_FTRS_POWER9_DD1)
+CPU_FTRS_PA6T | CPU_FTR_VSX | CPU_FTRS_POWER9 | \
+CPU_FTRS_POWER9_DD1 | CPU_FTRS_POWER9_DD2_2)
 #endif
 #else
 enum {
diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c
index 760872916013..bb94bb6d9e4d 100644
--- a/arch/powerpc/kernel/cputable.c
+++ b/arch/powerpc/kernel/cputable.c
@@ -547,6 +547,26 @@ static struct cpu_spec __initdata cpu_specs[] = {
.machine_check_early= __machine_check_early_realmode_p9,
.platform   = "power9",
},
+   {   /* Power9 DD2.2 */
+   .pvr_mask   = 0x,
+   .pvr_value  = 0x004e1202,
+   .cpu_name   = "POWER9 (raw)",
+   .cpu_features   = CPU_FTRS_POWER9_DD2_2,
+   .cpu_user_features  = COMMON_USER_POWER9,
+   .cpu_user_features2 = COMMON_USER2_POWER9,
+   .mmu_features   = MMU_FTRS_POWER9,
+   .icache_bsize   = 128,
+   .dcache_bsize   = 128,
+   .num_pmcs   = 6,
+   .pmc_type   = PPC_PMC_IBM,
+   .oprofile_cpu_type  = "ppc64/power9",
+   .oprofile_type  = PPC_OPROFILE_INVALID,
+   .cpu_setup  = __setup_cpu_power9,
+   .cpu_restore= __restore_cpu_power9,
+   .flush_tlb  = __flush_tlb_power9,
+   .machine_check_early= __machine_check_early_realmode_p9,
+   .platform   = "power9",
+   },
{   /* Power9 */
.pvr_mask   = 0x,
.pvr_value  = 0x004e,
-- 
2.14.1



Re: [PATCH v5 0/3] Prepartion for SR-IOV PowerVM Enablement

2017-12-07 Thread Alexey Kardashevskiy
On 06/12/17 02:13, Bryant G. Ly wrote:
> 
> 
> On 12/4/17 7:24 PM, Alexey Kardashevskiy wrote:
>> On 05/12/17 02:08, Bryant G. Ly wrote:
>>>
>>> On 12/2/17 7:45 PM, Alexey Kardashevskiy wrote:
 On 10/11/17 01:00, Bryant G. Ly wrote:
> v1 - Initial patch
> v2 - Addressed Bjorn's comment on creating a highly platform
>  dependent global exported symbol.
> v3 - Based patch off linux-ppc/master
> v4 - Using the sriov-drivers_autoprobe mechanism per Bjorn's request
> v5 - Fixed comments and commit message
 What is this made against of? I'd like to give it a try but it does not
 apply to Linus'es tree or powerpc/next. Thanks.

>>> This was made against powerpc/next back when it was still under 4.14-rc6.
>>> It has been in review for awhile... 
>> Sure, sha1 or github tree would be enough though to try.
> 
> 6cff0a118f23b98c604a3604ea9de11338e24fbe
> 
> or if you want to just use the github tree its:
> 
> https://github.com/powervm/sriov-ppc/tree/upstream


Thanks! Looks good, works fine (tried MLX5 wit hpowernv - works, pseries
under KVM - prints a nice error instead of crashing),

Reviewed-by: Alexey Kardashevskiy 




> 
>>
>>> -Bryant
>>>
> Bryant G. Ly (3):
>   powerpc/kernel: Separate SR-IOV Calls
>   pseries: Add PSeries SR-IOV Machine dependent calls
>   PCI/IOV: Add pci_vf_drivers_autoprobe() interface
>
>  arch/powerpc/include/asm/machdep.h   |  7 ++
>  arch/powerpc/include/asm/pci-bridge.h|  4 +---
>  arch/powerpc/kernel/eeh_driver.c |  4 ++--
>  arch/powerpc/kernel/pci-common.c | 23 +++
>  arch/powerpc/kernel/pci_dn.c |  6 -
>  arch/powerpc/platforms/powernv/eeh-powernv.c | 33 
> ++--
>  arch/powerpc/platforms/powernv/pci-ioda.c|  6 +++--
>  arch/powerpc/platforms/pseries/eeh_pseries.c | 24 
>  arch/powerpc/platforms/pseries/pci.c | 31 
> ++
>  drivers/pci/iov.c| 11 ++
>  include/linux/pci.h  |  2 ++
>  11 files changed, 118 insertions(+), 33 deletions(-)
>
>>
> 


-- 
Alexey


Re: [resend-without-rfc] powernv/kdump: Fix cases where the kdump kernel can get HMI's

2017-12-07 Thread Nicholas Piggin
On Fri,  8 Dec 2017 14:35:33 +1100
Balbir Singh  wrote:

> Certain HMI's such as malfunction error propagate through
> all threads/core on the system. If a thread was offline
> prior to us crashing the system and jumping to the kdump
> kernel, bad things happen when it wakes up due to an HMI
> in the kdump kernel.
> 
> There are several possible ways to solve this problem
> 
> 1. Put the offline cores in a state such that they are
> not woken up for machine check and HMI errors. This
> does not work, since we might need to wake up offline
> threads occasionally to handle TB errors
> 2. Ignore HMI errors, setup HMEER to mask HMI errors,
> but this still leads the window open for any MCEs
> and masking them for the duration of the dump might
> be a concern
> 3. Wake up offline CPUs, as in send them to crash_ipi_callback
> (not wake them up as in mark them online as seen by
> the scheduler). kexec does a wake_online_cpus() call,
> this patch does something similar, but instead sends
> an IPI and forces them to crash_ipi_callback
> 
> Care is taken to enable this only for powenv platforms
> via crash_wake_offline (a global value set at setup
> time). The crash code sends out IPI's to all CPU's
> which then move to crash_ipi_callback and kexec_smp_wait().
> We don't grab the pt_regs for offline CPU's.
> 
> Signed-off-by: Balbir Singh 
> ---
> 
> Nick reviewed the patches and asked if
> 
> 1. We need to do anything on the otherside of the kernel?
> The answer is not clear at this point, but I don't want
> to block this patch as it fixes a critical problem with
> kdump in SMT=2/1 mode
> 2. We should do this for other platforms
> The answer is same as above, other platforms require testing
> and I can selectively enable them as needed as I test them

Yeah I didn't intend those as a nack for the patch... It's
a bit annoying to have these selections between online cpus
and present cpus depending on kdump.

We don't want to do a full CPU online in the kdump path of
course, but what if the crash code has a call that can IPI
offline CPUs to get them into the crash callback, rather than
put it in the general NMI IPI code?


> @@ -187,6 +188,14 @@ static void pnv_smp_cpu_kill_self(void)
>   WARN_ON(lazy_irq_pending());
>  
>   /*
> +  * For kdump kernels, we process the ipi and jump to
> +  * crash_ipi_callback. For more details see the description
> +  * at crash_wake_offline
> +  */
> + if (kdump_in_progress())
> + crash_ipi_callback(NULL);
> +
> + /*
>* If the SRR1 value indicates that we woke up due to
>* an external interrupt, then clear the interrupt.
>* We clear the interrupt before checking for the

I think you need to do this _after_ clearing the interrupt,
otherwise you get a lost wakeup window, don't you?

Thanks,
Nick


Re: [v2 PATCH] cpufreq: powernv: Correctly parse the sign of pstates on POWER8 vs POWER9

2017-12-07 Thread Balbir Singh
On Thu, Dec 7, 2017 at 4:59 PM, Gautham R. Shenoy
 wrote:
> From: "Gautham R. Shenoy" 
>
> On POWERNV platform, Pstates are 8-bit values. On POWER8 they are
> negatively numbered while on POWER9 they are positively
> numbered. Thus, on POWER9, the maximum number of pstates could be as
> high as 256.
>
> The current code interprets pstates as a signed 8-bit value. This
> causes a problem on POWER9 platforms which have more than 128 pstates.
> On such systems, on a CPU that is in a lower pstate whose number is
> greater than 128, querying the current pstate returns a "pstate X is
> out of bound" error message and the current pstate is reported as the
> nominal pstate.
>
> This patch fixes the aforementioned issue by correctly differentiating
> the sign whenever a pstate value read, depending on whether the
> pstates are positively numbered or negatively numbered.

Yikes! Is there no better way of fixing this?

>
> Fixes: commit 09ca4c9b5958 ("cpufreq: powernv: Replacing pstate_id with 
> frequency table index")
> Cc:  #v4.8
> Signed-off-by: Gautham R. Shenoy 
> Tested-and-reviewed-by: Shilpasri G Bhat 
> Acked-by: Viresh Kumar 
> ---
>  drivers/cpufreq/powernv-cpufreq.c | 43 
> ++-
>  1 file changed, 33 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index b6d7c4c..bb7586e 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -41,11 +41,14 @@
>  #define POWERNV_MAX_PSTATES256
>  #define PMSR_PSAFE_ENABLE  (1UL << 30)
>  #define PMSR_SPR_EM_DISABLE(1UL << 31)
> -#define PMSR_MAX(x)((x >> 32) & 0xFF)
> +#define EXTRACT_BYTE(x, shift) (((x) >> shift) & 0xFF)
> +#define MAX_SHIFT  32
>  #define LPSTATE_SHIFT  48
>  #define GPSTATE_SHIFT  56
> -#define GET_LPSTATE(x) (((x) >> LPSTATE_SHIFT) & 0xFF)
> -#define GET_GPSTATE(x) (((x) >> GPSTATE_SHIFT) & 0xFF)
> +#define GET_PMSR_MAX(x)EXTRACT_BYTE(x, MAX_SHIFT)
> +#define GET_LPSTATE(x) EXTRACT_BYTE(x, LPSTATE_SHIFT)
> +#define GET_GPSTATE(x) EXTRACT_BYTE(x, GPSTATE_SHIFT)
> +

Can you hide all of this in pstate_to_idx(), do the casting inside? I
was reviewing this
code earlier before being distracted with something else, this did
come across as
strange and I was looking at using abs values to simplify the code,
but I did not get
to it.

Balbir Singh.


Re: [PATCH] powerpc/xmon: use ARRAY_SIZE on various array sizing calculations

2017-12-07 Thread Balbir Singh
On Thu, Dec 7, 2017 at 10:01 PM, Colin King  wrote:
> From: Colin Ian King 
>
> Use the ARRAY_SIZE macro on several arrays to determine their size.
> Improvement suggested by coccinelle.
>

This file is taken from binutils and re-licensed. Keeping the file
as-is helps apply newer patches easily on top as opposed to redoing
the changes. I would prefer not to move to ARRAY_SIZE and stick to
what's already in the file.

Balbir Singh.


[resend-without-rfc] powernv/kdump: Fix cases where the kdump kernel can get HMI's

2017-12-07 Thread Balbir Singh
Certain HMI's such as malfunction error propagate through
all threads/core on the system. If a thread was offline
prior to us crashing the system and jumping to the kdump
kernel, bad things happen when it wakes up due to an HMI
in the kdump kernel.

There are several possible ways to solve this problem

1. Put the offline cores in a state such that they are
not woken up for machine check and HMI errors. This
does not work, since we might need to wake up offline
threads occasionally to handle TB errors
2. Ignore HMI errors, setup HMEER to mask HMI errors,
but this still leads the window open for any MCEs
and masking them for the duration of the dump might
be a concern
3. Wake up offline CPUs, as in send them to crash_ipi_callback
(not wake them up as in mark them online as seen by
the scheduler). kexec does a wake_online_cpus() call,
this patch does something similar, but instead sends
an IPI and forces them to crash_ipi_callback

Care is taken to enable this only for powenv platforms
via crash_wake_offline (a global value set at setup
time). The crash code sends out IPI's to all CPU's
which then move to crash_ipi_callback and kexec_smp_wait().
We don't grab the pt_regs for offline CPU's.

Signed-off-by: Balbir Singh 
---

Nick reviewed the patches and asked if

1. We need to do anything on the otherside of the kernel?
The answer is not clear at this point, but I don't want
to block this patch as it fixes a critical problem with
kdump in SMT=2/1 mode
2. We should do this for other platforms
The answer is same as above, other platforms require testing
and I can selectively enable them as needed as I test them

 arch/powerpc/include/asm/kexec.h |  2 ++
 arch/powerpc/kernel/crash.c  | 18 +-
 arch/powerpc/kernel/smp.c| 11 ---
 arch/powerpc/platforms/powernv/smp.c | 23 +++
 4 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index 4419d435639a..9dcbfa6bbb91 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -73,6 +73,8 @@ extern void kexec_smp_wait(void); /* get and clear naca 
physid, wait for
  master to copy new code to 0 */
 extern int crashing_cpu;
 extern void crash_send_ipi(void (*crash_ipi_callback)(struct pt_regs *));
+extern void crash_ipi_callback(struct pt_regs *);
+extern int crash_wake_offline;
 
 struct kimage;
 struct pt_regs;
diff --git a/arch/powerpc/kernel/crash.c b/arch/powerpc/kernel/crash.c
index cbabb5adccd9..7e2ddfa9213e 100644
--- a/arch/powerpc/kernel/crash.c
+++ b/arch/powerpc/kernel/crash.c
@@ -44,6 +44,14 @@
 #define REAL_MODE_TIMEOUT  1
 
 static int time_to_dump;
+/*
+ * crash_wake_offline should be set to 1 by platforms that intend to wake
+ * up offline cpus prior to jumping to a kdump kernel. Currently powernv
+ * sets it to 1, since we want to avoid things from happening when an
+ * offline CPU wakes up due to something like an HMI (malfunction error),
+ * which propagates to all threads.
+ */
+int crash_wake_offline;
 
 #define CRASH_HANDLER_MAX 3
 /* List of shutdown handles */
@@ -63,17 +71,14 @@ static int handle_fault(struct pt_regs *regs)
 #ifdef CONFIG_SMP
 
 static atomic_t cpus_in_crash;
-static void crash_ipi_callback(struct pt_regs *regs)
+void crash_ipi_callback(struct pt_regs *regs)
 {
static cpumask_t cpus_state_saved = CPU_MASK_NONE;
 
int cpu = smp_processor_id();
 
-   if (!cpu_online(cpu))
-   return;
-
hard_irq_disable();
-   if (!cpumask_test_cpu(cpu, &cpus_state_saved)) {
+   if (cpu_online(cpu) && !cpumask_test_cpu(cpu, &cpus_state_saved)) {
crash_save_cpu(regs, cpu);
cpumask_set_cpu(cpu, &cpus_state_saved);
}
@@ -109,6 +114,9 @@ static void crash_kexec_prepare_cpus(int cpu)
 
printk(KERN_EMERG "Sending IPI to other CPUs\n");
 
+   if (crash_wake_offline)
+   ncpus = num_present_cpus() - 1;
+
crash_send_ipi(crash_ipi_callback);
smp_wmb();
 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index e0a4c1f82e25..f485db54c2f9 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -429,10 +429,12 @@ static void do_smp_send_nmi_ipi(int cpu)
} else {
int c;
 
-   for_each_online_cpu(c) {
+   for_each_present_cpu(c) {
if (c == raw_smp_processor_id())
continue;
-   do_message_pass(c, PPC_MSG_NMI_IPI);
+   if (cpu_online(c) ||
+   (kdump_in_progress() && crash_wake_offline))
+   do_message_pass(c, PPC_MSG_NMI_IPI);
}
}
 }
@@ -485,7 +487,10 @@ int smp_send_nmi_ipi(int cpu, void (*fn)(struct pt_regs 
*), u64 delay_us)
 
if (cpu < 0) {
   

Re: [v2 PATCH] cpufreq: powernv: Correctly parse the sign of pstates on POWER8 vs POWER9

2017-12-07 Thread Rafael J. Wysocki
On Thu, Dec 7, 2017 at 6:59 AM, Gautham R. Shenoy
 wrote:
> From: "Gautham R. Shenoy" 
>
> On POWERNV platform, Pstates are 8-bit values. On POWER8 they are
> negatively numbered while on POWER9 they are positively
> numbered. Thus, on POWER9, the maximum number of pstates could be as
> high as 256.
>
> The current code interprets pstates as a signed 8-bit value. This
> causes a problem on POWER9 platforms which have more than 128 pstates.
> On such systems, on a CPU that is in a lower pstate whose number is
> greater than 128, querying the current pstate returns a "pstate X is
> out of bound" error message and the current pstate is reported as the
> nominal pstate.
>
> This patch fixes the aforementioned issue by correctly differentiating
> the sign whenever a pstate value read, depending on whether the
> pstates are positively numbered or negatively numbered.
>
> Fixes: commit 09ca4c9b5958 ("cpufreq: powernv: Replacing pstate_id with 
> frequency table index")
> Cc:  #v4.8
> Signed-off-by: Gautham R. Shenoy 
> Tested-and-reviewed-by: Shilpasri G Bhat 
> Acked-by: Viresh Kumar 

I'm going to apply this, or please let me know if you want to route it
differently.

> ---
>  drivers/cpufreq/powernv-cpufreq.c | 43 
> ++-
>  1 file changed, 33 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index b6d7c4c..bb7586e 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -41,11 +41,14 @@
>  #define POWERNV_MAX_PSTATES256
>  #define PMSR_PSAFE_ENABLE  (1UL << 30)
>  #define PMSR_SPR_EM_DISABLE(1UL << 31)
> -#define PMSR_MAX(x)((x >> 32) & 0xFF)
> +#define EXTRACT_BYTE(x, shift) (((x) >> shift) & 0xFF)
> +#define MAX_SHIFT  32
>  #define LPSTATE_SHIFT  48
>  #define GPSTATE_SHIFT  56
> -#define GET_LPSTATE(x) (((x) >> LPSTATE_SHIFT) & 0xFF)
> -#define GET_GPSTATE(x) (((x) >> GPSTATE_SHIFT) & 0xFF)
> +#define GET_PMSR_MAX(x)EXTRACT_BYTE(x, MAX_SHIFT)
> +#define GET_LPSTATE(x) EXTRACT_BYTE(x, LPSTATE_SHIFT)
> +#define GET_GPSTATE(x) EXTRACT_BYTE(x, GPSTATE_SHIFT)
> +
>
>  #define MAX_RAMP_DOWN_TIME 5120
>  /*
> @@ -64,6 +67,12 @@
>
>  /* Interval after which the timer is queued to bring down global pstate */
>  #define GPSTATE_TIMER_INTERVAL 2000
> +/*
> + * On POWER8 the pstates are negatively numbered. On POWER9, they are
> + * positively numbered.  Use this flag to track whether we have
> + * positive or negative numbered pstates.
> + */
> +static bool pos_pstates;
>
>  /**
>   * struct global_pstate_info - Per policy data structure to maintain history 
> of
> @@ -164,7 +173,7 @@ static inline unsigned int pstate_to_idx(int pstate)
> int min = powernv_freqs[powernv_pstate_info.min].driver_data;
> int max = powernv_freqs[powernv_pstate_info.max].driver_data;
>
> -   if (min > 0) {
> +   if (pos_pstates) {
> if (unlikely((pstate < max) || (pstate > min))) {
> pr_warn_once("pstate %d is out of bound\n", pstate);
> return powernv_pstate_info.nominal;
> @@ -301,6 +310,9 @@ static int init_powernv_pstates(void)
> }
> }
>
> +   if ((int)pstate_min > 0)
> +   pos_pstates = true;
> +
> /* End of list marker entry */
> powernv_freqs[i].frequency = CPUFREQ_TABLE_END;
> return 0;
> @@ -438,7 +450,6 @@ struct powernv_smp_call_data {
>  static void powernv_read_cpu_freq(void *arg)
>  {
> unsigned long pmspr_val;
> -   s8 local_pstate_id;
> struct powernv_smp_call_data *freq_data = arg;
>
> pmspr_val = get_pmspr(SPRN_PMSR);
> @@ -447,8 +458,11 @@ static void powernv_read_cpu_freq(void *arg)
>  * The local pstate id corresponds bits 48..55 in the PMSR.
>  * Note: Watch out for the sign!
>  */
> -   local_pstate_id = (pmspr_val >> 48) & 0xFF;
> -   freq_data->pstate_id = local_pstate_id;
> +   if (pos_pstates)
> +   freq_data->pstate_id = (u8)GET_LPSTATE(pmspr_val);
> +   else
> +   freq_data->pstate_id = (s8)GET_LPSTATE(pmspr_val);
> +
> freq_data->freq = pstate_id_to_freq(freq_data->pstate_id);
>
> pr_debug("cpu %d pmsr %016lX pstate_id %d frequency %d kHz\n",
> @@ -522,7 +536,10 @@ static void powernv_cpufreq_throttle_check(void *data)
> chip = this_cpu_read(chip_info);
>
> /* Check for Pmax Capping */
> -   pmsr_pmax = (s8)PMSR_MAX(pmsr);
> +   if (pos_pstates)
> +   pmsr_pmax = (u8)GET_PMSR_MAX(pmsr);
> +   else
> +   pmsr_pmax = (s8)GET_PMSR_MAX(pmsr);
> pmsr_pmax_idx = pstate_to_idx(pmsr_pmax);
> if (pmsr_pmax_idx != powernv_pstate_info.max) {
> if (chip->throttled)
> @@ -645,8 +662,14 @@ v

[PATCH] powerpc/perf: Fix kfree memory allocated for nest pmus

2017-12-07 Thread Anju T Sudhakar
imc_common_cpuhp_mem_free() is the common function for all IMC (In-memory
Collection counters) domains to unregister cpuhotplug callback and free memory.
Since kfree of memory allocated for nest-imc (per_nest_pmu_arr) is in the common
code, all domains (core/nest/thread) can do the kfree in the failure case.

This could potentially create a call trace as shown below, where 
core(/thread/nest)
imc pmu initialization fails and in the failure path imc_common_cpuhp_mem_free()
free the memory(per_nest_pmu_arr), which is allocated by successfully registered
nest units.


The call trace is generated in a scenario where core-imc initialization is
made to fail and a cpuhotplug is performed in a p9 system.
During cpuhotplug ppc_nest_imc_cpu_offline() tries to access per_nest_pmu_arr,
which is already freed by core-imc.

[  136.563618] NIP [c0cb6a94] mutex_lock+0x34/0x90
[  136.563653] LR [c0cb6a88] mutex_lock+0x28/0x90
[  136.563687] Call Trace:
[  136.563707] [c016b7a93b90] [c0cb6a88] mutex_lock+0x28/0x90 
(unreliable)
[  136.563762] [c016b7a93bc0] [c02bc720] 
perf_pmu_migrate_context+0x90/0x3a0
[  136.563814] [c016b7a93c60] [c00f7a40] 
ppc_nest_imc_cpu_offline+0x190/0x1f0
[  136.563867] [c016b7a93cb0] [c0108140] 
cpuhp_invoke_callback+0x160/0x820
[  136.563918] [c016b7a93d30] [c010939c] 
cpuhp_thread_fun+0x1bc/0x270
[  136.563970] [c016b7a93d60] [c013d2b0] 
smpboot_thread_fn+0x250/0x290
[  136.564022] [c016b7a93dc0] [c0136f18] kthread+0x1a8/0x1b0
[  136.564067] [c016b7a93e30] [c000b4e8] 
ret_from_kernel_thread+0x5c/0x74

To address this scenario do the kfree(per_nest_pmu_arr) only in case of
nest-imc initialization failure, and when there is no other nest units 
registered.


Fixes: 73ce9aec65b1 ("powerpc/perf: Fix IMC_MAX_PMU macro")
Signed-off-by: Anju T Sudhakar 
---
 arch/powerpc/perf/imc-pmu.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 0ead3cd..4eb9e2b 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -1171,6 +1171,7 @@ static void imc_common_cpuhp_mem_free(struct imc_pmu 
*pmu_ptr)
if (nest_pmus == 1) {

cpuhp_remove_state(CPUHP_AP_PERF_POWERPC_NEST_IMC_ONLINE);
kfree(nest_imc_refc);
+   kfree(per_nest_pmu_arr);
}
 
if (nest_pmus > 0)
@@ -1195,7 +1196,6 @@ static void imc_common_cpuhp_mem_free(struct imc_pmu 
*pmu_ptr)
kfree(pmu_ptr->attr_groups[IMC_EVENT_ATTR]->attrs);
kfree(pmu_ptr->attr_groups[IMC_EVENT_ATTR]);
kfree(pmu_ptr);
-   kfree(per_nest_pmu_arr);
return;
 }
 
@@ -1309,6 +1309,8 @@ int init_imc_pmu(struct device_node *parent, struct 
imc_pmu *pmu_ptr, int pmu_id
ret = nest_pmu_cpumask_init();
if (ret) {
mutex_unlock(&nest_init_lock);
+   kfree(nest_imc_refc);
+   kfree(per_nest_pmu_arr);
goto err_free;
}
}
-- 
2.7.4



Re: [PATCH] watchdog: core: make sure the watchdog_worker always works

2017-12-07 Thread Guenter Roeck

On 12/07/2017 02:38 AM, Christophe Leroy wrote:

When running a command like 'chrt -f 99 dd if=/dev/zero of=/dev/null',
the watchdog_worker fails to service the HW watchdog and the
HW watchdog fires long before the watchdog soft timeout.

At the moment, the watchdog_worker is invoked as a delayed work.
Delayed works are handled by non realtime kernel threads. The
WQ_HIGHPRI flag only increases the niceness of that threads.

This patchs replaces the delayed work logic by hrtimer, in order to


s/patchs/patch/


ensure that the watchdog_worker will already have priority.


always ?



Signed-off-by: Christophe Leroy 
---
  drivers/watchdog/watchdog_dev.c | 87 +++--
  1 file changed, 41 insertions(+), 46 deletions(-)

diff --git a/drivers/watchdog/watchdog_dev.c b/drivers/watchdog/watchdog_dev.c
index 1e971a50d7fb..e9b234c4ff67 100644
--- a/drivers/watchdog/watchdog_dev.c
+++ b/drivers/watchdog/watchdog_dev.c
@@ -36,7 +36,7 @@
  #include /* For the -ENODEV/... values */
  #include/* For file operations */
  #include  /* For __init/__exit/... */
-#include/* For timeout functions */
+#include 
  #include/* For printk/panic/... */
  #include  /* For data references */
  #include/* For handling misc devices */
@@ -46,7 +46,6 @@
  #include  /* For memory functions */
  #include /* For standard types (like size_t) */
  #include  /* For watchdog specific items */
-#include  /* For workqueue */
  #include   /* For copy_to_user/put_user/... */
  
  #include "watchdog_core.h"

@@ -65,9 +64,9 @@ struct watchdog_core_data {
struct cdev cdev;
struct watchdog_device *wdd;
struct mutex lock;
-   unsigned long last_keepalive;
-   unsigned long last_hw_keepalive;
-   struct delayed_work work;
+   ktime_t last_keepalive;
+   ktime_t last_hw_keepalive;
+   struct hrtimer timer;
unsigned long status;   /* Internal status bits */
  #define _WDOG_DEV_OPEN0   /* Opened ? */
  #define _WDOG_ALLOW_RELEASE   1   /* Did we receive the magic char ? */
@@ -79,8 +78,6 @@ static dev_t watchdog_devt;
  /* Reference to watchdog device behind /dev/watchdog */
  static struct watchdog_core_data *old_wd_data;
  
-static struct workqueue_struct *watchdog_wq;

-
  static bool handle_boot_enabled =
IS_ENABLED(CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED);
  
@@ -107,18 +104,19 @@ static inline bool watchdog_need_worker(struct watchdog_device *wdd)

(t && !watchdog_active(wdd) && watchdog_hw_running(wdd));
  }
  
-static long watchdog_next_keepalive(struct watchdog_device *wdd)

+static ktime_t watchdog_next_keepalive(struct watchdog_device *wdd)
  {
struct watchdog_core_data *wd_data = wdd->wd_data;
unsigned int timeout_ms = wdd->timeout * 1000;
-   unsigned long keepalive_interval;
-   unsigned long last_heartbeat;
-   unsigned long virt_timeout;
+   ktime_t keepalive_interval;
+   ktime_t last_heartbeat, latest_heartbeat;
+   ktime_t virt_timeout;
unsigned int hw_heartbeat_ms;
  
-	virt_timeout = wd_data->last_keepalive + msecs_to_jiffies(timeout_ms);

+   virt_timeout = ktime_add(wd_data->last_keepalive,
+ms_to_ktime(timeout_ms));
hw_heartbeat_ms = min_not_zero(timeout_ms, wdd->max_hw_heartbeat_ms);
-   keepalive_interval = msecs_to_jiffies(hw_heartbeat_ms / 2);
+   keepalive_interval = ms_to_ktime(hw_heartbeat_ms / 2);
  
  	if (!watchdog_active(wdd))

return keepalive_interval;
@@ -128,8 +126,11 @@ static long watchdog_next_keepalive(struct watchdog_device 
*wdd)
 * after the most recent ping from userspace, the last
 * worker ping has to come in hw_heartbeat_ms before this timeout.
 */
-   last_heartbeat = virt_timeout - msecs_to_jiffies(hw_heartbeat_ms);
-   return min_t(long, last_heartbeat - jiffies, keepalive_interval);
+   last_heartbeat = ktime_sub(virt_timeout, ms_to_ktime(hw_heartbeat_ms));
+   latest_heartbeat = ktime_sub(last_heartbeat, ktime_get());
+   if (ktime_before(latest_heartbeat, keepalive_interval))
+   return latest_heartbeat;
+   return keepalive_interval;
  }
  
  static inline void watchdog_update_worker(struct watchdog_device *wdd)

@@ -137,29 +138,33 @@ static inline void watchdog_update_worker(struct 
watchdog_device *wdd)
struct watchdog_core_data *wd_data = wdd->wd_data;
  
  	if (watchdog_need_worker(wdd)) {

-   long t = watchdog_next_keepalive(wdd);
+   ktime_t t = watchdog_next_keepalive(wdd);
  
  		if (t > 0)

-   mod_delayed_work(watchdog_wq, &wd_data->work, t);
+   hrtimer_start(&wd_data->timer, t, HRTIMER_MODE_REL);
} else {
-   cancel_delayed_work(&wd_data->work);
+   hrtimer_cancel(&wd_data->timer);

Re: [PATCH] [powerpc-next] Fix powerpc64 alignment of .toc section in kernel modules

2017-12-07 Thread Desnes Augusto Nunes do Rosário

Hello Michael,

On 12/07/2017 10:25 AM, Michael Ellerman wrote:

Hi Desnes,

Am I right that Alan largely wrote this patch?

If so it should probably be From: him, so that he is the author in the
git log.


Yes, Alan Modra is the main author and I am just committing it with 
minor changes. Thus, the author change is necessary.





Desnes Augusto Nunes do Rosario  writes:

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 1381693..c472f5b 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -63,6 +63,7 @@ UTS_MACHINE := $(subst $(space),,$(machine-y))
  ifdef CONFIG_PPC32
  KBUILD_LDFLAGS_MODULE += arch/powerpc/lib/crtsavres.o
  else
+KBUILD_LDFLAGS_MODULE += -T arch/powerpc/kernel/module.lds


This needs to be:

KBUILD_LDFLAGS_MODULE += -T $(srctree)/arch/powerpc/kernel/module.lds

Otherwise building with O=../build fails with:

   ld: cannot open linker script file arch/powerpc/kernel/module.lds: No such 
file or directory

I'll fix it up.


Indeed; this change is necessary to avoid any path errors.




diff --git a/arch/powerpc/kernel/module_64.c b/arch/powerpc/kernel/module_64.c
index 759104b..9b2c5c1 100644
--- a/arch/powerpc/kernel/module_64.c
+++ b/arch/powerpc/kernel/module_64.c
@@ -374,11 +377,13 @@ int module_frob_arch_sections(Elf64_Ehdr *hdr,
  }
  
  /* r2 is the TOC pointer: it actually points 0x8000 into the TOC (this

-   gives the value maximum span in an instruction which uses a signed
-   offset) */
+ * gives the value maximum span in an instruction which uses a signed
+ * offset).  Round down to a 256 byte boundary for the odd case where
+ * we are setting up r2 without a .toc section.
+ */
  static inline unsigned long my_r2(const Elf64_Shdr *sechdrs, struct module 
*me)
  {
-   return sechdrs[me->arch.toc_section].sh_addr + 0x8000;
+   return (sechdrs[me->arch.toc_section].sh_addr & -256) + 0x8000;


I think it's more typical in the kernel to write -256 as ~0xff.

Again I can fix it up.


Good to know!



cheers



Lastly, will you fix it up or do you want me to send a second version 
then? Whatever is best for you.


Thank you for the review.

--
Desnes Augusto Nunes do Rosário
--

Linux Developer - IBM / Brazil
M.Sc. in Electrical and Computer Engineering - UFRN

(11) 9595-30-900
desn...@br.ibm.com



Re: [PATCH] [powerpc-next] Fix powerpc64 alignment of .toc section in kernel modules

2017-12-07 Thread Michael Ellerman
Hi Desnes,

Am I right that Alan largely wrote this patch?

If so it should probably be From: him, so that he is the author in the
git log.


Desnes Augusto Nunes do Rosario  writes:
> diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
> index 1381693..c472f5b 100644
> --- a/arch/powerpc/Makefile
> +++ b/arch/powerpc/Makefile
> @@ -63,6 +63,7 @@ UTS_MACHINE := $(subst $(space),,$(machine-y))
>  ifdef CONFIG_PPC32
>  KBUILD_LDFLAGS_MODULE += arch/powerpc/lib/crtsavres.o
>  else
> +KBUILD_LDFLAGS_MODULE += -T arch/powerpc/kernel/module.lds

This needs to be:

KBUILD_LDFLAGS_MODULE += -T $(srctree)/arch/powerpc/kernel/module.lds

Otherwise building with O=../build fails with:

  ld: cannot open linker script file arch/powerpc/kernel/module.lds: No such 
file or directory

I'll fix it up.

> diff --git a/arch/powerpc/kernel/module_64.c b/arch/powerpc/kernel/module_64.c
> index 759104b..9b2c5c1 100644
> --- a/arch/powerpc/kernel/module_64.c
> +++ b/arch/powerpc/kernel/module_64.c
> @@ -374,11 +377,13 @@ int module_frob_arch_sections(Elf64_Ehdr *hdr,
>  }
>  
>  /* r2 is the TOC pointer: it actually points 0x8000 into the TOC (this
> -   gives the value maximum span in an instruction which uses a signed
> -   offset) */
> + * gives the value maximum span in an instruction which uses a signed
> + * offset).  Round down to a 256 byte boundary for the odd case where
> + * we are setting up r2 without a .toc section.
> + */
>  static inline unsigned long my_r2(const Elf64_Shdr *sechdrs, struct module 
> *me)
>  {
> - return sechdrs[me->arch.toc_section].sh_addr + 0x8000;
> + return (sechdrs[me->arch.toc_section].sh_addr & -256) + 0x8000;

I think it's more typical in the kernel to write -256 as ~0xff.

Again I can fix it up.

cheers


[PATCH] powerpc/xmon: use ARRAY_SIZE on various array sizing calculations

2017-12-07 Thread Colin King
From: Colin Ian King 

Use the ARRAY_SIZE macro on several arrays to determine their size.
Improvement suggested by coccinelle.

Signed-off-by: Colin Ian King 
---
 arch/powerpc/xmon/ppc-opc.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/xmon/ppc-opc.c b/arch/powerpc/xmon/ppc-opc.c
index ac2b55b1332e..f3f57a12d43b 100644
--- a/arch/powerpc/xmon/ppc-opc.c
+++ b/arch/powerpc/xmon/ppc-opc.c
@@ -966,8 +966,7 @@ const struct powerpc_operand powerpc_operands[] =
   { 0xff, 11, NULL, NULL, PPC_OPERAND_SIGNOPT },
 };
 
-const unsigned int num_powerpc_operands = (sizeof (powerpc_operands)
-  / sizeof (powerpc_operands[0]));
+const unsigned int num_powerpc_operands = ARRAY_SIZE(powerpc_operands);
 
 /* The functions used to insert and extract complicated operands.  */
 
@@ -6980,8 +6979,7 @@ const struct powerpc_opcode powerpc_opcodes[] = {
 {"fcfidu.",XRC(63,974,1),  XRA_MASK, POWER7|PPCA2, PPCVLE, {FRT, 
FRB}},
 };
 
-const int powerpc_num_opcodes =
-  sizeof (powerpc_opcodes) / sizeof (powerpc_opcodes[0]);
+const int powerpc_num_opcodes = ARRAY_SIZE(powerpc_opcodes);
 
 /* The VLE opcode table.
 
@@ -7219,8 +7217,7 @@ const struct powerpc_opcode vle_opcodes[] = {
 {"se_bl",  BD8(58,0,1),BD8_MASK,   PPCVLE, 0,  {B8}},
 };
 
-const int vle_num_opcodes =
-  sizeof (vle_opcodes) / sizeof (vle_opcodes[0]);
+const int vle_num_opcodes = ARRAY_SIZE(vle_opcodes);
 
 /* The macro table.  This is only used by the assembler.  */
 
@@ -7288,5 +7285,4 @@ const struct powerpc_macro powerpc_macros[] = {
 {"e_clrlslwi",4, PPCVLE, "e_rlwinm %0,%1,%3,(%2)-(%3),31-(%3)"},
 };
 
-const int powerpc_num_macros =
-  sizeof (powerpc_macros) / sizeof (powerpc_macros[0]);
+const int powerpc_num_macros = ARRAY_SIZE(powerpc_macros);
-- 
2.14.1



[PATCH] watchdog: core: make sure the watchdog_worker always works

2017-12-07 Thread Christophe Leroy
When running a command like 'chrt -f 99 dd if=/dev/zero of=/dev/null',
the watchdog_worker fails to service the HW watchdog and the
HW watchdog fires long before the watchdog soft timeout.

At the moment, the watchdog_worker is invoked as a delayed work.
Delayed works are handled by non realtime kernel threads. The
WQ_HIGHPRI flag only increases the niceness of that threads.

This patchs replaces the delayed work logic by hrtimer, in order to
ensure that the watchdog_worker will already have priority.

Signed-off-by: Christophe Leroy 
---
 drivers/watchdog/watchdog_dev.c | 87 +++--
 1 file changed, 41 insertions(+), 46 deletions(-)

diff --git a/drivers/watchdog/watchdog_dev.c b/drivers/watchdog/watchdog_dev.c
index 1e971a50d7fb..e9b234c4ff67 100644
--- a/drivers/watchdog/watchdog_dev.c
+++ b/drivers/watchdog/watchdog_dev.c
@@ -36,7 +36,7 @@
 #include/* For the -ENODEV/... values */
 #include   /* For file operations */
 #include /* For __init/__exit/... */
-#include  /* For timeout functions */
+#include 
 #include   /* For printk/panic/... */
 #include /* For data references */
 #include   /* For handling misc devices */
@@ -46,7 +46,6 @@
 #include /* For memory functions */
 #include/* For standard types (like size_t) */
 #include /* For watchdog specific items */
-#include/* For workqueue */
 #include  /* For copy_to_user/put_user/... */
 
 #include "watchdog_core.h"
@@ -65,9 +64,9 @@ struct watchdog_core_data {
struct cdev cdev;
struct watchdog_device *wdd;
struct mutex lock;
-   unsigned long last_keepalive;
-   unsigned long last_hw_keepalive;
-   struct delayed_work work;
+   ktime_t last_keepalive;
+   ktime_t last_hw_keepalive;
+   struct hrtimer timer;
unsigned long status;   /* Internal status bits */
 #define _WDOG_DEV_OPEN 0   /* Opened ? */
 #define _WDOG_ALLOW_RELEASE1   /* Did we receive the magic char ? */
@@ -79,8 +78,6 @@ static dev_t watchdog_devt;
 /* Reference to watchdog device behind /dev/watchdog */
 static struct watchdog_core_data *old_wd_data;
 
-static struct workqueue_struct *watchdog_wq;
-
 static bool handle_boot_enabled =
IS_ENABLED(CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED);
 
@@ -107,18 +104,19 @@ static inline bool watchdog_need_worker(struct 
watchdog_device *wdd)
(t && !watchdog_active(wdd) && watchdog_hw_running(wdd));
 }
 
-static long watchdog_next_keepalive(struct watchdog_device *wdd)
+static ktime_t watchdog_next_keepalive(struct watchdog_device *wdd)
 {
struct watchdog_core_data *wd_data = wdd->wd_data;
unsigned int timeout_ms = wdd->timeout * 1000;
-   unsigned long keepalive_interval;
-   unsigned long last_heartbeat;
-   unsigned long virt_timeout;
+   ktime_t keepalive_interval;
+   ktime_t last_heartbeat, latest_heartbeat;
+   ktime_t virt_timeout;
unsigned int hw_heartbeat_ms;
 
-   virt_timeout = wd_data->last_keepalive + msecs_to_jiffies(timeout_ms);
+   virt_timeout = ktime_add(wd_data->last_keepalive,
+ms_to_ktime(timeout_ms));
hw_heartbeat_ms = min_not_zero(timeout_ms, wdd->max_hw_heartbeat_ms);
-   keepalive_interval = msecs_to_jiffies(hw_heartbeat_ms / 2);
+   keepalive_interval = ms_to_ktime(hw_heartbeat_ms / 2);
 
if (!watchdog_active(wdd))
return keepalive_interval;
@@ -128,8 +126,11 @@ static long watchdog_next_keepalive(struct watchdog_device 
*wdd)
 * after the most recent ping from userspace, the last
 * worker ping has to come in hw_heartbeat_ms before this timeout.
 */
-   last_heartbeat = virt_timeout - msecs_to_jiffies(hw_heartbeat_ms);
-   return min_t(long, last_heartbeat - jiffies, keepalive_interval);
+   last_heartbeat = ktime_sub(virt_timeout, ms_to_ktime(hw_heartbeat_ms));
+   latest_heartbeat = ktime_sub(last_heartbeat, ktime_get());
+   if (ktime_before(latest_heartbeat, keepalive_interval))
+   return latest_heartbeat;
+   return keepalive_interval;
 }
 
 static inline void watchdog_update_worker(struct watchdog_device *wdd)
@@ -137,29 +138,33 @@ static inline void watchdog_update_worker(struct 
watchdog_device *wdd)
struct watchdog_core_data *wd_data = wdd->wd_data;
 
if (watchdog_need_worker(wdd)) {
-   long t = watchdog_next_keepalive(wdd);
+   ktime_t t = watchdog_next_keepalive(wdd);
 
if (t > 0)
-   mod_delayed_work(watchdog_wq, &wd_data->work, t);
+   hrtimer_start(&wd_data->timer, t, HRTIMER_MODE_REL);
} else {
-   cancel_delayed_work(&wd_data->work);
+   hrtimer_cancel(&wd_data->timer);
}
 }
 
 static int __watchdog_ping(struct watchdog_device *wdd)
 {
struct watchdog