date:20171013

Re: [PATCH v2 7/9] powerpc/64s: do not allocate lppaca if we are not virtualized

2017-10-13 Thread Nicholas Piggin

On Sat, 14 Oct 2017 09:47:59 +1100
Paul Mackerras  wrote:

> On Sun, Aug 13, 2017 at 11:33:44AM +1000, Nicholas Piggin wrote:
> > The "lppaca" is a structure registered with the hypervisor. This
> > is unnecessary when running on non-virtualised platforms. One field
> > from the lppaca (pmcregs_in_use) is also used by the host, so move
> > the host part out into the paca (lppaca field is still updated in
> > guest mode).  
> 
> There is an error in the patch, see below...
> 
> > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> > b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > index c52184a8efdf..b838348e3a2b 100644
> > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > @@ -99,8 +99,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
> > mtspr   SPRN_SPRG_VDSO_WRITE,r3
> >  
> > /* Reload the host's PMU registers */
> > -   ld  r3, PACALPPACAPTR(r13)  /* is the host using the PMU? */
> > -   lbz r4, LPPACA_PMCINUSE(r3)
> > +   lbz r4, PACA_PMCINUSE(r13) /* is the host using the PMU? */
> > cmpwi   r4, 0
> > beq 23f /* skip if not */
> >  BEGIN_FTR_SECTION
> > @@ -1671,7 +1670,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
> > mtspr   SPRN_MMCRA, r7
> > isync
> > beq 21f /* if no VPA, save PMU stuff anyway */
> > -   lbz r7, LPPACA_PMCINUSE(r8)
> > +   lbz r7, PACA_PMCINUSE(r13)  
> 
> We really do need to check the guest's flag not the host's here, since
> we're deciding whether to save the PMU state to the vcpu struct.

Okay I'll fix that up.

Thanks,
Nick

Re: [PATCH v2] KVM: PPC: Book3S PR: only install valid SLBs during KVM_SET_SREGS

2017-10-13 Thread Paul Mackerras

On Mon, Oct 02, 2017 at 10:40:22AM +0200, Greg Kurz wrote:
> Userland passes an array of 64 SLB descriptors to KVM_SET_SREGS,
> some of which are valid (ie, SLB_ESID_V is set) and the rest are
> likely all-zeroes (with QEMU at least).
> 
> Each of them is then passed to kvmppc_mmu_book3s_64_slbmte(), which
> assumes to find the SLB index in the 3 lower bits of its rb argument.
> When passed zeroed arguments, it happily overwrites the 0th SLB entry
> with zeroes. This is exactly what happens while doing live migration
> with QEMU when the destination pushes the incoming SLB descriptors to
> KVM PR. When reloading the SLBs at the next synchronization, QEMU first
> clears its SLB array and only restore valid ones, but the 0th one is
> now gone and we cannot access the corresponding memory anymore:
> 
> (qemu) x/x $pc
> c00b742c: Cannot access memory
> 
> To avoid this, let's filter out non-valid SLB entries. While here, we
> also force a full SLB flush before installing new entries.

With this, a 32-bit powermac config with PR KVM enabled fails to build:

  CC [M]  arch/powerpc/kvm/book3s_pr.o
/home/paulus/kernel/kvm/arch/powerpc/kvm/book3s_pr.c: In function 
‘kvm_arch_vcpu_ioctl_set_sregs_pr’:
/home/paulus/kernel/kvm/arch/powerpc/kvm/book3s_pr.c:1337:13: error: 
‘SLB_ESID_V’ undeclared (first use in this function)
if (rb & SLB_ESID_V)
 ^
/home/paulus/kernel/kvm/arch/powerpc/kvm/book3s_pr.c:1337:13: note: each 
undeclared identifier is reported only once for each function it appears in
/home/paulus/kernel/kvm/scripts/Makefile.build:313: recipe for target 
'arch/powerpc/kvm/book3s_pr.o' failed
make[3]: *** [arch/powerpc/kvm/book3s_pr.o] Error 1

Paul.

Re: [PATCH kernel] KVM: PPC: Protect kvmppc_gpa_to_ua() with srcu

2017-10-13 Thread Paul Mackerras

On Wed, Oct 11, 2017 at 04:00:34PM +1100, Alexey Kardashevskiy wrote:
> kvmppc_gpa_to_ua() accesses KVM memory slot array via
> srcu_dereference_check() and this produces warnings from RCU like below.
> 
> This extends the existing srcu_read_lock/unlock to cover that
> kvmppc_gpa_to_ua() as well.
> 
> We did not hit this before as this lock is not needed for the realmode
> handlers and hash guests would use the realmode path all the time;
> however the radix guests are always redirected to the virtual mode
> handlers and hence the warning.
> 
> [   68.253798] ./include/linux/kvm_host.h:575 suspicious 
> rcu_dereference_check() usage!
> [   68.253799]
>other info that might help us debug this:
> 
> [   68.253802]
>rcu_scheduler_active = 2, debug_locks = 1
> [   68.253804] 1 lock held by qemu-system-ppc/6413:
> [   68.253806]  #0:  (>mutex){+.+.}, at: [] 
> vcpu_load+0x3c/0xc0 [kvm]
> [   68.253826]
>stack backtrace:
> [   68.253830] CPU: 92 PID: 6413 Comm: qemu-system-ppc Tainted: GW
>4.14.0-rc3-00553-g432dcba58e9c-dirty #72
> [   68.253833] Call Trace:
> [   68.253839] [c00fd3d9f790] [c0b7fcc8] dump_stack+0xe8/0x160 
> (unreliable)
> [   68.253845] [c00fd3d9f7d0] [c01924c0] 
> lockdep_rcu_suspicious+0x110/0x180
> [   68.253851] [c00fd3d9f850] [c00e825c] 
> kvmppc_gpa_to_ua+0x26c/0x2b0
> [   68.253858] [c00fd3d9f8b0] [c0080e3e1984] 
> kvmppc_h_put_tce+0x12c/0x2a0 [kvm]
> 
> Signed-off-by: Alexey Kardashevskiy 

Thanks, applied to my kvm-ppc-fixes branch.

Paul.

Re: [PATCH] KVM: PPC: Book3S HV: POWER9 more doorbell fixes

2017-10-13 Thread Paul Mackerras

On Tue, Oct 10, 2017 at 08:18:28PM +1000, Nicholas Piggin wrote:
> - Add another case where msgsync is required.
> - Required barrier sequence for global doorbells is msgsync ; lwsync
> - POWER9 DD1 has a different barrier sequence that we don't implement,
>   so remove
> 
> When msgsnd is used for IPIs to other cores, msgsync must be executed by
> the target to order stores performed on the source before its msgsnd
> (provided the source executes the appropriate sync).
> 
> Fixes: 1704a81ccebc ("KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores 
> on POWER9")
> Signed-off-by: Nicholas Piggin 

Thanks, applied to my kvm-ppc-fixes branch (minus the comment about DD1).

Paul.

Re: [PATCH] KVM: PPC: fix oops when checking KVM_CAP_PPC_HTM

2017-10-13 Thread Paul Mackerras

On Thu, Sep 14, 2017 at 11:56:25PM +0200, Greg Kurz wrote:
> The following program causes a kernel oops:
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> main()
> {
> int fd = open("/dev/kvm", O_RDWR);
> ioctl(fd, KVM_CHECK_EXTENSION, KVM_CAP_PPC_HTM);
> }
> 
> This happens because when using the global KVM fd with
> KVM_CHECK_EXTENSION, kvm_vm_ioctl_check_extension() gets
> called with a NULL kvm argument, which gets dereferenced
> in is_kvmppc_hv_enabled(). Spotted while reading the code.
> 
> Let's use the hv_enabled fallback variable, like everywhere
> else in this function.
> 
> Fixes: 23528bb21ee2 ("KVM: PPC: Introduce KVM_CAP_PPC_HTM")
> Cc: sta...@vger.kernel.org # v4.7+
> Signed-off-by: Greg Kurz 

Thanks, applied to my kvm-ppc-fixes branch.

Paul.

Re: [PATCH] KVM: PPC: fix oops when checking KVM_CAP_PPC_HTM

2017-10-13 Thread Paul Mackerras

On Fri, Oct 13, 2017 at 06:14:00PM +0200, Paolo Bonzini wrote:
> On 13/10/2017 01:16, Greg Kurz wrote:
> > Ping ?
> 
> When is Paul back from vacation? :)

Now. :)

Paul.

Re: [PATCH v2 7/9] powerpc/64s: do not allocate lppaca if we are not virtualized

2017-10-13 Thread Paul Mackerras

On Sun, Aug 13, 2017 at 11:33:44AM +1000, Nicholas Piggin wrote:
> The "lppaca" is a structure registered with the hypervisor. This
> is unnecessary when running on non-virtualised platforms. One field
> from the lppaca (pmcregs_in_use) is also used by the host, so move
> the host part out into the paca (lppaca field is still updated in
> guest mode).

There is an error in the patch, see below...

> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index c52184a8efdf..b838348e3a2b 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -99,8 +99,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
>   mtspr   SPRN_SPRG_VDSO_WRITE,r3
>  
>   /* Reload the host's PMU registers */
> - ld  r3, PACALPPACAPTR(r13)  /* is the host using the PMU? */
> - lbz r4, LPPACA_PMCINUSE(r3)
> + lbz r4, PACA_PMCINUSE(r13) /* is the host using the PMU? */
>   cmpwi   r4, 0
>   beq 23f /* skip if not */
>  BEGIN_FTR_SECTION
> @@ -1671,7 +1670,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>   mtspr   SPRN_MMCRA, r7
>   isync
>   beq 21f /* if no VPA, save PMU stuff anyway */
> - lbz r7, LPPACA_PMCINUSE(r8)
> + lbz r7, PACA_PMCINUSE(r13)

We really do need to check the guest's flag not the host's here, since
we're deciding whether to save the PMU state to the vcpu struct.

Paul.

Re: [PATCH] ptrace: Add compat PTRACE_{G,S}ETSIGMASK handlers

2017-10-13 Thread Yury Norov

Hi James, all,

(add linux-...@vger.kernel.org as it is user-visible,
Catalin Marinas and Arnd Bergmann )

On Thu, Jun 29, 2017 at 05:26:37PM +0100, James Morse wrote:
> compat_ptrace_request() lacks handlers for PTRACE_{G,S}ETSIGMASK,
> instead using those in ptrace_request(). The compat variant should
> read a compat_sigset_t from userspace instead of ptrace_request()s
> sigset_t.
> 
> While compat_sigset_t is the same size as sigset_t, it is defined as
> 2xu32, instead of a single u64. On a big-endian CPU this means that
> compat_sigset_t is passed to user-space using middle-endianness,
> where the least-significant u32 is written most significant byte
> first.
> 
> If ptrace_request()s code is used userspace will read the most
> significant u32 where it expected the least significant.
> 
> Instead of duplicating ptrace_request()s code as a special case in
> the arch code, handle it here.
> 
> CC: Yury Norov 
> CC: Andrey Vagin 
> Reported-by: Zhou Chengming 
> Signed-off-by: James Morse 
> Fixes: 29000caecbe87 ("ptrace: add ability to get/set signal-blocked mask")
> ---
> LTP test case here:
> https://lists.linux.it/pipermail/ltp/2017-June/004932.html

This patch relies on sigset_{to,from}_compat() which was proposed to
remove from the kernel recently. The change is in linux-next, and it
breaks the build of the kenel with this patch. Below the updated
version.

I'd like to ask here again, do we need this change? The patch is
correct, but it changes the ptrace API for compat big-endian
architectures. It normally should stop us from pulling it, but there's
seemingly no users of the API in the wild, and so it will
break nothing.

The problem was originally reported by Zhou Chengming for BE arm64/ilp32.
I would like to see arm64/ilp32 working correct in this case, and
developers of other new architectures probably would so.

Regarding arm64/ilp32, we have agreed ABI, and 4.12 and 4.13 kernels
have this change:
https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=staging/ilp32-4.12
https://github.com/norov/linux/tree/ilp32-4.13

So I see 3 ways to proceed with this:
1. Drop the patch and remove it from arm64/ilp32;
2. Apply the patch as is;
3. Introduce new config option like ARCH_PTRACE_COMPAT_BE_SWAP_SIGMASK,
   make it enabled by default and disable explicitly for existing
   compat BE architectures.

I would choose 2 or 3 depending on what maintainers of existing
architectures think.

Yury

Signed-off-by: Yury Norov 
---
 kernel/ptrace.c | 52 
 1 file changed, 40 insertions(+), 12 deletions(-)

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 84b1367935e4..1af47a33768e 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -880,6 +880,22 @@ static int ptrace_regset(struct task_struct *task, int 
req, unsigned int type,
 EXPORT_SYMBOL_GPL(task_user_regset_view);
 #endif
 
+static int ptrace_setsigmask(struct task_struct *child, sigset_t *new_set)
+{
+   sigdelsetmask(new_set, sigmask(SIGKILL)|sigmask(SIGSTOP));
+
+   /*
+* Every thread does recalc_sigpending() after resume, so
+* retarget_shared_pending() and recalc_sigpending() are not
+* called here.
+*/
+   spin_lock_irq(>sighand->siglock);
+   child->blocked = *new_set;
+   spin_unlock_irq(>sighand->siglock);
+
+   return 0;
+}
+
 int ptrace_request(struct task_struct *child, long request,
   unsigned long addr, unsigned long data)
 {
@@ -951,18 +967,7 @@ int ptrace_request(struct task_struct *child, long request,
break;
}
 
-   sigdelsetmask(_set, sigmask(SIGKILL)|sigmask(SIGSTOP));
-
-   /*
-* Every thread does recalc_sigpending() after resume, so
-* retarget_shared_pending() and recalc_sigpending() are not
-* called here.
-*/
-   spin_lock_irq(>sighand->siglock);
-   child->blocked = new_set;
-   spin_unlock_irq(>sighand->siglock);
-
-   ret = 0;
+   ret = ptrace_setsigmask(child, _set);
break;
}
 
@@ -1192,6 +1197,7 @@ int compat_ptrace_request(struct task_struct *child, 
compat_long_t request,
 {
compat_ulong_t __user *datap = compat_ptr(data);
compat_ulong_t word;
+   sigset_t new_set;
siginfo_t siginfo;
int ret;
 
@@ -1233,6 +1239,28 @@ int compat_ptrace_request(struct task_struct *child, 
compat_long_t request,
else
ret = ptrace_setsiginfo(child, );
break;
+   case PTRACE_GETSIGMASK:
+   if (addr != sizeof(compat_sigset_t))
+   return -EINVAL;
+
+   ret = put_compat_sigset((compat_sigset_t __user *) datap,
+

Re: [PATCH v3 2/2] pseries/eeh: Add Pseries pcibios_bus_add_device

2017-10-13 Thread Bryant G. Ly




On 10/13/17 1:05 PM, Alex Williamson wrote:

On Fri, 13 Oct 2017 07:01:48 -0500
Steven Royer  wrote:


On 2017-10-13 06:53, Steven Royer wrote:

On 2017-10-12 22:34, Bjorn Helgaas wrote:

[+cc Alex, Bodong, Eli, Saeed]

On Thu, Oct 12, 2017 at 02:59:23PM -0500, Bryant G. Ly wrote:

On 10/12/17 1:29 PM, Bjorn Helgaas wrote:

On Thu, Oct 12, 2017 at 03:09:53PM +1100, Michael Ellerman wrote:

Bjorn Helgaas  writes:
  

On Fri, Sep 22, 2017 at 09:19:28AM -0500, Bryant G. Ly wrote:

This patch adds the machine dependent call for
pcibios_bus_add_device, since the previous patch
separated the calls out between the PowerNV and PowerVM.

The difference here is that for the PowerVM environment
we do not want match_driver set because in this environment
we do not want the VF device drivers to load immediately, due to
firmware loading the device node when VF device is assigned to the
logical partition.

This patch will depend on the patch linked below, which is under
review.

https://patchwork.kernel.org/patch/9882915/

Signed-off-by: Bryant G. Ly 
Signed-off-by: Juan J. Alvarez 
---
  arch/powerpc/platforms/pseries/eeh_pseries.c | 24 
  1 file changed, 24 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c 
b/arch/powerpc/platforms/pseries/eeh_pseries.c
index 6b812ad990e4..45946ee90985 100644
--- a/arch/powerpc/platforms/pseries/eeh_pseries.c
+++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
@@ -64,6 +64,27 @@ static unsigned char slot_errbuf[RTAS_ERROR_LOG_MAX];
  static DEFINE_SPINLOCK(slot_errbuf_lock);
  static int eeh_error_buf_size;
+void pseries_pcibios_bus_add_device(struct pci_dev *pdev)
+{
+   struct pci_dn *pdn = pci_get_pdn(pdev);
+
+   if (!pdev->is_virtfn)
+   return;
+
+   pdn->device_id  =  pdev->device;
+   pdn->vendor_id  =  pdev->vendor;
+   pdn->class_code =  pdev->class;
+
+   /*
+* The following operations will fail if VF's sysfs files
+* aren't created or its resources aren't finalized.
+*/
+   eeh_add_device_early(pdn);
+   eeh_add_device_late(pdev);
+   eeh_sysfs_add_device(pdev);
+   pdev->match_driver = -1;

match_driver is a bool, which should be assigned "true" or "false".

Above he mentioned a dependency on:

   [04/10] PCI: extend pci device match_driver state
   https://patchwork.kernel.org/patch/9882915/


Which makes it an int.

Oh, right, I missed that, thanks.
  

Or has that patch been rejected or something?

I haven't *rejected* it, but it's low on my priority list, so you
shouldn't depend on it unless it adds functionality you really need.
If I did apply that particular patch, I would want some rework because
it currently obfuscates the match_driver logic.  There's no clue when
reading the code what -1/0/1 mean.

So do you prefer enum's? - If so I can make a change for that.

Apparently here you *do* want the "-1 means the PCI core will never
set match_driver to 1" functionality, so maybe you do depend on it.

We depend on the patch because we want that ability to never set
match_driver,
for SRIOV on PowerVM.

Is this really new PowerVM-specific functionality?  ISTR recent
discussions
about inhibiting driver binding in a generic way, e.g.,
http://lkml.kernel.org/r/1490022874-54718-1-git-send-email-bod...@mellanox.com
   

If that's the case, how to you ever bind a driver to these VFs?  The
changelog says you don't want VF drivers to load *immediately*, so I
assume you do want them to load eventually.
  

The VF's that get dynamically created within the configure SR-IOV
call, on the Pseries Platform, wont be matched with a driver. - We
do not want it to match.

The Power Hypervisor will load the VFs. The VF's will get
assigned(by the user) via the HMC or Novalink in this environment
which will then trigger PHYP to load the VF device node to the
device tree.

I don't know what it means for the Hypervisor to "load the VFs."  Can
you explain that in PCI-speak?

The things I know about are:

   - we set PCI_SRIOV_CTRL_VFE in the PF, which enables VFs
   - now the VFs respond to config accesses
   - the PCI core enumerates the VFs by reading their config space
   - the PCI core builds pci_dev structs for the VFs
   - the PCI core adds these pci_devs to the bus
   - we try to bind drivers to the VFs
   - the VF driver probe function may read VF config space and VF BARs
   - the VF may be assigned to a guest VM

Where does "loading the VFs" fit in?  I don't know what HMC, Novalink,
or PHYP are.  I don't *need* to know what they are, as long as you can
explain what's happening in terms of the PCI concepts and generic
Linux VMs
and device assignment.

Bjorn

The VFs will be hotplugged into the VM separately from the enable
SR-IOV, so the driver will load as part of the hotplug operation.

Steve

One more point of clarification: when the hotplug happens, the VF will

Re: [PATCH v3 2/2] pseries/eeh: Add Pseries pcibios_bus_add_device

2017-10-13 Thread Alex Williamson

On Fri, 13 Oct 2017 07:01:48 -0500
Steven Royer  wrote:

> On 2017-10-13 06:53, Steven Royer wrote:
> > On 2017-10-12 22:34, Bjorn Helgaas wrote:  
> >> [+cc Alex, Bodong, Eli, Saeed]
> >> 
> >> On Thu, Oct 12, 2017 at 02:59:23PM -0500, Bryant G. Ly wrote:  
> >>> On 10/12/17 1:29 PM, Bjorn Helgaas wrote:  
> >>> >On Thu, Oct 12, 2017 at 03:09:53PM +1100, Michael Ellerman wrote:  
> >>> >>Bjorn Helgaas  writes:
> >>> >>  
> >>> >>>On Fri, Sep 22, 2017 at 09:19:28AM -0500, Bryant G. Ly wrote:  
> >>> This patch adds the machine dependent call for
> >>> pcibios_bus_add_device, since the previous patch
> >>> separated the calls out between the PowerNV and PowerVM.
> >>> 
> >>> The difference here is that for the PowerVM environment
> >>> we do not want match_driver set because in this environment
> >>> we do not want the VF device drivers to load immediately, due to
> >>> firmware loading the device node when VF device is assigned to the
> >>> logical partition.
> >>> 
> >>> This patch will depend on the patch linked below, which is under
> >>> review.
> >>> 
> >>> https://patchwork.kernel.org/patch/9882915/
> >>> 
> >>> Signed-off-by: Bryant G. Ly 
> >>> Signed-off-by: Juan J. Alvarez 
> >>> ---
> >>>   arch/powerpc/platforms/pseries/eeh_pseries.c | 24 
> >>>  
> >>>   1 file changed, 24 insertions(+)
> >>> 
> >>> diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c 
> >>> b/arch/powerpc/platforms/pseries/eeh_pseries.c
> >>> index 6b812ad990e4..45946ee90985 100644
> >>> --- a/arch/powerpc/platforms/pseries/eeh_pseries.c
> >>> +++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
> >>> @@ -64,6 +64,27 @@ static unsigned char 
> >>> slot_errbuf[RTAS_ERROR_LOG_MAX];
> >>>   static DEFINE_SPINLOCK(slot_errbuf_lock);
> >>>   static int eeh_error_buf_size;
> >>> +void pseries_pcibios_bus_add_device(struct pci_dev *pdev)
> >>> +{
> >>> + struct pci_dn *pdn = pci_get_pdn(pdev);
> >>> +
> >>> + if (!pdev->is_virtfn)
> >>> + return;
> >>> +
> >>> + pdn->device_id  =  pdev->device;
> >>> + pdn->vendor_id  =  pdev->vendor;
> >>> + pdn->class_code =  pdev->class;
> >>> +
> >>> + /*
> >>> +  * The following operations will fail if VF's sysfs files
> >>> +  * aren't created or its resources aren't finalized.
> >>> +  */
> >>> + eeh_add_device_early(pdn);
> >>> + eeh_add_device_late(pdev);
> >>> + eeh_sysfs_add_device(pdev);
> >>> + pdev->match_driver = -1;  
> >>> >>>match_driver is a bool, which should be assigned "true" or "false".  
> >>> >>Above he mentioned a dependency on:
> >>> >>
> >>> >>   [04/10] PCI: extend pci device match_driver state
> >>> >>   https://patchwork.kernel.org/patch/9882915/
> >>> >>
> >>> >>
> >>> >>Which makes it an int.  
> >>> >Oh, right, I missed that, thanks.
> >>> >  
> >>> >>Or has that patch been rejected or something?  
> >>> >I haven't *rejected* it, but it's low on my priority list, so you
> >>> >shouldn't depend on it unless it adds functionality you really need.
> >>> >If I did apply that particular patch, I would want some rework because
> >>> >it currently obfuscates the match_driver logic.  There's no clue when
> >>> >reading the code what -1/0/1 mean.  
> >>> So do you prefer enum's? - If so I can make a change for that.  
> >>> >Apparently here you *do* want the "-1 means the PCI core will never
> >>> >set match_driver to 1" functionality, so maybe you do depend on it.  
> >>> We depend on the patch because we want that ability to never set
> >>> match_driver,
> >>> for SRIOV on PowerVM.  
> >> 
> >> Is this really new PowerVM-specific functionality?  ISTR recent 
> >> discussions
> >> about inhibiting driver binding in a generic way, e.g.,
> >> http://lkml.kernel.org/r/1490022874-54718-1-git-send-email-bod...@mellanox.com
> >>   
> >>> >If that's the case, how to you ever bind a driver to these VFs?  The
> >>> >changelog says you don't want VF drivers to load *immediately*, so I
> >>> >assume you do want them to load eventually.
> >>> >  
> >>> The VF's that get dynamically created within the configure SR-IOV
> >>> call, on the Pseries Platform, wont be matched with a driver. - We
> >>> do not want it to match.
> >>> 
> >>> The Power Hypervisor will load the VFs. The VF's will get
> >>> assigned(by the user) via the HMC or Novalink in this environment
> >>> which will then trigger PHYP to load the VF device node to the
> >>> device tree.  
> >> 
> >> I don't know what it means for the Hypervisor to "load the VFs."  Can
> >> you explain that in PCI-speak?
> >> 
> >> The things I know about are:
> >> 
> >>   - we set PCI_SRIOV_CTRL_VFE in the PF, which enables VFs
> >>   - now the VFs respond to

[PATCH v12 10/11] sparc64: optimized struct page zeroing

2017-10-13 Thread Pavel Tatashin

Add an optimized mm_zero_struct_page(), so struct page's are zeroed without
calling memset(). We do eight to ten regular stores based on the size of
struct page. Compiler optimizes out the conditions of switch() statement.

SPARC-M6 with 15T of memory, single thread performance:

   BASEFIX  OPTIMIZED_FIX
bootmem_init   28.440467985s   2.305674818s   2.305161615s
free_area_init_nodes  202.845901673s 225.343084508s 172.556506560s
  
Total 231.286369658s 227.648759326s 174.861668175s

BASE:  current linux
FIX:   This patch series without "optimized struct page zeroing"
OPTIMIZED_FIX: This patch series including the current patch.

bootmem_init() is where memory for struct pages is zeroed during
allocation. Note, about two seconds in this function is a fixed time: it
does not increase as memory is increased.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 
Acked-by: David S. Miller 
---
 arch/sparc/include/asm/pgtable_64.h | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 4fefe3762083..8ed478abc630 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -230,6 +230,36 @@ extern unsigned long _PAGE_ALL_SZ_BITS;
 extern struct page *mem_map_zero;
 #define ZERO_PAGE(vaddr)   (mem_map_zero)
 
+/* This macro must be updated when the size of struct page grows above 80
+ * or reduces below 64.
+ * The idea that compiler optimizes out switch() statement, and only
+ * leaves clrx instructions
+ */
+#definemm_zero_struct_page(pp) do {
\
+   unsigned long *_pp = (void *)(pp);  \
+   \
+/* Check that struct page is either 64, 72, or 80 bytes */ \
+   BUILD_BUG_ON(sizeof(struct page) & 7);  \
+   BUILD_BUG_ON(sizeof(struct page) < 64); \
+   BUILD_BUG_ON(sizeof(struct page) > 80); \
+   \
+   switch (sizeof(struct page)) {  \
+   case 80:\
+   _pp[9] = 0; /* fallthrough */   \
+   case 72:\
+   _pp[8] = 0; /* fallthrough */   \
+   default:\
+   _pp[7] = 0; \
+   _pp[6] = 0; \
+   _pp[5] = 0; \
+   _pp[4] = 0; \
+   _pp[3] = 0; \
+   _pp[2] = 0; \
+   _pp[1] = 0; \
+   _pp[0] = 0; \
+   }   \
+} while (0)
+
 /* PFNs are real physical page numbers.  However, mem_map only begins to record
  * per-page information starting at pfn_base.  This is to handle systems where
  * the first physical page in the machine is at some huge physical address,
-- 
2.14.2

[PATCH v12 05/11] mm: defining memblock_virt_alloc_try_nid_raw

2017-10-13 Thread Pavel Tatashin

* A new variant of memblock_virt_alloc_* allocations:
memblock_virt_alloc_try_nid_raw()
- Does not zero the allocated memory
- Does not panic if request cannot be satisfied

* optimize early system hash allocations

Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify
that memory that was allocated for system hash needs to be zeroed,
otherwise the memory does not need to be zeroed, and client will initialize
it.

If memory does not need to be zero'd, call the new
memblock_virt_alloc_raw() interface, and thus improve the boot performance.

* debug for raw alloctor

When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
places excpect zeroed memory.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 
Acked-by: Michal Hocko 
---
 include/linux/bootmem.h | 27 ++
 mm/memblock.c   | 60 +++--
 mm/page_alloc.c | 15 ++---
 3 files changed, 87 insertions(+), 15 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index e223d91b6439..ea30b3987282 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -160,6 +160,9 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
 #define BOOTMEM_ALLOC_ANYWHERE (~(phys_addr_t)0)
 
 /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */
+void *memblock_virt_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align,
+ phys_addr_t min_addr,
+ phys_addr_t max_addr, int nid);
 void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
phys_addr_t align, phys_addr_t min_addr,
phys_addr_t max_addr, int nid);
@@ -176,6 +179,14 @@ static inline void * __init memblock_virt_alloc(
NUMA_NO_NODE);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+   phys_addr_t size,  phys_addr_t align)
+{
+   return memblock_virt_alloc_try_nid_raw(size, align, BOOTMEM_LOW_LIMIT,
+   BOOTMEM_ALLOC_ACCESSIBLE,
+   NUMA_NO_NODE);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
phys_addr_t size, phys_addr_t align)
 {
@@ -257,6 +268,14 @@ static inline void * __init memblock_virt_alloc(
return __alloc_bootmem(size, align, BOOTMEM_LOW_LIMIT);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+   phys_addr_t size,  phys_addr_t align)
+{
+   if (!align)
+   align = SMP_CACHE_BYTES;
+   return __alloc_bootmem_nopanic(size, align, BOOTMEM_LOW_LIMIT);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
phys_addr_t size, phys_addr_t align)
 {
@@ -309,6 +328,14 @@ static inline void * __init 
memblock_virt_alloc_try_nid(phys_addr_t size,
  min_addr);
 }
 
+static inline void * __init memblock_virt_alloc_try_nid_raw(
+   phys_addr_t size, phys_addr_t align,
+   phys_addr_t min_addr, phys_addr_t max_addr, int nid)
+{
+   return ___alloc_bootmem_node_nopanic(NODE_DATA(nid), size, align,
+   min_addr, max_addr);
+}
+
 static inline void * __init memblock_virt_alloc_try_nid_nopanic(
phys_addr_t size, phys_addr_t align,
phys_addr_t min_addr, phys_addr_t max_addr, int nid)
diff --git a/mm/memblock.c b/mm/memblock.c
index 91205780e6b1..1f299fb1eb08 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1327,7 +1327,6 @@ static void * __init memblock_virt_alloc_internal(
return NULL;
 done:
ptr = phys_to_virt(alloc);
-   memset(ptr, 0, size);
 
/*
 * The min_count is set to 0 so that bootmem allocated blocks
@@ -1340,6 +1339,45 @@ static void * __init memblock_virt_alloc_internal(
return ptr;
 }
 
+/**
+ * memblock_virt_alloc_try_nid_raw - allocate boot memory block without zeroing
+ * memory and without panicking
+ * @size: size of memory block to be allocated in bytes
+ * @align: alignment of the region and block's size
+ * @min_addr: the lower bound of the memory region from where the allocation
+ *   is preferred (phys address)
+ * @max_addr: the upper bound of the memory region from where the allocation
+ *   is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to
+ *   allocate only from memory limited by memblock.current_limit value
+ * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
+ *
+ *

[PATCH v12 00/11] complete deferred page initialization

2017-10-13 Thread Pavel Tatashin

Changelog:
v12 - v11
- Improved comments for mm: zero reserved and unavailable struct pages
- Added back patch: mm: deferred_init_memmap improvements
- Added patch from Will Deacon: arm64: kasan: Avoid using
  vmemmap_populate to initialise shadow

v11 - v10
- Moved kasan_map_populate() implementation from common code into arch
  specific as discussed with Will Deacon. We do not need
  "mm/kasan: kasan specific map populate function" anymore, so only
  9 patches left.

v10 - v9
- Addressed new comments from Michal Hocko.
- Sent "mm: deferred_init_memmap improvements" as a separate patch as
  it is also fixing existing problem.
- Merged "mm: stop zeroing memory during allocation in vmemmap" with
  "mm: zero struct pages during initialization".
- Added more comments "mm: zero reserved and unavailable struct pages"

v9 - v8
- Addressed comments raised by Mark Rutland and Ard Biesheuvel: changed
  kasan implementation. Added a new function: kasan_map_populate() that
  zeroes the allocated and mapped memory

v8 - v7
- Added Acked-by's from Dave Miller for SPARC changes
- Fixed a minor compiling issue on tile architecture reported by kbuild

v7 - v6
- Addressed comments from Michal Hocko
- memblock_discard() patch was removed from this series and integrated
  separately
- Fixed bug reported by kbuild test robot new patch:
  mm: zero reserved and unavailable struct pages
- Removed patch
  x86/mm: reserve only exiting low pages
  As, it is not needed anymore, because of the previous fix 
- Re-wrote deferred_init_memmap(), found and fixed an existing bug, where
  page variable is not reset when zone holes present.
- Merged several patches together per Michal request
- Added performance data including raw logs

v6 - v5
- Fixed ARM64 + kasan code, as reported by Ard Biesheuvel
- Tested ARM64 code in qemu and found few more issues, that I fixed in this
  iteration
- Added page roundup/rounddown to x86 and arm zeroing routines to zero the
  whole allocated range, instead of only provided address range.
- Addressed SPARC related comment from Sam Ravnborg
- Fixed section mismatch warnings related to memblock_discard().

v5 - v4
- Fixed build issues reported by kbuild on various configurations
v4 - v3
- Rewrote code to zero sturct pages in __init_single_page() as
  suggested by Michal Hocko
- Added code to handle issues related to accessing struct page
  memory before they are initialized.

v3 - v2
- Addressed David Miller comments about one change per patch:
* Splited changes to platforms into 4 patches
* Made "do not zero vmemmap_buf" as a separate patch

v2 - v1
- Per request, added s390 to deferred "struct page" zeroing
- Collected performance data on x86 which proofs the importance to
  keep memset() as prefetch (see below).

SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config option,
which defers initializing struct pages until all cpus have been started so
it can be done in parallel.

However, this feature is sub-optimal, because the deferred page
initialization code expects that the struct pages have already been zeroed,
and the zeroing is done early in boot with a single thread only.  Also, we
access that memory and set flags before struct pages are initialized. All
of this is fixed in this patchset.

In this work we do the following:
- Never read access struct page until it was initialized
- Never set any fields in struct pages before they are initialized
- Zero struct page at the beginning of struct page initialization


==
Performance improvements on x86 machine with 8 nodes:
Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
TIME  SPEED UP
base no deferred:   95.796233s
fix no deferred:79.978956s19.77%

base deferred:  77.254713s
fix deferred:   55.050509s40.34%
==
SPARC M6 3600 MHz with 15T of memory
TIME  SPEED UP
base no deferred:   358.335727s
fix no deferred:302.320936s   18.52%

base deferred:  237.534603s
fix deferred:   182.103003s   30.44%
==
Raw dmesg output with timestamps:
x86 base no deferred:https://hastebin.com/ofunepurit.scala
x86 base deferred:   https://hastebin.com/ifazegeyas.scala
x86 fix no deferred: https://hastebin.com/pegocohevo.scala
x86 fix deferred:https://hastebin.com/ofupevikuk.scala
sparc base no deferred:  https://hastebin.com/ibobeteken.go
sparc base deferred: https://hastebin.com/fariqimiyu.go
sparc fix no deferred:   https://hastebin.com/muhegoheyi.go
sparc fix deferred:  https://hastebin.com/xadinobutu.go

Pavel Tatashin (10):
  mm: deferred_init_memmap improvements
  x86/mm: setting fields in deferred pages
  sparc64/mm: setting fields in deferred pages
  sparc64: simplify

[PATCH v12 09/11] mm: stop zeroing memory during allocation in vmemmap

2017-10-13 Thread Pavel Tatashin

vmemmap_alloc_block() will no longer zero the block, so zero memory
at its call sites for everything except struct pages.  Struct page memory
is zero'd by struct page initialization.

Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
we will get the performance improvement by zeroing the memory in parallel
when struct pages are zeroed.

Add struct page zeroing as a part of initialization of other fields in
__init_single_page().

This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):

 BASEFIX
sparse_init 11.244671836s   0.007199623s
zone_sizes_init  4.879775891s   8.355182299s
  --
Total   16.124447727s   8.362381922s

sparse_init is where memory for struct pages is zeroed, and the zeroing
part is moved later in this patch into __init_single_page(), which is
called from zone_sizes_init().

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 
Acked-by: Michal Hocko 
---
 include/linux/mm.h  | 11 +++
 mm/page_alloc.c |  1 +
 mm/sparse-vmemmap.c | 15 +++
 mm/sparse.c |  6 +++---
 4 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 04c8b2e5aff4..fd045a3b243a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2501,6 +2501,17 @@ static inline void *vmemmap_alloc_block_buf(unsigned 
long size, int node)
return __vmemmap_alloc_block_buf(size, node, NULL);
 }
 
+static inline void *vmemmap_alloc_block_zero(unsigned long size, int node)
+{
+   void *p = vmemmap_alloc_block(size, node);
+
+   if (!p)
+   return NULL;
+   memset(p, 0, size);
+
+   return p;
+}
+
 void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
 int vmemmap_populate_basepages(unsigned long start, unsigned long end,
   int node);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 54e0fa12e7ff..eb2ac79926e8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1170,6 +1170,7 @@ static void free_one_page(struct zone *zone,
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
 {
+   mm_zero_struct_page(page);
set_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index d1a39b8051e0..c2f5654e7c9d 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -41,7 +41,7 @@ static void * __ref __earlyonly_bootmem_alloc(int node,
unsigned long align,
unsigned long goal)
 {
-   return memblock_virt_alloc_try_nid(size, align, goal,
+   return memblock_virt_alloc_try_nid_raw(size, align, goal,
BOOTMEM_ALLOC_ACCESSIBLE, node);
 }
 
@@ -54,9 +54,8 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int 
node)
if (slab_is_available()) {
struct page *page;
 
-   page = alloc_pages_node(node,
-   GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
-   get_order(size));
+   page = alloc_pages_node(node, GFP_KERNEL | __GFP_RETRY_MAYFAIL,
+   get_order(size));
if (page)
return page_address(page);
return NULL;
@@ -183,7 +182,7 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned 
long addr, int node)
 {
pmd_t *pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
pmd_populate_kernel(_mm, pmd, p);
@@ -195,7 +194,7 @@ pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, unsigned 
long addr, int node)
 {
pud_t *pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
pud_populate(_mm, pud, p);
@@ -207,7 +206,7 @@ p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned 
long addr, int node)
 {
p4d_t *p4d = p4d_offset(pgd, addr);
if (p4d_none(*p4d)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
p4d_populate(_mm,

[PATCH v12 02/11] x86/mm: setting fields in deferred pages

2017-10-13 Thread Pavel Tatashin

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled, however, we set fields in
register_page_bootmem_info that are subsequently clobbered right after in
free_all_bootmem:

mem_init() {
register_page_bootmem_info();
free_all_bootmem();
...
}

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

  mem_init
   register_page_bootmem_info
register_page_bootmem_info_node
 get_page_bootmem
  .. setting fields here ..
  such as: page->freelist = (void *)type;

  free_all_bootmem()
   free_low_memory_core_early()
for_each_reserved_mem_region()
 reserve_bootmem_region()
  init_reserved_page() <- Only if this is deferred reserved page
   __init_single_pfn()
__init_single_page()
memset(0) <-- Loose the set fields here

We end-up with issue where, currently we do not observe problem as memory
is explicitly zeroed. But, if flag asserts are changed we can start hitting
issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 
Acked-by: Michal Hocko 
---
 arch/x86/mm/init_64.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5ea1c3c2636e..8822523fdcd7 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1182,12 +1182,18 @@ void __init mem_init(void)
 
/* clear_bss() already clear the empty_zero_page */
 
-   register_page_bootmem_info();
-
/* this will put all memory onto the freelists */
free_all_bootmem();
after_bootmem = 1;
 
+   /*
+* Must be done after boot memory is put on freelist, because here we
+* might set fields in deferred struct pages that have not yet been
+* initialized, and free_all_bootmem() initializes all the reserved
+* deferred pages for us.
+*/
+   register_page_bootmem_info();
+
/* Register memory areas for /proc/kcore */
kclist_add(_vsyscall, (void *)VSYSCALL_ADDR,
 PAGE_SIZE, KCORE_OTHER);
-- 
2.14.2

[PATCH v12 08/11] arm64/kasan: add and use kasan_map_populate()

2017-10-13 Thread Pavel Tatashin

During early boot, kasan uses vmemmap_populate() to establish its shadow
memory. But, that interface is intended for struct pages use.

Because of the current project, vmemmap won't be zeroed during allocation,
but kasan expects that memory to be zeroed. We are adding a new
kasan_map_populate() function to resolve this difference.

Therefore, we must use a new interface to allocate and map kasan shadow
memory, that also zeroes memory for us.

Signed-off-by: Pavel Tatashin 
---
 arch/arm64/mm/kasan_init.c | 72 ++
 1 file changed, 66 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 81f03959a4ab..cb4af2951c90 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -28,6 +28,66 @@
 
 static pgd_t tmp_pg_dir[PTRS_PER_PGD] __initdata __aligned(PGD_SIZE);
 
+/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
+static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
+   int node)
+{
+   unsigned long addr, pfn, next;
+   unsigned long long size;
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+   int ret;
+
+   ret = vmemmap_populate(start, end, node);
+   /*
+* We might have partially populated memory, so check for no entries,
+* and zero only those that actually exist.
+*/
+   for (addr = start; addr < end; addr = next) {
+   pgd = pgd_offset_k(addr);
+   if (pgd_none(*pgd)) {
+   next = pgd_addr_end(addr, end);
+   continue;
+   }
+
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud)) {
+   next = pud_addr_end(addr, end);
+   continue;
+   }
+   if (pud_sect(*pud)) {
+   /* This is PUD size page */
+   next = pud_addr_end(addr, end);
+   size = PUD_SIZE;
+   pfn = pud_pfn(*pud);
+   } else {
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd)) {
+   next = pmd_addr_end(addr, end);
+   continue;
+   }
+   if (pmd_sect(*pmd)) {
+   /* This is PMD size page */
+   next = pmd_addr_end(addr, end);
+   size = PMD_SIZE;
+   pfn = pmd_pfn(*pmd);
+   } else {
+   pte = pte_offset_kernel(pmd, addr);
+   next = addr + PAGE_SIZE;
+   if (pte_none(*pte))
+   continue;
+   /* This is base size page */
+   size = PAGE_SIZE;
+   pfn = pte_pfn(*pte);
+   }
+   }
+   memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
+   }
+   return ret;
+}
+
 /*
  * The p*d_populate functions call virt_to_phys implicitly so they can't be 
used
  * directly on kernel symbols (bm_p*d). All the early functions are called too
@@ -161,11 +221,11 @@ void __init kasan_init(void)
 
clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
 
-   vmemmap_populate(kimg_shadow_start, kimg_shadow_end,
-pfn_to_nid(virt_to_pfn(lm_alias(_text;
+   kasan_map_populate(kimg_shadow_start, kimg_shadow_end,
+  pfn_to_nid(virt_to_pfn(lm_alias(_text;
 
/*
-* vmemmap_populate() has populated the shadow region that covers the
+* kasan_map_populate() has populated the shadow region that covers the
 * kernel image with SWAPPER_BLOCK_SIZE mappings, so we have to round
 * the start and end addresses to SWAPPER_BLOCK_SIZE as well, to prevent
 * kasan_populate_zero_shadow() from replacing the page table entries
@@ -191,9 +251,9 @@ void __init kasan_init(void)
if (start >= end)
break;
 
-   vmemmap_populate((unsigned long)kasan_mem_to_shadow(start),
-   (unsigned long)kasan_mem_to_shadow(end),
-   pfn_to_nid(virt_to_pfn(start)));
+   kasan_map_populate((unsigned long)kasan_mem_to_shadow(start),
+  (unsigned long)kasan_mem_to_shadow(end),
+  pfn_to_nid(virt_to_pfn(start)));
}
 
/*
-- 
2.14.2

[PATCH v12 11/11] arm64: kasan: Avoid using vmemmap_populate to initialise shadow

2017-10-13 Thread Pavel Tatashin

From: Will Deacon 

The kasan shadow is currently mapped using vmemmap_populate since that
provides a semi-convenient way to map pages into swapper. However, since
that no longer zeroes the mapped pages, it is not suitable for kasan,
which requires that the shadow is zeroed in order to avoid false
positives.

This patch removes our reliance on vmemmap_populate and reuses the
existing kasan page table code, which is already required for creating
the early shadow.

Signed-off-by: Will Deacon 
Signed-off-by: Pavel Tatashin 
---
 arch/arm64/Kconfig |   2 +-
 arch/arm64/mm/kasan_init.c | 180 +++--
 2 files changed, 76 insertions(+), 106 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0df64a6a56d4..888580b9036e 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -68,7 +68,7 @@ config ARM64
select HAVE_ARCH_BITREVERSE
select HAVE_ARCH_HUGE_VMAP
select HAVE_ARCH_JUMP_LABEL
-   select HAVE_ARCH_KASAN if SPARSEMEM_VMEMMAP && !(ARM64_16K_PAGES && 
ARM64_VA_BITS_48)
+   select HAVE_ARCH_KASAN if !(ARM64_16K_PAGES && ARM64_VA_BITS_48)
select HAVE_ARCH_KGDB
select HAVE_ARCH_MMAP_RND_BITS
select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index cb4af2951c90..acba49fb5aac 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -11,6 +11,7 @@
  */
 
 #define pr_fmt(fmt) "kasan: " fmt
+#include 
 #include 
 #include 
 #include 
@@ -28,66 +29,6 @@
 
 static pgd_t tmp_pg_dir[PTRS_PER_PGD] __initdata __aligned(PGD_SIZE);
 
-/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
-static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
-   int node)
-{
-   unsigned long addr, pfn, next;
-   unsigned long long size;
-   pgd_t *pgd;
-   pud_t *pud;
-   pmd_t *pmd;
-   pte_t *pte;
-   int ret;
-
-   ret = vmemmap_populate(start, end, node);
-   /*
-* We might have partially populated memory, so check for no entries,
-* and zero only those that actually exist.
-*/
-   for (addr = start; addr < end; addr = next) {
-   pgd = pgd_offset_k(addr);
-   if (pgd_none(*pgd)) {
-   next = pgd_addr_end(addr, end);
-   continue;
-   }
-
-   pud = pud_offset(pgd, addr);
-   if (pud_none(*pud)) {
-   next = pud_addr_end(addr, end);
-   continue;
-   }
-   if (pud_sect(*pud)) {
-   /* This is PUD size page */
-   next = pud_addr_end(addr, end);
-   size = PUD_SIZE;
-   pfn = pud_pfn(*pud);
-   } else {
-   pmd = pmd_offset(pud, addr);
-   if (pmd_none(*pmd)) {
-   next = pmd_addr_end(addr, end);
-   continue;
-   }
-   if (pmd_sect(*pmd)) {
-   /* This is PMD size page */
-   next = pmd_addr_end(addr, end);
-   size = PMD_SIZE;
-   pfn = pmd_pfn(*pmd);
-   } else {
-   pte = pte_offset_kernel(pmd, addr);
-   next = addr + PAGE_SIZE;
-   if (pte_none(*pte))
-   continue;
-   /* This is base size page */
-   size = PAGE_SIZE;
-   pfn = pte_pfn(*pte);
-   }
-   }
-   memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
-   }
-   return ret;
-}
-
 /*
  * The p*d_populate functions call virt_to_phys implicitly so they can't be 
used
  * directly on kernel symbols (bm_p*d). All the early functions are called too
@@ -95,77 +36,117 @@ static int __meminit kasan_map_populate(unsigned long 
start, unsigned long end,
  * with the physical address from __pa_symbol.
  */
 
-static void __init kasan_early_pte_populate(pmd_t *pmd, unsigned long addr,
-   unsigned long end)
+static phys_addr_t __init kasan_alloc_zeroed_page(int node)
 {
-   pte_t *pte;
-   unsigned long next;
+   void *p = memblock_virt_alloc_try_nid(PAGE_SIZE, PAGE_SIZE,
+ __pa(MAX_DMA_ADDRESS),
+ MEMBLOCK_ALLOC_ACCESSIBLE, node);
+   return __pa(p);
+}
 
-   if (pmd_none(*pmd))
-   __pmd_populate(pmd, __pa_symbol(kasan_zero_pte), 
PMD_TYPE_TABLE);

[PATCH v12 04/11] sparc64: simplify vmemmap_populate

2017-10-13 Thread Pavel Tatashin

Remove duplicating code by using common functions
vmemmap_pud_populate and vmemmap_pgd_populate.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 
Acked-by: David S. Miller 
Acked-by: Michal Hocko 
---
 arch/sparc/mm/init_64.c | 23 ++-
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index caed495544e9..6839db3ffe1d 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2652,30 +2652,19 @@ int __meminit vmemmap_populate(unsigned long vstart, 
unsigned long vend,
vstart = vstart & PMD_MASK;
vend = ALIGN(vend, PMD_SIZE);
for (; vstart < vend; vstart += PMD_SIZE) {
-   pgd_t *pgd = pgd_offset_k(vstart);
+   pgd_t *pgd = vmemmap_pgd_populate(vstart, node);
unsigned long pte;
pud_t *pud;
pmd_t *pmd;
 
-   if (pgd_none(*pgd)) {
-   pud_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
+   if (!pgd)
+   return -ENOMEM;
 
-   if (!new)
-   return -ENOMEM;
-   pgd_populate(_mm, pgd, new);
-   }
-
-   pud = pud_offset(pgd, vstart);
-   if (pud_none(*pud)) {
-   pmd_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
-
-   if (!new)
-   return -ENOMEM;
-   pud_populate(_mm, pud, new);
-   }
+   pud = vmemmap_pud_populate(pgd, vstart, node);
+   if (!pud)
+   return -ENOMEM;
 
pmd = pmd_offset(pud, vstart);
-
pte = pmd_val(*pmd);
if (!(pte & _PAGE_VALID)) {
void *block = vmemmap_alloc_block(PMD_SIZE, node);
-- 
2.14.2

[PATCH v12 06/11] mm: zero reserved and unavailable struct pages

2017-10-13 Thread Pavel Tatashin

Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by going
through __init_single_page().

In some cases these struct pages are accessed even if they do not contain
any data. One example is page_to_pfn() might access page->flags if this is
where section information is stored (CONFIG_SPARSEMEM,
SECTION_IN_PAGE_FLAGS).

One example of such memory: trim_low_memory_range() unconditionally
reserves from pfn 0, but e820__memblock_setup() might provide the exiting
memory from pfn 1 (i.e. KVM).

Since, struct pages are zeroed in __init_single_page(), and not during
allocation time, we must zero such struct pages explicitly.

The patch involves adding a new memblock iterator:
for_each_resv_unavail_range(i, p_start, p_end)

Which iterates through reserved && !memory lists, and we zero struct pages
explicitly by calling mm_zero_struct_page().

===

Here is more detailed example of problem that this patch is addressing:

Run tested on qemu with the following arguments:

-enable-kvm -cpu kvm64 -m 512 -smp 2

This patch reports that there are 98 unavailable pages.

They are: pfn 0 and pfns in range [159, 255].

Note, trim_low_memory_range() reserves only pfns in range [0, 15], it does
not reserve [159, 255] ones.

e820__memblock_setup() reports linux that the following physical ranges are
available:
[1 , 158]
[256, 130783]

Notice, that exactly unavailable pfns are missing!

Now, lets check what we have in zone 0: [1, 131039]

pfn 0, is not part of the zone, but pfns [1, 158], are.

However, the bigger problem we have if we do not initialize these struct
pages is with memory hotplug. Because, that path operates at 2M boundaries
(section_nr). And checks if 2M range of pages is hot removable. It starts
with first pfn from zone, rounds it down to 2M boundary (sturct pages are
allocated at 2M boundaries when vmemmap is created), and checks if that
section is hot removable. In this case start with pfn 1 and convert it down
to pfn 0. Later pfn is converted to struct page, and some fields are
checked. Now, if we do not zero struct pages, we get unpredictable results.

In fact when CONFIG_VM_DEBUG is enabled, and we explicitly set all vmemmap
memory to ones, the following panic is observed with kernel test without
this patch applied:

BUG: unable to handle kernel NULL pointer dereference at  (null)
IP: is_pageblock_removable_nolock+0x35/0x90
PGD 0 P4D 0
Oops:  [#1] PREEMPT
...
task: 88001f4e2900 task.stack: c9314000
RIP: 0010:is_pageblock_removable_nolock+0x35/0x90
RSP: 0018:c9317d60 EFLAGS: 00010202
RAX:  RBX: 88001d92b000 RCX: 
RDX:  RSI: 0020 RDI: 88001d92b000
RBP: c9317d80 R08: 10c8 R09: 
R10:  R11:  R12: 88001db2b000
R13: 81af6d00 R14: 88001f7d5000 R15: 82a1b6c0
FS:  7f4eb857f7c0() GS:81c27000() knlGS:0
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 1f4e6000 CR4: 06b0
Call Trace:
 ? is_mem_section_removable+0x5a/0xd0
 show_mem_removable+0x6b/0xa0
 dev_attr_show+0x1b/0x50
 sysfs_kf_seq_show+0xa1/0x100
 kernfs_seq_show+0x22/0x30
 seq_read+0x1ac/0x3a0
 kernfs_fop_read+0x36/0x190
 ? security_file_permission+0x90/0xb0
 __vfs_read+0x16/0x30
 vfs_read+0x81/0x130
 SyS_read+0x44/0xa0
 entry_SYSCALL_64_fastpath+0x1f/0xbd

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 
Acked-by: Michal Hocko 
---
 include/linux/memblock.h | 16 
 include/linux/mm.h   | 15 +++
 mm/page_alloc.c  | 40 
 3 files changed, 71 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index bae11c7e7bf3..ce8bfa5f3e9b 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, 
unsigned long max_pfn);
for_each_mem_range_rev(i, , , \
   nid, flags, p_start, p_end, p_nid)
 
+/**
+ * for_each_resv_unavail_range - iterate through reserved and unavailable 
memory
+ * @i: u64 used as loop variable
+ * @flags: pick from blocks based on memory attributes
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over unavailable but reserved (reserved && !memory) areas of memblock.
+ * Available as soon as memblock is initialized.
+ * Note: because this memory does not belong to any physical node, flags and
+ * nid

[PATCH v12 03/11] sparc64/mm: setting fields in deferred pages

2017-10-13 Thread Pavel Tatashin

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled there is a case where we set some
fields prior to initializing:

mem_init() {
 register_page_bootmem_info();
 free_all_bootmem();
 ...
}

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

mem_init
register_page_bootmem_info
register_page_bootmem_info_node
 get_page_bootmem
  .. setting fields here ..
  such as: page->freelist = (void *)type;

free_all_bootmem()
free_low_memory_core_early()
 for_each_reserved_mem_region()
  reserve_bootmem_region()
   init_reserved_page() <- Only if this is deferred reserved page
__init_single_pfn()
 __init_single_page()
  memset(0) <-- Loose the set fields here

We end-up with similar issue as in the previous patch, where currently we
do not observe problem as memory is zeroed. But, if flag asserts are
changed we can start hitting issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 
Acked-by: David S. Miller 
Acked-by: Michal Hocko 
---
 arch/sparc/mm/init_64.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 6034569e2c0d..caed495544e9 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2548,9 +2548,16 @@ void __init mem_init(void)
 {
high_memory = __va(last_valid_pfn << PAGE_SHIFT);
 
-   register_page_bootmem_info();
free_all_bootmem();
 
+   /*
+* Must be done after boot memory is put on freelist, because here we
+* might set fields in deferred struct pages that have not yet been
+* initialized, and free_all_bootmem() initializes all the reserved
+* deferred pages for us.
+*/
+   register_page_bootmem_info();
+
/*
 * Set up the zero page, mark it reserved, so that page count
 * is not manipulated when freeing the page from user ptes.
-- 
2.14.2

[PATCH v12 01/11] mm: deferred_init_memmap improvements

2017-10-13 Thread Pavel Tatashin

deferred_init_memmap() is called when struct pages are initialized later
in boot by slave CPUs. This patch simplifies and optimizes this function,
and also fixes a couple issues (described below).

The main change is that now we are iterating through free memblock areas
instead of all configured memory. Thus, we do not have to check if the
struct page has already been initialized.

=
In deferred_init_memmap() where all deferred struct pages are initialized
we have a check like this:

if (page->flags) {
VM_BUG_ON(page_zone(page) != zone);
goto free_range;
}

This way we are checking if the current deferred page has already been
initialized. It works, because memory for struct pages has been zeroed, and
the only way flags are not zero if it went through __init_single_page()
before.  But, once we change the current behavior and won't zero the memory
in memblock allocator, we cannot trust anything inside "struct page"es
until they are initialized. This patch fixes this.

The deferred_init_memmap() is re-written to loop through only free memory
ranges provided by memblock.

Note, this first issue is relevant only when the following change is
merged:

=
This patch fixes another existing issue on systems that have holes in
zones i.e CONFIG_HOLES_IN_ZONE is defined.

In for_each_mem_pfn_range() we have code like this:

if (!pfn_valid_within(pfn)
goto free_range;

Note: 'page' is not set to NULL and is not incremented but 'pfn' advances.
Thus means if deferred struct pages are enabled on systems with these kind
of holes, linux would get memory corruptions. I have fixed this issue by
defining a new macro that performs all the necessary operations when we
free the current set of pages.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 
---
 mm/page_alloc.c | 168 
 1 file changed, 85 insertions(+), 83 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 77e4d3c5c57b..cdbd14829fd3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1410,14 +1410,17 @@ void clear_zone_contiguous(struct zone *zone)
 }
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-static void __init deferred_free_range(struct page *page,
-   unsigned long pfn, int nr_pages)
+static void __init deferred_free_range(unsigned long pfn,
+  unsigned long nr_pages)
 {
-   int i;
+   struct page *page;
+   unsigned long i;
 
-   if (!page)
+   if (!nr_pages)
return;
 
+   page = pfn_to_page(pfn);
+
/* Free a large naturally-aligned chunk if possible */
if (nr_pages == pageblock_nr_pages &&
(pfn & (pageblock_nr_pages - 1)) == 0) {
@@ -1443,19 +1446,89 @@ static inline void __init 
pgdat_init_report_one_done(void)
complete(_init_all_done_comp);
 }
 
+/*
+ * Helper for deferred_init_range, free the given range, reset the counters, 
and
+ * return number of pages freed.
+ */
+static inline unsigned long __def_free(unsigned long *nr_free,
+  unsigned long *free_base_pfn,
+  struct page **page)
+{
+   unsigned long nr = *nr_free;
+
+   deferred_free_range(*free_base_pfn, nr);
+   *free_base_pfn = 0;
+   *nr_free = 0;
+   *page = NULL;
+
+   return nr;
+}
+
+static unsigned long deferred_init_range(int nid, int zid, unsigned long pfn,
+unsigned long end_pfn)
+{
+   struct mminit_pfnnid_cache nid_init_state = { };
+   unsigned long nr_pgmask = pageblock_nr_pages - 1;
+   unsigned long free_base_pfn = 0;
+   unsigned long nr_pages = 0;
+   unsigned long nr_free = 0;
+   struct page *page = NULL;
+
+   for (; pfn < end_pfn; pfn++) {
+   /*
+* First we check if pfn is valid on architectures where it is
+* possible to have holes within pageblock_nr_pages. On systems
+* where it is not possible, this function is optimized out.
+*
+* Then, we check if a current large page is valid by only
+* checking the validity of the head pfn.
+*
+* meminit_pfn_in_nid is checked on systems where pfns can
+* interleave within a node: a pfn is between start and end
+* of a node, but does not belong to this memory node.
+*
+* Finally, we minimize pfn page lookups and scheduler checks by
+* performing it only once every pageblock_nr_pages.
+*/
+   if (!pfn_valid_within(pfn)) {
+   nr_pages += __def_free(_free, _base_pfn, );
+   } else if

[PATCH v12 07/11] x86/kasan: add and use kasan_map_populate()

2017-10-13 Thread Pavel Tatashin

During early boot, kasan uses vmemmap_populate() to establish its shadow
memory. But, that interface is intended for struct pages use.

Because of the current project, vmemmap won't be zeroed during allocation,
but kasan expects that memory to be zeroed. We are adding a new
kasan_map_populate() function to resolve this difference.

Therefore, we must use a new interface to allocate and map kasan shadow
memory, that also zeroes memory for us.

Signed-off-by: Pavel Tatashin 
---
 arch/x86/mm/kasan_init_64.c | 75 ++---
 1 file changed, 71 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index bc84b73684b7..9778fec8a5dc 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -15,6 +15,73 @@
 
 extern struct range pfn_mapped[E820_MAX_ENTRIES];
 
+/* Creates mappings for kasan during early boot. The mapped memory is zeroed */
+static int __meminit kasan_map_populate(unsigned long start, unsigned long end,
+   int node)
+{
+   unsigned long addr, pfn, next;
+   unsigned long long size;
+   pgd_t *pgd;
+   p4d_t *p4d;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+   int ret;
+
+   ret = vmemmap_populate(start, end, node);
+   /*
+* We might have partially populated memory, so check for no entries,
+* and zero only those that actually exist.
+*/
+   for (addr = start; addr < end; addr = next) {
+   pgd = pgd_offset_k(addr);
+   if (pgd_none(*pgd)) {
+   next = pgd_addr_end(addr, end);
+   continue;
+   }
+
+   p4d = p4d_offset(pgd, addr);
+   if (p4d_none(*p4d)) {
+   next = p4d_addr_end(addr, end);
+   continue;
+   }
+
+   pud = pud_offset(p4d, addr);
+   if (pud_none(*pud)) {
+   next = pud_addr_end(addr, end);
+   continue;
+   }
+   if (pud_large(*pud)) {
+   /* This is PUD size page */
+   next = pud_addr_end(addr, end);
+   size = PUD_SIZE;
+   pfn = pud_pfn(*pud);
+   } else {
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd)) {
+   next = pmd_addr_end(addr, end);
+   continue;
+   }
+   if (pmd_large(*pmd)) {
+   /* This is PMD size page */
+   next = pmd_addr_end(addr, end);
+   size = PMD_SIZE;
+   pfn = pmd_pfn(*pmd);
+   } else {
+   pte = pte_offset_kernel(pmd, addr);
+   next = addr + PAGE_SIZE;
+   if (pte_none(*pte))
+   continue;
+   /* This is base size page */
+   size = PAGE_SIZE;
+   pfn = pte_pfn(*pte);
+   }
+   }
+   memset(phys_to_virt(PFN_PHYS(pfn)), 0, size);
+   }
+   return ret;
+}
+
 static int __init map_range(struct range *range)
 {
unsigned long start;
@@ -23,7 +90,7 @@ static int __init map_range(struct range *range)
start = (unsigned long)kasan_mem_to_shadow(pfn_to_kaddr(range->start));
end = (unsigned long)kasan_mem_to_shadow(pfn_to_kaddr(range->end));
 
-   return vmemmap_populate(start, end, NUMA_NO_NODE);
+   return kasan_map_populate(start, end, NUMA_NO_NODE);
 }
 
 static void __init clear_pgds(unsigned long start,
@@ -136,9 +203,9 @@ void __init kasan_init(void)
kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
kasan_mem_to_shadow((void *)__START_KERNEL_map));
 
-   vmemmap_populate((unsigned long)kasan_mem_to_shadow(_stext),
-   (unsigned long)kasan_mem_to_shadow(_end),
-   NUMA_NO_NODE);
+   kasan_map_populate((unsigned long)kasan_mem_to_shadow(_stext),
+  (unsigned long)kasan_mem_to_shadow(_end),
+  NUMA_NO_NODE);
 
kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
(void *)KASAN_SHADOW_END);
-- 
2.14.2

[PATCH v1] powerpc/pci: convert to use for_each_pci_bridge() helper

2017-10-13 Thread Andy Shevchenko

...which makes code slightly cleaner.

Requires: d43f59ce6c50 ("PCI: Add for_each_pci_bridge() helper")
Signed-off-by: Andy Shevchenko 
---
 arch/powerpc/kernel/pci-hotplug.c | 7 ++-
 arch/powerpc/kernel/pci_of_scan.c | 7 ++-
 2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/pci-hotplug.c 
b/arch/powerpc/kernel/pci-hotplug.c
index 2d71269e7dc1..741f47295188 100644
--- a/arch/powerpc/kernel/pci-hotplug.c
+++ b/arch/powerpc/kernel/pci-hotplug.c
@@ -134,11 +134,8 @@ void pci_hp_add_devices(struct pci_bus *bus)
pcibios_setup_bus_devices(bus);
max = bus->busn_res.start;
for (pass = 0; pass < 2; pass++) {
-   list_for_each_entry(dev, >devices, bus_list) {
-   if (pci_is_bridge(dev))
-   max = pci_scan_bridge(bus, dev,
- max, pass);
-   }
+   for_each_pci_bridge(dev, bus)
+   max = pci_scan_bridge(bus, dev, max, pass);
}
}
pcibios_finish_adding_to_bus(bus);
diff --git a/arch/powerpc/kernel/pci_of_scan.c 
b/arch/powerpc/kernel/pci_of_scan.c
index 0d790f8432d2..8bdaa2a6fa62 100644
--- a/arch/powerpc/kernel/pci_of_scan.c
+++ b/arch/powerpc/kernel/pci_of_scan.c
@@ -369,11 +369,8 @@ static void __of_scan_bus(struct device_node *node, struct 
pci_bus *bus,
pcibios_setup_bus_devices(bus);
 
/* Now scan child busses */
-   list_for_each_entry(dev, >devices, bus_list) {
-   if (pci_is_bridge(dev)) {
-   of_scan_pci_bridge(dev);
-   }
-   }
+   for_each_pci_bridge(dev, bus)
+   of_scan_pci_bridge(dev);
 }
 
 /**
-- 
2.14.2

Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()

2017-10-13 Thread Will Deacon

On Fri, Oct 13, 2017 at 12:00:27PM -0400, Pavel Tatashin wrote:
> BTW, don't we need the same aligments inside for_each_memblock() loop?

Hmm, yes actually, given that we shift them right for the shadow address.

> How about change kasan_map_populate() to accept regular VA start, end
> address, and convert them internally after aligning to PAGE_SIZE?

That's what my original patch did, but it doesn't help on its own because
kasan_populate_zero_shadow would need the same change.

Will

Re: [PATCH] KVM: PPC: fix oops when checking KVM_CAP_PPC_HTM

2017-10-13 Thread Paolo Bonzini

On 13/10/2017 01:16, Greg Kurz wrote:
> Ping ?

When is Paul back from vacation? :)

Paolo

> On Thu, 14 Sep 2017 23:56:25 +0200
> Greg Kurz  wrote:
> 
>> The following program causes a kernel oops:
>>
>> #include 
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> main()
>> {
>> int fd = open("/dev/kvm", O_RDWR);
>> ioctl(fd, KVM_CHECK_EXTENSION, KVM_CAP_PPC_HTM);
>> }
>>
>> This happens because when using the global KVM fd with
>> KVM_CHECK_EXTENSION, kvm_vm_ioctl_check_extension() gets
>> called with a NULL kvm argument, which gets dereferenced
>> in is_kvmppc_hv_enabled(). Spotted while reading the code.
>>
>> Let's use the hv_enabled fallback variable, like everywhere
>> else in this function.
>>
>> Fixes: 23528bb21ee2 ("KVM: PPC: Introduce KVM_CAP_PPC_HTM")
>> Cc: sta...@vger.kernel.org # v4.7+
>> Signed-off-by: Greg Kurz 
>> ---
>>  arch/powerpc/kvm/powerpc.c |3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index 3480faaf1ef8..ee279c7f4802 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -644,8 +644,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
>> ext)
>>  break;
>>  #endif
>>  case KVM_CAP_PPC_HTM:
>> -r = cpu_has_feature(CPU_FTR_TM_COMP) &&
>> -is_kvmppc_hv_enabled(kvm);
>> +r = cpu_has_feature(CPU_FTR_TM_COMP) && hv_enabled;
>>  break;
>>  default:
>>  r = 0;
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

[PATCH tip/sched/membarrier 5/5] Fix: membarrier: Handle CLONE_VM + !CLONE_THREAD correctly on powerpc

2017-10-13 Thread Paul E. McKenney

From: Mathieu Desnoyers 

Threads targeting the same VM but which belong to different thread
groups is a tricky case. It has a few consequences:

It turns out that we cannot rely on get_nr_threads(p) to count the
number of threads using a VM. We can use
(atomic_read(>mm_users) == 1 && get_nr_threads(p) == 1)
instead to skip the synchronize_sched() for cases where the VM only has
a single user, and that user only has a single thread.

It also turns out that we cannot use for_each_thread() to set
thread flags in all threads using a VM, as it only iterates on the
thread group.

Therefore, test the membarrier state variable directly rather than
relying on thread flags. This means
membarrier_register_private_expedited() needs to set the
MEMBARRIER_STATE_SWITCH_MM flag, issue synchronize_sched(), and only
then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows private
expedited membarrier commands to succeed. membarrier_arch_switch_mm()
now tests for the MEMBARRIER_STATE_SWITCH_MM flag.

Changes since v1:
- Remove membarrier thread flag on powerpc (now unused).

Reported-by: Peter Zijlstra 
Signed-off-by: Mathieu Desnoyers 
CC: Paul E. McKenney 
CC: Boqun Feng 
CC: Andrew Hunter 
CC: Maged Michael 
CC: gro...@google.com
CC: Avi Kivity 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Dave Watson 
CC: Alan Stern 
CC: Will Deacon 
CC: Andy Lutomirski 
CC: Ingo Molnar 
CC: Alexander Viro 
CC: Nicholas Piggin 
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-a...@vger.kernel.org
Signed-off-by: Paul E. McKenney 
---
 arch/powerpc/include/asm/membarrier.h  | 21 ++---
 arch/powerpc/include/asm/thread_info.h |  3 ---
 arch/powerpc/kernel/membarrier.c   | 17 -
 include/linux/mm_types.h   |  2 +-
 include/linux/sched/mm.h   | 28 ++--
 kernel/fork.c  |  2 --
 kernel/sched/membarrier.c  | 16 +---
 7 files changed, 26 insertions(+), 63 deletions(-)

diff --git a/arch/powerpc/include/asm/membarrier.h 
b/arch/powerpc/include/asm/membarrier.h
index 61152a7a3cf9..0951646253d9 100644
--- a/arch/powerpc/include/asm/membarrier.h
+++ b/arch/powerpc/include/asm/membarrier.h
@@ -11,8 +11,8 @@ static inline void membarrier_arch_switch_mm(struct mm_struct 
*prev,
 * when switching from userspace to kernel is not needed after
 * store to rq->curr.
 */
-   if (likely(!test_ti_thread_flag(task_thread_info(tsk),
-   TIF_MEMBARRIER_PRIVATE_EXPEDITED) || !prev))
+   if (likely(!(atomic_read(>membarrier_state)
+   & MEMBARRIER_STATE_SWITCH_MM) || !prev))
return;
 
/*
@@ -21,23 +21,6 @@ static inline void membarrier_arch_switch_mm(struct 
mm_struct *prev,
 */
smp_mb();
 }
-static inline void membarrier_arch_fork(struct task_struct *t,
-   unsigned long clone_flags)
-{
-   /*
-* Coherence of TIF_MEMBARRIER_PRIVATE_EXPEDITED against thread
-* fork is protected by siglock. membarrier_arch_fork is called
-* with siglock held.
-*/
-   if (test_thread_flag(TIF_MEMBARRIER_PRIVATE_EXPEDITED))
-   set_ti_thread_flag(task_thread_info(t),
-   TIF_MEMBARRIER_PRIVATE_EXPEDITED);
-}
-static inline void membarrier_arch_execve(struct task_struct *t)
-{
-   clear_ti_thread_flag(task_thread_info(t),
-   TIF_MEMBARRIER_PRIVATE_EXPEDITED);
-}
 void membarrier_arch_register_private_expedited(struct task_struct *t);
 
 #endif /* _ASM_POWERPC_MEMBARRIER_H */
diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index 2a208487724b..a941cc6fc3e9 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -100,7 +100,6 @@ static inline struct thread_info *current_thread_info(void)
 #if defined(CONFIG_PPC64)
 #define TIF_ELF2ABI18  /* function descriptors must die! */
 #endif
-#define TIF_MEMBARRIER_PRIVATE_EXPEDITED   19  /* membarrier */
 
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE (1<

[PATCH tip/sched/membarrier 1/5] membarrier: Provide register expedited private command

2017-10-13 Thread Paul E. McKenney

From: Mathieu Desnoyers 

Provide a new command allowing processes to register their intent to use
the private expedited command.

This allows PowerPC to skip the full memory barrier in switch_mm(), and
only issue the barrier when scheduling into a task belonging to a
process that has registered to use expedited private.

Processes are now required to register before using
MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.

Changes since v1:
- Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in
  powerpc membarrier_arch_sched_in(), given that we want to specifically
  check the next thread state.
- Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig.
- Use task_thread_info() to pass thread_info from task to
  *_ti_thread_flag().

Changes since v2:
- Move membarrier_arch_sched_in() call to finish_task_switch().
- Check for NULL t->mm in membarrier_arch_fork().
- Use membarrier_sched_in() in generic code, which invokes the
  arch-specific membarrier_arch_sched_in(). This fixes allnoconfig
  build on PowerPC.
- Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing
  allnoconfig build on PowerPC.
- Build and runtime tested on PowerPC.

Changes since v3:
- Simply rely on copy_mm() to copy the membarrier_private_expedited mm
  field on fork.
- powerpc: test thread flag instead of reading
  membarrier_private_expedited in membarrier_arch_fork().
- powerpc: skip memory barrier in membarrier_arch_sched_in() if coming
  from kernel thread, since mmdrop() implies a full barrier.
- Set membarrier_private_expedited to 1 only after arch registration
  code, thus eliminating a race where concurrent commands could succeed
  when they should fail if issued concurrently with process
  registration.
- Use READ_ONCE() for membarrier_private_expedited field access in
  membarrier_private_expedited. Matches WRITE_ONCE() performed in
  process registration.

Changes since v4:
- Move powerpc hook from sched_in() to switch_mm(), based on feedback
  from Nicholas Piggin.

Signed-off-by: Mathieu Desnoyers 
CC: Peter Zijlstra 
CC: Paul E. McKenney 
CC: Boqun Feng 
CC: Andrew Hunter 
CC: Maged Michael 
CC: gro...@google.com
CC: Avi Kivity 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Dave Watson 
CC: Alan Stern 
CC: Will Deacon 
CC: Andy Lutomirski 
CC: Ingo Molnar 
CC: Alexander Viro 
CC: Nicholas Piggin 
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-a...@vger.kernel.org
Signed-off-by: Paul E. McKenney 
---
 MAINTAINERS|  2 ++
 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/include/asm/membarrier.h  | 43 +
 arch/powerpc/include/asm/thread_info.h |  3 ++
 arch/powerpc/kernel/Makefile   |  2 ++
 arch/powerpc/kernel/membarrier.c   | 45 ++
 arch/powerpc/mm/mmu_context.c  |  7 +
 fs/exec.c  |  1 +
 include/linux/mm_types.h   |  3 ++
 include/linux/sched/mm.h   | 50 ++
 include/uapi/linux/membarrier.h| 23 +++-
 init/Kconfig   |  3 ++
 kernel/fork.c  |  2 ++
 kernel/sched/core.c| 10 ---
 kernel/sched/membarrier.c  | 25 ++---
 15 files changed, 199 insertions(+), 21 deletions(-)
 create mode 100644 arch/powerpc/include/asm/membarrier.h
 create mode 100644 arch/powerpc/kernel/membarrier.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 2d3d750b19c0..f0bc68b2d221 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8829,6 +8829,8 @@ L:linux-ker...@vger.kernel.org
 S: Supported
 F: kernel/sched/membarrier.c
 F: include/uapi/linux/membarrier.h
+F: arch/powerpc/kernel/membarrier.c
+F: arch/powerpc/include/asm/membarrier.h
 
 MEMORY MANAGEMENT
 L: linux...@kvack.org
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 809c468edab1..6f44c5f74f71 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -138,6 +138,7 @@ config PPC
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
+   select ARCH_HAS_MEMBARRIER_HOOKS
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
diff --git a/arch/powerpc/include/asm/membarrier.h 
b/arch/powerpc/include/asm/membarrier.h
new file mode 100644
index

[PATCH tip/sched/membarrier 4/5] membarrier: Remove unused code for architectures without membarrier hooks

2017-10-13 Thread Paul E. McKenney

From: Mathieu Desnoyers 

Architectures without membarrier hooks don't need to emit the
empty membarrier_arch_switch_mm() static inline when
CONFIG_MEMBARRIER=y.

Adapt the CONFIG_MEMBARRIER=n counterpart to only emit the empty
membarrier_arch_switch_mm() for architectures with membarrier hooks.

Reported-by: Nicholas Piggin 
Signed-off-by: Mathieu Desnoyers 
CC: Peter Zijlstra 
CC: Paul E. McKenney 
CC: Boqun Feng 
CC: Andrew Hunter 
CC: Maged Michael 
CC: gro...@google.com
CC: Avi Kivity 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Dave Watson 
CC: Alan Stern 
CC: Will Deacon 
CC: Andy Lutomirski 
CC: Ingo Molnar 
CC: Alexander Viro 
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-a...@vger.kernel.org
Signed-off-by: Paul E. McKenney 
---
 include/linux/sched/mm.h | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index e4955d293687..40379edac388 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -221,10 +221,6 @@ static inline void memalloc_noreclaim_restore(unsigned int 
flags)
 #ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS
 #include 
 #else
-static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
-   struct mm_struct *next, struct task_struct *tsk)
-{
-}
 static inline void membarrier_arch_fork(struct task_struct *t,
unsigned long clone_flags)
 {
@@ -253,10 +249,12 @@ static inline void membarrier_execve(struct task_struct 
*t)
membarrier_arch_execve(t);
 }
 #else
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS
 static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
struct mm_struct *next, struct task_struct *tsk)
 {
 }
+#endif
 static inline void membarrier_fork(struct task_struct *t,
unsigned long clone_flags)
 {
-- 
2.5.2

Re: [PATCH] powerpc/eeh: make eeh_ops structures _ro_after_init

2017-10-13 Thread Bhumika Goyal

On Fri, Oct 13, 2017 at 6:08 PM, Julia Lawall  wrote:
>
>
> On Fri, 13 Oct 2017, Bhumika Goyal wrote:
>
>> These structures are passed to the eeh_ops_register function during the
>> initialization phase. There they get stored in a structure variable
>> which only makes function calls through function pointers. There is no
>> other usage of these eeh_ops structures and their fields are never
>> modified after init phase. So, make them __ro_after_init.
>
> I think they could be const.
>

Yes. I will send a patch for const.

> julia
>
>> Signed-off-by: Bhumika Goyal 
>> ---
>>  arch/powerpc/platforms/powernv/eeh-powernv.c | 2 +-
>>  arch/powerpc/platforms/pseries/eeh_pseries.c | 2 +-
>>  2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
>> b/arch/powerpc/platforms/powernv/eeh-powernv.c
>> index 4650fb2..d2a53df 100644
>> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
>> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
>> @@ -1731,7 +1731,7 @@ static int pnv_eeh_restore_config(struct pci_dn *pdn)
>>   return 0;
>>  }
>>
>> -static struct eeh_ops pnv_eeh_ops = {
>> +static struct eeh_ops pnv_eeh_ops __ro_after_init = {
>>   .name   = "powernv",
>>   .init   = pnv_eeh_init,
>>   .probe  = pnv_eeh_probe,
>> diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c 
>> b/arch/powerpc/platforms/pseries/eeh_pseries.c
>> index 6b812ad..6fedfc9 100644
>> --- a/arch/powerpc/platforms/pseries/eeh_pseries.c
>> +++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
>> @@ -684,7 +684,7 @@ static int pseries_eeh_write_config(struct pci_dn *pdn, 
>> int where, int size, u32
>>   return rtas_write_config(pdn, where, size, val);
>>  }
>>
>> -static struct eeh_ops pseries_eeh_ops = {
>> +static struct eeh_ops pseries_eeh_ops __ro_after_init = {
>>   .name   = "pseries",
>>   .init   = pseries_eeh_init,
>>   .probe  = pseries_eeh_probe,
>> --
>> 1.9.1
>>
>>

Re: [PATCH] powerpc/eeh: make eeh_ops structures _ro_after_init

2017-10-13 Thread Julia Lawall



On Fri, 13 Oct 2017, Bhumika Goyal wrote:

> These structures are passed to the eeh_ops_register function during the
> initialization phase. There they get stored in a structure variable
> which only makes function calls through function pointers. There is no
> other usage of these eeh_ops structures and their fields are never
> modified after init phase. So, make them __ro_after_init.

I think they could be const.

julia

> Signed-off-by: Bhumika Goyal 
> ---
>  arch/powerpc/platforms/powernv/eeh-powernv.c | 2 +-
>  arch/powerpc/platforms/pseries/eeh_pseries.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
> b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index 4650fb2..d2a53df 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -1731,7 +1731,7 @@ static int pnv_eeh_restore_config(struct pci_dn *pdn)
>   return 0;
>  }
>
> -static struct eeh_ops pnv_eeh_ops = {
> +static struct eeh_ops pnv_eeh_ops __ro_after_init = {
>   .name   = "powernv",
>   .init   = pnv_eeh_init,
>   .probe  = pnv_eeh_probe,
> diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c 
> b/arch/powerpc/platforms/pseries/eeh_pseries.c
> index 6b812ad..6fedfc9 100644
> --- a/arch/powerpc/platforms/pseries/eeh_pseries.c
> +++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
> @@ -684,7 +684,7 @@ static int pseries_eeh_write_config(struct pci_dn *pdn, 
> int where, int size, u32
>   return rtas_write_config(pdn, where, size, val);
>  }
>
> -static struct eeh_ops pseries_eeh_ops = {
> +static struct eeh_ops pseries_eeh_ops __ro_after_init = {
>   .name   = "pseries",
>   .init   = pseries_eeh_init,
>   .probe  = pseries_eeh_probe,
> --
> 1.9.1
>
>

Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()

2017-10-13 Thread Pavel Tatashin

BTW, don't we need the same aligments inside for_each_memblock() loop?

How about change kasan_map_populate() to accept regular VA start, end
address, and convert them internally after aligning to PAGE_SIZE?

Thank you,
Pavel


On Fri, Oct 13, 2017 at 11:54 AM, Pavel Tatashin
 wrote:
>> Thanks for sharing the .config and tree. It looks like the problem is that
>> kimg_shadow_start and kimg_shadow_end are not page-aligned. Whilst I fix
>> them up in kasan_map_populate, they remain unaligned when passed to
>> kasan_populate_zero_shadow, which confuses the loop termination conditions
>> in e.g. zero_pte_populate and the shadow isn't configured properly.
>
> This makes sense. Thank you. I will insert these changes into your
> patch, and send out a new series soon after sanity checking it.
>
> Pavel

Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()

2017-10-13 Thread Pavel Tatashin

> Thanks for sharing the .config and tree. It looks like the problem is that
> kimg_shadow_start and kimg_shadow_end are not page-aligned. Whilst I fix
> them up in kasan_map_populate, they remain unaligned when passed to
> kasan_populate_zero_shadow, which confuses the loop termination conditions
> in e.g. zero_pte_populate and the shadow isn't configured properly.

This makes sense. Thank you. I will insert these changes into your
patch, and send out a new series soon after sanity checking it.

Pavel

Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()

2017-10-13 Thread Will Deacon

Hi Pavel,

On Fri, Oct 13, 2017 at 11:09:41AM -0400, Pavel Tatashin wrote:
> > It shouldn't be difficult to use section mappings with my patch, I just
> > don't really see the need to try to optimise TLB pressure when you're
> > running with KASAN enabled which already has something like a 3x slowdown
> > afaik. If it ends up being a big deal, we can always do that later, but
> > my main aim here is to divorce kasan from vmemmap because they should be
> > completely unrelated.
> 
> Yes, I understand that kasan makes system slow, but my point is why
> make it even slower? However, I am OK adding your patch to the series,
> BTW, symmetric changes will be needed for x86 as well sometime later.
> 
> >
> > This certainly doesn't sound right; mapping the shadow with pages shouldn't
> > lead to problems. I also can't seem to reproduce this myself -- could you
> > share your full .config and a pointer to the git tree that you're using,
> > please?
> 
> Config is attached. I am using my patch series + your patch + today's
> clone from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Great, I hit the same problem with your .config. It might actually be
CONFIG_DEBUG_MEMORY_INIT which does it.

> Also, in a separate e-mail i sent out the qemu arguments.
> 
> >
> >> I feel, this patch requires more work, and I am troubled with using
> >> base pages instead of large pages.
> >
> > I'm happy to try fixing this, because I think splitting up kasan and vmemmap
> > is the right thing to do here.
> 
> Thank you very much.

Thanks for sharing the .config and tree. It looks like the problem is that
kimg_shadow_start and kimg_shadow_end are not page-aligned. Whilst I fix
them up in kasan_map_populate, they remain unaligned when passed to
kasan_populate_zero_shadow, which confuses the loop termination conditions
in e.g. zero_pte_populate and the shadow isn't configured properly.

Fixup diff below; please merge in with my original patch.

Will

--->8

diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index b922826d9908..207b1acb823a 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -146,7 +146,7 @@ asmlinkage void __init kasan_early_init(void)
 static void __init kasan_map_populate(unsigned long start, unsigned long end,
  int node)
 {
-   kasan_pgd_populate(start & PAGE_MASK, PAGE_ALIGN(end), node, false);
+   kasan_pgd_populate(start, end, node, false);
 }
 
 /*
@@ -183,8 +183,8 @@ void __init kasan_init(void)
struct memblock_region *reg;
int i;
 
-   kimg_shadow_start = (u64)kasan_mem_to_shadow(_text);
-   kimg_shadow_end = (u64)kasan_mem_to_shadow(_end);
+   kimg_shadow_start = (u64)kasan_mem_to_shadow(_text) & PAGE_MASK;
+   kimg_shadow_end = PAGE_ALIGN((u64)kasan_mem_to_shadow(_end));
 
mod_shadow_start = (u64)kasan_mem_to_shadow((void *)MODULES_VADDR);
mod_shadow_end = (u64)kasan_mem_to_shadow((void *)MODULES_END);

Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()

2017-10-13 Thread Pavel Tatashin

Here is simplified qemu command:

qemu-system-aarch64 \
  -display none \
  -kernel ./arch/arm64/boot/Image  \
  -M virt -cpu cortex-a57 -s -S

In a separate terminal start arm64 cross debugger:

$ aarch64-unknown-linux-gnu-gdb ./vmlinux
...
Reading symbols from ./vmlinux...done.
(gdb) target remote :1234
Remote debugging using :1234
0x4000 in ?? ()
(gdb) c
Continuing.
^C
(gdb) lx-dmesg
[0.00] Booting Linux on physical CPU 0x0
[0.00] Linux version 4.14.0-rc4_pt_study-00136-gbed2c89768ba
(soleen@xakep) (gcc version 7.1.0 (crosstool-NG
crosstool-ng-1.23.0-90-g81327dd9)) #1 SMP PREEMPT Fri Oct 13 11:24:46
EDT 2017
... until the panic message is printed ...

Thank you,
Pavel

Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()

2017-10-13 Thread Pavel Tatashin

> It shouldn't be difficult to use section mappings with my patch, I just
> don't really see the need to try to optimise TLB pressure when you're
> running with KASAN enabled which already has something like a 3x slowdown
> afaik. If it ends up being a big deal, we can always do that later, but
> my main aim here is to divorce kasan from vmemmap because they should be
> completely unrelated.

Yes, I understand that kasan makes system slow, but my point is why
make it even slower? However, I am OK adding your patch to the series,
BTW, symmetric changes will be needed for x86 as well sometime later.

>
> This certainly doesn't sound right; mapping the shadow with pages shouldn't
> lead to problems. I also can't seem to reproduce this myself -- could you
> share your full .config and a pointer to the git tree that you're using,
> please?

Config is attached. I am using my patch series + your patch + today's
clone from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Also, in a separate e-mail i sent out the qemu arguments.

>
>> I feel, this patch requires more work, and I am troubled with using
>> base pages instead of large pages.
>
> I'm happy to try fixing this, because I think splitting up kasan and vmemmap
> is the right thing to do here.

Thank you very much.

Pavel


config.gz
Description: GNU Zip compressed data

Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()

2017-10-13 Thread Pavel Tatashin

> Do you know what your physical memory layout looks like?

[0.00] Memory: 34960K/131072K available (16316K kernel code,
6716K rwdata, 7996K rodata, 1472K init, 8837K bss, 79728K reserved,
16384K cma-reserved)
[0.00] Virtual kernel memory layout:
[0.00] kasan   : 0x - 0x2000
( 32768 GB)
[0.00] modules : 0x2000 - 0x2800
(   128 MB)
[0.00] vmalloc : 0x2800 - 0x7dffbfff
( 96254 GB)
[0.00]   .text : 0x2808 - 0x2907
( 16320 KB)
[0.00] .rodata : 0x2907 - 0x2985
(  8064 KB)
[0.00]   .init : 0x2985 - 0x299c
(  1472 KB)
[0.00]   .data : 0x299c - 0x2a04f200
(  6717 KB)
[0.00].bss : 0x2a04f200 - 0x2a8f09e0
(  8838 KB)
[0.00] fixed   : 0x7dfffe7fd000 - 0x7dfffec0
(  4108 KB)
[0.00] PCI I/O : 0x7dfffee0 - 0x7de0
(16 MB)
[0.00] vmemmap : 0x7e00 - 0x8000
(  2048 GB maximum)
[0.00]   0x7e00 - 0x7e20
( 2 MB actual)
[0.00] memory  : 0x8000 - 0x8800
(   128 MB)

>
> Knowing that would tell us where shadow memory *should* be.
>
> Can you share the command line you're using the launch the VM?
>

virtme-run --kdir . --arch aarch64 --qemu-opts -s -S

and get messages from connected gdb session via lx-dmesg command.

The actual qemu arguments are these:

qemu-system-aarch64 -fsdev
local,id=virtfs1,path=/,security_model=none,readonly -device
virtio-9p-device,fsdev=virtfs1,mount_tag=/dev/root -fsdev
local,id=virtfs5,path=/usr/share/virtme-guest-0,security_model=none,readonly
-device virtio-9p-device,fsdev=virtfs5,mount_tag=virtme.guesttools -M
virt -cpu cortex-a57 -parallel none -net none -echr 1 -serial none
-chardev stdio,id=console,signal=off,mux=on -serial chardev:console
-mon chardev=console -vga none -display none -kernel
./arch/arm64/boot/Image -append 'earlyprintk=serial,ttyAMA0,115200
console=ttyAMA0 psmouse.proto=exps "virtme_stty_con=rows 57 cols 105
iutf8" TERM=screen-256color-bce rootfstype=9p
rootflags=version=9p2000.L,trans=virtio,access=any raid=noautodetect
ro init=/bin/sh -- -c "mount -t tmpfs run /run;mkdir -p
/run/virtme/guesttools;/bin/mount -n -t 9p -o
ro,version=9p2000.L,trans=virtio,access=any virtme.guesttools
/run/virtme/guesttools;exec /run/virtme/guesttools/virtme-init"' -s -S

Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()

2017-10-13 Thread Mark Rutland

Hi,

On Fri, Oct 13, 2017 at 03:43:19PM +0100, Will Deacon wrote:
> On Fri, Oct 13, 2017 at 10:10:09AM -0400, Pavel Tatashin wrote:
> > I am getting the following panic during boot:
> > 
> > [0.012637] pid_max: default: 32768 minimum: 301
> > [0.016037] Security Framework initialized
> > [0.018389] Dentry cache hash table entries: 16384 (order: 5, 131072 
> > bytes)
> > [0.019559] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
> > [0.020409] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
> > [0.020721] Mountpoint-cache hash table entries: 512 (order: 0, 4096 
> > bytes)
> > [0.055337] Unable to handle kernel paging request at virtual
> > address 0400010065af
> > [0.055422] Mem abort info:
> > [0.055518]   Exception class = DABT (current EL), IL = 32 bits
> > [0.055579]   SET = 0, FnV = 0
> > [0.055640]   EA = 0, S1PTW = 0
> > [0.055699] Data abort info:
> > [0.055762]   ISV = 0, ISS = 0x0007
> > [0.055822]   CM = 0, WnR = 0
> > [0.055966] swapper pgtable: 4k pages, 48-bit VAs, pgd = 2a8f4000
> > [0.056047] [0400010065af] *pgd=46fe7003,
> > *pud=46fe6003, *pmd=46fe5003, *pte=
> > [0.056436] Internal error: Oops: 9607 [#1] PREEMPT SMP
> > [0.056701] Modules linked in:
> > [0.056939] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> > 4.14.0-rc4_pt_memset12-00096-gfca5985f860e-dirty #16
> > [0.057001] Hardware name: linux,dummy-virt (DT)
> > [0.057084] task: 299d9000 task.stack: 299c
> > [0.057275] PC is at __asan_load8+0x34/0xb0
> > [0.057375] LR is at __d_rehash+0xf0/0x240

Do you know what your physical memory layout looks like? 

Knowing that would tell us where shadow memory *should* be.

Can you share the command line you're using the launch the VM?

Thanks,
Mark.

Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()

2017-10-13 Thread Will Deacon

Hi Pavel,

On Fri, Oct 13, 2017 at 10:10:09AM -0400, Pavel Tatashin wrote:
> I have a couple concerns about your patch:
> 
> One of the reasons (and actually, the main reason) why I preferred to
> keep vmemmap_populate() instead of implementing kasan's own variant,
> which btw can be done in common code similarly to
> vmemmap_populate_basepages() is that vmemmap_populate() uses large
> pages when available. I think it is a considerable downgrade to go
> back to base pages, when we already have large page support available
> to us.

It shouldn't be difficult to use section mappings with my patch, I just
don't really see the need to try to optimise TLB pressure when you're
running with KASAN enabled which already has something like a 3x slowdown
afaik. If it ends up being a big deal, we can always do that later, but
my main aim here is to divorce kasan from vmemmap because they should be
completely unrelated.

> The kasan shadow tree is large, it is up-to 1/8th of system memory, so
> even on moderate size servers, shadow tree is going to be multiple
> gigabytes.
> 
> The second concern is that there is an existing bug associated with
> your patch that I am not sure how to solve:
> 
> Try building your patch with CONFIG_DEBUG_VM. This config makes
> memblock_virt_alloc_try_nid_raw() to do memset(0xff) on all allocated
> memory.
> 
> I am getting the following panic during boot:
> 
> [0.012637] pid_max: default: 32768 minimum: 301
> [0.016037] Security Framework initialized
> [0.018389] Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
> [0.019559] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
> [0.020409] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
> [0.020721] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes)
> [0.055337] Unable to handle kernel paging request at virtual
> address 0400010065af
> [0.055422] Mem abort info:
> [0.055518]   Exception class = DABT (current EL), IL = 32 bits
> [0.055579]   SET = 0, FnV = 0
> [0.055640]   EA = 0, S1PTW = 0
> [0.055699] Data abort info:
> [0.055762]   ISV = 0, ISS = 0x0007
> [0.055822]   CM = 0, WnR = 0
> [0.055966] swapper pgtable: 4k pages, 48-bit VAs, pgd = 2a8f4000
> [0.056047] [0400010065af] *pgd=46fe7003,
> *pud=46fe6003, *pmd=46fe5003, *pte=
> [0.056436] Internal error: Oops: 9607 [#1] PREEMPT SMP
> [0.056701] Modules linked in:
> [0.056939] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> 4.14.0-rc4_pt_memset12-00096-gfca5985f860e-dirty #16
> [0.057001] Hardware name: linux,dummy-virt (DT)
> [0.057084] task: 299d9000 task.stack: 299c
> [0.057275] PC is at __asan_load8+0x34/0xb0
> [0.057375] LR is at __d_rehash+0xf0/0x240

[...]

> So, I've been trying to root cause it, and here is what I've got:
> 
> First, I went back to my version of kasan_map_populate() and replaced
> vmemmap_populate() with vmemmap_populate_basepages(), which
> behavior-vise made it very similar to your patch. After doing this I
> got the same panic. So, I figured there must be something to do with
> the differences that regular vmemmap allocated with granularity of
> SWAPPER_BLOCK_SIZE while kasan with granularity of PAGE_SIZE.
> 
> So, I made the following modification to your patch:
> 
> static void __init kasan_map_populate(unsigned long start, unsigned long end,
>   int node)
> {
> +start = round_down(start, SWAPPER_BLOCK_SIZE);
> +   end = round_up(end, SWAPPER_BLOCK_SIZE);
> kasan_pgd_populate(start & PAGE_MASK, PAGE_ALIGN(end), node, false);
> }
> 
> This is basically makes shadow tree ranges to be SWAPPER_BLOCK_SIZE
> aligned. After, this modification everything is working.  However, I
> am not sure if this is a proper fix.

This certainly doesn't sound right; mapping the shadow with pages shouldn't
lead to problems. I also can't seem to reproduce this myself -- could you
share your full .config and a pointer to the git tree that you're using,
please?

> I feel, this patch requires more work, and I am troubled with using
> base pages instead of large pages.

I'm happy to try fixing this, because I think splitting up kasan and vmemmap
is the right thing to do here.

Will

Re: [PATCH v11 7/9] arm64/kasan: add and use kasan_map_populate()

2017-10-13 Thread Pavel Tatashin

Hi Will,

I have a couple concerns about your patch:

One of the reasons (and actually, the main reason) why I preferred to
keep vmemmap_populate() instead of implementing kasan's own variant,
which btw can be done in common code similarly to
vmemmap_populate_basepages() is that vmemmap_populate() uses large
pages when available. I think it is a considerable downgrade to go
back to base pages, when we already have large page support available
to us.

The kasan shadow tree is large, it is up-to 1/8th of system memory, so
even on moderate size servers, shadow tree is going to be multiple
gigabytes.

The second concern is that there is an existing bug associated with
your patch that I am not sure how to solve:

Try building your patch with CONFIG_DEBUG_VM. This config makes
memblock_virt_alloc_try_nid_raw() to do memset(0xff) on all allocated
memory.

I am getting the following panic during boot:

[0.012637] pid_max: default: 32768 minimum: 301
[0.016037] Security Framework initialized
[0.018389] Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
[0.019559] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
[0.020409] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
[0.020721] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes)
[0.055337] Unable to handle kernel paging request at virtual
address 0400010065af
[0.055422] Mem abort info:
[0.055518]   Exception class = DABT (current EL), IL = 32 bits
[0.055579]   SET = 0, FnV = 0
[0.055640]   EA = 0, S1PTW = 0
[0.055699] Data abort info:
[0.055762]   ISV = 0, ISS = 0x0007
[0.055822]   CM = 0, WnR = 0
[0.055966] swapper pgtable: 4k pages, 48-bit VAs, pgd = 2a8f4000
[0.056047] [0400010065af] *pgd=46fe7003,
*pud=46fe6003, *pmd=46fe5003, *pte=
[0.056436] Internal error: Oops: 9607 [#1] PREEMPT SMP
[0.056701] Modules linked in:
[0.056939] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
4.14.0-rc4_pt_memset12-00096-gfca5985f860e-dirty #16
[0.057001] Hardware name: linux,dummy-virt (DT)
[0.057084] task: 299d9000 task.stack: 299c
[0.057275] PC is at __asan_load8+0x34/0xb0
[0.057375] LR is at __d_rehash+0xf0/0x240
[0.057460] pc : [] lr : []
pstate: 6045
[0.057522] sp : 299c6a60
[0.057590] x29: 299c6a60 x28: 299d9010
[0.057733] x27: 0004 x26: 28031000
[0.057846] x25: 299d9000 x24: 83c06410
---Type  to continue, or q  to quit---
[0.057954] x23: 03af x22: 83c06400
[0.058065] x21: 1fffe40001338d5a x20: 28032d78
[0.058175] x19: 83c06408 x18: 
[0.058311] x17: 0009 x16: 7fff
[0.058417] x15: 002a x14: 280ef374
[0.058528] x13: 28126648 x12: 28411a7c
[0.058638] x11: 28392358 x10: 28392184
[0.058770] x9 : 2835aad8 x8 : 29850e90
[0.058883] x7 : 2904b23c x6 : f2f2f200
[0.058990] x5 :  x4 : 28032d78
[0.059097] x3 :  x2 : dfff2000
[0.059206] x1 : 0007 x0 : 1fffe400010065af
[0.059372] Process swapper/0 (pid: 0, stack limit = 0x299c)
[0.059442] Call trace:
[0.059603] Exception stack(0x299c6920 to 0x299c6a60)
[0.059771] 6920: 1fffe400010065af 0007
dfff2000 
[0.059877] 6940: 28032d78 
f2f2f200 2904b23c
[0.059973] 6960: 29850e90 2835aad8
28392184 28392358
[0.060066] 6980: 28411a7c 28126648
280ef374 002a
[0.060154] 69a0: 7fff 0009
 83c06408
[0.060246] 69c0: 28032d78 1fffe40001338d5a
83c06400 03af
[0.060338] 69e0: 83c06410 299d9000
28031000 0004
[0.060432] 6a00: 299d9010 299c6a60
2837e168 299c6a60
[0.060525] 6a20: 28317d7c 6045
28392358 28411a7c
[0.060620] 6a40:  280ef374
299c6a60 28317d7c
[0.060762] [] __asan_load8+0x34/0xb0
[0.060856] [] __d_rehash+0xf0/0x240
[0.060944] [] d_add+0x288/0x3f0
[0.061041] [] proc_setup_self+0x110/0x198
[0.061139] [] proc_fill_super+0x13c/0x198
[0.061234] [] mount_ns+0x98/0x148
[0.061328] [] proc_mount+0x5c/0x70
[0.061422] [] mount_fs+0x50/0x1a8
[0.061515] [] vfs_kern_mount.part.7+0x9c/0x218
[0.061602] [] kern_mount_data+0x38/0x70
[0.061699] [] pid_ns_prepare_proc+0x24/0x50
[0.061796] [] alloc_pid+0x6e8/0x730
[0.061891] [] copy_process.isra.6.part.7+0x11cc/0x2cb8
[0.061978] [] _do_fork+0x14c/0x4c0
[0.062065] []

Re: [PATCH] powerpc/powernv: Enable reset_devices parameter to issue a PHB reset

2017-10-13 Thread Guilherme G. Piccoli

On 10/13/2017 05:37 AM, Michael Ellerman wrote:
> 
> I really dislike this.
> 
> You're basically saying the kernel can't work out how to get a device
> working, so let's leave it up to the user.

Oh, it was never my intention to say such blasphemy :)
It meant to be just a debug option to help the users, specifically the
ones debugging drivers, to try using a hammer to recover bad devices! To
this issue that I mentioned as an example, the fix specifically goes in
the FW of the adapter.

Anyway, since you really dislike it, let's drop it, no big deal!
Cheers,

Guilherme

> 
> The driver should be fixed to detect that the device is not responding
> and request a reset.
> 
> cheers
>

[PATCH] powerpc/eeh: make eeh_ops structures _ro_after_init

2017-10-13 Thread Bhumika Goyal

These structures are passed to the eeh_ops_register function during the
initialization phase. There they get stored in a structure variable
which only makes function calls through function pointers. There is no
other usage of these eeh_ops structures and their fields are never
modified after init phase. So, make them __ro_after_init.

Signed-off-by: Bhumika Goyal 
---
 arch/powerpc/platforms/powernv/eeh-powernv.c | 2 +-
 arch/powerpc/platforms/pseries/eeh_pseries.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 4650fb2..d2a53df 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -1731,7 +1731,7 @@ static int pnv_eeh_restore_config(struct pci_dn *pdn)
return 0;
 }
 
-static struct eeh_ops pnv_eeh_ops = {
+static struct eeh_ops pnv_eeh_ops __ro_after_init = {
.name   = "powernv",
.init   = pnv_eeh_init,
.probe  = pnv_eeh_probe,
diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c 
b/arch/powerpc/platforms/pseries/eeh_pseries.c
index 6b812ad..6fedfc9 100644
--- a/arch/powerpc/platforms/pseries/eeh_pseries.c
+++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
@@ -684,7 +684,7 @@ static int pseries_eeh_write_config(struct pci_dn *pdn, int 
where, int size, u32
return rtas_write_config(pdn, where, size, val);
 }
 
-static struct eeh_ops pseries_eeh_ops = {
+static struct eeh_ops pseries_eeh_ops __ro_after_init = {
.name   = "pseries",
.init   = pseries_eeh_init,
.probe  = pseries_eeh_probe,
-- 
1.9.1

[PATCH 1/1] KVM: PPC: Book3S: Add MMIO emulation for VMX instructions

2017-10-13 Thread Jose Ricardo Ziviani

This patch provides the MMIO load/store vector indexed
X-Form emulation.

Instructions implemented: lvx, stvx

Signed-off-by: Jose Ricardo Ziviani 
---
 arch/powerpc/include/asm/kvm_host.h   |   2 +
 arch/powerpc/include/asm/kvm_ppc.h|   4 +
 arch/powerpc/include/asm/ppc-opcode.h |   6 ++
 arch/powerpc/kvm/emulate_loadstore.c  |  32 +++
 arch/powerpc/kvm/powerpc.c| 162 ++
 5 files changed, 189 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index e372ed871c51..a28922c4a2c7 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -692,6 +692,7 @@ struct kvm_vcpu_arch {
u8 mmio_vsx_offset;
u8 mmio_vsx_copy_type;
u8 mmio_vsx_tx_sx_enabled;
+   u8 mmio_vmx_copy_nums;
u8 osi_needed;
u8 osi_enabled;
u8 papr_enabled;
@@ -802,6 +803,7 @@ struct kvm_vcpu_arch {
 #define KVM_MMIO_REG_QPR   0x0040
 #define KVM_MMIO_REG_FQPR  0x0060
 #define KVM_MMIO_REG_VSX   0x0080
+#define KVM_MMIO_REG_VMX   0x00a0
 
 #define __KVM_HAVE_ARCH_WQP
 #define __KVM_HAVE_CREATE_DEVICE
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index ba5fadd6f3c9..c444d1614b9c 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -81,6 +81,10 @@ extern int kvmppc_handle_loads(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
 extern int kvmppc_handle_vsx_load(struct kvm_run *run, struct kvm_vcpu *vcpu,
unsigned int rt, unsigned int bytes,
int is_default_endian, int mmio_sign_extend);
+extern int kvmppc_handle_load128_by2x64(struct kvm_run *run,
+   struct kvm_vcpu *vcpu, unsigned int rt, int is_default_endian);
+extern int kvmppc_handle_store128_by2x64(struct kvm_run *run,
+   struct kvm_vcpu *vcpu, unsigned int rs, int is_default_endian);
 extern int kvmppc_handle_store(struct kvm_run *run, struct kvm_vcpu *vcpu,
   u64 val, unsigned int bytes,
   int is_default_endian);
diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index ce0930d68857..a51febca08c5 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -156,6 +156,12 @@
 #define OP_31_XOP_LFDX  599
 #define OP_31_XOP_LFDUX631
 
+/* VMX Vector Load Instructions */
+#define OP_31_XOP_LVX   103
+
+/* VMX Vector Store Instructions */
+#define OP_31_XOP_STVX  231
+
 #define OP_LWZ  32
 #define OP_STFS 52
 #define OP_STFSU 53
diff --git a/arch/powerpc/kvm/emulate_loadstore.c 
b/arch/powerpc/kvm/emulate_loadstore.c
index af833531af31..40fbc14809cb 100644
--- a/arch/powerpc/kvm/emulate_loadstore.c
+++ b/arch/powerpc/kvm/emulate_loadstore.c
@@ -58,6 +58,18 @@ static bool kvmppc_check_vsx_disabled(struct kvm_vcpu *vcpu)
 }
 #endif /* CONFIG_VSX */
 
+#ifdef CONFIG_ALTIVEC
+static bool kvmppc_check_altivec_disabled(struct kvm_vcpu *vcpu)
+{
+   if (!(kvmppc_get_msr(vcpu) & MSR_VEC)) {
+   kvmppc_core_queue_vec_unavail(vcpu);
+   return true;
+   }
+
+   return false;
+}
+#endif /* CONFIG_ALTIVEC */
+
 /*
  * XXX to do:
  * lfiwax, lfiwzx
@@ -98,6 +110,7 @@ int kvmppc_emulate_loadstore(struct kvm_vcpu *vcpu)
vcpu->arch.mmio_vsx_copy_type = KVMPPC_VSX_COPY_NONE;
vcpu->arch.mmio_sp64_extend = 0;
vcpu->arch.mmio_sign_extend = 0;
+   vcpu->arch.mmio_vmx_copy_nums = 0;
 
switch (get_op(inst)) {
case 31:
@@ -459,6 +472,25 @@ int kvmppc_emulate_loadstore(struct kvm_vcpu *vcpu)
 rs, 4, 1);
break;
 #endif /* CONFIG_VSX */
+
+#ifdef CONFIG_ALTIVEC
+   case OP_31_XOP_LVX:
+   if (kvmppc_check_altivec_disabled(vcpu))
+   return EMULATE_DONE;
+   vcpu->arch.mmio_vmx_copy_nums = 2;
+   emulated = kvmppc_handle_load128_by2x64(run, vcpu,
+   KVM_MMIO_REG_VMX|rt, 1);
+   break;
+
+   case OP_31_XOP_STVX:
+   if (kvmppc_check_altivec_disabled(vcpu))
+   return EMULATE_DONE;
+   vcpu->arch.mmio_vmx_copy_nums = 2;
+   emulated = kvmppc_handle_store128_by2x64(run, vcpu,
+   rs, 1);
+   break;
+#endif /* CONFIG_ALTIVEC */
+
default:
emulated = EMULATE_FAIL;
break;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 3480faaf1ef8..6f3b49cd6634 100644
--- a/arch/powerpc/kvm/powerpc.c
+++

[PATCH 0/1] powerpc: Implements MMIO emulation for lvx/stvx instructions

2017-10-13 Thread Jose Ricardo Ziviani

Hello!

This patch implements MMIO emulation for two instructions: lvx and stvx.

Thank you!

Jose Ricardo Ziviani (1):
  KVM: PPC: Book3S: Add MMIO emulation for VMX instructions

 arch/powerpc/include/asm/kvm_host.h   |   2 +
 arch/powerpc/include/asm/kvm_ppc.h|   4 +
 arch/powerpc/include/asm/ppc-opcode.h |   6 ++
 arch/powerpc/kvm/emulate_loadstore.c  |  32 +++
 arch/powerpc/kvm/powerpc.c| 162 ++
 5 files changed, 189 insertions(+), 17 deletions(-)

-- 
2.11.0

Re: [V2] powerpc/perf: Fix IMC initialization crash

2017-10-13 Thread Michael Ellerman

On Fri, 2017-10-13 at 05:59:41 UTC, Anju T Sudhakar wrote:
> Call trace observed with latest firmware, and upstream kernel.
> 
> [   14.499938] NIP [c00f318c] init_imc_pmu+0x8c/0xcf0
> [   14.499973] LR [c00f33f8] init_imc_pmu+0x2f8/0xcf0
> [   14.57] Call Trace:
> [   14.500027] [c03fed18f710] [c00f33c8] init_imc_pmu+0x2c8/0xcf0 
> (unreliable)
> [   14.500080] [c03fed18f800] [c00b5ec0] 
> opal_imc_counters_probe+0x300/0x400
> [   14.500132] [c03fed18f900] [c0807ef4] 
> platform_drv_probe+0x64/0x110
> [   14.500185] [c03fed18f980] [c0804b58] 
> driver_probe_device+0x3d8/0x580
> [   14.500236] [c03fed18fa10] [c0804e4c] 
> __driver_attach+0x14c/0x1a0
> [   14.500302] [c03fed18fa90] [c080156c] 
> bus_for_each_dev+0x8c/0xf0
> [   14.500353] [c03fed18fae0] [c0803fa4] driver_attach+0x34/0x50
> [   14.500397] [c03fed18fb00] [c0803688] 
> bus_add_driver+0x298/0x350
> [   14.500449] [c03fed18fb90] [c080605c] 
> driver_register+0x9c/0x180
> [   14.500500] [c03fed18fc00] [c0807dec] 
> __platform_driver_register+0x5c/0x70
> [   14.500552] [c03fed18fc20] [c101cee0] 
> opal_imc_driver_init+0x2c/0x40
> [   14.500603] [c03fed18fc40] [c000d084] 
> do_one_initcall+0x64/0x1d0
> [   14.500654] [c03fed18fd00] [c100434c] 
> kernel_init_freeable+0x280/0x374
> [   14.500705] [c03fed18fdc0] [c000d314] kernel_init+0x24/0x160
> [   14.500750] [c03fed18fe30] [c000b4e8] 
> ret_from_kernel_thread+0x5c/0x74
> [   14.500799] Instruction dump:
> [   14.500827] 4082024c 2f890002 419e054c 2e890003 41960094 2e890001 3ba0ffea 
> 419602d8 
> [   14.500884] 419e0290 2f890003 419e02a8 e93e0118  2fa3 
> 419e0010 4827ba41 
> [   14.500945] ---[ end trace 27b734ad26f1add4 ]---
> [   15.908719] 
> [   16.908869] Kernel panic - not syncing: Attempted to kill init! 
> exitcode=0x0007
> [   16.908869] 
> [   18.125813] ---[ end Kernel panic - not syncing: Attempted to kill init! 
> exitcode=0x0007]
> 
> While registering nest imc at init, cpu-hotplug callback 
> `nest_pmu_cpumask_init()`
> makes an opal call to stop the engine. And if the OPAL call fails, 
> imc_common_cpuhp_mem_free() is invoked to cleanup memory and cpuhotplug setup.
> 
> But when cleaning up the attribute group, we were dereferencing the attribute
> element array without checking whether the backing element is not NULL. This
> causes the kernel panic.
> 
> Add a check for the backing element prior to dereferencing the attribute 
> element,
> to handle the failing case gracefully.
> 
> Signed-off-by: Anju T Sudhakar 
> Reported-by: Pridhiviraj Paidipeddi 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/0d8ba16278ec30a262d931875018ab

cheers

Re: powerpc/perf: Add ___GFP_NOWARN flag to alloc_pages_node()

2017-10-13 Thread Michael Ellerman

On Wed, 2017-10-11 at 12:57:39 UTC, Anju T Sudhakar wrote:
> Stack trace output during a stress test:
>  [4.310049] Freeing initrd memory: 22592K
> [4.310646] rtas_flash: no firmware flash support
> [4.313341] cpuhp/64: page allocation failure: order:0, 
> mode:0x14480c0(GFP_KERNEL|__GFP_ZERO|__GFP_THISNODE), nodemask=(null)
> [4.313465] cpuhp/64 cpuset=/ mems_allowed=0
> [4.313521] CPU: 64 PID: 392 Comm: cpuhp/64 Not tainted 
> 4.11.0-39.el7a.ppc64le #1
> [4.313588] Call Trace:
> [4.313622] [c00f1fb1b8e0] [c0c09388] dump_stack+0xb0/0xf0 
> (unreliable)
> [4.313694] [c00f1fb1b920] [c030ef6c] warn_alloc+0x12c/0x1c0
> [4.313753] [c00f1fb1b9c0] [c030ff68] 
> __alloc_pages_nodemask+0xea8/0x1000
> [4.313823] [c00f1fb1bbb0] [c0113a8c] 
> core_imc_mem_init+0xbc/0x1c0
> [4.313892] [c00f1fb1bc00] [c0113cdc] 
> ppc_core_imc_cpu_online+0x14c/0x170
> [4.313962] [c00f1fb1bc90] [c0125758] 
> cpuhp_invoke_callback+0x198/0x5d0
> [4.314031] [c00f1fb1bd00] [c012782c] 
> cpuhp_thread_fun+0x8c/0x3d0
> [4.314101] [c00f1fb1bd60] [c01678d0] 
> smpboot_thread_fn+0x290/0x2a0
> [4.314169] [c00f1fb1bdc0] [c015ee78] kthread+0x168/0x1b0
> [4.314229] [c00f1fb1be30] [c000b368] 
> ret_from_kernel_thread+0x5c/0x74
> [4.314313] Mem-Info:
> [4.314356] active_anon:0 inactive_anon:0 isolated_anon:0
> 
> core_imc_mem_init() at system boot use alloc_pages_node() to get memory
> and alloc_pages_node() throws this stack dump when tried to allocate
> memory from a node which has no memory behind it. Add a ___GFP_NOWARN
> flag in allocation request as a fix. 
> 
> Signed-off-by: Anju T Sudhakar 
> Reported-by: Michael Ellerman 
> Reported-by: Venkat R.B 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/cd4f2b30e5ef7d4bde61eb515372d9

cheers

Re: powerpc/perf: Fix for core/nest imc call trace on cpuhotplug

2017-10-13 Thread Michael Ellerman

On Wed, 2017-10-04 at 06:50:52 UTC, Anju T Sudhakar wrote:
> Nest/core pmu units are enabled only when it is used. A reference count is
>   
> maintained for the events which uses the nest/core pmu units. Currently in
>   
> *_imc_counters_release function a WARN() is used for notification of any  
>   
> underflow of ref count.   
>   
>   
>   
> The case where event ref count hit a negative value is, when perf session is  
>   
> started, followed by offlining of all cpus in a given core.   
>   
> i.e. in cpuhotplug offline path ppc_core_imc_cpu_offline() function set the   
>   
> ref->count to zero, if the current cpu which is about to offline is the last  
>   
> cpu in a given core and make an OPAL call to disable the engine in that core. 
>   
> And on perf session termination, perf->destroy (core_imc_counters_release) 
> will 
> first decrement the ref->count for this core and based on the ref->count 
> value  
> an opal call is made to disable the core-imc engine.  
>   
> Now, since cpuhotplug path already clears the ref->count for core and 
> disabled  
> the engine, perf->destroy() decrementing again at event termination make it   
>   
> negative which in turn fires the WARN_ON. The same happens for nest units.
>   
>   
>   
> Add a check to see if the reference count is alreday zero, before 
> decrementing  
> the count, so that the ref count will not hit a negative value.   
>   
>   
>   
> Signed-off-by: Anju T Sudhakar 
> Reviewed-by: Santosh Sivaraj 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/0d923820c6db1644c27c2d0a5af892

cheers

[GIT PULL] Please pull powerpc/linux.git powerpc-4.14-5 tag

2017-10-13 Thread Michael Ellerman

Hi Linus,

Please pull a few more powerpc fixes for 4.14:

The following changes since commit 53ecde0b9126ff140abe3aefd7f0ec64d6fa36b0:

  powerpc/powernv: Increase memory block size to 1GB on radix (2017-10-06 
15:50:45 +1100)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
tags/powerpc-4.14-5

for you to fetch changes up to 0d8ba16278ec30a262d931875018abee332f926f:

  powerpc/perf: Fix IMC initialization crash (2017-10-13 20:08:40 +1100)


powerpc fixes for 4.14 #5

A fix for a bad bug (written by me) in our livepatch handler. Removal of an
over-zealous lockdep_assert_cpus_held() in our topology code. A fix to the
recently added emulation of cntlz[wd]. And three small fixes to the recently
added IMC PMU driver.

Thanks to:
  Anju T Sudhakar, Balbir Singh, Kamalesh Babulal, Naveen N. Rao, Sandipan Das,
  Santosh Sivaraj, Thiago Jung Bauermann.


Anju T Sudhakar (3):
  powerpc/perf: Fix for core/nest imc call trace on cpuhotplug
  powerpc/perf: Add ___GFP_NOWARN flag to alloc_pages_node()
  powerpc/perf: Fix IMC initialization crash

Kamalesh Babulal (1):
  powerpc/livepatch: Fix livepatch stack access

Sandipan Das (1):
  powerpc/lib/sstep: Fix count leading zeros instructions

Thiago Jung Bauermann (1):
  powerpc: Don't call lockdep_assert_cpus_held() from 
arch_update_cpu_topology()

 arch/powerpc/kernel/trace/ftrace_64_mprofile.S | 45 +-
 arch/powerpc/lib/sstep.c   |  6 ++--
 arch/powerpc/mm/numa.c |  1 -
 arch/powerpc/perf/imc-pmu.c| 39 +++---
 4 files changed, 53 insertions(+), 38 deletions(-)


signature.asc
Description: PGP signature

Re: [PATCH v3 2/2] pseries/eeh: Add Pseries pcibios_bus_add_device

2017-10-13 Thread Steven Royer


On 2017-10-13 06:53, Steven Royer wrote:

On 2017-10-12 22:34, Bjorn Helgaas wrote:

[+cc Alex, Bodong, Eli, Saeed]

On Thu, Oct 12, 2017 at 02:59:23PM -0500, Bryant G. Ly wrote:

On 10/12/17 1:29 PM, Bjorn Helgaas wrote:
>On Thu, Oct 12, 2017 at 03:09:53PM +1100, Michael Ellerman wrote:
>>Bjorn Helgaas  writes:
>>
>>>On Fri, Sep 22, 2017 at 09:19:28AM -0500, Bryant G. Ly wrote:
This patch adds the machine dependent call for
pcibios_bus_add_device, since the previous patch
separated the calls out between the PowerNV and PowerVM.

The difference here is that for the PowerVM environment
we do not want match_driver set because in this environment
we do not want the VF device drivers to load immediately, due to
firmware loading the device node when VF device is assigned to the
logical partition.

This patch will depend on the patch linked below, which is under
review.

https://patchwork.kernel.org/patch/9882915/

Signed-off-by: Bryant G. Ly 
Signed-off-by: Juan J. Alvarez 
---
  arch/powerpc/platforms/pseries/eeh_pseries.c | 24 
  1 file changed, 24 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c 
b/arch/powerpc/platforms/pseries/eeh_pseries.c
index 6b812ad990e4..45946ee90985 100644
--- a/arch/powerpc/platforms/pseries/eeh_pseries.c
+++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
@@ -64,6 +64,27 @@ static unsigned char slot_errbuf[RTAS_ERROR_LOG_MAX];
  static DEFINE_SPINLOCK(slot_errbuf_lock);
  static int eeh_error_buf_size;
+void pseries_pcibios_bus_add_device(struct pci_dev *pdev)
+{
+   struct pci_dn *pdn = pci_get_pdn(pdev);
+
+   if (!pdev->is_virtfn)
+   return;
+
+   pdn->device_id  =  pdev->device;
+   pdn->vendor_id  =  pdev->vendor;
+   pdn->class_code =  pdev->class;
+
+   /*
+* The following operations will fail if VF's sysfs files
+* aren't created or its resources aren't finalized.
+*/
+   eeh_add_device_early(pdn);
+   eeh_add_device_late(pdev);
+   eeh_sysfs_add_device(pdev);
+   pdev->match_driver = -1;
>>>match_driver is a bool, which should be assigned "true" or "false".
>>Above he mentioned a dependency on:
>>
>>   [04/10] PCI: extend pci device match_driver state
>>   https://patchwork.kernel.org/patch/9882915/
>>
>>
>>Which makes it an int.
>Oh, right, I missed that, thanks.
>
>>Or has that patch been rejected or something?
>I haven't *rejected* it, but it's low on my priority list, so you
>shouldn't depend on it unless it adds functionality you really need.
>If I did apply that particular patch, I would want some rework because
>it currently obfuscates the match_driver logic.  There's no clue when
>reading the code what -1/0/1 mean.
So do you prefer enum's? - If so I can make a change for that.
>Apparently here you *do* want the "-1 means the PCI core will never
>set match_driver to 1" functionality, so maybe you do depend on it.
We depend on the patch because we want that ability to never set
match_driver,
for SRIOV on PowerVM.


Is this really new PowerVM-specific functionality?  ISTR recent 
discussions

about inhibiting driver binding in a generic way, e.g.,
http://lkml.kernel.org/r/1490022874-54718-1-git-send-email-bod...@mellanox.com


>If that's the case, how to you ever bind a driver to these VFs?  The
>changelog says you don't want VF drivers to load *immediately*, so I
>assume you do want them to load eventually.
>
The VF's that get dynamically created within the configure SR-IOV
call, on the Pseries Platform, wont be matched with a driver. - We
do not want it to match.

The Power Hypervisor will load the VFs. The VF's will get
assigned(by the user) via the HMC or Novalink in this environment
which will then trigger PHYP to load the VF device node to the
device tree.


I don't know what it means for the Hypervisor to "load the VFs."  Can
you explain that in PCI-speak?

The things I know about are:

  - we set PCI_SRIOV_CTRL_VFE in the PF, which enables VFs
  - now the VFs respond to config accesses
  - the PCI core enumerates the VFs by reading their config space
  - the PCI core builds pci_dev structs for the VFs
  - the PCI core adds these pci_devs to the bus
  - we try to bind drivers to the VFs
  - the VF driver probe function may read VF config space and VF BARs
  - the VF may be assigned to a guest VM

Where does "loading the VFs" fit in?  I don't know what HMC, Novalink,
or PHYP are.  I don't *need* to know what they are, as long as you can
explain what's happening in terms of the PCI concepts and generic 
Linux VMs

and device assignment.

Bjorn


The VFs will be hotplugged into the VM separately from the enable
SR-IOV, so the driver will load as part of the hotplug operation.

Steve

Re: [PATCH v3 2/2] pseries/eeh: Add Pseries pcibios_bus_add_device

2017-10-13 Thread Steven Royer


On 2017-10-12 22:34, Bjorn Helgaas wrote:

[+cc Alex, Bodong, Eli, Saeed]

On Thu, Oct 12, 2017 at 02:59:23PM -0500, Bryant G. Ly wrote:

On 10/12/17 1:29 PM, Bjorn Helgaas wrote:
>On Thu, Oct 12, 2017 at 03:09:53PM +1100, Michael Ellerman wrote:
>>Bjorn Helgaas  writes:
>>
>>>On Fri, Sep 22, 2017 at 09:19:28AM -0500, Bryant G. Ly wrote:
This patch adds the machine dependent call for
pcibios_bus_add_device, since the previous patch
separated the calls out between the PowerNV and PowerVM.

The difference here is that for the PowerVM environment
we do not want match_driver set because in this environment
we do not want the VF device drivers to load immediately, due to
firmware loading the device node when VF device is assigned to the
logical partition.

This patch will depend on the patch linked below, which is under
review.

https://patchwork.kernel.org/patch/9882915/

Signed-off-by: Bryant G. Ly 
Signed-off-by: Juan J. Alvarez 
---
  arch/powerpc/platforms/pseries/eeh_pseries.c | 24 
  1 file changed, 24 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c 
b/arch/powerpc/platforms/pseries/eeh_pseries.c
index 6b812ad990e4..45946ee90985 100644
--- a/arch/powerpc/platforms/pseries/eeh_pseries.c
+++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
@@ -64,6 +64,27 @@ static unsigned char slot_errbuf[RTAS_ERROR_LOG_MAX];
  static DEFINE_SPINLOCK(slot_errbuf_lock);
  static int eeh_error_buf_size;
+void pseries_pcibios_bus_add_device(struct pci_dev *pdev)
+{
+   struct pci_dn *pdn = pci_get_pdn(pdev);
+
+   if (!pdev->is_virtfn)
+   return;
+
+   pdn->device_id  =  pdev->device;
+   pdn->vendor_id  =  pdev->vendor;
+   pdn->class_code =  pdev->class;
+
+   /*
+* The following operations will fail if VF's sysfs files
+* aren't created or its resources aren't finalized.
+*/
+   eeh_add_device_early(pdn);
+   eeh_add_device_late(pdev);
+   eeh_sysfs_add_device(pdev);
+   pdev->match_driver = -1;
>>>match_driver is a bool, which should be assigned "true" or "false".
>>Above he mentioned a dependency on:
>>
>>   [04/10] PCI: extend pci device match_driver state
>>   https://patchwork.kernel.org/patch/9882915/
>>
>>
>>Which makes it an int.
>Oh, right, I missed that, thanks.
>
>>Or has that patch been rejected or something?
>I haven't *rejected* it, but it's low on my priority list, so you
>shouldn't depend on it unless it adds functionality you really need.
>If I did apply that particular patch, I would want some rework because
>it currently obfuscates the match_driver logic.  There's no clue when
>reading the code what -1/0/1 mean.
So do you prefer enum's? - If so I can make a change for that.
>Apparently here you *do* want the "-1 means the PCI core will never
>set match_driver to 1" functionality, so maybe you do depend on it.
We depend on the patch because we want that ability to never set
match_driver,
for SRIOV on PowerVM.


Is this really new PowerVM-specific functionality?  ISTR recent 
discussions

about inhibiting driver binding in a generic way, e.g.,
http://lkml.kernel.org/r/1490022874-54718-1-git-send-email-bod...@mellanox.com


>If that's the case, how to you ever bind a driver to these VFs?  The
>changelog says you don't want VF drivers to load *immediately*, so I
>assume you do want them to load eventually.
>
The VF's that get dynamically created within the configure SR-IOV
call, on the Pseries Platform, wont be matched with a driver. - We
do not want it to match.

The Power Hypervisor will load the VFs. The VF's will get
assigned(by the user) via the HMC or Novalink in this environment
which will then trigger PHYP to load the VF device node to the
device tree.


I don't know what it means for the Hypervisor to "load the VFs."  Can
you explain that in PCI-speak?

The things I know about are:

  - we set PCI_SRIOV_CTRL_VFE in the PF, which enables VFs
  - now the VFs respond to config accesses
  - the PCI core enumerates the VFs by reading their config space
  - the PCI core builds pci_dev structs for the VFs
  - the PCI core adds these pci_devs to the bus
  - we try to bind drivers to the VFs
  - the VF driver probe function may read VF config space and VF BARs
  - the VF may be assigned to a guest VM

Where does "loading the VFs" fit in?  I don't know what HMC, Novalink,
or PHYP are.  I don't *need* to know what they are, as long as you can
explain what's happening in terms of the PCI concepts and generic Linux 
VMs

and device assignment.

Bjorn


The VFs will be hotplugged into the VM separately from the enable 
SR-IOV, so the driver will load as part of the hotplug operation.


Steve

Re: [PATCH] powerpc/powernv: Enable reset_devices parameter to issue a PHB reset

2017-10-13 Thread Michael Ellerman

"Guilherme G. Piccoli"  writes:

> During a kdump kernel boot in PowerPC, we request a reset of the
> PHBs to the FW. It makes sense, since if we are booting a kdump
> kernel it means we had some trouble before and we cannot rely in
> the adapters' health; they could be in a bad state, hence the
> reset is needed.
>
> But not only in a kdump kernel we could use this reset - there are
> situations, specially when debugging drivers, that we could break
> an adapter in a way it requires such reset. One can tell to just
> go ahead and reboot the machine, but happens that many times doing
> kexec is much faster, and so preferable than a full power cycle.
> Also, we could have situations in which adapters are in bad state
> due to adapter's FW issue, and only a PHB Fundamental Reset could
> revive them.
>
> This patch enables the reset_devices parameter to perform such reset.
> The parameter is barely used - only few drivers make use of it.
> This is a PowerPC-only change.
>
> Signed-off-by: Guilherme G. Piccoli 
> ---
> This patch was built/tested against powerpc/next branch.
>
> We recently had a situation in which i40e driver couldn't start,
> even after a full power cycle, due to a bug in its FW triggered
> by a DCB condition in switch (thanks Mauro for narrowing this).
> This patch enabled us to revive the adapter and use network
> while debugging.

I really dislike this.

You're basically saying the kernel can't work out how to get a device
working, so let's leave it up to the user.

The driver should be fixed to detect that the device is not responding
and request a reset.

cheers

Re: KASan for powerpc

2017-10-13 Thread KHUSHAL GUMGAONKAR

Hi Balbir,sorry for not mentioning details.
below are the details1. What machine you tried this on?    - I am using PowerPC 
e500mc processor with and it's customised. Can't share much details
2. What MMU mode?    -Not sure about this.
3. What kernel?    -kernel version is 4.1.35
4. What gcc version, flags for KASAN    -gcc version 4.9.2 and KASAN_MINIMAL 
FLAG 
5. What was the KASAN design - how much shadow memory to real memory?
    -we have continuous 2 GB  low memory and shadow size is 256 MB

 THANKS AND REGARDS,
KHUSHAL K. GUMGAONKARM.TECH IN VISUAL INFORMATION AND EMBEDDED SYSTEM,IIT 
KHARAGPUR 

On Friday, 13 October 2017 4:45 AM, Balbir Singh  
wrote:

 On Tue, Oct 10, 2017 at 11:51 PM, KHUSHAL GUMGAONKAR
 wrote:
>  Hi Aneesh,
>
> I am facing unknown error in printk  during boot and after that kernel Hangs
> on powerpc with KASan changes.
>

This is probably the worst way to reach out. The reason being

You describe the crash your seeing, there is no description of

1. What machine you tried this on?
2. What MMU mode?
3. What kernel?
4. What gcc version, flags for KASAN
5. What was the KASAN design - how much shadow memory to real memory?

Unless you can work out those details, this is just noise on the list.
I posted an RFC for radix -
http://linuxppc.10917.n7.nabble.com/RFC-PATCH-v1-powerpc-radix-kasan-KASAN-support-for-Radix-td125864.html

It needs work, please read the implementations so far, discuss your
design/understanding. Then we can discuss what issues you are running
into, otherwise it's just noise without effort.

Balbir Singh.

Re: [PATCH 1/2] vgaarb: Select a default VGA device even if there's no legacy VGA

2017-10-13 Thread Julien Thierry




On 12/10/17 13:05, Lothar Waßmann wrote:

Hi,

On Thu, 12 Oct 2017 12:24:10 +0100 Julien Thierry wrote:

Hi Bjorn,

On 06/10/17 23:24, Bjorn Helgaas wrote:

From: Bjorn Helgaas 

Daniel Axtens reported that on the HiSilicon D05 board, the VGA device is
behind a bridge that doesn't support PCI_BRIDGE_CTL_VGA, so the VGA arbiter
never selects it as the default, which means Xorg auto-detection doesn't
work.

VGA is a legacy PCI feature: a VGA device can respond to addresses, e.g.,
[mem 0xa-0xb], [io 0x3b0-0x3bb], [io 0x3c0-0x3df], etc., that are
not configurable by BARs.  Consequently, multiple VGA devices can conflict
with each other.  The VGA arbiter avoids conflicts by ensuring that those
legacy resources are only routed to one VGA device at a time.

The arbiter identifies the "default VGA" device, i.e., a legacy VGA device
that was used by boot firmware.  It selects the first device that:

- is of PCI_CLASS_DISPLAY_VGA,
- has both PCI_COMMAND_IO and PCI_COMMAND_MEMORY enabled, and
- has PCI_BRIDGE_CTL_VGA set in all upstream bridges.

Some systems don't have such a device.  For example, if a host bridge
doesn't support I/O space, PCI_COMMAND_IO probably won't be enabled for any
devices below it.  Or, as on the HiSilicon D05, the VGA device may be
behind a bridge that doesn't support PCI_BRIDGE_CTL_VGA, so accesses to the
legacy VGA resources will never reach the device.

This patch extends the arbiter so that if it doesn't find a device that
meets all the above criteria, it selects the first device that:

- is of PCI_CLASS_DISPLAY_VGA and
- has PCI_COMMAND_IO or PCI_COMMAND_MEMORY enabled

If it doesn't find even that, it selects the first device that:

- is of class PCI_CLASS_DISPLAY_VGA.

Such a device may not be able to use the legacy VGA resources, but most
drivers can operate the device without those.  Setting it as the default
device means its "boot_vga" sysfs file will contain "1", which Xorg (via
libpciaccess) uses to help select its default output device.

This fixes Xorg auto-detection on some arm64 systems (HiSilicon D05 in
particular; see the link below).

It also replaces the powerpc fixup_vga() quirk, albeit with slightly
different semantics: the quirk selected the first VGA device we found, and
overrode that selection with any enabled VGA device we found.  If there
were several enabled VGA devices, the *last* one we found would become the
default.

The code here instead selects the *first* enabled VGA device we find, and
if none are enabled, the first VGA device we find.

Link: http://lkml.kernel.org/r/20170901072744.2409-1-...@axtens.net
Tested-by: Daniel Axtens # arm64, ppc64-qemu-tcg
Signed-off-by: Bjorn Helgaas 
---
   arch/powerpc/kernel/pci-common.c |   12 
   drivers/gpu/vga/vgaarb.c |   25 +
   2 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 02831a396419..0ac7aa346c69 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1740,15 +1740,3 @@ static void fixup_hide_host_resource_fsl(struct pci_dev 
*dev)
   }
   DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MOTOROLA, PCI_ANY_ID, 
fixup_hide_host_resource_fsl);
   DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_FREESCALE, PCI_ANY_ID, 
fixup_hide_host_resource_fsl);
-
-static void fixup_vga(struct pci_dev *pdev)
-{
-   u16 cmd;
-
-   pci_read_config_word(pdev, PCI_COMMAND, );
-   if ((cmd & (PCI_COMMAND_IO | PCI_COMMAND_MEMORY)) || 
!vga_default_device())
-   vga_set_default_device(pdev);
-
-}
-DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_ANY_ID, PCI_ANY_ID,
- PCI_CLASS_DISPLAY_VGA, 8, fixup_vga);
diff --git a/drivers/gpu/vga/vgaarb.c b/drivers/gpu/vga/vgaarb.c
index 76875f6299b8..aeb41f793ed4 100644
--- a/drivers/gpu/vga/vgaarb.c
+++ b/drivers/gpu/vga/vgaarb.c
@@ -1468,6 +1468,31 @@ static int __init vga_arb_device_init(void)
vgaarb_info(dev, "no bridge control possible\n");
}
   
+	if (!vga_default_device()) {

+   list_for_each_entry(vgadev, _list, list) {
+   struct device *dev = >pdev->dev;
+   u16 cmd;
+
+   pdev = vgadev->pdev;
+   pci_read_config_word(pdev, PCI_COMMAND, );
+   if (cmd & (PCI_COMMAND_IO | PCI_COMMAND_MEMORY)) {
+   vgaarb_info(dev, "setting as boot device (VGA legacy 
resources not available)\n");
+   vga_set_default_device(pdev);
+   break;
+   }
+   }
+   }
+
+   if (!vga_default_device()) {
+   vgadev = list_first_entry_or_null(_list,
+ struct vga_device, list);
+   if (vgadev) {
+

Re: [PATCH 1/2] vgaarb: Select a default VGA device even if there's no legacy VGA

2017-10-13 Thread Julien Thierry


Hi Bjorn,

On 06/10/17 23:24, Bjorn Helgaas wrote:

From: Bjorn Helgaas 

Daniel Axtens reported that on the HiSilicon D05 board, the VGA device is
behind a bridge that doesn't support PCI_BRIDGE_CTL_VGA, so the VGA arbiter
never selects it as the default, which means Xorg auto-detection doesn't
work.

VGA is a legacy PCI feature: a VGA device can respond to addresses, e.g.,
[mem 0xa-0xb], [io 0x3b0-0x3bb], [io 0x3c0-0x3df], etc., that are
not configurable by BARs.  Consequently, multiple VGA devices can conflict
with each other.  The VGA arbiter avoids conflicts by ensuring that those
legacy resources are only routed to one VGA device at a time.

The arbiter identifies the "default VGA" device, i.e., a legacy VGA device
that was used by boot firmware.  It selects the first device that:

   - is of PCI_CLASS_DISPLAY_VGA,
   - has both PCI_COMMAND_IO and PCI_COMMAND_MEMORY enabled, and
   - has PCI_BRIDGE_CTL_VGA set in all upstream bridges.

Some systems don't have such a device.  For example, if a host bridge
doesn't support I/O space, PCI_COMMAND_IO probably won't be enabled for any
devices below it.  Or, as on the HiSilicon D05, the VGA device may be
behind a bridge that doesn't support PCI_BRIDGE_CTL_VGA, so accesses to the
legacy VGA resources will never reach the device.

This patch extends the arbiter so that if it doesn't find a device that
meets all the above criteria, it selects the first device that:

   - is of PCI_CLASS_DISPLAY_VGA and
   - has PCI_COMMAND_IO or PCI_COMMAND_MEMORY enabled

If it doesn't find even that, it selects the first device that:

   - is of class PCI_CLASS_DISPLAY_VGA.

Such a device may not be able to use the legacy VGA resources, but most
drivers can operate the device without those.  Setting it as the default
device means its "boot_vga" sysfs file will contain "1", which Xorg (via
libpciaccess) uses to help select its default output device.

This fixes Xorg auto-detection on some arm64 systems (HiSilicon D05 in
particular; see the link below).

It also replaces the powerpc fixup_vga() quirk, albeit with slightly
different semantics: the quirk selected the first VGA device we found, and
overrode that selection with any enabled VGA device we found.  If there
were several enabled VGA devices, the *last* one we found would become the
default.

The code here instead selects the *first* enabled VGA device we find, and
if none are enabled, the first VGA device we find.

Link: http://lkml.kernel.org/r/20170901072744.2409-1-...@axtens.net
Tested-by: Daniel Axtens # arm64, ppc64-qemu-tcg
Signed-off-by: Bjorn Helgaas 
---
  arch/powerpc/kernel/pci-common.c |   12 
  drivers/gpu/vga/vgaarb.c |   25 +
  2 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 02831a396419..0ac7aa346c69 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1740,15 +1740,3 @@ static void fixup_hide_host_resource_fsl(struct pci_dev 
*dev)
  }
  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MOTOROLA, PCI_ANY_ID, 
fixup_hide_host_resource_fsl);
  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_FREESCALE, PCI_ANY_ID, 
fixup_hide_host_resource_fsl);
-
-static void fixup_vga(struct pci_dev *pdev)
-{
-   u16 cmd;
-
-   pci_read_config_word(pdev, PCI_COMMAND, );
-   if ((cmd & (PCI_COMMAND_IO | PCI_COMMAND_MEMORY)) || 
!vga_default_device())
-   vga_set_default_device(pdev);
-
-}
-DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_ANY_ID, PCI_ANY_ID,
- PCI_CLASS_DISPLAY_VGA, 8, fixup_vga);
diff --git a/drivers/gpu/vga/vgaarb.c b/drivers/gpu/vga/vgaarb.c
index 76875f6299b8..aeb41f793ed4 100644
--- a/drivers/gpu/vga/vgaarb.c
+++ b/drivers/gpu/vga/vgaarb.c
@@ -1468,6 +1468,31 @@ static int __init vga_arb_device_init(void)
vgaarb_info(dev, "no bridge control possible\n");
}
  
+	if (!vga_default_device()) {

+   list_for_each_entry(vgadev, _list, list) {
+   struct device *dev = >pdev->dev;
+   u16 cmd;
+
+   pdev = vgadev->pdev;
+   pci_read_config_word(pdev, PCI_COMMAND, );
+   if (cmd & (PCI_COMMAND_IO | PCI_COMMAND_MEMORY)) {
+   vgaarb_info(dev, "setting as boot device (VGA legacy 
resources not available)\n");
+   vga_set_default_device(pdev);
+   break;
+   }
+   }
+   }
+
+   if (!vga_default_device()) {
+   vgadev = list_first_entry_or_null(_list,
+ struct vga_device, list);
+   if (vgadev) {
+   struct device *dev = >pdev->dev;
+   vgaarb_info(dev, "setting as boot device (VGA

Re: [PATCH v2] KVM: PPC: Book3S PR: only install valid SLBs during KVM_SET_SREGS

2017-10-13 Thread Greg Kurz

Ping ?

On Mon, 02 Oct 2017 10:40:22 +0200
Greg Kurz  wrote:

> Userland passes an array of 64 SLB descriptors to KVM_SET_SREGS,
> some of which are valid (ie, SLB_ESID_V is set) and the rest are
> likely all-zeroes (with QEMU at least).
> 
> Each of them is then passed to kvmppc_mmu_book3s_64_slbmte(), which
> assumes to find the SLB index in the 3 lower bits of its rb argument.
> When passed zeroed arguments, it happily overwrites the 0th SLB entry
> with zeroes. This is exactly what happens while doing live migration
> with QEMU when the destination pushes the incoming SLB descriptors to
> KVM PR. When reloading the SLBs at the next synchronization, QEMU first
> clears its SLB array and only restore valid ones, but the 0th one is
> now gone and we cannot access the corresponding memory anymore:
> 
> (qemu) x/x $pc
> c00b742c: Cannot access memory
> 
> To avoid this, let's filter out non-valid SLB entries. While here, we
> also force a full SLB flush before installing new entries.
> 
> Signed-off-by: Greg Kurz 
> ---
> v2: - flush SLB before installing new entries
> ---
>  arch/powerpc/kvm/book3s_pr.c |   10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
> index 3beb4ff469d1..7cce08d610ae 100644
> --- a/arch/powerpc/kvm/book3s_pr.c
> +++ b/arch/powerpc/kvm/book3s_pr.c
> @@ -1327,9 +1327,15 @@ static int kvm_arch_vcpu_ioctl_set_sregs_pr(struct 
> kvm_vcpu *vcpu,
>  
>   vcpu3s->sdr1 = sregs->u.s.sdr1;
>   if (vcpu->arch.hflags & BOOK3S_HFLAG_SLB) {
> + /* Flush all SLB entries */
> + vcpu->arch.mmu.slbmte(vcpu, 0, 0);
> + vcpu->arch.mmu.slbia(vcpu);
> +
>   for (i = 0; i < 64; i++) {
> - vcpu->arch.mmu.slbmte(vcpu, 
> sregs->u.s.ppc64.slb[i].slbv,
> - 
> sregs->u.s.ppc64.slb[i].slbe);
> + u64 rb = sregs->u.s.ppc64.slb[i].slbe;
> + u64 rs = sregs->u.s.ppc64.slb[i].slbv;
> + if (rb & SLB_ESID_V)
> + vcpu->arch.mmu.slbmte(vcpu, rs, rb);
>   }
>   } else {
>   for (i = 0; i < 16; i++) {
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V2] powerpc/perf: Fix IMC initialization crash

2017-10-13 Thread Anju T Sudhakar

Call trace observed with latest firmware, and upstream kernel.

[   14.499938] NIP [c00f318c] init_imc_pmu+0x8c/0xcf0
[   14.499973] LR [c00f33f8] init_imc_pmu+0x2f8/0xcf0
[   14.57] Call Trace:
[   14.500027] [c03fed18f710] [c00f33c8] init_imc_pmu+0x2c8/0xcf0 
(unreliable)
[   14.500080] [c03fed18f800] [c00b5ec0] 
opal_imc_counters_probe+0x300/0x400
[   14.500132] [c03fed18f900] [c0807ef4] 
platform_drv_probe+0x64/0x110
[   14.500185] [c03fed18f980] [c0804b58] 
driver_probe_device+0x3d8/0x580
[   14.500236] [c03fed18fa10] [c0804e4c] __driver_attach+0x14c/0x1a0
[   14.500302] [c03fed18fa90] [c080156c] bus_for_each_dev+0x8c/0xf0
[   14.500353] [c03fed18fae0] [c0803fa4] driver_attach+0x34/0x50
[   14.500397] [c03fed18fb00] [c0803688] bus_add_driver+0x298/0x350
[   14.500449] [c03fed18fb90] [c080605c] driver_register+0x9c/0x180
[   14.500500] [c03fed18fc00] [c0807dec] 
__platform_driver_register+0x5c/0x70
[   14.500552] [c03fed18fc20] [c101cee0] 
opal_imc_driver_init+0x2c/0x40
[   14.500603] [c03fed18fc40] [c000d084] do_one_initcall+0x64/0x1d0
[   14.500654] [c03fed18fd00] [c100434c] 
kernel_init_freeable+0x280/0x374
[   14.500705] [c03fed18fdc0] [c000d314] kernel_init+0x24/0x160
[   14.500750] [c03fed18fe30] [c000b4e8] 
ret_from_kernel_thread+0x5c/0x74
[   14.500799] Instruction dump:
[   14.500827] 4082024c 2f890002 419e054c 2e890003 41960094 2e890001 3ba0ffea 
419602d8 
[   14.500884] 419e0290 2f890003 419e02a8 e93e0118  2fa3 419e0010 
4827ba41 
[   14.500945] ---[ end trace 27b734ad26f1add4 ]---
[   15.908719] 
[   16.908869] Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x0007
[   16.908869] 
[   18.125813] ---[ end Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x0007]

While registering nest imc at init, cpu-hotplug callback 
`nest_pmu_cpumask_init()`
makes an opal call to stop the engine. And if the OPAL call fails, 
imc_common_cpuhp_mem_free() is invoked to cleanup memory and cpuhotplug setup.

But when cleaning up the attribute group, we were dereferencing the attribute
element array without checking whether the backing element is not NULL. This
causes the kernel panic.

Add a check for the backing element prior to dereferencing the attribute 
element,
to handle the failing case gracefully.

Signed-off-by: Anju T Sudhakar 
Reported-by: Pridhiviraj Paidipeddi 
---
 arch/powerpc/perf/imc-pmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 9ccac86f3463..001504b0e800 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -1148,7 +1148,8 @@ static void imc_common_cpuhp_mem_free(struct imc_pmu 
*pmu_ptr)
}
 
/* Only free the attr_groups which are dynamically allocated  */
-   kfree(pmu_ptr->attr_groups[IMC_EVENT_ATTR]->attrs);
+   if (pmu_ptr->attr_groups[IMC_EVENT_ATTR])
+   kfree(pmu_ptr->attr_groups[IMC_EVENT_ATTR]->attrs);
kfree(pmu_ptr->attr_groups[IMC_EVENT_ATTR]);
kfree(pmu_ptr);
return;
-- 
2.14.1

55 matches

Mail list logo