Re: [PATCH 4/6] powerpc/corenet: Create the dts components for the DPAA FMan

2014-05-06 Thread Emil Medve
Hello Scott,


On 05/05/2014 06:25 PM, Scott Wood wrote:
 On Sat, 2014-05-03 at 05:02 -0500, Emil Medve wrote:
 Hello Scott,


 On 04/21/2014 05:11 PM, Scott Wood wrote:
 On Fri, 2014-04-18 at 07:21 -0500, Shruti Kanetkar wrote:
 +fman@40 {
 +  mdio@f1000 {
 +  #address-cells = 1;
 +  #size-cells = 0;
 +  compatible = fsl,fman-xmdio;
 +  reg = 0xf1000 0x1000;
 +  };
 +};

 I'd like to see a complete fman binding before we start adding pieces.

 The driver for the FMan 10 Gb/s MDIO has upstreamed a couple of years
 ago: '9f35a73 net/fsl: introduce Freescale 10G MDIO driver', granted
 without a binding writeup.
 
 Pushing driver code through the netdev tree does not establish device
 tree ABI.  Binding documents and dts files do.

Sure, ideally and formally. But upstreaming a driver represents, if
nothing else, a statement of intent to observe a device tree ABI. Via
the SDK, FSL customers are using the device tree ABI the driver de facto
establishes. I guess a driver that makes it upstream can establish an
device tree ABI

We'll re-spin adding the binding document

 This patch series should probably include a
 binding blurb. However, let's not gate this patchset on a complete
 binding for the FMan
 
 I at least want to see enough of the FMan binding to have confidence
 that what we're adding now is correct.

I'm not sure what you're looking for. The nodes we're adding are
describing a very common CCSR space interface for quite common device blocks

 As you know we don't own the FMan work and the FMan work is... not ready
 for upstreaming.
 
 I'm not asking for a driver, just a binding that describes hardware.  Is
 there any reason why the fman node needs to be anywhere near as
 complicated as it is in the SDK, if we're limiting it to actual hardware
 description?

Is this a trick question? :-) Of course it doesn't need to be more
complicated than actual hardware. But, to repeat myself, said
description is not... ready and I don't know when it will be. Somebody
else owns pushing the bulk of FMan upstream and I'd rather not step on
their turf quite like this

 Do we really need to have nodes for all the sub-blocks?

Definitely no, and internally I'm pushing to clean that up. However, you
surely remember we've been pushing from the early days of P4080 and it's
been, to put it optimistically, slow

 In an attempt to make some sort of progress we've
 decided to upstream the pieces that are less controversial and MDIO is
 an obvious candidate

 +fman@40 {
 +  mdio0: mdio@e1120 {
 +  #address-cells = 1;
 +  #size-cells = 0;
 +  compatible = fsl,fman-mdio;
 +  reg = 0xe1120 0xee0;
 +  };
 +};

 What is the difference between fsl,fman-mdio and fsl,fman-xmdio?  I
 don't see the latter on the list of compatibles in patch 3/6.

 'fsl,fman-mdio' is the 1 Gb/s MDIO (Clause 22 only). 'fsl,fman-xmdio' is
 the 10 Gb/s MDIO (Clause 45 only). We can respin this patch wi

 
 respin this patch wi...?

Not sure where the end of that sentence went. I meant we'll re-spin with
a binding for the 10 Gb/s MDIO block

 I believe 'fsl,fman-mdio' (and others on that list) was added
 gratuitously as the FMan MDIO is completely compatible with the
 eTSEC/gianfar MDIO driver, but we can deal with that later
 
 It's still good to identify the specific device, even if it's believed
 to be 100% compatible.

You suggesting we create new compatibles for every instance/integration
of a hardware block even though is identical with an earlier hardware
integration? Well, I guess that's been done that and now we have about 8
different compatibles that convey no real difference at all

 Plus, IIRC there's been enough badness in the
 eTSEC MDIO binding that it'd be good to steer clear of it.

Hmm... I guess we can leave things as they are. I wasn't going to touch
this just now anyway

 Within each category, is the exact fman version discoverable from the
 mdio registers?

 No, but that's irrelevant as that's not the difference between the two
 compatibles
 
 It's relevant because it means the compatible string should have a block
 version number in it, or at least some other way in the MDIO node to
 indicate the block version.

The 1 Gb/s MDIO block doesn't track a version of its own and from a
programming interface perspective it has no visible difference since
eTSEC. The 10 Gb/s MDIO doesn't track a version of its own either and
across the existing FMan versions is identical from a programming
interface perspective

I guess we can append a 'v1.0' to the MDIO compatible(s). However, given
the SDK we'll have to support the compatibles the (already upstream)
drivers support. Dealing with all that legacy is going to be so tedious

 +fman@50 {
 +  #address-cells = 1;
 +  #size-cells = 1;
 +  compatible = simple-bus;

 Why is this simple-bus?

 Because that's the translation type for the FMan sub-nodes.
 
 What do you mean by translation type?

I mean address translation across 

Re: [PATCH 5/6] powerpc/corenet: Add DPAA FMan support to the SoC device tree(s)

2014-05-06 Thread Emil Medve
Hello Scott,


On 05/05/2014 06:34 PM, Scott Wood wrote:
 On Sun, 2014-05-04 at 05:59 -0500, Emil Medve wrote:
 Hello Scott,


 On 04/21/2014 05:14 PM, Scott Wood wrote:
 On Fri, 2014-04-18 at 07:21 -0500, Shruti Kanetkar wrote:
 FMan 1 Gb/s MACs (dTSEC and mEMAC) have support for SGMII PHYs.
 Add support for the internal SerDes TBI PHYs

 Based on prior work by Andy Fleming aflem...@gmail.com

 Signed-off-by: Shruti Kanetkar shr...@freescale.com
 ---
  arch/powerpc/boot/dts/fsl/b4860si-post.dtsi |  28 +
  arch/powerpc/boot/dts/fsl/b4si-post.dtsi|  51 +
  arch/powerpc/boot/dts/fsl/p1023si-post.dtsi |  14 +++
  arch/powerpc/boot/dts/fsl/p2041si-post.dtsi |  64 
  arch/powerpc/boot/dts/fsl/p3041si-post.dtsi |  64 
  arch/powerpc/boot/dts/fsl/p4080si-post.dtsi | 104 +++
  arch/powerpc/boot/dts/fsl/p5020si-post.dtsi |  64 
  arch/powerpc/boot/dts/fsl/p5040si-post.dtsi | 128 +++
  arch/powerpc/boot/dts/fsl/t4240si-post.dtsi | 154 
 
  9 files changed, 671 insertions(+)

 diff --git a/arch/powerpc/boot/dts/fsl/b4860si-post.dtsi 
 b/arch/powerpc/boot/dts/fsl/b4860si-post.dtsi
 index cbc354b..45b0ff5 100644
 --- a/arch/powerpc/boot/dts/fsl/b4860si-post.dtsi
 +++ b/arch/powerpc/boot/dts/fsl/b4860si-post.dtsi
 @@ -172,6 +172,34 @@
compatible = fsl,b4860-rcpm, fsl,qoriq-rcpm-2.0;
};
  
 +/include/ qoriq-fman3-0-1g-4.dtsi
 +/include/ qoriq-fman3-0-1g-5.dtsi
 +/include/ qoriq-fman3-0-10g-0.dtsi
 +/include/ qoriq-fman3-0-10g-1.dtsi
 +  fman@40 {
 +  ethernet@e8000 {
 +  tbi-handle = tbi4;
 +  };

 Binding needed

 Where is the reg for these unit addresses?

 As I said, the bulk of the FMan work comes from another team. Here we
 need just enough to hook up the MDIO and PHY nodes.
 
 Unit addresses must match reg.  No reg, no unit address.

We can add a 'reg' property, but we really don't want to clash with the
team that is working on upstreaming the FMan/MAC bindings and drivers

 I'd really like to be able to make progress on this without waiting for that 
 moment in time
 we can get the entire FMan binding in place
 
 Why is the fman binding such a big deal?
 
 +  mdio@e9000 {
 +  tbi4: tbi-phy@8 {
 +  reg = 0x8;
 +  device_type = tbi-phy;
 +  };
 +  };

 Binding needed for tbi-phy device_type

 I guess that's fair (BTW, you accepted tbi-phy nodes/device-type before
 without a binding)
 
 It's existing practice on eTSEC.  FMan seemed like an opportunity to
 avoid carrying cruft forward.

The 1 Gb/s MDIO block is not FMan specific. As I said is the same block
from eTSEC. That's part of the reason we're trying upstreaming this
independent of the FMan stuff. So, don't think FMan, think MDIO

 Why are we using device_type at all for this?

 That's what the upstream driver is looking for.
 
 Drivers should look for what the binding says -- not the other way
 around.

Yeah yeah. Nobody likes it, but the driver is/describes the de facto binding

On a constructive note, the Ethernet PHY code doesn't do device tree
based probing so no compatibles are used at all. So device_type is used
to convey a TBI PHY

  Anyway, most days PHYs can be discovered so they don't use/need
 compatible properties. That's I guess part of the reason we don't have
 bindings for them PHY nodes
 
 I don't see why there couldn't be a compatible that describes the
 standard programming interface.

Because it can be detected at runtime and I guess stuff like that should
stay out of the device tree. I'm using PCI as an analogy here

 However, what you can't discover is how they are wired to the MAC(s) so
 we still need some nodes in the device tree to convey that. Also, when
 looking for a specific kind of PHY, such as TBI, device_type works
 easier then parsing compatibles from various vendors or so
 
 Don't you find the TBI by following the tbi-handle property?

When the MAC attaches to the PHY the tbi-handle is followed. But the
MDIO/PHY code/driver(s) doesn't quite see the tbi-handle as it's
outside the MDIO/PHY nodes

 That said,
 I don't object to having a way to label a PHY as attached via TBI if
 that's useful.  I'm giving a mild, non-nacking (given the history)
 objection to using device_type for that (given other history).

Personally, I think that TBI PHY support is a bit messy but I don't have
bandwidth to deal with that. The TBI PHY should be handled as a regular
PHY and right now is a special case


Cheers,
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RFC 00/22] EEH Support for VFIO PCI devices on PowerKVM guest

2014-05-06 Thread Alexander Graf


On 06.05.14 06:26, Gavin Shan wrote:

On Mon, May 05, 2014 at 08:00:12AM -0600, Alex Williamson wrote:

On Mon, 2014-05-05 at 13:56 +0200, Alexander Graf wrote:

On 05/05/2014 03:27 AM, Gavin Shan wrote:

The series of patches intends to support EEH for PCI devices, which have been
passed through to PowerKVM based guest via VFIO. The implementation is
straightforward based on the issues or problems we have to resolve to support
EEH for PowerKVM based guest.

- Emulation for EEH RTAS requests. Thanksfully, we already have infrastructure
to emulate XICS. Without introducing new mechanism, we just extend that
existing infrastructure to support EEH RTAS emulation. EEH RTAS requests
initiated from guest are posted to host where the requests get handled or
delivered to underly firmware for further handling. For that, the host 
kerenl
has to maintain the PCI address (host domain/bus/slot/function to guest's
PHB BUID/bus/slot/function) mapping via KVM VFIO device. The address mapping
will be built when initializing VFIO device in QEMU and destroied when the
VFIO device in QEMU is going to offline, or VM is destroy.

Do you also expose all those interfaces to user space? VFIO is as much
about user space device drivers as it is about device assignment.


Yep, all the interfaces are exported to user space.


I would like to first see an implementation that doesn't touch KVM
emulation code at all but instead routes everything through QEMU. As a
second step we can then accelerate performance critical paths inside of KVM.


Ok. I'll change the implementation. However, the QEMU still has to
poll/push information from/to host kerenl. So the best place for that
would be tce_iommu_driver_ops::ioctl as EEH is Power specific feature.

For the error injection, I guess I have to put the logic token management
into QEMU and error injection request will be handled by QEMU and then
routed to host kernel via additional syscall as we did for pSeries.


Yes, start off without in-kernel XICS so everything simply lives in 
QEMU. Then add callbacks into the in-kernel XICS to inject these 
interrupts if we don't have wide enough interfaces already.




Alex

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V4] POWERPC: BOOK3S: KVM: Use the saved dar value and generic make_dsisr

2014-05-06 Thread Alexander Graf


On 06.05.14 02:41, Paul Mackerras wrote:

On Mon, May 05, 2014 at 01:19:30PM +0200, Alexander Graf wrote:

On 05/04/2014 07:21 PM, Aneesh Kumar K.V wrote:

+#ifdef CONFIG_PPC_BOOK3S_64
+   return vcpu-arch.fault_dar;

How about PA6T and G5s?

G5 sets DAR on an alignment interrupt.

As for PA6T, I don't know for sure, but if it doesn't, ordinary
alignment interrupts wouldn't be handled properly, since the code in
arch/powerpc/kernel/align.c assumes DAR contains the address being
accessed on all PowerPC CPUs.


Now that's a good point. If we simply behave like Linux, I'm fine. This 
definitely deserves a comment on the #ifdef in the code.



Alex

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel page allocator for hash page table.

2014-05-06 Thread Alexander Graf


On 06.05.14 02:06, Benjamin Herrenschmidt wrote:

On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote:

Isn't this a greater problem? We should start swapping before we hit
the point where non movable kernel allocation fails, no?

Possibly but the fact remains, this can be avoided by making sure that
if we create a CMA reserve for KVM, then it uses it rather than using
the rest of main memory for hash tables.


So why were we preferring non-CMA memory before? Considering that Aneesh 
introduced that logic in fa61a4e3 I suppose this was just a mistake?



The fact that KVM uses a good number of normal kernel pages is maybe
suboptimal, but shouldn't be a critical problem.

The point is that we explicitly reserve those pages in CMA for use
by KVM for that specific purpose, but the current code tries first
to get them out of the normal pool.

This is not an optimal behaviour and is what Aneesh patches are
trying to fix.


I agree, and I agree that it's worth it to make better use of our 
resources. But we still shouldn't crash.


However, reading through this thread I think I've slowly grasped what 
the problem is. The hugetlbfs size calculation.


I guess something in your stack overreserves huge pages because it 
doesn't account for the fact that some part of system memory is already 
reserved for CMA.


So the underlying problem is something completely orthogonal. The patch 
body as is is fine, but the patch description should simply say that we 
should prefer the CMA region because it's already reserved for us for 
this purpose and we make better use of our available resources that way.


All the bits about pinning, numa, libvirt and whatnot don't really 
matter and are just details that led Aneesh to find this non-optimal 
allocation.



Alex

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RFC 00/22] EEH Support for VFIO PCI devices on PowerKVM guest

2014-05-06 Thread Benjamin Herrenschmidt
On Tue, 2014-05-06 at 08:56 +0200, Alexander Graf wrote:
  For the error injection, I guess I have to put the logic token
 management
  into QEMU and error injection request will be handled by QEMU and
 then
  routed to host kernel via additional syscall as we did for pSeries.
 
 Yes, start off without in-kernel XICS so everything simply lives in 
 QEMU. Then add callbacks into the in-kernel XICS to inject these 
 interrupts if we don't have wide enough interfaces already.

It's got nothing to do with XICS ... :-)

But yes, we can route everything via qemu for now, then we'll need
at least one of the call to have a direct path but we should probably
strive to even make it real mode if that's possible, it's the one that
Linux will call whenever an MMIO returns all f's to check if the
underlying PE is frozen.

But we can do that as a second stage.

In fact going via VFIO ioctl's does make the whole security and
translation model much simpler initially.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel page allocator for hash page table.

2014-05-06 Thread Benjamin Herrenschmidt
On Tue, 2014-05-06 at 09:05 +0200, Alexander Graf wrote:
 On 06.05.14 02:06, Benjamin Herrenschmidt wrote:
  On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote:
  Isn't this a greater problem? We should start swapping before we hit
  the point where non movable kernel allocation fails, no?
  Possibly but the fact remains, this can be avoided by making sure that
  if we create a CMA reserve for KVM, then it uses it rather than using
  the rest of main memory for hash tables.
 
 So why were we preferring non-CMA memory before? Considering that Aneesh 
 introduced that logic in fa61a4e3 I suppose this was just a mistake?

I assume so.

  The fact that KVM uses a good number of normal kernel pages is maybe
  suboptimal, but shouldn't be a critical problem.
  The point is that we explicitly reserve those pages in CMA for use
  by KVM for that specific purpose, but the current code tries first
  to get them out of the normal pool.
 
  This is not an optimal behaviour and is what Aneesh patches are
  trying to fix.
 
 I agree, and I agree that it's worth it to make better use of our 
 resources. But we still shouldn't crash.

Well, Linux hitting out of memory conditions has never been a happy
story :-)

 However, reading through this thread I think I've slowly grasped what 
 the problem is. The hugetlbfs size calculation.

Not really.

 I guess something in your stack overreserves huge pages because it 
 doesn't account for the fact that some part of system memory is already 
 reserved for CMA.

Either that or simply Linux runs out because we dirty too fast...
really, Linux has never been good at dealing with OO situations,
especially when things like network drivers and filesystems try to do
ATOMIC or NOIO allocs...
 
 So the underlying problem is something completely orthogonal. The patch 
 body as is is fine, but the patch description should simply say that we 
 should prefer the CMA region because it's already reserved for us for 
 this purpose and we make better use of our available resources that way.

No.

We give a chunk of memory to hugetlbfs, it's all good and fine.

Whatever remains is split between CMA and the normal page allocator.

Without Aneesh latest patch, when creating guests, KVM starts allocating
it's hash tables from the latter instead of CMA (we never allocate from
hugetlb pool afaik, only guest pages do that, not hash tables).

So we exhaust the page allocator and get linux into OOM conditions
while there's plenty of space in CMA. But the kernel cannot use CMA for
it's own allocations, only to back user pages, which we don't care about
because our guest pages are covered by our hugetlb reserve :-)

 All the bits about pinning, numa, libvirt and whatnot don't really 
 matter and are just details that led Aneesh to find this non-optimal 
 allocation.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel page allocator for hash page table.

2014-05-06 Thread Alexander Graf


On 06.05.14 09:19, Benjamin Herrenschmidt wrote:

On Tue, 2014-05-06 at 09:05 +0200, Alexander Graf wrote:

On 06.05.14 02:06, Benjamin Herrenschmidt wrote:

On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote:

Isn't this a greater problem? We should start swapping before we hit
the point where non movable kernel allocation fails, no?

Possibly but the fact remains, this can be avoided by making sure that
if we create a CMA reserve for KVM, then it uses it rather than using
the rest of main memory for hash tables.

So why were we preferring non-CMA memory before? Considering that Aneesh
introduced that logic in fa61a4e3 I suppose this was just a mistake?

I assume so.


The fact that KVM uses a good number of normal kernel pages is maybe
suboptimal, but shouldn't be a critical problem.

The point is that we explicitly reserve those pages in CMA for use
by KVM for that specific purpose, but the current code tries first
to get them out of the normal pool.

This is not an optimal behaviour and is what Aneesh patches are
trying to fix.

I agree, and I agree that it's worth it to make better use of our
resources. But we still shouldn't crash.

Well, Linux hitting out of memory conditions has never been a happy
story :-)


However, reading through this thread I think I've slowly grasped what
the problem is. The hugetlbfs size calculation.

Not really.


I guess something in your stack overreserves huge pages because it
doesn't account for the fact that some part of system memory is already
reserved for CMA.

Either that or simply Linux runs out because we dirty too fast...
really, Linux has never been good at dealing with OO situations,
especially when things like network drivers and filesystems try to do
ATOMIC or NOIO allocs...
  

So the underlying problem is something completely orthogonal. The patch
body as is is fine, but the patch description should simply say that we
should prefer the CMA region because it's already reserved for us for
this purpose and we make better use of our available resources that way.

No.

We give a chunk of memory to hugetlbfs, it's all good and fine.

Whatever remains is split between CMA and the normal page allocator.

Without Aneesh latest patch, when creating guests, KVM starts allocating
it's hash tables from the latter instead of CMA (we never allocate from
hugetlb pool afaik, only guest pages do that, not hash tables).

So we exhaust the page allocator and get linux into OOM conditions
while there's plenty of space in CMA. But the kernel cannot use CMA for
it's own allocations, only to back user pages, which we don't care about
because our guest pages are covered by our hugetlb reserve :-)


Yes. Write that in the patch description and I'm happy ;).


Alex

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 5/6] powerpc/corenet: Add DPAA FMan support to the SoC device tree(s)

2014-05-06 Thread Joakim Tjernlund
Linuxppc-dev 
linuxppc-dev-bounces+joakim.tjernlund=transmode...@lists.ozlabs.org 
wrote on 2014/05/06 08:28:42:
 

.

 
  That said,
  I don't object to having a way to label a PHY as attached via TBI if
  that's useful.  I'm giving a mild, non-nacking (given the history)
  objection to using device_type for that (given other history).
 
 Personally, I think that TBI PHY support is a bit messy but I don't have
 bandwidth to deal with that. The TBI PHY should be handled as a regular
 PHY and right now is a special case

Yes please! We will use the TBI as the only PHY in 1000BASE-X mode so
naturally we want to see the TBI as its own PHY and monitor its link 
status, AN etc.

 Jocke
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest

2014-05-06 Thread Alexander Graf

On 05/04/2014 07:30 PM, Aneesh Kumar K.V wrote:

Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
---
  arch/powerpc/include/asm/kvm_book3s_64.h | 146 ++-
  arch/powerpc/kvm/book3s_hv.c |   7 ++
  2 files changed, 130 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 51388befeddb..f03ea8f90576 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -77,34 +77,122 @@ static inline long try_lock_hpte(unsigned long *hpte, 
unsigned long bits)
return old == 0;
  }
  
+static inline int __hpte_actual_psize(unsigned int lp, int psize)

+{
+   int i, shift;
+   unsigned int mask;
+
+   /* start from 1 ignoring MMU_PAGE_4K */
+   for (i = 1; i  MMU_PAGE_COUNT; i++) {
+
+   /* invalid penc */
+   if (mmu_psize_defs[psize].penc[i] == -1)
+   continue;
+   /*
+* encoding bits per actual page size
+*PTE LP actual page size
+* rrrz =8KB
+* rrzz =16KB
+* rzzz =32KB
+*  =64KB
+* ...
+*/
+   shift = mmu_psize_defs[i].shift - LP_SHIFT;
+   if (shift  LP_BITS)
+   shift = LP_BITS;
+   mask = (1  shift) - 1;
+   if ((lp  mask) == mmu_psize_defs[psize].penc[i])
+   return i;
+   }
+   return -1;
+}
+
  static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r,
 unsigned long pte_index)
  {
-   unsigned long rb, va_low;
+   int b_size, a_size;
+   unsigned int penc;
+   unsigned long rb = 0, va_low, sllp;
+   unsigned int lp = (r  LP_SHIFT)  ((1  LP_BITS) - 1);
+
+   if (!(v  HPTE_V_LARGE)) {
+   /* both base and actual psize is 4k */
+   b_size = MMU_PAGE_4K;
+   a_size = MMU_PAGE_4K;
+   } else {
+   for (b_size = 0; b_size  MMU_PAGE_COUNT; b_size++) {
+
+   /* valid entries have a shift value */
+   if (!mmu_psize_defs[b_size].shift)
+   continue;
  
+			a_size = __hpte_actual_psize(lp, b_size);

+   if (a_size != -1)
+   break;
+   }
+   }
+   /*
+* Ignore the top 14 bits of va
+* v have top two bits covering segment size, hence move
+* by 16 bits, Also clear the lower HPTE_V_AVPN_SHIFT (7) bits.
+* AVA field in v also have the lower 23 bits ignored.
+* For base page size 4K we need 14 .. 65 bits (so need to
+* collect extra 11 bits)
+* For others we need 14..14+i
+*/
+   /* This covers 14..54 bits of va*/
rb = (v  ~0x7fUL)  16; /* AVA field */
+   /*
+* AVA in v had cleared lower 23 bits. We need to derive
+* that from pteg index
+*/
va_low = pte_index  3;
if (v  HPTE_V_SECONDARY)
va_low = ~va_low;
-   /* xor vsid from AVA */
+   /*
+* get the vpn bits from va_low using reverse of hashing.
+* In v we have va with 23 bits dropped and then left shifted
+* HPTE_V_AVPN_SHIFT (7) bits. Now to find vsid we need
+* right shift it with (SID_SHIFT - (23 - 7))
+*/
if (!(v  HPTE_V_1TB_SEG))
-   va_low ^= v  12;
+   va_low ^= v  (SID_SHIFT - 16);
else
-   va_low ^= v  24;
+   va_low ^= v  (SID_SHIFT_1T - 16);
va_low = 0x7ff;
-   if (v  HPTE_V_LARGE) {
-   rb |= 1;/* L field */
-   if (cpu_has_feature(CPU_FTR_ARCH_206) 
-   (r  0xff000)) {
-   /* non-16MB large page, must be 64k */
-   /* (masks depend on page size) */
-   rb |= 0x1000;   /* page encoding in LP field */
-   rb |= (va_low  0x7f)  16; /* 7b of VA in AVA/LP 
field */
-   rb |= ((va_low  4)  0xf0); /* AVAL field (P7 doesn't 
seem to care) */
-   }
-   } else {
-   /* 4kB page */
-   rb |= (va_low  0x7ff)  12; /* remaining 11b of VA */
+
+   switch (b_size) {
+   case MMU_PAGE_4K:
+   sllp = ((mmu_psize_defs[a_size].sllp  SLB_VSID_L)  6) |
+   ((mmu_psize_defs[a_size].sllp  SLB_VSID_LP)  4);
+   rb |= sllp  5;  /*  AP field */
+   rb |= (va_low  0x7ff)  12; /* remaining 11 bits of AVA */
+   break;
+   default:
+   {
+   int aval_shift;
+   /*
+  

Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest

2014-05-06 Thread Benjamin Herrenschmidt
On Tue, 2014-05-06 at 11:12 +0200, Alexander Graf wrote:

 So if I understand this patch correctly, it simply introduces logic to 
 handle page sizes other than 4k, 64k, 16M by analyzing the actual page 
 size field in the HPTE. Mind to explain why exactly that enables us to 
 use THP?

 What exactly is the flow if the pages are not backed by huge pages? What 
 is the flow when they start to get backed by huge pages?

The hypervisor doesn't care about segments ... but it needs to properly
decode the page size requested by the guest, if anything, to issue the
right form of tlbie instruction.

The encoding in the HPTE for a 16M page inside a 64K segment is
different than the encoding for a 16M in a 16M segment, this is done so
that the encoding carries both information, which allows broadcast
tlbie to properly find the right set in the TLB for invalidations among
others.

So from a KVM perspective, we don't know whether the guest is doing THP
or something else (Linux calls it THP but all we care here is that this
is MPSS, another guest than Linux might exploit that differently).

What we do know is that if we advertise MPSS, we need to decode the page
sizes encoded in the HPTE so that we know what we are dealing with in
H_ENTER and can do the appropriate TLB invalidations in H_REMOVE 
evictions.

  +   if (a_size != -1)
  +   return 1ul  mmu_psize_defs[a_size].shift;
  +   }
  +
  +   }
  +   return 0;
}

static inline unsigned long hpte_rpn(unsigned long ptel, unsigned long 
  psize)
  diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
  index 8227dba5af0f..a38d3289320a 100644
  --- a/arch/powerpc/kvm/book3s_hv.c
  +++ b/arch/powerpc/kvm/book3s_hv.c
  @@ -1949,6 +1949,13 @@ static void kvmppc_add_seg_page_size(struct 
  kvm_ppc_one_seg_page_size **sps,
   * support pte_enc here
   */
  (*sps)-enc[0].pte_enc = def-penc[linux_psize];
  +   /*
  +* Add 16MB MPSS support
  +*/
  +   if (linux_psize != MMU_PAGE_16M) {
  +   (*sps)-enc[1].page_shift = 24;
  +   (*sps)-enc[1].pte_enc = def-penc[MMU_PAGE_16M];
  +   }
 
 So this basically indicates that every segment (except for the 16MB one) 
 can also handle 16MB MPSS page sizes? I suppose you want to remove the 
 comment in kvm_vm_ioctl_get_smmu_info_hv() that says we don't do MPSS here.

I haven't reviewed the code there, make sure it will indeed do a
different encoding for every combination of segment/actual page size.

 Can we also ensure that every system we run on can do MPSS?

P7 and P8 are identical in that regard. However 970 doesn't do MPSS so
let's make sure we get that right.

Cheers,
Ben.
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest

2014-05-06 Thread Alexander Graf

On 05/06/2014 11:26 AM, Benjamin Herrenschmidt wrote:

On Tue, 2014-05-06 at 11:12 +0200, Alexander Graf wrote:


So if I understand this patch correctly, it simply introduces logic to
handle page sizes other than 4k, 64k, 16M by analyzing the actual page
size field in the HPTE. Mind to explain why exactly that enables us to
use THP?

What exactly is the flow if the pages are not backed by huge pages? What
is the flow when they start to get backed by huge pages?

The hypervisor doesn't care about segments ... but it needs to properly
decode the page size requested by the guest, if anything, to issue the
right form of tlbie instruction.

The encoding in the HPTE for a 16M page inside a 64K segment is
different than the encoding for a 16M in a 16M segment, this is done so
that the encoding carries both information, which allows broadcast
tlbie to properly find the right set in the TLB for invalidations among
others.

So from a KVM perspective, we don't know whether the guest is doing THP
or something else (Linux calls it THP but all we care here is that this
is MPSS, another guest than Linux might exploit that differently).


Ugh. So we're just talking about a guest using MPSS here? Not about the 
host doing THP? I must've missed that part.




What we do know is that if we advertise MPSS, we need to decode the page
sizes encoded in the HPTE so that we know what we are dealing with in
H_ENTER and can do the appropriate TLB invalidations in H_REMOVE 
evictions.


Yes. That makes a lot of sense. So this patch really is all about 
enabling MPSS support for 16MB pages. No more, no less.





+   if (a_size != -1)
+   return 1ul  mmu_psize_defs[a_size].shift;
+   }
+
+   }
+   return 0;
   }
   
   static inline unsigned long hpte_rpn(unsigned long ptel, unsigned long psize)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 8227dba5af0f..a38d3289320a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1949,6 +1949,13 @@ static void kvmppc_add_seg_page_size(struct 
kvm_ppc_one_seg_page_size **sps,
 * support pte_enc here
 */
(*sps)-enc[0].pte_enc = def-penc[linux_psize];
+   /*
+* Add 16MB MPSS support
+*/
+   if (linux_psize != MMU_PAGE_16M) {
+   (*sps)-enc[1].page_shift = 24;
+   (*sps)-enc[1].pte_enc = def-penc[MMU_PAGE_16M];
+   }

So this basically indicates that every segment (except for the 16MB one)
can also handle 16MB MPSS page sizes? I suppose you want to remove the
comment in kvm_vm_ioctl_get_smmu_info_hv() that says we don't do MPSS here.

I haven't reviewed the code there, make sure it will indeed do a
different encoding for every combination of segment/actual page size.


Can we also ensure that every system we run on can do MPSS?

P7 and P8 are identical in that regard. However 970 doesn't do MPSS so
let's make sure we get that right.


yes. When / if people can easily get their hands on p7/p8 bare metal 
systems I'll be more than happy to remove 970 support as well, but for 
now it's probably good to keep in.



Alex

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V3 2/2] powerpc/pseries: init fault_around_order for pseries

2014-05-06 Thread Ingo Molnar

* Rusty Russell ru...@rustcorp.com.au wrote:

 Ingo Molnar mi...@kernel.org writes:
  * Madhavan Srinivasan ma...@linux.vnet.ibm.com wrote:
 
  Performance data for different FAULT_AROUND_ORDER values from 4 socket
  Power7 system (128 Threads and 128GB memory). perf stat with repeat of 5
  is used to get the stddev values. Test ran in v3.14 kernel (Baseline) and
  v3.15-rc1 for different fault around order values.
  
  FAULT_AROUND_ORDER  Baseline1   3   4  
   5   8
  
  Linux build (make -j64)
  minor-faults47,437,359  35,279,286  25,425,347  
  23,461,275  22,002,189  21,435,836
  times in seconds347.302528420   344.061588460   340.974022391   
  348.193508116   348.673900158   350.986543618
   stddev for time( +-  1.50% )   ( +-  0.73% )   ( +-  1.13% )   ( 
  +-  1.01% )   ( +-  1.89% )   ( +-  1.55% )
   %chg time to baseline  -0.9%   -1.8%   
  0.2%0.39%   1.06%
 
  Probably too noisy.
 
 A little, but 3 still looks like the winner.
 
  Linux rebuild (make -j64)
  minor-faults941,552 718,319 486,625 
  440,124 410,510 397,416
  times in seconds30.56983471831.21963753931.319370649
  31.43428547231.97236717431.443043580
   stddev for time( +-  1.07% )   ( +-  0.13% )   ( +-  0.43% )   ( 
  +-  0.18% )   ( +-  0.95% )   ( +-  0.58% )
   %chg time to baseline  2.1%2.4%
  2.8%4.58%   2.85%
 
  Here it looks like a speedup. Optimal value: 5+.
 
 No, lower time is better.  Baseline (no faultaround) wins.
 
 
 etc.

ah, yeah, you are right. Brainfart of the week...

 It's not a huge surprise that a 64k page arch wants a smaller value 
 than a 4k system.  But I agree: I don't see much upside for FAO  0, 
 but I do see downside.
 
 Most extreme results:
 Order 1: 2% loss on recompile.  10% win 4% loss on seq.  9% loss random.
 Order 3: 2% loss on recompile.  6% win 5% loss on seq.  14% loss on random.
 Order 4: 2.8% loss on recompile. 10% win 7% loss on seq.  9% loss on random.
 
  I'm starting to suspect that maybe workloads ought to be given a 
  choice in this matter, via madvise() or such.
 
 I really don't think they'll be able to use it; it'll change far too 
 much with machine and kernel updates. [...]

Do we know that?

 [...] I think we should apply patch
 #1 (with fixes) to make it a variable, then set it to 0 for PPC.

Ok, agreed - at least until contrary data comes around.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Build regressions/improvements in v3.15-rc4

2014-05-06 Thread Geert Uytterhoeven
On Tue, May 6, 2014 at 2:04 PM, Geert Uytterhoeven ge...@linux-m68k.org wrote:
 JFYI, when comparing v3.15-rc4[1]  to v3.15-rc3[3], the summaries are:
   - build errors: +7/-1

  + /scratch/kisskb/src/arch/powerpc/include/asm/fixmap.h: error:
overflow in enumeration values  CC  drivers/hwmon/smsc47m192.o:
= 51:2
  + /scratch/kisskb/src/arch/powerpc/include/asm/fixmap.h: error:
overflow in enumeration values  CC [M]  drivers/usb/gadget/f_rndis.o:
= 51:2
  + /scratch/kisskb/src/arch/powerpc/include/asm/fixmap.h: error:
overflow in enumeration values:  = 51:2

powerpc-randconfig (looks scary, is CONFIG_HIGHMEM=y broken on ppc?)

  + /scratch/kisskb/src/arch/powerpc/kernel/head_44x.S: Error: invalid
operands (*ABS* and *UND* sections) for `|':  = 686, 603
  + /scratch/kisskb/src/arch/powerpc/mm/tlb_nohash_low.S: Error:
unsupported relocation against PPC47x_TLBE_SIZE:  = 113

powerpc-randconfig

  + /scratch/kisskb/src/arch/powerpc/platforms/powernv/setup.c: error:
implicit declaration of function 'get_hard_smp_processor_id'
[-Werror=implicit-function-declaration]:  = 179:4

ppc64_defconfig+UP

Lemme guess: If CONFIG_SMP=n, linux/smp.h does not include asm/smp.h,
so it needs an explicit #include asm/smp.h?

  + error: initramfs.c: undefined reference to `__stack_chk_guard':
= .init.text+0x19dc)

x86_64-randconfig

 [1] http://kisskb.ellerman.id.au/kisskb/head/7449/ (all 119 configs)
 [3] http://kisskb.ellerman.id.au/kisskb/head/7427/ (all 119 configs)

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say programmer or something like that.
-- Linus Torvalds
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V4] POWERPC: BOOK3S: KVM: Use the saved dar value and generic make_dsisr

2014-05-06 Thread Aneesh Kumar K.V
Alexander Graf ag...@suse.de writes:

 On 06.05.14 02:41, Paul Mackerras wrote:
 On Mon, May 05, 2014 at 01:19:30PM +0200, Alexander Graf wrote:
 On 05/04/2014 07:21 PM, Aneesh Kumar K.V wrote:
 +#ifdef CONFIG_PPC_BOOK3S_64
 +  return vcpu-arch.fault_dar;
 How about PA6T and G5s?
 G5 sets DAR on an alignment interrupt.

 As for PA6T, I don't know for sure, but if it doesn't, ordinary
 alignment interrupts wouldn't be handled properly, since the code in
 arch/powerpc/kernel/align.c assumes DAR contains the address being
 accessed on all PowerPC CPUs.

 Now that's a good point. If we simply behave like Linux, I'm fine. This 
 definitely deserves a comment on the #ifdef in the code.


Will update and send V5

-aneesh

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V4] POWERPC: BOOK3S: KVM: Use the saved dar value and generic make_dsisr

2014-05-06 Thread Aneesh Kumar K.V
Alexander Graf ag...@suse.de writes:

 On 06.05.14 02:41, Paul Mackerras wrote:
 On Mon, May 05, 2014 at 01:19:30PM +0200, Alexander Graf wrote:
 On 05/04/2014 07:21 PM, Aneesh Kumar K.V wrote:
 +#ifdef CONFIG_PPC_BOOK3S_64
 +  return vcpu-arch.fault_dar;
 How about PA6T and G5s?
 G5 sets DAR on an alignment interrupt.

 As for PA6T, I don't know for sure, but if it doesn't, ordinary
 alignment interrupts wouldn't be handled properly, since the code in
 arch/powerpc/kernel/align.c assumes DAR contains the address being
 accessed on all PowerPC CPUs.

 Now that's a good point. If we simply behave like Linux, I'm fine. This 
 definitely deserves a comment on the #ifdef in the code.


How about ?

#ifdef CONFIG_PPC_BOOK3S_64
/*
 * Linux always expect a valid  dar as per alignment
 * interrupt handling code (fix_alignment()). Don't compute the dar
 * value here, instead used the saved dar value. Right now we restrict
 * this only for BOOK3S-64.
 */
return vcpu-arch.fault_dar;
#else


-aneesh

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel page allocator for hash page table.

2014-05-06 Thread Aneesh Kumar K.V
Alexander Graf ag...@suse.de writes:

 On 06.05.14 09:19, Benjamin Herrenschmidt wrote:
 On Tue, 2014-05-06 at 09:05 +0200, Alexander Graf wrote:
 On 06.05.14 02:06, Benjamin Herrenschmidt wrote:
 On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote:
 Isn't this a greater problem? We should start swapping before we hit
 the point where non movable kernel allocation fails, no?
 Possibly but the fact remains, this can be avoided by making sure that
 if we create a CMA reserve for KVM, then it uses it rather than using
 the rest of main memory for hash tables.
 So why were we preferring non-CMA memory before? Considering that Aneesh
 introduced that logic in fa61a4e3 I suppose this was just a mistake?
 I assume so.


...


 Whatever remains is split between CMA and the normal page allocator.

 Without Aneesh latest patch, when creating guests, KVM starts allocating
 it's hash tables from the latter instead of CMA (we never allocate from
 hugetlb pool afaik, only guest pages do that, not hash tables).

 So we exhaust the page allocator and get linux into OOM conditions
 while there's plenty of space in CMA. But the kernel cannot use CMA for
 it's own allocations, only to back user pages, which we don't care about
 because our guest pages are covered by our hugetlb reserve :-)

 Yes. Write that in the patch description and I'm happy ;).


How about the below:

Current KVM code first try to allocate hash page table from the normal
page allocator before falling back to the CMA reserve region. One of the
side effects of that is, we could exhaust the page allocator and get
linux into OOM conditions while we still have plenty of space in CMA. 

Fix this by trying the CMA reserve region first and then falling back
to normal page allocator if we fail to get enough memory from CMA
reserve area.

-aneesh

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V4] POWERPC: BOOK3S: KVM: Use the saved dar value and generic make_dsisr

2014-05-06 Thread Alexander Graf

On 05/06/2014 04:12 PM, Aneesh Kumar K.V wrote:

Alexander Graf ag...@suse.de writes:


On 06.05.14 02:41, Paul Mackerras wrote:

On Mon, May 05, 2014 at 01:19:30PM +0200, Alexander Graf wrote:

On 05/04/2014 07:21 PM, Aneesh Kumar K.V wrote:

+#ifdef CONFIG_PPC_BOOK3S_64
+   return vcpu-arch.fault_dar;

How about PA6T and G5s?

G5 sets DAR on an alignment interrupt.

As for PA6T, I don't know for sure, but if it doesn't, ordinary
alignment interrupts wouldn't be handled properly, since the code in
arch/powerpc/kernel/align.c assumes DAR contains the address being
accessed on all PowerPC CPUs.

Now that's a good point. If we simply behave like Linux, I'm fine. This
definitely deserves a comment on the #ifdef in the code.


How about ?

#ifdef CONFIG_PPC_BOOK3S_64
/*
 * Linux always expect a valid  dar as per alignment
 * interrupt handling code (fix_alignment()). Don't compute the dar
 * value here, instead used the saved dar value. Right now we restrict
 * this only for BOOK3S-64.
 */


/* Linux's fix_alignment() assumes that DAR is valid, so can we */


Alex


return vcpu-arch.fault_dar;
#else


-aneesh



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest

2014-05-06 Thread Aneesh Kumar K.V
Alexander Graf ag...@suse.de writes:

 On 05/04/2014 07:30 PM, Aneesh Kumar K.V wrote:
 Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com




   static inline unsigned long hpte_page_size(unsigned long h, unsigned long 
 l)
   {
 +int size, a_size;
 +/* Look at the 8 bit LP value */
 +unsigned int lp = (l  LP_SHIFT)  ((1  LP_BITS) - 1);
 +
  /* only handle 4k, 64k and 16M pages for now */
  if (!(h  HPTE_V_LARGE))
 -return 1ul  12;   /* 4k page */
 -if ((l  0xf000) == 0x1000  cpu_has_feature(CPU_FTR_ARCH_206))
 -return 1ul  16;   /* 64k page */
 -if ((l  0xff000) == 0)
 -return 1ul  24;   /* 16M page */
 -return 0;   /* error */
 +return 1ul  12;
 +else {
 +for (size = 0; size  MMU_PAGE_COUNT; size++) {
 +/* valid entries have a shift value */
 +if (!mmu_psize_defs[size].shift)
 +continue;
 +
 +a_size = __hpte_actual_psize(lp, size);

 a_size as psize is probably a slightly confusing namer. Just call it 
 a_psize.

Will update.


 So if I understand this patch correctly, it simply introduces logic to 
 handle page sizes other than 4k, 64k, 16M by analyzing the actual page 
 size field in the HPTE. Mind to explain why exactly that enables us to 
 use THP?

 What exactly is the flow if the pages are not backed by huge pages? What 
 is the flow when they start to get backed by huge pages?

 +if (a_size != -1)
 +return 1ul  mmu_psize_defs[a_size].shift;
 +}
 +
 +}
 +return 0;
   }
   
   static inline unsigned long hpte_rpn(unsigned long ptel, unsigned long 
 psize)
 diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
 index 8227dba5af0f..a38d3289320a 100644
 --- a/arch/powerpc/kvm/book3s_hv.c
 +++ b/arch/powerpc/kvm/book3s_hv.c
 @@ -1949,6 +1949,13 @@ static void kvmppc_add_seg_page_size(struct 
 kvm_ppc_one_seg_page_size **sps,
   * support pte_enc here
   */
  (*sps)-enc[0].pte_enc = def-penc[linux_psize];
 +/*
 + * Add 16MB MPSS support
 + */
 +if (linux_psize != MMU_PAGE_16M) {
 +(*sps)-enc[1].page_shift = 24;
 +(*sps)-enc[1].pte_enc = def-penc[MMU_PAGE_16M];
 +}

 So this basically indicates that every segment (except for the 16MB one) 
 can also handle 16MB MPSS page sizes? I suppose you want to remove the 
 comment in kvm_vm_ioctl_get_smmu_info_hv() that says we don't do MPSS
 here.

Will do


 Can we also ensure that every system we run on can do MPSS?


Will do

-aneesh

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel page allocator for hash page table.

2014-05-06 Thread Alexander Graf

On 05/06/2014 04:20 PM, Aneesh Kumar K.V wrote:

Alexander Graf ag...@suse.de writes:


On 06.05.14 09:19, Benjamin Herrenschmidt wrote:

On Tue, 2014-05-06 at 09:05 +0200, Alexander Graf wrote:

On 06.05.14 02:06, Benjamin Herrenschmidt wrote:

On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote:

Isn't this a greater problem? We should start swapping before we hit
the point where non movable kernel allocation fails, no?

Possibly but the fact remains, this can be avoided by making sure that
if we create a CMA reserve for KVM, then it uses it rather than using
the rest of main memory for hash tables.

So why were we preferring non-CMA memory before? Considering that Aneesh
introduced that logic in fa61a4e3 I suppose this was just a mistake?

I assume so.


...


Whatever remains is split between CMA and the normal page allocator.

Without Aneesh latest patch, when creating guests, KVM starts allocating
it's hash tables from the latter instead of CMA (we never allocate from
hugetlb pool afaik, only guest pages do that, not hash tables).

So we exhaust the page allocator and get linux into OOM conditions
while there's plenty of space in CMA. But the kernel cannot use CMA for
it's own allocations, only to back user pages, which we don't care about
because our guest pages are covered by our hugetlb reserve :-)

Yes. Write that in the patch description and I'm happy ;).


How about the below:

Current KVM code first try to allocate hash page table from the normal
page allocator before falling back to the CMA reserve region. One of the
side effects of that is, we could exhaust the page allocator and get
linux into OOM conditions while we still have plenty of space in CMA.

Fix this by trying the CMA reserve region first and then falling back
to normal page allocator if we fail to get enough memory from CMA
reserve area.


Fix the grammar (I've spotted a good number of mistakes), then this 
should do. Please also improve the headline.



Alex

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest

2014-05-06 Thread Aneesh Kumar K.V
Paul Mackerras pau...@samba.org writes:

 On Mon, May 05, 2014 at 08:17:00PM +0530, Aneesh Kumar K.V wrote:
 Alexander Graf ag...@suse.de writes:
 
  On 05/04/2014 07:30 PM, Aneesh Kumar K.V wrote:
  Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
 
  No patch description, no proper explanations anywhere why you're doing 
  what. All of that in a pretty sensitive piece of code. There's no way 
  this patch can go upstream in its current form.
 
 
 Sorry about being vague. Will add a better commit message. The goal is
 to export MPSS support to guest if the host support the same. MPSS
 support is exported via penc encoding in ibm,segment-page-sizes. The
 actual format can be found at htab_dt_scan_page_sizes. When the guest
 memory is backed by hugetlbfs we expose the penc encoding the host
 support to guest via kvmppc_add_seg_page_size. 

 In a case like this it's good to assume the reader doesn't know very
 much about Power CPUs, and probably isn't familiar with acronyms such
 as MPSS.  The patch needs an introductory paragraph explaining that on
 recent IBM Power CPUs, while the hashed page table is looked up using
 the page size from the segmentation hardware (i.e. the SLB), it is
 possible to have the HPT entry indicate a larger page size.  Thus for
 example it is possible to put a 16MB page in a 64kB segment, but since
 the hash lookup is done using a 64kB page size, it may be necessary to
 put multiple entries in the HPT for a single 16MB page.  This
 capability is called mixed page-size segment (MPSS).  With MPSS,
 there are two relevant page sizes: the base page size, which is the
 size used in searching the HPT, and the actual page size, which is the
 size indicated in the HPT entry.  Note that the actual page size is
 always = base page size.

I will update the commit message with the above details

-aneesh

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V5] KVM: PPC: BOOK3S: Use the saved dar value and generic make_dsisr

2014-05-06 Thread Aneesh Kumar K.V
Although it's optional IBM POWER cpus always had DAR value set on
alignment interrupt. So don't try to compute these values.

Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
---
* Changes from V4
 * Update comments around using fault_dar

 arch/powerpc/include/asm/disassemble.h | 34 +
 arch/powerpc/kernel/align.c| 34 +
 arch/powerpc/kvm/book3s_emulate.c  | 46 ++
 3 files changed, 43 insertions(+), 71 deletions(-)

diff --git a/arch/powerpc/include/asm/disassemble.h 
b/arch/powerpc/include/asm/disassemble.h
index 856f8deb557a..6330a61b875a 100644
--- a/arch/powerpc/include/asm/disassemble.h
+++ b/arch/powerpc/include/asm/disassemble.h
@@ -81,4 +81,38 @@ static inline unsigned int get_oc(u32 inst)
 {
return (inst  11)  0x7fff;
 }
+
+#define IS_XFORM(inst) (get_op(inst)  == 31)
+#define IS_DSFORM(inst)(get_op(inst) = 56)
+
+/*
+ * Create a DSISR value from the instruction
+ */
+static inline unsigned make_dsisr(unsigned instr)
+{
+   unsigned dsisr;
+
+
+   /* bits  6:15 -- 22:31 */
+   dsisr = (instr  0x03ff)  16;
+
+   if (IS_XFORM(instr)) {
+   /* bits 29:30 -- 15:16 */
+   dsisr |= (instr  0x0006)  14;
+   /* bit 25 --17 */
+   dsisr |= (instr  0x0040)  8;
+   /* bits 21:24 -- 18:21 */
+   dsisr |= (instr  0x0780)  3;
+   } else {
+   /* bit  5 --17 */
+   dsisr |= (instr  0x0400)  12;
+   /* bits  1: 4 -- 18:21 */
+   dsisr |= (instr  0x7800)  17;
+   /* bits 30:31 -- 12:13 */
+   if (IS_DSFORM(instr))
+   dsisr |= (instr  0x0003)  18;
+   }
+
+   return dsisr;
+}
 #endif /* __ASM_PPC_DISASSEMBLE_H__ */
diff --git a/arch/powerpc/kernel/align.c b/arch/powerpc/kernel/align.c
index 94908af308d8..34f55524d456 100644
--- a/arch/powerpc/kernel/align.c
+++ b/arch/powerpc/kernel/align.c
@@ -25,14 +25,13 @@
 #include asm/cputable.h
 #include asm/emulated_ops.h
 #include asm/switch_to.h
+#include asm/disassemble.h
 
 struct aligninfo {
unsigned char len;
unsigned char flags;
 };
 
-#define IS_XFORM(inst) (((inst)  26) == 31)
-#define IS_DSFORM(inst)(((inst)  26) = 56)
 
 #define INVALID{ 0, 0 }
 
@@ -192,37 +191,6 @@ static struct aligninfo aligninfo[128] = {
 };
 
 /*
- * Create a DSISR value from the instruction
- */
-static inline unsigned make_dsisr(unsigned instr)
-{
-   unsigned dsisr;
-
-
-   /* bits  6:15 -- 22:31 */
-   dsisr = (instr  0x03ff)  16;
-
-   if (IS_XFORM(instr)) {
-   /* bits 29:30 -- 15:16 */
-   dsisr |= (instr  0x0006)  14;
-   /* bit 25 --17 */
-   dsisr |= (instr  0x0040)  8;
-   /* bits 21:24 -- 18:21 */
-   dsisr |= (instr  0x0780)  3;
-   } else {
-   /* bit  5 --17 */
-   dsisr |= (instr  0x0400)  12;
-   /* bits  1: 4 -- 18:21 */
-   dsisr |= (instr  0x7800)  17;
-   /* bits 30:31 -- 12:13 */
-   if (IS_DSFORM(instr))
-   dsisr |= (instr  0x0003)  18;
-   }
-
-   return dsisr;
-}
-
-/*
  * The dcbz (data cache block zero) instruction
  * gives an alignment fault if used on non-cacheable
  * memory.  We handle the fault mainly for the
diff --git a/arch/powerpc/kvm/book3s_emulate.c 
b/arch/powerpc/kvm/book3s_emulate.c
index 99d40f8977e8..6bbdb3d1ec77 100644
--- a/arch/powerpc/kvm/book3s_emulate.c
+++ b/arch/powerpc/kvm/book3s_emulate.c
@@ -569,48 +569,17 @@ unprivileged:
 
 u32 kvmppc_alignment_dsisr(struct kvm_vcpu *vcpu, unsigned int inst)
 {
-   u32 dsisr = 0;
-
-   /*
-* This is what the spec says about DSISR bits (not mentioned = 0):
-*
-* 12:13[DS]Set to bits 30:31
-* 15:16[X] Set to bits 29:30
-* 17   [X] Set to bit 25
-*  [D/DS]  Set to bit 5
-* 18:21[X] Set to bits 21:24
-*  [D/DS]  Set to bits 1:4
-* 22:26Set to bits 6:10 (RT/RS/FRT/FRS)
-* 27:31Set to bits 11:15 (RA)
-*/
-
-   switch (get_op(inst)) {
-   /* D-form */
-   case OP_LFS:
-   case OP_LFD:
-   case OP_STFD:
-   case OP_STFS:
-   dsisr |= (inst  12)  0x4000; /* bit 17 */
-   dsisr |= (inst  17)  0x3c00; /* bits 18:21 */
-   break;
-   /* X-form */
-   case 31:
-   dsisr |= (inst  14)  0x18000; /* bits 15:16 */
-   dsisr |= (inst  8)   0x04000; /* bit 17 */
-   dsisr |= (inst  3)   0x03c00; /* bits 18:21 */
-   break;
-   

[PATCH] powerpc/fsl: Updated corenet-cf compatible string for corenet1-cf chips

2014-05-06 Thread Diana Craciun
From: Diana Craciun diana.crac...@freescale.com

Updated the device trees according to the corenet-cf
binding definition.

Signed-off-by: Diana Craciun diana.crac...@freescale.com
---
 arch/powerpc/boot/dts/fsl/p2041si-post.dtsi | 2 +-
 arch/powerpc/boot/dts/fsl/p3041si-post.dtsi | 2 +-
 arch/powerpc/boot/dts/fsl/p4080si-post.dtsi | 2 +-
 arch/powerpc/boot/dts/fsl/p5020si-post.dtsi | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/p2041si-post.dtsi 
b/arch/powerpc/boot/dts/fsl/p2041si-post.dtsi
index e2987a3..b5daa4c 100644
--- a/arch/powerpc/boot/dts/fsl/p2041si-post.dtsi
+++ b/arch/powerpc/boot/dts/fsl/p2041si-post.dtsi
@@ -246,7 +246,7 @@
};
 
corenet-cf@18000 {
-   compatible = fsl,corenet-cf;
+   compatible = fsl,corenet1-cf, fsl,corenet-cf;
reg = 0x18000 0x1000;
interrupts = 16 2 1 31;
fsl,ccf-num-csdids = 32;
diff --git a/arch/powerpc/boot/dts/fsl/p3041si-post.dtsi 
b/arch/powerpc/boot/dts/fsl/p3041si-post.dtsi
index 7af6d45..5abd1fc 100644
--- a/arch/powerpc/boot/dts/fsl/p3041si-post.dtsi
+++ b/arch/powerpc/boot/dts/fsl/p3041si-post.dtsi
@@ -273,7 +273,7 @@
};
 
corenet-cf@18000 {
-   compatible = fsl,corenet-cf;
+   compatible = fsl,corenet1-cf, fsl,corenet-cf;
reg = 0x18000 0x1000;
interrupts = 16 2 1 31;
fsl,ccf-num-csdids = 32;
diff --git a/arch/powerpc/boot/dts/fsl/p4080si-post.dtsi 
b/arch/powerpc/boot/dts/fsl/p4080si-post.dtsi
index 2415e1f..bf0e7c9 100644
--- a/arch/powerpc/boot/dts/fsl/p4080si-post.dtsi
+++ b/arch/powerpc/boot/dts/fsl/p4080si-post.dtsi
@@ -281,7 +281,7 @@
};
 
corenet-cf@18000 {
-   compatible = fsl,corenet-cf;
+   compatible = fsl,corenet1-cf, fsl,corenet-cf;
reg = 0x18000 0x1000;
interrupts = 16 2 1 31;
fsl,ccf-num-csdids = 32;
diff --git a/arch/powerpc/boot/dts/fsl/p5020si-post.dtsi 
b/arch/powerpc/boot/dts/fsl/p5020si-post.dtsi
index 2985de4..f7ca9f4 100644
--- a/arch/powerpc/boot/dts/fsl/p5020si-post.dtsi
+++ b/arch/powerpc/boot/dts/fsl/p5020si-post.dtsi
@@ -278,7 +278,7 @@
};
 
corenet-cf@18000 {
-   compatible = fsl,corenet-cf;
+   compatible = fsl,corenet1-cf, fsl,corenet-cf;
reg = 0x18000 0x1000;
interrupts = 16 2 1 31;
fsl,ccf-num-csdids = 32;
-- 
1.7.11.7

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest

2014-05-06 Thread Aneesh Kumar K.V
Alexander Graf ag...@suse.de writes:

 On 05/06/2014 11:26 AM, Benjamin Herrenschmidt wrote:
 On Tue, 2014-05-06 at 11:12 +0200, Alexander Graf wrote:


.


I updated the commit message as below. Let me know if this is ok.

KVM: PPC: BOOK3S: HV: THP support for guest

On recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size.  Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page.  This
capability is called mixed page-size segment (MPSS).  With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry. [ Note that the actual page size is
always = base page size ].

We advertise MPSS feature to guest only if the host CPU supports the
same. We use ibm,segment-page-sizes device tree node to advertise
the MPSS support. The penc encoding indicate whether we support
a specific combination of base page size and actual page size
in the same segment. It is also the value used in the L|LP encoding
of HPTE entry.

In-order to support MPSS in guest, KVM need to handle the below details
* advertise MPSS via ibm,segment-page-sizes
* Decode the base and actual page size correctly from the HPTE entry
  so that we know what we are dealing with in H_ENTER and and can do
  the appropriate TLB invalidation in H_REMOVE and evictions.




 yes. When / if people can easily get their hands on p7/p8 bare metal 
 systems I'll be more than happy to remove 970 support as well, but for 
 now it's probably good to keep in.


This should handle that.

+   /*
+* Add 16MB MPSS support if host supports it
+*/
+   if (linux_psize != MMU_PAGE_16M  def-penc[MMU_PAGE_16M] != -1) {
+   (*sps)-enc[1].page_shift = 24;
+   (*sps)-enc[1].pte_enc = def-penc[MMU_PAGE_16M];
+   }
(*sps)++;

-aneesh

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest

2014-05-06 Thread Alexander Graf

On 05/06/2014 05:06 PM, Aneesh Kumar K.V wrote:

Alexander Graf ag...@suse.de writes:


On 05/06/2014 11:26 AM, Benjamin Herrenschmidt wrote:

On Tue, 2014-05-06 at 11:12 +0200, Alexander Graf wrote:


.


I updated the commit message as below. Let me know if this is ok.

 KVM: PPC: BOOK3S: HV: THP support for guest


This has nothing to do with THP.

 
 On recent IBM Power CPUs, while the hashed page table is looked up using

 the page size from the segmentation hardware (i.e. the SLB), it is
 possible to have the HPT entry indicate a larger page size.  Thus for
 example it is possible to put a 16MB page in a 64kB segment, but since
 the hash lookup is done using a 64kB page size, it may be necessary to
 put multiple entries in the HPT for a single 16MB page.  This
 capability is called mixed page-size segment (MPSS).  With MPSS,
 there are two relevant page sizes: the base page size, which is the
 size used in searching the HPT, and the actual page size, which is the
 size indicated in the HPT entry. [ Note that the actual page size is
 always = base page size ].
 
 We advertise MPSS feature to guest only if the host CPU supports the

 same. We use ibm,segment-page-sizes device tree node to advertise
 the MPSS support. The penc encoding indicate whether we support
 a specific combination of base page size and actual page size
 in the same segment. It is also the value used in the L|LP encoding
 of HPTE entry.
 
 In-order to support MPSS in guest, KVM need to handle the below details

 * advertise MPSS via ibm,segment-page-sizes
 * Decode the base and actual page size correctly from the HPTE entry
   so that we know what we are dealing with in H_ENTER and and can do


Which code path exactly changes for H_ENTER?


   the appropriate TLB invalidation in H_REMOVE and evictions.


Apart from the grammar (which is pretty broken for the part that is not 
copied from Paul) and the subject line this sounds quite reasonable.



Alex

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

RE: [PATCH v2 1/4] KVM: PPC: e500mc: Revert add load inst fixup

2014-05-06 Thread mihai.cara...@freescale.com
 -Original Message-
 From: Alexander Graf [mailto:ag...@suse.de]
 Sent: Sunday, May 04, 2014 1:14 AM
 To: Caraman Mihai Claudiu-B02008
 Cc: kvm-...@vger.kernel.org; k...@vger.kernel.org; linuxppc-
 d...@lists.ozlabs.org
 Subject: Re: [PATCH v2 1/4] KVM: PPC: e500mc: Revert add load inst
 fixup
 
 
 
 Am 03.05.2014 um 01:14 schrieb mihai.cara...@freescale.com
 mihai.cara...@freescale.com:
 
  From: Alexander Graf ag...@suse.de
  Sent: Friday, May 2, 2014 12:24 PM

  This was the first idea that sprang to my mind inspired from how DO_KVM
  is hooked on PR. I actually did a simple POC for e500mc/e5500, but this
 will
  not work on e6500 which has shared IVORs between HW threads.
 
 What if we combine the ideas? On read we flip the IVOR to a separate
 handler that checks for a field in the PACA. Only if that field is set,
 we treat the fault as kvm fault, otherwise we jump into the normal
 handler.
 
 I suppose we'd have to also take a lock to make sure we don't race with
 the other thread when it wants to also read a guest instruction, but you
 get the idea.

This might be a solution for TLB eviction but not for execute-but-not-read
entries which requires access from host context.

 
 I have no idea whether this would be any faster, it's more of a
 brainstorming thing really. But regardless this patch set would be a move
 into the right direction.
 
 Btw, do we have any guarantees that we don't get scheduled away before we
 run kvmppc_get_last_inst()? If we run on a different core we can't read
 the inst anymore. Hrm.

It was your suggestion to move the logic from kvmppc_handle_exit() irq
disabled area to kvmppc_get_last_inst():

http://git.freescale.com/git/cgit.cgi/ppc/sdk/linux.git/tree/arch/powerpc/kvm/booke.c

Still, what is wrong if we get scheduled on another core? We will emulate
again and the guest will populate the TLB on the new core.

-Mike
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 1/4] KVM: PPC: e500mc: Revert add load inst fixup

2014-05-06 Thread Alexander Graf

On 05/06/2014 05:48 PM, mihai.cara...@freescale.com wrote:

-Original Message-
From: Alexander Graf [mailto:ag...@suse.de]
Sent: Sunday, May 04, 2014 1:14 AM
To: Caraman Mihai Claudiu-B02008
Cc: kvm-...@vger.kernel.org; k...@vger.kernel.org; linuxppc-
d...@lists.ozlabs.org
Subject: Re: [PATCH v2 1/4] KVM: PPC: e500mc: Revert add load inst
fixup



Am 03.05.2014 um 01:14 schrieb mihai.cara...@freescale.com
mihai.cara...@freescale.com:


From: Alexander Graf ag...@suse.de
Sent: Friday, May 2, 2014 12:24 PM

This was the first idea that sprang to my mind inspired from how DO_KVM
is hooked on PR. I actually did a simple POC for e500mc/e5500, but this

will

not work on e6500 which has shared IVORs between HW threads.

What if we combine the ideas? On read we flip the IVOR to a separate
handler that checks for a field in the PACA. Only if that field is set,
we treat the fault as kvm fault, otherwise we jump into the normal
handler.

I suppose we'd have to also take a lock to make sure we don't race with
the other thread when it wants to also read a guest instruction, but you
get the idea.

This might be a solution for TLB eviction but not for execute-but-not-read
entries which requires access from host context.


Good point :).




I have no idea whether this would be any faster, it's more of a
brainstorming thing really. But regardless this patch set would be a move
into the right direction.

Btw, do we have any guarantees that we don't get scheduled away before we
run kvmppc_get_last_inst()? If we run on a different core we can't read
the inst anymore. Hrm.

It was your suggestion to move the logic from kvmppc_handle_exit() irq
disabled area to kvmppc_get_last_inst():

http://git.freescale.com/git/cgit.cgi/ppc/sdk/linux.git/tree/arch/powerpc/kvm/booke.c

Still, what is wrong if we get scheduled on another core? We will emulate
again and the guest will populate the TLB on the new core.


Yes, it means we have to get the EMULATE_AGAIN code paths correct :). It 
also means we lose some performance with preemptive kernel configurations.



Alex

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V2] KVM: PPC: BOOK3S: HV: Prefer CMA region for hash page table allocation

2014-05-06 Thread Aneesh Kumar K.V
Today when KVM tries to reserve memory for the hash page table it
allocates from the normal page allocator first. If that fails it
falls back to CMA's reserved region. One of the side effects of
this is that we could end up exhausting the page allocator and
get linux into OOM conditions while we still have plenty of space
available in CMA.

This patch addresses this issue by first trying hash page table
allocation from CMA's reserved region before falling back to the normal
page allocator. So if we run out of memory, we really are out of memory.

Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
---
* Changes from V1
  * Update commit message 

 arch/powerpc/kvm/book3s_64_mmu_hv.c | 23 ++-
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index fb25ebc0af0c..f32896ffd784 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -52,7 +52,7 @@ static void kvmppc_rmap_reset(struct kvm *kvm);
 
 long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
 {
-   unsigned long hpt;
+   unsigned long hpt = 0;
struct revmap_entry *rev;
struct page *page = NULL;
long order = KVM_DEFAULT_HPT_ORDER;
@@ -64,22 +64,11 @@ long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
}
 
kvm-arch.hpt_cma_alloc = 0;
-   /*
-* try first to allocate it from the kernel page allocator.
-* We keep the CMA reserved for failed allocation.
-*/
-   hpt = __get_free_pages(GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT |
-  __GFP_NOWARN, order - PAGE_SHIFT);
-
-   /* Next try to allocate from the preallocated pool */
-   if (!hpt) {
-   VM_BUG_ON(order  KVM_CMA_CHUNK_ORDER);
-   page = kvm_alloc_hpt(1  (order - PAGE_SHIFT));
-   if (page) {
-   hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page));
-   kvm-arch.hpt_cma_alloc = 1;
-   } else
-   --order;
+   VM_BUG_ON(order  KVM_CMA_CHUNK_ORDER);
+   page = kvm_alloc_hpt(1  (order - PAGE_SHIFT));
+   if (page) {
+   hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page));
+   kvm-arch.hpt_cma_alloc = 1;
}
 
/* Lastly try successively smaller sizes from the page allocator */
-- 
1.9.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest

2014-05-06 Thread Aneesh Kumar K.V
Alexander Graf ag...@suse.de writes:

 On 05/06/2014 05:06 PM, Aneesh Kumar K.V wrote:
 Alexander Graf ag...@suse.de writes:

 On 05/06/2014 11:26 AM, Benjamin Herrenschmidt wrote:
 On Tue, 2014-05-06 at 11:12 +0200, Alexander Graf wrote:

 .


 I updated the commit message as below. Let me know if this is ok.

  KVM: PPC: BOOK3S: HV: THP support for guest

 This has nothing to do with THP.

THP support in guest depend on KVM advertising MPSS feature. We already
have rest of the changes needed to support transparent huge pages
upstream. (We do support THP with PowerVM LPAR already). The primary
motivation of this patch is to enable THP in powerkvm guest. 


  
  On recent IBM Power CPUs, while the hashed page table is looked up using
  the page size from the segmentation hardware (i.e. the SLB), it is
  possible to have the HPT entry indicate a larger page size.  Thus for
  example it is possible to put a 16MB page in a 64kB segment, but since
  the hash lookup is done using a 64kB page size, it may be necessary to
  put multiple entries in the HPT for a single 16MB page.  This
  capability is called mixed page-size segment (MPSS).  With MPSS,
  there are two relevant page sizes: the base page size, which is the
  size used in searching the HPT, and the actual page size, which is the
  size indicated in the HPT entry. [ Note that the actual page size is
  always = base page size ].
  
  We advertise MPSS feature to guest only if the host CPU supports the
  same. We use ibm,segment-page-sizes device tree node to advertise
  the MPSS support. The penc encoding indicate whether we support
  a specific combination of base page size and actual page size
  in the same segment. It is also the value used in the L|LP encoding
  of HPTE entry.
  
  In-order to support MPSS in guest, KVM need to handle the below details
  * advertise MPSS via ibm,segment-page-sizes
  * Decode the base and actual page size correctly from the HPTE entry
so that we know what we are dealing with in H_ENTER and and can do

 Which code path exactly changes for H_ENTER?

There is no real code path changes. Any code path that use
hpte_page_size() is impacted. We return actual page size there. 


the appropriate TLB invalidation in H_REMOVE and evictions.

 Apart from the grammar (which is pretty broken for the part that is not 
 copied from Paul) and the subject line this sounds quite reasonable.


Wll try to fix.

-aneesh

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest

2014-05-06 Thread Alexander Graf

On 05/06/2014 06:08 PM, Aneesh Kumar K.V wrote:

Alexander Graf ag...@suse.de writes:


On 05/06/2014 05:06 PM, Aneesh Kumar K.V wrote:

Alexander Graf ag...@suse.de writes:


On 05/06/2014 11:26 AM, Benjamin Herrenschmidt wrote:

On Tue, 2014-05-06 at 11:12 +0200, Alexander Graf wrote:


.


I updated the commit message as below. Let me know if this is ok.

  KVM: PPC: BOOK3S: HV: THP support for guest

This has nothing to do with THP.

THP support in guest depend on KVM advertising MPSS feature. We already
have rest of the changes needed to support transparent huge pages
upstream. (We do support THP with PowerVM LPAR already). The primary
motivation of this patch is to enable THP in powerkvm guest.


But KVM doesn't care. KVM cares about MPSS. It's like saying Support 
fork() in a subject line while your patch implements page faults.




  
  On recent IBM Power CPUs, while the hashed page table is looked up using

  the page size from the segmentation hardware (i.e. the SLB), it is
  possible to have the HPT entry indicate a larger page size.  Thus for
  example it is possible to put a 16MB page in a 64kB segment, but since
  the hash lookup is done using a 64kB page size, it may be necessary to
  put multiple entries in the HPT for a single 16MB page.  This
  capability is called mixed page-size segment (MPSS).  With MPSS,
  there are two relevant page sizes: the base page size, which is the
  size used in searching the HPT, and the actual page size, which is the
  size indicated in the HPT entry. [ Note that the actual page size is
  always = base page size ].
  
  We advertise MPSS feature to guest only if the host CPU supports the

  same. We use ibm,segment-page-sizes device tree node to advertise
  the MPSS support. The penc encoding indicate whether we support
  a specific combination of base page size and actual page size
  in the same segment. It is also the value used in the L|LP encoding
  of HPTE entry.
  
  In-order to support MPSS in guest, KVM need to handle the below details

  * advertise MPSS via ibm,segment-page-sizes
  * Decode the base and actual page size correctly from the HPTE entry
so that we know what we are dealing with in H_ENTER and and can do

Which code path exactly changes for H_ENTER?

There is no real code path changes. Any code path that use
hpte_page_size() is impacted. We return actual page size there.


Ah, I see :).




the appropriate TLB invalidation in H_REMOVE and evictions.

Apart from the grammar (which is pretty broken for the part that is not
copied from Paul) and the subject line this sounds quite reasonable.


Wll try to fix.


Awesome. Thanks a lot!


Alex

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc/fsl: Updated corenet-cf compatible string for corenet1-cf chips

2014-05-06 Thread Scott Wood
On Tue, 2014-05-06 at 17:56 +0300, Diana Craciun wrote:
 From: Diana Craciun diana.crac...@freescale.com
 
 Updated the device trees according to the corenet-cf
 binding definition.
 
 Signed-off-by: Diana Craciun diana.crac...@freescale.com
 ---
  arch/powerpc/boot/dts/fsl/p2041si-post.dtsi | 2 +-
  arch/powerpc/boot/dts/fsl/p3041si-post.dtsi | 2 +-
  arch/powerpc/boot/dts/fsl/p4080si-post.dtsi | 2 +-
  arch/powerpc/boot/dts/fsl/p5020si-post.dtsi | 2 +-
  4 files changed, 4 insertions(+), 4 deletions(-)

Oops, I meant to include this in the patch I sent, but forgot to squash
the two patches together. :-P

Where's p5040?

-Scott


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V2] KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest

2014-05-06 Thread Aneesh Kumar K.V
On recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size.  Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page.  This
capability is called mixed page-size segment (MPSS).  With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry. [ Note that the actual page size is
always = base page size ].

We use ibm,segment-page-sizes device tree node to advertise
the MPSS support to PAPR guest. The penc encoding indicates whether
we support a specific combination of base page size and actual
page size in the same segment. We also use the penc value in the
LP encoding of HPTE entry.

This patch exposes MPSS support to KVM guest by advertising the
feature via ibm,segment-page-sizes. It also adds the necessary changes
to decode the base page size and the actual page size correctly from the
HPTE entry.

Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
---
Changes from V1:
* Update commit message
* Rename variables as per review feedback

 arch/powerpc/include/asm/kvm_book3s_64.h | 146 ++-
 arch/powerpc/kvm/book3s_hv.c |   7 ++
 2 files changed, 130 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 51388befeddb..fddb72b48ce9 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -77,34 +77,122 @@ static inline long try_lock_hpte(unsigned long *hpte, 
unsigned long bits)
return old == 0;
 }
 
+static inline int __hpte_actual_psize(unsigned int lp, int psize)
+{
+   int i, shift;
+   unsigned int mask;
+
+   /* start from 1 ignoring MMU_PAGE_4K */
+   for (i = 1; i  MMU_PAGE_COUNT; i++) {
+
+   /* invalid penc */
+   if (mmu_psize_defs[psize].penc[i] == -1)
+   continue;
+   /*
+* encoding bits per actual page size
+*PTE LP actual page size
+* rrrz =8KB
+* rrzz =16KB
+* rzzz =32KB
+*  =64KB
+* ...
+*/
+   shift = mmu_psize_defs[i].shift - LP_SHIFT;
+   if (shift  LP_BITS)
+   shift = LP_BITS;
+   mask = (1  shift) - 1;
+   if ((lp  mask) == mmu_psize_defs[psize].penc[i])
+   return i;
+   }
+   return -1;
+}
+
 static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r,
 unsigned long pte_index)
 {
-   unsigned long rb, va_low;
+   int b_psize, a_psize;
+   unsigned int penc;
+   unsigned long rb = 0, va_low, sllp;
+   unsigned int lp = (r  LP_SHIFT)  ((1  LP_BITS) - 1);
+
+   if (!(v  HPTE_V_LARGE)) {
+   /* both base and actual psize is 4k */
+   b_psize = MMU_PAGE_4K;
+   a_psize = MMU_PAGE_4K;
+   } else {
+   for (b_psize = 0; b_psize  MMU_PAGE_COUNT; b_psize++) {
+
+   /* valid entries have a shift value */
+   if (!mmu_psize_defs[b_psize].shift)
+   continue;
 
+   a_psize = __hpte_actual_psize(lp, b_psize);
+   if (a_psize != -1)
+   break;
+   }
+   }
+   /*
+* Ignore the top 14 bits of va
+* v have top two bits covering segment size, hence move
+* by 16 bits, Also clear the lower HPTE_V_AVPN_SHIFT (7) bits.
+* AVA field in v also have the lower 23 bits ignored.
+* For base page size 4K we need 14 .. 65 bits (so need to
+* collect extra 11 bits)
+* For others we need 14..14+i
+*/
+   /* This covers 14..54 bits of va*/
rb = (v  ~0x7fUL)  16;   /* AVA field */
+   /*
+* AVA in v had cleared lower 23 bits. We need to derive
+* that from pteg index
+*/
va_low = pte_index  3;
if (v  HPTE_V_SECONDARY)
va_low = ~va_low;
-   /* xor vsid from AVA */
+   /*
+* get the vpn bits from va_low using reverse of hashing.
+* In v we have va with 23 bits dropped and then left shifted
+* HPTE_V_AVPN_SHIFT (7) bits. Now to find vsid we need
+* right shift it with (SID_SHIFT - (23 - 7))
+*/
if (!(v  HPTE_V_1TB_SEG))
-   va_low ^= v  12;
+   

RE: [PATCH v2 3/4] KVM: PPC: Alow kvmppc_get_last_inst() to fail

2014-05-06 Thread mihai.cara...@freescale.com
 -Original Message-
 From: Alexander Graf [mailto:ag...@suse.de]
 Sent: Friday, May 02, 2014 12:55 PM
 To: Caraman Mihai Claudiu-B02008
 Cc: kvm-...@vger.kernel.org; k...@vger.kernel.org; linuxppc-
 d...@lists.ozlabs.org
 Subject: Re: [PATCH v2 3/4] KVM: PPC: Alow kvmppc_get_last_inst() to fail
 
 On 05/01/2014 02:45 AM, Mihai Caraman wrote:
...
  diff --git a/arch/powerpc/include/asm/kvm_ppc.h
 b/arch/powerpc/include/asm/kvm_ppc.h
  index 4096f16..6e7c358 100644
  --- a/arch/powerpc/include/asm/kvm_ppc.h
  +++ b/arch/powerpc/include/asm/kvm_ppc.h
  @@ -72,6 +72,8 @@ extern int kvmppc_sanity_check(struct kvm_vcpu
 *vcpu);
extern int kvmppc_subarch_vcpu_init(struct kvm_vcpu *vcpu);
extern void kvmppc_subarch_vcpu_uninit(struct kvm_vcpu *vcpu);
 
  +extern int kvmppc_get_last_inst(struct kvm_vcpu *vcpu, u32 *inst);
 
 Phew. Moving this into a separate function sure has some performance
 implications. Was there no way to keep it in a header?
 
 You could just move it into its own .h file which we include after
 kvm_ppc.h. That way everything's available. That would also help me a
 lot with the little endian port where I'm also struggling with header
 file inclusion order and kvmppc_need_byteswap().

Great, I will do this.

  diff --git a/arch/powerpc/kvm/book3s_pr.c
 b/arch/powerpc/kvm/book3s_pr.c
  index c5c052a..b7fffd1 100644
  --- a/arch/powerpc/kvm/book3s_pr.c
  +++ b/arch/powerpc/kvm/book3s_pr.c
  @@ -608,12 +608,9 @@ void kvmppc_giveup_ext(struct kvm_vcpu *vcpu,
 ulong msr)
 
static int kvmppc_read_inst(struct kvm_vcpu *vcpu)
{
  -   ulong srr0 = kvmppc_get_pc(vcpu);
  -   u32 last_inst = kvmppc_get_last_inst(vcpu);
  -   int ret;
  +   u32 last_inst;
 
  -   ret = kvmppc_ld(vcpu, srr0, sizeof(u32), last_inst, false);
  -   if (ret == -ENOENT) {
  +   if (kvmppc_get_last_inst(vcpu, last_inst) == -ENOENT) {
 
 ENOENT?

You have to tell us :) Why does kvmppc_ld() mix emulation_result
enumeration with generic errors? Do you want to change that and
use EMULATE_FAIL instead?

 
  ulong msr = vcpu-arch.shared-msr;
 
  msr = kvmppc_set_field(msr, 33, 33, 1);
  @@ -867,15 +864,18 @@ int kvmppc_handle_exit_pr(struct kvm_run *run,
 struct kvm_vcpu *vcpu,
  {
  enum emulation_result er;
  ulong flags;
  +   u32 last_inst;
 
program_interrupt:
  flags = vcpu-arch.shadow_srr1  0x1full;
  +   kvmppc_get_last_inst(vcpu, last_inst);
 
 No check for the return value?

Should we queue a program exception and resume guest?

 
 
  if (vcpu-arch.shared-msr  MSR_PR) {
#ifdef EXIT_DEBUG
  -   printk(KERN_INFO Userspace triggered 0x700 exception
 at 0x%lx (0x%x)\n, kvmppc_get_pc(vcpu), kvmppc_get_last_inst(vcpu));
  +   pr_info(Userspace triggered 0x700 exception at\n
  +   0x%lx (0x%x)\n, kvmppc_get_pc(vcpu), last_inst);
#endif
  -   if ((kvmppc_get_last_inst(vcpu)  0xff0007ff) !=
  +   if ((last_inst  0xff0007ff) !=
  (INS_DCBZ  0xfff7)) {
  kvmppc_core_queue_program(vcpu, flags);
  r = RESUME_GUEST;
  @@ -894,7 +894,7 @@ program_interrupt:
  break;
  case EMULATE_FAIL:
  printk(KERN_CRIT %s: emulation at %lx failed
 (%08x)\n,
  -  __func__, kvmppc_get_pc(vcpu),
 kvmppc_get_last_inst(vcpu));
  +  __func__, kvmppc_get_pc(vcpu), last_inst);
  kvmppc_core_queue_program(vcpu, flags);
  r = RESUME_GUEST;
  break;
  @@ -911,8 +911,12 @@ program_interrupt:
  break;
  }
  case BOOK3S_INTERRUPT_SYSCALL:
  +   {
  +   u32 last_sc;
  +
  +   kvmppc_get_last_sc(vcpu, last_sc);
 
 No check for the return value?

The existing code does not handle KVM_INST_FETCH_FAILED. 
How should we continue if papr is enabled and last_sc fails?

 
  if (vcpu-arch.papr_enabled 
  -   (kvmppc_get_last_sc(vcpu) == 0x4422) 
  +   (last_sc == 0x4422) 
  !(vcpu-arch.shared-msr  MSR_PR)) {
  /* SC 1 papr hypercalls */
  ulong cmd = kvmppc_get_gpr(vcpu, 3);
  @@ -957,6 +961,7 @@ program_interrupt:
  r = RESUME_GUEST;
  }
  break;
  +   }
  case BOOK3S_INTERRUPT_FP_UNAVAIL:
  case BOOK3S_INTERRUPT_ALTIVEC:
  case BOOK3S_INTERRUPT_VSX:
  @@ -985,15 +990,20 @@ program_interrupt:
  break;
  }
  case BOOK3S_INTERRUPT_ALIGNMENT:
  +   {
  +   u32 last_inst;
  +
  if (kvmppc_read_inst(vcpu) == EMULATE_DONE) {
  -   vcpu-arch.shared-dsisr = kvmppc_alignment_dsisr(vcpu,
  -   kvmppc_get_last_inst(vcpu));
  -   vcpu-arch.shared-dar = 

[PATCH 1/2 v2] bootmem/powerpc: Unify bootmem initialization

2014-05-06 Thread Emil Medve
Unify the low/highmem code path from do_init_bootmem() by using (the)
lowmem related variables/parameters even when the low/highmem split
is not needed (64-bit) or configured. In such cases the lowmem
variables/parameters continue to observe the definition by referring
to memory directly mapped by the kernel

Signed-off-by: Emil Medve emilian.me...@freescale.com
---

v2: Rebased, no changes

 arch/powerpc/mm/mem.c | 36 
 1 file changed, 16 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 32202c9..eaf5d1d8 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -188,27 +188,31 @@ EXPORT_SYMBOL_GPL(walk_system_ram_range);
 void __init do_init_bootmem(void)
 {
unsigned long start, bootmap_pages;
-   unsigned long total_pages;
struct memblock_region *reg;
int boot_mapsize;
+   phys_addr_t _total_lowmem;
+   phys_addr_t _lowmem_end_addr;
 
-   max_low_pfn = max_pfn = memblock_end_of_DRAM()  PAGE_SHIFT;
-   total_pages = (memblock_end_of_DRAM() - memstart_addr)  PAGE_SHIFT;
-#ifdef CONFIG_HIGHMEM
-   total_pages = total_lowmem  PAGE_SHIFT;
-   max_low_pfn = lowmem_end_addr  PAGE_SHIFT;
+#ifndef CONFIG_HIGHMEM
+   _lowmem_end_addr = memblock_end_of_DRAM();
+#else
+   _lowmem_end_addr = lowmem_end_addr;
 #endif
 
+   max_pfn = memblock_end_of_DRAM()  PAGE_SHIFT;
+   max_low_pfn = _lowmem_end_addr  PAGE_SHIFT;
+   min_low_pfn = MEMORY_START  PAGE_SHIFT;
+
/*
 * Find an area to use for the bootmem bitmap.  Calculate the size of
 * bitmap required as (Total Memory) / PAGE_SIZE / BITS_PER_BYTE.
 * Add 1 additional page in case the address isn't page-aligned.
 */
-   bootmap_pages = bootmem_bootmap_pages(total_pages);
+   _total_lowmem = _lowmem_end_addr - memstart_addr;
+   bootmap_pages = bootmem_bootmap_pages(_total_lowmem  PAGE_SHIFT);
 
start = memblock_alloc(bootmap_pages  PAGE_SHIFT, PAGE_SIZE);
 
-   min_low_pfn = MEMORY_START  PAGE_SHIFT;
boot_mapsize = init_bootmem_node(NODE_DATA(0), start  PAGE_SHIFT, 
min_low_pfn, max_low_pfn);
 
/* Place all memblock_regions in the same node and merge contiguous
@@ -219,26 +223,18 @@ void __init do_init_bootmem(void)
/* Add all physical memory to the bootmem map, mark each area
 * present.
 */
-#ifdef CONFIG_HIGHMEM
-   free_bootmem_with_active_regions(0, lowmem_end_addr  PAGE_SHIFT);
+   free_bootmem_with_active_regions(0, max_low_pfn);
 
/* reserve the sections we're already using */
for_each_memblock(reserved, reg) {
-   unsigned long top = reg-base + reg-size - 1;
-   if (top  lowmem_end_addr)
+   if (reg-base + reg-size - 1  _lowmem_end_addr)
reserve_bootmem(reg-base, reg-size, BOOTMEM_DEFAULT);
-   else if (reg-base  lowmem_end_addr) {
-   unsigned long trunc_size = lowmem_end_addr - reg-base;
+   else if (reg-base  _lowmem_end_addr) {
+   unsigned long trunc_size = _lowmem_end_addr - reg-base;
reserve_bootmem(reg-base, trunc_size, BOOTMEM_DEFAULT);
}
}
-#else
-   free_bootmem_with_active_regions(0, max_pfn);
 
-   /* reserve the sections we're already using */
-   for_each_memblock(reserved, reg)
-   reserve_bootmem(reg-base, reg-size, BOOTMEM_DEFAULT);
-#endif
/* XXX need to clip this if using highmem? */
sparse_memory_present_with_active_regions(0);
 
-- 
1.9.2
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 2/2 v2] powerpc: Enable NO_BOOTMEM

2014-05-06 Thread Emil Medve
Currently bootmem is just a wrapper around memblock. This gets rid of
the wrapper code just as other ARHC(es) did: x86, arm, etc.

For now only cover !NUMA systems/builds

Signed-off-by: Emil Medve emilian.me...@freescale.com
---

v2: Acknowledge that NUMA systems/builds are not covered by this patch

 arch/powerpc/Kconfig  | 3 +++
 arch/powerpc/mm/mem.c | 8 
 2 files changed, 11 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index e099899..07b164b 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -475,6 +475,9 @@ config SYS_SUPPORTS_HUGETLBFS
 
 source mm/Kconfig
 
+config NO_BOOTMEM
+   def_bool !NUMA
+
 config ARCH_MEMORY_PROBE
def_bool y
depends on MEMORY_HOTPLUG
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index eaf5d1d8..d3e1d5f 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -187,10 +187,12 @@ EXPORT_SYMBOL_GPL(walk_system_ram_range);
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 void __init do_init_bootmem(void)
 {
+#ifndef CONFIG_NO_BOOTMEM
unsigned long start, bootmap_pages;
struct memblock_region *reg;
int boot_mapsize;
phys_addr_t _total_lowmem;
+#endif
phys_addr_t _lowmem_end_addr;
 
 #ifndef CONFIG_HIGHMEM
@@ -203,6 +205,7 @@ void __init do_init_bootmem(void)
max_low_pfn = _lowmem_end_addr  PAGE_SHIFT;
min_low_pfn = MEMORY_START  PAGE_SHIFT;
 
+#ifndef CONFIG_NO_BOOTMEM
/*
 * Find an area to use for the bootmem bitmap.  Calculate the size of
 * bitmap required as (Total Memory) / PAGE_SIZE / BITS_PER_BYTE.
@@ -214,12 +217,14 @@ void __init do_init_bootmem(void)
start = memblock_alloc(bootmap_pages  PAGE_SHIFT, PAGE_SIZE);
 
boot_mapsize = init_bootmem_node(NODE_DATA(0), start  PAGE_SHIFT, 
min_low_pfn, max_low_pfn);
+#endif
 
/* Place all memblock_regions in the same node and merge contiguous
 * memblock_regions
 */
memblock_set_node(0, (phys_addr_t)ULLONG_MAX, memblock.memory, 0);
 
+#ifndef CONFIG_NO_BOOTMEM
/* Add all physical memory to the bootmem map, mark each area
 * present.
 */
@@ -234,11 +239,14 @@ void __init do_init_bootmem(void)
reserve_bootmem(reg-base, trunc_size, BOOTMEM_DEFAULT);
}
}
+#endif
 
/* XXX need to clip this if using highmem? */
sparse_memory_present_with_active_regions(0);
 
+#ifndef CONFIG_NO_BOOTMEM
init_bootmem_done = 1;
+#endif
 }
 
 /* mark pages that don't exist as nosave */
-- 
1.9.2
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2] powerpc: Use PFN_PHYS() to avoid truncating the physical address

2014-05-06 Thread Emil Medve
Signed-off-by: Emil Medve emilian.me...@freescale.com
---

v2: Rebased and updated due to upstream changes since v1

 arch/powerpc/include/asm/io.h  |  2 +-
 arch/powerpc/include/asm/page.h|  2 +-
 arch/powerpc/include/asm/pgalloc-32.h  |  2 +-
 arch/powerpc/include/asm/rtas.h|  3 ++-
 arch/powerpc/kernel/crash_dump.c   |  2 +-
 arch/powerpc/kernel/eeh.c  |  4 +---
 arch/powerpc/kernel/io-workarounds.c   |  2 +-
 arch/powerpc/kernel/pci-common.c   |  2 +-
 arch/powerpc/kernel/vdso.c |  6 +++---
 arch/powerpc/kvm/book3s_64_mmu_host.c  |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c|  5 ++---
 arch/powerpc/kvm/book3s_hv.c   | 10 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c|  4 ++--
 arch/powerpc/kvm/e500_mmu_host.c   |  5 ++---
 arch/powerpc/mm/hugepage-hash64.c  |  2 +-
 arch/powerpc/mm/hugetlbpage-book3e.c   |  2 +-
 arch/powerpc/mm/hugetlbpage-hash64.c   |  2 +-
 arch/powerpc/mm/mem.c  |  9 -
 arch/powerpc/mm/numa.c | 13 +++--
 arch/powerpc/platforms/powernv/opal.c  |  2 +-
 arch/powerpc/platforms/pseries/iommu.c |  8 
 21 files changed, 43 insertions(+), 46 deletions(-)

diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
index 97d3869..8f7af05 100644
--- a/arch/powerpc/include/asm/io.h
+++ b/arch/powerpc/include/asm/io.h
@@ -790,7 +790,7 @@ static inline void * phys_to_virt(unsigned long address)
 /*
  * Change struct page to physical address.
  */
-#define page_to_phys(page) ((phys_addr_t)page_to_pfn(page)  PAGE_SHIFT)
+#define page_to_phys(page) PFN_PHYS(page_to_pfn(page))
 
 /*
  * 32 bits still uses virt_to_bus() for it's implementation of DMA
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index 32e4e21..7193d45 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -131,7 +131,7 @@ extern long long virt_phys_offset;
 #endif
 
 #define virt_to_page(kaddr)pfn_to_page(__pa(kaddr)  PAGE_SHIFT)
-#define pfn_to_kaddr(pfn)  __va((pfn)  PAGE_SHIFT)
+#define pfn_to_kaddr(pfn)  __va(PFN_PHYS(pfn))
 #define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr)  PAGE_SHIFT)
 
 /*
diff --git a/arch/powerpc/include/asm/pgalloc-32.h 
b/arch/powerpc/include/asm/pgalloc-32.h
index 842846c..3d19a8e 100644
--- a/arch/powerpc/include/asm/pgalloc-32.h
+++ b/arch/powerpc/include/asm/pgalloc-32.h
@@ -24,7 +24,7 @@ extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
 #define pmd_populate_kernel(mm, pmd, pte)  \
(pmd_val(*(pmd)) = __pa(pte) | _PMD_PRESENT)
 #define pmd_populate(mm, pmd, pte) \
-   (pmd_val(*(pmd)) = (page_to_pfn(pte)  PAGE_SHIFT) | 
_PMD_PRESENT)
+   (pmd_val(*(pmd)) = PFN_PHYS(page_to_pfn(pte)) | _PMD_PRESENT)
 #define pmd_pgtable(pmd) pmd_page(pmd)
 #else
 #define pmd_populate_kernel(mm, pmd, pte)  \
diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index b390f55..c19bd9f 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -3,6 +3,7 @@
 #ifdef __KERNEL__
 
 #include linux/spinlock.h
+#include linux/pfn.h
 #include asm/page.h
 
 /*
@@ -418,7 +419,7 @@ extern void rtas_take_timebase(void);
 #ifdef CONFIG_PPC_RTAS
 static inline int page_is_rtas_user_buf(unsigned long pfn)
 {
-   unsigned long paddr = (pfn  PAGE_SHIFT);
+   unsigned long paddr = PFN_PHYS(pfn);
if (paddr = rtas_rmo_buf  paddr  (rtas_rmo_buf + RTAS_RMOBUF_MAX))
return 1;
return 0;
diff --git a/arch/powerpc/kernel/crash_dump.c b/arch/powerpc/kernel/crash_dump.c
index 7a13f37..a46a9c2 100644
--- a/arch/powerpc/kernel/crash_dump.c
+++ b/arch/powerpc/kernel/crash_dump.c
@@ -104,7 +104,7 @@ ssize_t copy_oldmem_page(unsigned long pfn, char *buf,
return 0;
 
csize = min_t(size_t, csize, PAGE_SIZE);
-   paddr = pfn  PAGE_SHIFT;
+   paddr = PFN_PHYS(pfn);
 
if (memblock_is_region_memory(paddr, csize)) {
vaddr = __va(paddr);
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 3764fb7..7f2ba3d 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -271,7 +271,6 @@ void eeh_slot_error_detail(struct eeh_pe *pe, int severity)
 static inline unsigned long eeh_token_to_phys(unsigned long token)
 {
pte_t *ptep;
-   unsigned long pa;
int hugepage_shift;
 
/*
@@ -281,9 +280,8 @@ static inline unsigned long eeh_token_to_phys(unsigned long 
token)
if (!ptep)
return token;
WARN_ON(hugepage_shift);
-   pa = pte_pfn(*ptep)  PAGE_SHIFT;
 
-   return pa | (token  (PAGE_SIZE-1));
+   return PFN_PHYS(pte_pfn(*ptep)) | (token  (PAGE_SIZE-1));
 }
 
 /*
diff --git a/arch/powerpc/kernel/io-workarounds.c 
b/arch/powerpc/kernel/io-workarounds.c
index 24b968f..dd9a4a2 100644
--- a/arch/powerpc/kernel/io-workarounds.c
+++ 

Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest

2014-05-06 Thread Benjamin Herrenschmidt
On Tue, 2014-05-06 at 21:38 +0530, Aneesh Kumar K.V wrote:

  I updated the commit message as below. Let me know if this is ok.
 
   KVM: PPC: BOOK3S: HV: THP support for guest
 
  This has nothing to do with THP.
 
 THP support in guest depend on KVM advertising MPSS feature. We already
 have rest of the changes needed to support transparent huge pages
 upstream. (We do support THP with PowerVM LPAR already). The primary
 motivation of this patch is to enable THP in powerkvm guest. 

I would argue (nit picking, I know ... :-) that the subject should be
Enable MPSS support for guests, and the description can then explain
that this allows Linux guests to use THP.

Cheers,
Ben.

 
   
   On recent IBM Power CPUs, while the hashed page table is looked up 
  using
   the page size from the segmentation hardware (i.e. the SLB), it is
   possible to have the HPT entry indicate a larger page size.  Thus for
   example it is possible to put a 16MB page in a 64kB segment, but since
   the hash lookup is done using a 64kB page size, it may be necessary to
   put multiple entries in the HPT for a single 16MB page.  This
   capability is called mixed page-size segment (MPSS).  With MPSS,
   there are two relevant page sizes: the base page size, which is the
   size used in searching the HPT, and the actual page size, which is the
   size indicated in the HPT entry. [ Note that the actual page size is
   always = base page size ].
   
   We advertise MPSS feature to guest only if the host CPU supports the
   same. We use ibm,segment-page-sizes device tree node to advertise
   the MPSS support. The penc encoding indicate whether we support
   a specific combination of base page size and actual page size
   in the same segment. It is also the value used in the L|LP encoding
   of HPTE entry.
   
   In-order to support MPSS in guest, KVM need to handle the below 
  details
   * advertise MPSS via ibm,segment-page-sizes
   * Decode the base and actual page size correctly from the HPTE entry
 so that we know what we are dealing with in H_ENTER and and can do
 
  Which code path exactly changes for H_ENTER?
 
 There is no real code path changes. Any code path that use
 hpte_page_size() is impacted. We return actual page size there. 
 
 
 the appropriate TLB invalidation in H_REMOVE and evictions.
 
  Apart from the grammar (which is pretty broken for the part that is not 
  copied from Paul) and the subject line this sounds quite reasonable.
 
 
 Wll try to fix.
 
 -aneesh


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/2 v2] powerpc: Enable NO_BOOTMEM

2014-05-06 Thread Scott Wood
On Tue, 2014-05-06 at 13:48 -0500, Emil Medve wrote:
 Currently bootmem is just a wrapper around memblock. This gets rid of
 the wrapper code just as other ARHC(es) did: x86, arm, etc.
 
 For now only cover !NUMA systems/builds
 
 Signed-off-by: Emil Medve emilian.me...@freescale.com
 ---
 
 v2: Acknowledge that NUMA systems/builds are not covered by this patch
 
  arch/powerpc/Kconfig  | 3 +++
  arch/powerpc/mm/mem.c | 8 
  2 files changed, 11 insertions(+)
 
 diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
 index e099899..07b164b 100644
 --- a/arch/powerpc/Kconfig
 +++ b/arch/powerpc/Kconfig
 @@ -475,6 +475,9 @@ config SYS_SUPPORTS_HUGETLBFS
  
  source mm/Kconfig
  
 +config NO_BOOTMEM
 + def_bool !NUMA

This will allow a user to manually turn on CONFIG_NO_BOOTMEM in the
presence of NUMA.  From the changelog it sounds like this is not what
you intended.

What are the issues with NUMA?  As is, you're not getting rid of wrapper
code -- only adding ifdefs.

-Scott


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/2 v2] powerpc: Enable NO_BOOTMEM

2014-05-06 Thread Scott Wood
On Tue, 2014-05-06 at 16:49 -0500, Scott Wood wrote:
 On Tue, 2014-05-06 at 13:48 -0500, Emil Medve wrote:
  Currently bootmem is just a wrapper around memblock. This gets rid of
  the wrapper code just as other ARHC(es) did: x86, arm, etc.
  
  For now only cover !NUMA systems/builds
  
  Signed-off-by: Emil Medve emilian.me...@freescale.com
  ---
  
  v2: Acknowledge that NUMA systems/builds are not covered by this patch
  
   arch/powerpc/Kconfig  | 3 +++
   arch/powerpc/mm/mem.c | 8 
   2 files changed, 11 insertions(+)
  
  diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
  index e099899..07b164b 100644
  --- a/arch/powerpc/Kconfig
  +++ b/arch/powerpc/Kconfig
  @@ -475,6 +475,9 @@ config SYS_SUPPORTS_HUGETLBFS
   
   source mm/Kconfig
   
  +config NO_BOOTMEM
  +   def_bool !NUMA
 
 This will allow a user to manually turn on CONFIG_NO_BOOTMEM in the
 presence of NUMA.  From the changelog it sounds like this is not what
 you intended.

Ignore this part -- I see it doesn't have an option string for it to
show up to the user.

-Scott


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/2 v2] powerpc: Enable NO_BOOTMEM

2014-05-06 Thread Emil Medve
Hello Scott,


On 05/06/2014 04:49 PM, Scott Wood wrote:
 On Tue, 2014-05-06 at 13:48 -0500, Emil Medve wrote:
 Currently bootmem is just a wrapper around memblock. This gets rid of
 the wrapper code just as other ARHC(es) did: x86, arm, etc.

 For now only cover !NUMA systems/builds

 Signed-off-by: Emil Medve emilian.me...@freescale.com
 ---

 v2: Acknowledge that NUMA systems/builds are not covered by this patch

  arch/powerpc/Kconfig  | 3 +++
  arch/powerpc/mm/mem.c | 8 
  2 files changed, 11 insertions(+)

 diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
 index e099899..07b164b 100644
 --- a/arch/powerpc/Kconfig
 +++ b/arch/powerpc/Kconfig
 @@ -475,6 +475,9 @@ config SYS_SUPPORTS_HUGETLBFS
  
  source mm/Kconfig
  
 +config NO_BOOTMEM
 +def_bool !NUMA
 
 This will allow a user to manually turn on CONFIG_NO_BOOTMEM in the
 presence of NUMA.  From the changelog it sounds like this is not what
 you intended.
 
 What are the issues with NUMA?

Well, I don't have access to a NUMA box/board. I could enable NUMA for a
!NUMA board but I'd feel better if I could actually test/debug on a
relevant system

 As is, you're not getting rid of wrapper code -- only adding ifdefs.

First, you're talking about the bootmem initialization wrapper code for
powerpc. The actual bootmem code is in include/linux/bootmem.h and
mm/bootmem.c. We can't remove those files as they are still used by
other arches. Also, the word wrapper is somewhat imprecise as in powerpc
land bootmem sort of runs on top of memblock

When NO_BOOTMEM is configured the mm/nobootmem.c is used that is the
bootmem API actually re-implemented with memblock. The bootmem API is
used in various places in the arch independent code

This patch wants to isolate for removal the bootmem initialization code
for powerpc and to exclude mm/bootmem.c from being built. This being the
first step I didn't want to actually remove the code, so it will be easy
to debug if some issues crop up. Also, people that want the use the
bootmem code for some reason can easily do that. Once this change spends
some time in the tree, we can actually remove the bootmem initialization
code


Cheers,
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/2 v2] powerpc: Enable NO_BOOTMEM

2014-05-06 Thread Scott Wood
On Tue, 2014-05-06 at 19:16 -0500, Emil Medve wrote:
 Hello Scott,
 
 
 On 05/06/2014 04:49 PM, Scott Wood wrote:
  On Tue, 2014-05-06 at 13:48 -0500, Emil Medve wrote:
  Currently bootmem is just a wrapper around memblock. This gets rid of
  the wrapper code just as other ARHC(es) did: x86, arm, etc.
 
  For now only cover !NUMA systems/builds
 
  Signed-off-by: Emil Medve emilian.me...@freescale.com
  ---
 
  v2: Acknowledge that NUMA systems/builds are not covered by this patch
 
   arch/powerpc/Kconfig  | 3 +++
   arch/powerpc/mm/mem.c | 8 
   2 files changed, 11 insertions(+)
 
  diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
  index e099899..07b164b 100644
  --- a/arch/powerpc/Kconfig
  +++ b/arch/powerpc/Kconfig
  @@ -475,6 +475,9 @@ config SYS_SUPPORTS_HUGETLBFS
   
   source mm/Kconfig
   
  +config NO_BOOTMEM
  +  def_bool !NUMA
  
  This will allow a user to manually turn on CONFIG_NO_BOOTMEM in the
  presence of NUMA.  From the changelog it sounds like this is not what
  you intended.
  
  What are the issues with NUMA?
 
 Well, I don't have access to a NUMA box/board. I could enable NUMA for a
 !NUMA board but I'd feel better if I could actually test/debug on a
 relevant system

You could first test with NUMA on a non-NUMA board, and then if that
works ask the list for help testing on NUMA hardware (and various
non-Freescale non-NUMA hardware, for that matter).

Is there a specific issue that would need to be addressed to make it
work on NUMA?

  As is, you're not getting rid of wrapper code -- only adding ifdefs.
 
 First, you're talking about the bootmem initialization wrapper code for
 powerpc. The actual bootmem code is in include/linux/bootmem.h and
 mm/bootmem.c. We can't remove those files as they are still used by
 other arches. Also, the word wrapper is somewhat imprecise as in powerpc
 land bootmem sort of runs on top of memblock

My point was just that the changelog says This gets rid of wrapper
code when it actually removes no source code, and adds configuration
complexity.

 When NO_BOOTMEM is configured the mm/nobootmem.c is used that is the
 bootmem API actually re-implemented with memblock. The bootmem API is
 used in various places in the arch independent code
 
 This patch wants to isolate for removal the bootmem initialization code
 for powerpc and to exclude mm/bootmem.c from being built. This being the
 first step I didn't want to actually remove the code, so it will be easy
 to debug if some issues crop up. Also, people that want the use the
 bootmem code for some reason can easily do that. Once this change spends
 some time in the tree, we can actually remove the bootmem initialization
 code

Is there a plausible reason someone would want to use the bootmem
code?

While the ifdef it for a while approach is sometimes sensible, usually
it's better to just make the change rather than ifdef it.  Consider what
the code would look like if there were ifdefs for a ton of random
changes, half of which nobody ever bothered to go back and clean up
after the change got widespread testing.  Why is this patch risky enough
to warrant such an approach?  Shouldn't boot-time issues be pretty
obvious?

-Scott


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/1] booke/watchdog: refine and clean up the codes

2014-05-06 Thread Tang Yuantian
From: Tang Yuantian yuantian.t...@freescale.com

Basically, this patch does the following:
1. Move the codes of parsing boot parameters from setup-common.c
   to driver. In this way, code reader can know directly that
   there are boot parameters that can change the timeout.
2. Make boot parameter 'booke_wdt_period' effective.
   currently, when driver is loaded, default timeout is always
   being used in stead of booke_wdt_period.
3. Wrap up the watchdog timeout in device struct and clean up
   unnecessary codes.

Signed-off-by: Tang Yuantian yuantian.t...@freescale.com
---
 arch/powerpc/kernel/setup-common.c | 27 
 drivers/watchdog/booke_wdt.c   | 51 --
 2 files changed, 33 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index bc76cc6..5874aef 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -715,33 +715,6 @@ static int powerpc_debugfs_init(void)
 arch_initcall(powerpc_debugfs_init);
 #endif
 
-#ifdef CONFIG_BOOKE_WDT
-extern u32 booke_wdt_enabled;
-extern u32 booke_wdt_period;
-
-/* Checks wdt=x and wdt_period=xx command-line option */
-notrace int __init early_parse_wdt(char *p)
-{
-   if (p  strncmp(p, 0, 1) != 0)
-   booke_wdt_enabled = 1;
-
-   return 0;
-}
-early_param(wdt, early_parse_wdt);
-
-int __init early_parse_wdt_period(char *p)
-{
-   unsigned long ret;
-   if (p) {
-   if (!kstrtol(p, 0, ret))
-   booke_wdt_period = ret;
-   }
-
-   return 0;
-}
-early_param(wdt_period, early_parse_wdt_period);
-#endif /* CONFIG_BOOKE_WDT */
-
 void ppc_printk_progress(char *s, unsigned short hex)
 {
pr_info(%s\n, s);
diff --git a/drivers/watchdog/booke_wdt.c b/drivers/watchdog/booke_wdt.c
index a8dbceb3..08a7853 100644
--- a/drivers/watchdog/booke_wdt.c
+++ b/drivers/watchdog/booke_wdt.c
@@ -41,6 +41,28 @@ u32 booke_wdt_period = CONFIG_BOOKE_WDT_DEFAULT_TIMEOUT;
 #define WDTP_MASK  (TCR_WP_MASK)
 #endif
 
+/* Checks wdt=x and wdt_period=xx command-line option */
+notrace int __init early_parse_wdt(char *p)
+{
+   if (p  strncmp(p, 0, 1) != 0)
+   booke_wdt_enabled = 1;
+
+   return 0;
+}
+early_param(wdt, early_parse_wdt);
+
+int __init early_parse_wdt_period(char *p)
+{
+   unsigned long ret;
+   if (p) {
+   if (!kstrtol(p, 0, ret))
+   booke_wdt_period = ret;
+   }
+
+   return 0;
+}
+early_param(wdt_period, early_parse_wdt_period);
+
 #ifdef CONFIG_PPC_FSL_BOOK3E
 
 /* For the specified period, determine the number of seconds
@@ -103,17 +125,18 @@ static unsigned int sec_to_period(unsigned int secs)
 static void __booke_wdt_set(void *data)
 {
u32 val;
+   struct watchdog_device *wdog = data;
 
val = mfspr(SPRN_TCR);
val = ~WDTP_MASK;
-   val |= WDTP(booke_wdt_period);
+   val |= WDTP(sec_to_period(wdog-timeout));
 
mtspr(SPRN_TCR, val);
 }
 
-static void booke_wdt_set(void)
+static void booke_wdt_set(void *data)
 {
-   on_each_cpu(__booke_wdt_set, NULL, 0);
+   on_each_cpu(__booke_wdt_set, data, 0);
 }
 
 static void __booke_wdt_ping(void *data)
@@ -131,12 +154,13 @@ static int booke_wdt_ping(struct watchdog_device *wdog)
 static void __booke_wdt_enable(void *data)
 {
u32 val;
+   struct watchdog_device *wdog = data;
 
/* clear status before enabling watchdog */
__booke_wdt_ping(NULL);
val = mfspr(SPRN_TCR);
val = ~WDTP_MASK;
-   val |= (TCR_WIE|TCR_WRC(WRC_CHIP)|WDTP(booke_wdt_period));
+   val |= (TCR_WIE|TCR_WRC(WRC_CHIP)|WDTP(sec_to_period(wdog-timeout)));
 
mtspr(SPRN_TCR, val);
 }
@@ -162,25 +186,17 @@ static void __booke_wdt_disable(void *data)
 
 }
 
-static void __booke_wdt_start(struct watchdog_device *wdog)
+static int booke_wdt_start(struct watchdog_device *wdog)
 {
-   on_each_cpu(__booke_wdt_enable, NULL, 0);
+   on_each_cpu(__booke_wdt_enable, wdog, 0);
pr_debug(watchdog enabled (timeout = %u sec)\n, wdog-timeout);
-}
 
-static int booke_wdt_start(struct watchdog_device *wdog)
-{
-   if (booke_wdt_enabled == 0) {
-   booke_wdt_enabled = 1;
-   __booke_wdt_start(wdog);
-   }
return 0;
 }
 
 static int booke_wdt_stop(struct watchdog_device *wdog)
 {
on_each_cpu(__booke_wdt_disable, NULL, 0);
-   booke_wdt_enabled = 0;
pr_debug(watchdog disabled\n);
 
return 0;
@@ -191,9 +207,8 @@ static int booke_wdt_set_timeout(struct watchdog_device 
*wdt_dev,
 {
if (timeout  MAX_WDT_TIMEOUT)
return -EINVAL;
-   booke_wdt_period = sec_to_period(timeout);
wdt_dev-timeout = timeout;
-   booke_wdt_set();
+   booke_wdt_set(wdt_dev);
 
return 0;
 }
@@ -231,10 +246,10 @@ static int __init booke_wdt_init(void)
pr_info(powerpc book-e watchdog 

Re: [PATCH] KVM: PPC: BOOK3S: PR: Fix WARN_ON with debug options on

2014-05-06 Thread Paul Mackerras
On Sun, May 04, 2014 at 10:56:08PM +0530, Aneesh Kumar K.V wrote:
 With debug option sleep inside atomic section checking enabled we get
 the below WARN_ON during a PR KVM boot. This is because upstream now
 have PREEMPT_COUNT enabled even if we have preempt disabled. Fix the
 warning by adding preempt_disable/enable around floating point and altivec
 enable.

This worries me a bit.  In this code:

   if (msr  MSR_FP) {
 + preempt_disable();
   enable_kernel_fp();
   load_fp_state(vcpu-arch.fp);
   t-fp_save_area = vcpu-arch.fp;
 + preempt_enable();

What would happen if we actually did get preempted at this point?
Wouldn't we lose the FP state we just loaded?

In other words, how come we're not already preempt-disabled at this
point?

Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev