Re: [RFC PATCH v2 3/3] vfio-pci: Allow to mmap MSI-X table if EEH is supported

2016-01-04 Thread Benjamin Herrenschmidt
On Mon, 2016-01-04 at 14:07 -0700, Alex Williamson wrote:
> On Thu, 2015-12-31 at 16:50 +0800, Yongji Xie wrote:
> > Current vfio-pci implementation disallows to mmap MSI-X
> > table in case that user get to touch this directly.
> > 
> > However, EEH mechanism can ensure that a given pci device
> > can only shoot the MSIs assigned for its PE. So we think
> > it's safe to expose the MSI-X table to userspace because
> > the exposed MSI-X table can't be used to do harm to other
> > memory space.
> > 
> > And with MSI-X table mmapped, some performance issues which
> > are caused when PCI adapters have critical registers in the
> > same page as the MSI-X table also can be resolved.
> > 
> > So this patch adds a Kconfig option, VFIO_PCI_MMAP_MSIX,
> > to support for mmapping MSI-X table.
> > 
> > Signed-off-by: Yongji Xie 
> > ---
> >  drivers/vfio/pci/Kconfig|4 
> >  drivers/vfio/pci/vfio_pci.c |6 --
> >  2 files changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> > index 02912f1..67b0a2c 100644
> > --- a/drivers/vfio/pci/Kconfig
> > +++ b/drivers/vfio/pci/Kconfig
> > @@ -23,6 +23,10 @@ config VFIO_PCI_MMAP
> >     depends on VFIO_PCI
> >     def_bool y if !S390
> >  
> > +config VFIO_PCI_MMAP_MSIX
> > +   depends on VFIO_PCI_MMAP
> > +   def_bool y if EEH
> 
> Does CONFIG_EEH necessarily mean the EEH is enabled?  Could the
> system
> not support EEH or could EEH be disabled via kernel commandline
> options?

EEH is definitely the wrong thing to test here anyway. What needs to be
tested is that the PCI Host bridge supports filtering of MSIs, so
ideally this should be some kind of host bridge attribute set by the
architecture backend.

This can happen with or without CONFIG_EEH and you are right,
CONFIG_EEH can be enabled and the machine not support it.

Any IODA bridge will support this.

Cheers,
Ben.

> > +
> >  config VFIO_PCI_INTX
> >     depends on VFIO_PCI
> >     def_bool y if !S390
> > diff --git a/drivers/vfio/pci/vfio_pci.c
> > b/drivers/vfio/pci/vfio_pci.c
> > index 09b3805..d536985 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -555,7 +555,8 @@ static long vfio_pci_ioctl(void *device_data,
> >     IORESOURCE_MEM && (info.size >=
> > PAGE_SIZE ||
> >     pci_resource_page_aligned)) {
> >     info.flags |=
> > VFIO_REGION_INFO_FLAG_MMAP;
> > -   if (info.index == vdev->msix_bar)
> > {
> > +   if
> > (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP_MSIX) &&
> > +   info.index == vdev->msix_bar)
> > {
> >     ret =
> > msix_sparse_mmap_cap(vdev, &caps);
> >     if (ret)
> >     return ret;
> > @@ -967,7 +968,8 @@ static int vfio_pci_mmap(void *device_data,
> > struct vm_area_struct *vma)
> >     if (phys_len < PAGE_SIZE || req_start + req_len >
> > phys_len)
> >     return -EINVAL;
> >  
> > -   if (index == vdev->msix_bar) {
> > +   if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP_MSIX) &&
> > +   index == vdev->msix_bar) {
> >     /*
> >      * Disallow mmaps overlapping the MSI-X table;
> > users
> > don't
> >      * get to touch this directly.  We could find
> > somewhere
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-
> kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 3/3] vfio-pci: Allow to mmap MSI-X table if EEH is supported

2015-12-17 Thread Benjamin Herrenschmidt
On Thu, 2015-12-17 at 14:41 -0700, Alex Williamson wrote:
> > So I think it is safe to mmap/passthrough MSI-X table on PPC64
> > platform.
> > And I'm not sure whether other architectures can ensure these two 
> > points. 
> 
> There is another consideration, which is the API exposed to the user.
>  vfio currently enforces interrupt setup through ioctls by making the
> PCI mechanisms for interrupt programming inaccessible through the
> device regions.  Ignoring that you are only focused on PPC64 with QEMU,
> does it make sense for the vfio API to allow a user to manipulate
> interrupt programming in a way that not only will not work, but in a
> way that we expect to fail and require error isolation to recover from?
>  I can't say I'm fully convinced that a footnote in the documentation
> is sufficient for that.  Thanks,

Well, one could argue that the "isolation" provided by qemu here is
fairly weak anyway ;-)

I mean. .. how do you know the device doesn't have a backdoor path into
that table via some other MMIO registers anyway ?

In any case, the HW isolation on platforms like pseries means that the
worst the guest can do si shoot itself in the foot. Big deal. On the
other hand, not bothering with intercepting the table has benefits,
such as reducing the memory region clutter, but also removing all the
massive performacne problems we see because adapters have critical
registers in the same 64K page as the MSI-X table.

So I don't think there is any question here that we *need* that
functionality in power. The filtering of the table by Qemu doesn't
provide any practical benefit, it just gets in the way.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-11-19 Thread Benjamin Herrenschmidt
On Thu, 2015-11-19 at 23:38 +, David Woodhouse wrote:
> 
> I understand that POWER and other platforms don't currently have a
> clean way to indicate that certain device don't have translation. And I
> understand that we may end up with a *quirk* which ensures that the DMA
> API does the right thing (i.e. nothing) in certain cases.
> 
> But we should *NOT* be involving the virtio device drivers in that
> quirk, in any way. And putting a feature bit in the virtio device
> itself doesn't seem at all sane either.
> 
> Bear in mind that qemu-system-x86_64 currently has the *same* problem
> with assigned physical devices. It's claiming they're translated, and
> they're not.

It's not that clear but yeah ... as I mentioned, I can't find a
way to do that quirk that won't break when we want to actually use
the iommu... 

Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-10 Thread Benjamin Herrenschmidt
On Tue, 2015-11-10 at 20:46 -0800, Andy Lutomirski wrote:
> Me neither.  At least it wouldn't be a regression, but it's still
> crappy.
> 
> I think that arm is fine, at least.  I was unable to find an arm QEMU
> config that has any problems with my patches.

Ok, give me a few days for my headache & fever to subside see if I can
find something better. David, no idea from your side ? :-)

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-10 Thread Benjamin Herrenschmidt
On Tue, 2015-11-10 at 15:44 -0800, Andy Lutomirski wrote:
> 
> > What about partition <-> partition virtio such as what we could do on
> > PAPR systems. That would have the weak barrier bit.
> >
> 
> Is it partition <-> partition, bypassing IOMMU?

No.

> I think I'd settle for just something that doesn't regress
> non-experimental setups that actually work today and that allow new
> setups (x86 with fixed QEMU and maybe something more complicated on
> powerpc and/or sparc) to work in all cases.
> 
> We could certainly just make powerpc and sparc continue bypassing the
> IOMMU until someone comes up with a way to fix it.  I'll send out some
> patches that do that, and maybe that'll help this make progress.

But we haven't found a solution that works. All we have come up with is
a quirk that will force bypass on virtio always and will not allow us
to operate non-bypassing devices on either of those architectures in
the future.

I'm not too happy about this.

Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-10 Thread Benjamin Herrenschmidt
On Tue, 2015-11-10 at 10:54 -0800, Andy Lutomirski wrote:
> 
> Does that work on powerpc on existing kernels?
> 
> Anyway, here's another crazy idea: make the quirk assume that the
> IOMMU is bypasses if and only if the weak barriers bit is set on
> systems that are missing the new DT binding.

"New DT bindings" doesn't mean much ... how do we change DT bindings on
existing machines with a FW in flash ?

What about partition <-> partition virtio such as what we could do on
PAPR systems. That would have the weak barrier bit.

Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-10 Thread Benjamin Herrenschmidt
On Tue, 2015-11-10 at 14:43 +0200, Michael S. Tsirkin wrote:
> But not virtio-pci I think - that's broken for that usecase since we use
> weaker barriers than required for real IO, as these have measureable
> overhead.  We could have a feature "is a real PCI device",
> that's completely reasonable.

Do we use weaker barriers on the Linux driver side ? I didn't think so
... 

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-10 Thread Benjamin Herrenschmidt
On Tue, 2015-11-10 at 11:27 +0100, Joerg Roedel wrote:
> 
> You have the same problem when real PCIe devices appear that speak
> virtio. I think the only real (still not very nice) solution is to add a
> quirk to powerpc platform code that sets noop dma-ops for the existing
> virtio vendor/device-ids and add a DT property to opt-out of that quirk.
>
> New vendor/device-ids (as for real devices) would just not be covered by
> the quirk and existing emulated devices continue to work.

Why woud real devices use new vendor/device IDs ? Also there are other
cases such as using virtio between 2 partitions, which we could do
under PowerVM ... that would require proper iommu usage with existing
IDs.

> The absence of the property just means that the quirk is in place and
> the system assumes no translation for virtio devices.

The only way that works forward for me (and possibly sparc & others,
what about ARM ?) is if we *change* something in virtio qemu at the
same time as we add some kind of property. For example the ProgIf field
or revision ID field.

That way I can key on that change.

It's still tricky because I would have to somewhat tell my various firmwares
(SLOF, OpenBIOS, OPAL, ...) so they can create the appropriate property, it's
still hacky, but it would be workable.

Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-10 Thread Benjamin Herrenschmidt
On Tue, 2015-11-10 at 10:45 +0100, Knut Omang wrote:
> Can something be done by means of PCIe capabilities?
> ATS (Address Translation Support) seems like a natural choice?

Euh no... ATS is something else completely

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-10 Thread Benjamin Herrenschmidt
On Mon, 2015-11-09 at 21:35 -0800, Andy Lutomirski wrote:
> 
> We could do it the other way around: on powerpc, if a PCI device is in
> that range and doesn't have the "bypass" property at all, then it's
> assumed to bypass the IOMMU.  This means that everything that
> currently works continues working.  If someone builds a physical
> virtio device or uses another system in PCIe target mode speaking
> virtio, then it won't work until they upgrade their firmware to set
> bypass=0.  Meanwhile everyone using hypothetical new QEMU also gets
> bypass=0 and no ambiguity.
>
> vfio will presumably notice the bypass and correctly refuse to map any
> current virtio devices.
> 
> Would that work?

That would be extremely strange from a platform perspective. Any device
in that vendor/device range would bypass the iommu unless some new
property "actually-works-like-a-real-pci-device" happens to exist in
the device-tree, which we would then need to define somewhere and
handle accross at least 3 different platforms who get their device-tree 
from widly different places.

Also if tomorrow I create a PCI device that implements virtio-net and
put it in a machine running IBM proprietary firmware (or Apple's or
Sun's), it won't have that property...

This is not hypothetical. People are using virtio to do point-to-point
communication between machines via PCIe today.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-09 Thread Benjamin Herrenschmidt
On Mon, 2015-11-09 at 18:18 -0800, Andy Lutomirski wrote:
> 
> Which leaves the special case of Xen, where even preexisting devices
> don't bypass the IOMMU.  Can we keep this specific to powerpc and
> sparc?  On x86, this problem is basically nonexistent, since the IOMMU
> is properly self-describing.
> 
> IOW, I think that on x86 we should assume that all virtio devices
> honor the IOMMU.

You don't like performances ? :-)

Cheers,
Ben.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-09 Thread Benjamin Herrenschmidt
On Mon, 2015-11-09 at 18:18 -0800, Andy Lutomirski wrote:
> 
> /* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF.
> */
> static const struct pci_device_id virtio_pci_id_table[] = {
>     { PCI_DEVICE(0x1af4, PCI_ANY_ID) },
>     { 0 }
> };
> 
> Can we match on that range?

We can, but the problem remains, how do we differenciate an existing
device that does bypass vs. a newer one that needs the IOMMU and thus
doesn't have the new "bypass" property in the device-tree.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-09 Thread Benjamin Herrenschmidt
On Mon, 2015-11-09 at 16:46 -0800, Andy Lutomirski wrote:
> The problem here is that in some of the problematic cases the virtio
> driver may not even be loaded.  If someone runs an L1 guest with an
> IOMMU-bypassing virtio device and assigns it to L2 using vfio, then
> *boom* L1 crashes.  (Same if, say, DPDK gets used, I think.)
> 
> >
> > The only way out of this while keeping the "platform" stuff would be to
> > also bump some kind of version in the virtio config (or PCI header). I
> > have no other way to differenciate between "this is an old qemu that
> > doesn't do the 'bypass property' yet" from "this is a virtio device
> > that doesn't bypass".
> >
> > Any better idea ?
> 
> I'd suggest that, in the absence of the new DT binding, we assume that
> any PCI device with the virtio vendor ID is passthrough on powerpc.  I
> can do this in the virtio driver, but if it's in the platform code
> then vfio gets it right too (i.e. fails to load).

The problem is there isn't *a* virtio vendor ID. It's the RedHat vendor
ID which will be used by more than just virtio, so we need to
specifically list the devices.

Additionally, that still means that once we have a virtio device that
actually uses the iommu, powerpc will not work since the "workaround"
above will kick in.

The "in absence of the new DT binding" doesn't make that much sense.

Those platforms use device-trees defined since the dawn of ages by
actual open firmware implementations, they either have no iommu
representation in there (Macs, the platform code hooks it all up) or
have various properties related to the iommu but no concept of "bypass"
in there.

We can *add* a new property under some circumstances that indicates a
bypass on a per-device basis, however that doesn't completely solve it:

  - As I said above, what does the absence of that property mean ? An
old qemu that does bypass on all virtio or a new qemu trying to tell
you that the virtio device actually does use the iommu (or some other
environment that isn't qemu) ?

  - On things like macs, the device-tree is generated by openbios, it
would have to have some added logic to try to figure that out, which
means it needs to know *via different means* that some or all virtio
devices bypass the iommu.

I thus go back to my original statement, it's a LOT easier to handle if
the device itself is self describing, indicating whether it is set to
bypass a host iommu or not. For L1->L2, well, that wouldn't be the
first time qemu/VFIO plays tricks with the passed through device
configuration space...

Note that the above can be solved via some kind of compromise: The
device self describes the ability to honor the iommu, along with the
property (or ACPI table entry) that indicates whether or not it does.

IE. We could use the revision or ProgIf field of the config space for
example. Or something in virtio config. If it's an "old" device, we
know it always bypass. If it's a new device, we know it only bypasses
if the corresponding property is in. I still would have to sort out the
openbios case for mac among others but it's at least a workable
direction.

BTW. Don't you have a similar problem on x86 that today qemu claims
that everything honors the iommu in ACPI ?

Unless somebody can come up with a better idea...

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-09 Thread Benjamin Herrenschmidt
So ...

I've finally tried to sort that out for powerpc and I can't find a way
to make that work that isn't a complete pile of stinking shit.

I'm very tempted to go back to my original idea: virtio itself should
indicate it's "bypassing ability" via the virtio config space or some
other bit (like the ProgIf of the PCI header etc...).

I don't see how I can make it work otherwise.

The problem with the statement "it's a platform matter" is that:

  - It's not entirely correct. It's actually a feature of a specific
virtio implementation (qemu's) that it bypasses all the platform IOMMU
facilities.

  - The platforms (on powerpc there's at least 3 or 4 that have qemu
emulation and support some form of PCI) don't have an existing way to
convey the information that a device bypasses the IOMMU (if any).

  - Even if we add such a mechanism (new property in the device-tree),
we end up with a big turd: Because we need to be compatible with older
qemus, we essentially need a quirk that will make all virtio devices
assume that property is present. That will of course break whenever we
try to use another implementation of virtio on powerpc which doesn't
bypass the iommu.

We don't have a way to differenciate a virtio device that does the
bypass from one that doesn't and the backward compatibility requirement
forces us to treat all existing virtio devices as doing bypass.

The only way out of this while keeping the "platform" stuff would be to
also bump some kind of version in the virtio config (or PCI header). I
have no other way to differenciate between "this is an old qemu that
doesn't do the 'bypass property' yet" from "this is a virtio device
that doesn't bypass".

Any better idea ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-03 Thread Benjamin Herrenschmidt
On Tue, 2015-11-03 at 14:11 +0100, Christoph Hellwig wrote:
> > xHCI for example, vs. something like 10G ethernet... but yes I agree it
> > sucks. I don't like that sort of policy anywhere in drivers. On the
> > other hand the platform doesn't have much information to make that sort
> > of decision either.
> 
> Mabye because it should simply use what's optimal?  E.g. passthrough
> whenever possible, where arguments against possible are:  dma_mask, vfio
> requirements, kernel command line option. 

Right this is what I do today on powerpc with the exception of
the command line option.

>  This is what a lot of
> architectures already do, I remember the SGI Origin / Altix code has the
> same behavior as well.  Those IOMMUs already had the 64 bit passthrough
> and 32-bit sliding window in addition to the real IOMMU 10 years ago.
> --
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Benjamin Herrenschmidt
On Mon, 2015-11-02 at 22:45 +0100, Arnd Bergmann wrote:
> > Then I would argue for naming this differently. Make it an optional
> > hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is
> > achieved via using a bypass or other means in the backend not the
> > business of the driver.
> > 
> 
> With a name like that, who wouldn't pass that flag? ;-)

xHCI for example, vs. something like 10G ethernet... but yes I agree it
sucks. I don't like that sort of policy anywhere in drivers. On the
other hand the platform doesn't have much information to make that sort
of decision either.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Benjamin Herrenschmidt
On Mon, 2015-11-02 at 22:48 +, David Woodhouse wrote:
> 
> In the Intel case, the mapping setup is entirely per-device (except for
> crap devices and devices behind a PCIe-PCI bridge, etc.).
> 
> So you can happily have a passthrough mapping for *one* device, without
> making that same mapping available to another device. You can make that
> trade-off of speed vs. protection for each device in the system.

Same for me. I should have written "in the context of a device ...",
the problem reamains that it doesn't buy you much to do it *per
-mapping* which is what this API seems to be about.

> Currently we have the 'iommu=pt' option that makes us use passthrough
> for *all* devices (except those whose dma_mask can't reach all of
> physical memory). But I could live with something like Shamir is
> proposing, which lets us do the bypass only for performance-sensitive
> devices which actually *ask* for it (and of course we'd have a system-
> wide mode which declines that request and does normal mappings anyway).
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Benjamin Herrenschmidt
On Mon, 2015-11-02 at 14:07 +0200, Shamir Rabinovitch wrote:
> On Mon, Nov 02, 2015 at 09:00:34PM +1100, Benjamin Herrenschmidt
> wrote:
> > 
> > Chosing on a per-mapping basis *in the back end* might still make
> > some
> 
> In my case, choosing mapping based on the hardware that will use this
> mappings makes more sense. Most hardware are not that performance 
> sensitive as the Infiniband hardware.

 ...

> The driver know for what hardware it is mapping the memory so it know 
> if the memory will be used by performance sensitive hardware or not.

Then I would argue for naming this differently. Make it an optional
hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is
achieved via using a bypass or other means in the backend not the
business of the driver.

> In your case, what will give the better performance - 1:1 mapping or
> IOMMU
> mapping? When you say 'relaxing the protection' you refer to 1:1
> mapping?

> Also, how this 1:1 window address the security concerns that other
> raised
> by other here?

It will partially only but it's just an example of another way the
bakcend could provide some improved performances without a bypass.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Benjamin Herrenschmidt
On Mon, 2015-11-02 at 09:23 +0200, Shamir Rabinovitch wrote:
> To summary -
> 
> 1. The whole point of the IOMMU pass through was to get bigger address space
> and faster map/unmap operations for performance critical hardware
> 2. SPARC IOMMU in particular has the ability to DVMA which adress all the 
> protection concerns raised above. Not sure what will be the 
> performance
> impact though. This still need a lot of work before we could test 
> this.
> 3. On x86 we use IOMMU in pass through mode so all the above concerns are 
> valid
> 
> The question are -
> 
> 1. Does partial use of IOMMU while the pass through window is enabled add some
> protection?
> 2. Do we rather the x86 way of doing this which is enable / disable IOMMU 
> translations at kernel level?
> 
> I think that I can live with option (2) till I have DVMA if there is strong
> disagree on the need for per allocation IOMMU bypass.

Chosing on a per-mapping basis *in the back end* might still make some
amount of sense. What I don't completely grasp is what does it give
you to expose that choice to the *driver* all the way up the chain. Why
does the driver knows better whether something should use the bypass or
not ?

I can imagine some in-between setups, for example, on POWER (and
probably x86), I could setup a window that is TCE-mapped (TCEs are our
iommu PTEs) but used to create a 1:1 mapping. IE. A given TCE always
map to the same physical page. I could then use map/unmap to adjust the
protection, the idea being that only "relaxing" the protection requires
flushing the IO TLB, ie, we could delay most flushes.

But that sort of optimization only makes sense in the back-end.

So what was your original idea where you thought the driver was the
right one to decide whether to use the bypass or not for a given map
operation ? That's what I don't grasp... you might have a valid case
that I just fail to see.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-01 Thread Benjamin Herrenschmidt
On Sun, 2015-11-01 at 09:45 +0200, Shamir Rabinovitch wrote:
> Not sure this use case is possible for Infiniband where application hold
> the data buffers and there is no way to force application to re use the 
> buffer as suggested.
> 
> This is why I think there will be no easy way to bypass the DMA mapping cost
> for all use cases unless we allow applications to request bypass/pass through
> DMA mapping (implicitly or explicitly).

But but but ...

What I don't understand is how that brings you any safety.

Basically, either your bridge has a bypass window, or it doesn't. (Or
it has one and it's enabled or not, same thing).

If you request the bypass on a per-mapping basis, you basically have to
keep the window always enabled, unless you do some nasty refcounting of
how many people have a bypass mapping in flight, but that doesn't seem
useful.

Thus you have already lost all protection from the device, since your
entire memory is accessible via the bypass mapping.

Which means what is the point of then having non-bypass mappings for
other things ? Just to make things slower ?

I really don't see what this whole "bypass on a per-mapping basis" buys
you.

Note that we implicitly already do that on powerpc, but not for those
reasons, we do it based on the DMA mask, so that if your coherent mask
is 32-bit but your dma mask is 64-bit (which is not an uncommon
combination), we service the "coherent" requests (basically the long
lifed dma allocs) from the remapped region and the "stream" requests
(ie, map_page/map_sg) from the bypass.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-30 Thread Benjamin Herrenschmidt
On Fri, 2015-10-30 at 11:32 +0100, Arnd Bergmann wrote:
> On Thursday 29 October 2015 10:10:46 Benjamin Herrenschmidt wrote:
> > 
> > > Maybe we should at least coordinate IOMMU 'paranoid/fast' modes
> > > across
> > > architectures, and then the DMA_ATTR_IOMMU_BYPASS flag would have
> > > a
> > > sane meaning in the paranoid mode (and perhaps we'd want an ultra
> > > -paranoid mode where it's not honoured).
> > 
> > Possibly, though ideally that would be a user policy but of course
> > by
> > the time you get to userspace it's generally too late.
> 
> IIRC, we have an 'iommu=force' command line switch for this, to
> ensure
> that no device can use a linear mapping and everything goes th ough
> the page tables. This is often useful for both debugging and as a
> security measure when dealing with unpriviledged DMA access (virtual
> machines, vfio, ...).

That was used to force-enable the iommu on platforms like G5s where we
would otherwise only do so if the memory was larger than 32-bit but we
never implemented using it to prevent the bypass region.

> If we add a DMA_ATTR_IOMMU_BYPASS attribute, we should clearly
> document
> which osed to force-enable the iommu on HGthe two we expect to take
> priority in cases where we have a
> choice.
>
> I wonder if the 'iommu=force' attribute is too coarse-grained though,
> and if we should perhaps allow a per-device setting on architectures
> that allow this.

The interesting thing, if we can make it work, is the bypass attribute
being per mapping... 

Ben. 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-29 Thread Benjamin Herrenschmidt
On Thu, 2015-10-29 at 11:31 -0700, Andy Lutomirski wrote:
> On Oct 28, 2015 6:11 PM, "Benjamin Herrenschmidt"
>  wrote:
> > 
> > On Thu, 2015-10-29 at 09:42 +0900, David Woodhouse wrote:
> > > On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote:
> > > 
> > > > On Power, I generally have 2 IOMMU windows for a device, one at
> > > > the
> > > > bottom is remapped, and is generally used for 32-bit devices
> > > > and the
> > > > one at the top us setup as a bypass
> > > 
> > > So in the normal case of decent 64-bit devices (and not in a VM),
> > > they'll *already* be using the bypass region and have full access
> > > to
> > > all of memory, all of the time? And you have no protection
> > > against
> > > driver and firmware bugs causing stray DMA?
> > 
> > Correct, we chose to do that for performance reasons.
> 
> Could this be mitigated using pools?  I don't know if the net code
> would play along easily.

Possibly, the pools we have already limit the lock contention but we
still have the map/unmap overhead which under a hypervisor can be quite
high. I'm not necessarily against changing the way we do things but it
would have to be backed up with numbers.

Cheers,
Ben.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-28 Thread Benjamin Herrenschmidt
On Thu, 2015-10-29 at 09:42 +0900, David Woodhouse wrote:
> On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote:
> 
> > On Power, I generally have 2 IOMMU windows for a device, one at the
> > bottom is remapped, and is generally used for 32-bit devices and the
> > one at the top us setup as a bypass 
> 
> So in the normal case of decent 64-bit devices (and not in a VM),
> they'll *already* be using the bypass region and have full access to
> all of memory, all of the time? And you have no protection against
> driver and firmware bugs causing stray DMA?

Correct, we chose to do that for performance reasons.

> > I don't see how thata ttribute would work for us.
> 
> Because you're already doing it anyway without being asked :)
> 
> If SPARC and POWER are both doing that, perhaps we should change the
> default for Intel too?
> 
> Aside from the lack of security, the other disadvantage of that is that
> you have to pin *all* pages of a guest in case DMA happens; you don't
> get to pin *only* those pages which are referenced by that guest's
> IOMMU page tables...

Correct, the problem is that the cost of doing map/unmap from a guest
is really a huge hit on things like network devices.

Another problem is that the failure mode isn't great if you don't pin. 

IE. You have to pin pages as they get mapped into the iommu by the
guest, but you don't know in advance how much and you may hit the
process ulimit on pinned pages half way through.

We tried to address that in various ways but it always ended up horrid.

> Maybe we should at least coordinate IOMMU 'paranoid/fast' modes across
> architectures, and then the DMA_ATTR_IOMMU_BYPASS flag would have a
> sane meaning in the paranoid mode (and perhaps we'd want an ultra
> -paranoid mode where it's not honoured).

Possibly, though ideally that would be a user policy but of course by
the time you get to userspace it's generally too late.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-28 Thread Benjamin Herrenschmidt
On Wed, 2015-10-28 at 22:31 +0900, David Woodhouse wrote:
> We have an option in the Intel IOMMU for pass-through mode too, which
> basically *is* a total bypass. In practice, what's the difference
> between that and a "simple translation that does not require any
> [translation]"? We set up a full 1:1 mapping of all memory, and then
> the map/unmap methods become no-ops.
> 
> Currently we have no way to request that mode on a per-device basis;
> we
> only have 'iommu=pt' on the command line to set it for *all* devices.
> But performance-sensitive devices might want it, while we keep doing
> proper translation for others.

On Power, I generally have 2 IOMMU windows for a device, one at the
bottom is remapped, and is generally used for 32-bit devices and the
one at the top us setup as a bypass (or in the case of KVM, as a linear
mapping of guest memory which looks the same as a bypass to the guest).

The DMA ops will automatically hit the appropriate one based on the
DMA mask.

I don't see how that attribute would work for us.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread Benjamin Herrenschmidt
On Wed, 2015-10-28 at 16:40 +0900, Christian Borntraeger wrote:
> We have discussed that at kernel summit. I will try to implement a dummy 
> dma_ops for
> s390 that does 1:1 mapping and Ben will look into doing some quirk to handle 
> "old"
> code in addition to also make it possible to mark devices as iommu bypass 
> (IIRC,
> via device tree, Ben?)

Something like that yes. I'll look into it when I'm back home.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-ppc] KVM memory slots limit on powerpc

2015-09-04 Thread Benjamin Herrenschmidt
On Fri, 2015-09-04 at 12:28 +0200, Thomas Huth wrote:

> > Maybe some rcu protected scheme that doubles the amount of memslots
> > for
> > each overrun? Yes, that would be good and even reduce the footprint
> > for
> > systems with only a small number of memslots.
> 
> Seems like Alex Williamson already posted a patchset for growable
> memslots a couple of years ago:
> 
>  http://www.spinics.net/lists/kvm/msg50491.html
> 
> But I didn't quite spot the result in that thread why it never has 
> been
> included upstream. Alex (W.), do you remember the outcome?

Isn't the memslot array *already* protected by RCU anyway ?

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: ppc: Fix size of the PSPB register

2015-09-01 Thread Benjamin Herrenschmidt
On Wed, 2015-09-02 at 08:45 +1000, Paul Mackerras wrote:
> On Wed, Sep 02, 2015 at 08:25:05AM +1000, Benjamin Herrenschmidt
> wrote:
> > On Tue, 2015-09-01 at 23:41 +0200, Thomas Huth wrote:
> > > The size of the Problem State Priority Boost Register is only
> > > 32 bits, so let's change the type of the corresponding variable
> > > accordingly to avoid future trouble.
> > 
> > It's not future trouble, it's broken today for LE and this should
> > fix
> > it BUT 
> 
> No, it's broken today for BE hosts, which will always see 0 for the
> PSPB register value.  LE hosts are fine.

0 or PSPB << 32 ?

> > The asm accesses it using lwz/stw and C accesses it as a ulong. On
> > LE
> > that will mean that userspace will see the value << 32
> 
> No, that will happen on BE, and since KVM_REG_PPC_PSPB says it's a
> 32-bit register, we'll just pass 0 back to userspace when it reads
> it.

Ah ok, I missed that bit about KVM_REG_PPC_PSPB

> > Now "fixing" it might break migration if that field is already
> > stored/loaded in its "broken" form. We may have to keep the
> > "broken"
> > behaviour and document that qemu sees a value shifted by 32.
> 
> It will be being set to 0 on BE hosts across migration today
> (fortunately 0 is a benign value for PSPB).  If we fix this on both
> the source and destination host, then the value will get migrated
> across correctly.

Ok, I missed the part where KVM_REG_PPC_PSPB passed it down as a 32
-bit. That means Thomas patch should work indeed.

> I think Thomas's patch is fine, it just needs a stronger patch
> description saying that it fixes an actual bug.

Right.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: ppc: Fix size of the PSPB register

2015-09-01 Thread Benjamin Herrenschmidt
On Wed, 2015-09-02 at 08:24 +1000, Paul Mackerras wrote:
> On Tue, Sep 01, 2015 at 11:41:18PM +0200, Thomas Huth wrote:
> > The size of the Problem State Priority Boost Register is only
> > 32 bits, so let's change the type of the corresponding variable
> > accordingly to avoid future trouble.
> 
> Since we're already using lwz/stw in the assembly code in
> book3s_hv_rmhandlers.S, this is actually a bug fix, isn't it?
> How did you find it?  Did you observe a failure of some kind, or did
> you just find it by code inspection?

Won't the fix break migration ? Unless qemu doens't migrate it ...

> Paul.
> ___
> Linuxppc-dev mailing list
> linuxppc-...@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: ppc: Fix size of the PSPB register

2015-09-01 Thread Benjamin Herrenschmidt
On Tue, 2015-09-01 at 23:41 +0200, Thomas Huth wrote:
> The size of the Problem State Priority Boost Register is only
> 32 bits, so let's change the type of the corresponding variable
> accordingly to avoid future trouble.

It's not future trouble, it's broken today for LE and this should fix
it BUT 

The asm accesses it using lwz/stw and C accesses it as a ulong. On LE
that will mean that userspace will see the value << 32

Now "fixing" it might break migration if that field is already
stored/loaded in its "broken" form. We may have to keep the "broken"
behaviour and document that qemu sees a value shifted by 32.

Cheers,
Ben.

> Signed-off-by: Thomas Huth 
> ---
>  arch/powerpc/include/asm/kvm_host.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_host.h
> b/arch/powerpc/include/asm/kvm_host.h
> index d91f65b..c825f3a 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -473,7 +473,7 @@ struct kvm_vcpu_arch {
>   ulong ciabr;
>   ulong cfar;
>   ulong ppr;
> - ulong pspb;
> + u32 pspb;
>   ulong fscr;
>   ulong shadow_fscr;
>   ulong ebbhr;

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/25] powerpc: Use bool function return values of true/false not 1/0

2015-03-30 Thread Benjamin Herrenschmidt
On Mon, 2015-03-30 at 16:46 -0700, Joe Perches wrote:
> Use the normal return values for bool functions

Acked-by: Benjamin Herrenschmidt 

Should we merge it or will you ?

Cheers,
Ben.

> Signed-off-by: Joe Perches 
> ---
>  arch/powerpc/include/asm/dcr-native.h| 2 +-
>  arch/powerpc/include/asm/dma-mapping.h   | 4 ++--
>  arch/powerpc/include/asm/kvm_book3s_64.h | 4 ++--
>  arch/powerpc/sysdev/dcr.c| 2 +-
>  4 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/dcr-native.h 
> b/arch/powerpc/include/asm/dcr-native.h
> index 7d2e623..4efc11d 100644
> --- a/arch/powerpc/include/asm/dcr-native.h
> +++ b/arch/powerpc/include/asm/dcr-native.h
> @@ -31,7 +31,7 @@ typedef struct {
>  
>  static inline bool dcr_map_ok_native(dcr_host_native_t host)
>  {
> - return 1;
> + return true;
>  }
>  
>  #define dcr_map_native(dev, dcr_n, dcr_c) \
> diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> b/arch/powerpc/include/asm/dma-mapping.h
> index 894d538..9103687 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -191,11 +191,11 @@ static inline bool dma_capable(struct device *dev, 
> dma_addr_t addr, size_t size)
>   struct dev_archdata *sd = &dev->archdata;
>  
>   if (sd->max_direct_dma_addr && addr + size > sd->max_direct_dma_addr)
> - return 0;
> + return false;
>  #endif
>  
>   if (!dev->dma_mask)
> - return 0;
> + return false;
>  
>   return addr + size - 1 <= *dev->dma_mask;
>  }
> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
> b/arch/powerpc/include/asm/kvm_book3s_64.h
> index 2d81e20..2a244bf 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
> @@ -335,7 +335,7 @@ static inline bool hpte_read_permission(unsigned long pp, 
> unsigned long key)
>  {
>   if (key)
>   return PP_RWRX <= pp && pp <= PP_RXRX;
> - return 1;
> + return true;
>  }
>  
>  static inline bool hpte_write_permission(unsigned long pp, unsigned long key)
> @@ -373,7 +373,7 @@ static inline bool slot_is_aligned(struct kvm_memory_slot 
> *memslot,
>   unsigned long mask = (pagesize >> PAGE_SHIFT) - 1;
>  
>   if (pagesize <= PAGE_SIZE)
> - return 1;
> + return true;
>   return !(memslot->base_gfn & mask) && !(memslot->npages & mask);
>  }
>  
> diff --git a/arch/powerpc/sysdev/dcr.c b/arch/powerpc/sysdev/dcr.c
> index 2d8a101..121e26f 100644
> --- a/arch/powerpc/sysdev/dcr.c
> +++ b/arch/powerpc/sysdev/dcr.c
> @@ -54,7 +54,7 @@ bool dcr_map_ok_generic(dcr_host_t host)
>   else if (host.type == DCR_HOST_MMIO)
>   return dcr_map_ok_mmio(host.host.mmio);
>   else
> - return 0;
> + return false;
>  }
>  EXPORT_SYMBOL_GPL(dcr_map_ok_generic);
>  


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: PPC: Remove page table walk helpers

2015-03-29 Thread Benjamin Herrenschmidt
On Mon, 2015-03-30 at 10:39 +0530, Aneesh Kumar K.V wrote:
> This patch remove helpers which we had used only once in the code.
> Limiting page table walk variants help in ensuring that we won't
> end up with code walking page table with wrong assumptions.
> 
> Signed-off-by: Aneesh Kumar K.V 

Alex, it would be preferable to have this (and the previous one) in the
powerpc tree due to dependencies with further fixes to our page table
walking vs. THP, so if you're happy, just give us an ack.

Cheers,
Ben.

> ---
>  arch/powerpc/include/asm/pgtable.h  | 21 -
>  arch/powerpc/kvm/book3s_hv_rm_mmu.c | 62 
> -
>  arch/powerpc/kvm/e500_mmu_host.c|  2 +-
>  3 files changed, 28 insertions(+), 57 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pgtable.h 
> b/arch/powerpc/include/asm/pgtable.h
> index 9835ac4173b7..92fe01c355a9 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -249,27 +249,6 @@ extern int gup_hugepte(pte_t *ptep, unsigned long sz, 
> unsigned long addr,
>  #endif
>  pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
>unsigned *shift);
> -
> -static inline pte_t *lookup_linux_ptep(pgd_t *pgdir, unsigned long hva,
> -  unsigned long *pte_sizep)
> -{
> - pte_t *ptep;
> - unsigned long ps = *pte_sizep;
> - unsigned int shift;
> -
> - ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift);
> - if (!ptep)
> - return NULL;
> - if (shift)
> - *pte_sizep = 1ul << shift;
> - else
> - *pte_sizep = PAGE_SIZE;
> -
> - if (ps > *pte_sizep)
> - return NULL;
> -
> - return ptep;
> -}
>  #endif /* __ASSEMBLY__ */
>  
>  #endif /* __KERNEL__ */
> diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
> b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> index 625407e4d3b0..73e083cb9f7e 100644
> --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> @@ -131,25 +131,6 @@ static void remove_revmap_chain(struct kvm *kvm, long 
> pte_index,
>   unlock_rmap(rmap);
>  }
>  
> -static pte_t lookup_linux_pte_and_update(pgd_t *pgdir, unsigned long hva,
> -   int writing, unsigned long *pte_sizep)
> -{
> - pte_t *ptep;
> - unsigned long ps = *pte_sizep;
> - unsigned int hugepage_shift;
> -
> - ptep = find_linux_pte_or_hugepte(pgdir, hva, &hugepage_shift);
> - if (!ptep)
> - return __pte(0);
> - if (hugepage_shift)
> - *pte_sizep = 1ul << hugepage_shift;
> - else
> - *pte_sizep = PAGE_SIZE;
> - if (ps > *pte_sizep)
> - return __pte(0);
> - return kvmppc_read_update_linux_pte(ptep, writing, hugepage_shift);
> -}
> -
>  static inline void unlock_hpte(__be64 *hpte, unsigned long hpte_v)
>  {
>   asm volatile(PPC_RELEASE_BARRIER "" : : : "memory");
> @@ -166,10 +147,10 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long 
> flags,
>   struct revmap_entry *rev;
>   unsigned long g_ptel;
>   struct kvm_memory_slot *memslot;
> - unsigned long pte_size;
> + unsigned hpage_shift;
>   unsigned long is_io;
>   unsigned long *rmap;
> - pte_t pte;
> + pte_t *ptep;
>   unsigned int writing;
>   unsigned long mmu_seq;
>   unsigned long rcbits;
> @@ -208,22 +189,33 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long 
> flags,
>  
>   /* Translate to host virtual address */
>   hva = __gfn_to_hva_memslot(memslot, gfn);
> + ptep = find_linux_pte_or_hugepte(pgdir, hva, &hpage_shift);
> + if (ptep) {
> + pte_t pte;
> + unsigned int host_pte_size;
>  
> - /* Look up the Linux PTE for the backing page */
> - pte_size = psize;
> - pte = lookup_linux_pte_and_update(pgdir, hva, writing, &pte_size);
> - if (pte_present(pte) && !pte_protnone(pte)) {
> - if (writing && !pte_write(pte))
> - /* make the actual HPTE be read-only */
> - ptel = hpte_make_readonly(ptel);
> - is_io = hpte_cache_bits(pte_val(pte));
> - pa = pte_pfn(pte) << PAGE_SHIFT;
> - pa |= hva & (pte_size - 1);
> - pa |= gpa & ~PAGE_MASK;
> - }
> + if (hpage_shift)
> + host_pte_size = 1ul << hpage_shift;
> + else
> + host_pte_size = PAGE_SIZE;
> + /*
> +  * We should always find the guest page size
> +  * to <= host page size, if host is using hugepage
> +  */
> + if (host_pte_size < psize)
> + return H_PARAMETER;
>  
> - if (pte_size < psize)
> - return H_PARAMETER;
> + pte = kvmppc_read_update_linux_pte(ptep, writing, hpage_shift);
> + if (pte_present(pte) && !pte_protnone(pte)) {
> + i

Re: [PATCH v5 25/29] powerpc/powernv/ioda: Define and implement DMA table/window management callbacks

2015-03-11 Thread Benjamin Herrenschmidt
On Wed, 2015-03-11 at 19:54 +1100, Alexey Kardashevskiy wrote:

> > +/* Page size flags for ibm,query-pe-dma-window */
> > +#define DDW_PGSIZE_4K   0x01
> > +#define DDW_PGSIZE_64K  0x02
> > +#define DDW_PGSIZE_16M  0x04
> > +#define DDW_PGSIZE_32M  0x08
> > +#define DDW_PGSIZE_64M  0x10
> > +#define DDW_PGSIZE_128M 0x20
> > +#define DDW_PGSIZE_256M 0x40
> > +#define DDW_PGSIZE_16G  0x80
> > +#define DDW_PGSIZE_MASK 0xFF
> > +
> >   struct iommu_table_group {
> >   #ifdef CONFIG_IOMMU_API
> > struct iommu_group *group;
> >   #endif
> > +   /* Some key properties of IOMMU */
> > +   __u32 tce32_start;
> > +   __u32 tce32_size;
> > +   __u32 max_dynamic_windows_supported;
> > +   __u32 max_levels;
> > +   __u32 flags;
> 
> Just realized that due to their static nature, they are better to be in 
> iommu_table_group_ops, will fix it in v6.

Ugh ? I dislike mixing function pointers and other fields, even if
statis. If you *really* want to separate them make them a struct
iommu_table_info and declare a const member. Otherwise don't bother and
leave them where they are.

> 
> > +
> > struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
> > struct iommu_table_group_ops *ops;
> >   };
> > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> > b/arch/powerpc/platforms/powernv/pci-ioda.c
> > index ed60b38..07857c4 100644
> > --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> > @@ -48,6 +48,7 @@
> >   #include "pci.h"
> >
> >   #define POWERNV_IOMMU_DEFAULT_LEVELS  1
> > +#define POWERNV_IOMMU_MAX_LEVELS   5
> >
> >   extern void ioda_eeh_tvt_print(struct pnv_phb *phb);
> >
> > @@ -1155,11 +1156,14 @@ static void pnv_ioda1_tce_free_vm(struct 
> > iommu_table *tbl, long index,
> > pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false);
> >   }
> >
> > +static void pnv_pci_free_table(struct iommu_table *tbl);
> > +
> >   struct iommu_table_ops pnv_ioda1_iommu_ops = {
> > .set = pnv_ioda1_tce_build_vm,
> > .exchange = pnv_ioda1_tce_xchg_vm,
> > .clear = pnv_ioda1_tce_free_vm,
> > .get = pnv_tce_get,
> > +   .free = pnv_pci_free_table
> >   };
> >
> >   static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
> > @@ -1317,6 +1321,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb 
> > *phb,
> >  TCE_PCI_SWINV_PAIR);
> > }
> > tbl->it_ops = &pnv_ioda1_iommu_ops;
> > +   pe->table_group.tce32_start = tbl->it_offset << tbl->it_page_shift;
> > +   pe->table_group.tce32_size = tbl->it_size << tbl->it_page_shift;
> > +   pe->table_group.max_dynamic_windows_supported = 0;
> > +   pe->table_group.max_levels = 0;
> > +   pe->table_group.flags = 0;
> > iommu_init_table(tbl, phb->hose->node);
> > iommu_register_group(&pe->table_group, phb->hose->global_number,
> > pe->pe_number);
> > @@ -1401,7 +1410,7 @@ static __be64 *pnv_alloc_tce_table(int nid,
> >   }
> >
> >   static long pnv_pci_ioda2_create_table(struct iommu_table_group 
> > *table_group,
> > -   __u32 page_shift, __u64 window_size, __u32 levels,
> > +   int num, __u32 page_shift, __u64 window_size, __u32 levels,
> > struct iommu_table *tbl)
> >   {
> > struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> > @@ -1428,8 +1437,8 @@ static long pnv_pci_ioda2_create_table(struct 
> > iommu_table_group *table_group,
> > shift = ROUND_UP(ilog2(window_size) - page_shift, levels) / levels;
> > shift += 3;
> > shift = max_t(unsigned, shift, IOMMU_PAGE_SHIFT_4K);
> > -   pr_info("Creating TCE table %08llx, %d levels, TCE table size = %lx\n",
> > -   window_size, levels, 1UL << shift);
> > +   pr_info("Creating TCE table #%d %08llx, %d levels, TCE table size = 
> > %lx\n",
> > +   num, window_size, levels, 1UL << shift);
> >
> > tbl->it_level_size = 1ULL << (shift - 3);
> > left = tce_table_size;
> > @@ -1440,11 +1449,10 @@ static long pnv_pci_ioda2_create_table(struct 
> > iommu_table_group *table_group,
> > tbl->it_indirect_levels = levels - 1;
> >
> > /* Setup linux iommu table */
> > -   pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
> > -   page_shift);
> > +   pnv_pci_setup_iommu_table(tbl, addr, tce_table_size,
> > +   num ? pe->tce_bypass_base : 0, page_shift);
> >
> > tbl->it_ops = &pnv_ioda2_iommu_ops;
> > -   iommu_init_table(tbl, nid);
> >
> > return 0;
> >   }
> > @@ -1461,8 +1469,21 @@ static void pnv_pci_free_table(struct iommu_table 
> > *tbl)
> > iommu_reset_table(tbl, "ioda2");
> >   }
> >
> > +static inline void pnv_pci_ioda2_tvt_invalidate(unsigned int pe_number,
> > +   unsigned long it_index)
> > +{
> > +   __be64 __iomem *invalidate = (__be64 __iomem *)it_index;
> > +   /* 01xb - invalidate TCEs that match the specified PE# */
> > +   unsig

Re: [PATCH v5 03/29] vfio: powerpc/spapr: Check that TCE page size is equal to it_page_size

2015-03-10 Thread Benjamin Herrenschmidt
On Tue, 2015-03-10 at 17:03 -0600, Alex Williamson wrote:
> > > return (PAGE_SHIFT + compound_order(compound_head(page) >= page_shift);
> > 
> > This won't be "bool" though.
> 
> Yes, it will.

Don't you have your parenthesis in the wrong place, Alex ? :-)

> >  This will (I'll do this)
> > 
> > shift = PAGE_SHIFT + compound_order(compound_head(page));
> > return (shift >= page_shift);


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/11] qemu: towards virtio-1 host support

2014-10-22 Thread Benjamin Herrenschmidt
On Wed, 2014-10-22 at 16:17 +0200, Jan Kiszka wrote:
> I thought about this again, and I'm not sure anymore if we can use
> ACPI
> to "black-list" the incompatible virtio devices. Reason: hotplug. To
> my
> understanding, the ACPI DRHD tables won't change during runtime when a
> device shows up or disappears. We would have to isolate virtio devices
> from the rest of the system by using separate buses for it (and avoid
> listing those in any DRHD table) and enforce that they only get
> plugged
> into those buses. I suppose that is not desirable.
> 
> Maybe it's better to fix virtio /wrt IOMMUs.

I always go back to my initial proposal which is to define that current
virtio always bypass any iommu (which is what it does really) and have
it expose via a new capability if that isn't the case. That means fixing
that Xen thingy to allow qemu to know what to expose I assume but that
seems to be the less bad approach.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] powerpc/kvm: support to handle sw breakpoint

2014-08-11 Thread Benjamin Herrenschmidt
On Mon, 2014-08-11 at 09:26 +0200, Alexander Graf wrote:
> > diff --git a/arch/powerpc/kvm/emulate.c b/arch/powerpc/kvm/emulate.c
> > index da86d9b..d95014e 100644
> > --- a/arch/powerpc/kvm/emulate.c
> > +++ b/arch/powerpc/kvm/emulate.c
> 
> This should be book3s_emulate.c.

Any reason we can't make that 0000 opcode as breakpoint common to
all powerpc variants ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/5] powerpc/eeh: Add warning message in eeh_dev_open()

2014-08-07 Thread Benjamin Herrenschmidt
On Thu, 2014-08-07 at 12:47 +1000, Gavin Shan wrote:
> The patch adds one warning message in eeh_dev_open() in case the
> PCI device can't be marked as passed through.
> 
> Suggested-by: Alexey Kardashevskiy 
> Signed-off-by: Gavin Shan 
> ---

Acked-by: Benjamin Herrenschmidt 

Alex, are you taking this or should I ?

>  arch/powerpc/kernel/eeh.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index 59a64f8..5d73a49 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -1162,8 +1162,11 @@ int eeh_dev_open(struct pci_dev *pdev)
>  
>   /* No EEH device or PE ? */
>   edev = pci_dev_to_eeh_dev(pdev);
> - if (!edev || !edev->pe)
> + if (!edev || !edev->pe) {
> + pr_warn_once("%s: PCI device %s not supported\n",
> +  __func__, pci_name(pdev));
>   goto out;
> + }
>  
>   /* Increase PE's pass through count */
>   atomic_inc(&edev->pe->pass_dev_cnt);


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 1/5] powerpc/eeh: Export eeh_iommu_group_to_pe()

2014-08-07 Thread Benjamin Herrenschmidt
On Thu, 2014-08-07 at 12:47 +1000, Gavin Shan wrote:
> The function is used by VFIO driver, which might be built as a
> dynamic module. So it should be exported.
> 
> Signed-off-by: Gavin Shan 

Acked-by: Benjamin Herrenschmidt 

Alex, are you taking this or should I ?

> ---
>  arch/powerpc/kernel/eeh.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index 6043879..59a64f8 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -1254,6 +1254,7 @@ struct eeh_pe *eeh_iommu_group_to_pe(struct iommu_group 
> *group)
>  
>   return edev->pe;
>  }
> +EXPORT_SYMBOL_GPL(eeh_iommu_group_to_pe);
>  
>  #endif /* CONFIG_IOMMU_API */
>  


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] vfio_pci: spapr: Enable VFIO if EEH is not supported

2014-08-05 Thread Benjamin Herrenschmidt
On Tue, 2014-08-05 at 21:44 -0600, Alex Williamson wrote:
> >   ret = vfio_spapr_pci_eeh_open(vdev->pdev);
> > - if (ret) {
> > - vfio_pci_disable(vdev);
> > - goto error;
> > - }
> > + if (ret)
> > + pr_warn_once("EEH is not supported\n");
> >   }
> >  
> >   return 0;
> 
> Now the next question, what's the point of vfio_spapr_pci_eeh_open()
> returning a value?  Couldn't it return void now and this warning can
> go into eeh specific code?  Thanks,

In order to call vfio_pci_disable() when that happens ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 16/63] PPC: Add asm helpers for BE 32bit load/store

2014-08-01 Thread Benjamin Herrenschmidt
On Fri, 2014-08-01 at 11:17 +0200, Alexander Graf wrote:
> >From assembly code we might not only have to explicitly BE access 64bit 
> >values,
> but sometimes also 32bit ones. Add helpers that allow for easy use of 
> lwzx/stwx
> in their respective byte-reverse or native form.
> 
> Signed-off-by: Alexander Graf 

Acked-by: Benjamin Herrenschmidt 
> ---
>  arch/powerpc/include/asm/asm-compat.h | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/asm-compat.h 
> b/arch/powerpc/include/asm/asm-compat.h
> index 4b237aa..21be8ae 100644
> --- a/arch/powerpc/include/asm/asm-compat.h
> +++ b/arch/powerpc/include/asm/asm-compat.h
> @@ -34,10 +34,14 @@
>  #define PPC_MIN_STKFRM   112
>  
>  #ifdef __BIG_ENDIAN__
> +#define LWZX_BE  stringify_in_c(lwzx)
>  #define LDX_BE   stringify_in_c(ldx)
> +#define STWX_BE  stringify_in_c(stwx)
>  #define STDX_BE  stringify_in_c(stdx)
>  #else
> +#define LWZX_BE  stringify_in_c(lwbrx)
>  #define LDX_BE   stringify_in_c(ldbrx)
> +#define STWX_BE  stringify_in_c(stwbrx)
>  #define STDX_BE  stringify_in_c(stdbrx)
>  #endif
>  


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] Use virtual page class key protection mechanism for speeding up guest page fault

2014-06-29 Thread Benjamin Herrenschmidt
On Sun, 2014-06-29 at 16:47 +0530, Aneesh Kumar K.V wrote:

> To achieve the above we use virtual page calss protection mechanism for
> covering (2) and (3). For both the above case we mark the hpte
> valid, but associate the page with virtual page class index 30 and 31.
> The authority mask register is configured such that class index 30 and 31
> will have read/write denied. The above change results in a key fault
> for (2) and (3). This allows us to forward a NO_HPTE fault directly to guest
> without doing the expensive hash pagetable lookup.

So we have a measurable performance benefit (about half a second out of
8) but you didn't explain the drawback here which is to essentially make
it impossible for guests to exploit virtual page class keys, or did you
find a way to still make that possible ?

As it-is, it's not a huge issue for Linux but we might have to care with
other OSes that do care...

Do we have a way in PAPR to signify to the guest that the keys are not
available ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-24 Thread Benjamin Herrenschmidt
On Wed, 2014-06-25 at 00:33 +1000, Alexey Kardashevskiy wrote:
> 
> I do not understand why @val is considered LE here and need to be
> converted
> to CPU. Really. I truly believe it should be cpu_to_le32().

No. Both are slightly wrong semantically but le32_to_cpu() is less
wrong :-)

iowrite32 supposedly takes a "cpu" value as argument and writes an "le"
value. So if anything, you need something that converts to a "cpu" value
before you call iowrite32.

Now it's still slightly wrong because the "input" to le32_to_cpu() is
supposed to be an "LE" value but of course here it's not, it's a "cpu"
value too :-)

But yes, I agree with aw here, either do nothing or stick a set of
iowriteXX_native or something equivalent in the generic iomap header,
define them in term of iowrite32be/iowrite32 based on the compile time
endian of the arch.

Hitting asm-generic/iomap.h I think will cover all archs except ARM.
For ARM, just hack arch/arm/asm/io.h

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-24 Thread Benjamin Herrenschmidt
On Tue, 2014-06-24 at 12:41 +0200, Alexander Graf wrote:
> Is there actually any difference in generated code with this patch 
> applied and without? I would hope that iowrite..() is inlined and 
> cancels out the cpu_to_le..() calls that are also inlined?

No, the former uses byteswapping asm, the compiler can't "cancel" it
out, but the overhead of the additional byteswap might not be
measurable.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-20 Thread Benjamin Herrenschmidt
On Sat, 2014-06-21 at 00:14 +1000, Alexey Kardashevskiy wrote:

> We can still use __raw_writel&co, would that be ok?

No unless you understand precisely what kind of memory barriers each
platform require for these.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-20 Thread Benjamin Herrenschmidt
On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:

> Working on big endian being an accident may be a matter of perspective

 :-)

> The comment remains that this patch doesn't actually fix anything except
> the overhead on big endian systems doing redundant byte swapping and
> maybe the philosophy that vfio regions are little endian.

Yes, that works by accident because technically VFIO is a transport and
thus shouldn't perform any endian swapping of any sort, which remains
the responsibility of the end driver which is the only one to know
whether a given BAR location is a a register or some streaming data
and in the former case whether it's LE or BE (some PCI devices are BE
even ! :-)

But yes, in the end, it works with the dual "cancelling" swaps and the
overhead of those swaps is probably drowned in the noise of the syscall
overhead.

> I'm still not a fan of iowrite vs iowritebe, there must be something we
> can use that doesn't have an implicit swap.

Sadly there isn't ... In the old day we didn't even have the "be"
variant and readl/writel style accessors still don't have them either
for all archs.

There is __raw_readl/writel but here the semantics are much more than
just "don't swap", they also don't have memory barriers (which means
they are essentially useless to most drivers unless those are platform
specific drivers which know exactly what they are doing, or in the rare
cases such as accessing a framebuffer which we know never have side
effects). 

>  Calling it iowrite*_native is also an abuse of the namespace.


>  Next thing we know some common code
> will legitimately use that name. 

I might make sense to those definitions into a common header. There have
been a handful of cases in the past that wanted that sort of "native
byte order" MMIOs iirc (though don't ask me for examples, I can't really
remember).

>  If we do need to define an alias
> (which I'd like to avoid) it should be something like vfio_iowrite32.
> Thanks,

Cheers,
Ben.

> Alex
> 
> > > ===
> > > 
> > > any better?
> > > 
> > > 
> > > 
> > > 
> > >>>> Suggested-by: Benjamin Herrenschmidt 
> > >>>> Signed-off-by: Alexey Kardashevskiy 
> > >>>> ---
> > >>>>  drivers/vfio/pci/vfio_pci_rdwr.c | 20 
> > >>>>  1 file changed, 16 insertions(+), 4 deletions(-)
> > >>>>
> > >>>> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c 
> > >>>> b/drivers/vfio/pci/vfio_pci_rdwr.c
> > >>>> index 210db24..f363b5a 100644
> > >>>> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> > >>>> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> > >>>> @@ -21,6 +21,18 @@
> > >>>>  
> > >>>>  #include "vfio_pci_private.h"
> > >>>>  
> > >>>> +#ifdef __BIG_ENDIAN__
> > >>>> +#define ioread16_native   ioread16be
> > >>>> +#define ioread32_native   ioread32be
> > >>>> +#define iowrite16_native  iowrite16be
> > >>>> +#define iowrite32_native  iowrite32be
> > >>>> +#else
> > >>>> +#define ioread16_native   ioread16
> > >>>> +#define ioread32_native   ioread32
> > >>>> +#define iowrite16_native  iowrite16
> > >>>> +#define iowrite32_native  iowrite32
> > >>>> +#endif
> > >>>> +
> > >>>>  /*
> > >>>>   * Read or write from an __iomem region (MMIO or I/O port) with an 
> > >>>> excluded
> > >>>>   * range which is inaccessible.  The excluded range drops writes and 
> > >>>> fills
> > >>>> @@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char 
> > >>>> __user *buf,
> > >>>>if (copy_from_user(&val, buf, 4))
> > >>>>return -EFAULT;
> > >>>>  
> > >>>> -  iowrite32(le32_to_cpu(val), io + off);
> > >>>> +  iowrite32_native(val, io + off);
> > >>>>} else {
> > >>>> -  val = cpu_to_le32(ioread32(io + off));
> > >>>> +  val = ioread32_native(io + off);
> > >>>>  
> > >>>>  

Re: [PATCH] powerpc/kvm: support to handle sw breakpoint

2014-06-17 Thread Benjamin Herrenschmidt
On Tue, 2014-06-17 at 11:25 +0200, Alexander Graf wrote:
> On 17.06.14 11:22, Benjamin Herrenschmidt wrote:
> > On Tue, 2014-06-17 at 10:54 +0200, Alexander Graf wrote:
> >> Also, why don't we use twi always or something else that actually is
> >> defined as illegal instruction? I would like to see this shared with
> >> book3s_32 PR.
> > twi will be directed to the guest on HV no ? We want a real illegal
> > because those go to the host (for potential emulation by the HV).
> 
> Ah, good point. I guess we need different one for PR and HV then to 
> ensure compatibility with older ISAs on PR.

Well, we also need to be careful with what happens if a PR guest puts
that instruction in, do that stop its HV guest/host ?

What if it's done in userspace ? Do that stop the kernel ? :-)

Maddy, I haven't checked, does your patch ensure that we only ever stop
if the instruction is at a recorded bkpt address ? It still means that a
userspace process can practically DOS its kernel by issuing a lot of
these causing a crapload of exits.

Cheers,
Ben.

> Alex
> 
> > I'm
> > trying to see if I can get the architect to set one in stone in a future
> > proof way.
> >
> > Cheers,
> > Ben.
> >
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] powerpc/kvm: support to handle sw breakpoint

2014-06-17 Thread Benjamin Herrenschmidt
On Tue, 2014-06-17 at 10:54 +0200, Alexander Graf wrote:
> 
> Also, why don't we use twi always or something else that actually is 
> defined as illegal instruction? I would like to see this shared with 
> book3s_32 PR.

twi will be directed to the guest on HV no ? We want a real illegal
because those go to the host (for potential emulation by the HV). I'm
trying to see if I can get the architect to set one in stone in a future
proof way.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] PPC: KVM: Add support for 64bit TCE windows

2014-06-05 Thread Benjamin Herrenschmidt
On Thu, 2014-06-05 at 13:56 +0200, Alexander Graf wrote:
> What if we ask user space to give us a pointer to user space allocated 
> memory along with the TCE registration? We would still ask user space to 
> only use the returned fd for TCE modifications, but would have some 
> nicely swappable memory we can store the TCE entries in.

That isn't going to work terribly well for VFIO :-) But yes, for
emulated devices, we could improve things a bit, including for
the 32-bit TCE tables.

For emulated, the real mode path could walk the page tables and fallback
to virtual mode & get_user if the page isn't present, thus operating
directly on qemu memory TCE tables instead of the current pinned stuff.

However that has a cost in performance, but since that's really only
used for emulated devices and PAPR VIOs, it might not be a huge issue.

But for VFIO we don't have much choice, we need to create something the
HW can access.

> In fact, the code as is today can allocate an arbitrary amount of pinned 
> kernel memory from within user space without any checks.

Right. We should at least account it in the locked limit.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] PPC: KVM: Add support for 64bit TCE windows

2014-06-05 Thread Benjamin Herrenschmidt
On Thu, 2014-06-05 at 19:26 +1000, Alexey Kardashevskiy wrote:
> 
> No trees yet. For 64GB window we need (64<<30)/(16<<20)*8 = 32K TCE table.
> Do we really need trees?

The above is assuming hugetlbfs backed guests. These are the least of my worry
indeed. But we need to deal with 4k and 64k guests.

Cheers,
Ben


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] PPC: KVM: Add support for 64bit TCE windows

2014-06-05 Thread Benjamin Herrenschmidt
On Thu, 2014-06-05 at 17:25 +1000, Alexey Kardashevskiy wrote:
> +This creates a virtual TCE (translation control entry) table, which
> +is an IOMMU for PAPR-style virtual I/O.  It is used to translate
> +logical addresses used in virtual I/O into guest physical addresses,
> +and provides a scatter/gather capability for PAPR virtual I/O.
> +
> +/* for KVM_CAP_SPAPR_TCE_64 */
> +struct kvm_create_spapr_tce_64 {
> +   __u64 liobn;
> +   __u64 window_size;
> +   __u64 bus_offset;
> +   __u32 page_shift;
> +   __u32 flags;
> +};
> +
> +The liobn field gives the logical IO bus number for which to create a
> +TCE table. The window_size field specifies the size of the DMA window
> +which this TCE table will translate - the table will contain one 64
> +bit TCE entry for every IOMMU page. The bus_offset field tells where
> +this window is mapped on the IO bus. 

Hrm, the bus_offset cannot be set arbitrarily, it has some pretty strong
HW limits depending on the type of bridge & architecture version...

Do you plan to have that knowledge in qemu ? Or do you have some other
mechanism to query it ? (I might be missing a piece of the puzzle here).

Also one thing I've been pondering ...

We'll end up wasting a ton of memory with those TCE tables. If you have
3 PEs mapped into a guest, it will try to create 3 DDW's mapping the
entire guest memory and so 3 TCE tables large enough for that ... and
which will contain exactly the same entries !

We really want to look into extending PAPR to allow the creation of
table "aliases" so that the guest can essentially create one table and
associate it with multiple PEs. We might still decide to do multiple
copies for NUMA reasons but no more than one per node for example... at
least we can have the policy in qemu/kvm.

Also, do you currently require allocating a single physically contiguous
table or do you support TCE trees in your implementation ?

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: powerpc/pseries: Use new defines when calling h_set_mode

2014-05-29 Thread Benjamin Herrenschmidt
On Thu, 2014-05-29 at 23:27 +0200, Alexander Graf wrote:
> On 29.05.14 09:45, Michael Neuling wrote:
> >>> +/* Values for 2nd argument to H_SET_MODE */
> >>> +#define H_SET_MODE_RESOURCE_SET_CIABR1
> >>> +#define H_SET_MODE_RESOURCE_SET_DAWR2
> >>> +#define H_SET_MODE_RESOURCE_ADDR_TRANS_MODE3
> >>> +#define H_SET_MODE_RESOURCE_LE4
> >>
> >> Much better, but I think you want to make use of these in non-kvm code too,
> >> no? At least the LE one is definitely already implemented as call :)
> > Sure but that's a different patch below.
> 
> Ben, how would you like to handle these 2 patches? If you give me an ack 
> I can just put this patch into my kvm queue. Alternatively we could both 
> carry a patch that adds the H_SET_MODE header bits only and whoever hits 
> Linus' tree first wins ;).

No biggie. Worst case it's a trivial conflict.

Cheers,
Ben.

> 
> Alex
> 
> >
> > Mikey
> >
> >
> > powerpc/pseries: Use new defines when calling h_set_mode
> >
> > Now that we define these in the KVM code, use these defines when we call
> > h_set_mode.  No functional change.
> >
> > Signed-off-by: Michael Neuling 
> > --
> > This depends on the KVM h_set_mode patches.
> >
> > diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
> > b/arch/powerpc/include/asm/plpar_wrappers.h
> > index 12c32c5..67859ed 100644
> > --- a/arch/powerpc/include/asm/plpar_wrappers.h
> > +++ b/arch/powerpc/include/asm/plpar_wrappers.h
> > @@ -273,7 +273,7 @@ static inline long plpar_set_mode(unsigned long mflags, 
> > unsigned long resource,
> >   static inline long enable_reloc_on_exceptions(void)
> >   {
> > /* mflags = 3: Exceptions at 0xC0004000 */
> > -   return plpar_set_mode(3, 3, 0, 0);
> > +   return plpar_set_mode(3, H_SET_MODE_RESOURCE_ADDR_TRANS_MODE, 0, 0);
> >   }
> >   
> >   /*
> > @@ -284,7 +284,7 @@ static inline long enable_reloc_on_exceptions(void)
> >* returns H_SUCCESS.
> >*/
> >   static inline long disable_reloc_on_exceptions(void) {
> > -   return plpar_set_mode(0, 3, 0, 0);
> > +   return plpar_set_mode(0, H_SET_MODE_RESOURCE_ADDR_TRANS_MODE, 0, 0);
> >   }
> >   
> >   /*
> > @@ -297,7 +297,7 @@ static inline long disable_reloc_on_exceptions(void) {
> >   static inline long enable_big_endian_exceptions(void)
> >   {
> > /* mflags = 0: big endian exceptions */
> > -   return plpar_set_mode(0, 4, 0, 0);
> > +   return plpar_set_mode(0, H_SET_MODE_RESOURCE_LE, 0, 0);
> >   }
> >   
> >   /*
> > @@ -310,17 +310,17 @@ static inline long enable_big_endian_exceptions(void)
> >   static inline long enable_little_endian_exceptions(void)
> >   {
> > /* mflags = 1: little endian exceptions */
> > -   return plpar_set_mode(1, 4, 0, 0);
> > +   return plpar_set_mode(1, H_SET_MODE_RESOURCE_LE, 0, 0);
> >   }
> >   
> >   static inline long plapr_set_ciabr(unsigned long ciabr)
> >   {
> > -   return plpar_set_mode(0, 1, ciabr, 0);
> > +   return plpar_set_mode(0, H_SET_MODE_RESOURCE_SET_CIABR, ciabr, 0);
> >   }
> >   
> >   static inline long plapr_set_watchpoint0(unsigned long dawr0, unsigned 
> > long dawrx0)
> >   {
> > -   return plpar_set_mode(0, 2, dawr0, dawrx0);
> > +   return plpar_set_mode(0, H_SET_MODE_RESOURCE_SET_DAWR, dawr0, dawrx0);
> >   }
> >   
> >   #endif /* _ASM_POWERPC_PLPAR_WRAPPERS_H */
> >


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest

2014-05-06 Thread Benjamin Herrenschmidt
On Tue, 2014-05-06 at 21:38 +0530, Aneesh Kumar K.V wrote:

> >> I updated the commit message as below. Let me know if this is ok.
> >>
> >>  KVM: PPC: BOOK3S: HV: THP support for guest
> >
> > This has nothing to do with THP.
> 
> THP support in guest depend on KVM advertising MPSS feature. We already
> have rest of the changes needed to support transparent huge pages
> upstream. (We do support THP with PowerVM LPAR already). The primary
> motivation of this patch is to enable THP in powerkvm guest. 

I would argue (nit picking, I know ... :-) that the subject should be
"Enable MPSS support for guests", and the description can then explain
that this allows Linux guests to use THP.

Cheers,
Ben.

> >
> >>  
> >>  On recent IBM Power CPUs, while the hashed page table is looked up 
> >> using
> >>  the page size from the segmentation hardware (i.e. the SLB), it is
> >>  possible to have the HPT entry indicate a larger page size.  Thus for
> >>  example it is possible to put a 16MB page in a 64kB segment, but since
> >>  the hash lookup is done using a 64kB page size, it may be necessary to
> >>  put multiple entries in the HPT for a single 16MB page.  This
> >>  capability is called mixed page-size segment (MPSS).  With MPSS,
> >>  there are two relevant page sizes: the base page size, which is the
> >>  size used in searching the HPT, and the actual page size, which is the
> >>  size indicated in the HPT entry. [ Note that the actual page size is
> >>  always >= base page size ].
> >>  
> >>  We advertise MPSS feature to guest only if the host CPU supports the
> >>  same. We use "ibm,segment-page-sizes" device tree node to advertise
> >>  the MPSS support. The penc encoding indicate whether we support
> >>  a specific combination of base page size and actual page size
> >>  in the same segment. It is also the value used in the L|LP encoding
> >>  of HPTE entry.
> >>  
> >>  In-order to support MPSS in guest, KVM need to handle the below 
> >> details
> >>  * advertise MPSS via ibm,segment-page-sizes
> >>  * Decode the base and actual page size correctly from the HPTE entry
> >>so that we know what we are dealing with in H_ENTER and and can do
> >
> > Which code path exactly changes for H_ENTER?
> 
> There is no real code path changes. Any code path that use
> hpte_page_size() is impacted. We return actual page size there. 
> 
> >
> >>the appropriate TLB invalidation in H_REMOVE and evictions.
> >
> > Apart from the grammar (which is pretty broken for the part that is not 
> > copied from Paul) and the subject line this sounds quite reasonable.
> >
> 
> Wll try to fix.
> 
> -aneesh


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] KVM: PPC: BOOK3S: HV: THP support for guest

2014-05-06 Thread Benjamin Herrenschmidt
On Tue, 2014-05-06 at 11:12 +0200, Alexander Graf wrote:

> So if I understand this patch correctly, it simply introduces logic to 
> handle page sizes other than 4k, 64k, 16M by analyzing the actual page 
> size field in the HPTE. Mind to explain why exactly that enables us to 
> use THP?
>
> What exactly is the flow if the pages are not backed by huge pages? What 
> is the flow when they start to get backed by huge pages?

The hypervisor doesn't care about segments ... but it needs to properly
decode the page size requested by the guest, if anything, to issue the
right form of tlbie instruction.

The encoding in the HPTE for a 16M page inside a 64K segment is
different than the encoding for a 16M in a 16M segment, this is done so
that the encoding carries both information, which allows broadcast
tlbie to properly find the right set in the TLB for invalidations among
others.

So from a KVM perspective, we don't know whether the guest is doing THP
or something else (Linux calls it THP but all we care here is that this
is MPSS, another guest than Linux might exploit that differently).

What we do know is that if we advertise MPSS, we need to decode the page
sizes encoded in the HPTE so that we know what we are dealing with in
H_ENTER and can do the appropriate TLB invalidations in H_REMOVE &
evictions.

> > +   if (a_size != -1)
> > +   return 1ul << mmu_psize_defs[a_size].shift;
> > +   }
> > +
> > +   }
> > +   return 0;
> >   }
> >   
> >   static inline unsigned long hpte_rpn(unsigned long ptel, unsigned long 
> > psize)
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index 8227dba5af0f..a38d3289320a 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -1949,6 +1949,13 @@ static void kvmppc_add_seg_page_size(struct 
> > kvm_ppc_one_seg_page_size **sps,
> >  * support pte_enc here
> >  */
> > (*sps)->enc[0].pte_enc = def->penc[linux_psize];
> > +   /*
> > +* Add 16MB MPSS support
> > +*/
> > +   if (linux_psize != MMU_PAGE_16M) {
> > +   (*sps)->enc[1].page_shift = 24;
> > +   (*sps)->enc[1].pte_enc = def->penc[MMU_PAGE_16M];
> > +   }
> 
> So this basically indicates that every segment (except for the 16MB one) 
> can also handle 16MB MPSS page sizes? I suppose you want to remove the 
> comment in kvm_vm_ioctl_get_smmu_info_hv() that says we don't do MPSS here.

I haven't reviewed the code there, make sure it will indeed do a
different encoding for every combination of segment/actual page size.

> Can we also ensure that every system we run on can do MPSS?

P7 and P8 are identical in that regard. However 970 doesn't do MPSS so
let's make sure we get that right.

Cheers,
Ben.
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel page allocator for hash page table.

2014-05-06 Thread Benjamin Herrenschmidt
On Tue, 2014-05-06 at 09:05 +0200, Alexander Graf wrote:
> On 06.05.14 02:06, Benjamin Herrenschmidt wrote:
> > On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote:
> >> Isn't this a greater problem? We should start swapping before we hit
> >> the point where non movable kernel allocation fails, no?
> > Possibly but the fact remains, this can be avoided by making sure that
> > if we create a CMA reserve for KVM, then it uses it rather than using
> > the rest of main memory for hash tables.
> 
> So why were we preferring non-CMA memory before? Considering that Aneesh 
> introduced that logic in fa61a4e3 I suppose this was just a mistake?

I assume so.

> >> The fact that KVM uses a good number of normal kernel pages is maybe
> >> suboptimal, but shouldn't be a critical problem.
> > The point is that we explicitly reserve those pages in CMA for use
> > by KVM for that specific purpose, but the current code tries first
> > to get them out of the normal pool.
> >
> > This is not an optimal behaviour and is what Aneesh patches are
> > trying to fix.
> 
> I agree, and I agree that it's worth it to make better use of our 
> resources. But we still shouldn't crash.

Well, Linux hitting out of memory conditions has never been a happy
story :-)

> However, reading through this thread I think I've slowly grasped what 
> the problem is. The hugetlbfs size calculation.

Not really.

> I guess something in your stack overreserves huge pages because it 
> doesn't account for the fact that some part of system memory is already 
> reserved for CMA.

Either that or simply Linux runs out because we dirty too fast...
really, Linux has never been good at dealing with OO situations,
especially when things like network drivers and filesystems try to do
ATOMIC or NOIO allocs...
 
> So the underlying problem is something completely orthogonal. The patch 
> body as is is fine, but the patch description should simply say that we 
> should prefer the CMA region because it's already reserved for us for 
> this purpose and we make better use of our available resources that way.

No.

We give a chunk of memory to hugetlbfs, it's all good and fine.

Whatever remains is split between CMA and the normal page allocator.

Without Aneesh latest patch, when creating guests, KVM starts allocating
it's hash tables from the latter instead of CMA (we never allocate from
hugetlb pool afaik, only guest pages do that, not hash tables).

So we exhaust the page allocator and get linux into OOM conditions
while there's plenty of space in CMA. But the kernel cannot use CMA for
it's own allocations, only to back user pages, which we don't care about
because our guest pages are covered by our hugetlb reserve :-)

> All the bits about pinning, numa, libvirt and whatnot don't really 
> matter and are just details that led Aneesh to find this non-optimal 
> allocation.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/22] EEH Support for VFIO PCI devices on PowerKVM guest

2014-05-06 Thread Benjamin Herrenschmidt
On Tue, 2014-05-06 at 08:56 +0200, Alexander Graf wrote:
> > For the error injection, I guess I have to put the logic token
> management
> > into QEMU and error injection request will be handled by QEMU and
> then
> > routed to host kernel via additional syscall as we did for pSeries.
> 
> Yes, start off without in-kernel XICS so everything simply lives in 
> QEMU. Then add callbacks into the in-kernel XICS to inject these 
> interrupts if we don't have wide enough interfaces already.

It's got nothing to do with XICS ... :-)

But yes, we can route everything via qemu for now, then we'll need
at least one of the call to have a "direct" path but we should probably
strive to even make it real mode if that's possible, it's the one that
Linux will call whenever an MMIO returns all f's to check if the
underlying PE is frozen.

But we can do that as a second stage.

In fact going via VFIO ioctl's does make the whole security and
translation model much simpler initially.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] POWERPC: BOOK3S: KVM: Use the saved dar value and generic make_dsisr

2014-05-05 Thread Benjamin Herrenschmidt
On Mon, 2014-05-05 at 16:43 +0200, Alexander Graf wrote:
> > Paul mentioned that BOOK3S always had DAR value set on alignment
> > interrupt. And the patch is to enable/collect correct DAR value when
> > running with Little Endian PR guest. Now to limit the impact and to
> > enable Little Endian PR guest, I ended up doing the conditional code
> > only for book3s 64 for which we know for sure that we set DAR value.
> 
> Yes, and I'm asking whether we know that this statement holds true for 
> PA6T and G5 chips which I wouldn't consider IBM POWER. Since the G5 is 
> at least developed by IBM, I'd assume its semantics here are similar to 
> POWER4, but for PA6T I wouldn't be so sure.

I am not aware of any PowerPC processor that does not set DAR on
alignment interrupts. Paul, are you ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel page allocator for hash page table.

2014-05-05 Thread Benjamin Herrenschmidt
On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote:
> Isn't this a greater problem? We should start swapping before we hit
> the point where non movable kernel allocation fails, no?

Possibly but the fact remains, this can be avoided by making sure that
if we create a CMA reserve for KVM, then it uses it rather than using
the rest of main memory for hash tables.

> The fact that KVM uses a good number of normal kernel pages is maybe
> suboptimal, but shouldn't be a critical problem.

The point is that we explicitly reserve those pages in CMA for use
by KVM for that specific purpose, but the current code tries first
to get them out of the normal pool.

This is not an optimal behaviour and is what Aneesh patches are
trying to fix.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] POWERPC: BOOK3S: KVM: Use the saved dar value and generic make_dsisr

2014-05-05 Thread Benjamin Herrenschmidt
On Mon, 2014-05-05 at 19:56 +0530, Aneesh Kumar K.V wrote:
> 
> Paul mentioned that BOOK3S always had DAR value set on alignment
> interrupt. And the patch is to enable/collect correct DAR value when
> running with Little Endian PR guest. Now to limit the impact and to
> enable Little Endian PR guest, I ended up doing the conditional code
> only for book3s 64 for which we know for sure that we set DAR value.

Only BookS ? Afaik, the kernel align.c unconditionally uses DAR on
every processor type. It's DSISR that may or may not be populated
but afaik DAR always is.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] drivers/vfio/pci: Fix MSIx message lost

2014-03-02 Thread Benjamin Herrenschmidt
On Sun, 2014-03-02 at 21:49 -0700, Alex Williamson wrote:

> I understand from talking to Alexey that the BARs are reset by this
> BIST, but what about the MSIX capability?  If that gets reset to be
> disabled, where does it get re-enabled?

The guest will do pci_save/restore_state iirc which will attempt
to save and restore everything, but qemu might get in the way here...

I'm not sure whether the IPR BIST clears the config space but we
probably should account for the general case of the driver in the guest
doing some kind of reset (BIST or otherwise) and trying to save and
restore state itself.

>   QEMU won't know about the BIST,
> so probably assumes nothing changed when the guest writes the enable
> bit.  VFIO doesn't allow userspace writes to the MSIX capability.  So it
> seems like the above would restore the table entry, but not necessarily
> whether MSIX is enabled on the device.

Maybe qemu shouldn't trust its cache and upon guest writes, do a read
first to check the HW state ? It's not like any of this is performance
sensitive anyway...

> When I talked with Alexey I noted that we already have a BAR restore
> function, vfio_bar_restore(), that tries to do some fixup when backdoor
> resets are noticed.  If we were to trigger that upon noting the user
> writing BAR values to what we think they're already set to, we could
> extend it to trigger an interrupt restore that could fixup both the
> capability and the table entries.  Thanks,

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] drivers/vfio/pci: Fix MSIx message lost

2014-03-02 Thread Benjamin Herrenschmidt
On Mon, 2014-03-03 at 15:43 +1100, Alexey Kardashevskiy wrote:
> While it works for our particular problem and seems correct, it has one
> flaw - hw/pci/msix.c will not generate this backtrace if masking bit does
> not change which can happen in general:
> ===
> static void msix_handle_mask_update(PCIDevice *dev, int vector, bool
> was_masked)
> {
> bool is_masked = msix_is_masked(dev, vector);
> 
> if (is_masked == was_masked) {
> return;
> }
> ===
> 
> Or if masking bit is the same, nothing bad is expected?...

Hrm ok, so it will work in this specific case but might not in the general
case of a driver triggering some kind of local reset on the device requiring
the MSI-X to be restored. The guest will write but qemu will swallow them ...

I think that needs to be fixed but it might be hard without introducing
a new ioctl from what I can see of the way the code is structured... Unless
qemu turns that into a disable/enable pair I suppose.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] drivers/vfio/pci: Fix MSIx message lost

2014-03-02 Thread Benjamin Herrenschmidt
On Mon, 2014-03-03 at 11:24 +0800, Gavin Shan wrote:
> The problem is specific to the case of BIST issued applied to IPR
> adapter on the guest side. After BIST reset, we lose everything
> in MSIx table and we never have chance update MSIx messages for
> those enabled interrupts to MSIx table.
> 
> The patch fixes it by writing MSIx message to MSIx table before
> reenabling them.

That needs a better explanation... the guest does try to restore the
MSI-X (in a way similar to resuming from suspend) by writing the
message value back into the table.

My understanding is that we trap that and turn that into a call to
vfio_msi_set_vector_signal, is that correct ?

In that case, it does indeed make sense to have that function rewrite
the cached message.

(Just trying to understand how this patch fixes the problem we saw)

I suppose other drivers would have similar issues if they have some
kind of internal "reset" path and try to save and restore the MSI-X
table.

Cheers,
Ben.

> Reported-by: Wen Xiong 
> Signed-off-by: Gavin Shan 
> ---
>  drivers/vfio/pci/vfio_pci_intrs.c |   19 +++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
> b/drivers/vfio/pci/vfio_pci_intrs.c
> index 2103576..279ebd0 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -517,6 +518,7 @@ static int vfio_msi_set_vector_signal(struct 
> vfio_pci_device *vdev,
>   struct pci_dev *pdev = vdev->pdev;
>   int irq = msix ? vdev->msix[vector].vector : pdev->irq + vector;
>   char *name = msix ? "vfio-msix" : "vfio-msi";
> + struct msi_msg msg;
>   struct eventfd_ctx *trigger;
>   int ret;
>  
> @@ -544,6 +546,23 @@ static int vfio_msi_set_vector_signal(struct 
> vfio_pci_device *vdev,
>   return PTR_ERR(trigger);
>   }
>  
> + /* We possiblly lose the MSI/MSIx message in some cases.
> +  * For example, BIST reset on IPR adapter. The MSIx table
> +  * is cleaned out. However, we never get chance to put
> +  * MSIx messages to MSIx table because all MSIx stuff is
> +  * being cached in QEMU. Here, we had the trick to put the
> +  * MSI/MSIx message back.
> +  *
> +  * Basically, we needn't worry about MSI messages. However,
> +  * it's not harmful and there might be cases of PCI config data
> +  * lost because of cached PCI config data in QEMU again.
> +  *
> +  * Note that we should flash the message prior to enabling
> +  * the corresponding interrupt by request_irq().
> +  */
> +  get_cached_msi_msg(irq, &msg);
> +  write_msi_msg(irq, &msg);
> +
>   ret = request_irq(irq, vfio_msihandler, 0,
> vdev->ctx[vector].name, trigger);
>   if (ret) {


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] PPC: KVM: Introduce hypervisor call H_GET_TCE

2014-02-21 Thread Benjamin Herrenschmidt
On Fri, 2014-02-21 at 16:31 +0100, Laurent Dufour wrote:
> This fix introduces the H_GET_TCE hypervisor call which is basically the
> reverse of H_PUT_TCE, as defined in the Power Architecture Platform
> Requirements (PAPR).
> 
> The hcall H_GET_TCE is required by the kdump kernel which is calling it to
> retrieve the TCE set up by the panicing kernel.

Alexey, will that work for VFIO ? Or are those patches *still* not
upstream ?

> Signed-off-by: Laurent Dufour 
> ---
>  arch/powerpc/include/asm/kvm_ppc.h  |2 ++
>  arch/powerpc/kvm/book3s_64_vio_hv.c |   28 
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S |2 +-
>  3 files changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
> b/arch/powerpc/include/asm/kvm_ppc.h
> index fcd53f0..4096f16 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -129,6 +129,8 @@ extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>   struct kvm_create_spapr_tce *args);
>  extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>unsigned long ioba, unsigned long tce);
> +extern long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> +  unsigned long ioba);
>  extern struct kvm_rma_info *kvm_alloc_rma(void);
>  extern void kvm_release_rma(struct kvm_rma_info *ri);
>  extern struct page *kvm_alloc_hpt(unsigned long nr_pages);
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
> b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 2c25f54..89e96b3 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -75,3 +75,31 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long 
> liobn,
>   return H_TOO_HARD;
>  }
>  EXPORT_SYMBOL_GPL(kvmppc_h_put_tce);
> +
> +long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
> +   unsigned long ioba)
> +{
> + struct kvm *kvm = vcpu->kvm;
> + struct kvmppc_spapr_tce_table *stt;
> +
> + list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
> + if (stt->liobn == liobn) {
> + unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
> + struct page *page;
> + u64 *tbl;
> +
> + if (ioba >= stt->window_size)
> + return H_PARAMETER;
> +
> + page = stt->pages[idx / TCES_PER_PAGE];
> + tbl = (u64 *)page_address(page);
> +
> + vcpu->arch.gpr[4] = tbl[idx % TCES_PER_PAGE];
> + return H_SUCCESS;
> + }
> + }
> +
> + /* Didn't find the liobn, punt it to userspace */
> + return H_TOO_HARD;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_h_get_tce);
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index e66d4ec..7d4fe2a 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1758,7 +1758,7 @@ hcall_real_table:
>   .long   0   /* 0x10 - H_CLEAR_MOD */
>   .long   0   /* 0x14 - H_CLEAR_REF */
>   .long   .kvmppc_h_protect - hcall_real_table
> - .long   0   /* 0x1c - H_GET_TCE */
> + .long   .kvmppc_h_get_tce - hcall_real_table
>   .long   .kvmppc_h_put_tce - hcall_real_table
>   .long   0   /* 0x24 - H_SET_SPRG0 */
>   .long   .kvmppc_h_set_dabr - hcall_real_table


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 02/10] KVM: PPC: BOOK3S: PR: Emulate virtual timebase register

2014-01-29 Thread Benjamin Herrenschmidt
On Thu, 2014-01-30 at 09:54 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2014-01-29 at 17:39 +0100, Alexander Graf wrote:
> > static inline mfvtb(unsigned long)
> > {
> > #ifdef CONFIG_PPC_BOOK3S_64
> >  return mfspr(SPRN_VTB);
> > #else
> >  BUG();
> > #endif
> > }
> > 
> > is a lot easier to read and get right. But reg.h is Ben's call.
> 
> Agreed.

I mean I agree with Alex, his version is nicer :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 02/10] KVM: PPC: BOOK3S: PR: Emulate virtual timebase register

2014-01-29 Thread Benjamin Herrenschmidt
On Wed, 2014-01-29 at 17:39 +0100, Alexander Graf wrote:
> static inline mfvtb(unsigned long)
> {
> #ifdef CONFIG_PPC_BOOK3S_64
>  return mfspr(SPRN_VTB);
> #else
>  BUG();
> #endif
> }
> 
> is a lot easier to read and get right. But reg.h is Ben's call.

Agreed.

> Also could you please give me a pointer to the specification for it? I 
> tried to look up vtb in the 2.06 ISA and couldn't find it. Is it a CPU 
> specific register?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-ppc] KVM and variable-endianness guest CPUs

2014-01-27 Thread Benjamin Herrenschmidt
On Mon, 2014-01-27 at 16:44 -0800, Christoffer Dall wrote:

> I'm loosing track of this discussion, Ben, can you explain a bit?  You
> wrote:
> 
>   Having a byte array coming in that represents what the CPU does in its
>   current byte order means you do *NOT* need to query the endianness of
>   the guest CPU from userspace.
> 
> What does "a byte array that represents what the CPU does in its current
> byte order" mean in this context.  Do you mean the VCPU or the physical
> CPU when you say CPU.

It doesn't matter once it's a byte array in address order. Again this is
the *right* abstraction for the kernel ABI, because you do not care
about the endianness of either side, guest or host.

It makes no sense to treat a modern CPU data bus as having an MSB and an
LSB (even if they have it sometimes on the block diagram). Only when
*interpreting a value* on that bus, such as an *address* does the
endianness become of use.

Treat the bus instead as an ordered sequence of bytes in ascending
address order and most of the complexity goes away.

>From there, for a given device, it all depends which bytes *that device*
choses to consider as being the MSB vs. LSB. It's not even a bus thing,
though of course some busses suggest an endianness, and some like PCI
mandates it for configuration space. 

But it remains a device-side choice.

> I read your text as saying "just do a store of the register into the
> data pointer and don't worry about endianness", but somebody, somewhere,
> has to check the VCPU endianness setting.
> 
> I'm probably wrong, and you are probably the right person to clear this
> up, but can you formulate exactly what you think the KVM ABI is and how
> you would put it in Documentation/virtual/kvm/api.txt?
> 
> My point of view is that it is KVM that needs to do this, and it should
> "emulate the CPU" by performing a byteswap in the case where the CPU
> E-bit is set on ARM, but this is an ARM-centric way of looking at
> things.

The ABI going to qemu should be (and inside qemu from TCG to the
emulation) that the CPU did an access of N bytes wide at address A
whose value is the byte array data[] in ascending address order.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/73] tree-wide: clean up some no longer required #include

2014-01-27 Thread Benjamin Herrenschmidt
On Wed, 2014-01-22 at 19:38 -0500, Paul Gortmaker wrote:

> Thanks, it was a great help as it uncovered a few issues in fringe arch
> that I didn't have toolchains for, and I've fixed all of those up.
> 
> I've noticed that powerpc has been un-buildable for a while now; I have
> used this hack patch locally so I could run the ppc defconfigs to check
> that I didn't break anything.  Maybe useful for linux-next in the
> interim?  It is a hack patch -- Not-Signed-off-by: Paul Gortmaker.  :)

Can you and/or Aneesh submit that as a proper patch (with S-O-B
etc...) ?

Thanks !

Cheers,
Ben.

> Paul.
> --
> 
> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
> b/arch/powerpc/include/asm/pgtable-ppc64.h
> index d27960c89a71..d0f070a2b395 100644
> --- a/arch/powerpc/include/asm/pgtable-ppc64.h
> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
> @@ -560,9 +560,9 @@ extern void pmdp_invalidate(struct vm_area_struct *vma, 
> unsigned long address,
>   pmd_t *pmdp);
>  
>  #define pmd_move_must_withdraw pmd_move_must_withdraw
> -typedef struct spinlock spinlock_t;
> -static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
> -  spinlock_t *old_pmd_ptl)
> +struct spinlock;
> +static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
> +  struct spinlock *old_pmd_ptl)
>  {
>   /*
>* Archs like ppc64 use pgtable to store per pmd
> 
> ___
> Linuxppc-dev mailing list
> linuxppc-...@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-ppc] KVM and variable-endianness guest CPUs

2014-01-27 Thread Benjamin Herrenschmidt
On Mon, 2014-01-27 at 23:49 +, Peter Maydell wrote:
> 
> Er, what? If we make the array be guest's current order
> then by definition userspace has to look at the guest's
> current endianness. I agree that would be bad. Either
> of the two current proposals (host kernel order; guest
> CPU's native/natural/default-byte-order) avoid it, though.

No, this has nothing to do with the guest endianness, and
all to do about the (hopefully) byte-address invariant bus we have
on the processor.

Anyway, the existing crap is ABI so I suspect we have to stick with it,
just maybe document it better.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-ppc] KVM and variable-endianness guest CPUs

2014-01-27 Thread Benjamin Herrenschmidt
On Thu, 2014-01-23 at 20:11 -0800, Victor Kamensky wrote:
> > I would take 50 byteswaps with a clear ABI any day over an obscure
> > standard that can avoid a single hardware-on-register instruction.
> This
> > is about designing a clean software interface, not about building an
> > optimized integrated stack.
> >
> > Unfortunately, this is going nowhere, so I think we need to stop
> this
> > thread.  As you can see I have sent a patch as a clarification to
> the
> > ABI, if it's merged we can move on with more important tasks.
> 
> OK, that is fine. I still believe is not the best choice,
> but I agree that we need to move on. I will respin my
> V7 KVM BE patches according to this new semantics, I will
> integrate comments that you (thanks!) and others gave me
> over mailing list and post my series again when it is ready.

Right, the whole "host endian" is a horrible choice from every way you
look at it, but I'm afraid it's unfixable since it's already ABI :-(

Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-ppc] KVM and variable-endianness guest CPUs

2014-01-27 Thread Benjamin Herrenschmidt

> The point is simple, and Peter has made it over and over:
> Any consumer of a memory operation sees "value, len, address".
> 
> This is what KVM_EXIT_MMIO emulates.  So just by knowing the ABI
> definition and having a pointer to the structure you need to be able to
> tell me "value, len, address".

But that's useless because it doesn't tell you the byte order
of the value which is critical for emulation unless you *defined* in
your ABI the byte order of the value, and thus include an artificial
swap when the guest is in a different endian mode than the host.


My understanding is that ARM is byte-address invariant, as is powerpc,
so it makes a LOT more sense to carry a sequence of address ordered
bytes instead which will correspond to what the guest code thinks
it's writing, and have the device respond appropriately based on
the endianness of the bus it sits on or the device itself.


> > >  (2) the API between kernel and userspace needs to define
> > >  the semantics of mmio.data, ie how to map between
> > >  "x byte wide transaction with value v" and the array,
> > >  and that is primarily what this conversation is about
> > >  (3) the only choice which is both (a) sensible and (b)
> > >  not breaking existing usage is to say "the array is
> > >  in host-kernel-byte-order"
> > >  (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
> > >  the same, because in the ARM case it is doing an
> > >  internal-to-CPU byteswap, and in the PPC case it is not

I very much doubt that there is a difference here, I'm not sure about
that business with "internal byteswap".

The order in which the bytes of a word are presented on the bus changes
depending on the core endianness. If that's what you call a "byteswap"
then both ARM and PPC do it when they are in their "other" endian,
but that confusion comes from assuming that a data bus has an endianness
at all to begin with.

I was hoping that by 2014, such ideas were things of the past.

> > That is one of the key disconnects. I'll go find real examples
> > in ARM LE, ARM BE, and PPC BE Linux kernel. Just for
> > everybody sake's here is summary of the disconnect:
> > 
> > If we have the same h/w connected to memory bus in ARM
> > and PPC systems and we have the following three pieces
> > of code that work with r0 having same device same
> > register address:
> > 
> > 1. ARM LE word write of  0x04030201:
> > setend le
> > mov r1, #0x04030201
> > str r1, [r0]
> > 
> > 2. ARM BE word write of 0x01020304:
> > setend be
> > mov r1, #0x01020304
> > str r1, [r0]
> > 
> > 3. PPC BE word write of 0x01020304:
> > lis r1,0x102
> > ori r1,r1,0x304
> > stwr1,0(r0)
> > 
> > I claim that h/w will see the same data on bus lines in all
> > three cases, and h/w would acts the same in all three
> > cases. Peter says that ARM BE and PPC BE case h/w
> > would act differently.
> > 
> > If anyone else can offer opinion on that while I am looking
> > for real examples that would be great.
> > 
> 
> I really don't think listing all these examples help.  You need to focus
> on the key points that Peter listed in his previous mail.
> 
> I tried in our chat to ask you this questions:
> 
> vcpu_data_host_to_guest() is handling a read from an emulated device.
> All the info you have is:
> (1) len of memory access
> (2) mmio.data pointer
> (3) destination register
> (4) host CPU endianness
> (5) guest CPU endianness
> 
> Based on this information alone, you need to decide whether you do a
> byteswap or not before loading the hardware register upon returning to
> the guest.
> 
> You will find it impossible to answer, because you don't know the layout
> of mmio.data, and that is the thing we are trying to solve.
> 
> If you cannot reply to this point in less than 50 lines or mention
> anything about devices being LE or BE or come with examples, I am
> probably not going to read your reply, sorry.
> 
> -Christoffer


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-ppc] KVM and variable-endianness guest CPUs

2014-01-27 Thread Benjamin Herrenschmidt
On Thu, 2014-01-23 at 15:33 +, Peter Maydell wrote:
>  (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
>  the same, because in the ARM case it is doing an
>  internal-to-CPU byteswap, and in the PPC case it is not

Aren't they both byte-order invariant ?

In that case they are the same.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-ppc] KVM and variable-endianness guest CPUs

2014-01-27 Thread Benjamin Herrenschmidt
On Tue, 2014-01-28 at 11:07 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2014-01-23 at 15:33 +, Peter Maydell wrote:
> >  (4) PPC CPUs in BE mode and ARM CPUs in BE mode are not
> >  the same, because in the ARM case it is doing an
> >  internal-to-CPU byteswap, and in the PPC case it is not
> 
> Aren't they both byte-order invariant ?

I meant byte-address...

> In that case they are the same.
> 
> Ben.
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-ppc] KVM and variable-endianness guest CPUs

2014-01-27 Thread Benjamin Herrenschmidt
On Wed, 2014-01-22 at 20:02 +, Peter Maydell wrote:
> 
> Defining it as being always guest-order would mean that
> userspace had to continually look at the guest CPU
> endianness bit, which is annoying and awkward.
> 
> Defining it as always host-endian order is the most
> reasonable option available. It also happens to work
> for the current QEMU code, which is nice.

No.

Having a byte array coming in that represents what the CPU does in its
current byte order means you do *NOT* need to query the endianness of
the guest CPU from userspace.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-ppc] KVM and variable-endianness guest CPUs

2014-01-27 Thread Benjamin Herrenschmidt
On Wed, 2014-01-22 at 11:29 -0800, Victor Kamensky wrote:
> I don't see why you so attached to desire to describe
> data part of memory transaction as just one of int
> types. If we are talking about bunch of hypothetical
> cases imagine such bus that allow transaction with
> size of 6 bytes. How do you describe such data in
> your ints speak? What endianity you can assign to
> sequence of 6 bytes? While note that description of
> such transaction as set of 6 byte values at address
> $whatever makes perfect sense.

Absolutely. For example, the "real" bus out of a POWER8 core is
something like 128 bytes wide though I wouldn't be surprised if it was
serialized, I don't actually know the details, it's all inside the chip.
The interconnect between chip is a multi-lane elastic interface whose
width has nothing to do with the payload size. Same goes with PCIe.

The only thing that can more/less sanely represent all of these things
is a series of bytes ordered by address, with attributes such as the
access size (or byte enables if that makes more sense or we want to
emulate really funky stuff) and possibly other decoration that some
architectures might want to have (such as caching/combining attributes
etc... which *might* be useful under some circumstances).

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-ppc] KVM and variable-endianness guest CPUs

2014-01-27 Thread Benjamin Herrenschmidt
On Wed, 2014-01-22 at 17:29 +, Peter Maydell wrote:
> 
> > Basically if it would be on real bus, get byte value
> > that corresponds to phys_addr + 0 address place
> > it into data[0], get byte value that corresponds to
> > phys_addr + 1 address place it into data[1], etc.
> 
> This just isn't how real buses work.

Actually it can be :-)

>  There is no
> "address + 1, address + 2". There is a single address
> for the memory transaction and a set of data on
> data lines and some separate size information.
> How the device at the far end of the bus chooses
> to respond to 32 bit accesses to address X versus
> 8 bit accesses to addresses X through X+3 is entirely
> its own business and unrelated to the CPU.

However the bus has a definition of what byte lane is the lowest in
address order. Byte order invariance is an important function of
all busses.

I think that trying to treat it any differently than an address
ordered series of bytes is going to turn into a complete and
inextricable mess.

>  (It would
> be perfectly possible to have a device which when
> you read from address X as 32 bits returned 0x12345678,
> when you read from address X as 16 bits returned
> 0x9abc, returned 0x42 for an 8 bit read from X+1,
> and so on. Having byte reads from X..X+3 return
> values corresponding to parts of the 32 bit access
> is purely a convention.)

Right, it's possible. It's also stupid and not how most modern devices
and busses work. Besides there is no reason why that can't be
implemented with Victor proposal anyway.

Ben.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Use schedule instead of cond_resched

2013-12-10 Thread Benjamin Herrenschmidt
On Tue, 2013-12-10 at 23:48 +0100, Peter Zijlstra wrote:
> 
> Yeah, I went on holidays and the patch just sat there. I'll prod Ingo
> into merging it.

Thanks !

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Use schedule instead of cond_resched

2013-12-10 Thread Benjamin Herrenschmidt
On Tue, 2013-12-10 at 15:40 +0100, Alexander Graf wrote:
> On 10.12.2013, at 15:21, Aneesh Kumar K.V  
> wrote:
> 
> > From: "Aneesh Kumar K.V" 
> > 
> > We already checked need_resched. So we can call schedule directly
> > 
> > Signed-off-by: Aneesh Kumar K.V 
> 
> The real fix for the issue you're seeing is
> 
>   https://lkml.org/lkml/2013/11/28/241
> 

And is still not upstream Peter ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Error in frreing hugepages with preemption enabled

2013-12-03 Thread Benjamin Herrenschmidt
On Tue, 2013-12-03 at 23:21 +0100, Andrea Arcangeli wrote:
> #ifdef CONFIG_PPC_FSL_BOOK3E
> hugepd_free(tlb, hugepte);
 ^^

This is the culprit

(Alex, you didn't specify this was embedded or did I miss it ?)

> #else
> pgtable_free_tlb(tlb, hugepte, pdshift - shift);
> #endif
> }

That function does:

batchp = &__get_cpu_var(hugepd_freelist_cur);

IE, it tries to use a per-CPU batch. Basically, it's duplicating the
logic in mm/memory.c for RCU freeing using a per-cpu freelist. I suppose
it assumes being called under something like the page table lock ?

This code also never "flushes" the batch, which is a concern...

Alex, this is Freescale stuff, can you followup with them ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 34/51] powerpc: move debug registers in a structure

2013-11-03 Thread Benjamin Herrenschmidt
On Mon, 2013-11-04 at 07:43 +0100, Alexander Graf wrote:
> Yeah, it's what Ben and me agreed on after I waited forever to get a
> topic branch created. Oh well, I guess this time we just have to
> manually resolve the conflicts and do a better job at communicating
> next time.

That specific one was my fault. Too much stuff on my plate and I
completely forgot about the topic branch, anyway, it's ok now.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 34/51] powerpc: move debug registers in a structure

2013-11-03 Thread Benjamin Herrenschmidt
On Sun, 2013-11-03 at 16:30 +0200, Gleb Natapov wrote:
> On Thu, Oct 31, 2013 at 10:18:19PM +0100, Alexander Graf wrote:
> > From: Bharat Bhushan 
> > 
> > This way we can use same data type struct with KVM and
> > also help in using other debug related function.
> > 
> > Signed-off-by: Bharat Bhushan 
> > Signed-off-by: Alexander Graf 
> It would be nice to have PPC maintainers (CCed) ACKs here. This patch
> also has merging conflicts with cbc9565e (should be easy to resolve)

Is it easier if I put it in powerpc -next and resolve the conflicts ?

I was ok with the patch but yes, I should probably have put it in a
topic branch in the first place and merge it both in my tree and Alex.

Cheers,
Ben.

> > ---
> >  arch/powerpc/include/asm/processor.h |  38 +
> >  arch/powerpc/include/asm/reg_booke.h |   8 +-
> >  arch/powerpc/kernel/asm-offsets.c|   2 +-
> >  arch/powerpc/kernel/process.c|  42 +-
> >  arch/powerpc/kernel/ptrace.c | 154 
> > +--
> >  arch/powerpc/kernel/ptrace32.c   |   2 +-
> >  arch/powerpc/kernel/signal_32.c  |   6 +-
> >  arch/powerpc/kernel/traps.c  |  35 
> >  8 files changed, 147 insertions(+), 140 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/processor.h 
> > b/arch/powerpc/include/asm/processor.h
> > index e378ccc..b438444 100644
> > --- a/arch/powerpc/include/asm/processor.h
> > +++ b/arch/powerpc/include/asm/processor.h
> > @@ -147,22 +147,7 @@ typedef struct {
> >  #define TS_FPR(i) fpr[i][TS_FPROFFSET]
> >  #define TS_TRANS_FPR(i) transact_fpr[i][TS_FPROFFSET]
> >  
> > -struct thread_struct {
> > -   unsigned long   ksp;/* Kernel stack pointer */
> > -   unsigned long   ksp_limit;  /* if ksp <= ksp_limit stack overflow */
> > -
> > -#ifdef CONFIG_PPC64
> > -   unsigned long   ksp_vsid;
> > -#endif
> > -   struct pt_regs  *regs;  /* Pointer to saved register state */
> > -   mm_segment_tfs; /* for get_fs() validation */
> > -#ifdef CONFIG_BOOKE
> > -   /* BookE base exception scratch space; align on cacheline */
> > -   unsigned long   normsave[8] cacheline_aligned;
> > -#endif
> > -#ifdef CONFIG_PPC32
> > -   void*pgdir; /* root of page-table tree */
> > -#endif
> > +struct debug_reg {
> >  #ifdef CONFIG_PPC_ADV_DEBUG_REGS
> > /*
> >  * The following help to manage the use of Debug Control Registers
> > @@ -199,6 +184,27 @@ struct thread_struct {
> > unsigned long   dvc2;
> >  #endif
> >  #endif
> > +};
> > +
> > +struct thread_struct {
> > +   unsigned long   ksp;/* Kernel stack pointer */
> > +   unsigned long   ksp_limit;  /* if ksp <= ksp_limit stack overflow */
> > +
> > +#ifdef CONFIG_PPC64
> > +   unsigned long   ksp_vsid;
> > +#endif
> > +   struct pt_regs  *regs;  /* Pointer to saved register state */
> > +   mm_segment_tfs; /* for get_fs() validation */
> > +#ifdef CONFIG_BOOKE
> > +   /* BookE base exception scratch space; align on cacheline */
> > +   unsigned long   normsave[8] cacheline_aligned;
> > +#endif
> > +#ifdef CONFIG_PPC32
> > +   void*pgdir; /* root of page-table tree */
> > +#endif
> > +   /* Debug Registers */
> > +   struct debug_reg debug;
> > +
> > /* FP and VSX 0-31 register set */
> > double  fpr[32][TS_FPRWIDTH] __attribute__((aligned(16)));
> > struct {
> > diff --git a/arch/powerpc/include/asm/reg_booke.h 
> > b/arch/powerpc/include/asm/reg_booke.h
> > index ed8f836..2e31aac 100644
> > --- a/arch/powerpc/include/asm/reg_booke.h
> > +++ b/arch/powerpc/include/asm/reg_booke.h
> > @@ -381,7 +381,7 @@
> >  #define DBCR0_IA34T0x4000  /* Instr Addr 3-4 range Toggle 
> > */
> >  #define DBCR0_FT   0x0001  /* Freeze Timers on debug event */
> >  
> > -#define dbcr_iac_range(task)   ((task)->thread.dbcr0)
> > +#define dbcr_iac_range(task)   ((task)->thread.debug.dbcr0)
> >  #define DBCR_IAC12IDBCR0_IA12  /* Range 
> > Inclusive */
> >  #define DBCR_IAC12X(DBCR0_IA12 | DBCR0_IA12X)  /* Range 
> > Exclusive */
> >  #define DBCR_IAC12MODE (DBCR0_IA12 | DBCR0_IA12X)  /* IAC 1-2 Mode 
> > Bits */
> > @@ -395,7 +395,7 @@
> >  #define DBCR1_DAC1W0x2000  /* DAC1 Write Debug Event */
> >  #define DBCR1_DAC2W0x1000  /* DAC2 Write Debug Event */
> >  
> > -#define dbcr_dac(task) ((task)->thread.dbcr1)
> > +#define dbcr_dac(task) ((task)->thread.debug.dbcr1)
> >  #define DBCR_DAC1R DBCR1_DAC1R
> >  #define DBCR_DAC1W DBCR1_DAC1W
> >  #define DBCR_DAC2R DBCR1_DAC2R
> > @@ -441,7 +441,7 @@
> >  #define DBCR0_CRET 0x0020  /* Critical Return Debug Event */
> >  #define DBCR0_FT   0x0001  /* Freeze Timers on debug event */
> >  
> > -#define dbcr_dac(task) ((task)->thread.dbcr0)
> > +#define dbcr_dac(task) ((task)->thread.debug.dbcr0)
> >  #define DBCR_DAC1R DBCR0_DAC1R

Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-03 Thread Benjamin Herrenschmidt
On Thu, 2013-10-03 at 08:43 +0300, Gleb Natapov wrote:
> Why it can be a bad idea? User can drain hwrng continuously making other
> users of it much slower, or even worse, making them fall back to another
> much less reliable, source of entropy.

Not in a very significant way, we generate entropy at 1Mhz after all, which
is pretty good.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-02 Thread Benjamin Herrenschmidt
On Wed, 2013-10-02 at 17:37 +0300, Gleb Natapov wrote:
> On Wed, Oct 02, 2013 at 04:33:18PM +0200, Paolo Bonzini wrote:
> > Il 02/10/2013 16:08, Alexander Graf ha scritto:
> > > > The hwrng is accessible by host userspace via /dev/mem.
> > > 
> > > A guest should live on the same permission level as a user space
> > > application. If you run QEMU as UID 1000 without access to /dev/mem, why
> > > should the guest suddenly be able to directly access a memory location
> > > (MMIO) it couldn't access directly through a normal user space interface.
> > > 
> > > It's basically a layering violation.
> > 
> > With Michael's earlier patch in this series, the hwrng is accessible by
> > host userspace via /dev/hwrng, no?
> > 
> Access to which can be controlled by its permission. Permission of
> /dev/kvm may be different. If we route hypercall via userspace and
> configure qemu to get entropy from /dev/hwrng everything will fall
> nicely together (except performance).

Yes, except abysmall performance and a lot more code for something
completely and utterly pointless  nice.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-02 Thread Benjamin Herrenschmidt
On Wed, 2013-10-02 at 17:10 +0300, Gleb Natapov wrote:
> > The hwrng is accessible by host userspace via /dev/mem.
> > 
> Regular user has no access to /dev/mem, but he can start kvm guest and
> gain access to the device.

Seriously. You guys are really trying hard to make our life hell or
what ? That discussion about access permissions makes no sense
whatsoever. Please stop.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-02 Thread Benjamin Herrenschmidt
On Wed, 2013-10-02 at 16:08 +0200, Alexander Graf wrote:
> A guest should live on the same permission level as a user space
> application. If you run QEMU as UID 1000 without access to /dev/mem,
> why should the guest suddenly be able to directly access a memory
> location (MMIO) it couldn't access directly through a normal user
> space interface.
> 
> It's basically a layering violation.

Guys, please stop with the academic non-sense !

Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-02 Thread Benjamin Herrenschmidt
On Wed, 2013-10-02 at 13:02 +0300, Gleb Natapov wrote:

> Yes, I alluded to it in my email to Paul and Paolo asked also. How this
> interface is disabled? Also hwrnd is MMIO in a host why guest needs to
> use hypercall instead of emulating the device (in kernel or somewhere
> else?). Another things is that on a host hwrnd is protected from
> direct userspace access by virtue of been a device, but guest code (event
> kernel mode) is userspace as far as hosts security model goes, so by
> implementing this hypercall in a way that directly access hwrnd you
> expose hwrnd to a userspace unconditionally. Why is this a good idea? 

BTW. Is this always going to be like this ?

Every *single* architectural or design decision we make for our
architecture has to be justified 30 times over, every piece of code bike
shedded to oblivion for month, etc... ?

Do we always have to finally get to some kind of agreement on design, go
to the 6 month bike-shedding phase, just to have somebody else come up
and start re-questioning the whole original design (without any
understanding of our specific constraints of course) ?

You guys are the most horrendous community I have ever got to work with.
It's simply impossible to get anything done in any reasonable time
frame .

At this stage, it would have taken us an order of magnitude less time to
simply rewrite an entire hypervisor from scratch.

This is sad.

Ben.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-02 Thread Benjamin Herrenschmidt
On Wed, 2013-10-02 at 13:02 +0300, Gleb Natapov wrote:

> Yes, I alluded to it in my email to Paul and Paolo asked also. How this
> interface is disabled? Also hwrnd is MMIO in a host why guest needs to
> use hypercall instead of emulating the device (in kernel or somewhere
> else?).

Migration will have to be dealt with one way or another, I suppose we
will indeed need a qemu fallback.

As for why hypercall instead of MMIO, well, you'd have to ask the folks
who wrote the PAPR spec :-) It's specified as a hypercall and
implemented as such in pHyp (PowerVM). The existing guests expect it
that way.

It might have to do with the required whitening done by the hypervisor
(H_RANDOM output is supposed to be clean). It also abstracts us from the
underlying HW implementation which could in theory change.
 
>  Another things is that on a host hwrnd is protected from
> direct userspace access by virtue of been a device, but guest code (event
> kernel mode) is userspace as far as hosts security model goes, so by
> implementing this hypercall in a way that directly access hwrnd you
> expose hwrnd to a userspace unconditionally. Why is this a good idea? 

Why would this be a bad idea ?

Ben.

> --
>   Gleb.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-02 Thread Benjamin Herrenschmidt
On Wed, 2013-10-02 at 11:11 +0200, Alexander Graf wrote:

> Right, and the difference for the patch in question is really whether
> we handle in in kernel virtual mode or in QEMU, so the bulk of the
> overhead (kicking threads out of  guest context, switching MMU
> context, etc) happens either way.
> 
> So the additional overhead when handling it in QEMU here really boils
> down to the user space roundtrip (plus another random number read
> roundtrip).

Hrm, Michael had a real mode version ... not sure where it went then.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-02 Thread Benjamin Herrenschmidt
On Wed, 2013-10-02 at 10:46 +0200, Paolo Bonzini wrote:

> 
> Thanks.  Any chance you can give some numbers of a kernel hypercall and
> a userspace hypercall on Power, so we have actual data?  For example a
> hypercall that returns H_PARAMETER as soon as possible.

I don't have (yet) numbers at hand but we have basically 3 places where
we can handle hypercalls:

 - Kernel real mode. This is where most of our MMU stuff goes for
example unless it needs to trigger a page fault in Linux. This is
executed with translation disabled and the MMU still in guest context.
This is the fastest path since we don't take out the other threads nor
perform any expensive context change. This is where we put the
"accelerated" H_RANDOM as well.

 - Kernel virtual mode. That's a full exit, so all threads are out and
MMU switched back to host Linux. Things like vhost MMIO emulation goes
there, page faults, etc...

 - Qemu. This adds the round trip to userspace on top of the above.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-01 Thread Benjamin Herrenschmidt
On Tue, 2013-10-01 at 13:19 +0200, Paolo Bonzini wrote:
> Il 01/10/2013 11:38, Benjamin Herrenschmidt ha scritto:
> > So for the sake of that dogma you are going to make us do something that
> > is about 100 times slower ? (and possibly involves more lines of code)
> 
> If it's 100 times slower there is something else that's wrong.  It's
> most likely not 100 times slower, and this makes me wonder if you or
> Michael actually timed the code at all.

We haven't but it's pretty obvious:

 - The KVM real mode implementation: guest issues the hcall, we remain
in real mode, within the MMU context of the guest, all secondary threads
on the core are still running in the guest, and we do an MMIO & return.

 - The qemu variant: guest issues the hcall we need to exit the guest,
which means bring *all* threads on the core out of KVM, switch the full
MMU context back to host (which among others involves flushing the ERAT,
aka level 1 TLB), while sending the secondary threads into idle loops.
Then we return to qemu user context, which will then use /dev/random ->
back into the kernel and out, at which point we can return to the guest,
so back into the kernel, back into run which means IPI the secondary
threads on the core, switch the MMU context again until we can finally
go back to executing guest instructions.

So no we haven't measured. But it is going to be VERY VERY VERY much
slower. Our exit latencies are bad with our current MMU *and* any exit
is going to cause all secondary threads on the core to have to exit as
well (remember P7 is 4 threads, P8 is 8)

> > It's not just speed ... H_RANDOM is going to be called by the guest
> > kernel. A round trip to qemu is going to introduce a kernel jitter
> > (complete stop of operations of the kernel on that virtual processor) of
> > a full exit + round trip to qemu + back to the kernel to get to some
> > source of random number ...  this is going to be in the dozens of ns at
> > least.
> 
> I guess you mean dozens of *micro*seconds, which is somewhat exaggerated
> but not too much.  On x86 some reasonable timings are:

Yes.

>   100 cyclesbare metal rdrand
>   2000 cycles   guest->hypervisor->guest
>   15000 cycles  guest->userspace->guest
> 
> (100 cycles = 40 ns = 200 MB/sec; 2000 cycles = ~1 microseconds; 15000
> cycles = ~7.5 microseconds).  Even on 5 year old hardware, a userspace
> roundtrip is around a dozen microseconds.

So in your case going to qemu to "emulate" rdrand would indeed be 150
times slower, I don't see in what universe that would be considered a
good idea.

> Anyhow, I would like to know more about this hwrng and hypercall.
> 
> Does the hwrng return random numbers (like rdrand) or real entropy (like
> rdseed that Intel will add in Broadwell)?

It's a random number obtained from sampling a set of oscillators. It's
slightly biased but we have very simple code (I believe shared with the
host kernel implementation) for whitening it as is required by PAPR.
 
>   What about the hypercall?
> For example virtio-rng is specified to return actual entropy, it doesn't
> matter if it is from hardware or software.
> 
> In either case, the patches have problems.
> 
> 1) If the hwrng returns random numbers, the whitening you're doing is
> totally insufficient and patch 2 is forging entropy that doesn't exist.

I will let Paul to comment on the whitening, it passes all the tests
we've been running it through.

> 2) If the hwrng returns entropy, a read from the hwrng is going to even
> more expensive than an x86 rdrand (perhaps ~2000 cycles).

Depends how often you read, the HW I think is sampling asynchronously so
you only block on the MMIO if you already consumed the previous sample
but I'll let Paulus provide more details here.

>   Hence, doing
> the emulation in the kernel is even less necessary.  Also, if the hwrng
> returns entropy patch 1 is unnecessary: you do not need to waste
> precious entropy bits by passing them to arch_get_random_long; just run
> rngd in the host as that will put the entropy to much better use.
>
> 3) If the hypercall returns random numbers, then it is a pretty
> braindead interface since returning 8 bytes at a time limits the
> throughput to a handful of MB/s (compare to 200 MB/sec for x86 rdrand).
>  But more important: in this case drivers/char/hw_random/pseries-rng.c
> is completely broken and insecure, just like patch 2 in case (1) above.

How so ?

> 4) If the hypercall returns entropy (same as virtio-rng), the same
> considerations on speed apply.  If you can only produce entropy at say 1
> MB/s (so reading 8 bytes take 8 microseconds---which is actually very
> fast), it doesn't matter that much to spend 7 microseco

Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems

2013-10-01 Thread Benjamin Herrenschmidt
On Tue, 2013-10-01 at 11:39 +0300, Gleb Natapov wrote:
> On Tue, Oct 01, 2013 at 06:34:26PM +1000, Michael Ellerman wrote:
> > On Thu, Sep 26, 2013 at 11:06:59AM +0200, Paolo Bonzini wrote:
> > > Il 26/09/2013 08:31, Michael Ellerman ha scritto:
> > > > Some powernv systems include a hwrng. Guests can access it via the
> > > > H_RANDOM hcall.
> > > 
> > > Is there any reason to do this in the kernel?  
> > 
> > It's less code, and it's faster :)
> > 
> > > It does not have to be a particularly fast path;
> > 
> > Sure, but do we have to make it slow on purpose?
> > 
> We do not put non performance critical devices into the kernel.

So for the sake of that dogma you are going to make us do something that
is about 100 times slower ? (and possibly involves more lines of code)

It's not just speed ... H_RANDOM is going to be called by the guest
kernel. A round trip to qemu is going to introduce a kernel jitter
(complete stop of operations of the kernel on that virtual processor) of
a full exit + round trip to qemu + back to the kernel to get to some
source of random number ...  this is going to be in the dozens of ns at
least.

This makes no sense.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] hwrng: Add a driver for the hwrng found in power7+ systems

2013-09-26 Thread Benjamin Herrenschmidt
On Thu, 2013-09-26 at 16:31 +1000, Michael Ellerman wrote:

> +   pr_info("registered powernv hwrng.\n");

First letter of a line should get a capital :-) Also since
it's per-device, at least indicate the OF path or the chip number or
something ...

Cheers,
Ben.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] powerpc: Implement arch_get_random_long/int() for powernv

2013-09-26 Thread Benjamin Herrenschmidt
On Thu, 2013-09-26 at 16:31 +1000, Michael Ellerman wrote:
> +   pr_info_once("registering arch random hook\n");

Either pr_debug or make it nicer looking :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/11] KVM: PPC: Book3S HV: Add support for guest Program Priority Register

2013-09-16 Thread Benjamin Herrenschmidt
On Fri, 2013-09-06 at 13:22 +1000, Paul Mackerras wrote:
> POWER7 and later IBM server processors have a register called the
> Program Priority Register (PPR), which controls the priority of
> each hardware CPU SMT thread, and affects how fast it runs compared
> to other SMT threads.  This priority can be controlled by writing to
> the PPR or by use of a set of instructions of the form or rN,rN,rN
> which are otherwise no-ops but have been defined to set the priority
> to particular levels.
> 
> This adds code to context switch the PPR when entering and exiting
> guests and to make the PPR value accessible through the SET/GET_ONE_REG
> interface.  When entering the guest, we set the PPR as late as
> possible, because if we are setting a low thread priority it will
> make the code run slowly from that point on.  Similarly, the
> first-level interrupt handlers save the PPR value in the PACA very
> early on, and set the thread priority to the medium level, so that
> the interrupt handling code runs at a reasonable speed.
> 
> Signed-off-by: Paul Mackerras 

Acked-by: Benjamin Herrenschmidt 

Alex, can you take this via your tree ?

Cheers,
Ben.

> ---
>  Documentation/virtual/kvm/api.txt |  1 +
>  arch/powerpc/include/asm/exception-64s.h  |  8 
>  arch/powerpc/include/asm/kvm_book3s_asm.h |  1 +
>  arch/powerpc/include/asm/kvm_host.h   |  1 +
>  arch/powerpc/include/uapi/asm/kvm.h   |  1 +
>  arch/powerpc/kernel/asm-offsets.c |  2 ++
>  arch/powerpc/kvm/book3s_hv.c  |  6 ++
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 12 +++-
>  8 files changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 1030ac9..34a32b6 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1836,6 +1836,7 @@ registers, find a list below:
>PPC   | KVM_REG_PPC_ACOP   | 64
>PPC   | KVM_REG_PPC_VRSAVE | 32
>PPC   | KVM_REG_PPC_LPCR   | 64
> +  PPC   | KVM_REG_PPC_PPR| 64
>PPC   | KVM_REG_PPC_TM_GPR0| 64
>...
>PPC   | KVM_REG_PPC_TM_GPR31   | 64
> diff --git a/arch/powerpc/include/asm/exception-64s.h 
> b/arch/powerpc/include/asm/exception-64s.h
> index 07ca627..b86c4db 100644
> --- a/arch/powerpc/include/asm/exception-64s.h
> +++ b/arch/powerpc/include/asm/exception-64s.h
> @@ -203,6 +203,10 @@ do_kvm_##n:  
> \
>   ld  r10,area+EX_CFAR(r13);  \
>   std r10,HSTATE_CFAR(r13);   \
>   END_FTR_SECTION_NESTED(CPU_FTR_CFAR,CPU_FTR_CFAR,947);  \
> + BEGIN_FTR_SECTION_NESTED(948)   \
> + ld  r10,area+EX_PPR(r13);   \
> + std r10,HSTATE_PPR(r13);\
> + END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948);\
>   ld  r10,area+EX_R10(r13);   \
>   stw r9,HSTATE_SCRATCH1(r13);\
>   ld  r9,area+EX_R9(r13); \
> @@ -216,6 +220,10 @@ do_kvm_##n:  
> \
>   ld  r10,area+EX_R10(r13);   \
>   beq 89f;\
>   stw r9,HSTATE_SCRATCH1(r13);\
> + BEGIN_FTR_SECTION_NESTED(948)   \
> + ld  r9,area+EX_PPR(r13);\
> + std r9,HSTATE_PPR(r13); \
> + END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948);\
>   ld  r9,area+EX_R9(r13); \
>   std r12,HSTATE_SCRATCH0(r13);   \
>   li  r12,n;  \
> diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
> b/arch/powerpc/include/asm/kvm_book3s_asm.h
> index 9039d3c..22f4606 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_asm.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
> @@ -101,6 +101,7 @@ struct kvmppc_host_state {
>  #endif
>  #ifdef CONFIG_PPC_BOOK3S_64
>   u64 cfar;
> + u64 ppr;
>  #endif
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_host.h 
> b/arch/powerpc/include/asm/kvm_host.h
> index 9741bf0..b0dcd18 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -464,6 +464,7 @@ struct kvm_vcpu_arch {
>  

Re: [PATCH 19/23] KVM: PPC: Book3S: Select PR vs HV separately for each guest

2013-09-12 Thread Benjamin Herrenschmidt
On Fri, 2013-09-13 at 10:17 +1000, Paul Mackerras wrote:

> Aneesh and I are currently investigating an alternative approach,
> which is much more like the x86 way of doing things.  We are looking
> at splitting the code into three modules: a kvm_pr.ko module with the
> PR-specific bits, a kvm_hv.ko module with the HV-specific bits, and a
> core kvm.ko module with the generic bits (basically book3s.c,
> powerpc.c, stuff from virt/kvm/, plus the stuff that both PR and HV
> use).  Basically the core module would have a pointer to a struct
> full of function pointers for the various ops that book3s_pr.c and
> book3s_hv.c both provide.  You would only be able to have one of
> kvm_pr and kvm_hv loaded at any one time.  If they were built in, you
> could have them both built in but only one could register its function
> pointer struct with the core.  Obviously the kvm_hv module would only
> load and register its struct on a machine that had hypervisor mode
> available.  If they were both built in I would think we would give HV
> the first chance to register itself, and let PR register if we can't
> do HV.
> 
> How does that sound?

As long as we can force-load the PR one on a machine that normally runs
HV for the sake of testing ...

Also, all those KVM modules ... they don't auto-load do they ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

2013-09-04 Thread Benjamin Herrenschmidt
On Tue, 2013-09-03 at 13:53 +0300, Gleb Natapov wrote:
> > Or supporting all IOMMU links (and leaving emulated stuff as is) in on
> > "device" is the last thing I have to do and then you'll ack the patch?
> > 
> I am concerned more about API here. Internal implementation details I
> leave to powerpc experts :)

So Gleb, I want to step in for a bit here.

While I understand that the new KVM device API is all nice and shiny and that 
this
whole thing should probably have been KVM devices in the first place (had they
existed or had we been told back then), the point is, the API for handling
HW IOMMUs that Alexey is trying to add is an extension of an existing mechanism
used for emulated IOMMUs.

The internal data structure is shared, and fundamentally, by forcing him to
use that new KVM device for the "new stuff", we create a oddball API with
an ioctl for one type of iommu and a KVM device for the other, which makes
the implementation a complete mess in the kernel (and you should care :-)

So for something completely new, I would tend to agree with you. However, I
still think that for this specific case, we should just plonk-in the original
ioctl proposed by Alexey and be done with it.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] KVM: PPC: Book3S HV: Reserve POWER8 space in get/set_one_reg

2013-08-30 Thread Benjamin Herrenschmidt
On Fri, 2013-08-30 at 16:01 +0200, Alexander Graf wrote:
> > 
> > - The TM state is offset bu 0x1000.  Other than being bigger than
> the
> >  SPR space, it's fairly arbitrarily chose. 

Make it higher, just in case

Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO

2013-08-27 Thread Benjamin Herrenschmidt
On Wed, 2013-08-28 at 10:51 +1000, Alexey Kardashevskiy wrote:
> The ioctl I made up is basically a copy of KVM_CREATE_SPAPR_TCE which does
> the same thing for emulated devices and it is there for quite a while but
> it is not really extensible. And these two ioctls share some bits of code.
> Now we will have 2 pieces of code which do almost the same thing but in a
> different way. Kinda sucks :(

Right. Thus the question, Gleb, we can either:

 - Keep Alexey patch as-is allowing us to *finally* merge that stuff
that's been around for monthes

 - Convert *both* existing TCE objects to the new 
KVM_CREATE_DEVICE, and have some backward compat code for the old one.

I don't think it makes sense to have the "emulated TCE" and "IOMMU TCE"
objects use a fundamentally different API and infrastructure.

> >> So my stuff is not going to upstream again. Heh. Ok. I'll implement it.
> >>
> > Thanks! Should I keep KVM_CAP_SPAPR_MULTITCE capability patch or can I
> > drop it for now?
> 
> Please keep it, it is unrelated to the IOMMU-VFIO thing.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] KVM: PPC: reserve a capability number for multitce support

2013-08-27 Thread Benjamin Herrenschmidt
On Tue, 2013-08-27 at 09:41 +0300, Gleb Natapov wrote:
> > Oh and Alexey mentions that there are two capabilities and you only
> > applied one :-)
> > 
> Another one is:
>  [PATCH v8] KVM: PPC: reserve a capability and ioctl numbers for
> realmode VFIO
> ?

Yes, thanks !

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] KVM: PPC: reserve a capability number for multitce support

2013-08-27 Thread Benjamin Herrenschmidt
On Tue, 2013-08-27 at 09:40 +0300, Gleb Natapov wrote:
> > Thanks. Since it's not in a topic branch that I can pull, I'm going to
> > just cherry-pick them. However, they are in your "queue" branch, not
> > "next" branch. Should I still assume this is a stable branch and that
> > the numbers aren't going to change ?
> > 
> Queue will become next after I will test it and if test will fail the
> commit hash may change, but since you are going to cherry-pick and this
> does not preserve commit hash it does not matter.

Right, as long as the actual cap and ioctl numbers remain stable :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] KVM: PPC: reserve a capability number for multitce support

2013-08-26 Thread Benjamin Herrenschmidt
On Tue, 2013-08-27 at 14:19 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2013-08-26 at 15:37 +0300, Gleb Natapov wrote:
> > > Gleb, any chance you can put this (and the next one) into a tree to
> > > "lock in" the numbers ?
> > > 
> > Applied it. Sorry for slow response, was on vocation and still go
> > through the email backlog.
> 
> Thanks. Since it's not in a topic branch that I can pull, I'm going to
> just cherry-pick them. However, they are in your "queue" branch, not
> "next" branch. Should I still assume this is a stable branch and that
> the numbers aren't going to change ?

Oh and Alexey mentions that there are two capabilities and you only
applied one :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] KVM: PPC: reserve a capability number for multitce support

2013-08-26 Thread Benjamin Herrenschmidt
On Mon, 2013-08-26 at 15:37 +0300, Gleb Natapov wrote:
> > Gleb, any chance you can put this (and the next one) into a tree to
> > "lock in" the numbers ?
> > 
> Applied it. Sorry for slow response, was on vocation and still go
> through the email backlog.

Thanks. Since it's not in a topic branch that I can pull, I'm going to
just cherry-pick them. However, they are in your "queue" branch, not
"next" branch. Should I still assume this is a stable branch and that
the numbers aren't going to change ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio-pci: PCI hot reset interface

2013-08-19 Thread Benjamin Herrenschmidt
On Mon, 2013-08-19 at 16:59 -0600, Alex Williamson wrote:
> On Tue, 2013-08-20 at 08:42 +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2013-08-19 at 14:02 -0600, Bjorn Helgaas wrote:
> > > I guess.  And supply the pci_slot rather than the pci_dev?  I'm a
> > > little bit worried because the idea of a "slot" is not well-defined in
> > > the spec, and we have sort of an ad hoc method of discovering and
> > > managing them, e.g., acpiphp and pciehp might discover the same slot.
> > > But I guess that's no reason to bury generic code in vfio.
> > 
> > And I don't have pci_slot's at all yet on powerpc "powernv" (the host
> > platform for KVM) since at this stage we don't support physical hotplug
> > on the target machines...
> > 
> > Alex, why specifically looking for "slots" here ? I don't quite
> > understand. It makes sense to be able to reset individual devices
> > whether they are on the otherboard, behind extension chassis or directly
> > on slots...
> 
> a) resetting a slot may have a smaller footprint than resetting a bus,

*May* ... at least on PCIe there is no difference. I suppose PCI pre-E
slots might have individual reset controls though the way to get
them is fairly platform specific.

> b) hotplug controllers sometimes need to be involved in a bus reset.
> For b) I have a specific example where my Lenovo S20 workstation has an
> onboard tg3 NIC attached to a root port supporting pciehp (go figure
> since the tg3 is soldered onto the motherboard) and doing a secondary
> bus reset at the root port triggers a presence detection change and
> therefore tries to do a surprise removal.  By doing a "slot" reset, I
> have the hotplug controller code manage the bus reset by disabling
> presence detection around the bus reset.  If you don't have slots and
> you don't need anything special around a secondary bus reset, you're
> fine.  It's just an opportunity to provide a hook for the hotplug
> controller to participate.  Thanks,

Yuck, junk HW again ... oh well, I suppose that's never going to end...

As long as the code works without the slots I'm fine :-) As I mentioned,
we might have to do a whole different infrastructure for EEH anyway
(which sucks but we have little choice in the matter).

Cheers,
Ben.

> Alex
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   >