Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-08 Thread Benjamin Herrenschmidt
On Wed, 2018-08-08 at 05:30 -0700, Christoph Hellwig wrote:
> On Wed, Aug 08, 2018 at 08:07:49PM +1000, Benjamin Herrenschmidt wrote:
> > Qemu virtio bypasses that iommu when the VIRTIO_F_IOMMU_PLATFORM flag
> > is not set (default) but there's nothing in the device-tree to tell the
> > guest about this since it's a violation of our pseries architecture, so
> > we just rely on Linux virtio "knowing" that it happens. It's a bit
> > yucky but that's now history...
> 
> That is ugly as hell, but it is how virtio works everywhere, so nothing
> special so far.

Yup.

> > Essentially pseries "architecturally" does not have the concept of not
> > having an iommu in the way and qemu violates that architecture today.
> > 
> > (Remember it comes from pHyp, our priorietary HV, which we are somewhat
> > mimmicing here).
> 
> It shouldnt be too hard to have a dt property that communicates this,
> should it?

We could invent something I suppose. The additional problem then (yeah
I know ... what a mess) is that qemu doesn't create the DT for PCI
devices, the firmware (SLOF) inside the guest does using normal PCI
probing.

That said, that FW could know about all the virtio vendor/device IDs,
check the VIRTIO_F_IOMMU_PLATFORM and set that property accordingly...
messy but doable. It's not a bus property (see my other reply below as
this could complicate things with your bus mask).

But we are drifting from the problem at hand :-) You propose we do set
VIRTIO_F_IOMMU_PLATFORM so we aren't in the above case, and the bypass
stuff works, so no need to touch it.

See my recap at the end of the email to make sure I understand fully
what you suggest.

> > So if we always set VIRTIO_F_IOMMU_PLATFORM, it *will* force all virtio
> > through that iommu and performance will suffer (esp vhost I suspect),
> > especially since adding/removing translations in the iommu is a
> > hypercall.

> Well, we'd nee to make sure that for this particular bus we skip the
> actualy iommu.

It's not a bus property. Qemu will happily mix up everything on the
same bus, that includes emulated devices that go through the emulated
iommu, real VFIO devices that go through an actual HW iommu and virtio
that bypasses everything.

This makes things tricky in general (not just in my powerpc secure VM
case) since, at least on powerpc but I suppose elsewhere too, iommu
related properties tend to be per "bus" while here, qemu will mix and
match.

But again, I think we are drifting away from the topic, see below

> > > It would not be the same effect.  The problem with that is that you must
> > > now assumes that your qemu knows that for example you might be passing
> > > a dma offset if the bus otherwise requires it. 
> > 
> > I would assume that arch_virtio_wants_dma_ops() only returns true when
> > no such offsets are involved, at least in our case that would be what
> > happens.
> 
> That would work, but we're really piling hacĸs ontop of hacks here.

Sort-of :-) At least none of what we are discussing now involves
touching the dma_ops themselves so we are not in the way of your big
cleanup operation here. But yeah, let's continue discussing your other
solution below.

> > >  Or in other words:
> > > you potentially break the contract between qemu and the guest of always
> > > passing down physical addresses.  If we explicitly change that contract
> > > through using a flag that says you pass bus address everything is fine.
> > 
> > For us a "bus address" is behind the iommu so that's what
> > VIRTIO_F_IOMMU_PLATFORM does already. We don't have the concept of a
> > bus address that is different. I suppose it's an ARMism to have DMA
> > offsets that are separate from iommus ? 
> 
> No, a lot of platforms support a bus address that has an offset from
> the physical address. including a lot of power platforms:

Ok, just talking past each other :-) For all the powerpc ones, these
*do* go through the iommu, which is what I meant. It's just a window of
the iommu that provides some kind of direct mapping of memory.

For pseries, there is no such thing however. What we do to avoid
constant map/unmap of iommu PTEs in pseries guests is that we use
hypercalls to create a 64-bit window and populate all its PTEs with an
identity mapping. But that's not as efficient as a real bypass.

There are good historical reasons for that, since pseries is a guest
platform, its memory is never really where the guest thinks it is, so
you always need an iommu to remap. Even for virtual devices, since for
most of them, in the "IBM" pHyp model, the "peer" is actually another
partition, so the virtual iommu handles translating accross the two
partitions.

Same goes with cell in HW, no real bypass, just the iommu being
confiured with very large pages and a fixed mapping.

powernv has a separate physical window that can be configured as a real
bypass though, so does the U4 DART. Not sure about the FSL one.

But yeah, your point stands, this is just implementation details.

> arch/powerpc/kernel

Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Michael S. Tsirkin
On Tue, Aug 07, 2018 at 05:56:59AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-08-06 at 16:46 +0300, Michael S. Tsirkin wrote:
> > 
> > > Right, we'll need some quirk to disable balloons  in the guest I
> > > suppose.
> > > 
> > > Passing something from libvirt is cumbersome because the end user may
> > > not even need to know about secure VMs. There are use cases where the
> > > security is a contract down to some special application running inside
> > > the secure VM, the sysadmin knows nothing about.
> > > 
> > > Also there's repercussions all the way to admin tools, web UIs etc...
> > > so it's fairly wide ranging.
> > > 
> > > So as long as we only need to quirk a couple of devices, it's much
> > > better contained that way.
> > 
> > So just the balloon thing already means that yes management and all the
> > way to the user tools must know this is going on. Otherwise
> > user will try to inflate the balloon and wonder why this does not work.
> 
> There is *dozens* of management systems out there, not even all open
> source, we won't ever be able to see the end of the tunnel if we need
> to teach every single of them, including end users, about platform
> specific new VM flags like that.
> 
> .../...

In the end I suspect you will find you have to.

> > Here's another example: you can't migrate a secure vm to hypervisor
> > which doesn't support this feature. Again management tools above libvirt
> > need to know otherwise they will try.
> 
> There will have to be a new machine type for that I suppose, yes,
> though it's not just the hypervisor that needs to know about the
> modified migration stream, it's also the need to have a compatible
> ultravisor with the right keys on the other side.
> 
> So migration is going to be special and require extra admin work in all
> cases yes. But not all secure VMs are meant to be migratable.
> 
> In any case, back to the problem at hand. What a qemu flag gives us is
> just a way to force iommu at VM creation time.

I don't think a qemu flag is strictly required for a problem at hand.

> This is rather sub-optimal, we don't really want the iommu in the way,
> so it's at best a "workaround", and it's not really solving the real
> problem.

This specific problem, I think I agree.

> As I said replying to Christoph, we are "leaking" into the interface
> something here that is really what's the VM is doing to itself, which
> is to stash its memory away in an inaccessible place.
> 
> Cheers,
> Ben.

I think Christoph merely objects to the specific implementation.  If
instead you do something like tweak dev->bus_dma_mask for the virtio
device I think he won't object.

-- 
MST


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Christoph Hellwig
On Mon, Aug 06, 2018 at 07:13:32PM +0300, Michael S. Tsirkin wrote:
> Oh that makes sense then. Could you post a pointer pls so
> this patchset is rebased on top (there are things to
> change about 4/4 but 1-3 could go in if they don't add
> overhead)?

The dma mapping direct calls will need a major work vs what I posted.
I plan to start that work in about two weeks once returning from my
vacation.


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Michael S. Tsirkin
On Mon, Aug 06, 2018 at 08:24:06AM -0700, Christoph Hellwig wrote:
> On Mon, Aug 06, 2018 at 04:36:43PM +0300, Michael S. Tsirkin wrote:
> > On Mon, Aug 06, 2018 at 02:32:28PM +0530, Anshuman Khandual wrote:
> > > On 08/05/2018 05:54 AM, Michael S. Tsirkin wrote:
> > > > On Fri, Aug 03, 2018 at 08:21:26PM -0500, Benjamin Herrenschmidt wrote:
> > > >> On Fri, 2018-08-03 at 22:08 +0300, Michael S. Tsirkin wrote:
> > > >> Please go through these patches and review whether this approach 
> > > >> broadly
> > > >> makes sense. I will appreciate suggestions, inputs, comments 
> > > >> regarding
> > > >> the patches or the approach in general. Thank you.
> > > >
> > > > Jason did some work on profiling this. Unfortunately he reports
> > > > about 4% extra overhead from this switch on x86 with no vIOMMU.
> > > 
> > >  The test is rather simple, just run pktgen 
> > >  (pktgen_sample01_simple.sh) in
> > >  guest and measure PPS on tap on host.
> > > 
> > >  Thanks
> > > >>>
> > > >>> Could you supply host configuration involved please?
> > > >>
> > > >> I wonder how much of that could be caused by Spectre mitigations
> > > >> blowing up indirect function calls...
> > > >>
> > > >> Cheers,
> > > >> Ben.
> > > > 
> > > > I won't be surprised. If yes I suggested a way to mitigate the overhead.
> > > 
> > > Did we get better results (lower regression due to indirect calls) with
> > > the suggested mitigation ? Just curious.
> > 
> > I'm referring to this:
> > I wonder whether we can support map_sg and friends being NULL, then use
> > that when mapping is an identity. A conditional branch there is likely
> > very cheap.
> > 
> > I don't think anyone tried implementing this yes.
> 
> I've done something very similar in the thread I posted a few years
> ago.

Right so that was before spectre where a virtual call was cheaper :(

> I plan to get a version of that upstream for 4.20, but it won't
> cover the virtio case, just the real direct mapping.

I guess this RFC will have to be reworked on top and performance retested.

Thanks,

-- 
MST


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-06 Thread Christoph Hellwig
On Mon, Aug 06, 2018 at 04:36:43PM +0300, Michael S. Tsirkin wrote:
> On Mon, Aug 06, 2018 at 02:32:28PM +0530, Anshuman Khandual wrote:
> > On 08/05/2018 05:54 AM, Michael S. Tsirkin wrote:
> > > On Fri, Aug 03, 2018 at 08:21:26PM -0500, Benjamin Herrenschmidt wrote:
> > >> On Fri, 2018-08-03 at 22:08 +0300, Michael S. Tsirkin wrote:
> > >> Please go through these patches and review whether this approach 
> > >> broadly
> > >> makes sense. I will appreciate suggestions, inputs, comments 
> > >> regarding
> > >> the patches or the approach in general. Thank you.
> > >
> > > Jason did some work on profiling this. Unfortunately he reports
> > > about 4% extra overhead from this switch on x86 with no vIOMMU.
> > 
> >  The test is rather simple, just run pktgen (pktgen_sample01_simple.sh) 
> >  in
> >  guest and measure PPS on tap on host.
> > 
> >  Thanks
> > >>>
> > >>> Could you supply host configuration involved please?
> > >>
> > >> I wonder how much of that could be caused by Spectre mitigations
> > >> blowing up indirect function calls...
> > >>
> > >> Cheers,
> > >> Ben.
> > > 
> > > I won't be surprised. If yes I suggested a way to mitigate the overhead.
> > 
> > Did we get better results (lower regression due to indirect calls) with
> > the suggested mitigation ? Just curious.
> 
> I'm referring to this:
>   I wonder whether we can support map_sg and friends being NULL, then use
>   that when mapping is an identity. A conditional branch there is likely
>   very cheap.
> 
> I don't think anyone tried implementing this yes.

I've done something very similar in the thread I posted a few years
ago. I plan to get a version of that upstream for 4.20, but it won't
cover the virtio case, just the real direct mapping.


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-04 Thread Michael S. Tsirkin
On Fri, Aug 03, 2018 at 08:16:21PM -0500, Benjamin Herrenschmidt wrote:
> On Fri, 2018-08-03 at 22:07 +0300, Michael S. Tsirkin wrote:
> > On Fri, Aug 03, 2018 at 10:58:36AM -0500, Benjamin Herrenschmidt wrote:
> > > On Fri, 2018-08-03 at 00:05 -0700, Christoph Hellwig wrote:
> > > > >   2- Make virtio use the DMA API with our custom platform-provided
> > > > > swiotlb callbacks when needed, that is when not using IOMMU *and*
> > > > > running on a secure VM in our case.
> > > > 
> > > > And total NAK the customer platform-provided part of this.  We need
> > > > a flag passed in from the hypervisor that the device needs all bus
> > > > specific dma api treatment, and then just use the normal plaform
> > > > dma mapping setup. 
> > > 
> > > Christoph, as I have explained already, we do NOT have a way to provide
> > > such a flag as neither the hypervisor nor qemu knows anything about
> > > this when the VM is created.
> > 
> > I think the fact you can't add flags from the hypervisor is
> > a sign of a problematic architecture, you should look at
> > adding that down the road - you will likely need it at some point.
> 
> Well, we can later in the boot process. At VM creation time, it's just
> a normal VM. The VM firmware, bootloader etc... are just operating
> normally etc...

I see the allure of this, but I think down the road you will
discover passing a flag in libvirt XML saying
"please use a secure mode" or whatever is a good idea.

Even thought it is probably not required to address this
specific issue.

For example, I don't think ballooning works in secure mode,
you will be able to teach libvirt not to try to add a
balloon to the guest.

> Later on, (we may have even already run Linux at that point,
> unsecurely, as we can use Linux as a bootloader under some
> circumstances), we start a "secure image".
> 
> This is a kernel zImage that includes a "ticket" that has the
> appropriate signature etc... so that when that kernel starts, it can
> authenticate with the ultravisor, be verified (along with its ramdisk)
> etc... and copied (by the UV) into secure memory & run from there.
> 
> At that point, the hypervisor is informed that the VM has become
> secure.
> 
> So at that point, we could exit to qemu to inform it of the change,

That's probably a good idea too.

> and
> have it walk the qtree and "Switch" all the virtio devices to use the
> IOMMU I suppose, but it feels a lot grosser to me.

That part feels gross, yes.

> That's the only other option I can think of.
> 
> > However in this specific case, the flag does not need to come from the
> > hypervisor, it can be set by arch boot code I think.
> > Christoph do you see a problem with that?
> 
> The above could do that yes. Another approach would be to do it from a
> small virtio "quirk" that pokes a bit in the device to force it to
> iommu mode when it detects that we are running in a secure VM. That's a
> bit warty on the virito side but probably not as much as having a qemu
> one that walks of the virtio devices to change how they behave.
> 
> What do you reckon ?

I think you are right that for the dma limit the hypervisor doesn't seem
to need to know.

> What we want to avoid is to expose any of this to the *end user* or
> libvirt or any other higher level of the management stack. We really
> want that stuff to remain contained between the VM itself, KVM and
> maybe qemu.
>
> We will need some other qemu changes for migration so that's ok. But
> the minute you start touching libvirt and the higher levels it becomes
> a nightmare.
> 
> Cheers,
> Ben.

I don't believe you'll be able to avoid that entirely. The split between
libvirt and qemu is more about community than about code, random bits of
functionality tend to land on random sides of that fence.  Better add a
tag in domain XML early is my advice. Having said that, it's your
hypervisor. I'm just suggesting that when hypervisor does somehow need
to care then I suspect most people won't be receptive to the argument
that changing libvirt is a nightmare.

> > > >  To get swiotlb you'll need to then use the DT/ACPI
> > > > dma-range property to limit the addressable range, and a swiotlb
> > > > capable plaform will use swiotlb automatically.
> > > 
> > > This cannot be done as you describe it.
> > > 
> > > The VM is created as a *normal* VM. The DT stuff is generated by qemu
> > > at a point where it has *no idea* that the VM will later become secure
> > > and thus will have to restrict which pages can be used for "DMA".
> > > 
> > > The VM will *at runtime* turn itself into a secure VM via interactions
> > > with the security HW and the Ultravisor layer (which sits below the
> > > HV). This happens way after the DT has been created and consumed, the
> > > qemu devices instanciated etc...
> > > 
> > > Only the guest kernel knows because it initates the transition. When
> > > that happens, the virtio devices have already been used by the guest
> > > firmware, bootloader, possibly another kernel t

Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-03 Thread Christoph Hellwig
On Thu, Aug 02, 2018 at 11:53:08PM +0300, Michael S. Tsirkin wrote:
> > We don't need cache flushing tricks.
> 
> You don't but do real devices on same platform need them?

IBMs power plaforms are always cache coherent.  There are some powerpc
platforms have not cache coherent DMA, but I guess this scheme isn't
intended for them.


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-07-27 Thread Anshuman Khandual
On 07/27/2018 03:28 PM, Will Deacon wrote:
> Hi Anshuman,
> 
> On Fri, Jul 20, 2018 at 09:29:37AM +0530, Anshuman Khandual wrote:
>> This patch series is the follow up on the discussions we had before about
>> the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation
>> for virito devices (https://patchwork.kernel.org/patch/10417371/). There
>> were suggestions about doing away with two different paths of transactions
>> with the host/QEMU, first being the direct GPA and the other being the DMA
>> API based translations.
>>
>> First patch attempts to create a direct GPA mapping based DMA operations
>> structure called 'virtio_direct_dma_ops' with exact same implementation
>> of the direct GPA path which virtio core currently has but just wrapped in
>> a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of
>> the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve the
>> existing semantics. The second patch does exactly that inside the function
>> virtio_finalize_features(). The third patch removes the default direct GPA
>> path from virtio core forcing it to use DMA API callbacks for all devices.
>> Now with that change, every device must have a DMA operations structure
>> associated with it. The fourth patch adds an additional hook which gives
>> the platform an opportunity to do yet another override if required. This
>> platform hook can be used on POWER Ultravisor based protected guests to
>> load up SWIOTLB DMA callbacks to do the required (as discussed previously
>> in the above mentioned thread how host is allowed to access only parts of
>> the guest GPA range) bounce buffering into the shared memory for all I/O
>> scatter gather buffers to be consumed on the host side.
>>
>> Please go through these patches and review whether this approach broadly
>> makes sense. I will appreciate suggestions, inputs, comments regarding
>> the patches or the approach in general. Thank you.
> I just wanted to say that this patch series provides a means for us to
> force the coherent DMA ops for legacy virtio devices on arm64, which in turn
> means that we can enable the SMMU with legacy devices in our fastmodel
> emulation platform (which is slowly being upgraded to virtio 1.0) without
> hanging during boot. Patch below.
> 
> So:
> 
> Acked-by: Will Deacon 
> Tested-by: Will Deacon 

Thanks Will.



Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-07-24 Thread Anshuman Khandual
On 07/23/2018 02:38 PM, Michael S. Tsirkin wrote:
> On Mon, Jul 23, 2018 at 11:58:23AM +0530, Anshuman Khandual wrote:
>> On 07/20/2018 06:46 PM, Michael S. Tsirkin wrote:
>>> On Fri, Jul 20, 2018 at 09:29:37AM +0530, Anshuman Khandual wrote:
 This patch series is the follow up on the discussions we had before about
 the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation
 for virito devices (https://patchwork.kernel.org/patch/10417371/). There
 were suggestions about doing away with two different paths of transactions
 with the host/QEMU, first being the direct GPA and the other being the DMA
 API based translations.

 First patch attempts to create a direct GPA mapping based DMA operations
 structure called 'virtio_direct_dma_ops' with exact same implementation
 of the direct GPA path which virtio core currently has but just wrapped in
 a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of
 the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve the
 existing semantics. The second patch does exactly that inside the function
 virtio_finalize_features(). The third patch removes the default direct GPA
 path from virtio core forcing it to use DMA API callbacks for all devices.
 Now with that change, every device must have a DMA operations structure
 associated with it. The fourth patch adds an additional hook which gives
 the platform an opportunity to do yet another override if required. This
 platform hook can be used on POWER Ultravisor based protected guests to
 load up SWIOTLB DMA callbacks to do the required (as discussed previously
 in the above mentioned thread how host is allowed to access only parts of
 the guest GPA range) bounce buffering into the shared memory for all I/O
 scatter gather buffers to be consumed on the host side.

 Please go through these patches and review whether this approach broadly
 makes sense. I will appreciate suggestions, inputs, comments regarding
 the patches or the approach in general. Thank you.
>>> I like how patches 1-3 look. Could you test performance
>>> with/without to see whether the extra indirection through
>>> use of DMA ops causes a measurable slow-down?
>>
>> I ran this simple DD command 10 times where /dev/vda is a virtio block
>> device of 10GB size.
>>
>> dd if=/dev/zero of=/dev/vda bs=8M count=1024 oflag=direct
>>
>> With and without patches bandwidth which has a bit wide range does not
>> look that different from each other.
>>
>> Without patches
>> ===
>>
>> -- 1 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 1.95557 s, 4.4 GB/s
>> -- 2 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 2.05176 s, 4.2 GB/s
>> -- 3 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 1.88314 s, 4.6 GB/s
>> -- 4 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 1.84899 s, 4.6 GB/s
>> -- 5 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 5.37184 s, 1.6 GB/s
>> -- 6 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 1.9205 s, 4.5 GB/s
>> -- 7 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 6.85166 s, 1.3 GB/s
>> -- 8 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 1.74049 s, 4.9 GB/s
>> -- 9 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 6.31699 s, 1.4 GB/s
>> -- 10 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 2.47057 s, 3.5 GB/s
>>
>>
>> With patches
>> 
>>
>> -- 1 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 2.25993 s, 3.8 GB/s
>> -- 2 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 1.82438 s, 4.7 GB/s
>> -- 3 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 1.93856 s, 4.4 GB/s
>> -- 4 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 1.83405 s, 4.7 GB/s
>> -- 5 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 7.50199 s, 1.1 GB/s
>> -- 6 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 2.28742 s, 3.8 GB/s
>> -- 7 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 5.74958 s, 1.5 GB/s
>> -- 8 -
>> 1024+0