Re: [RFC 4/4] virtio: Add platform specific DMA API translation for virito devices

2018-07-24 Thread Anshuman Khandual
On 07/23/2018 07:46 AM, Anshuman Khandual wrote:
> On 07/20/2018 06:45 PM, Michael S. Tsirkin wrote:
>> On Fri, Jul 20, 2018 at 09:29:41AM +0530, Anshuman Khandual wrote:
>>> Subject: Re: [RFC 4/4] virtio: Add platform specific DMA API translation for
>>> virito devices
>>
>> s/virito/virtio/
> 
> Oops, will fix it. Thanks for pointing out.
> 
>>
>>> This adds a hook which a platform can define in order to allow it to
>>> override virtio device's DMA OPS irrespective of whether it has the
>>> flag VIRTIO_F_IOMMU_PLATFORM set or not. We want to use this to do
>>> bounce-buffering of data on the new secure pSeries platform, currently
>>> under development, where a KVM host cannot access all of the memory
>>> space of a secure KVM guest.  The host can only access the pages which
>>> the guest has explicitly requested to be shared with the host, thus
>>> the virtio implementation in the guest has to copy data to and from
>>> shared pages.
>>>
>>> With this hook, the platform code in the secure guest can force the
>>> use of swiotlb for virtio buffers, with a back-end for swiotlb which
>>> will use a pool of pre-allocated shared pages.  Thus all data being
>>> sent or received by virtio devices will be copied through pages which
>>> the host has access to.
>>>
>>> Signed-off-by: Anshuman Khandual 
>>> ---
>>>  arch/powerpc/include/asm/dma-mapping.h | 6 ++
>>>  arch/powerpc/platforms/pseries/iommu.c | 6 ++
>>>  drivers/virtio/virtio.c| 7 +++
>>>  3 files changed, 19 insertions(+)
>>>
>>> diff --git a/arch/powerpc/include/asm/dma-mapping.h 
>>> b/arch/powerpc/include/asm/dma-mapping.h
>>> index 8fa3945..bc5a9d3 100644
>>> --- a/arch/powerpc/include/asm/dma-mapping.h
>>> +++ b/arch/powerpc/include/asm/dma-mapping.h
>>> @@ -116,3 +116,9 @@ extern u64 __dma_get_required_mask(struct device *dev);
>>>  
>>>  #endif /* __KERNEL__ */
>>>  #endif /* _ASM_DMA_MAPPING_H */
>>> +
>>> +#define platform_override_dma_ops platform_override_dma_ops
>>> +
>>> +struct virtio_device;
>>> +
>>> +extern void platform_override_dma_ops(struct virtio_device *vdev);
>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
>>> b/arch/powerpc/platforms/pseries/iommu.c
>>> index 06f0296..5773bc7 100644
>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>> @@ -38,6 +38,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  #include 
>>>  #include 
>>>  #include 
>>> @@ -1396,3 +1397,8 @@ static int __init disable_multitce(char *str)
>>>  __setup("multitce=", disable_multitce);
>>>  
>>>  machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
>>> +
>>> +void platform_override_dma_ops(struct virtio_device *vdev)
>>> +{
>>> +   /* Override vdev->parent.dma_ops if required */
>>> +}
>>> diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
>>> index 6b13987..432c332 100644
>>> --- a/drivers/virtio/virtio.c
>>> +++ b/drivers/virtio/virtio.c
>>> @@ -168,6 +168,12 @@ EXPORT_SYMBOL_GPL(virtio_add_status);
>>>  
>>>  const struct dma_map_ops virtio_direct_dma_ops;
>>>  
>>> +#ifndef platform_override_dma_ops
>>> +static inline void platform_override_dma_ops(struct virtio_device *vdev)
>>> +{
>>> +}
>>> +#endif
>>> +
>>>  int virtio_finalize_features(struct virtio_device *dev)
>>>  {
>>> int ret = dev->config->finalize_features(dev);
>>> @@ -179,6 +185,7 @@ int virtio_finalize_features(struct virtio_device *dev)
>>> if (virtio_has_iommu_quirk(dev))
>>> set_dma_ops(dev->dev.parent, _direct_dma_ops);
>>>  
>>> +   platform_override_dma_ops(dev);
>>
>> Is there a single place where virtio_has_iommu_quirk is called now?
> 
> Not other than this one. But in the proposed implementation of
> platform_override_dma_ops on powerpc, we will again check on
> virtio_has_iommu_quirk before overriding it with SWIOTLB.
> 
> void platform_override_dma_ops(struct virtio_device *vdev)
> {
> if (is_ultravisor_platform() && virtio_has_iommu_quirk(vdev))
> set_dma_ops(vdev->dev.parent, _dma_ops);
> }
> 
>> If so, we could put this into virtio_has_iommu_quirk then.
> 
> Did you mean platform_override_dma_ops instead ? If so, yes that
> is possible. Default implementation of platform_override_dma_ops
> should just check on VIRTIO_F_IOMMU_PLATFORM feature and override
> with virtio_direct_dma_ops but arch implementation can check on
> what ever else they would like and override appropriately.
> 
> Default platform_override_dma_ops will be like this
> 
> #ifndef platform_override_dma_ops
> static inline void platform_override_dma_ops(struct virtio_device *vdev)
> {
>   if(!virtio_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM))
>   set_dma_ops(dev->dev.parent, _direct_dma_ops);
> }
> #endif
> 
> Proposed powerpc implementation will be like this instead
> 
> void platform_override_dma_ops(struct virtio_device *vdev)
> {
>   if (virtio_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM))
>   return;
> 

Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-07-24 Thread Anshuman Khandual
On 07/23/2018 02:38 PM, Michael S. Tsirkin wrote:
> On Mon, Jul 23, 2018 at 11:58:23AM +0530, Anshuman Khandual wrote:
>> On 07/20/2018 06:46 PM, Michael S. Tsirkin wrote:
>>> On Fri, Jul 20, 2018 at 09:29:37AM +0530, Anshuman Khandual wrote:
 This patch series is the follow up on the discussions we had before about
 the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation
 for virito devices (https://patchwork.kernel.org/patch/10417371/). There
 were suggestions about doing away with two different paths of transactions
 with the host/QEMU, first being the direct GPA and the other being the DMA
 API based translations.

 First patch attempts to create a direct GPA mapping based DMA operations
 structure called 'virtio_direct_dma_ops' with exact same implementation
 of the direct GPA path which virtio core currently has but just wrapped in
 a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of
 the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve the
 existing semantics. The second patch does exactly that inside the function
 virtio_finalize_features(). The third patch removes the default direct GPA
 path from virtio core forcing it to use DMA API callbacks for all devices.
 Now with that change, every device must have a DMA operations structure
 associated with it. The fourth patch adds an additional hook which gives
 the platform an opportunity to do yet another override if required. This
 platform hook can be used on POWER Ultravisor based protected guests to
 load up SWIOTLB DMA callbacks to do the required (as discussed previously
 in the above mentioned thread how host is allowed to access only parts of
 the guest GPA range) bounce buffering into the shared memory for all I/O
 scatter gather buffers to be consumed on the host side.

 Please go through these patches and review whether this approach broadly
 makes sense. I will appreciate suggestions, inputs, comments regarding
 the patches or the approach in general. Thank you.
>>> I like how patches 1-3 look. Could you test performance
>>> with/without to see whether the extra indirection through
>>> use of DMA ops causes a measurable slow-down?
>>
>> I ran this simple DD command 10 times where /dev/vda is a virtio block
>> device of 10GB size.
>>
>> dd if=/dev/zero of=/dev/vda bs=8M count=1024 oflag=direct
>>
>> With and without patches bandwidth which has a bit wide range does not
>> look that different from each other.
>>
>> Without patches
>> ===
>>
>> -- 1 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 1.95557 s, 4.4 GB/s
>> -- 2 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 2.05176 s, 4.2 GB/s
>> -- 3 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 1.88314 s, 4.6 GB/s
>> -- 4 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 1.84899 s, 4.6 GB/s
>> -- 5 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 5.37184 s, 1.6 GB/s
>> -- 6 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 1.9205 s, 4.5 GB/s
>> -- 7 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 6.85166 s, 1.3 GB/s
>> -- 8 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 1.74049 s, 4.9 GB/s
>> -- 9 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 6.31699 s, 1.4 GB/s
>> -- 10 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 2.47057 s, 3.5 GB/s
>>
>>
>> With patches
>> 
>>
>> -- 1 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 2.25993 s, 3.8 GB/s
>> -- 2 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 1.82438 s, 4.7 GB/s
>> -- 3 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 1.93856 s, 4.4 GB/s
>> -- 4 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 1.83405 s, 4.7 GB/s
>> -- 5 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 7.50199 s, 1.1 GB/s
>> -- 6 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 2.28742 s, 3.8 GB/s
>> -- 7 -
>> 1024+0 records in
>> 1024+0 records out
>> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 5.74958 s, 1.5 GB/s
>> -- 8 -
>> 

Re: [PATCH net-next 0/6] virtio_net: Add ethtool stat items

2018-07-24 Thread David Miller
From: Toshiaki Makita 
Date: Mon, 23 Jul 2018 23:36:03 +0900

> From: Toshiaki Makita 
> 
> Add some ethtool stat items useful for performance analysis.
> 
> Signed-off-by: Toshiaki Makita 

Michael and Jason, any objections to these new stats?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v36 0/5] Virtio-balloon: support free page reporting

2018-07-24 Thread Wei Wang

On 07/23/2018 10:36 PM, Dr. David Alan Gilbert wrote:

* Michael S. Tsirkin (m...@redhat.com) wrote:

On Fri, Jul 20, 2018 at 04:33:00PM +0800, Wei Wang wrote:

This patch series is separated from the previous "Virtio-balloon
Enhancement" series. The new feature, VIRTIO_BALLOON_F_FREE_PAGE_HINT,
implemented by this series enables the virtio-balloon driver to report
hints of guest free pages to the host. It can be used to accelerate live
migration of VMs. Here is an introduction of this usage:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

This feature enables the optimization by skipping the transfer of guest
free pages during VM live migration. It is not concerned that the memory
pages are used after they are given to the hypervisor as a hint of the
free pages, because they will be tracked by the hypervisor and transferred
in the subsequent round if they are used and written.

* Tests
- Test Environment
 Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
 Guest: 8G RAM, 4 vCPU
 Migration setup: migrate_set_speed 100G, migrate_set_downtime 2 second

- Test Results
 - Idle Guest Live Migration Time (results are averaged over 10 runs):
 - Optimization v.s. Legacy = 409ms vs 1757ms --> ~77% reduction
(setting page poisoning zero and enabling ksm don't affect the
  comparison result)
 - Guest with Linux Compilation Workload (make bzImage -j4):
 - Live Migration Time (average)
   Optimization v.s. Legacy = 1407ms v.s. 2528ms --> ~44% reduction
 - Linux Compilation Time
   Optimization v.s. Legacy = 5min4s v.s. 5min12s
   --> no obvious difference

I'd like to see dgilbert's take on whether this kind of gain
justifies adding a PV interfaces, and what kind of guest workload
is appropriate.

Cc'd.

Well, 44% is great ... although the measurement is a bit weird.

a) A 2 second downtime is very large; 300-500ms is more normal


No problem, I will set downtime to 400ms for the tests.


b) I'm not sure what the 'average' is  - is that just between a bunch of
repeated migrations?


Yes, just repeatedly ("source<>destination" migration) do the tests 
and get an averaged result.




c) What load was running in the guest during the live migration?


The first one above just uses a guest without running any specific 
workload (named idle guests).

The second one uses a guest with the Linux compilation workload running.



An interesting measurement to add would be to do the same test but
with a VM with a lot more RAM but the same load;  you'd hope the gain
would be even better.
It would be interesting, especially because the users who are interested
are people creating VMs allocated with lots of extra memory (for the
worst case) but most of the time migrating when it's fairly idle.


OK. I will add tests of a guest with larger memory.

Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization