Re: [Qemu-devel] [RFC v4 00/16] VIRTIO-IOMMU device
Hi Eric, Out of curiosity, I compared your SMMUv3 full emulation against virtio-iommu. For virtio-net device behind the virtio-iommu I get 50% performance of what I can get for SMMUv3 emulation. Do you have similar observations? Since there is no need to emulate HW behaviour in QEMU I expected virtio-iommu to be faster. I will run some more benchmarks. Thanks, Tomasz On 19.09.2017 09:46, Eric Auger wrote: This series implements the virtio-iommu device. This v4 is an upgrade to v0.4 spec [1] and applies on QEMU v2.10.0. - probe request support although no reserved region is returned at the moment - unmap semantics less strict, as specified in v0.4 - device registration, attach/detach revisited - split into smaller patches to ease review - propose a way to inform the IOMMU mr about the page_size_mask of underlying HW IOMMU, if any - remove warning associated with the translation of the MSI doorbell The device gets instantiated using the "-device virtio-iommu-device" option. It currently works with ARM virt machine only, as the machine must handle the dt binding between the virtio-mmio "iommu" node and the PCI host bridge node. The associated VHOST/VFIO adaptation is available on the branch below but is not officially delivered as part of this series. Best Regards Eric This series can be found at: https://github.com/eauger/qemu/tree/v2.10.0-virtio-iommu-v4 References: [1] [RFC] virtio-iommu version 0.4 git://linux-arm.org/virtio-iommu.git branch viommu/v0.4 Testing: - guest kernel with v0.4 virtio-iommu driver (4kB page) - tested with guest using virtio-pci-net and vhost-net (,vhost=on/off,iommu_platform,disable-modern=off,disable-legacy=on ) - tested with VFIO and guest assigned with 2 VFs - tested with DPDK on guest with 2 assigned VFs Not tested: - hot-plug/hot-unplug of EP: not implemented History: v2-> v4: - see above v2 -> v3: - rebase on top of 2.10-rc0 and especially [PATCH qemu v9 0/2] memory/iommu: QOM'fy IOMMU MemoryRegion - add mutex init - fix as->mappings deletion using g_tree_ref/unref - when a dev is attached whereas it is already attached to another address space, first detach it - fix some error values - page_sizes = TARGET_PAGE_MASK; - I haven't changed the unmap() semantics yet, waiting for the next virtio-iommu spec revision. v1 -> v2: - fix redifinition of viommu_as typedef Eric Auger (16): update-linux-headers: import virtio_iommu.h linux-headers: Update for virtio-iommu virtio-iommu: add skeleton virtio-iommu: Decode the command payload virtio-iommu: Add the iommu regions virtio-iommu: Register attached devices virtio-iommu: Implement attach/detach command virtio-iommu: Implement map/unmap virtio-iommu: Implement translate virtio-iommu: Implement probe request hw/arm/virt: Add 2.11 machine type hw/arm/virt: Add virtio-iommu to the virt board memory.h: Add set_page_size_mask IOMMUMemoryRegion callback hw/vfio/common: Set the IOMMUMemoryRegion supported page sizes virtio-iommu: Implement set_page_size_mask hw/vfio/common: Do not print error when viommu translates into an mmio region hw/arm/virt.c | 115 +++- hw/vfio/common.c | 7 +- hw/virtio/Makefile.objs | 1 + hw/virtio/trace-events| 24 + hw/virtio/virtio-iommu.c | 923 ++ include/exec/memory.h | 4 + include/hw/arm/virt.h | 5 + include/hw/vfio/vfio-common.h | 1 + include/hw/virtio/virtio-iommu.h | 61 ++ include/standard-headers/linux/virtio_ids.h | 1 + include/standard-headers/linux/virtio_iommu.h | 177 + linux-headers/linux/virtio_iommu.h| 1 + scripts/update-linux-headers.sh | 3 + 13 files changed, 1312 insertions(+), 11 deletions(-) create mode 100644 hw/virtio/virtio-iommu.c create mode 100644 include/hw/virtio/virtio-iommu.h create mode 100644 include/standard-headers/linux/virtio_iommu.h create mode 100644 linux-headers/linux/virtio_iommu.h
Re: [Qemu-devel] [RFC v4 10/16] virtio-iommu: Implement probe request
Hi Eric, On 19.09.2017 09:46, Eric Auger wrote: This patch implements the PROBE request. At the moment, no reserved regions are returned. At the moment reserved regions are stored per device. Signed-off-by: Eric Auger--- [...] + +static int virtio_iommu_fill_property(int devid, int type, + viommu_property_buffer *bufstate) +{ +int ret = -ENOSPC; + +if (bufstate->filled + 4 >= VIOMMU_PROBE_SIZE) { +bufstate->error = true; +goto out; +} + +switch (type) { +case VIRTIO_IOMMU_PROBE_T_NONE: +ret = virtio_iommu_fill_none_prop(bufstate); +break; +case VIRTIO_IOMMU_PROBE_T_RESV_MEM: +{ +viommu_dev *dev = bufstate->dev; + +g_tree_foreach(dev->reserved_regions, + virtio_iommu_fill_resv_mem_prop, + bufstate); +if (!bufstate->error) { +ret = 0; +} +break; +} +default: +ret = -ENOENT; +break; +} +out: +if (ret) { +error_report("%s property of type=%d could not be filled (%d)," + " remaining size = 0x%lx", + __func__, type, ret, bufstate->filled); +} +return ret; +} + +static int virtio_iommu_probe(VirtIOIOMMU *s, + struct virtio_iommu_req_probe *req, + uint8_t *buf) +{ +uint32_t devid = le32_to_cpu(req->device); +int16_t prop_types = SUPPORTED_PROBE_PROPERTIES, type; +viommu_property_buffer bufstate; +viommu_dev *dev; +int ret; + +dev = g_tree_lookup(s->devices, GUINT_TO_POINTER(devid)); +if (!dev) { +return -EINVAL; +} + +bufstate.start = buf; +bufstate.filled = 0; +bufstate.dev = dev; bufstate.error is not initialized which may cause false alarm in virtio_iommu_fill_property() + +while ((type = ctz32(prop_types)) != 32) { +ret = virtio_iommu_fill_property(devid, 1 << type, ); +if (ret) { +break; +} +prop_types &= ~(1 << type); +} +virtio_iommu_fill_property(devid, VIRTIO_IOMMU_PROBE_T_NONE, ); + +return VIRTIO_IOMMU_S_OK; +} + #define get_payload_size(req) (\ sizeof((req)) - sizeof(struct virtio_iommu_req_tail)) @@ -433,6 +567,24 @@ static int virtio_iommu_handle_unmap(VirtIOIOMMU *s, return virtio_iommu_unmap(s, ); } Thanks, Tomasz
Re: [Qemu-devel] [Qemu-arm] [PATCH v7 13/20] hw/arm/smmuv3: Implement IOMMU memory region replay callback
Hi Eric, On 15.09.2017 16:50, Auger Eric wrote: Hi, On 15/09/2017 12:42, tn wrote: Hi Eric, On 15.09.2017 09:30, Auger Eric wrote: Hi Tomasz, On 14/09/2017 16:43, Tomasz Nowicki wrote: On 14.09.2017 16:31, Tomasz Nowicki wrote: Hi Eric, On 14.09.2017 11:27, Linu Cherian wrote: Hi Eric, On Fri Sep 01, 2017 at 07:21:16PM +0200, Eric Auger wrote: memory_region_iommu_replay() is used for VFIO integration. However its default implementation is not adapted to SMMUv3 IOMMU memory region. Indeed the input address range is too huge and its execution is too slow as it calls the translate() callback on each granule. Let's implement the replay callback which hierarchically walk over the page table structure and notify only the segments that are populated with valid entries. Signed-off-by: Eric Auger <eric.au...@redhat.com> --- hw/arm/smmuv3.c | 36 hw/arm/trace-events | 1 + 2 files changed, 37 insertions(+) diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c index 8e7d10d..c43bd93 100644 --- a/hw/arm/smmuv3.c +++ b/hw/arm/smmuv3.c @@ -657,6 +657,41 @@ static int smmuv3_notify_entry(IOMMUTLBEntry *entry, void *private) return 0; } +/* Unmap the whole notifier's range */ +static void smmuv3_unmap_notifier_range(IOMMUNotifier *n) +{ +IOMMUTLBEntry entry; +hwaddr size = n->end - n->start + 1; + +entry.target_as = _space_memory; +entry.iova = n->start & ~(size - 1); +entry.perm = IOMMU_NONE; +entry.addr_mask = size - 1; + +memory_region_notify_one(n, ); +} + +static void smmuv3_replay(IOMMUMemoryRegion *mr, IOMMUNotifier *n) +{ +SMMUTransCfg cfg = {}; +int ret; + +trace_smmuv3_replay(mr->parent_obj.name, n, n->start, n->end); +smmuv3_unmap_notifier_range(n); + +ret = smmuv3_decode_config(mr, ); +if (ret) { +error_report("%s error decoding the configuration for iommu mr=%s", + __func__, mr->parent_obj.name); +} On an invalid config being found, shouldnt we return rather than proceeding with page table walk. For example on an invalid Stream table entry. Indeed, without return here vhost case is not working for me. I was just lucky one time. return here has no influence. Vhost still not working. Sorry for noise. As far as I understand the replay() callback only is called in VFIO use case. So this shouldn't impact vhost. I can't reproduce your vhost issue on my side. I will review the invalidate code again and compare against the last version. What is the page size used by your guest? 64K page size for guest as well as for host. However, I've just checked 4K page size for guest and then vhost is working fine. So the bug stems from the incorrect target page size used on SMMU_CMD_TLBI_NH_VA invalidation. This is now corrected by using the actual config granule size and this fixes the issue with vhost use case and 64KB page guest. I will release this fix early next week. Sorry for the inconvenience. Yes, that was it. I will provide my t-b for the next series version. Thanks, Tomasz
Re: [Qemu-devel] [Qemu-arm] [PATCH v7 13/20] hw/arm/smmuv3: Implement IOMMU memory region replay callback
On 14.09.2017 16:31, Tomasz Nowicki wrote: Hi Eric, On 14.09.2017 11:27, Linu Cherian wrote: Hi Eric, On Fri Sep 01, 2017 at 07:21:16PM +0200, Eric Auger wrote: memory_region_iommu_replay() is used for VFIO integration. However its default implementation is not adapted to SMMUv3 IOMMU memory region. Indeed the input address range is too huge and its execution is too slow as it calls the translate() callback on each granule. Let's implement the replay callback which hierarchically walk over the page table structure and notify only the segments that are populated with valid entries. Signed-off-by: Eric Auger <eric.au...@redhat.com> --- hw/arm/smmuv3.c | 36 hw/arm/trace-events | 1 + 2 files changed, 37 insertions(+) diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c index 8e7d10d..c43bd93 100644 --- a/hw/arm/smmuv3.c +++ b/hw/arm/smmuv3.c @@ -657,6 +657,41 @@ static int smmuv3_notify_entry(IOMMUTLBEntry *entry, void *private) return 0; } +/* Unmap the whole notifier's range */ +static void smmuv3_unmap_notifier_range(IOMMUNotifier *n) +{ +IOMMUTLBEntry entry; +hwaddr size = n->end - n->start + 1; + +entry.target_as = _space_memory; +entry.iova = n->start & ~(size - 1); +entry.perm = IOMMU_NONE; +entry.addr_mask = size - 1; + +memory_region_notify_one(n, ); +} + +static void smmuv3_replay(IOMMUMemoryRegion *mr, IOMMUNotifier *n) +{ +SMMUTransCfg cfg = {}; +int ret; + +trace_smmuv3_replay(mr->parent_obj.name, n, n->start, n->end); +smmuv3_unmap_notifier_range(n); + +ret = smmuv3_decode_config(mr, ); +if (ret) { +error_report("%s error decoding the configuration for iommu mr=%s", + __func__, mr->parent_obj.name); +} On an invalid config being found, shouldnt we return rather than proceeding with page table walk. For example on an invalid Stream table entry. Indeed, without return here vhost case is not working for me. I was just lucky one time. return here has no influence. Vhost still not working. Sorry for noise. Tomasz Thanks, Tomasz + +if (cfg.disabled || cfg.bypassed) { +return; +} +/* walk the page tables and replay valid entries */ +smmu_page_walk(, 0, (1ULL << (64 - cfg.tsz)) - 1, false, + smmuv3_notify_entry, n); +} static void smmuv3_notify_iova_range(IOMMUMemoryRegion *mr, IOMMUNotifier *n, uint64_t iova, size_t size) { @@ -1095,6 +1130,7 @@ static void smmuv3_iommu_memory_region_class_init(ObjectClass *klass, imrc->translate = smmuv3_translate; imrc->notify_flag_changed = smmuv3_notify_flag_changed; +imrc->replay = smmuv3_replay; } static const TypeInfo smmuv3_type_info = { diff --git a/hw/arm/trace-events b/hw/arm/trace-events index 4ac264d..15f84d6 100644 --- a/hw/arm/trace-events +++ b/hw/arm/trace-events @@ -46,5 +46,6 @@ smmuv3_cfg_stage(int s, uint32_t oas, uint32_t tsz, uint64_t ttbr, bool aa64, ui smmuv3_notify_flag_add(const char *iommu) "ADD SMMUNotifier node for iommu mr=%s" smmuv3_notify_flag_del(const char *iommu) "DEL SMMUNotifier node for iommu mr=%s" smmuv3_replay_mr(const char *name) "iommu mr=%s" +smmuv3_replay(const char *name, void *n, hwaddr start, hwaddr end) "iommu mr=%s notifier=%p [0x%"PRIx64",0x%"PRIx64"]" smmuv3_notify_entry(hwaddr iova, hwaddr pa, hwaddr mask, int perm) "iova=0x%"PRIx64" pa=0x%" PRIx64" mask=0x%"PRIx64" perm=%d" smmuv3_notify_iova_range(const char *name, uint64_t iova, size_t size, void *n) "iommu mr=%s iova=0x%"PRIx64" size=0x%lx n=%p" -- 2.5.5
Re: [Qemu-devel] [Qemu-arm] [PATCH v7 13/20] hw/arm/smmuv3: Implement IOMMU memory region replay callback
Hi Eric, On 14.09.2017 11:27, Linu Cherian wrote: Hi Eric, On Fri Sep 01, 2017 at 07:21:16PM +0200, Eric Auger wrote: memory_region_iommu_replay() is used for VFIO integration. However its default implementation is not adapted to SMMUv3 IOMMU memory region. Indeed the input address range is too huge and its execution is too slow as it calls the translate() callback on each granule. Let's implement the replay callback which hierarchically walk over the page table structure and notify only the segments that are populated with valid entries. Signed-off-by: Eric Auger--- hw/arm/smmuv3.c | 36 hw/arm/trace-events | 1 + 2 files changed, 37 insertions(+) diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c index 8e7d10d..c43bd93 100644 --- a/hw/arm/smmuv3.c +++ b/hw/arm/smmuv3.c @@ -657,6 +657,41 @@ static int smmuv3_notify_entry(IOMMUTLBEntry *entry, void *private) return 0; } +/* Unmap the whole notifier's range */ +static void smmuv3_unmap_notifier_range(IOMMUNotifier *n) +{ +IOMMUTLBEntry entry; +hwaddr size = n->end - n->start + 1; + +entry.target_as = _space_memory; +entry.iova = n->start & ~(size - 1); +entry.perm = IOMMU_NONE; +entry.addr_mask = size - 1; + +memory_region_notify_one(n, ); +} + +static void smmuv3_replay(IOMMUMemoryRegion *mr, IOMMUNotifier *n) +{ +SMMUTransCfg cfg = {}; +int ret; + +trace_smmuv3_replay(mr->parent_obj.name, n, n->start, n->end); +smmuv3_unmap_notifier_range(n); + +ret = smmuv3_decode_config(mr, ); +if (ret) { +error_report("%s error decoding the configuration for iommu mr=%s", + __func__, mr->parent_obj.name); +} On an invalid config being found, shouldnt we return rather than proceeding with page table walk. For example on an invalid Stream table entry. Indeed, without return here vhost case is not working for me. Thanks, Tomasz + +if (cfg.disabled || cfg.bypassed) { +return; +} +/* walk the page tables and replay valid entries */ +smmu_page_walk(, 0, (1ULL << (64 - cfg.tsz)) - 1, false, + smmuv3_notify_entry, n); +} static void smmuv3_notify_iova_range(IOMMUMemoryRegion *mr, IOMMUNotifier *n, uint64_t iova, size_t size) { @@ -1095,6 +1130,7 @@ static void smmuv3_iommu_memory_region_class_init(ObjectClass *klass, imrc->translate = smmuv3_translate; imrc->notify_flag_changed = smmuv3_notify_flag_changed; +imrc->replay = smmuv3_replay; } static const TypeInfo smmuv3_type_info = { diff --git a/hw/arm/trace-events b/hw/arm/trace-events index 4ac264d..15f84d6 100644 --- a/hw/arm/trace-events +++ b/hw/arm/trace-events @@ -46,5 +46,6 @@ smmuv3_cfg_stage(int s, uint32_t oas, uint32_t tsz, uint64_t ttbr, bool aa64, ui smmuv3_notify_flag_add(const char *iommu) "ADD SMMUNotifier node for iommu mr=%s" smmuv3_notify_flag_del(const char *iommu) "DEL SMMUNotifier node for iommu mr=%s" smmuv3_replay_mr(const char *name) "iommu mr=%s" +smmuv3_replay(const char *name, void *n, hwaddr start, hwaddr end) "iommu mr=%s notifier=%p [0x%"PRIx64",0x%"PRIx64"]" smmuv3_notify_entry(hwaddr iova, hwaddr pa, hwaddr mask, int perm) "iova=0x%"PRIx64" pa=0x%" PRIx64" mask=0x%"PRIx64" perm=%d" smmuv3_notify_iova_range(const char *name, uint64_t iova, size_t size, void *n) "iommu mr=%s iova=0x%"PRIx64" size=0x%lx n=%p" -- 2.5.5
Re: [Qemu-devel] [RFC v5 0/8] ARM SMMUv3 Emulation Support
Hi Eric, On 01.08.2017 15:07, Auger Eric wrote: Hi Tomasz, On 01/08/2017 13:01, Tomasz Nowicki wrote: Hi Eric, Just letting you know that I am facing another issue with the following setup: 1. host (4.12 kernel & 64K page) and VM (4.12 kernel & 64K page) 2. QEMU + -netdev type=tap,ifname=tap,id=net0 -device virtio-net-pci,netdev=net0,iommu_platform,disable-modern=off,disable-legacy=on 2. On VM, I allocate some huge pages and run DPDK testpmd app: # echo 4 > /sys/kernel/mm/hugepages/hugepages-524288kB/nr_hugepages # ./dpdk/usertools/dpdk-devbind.py -b vfio-pci :00:02.0 # ./dpdk/build/app/testpmd -l 0-13 -n 4 -w :00:02.0 -- --disable-hw-vlan-filter --disable-rss -i EAL: Detected 14 lcore(s) EAL: Probing VFIO support... EAL: VFIO support initialized EAL: PCI device :00:02.0 on NUMA socket -1 EAL: probe driver: 1af4:1041 net_virtio EAL: using IOMMU type 1 (Type 1) EAL: iommu_map_dma vaddr 2000 size 8000 iova 12000 EAL: Can't write to PCI bar (0) : offset (12) EAL: Can't read from PCI bar (0) : offset (12) EAL: Can't read from PCI bar (0) : offset (12) EAL: Can't write to PCI bar (0) : offset (12) EAL: Can't read from PCI bar (0) : offset (12) EAL: Can't write to PCI bar (0) : offset (12) EAL: Can't read from PCI bar (0) : offset (0) EAL: Can't write to PCI bar (0) : offset (4) EAL: Can't write to PCI bar (0) : offset (14) EAL: Can't write to PCI bar (0) : offset (e) EAL: Can't read from PCI bar (0) : offset (c) EAL: Requested device :00:02.0 cannot be used EAL: No probed ethernet devices Interactive-mode selected USER1: create a new mbuf pool : n=251456, size=2176, socket=0 When VM uses *4K pages* the same setup works fine. I will work on this but please let me know in case you already know what is going on. No I did not face that one. I was able to launch testpmd without such early message. However I assigned an igbvf device to the guest and then to DPDK. I've never tested your config. However as stated in my cover letter at the moment DPDK is not working for me because of storms of tlbi-on-maps. I intend to work on this as soon as get some bandwidth, sorry. I found what was the reason of failure. QEMU creates BARs for VIRTIO PCI device. The size of it depends on what is necessary for VIRTIO protocol. In my case the BAR is 16K size which is too small to be mmapable for kernel with 64K pages: vfio_pci_enable() -> vfio_pci_probe_mmaps() -> here guest kernel checks that BAR size is smaller than current PAGE_SIZE and clears VFIO_REGION_INFO_FLAG_MMAP flag which prevents BAR from being mmapped later on. I added -device virtio-net-pci,...,page-per-vq=on to enlarge BAR size to 8M and now testpmd works fine. I wonder how the same setup is working with e.g. Intel or AMD IOMMU. Thanks, Tomasz
Re: [Qemu-devel] [RFC v5 0/8] ARM SMMUv3 Emulation Support
Hi Eric, Just letting you know that I am facing another issue with the following setup: 1. host (4.12 kernel & 64K page) and VM (4.12 kernel & 64K page) 2. QEMU + -netdev type=tap,ifname=tap,id=net0 -device virtio-net-pci,netdev=net0,iommu_platform,disable-modern=off,disable-legacy=on 2. On VM, I allocate some huge pages and run DPDK testpmd app: # echo 4 > /sys/kernel/mm/hugepages/hugepages-524288kB/nr_hugepages # ./dpdk/usertools/dpdk-devbind.py -b vfio-pci :00:02.0 # ./dpdk/build/app/testpmd -l 0-13 -n 4 -w :00:02.0 -- --disable-hw-vlan-filter --disable-rss -i EAL: Detected 14 lcore(s) EAL: Probing VFIO support... EAL: VFIO support initialized EAL: PCI device :00:02.0 on NUMA socket -1 EAL: probe driver: 1af4:1041 net_virtio EAL: using IOMMU type 1 (Type 1) EAL: iommu_map_dma vaddr 2000 size 8000 iova 12000 EAL: Can't write to PCI bar (0) : offset (12) EAL: Can't read from PCI bar (0) : offset (12) EAL: Can't read from PCI bar (0) : offset (12) EAL: Can't write to PCI bar (0) : offset (12) EAL: Can't read from PCI bar (0) : offset (12) EAL: Can't write to PCI bar (0) : offset (12) EAL: Can't read from PCI bar (0) : offset (0) EAL: Can't write to PCI bar (0) : offset (4) EAL: Can't write to PCI bar (0) : offset (14) EAL: Can't write to PCI bar (0) : offset (e) EAL: Can't read from PCI bar (0) : offset (c) EAL: Requested device :00:02.0 cannot be used EAL: No probed ethernet devices Interactive-mode selected USER1: create a new mbuf pool : n=251456, size=2176, socket=0 When VM uses *4K pages* the same setup works fine. I will work on this but please let me know in case you already know what is going on. Thanks, Tomasz On 09.07.2017 22:51, Eric Auger wrote: This series implements the emulation code for ARM SMMUv3. This is the continuation of Prem's work [1]. This v5 mainly brings VFIO integration in DT mode. On guest kernel side, this requires a quirk [1] to force TLB invalidation on map. The following changes also are noticeable: - fix SMMU_CMDQ_CONS offset - adds dma-coherent dt property which fixes the unhandled command opcode bug. - implements block PTE The smmu is instantiated when passing the smmu option to machvirt: "-M virt-2.10,smmu" As I haven't split the code yet so that it can be easily reviewable I don't expect deep reviews at this stage. Also the implementation may be largely sub-optimal. Tested Use Cases: - booted a guest in dt and acpi mode with an iommu_platform virtio-net-pci device (using dma ops). Tested with the following guest combinations: 4K page - 39 bit VA, 4K - 48b, 64K - 39b, 64K - 48b. - booted a guest (featuring [1]) with PCIe passthrough'ed PCIe devices: - AMD Overdrive and igbvf passthrough (using gsi direct mapping) - Cavium ThunderX and ixgbevf passthrough (using KVM MSI routing) Unfortunately I have not been able to run DPDK testpmd yet on guest side. The problem I see is the user space driver dma-maps a huge area and this causes plenty of CMDQ_OP_TLBI_NH_VA commands to be sent (tlbi-on-map) which are sent for each page whereas the dma-map covers a huge page. I will work on this issue for next version. Known limitations: - no VMSAv8-32 suport - no nested stage support (S1 + S2) - no support for HYP mappings - register fine emulation, commands, interrupts and errors were not accurately tested. Handling is sufficient to run use cases described hereafter though. Best Regards Eric This series can be found at: v5: https://github.com/eauger/qemu/tree/v2.9-SMMU-v5 v4: https://github.com/eauger/qemu/tree/v2.9-SMMU-v4 References: [1] [RFC 0/2] arm-smmu-v3 tlbi-on-map option [2] Prem's last iteration: - https://lists.gnu.org/archive/html/qemu-devel/2016-08/msg03531.html History: v4 -> v5: - initial_level now part of SMMUTransCfg - smmu_page_walk_64 takes into account the max input size - implement sys->iommu_ops.replay and sys->iommu_ops.notify_flag_changed - smmuv3_translate: bug fix: don't walk on bypass - smmu_update_qreg: fix PROD index update - I did not yet address Peter's comments as the code is not mature enough to be split into sub patches. v3 -> v4 [Eric]: - page table walk rewritten to allow scan of the page table within a range of IOVA. This prepares for VFIO integration and replay. - configuration parsing partially reworked. - do not advertise unsupported/untested features: S2, S1 + S2, HYP, PRI, ATS, .. - added ACPI table generation - migrated to dynamic traces - mingw compilation fix v2 -> v3 [Eric]: - rebased on 2.9 - mostly code and patch reorganization to ease the review process - optional patches removed. They may be handled separately. I am currently working on ACPI enablement. - optional instantiation of the smmu in mach-virt - removed [2/9] (fdt functions) since not mandated - start splitting main patch into base and derived object - no new function feature added v1 -> v2 [Prem]: - Adopted review comments from Eric Auger - Make SMMU_DPRINTF to
Re: [Qemu-devel] [RFC v5 1/8] hw/arm/smmu-common: smmu base class
Hi Eric, I found out what is going on regarding vhost-net outgoing packet's payload corruption. My packets were corrupted because of inconsistent IOVA to HVA translation in IOTLB. Please see below. On 09.07.2017 22:51, Eric Auger wrote: Introduces the base device and class for the ARM smmu. Implements VMSAv8-64 table lookup and translation. VMSAv8-32 is not implemented. Signed-off-by: Eric AugerSigned-off-by: Prem Mallappa --- [...] + +/** + * smmu_page_walk_level_64 - Walk an IOVA range from a specific level + * @baseaddr: table base address corresponding to @level + * @level: level + * @cfg: translation config + * @start: end of the IOVA range + * @end: end of the IOVA range + * @hook_fn: the hook that to be called for each detected area + * @private: private data for the hook function + * @read: whether parent level has read permission + * @write: whether parent level has write permission + * @nofail: indicates whether each iova of the range + * must be translated or whether failure is allowed + * @notify_unmap: whether we should notify invalid entries + * + * Return 0 on success, < 0 on errors not related to translation + * process, > 1 on errors related to translation process (only + * if nofail is set) + */ +static int +smmu_page_walk_level_64(dma_addr_t baseaddr, int level, +SMMUTransCfg *cfg, uint64_t start, uint64_t end, +smmu_page_walk_hook hook_fn, void *private, +bool read, bool write, bool nofail, +bool notify_unmap) +{ +uint64_t subpage_size, subpage_mask, pte, iova = start; +bool read_cur, write_cur, entry_valid; +int ret, granule_sz, stage; +IOMMUTLBEntry entry; + +granule_sz = cfg->granule_sz; +stage = cfg->stage; +subpage_size = 1ULL << level_shift(level, granule_sz); +subpage_mask = level_page_mask(level, granule_sz); + +trace_smmu_page_walk_level_in(level, baseaddr, granule_sz, + start, end, subpage_size); + +while (iova < end) { +dma_addr_t next_table_baseaddr; +uint64_t iova_next, pte_addr; +uint32_t offset; + +iova_next = (iova & subpage_mask) + subpage_size; +offset = iova_level_offset(iova, level, granule_sz); +pte_addr = baseaddr + offset * sizeof(pte); +pte = get_pte(baseaddr, offset); + +trace_smmu_page_walk_level(level, iova, subpage_size, + baseaddr, offset, pte); + +if (pte == (uint64_t)-1) { +if (nofail) { +return SMMU_TRANS_ERR_WALK_EXT_ABRT; +} +goto next; +} +if (is_invalid_pte(pte) || is_reserved_pte(pte, level)) { +trace_smmu_page_walk_level_res_invalid_pte(stage, level, baseaddr, + pte_addr, offset, pte); +if (nofail) { +return SMMU_TRANS_ERR_WALK_EXT_ABRT; +} +goto next; +} vhost maintains its IOTLB cache and when there is no IOVA->HVA translation, it asks QEMU for help. However, IOTLB entries invalidations are guest initiative so for any DMA unmap at guest side we trap to SMMU emulation code and call: smmu_notify_all -> smmuv3_replay_single -> smmu_page_walk_64 -> smmu_page_walk_level_64 -> smmuv3_replay_hook -> vhost_iommu_unmap_notify The thing is that smmuv3_replay_hook() is never called because guest zeros PTE before we trap to QEMU so that smmu_page_walk_level_64() fails on ^^^ is_invalid_pte(pte) check. This way we keep old IOTLB entry in vhost and subsequent translations may be broken. + +read_cur = read; /* TODO */ +write_cur = write; /* TODO */ +entry_valid = read_cur | write_cur; /* TODO */ + +if (is_page_pte(pte, level)) { +uint64_t gpa = get_page_pte_address(pte, granule_sz); +int perm = IOMMU_ACCESS_FLAG(read_cur, write_cur); + +trace_smmu_page_walk_level_page_pte(stage, level, entry.iova, +baseaddr, pte_addr, pte, gpa); +if (!entry_valid && !notify_unmap) { +printf("%s entry_valid=%d notify_unmap=%d\n", __func__, + entry_valid, notify_unmap); +goto next; +} +ret = call_entry_hook(iova, subpage_mask, gpa, perm, + hook_fn, private); +if (ret) { +return ret; +} +goto next; +} +if (is_block_pte(pte, level)) { +uint64_t block_size; +hwaddr gpa = get_block_pte_address(pte, level, granule_sz, + _size); +int perm = IOMMU_ACCESS_FLAG(read_cur, write_cur); + +if (gpa == -1) { +if (nofail) { +
Re: [Qemu-devel] [RFC v5 0/8] ARM SMMUv3 Emulation Support
Hi Eric, With fixes in comments that I made, I was able to run VM with virtio-blk-pci and virtio-net-pci devices. I have tried vhost-net as well but I am seeing outgoing packets payload corrupted from host perspective, tcpdump on host tun i/f shows zeroes in packet payload. However, tcpdump in VM shows that packets are fine. Have you seen anything like that? Packets incoming to VM are fine though. I will keep debugging on my side too. Thanks, Tomasz On 09.07.2017 22:51, Eric Auger wrote: This series implements the emulation code for ARM SMMUv3. This is the continuation of Prem's work [1]. This v5 mainly brings VFIO integration in DT mode. On guest kernel side, this requires a quirk [1] to force TLB invalidation on map. The following changes also are noticeable: - fix SMMU_CMDQ_CONS offset - adds dma-coherent dt property which fixes the unhandled command opcode bug. - implements block PTE The smmu is instantiated when passing the smmu option to machvirt: "-M virt-2.10,smmu" As I haven't split the code yet so that it can be easily reviewable I don't expect deep reviews at this stage. Also the implementation may be largely sub-optimal. Tested Use Cases: - booted a guest in dt and acpi mode with an iommu_platform virtio-net-pci device (using dma ops). Tested with the following guest combinations: 4K page - 39 bit VA, 4K - 48b, 64K - 39b, 64K - 48b. - booted a guest (featuring [1]) with PCIe passthrough'ed PCIe devices: - AMD Overdrive and igbvf passthrough (using gsi direct mapping) - Cavium ThunderX and ixgbevf passthrough (using KVM MSI routing) Unfortunately I have not been able to run DPDK testpmd yet on guest side. The problem I see is the user space driver dma-maps a huge area and this causes plenty of CMDQ_OP_TLBI_NH_VA commands to be sent (tlbi-on-map) which are sent for each page whereas the dma-map covers a huge page. I will work on this issue for next version. Known limitations: - no VMSAv8-32 suport - no nested stage support (S1 + S2) - no support for HYP mappings - register fine emulation, commands, interrupts and errors were not accurately tested. Handling is sufficient to run use cases described hereafter though. Best Regards Eric This series can be found at: v5: https://github.com/eauger/qemu/tree/v2.9-SMMU-v5 v4: https://github.com/eauger/qemu/tree/v2.9-SMMU-v4 References: [1] [RFC 0/2] arm-smmu-v3 tlbi-on-map option [2] Prem's last iteration: - https://lists.gnu.org/archive/html/qemu-devel/2016-08/msg03531.html History: v4 -> v5: - initial_level now part of SMMUTransCfg - smmu_page_walk_64 takes into account the max input size - implement sys->iommu_ops.replay and sys->iommu_ops.notify_flag_changed - smmuv3_translate: bug fix: don't walk on bypass - smmu_update_qreg: fix PROD index update - I did not yet address Peter's comments as the code is not mature enough to be split into sub patches. v3 -> v4 [Eric]: - page table walk rewritten to allow scan of the page table within a range of IOVA. This prepares for VFIO integration and replay. - configuration parsing partially reworked. - do not advertise unsupported/untested features: S2, S1 + S2, HYP, PRI, ATS, .. - added ACPI table generation - migrated to dynamic traces - mingw compilation fix v2 -> v3 [Eric]: - rebased on 2.9 - mostly code and patch reorganization to ease the review process - optional patches removed. They may be handled separately. I am currently working on ACPI enablement. - optional instantiation of the smmu in mach-virt - removed [2/9] (fdt functions) since not mandated - start splitting main patch into base and derived object - no new function feature added v1 -> v2 [Prem]: - Adopted review comments from Eric Auger - Make SMMU_DPRINTF to internally call qemu_log (since translation requests are too many, we need control on the type of log we want) - SMMUTransCfg modified to suite simplicity - Change RegInfo to uint64 register array - Code cleanup - Test cleanups - Reshuffled patches v0 -> v1 [Prem]: - As per SMMUv3 spec 16.0 (only is_ste_consistant() is noticeable) - Reworked register access/update logic - Factored out translation code for - single point bug fix - sharing/removal in future - (optional) Unit tests added, with PCI test device - S1 with 4k/64k, S1+S2 with 4k/64k - (S1 or S2) only can be verified by Linux 4.7 driver - (optional) Priliminary ACPI support v0 [Prem]: - Implements SMMUv3 spec 11.0 - Supported for PCIe devices, - Command Queue and Event Queue supported - LPAE only, S1 is supported and Tested, S2 not tested - BE mode Translation not supported - IRQ support (legacy, no MSI) - Tested with DPDK and e1000 Eric Auger (5): hw/arm/smmu-common: smmu base class hw/arm/virt: Add 2.10 machine type hw/arm/virt: Add tlbi-on-map property to the smmuv3 node target/arm/kvm: Translate the MSI doorbell in
Re: [Qemu-devel] [RFC v5 2/8] hw/arm/smmuv3: smmuv3 emulation model
Hi Eric, On 09.07.2017 22:51, Eric Auger wrote: From: Prem MallappaIntroduces the SMMUv3 derived model. This is based on System MMUv3 specification (v17). Signed-off-by: Prem Mallappa Signed-off-by: Eric Auger --- v4 -> v5: - change smmuv3_translate proto (IOMMUAccessFlags flag) - has_stagex replaced by is_ste_stagex - smmu_cfg_populate removed - added smmuv3_decode_config and reworked error management - remwork the naming of IOMMU mrs - fix SMMU_CMDQ_CONS offset [...] + +static void smmu_update_qreg(SMMUV3State *s, SMMUQueue *q, hwaddr reg, + uint32_t off, uint64_t val, unsigned size) +{ + if (size == 8 && off == 0) { +smmu_write64_reg(s, reg, val); Based on my observation we never get here. If I read the code correctly, memory_region_dispatch_{write|read}()->memory_region_{write|read}_accessor() will cut all 8-bytes accesses into 4-bytes slices. However, this makes my SMMUv3 register handling happy: static const MemoryRegionOps smmu_mem_ops = { .read = smmu_read_mmio, .write = smmu_write_mmio, .endianness = DEVICE_LITTLE_ENDIAN, .valid = { .min_access_size = 4, .max_access_size = 8, }, +.impl = { +.min_access_size = 4, + .max_access_size = 8, +}, }; Thanks, Tomasz
Re: [Qemu-devel] [RFC v5 2/8] hw/arm/smmuv3: smmuv3 emulation model
Hi Eric, On 09.07.2017 22:51, Eric Auger wrote: From: Prem MallappaIntroduces the SMMUv3 derived model. This is based on System MMUv3 specification (v17). Signed-off-by: Prem Mallappa Signed-off-by: Eric Auger --- v4 -> v5: - change smmuv3_translate proto (IOMMUAccessFlags flag) - has_stagex replaced by is_ste_stagex - smmu_cfg_populate removed - added smmuv3_decode_config and reworked error management - remwork the naming of IOMMU mrs - fix SMMU_CMDQ_CONS offset [...] + +/* + * Register Access Primitives + */ + +static inline void smmu_write64_reg(SMMUV3State *s, uint32_t addr, uint64_t val) +{ +addr >>= 2; +s->regs[addr] = val & 0xULL; +s->regs[addr + 1] = val & ~0xULL; +} + +static inline void smmu_write_reg(SMMUV3State *s, uint32_t addr, uint64_t val) +{ +s->regs[addr >> 2] = val; +} + +static inline uint32_t smmu_read_reg(SMMUV3State *s, uint32_t addr) +{ +return s->regs[addr >> 2]; +} + +static inline uint64_t smmu_read64_reg(SMMUV3State *s, uint32_t addr) +{ +addr >>= 2; +return s->regs[addr] | (s->regs[addr + 1] << 32); To be consistent with smmu_write64_reg() we should not shift here second half of register, instead simply: return s->regs[addr] | s->regs[addr + 1]; Thanks, Tomasz