Re: [Qemu-devel] [RFC v4 00/16] VIRTIO-IOMMU device

2017-09-27 Thread Tomasz Nowicki

Hi Eric,

Out of curiosity, I compared your SMMUv3 full emulation against 
virtio-iommu. For virtio-net device behind the virtio-iommu I get 50% 
performance of what I can get for SMMUv3 emulation.


Do you have similar observations? Since there is no need to emulate HW 
behaviour in QEMU I expected virtio-iommu to be faster. I will run some 
more benchmarks.


Thanks,
Tomasz

On 19.09.2017 09:46, Eric Auger wrote:

This series implements the virtio-iommu device.

This v4 is an upgrade to v0.4 spec [1] and applies on QEMU v2.10.0.
- probe request support although no reserved region is returned at
   the moment
- unmap semantics less strict, as specified in v0.4
- device registration, attach/detach revisited
- split into smaller patches to ease review
- propose a way to inform the IOMMU mr about the page_size_mask
   of underlying HW IOMMU, if any
- remove warning associated with the translation of the MSI doorbell

The device gets instantiated using the "-device virtio-iommu-device"
option. It currently works with ARM virt machine only, as the machine
must handle the dt binding between the virtio-mmio "iommu" node and
the PCI host bridge node.

The associated VHOST/VFIO adaptation is available on the branch
below but is not officially delivered as part of this series.

Best Regards

Eric

This series can be found at:
https://github.com/eauger/qemu/tree/v2.10.0-virtio-iommu-v4

References:
[1] [RFC] virtio-iommu version 0.4
 git://linux-arm.org/virtio-iommu.git branch viommu/v0.4

Testing:
- guest kernel with v0.4 virtio-iommu driver (4kB page)
- tested with guest using virtio-pci-net and vhost-net
   (,vhost=on/off,iommu_platform,disable-modern=off,disable-legacy=on  )
- tested with VFIO and guest assigned with 2 VFs
- tested with DPDK on guest with 2 assigned VFs

Not tested:
- hot-plug/hot-unplug of EP: not implemented

History:
v2-> v4:
- see above

v2 -> v3:
- rebase on top of 2.10-rc0 and especially
   [PATCH qemu v9 0/2] memory/iommu: QOM'fy IOMMU MemoryRegion
- add mutex init
- fix as->mappings deletion using g_tree_ref/unref
- when a dev is attached whereas it is already attached to
   another address space, first detach it
- fix some error values
- page_sizes = TARGET_PAGE_MASK;
- I haven't changed the unmap() semantics yet, waiting for the
   next virtio-iommu spec revision.

v1 -> v2:
- fix redifinition of viommu_as typedef

Eric Auger (16):
   update-linux-headers: import virtio_iommu.h
   linux-headers: Update for virtio-iommu
   virtio-iommu: add skeleton
   virtio-iommu: Decode the command payload
   virtio-iommu: Add the iommu regions
   virtio-iommu: Register attached devices
   virtio-iommu: Implement attach/detach command
   virtio-iommu: Implement map/unmap
   virtio-iommu: Implement translate
   virtio-iommu: Implement probe request
   hw/arm/virt: Add 2.11 machine type
   hw/arm/virt: Add virtio-iommu to the virt board
   memory.h: Add set_page_size_mask IOMMUMemoryRegion callback
   hw/vfio/common: Set the IOMMUMemoryRegion supported page sizes
   virtio-iommu: Implement set_page_size_mask
   hw/vfio/common: Do not print error when viommu translates into an mmio
 region

  hw/arm/virt.c | 115 +++-
  hw/vfio/common.c  |   7 +-
  hw/virtio/Makefile.objs   |   1 +
  hw/virtio/trace-events|  24 +
  hw/virtio/virtio-iommu.c  | 923 ++
  include/exec/memory.h |   4 +
  include/hw/arm/virt.h |   5 +
  include/hw/vfio/vfio-common.h |   1 +
  include/hw/virtio/virtio-iommu.h  |  61 ++
  include/standard-headers/linux/virtio_ids.h   |   1 +
  include/standard-headers/linux/virtio_iommu.h | 177 +
  linux-headers/linux/virtio_iommu.h|   1 +
  scripts/update-linux-headers.sh   |   3 +
  13 files changed, 1312 insertions(+), 11 deletions(-)
  create mode 100644 hw/virtio/virtio-iommu.c
  create mode 100644 include/hw/virtio/virtio-iommu.h
  create mode 100644 include/standard-headers/linux/virtio_iommu.h
  create mode 100644 linux-headers/linux/virtio_iommu.h





Re: [Qemu-devel] [RFC v4 10/16] virtio-iommu: Implement probe request

2017-09-27 Thread Tomasz Nowicki

Hi Eric,

On 19.09.2017 09:46, Eric Auger wrote:

This patch implements the PROBE request. At the moment,
no reserved regions are returned.

At the moment reserved regions are stored per device.

Signed-off-by: Eric Auger 

---



[...]


+
+static int virtio_iommu_fill_property(int devid, int type,
+  viommu_property_buffer *bufstate)
+{
+int ret = -ENOSPC;
+
+if (bufstate->filled + 4 >= VIOMMU_PROBE_SIZE) {
+bufstate->error = true;
+goto out;
+}
+
+switch (type) {
+case VIRTIO_IOMMU_PROBE_T_NONE:
+ret = virtio_iommu_fill_none_prop(bufstate);
+break;
+case VIRTIO_IOMMU_PROBE_T_RESV_MEM:
+{
+viommu_dev *dev = bufstate->dev;
+
+g_tree_foreach(dev->reserved_regions,
+   virtio_iommu_fill_resv_mem_prop,
+   bufstate);
+if (!bufstate->error) {
+ret = 0;
+}
+break;
+}
+default:
+ret = -ENOENT;
+break;
+}
+out:
+if (ret) {
+error_report("%s property of type=%d could not be filled (%d),"
+ " remaining size = 0x%lx",
+ __func__, type, ret, bufstate->filled);
+}
+return ret;
+}
+
+static int virtio_iommu_probe(VirtIOIOMMU *s,
+  struct virtio_iommu_req_probe *req,
+  uint8_t *buf)
+{
+uint32_t devid = le32_to_cpu(req->device);
+int16_t prop_types = SUPPORTED_PROBE_PROPERTIES, type;
+viommu_property_buffer bufstate;
+viommu_dev *dev;
+int ret;
+
+dev = g_tree_lookup(s->devices, GUINT_TO_POINTER(devid));
+if (!dev) {
+return -EINVAL;
+}
+
+bufstate.start = buf;
+bufstate.filled = 0;
+bufstate.dev = dev;


bufstate.error is not initialized which may cause false alarm in 
virtio_iommu_fill_property()



+
+while ((type = ctz32(prop_types)) != 32) {
+ret = virtio_iommu_fill_property(devid, 1 << type, );
+if (ret) {
+break;
+}
+prop_types &= ~(1 << type);
+}
+virtio_iommu_fill_property(devid, VIRTIO_IOMMU_PROBE_T_NONE, );
+
+return VIRTIO_IOMMU_S_OK;
+}
+
  #define get_payload_size(req) (\
  sizeof((req)) - sizeof(struct virtio_iommu_req_tail))
  
@@ -433,6 +567,24 @@ static int virtio_iommu_handle_unmap(VirtIOIOMMU *s,

  return virtio_iommu_unmap(s, );
  }


Thanks,
Tomasz



Re: [Qemu-devel] [Qemu-arm] [PATCH v7 13/20] hw/arm/smmuv3: Implement IOMMU memory region replay callback

2017-09-18 Thread Tomasz Nowicki

Hi Eric,

On 15.09.2017 16:50, Auger Eric wrote:

Hi,

On 15/09/2017 12:42, tn wrote:

Hi Eric,

On 15.09.2017 09:30, Auger Eric wrote:

Hi Tomasz,

On 14/09/2017 16:43, Tomasz Nowicki wrote:

On 14.09.2017 16:31, Tomasz Nowicki wrote:

Hi Eric,

On 14.09.2017 11:27, Linu Cherian wrote:

Hi Eric,

On Fri Sep 01, 2017 at 07:21:16PM +0200, Eric Auger wrote:

memory_region_iommu_replay() is used for VFIO integration.

However its default implementation is not adapted to SMMUv3
IOMMU memory region. Indeed the input address range is too
huge and its execution is too slow as it calls the translate()
callback on each granule.

Let's implement the replay callback which hierarchically walk
over the page table structure and notify only the segments
that are populated with valid entries.

Signed-off-by: Eric Auger <eric.au...@redhat.com>
---
hw/arm/smmuv3.c | 36 
hw/arm/trace-events |  1 +
2 files changed, 37 insertions(+)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 8e7d10d..c43bd93 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -657,6 +657,41 @@ static int smmuv3_notify_entry(IOMMUTLBEntry
*entry, void *private)
return 0;
}
+/* Unmap the whole notifier's range */
+static void smmuv3_unmap_notifier_range(IOMMUNotifier *n)
+{
+IOMMUTLBEntry entry;
+hwaddr size = n->end - n->start + 1;
+
+entry.target_as = _space_memory;
+entry.iova = n->start & ~(size - 1);
+entry.perm = IOMMU_NONE;
+entry.addr_mask = size - 1;
+
+memory_region_notify_one(n, );
+}
+
+static void smmuv3_replay(IOMMUMemoryRegion *mr, IOMMUNotifier *n)
+{
+SMMUTransCfg cfg = {};
+int ret;
+
+trace_smmuv3_replay(mr->parent_obj.name, n, n->start, n->end);
+smmuv3_unmap_notifier_range(n);
+
+ret = smmuv3_decode_config(mr, );
+if (ret) {
+error_report("%s error decoding the configuration for iommu
mr=%s",
+ __func__, mr->parent_obj.name);
+}



On an invalid config being found, shouldnt we return rather than
proceeding with
page table walk. For example on an invalid Stream table entry.


Indeed, without return here vhost case is not working for me.


I was just lucky one time. return here has no influence. Vhost still not
working. Sorry for noise.


As far as I understand the replay() callback only is called in VFIO use
case. So this shouldn't impact vhost.

I can't reproduce your vhost issue on my side. I will review the
invalidate code again and compare against the last version.

What is the page size used by your guest?


64K page size for guest as well as for host.

However, I've just checked 4K page size for guest and then vhost is
working fine.

So the bug stems from the incorrect target page size used on
SMMU_CMD_TLBI_NH_VA invalidation. This is now corrected by using the
actual config granule size and this fixes the issue with vhost use case
and 64KB page guest. I will release this fix early next week. Sorry for
the inconvenience.



Yes, that was it. I will provide my t-b for the next series version.

Thanks,
Tomasz



Re: [Qemu-devel] [Qemu-arm] [PATCH v7 13/20] hw/arm/smmuv3: Implement IOMMU memory region replay callback

2017-09-14 Thread Tomasz Nowicki

On 14.09.2017 16:31, Tomasz Nowicki wrote:

Hi Eric,

On 14.09.2017 11:27, Linu Cherian wrote:

Hi Eric,

On Fri Sep 01, 2017 at 07:21:16PM +0200, Eric Auger wrote:

memory_region_iommu_replay() is used for VFIO integration.

However its default implementation is not adapted to SMMUv3
IOMMU memory region. Indeed the input address range is too
huge and its execution is too slow as it calls the translate()
callback on each granule.

Let's implement the replay callback which hierarchically walk
over the page table structure and notify only the segments
that are populated with valid entries.

Signed-off-by: Eric Auger <eric.au...@redhat.com>
---
  hw/arm/smmuv3.c | 36 
  hw/arm/trace-events |  1 +
  2 files changed, 37 insertions(+)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 8e7d10d..c43bd93 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -657,6 +657,41 @@ static int smmuv3_notify_entry(IOMMUTLBEntry 
*entry, void *private)

  return 0;
  }
+/* Unmap the whole notifier's range */
+static void smmuv3_unmap_notifier_range(IOMMUNotifier *n)
+{
+IOMMUTLBEntry entry;
+hwaddr size = n->end - n->start + 1;
+
+entry.target_as = _space_memory;
+entry.iova = n->start & ~(size - 1);
+entry.perm = IOMMU_NONE;
+entry.addr_mask = size - 1;
+
+memory_region_notify_one(n, );
+}
+
+static void smmuv3_replay(IOMMUMemoryRegion *mr, IOMMUNotifier *n)
+{
+SMMUTransCfg cfg = {};
+int ret;
+
+trace_smmuv3_replay(mr->parent_obj.name, n, n->start, n->end);
+smmuv3_unmap_notifier_range(n);
+
+ret = smmuv3_decode_config(mr, );
+if (ret) {
+error_report("%s error decoding the configuration for iommu 
mr=%s",

+ __func__, mr->parent_obj.name);
+}



On an invalid config being found, shouldnt we return rather than 
proceeding with

page table walk. For example on an invalid Stream table entry.


Indeed, without return here vhost case is not working for me.


I was just lucky one time. return here has no influence. Vhost still not 
working. Sorry for noise.


Tomasz



Thanks,
Tomasz



+

+if (cfg.disabled || cfg.bypassed) {
+return;
+}
+/* walk the page tables and replay valid entries */
+smmu_page_walk(, 0, (1ULL << (64 - cfg.tsz)) - 1, false,
+   smmuv3_notify_entry, n);
+}
  static void smmuv3_notify_iova_range(IOMMUMemoryRegion *mr, 
IOMMUNotifier *n,

   uint64_t iova, size_t size)
  {
@@ -1095,6 +1130,7 @@ static void 
smmuv3_iommu_memory_region_class_init(ObjectClass *klass,

  imrc->translate = smmuv3_translate;
  imrc->notify_flag_changed = smmuv3_notify_flag_changed;
+imrc->replay = smmuv3_replay;
  }
  static const TypeInfo smmuv3_type_info = {
diff --git a/hw/arm/trace-events b/hw/arm/trace-events
index 4ac264d..15f84d6 100644
--- a/hw/arm/trace-events
+++ b/hw/arm/trace-events
@@ -46,5 +46,6 @@ smmuv3_cfg_stage(int s, uint32_t oas, uint32_t tsz, 
uint64_t ttbr, bool aa64, ui
  smmuv3_notify_flag_add(const char *iommu) "ADD SMMUNotifier node 
for iommu mr=%s"
  smmuv3_notify_flag_del(const char *iommu) "DEL SMMUNotifier node 
for iommu mr=%s"

  smmuv3_replay_mr(const char *name) "iommu mr=%s"
+smmuv3_replay(const char *name, void *n, hwaddr start, hwaddr end) 
"iommu mr=%s notifier=%p [0x%"PRIx64",0x%"PRIx64"]"
  smmuv3_notify_entry(hwaddr iova, hwaddr pa, hwaddr mask, int perm) 
"iova=0x%"PRIx64" pa=0x%" PRIx64" mask=0x%"PRIx64" perm=%d"
  smmuv3_notify_iova_range(const char *name, uint64_t iova, size_t 
size, void *n) "iommu mr=%s iova=0x%"PRIx64" size=0x%lx n=%p"

--
2.5.5








Re: [Qemu-devel] [Qemu-arm] [PATCH v7 13/20] hw/arm/smmuv3: Implement IOMMU memory region replay callback

2017-09-14 Thread Tomasz Nowicki

Hi Eric,

On 14.09.2017 11:27, Linu Cherian wrote:

Hi Eric,

On Fri Sep 01, 2017 at 07:21:16PM +0200, Eric Auger wrote:

memory_region_iommu_replay() is used for VFIO integration.

However its default implementation is not adapted to SMMUv3
IOMMU memory region. Indeed the input address range is too
huge and its execution is too slow as it calls the translate()
callback on each granule.

Let's implement the replay callback which hierarchically walk
over the page table structure and notify only the segments
that are populated with valid entries.

Signed-off-by: Eric Auger 
---
  hw/arm/smmuv3.c | 36 
  hw/arm/trace-events |  1 +
  2 files changed, 37 insertions(+)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 8e7d10d..c43bd93 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -657,6 +657,41 @@ static int smmuv3_notify_entry(IOMMUTLBEntry *entry, void 
*private)
  return 0;
  }
  
+/* Unmap the whole notifier's range */

+static void smmuv3_unmap_notifier_range(IOMMUNotifier *n)
+{
+IOMMUTLBEntry entry;
+hwaddr size = n->end - n->start + 1;
+
+entry.target_as = _space_memory;
+entry.iova = n->start & ~(size - 1);
+entry.perm = IOMMU_NONE;
+entry.addr_mask = size - 1;
+
+memory_region_notify_one(n, );
+}
+
+static void smmuv3_replay(IOMMUMemoryRegion *mr, IOMMUNotifier *n)
+{
+SMMUTransCfg cfg = {};
+int ret;
+
+trace_smmuv3_replay(mr->parent_obj.name, n, n->start, n->end);
+smmuv3_unmap_notifier_range(n);
+
+ret = smmuv3_decode_config(mr, );
+if (ret) {
+error_report("%s error decoding the configuration for iommu mr=%s",
+ __func__, mr->parent_obj.name);
+}



On an invalid config being found, shouldnt we return rather than proceeding with
page table walk. For example on an invalid Stream table entry.


Indeed, without return here vhost case is not working for me.

Thanks,
Tomasz



+

+if (cfg.disabled || cfg.bypassed) {
+return;
+}
+/* walk the page tables and replay valid entries */
+smmu_page_walk(, 0, (1ULL << (64 - cfg.tsz)) - 1, false,
+   smmuv3_notify_entry, n);
+}
  static void smmuv3_notify_iova_range(IOMMUMemoryRegion *mr, IOMMUNotifier *n,
   uint64_t iova, size_t size)
  {
@@ -1095,6 +1130,7 @@ static void 
smmuv3_iommu_memory_region_class_init(ObjectClass *klass,
  
  imrc->translate = smmuv3_translate;

  imrc->notify_flag_changed = smmuv3_notify_flag_changed;
+imrc->replay = smmuv3_replay;
  }
  
  static const TypeInfo smmuv3_type_info = {

diff --git a/hw/arm/trace-events b/hw/arm/trace-events
index 4ac264d..15f84d6 100644
--- a/hw/arm/trace-events
+++ b/hw/arm/trace-events
@@ -46,5 +46,6 @@ smmuv3_cfg_stage(int s, uint32_t oas, uint32_t tsz, uint64_t 
ttbr, bool aa64, ui
  smmuv3_notify_flag_add(const char *iommu) "ADD SMMUNotifier node for iommu 
mr=%s"
  smmuv3_notify_flag_del(const char *iommu) "DEL SMMUNotifier node for iommu 
mr=%s"
  smmuv3_replay_mr(const char *name) "iommu mr=%s"
+smmuv3_replay(const char *name, void *n, hwaddr start, hwaddr end) "iommu mr=%s notifier=%p 
[0x%"PRIx64",0x%"PRIx64"]"
  smmuv3_notify_entry(hwaddr iova, hwaddr pa, hwaddr mask, int perm) "iova=0x%"PRIx64" pa=0x%" 
PRIx64" mask=0x%"PRIx64" perm=%d"
  smmuv3_notify_iova_range(const char *name, uint64_t iova, size_t size, void *n) "iommu mr=%s 
iova=0x%"PRIx64" size=0x%lx n=%p"
--
2.5.5








Re: [Qemu-devel] [RFC v5 0/8] ARM SMMUv3 Emulation Support

2017-08-03 Thread Tomasz Nowicki

Hi Eric,

On 01.08.2017 15:07, Auger Eric wrote:

Hi Tomasz,
On 01/08/2017 13:01, Tomasz Nowicki wrote:

Hi Eric,

Just letting you know that I am facing another issue with the following
setup:
1. host (4.12 kernel & 64K page) and VM (4.12 kernel & 64K page)
2. QEMU + -netdev type=tap,ifname=tap,id=net0 -device
virtio-net-pci,netdev=net0,iommu_platform,disable-modern=off,disable-legacy=on

2. On VM, I allocate some huge pages and run DPDK testpmd app:
# echo 4 > /sys/kernel/mm/hugepages/hugepages-524288kB/nr_hugepages
# ./dpdk/usertools/dpdk-devbind.py -b vfio-pci  :00:02.0
# ./dpdk/build/app/testpmd -l 0-13 -n 4 -w :00:02.0 --
--disable-hw-vlan-filter --disable-rss -i
EAL: Detected 14 lcore(s)
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: PCI device :00:02.0 on NUMA socket -1
EAL:   probe driver: 1af4:1041 net_virtio
EAL:   using IOMMU type 1 (Type 1)
EAL: iommu_map_dma vaddr 2000 size 8000 iova 12000
EAL: Can't write to PCI bar (0) : offset (12)
EAL: Can't read from PCI bar (0) : offset (12)
EAL: Can't read from PCI bar (0) : offset (12)
EAL: Can't write to PCI bar (0) : offset (12)
EAL: Can't read from PCI bar (0) : offset (12)
EAL: Can't write to PCI bar (0) : offset (12)
EAL: Can't read from PCI bar (0) : offset (0)
EAL: Can't write to PCI bar (0) : offset (4)
EAL: Can't write to PCI bar (0) : offset (14)
EAL: Can't write to PCI bar (0) : offset (e)
EAL: Can't read from PCI bar (0) : offset (c)
EAL: Requested device :00:02.0 cannot be used
EAL: No probed ethernet devices
Interactive-mode selected
USER1: create a new mbuf pool : n=251456, size=2176,
socket=0

When VM uses *4K pages* the same setup works fine. I will work on this
but please let me know in case you already know what is going on.


No I did not face that one. I was able to launch testpmd without such
early message. However I assigned an igbvf device to the guest and then
to DPDK. I've never tested your config.

However as stated in my cover letter at the moment DPDK is not working
for me because of storms of tlbi-on-maps. I intend to work on this as
soon as get some bandwidth, sorry.


I found what was the reason of failure.

QEMU creates BARs for VIRTIO PCI device. The size of it depends on what 
is necessary for VIRTIO protocol. In my case the BAR is 16K size which 
is too small to be mmapable for kernel with 64K pages:

vfio_pci_enable() -> vfio_pci_probe_mmaps() ->
here guest kernel checks that BAR size is smaller than current PAGE_SIZE 
and clears VFIO_REGION_INFO_FLAG_MMAP flag which prevents BAR from being 
mmapped later on. I added -device virtio-net-pci,...,page-per-vq=on to 
enlarge BAR size to 8M and now testpmd works fine. I wonder how the same 
setup is working with e.g. Intel or AMD IOMMU.


Thanks,
Tomasz



Re: [Qemu-devel] [RFC v5 0/8] ARM SMMUv3 Emulation Support

2017-08-01 Thread Tomasz Nowicki

Hi Eric,

Just letting you know that I am facing another issue with the following 
setup:

1. host (4.12 kernel & 64K page) and VM (4.12 kernel & 64K page)
2. QEMU + -netdev type=tap,ifname=tap,id=net0 -device 
virtio-net-pci,netdev=net0,iommu_platform,disable-modern=off,disable-legacy=on

2. On VM, I allocate some huge pages and run DPDK testpmd app:
# echo 4 > /sys/kernel/mm/hugepages/hugepages-524288kB/nr_hugepages
# ./dpdk/usertools/dpdk-devbind.py -b vfio-pci  :00:02.0
# ./dpdk/build/app/testpmd -l 0-13 -n 4 -w :00:02.0 -- 
--disable-hw-vlan-filter --disable-rss -i

EAL: Detected 14 lcore(s)
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: PCI device :00:02.0 on NUMA socket -1
EAL:   probe driver: 1af4:1041 net_virtio
EAL:   using IOMMU type 1 (Type 1)
EAL: iommu_map_dma vaddr 2000 size 8000 iova 12000
EAL: Can't write to PCI bar (0) : offset (12)
EAL: Can't read from PCI bar (0) : offset (12)
EAL: Can't read from PCI bar (0) : offset (12)
EAL: Can't write to PCI bar (0) : offset (12)
EAL: Can't read from PCI bar (0) : offset (12)
EAL: Can't write to PCI bar (0) : offset (12)
EAL: Can't read from PCI bar (0) : offset (0)
EAL: Can't write to PCI bar (0) : offset (4)
EAL: Can't write to PCI bar (0) : offset (14)
EAL: Can't write to PCI bar (0) : offset (e)
EAL: Can't read from PCI bar (0) : offset (c)
EAL: Requested device :00:02.0 cannot be used
EAL: No probed ethernet devices
Interactive-mode selected
USER1: create a new mbuf pool : n=251456, size=2176, 
socket=0


When VM uses *4K pages* the same setup works fine. I will work on this 
but please let me know in case you already know what is going on.


Thanks,
Tomasz


On 09.07.2017 22:51, Eric Auger wrote:

This series implements the emulation code for ARM SMMUv3.
This is the continuation of Prem's work [1].

This v5 mainly brings VFIO integration in DT mode. On guest kernel
side, this requires a quirk [1] to force TLB invalidation on map.

The following changes also are noticeable:
- fix SMMU_CMDQ_CONS offset
- adds dma-coherent dt property which fixes the unhandled command
   opcode bug.
- implements block PTE

The smmu is instantiated when passing the smmu option to machvirt:
"-M virt-2.10,smmu"

As I haven't split the code yet so that it can be easily reviewable
I don't expect deep reviews at this stage. Also the implementation may
be largely sub-optimal.

Tested Use Cases:
- booted a guest in dt and acpi mode with an iommu_platform
   virtio-net-pci device (using dma ops). Tested with the following
   guest combinations: 4K page - 39 bit VA, 4K - 48b, 64K - 39b,
   64K - 48b.
- booted a guest (featuring [1]) with PCIe passthrough'ed PCIe devices:
   - AMD Overdrive and igbvf passthrough (using gsi direct mapping)
   - Cavium ThunderX and ixgbevf passthrough (using KVM MSI routing)

Unfortunately I have not been able to run DPDK testpmd yet on guest side.
The problem I see is the user space driver dma-maps a huge area
and this causes plenty of CMDQ_OP_TLBI_NH_VA commands to be sent
(tlbi-on-map) which are sent for each page whereas the dma-map covers a
huge page. I will work on this issue for next version.

Known limitations:
- no VMSAv8-32 suport
- no nested stage support (S1 + S2)
- no support for HYP mappings
- register fine emulation, commands, interrupts and errors were
   not accurately tested. Handling is sufficient to run use cases
   described hereafter though.

Best Regards

Eric

This series can be found at:
v5: https://github.com/eauger/qemu/tree/v2.9-SMMU-v5
v4: https://github.com/eauger/qemu/tree/v2.9-SMMU-v4

References:
[1] [RFC 0/2] arm-smmu-v3 tlbi-on-map option
[2] Prem's last iteration:
- https://lists.gnu.org/archive/html/qemu-devel/2016-08/msg03531.html

History:
v4 -> v5:
- initial_level now part of SMMUTransCfg
- smmu_page_walk_64 takes into account the max input size
- implement sys->iommu_ops.replay and sys->iommu_ops.notify_flag_changed
- smmuv3_translate: bug fix: don't walk on bypass
- smmu_update_qreg: fix PROD index update
- I did not yet address Peter's comments as the code is not mature enough
   to be split into sub patches.

v3 -> v4 [Eric]:
- page table walk rewritten to allow scan of the page table within a
   range of IOVA. This prepares for VFIO integration and replay.
- configuration parsing partially reworked.
- do not advertise unsupported/untested features: S2, S1 + S2, HYP,
   PRI, ATS, ..
- added ACPI table generation
- migrated to dynamic traces
- mingw compilation fix

v2 -> v3 [Eric]:
- rebased on 2.9
- mostly code and patch reorganization to ease the review process
- optional patches removed. They may be handled separately. I am currently
   working on ACPI enablement.
- optional instantiation of the smmu in mach-virt
- removed [2/9] (fdt functions) since not mandated
- start splitting main patch into base and derived object
- no new function feature added

v1 -> v2 [Prem]:
- Adopted review comments from Eric Auger
 - Make SMMU_DPRINTF to 

Re: [Qemu-devel] [RFC v5 1/8] hw/arm/smmu-common: smmu base class

2017-07-25 Thread Tomasz Nowicki

Hi Eric,

I found out what is going on regarding vhost-net outgoing packet's 
payload corruption. My packets were corrupted because of inconsistent 
IOVA to HVA translation in IOTLB. Please see below.


On 09.07.2017 22:51, Eric Auger wrote:

Introduces the base device and class for the ARM smmu.
Implements VMSAv8-64 table lookup and translation. VMSAv8-32
is not implemented.

Signed-off-by: Eric Auger 
Signed-off-by: Prem Mallappa 

---


[...]


+
+/**
+ * smmu_page_walk_level_64 - Walk an IOVA range from a specific level
+ * @baseaddr: table base address corresponding to @level
+ * @level: level
+ * @cfg: translation config
+ * @start: end of the IOVA range
+ * @end: end of the IOVA range
+ * @hook_fn: the hook that to be called for each detected area
+ * @private: private data for the hook function
+ * @read: whether parent level has read permission
+ * @write: whether parent level has write permission
+ * @nofail: indicates whether each iova of the range
+ *  must be translated or whether failure is allowed
+ * @notify_unmap: whether we should notify invalid entries
+ *
+ * Return 0 on success, < 0 on errors not related to translation
+ * process, > 1 on errors related to translation process (only
+ * if nofail is set)
+ */
+static int
+smmu_page_walk_level_64(dma_addr_t baseaddr, int level,
+SMMUTransCfg *cfg, uint64_t start, uint64_t end,
+smmu_page_walk_hook hook_fn, void *private,
+bool read, bool write, bool nofail,
+bool notify_unmap)
+{
+uint64_t subpage_size, subpage_mask, pte, iova = start;
+bool read_cur, write_cur, entry_valid;
+int ret, granule_sz, stage;
+IOMMUTLBEntry entry;
+
+granule_sz = cfg->granule_sz;
+stage = cfg->stage;
+subpage_size = 1ULL << level_shift(level, granule_sz);
+subpage_mask = level_page_mask(level, granule_sz);
+
+trace_smmu_page_walk_level_in(level, baseaddr, granule_sz,
+  start, end, subpage_size);
+
+while (iova < end) {
+dma_addr_t next_table_baseaddr;
+uint64_t iova_next, pte_addr;
+uint32_t offset;
+
+iova_next = (iova & subpage_mask) + subpage_size;
+offset = iova_level_offset(iova, level, granule_sz);
+pte_addr = baseaddr + offset * sizeof(pte);
+pte = get_pte(baseaddr, offset);
+
+trace_smmu_page_walk_level(level, iova, subpage_size,
+   baseaddr, offset, pte);
+
+if (pte == (uint64_t)-1) {
+if (nofail) {
+return SMMU_TRANS_ERR_WALK_EXT_ABRT;
+}
+goto next;
+}
+if (is_invalid_pte(pte) || is_reserved_pte(pte, level)) {
+trace_smmu_page_walk_level_res_invalid_pte(stage, level, baseaddr,
+   pte_addr, offset, pte);
+if (nofail) {
+return SMMU_TRANS_ERR_WALK_EXT_ABRT;
+}
+goto next;
+}



vhost maintains its IOTLB cache and when there is no IOVA->HVA 
translation, it asks QEMU for help. However, IOTLB entries invalidations 
are guest initiative so for any DMA unmap at guest side we trap to SMMU 
emulation code and call:
smmu_notify_all -> smmuv3_replay_single -> smmu_page_walk_64 -> 
smmu_page_walk_level_64 -> smmuv3_replay_hook -> vhost_iommu_unmap_notify


The thing is that smmuv3_replay_hook() is never called because guest 
zeros PTE before we trap to QEMU so that smmu_page_walk_level_64() fails 
on ^^^ is_invalid_pte(pte) check. This way we keep old IOTLB entry in 
vhost and subsequent translations may be broken.



+
+read_cur = read; /* TODO */
+write_cur = write; /* TODO */
+entry_valid = read_cur | write_cur; /* TODO */
+
+if (is_page_pte(pte, level)) {
+uint64_t gpa = get_page_pte_address(pte, granule_sz);
+int perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
+
+trace_smmu_page_walk_level_page_pte(stage, level, entry.iova,
+baseaddr, pte_addr, pte, gpa);
+if (!entry_valid && !notify_unmap) {
+printf("%s entry_valid=%d notify_unmap=%d\n", __func__,
+   entry_valid, notify_unmap);
+goto next;
+}
+ret = call_entry_hook(iova, subpage_mask, gpa, perm,
+  hook_fn, private);
+if (ret) {
+return ret;
+}
+goto next;
+}
+if (is_block_pte(pte, level)) {
+uint64_t block_size;
+hwaddr gpa = get_block_pte_address(pte, level, granule_sz,
+   _size);
+int perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
+
+if (gpa == -1) {
+if (nofail) {
+

Re: [Qemu-devel] [RFC v5 0/8] ARM SMMUv3 Emulation Support

2017-07-14 Thread Tomasz Nowicki

Hi Eric,

With fixes in comments that I made, I was able to run VM with 
virtio-blk-pci and virtio-net-pci devices.


I have tried vhost-net as well but I am seeing outgoing packets payload 
corrupted from host perspective, tcpdump on host tun i/f shows zeroes in 
packet payload. However, tcpdump in VM shows that packets are fine. Have 
you seen anything like that? Packets incoming to VM are fine though. I 
will keep debugging on my side too.


Thanks,
Tomasz

On 09.07.2017 22:51, Eric Auger wrote:

This series implements the emulation code for ARM SMMUv3.
This is the continuation of Prem's work [1].

This v5 mainly brings VFIO integration in DT mode. On guest kernel
side, this requires a quirk [1] to force TLB invalidation on map.

The following changes also are noticeable:
- fix SMMU_CMDQ_CONS offset
- adds dma-coherent dt property which fixes the unhandled command
   opcode bug.
- implements block PTE

The smmu is instantiated when passing the smmu option to machvirt:
"-M virt-2.10,smmu"

As I haven't split the code yet so that it can be easily reviewable
I don't expect deep reviews at this stage. Also the implementation may
be largely sub-optimal.

Tested Use Cases:
- booted a guest in dt and acpi mode with an iommu_platform
   virtio-net-pci device (using dma ops). Tested with the following
   guest combinations: 4K page - 39 bit VA, 4K - 48b, 64K - 39b,
   64K - 48b.
- booted a guest (featuring [1]) with PCIe passthrough'ed PCIe devices:
   - AMD Overdrive and igbvf passthrough (using gsi direct mapping)
   - Cavium ThunderX and ixgbevf passthrough (using KVM MSI routing)

Unfortunately I have not been able to run DPDK testpmd yet on guest side.
The problem I see is the user space driver dma-maps a huge area
and this causes plenty of CMDQ_OP_TLBI_NH_VA commands to be sent
(tlbi-on-map) which are sent for each page whereas the dma-map covers a
huge page. I will work on this issue for next version.

Known limitations:
- no VMSAv8-32 suport
- no nested stage support (S1 + S2)
- no support for HYP mappings
- register fine emulation, commands, interrupts and errors were
   not accurately tested. Handling is sufficient to run use cases
   described hereafter though.

Best Regards

Eric

This series can be found at:
v5: https://github.com/eauger/qemu/tree/v2.9-SMMU-v5
v4: https://github.com/eauger/qemu/tree/v2.9-SMMU-v4

References:
[1] [RFC 0/2] arm-smmu-v3 tlbi-on-map option
[2] Prem's last iteration:
- https://lists.gnu.org/archive/html/qemu-devel/2016-08/msg03531.html

History:
v4 -> v5:
- initial_level now part of SMMUTransCfg
- smmu_page_walk_64 takes into account the max input size
- implement sys->iommu_ops.replay and sys->iommu_ops.notify_flag_changed
- smmuv3_translate: bug fix: don't walk on bypass
- smmu_update_qreg: fix PROD index update
- I did not yet address Peter's comments as the code is not mature enough
   to be split into sub patches.

v3 -> v4 [Eric]:
- page table walk rewritten to allow scan of the page table within a
   range of IOVA. This prepares for VFIO integration and replay.
- configuration parsing partially reworked.
- do not advertise unsupported/untested features: S2, S1 + S2, HYP,
   PRI, ATS, ..
- added ACPI table generation
- migrated to dynamic traces
- mingw compilation fix

v2 -> v3 [Eric]:
- rebased on 2.9
- mostly code and patch reorganization to ease the review process
- optional patches removed. They may be handled separately. I am currently
   working on ACPI enablement.
- optional instantiation of the smmu in mach-virt
- removed [2/9] (fdt functions) since not mandated
- start splitting main patch into base and derived object
- no new function feature added

v1 -> v2 [Prem]:
- Adopted review comments from Eric Auger
 - Make SMMU_DPRINTF to internally call qemu_log
 (since translation requests are too many, we need control
  on the type of log we want)
 - SMMUTransCfg modified to suite simplicity
 - Change RegInfo to uint64 register array
 - Code cleanup
 - Test cleanups
- Reshuffled patches

v0 -> v1 [Prem]:
- As per SMMUv3 spec 16.0 (only is_ste_consistant() is noticeable)
- Reworked register access/update logic
- Factored out translation code for
 - single point bug fix
 - sharing/removal in future
- (optional) Unit tests added, with PCI test device
 - S1 with 4k/64k, S1+S2 with 4k/64k
 - (S1 or S2) only can be verified by Linux 4.7 driver
 - (optional) Priliminary ACPI support

v0 [Prem]:
- Implements SMMUv3 spec 11.0
- Supported for PCIe devices,
- Command Queue and Event Queue supported
- LPAE only, S1 is supported and Tested, S2 not tested
- BE mode Translation not supported
- IRQ support (legacy, no MSI)
- Tested with DPDK and e1000


Eric Auger (5):
   hw/arm/smmu-common: smmu base class
   hw/arm/virt: Add 2.10 machine type
   hw/arm/virt: Add tlbi-on-map property to the smmuv3 node
   target/arm/kvm: Translate the MSI doorbell in 

Re: [Qemu-devel] [RFC v5 2/8] hw/arm/smmuv3: smmuv3 emulation model

2017-07-13 Thread Tomasz Nowicki

Hi Eric,

On 09.07.2017 22:51, Eric Auger wrote:

From: Prem Mallappa 

Introduces the SMMUv3 derived model. This is based on
System MMUv3 specification (v17).

Signed-off-by: Prem Mallappa 
Signed-off-by: Eric Auger 

---
v4 -> v5:
- change smmuv3_translate proto (IOMMUAccessFlags flag)
- has_stagex replaced by is_ste_stagex
- smmu_cfg_populate removed
- added smmuv3_decode_config and reworked error management
- remwork the naming of IOMMU mrs
- fix SMMU_CMDQ_CONS offset



[...]


+
+static void smmu_update_qreg(SMMUV3State *s, SMMUQueue *q, hwaddr reg,
+ uint32_t off, uint64_t val, unsigned size)
+{
+   if (size == 8 && off == 0) {
+smmu_write64_reg(s, reg, val);


Based on my observation we never get here.

If I read the code correctly, 
memory_region_dispatch_{write|read}()->memory_region_{write|read}_accessor() 
will cut all 8-bytes accesses into 4-bytes slices. However, this makes 
my SMMUv3 register handling happy:


 static const MemoryRegionOps smmu_mem_ops = {
 .read = smmu_read_mmio,
 .write = smmu_write_mmio,
 .endianness = DEVICE_LITTLE_ENDIAN,
 .valid = {
 .min_access_size = 4,
 .max_access_size = 8,
 },
+.impl = {
+.min_access_size = 4,
+   .max_access_size = 8,
+},
 };

Thanks,
Tomasz



Re: [Qemu-devel] [RFC v5 2/8] hw/arm/smmuv3: smmuv3 emulation model

2017-07-13 Thread Tomasz Nowicki

Hi Eric,

On 09.07.2017 22:51, Eric Auger wrote:

From: Prem Mallappa 

Introduces the SMMUv3 derived model. This is based on
System MMUv3 specification (v17).

Signed-off-by: Prem Mallappa 
Signed-off-by: Eric Auger 

---
v4 -> v5:
- change smmuv3_translate proto (IOMMUAccessFlags flag)
- has_stagex replaced by is_ste_stagex
- smmu_cfg_populate removed
- added smmuv3_decode_config and reworked error management
- remwork the naming of IOMMU mrs
- fix SMMU_CMDQ_CONS offset



[...]


+
+/*
+ *  Register Access Primitives
+ */
+
+static inline void smmu_write64_reg(SMMUV3State *s, uint32_t addr, uint64_t 
val)
+{
+addr >>= 2;
+s->regs[addr] = val & 0xULL;
+s->regs[addr + 1] = val & ~0xULL;
+}
+
+static inline void smmu_write_reg(SMMUV3State *s, uint32_t addr, uint64_t val)
+{
+s->regs[addr >> 2] = val;
+}
+
+static inline uint32_t smmu_read_reg(SMMUV3State *s, uint32_t addr)
+{
+return s->regs[addr >> 2];
+}
+
+static inline uint64_t smmu_read64_reg(SMMUV3State *s, uint32_t addr)
+{
+addr >>= 2;
+return s->regs[addr] | (s->regs[addr + 1] << 32);


To be consistent with smmu_write64_reg() we should not shift here second 
half of register, instead simply:


return s->regs[addr] | s->regs[addr + 1];

Thanks,
Tomasz