Re: [virtio-dev] [RFC] virtio-iommu version 0.6
On 22/03/18 09:44, Tian, Kevin wrote: >> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com] >> Sent: Wednesday, March 21, 2018 9:14 PM >> >> Hi Kevin, >> >> Thanks for the comments >> >> On 19/03/18 10:03, Tian, Kevin wrote: >>> BYPASS feature bit is not covered in "2.3.1/2.3.2/2.3.3"". Is it >>> intended? >> >> In my opinion BYPASS is a bit different from the other features: while the >> others are needed for correctness, this one is optional and even if the >> guest supports BYPASS, it should be allowed not to accept it. For security >> reasons it may not want to let endpoints access the whole guest-physical >> address space. > > ok, possibly because I'm not familiar with virtio spec convention. > My original feeling is that each feature bit will have a behavior > description regarding to whether device reports and whether > driver accepts. If no need to cover optional feature, then it's fine > to me. :-) I think the virtio spec allows the reader to choose how to implement any behavior that isn't explicitly described. Whenever a SHOULD or MUST isn't stated in a requirement section, the spec implies "feature X MAY be enabled if offered". Looking at the virtio-net specification, optional features such as VIRTIO_NET_F_VLAN aren't mentioned in driver requirements. On the other hand, features that the device needs in order to function properly are explicitly stated, for example VIRTIO_NET_F_MAC. >> [...] >>> Then comes a question for VIRTIO_IOMMU_RESV_MEM_T_MSI. >>> I know there were quite some discussion around this flag before, >>> but my mental picture now still is a bit difficult to understand its >>> usage based on examples in implementation notes: >>> >>> - for x86, you describe it as indicating MSI bypass for >>> oxfeex. However guest doesn't need to know this fact. Only >>> requirement is to treat it as reserved range (as on bare metal) >>> then T_RESERVED is sufficient for this purpose> >>> - for ARM, either let guest or let host to choose a virtual >>> address for mapping to MSI doorbell. the former doesn't require >>> a reserved range. for the latter also T_RESERVED is enough as >>> the example in hardware device assignment section. >> >> It might be nicer to have the host decide it, but when the physical IOMMU >> is ARM SMMU, nesting translation complicates things, because the guest >> *has* to create a mapping: > > confirm one thing first. v0.6 doesn't support binding to guest IOVA > page table yet. So based on current map/unmap interface, there is > no such complicity right? stage-1 is just a shadowed translation (IOVA > ->HPA) to guest side (IOVA->GPA) with nesting disabled. In that case > the default behavior is host-reserved style. Yes, with v0.6 the host chooses whether to translate or bypass the MSIs. > Then comes nested scenario: > >> >> * The guest is in charge of stage-1. It creates IOVA->GPA mapping for the >> MSI doorbell. The GPA is mapped to the physical MSI doorbell at >> stage-2 by the host. > > Is it a must that above GPA is mapped to physical MSI doorbell? > > Ideally: > > 1) Host reserves some IOVA range for mapping MSI doorbell > 2) the range is reported to user space > 3) Qemu reported the range as reserved on endpoints, marked > as T_IDENTITY (a new type to be introduced), meaning guest > needs setup identity mapping in stage-1 I think it works, and is saner than my idea. I had one concern (thought I had more but can't find any in my notes; they might resurface when prototyping): the MSI doorbell has to be mapped with device attributes (MMIO) to ensure proper interrupt delivery. I think it's fine because the host maps the doorbell with the right attributes at stage-2, which will take precedence over what the guest uses for stage-1 according to the Arm architecture. So I don't think T_IDENTITY requires arch-specific attributes for now but there should definitely be space for extension. I also noticed that Intel and AMD drivers in Linux don't add any special attribute or protection to identity mappings (RESV_DIRECT in Linux), apart from READ/WRITE. For reference the previous discussion about T_IDENTITY is here: https://www.spinics.net/lists/kvm/msg155240.html One problem with my idea was that if the device is behind multiple IRQ chips, the host cannot know which IRQ chip the IOVA corresponds to, when reading the MSI-X tables. Then again I don't know why multiple IRQ chips per device would be desirable, but it's allowed. > 4) Then device should be able to ring physical MSI doorbell > 5) I'm not sure whether guest still needs to allocate its own > IOVA and mapping to vGIC doorbell in such case... I don't think it matters, because the host MSI infrastructure is decoupled from the guest's. The host programs the physical MSI-X tables and only needs the guest to create a stage-1 mappings for those, which is done with T_IDENTITY. The guest doesn't know what the T_IDENTITY mapping is for. Then QEMU can choose whether to map or bypass
RE: [virtio-dev] [RFC] virtio-iommu version 0.6
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com] > Sent: Wednesday, March 21, 2018 9:14 PM > > Hi Kevin, > > Thanks for the comments > > On 19/03/18 10:03, Tian, Kevin wrote: > > BYPASS feature bit is not covered in "2.3.1/2.3.2/2.3.3"". Is it > > intended? > > In my opinion BYPASS is a bit different from the other features: while the > others are needed for correctness, this one is optional and even if the > guest supports BYPASS, it should be allowed not to accept it. For security > reasons it may not want to let endpoints access the whole guest-physical > address space. ok, possibly because I'm not familiar with virtio spec convention. My original feeling is that each feature bit will have a behavior description regarding to whether device reports and whether driver accepts. If no need to cover optional feature, then it's fine to me. :-) > > > --2.6.2 Device Requirements: Device operations-- > > > > "If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered, > > the device MUST truncate the range described by virt_start > > and virt_end in requests to ft in the range described by > > input_range." > > > > "If the VIRTIO_IOMMU_F_DOMAIN_BITS is offered, the device > > MUST ignore bits above domain_bits in field domain of requests." > > > > shouldn't device return error upon above situation instead > > of continuing operation with modified value? > > The intent was to allow device and driver to pass metadata in the top bits > of pointers and domain IDs, perhaps for a future extension or for > debugging. I'm not convinced it's useful anymore (and do wonder what my > initial reasoning was...) Such extension would be determined by a new > feature bit and the device would know that the fields contain additional > info, if the driver accepted that feature. > > > --2.6.4 DETACH request-- > > > > " Detach an endpoint from its domain. When this request > > completes, the endpoint cannot access any mapping from that > > domain anymore " > > > > Does it make sense to mention BYPASS situation upon this request? > > Sure, I'll add something to clarify that situation > > > --2.6.8.2 Property RESV_MEM -- > > > > "VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0) Accesses to > > virtual addresses in this region are not translated by the device. > > They may either be aborted by the device (or the underlying > > IOMMU), bypass it, or never even reach it" > > > > in 3.3 hardware device assignment you described an example > > that a reserved range is stolen by host to map the MSI > > doorbell... such map behavior is not described above. > > Right, we can say that accesses to the region may be used for host > translation > > > Then comes a question for VIRTIO_IOMMU_RESV_MEM_T_MSI. > > I know there were quite some discussion around this flag before, > > but my mental picture now still is a bit difficult to understand its > > usage based on examples in implementation notes: > > > > - for x86, you describe it as indicating MSI bypass for > > oxfeex. However guest doesn't need to know this fact. Only > > requirement is to treat it as reserved range (as on bare metal) > > then T_RESERVED is sufficient for this purpose> > > - for ARM, either let guest or let host to choose a virtual > > address for mapping to MSI doorbell. the former doesn't require > > a reserved range. for the latter also T_RESERVED is enough as > > the example in hardware device assignment section. > > It might be nicer to have the host decide it, but when the physical IOMMU > is ARM SMMU, nesting translation complicates things, because the guest > *has* to create a mapping: confirm one thing first. v0.6 doesn't support binding to guest IOVA page table yet. So based on current map/unmap interface, there is no such complicity right? stage-1 is just a shadowed translation (IOVA ->HPA) to guest side (IOVA->GPA) with nesting disabled. In that case the default behavior is host-reserved style. Then comes nested scenario: > > * The guest is in charge of stage-1. It creates IOVA->GPA mapping for the > MSI doorbell. The GPA is mapped to the physical MSI doorbell at > stage-2 by the host. Is it a must that above GPA is mapped to physical MSI doorbell? Ideally: 1) Host reserves some IOVA range for mapping MSI doorbell 2) the range is reported to user space 3) Qemu reported the range as reserved on endpoints, marked as T_IDENTITY (a new type to be introduced), meaning guest needs setup identity mapping in stage-1 4) Then device should be able to ring physical MSI doorbell 5) I'm not sure whether guest still needs to allocate its own IOVA and mapping to vGIC doorbell in such case... > > * when it writes that IOVA in the virtual MSI-X tables, the host updates > the physical MSI-X tables using an ioctl (it can't map the physical > tables into the guest, because MSI-X vector data also contain physical > device info that shouldn't be known by the guest.) > > So we need support for software-mapped MSIs in the guest anyway. In >
Re: [virtio-dev] [RFC] virtio-iommu version 0.6
Hi Kevin, Thanks for the comments On 19/03/18 10:03, Tian, Kevin wrote: > BYPASS feature bit is not covered in "2.3.1/2.3.2/2.3.3"". Is it > intended? In my opinion BYPASS is a bit different from the other features: while the others are needed for correctness, this one is optional and even if the guest supports BYPASS, it should be allowed not to accept it. For security reasons it may not want to let endpoints access the whole guest-physical address space. > --2.6.2 Device Requirements: Device operations-- > > "If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered, > the device MUST truncate the range described by virt_start > and virt_end in requests to ft in the range described by > input_range." > > "If the VIRTIO_IOMMU_F_DOMAIN_BITS is offered, the device > MUST ignore bits above domain_bits in field domain of requests." > > shouldn't device return error upon above situation instead > of continuing operation with modified value? The intent was to allow device and driver to pass metadata in the top bits of pointers and domain IDs, perhaps for a future extension or for debugging. I'm not convinced it's useful anymore (and do wonder what my initial reasoning was...) Such extension would be determined by a new feature bit and the device would know that the fields contain additional info, if the driver accepted that feature. > --2.6.4 DETACH request-- > > " Detach an endpoint from its domain. When this request > completes, the endpoint cannot access any mapping from that > domain anymore " > > Does it make sense to mention BYPASS situation upon this request? Sure, I'll add something to clarify that situation > --2.6.8.2 Property RESV_MEM -- > > "VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0) Accesses to > virtual addresses in this region are not translated by the device. > They may either be aborted by the device (or the underlying > IOMMU), bypass it, or never even reach it" > > in 3.3 hardware device assignment you described an example > that a reserved range is stolen by host to map the MSI > doorbell... such map behavior is not described above. Right, we can say that accesses to the region may be used for host translation > Then comes a question for VIRTIO_IOMMU_RESV_MEM_T_MSI. > I know there were quite some discussion around this flag before, > but my mental picture now still is a bit difficult to understand its > usage based on examples in implementation notes: > > - for x86, you describe it as indicating MSI bypass for > oxfeex. However guest doesn't need to know this fact. Only > requirement is to treat it as reserved range (as on bare metal) > then T_RESERVED is sufficient for this purpose> > - for ARM, either let guest or let host to choose a virtual > address for mapping to MSI doorbell. the former doesn't require > a reserved range. for the latter also T_RESERVED is enough as > the example in hardware device assignment section. It might be nicer to have the host decide it, but when the physical IOMMU is ARM SMMU, nesting translation complicates things, because the guest *has* to create a mapping: * The guest is in charge of stage-1. It creates IOVA->GPA mapping for the MSI doorbell. The GPA is mapped to the physical MSI doorbell at stage-2 by the host. * when it writes that IOVA in the virtual MSI-X tables, the host updates the physical MSI-X tables using an ioctl (it can't map the physical tables into the guest, because MSI-X vector data also contain physical device info that shouldn't be known by the guest.) So we need support for software-mapped MSIs in the guest anyway. In Linux the virtio-iommu driver can use existing plumbing for that, no change needed in the guest. > Then what's the real point of differentiating MSI bypass from > normal reserved range? Is there other example where guest > may use such info to do special thing? The presence of an MSI bypass regions allows the driver and the DMA layer (in Linux iommu_dma_map_msi_msg) to know whether it needs to create a mapping or if it's bypassed. When writing the entry into the MSI-X table, if the address in the MSI vector already corresponds to an MSI bypass region, then there is no need to create a mapping. Making the bypass MSI region a normal RESV range hides that information. Another way of choosing would be with #ifdef CONFIG_ARM64, CONFIG_X86 etc, but I find it nasty, and I personally prefer using MSI bypass for ARM when possible. > -- 3.2 Message Signaled Interrupts -- > > " Section 3.2.2 describes the added complexity (from the host > point of view) of translating the IRQ chip doorbell " > > however later 3.2.2 only says one sentence about host > complexity - " However, this mode of operations may add > significant complexity in the host implementation". If you plan > to elaborate that part in the future, please add a note. :-) Oh right, I don't really want to elaborate, I'll try to reword 3.2. The kvmtool patches have support for both MSI bypass and translation (selectable
RE: [virtio-dev] [RFC] virtio-iommu version 0.6
> From: Jean-Philippe Brucker > Sent: Wednesday, February 7, 2018 2:11 AM > > Please find version 0.6 of the virtio-iommu specification at the following > locations. > > Document: http://jpbrucker.net/virtio-iommu/spec/virtio-iommu.pdf > http://jpbrucker.net/virtio-iommu/spec/virtio-iommu.html > > Sources: git://linux-arm.org/virtio-iommu.git viommu/v0.6 > > I didn't receive any comment for v0.5 [1], so this update is fairly light: > * Changed range definition in map and unmap requests > * Use virtio-iommu device ID > * Update IORT node ID and simplify layout > > The following document shows the difference between v0.5 and v0.6: > http://jpbrucker.net/virtio-iommu/spec/diffs/virtio-iommu-pdf-diff-v0.5- > v0.6.pdf > > Next version will hopefully add vSVA (PASID+PRI) and/or architectural > optimizations, but I can't provide a timeline yet. I'll probably send a > small draft first. > > I will send the Linux driver soon. In the meantime you can fetch it on my > development branches, based on next-20180206. > > git://linux-arm.org/linux-jpb.git virtio-iommu/devel > git://linux-arm.org/kvmtool-jpb.git virtio-iommu/devel > > Any comments welcome! > > Thanks, > Jean > > [1] https://www.spinics.net/lists/kvm/msg157402.html some comments here: -- BYPASS feature bit is not covered in "2.3.1/2.3.2/2.3.3"". Is it intended? --2.6.2 Device Requirements: Device operations-- "If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered, the device MUST truncate the range described by virt_start and virt_end in requests to ft in the range described by input_range." "If the VIRTIO_IOMMU_F_DOMAIN_BITS is offered, the device MUST ignore bits above domain_bits in field domain of requests." shouldn't device return error upon above situation instead of continuing operation with modified value? --2.6.4 DETACH request-- " Detach an endpoint from its domain. When this request completes, the endpoint cannot access any mapping from that domain anymore " Does it make sense to mention BYPASS situation upon this request? --2.6.8.2 Property RESV_MEM -- "VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0) Accesses to virtual addresses in this region are not translated by the device. They may either be aborted by the device (or the underlying IOMMU), bypass it, or never even reach it" in 3.3 hardware device assignment you described an example that a reserved range is stolen by host to map the MSI doorbell... such map behavior is not described above. Then comes a question for VIRTIO_IOMMU_RESV_MEM_T_MSI. I know there were quite some discussion around this flag before, but my mental picture now still is a bit difficult to understand its usage based on examples in implementation notes: - for x86, you describe it as indicating MSI bypass for oxfeex. However guest doesn't need to know this fact. Only requirement is to treat it as reserved range (as on bare metal) then T_RESERVED is sufficient for this purpose - for ARM, either let guest or let host to choose a virtual address for mapping to MSI doorbell. the former doesn't require a reserved range. for the latter also T_RESERVED is enough as the example in hardware device assignment section. Then what's the real point of differentiating MSI bypass from normal reserved range? Is there other example where guest may use such info to do special thing? -- 3.2 Message Signaled Interrupts -- " Section 3.2.2 describes the added complexity (from the host point of view) of translating the IRQ chip doorbell " however later 3.2.2 only says one sentence about host complexity - " However, this mode of operations may add significant complexity in the host implementation". If you plan to elaborate that part in the future, please add a note. :-) Thanks Kevin