Hello Julien, Volodymyr
On 8/27/25 01:28, Volodymyr Babchuk wrote:
Hi Milan,
Milan Djokic <milan_djo...@epam.com> writes:
Hello Julien,
On 8/13/25 14:11, Julien Grall wrote:
On 13/08/2025 11:04, Milan Djokic wrote:
Hello Julien,
Hi Milan,
We have prepared a design document and it will be part of the updated
patch series (added in docs/design). I'll also extend cover letter with
details on implementation structure to make review easier.
I would suggest to just iterate on the design document for now.
Following is the design document content which will be provided in
updated patch series:
Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
==========================================================
Author: Milan Djokic <milan_djo...@epam.com>
Date: 2025-08-07
Status: Draft
Introduction
------------
The SMMUv3 supports two stages of translation. Each stage of translation
can be independently enabled. An incoming address is logically
translated from VA to IPA in stage 1, then the IPA is input to stage 2
which translates the IPA to the output PA. Stage 1 translation support
is required to provide isolation between different devices within the OS.
Xen already supports Stage 2 translation but there is no support for
Stage 1 translation. This design proposal outlines the introduction of
Stage-1 SMMUv3 support in Xen for ARM guests.
Motivation
----------
ARM systems utilizing SMMUv3 require Stage-1 address translation to
ensure correct and secure DMA behavior inside guests.
Can you clarify what you mean by "correct"? DMA would still work
without
stage-1.
Correct in terms of working with guest managed I/O space. I'll
rephrase this statement, it seems ambiguous.
This feature enables:
- Stage-1 translation in guest domain
- Safe device passthrough under secure memory translation
Design Overview
---------------
These changes provide emulated SMMUv3 support:
- SMMUv3 Stage-1 Translation: stage-1 and nested translation support in
SMMUv3 driver
- vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling
So what are you planning to expose to a guest? Is it one vIOMMU per
pIOMMU? Or a single one?
Single vIOMMU model is used in this design.
Have you considered the pros/cons for both?
- Register/Command Emulation: SMMUv3 register emulation and command
queue handling
That's a point for consideration.
single vIOMMU prevails in terms of less complex implementation and a
simple guest iommmu model - single vIOMMU node, one interrupt path,
event queue, single set of trap handlers for emulation, etc.
Cons for a single vIOMMU model could be less accurate hw
representation and a potential bottleneck with one emulated queue and
interrupt path.
On the other hand, vIOMMU per pIOMMU provides more accurate hw
modeling and offers better scalability in case of many IOMMUs in the
system, but this comes with more complex emulation logic and device
tree, also handling multiple vIOMMUs on guest side.
IMO, single vIOMMU model seems like a better option mostly because
it's less complex, easier to maintain and debug. Of course, this
decision can and should be discussed.
Well, I am not sure that this is possible, because of StreamID
allocation. The biggest offender is of course PCI, as each Root PCI
bridge will require own SMMU instance with own StreamID space. But even
without PCI you'll need some mechanism to map vStremID to
<pSMMU, pStreamID>, because there will be overlaps in SID space.
Actually, PCI/vPCI with vSMMU is its own can of worms...
For each pSMMU, we have a single command queue that will receive command
from all the guests. How do you plan to prevent a guest hogging the
command queue?
In addition to that, AFAIU, the size of the virtual command queue is
fixed by the guest rather than Xen. If a guest is filling up the queue
with commands before notifying Xen, how do you plan to ensure we don't
spend too much time in Xen (which is not preemptible)?
We'll have to do a detailed analysis on these scenarios, they are not
covered by the design (as well as some others which is clear after
your comments). I'll come back with an updated design.
I think that can be handled akin to hypercall continuation, which is
used in similar places, like P2M code
[...]
I have updated vIOMMU design document with additional security topics
covered and performance impact results. Also added some additional
explanations for vIOMMU components following your comments.
Updated document content:
===============================================
Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
===============================================
:Author: Milan Djokic <milan_djo...@epam.com>
:Date: 2025-08-07
:Status: Draft
Introduction
========
The SMMUv3 supports two stages of translation. Each stage of translation
can be
independently enabled. An incoming address is logically translated from
VA to
IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to
the output PA. Stage 1 translation support is required to provide
isolation between different
devices within OS. XEN already supports Stage 2 translation but there is no
support for Stage 1 translation.
This design proposal outlines the introduction of Stage-1 SMMUv3 support
in Xen for ARM guests.
Motivation
==========
ARM systems utilizing SMMUv3 require stage-1 address translation to
ensure secure DMA and guest managed I/O memory mappings.
This feature enables:
- Stage-1 translation in guest domain
- Safe device passthrough under secure memory translation
Design Overview
===============
These changes provide emulated SMMUv3 support:
- **SMMUv3 Stage-1 Translation**: stage-1 and nested translation support
in SMMUv3 driver.
- **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1
handling.
- **Register/Command Emulation**: SMMUv3 register emulation and command
queue handling.
- **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes to
device trees for dom0 and dom0less scenarios.
- **Runtime Configuration**: Introduces a `viommu` boot parameter for
dynamic enablement.
vIOMMU is exposed to guest as a single device with predefined
capabilities and commands supported. Single vIOMMU model abstracts the
details of an actual IOMMU hardware, simplifying usage from the guest
point of view. Guest OS handles only a single IOMMU, even if multiple
IOMMU units are available on the host system.
Security Considerations
=======================
**viommu security benefits:**
- Stage-1 translation ensures guest devices cannot perform unauthorized DMA.
- Emulated IOMMU removes guest dependency on IOMMU hardware while
maintaining domains isolation.
1. Observation:
---------------
Support for Stage-1 translation in SMMUv3 introduces new data structures
(`s1_cfg` alongside `s2_cfg`) and logic to write both Stage-1 and
Stage-2 entries in the Stream Table Entry (STE), including an `abort`
field to handle partial configuration states.
**Risk:**
Without proper handling, a partially applied Stage-1 configuration might
leave guest DMA mappings in an inconsistent state, potentially enabling
unauthorized access or causing cross-domain interference.
**Mitigation:** *(Handled by design)*
This feature introduces logic that writes both `s1_cfg` and `s2_cfg` to
STE and manages the `abort` field-only considering Stage-1 configuration
if fully attached. This ensures incomplete or invalid guest
configurations are safely ignored by the hypervisor.
2. Observation:
---------------
Guests can now invalidate Stage-1 caches; invalidation needs forwarding
to SMMUv3 hardware to maintain coherence.
**Risk:**
Failing to propagate cache invalidation could allow stale mappings,
enabling access to old mappings and possibly data leakage or misrouting.
**Mitigation:** *(Handled by design)*
This feature ensures that guest-initiated invalidations are correctly
forwarded to the hardware, preserving IOMMU coherency.
3. Observation:
---------------
This design introduces substantial new functionality, including the
`vIOMMU` framework, virtual SMMUv3 devices (`vsmmuv3`), command queues,
event queues, domain management, and Device Tree modifications (e.g.,
`iommus` nodes and `libxl` integration).
**Risk:**
Large feature expansions increase the attack surface—potential for race
conditions, unchecked command inputs, or Device Tree-based
misconfigurations.
**Mitigation:**
- Sanity checks and error-handling improvements have been introduced in
this feature.
- Further audits have to be performed for this feature and its
dependencies in this area. Currently, feature is marked as *Tech
Preview* and is self-contained, reducing the risk to unrelated components.
4. Observation:
---------------
The code includes transformations to handle nested translation versus
standard modes and uses guest-configured command queues (e.g.,
`CMD_CFGI_STE`) and event notifications.
**Risk:**
Malicious or malformed queue commands from guests could bypass
validation, manipulate SMMUv3 state, or cause Dom0 instability.
**Mitigation:** *(Handled by design)*
Built-in validation of command queue entries and sanitization mechanisms
ensure only permitted configurations are applied. This is supported via
additions in `vsmmuv3` and `cmdqueue` handling code.
5. Observation:
---------------
Device Tree modifications enable device assignment and
configuration—guest DT fragments (e.g., `iommus`) are added via `libxl`.
**Risk:**
Erroneous or malicious Device Tree injection could result in device
misbinding or guest access to unauthorized hardware.
**Mitigation:**
- `libxl` perform checks of guest configuration and parse only
predefined dt fragments and nodes, reducing risc.
- The system integrator must ensure correct resource mapping in the
guest Device Tree (DT) fragments.
6. Observation:
---------------
Introducing optional per-guest enabled features (`viommu` argument in xl
guest config) means some guests may opt-out.
**Risk:**
Differences between guests with and without `viommu` may cause
unexpected behavior or privilege drift.
**Mitigation:**
Verify that downgrade paths are safe and well-isolated; ensure missing
support doesn't cause security issues. Additional audits on emulation
paths and domains interference need to be performed in a multi-guest
environment.
7. Observation:
---------------
Guests have the ability to issue Stage-1 IOMMU commands like cache
invalidation, stream table entries configuration, etc. An adversarial
guest may issue a high volume of commands in rapid succession.
**Risk**
Excessive commands requests can cause high hypervisor CPU consumption
and disrupt scheduling, leading to degraded system responsiveness and
potential denial-of-service scenarios.
**Mitigation**
- Xen credit scheduler limits guest vCPU execution time, securing basic
guest rate-limiting.
- Batch multiple commands of same type to reduce overhead on the virtual
SMMUv3 hardware emulation.
- Implement vIOMMU commands execution restart and continuation support
8. Observation:
---------------
Some guest commands issued towards vIOMMU are propagated to pIOMMU
command queue (e.g. TLB invalidate). For each pIOMMU, only one command
queue is
available for all domains.
**Risk**
Excessive commands requests from abusive guest can cause flooding of
physical IOMMU command queue, leading to degraded pIOMMU responsivness
on commands issued from other guests.
**Mitigation**
- Xen credit scheduler limits guest vCPU execution time, securing basic
guest rate-limiting.
- Batch commands which should be propagated towards pIOMMU cmd queue and
enable support for batch execution pause/continuation
- If possible, implement domain penalization by adding a per-domain cost
counter for vIOMMU/pIOMMU usage.
9. Observation:
---------------
vIOMMU feature includes event queue used for forwarding IOMMU events to
guest (e.g. translation faults, invalid stream IDs, permission errors).
A malicious guest can misconfigure its SMMU state or intentionally
trigger faults with high frequency.
**Risk**
Occurance of IOMMU events with high frequency can cause Xen to flood the
event queue and disrupt scheduling with high hypervisor CPU load for
events handling.
**Mitigation**
- Implement fail-safe state by disabling events forwarding when faults
are occured with high frequency and not processed by guest.
- Batch multiple events of same type to reduce overhead on the virtual
SMMUv3 hardware emulation.
- Consider disabling event queue for untrusted guests
Performance Impact
==================
With iommu stage-1 and nested translation inclusion, performance
overhead is introduced comparing to existing, stage-2 only usage in Xen.
Once mappings are established, translations should not introduce
significant overhead.
Emulated paths may introduce moderate overhead, primarily affecting
device initialization and event handling.
Performance impact highly depends on target CPU capabilities. Testing is
performed on cortex-a53 based platform.
Performance is mostly impacted by emulated vIOMMU operations, results
shown in the following table.
+-------------------------------+---------------------------------+
| vIOMMU Operation | Execution time in guest |
+===============================+=================================+
| Reg read | median: 30μs, worst-case: 250μs |
+-------------------------------+---------------------------------+
| Reg write | median: 35μs, worst-case: 280μs |
+-------------------------------+---------------------------------+
| Invalidate TLB | median: 90μs, worst-case: 1ms+ |
+-------------------------------+---------------------------------+
| Invalidate STE | median: 450μs worst_case: 7ms+ |
+-------------------------------+---------------------------------+
With vIOMMU exposed to guest, guest OS has to initialize IOMMU device
and configure stage-1 mappings for devices attached to it.
Following table shows initialization stages which impact stage-1 enabled
guest boot time and compares it with stage-1 disabled guest.
"NOTE: Device probe execution time varies significantly depending on
device complexity. virtio-gpu was selected as a test case due to its
extensive use of dynamic DMA allocations and IOMMU mappings, making it a
suitable candidate for benchmarking stage-1 vIOMMU behavior."
+---------------------+-----------------------+------------------------+
| Stage | Stage-1 Enabled Guest | Stage-1 Disabled Guest |
+=====================+=======================+========================+
| IOMMU Init | ~25ms | / |
+---------------------+-----------------------+------------------------+
| Dev Attach / Mapping| ~220ms | ~200ms |
+---------------------+-----------------------+------------------------+
For devices configured with dynamic DMA mappings, DMA allocate/map/unmap
operations performance is also impacted on stage-1 enabled guests.
Dynamic DMA mapping operation issues emulated IOMMU functions like mmio
write/read and TLB invalidations.
As a reference, following table shows performance results for runtime
dma operations for virtio-gpu device.
+---------------+-------------------------+----------------------------+
| DMA Op | Stage-1 Enabled Guest | Stage-1 Disabled Guest |
+===============+=========================+============================+
| dma_alloc | median: 27μs, worst: 7ms| median: 2.5μs, worst: 360μs|
+---------------+-------------------------+----------------------------+
| dma_free | median: 1ms, worst: 14ms| median: 2.2μs, worst: 85μs |
+---------------+-------------------------+----------------------------+
| dma_map | median: 25μs, worst: 7ms| median: 1.5μs, worst: 336μs|
+---------------+-------------------------+----------------------------+
| dma_unmap | median: 1ms, worst: 13ms| median: 1.3μs, worst: 65μs |
+---------------+-------------------------+----------------------------+
Testing
============
- QEMU-based ARM system tests for Stage-1 translation and nested
virtualization.
- Actual hardware validation on platforms such as Renesas to ensure
compatibility with real SMMUv3 implementations.
- Unit/Functional tests validating correct translations (not implemented).
Migration and Compatibility
===========================
This optional feature defaults to disabled (`viommu=""`) for backward
compatibility.
References
==========
- Original feature implemented by Rahul Singh:
https://patchwork.kernel.org/project/xen-devel/cover/cover.1669888522.git.rahul.si...@arm.com/
- SMMUv3 architecture documentation
- Existing vIOMMU code patterns