Re: [RFC] iommu/vt-d: Use the SIRTP when enabling remapping
On 2014-09-09 23:11, Nathan Zimmer wrote: The previous change (iommu/vt-d: Don't store SIRTP request) to this area caused a crash in our simulator. In particular is seems that by the time a UART interrupt is sent through the system, we don't see interrupt remapping to be enabled. So the interrupt does not get translated to a logical interrupt and crashes. OR'ing the SIRTP request to make sure it is seen but hopefully not sticky. This seems like a clean fix, at least on our simulator; if you don't agree, our simulator guy will take a closer look at our iommu model. Check the VT-d spec (6.7, Set Interrupt Remapping Table Pointer Operation): Software must always follow the interrupt-remapping table pointer set operation with a global invalidate of the IEC to ensure hardware references the new structures *before* enabling interrupt remapping. There is also (command register description): If multiple control fields in this register need to be modified, software must serialize the modifications through multiple writes to this register. This, in combination with the command registers description of bits 24 and 25 strongly suggests that your model is broken. Found testing on our simulator, not real hardware. Funnily, the original issue was found by the QEMU model of VT-d that used to take the last cited sentence very strictly (I asked to remove it due to the preexisting Linux versions). Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v6 15/20] vfio/platform: support for maskable and automasked interrupts
On Tue, Sep 02, 2014 at 06:06:17PM +0200, Antonios Motakis wrote: On Sun, Jun 8, 2014 at 12:17 PM, Christoffer Dall christoffer.d...@linaro.org wrote: On Thu, Jun 05, 2014 at 07:03:23PM +0200, Antonios Motakis wrote: Adds support to mask interrupts, and also for automasked interrupts. Level sensitive interrupts are exposed as automasked interrupts and are masked and disabled automatically when they fire. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform_irq.c | 112 -- drivers/vfio/platform/vfio_platform_private.h | 2 + 2 files changed, 109 insertions(+), 5 deletions(-) diff --git a/drivers/vfio/platform/vfio_platform_irq.c b/drivers/vfio/platform/vfio_platform_irq.c index d79f5af..10dfbf0 100644 --- a/drivers/vfio/platform/vfio_platform_irq.c +++ b/drivers/vfio/platform/vfio_platform_irq.c @@ -51,9 +51,17 @@ int vfio_platform_irq_init(struct vfio_platform_device *vdev) if (hwirq 0) goto err; - vdev-irq[i].flags = VFIO_IRQ_INFO_EVENTFD; + spin_lock_init(vdev-irq[i].lock); + + vdev-irq[i].flags = VFIO_IRQ_INFO_EVENTFD + | VFIO_IRQ_INFO_MASKABLE; + + if (irq_get_trigger_type(hwirq) IRQ_TYPE_LEVEL_MASK) + vdev-irq[i].flags |= VFIO_IRQ_INFO_AUTOMASKED; This seems to rely on the fact that you had actually loaded a driver for your device to set the right type. Is this assumption always correct? It seems to me that this configuration bit should now be up to your user space drive who is the best candidate to know details about your device at this point? Hm, I see this type being set usually either in a device tree source, or in the support code for a specific platform. Are there any situations where this is actually set by the driver? If I understand right this is not the case for the PL330, but if it is possible that it is the case for another device then I need to rethink this. Though as far as I understand this should not be the case. Wow, this has been incredibly long time since I looked at this code, so not sure if I remember my original reasoning anymore, however, while device properties are set in the DT, they would only be available to this code if you actually loaded a device driver for that device, right? I'm just not sure that assumption always holds for devices used by VFIO, but I'm really not sure anymore. Maybe I'm rambling. + vdev-irq[i].count = 1; vdev-irq[i].hwirq = hwirq; + vdev-irq[i].masked = false; } vdev-num_irqs = cnt; @@ -77,11 +85,27 @@ void vfio_platform_irq_cleanup(struct vfio_platform_device *vdev) static irqreturn_t vfio_irq_handler(int irq, void *dev_id) { - struct eventfd_ctx *trigger = dev_id; + struct vfio_platform_irq *irq_ctx = dev_id; + unsigned long flags; + int ret = IRQ_NONE; + + spin_lock_irqsave(irq_ctx-lock, flags); + + if (!irq_ctx-masked) { + ret = IRQ_HANDLED; + + if (irq_ctx-flags VFIO_IRQ_INFO_AUTOMASKED) { + disable_irq_nosync(irq_ctx-hwirq); + irq_ctx-masked = true; + } + } - eventfd_signal(trigger, 1); + spin_unlock_irqrestore(irq_ctx-lock, flags); - return IRQ_HANDLED; + if (ret == IRQ_HANDLED) + eventfd_signal(irq_ctx-trigger, 1); + + return ret; } static int vfio_set_trigger(struct vfio_platform_device *vdev, @@ -162,6 +186,82 @@ static int vfio_platform_set_irq_trigger(struct vfio_platform_device *vdev, return -EFAULT; } +static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev, + unsigned index, unsigned start, + unsigned count, uint32_t flags, void *data) +{ + uint8_t arr; arr? arr for array! As in, the VFIO API allows an array of IRQs. However for platform devices we don't use this, each IRQ is exposed independently as an array of 1 IRQ. but it's not an array. If it contains IRQs, call it irqs. Unless this is referring specifically to a field *named* array, I don't remember the API at current, but reading the code along it didn't make sense to me to have a uint8_t called arr, and code should make as much sense independenly as possible. This reminds me of people writing code like: String str; yuck. + + if (start != 0 || count != 1) + return -EINVAL; + + switch (flags VFIO_IRQ_SET_DATA_TYPE_MASK) { + case VFIO_IRQ_SET_DATA_BOOL: + if (copy_from_user(arr, data, sizeof(uint8_t))) + return -EFAULT; + + if (arr != 0x1) + return
Re: [RFC PATCH v2 3/7] iommu: add new iommu_ops callback for adding an OF device
Hello, On 2014-09-02 19:56, Will Deacon wrote: This patch adds a new function to the iommu_ops structure to allow an OF device to be added to a specific IOMMU instance using the recently merged generic devicetree binding for IOMMUs. The callback (of_xlate) takes a struct device representing the master and an of_phandle_args representing the IOMMU and the correspondong IDs for the new master. Signed-off-by: Will Deacon will.dea...@arm.com --- include/linux/iommu.h | 5 + 1 file changed, 5 insertions(+) diff --git a/include/linux/iommu.h b/include/linux/iommu.h index fdddb14cd8f5..3e766b85daa3 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -21,6 +21,7 @@ #include linux/errno.h #include linux/err.h +#include linux/of.h #include linux/types.h #include trace/events/iommu.h @@ -136,6 +137,10 @@ struct iommu_ops { /* Get the numer of window per domain */ u32 (*domain_get_windows)(struct iommu_domain *domain); +#ifdef CONFIG_OF_IOMMU + int (*of_xlate)(struct device *dev, struct of_phandle_args *args); If I understand correctly how it is designed to work, then it should be: struct iommu_data *(*of_xlate)(struct device *dev, struct of_phandle_args *args); +#endif + unsigned long pgsize_bitmap; }; Best regards -- Marek Szyprowski, PhD Samsung RD Institute Poland ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v2 3/7] iommu: add new iommu_ops callback for adding an OF device
On Wed, Sep 10, 2014 at 12:16:06PM +0100, Marek Szyprowski wrote: On 2014-09-02 19:56, Will Deacon wrote: This patch adds a new function to the iommu_ops structure to allow an OF device to be added to a specific IOMMU instance using the recently merged generic devicetree binding for IOMMUs. The callback (of_xlate) takes a struct device representing the master and an of_phandle_args representing the IOMMU and the correspondong IDs for the new master. Signed-off-by: Will Deacon will.dea...@arm.com --- include/linux/iommu.h | 5 + 1 file changed, 5 insertions(+) diff --git a/include/linux/iommu.h b/include/linux/iommu.h index fdddb14cd8f5..3e766b85daa3 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -21,6 +21,7 @@ #include linux/errno.h #include linux/err.h +#include linux/of.h #include linux/types.h #include trace/events/iommu.h @@ -136,6 +137,10 @@ struct iommu_ops { /* Get the numer of window per domain */ u32 (*domain_get_windows)(struct iommu_domain *domain); +#ifdef CONFIG_OF_IOMMU + int (*of_xlate)(struct device *dev, struct of_phandle_args *args); If I understand correctly how it is designed to work, then it should be: struct iommu_data *(*of_xlate)(struct device *dev, struct of_phandle_args *args); Yeah, that might be cleaner than the of_iommu_get_data call in of_dma_configure. I'm currently cooking a v3, so I'll include this change. Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v2 3/7] iommu: add new iommu_ops callback for adding an OF device
On Wed, Sep 10, 2014 at 12:22:13PM +0100, Will Deacon wrote: On Wed, Sep 10, 2014 at 12:16:06PM +0100, Marek Szyprowski wrote: On 2014-09-02 19:56, Will Deacon wrote: +#ifdef CONFIG_OF_IOMMU + int (*of_xlate)(struct device *dev, struct of_phandle_args *args); If I understand correctly how it is designed to work, then it should be: struct iommu_data *(*of_xlate)(struct device *dev, struct of_phandle_args *args); Yeah, that might be cleaner than the of_iommu_get_data call in of_dma_configure. I'm currently cooking a v3, so I'll include this change. Actually, I spoke too soon. If we make of_xlate return the iommu_data, then we have to continue getting the iommu_ops from the bus_type, otherwise we have no way to call the right of_xlate. So I'd rather leave of_dma_configure using of_iommu_get_data (I'm currently ripping out the bus code in there) so that we can embed the iommu_ops in iommu_data instead. Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [Xen-devel] [PATCH v1 04/21] x86/xen/MSI: Eliminate arch_msix_mask_irq() and arch_msi_mask_irq()
On 05/09/14 11:09, Yijing Wang wrote: Commit 0e4ccb150 added two __weak arch functions arch_msix_mask_irq() and arch_msi_mask_irq() to fix a bug found when running xen in x86. Introduced these two funcntions make MSI code complex. And mask/unmask is the irq actions related to interrupt controller, should not use weak arch functions to override mask/unmask interfaces. This patch reverted commit 0e4ccb150 and export struct irq_chip msi_chip, modify msi_chip-irq_mask/irq_unmask() in xen init functions to fix this bug for simplicity. Also this is preparation for using struct msi_chip instead of weak arch MSI functions in all platforms. Acked-by: David Vrabel david.vra...@citrix.com But I wonder if it would be better the Xen subsystem to provide its own struct irq_chip instead of adjusting the fields in the generic x86 one. David ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [Xen-devel] [PATCH v1 08/21] x86/xen/MSI: Use MSI chip framework to configure MSI/MSI-X irq
On 09/09/14 03:06, Yijing Wang wrote: On 2014/9/5 22:29, David Vrabel wrote: On 05/09/14 11:09, Yijing Wang wrote: Use MSI chip framework instead of arch MSI functions to configure MSI/MSI-X irq. So we can manage MSI/MSI-X irq in a unified framework. [...] --- a/arch/x86/pci/xen.c +++ b/arch/x86/pci/xen.c [...] @@ -418,9 +430,9 @@ int __init pci_xen_init(void) #endif #ifdef CONFIG_PCI_MSI - x86_msi.setup_msi_irqs = xen_setup_msi_irqs; - x86_msi.teardown_msi_irq = xen_teardown_msi_irq; - x86_msi.teardown_msi_irqs = xen_teardown_msi_irqs; + xen_msi_chip.setup_irqs = xen_setup_msi_irqs; + xen_msi_chip.teardown_irqs = xen_teardown_msi_irqs; + x86_msi_chip = xen_msi_chip; msi_chip.irq_mask = xen_nop_msi_mask; msi_chip.irq_unmask = xen_nop_msi_mask; Why have these not been changed to set the x86_msi_chip.mask/unmask fields instead? Hi David, x86_msi_chip here is struct msi_chip data type, used to configure MSI/MSI-X irq. msi_chip above is struct irq_chip data type, represent the MSI irq controller. They are not the same object. Their name easily confusing people. Ok, it all makes sense now. Acked-by: David Vrabel david.vra...@citrix.com David ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v2 4/7] iommu: provide helper function to configure an IOMMU for an of master
Hi Will, Thank you for the patch. On Tuesday 02 September 2014 18:56:24 Will Deacon wrote: The generic IOMMU device-tree bindings can be used to add arbitrary OF masters to an IOMMU with a compliant binding. This patch introduces of_iommu_configure, which does exactly that. Signed-off-by: Will Deacon will.dea...@arm.com --- drivers/iommu/Kconfig | 2 +- drivers/iommu/of_iommu.c| 52 ++ include/linux/dma-mapping.h | 7 ++ include/linux/of_iommu.h| 8 +++ 4 files changed, 68 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index dd5112265cc9..6d13f962f156 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -15,7 +15,7 @@ if IOMMU_SUPPORT config OF_IOMMU def_bool y - depends on OF + depends on OF IOMMU_API config FSL_PAMU bool Freescale IOMMU support diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c index f9209666157c..1188c929ffa7 100644 --- a/drivers/iommu/of_iommu.c +++ b/drivers/iommu/of_iommu.c @@ -18,6 +18,7 @@ */ #include linux/export.h +#include linux/iommu.h #include linux/limits.h #include linux/of.h #include linux/of_iommu.h @@ -90,6 +91,57 @@ int of_get_dma_window(struct device_node *dn, const char *prefix, int index, } EXPORT_SYMBOL_GPL(of_get_dma_window); +struct iommu_dma_mapping *of_iommu_configure(struct device *dev) +{ + struct of_phandle_args iommu_spec; + struct iommu_dma_mapping *mapping; + struct bus_type *bus = dev-bus; + const struct iommu_ops *ops = bus-iommu_ops; + struct iommu_data *iommu = NULL; + int idx = 0; + + if (!iommu_present(bus) || !ops-of_xlate) + return NULL; + + /* + * We don't currently walk up the tree looking for a parent IOMMU. + * See the `Notes:' section of + * Documentation/devicetree/bindings/iommu/iommu.txt + */ + while (!of_parse_phandle_with_args(dev-of_node, iommus, +#iommu-cells, idx, +iommu_spec)) { + struct device_node *np = iommu_spec.np; + struct iommu_data *data = of_iommu_get_data(np); + + if (!iommu) { + if (!ops-of_xlate(dev, iommu_spec)) + iommu = data; If I understand the code correctly, this will call of_xlate for the first IOMMU reference only and silently ignore all subsequent references if they point to the same IOMMU device but with different stream/requester IDs. Is that the desired behaviour ? + } else if (iommu != data) { + pr_warn(Rejecting device %s with multiple IOMMU instances\n, + dev_name(dev)); + iommu = NULL; + } + + of_node_put(np); + + if (!iommu) + break; + + idx++; + } + + if (!iommu) + return NULL; + + mapping = devm_kzalloc(dev, sizeof(*mapping), GFP_KERNEL); + if (!mapping) + return NULL; + + mapping-iommu = iommu; + return mapping; +} + void __init of_iommu_init(void) { struct device_node *np; diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 0f7f7b68b0db..16bf92f71c3e 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -62,6 +62,13 @@ struct dma_map_ops { int is_phys; }; +struct iommu_data; + +struct iommu_dma_mapping { + struct iommu_data *iommu; + struct list_head node; +}; + #define DMA_BIT_MASK(n) (((n) == 64) ? ~0ULL : ((1ULL(n))-1)) #define DMA_MASK_NONE0x0ULL diff --git a/include/linux/of_iommu.h b/include/linux/of_iommu.h index a39e2d78f735..f2d3b1e780d7 100644 --- a/include/linux/of_iommu.h +++ b/include/linux/of_iommu.h @@ -1,9 +1,12 @@ #ifndef __OF_IOMMU_H #define __OF_IOMMU_H +#include linux/device.h #include linux/iommu.h #include linux/of.h +struct iommu_dma_mapping; + #ifdef CONFIG_OF_IOMMU extern int of_get_dma_window(struct device_node *dn, const char *prefix, @@ -11,6 +14,7 @@ extern int of_get_dma_window(struct device_node *dn, const char *prefix, size_t *size); extern void of_iommu_init(void); +extern struct iommu_dma_mapping *of_iommu_configure(struct device *dev); #else @@ -22,6 +26,10 @@ static inline int of_get_dma_window(struct device_node *dn, const char *prefix, } static inline void of_iommu_init(void) { } +static inline struct iommu_dma_mapping *of_iommu_configure(struct device *dev) +{ + return NULL; +} #endif /* CONFIG_OF_IOMMU */ -- Regards, Laurent Pinchart ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v2 4/7] iommu: provide helper function to configure an IOMMU for an of master
Hi Laurent, Cheers for the review. On Wed, Sep 10, 2014 at 02:01:34PM +0100, Laurent Pinchart wrote: On Tuesday 02 September 2014 18:56:24 Will Deacon wrote: +struct iommu_dma_mapping *of_iommu_configure(struct device *dev) +{ + struct of_phandle_args iommu_spec; + struct iommu_dma_mapping *mapping; + struct bus_type *bus = dev-bus; + const struct iommu_ops *ops = bus-iommu_ops; + struct iommu_data *iommu = NULL; + int idx = 0; + + if (!iommu_present(bus) || !ops-of_xlate) + return NULL; + + /* +* We don't currently walk up the tree looking for a parent IOMMU. +* See the `Notes:' section of +* Documentation/devicetree/bindings/iommu/iommu.txt +*/ + while (!of_parse_phandle_with_args(dev-of_node, iommus, + #iommu-cells, idx, + iommu_spec)) { + struct device_node *np = iommu_spec.np; + struct iommu_data *data = of_iommu_get_data(np); + + if (!iommu) { + if (!ops-of_xlate(dev, iommu_spec)) + iommu = data; If I understand the code correctly, this will call of_xlate for the first IOMMU reference only and silently ignore all subsequent references if they point to the same IOMMU device but with different stream/requester IDs. Is that the desired behaviour ? No; I just fixed that actually. I'll send a v3 soon which has a bunch of fixes like this. Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] iommu/vt-d: Use the SIRTP when enabling remapping
On 9/10/2014 1:56 AM, Jan Kiszka wrote: On 2014-09-09 23:11, Nathan Zimmer wrote: The previous change (iommu/vt-d: Don't store SIRTP request) to this area caused a crash in our simulator. In particular is seems that by the time a UART interrupt is sent through the system, we don't see interrupt remapping to be enabled. So the interrupt does not get translated to a logical interrupt and crashes. OR'ing the SIRTP request to make sure it is seen but hopefully not sticky. This seems like a clean fix, at least on our simulator; if you don't agree, our simulator guy will take a closer look at our iommu model. Check the VT-d spec (6.7, Set Interrupt Remapping Table Pointer Operation): Software must always follow the interrupt-remapping table pointer set operation with a global invalidate of the IEC to ensure hardware references the new structures *before* enabling interrupt remapping. There is also (command register description): If multiple control fields in this register need to be modified, software must serialize the modifications through multiple writes to this register. This, in combination with the command registers description of bits 24 and 25 strongly suggests that your model is broken. Thanks, Jan. I'll read up on it. Still working out some kinks with the relatively new IOMMU/interrupt-remapping logic which we added over the last 18 months or so. Sorry for any distraction, and thanks for the help/info. Gary Found testing on our simulator, not real hardware. Funnily, the original issue was found by the QEMU model of VT-d that used to take the last cited sentence very strictly (I asked to remove it due to the preexisting Linux versions). Jan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [Xen-devel] [PATCH v1 08/21] x86/xen/MSI: Use MSI chip framework to configure MSI/MSI-X irq
On Wed, Sep 10, 2014 at 01:38:25PM +0100, David Vrabel wrote: On 09/09/14 03:06, Yijing Wang wrote: On 2014/9/5 22:29, David Vrabel wrote: On 05/09/14 11:09, Yijing Wang wrote: Use MSI chip framework instead of arch MSI functions to configure MSI/MSI-X irq. So we can manage MSI/MSI-X irq in a unified framework. [...] --- a/arch/x86/pci/xen.c +++ b/arch/x86/pci/xen.c [...] @@ -418,9 +430,9 @@ int __init pci_xen_init(void) #endif #ifdef CONFIG_PCI_MSI - x86_msi.setup_msi_irqs = xen_setup_msi_irqs; - x86_msi.teardown_msi_irq = xen_teardown_msi_irq; - x86_msi.teardown_msi_irqs = xen_teardown_msi_irqs; + xen_msi_chip.setup_irqs = xen_setup_msi_irqs; + xen_msi_chip.teardown_irqs = xen_teardown_msi_irqs; + x86_msi_chip = xen_msi_chip; msi_chip.irq_mask = xen_nop_msi_mask; msi_chip.irq_unmask = xen_nop_msi_mask; Why have these not been changed to set the x86_msi_chip.mask/unmask fields instead? Hi David, x86_msi_chip here is struct msi_chip data type, used to configure MSI/MSI-X irq. msi_chip above is struct irq_chip data type, represent the MSI irq controller. They are not the same object. Their name easily confusing people. Ok, it all makes sense now. Acked-by: David Vrabel david.vra...@citrix.com You can also add 'Tested-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com' on the whole series - I ran it through on Xen and on baremetal with a mix of 32/64 builds. Oh, and also Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com the Xen parts. David ___ Xen-devel mailing list xen-de...@lists.xen.org http://lists.xen.org/xen-devel ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/6] iommu/arm-smmu: add support for specifying clocks
On Wed, Sep 10, 2014 at 02:29:42AM +0100, Mitchel Humpherys wrote: On Tue, Aug 26 2014 at 07:27:58 AM, Will Deacon will.dea...@arm.com wrote: On Tue, Aug 19, 2014 at 08:03:09PM +0100, Olav Haugan wrote: Clients of the SMMU driver are required to vote for clocks and power when they know they need to use the SMMU. However, the clock and power needed to be on for the SMMU to service bus masters aren't necessarily the same as the ones needed to read/write registers...See below. The case I'm thinking of is where a device masters through the IOMMU, but doesn't make use of any translations. In this case, its transactions will bypass the SMMU and I want to ensure that continues to happen, regardless of the power state of the SMMU. Then I assume the driver for such a device wouldn't be attaching to (or detaching from) the IOMMU, so we won't be touching it at all either way. Or am I missing something? As long as its only the register file that gets powered down, then there's no issue. However, that's making assumptions about what these clocks are controlling. Is there a way for the driver to know which aspects of the device are controlled by which clock? What stops theses from racing with each other when there are multiple clocks? I also assume that the clk API ignores calls to clk_enable_prepare for a clk that's already enabled? I couldn't find that code... All the clock APIs are reference counted yes. Not sure what you mean by racing with each other? When you call to enable a clock the call does not return until the clock is already ON (or OFF). I was thinking of an interrupt handler racing with normal code, but actually you balance the clk enable/disable in the interrupt handlers. However, it's not safe to call these clk functions from irq context anyway, since clk_prepare may sleep. Ah yes. You okay with moving to a threaded IRQ? A threaded IRQ already makes sense for context interrupts (if anybody has a platform that can do stalls properly), but it seems a bit weird for the global fault handler. Is there no way to fix the race instead? @@ -2061,12 +2164,16 @@ static int arm_smmu_device_dt_probe(struct platform_device *pdev) spin_unlock(arm_smmu_devices_lock); arm_smmu_device_reset(smmu); + arm_smmu_disable_clocks(smmu); I wonder if this is really the right thing to do. Rather than the fine-grained clock enable/disable you have, why don't we just enable in domain_init and disable in domain_destroy, with refcounting for the clocks? So the whole point of all of this is that we try to save power. As Mitch wrote in the commit text we want to only leave the clock and power on for as short period of time as possible. Understood, but if the clocks are going up and down like yo-yos, then it's not obvious that you end up saving any power at all. Have you tried measuring the power consumption with different granularities for the clocks? This has been profiled extensively and for some use cases it's a huge win. Unfortunately we don't have any numbers for public sharing :( but you can imagine a use case where some multimedia framework maps a bunch of buffers into an SMMU at the beginning of some interactive user session and doesn't unmap them until the (human) user decides they are done. This could be a long time, all the while these clocks could be off, saving power. Ok, I can see that. I wonder whether we could wrap all of the iommu_ops functions with the clock enable/disable code, instead of putting it directly into the drivers? The code you're proposing seems to take the approach of `we're going to access registers so enable the clocks, access the registers then disable the clocks', which is simple but may not be particularly effective. Yes, that's a good summary of the approach here. It has been effective in saving power for us in the past... Mike, do you have any experience with this sort of stuff? Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/6] iommu/arm-smmu: add support for specifying clocks
On Wed, Sep 10 2014 at 11:27:39 AM, Will Deacon will.dea...@arm.com wrote: On Wed, Sep 10, 2014 at 02:29:42AM +0100, Mitchel Humpherys wrote: On Tue, Aug 26 2014 at 07:27:58 AM, Will Deacon will.dea...@arm.com wrote: On Tue, Aug 19, 2014 at 08:03:09PM +0100, Olav Haugan wrote: Clients of the SMMU driver are required to vote for clocks and power when they know they need to use the SMMU. However, the clock and power needed to be on for the SMMU to service bus masters aren't necessarily the same as the ones needed to read/write registers...See below. The case I'm thinking of is where a device masters through the IOMMU, but doesn't make use of any translations. In this case, its transactions will bypass the SMMU and I want to ensure that continues to happen, regardless of the power state of the SMMU. Then I assume the driver for such a device wouldn't be attaching to (or detaching from) the IOMMU, so we won't be touching it at all either way. Or am I missing something? As long as its only the register file that gets powered down, then there's no issue. However, that's making assumptions about what these clocks are controlling. Is there a way for the driver to know which aspects of the device are controlled by which clock? Yes, folks should only be putting config clocks here. In our system, at least, the clocks for configuring the SMMU are different than those for using it. Maybe I should make a note about what kinds of clocks can be specified here in the bindings (i.e. only those that can be safely disabled while still allowing translations to occur)? What stops theses from racing with each other when there are multiple clocks? I also assume that the clk API ignores calls to clk_enable_prepare for a clk that's already enabled? I couldn't find that code... All the clock APIs are reference counted yes. Not sure what you mean by racing with each other? When you call to enable a clock the call does not return until the clock is already ON (or OFF). I was thinking of an interrupt handler racing with normal code, but actually you balance the clk enable/disable in the interrupt handlers. However, it's not safe to call these clk functions from irq context anyway, since clk_prepare may sleep. Ah yes. You okay with moving to a threaded IRQ? A threaded IRQ already makes sense for context interrupts (if anybody has a platform that can do stalls properly), but it seems a bit weird for the global fault handler. Is there no way to fix the race instead? Are you referring to the scenario where someone might be disabling clocks at the same time? This isn't a problem since the clocks are refcounted. I believe the main problem here is calling clk_enable from atomic context since it might sleep. For my own edification, why would it be weird to move to a threaded IRQ here? We're not really doing any important work here (just printing an informational message) so moving to a threaded IRQ actually seems like the courteous thing to do... @@ -2061,12 +2164,16 @@ static int arm_smmu_device_dt_probe(struct platform_device *pdev) spin_unlock(arm_smmu_devices_lock); arm_smmu_device_reset(smmu); + arm_smmu_disable_clocks(smmu); I wonder if this is really the right thing to do. Rather than the fine-grained clock enable/disable you have, why don't we just enable in domain_init and disable in domain_destroy, with refcounting for the clocks? So the whole point of all of this is that we try to save power. As Mitch wrote in the commit text we want to only leave the clock and power on for as short period of time as possible. Understood, but if the clocks are going up and down like yo-yos, then it's not obvious that you end up saving any power at all. Have you tried measuring the power consumption with different granularities for the clocks? This has been profiled extensively and for some use cases it's a huge win. Unfortunately we don't have any numbers for public sharing :( but you can imagine a use case where some multimedia framework maps a bunch of buffers into an SMMU at the beginning of some interactive user session and doesn't unmap them until the (human) user decides they are done. This could be a long time, all the while these clocks could be off, saving power. Ok, I can see that. I wonder whether we could wrap all of the iommu_ops functions with the clock enable/disable code, instead of putting it directly into the drivers? Interesting idea... I'm up for it if the IOMMU core and driver maintainers are okay with it. The code you're proposing seems to take the approach of `we're going to access registers so enable the clocks, access the registers then disable the clocks', which is simple but may not be particularly effective. Yes, that's a good summary of the approach here. It has been effective in saving power for us in the past... Mike, do you
Re: [RFC] iommu/vt-d: Use the SIRTP when enabling remapping
I'm not really a kernel developer and don't know the proper etiquette for this, but just wanted to say thanks to Jan for the guidance. With your help, I was able to track our problem down to a much more fundamental problem of incorrect bit position of the fields in our VT-d GLBCMD register, attributable to our now-manual process of generating register information from Intel socket documentation. (We used to used Intel's SoftSDV to determine register fields, but that's no longer available and we have to use fallible human eyes! Thousands of registers in each new socket...) Gary On 09/10/2014 02:58 AM, Gary Kroening wrote: On 9/10/2014 1:56 AM, Jan Kiszka wrote: On 2014-09-09 23:11, Nathan Zimmer wrote: The previous change (iommu/vt-d: Don't store SIRTP request) to this area caused a crash in our simulator. In particular is seems that by the time a UART interrupt is sent through the system, we don't see interrupt remapping to be enabled. So the interrupt does not get translated to a logical interrupt and crashes. OR'ing the SIRTP request to make sure it is seen but hopefully not sticky. This seems like a clean fix, at least on our simulator; if you don't agree, our simulator guy will take a closer look at our iommu model. Check the VT-d spec (6.7, Set Interrupt Remapping Table Pointer Operation): Software must always follow the interrupt-remapping table pointer set operation with a global invalidate of the IEC to ensure hardware references the new structures *before* enabling interrupt remapping. There is also (command register description): If multiple control fields in this register need to be modified, software must serialize the modifications through multiple writes to this register. This, in combination with the command registers description of bits 24 and 25 strongly suggests that your model is broken. Thanks, Jan. I'll read up on it. Still working out some kinks with the relatively new IOMMU/interrupt-remapping logic which we added over the last 18 months or so. Sorry for any distraction, and thanks for the help/info. Gary Found testing on our simulator, not real hardware. Funnily, the original issue was found by the QEMU model of VT-d that used to take the last cited sentence very strictly (I asked to remove it due to the preexisting Linux versions). Jan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
On Tue, 9 Sep 2014 17:43:51 +0200 Joerg Roedel j...@8bytes.org wrote: here is a patch-set to extend the mmu_notifiers in the Linux kernel to allow managing CPU external TLBs. Those TLBs may be implemented in IOMMUs or any other external device, e.g. ATS/PRI capable PCI devices. The problem with managing these TLBs are the semantics of the invalidate_range_start/end call-backs currently available. Currently the subsystem using mmu_notifiers has to guarantee that no new TLB entries are established between invalidate_range_start/end. Furthermore the invalidate_range_start() function is called when all pages are still mapped and invalidate_range_end() when the pages are unmapped an already freed. So both call-backs can't be used to safely flush any non-CPU TLB because _start() is called too early and _end() too late. There's a lot of missing information here. Why don't the existing callbacks suit non-CPU TLBs? What is different about them? Please update the changelog to contain all this context. In the AMD IOMMUv2 driver this is currently implemented by assigning an empty page-table to the external device between _start() and _end(). But as tests have shown this doesn't work as external devices don't re-fault infinitly but enter a failure state after some time. More missing info. Why are these faults occurring? Is there some device activity which is trying to fault in pages, but the CPU is executing code between _start() and _end() so the driver must refuse to instantiate a page to satisfy the fault? That's just a guess, and I shouldn't be guessing. Please update the changelog to fully describe the dynamic activity which is causing this. Next problem with this solution is that it causes an interrupt storm for IO page faults to be handled when an empty page-table is assigned. Also too skimpy. I *think* this is a variant of the problem in the preceding paragraph. We get a fault storm (which is problem 2) and sometimes the faulting device gives up (which is problem 1). Or something. Please de-fog all of this. Furthermore the _start()/end() notifiers only catch the moment when page mappings are released, but not page-table pages. But this is necessary for managing external TLBs when the page-table is shared with the CPU. How come? To solve this situation I wrote a patch-set to introduce a new notifier call-back: mmu_notifer_invalidate_range(). This notifier lifts the strict requirements that no new references are taken in the range between _start() and _end(). When the subsystem can't guarantee that any new references are taken is has to provide the invalidate_range() call-back to clear any new references in there. It is called between invalidate_range_start() and _end() every time the VMM has to wipe out any references to a couple of pages. This are usually the places where the CPU TLBs are flushed too and where its important that this happens before invalidate_range_end() is called. Any comments and review appreciated! The patchset looks decent, although I find it had to review because I just wasn't provided with enough of the thinking that went into it. I have enough info to look at the C code, but not enough info to identify and evaluate alternative implementation approaches, to identify possible future extensions, etc. The patchset does appear to add significant additional overhead to hot code paths when mm_has_notifiers(mm). Please let's update the changelog to address this rather important concern. How significant is the impact on such mm's, how common are such mm's now and in the future, should we (for example) look at short-circuiting __mmu_notifier_invalidate_range() if none of the registered notifiers implement -invalidate_range(), etc. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/3 v3] mmu_notifier: Allow to manage CPU external TLBs
On Wed, Sep 10, 2014 at 03:01:25PM -0700, Andrew Morton wrote: On Tue, 9 Sep 2014 17:43:51 +0200 Joerg Roedel j...@8bytes.org wrote: here is a patch-set to extend the mmu_notifiers in the Linux kernel to allow managing CPU external TLBs. Those TLBs may be implemented in IOMMUs or any other external device, e.g. ATS/PRI capable PCI devices. The problem with managing these TLBs are the semantics of the invalidate_range_start/end call-backs currently available. Currently the subsystem using mmu_notifiers has to guarantee that no new TLB entries are established between invalidate_range_start/end. Furthermore the invalidate_range_start() function is called when all pages are still mapped and invalidate_range_end() when the pages are unmapped an already freed. So both call-backs can't be used to safely flush any non-CPU TLB because _start() is called too early and _end() too late. There's a lot of missing information here. Why don't the existing callbacks suit non-CPU TLBs? What is different about them? Please update the changelog to contain all this context. So unlike KVM or any current user of the mmu_notifier api, any PCIE ATS/PASID capable hardware IOMMUv2 for AMD and forgot the marketing name for Intel (probably VTd) implementation walk the cpu page table on there own and have there own TLB cache. In fact not only the iommu can have a TLB cache but any single PCIE hardware can implement its own local TLB cache. So if we flush the IOMMU and device TLB inside the range_start callback there is a chance that the hw will just rewalk the cpu page table and repopulate its TLB before the CPU page table is actually updated. Now if we shoot down the TLB inside the range_end callback, then we are too late ie the CPU page table is already populated with new entry and all the TLB in the IOMMU an in device might be pointing to the old pages. So the aim of this callback is to happen right after the CPU page table is updated but before the old page is freed or recycled. Note that it is also safe for COW and other transition from like read only to read and write or the other way around. In the AMD IOMMUv2 driver this is currently implemented by assigning an empty page-table to the external device between _start() and _end(). But as tests have shown this doesn't work as external devices don't re-fault infinitly but enter a failure state after some time. More missing info. Why are these faults occurring? Is there some device activity which is trying to fault in pages, but the CPU is executing code between _start() and _end() so the driver must refuse to instantiate a page to satisfy the fault? That's just a guess, and I shouldn't be guessing. Please update the changelog to fully describe the dynamic activity which is causing this. The hack that was use prior to this patch was to point the IOMMU to an empty page table (a zero page) inside the range_start() callback and shoot down the TLB but this meant that the device might enter inside a storm of page fault. GPU can have thousand of threads and because during invalidation the empty page table is use they all starts triggering page fault even if they were not trying to access the range being invalidated. It turns out that when this happens current hw like AMD GPU actually stop working after a while ie the hw stumble because there is too much fault going on. Next problem with this solution is that it causes an interrupt storm for IO page faults to be handled when an empty page-table is assigned. Also too skimpy. I *think* this is a variant of the problem in the preceding paragraph. We get a fault storm (which is problem 2) and sometimes the faulting device gives up (which is problem 1). Or something. Please de-fog all of this. Does above explanation help understand the issue ? Given that on each device page fault an IRQ is trigger (well the way the hw works is bit more complex than that). Furthermore the _start()/end() notifiers only catch the moment when page mappings are released, but not page-table pages. But this is necessary for managing external TLBs when the page-table is shared with the CPU. How come? As explained above end() might happens after page that were previously mapped are free or recycled. To solve this situation I wrote a patch-set to introduce a new notifier call-back: mmu_notifer_invalidate_range(). This notifier lifts the strict requirements that no new references are taken in the range between _start() and _end(). When the subsystem can't guarantee that any new references are taken is has to provide the invalidate_range() call-back to clear any new references in there. It is called between invalidate_range_start() and _end() every time the VMM has to wipe out any references to a couple of pages. This are usually the places where the CPU TLBs are flushed too and where its important that this happens before
Re: [Xen-devel] [PATCH v1 04/21] x86/xen/MSI: Eliminate arch_msix_mask_irq() and arch_msi_mask_irq()
On 2014/9/10 20:36, David Vrabel wrote: On 05/09/14 11:09, Yijing Wang wrote: Commit 0e4ccb150 added two __weak arch functions arch_msix_mask_irq() and arch_msi_mask_irq() to fix a bug found when running xen in x86. Introduced these two funcntions make MSI code complex. And mask/unmask is the irq actions related to interrupt controller, should not use weak arch functions to override mask/unmask interfaces. This patch reverted commit 0e4ccb150 and export struct irq_chip msi_chip, modify msi_chip-irq_mask/irq_unmask() in xen init functions to fix this bug for simplicity. Also this is preparation for using struct msi_chip instead of weak arch MSI functions in all platforms. Acked-by: David Vrabel david.vra...@citrix.com But I wonder if it would be better the Xen subsystem to provide its own struct irq_chip instead of adjusting the fields in the generic x86 one. Thanks! Currently, Xen and the bare x86 system only have the different irq_chip-irq_mask/irq_unmask. So I chose to override the two ops of bare x86 irq_chip in xen. Konrad Rzeszutek Wilk has been tested it ok in his platform, so I think we could use its own irq_chip for xen later if the difference become large. Thanks! Yijing. David . -- Thanks! Yijing ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [Xen-devel] [PATCH v1 08/21] x86/xen/MSI: Use MSI chip framework to configure MSI/MSI-X irq
On 2014/9/10 20:38, David Vrabel wrote: On 09/09/14 03:06, Yijing Wang wrote: On 2014/9/5 22:29, David Vrabel wrote: On 05/09/14 11:09, Yijing Wang wrote: Use MSI chip framework instead of arch MSI functions to configure MSI/MSI-X irq. So we can manage MSI/MSI-X irq in a unified framework. [...] --- a/arch/x86/pci/xen.c +++ b/arch/x86/pci/xen.c [...] @@ -418,9 +430,9 @@ int __init pci_xen_init(void) #endif #ifdef CONFIG_PCI_MSI - x86_msi.setup_msi_irqs = xen_setup_msi_irqs; - x86_msi.teardown_msi_irq = xen_teardown_msi_irq; - x86_msi.teardown_msi_irqs = xen_teardown_msi_irqs; + xen_msi_chip.setup_irqs = xen_setup_msi_irqs; + xen_msi_chip.teardown_irqs = xen_teardown_msi_irqs; + x86_msi_chip = xen_msi_chip; msi_chip.irq_mask = xen_nop_msi_mask; msi_chip.irq_unmask = xen_nop_msi_mask; Why have these not been changed to set the x86_msi_chip.mask/unmask fields instead? Hi David, x86_msi_chip here is struct msi_chip data type, used to configure MSI/MSI-X irq. msi_chip above is struct irq_chip data type, represent the MSI irq controller. They are not the same object. Their name easily confusing people. Ok, it all makes sense now. Acked-by: David Vrabel david.vra...@citrix.com Thanks! David . -- Thanks! Yijing ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [Xen-devel] [PATCH v1 08/21] x86/xen/MSI: Use MSI chip framework to configure MSI/MSI-X irq
On 2014/9/10 22:59, Konrad Rzeszutek Wilk wrote: On Wed, Sep 10, 2014 at 01:38:25PM +0100, David Vrabel wrote: On 09/09/14 03:06, Yijing Wang wrote: On 2014/9/5 22:29, David Vrabel wrote: On 05/09/14 11:09, Yijing Wang wrote: Use MSI chip framework instead of arch MSI functions to configure MSI/MSI-X irq. So we can manage MSI/MSI-X irq in a unified framework. [...] --- a/arch/x86/pci/xen.c +++ b/arch/x86/pci/xen.c [...] @@ -418,9 +430,9 @@ int __init pci_xen_init(void) #endif #ifdef CONFIG_PCI_MSI - x86_msi.setup_msi_irqs = xen_setup_msi_irqs; - x86_msi.teardown_msi_irq = xen_teardown_msi_irq; - x86_msi.teardown_msi_irqs = xen_teardown_msi_irqs; + xen_msi_chip.setup_irqs = xen_setup_msi_irqs; + xen_msi_chip.teardown_irqs = xen_teardown_msi_irqs; + x86_msi_chip = xen_msi_chip; msi_chip.irq_mask = xen_nop_msi_mask; msi_chip.irq_unmask = xen_nop_msi_mask; Why have these not been changed to set the x86_msi_chip.mask/unmask fields instead? Hi David, x86_msi_chip here is struct msi_chip data type, used to configure MSI/MSI-X irq. msi_chip above is struct irq_chip data type, represent the MSI irq controller. They are not the same object. Their name easily confusing people. Ok, it all makes sense now. Acked-by: David Vrabel david.vra...@citrix.com You can also add 'Tested-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com' on the whole series - I ran it through on Xen and on baremetal with a mix of 32/64 builds. Oh, and also Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com the Xen parts. Thanks very much for your test and review! Thanks! Yijing. David ___ Xen-devel mailing list xen-de...@lists.xen.org http://lists.xen.org/xen-devel . -- Thanks! Yijing ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu