[RFC PATCH 0/2] Fix rockchip IOMMU driver vs PM issues
On Rockchip SoCs, the IOMMUs are located inside the same power domains as IP blocks using them. This means that as soon as we runtime suspend such IP block, causing the power domain to be powered off, the IOMMU would also be powered off, losing all its state. This means that whenever the power domain is being powered off, the IOMMU driver needs to be able to deinitialize the hardware and whenever the domain is being powered on, it needs to restore all the state, so the consumer device is able to perform memory transactions. The solution proposed here revives the idea of PM domain notifiers submitted originally by Samsung's Sylwester Nawrocki and Marek Szyprowski in [1]. The main benefit of this idea that it is very simple, adding just 84 lines of code, but effective and also useful for other purposes, what can be seen in thread [2] and [3]. Moreover, it lets us avoid adding device specific code (in this case specific to single IOMMU type) to power domain drivers, which can be used for more devices than just IOMMU and which may vary between platforms using the same IOMMU driver. The IOMMU-specific part of the solution simply uses the pre-power-off and post-power-on notifications to do the necessary setup without the need of any action from inside drivers of devices behind the IOMMU, so all the code specific to this particular IOMMU type stays outside other drivers that should not rely on being run with a particular type of IOMMU (and often even SoC). Tested with a custom board based on RK3288 SoC. [1] http://thread.gmane.org/gmane.linux.kernel.samsung-soc/36079/focus=346956 [2] http://thread.gmane.org/gmane.linux.power-management.general/51993/focus=39158 [3] http://marc.info/?l=linux-pmm=136448756308203w=2 Sylwester Nawrocki (1): pm: Add PM domain notifications Tomasz Figa (1): iommu: rockchip: Handle system-wide and runtime PM Documentation/power/notifiers.txt | 14 +++ drivers/base/power/domain.c | 50 + drivers/iommu/rockchip-iommu.c| 208 +++--- include/linux/pm_domain.h | 20 4 files changed, 256 insertions(+), 36 deletions(-) -- 2.2.0.rc0.207.ga3a616c ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH 1/2] pm: Add PM domain notifications
From: Sylwester Nawrocki s.nawro...@samsung.com This patch adds notifiers to the runtime PM/genpd subsystem. It is now possible to register a notifier, which will be called before and after the generic power domain subsystem calls the power domain's power_on and power_off callbacks. Signed-off-by: Sylwester Nawrocki s.nawro...@samsung.com [tf...@chromium.org: rebased] Signed-off-by: Tomasz Figa tf...@chromium.org --- Documentation/power/notifiers.txt | 14 +++ drivers/base/power/domain.c | 50 +++ include/linux/pm_domain.h | 20 3 files changed, 84 insertions(+) diff --git a/Documentation/power/notifiers.txt b/Documentation/power/notifiers.txt index a81fa25..62303f6 100644 --- a/Documentation/power/notifiers.txt +++ b/Documentation/power/notifiers.txt @@ -53,3 +53,17 @@ NULL). To register and/or unregister a suspend notifier use the functions register_pm_notifier() and unregister_pm_notifier(), respectively, defined in include/linux/suspend.h . If you don't need to unregister the notifier, you can also use the pm_notifier() macro defined in include/linux/suspend.h . + +Power Domain notifiers +-- + +The power domain notifiers allow subsystems or drivers to register for power +domain on/off notifications, should they need to perform any actions right +before or right after the power domain on/off. The device must be already +added to a power domain before its subsystem or driver registers the notifier. +Following events are supported: + +PM_GENPD_POWER_ON_PREPARE The power domain is about to turn on. +PM_GENPD_POST_POWER_ON The power domain has just turned on. +PM_GENPD_POWER_OFF_PREPARE The power domain is about to turn off. +PM_GENPD_POST_POWER_OFFThe power domain has just turned off. diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c index 6a103a3..0d6f84e 100644 --- a/drivers/base/power/domain.c +++ b/drivers/base/power/domain.c @@ -68,6 +68,45 @@ static struct generic_pm_domain *pm_genpd_lookup_name(const char *domain_name) return genpd; } +int pm_genpd_register_notifier(struct device *dev, struct notifier_block *nb) +{ + struct pm_domain_data *pdd; + int ret = -EINVAL; + + spin_lock_irq(dev-power.lock); + if (dev-power.subsys_data) { + pdd = dev-power.subsys_data-domain_data; + ret = blocking_notifier_chain_register(pdd-notify_chain_head, + nb); + } + spin_unlock_irq(dev-power.lock); + return ret; +} +EXPORT_SYMBOL_GPL(pm_genpd_register_notifier); + +void pm_genpd_unregister_notifier(struct device *dev, struct notifier_block *nb) +{ + struct pm_domain_data *pdd; + + spin_lock_irq(dev-power.lock); + if (dev-power.subsys_data) { + pdd = dev-power.subsys_data-domain_data; + blocking_notifier_chain_unregister(pdd-notify_chain_head, nb); + } + spin_unlock_irq(dev-power.lock); +} +EXPORT_SYMBOL_GPL(pm_genpd_unregister_notifier); + +static void pm_genpd_notifier_call(unsigned long event, + struct generic_pm_domain *genpd) +{ + struct pm_domain_data *pdd; + + list_for_each_entry(pdd, genpd-dev_list, list_node) + blocking_notifier_call_chain(pdd-notify_chain_head, +event, pdd-dev); +} + struct generic_pm_domain *dev_to_genpd(struct device *dev) { if (IS_ERR_OR_NULL(dev-pm_domain)) @@ -161,12 +200,17 @@ static int genpd_power_on(struct generic_pm_domain *genpd) if (!genpd-power_on) return 0; + pm_genpd_notifier_call(PM_GENPD_POWER_ON_PREPARE, genpd); + time_start = ktime_get(); ret = genpd-power_on(genpd); if (ret) return ret; elapsed_ns = ktime_to_ns(ktime_sub(ktime_get(), time_start)); + + pm_genpd_notifier_call(PM_GENPD_POST_POWER_ON, genpd); + if (elapsed_ns = genpd-power_on_latency_ns) return ret; @@ -188,12 +232,17 @@ static int genpd_power_off(struct generic_pm_domain *genpd) if (!genpd-power_off) return 0; + pm_genpd_notifier_call(PM_GENPD_POWER_OFF_PREPARE, genpd); + time_start = ktime_get(); ret = genpd-power_off(genpd); if (ret == -EBUSY) return ret; elapsed_ns = ktime_to_ns(ktime_sub(ktime_get(), time_start)); + + pm_genpd_notifier_call(PM_GENPD_POST_POWER_OFF, genpd); + if (elapsed_ns = genpd-power_off_latency_ns) return ret; @@ -1466,6 +1515,7 @@ int __pm_genpd_add_device(struct generic_pm_domain *genpd, struct device *dev, genpd-attach_dev(genpd, dev); mutex_lock(gpd_data-lock); + BLOCKING_INIT_NOTIFIER_HEAD(gpd_data-base.notify_chain_head); gpd_data-base.dev = dev;
Re: [RFC PATCH 1/2] pm: Add PM domain notifications
Hi Sylwester, On Thu, Dec 11, 2014 at 7:36 PM, Sylwester Nawrocki s.nawro...@samsung.com wrote: Hi Tomasz, On 11/12/14 09:26, Tomasz Figa wrote: From: Sylwester Nawrocki s.nawro...@samsung.com This patch adds notifiers to the runtime PM/genpd subsystem. It is now possible to register a notifier, which will be called before and after the generic power domain subsystem calls the power domain's power_on and power_off callbacks. Signed-off-by: Sylwester Nawrocki s.nawro...@samsung.com [tf...@chromium.org: rebased] Signed-off-by: Tomasz Figa tf...@chromium.org Not sure if you've noticed it, I posted an updated version of this patch recently [1]. The notifiers list is moved to struct generic_pm_domain there and it also allows to register a notifier for selected power domain by name. [snip] [1] http://www.spinics.net/lists/linux-samsung-soc/msg38549.html Ah, haven't noticed, sorry. The API using devices looks the same, so I guess we can simply have patch 2/2 of this series applied on top of your patch. By the way, look-up by name (presumably hardcoded somewhere?) sounds a bit strange to me. What was the reason for it to be added? Best regards, Tomasz ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH 2/2] iommu: rockchip: Handle system-wide and runtime PM
This patch modifies the rockchip-iommu driver to consider state of the power domain the IOMMU is located in. When the power domain is powered off, the IOMMU cannot be accessed and register programming must be deferred until the power domain becomes enabled. The solution implemented in this patch uses power domain notifications to perform necessary IOMMU initialization. Race with runtime PM core is avoided by protecting code accessing the hardware, including startup and shutdown functions, with a spinlock with a check for power state inside. Signed-off-by: Tomasz Figa tf...@chromium.org --- drivers/iommu/rockchip-iommu.c | 208 ++--- 1 file changed, 172 insertions(+), 36 deletions(-) diff --git a/drivers/iommu/rockchip-iommu.c b/drivers/iommu/rockchip-iommu.c index b2023af..9120655 100644 --- a/drivers/iommu/rockchip-iommu.c +++ b/drivers/iommu/rockchip-iommu.c @@ -20,6 +20,8 @@ #include linux/of.h #include linux/of_platform.h #include linux/platform_device.h +#include linux/pm_domain.h +#include linux/pm_runtime.h #include linux/slab.h #include linux/spinlock.h @@ -88,6 +90,9 @@ struct rk_iommu { int irq; struct list_head node; /* entry in rk_iommu_domain.iommus */ struct iommu_domain *domain; /* domain to which iommu is attached */ + struct notifier_block genpd_nb; + spinlock_t hw_lock; /* lock for register access */ + bool is_powered; /* power domain is on and register clock is enabled */ }; static inline void rk_table_flush(u32 *va, unsigned int count) @@ -283,6 +288,9 @@ static void rk_iommu_zap_lines(struct rk_iommu *iommu, dma_addr_t iova, size_t size) { dma_addr_t iova_end = iova + size; + + assert_spin_locked(iommu-hw_lock); + /* * TODO(djkurtz): Figure out when it is more efficient to shootdown the * entire iotlb rather than iterate over individual iovas. @@ -293,11 +301,15 @@ static void rk_iommu_zap_lines(struct rk_iommu *iommu, dma_addr_t iova, static bool rk_iommu_is_stall_active(struct rk_iommu *iommu) { + assert_spin_locked(iommu-hw_lock); + return rk_iommu_read(iommu, RK_MMU_STATUS) RK_MMU_STATUS_STALL_ACTIVE; } static bool rk_iommu_is_paging_enabled(struct rk_iommu *iommu) { + assert_spin_locked(iommu-hw_lock); + return rk_iommu_read(iommu, RK_MMU_STATUS) RK_MMU_STATUS_PAGING_ENABLED; } @@ -306,6 +318,8 @@ static int rk_iommu_enable_stall(struct rk_iommu *iommu) { int ret; + assert_spin_locked(iommu-hw_lock); + if (rk_iommu_is_stall_active(iommu)) return 0; @@ -327,6 +341,8 @@ static int rk_iommu_disable_stall(struct rk_iommu *iommu) { int ret; + assert_spin_locked(iommu-hw_lock); + if (!rk_iommu_is_stall_active(iommu)) return 0; @@ -344,6 +360,8 @@ static int rk_iommu_enable_paging(struct rk_iommu *iommu) { int ret; + assert_spin_locked(iommu-hw_lock); + if (rk_iommu_is_paging_enabled(iommu)) return 0; @@ -361,6 +379,8 @@ static int rk_iommu_disable_paging(struct rk_iommu *iommu) { int ret; + assert_spin_locked(iommu-hw_lock); + if (!rk_iommu_is_paging_enabled(iommu)) return 0; @@ -379,6 +399,8 @@ static int rk_iommu_force_reset(struct rk_iommu *iommu) int ret; u32 dte_addr; + assert_spin_locked(iommu-hw_lock); + /* * Check if register DTE_ADDR is working by writing DTE_ADDR_DUMMY * and verifying that upper 5 nybbles are read back. @@ -401,6 +423,50 @@ static int rk_iommu_force_reset(struct rk_iommu *iommu) return ret; } +static int rk_iommu_startup(struct rk_iommu *iommu) +{ + struct iommu_domain *domain = iommu-domain; + struct rk_iommu_domain *rk_domain; + phys_addr_t dte_addr; + int ret; + + assert_spin_locked(iommu-hw_lock); + + ret = rk_iommu_enable_stall(iommu); + if (ret) + return ret; + + ret = rk_iommu_force_reset(iommu); + if (ret) + return ret; + + rk_domain = domain-priv; + dte_addr = virt_to_phys(rk_domain-dt); + rk_iommu_write(iommu, RK_MMU_DTE_ADDR, dte_addr); + rk_iommu_command(iommu, RK_MMU_CMD_ZAP_CACHE); + rk_iommu_write(iommu, RK_MMU_INT_MASK, RK_MMU_IRQ_MASK); + + ret = rk_iommu_enable_paging(iommu); + if (ret) + return ret; + + rk_iommu_disable_stall(iommu); + + return ret; +} + +static void rk_iommu_shutdown(struct rk_iommu *iommu) +{ + assert_spin_locked(iommu-hw_lock); + + /* Ignore error while disabling, just keep going */ + rk_iommu_enable_stall(iommu); + rk_iommu_disable_paging(iommu); + rk_iommu_write(iommu, RK_MMU_INT_MASK, 0); + rk_iommu_write(iommu, RK_MMU_DTE_ADDR, 0); + rk_iommu_disable_stall(iommu); +} +
Re: [RFC PATCH 2/2] iommu: rockchip: Handle system-wide and runtime PM
Hi Ulf, On Thu, Dec 11, 2014 at 8:58 PM, Ulf Hansson ulf.hans...@linaro.org wrote: On 11 December 2014 at 09:26, Tomasz Figa tf...@chromium.org wrote: This patch modifies the rockchip-iommu driver to consider state of the power domain the IOMMU is located in. When the power domain is powered off, the IOMMU cannot be accessed and register programming must be deferred until the power domain becomes enabled. [snip] @@ -988,11 +1107,28 @@ static int rk_iommu_probe(struct platform_device *pdev) return -ENXIO; } + pm_runtime_no_callbacks(dev); + pm_runtime_enable(dev); + + /* Synchronize state of the domain with driver data. */ + pm_runtime_get_sync(dev); + iommu-is_powered = true; Doesn't the runtime PM status reflect the value of is_powered, thus why do you need to have a copy of it? Could it perpahps be that you try to cope with the case when CONFIG_PM is unset? It's worth noting that this driver fully relies on status of other devices in the power domain the IOMMU is in and does not enforce the status on its own. So in general, as far as my understanding of PM runtime subsystem, the status of the IOMMU device will be always suspended, because nobody will call pm_runtime_get() on it (except the get and put pair in probe). So is_powered is here to track status of the domain, not the device. Feel free to suggest a better way, though. Best regards, Tomasz ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 2/2] iommu: rockchip: Handle system-wide and runtime PM
On 11 December 2014 at 09:26, Tomasz Figa tf...@chromium.org wrote: This patch modifies the rockchip-iommu driver to consider state of the power domain the IOMMU is located in. When the power domain is powered off, the IOMMU cannot be accessed and register programming must be deferred until the power domain becomes enabled. The solution implemented in this patch uses power domain notifications to perform necessary IOMMU initialization. Race with runtime PM core is avoided by protecting code accessing the hardware, including startup and shutdown functions, with a spinlock with a check for power state inside. Signed-off-by: Tomasz Figa tf...@chromium.org --- drivers/iommu/rockchip-iommu.c | 208 ++--- 1 file changed, 172 insertions(+), 36 deletions(-) diff --git a/drivers/iommu/rockchip-iommu.c b/drivers/iommu/rockchip-iommu.c index b2023af..9120655 100644 --- a/drivers/iommu/rockchip-iommu.c +++ b/drivers/iommu/rockchip-iommu.c @@ -20,6 +20,8 @@ #include linux/of.h #include linux/of_platform.h #include linux/platform_device.h +#include linux/pm_domain.h +#include linux/pm_runtime.h #include linux/slab.h #include linux/spinlock.h @@ -88,6 +90,9 @@ struct rk_iommu { int irq; struct list_head node; /* entry in rk_iommu_domain.iommus */ struct iommu_domain *domain; /* domain to which iommu is attached */ + struct notifier_block genpd_nb; + spinlock_t hw_lock; /* lock for register access */ + bool is_powered; /* power domain is on and register clock is enabled */ }; static inline void rk_table_flush(u32 *va, unsigned int count) @@ -283,6 +288,9 @@ static void rk_iommu_zap_lines(struct rk_iommu *iommu, dma_addr_t iova, size_t size) { dma_addr_t iova_end = iova + size; + + assert_spin_locked(iommu-hw_lock); + /* * TODO(djkurtz): Figure out when it is more efficient to shootdown the * entire iotlb rather than iterate over individual iovas. @@ -293,11 +301,15 @@ static void rk_iommu_zap_lines(struct rk_iommu *iommu, dma_addr_t iova, static bool rk_iommu_is_stall_active(struct rk_iommu *iommu) { + assert_spin_locked(iommu-hw_lock); + return rk_iommu_read(iommu, RK_MMU_STATUS) RK_MMU_STATUS_STALL_ACTIVE; } static bool rk_iommu_is_paging_enabled(struct rk_iommu *iommu) { + assert_spin_locked(iommu-hw_lock); + return rk_iommu_read(iommu, RK_MMU_STATUS) RK_MMU_STATUS_PAGING_ENABLED; } @@ -306,6 +318,8 @@ static int rk_iommu_enable_stall(struct rk_iommu *iommu) { int ret; + assert_spin_locked(iommu-hw_lock); + if (rk_iommu_is_stall_active(iommu)) return 0; @@ -327,6 +341,8 @@ static int rk_iommu_disable_stall(struct rk_iommu *iommu) { int ret; + assert_spin_locked(iommu-hw_lock); + if (!rk_iommu_is_stall_active(iommu)) return 0; @@ -344,6 +360,8 @@ static int rk_iommu_enable_paging(struct rk_iommu *iommu) { int ret; + assert_spin_locked(iommu-hw_lock); + if (rk_iommu_is_paging_enabled(iommu)) return 0; @@ -361,6 +379,8 @@ static int rk_iommu_disable_paging(struct rk_iommu *iommu) { int ret; + assert_spin_locked(iommu-hw_lock); + if (!rk_iommu_is_paging_enabled(iommu)) return 0; @@ -379,6 +399,8 @@ static int rk_iommu_force_reset(struct rk_iommu *iommu) int ret; u32 dte_addr; + assert_spin_locked(iommu-hw_lock); + /* * Check if register DTE_ADDR is working by writing DTE_ADDR_DUMMY * and verifying that upper 5 nybbles are read back. @@ -401,6 +423,50 @@ static int rk_iommu_force_reset(struct rk_iommu *iommu) return ret; } +static int rk_iommu_startup(struct rk_iommu *iommu) +{ + struct iommu_domain *domain = iommu-domain; + struct rk_iommu_domain *rk_domain; + phys_addr_t dte_addr; + int ret; + + assert_spin_locked(iommu-hw_lock); + + ret = rk_iommu_enable_stall(iommu); + if (ret) + return ret; + + ret = rk_iommu_force_reset(iommu); + if (ret) + return ret; + + rk_domain = domain-priv; + dte_addr = virt_to_phys(rk_domain-dt); + rk_iommu_write(iommu, RK_MMU_DTE_ADDR, dte_addr); + rk_iommu_command(iommu, RK_MMU_CMD_ZAP_CACHE); + rk_iommu_write(iommu, RK_MMU_INT_MASK, RK_MMU_IRQ_MASK); + + ret = rk_iommu_enable_paging(iommu); + if (ret) + return ret; + + rk_iommu_disable_stall(iommu); + + return ret; +} + +static void rk_iommu_shutdown(struct rk_iommu *iommu) +{ + assert_spin_locked(iommu-hw_lock); + + /* Ignore error while disabling, just keep going */ +
Re: [RFC PATCH 2/2] iommu: rockchip: Handle system-wide and runtime PM
On 11 December 2014 at 13:42, Tomasz Figa tf...@chromium.org wrote: Hi Ulf, On Thu, Dec 11, 2014 at 8:58 PM, Ulf Hansson ulf.hans...@linaro.org wrote: On 11 December 2014 at 09:26, Tomasz Figa tf...@chromium.org wrote: This patch modifies the rockchip-iommu driver to consider state of the power domain the IOMMU is located in. When the power domain is powered off, the IOMMU cannot be accessed and register programming must be deferred until the power domain becomes enabled. [snip] @@ -988,11 +1107,28 @@ static int rk_iommu_probe(struct platform_device *pdev) return -ENXIO; } + pm_runtime_no_callbacks(dev); + pm_runtime_enable(dev); + + /* Synchronize state of the domain with driver data. */ + pm_runtime_get_sync(dev); + iommu-is_powered = true; Doesn't the runtime PM status reflect the value of is_powered, thus why do you need to have a copy of it? Could it perpahps be that you try to cope with the case when CONFIG_PM is unset? It's worth noting that this driver fully relies on status of other devices in the power domain the IOMMU is in and does not enforce the status on its own. So in general, as far as my understanding of PM runtime subsystem, the status of the IOMMU device will be always suspended, because nobody will call pm_runtime_get() on it (except the get and put pair in probe). So is_powered is here to track status of the domain, not the device. Feel free to suggest a better way, though. I see, thanks for clarifying. I think it makes sense as is, I have no better suggestion. Kind regards Uffe ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 1/2] pm: Add PM domain notifications
On 11 December 2014 at 14:54, Sylwester Nawrocki s.nawro...@samsung.com wrote: On 11/12/14 12:04, Tomasz Figa wrote: ... On 11/12/14 09:26, Tomasz Figa wrote: From: Sylwester Nawrocki s.nawro...@samsung.com This patch adds notifiers to the runtime PM/genpd subsystem. It is now possible to register a notifier, which will be called before and after the generic power domain subsystem calls the power domain's power_on and power_off callbacks. Signed-off-by: Sylwester Nawrocki s.nawro...@samsung.com [tf...@chromium.org: rebased] Signed-off-by: Tomasz Figa tf...@chromium.org Not sure if you've noticed it, I posted an updated version of this patch recently [1]. The notifiers list is moved to struct generic_pm_domain there and it also allows to register a notifier for selected power domain by name. [snip] [1] http://www.spinics.net/lists/linux-samsung-soc/msg38549.html Ah, haven't noticed, sorry. The API using devices looks the same, so I guess we can simply have patch 2/2 of this series applied on top of your patch. Yes, that should work. By the way, look-up by name (presumably hardcoded somewhere?) sounds a bit strange to me. What was the reason for it to be added? Yes, that might not be a very elegant approach. We initially used it to implement power domain on/off sequence per specific domain and SoC, since it appeared resistant to generalize. I.e. the control register write sequences are different per domain and per SoC (exynos). So we named the domains in the device tree in that way: pm_domains: pm-domains@10024000 { compatible = samsung,exynos4415-pd; reg-names = cam, tv, mfc, g3d, lcd0, isp0, isp1; reg = 0x10024000 0x20, 0x10024020 0x20, 0x10024040 0x20, 0x10024060 0x20, 0x10024080 0x20, 0x100240A0 0x20, 0x100240E0 0x20; #power-domain-cells = 1; }; and then, for example, in the exynos CMU_ISP{0, 1} (clock controller) driver registered for notification on isp0 and isp1 power domains (isp1 is a sub-domain of isp0 and the consumer devices are normally attached to isp1). We have been investigating if we could do without the notification at the clocks driver side, then the all SoC/power domain specific code would end up in the exynos power domain driver. But I'm afraid it's not going to work for all SoCs. Anyway lookup by name might be not needed. Regarding lookup by name, let's please move away from those APIs. I am planning to remove all name based API from the genpd API as soon as I can. If you need to fetch domains, this might help you: http://www.spinics.net/lists/arm-kernel/msg383743.html Kind regards Uffe ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [v2 17/25] KVM: kvm-vfio: User API for VT-d Posted-Interrupts
On Thu, 2014-12-11 at 05:55 +, Wu, Feng wrote: -Original Message- From: Alex Williamson [mailto:alex.william...@redhat.com] Sent: Monday, December 08, 2014 1:21 PM To: Wu, Feng Cc: Eric Auger; t...@linutronix.de; mi...@redhat.com; h...@zytor.com; x...@kernel.org; g...@kernel.org; pbonz...@redhat.com; dw...@infradead.org; j...@8bytes.org; jiang@linux.intel.com; linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org; k...@vger.kernel.org Subject: Re: [v2 17/25] KVM: kvm-vfio: User API for VT-d Posted-Interrupts On Mon, 2014-12-08 at 04:58 +, Wu, Feng wrote: -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Eric Auger Sent: Thursday, December 04, 2014 10:05 PM To: Wu, Feng; t...@linutronix.de; mi...@redhat.com; h...@zytor.com; x...@kernel.org; g...@kernel.org; pbonz...@redhat.com; dw...@infradead.org; j...@8bytes.org; alex.william...@redhat.com; jiang@linux.intel.com Cc: linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org; k...@vger.kernel.org Subject: Re: [v2 17/25] KVM: kvm-vfio: User API for VT-d Posted-Interrupts Hi Feng, On 12/03/2014 08:39 AM, Feng Wu wrote: This patch adds and documents a new attribute KVM_DEV_VFIO_DEVICE_POSTING_IRQ in KVM_DEV_VFIO_DEVICE group. This new attribute is used for VT-d Posted-Interrupts. When guest OS changes the interrupt configuration for an assigned device, such as, MSI/MSIx data/address fields, QEMU will use this IRQ attribute to tell KVM to update the related IRTE according the VT-d Posted-Interrrupts Specification, such as, the guest vector should be updated in the related IRTE. Signed-off-by: Feng Wu feng...@intel.com --- Documentation/virtual/kvm/devices/vfio.txt |9 + include/uapi/linux/kvm.h | 10 ++ 2 files changed, 19 insertions(+), 0 deletions(-) diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt index f7aff29..41e12b7 100644 --- a/Documentation/virtual/kvm/devices/vfio.txt +++ b/Documentation/virtual/kvm/devices/vfio.txt @@ -42,3 +42,12 @@ activated before VFIO_DEVICE_SET_IRQS has been called to trigger the IRQ or associate an eventfd to it. Unforwarding can only be called while the signaling has been disabled with VFIO_DEVICE_SET_IRQS. If this condition is not satisfied, the command returns an -EBUSY. + + KVM_DEV_VFIO_DEVICE_POSTING_IRQ: Use posted interrtups mechanism to post typo + the IRQ to guests. +For this attribute, kvm_device_attr.addr points to a kvm_vfio_dev_irq struct. + +When guest OS changes the interrupt configuration for an assigned device, +such as, MSI/MSIx data/address fields, QEMU will use this IRQ attribute +to tell KVM to update the related IRTE according the VT-d Posted-Interrrupts +Specification, such as, the guest vector should be updated in the related IRTE. For my curiosity are there any restrictions about the instant at which the change can be done? I do not get here how you deactivate the posting? The current method is if the hardware supports interrupts posting, we will use it instead of interrupts remapping, since it has good performance. Why do I need deactivate interrupts posting? Here is the reply to Alex for the same question: In fact, I don't think we need to stop the posted-interrupts. For setting posted interrupts, we update the related IRTE according to the new format. If the guest reboots, or unload the drivers, or some other operations, the msi/msix will be disabled first, in this path, the irq will be disabled the related IRTE is not used anymore. Right, and I'm still not sure I agree with that reasoning. We need to build the kernel interface to be generic, not tailored for a specific userspace. I don't really feel comfortable having something that can't be disabled via a similar path to it being enabled. For instance, what about a dynamic debug interface where we want to enable tracing and see each interrupt injected into the guest. At that point we'd want to disabled posted interrupts and direct KVM injection and route via QEMU. Thanks, Alex I am not quite understand why we need to debug the software delivery path for interrupt when PI is used, in this case, the software injection code will have no chance to execute. If we don't want the use PI, we can disable it from kernel command line. Well, first off, I think it's just good interface design that if we introduce the ability to enable something we also provide the ability to disable it without making assumption about how a specific userspace is
Re: [RFC PATCH 2/2] iommu: rockchip: Handle system-wide and runtime PM
On 11 December 2014 at 16:31, Kevin Hilman khil...@kernel.org wrote: [+ Laurent Pinchart] Tomasz Figa tf...@chromium.org writes: On Thu, Dec 11, 2014 at 8:58 PM, Ulf Hansson ulf.hans...@linaro.org wrote: [...] @@ -988,11 +1107,28 @@ static int rk_iommu_probe(struct platform_device *pdev) return -ENXIO; } + pm_runtime_no_callbacks(dev); + pm_runtime_enable(dev); + + /* Synchronize state of the domain with driver data. */ + pm_runtime_get_sync(dev); + iommu-is_powered = true; Doesn't the runtime PM status reflect the value of is_powered, thus why do you need to have a copy of it? Could it perpahps be that you try to cope with the case when CONFIG_PM is unset? It's worth noting that this driver fully relies on status of other devices in the power domain the IOMMU is in and does not enforce the status on its own. So in general, as far as my understanding of PM runtime subsystem, the status of the IOMMU device will be always suspended, because nobody will call pm_runtime_get() on it (except the get and put pair in probe). So is_powered is here to track status of the domain, not the device. Feel free to suggest a better way, though. I still don't like these notifiers. I think they add ways to bypass having proper runtime PM implemented for devices/subsystems. I do agree, but I haven't found another good solution to the problem. From a high-level, the IOMMU is just another device inside the PM domain, so ideally it should be doing it's own _get() and _put() calls so the PM domain code would just do the right thing without the need for notifiers. As I understand it, the IOMMU (or for other similar cases) shouldn't be doing any get() and put() at all because there are no IO API to serve request from. In principle we could consider these kind devices as parent devices to those other devices that needs them. Then runtime PM core would take care of things for us, right!? Now, I am not so sure using the parent approach is actually viable, since it will likely have other complications, but I haven't thoroughly thought it though yet. No knowing a lot about the IOMMU API, I'm guessing the reason you're not doing that is because the IOMMU API currently doesn't have an easy way to keep track of *active* users so it's not obvious where to put those _get and _put calls. If that doesn't exist, perhaps a simple iommu_get() and iommu_put() API needs to be introduced (which inside the IOMMU core would just do runtime PM calls) so that users of the IOMMU could inform the subsystem that the IOMMU is needed and it should not be powered off. I Cc'd Laurent because I know he's thought about this before from the IOMMU side, and not sure if he came up with a solution. Cool, let's see then. Kind regards Uffe ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/2] iommu/vt-d: Only remove domain when device is removed
On Tue, Dec 09, 2014 at 01:15:25PM +0100, Joerg Roedel wrote: On Thu, Nov 06, 2014 at 09:16:05AM -0700, Alex Williamson wrote: But the domains are unlinked from device_domain_list using unlink_domain_info() which is called from both domain_remove_dev_info() and domain_remove_one_dev_info() which are both part of that more likely, unlikely branch in intel_iommu_attach_device(). So it seems like any time we switch a device from the DMA-API to the IOMMU-API, we lose the reference to the domain. Is that incorrect? I'll try to test. Okay, I thought a while about that and it looks like a real fix needs a rewrite of the domain handling code in the VT-d driver to better handle domain lifetime. We'll get this for free when we add default domains and more domain handling logic to the iommu core, so I think we don't need to start rewriting the VT-d driver for this. But for the time being, here is a simple fix for the leak in iommu_attach_domain: Joerg, This patch doesn't seem to be fixing the memory leak. I am testing with a 3.18.0-rc7 kernel applied to a RHEL 7.0 system. I added printk in free_domain_mem and alloc_domain to first reproduce Alex's observation. I created a VM and assigned a PCI NIC w/ no associated RMRR to the VM. The pattern I see is when starting the VM is: alloc_domain - X When powering off the VM: free_domain_mem(X) alloc_domain - Y I then applied the patch below and I still see the same pattern of two alloc_domain and one free_domain_mem when powering on/off the VM. I added some additional instrumentation and I do not see the new call to domain_exit being executed. (See inline comments below.) Jerry From d65b236d0f27fe3ef7ac4d12cceb0da67aec86ce Mon Sep 17 00:00:00 2001 From: Joerg Roedel jroe...@suse.de Date: Tue, 9 Dec 2014 12:56:45 +0100 Subject: [PATCH] iommu/vt-d: Fix dmar_domain leak in iommu_attach_device Since commit 1196c2f a domain is only destroyed in the notifier path if it is hot-unplugged. This caused a domain leakage in iommu_attach_device when a driver was unbound from the device and bound to VFIO. In this case the device is attached to a new domain and unlinked from the old domain. At this point nothing points to the old domain anymore and its memory is leaked. Fix this by explicitly freeing the old domain in iommu_attach_domain. Fixes: 1196c2f 'iommu/vt-d: Only remove domain when device is removed' Signed-off-by: Joerg Roedel jroe...@suse.de --- drivers/iommu/intel-iommu.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 1232336..9ef8e89 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -4424,10 +4424,13 @@ static int intel_iommu_attach_device(struct iommu_domain *domain, old_domain = find_domain(dev); if (old_domain) { - if (domain_type_is_vm_or_si(dmar_domain)) + if (domain_type_is_vm_or_si(dmar_domain)) { JAH This path is executed when starting the VM. domain_remove_one_dev_info(old_domain, dev); - else + } else { JAH I don't see this path being executed. domain_remove_dev_info(old_domain); + if (list_empty(old_domain-devices)) + domain_exit(old_domain); + } } } -- 1.8.4.5 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu -- Jerry HoemannSoftware Engineer Hewlett-Packard 3404 E Harmony Rd. MS 36phone: (970) 898-1022 Ft. Collins, CO 80528 FAX:(970) 898-0707 email: jerry.hoem...@hp.com ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 2/2] iommu: rockchip: Handle system-wide and runtime PM
On Thursday, December 11, 2014 04:51:37 PM Ulf Hansson wrote: On 11 December 2014 at 16:31, Kevin Hilman khil...@kernel.org wrote: [+ Laurent Pinchart] Tomasz Figa tf...@chromium.org writes: On Thu, Dec 11, 2014 at 8:58 PM, Ulf Hansson ulf.hans...@linaro.org wrote: [...] @@ -988,11 +1107,28 @@ static int rk_iommu_probe(struct platform_device *pdev) return -ENXIO; } + pm_runtime_no_callbacks(dev); + pm_runtime_enable(dev); + + /* Synchronize state of the domain with driver data. */ + pm_runtime_get_sync(dev); + iommu-is_powered = true; Doesn't the runtime PM status reflect the value of is_powered, thus why do you need to have a copy of it? Could it perpahps be that you try to cope with the case when CONFIG_PM is unset? It's worth noting that this driver fully relies on status of other devices in the power domain the IOMMU is in and does not enforce the status on its own. So in general, as far as my understanding of PM runtime subsystem, the status of the IOMMU device will be always suspended, because nobody will call pm_runtime_get() on it (except the get and put pair in probe). So is_powered is here to track status of the domain, not the device. Feel free to suggest a better way, though. I still don't like these notifiers. I think they add ways to bypass having proper runtime PM implemented for devices/subsystems. I do agree, but I haven't found another good solution to the problem. For the record, I'm not liking this mostly because it fixes a generic problem in a way that's hidden in the genpd code and very indirect. From a high-level, the IOMMU is just another device inside the PM domain, so ideally it should be doing it's own _get() and _put() calls so the PM domain code would just do the right thing without the need for notifiers. As I understand it, the IOMMU (or for other similar cases) shouldn't be doing any get() and put() at all because there are no IO API to serve request from. In principle we could consider these kind devices as parent devices to those other devices that needs them. Then runtime PM core would take care of things for us, right!? Now, I am not so sure using the parent approach is actually viable, since it will likely have other complications, but I haven't thoroughly thought it though yet. That actually need not be a parent. What's needed in this case is to do a pm_runtime_get_sync() on a device depended on every time a dependent device is runtime-resumed (and analogously for suspending). The core doesn't have a way to do that, but it looks like we'll need to add it anyway for various reasons (ACPI _DEP is one of them as I mentioned some time ago, but people dismissed it basically as not their problem). -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/5] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO
Sorry I have no plan yet. Could you send me your logs on your AMD system? Thanks Zhenhua On 12/10/2014 04:46 PM, Baoquan He wrote: Hi Joerg, ZhenHua, This issue happens on AMD iommu too, do you have any plans or thoughts on that? Thanks Baoquan On 11/17/14 at 02:38pm, Joerg Roedel wrote: On Fri, Nov 14, 2014 at 02:27:44PM +0800, Li, ZhenHua wrote: I am working following your directions: 1. If the VT-d driver finds the IOMMU enabled, it reuses its root entry table, and do NOT disable-enable iommu. Other data will be copied. 2. When a device driver issues the first dma_map command for a device, we assign a new and empty page-table, thus removing all mappings from the old kernel for the device. Please let me know if I get something wrong. Yes, this sounds right. Happily waiting for patches :) Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 2/2] iommu: rockchip: Handle system-wide and runtime PM
Hi Rafael, On Fri, Dec 12, 2014 at 5:48 AM, Rafael J. Wysocki r...@rjwysocki.net wrote: On Thursday, December 11, 2014 04:51:37 PM Ulf Hansson wrote: On 11 December 2014 at 16:31, Kevin Hilman khil...@kernel.org wrote: [+ Laurent Pinchart] Tomasz Figa tf...@chromium.org writes: On Thu, Dec 11, 2014 at 8:58 PM, Ulf Hansson ulf.hans...@linaro.org wrote: [...] @@ -988,11 +1107,28 @@ static int rk_iommu_probe(struct platform_device *pdev) return -ENXIO; } + pm_runtime_no_callbacks(dev); + pm_runtime_enable(dev); + + /* Synchronize state of the domain with driver data. */ + pm_runtime_get_sync(dev); + iommu-is_powered = true; Doesn't the runtime PM status reflect the value of is_powered, thus why do you need to have a copy of it? Could it perpahps be that you try to cope with the case when CONFIG_PM is unset? It's worth noting that this driver fully relies on status of other devices in the power domain the IOMMU is in and does not enforce the status on its own. So in general, as far as my understanding of PM runtime subsystem, the status of the IOMMU device will be always suspended, because nobody will call pm_runtime_get() on it (except the get and put pair in probe). So is_powered is here to track status of the domain, not the device. Feel free to suggest a better way, though. I still don't like these notifiers. I think they add ways to bypass having proper runtime PM implemented for devices/subsystems. I do agree, but I haven't found another good solution to the problem. For the record, I'm not liking this mostly because it fixes a generic problem in a way that's hidden in the genpd code and very indirect. Well, that's true. This is indeed a generic problem of PM dependencies between devices (other than those represented by parent-child relation), which in fact doesn't have much to do with genpd, but rather with those devices directly. It is just that genpd is the most convenient location to solve this in current code and in a simple way. In other words, I see this solution as a reasonable way to get the problem solved quickly for now, so that we can start thinking about a more elegant solution. From a high-level, the IOMMU is just another device inside the PM domain, so ideally it should be doing it's own _get() and _put() calls so the PM domain code would just do the right thing without the need for notifiers. As I understand it, the IOMMU (or for other similar cases) shouldn't be doing any get() and put() at all because there are no IO API to serve request from. In principle we could consider these kind devices as parent devices to those other devices that needs them. Then runtime PM core would take care of things for us, right!? Now, I am not so sure using the parent approach is actually viable, since it will likely have other complications, but I haven't thoroughly thought it though yet. That actually need not be a parent. What's needed in this case is to do a pm_runtime_get_sync() on a device depended on every time a dependent device is runtime-resumed (and analogously for suspending). The core doesn't have a way to do that, but it looks like we'll need to add it anyway for various reasons (ACPI _DEP is one of them as I mentioned some time ago, but people dismissed it basically as not their problem). Let me show you our exact use case. We have a power domain, which contains an IOMMU and an IP block, which can do bus transactions through that IOMMU. Of course the IP block is not aware of the IOMMU, because this is just an integration detail and on other platforms using the same IP block the IOMMU might not be there. This implies that the driver for this IP block should not be aware of the IOMMU either, which, on the buffer allocation and mapping side, is achieved by DMA mapping subsystem. We would also want the IOMMU to be fully transparent to that driver on PM side. Now, for PM related details, they are located in the same power domain, which means that whenever the power domain is turned off, the CPU can't access their registers and all the hardware state is gone. To make the case more interesting, there is no point in programming the IOMMU unless the device using it is powered on. Similarly, there is no point in powering the domain on to just access the IOMMU. This implies that the power domain should be fully controlled by the stand-alone IP block, while the peripheral IOMMU shouldn't affect its state and its driver only respond to operations performed by driver of that stand-alone IP block. A solution like below could work for the above: - There is a per-device list of peripheral devices, which need to be powered on for this device to work. - Whenever the IOMMU subsystem/driver binds an IOMMU to a device, it adds the IOMMU device to the list of peripheral devices of that device (and similarly for removal). - A runtime PM