Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware
On 2019-10-08 9:38 am, Yunsheng Lin wrote: On 2019/9/25 18:41, Peter Zijlstra wrote: On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote: From the discussion above, It seems making the node_to_cpumask_map() NUMA_NO_NODE aware is the most feasible way to move forwad. That's still wrong. Hi, Peter It seems this has trapped in the dead circle. From my understanding, NUMA_NO_NODE which means not node numa preference is the state to describe the node of virtual device or the physical device that has equal distance to all cpu. We can be stricter if the device does have a nearer node, but we can not deny that a device does not have a node numa preference or node affinity, which also means the control or data buffer can be allocated at the node where the process is running. As you has proposed, making it -2 and have dev_to_node() warn if the device does have a nearer node and not set by the fw is a way to be stricter. But I think maybe being stricter is not really relevant to NUMA_NO_NODE, because we does need a state to describe the device that have equal distance to all node, even if it is not physically scalable. Any better suggestion to move this forward? FWIW (since this is in my inbox), it sounds like the fundamental issue is that NUMA_NO_NODE is conflated for at least two different purposes, so trying to sort that out would be a good first step. AFAICS we have genuine "don't care" cases like alloc_pages_node(), where if the producer says it doesn't matter then the consumer is free to make its own judgement on what to do, and fundamentally different "we expect this thing to have an affinity but it doesn't, so we can't say what's appropriate" cases which could really do with some separate indicator like "NUMA_INVALID_NODE". The tricky part is then bestowed on the producers to decide whether they can downgrade "invalid" to "don't care". You can technically build 'a device' whose internal logic is distributed between nodes and thus appears to have equal affinity - interrupt controllers, for example, may have per-CPU or per-node interfaces that end up looking like that - so although it's unlikely it's not outright nonsensical. Similarly a 'device' that's actually emulated behind a firmware call interface may well effectively have no real affinity. Robin.
Re: [PATCH 01/23] dma-mapping: provide a generic DMA_MAPPING_ERROR
On 30/11/2018 13:22, Christoph Hellwig wrote: Error handling of the dma_map_single and dma_map_page APIs is a little problematic at the moment, in that we use different encodings in the returned dma_addr_t to indicate an error. That means we require an additional indirect call to figure out if a dma mapping call returned an error, and a lot of boilerplate code to implement these semantics. Instead return the maximum addressable value as the error. As long as we don't allow mapping single-byte ranges with single-byte alignment this value can never be a valid return. Additionaly if drivers do not check the return value from the dma_map* routines this values means they will generally not be pointed to actual memory. Once the default value is added here we can start removing the various mapping_error methods and just rely on this generic check. Signed-off-by: Christoph Hellwig --- include/linux/dma-mapping.h | 6 ++ 1 file changed, 6 insertions(+) diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 0f81c713f6e9..46bd612d929e 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -133,6 +133,8 @@ struct dma_map_ops { u64 (*get_required_mask)(struct device *dev); }; +#define DMA_MAPPING_ERROR (~(dma_addr_t)0) + extern const struct dma_map_ops dma_direct_ops; extern const struct dma_map_ops dma_virt_ops; @@ -576,6 +578,10 @@ static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr) const struct dma_map_ops *ops = get_dma_ops(dev); debug_dma_mapping_error(dev, dma_addr); + + if (dma_addr == DMA_MAPPING_ERROR) + return 1; + if (ops->mapping_error) return ops->mapping_error(dev, dma_addr); return 0; I'd have been inclined to put the default check here, i.e. - return 0 + return dma_addr == DMA_MAPPING_ERROR such that the callback retains full precedence and we don't have to deal with the non-trivial removals immediately if it comes to it. Not that it makes a vast difference though, so either way, Reviewed-by: Robin Murphy
Re: [PATCH v2 03/20] perf/core: add PERF_PMU_CAP_EXCLUDE for exclusion capable PMUs
Hi Andrew, On 26/11/2018 11:12, Andrew Murray wrote: Many PMU drivers do not have the capability to exclude counting events that occur in specific contexts such as idle, kernel, guest, etc. These drivers indicate this by returning an error in their event_init upon testing the events attribute flags. This approach is error prone and often inconsistent. Let's instead allow PMU drivers to advertise their ability to exclude based on context via a new capability: PERF_PMU_CAP_EXCLUDE. This allows the perf core to reject requests for exclusion events where there is no support in the PMU. Signed-off-by: Andrew Murray --- include/linux/perf_event.h | 1 + kernel/events/core.c | 9 + 2 files changed, 10 insertions(+) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index b2e806f..69b3d65 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -244,6 +244,7 @@ struct perf_event; #define PERF_PMU_CAP_EXCLUSIVE0x10 #define PERF_PMU_CAP_ITRACE 0x20 #define PERF_PMU_CAP_HETEROGENEOUS_CPUS 0x40 +#define PERF_PMU_CAP_EXCLUDE 0x80 /** * struct pmu - generic performance monitoring unit diff --git a/kernel/events/core.c b/kernel/events/core.c index 5a97f34..9afb33c 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -9743,6 +9743,15 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event) if (ctx) perf_event_ctx_unlock(event->group_leader, ctx); + if (!ret) { + if (!(pmu->capabilities & PERF_PMU_CAP_EXCLUDE) && + event_has_any_exclude_flag(event)) { Technically this is a bisection-breaker, since no driver has this capability yet - ideally, this patch should come after all the ones introducing it to the relevant drivers (with the removal of the now-redundant code from the other drivers at the end). Alternatively, since we already have several other negative capabilities, unless there's a strong feeling against adding any more then it might work out simpler to flip it to PERF_PMU_CAP_NO_EXCLUDE, such that we only need to introduce the core check then directly replace the open-coded event checks with the capability in the appropriate drivers, and need not touch the exclusion-supporting ones at all. Robin. + if (event->destroy) + event->destroy(event); + ret = -EINVAL; + } + } + if (ret) module_put(pmu->module);
Re: [PATCH 1/2] dma-mapping: remove ->mapping_error
On 09/11/2018 08:46, Christoph Hellwig wrote: [...] diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c index 1167ff0416cf..cfb422e17049 100644 --- a/drivers/iommu/amd_iommu.c +++ b/drivers/iommu/amd_iommu.c @@ -55,8 +55,6 @@ #include "amd_iommu_types.h" #include "irq_remapping.h" -#define AMD_IOMMU_MAPPING_ERROR 0 - #define CMD_SET_TYPE(cmd, t) ((cmd)->data[1] |= ((t) << 28)) #define LOOP_TIMEOUT 10 @@ -2339,7 +2337,7 @@ static dma_addr_t __map_single(struct device *dev, paddr &= PAGE_MASK; address = dma_ops_alloc_iova(dev, dma_dom, pages, dma_mask); - if (address == AMD_IOMMU_MAPPING_ERROR) + if (address == DMA_MAPPING_ERROR) This for one is clearly broken, because the IOVA allocator still returns 0 on failure here... goto out; prot = dir2prot(direction); @@ -2376,7 +2374,7 @@ static dma_addr_t __map_single(struct device *dev, dma_ops_free_iova(dma_dom, address, pages); - return AMD_IOMMU_MAPPING_ERROR; + return DMA_MAPPING_ERROR; } /* @@ -2427,7 +2425,7 @@ static dma_addr_t map_page(struct device *dev, struct page *page, if (PTR_ERR(domain) == -EINVAL) return (dma_addr_t)paddr; else if (IS_ERR(domain)) - return AMD_IOMMU_MAPPING_ERROR; + return DMA_MAPPING_ERROR; dma_mask = *dev->dma_mask; dma_dom = to_dma_ops_domain(domain); @@ -2504,7 +2502,7 @@ static int map_sg(struct device *dev, struct scatterlist *sglist, npages = sg_num_pages(dev, sglist, nelems); address = dma_ops_alloc_iova(dev, dma_dom, npages, dma_mask); - if (address == AMD_IOMMU_MAPPING_ERROR) + if (address == DMA_MAPPING_ERROR) ..and here. I very much agree with the concept, but I think the way to go about it is to convert the implementations which need it to the standardised *_MAPPING_ERROR value one-by-one, and only then then do the big sweep to remove them all. That has more of a chance of getting worthwhile review and testing from the respective relevant parties (I'll confess I came looking for this bug specifically, since I happened to recall amd_iommu having a tricky implicit reliance on the old DMA_ERROR_CODE being 0 on x86). In terms of really minimising the error-checking overhead it's a bit of a shame that DMA_MAPPING_ERROR = 0 doesn't seem viable as the thing to standardise on, since that has advantages at the micro-optimisation level for many ISAs - fixing up the legacy IOMMU code doesn't seem insurmountable, but I suspect there may well be non-IOMMU platforms where DMA to physical address 0 is a thing :( (yeah, I know saving a couple of instructions and potential register allocations is down in the noise when we're already going from an indirect call to an inline comparison; I'm mostly just thinking out loud there) Robin. goto out_err; prot = dir2prot(direction); @@ -2627,7 +2625,7 @@ static void *alloc_coherent(struct device *dev, size_t size, *dma_addr = __map_single(dev, dma_dom, page_to_phys(page), size, DMA_BIDIRECTIONAL, dma_mask); - if (*dma_addr == AMD_IOMMU_MAPPING_ERROR) + if (*dma_addr == DMA_MAPPING_ERROR) goto out_free; return page_address(page); @@ -2678,11 +2676,6 @@ static int amd_iommu_dma_supported(struct device *dev, u64 mask) return check_device(dev); } -static int amd_iommu_mapping_error(struct device *dev, dma_addr_t dma_addr) -{ - return dma_addr == AMD_IOMMU_MAPPING_ERROR; -} - static const struct dma_map_ops amd_iommu_dma_ops = { .alloc = alloc_coherent, .free = free_coherent, @@ -2691,7 +2684,6 @@ static const struct dma_map_ops amd_iommu_dma_ops = { .map_sg = map_sg, .unmap_sg = unmap_sg, .dma_supported = amd_iommu_dma_supported, - .mapping_error = amd_iommu_mapping_error, }; static int init_reserved_iova_ranges(void)
Re: [PATCH 3/3] dma-debug: unexport dma_debug_resize_entries and debug_dma_dump_mappings
On 24/04/18 15:02, Christoph Hellwig wrote: Only used by the AMD GART driver, which must be built in. FWIW debug_dma_dump_mappings() is also called by the Intel VT-d driver, but the same reasoning still applies. This does rather beg the question of whether it's right to have bits of low-level dma-debug internals *only* called by a couple of IOMMU drivers, but that can wait for another day. Reviewed-by: Robin Murphy <robin.mur...@arm.com> Signed-off-by: Christoph Hellwig <h...@lst.de> --- lib/dma-debug.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/lib/dma-debug.c b/lib/dma-debug.c index 075253cb613b..6a1ebaa83623 100644 --- a/lib/dma-debug.c +++ b/lib/dma-debug.c @@ -444,7 +444,6 @@ void debug_dma_dump_mappings(struct device *dev) spin_unlock_irqrestore(>lock, flags); } } -EXPORT_SYMBOL(debug_dma_dump_mappings); /* * For each mapping (initial cacheline in the case of @@ -753,7 +752,6 @@ int dma_debug_resize_entries(u32 num_entries) return ret; } -EXPORT_SYMBOL(dma_debug_resize_entries); /* * DMA-API debugging init code -- To unsubscribe from this list: send the line "unsubscribe linux-alpha" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] dma-debug: simplify counting of preallocated requests
On 24/04/18 15:02, Christoph Hellwig wrote: Just keep a single variable with a descriptive name instead of two with confusing names. Reviewed-by: Robin Murphy <robin.mur...@arm.com> Signed-off-by: Christoph Hellwig <h...@lst.de> --- lib/dma-debug.c | 20 1 file changed, 4 insertions(+), 16 deletions(-) diff --git a/lib/dma-debug.c b/lib/dma-debug.c index 712a897174e4..075253cb613b 100644 --- a/lib/dma-debug.c +++ b/lib/dma-debug.c @@ -132,7 +132,7 @@ static u32 min_free_entries; static u32 nr_total_entries; /* number of preallocated entries requested by kernel cmdline */ -static u32 req_entries; +static u32 nr_prealloc_entries = PREALLOC_DMA_DEBUG_ENTRIES; /* debugfs dentry's for the stuff above */ static struct dentry *dma_debug_dent__read_mostly; @@ -1011,7 +1011,6 @@ void dma_debug_add_bus(struct bus_type *bus) static int dma_debug_init(void) { - u32 num_entries; int i; /* Do not use dma_debug_initialized here, since we really want to be @@ -1032,12 +1031,7 @@ static int dma_debug_init(void) return 0; } - if (req_entries) - num_entries = req_entries; - else - num_entries = PREALLOC_DMA_DEBUG_ENTRIES; - - if (prealloc_memory(num_entries) != 0) { + if (prealloc_memory(nr_prealloc_entries) != 0) { pr_err("DMA-API: debugging out of memory error - disabled\n"); global_disable = true; @@ -1068,16 +1062,10 @@ static __init int dma_debug_cmdline(char *str) static __init int dma_debug_entries_cmdline(char *str) { - int res; - if (!str) return -EINVAL; - - res = get_option(, _entries); - - if (!res) - req_entries = 0; - + if (!get_option(, _prealloc_entries)) + nr_prealloc_entries = PREALLOC_DMA_DEBUG_ENTRIES; return 0; } -- To unsubscribe from this list: send the line "unsubscribe linux-alpha" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/22] dma-debug: move initialization to common code
Hi Christoph, Nice cleanup! Looks good overall, just a couple of nits. On 20/04/18 09:02, Christoph Hellwig wrote: [...] diff --git a/lib/dma-debug.c b/lib/dma-debug.c index 7f5cdc1e6b29..712a897174e4 100644 --- a/lib/dma-debug.c +++ b/lib/dma-debug.c @@ -41,6 +41,11 @@ #define HASH_FN_SHIFT 13 #define HASH_FN_MASK(HASH_SIZE - 1) +/* allow architectures to override this if absolutely required */ +#ifndef PREALLOC_DMA_DEBUG_ENTRIES +#define PREALLOC_DMA_DEBUG_ENTRIES (1 << 16) +#endif + enum { dma_debug_single, dma_debug_page, @@ -1004,18 +1009,16 @@ void dma_debug_add_bus(struct bus_type *bus) bus_register_notifier(bus, nb); } -/* - * Let the architectures decide how many entries should be preallocated. - */ -void dma_debug_init(u32 num_entries) +static int dma_debug_init(void) { + u32 num_entries; Maybe initialise it to PREALLOC_DMA_DEBUG_ENTRIES? int i; /* Do not use dma_debug_initialized here, since we really want to be * called to set dma_debug_initialized */ if (global_disable) - return; + return 0; for (i = 0; i < HASH_SIZE; ++i) { INIT_LIST_HEAD(_entry_hash[i].list); @@ -1026,17 +1029,19 @@ void dma_debug_init(u32 num_entries) pr_err("DMA-API: error creating debugfs entries - disabling\n"); global_disable = true; - return; + return 0; } if (req_entries) num_entries = req_entries; + else + num_entries = PREALLOC_DMA_DEBUG_ENTRIES; if (prealloc_memory(num_entries) != 0) { pr_err("DMA-API: debugging out of memory error - disabled\n"); global_disable = true; - return; + return 0; } nr_total_entries = num_free_entries; @@ -1044,7 +1049,9 @@ void dma_debug_init(u32 num_entries) dma_debug_initialized = true; pr_info("DMA-API: debugging enabled by kernel config\n"); + return 0; } +core_initcall(dma_debug_init); I think it's worth noting that for most users this now happens much earlier than before. In general that's probably good (e.g. on arm64 it should prevent false-positives from the Arm SMMU drivers under ACPI), and I can't imagine it's high-risk, but it is a behaviour change. Robin. static __init int dma_debug_cmdline(char *str) { -- To unsubscribe from this list: send the line "unsubscribe linux-alpha" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 31/33] dma-direct: reject too small dma masks
On 10/01/18 15:32, Christoph Hellwig wrote: On Wed, Jan 10, 2018 at 11:49:34AM +, Robin Murphy wrote: +#ifdef CONFIG_ZONE_DMA + if (mask < DMA_BIT_MASK(ARCH_ZONE_DMA_BITS)) + return 0; +#else + /* +* Because 32-bit DMA masks are so common we expect every architecture +* to be able to satisfy them - either by not supporting more physical +* memory, or by providing a ZONE_DMA32. If neither is the case, the +* architecture needs to use an IOMMU instead of the direct mapping. +*/ + if (mask < DMA_BIT_MASK(32)) + return 0; Do you think it's worth the effort to be a little more accommodating here? i.e.: return dma_max_pfn(dev) >= max_pfn; We seem to have a fair few 28-31 bit masks for older hardware which probably associates with host systems packing equivalently small amounts of RAM. And those devices don't have a ZONE_DMA? I think we could do something like that, but I'd rather have it as a separate commit with a good explanation. Maybe you can just send on on top of the series? Good point - other than the IXP4xx platform and possibly the Broadcom network drivers, it's probably only x86-relevant stuff where the concern is moot. Let's just keep the simple assumption then, until actually proven otherwise. Robin. -- To unsubscribe from this list: send the line "unsubscribe linux-alpha" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 27/33] dma-direct: use node local allocations for coherent memory
On 10/01/18 15:30, Christoph Hellwig wrote: On Wed, Jan 10, 2018 at 12:06:22PM +, Robin Murphy wrote: On 10/01/18 08:00, Christoph Hellwig wrote: To preserve the x86 behavior. And combined with patch 10/22 of the SWIOTLB refactoring, this means SWIOTLB allocations will also end up NUMA-aware, right? Great, that's what we want on arm64 too :) Well, only for swiotlb allocations that can be satisfied by dma_direct_alloc. If we actually have to fall back to the swiotlb buffers there is not node affinity yet. Yeah, when I looked into it I reached the conclusion that per-node bounce buffers probably weren't worth it - if you have to bounce you've already pretty much lost the performance game, and if the CPU doing the bouncing happens to be on a different node from the device you've certainly lost either way. Per-node CMA zones we definitely *would* like, but that's a future problem (it looks technically feasible without huge infrastructure changes, but fiddly). Robin. -- To unsubscribe from this list: send the line "unsubscribe linux-alpha" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/33] dma-mapping: move swiotlb arch helpers to a new header
On 10/01/18 08:00, Christoph Hellwig wrote: phys_to_dma, dma_to_phys and dma_capable are helpers published by architecture code for use of swiotlb and xen-swiotlb only. Drivers are not supposed to use these directly, but use the DMA API instead. Move these to a new asm/dma-direct.h helper, included by a linux/dma-direct.h wrapper that provides the default linear mapping unless the architecture wants to override it. Signed-off-by: Christoph Hellwig--- [...] drivers/crypto/marvell/cesa.c | 1 + drivers/mtd/nand/qcom_nandc.c | 1 + I took a look at these, and it seems their phys_to_dma() usage is doing the thing which we subsequently formalised as dma_map_resource(). I've had a crack at a quick patch to update the CESA driver; qcom_nandc looks slightly more complex in that the changes probably need to span the BAM dmaengine driver as well. In the process, though, I stumbled across gen_pool_dma_alloc() - yuck, something needs doing there, for sure... Robin. -- To unsubscribe from this list: send the line "unsubscribe linux-alpha" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 31/33] dma-direct: reject too small dma masks
On 10/01/18 08:00, Christoph Hellwig wrote: Signed-off-by: Christoph Hellwig <h...@lst.de> --- include/linux/dma-direct.h | 1 + lib/dma-direct.c | 19 +++ 2 files changed, 20 insertions(+) diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h index 4788bf0bf683..bcdb1a3e4b1f 100644 --- a/include/linux/dma-direct.h +++ b/include/linux/dma-direct.h @@ -42,5 +42,6 @@ void *dma_direct_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs); void dma_direct_free(struct device *dev, size_t size, void *cpu_addr, dma_addr_t dma_addr, unsigned long attrs); +int dma_direct_supported(struct device *dev, u64 mask); #endif /* _LINUX_DMA_DIRECT_H */ diff --git a/lib/dma-direct.c b/lib/dma-direct.c index 784a68dfdbe3..40b1f92f2214 100644 --- a/lib/dma-direct.c +++ b/lib/dma-direct.c @@ -122,6 +122,24 @@ static int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, return nents; } +int dma_direct_supported(struct device *dev, u64 mask) +{ +#ifdef CONFIG_ZONE_DMA + if (mask < DMA_BIT_MASK(ARCH_ZONE_DMA_BITS)) + return 0; +#else + /* +* Because 32-bit DMA masks are so common we expect every architecture +* to be able to satisfy them - either by not supporting more physical +* memory, or by providing a ZONE_DMA32. If neither is the case, the +* architecture needs to use an IOMMU instead of the direct mapping. +*/ + if (mask < DMA_BIT_MASK(32)) + return 0; Do you think it's worth the effort to be a little more accommodating here? i.e.: return dma_max_pfn(dev) >= max_pfn; We seem to have a fair few 28-31 bit masks for older hardware which probably associates with host systems packing equivalently small amounts of RAM. Otherwise though, Reviewed-by: Robin Murphy <robin.mur...@arm.com> Robin. +#endif + return 1; +} + static int dma_direct_mapping_error(struct device *dev, dma_addr_t dma_addr) { return dma_addr == DIRECT_MAPPING_ERROR; @@ -132,6 +150,7 @@ const struct dma_map_ops dma_direct_ops = { .free = dma_direct_free, .map_page = dma_direct_map_page, .map_sg = dma_direct_map_sg, + .dma_supported = dma_direct_supported, .mapping_error = dma_direct_mapping_error, }; EXPORT_SYMBOL(dma_direct_ops); -- To unsubscribe from this list: send the line "unsubscribe linux-alpha" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html